Lecture 11
Lecture 11
•Model
1
Regression Analysis…
Our problem objective is to analyze the relationship between
interval variables; regression analysis is the tool we will study.
2
Regression Analysis…
3
Model Types…
Deterministic Model: an equation or set of equations that allow us
to fully determine the value of the dependent variable from the
values of the independent variables. Deterministic models are
usually unrealistic.
E.g. is it reasonable to believe that we can determine the selling
price of a house solely on the basis of its size?
Contrast this with…
E.g. do all houses of the same size (measured in square feet) sell
for exactly the same price?
4
A Model…
To create a probabilistic model, we start with a deterministic
model that approximates the relationship we want to model
and add a random term that measures the error of the
deterministic component.
Deterministic Model:
The cost of building a new house is about $100 per square foot
and most lots sell for about $100,000. Hence the approximate
selling price (y) would be:
y = $100,000 + (100$/ft2)(x)
(where x is the size of the house in square feet)
5
A Model…
A model of the relationship between house size (independent
variable) and house price (dependent variable) would be:
House
Price t
o u
s ab
st
c o
o us e o t. i ze )
g a h are fo 10 0 (S
il din s qu 0 +
u
B 0 pe r ,0 0
0 1 0 0
$1 e =
P ric
Most lots sell
o u se
for $100,000 H
House size
In this model, the price of the house is completely determined by the size.
6
A Model…
In real life however, the house cost will vary even among the
same size of house:
Lower vs. Higher
Variability
House
Price
100K$
y = 100,000 + 100x + ɛ
8
Simple Linear Regression Model…
dependent independent
variable variable
9
Simple Linear Regression Model…
rise
run
β1 =slope (=rise/run)
β0 =y-intercept
10
Estimating the Coefficients…
In much the same way we base estimates of µ on x , we
estimate β0 using b0 and β1 using b1, the y-intercept and slope
(respectively) of the least squares or regression line given by:
11
Example 16.1
The annual bonuses ($1,000s) of six employees with
different years of experience were recorded as
follows. We wish to determine the straight line
relationship between annual bonus and years of
experience.
12
Least Squares Line…
n ces
re
d iffe
a r ed
e squ line…
o f th d the
e s u m n ts a n
s th poi
iz e the
im n
e min twee
is lin be
Th
13
Example 16.2…
Car dealers across North America use the "Red Book" to help
them determine the value of used cars that their customers trade
in when purchasing new cars.
The book, which is published monthly, lists the trade-in values
for all basic models of cars.
It provides alternative values for each car model according to
its condition and optional features.
The values are determined on the basis of the average paid at
recent used-car auctions, the source of supply for many used-
car dealers.
14
Example 16.2…
However, the Red Book does not indicate the value determined by
the odometer reading, despite the fact that a critical factor for
used-car buyers is how far the car has been driven.
The dealer recorded the price ($1,000) and the number of miles
(thousands) on the odometer. (Xm16-02).
15
Example 16.2…
Click Data, Data Analysis, Regression
16
Example 16.2…
A B C D E F
1 SUMMARY OUTPUT
2
3 Regression Statistics
4 Multiple R 0.8052
5 R Square 0.6483
6 Adjusted R Square 0.6447 Lots of good statistics calculated for
7 Standard Error 0.3265 us, but for now, all we’re interested
8 Observations 100 in is this…
9
10 ANOVA
11 df SS MS F Significance F
12 Regression 1 19.26 19.26 180.64 5.75E-24
13 Residual 98 10.45 0.11
14 Total 99 29.70
15
16 Coefficients Standard Error t Stat P-value
17 Intercept 17.25 0.182 94.73 3.57E-98
18 Odometer -0.0669 0.0050 -13.44 5.75E-24
17
Example 16.2… INTERPRET
18
Example 16.2… INTERPRET
19
Required Conditions…
For these regression methods to be valid the following four
conditions for the error variable (ɛ) must be met:
• The probability distribution of ɛ is normal.
• The mean of the distribution is 0; that is, E(ɛ) = 0.
• The standard deviation of ɛ is σɛ, which is a constant
regardless of the value of x.
• The value of ɛ associated with any particular value of y is
independent of ɛ associated with any other value of y.
20
Assessing the Model…
The least squares method will always produce a straight line,
even if there is no relationship between the variables, or if the
relationship is something other than linear.
21
Sum of Squares for Error (SSE)…
The sum of squares for error is calculated as:
n
SSE ( y i ŷ i ) 2
i 1
22
Standard Error of Estimate…
If sε is small, the fit is excellent and the linear model should be
used for forecasting. If sε is large, the model is poor…
But what is small and what is large?
23
Standard Error of Estimate…
24
Testing the Slope…
If no linear relationship exists between the two variables, we
would expect the regression line to be horizontal, that is, to
have a slope of zero.
25
Testing the Slope…
We can implement this test statistic to try our hypotheses:
26
Example 16.4…
Test to determine if there is a linear relationship between the price
& odometer readings… (at 5% significance level)
We want to test:
H0: β1 = 0
H 1 : β1 ≠ 0
(if the null hypothesis is true, no linear relationship exists)
The rejection region is:
27
Example 16.4… COMPUTE
p-value
28
Testing the Slope…
If we wish to test for positive or negative linear relationships
we conduct one-tail tests, i.e. our research hypothesis become:
H 1 : β1 < 0 (testing for a negative slope)
or
H1: β1 >0 (testing for a positive slope)
29
Coefficient of Determination…
Tests thus far have shown if a linear relationship exists; it is
also useful to measure the strength of the relationship. This is
done by calculating the coefficient of determination (R2).
30
Coefficient of Determination…
As we did with analysis of variance, we can partition the variation
in y into two parts:
31
Coefficient of Determination COMPUTE
32
Coefficient of Determination INTERPRET
33
More on Excel’s Output…
An analysis of variance (ANOVA) table for the
simple linear regression model can be given by:
degrees
Sums of Mean
Source of F-Statistic
Squares Squares
freedom
MSR =
Regression 1 SSR F=MSR/MSE
SSR/1
MSE =
Error n–2 SSE
SSE/(n–2)
Variation
Total n–1
in y
34
Chapter 17
Multiple Regression
35
Multiple Regression…
The simple linear regression model was used to analyze how
one interval variable (the dependent variable y) is related to
one other interval variable (the independent variable x).
We expect to develop models that fit the data better than would
a simple linear regression model.
36
The Model…
We now assume we have k independent variables potentially
related to the one dependent variable. This relationship is
represented in this first order linear equation:
error variable
coefficients
In the one variable, two dimensional case we drew a regression
line; here we imagine a response surface.
37
Required Conditions…
For these regression methods to be valid the following four
conditions for the error variable (ɛ) must be met:
• The probability distribution of the error variable (ɛ) is normal.
• The mean of the error variable is 0.
• The standard deviation of ɛ is σɛ , which is a constant.
• The errors are independent.
38
Estimating the Coefficients…
39
Regression Analysis Steps…
Use a computer and software to generate the coefficients and the
statistics used to assess the model.
40
Example 17.1
GENERAL SOCIAL SURVEY: VARIABLES THAT
AFFECT INCOME
41
Example 17.1
Here is a list of all the interval variables the General Social Survey
created.
Age (AGE)
Hours of work per week of respondent and of spouse (HRS and SPHRS)
Number of children (CHILDS)
42
Example 17.1
Score on question, Should government reduce
income differences between rich and poor?
(EQWLTH)
Score on question, Should government improve
standard of living of poor people? (HELPPOOR)
Score on question, Should government do more or
less to solve country’s problems? (HELPNOT)
Score on question, Is it government’s responsibility
to help pay for doctor and hospital bills?
(HELPSICK)
43
Example 17.1
Here are the available variables from the General Social Survey of 2012
and the reason why we have selected each one:
44
Example 17.1
Spouse’s hours of work (SPHRS1): It is possible that, if one’s
spouse works more and earns more, the other spouse may
choose to work less and thus earn less
45
Example 17.1
Step 2: Use a computer to compute all the coefficients and other
statistics
46
Model Assessment…
We will assess the model in three ways:
47
Standard Error of Estimate…
In multiple regression, the standard error of estimate is
defined as:
48
Coefficient of Determination…
49
Testing the Validity of the Model…
50
Testing the Validity of the Model…
ANOVA table for regression analysis…
Source of degrees of Sums of
Mean Squares F-Statistic
Variation freedom Squares
Regression k SSR MSR = SSR/k F=MSR/MSE
Total n–1
52
Table 17.2… Summary
Assessment
SSE R2 F
of Model
0 0 1 Perfect
0 0 Invalid
Once we’re satisfied that the model fits the data as well as possible, and that the
required conditions are satisfied, we can interpret and test the individual coefficients
and use the model to predict and estimate…
53
Interpreting the Coefficients*
Intercept
The intercept is b0 = −108,240. This is the average income when all
the independent variables are zero. As we observed in Chapter
16, it is often misleading to try to interpret this value,
particularly if 0 is outside the range of the values of the
independent variables (as is the case here).
Age
The relationship between income and age is described by b1 = 974.
From this number we learn that in this model, for each additional
year of age, income increases on average by $974, assuming that
the other independent variables in this model are held constant.
*in each case we assume all other variables are held constant…
54
Interpreting the Coefficients*
Education
The coefficient b2 = 5,680 specifies that in this sample for each
additional year of education the income increases on
average by $5,680, assuming the constancy of the other
independent variables.
Hours of work
The relationship between hours of work per week is expressed
by b3 = 1,091. We interpret this number as the average
increase in annual income for each additional hour of work
per week keeping the other independent variables fixed in
this sample.
*in each case we assume all other variables are held constant…
55
Interpreting the Coefficients*
Spouse’s hours of work
The relationship between annual income and a spouse’s hours of
work per week is described in this sample b 4 = −250.90,
which we interpret to mean that for each additional hour a
spouse works per week income decreases on average by
$250.90 when the other variables are constant.
Number of family members earning income
In this dataset the relationship between annual income and the
number of family members who earn money is expressed by b 5
= 958.5, which tells us that for each additional family member
earning money annual Income increases on average by $958.5
assuming that the other Independent variables are constant.
*in each case we assume all other variables are held constant…
56
Interpreting the Coefficients*
Number of children
The relationship between annual income and number
of children is expressed by b6 = −1,765, which
tells us that in this sample for each additional
child annual income decreases on average by
$1,765
*in each case we assume all other variables are held constant…
57
Testing the Coefficients…
For each independent variable, we can test to determine whether
there is enough evidence of a linear relationship between it and the
dependent variable for the entire population…
H0: βi = 0
H1: βi ≠ 0
(for i = 1, 2, …, k) and using:
58
Testing the Coefficients
Test of β1 (Coefficient of age)
Value of the test statistic: t = 6.28; p-value = 0
59
Testing the Coefficients
Test of β5 (Coefficient of number of earners in family)
Value of the test statistic: t = .31; p-value = .7531
60
INTERPRET
Testing the Coefficients
There is sufficient evidence at the 5% significance
level to infer that each of the following variables
is linearly related to income
Age
Education
Number of hours of work per week
Spouse’s hours of work
61
Testing the Coefficients INTERPRET
62