STAT 252-Notes-Topic 5-Multiple Linear Regression
STAT 252-Notes-Topic 5-Multiple Linear Regression
• Multiple Linear Regression develops a model where there is only one response variable (y), but
more than one explanatory or predictor variables (x1, x2, ..., xp)
y = 0 + 1x1 + 2 x2 + ... + p x p +
Where:
• y is the response variable.
• x1, x2, ..., xp are the explanatory variables.
• p = predictor variable
• E ( y) = 0 + 1x1 + 2 x2 + ... + p x p is the deterministic part of the model.
• i determines the contribution of the explanatory variable xi to the model.
• is the random error, which is assumed to be normally distributed with mean 0 and
standard deviation
• When the least squares criterion is applied this leads to the general model for the population multiple
linear regression equation as follows:
y = 0 + 1x1 + 2 x2 + ... + p x p
Or
( y | x1, x2 ,..., x p ) = 0 + 1x1 + 2 x2 + ... + p x p
• The general formula for the sample multiple linear regression equation is:
yˆ = ˆ0 + ˆ1x1 + ˆ2 x2 + ... + ˆ p x p
Or
ˆ ( y | x1, x2 ,..., x p ) = ˆ0 + ˆ1x1 + ˆ2 x2 + ... + ˆ p x p
• The y-intercept ( ̂ 0 ) is the value of y when all explanatory variables have a value of 0 (x1=0, x2=0, ...,
xp=0).
• The values ˆ1, + ˆ2 , +... + ˆ p are referred to as partial slopes or partial regression coefficients.
• Each ˆi tells us the change in y per unit increase in x, holding all other explanatory variables
constant.
1
5.2 Inferences Concerning the Overall Usefulness of the Multiple Regression Model
1. Linearity of the population regression line: The relationship between the variables
as described by the population regression equation y = 0 + 1x1 + 2 x2 + ... + p x p
must be approximately linear.
3. Normal populations: For each set of values of x1, x2, ..., xp, the corresponding y-
values must be normally distributed.
4. No Serious Outliers: Significant outliers can drastically change the regression model
Regression Identity:
• If the sample multiple linear regression equation fits the data well, then the observed values and
predicted values of the response variable (based on the regression model) will be “close”
together.
• AND thus, SS ERROR will be small relative to SSTOTAL and SS REGR will be large relative to SSTOTAL
Overall usefulness or significance of the multiple regression model can be determined by:
1. Multiple regression ANOVA F-test
2. Multiple R (Multiple correlation coefficient)
3. Coefficient of multiple determination
2
Multiple Regression ANOVA Test (F-Test)
Multiple Regression ANOVA Test (F-Test)
Purpose: To test whether a multiple linear regression model is useful for making predictions
Step 3: Obtain the three sums of squares ( SSTOTAL , SS REGR and SS ERROR ) and
Compute the calculated value of the F-statistic
SSREGR / p MS REGR
F= =
SSERROR / (n − ( p + 1)) MS ERROR
Note: Recall that, in general, in simple linear regression, the Regression df is the number of coefficients
(y-intercept + slope) being estimated minus 1, that is 2 – 1 = 1. For multiple linear regression, the
coefficients are the y-intercept plus the slopes of p predictor variables, that is, there are 1 + p coefficients.
Thus, Regression df = (1 + p) – 1 = p
3
Multiple R (Multiple Correlation Coefficient)
• Measures the overall correlation between the all the variables involved in the model
= the fraction or percentage of variation in the observed values of the response variable that
is accounted for by the regression analysis involving more than one explanatory variable
Explained variability
R2 =
Total variability
SS SS SS − SSError
R 2 = REGR = 1 − Error = TOTAL
SSTOTAL SSTOTAL SSTOTAL
0 ≤ R2 ≤ 1 OR 0% ≤ R2 ≤ 100%
This implies that 1 – R2 of the variation in the observed values of the response variable are
accounted for by other factors, not the explanatory variable used in the regression analysis
MSERROR
2
Radj = 1−
MSTOTAL
SS ERROR
(n − ( p + 1)) (n − 1) SS ERROR
2
Radj = 1− = 1−
SSTOTAL (n − (k + 1)) SSTOTAL
(n − 1)
4
Example: Effect of age and miles driven on the price of Orion cars
The age, miles driven and price of a random sample of 11 Orion cars along with SPSS output are shown
below.
Checking Assumptions for the Orion Price regression model (SPSS ouput)
Residuals
0 0
0 2 4 6 8 0 20 40 60 80 100
-50
Age -20
Miles
5
SPSS Output
Descriptive Statistics
Mean Std. N
Deviation
Price 88.6364 31.15854 11
Age 5.2727 1.42063 11
Miles 53.9091 21.56597 11
Model Summaryb
Model R R Adjusted Std. Change Statistics
Square R Error of R Square F df1 df2 Sig. F
Square the Change Change Change
Estimate
1 .968a .936 .920 8.80505 .936 58.612 2 8 .000
a. Predictors: (Constant), Miles, Age
b. Dependent Variable: Price
ANOVAa
Model Sum of df Mean F Sig.
Squares Square
Regression 9088.314 2 4544.157 58.612 .000b
1 Residual 620.232 8 77.529
Total 9708.545 10
a. Dependent Variable: Price
b. Predictors: (Constant), Miles, Age
Coefficientsa
Model Unstandardized Standardized t Sig. 95.0% Confidence
Coefficients Coefficients Interval for B
B Std. Error Beta Lower Upper
Bound Bound
(Constant) 183.035 11.348 16.130 .000 156.868 209.203
1 Age -9.504 3.874 -.433 -2.453 .040 -18.438 -.570
Miles -.821 .255 -.569 -3.219 .012 -1.410 -.233
a. Dependent Variable: Price
6
Research Problem: Overall Assessment of the Model
>>>>>>>>>>
(a) At the 5% significance level, perform a hypothesis test to determine whether the overall multiple linear
regression model is useful for making predictions, that is, whether the variables age and miles driven,
taken together, are useful for predicting the price of the Orions.
(b) What percentage of the variation in Orion price is explained by the regression model? Determine the
unadjusted percentage.
(c) What percentage of the variation in Orion price is explained by the regression model? Determine the
adjusted percentage and compare it with the unadjusted percentage calculated in part (b).
7
>>>>>>>>>>
5.3 Inferences Concerning the Usefulness of Particular Predictor Variables: The Multiple
Regression t-test and Confidence Interval for Particular Slopes
• The ANOVA F-test determines whether the overall model is useful in explaining the
relationship between all the variables involved.
• However, the Multiple Regression t-test is required to determine if particular predictor
variables are useful in making predictions.
Calculate the test statistic for each particular predictor variable using computer output
ˆi
t=
SE ( ˆi )
Decide to reject or not reject Ho by looking in the t-table at df = n − ( p + 1)
8
Example (Orion Prices): Refer to the data set and full SPSS output on previous pages
SPSS Output
Coefficientsa
Model Unstandardized Standardized t Sig. 95.0% Confidence
Coefficients Coefficients Interval for B
B Std. Error Beta Lower Upper
Bound Bound
(Constant) 183.035 11.348 16.130 .000 156.868 209.203
1 Age -9.504 3.874 -.433 -2.453 .040 -18.438 -.570
Miles -.821 .255 -.569 -3.219 .012 -1.410 -.233
a. Dependent Variable: Price
>>>>>>>>>>
(a) At the 5% significance level, test whether the data provide sufficient evidence to conclude that the
number of miles driven, in conjunction with age, is useful for predicting price.
(b) Calculate a 95% confidence interval for the partial slope for miles driven.
>>>>>>>>>>
9
Compare Age and Miles Driven with respect to Usefulness in making predictions
100 Price
50 Price 0
Predicted Price
0 Predicted Price 0 50 100
0 5 10 Miles
Age
Correlation Matrix: For all variables in the data set for Orion prices
Correlations
Price Age Miles
Price 1.000 -.924 -.942
Pearson Correlation Age -.924 1.000 .863
Miles -.942 .863 1.000
Price . .000 .000
Sig. (1-tailed) Age .000 . .000
Miles .000 .000 .
Price 11 11 11
N Age 11 11 11
Miles 11 11 11
10
Note the following:
1. Miles driven has a higher t-statistic than age
2. Miles driven has a slightly lower P-value than age
3. Miles driven have a “tighter” confidence interval for the slope than age
4. Miles driven is more highly correlated with price (r = -0.942) than is age (r = -0.924), at
df = n − ( p + 1) = 11 − (2 + 1) = 8
11
5.4 Confidence Interval and Prediction Interval for the Response Variable
Confidence Interval for Mean Response (or Conditional Mean) in Multiple Regression
1. For a confidence level of 1 – α, use the t-distribution table to find tα/2 with df = n – (p + 1)
2. Compute the point estimate by using the multiple regression equation. At particular values
of the predictor variables: x1, x2, ..., xp , the point estimate ŷ of the mean response of the
response variable is found as follows:
yˆ = ˆ0 + ˆ1x1 + ˆ2 x2 + ... + ˆ p x p
[Note: Since exact calculations of the standard deviation of the predicted y-value ( s yˆ ) is
p
Prediction Interval (for all Single Observations) for the Response Variable in Multiple
Regression
1. For a confidence level of 1 – α, use the t-distribution table to find tα/2 with df = n – (p + 1)
2. Compute the point estimate by using the multiple regression equation. At particular values
of the predictor variables: x1, x2, ..., xp, the point estimate ŷ of the mean response of the
response variable is found as follows:
yˆ = ˆ0 + ˆ1x1 + ˆ2 x2 + ... + ˆ p x p
[Note: Since exact calculations of the standard deviation of the predicted y-value ( s yˆ ) is rather
p
12
Example (Price of Orions against age and miles driven)
Find:
1. A 95% confidence interval for the mean price of Orions that are 5 years old and have been driven
52,000 miles
2. A 95% prediction interval for the price of an Orion (any single observation) that is 5 years old and
has been driven 52,000 miles
MINITAB Output
[See Weiss, Module A, page A-55]
Regression Analysis: Price versus Age, Miles
Analysis of Variance
Source DF SS MS F P
Regression 2 9088.3 4544.2 58.61 0.000
Residual Error 8 620.2 77.5
Total 10 9708.5
New
Obs Fit SE Fit 95% CI 95% PI
1 92.80 2.74 (86.48, 99.12) (71.53, 114.06)
New
Obs Age Miles
1 5.00 52.0
13
Find a 95% confidence for the mean price of all Orions that are 5 years old and have been driven 52,000
miles
>>>>>>>>>>
>>>>>>>>>>
Calculate a 95% prediction interval for the price of an Orion (any single observation) that is 5 years old
and has been driven 52,000 miles
t
1. At df = 8, /2 = t0.05/2 = t0.025 = 2.306
2. The point estimate for the price of 5-year-old Orions that has been driven 52,000
miles is:
ŷ = 183 - 9.50 (5) - 0.821 (52) = 92.80 (in hundreds of dollars)
>>>>>>>>>>
>>>>>>>>>>
14
5.5 Multiple Regression Models Involving Indicator Variables (= Dummy Variables)
• These are categorical variables that are used as one of the predictor variables
• It is coded as 0 or 1
Descriptive Statistics
15
Checking Assumptions
Residuals
0 0
56 58 60 62 64 66 68 60 65 70 75
-10 -10
Height of Mother Height of Father
0
0 0.5 1 1.5
-5
-10
Sex of Child
50
40
Height of Child
30
Predicted Height of Child
20
10
0
0 0.2 0.4 0.6 0.8 1 1.2
Sex of Child
16
Model Summaryb
ANOVAa
Coefficientsa
Regression equation:
Height of child = 25.588 + 0.377(Height of Mother) + 0.195(Height of Father) + 4.148(Sex)
Prediction:
Suppose a mother is 63 inches and a father is 69 inches
Predicted height of a daughter is:
Height of a daughter = 25.588 + 0.377(63) + 0.195(69) + 4.148(0) = 62.8 inches
Predicted height of a son is:
Height of a son = 25.588 + 0.377(63) + 0.195(69) + 4.148(1) = 67.0 inches
The coefficient 4.148 means that for given heights of mothers and fathers, a son will have a predicted
height that is 4.148 inches more than the height of a daughter.
17
Adjusted Coefficient of Determination:
SS ERROR 101.568
(n − ( p + 1)) (20 − (3 + 1)) 6.348
2
Radj = 1− = 1− = 1− = 0.535
SSTOTAL 259.470 13.6563
(n − 1) (20 − 1)
Note: This is fairly different from the coefficient of determination (unadjusted), which is 0.609. This is
because there are 4 regression coefficients (intercept and 3 slopes)
Calculate 95% confidence intervals for the partial slopes of the regression equation that relate:
1. Heights of children to the heights of mothers
2. Heights of children to their sex
df = n – (p + 1) = 20 – (3+1) = 16
t
At df = 16, /2 = t0.05/2 = t0.025 = 2.120
Heights of children to the heights of mothers
ˆi t /2 SE ( ˆi )
0.377 2.120 0.308
0.377 0.6530
(−0.276,1.030)
Note: The slope that relates heights of children to their sex does not have a negative value as one of the
endpoints. This is in agreement with the greater significance of that slope when the multiple regression t-
test was performed.
Does this mean that the heights of children are not related to the heights of their parents?
18
5.6 Interaction Models in Multiple Regression
• Without interaction, the general model for multiple linear regression was:
y = 0 + 1x1 + 2 x2 + ... + p x p +
The predicted response of y with changes in x1 has the same slope for all values of x2 (and the
same holds true for all xi variables involved)
• When interaction between variables occurs, the interaction model for multiple linear regression
(for two interacting predictor variables) is:
y = 0 + 1x1 + 2 x2 + 3 x1x2 +
Where,
• y is the response variable
• x1, x2 are the explanatory (predictor) variables
• E ( y ) = 0 + 1x1 + 2 x2 + 3 x1x2 is the deterministic part of the model
• 1 + 3 x2 represents the change in y for a 1-unit increase in x1
[Since 1x1 + 3 x1x2 x1 ( 1 + 3 x2 ) ]
• 2 + 3 x1 represents the change in y for a 1-unit increase in x2
x + 3 x1x2 x2 ( 2 + 3 x1 )
[Since 2 2 ]
• is the random error, which is assumed to be normally distributed with mean 0 and
standard deviation
This results in non-parallel lines (often intersecting lines) as shown below:
19
Research Problem Involving an Interaction Term (and Combining all Previous MLR Concepts):
Effect of BMI and Salt Intake (and their Interaction) on Systolic Blood Pressure
It has been hypothesized that increased salt intake associated with greater food intake by obese people
may be the mechanism for the relationship between obesity and high blood pressure. A random sample
of 14 people with high blood pressure was selected and their body mass index (BMI) (body
weight/(height)2), as a measure of obesity, was measured along with their sodium intake (in 100s of
mg/day). These two variables were used to calculate the interaction term (BMI x sodium intake). Their
systolic blood pressure (SBP) was measured in mm Hg as the response variable. The raw data are
shown below along with incomplete SPSS output.
Model Summaryb
Model R R Square Adjusted R Std. Error of the
Square Estimate
1 .999a .997 .997 .586
a. Predictors: (Constant), Interaction, BMI, Salt_intake
b. Dependent Variable: SBP
ANOVAa
Model Sum of Squares df Mean Square F Sig.
Regression 1330.000 3 443.333 1293.138 3.04 x 10-13
1 Residual 3.428 10 .343
Total 1333.429 13
a. Dependent Variable: SBP
b. Predictors: (Constant), Interaction, BMI, Salt_intake
20
Coefficientsa
Model Unstandardized Coefficients Standardized t Sig.
Coefficients
B Std. Error Beta
(Constant) 108.726 8.168 13.312 .000
BMI -.218 .285 -.109 -.765 .462
1
Salt_intake .892 .350 .496 2.546 .029
Interaction .015 .006 .612 2.640 .025
a. Dependent Variable: SBP
>>>>>>>>>>
(a) At the 5% significance level, perform a hypothesis test to determine whether the overall multiple
regression model is significant or useful for making predictions about systolic blood pressure (SBP).
Perform ALL steps of the hypothesis test.
(b) At the 5% significance level, perform the most appropriate test to determine whether there is a positive
relationship between salt intake and systolic blood pressure.
21
(c) Calculate a 95% confidence interval for the slope of the interaction term (representing interaction
between BMI and sodium intake). Using this confidence interval, what conclusion can you make about
the possible interaction between body mass index and sodium intake in their effect on systolic blood
pressure? Explain your answer.
(d) What does this model tell us about effect of BMI and the relative effect of the 3 predictor variables?
(e) Find the standard error of the model (standard error of the estimate of the model)?
(f) What percentage of the variation in systolic blood pressure is explained by (or accounted for by) the
regression model? (Note: Determine the adjusted percentage.)
(g) Suppose that a person with a body mass index of 40 kg/m 2 and daily sodium intake of 42 (in 100s of
mg/day) had an observed systolic blood pressure reading of 163 mm Hg. What was the residual or
error of this observation?
22
(h) Based on the values of the predictor variables given in part (g) (BMI = 40 kg/m2, sodium intake = 42
(100) mg/day)), what is the 95% prediction interval for all single observation responses of systolic
blood pressure at those values of the predictor variables? [Note: SE(Fit) = 0.337]
(i) Based on the values of the predictor variables given in part (g) (BMI = 40 kg/m2, sodium intake = 42
(100) mg/day)), what is the 95% confidence interval for mean systolic blood pressure at those values
of the predictor variables? [Note again: SE(Fit) = 0.337]
>>>>>>>>>>
(j) Compare the length of the prediction interval in part (h) with the confidence interval in part (i). Explain
the difference between these two confidence intervals and explain any possible difference in their
lengths.
Based on the prediction interval in part (h), if we take random samples of people having the given values
of the predictor variables, we can be 95% confident that an individual would have systolic blood pressure
between 161.67 and 164.174 mm Hg; whereas, based on the confidence interval in part (i), we can be
95% confident that the means of those samples will be between 161.919 and 163.421 mm Hg. This is
because the confidence interval for the mean response is shorter than the prediction interval for all single
observation responses.
23
5.7 Reduced Models and the Extra Sum-of-Squares F-test in Multiple Linear Regression
Full Model = model which includes all the parameters or predictor variables involved in the research
Reduced Model = model which hypotheses that some of the slopes of the predictor variables equal zero
and, thus they are taken out of the full model to make a reduced model
( Extra SS ) / ( Extra df )
F=
SSE (Full)/df ERROR ( Full )
24
Example with Interaction and Indicator Variables & Involving Extra Sum-of-Squares F-test
The table below shows the prices of a random sample of 30 homes, along with the living area, number of
bedrooms, number of rooms, age, and location.
• Indicator variables z1 and z2 are defined as:
25
Overall multiple regression model
Selecting some of the above predictor variables, the overall model describing the effect of living area,
location and the interaction between living area and location (leaving out the number of bedrooms,
number of rooms and age) is as follows:
Downtown: (z1 = z2 = 0)
y = 0 + 1 x1 + 2 (0) + 3 (0) + 4 x1 (0) + 5 x1 (0) +
y = 0 + 1 x1 +
Inner suburbs: (z1 = 1, z2 = 0)
y = 0 + 1 x1 + 2 (1) + 3 (0) + 4 x1 (1) + 5 x1 (0) +
y = 0 + 2 + (1 + 4 ) x1 +
Outer suburbs: (z1 = 0, z2 = 1)
y = 0 + 1 x1 + 2 (0) + 3 (1) + 4 x1 (0) + 5 x1 (1) +
y = 0 + 3 + (1 + 5 ) x1 +
Model 1 (Separate Lines Model = Full Model, which includes all predictor variables):
y = 0 + 1 x1 + 2 z1 + 3 z2 + 4 x1 z1 + 5 x1 z2
OR ( price | area, location,interaction) = 0 + 1area + 2 z1 + 3 z2 + 4 x1 z1 + 5 x1 z2
Model 2 (Parallel Lines Model = Reduced model assuming there is no interaction effect):
y = 0 + 1 x1 + 2 z1 + 3 z2
OR ( price | area, location) = 0 + 1area + 2 z1 + 3 z2
Explanation: If no interaction effect, then 4 = 5 = 0 so 1 = 1 + 4 = 1 + 5 (slopes are equal)
And thus the 3 SLR lines are parallel.
Model 3 (Equal Lines Model = Reduced model assuming location and their interaction have no effect):
y = 0 + 1 x1
OR ( price | area) = 0 + 1area
2 = 3 = 4 = 5 = 0 so
Explanation: If no effect of location and interaction, then
26
SPSS output:
Model Summary
Model R R Square Adjusted R Std. Error of
Square the Estimate
1 .943a .889 .866 4.05994
a. Predictors: (Constant), x1z2, x1, z1, z2, x1z1
Coefficientsa
Model Unstandardized Coefficients Standardized t Sig.
Coefficients
B Std. Error Beta
(Constant) 8.969 6.078 1.476 .153
x1 4.807 .397 1.366 12.098 .000
z1 52.122 11.225 2.025 4.643 .000
1
z2 48.558 7.797 2.231 6.228 .000
x1z1 -3.201 .759 -1.823 -4.218 .000
x1z2 -2.803 .530 -1.836 -5.291 .000
a. Dependent Variable: y
Model 2 (Parallel Lines Model): Effect of area and location (Reduced model assuming there is no
interaction effect, i.e., assuming slopes for interaction = 0)
ANOVAa
Model Sum of df Mean Square F Sig.
Squares
Regression 2607.733 3 869.244 23.883 .000b
1 Residual 946.277 26 36.395
Total 3554.010 29
a. Dependent Variable: y
b. Predictors: (Constant), z2, x1, z1
Coefficientsa
Model Unstandardized Coefficients Standardized t Sig.
Coefficients
B Std. Error Beta
(Constant) 35.825 5.785 6.193 .000
x1 3.000 .362 .852 8.292 .000
1
z1 5.189 3.127 .202 1.660 .109
z2 8.142 2.680 .374 3.038 .005
a. Dependent Variable: y
27
Model 3 (Equal Lines Model): Effect of Area only (Reduced model assuming location and
interaction have no effect, i.e., assuming all slopes for location and interaction = 0)
ANOVAa
Model Sum of df Mean Square F Sig.
Squares
Regression 2271.714 1 2271.714 49.605 .000b
1 Residual 1282.296 28 45.796
Total 3554.010 29
a. Dependent Variable: y
b. Predictors: (Constant), x1
Coefficientsa
Model Unstandardized Coefficients Standardized t Sig.
Coefficients
B Std. Error Beta
(Constant) 43.732 5.780 7.567 .000
1
x1 2.814 .400 .799 7.043 .000
a. Dependent Variable: y
(a) At the 5% significance level, perform a hypothesis test to determine whether the overall multiple
regression model is significant or useful for making predictions about house price. Perform ALL steps
of the hypothesis test.
H0 : 1 = 2 = 3 = 4 = 5 = 0
[The overall multiple regression model is not useful for making predictions about house price.]
Ha : At least one i is not zero
[The overall multiple regression model is useful for making predictions about house price.]
df (regression) = p = 5 df (error) = n – (p + 1) = 30 – (5 + 1) = 24
At df = (5, 24), P < 0.001 There is extremely strong evidence against Ho.
Since P < α (0.05), reject Ho.
Conclusion: At the 5% significance level, the data provide sufficient evidence to conclude that at least
one of the population regression coefficients is not zero OR that the overall regression model is useful for
making predictions about the response variable (house price).
28
>>>>>>>>>>
(b) At the 5% significance level, perform a hypothesis test to determine if there is interaction between
location and living area in the way that they affect house price, after accounting for area and location.
In other words, test whether the 3 simple regression lines are parallel, that is, whether the slopes are
the same for all 3 lines.
>>>>>>>>>>
Finding the Residual Sum-of-Squares
Suppose you are given that the F-statistic for the Parallel Lines Model is F = 16.7045, but you are not
given the ANOVA table on the previous page for this model. What is the Residual Sum-of-Squares
( SS ERROR ) for this Parallel Lines Model?
29
>>>>>>>>>>
Comparing the 3 SLR Equations for Downtown, Inner Suburbs, and Outer Suburbs
Using the output to get the overall regression model, we get the following:
yˆ = 8.969 + 4.807 x1 + 52.122 z1 + 48.558z2 + (−3.201) x1 z1 + (−2.803) x1 z2
Note: all partial slopes, including those for the interaction terms, are significant.
We can determine the fitted straight line for each location by finding 3 simple linear regression equations
by simplifying the overall model
Overall model: y = 0 + 1 x1 + 2 z1 + 3 z2 + 4 x1 z1 + 5 x1 z2 +
Downtown: y = 0 + 1 x1 +
yˆ = 8.969 + 4.807 x1
Inner suburbs: y = 0 + 2 + (1 + 4 ) x1 +
yˆ = 8.969 + 52.122 + (4.807 − 3.201) x1
yˆ = 61.091 + 1.606 x1
Outer suburbs: y = 0 + 3 + (1 + 5 ) x1 +
yˆ = 8.969 + 48.558 + (4.807 − 2.803) x1
yˆ = 57.527 + 2.004 x1
Overall Conclusion:
1. Downtown houses have a much lower baseline price relative to the suburbs, judging by the lower
end of the simple linear regression line (indicated by the low y-intercept).
2. At least some of the slopes are significantly different, so they contribute differently to the model.
3. Downtown prices increase faster than the suburbs as the house size increases. (Based on the
slopes of the simple linear regression equations.)
4. Both types of suburbs (inner and outer) are similar in baseline prices as well as the increase in
price with increasing house size.
30
5.8 Building Models in Multiple Linear Regression
Define the gender and diameter of the clear zone variables using the following indicator variables:
Consider the following as the ORIGINAL regression model with change in refractive error (CRE) as the
response:
a) Referring to the original model, in terms of the regression coefficients, what is the effect of age on
mean change in refractive error (CRE), after accounting for gender and diameter? Define this effect in
general, then summarize the effect for each combination of gender and diameter of the clear zone?
Summarize your results in the chart below.
Solution:
Logic: For the general effect of age, consider only terms that include age, thus all terms without age are
excluded, that is,
0 , 2 , 3 , 4 , 5 are excluded.
The general effect of age on mean CRE is:
{CRE | Age + 1, Gender , Diameter} − {CRE | Age, Gender , Diameter}
= 1 + 6 male + 7 D1 + 8 D2 + 9 D3 + 10 (male D1) + 11 (male D2) + 12 (male D3)
31
Logic: For the effect of age on each combination below, include only slopes for age by itself or for age in
combination with either gender and/or diameter of the clear zone.
Male
2.5
1 + 6 + 7 + 10
Male 4.0 1 + 6
Female 2.5 1 + 7
Female 3.0 1 + 8
Female 3.5 1 + 9
Female 4.0 1
b) Modify the original model to specify that the effect of age on the mean of CRE is the same for males
and females with the same diameter of the clear zone; otherwise, the effect of age on the mean of
CRE is possibly different for males and females without having the same diameter of the clear zone.
Just state the constraint(s) needed. You do not have to rewrite the model.
Explanation:
32
c) Referring to the original model, write the null and alternative hypotheses, in terms of the coefficients,
to test whether the effect of age is the same for all diameters of the clear zone for females. What is
the distribution of the test statistic under the null hypothesis?
Solution: The effect of age on the mean of CRE is the same for all diameters of the clear zone for
females if 1 + 7 = 1 + 8 = 1 + 9 = 1 .
Therefore, H0: β7 = β8 = β9 = 0,
HA: at least one βi ≠ 0 i = 7, 8, 9
If H 0 is true, the test statistic has an F-distribution with degrees of freedom of:
df = [ Extra df , df ERROR ( Full )] = [Number of selected i ' s, n − ( p + 1)] = (3, 413 − (12 + 1)) = (3, 400)
d) Referring to the original model, in terms of the regression coefficients, what is the effect of gender
(male vs. female) on the mean CRE, after accounting for age and diameter? Define this effect in
general, then summarize the effect for each diameter of the clear zone in the table below.
Logic: For the general effect of gender, consider only terms that include male.
Solution: The effect of gender (male vs. female) on the mean of CRE is:
{CRE | Age, Male, Diameter} − {CRE | Age, Female, Diameter}
= {CRE | Age, Male = 1, Diameter} − {CRE | Age, Male = 0, Diameter}
= 2 + 6 Age + 10 ( Age D1) + 11 ( Age D 2) + 12 ( Age D3)
Diameter of Logic
Effect of gender (male vs.
the clear
female) on the mean CRE
zone
4.0 2 + 6 Age
e) Re-write the original model indicating that gender has no effect on mean CRE.
Solution: Gender has no effect on mean CRE if there is no gender in the model. Therefore,
{CRE | Age, Diameter} = 0 + 1 Age + 3 D1 + 4 D2 + 5 D3
+ 7 ( Age D1) + 8 ( Age D2) + 9 ( Age D3)
33