0% found this document useful (0 votes)
15 views

STAT 252-Notes-Topic 5-Multiple Linear Regression

STAT 252

Uploaded by

wenkanglucky
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

STAT 252-Notes-Topic 5-Multiple Linear Regression

STAT 252

Uploaded by

wenkanglucky
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

TOPIC 5: MULTIPLE REGRESSION ANALYSIS

5.1 The Multiple Linear Regression Model

• Multiple Linear Regression develops a model where there is only one response variable (y), but
more than one explanatory or predictor variables (x1, x2, ..., xp)

• The general model for multiple linear regression is:

y = 0 + 1x1 + 2 x2 + ... +  p x p + 

Where:
• y is the response variable.
• x1, x2, ..., xp are the explanatory variables.
• p = predictor variable
• E ( y) = 0 + 1x1 + 2 x2 + ... +  p x p is the deterministic part of the model.
• i determines the contribution of the explanatory variable xi to the model.
•  is the random error, which is assumed to be normally distributed with mean 0 and
standard deviation 

• When the least squares criterion is applied this leads to the general model for the population multiple
linear regression equation as follows:
y = 0 + 1x1 + 2 x2 + ... +  p x p
Or
 ( y | x1, x2 ,..., x p ) = 0 + 1x1 + 2 x2 + ... +  p x p

• The general formula for the sample multiple linear regression equation is:
yˆ = ˆ0 + ˆ1x1 + ˆ2 x2 + ... + ˆ p x p
Or
ˆ ( y | x1, x2 ,..., x p ) = ˆ0 + ˆ1x1 + ˆ2 x2 + ... + ˆ p x p

• The y-intercept ( ̂ 0 ) is the value of y when all explanatory variables have a value of 0 (x1=0, x2=0, ...,
xp=0).
• The values ˆ1, + ˆ2 , +... + ˆ p are referred to as partial slopes or partial regression coefficients.
• Each ˆi tells us the change in y per unit increase in x, holding all other explanatory variables
constant.

1
5.2 Inferences Concerning the Overall Usefulness of the Multiple Regression Model

Assumptions for Multiple Regression Inference

Assumptions (Conditions) for Regression Inferences

1. Linearity of the population regression line: The relationship between the variables
as described by the population regression equation y = 0 + 1x1 +  2 x2 + ... +  p x p
must be approximately linear.

2. Equal standard deviations (homoscedasticity): The standard deviations of y-values


must be approximately the same for all sets of values of x1, x2, ..., xp

3. Normal populations: For each set of values of x1, x2, ..., xp, the corresponding y-
values must be normally distributed.

4. No Serious Outliers: Significant outliers can drastically change the regression model

5. Independent observations: The observations of the response variable are


independent of one another. This implies that the observations of the predictor variables
do not need to be independent.

Note: All assumptions (except independence) can be checked graphically.

Regression Identity for Multiple Linear Regression

Regression Identity:

SSTOTAL = SSREGR + SSERROR


Regression Identity for Degrees of Freedom:

df ( SSTOTAL ) = df ( SS REGR ) + df ( SS ERROR )


Or n − 1 = p + (n − ( p + 1))
Where n is sample size and p is the number of predictor variables

• If the sample multiple linear regression equation fits the data well, then the observed values and
predicted values of the response variable (based on the regression model) will be “close”
together.
• AND thus, SS ERROR will be small relative to SSTOTAL and SS REGR will be large relative to SSTOTAL

Overall usefulness or significance of the multiple regression model can be determined by:
1. Multiple regression ANOVA F-test
2. Multiple R (Multiple correlation coefficient)
3. Coefficient of multiple determination

2
Multiple Regression ANOVA Test (F-Test)
Multiple Regression ANOVA Test (F-Test)

Purpose: To test whether a multiple linear regression model is useful for making predictions

Assumptions: The assumptions shown above

Step 1: Selection of the test based on the purpose and assumptions

Step 2: The null and alternative hypotheses are:


H 0 : 1 = 2 = ... =  p = 0
Ha : At least one of the slopes i s is not zero.

Step 3: Obtain the three sums of squares ( SSTOTAL , SS REGR and SS ERROR ) and
Compute the calculated value of the F-statistic

ANOVA Table for Multiple Linear Regression


Source of variation SS df MS = SS/df F-statistic
Regression p SS REGR MSREGR
SS REGR MS REGR = F=
p MSERROR
Error (Residual) SS ERROR n – (p+1) SS ERROR
MS ERROR =
n − ( p + 1)
Total n-1
SSTOTAL

SSREGR / p MS REGR
F= =
SSERROR / (n − ( p + 1)) MS ERROR

Step 4: Decide to reject or not reject Ho


df = (numerator degrees of freedom, denominator degrees of freedom)
df = ( p, n − ( p + 1))
(Where n = no. of xy observations and p = the number of predictor variables)
If P-value ≤ α, reject H0

Step 5: Conclusion in terms of the research problem

Note: Recall that, in general, in simple linear regression, the Regression df is the number of coefficients
(y-intercept + slope) being estimated minus 1, that is 2 – 1 = 1. For multiple linear regression, the
coefficients are the y-intercept plus the slopes of p predictor variables, that is, there are 1 + p coefficients.
Thus, Regression df = (1 + p) – 1 = p

3
Multiple R (Multiple Correlation Coefficient)

• Measures the overall correlation between the all the variables involved in the model

• Multiple R = + R2 (see below)

Coefficient of Multiple Determination


Coefficient of multiple determination (R2) = [multiple correlation coefficient]2
[Also called Multiple R2]

= the fraction or percentage of variation in the observed values of the response variable that
is accounted for by the regression analysis involving more than one explanatory variable

Explained variability
R2 =
Total variability
SS SS SS − SSError
R 2 = REGR = 1 − Error = TOTAL
SSTOTAL SSTOTAL SSTOTAL

0 ≤ R2 ≤ 1 OR 0% ≤ R2 ≤ 100%

This implies that 1 – R2 of the variation in the observed values of the response variable are
accounted for by other factors, not the explanatory variable used in the regression analysis

Adjusted Coefficient of Determination


• If the sample size equals the number of parameters (regression coefficients), then R2 = 1, which
can give the impression that the estimated model is a good fit of the population regression model,
even when the estimated model may actually may not give an accurate representation of the real
population model.
• Therefore, the adjusted R2 is a more accurate measure of the fit of the model

Adjusted Coefficient of Determination

MSERROR
2
Radj = 1−
MSTOTAL

SS ERROR
(n − ( p + 1)) (n − 1) SS ERROR
2
Radj = 1− = 1−
SSTOTAL (n − (k + 1)) SSTOTAL
(n − 1)

4
Example: Effect of age and miles driven on the price of Orion cars
The age, miles driven and price of a random sample of 11 Orion cars along with SPSS output are shown
below.

Car Age (yrs) Miles (1000) Price ($100s)


1 5 57 85
2 4 40 103
3 6 77 70
4 5 60 82
5 5 49 89
6 5 47 98
7 6 58 66
8 6 39 95
9 2 8 169
10 7 69 70
11 7 89 48

Checking Assumptions for the Orion Price regression model (SPSS ouput)

Age Residual Plot Miles Residual Plot


50 20
Residuals

Residuals

0 0
0 2 4 6 8 0 20 40 60 80 100
-50
Age -20
Miles

5
SPSS Output
Descriptive Statistics
Mean Std. N
Deviation
Price 88.6364 31.15854 11
Age 5.2727 1.42063 11
Miles 53.9091 21.56597 11

Model Summaryb
Model R R Adjusted Std. Change Statistics
Square R Error of R Square F df1 df2 Sig. F
Square the Change Change Change
Estimate
1 .968a .936 .920 8.80505 .936 58.612 2 8 .000
a. Predictors: (Constant), Miles, Age
b. Dependent Variable: Price

ANOVAa
Model Sum of df Mean F Sig.
Squares Square
Regression 9088.314 2 4544.157 58.612 .000b
1 Residual 620.232 8 77.529
Total 9708.545 10
a. Dependent Variable: Price
b. Predictors: (Constant), Miles, Age

Coefficientsa
Model Unstandardized Standardized t Sig. 95.0% Confidence
Coefficients Coefficients Interval for B
B Std. Error Beta Lower Upper
Bound Bound
(Constant) 183.035 11.348 16.130 .000 156.868 209.203
1 Age -9.504 3.874 -.433 -2.453 .040 -18.438 -.570
Miles -.821 .255 -.569 -3.219 .012 -1.410 -.233
a. Dependent Variable: Price

*Suppose that the numbers highlighted in yellow were not given

6
Research Problem: Overall Assessment of the Model
>>>>>>>>>>
(a) At the 5% significance level, perform a hypothesis test to determine whether the overall multiple linear
regression model is useful for making predictions, that is, whether the variables age and miles driven,
taken together, are useful for predicting the price of the Orions.

(b) What percentage of the variation in Orion price is explained by the regression model? Determine the
unadjusted percentage.

(c) What percentage of the variation in Orion price is explained by the regression model? Determine the
adjusted percentage and compare it with the unadjusted percentage calculated in part (b).

7
>>>>>>>>>>

5.3 Inferences Concerning the Usefulness of Particular Predictor Variables: The Multiple
Regression t-test and Confidence Interval for Particular Slopes

• The ANOVA F-test determines whether the overall model is useful in explaining the
relationship between all the variables involved.
• However, the Multiple Regression t-test is required to determine if particular predictor
variables are useful in making predictions.

Multiple Regression t-test for the Usefulness of Particular Predictor Variables

State the hypotheses


Ho: i = 0
xi is not useful in making predictions about the response variable)
(Predictor variable

Ha: i  0 (two-tailed) or i  0 (left tailed) or i  0 (right-tailed)

(Predictor variable xi is useful in making predictions about the response variable)

Calculate the test statistic for each particular predictor variable using computer output
ˆi
t=
SE ( ˆi )
Decide to reject or not reject Ho by looking in the t-table at df = n − ( p + 1)

Interpretation in words in terms of the research problem

Note: t2 ≠ F in Multiple Linear Regression, though it did in Simple Linear Regression

Confidence Interval for a Slope, i in Multiple Regression


1. For a confidence level of 1 – α, use the table of the t-distribution to find tα/2 with
df = n – (p + 1)

2. The endpoints of the confidence interval for i are:


ˆi  t /2  SE ( ˆi )

3. Interpret the confidence interval in terms of the research problem

8
Example (Orion Prices): Refer to the data set and full SPSS output on previous pages

SPSS Output
Coefficientsa
Model Unstandardized Standardized t Sig. 95.0% Confidence
Coefficients Coefficients Interval for B
B Std. Error Beta Lower Upper
Bound Bound
(Constant) 183.035 11.348 16.130 .000 156.868 209.203
1 Age -9.504 3.874 -.433 -2.453 .040 -18.438 -.570
Miles -.821 .255 -.569 -3.219 .012 -1.410 -.233
a. Dependent Variable: Price

>>>>>>>>>>
(a) At the 5% significance level, test whether the data provide sufficient evidence to conclude that the
number of miles driven, in conjunction with age, is useful for predicting price.

(b) Calculate a 95% confidence interval for the partial slope for miles driven.

>>>>>>>>>>

9
Compare Age and Miles Driven with respect to Usefulness in making predictions

Age Line Fit Plot Miles Line Fit Plot


200 200
150
Price 100
Price

100 Price
50 Price 0
Predicted Price
0 Predicted Price 0 50 100
0 5 10 Miles
Age

Correlation Matrix: For all variables in the data set for Orion prices
Correlations
Price Age Miles
Price 1.000 -.924 -.942
Pearson Correlation Age -.924 1.000 .863
Miles -.942 .863 1.000
Price . .000 .000
Sig. (1-tailed) Age .000 . .000
Miles .000 .000 .
Price 11 11 11
N Age 11 11 11
Miles 11 11 11

10
Note the following:
1. Miles driven has a higher t-statistic than age
2. Miles driven has a slightly lower P-value than age
3. Miles driven have a “tighter” confidence interval for the slope than age
4. Miles driven is more highly correlated with price (r = -0.942) than is age (r = -0.924), at
df = n − ( p + 1) = 11 − (2 + 1) = 8

11
5.4 Confidence Interval and Prediction Interval for the Response Variable

Confidence Interval for Mean Response (or Conditional Mean) in Multiple Regression

1. For a confidence level of 1 – α, use the t-distribution table to find tα/2 with df = n – (p + 1)

2. Compute the point estimate by using the multiple regression equation. At particular values
of the predictor variables: x1, x2, ..., xp , the point estimate ŷ of the mean response of the
response variable is found as follows:
yˆ = ˆ0 + ˆ1x1 + ˆ2 x2 + ... + ˆ p x p

The endpoints of the confidence interval are:


Point estimate or “Fit” ± Critical value x SE(Fit)
OR yˆ  t /2  SE ( Fit )
[Note: SE(Fit) = standard deviation of the predicted y-value = s yˆ p ]

3. Interpret the confidence interval in terms of the research problem

[Note: Since exact calculations of the standard deviation of the predicted y-value ( s yˆ ) is
p

rather complicated, we usually use computer output to obtain SE(Fit).]

Prediction Interval (for all Single Observations) for the Response Variable in Multiple
Regression

1. For a confidence level of 1 – α, use the t-distribution table to find tα/2 with df = n – (p + 1)

2. Compute the point estimate by using the multiple regression equation. At particular values
of the predictor variables: x1, x2, ..., xp, the point estimate ŷ of the mean response of the
response variable is found as follows:
yˆ = ˆ0 + ˆ1x1 + ˆ2 x2 + ... + ˆ p x p

The endpoints of the prediction interval are:


Point estimate or “Fit” ± Critical value x MSE + [SE ( Fit )]2
OR yˆ  t /2  ˆ 2 + [SE ( Fit )]2
[Note: SE(Fit) = standard deviation of the predicted y-value = s yˆ p ]

3. Interpret the confidence interval in terms of the research problem

[Note: Since exact calculations of the standard deviation of the predicted y-value ( s yˆ ) is rather
p

complicated, we usually use computer output to obtain SE(Fit).]

12
Example (Price of Orions against age and miles driven)
Find:
1. A 95% confidence interval for the mean price of Orions that are 5 years old and have been driven
52,000 miles
2. A 95% prediction interval for the price of an Orion (any single observation) that is 5 years old and
has been driven 52,000 miles

MINITAB Output
[See Weiss, Module A, page A-55]
Regression Analysis: Price versus Age, Miles

The regression equation is


Price = 183 - 9.50 Age - 0.821 Miles

Predictor Coef SE Coef T P


Constant 183.04 11.35 16.13 0.000
Age -9.504 3.874 -2.45 0.040
Miles -0.8215 0.2552 -3.22 0.012

se = 8.80505 R-Sq = 93.6% R-Sq(adj) = 92.0%

Analysis of Variance

Source DF SS MS F P
Regression 2 9088.3 4544.2 58.61 0.000
Residual Error 8 620.2 77.5
Total 10 9708.5

Predicted Values for New Observations

New
Obs Fit SE Fit 95% CI 95% PI
1 92.80 2.74 (86.48, 99.12) (71.53, 114.06)

Values of Predictors for New Observations

New
Obs Age Miles
1 5.00 52.0

13
Find a 95% confidence for the mean price of all Orions that are 5 years old and have been driven 52,000
miles
>>>>>>>>>>

>>>>>>>>>>
Calculate a 95% prediction interval for the price of an Orion (any single observation) that is 5 years old
and has been driven 52,000 miles

t
1. At df = 8,  /2 = t0.05/2 = t0.025 = 2.306
2. The point estimate for the price of 5-year-old Orions that has been driven 52,000
miles is:
ŷ = 183 - 9.50 (5) - 0.821 (52) = 92.80 (in hundreds of dollars)
>>>>>>>>>>

>>>>>>>>>>

14
5.5 Multiple Regression Models Involving Indicator Variables (= Dummy Variables)
• These are categorical variables that are used as one of the predictor variables
• It is coded as 0 or 1

Example involving an Indicator Variable


Indicator variable = sex of the child (Coded as 0 for female and 1 for male)

Height of Mother Height of Father Sex of Child Height of Child


66 70 1 62.5
66 64 1 69.1
64 68 1 67.1
66 74 1 71.1
64 62 1 67.4
64 67 1 64.9
62 72 1 66.5
62 72 1 66.5
63 71 1 67.5
65 71 1 71.9
63 64 0 58.6
64 67 0 65.3
65 72 0 65.4
59 67 0 60.9
58 66 0 60
63 69 0 62.2
62 69 0 63.4
63 66 0 62.2
63 69 0 59.6
60 66 0 64

Descriptive Statistics

Mean Std. Deviation N

Height_Child 64.805 3.6954 20


Height_Mother 63.10 2.198 20
Height_Father 68.30 3.164 20
Sex_of_Child .50 .513 20

15
Checking Assumptions

Height of Mother Height of Father


Residual Plot Residual Plot
10 10
Residuals

Residuals
0 0
56 58 60 62 64 66 68 60 65 70 75
-10 -10
Height of Mother Height of Father

Sex of Child Residual


Plot
5
Residuals

0
0 0.5 1 1.5
-5

-10
Sex of Child

Sex of Child Line Fit Plot


80
70
60
Height of Child

50
40
Height of Child
30
Predicted Height of Child
20
10
0
0 0.2 0.4 0.6 0.8 1 1.2
Sex of Child

16
Model Summaryb

Model R R Square Adjusted R Std. Error of the


Square Estimate

1 .780a .609 .535 2.5195

a. Predictors: (Constant), Sex_of_Child, Height_Father, Height_Mother


b. Dependent Variable: Height_Child

ANOVAa

Model Sum of Squares df Mean Square F Sig.

Regression 157.902 3 52.634 8.291 .001b

1 Residual 101.568 16 6.348


Total 259.470 19

a. Dependent Variable: Height_Child


b. Predictors: (Constant), Sex_of_Child, Height_Father, Height_Mother

Coefficientsa

Model Unstandardized Standardized t Sig. 95.0% Confidence Interval for B


Coefficients Coefficients

B Std. Error Beta Lower Bound Upper Bound

(Constant) 25.588 21.942 1.166 .261 -20.928 72.104

Height_Mother .377 .308 .224 1.224 .239 -.276 1.030


1
Height_Father .195 .190 .167 1.028 .319 -.207 .598

Sex_of_Child 4.148 1.334 .576 3.108 .007 1.319 6.976

a. Dependent Variable: Height_Child

Regression equation:
Height of child = 25.588 + 0.377(Height of Mother) + 0.195(Height of Father) + 4.148(Sex)

Prediction:
Suppose a mother is 63 inches and a father is 69 inches
Predicted height of a daughter is:
Height of a daughter = 25.588 + 0.377(63) + 0.195(69) + 4.148(0) = 62.8 inches
Predicted height of a son is:
Height of a son = 25.588 + 0.377(63) + 0.195(69) + 4.148(1) = 67.0 inches

The coefficient 4.148 means that for given heights of mothers and fathers, a son will have a predicted
height that is 4.148 inches more than the height of a daughter.

17
Adjusted Coefficient of Determination:

SS ERROR 101.568
(n − ( p + 1)) (20 − (3 + 1)) 6.348
2
Radj = 1− = 1− = 1− = 0.535
SSTOTAL 259.470 13.6563
(n − 1) (20 − 1)

Note: This is fairly different from the coefficient of determination (unadjusted), which is 0.609. This is
because there are 4 regression coefficients (intercept and 3 slopes)

Calculate 95% confidence intervals for the partial slopes of the regression equation that relate:
1. Heights of children to the heights of mothers
2. Heights of children to their sex

df = n – (p + 1) = 20 – (3+1) = 16
t
At df = 16,  /2 = t0.05/2 = t0.025 = 2.120
Heights of children to the heights of mothers
ˆi  t /2  SE ( ˆi )
0.377  2.120  0.308
0.377  0.6530
(−0.276,1.030)

Heights of children to their sex


ˆi  t /2  SE ( ˆi )
4.148  2.120 1.334
4.148  2.8288
(1.319, 6.976)

Note: The slope that relates heights of children to their sex does not have a negative value as one of the
endpoints. This is in agreement with the greater significance of that slope when the multiple regression t-
test was performed.

Does this mean that the heights of children are not related to the heights of their parents?

18
5.6 Interaction Models in Multiple Regression

• Without interaction, the general model for multiple linear regression was:
y = 0 + 1x1 + 2 x2 + ... +  p x p + 

The predicted response of y with changes in x1 has the same slope for all values of x2 (and the
same holds true for all xi variables involved)

This results in a parallel-lines model as shown below:

• When interaction between variables occurs, the interaction model for multiple linear regression
(for two interacting predictor variables) is:
y =  0 + 1x1 +  2 x2 + 3 x1x2 + 
Where,
• y is the response variable
• x1, x2 are the explanatory (predictor) variables
• E ( y ) =  0 + 1x1 +  2 x2 + 3 x1x2 is the deterministic part of the model
• 1 + 3 x2 represents the change in y for a 1-unit increase in x1
[Since 1x1 + 3 x1x2  x1 ( 1 + 3 x2 ) ]
•  2 + 3 x1 represents the change in y for a 1-unit increase in x2
 x + 3 x1x2  x2 (  2 + 3 x1 )
[Since 2 2 ]
•  is the random error, which is assumed to be normally distributed with mean 0 and
standard deviation 
This results in non-parallel lines (often intersecting lines) as shown below:

19
Research Problem Involving an Interaction Term (and Combining all Previous MLR Concepts):

Effect of BMI and Salt Intake (and their Interaction) on Systolic Blood Pressure
It has been hypothesized that increased salt intake associated with greater food intake by obese people
may be the mechanism for the relationship between obesity and high blood pressure. A random sample
of 14 people with high blood pressure was selected and their body mass index (BMI) (body
weight/(height)2), as a measure of obesity, was measured along with their sodium intake (in 100s of
mg/day). These two variables were used to calculate the interaction term (BMI x sodium intake). Their
systolic blood pressure (SBP) was measured in mm Hg as the response variable. The raw data are
shown below along with incomplete SPSS output.

BMI Sodium intake SBP


(kg/m2) (100 mg/day) Interaction (mm Hg)
30 30 900 143
30 31 930 144
33 32 1056 146
34 35 1190 150
36 36 1296 152
37 37 1369 154
38 38 1444 156
39 39 1521 158
40 41 1640 161
40 42 1680 163
41 43 1763 165
43 44 1892 168
44 45 1980 170
47 49 2303 176

Model Summaryb
Model R R Square Adjusted R Std. Error of the
Square Estimate
1 .999a .997 .997 .586
a. Predictors: (Constant), Interaction, BMI, Salt_intake
b. Dependent Variable: SBP

ANOVAa
Model Sum of Squares df Mean Square F Sig.
Regression 1330.000 3 443.333 1293.138 3.04 x 10-13
1 Residual 3.428 10 .343
Total 1333.429 13
a. Dependent Variable: SBP
b. Predictors: (Constant), Interaction, BMI, Salt_intake

20
Coefficientsa
Model Unstandardized Coefficients Standardized t Sig.
Coefficients
B Std. Error Beta
(Constant) 108.726 8.168 13.312 .000
BMI -.218 .285 -.109 -.765 .462
1
Salt_intake .892 .350 .496 2.546 .029
Interaction .015 .006 .612 2.640 .025
a. Dependent Variable: SBP
>>>>>>>>>>
(a) At the 5% significance level, perform a hypothesis test to determine whether the overall multiple
regression model is significant or useful for making predictions about systolic blood pressure (SBP).
Perform ALL steps of the hypothesis test.

(b) At the 5% significance level, perform the most appropriate test to determine whether there is a positive
relationship between salt intake and systolic blood pressure.

21
(c) Calculate a 95% confidence interval for the slope of the interaction term (representing interaction
between BMI and sodium intake). Using this confidence interval, what conclusion can you make about
the possible interaction between body mass index and sodium intake in their effect on systolic blood
pressure? Explain your answer.

(d) What does this model tell us about effect of BMI and the relative effect of the 3 predictor variables?

(e) Find the standard error of the model (standard error of the estimate of the model)?

(f) What percentage of the variation in systolic blood pressure is explained by (or accounted for by) the
regression model? (Note: Determine the adjusted percentage.)

(g) Suppose that a person with a body mass index of 40 kg/m 2 and daily sodium intake of 42 (in 100s of
mg/day) had an observed systolic blood pressure reading of 163 mm Hg. What was the residual or
error of this observation?

22
(h) Based on the values of the predictor variables given in part (g) (BMI = 40 kg/m2, sodium intake = 42
(100) mg/day)), what is the 95% prediction interval for all single observation responses of systolic
blood pressure at those values of the predictor variables? [Note: SE(Fit) = 0.337]

(i) Based on the values of the predictor variables given in part (g) (BMI = 40 kg/m2, sodium intake = 42
(100) mg/day)), what is the 95% confidence interval for mean systolic blood pressure at those values
of the predictor variables? [Note again: SE(Fit) = 0.337]

>>>>>>>>>>
(j) Compare the length of the prediction interval in part (h) with the confidence interval in part (i). Explain
the difference between these two confidence intervals and explain any possible difference in their
lengths.

Based on the prediction interval in part (h), if we take random samples of people having the given values
of the predictor variables, we can be 95% confident that an individual would have systolic blood pressure
between 161.67 and 164.174 mm Hg; whereas, based on the confidence interval in part (i), we can be
95% confident that the means of those samples will be between 161.919 and 163.421 mm Hg. This is
because the confidence interval for the mean response is shorter than the prediction interval for all single
observation responses.

23
5.7 Reduced Models and the Extra Sum-of-Squares F-test in Multiple Linear Regression

Full Model = model which includes all the parameters or predictor variables involved in the research

Reduced Model = model which hypotheses that some of the slopes of the predictor variables equal zero
and, thus they are taken out of the full model to make a reduced model

Extra-Sum-of-Squares F-test in Multiple Linear Regression

• Also called Partial F-test or Nested F-test

Extra-Sum-of-Squares F-Test in MLR

Null and alternative hypotheses:


H0: All selected beta’s (slopes) equal 0. (Reduced model)
Ha: Not all selected beta’s (slopes) equal 0. (Full model)

Calculations for Extra-Sum-of Squares F-test:


Extra Sum of Squares = SSE ( reduced ) − SSE ( full )

Extra df = df ERROR (reduced ) − df ERROR ( full )

( Extra SS ) / ( Extra df )
F=
SSE (Full)/df ERROR ( Full )

[SSE (reduced ) − SSE ( full )] [df E (reduced ) − df E ( full )]


OR F=
SSE ( full ) / df E ( full )

Examine the distribution of the F-table at:


df = [ Extra df , df ERROR ( Full )] = [Number of selected i ' s, n − ( p + 1)]

Recall that, residual (error) = observed value – estimated value


Therefore, residual sum of squares or error sum of squares is:
SSE =  (observed value – estimated value)2 =  ( xi − x )2

24
Example with Interaction and Indicator Variables & Involving Extra Sum-of-Squares F-test
The table below shows the prices of a random sample of 30 homes, along with the living area, number of
bedrooms, number of rooms, age, and location.
• Indicator variables z1 and z2 are defined as:

z1 = z2 = 0 for downtown; z1 = 1 , z2 = 0 for inner suburbs; z1 = 0 , z2 = 1 for outer suburbs


• x1 z1 = interaction x1  z1
• x1 z2 = interaction x1  z2
Living area
(100s of sq. No. of No. of Age
Price Ft.) bedrooms room (years) Location Location
($1000)
(y) ( x1 ) ( x2 ) ( x3 ) ( x4 ) ( z1 ) ( z2 ) x1 z1 x1 z2
84 13.8 3 7 10 1 0 13.8 0
93 19 2 7 22 0 1 0 19
83.1 10 2 7 15 0 1 0 10
85.2 15 3 7 12 0 1 0 15
85.2 12 3 7 8 0 1 0 12
85.2 15 3 7 12 0 1 0 15
85.2 12 3 7 8 0 1 0 12
63.3 9.1 3 6 2 0 1 0 9.1
84.3 12.5 3 7 11 0 1 0 12.5
84.3 12.5 3 7 11 0 1 0 12.5
77.4 12 3 7 5 1 0 12 0
92.4 17.9 3 7 18 0 0 0 0
92.4 17.9 3 7 18 0 0 0 0
61.5 9.5 2 5 8 0 0 0 0
88.5 16 3 7 11 0 0 0 0
88.5 16 3 7 11 0 0 0 0
40.6 8 2 5 5 0 0 0 0
81.6 11.8 3 7 8 0 1 0 11.8
86.7 16 3 7 9 1 0 16 0
89.7 16.8 2 7 12 0 0 0 0
86.7 16 3 7 9 1 0 16 0
89.7 16.8 2 7 12 0 0 0 0
75.9 9.5 3 6 6 0 1 0 9.5
78.9 10 3 6 11 1 0 10 0
87.9 16.5 3 7 15 1 0 16.5 0
91 15.1 3 7 8 0 1 0 15.1
92 17.9 3 8 13 0 1 0 17.9
87.9 16.5 3 7 15 1 0 16.5 0
90.9 15 3 7 8 0 1 0 15
91.9 17.8 3 8 13 0 1 0 17.8

25
Overall multiple regression model
Selecting some of the above predictor variables, the overall model describing the effect of living area,
location and the interaction between living area and location (leaving out the number of bedrooms,
number of rooms and age) is as follows:

Overall (Full) model: y = 0 + 1 x1 +  2 z1 + 3 z2 +  4 x1 z1 + 5 x1 z2 + 


We can determine the fitted straight line for each location by finding 3 simple linear regression equations
based on simplification of the overall model

Downtown: (z1 = z2 = 0)
y = 0 + 1 x1 +  2 (0) + 3 (0) +  4 x1 (0) + 5 x1 (0) + 
y = 0 + 1 x1 + 
Inner suburbs: (z1 = 1, z2 = 0)
y = 0 + 1 x1 +  2 (1) + 3 (0) +  4 x1 (1) + 5 x1 (0) + 
y = 0 +  2 + (1 +  4 ) x1 + 
Outer suburbs: (z1 = 0, z2 = 1)
y = 0 + 1 x1 +  2 (0) + 3 (1) +  4 x1 (0) + 5 x1 (1) + 
y = 0 + 3 + (1 + 5 ) x1 + 

From this we write 3 models:

Model 1 (Separate Lines Model = Full Model, which includes all predictor variables):
y = 0 + 1 x1 +  2 z1 + 3 z2 + 4 x1 z1 + 5 x1 z2
OR  ( price | area, location,interaction) = 0 + 1area +  2 z1 + 3 z2 +  4 x1 z1 + 5 x1 z2
Model 2 (Parallel Lines Model = Reduced model assuming there is no interaction effect):
y = 0 + 1 x1 +  2 z1 + 3 z2
OR  ( price | area, location) = 0 + 1area +  2 z1 + 3 z2
Explanation: If no interaction effect, then  4 = 5 = 0 so 1 = 1 +  4 = 1 + 5 (slopes are equal)
And thus the 3 SLR lines are parallel.

Model 3 (Equal Lines Model = Reduced model assuming location and their interaction have no effect):
y =  0 + 1 x1
OR  ( price | area) = 0 + 1area
 2 = 3 =  4 = 5 = 0 so
Explanation: If no effect of location and interaction, then

0 = 0 +  2 = 0 + 3 (y-intercepts are equal) and 1 = 1 +  4 = 1 + 5 (slopes are equal)


And thus the 3 SLR lines are equal.

26
SPSS output:
Model Summary
Model R R Square Adjusted R Std. Error of
Square the Estimate
1 .943a .889 .866 4.05994
a. Predictors: (Constant), x1z2, x1, z1, z2, x1z1

Model 1 (Full Model or Separate Lines Model)


ANOVAa
Model Sum of df Mean Square F Sig.
Squares
Regression 3158.414 5 631.683 38.323 .000b
1 Residual 395.595 24 16.483
Total 3554.010 29
a. Dependent Variable: y
b. Predictors: (Constant), x1z2, x1, z1, z2, x1z1

Coefficientsa
Model Unstandardized Coefficients Standardized t Sig.
Coefficients
B Std. Error Beta
(Constant) 8.969 6.078 1.476 .153
x1 4.807 .397 1.366 12.098 .000
z1 52.122 11.225 2.025 4.643 .000
1
z2 48.558 7.797 2.231 6.228 .000
x1z1 -3.201 .759 -1.823 -4.218 .000
x1z2 -2.803 .530 -1.836 -5.291 .000
a. Dependent Variable: y

Model 2 (Parallel Lines Model): Effect of area and location (Reduced model assuming there is no
interaction effect, i.e., assuming slopes for interaction = 0)

ANOVAa
Model Sum of df Mean Square F Sig.
Squares
Regression 2607.733 3 869.244 23.883 .000b
1 Residual 946.277 26 36.395
Total 3554.010 29
a. Dependent Variable: y
b. Predictors: (Constant), z2, x1, z1

Coefficientsa
Model Unstandardized Coefficients Standardized t Sig.
Coefficients
B Std. Error Beta
(Constant) 35.825 5.785 6.193 .000
x1 3.000 .362 .852 8.292 .000
1
z1 5.189 3.127 .202 1.660 .109
z2 8.142 2.680 .374 3.038 .005
a. Dependent Variable: y

27
Model 3 (Equal Lines Model): Effect of Area only (Reduced model assuming location and
interaction have no effect, i.e., assuming all slopes for location and interaction = 0)

ANOVAa
Model Sum of df Mean Square F Sig.
Squares
Regression 2271.714 1 2271.714 49.605 .000b
1 Residual 1282.296 28 45.796
Total 3554.010 29
a. Dependent Variable: y
b. Predictors: (Constant), x1

Coefficientsa
Model Unstandardized Coefficients Standardized t Sig.
Coefficients
B Std. Error Beta
(Constant) 43.732 5.780 7.567 .000
1
x1 2.814 .400 .799 7.043 .000
a. Dependent Variable: y

(a) At the 5% significance level, perform a hypothesis test to determine whether the overall multiple
regression model is significant or useful for making predictions about house price. Perform ALL steps
of the hypothesis test.

H0 : 1 = 2 = 3 = 4 = 5 = 0
[The overall multiple regression model is not useful for making predictions about house price.]
Ha : At least one i is not zero
[The overall multiple regression model is useful for making predictions about house price.]

k = number of predictor variables = 5, n = 30 (random sample of 30 homes)

SS REGR / p 3158.414 / 5 631.683


F= = = = 38.323
SSERROR / (n − ( p + 1)) 395.595 / (30 − (5 + 1)) 16.483

df (regression) = p = 5 df (error) = n – (p + 1) = 30 – (5 + 1) = 24
At df = (5, 24), P < 0.001 There is extremely strong evidence against Ho.
Since P < α (0.05), reject Ho.

Conclusion: At the 5% significance level, the data provide sufficient evidence to conclude that at least
one of the population regression coefficients is not zero OR that the overall regression model is useful for
making predictions about the response variable (house price).

28
>>>>>>>>>>
(b) At the 5% significance level, perform a hypothesis test to determine if there is interaction between
location and living area in the way that they affect house price, after accounting for area and location.
In other words, test whether the 3 simple regression lines are parallel, that is, whether the slopes are
the same for all 3 lines.

>>>>>>>>>>
Finding the Residual Sum-of-Squares
Suppose you are given that the F-statistic for the Parallel Lines Model is F = 16.7045, but you are not
given the ANOVA table on the previous page for this model. What is the Residual Sum-of-Squares
( SS ERROR ) for this Parallel Lines Model?

[ SS E (reduced ) − 395.595] [2]


16.7045 =
395.595 / 24
(16.7045)(16.483125) = ( SS E (reduced ) / 2) −197.7975
275.34236 + 197.7975 = SS E (reduced ) / 2
SS E (reduced )  946.28
>>>>>>>>>>
(c) At the 5% significance level, perform a hypothesis test to determine if there is an effect of location
and/or the interaction between location and living area on house price, after accounting for living area.
In other words, test whether the 3 simple regression lines are equal, that is, whether the y-intercepts
and slopes are the same for all 3 lines.

29
>>>>>>>>>>
Comparing the 3 SLR Equations for Downtown, Inner Suburbs, and Outer Suburbs

Using the output to get the overall regression model, we get the following:
yˆ = 8.969 + 4.807 x1 + 52.122 z1 + 48.558z2 + (−3.201) x1 z1 + (−2.803) x1 z2
Note: all partial slopes, including those for the interaction terms, are significant.

We can determine the fitted straight line for each location by finding 3 simple linear regression equations
by simplifying the overall model

Overall model: y = 0 + 1 x1 +  2 z1 + 3 z2 +  4 x1 z1 + 5 x1 z2 + 

Downtown: y = 0 + 1 x1 + 
yˆ = 8.969 + 4.807 x1
Inner suburbs: y = 0 +  2 + (1 +  4 ) x1 + 
yˆ = 8.969 + 52.122 + (4.807 − 3.201) x1
yˆ = 61.091 + 1.606 x1
Outer suburbs: y = 0 + 3 + (1 + 5 ) x1 + 
yˆ = 8.969 + 48.558 + (4.807 − 2.803) x1
yˆ = 57.527 + 2.004 x1
Overall Conclusion:
1. Downtown houses have a much lower baseline price relative to the suburbs, judging by the lower
end of the simple linear regression line (indicated by the low y-intercept).
2. At least some of the slopes are significantly different, so they contribute differently to the model.
3. Downtown prices increase faster than the suburbs as the house size increases. (Based on the
slopes of the simple linear regression equations.)
4. Both types of suburbs (inner and outer) are similar in baseline prices as well as the increase in
price with increasing house size.

30
5.8 Building Models in Multiple Linear Regression

Example on Refractive Surgery


Radial keratotomy is a type of refractive surgery in which radial incisions are made in a myopic
(nearsighted) patient’s cornea to reduce the person’s myopia. The incisions extend radially from the
periphery toward the centre of the cornea. A circular central portion of the cornea, known as the clear
zone, remains uncut. A researcher examined the variables associated with the five-year post-surgical
change in refractive error. She selected 413 patients for the study who met strict entry criteria. In fact, four
clear zone sizes were used: 2.5 mm, 3.0 mm, 3.5 mm, and 4.0 mm. The following is the description of
variables under study.

Variable Description of Variables


Gender Gender (Male, Female),
Diameter Diameter of the clear zone (remains uncut)
(2.5 mm, 3.0 mm, 3.5 mm, and 4.0 mm),
Age Age of patients (in years),
Depth Depth of incision (in mm),
CRE Change in refractive error.

Define the gender and diameter of the clear zone variables using the following indicator variables:

Male = 1 for a male and Male = 0 for a female,


D1 = 1 if diameter of the clear zone is 2.5 mm and D1 = 0 otherwise,
D2 = 1 if diameter of the clear zone is 3.0 mm and D2 = 0 otherwise,
D3 = 1 if diameter of the clear zone is 3.5 mm and D3 = 0 otherwise,
D4 = 0 (no incision)

Consider the following as the ORIGINAL regression model with change in refractive error (CRE) as the
response:

{CRE | Age, Gender, Diameter) =  0 + 1 Age +  2 Male +  3 D1 +  4 D2 +  5 D3


+  6 ( Age  Male) +  7 ( Age  D1) +  8 ( Age  D 2) +  9 ( Age  D3)
+ 10 ( Age  Male  D1) + 11 ( Age  Male  D2) + 12 ( Age  Male  D3)

a) Referring to the original model, in terms of the regression coefficients, what is the effect of age on
mean change in refractive error (CRE), after accounting for gender and diameter? Define this effect in
general, then summarize the effect for each combination of gender and diameter of the clear zone?
Summarize your results in the chart below.

Solution:
Logic: For the general effect of age, consider only terms that include age, thus all terms without age are
excluded, that is,
0 ,  2 , 3 ,  4 , 5 are excluded.
The general effect of age on mean CRE is:
{CRE | Age + 1, Gender , Diameter} − {CRE | Age, Gender , Diameter}
= 1 + 6 male + 7 D1 + 8 D2 + 9 D3 + 10 (male  D1) + 11 (male  D2) + 12 (male  D3)

31
Logic: For the effect of age on each combination below, include only slopes for age by itself or for age in
combination with either gender and/or diameter of the clear zone.

Therefore, for each combination of gender and diameter, we have:

Diameter of Logic Effect of age on


Gender
the clear zone mean CRE

Male
2.5
1 +  6 +  7 + 10

Male 3.0 1 +  6 +  8 + 11

Male 3.5 1 +  6 +  9 + 12

Male 4.0 1 +  6

Female 2.5 1 +  7

Female 3.0 1 +  8

Female 3.5 1 +  9

Female 4.0 1

b) Modify the original model to specify that the effect of age on the mean of CRE is the same for males
and females with the same diameter of the clear zone; otherwise, the effect of age on the mean of
CRE is possibly different for males and females without having the same diameter of the clear zone.
Just state the constraint(s) needed. You do not have to rewrite the model.

 Diameter = 2.5 : 1 +  6 +  7 + 10 = 1 +  7   6 + 10 = 0 


 Diameter = 3.0 :  +  +  +  =  +    +  = 0 
 
   6 = 10 = 11 = 12 = 0
1 6 8 11 1 8 6 11

 Diameter = 3.5 : 1 +  6 + 9 + 12 = 1 + 9   6 + 12 = 0 

 Diameter = 4.0 : 1 +  6 = 1  6 = 0 

Explanation:

32
c) Referring to the original model, write the null and alternative hypotheses, in terms of the coefficients,
to test whether the effect of age is the same for all diameters of the clear zone for females. What is
the distribution of the test statistic under the null hypothesis?

Solution: The effect of age on the mean of CRE is the same for all diameters of the clear zone for
females if 1 +  7 = 1 +  8 = 1 +  9 = 1 .
Therefore, H0: β7 = β8 = β9 = 0,
HA: at least one βi ≠ 0 i = 7, 8, 9
If H 0 is true, the test statistic has an F-distribution with degrees of freedom of:
df = [ Extra df , df ERROR ( Full )] = [Number of selected i ' s, n − ( p + 1)] = (3, 413 − (12 + 1)) = (3, 400)

d) Referring to the original model, in terms of the regression coefficients, what is the effect of gender
(male vs. female) on the mean CRE, after accounting for age and diameter? Define this effect in
general, then summarize the effect for each diameter of the clear zone in the table below.

Logic: For the general effect of gender, consider only terms that include male.

Solution: The effect of gender (male vs. female) on the mean of CRE is:
{CRE | Age, Male, Diameter} − {CRE | Age, Female, Diameter}
= {CRE | Age, Male = 1, Diameter} − {CRE | Age, Male = 0, Diameter}
=  2 + 6 Age + 10 ( Age  D1) + 11 ( Age  D 2) + 12 ( Age  D3)

Diameter of Logic
Effect of gender (male vs.
the clear
female) on the mean CRE
zone

2.5  2 + (  6 + 10 ) Age

3.0  2 + ( 6 + 11 ) Age

3.5  2 + ( 6 + 12 ) Age

4.0  2 +  6 Age

e) Re-write the original model indicating that gender has no effect on mean CRE.

Solution: Gender has no effect on mean CRE if there is no gender in the model. Therefore,
{CRE | Age, Diameter} =  0 + 1 Age +  3 D1 +  4 D2 +  5 D3
+  7 ( Age  D1) +  8 ( Age  D2) +  9 ( Age  D3)

33

You might also like