15Multiple Linear Regression
15Multiple Linear Regression
Biostatistics
By
Wondwossen Terefe
Assistant Professor of Biostatistics at Mekele
University
Senior Biostatistician at Tulane International Ethiopia
Overview
OLR vs MLR
OMLR with Examples
OAssumptions
OGeneral Strategy
OSummary
2
The linear model with a
single
predictor variable X can
easily
be extended to two or more
Ypredictor
Xvariables.
o 1 X ... X
1 2 2 p p
3
The General Idea
Simple regression considers the relation
between a single explanatory variable and
response variable
4
The General Idea
Multiple regression simultaneously
considers the influence of multiple
explanatory variables on a response
variable Y
The intent is to look at
the independent effect
of each variable while
“adjusting out” the
influence of potential
confounders
5
Regression Modeling
O A simple regression
model (one
independent
variable) fits a
regression line in 2-
dimensional space
O A multiple regression
model with two
explanatory variables
fits a regression
plane in 3- 6
dimensional space
Simple Regression
Model
Regression coefficients are estimated by
minimizing ∑residuals2 (i.e., sum of the
squared residuals) to derive this model:
7
Multiple Regression Model
Again, estimates for the multiple
slope coefficients are derived by
minimizing ∑residuals2 to derive this
multiple regression model:
8
Multiple Regression Model
O Intercept α
predicts where
the regression
plane crosses
the Y axis
O Slope for
variable X1 (β1)
predicts the
change in Y per
unit X1 holding
X2 constant
O The slope for
variable X2 (β2)
predicts the
change in Y per
unit X2 holding 9
X1 constant
Multiple Regression Model
A multiple regression
model with k independent
variables fits a
regression “surface” in k
+ 1 dimensional space
(cannot be visualized)
10
Categorical Explanatory
Variables in Regression
Models
O Categorical
independent variables
can be incorporated
into a regression
model by converting
them into 0/1
(“dummy”) variables
O For binary variables,
code dummies “0” for 11
“no” and 1 for “yes”
Dummy Variables, More
than two levels
For categorical variables with k categories, use k–1
dummy variables
SMOKE2 has three levels, initially coded
0 = non-smoker
1 = former smoker
2 = current smoker
12
Common variance
explained by X1 and X2
Unique variance
explained by X2
X2
X1
Unique variance
Y Variance NOT
explained by X1 explained by X1 13
and X2
A “good” model
X1 X2
Y
14
Y o 1 X 1 2 X 2 ... p X p
Y X
Residuals:
Y Y 16
Regression Statistics
How good is our model?
SST (Y Y ) 2
SSR (Y Y ) 2
SSE (Y Y ) 2
17
Coefficient of determination
(R2)
O Also known as the squared multiple
correlation coefficient
O Usually report R2 instead of R
O R2 = % of variance in DV explained
by combined effects of the IVs
O Analogous to r2
18
Interpretation of R2
2SSR
R
SST
Coefficient of Determination
to judge the adequacy of the regression model
20
Adjusted R2
O Adjusted R2 used for estimating explained
variance in a population.
O As number of predictors approaches N, R2 is
inflated
O Hence report R2 and adjusted R2
particularly for small N and where results are
to be generalised
O If N is small, take more note of adjusted R2
21
Regression Statistics
2 n 1 2
R adj 1 (1 R )
n k 1
n = sample size
k = number of independent variables
S e S ˆ
2
e
2
SSE SSE (Y Y ) 2
2
S
e
n k 1
2
S e MSE
23
ANOVA
H 0 : 1 2 ... k 0
H A : i 0 at least one!
df SS MS F P-value
H 0 : i 0
H 1 : i 0
bi i
t( n k 1)
Sbi
25
Hypotheses Tests for Regression
Coefficients
H 0 : i 0
H A : i 0
b1 i bi i
t( n k 1) 2
S e (bi ) 2 S
S e Cii e
S xx
26
Confidence Interval on
Regression Coefficients
27
1
( X ' X ) X ' Y
28
1
( X ' X ) X ' Y
29
1
( X ' X ) X ' Y
30
b1 i bi i
t( n k 1)
S e (bi ) 2
S e Cii
31
Model
Development/Building
32
Types of Model
Building
OForward addition
OBackward elimination
OStandard or direct
(simultaneous)
OHierarchical or sequential
33
35
Backward elimination
37
Direct or Standard
39
Stepwise
O Combines both forward and backward.
O At each step, variables may be entered or
removed if they meet certain criteria or some
order
O By size or correlation with dependent variable
O In order of significance
O Useful for developing the best prediction
equation from the smallest no. of variables.
O Means that redundant predictors will be
removed.
O Computer driven – controversial
40
Stepwise Include X3
Regressio Include X6
n Include X2
Include X5
Remove X2
(When X5 was inserted into the
model
X2 became unnecessary)
Include X7
Remove X7 - it is insignificant
41
Stop
Final model includes X3, X5 and X6
Which method?
O Standard: To assess impact of all IVs
simultaneously
O Hierarchical: To test specific hypotheses
derived from theory
O Stepwise: If goal is accurate statistical
prediction – computer driven
42
Model Selection: The General
Case
e l y
t re m
E x an t
o rt
Im p 44
MLR Model: Basic Assumptions
O Independence: The data of any particular
subject are independent of the data of all
other subjects
O Normality: in the population, the data on the
dependent variable are normally distributed
for each of the possible combinations of the
level of the X variables; each of the variables
is normally distributed
O Homoscedasticity: In the population, the
variances of the dependent variable for each
of the possible combinations of the levels of
the X variables are equal.
O Linearity: In the population, the relation
between the dependent variable and the 45
47
Diagnostic Tests For Regressions
Expected distribution of residuals for
a linear model with normal
distribution or residuals (errors).
i
Xi 48
Standardized Residuals
Standard Residuals
2.5
2
ei 1.5
di 1
S e2 0.5
0
-0.5 0 5 10 15 20 25
-1
-1.5
-2
49
Normality & homoscedasticity
Normality
O If non-normality,
there will be
heteroscedasticity
Homoscedasticity
O Variance around
regression line is
same throughout the
distribution
O Even spread in 50
residual plots
Homoscedasticity
52
Multicollinearity
O Many health research studies have large
numbers of predictor variables
O Problems arise when the various predictors
are highly related among themselves
(collinear)
O Estimated regression coefficients can change
dramatically, depending on whether or not
other predictor(s) are included in model.
O Standard errors of regression coefficients can
increase, causing non-significant t-tests and
wide confidence intervals
O Variables are explaining the same variation in Y
53
Multicolinearity
• A high degree of multicolinearity
produces unacceptable uncertainty
(large variance) in regression
coefficient estimates (i.e., large
sampling variation)
O Detect via:
— Correlation matrix - are there large
correlations among IVs?
— Tolerance statistics - if < .2 then exclude
that variable.
— Variance Inflation Factor (VIF) - looking
for < 5, otherwise exclude variable.
55
Scatter Plot
56
Multicolinearity
If the F-test for significance of
regression is significant, but tests on
the individual regression coefficients
are not, multicolinearity may be
present.
O Partial Correlation
O Multiple Regression Y
X1 X2
60
Partial Correlation
O Measures the strength of association between
Y and a predictor, controlling for other
predictor(s).
O Squared partial correlation represents the
fraction of variation in Y that is not explained
by other predictor(s) that is explained by this
predictor.
65
Cont…
66
Statistical Definition of
the Partial F Test
O Research Question: Does inclusion/exclusion
of the “extra” predictors explain significantly
more of the variability in outcome compared
to the variability that is explained by the
predictors that are already in the model?
O HO: Addition of Xp+1 ... Xp+k is of no statistical
significance for the prediction of Y after
controlling for the predictors X1 ... Xp meaning
that:
O βp+1 =βp+2 = ... =βp+k =0
O HA: Not
67
Cont…
68
Cont…
69
A Suggested Statistical Criterion for
Determination of Confounding
O A variable Z might be judged to be a
confounder of an X-Y relationship if
BOTH of the following are satisfied:
O Its inclusion in a model that already
contains X as a predictor has
adjusted significance level < .05; and
O Its inclusion in the model alters the
estimated regression coefficient for X
by 15-20% or more, relative to the
model that contains only X as a
predictor.
70
71
A Suggested Statistical Criterion
for Assessment of Interaction
µY µY
µY = 18 + 5X1 µY = 30 + 15X1 X2 = 5
X2 = 2 60 –
50 –
X2 = 2
40 –
30 –
µY = 30 - 10X1 20 – µY = 18 + 15X1
X2 = 5
10 –
| | | |
X1 X1
1 2 1 2
(a) (b)
Figure 12.26
75
Quadratic and
Second-Order Models
Quadratic Effects
Y = 0 + 1X1 + 2X12 + e
( SSEr SSEc ) /( k g )
Test Statistic : Fobs
SSEc /[ n (k 1)]
P P( F Fobs )
78
P-value based on F-distribution with k-g and n-(k+1) d.f.
Compare the “Model 2” and “Model3” models using
a partial F test.
79
Dealing with outliers
O Extreme cases should be deleted or modified.
O Univariate outliers - detected via initial data
screening
O Bivariate outliers – detected via scatterplots
O Multivariate outliers - unusual combination of
predictors…
80
Multivariate outliers
O Can use Mahalanobis distance or Cook’s D as a
MV outlier screening procedure
O A case may be within normal range on all
variables, but represent a multivariate outlier
which unduly influences multivariate test
results
e.g., a person who:
O Is 19 years old
O Has 3 children
O Has an undergraduate degree
O Identify & check unusual cases
81
Multivariate outliers
82
Detecting Sample
Outliers
Sample leverages
Standardized residuals
Cook’s distance measure
Yi – ^
Yi
Standardized residual =
s 1 - hi
83
Regression coefficients
k 1 or 2 3 or 4 ≥5
DMAX .8 .9 1.0
85
Table 12.1
Standardized Regression Coefficients
O Measures the change in E(Y) in standard
deviations, per standard deviation change in
Xi, controlling for all other predictors (bi*)
O Allows comparison of variable effects that are
independent of units
O Estimated standardized regression
coefficients:
sXi
b bi
*
i
sY
• where bi , is the partial regression coefficient and sXi and sY are
86
the sample standard deviations for the two variables
Unstandardised regression
coefficients
OB = unstandardised regression
coefficient
OUsed for regression equations
OUsed for predicting Y scores
OCan’t be compared with one
another unless all IVs are
measured on the same scale
87
Standardised regression
coefficients
OBeta (b or b) = standardised
regression coefficient
OUsed for comparing the relative
strength of predictors
O b = r in LR but this is only true
in MLR when the IVs are
uncorrelated.
88
Relative importance of IVs
OWhich IVs are the most
important?
OCompare the standardised
regression coefficients (b’s)
89
split sample validation
O It is expected that the results obtained from split
sample validations will vary somewhat from the
results obtained from the analysis using the full
data set. We will use the following as our criteria
that the validation verified our analysis and
supports the generalizability of our findings:
O First, the overall relationship between the
dependent variable and the set of independent SW
388
variable must be statistically significant for both
R7
validation analyses. Dat
a
O Second, the R² for each validation must be Ana
lysi
within 5% (plus or minus) of the R² for the s&
model using the full sample. Co
mp
uter
s II
split sample validation - 2
O Third, the pattern of statistical significance for
the coefficients of the independent variables for
both validation analyses must be the same as
the pattern for the full analysis, i.e. the same
variables are statistically significant or not
significant.
94
Research question
O Do number of cigarettes (IV1) , exercise (IV2)
and cholesterol (IV3) predict CHD mortality
(DV)?
Cigarettes
Exercise CHD Mortality
Cholesterol
95
Research question
O “Does the number of years of
psychological study (IV1) and the
number of years of counseling
experience (IV2) predict clinical
psychologists’ effectiveness in treating
mental illness (DV)?”
Study
Experience Effectiveness
96
Example
“Does ‘ignoring problems’ (IV1)
and ‘worrying’ (IV2)
predict ‘psychological distress’ (DV)”
98
Y
.32 .52
.34
X2
X1
99
100
101
Coefficients a
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 138.932 4.680 29.687 .000
Worry -11.511 1.510 -.464 -7.625 .000
Ignore the Problem -4.735 1.780 -.162 -2.660 .008
a. Dependent Variable: Psychological Distress
102
Coefficients a
Correlations
Model Zero-order Partial
1 Worry -.521 -.460
Ignore the Problem -.325 -.178
a. Dependent Variable: Psychological Distress
103
Y
.18
.46
.32 .52
.34
X2
X1
104
Prediction equations
Linear Regression
Psych. Distress = 119 - 9.50*Ignore
R2 = .11
105
***Confidence interval for the
slope
b1 = -5.44
The 95% CI:
-6.17 £ b1 £ -4.70
The est. average consumption of oil is reduced by
between 4.7 gallons to 6.17 gallons per each increase
106
of
10 F.
Confidence interval for the
slope
108
Study
OParticipants were children 8 to
12 years
OLived in high-violence areas,
USA
OHypothesis: violence and stress
lead to internalising behavior,
whereas social support would
reduce internalising behaviour.
109
Variables
OPredictors
ODegree of witnessing violence
OMeasure of life stress
OMeasure of social support
OOutcome
OInternalising behaviour
(e.g., depression, anxiety 110
symptoms)
Correlations
Pearson Correlation
Correlations
Internalizin
Amount g
violenced Current Social symptoms
witnessed stress support on CBCL
Amount violenced
witnessed
Current stress .050
Social support .080 -.080
Internalizing symptoms
.200* .270** -.170
on CBCL
*. Correlation is significant at the 0.05 level (2-tailed).
**. Correlation is significant at the 0.01 level (2-tailed). 111
R2
Model Summary
Adjusted Std. Error
R R of the
R Square Square Estimate
.37a .135 .108 2.2198
a. Predictors: (Constant), Social
support, Current stress, Amount
violenced witnessed 112
Test for overall
significance
• Shows if there is a linear relationship between
all of the X variables taken together and Y
• Hypothesis:
H0: b1 = b2 = … = bp = 0
(No linear relationships)
H1: At least one bi ¹ 0
(At least one independent variable effects Y)
113
Test for overall
significance
O Significance test of R2 given by ANOVA table
ANOVAb
Sum of Mean
Squares df Square F Sig.
Regression 454.482 1 454.48 19.59 .00a
Residual 440.757 19 23.198
Total 895.238 20
a. Predictors: (Constant), Cigarette Consumption per
Adult per Day 114
H0: bi = 0
(No linear relationship)
H1: bi ¹ 0 115
117
Interpretation
Yˆ b1 X 1 b2 X 2 b3 X 3 b0
0.038Wit 0.273Stress 0.074 SocSupp 0.477
O Slopes for Witness and Stress are positive, but
slope for Social Support negative.
O If you had subjects with identical Stress and
Social Support, a one unit increase in Witness
would produce .038 unit increase in
Internalising symptoms.
118
Predictions
If Witness = 20, Stress = 5, and SocSupp = 35, then
we would predict that internalising symptoms
would be..... .012.
121
122
O R2 = .35
123
O Two sig. IVs (not Social Capital - dropped)
O R2 = .72 124
128
General MLR strategy
O Check assumptions
O Conduct MLR – choose type
O Interpret the output
O Develop a regression equation
129
1. Check assumptions
O Assumptions
(Xs not correlated, X-Y linear relations,
normal distributions, homoscedasticity)
O Check histograms (normality)
O Check scatterplots (linearity & outliers)
O Check correlation table (linearity &
collinearity)
O Check influential outlying cases (mv
outliers)
O Check residual plots
130
2. Conduct MLR
Conduct a multiple linear regression:
O Standard
O Hierarchical
O Stepwise
O Forward
O Backward
131
3. Interpret the results
Interpret the technical and psychological
meaning of the results, based on:
O Overall amount of Y predicted (R, R2,
Adjusted R2, the statistical
significance of R)
O Changes in R and F change if
hierarchical.
O Coefficients for IVs
Standardised and unstandardised
regression coefficients for IVs in each
model (b, B).
O Relations between X predictors (r) 132
O Zero-order and partial correlations for
4. Regression equation
O MLR is usually for explanation,
sometimes prediction
O If useful, develop a regression equation for
the final model.
O Interpret constant and slopes.
133
References
O Kliewer, W., Lepore, S.J., Oskin, D., &
Johnson, P.D. (1998) The role of social
and cognitive processes in children’s
adjustment to community violence.
Journal of Consulting and Clinical
Psychology, 66, 199-209.
O Vemuri, A. W., & Constanza, R. (2006).
The role of human, social, built, and
natural capital in explaining life
satisfaction at the country level: Toward
134
a National Well-Being Index (NWI).
Ecological Economics, 58, 119-133.
Goodness-of-Fit and Regression Diagnostics
135
Cont…
O Our eye “tells” us:
O A better fitting relationship between
X and Y is quadratic
O We notice different sizes of
discrepancies
O Some observed Y are close to the
fitted Y’ (e.g. near X=1 or X=8)
O Other observed Y are very far from
the fitted Y’ (e.g. near X=5)
136
Cont…
O Poor fits of the data to a fitted line can occur
for several reasons and can occur even when
the fitted line explains a large proportion (R2)
of the total variability in response:
O The wrong functional form (link function) was fit.
O Extreme values (outliers) exhibit uniquely large
discrepancies between observed and fitted
values.
O One or more important explanatory variables
have been omitted.
O One or more model assumptions have been
violated.
137
Cont…
O Consequences of a poor fit include:
O We learn the wrong biology.
O Comparison of group differences aren’t
“fair” because they are unduly influenced
by a minority.
O Comparison of group means aren’t “fair”
because we used the wrong standard
error.
O Predictions are wrong because the fitted
model does not apply to the case of
interest.
138
Cont…
O Available techniques of goodness-of-fit
assessment are of two types:
140
Case analysis
141
b. Assessment of Normality
O Recall what we are assuming wrt normality:
O Simple Linear Regression:
At each level “x” of the predictor variable X, the
outcomes Y are distributed normal with mean = μY|x=
β0 + β1x and constant variance σ2Y|x
O Multiple Linear Regression:
At each vector level “x = [x1, x2, ...,xp] ” of the
predictor vector X, the outcomes Y are distributed
normal with mean =
μY|x= β0 + β1x1+ β2x2 +...+ βpxp and constant variance
σ2Y|x
142
Some graphical assessments of normality and
what to watch out for:
145
Cont…
O Guidelines
O In practice, the assessment of normality is
made after assessment of other model
assumption violations. The linear model is
often more robust to violations of the
assumption of normality. The cure, is often
worse than the problem. (e.g. –
transformation of the outcome variable)
O consider doing a scatterplot of the residuals.
O Look for
O Bell shaped pattern
O Center at zero
O No gross outliers
146
147
148
149
c. Cook-Weisberg Test of
Heteroscedasticity
O Evidence of a violation of
homogeneity (this is
heteroscecasticity) is seen when
O There is increasing or decreasing
variation in the residuals with fitted
Yˆ
O There is increasing or decreasing
variation in the residuals with
predictor X
150
Cont…
151
152
153
154
Outlier detection
161
162
163
164
Robust statistical options when assumptions
are violated
O Nonlinearity
1.Transformation to linearity
2.Nonlinear regression
O Influential Outliers
1.Robust regression with robust weight functions
2.rreg y x1 x2
O Heteroskedasticity of residuals
1.Regression with Huber/White/Sandwich variance-
covariance estimators
2.Regress y x1 x2, robust
O Residual autocorrelation correction
1.Autoregression with prais y x1 x2, robust
2.newey-west regression
O Nonnormality of residuals
1.Quantile regression: qreg y x1 x2 165
2.Bootstrapping the regression coefficients
regression diagnostics for the
detection of a poor fit: