Metrics Topic6 Part1 Multipleregression
Metrics Topic6 Part1 Multipleregression
1
Review the causal analysis framework in topic 4
• 𝑋𝑖 : individual i’s received treatment (such as years of education).
• 𝑌𝑖 : individual i’s outcome measure (such as earnings).
• 𝑌𝑥𝑖 : potential outcome if individual i receives treatment 𝑥.
• Assume that potential outcome is linear in 𝑥, where 𝑥 = 𝑥1 , 𝑥2 , … , 𝑥𝑀 .
𝑌𝑥𝑖 = 𝛽0 + 𝛽1 𝑥 + 𝑒𝑖 , 𝐸 (𝑒𝑖 ) = 0.
Δ𝑌𝑥𝑖
• The causal effect of the treatment on outcome is defined as 𝛽1 =
Δ𝑥
One more year of education changes earnings by 𝛽1 units.
• Note that when 𝑥 = 𝑋𝑖 , 𝑌𝑥𝑖 = 𝑌𝑖 : 𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝑒𝑖
• In general, 𝐸 (𝑒𝑖 |𝑋𝑖 ) ≠ 0.
➢ Think of 𝑒𝑖 as the innate ability of person i. which determines
person i’s potential outcomes.
➢ On average, those with more education tend to have higher ability:
𝐸 (𝑒𝑖 |𝑋𝑖 = 16) > 𝐸 (𝑒𝑖 |𝑋𝑖 = 9).
➢ The causal effect is β1 × 7. But the measured difference in means:
𝐸 (𝑌𝑖 |𝑋𝑖 = 16) − 𝐸 (𝑌𝑖 |𝑋𝑖 = 9) = 7𝛽1 + 𝐸 (𝑒𝑖 |𝑋𝑖 = 16) − 𝐸 (𝑒𝑖 |𝑋𝑖 = 9) >
7𝛽1
2
Omitted variable bias, the theory
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝑢
• The error u includes all omitted variables, other than X, that
influence Y. ( u might also reflect heterogeneous causal effect.)
• There are always omitted variables.
• If there are omitted variables that are correlated with X, then LSA#1
is violated and the OLS estimators converge in probability to the
causal parameter plus a bias. The bias is called “omitted variable
bias.”
• Recall that 𝑐𝑜𝑣(𝑌, 𝑋) = 𝑐𝑜𝑣(𝛽0 + 𝛽1 𝑋 + 𝑢, 𝑋) = 𝛽1 𝑣𝑎𝑟(𝑋) + 𝑐𝑜𝑣(𝑢, 𝑋)
• Then
𝑠𝑌𝑋 𝑝 𝑐𝑜𝑣(𝑌, 𝑋) 𝛽1 𝑣𝑎𝑟(𝑋) + 𝑐𝑜𝑣(𝑢, 𝑋) 𝑐𝑜𝑣(𝑢, 𝑋)
𝛽̂1 = → = = 𝛽1 +
𝑠𝑋2 𝑣𝑎𝑟(𝑋) 𝑣𝑎𝑟(𝑋) 𝑣𝑎𝑟(𝑋)
• Equivalently,
𝑝 𝑐𝑜𝑣 (𝑢, 𝑋 )
̂
𝛽1 − 𝛽1 → ≡ "𝐨𝐦𝐢𝐭𝐭𝐞𝐝 𝐯𝐚𝐫𝐢𝐚𝐛𝐥𝐞 𝐛𝐢𝐚𝐬"
(
𝑣𝑎𝑟 𝑋 )
• There is a downward bias if cov(u,X)<0, and upward bias if cov(u,X)>0.
3
The TestScore-STR example
𝑇𝑒𝑠𝑡𝑆𝑐𝑜𝑟𝑒 = 𝛽0 + 𝛽1 𝑆𝑇𝑅 + 𝑢
• PctEL has a negative effect on TestScore, and thus enters u with a
negative sign
𝑢 = 𝛾𝑃𝑐𝑡𝐸𝐿 + 𝑣, 𝛾<0
• PctEL is positively correlated with STR
𝑐𝑜𝑣(𝑃𝑐𝑡𝐸𝐿, 𝑆𝑇𝑅) > 0
So
cov(u, STR) < 0
• Then there is a downward bias.
• The OLS estimators: 𝛽̂1 − 𝛽1 < 0 in large samples.
4
Omitted variables that satisfy LSA#1
𝐼𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛𝑖 = 𝛽0 + 𝛽1 𝑉𝑎𝑐𝑐𝑖𝑛𝑒𝑖 + 𝑢𝑖
5
Three ways to overcome omitted variable bias
1. Run a randomized controlled experiment in which treatment (STR) is
randomly assigned: then PctEL is still a determinant of TestScore, but
PctEL is uncorrelated with STR. (This solution is not feasible.)
6
Difference in means: holding constant omitted factors
• Among districts with comparable PctEL, the effect of class size is small than
the overall “test score gap” of 7.4.
7
The conditional independence assumption and control variables
• Conditional Independence Assumption: the treatment Xi is
independent of the potential outcomes Yxi conditional on Zi .
• Given that 𝑌𝑥𝑖 = 𝛽0 + 𝛽1 𝑥 + 𝑒𝑖 , the above assumption implies Xi is
independent of ei conditional on Zi . This implies
𝐸 (𝑒𝑖 |Xi , 𝑍𝑖 ) = 𝐸(𝑒𝑖 |𝑍𝑖 )
• We can always decompose a random variable as follows
ei = 𝐸 (𝑒𝑖 |Xi , 𝑍𝑖 ) + (𝒆𝒊 − 𝑬(𝒆𝒊 |𝐗 𝐢 , 𝒁𝒊 )) = 𝐸 (𝑒𝑖 |Xi , 𝑍𝑖 ) + 𝒖𝒊
where ui = 𝑒𝑖 − 𝐸 (𝑒𝑖 |Xi , 𝑍𝑖 ) and E(ui |𝑋𝑖 , 𝑍𝑖 ) = 0 by definition.
• Assume 𝐸 (𝑒𝑖 |𝑍𝑖 ) = 𝛾0 + 𝛾1 𝑍𝑖 , then
𝑬(𝒆𝒊 |𝐗 𝐢 , 𝒁𝒊 ) = 𝐸 (𝑒𝑖 |𝑍𝑖 ) = 𝜸𝟎 + 𝜸𝟏 𝒁𝒊 .
• Then ei = 𝛾0 + 𝛾1 𝑍𝑖 + 𝑢𝑖 , 𝑤𝑖𝑡ℎ 𝐸 (𝑢𝑖 |𝑋𝑖 , 𝑍𝑖 ) = 0.
• The original linear causal model becomes a linear regression model
with an additional regressor:
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝑒𝑖 = (𝛽0 + 𝛾0 ) + 𝛽1 𝑋𝑖 + 𝛾1 𝑍𝑖 + 𝑢𝑖 , 𝐸 (𝑢𝑖 |𝑋𝑖 , 𝑍𝑖 ) = 0
• In the new model, Zi is called the “control variable”.
8
Example of Conditional Independence: Project STAR
• Project STAR (Student-Teacher Achievement Ratio)
• 11,600 kindergartners in 1985-86. Study ran for 4 years until the
original cohort was in 3rd grade.
– Cost $12 million.
• Upon entering the school system, a student was randomly assigned
to one of three groups within the school:
– regular class (22 – 25 students, no aid)
– regular class + aide (with a full-time aid)
– small class (13 – 17 students)
• Y = Stanford Achievement Test scores
• Teachers were also randomly assigned within a school.
9
The Data Structure of STAR
• A random sample or an iid. sample. Use the data structure of STAR
as an example:
{(𝑌𝑖 , 𝑋𝑖 , 𝑆𝑖 )}300
𝑖=1
• The observed variables are:
– 𝑌𝑖 : student 𝑖’s test score.
– 𝑋𝑖 : a dummy indicating whether student 𝑖 was treated with
small class.
– 𝑆𝑖 : a factor that indicates which school student 𝑖 belongs to.
Assume that there are three schools: 𝑆𝑖 = 1,2,3.
– We may think of 𝑆𝑖 in terms of 3 dummies: 𝑆1𝑖 , 𝑆2𝑖 , 𝑆3𝑖 .
• The causal model
𝑌𝑖 = 𝑐 + 𝛽𝑋𝑖 + 𝑢𝑖
where 𝑢𝑖 is the causal error, which includes other determinants of scores
and also possibly reflects heterogeneous treatment effects.
10
The Implied Linear Regression
• The treatment of class sizes is randomly assigned within a school but
not between schools:
– 𝑋𝑖 is independent of the potential scores (𝑌1𝑖 , 𝑌0𝑖 ), conditional
on (𝑆1𝑖 , 𝑆2𝑖 )
– Conditional independence implies conditional mean
independence between 𝑋𝑖 and the causal error 𝑢𝑖 :
𝐸 (𝑢𝑖 |𝑋𝑖 , 𝑆1𝑖 , 𝑆2𝑖 ) = 𝐸 (𝑢𝑖 |𝑆1𝑖 , 𝑆2𝑖 )
– We have two dummies that span all three school effects. This
is the so called “saturated model”: the conditional mean must be
a linear function of the set of dummies:
𝑢𝑖 = 𝑎 + 𝛾1 𝑆1𝑖 + 𝛾2 𝑆2𝑖 + 𝑒𝑖 , 𝐸 (𝑒𝑖 |𝑋𝑖 , 𝑆1𝑖 , 𝑆2𝑖 ) = 0.
• This implies a linear regression model with school fixed effects as
controls:
𝑌𝑖 = 𝛽0 + 𝛽𝑋𝑖 + 𝛾1 𝑆1𝑖 + 𝛾2 𝑆2𝑖 + 𝑒𝑖 , 𝐸 (𝑒𝑖 |𝑋𝑖 , 𝑆1𝑖 , 𝑆2𝑖 ) = 0
11
The Population Multiple Regression Model
Consider the case of two regressors:
Yi = 0 + 1X1i + 2X2i + ui, i = 1,…,n
12
Interpretation of coefficients in multiple regression
Yi = 0 + 1X1i + 2X2i + ui, i = 1,…,n, E(ui |𝑋1𝑖 , 𝑋2𝑖 ) = 0
Consider changing X1 by X1 while holding X2 constant:
Y
1 = , holding X2 constant
X 1
13
The OLS Estimator in Multiple Regression
Regression of TestScore against STR:
14
What’s special about multiple regression?
• R2 and R
̅2 (adjusted R2)
• One more least square assumption: LSA #4 (no perfect
multicollinearity)
• F-test (Wald-test): testing joint significance of more than one
regression coefficients; testing restrictions on regression coefficients.
(linking F-statistic and t-statistic in case of only one restriction in H0)
Motivation: imperfect multicollinearity…
Remark: other than the above three, the theory of multiple regression
is the same as that of simple regression with a single regressor.
15
Measures of Fit for Multiple Regression
Actual = predicted + residual: Yi = Yˆi + uˆi
n
1 1 n 2
SER =
n − k − 1 i =1
ˆ
ui
2
, RMSE =
n i =1
uˆi
• R
̅2 = “adjusted R2” = R2 with a degrees-of-freedom adjustment
16
R2
R2 always increases after adding additional regressors.
By adding additional regressors, the SSR becomes smaller.
𝑛 2 𝑛 2
∑ (𝑌𝑖 − 𝑏̂0 − 𝑏̂1 𝑋1𝑖 − 𝑏̂2 𝑋2𝑖 ) ≤ ∑ (𝑌𝑖 − 𝛽̂0 − 𝛽̂1 𝑋1𝑖 )
𝑖=1 𝑖=1
By definition of OLS
𝑛 2
∑ (𝑌𝑖 − 𝑏̂0 − 𝑏̂1 𝑋1𝑖 − 𝑏̂2 𝑋2𝑖 )
𝑖=1
𝑚𝑖𝑛 𝑛
≡ ∑ (𝑌𝑖 − 𝑏0 − 𝑏1 𝑋1𝑖 − 𝑏2 𝑋2𝑖 )2
{𝑏0 , 𝑏1 , 𝑏2 } 𝑖=1
𝑛 2
≤∑ (𝑌𝑖 − 𝛽̂0 − 𝛽̂1 𝑋1𝑖 + 0 ⋅ 𝑋2𝑖 )
𝑖=1
SSR
R2 = 1− becomes larger when adding more regressors. (TSS is the
TSS
same for all regression models)
17
̅2
R
̅2 (“adjusted R2”) makes some adjustment by “penalizing” the
R
̅2 does not necessarily increase
regression with more regressors – the R
when adding additional regressors.
n − 1 SSR
Adjusted R :2 R = 1−
2
n − k − 1 TSS
• When 𝑘 ≥ 1, for the same regression
̅2 < R2
R
• For a regression with only the intercept,
̅2 = R2=0 (In this case k=0).
R
• If n is large,
𝑅̅2 ≈ 𝑅2 .
18
̅𝟐
An example of 𝑹𝟐 and 𝑹
(3) TestScore=654.2,
̅2 = 0, SER = 19.05, n=420
R2 = 0, R
19
The Least Squares Assumptions for Multiple Regression
20
Perfect multicollinearity
21
An example of perfect multicollinearity
• Generate a new variable
STR2=2*STR+1
• Regress TestScore on STR and STR2 and PctEL
22
The Sampling Distribution of the OLS Estimator
Under the four Least Squares Assumptions,
• The sampling distribution of ˆ1 has mean 1
𝐸(𝛽̂1 ) = 𝛽1
• var( ˆ1 ) is inversely proportional to n.
• For large n
p
o ˆ1 is consistent: ˆ1 → 1 (law of large numbers)
̂1 −𝛽1
𝛽
o ̂1 ) is approximately N(0,1) (CLT)
𝑆𝐸(𝛽
o These statements hold for all 𝛽̂𝑗 , 𝑗 = 0,1, … , 𝑘
23
The dummy variable trap
• Dummy variable trap is a special example of perfect multicollinearity.
• Consider a set of dummy variables, which are mutually exclusive and
exhaustive: there are multiple categories and every observation falls
in one and only one category.
Consider the four dummies for a college student:
Freshmen+Sophomores+Juniors+Seniors =1
• If your regression includes all these dummy variables and a constant,
you will have perfect multicollinearity – this is called the dummy
variable trap. Solutions:
1. Omit one of the groups (e.g. Seniors), or
2. Omit the intercept
24
An example of dummy variable trap
1, 𝑖𝑓 𝑖 𝑖𝑠 𝑎 𝑚𝑎𝑛 1, 𝑖𝑓 𝑖 𝑖𝑠 𝑎 𝑤𝑜𝑚𝑎𝑛
𝑀𝑖 = { , 𝑊𝑖 = {
0, 𝑒𝑙𝑠𝑒 0, 𝑒𝑙𝑠𝑒
• 𝑀𝑖 and 𝑊𝑖 are mutually exclusive: an individual cannot be both a man
and a woman.
• 𝑀𝑖 and 𝑊𝑖 are exhaustive: an individual must be either a man or a
woman
• In sum: 𝑀𝑖 + 𝑊𝑖 = 1 for all i.
• Consider the regression model
𝑊𝑎𝑔𝑒𝑖 = 𝛽0 + 𝛽1 𝑀𝑖 + 𝛽2 𝑊𝑖 + 𝑢𝑖
𝐸 (𝑊𝑎𝑔𝑒𝑖 |𝑖 𝑖𝑠 𝑎 𝑚𝑎𝑛) = 𝛽0 + 𝛽1
𝐸 (𝑊𝑎𝑔𝑒𝑖 |𝑖 𝑖𝑠 𝑎 𝑤𝑜𝑚𝑎𝑛) = 𝛽0 + 𝛽2
• Cannot separately interpret 𝛽0 , 𝛽1 , 𝛽2 .
25
Two correct models:
(1) 𝑊𝑎𝑔𝑒𝑖 = 𝛽0 + 𝛽1 𝑀𝑖 + 𝑢𝑖
(2) 𝑊𝑎𝑔𝑒𝑖 = 𝑏1 𝑀𝑖 + 𝑏2 𝑊𝑖 + 𝑢𝑖
In sum, we have
𝛽0 = 𝑏2 , 𝛽1 = 𝑏1 − 𝑏2 .
The OLS estimators should follow the same relations
𝛽̂0 = 𝑏̂2 , 𝛽̂1 = 𝑏̂1 − 𝑏̂2 .
The corresponding standard errors are also the same:
𝑆𝐸(𝛽̂0 ) = 𝑆𝐸(𝑏̂2 ), 𝑆𝐸(𝛽̂1 ) = 𝑆𝐸(𝑏̂1 − 𝑏̂2 ).
26
Model 1: OLS, using observations 1-7986
Dependent variable: ahe
Heteroskedasticity-robust standard errors, variant HC1
Coefficient Std. Error t-ratio p-value
const 15.3586 0.133946 114.7 <0.0001 ***
male 2.41405 0.190958 12.64 <0.0001 ***
27
Imperfect multicollinearity
28
Implications of imperfect multicollinearity
Theoretically, imperfect multicollinearity is not a problem.
Practically, imperfect multicollinearity implies that one or more of the
regression coefficients will be imprecisely estimated.
• 𝛽1 is the effect of X1 holding X2 constant; but if X1 and X2 are highly
correlated, there is very little variation in X1 once X2 is held constant
– so the data don’t contain much information about what happens
when X1 changes but X2 doesn’t.
• If so, the standard error of the OLS estimator of the coefficient on X1
will be large.
29
An example of imperfect multicollinearity
• Generate a new variable
STR2=2*STR+1
• Generate another new variable
STR3=STR2+rnorm()
Here rnorm() denotes iid.N(0,1) random variable.
• Perfect multicollinearity between STR and STR2
• Imperfect multicollinearity between STR and STR3
Corr(STR,STR3)=0.966
Model: OLS, using observations 1-420
Dependent variable: testscr
Omitted due to exact collinearity: str2
30
Stronger correlation implies more imprecise estimates
• Generate
STR4=STR2+0.001*rnorm()
Corr(STR,STR3)=0.9999
31
What to do if there is imperfect multicollinearity
• Suppose X is the variable of interest, such as STR.
• Assume another variable W in the regression function is highly
correlated with X.
32
Summary
33