Week 10
Week 10
SI 2020,
SI 2021,Week 10-11, 16-23 November 2021
We label variables in 𝒙𝒙𝑡𝑡 that are correlated with the error terms
endogenous; variables that are not are called exogenous
Endogeneity is thus said to occur in a multiple regression model if
𝐸𝐸 𝒙𝒙𝑡𝑡 𝜀𝜀𝑡𝑡 ≠ 0; endogeneity implies that an explanatory variable included
in the model is correlated with unobservables relegated to the error term
𝐸𝐸 𝒙𝒙𝑡𝑡 𝜀𝜀𝑡𝑡 ≠ 0; endogeneity can arise as a result of
• measurement error
• dynamic model (lagged dependent)
• omitted variable bias
• simultaneity
Measurement Error
Endogeneity – measurement error
• Since both 𝑥𝑥𝑖𝑖∗ and 𝑢𝑢𝑖𝑖 depend on 𝜐𝜐𝑖𝑖 , they are correlated
Endogeneity – dynamic model (lagged dependent + autocorrelation)
Endogeneity – dynamic model (lagged dependent + autocorrelation)
Endogeneity problems may arise in a dynamic model that includes a lagged dependent
variable
𝑦𝑦𝑡𝑡 = 𝛽𝛽1 + 𝛽𝛽2 𝑥𝑥𝑡𝑡 + 𝛽𝛽3 𝑦𝑦𝑡𝑡−1 + 𝜀𝜀𝑡𝑡
As long as we assume that 𝐸𝐸 𝑥𝑥𝑡𝑡 𝜀𝜀𝑡𝑡 = 0 and 𝐸𝐸 𝑦𝑦𝑡𝑡−1 𝜀𝜀𝑡𝑡 = 0 for all 𝑡𝑡 the OLS estimator
for 𝜷𝜷 is consistent
However, suppose that 𝜀𝜀𝑡𝑡 is subject to first order autocorrelation 𝜀𝜀𝑡𝑡 = 𝜌𝜌𝜀𝜀𝑡𝑡−1 + 𝑣𝑣𝑡𝑡
This gives: 𝑦𝑦𝑡𝑡 = 𝛽𝛽1 + 𝛽𝛽2 𝑥𝑥𝑡𝑡 + 𝛽𝛽3 𝑦𝑦𝑡𝑡−1 + 𝜌𝜌𝜀𝜀𝑡𝑡−1 + 𝑣𝑣𝑡𝑡
Obviously, we also have: 𝑦𝑦𝑡𝑡−1 = 𝛽𝛽1 + 𝛽𝛽2 𝑥𝑥𝑡𝑡−1 + 𝛽𝛽3 𝑦𝑦𝑡𝑡−2 + 𝜀𝜀𝑡𝑡−1
This implies that error term 𝜀𝜀𝑡𝑡 is correlated with 𝑦𝑦𝑡𝑡−1
Thus if 𝜌𝜌 ≠ 0, OLS is no longer consistent for the parameters
Endogeneity – omitted variable bias
Endogeneity – omitted variable bias
Consider a wage equation 𝑦𝑦𝑖𝑖 = 𝒙𝒙1𝑖𝑖 ′𝜷𝜷1 + 𝑥𝑥2𝑖𝑖 𝛽𝛽2 + 𝑢𝑢𝑖𝑖 𝛾𝛾 + 𝜐𝜐𝑖𝑖
Where 𝑥𝑥2𝑖𝑖 denotes years of schooling, and 𝑢𝑢𝑖𝑖 is an unobserved variable reflecting “ability”. Persons with
higher levels of ability tend to have higher wages but are also more likely to have higher schooling
level
Thus: 𝛾𝛾 > 0 and 𝑐𝑐𝑐𝑐𝑐𝑐 𝑥𝑥2𝑖𝑖 , 𝑢𝑢𝑖𝑖 > 0
Since 𝑢𝑢𝑖𝑖 is unobserved, we end up estimating 𝑦𝑦𝑖𝑖 = 𝒙𝒙𝑖𝑖 ′𝜷𝜷 + 𝜀𝜀𝑖𝑖 , where 𝒙𝒙′𝑖𝑖 = 𝒙𝒙1𝑖𝑖 ′, 𝑥𝑥2𝑖𝑖
𝜷𝜷𝜷 = 𝜷𝜷1′ , 𝛽𝛽2
𝜀𝜀𝑖𝑖 = 𝑢𝑢𝑖𝑖 𝛾𝛾 + 𝜐𝜐𝑖𝑖
Endogeneity – omitted variable bias
Estimating 𝜷𝜷 by OLS yields (see omitted variable bias discussion ,
Lecture 4): 𝒃𝒃 = 𝜷𝜷 + 𝑿𝑿′ 𝑿𝑿 −1 ∑𝑁𝑁𝑖𝑖=1 𝒙𝒙𝑖𝑖 𝑢𝑢𝑖𝑖 𝛾𝛾 + 𝑿𝑿′ 𝑿𝑿 −1 ∑𝑁𝑁𝑖𝑖=1 𝒙𝒙𝑖𝑖 𝜐𝜐𝑖𝑖
Structural equations:
𝑦𝑦𝑡𝑡 = 𝛽𝛽1 + 𝛽𝛽2 𝑥𝑥2𝑡𝑡 + 𝜀𝜀𝑡𝑡
𝑥𝑥2𝑡𝑡 = 𝑦𝑦𝑡𝑡 + 𝑧𝑧2𝑡𝑡
Solve to get reduced form equations:
𝛽𝛽1 1 𝜀𝜀𝑡𝑡
𝑥𝑥2𝑡𝑡 = + 𝑧𝑧2𝑡𝑡 +
1 − 𝛽𝛽2 1 − 𝛽𝛽2 1 − 𝛽𝛽2
𝛽𝛽1 𝛽𝛽2 𝜀𝜀𝑡𝑡
𝑦𝑦𝑡𝑡 = + 𝑧𝑧2𝑡𝑡 +
1 − 𝛽𝛽2 1 − 𝛽𝛽2 1 − 𝛽𝛽2
The reduced form equations can be estimated with OLS since 𝐸𝐸 𝑧𝑧2𝑡𝑡 𝜀𝜀𝑡𝑡 = 0
structural equation cannot be estimated by OLS
Instrumental variables
Instrumental variables
where
which can be estimated fairly easily (to get standard errors etc.)
24
Instrumental variables
Consider the general model 𝑦𝑦𝑡𝑡 = 𝒙𝒙′𝑡𝑡 𝜷𝜷 + 𝜀𝜀𝑡𝑡 , where 𝐸𝐸 𝒙𝒙𝑡𝑡 𝜀𝜀𝑡𝑡 ≠ 0
for some elements of 𝒙𝒙𝑡𝑡
Suppose we can find a vector of instruments 𝒛𝒛𝑡𝑡 with the same dimensions as 𝒙𝒙𝑡𝑡 such that 𝐸𝐸 𝒛𝒛𝑡𝑡 𝜀𝜀𝑡𝑡 = 0;
note: we only have to find instruments for the endogenous explanatory variables, the exogenous
explanatory variables can serve as their own instruments
Using matrix notation, this can be written as 𝒚𝒚 = 𝑿𝑿𝜷𝜷 + 𝜺𝜺, where
𝒚𝒚 is the 𝑁𝑁 × 1 column of observations for 𝑦𝑦𝑖𝑖 ,
𝑿𝑿 is the 𝑁𝑁 × 𝐾𝐾 matrix collecting the vectors 𝒙𝒙′𝑖𝑖 , and
𝜺𝜺 is the 𝑁𝑁 × 1 column of observations for 𝜀𝜀𝑖𝑖
The OLS estimate for 𝜷𝜷 is 𝒃𝒃 = 𝑿𝑿′ 𝑿𝑿 −𝟏𝟏 𝑿𝑿𝑿𝑿𝑿, which is inconsistent
Let 𝒁𝒁 be the 𝑁𝑁 × 𝐾𝐾 matrix of instruments; then as in the simple case before, the IV estimator is given by
partially replacing 𝑿𝑿 with 𝒁𝒁 as follows: 𝜷𝜷 � 𝐼𝐼𝐼𝐼 = 𝒁𝒁′ 𝑿𝑿 −𝟏𝟏 𝒁𝒁′𝒚𝒚
Instrumental variables
• In the above discussion the number of instruments 𝑅𝑅, say, is the
same as the number 𝐾𝐾 of explanatory variables (with exogenous
explanatory variables as their own instruments): 𝑅𝑅 = 𝐾𝐾; identified
• What if 𝑅𝑅 < 𝐾𝐾? We do not have enough information: no solution,
under-identified
• What if 𝑅𝑅 > 𝐾𝐾? We have more information (more equations) than
needed: over-identified
• Rather than ignoring relevant information (if 𝑅𝑅 > 𝐾𝐾) we minimize a
quadratic in the sample moments; leads to Generalized Instrumental
Variables Estimator (GIVE) or Two-Stage Least Squares (2SLS):
−𝟏𝟏 ′
�
𝜷𝜷𝐺𝐺𝐼𝐼𝐼𝐼𝐸𝐸 = 𝑿𝑿 𝒁𝒁 𝒁𝒁 𝒁𝒁 𝒁𝒁𝒁𝑿𝑿 𝑿𝑿 𝒁𝒁 𝒁𝒁′ 𝒁𝒁 −𝟏𝟏 𝒁𝒁𝒁𝒚𝒚
′ ′ −𝟏𝟏
Why the name 2SLS? Consider a structural model and reduced form
Let reduced form of kth explanatory variable be 𝒙𝒙𝑘𝑘 = 𝒁𝒁𝝅𝝅𝑘𝑘 + 𝝊𝝊𝑘𝑘
First step; the OLS estimate of 𝝅𝝅𝑘𝑘 is 𝒃𝒃𝑘𝑘 = 𝒁𝒁′ 𝒁𝒁 −𝟏𝟏 𝒁𝒁′𝒙𝒙𝑘𝑘
�𝑘𝑘 = 𝒁𝒁𝒃𝒃𝑘𝑘 = 𝒁𝒁 𝒁𝒁′ 𝒁𝒁 −𝟏𝟏 𝒁𝒁′𝒙𝒙𝑘𝑘
�𝑘𝑘 of 𝒙𝒙𝑘𝑘 is: 𝒙𝒙
The predicted value 𝒙𝒙
Second step; estimate the original structural equations by OLS, while replacing all
endogenous variables on the right-hand-side with their predicted values from the
reduced form
� be the matrix with predicted values with columns 𝒙𝒙
Let 𝑿𝑿 �=
�𝑘𝑘 ; it is thus equal to 𝑿𝑿
𝒁𝒁 𝒁𝒁′ 𝒁𝒁 −𝟏𝟏 𝒁𝒁′ 𝑿𝑿
−𝟏𝟏
� ′𝑿𝑿
The OLS estimator in the second step is given by 𝒃𝒃 = 𝑿𝑿 � � ′ 𝒚𝒚 which is actually
𝑿𝑿
equal to GIVE
Finding instruments
Finding instruments
So, which factors affect schooling but not earnings directly? (not related to unobserved
ability/intelligence that is determining wages?) Parents’ education? Distance to school?
Once we identify the instruments, we are ready to estimate the parameters using IV regression
We do this in two stages, so called 2SLS (Two Stage Least Square).
In order to run IV, we need at least one instrument for each endogenous explanatory variable
IV regression (2SLS) runs as follows:
In the first stage regression, endogenous explanatory variables are regressed upon instruments and
exogenous explanatory variables; fitted values are saved
This is done for all the endogenous regressors included
In the second stage regression, the original dependent variable is regressed upon predicted values of
endogenous regressors and exogenous variables
Wage example
Wage example
Wage data taken from Card (1995), based on the National Longitudinal Survey of Young
Men
3010 men, wages in 1976
We observe individual characteristics, including experience, race, region, family
background, and so on
We choose a fairly simple specification
First step: always do (and report) OLS; provides a benchmark for what follows
In the equation, if schooling is endogenous, then experience and experience squared
are by construction endogenous
Therefore, at least 3 instruments required for 3 endogenous regressors
Wage example
Students who live near a college have on av. 0.35 years more schooling
The requirement of relevance can be tested (look at t-values above)
Instrument exogeneity is not fully testable only if over-identifying
restrictions; we need to argue ‘plausibility’ – in cases like this, economic arguments
are more valid than statistical ones
Next, we estimate the regression using the instruments
Wage example
IV estimate
Est. return to schooling over 13%; large std error, but significant
Estimate higher than OLS, but might just be due to sampling error;
(estimate fairly robust to alternative specifications)
The larger std error is due to low correlation between instruments and
endogenous regressors; note low R2 in reduced form
Important Issues
Issues
IV estimates are (much) less accurate than OLS (how much depends upon their
correlation with the endogenous regressors)
No R2 reporting in IV; our goal is to produce consistent estimator for causal effect,
which is what IV tries to do
There is no unique definition of an R2 or adjusted R2 if the model is not estimated by
ordinary least squares (OLS)
This reflects that the R2 plays no role at all in comparing alternative estimators
It is possible to use more instruments than required (overidentification)
Sargan test- overall validity of instruments
• The Sargan test can be used to test the overall validity of instruments, provided the
number of instruments exceeds the number of endogenous variables.
• The test statistic is NR2 of an auxiliary regression of IV residuals upon a full set of
instruments, which has a chi-squared distribution.
• The joint null hypothesis is that the instruments are valid instruments, i.e.,
uncorrelated with the error term, and that the excluded instruments are correctly
excluded from the estimated equation.
Next Week
Test of Endogeneity