0% found this document useful (0 votes)
24 views

Week 10

The document discusses endogeneity and how it can arise through measurement error, omitted variables, simultaneity, and dynamic models with lagged dependent variables. It explains that endogeneity violates the Gauss-Markov assumption that regressors are uncorrelated with the error term, which can bias OLS estimates. Several examples are provided to illustrate how endogeneity occurs in economic models and how it impacts coefficient estimates.

Uploaded by

Jerry ma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Week 10

The document discusses endogeneity and how it can arise through measurement error, omitted variables, simultaneity, and dynamic models with lagged dependent variables. It explains that endogeneity violates the Gauss-Markov assumption that regressors are uncorrelated with the error term, which can bias OLS estimates. Several examples are provided to illustrate how endogeneity occurs in economic models and how it impacts coefficient estimates.

Uploaded by

Jerry ma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Eco 401 Econometrics

SI 2020,
SI 2021,Week 10-11, 16-23 November 2021

Dr Syed Kanwar Abbas


Office Location: BS304
Email: [email protected]
Agenda
Last week, we looked at the Autocorrelation, consequences and its solution. We would now start the
instrumental variable/2SLS estimation. At the end of this section, you should be able to understand

What is the endogeneity problem?


How does endogeneity affect the OLS estimates?
How do we correct/solve endogeneity problem?

This lecture is based on Chapter 5 of your textbook by Verbeek (2017).


Endogeneity
Gauss-Markov assumptions
What does BLUE mean?
• Best – minimum variance of the estimator
• Linear – within the class of linear estimators
• Unbiased – the expected value = 'truth': 𝐸𝐸 𝒃𝒃 = 𝜷𝜷
• Estimator
When is OLS BLUE? Under the Gauss Markov assumptions:
• (A1) mean zero; error terms have mean zero: 𝐸𝐸 𝜺𝜺 = 𝟎𝟎
• (A2) independent; error terms independent of exogenous variables
• (A3) homoskedasticity; error terms have same variance 𝑉𝑉 𝜀𝜀𝑖𝑖 = 𝜎𝜎 2
• (A4) no autocorrelation; error terms mutually uncorrelated 𝑐𝑐𝑐𝑐𝑐𝑐 𝜀𝜀𝑖𝑖 , 𝜀𝜀𝑗𝑗 = 0, 𝑓𝑓𝑓𝑓𝑓𝑓 𝑖𝑖 ≠ 𝑗𝑗
Violation 3: 𝐸𝐸 𝒙𝒙𝑡𝑡 𝜀𝜀𝑡𝑡 ≠ 0; endogeneity

Consider a model 𝑦𝑦𝑡𝑡 = 𝒙𝒙′𝑡𝑡 𝜷𝜷 + 𝜀𝜀𝑡𝑡


So far, we assumed that the error term 𝜀𝜀𝑡𝑡 and the explanatory variables 𝒙𝒙𝑡𝑡 are
contemporaneously uncorrelated 𝐸𝐸 𝒙𝒙𝑡𝑡 𝜀𝜀𝑡𝑡 = 0
o This condition simply says that the error term (mean zero) is uncorrelated with
any of the explanatory variables
• We have also concluded that the OLS estimate 𝒃𝒃 is consistent for 𝜷𝜷 even if the Gauss-
Markov conditions (A3) and (A4) are not fulfilled

o the assumption that 𝜀𝜀𝑡𝑡 is independent of 𝒙𝒙𝑡𝑡 may be too strong


Violation 3: 𝐸𝐸 𝒙𝒙𝑡𝑡 𝜀𝜀𝑡𝑡 ≠ 0; endogeneity

We label variables in 𝒙𝒙𝑡𝑡 that are correlated with the error terms
endogenous; variables that are not are called exogenous
Endogeneity is thus said to occur in a multiple regression model if
𝐸𝐸 𝒙𝒙𝑡𝑡 𝜀𝜀𝑡𝑡 ≠ 0; endogeneity implies that an explanatory variable included
in the model is correlated with unobservables relegated to the error term
𝐸𝐸 𝒙𝒙𝑡𝑡 𝜀𝜀𝑡𝑡 ≠ 0; endogeneity can arise as a result of
• measurement error
• dynamic model (lagged dependent)
• omitted variable bias
• simultaneity
Measurement Error
Endogeneity – measurement error

Data is often measured with error


o reporting errors
o coding errors
Measurement error in the dependent variable 𝑦𝑦, does not cause endogeneity
Measurement error in the explanatory variable 𝑥𝑥, does cause endogeneity problems
Endogeneity – measurement error
• Suppose that we do not get a perfect measure of one of our explanatory
variables
• Instead of observing 𝑥𝑥𝑖𝑖 we observe 𝑥𝑥𝑖𝑖∗ = 𝑥𝑥𝑖𝑖 + 𝜐𝜐𝑖𝑖 ; where 𝜐𝜐𝑖𝑖 is the measurement
"noise"
• When we try to estimate the regression 𝑦𝑦𝑖𝑖 = 𝛼𝛼 + 𝛽𝛽𝑥𝑥𝑖𝑖 + 𝜀𝜀𝑖𝑖
• We actually end up estimating
𝑦𝑦𝑖𝑖 = 𝛼𝛼 + 𝛽𝛽 𝑥𝑥𝑖𝑖∗ − 𝜐𝜐𝑖𝑖 + 𝜀𝜀𝑖𝑖
𝑦𝑦𝑖𝑖 = 𝛼𝛼 + 𝛽𝛽𝑥𝑥𝑖𝑖∗ + (𝜀𝜀𝑖𝑖 −𝛽𝛽𝜐𝜐𝑖𝑖 )
𝑦𝑦𝑖𝑖 = 𝛼𝛼 + 𝛽𝛽 𝑥𝑥𝑖𝑖∗ + 𝑢𝑢𝑖𝑖

• Since both 𝑥𝑥𝑖𝑖∗ and 𝑢𝑢𝑖𝑖 depend on 𝜐𝜐𝑖𝑖 , they are correlated
Endogeneity – dynamic model (lagged dependent + autocorrelation)
Endogeneity – dynamic model (lagged dependent + autocorrelation)
Endogeneity problems may arise in a dynamic model that includes a lagged dependent
variable
𝑦𝑦𝑡𝑡 = 𝛽𝛽1 + 𝛽𝛽2 𝑥𝑥𝑡𝑡 + 𝛽𝛽3 𝑦𝑦𝑡𝑡−1 + 𝜀𝜀𝑡𝑡
As long as we assume that 𝐸𝐸 𝑥𝑥𝑡𝑡 𝜀𝜀𝑡𝑡 = 0 and 𝐸𝐸 𝑦𝑦𝑡𝑡−1 𝜀𝜀𝑡𝑡 = 0 for all 𝑡𝑡 the OLS estimator
for 𝜷𝜷 is consistent
However, suppose that 𝜀𝜀𝑡𝑡 is subject to first order autocorrelation 𝜀𝜀𝑡𝑡 = 𝜌𝜌𝜀𝜀𝑡𝑡−1 + 𝑣𝑣𝑡𝑡
This gives: 𝑦𝑦𝑡𝑡 = 𝛽𝛽1 + 𝛽𝛽2 𝑥𝑥𝑡𝑡 + 𝛽𝛽3 𝑦𝑦𝑡𝑡−1 + 𝜌𝜌𝜀𝜀𝑡𝑡−1 + 𝑣𝑣𝑡𝑡
Obviously, we also have: 𝑦𝑦𝑡𝑡−1 = 𝛽𝛽1 + 𝛽𝛽2 𝑥𝑥𝑡𝑡−1 + 𝛽𝛽3 𝑦𝑦𝑡𝑡−2 + 𝜀𝜀𝑡𝑡−1
This implies that error term 𝜀𝜀𝑡𝑡 is correlated with 𝑦𝑦𝑡𝑡−1
Thus if 𝜌𝜌 ≠ 0, OLS is no longer consistent for the parameters
Endogeneity – omitted variable bias
Endogeneity – omitted variable bias

Some unobservable (or omitted) variable affects both 𝑦𝑦 and 𝑥𝑥


If a relevant variable is omitted that is correlated with the included ones, OLS becomes
biased; problem for causal interpretations.
Example: wage equation with unobserved ability related to schooling

Consider a wage equation 𝑦𝑦𝑖𝑖 = 𝒙𝒙1𝑖𝑖 ′𝜷𝜷1 + 𝑥𝑥2𝑖𝑖 𝛽𝛽2 + 𝑢𝑢𝑖𝑖 𝛾𝛾 + 𝜐𝜐𝑖𝑖

Where 𝑥𝑥2𝑖𝑖 denotes years of schooling, and 𝑢𝑢𝑖𝑖 is an unobserved variable reflecting “ability”. Persons with
higher levels of ability tend to have higher wages but are also more likely to have higher schooling
level
Thus: 𝛾𝛾 > 0 and 𝑐𝑐𝑐𝑐𝑐𝑐 𝑥𝑥2𝑖𝑖 , 𝑢𝑢𝑖𝑖 > 0
Since 𝑢𝑢𝑖𝑖 is unobserved, we end up estimating 𝑦𝑦𝑖𝑖 = 𝒙𝒙𝑖𝑖 ′𝜷𝜷 + 𝜀𝜀𝑖𝑖 , where 𝒙𝒙′𝑖𝑖 = 𝒙𝒙1𝑖𝑖 ′, 𝑥𝑥2𝑖𝑖
𝜷𝜷𝜷 = 𝜷𝜷1′ , 𝛽𝛽2
𝜀𝜀𝑖𝑖 = 𝑢𝑢𝑖𝑖 𝛾𝛾 + 𝜐𝜐𝑖𝑖
Endogeneity – omitted variable bias
Estimating 𝜷𝜷 by OLS yields (see omitted variable bias discussion ,
Lecture 4): 𝒃𝒃 = 𝜷𝜷 + 𝑿𝑿′ 𝑿𝑿 −1 ∑𝑁𝑁𝑖𝑖=1 𝒙𝒙𝑖𝑖 𝑢𝑢𝑖𝑖 𝛾𝛾 + 𝑿𝑿′ 𝑿𝑿 −1 ∑𝑁𝑁𝑖𝑖=1 𝒙𝒙𝑖𝑖 𝜐𝜐𝑖𝑖

plim is only 0 if 𝐸𝐸 𝒙𝒙𝑖𝑖 𝑢𝑢𝑖𝑖 = 𝟎𝟎 so no problem; plim = 0


𝒙𝒙𝑖𝑖 and 𝑢𝑢𝑖𝑖 are orthogonal
Assuming 𝐸𝐸 𝒙𝒙𝑖𝑖 𝜐𝜐𝑖𝑖 = 0, when 𝛾𝛾 ≠ 0 consistency of the OLS estimator
requires 𝐸𝐸 𝒙𝒙𝑖𝑖 𝑢𝑢𝑖𝑖 = 0
That is, the unobserved “ability” should be uncorrelated with
schooling and the other explanatory variables in the model
Assuming 𝐸𝐸 𝒙𝒙𝑖𝑖 𝑢𝑢𝑖𝑖 > 0, we thus expect that OLS overestimates the
returns to schooling
Therefore, it shows a bias if 𝑢𝑢𝑖𝑖 and 𝒙𝒙𝑖𝑖 are correlated
Endogeneity – simultaneity
Endogeneity – simultaneity
Simultaneity occurs when 𝑥𝑥𝑡𝑡 not only has an impact on 𝑦𝑦𝑡𝑡 , but at the same time 𝑦𝑦𝑡𝑡 has an
impact on 𝑥𝑥𝑡𝑡 (reverse causality)
This situation arises in many economic contexts, such as
o quantity & price determined by intersection demand & supply
o investment & productivity
o sales & advertising
Consider a Keynesian consumption function 𝑦𝑦𝑡𝑡 = 𝛽𝛽1 + 𝛽𝛽2 𝑥𝑥2𝑡𝑡 + 𝜀𝜀𝑡𝑡
where 𝑦𝑦𝑡𝑡 is consumption, 𝑥𝑥2𝑡𝑡 is income, 𝑡𝑡 = 1, . . , 𝑇𝑇 are periods (years), and 𝛽𝛽2 ∈
0,1 denotes the marginal propensity to consume
A situation of reverse causality naturally arises when 𝑦𝑦𝑡𝑡 and 𝑥𝑥2𝑡𝑡 are determined
simultaneously
The above consumption equation has a causal interpretation describing the impact of
income upon consumption: how much more will people consume if their income
increases by one dollar?
Endogeneity – simultaneity

However, aggregate income is not exogenous; in a closed economy without a


government income is defined by: 𝑥𝑥2𝑡𝑡 = 𝑦𝑦𝑡𝑡 + 𝑧𝑧2𝑡𝑡 , where 𝑧𝑧2𝑡𝑡 denotes the
investment level
It says that total income is the sum of total consumption and total investment
These two equations are structural equations since they have a, ceteris paribus, causal
interpretation
We assume that 𝑧𝑧2𝑡𝑡 and 𝜀𝜀𝑡𝑡 are uncorrelated 𝐸𝐸 𝑧𝑧2𝑡𝑡 𝜀𝜀𝑡𝑡 = 0
This means 𝑧𝑧2𝑡𝑡 is exogenous whereas 𝑦𝑦𝑡𝑡 and 𝑥𝑥2𝑡𝑡 are endogenous; jointly and
simultaneously determined within the model
Since 𝑦𝑦𝑡𝑡 influences 𝑥𝑥2𝑡𝑡 ,this implies that income 𝑥𝑥2𝑡𝑡 and error term 𝜀𝜀𝑡𝑡 are correlated;
the OLS estimate for 𝛽𝛽2 will thus be biased and inconsistent
Endogeneity – simultaneity

Structural equations:
𝑦𝑦𝑡𝑡 = 𝛽𝛽1 + 𝛽𝛽2 𝑥𝑥2𝑡𝑡 + 𝜀𝜀𝑡𝑡
𝑥𝑥2𝑡𝑡 = 𝑦𝑦𝑡𝑡 + 𝑧𝑧2𝑡𝑡
Solve to get reduced form equations:
𝛽𝛽1 1 𝜀𝜀𝑡𝑡
𝑥𝑥2𝑡𝑡 = + 𝑧𝑧2𝑡𝑡 +
1 − 𝛽𝛽2 1 − 𝛽𝛽2 1 − 𝛽𝛽2
𝛽𝛽1 𝛽𝛽2 𝜀𝜀𝑡𝑡
𝑦𝑦𝑡𝑡 = + 𝑧𝑧2𝑡𝑡 +
1 − 𝛽𝛽2 1 − 𝛽𝛽2 1 − 𝛽𝛽2
The reduced form equations can be estimated with OLS since 𝐸𝐸 𝑧𝑧2𝑡𝑡 𝜀𝜀𝑡𝑡 = 0
structural equation cannot be estimated by OLS
Instrumental variables
Instrumental variables

A possible solution to the endogeneity problem is using Instrumental Variable (IV)


techniques
Let us, for exposition purposes, consider the simple model
𝑦𝑦𝑡𝑡 = 𝛽𝛽1 + 𝛽𝛽2 𝑥𝑥𝑡𝑡 + 𝜀𝜀𝑡𝑡 , where 𝐸𝐸 𝑥𝑥𝑡𝑡 𝜀𝜀𝑡𝑡 ≠ 0, so OLS is inconsistent
Now, suppose we can find a variable 𝑧𝑧𝑡𝑡 (which we will call an instrument) that
satisfies the following two conditions
o Exogeneity: 𝐸𝐸 𝑧𝑧𝑡𝑡 𝜀𝜀𝑡𝑡 = 0 (instrument uncorrelated with error)
o Relevance: cov 𝑥𝑥𝑡𝑡 , 𝑧𝑧𝑡𝑡 ≠ 0 (instrument correlated with end. regressor)
𝑧𝑧𝑡𝑡 is called an instrumental variable / instrument
Instrumental variables
Let us now take the covariance with 𝑧𝑧𝑡𝑡 on both sides of
𝑦𝑦𝑡𝑡 = 𝛽𝛽1 + 𝛽𝛽2 𝑥𝑥𝑡𝑡 + 𝜀𝜀𝑡𝑡
to get
cov 𝑦𝑦𝑡𝑡 , 𝑧𝑧𝑡𝑡 = 𝛽𝛽2 cov 𝑥𝑥𝑡𝑡 , 𝑧𝑧𝑡𝑡 + cov 𝑧𝑧𝑡𝑡 , 𝜀𝜀𝑡𝑡

cov 𝑥𝑥𝑡𝑡 , 𝑧𝑧𝑡𝑡 ≠ 0 cov 𝑧𝑧𝑡𝑡 , 𝜀𝜀𝑡𝑡 = 0


So we can write
cov 𝑦𝑦𝑡𝑡 ,𝑧𝑧𝑡𝑡
𝛽𝛽2 =
cov 𝑥𝑥𝑡𝑡 ,𝑧𝑧𝑡𝑡
This (theoretically) determines 𝛽𝛽2 ; how to estimate it?
Instrumental variables
• Simply replace the population covariances by the sample
∑ 𝑧𝑧 −𝑧𝑧̅ 𝑦𝑦𝑡𝑡 −𝑦𝑦�
covariances: 𝛽𝛽̂2,𝐼𝐼𝐼𝐼 = 𝑡𝑡 𝑡𝑡
∑𝑡𝑡 𝑧𝑧𝑡𝑡 −𝑧𝑧̅ 𝑥𝑥𝑡𝑡 −𝑥𝑥̅
• This is called an instrumental variable (IV) estimator
• Recall that the OLS estimator in this case is (see chapter 2):
∑𝑡𝑡 𝑥𝑥𝑡𝑡 −𝑥𝑥̅ 𝑦𝑦𝑡𝑡 −𝑦𝑦�
̂
𝛽𝛽2,𝑂𝑂𝑂𝑂𝑂𝑂 = 𝑏𝑏2 =
∑𝑡𝑡 𝑥𝑥𝑡𝑡 −𝑥𝑥̅ 𝑥𝑥𝑡𝑡 −𝑥𝑥̅
• The instrument 𝑧𝑧𝑡𝑡 thus replaces 𝑥𝑥𝑡𝑡 twice in the IV formula;
alternatively, this means that IV reduces to OLS if 𝑧𝑧𝑡𝑡 = 𝑥𝑥𝑡𝑡
More generally
Consider the model

where

for some elements of xt.


Suppose we can find a vector of instruments zt, having the same dimensions as xt
such that

Then the IV estimator based on these instruments is given by


23
More generally
∑𝑡𝑡 𝑧𝑧𝑡𝑡 − 𝑧𝑧̅ 𝑦𝑦𝑡𝑡 − 𝑦𝑦�
𝛽𝛽̂2,𝐼𝐼𝐼𝐼 =
∑𝑡𝑡 𝑧𝑧𝑡𝑡 − 𝑧𝑧̅ 𝑥𝑥𝑡𝑡 − 𝑥𝑥̅

Its (asymptotic) covariance matrix is given by

which can be estimated fairly easily (to get standard errors etc.)

24
Instrumental variables

Consider the general model 𝑦𝑦𝑡𝑡 = 𝒙𝒙′𝑡𝑡 𝜷𝜷 + 𝜀𝜀𝑡𝑡 , where 𝐸𝐸 𝒙𝒙𝑡𝑡 𝜀𝜀𝑡𝑡 ≠ 0
for some elements of 𝒙𝒙𝑡𝑡
Suppose we can find a vector of instruments 𝒛𝒛𝑡𝑡 with the same dimensions as 𝒙𝒙𝑡𝑡 such that 𝐸𝐸 𝒛𝒛𝑡𝑡 𝜀𝜀𝑡𝑡 = 0;
note: we only have to find instruments for the endogenous explanatory variables, the exogenous
explanatory variables can serve as their own instruments
Using matrix notation, this can be written as 𝒚𝒚 = 𝑿𝑿𝜷𝜷 + 𝜺𝜺, where
𝒚𝒚 is the 𝑁𝑁 × 1 column of observations for 𝑦𝑦𝑖𝑖 ,
𝑿𝑿 is the 𝑁𝑁 × 𝐾𝐾 matrix collecting the vectors 𝒙𝒙′𝑖𝑖 , and
𝜺𝜺 is the 𝑁𝑁 × 1 column of observations for 𝜀𝜀𝑖𝑖
The OLS estimate for 𝜷𝜷 is 𝒃𝒃 = 𝑿𝑿′ 𝑿𝑿 −𝟏𝟏 𝑿𝑿𝑿𝑿𝑿, which is inconsistent
Let 𝒁𝒁 be the 𝑁𝑁 × 𝐾𝐾 matrix of instruments; then as in the simple case before, the IV estimator is given by
partially replacing 𝑿𝑿 with 𝒁𝒁 as follows: 𝜷𝜷 � 𝐼𝐼𝐼𝐼 = 𝒁𝒁′ 𝑿𝑿 −𝟏𝟏 𝒁𝒁′𝒚𝒚
Instrumental variables
• In the above discussion the number of instruments 𝑅𝑅, say, is the
same as the number 𝐾𝐾 of explanatory variables (with exogenous
explanatory variables as their own instruments): 𝑅𝑅 = 𝐾𝐾; identified
• What if 𝑅𝑅 < 𝐾𝐾? We do not have enough information: no solution,
under-identified
• What if 𝑅𝑅 > 𝐾𝐾? We have more information (more equations) than
needed: over-identified
• Rather than ignoring relevant information (if 𝑅𝑅 > 𝐾𝐾) we minimize a
quadratic in the sample moments; leads to Generalized Instrumental
Variables Estimator (GIVE) or Two-Stage Least Squares (2SLS):
−𝟏𝟏 ′

𝜷𝜷𝐺𝐺𝐼𝐼𝐼𝐼𝐸𝐸 = 𝑿𝑿 𝒁𝒁 𝒁𝒁 𝒁𝒁 𝒁𝒁𝒁𝑿𝑿 𝑿𝑿 𝒁𝒁 𝒁𝒁′ 𝒁𝒁 −𝟏𝟏 𝒁𝒁𝒁𝒚𝒚
′ ′ −𝟏𝟏

replaces 𝑿𝑿′ in OLS formula


Instrumental variables

Why the name 2SLS? Consider a structural model and reduced form
Let reduced form of kth explanatory variable be 𝒙𝒙𝑘𝑘 = 𝒁𝒁𝝅𝝅𝑘𝑘 + 𝝊𝝊𝑘𝑘
First step; the OLS estimate of 𝝅𝝅𝑘𝑘 is 𝒃𝒃𝑘𝑘 = 𝒁𝒁′ 𝒁𝒁 −𝟏𝟏 𝒁𝒁′𝒙𝒙𝑘𝑘
�𝑘𝑘 = 𝒁𝒁𝒃𝒃𝑘𝑘 = 𝒁𝒁 𝒁𝒁′ 𝒁𝒁 −𝟏𝟏 𝒁𝒁′𝒙𝒙𝑘𝑘
�𝑘𝑘 of 𝒙𝒙𝑘𝑘 is: 𝒙𝒙
The predicted value 𝒙𝒙

Second step; estimate the original structural equations by OLS, while replacing all
endogenous variables on the right-hand-side with their predicted values from the
reduced form
� be the matrix with predicted values with columns 𝒙𝒙
Let 𝑿𝑿 �=
�𝑘𝑘 ; it is thus equal to 𝑿𝑿
𝒁𝒁 𝒁𝒁′ 𝒁𝒁 −𝟏𝟏 𝒁𝒁′ 𝑿𝑿
−𝟏𝟏
� ′𝑿𝑿
The OLS estimator in the second step is given by 𝒃𝒃 = 𝑿𝑿 � � ′ 𝒚𝒚 which is actually
𝑿𝑿
equal to GIVE
Finding instruments
Finding instruments

To use IV as a consistent estimator for 𝜷𝜷 we need valid instruments


Two conditions are required for instrument 𝑧𝑧𝑡𝑡 to be valid:
o Exogeneity: 𝐸𝐸 𝑧𝑧𝑡𝑡 𝜀𝜀𝑡𝑡 = 0 (instrument uncorrelated with error)
o Relevance: cov 𝑥𝑥𝑡𝑡 , 𝑧𝑧𝑡𝑡 ≠ 0 (instrument correlated with end. regressor)

Exogeneity is based on economic assumptions or gut feeling; it can only be tested if we


have over-identification (𝑅𝑅 > 𝐾𝐾) Hausman test
Relevance can be tested by regressing 𝑥𝑥𝑡𝑡 = 𝛾𝛾1 + 𝛾𝛾2 𝑧𝑧𝑡𝑡 + 𝑣𝑣𝑡𝑡 where an instrument is not
relevant if 𝛾𝛾2 = 0
Finding instruments is hard.
Estimating the returns to schooling
Estimating the returns to schooling
Estimating the causal effect of schooling upon earnings has attracted substantive
attention in the literature
Causal: What is the effect on earnings of an exogenous increase in schooling?
OLS estimates tend to be biased, because they reflect differences in unobserved
characteristics of individuals that have attained different levels of schooling (such
as: intelligence level, specific skill). This is referred to as “ability bias”.
Another cause of biased OLS estimates could be measurement error in schooling
(downward bias)
Suppose we want to estimate a wage equation explaining earnings from schooling
and other variables
𝑦𝑦𝑖𝑖 = 𝒙𝒙1𝑖𝑖 ′𝜷𝜷1 + 𝑥𝑥2𝑖𝑖 𝛽𝛽2 + 𝑢𝑢𝑖𝑖 𝛾𝛾 + 𝜐𝜐𝑖𝑖 , where 𝑥𝑥2𝑖𝑖 denotes years of schooling, and 𝑢𝑢𝑖𝑖 is an
unobserved variable reflecting “ability”
Estimating the returns to schooling

So, which factors affect schooling but not earnings directly? (not related to unobserved
ability/intelligence that is determining wages?) Parents’ education? Distance to school?
Once we identify the instruments, we are ready to estimate the parameters using IV regression
We do this in two stages, so called 2SLS (Two Stage Least Square).
In order to run IV, we need at least one instrument for each endogenous explanatory variable
IV regression (2SLS) runs as follows:
In the first stage regression, endogenous explanatory variables are regressed upon instruments and
exogenous explanatory variables; fitted values are saved
This is done for all the endogenous regressors included
In the second stage regression, the original dependent variable is regressed upon predicted values of
endogenous regressors and exogenous variables
Wage example
Wage example

Wage data taken from Card (1995), based on the National Longitudinal Survey of Young
Men
3010 men, wages in 1976
We observe individual characteristics, including experience, race, region, family
background, and so on
We choose a fairly simple specification
First step: always do (and report) OLS; provides a benchmark for what follows
In the equation, if schooling is endogenous, then experience and experience squared
are by construction endogenous
Therefore, at least 3 instruments required for 3 endogenous regressors
Wage example

The estimated average returns to schooling is 7.4%


For 3 endogenous variables, we need 3 instruments; age and age2
for experience; in this case college proximity is used for schooling
We conduct the reduced form equation for schooling on age and
age2 and proximity
Wage example Reduced form or first stage regression

Students who live near a college have on av. 0.35 years more schooling
The requirement of relevance can be tested (look at t-values above)
Instrument exogeneity is not fully testable only if over-identifying
restrictions; we need to argue ‘plausibility’ – in cases like this, economic arguments
are more valid than statistical ones
Next, we estimate the regression using the instruments
Wage example
IV estimate

Est. return to schooling over 13%; large std error, but significant
Estimate higher than OLS, but might just be due to sampling error;
(estimate fairly robust to alternative specifications)
The larger std error is due to low correlation between instruments and
endogenous regressors; note low R2 in reduced form
Important Issues
Issues

Any IV estimate requires a choice of instruments that should be motivated; always


mention this choice
Reduced form equation, explaining endogenous regressors from exogenous regressors
and instruments, should show significant effect of the instruments
If weak: weak instruments problem (or weak identification) arises
o weak instruments problem: properties of the IV estimator can be very poor and
can be severely biased if the instruments exhibit only weak correlation with the
endogenous regressors
o If so, the normal distribution provides a poor approximation to the true
distribution of the IV estimator, even for large samples
o As a result, the standard IV estimator is biased, its standard errors are misleading
and hypothesis tests are unreliable
Issues

IV estimates are (much) less accurate than OLS (how much depends upon their
correlation with the endogenous regressors)
No R2 reporting in IV; our goal is to produce consistent estimator for causal effect,
which is what IV tries to do
There is no unique definition of an R2 or adjusted R2 if the model is not estimated by
ordinary least squares (OLS)
This reflects that the R2 plays no role at all in comparing alternative estimators
It is possible to use more instruments than required (overidentification)
Sargan test- overall validity of instruments

• The Sargan test can be used to test the overall validity of instruments, provided the
number of instruments exceeds the number of endogenous variables.

• The test statistic is NR2 of an auxiliary regression of IV residuals upon a full set of
instruments, which has a chi-squared distribution.

• The joint null hypothesis is that the instruments are valid instruments, i.e.,
uncorrelated with the error term, and that the excluded instruments are correctly
excluded from the estimated equation.
Next Week
Test of Endogeneity

You might also like