Introduction To Econometrics With R
Introduction To Econometrics With R
This book is in Open Review. We want your feedback to make the book better for you and other
students. You may annotate some text by selecting it with the cursor and then click the on the pop-
up menu. You can also see the annotations of others: click the in the upper right hand corner of the
page
This section treats five sources that cause the OLS estimator in (multiple) regression models to be biased and
inconsistent for the causal effect of interest and discusses possible remedies. All five sources imply a violation of
the first least squares assumption presented in Key Concept 6.4.
measurement errors
Beside these threats for consistency of the estimator, we also briefly discuss causes of inconsistent estimation of
OLS standard errors.
Inclusion of additional variables reduces the risk of omitted variable bias but may increase the variance of
the estimator of the coefficient of interest.
We present some guidelines that help deciding whether to include an additional variable:
2. Identify the most important potential sources of omitted variable bias by using knowledge available
before estimating the model. You should end up with a baseline specification and a set of regressors
that are questionable
3. Use different model specifications to test whether questionable regressors have coefficients different
from zero
4. Use tables to provide full disclosure of your results, i.e., present different model specifications that
both support your argument and enable the reader to see the effect of including questionable
regressors
By now you should be aware of omitted variable bias and its consequences. Key Concept 9.2 gives some guidelines
on how to proceed if there are control variables that possibly allow to reduce omitted variable bias. If including
additional variables to mitigate the bias is not an option because there are no adequate controls, there are
different approaches to solve the problem:
If the population regression function is nonlinear but the regression function is linear, the functional form of the
regression model is misspecified. This leads to a bias of the OLS estimator.
A regression suffers from misspecification of the functional form when the functional form of the
estimated regression model differs from the functional form of the population regression function.
Functional form misspecification leads to biased and inconsistent coefficient estimators. A way to detect
functional form misspecification is to plot the estimated regression function and the data. This may also
be helpful to choose the correct functional form.
It is easy to come up with examples of misspecification of the functional form: consider the case where the
population regression function is
2
Yi = X
i
Y i = β0 + β1 X i + u i .
Clearly, the regression function is misspecified here. We now simulate data and visualize this.
# set seed for reproducibility
set.seed(3)
ms_mod
##
## Call:
## lm(formula = Y ~ X)
##
## Coefficients:
## (Intercept) X
## 8.11363 -0.04684
pch = 20,
col = "steelblue")
col = "darkred",
lwd = 2)
Hide Plot
It is evident that the regression errors are relatively small for observations close to X = −3 and X = 3 but that
the errors increase for X values closer to zero and even more for values beyond −4 and 4. Consequences are
drastic: the intercept is estimated to be 8.1 and for the slope parameter we obtain an estimate obviously very close
to zero. This issue does not disappear as the number of observations is increased because OLS is biased and
inconsistent due to the misspecification of the regression function.
When independent variables are measured imprecisely, we speak of errors-in-variables bias. This bias
does not disappear if the sample size is large. If the measurement error has mean zero and is independent
of the affected variable, the OLS estimator of the respective coefficient is biased towards zero.
Suppose you are incorrectly measuring the single regressor X so that there is a measurement error and you
i
observe X instead of X . Then, instead of estimating the population the regression model
i i
Y i = β0 + β1 X i + u i
you end up estimating
∼ ∼
Yi = β0 + β1 X i + β1 (Xi − X i ) + ui
=vi
Y i = β0 + β1 X i + v i
∼
where X and the error term v are correlated. Thus OLS would be biased and inconsistent for the true β in this
i i 1
example. One can show that direction and strength of the bias depend on the correlation between the observed
∼ ∼
The classical measurement error model assumes that the measurement error, w , has zero mean and that it is i
uncorrelated with the variable, X , and the error term of the population regression model, u :
i i
X i = Xi + wi , ρw ,ui = 0, ρw ,Xi = 0
i i
2
p σ
X
β̂ → β1 (9.1)
1
2 2
σ + σw
X
X
2
, σw > 0 such that the fraction in (9.1) is smaller than 1. Note that there are two
extreme cases:
p
2. if σ 2
w ≫ σ
2
X
we have βˆ
1
→ 0 . This is the case if the measurement error is so large that there essentially is no
information on X in the data that can be used to estimate β . 1
The most obvious way to deal with errors-in-variables bias is to use an accurately measured X. If this not possible,
instrumental variables regression is an option. One might also deal with the issue by using a mathematical model
of the measurement error and adjust the estimates appropriately: if it is plausible that the classical measurement
error model applies and if there is information that can be used to estimate the ratio in equation (9.1), one could
compute an estimate that corrects for the downward bias.
For example, consider two bivariate normally distributed random variables X, Y . It is a well known result that the
conditional expectation function of Y given X has the form
σY
E(Y |X) = E(Y ) + ρX,Y [X − E(X)] . (9.2)
σX
Thus for
50 10 5
(X, Y ) ∼ N [( ),( )] (9.3)
100 5 10
= 75 + 0.5Xi . (9.4)
∼
i.i.d.
Now suppose you gather data on X and Y , but that you can only measure X i = Xi + wi with w
i ∼ N (0, 10) .
Since the w are independent of the X , there is no correlation between the X and the w so that we have a case
i i i i
of the classical measurement error model. We now illustrate this example in R using the package mvtnorm (Genz,
Bretz, Miwa, Mi, & Hothorn, 2018).
# set seed
set.seed(1)
We now estimate a simple linear regression of Y on X using this sample data and run the same regression again
but this time we add i.i.d. N (0, 10) errors added to X.
## (Intercept) X
## 76.3002047 0.4755264
error_mod$coefficients
## (Intercept) X
## 87.276004 0.255212
Next, we visualize the results and compare with the population regression function.
# plot sample data
plot(dat$X, dat$Y,
pch = 20,
col = "steelblue",
xlab = "X",
ylab = "Y")
col = "darkgreen",
lwd = 1.5)
col = "purple",
lwd = 1.5)
abline(error_mod,
col = "darkred",
lwd = 1.5)
# add legend
legend("topleft",
bg = "transparent",
cex = 0.8,
lty = 1,
In the situation without measurement error, the estimated regression function is close to the population
regression function. Things are different when we use the mismeasured regressor X: both the estimate for the
intercept and the estimate for the coefficient on X differ considerably from results obtained using the “clean” data
on X. In particular β̂ 1
= 0.255 , so there is a downward bias. We are in the comfortable situation to know σ and
2
X
2
σw . This allows us to correct for the bias using (9.1). Using this information we obtain the biased-corrected
estimate
2 2
σ + σw 10 + 10
X
ˆ
⋅ β = ⋅ 0.255 = 0.51
1
2 10
σ
X
which is quite close to β 1 = 0.5 , the true coefficient from the population regression function.
Bear in mind that the above analysis uses a single sample. Thus one may argue that the results are just a
coincidence. Can you show the contrary using a simulation study?
There are three cases of sample selection. Only one of them poses a threat to internal validity of a regression
study. The three cases are:
3. Data are missing due to a selection process which is related to the dependent variable.
Let us jump back to the example of variables X and Y distributed as stated in equation (9.3) and illustrate all three
cases using R.
If data are missing at random, this is nothing but loosing observations. For example, loosing 50% of the sample
would be the same as never having seen the (randomly chosen) half of the sample observed. Therefore, missing
data do not introduce an estimation bias and “only” lead to less efficient estimators.
# set seed
set.seed(1)
# simulate data
dat <- data.frame(
plot(dat$X[-id],
dat$Y[-id],
col = "steelblue",
pch = 20,
cex = 0.8,
xlab = "X",
ylab = "Y")
points(dat$X[id],
dat$Y[id],
cex = 0.8,
col = "gray",
pch = 20)
col = "darkgreen",
lwd = 1.5)
# add a legend
legend("topleft",
lty = 1,
bg = "transparent",
cex = 0.8,
col = c("darkgreen", "black", "purple"),
Hide Plot
The gray dots represent the 500 discarded observations. When using the remaining observations, the estimation
results deviate only marginally from the results obtained using the full sample.
Selecting data randomly based on the value of a regressor has also the effect of reducing the sample size and does
not introduce estimation bias. We will now drop all observations with X > 45 , estimate the model again and
compare.
# set random seed
set.seed(1)
# simulate data
dat <- data.frame(
# mark observations
id <- dat$X >= 45
plot(dat$X[-id],
dat$Y[-id],
col = "steelblue",
cex = 0.8,
pch = 20,
xlab = "X",
ylab = "Y")
points(dat$X[id],
dat$Y[id],
col = "gray",
cex = 0.8,
pch = 20)
col = "darkgreen",
lwd = 1.5)
# add legend
legend("topleft",
lty = 1,
bg = "transparent",
cex = 0.8,
col = c("darkgreen", "black", "purple"),
Hide Plot
Note that although we dropped more than 90% of all observations, the estimated regression function is very close
to the line estimated based on the full sample.
In the third case we face sample selection bias. We can illustrate this by using only observations with X i < 55 and
Yi > 100 . These observations are easily identified using the function which() and logical operators:
which(dat$X < 55 & dat$Y > 100)
# set random seed
set.seed(1)
# simulate data
dat <- data.frame(
rmvnorm(1000, c(50,100),
# mark observations
id <- which(dat$X <= 55 & dat$Y >= 100)
plot(dat$X[-id],
dat$Y[-id],
col = "gray",
cex = 0.8,
pch = 20,
xlab = "X",
ylab = "Y")
points(dat$X[id],
dat$Y[id],
col = "steelblue",
cex = 0.8,
pch = 20)
col = "darkgreen",
lwd = 1.5)
# add legend
legend("topleft",
lty = 1,
bg = "transparent",
cex = 0.8,
col = c("darkgreen", "black", "purple"),
Hide Plot
There are methods that allow to correct for sample selection bias. However, these methods are beyond the scope
of the book and are therefore not considered here. The concept of sample selection bias is summarized in Key
Concept 9.5.
Simultaneous Causality
So far we have assumed that the changes in the independent variable X are responsible for changes in the
dependent variable Y . When the reverse is also true, we say that there is simultaneous causality between
X and Y . This reverse causality leads to correlation between X and the error in the population regression
of interest such that the coefficient on X is estimated with bias.
Suppose we are interested in estimating the effect of a 20% increase in cigarettes prices on cigarette consumption
in the United States using a multiple regression model. This may be investigated using the dataset
CigarettesSW which is part of the AER package. CigarettesSW is a panel data set on cigarette consumption
for all 48 continental U.S. federal states from 1985-1995 and provides data on economic indicators and average
local prices, taxes and per capita pack consumption.
After loading the data set, we pick observations for the year 1995 and plot logarithms of the per pack price, price,
against pack consumption, packs, and estimate a simple linear regression model.
data("CigarettesSW")
cigcon_mod
##
## Call:
##
## Coefficients:
## (Intercept) log(price)
## 10.850 -1.213
xlab = "ln(Price)",
ylab = "ln(Consumption)",
col = "steelblue")
abline(cigcon_mod,
col = "darkred",
lwd = 1.5)
Hide Plot
Remember from Chapter 8 that, due to the log-log specification, in the population regression the coefficient on the
logarithm of price is interpreted as the price elasticity of consumption. The estimated coefficient suggests that a
1% increase in cigarettes prices reduces cigarette consumption by about 1.2%, on average. Have we estimated a
demand curve? The answer is no: this is a classic example of simultaneous causality, see Key Concept 9.6. The
observations are market equilibria which are determined by both changes in supply and changes in demand.
Therefore the price is correlated with the error term and the OLS estimator is biased. We can neither estimate a
demand nor a supply curve consistently using this approach.
We will return to this issue in Chapter 12 which treats instrumental variables regression, an approach that allows
consistent estimation when there is simultaneous causality.
There are two central threats to computation of consistent OLS standard errors:
2. Serial correlation: if the population regression error is correlated across observations, we have serial
correlation. This often happens in applications where repeated observations are used, e.g., in panel data
studies. As for heteroskedasticity, vcovHC() can be used to obtain valid standard errors when there is serial
correlation.
Inconsistent standard errors will produce invalid hypothesis tests and wrong confidence intervals. For example,
when testing the null that some model coefficient is zero, we cannot trust the outcome anymore because the test
may fail to have a size of 5% due to the wrongly computed standard error.
Key Concept 9.7 summarizes all threats to internal validity discussed above.
The five primary threats to internal validity of a multiple regression study are:
1. Omitted variables
4. Sample selection
5. Simultaneous causality
All these threats lead to failure of the first least squares assumption
Furthermore, if one does not adjust for heteroskedasticity and/or serial correlation, incorrect standard
errors may be a threat to internal validity of the study.
References
Genz, A., Bretz, F., Miwa, T., Mi, X., & Hothorn, T. (2018). mvtnorm: Multivariate Normal and t Distributions (Version
1.0-8). Retrieved from https://CRAN.R-project.org/package=mvtnorm