0% found this document useful (0 votes)
3 views

Introduction To Econometrics With R

Uploaded by

Omo Naija
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Introduction To Econometrics With R

Uploaded by

Omo Naija
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Hide Source

This book is in Open Review. We want your feedback to make the book better for you and other

 students. You may annotate some text by selecting it with the cursor and then click the on the pop-
up menu. You can also see the annotations of others: click the in the upper right hand corner of the
page

9.2 Threats to Internal Validity of Multiple Regression Analysis

This section treats five sources that cause the OLS estimator in (multiple) regression models to be biased and
inconsistent for the causal effect of interest and discusses possible remedies. All five sources imply a violation of
the first least squares assumption presented in Key Concept 6.4.

This sections treats:

omitted variable Bias

misspecification of the functional form

measurement errors

missing data and sample selection

simultaneous causality bias

Beside these threats for consistency of the estimator, we also briefly discuss causes of inconsistent estimation of
OLS standard errors.

Omitted Variable Bias

Omitted Variable Bias: Should I include More Key Concept 9.2


Variables in My Regression?

Inclusion of additional variables reduces the risk of omitted variable bias but may increase the variance of
the estimator of the coefficient of interest.

We present some guidelines that help deciding whether to include an additional variable:

1. Specify the coefficient(s) of interest

2. Identify the most important potential sources of omitted variable bias by using knowledge available
before estimating the model. You should end up with a baseline specification and a set of regressors
that are questionable
3. Use different model specifications to test whether questionable regressors have coefficients different
from zero

4. Use tables to provide full disclosure of your results, i.e., present different model specifications that
both support your argument and enable the reader to see the effect of including questionable
regressors

By now you should be aware of omitted variable bias and its consequences. Key Concept 9.2 gives some guidelines
on how to proceed if there are control variables that possibly allow to reduce omitted variable bias. If including
additional variables to mitigate the bias is not an option because there are no adequate controls, there are
different approaches to solve the problem:

usage of panel data methods (discussed in Chapter 10)

usage of instrumental variables regression (discussed in Chapter 12)

usage of a randomized control experiment (discussed in Chapter 13)

Misspecification of the Functional Form of the Regression Function

If the population regression function is nonlinear but the regression function is linear, the functional form of the
regression model is misspecified. This leads to a bias of the OLS estimator.

Functional Form Misspecification Key Concept 9.3

A regression suffers from misspecification of the functional form when the functional form of the
estimated regression model differs from the functional form of the population regression function.
Functional form misspecification leads to biased and inconsistent coefficient estimators. A way to detect
functional form misspecification is to plot the estimated regression function and the data. This may also
be helpful to choose the correct functional form.

It is easy to come up with examples of misspecification of the functional form: consider the case where the
population regression function is

2
Yi = X
i

but the model used is

Y i = β0 + β1 X i + u i .

Clearly, the regression function is misspecified here. We now simulate data and visualize this.
# set seed for reproducibility
set.seed(3)

# simulate data set


X <- runif(100, -5, 5)

Y <- X^2 + rnorm(100)

# estimate the regression function


ms_mod <- lm(Y ~ X)

ms_mod

##

## Call:

## lm(formula = Y ~ X)

##

## Coefficients:
## (Intercept) X

## 8.11363 -0.04684

# plot the data


plot(X, Y,

main = "Misspecification of Functional Form",

pch = 20,

col = "steelblue")

# plot the linear regression line


abline(ms_mod,

col = "darkred",

lwd = 2)
Hide Plot

It is evident that the regression errors are relatively small for observations close to X = −3 and X = 3 but that
the errors increase for X values closer to zero and even more for values beyond −4 and 4. Consequences are
drastic: the intercept is estimated to be 8.1 and for the slope parameter we obtain an estimate obviously very close
to zero. This issue does not disappear as the number of observations is increased because OLS is biased and
inconsistent due to the misspecification of the regression function.

Measurement Error and Errors-in-Variables Bias

Errors-in-Variable Bias Key Concept 9.4

When independent variables are measured imprecisely, we speak of errors-in-variables bias. This bias
does not disappear if the sample size is large. If the measurement error has mean zero and is independent
of the affected variable, the OLS estimator of the respective coefficient is biased towards zero.

Suppose you are incorrectly measuring the single regressor X so that there is a measurement error and you
i

observe X instead of X . Then, instead of estimating the population the regression model
i i

Y i = β0 + β1 X i + u i
you end up estimating

∼ ∼

Yi = β0 + β1 X i + β1 (Xi − X i ) + ui

=vi

Y i = β0 + β1 X i + v i

where X and the error term v are correlated. Thus OLS would be biased and inconsistent for the true β in this
i i 1

example. One can show that direction and strength of the bias depend on the correlation between the observed
∼ ∼

regressor, X , and the measurement error, w


i i
= Xi − X i . This correlation in turn depends on the type of the
measurement error made.

The classical measurement error model assumes that the measurement error, w , has zero mean and that it is i

uncorrelated with the variable, X , and the error term of the population regression model, u :
i i

X i = Xi + wi , ρw ,ui = 0, ρw ,Xi = 0
i i

Then it holds that

2
p σ
X
β̂ → β1 (9.1)
1
2 2
σ + σw
X

which implies inconsistency as σ 2

X
2
, σw > 0 such that the fraction in (9.1) is smaller than 1. Note that there are two
extreme cases:
p

1. If there is no measurement error, σ 2


w = 0 such that βˆ 1
→ β1 .
p

2. if σ 2
w ≫ σ
2
X
we have βˆ
1
→ 0 . This is the case if the measurement error is so large that there essentially is no
information on X in the data that can be used to estimate β . 1

The most obvious way to deal with errors-in-variables bias is to use an accurately measured X. If this not possible,
instrumental variables regression is an option. One might also deal with the issue by using a mathematical model
of the measurement error and adjust the estimates appropriately: if it is plausible that the classical measurement
error model applies and if there is information that can be used to estimate the ratio in equation (9.1), one could
compute an estimate that corrects for the downward bias.

For example, consider two bivariate normally distributed random variables X, Y . It is a well known result that the
conditional expectation function of Y given X has the form

σY
E(Y |X) = E(Y ) + ρX,Y [X − E(X)] . (9.2)
σX

Thus for

50 10 5
(X, Y ) ∼ N [( ),( )] (9.3)
100 5 10

according to (9.2), the population regression function is

Yi = 100 + 0.5(Xi − 50)

= 75 + 0.5Xi . (9.4)

i.i.d.
Now suppose you gather data on X and Y , but that you can only measure X i = Xi + wi with w
i ∼ N (0, 10) .
Since the w are independent of the X , there is no correlation between the X and the w so that we have a case
i i i i

of the classical measurement error model. We now illustrate this example in R using the package mvtnorm (Genz,
Bretz, Miwa, Mi, & Hothorn, 2018).

# set seed
set.seed(1)

# load the package 'mvtnorm' and simulate bivariate normal data


library(mvtnorm)

dat <- data.frame(


rmvnorm(1000, c(50, 100),

sigma = cbind(c(10, 5), c(5, 10))))

# set columns names


colnames(dat) <- c("X", "Y")

We now estimate a simple linear regression of Y on X using this sample data and run the same regression again
but this time we add i.i.d. N (0, 10) errors added to X.

# estimate the model (without measurement error)


noerror_mod <- lm(Y ~ X, data = dat)

# estimate the model (with measurement error in X)


dat$X <- dat$X + rnorm(n = 1000, sd = sqrt(10))

error_mod <- lm(Y ~ X, data = dat)

# print estimated coefficients to console


noerror_mod$coefficients

## (Intercept) X

## 76.3002047 0.4755264

error_mod$coefficients

## (Intercept) X

## 87.276004 0.255212

Next, we visualize the results and compare with the population regression function.
# plot sample data
plot(dat$X, dat$Y,
pch = 20,

col = "steelblue",

xlab = "X",

ylab = "Y")

# add population regression function


abline(coef = c(75, 0.5),

col = "darkgreen",

lwd = 1.5)

# add estimated regression functions


abline(noerror_mod,

col = "purple",

lwd = 1.5)

abline(error_mod,

col = "darkred",

lwd = 1.5)

# add legend
legend("topleft",

bg = "transparent",

cex = 0.8,

lty = 1,

col = c("darkgreen", "purple", "darkred"),


legend = c("Population", "No Errors", "Errors"))
Hide Plot

In the situation without measurement error, the estimated regression function is close to the population
regression function. Things are different when we use the mismeasured regressor X: both the estimate for the
intercept and the estimate for the coefficient on X differ considerably from results obtained using the “clean” data
on X. In particular β̂ 1
= 0.255 , so there is a downward bias. We are in the comfortable situation to know σ and
2
X

2
σw . This allows us to correct for the bias using (9.1). Using this information we obtain the biased-corrected
estimate

2 2
σ + σw 10 + 10
X
ˆ
⋅ β = ⋅ 0.255 = 0.51
1
2 10
σ
X

which is quite close to β 1 = 0.5 , the true coefficient from the population regression function.

Bear in mind that the above analysis uses a single sample. Thus one may argue that the results are just a
coincidence. Can you show the contrary using a simulation study?

Missing Data and Sample Selection

Sample Selection Bias Key Concept 9.5


When the sampling process influences the availability of data and when there is a relation of this sampling
process to the dependent variable that goes beyond the dependence on the regressors, we say that there
is a sample selection bias. This bias is due to correlation between one or more regressors and the error
term. Sample selection implies both bias and inconsistency of the OLS estimator.

There are three cases of sample selection. Only one of them poses a threat to internal validity of a regression
study. The three cases are:

1. Data are missing at random.

2. Data are missing based on the value of a regressor.

3. Data are missing due to a selection process which is related to the dependent variable.

Let us jump back to the example of variables X and Y distributed as stated in equation (9.3) and illustrate all three
cases using R.

If data are missing at random, this is nothing but loosing observations. For example, loosing 50% of the sample
would be the same as never having seen the (randomly chosen) half of the sample observed. Therefore, missing
data do not introduce an estimation bias and “only” lead to less efficient estimators.
# set seed
set.seed(1)

# simulate data
dat <- data.frame(

rmvnorm(1000, c(50, 100),

sigma = cbind(c(10, 5), c(5, 10))))

colnames(dat) <- c("X", "Y")

# mark 500 randomly selected observations


id <- sample(1:1000, size = 500)

plot(dat$X[-id],

dat$Y[-id],

col = "steelblue",

pch = 20,
cex = 0.8,

xlab = "X",

ylab = "Y")

points(dat$X[id],
dat$Y[id],

cex = 0.8,

col = "gray",

pch = 20)

# add the population regression function


abline(coef = c(75, 0.5),

col = "darkgreen",

lwd = 1.5)

# add the estimated regression function for the full sample


abline(noerror_mod)

# estimate model case 1 and add the regression line


dat <- dat[-id, ]

c1_mod <- lm(dat$Y ~ dat$X, data = dat)

abline(c1_mod, col = "purple")

# add a legend
legend("topleft",
lty = 1,

bg = "transparent",

cex = 0.8,
col = c("darkgreen", "black", "purple"),

legend = c("Population", "Full sample", "500 obs. randomly selected"))

Hide Plot

The gray dots represent the 500 discarded observations. When using the remaining observations, the estimation
results deviate only marginally from the results obtained using the full sample.

Selecting data randomly based on the value of a regressor has also the effect of reducing the sample size and does
not introduce estimation bias. We will now drop all observations with X > 45 , estimate the model again and
compare.
# set random seed
set.seed(1)

# simulate data
dat <- data.frame(

rmvnorm(1000, c(50, 100),

sigma = cbind(c(10, 5), c(5, 10))))

colnames(dat) <- c("X", "Y")

# mark observations
id <- dat$X >= 45

plot(dat$X[-id],

dat$Y[-id],

col = "steelblue",

cex = 0.8,
pch = 20,

xlab = "X",

ylab = "Y")

points(dat$X[id],
dat$Y[id],

col = "gray",

cex = 0.8,

pch = 20)

# add population regression function


abline(coef = c(75, 0.5),

col = "darkgreen",

lwd = 1.5)

# add estimated regression function for full sample


abline(noerror_mod)

# estimate model case 1, add regression line


dat <- dat[-id, ]

c2_mod <- lm(dat$Y ~ dat$X, data = dat)

abline(c2_mod, col = "purple")

# add legend
legend("topleft",
lty = 1,

bg = "transparent",

cex = 0.8,
col = c("darkgreen", "black", "purple"),

legend = c("Population", "Full sample", "Obs. with X <= 45"))

Hide Plot

Note that although we dropped more than 90% of all observations, the estimated regression function is very close
to the line estimated based on the full sample.

In the third case we face sample selection bias. We can illustrate this by using only observations with X i < 55 and
Yi > 100 . These observations are easily identified using the function which() and logical operators:
which(dat$X < 55 & dat$Y > 100)
# set random seed
set.seed(1)

# simulate data
dat <- data.frame(

rmvnorm(1000, c(50,100),

sigma = cbind(c(10,5), c(5,10))))

colnames(dat) <- c("X","Y")

# mark observations
id <- which(dat$X <= 55 & dat$Y >= 100)

plot(dat$X[-id],

dat$Y[-id],

col = "gray",

cex = 0.8,
pch = 20,

xlab = "X",

ylab = "Y")

points(dat$X[id],
dat$Y[id],

col = "steelblue",

cex = 0.8,

pch = 20)

# add population regression function


abline(coef = c(75, 0.5),

col = "darkgreen",

lwd = 1.5)

# add estimated regression function for full sample


abline(noerror_mod)

# estimate model case 1, add regression line


dat <- dat[id, ]

c3_mod <- lm(dat$Y ~ dat$X, data = dat)

abline(c3_mod, col = "purple")

# add legend
legend("topleft",
lty = 1,

bg = "transparent",

cex = 0.8,
col = c("darkgreen", "black", "purple"),

legend = c("Population", "Full sample", "X <= 55 & Y >= 100"))

Hide Plot

We see that the selection process leads to biased estimation results.

There are methods that allow to correct for sample selection bias. However, these methods are beyond the scope
of the book and are therefore not considered here. The concept of sample selection bias is summarized in Key
Concept 9.5.

Simultaneous Causality

Simultaneous Causality Bias Key Concept 9.6

So far we have assumed that the changes in the independent variable X are responsible for changes in the
dependent variable Y . When the reverse is also true, we say that there is simultaneous causality between
X and Y . This reverse causality leads to correlation between X and the error in the population regression
of interest such that the coefficient on X is estimated with bias.
Suppose we are interested in estimating the effect of a 20% increase in cigarettes prices on cigarette consumption
in the United States using a multiple regression model. This may be investigated using the dataset
CigarettesSW which is part of the AER package. CigarettesSW is a panel data set on cigarette consumption
for all 48 continental U.S. federal states from 1985-1995 and provides data on economic indicators and average
local prices, taxes and per capita pack consumption.

After loading the data set, we pick observations for the year 1995 and plot logarithms of the per pack price, price,
against pack consumption, packs, and estimate a simple linear regression model.

# load the data set


library(AER)

data("CigarettesSW")

c1995 <- subset(CigarettesSW, year == "1995")

# estimate the model


cigcon_mod <- lm(log(packs) ~ log(price), data = c1995)

cigcon_mod

##

## Call:

## lm(formula = log(packs) ~ log(price), data = c1995)

##
## Coefficients:

## (Intercept) log(price)

## 10.850 -1.213

# plot the estimated regression line and the data


plot(log(c1995$price), log(c1995$packs),

xlab = "ln(Price)",

ylab = "ln(Consumption)",

main = "Demand for Cigarettes",


pch = 20,

col = "steelblue")

abline(cigcon_mod,

col = "darkred",
lwd = 1.5)
Hide Plot

Remember from Chapter 8 that, due to the log-log specification, in the population regression the coefficient on the
logarithm of price is interpreted as the price elasticity of consumption. The estimated coefficient suggests that a
1% increase in cigarettes prices reduces cigarette consumption by about 1.2%, on average. Have we estimated a
demand curve? The answer is no: this is a classic example of simultaneous causality, see Key Concept 9.6. The
observations are market equilibria which are determined by both changes in supply and changes in demand.
Therefore the price is correlated with the error term and the OLS estimator is biased. We can neither estimate a
demand nor a supply curve consistently using this approach.

We will return to this issue in Chapter 12 which treats instrumental variables regression, an approach that allows
consistent estimation when there is simultaneous causality.

Sources of Inconsistency of OLS Standard Errors

There are two central threats to computation of consistent OLS standard errors:

1. Heteroskedasticity: implications of heteroskedasticiy have been discussed in Chapter 5. Heteroskedasticity-


robust standard errors as computed by the function vcovHC() from the package sandwich produce valid
standard errors under heteroskedasticity.

2. Serial correlation: if the population regression error is correlated across observations, we have serial
correlation. This often happens in applications where repeated observations are used, e.g., in panel data
studies. As for heteroskedasticity, vcovHC() can be used to obtain valid standard errors when there is serial
correlation.
Inconsistent standard errors will produce invalid hypothesis tests and wrong confidence intervals. For example,
when testing the null that some model coefficient is zero, we cannot trust the outcome anymore because the test
may fail to have a size of 5% due to the wrongly computed standard error.

Key Concept 9.7 summarizes all threats to internal validity discussed above.

Threats to Internal Validity of a Regression Key Concept 9.7


Study

The five primary threats to internal validity of a multiple regression study are:

1. Omitted variables

2. Misspecification of functional form

3. Errors in variables (measurement errors in the regressors)

4. Sample selection

5. Simultaneous causality

All these threats lead to failure of the first least squares assumption

E(ui |X1i , … , Xki ) ≠ 0

so that the OLS estimator is biased and inconsistent.

Furthermore, if one does not adjust for heteroskedasticity and/or serial correlation, incorrect standard
errors may be a threat to internal validity of the study.

References

Genz, A., Bretz, F., Miwa, T., Mi, X., & Hothorn, T. (2018). mvtnorm: Multivariate Normal and t Distributions (Version
1.0-8). Retrieved from https://CRAN.R-project.org/package=mvtnorm

You might also like