0% found this document useful (0 votes)
90 views41 pages

A GMM Approach For Dealing With Missing Data

This paper presents a Generalized Method of Moments (GMM) framework for dealing with missing data on explanatory or instrumental variables in regression models. The GMM estimator allows for efficiency gains relative to complete-case analysis under assumptions on the missingness process. The GMM approach is compared to existing methods like linear imputation and dummy variable approaches. Simulations and empirical examples demonstrate the GMM approach can improve efficiency over existing methods. The GMM framework is also extended to instrumental variables models with potential missingness.

Uploaded by

raghidk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views41 pages

A GMM Approach For Dealing With Missing Data

This paper presents a Generalized Method of Moments (GMM) framework for dealing with missing data on explanatory or instrumental variables in regression models. The GMM estimator allows for efficiency gains relative to complete-case analysis under assumptions on the missingness process. The GMM approach is compared to existing methods like linear imputation and dummy variable approaches. Simulations and empirical examples demonstrate the GMM approach can improve efficiency over existing methods. The GMM framework is also extended to instrumental variables models with potential missingness.

Uploaded by

raghidk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

A GMM approach for dealing with missing data

on regressors and instruments∗

Jason Abrevaya Stephen G. Donald


Department of Economics Department of Economics
University of Texas University of Texas

This version: April 2013

Abstract

Missing data is one of the most common challenges facing empirical researchers. This
paper presents a general GMM framework for dealing with missing data on explanatory
variables or instrumental variables. For a linear-regression model with missing covariate
data, an efficient GMM estimator under minimal assumptions on missingness is proposed.
The estimator, which also allows for a specification test of the missingness assumptions, is
compared to linear imputation methods and a dummy-variable approach commonly used in
empirical research. For an instrumental-variables model with potential missingness of the
instrument, the GMM framework suggests a rich set of instruments that can be used to
improve efficiency. Simulations and empirical examples are provided to compare the GMM
approach with existing approaches.

JEL Classification: C13, C30


Keywords: Missing observations, imputation, projections, GMM, instrumental variables.


We are grateful to Shu Shen for excellent research assistance and to Garry Barrett, Robert Moffitt, and
seminar participants at several institutions for their helpful comments. We especially would like to thank Yu-
Chin Hsu, who pointed out a mistake in an earlier draft.
1 Introduction

A common feature of many data sets used in empirical research is that of missing information for

certain variables. For example, an explanatory variable may be unavailable for large portions

of the observational units. If the variable with missing observations is considered an important

part of the model, simply omitting the variable from the analysis brings with it the possibility

of substantial omitted variables bias. Alternatively, if the variable in question is considered

important and to be “missing at random,” then a simple way to deal with the problem is

to omit the observations and estimate a model using observations with complete data. This

method could, however, result in a much smaller sample size. A recent paper by Dardanoni

et. al. (2011) has shown in a very general setting that there will generally be no gains relative

to using the complete data unless certain types of restrictions are imposed. One approach

aimed at improving precision of estimates is the linear imputation method that replaces a

missing regressor with its predicted value based on a set of covariates (see Dagenais (1973)

and Gourieroux and Monfort (1981)). Another popular approach is what we term a “dummy

variable method”that sets the missing regressor value to zero and uses dummies or indicators

for whether the regressor was missing for the observation.1

To give a sense of the prevalence of missing data in empirical research as well as the

popularity of these two methods of dealing with missing data, Table 1 provides some summary

statistics for four top empirical economics journals (Amerian Economic Review (AER), Journal

of Human Resources (JHR), Journal of Labor Economics (JLE), and Quarterly Journal of

Economics (QJE)) over a three-year period (2006-2008).2 Over half of the empirical papers in

JLE and QJE have a missing-data issue, and nearly 40% of all papers across the four journals

having data missingness. Of the papers with missing data, a large majority (roughly 70%)

report that they have dropped observations due to missing values. Both the “dummy-variable
1
Greene (2008) refers to this method, in the context of a simple regression, as the “modified zero order
regression.”
2
To identify data missingness, we searched for the word “missing” within the full text of an article and, if
found, read through the data description to check if the author(s) mentioned having observations with missing
values. The method(s) used to deal with missingness were inferred from the data description and/or empirical
results section.

[1]
Table 1: Data missingness in economics journals, 2006-2008

Journal Empirical Papers with Method of handling missing dataa


papers missing data (% of missing-data papers in parentheses)
(% of empirical Drop Use indicator Use an
papers) observations variables for imputation
missingness methodb
American Economic Reviewc 191 55 40 9 14
(28.8%) (72.7%) (16.4%) (25.5%)

Journal of Human Resources 94 40 26 10 6


(42.6%) (65.0%) (25.0%) (15.0%)

Journal of Labor Economics 52 26 18 4 5


(50.0%) (69.2%) (15.4%) (19.2%)

Quarterly Journal of Economics 79 41 29 8 10


(51.9%) (70.7%) (19.5%) (24.4%)

Total 416 162 113 31 35


(38.9%) (69.8%) (19.1%) (21.6%)
a A given paper may use more than one method, so the percentages add up to more than 100%.
b This column includes any type of imputation methods (regression-based, using past/future values, etc.).
c Includes Papers & Proceedings issues.

method” and the “imputation method” are quite common approaches to handling missing data,

with each being used in roughly 20% of the missing-data papers. Except in the case of simple

regression, the dummy-variable method is known to generally lead to biased and inconsistent

estimation (Jones (1996)), yet Table 1 clearly indicates the method’s prominence despite the

inconsistency associated with it.

In this paper, we argue that the types of orthogonality restrictions used in linear impu-

tation methods are more plausible than those used in the dummy variable method and, based

on these restrictions, we develop a Generalized Method of Moments (GMM) procedure. Using

standard results for GMM estimation, we show that there are situations where the GMM es-

timator yields variance reductions relative to the complete-data OLS estimator for some, and

sometimes all, of the coefficients of interest. Also, as a byproduct of the GMM procedure, a

fully robust specification test arises from a standard test of overidentifying restrictions. We also

compare the GMM approach to the linear imputation methods proposed in Dagenais (1973)

and Gourieroux and Monfort (1981) and show that the GMM estimator is generally at least

as efficient as these earlier alternatives. In certain more restrictive situations, which essentially

[2]
amount to homoskedasticity-type assumptions, the estimator proposed in Dagenais (1973) is

asymptotically equivalent to GMM. However, the simpler (unweighted) linear imputation es-

timator discussed in Gourieroux and Monfort (1981) has little to recommend it aside from its

computational simplicity. Unlike the GMM estimator, the unweighted imputation estimator

is not necessarily an improvement relative to the complete data estimator. Moreover, the ap-

parent computational simplicity of computing this estimator is possibly offset by the fact that

appropriate standard errors are not obtained from the usual OLS standard error formulae. We

also examine the assumptions implicit in the dummy variable method and note that, as pre-

viously shown by Jones (1996), it is potentially inconsistent even under the assumption that

the regressor is “missing at random.” Moreover, even when the assumptions for consistency are

met, the dummy variable method may actually be less efficient than the complete data method.

Our results provide insight into conditions that are needed for efficiency gains to be possible.

The paper is structured as follows. Section 2 introduces the model, notation, and as-

sumptions. These involve the regression relationship of interest as well as the general linear

projection relationship between the regressors. We then develop a set of moment conditions

for the observed data and show that an optimally weighted GMM estimator that uses these

conditions will bring efficiency gains in general. Section 3 compares the GMM estimator to

estimators previously considered in the literature. Section 4 extends the GMM approach to

situations where there are missing data in an instrumental variables model (either for the in-

strumental variable or the endogenous variable). The full set of instruments implied by the

assumptions on missingness offer the possibility of efficiency gains. Section 5 considers a sim-

ulation study in which the GMM approach is compared to other methods in finite samples.

Section 6 reports results from two empirical examples, the first a standard regression (with

missing covariate) example and the second an instrumental-variables (with missing instrument)

example. Section 7 concludes. Detailed proofs of the paper’s results are provided in a Technical

Appendix (supplemental materials).

[3]
2 Model Assumptions and Moment Conditions

Consider the following standard linear regression model

Yi = Xi α0 + Zi0 β0 + εi = Wi0 θ0 + εi i = 1, ..., n (2.1)

where Xi is a (possibly missing) scalar regressor, Zi is a K-vector of (never missing) regressors,

and Wi ≡ (Xi , Zi0 )0 . The first element of Zi is 1; that is, the model is assumed to contain an

intercept. We assume the residual only satisfies the conditions for (2.1) to be a linear projection,

specifically

E(Xi εi ) = 0 and E(Zi εi ) = 0. (2.2)

The variable mi indicates whether or not Xi is missing for observational unit i:



1 if Xi missing
mi =
0 if Xi observed

We assume the existence of a linear projection of Xi onto Zi ,

Xi = Zi0 γ0 + ξi where E(Zi ξi ) = 0. (2.3)

Provided that Xi and the elements of Zi have finite variances and that the variance-covariance

matrix of (Xi , Zi0 ) is nonsingular, the projection in (2.3) is unique and completely general in

the sense that it does not place any restrictions on the joint distribution of (Xi , Zi0 ). Also, no

homoskedasticity assumptions are imposed on ξi or εi , though the nature of the results under

homoskedasticity is discussed below.

Observations with missing Xi are problematic since (2.1) cannot be used directly to con-

struct moment conditions for estimating θ0 ≡ (α0 , β00 )0 — all that we see for such observations

is the combination (Yi , Zi0 ). Note, however, that (2.1) and (2.3) imply

def
Yi = Zi0 (γ0 α0 + β0 ) + εi + ξi α0 = Zi0 (γ0 α0 + β0 ) + ηi . (2.4)

For this relationship to be useful in estimation, we require an assumption on the missingness

variable mi . This is our version of the “missing at random” (MAR) assumption on mi :

[4]
Assumption 1 (i) E(mi Zi εi ) = 0; (ii) E(mi Zi ξi ) = 0; (iii) E(mi Xi εi ) = 0.

Several remarks are in order. First, the complete data estimator (defined explicitly below)

also requires conditions (i) and (iii) (but not (ii)) of Assumption 1 in order to be consistent.

Second, the conditions of Assumption 1 are weaker than assuming that mi is independent of

the unobserved variables and will be satisfied when Zi εi , Zi ξi , and Xi εi are mean independent

of mi . Of course, assuming that mi is statistically independent of (Xi , Zi , εi , ξi ) will imply the

conditions in Assumption 1; such an assumption is generally known as “missing completely at

random” (MCAR). Assumption 1 allows for mi to depend on the explanatory variables and

other unobserved factors under certain conditions; for example, suppose that

mi = 1(h(Zi , vi ) > 0)

for some arbitrary function h so that missingness of Xi depends on the other explanatory

variables as well as an unobserved factor vi . For this missingness mechanism, Assumption 1

will be satisfied when vi is independent of εi and ξi conditional on Wi , along with E(εi |Wi ) = 0

and E(ξi |Zi ) = 0.3

We define a vector of moment functions based upon (2.2), (2.3), and (2.4),

(1 − mi )Wi (Yi − Xi α − Zi0 β)


   
g1i (α, β, γ)
gi (α, β, γ) =  (1 − mi )Zi (Xi − Zi0 γ)  =  g2i (α, β, γ)  , (2.5)
mi Zi (Yi − Zi0 (γα + β)) g3i (α, β, γ)

for which the following result holds:

Lemma 1 Under Assumption 1, E(gi (α0 , β0 , γ0 )) = 0.

Lemma 1 implies that the model and Assumption 1 generate a vector of 3K + 1 moment

conditions satisfied by the population parameter values (α0 , β0 , γ0 ). Since there are 2K + 1

parameters, there are K overidentifying restrictions — it is the availability of these overidenti-

fying restrictions that provides a way of more efficiently estimating the parameters of interest.

Indeed, as the following result shows, the use of a subset of the moment conditions that consists
3
See Griliches (1986) for additional discussion on the relationship between mi and the model.

[5]
of g1i and either g2i or g3i (but not both) results in an estimator for θ0 that is identical to the

“complete data estimator” using only g1i , given by


n
!−1 n
X X
0
θ̂C = (1 − mi )Wi Wi (1 − mi )Wi Yi .
i=1 i=1

Lemma 2 GMM estimators of θ0 = (α0 , β00 )0 based on the moments (g1i (α, β, γ)0 , g2i (α, β, γ)0 )0

or the moments (g1i (α, β, γ)0 , g3i (α, β, γ)0 )0 are identical to the complete data estimator θ̂C .

This result and the general proposition that adding valid moment conditions cannot reduce

asymptotic variance give rise to the possibility of efficiency gains from using the complete set

of moment conditions.

Note that the result in Dardanoni et. al. (2011) concerning equivalence between general

imputation methods and the complete data method is based on imputation coefficients possibly

differing between missing and non-missing observations, thus violating (ii) of Assumption 1. In

their setup, one would allow for different γ vectors in the g2i and g3i moments, leading to GMM

also being equivalent to the complete data estimator. Relative to Dardanoni et. al. (2011),

the potential efficiency gains from GMM can be seen as coming from the restriction that γ is

the same for missing and non-missing data (as implied by Assumption 1). As shown below,

a byproduct of the GMM framework is a straightforward and robust overidentification test of

this restriction.

We now consider the asymptotic variance of the standard optimally weighted GMM proce-

dure. The optimal weight matrix for such a procedure is the inverse of the variance-covariance

matrix of the moment function evaluated at the true values of the parameters,
 
Ω11 Ω12 0
Ω = E(gi (α0 , β0 , γ0 )gi (α0 , β0 , γ0 )0 ) =  Ω012 Ω22 0  , (2.6)
0 0 Ω33

where

Ω11 = E((1 − mi )Wi Wi0 ε2i ), Ω22 = E((1 − mi )Zi Zi0 ξi2 ),

Ω12 = E((1 − mi )Wi Zi0 εi ξi ), Ω33 = E(mi Zi Zi0 ηi2 ).

[6]
The zero components in (2.6) follow from mi (1 − mi ) = 0.

To implement the optimally weighted GMM procedure, we take sample analogs of the

three blocks and estimate the residuals using a preliminary consistent procedure:

1X 1X
Ω̂11 = (1 − mi )Wi Wi0 ε̂2i , Ω̂22 = (1 − mi )Zi Zi0 ξˆi2 ,
n n
i i
1X 0 ˆ 1X
Ω̂12 = (1 − mi )Wi Zi ε̂i ξi , Ω̂33 = mi Zi Zi0 η̂i2 .
n n
i i

In the simulations in Section 5, for instance, ε̂i and ξˆi are estimated from the complete data

regressions of Yi on Wi and Xi on Zi , respectively, and η̂i is estimated from a regression of Yi

on Zi using observations with missing Xi .

The optimal two-step GMM estimators, denoted by (α̂, β̂, γ̂), solve

min ḡ(α, β, γ)0 Ω̂−1 ḡ(α, β, γ), (2.7)


α,β,γ
Pn
where ḡ(α, β, γ) = n−1 i=1 gi (α, β, γ) and Ω̂ is the estimator of Ω obtained by plugging Ω̂11 ,

Ω̂22 , Ω̂12 , and Ω̂33 into (2.6). Although this GMM estimator is nonlinear in its parameters

and therefore requires numerical methods, our simulations show that this type of problem is

well-behaved and can be easily optimized in Stata (Version 11 or later) and other econometrics

packages.

As is well known, and as stated in Proposition 1 below, the other component in the variance

covariance matrix for the optimally weighted GMM is the gradient matrix corresponding to the

moment functions. In this instance, this is given by


 
G11 0
G =  0 G22  (2.8)
G31 G32
where the components of the matrix are

G11 = −E((1 − mi )Wi Wi0 ),

G22 = −E((1 − mi )Zi Zi0 ),

−E(mi Zi Zi0 γ0 ) −E(mi Zi Zi0 )



G31 = ,

G32 = −E(mi Zi Zi0 α0 ).

[7]
The first set of columns represent the expectation of the derivatives of the moment functions

with respect to (α, β 0 )0 while the second set of columns are related to the derivatives with

respect to γ. For purposes of inference, one can easily estimate the components of G by taking

sample analogs evaluated at the GMM estimates of all the parameters.

Under standard regularity conditions, we have the following result:4

Proposition 1 Under Assumption 1, the estimators (α̂, β̂, γ̂) are consistent and asymptotically

normally distributed with asymptotic variance given by (G0 Ω−1 G)−1 . Moreover,

d
nḡ(α̂G , β̂G , γ̂G )0 Ω̂−1 ḡ(α̂G , β̂G , γ̂G ) → χ2 (K) (2.9)

The behavior of the objective function in (2.9) gives rise to the possibility of testing

the overidentifying restrictions imposed by Assumption 1. For instance, as discussed above,

Assumption 1 would be violated when Wi εi has nonzero expectation for observations where Xi

is (non-)missing. Similarly, the assumption would be violated when Zi ξi has nonzero expectation

for observations where Xi is (non-)missing. Since, as shown below, improvements in terms of

efficiency can only be achieved when these restrictions are satisfied and imposed on estimation,

the method provides a way of testing the validity of these assumptions.

Using the same notation, the asymptotic variance of the complete data estimator is given

in the following result:

Lemma 3 Under Assumption 1, the complete data estimator has asymptotic variance given by


AV AR( n(θ̂C − θ0 )) = (G11 Ω−1
11 G11 )
−1
= G−1 −1
11 Ω11 G11 .

As noted earlier, the fact that the complete data estimator θ̂C is also equivalent to the GMM

estimator of θ0 based on a subset of moment conditions (Lemma 2) implies that the optimally

weighted GMM estimator using all 3K + 1 moment conditions has a variance no larger than
4
By standard regularity conditions, we simply mean the finite variances and nonsingularity that allow for the
unique representations in (2.1) and (2.3). These identification conditions will be implicitly assumed throughout
the exposition below.

[8]
that given in Lemma 3. The remainder of this Section examines in detail when efficiency gains

are possible with the GMM approach.

Some additional notation is needed. First, denote λ = P (mi = 0), so that in the asymp-

totic theory λ represents the asymptotic proportion of data that have non-missing Xi . Also,

using the subscript m (or c) to denote an expectation for observations with missing (or non-

missing) Xi values, define the following quantities:

Γm = E(Zi Zi0 |mi = 1) and Γc = E(Zi Zi0 |mi = 0)

Ωηηm = E(Zi Zi0 ηi2 |mi = 1) and Ωηηc = E(Zi Zi0 ηi2 |mi = 0)

Ωεηm = E(Zi Zi0 εi ηi |mi = 1) and Ωεηc = E(Zi Zi0 εi ηi |mi = 0)

Λεηξm = E(Zi εi ηi ξi |mi = 1) and Λεηξc = E(Zi εi ηi ξi |mi = 0)


2
σξm = E(ξi2 |mi = 1) and 2
σξc = E(ξi2 |mi = 0)

The following general result characterizes the difference in asymptotic variances between

the efficient GMM estimator and the complete data estimator.

Proposition 2 Under Assumption 1,


√ √ A0
 

AV AR( n(θ̂C − θ0 )) − AV AR( n(θ̂ − θ0 )) = D A B ≥ 0,
B0

where

2
−1
A = Λεηξc σξc ,
2 −1 0
B = Ωεηc Γ−1

c − Λεηξc σξc γ0 ,
 −1
1 −1 1 −1 −1 1 −1 −1
D = Γ Γ Ωηηm Γm + Γc Ωηηc Γc Γ−1
c .
λ2 c (1 − λ) m λ

Since the matrix D is positive definite under fairly general conditions, it becomes straightfor-

ward to consider separately situations in which there is a reduction in variance for estimation of

α0 and situations in which there is a reduction in variance for estimation of β0 . The difference

corresponding to estimation of α0 is given by


−2
2
σξc Λ0εηξc DΛεηξc ≥ 0

[9]
which is equal to zero if and only if Λεηξc = 0. On the other hand, for estimation of β0 , the

difference is given by B 0 DB ≥ 0, which is equal to zero if and only if


−1
B = Ωεηc Γ−1 2
c − Λεηξc σξc γ00 = 0.

Given the definition of ηi , we can write

Λεηξc = E(Zi ε2i ξi |mi = 0) + α0 E(Zi εi ξi2 |mi = 0) (2.10)


def
= Λεεξc + α0 Λεξξc

and

Ωεηc = E(Zi Zi0 ε2i |mi = 0) + α0 E(Zi Zi0 εi ξi |mi = 0) (2.11)


def
= Ωεεc + α0 Ωεξc .

In contrast to the previous literature, this general result suggests that efficiency gains are

possible for both α0 and β0 since the general assumptions do not imply either Λεηξc = 0 or

B = 0. The previous literature (eg Dagenais (1973) and Gourieroux and Monfort (1981))

either explicitly or implicitly imposed stronger assumptions on the model and revealed only

the possibility of improvements for the estimation of β0 using imputation methods discussed

below. The above result indicates situations when improvements for α0 will be possible and

why, under more restrictive classical assumptions, improvements will not be available.

A simple more restrictive assumption is given by the following:

Assumption 2 Suppose that E(εi |Xi , Zi , mi = 0) = 0 and E(ε2i |Xi , Zi , mi = 0) = σεc


2 .

This assumption implies a classical regression model with mean zero and heteroskedastic

residuals, as implied by the Gaussian-like assumptions made in the above-referenced literature.

Under Assumption 2, based upon (2.10), note that Λεηξc = 0 since

E(Zi ε2i ξi |mi = 0) = E(E(Zi ε2i ξi |Xi , Zi , mi = 0)|mi = 0) = σε2 E(Zi ξi |mi = 0) = 0

and

E(Zi εi ξi2 |mi = 0) = E(E(Zi εi ξi2 |Xi , Zi , mi = 0)|mi = 0) = E(Zi ξi2 E(εi |Xi , Zi , mi = 0)|mi = 0) = 0.

[10]
Also, note that B simplifies in this case:
E(Zi Zi0 ε2i |mi = 0) = σεc
2
Γc
E(Zi Zi0 εi ξi |mi = 0) = E(E(Zi Zi0 εi ξi |Xi , Zi , mi = 0)|mi = 0) = E(Zi Zi0 ξi E(εi |Xi , Zi , mi = 0)|mi = 0) = 0

Therefore, under the more restrictive Assumption 2, the following results hold:

Lemma 4 Under Assumptions 1 and 2,

√ √
AV AR( n(α̂C − α0 )) − AV AR( n(α̂ − α0 )) = 0
√ √ 2
2
AV AR( n(β̂C − β0 )) − AV AR( n(β̂ − β0 )) = σεc D>0

Consistent with the previous literature, this result shows that under the classical assumptions

on the residual εi there will be no gain in terms of estimating α0 but generally there will be

a gain for β0 . As discussed below, additional assumptions made in the previous literature also

simplify D relative to the general expression in Proposition 2.

Before considering the further restrictions that lead Lemma 4 to coincide with results in

the previous literature, we return to the general case to explore situations where gains for both

α0 and β0 are possible. To illustrate, we consider a specification for the residuals that permits

flexible forms of scale heteroskedasticity:

Assumption 3 εi = σε (Xi , Zi )ui and ξi = σξ (Zi )vi , where (ui , vi ) are jointly i.i.d. and inde-

pendent of (Xi , Zi ), with ui and vi both having mean zero and variance one.

This assumption implies Assumption 2 when σε (Xi , Zi ) = σε and E(ui |vi , Zi , mi = 0) = 0. The

two key terms with respect to estimation of α0 are

Λεεξc = E(Zi σε (Zi0 γ0 + σξ (Zi )vi , Zi )2 u2i σξ (Zi )vi |mi = 0)

Λεξξc = E(Zi σε (Zi0 γ0 + σξ (Zi )vi , Zi )ui σξ (Zi )2 vi2 |mi = 0)

Neither term is zero without further restrictions. The second quantity, Λεξξc , will equal zero

with the additional condition E(ui |Zi , vi , mi = 0) = 0, which essentially corresponds to the

condition that E(εi |Xi , Zi , mi = 0) = 0 as would be the case under Assumption 2. Otherwise,

[11]
this term is not necessarily zero. Also, for the first quantity Λεεξc , if heteroskedasticity in εi is

limited to dependence on just Zi so that σε (Xi , Zi ) = σε (Zi ), then

Λεεξc = E(Zi σε (Zi )2 u2i σξ (Zi )vi |mi = 0)

= E(Zi σε (Zi )2 σξ (Zi )E(u2i vi |Zi , mi = 0)|mi = 0)

which will be zero when E(u2i vi |Zi , mi = 0) = 0. For example, a bivariate normal distribution

of (ui , vi ) would satisfy this property.

Considering both Λεεξc and Λεξξc , the following result provides sufficient conditions for no

efficiency gains with respect to α0 :

Lemma 5 If Assumptions 1 and 3 hold and, furthermore, (i) σε (Xi , Zi ) = σε (Zi ), (ii) E(u2i vi |mi =

0) = 0, and (iii) E(ui vi2 |mi = 0) = 0 hold, then

√ √
AV AR( n(α̂C − α0 )) − AV AR( n(α̂ − α0 )) = 0
√ √
AV AR( n(β̂C − β0 )) − AV AR( n(β̂ − β0 )) = Γ−1 −1
c Ωεηc DΩεηc Γc > 0

There are still generally improvements for estimation of β0 with the main simplification coming

from the fact that the second term in B is zero. Even when σε (Zi ) and σξ (Zi ) are constants (so

that (i) holds), efficiency gains for α0 are possible when the the joint third-moment conditions

in (ii) and/or (iii) are violated.

For the sake of completeness and in order to compare the GMM method and the complete

data method with alternative imputation methods proposed in Dagenias (1973) and Gourieroux

and Monfort (1981), we derive explicit expressions for the asymptotic variances under the

classical assumptions in those papers. To do this, define

σε2 = E(ε2i ), σξ2 = E(ξi2 )

and let Ωεεm , Ωξξm and Ωξξc denote the matrix of moments analogous to the definitions of Ωεεc

and Ωηηm above. With this notation, the following assumption is essentially equivalent to their

assumptions:

[12]
Assumption 4 (i) The residuals εi and ξi are conditionally (on (Xi , Zi , mi ) and (Zi , mi ),

respectively) mean zero and homoskedastic, (ii) Γm = Γc = Γ, (iii) Ωεεm = Ωεεc = σε2 Γ,

(iv) Ωξξm = Ωξξc = σξ2 Γ.

Conditions (ii)-(iv) will be satisfied, given (i), when Xi data are MCAR (see, e.g., Gourier-

oux and Monfort (1981) or Nijman and Palm (1988)). Under Assumption 4, the asymptotic-

variance expressions for the complete data estimator simplify to

√ 1 2 2 −1
AV AR( n(α̂C − α0 )) = σ σ (2.12)
λ ε ξ
√ 1 2 −1 1 2 2 −1
AV AR( n(β̂C − β0 )) = σ Γ + σε σξ γ0 γ00 , (2.13)
λ ε λ

while for the GMM estimator,

√ 1 2 2 −1
AV AR( n(α̂ − α0 )) = σε σξ (2.14)
λ  
√ (1 − λ)σξ2 α02 1 −1
AV AR( n(β̂ − β0 )) = σε2 1 +    Γ−1 + σε2 σξ2 γ0 γ00 . (2.15)
2
λ σε2 + σξ α02 λ

Comparing the first terms in (2.13) and (2.15), one can see the factors that affect the efficiency

improvement from GMM estimation of β0 . The result in Lemma 5 regarding estimation of α0

is also evident.

3 Comparison to Other Methods

3.1 Linear Imputation

This section compares the general GMM procedure with the linear imputation method proposed

by Dagenais (1973) and discussed further by Gourieroux and Monfort (1981). First, note that

plugging the projection (2.3) into the regression model (2.1) yields

Yi = (1 − mi )Xi + mi Zi0 γ0 α0 + Zi0 β0 + εi + mi ξi α0 ,



(3.16)

where by Assumption 1 the composite residual εi + mi ξi α0 is orthogonal to the regressors

((1 − mi )Xi , Zi0 , mi Zi0 ). The methods discussed in these earlier papers can be thought of as a

[13]
sequential approach where one first estimates γ0 using the second moment condition by
n
!−1 n
X X
γ̂ = (1 − mi )Zi Zi0 (1 − mi )Zi Xi
i=1 i=1

and then substituting into (3.16) and using regression based methods.

Given an estimate γ̂ version of γ0 , the operational version of (3.16) is

Yi = (1 − mi )Xi + mi Zi0 γ̂ α0 + Zi0 β0 + εi + mi ξi α0 + mi Zi0 (γ0 − γ̂) α0 .




(3.17)

Equation (3.17) is essentially a regression model with X̂i = ((1 − mi )Xi + mi Zi0 γ̂) and Zi as

regressors — that is, Xi is used if it is observed and the linearly imputed value Zi0 γ̂ is used

otherwise. OLS estimation can be used, but even under homoskedasticity assumptions on εi and

ξi the residual will be necessarily heteroskedastic due to: (i) mi ξi α0 appearing for missing Xi ,

and (ii) estimation error in using γ̂ in place of γ0 . Under the conditions of Dagenais (1973) and

Gourieroux and Monfort (1981), which are essentially equivalent to Assumption 4, the residual

variance is
n
!−1
X
σε2 + mi σξ2 α02 + σξ2 α02 mi Zi0 (1 − m` )Z` Z`0 Zi . (3.18)
`=1

and the covariance across residuals for observations i 6= j given by


n
!−1
X
σξ2 α02 mi mj Zi0 (1 − m` )Z` Z`0 Zj . (3.19)
`=1

Dagenais (1973) proposed a FGLS procedure for the estimation of (3.17) where one estimates

(3.18) and (3.19) using preliminary estimates σ̂ε2 , σ̂ξ2 , and α̂ that would be consistent under

the homoskedasticity assumptions. Gourieroux and Monfort (1981) use the label “Generalized

Dagenais (GD) estimator” for this FGLS procedure and the label “Ordinary Dagenais (OD)

estimator” for the standard OLS estimator.

The following result facilitates comparisons between GD and GMM by showing that GD

behaves asymptotically like a GMM estimator with a particular weight matrix:

Proposition 3 (GD and GMM) The GD estimator is asymptotically equivalent to GMM

using all the moments in (2.5) and an estimated weight matrix based on (2.6) where Ω̂12 = 0,

[14]
and

1X 1X 1X
Ω̂11 = σ̂ε2 (1 − mi )Wi Wi0 , Ω̂22 = σ̂ξ2 (1 − mi )Zi Zi0 , Ω̂33 = σ̂ε2 + α̂2 σ̂ξ2 mi Zi Zi0
n n n
i i i

using estimates σ̂ε2 , σ̂ξ2 and α̂ that have the same limits as those used in GD.

This result suggests that GD behaves like GMM using a weight matrix that is optimal under

the conditions in Assumption 4. When these conditions are satisfied, GMM and GD have the

same asymptotic variance as given in (2.14) and (2.15). When they are violated, as would occur

under heteroskedasticity, the general results for GMM imply that the general version of GMM

laid out in the previous section will be at least as efficient as GD. We explore some of these

gains in simulations that follow.

One can also compare GD to a sequential GMM procedure where one first estimates γ0

using the second set of moment functions and then uses the other two moment functions with

γ0 replaced by γ̂ in the third moment function. Specifically, the second stage uses the moment

vector

(1 − mi )Wi (Yi − Xi α − Zi0 β)


 
1X def
= ḡS (α, β, γ̂) (3.20)
n mi Zi (Yi − Zi0 (γ̂α + β))
i

Under conditional homoskedasticity assumptions on εi and ξi , the optimal weight matrix for

this GMM procedure can be estimated by


!−1
Ω̂11 0 def
1 0 1 0
−1 1 0 = Ω̂−1
S
0 Ω̂33 + α̂2 σ̂ξ2
P  P P
n i mi Zi Zi n i (1 − mi )Zi Zi n i mi Zi Zi (3.21)

where Ω̂11 , Ω̂33 , α̂2 , and σ̂ξ2 are as in Proposition 3. Note the second term in the lower right

corner comes from the fact that one has used an estimate of γ0 .

Proposition 4 (GDE-SGMM) The GD estimator is asymptotically equivalent to sequential

GMM using the moments ḡS (α, β, γ̂)and the weight matrix Ω̂−1 2 2
S that uses estimates σ̂ε , σ̂ξ and

α̂ that have the same limits as those used in GD.

Again, the weight matrix for this sequential GMM procedure will generally be optimal

only when the homoskedasticity assumptions are valid. This result facilitates a comparison with

[15]
the OD estimator since that procedure is numerically identical to a sequential GMM procedure

that uses as a weight matrix (in place of Ω̂−1 0


S ) H H where

1 0 γ̂ 0
 
H= .
0 I I

Note that
X̂i (Yi − Xi α − Zi0 β)
 
1X
H ḡS (α, β, γ̂) = ,
n Zi (Yi − Zi0 (γ̂α + β))
i

which are the normal equations for OLS estimation of (3.17). Even under homoskedasticity

assumptions on εi and ξi , the usual OLS standard errors are invalid due to the presence of

pre-estimation error in the residual. Since, as we argue below, there is little reason to use the

OD estimator, we omit the details of how one would properly estimate its standard errors.

Standard results for FGLS and GMM imply that the optimal GMM estimator will be

at least as efficient as OD and that, under the stronger homoskedasticity assumptions, GD

will be at least as efficient as OD. The efficiency comparisons are summarized in the following

proposition:

Proposition 5 Under Assumption 1,

√ √ √ √
(i) AV AR( n(α̂−α0 )) ≤ AV AR( n(α̂OD −α0 )) = AV AR( n(α̂GD −α0 )) = AV AR( n(α̂C −

α0 ))

√ √
(ii) AV AR( n(β̂ − β0 )) ≤ AV AR( n(β̂OD − β0 ))

√ √
(iii) AV AR( n(β̂ − β0 )) ≤ AV AR( n(β̂GD − β0 ))

√ √
(iv) AV AR( n(β̂GD − β0 )) ≤ AV AR( n(β̂OD − β0 )) if Assumption 4(i) holds

Neither the OD or GD estimators bring any improvements with respect to estimating α0 whereas

the results of Section 2 indicate that this is possible in some cases using GMM. With respect

to β0 , one can show that even under the stronger homoskedasticity assumptions the OD es-

timator, unlike the GD estimator, is not guaranteed to bring about efficiency improvements

[16]
relative to the complete data estimator. The asymptotic variance for β̂OD under the stronger

homogeneity and homoskedasticity assumptions in Assumption 4 is given by


!
(1 − λ)σ 2 α2
ξ 0 −1 1 2 2 −1
σε2 1 + Γ + σ σ γ0 γ00 .
λσε2 λ ε ξ

This asymptotic variance is at least as large as (2.15) (which is also the variance of GD in this

case) since the difference is


!  
(1 − λ)σξ2 α02 (1 − λ)σ 2 α2
ξ 0  σε2 (1 − λ)σξ2 α02 σξ2 α02
σε2 1+ 2
− σε 1 + =  ≥ 0.
λσε2
  
λ σε2 + σξ2 α02 λ σε2 σε2 + σξ2 α02

To see that the asymptotic variance for β̂OD is not even guaranteed to be lower than that of

β̂C , note that5


!!
√ √ 1 (1 − λ)σξ2 α02
AV AR( n(β̂C − β0 )) − AV AR( n(β̂OD − β0 )) = σε2
− 1+ σε2 Γ−1
λ λσε2
 
2 (1 − λ)
σε − σξ α0 Γ−1 ≷ 0.
2 2 2

= σε
λσε2

Therefore, despite its convenience, the OD estimator is an inferior alternative. It is less efficient

than GMM and not even guaranteed to improve upon the complete data estimator. The

simulations in Section 5 provide numerical evidence on these points.

3.2 Dummy Variable Method

We can define the “dummy variable method” using (3.16) and separating the intercept from
0 ) and also γ = (γ , γ 0 ) so that
the other variables in Zi so that Zi = (1, Z2i 0 10 20

Yi = (1 − mi )Xi α0 + Zi0 β0 + mi γ10 α0 + mi Z2i


0
γ20 α0 + εi + mi ξi α0 . (3.22)

Given Assumption 1, equation (3.22) is a valid regression model in the sense that the residual

εi + mi ξi α0 is orthogonal to the regressors. The dummy variable method amounts to running


0 . Let θ̂ 0 0
the regression without the regressors mi Z2i DM ≡ (α̂DM , β̂DM ) denote the dummy
5
As pointed out by Griliches (1986), the claim by Gourieroux and Monfort (1981) that the unweighted
estimator for β0 is at least as efficient as the complete data estimator is in fact an error. The error is the result
of a slight mistake in the algebra — that error that has been corrected by hand in the version that can be found
in JSTOR.

[17]
0 .6
variable estimator based on running the regression in (3.22) omitting the regressors mi Z2i

The following proposition (see also Jones (1996)) formally states the result that the dummy

variable method will be subject to omitted-variables bias (and inconsistency) unless certain

restrictions are satisfied:

0
Proposition 6 The estimators (α̂DM , β̂DM )0 are biased and inconsistent unless (i) α0 = 0 or

(ii) γ20 = 0.

The first condition is that Xi is an irrelevant variable in the regression of interest (2.1), in which

case the best solution to the missing-data problem is to drop Xi completely and use all available

data to regress Yi on Zi . The second condition requires that either Z2i is non-existent in the

model in the first place so that the original model (2.1) is a simple linear regression model or

else Z2i is not useful for predicting Xi . If Z2i is non-existent, the dummy-variable estimator is

equivalent to the complete-data estimator.

Even if the conditions of Proposition 6 are met and the dummy method is consistent, it

is still in general difficult to compare the variance with that of the complete data estimator.

Although the complete data estimator results from estimating the unrestricted version of (3.22)
0 does not necessarily result in efficiency gains
by OLS, dropping the irrelevant regressor mi Z2i

when there is conditional heteroskedasticity. To facilitate comparisons (under either (i) or

(ii) of Proposition 6) consider the restrictive situation where there is homoskedasticity and

homogeneity (Assumption 4) and also the normalization


 
1 0
Γ= . (3.23)
0 Γ22

Since the first element of Zi is 1, (3.23) amounts to assuming that the regressors in Z2i are

mean zero and will not alter any of the slope coefficients; the intercept in the model (i.e.,

the first element of β0 ) will be altered by this normalization. The asymptotic variance of the

dummy variable estimator, along with efficiency comparisons to the complete data estimator,

are summarized in the following proposition:


6
The regression also yields an estimate of γ10 α0 .

[18]
Proposition 7 Under Assumptions 1 and 4 and the normalization in (3.23),

(i) when α0 = 0 we have,

√ σε2 √
AV AR( n(α̂DM − α0 )) =   ≤ AV AR( n(α̂C − α0 ))
0 Γ γ
λ σξ2 + (1 − λ)γ20 22 20
 1


0
AV AR( n(β̂DM − β0 )) = σε2 λ
0 Γ−1
22
λ−2 γ10
2 λ−1 γ10 γ20
0
 
λ
+σε2
2 0 Γ γ )
(σξ + (1 − λ) γ20 22 20 λ−1 γ10 γ20 0
γ20 γ20

≤ AV AR( n(β̂C − β0 ))

(ii) when γ20 = 0 we have,

√ σε2 √
AV AR( n(α̂DM − α0 )) = 2 = AV AR( n(α̂C − α0 ))
λσξ
σε2 λ1
!
√ 0 2
 
1 γ10 0
AV AR( n(β̂DM − β0 )) =   + σε2
0 σε2 + (1 − λ)σξ2 α02 Γ−1
22 λσξ2 0 0

Result (i) says that the estimator for α0 will be more efficient than the complete data

estimator when α0 = 0 and γ20 6= 0. (When γ20 = 0 as well, there is no efficiency gain possible

for estimating the coefficient of the missing variable.) One can compare the variance of the

estimator of α0 that would be possible if the Xi were fully observed — the variance under these

conditions would be given by (using F O subscript to denote the estimator with fully observed

data)
√ σ2
AV AR( n(α̂F O − α0 )) = ε2
σξ
so then the relative efficiency would be

  !−1
AV AR( n(α̂DM − α0 )) σε2 σε2
√ =   
AV AR( n(α̂F O − α0 )) 2 0
λ σξ + (1 − λ)γ20 Γ22 γ20 σξ2

σξ2
=  
0 Γ γ
λ σξ2 + (1 − λ)γ20 22 20
!−1
0 Γ γ
γ20 22 20
= λ + (1 − λ)λ
σξ2

[19]
which depends on λ as well as the signal to noise ratio in the relationship between Xi and Zi . It

is possible that the dummy variable method could be more efficient than the full data method
0 Γ γ is large relative to σ 2 . Intuitively when this occurs there is a strong relationship
when γ20 22 20 ξ

between Xi and Zi and it is hard to estimate α0 . When α0 = 0, it does not matter whether Xi

is observed or not and apparently there are some gains from entering it as a zero and using a

missing indicator in its place. This result should not be pushed too far, however, since it only

occurs when α0 = 0 — in this instance, it would be better to drop Xi completely and just use

Zi in the regression. Also, one suspects that in most applications in economics the noise σξ2 is
0 Γ γ
likely to be large relative to the signal γ20 22 20 so that the ratio of variances is likely to be

larger than one in practice. The second part of (i) suggests that the estimator of the slopes in

β0 will be more efficient than GMM since7

λ 0 1 0
0 γ20 γ20 ≤ γ20 γ20 .
(σξ2 + (1 − λ) γ20 Γ22 γ20 ) λσξ2

When γ20 = 0, result (ii) implies that the dummy method is no more efficient than the

complete data estimator for α0 and could be more or less efficient than the complete data

estimator of the β0 slopes since

σε2  (1 − λ) 2
− σε2 + (1 − λ)σξ2 β12 = σε − σξ2 β12 ≶ 0.

λ λ

This comparison is equivalent to the efficiency comparison between the unweighted imputation

estimator and the complete data estimator. In this instance, in fact, the dummy method has

the same asymtptotic variance as the unweighted imputation estimator and, as such, is less

efficient than the GMM or weighted imputation estimator.

The results of this section suggest little to recommend the dummy variable method for

dealing with missingness. It raises the possibility of bias and inconsistency. As a practical

matter, one may be willing to live with this bias if the method had a lower variance but
7
Also for the intercept (under the reparameterization) there will be an efficiency gain relative to GMM since

λλ−2 γ10
2 2
γ10 γ2
0
= 0
≤ 102 .
(σξ2 + (1 − λ) γ20 Γ22 γ20 ) 2
λ(σξ + (1 − λ) γ20 Γ22 γ20 ) λσξ

[20]
even this is not guaranteed. The only situation where one does not sacrifice bias in exchange

for variance improvements is precisely the case where the missing variable can be eliminated

completely.

4 Missing Data in Instrumental Variable Models

The GMM framework to handle missingness can easily be modified to handle other models

for which GMM estimators are commonly used. In this section, we consider extending the

methodology to the case of instrumental-variables models. Section 4.1 considers the case where

the instrumental variable may be missing, whereas Section 4.2 considers the case where the

endogenous variable may be missing. As the latter case turns out to be very similar to the

situation considered in Section 2, the discussion in Section 4.2 will be somewhat brief.

4.1 Missing Instrument Values

This section considers a situation in which an instrumental variable has potentially missing

values. An example of this occurs in Card (1995), where IQ score is used as an instrument

for the “Knowledge of the World of Work” (KWW) test score in a wage regression; IQ score

is missing for about 30% of the sample, and Card (1995) simply omits the observations with

missing data in the IV estimation. Other authors (see, for example, Dahl and DellaVigna

(2009)) have used a dummy variable approach to deal with missing values for an instrument

— instead of dropping observations with missing values, one enters a zero for the missing value

and “compensates” by using dummies for “missingness.”

Consider a simple situation in which there is a single instrument for a single endogenous

regressor and where the instrument may be missing. The model consists of the following

“structural” equation

Y1i = Y2i δ0 + Zi0 β0 + εi , E(Zi εi ) = 0, E(Y2i εi ) 6= 0, (4.24)

where all the variables are observed, and a reduced form (linear projection) for the endogenous

[21]
regressor Y2i ,

Y2i = Xi π0 + Zi0 Π0 + vi , E(Xi vi ) = 0, E(Zi vi ) = 0. (4.25)

Missingness of the instrumental variable Xi is denoted by the indicator variable mi (equal to

one if Xi missing). The missing-at-random assumptions required in this context are8

E(mi Xi εi ) = E(mi Xi vi ) = E(mi Xi ξi ), (4.26)

where ξi is the projection error from the projection of Xi onto Zi (as in (2.3)). Also, Xi is

assumed to be a valid and useful instrument in the sense that

E(Xi εi ) = 0 and π0 6= 0.

For this model, one is primarily interested in the estimation of the parameters of (4.24) so that

efficient estimation of the parameters of (4.25) is not of paramount concern. As in Section 2,

we assume that the linear projection in (2.3) exists. The complete data method in this context

amounts to using only the observations for which mi = 0 (Xi not missing) — this is the approach

in Card (1995). Given the missing-at-random assumption (4.26), this is a consistent approach

that asymptotically uses a proportion of data represented by λ = P (mi = 0).

Similar to the arguments in Section 2, one can use (2.3) to write a reduced form linear

projection that is satisfied for the entire sample as

Y2i = (1 − mi )Xi π0 + Zi0 Π0 + mi Zi0 γ0 π0 + vi + π0 mi ξi . (4.27)

0 )0 , the full-sample reduced form in (4.27) suggests that one use an


Then, partitioning Zi = (1, Z2i

instrument set that consists of ((1 − mi )Xi , Zi , mi , mi Z2i ) when estimating (4.24) based on the

entire sample. On the other hand, the dummy variable approach amounts to using the subset

((1 − mi )Xi , Zi , mi ). The interesting question in this case is whether there are benefits from

using the entire sample with either set of instruments relative to just omitting observations

with missing values for the instrument. Intuitively, based on standard results, one cannot

imagine that the “dummy approach” could be better than the approach based on the full set of
8
See Mogstad and Wiswall (2010) for a treatment of missing instrumental variables under alternative
assumptions.

[22]
instruments ((1 − mi )Xi , Zi , mi , mi Z2i ). To address this question, we compare the properties of

the IV estimators. Clearly each estimator is consistent so it comes down to relative variances.

We use similar notation to previous sections so that (δ̂C , β̂C ) denotes the IV estimator using

complete data where Xi is used to instrument Y2i . Similarly (δ̂D , β̂D ) is the 2SLS estimator

that uses the instrument set ((1 − mi )Xi , Zi , mi ) and the entire sample. The 2SLS estimator
0 ) and the entire sample
using the full instrument set (based on (4.27)) ((1 − mi )Xi , Zi , mi , mi Z2i

is denoted by (δ̂F , β̂F ).

It seems intuitively clear that (δ̂F , β̂F ) will be at least as efficient as (δ̂D , β̂D ) — what

is less clear is how these estimators perform relative to the complete data IV estimator. The

following result shows that with respect to estimation of δ0 there is no advantage from using

the 2SLS methods and the entire sample and in the case of δ̂D one may actually be worse off

(relative to the complete data estimator) in terms of asymptotic variance.

Proposition 8 If εi , vi , and ξi are conditionally homoskedastic and E(Zi Zi0 ) = I, then

√ √ −1
AV AR( n(δ̂C − δ0 )) = AV AR( n(δ̂F − δ0 )) = σε2 π02 λσξ2
!
√ −1 0 γ
(1 − λ)γ20 20
AV AR( n(δ̂D − δ0 )) = σε2 π02 λσξ2 1+ 2 .
σξ

The full instrument estimator and the complete data estimator have the same asymptotic

variance while the dummy method is less efficient by an amount that depends on the amount

of missing data as well as the coefficient γ20 — when this is zero, the dummy variable method

has the same asymptotic variance as the other two estimators. This result suggests that the

method of dealing with missingness in instruments by using zeros for the missing value and mi

alone to compensate is likely to be inferior compared to the method that drops observations

with missing values for the instrument. To reach the same level of efficiency, one must add to

the instrument set the interactions of mi with all the elements of Zi . The following proposition

shows that the latter described method does bring some improvements with respect to the

estimation of β0 :

[23]
Proposition 9 If εi , vi , and ξi are conditionally homoskedastic and E(Zi Zi0 ) = I, then

 
2 1 2 2 −1
 0
AV AR( n(β̂C − β0 )) = σε I + π0 λσξ (γ0 π0 + Π0 ) (γ0 π0 + Π0 )
λ
√  −1 
AV AR( n(β̂F − β0 )) = σε2 I + π02 λσξ2 (γ0 π0 + Π0 ) (γ0 π0 + Π0 )0
! !
√ (1 − λ)γ200 γ
2 2 2 −1 20 0

AV AR( n(β̂D − β0 )) = σε I + π0 λσξ 1+ (γ0 π0 + Π0 ) (γ0 π0 + Π0 ) .
σξ2

Comparing these asymptotic variances, the full instrument 2SLS estimator has the lowest

asymptotic variance — it is unequivocally more efficient than the complete IV estimator when

there is a non-negligible portion of missing data. The full instrument estimator is more efficient

than the dummy estimator except when γ20 = 0, in which case the additional instruments in

the full set are useless for Y2i . The comparison between the dummy method and the complete

IV estimator depends on the various parameters of the model — large γ20 tends to make the

complete estimator more efficient, while small λ tends to make the dummy more efficient.

These results have a simple implication. For missing instrument values (where missingness

satisfies our assumptions), the method that is guaranteed to deliver asymptotic efficiency is

the 2SLS estimator with a full set of instruments obtained from interactions of mi and Zi .

Compensating for missingness by simply using the dummy alone is not a good idea unless one

believes the instrument is uncorrelated with the other exogenous variables in the model. In

general, while using the dummy alone may bring a benefit for some coefficients, it may also

come at a cost for the other coefficients.

4.2 Missing Endogenous-Variable Values

We now consider the case where the endogenous regressor Y2i may be missing, and let mi denote

the indicator variable for the missingness of Y2i . Otherwise, we consider the same structural

and reduced-form models as in (4.24) and (4.25), respectively:9

Y1i = Y2i δ0 + Zi0 β0 + εi , E(Zi εi ) = 0, E(Y2i εi ) 6= 0

Y2i = Xi π0 + Zi0 Π0 + vi , E(Xi vi ) = 0, E(Zi vi ) = 0.


9
Although this formulation restricts the instrumental variable Xi to be scalar, it is trivial to extend the GMM
estimator below to the case of vector Xi .

[24]
In comparing these two equations to their counterparts in the missing-exogenous-variable model

of Section 2 (see (2.1) and (2.3), respectively), there are two key differences: (i) the RHS variable

Y2i is not orthogonal to the first-equation error, and (ii) an additional exogenous variable (Xi )

is orthogonal to both the first- and second-equation errors. It is straightforward to incorporate

both of these differences into the GMM framework.

Let Wi = (Xi , Zi0 )0 denote the full vector of exogenous variables in the model. Then, the

appropriate vector of moment functions (analogous to (2.5)) is given by

(1 − mi )Wi (Y1i − Y2i δ − Zi0 β)


   
h1i (δ, β, π, Π)
hi (δ, β, π, Π) =  mi Wi (Y1i − Xi πδ − Zi0 (Πδ + β))  =  h2i (δ, β, π, Π)  .
(1 − mi )Wi (Y2i − Xi π − Zi0 Π) h3i (δ, β, π, Π) (4.28)

Note that consistency of the GMM estimator requires the following missing-at-random assumption:10

E(mi Wi εi ) = E(mi Wi vi ) = 0.

There are a total of 3K + 3 moments in (4.28) and 2K + 2 parameters in (δ0 , β0 , π0 , Π0 ), so

that the optimal GMM estimator would yield a test of overidentifying restrictions, analogous

to Proposition 1, with χ2 (K + 1) limiting distribution.

5 Monte Carlo Experiments

In this section, we conduct several simulations to examine the small-sample performance of the

various methods considered in Sections 2 and 3 under different data-generating processes. We

consider a very simple setup with K = 2:

Yi = Xi α0 + β1 + β2 Z2i + σε (Xi , Zi )ui

Xi = γ1 + γ2 Z2i + σξ (Zi )vi .


q
σε (Xi , Zi ) = θ0 + θ1 Xi2 + θ2 Zi2
q
σξ (Zi ) = δ0 + δ1 Zi2

Zi ∼ N (0, 1)
10
Consistency of the complete-data estimator requires E(mi Wi εi ) = 0.

[25]
We fix (β1 , β2 , γ1 , γ2 ) = (1, 1, 1, 1) throughout the experiments. In all but one of the designs

α0 is set to 1; in one design, it is set to 0.1. For each of the designs, we consider a simple

missingness mechanism in which exactly half of the Xi ’s are missing completely at random

(λ = 1/2). We consider a total of eight different designs, with the first five based upon

ui , vi ∼ N (0, 1), vi ⊥ ui

and the following parameter values:

Design 1: α0 = 1, θ0 = δ0 = 10, θ1 = θ2 = δ1 = 0

Design 2: α0 = 0.1, θ0 = δ0 = 10, θ1 = θ2 = δ1 = 0

Design 3: α0 = 1, θ0 = 1, δ0 = 10, θ1 = θ2 = δ1 = 0

Design 4: α0 = 1, θ0 = δ0 = θ2 = δ1 = 1, θ1 = 0

Design 5: α0 = 1, θ0 = δ0 = θ1 = θ2 = δ1 = 1

Designs 1–3 have homoskedastic residuals and are used to illustrate the effect of different values

for the variances and the importance of the missing variable itself. Designs 4 and 5 introduce het-

eroskedasticity, with conditional variances dependent on Z in Design 4 and the main-equation

variance also dependent on X in Design 5. These latter two designs are meant to examine the

potential efficiency gains for α0 indicated by the result in Proposition 2 and the special case

following that proposition.

The remaining three designs are based on

ui ∼ N (0, 1)

vi = u2i − 1.

Note that ui and vi are independent (and mean zero) conditional on Xi and Zi and also that

E(ui vi ) = 0 but E(u2i vi ) = 2. This setup is meant to investigate the relevance of the third-

moment condition for efficiency gains for α0 discussed after Proposition 2. The parameters for

Designs 6 and 7 are as follows:

Design 6: α0 = 1, θ0 = δ0 = 1, θ1 = θ2 = δ1 = 0

[26]
Design 7: α0 = 1, θ0 = δ0 = θ2 = δ1 = 1, θ1 = 0

For Design 8, we consider α0 = 1 and the following exponential forms for the residual standard

deviations:
q 
σ (Xi , Zi ) = exp 2 2
0.1 + 0.2Zi + 0.1Xi
q 
σξ (Xi , Zi ) = exp 2
0.1 + 0.2Zi

This last design illustrates more dramatic efficiency gains that are possible with the GMM

estimator.

For all simulations, a sample size of n = 400 is used. The results are reported in Tables 2–9.

For a set of 1000 replications for each design, these tables report the bias, variance, and overall

MSE for the estimators of the parameters (α0 , β1 , β2 ). The estimators considered are those

discussed in Sections 2 and 3, namely (i) the complete case estimator, (ii) the dummy variable

estimator, (iii) the unweighted imputation (OD) estimator, (iv) the weighted imputation (GD)

estimator, and (v) the optimal GMM estimator.

Several things stand out in the results. With the exception of the dummy variable method,

none of the methods have much bias for any of the parameters. The dummy variable method

can be very biased. This is most pronounced for β2 in all cases except for the Design 2 where

X is relatively unimportant. There is also substantial bias for both α0 and β1 , especially in

Designs 4–8. The variance of the dummy variable estimator can be larger or smaller than

the complete data estimator, but the overall MSE for this estimator is never smaller than

the complete data estimator with the exception of the case where α0 is small. There is little

evidence to suggest the dummy method has any advantage except in cases where X can be

dropped from the analysis.

Regarding the other estimators, the results for Designs 1–3 support the theoretical results:

the weighted imputation and GMM estimators are roughly equally efficient and do substantially

better for estimating β1 and β2 except in Design 2 where X is relatively unimportant in the

main regression model. The unweighted imputation estimator is relatively inefficient in Designs

1 and 3 for estimating β1 and β2 . In Design 2, where α0 is small, the unweighted estimator does

[27]
as well as the weighted estimator and GMM but this is the only instance where this occurs.

For estimating α0 , the unweighted and weighted imputation estimators appear identical.

The results in Design 4–8 show that the GMM estimator is generally the estimation method

with the lowest variance and overall MSE. There are also gains in terms of estimating α0 in

Designs 5–8. The result in Design 4 is consistent with the discussion following Proposition 2,

which suggested that when the conditional variance only depended on Z that there would not

be any improvement for estimating α0 .

The GMM estimator itself seems to be very well behaved from a numerical standpoint.

Using standard Gauss-type iterations, the estimates were found with a very tight convergence

criterion in no more than around ten iterations in any case. The experiments all ran very

quickly (i.e., a few seconds on a standard desktop computer in GAUSS8.0) despite the fact that

a GMM estimate with nonlinearity had to be computed via numerical methods 1000 times for

each experiment.

Overall, the simulation results suggest the following. First, if one believes the data are

homoskedastic, then the two-step linear imputation method is preferred as long as one uses

the weighting suggested by Dagenais (1973). Second, the dummy variable method, though

convenient, has little else to recommend it. It can be substantially biased except in cases where

one could actually just toss out X completely and do a regression on Z — moreover, it also does

not necessarily bring about variance improvements which is the whole purpose of imputation in

the first place. Third, the GMM estimators seem to be numerically stable, bring about variance

gains in a variety of cases including homoskedastic and heteroskedastic cases, and, as an added

bonus, give rise to the possibility of testing the restrictions on the models that bring about the

possibility of efficiency gains.

6 Empirical examples

This section considers application of the GMM and other missing-covariate estimation methods

to datasets with a large amount of missing data on variables of interest. Section 6.1 considers

[28]
Table 2: Monte Carlo simulations, Design 1

Estimation method Parameter Bias n*Var MSE


Complete-case method α0 0.001 1.84 0.005
β1 -0.001 22.76 0.057
β2 -0.001 22.26 0.056
Dummy-variable method α0 -0.047 1.88 0.007
β1 0.049 23.44 0.061
β2 0.535 15.88 0.326
Unweighted imputation α0 0.001 1.84 0.005
β1 0.002 22.63 0.057
β2 0.006 22.25 0.056
Weighted imputation α0 0.001 1.84 0.005
β1 0.000 17.65 0.044
β2 0.003 16.93 0.042
GMM (efficient) α0 0.004 1.87 0.005
β1 -0.003 17.79 0.044
β2 0.000 17.10 0.043
Design 1: α0 = 1, θ0 = δ0 = 10, θ1 = θ2 = δ1 = 0

Table 3: Monte Carlo simulations, Design 2

Estimation method Parameter Bias n*Var MSE


Complete-case method α0 0.001 1.84 0.005
β1 -0.001 22.76 0.057
β2 -0.001 22.26 0.056
Dummy-variable method α0 -0.004 1.77 0.004
β1 0.005 22.66 0.057
β2 0.060 10.51 0.030
Unweighted imputation α0 0.001 1.84 0.005
β1 -0.001 13.56 0.034
β2 0.006 11.97 0.030
Weighted imputation α0 0.001 1.84 0.005
β1 -0.001 13.53 0.034
β2 0.006 11.99 0.030
GMM (efficient) α0 0.002 1.89 0.005
β1 -0.002 13.60 0.034
β2 0.006 12.22 0.031
Design 2: α0 = 0.1, θ0 = δ0 = 10, θ1 = θ2 = δ1 = 0

[29]
Table 4: Monte Carlo simulations, Design 3

Estimation method Parameter Bias n*Var MSE


Complete-case method α0 0.000 0.18 0.000
β1 0.000 2.28 0.006
β2 0.000 2.23 0.006
Dummy-variable method α0 -0.048 0.28 0.003
β1 0.049 2.97 0.010
β2 0.530 6.95 0.298
Unweighted imputation α0 0.000 0.18 0.000
β1 0.003 10.68 0.027
β2 0.001 11.92 0.030
Weighted imputation α0 0.000 0.18 0.000
β1 0.000 2.17 0.005
β2 0.000 2.12 0.005
GMM (efficient) α0 0.001 0.19 0.000
β1 -0.001 2.18 0.005
β2 -0.001 2.12 0.005
Design 3: α0 = 1, θ0 = 1, δ0 = 10, θ1 = θ2 = δ1 = 0

Table 5: Monte Carlo simulations, Design 4

Estimation method Parameter Bias n*Var MSE


Complete-case method α0 0.004 2.83 0.007
β1 -0.004 7.18 0.018
β2 -0.002 10.85 0.027
Dummy-variable method α0 -0.200 3.09 0.048
β1 0.201 8.11 0.061
β2 0.608 7.67 0.389
Unweighted imputation α0 0.004 2.83 0.007
β1 -0.004 7.11 0.018
β2 0.000 10.87 0.027
Weighted imputation α0 0.004 2.83 0.007
β1 -0.004 6.12 0.015
β2 -0.001 8.83 0.022
GMM (efficient) α0 0.005 2.83 0.007
β1 -0.005 6.10 0.015
β2 -0.002 8.89 0.022
Design 4: α0 = 1, θ0 = δ0 = θ2 = δ1 = 1, θ1 = 0

[30]
Table 6: Monte Carlo simulations, Design 5

Estimation method Parameter Bias n*Var MSE


Complete-case method α0 0.011 13.93 0.035
β1 -0.012 19.34 0.049
β2 -0.002 22.65 0.057
Dummy-variable method α0 -0.194 13.94 0.073
β1 0.195 19.89 0.088
β2 0.612 14.63 0.411
Unweighted imputation α0 0.011 13.93 0.035
β1 -0.010 19.33 0.048
β2 -0.001 22.46 0.056
Weighted imputation α0 0.011 13.93 0.035
β1 -0.010 17.78 0.045
β2 0.002 19.46 0.049
GMM (efficient) α0 0.006 12.53 0.031
β1 -0.007 16.27 0.041
β2 -0.002 18.35 0.046
Design 5: α0 = 1, θ0 = δ0 = θ1 = θ2 = δ1 = 1

Table 7: Monte Carlo simulations, Design 6

Estimation method Parameter Bias n*Var MSE


Complete-case method α0 0.003 5.09 0.013
β1 -0.005 2.71 0.007
β2 -0.001 7.24 0.018
Dummy-variable method α0 -0.204 4.30 0.052
β1 0.200 2.88 0.047
β2 0.601 3.94 0.371
Unweighted imputation α0 0.003 5.09 0.013
β1 -0.002 5.91 0.015
β2 -0.001 8.46 0.021
Weighted imputation α0 0.003 5.09 0.013
β1 -0.001 3.09 0.008
β2 -0.001 6.96 0.017
GMM (efficient) α0 0.004 4.88 0.012
β1 -0.007 2.72 0.007
β2 -0.003 6.74 0.017
Design 6: α0 = 1, θ0 = δ0 = 1, θ1 = θ2 = δ1 = 0

[31]
Table 8: Monte Carlo simulations, Design 7

Estimation method Parameter Bias n*Var MSE


Complete-case method α0 0.002 6.54 0.016
β1 -0.005 3.99 0.010
β2 -0.001 14.50 0.036
Dummy-variable method α0 -0.114 5.97 0.028
β1 0.110 4.46 0.023
β2 0.556 10.96 0.337
Unweighted imputation α0 0.002 6.54 0.016
β1 -0.002 9.65 0.024
β2 0.002 18.74 0.047
Weighted imputation α0 0.002 6.54 0.016
β1 0.000 4.42 0.011
β2 0.000 13.28 0.033
GMM (efficient) α0 0.001 6.13 0.015
β1 -0.005 3.72 0.009
β2 0.001 12.83 0.032
Design 7: α0 = 1, θ0 = δ0 = θ2 = δ1 = 1, θ1 = 0

Table 9: Monte Carlo simulations, Design 8

Estimation method Parameter Bias n*Var MSE


Complete-case method α0 0.029 148.90 0.373
β1 -0.034 112.47 0.282
β2 -0.001 41.23 0.103
Dummy-variable method α0 -0.112 156.16 0.403
β1 0.110 116.13 0.302
β2 0.574 54.31 0.465
Unweighted imputation α0 0.029 148.90 0.373
β1 -0.035 131.35 0.330
β2 -0.018 96.21 0.241
Weighted imputation α0 0.029 148.90 0.373
β1 -0.033 124.48 0.312
β2 -0.006 69.65 0.174
GMM (efficient) α0 0.002 18.45 0.046
β1 -0.006 22.74 0.057
β2 0.003 24.19 0.060
Design 8: α0 = 1, exponential variance form (see text)

[32]
the estimation of regression models using data from the Wisconsin Longitudinal Study, where

a covariate of interest is observed for only about a quarter of sampled individuals. Section 6.2

considers the Card (1995) data (from the National Longitudinal Survey of Young Men) men-

tioned in Section 4.1; for this data, we estimate instrumental-variables regressions where one of

the instrumental variables (IQ score) is missing for about one-third of the observations.

6.1 Regression example — Wisconsin Longitudinal Study

The Wisconsin Longitudinal Study (WLS) has followed a random sample of individuals who

graduated from Wisconsin high schools in 1957. In addition to the original survey, several follow-

up surveys have been used to gather longitudinal information for the sample. For this example,

we focus on a specific data item that is available for only a small fraction of the overall sample.

Specifically, we look at BMI (body mass index) ratings based upon high-school yearbook photos

of the individuals. For several reasons, this high-school BMI rating variable is observed for only

about a quarter of individuals.11 The variable should not be considered missing completely at

random (MCAR) since observability depends on whether a school’s yearbook is available or not

and, therefore, could be related to variables correlated with school identity. The high-school

BMI rating is based on the independent assessment of six individuals, and the variable that we

use should be viewed as a proxy for BMI (or, more accurately, perceived BMI) in high school.12

We consider two regression examples where the high-school BMI rating variable is used

as an explanatory variable for future outcomes. The dependent variables that we consider are

(i) completed years of schooling (as of 1964) and (ii) adult BMI (as reported in 1992-1993).

For the schooling regression, IQ score (also recorded in high school) is used as an additional

covariate; for the adult BMI regression, IQ score and completed years of schooling are used as

additional covariates. The results are reported in Table 10, where estimation is done separately

for men and women and three different methods are considered: (i) the complete-case method,
11
According to the WLS website, yearbooks were available and coded for only 72% of graduates; in addition,
in the release of the data, ratings had not been completed for the full set of available photos.
12
Each rater assigned a relative body mass score from 1 (low) to 11 (high). The variable that we use, named
srbmi in the public-release WLS dataset, is a standardized variable calculated separately for male and female
photos. According to the WLS documentation, this variable is calculated by generating rater-specific z scores,
summing the z scores for a given photo, and dividing by the number of raters.

[33]
(ii) the dummy-variable method, and (iii) the GMM method. The schooling regression results

are in the top panel (Panel A), and the adult-BMI regression results are in the bottom panel

(Panel B).

There is a lot of missing data associated with the high-school BMI rating variable. In

the schooling regression, high-school BMI rating is observed for only 888 of 3,969 men (22.4%)

and 1,107 of 4,276 women (25.9%). As a result, we see that the complete-case method results

in much higher standard errors for the other covariates (e.g., the standard errors on the IQ

variable in Panel A are roughly twice as large as those from either the dummy-variable or GMM

methods). While the dummy-variable method is not guaranteed to be consistent, it gives quite

similar results on the high-school BMI rating coefficient to the other methods; as the theory of

Sections 2 and 3 has suggested, there is little difference in the standard errors for this covariate

across the three methods. While the coefficient estimates for the methods are quite similar in

the education regressions, there are a few differences in the adult BMI regressions. For instance,

the estimated effect of education on adult BMI is -0.2224 (s.e. 0.0552) for the dummy-variable

method and -0.1389 (s.e. 0.0616) for the GMM method.

The dummy-variable method is, of course, not guaranteed to even be consistent under the

missingness assumptions that yield GMM consistency. Moreover, the GMM method allows for

an overidentification test of the assumptions being made. The overidentification test statistics

are reported in Table 10 and, under the null hypothesis of correct specification, have limiting

distributions of χ22 and χ23 for the education and adult BMI regressions, respectively. The test

for the adult-BMI regression on the male sample has a p-value of 0.016, casting serious doubt

on the missingness assumptions.13 For this regression, note that the complete-case method

estimate for the IQ coefficient was positive (0.0221) and statistically significant at a 5% level

(s.e. 0.0100); in contrast, the GMM estimate for the IQ coefficient is very close to zero in

magnitude and statistically insignificant. The result of the overidentification test, however,

would caution a researcher against inferring too much from this difference. The test indicates

that the assumptions needed for consistency of GMM are not satisfied; it is also possible that
13
The probability of seeing one of the four p-values less than 0.016 (under correct specification for all four
cases) is roughly 6.2%.

[34]
Table 10: Regression examples, Wisconsin Longitudinal Study data

Panel A Dependent variable = years of education


Men Women
Complete- Dummy- GMM Complete- Dummy- GMM
case variable method case variable method
method method method method
High-school BMI rating 0.0878 0.0889 0.0826 -0.2748 -0.2727 -0.2650
(0.0757) (0.0757) (0.0751) (0.0603) (0.0600) (0.0602)
IQ 0.0644 0.0673 0.0674 0.0473 0.0486 0.0476
(0.0041) (0.0019) (0.0019) (0.0034) (0.0017) (0.0018)
Missing-BMI indicator -0.0174 -0.0867
(0.0729) (0.0557)
Constant 7.3962 7.1067 7.0781 8.6023 8.4702 8.5001
(0.4023) (0.1915) (0.1805) (0.3287) (0.1697) (0.1701)

Test statistic (d.f. 2) 0.866 3.208


(p-value) (0.649) (0.201)

Observations 888 3969 3969 1107 4276 4276


Panel B Dependent variable = adult BMI
Men Women
Complete- Dummy- GMM Complete- Dummy- GMM
case variable method case variable method
method method method method
High-school BMI rating 1.5504 1.5345 1.5577 1.9491 1.9213 2.0204
(0.1693) (0.1699) (0.1675) (0.2213) (0.2196) (0.2101)
IQ 0.0221 -0.0066 0.0002 0.0092 0.0007 0.0051
(0.0100) (0.0058) (0.0062) (0.0130) (0.0070) (0.0075)
Years of education -0.1780 -0.1590 -0.1698 -0.1817 -0.2224 -0.1389
(0.0662) (0.0394) (0.0421) (0.0973) (0.0552) (0.0616)
Missing-BMI indicator -0.1032 0.1666
(0.1583) (0.1947)
Constant 27.8288 30.4810 29.8730 27.4363 28.8575 27.4230
(1.0309) (0.5839) (0.6218) (1.5291) (0.8467) (0.9356)

Test statistic (d.f. 3) 10.316 1.776


(p-value) (0.016) (0.620)

Observations 698 2587 2587 873 2917 2917

[35]
complete-case method estimator is itself inconsistent here (e.g., if Assumption (i) and/or (iii)

are violated).

6.2 Instrumental variable example — Card (1995)

This section considers IV estimation of a log-wage regression using the data of Card (1995).

The sample consists of observations on male workers from the National Longitudinal Survey of

Young Men (NLSYM) in 1976. The endogenous variable (Y2 ) is KW W (an individual’s score

on the “Knowledge of the World of Work” test), with exogenous variables (Z2 ) including years

of education, years of experience (and its square), an SM SA indicator variable (1 if living in an

SMSA in 1976), a South indicator variable (1 if living in the south in 1976), and a black-race

indicator variable. IQ score is used as an instrument (X) for KW W , but IQ data is missing

for 923 of the 2,963 observations. The complete-data sample, where IQ and the other variables

are non-missing, has 2,040 observations. An additional specification, in which education is also

treated as endogenous, is considered; for this specification, an indicator variable for living in a

local labor market with a 4-year college is used as an additional instrumental variable (and is

always observed, unlike IQ score).

Table 11 reports the IV estimation results. Three estimators are considered: (i) the

complete-data IV estimator (2,040 observations), (ii) the dummy-variable IV estimator (2,963

observations, using the missingness indicator as an additional instrument), and (iii) the full

IV estimator (2,963 observations, using the missingness indicator and its interactions with Z2

as additional instruments). The first three columns of Table 11 use IQ as an instrument for

KWW, and the second three columns also use the near-4-year-college indicator variable as an

instrument for education.

The results clearly illustrate the greater efficiency associated with the full IV approach. In

the first specification, the complete-data and dummy-IV estimators have very similar standard

errors, whereas the full IV estimator provides efficiency gains (roughly 10-15%) for the coefficient

estimates of both the endogenous (KW W ) variable and the exogenous variables. The efficiency

gains in the second specification are far more dramatic. For the KW W coefficient, the full-

[36]
Table 11: Instrumental variable examples, Card (1995) NLSYM data

Dependent variable = ln(weekly wage)


IQ score as instrument for KWW IQ score as instrument for KWW
and near-4-year-college indicator as
instrument for years of education
Complete Dummy Full Complete Dummy Full
Data Instrument Instrument Data Instrument Instrument
KWW 0.0191 0.0189 0.0204 0.0034 0.0202 0.0278
(0.0051) (0.0059) (0.0046) (0.0218) (0.0146) (0.0097)
Education 0.0367 0.0313 0.0280 0.1061 0.0274 0.0053
(0.0116) (0.0136) (0.0109) (0.0946) (0.0528) (0.0356)
Experience 0.0606 0.0525 0.0503 0.1075 0.0501 0.0363
(0.0126) (0.0113) (0.0099) (0.0647) (0.0316) (0.0219)
Experience squared -0.0019 -0.0016 -0.0016 -0.0030 -0.0016 -0.0013
(0.0005) (0.0004) (0.0004) (0.0015) (0.0006) (0.0004)
Black -0.0633 -0.0683 -0.0590 -0.1247 -0.0612 -0.0184
(0.0385) (0.0412) (0.0342) (0.0910) (0.0752) (0.0523)
SMSA 0.1344 0.1317 0.1295 0.1400 0.1303 0.1216
(0.0201) (0.0181) (0.0173) (0.0214) (0.0202) (0.0186)
South -0.0766 -0.1106 -0.1095 -0.0810 -0.1100 -0.1061
(0.0184) (0.0159) (0.0158) (0.0193) (0.0162) (0.0163)
Constant 4.7336 4.8681 4.8773 4.0223 4.8932 5.0284
(0.0945) (0.0783) (0.0751) (0.9699) (0.4490) (0.3171)
J-statistic 16.8 10.2
(p-value) (0.0184) (0.1189)
Observations 2040 2963 2963 2040 2963 2963

[37]
instrument standard error is 0.0097 as compared to the complete-data and dummy-IV standard

errors of 0.0218 and 0.0146, respectively. For the education coefficient, the full-instrument

standard error is 0.0356 as compared to the complete-data and dummy-IV standard errors

of 0.0946 and 0.0528, respectively. Thus, the standard errors on the endogenous variable are

roughly a third lower for the full-IV estimator as compared to the dummy-IV estimator. For the

exogenous variables, the full-IV standard errors are uniformly lower, with the largest efficiency

gains evident for the experience variables and the black indicator.

7 Conclusion

This paper has considered several methods that avoid the problem of dropping observations in

the face of missing data on explanatory variables. We proposed a GMM procedure based on

set of moment conditions in the context of a regression model with a regressor that has missing

values. The moment conditions were obtained with minimal additional assumptions on the

data, and the method was shown to provide efficiency gains for some of the parameters. The

GMM approach was compared to some well known linear imputation methods and shown to

be equivalent to an optimal version of such methods under stronger assumptions than used to

justify the GMM approach and potentially more efficient under the more general conditions.

The (sub-optimal) unweighted linear imputation and the commonly used dummy method were

found to potentially provide a “cure that is worse than the disease.”

The GMM approach can be extended to other settings where estimation can be naturally

cast into a method-of-moments framework. In Section 4, for instance, the GMM approach was

used to provide estimators for cases in which an instrument or an endogenous regressor might

have missing values. In ongoing work, we are considering the application of GMM methods

to linear panel-data models with missing covariate data. For non-linear models, where the

projection approach is no longer applicable, it appears that strong parametric assumptions on

the relationship between missing and non-missing covariates are required (see, for example,

Conniffe and O’Neill (2009)). Also, the theoretical development here has focused upon the

case of a single missing covariate. The idea of efficient GMM estimation can be extended to

[38]
additional missing covariates, as in Muris (2011).

References

Card, D. (1995), “Using geographic variation in college proximity to estimate the return to

schooling,” in Aspects of Labour Market Behavior: Essays in Honour of John Vanderkamp.

Ed. L.N. Christophedes, E.K. Grant and R. Swidinsky, pp. 102-220. Toronto: University

of Toronto Press.

Conniffe, D. and D. O’Neill (2009), “Efficient probit estimation with partially missing covari-

ates,” IZA Discussion Paper No. 4081.

Dahl, G. and S. DellaVigna (2009), “Does movie violence increase violent crime?” Quarterly

Journal of Economics 124, pp. 677-734.

Dagenais, M. C. (1973), “The use of incomplete observations in multiple regression analysis:

a generalized least squares approach,” Journal of Econometrics 1, pp. 317-328.

Dardanoni, V., S. Modica and F. Peracchi (2011), “Regression with imputed covariates: A

generalized missing-indicator approach,”Journal of Econometrics 162, pp. 362-368.

Gourieroux, C. and A. Monfort (1981), “On the problem of missing observations in linear

models,” Review of Economic Studies 48(4), pp. 579-586.

Griliches, Z. (1986), “Economic data issues,” in Griliches, Z. and Intrilligator, M., eds., Hand-

book of Econometrics Vol III, Amsterdam: New Holland.

Jones, M. P. (1996), “Indicator and stratification methods for missing explanatory variables in

multiple linear regression,” Journal of the American Statistical Association 91(433), pp.

222-230.

Mogstad, M. and M. Wiswall (2010), “Instrumental variables estimation with partially missing

instruments,” IZA Discusion Paper No. 4689.

[39]
Muris, C. (2011), “Efficient GMM estimation with a general missing data pattern,” mimeo,

Simon Fraser University.

Nijman, T. and F. Palm (1988), “Efficiency gains due to using missing data procedures in

regression models,” Statistical Papers 29, pp. 249-256.

[40]

You might also like