Ebook - Econometrics Handbook PDF
Ebook - Econometrics Handbook PDF
The Council for Economic Education (first named the Joint Council on Economic Education and
then the National Council on Economic Education) together with the American Economic
Association Committee on Economic Education have a long history of advancing the use of
econometrics for assessment aimed at increasing the effectiveness of teaching and learning
economics. As described by Welsh (1972), the first formal program was funded by the General
Electric Education Fund and held in 1969 and 1970 at Carnegie-Mellon University, and was a
contributing force behind the establishment of the Journal of Economic Education in 1969 as an
outlet for research on the teaching of economics. As described in Buckles and Highsmith
(1990), in 1987 and 1988 a second econometrics training program was sponsored by the Pew
Charitable trust and held at Princeton University, with papers featured in the Journal of
Economic Education, Summer 1990. Since that time there have been major advances in
econometrics and its applications in education research -- most notably, in the specification of
the data generating processes and computer programs for estimating the models representing
those processes.
Becker and Baumol (1996) called attention to the importance that Sir Ronald A. Fisher
assigned to "random arrangements" in experimental design. i Fisher designed experiments in
which plots of land were randomly assigned to different fertilizer treatments to assess differences
in yield. By measuring the mean yield on each of several different randomly assigned plots of
land, Fisher eliminated or "averaged out" the effects of nontreatment influences (such as weather
and soil content) so that only the effect of the choice of fertilizer was reflected in differences
among the mean yields. For educational research, Fisher's randomization concept became the
ideal: hypothetical classrooms (or students) are treated as if they could be assigned randomly to
different instructional procedures.
In education research, however, the data are typically not generated by well-defined
experiments employing random sampling procedures. Our ability to extract causal inferences
from an analysis is affected by sample selection procedures and estimation methods. The
absence of experimental data has undoubtedly provided the incentive for the econometricians to
exert their considerable ingenuity to devise powerful methods to identify, separate out, and
evaluate the magnitude of the influence exercised by each of the many variables that determine
the shape of any economic phenomenon. Econometricians have been pioneers in the design of
techniques to deal with missing observations, missing variables, errors in variables, simultaneous
relationships and pooled cross-section and time-series data. In short, and as comprehensively
W. E. Becker May 1, 2010: p. 1
reviewed by Imbens and Wooldridge (2009), they have found it useful to design an armory of
analytic weapons to deal with the messy and dirty statistics obtained from opportunistic samples,
seeking to extract from them the ceteris paribus relationships that empirical work in the natural
sciences is able to average out with the aid of randomized experiments. At the request of the
Council for Economic Education and the American Economic Association Committee on
Economic Education, I have developed four modules that will enable researchers to employ these
weapons in their empirical studies of educational practices, with special attention given to the
teaching and learning of economics.
MODULE ONE
Although first employed in policy debate by George Yule at the turn of the 20th century,
economic educators and general educationalists alike continue to rely on least-squares estimated
multiple regressions to extract the effects of unwanted influences. ii This is the first of the four
modules designed to demonstrate and enable researchers to move beyond these basic regression
methods to the more advanced techniques of the 21st century using any one of three computer
programs: LIMDEP (NLOGIT), STATA and SAS.
This module is broken into four parts. Part One introduces the nature of data and the
basic data generating processes for both continuous and discrete dependent variables. The lack
of a fixed variance in the population error term (heteroscedasticity) is introduced and discussed.
Parts Two, Three and Four show how to get that data into each one of the three computer
programs: Part Two for LIMDEP (NLOGIT), Part Three for STATA (by Ian McCarthy, Senior
Consultant, FTI Consulting) and Part Four for SAS (by Gregory Aaron William Gilpin, Assistant
Professor of Economics, Montana State University). Parts Two, Three and Four also provide
the respective computer program commands to do least-squares estimation of the standard
learning regression model involving a continuous dependent test-score variable but with the
procedures to adjust for heteroscedastic errors. The maximum likelihood routines to estimate
probit and logit models of discrete choice are also provided using each of the three programs.
Finally, statistical tests of coefficient restrictions and model structure are presented.
MODULE TWO
W. E. Becker May 1, 2010: p. 2
MODULE THREE (co-authored by W. E. Becker, J.J. Siegfried and W. H. Greene)
As seen in Modules One and Two, the typical economic education empirical study involves an
assessment of learning between a pretest and a posttest. Other than the fact that testing occurs at
two different points in time (before and after an intervention) , there is no time dimension in this
model of learning. Panel data analysis provides an alternative structure in which measurements
on the cross section of subjects are taken at regular intervals over multiple periods of time.
Collecting data on the cross section of subjects over time enables a study of change. It also
opens the door for economic education researchers to look at things other than test scores that
vary with time.
This third in the series of four modules provides an introduction to panel data analysis
with specific applications to economic education. The data structure for a panel along with
constant coefficient, fixed effects and random effects representations of the data generating
processes are presented. Consideration is given to different methods of estimation and testing.
Finally, as in Modules One and Two, contemporary estimation and testing procedures are
demonstrated in Parts Two, Three and Four using LIMDEP (NLOGIT), STATA (by Ian
McCarthy) and SAS (by Gregory Gilpin).
In the assessment of student learning that occurs between the start of a program (as measured, for
example, by a pretest) and the end of the program (posttest), there is an assumption that all the
students who start the program finish the program. There is also an assumption that those who
start the program are representative of, or at least are a random sample of, those for whom an
inference is to be made about the outcome of the program. This module addresses how these
assumptions might be wrong and how problems of sample selection might occur because of
unobservable or unmeasurable phenomena as well as things that can be observed and measured.
Attention is given to the Heckman-type models, regression discontinuity and propensity score
matching. As in the previous modules, contemporary estimation and testing procedures are
demonstrated in Parts Two, Three and Four using LIMDEP (NLOGIT), STATA (by Ian
McCarthy) and SAS (by Gregory Gilpin).
RECOGNITION
As with the 1969-1970 and 1987-1988 workshops, these four modules have been made possible
through a cooperative effort of the National Council on Economic Education and the American
Economic Association Committee on Economic Education. This new work is part of the
Absolute Priority direct activity component of the Excellence in Economic Education grant to the
NCEE and funded by the U.S. Department of Education Office of Innovation and Improvement.
Special thanks are due Ian McCarthy and Greg Gilpin for their care in duplicating in STATA and
SAS what I did in LIMDEP (in Modules One, Two and Three). Comments and constructive
W. E. Becker May 1, 2010: p. 3
criticism received from William Bosshardt, Jennifer (gigi) Foster, Peter Kennedy, Mark Maier,
KimMarie McGoldrick, Gail Hoyt, Martin Shanahan, Robert Toutkoushian and Michael Watts in
the Beta testing must also be thankfully acknowledged. Finally, as with all my writing, Suzanne
Becker must be thanked for her patience in dealing with me and excellence in editing.
REFERENCE
Becker, William E. and William J. Baumol (1996). Assessing Educational Practices: The
Contribution of Economics, Cambridge MA.: MIT Press.
Buckles, Stephen and Robert Highsmith (1990). “Preface to Special Research Issue,” Journal of
Economic Education, (Summer): 229-230.
Fisher, R. A. (1970). Statistical methods for research workers, 14th ed. New York: Hafner.
Simpson, E.H. (1951). “The interpretation of interaction in contingency tables”, Journal of the
Royal Statistical Society, Series B, 13: 238-241.
Stigler, Stephen M. (1986). The History of Statistics: The measurement of Uncertainty before
1900. Cambridge: Harvard University Press.
Welsh, Arthur L. (1972). Research Papers in Economic Education. New York: Joint Council on
Economic Education.
W. E. Becker May 1, 2010: p. 4
ENDNOTES
i. In what is possibly the most influential book in statistics, Statistical Methods for
Research Workers, Sir Ronald Fisher wrote:
Statistical methods are essential to social studies, and it is principally by the aid of such
methods that these studies may be raised to the rank of sciences. This particular
dependence of social studies upon statistical methods has led to the unfortunate
misapprehension that statistics is to be regarded as a branch of economics, whereas in
truth methods adequate to the treatment of economic data, in so far as these exist, have
mostly been developed in biology and the other sciences. (1970, p. 2)
Fisher's view, traceable to the first version of his book in 1925, is still held today by many
scholars in the natural sciences and departments of mathematics. Econometricians (economists
who apply statistical methods to economics), psychometricians, cliometricians, and other
"metricians" in the social sciences have different views of the process by which statistical
methods have developed. Given that Fisher's numerous and great contributions to statistics were
in applications within biology, genetics, and agriculture, his view is understandable although it is
disputed by sociologist Clifford Clogg and the numerous commenters on his article "The Impact
of Sociological Methodology on Statistical Methodology," Statistical Science, (May 1992).
Econometricians certainly have contributed a great deal to the tool box, ranging from
simultaneous equation estimation techniques to a variety of important tests of the validity of a
statistical inference.
ii. George Yule (1871-1951) designed "net or partial regression" to represent the influence
of one variable on another, holding other variables constant. He invented the multiple
correlation coefficient R for the correlation of y with many x's. Yule's regressions looked much
like those of today. In 1899, for instance, he published a study in which changes in the
percentage of persons in poverty in England between 1871 and 1881 were explained by the
change in the percentage of disabled relief recipients to total relief recipients (called the "out-
relief ratio"), the percentage change in the proportion of old people, and the percentage change in
the population.
Stigler (1986, pp. 356-7) reports that although Yule's regression analysis of poverty was
well known at the time, it did not have an immediate effect on social policy or statistical
practices. In part this meager response was the result of the harsh criticism it received from the
leading English economist, A.C. Pigou. In 1908, Pigou wrote that statistical reasoning could not
be rightly used to establish the relationship between poverty and out-relief because even in a
multiple regression (which Pigou called "triple correlation") the most important influences,
superior program management and restrictive practices, cannot be measured quantitatively.
W. E. Becker May 1, 2010: p. 5
Pigou thereby offered the most enduring criticism of regression analysis; the possibility
that an unmeasured but relevant variable has been omitted from the regression and that it is this
variable that is really responsible for the appearance of a causal relationship between the
dependent variable and the included regressors. Both Yule and Pigou recognized the difference
between marginal and partial association.
Today some statisticians assign credit for this identification to E. H. Simpson (1951); the
proposition referred to as "Simpson's paradox" points out that marginal and partial association
can differ even in direction so that what is true for parts of a sample need not be true for the
entire sample. This is represented in the following figure where the two separate regressions of y
on the high values of x and the low values of x have positive slopes but a single regression fit to
all the data shows a negative slope. As in Pigou's criticism of Yule 90 years ago, this "paradox"
may be caused by an omitted but relevant explanatory variable for y that is also related to x. It
may be better named the "Yule-Pigou paradox" or "Yule-Pigou effect." As demonstrated in
Module Two, much of modern day econometrics has dealt with this problem and it occurrence in
education research.
y .
│ . . .
│ . . . .
│ . .. .
│. . .
│ .
│ .
│ . . . .
│ . . .
│ . .
│ . . .
│ .
│
└───────────────────────────────────────────────────────── x
W. E. Becker May 1, 2010: p. 6
MODULE ONE, PART ONE: DATA-GENERATING PROCESSES
William E. Becker
Professor of Economics, Indiana University, Bloomington, Indiana, USA
Adjunct Professor of Commerce, University of South Australia, Adelaide, Australia
Research Fellow, Institute for the Study of Labor (IZA), Bonn, Germany
Editor, Journal of Economic Education
Editor, Social Scinece Reseach Network: Economic Research Network Educator
This is Part One of Module One. It highlights the nature of data and the data-generating process,
which is one of the key ideas of modern day econometrics. The difference between cross-section
and time-series data is presented and followed by a discussion of continuous and discrete
dependent variable data-generating processes. Least-squares and maximum-likelihood
estimation is introduced along with analysis of variance testing. This module assumes that the
user has some familiarity with estimation and testing previous statistics and introductory
econometrics courses. Its purpose is to bring that knowledge up-to-date. These contemporary
estimation and testing procedures are demonstrated in Parts Two, Three and Four, where data are
respectively entered into LIMDEP, STATA and SAS for estimation of continuous and discrete
dependent variable models.
In the natural sciences, researchers speak of collecting data but within the social sciences it is
advantageous to think of the manner in which data are generated either across individuals or over
time. Typically, economic education studies have employed cross-section data. The term cross-
section data refer to statistics for each in a broad set of entities in a given time period, for
example 100 Test of Economic Literacy (TEL) test scores matched to time usage for final
semester 12th graders in a given year. Time-series data, in contrast, are values for a given
category in a series of sequential time periods, i.e., the total number of U.S. students who
completed a unit in high school economics in each year from 1980 through 2008. Cross-section
data sets typically consist of observations of different individuals all collected at a point in time.
Time-series data sets have been primarily restricted to institutional data collected over particular
intervals of time.
More recently empirical work within education has emphasized panel data, which are a
combination of cross-section and time-series data. In panel analysis, the same group of
individuals (a cohort) is followed over time. In a cross-section analysis, things that vary among
individuals, such as sex, race and ability, must either be averaged out by randomization or taken
into account via controls. But sex, race, ability and other personal attributes tend to be constant
from one time period to another and thus do not distort a panel study even though the assignment
of individuals among treatment/control groups is not random. Only one of these four modules
will be explicitly devoted to panel data.
Test scores, such as those obtained from the TEL or Test of Understanding of College
Economics (TUCE), are typically assumed to be the outcome of a continuous variable Y that may
be generated by a process involving a deterministic component (e.g., the mean of Y, μ y , which
might itself be a function of some explanatory variables X1, X2 …Xk) and the purely random
perturbation or error term components v and ε :
where Yit is the test score of the ith person at time t and the it subscripts similarly indicate
observations for the ith person on the X explanatory variables at time t. Additionally, normality
of the continuous dependent variable is ensured by assuming the error term components are
normally distributed with means of zero and constant variances: vit ~ N (0, σ v2 ) and ε it ~ N (0, σ ε2 ) .
As a continuous random variable, which gets its normal distribution from epsilon, at least
theoretically any value is possible. But as a test score, Y is only supported for values greater than
zero and less than the maximum test score, which for the TUCE is 30. In addition, multiple-
choice test scores like the TUCE can only assume whole number values between 0 and 30, which
poses problems that are addressed in these four modules.
The change score model (also known as the value-added model, gain score model or
achievement model) is just a variation on the above basic model:
where Yit-1 is the test score of the ith person at time t−1. If one of the X variables is a bivariate
dummy variable included to capture the effect of a treatment over a control, then this model is
called a difference in difference model:
Yit is now referred to as the post-treatment score or posttest and Yit-1 is the pre-treatment score or
pretest. Again, the dependent variable Yit −Yit-1 can be viewed as a continuous random variable,
but for multiple-choice tests, this difference is restricted to whole number values and is bounded
by the absolute value of the test score’s minimum and maximum.
This difference in difference model is often used with cross-section data that ignores
time-series implications associated with the dependent variable (and thus the error term)
involving two periods. For such models, ordinary least-squares estimation as performed in
Following the lead of Hanushek (1986, 1156-57), the change-score model has been
thought of as a special case of an allegedly superior regression involving a lagged dependent
variable, where the coefficient of adjustment ( λ0* ) is set equal to one for the change-score model:
Allison (1990) rightfully called this interpretation into question, arguing that these are two
separate models (change score approach and regressor variable approach) involving different
assumptions about the data generating process. If it is believed that there is a direct causal
relationship Yit −1 ⇒ Yit or if the other explanatory X variables are related to the Yit-1 to Yit
transition, then the regressor variable approach is justified. But, as demonstrated to economic
educators as far back as Becker (1983), the regressor variable model has a built-in bias
associated with the regression to the mean phenomenon. Allison concluded, “The important
point is that there should be no automatic preference for either model and that the only proper
basis for a choice is a careful consideration of each empirical application . . . . In ambiguous
cases, there may be no recourse but to do the analysis both ways and to trust only those
conclusions that are consistent across methods.” (p. 110)
As pointed out by Allison (1990) and Becker, Greene and Rosen (1990), at roughly the
same time, and earlier by Becker and Salemi (1977) and later by Becker (2004), models to avoid
are those that place a change score on the left-hand side and a pretest on the right. Yet,
educational researchers continue to employ this inherently faulty design. For example, Hake
(1998) constructed a “gap closing variable (g)” as the dependent variable and regressed it on the
pretest:
More specifically, Bandiera, Larcinese and Rasul estimated the following panel data
model for the ith student, enrolled on a degree program offered by department d, in time period t,
where gidct is the ith student’s grade in department d for course (or essay) c at time t and αi is a
fixed effect that captures time-invariant characteristics of the student that affect his or her grade
across time periods, such as his or her underlying motivation, ability, and labor market options
upon graduation. Because each student can only be enrolled in one department or degree
program, αi also captures all department and program characteristics that affect grades in both
periods, such as the quality of teaching and the grading standards. Fc is a equal to one if the
student obtains feedback on his or her grade on course c and Tt identifies the first or second time
period, Xc includes a series of course characteristics that are relevant for both examined courses
and essays, and all other controls are as previously defined. TDidˊ is equal to one if student i
took any examined courses offered by department dˊ and is zero otherwise; it accounts for
differences in grades due to students taking courses in departments other than their own
department d. Finally, εidct is a disturbance term.
As specified, this model does not control for past grades (or expected grades), which is
the essence of a change-score model. It should have been specified as either
Obviously, there is no past grade for the first period and that is in part why a panel data
set up has historically not been used when only “pre” and “post” measures of performance are
available. Notice that the treatment dummy variable coefficient β is inconsistently estimated
with bias if the relevant past course grades in the second period essay-grade equation are
omitted. As discuss in Module Three on panel data studies, bringing in a lagged dependent
variable into panel data analysis poses more estimation problems. The thing emphasized here is
that a change-score model must be employed in assessing a treatment effect. In Module Four,
W. E. Becker Module One, Part One: Data Generating Processes 3‐ 23‐10: p. 4
propensity score matching models are introduced for a means of doing this as an alternative to
the least squares method employed in this module.
In many problems, the dependent variable cannot be treated as continuous. For example,
whether one takes another economics course is a bivariate variable that can be represented by Y =
1, if yes or 0, if not, which is a discrete choice involving one of two options. As another
example, consider count data of the type generated by the question how many more courses in
economics will a student take? 0, 1, 2 … where increasing positive values are increasingly
unlikely. Grades provide another example of a discrete dependent variable where order matters
but there are no unique number line values that can be assigned. The grade of A is better than B
but not necessarily by the same magnitude that B is better than C. Typically A is assigned a 4, B
a 3 and C a 2 but these are totally arbitrary and do not reflect true number line values. The
dependent variable might also have no apparent order, as the choice of a class to take in a
semester – for example, in the decision to enroll in economics 101, sociology 101, psychology
101 or whatever, one course of study cannot be given a number greater or less than another with
the magnitude having meaning on a number line.
In this module we will address the simplest of the discrete dependent variable models;
namely, those involving the bivariate dependent variable in the linear probability, probit and
logit models.
Linear Probability Model
Consider the binary choice model where Yi = 1, with probability Pi , or Yi = 0, with probability
(1–Pi.). In the linear probability regression model Yi = β1 + β 2 xi + ε i , E (ε i ) = 0 implies
E (Yi | xi ) = β1 + β 2 xi , where also E (Yi | xi ) = (0)[1 − ( Pi | xi )] + (1)( Pi | xi ) = Pi | xi . Thus,
E (Yi | xi ) = β1 + β 2 xi = Pi | xi , which we will write simply as Pi. That is, the expected value of the
0 or 1 bivariate dependent variable, conditional on the explanatory variable(s), is the probability
of a success (Y = 1). We can interpret a computer-generated, least-squares prediction of E(Y|x)
as the probability that Y = 1 at that x value.
In addition, the mean of the population error in the linear probability model is zero:
E (ε ) = (1 − β1 − β 2 x ) P + (0 − β1 − β 2 x )(1 − P )
= P − β1 − β 2 x = P − E (Y | x ) = 0 for P = E (Y | x )
However, the least squares Ŷ can be negative or greater than one, which makes it a peculiar
predictor of probability. Furthermore, the variance of epsilon is
An adjustment for heteroscedasticity in the linear probability model can be made via a
generalized least-squares procedure but the problem of constraining β1 + β 2 xi to the zero – one
interval cannot be easily overcome. Furthermore, although predictions are continuous, epsilon
cannot be assumed to be normally distributed as long as the dependent variable is bivariate,
which makes suspect the use of the computer-generated t statistic. It is for these reasons that
linear probability models are no longer widely used in educational research.
Probit Model
Ideally, the estimates of the probability of success (Y = 1) will be consistent with probability
theory with values in the 0 to 1 interval. One way to do this is to specify a probit model, which
is then estimated by computer programs such as LIMDEP, SAS and STATA that use maximum
likelihood routines. Unlike least squares, which selects the sample regression coefficient to
minimize the squared residuals, maximum likelihood selects the coefficients in the assumed
data-generating model to maximize the probability of getting the observed sample data.
The probit model starts by building a bridge or mapping between the 0s and 1s to be
observed for the bivariate dependent variable and an unobservable or hidden (latent) variable that
is assumed to be the driving force for the 0s and 1s:
I i* = β1 + β 2 X i 2 + β 3 X i 3 + β 4 X i 4 + ε i = X i β , where ε it ~ N (0,1) .
G( ) and g( ) are the standard normal distribution and density functions, and
Xβ
P (Y = 1) = ∫ g (t )dt .
−∞
Within economics the latent variable I* is interpreted as net utility or propensity to take
action. For instance, I* might be interpreted as the net utility of taking another economics
course. If the net utility of taking another economics course is positive, then I* is positive,
implying another course is taken and Y = 1. If the net utility of taking another economics course
is negative, then the other course is not taken, I* is negative and Y = 0.
The idea behind maximum likelihood estimation of a probit model is to maximize the
density L with respect to β and σ where the likelihood function is
The calculation of ∂L / ∂β is not convenient but the logarithm (ln) of the likelihood function is
easily differentiated
∂ ln L / ∂β = L−1∂L / ∂β .
Intuitively, the strategy of maximum likelihood (ML) estimation is to maximize (the log of) this
joint density for the observed data with respect to the unknown parameters in the beta vector,
where σ is set equal to one. The probit maximum likelihood computation is a little more difficult
than for the standard classical regression model because it is necessary to compute the integrals
of the standard normal distribution. But computer programs can do the ML routines with ease in
most cases if the sample sizes are sufficiently large. See William Greene, Econometric Analysis
(5th Edition, 2003, pp. 670-671) for joint density and likelihood function that leads to the
likelihood equations for ∂ ln L / ∂β .
The unit of measurement and thus the magnitude of the probit coefficients are set by the
assumption that the variance of the error term ε is unity. That is, the estimated probit
coefficients along a number line have no meaning. If the explanatory variables are continuous,
however, the probit coefficients can be employed to calculate a marginal probability of success
at specific values of the explanatory variables:
∂p ( x) / ∂x = g ( Xβ ) β x , where g( ) is density g ( z ) = ∂G ( z ) / ∂z .
Logit Model
An alternative to the probit model is the logit model, which has nearly identical properties to the
probit, but has a different interpretation of the latent variable I*. To see this, again let
Pi = E (Y = 1 | X i ) .
A graph of the logistic function G(z) = exp(z)/[1+exp(z)] looks like the standard normal, as seen
in the following figure, but does not rise or fall to 1.00 and 0.00 as fast:
Nonparametrics
As outlined in Becker and Greene (2001), recent developments in theory and computational
procedures enable researchers to work with nonlinear modeling of all sorts as well as
nonparametric regression techniques. As an example of what can be done consider the widely
cited economic education application in Spector and Mazzeo (1980). They estimated a probit
model to shed light on how a student's performance in a principles of macroeconomics class
relates to his/her grade in an intermediate macroeconomics class, after controlling for such things
as grade point average (GPA) going into the class. The effect of GPA on future performance is
less obvious than it might appear at first. Certainly it is possible that students with the highest
GPA would get the most from the second course. On the other hand, perhaps the best students
were already well equipped, and if the second course catered to the mediocre (who had more to
gain and more room to improve) then a negative relationship between GPA and increase in
grades (GRADE) might arise. A negative relationship might also arise if artificially high grades
were given in the first course. The below figure provides an analysis similar to that done by
Spector and Mazzeo (using a subset of their data).
.6
Pr[incr]
.4
.2
.0
2.0 2.4 2.8 3.2 3.6 4.0
GPA
In this figure, the horizontal axis shows the initial grade point average of students in the
study. The vertical axis shows the relative frequency of the incremental grades that increase
from the first to the second course. The solid curve shows the estimated relative frequency of
grades that improve in the second course using a probit model (the one used by the authors).
These estimates suggest a positive relationship between GPA and the probability of grade
improvement in the second macroeconomics throughout the GPA range. The dashed curve in
the figure provides the results using a much less-structured nonparametric regression model. iii
The conclusion reached with this technique is qualitatively similar to that obtained with the
probit model for GPAs above 2.6, where the positive relationship between GPA and the
probability of grade improvement can be seen, but it is materially different for those with GPAs
lower than 2.6, where a negative relationship between GPA and the probability of grade
improvement is found. Possibly these poorer students received gift grades in the introductory
macroeconomics course.
There are other alternatives to least squares that economic education researchers can
employ in programs such as LIMDEP, STATA and SAS. For example, the least-absolute-
deviations approach is a useful device for assessing the sensitivity of estimates to outliers. It is
likely that examples can be found to show that even if least-squares estimation of the conditional
mean is a better estimator in large samples, least-absolute-deviations estimation of the
conditional median performs better in small samples. The critical point is that economic
education researchers must recognize that there are and will be new alternatives to modeling and
W. E. Becker Module One, Part One: Data Generating Processes 3‐ 23‐10: p. 9
estimation routines as currently found in Journal of Economic Education articles and articles in
the other journals that publish this work, as listed in Lo, Wong and Mixon (2008). In this
module and in the remaining three, only passing mention will be given to these emerging
methods of analysis. The emphasis will be on least-squares and maximum-likelihood
estimations of continuous and discrete data-generating processes that can be represented
parametrically.
Hake (1998) reported that he has test scores for 6,542 individual students in 62
introductory physics courses. He works only with mean scores for the classes; thus, his effective
sample size is 62, and not 6,542. The 6,542 students are not irrelevant, but they enter in a way
that I did not find mentioned by Hake. The amount of variability around a mean test score for a
class of 20 students versus a mean for 200 students cannot be expected to be the same.
Estimation of a standard error for a sample of 62, where each of the 62 means receives an equal
weight, ignores this heterogeneity. vi Francisco, Trautman, and Nicoll (1998) recognized that the
number of subjects in each group implies heterogeneity in their analysis of average gain scores in
an introductory chemistry course. Similarly, Kennedy and Siegfried (1997) made an adjustment
for heterogeneity in their study of class size on student learning in economics.
Fleisher, Hashimoto, and Weinberg (2002) considered the effectiveness (in terms of
student course grades and persistences) of 47 foreign graduate student instructors versus 21
native English speaking graduate student instructors in an environment in which English is the
language of the majority of their undergraduate students. Fleisher, Hashimoto, and Weinberg
recognized the loss of information in using the 92 mean class grades for these 68 graduate
student instructors, although they did report aggregate mean class grade effects with the
corrected heterogeneity adjustment for standard errors based on class size. They preferred to
look at 2,680 individual undergraduate results conditional on which one of the 68 graduate
student instructors each of the undergraduates had in any one of 92 sections of the course. To
Whatever the unit of measure for the dependent variable (aggregate or individual) the
important point here is recognition of the need for one of two adjustments that must be made to
get the correct standard errors. If an aggregate unit is employed (e.g., class means) then an
adjustment for the number of observations making up the aggregate is required. If individual
observations share a common component (e.g., students grouped into classes) then the standard
errors reflect this clustering. Computer programs such as LIMDEP (NLOGIT), SAS and
STATA can automatically perform both of these adjustments.
Student of statistics are familiar with the F statistic as computed and printed in most computer
regression routines under a banner “Analysis of Variance” or just ANOVA. This F is often
presented in introductory statistics textbooks as a test of the overall all fit or explanatory power
of the regression. I have learned from years of teaching econometrics that it is better to think of
this test as one of all population model slope coefficients are zero (the explanatory power is not
sufficient to conclude that there is any relations between the xs and y in the population) versus
the alternative that at least one slope coefficient is not zero (there is some explanatory power).
Thinking of this F statistic as just a joint test of slope coefficients, makes it easier to recognize
that an F statistics can be calculated for any subset of coefficients to test for joint significance
within the subset. Here I present the theoretical underpinnings for extensions of the basic
ANOVA to tests of subsets of coefficients. Parts two three and four provide the corresponding
commands to do these tests in LIMDEP, STATA and SAS.
y = ib1 + X 2b 2 + e
where i is the column of 1’s in the X matrix associated with the intercept b1 and X2 is the
remaining (k–1) explanatory x variables associated with the (k–1) slope coefficients in the b 2
vector. The total sum of squared deviations
n n
TotSS = ∑ ( yi − y ) 2 = ∑ ( yi )2 − ny 2 = ( y′y − ny 2 )
i =1 i =1
measures the amount of variability in y around y , which ignoring any effect of the xs (in essence
the b2 vector is assumed to be a vector of zeros). The residual sum of squares
n
ResSS = ∑ (e )
i =1
i
2
= e′e
For calculating the F statistic, computer programs use the equivalent of the following:
This F is the ratio of two independently distributed Chi-square random variables adjusted for
their respective degrees of freedom. The relevant decision rule for rejecting the null hypothesis
is that the probability of this calculated F value or something greater, with K − 1 and n − K
degrees of freedom, is less than the typical (0.10, 0.05 or 0.01) probabilities of a Type I error.
Calculation of the F statistic in this manner, however, is just a special case of running two
regressions: a restricted and an unrestricted. One regression was computed with all the slope
coefficients set equal (or restricted) to zero so Y is regressed only on the column of ones. This
restricted regression is the same as using Y to predict Y regardless of the values of the xs. This
restricted residual sum of squares, e′r e r , is what is usually called the total sum of squares,
TotSS = y' y − ny 2 . The unrestricted regression allows all of the slope coefficients to find their
values to minimize the residual sum of squares, which is thus called the unrestricted residual
sum of squares, e′u e u , and is usually just list in a computer printout as the residual sum of
squares ResSS= e′ e .
The idea of a restricted and unrestricted regression can be extended to test any subset of
coefficients. For example, say the full model for a posttest Y is
Yi = β1 + β2 xi 2 + β3 xi 3 + β4 xi 4 + εi .
Let’s say the claim is made that x3 and x4 do not affect Y. One way to interpret this is to specify
that β3 = β4 = 0 , but β2 ≠ 0 . The dependent variable is again decomposed into two components
but now x1 is included with the intercept in the partitioning of the X matrix:
y = X 1b 1 + X 2 b 2 + e .
where X1 is the n × 2 matrix, with the first column containing ones and the second observations
on x1 (b1 contains the y intercept and x1 slope coefficient) and X2 is the n × 2 matrix, with two
columns for x3 and x4 (b2 contains x3 and x4 slope coefficients). If the claim about x3 and x4 not
Yi = β1 + β2 xi 2 + εi .
The null hypotheses is H 0 : β3 =β 4 =0 ; i.e., x2 might affect Y but x3 and x4 do not affect Y.
where the restricted residual sum of squares e′r e r is obtained from a simple regression of Y on x2,
including a constant, and the unrestricted sum of squared residuals e′u e u is obtained from a
regression of Y on x2, x3 and x4 , including a constant.
In general, it is best to test the overall fit of the regression model before testing any subset
or individual coefficients. The appropriate hypotheses and F statistic are
H 0 : β 2 = β3 = ... = β K = 0 (or H 0 : R 2 = 0 )
H A : at least one slope coefficient is not zero (or H 0 : R 2 ≠ 0 )
[(y' y − ny 2 ) − e′e] /( K − 1)
F= .
e′e /( n − K )
If the calculated value of this F is significant, then subsets of the coefficients can be tested as
H 0 : β s = βt = ... = 0
H A : at least one of these slope coefficient is not zero
The restricted residual sum of squares e′r e r is obtained by a regression on only the q xs that did
not have their coefficients restricted to zero. Any number of subsets of coefficients can be tested
in this framework of restricted and unrestricted regressions as summarized in the following table.
H o : β 4 = β5 = 0 H A : β 4 or β 5 ≠ 0
where "ChangeScore" is the difference between a student's test scores at the end and
beginning of a course in economics, female = 1, if female and 0 if male, "treatment" = 1,
if in the treatment group and 0 if not, and "GPA" is the student's grade point average
before enrolling in the course.
The F test of subsets of coefficients is also ideal for testing for fixed effects as reflected in sets of
dummy variables. For example, in Parts Two, Three and Four an F test is performed to check
whether there is any fixed difference in test performance among four classes taking economics
using the following assumed data generating process:
H o : β3 = β4 = β5 = 0 H A : β 3 , β 4 or β 5 ≠ 0
where “post” is a student’s post-course test score, “pre” is the student’s pre-course test
score, and “class” identifies to which one of the four classes the students was assigned,
e.g., class3 = 1 if student was in the third class and class3 = 0 if not. The fixed effect for
students in the fourth class (class1, class2 and class3 are zero)) is captured in the
intercept β1 .
It is important to notices in this test of fixed class effects that the relationship between the post
and pre test (as reflected in the slope coefficient β 2 ) is assumed to be the same regardless of the
class to which the student was assigned. The next section described a test for any structural
difference among the groups.
Earlier in our discussion of the difference in difference or change score model, a 0-1 bivariate
dummy variable was introduced to test for a difference in intercepts between a treatment and
control group, which could be done with a single coefficient t test. However, the expected
difference in the dependent variable for the two groups might not be constant. It might vary with
the level of the independent variables. Indeed, the appropriate model might be completely
different for the two groups. Or, it might be the same.
Allowing for any type of difference between the control and experimental variables
implies that the null and alternative hypotheses are
where the β1 and β 2 are KΧ1 column vectors containing the K coefficients β1 ,β 2 ,β3 ,...β K for the
control β1 and the experimental β 2 groups . Let X1 and X 2 contain the observations on the
explanatory variables corresponding to the β1 and β 2 , including the column of ones for the
constant β1 . The unrestricted regression is captured by two separate regressions:
⎡ y1 ⎤ ⎡ X1 0 ⎤ ⎡β1 ⎤ ⎡ ε1 ⎤
⎢y ⎥ = ⎢ 0 +
X 2 ⎥⎦ ⎢⎣β2 ⎥⎦ ⎢⎣ ε2 ⎥⎦
.
⎣ 2⎦ ⎣
That is, the unrestricted model is estimated by fitting the two regressions separately. The
unrestricted residual sum of squares is obtained by adding the residuals from these two
regressions. The unrestricted degrees of freedom are similarly obtained by adding the degrees of
freedom of each regression.
⎡ y1 ⎤ ⎡ X 1 0⎤ ⎡ε ⎤
⎢y ⎥ = ⎢ 0 ⎥ [β] + ⎢ 1 ⎥ .
⎣ 2⎦ ⎣ X2 ⎦ ⎣ε2 ⎦
That is , the restricted residual sum of squares is obtained from a regression in which the data
from the two groups are pooled and a single set of coefficients is estimated for the pooled data
set.
where unrestricted ResSS = residuals sum of squares from a regression on only those in the
control plus residuals from a regression on only those in the treatment groups.
Thus, to test for structure change over J regimes, run separate regressions on each and
add up the residuals to obtain the unrestricted residual sum of squares, ResSSu,with df = n – JK.
The restricted residual sum of squares is ResSSr, with df = n – K.
(ResSSr − ResSSu ) / K ( J − 1)
F=
ResSSu /(n − JK )
This form of testing for a difference among groups is known in economics as a Chow
Test. As demonstrated in Part Two using LIMDEP and Parts Three and Four using STATA and
SAS, any number of subgroups could be tested by adding up their individual residual sums of
squares and degrees of freedom to form the unrestricted residual sums of squares and matching
degrees of freedom.
Depending on the nature of the model being estimated and the estimation method, computer
programs will produce alternatives to the F statistics for testing (linear and nonlinear) restrictions
and structural changes. What follows is only an introduction to these statistics that should be
sufficient to give meaning to the numbers produced based on our discussion of ANOVA above.
The Wald (W) statistic follows the Chi-squared distribution with J degrees of freedom,
reflecting the number of restrictions imposed:
(e r'e r − e u'e u )
W= ~ χ 2 (J ) .
e u'e u / n
nJ J
W= F= F,
n−k 1 − (k / n)
W = JF .
The likelihood ratio (LR) test is formed by twice the difference between the log-
likelihood function for an unrestricted regression ( Lur ) and its value for the restricted regression
(Lr ).
LR = 2(Lur − Lr ) > 0 .
Under the null hypothesis that the J restrictions are true, LR is distributed Chi-square with J
degrees of freedom.
The Lagrange multiplier test (LM) is based on the gradient (or score) vector
⎡ ∂L / ∂β ⎤ ⎡ X' ε / σ 2 ⎤
=
⎢∂L / ∂σ 2 ⎥ ⎢ 4 ⎥
.
⎦ ⎣− (n / 2σ ) + (ε' ε / 2σ )⎦
2
⎣
where, as before, to evaluate this score vector with the restrictions we replace e = y − Xb with
er = y − Xbr . After sufficient algebra, the Lagrange statistic is defined by
nJ W
LM = F= .
(n − k )[1 + JF /(n − k )] 1 + (W / n )
Thus, LM ≤ LR ≤ W .
I like to say to students in my classes on econometrics that theory is easy, data are hard – hard to
find and hard to get into a computer program for statistical analysis. In this first of four parts in
Module One, I provided an introduction to the theoretical data generating processes associated
with continuous versus discrete dependent variables. Parts Two, Three and Four concentrate on
getting the data into one of three computer programs: LIMDEP (NLOGIT), STATA and SAS.
Attention is also given to estimation and testing within regressions employing individual cross-
sectional observations within these programs. Later modules will address complications
introduced by panel data and sources of endogeneity.
Bandiera, Oriana, Valentino Larcinese and Imron Rasul (2010). “Blissful Ignorance? Evidence
from a Natural Experiment on the Effect of Individual Feedback on Performance , ” IZA
Seminar, Bonn Germany, December 5, 2009. January 2010 version downloadable at
http://www.iza.org/index_html?lang=en&mainframe=http%3A//www.iza.org/en/webcontent/eve
nts/izaseminar_description_html%3Fsem_id%3D1703&topSelect=events&subSelect=seminar
Becker, William E. (Summer 1983). “Economic Education Research: Partr III, Statistical
Estimation Methods,” Journal of Economic Education, Vol. 14 (Sumer): 4-15
Becker, William E. and William H. Greene (2001). “Teaching Statistics and Econometrics to
Undergraduates,” Journal of Economic Perspectives, Vol. 15 (Fall): 169-182.
Becker, William E., William Greene and Sherwin Rosen (1990). “Research on High School
Economic Education,” American Economic Review, Vol. 80, (May): 14-23, and an expanded
version in Journal of Economic Education, Summer 1990: 231-253.
Becker, William E. and Peter Kennedy (1992 ). “A Graphical Exposition of the Ordered Probit,”
with P. Kennedy, Econometric Theory, Vol. 8: 127-131.
Becker, William E. and Michael Salemi (1977). “The Learning and Cost Effectiveness of AVT
Supplemented Instruction: Specification of Learning Models,” Journal of Economic Education
Vol. 8 (Spring) : 77-92.
Campbell, D., and D. Kenny (1999). A Primer on Regression Artifacts. New York: The Guilford
Press.
Fleisher, B., M. Hashimoto, and B. Weinberg. 2002. “Foreign GTAs can be Effective Teachers
of Economics.” Journal of Economic Education, Vol. 33 (Fall): 299-326.
Francisco, J. S., M. Trautmann, and G. Nicoll. 1998. “Integrating a Study Skills Workshop and
Pre-Examination to Improve Student’s Chemistry Performance.” Journal of College Science
Teaching, Vol. 28 (February): 273-278.
Greene, William (2003). Econometric Analysis. 5th Edition, New Jersey: Prentice Hall.
Hanushek, Eric A. (1986). “The Economics of Schooling: Production and Efficiency in Public
Schools,” Journal of Economic Literature, 24(September)): 1141-1177.
Lo, Melody, Sunny Wong and Franklin Mixon (2008). “Ranking Economics Journals Economics
Departments, and Economists Using Teaching-Focused Research Productivity." Southern
Economics Journal 2008, 74(January): 894-906.
Moulton, B. R. (1986). “Random Group Effects and the Precision of Regression Estimators.”
Journal of Econometrics, Vol. 32 (August): 385-97.
Kennedy, P., and J. Siegfried. (1997). “Class Size and Achievement in Introductory Economics:
Evidence from the TUCE III Data.” Economics of Education Review, Vol. 16 (August): 385-394.
Kvam, Paul. (2000). “The Effect of Active Learning Methods on Student Retention in
Engineering Statistics.” American Statistician, 54 (2): 136-40.
Ramsden, P. (1998). “Managing the Effective University.” Higher Education Research &
Development, 17 (3): 347-70.
Salemi, Michael and George Tauchen. 1987. “Simultaneous Nonlinear Learning Models.” In W.
E. Becker and W. Walstad, eds., Econometric modeling in economic education research, pp.
207-23. Boston: Kluwer-Nijhoff.
Spector, Lee C. and Michael Mazzeo (1980). “Probit Analysis and Economic Education”
Journal of Economic Education, Vol. 11(Spring), 11(2): 37-44.
ii
Let the posttest score ( y1 ) and pretest score ( y 0 ) be defined on the same scale, then the model
of the ith student’s pretest is
y 0i = β 0 (ability ) i + v 0i ,
where β 0 is the slope coefficient to be estimated, v0i is the population error in predicting the ith
student’s pretest score with ability, and all variables are measured as deviations from their
means. The ith student’s posttest is similarly defined by
And after substituting the pretest for unobserved true ability we have
E (Δb / b0 ) ≤ (Δβ / β 0 )
Although v1i and y 0i are unrelated, E( v1i y 0i ) = 0, v0i and y 0i are positively related, E( v0i y 0i ) > 0;
thus, E (Δb / b0 ) ≤ Δβ / β 0 . Becker and Salemi (1977) suggested an instrumental variable
technique to address this source of bias and Salemi and Tauchen (1987) suggested a modeling of
the error term structure.
Hake (1998) makes no reference to this bias when he discusses his regressions and correlation of
average normalized gain, average gain score and posttest score on the average pretest score. In
g = (posttest−pretest)/(maxscore−pretest)
and the pretest, but a least-squares estimator does not yield a significant negative relationship for
sample data, then there is evidence that something is peculiar. It is the lack of independence
between the pretest and the population error term (caused, for example, by measurement error in
the pretest, simultaneity between g and the pretest, or possible missing but relevant variables)
that is the problem. Hotelling received credit for recognizing this endogenous regressor problem
(in the 1930s) and the resulting regression to the mean phenomenon. Milton Friedman received
a Nobel prize in economics for coming up with an instrumental variable technique (for
estimation of consumption functions in the 1950s) to remove the resulting bias inherent in least-
squares estimators when measurement error in a regressor is suspected. Later Friedman (1992,
p. 2131) concluded: “I suspect that the regression fallacy is the most common fallacy in the
statistical analysis of economic data ...” Similarly, psychologists Campbell and Kenny (1999, p.
xiii) stated: “Regression toward the mean is a artifact that as easily fools statistical experts as lay
people.” But unlike Friedman, Campbell and Kenny did not recognize the instrumental variable
method for addressing the problem.
In an otherwise innovative study, Paul Kvam (2000) correctly concluded that there was
insufficient statistical evidence to conclude that active-learning methods (primarily through
integrating students’ projects into lectures) resulted in better retention of quantitative skills than
traditional methods, but then went out on a limb by concluding from a scatter plot of individual
student pretest and posttest scores that students who fared worse on the first exam retain
concepts better if they were taught using active-learning methods. Kvan never addressed the
measurement error problem inherent in using the pretest as an explanatory variable. Wainer
(2000) called attention to others who fail to take measurement error into account in labeling
students as “strivers” because their observed test scores exceed values predicted by a regression
equation.
iii
The plot for the probability model was produced by first fitting a probit model of the binary
variable GRADE, as a function of GPA. This produces a functional relationship of the form
Prob(GRADE = 1) = Φ(α + βGRADE), where estimates of α and β are produced by maximum
likelihood techniques. The graph is produced by plotting the standard normal distribution
function, Φ(α + βGRADE) for the values of GRADE in the sample, which range between 2.0
and 4.0, then connecting the dots. The nonparametric regression, although intuitively appealing
because it can be viewed as making use of weighted relative frequencies, is computationally
more complicated. [Today the binomial probit model can be fitted with just about any statistical
package but software for nonparametric estimation is less common. LIMDEP (NLOGIT)
version 8.0 (Econometric Software, Inc., 2001) was used for both the probit and nonparametric
estimations.] The nonparametric approach is based on the assumption that there is some as yet
iv
Unlike the mean, the median reflects relative but not absolute magnitude; thus, the median may
be a poor measure of change. For example, the series 1, 2, 3 and the series 1, 2, 300 have the
same median (2) but different means (2 versus 101).
v
To appreciate the importance of the unit of analysis, consider a study done by Ramsden (1998,
pp. 352-354) in which he provided a scatter plot showing a positive relationship between a y-axis
index for his “deep approach” (aimed at student understanding versus “surface learning”) and an
x-axis index of “good teaching” (including feedback of assessed work, clear goals, etc.):
Ramsden’s regression ( y = 18.960 + 0.35307 x ) seems to imply that a decrease (increase) in the
good teaching index by one unit leads to a 0.35307 decrease (increase) in the predicted deep
approach index; that is, good teaching positively affects deep learning. But does it?
Ramsden (1998) ignored the fact that each of his 50 data points represent a type of institutional
average that is based on multiple inputs; thus, questions of heteroscedasticity and the calculation
of appropriate standard errors for testing statistical inference are relevant. In addition, because
Ramsden reports working only with the aggregate data from each university, it is possible that
within each university the relationship between good teaching (x) and the deep approach (y)
could be negative but yet appear positive in the aggregate.
When I contacted Ramsden to get a copy of his data and his coauthored “Paper presented at the
Annual Conference of the Australian Association for Research in Education, Brisbane
(December 1997),” which was listed as the source for his regression of the deep approach index
on the good teaching index in his 1998 published article, he confessed that this conference paper
never got written and that he no longer had ready access to the data (email correspondence
August 22, 2000).
Aside from the murky issue of Ramsden citing his 1997 paper, which he subsequently admitted
does not exist, and his not providing the data on which the published 1998 paper is allegedly
based, a potential problem of working with data aggregated at the university level can be seen
University One
University Two
University Three
University Means
vi
Let y it be the observed test score index of the ith student in the tth class, who has an expected
test score index value of μ it . That is, y it = μ it + ε it , where ε it is the random error in testing such
that its expected value is zero, E (ε it ) = 0 , and variance is σ 2 , E (ε it ) = σ 2 , for all i and t .
2
Let y t be the sample mean of a test score index for the tth class of nt students. That is,
y t = μ t + ε t and E (ε t ) = σ 2 nt . Thus, the variance of the class mean test score index is
2
vii
As in Fleisher, Hashimoto, and Weinberg (2002), let y gi be the performance measure of the ith
student in a class taught by instructor g, let Fg be a dummy variable reflecting a characteristics
of the instructor (e.g., nonnative English speaker), let xgi be a (1 × n) vector of the student’s
y gi = Fg γ + xgi β + ε gi
where γ and β are parameters to be estimated. The error term, however, has two components:
one unique to the ith student in the gth instructor’s class ( ugi ) and one that is shared by all
students in this class ( ξ g ): ε gi = ξ g + ugi . It is the presence of the shared error ξ g for which an
adjustment in standard errors is required. The ordinary least squares routines employed by the
standard computer programs are based on a model in which the variance-covariance matrix of
error terms is diagonal, with element σ u2 . The presence of the ξ g terms makes this matrix block
diagonal, where each student in the gth instructor’s class has an off-diagonal element σ ξ2 .
In (May 11, 2008) email correspondence, Bill Greene called my attention to the fact that
Moulton (1986) gave a specific functional form for the shared error term component
computation. Fleisher, Hashimoto, and Weinberg actually used an approximation that is aligned
with the White estimator (as presented in Parts Two, Three and Four of this module), which is
the "CLUSTER" estimator in STATA. In LIMDEP (NLOGIT), Moulton’s shared error term
adjustment is done by first arranging the data as in a panel with the groups contained in
contiguous blocks of observations. Then, the command is “REGRESS ; ... ; CLUSTER = spec.
$” where "spec" is either a fixed number of observations in a group, or the name of an
identification variable that contains a class number. The important point is to recognize that
heterogeneity could be the result of each group having its own variance and each individual
within a group having its own variance. As discussed in detail in Parts Two, Three and Four,
heteroscedasticity in general is handled in STATA with the “ROBUST” command and in
LIMDEP with the “HETRO” command.
MODULE ONE, PART TWO: READING DATA INTO LIMDEP, CREATING AND
RECODING VARIABLES, AND ESTIMATING AND TESTING MODELS IN LIMDEP
This Part Two of Module One provides a cookbook-type demonstration of the steps required to
read or import data into LIMDEP. The reading of both small and large text and Excel files are
shown though real data examples. The procedures to recode and create variables within
LIMDEP are demonstrated. Commands for least-squares regression estimation and maximum
likelihood estimation of probit and logit models are provided. Consideration is given to analysis
of variance and the testing of linear restrictions and structural differences, as outlined in Part
One. (Parts Three and Four provide the STATA and SAS commands for the same operations
undertaken here in Part Two with LIMDEP. For a thorough review of LIMDEP, see Hilbe,
2006.)
LIMDEP can read or import data in several ways. The most easily imported files are those
created in Microsoft Excel with the “.xls” file name extension. To see how this is done, consider
the data set in the Excel file “post-pre.xls,” which consists of test scores for 24 students in four
classes. The column titled “Student” identifies the 24 students by number, “post” provides each
student’s post-course test score, “pre” is each student’s pre-course test score, and “class”
identifies to which one of the four classes the students was assigned, e.g., class4 = 1 if student
was in the fourth class and class4 = 0 if not. The “.” in the post column for student 24 indicates
that the student is missing a post-course test score.
To start, the file “post-pre.xls” must be downloaded and copied to your computer’s hard drive.
Once this is done open LIMDEP. Clicking on “Project,” “Import,” and “Variables…” yields the
following screen display:
Clicking “Variable” gives a screen display of your folders in “My Documents,” in which you can
locate files containing Excel ( .wk and .xls) files.
The next slide shows a path to the file “post-pre.xls.” (The path to your copy of “post-pre.xls”
will obviously depend on where you placed it on your computer’s hard drive.) Clicking “Open”
imports the file into LIMDEP.
To make sure the data has been correctly imported into LIMDEP, click the “Activate Data
Editor” button, which is second from the right on the tool bar or go to Data Editor in the
Window’s menu. Notice that the missing observation for Student 24 appears as a blank in this
data editor. The sequencing of these steps and the associated screens follow:
Next we consider externally created text files that are typically accompanied by the “.txt” or
“.prn” extensions. For demonstration purposes, the data set we just employed with 24
observations on the 7 variables (“student,” “post,” “pre,” “class1,” “class2,” “class3,” and
“class4”) was saved as the space delineated text file “post-pre.txt.” After downloading this file
to your hard drive open LIMDEP to its first screen:
To read the file “post-pre.txt,” begin by clicking “File” in the upper left-hand corner of the
ribbon, which will yield the following screen display:
Click on “OK” to “Text/Command Document” to create a file into which all your commands
will go.
The “read” command is typed or copied into the “Untitled” command file, with all subparts of a
command separated with semicolons (;). The program is not case sensitive; thus, upper and
lower case letters can be used interchangeably. The read command includes the number of
variables or columns to be read (nvar= ), the number of records or observations for each variable
(nobs= ), and the place to find the file (File= ). Because the names of the variables are on the
first row of the file to be read, we tell this to LIMDEP with the Names=1 command. If the file
path is long and involves spaces (as it is here, but your path will depend on where you placed
your file), then quote marks are required around the path. The $ indicates the end of the
command.
Read;NVAR=7;NOBS=24;Names=1;File=
"C:\Documents and Settings\beckerw\My Documents\WKPAPERS\NCEE - econometrics
\Module One\post-pre.txt"$
Upon copying or typing this read command into the command file and highlighting the entire
three lines, the screen display appears as below and the “Go” button is pressed to run the
command.
LIMDEP tells the user that it has attempted the command with the appearance of
To check on the correct reading of the data, click the “Activate Data Editor” button, which is
second from the right on the tool bar or go to Data Editor in the Window’s menu. Notice that if
you use the Window’s menus, there are now four files open within Limdep: Untitled Project 1*,
Untitled 1*, Output 1*, and Data Editor. As you already know, Untitled 1 contains your read
command and Untitled Project is just information in the opening LIMDEP screen. Output
contains the commands that LIMDEP has attempted, which so far only includes the read
command. This output file could have also been accessed by clicking on the view square next to
the X box in the following rectangle
When it appeared to check on whether the read command was properly executed.
LIMDEP has a data matrix default restriction of no more than 222 rows (records per variable),
900 columns (number of variables) and 200,000 cells. To read, import or create nonconforming
data sets this default setting must be changed. For example, to accommodate larger data sets the
number of rows must be increased. If the creation of more than 900 variables is anticipated,
even if less than 900 variables were initially imported, then the number of columns must be
increased before any data is read. This is accomplished by clicking the project button on the top
ribbon, going to settings, and changing the number of cells and number of rows.
As an example, consider the data set employed by Becker and Powers (2001), which
initially had 2,837 records. Open LIMDEP and go to “Project” and then “Settings…,” which
yields the following screen display:
Increasing the “Number of Cells” from 200,000 to 2,000,000 and increasing “Rows” from 222 to
3,000, automatically resets the “Columns” to 666, which is more than sufficient to read the 64
variables in the initial data set and to accommodate any variables to be created within LIMDEP.
Pressing “OK” resets the memory allocation that LIMDEP will employ for this data set.
This Becker and Powers data set does not have variable names imbedded in it. Thus, they will
be added to the read command. To now read the data follow the path “Files” to “New” to
“Text/Command Document” and click “OK.” Entering the following read statement into the
Text/Command file, highlighting it, and pushing the green “Go” button will enter the 2,837
records on 64 variables in file beck8WO into LIMDEP and each of the variables will be named
as indicated by each two character label.
READ; NREC=2837; NVAR=64; FILE=F:\beck8WO.csv; Names=
A1,A2,X3, C,AL,AM,AN,CA,CB,CC,CH,CI,CJ,CK,CL,CM,CN,CO,CS,CT,
CU,CV,CW,DB,DD,DI,DJ,DK,DL,DM,DN,DQ,DR,DS,DY,DZ,EA,EB,EE,EF,
EI,EJ,EP,EQ,ER,ET,EY,EZ,FF,FN,FX,FY,FZ,GE,GH,GM,GN,GQ,GR,HB,
HC,HD,HE,HF $
Defining all the variables is not critical for our purposes here, but the variables used in the
Becker and Power’s study required the following definitions (where names are not upper- and
lower-case sensitive):
Notice that this data set is too large to fit in LIMDEP’s “Active Data Editor” but all of the data
are there as verified with the following DSTAT command, which is entered in the
Text/Command file and highlighted. Upon clicking on the Go button, the descriptive statistics
for each variable appear in the output file. Again, the output file is accessed via the Window tab
in the upper ribbon. (Notice that in this data set, all missing values were coded −9. )
In summary, the LIMDEP Help menu states that the READ command is of the general form
The default is an ASCII (or text) data file in which numbers are separated by blanks, tabs, and/or
commas. Although not demonstrated here, LIMDEP will also read formatted files by adding the
option “; Format = ( Fortran format )” to the read command. In addition, although not
demonstrated, small data sets can be cut and pasted directly into the Test/Command Document,
preceded by a simple read command “READ ; Nobs = number of observations; Nvar =
number of variables$”, where “;Names = 1” would also be added if names appear on the line
before the data.
To demonstrate some of the least-squares regression commands in LIMDEP, read either the
Excel or space delineated text version of the 24 observations and 7 variable “post-pre” data set
into LIMDEP. The model to be estimated is
All statistical and mathematical instructions must be placed in the “Text/Command Document”
of LIMDEP, which is accessed via the “File” to “New” route described earlier:
Once in the “Text/Command Document,” the command for a regression can be entered.
Before doing this, however, recall that the posttest score is missing for the 24th person, as can be
seen in the “Active Data Editor.” LIMDEP automatically codes all missing data that appear in a
text or Excel file as “.” with the value −999. If a regression is estimated with the entire data set,
then this fictitious −999 place holding value would be incorrectly employed. To avoid this, all
commands can be prefaced with “skip,” which tells LIMDEP to not use records involving −999.
(In the highly unlikely event that −999 is a legitimate value, then a recoding is required as
discussed below.) The syntax for regressions in LIMDEP is
where “lhs=” is the left-hand-side dependent variable and “rhs=” is the right-hand-side
explanatory variable. The “one” is included on the right-hand-side to estimate a y-intercept. If
this “one” is not specified then the regression is forced to go through the origin – that is, no
constant term is estimated. Finally, LIMDEP will automatically predict the value of the
dependent variable, including 95 percent confidence intervals, and show the results in the output
file by adding “fill; list” to the regression command:
To avoid the sum of the four dummy variables equaling the column of ones in the data set, the
fourth class is not included. The commands to be typed into the Untitled Text/Command
Document are now
skip$
Regress ; Lhs =post; Rhs =one, pre, class1, class2,class3; fill; list$
Highlighting these commands and pressing “Go” gives the results in the LIMDEP output file:
From this output the Predicted posttest score is 14.621, with 95 percent confidence interval equal
to 11.5836< E(y|X24)< 17.6579.
A researcher might be interested to test whether the class in which a student is enrolled affects
his/her post-course test score, assuming fixed effects only. This linear restriction is done
automatically in LIMDEP by adding the following “cls:” command to the regression statement in
the Text/Command Document.
Regress ; Lhs =post; Rhs =one, pre, class1, class2,class3; CLS: b(3)=0,b(4)=0,b(5)=0$
Upon highlighting and pressing the Go button, the following results will appear in the output file:
--> Regress ; Lhs =post; Rhs =one, pre, class1, class2,class3; CLS: b(3)=0,b(...
************************************************************************
* NOTE: Deleted 1 observations with missing data. N is now 23 *
************************************************************************
le One\post-pre.xls
+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 26.21739130 , S.D.= 5.384797808 |
| Model size: Observations = 23, Parameters = 5, Deg.Fr.= 18 |
| Residuals: Sum of squares= 28.59332986 , Std.Dev.= 1.26036 |
| Fit: R-squared= .955177, Adjusted R-squared = .94522 |
| Model test: F[ 4, 18] = 95.89, Prob value = .00000 |
| Diagnostic: Log-L = -35.1389, Restricted(b=0) Log-L = -70.8467 |
| LogAmemiyaPrCrt.= .660, Akaike Info. Crt.= 3.490 |
| Autocorrel: Durbin-Watson Statistic = 1.72443, Rho = .13779 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -3.585879292 1.6459223 -2.179 .0429 le One\post-pre.xl
PRE 1.517221644 .93156695E-01 16.287 .0000 18.695652
CLASS1 1.420780437 .90500685 1.570 .1338 .21739130
CLASS2 1.177398543 .78819907 1.494 .1526 .30434783
CLASS3 2.954037461 .76623994 3.855 .0012 .26086957
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
le One\post-pre.xls
le One\post-pre.xls
+-----------------------------------------------------------------------+
| Linearly restricted regression |
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 26.21739130 , S.D.= 5.384797808 |
| Model size: Observations = 23, Parameters = 2, Deg.Fr.= 21 |
| Residuals: Sum of squares= 53.19669876 , Std.Dev.= 1.59160 |
| Fit: R-squared= .916608, Adjusted R-squared = .91264 |
| (Note: Not using OLS. R-squared is not bounded in [0,1] |
| Model test: F[ 1, 21] = 230.82, Prob value = .00000 |
| Diagnostic: Log-L = -42.2784, Restricted(b=0) Log-L = -70.8467 |
| LogAmemiyaPrCrt.= 1.013, Akaike Info. Crt.= 3.850 |
| Note, when restrictions are imposed, R-squared can be less than zero. |
| F[ 3, 18] for the restrictions = 5.1627, Prob = .0095 |
| Autocorrel: Durbin-Watson Statistic = 1.12383, Rho = .43808 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -2.211829436 1.9004224 -1.164 .2597 le One\post-pre.xl
PRE 1.520632737 .10008855 15.193 .0000 18.695652
CLASS1 .0000000000 ........(Fixed Parameter)........ .21739130
CLASS2 -.4440892099E-15........(Fixed Parameter)........ .30434783
CLASS3 -.4440892099E-15........(Fixed Parameter)........ .26086957
is F[df1=3,df2=18] = 5.1627, with p-value = 0.0095. Thus, null hypothesis that none of the
dummies is significant at 0.05 Type I error level can be rejected in favor of the hypothesis that at
least one class is significant, assuming that the effect of pre-course test score on post-course test
score is the same in all classes and only the constant is affected by class assignment.
The above test of the linear restriction β3 = β4 = β5 = 0 (no difference among classes), assumed
that the pretest slope coefficient was constant, fixed and unaffected by the class to which a
student belonged. A full structural test requires the fitting of four separate regressions to obtain
the four residual sum of squares that are added to obtain the unrestricted sum of squares. The
restricted sum of squares is obtained from a regression of posttest on pretest with no dummies for
the classes; that is, the class to which a student belongs is irrelevant in the manner in which
pretests determine the posttest score.
The commands to be entered into the Document/text file of LIMDEP are as follows:
Restricted Regression
Sample; 1-23$
Regress ; Lhs =post; Rhs =one, pre$
Calc ; SSall = Sumsqdev$
Unrestricted Regressions
Sample; 1-5$
Regress ; Lhs =post; Rhs =one, pre$
Calc ; SS1 = Sumsqdev$
Sample; 6-12$
Regress ; Lhs =post; Rhs =one, pre$
Calc ; SS2 = Sumsqdev$
Sample; 13-18$
Regress ; Lhs =post; Rhs =one, pre$
Calc ; SS3 = Sumsqdev$
William. E. Becker Module One, Part Two: Using LIMDEP Sept. 15, 2008: p. 23
Sample; 19-23$
Regress ; Lhs =post; Rhs =one, pre$
Calc ; SS4 = Sumsqdev$
+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 26.21739130 , S.D.= 5.384797808 |
| Model size: Observations = 23, Parameters = 5, Deg.Fr.= 18 |
| Residuals: Sum of squares= 28.59332986 , Std.Dev.= 1.26036 |
| Fit: R-squared= .955177, Adjusted R-squared = .94522 |
| Model test: F[ 4, 18] = 95.89, Prob value = .00000 |
| Diagnostic: Log-L = -35.1389, Restricted(b=0) Log-L = -70.8467 |
| LogAmemiyaPrCrt.= .660, Akaike Info. Crt.= 3.490 |
| Autocorrel: Durbin-Watson Statistic = 1.72443, Rho = .13779 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -3.585879292 1.6459223 -2.179 .0429
PRE 1.517221644 .93156695E-01 16.287 .0000 18.695652
CLASS1 1.420780437 .90500685 1.570 .1338 .21739130
CLASS2 1.177398543 .78819907 1.494 .1526 .30434783
CLASS3 2.954037461 .76623994 3.855 .0012 .26086957
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
+-----------------------------------------------------------------------+
| Linearly restricted regression |
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 26.21739130 , S.D.= 5.384797808 |
| Model size: Observations = 23, Parameters = 2, Deg.Fr.= 21 |
| Residuals: Sum of squares= 53.19669876 , Std.Dev.= 1.59160 |
| Fit: R-squared= .916608, Adjusted R-squared = .91264 |
| (Note: Not using OLS. R-squared is not bounded in [0,1] |
| Model test: F[ 1, 21] = 230.82, Prob value = .00000 |
| Diagnostic: Log-L = -42.2784, Restricted(b=0) Log-L = -70.8467 |
| LogAmemiyaPrCrt.= 1.013, Akaike Info. Crt.= 3.850 |
| Note, when restrictions are imposed, R-squared can be less than zero. |
| F[ 3, 18] for the restrictions = 5.1627, Prob = .0095 |
| Autocorrel: Durbin-Watson Statistic = 1.12383, Rho = .43808 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -2.211829436 1.9004224 -1.164 .2597
PRE 1.520632737 .10008855 15.193 .0000 18.695652
CLASS1 .0000000000 ........(Fixed Parameter)........ .21739130
CLASS2 -.4440892099E-15........(Fixed Parameter)........ .30434783
CLASS3 -.4440892099E-15........(Fixed Parameter)........ .26086957
+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 26.21739130 , S.D.= 5.384797808 |
| Model size: Observations = 23, Parameters = 2, Deg.Fr.= 21 |
| Residuals: Sum of squares= 53.19669876 , Std.Dev.= 1.59160 |
| Fit: R-squared= .916608, Adjusted R-squared = .91264 |
| Model test: F[ 1, 21] = 230.82, Prob value = .00000 |
| Diagnostic: Log-L = -42.2784, Restricted(b=0) Log-L = -70.8467 |
| LogAmemiyaPrCrt.= 1.013, Akaike Info. Crt.= 3.850 |
| Autocorrel: Durbin-Watson Statistic = 1.12383, Rho = .43808 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -2.211829436 1.9004224 -1.164 .2575
PRE 1.520632737 .10008855 15.193 .0000 18.695652
+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 30.00000000 , S.D.= 3.000000000 |
| Model size: Observations = 5, Parameters = 2, Deg.Fr.= 3 |
| Residuals: Sum of squares= .2567567568 , Std.Dev.= .29255 |
| Fit: R-squared= .992868, Adjusted R-squared = .99049 |
| Model test: F[ 1, 3] = 417.63, Prob value = .00026 |
| Diagnostic: Log-L = .3280, Restricted(b=0) Log-L = -12.0299 |
| LogAmemiyaPrCrt.= -2.122, Akaike Info. Crt.= .669 |
| Autocorrel: Durbin-Watson Statistic = 2.19772, Rho = -.09886 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -2.945945946 1.6174496 -1.821 .1661
PRE 1.554054054 .76044788E-01 20.436 .0003 21.200000
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 27.28571429 , S.D.= 5.023753103 |
| Model size: Observations = 7, Parameters = 2, Deg.Fr.= 5 |
| Residuals: Sum of squares= 7.237132353 , Std.Dev.= 1.20309 |
| Fit: R-squared= .952208, Adjusted R-squared = .94265 |
| Model test: F[ 1, 5] = 99.62, Prob value = .00017 |
| Diagnostic: Log-L = -10.0492, Restricted(b=0) Log-L = -20.6923 |
| LogAmemiyaPrCrt.= .621, Akaike Info. Crt.= 3.443 |
| Autocorrel: Durbin-Watson Statistic = 1.50037, Rho = .24982 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 25.66666667 , S.D.= 5.278888772 |
| Model size: Observations = 6, Parameters = 2, Deg.Fr.= 4 |
| Residuals: Sum of squares= 8.081250000 , Std.Dev.= 1.42138 |
| Fit: R-squared= .942001, Adjusted R-squared = .92750 |
| Model test: F[ 1, 4] = 64.97, Prob value = .00129 |
| Diagnostic: Log-L = -9.4070, Restricted(b=0) Log-L = -17.9490 |
| LogAmemiyaPrCrt.= .991, Akaike Info. Crt.= 3.802 |
| Autocorrel: Durbin-Watson Statistic = 1.51997, Rho = .24001 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -1.525000000 3.4231291 -.445 .6790
PRE 1.568750000 .19463006 8.060 .0013 17.333333
+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 21.60000000 , S.D.= 5.549774770 |
| Model size: Observations = 5, Parameters = 2, Deg.Fr.= 3 |
| Residuals: Sum of squares= 8.924731183 , Std.Dev.= 1.72479 |
| Fit: R-squared= .927559, Adjusted R-squared = .90341 |
| Model test: F[ 1, 3] = 38.41, Prob value = .00846 |
| Diagnostic: Log-L = -8.5432, Restricted(b=0) Log-L = -15.1056 |
| LogAmemiyaPrCrt.= 1.427, Akaike Info. Crt.= 4.217 |
| Autocorrel: Durbin-Watson Statistic = .82070, Rho = .58965 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -7.494623656 4.7572798 -1.575 .2132
PRE 1.752688172 .28279093 6.198 .0085 16.600000
Because the calculated F = 2.92, and the critical F (Prob of Type I error =0.05, df1=6, df2=15) =
2.79, we reject the null hypothesis and conclude that at least one class is significantly different
from another, allowing the slope on pre-course test score to vary from one class to another. That
is, the class in which a student is enrolled is important because of a change in slope and/or the
intercept.
HETEROSCEDASTICITY
To adjust for either heteroscedasticity across individual observations or a common error term
within groups but not across groups the “hetro” and “cluster” command can be added to the
standard “regress” command in LIMDEP in the following manner:
Skip
where the “class” variable is created to name the classes 1, 2, 3 and 4 to enable their
identification in the “cluster” command.
The resulting LIMDEP output shows a marked increase in the significance of the individual
group effects, as reflected in their respective p-values.
--> RESET
--> READ;FILE="C:\Documents and Settings\beckerw\My Documents\WKPAPERS\NCEE -...
--> skip
--> Regress; Lhs= post; Rhs= one, pre, class1, class2, class3$
************************************************************************
* NOTE: Deleted 1 observations with missing data. N is now 23 *
************************************************************************
le One\post-pre.xls
+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 26.21739130 , S.D.= 5.384797808 |
William. E. Becker Module One, Part Two: Using LIMDEP Sept. 15, 2008: p. 27
--> Regress; Lhs= post; Rhs= one, pre, class1, class2, class3
;hetro $
************************************************************************
* NOTE: Deleted 1 observations with missing data. N is now 23 *
************************************************************************
le One\post-pre.xls
+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 26.21739130 , S.D.= 5.384797808 |
| Model size: Observations = 23, Parameters = 5, Deg.Fr.= 18 |
| Residuals: Sum of squares= 28.59332986 , Std.Dev.= 1.26036 |
| Fit: R-squared= .955177, Adjusted R-squared = .94522 |
| Model test: F[ 4, 18] = 95.89, Prob value = .00000 |
| Diagnostic: Log-L = -35.1389, Restricted(b=0) Log-L = -70.8467 |
| LogAmemiyaPrCrt.= .660, Akaike Info. Crt.= 3.490 |
| Autocorrel: Durbin-Watson Statistic = 1.72443, Rho = .13779 |
| Results Corrected for heteroskedasticity |
| Breusch - Pagan chi-squared = 4.0352, with 4 degrees of freedom |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -3.585879292 1.5096560 -2.375 .0289 le One\post-pre.xl
PRE 1.517221644 .72981808E-01 20.789 .0000 18.695652
CLASS1 1.420780437 .67752835 2.097 .0504 .21739130
CLASS2 1.177398543 .72249740 1.630 .1206 .30434783
CLASS3 2.954037461 .80582075 3.666 .0018 .26086957
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
le One\post-pre.xls
--> Create; Class = class1+2*class2+3*class3+4*class4$
--> Regress ; Lhs= post; Rhs= one, pre, class1, class2, class3
;cluster=class $
************************************************************************
* NOTE: Deleted 1 observations with missing data. N is now 23 *
************************************************************************
le One\post-pre.xls
+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 26.21739130 , S.D.= 5.384797808 |
| Model size: Observations = 23, Parameters = 5, Deg.Fr.= 18 |
William. E. Becker Module One, Part Two: Using LIMDEP Sept. 15, 2008: p. 28
Often variables need to be transformed or created within a computer program to perform the
desired analysis. To demonstrate the process and commands in LIMDEP, start with the Becker
and Power’s data that have been or can be read into LIMDEP as shown earlier. After reading
the data into LIMDEP the first task is to recode the qualitative data into appropriate dummies.
A2 contains a range of values representing various classes of institutions. These are recoded via
the “recode” command, where A2 is set equal to 1 for doctoral institutions (100/199), 2 for
comprehensive or master’s degree granting institutions (200/299), 3 for liberal arts colleges
(300/399) and 4 for two-year colleges (400/499) . The “create” command is then used to create
1 and 0 bivariate variables for each of these institutions of post-secondary education:
As should be apparent, this syntax says if a2 has a value between 100 and 199 recode it to be 1.
If a2 has a value between 200 and 299 recode it to be 2 and so on. Next, create a variable called
“doc” and if a2=1, then set doc=1 and for any other value of a2 let doc=0. Create a variable
called “comp” and if a2=2, then set comp=1 and for any other value of a2 let comp=0, and so on.
Next 1 - 0 bivariates are created to show whether the instructor had a PhD degree and where the
student got a positive score on the postTUCE:
To allow for quadratic forms in teacher experiences and class size the following variables are
created:
In this data set, as can be seen in the descriptive statistics (DSTAT), all missing values were
coded −9. Thus, adding together some of the responses to the student evaluations gives
information on whether a student actually completed an evaluation. For example, if the sum of
ge, gh, gm, and gq equals −36, we know that the student did not complete a student evaluation
in a meaningful way. A dummy variable to reflect this fact is then created by:
Finally, from the TUCE developer it is known that student number 2216 was counted in term 2
but was in term 1 but no postTUCE was taken. This error is corrected with the following
command:
recode; hb; 90=89$ #2216 counted in term 2, but in term 1 with no posttest
These “create” and “recode” commands can be entered into LIMDEP as a block, highlighted and
run with the “Go” button. Notice, also, that descriptive statements can be written after the “$” as
a reminder or for later justification or reference as to why the command was included.
One of the things of interest to Becker and Powers was whether class size at the beginning or end
of the term influenced whether a student completed the postTUCE. This can be assessed by
fitting a probit model to the 1 – 0 discrete dependent variable “final.” To do this, however, we
must make sure that there are no missing data on the variables to be included as regressors. In
this data set, all missing values were coded −9. LIMDEP’s “reject” command can be employed
to remove all records with a −9 value. The “probit” command is used to invoke a maximum
likelihood estimation with the following syntax:
where the addition of the “marginaleffect” tells LIMDEP to calculate marginal effects regardless
of whether the explanatory variable is or is not continuous. The commands to be entered into the
Text/Command Document are then
Reject; AN=-9$
Reject; HB=-9$
Reject; ci=-9$
Reject; ck=-9$
William. E. Becker Module One, Part Two: Using LIMDEP Sept. 15, 2008: p. 31
Reject; cs=0$
Reject; cs=-9$
Reject; a2=-9$
Reject; phd=-9$
reject; hc=-9$
probit;lhs=final;
rhs=one,an,hb,doc,comp,lib,ci,ck,phd,noeval;marginaleffect$
probit;lhs=final;
rhs=one,an,hc,doc,comp,lib,ci,ck,phd,noeval;marginaleffect$
which upon highlighting and pressing the Go button yields the output for these two probit
models.
--> probit;lhs=final;
rhs=one,an,hb,doc,comp,lib,ci,ck,phd,noeval;marginaleffect$
Normal exit from iterations. Exit status=0.
+---------------------------------------------+
| Binomial Probit Model |
William. E. Becker Module One, Part Two: Using LIMDEP Sept. 15, 2008: p. 32
+-------------------------------------------+
| Partial derivatives of E[y] = F[*] with |
| respect to the vector of characteristics. |
| They are computed at the means of the Xs. |
| Observations used for means are All Obs. |
+-------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Index function for probability
Constant .1977242134 .48193408E-01 4.103 .0000
AN .4378002101E-02 .18769275E-02 2.333 .0197 10.596830
HB -.9699107460E-03 .38243741E-03 -2.536 .0112 55.558949
Marginal effect for dummy variable is P|1 - P|0.
DOC .1595047130 .20392136E-01 7.822 .0000 .31774256
Marginal effect for dummy variable is P|1 - P|0.
COMP .7783344522E-01 .25881201E-01 3.007 .0026 .41785852
Marginal effect for dummy variable is P|1 - P|0.
LIB .8208261358E-01 .21451464E-01 3.826 .0001 .13567839
CI .3947761030E-01 .18186048E-01 2.171 .0299 1.2311558
Marginal effect for dummy variable is P|1 - P|0.
CK .1820482750E-01 .29016989E-01 .627 .5304 .91998454
Marginal effect for dummy variable is P|1 - P|0.
PHD -.2575430653E-01 .19325466E-01 -1.333 .1826 .68612292
Marginal effect for dummy variable is P|1 - P|0.
NOEVAL -.5339850032 .19586185E-01 -27.263 .0000 .29068419
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
+----------------------------------------+
| Fit Measures for Binomial Choice Model |
| Probit model for variable FINAL |
+----------------------------------------+
| Proportions P0= .197140 P1= .802860 |
| N = 2587 N0= 510 N1= 2077 |
William. E. Becker Module One, Part Two: Using LIMDEP Sept. 15, 2008: p. 33
--> probit;lhs=final;
rhs=one,an,hc,doc,comp,lib,ci,ck,phd,noeval;marginaleffect$
Normal exit from iterations. Exit status=0.
+---------------------------------------------+
| Binomial Probit Model |
| Maximum Likelihood Estimates |
| Model estimated: May 05, 2008 at 04:07:03PM.|
| Dependent variable FINAL |
| Weighting variable None |
| Number of observations 2587 |
| Iterations completed 6 |
| Log likelihood function -825.9472 |
| Restricted log likelihood -1284.216 |
| Chi squared 916.5379 |
| Degrees of freedom 9 |
| Prob[ChiSqd > value] = .0000000 |
| Hosmer-Lemeshow chi-squared = 22.57308 |
| P-value= .00396 with deg.fr. = 8 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Index function for probability
Constant .8712666323 .24117408 3.613 .0003
AN .2259549490E-01 .94553383E-02 2.390 .0169 10.596830
HC .1585898886E-03 .21039762E-02 .075 .9399 49.974874
DOC .8804040395 .14866411 5.922 .0000 .31774256
COMP .4596088640 .13798168 3.331 .0009 .41785852
LIB .5585267697 .17568141 3.179 .0015 .13567839
CI .1797199200 .90808055E-01 1.979 .0478 1.2311558
CK .1415663447E-01 .13332671 .106 .9154 .91998454
PHD -.2351326125 .10107423 -2.326 .0200 .68612292
NOEVAL -1.928215642 .72363621E-01 -26.646 .0000 .29068419
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
+-------------------------------------------+
| Partial derivatives of E[y] = F[*] with |
| respect to the vector of characteristics. |
| They are computed at the means of the Xs. |
| Observations used for means are All Obs. |
William. E. Becker Module One, Part Two: Using LIMDEP Sept. 15, 2008: p. 34
+-------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Index function for probability
Constant .1735365132 .47945637E-01 3.619 .0003
AN .4500509092E-02 .18776909E-02 2.397 .0165 10.596830
HC .3158750180E-04 .41902052E-03 .075 .9399 49.974874
Marginal effect for dummy variable is P|1 - P|0.
DOC .1467543687 .21319420E-01 6.884 .0000 .31774256
Marginal effect for dummy variable is P|1 - P|0.
COMP .8785901674E-01 .25536388E-01 3.441 .0006 .41785852
Marginal effect for dummy variable is P|1 - P|0.
LIB .8672357482E-01 .20661637E-01 4.197 .0000 .13567839
CI .3579612385E-01 .18068050E-01 1.981 .0476 1.2311558
Marginal effect for dummy variable is P|1 - P|0.
CK .2839467767E-02 .26927626E-01 .105 .9160 .91998454
Marginal effect for dummy variable is P|1 - P|0.
PHD -.4448632109E-01 .18193388E-01 -2.445 .0145 .68612292
Marginal effect for dummy variable is P|1 - P|0.
NOEVAL -.5339710749 .19569243E-01 -27.286 .0000 .29068419
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
+----------------------------------------+
| Fit Measures for Binomial Choice Model |
| Probit model for variable FINAL |
+----------------------------------------+
| Proportions P0= .197140 P1= .802860 |
| N = 2587 N0= 510 N1= 2077 |
| LogL = -825.94717 LogL0 = -1284.2161 |
| Estrella = 1-(L/L0)^(-2L0/n) = .35481 |
+----------------------------------------+
| Efron | McFadden | Ben./Lerman |
| .39186 | .35685 | .80450 |
| Cramer | Veall/Zim. | Rsqrd_ML |
| .38436 | .52510 | .29833 |
+----------------------------------------+
| Information Akaike I.C. Schwarz I.C. |
| Criteria .64627 1730.47688 |
+----------------------------------------+
Frequencies of actual & predicted outcomes
Predicted outcome has maximum probability.
Threshold value for predicting Y=1 = .5000
Predicted
------ ---------- + -----
Actual 0 1 | Total
------ ---------- + -----
0 337 173 | 510
1 192 1885 | 2077
------ ---------- + -----
Total 529 2058 | 2587
For each of these two probits, the first block of coefficients are for the latent variable
probit equation. The second block provides the marginal effects. The initial class size (hb)
probit coefficient −0.004883, however, is highly significant with a two-tail p-value of 0.0112.
On the other hand, the end-of-term class size (hc) probit coefficient (0.000159) is insignificant
with a two-tail p-value of 0.9399.
The overall goodness of fit can be assessed in several ways. The easiest is the proportion
of correct 0 and 1 predictions: For the first probit, using initial class size (hb) as an explanatory
variable, the proportion of correct prediction is 0.859 = (342+1880)/2587. For the second probit,
using end-of-term class size (hc) as an explanatory variable, the proportion of correct prediction
is also 0.859 = (337+1885)/2587. The Chi-square (922.95, df =9) for the probit employing the
initial class size is slightly higher than that for the end-of-term probit (916.5379, df =9) but they
are both highly significant.
Finally, worth noting when using the “reject” command is that the record is not removed.
It can be reactivated with the “include” command. Active and inactive status can be observed in
LIMDEP’s editor by the presence or lack of presence of chevrons (>>) next to the row number
down the left-hand side of the display.
If you wish to save you work in LIMDEP you must make sure to save each of the files
you want separately. Your Text/Command Document, data file, and output files must be saved
individually in LIMDEP. There is no global saving of all three files.
CONCLUDING REMARKS
The goal of this hands-on component of this first of four modules was to enable users to get data
into LIMDEP, create variables and run regressions on continuous and discrete variables; it was
not to explain all of the statistics produced by computer output. For this an intermediate level
econometrics textbook (such as Jeffrey Wooldridge, Introductory Econometrics) or advanced
econometrics textbook such as (William Greene, Econometric Analysis) must be consulted.
REFERENCES
Becker, William E. and John Powers (2001). “Student Performance, Attrition, and Class Size
Given Missing Student Data,” Economics of Education Review, Vol. 20, August: 377-388.
Hilbe, Joseph M. (2006). “A Review of LIMDEP 9.0 and NLOGIT 4.0.” The American
Statistician, 60(May): 187-202.
MODULE ONE, PART THREE: READING DATA INTO STATA, CREATING AND
RECODING VARIABLES, AND ESTIMATING AND TESTING MODELS IN STATA
This Part Three of Module One provides a cookbook-type demonstration of the steps required to
read or import data into STATA. The reading of both small and large text and Excel files are
shown though real data examples. The procedures to recode and create variables within STATA
are demonstrated. Commands for least-squares regression estimation and maximum likelihood
estimation of probit and logit models are provided. Consideration is given to analysis of variance
and the testing of linear restrictions and structural differences, as outlined in Part One. (Parts
Two and Four provide the LIMDEP and SAS commands for the same operations undertaken
here in Part Three with STATA. For a review of STATA, version 7, see Kolenikov (2001).)
STATA can read data from many different formats. As an example of how to read data created in
an Excel spreadsheet, consider the data from the Excel file “post-pre.xls,” which consists of test
scores for 24 students in four classes. The column titled “Student” identifies the 24 students by
number, “post” provides each student’s post-course test score, “pre” is each student’s pre-course
test score, and “class” identifies to which one of the four classes the students was assigned, e.g.,
class4 = 1 if student was in the fourth class and class4 = 0 if not. The “.” in the post column for
student 24 indicates that the student is missing a post-course test score.
To start, the file “post-pre.xls” must be downloaded and copied to your computer’s hard drive.
Unfortunately, STATA does not work with “.xls” data by default (i.e., there is no default
“import” function or command to get “.xls” data into STATA’s data editor); however, we can
still transfer data from an Excel spreadsheet into STATA by copy and paste. * First, open the
“post-pre.xls” file in Excel. The raw data are given below:
*
See Appendix A for a description of Stat/Transfer, a program to convert data from one format to
another.
In Excel, highlight the appropriate cells, right-click on the highlighted area and click “copy”.
Your screen should look something like:
Any time you wish to see your current data, you can go back to the data editor. We can also view
the data by typing in “browse” in the command window. As the terms suggest, “browse” only
allows you to see the data, while you can manually alter data in the “data editor”.
Next we consider externally created text files that are typically accompanied by the “.txt” or
“.prn” extensions. As an example, we use the previous dataset with 24 observations on the 7
variables (“student,” “post,” “pre,” “class1,” “class2,” “class3,” and “class4”) and saved it as a
space delineated text file “post-pre.txt.” To read the data into STATA, we need to utilize the
“insheet” command. In the command window, type
The “insheet” tells STATA to read in text data and “using” directs STATA to a particular file
name. In this case, the file is saved in the location “F:\NCEE (Becker)\post-pre.txt”, but this will
vary by user. Finally, the “delimiter(“ “)” option tells STATA that the data points in this file are
separated by a space. If your data were tab delimited, you could type
Ian M. McCarthy Module One, Part Three: Using STATA September 4, 2008: p. 6
insheet using “F:\NCEE (Becker)\post-pre.txt”, tab
In general, the “delimiter( )” option is used when your data have a less standard delimiter (e.g., a
colon, semicolon, etc.).
Once you’ve typed the appropriate command into the command window, press enter to run that
line of text. This should yield the following screen:
Just as before, STATA tells us that it has read a data set consisting of 7 variables and 24
observations, and we can access our variable list in the lower-left window pane. We can also see
previously written lines from the “review” window in the upper-left window pane. Again, we can
view our data by typing “browse” in the command window and pressing enter.
The default memory allocation is different depending on the version of STATA you are using.
When STATA first opens, it will indicate how much memory is allocated by default. From the
previous screenshot, for instance, STATA indicates that 1.00mb is set aside for STATA’s use.
A more useful (and detailed) description of STATA’s memory usage (among other things) can
be obtained by typing “creturn list” into the command window. This provides:
To work with large datasets (in this case, anything larger than 1mb), we can type “set memory
10m” into the command window and press enter. This increases the memory allocation to 10 mb,
and you can increase by more or less to your preference. You can also increase STATA’s
memory allocation permanently by typing, “set memory 10m, permanently” into the command
line. To check that our memory has actually increased, again type “memory” into the command
window and press enter. We get the following screen:
For the Becker and Powers data set, the 1mb allocation is sufficient, so we need only follow the
process to import a “.csv” file described above. Note, however, that this data set does not contain
variable names in the top row. You can assign names yourself with a slight addition to the
insheet command:
Where, var1 var2 var3 …, are the variable names for each of the 64 variables in the data set. Of
course, manually adding all 64 variable names can be irritating. For more details on how to
import data sets with data dictionaries (i.e., variable names and definitions in external files), try
typing “help infile” into the command window. If you do not assign variable names, then
STATA will provide default variable names of “v1, v2, v3, etc.”.
As in the previous section using LIMDEP, we now demonstrate various regression tools in
STATA using the “post-pre” data set. Recall the model being estimated is
STATA automatically drops any missing observations from our analysis, so we need not restrict
the data in any of our commands. In general, the syntax for a basic OLS regression in STATA is
where y-variable is just the independent variable name and x-variables are the dependent
variable names. Now is a good time to mention STATA’s very useful help menu. Typing “help
regress” into the command window and pressing enter will open a thorough description of the
regress command and all of its options, and similarly with any command in STATA.
Once you have your data read into STATA, let’s estimate the model
by typing:
To see the predicted post-test scores (with confidence intervals) from our regression, we type:
predict posthat
predict se_f, stdf
generate upper_f = posthat + invttail(e(df_r),0.025)*se_f
generate lower_f = posthat + invttail(e(df_r),0.025)*se_f
You can either copy and paste these commands directly into the command window and press
enter, or you can enter each one directly into the command window and press enter one at a time.
Notice the use of the “predict” and “generate” keywords in the previous set of commands. After
running a regression, STATA has lots of data stored away, some of which is shown in the output
and some that is not. By typing “predict posthat”, STATA applies the estimated regression
equation to the 24 observations in the sample to get predicted y-values. These predicted y-values
are the default prediction for the “predict” command, and if we want the standard error of these
predictions, we need to use “predict” again but this time specify the option “stdf”. This stands for
the standard deviation of the forecast. Both posthat and se_f are new variables that STATA has
created for us. Now, to get the upper and lower bounds of a 95% confidence interval, we apply
the usual formula taking the predicted value plus/minus the margin of error. Typing “generate
upper_f=…” and “generate lower_f=…” creates two new variables named “upper_f” and
Just as with LIMDEP, our 95% confidence interval for the 24th student’s predicted post-test score
is [11.5836, 17.6579]. For more information on the “predict” command, try typing “help predict”
into the command window.
To test the linear restriction of all class coefficients being zero, we type:
into the command window and press enter. STATA automatically forms the correct test statistic,
and we see
The second line gives us the p-value, where we see that we can reject the null that all class
coefficients are zero at any probability of Type I error greater than 0.0095.
into the command window and press enter. The resulting output is as follows:
. regress post pre if class1==1
------------------------------------------------------------------------------
post | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pre | 1.554054 .0760448 20.44 0.000 1.312046 1.796063
_cons | -2.945946 1.61745 -1.82 0.166 -8.093392 2.201501
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
We see in the upper-left portion of this output that the residual sum of squares from this
restricted regression is 0.2568. We can similarly run a restricted regression for only students in
class 2 by specifying the option “if class2==1”, and so forth for classes 3 and 4.
The second way to test for a structural break is to create several interaction terms and test
whether the dummy and interaction terms are jointly significantly different from zero. To
perform the Chow test this way, we first generate interaction terms between all dummy variables
and independent variables. To do this in STATA, type the following into the command window
and press enter:
generate pre_c1=pre*class1
generate pre_c2=pre*class2
generate pre_c3=pre*class3
With our new variables created, we now run a regression with all dummy and interaction terms
included, as well as the original independent variable. In STATA, we need to type:
into the command window and press enter. The output for this regression is not meaningful, as it
is only the test that we’re interested in. To run the test, we can then type:
into the command window and press enter. The resulting output is:
. test class1 class2 class3 pre_c1 pre_c2 pre_c3
( 1) class1 = 0
( 2) class2 = 0
( 3) class3 = 0
( 4) pre_c1 = 0
( 5) pre_c2 = 0
( 6) pre_c3 = 0
F( 6, 15) = 2.93
Prob > F = 0.0427
Just as we saw in LIMDEP, our F-statistic is 2.93, with a p-value of 0.0427. We again reject the
null (at a probability of Type I error=0.05) and conclude that class is important either through the
slope or intercept coefficients. This type of test will always yield results identical to the restricted
regression approach.
HETEROSCEDASTICITY
You can control for heteroscedasticity across observations or within specific groups (in this
class, within a given class, but not across classes) by specifying the “robust” or “cluster” option,
respectively, at the end of your regression command.
To account for a common error term within groups, but not across groups, we first create a class
variable that identifies each student into one of the 4 classes. This is used to specify which group
(or cluster) a student is in. To generate this variable, type:
into the command window and press enter. Then to allow for clustered error terms, our
regression command is:
------------------------------------------------------------------------------
| Robust
post | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pre | 1.517222 .1057293 14.35 0.001 1.180744 1.8537
class1 | 1.42078 .4863549 2.92 0.061 -.1270178 2.968579
class2 | 1.177399 .3141671 3.75 0.033 .1775785 2.177219
class3 | 2.954037 .0775348 38.10 0.000 2.707287 3.200788
_cons | -3.585879 1.755107 -2.04 0.134 -9.171412 1.999654
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
Similarly, to account for general heteroscedasticity across individual observations, our regression
command is:
------------------------------------------------------------------------------
| Robust
post | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pre | 1.517222 .0824978 18.39 0.000 1.3439 1.690543
class1 | 1.42078 .7658701 1.86 0.080 -.188253 3.029814
class2 | 1.177399 .8167026 1.44 0.167 -.53843 2.893227
class3 | 2.954037 .9108904 3.24 0.005 1.040328 4.867747
_cons | -3.585879 1.706498 -2.10 0.050 -7.171098 -.0006609
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
We now want to estimate a probit model using the Becker and Powers data set. First, read in the
“.csv” file: i
. insheet a1 a2 x3 c al am an ca cb cc ch ci cj ck cl cm cn co cs ct cu ///
> cv cw db dd di dj dk dl dm dn dq dr ds dy dz ea eb ee ef ///
> ei ej ep eq er et ey ez ff fn fx fy fz ge gh gm gn gq gr hb ///
> hc hd he hf using "F:\NCEE (Becker)\BECK8WO2.csv", comma
(64 vars, 2849 obs)
As always, we should look at our data before we start doing any work. Typing “browse” into the
command window and pressing enter, it looks as if several variables have been read as character
strings rather than numeric values. We can see this by typing “describe” into the command
window or simply by noting that string variables appear in red in the browsing window. This is a
somewhat common problem when using STATA with Excel, usually because of variable names
in the Excel files or because of spaces placed in front or after numeric values. If there are spaces
in any cell that contains an otherwise numeric value, STATA will read the entire column as a
character string. Since we know all variables should be numeric, we can fix this problem by
typing:
destring, replace
into the command window and pressing enter. This automatically codes all variables as numeric
variables.
Also note that the original Excel .csv file has several “extra” observations at the end of the data
set. These are essentially extra rows that have been left blank but were somehow utilized in the
original Excel file (for instance, just pressing enter at last cell will generate a new record with all
missing variables). STATA correctly reads these 12 observations as missing values, but because
we know these are not real observations, we can just drop these with the command “drop if
a1==.”. This works because a1 is not missing for any of the other observations.
Now we recode the variable a2 as a categorical variable, where a2=1 for doctorate institutions
(between 100 and 199), a2=2 for comprehensive master’s degree granting institutions (between
200 and 299), a2=3 for liberal arts colleges (between 300 and 399), and a2=4 for two-year
colleges (between 400 and 499). To do this, type the following command into the command
window:
Once we’ve recoded the variable, we can generate the 4 dummy variables as follows: ii
The more lengthy way to generate these variables would be to first generate new variables equal
to zero, and then replace each one if the relevant condition holds. But the above commands are a
more concise way.
In this data set, all missing values are coded −9. Thus, adding together some of the responses to
the student evaluations provides information as to whether a student actually completed an
evaluation. For example, if the sum of ge, gh, gm, and gq equals −36, we know that the student
did not complete a student evaluation in a meaningful way. A dummy variable to reflect this fact
is then created by: iii
Finally, from the TUCE developer it is known that student number 2216 was counted in term 2
but was in term 1 but no postTUCE was taken. This error is corrected with the following
command:
recode hb (90=89)
We are now ready to estimate the probit model with final as our dependent variable. Because
missing values are coded as -9 in this data set, we need to avoid these observations in our
analysis. The quickest way to avoid this problem is just to recode all of the variables, setting
every variable equal to “.” if it equals “-9”. Because there are 64 variables, we do not want to do
this one at a time, so instead we type:
foreach x of varlist * {
replace `x’=. if `x’==-9
}
You should type this command exactly as is for it to work correctly, including pressing enter
after the first open bracket. Also note that the single quotes surrounding each x in the replace
statement are two different characters. The first single quote is the key directly underneath the
escape key (for most keyboards) while the closing single quote is the standard single quote
keystroke by the enter key. For more help on this, type “help foreach” into the command
window.
Finally, we drop all observations where an=. and where cs=0 and run the probit model by typing
drop if an==.
into the command window and pressing enter. We can then retrieve the marginal effects by
typing “mfx” into the command window and pressing enter. This yields the following output:
. drop if cs==0
(1 observation deleted)
. drop if an==.
(249 observations deleted)
------------------------------------------------------------------------------
final | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
an | .022039 .0094752 2.33 0.020 .003468 .04061
hb | -.0048826 .0019241 -2.54 0.011 -.0086537 -.0011114
doc | .9757148 .1463617 6.67 0.000 .6888511 1.262578
comp | .4064945 .1392651 2.92 0.004 .13354 .679449
lib | .5214436 .1766459 2.95 0.003 .175224 .8676632
ci | .1987315 .0916865 2.17 0.030 .0190293 .3784337
ck | .08779 .1342874 0.65 0.513 -.1754085 .3509885
phd | -.133505 .1030316 -1.30 0.195 -.3354433 .0684333
noeval | -1.930522 .0723911 -26.67 0.000 -2.072406 -1.788638
_cons | .9953498 .2432624 4.09 0.000 .5185642 1.472135
------------------------------------------------------------------------------
. mfx
------------------------------------------------------------------------------
final | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
an | .0225955 .0094553 2.39 0.017 .0040634 .0411276
hc | .0001586 .002104 0.08 0.940 -.0039651 .0042823
doc | .880404 .1486641 5.92 0.000 .5890278 1.17178
comp | .4596089 .1379817 3.33 0.001 .1891698 .730048
lib | .5585268 .1756814 3.18 0.001 .2141976 .902856
ci | .1797199 .090808 1.98 0.048 .0017394 .3577004
ck | .0141566 .1333267 0.11 0.915 -.2471589 .2754722
phd | -.2351326 .1010742 -2.33 0.020 -.4332344 -.0370308
noeval | -1.928216 .0723636 -26.65 0.000 -2.070046 -1.786386
_cons | .8712666 .2411741 3.61 0.000 .3985742 1.343959
------------------------------------------------------------------------------
. mfx
Results from each model are equivalent to those of LIMDEP, where we see that the estimated
coefficient on hb is -0.005 with a p-value of 0.01, and the estimated coefficient on hc is 0.00007
with a p-value of 0.974. These results imply that initial class size is strongly significant while
final class size is insignificant.
this yields:
. predict prob1
(option p assumed; Pr(final))
. generate finalhat1=(prob1>.5)
| final
finalhat1 | 0 1 | Total
-----------+----------------------+----------
0 | 342 197 | 539
1 | 168 1,880 | 2,048
-----------+----------------------+----------
Total | 510 2,077 | 2,587
These results are exactly the same as with LIMDEP. For the second model, we get
. quietly probit final an hc doc comp lib ci ck phd noeval
. predict prob2
(option p assumed; Pr(final))
. generate finalhat2=(prob2>.5)
| final
finalhat2 | 0 1 | Total
-----------+----------------------+----------
0 | 337 192 | 529
1 | 173 1,885 | 2,058
-----------+----------------------+----------
Total | 510 2,077 | 2,587
REFERENCES
Becker, William E. and John Powers (2001). “Student Performance, Attrition, and Class Size
Given Missing Student Data,” Economics of Education Review, Vol. 20, August: 377-388.
Stat/Transfer is a convenient program used to convert data from one format to another. Although
this program is not free, there is a free trial version available at www.stattransfer.com. Note that
the trial program will not convert the entire data set—it will drop one observation.
Nonetheless, Stat/Transfer is very user friendly. If you install and open the trial program, your
screen should look something like:
We want to convert the “.xls” file into a STATA format (“.dta”). To do this, we need to first
specify the original file type (e.g., Excel), then specify the location of the file. We then specify
the format that we want (in this case, a STATA “.dta” file). Then click on Transfer, and
Stat/Transfer automatically converts the data into the format you’ve asked.
use “filename.dta”
i
When using the “insheet” command, STATA automatically converts A1 to a1, A2 to a2 and so
forth. STATA is, however, case sensitive. Therefore, whether users specify “insheet A1 A2 …”
or “insheet a1 a2 …,” we must still call the variables in lower case. For instance, the following
insheet command will work the exact same as that provided in the main text:
. insheet A1 A2 X3 C AL AM AN CA CB CC CH CI CJ CK CL CM CN CO CS CT CU ///
> CV CW DB DD DI DJ DK DL DM DN DQ DR DS DY DZ EA EB EE EF ///
> EI EJ EP EQ ER ET EY EZ FF FN FX FY FZ GE GH GM GN GQ GR HB ///
> HC HD HE HF using "F:\NCEE (Becker)\BECK8WO2.csv", comma
ii
The conditions “if a2!=.” tell STATA to run the command only if a2 is not missing. Although
this particular dataset does not contain any missing values, it is generally good practice to always
use this type of condition when creating dummy variables the way we have done here. For
example, if there were a missing observation, the command “gen doc=(a2==1)” would set doc=0
even if a2 is missing.
iii
An alternative procedure is to first set all variables to missing if they equal -9 and then
generate the dummy variable using:
generate noeval=(ge==.&gh==.&gm==.&gq==.)
This Part Four of Module One provides a cookbook-type demonstration of the steps required to
read or import data into SAS. The reading of both small and large text and Excel files are shown
though real data examples. The procedures to recode and create variables within SAS are
demonstrated. Commands for least-squares regression estimation and maximum likelihood
estimation of probit and logit models are provided. Consideration is given to analysis of variance
and the testing of linear restrictions and structural differences, as outlined in Part One. Parts
Two and Three provide the LIMDEP and STATA commands for the same operations undertaken
here in Part Four with SAS. For a thorough review of SAS, see Delwiche and Slaughter’s The
Little SAS Book: A Primer (2003).
SAS can read or import data in several ways. The most commonly imported files by students are
those created in Microsoft Excel with the “.xls” file name extension. Researchers tend to use flat
files that compactly store data. In what follows, we focus on importing data using Excel files. To
see how this is done, consider the data set in the Excel file “post-pre.xls,” which consists of test
scores for 24 students in four classes. The column title “Student” identifies the 24 students by
number, “post” provides each student’s post-course test score, “pre” is each student’s pre-course
test score, and “class” identifies to which one of the four classes the student was assigned, e.g.,
class4 = 1 if student was in the fourth class and class4 = 0 if not. The “.” in the post column for
student 24 indicates that the student is missing a post-course test score.
To start, the file “post-pre.xls” must be downloaded and copied to your computer’s hard drive.
Once this is done open SAS. Click on “File,” “Import Data…,” and “Standard data source”.
Selecting “Microsoft Excel 97, 2000 or 2002 Workbook” from the pull down menu yields the
following screen display:
Before the data is successfully imported, the table in your Excel file must be correctly specified.
By default, SAS assumes that you want to import the first table (Excel labels this table
“Sheet1”). If the data you are importing is on a different sheet, simply use the pull-down menu
and click on the correct table name. After you have specified the table from your Excel file, click
“Next”.
The Work Library is a temporary location in the computer’s RAM memory, which is
permanently deleted once the user exits SAS. To retain your newly created SAS data set, you
must place the dataset into a permanent Library.
To view the dataset, click on the “Work” library located in the Explorer section and then on the
dataset “Prepost”. Note that SAS, unlike LIMDEP, records missing observations with a period,
rather than a blank space. The sequencing of these steps and the two associated screens follow:
Next we consider externally created text files that are typically accompanied by the “.txt” and
“csv” extensions. For demonstration purposes, the data set just employed with 24 observations
on the 7 variables (“student,” “post,” “pre,” “class1,” “class2,” “class3,” and “class4”) was
saved as the space delimited text file “post-pre.txt.” After downloading this file to your hard
drive, open SAS to its first screen:
getnames=yes; run;
and then clicking the “run man” submit button. “proc import” tells SAS to read in text data and
“datafile” directs SAS to a particular file name. In this case, the file is saved in the location
“F:\NCEE (Becker)\post-pre.txt”, but this will vary by user. Finally, the “dbms=dlm” option tells
SAS that the data points in this file are separated by a space.
If your data is tab delimited, change the “dbms” function to dbms = tab and if you were using a
“.csv” file, change the “dbms” function to dbms=csv. In general, the “dlm” option is used when
your data have a less standard delimiter (e.g., a colon, semicolon, etc.).
Once you’ve typed the appropriate command into the command window, press the submit button
to run that line of text. This should yield the following screen:
in the editor window, highlight it, and click the submit “Running man” button. The sequencing
of these steps and the associated screen is as follows:
Click “File”, “Import Data”, “User-defined formats” and then “Next” After specifing the location
of the file, click “Next” and name the dataset whatever you want to call it. We will use “prepost”
as our SAS dataset name and then click “Next” and then “Finish”. This activates the user defined
format list where the variables can be properly identified and named. Below is a screen shot of
the “Import: List”.
SAS is extremely powerful in handling large datasets. There are few default restrictions on
importing raw data files. The most common restriction is line length. When importing data one
can specify the physical line length of the file using the LRECL command (see syntax below).
As an example, consider the data set employed by Becker and Powers (2001), which initially had
2,837 records. The default restrictions are sufficient, so we need only follow the process of
import a “.csv” file described above. Note, however, that this data set does not contain variable
names in the top row. You can assign names yourself with a simple, but lengthy, addition to the
importing command. Also note that we have specified what type of input variables the data
contains and what type of variables we want them to be. “best32.” corresponds to a numeric
variable list of length 32.
data BECPOW;
informat HF best32.;
format HF best12.;
input
A1 A2 X3 C AL AM AN CA CB CC CH CI CJ CK CL CM CN CO CS CT CU
CV CW DB DD DI DJ DK DL DM DN DQ DR DS DY DZ EA EB EE EF
EI EJ EP EQ ER ET EY EZ FF FN FX FY FZ GE GH GM GN GQ GR HB
HC HD HE HF; run;
For more details on how to import data sets with data dictionaries (i.e., variable names and
definitions in external files), try typing “infile” into the “SAS Help and Documentation” under
the Help tab. If you do not assign variable names, then SAS will provide default variable names
of “var1, var2, var3, etc.”.
As in the previous Module One, Part Two using LIMDEP, we now demonstrate some various
regression tools in SAS using the “post-pre” data set. Recall the model being estimated is
SAS automatically drops any missing observations from our analysis, so we need not restrict the
data in any of our commands. In general, the syntax for a basic OLS regression in SAS is
where y-variable is just the independent variable name and x-variables are the dependent
variable names.
Once you have your data read into SAS, we estimate the model
by typing:
proc reg data = prepost; model post = pre class1 class2 class3 / p cli clm; run;
into the editor window and pressing submit. Typing “/ p cli clm” after specifying the model
outputs predicted value and a 95% confidence interval. From the output the predicted posttest
score is 14.621, with 95 percent confidence interval equal to 11.5836< E(y|X24)< 17.6579.
proc reg data = prepost; model post = pre class1 class2 class3;
b1: test class1=0, class2=0, class3=0;
run;
In this case we have named the test “b1”. Upon highlighting and pressing the submit button, SAS
automatically forms the correct test statistic, and we see
The first line gives us the value of the F statistic and the associated P-value, where we see that
we can reject the null that all class coefficients are zero at any probability of Type I error greater
than 0.0095.
The above test of the linear restriction β3= β4= β5=0 (no difference among classes), assumed
that the pretest slope coefficient was constant, fixed and unaffected by the class to which a
student belonged. A full structural test requires the fitting of four separate regressions to obtain
the four residual sum of squares that are added to obtain the unrestricted sum of squares. The
restricted sum of squares is obtained from a regression of posttest on pretest with no dummies for
the classes; that is, the class to which a student belongs is irrelevant in the manner in which
pretests determined the posttest score.
We can perform this test one of two ways. One, we can run each restricted regression and the
unrestricted regression, take note of the residual sums of squares from each regression, and
explicitly calculate the F statistic. We already know how to run basic regressions in SAS, so the
new part is how to run a restricted regression. For this, we create a new variable that identifies
the restricted observations; we want to run the regression separately and then we can run all the
restrictions simultaneously. We first create the restriction variable “class” by typing into the
editor window:
data prepost; set prepost; if class1 = 1 then class = 1; if class2 = 1 then class = 2;
Highlight the command and click on the submit button. We now run all four restricted
regressions simultaneously by typing:
into the editor window, highlighting the text and press submit. The resulting output is as follows:
( ErrorSS r − ErrorSSu ) / 2( J − 1K )
F=
ErrorSSu / (n − JK )
Because the calculated F = 2.92, and the critical F (Prob of Type I error =0.05, df1=6, df2=15) =
2.79, we reject the null hypothesis and conclude that at least one class is significantly difference
from another, allowing for the slope on pre-course test score to vary is from one class to another.
That is, the class in which a student is enrolled is important because of a change in slope and or
intercept.
The second way to test for a structural break is to create several interaction terms and test
whether the dummy and interaction terms are jointly significantly different from zero. To
perform the Chow test this way, we first generate interaction terms between all dummy variables
and independent variables. To do this in SAS, type the following into the editor window,
highlight it and press submit:
proc reg data = prepost; model post = pre class1 class2 class3 pre_c1 pre_c2 pre_c3;
into the editor window, highlight it, and press enter. The output for this regression is not
meaningful, as it is only the test that we’re interested in. The resulting output is:
F( 6, 15) = 2.93
Prob > F = 0.0427
Just as we could see in LIMDEP, our F statistic is 2.93, with a P-value of 0.0427. We again
reject the null (at a probability of Type I error=0.05) and conclude that class is important either
through the slope or intercept coefficients. This type of test will always yield results identical to
the restricted regression approach.
HETEROSCEDASTICITY
You can control for heteroscedasticity across observations or within specific groups (in this case,
within a given class, but not across classes) by specifying the “robust” or “cluster” option,
respectively, at the end of your regression command.
To account for a common error term within groups, but not across groups, we create a class
variable that identifies each student into one of the 4 classes. This is used to specify which group
(or cluster) a student is in. To generate this variable, type:
data prepost; set prepost; if class1 = 1 then class = 1; if class2 = 1 then class = 2;
Then to allow for clustered error terms, our regression command is:
Similarly, to account for general heteroscedasticity across individual observations, our regression
command is:
As seen in the Becker and Powers’ (2001) study, often variables need to be transformed or
created within a computer programs to perform the desired analysis. To demonstrate the process
and commands in SAS, start with the Becker and Powers data that have been or can be read into
SAS as shown earlier.
As always, we should look at our log file and data before we start doing any work. Viewing the
log file, the data set BECPOW has 2849 observations and 64 variables. Upon viewing the
dataset, we should notice there are several “extra” observations at the end of the data set. These
are essentially extra rows that have been left blank but were somehow utilized in the original
Excel file (for instance, just pressing enter at last cell will generate a new record with all missing
variables). SAS correctly reads these 12 observations as missing values, but because we know
these are not real observations, we can just drop these with the command
This works because a1 is not missing for any of the other observations.
Next 1 - 0 bivariates, “phd” and “final” are created to show whether the instructor had a PhD
degree and where the student got a positive score on the postTUCE . To allow for quadratic
forms in teacher experiences and class size the variables “dmsq” and “hbsq” are created. In this
data set, all missing values were coded −9. Thus, adding together some of the responses to the
student evaluations gives information on whether a student actually completed an evaluation. For
example, if the sum of “ge”, “gh”, “gm”, and “gq” equals −36, we know that the student did not
complete a student evaluation in a meaningful way. A dummy variable “noeval” is created to
reflect this fact. Finally, from the TUCE developer it is known that student number 2216 was
counted in term 2 but was in term 1 but no postTUCE was taken (see “hb” line in syntax). The
following are the commands:
These commands can be entered into SAS as a block, highlighted and run with the “Submit”
button.
if an = -9 then delete;
if hb = -9 then delete;
if ci = -9 then delete;
if an = . then delete;
if cs = 0 then delete;
model final = an hb doc comp lib ci ck phd noeval / link=probit tech= newton;
run;
into the editor window, highlighting it, and pressing enter. The SAS “probit” procedure by
default uses a smaller value in the dependent variable as success. Thus, the magnitudes of the
coefficients remain the same, but the signs are opposite to those of the STATA, and LIMDEP.
The “descending” option forces SAS to use a larger value as success. Alternatively, you may
explicitly specify the category of successful “final” using the “event” option. The option
“link=probit” tells SAS that instead of running a logistic regression, we would like to do a probit
regression. We can then retrieve the marginal effects by typing:
Results from each model are equivalent to those of LIMDEP and STATA, where we see the
initial class size (hb) probit coefficient is −0.004883 with a P-value of 0.0112, and the estimated
coefficient of “hc” is 0.0000159 with a P-value of 0.9399. These results imply that initial class
size is strongly significant however final class size is insignificant.
The overall goodness of fit can be assessed in several ways. A straight forward way is using the
Chi-square statistic found in the “Fit Statistics” output. The Chi-squared is (922.95, df =9) for
CONCLUDING REMARKS
The goal of this hands-on component of Module One, Part Four was to enable users to get data
into SAS, create variables and run regressions on continuous and discrete variables; it was not to
explain all of the statistics produced by computer output. For this an intermediate level
econometrics textbook (such as Jeffrey Wooldridge, introductory Econometrics) or advanced
econometrics textbook such as William Greene, Econometric Analysis must be consulted.
REFERENCES
Becker, William E. and John Powers (2001). “Student Performance, Attrition, and Class Size
Given Missing Student Data,” Economics of Education Review, Vol. 20, August: 377-388.
Delwiche, Lora and Susan Slaughter (2003).The Little SAS Book: A Primer, Third Edition, SAS
Publishing.
William E. Becker
Professor of Economics, Indiana University, Bloomington, Indiana, USA
Adjunct Professor of Commerce, University of South Australia, Adelaide, Australia
Research Fellow, Institute for the Study of Labor (IZA), Bonn, Germany
Editor, Journal of Economic Education
Editor, Social Scinece Reseach Network: Economic Research Network Educator
Federal Reserve Chair Ben Bernanke said that being an economist is like being a
mechanic working on an engine while it is running. Economists typically do not have the
convenience of random assignment as in laboratory experiments. However, in some
situations they can take advantage of random events such as lotteries or nature. In other
circumstances, they might be able to produce variables that have desired random
components. When this is possible, they can use instrumental variable techniques and
two-stage least squares estimation, which is the focus of the Module Two. Part One of
Module Two is devoted to the general theoretical issues associated with endogeneity.
Module Two, Parts Two, Three and Four provide the methods of instrumental variable
estimation using LIMDEP (NLOGIT), STATA and SAS. To get started consider three
types of problems for which instrumental variables are employed. 1
Pigou identified the most enduring criticism of regression analysis; namely, the
possibility that an unmeasured but relevant variable has been omitted from the regression
and that it is this variable that is giving the appearance of a causal relationship between
the dependent variable and the included regressors. As described by Michael Finkelstein
and Bruce Levin (1990, pp. 363-364 and pp. 409-415), for example, defense attorneys
continue to argue that the plaintiff's experts omitted relevant market and productivity
variables when they use regression analysis to demonstrate that women are paid less than
men. Modern academic journals are packed with articles that argue for one specification
The second problem is errors in variables. The late Milton Friedman was
awarded the Nobel prize in Economics in part because of his path-breaking work in
estimating the relationship between consumption and permanent income, which is an
unobservable quantity. His work was later applied in unrelated areas such as education
research where a student’s grade is hypothesized to be a function of his or her effort and
ability, which are both unobservable. As we will see, unobserved explanatory variables
for which index variables are created give rise to errors-in-variables problems. As seen
in the early work of Becker and Salemi (1979), an outstanding example of this in
economic education research occurs when the pretest is used as a proxy for existing
knowledge, ability or prior understanding.
PROBLEMS OF ENDOGENEITY
Put simply, the problem of endogeneity occurs when an explanatory variable is related to
the error term in the population model of the data generating process, which causes the
ordinary least squares estimators of the relevant model parameters to be biased and
inconsistent. More precisely, for the least squared b vector to be a consistent estimator
of the β vector in the population data generating model y = Xβ + ε , the X'X matrix must
be a positive definite matrix (defined by Q , as the sample size n goes to infinity) and
⎛1 ⎞ ⎛1 ⎞
lim ⎜ X ′X ⎟ = Q (a positive definite matrix) and p lim⎜ X′ε ⎟ = 0 ,
n →∞ n
⎝ ⎠ ⎝n ⎠
then
⎛1 ⎞
p lim b = β + ( Q ) p lim ⎜ X ′ε ⎟ = β
−1
⎝n ⎠
In words, if observations on the explanatory variables (the Xs) are unrelated to draws
from the error terms (in vector ε ), then the sampling distribution of each of the
coefficients (the bs in b) will appear to degenerate to a spike on the relevant Beta, as the
sample size increases. In probability limit, a b is equal to its β : p lim b = β .
But if there is strong correlation between the Xs and ε s , and this correlation does
not deteriorate as the sample size goes to infinity, then the least squares estimators are not
consistent estimator of Betas and p lim b ≠ β . The b vector is an inconsistent estimator
because of endogenous regressors. That is, the sampling distribution of at least one of the
coefficient (one of the bs in b) will not degenerate to a spike on the relevant Beta, as the
sample size continues to increase.
OMITTED VARIABLE
If someone asserts that a regression has omitted variable bias, he or she is saying that the
population disturbance is related to an included regressor because a relevant explanatory
variable is missing in the estimated regression and its effects must be in the disturbance.
This is also known as unobserved heterogeneity because the effect of the omitted
variable also leads to population error term heterogeneity. The straightforward solution
is to include that omitted variable as a regressor, but often data on the missing variable
are unavailable. For example, as described in Becker (2004), the U.S. Congressional
Advisory Committee on Student Financial Assistance is interested in the functional
relationship between the effects of financial variables (e.g., family income, loan
availability, and/or grants) and the college-going decision, called persistence and
measured by the probability of attending a post-secondary institution, number of post-
secondary terms attempted and the like, in linear form:
The U.S. Department of Education is concerned about getting students “college ready,”
as measured by an index reflecting the completion of high school college prep courses,
high school grades, SAT scores and the like:
Putting the two interests together, where epsilon is the disturbance term, suggests that the
appropriate linear model is
Finances are now in the error term u. But students from wealthier families are known to
be more college ready than those from less well-off families. Thus, the explanatory
variable college ready is related to the error term u. If estimation is by OLS, bias and
inconsistent estimation of λ 2 result:
SIMULTANEITY
A classic case of simultaneity can be found in the most basic idea from microeconomics:
that the competitive market of supply and demand determines the equilibrium quantity.
The market data generating process is thus written as a three equation system:
Supply: Qs = m + nP + U
Demand: Qd = a + bP + cZ + V
Equilibrium Q = Qd = Qs
E (U ) = E (V ) = 0, E (UV ) = 0,
E (U 2 ) = σ u2 , E (Y 2 ) = σ v2 , and
E (VZ ) = E (UZ ) = 0.
Q = m + nP + U .
nˆ = ( P ' P ) −1 P ' Q
= n + ( P ' P ) −1 P ' U .
But from the market structure assumed to be generating the data we know
a−m c v−u
P= + Z+ = β 0 + β1 Z + ε 2 .
n−b n−b n−b
Thus, E ( P'U ) ≠ 0
a−m c V −U U2 σ2
E ( PU ) = E[( )U + ( Z )U + ( )U ] = E (− )=− u .
n−b n−b n−b n−b n−b
The OLS estimator ( nˆ ) is downward biased; that is, the true population parameter is
expected to be underestimated by the least squares estimator:
σ u2
E (nˆ ) = n − .
n−b
C = A + BX + U
She hypothesized that cities with many school districts provided more opportunity for
parents to switch their children in the pursuit of better schools; thus, competition among
school districts should lead to better schools as reflected in higher student test scores.
Allowing for other explanatory variables, the implied regression is
The causal effect of more school districts in a metropolitan area, however, may
not be clearly discerned from this regression of mean metropolitan test scores on the
number of school districts. Hoxby had anecdotal evidence that economy of scale
arguments might lead to two good school districts being merged. At the other extreme,
when districts were really bad they could not be merged with others and yet poor
performance did not imply that the district would be shut down (it might be taken over by
the state) even though a totally new district might be formed. That is, there is reverse
causality: bad test performance leads to more districts and good performance leads to
fewer.
This system of equations should make the simultaneity apparent. As discussed in more
detail later, the existence of the second equation (where both u and v include the effect of
unobservable ability) makes the free-response test score an endogenous regressor in the
first equation. Similarly, the existence of the first equation makes multiple-choice an
endogenous regressor in the second.
ERRORS IN VARIABLES
Next consider an “errors in variables” problem that leads to regressor and error term
correlation. In particular consider the example in which a student’s grade on an exam in
economics is hypothesized to be a function of effort and a random disturbance (u):
grade = A + B(effort) + u.
homework = C(effort) + v .
But a shock to v causes a shock to homework; thus, homework and u* are correlated and
the slope coefficient (B/C) cannot be estimated without bias via least squares.
she observed that areas with a lot of school districts also had a lot of streams, possibly
because the streams made natural boundaries for the school districts. She had what is
become known as a natural experiment. 2 The number of streams was a random event
in nature that had nothing to do with the population error term ( ε ) in the student
performance equation but yet was highly related to number of school districts. 3
For simplicity, ignoring any other variables in the student performance equation
and measuring test scores, number of school districts and number of streams in deviation
from their respective means, a consistent estimate of the effect of the number of school
districts on test scores can be obtained with the instrumental variable estimator
b2 =
∑ (dev. in test scores)(dev. in number of streams) .
∑ (dev. in number of school districts)(dev. in number of steams)
To appreciate why the instrumental estimator works, consider the expected value of the
terms in the numerator:
∑ (z i − z )( y i − y )
bIV = i =1
n
.
∑ (z
i =1
i − z )( xi − x )
As with the OLS estimator, the IV estimator has an asymptotically normal distribution.
The IV large sample variance is obtained by
n
∑ ( yˆ i − y) 2 / n
s b2IV = i =1
n
,
r 2
x,z ∑ (x
i =1
i − x) 2
where rx2,z is the coefficient of determination (square of correlation coefficient) for x and
z. Notice, if the correlation between x and z were perfect, the IV and OLS variance
estimators would be the same. On the other hand, if the linear relationship between the x
and z is weak, then the IV variance will greatly exceed that calculated by OLS.
Important to recognize is that a poor instrument is one that has a low rx2,z , causing
the standard error of the estimated slope coefficient to be overly large, or has E( Zε ) ≠ 0 ,
implying the Z was in fact endogenous. Unlike OLS estimators, the desired properties of
IV estimators are all asymptotic; thus, to refer to small sample statistics like the t ratio is
not appropriate. The appropriate statistic for testing with bIV is the standard normal:
BIV − β
Z≅ , for large n.
S IV / n
where veteran is one if a veteran of the Vietnam War and zero otherwise. Angrist
recognized that there was a sample selection problem (to be discussed in detail in a later
module). It is likely that those who expected their earnings to be enhanced by the
For his instrument, Angrist observed that the lottery used to draft young men
provided a natural experiment. Lottery numbers were assigned randomly; thus, the
number received would not be correlated with ε . Men receiving lower numbers faced a
higher probability of being drafted; thus, lottery numbers are correlated with being a
Vietnam vet.
The use of these natural experiments has and likely will continue to be a source of
instrumental variables for endogenous explanatory variables. Michael Murray (2006)
provided a detailed but easily read review of natural experiments and the use of the IV
estimator.
Often there are many exogenous variables that could be used as instruments for
endogenous variables. Let matrix Z contain the set of all the endogenous variables that
could serve as an instrument set of regressors. The instrumental variable estimator is
now of the general form
b IV = (Z′X)-1 Z'y
Var(b IV ) = σ 2 (Z′X)-1 Z′Z(X′Z)-1 .
Unlike the selective replacement of a regressor with its instrument, for sets of regressors
the typical estimation procedure involves the project of each of the columns of X in the
column space of Z; at least conceptually we have
ˆ = Z[(Z′Z)-1 Z′X] .
X
ˆ ′X) -1 X
b IV = (X ˆ ′y
= [X′Z(Z′Z) -1 Z′X]-1 X′Z(Z′Z) -1 Z′y
= [X′(I - M 2 )X]-1 X ′(I - M 2 )y
ˆ ′X)
= (X ˆ -1 X
ˆ ′y ,
which suggests a two step process: 1) regress the endogenous regressor(s) on all the
exogenous variables; 2) use the predicted values from step 1 as replacement for the
Unfortunately the standard errors associated with this TSLS estimation approach
do not reflect the fact that the instrument is a combination of variables. That is, the
standard errors obtained from the second step do not reflect the number of variables used
in the first step predictions. In the case of a single instrument the difference between the
variances of OLS and IV estimators was captured in the magnitude of rx2, z and a similar
adjustment must be made when multiple variables are used to form the instruments.
Advanced econometrics programs like LIMDEP, SAS and SAS automatically do this in
their TSLS programs.
We wish to test p lim ( X ′ε / n ) = 0 , but cannot use the covariance between n × K matrix X
and the n residuals ( ei = yi − yˆ i ) in the n × 1 vector e because X'e = 0 is a byproduct of
least squares. Greene (2003, pp. 80-83) outlined the testing procedure originally
proposed by Durbin (1954) and then extended by Wu (1973) and Hausman (1978).
Davidson and MacKinnon (1993) are recognized for providing an algebraic
demonstration of test statistic equivalence. Asymptotically, a Wald (W) statistic may be
used in a Chi-square ( χ 2 ) test with K* degrees of freedom, or for smaller samples, an F
statistic, with K* and n − (K + K*) degrees of freedom, can be used to test the joint
significance of the contribution of the predicted values ( X ˆ * ) of a regression of the K*
endogenous regressors, in matrix X*, on the exogenous variables (and a column of ones
for the constant term) in matrix Z:
ˆ * γ+ ε * ,
y = Xβ +X
ˆ = Zλˆ , and λˆ is a least squares estimator of λ.
where X* = Zλ + u , X*
grade = β1 +β 2 homework +ε .
The theoretical data generating process that gave rise to this model suggests that number
of homeworks completed is an endogenous regressor. To test this we need truly
exogenous variables – say x2 and x3 , which might represent student gender and race. The
number of homeworks is then regressed on these two exogenous variables to get the least
square equation
predicted homework = λˆ 1 + λˆ 2 x2 + λˆ 3 x3 .
This predicted homework variable is then added to the exam grade equation to form the
augmented regression
In this example, K = 2 (for β1 and β 2 ) and K* = 1 (for γ ); thus, the degrees of freedom
for the F statistic are 1 and n − (K + K*) , which is also the square of a t statistic with n −
(K + K*) degrees of freedom. That is, with only one endogenous variable and relatively
small sample n, the t statistic printed by a computer program is sufficient to do the test.
(Recall that asymptotically the t goes to the standard normal, with no adjustment for
degrees of freedom required.) As with any other F, χ 2 , t or z test, calculated statistics
greater than their critical values lead to the rejection of the null hypothesis. Important to
keep in mind, however, is that failure to reject the null hypothesis at a specific probability
of a Type I error does not prove exogeneity. The null hypothesis can always be rejected
at some Type I error level.
The additional calculation of this residual for inclusion in the augmented regression is not
necessary because the absolute value of the estimate of γ and its standard error are
identical regardless of whether predicted homework or the residual (= homework −
predicted homework) is used.
Finally, keep in mind that you can use all the exogenous variables in the system to
predict the endogenous variable. Some of these exogenous variables can even be in the
original equation of interest – in the grade example, the grade equation might have been
predicted homework = λˆ 1 + λˆ 2 x2 + λˆ 3 x3 .
As will become clear in the next section, the auxiliary equation should always have at
least one more exogenous variable than the initial equation of interest.
IDENTIFICATION CONDITIONS
Supply in equilibrium: Q = m + nP + U
if Z is household income, then an increase in Z shifts the demand curve up, from D to D',
but does not affect the supply curve (Figure 2); thus, the supply curve is identified by the
change in equilibrium observations. Notice, however, that the demand curve is not
identified because there is no unique exogenous variable in the supply equation.
Any exogenous variable that is excluded from at least one equation in an equation
system can be used as an instrumental variable. It can be used as an instrument in the
equation from which it is excluded. For example, in the supply and demand equation
system, the reduced form (no endogenous variables as explanatory variables) for P is
a−m c v−u
P= + Z+ = β1 + β 2 Z + ε 2 .
n−b n−b n−b
And either the price predicted from this equation or Z itself can be used as the instrument
for P in the supply equation. If there were more exogenous variables excluded from the
supply equation then they could all be used to get predicted price from the reduced form
equation.
Notice that the coefficient on Z in the reduced form equation for P must be
nonzero for Z to be used as an instrument, which requires that c ≠ 0 and n – b ≠ 0. This
requirement states that exogenous variable(s) excluded from the supply equation must
have a nonzero population coefficient in the demand equation and that the effect of price
cannot be the same in both demand and supply. This is known as the rank condition.
In equation form, the two equations that summarize the influence of the various
covariates on multiple-choice and essay test questions are written as the following
structural equations:
_ J
M i = ρ21 + ρ22Wi + ρ23 M i + ∑ ρ2 j X ij + U i* .
j =4
_ J
Wi = ρ31 + ρ32 M i + ρ33 Wi + ∑ ρ3 j X ij + Vi* .
j=4
M i and Wi are the ith student’s respective scores on the multiple-choice test and essay test.
M i and Wi are the mean multiple-choice and essay test scores at the school where the ith
student took the twelfth grade economics course. The X ij variables are the other
exogenous variables used to explain the ith student’s multiple choice and essay marks,
where the ρs are parameters to be estimated. U I* and Vi* are assumed to be zero mean and
constant variance error terms that may or may not each include an effect of unobservable
ability.
Least squares estimation of the ρs will involve bias if the respective error terms
U and Vi* are related to regressors (Wi in the first equation, and Mi in second equation).
*
i
Such relationships are seen in the reduced form equations, which are obtained by
solving for M and W in terms of the exogenous variables and the error terms in these two
equations:
_ _ J
M i = Γ21 + Γ22 W i + Γ23 M i + ∑ Γ2 j X ij + U i** .
j=4
_ _ J
Wi = Γ31 + Γ32 M i + Γ33 Wi + ∑ Γ3 j X ij + Vi** .
j =4
The reduced form parameters (Γs) are functions of the ρs, and U** and V** are
dependent on U* and V*:
Vi* + ρ32U i*
**
Vi = .
1 − ρ22 ρ32
In the reduced form error terms, it can be seen that a random shock in U* causes a
change in V**, which causes a change in W in the reduced form. Thus, W and U* are
related in the essay structural equation, and consistent estimation of the parameters in this
equation is not possible using least squares. Similarly, a shock in V*, and a resulting
change in U**yields a change in M. Thus, M and V* are dependent in the structural
equation, and least squares estimators of the parameters in that equation are inconsistent.
_ _
The inclusion of M i and W i in their respective structural equations, and their
exclusion from the other equation, enables both of the structural equations to be identified
within the system. For example, if a student moves from a school with a low average
multiple-choice test score to one with a higher average multiple-choice test score, then
his or her multiple-choice score will rise via a shift in the M-W relationship in the first
structural equation, but this shift is associated with a move along the W-M relationship in
the second structural equation; thus, the second structural equation is identified.
Similarly, if a student moves from a low average essay test score school to a higher one,
then his or her essay test score will rise via a shift in the W-M relationship in second
structural equation, but this shift implies a move along the M-W relationship in the first
structural equation, and this first structural equation is thus identified. Most certainly,
identification hinges critically on justifying the exclusionary rule employed.
CONCLUDING COMMENTS
Angrist, Joshua D. (1990). “Lifetime Earnings and the Vietnam Era Draft Lottery:
Evidence from Social Security Administrative Records, ”American Economic Review,”
Vol. 80 (June): 313-336.
Angrist, Joshua D. and Alan B. Krueger (2001). "Instrumental Variables and the Search
for Identification: From Supply and Demand to Natural Experiments," Journal of
Economic Perspectives, Vol. 15 (Fall),: 69-85.
Becker, William E. and Peter Kennedy (1995). “A Lesson in Least Squares and R
Squared,” American Statistician, Vol. 55 ( November): 282-283. Portions reprinted in
Dale Poirier, Intermediate Statistics and Econometrics (Cambridge: MIT Press, 1995):
562-563.
Becker, William E. and Michael Salemi (1977). “The Learning and Cost Effectiveness of
AVT Supplemented Instruction: Specification of Learning Models,” Journal of
Economic Education Vol. 8 (Spring) : 77-92.
Finkelstein, Michael and Bruce Levin (1990). Statistics for Lawyers. New York,
Springer-Verlag.
Greene, William (2003). Econometric Analysis. 5th Edition, New Jersey: Prentice Hall.
Hilsenrath, Jon (2005). “Novel Way to Assess School Competition Stirs Academic Row,”
Wall Street Journal (October 24): A1 and A11
Murray, Michael P. (2006). “Avoiding Invalid Instruments and Coping with Weak
Instruments,” Journal of Economic Perspectives. Vol. 20 (Fall): 111-132.
Waldman, Michael, Sean Nicholson and Nodir Adilov (2006). “Does Television Cause
Autism?” NBER Working Paper No. W12632 (October).
Whitehouse, Mark (2007). “Mind and Matter: Is an Economist Qualified to Solve Puzzle
of Autism? Professor’s Hypothesis: Rainy Days and TV May Trigger Condition,” Wall
Street Journal. February 27: A.1.
1
Conceptually there are more than three forms of endogeneity that could occur. For
example, if there is a lagged dependent variable and the residuals are serially correlated,
then the lagged dependent variable will be correlated with the error term. This is not a
problem for the typical cross-section regressions considered by economic educators but
does become a problem when time is introduced. To see this consider a data generating
process in which knowledge of economics (Yit) of the ith student at time t is a linear
function of the student’s ability at time t (xit) plus an error term ( ε it ):
yit = β1 + β 2 xit + ε it .
yit −1 = β1 + β 2 xit −1 + ε it −1 .
If learning is assessed in the following equation, then the pretest yit-1 regressor is
endogenous by construction:
Part Two of Module Two provides a cookbook-type demonstration of the steps required to use
LIMDEP to address problems of endogeneity using a two-stage least squares, instrumental
variable estimator. The Durbin, Hausman and Wu specification test for endogeneity is also
demonstrated. Users of this model need to have completed Module One, Parts One and Two,
and Module Two, Part One. That is, from Module One, users are assumed to know how to get
data into LIMDEP, recode and create variables within LIMDEP, and run and interpret regression
results. From Module Two, Part One, they are expected to have an understanding of the problem
of and source of endogeneity and the basic idea behind an instrumental variable approach and the
two-stage least squares method. The Becker and Johnston (1999) data set is used throughout
this module for demonstration purposes only. Module Two, Parts Three and Four demonstrate
in STATA and SAS what is done here in LIMDEP.
THE CASE
As described in Module Two, Part One, Becker and Johnston (1999) called attention to
classroom effects that might influence multiple-choice and essay type test taking skills in
economics in different ways. For example, if the student is in a classroom that emphasizes skills
associated with multiple choice testing (e.g., risk-taking behavior, question analyzing skills,
memorization, and keen sense of judging between close alternatives), then the student can be
expected to do better on multiple-choice questions. By the same token, if placed in a classroom
that emphasizes the skills of essay test question answering (e.g., organization, good sentence and
paragraph construction, obfuscation when uncertain, logical argument, and good penmanship),
then the student can be expected to do better on the essay component. Thus, Becker and
Johnston attempted to control for the type of class of which the student is a member. Their
measure of “teaching to the multiple-choice questions” is the mean score or mark on the
multiple-choice questions for the school in which the ith student took the 12th grade economics
course. Similarly, the mean school mark or score on the essay questions is their measure of the
ith student’s exposure to essay question writing skills.
In equation form, the two equations that summarize the influence of the various
covariates on multiple-choice and essay test questions are written as the follow structural
equations:
_ J
M i = ρ21 + ρ22Wi + ρ23 M i + ∑ ρ2 j X ij + U i* .
j =4
_ J
Wi = ρ31 + ρ32 M i + ρ33 Wi + ∑ ρ3 j X ij + Vi* .
j=4
As shown in Module Two, Part One, the least squares estimation of the ρs involves bias
because the error term U i* is related to Wi, in the first equation, and Vi* is related to Mi, in
second equation. Instruments for regressors Wi and Mi are needed. Because the reduced form
equations express Wi and Mi solely in terms of exogenous variables, they can be used to generate
the respective instruments:
_ _ J
M i = Γ21 + Γ22 W i + Γ23 M i + ∑ Γ2 j X ij + U i** .
j=4
_ _ J
Wi = Γ31 + Γ32 M i + Γ33 Wi + ∑ Γ3 j X ij + Vi** .
j =4
The reduced form parameters (Γs) are functions of the ρs, and the reduced form error terms U**
and V** are functions of U* and V*, which are not related to any of the regressors in the reduced
form equations.
We could estimate the reduced form equations and get Mˆ i and Wˆi . We could then
substitute Mˆ i and Wˆi into the structural equations as proxy regressors (instruments) for
M i and Wi . The least squares regression of Mi on Wˆi , M i and the Xs and a least squares
regression of Wi on Mˆ i , Wi and the Xs would yield consistent estimates of the respective ρs, but
the standard errors would be incorrect. LIMDEP automatically does all the required estimations
with the two-stage, least squares command:
The Becker and Johnston (1999) data are in the file named “Bill.CSV.” Before reading these
data into LIMDEP, however, the “Project Settings” must be increased from 200000 cells (222
Using these recode and create commands, yields the following relevant variable definitions:
In all of the regressions, the effect of being at a school with more than 49 test takers is captured
in the constant term, against which the other dummy variables are compared. The smallest
schools need to be rejected to treat the mean scores as exogenous and unaffected by any
individual student’s test performance, which is accomplished with the following command:
Reject; smallest = 1$
Dstat;RHS=TOTALMC,AVGMC,TOTESSAY,AVGESSAY,ADULTST,SEX,ESLFLAG,
EC01,EN093,MA081,MA082,MA083,SMALLER,SMALL,LARGE,LARGER$
Descriptive Statistics
All results based on nonmissing observations.
===============================================================================
Variable Mean Std.Dev. Minimum Maximum Cases
===============================================================================
-------------------------------------------------------------------------------
All observations in current sample
-------------------------------------------------------------------------------
TOTALMC 12.4355795 3.96194160 .000000000 20.0000000 3710
AVGMC 12.4355800 1.97263767 6.41666700 17.0714300 3710
TOTESSAY 18.1380054 9.21191366 .000000000 40.0000000 3710
AVGESSAY 18.1380059 4.66807071 5.70000000 29.7857100 3710
ADULTST .512129380E-02 .713893539E-01 .000000000 1.00000000 3710
SEX .390566038 .487943012 .000000000 1.00000000 3710
ESLFLAG .641509434E-01 .245054660 .000000000 1.00000000 3710
EC011 .677088949 .467652064 .000000000 1.00000000 3710
EN093 .622641509E-01 .241667268 .000000000 1.00000000 3710
MA081 .591374663 .491646035 .000000000 1.00000000 3710
MA082 .548787062 .497681208 .000000000 1.00000000 3710
MA083 .420215633 .493659946 .000000000 1.00000000 3710
SMALLER .462264151 .498641179 .000000000 1.00000000 3710
SMALL .207277628 .405410797 .000000000 1.00000000 3710
LARGE .106469003 .308478530 .000000000 1.00000000 3710
LARGER .978436658E-01 .297143201 .000000000 1.00000000 3710
For comparison with the two-stage least squares results, we start with the least squares
regressions shown after this paragraph. The least squares estimations are typical of those found
in multiple-choice and essay score correlation studies, with correlation coefficients of 0.77 and
0.78. The essay mark or score, W, is the most significant variable in the multiple-choice score
regression (first of the two tables) and the multiple-choice mark, M, is the most significant
variable in the essay regression (second of the two tables). Results like these have led
researchers to conclude that the essay and multiple-choice marks are good predictors of each
other. Notice also that both the mean multiple-choice and mean essay marks are significant in
their respective equations, suggesting that something in the classroom environment or group
experience influences individual test scores. Finally, being female has a significant negative
effect on the multiple choice-test score, but a significant positive effect on the essay score, as
expected from the least squares results reported by others. We will see how these results hold up
in the two-stage least squares regressions.
Regress;LHS=TOTALMC;RHS=TOTESSAY,ONE,ADULTST,SEX,AVGMC,
ESLFLAG,EC011,EN093,MA081,MA082,MA083,SMALLER,SMALL,LARGE,LARGER$
+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = TOTALMC Mean= 12.43557951 , S.D.= 3.961941603 |
| Model size: Observations = 3710, Parameters = 15, Deg.Fr.= 3695 |
| Residuals: Sum of squares= 23835.89955 , Std.Dev.= 2.53985 |
| Fit: R-squared= .590590, Adjusted R-squared = .58904 |
| Model test: F[ 14, 3695] = 380.73, Prob value = .00000 |
| Diagnostic: Log-L = -8714.8606, Restricted(b=0) Log-L = -10371.4459 |
| LogAmemiyaPrCrt.= 1.868, Akaike Info. Crt.= 4.706 |
| Autocorrel: Durbin-Watson Statistic = 1.99019, Rho = .00490 |
Regress;LHS=TOTESSAY;RHS=TOTALMC,ONE, ADULTST,SEX,AVGESSAY,
ESLFLAG,EC011,EN093,MA081,MA082,MA083,SMALLER,SMALL,LARGE,LARGER$
+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = TOTESSAY Mean= 18.13800539 , S.D.= 9.211913659 |
| Model size: Observations = 3710, Parameters = 15, Deg.Fr.= 3695 |
| Residuals: Sum of squares= 123011.3151 , Std.Dev.= 5.76986 |
| Fit: R-squared= .609169, Adjusted R-squared = .60769 |
| Model test: F[ 14, 3695] = 411.37, Prob value = .00000 |
| Diagnostic: Log-L = -11759.0705, Restricted(b=0) Log-L = -13501.8081 |
| LogAmemiyaPrCrt.= 3.509, Akaike Info. Crt.= 6.347 |
| Autocorrel: Durbin-Watson Statistic = 2.03115, Rho = -.01557 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
TOTALMC 1.408895961 .28223608E-01 49.919 .0000 12.435580
Constant -8.948704180 .55427657 -16.145 .0000
ADULTST -.8291495512 1.3454556 -.616 .5377 .51212938E-02
SEX 1.239956900 .20801072 5.961 .0000 .39056604
AVGESSAY .4000235352 .23711680E-01 16.870 .0000 18.138006
ESLFLAG .4511403830 1.9369352 .233 .8158 .64150943E-01
EC011 .2985371912 .21044864 1.419 .1560 .67708895
EN093 -2.020881931 1.9647001 -1.029 .3037 .62264151E-01
MA081 .8495120566 .41061265 2.069 .0386 .59137466
MA082 .1590915478 .44249860 .360 .7192 .54878706
MA083 1.809541566 .26793945 6.754 .0000 .42021563
SMALLER .6170663022 .33054246 1.867 .0619 .46226415
SMALL .2693408755 .35476913 .759 .4477 .20727763
LARGE .2646447973 .40526280 .653 .5137 .10646900
LARGER .6150288712E-01 .41436703 .148 .8820 .97843666E-01
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
Theoretical considerations discussed in Module Two, Part One, suggest that these least
squares estimates involve a simultaneous equation bias that is brought about by an apparent
reverse causality between the two forms of testing. Consistent estimation of the parameters in
this simultaneous equation system is possible with two-stage least squares, where our instrument
( Mˆ i ) for Mi is obtained by a least squares regression of Mi on SEX, ADULTST, AVGMC,
AVGESSAY, ESLFLAG,SMALLER,SMALL, LARGE, LARGER, EC011, EN093, MA081,
MA082, and MA083. Our instrument ( Wˆi ) for Wi is obtained by a least squares regression of
The 2SLS results differ from the least squares results in many ways. The essay mark or
score, W, is no longer a significant variable in the multiple-choice regression and the multiple-
choice mark, M, is likewise insignificant in the essay regression. Each score appears to be
measuring something different when the regressor and error-term-induced bias is eliminated by
our instrumental variable estimators.
Both the mean multiple-choice and mean essay scores continue to be significant in their
respective equations. But now being female is insignificant in explaining the multiple-choice
test score. Being female continues to have a significant positive effect on the essay score.
The theoretical argument is strong for treating multiple-choice and essay scores as endogenous
when employed as regressors in the explanation of the other. Nevertheless, this endogeneity can
be tested with the Durbin, Hausman and Wu specification test, which is a two-step procedure in
LIMDEP versions prior to 9. 0.4. ii
Either a Wald statistic, in a Chi-square ( χ 2 ) test with K* degrees of freedom, or an F
statistic with K* and n − (K + K*) degrees of freedom, is used to test the joint significance of the
contribution of the predicted values ( X ˆ * ) of a regression of the K* endogenous regressors, in
matrix X*, on the exogenous variables (and column of ones for the constant term) in matrix Z:
ˆ * γ+ ε * ,
y = Xβ +X
ˆ = Zλˆ , and λˆ is a least squares estimator of λ.
where X* = Zλ + u , X*
In our case, K* = 1 when the essay score is to be tested as an endogenous regressor in the
multiple-choice equation and when the multiple-choice regressor is to be tested as endogenous in
the essay equation. X̂ * is an n × 1 vector of predicted essay scores from a regression of essay
scores on all the exogenous variables (for subsequent use in the multiple-choice equation) or an
n × 1 vector of predicted multiple-choice scores from a regression of multiple-choice scores on all
the exogenous variables (for subsequent use in the essay equation). Because K* = 1, the relevant
W. E. Becker Module Two, Part Two: LIMDEP IV Hands On Section Sept. 15, 2008 p. 7
test statistic is either the t, with n − (K + K*) degrees of freedom for small n or the standard
normal, for large n.
In LIMDEP, the predicted essay score is obtained by the following command, where the
specification “;keep=Essayhat” tells LIMDEP to predict the essay scores and keep them as a
variable called “Essayhat”:
The predicted essay scores are then added as a regressor in the original multiple-choice
regression:
Regress;LHS=TOTALMC;RHS=TOTESSAY,ONE,ADULTST,SEX,AVGMC,
ESLFLAG,EC011,EN093,MA081,MA082,MA083,SMALLER,SMALL,LARGE,
LARGER, Essayhat$
The test statistic for the Essayhat coefficient is then used in the test of endogeneity. In the below
LIMDEP output, we see that the calculated standard normal test statistic z value is −12.916,
which far exceeds the absolute value of the 0.05 percent Type I error critical 1.96 standard
normal value. Thus, the null hypothesis of an exogenous essay score as an explanatory variable
for the multiple-choice score is rejected. As theorized, the essay score is endogenous in an
explanation of the multiple-choice score.
+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = TOTESSAY Mean= 18.13800539 , S.D.= 9.211913659 |
| Model size: Observations = 3710, Parameters = 15, Deg.Fr.= 3695 |
| Residuals: Sum of squares= 205968.5911 , Std.Dev.= 7.46609 |
| Fit: R-squared= .345598, Adjusted R-squared = .34312 |
| Model test: F[ 14, 3695] = 139.38, Prob value = .00000 |
| Diagnostic: Log-L = -12715.2253, Restricted(b=0) Log-L = -13501.8081 |
| LogAmemiyaPrCrt.= 4.025, Akaike Info. Crt.= 6.863 |
| Autocorrel: Durbin-Watson Statistic = 2.10143, Rho = -.05072 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -1.186477526 1.1613927 -1.022 .3070
ADULTST -.1617772661 1.7412483 -.093 .9260 .51212938E-02
SEX .6819632415 .27234747 2.504 .0123 .39056604
AVGESSAY .8405321032 .64642750E-01 13.003 .0000 18.138006
AVGMC .2714761464E-01 .15534612 .175 .8613 12.435580
ESLFLAG 1.739961011 2.5067133 .694 .4876 .64150943E-01
EC011 .7199749635 .27219191 2.645 .0082 .67708895
EN093 -4.021669541 2.5417647 -1.582 .1136 .62264151E-01
MA081 1.076407689 .53146100 2.025 .0428 .59137466
MA082 1.237970826 .57190601 2.165 .0304 .54878706
MA083 3.932399725 .34253928 11.480 .0000 .42021563
SMALLER .3418961082 .43385196 .788 .4307 .46226415
SMALL -.1350660711 .46222353 -.292 .7701 .20727763
--> Regress;LHS=TOTALMC;RHS=TOTESSAY,ONE,ADULTST,SEX,AVGMC,
ESLFLAG,EC011,EN093,MA081,MA082,MA083,SMALLER,SMALL,LARGE,LARGER,
Essayhat$
+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = TOTALMC Mean= 12.43557951 , S.D.= 3.961941603 |
| Model size: Observations = 3710, Parameters = 16, Deg.Fr.= 3694 |
| Residuals: Sum of squares= 22805.95017 , Std.Dev.= 2.48471 |
| Fit: R-squared= .608280, Adjusted R-squared = .60669 |
| Model test: F[ 15, 3694] = 382.41, Prob value = .00000 |
| Diagnostic: Log-L = -8632.9227, Restricted(b=0) Log-L = -10371.4459 |
| LogAmemiyaPrCrt.= 1.825, Akaike Info. Crt.= 4.662 |
| Autocorrel: Durbin-Watson Statistic = 2.07293, Rho = -.03647 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
TOTESSAY .2855834321 .54748868E-02 52.162 .0000 18.138005
Constant -.3038295700 .40335588 -.753 .4513
ADULTST .2533493567 .57959250 .437 .6620 .51212938E-02
SEX -.8971949978E-01 .95478380E-01 -.940 .3474 .39056604
AVGMC .9748840572 .52324297E-01 18.632 .0000 12.435580
ESLFLAG .6744471036 .83423185 .808 .4188 .64150943E-01
EC011 .2925430155 .93110045E-01 3.142 .0017 .67708895
EN093 -1.588715660 .85191616 -1.865 .0622 .62264151E-01
MA081 .2995655100 .17988277 1.665 .0958 .59137466
MA082 .8159710785 .19337874 4.220 .0000 .54878706
MA083 1.635255739 .15173722 10.777 .0000 .42021563
SMALLER .2715919941 .14509939 1.872 .0612 .46226415
SMALL .4372991270E-01 .15370083 .285 .7760 .20727763
LARGE .1981182700 .17494798 1.132 .2574 .10646900
LARGER -.8677104536E-01 .17856546 -.486 .6270 .97843666E-01
ESSAYHAT -.3380613370 .26173585E-01 -12.916 .0000 18.138005
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
The similar estimation routine to test for the endogeneity of the multiple-choice test score in the
essay equation yields a calculated z test statistic of −11.713, which far exceeds the absolute value
of its 1.96 critical value. Thus, the null hypothesis of an exogenous multiple-choice score as an
explanatory variable for the essay score is rejected. As theorized, the multiple-choice score is
endogenous in an explanation of the essay score.
--> Regress;LHS=TOTALMC; RHS=ONE, ADULTST,SEX, AVGMC,AVGESSAY,
ESLFLAG,EC011,EN093,MA081,MA082,MA083,SMALLER,SMALL,LARGE,LARGER;
keep=MChat$
+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = TOTALMC Mean= 12.43557951 , S.D.= 3.961941603 |
| Model size: Observations = 3710, Parameters = 15, Deg.Fr.= 3695 |
| Residuals: Sum of squares= 39604.31525 , Std.Dev.= 3.27389 |
| Fit: R-squared= .319748, Adjusted R-squared = .31717 |
| Model test: F[ 14, 3695] = 124.06, Prob value = .00000 |
| Diagnostic: Log-L = -9656.7280, Restricted(b=0) Log-L = -10371.4459 |
| LogAmemiyaPrCrt.= 2.376, Akaike Info. Crt.= 5.214 |
| Autocorrel: Durbin-Watson Statistic = 2.07600, Rho = -.03800 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = TOTESSAY Mean= 18.13800539 , S.D.= 9.211913659 |
| Model size: Observations = 3710, Parameters = 16, Deg.Fr.= 3694 |
| Residuals: Sum of squares= 118606.0003 , Std.Dev.= 5.66637 |
| Fit: R-squared= .623166, Adjusted R-squared = .62164 |
| Model test: F[ 15, 3694] = 407.25, Prob value = .00000 |
| Diagnostic: Log-L = -11691.4200, Restricted(b=0) Log-L = -13501.8081 |
| LogAmemiyaPrCrt.= 3.473, Akaike Info. Crt.= 6.311 |
| Autocorrel: Durbin-Watson Statistic = 2.09836, Rho = -.04918 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
TOTALMC 1.485222426 .28473026E-01 52.162 .0000 12.435580
Constant -1.179740796 .85802415 -1.375 .1691
ADULTST -.1690793751 1.3225239 -.128 .8983 .51212938E-02
SEX .6854633662 .20969294 3.269 .0011 .39056604
AVGESSAY .8417622152 .44322287E-01 18.992 .0000 18.138006
ESLFLAG 1.723698602 1.9052933 .905 .3656 .64150943E-01
EC011 .7128702679 .20967911 3.400 .0007 .67708895
EN093 -3.983249481 1.9367199 -2.057 .0397 .62264151E-01
MA081 1.069628788 .40368533 2.650 .0081 .59137466
MA082 1.217026971 .44384827 2.742 .0061 .54878706
MA083 3.892551120 .31758961 12.257 .0000 .42021563
SMALLER .3348223746 .32550676 1.029 .3037 .46226415
SMALL -.1364832691 .35012421 -.390 .6967 .20727763
LARGE .3418924354 .39804844 .859 .3904 .10646900
LARGER -.8251220288E-01 .40712043 -.203 .8394 .97843666E-01
MCHAT -1.457334653 .12441585 -11.713 .0000 12.435580
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
CONCLUDING COMMENTS
This cookbook-type introduction to the use of instrumental variables and two-stage least squares
regression and testing for endogeneity has just scratched the surface of this controversial
problem in statistical estimation and inference. It was intended to enable researchers to begin
using instrumental variables in their work and to enable readers of that work to have an idea of
what is being done. To learn more about these methods there is no substitute for a graduate level
textbook treatment such as that found in William Greene’s Econometric Analysis.
Becker, William E. and Carol Johnston (1999).“The Relationship Between Multiple Choice and
Essay Response Questions in Assessing Economics Understanding,” Economic Record
(Economic Society of Australia), Vol. 75 (December): 348-357.
Greene, William (2003). Econometric Analysis. 5th Edition, New Jersey: Prentice Hall.
ENDNOTES
i
In the default mode, relatively large samples are required for 2SLS in LIMDEP because a
routine aimed at providing consistent estimators is employed; thus, for example, no degrees of
freedom adjustment is made for variances; i.e.,
σˆ 2 = (1 / n )∑ (i th prediction error )
2
As William Greene states, “this is consistent with most published sources, but (curiously
enough) inconsistent with most other commercially available computer programs.” The degrees
of freedom correction for small samples is obtainable by adding the following specification to
the 2SLS command: ;DFC
ii
In Limdep version 9.0.4, the following command will automatically test x3 for endogeneity:
Because x3 is not an instrument, LIMDEP knows the test for endogeneity is on this variable.
Part Three of Module Two demonstrates how to address problems of endogeneity using
STATA's two-stage least squares instrumental variable estimator, as well as how to perform and
interpret the Durbin, Hausman and Wu specification test for endogeneity. Users of this model
need to have completed Module One, Parts One and Three, and Module Two, Part One. That is,
from Module One, users are assumed to know how to get data into STATA, recode and create
variables within STATA, and run and interpret regression results. From Module Two, Part One,
they are expected to have an understanding of the problem of and source of endogeneity and the
basic idea behind an instrumental variable approach and the two-stage least squares method. The
Becker and Johnston (1999) data set is used throughout this module.
THE CASE
As described in Module Two, Part One, Becker and Johnston (1999) called attention to
classroom effects that might influence multiple-choice and essay type test taking skills in
economics in different ways. For example, if the student is in a classroom that emphasizes skills
associated with multiple-choice testing (e.g., risk-taking behavior, question analyzing skills,
memorization, and keen sense of judging between close alternatives), then the student can be
expected to do better on multiple-choice questions. By the same token, if placed in a classroom
that emphasizes the skills of essay test question answering (e.g., organization, good sentence and
paragraph construction, obfuscation when uncertain, logical argument, and good penmanship),
then the student can be expected to do better on the essay component. Thus, Becker and
Johnston attempted to control for the type of class of which the student is a member. Their
measure of “teaching to the multiple-choice questions” is the mean score or mark on the
multiple-choice questions for the school in which the ith student took the 12th grade economics
course. Similarly, the mean school mark or score on the essay questions is their measure of the
ith student’s exposure to essay question writing skills.
In equation form, the two equations that summarize the influence of the various
covariates on multiple-choice and essay test questions are written as the following structural
equations:
_ J
M i = ρ21 + ρ22Wi + ρ23 M i + ∑ ρ2 j X ij + U i* .
j =4
_ J
Wi = ρ31 + ρ32 M i + ρ33 Wi + ∑ ρ3 j X ij + Vi* .
j=4
As shown in Module Two, Part One, the least squares estimators of the ρs are biased
because the error term U i* is related to Wi, in the first equation, and Vi* is related to Mi in second
equation. Instruments for regressors Wi and Mi are needed. Because the reduced form equations
express Wi and Mi solely in terms of exogenous variables, they can be used to generate the
respective instruments:
_ _ J
M i = Γ21 + Γ22 W i + Γ23 M i + ∑ Γ2 j X ij + U i** .
j=4
_ _ J
Wi = Γ31 + Γ32 M i + Γ33 Wi + ∑ Γ3 j X ij + Vi** .
j =4
The reduced form parameters (Γs) are functions of the ρs, and the reduced form error terms U**
and V** are functions of U* and V*, which are not related to any of the regressors in the reduced
form equations.
We could estimate the reduced form equations and get Mˆ i and Wˆi . We could then
substitute Mˆ i and Wˆi into the structural equations as proxy regressors (instruments) for M i and Wi .
The least squares regression of Mi on Wˆ , M i and the Xs and a least squares regression of Wi on
i
Mˆ i , Wi and the Xs would yield consistent estimates of the respective ρs, but the standard errors
would be incorrect. STATA automatically performs the required estimations with the
instrumental variables command: i
Here, independent_variables should be all of your included, exogenous variables, and in the
parentheses, we must specify the endogenous variable as a function of its instruments.
The Becker and Johnston (1999) data are in the file named “Bill.CSV.” Since this is a large
dataset, users may need to increase the size of STATA following the procedures described in
Module One, Part Three. For the version used in this Module (Intercooled STATA), the default
memory is sufficient. Now, the file Bill.CSV can be read into STATA with the following
insheet command. Note that, in this case, the directory has been changed beforehand so that we
need only specify the file BILL.csv. For instance, say the file is located in the folder,
“C:\Documents and Settings\My Documents\BILL.csv”. Then users can change the directory
with the command, cd “C:\Documents and Settings\My Documents”, in which case the file may
be accessed simply by specifying the actual file name, BILL.csv, as in the following:
Using these recode and generate commands yields the following relevant variable definitions:
In all of the regressions, the effect of being at a school with more than 49 test takers is captured
in the constant term, against which the other dummy variables are compared. The smallest
schools should not be included so that we can treat the mean scores as exogenous and unaffected
by any individual student’s test performance, which is accomplished by adding the following
command to the end of the summary statistics and regression commands:
if smallest!=1
This command is added to the end of our regression and summary statistics commands as an
option, and it says to only perform the desired command if smallest is not equal to 1. We could
also completely remove these observations with the command:
drop if smallest==1
The problem with this approach, however, is that we cannot retrieve observations once they’ve
been dropped (at least not easily), so it’s generally sound practice to follow the first approach.
The descriptive statistics on the relevant variables are then obtained with the following
command, yielding the STATA output shown: ii
------------------------------------------------------------------------------
totalmc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
totessay | .2707917 .0054726 49.48 0.000 .2600621 .2815213
adultst | .4674948 .592213 0.79 0.430 -.6936016 1.628591
sex | -.5259549 .0912871 -5.76 0.000 -.7049329 -.3469768
avgmc | .3793819 .0252904 15.00 0.000 .3297974 .4289663
eslflag | .3933259 .8524557 0.46 0.645 -1.278004 2.064656
ec011 | .0172264 .0926488 0.19 0.853 -.1644214 .1988743
en093 | -.3117338 .8649386 -0.36 0.719 -2.007538 1.38407
ma081 | -.120807 .1808402 -0.67 0.504 -.4753635 .2337494
ma082 | .3827058 .1946737 1.97 0.049 .0010273 .7643843
ma083 | .3703758 .1184767 3.13 0.002 .1380896 .602662
smaller | .0672105 .147435 0.46 0.649 -.2218514 .3562725
small | -.0056878 .1570632 -0.04 0.971 -.3136269 .3022513
large | .0663582 .1785263 0.37 0.710 -.2836616 .4163781
larger | .0565486 .1821756 0.31 0.756 -.300626 .4137232
_cons | 2.654957 .3393615 7.82 0.000 1.989603 3.320311
------------------------------------------------------------------------------
Theoretical considerations discussed in Module Two, Part One, suggest that these least
squares estimates involve a simultaneous equation bias that is brought about by an apparent
reverse causality between the two forms of testing. Consistent estimation of the parameters in
this simultaneous equation system is possible with two-stage least squares, where our instrument
Mˆ i for Mi is obtained by a least squares regression of Mi on SEX, ADULTST, AVGMC,
AVGESSAY, ESLFLAG,SMALLER,SMALL, LARGE, LARGER, EC011, EN093, MA081,
MA082, and MA083. Our instrument for Wˆi for Wi is obtained by a least squares regression of
Wi on SEX, ADULTST, AVGMC, AVGESSAY, ESLFLAG, SMALLER, SMALL, LARGE,
LARGER, EC011, EN093, MA081, MA082, and MA083. STATA will do these regressions
and the subsequent regressions for M and W employing these instruments via the following
commands, which yield the subsequent output. Note that we should only specify as instruments
variables that we are not including as independent variables in the full regression. As seen in the
output tables, STATA correctly includes all of the exogenous variables as instruments in the
two-stage least squares estimation:
------------------------------------------------------------------------------
totalmc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
totessay | -.0524779 .036481 -1.44 0.150 -.1240029 .019047
adultst | .2533495 .8261181 0.31 0.759 -1.366343 1.873042
sex | -.0897195 .1360894 -0.66 0.510 -.3565373 .1770983
avgmc | .9748841 .0745801 13.07 0.000 .8286619 1.121106
eslflag | .6744471 1.189066 0.57 0.571 -1.656844 3.005738
ec011 | .2925431 .1327137 2.20 0.028 .0323437 .5527425
en093 | -1.588716 1.214273 -1.31 0.191 -3.969426 .7919948
ma081 | .2995655 .2563946 1.17 0.243 -.2031234 .8022545
------------------------------------------------------------------------------
totessay | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
totalmc | .0278877 .1583175 0.18 0.860 -.2825105 .338286
adultst | -.1690792 1.728774 -0.10 0.922 -3.558524 3.220366
sex | .6854633 .274106 2.50 0.012 .1480494 1.222877
avgessay | .8417622 .0579371 14.53 0.000 .7281703 .9553541
eslflag | 1.723698 2.490557 0.69 0.489 -3.159304 6.6067
ec011 | .7128703 .2740879 2.60 0.009 .1754919 1.250249
en093 | -3.983249 2.531637 -1.57 0.116 -8.946793 .980295
ma081 | 1.069629 .5276886 2.03 0.043 .0350393 2.104218
ma082 | 1.217027 .5801887 2.10 0.036 .0795055 2.354548
ma083 | 3.892551 .4151461 9.38 0.000 3.078613 4.706489
smaller | .3348224 .4254953 0.79 0.431 -.4994062 1.169051
small | -.1364833 .4576746 -0.30 0.766 -1.033803 .7608364
large | .3418925 .5203201 0.66 0.511 -.6782504 1.362035
larger | -.0825121 .5321788 -0.16 0.877 -1.125905 .960881
_cons | -1.179741 1.12159 -1.05 0.293 -3.378737 1.019256
------------------------------------------------------------------------------
Instrumented: totalmc
Instruments: adultst sex avgessay eslflag ec011 en093 ma081 ma082 ma083
smaller small large larger avgmc
------------------------------------------------------------------------------
The 2SLS results differ from the least squares results in many ways. The essay mark or
score, W, is no longer a significant variable in the multiple-choice regression and the multiple-
choice mark, M, is likewise insignificant in the essay regression. Each score appears to be
measuring something different when the regressor and error-term-induced bias is eliminated by
our instrumental variable estimators.
Both the mean multiple-choice and mean essay scores continue to be significant in their
respective equations. But now being female is insignificant in explaining the multiple-choice
test score. Being female continues to have a significant positive effect on the essay score.
ˆ * γ+ ε * ,
y = Xβ +X
ˆ = Zλˆ , and λˆ is a least squares estimator of λ.
where X* = Zλ + u , X*
In our case, K* = 1 when the essay score is to be tested as an endogenous regressor in the
multiple-choice equation and when the multiple-choice regressor is to be tested as endogenous in
the essay equation. X̂ * is an n × 1 vector of predicted essay scores from a regression of essay
scores on all the exogenous variables (for subsequent use in the multiple-choice equation) or an
n × 1 vector of predicted multiple-choice scores from a regression of multiple-choice scores on all
the exogenous variables (for subsequent use in the essay equation). Because K* = 1, the relevant
test statistic is either the t, with n − (K + K*) degrees of freedom for small n or the standard
normal, for large n.
In STATA, the predicted essay score is obtained by the following command, where the
specification “predict totesshat, xb” tells STATA to predict the essay scores and keep them as a
variable called “totesshat”:
predict totesshat, xb
The predicted essay scores are then added as a regressor in the original multiple-choice
regression:
The test statistic for the totesshat coefficient is then used in the test of endogeneity. In the below
STATA output, we see that the calculated standard normal test statistic z value is −12.92, which
far exceeds the absolute value of the 0.05 percent Type I error critical 1.96 standard normal
value. Thus, the null hypothesis of an exogenous essay score as an explanatory variable for the
multiple-choice score is rejected. As theorized, the essay score is endogenous in an explanation
of the multiple-choice score.
. regress totessay adultst sex avgmc avgessay eslflag ec011 en093 ma081 ma082 ///
ma083 smaller small large larger if smallest!=1
------------------------------------------------------------------------------
totessay | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
adultst | -.1617771 1.741248 -0.09 0.926 -3.575679 3.252125
sex | .6819632 .2723475 2.50 0.012 .1479971 1.215929
avgmc | .0271476 .1553461 0.17 0.861 -.277425 .3317202
avgessay | .8405321 .0646428 13.00 0.000 .7137931 .9672711
eslflag | 1.739961 2.506713 0.69 0.488 -3.174717 6.654638
ec011 | .719975 .2721919 2.65 0.008 .1863138 1.253636
en093 | -4.021669 2.541765 -1.58 0.114 -9.005069 .9617305
ma081 | 1.076408 .531461 2.03 0.043 .0344219 2.118393
ma082 | 1.237971 .571906 2.16 0.030 .1166884 2.359253
ma083 | 3.9324 .3425393 11.48 0.000 3.260815 4.603984
smaller | .3418961 .433852 0.79 0.431 -.5087167 1.192509
small | -.1350661 .4622235 -0.29 0.770 -1.041304 .7711722
large | .3469099 .5247716 0.66 0.509 -.6819605 1.37578
larger | -.0848079 .5361811 -0.16 0.874 -1.136048 .9664321
_cons | -1.186477 1.161393 -1.02 0.307 -3.463511 1.090556
------------------------------------------------------------------------------
. predict totesshat, xb
(1 missing value generated)
.
. regress totalmc totessay adultst sex avgmc eslflag ec011 en093 ma081 ma082 ma083 ///
> smaller small large larger totesshat if smallest!=1
------------------------------------------------------------------------------
totalmc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
totessay | .2855834 .0054749 52.16 0.000 .2748493 .2963175
adultst | .2533495 .5795925 0.44 0.662 -.8830033 1.389702
sex | -.0897195 .0954784 -0.94 0.347 -.2769151 .097476
avgmc | .974884 .0523243 18.63 0.000 .8722967 1.077471
eslflag | .6744471 .8342319 0.81 0.419 -.9611532 2.310047
ec011 | .2925431 .09311 3.14 0.002 .1099909 .4750952
en093 | -1.588716 .8519162 -1.86 0.062 -3.258988 .0815565
ma081 | .2995655 .1798828 1.67 0.096 -.0531138 .6522448
The similar estimation routine to test for the endogeneity of the multiple-choice test score in the
essay equation yields a calculated z test statistic of −11.71, which far exceeds the absolute value
of its 1.96 critical value. Thus, the null hypothesis of an exogenous multiple-choice score as an
explanatory variable for the essay score is rejected. As theorized, the multiple-choice score is
endogenous in an explanation of the essay score.
. regress totalmc avgessay adultst sex avgmc eslflag ec011 en093 ma081 ma082 ma083 ///
> smaller small large larger if smallest!=1
------------------------------------------------------------------------------
totalmc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
avgessay | -.0441094 .0283459 -1.56 0.120 -.0996846 .0114658
adultst | .2618392 .7635394 0.34 0.732 -1.235161 1.758839
sex | -.1255075 .1194247 -1.05 0.293 -.3596523 .1086372
avgmc | .9734594 .0681195 14.29 0.000 .839904 1.107015
eslflag | .5831376 1.099197 0.53 0.596 -1.571954 2.738229
ec011 | .2547603 .1193565 2.13 0.033 .0207492 .4887713
en093 | -1.377667 1.114567 -1.24 0.217 -3.562894 .8075597
ma081 | .2430779 .2330463 1.04 0.297 -.2138341 .6999899
ma082 | .7510049 .2507815 2.99 0.003 .2593213 1.242689
ma083 | 1.428892 .1502039 9.51 0.000 1.134401 1.723382
smaller | .2536501 .1902446 1.33 0.183 -.1193446 .6266448
small | .050818 .2026856 0.25 0.802 -.3465686 .4482046
large | .1799134 .2301129 0.78 0.434 -.2712475 .6310742
larger | -.0823205 .235116 -0.35 0.726 -.5432904 .3786495
_cons | -.2415657 .509272 -0.47 0.635 -1.240048 .7569162
------------------------------------------------------------------------------
. predict totmchat, xb
(1 missing value generated)
.
. regress totessay totalmc adultst sex avgessay eslflag ec011 en093 ma081 ma082 ///
ma083 smaller small large larger totmchat if smallest!=1
------------------------------------------------------------------------------
totessay | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
totalmc | 1.485222 .028473 52.16 0.000 1.429398 1.541047
adultst | -.1690792 1.322524 -0.13 0.898 -2.762028 2.42387
sex | .6854633 .2096929 3.27 0.001 .274338 1.096589
avgessay | .8417622 .0443223 18.99 0.000 .7548637 .9286608
eslflag | 1.723699 1.905293 0.90 0.366 -2.011832 5.459229
CONCLUDING COMMENTS
This cookbook-type introduction to the use of instrumental variables and two-stage least squares
regression and testing for endogeneity has just scratched the surface of this controversial
problem in statistical estimation and inference. It was intended to enable researchers to begin
using instrumental variables in their work and to enable readers of that work to have an idea of
what is being done. To learn more about these methods there is no substitute for a graduate level
textbook treatment such as that found in William Greene’s Econometric Analysis.
REFERENCES
Becker, William E. and Carol Johnston (1999).“The Relationship Between Multiple Choice and
Essay Response Questions in Assessing Economics Understanding,” Economic Record
(Economic Society of Australia), Vol. 75 (December): 348-357.
Greene, William (2003). Econometric Analysis. 5th Edition, New Jersey: Prentice Hall.
ENDNOTES
i
As stated in Module Two, Part One, the 2SLS coefficients are consistent but not necessarily
unbiased. Consistency is an asymptotic property for which there are no adjustments for degrees
of freedom. Nevertheless, the default in the standard STATA routine for 2SLS, “ivreg,” adjusts
standard errors for the degrees of freedom. As an alternative, if your institution permits
downloads from STATA's user-written routines, then the “ivreg2” command rather than “ivreg”
can be employed. The “ivreg2” command makes no adjustment for degrees of freedom.
To use ivreg2, type findit ivreg2 into the STATA command window, where a list of information
and links to download this routine appears. Click on one of the download links and STATA
automatically downloads and installs the routine for use. Users can then access the
documentation for this routine by typing help ivreg2.
matrix large_sample_se=e(b)
matrix large_sample_var=e(V)*e(df_r)/e(N)
local ncol=colsof(large_sample_se)
forvalues i=1/`ncol' {
matrix large_sample_se[1,`i']=sqrt(large_sample_var[`i',`i'])
}
matrix list large_sample_se
ii
Notice that the size variables (smaller to larger) show 3711 observations but the others show
the correct 3710. This was an artifact of the way the size variables were created in STATA. The
extra blank space has no relevance and is ignored in the calculations that are all based on the
original 3710 observations.
Part Four of Module Two provides a cookbook-type demonstration of the steps required to use
SAS to address problems of endogeneity using a two-stage least squares, instrumental variable
estimator. The Durbin, Hausman and Wu specification test for endogeneity is also demonstrated.
Users of this model need to have completed Module One, Parts One and Four, and Module Two,
Part One. That is, from Module One, users are assumed to know how to get data into SAS,
recode and create variables within SAS, and run and interpret regression results. From Module
Two, Part One, they are expected to have an understanding of the problem of and source of
endogeneity and the basic idea behind an instrumental variable approach and the two-stage least
squares method. The Becker and Johnston (1999) data set is used throughout this module for
demonstration purposes only. Module Two, Parts Two and Three demonstrate in LIMDEP and
STATA what is done here in SAS.
THE CASE
As described in Module Two, Part One, Becker and Johnston (1999) called attention to
classroom effects that might influence multiple-choice and essay type test taking skills in
economics in different ways. For example, if the student is in a classroom that emphasizes skills
associated with multiple choice testing (e.g., risk-taking behavior, question analyzing skills,
memorization, and keen sense of judging between close alternatives), then the student can be
expected to do better on multiple-choice questions. By the same token, if placed in a classroom
that emphasizes the skills of essay test question answering (e.g., organization, good sentence and
paragraph construction, obfuscation when uncertain, logical argument, and good penmanship),
then the student can be expected to do better on the essay component. Thus, Becker and
Johnston attempted to control for the type of class of which the student is a member. Their
measure of “teaching to the multiple-choice questions” is the mean score or mark on the
multiple-choice questions for the school in which the ith student took the 12th grade economics
course. Similarly, the mean school mark or score on the essay questions is their measure of the
ith student’s exposure to essay question writing skills.
In equation form, the two equations that summarize the influence of the various
covariates on multiple-choice and essay test questions are written as the follow structural
equations:
_ J
M i = ρ21 + ρ22Wi + ρ23 M i + ∑ ρ2 j X ij + U i* .
j =4
_ J
Wi = ρ31 + ρ32 M i + ρ33 Wi + ∑ ρ3 j X ij + Vi* .
j=4
As shown in Module Two, Part One, the least squares estimation of the ρs involves bias
because the error term U i* is related to Wi, in the first equation, and Vi* is related to Mi, in
second equation. Instruments for regressors Wi and Mi are needed. Because the reduced form
equations express Wi and Mi solely in terms of exogenous variables, they can be used to generate
the respective instruments:
_ _ J
M i = Γ21 + Γ22 W i + Γ23 M i + ∑ Γ2 j X ij + U i** .
j=4
_ _ J
Wi = Γ31 + Γ32 M i + Γ33 Wi + ∑ Γ3 j X ij + Vi** .
j =4
The reduced form parameters (Γs) are functions of the ρs, and the reduced form error terms U**
and V** are functions of U* and V*, which are not related to any of the regressors in the reduced
form equations.
We could estimate the reduced form equations and get Mˆ i and Wˆi . We could then
substitute Mˆ i and Wˆi into the structural equations as proxy regressors (instruments) for M i and Wi .
The least squares regression of Mi on Wˆi , M i and the Xs and a least squares regression of Wi on
Mˆ i , Wi and the Xs would yield consistent estimates of the respective ρs, but the standard errors
would be incorrect. SAS can automatically do all the required estimations with the two-stage,
least squares command:
The Becker and Johnston (1999) data are in the file named “Bill.CSV.” The file Bill.CSV can be
read into SAS with the following read command (the file may be located anywhere on your hard
drive but here it is located on the e drive):
Using these recode and create commands, yields the following relevant variable definitions:
data Bill;
set M2P4.Bill;
smallest = 0;
smaller = 0;
small = 0;
large = 0;
larger = 0;
largest = 0;
if size > 0 & size < 10 then smallest = 1;
if size > 9 & size < 20 then smaller = 1;
if size > 19 & size < 30 then small = 1;
if size > 29 & size < 40 then large = 1;
if size > 39 & size < 50 then larger = 1;
if size > 49 then largest = 1;
run;
In all of the regressions, the effect of being at a school with more than 49 test takers is captured
in the constant term, against which the other dummy variables are compared. The smallest
schools need to be rejected to treat the mean scores as exogenous and unaffected by any
individual student’s test performance, which is accomplished with the following command:
The descriptive statistics on the relevant variables are then obtained with the following
command, yielding the SAS output table shown:
For comparison with the two-stage least squares results, we start with the least squares
regressions shown after this paragraph. The least squares estimations are typical of those found
in multiple-choice and essay score correlation studies, with correlation coefficients of 0.77 and
0.78. The essay mark or score, W, is the most significant variable in the multiple-choice score
regression (first of the two tables) and the multiple-choice mark, M, is the most significant
variable in the essay regression (second of the two tables). Results like these have led
researchers to conclude that the essay and multiple-choice marks are good predictors of each
other. Notice also that both the mean multiple-choice and mean essay marks are significant in
their respective equations, suggesting that something in the classroom environment or group
experience influences individual test scores. Finally, being female has a significant negative
effect on the multiple choice-test score, but a significant positive effect on the essay score, as
expected from the least squares results reported by others. We will see how these results hold up
in the two-stage least squares regressions.
proc reg data = Bill; model totalmc totessay adultst sex avgmc eslflag
ec011 en093 ma081 ma082 ma083 smaller small large larger; quit;
Both the mean multiple-choice and mean essay scores continue to be significant in their
respective equations. But now being female is insignificant in explaining the multiple-choice
test score. Being female continues to have a significant positive effect on the essay score.
The theoretical argument is strong for treating multiple-choice and essay scores as endogenous
when employed as regressors in the explanation of the other. Nevertheless, this endogeneity can
be tested with the Durbin, Hausman and Wu specification test, which is a two-step procedure in
SAS.
ˆ * γ+ ε * ,
y = Xβ +X
ˆ = Zλˆ , and λˆ is a least squares estimator of λ.
where X* = Zλ + u , X*
In our case, K* = 1 when the essay score is to be tested as an endogenous regressor in the
multiple-choice equation and when the multiple-choice regressor is to be tested as endogenous in
the essay equation. X̂ * is an n × 1 vector of predicted essay scores from a regression of essay
scores on all the exogenous variables (for subsequent use in the multiple-choice equation) or an
n × 1 vector of predicted multiple-choice scores from a regression of multiple-choice scores on all
the exogenous variables (for subsequent use in the essay equation). Because K* = 1, the relevant
test statistic is either the t, with n − (K + K*) degrees of freedom for small n or the standard
normal, for large n.
In SAS, the predicted essay score is obtained by the following command, where the
specification “output out=essaypredict p=Essayhat;” tells SAS to predict the essay scores and
keep them as a variable called “Essayhat”:
proc reg data = bill; model totessay = adultst sex avgessay avgmc eslflag
ec011 en093 ma081 ma082 ma083 smaller small large larger;
output out=essaypredict p=Essayhat; quit;
The predicted essay scores are then added as a regressor in the original multiple-choice
regression:
proc reg data = essaypredict; model totalmc = totessay adultst sex avgmc eslflag
ec011 en093 ma081 ma082 ma083 smaller small large larger Essayhat;
quit;
The test statistic for the Essayhat coefficient is then used in the test of endogeneity. In the below
SAS output, we see that the calculated standard normal test statistic z value is −12.916, which far
exceeds the absolute value of the 0.05 percent Type I error critical 1.96 standard normal value.
Thus, the null hypothesis of an exogenous essay score as an explanatory variable for the
multiple-choice score is rejected. As theorized, the essay score is endogenous in an explanation
of the multiple-choice score.
proc reg data = essaypredict; model totalmc = totessay adultst sex avgmc eslflag
ec011 en093 ma081 ma082 ma083 smaller small large larger Essayhat; quit;
proc reg data = bill; model totalmc = adultst sex avgmc avgessay eslflag ec011 en093
ma081 ma082 ma083 smaller small large larger; output out=mcpredict p=MChat; quit;
proc reg data = mcpredict; model totessay = totalmc adultst sex avgessay eslflag
ec011 en093 ma081 ma082 ma083 smaller small large larger MChat; quit;
This cookbook-type introduction to the use of instrumental variables and two-stage least squares
regression and testing for endogeneity has just scratched the surface of this controversial
problem in statistical estimation and inference. It was intended to enable researchers to begin
using instrumental variables in their work and to enable readers of that work to have an idea of
what is being done. To learn more about these methods there is no substitute for a graduate level
textbook treatment such as that found in William Greene’s Econometric Analysis.
REFERENCES
Becker, William E. and Carol Johnston (1999).“The Relationship Between Multiple Choice and
Essay Response Questions in Assessing Economics Understanding,” Economic Record
(Economic Society of Australia), Vol. 75 (December): 348-357.
Greene, William (2003). Econometric Analysis. 5th Edition, New Jersey: Prentice Hall.
As discussed in Modules One and Two, most of the empirical economic education research is
based on a “value-added,” “change-score” or a “difference-in-differences” model specifications
in which the expected improvement in student performance from a pre-program measure
(pretest) to its post-program measurement (posttest) is estimated and studied for a cross section
of subjects. Other than the fact that testing occurs at two different points in time, there is no
time dimension, as seen in the data sets employed in Modules One and Two. Panel data analysis
provides an alternative structure in which measurements on the cross section of subjects are
taken at regular intervals over multiple periods of time. i Collecting data on the cross section of
subjects over time enables a study of change. It opens the door for economic education
researchers to address unobservable attributes that lead to biased estimators in cross-section
analysis. ii As demonstrated in this module, it also opens the door for economic education
researchers to look at things other than test scores that vary with time.
This module provides an introduction to panel data analysis with specific applications to
economic education. The data structure for a panel along with constant coefficient, fixed effects
and random effects representations of the data generating processes are presented. Consideration
is given to different methods of estimation and testing. Finally, as in Modules One and Two,
contemporary estimation and testing procedures are demonstrated in Parts Two, Three and Four
using LIMDEP (NLOGIT), STATA and SAS.
As an example of a panel data set, consider our study (Becker, Greene and Siegfried,
Forthcoming) that examines the extent to which undergraduate degrees (BA and BS) in
economics or PhD degrees (PhD) in economics drive faculty size at those U.S. institutions that
offer only a bachelor degree and those that offer both bachelor degrees and PhDs.
-----------------------------------------
*William Becker is Professor of Economics, Indiana University, Adjunct Professor of Commerce,
University of South Australia, Research Fellow, Institute for the Study of Labor (IZA) and Fellow, Center
for Economic Studies and Institute for Economic Research (CESifo). William Greene is Professor of
Economics, Stern School of Business, New York University, Distinguished Adjunct Professor at
American University and External Affiliate of the Health Econometrics and Data Group at York
University. John Siegfried is Professor of Economics, Vanderbilt University, Senior Research Fellow,
University of Adelaide, South Australia, and Secretary-Treasurer of the American Economic Association.
Their e-mail addresses are <[email protected]>, <[email protected]> and
<[email protected]>.
WEB Module Three, Part One, 4‐14‐2010: 1
We obtained data on the number of full-time tenured or tenure-track faculty and the
number of undergraduate economics degrees per institution per year from the American
Economic Association’s Universal Academic Questionnaire (UAQ). The numbers of PhD
degrees in economics awarded by department were obtained from the Survey of Earned
Doctorates, which is sponsored by several U.S. federal government agencies. These sources
provided data on faculty size and degree yearly data for each institution for 16 years from 1990-
91 through 2005-06. For each year, we had data from 18 bachelor degree-granting institutions
and 24 institutions granting both the PhD and bachelor degrees. Pooling the cross-section
observations on each of the 18 bachelor only institutions, at a point in time, over the 16 years,
implies a panel of 288 observations on each initial variable. Pooling the cross-section
observations on each of the 24 PhD institutions, at a point in time, over the 16 years, implies a
panel of 384 observations on each initial variable. Subsequent creation of a three-year moving
average variable for degrees granted at each type of institution reduced the length of each panel
in the data set to 14 years of usable data.
Panel data are typically laid out in sequential blocks of cross-sectional data. For
example, the bachelor degree institution data observations for each of the 18 colleges appear in
blocks of 16 rows for years 1991 through 2006:
“T” is a time trend running from −7 to 8, corresponding to years from 1996 through 2006.
“MA_Deg” is a three-year moving average of degrees (unknown for the first two years).
WEB Module Three, Part One, 4‐14‐2010: 2
College Year BA&S MEANBA&S Public Bschol Faculty T MA_Deg
1 1991 50 47.375 2 1 11 ‐7 Missing
1 1992 32 47.375 2 1 8 ‐6 Missing
1 1993 31 47.375 2 1 10 ‐5 37.667
1 1994 35 47.375 2 1 9 ‐4 32.667
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
1 2003 57 47.375 2 1 7 5 56
1 2004 57 47.375 2 1 10 6 55.667
1 2005 57 47.375 2 1 10 7 57
1 2006 51 47.375 2 1 10 8 55
2 1991 16 8.125 2 1 3 ‐7 Missing
2 1992 14 8.125 2 1 3 ‐6 Missing
2 1993 10 8.125 2 1 3 ‐5 13.333
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
2 2004 10 8.125 2 1 3 6 12.667
2 2005 7 8.125 2 1 3 7 11.333
2 2006 6 8.125 2 1 3 8 7.667
3 1991 40 35.5 2 1 8 ‐7 Missing
3 1992 31 37.125 2 1 8 ‐6 Missing
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
17 2004 64 39.3125 2 0 5 6 54.667
17 2005 37 39.3125 2 0 4 7 51.333
17 2006 53 39.3125 2 0 4 8 51.333
18 1991 14 8.4375 2 0 4 ‐7 Missing
18 1992 10 8.4375 2 0 4 ‐6 Missing
18 1993 10 8.4375 2 0 4 ‐5 11.333
18 1994 7 8.4375 2 0 3.5 ‐4 9
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
18 2005 4 8.4375 2 0 2.5 7 7.333
18 2006 7 8.4375 2 0 3 8 6
WEB Module Three, Part One, 4‐14‐2010: 3
In a few years for some colleges, faculty size was missing. We interpolated missing data
on the number of faculty members in the economics department from the reported information in
the years prior and after a missing observation; thus, giving rise to the prospect for a half person
in those cases. If a panel data set such as this one has missing values that cannot be
meaningfully interpolated, it is an “unbalanced panel,” in which the number of usable
observations differs across the units. If there are no missing values and there are the same
number of periods of data for every group (college) in the sample, then the resulting pooled
cross-section and time-series data set is said to be a “balanced panel.” Typically, the cross-
section dimension is designated the i dimension and the time-series dimension is the t dimension.
Thus, panel data studies are sometimes referred to as “it ” studies.
There are three ways in which we consider the effect of degrees on faculty size. Here we will
consider only the bachelor degree-granting institutions.
First, the decision makers might set the permanent faculty based on the most current
available information, as reflected in the number of contemporaneous degrees (BA&Sit). That is,
the decision makers might form a type of rational expectation by setting the faculty size based on
the anticipated number of majors to receive degrees in the future, where that expectation for that
future number is forecasted by this year's value. Second, we included the overall mean number
of degrees awarded at each institution (MEANBA&Si) to reflect a type of historical steady state.
That is, the central administration or managers of the institution may have a target number of
permanent faculty relative to the long-term expected number of annual graduates from the
department that is desired to maintain the department’s appropriate role within the institution. iii
Third, the central authority might be willing to marginally increase or decrease the permanent
faculty size based on the near term trend in majors, as reflected in a three-year moving average
of degrees awarded (MA_Degit).
We then assume the faculty size data-generating process for bachelor degree-granting
undergraduate departments to be
where the error term εit is independent and identically distributed (iid) across institutions and
over time and E(εit2|xit) = σ2 , for I = 18 colleges and T = 14 years (−5 through 8) for 252
complete observations. Notice that there is no time subscript on the mean number of degrees,
public/private and B school regressors because they do not vary with time.
In a more general and nondescript algebraic form for any it study, in which all
explanatory variables are free to vary with time and the error term is of the simple iid form with
E(εit2|xit) = σ2, the model would be written
WEB Module Three, Part One, 4‐14‐2010: 4
Yit = β1 + β2X2it + β3X3it +. . . + βkXkit + εit , for i = 1, 2,….I and, t = 1, 2, … T. (2)
This is a constant coefficient model because the intercept and slopes are assumed not to
vary within a cross section (not to vary across institutions) or over time. If this assumption is
true, the parameters can be estimated without bias by ordinary least squares applied directly to
the panel data set. Unfortunately, this assumption is seldom true, requiring us to consider the
fixed-effect and random-effects models.
FIXED-EFFECTS MODEL
The fixed effects model allows the intercept to vary across institutions (or among whatever cross-
section categories that are under consideration in other studies), while keeping the slope
coefficients the same for all institutions (or categories). The model could be written as
Where β1i suggests that there is a separate intercept for each unit. No restriction is placed on
how the intercepts vary, except, of course, that they do so independently of εit. The model can be
made explicit for our application by inserting a 0-1 covariate or dummy variable for each of the
institutions except the one for which comparisons are to be made. In our case, there are 18
colleges; thus, 17 dummy variables are inserted and each of their coefficients is interpreted as the
expected change in faculty size for a movement from the excluded college to the college of
interest. Alternatively, we could have a separate dummy variable for each college and drop the
overall intercept. Both approaches give the same results for the other coefficients and for the R2
in the regression. (A moment’s thought will reveal, however, that in this setting, either way it is
formulated, it is not possible to have variables, such as type of school, that do not vary through
time. In the fixed effects model, such a variable would just be a multiple of the school specific
dummy variable.)
To clarify, the fixed-effects model for our study of bachelor degree institutions is written
(where Collgei = 1 if college i and 0 if not, for i = 1, 2, … 18) as
FACULTY sizeit=β1+β2YEARt+β3BA&Sit+β4MEANBA&Si+β5PUBLICi+
β6Bschl+ β7MA_Degit+ β8College1+β9College2+ (4)
β10College3+… +β23College16+β24College17+ εit.
Here a dummy for college 18 is omitted and its effects are reflected in the constant term β1 when
College1 = College2 =…= College16 = College17 = 0. For example, β9 is the expected change
in faculty size for a movement from college 18 to college 2. Which college is omitted is
arbitrary, but one must be omitted to avoid perfect collinearity in the data set. In general, if i
goes from 1 to I categories, then only I − 1 dummies are used to form the fixed-effects model:
WEB Module Three, Part One, 4‐14‐2010: 5
After creating the relevant dummy variables (Ds), the parameters of this fixed-effects model can
be estimated without bias using by ordinary least squares. iv
If one has sufficient observations, the categorical dummy variables can be interacted with
the other time-varying explanatory variables to enable the slopes to vary along with the intercept
over time. For our study with 18 college categories this would be laborious to write out in
equation form. In many cases there simply are not sufficient degrees of freedom to
accommodate all the required interactions. v
To demonstrate a parsimonious model setup with both intercept and slope variability
consider a hypothetical situation involving three categories (represented by dummies D1, D2 and
D3 ) and two time-varying explanatory variables (represented by X2it and X3it):
In this model, β1 is the intercept for category three, where D3 = 0. The intercept for category one
is β1 + β4 and for category 2 it is β1 + β5. The change in the expected value of Y given a change
in X2 is β2 + β6D1 + β8D2; thus for category 1 it is β2 + β6 and for category 2 it is β2 + β8. The
change in the expected value of Y for a movement from category two to category three is
Individual coefficients are tested in fixed-effects models as in any other model with the z
ratio (with asymptotic properties) or t ratio (finite sample properties). There could be category-
specific heteroscedasticity or autocorrelation over time. As described and demonstrated in
Module One, where students were grouped or clustered into classes, a type of White
heteroscedasticity consistent covariance estimator can be used in fixed-effects models with
ordinary least squares to obtain standard errors robust to unequal variances across the groups.
Correlation of residuals from one period to the next within a panel can also be a problem. If this
serial correlation is of the first-order autoregressive type, a Prais-Winston transformation
transformation might be considered to first partial difference the data to remove the serial
correlation problem. In general, because there are typically few time-series observations, it is
difficult to both correctly identify the nature of the time-series error term process and
appropriately address with a least-squares estimator. vi Contemporary treatments typically rely
on robust, “cluster” corrections that accommodate more general forms of correlation across time.
WEB Module Three, Part One, 4‐14‐2010: 6
The unrestricted sum of squared residuals comes from the regression:
RANDOM-EFFECTS MODELS
The random effects model, like the fixed effects model, allows the intercept term to vary across
units. The difference is an additional assumption, not made in the fixed effects case, that this
variation is independent of the values of the other variables in the model. Recall, in the fixed
effects case, we placed no restriction on the relationship between the intercepts and the other
independent variables. In essence, a random-effects data generating process is a regression with
an intercept that is subject to purely random perturbations; it is a category-specific random
variable (β1i). The realization of the random variable intercept β1i is assumed to be formed by
the overall mean plus the ith category-specific random term vi. In the case of our hypothetical,
parsimonious two explanatory variable model, the relevant random-effects equations are
Inserting the second equation into the first produces the “random effects” model,
Deviations from the main intercept, α, as measured in the category specific part of the error term,
vi, must be uncorrelated with the time-varying regressors (that is, vi, is uncorrelated with X2it and
X3it) and have zero mean. Because vi does not vary with time, it is reasonable to assume its
variance is fixed given the explanatory variables. vii Thus,
An important difference between the fixed and random effects models is that time-invariant
regressors, such as type of school, can be accommodated in the random effects but not in the
WEB Module Three, Part One, 4‐14‐2010: 7
fixed effects model. The surprising result ultimately follows from the assumption that the
variation of the constant terms is independent of the other variables, which allows us to put the
term vi in the disturbance of the equation, rather than build it into the regression with a set of
dummy variables.)
In a random-effects model, disturbances for a given college (in our case) or whatever
entity is under study will be correlated across periods whereas in the fixed-effects model this
correlation is assumed to be absent. However, in both settings, correlation between the panel
error term effects and the explanatory variables is also likely. Where it occurs, this correlation
will reflect the effect of substantive influences on the dependent variable that have been omitted
from the equation – the classic “missing variables” problem. The now standard Mundlak (1978)
approach is a method of accommodating this correlation between the effects and means of the
regressors. The approach is motivated by the suggestion that the correlation can be explained by
the overall levels (group means) of the time variables. By this device, the effect, β1i, is projected
upon the group means of the time-varying variables, so that
β1i = β1 + δ′ xi + wi (15)
where xi is the set of group (school) means of the time-varying variables and wi is a (now)
random effect that is uncorrelated with the variables and disturbances in the model, wi~N(0, σw2).
In fact, the random effects model as described here departs from an assumption that the
school effect, vi, actually is uncorrelated with the other variables. If true, the projection would
be unnecessary However, in most cases, the initial assumption of the random-effects model, that
the effects and the regressors are uncorrelated, is considered quite strong. In the fixed effects
case, the assumption is not made. However, it remains a useful extension of the fixed effects
model to think about the “effect,” β1i, in terms of a projection such as suggested above – perhaps
by the logic of a “hierarchical model,” for the regression. That is, although the fixed effects
model allows for an unrestricted effect, freely correlated with the time varying variables, the
Mundlak projection adds a layer of explanation to this effect. The Mundlak approach is a useful
approach in either setting. Note that after incorporating the Mundlak “correction” in the fixed
effects specification, the resulting equation becomes a random effects equation.
Adding the unit means to the equations picks up the correlation between the school
effects and the other variables as well as reflecting an expected long-term steady state. Our
random effects models for BA and BS degree-granting undergraduate economics departments is
WEB Module Three, Part One, 4‐14‐2010: 8
FIXED EFFECTS VERSUS RANDOM EFFECTS
Fixed-effects models can be estimated efficiently by ordinary least squares whereas random-
effects models are usually estimated using some type of generalized least-squares procedure.
GLS should yield more asymptotically efficient estimators if the assumptions for the random-
effects model are correct. Current practice, however, favors the fixed-effects approach for
estimating standard errors because of the likelihood that the stronger assumptions behind the
GLS estimator are likely not satisfied, implying poor finite sample properties (Angrist and
Pischke, 2009, p. 223). This creates a bit of a dilemma, because the fixed effects approach is, at
least potentially, very inefficient if the random effects assumptions are actually met. The fixed
effects approach could lead to estimation of K+1+n rather than K+2 parameters (including σ2).
Whether we treat the effects as fixed (with a constant intercept β1 and dummy category
variables) or random (with a stochastic intercept β1i) makes little difference when there are a
large number of time periods (Hsiao, 2007, p. 41). But, the typical case is one for which the
time series is short, with many cross-section units,
The Hausman (1978) test has become the standard approach for assessing the
appropriateness of the fixed-effects versus random-effects model. Ultimately, the question is
whether there is strong correlation between the unobserved case-specific random effects and the
explanatory variables. If this correlation is significant, the random-effects model is inappropriate
and the fixed-effects model is supported. On the other hand, insignificant correlation between
the specific random-effects errors and the regressors implies that the more efficient random-
effects coefficient estimators trump the consistent fixed-effects estimators. The correlation
cannot be assessed directly. But, indirectly, it has a testable implication for the estimators. If the
effects are correlated with the time varying variable, then, in essence, the dummy variables will
have been left out of the random effects model/estimator. The classic left out variable result then
implies that the random effects estimator will be biased because of this problem, but the fixed
effects estimator will not be biased because it includes the dummy variables. If the random
effects model is appropriate, the fixed effects approach will still be unbiased, though it will fail
to use the information that the extra dummy variables in the model are not needed. Thus, an
indirect test for the presence of this correlation is based on the empirical difference between the
fixed and random effects estimators.
Let βFE and βRE be the vectors of coefficients from a fixed-effects and random-effects
specification. The null hypothesis for purpose of the Hausman test is that under the random
effects assumption, estimators of both of these vectors are consistent, but the estimator for βRE is
more efficient (with a smaller asymptotic variance) than that of βFE. Hausman’s alternative
hypothesis is that the random-effects estimator is inconsistent (with coefficient distributions not
settling on the correct parameter values as sample size goes to infinity) under the hypothesis of
the fixed-effects model, but is consistent under the hypothesis of the random-effects model. The
fixed-effects estimator is consistent in both cases. The Hausman test statistic is based on the
difference between the estimated covariance matrix for least-squares dummy variable coefficient
estimates (bFE ) and that for the random-effects model:
WEB Module Three, Part One, 4‐14‐2010: 9
H = (bFE − bRE)´[Var (bFE ) − Var(bRE)]−1(bFE − bRE)
If the Chi-square statistic p value < 0.05, reject the Hausman null hypotheis and do not use
random effects. If the Chi-square statistic p value > 0.05, do not reject the Hausman null
hypothesis and use random effects. An intuitively appealing, and fully equivalent (and usually
more convenient) way to carry out the Hausman test is to test the null hypothesis in the context
of the random-effects model that the coefficients on the group means in the Mundlak-augmented
regression are jointly zero.
CONCLUDING COMMENTS
As stated in Module One, “theory is easy, data are hard – hard to find and hard to get into a
computer program for statistical analysis.” This axiom is particularly true for those wishing to
do panel data analysis on topics related to the teaching of economics where data collected for
only the cross sections is the norm. As stated in Endnote One, a recent exception is Stanca
(2006), in which a large panel data set for students in Introductory Microeconomics is used to
explore the effects of attendance on performance. As with Modules One and Two, Parts Two,
Three and Four of this module provide the computer code to conduct a panel data analysis with
LIMDEP (NLOGIT), STATA and SAS, using the Becker, Greene and Siegfried (2009) data set.
WEB Module Three, Part One, 4‐14‐2010: 10
REFERENCES
Angrist, Joshua D. and Jorn-Steffen Pischke (2009). Mostly Harmless Econometrics. Princeton
New Jersey: Princeton University Press.
Becker, William, William Greene and John Siegfried (Forthcoming). “Does Teaching Load
Affect Faculty Size?” American Economists.
Greene, William (2008). Econometric Analysis. 6th Edition, New Jersey: Prentice Hall.
Hsiao, Cheng (2007). Analysis of Panel Data. 2nd Edition (8th Printing), Cambridge: Cambridge
University Press.
Johnson, William R. and Sarah Turner (2009). “Faculty Without Students: Resource Allocation
in Higher Education, ” Journal of Economic Perspectives. Vol. 23. No. 2 (Spring): 169-190.
Link, Charles R. and James G. Mulligan (1996). "The Value of Repeat Data for individual
Students," in William E. Becker and William J. Baumol (eds), Assessing Educational
Practices: The Contribution of Economics, Cambridge MA.: MIT Press.
Mundlak, Yair (1978). “On the Pooling of Time Series and Cross Section Data, ” Econometrica.
Vol. 46. No. 1 (January): 69-85.
Stanca, Luca (2006). “The Effects of Attendance on Academic Performance: Panel Data
Evidence for Introductory Microeconomics” Journal of Economic Education. Vol. 37. No. 3
(Summer): 251-266.
WEB Module Three, Part One, 4‐14‐2010: 11
ENDNOTES
i
As seen in Stanca (2006), where a large panel data set for students in Introductory
Microeconomics is used to explore the effects of attendance on performance, panel data analysis
typically involved a dimension of time. However, panels can be set up in blocks that involve a
dimension other than time. For example, Marburger (2006) uses a panel data structure (with no
time dimension) to overcome endogeneity problems in assessing the effects of an enforced
attendance policy on absenteeism and exam performance. Each of the Q multiple-choice exam
questions was associated with specific course material that could be associated with the
attendance pattern of N students, giving rise to NQ panel data records. A dummy variable for
each student captured the fixed effects for the unmeasured attribute of students and thus
eliminating any student specific sample selection problem.
ii
Section V of Link and Mulligan (1996) outlines the advantage of panel analysis for a range of
educational issues. They show how panel data analysis can be used to isolate the effect of
individual teachers, schools or school districts on students' test scores.
iii
One of us, as a member on an external review team for a well known economics department,
was told by a high-ranking administrator that the department had received all the additional lines
it was going to get because it now had too many majors for the good of the institution.
Historically, the institution was known for turning out engineers and the economics department
was attracting too many students away from engineering. This personal experience is consistent
with Johnson and Turner’s (2009, p. 170) assessment that a substantial part of the explanation for
differences in student-faculty ratios across academic departments resides in politics or tradition
rather than economic decision-making in many institutions of higher education.
iv
As long as the model is static with all the explanatory variables exogenous and no lagged
dependent variables used as explanatory variables, ordinary least-squares estimators are unbiased
and consistent although not as efficient as those obtained by maximum likelihood routines.
Unfortunately, this is not true if a lagged dependent variable is introduced as a regressor (as one
might want to do if the posttest is to be explained by a pretest). The implied correlation between
the lagged dependent variable and the individual specific effects and associated error terms bias
the OLS estimators (Hsiao, 2007, pp. 73-74).
v
Fixed-effects models can have too many categories, requiring too many dummies, for
parameter estimation. Even if estimation is possible, there may be too few degrees of freedom
and little power for statistical tests. In addition, problems of multicollinearity arise when many
dummy variables are introduced.
vi
Hsiao (2007, pp. 295-310) discusses panel data with a large number of time periods. When T
is large serial correlation problems become a big issue, which is well beyond the scope of this
introductory module.
WEB Module Three, Part One, 4‐14‐2010: 12
vii
Random-effects models in which the intercept error term vi does not depend on time are
referred to as one-way random-effects models. Two-way random-effects models have error terms
of the form
εit = vi + εt + uit
where vi is the cross-section-specific error, affecting only observations in the ith panel; εt is the
time-specific component, which is unique to all observations for the tth period; and uit is the
random perturbation specific to the individual observation in the ith panel at time t. These two-
way random-effects models are also known as error component models and variance component
models.
WEB Module Three, Part One, 4‐14‐2010: 13
MODULE THREE, PART TWO: PANEL DATA ANALYSIS
IN ECONOMIC EDUCATION RESEARCH USING LIMDEP (NLOGIT)
Part Two of Module Three provides a cookbook-type demonstration of the steps required to use
LIMDEP (NLOGIT) in panel data analysis. Users of this model need to have completed Module
One, Parts One and Two, and Module Three, Part One. That is, from Module One users are
assumed to know how to get data into LIMDEP, recode and create variables within LIMDEP,
and run and interpret regression results. They are also expected to know how to test linear
restrictions on sets of coefficients as done in Module One, Parts One and Two. Module Three,
Parts Three and Four demonstrate in STATA and SAS what is done here in LIMDEP.
THE CASE
As described in Module Three, Part One, Becker, Greene and Siegfried (2009) examine the
extent to which undergraduate degrees (BA and BS) in economics or Ph.D. degrees (PhD) in
economics drive faculty size at those U.S. institutions that offer only a bachelor degree and those
that offer both bachelor degrees and PhDs. Here we retrace their analysis for the institutions
that offer only the bachelor degree. We provide and demonstrate the LIMDEP (NLOGIT) code
necessary to duplicate their results.
DATA FILE
The following panel data are provided in the comma separated values (CSV) text file
“bachelors.csv”, which will automatically open in EXCEL by simply double clicking on it after
it has been downloaded to your hard drive. Your EXCEL spreadsheet should look like this:
“DegreBar” is the average number of degrees awarded by each college for the 16-year period.
“T” is the time trend running from −7 to 8, corresponding to years from 1996 through 2006.
“MA_Deg” is a three-year moving average of degrees (unknown for the first two years).
Becker, Greene, Siegfried 9‐24‐2009 1
College Year Degrees DegreBar Public Faculty Bschol T MA_Deg
1 1991 50 47.375 2 11 1 ‐7 0
1 1992 32 47.375 2 8 1 ‐6 0
1 1993 31 47.375 2 10 1 ‐5 37.667
1 1994 35 47.375 2 9 1 ‐4 32.667
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
1 2003 57 47.375 2 7 1 5 56
1 2004 57 47.375 2 10 1 6 55.667
1 2005 57 47.375 2 10 1 7 57
1 2006 51 47.375 2 10 1 8 55
2 1991 16 8.125 2 3 1 ‐7 0
2 1992 14 8.125 2 3 1 ‐6 0
2 1993 10 8.125 2 3 1 ‐5 13.333
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
2 2004 10 8.125 2 3 1 6 12.667
2 2005 7 8.125 2 3 1 7 11.333
2 2006 6 8.125 2 3 1 8 7.667
3 1991 40 35.5 2 8 1 ‐7 0
3 1992 31 37.125 2 8 1 ‐6 0
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
17 2004 64 39.3125 2 5 0 6 54.667
17 2005 37 39.3125 2 4 0 7 51.333
17 2006 53 39.3125 2 4 0 8 51.333
18 1991 14 8.4375 2 4 0 ‐7 0
18 1992 10 8.4375 2 4 0 ‐6 0
18 1993 10 8.4375 2 4 0 ‐5 11.333
18 1994 7 8.4375 2 3.5 0 ‐4 9
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
18 2005 4 8.4375 2 2.5 0 7 7.333
18 2006 7 8.4375 2 3 0 8 6
Becker, Greene, Siegfried 9‐24‐2009 2
If you opened this CSV file in a word processor or text editing program, it would show
that each of the 289 lines (including the headers) corresponds to a row in the EXCEL table, but
variable values would be separated by commas and not appear neatly one on top of the other as
in EXCEL.
As discussed in Module One, Part Two, older versions of LIMDEP (NLOGIT) have a
data matrix default restriction of no more than 222 rows (records per variable), 900 columns
(number of variables) and 200,000 cells. LIMDEP 9 and NLOGIT 4.0 automatically adjust the
data constraints but in older versions the number of cells must be increased to accommodate
work with our data set. After opening LIMDEP, the number of working cells can be increased
by clicking the Project button on the top ribbon, going to Settings, and changing the number of
cells. Going from the default 200,000 cells to 900,000 cells (1,000 Rows and 900 columns) is
more than sufficient for this panel data set.
Becker, Greene, Siegfried 9‐24‐2009 3
We could write a “READ” command to bring this text data file into LIMDEP but like
EXCEL it can be imported into LIMDEP directly by clicking the Project button on the top
ribbon, going to Import, and then clicking on Variables, from which the bachelors.cvs file can be
located wherever it is stored (in our case in the “Greene programs 2” folder). Hitting the Open
button will bring the data set into LIMDEP, which can be checked by clicking the “Activate Data
Editor” button, which is second from the right on the tool bar or go to Data Editor in the
Window’s menu, as described and demonstrated in Module One, Part Two.
Becker, Greene, Siegfried 9‐24‐2009 4
In addition to a visual inspection of the data via the “Activate Data Editor,” we use the “dstat”
command to check the descriptive statistics. First, however, we need to remove the two years
(1991 and 1992) for which no data are available for the degree moving average measure. This is
done with the “Reject” command. In our “File:New Text/Command Document” (which was
described in Module One, Part Two), we have
reject ; year < 1993 $
dstat;rhs=*$
Becker, Greene, Siegfried 9‐24‐2009 5
--> reject ; year < 1993 $
--> dstat;rhs=*$
Descriptive Statistics
All results based on nonmissing observations.
==============================================================================
Variable Mean Std.Dev. Minimum Maximum Cases Missing
==============================================================================
All observations in current sample
--------+---------------------------------------------------------------------
COLLEGE | 9.50000 5.19845 1.00000 18.0000 252 0
YEAR | 1999.50 4.03915 1993.00 2006.00 252 0
DEGREES | 23.1111 19.2264 .000000 81.0000 252 0
DEGREBAR| 23.6528 18.0143 2.00000 62.4375 252 0
PUBLIC | 1.77778 .416567 1.00000 2.00000 252 0
FACULTY | 6.51786 3.13677 2.00000 14.0000 252 0
BSCHOOL | .388889 .488468 .000000 1.00000 252 0
T | 1.50000 4.03915 -5.00000 8.00000 252 0
MA_DEG | 23.1931 18.5540 1.33333 80.0000 252 0
The constant coefficient panel data model for the faculty size data-generating process for
bachelor degree-granting undergraduate departments is given by
where the error term εit is independent and identically distributed (iid) across institutions and
over time and E(εit2|xit) = σ2 , for I = 18 colleges and T = 14 years (−5 through 8) for 252
complete records. The LIMDEP OLS regression command that needs to be entered into the
command document (again, following the procedure for opening the command document
window shown in Module One, Part Two), including the standard error adjustment for clustering
is
Upon highlighting and hitting the “Go” button the Output file shows the following results
Becker, Greene, Siegfried 9‐24‐2009 6
--> reject ; year < 1993 $
--> regress;lhs=faculty;rhs=one,t,degrees,degrebar,public,bschool,MA_deg
;cluster=14$
+----------------------------------------------------+
| Ordinary least squares regression |
| LHS=FACULTY Mean = 6.517857 |
| Standard deviation = 3.136769 |
| Number of observs. = 252 |
| Model size Parameters = 7 |
| Degrees of freedom = 245 |
| Residuals Sum of squares = 868.4410 |
| Standard error of e = 1.882726 |
| Fit R-squared = .6483574 |
| Adjusted R-squared = .6397458 |
| Model test F[ 6, 245] (prob) = 75.29 (.0000) |
| Diagnostic Log likelihood = -513.4686 |
| Restricted(b=0) = -645.1562 |
| Chi-sq [ 6] (prob) = 263.38 (.0000) |
| Info criter. LogAmemiya Prd. Crt. = 1.292840 |
| Akaike Info. Criter. = 1.292826 |
| Bayes Info. Criter. = 1.390866 |
| Autocorrel Durbin-Watson Stat. = .3295926 |
| Rho = cor[e,e(-1)] = .8352037 |
| Model was estimated Jul 16, 2009 at 04:21:28PM |
+----------------------------------------------------+
+---------------------------------------------------------------------+
| Covariance matrix for the model is adjusted for data clustering. |
| Sample of 252 observations contained 18 clusters defined by |
| 14 observations (fixed number) in each cluster. |
| Sample of 252 observations contained 1 strata defined by |
| 252 observations (fixed number) in each stratum. |
+---------------------------------------------------------------------+
+--------+--------------+----------------+--------+--------+----------+
|Variable| Coefficient | Standard Error |t-ratio |P[|T|>t]| Mean of X|
+--------+--------------+----------------+--------+--------+----------+
|Constant| 10.1397*** .91063 11.135 .0000 |
|T | -.02809 .02227 -1.261 .2083 1.50000|
|DEGREES | -.01636 .01866 -.877 .3814 23.1111|
|DEGREBAR| .10832*** .03378 3.206 .0015 23.6528|
|PUBLIC | -3.86239*** .56950 -6.782 .0000 1.77778|
|BSCHOOL | .58112 .94253 .617 .5381 .38889|
|MA_DEG | .03780** .01810 2.089 .0377 23.1931|
+--------+------------------------------------------------------------+
| Note: ***, **, * = Significance at 1%, 5%, 10% level. |
+---------------------------------------------------------------------+
Contemporaneous degrees have little to do with current faculty size but both overall number of
degrees awarded (the school means) and the moving average of degrees (MA_DEG) have
significant effects. It takes an increase of 26 or 27 bachelor degrees in the moving average to
expect just one more faculty position. Whether it is a public or a private college is highly
significant. Moving from a public to a private college lowers predicted faculty size by nearly
four members for otherwise comparable institutions. There is an insignificant erosion of tenured
and tenure-track faculty size over time. Finally, while economics departments in colleges with a
business school tend to have a larger permanent faculty, ceteris paribus, the effect is small and
insignificant.
Becker, Greene, Siegfried 9‐24‐2009 7
FIXED-EFFECTS REGRESSION
To estimate the fixed-effects model we can either insert seventeen (0,1) covariates to capture the
unique effect of each of the 18 colleges (where each of the 17 dummy coefficients are measured
relative to the constant term) or the insert of 18 dummy variables with no overall constant term
in the OLS regression. The results for the other coefficients and for R2 will be identical either
way, while the the constant terms, since they measure the difference of each college from the
18th in the first case, or the difference of all 18 from zero in the second, will differ. This
difference is inconsequential for the regression of interest. Which way the model is estimated is
purely a matter of convenience and preference.
create
;Col1=college=1
;Col2=college=2
;Col3=college=3
;Col4=college=4
;Col5=college=5
;Col6=college=6$
create
;Col7=college=7
;Col8=college=8
;Col9=college=9
;Col10=college=10
;Col11=college=11
;Col12=college=12$
create
;Col13=college=13
;Col14=college=14
;Col15=college=15
;Col16=college=16
;Col17=college=17
;Col18=college=18$
regress;lhs=faculty;rhs=one,t,degrees,MA_deg,
Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8,Col9,
Col10,Col11,Col12,Col13,Col14,Col15,Col16,Col17; cluster=14$
Becker, Greene, Siegfried 9‐24‐2009 8
+---------------------------------------------------------------------+
| Covariance matrix for the model is adjusted for data clustering. |
| Sample of 252 observations contained 18 clusters defined by |
| 14 observations (fixed number) in each cluster. |
+---------------------------------------------------------------------+
----------------------------------------------------------------------
Ordinary least squares regression ............
LHS=FACULTY Mean = 6.51786
Standard deviation = 3.13677
Number of observs. = 252
Model size Parameters = 21
Degrees of freedom = 231
Residuals Sum of squares = 146.63709
Standard error of e = .79674
Fit R-squared = .94062
Adjusted R-squared = .93548
Model test F[ 20, 231] (prob) = 183.0(.0000)
Diagnostic Log likelihood = -289.34751
Restricted(b=0) = -645.15625
Chi-sq [ 20] (prob) = 711.6(.0000)
Info criter. LogAmemiya Prd. Crt. = -.37441
Akaike Info. Criter. = -.37480
Bayes Info. Criter. = -.08068
Model was estimated on Sep 23, 2009 at 06:44:38 PM
--------+-------------------------------------------------------------
Variable| Coefficient Standard Error t-ratio P[|T|>t] Mean of X
--------+-------------------------------------------------------------
Constant| 2.69636*** .15109 17.846 .0000
T| -.02853 .02245 -1.271 .2051 1.50000
DEGREES| -.01608 .01521 -1.058 .2913 23.1111
MA_DEG| .03985*** .01485 2.683 .0078 23.1931
COL1| 5.77747*** .76816 7.521 .0000 .05556
COL2| .15299*** .01343 11.392 .0000 .05556
COL3| 4.29759*** .55420 7.755 .0000 .05556
COL4| 6.28973*** .65533 9.598 .0000 .05556
COL5| 4.91094*** .56987 8.618 .0000 .05556
COL6| 5.02016*** .02561 196.041 .0000 .05556
COL7| 1.21384*** .01321 91.876 .0000 .05556
COL8| .77797*** .06785 11.466 .0000 .05556
COL9| 3.16474*** .06270 50.478 .0000 .05556
COL10| 2.86345*** .15540 18.427 .0000 .05556
COL11| 5.15181*** .02403 214.385 .0000 .05556
COL12| -.06802*** .02153 -3.160 .0018 .05556
COL13| 3.98895*** 1.01415 3.933 .0001 .05556
COL14| -.63196*** .11986 -5.272 .0000 .05556
COL15| 8.25859*** .47255 17.477 .0000 .05556
COL16| 8.00970*** .55461 14.442 .0000 .05556
COL17| .43544 .59258 .735 .4632 .05556
--------+-------------------------------------------------------------
Note: ***, **, * = Significance at 1%, 5%, 10% level.
----------------------------------------------------------------------
Becker, Greene, Siegfried 9‐24‐2009 9
Once again, contemporaneous degrees is not a driving force in faculty size. An F test is
not needed to assess if at least one of the 17 colleges differ from college 18. With the exception
of college 17, each of the other colleges are significantly different. The moving average of
degrees is again significant.
The preceding approach, of computing all the dummy variables and building them into
the regression, is likely to become unduly cumbersome if the number of colleges (units) is very
large. Most contemporary software, including LIMDEP will do this computation automatically
without explicitly computing the dummy variables and including them in the equation. As an
alternative to specifying all the dummies in the regression command, the same results can be
obtained with the simpler "FixedEffects" command:
regress;lhs=faculty;rhs=one,t,degrees,MA_deg
;Panel;Str=College
;FixedEffects;Robust$
----------------------------------------------------------------------
Least Squares with Group Dummy Variables..........
Ordinary least squares regression ............
LHS=FACULTY Mean = 6.51786
Standard deviation = 3.13677
Number of observs. = 252
Model size Parameters = 21
Degrees of freedom = 231
Residuals Sum of squares = 146.63709
Standard error of e = .79674
Fit R-squared = .94062
Adjusted R-squared = .93548
Model test F[ 20, 231] (prob) = 183.0(.0000)
Diagnostic Log likelihood = -289.34751
Restricted(b=0) = -645.15625
Chi-sq [ 20] (prob) = 711.6(.0000)
Info criter. LogAmemiya Prd. Crt. = -.37441
Akaike Info. Criter. = -.37480
Bayes Info. Criter. = -.08068
Model was estimated on Sep 23, 2009 at 06:44:38 PM
Estd. Autocorrelation of e(i,t) = .293724
Robust cluster corrected covariance matrix used
Panel:Groups Empty 0, Valid data 18
Smallest 14, Largest 14
Average group size in panel 14.00
--------+-------------------------------------------------------------
Variable| Coefficient Standard Error t-ratio P[|T|>t] Mean of X
--------+-------------------------------------------------------------
T| -.02853 .02245 -1.271 .2050 1.50000
DEGREES| -.01608 .01521 -1.058 .2912 23.1111
MA_DEG| .03985*** .01485 2.683 .0078 23.1931
--------+-------------------------------------------------------------
Note: ***, **, * = Significance at 1%, 5%, 10% level.
----------------------------------------------------------------------
Becker, Greene, Siegfried 9‐24‐2009 10
RANDOM-EFFECTS REGRESSION
Finally, consider the random-effects model in which we employ Mundlak’s (1978) approach to
estimating panel data. The Mundlak model posits that the fixed effects in the equation, β1i , can
be projected upon the group means of the time-varying variables, so that
β1i = β1 + δ′ xi + wi
where xi is the set of group (school) means of the time-varying variables and wi is a (now)
random effect that is uncorrelated with the variables and disturbances in the model. Logically,
adding the means to the equations picks up the correlation between the school effects and the
other variables. We could not incorporate the mean number of degrees awarded in the fixed-
effects model (because it was time invariant) but this variable plays a critical role in the Mundlak
approach to panel data modeling and estimation.
where error term ε is iid over time, E(εit2|xit) = σ2 for I = 18 and Ti = 14 and E[ui2] = θ2 for I =
18. The LIMDEP program to be run from the Text/Command Document (with 1991 and 1992
data suppressed) is
regress
;lhs=faculty
;rhs=one,t,degrees,degrebar,public,bschool,MA_deg
;pds=14
;panel
;random
;robust$
Becker, Greene, Siegfried 9‐24‐2009 11
--> regress
;lhs=faculty;rhs=one,t,degrees,degrebar,public,bschool,MA_deg
;pds=14;panel;random;robust$
----------------------------------------------------------------------
OLS Without Group Dummy Variables.................
Ordinary least squares regression ............
LHS=FACULTY Mean = 6.51786
Standard deviation = 3.13677
Number of observs. = 252
Model size Parameters = 7
Degrees of freedom = 245
Residuals Sum of squares = 868.44104
Standard error of e = 1.88273
Fit R-squared = .64836
Adjusted R-squared = .63975
Model test F[ 6, 245] (prob) = 75.3(.0000)
Diagnostic Log likelihood = -513.46861
Restricted(b=0) = -645.15625
Chi-sq [ 6] (prob) = 263.4(.0000)
Info criter. LogAmemiya Prd. Crt. = 1.29284
Akaike Info. Criter. = 1.29283
Bayes Info. Criter. = 1.39087
Model was estimated on Sep 23, 2009 at 07:17:22 PM
Panel Data Analysis of FACULTY [ONE way]
Unconditional ANOVA (No regressors)
Source Variation Deg. Free. Mean Square
Between 2312.22321 17. 136.01313
Residual 157.44643 234. .67285
Total 2469.66964 251. 9.83932
--------+-------------------------------------------------------------
Variable| Coefficient Standard Error t-ratio P[|T|>t] Mean of X
--------+-------------------------------------------------------------
T| -.02809 .03030 -.927 .3549 1.50000
DEGREES| -.01636 .02334 -.701 .4839 23.1111
DEGREBAR| .10832*** .02047 5.293 .0000 23.6528
PUBLIC| -3.86239*** .29652 -13.026 .0000 1.77778
BSCHOOL| .58112** .25115 2.314 .0215 .38889
MA_DEG| .03780 .02907 1.300 .1947 23.1931
Constant| 10.1397*** .52427 19.341 .0000
--------+-------------------------------------------------------------
Note: ***, **, * = Significance at 1%, 5%, 10% level.
----------------------------------------------------------------------
Becker, Greene, Siegfried 9‐24‐2009 12
+----------------------------------------------------+
| Panel:Groups Empty 0, Valid data 18 |
| Smallest 14, Largest 14 |
| Average group size 14.00 |
| There are 3 vars. with no within group variation. |
| DEGREBAR PUBLIC BSCHOOL |
+----------------------------------------------------+
----------------------------------------------------------------------
Random Effects Model: v(i,t) = e(i,t) + u(i)
Estimates: Var[e] = .643145
Var[u] = 2.901512
Corr[v(i,t),v(i,s)] = .818559
Lagrange Multiplier Test vs. Model (3) =1096.30
( 1 degrees of freedom, prob. value = .000000)
(High values of LM favor FEM/REM over CR model)
Baltagi-Li form of LM Statistic = 1096.30
Sum of Squares 868.488173
R-squared .648338
Robust cluster corrected covariance matrix used
--------+-------------------------------------------------------------
Variable| Coefficient Standard Error b/St.Er. P[|Z|>z] Mean of X
--------+-------------------------------------------------------------
T| -.02853 .02146 -1.329 .1838 1.50000
DEGREES| -.01609 .01793 -.897 .3696 23.1111
DEGREBAR| .10610*** .03228 3.287 .0010 23.6528
PUBLIC| -3.86365*** .54685 -7.065 .0000 1.77778
BSCHOOL| .58176 .90497 .643 .5203 .38889
MA_DEG| .03981** .01728 2.305 .0212 23.1931
Constant| 10.1419*** .87456 11.597 .0000
--------+-------------------------------------------------------------
Note: ***, **, * = Significance at 1%, 5%, 10% level.
----------------------------------------------------------------------
The marginal effect of an additional economics major is again insignificant but slightly negative
within the sample. Both the short-term moving average number and long-term average number
of bachelor degrees are significant. A long-term increase of about 10 students earning degrees in
economics is required to predict that one more tenured or tenure-track faculty member is in a
department. Ceteris paribus, economics departments at private institutions are smaller than
comparable departments at public schools by a large and significant number of four members.
Whether there is a business school present is insignificant. There is no meaningful trend in
faculty size.
CONCLUDING REMARKS
The goal of this hands-on component of this third of four modules is to enable economic
education researchers to make use of panel data for the estimation of constant coefficient, fixed-
effects and random-effects panel data models in LIMDEP (NLOGIT). It was not intended to
explain all of the statistical and econometric nuances associated with panel data analysis. For
this an intermediate level econometrics textbook (such as Jeffrey Wooldridge, Introductory
Econometrics) or advanced econometrics textbook (such as William Greene, Econometric
Analysis) should be consulted.
Becker, Greene, Siegfried 9‐24‐2009 13
APPENDIX:
There are two alternative ways to create the college dummy variables for use in the regression.
Becker, Greene, Siegfried 9‐24‐2009 14
REFERENCES
Becker, William, William Greene and John Siegfried (2009). “Does Teaching Load Affect
Faculty Size? ” Working Paper (July).
Greene, William (2008). Econometric Analysis. 6th Edition, New Jersey: Prentice Hall.
Mundlak, Yair (1978). "On the Pooling of Time Series and Cross Section Data," Econometrica.
Vol. 46. No. 1 (January): 69-85.
Wooldridge, Jeffrey (2009). Introductory Econometrics. 4th Edition, Mason OH: South-
Western.
Becker, Greene, Siegfried 9‐24‐2009 15
MODULE THREE, PART THREE: PANEL DATA ANALYSIS
IN ECONOMIC EDUCATION RESEARCH USING STATA
Part Three of Module Three provides a cookbook-type demonstration of the steps required to use
STATA in panel data analysis. Users of this model need to have completed Module One, Parts
One and Three, and Module Three, Part One. That is, from Module One users are assumed to
know how to get data into STATA, recode and create variables within STATA, and run and
interpret regression results. They are also expected to know how to test linear restrictions on
sets of coefficients as done in Module One, Parts One and Three. Module Three, Parts Two and
Four demonstrate in LIMDEP and SAS what is done here in STATA.
THE CASE
As described in Module Three, Part One, Becker, Greene and Siegfried (2009) examine the
extent to which undergraduate degrees (BA and BS) in economics or Ph.D. degrees (PhD) in
economics drive faculty size at those U.S. institutions that offer only a bachelor degree and those
that offer both bachelor degrees and PhDs. Here we retrace their analysis for the institutions
that offer only the bachelor degree. We provide and demonstrate the STATA code necessary to
duplicate their results.
DATA FILE
The following panel data are provided in the comma separated values (CSV) text file
“bachelors.csv”, which will automatically open in EXCEL by simply double clicking on it after
it has been downloaded to your hard drive. Your EXCEL spreadsheet should look like this:
“DegreBar” is the average number of degrees awarded by each college for the 16-year period.
“T” is the time trend running from −7 to 8, corresponding to years from 1996 through 2006.
“MA_Deg” is a three-year moving average of degrees (unknown for the first two years).
McCarthy 8‐30‐2009 1
College Year Degrees DegreBar Public Faculty Bschol T MA_Deg
1 1991 50 47.375 2 11 1 ‐7 0
1 1992 32 47.375 2 8 1 ‐6 0
1 1993 31 47.375 2 10 1 ‐5 37.667
1 1994 35 47.375 2 9 1 ‐4 32.667
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
1 2003 57 47.375 2 7 1 5 56
1 2004 57 47.375 2 10 1 6 55.667
1 2005 57 47.375 2 10 1 7 57
1 2006 51 47.375 2 10 1 8 55
2 1991 16 8.125 2 3 1 ‐7 0
2 1992 14 8.125 2 3 1 ‐6 0
2 1993 10 8.125 2 3 1 ‐5 13.333
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
2 2004 10 8.125 2 3 1 6 12.667
2 2005 7 8.125 2 3 1 7 11.333
2 2006 6 8.125 2 3 1 8 7.667
3 1991 40 35.5 2 8 1 ‐7 0
3 1992 31 37.125 2 8 1 ‐6 0
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
17 2004 64 39.3125 2 5 0 6 54.667
17 2005 37 39.3125 2 4 0 7 51.333
17 2006 53 39.3125 2 4 0 8 51.333
18 1991 14 8.4375 2 4 0 ‐7 0
18 1992 10 8.4375 2 4 0 ‐6 0
18 1993 10 8.4375 2 4 0 ‐5 11.333
18 1994 7 8.4375 2 3.5 0 ‐4 9
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
18 2005 4 8.4375 2 2.5 0 7 7.333
18 2006 7 8.4375 2 3 0 8 6
McCarthy 8‐30‐2009 2
If you opened this CSV file in a word processor or text editing program, it would show
that each of the 289 lines (including the headers) corresponds to a row in the EXCEL table, but
variable values would be separated by commas and not appear neatly one on top of the other as
in EXCEL.
As discussed in Module One, Part Three, you can read the CSV file into STATA by
typing the following command into the command window and pressing enter:
insheet using “E:\NCEE (Becker)\bachelors.csv”, comma
In this case, the “bachelors.csv” file is saved in the file “E:\NCEE (Becker)” but this will vary by
user. For these data, the default memory allocated by STATA should be sufficient. After
entering the above command in the command window and pressing enter, you should see the
following screen:
STATA indicates that the data consist of 9 variables and 288 observations. In addition to
a visual inspection of the data via the “browse” command, you can use the "summarize"
command to check the descriptive statistics. First, however, we need to remove the two years
(1991 and 1992) for which no data are available for the degree moving average measure. This is
done with the "drop if" command. In the command window, type:
drop if year < 1993
summarize
McCarthy 8‐30‐2009 3
which upon pressing enter yields the following summary statistics:
tsset “panel variable” “time variable”
In this case, our panel variable is college and our time variable is year, so the relevant command
is:
tsset college year
After typing the above command into STATA’s command window and pressing enter, you
should see the following screen:
McCarthy 8‐30‐2009 4
This indicates that STATA recognizes a strongly balanced panel (i.e., the same number of years
for each college) with observations for each panel from 1993 through 2006. Note that we could
also use the variable “t” as our time variable.
In general, we must “tsset” the data before we can utilize any of STATA’s time-series or
panel data commands (for example, the “xtreg” command presented below). Our time variable
should also be appropriately spaced. For example, if we have yearly data, but our time variable
was recorded in a daily format (e.g., 1/1/1999, 1/1/2000, 1/1/2002, etc.), we would want to
reformat this variable as a yearly variable rather than daily. Correctly formatting the time
variable is important to ensure the various time-series commands in STATA work properly. For
more detail on formats and other options for the “tsset” command type “help tsset” into
STATA’s command window.
The constant coefficient panel data model for the faculty size data-generating process for
bachelor degree-granting undergraduate departments is given by
McCarthy 8‐30‐2009 5
where the error term εit is independent and identically distributed (iid) across institutions and
over time and E(εit2|xit) = σ2 , for I = 18 colleges and T = 14 years (−5 through 8) for 252
complete records. The STATA OLS regression command that needs to be entered into the
command window, including the standard error adjustment for clustering is
After typing the above command into the command window and pressing enter, the output
window shows the following results:
------------------------------------------------------------------------------
| Robust
faculty | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
t | -.0280875 .0222654 -1.26 0.224 -.0750634 .0188885
degrees | -.0163611 .0186579 -0.88 0.393 -.0557259 .0230037
degrebar | .1083201 .0337821 3.21 0.005 .0370461 .1795942
public | -3.862393 .5694961 -6.78 0.000 -5.063925 -2.660862
bschool | .5811154 .9425269 0.62 0.546 -1.407443 2.569673
ma_deg | .0378038 .0180966 2.09 0.052 -.0003767 .0759842
_cons | 10.13974 .9106264 11.13 0.000 8.218486 12.06099
------------------------------------------------------------------------------
Contemporaneous degrees have little to do with current faculty size but both overall number of
degrees awarded (the school means) and the moving average of degrees (MA_DEG) have
significant effects. It takes an increase of 26 or 27 bachelor degrees in the moving average to
expect just one more faculty position. Whether it is a public or a private college is highly
significant. Moving from a public to a private college lowers predicted faculty size by nearly
four members for otherwise comparable institutions. There is an insignificant erosion of tenured
and tenure-track faculty size over time. Finally, while economics departments in colleges with a
business school tend to have a larger permanent faculty, ceteris paribus, the effect is small and
insignificant.
FIXED-EFFECTS REGRESSION
The fixed-effects model requires either the insertion of 17 (0,1) covariates to capture the unique
effect of each of the 18 colleges (where each of the 17 dummy coefficients are measured relative
to the constant term) or the insertion of 18 dummy variables with no constant term in the OLS
regression. In addition, no time invariant variables can be included because they would be
perfectly correlated with the respective college dummies. Thus, the overall mean number of
McCarthy 8‐30‐2009 6
degrees, the public or private dummy, and business school dummy cannot be included as
regressors.
The STATA code, including the commands to create the dummy variables, is (two
additional ways to estimate fixed-effects models in STATA are presented in the Appendix):
gen Col1=(college==1)
gen Col2=(college==2)
gen Col3=(college==3)
gen Col4=(college==4)
gen Col5=(college==5)
gen Col6=(college==6)
gen Col7=(college==7)
gen Col8=(college==8)
gen Col9=(college==9)
gen Col10=(college==10)
gen Col11=(college==11)
gen Col12=(college==12)
gen Col13=(college==13)
gen Col14=(college==14)
gen Col15=(college==15)
gen Col16=(college==16)
gen Col17=(college==17)
gen Col18=(college==18)
McCarthy 8‐30‐2009 7
Linear regression Number of obs = 252
F( 2, 17) = .
Prob > F = .
R-squared = 0.9406
Number of clusters (college) = 18 Root MSE = .79674
------------------------------------------------------------------------------
| Robust
faculty | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
t | -.0285342 .022453 -1.27 0.221 -.0759059 .0188374
degrees | -.0160847 .0152071 -1.06 0.305 -.0481689 .0159995
ma_deg | .039847 .0148528 2.68 0.016 .0085103 .0711837
Col1 | 5.777467 .7681565 7.52 0.000 4.156799 7.398136
Col2 | .1529889 .0134293 11.39 0.000 .1246555 .1813222
Col3 | 4.297591 .5541956 7.75 0.000 3.128341 5.466842
Col4 | 6.289728 .6553347 9.60 0.000 4.907093 7.672363
Col5 | 4.910941 .5698701 8.62 0.000 3.708621 6.113262
Col6 | 5.020157 .0256077 196.04 0.000 4.966129 5.074185
Col7 | 1.213842 .0132117 91.88 0.000 1.185967 1.241716
Col8 | .7779701 .0678475 11.47 0.000 .6348244 .9211157
Col9 | 3.164737 .0626958 50.48 0.000 3.03246 3.297013
Col10 | 2.863453 .1553986 18.43 0.000 2.53559 3.191315
Col11 | 5.151815 .0240307 214.39 0.000 5.101115 5.202515
Col12 | -.0680152 .0215257 -3.16 0.006 -.1134304 -.0226
Col13 | 3.988947 1.014148 3.93 0.001 1.849282 6.128611
Col14 | -.631956 .1198635 -5.27 0.000 -.8848458 -.3790662
Col15 | 8.258587 .4725524 17.48 0.000 7.261588 9.255585
Col16 | 8.009696 .5546092 14.44 0.000 6.839573 9.179819
Col17 | .4354377 .5925837 0.73 0.472 -.8148046 1.68568
_cons | 2.696364 .1510869 17.85 0.000 2.377598 3.015129
------------------------------------------------------------------------------
Once again, contemporaneous degrees is not a driving force in faculty size. There is no
need to do an F test to assess if at least one of the 17 colleges differ from college 18. With the
exception of college 17, each of the other colleges are significantly different. The moving
average of degrees is again significant.
RANDOM-EFFECTS REGRESSION
Finally, consider the random-effects model in which we employ Mundlak’s (1978) approach to
estimating panel data. The Mundlak model posits that the fixed effects in the equation, β1i , can
be projected upon the group means of the time-varying variables, so that
β1i = β1 + δ′ xi + wi
where xi is the set of group (school) means of the time-varying variables and wi is a (now)
random effect that is uncorrelated with the variables and disturbances in the model. Logically,
adding the means to the equations picks up the correlation between the school effects and the
other variables. We could not incorporate the mean number of degrees awarded in the fixed-
effects model (because it was time invariant) but this variable plays a critical role in the Mundlak
approach to panel data modeling and estimation.
McCarthy 8‐30‐2009 8
The random effects model for BA and BS degree-granting undergraduate departments is
FACULTY sizeit = β1 + β2Tt + β3BA&Sit + β4MEANBA&Si + β5MOVAVBA&BS
+ β6PUBLICi + β7Bschl + εit + ui
where error term ε is iid over time, E(εit2|xit) = σ2 for I = 18 and Ti = 14 and E[ui2] = θ2 for I =
18. The STATA command to estimate this model is
The marginal effect of an additional economics major is again insignificant but slightly negative
within the sample. Both the short-term moving average number and long-term average number
1
Note that the Wald statistic of 1273.20 is based on a test of all coefficients in the model (including the constant).
This is inconsistent with the default Wald statistic reported in other regression results, including random-effects
models without robust or clustered standard errors, where the default statistic is based on a test of all slope
coefficients in the model. In the model estimated here, the Wald statistic based on a test of all slope coefficients
equal to 0 is 198.55. I understand that the current version of STATA (STATA 11) now consistently presents the
Wald statistic based on a test of all slope coefficients.
McCarthy 8‐30‐2009 9
of bachelor degrees are significant. A long-term increase of about 10 students earning degrees in
economics is required to predict that one more tenured or tenure-track faculty member is in a
department. Ceteris paribus, economics departments at private institutions are smaller than
comparable departments at public schools by a large and significant number of four members.
Whether there is a business school present is insignificant. There is no meaningful trend in
faculty size.
CONCLUDING REMARKS
The goal of this hands-on component of this third of four modules is to enable economic
education researchers to make use of panel data for the estimation of constant coefficient, fixed-
effects and random-effects panel data models in STATA. It was not intended to explain all of
the statistical and econometric nuances associated with panel data analysis. For this an
intermediate level econometrics textbook (such as Jeffrey Wooldridge, Introductory
Econometrics) or advanced econometrics textbook (such as William Greene, Econometric
Analysis) should be consulted.
McCarthy 8‐30‐2009 10
APPENDIX: Alternative commands to estimate fixed-ffects models in STATA
We estimated the above fixed-effects model after explicitly creating 18 different dummy
variables. STATA also has a built in command (“xi”) to create a sequence of dummy variables
from a single categorical variable. To be consistent with the above model, we can first indicate
to STATA which category it should omit when creating the college dummy variables by typing
the following command into the command window and pressing enter:
char college[omit] 18
We can now automatically create the relevant college dummy variables and estimate the fixed-
effects model all through one command:
------------------------------------------------------------------------------
| Robust
faculty | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
t | -.0285342 .022453 -1.27 0.221 -.0759059 .0188374
degrees | -.0160847 .0152071 -1.06 0.305 -.0481689 .0159995
ma_deg | .039847 .0148528 2.68 0.016 .0085103 .0711837
_Icollege_1 | 5.777467 .7681565 7.52 0.000 4.156799 7.398136
_Icollege_2 | .1529889 .0134293 11.39 0.000 .1246555 .1813222
_Icollege_3 | 4.297591 .5541956 7.75 0.000 3.128341 5.466842
_Icollege_4 | 6.289728 .6553347 9.60 0.000 4.907093 7.672363
_Icollege_5 | 4.910941 .5698701 8.62 0.000 3.708621 6.113262
_Icollege_6 | 5.020157 .0256077 196.04 0.000 4.966129 5.074185
_Icollege_7 | 1.213842 .0132117 91.88 0.000 1.185967 1.241716
_Icollege_8 | .7779701 .0678475 11.47 0.000 .6348244 .9211157
_Icollege_9 | 3.164737 .0626958 50.48 0.000 3.03246 3.297013
_Icollege_10 | 2.863453 .1553986 18.43 0.000 2.53559 3.191315
_Icollege_11 | 5.151815 .0240307 214.39 0.000 5.101115 5.202515
_Icollege_12 | -.0680152 .0215257 -3.16 0.006 -.1134304 -.0226
_Icollege_13 | 3.988947 1.014148 3.93 0.001 1.849282 6.128611
_Icollege_14 | -.631956 .1198635 -5.27 0.000 -.8848458 -.3790662
_Icollege_15 | 8.258587 .4725524 17.48 0.000 7.261588 9.255585
_Icollege_16 | 8.009696 .5546092 14.44 0.000 6.839573 9.179819
_Icollege_17 | .4354377 .5925837 0.73 0.472 -.8148046 1.68568
_cons | 2.696364 .1510869 17.85 0.000 2.377598 3.015129
------------------------------------------------------------------------------
McCarthy 8‐30‐2009 11
Method 2 – xtreg fe
STATA’s “xtreg” command allows for various panel data models to be estimated. A random-
effects model was presented above, but “xtreg” also estimates a fixed-effects model, a between-
effects model, and various other models. The basic syntax for the “xtreg” command is:
F(3,17) = 2.66
corr(u_i, Xb) = 0.4966 Prob > F = 0.0815
The additional “dfadj” option adjusts the cluster-robust standard error estimates to account for
the transformation used by STATA in estimating the fixed-effects model (called the within
transform). Although estimating the fixed-effects model with xtreg no longer provides estimates
of the dummy variable coefficients, we see that the coefficient estimates and standard errors for
the remaining variables are identical to those of an OLS regression with panel-specific dummies
and cluster-robust standard errors.
McCarthy 8‐30‐2009 12
REFERENCES
Becker, William, William Greene and John Siegfried (2009). “Does Teaching Load Affect
Faculty Size? ” Working Paper (July).
Greene, William (2008). Econometric Analysis. 6th Edition, New Jersey: Prentice Hall.
Mundlak, Yair (1978). "On the Pooling of Time Series and Cross Section Data," Econometrica.
Vol. 46. No. 1 (January): 69-85.
Wooldridge, Jeffrey (2009). Introductory Econometrics. 4th Edition, Mason OH: South-
Western.
McCarthy 8‐30‐2009 13
MODULE THREE, PART FOUR: PANEL DATA ANALYSIS
IN ECONOMIC EDUCATION RESEARCH USING SAS
Part Four of Module Three provides a cookbook-type demonstration of the steps required to use
SAS in panel data analysis. Users of this model need to have completed Module One, Parts One
and Four, and Module Three, Part One. That is, from Module One users are assumed to know
how to get data into SAS, recode and create variables within SAS, and run and interpret
regression results. They are also expected to know how to test linear restrictions on sets of
coefficients as done in Module One, Parts One and Two. Module Three, Parts Two and Three
demonstrate in LIMDEP and STATA what is done here in SAS.
THE CASE
As described in Module Three, Part One, Becker, Greene and Siegfried (2009) examine the
extent to which undergraduate degrees (BA and BS) in economics or Ph.D. degrees (PhD) in
economics drive faculty size at those U.S. institutions that offer only a bachelor degree and those
that offer both bachelor degrees and PhDs. Here we retrace their analysis for the institutions
that offer only the bachelor degree. We provide and demonstrate the SAS code necessary to
duplicate their results.
DATA FILE
The following panel data are provided in the comma separated values (CSV) text file
“bachelors.csv”, which will automatically open in EXCEL by simply double clicking on it after
it has been downloaded to your hard drive. Your EXCEL spreadsheet should look like this:
“DegreBar” is the average number of degrees awarded by each college for the 16-year period.
“T” is the time trend running from −7 to 8, corresponding to years from 1996 through 2006.
“MA_Deg” is a three-year moving average of degrees (unknown for the first two years).
Gilpin 8‐30‐2009 1
College Year Degrees DegreBar Public Faculty Bschol T MA_Deg
1 1991 50 47.375 2 11 1 ‐7 0
1 1992 32 47.375 2 8 1 ‐6 0
1 1993 31 47.375 2 10 1 ‐5 37.667
1 1994 35 47.375 2 9 1 ‐4 32.667
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
1 2003 57 47.375 2 7 1 5 56
1 2004 57 47.375 2 10 1 6 55.667
1 2005 57 47.375 2 10 1 7 57
1 2006 51 47.375 2 10 1 8 55
2 1991 16 8.125 2 3 1 ‐7 0
2 1992 14 8.125 2 3 1 ‐6 0
2 1993 10 8.125 2 3 1 ‐5 13.333
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
2 2004 10 8.125 2 3 1 6 12.667
2 2005 7 8.125 2 3 1 7 11.333
2 2006 6 8.125 2 3 1 8 7.667
3 1991 40 35.5 2 8 1 ‐7 0
3 1992 31 37.125 2 8 1 ‐6 0
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
17 2004 64 39.3125 2 5 0 6 54.667
17 2005 37 39.3125 2 4 0 7 51.333
17 2006 53 39.3125 2 4 0 8 51.333
18 1991 14 8.4375 2 4 0 ‐7 0
18 1992 10 8.4375 2 4 0 ‐6 0
18 1993 10 8.4375 2 4 0 ‐5 11.333
18 1994 7 8.4375 2 3.5 0 ‐4 9
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
18 2005 4 8.4375 2 2.5 0 7 7.333
18 2006 7 8.4375 2 3 0 8 6
Gilpin 8‐30‐2009 2
If you opened this CSV file in a word processor or text editing program, it would show
that each of the 289 lines (including the headers) corresponds to a row in the EXCEL table, but
variable values would be separated by commas and not appear neatly one on top of the other as
in EXCEL.
As discussed in Module One, Part Two, SAS has a data matrix default restriction. This
data set is sufficiently small, so there is no need to adjust the size of the matrix. We could write a
“READ” command to bring this text data file into SAS similar to Module 1, Part 4, but like
EXCEL, it can be imported into SAS directly by using the import wizard.
To import the data into SAS, click on ‘File’ at the top left corner of your screen in SAS, and then
click ‘Import Data’.
This will initialize the Import Wizard pop-up screen. Since the data is comma separated values,
scroll down under the ‘Select data source below.’ tab and click on ‘Comma Separated Values
(*.csv)’ as shown below.
Gilpin 8‐30‐2009 3
Click ‘Next’, and then provide the location from which the file bachelor.cvs can be
located wherever it is stored (in our case in “e:\bachelor.csv”).
Gilpin 8‐30‐2009 4
To finish importing the data, click ‘Next’, and then name the dataset, known as a member in
SAS, to be stored in the temporary library called ‘WORK’. Recall that a library is simply a
folder to store datasets and output. I named the file ‘BACHELORS’ as seen below. Hitting the
Finish button will bring the data set into SAS.
To verify that the wizard imported the data correct, review the Log file and physically inspect the
dataset. When SAS is opened, the default panels are the ‘Log’ window at the top right, the
‘Editor’ window in the bottom right and the ‘Explorer/Results’ window on the left. Scrolling
through the Log reveals that the dataset was successfully imported. The details of the data step
procedure are provided along with a few summary statistics of how many observations and
variables were imported.
Gilpin 8‐30‐2009 5
To view the dataset, click on the “Libraries” folder, which is in the top left of the ‘Explorer’
panel, and then click on the ‘Work’ library. This reveals all of the members in the ‘Work’
library. In this case, the only member is the dataset ‘Bachelors’. To view the dataset, click on the
dataset icon ‘Bachelors’.
In addition to a visual inspection of the data, we use the “means” command to check the
descriptive statistics. Since we don’t list any variables in the command, by default, SAS runs the
‘means’ command on all variables in the dataset. First, however, we need to remove the two
years (1991 and 1992) for which no data are available for the degree moving average measure.
Since we may need the full dataset later, it is good practice to delete the observations off of a
copy of the dataset (called bachelors2). This is done in a data step using an ‘if then’ command.
data bachelors2;
set bachelors;
if year = 1991 then delete;
if year = 1992 then delete;
run;
Typing the following commands into the ‘Editor’ window and then clicking the run bottom
(recall this is the running man at the top) yields the following screen.
Gilpin 8‐30‐2009 6
CONSTANT COEFFICIENT REGRESSION
The constant coefficient panel data model for the faculty size data-generating process for
bachelor degree-granting undergraduate departments is given by
where the error term εit is independent and identically distributed (iid) across institutions and
over time and E(εit2|xit) = σ2 , for I = 18 colleges and T = 14 years (−5 through 8) for 252
complete records. To take into account clustering, include the cluster option with the cluster
being on the colleges. The SAS OLS regression command that needs to be entered into the
editor, including the standard error adjustment for clustering is
Gilpin 8‐30‐2009 7
Upon highlighting and hitting the “run” button, the Output panel shows the following results
Contemporaneous degrees have little to do with current faculty size but both overall number of
degrees awarded (the school means) and the moving average of degrees (MA_DEG) have
significant effects. It takes an increase of 26 or 27 bachelor degrees in the moving average to
expect just one more faculty position. Whether it is a public or a private college is highly
significant. Moving from a public to a private college lowers predicted faculty size by nearly
four members for otherwise comparable institutions. There is an insignificant erosion of tenured
and tenure-track faculty size over time. Finally, while economics departments in colleges with a
business school tend to have a larger permanent faculty, ceteris paribus, the effect is small and
insignificant.
Gilpin 8‐30‐2009 8
FIXED-EFFECTS REGRESSION
The fixed-effects model requires either the insertion of 17 (0,1) covariates to capture the unique
effect of each of the 18 colleges (where each of the 17 dummy coefficients are measured relative
to the constant term) or the insertion of 18 dummy variables with no constant term in the OLS
regression. In addition, no time invariant variables can be included because they would be
perfectly correlated with the respective college dummies. Thus, the overall mean number of
degrees, the public or private dummy, and business school dummy cannot be included as
regressors.
The SAS code to be run from the editor window, including the commands to create the
dummy variables is:
data bachelors2;
set bachelors2;
run;
Gilpin 8‐30‐2009 9
The resulting regression information appearing in the output window is
Once again, contemporaneous degrees is not a driving force in faculty size. An F test is
not needed to assess if at least one of the 17 colleges differ from college 18. With the exception
of college 17, each of the other colleges are significantly different. The moving average of
degrees is again significant.
Gilpin 8‐30‐2009 10
RANDOM-EFFECTS REGRESSION
Finally, consider the random-effects model in which we employ Mundlak’s (1978) approach to
estimating panel data. The Mundlak model posits that the fixed effects in the equation, β1i , can
be projected upon the group means of the time-varying variables, so that
β1i = β1 + δ′ xi + wi
where xi is the set of group (school) means of the time-varying variables and wi is a (now)
random effect that is uncorrelated with the variables and disturbances in the model. Logically,
adding the means to the equations picks up the correlation between the school effects and the
other variables. We could not incorporate the mean number of degrees awarded in the fixed-
effects model (because it was time invariant) but this variable plays a critical role in the Mundlak
approach to panel data modeling and estimation.
where error term ε is iid over time, E(εit2|xit) = σ2 for I = 18 and Ti = 14 and E[ui2] = θ2 for I =
18.
In SAS 9.1, there are no straightforward procedures to estimate this model. In the appendix, I do
provide a lengthy procedure that estimates the random effects model by OLS regression on a
transformed model. This is quite complex and is not recommended for beginners. See Cameron
and Trivedi (2005) for further details. SAS 9.2 has a new command called the PANEL procedure
to estimate panel data. For our model, we need to attach the / RANONE option to specify that a
one-way random-effects model be estimated. We also need to correct for the clustering of the
data. Unlike simple commands in LIMPDEP and STATA, SAS does not have an option for one-
way random effects with clustered errors.
This new SAS 9.2 procedure has more options for specific error term structures in panel data.
Although SAS does not allow the CLUSTER option, there is a VCOMP option that specifies the
type of variance component estimate to use. For balanced data, the default is VCOMP=FB.
However, the FB method does not always obtain nonnegative estimates for the cross section (or
group) variance. In the case of a negative estimate, a warning is printed and the estimate is set to
zero. Because we have to address clustering, WK option is specified, which is close to groupwise
heteroscedastic regression.
Gilpin 8‐30‐2009 11
The SAS code to be run from the Editor panel (with 1991 and 1992 data suppressed) is
Gilpin 8‐30‐2009 12
The marginal effect of an additional economics major is again insignificant but slightly negative
within the sample. Both the short-term moving average number and long-term average number
of bachelor degrees are significant. A long-term increase of about 10 students earning degrees in
economics is required to predict that one more tenured or tenure-track faculty member is in a
department. Ceteris paribus, economics departments at private institutions are smaller than
comparable departments at public schools by a large and significant number of four members.
Whether there is a business school present is insignificant. There is no meaningful trend in
faculty size.
It should be clear that this regression is NOT identical to similar one-way random effect models
controlling for clustering in LIMDEP or STATA. The standard errors are adjusted for a general
groupwise heteroscedastic error structure. The difference does not alter the significance and the
standard errors are, for the most part, very comparable.
CONCLUDING REMARKS
The goal of this hands-on component of this third of four modules is to enable economic
education researchers to make use of panel data for the estimation of constant coefficient, fixed-
effects and random-effects panel data models in SAS. It was not intended to explain all of the
statistical and econometric nuances associated with panel data analysis. For this an intermediate
level econometrics textbook (such as Jeffrey Wooldridge, Introductory Econometrics) or
advanced econometrics textbook (such as William Greene, Econometric Analysis) should be
consulted.
Gilpin 8‐30‐2009 13
APPENDIX: Alternative Means to Estimate Random-Effects Model with Clustered Data.
The following code provides a necessary code to estimate the random-effect models with
clustering. The estimation procedure is two-step feasible GLS. In the first step, the variance
matrix is estimated. In the second step, this variance matrix is used to transform the equation.
Because the variance matrix is estimated and not the true variance, this causes the standard errors
to be slight different than the standard errors provided by LIMPDEP or STATA when estimating
a random effects model with clustering.
/* create lamda */
proc iml;
use covvc;
read all var {_VARERR_ _VARCS_} into x;
use num;
read var {_freq_} into y;
print y;
sesq = x[1,1];
susq = x[1,2];
lamda = 1 - sqrt( sesq / (y[1,1]*susq + sesq) );
print x y lamda;
cname = {"lamda"};
Gilpin 8‐30‐2009 14
create out from lamda [ colname=cname];
append from lamda;
quit;
DATA bachelors4;
if _N_ = 1 then set out;
SET bachelors3;
l = one*lamda;
run;
/* transform data */
data clean (keep = college con nfaculty nt ndegrees ndegrebar npublic
nbschool nMA_deg year);
set bachelors4;
nfaculty = faculty - lamda*avg_faculty;
nt = t - lamda*avg_t;
ndegrees = degrees - lamda*avg_degrees ;
ndegrebar = degrebar - lamda*avg_degrebar;
npublic = public - lamda*avg_public;
nbschool = bschool - lamda*avg_bschool;
nMA_deg = ma_deg - lamda*avg_ma_deg;
con = 1 - lamda*1;
run;
Gilpin 8‐30‐2009 15
The standard errors associated with this regression are much closer to the standard errors from
LIMPDEP and STATA. However, this is a complex sequence of codes which should not be
attempted by beginners.
Gilpin 8‐30‐2009 16
REFERENCES
Becker, William, William Greene and John Siegfried (2009). “Does Teaching Load Affect
Faculty Size? ” Working Paper (July).
Cameron, Colin and Pravin Trivedi (2005). Microeconometrics. 1st Edition, New York,
Cambridge University Press.
Mundlak, Yair (1978). "On the Pooling of Time Series and Cross Section Data," Econometrica.
Vol. 46. No. 1 (January): 69-85.
Greene, William (2008). Econometric Analysis. 6th Edition, New Jersey: Prentice Hall.
Wooldridge, Jeffrey (2009). Introductory Econometrics. 4th Edition, Mason OH: South-
Western.
Gilpin 8‐30‐2009 17
MODULE FOUR, PART ONE:
Modules One and Two addressed an economic education empirical study involved with the
assessment of student learning that occurs between the start of a program (as measured, for
example, by a pretest) and the end of the program (posttest). At least implicitly, there is an
assumption that all the students who start the program finish the program. There is also an
assumption that those who start the program are representative of, or at least are a random
sample of, those for whom an inference is to be made about the outcome of the program. This
module addresses how these assumptions might be wrong and how problems of sample selection
might occur. The consequences of and remedies for sample selection are presented here in Part
One. As in the earlier three modules, contemporary estimation procedures to adjust for sample
selection are demonstrated in Parts Two, Three and Four using LIMDEP (NLOGIT), STATA
and SAS.
Before addressing the technical issues associated with sample selection problems in an
assessment of one or another instructional method, one type of student or teacher versus another,
or similar educational comparisons, it might be helpful to consider an analogy involving a
contest of skill between two types of contestants: Type A and Type B. There are 8 of each type
who compete against each other in the first round of matches. The 8 winners of the first set of
matches compete against each other in a second round, and the 4 winners of that round compete
in a third. Type A and Type B may compete against their own type in any match after the first
round, but one Type A and one Type B manage to make it to the final round. In the final match
they tie. Should we conclude, on probabilistic grounds, that Type A and Type B contestants are
equally skilled?
-----------------------------------------
May 1, 2010: p. 1
How is your answer to the above questions affected if we tell you that on the first round 5
Type As and only 3 Types Bs won their matches and only one Type B was successful in the
second and third round? The additional information should make clear that we have to consider
how the individual matches are connected and not just look at the last match. But before you
conclude that Type As had a superior attribute only in the early contests and not in the finals,
consider another analogy provided by Thomas Kane (Becker 2004).
These analogies demonstrate all three sources of bias associated with attempts to assess
performance from the start of a program to its finish: sample selection bias, endogeneity, and
omitted variables. The length of the dogs’ legs not appearing to be a problem in the final race
reflects the sample selection issues resulting if the researcher only looks at that last race. In
education research this corresponds to only looking at the performance of those who take the
final exam, fill out the end-of-term student evaluations, and similar terminal program
measurements. Looking only at the last race (corresponding to those who take the final exam)
would be legitimate if the races were independent (previous exam performance had no effect on
final exam taking, students could not self select into the treatment group versus control group),
but the races (like test scores) are sequentially dependent; thus, there is an endogeneity problem
(as introduced in Module Two). As Kane points out, concluding that leg length was important in
the first two races and not in the third reveals the omitted-variable problem: a trait such as heart
strength or competitive motivation might be overriding short legs and thus should be included as
a relevant explanatory variable in the analyses. These problems of sample selection in
educational assessment are the focus of this module.
The statistical inference problems associated with sample selection in the typical change-score
model used in economic education research can be demonstrated using a modified version of the
presentation in Becker and Powers (2001), where the data generating process for the change
score (difference between post and pre TUCE scores) for the ith student ( Δyi ) is modeled as
May 1, 2010: p. 2
k
Δyi = X i β + ε i = β1 + ∑ β j x ji + ε i (1)
j =2
The data set of explanatory variables is matrix X, where Xi is the row of xji values for the
relevant variables believed to explain the ith student’s pretest and posttest scores, the β j ’s are the
associated slope coefficients in the vector β , and ε i is the individual random shock (caused, for
example, by unobservable attributes, events or environmental factors) that affect the ith student’s
test scores. In empirical work, the exact nature of Δyi is critical. For instance, to model the
truncation issues that might be relevant for extremely able students’ being better than the
maximum TUCE score, a Tobit model can be specified for Δyi . i Also critical is the assumed
starting point on which all subsequent estimation is conditioned. ii
For an initial population of N students, let T* be the vector of all students’ propensities
to take a posttest. Let H be the matrix of explanatory variables that are believed to drive these
propensities, which includes directly observable things (e.g., time of class, instructor’s native
language). Let α be the vector of slope coefficients corresponding to these observable variables.
The individual unobservable random shocks that affect each student’s propensity to take the
posttest are contained in the error term vector ω . The data generating process for the i th
student’s propensity to take the posttest can now be written:
Ti* = H i α + ωi (2)
where
For estimation purposes, the error term ωi is assumed to be a standard normal random
variable that is independently and identically distributed with the other students’ error terms in
the ω vector. As shown in Module Four (Parts Two, Three and Four) this probit model for the
propensity to take the posttest can be estimated using the maximum-likelihood routines in
programs such as LIMDEP, STATA or SAS.
May 1, 2010: p. 3
The effect of attrition between the pretest and posttest, as reflected in the absence of a
posttest score for the ith student (Ti = 0) and an adjustment for the resulting bias caused by
excluding those students from the Δyi regression can be illustrated with a two-equation model
formed by the selection equation (2) and the i th student’s change score equation (1). iii Each of
the disturbances in vector ε , equation (1), is assumed to be distributed bivariate normal with the
corresponding disturbance term in the ω vector of the selection equation (2). Thus, for the i th
student we have:
E(ε) = E(ω) = 0, E(εε ') = σε2I, E(ωω ') = I, and E(εω') = ρσεI . (4)
That is, the disturbances have zero means, unit variance, and no covariance among students, but
there is covariance between selection in getting a posttest score and the measurement of the
change score.
The difference in the functional forms of the posttest selection equation (2) and the
change score equation (1) ensures the identification of equation (1) but ideally other restrictions
would lend support to identification. Estimates of the parameters in equation (1) are desired, but
the i th student’s change score Δyi is observed in the TUCE data for only the subset of students
for whom Ti = 1 . The regression for this censored sample of nT =1 students is:
Similar to omitting a relevant variable from a regression (as discussed in Module Two), selection
bias is a problem because the magnitude of E (ε i | Ti * > 0) varies across individuals and yet is not
included in the estimation of equation (1). To the extent that ε i and ωi (and thus Ti* ) are
related, the estimators are biased.
The change score regression (1) can be adjusted for those who elected not to take a
posttest in several ways. An early Heckman-type solution to the sample selection problem is to
rewrite the omitted variable component of the regression so that the equation to be estimated is:
where λi = f (−Ti * ) /[1 − F (−Ti * )] , and f (.) and F (.) are the normal density and distribution
functions. The inverse Mill’s ratio (or hazard) λi is the standardized mean of the disturbance
term ωi , for the i th student who took the posttest; it is close to zero only for those well above the
May 1, 2010: p. 4
T = 1 threshold. The values of λ are generated from the estimated probit selection equation (2)
for all students. Each student in the change score regression ( Δyi ) gets a calculated value λi ,
with the vector of these values serving as a shift variable in the persistence regression.
The Heckman-type selection model represented by equations (1) and (2) highlights the
nature of the sample selection problem inherent in estimating a change-score model by itself.
Selection results in population error term and regressor correlation that biases and makes the
coefficient estimators in the change score model inconsistent. The early Heckman (1979) type
two-equation estimation of the parameters in a selection model and change-score model,
however, requires cross-equation exclusion restrictions (variables that affect selection but not the
change score), differences in functional forms, and/or distributional assumptions for the error
terms. Parameter estimates are typically sensitive to these model specifications.
The college investigated by van der Klaauw created a single continuous index of each student’s
initial financial aid potential (based on a SAT score and high school GPA) and then classified
students into one of four aid level categories based on discrete cut points. The aid assignment
rule depends at least in part on the value of a continuous variable relative to a given threshold in
such a way that the corresponding probability of receiving aid (and the mean amount offered) is
a discontinuous function of this continuous variable at the threshold cut point. A sample of
May 1, 2010: p. 5
individual students close to a cut point on either side can be treated as a random sample at the cut
point because on average there really should be little difference between them (in terms of
financial aid offers received from other colleges and other unknown variables). In the absence of
the financial aid level under consideration, we should expect little difference in the college-going
decision of those just above and just below the cut point. Similarly, if they were all given the
financial aid, we should see little difference in outcomes, on average. To the extent that some
actually get it and others do not, we have an interpretable treatment effect. (Intuitively, this can
be thought of as running a regression of enrollment on financial aid for those close to the cut
point, with an adjustment for being in that position.) In his empirical work, van der Klaauw
obtained credible estimates of the importance of the financial aid effect without having to rely on
arbitrary cross-equation exclusion restrictions and functional form assumptions.
Kane (2003) uses an identification strategy similar to van der Klaauw but does so for all
those who applied for the Cal Grant Program to attend any college in California. Eligibility for
the Cal Grant Program is subject to a minimum GPA and maximum family income and asset
level. Like van der Klaauw, Kane exploits discontinuities on one dimension of eligibility for
those who satisfy the other dimensions of eligibility.
Although some education researchers are trying to fit their selection problems into this
regression discontinuity framework, legitimate applications are few because the technique has
very stringent data requirement (an actual but unknown or conceptual defendable continuous
index with thresholds for rank-ordered classifications) and limited ability to generalize away
from the classification cut points. Much of economic education research, on the other hand,
deals with the assessment of one type of program or environment versus another, in which the
source of selection bias is entry and exit from the control or experimental groups. An alternative
to Heckman’s parametric (rigid equation form) manner of comparing outcome measures adjusted
for selection based on unobservables is propensity score matching.
Propensity score matching as a body of methods is based on the following logic: We are
interested in evaluating a change score after a treatment. Let O now denote the outcome variable
or interest (e.g., posttest score, change score, persistence, or whatever) and T denote the
treatment dummy variable (e.g., took the enhanced course), such that T = 1 for an individual who
has experienced the “treatment,” and T = 0 for one who has not. If we are interested in the
change-score effect of treatment on the treated, the conceptual experiment would amount to
observing the treated individual (1) after he or she experienced the treatment and the same
individual in the same situation but (2) after he/she did not experience the treatment (but
presumably, others did). The treatment effect would be the difference between the two post-test
scores (because the pretest would be the one achieved by this individual). The problem, of
May 1, 2010: p. 6
course, is that ex post, we don’t observe the outcome variable, O, for the treated individual, in
the absence of the treatment. We observe some individuals who were treated and other
individuals who were not. Propensity score matching is a largely nonparametric approach to
evaluating treatment effects with this consideration in mind. iv
If individuals who experienced the treatment were exactly like those who did not in all
other respects, we could proceed by comparing random samples of treated and nontreated
individuals, confident that any observed differences could be attributed to the treatment. The
first section of this module focused on the problem that treated individuals might differ from
untreated individuals systematically, but in ways that are not directly observable by the
econometrician. To consider an example, if the decision to take an economics course (the
treatment) were motivated by characteristics of individuals (curiosity, ambition, etc.) that were
also influential in their performance on the outcome (test), then our analysis might attribute the
change in the score to the treatment rather than to these characteristics. Models of sample
selection considered previously are directed at this possibility. The development in this section
is focused on the possibility that the same kinds of issues might arise, but the underlying features
that differentiate the treated from the untreated can be observed, at least in part.
To take a simple example, suppose it is known with certainty that the underlying,
unobservable characteristics that are affecting the change score are perfectly randomly
distributed across individuals, treated and untreated. Assume, as well, that it is known for certain
that the only systematic, observable difference between treated and untreated individuals is that
women are more likely to undertake the treatment than men. It would make sense, then, that if
we want to compare treated to untreated individuals, we would not want to compare a randomly
selected group of treated individuals to a randomly selected group of untreated individuals – the
former would surely contain more women than the latter. Rather, we would try to balance the
samples so that we compared a group of women to another group of women and a group of men
to another group of men, thereby controlling for the impact of gender on the likelihood of
receiving the treatment. We might then want to develop an overall average by averaging, once
again, this time the two differences, one for men, the other for women.
May 1, 2010: p. 7
In the main, and as already made clear in our consideration of the Heckman adjustment,
if assignment to the treatment is nonrandom, then estimation of treatment effects will be biased
by the effect of the variables that effect the treatment assignment. The strategy is, essentially, to
locate an untreated individual who looks like the treated one in every respect except the
treatment, then compare the outcomes. We then average this across individual pairs to estimate
the “average treatment effect on the treated.” The practical difficulty is that individuals differ in
many characteristics, and it is not feasible, in a realistic application, to compare each treated
observation to an untreated one that “looks like it.” There are too many dimensions on which
individuals can differ. The technique of propensity score matching is intended to deal with this
complication. Keep in mind, however, if unmeasured or unobserved attributes are important,
and they are not randomly distributed across treatment and control groups, matching techniques
may not work. That is for what the methods in the previous sections were designed.
We now provide some technical details on propensity score matching. Let x denote a vector of
observable characteristics of the individual, before the treatment. Let the probability of treatment
be denoted P(T=1|x) = P(x). Because T is binary, P(x) = E[T|x], as in a linear probability model.
If treatment is random given x, then treatment is random given P(x), which in this context is
called the propensity score. It will generally not be possible to match individuals based on all the
characteristics individually – with continuously measured characteristics, such as income. There
are too many cells. The matching is done via the propensity score. Individuals with similar
propensity scores are expected (on average) to be individuals with similar characteristics.
Overall, for a ‘treated’ individual with propensity P(xi) and outcome Oi, the strategy is to
locate a control observation with similar propensity P(xc) and with outcome Oc. The effect of
treatment on the treated for this individual is estimated by Oi – Oc. This is averaged across
individuals to estimate the average treatment effect on the treated. The underlying theory asserts
that the estimates of treatment effects across treated and controls are unbiased if the treatment
assignment is random among individuals with the same propensity score; the propensity score,
itself, captures the drivers of the treatment assignment. (Relevant papers that establish this
methodology are too numerous to list here. Useful references are four canonical papers,
Heckman et al. [1997, 1998a, 1998b, 1999] and a study by Becker and Ichino [2002].)
The steps in the propensity score matching analysis consist of the following:
Step 1. Estimate the propensity score function, P(x), for each individual by fitting
a probit or logit model, and using the fitted probabilities.
May 1, 2010: p. 8
Step 2. Establish that the average propensity scores of treated and control
observations are the same within particular ranges of the propensity scores. (This
is a test of the “balancing hypothesis.”)
Step 3. Establish that the averages of the characteristics for treatment and controls
are the same for observations in specific ranges of the propensity score. This is a
check on whether the propensity score approach appears to be succeeding at
matching individuals with similar characteristics by matching them on their
propensity scores.
Step 4. For each treated observation in the sample, locate a similar control
observation(s) based on the propensity scores. Compute the treatment effect,
Oi – Oc. Average this across observations to get the average treatment effect.
Step 5. In order to estimate a standard error for this estimate, Step 4 is repeated
with a set of bootstrapped samples.
exp(β′x)
Prob(T=1|x) = .
1 + exp(β′x)
The probit model, F (β′x) = Φ (β′x) , where Φ(t) is the normal distribution function, is an
alternative. The propensity score is the fitted probability from the probit or logit model,
( )
Propensity Score for individual i = F βˆ ′xi = Pi .
The central feature of this step is to find similar individuals by finding individuals who have
similar propensity scores. Before proceeding, we note, the original objective is to find groups of
individuals who have the same x. This is easy to do in our simple example, where the only
variable in x is gender, so we can simply distinguish people by their gender. When the x vector
has many variables, it is impossible to partition the data set into groups of individuals with the
same, or even similar explanatory variables. In the example we will develop below, x includes
age (and age squared), education, marital status, race, income and unemployment status. The
working principle in this procedure is that individuals who have similar propensity scores will, if
May 1, 2010: p. 9
we average enough of them, have largely similar characteristics. (The reverse must be true, of
course.) Thus, although we cannot group people by their characteristics, xs, we can (we hope)
achieve the same end by grouping people by their propensity scores. That leads to step 2 of the
matching procedure.
Grouping those with similar propensity scores should result in similar predicted probabilities for
treatment and control groups. For instance, suppose we take a range of propensity scores
(probabilities of participating in the treatement), say from 0.4 to 0.6. Then, the part of the sample
that contains propensity scores in this range should contain a mix of treated individuals
(individuals with T = 1) and controls (individuals with T = 0). If the theory we are relying on is
correct, then the average propensity score for treated and controls should be the same, at least
approximately. That is,
( )( ) ( )(
Average F βˆ ′x T = 1 and Fˆ in the range ≈ Average F βˆ ′x T = 0 and Fˆ in the range . )
We will look for a partitioning of the range of propensity scores for which this is the case in each
range.
A first step is to decide if it is necessary to restrict the sample to the range of values of
propensity scores that is shared by the treated and control observations. That range is called the
common support. Thus, if the propensity scores of the treated individuals range from 0.1 to 0.7
and the scores of the control observations range from 0.2 to 0.9, then the common support is
from 0.2 to 0.7. Observations that have scores outside this range would not be used in the
analysis.
Once the sample to be used is determined, we will partition the range of propensity scores
into K cells. For each partitioning of the range of propensity scores considered, we will use a
standard F test for equality of means of the propensity scores of the treatment and control
observations:
(P )
2
C
k
− PTk
Fk [1, d ] = , k = 1,..., K .
(S 2
C ,k / NCk + ST2, k / NTk )
The denominator degrees of freedom for F are approximated using a technique invented by
Satterthwaite (1946):
May 1, 2010: p. 10
( NC − 1) ( N − 1)
d =w 2 k
+ (1 − w) 2 T k
SC , k / N C ST ,k / NT
( )
2
( NT − 1) SC2 ,k / NCk
w= .
( ) ( )
2 2
( NT − 1) SC2 ,k / NCk + ( NC − 1) ST2,k / NTk
If any of the cells (ranges of scores) fails this test, the next step is to increase the number of cells.
There are various strategies by which this can be done. The natural approach would be to leave
cells that pass the test as they are, and partition more finely the ones that do not. This may take
several attempts. In our example, we started by separating the range into 5 parts. With 5
segments, however, the data do not appear to satisfy the balancing requirement. We then try 6
and, finally, 7 segments of the range of propensity scores. With the range divided into 7
segments, it appears that the balance requirement is met.
Analysis can proceed even if the partitioning of the range of scores does not pass this test.
However, the test at this step will help to give an indication of whether the model used to
calculate the propensity scores is sufficiently specified. A persistent failure of the balancing test
might signal problems with the model that is being used to create the propensity scores. The
result of this step is a partitioning of the range of propensity scores into K cells with the K + 1
values,
Step 3 returns to the original motivation of the methodology. At step 3, we examine the
characteristics (x vectors) of the individuals in the treatement and control groups within the
subsamples defined by the groupings made by Step 2. If our theory of propensity scores is
working, it should be the case that within a group, for example, for the individuals whose
propensity scores are in the range 0.4 to 0.6, the x vectors should be similar in that at least the
means should be very close. This aspect of the data is examined statistically. Analysis can
proceed if this property is not met but the result(s) of these tests might signal to the analyst that
their results are a bit fragile. In our example below, there are seven cells in the grid of
propensity scores and 12 variables in the model. We find that for four of the 12 variables in one
of the 7 cells (i.e., in four cases out of 84), the means of the treated and control observations
appear to be significantly different. Overall, this difference does not appear to be too severe, so
we proceed in spite of it.
May 1, 2010: p. 11
MATCHING
Assuming that the data have passed the scrutiny in step 3, we now match the observations. For
each treated observation (individual’s outcome measure such as a test score) in the sample, we
find a control observation that is similar to it. The intricate complication at this step is to define
“similar.” It will generally not be possible to find a treated observation and a control observation
with exactly the same propensity score. So, at this stage it is necessary to decide what rule to use
for “close.” The obvious choice would be the nearest neighbor in the set of observations that is
in the propensity score group. The nearest neighbor for observation Oi would be the Oc* for
which |Pi – Pc| is minimized. We note, by this strategy, a particular control observation might be
the nearest neighbor for more than one treatment observation and some control observations
might not be the nearest neighbor to any treated observation.
Another strategy is to use the average of several nearby observations. The counterpart
observation is constructed by averaging all control observations whose propensity scores fall in a
given range in the neighborhood of Pi. Thus, we first locate the set [Ct*] = the set of control
observations for which |Pt – Pc| < r, for a chosen value of r called the caliper. We then average
Oc for these observations. By this construction, the neighbor may be an average of several
control observations. It may also not exist, if no observations are close enough. In this case, r
must be increased. As in the single nearest neighbor computation, control observations may be
used more than once, or they might not be used at all (e.g., if the caliper is r = .01, and a control
observation has propensity .5 and the nearest treated observations have propensities of .45 and
.55, then this control will never be used).
A third strategy for finding the counterpart observations is to use kernel methods to
average all of the observations in the range of scores that contains the Oi that we are trying to
match. The averaging function is computed as follows:
1 ⎡ Pi − Pc ⎤
K
h ⎢⎣ h ⎥⎦
wc =
1 ⎡ Pi − Pc ⎤
∑ control observations in the cell h
K⎢
⎣ h ⎦
⎥
The function K[.] is a weighting function that takes its largest value when Pi equals Pc and tapers
off to zero as Pc is farther from Pi. Typical choices for the kernel function are the normal or
logistic density functions. A common choice that cuts off the computation at a specific point is
the Epanechnikov (1969) weighting function,
May 1, 2010: p. 12
K[t] = 0.75(1 – .2t2)/51/2 for |t| < 5, and 0 otherwise.
The parameter h is the bandwidth that controls the weights given to points that lie relatively far
from Pi. A larger bandwidth gives more distant points relatively greater weight. Choice of the
bandwidth is a bit of an (arcane) art. The value 0.06 is a reasonable choice for the types of data
we are using in our analysis here.
Once treatment observations, Oi and control observations, Oc are matched, the treatment
effect for this pair is computed as Oi – Oc. The average treatment effect (ATE) is then estimated
by the mean,
∑
^ 1 N match
ATE = i =1
(Oi − Oc )
N match
STATISTICAL INFERENCE
In order to form a confidence interval around the estimated average treatment effect, it is
necessary to obtain an estimated standard error. This is done by reconstructing the entire sample
used in Steps 2 through 4 R times, using bootstrapping. By this method, we sample N
observations from the sample of N observations with replacement. Then ATE is computed R
times and the estimated standard error is the empirical standard deviation of the R observations.
This can be used to form a confidence interval for the ATE.
The end result of the computations will be a confidence interval for the expected
treatment effect on the treated individuals in the sample. For example, in the application that we
will present in Part 2 of this module, in which the outcome variable is the log of earnings and the
treatment is the National Supported Work Demonstration – see LaLonde (1986) – the following
is the set of final results:
+----------------------------------------------------------------------+
| Number of Treated observations = 185 Number of controls = 1157 |
| Estimated Average Treatment Effect = .156255 |
| Estimated Asymptotic Standard Error = .104204 |
| t statistic (ATT/Est.S.E.) = 1.499510 |
| Confidence Interval for ATT = ( -.047985 to .360496) 95% |
| Average Bootstrap estimate of ATT = .144897 |
| ATT - Average bootstrap estimate = .011358 |
+----------------------------------------------------------------------+
The overall estimate from the analysis is ATE = 0.156255, which suggests that the effect on
earnings that can be attributed to participation in the program is 15.6%. Based on the (25)
bootstrap replications, we obtained an estimated standard error of 0.104204. By forming a
confidence interval using this standard error, we obtain our interval estimate of the impact of the
May 1, 2010: p. 13
program of (-4.80% to +36.05%). We would attribute the negative range to an unconstrained
estimate of the sampling variability of the estimator, not actually to a negative impact of the
program. v
CONCLUDING COMMENTS
The genius in James Heckman was recognizing that sample selection problems are not
necessarily removed by bigger samples because unobservables will continue to bias estimators.
His parametric solution to the sample selection problem has not been lessened by newer semi-
parametric techniques. It is true that results obtained from the two equation system advanced by
Heckman over 30 years ago are sensitive to the correctness of the equations and their
identification. Newer methods such as regression discontinuity, however, are extremely limited
in their applications. As we will see in Module Four, Parts Two, Three and Four, methods such
as the propensity score matching depend on the validity of the logit or probit functions estimated
along with the methods of getting smoothness in the kernel density estimator. One of the
beauties of Heckman’s original selection adjustment method is that its results can be easily
replicated in LIMDEP, STATA and SAS. Such is not the case with the more recent
nonparametric and semi-parametric methods for addressing sample selection problems.
REFERENCES
Becker, William E. “Omitted Variables and Sample Selection Problems in Studies of College-
Going Decisions,” Public Policy and College Access: Investigating the Federal and State Role in
Equalizing Postsecondary Opportunity, Edward St. John (ed), 19. NY: AMS Press. 2004: 65-86.
_____ “Quit Lying and Address the Controversies: There Are No Dogmata, Laws, Rules or
Standards in the Science of Economics,” American Economist, 50, Spring 2008: 3-14.
_____ and William Walstad. “Data Loss from Pretest to Posttest as a Sample Selection
Problem,” Review of Economics and Statistics, 72, February 1990: 184-188.
_____ and John Powers. “Student Performance, Attrition, and Class Size Given Missing Student
Data,” Economics of Education Review, 20, August 2001: 377-388.
May 1, 2010: p. 14
Deheija, R. and S. Wahba “Causal Effects in Nonexperimental Studies: Reevaluation of the
Evaluation of Training Programs,” Journal of the American Statistical Association, 94, 1999, pp.
1052-1062.
Greene, William. H. “A statistical model for credit scoring.” Department of Economics, Stern
School of Business, New York University, (September 29, 1992).
Heckman, James. “Sample Bias as a Specification Error,” Econometrica, 47, 1979: 153-162.
Heckman, J., H. Ichimura, J. Smith and P. Todd. “Characterizing Selection Bias Using
Experimental Data,” Econometrica, 66, 5, 1998a: 1017-1098.
Heckman, J., R. LaLonde, and J. Smith. ‘The Economics and Econometrics of Active Labour
Market Programmes,’ in Ashenfelter, O. and D. Card (eds.) The Handbook of Labor Economics,
Vol. 3, North Holland, Amsterdam, 1999.
Krueger, Alan B. and Molly F. McIntosh. “Using a Web-Based Questionnaire as an Aide for
High School Economics Instruction,” Journal of Economic Education, 39, Spring, 2008: 174-
197.
Huynh, Kim, David Jacho-Chavez, and James K. Self. “The Efficacy of Collaborative Learning
Recitation Sessions on Student Outcomes?” American Economic Review, (Forthcoming May
2010).
LaLonde, R., “Evaluating the Econometric Evaluations of Training Programs with Experimental
Data,” American Economic Review, 76, 4, 1986: 604-620.
May 1, 2010: p. 15
van der Klaauw, W. “Estimating the Effect of Financial Aid Offers on College Enrollment: A
Regression-Discounting Approach.” International Economic Review, November, 2002: 1249-
1288.
ENDNOTES
i
. The opportunistic samples employed in the older versions of the TUCE as well as the new TUCE
4 have few observations from highly selective schools. The TUCE 4 is especially noteworthy because it
has only one such prestige school: Stanford University, where the class was taught by a non-tenure track
teacher. Thus, the TUCE 4 might reflect what those in the sample are taught and are able to do, but it
does not reflect what those in the know are teaching or what highly able students are able to do. For
example Alan Krueger (Princeton University) is listed as a member of the TUCE 4 “national panel of
distinguished economists;” yet, in a 2008 Journal of Economic Education article he writes: “a long
standing complaint of mine, as well as others, for example Becker 2007 and Becker 2004, is that
introductory economics courses have not kept up with the economics profession’s expanding emphasis on
data and empirical analysis.” Whether bright and motivated students at the leading institutions of higher
education can be expected to get all or close to all 33 multiple-choice questions correct on either the
micro or macro parts of the TUCE (because they figure out what the test designers want for an answer) or
score poorly (because they know more than what the multiple-choice questions assume) is open to
question and empirical testing. What is not debatable is that the TUCE 4 is based on a censored sample
that excludes those at and exposed to thinking at the forefront of the science of economics.
ii
. Because Becker and Powers (2002) do not have any data before the start of the course, they
condition on those who are already in the course and only adjust their change-score model estimation for
attrition between the pretest and posttest. More recently, Huynh, Jacho-Chavez, and Self (2010) account
for selection into, out of and between collaborative learning sections of a large principles course in their
change-score modeling.
iii.
. Although Δyi is treated as a continuous variable this is not essential. For example, a bivariate
choice (probit or logit) model can be specified to explicitly model the taking of a posttest decision as a
“yes” or “no” for students who enrolled in the course. The selection issue is then modeled in a way
similar to that employed by Greene (1992) on consumer loan default and credit card expenditures. As
with the standard Heckman selection model, this two-equation system involving bivariate choice and
selection can be estimated in a program like LIMDEP.
iv
. The procedure is not “parametric” in that it is not fully based on a parametric model. It is not
“nonparametric” in that it does employ a particular binary choice model to describe participation, or
receiving the treatment. But the binary choice model functions as an aggregator of a vector of variables
into a single score, not necessarily as a behavioral relationship. Perhaps “partially parametric” would be
appropriate here, but we have not seen this term used elsewhere.
May 1, 2010: p. 16
v .
The example mentioned at several points in this discussion will be presented in much greater
detail in Part 2. The data will be analyzed with LIMDEP, Stata and SAS. We note at this point, there are
some issues with duplication of the results with the three programs and with the studies done by the
original authors. Some of these are numerical and specifically explainable. However, we do not
anticipate that results in Step 5 can be replicated across platforms. The reason is that Step 5 requires
generation of random numbers to draw the bootstrap samples. The pseudorandom number generators
used by different programs vary substantially, and these differences show up in, for example, in bootstrap
replications. If the samples involved are large enough, this sort of random variation (chatter) gets
averaged out in the results. The sample in our real world application is not large enough to expect that
this chatter will be completely averaged out. As such, as will be evident later, there will be some small
variation across programs in the results that one obtains with our or any other small or moderately sized
data set.
May 1, 2010: p. 17
MODULE FOUR, PART TWO: SAMPLE SELECTION
Part Two of Module Four provides a cookbook-type demonstration of the steps required to use
LIMDEP (NLOGIT) in situations involving estimation problems associated with sample
selection. Users of this model need to have completed Module One, Parts One and Two, but not
necessarily Modules Two and Three. From Module One users are assumed to know how to get
data into LIMDEP, recode and create variables within LIMDEP, and run and interpret regression
results. Module Four, Parts Three and Four demonstrate in STATA and SAS what is done here
in LIMDEP.
The change score or difference in difference model is used extensively in education research.
Yet, before Becker and Walstad (1990), little if any attention was given to the consequence of
missing student records that result from: 1) "data cleaning" done by those collecting the data, 2)
student unwillingness to provide data, or 3) students self-selecting into or out of the study. The
implications of these types of sample selection are shown in the work of Becker and Powers
(2001) where the relationship between class size and student learning was explored using the
third edition of the Test of Understanding in College Economics (TUCE), which was produced
by Saunders (1994) for the National Council on Economic Education (NCEE), since renamed the
Council for Economic Education.
Module One, Part Two showed how to get the Becker and Powers data set
“beck8WO.csv” into LIMDEP (NLOGIT). As a brief review this was done with the read
command:
READ; NREC=2837; NVAR=64; FILE=k:\beck8WO.csv; Names=
A1,A2,X3, C,AL,AM,AN,CA,CB,CC,CH,CI,CJ,CK,CL,CM,CN,CO,CS,CT,
CU,CV,CW,DB,DD,DI,DJ,DK,DL,DM,DN,DQ,DR,DS,DY,DZ,EA,EB,EE,EF,
EI,EJ,EP,EQ,ER,ET,EY,EZ,FF,FN,FX,FY,FZ,GE,GH,GM,GN,GQ,GR,HB,
HC,HD,HE,HF $
where
In Module One, Part Two the procedure for changing the size of the work space in earlier
versions of LIMDEP and NLOGIT was shown but that is no longer required for the 9th version
of LIMDEP and the 4th version of NLOGIT. Starting with LIMDEP version 9 and NLOGIT
version 4 the required work space is automatically determined by the “Read” command and
increased as needed with subsequent “Create” commands.
Separate dummy variables need to be created for each type of school (A2), which is done with
the following code:
To create a dummy variable for whether the instructor had a PhD we use
Create; phd=dj=3$
To create a dummy variable for whether the student took the postTUCE we use
final=cc>0;
W. E. Becker and W. H. Greene, 5‐1‐2010 2
To create a dummy variable for whether a student did (noeval = 0) or did not (noeval = 1)
complete a student evaluation of the instructor we use
“Noeval” reflects whether the student was around toward the end of the term, attending classes,
and sufficiently motivated to complete an evaluation of the instructor. In the Saunder’s data set
evaluation questions with no answer where coded -9; thus, these four questions summing to -36
indicates that no questions were answered.
Create; change=cc-an$
Finally, there was a correction for the term in which student record 2216 was incorrectly
recorded:
All of these recoding and create commands are entered into LIMDEP command file as follows:
Reject; AN=-9$
W. E. Becker and W. H. Greene, 5‐1‐2010 3
Reject; HB=-9$
Reject; ci=-9$
Reject; ck=-9$
Reject; cs=0$
Reject; cs=-9$
Reject; a2=-9$
Reject; phd=-9$
The use of these data entry and management commands will appear in the LIMDEP (NLOGIT)
output file for the equations to be estimated in the next section.
THE PROPENSITY TO TAKE THE POSTTEST AND THE CHANGE SCORE EQUATION
To address attrition-type sample selection problems in change score studies, Becker and Powers
first add observations that were dropped during the early stage of assembling data for TUCE III.
Becker and Powers do not have any data on students before they enrolled in the course and thus
cannot address selection into the course, but to examine the effects of attrition (course
withdrawal) they introduce three measures of class size (beginning, ending, and average) and
argue that initial or beginning class size is the critical measure for assessing learning over the
entire length of the course. i To show the effects of initial class size on attrition (as discussed in
Module Four, Part One) they employ what is now the simplest and most restrictive of sample
correction methods, which can be traced to James Heckman (1979), recipient of the 2000 Nobel
Prize in Economics.
From Module Four, Part One, we have the data generating process for the difference between
post and preTUCE scores for the ith student ( Δy i ):
k
Δyi = X i β + ε i = β1 + ∑ β j x ji + ε i (1)
j =2
where the data set of explanatory variables is matrix X, where Xi is the row of xji values for the
relevant variables believed to explain the ith student’s pretest and posttest scores, the β j ’s are the
associated slope coefficients in the vector β , and ε i is the individual random shock (caused, for
example, by unobservable attributes, events or environmental factors) that affect the ith student’s
test scores. Sample selection associated with students’ unwillingness to take the postteest
(dropping the course) results in population error term and regressor correlation that biases and
makes coefficient estimators in this change score model inconsistent.
The data generating process for the i th student’s propensity to take the posttest is:
Ti* = H i α + ωi (2)
W. E. Becker and W. H. Greene, 5‐1‐2010 4
where
H is the matrix of explanatory variables that are believed to drive these propensities.
ω is the vector of unobservable random shocks that affect each student’s propensity.
The effect of attrition between the pretest and posttest, as reflected in the absence of a
posttest score for the ith student (Ti = 0) and a Heckman adjustment for the resulting bias caused
by excluding those students from the change-score regression requires estimation of equation (2)
and the calculation of an inverse Mill’s ratio for each student who has a pretest. This inverse
Mill’s ratio is then added to the change-score regression (1) as another explanatory variable. In
essence, this inverse Mill’s ratio adjusts the error term for the missing students.
For the Heckman adjustment for sample selection each disturbance in vector ε , equation
(1), is assumed to be distributed bivariate normal with the corresponding disturbance term in the
ω vector of the selection equation (2). Thus, for the i th student we have:
E (ε) = E (ω) = 0, E (εε ') = σ ε2I, E (ωω ') = I, and E (εω') = ρσ ε I . (4)
That is, the disturbances have zero means, unit variance, and no covariance among students, but
there is covariance between selection in getting a posttest score and the measurement of the
change score.
The regression for this censored sample of nT =1 students who took the posttest is now:
W. E. Becker and W. H. Greene, 5‐1‐2010 5
where λi is the inverse Mill’s ratio (or hazard) such that λi = f ( −Ti* ) /[1 − F (−Ti* )] , and f (.)
and F (.) are the normal density and distribution functions. λi is the standardized mean of the
disturbance term ωi , for the i th student who took the posttest; it is close to zero only for those
well above the T = 1 threshold. The values of λ are generated from the estimated probit
selection equation (2) for all students.
The probit command for the selection equation to be estimated in LIMDEP (NLOGIT) is
probit;lhs=final;rhs=one,an,hb,doc,comp,lib,ci,ck,phd,noeval;hold results$
where the “hold results” extension tells LIMDEP to hold the results for the change equation to
be estimated by least squares with the inverse Mill’s ratio used as regressor.
The command for estimating the adjusted change equation using both the inverse Mills
ratio as a regressor and maximum likelihood estimation of the ρ and σ ε is written
selection;lhs=change;rhs=one,hb,doc,comp,lib,ci,ck,phd,noeval;mle$
where the extension “mle” tells LIMDEP (NLOGIT) to use maximum likelihood estimation.
As described in Module One, Part Two, entering all of these commands into the
command file in LIMDEP (NLOGIT), highlighting the bunch and pressing the GO button yields
the following output file:
W. E. Becker and W. H. Greene, 5‐1‐2010 6
Normal exit: 6 iterations. Status=0. F= 822.7411
+---------------------------------------------+
| Binomial Probit Model |
| Dependent variable FINAL |
| Log likelihood function -822.7411 |
| Restricted log likelihood -1284.216 |
| Chi squared [ 9 d.f.] 922.95007 |
| Significance level .0000000 |
| McFadden Pseudo R-squared .3593438 |
| Estimation based on N = 2587, K = 10 |
| AIC = .6438 Bayes IC = .6664 |
| AICf.s. = .6438 HQIC = .6520 |
| Model estimated: Dec 08, 2009, 12:12:49 |
| Results retained for SELECTION model. |
| Hosmer-Lemeshow chi-squared = 26.06658 |
| P-value= .00102 with deg.fr. = 8 |
+--------+--------------+----------------+--------+--------+----------+
|Variable| Coefficient | Standard Error |b/St.Er.|P[|Z|>z]| Mean of X|
+--------+--------------+----------------+--------+--------+----------+
+--------+Index function for probability |
|Constant| .99535*** .24326 4.092 .0000 |
|AN | .02204** .00948 2.326 .0200 10.5968|
|HB | -.00488** .00192 -2.538 .0112 55.5589|
|DOC | .97571*** .14636 6.666 .0000 .31774|
|COMP | .40649*** .13927 2.919 .0035 .41786|
|LIB | .52144*** .17665 2.952 .0032 .13568|
|CI | .19873** .09169 2.168 .0302 1.23116|
|CK | .08779 .13429 .654 .5133 .91998|
|PHD | -.13351 .10303 -1.296 .1951 .68612|
|NOEVAL | -1.93052*** .07239 -26.668 .0000 .29068|
+--------+------------------------------------------------------------+
| Note: ***, **, * = Significance at 1%, 5%, 10% level. |
+---------------------------------------------------------------------+
+----------------------------------------+
| Fit Measures for Binomial Choice Model |
| Probit model for variable FINAL |
+----------------------------------------+
| Y=0 Y=1 Total|
| Proportions .19714 .80286 1.00000|
| Sample Size 510 2077 2587|
+----------------------------------------+
| Log Likelihood Functions for BC Model |
| P=0.50 P=N1/N P=Model|
| LogL = -1793.17 -1284.22 -822.74|
+----------------------------------------+
| Fit Measures based on Log Likelihood |
| McFadden = 1-(L/L0) = .35934|
| Estrella = 1-(L/L0)^(-2L0/n) = .35729|
| R-squared (ML) = .30006|
| Akaike Information Crit. = .64379|
| Schwartz Information Crit. = .66643|
+----------------------------------------+
| Fit Measures Based on Model Predictions|
| Efron = .39635|
| Ben Akiva and Lerman = .80562|
| Veall and Zimmerman = .52781|
| Cramer = .38789|
+----------------------------------------+
+---------------------------------------------------------+
W. E. Becker and W. H. Greene, 5‐1‐2010 7
|Predictions for Binary Choice Model. Predicted value is |
|1 when probability is greater than .500000, 0 otherwise.|
|Note, column or row total percentages may not sum to |
|100% because of rounding. Percentages are of full sample.|
+------+---------------------------------+----------------+
|Actual| Predicted Value | |
|Value | 0 1 | Total Actual |
+------+----------------+----------------+----------------+
| 0 | 342 ( 13.2%)| 168 ( 6.5%)| 510 ( 19.7%)|
| 1 | 197 ( 7.6%)| 1880 ( 72.7%)| 2077 ( 80.3%)|
+------+----------------+----------------+----------------+
|Total | 539 ( 20.8%)| 2048 ( 79.2%)| 2587 (100.0%)|
+------+----------------+----------------+----------------+
+---------------------------------------------------------+
--> selection;lhs=change;rhs=one,hb,doc,comp,lib,ci,ck,phd,noeval;mle$
+----------------------------------------------------------+
| Sample Selection Model |
| Probit selection equation based on FINAL |
| Selection rule is: Observations with FINAL = 1 |
| Results of selection: |
| Data points Sum of weights |
| Data set 2587 2587.0 |
| Selected sample 2077 2077.0 |
+----------------------------------------------------------+
W. E. Becker and W. H. Greene, 5‐1‐2010 8
+----------------------------------------------------+
------------------------------------------------------------------
ML Estimates of Selection Model
Dependent variable CHANGE
Log likelihood function -6826.46734
Estimation based on N = 2587, K = 21
Information Criteria: Normalization=1/N
Normalized Unnormalized
AIC 5.29375 13694.93469
Fin.Smpl.AIC 5.29389 13695.29492
Bayes IC 5.34131 13817.95802
Hannan Quinn 5.31099 13739.52039
Model estimated: Mar 31, 2010, 15:17:41
FIRST 10 estimates are probit equation.
--------+---------------------------------------------------------
| Standard Prob.
CHANGE| Coefficient Error z z>|Z|
W. E. Becker and W. H. Greene, 5‐1‐2010 9
--------+---------------------------------------------------------
|Selection (probit) equation for FINAL
Constant| .99018*** .24020 4.12 .0000
AN| .02278** .00940 2.42 .0153
HB| -.00489** .00206 -2.37 .0178
DOC| .97154*** .15076 6.44 .0000
COMP| .40431*** .14433 2.80 .0051
LIB| .51505*** .19086 2.70 .0070
CI| .19927** .09054 2.20 .0277
CK| .08590 .11902 .72 .4705
PHD| -.13208 .09787 -1.35 .1772
NOEVAL| -1.92902*** .07138 -27.03 .0000
|Corrected regression, Regime 1
Constant| 6.81754*** .72389 9.42 .0000
HB| -.00978* .00559 -1.75 .0803
DOC| 1.99729*** .55348 3.61 .0003
COMP| -.36198 .43327 -.84 .4034
LIB| 2.23154*** .50534 4.42 .0000
CI| .39401 .25339 1.55 .1199
CK| -2.74337*** .38031 -7.21 .0000
PHD| .64209** .28964 2.22 .0266
NOEVAL| -.63201 1.26902 -.50 .6185
SIGMA(1)| 4.35713*** .07012 62.14 .0000
RHO(1,2)| .03706 .35739 .10 .9174
--------+---------------------------------------------------------
W. E. Becker and W. H. Greene, 5‐1‐2010 10
The beginning or initial class size is negatively and highly significantly related to the propensity
to take the posttest, with a one-tail p value of 0.0056.
The corresponding change-score equation employing the inverse Mills ratio is on page 9:
The change score is negatively and significantly related to the class size, with a one-tail p value
of 0.0347, but it takes an additional 100 students to lower the change score by a point.
Page 10 provides maximum likelihood estimation of both the probit equation and the
change score equation with separate estimation of ρ and σ ε . The top panel provides the probit
coefficients for the propensity equation, where it is shown that initial class size is negatively and
significantly related to the propensity to take the posttest with a one-tail p value of 0.009. The
second panel gives the change score results, where initial class size is negatively and
significantly related to the change score with a one-tail p value of 0.040. Again, it takes
approximately 100 students to move the change score in the opposite direction by a point.
As a closing comment on the estimation of the Heckit model, it is worth pointing out that
there is no unique way to estimate the standard errors via maximum likelihood computer
routines. Historically, LIMDEP used the conventional second derivatives matrix to compute
standard errors for the maximum likelihood estimation of the two-equation Heckit model. In the
process of preparing this module, differences in standard errors produced by LIMDEP and
STATA suggested that STATA was using the alternative outer products of the first derivatives.
To achieve consistency, Bill Greene modified the LIMDEP routine in April 2010 so that it also
now uses the outer products of the first derivatives.
W. E. Becker and W. H. Greene, 5‐1‐2010 11
AN APPLICATION OF PROPENSITY SCORE MATCHING
Unfortunately, we are not aware of a study in economic education for which propensity score
matching has been used. Thus, we looked outside economic education and elected to redo the
example reported in Becker and Ichino (2002). This application and data are derived from
Dehejia and Wahba (1999), whose study, in turn was based on LaLonde (1986). The data set
consists of observed samples of treatments and controls from the National Supported Work
demonstration. Some of the institutional features of the data set are given by Becker and Ichino.
The data were downloaded from the website http://www.nber.org/~rdehejia/nswdata.html. The
data set used here is in the original text form, contained in the data file “matchingdata.txt.” They
have been assembled from the several parts in the NBER archive.
Becker and Ichino report that they were unable to replicate Dehejia and Wahba’s results,
though they did obtain similar results. (They indicate that they did not have the original authors’
specifications of the number of blocks used in the partitioning of the range of propensity scores,
significance levels, or exact procedures for testing the balancing property.) In turn, we could
not precisely replicate Becker and Ichino’s results – we can identify the reason, as discussed
below. Likewise, however, we obtain similar results.
There are 2,675 observations in the data set, 2490 controls (with t = 0) and 185 treated
observations (with t = 1). The variables in the raw data set are
We will analyze these data following Becker and Ichino’s line of analysis. We assume
that you have completed Module One, Part Two, and thus are familiar with placing commands in
the text editor and using the GO button to submit commands, and where results are found in the
output window. In what follows, we will simply show the commands you need to enter into
LIMDEP (NLOGIT) to produce the results that we will discuss.
To start, the data are imported by using the command (where the data file is on the C
drive but your data could be placed wherever):
W. E. Becker and W. H. Greene, 5‐1‐2010 12
READ ; file=C:\matchingdata.txt;
names=t,age,educ,black,hisp,marr,nodegree,re74,re75,re78;nvar=10;nobs=2675$
age2 = age squared
educ2 = educ squared
re742 = re74 squared
re752 = re75 squared
blacku74 = black times 1(re74 = 0)
In order to improve the readability of some of the reported results, we have divided the
income variables by 10,000. (This is also an important adjustment that accommodates a
numerical problem with the original data set. This is discussed below.) The outcome variable is
re78.
The data are set up and described first. The transformations used to create the
transformed variables are
DSTAT ; Rhs = * $
Descriptive Statistics
All results based on nonmissing observations.
==============================================================================
Variable Mean Std.Dev. Minimum Maximum Cases Missing
==============================================================================
All observations in current sample
--------+---------------------------------------------------------------------
T| .691589E-01 .253772 .000000 1.00000 2675 0
AGE| 34.2258 10.4998 17.0000 55.0000 2675 0
EDUC| 11.9944 3.05356 .000000 17.0000 2675 0
BLACK| .291589 .454579 .000000 1.00000 2675 0
HISP| .343925E-01 .182269 .000000 1.00000 2675 0
MARR| .819439 .384726 .000000 1.00000 2675 0
NODEGREE| .333084 .471404 .000000 1.00000 2675 0
RE74| 1.82300 1.37223 .000000 13.7149 2675 0
RE75| 1.78509 1.38778 .000000 15.6653 2675 0
RE78| 2.05024 1.56325 .000000 12.1174 2675 0
AGE2| 1281.61 766.842 289.000 3025.00 2675 0
EDUC2| 153.186 70.6223 .000000 289.000 2675 0
W. E. Becker and W. H. Greene, 5‐1‐2010 13
RE742| 5.20563 8.46589 .000000 188.098 2675 0
RE752| 5.11175 8.90808 .000000 245.402 2675 0
BLACKU74| .549533E-01 .227932 .000000 1.00000 2675 0
We next fit the logit model for the propensity scores. An immediate problem arises with
the data set as used by Becker and Ichino. The income data are in raw dollar terms – the mean of
re74, for example is $18,230.00. The square of it, which is on the order of 300,000,000, as well
as the square of re75 which is similar, is included in the logit equation with a dummy variable for
Hispanic which is zero for 96.5% of the observations and the blacku74 dummy variable which is
zero for 94.5% of the observations. Because of the extreme difference in magnitudes, estimation
of the logit model in this form is next to impossible. But rescaling the data by dividing the
income variables by 10,000 addresses the instability problem. ii These transformations are shown
in the second CREATE command above. This has no impact on the results produced with the
data, other than stabilizing the estimation of the logit equation. We are now quite able to
replicate the Becker and Ichino results except for an occasional very low order digit.
The logit model from which the propensity scores are obtained is fit using
NAMELIST ; X = age,age2,educ,educ2,marr,black,hisp,
re74,re75,re742,re752,blacku74,one $
LOGIT ; Lhs = t ; Rhs = x ; Hold $
(Note: Becker and Ichino’s coefficients on re74 and re75 are multiplied by 10,000, and
coefficients on re742 and re752 are multiplied by 100,000,000. Some additional logit results
from LIMDEP are omitted. Becker and Ichino’s results are included in the results for
comparison.)
------------------------------------------------------------------
Binary Logit Model for Binary Choice
Dependent variable T Becker/Ichino
Log likelihood function -204.97536 (-204.97537)
Restricted log likelihood -672.64954 (identical)
Chi squared [ 12 d.f.] 935.34837
Significance level .00000
McFadden Pseudo R-squared .6952717
Estimation based on N = 2675, K = 13
Information Criteria: Normalization=1/N
Normalized Unnormalized
AIC .16297 435.95071
Fin.Smpl.AIC .16302 436.08750
Bayes IC .19160 512.54287
Hannan Quinn .17333 463.66183
Hosmer-Lemeshow chi-squared = 12.77381
P-value= .11987 with deg.fr. = 8
--------+-----------------------------------------------------
| Standard Prob. Mean
T| Coefficient Error z z>|Z| of X
--------+----------------------------------------------------- Becker/Ichino
|Characteristics in numerator of Prob[Y = 1] Coeff. |t|
AGE| .33169*** .12033 2.76 .0058 34.2258 .3316904 (2.76)
W. E. Becker and W. H. Greene, 5‐1‐2010 14
AGE2| -.00637*** .00186 -3.43 .0006 1281.61 -.0063668 (3.43)
EDUC| .84927** .34771 2.44 .0146 11.9944 .8492683 (2.44)
EDUC2| -.05062*** .01725 -2.93 .0033 153.186 -.0506202 (2.93)
MARR| -1.88554*** .29933 -6.30 .0000 .81944 -1.885542 (6.30)
BLACK| 1.13597*** .35179 3.23 .0012 .29159 1.135973 (3.23)
HISP| 1.96902*** .56686 3.47 .0005 .03439 1.969020 (3.47)
RE74| -1.05896*** .35252 -3.00 .0027 1.82300 -.1059000 (3.00)
RE75| -2.16854*** .41423 -5.24 .0000 1.78509 -.2169000 (5.24)
RE742| .23892*** .06429 3.72 .0002 5.20563 .2390000 (3.72)
RE752| .01359 .06654 .20 .8381 5.11175 .0136000 (0.21)
BLACKU74| 2.14413*** .42682 5.02 .0000 .05495 2.144129 (5.02)
Constant| -7.47474*** 2.44351 -3.06 .0022 -7.474742 (3.06)
--------+-----------------------------------------------------
Note: ***, **, * ==> Significance at 1%, 5%, 10% level.
--------------------------------------------------------------
+---------------------------------------------------------+
|Predictions for Binary Choice Model. Predicted value is |
|1 when probability is greater than .500000, 0 otherwise.|
|Note, column or row total percentages may not sum to |
|100% because of rounding. Percentages are of full sample.|
+------+---------------------------------+----------------+
|Actual| Predicted Value | |
|Value | 0 1 | Total Actual |
+------+----------------+----------------+----------------+
| 0 | 2463 ( 92.1%)| 27 ( 1.0%)| 2490 ( 93.1%)|
| 1 | 51 ( 1.9%)| 134 ( 5.0%)| 185 ( 6.9%)|
+------+----------------+----------------+----------------+
|Total | 2514 ( 94.0%)| 161 ( 6.0%)| 2675 (100.0%)|
+------+----------------+----------------+----------------+
The first set of matching results uses the kernel estimator for the neighbors, lists the
intermediate results, and uses only the observations in the common support. iii
The estimated propensity score function is echoed first. This merely reports the earlier
estimated binary choice model for the treatment assignment. The treatment assignment model is
not reestimated. (The ;Hold in the LOGIT or PROBIT command stores the estimated model for
this use.)
+---------------------------------------------------+
| ******* Propensity Score Matching Analysis ****** |
| Treatment variable = T , Outcome = RE78 |
| Sample In Use |
| Total number of observations = 2675 |
| Number of valid (complete) obs. = 2675 |
| Number used (in common support) = 1342 |
| Sample Partitioning of Data In Use |
| Treated Controls Total |
| Observations 185 1157 1342 |
| Sample Proportion 13.79% 86.21% 100.00% |
+---------------------------------------------------+
+-------------------------------------------------------------+
W. E. Becker and W. H. Greene, 5‐1‐2010 15
| Propensity Score Function = Logit based on T |
| Variable Coefficient Standard Error t statistic |
| AGE .33169 .12032986 2.757 |
| AGE2 -.00637 .00185539 -3.432 |
| EDUC .84927 .34770583 2.442 |
| EDUC2 -.05062 .01724929 -2.935 |
| MARR -1.88554 .29933086 -6.299 |
| BLACK 1.13597 .35178542 3.229 |
| HISP 1.96902 .56685941 3.474 |
| RE74 -1.05896 .35251776 -3.004 |
| RE75 -2.16854 .41423244 -5.235 |
| RE742 .23892 .06429271 3.716 |
| RE752 .01359 .06653758 .204 |
| BLACKU74 2.14413 .42681518 5.024 |
| ONE -7.47474 2.44351058 -3.059 |
| Note:Estimation sample may not be the sample analyzed here. |
| Observations analyzed are restricted to the common support =|
| only controls with propensity in the range of the treated. |
+-------------------------------------------------------------+
The note in the reported logit results reports how the common support is defined, that is,
as the range of variation of the scores for the treated observations.
The next set of results reports the iterations that partition the range of estimated
probabilities. The report includes the results of the F tests within the partitions as well as the
details of the full partition itself. The balancing hypothesis is rejected when the p value is less
than 0.01 within the cell. Becker and Ichino do not report the results of this search for their data,
but do report that they ultimately found seven blocks, as we did. They do not report the means by
which the test of equality is carried out within the blocks or the critical value used.
W. E. Becker and W. H. Greene, 5‐1‐2010 16
.78033 .97525 7 .96240 .00713 103 .93022 .05405 29.05 .0000
Iteration 2 Mean scores are not equal in at least one cell
================================================================================
Iteration 3. Partitioning range of propensity scores into 7 intervals.
================================================================================
Range Controls Treatment
# Obs. Mean PS S.D. PS # obs. Mean PS S.D. PS F Prob
---------------- ---------------------- ---------------------- -------------
.00061 .09807 1026 .01522 .02121 11 .03636 .03246 4.64 .0566
.09807 .19554 55 .13104 .02762 6 .14183 .02272 1.16 .3163
.19554 .39047 41 .28538 .05956 26 .30732 .05917 2.18 .1460
.39047 .58540 15 .49681 .05098 20 .49273 .06228 .05 .8327
.58540 .78033 13 .68950 .04660 19 .64573 .04769 6.68 .0157
.78033 .87779 0 .00000 .00000 17 .81736 .02800 .00 1.0000
.87779 .97525 7 .96240 .00713 86 .95253 .01813 8.77 .0103
Mean PSCORES are tested equal within the blocks listed below
After partitioning the range of the propensity scores, we report the empirical distribution of the
propensity scores and the boundaries of the blocks estimated above. The values below show the
percentiles that are also reported by Becker and Ichino. The reported search algorithm notwithstanding,
the block boundaries shown by Becker and Ichino shown below are roughly the same.
+-------------------------------------------------------------+
| Empirical Distribution of Propensity Scores in Sample Used | Becker/Ichino
| Percent Lower Upper Sample size = 1342 | Percentiles (lower)
| 0% - 5% .000611 .000801 Average score .137746 | .0006426
| 5% - 10% .000802 .001088 Std.Dev score .274560 | .0008025
| 10% - 15% .001093 .001378 Variance .075383 | .0010932
| 15% - 20% .001380 .001809 Blocks used to test balance |
| 20% - 25% .001815 .002355 Lower Upper # obs |
| 25% - 30% .002355 .003022 1 .000611 .098075 1037 | .0023546
| 30% - 35% .003046 .004094 2 .098075 .195539 61 |
| 35% - 40% .004097 .005299 3 .195539 .390468 67 |
| 40% - 45% .005315 .007631 4 .390468 .585397 35 |
| 45% - 50% .007632 .010652 5 .585397 .780325 32 |
| 50% - 55% .010682 .015103 6 .780325 .877790 17 | .0106667
| 55% - 60% .015105 .022858 7 .877790 .975254 93 |
| 60% - 65% .022888 .035187 |
| 65% - 70% .035316 .051474 |
| 70% - 75% .051488 .075104 |
| 75% - 80% .075712 .135218 | .0757115
| 80% - 85% .135644 .322967 |
| 85% - 90% .335230 .616205 |
| 90% - 95% .625082 .949302 | .6250832
| 95% - 100% .949302 .975254 | .949382 to .970598
+-------------------------------------------------------------+
The blocks used for the balancing hypothesis are shown at the right in the table above. Becker
and Ichino report that they used the following blocks and sample sizes:
W. E. Becker and W. H. Greene, 5‐1‐2010 17
6 0.60 0.80 33
7 0.80 1.00 105
At this point, our results begin to differ somewhat from those of Becker and Ichino because they
are using a different (cruder) blocking arrangement for the ranges of the propensity scores. This should
not affect the ultimate estimation of the ATE; it is an intermediate step in the analysis that is a check on
the reliability of the procedure.
The next set of results reports the analysis of the balancing property for the independent variables.
A test is reported for each variable in each block as listed in the table above. The lines marked (by the
program) with “*” show cells in which one or the other group had no observations, so the F test could not
be carried out. This was treated as a “success” in each analysis. Lines marked with an “o” note where the
balancing property failed. There are only four of these, but those we do find are not borderline. Becker
and Ichino report their finding that the balancing property is satisfied. Note that our finding does not
prevent the further analysis. It merely suggests to the analyst that they might want to consider a richer
specification of the propensity function model.
W. E. Becker and W. H. Greene, 5‐1‐2010 18
MARR 5 .153846 .210526 .17 .6821
MARR 6 .000000 .529412 .00 1.0000 *
MARR 7 .000000 .000000 .00 1.0000
BLACK 1 .358674 .636364 3.63 .0833
BLACK 2 .600000 .500000 .22 .6553
BLACK 3 .780488 .769231 .01 .9150
BLACK 4 .866667 .500000 6.65 .0145
BLACK 5 .846154 .947368 .81 .3792
BLACK 6 .000000 .941176 .00 1.0000 *
BLACK 7 1.000000 .953488 .00 1.0000 *
HISP 1 .048733 .000000 52.46 .0000 o
HISP 2 .072727 .333333 1.77 .2311
HISP 3 .048780 .000000 2.10 .1547
HISP 4 .066667 .150000 .66 .4224
HISP 5 .153846 .052632 .81 .3792
HISP 6 .000000 .058824 .00 1.0000 *
HISP 7 .000000 .046512 4.19 .0436
RE74 1 1.230846 1.214261 .00 1.0000
RE74 2 .592119 .237027 10.63 .0041 o
RE74 3 .584965 .547003 .06 .8074
RE74 4 .253634 .298130 .16 .6875
RE74 5 .154631 .197888 .44 .5108
RE74 6 .000000 .002619 .00 1.0000 *
RE74 7 .000000 .000000 .00 1.0000
RE75 1 1.044680 .896447 .41 .5343
RE75 2 .413079 .379168 .09 .7653
RE75 3 .276234 .279825 .00 1.0000
RE75 4 .286058 .169340 2.39 .1319
RE75 5 .137276 .139118 .00 1.0000
RE75 6 .000000 .061722 .00 1.0000 *
RE75 7 .012788 .021539 .37 .5509
RE742 1 2.391922 2.335453 .00 1.0000
RE742 2 .672950 .092200 9.28 .0035 o
RE742 3 .638937 .734157 .09 .7625
RE742 4 .127254 .245461 1.14 .2936
RE742 5 .040070 .095745 1.31 .2647
RE742 6 .000000 .000117 .00 1.0000 *
RE742 7 .000000 .000000 .00 1.0000
RE752 1 1.779930 1.383457 .43 .5207
RE752 2 .313295 .201080 1.48 .2466
RE752 3 .151139 .135407 .14 .7133
RE752 4 .128831 .079975 .97 .3308
RE752 5 .088541 .037465 .51 .4894
RE752 6 .000000 .037719 .00 1.0000 *
RE752 7 .001145 .005973 2.57 .1124
BLACKU74 1 .014620 .000000 15.12 .0001 o
BLACKU74 2 .054545 .000000 3.17 .0804
BLACKU74 3 .121951 .192308 .58 .4515
BLACKU74 4 .200000 .100000 .66 .4242
BLACKU74 5 .230769 .315789 .29 .5952
BLACKU74 6 .000000 .941176 .00 1.0000 *
BLACKU74 7 1.000000 .953488 .00 1.0000 *
Variable BLACKU74 is unbalanced in block 1
Other variables may also be unbalanced
You might want to respecify the index function for the P-scores
This part of the analysis ends with a recommendation that the analyst reexamine the specification
of the propensity score model. Because this is not a numerical problem, the analysis continues with
estimation of the average treatment effect on the treated.
W. E. Becker and W. H. Greene, 5‐1‐2010 19
The first example below shows estimation using the kernel estimator to define the counterpart
observation from the controls and using only the subsample in the common support. This stage consists of
nboot + 1 iterations. In order to be able to replicate the results, we set the seed of the random number
generator before computing the results.
CALC ; Ran(1234579) $
MATCH ; Lhs = re78 ; Kernel ; List ; Common Support $
The first result is the actual estimation, which is reported in the intermediate results. Then the
nboot repetitions are reported. (These will be omitted if ; List is not included in the command.) Recall,
we divided the income values by 10,000. The value of .156255 reported below thus corresponds to
$1,562.55. Becker and Ichino report a value (see their section 6.4) of $1537.94. Using the bootstrap
replications, we have estimated the asymptotic standard error to be $1042.04. A 95% confidence interval
for the treatment effect is computed using $1537.94 ± 1.96(1042.04) = (-$325.41,$3474.11).
+----------------------------------------------------------------------+
| Estimated Average Treatment Effect (T ) Outcome is RE78 |
| Kernel Using Epanechnikov kernel with bandwidth = .0600 |
| Note, controls may be reused in defining matches. |
| Number of bootstrap replications used to obtain variance = 25 |
+----------------------------------------------------------------------+
Estimated average treatment effect = .156255
Begin bootstrap iterations *******************************************
Boootstrap estimate 1 = .099594
Boootstrap estimate 2 = .109812
Boootstrap estimate 3 = .152911
Boootstrap estimate 4 = .168743
Boootstrap estimate 5 = -.015677
Boootstrap estimate 6 = .052938
Boootstrap estimate 7 = -.003275
Boootstrap estimate 8 = .212767
Boootstrap estimate 9 = -.042274
Boootstrap estimate 10 = .053342
Boootstrap estimate 11 = .351122
Boootstrap estimate 12 = .117883
Boootstrap estimate 13 = .181123
Boootstrap estimate 14 = .111917
Boootstrap estimate 15 = .181256
Boootstrap estimate 16 = -.012129
Boootstrap estimate 17 = .240363
Boootstrap estimate 18 = .201321
Boootstrap estimate 19 = .169463
Boootstrap estimate 20 = .238131
Boootstrap estimate 21 = .358050
Boootstrap estimate 22 = .199020
Boootstrap estimate 23 = .083503
Boootstrap estimate 24 = .146215
Boootstrap estimate 25 = .266303
End bootstrap iterations *******************************************
+----------------------------------------------------------------------+
| Number of Treated observations = 185 Number of controls = 1157 |
| Estimated Average Treatment Effect = .156255 | (.153794)
| Estimated Asymptotic Standard Error = .104204 | (.101687)
| t statistic (ATT/Est.S.E.) = 1.499510 |
W. E. Becker and W. H. Greene, 5‐1‐2010 20
| Confidence Interval for ATT = ( -.047985 to .360496) 95% |
| Average Bootstrap estimate of ATT = .144897 |
| ATT - Average bootstrap estimate = .011358 |
+----------------------------------------------------------------------+
Note that the estimated asymptotic standard error is somewhat different. As we noted earlier,
because of differences in random number generators, the bootstrap replications will differ across
programs. It will generally not be possible to exactly replicate results generated with different computer
programs. With a specific computer program, replication is obtained by setting the seed of the random
number generator. (The specific seed chosen is immaterial, so long as the same seed is used each time.)
The next set of estimates is based on all of the program defaults. The single nearest neighbor is
used for the counterpart observation; 25 bootstrap replications are used to compute the standard deviation,
and the full range of propensity scores (rather than the common support) is used. Intermediate output is
also suppressed. Once again, we set the seed for the random number generator before estimation.
CALC ; Ran(1234579) $
MATCH ; Rhs = re78 $
+-------------------------------------------------------------+
| Empirical Distribution of Propensity Scores in Sample Used |
| Percent Lower Upper Sample size = 2675 |
| 0% - 5% .000000 .000000 Average score .069159 |
| 5% - 10% .000000 .000002 Std.Dev score .206287 |
| 10% - 15% .000002 .000006 Variance .042555 |
| 15% - 20% .000007 .000015 Blocks used to test balance |
| 20% - 25% .000016 .000032 Lower Upper # obs |
| 25% - 30% .000032 .000064 1 .000000 .097525 2370 |
| 30% - 35% .000064 .000121 2 .097525 .195051 60 |
| 35% - 40% .000121 .000204 3 .195051 .390102 68 |
| 40% - 45% .000204 .000368 4 .390102 .585152 35 |
| 45% - 50% .000368 .000618 5 .585152 .780203 32 |
| 50% - 55% .000618 .001110 6 .780203 .877729 17 |
| 55% - 60% .001123 .001851 7 .877729 .975254 93 |
| 60% - 65% .001854 .003047 |
| 65% - 70% .003057 .005451 |
| 70% - 75% .005451 .010756 |
| 75% - 80% .010877 .023117 |
| 80% - 85% .023149 .051488 |
| 85% - 90% .051703 .135644 |
| 90% - 95% .136043 .625082 |
| 95% - 100% .625269 .975254 |
+-------------------------------------------------------------+
Examining exogenous variables for balancing hypothesis
Variable BLACKU74 is unbalanced in block 1
Other variables may also be unbalanced
You might want to respecify the index function for the P-scores
+----------------------------------------------------------------------+
| Estimated Average Treatment Effect (T ) Outcome is RE78 |
W. E. Becker and W. H. Greene, 5‐1‐2010 21
| Nearest Neighbor Using average of 1 closest neighbors |
| Note, controls may be reused in defining matches. |
| Number of bootstrap replications used to obtain variance = 25 |
+----------------------------------------------------------------------+
Estimated average treatment effect = .169094
Begin bootstrap iterations *******************************************
End bootstrap iterations *******************************************
+----------------------------------------------------------------------+
| Number of Treated observations = 185 Number of controls = 54 |
| Estimated Average Treatment Effect = .169094 |
| Estimated Asymptotic Standard Error = .102433 |
| t statistic (ATT/Est.S.E.) = 1.650772 |
| Confidence Interval for ATT = ( -.031675 to .369864) 95% |
| Average Bootstrap estimate of ATT = .171674 |
| ATT - Average bootstrap estimate = -.002579 |
+----------------------------------------------------------------------+
Using the full sample in this fashion produces an estimate of $1,690.94 for the treatment effect
with an estimated standard error of $1,093.29. Note that from the results above, we find that only 54 of
the 2490 control observations were used as nearest neighbors for the 185 treated observations. In
comparison, using the 1,342 observations in their estimated common support, and the same 185 treateds,
Becker and Ichino reported estimates of $1,667.64 and $2,113.59 for the effect and the standard error,
respectively and use 57 of the 1,342 controls as nearest neighbors.
The next set of results uses the caliper form of matching and again restricts attention to the
estimates in the common support.
CALC ; Ran(1234579) $
MATCH ; Rhs = re78 ; Range = .0001 ; Common Support $
CALC ; Ran(1234579) $
MATCH ; Rhs = re78 ; Range = .01 ; Common Support $
The estimated treatment effects are now very different. We see that only 23 of the 185 treated
observations had a neighbor within a range (radius in the terminology of Becker and Ichino) of 0.0001.
The treatment effect is estimated to be only $321.95 with a standard error of $307.95. In contrast, using
this procedure, and this radius, Becker and Ichino report a nonsense result of -$5,546.10 with a standard
error of $2,388.72. They state that this illustrates the sensitivity of the estimator to the choice of radius,
which is certainly the case. To examine this aspect, we recomputed the estimator using a range of 0.01
instead of 0.0001. This produces the expected effect, as seen in the second set of results below. The
estimated treatment effect rises to $1433.54 which is comparable to the other results already obtained
+----------------------------------------------------------------------+
| Estimated Average Treatment Effect (T ) Outcome is RE78 |
| Caliper Using distance of .00010 to locate matches |
| Note, controls may be reused in defining matches. |
| Number of bootstrap replications used to obtain variance = 25 |
+----------------------------------------------------------------------+
Estimated average treatment effect = .032195
W. E. Becker and W. H. Greene, 5‐1‐2010 22
Begin bootstrap iterations *******************************************
End bootstrap iterations *******************************************
+----------------------------------------------------------------------+
| Number of Treated observations = 23 Number of controls = 66 |
| Estimated Average Treatment Effect = .032195 |
| Estimated Asymptotic Standard Error = .030795 |
| t statistic (ATT/Est.S.E.) = 1.045454 |
| Confidence Interval for ATT = ( -.028163 to .092553) 95% |
| Average Bootstrap estimate of ATT = .018996 |
| ATT - Average bootstrap estimate = .013199 |
+----------------------------------------------------------------------+
+----------------------------------------------------------------------+
| Estimated Average Treatment Effect (T ) Outcome is RE78 |
| Caliper Using distance of .01000 to locate matches |
| Note, controls may be reused in defining matches. |
| Number of bootstrap replications used to obtain variance = 25 |
+----------------------------------------------------------------------+
Estimated average treatment effect = .143354
Begin bootstrap iterations *******************************************
End bootstrap iterations *******************************************
+----------------------------------------------------------------------+
| Number of Treated observations = 146 Number of controls = 1111 |
| Estimated Average Treatment Effect = .143354 |
| Estimated Asymptotic Standard Error = .078378 |
| t statistic (ATT/Est.S.E.) = 1.829010 |
| Confidence Interval for ATT = ( -.010267 to .296974) 95% |
| Average Bootstrap estimate of ATT = .127641 |
| ATT - Average bootstrap estimate = .015713 |
+----------------------------------------------------------------------+
CONCLUDING COMMENTS
Results obtained from the two equation system advanced by Heckman over 30 years ago are
sensitive to the correctness of the equations and their identification. On the other hand, methods
such as the propensity score matching depend on the validity of the logit or probit functions
estimated along with the methods of getting smoothness in the kernel density estimator.
Someone using Heckman’s original selection adjustment method can easily have their results
replicated in LIMDEP, STATA and SAS, although standard error estimates may differ somewhat
because of the difference in routines used. Such is not the case with propensity score matching.
Propensity score matching results are highly sensitive to the computer program employed while
Heckman’s original sample selection adjustment method can be relied on to give comparable
coefficient estimates across programs.
REFERENCES
Becker, William and William Walstad. “Data Loss From Pretest to Posttest As a Sample
Selection Problem,” Review of Economics and Statistics, Vol. 72, February 1990: 184-188,
W. E. Becker and W. H. Greene, 5‐1‐2010 23
Becker, William and John Powers. “Student Performance, Attrition, and Class Size Given
Missing Student Data,” Economics of Education Review, Vol. 20, August 2001: 377-388.
Heckman, James. Sample Bias as a Specific Error. Econometrica, Vol. 47, 1979: 153-162.
Huynh, Kim, David Jacho-Chavez, and James K. Self.“The Efficacy of Collaborative Learning
Recitation Sessions on Student Outcomes?” American Economic Review, (Forthcoming May
2010).
LaLonde, R., “Evaluating the Econometric Evaluations of Training Programs with Experimental
Data,” American Economic Review, Vol. 76, 4, 1986, 604-620.
Saunders, Phillip. The TUCE III Data Set: Background information and file codes
(documentation, summary tables, and five 3.5-inch double-sided, high density disks in ASCII
format). New York: National Council on Economic Education, 1994.
ENDNOTES
i
Huynh, Jacho-Chavez, and Self (2010) have a data set that enables them to account for selection
into, out of and between collaborative learning sections of a large principles course in their change-score
modeling.
ii
An attempt to compute a linear regression of the original RE78 on the original unscaled other
variables is successful, but produces a warning that the condition number of the X matrix is 6.5 times 109.
When the data are scaled as done above, no warning about multicollinearity is given.
iii
The Kernel density estimator is a nonparametric estimator. Unlike a parametric
estimator (which is an equation), a non-parametric estimator has no fixed structure and is based
on a histogram of all the data. Histograms are bar charts, which are not smooth, and whose
shape depends on the width of the bin into which the data are divided. In essence, with a fixed
bin width, the kernel estimator smoothes out the histogram by centering each of the bins at each
data point rather than fixing the end points of the bin. The optimum bin width is a subject of
debate and well beyond the technical level of this module.
W. E. Becker and W. H. Greene, 5‐1‐2010 24
MODULE FOUR, PART THREE: SAMPLE SELECTION
Part Three of Module Four provides a cookbook-type demonstration of the steps required to use
STATA in situations involving estimation problems associated with sample selection. Users of
this model need to have completed Module One, Parts One and Three, but not necessarily
Modules Two and Three. From Module One users are assumed to know how to get data into
STATA, recode and create variables within STATA, and run and interpret regression results.
Module Four, Parts Two and Four demonstrate in LIMDEP (NLOGIT) and SAS what is done
here in STATA.
The change score or difference in difference model is used extensively in education research.
Yet, before Becker and Walstad (1990), little if any attention was given to the consequence of
missing student records that result from: 1) "data cleaning" done by those collecting the data, 2)
student unwillingness to provide data, or 3) students self-selecting into or out of the study. The
implications of these types of sample selection are shown in the work of Becker and Powers
(2001) where the relationship between class size and student learning was explored using the
third edition of the Test of Understanding in College Economics (TUCE), which was produced
by Saunders (1994) for the National Council on Economic Education (NCEE), since renamed the
Council for Economic Education.
Module One, Part Three showed how to get the Becker and Powers data set
“beck8WO.csv” into STATA. As a brief review this was done with the insheet command:
. insheet a1 a2 x3 c al am an ca cb cc ch ci cj ck cl cm cn co cs ct cu ///
> cv cw db dd di dj dk dl dm dn dq dr ds dy dz ea eb ee ef ///
> ei ej ep eq er et ey ez ff fn fx fy fz ge gh gm gn gq gr hb ///
> hc hd he hf using "F:\BECK8WO2.csv", comma
(64 vars, 2849 obs)
where
Separate dummy variables need to be created for each type of school (A2), which is done with
the following code:
recode a2 (100/199=1) (200/299=2) (300/399=3) (400/499=4)
generate doc=(a2==1) if a2!=.
generate comp=(a2==2) if a2!=.
generate lib=(a2==3) if a2!=.
generate twoyr=(a2==4) if a2!=.
To create a dummy variable for whether the instructor had a PhD we use
To create a dummy variable for whether the student took the postTUCE we use
To create a dummy variable for whether a student did (noeval = 0) or did not (noeval = 1)
complete a student evaluation of the instructor we use
“Noeval” reflects whether the student was around toward the end of the term, attending classes,
and sufficiently motivated to complete an evaluation of the instructor. In the Saunder’s data set
evaluation questions with no answer where coded -9; thus, these four questions summing to -36
indicates that no questions were answered.
Ian McCarthy, 3‐30‐2010 2
And the change score is created with
generate change=cc-an
Finally, there was a correction for the term in which student record 2216 was incorrectly
recorded:
recode hb (90=89)
All of these recoding and create commands are entered into the STATA command file as
follows:
gen noeval=(ge+gh+gm+gq==-36)
gen change=cc-an
recode hb (90=89)
The use of these data entry and management commands will appear in the STATA output file for
the equations to be estimated in the next section.
Ian McCarthy, 3‐30‐2010 3
THE PROPENSITY TO TAKE THE POSTTEST AND THE CHANGE SCORE EQUATION
To address attrition-type sample selection problems in change score studies, Becker and Powers
first add observations that were dropped during the early stage of assembling data for TUCE III.
Becker and Powers do not have any data on students before they enrolled in the course and thus
cannot address selection into the course, but to examine the effects of attrition (course
withdrawal) they introduce three measures of class size (beginning, ending, and average) and
argue that initial or beginning class size is the critical measure for assessing learning over the
entire length of the course. i To show the effects of initial class size on attrition (as discussed in
Module Four, Part One) they employ what is now the simplest and most restrictive of sample
correction methods, which can be traced to James Heckman (1979), recipient of the 2000 Nobel
Prize in Economics.
From Module Four, Part One, we have the data generating process for the difference between
post and preTUCE scores for the ith student ( Δy i ):
k
Δyi = X i β + ε i = β1 + ∑ β j x ji + ε i (1)
j =2
where the data set of explanatory variables is matrix X, where Xi is the row of xji values for the
relevant variables believed to explain the ith student’s pretest and posttest scores, the β j ’s are the
associated slope coefficients in the vector β , and ε i is the individual random shock (caused, for
example, by unobservable attributes, events or environmental factors) that affect the ith student’s
test scores. Sample selection associated with students’ unwillingness to take the postteest
(dropping the course) results in population error term and regressor correlation that biases and
makes coefficient estimators in this change score model inconsistent.
The data generating process for the i th student’s propensity to take the posttest is:
Ti* = H i α + ωi (2)
where
H is the matrix of explanatory variables that are believed to drive these propensities.
Ian McCarthy, 3‐30‐2010 4
ω is the vector of unobservable random shocks that affect each student’s propensity.
The effect of attrition between the pretest and posttest, as reflected in the absence of a
posttest score for the ith student (Ti = 0) and a Heckman adjustment for the resulting bias caused
by excluding those students from the change-score regression requires estimation of equation (2)
and the calculation of an inverse Mill’s ratio for each student who has a pretest. This inverse
Mill’s ratio is then added to the change-score regression (1) as another explanatory variable. In
essence, this inverse Mill’s ratio adjusts the error term for the missing students.
For the Heckman adjustment for sample selection each disturbance in vector ε , equation
(1), is assumed to be distributed bivariate normal with the corresponding disturbance term in the
ω vector of the selection equation (2). Thus, for the i th student we have:
E (ε) = E (ω) = 0, E (εε ') = σ ε2I, E (ωω ') = I, and E (εω') = ρσ ε I . (4)
That is, the disturbances have zero means, unit variance, and no covariance among students, but
there is covariance between selection in getting a posttest score and the measurement of the
change score.
The regression for this censored sample of nT =1 students who took the posttest is now:
where λi is the inverse Mill’s ratio (or hazard) such that λi = f ( −Ti * ) /[1 − F (−Ti * )] , and f (.)
and F (.) are the normal density and distribution functions. λi is the standardized mean of the
disturbance term ωi , for the i th student who took the posttest; it is close to zero only for those
well above the T = 1 threshold. The values of λ are generated from the estimated probit
selection equation (2) for all students.
STATA’s built-in “heckman” command estimates both the selection and outcome
equation using either the full-information maximum likelihood or Heckman’s original two-step
estimator (which uses the Mills ratio as a regressor). The default “heckman” command
implements the maximum likelihood estimation, including ρ and σ ε , and is written:
Ian McCarthy, 3‐30‐2010 5
heckman change hb doc comp lib ci ck phd noeval, ///
select (final = an hb doc comp lib ci ck phd noeval) vce(opg)
while the Mills ratio two-step process can be implemented by specifying the option “twostep”
after the command. The option “vce(opg)” specifies the outer-product of the gradient method to
estimate standard errors, as opposed to STATA’s default Hessian method.
As described in Module One, Part Three, entering all of these commands into the
command window in STATA and pressing enter (or alternatively, highlighting the commands in
a do file and pressing ctrl-d) yields the following output file:
. insheet ///
> A1 A2 X3 C AL AM AN CA CB CC CH CI CJ CK CL CM CN CO CS CT ///
> CU CV CW DB DD DI DJ DK DL DM DN DQ DR DS DY DZ EA EB EE EF ///
> EI EJ EP EQ ER ET EY EZ FF FN FX FY FZ GE GH GM GN GQ GR HB ///
> HC HD HE HF ///
> using "C:\BECK8WO.csv", comma
(64 vars, 2837 obs)
Ian McCarthy, 3‐30‐2010 6
. heckman change hb doc comp lib ci ck phd noeval, select (final = an hb doc
comp lib ci ck phd noeval) vce(opg)
------------------------------------------------------------------------------
| OPG
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
change |
hb | -.0097802 .0055923 -1.75 0.080 -.0207408 .0011805
doc | 1.997291 .5534814 3.61 0.000 .912487 3.082094
comp | -.361983 .4332653 -0.84 0.403 -1.211167 .4872015
lib | 2.23154 .505341 4.42 0.000 1.24109 3.22199
ci | .3940114 .2533859 1.55 0.120 -.1026158 .8906386
ck | -2.743372 .3803107 -7.21 0.000 -3.488767 -1.997976
phd | .6420888 .2896418 2.22 0.027 .0744013 1.209776
noeval | -.6320101 1.269022 -0.50 0.618 -3.119248 1.855227
_cons | 6.817536 .7238893 9.42 0.000 5.398739 8.236332
-------------+----------------------------------------------------------------
final |
an | .0227793 .009396 2.42 0.015 .0043634 .0411953
hb | -.0048868 .0020624 -2.37 0.018 -.008929 -.0008447
doc | .9715436 .150756 6.44 0.000 .6760672 1.26702
comp | .4043055 .1443272 2.80 0.005 .1214295 .6871815
lib | .5150521 .1908644 2.70 0.007 .1409648 .8891394
ci | .1992685 .0905382 2.20 0.028 .0218169 .37672
ck | .0859013 .1190223 0.72 0.470 -.1473781 .3191808
phd | -.1320764 .0978678 -1.35 0.177 -.3238939 .059741
noeval | -1.929021 .0713764 -27.03 0.000 -2.068916 -1.789126
_cons | .9901789 .240203 4.12 0.000 .5193897 1.460968
-------------+----------------------------------------------------------------
/athrho | .0370755 .3578813 0.10 0.917 -.6643589 .73851
/lnsigma | 1.471813 .0160937 91.45 0.000 1.44027 1.503356
-------------+----------------------------------------------------------------
rho | .0370585 .3573898 -.581257 .6282441
sigma | 4.357128 .0701223 4.221836 4.496756
lambda | .1614688 1.55763 -2.89143 3.214368
------------------------------------------------------------------------------
LR test of indep. eqns. (rho = 0): chi2(1) = 0.03 Prob > chi2 = 0.8612
------------------------------------------------------------------------------
The above output provides maximum likelihood estimation of both the probit equation
and the change score equation with separate estimation of ρ and σ ε . The bottom panel provides
the probit coefficients for the propensity equation, where it is shown that initial class size is
negatively and significantly related to the propensity to take the posttest with a one-tail p value
of 0.009. The tob panel gives the change score results, where initial class size is negatively and
Ian McCarthy, 3‐30‐2010 7
significantly related to the change score with a one-tail p value of 0.04. Again, it takes
approximately 100 students to move the change score in the opposite direction by a point.
Alternatively, the following command estimates the Heckman model using the Mills ratio
as a regressor:
. heckman change hb doc comp lib ci ck phd noeval, select (final = an hb doc comp lib
ci ck phd noeval) twostep
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
change |
hb | -.0102219 .0056305 -1.82 0.069 -.0212575 .0008137
doc | 2.079684 .5764526 3.61 0.000 .9498578 3.20951
comp | -.329457 .4426883 -0.74 0.457 -1.19711 .5381962
lib | 2.274478 .5373268 4.23 0.000 1.221337 3.327619
ci | .4082326 .2592943 1.57 0.115 -.0999749 .9164401
ck | -2.730737 .377552 -7.23 0.000 -3.470725 -1.990749
phd | .6334483 .2910392 2.18 0.030 .063022 1.203875
noeval | -.8843357 1.272225 -0.70 0.487 -3.377851 1.60918
_cons | 6.741226 .7510686 8.98 0.000 5.269159 8.213293
-------------+----------------------------------------------------------------
final |
an | .022039 .0094752 2.33 0.020 .003468 .04061
hb | -.0048826 .0019241 -2.54 0.011 -.0086537 -.0011114
doc | .9757148 .1463617 6.67 0.000 .6888511 1.262578
comp | .4064945 .1392651 2.92 0.004 .13354 .679449
lib | .5214436 .1766459 2.95 0.003 .175224 .8676632
ci | .1987315 .0916865 2.17 0.030 .0190293 .3784337
ck | .08779 .1342874 0.65 0.513 -.1754085 .3509885
phd | -.133505 .1030316 -1.30 0.195 -.3354433 .0684333
noeval | -1.930522 .0723911 -26.67 0.000 -2.072406 -1.788638
_cons | .9953498 .2432624 4.09 0.000 .5185642 1.472135
-------------+----------------------------------------------------------------
mills |
lambda | .4856741 1.596833 0.30 0.761 -2.644061 3.61541
-------------+----------------------------------------------------------------
rho | 0.11132
sigma | 4.3630276
lambda | .48567415 1.596833
------------------------------------------------------------------------------
Ian McCarthy, 3‐30‐2010 8
The estimated probit model (in the bottom portion of the above output) is
The beginning or initial class size is negatively and highly significantly related to the propensity
to take the posttest, with a one-tail p value of 0.011.
The corresponding change-score equation employing the inverse Mills ratio is in the
upper portion of the above output:
The change score is negatively and significantly related to the class size, with a one-tail p value
of 0.0345, but it takes an additional 100 students to lower the change score by a point.
Ian McCarthy, 3‐30‐2010 9
AN APPLICATION OF PROPENSITY SCORE MATCHING
Unfortunately, we are not aware of a study in economic education for which propensity score
matching has been used. Thus, we looked outside economic education and elected to redo the
example reported in Becker and Ichino (2002). This application and data are derived from
Dehejia and Wahba (1999), whose study, in turn was based on LaLonde (1986). The data set
consists of observed samples of treatments and controls from the National Supported Work
demonstration. Some of the institutional features of the data set are given by Becker and Ichino.
The data were downloaded from the website http://www.nber.org/~rdehejia/nswdata.html. The
data set used here is in the original text form, contained in the data file “matchingdata.txt.” They
have been assembled from the several parts in the NBER archive.
Becker and Ichino report that they were unable to replicate Dehejia and Wahba’s results,
though they did obtain similar results. (They indicate that they did not have the original authors’
specifications of the number of blocks used in the partitioning of the range of propensity scores,
significance levels, or exact procedures for testing the balancing property.) In turn, we could
not precisely replicate Becker and Ichino’s results – we can identify the reason, as discussed
below. Likewise, however, we obtain similar results.
There are 2,675 observations in the data set, 2,490 controls (with t = 0) and 185 treated
observations (with t = 1). The variables in the raw data set are
We will analyze these data following Becker and Ichino’s line of analysis. We assume
that you have completed Module One, Part Three, and thus are familiar with placing commands
in the command window or in a do file. In what follows, we will simply show the commands
you need to enter into STATA to produce the results that we will discuss.
First, note that STATA does not have a default command available for propensity score
matching. Becker and Ichino, however, have created the user-written routine pscore that
implements the propensity score matching analysis underlying Becker and Ichino (2002). As
described in the endnotes of Module Two, Part Three, users can install the pscore routine by
typing findit pscore into the command window, where a list of information and links to
download this routine appears. Click on one of the download links and STATA automatically
Ian McCarthy, 3‐30‐2010 10
downloads and installs the routine for use. Users can then access the documentation for this
routine by typing help pscore. Installing the pscore routine also downloads and installs several
other routines useful for analyzing treatment effects (i.e., the routines attk, attnd and attr,
discussed later in this Module).
To begin the analysis, the data are imported by using the command (where the data file is
on the C drive but your data could be placed wherever):
insheet ///
t age educ black hisp marr nodegree re74 re75 re78 ///
using "C:\matchingdata.txt"
age2 = age squared
educ2 = educ squared
re742 = re74 squared
re752 = re75 squared
blacku74 = black times 1(re74 = 0)
In order to improve the readability of some of the reported results, we have divided the
income variables by 10,000. (This is also an important adjustment that accommodates a
numerical problem with the original data set. This is discussed below.) The outcome variable is
re78.
The data are set up and described first. The transformations used to create the
transformed variables are
gen age2=age^2
gen educ2=educ^2
replace re74=re74/10000
replace re75=re75/10000
replace re78=re78/10000
gen re742=re74^2
gen re752=re75^2
gen blacku74=black*(re74==0)
global X age age2 educ educ2 marr black hisp re74 re75 re742 re752 blacku74
Ian McCarthy, 3‐30‐2010 11
. sum
We next fit the logit model for the propensity scores. An immediate problem arises with
the data set as used by Becker and Ichino. The income data are in raw dollar terms – the mean of
re74, for example is $18,230.00. The square of it, which is on the order of 300,000,000, as well
as the square of re75 which is similar, is included in the logit equation with a dummy variable for
Hispanic which is zero for 96.5% of the observations and the blacku74 dummy variable which is
zero for 94.5% of the observations. Because of the extreme difference in magnitudes, estimation
of the logit model in this form is next to impossible. But rescaling the data by dividing the
income variables by 10,000 addresses the instability problem. These transformations are shown
in the replace commands above. This has no impact on the results produced with the data, other
than stabilizing the estimation of the logit equation.
The following command estimates the logit model from which the propensity scores are
obtained and tests the balancing hypothesis. The logit model from which the propensity scores
are obtained is fit using: ii
. global X age age2 educ educ2 marr black hisp re74 re75 re742 re752 blacku74
. pscore t $X, logit pscore(_pscore) blockid(_block) comsup
where the logit option specifies that propensity scores should be estimated using the logit
model, the blockid and pscore options define two new variables created by STATA
representing each observation’s propensity score and block id, and the comsup option restricts
the analysis to observations in the common support.
Ian McCarthy, 3‐30‐2010 12
(Note: Becker and Ichino’s coefficients on re74 and re75 are multiplied by 10,000, and
coefficients on re742 and re752 are multiplied by 100,000,000. Otherwise, the output presented
here matches that of Becker and Ichino)
. pscore t $X, logit pscore(_pscore) blockid(_block) comsup
****************************************************
Algorithm to estimate the propensity score
****************************************************
The treatment is t
------------------------------------------------------------------------------
t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .3316903 .1203299 2.76 0.006 .0958482 .5675325
age2 | -.0063668 .0018554 -3.43 0.001 -.0100033 -.0027303
educ | .849268 .3477058 2.44 0.015 .1677771 1.530759
educ2 | -.0506202 .0172493 -2.93 0.003 -.0844282 -.0168122
marr | -1.885542 .2993309 -6.30 0.000 -2.472219 -1.298864
black | 1.135972 .3517854 3.23 0.001 .4464852 1.825459
hisp | 1.96902 .5668594 3.47 0.001 .857996 3.080044
re74 | -1.058961 .3525178 -3.00 0.003 -1.749883 -.3680387
re75 | -2.168541 .4142324 -5.24 0.000 -2.980422 -1.35666
re742 | .2389164 .0642927 3.72 0.000 .112905 .3649278
re752 | .0135926 .0665375 0.20 0.838 -.1168185 .1440038
blacku74 | 2.14413 .4268152 5.02 0.000 1.307588 2.980673
_cons | -7.474743 2.443511 -3.06 0.002 -12.26394 -2.68555
------------------------------------------------------------------------------
Note: 22 failures and 0 successes completely determined
Ian McCarthy, 3‐30‐2010 13
Note: the common support option has been selected
The region of common support is [.00061066, .97525407]
Description of the estimated propensity score
in region of common support
The next set of results summarizes the tests of the balancing hypothesis. By specifying
the detail option in the above pscore command, the routine will also report the separate results of
the F tests within the partitions as well as the details of the full partition itself. The balancing
hypothesis is rejected when the p value is less than 0.01 within the cell. Becker and Ichino do
not report the results of this search for their data, but do report that they ultimately found seven
blocks. They do not report the means by which the test of equality is carried out within the
blocks or the critical value used.
******************************************************
Step 1: Identification of the optimal number of blocks
Use option detail if you want more detailed output
******************************************************
**********************************************************
Step 2: Test of balancing property of the propensity score
Use option detail if you want more detailed output
**********************************************************
Ian McCarthy, 3‐30‐2010 14
Inferior |
of block | t
of pscore | 0 1 | Total
-----------+----------------------+----------
0 | 924 7 | 931
.05 | 102 4 | 106
.1 | 56 7 | 63
.2 | 41 28 | 69
.4 | 14 21 | 35
.6 | 13 20 | 33
.8 | 7 98 | 105
-----------+----------------------+----------
Total | 1,157 185 | 1,342
*******************************************
End of the algorithm to estimate the pscore
*******************************************
The final portion of the pscore output presents the blocks used for the balancing hypothesis.
Again, specifying the detail option will report the results of the balancing property test for each of the
independent variables, which are excluded here for brevity. This part of the analysis also recommends
that the analyst reexamine the specification of the propensity score model. Because this is not a
numerical problem, the analysis continues with estimation of the average treatment effect on the treated.
The first example below shows estimation using the kernel estimator to define the counterpart
iii
observation from the controls and using only the subsample in the common support. This stage consists
of nboot + 1 iterations. In order to be able to replicate the results, we set the seed of the random number
generator before computing the results:
Recall, we divided the income values by 10,000. The value of .153795 reported below thus
corresponds to $1,537.95. Becker and Ichino report a value (see their section 6.4) of $1,537.94. Using
the bootstrap replications, we have estimated the asymptotic standard error to be $856.28. A 95%
confidence interval for the treatment effect is computed using $1537.95 ± 1.96(856.27) = (-
$229.32,$3,305.22).
Ian McCarthy, 3‐30‐2010 15
ATT estimation with the Kernel Matching method
---------------------------------------------------------
n. treat. n. contr. ATT Std. Err. t
---------------------------------------------------------
---------------------------------------------------------
Note: Analytical standard errors cannot be computed. Use
the bootstrap option to get bootstrapped standard errors.
command: attk re78 t age age2 educ educ2 marr black hisp re74 re75 re742 re752
blacku74 , pscore(_pscore) comsup bwidth(.06)
statistic: attk = r(attk)
note: label truncated to 80 characters
------------------------------------------------------------------------------
Variable | Reps Observed Bias Std. Err. [95% Conf. Interval]
-------------+----------------------------------------------------------------
attk | 25 .1537945 -.0050767 .0856277 -.0229324 .3305215 (N)
| -.0308111 .279381 (P)
| -.0308111 .2729317 (BC)
------------------------------------------------------------------------------
Note: N = normal
P = percentile
BC = bias-corrected
---------------------------------------------------------
n. treat. n. contr. ATT Std. Err. t
---------------------------------------------------------
---------------------------------------------------------
Note that the estimated asymptotic standard error is somewhat different. As we noted earlier,
because of differences in random number generators, the bootstrap replications will differ across
programs. It will generally not be possible to exactly replicate results generated with different computer
programs. With a specific computer program, replication is obtained by setting the seed of the random
number generator. (The specific seed chosen is immaterial, so long as the same seed is used each time.)
The next set of estimates is based on all of the program defaults. The single nearest neighbor is
used for the counterpart observation; 25 bootstrap replications are used to compute the standard deviation,
and the full range of propensity scores (rather than the common support) is used. Intermediate output is
Ian McCarthy, 3‐30‐2010 16
also suppressed. Once again, we set the seed for the random number generator before estimation. In this
case, the pscore calculation is not used, and we have instead estimated the nearest neighbor matching and
the logit propensity scores in the same command sequence by specifying the logit option rather than the
pscore option. Skipping the pscore routine essentially amounts to ignoring any test of the balancing
hypothesis. For the purposes of this Module, this is a relatively innocuous simplification, but in practice,
the pscore routine should always be used prior to estimating the treatment effects.
---------------------------------------------------------
n. treat. n. contr. ATT Std. Err. t
---------------------------------------------------------
---------------------------------------------------------
Note: the numbers of treated and controls refer to actual
nearest neighbour matches
command: attnd re78 t age age2 educ educ2 marr black hisp re74 re75 re742 re752
blacku74 , pscore() logit
statistic: attnd = r(attnd)
note: label truncated to 80 characters
------------------------------------------------------------------------------
Variable | Reps Observed Bias Std. Err. [95% Conf. Interval]
-------------+----------------------------------------------------------------
attnd | 25 .1667644 .012762 .1160762 -.0728051 .4063339 (N)
| -.1111108 .3704965 (P)
| -.1111108 .2918935 (BC)
------------------------------------------------------------------------------
Note: N = normal
P = percentile
BC = bias-corrected
Ian McCarthy, 3‐30‐2010 17
ATT estimation with Nearest Neighbor Matching method
(random draw version)
Bootstrapped standard errors
---------------------------------------------------------
n. treat. n. contr. ATT Std. Err. t
---------------------------------------------------------
---------------------------------------------------------
Note: the numbers of treated and controls refer to actual
nearest neighbour matches
Using the full sample in this fashion produces an estimate of $1,667.64 for the treatment effect
with an estimated standard error of $1,160.76. In comparison, using the 1,342 observations in their
estimated common support, and the same 185 treated observations, Becker and Ichino reported estimates
of $1,667.64 and $2,113.59 for the effect and the standard error, respectively and use 57 of the 1,342
controls as nearest neighbors.
The next set of results uses the radius form of matching and again restricts attention to the
estimates in the common support.
---------------------------------------------------------
n. treat. n. contr. ATT Std. Err. t
---------------------------------------------------------
---------------------------------------------------------
Note: the numbers of treated and controls refer to actual
matches within radius
Ian McCarthy, 3‐30‐2010 18
Bootstrapping of standard errors
command: attr re78 t age age2 educ educ2 marr black hisp re74 re75 re742 re752
blacku74 , pscore() logit comsu
> p radius(.0001)
statistic: attr = r(attr)
note: label truncated to 80 characters
------------------------------------------------------------------------------
Variable | Reps Observed Bias Std. Err. [95% Conf. Interval]
-------------+----------------------------------------------------------------
attr | 25 -.554614 -.0043318 .5369267 -1.662776 .5535483 (N)
| -1.64371 .967416 (P)
| -1.357991 .967416 (BC)
------------------------------------------------------------------------------
Note: N = normal
P = percentile
BC = bias-corrected
---------------------------------------------------------
n. treat. n. contr. ATT Std. Err. t
---------------------------------------------------------
---------------------------------------------------------
Note: the numbers of treated and controls refer to actual
matches within radius
The estimated treatment effects are now very different. We see that only 23 of the 185 treated
observations had a neighbor within a range (radius in the terminology of Becker and Ichino) of 0.0001.
Consistent with Becker and Ichino’s results, the treatment effect is estimated to be -$5,546.14 with a
standard error of $5,369.27. Becker and Ichino state that that these nonsensical results illustrate both the
differences in “caliper” vesus “radius” matching as well as the sensitivity of the estimator to the choice of
radius. In order to implement a true caliper matching process, the user-written psmatch2 routine should
be used.
After installing the psmatch2 routine, caliper matching with logit propensity scores and common
support can be implemented with the following command:
Ian McCarthy, 3‐30‐2010 19
. psmatch2 t $X, common logit caliper(0.0001) outcome(re78)
------------------------------------------------------------------------------
t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .3316903 .1203299 2.76 0.006 .0958482 .5675325
age2 | -.0063668 .0018554 -3.43 0.001 -.0100033 -.0027303
educ | .849268 .3477058 2.44 0.015 .1677771 1.530759
educ2 | -.0506202 .0172493 -2.93 0.003 -.0844282 -.0168122
marr | -1.885542 .2993309 -6.30 0.000 -2.472219 -1.298864
black | 1.135972 .3517854 3.23 0.001 .4464852 1.825459
hisp | 1.96902 .5668594 3.47 0.001 .857996 3.080044
re74 | -1.058961 .3525178 -3.00 0.003 -1.749883 -.3680387
re75 | -2.168541 .4142324 -5.24 0.000 -2.980422 -1.35666
re742 | .2389164 .0642927 3.72 0.000 .112905 .3649278
re752 | .0135926 .0665375 0.20 0.838 -.1168185 .1440038
blacku74 | 2.14413 .4268152 5.02 0.000 1.307588 2.980673
_cons | -7.474743 2.443511 -3.06 0.002 -12.26394 -2.68555
------------------------------------------------------------------------------
Note: 22 failures and 0 successes completely determined.
There are observations with identical propensity score values.
The sort order of the data could affect your results.
Make sure that the sort order is random before calling psmatch2.
--------------------------------------------------------------------------------------
Variable Sample | Treated Controls Difference S.E. T-stat
----------------------------+---------------------------------------------------------
re78 Unmatched | .634914353 2.1553921 -1.52047775 .115461434 -13.17
ATT | .672171543 .443317968 .228853575 .438166333 0.52
----------------------------+---------------------------------------------------------
Note: S.E. for ATT does not take into account that the propensity score is estimated.
The “difference” column in the “ATT” row of the above results presents the estimated treatment
effect. Using a true caliper matching process, the estimates of $2,228.85 and $4,381.66 for the effect and
the standard error, respectively, are much more comparable to the results previously obtained.
Ian McCarthy, 3‐30‐2010 20
CONCLUDING COMMENTS
Results obtained from the two equation system advanced by Heckman over 30 years ago are
sensitive to the correctness of the equations and their identification. On the other hand, methods
such as the propensity score matching depend on the validity of the logit or probit functions
estimated along with the methods of getting smoothness in the kernel density estimator.
Someone using Heckman’s original selection adjustment method can easily have their results
replicated in LIMDEP, STATA and SAS. Such is not the case with propensity score matching.
Propensity score matching results are highly sensitive to the computer program employed while
Heckman’s original sample selection adjustment method can be relied on to give comparable
results across programs.
Ian McCarthy, 3‐30‐2010 21
REFERENCES
Becker, William and William Walstad. “Data Loss From Pretest to Posttest As a Sample
Selection Problem,” Review of Economics and Statistics, Vol. 72, February 1990: 184-188,
Becker, William and John Powers. “Student Performance, Attrition, and Class Size Given
Missing Student Data,” Economics of Education Review, Vol. 20, August 2001: 377-388.
Heckman, James. Sample Bias as a Specific Error. Econometrica, Vol. 47, 1979: 153-162.
Huynh, Kim, David Jacho-Chavez, and James K. Self.“The Efficacy of Collaborative Learning
Recitation Sessions on Student Outcomes?” American Economic Review, (Forthcoming May
2010).
LaLonde, R., “Evaluating the Econometric Evaluations of Training Programs with Experimental
Data,” American Economic Review, Vol. 76, 4, 1986, 604-620.
Saunders, Phillip. The TUCE III Data Set: Background information and file codes
(documentation, summary tables, and five 3.5-inch double-sided, high density disks in ASCII
format). New York: National Council on Economic Education, 1994.
Ian McCarthy, 3‐30‐2010 22
ENDNOTES
i
Huynh, Jacho-Chavez, and Self (2010) have a data set that enables them to account for selection
into, out of and between collaborative learning sections of a large principles course in their change-score
modeling.
ii
Users can also estimate the logit model with STATA’s default logit command. The
predicted probabilities from the logit estimation are equivalent to the propensity scores
automatically provided with the pscore command. Since STATA does not offer any default
matching routine to use following the default logit command, we adopt the use of the pscore
routine (the download of which includes several matching routines to calculate treatment
effects). The pscore routine also tests the balancing hypothesis and provides other relevant
information for propensity score matching which is not provided by the default logit command.
iii
The Kernel density estimator is a nonparametric estimator. Unlike a parametric
estimator (which is an equation), a non-parametric estimator has no fixed structure and is based
on a histogram of all the data. Histograms are bar charts, which are not smooth, and whose
shape depends on the width of the bin into which the data are divided. In essence, with a fixed
bin width, the kernel estimator smoothes out the histogram by centering each of the bins at each
data point rather than fixing the end points of the bin. The optimum bin width is a subject of
debate and well beyond the technical level of this module.
Ian McCarthy, 3‐30‐2010 23
MODULE FOUR, PART FOUR: SAMPLE SELECTION
Part Four of Module Four provides a cookbook-type demonstration of the steps required to use
SAS in situations involving estimation problems associated with sample selection. Unlike
LIMDEP and STATA, SAS does not have a procedure or macro available from SAS Institute
specifically designed to match observations using propensity scores. There are a few user-written
codes but these are not well suited to replicate the particular type of sample-selection problems
estimated in LIMDEP and STATA. As such, this segment will go as far as SAS permits in
replicating what is done in Parts Two and Three in LIMDEP and STATA. Users of this model
need to have completed Module One, Parts One and Four, but not necessarily Modules Two and
Three. From Module One users are assumed to know how to get data into SAS, recode and
create variables within SAS, and run and interpret regression results. Module Four, Parts Two
and Three demonstrate in LIMDEP and STATA what is done here in SAS.
The change score or difference in difference model is used extensively in education research.
Yet, before Becker and Walstad (1990), little if any attention was given to the consequence of
missing student records that result from: 1) "data cleaning" done by those collecting the data, 2)
student unwillingness to provide data, or 3) students self-selecting into or out of the study. The
implications of these types of sample selection are shown in the work of Becker and Powers
(2001) where the relationship between class size and student learning was explored using the
third edition of the Test of Understanding in College Economics (TUCE), which was produced
by Saunders (1994) for the National Council on Economic Education (NCEE), since renamed the
Council for Economic Education.
Module One, Part Four showed how to get the Becker and Powers data set
“beck8WO.csv” into SAS. As a brief review this was done with the read command:
data BECPOW;
infile 'C:\Users\gregory.gilpin\Desktop\BeckerWork\BECK8WO.CSV'
delimiter = ',' MISSOVER DSD lrecl=32767 ;
informat A1 best32.; informat A2 best32.; informat X3 best32.;
informat C best32. ; informat AL best32.; informat AM best32.;
informat AN best32.; informat CA best32.; informat CB best32.;
informat CC best32.; informat CH best32.; informat CI best32.;
informat CJ best32.; informat CK best32.; informat CL best32.;
informat CM best32.; informat CN best32.; informat CO best32.;
informat CS best32.; informat CT best32.; informat CU best32.;
informat CV best32.; informat CW best32.; informat DB best32.;
G. Gilpin, 3‐17‐2010 1
informat DD best32.; informat DI best32.; informat DJ best32.;
informat DK best32.; informat DL best32.; informat DM best32.;
informat DN best32.; informat DQ best32.; informat DR best32.;
informat DS best32.; informat DY best32.; informat DZ best32.;
informat EA best32.; informat EB best32.; informat EE best32.;
informat EF best32.; informat EI best32.; informat EJ best32.;
informat EP best32.; informat EQ best32.; informat ER best32.;
informat ET best32.; informat EY best32.; informat EZ best32.;
informat FF best32.; informat FN best32.; informat FX best32.;
informat FY best32.; informat FZ best32.; informat GE best32.;
informat GH best32.; informat GM best32.; informat GN best32.;
informat GQ best32.; informat GR best32.; informat HB best32.;
informat HC best32.; informat HD best32.; informat HE best32.;
informat HF best32.;
G. Gilpin, 3‐17‐2010 2
an: preTUCE score (0 to 30)
ge: Student evaluation measured interest
gh: Student evaluation measured textbook quality
gm: Student evaluation measured regular instructor’s English ability
gq: Student evaluation measured overall teaching effectiveness
ci: Instructor sex (Male = 1, Female = 2)
ck: English is native language of instructor (Yes = 1, No = 0)
cs: PostTUCE score counts toward course grade (Yes = 1, No = 0)
ff: GPA*100
fn: Student had high school economics (Yes = 1, No = 0)
ey: Student’s sex (Male = 1, Female = 2)
fx: Student working in a job (Yes = 1, No = 0)
Separate dummy variables need to be created for each type of school (A2), which is done with
the following code:
To create a dummy variable for whether the instructor had a PhD we use
phd = 0;
if dj = 3 then phd = 1;
To create a dummy variable for whether the student took the postTUCE we use
final = 0;
if cc > 0 then final = 1;
To create a dummy variable for whether a student did (noeval = 0) or did not (noeval = 1)
complete a student evaluation of the instructor we use
evalsum = ge+gh+gm+gq;
noeval= 0;
if evalsum = -36 then noeval = 1;
“Noeval” reflects whether the student was around toward the end of the term, attending classes,
and sufficiently motivated to complete an evaluation of the instructor. In the Saunder’s data set
G. Gilpin, 3‐17‐2010 3
evaluation questions with no answer where coded -9; thus, these four questions summing to -36
indicates that no questions were answered.
change = cc - an;
Finally, there was a correction for the term in which student record 2216 was incorrectly
recorded:
if hb = 90 then hb = 89;
All of these recoding and create commands are entered into SAS editor file as follows:
data becpow;
set beckpow;
if 99 < A2 < 200 then a2 = 1;
if 199 < A2 < 300 then a2 = 2;
if 299 < A2 < 400 then a2 = 3;
if 399 < A2 < 500 then a2 = 4;
doc = 0; comp = 0; lib = 0; twoyr = 0;
if a2 = 1 then doc = 1;
if a2 = 2 then comp = 1;
if a2 = 3 then lib = 1;
if a2 = 4 then twoyr = 1;
phd = 0;
if dj = 3 then phd = 1;
final = 0;
if cc > 0 then final = 1;
evalsum = ge+gh+gm+gq;
noeval= 0;
if evalsum = -36 then noeval = 1;
change = cc - an;
if hb = 90 then hb = 89;
run;
data becpow;
set beckpow;
if AN=-9 then delete;
if HB=-9 then delete;
if ci=-9 then delete;
if ck=-9 then delete;
if cs=0 then delete;
G. Gilpin, 3‐17‐2010 4
if cs=-9 then delete;
if a2=-9 then delete;
if phd=-9 then delete;
run;
The use of these data entry and management commands will appear in the SAS output file for the
equations to be estimated in the next section.
THE PROPENSITY TO TAKE THE POSTTEST AND THE CHANGE SCORE EQUATION
To address attrition-type sample selection problems in change score studies, Becker and Powers
first add observations that were dropped during the early stage of assembling data for TUCE III.
Becker and Powers do not have any data on students before they enrolled in the course and thus
cannot address selection into the course, but to examine the effects of attrition (course
withdrawal) they introduce three measures of class size (beginning, ending, and average) and
argue that initial or beginning class size is the critical measure for assessing learning over the
entire length of the course. i To show the effects of initial class size on attrition (as discussed in
Module Four, Part One) they employ what is now the simplest and most restrictive of sample
correction methods, which can be traced to James Heckman (1979), recipient of the 2000 Nobel
Prize in Economics.
From Module Four, Part One, we have the data generating process for the difference between
post and preTUCE scores for the ith student ( Δy i ):
k
Δyi = X i β + ε i = β1 + ∑ β j x ji + ε i (1)
j =2
where the data set of explanatory variables is matrix X, where Xi is the row of xji values for the
relevant variables believed to explain the ith student’s pretest and posttest scores, the β j ’s are the
associated slope coefficients in the vector β , and ε i is the individual random shock (caused, for
example, by unobservable attributes, events or environmental factors) that affect the ith student’s
test scores. Sample selection associated with students’ unwillingness to take the postteest
(dropping the course) results in population error term and regressor correlation that biases and
makes coefficient estimators in this change score model inconsistent.
The data generating process for the i th student’s propensity to take the posttest is:
Ti * = H i α + ωi (2)
where
G. Gilpin, 3‐17‐2010 5
Ti = 1 , if Ti* > 0 , and student i has a posttest score, and
H is the matrix of explanatory variables that are believed to drive these propensities.
ω is the vector of unobservable random shocks that affect each student’s propensity.
The effect of attrition between the pretest and posttest, as reflected in the absence of a
posttest score for the ith student (Ti = 0) and a Heckman adjustment for the resulting bias caused
by excluding those students from the change-score regression requires estimation of equation (2)
and the calculation of an inverse Mill’s ratio for each student who has a pretest. This inverse
Mill’s ratio is then added to the change-score regression (1) as another explanatory variable. In
essence, this inverse Mill’s ratio adjusts the error term for the missing students.
For the Heckman adjustment for sample selection each disturbance in vector ε , equation
(1), is assumed to be distributed bivariate normal with the corresponding disturbance term in the
ω vector of the selection equation (2). Thus, for the i th student we have:
E (ε) = E (ω) = 0, E (εε ') = σ ε2I, E (ωω ') = I, and E (εω') = ρσ ε I . (4)
That is, the disturbances have zero means, unit variance, and no covariance among students, but
there is covariance between selection in getting a posttest score and the measurement of the
change score.
The regression for this censored sample of nT =1 students who took the posttest is now:
where λi is the inverse Mill’s ratio (or hazard) such that λi = f ( −Ti* ) /[1 − F (−Ti* )] , and f (.)
and F (.) are the normal density and distribution functions. λi is the standardized mean of the
G. Gilpin, 3‐17‐2010 6
disturbance term ωi , for the i th student who took the posttest; it is close to zero only for those
well above the T = 1 threshold. The values of λ are generated from the estimated probit
selection equation (2) for all students.
where the “/ discrete” extension tells SAS to estimated the model by probit.
The command for estimating the adjusted change equation using both the inverse Mills
ratio as a regressor and maximum likelihood estimation of the ρ and σ ε is written
where the extension “ / select (final = 1)” tells SAS that the selection is on observations with the
variable final equal to 1.
As described in Module One, Part Four, entering all of these commands into the editor
window in SAS and pressing the RUN button yields the following output file:
G. Gilpin, 3‐17‐2010 7
G. Gilpin, 3‐17‐2010 8
G. Gilpin, 3‐17‐2010 9
The estimated probit model (as found on the top of page 8) is
The beginning or initial class size is negatively and highly significantly related to the propensity
to take the posttest, with a one-tail p value of 0.0056.
The corresponding change-score equation employing the inverse Mills ratio is on page 8-
9:
The change score is negatively and significantly related to the class size, with a one-tail p value
of 0.0347, but it takes an additional 100 students to lower the change score by a point. The
maximum likelihood results also contain separate estimates of ρ and σ ε . Note that the
coefficients are slightly different then those provided by LIMDEP. This is due to the
maximization algorithm of used in proc qlim – that of Newton–Raphson maximization method.
Currently SAS does not have any other standard routine to perform Heckman’s two-step
procedure. It should be noted that there are a few user written codes which can be implemented.
G. Gilpin, 3‐17‐2010 10
AN APPLICATION OF PROPENSITY SCORE MATCHING
Unfortunately, we are not aware of a study in economic education for which propensity score
matching has been used. Thus, we looked outside economic education and elected to redo the
example reported in Becker and Ichino (2002). This application and data are derived from
Dehejia and Wahba (1999), whose study, in turn was based on LaLonde (1986). The data set
consists of observed samples of treatments and controls from the National Supported Work
demonstration. Some of the institutional features of the data set are given by Becker and Ichino.
The data were downloaded from the website http://www.nber.org/~rdehejia/nswdata.html. The
data set used here is in the original text form, contained in the data file “matchingdata.txt.” They
have been assembled from the several parts in the NBER archive.
Becker and Ichino report that they were unable to replicate Dehejia and Wahba’s results,
though they did obtain similar results. (They indicate that they did not have the original authors’
specifications of the number of blocks used in the partitioning of the range of propensity scores,
significance levels, or exact procedures for testing the balancing property.) In turn, we could
not precisely replicate Becker and Ichino’s results – we can identify the reason, as discussed
below. Likewise, however, we obtain similar results.
There are 2,675 observations in the data set, 2490 controls (with t = 0) and 185 treated
observations (with t = 1). The variables in the raw data set are
We will analyze these data following Becker and Ichino’s line of analysis. We assume
that you have completed Module One, Part Two, and thus are familiar with placing commands in
the editor and using the RUN button to submit commands, and where results are found in the
output window. In what follows, we will simply show the commands you need to enter into SAS
to produce the results that we will discuss.
To start, the data are imported by using the import wizard. The file is most easily
imported by specifying the file as a ‘delimited file *.*’: When providing the location of the file,
click ‘options’ and then click on the Delimiter ‘space’ and unclick the box for ‘Get variable
G. Gilpin, 3‐17‐2010 11
names from first row’. In what follows, I call the imported dataset ‘match’. As ‘match’ does not
have proper variables names, this is easily corrected using a datastep:
data match (keep = t age educ black hisp marr nodegree re74 re75 re78);
rename var3 = t var5 = age var7 = educ var9 = black var11 = hisp
var13 = marr var15 = nodegree var17 = re74 var19 = re75
var21 = re78;
set match;
run ;
age2 = age squared
educ2 = educ squared
re742 = re74 squared
re752 = re75 squared
blacku74 = black times 1(re74 = 0)
In order to improve the readability of some of the reported results, we have divided the
income variables by 10,000. (This is also an important adjustment that accommodates a
numerical problem with the original data set. This is discussed below.) The outcome variable is
re78.
The data are set up and described first. The transformations used to create the
transformed variables are
data match;
set match;
age2 = age*age; educ2 = educ*educ;
re74 = re74/10000; re75 = re75/10000; re78 = re78/10000;
re742 = re74*re74; re752 = re75*re75;
blacku74 = black*(re74 = 0);
run;
The data are described with the following code and statistics:
G. Gilpin, 3‐17‐2010 12
We next fit the logit model for the propensity scores. An immediate problem arises with
the data set as used by Becker and Ichino. The income data are in raw dollar terms – the mean of
re74, for example is $18,230.00. The square of it, which is on the order of 300,000,000, as well
as the square of re75 which is similar, is included in the logit equation with a dummy variable for
Hispanic which is zero for 96.5% of the observations and the blacku74 dummy variable which is
zero for 94.5% of the observations. Because of the extreme difference in magnitudes, estimation
of the logit model in this form is next to impossible. But rescaling the data by dividing the
income variables by 10,000 addresses the instability problem. ii These transformations are shown
in the second set of commands above. This has no impact on the results produced with the data,
other than stabilizing the estimation of the logit equation. We are now quite able to replicate the
Becker and Ichino results except for an occasional very low order digit.
The logit model from which the propensity scores are obtained is fit using
(Note: Becker and Ichino’s coefficients on re74 and re75 are multiplied by 10,000, and
coefficients on re742 and re752 are multiplied by 100,000,000.)
G. Gilpin, 3‐17‐2010 13
The above results provide the predicted probabilities to be used in matching algorithms. As
discussed in the Introduction of this part, SAS does not have such a procedure or macro
specifically designed to match observations to estimate treatment effects. We refer the reader to
Parts Two and Three of this module to for further understanding on how to implement matching
procedures in LIMDEP and STATA.
G. Gilpin, 3‐17‐2010 14
CONCLUDING COMMENTS
Results obtained from the two equation system advanced by Heckman over 30 years ago are
sensitive to the correctness of the equations and their identification. On the other hand, methods
such as the propensity score matching depend on the validity of the logit or probit functions
estimated along with the methods of getting smoothness in the kernel density estimator.
Someone using Heckman’s original selection adjustment method can easily have their results
replicated in LIMDEP, STATA and SAS. Such is not the case with propensity score matching.
Propensity score matching results are highly sensitive to the computer program employed while
Heckman’s original sample selection adjustment method can be relied on to give comparable
results across programs.
REFERENCES
Becker, William and William Walstad. “Data Loss From Pretest to Posttest As a Sample
Selection Problem,” Review of Economics and Statistics, Vol. 72, February 1990: 184-188,
Becker, William and John Powers. “Student Performance, Attrition, and Class Size Given
Missing Student Data,” Economics of Education Review, Vol. 20, August 2001: 377-388.
Heckman, James. Sample Bias as a Specific Error. Econometrica, Vol. 47, 1979: 153-162.
Huynh, Kim, David Jacho-Chavez, and James K. Self.“The Efficacy of Collaborative Learning
Recitation Sessions on Student Outcomes?” American Economic Review, (Forthcoming May
2010).
LaLonde, R., “Evaluating the Econometric Evaluations of Training Programs with Experimental
Data,” American Economic Review, Vol. 76, 4, 1986, 604-620.
Saunders, Phillip. The TUCE III Data Set: Background information and file codes
(documentation, summary tables, and five 3.5-inch double-sided, high density disks in ASCII
format). New York: National Council on Economic Education, 1994.
G. Gilpin, 3‐17‐2010 15
ENDNOTES
i
Huynh, Jacho-Chavez, and Self (2010) have a data set that enables them to account for selection
into, out of and between collaborative learning sections of a large principles course in their change-score
modeling.
ii
An attempt to compute a linear regression of the original RE78 on the original unscaled other
variables is successful, but produces a warning that the condition number of the X matrix is 6.5 times 109.
When the data are scaled as done above, no warning about multicollinearity is given.
G. Gilpin, 3‐17‐2010 16