0% found this document useful (0 votes)
266 views

Ebook - Econometrics Handbook PDF

Uploaded by

Debapriya Samal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
266 views

Ebook - Econometrics Handbook PDF

Uploaded by

Debapriya Samal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 317

INTRODUCTION TO AN ONLINE HANDBOOK FOR THE USE OF UP-

TO-DATE ECONOMETRICS IN ECONOMIC EDUCATION RESEARCH


William E. Becker
Professor of Economics, Indiana University, Bloomington, Indiana, USA
Adjunct Professor of Commerce, University of South Australia, Adelaide, Australia
Research Fellow, Institute for the Study of Labor (IZA), Bonn, Germany
Fellow, Center for Economic Studies and Institute for Economic Research (CESifo, Munich Germany)

The Council for Economic Education (first named the Joint Council on Economic Education and
then the National Council on Economic Education) together with the American Economic
Association Committee on Economic Education have a long history of advancing the use of
econometrics for assessment aimed at increasing the effectiveness of teaching and learning
economics. As described by Welsh (1972), the first formal program was funded by the General
Electric Education Fund and held in 1969 and 1970 at Carnegie-Mellon University, and was a
contributing force behind the establishment of the Journal of Economic Education in 1969 as an
outlet for research on the teaching of economics. As described in Buckles and Highsmith
(1990), in 1987 and 1988 a second econometrics training program was sponsored by the Pew
Charitable trust and held at Princeton University, with papers featured in the Journal of
Economic Education, Summer 1990. Since that time there have been major advances in
econometrics and its applications in education research -- most notably, in the specification of
the data generating processes and computer programs for estimating the models representing
those processes.

Becker and Baumol (1996) called attention to the importance that Sir Ronald A. Fisher
assigned to "random arrangements" in experimental design. i Fisher designed experiments in
which plots of land were randomly assigned to different fertilizer treatments to assess differences
in yield. By measuring the mean yield on each of several different randomly assigned plots of
land, Fisher eliminated or "averaged out" the effects of nontreatment influences (such as weather
and soil content) so that only the effect of the choice of fertilizer was reflected in differences
among the mean yields. For educational research, Fisher's randomization concept became the
ideal: hypothetical classrooms (or students) are treated as if they could be assigned randomly to
different instructional procedures.

In education research, however, the data are typically not generated by well-defined
experiments employing random sampling procedures. Our ability to extract causal inferences
from an analysis is affected by sample selection procedures and estimation methods. The
absence of experimental data has undoubtedly provided the incentive for the econometricians to
exert their considerable ingenuity to devise powerful methods to identify, separate out, and
evaluate the magnitude of the influence exercised by each of the many variables that determine
the shape of any economic phenomenon. Econometricians have been pioneers in the design of
techniques to deal with missing observations, missing variables, errors in variables, simultaneous
relationships and pooled cross-section and time-series data. In short, and as comprehensively

W. E. Becker    May 1, 2010: p. 1 
 
reviewed by Imbens and Wooldridge (2009), they have found it useful to design an armory of
analytic weapons to deal with the messy and dirty statistics obtained from opportunistic samples,
seeking to extract from them the ceteris paribus relationships that empirical work in the natural
sciences is able to average out with the aid of randomized experiments. At the request of the
Council for Economic Education and the American Economic Association Committee on
Economic Education, I have developed four modules that will enable researchers to employ these
weapons in their empirical studies of educational practices, with special attention given to the
teaching and learning of economics.

MODULE ONE

Although first employed in policy debate by George Yule at the turn of the 20th century,
economic educators and general educationalists alike continue to rely on least-squares estimated
multiple regressions to extract the effects of unwanted influences. ii This is the first of the four
modules designed to demonstrate and enable researchers to move beyond these basic regression
methods to the more advanced techniques of the 21st century using any one of three computer
programs: LIMDEP (NLOGIT), STATA and SAS.

This module is broken into four parts. Part One introduces the nature of data and the
basic data generating processes for both continuous and discrete dependent variables. The lack
of a fixed variance in the population error term (heteroscedasticity) is introduced and discussed.
Parts Two, Three and Four show how to get that data into each one of the three computer
programs: Part Two for LIMDEP (NLOGIT), Part Three for STATA (by Ian McCarthy, Senior
Consultant, FTI Consulting) and Part Four for SAS (by Gregory Aaron William Gilpin, Assistant
Professor of Economics, Montana State University). Parts Two, Three and Four also provide
the respective computer program commands to do least-squares estimation of the standard
learning regression model involving a continuous dependent test-score variable but with the
procedures to adjust for heteroscedastic errors. The maximum likelihood routines to estimate
probit and logit models of discrete choice are also provided using each of the three programs.
Finally, statistical tests of coefficient restrictions and model structure are presented.

MODULE TWO

The second of four modules is devoted to endogeneity in educational studies. Endogeneity is a


problem caused by explanatory variables that are related to the error term in a population
regression model. As explained in Part One of Module Two, this error term and regressor
dependence can be caused by omitted relevant explanatory variables, errors in the explanatory
variables, simultaneity (reverse causality between y and an x), and other sources emphasized in
subsequent modules. Endogeneity makes least squares estimators biased and inconsistent. The
uses of natural experiments, instrumental variables and two-stage least squares are presented as
means for addressing endogeneity. Parts Two, Three and Four show how to perform and provide
the commands for two-stage least squares estimation in LIMDEP (NLOGIT), STATA (by Ian
McCarthy) and SAS (by Gregory Gilpin) using data from a study of the relationship between
multiple-choice test scores and essay-test scores.

W. E. Becker    May 1, 2010: p. 2 
 
MODULE THREE (co-authored by W. E. Becker, J.J. Siegfried and W. H. Greene)

As seen in Modules One and Two, the typical economic education empirical study involves an
assessment of learning between a pretest and a posttest. Other than the fact that testing occurs at
two different points in time (before and after an intervention) , there is no time dimension in this
model of learning. Panel data analysis provides an alternative structure in which measurements
on the cross section of subjects are taken at regular intervals over multiple periods of time.
Collecting data on the cross section of subjects over time enables a study of change. It also
opens the door for economic education researchers to look at things other than test scores that
vary with time.

This third in the series of four modules provides an introduction to panel data analysis
with specific applications to economic education. The data structure for a panel along with
constant coefficient, fixed effects and random effects representations of the data generating
processes are presented. Consideration is given to different methods of estimation and testing.
Finally, as in Modules One and Two, contemporary estimation and testing procedures are
demonstrated in Parts Two, Three and Four using LIMDEP (NLOGIT), STATA (by Ian
McCarthy) and SAS (by Gregory Gilpin).

MODULE FOUR (co-authored by W. E. Becker and W. H. Greene)

In the assessment of student learning that occurs between the start of a program (as measured, for
example, by a pretest) and the end of the program (posttest), there is an assumption that all the
students who start the program finish the program. There is also an assumption that those who
start the program are representative of, or at least are a random sample of, those for whom an
inference is to be made about the outcome of the program. This module addresses how these
assumptions might be wrong and how problems of sample selection might occur because of
unobservable or unmeasurable phenomena as well as things that can be observed and measured.
Attention is given to the Heckman-type models, regression discontinuity and propensity score
matching. As in the previous modules, contemporary estimation and testing procedures are
demonstrated in Parts Two, Three and Four using LIMDEP (NLOGIT), STATA (by Ian
McCarthy) and SAS (by Gregory Gilpin).

RECOGNITION

As with the 1969-1970 and 1987-1988 workshops, these four modules have been made possible
through a cooperative effort of the National Council on Economic Education and the American
Economic Association Committee on Economic Education. This new work is part of the
Absolute Priority direct activity component of the Excellence in Economic Education grant to the
NCEE and funded by the U.S. Department of Education Office of Innovation and Improvement.
Special thanks are due Ian McCarthy and Greg Gilpin for their care in duplicating in STATA and
SAS what I did in LIMDEP (in Modules One, Two and Three). Comments and constructive

W. E. Becker    May 1, 2010: p. 3 
 
criticism received from William Bosshardt, Jennifer (gigi) Foster, Peter Kennedy, Mark Maier,
KimMarie McGoldrick, Gail Hoyt, Martin Shanahan, Robert Toutkoushian and Michael Watts in
the Beta testing must also be thankfully acknowledged. Finally, as with all my writing, Suzanne
Becker must be thanked for her patience in dealing with me and excellence in editing.

REFERENCE

Becker, William E. and William J. Baumol (1996). Assessing Educational Practices: The
Contribution of Economics, Cambridge MA.: MIT Press.

Buckles, Stephen and Robert Highsmith (1990). “Preface to Special Research Issue,” Journal of
Economic Education, (Summer): 229-230.

Clogg, C. C. (1992). "The impact of sociological methodology on statistical methodology,"


Statistical Science , (May): 183-96.

Fisher, R. A. (1970). Statistical methods for research workers, 14th ed. New York: Hafner.

Imbens, Guido W. and Jeffrey M. Wooldridge (2009). "Recent development in econometrics of


program evaluation," Journal of Economic Literature, (March): 5-86.

Simpson, E.H. (1951). “The interpretation of interaction in contingency tables”, Journal of the
Royal Statistical Society, Series B, 13: 238-241.

Stigler, Stephen M. (1986). The History of Statistics: The measurement of Uncertainty before
1900. Cambridge: Harvard University Press.

Welsh, Arthur L. (1972). Research Papers in Economic Education. New York: Joint Council on
Economic Education.

W. E. Becker    May 1, 2010: p. 4 
 
ENDNOTES
                                                       
i. In what is possibly the most influential book in statistics, Statistical Methods for
Research Workers, Sir Ronald Fisher wrote:

The science of statistics is essentially a branch of Applied Mathematics, and may be


regarded as mathematics applied to observational data. (p. 1)

Statistical methods are essential to social studies, and it is principally by the aid of such
methods that these studies may be raised to the rank of sciences. This particular
dependence of social studies upon statistical methods has led to the unfortunate
misapprehension that statistics is to be regarded as a branch of economics, whereas in
truth methods adequate to the treatment of economic data, in so far as these exist, have
mostly been developed in biology and the other sciences. (1970, p. 2)

Fisher's view, traceable to the first version of his book in 1925, is still held today by many
scholars in the natural sciences and departments of mathematics. Econometricians (economists
who apply statistical methods to economics), psychometricians, cliometricians, and other
"metricians" in the social sciences have different views of the process by which statistical
methods have developed. Given that Fisher's numerous and great contributions to statistics were
in applications within biology, genetics, and agriculture, his view is understandable although it is
disputed by sociologist Clifford Clogg and the numerous commenters on his article "The Impact
of Sociological Methodology on Statistical Methodology," Statistical Science, (May 1992).
Econometricians certainly have contributed a great deal to the tool box, ranging from
simultaneous equation estimation techniques to a variety of important tests of the validity of a
statistical inference.

ii. George Yule (1871-1951) designed "net or partial regression" to represent the influence
of one variable on another, holding other variables constant. He invented the multiple
correlation coefficient R for the correlation of y with many x's. Yule's regressions looked much
like those of today. In 1899, for instance, he published a study in which changes in the
percentage of persons in poverty in England between 1871 and 1881 were explained by the
change in the percentage of disabled relief recipients to total relief recipients (called the "out-
relief ratio"), the percentage change in the proportion of old people, and the percentage change in
the population.

Predicted change percent in pauperism = − 27.07 percent + 0.299(change percentage in


out-relief ratio) + 0.271(change percentage in proportion of old) + 0.064(change
percentage in population).

Stigler (1986, pp. 356-7) reports that although Yule's regression analysis of poverty was
well known at the time, it did not have an immediate effect on social policy or statistical
practices. In part this meager response was the result of the harsh criticism it received from the
leading English economist, A.C. Pigou. In 1908, Pigou wrote that statistical reasoning could not
be rightly used to establish the relationship between poverty and out-relief because even in a
multiple regression (which Pigou called "triple correlation") the most important influences,
superior program management and restrictive practices, cannot be measured quantitatively.
W. E. Becker    May 1, 2010: p. 5 
 
                                                                                                                                                                               
Pigou thereby offered the most enduring criticism of regression analysis; the possibility
that an unmeasured but relevant variable has been omitted from the regression and that it is this
variable that is really responsible for the appearance of a causal relationship between the
dependent variable and the included regressors. Both Yule and Pigou recognized the difference
between marginal and partial association.
Today some statisticians assign credit for this identification to E. H. Simpson (1951); the
proposition referred to as "Simpson's paradox" points out that marginal and partial association
can differ even in direction so that what is true for parts of a sample need not be true for the
entire sample. This is represented in the following figure where the two separate regressions of y
on the high values of x and the low values of x have positive slopes but a single regression fit to
all the data shows a negative slope. As in Pigou's criticism of Yule 90 years ago, this "paradox"
may be caused by an omitted but relevant explanatory variable for y that is also related to x. It
may be better named the "Yule-Pigou paradox" or "Yule-Pigou effect." As demonstrated in
Module Two, much of modern day econometrics has dealt with this problem and it occurrence in
education research.

y .
│ . . .
│ . . . .
│ . .. .
│. . .
│ .
│ .
│ . . . .
│ . . .
│ . .
│ . . .
│ .

└───────────────────────────────────────────────────────── x

W. E. Becker    May 1, 2010: p. 6 
 
MODULE ONE, PART ONE: DATA-GENERATING PROCESSES

William E. Becker
Professor of Economics, Indiana University, Bloomington, Indiana, USA
Adjunct Professor of Commerce, University of South Australia, Adelaide, Australia
Research Fellow, Institute for the Study of Labor (IZA), Bonn, Germany
Editor, Journal of Economic Education
Editor, Social Scinece Reseach Network: Economic Research Network Educator

This is Part One of Module One. It highlights the nature of data and the data-generating process,
which is one of the key ideas of modern day econometrics. The difference between cross-section
and time-series data is presented and followed by a discussion of continuous and discrete
dependent variable data-generating processes. Least-squares and maximum-likelihood
estimation is introduced along with analysis of variance testing. This module assumes that the
user has some familiarity with estimation and testing previous statistics and introductory
econometrics courses. Its purpose is to bring that knowledge up-to-date. These contemporary
estimation and testing procedures are demonstrated in Parts Two, Three and Four, where data are
respectively entered into LIMDEP, STATA and SAS for estimation of continuous and discrete
dependent variable models.

CROSS-SECTION AND TIME-SERIES DATA

In the natural sciences, researchers speak of collecting data but within the social sciences it is
advantageous to think of the manner in which data are generated either across individuals or over
time. Typically, economic education studies have employed cross-section data. The term cross-
section data refer to statistics for each in a broad set of entities in a given time period, for
example 100 Test of Economic Literacy (TEL) test scores matched to time usage for final
semester 12th graders in a given year. Time-series data, in contrast, are values for a given
category in a series of sequential time periods, i.e., the total number of U.S. students who
completed a unit in high school economics in each year from 1980 through 2008. Cross-section
data sets typically consist of observations of different individuals all collected at a point in time.
Time-series data sets have been primarily restricted to institutional data collected over particular
intervals of time.

More recently empirical work within education has emphasized panel data, which are a
combination of cross-section and time-series data. In panel analysis, the same group of
individuals (a cohort) is followed over time. In a cross-section analysis, things that vary among
individuals, such as sex, race and ability, must either be averaged out by randomization or taken
into account via controls. But sex, race, ability and other personal attributes tend to be constant
from one time period to another and thus do not distort a panel study even though the assignment
of individuals among treatment/control groups is not random. Only one of these four modules
will be explicitly devoted to panel data.

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 1 


 
CONTINUOUS DEPENDENT (TEST SCORE) VARIABLES

Test scores, such as those obtained from the TEL or Test of Understanding of College
Economics (TUCE), are typically assumed to be the outcome of a continuous variable Y that may
be generated by a process involving a deterministic component (e.g., the mean of Y, μ y , which
might itself be a function of some explanatory variables X1, X2 …Xk) and the purely random
perturbation or error term components v and ε :

Yit = μ y + vit or Yit = β1 + β 2 X it 2 + β 3 X it 3 + β 4 X it 4 + ε it ,

where Yit is the test score of the ith person at time t and the it subscripts similarly indicate
observations for the ith person on the X explanatory variables at time t. Additionally, normality
of the continuous dependent variable is ensured by assuming the error term components are
normally distributed with means of zero and constant variances: vit ~ N (0, σ v2 ) and ε it ~ N (0, σ ε2 ) .

As a continuous random variable, which gets its normal distribution from epsilon, at least
theoretically any value is possible. But as a test score, Y is only supported for values greater than
zero and less than the maximum test score, which for the TUCE is 30. In addition, multiple-
choice test scores like the TUCE can only assume whole number values between 0 and 30, which
poses problems that are addressed in these four modules.

The change score model (also known as the value-added model, gain score model or
achievement model) is just a variation on the above basic model:

Yit − Yit −1 = λ1 + λ2 X it 2 + λ3 X it 3 + λ4 X it 4 + uit ,

where Yit-1 is the test score of the ith person at time t−1. If one of the X variables is a bivariate
dummy variable included to capture the effect of a treatment over a control, then this model is
called a difference in difference model:

[(mean treatment effect at time t) – (mean control effect at time t)] –


[(mean treatment effect at time t−1) − (mean control effect at time t−1)]
= [E(Yit |treatment =1) − E(Yit |treatment =0)] – [E(Yit-1 |treatment =1) − E(Yit-1 |treatment =0)]
= [E(Yit |treatment =1) − E(Yit-1 |treatment =1)] – [E(Yit |treatment =0) − E(Yit-1 |treatment =0)]
= the lambda on the bivariate treatment variable.

Yit is now referred to as the post-treatment score or posttest and Yit-1 is the pre-treatment score or
pretest. Again, the dependent variable Yit −Yit-1 can be viewed as a continuous random variable,
but for multiple-choice tests, this difference is restricted to whole number values and is bounded
by the absolute value of the test score’s minimum and maximum.

This difference in difference model is often used with cross-section data that ignores
time-series implications associated with the dependent variable (and thus the error term)
involving two periods. For such models, ordinary least-squares estimation as performed in

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 2 


 
EXCEL and all other computer programs is sufficient. However, time sequencing of testing can
cause problems. For example, as will be demonstrated in Module Three on sample selection, it
is not a trivial problem to work with observations for which there is a pretest (given at the start of
the term) but no posttest scores because the students dropped out of the class before the final
exam was given. Single equation least-squares estimators will be biased and inconsistent if the
explanatory variables and error term are related because of time-series problems.

Following the lead of Hanushek (1986, 1156-57), the change-score model has been
thought of as a special case of an allegedly superior regression involving a lagged dependent
variable, where the coefficient of adjustment ( λ0* ) is set equal to one for the change-score model:

Yit = λ0*Yit −1 + λ1* + λ2* X it 2 + λ3* X it 3 + λ4* X it 4 + ωit .

Allison (1990) rightfully called this interpretation into question, arguing that these are two
separate models (change score approach and regressor variable approach) involving different
assumptions about the data generating process. If it is believed that there is a direct causal
relationship Yit −1 ⇒ Yit or if the other explanatory X variables are related to the Yit-1 to Yit
transition, then the regressor variable approach is justified. But, as demonstrated to economic
educators as far back as Becker (1983), the regressor variable model has a built-in bias
associated with the regression to the mean phenomenon. Allison concluded, “The important
point is that there should be no automatic preference for either model and that the only proper
basis for a choice is a careful consideration of each empirical application . . . . In ambiguous
cases, there may be no recourse but to do the analysis both ways and to trust only those
conclusions that are consistent across methods.” (p. 110)

As pointed out by Allison (1990) and Becker, Greene and Rosen (1990), at roughly the
same time, and earlier by Becker and Salemi (1977) and later by Becker (2004), models to avoid
are those that place a change score on the left-hand side and a pretest on the right. Yet,
educational researchers continue to employ this inherently faulty design. For example, Hake
(1998) constructed a “gap closing variable (g)” as the dependent variable and regressed it on the
pretest:

posttest score − pretest score


g = gap closing = = f ( pretest score ...)
maximum score − pretest score
where the pretest and posttest scores where classroom averages on a standardized physics test,
and maximum score was the highest score possible. Apparently, Hake was unaware of the
literature on the gap-closing model. The outcome measure g is algebraically related to the
starting position of the student as reflected in the pretest: g falls as the pretest score rises, for
maximum score > posttest score > pretest score. i Any attempt to regress a posttest-minus-
pretest change score, or its standardized gap-closing measure g on a pretest score yields a biased
estimate of the pretest effect. ii

As an alternative to the change-score models [ of the type posttest − pretest=


f(treatement, . . . ) or posttest = f(pretest, treatment, …)], labor economics have turned to a

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 3 


 
difference-in-difference model employing a panel data specification to assess treatment effects.
But not all of these are consistent with the change score models discussed here. For example,
Bandiera, Larcinese and Rasul (2010) wanted to assess the effect in the second period of
providing students with information on grades in the first period. In the first period, numerical
grade scores were assigned to each student for course work, but only those in the treatment were
told their scores, and in the second period numerical grade score were given on essays. That is,
the treatment dummy variable reflected whether or not the student obtained grade information
(feedback) on at least 75 percent of his or her course work in the first period, and zero if not.
This treatment dummy then entered in the second period as an explanatory variable for the essay
grade.

More specifically, Bandiera, Larcinese and Rasul estimated the following panel data
model for the ith student, enrolled on a degree program offered by department d, in time period t,

gidct = α i + β [ Fc × Tt ] + γ Tt + δ X c + ∑ μd 'TDid ' + ε idct


d'

where gidct is the ith student’s grade in department d for course (or essay) c at time t and αi is a
fixed effect that captures time-invariant characteristics of the student that affect his or her grade
across time periods, such as his or her underlying motivation, ability, and labor market options
upon graduation. Because each student can only be enrolled in one department or degree
program, αi also captures all department and program characteristics that affect grades in both
periods, such as the quality of teaching and the grading standards. Fc is a equal to one if the
student obtains feedback on his or her grade on course c and Tt identifies the first or second time
period, Xc includes a series of course characteristics that are relevant for both examined courses
and essays, and all other controls are as previously defined. TDidˊ is equal to one if student i
took any examined courses offered by department dˊ and is zero otherwise; it accounts for
differences in grades due to students taking courses in departments other than their own
department d. Finally, εidct is a disturbance term.

As specified, this model does not control for past grades (or expected grades), which is
the essence of a change-score model. It should have been specified as either

gidct = α i + ω gidct −1 + β [ Fc × Tt ] + γ Tt + δ X c + ∑ μd 'TDid ' + ε idct


d'
or

gidct gidct −1 = α i + ω gidct −1 + β [ Fc × Tt ] + γ Tt + δ X c + ∑ μd 'TDid ' + ε idct


d'

Obviously, there is no past grade for the first period and that is in part why a panel data
set up has historically not been used when only “pre” and “post” measures of performance are
available. Notice that the treatment dummy variable coefficient β is inconsistently estimated
with bias if the relevant past course grades in the second period essay-grade equation are
omitted. As discuss in Module Three on panel data studies, bringing in a lagged dependent
variable into panel data analysis poses more estimation problems. The thing emphasized here is
that a change-score model must be employed in assessing a treatment effect. In Module Four,
W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 4 
 
propensity score matching models are introduced for a means of doing this as an alternative to
the least squares method employed in this module.

DISCRETE DEPENDENT VARIABLES

In many problems, the dependent variable cannot be treated as continuous. For example,
whether one takes another economics course is a bivariate variable that can be represented by Y =
1, if yes or 0, if not, which is a discrete choice involving one of two options. As another
example, consider count data of the type generated by the question how many more courses in
economics will a student take? 0, 1, 2 … where increasing positive values are increasingly
unlikely. Grades provide another example of a discrete dependent variable where order matters
but there are no unique number line values that can be assigned. The grade of A is better than B
but not necessarily by the same magnitude that B is better than C. Typically A is assigned a 4, B
a 3 and C a 2 but these are totally arbitrary and do not reflect true number line values. The
dependent variable might also have no apparent order, as the choice of a class to take in a
semester – for example, in the decision to enroll in economics 101, sociology 101, psychology
101 or whatever, one course of study cannot be given a number greater or less than another with
the magnitude having meaning on a number line.

In this module we will address the simplest of the discrete dependent variable models;
namely, those involving the bivariate dependent variable in the linear probability, probit and
logit models.  
 
 
Linear Probability Model

Consider the binary choice model where Yi = 1, with probability Pi , or Yi = 0, with probability
(1–Pi.). In the linear probability regression model Yi = β1 + β 2 xi + ε i , E (ε i ) = 0 implies
E (Yi | xi ) = β1 + β 2 xi , where also E (Yi | xi ) = (0)[1 − ( Pi | xi )] + (1)( Pi | xi ) = Pi | xi . Thus,
E (Yi | xi ) = β1 + β 2 xi = Pi | xi , which we will write simply as Pi. That is, the expected value of the
0 or 1 bivariate dependent variable, conditional on the explanatory variable(s), is the probability
of a success (Y = 1). We can interpret a computer-generated, least-squares prediction of E(Y|x)
as the probability that Y = 1 at that x value.

In addition, the mean of the population error in the linear probability model is zero:

E (ε ) = (1 − β1 − β 2 x ) P + (0 − β1 − β 2 x )(1 − P )
= P − β1 − β 2 x = P − E (Y | x ) = 0 for P = E (Y | x )

However, the least squares Ŷ can be negative or greater than one, which makes it a peculiar
predictor of probability. Furthermore, the variance of epsilon is

var( ε ) = Pi[ 1 − ( β1 + β 2 xi ) ]2 + (1–Pi ) ( β1 + β 2 xi ) 2= Pi( 1 − Pi )2 + (1–Pi )Pi2 = Pi(1–Pi),

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 5 


 
which (because Pi depends on xi) means that the linear probability model has a problem of
heteroscedasticity.

An adjustment for heteroscedasticity in the linear probability model can be made via a
generalized least-squares procedure but the problem of constraining β1 + β 2 xi to the zero – one
interval cannot be easily overcome. Furthermore, although predictions are continuous, epsilon
cannot be assumed to be normally distributed as long as the dependent variable is bivariate,
which makes suspect the use of the computer-generated t statistic. It is for these reasons that
linear probability models are no longer widely used in educational research.

Probit Model

Ideally, the estimates of the probability of success (Y = 1) will be consistent with probability
theory with values in the 0 to 1 interval. One way to do this is to specify a probit model, which
is then estimated by computer programs such as LIMDEP, SAS and STATA that use maximum
likelihood routines. Unlike least squares, which selects the sample regression coefficient to
minimize the squared residuals, maximum likelihood selects the coefficients in the assumed
data-generating model to maximize the probability of getting the observed sample data.

The probit model starts by building a bridge or mapping between the 0s and 1s to be
observed for the bivariate dependent variable and an unobservable or hidden (latent) variable that
is assumed to be the driving force for the 0s and 1s:

I i* = β1 + β 2 X i 2 + β 3 X i 3 + β 4 X i 4 + ε i = X i β , where ε it ~ N (0,1) .

and I* > 0 implies Y = 1 and I* < 0 implies Y = 0 and


Pi = P(Y = 1 | X i ) = G ( I i * > 0) = G ( Z i ≤ X i β ) .

G( ) and g( ) are the standard normal distribution and density functions, and

P (Y = 1) = ∫ g (t )dt .
−∞

Within economics the latent variable I* is interpreted as net utility or propensity to take
action. For instance, I* might be interpreted as the net utility of taking another economics
course. If the net utility of taking another economics course is positive, then I* is positive,
implying another course is taken and Y = 1. If the net utility of taking another economics course
is negative, then the other course is not taken, I* is negative and Y = 0.

The idea behind maximum likelihood estimation of a probit model is to maximize the
density L with respect to β and σ where the likelihood function is

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 6 


 
L = f (ε ) = (2πσ 2 ) − n / 2 exp( −ε'ε / 2σ 2 )
= (2πσ 2 ) − n / 2 exp[ −(y − Xβ )'(y − Xβ ) / 2σ 2 ]

The calculation of ∂L / ∂β is not convenient but the logarithm (ln) of the likelihood function is
easily differentiated

∂ ln L / ∂β = L−1∂L / ∂β .

Intuitively, the strategy of maximum likelihood (ML) estimation is to maximize (the log of) this
joint density for the observed data with respect to the unknown parameters in the beta vector,
where σ is set equal to one. The probit maximum likelihood computation is a little more difficult
than for the standard classical regression model because it is necessary to compute the integrals
of the standard normal distribution. But computer programs can do the ML routines with ease in
most cases if the sample sizes are sufficiently large. See William Greene, Econometric Analysis
(5th Edition, 2003, pp. 670-671) for joint density and likelihood function that leads to the
likelihood equations for ∂ ln L / ∂β .

The unit of measurement and thus the magnitude of the probit coefficients are set by the
assumption that the variance of the error term ε is unity. That is, the estimated probit
coefficients along a number line have no meaning. If the explanatory variables are continuous,
however, the probit coefficients can be employed to calculate a marginal probability of success
at specific values of the explanatory variables:

∂p ( x) / ∂x = g ( Xβ ) β x , where g( ) is density g ( z ) = ∂G ( z ) / ∂z .

Interpreting coefficients for discrete explanatory variables is more cumbersome as demonstrated


graphically in Becker and Waldman (1989) and Becker and Kennedy (1992).

Logit Model

An alternative to the probit model is the logit model, which has nearly identical properties to the
probit, but has a different interpretation of the latent variable I*. To see this, again let

Pi = E (Y = 1 | X i ) .

The logit model is then obtained as an exponential function

Pi = 1 /(1 + e − X i β ) = 1 /(1 + e − zi ) = e zi /(1 + e zi ) ; thus,


1 − Pi = 1 − e zi /(1 + e zi ) = 1/(1 + e zi ) , and
Pi /(1 − Pi ) = e zi , which is the odd ratio for success (Y = 1)

The log odds ratio is the latent variable logit equation

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 7 


 
⎛ P ⎞
I i* = ln ⎜ i ⎟ = zi = X i β .
⎝1− Pi ⎠

A graph of the logistic function G(z) = exp(z)/[1+exp(z)] looks like the standard normal, as seen
in the following figure, but does not rise or fall to 1.00 and 0.00 as fast:

Graph of Logistic Function

Nonparametrics

As outlined in Becker and Greene (2001), recent developments in theory and computational
procedures enable researchers to work with nonlinear modeling of all sorts as well as
nonparametric regression techniques. As an example of what can be done consider the widely
cited economic education application in Spector and Mazzeo (1980). They estimated a probit
model to shed light on how a student's performance in a principles of macroeconomics class
relates to his/her grade in an intermediate macroeconomics class, after controlling for such things
as grade point average (GPA) going into the class. The effect of GPA on future performance is
less obvious than it might appear at first. Certainly it is possible that students with the highest
GPA would get the most from the second course. On the other hand, perhaps the best students
were already well equipped, and if the second course catered to the mediocre (who had more to
gain and more room to improve) then a negative relationship between GPA and increase in
grades (GRADE) might arise. A negative relationship might also arise if artificially high grades
were given in the first course. The below figure provides an analysis similar to that done by
Spector and Mazzeo (using a subset of their data).

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 8 


 
Probability That Grade Would Increase
1.0
PR_MODEL
PR_NPREG
.8

.6
Pr[incr]

.4

.2

.0
2.0 2.4 2.8 3.2 3.6 4.0
GPA

In this figure, the horizontal axis shows the initial grade point average of students in the
study. The vertical axis shows the relative frequency of the incremental grades that increase
from the first to the second course. The solid curve shows the estimated relative frequency of
grades that improve in the second course using a probit model (the one used by the authors).
These estimates suggest a positive relationship between GPA and the probability of grade
improvement in the second macroeconomics throughout the GPA range. The dashed curve in
the figure provides the results using a much less-structured nonparametric regression model. iii
The conclusion reached with this technique is qualitatively similar to that obtained with the
probit model for GPAs above 2.6, where the positive relationship between GPA and the
probability of grade improvement can be seen, but it is materially different for those with GPAs
lower than 2.6, where a negative relationship between GPA and the probability of grade
improvement is found. Possibly these poorer students received gift grades in the introductory
macroeconomics course.

There are other alternatives to least squares that economic education researchers can
employ in programs such as LIMDEP, STATA and SAS. For example, the least-absolute-
deviations approach is a useful device for assessing the sensitivity of estimates to outliers. It is
likely that examples can be found to show that even if least-squares estimation of the conditional
mean is a better estimator in large samples, least-absolute-deviations estimation of the
conditional median performs better in small samples. The critical point is that economic
education researchers must recognize that there are and will be new alternatives to modeling and
W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 9 
 
estimation routines as currently found in Journal of Economic Education articles and articles in
the other journals that publish this work, as listed in Lo, Wong and Mixon (2008). In this
module and in the remaining three, only passing mention will be given to these emerging
methods of analysis. The emphasis will be on least-squares and maximum-likelihood
estimations of continuous and discrete data-generating processes that can be represented
parametrically.

INDIVIDUAL OBSERVATIONS OR GROUP AVERAGES:


WHAT IS THE UNIT OF ANALYSIS?

In Becker (2004), I called attention to the implications of working with observations on


individuals versus working with averages of individuals in different groupings. For example,
what is the appropriate unit of measurement for assessing the validity of student evaluations of
teaching (as reflected, for example, in the relationship between student evaluations of teaching
and student outcomes)? In the case of end-of-term student evaluations of instructors, an
administrator’s interest may not be how students as individuals rate the instructor but how the
class as a whole rates the instructor. Thus, the unit of measure is an aggregate for the class.
There is no unique aggregate, although the class mean or median response is typically used. iv
For the assessment of instructional methods, however, the unit of measurement may arguably be
the individual student in a class and not the class as a unit. Is the question: how is the ith
student’s learning affected by being in a classroom where one versus another teaching method is
employed? Or is the question: how is the class’s learning affected by one method versus
another? The answers to these questions have implications for the statistics employed and
interpretation of the results obtained. v

Hake (1998) reported that he has test scores for 6,542 individual students in 62
introductory physics courses. He works only with mean scores for the classes; thus, his effective
sample size is 62, and not 6,542. The 6,542 students are not irrelevant, but they enter in a way
that I did not find mentioned by Hake. The amount of variability around a mean test score for a
class of 20 students versus a mean for 200 students cannot be expected to be the same.
Estimation of a standard error for a sample of 62, where each of the 62 means receives an equal
weight, ignores this heterogeneity. vi Francisco, Trautman, and Nicoll (1998) recognized that the
number of subjects in each group implies heterogeneity in their analysis of average gain scores in
an introductory chemistry course. Similarly, Kennedy and Siegfried (1997) made an adjustment
for heterogeneity in their study of class size on student learning in economics.

Fleisher, Hashimoto, and Weinberg (2002) considered the effectiveness (in terms of
student course grades and persistences) of 47 foreign graduate student instructors versus 21
native English speaking graduate student instructors in an environment in which English is the
language of the majority of their undergraduate students. Fleisher, Hashimoto, and Weinberg
recognized the loss of information in using the 92 mean class grades for these 68 graduate
student instructors, although they did report aggregate mean class grade effects with the
corrected heterogeneity adjustment for standard errors based on class size. They preferred to
look at 2,680 individual undergraduate results conditional on which one of the 68 graduate
student instructors each of the undergraduates had in any one of 92 sections of the course. To

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 10 


 
ensure that their standard errors did not overstate the precision of their estimates when using the
individual student data, Fleisher, Hashimoto, and Weinberg explicitly adjusted their standard
errors for the clustering of the individual student observations into classes using a procedure akin
to that developed by Moulton (1986). vii

Whatever the unit of measure for the dependent variable (aggregate or individual) the
important point here is recognition of the need for one of two adjustments that must be made to
get the correct standard errors. If an aggregate unit is employed (e.g., class means) then an
adjustment for the number of observations making up the aggregate is required. If individual
observations share a common component (e.g., students grouped into classes) then the standard
errors reflect this clustering. Computer programs such as LIMDEP (NLOGIT), SAS and
STATA can automatically perform both of these adjustments.

ANALYSIS OF VARIANCE (ANOVA) AND HYPOTHESES TESTING

Student of statistics are familiar with the F statistic as computed and printed in most computer
regression routines under a banner “Analysis of Variance” or just ANOVA. This F is often
presented in introductory statistics textbooks as a test of the overall all fit or explanatory power
of the regression. I have learned from years of teaching econometrics that it is better to think of
this test as one of all population model slope coefficients are zero (the explanatory power is not
sufficient to conclude that there is any relations between the xs and y in the population) versus
the alternative that at least one slope coefficient is not zero (there is some explanatory power).
Thinking of this F statistic as just a joint test of slope coefficients, makes it easier to recognize
that an F statistics can be calculated for any subset of coefficients to test for joint significance
within the subset. Here I present the theoretical underpinnings for extensions of the basic
ANOVA to tests of subsets of coefficients. Parts two three and four provide the corresponding
commands to do these tests in LIMDEP, STATA and SAS.

As a starting point to ANOVA consider the F statistics that is generated by most


computer programs. This F calculation can be viewed as a decomposition or partitioning of the
dependent variable into two components (intercept and slopes) and a residual:

y = ib1 + X 2b 2 + e

where i is the column of 1’s in the X matrix associated with the intercept b1 and X2 is the
remaining (k–1) explanatory x variables associated with the (k–1) slope coefficients in the b 2
vector. The total sum of squared deviations

n n
TotSS = ∑ ( yi − y ) 2 = ∑ ( yi )2 − ny 2 = ( y′y − ny 2 )
i =1 i =1
measures the amount of variability in y around y , which ignoring any effect of the xs (in essence
the b2 vector is assumed to be a vector of zeros). The residual sum of squares
n
ResSS = ∑ (e )
i =1
i
2
= e′e

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 11 


 
measures the amount of variability in y around ŷ , which lets b1 and b2 assume their least squares
values.

Partitioning of y in this manner enables us to test the contributions of the xs to explaining


variability in the dependent variable. That is,

H 0 : β2 = β3 = ...βk = 0 versus H A : at least one slope coefficient is not zero.

For calculating the F statistic, computer programs use the equivalent of the following:

[( y'y − ny 2 ) − e′e ] /[( n − 1) − ( n − K )] [( y'y − ny 2 ) − e′e ]/( K − 1) (TotSS − ResSS) /( K − 1)


F= = =
e′e /( n − K ) e′e /( n − K ) ResSS /( n − K )

This F is the ratio of two independently distributed Chi-square random variables adjusted for
their respective degrees of freedom. The relevant decision rule for rejecting the null hypothesis
is that the probability of this calculated F value or something greater, with K − 1 and n − K
degrees of freedom, is less than the typical (0.10, 0.05 or 0.01) probabilities of a Type I error.

Calculation of the F statistic in this manner, however, is just a special case of running two
regressions: a restricted and an unrestricted. One regression was computed with all the slope
coefficients set equal (or restricted) to zero so Y is regressed only on the column of ones. This
restricted regression is the same as using Y to predict Y regardless of the values of the xs. This
restricted residual sum of squares, e′r e r , is what is usually called the total sum of squares,
TotSS = y' y − ny 2 . The unrestricted regression allows all of the slope coefficients to find their
values to minimize the residual sum of squares, which is thus called the unrestricted residual
sum of squares, e′u e u , and is usually just list in a computer printout as the residual sum of
squares ResSS= e′ e .

The idea of a restricted and unrestricted regression can be extended to test any subset of
coefficients. For example, say the full model for a posttest Y is

Yi = β1 + β2 xi 2 + β3 xi 3 + β4 xi 4 + εi .

Let’s say the claim is made that x3 and x4 do not affect Y. One way to interpret this is to specify
that β3 = β4 = 0 , but β2 ≠ 0 . The dependent variable is again decomposed into two components
but now x1 is included with the intercept in the partitioning of the X matrix:

y = X 1b 1 + X 2 b 2 + e .

where X1 is the n × 2 matrix, with the first column containing ones and the second observations
on x1 (b1 contains the y intercept and x1 slope coefficient) and X2 is the n × 2 matrix, with two
columns for x3 and x4 (b2 contains x3 and x4 slope coefficients). If the claim about x3 and x4 not

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 12 


 
belonging in the explanation of Y is true, then the two slope coefficients in b2 should be set to
zero because the true model is the restricted specification

Yi = β1 + β2 xi 2 + εi .

The null hypotheses is H 0 : β3 =β 4 =0 ; i.e., x2 might affect Y but x3 and x4 do not affect Y.

The alternative hypothesis is H A : β3 ≠ 0 or β4 ≠ 0 ; i.e.,. x3 and x4 both affect Y.

The F statistic to test the hypotheses is then

[e′r e r − e′u e u ] /[(n − K r ) − (n − K u )]


F = ,
e′u e u /(n − K u )

where the restricted residual sum of squares e′r e r is obtained from a simple regression of Y on x2,
including a constant, and the unrestricted sum of squared residuals e′u e u is obtained from a
regression of Y on x2, x3 and x4 , including a constant.

In general, it is best to test the overall fit of the regression model before testing any subset
or individual coefficients. The appropriate hypotheses and F statistic are

H 0 : β 2 = β3 = ... = β K = 0 (or H 0 : R 2 = 0 )
H A : at least one slope coefficient is not zero (or H 0 : R 2 ≠ 0 )

[(y' y − ny 2 ) − e′e] /( K − 1)
F= .
e′e /( n − K )

If the calculated value of this F is significant, then subsets of the coefficients can be tested as

H 0 : β s = βt = ... = 0
H A : at least one of these slope coefficient is not zero

[e′r e r − e′u e u ] /[( K u − q)]


F= , for q = k – number of restrictions.
e′u e u /(n − K u )

The restricted residual sum of squares e′r e r is obtained by a regression on only the q xs that did
not have their coefficients restricted to zero. Any number of subsets of coefficients can be tested
in this framework of restricted and unrestricted regressions as summarized in the following table.

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 13 


 
SUMMARY FOR ANOVA TESTING

PANEL A. TRADITIONAL ANOVA FOR TESTING


R 2 = 0 versus R 2 ≠ 0
Degrees of Mean
Sum of Squares Source Freedom Square
------------------------------------- ------------- -------------- -------
Total (to be explained) y' y − ny 2 n–1 s y2
Residual or Error (unexplained) e′e n–k se2
Regression or Model (explained) b' X' y − ny 2 k–1

R 2 /(K − 1 ) [1 − (ResSS/TotSS)]/( K − 1) (TotSS-ResSS) /( K − 1)


F= = =
( 1 − R )/ ( n − K )
2
(ResSS/TotSS) /( n − K ) ResSS /( n − K )

PANEL B. RESTRICTED REGRESSION FOR TESTING ALL THE


SLOPES β 2 = β 3 = ... = β K = 0
Degrees of Mean
Sum of Squares Source Freedom Square
------------------------------- --------------- -------------- -------
Restricted (all slopes = 0) ′
e r e r = y' y − ny 2
n–1 s y2
Unrestricted e′u e u = e′e n –k se2
Improvement b' X' y − ny 2 k–1

[Restricted ResSS( slopes = 0) − Unrestricted ResSS]( K − 1)


F=
Unrestricted ResSS /( n − k )

PANEL C. RESTRICTED REGRESSION FOR TESTING A SUBSET OF


COEFFICIENTS β s = β t = ... = 0
Degrees of
Sum of Squares Source Freedom
---------------------------------- ---------------- --------------
Restricted ( β s = β t = ... = 0) e′r e r n – q, for q = k – number of restrictions
Unrestricted e′u e u n–k
Improvement e′r e r – e′u e u K–q

[Restricted ResSS( subset = 0) − Unrestricted ResSS]( K − q)


F=
Unrestricted ResSS /( n − k )

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 14 


 
The F test of subsets of coefficients is ideal for testing interactions. For instance, to test
for the treatment effect in the following model both β 4 and β 5 must be jointly tested against zero:

ChangeScore = β1 + β 2 female + β 3 femaletreatment + β 4treatment + β 5GPA + ε

H o : β 4 = β5 = 0 H A : β 4 or β 5 ≠ 0

where "ChangeScore" is the difference between a student's test scores at the end and
beginning of a course in economics, female = 1, if female and 0 if male, "treatment" = 1,
if in the treatment group and 0 if not, and "GPA" is the student's grade point average
before enrolling in the course.

The F test of subsets of coefficients is also ideal for testing for fixed effects as reflected in sets of
dummy variables. For example, in Parts Two, Three and Four an F test is performed to check
whether there is any fixed difference in test performance among four classes taking economics
using the following assumed data generating process:

post = β1 + β 2 pre + β 3class1 + β 4class 2 + β 5class3 + ε

H o : β3 = β4 = β5 = 0 H A : β 3 , β 4 or β 5 ≠ 0

where “post” is a student’s post-course test score, “pre” is the student’s pre-course test
score, and “class” identifies to which one of the four classes the students was assigned,
e.g., class3 = 1 if student was in the third class and class3 = 0 if not. The fixed effect for
students in the fourth class (class1, class2 and class3 are zero)) is captured in the
intercept β1 .

It is important to notices in this test of fixed class effects that the relationship between the post
and pre test (as reflected in the slope coefficient β 2 ) is assumed to be the same regardless of the
class to which the student was assigned. The next section described a test for any structural
difference among the groups.

TESTING FOR A SPECIFICATION DIFFERENCE ACROSS GROUPS

Earlier in our discussion of the difference in difference or change score model, a 0-1 bivariate
dummy variable was introduced to test for a difference in intercepts between a treatment and
control group, which could be done with a single coefficient t test. However, the expected
difference in the dependent variable for the two groups might not be constant. It might vary with
the level of the independent variables. Indeed, the appropriate model might be completely
different for the two groups. Or, it might be the same.

Allowing for any type of difference between the control and experimental variables
implies that the null and alternative hypotheses are

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 15 


 
H0: β 1 = β 2 = β
HA: β 1 ≠ β 2 ,

where the β1 and β 2 are KΧ1 column vectors containing the K coefficients β1 ,β 2 ,β3 ,...β K for the
control β1 and the experimental β 2 groups . Let X1 and X 2 contain the observations on the
explanatory variables corresponding to the β1 and β 2 , including the column of ones for the
constant β1 . The unrestricted regression is captured by two separate regressions:

⎡ y1 ⎤ ⎡ X1 0 ⎤ ⎡β1 ⎤ ⎡ ε1 ⎤
⎢y ⎥ = ⎢ 0 +
X 2 ⎥⎦ ⎢⎣β2 ⎥⎦ ⎢⎣ ε2 ⎥⎦
.
⎣ 2⎦ ⎣

That is, the unrestricted model is estimated by fitting the two regressions separately. The
unrestricted residual sum of squares is obtained by adding the residuals from these two
regressions. The unrestricted degrees of freedom are similarly obtained by adding the degrees of
freedom of each regression.

The restricted regression is just a regression of y on the xs with no group distinction in


beta coefficients:

⎡ y1 ⎤ ⎡ X 1 0⎤ ⎡ε ⎤
⎢y ⎥ = ⎢ 0 ⎥ [β] + ⎢ 1 ⎥ .
⎣ 2⎦ ⎣ X2 ⎦ ⎣ε2 ⎦

That is , the restricted residual sum of squares is obtained from a regression in which the data
from the two groups are pooled and a single set of coefficients is estimated for the pooled data
set.

The appropriate F statistic is

[Restricted ResSS(β1 = β 2 ) − Unrestricted ResSS] / K


F = ,
Unrestricted ResSS /[n − 2 K ]

where unrestricted ResSS = residuals sum of squares from a regression on only those in the
control plus residuals from a regression on only those in the treatment groups.

Thus, to test for structure change over J regimes, run separate regressions on each and
add up the residuals to obtain the unrestricted residual sum of squares, ResSSu,with df = n – JK.
The restricted residual sum of squares is ResSSr, with df = n – K.

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 16 


 
H 0 : β1 = β 2 = . . . = β J and H a : β ' s are not equal

(ResSSr − ResSSu ) / K ( J − 1)
F=
ResSSu /(n − JK )

This form of testing for a difference among groups is known in economics as a Chow
Test. As demonstrated in Part Two using LIMDEP and Parts Three and Four using STATA and
SAS, any number of subgroups could be tested by adding up their individual residual sums of
squares and degrees of freedom to form the unrestricted residual sums of squares and matching
degrees of freedom.

OTHER TEST STATISTICS

Depending on the nature of the model being estimated and the estimation method, computer
programs will produce alternatives to the F statistics for testing (linear and nonlinear) restrictions
and structural changes. What follows is only an introduction to these statistics that should be
sufficient to give meaning to the numbers produced based on our discussion of ANOVA above.

The Wald (W) statistic follows the Chi-squared distribution with J degrees of freedom,
reflecting the number of restrictions imposed:

(e r'e r − e u'e u )
W= ~ χ 2 (J ) .
e u'e u / n

If the model and the restriction are linear, then

nJ J
W= F= F,
n−k 1 − (k / n)

which for large n yields the asymptotic results

W = JF .

The likelihood ratio (LR) test is formed by twice the difference between the log-
likelihood function for an unrestricted regression ( Lur ) and its value for the restricted regression
(Lr ).

LR = 2(Lur − Lr ) > 0 .

Under the null hypothesis that the J restrictions are true, LR is distributed Chi-square with J
degrees of freedom.

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 17 


 
The relationship between the likelihood ratio test and Wald test can be shown to be

n(e r'e r − e u'e u ) n(e r'e r − e u'e u ) 2


LR = − ≤W .
e u'e u 2e u'e u

The Lagrange multiplier test (LM) is based on the gradient (or score) vector

⎡ ∂L / ∂β ⎤ ⎡ X' ε / σ 2 ⎤
=
⎢∂L / ∂σ 2 ⎥ ⎢ 4 ⎥
.
⎦ ⎣− (n / 2σ ) + (ε' ε / 2σ )⎦
2

where, as before, to evaluate this score vector with the restrictions we replace e = y − Xb with
er = y − Xbr . After sufficient algebra, the Lagrange statistic is defined by

LM = ne r'X(X'X) −1 X'e r /e r'e r = nR 2 ~ χ 2 ( J ) ,

where R2 is the conventional coefficient of determination from a regression of er on X, where er


has a zero mean (i.e., only slopes are being tested). It can also be shown that

nJ W
LM = F= .
(n − k )[1 + JF /(n − k )] 1 + (W / n )

Thus, LM ≤ LR ≤ W .

DATA ENTRY AND ESTIMATION

I like to say to students in my classes on econometrics that theory is easy, data are hard – hard to
find and hard to get into a computer program for statistical analysis. In this first of four parts in
Module One, I provided an introduction to the theoretical data generating processes associated
with continuous versus discrete dependent variables. Parts Two, Three and Four concentrate on
getting the data into one of three computer programs: LIMDEP (NLOGIT), STATA and SAS.
Attention is also given to estimation and testing within regressions employing individual cross-
sectional observations within these programs. Later modules will address complications
introduced by panel data and sources of endogeneity.

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 18 


 
REFERENCES

Allison, Paul D. (1990). “Change Scores as Dependent Variables in Regression Analysis,”


Sociological Methodology, Vol. 20: 93-114.

Bandiera, Oriana, Valentino Larcinese and Imron Rasul (2010). “Blissful Ignorance? Evidence
from a Natural Experiment on the Effect of Individual Feedback on Performance , ” IZA
Seminar, Bonn Germany, December 5, 2009. January 2010 version downloadable at
http://www.iza.org/index_html?lang=en&mainframe=http%3A//www.iza.org/en/webcontent/eve
nts/izaseminar_description_html%3Fsem_id%3D1703&topSelect=events&subSelect=seminar

Becker, William E. (2004). “Quantitative Research on Teaching Methods in Tertiary Education,”


in W. E. Becker and M. L. Andrews (eds), The Scholarship of Teaching and Learning in Higher
Education: Contributions of the Research Universities, Indiana University Press: 265-309.

Becker, William E. (Summer 1983). “Economic Education Research: Partr III, Statistical
Estimation Methods,” Journal of Economic Education, Vol. 14 (Sumer): 4-15

Becker, William E. and William H. Greene (2001). “Teaching Statistics and Econometrics to
Undergraduates,” Journal of Economic Perspectives, Vol. 15 (Fall): 169-182.

Becker, William E., William Greene and Sherwin Rosen (1990). “Research on High School
Economic Education,” American Economic Review, Vol. 80, (May): 14-23, and an expanded
version in Journal of Economic Education, Summer 1990: 231-253.

Becker, William E. and Peter Kennedy (1992 ). “A Graphical Exposition of the Ordered Probit,”
with P. Kennedy, Econometric Theory, Vol. 8: 127-131.

Becker, William E. and Michael Salemi (1977). “The Learning and Cost Effectiveness of AVT
Supplemented Instruction: Specification of Learning Models,” Journal of Economic Education
Vol. 8 (Spring) : 77-92.

Becker, William E. and Donald Waldman (1989). “Graphical Interpretation of Probit


Coefficients,” Journal of Economic Education, Vol. 20 (Fall): 371-378.

Campbell, D., and D. Kenny (1999). A Primer on Regression Artifacts. New York: The Guilford
Press.

Fleisher, B., M. Hashimoto, and B. Weinberg. 2002. “Foreign GTAs can be Effective Teachers
of Economics.” Journal of Economic Education, Vol. 33 (Fall): 299-326.

Francisco, J. S., M. Trautmann, and G. Nicoll. 1998. “Integrating a Study Skills Workshop and
Pre-Examination to Improve Student’s Chemistry Performance.” Journal of College Science
Teaching, Vol. 28 (February): 273-278.

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 19 


 
Friedman, M. 1992. “Communication: Do Old Fallacies Ever Eie?” Journal of Economic
Literature, Vol. 30 (December): 2129-2132.

Greene, William (2003). Econometric Analysis. 5th Edition, New Jersey: Prentice Hall.

Hake, R. R. (1998). “Interactive-Engagement versus Traditional Methods: A Six-Thousand-


Student Survey of Mechanics Test Data for Introductory Physics Courses.” American Journal of
Physics, Vol. 66 (January): 64-74.

Hanushek, Eric A. (1986). “The Economics of Schooling: Production and Efficiency in Public
Schools,” Journal of Economic Literature, 24(September)): 1141-1177.

Lo, Melody, Sunny Wong and Franklin Mixon (2008). “Ranking Economics Journals Economics
Departments, and Economists Using Teaching-Focused Research Productivity." Southern
Economics Journal 2008, 74(January): 894-906.

Moulton, B. R. (1986). “Random Group Effects and the Precision of Regression Estimators.”
Journal of Econometrics, Vol. 32 (August): 385-97.

Kennedy, P., and J. Siegfried. (1997). “Class Size and Achievement in Introductory Economics:
Evidence from the TUCE III Data.” Economics of Education Review, Vol. 16 (August): 385-394.

Kvam, Paul. (2000). “The Effect of Active Learning Methods on Student Retention in
Engineering Statistics.” American Statistician, 54 (2): 136-40.

Ramsden, P. (1998). “Managing the Effective University.” Higher Education Research &
Development, 17 (3): 347-70.

Salemi, Michael and George Tauchen. 1987. “Simultaneous Nonlinear Learning Models.” In W.
E. Becker and W. Walstad, eds., Econometric modeling in economic education research, pp.
207-23. Boston: Kluwer-Nijhoff.

Spector, Lee C. and Michael Mazzeo (1980). “Probit Analysis and Economic Education”
Journal of Economic Education, Vol. 11(Spring), 11(2): 37-44.

Wainer, H. 2000. “Kelley’s Paradox.” Chance, 13 (Winter): 47-48.

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 20 


 
ENDNOTES
                                                       
i
Let the change or gain score be Δy = [ y1 − y 0 ] , which is the posttest score minus the pretest
score, and let the maximum change score be Δy max = [ y max − y 0 ] , then

∂ (Δy / Δy max ) − ( y max − y1 )


= ≤ 0, for y max ≥ y1 ≥ y 0
∂y 0 ( y max − y 0 ) 2

ii
Let the posttest score ( y1 ) and pretest score ( y 0 ) be defined on the same scale, then the model
of the ith student’s pretest is

y 0i = β 0 (ability ) i + v 0i ,

where β 0 is the slope coefficient to be estimated, v0i is the population error in predicting the ith
student’s pretest score with ability, and all variables are measured as deviations from their
means. The ith student’s posttest is similarly defined by

y1i = β 1 (ability ) i + v1i

The change or gain score model is then

y1i − y 0i = ( β 1 − β 0 )ability + v1i − v0i

And after substituting the pretest for unobserved true ability we have

Δy i = (Δβ / β 0 ) y 0i + v1i − v0i [1 + (Δβ / β 0 )]

The least squares slope estimator (Δb / b0 ) has an expected value of

E (Δb / b0 ) = E (∑ Δy1 y 0i / ∑ y 02i )


i i

E (Δb / b0 ) = (Δβ / β 0 ) + E{∑ [v1i − v0i − v01 (Δβ / β 0 )]1i y 0i / ∑ y 02i }


i i

E (Δb / b0 ) ≤ (Δβ / β 0 )

Although v1i and y 0i are unrelated, E( v1i y 0i ) = 0, v0i and y 0i are positively related, E( v0i y 0i ) > 0;
thus, E (Δb / b0 ) ≤ Δβ / β 0 . Becker and Salemi (1977) suggested an instrumental variable
technique to address this source of bias and Salemi and Tauchen (1987) suggested a modeling of
the error term structure.

Hake (1998) makes no reference to this bias when he discusses his regressions and correlation of
average normalized gain, average gain score and posttest score on the average pretest score. In

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 21 


 
                                                                                                                                                                               
http://www.consecol.org/vol5/iss2/art28/, he continued to be unaware of, unable or unwilling to
specify the mathematics of the population model from which student data are believed to be
generated and the method of parameter estimation employed. As the algebra of this endnote
suggests, if a negative relationship is expected between the gap closing measure

g = (posttest−pretest)/(maxscore−pretest)

and the pretest, but a least-squares estimator does not yield a significant negative relationship for
sample data, then there is evidence that something is peculiar. It is the lack of independence
between the pretest and the population error term (caused, for example, by measurement error in
the pretest, simultaneity between g and the pretest, or possible missing but relevant variables)
that is the problem. Hotelling received credit for recognizing this endogenous regressor problem
(in the 1930s) and the resulting regression to the mean phenomenon. Milton Friedman received
a Nobel prize in economics for coming up with an instrumental variable technique (for
estimation of consumption functions in the 1950s) to remove the resulting bias inherent in least-
squares estimators when measurement error in a regressor is suspected. Later Friedman (1992,
p. 2131) concluded: “I suspect that the regression fallacy is the most common fallacy in the
statistical analysis of economic data ...” Similarly, psychologists Campbell and Kenny (1999, p.
xiii) stated: “Regression toward the mean is a artifact that as easily fools statistical experts as lay
people.” But unlike Friedman, Campbell and Kenny did not recognize the instrumental variable
method for addressing the problem.

In an otherwise innovative study, Paul Kvam (2000) correctly concluded that there was
insufficient statistical evidence to conclude that active-learning methods (primarily through
integrating students’ projects into lectures) resulted in better retention of quantitative skills than
traditional methods, but then went out on a limb by concluding from a scatter plot of individual
student pretest and posttest scores that students who fared worse on the first exam retain
concepts better if they were taught using active-learning methods. Kvan never addressed the
measurement error problem inherent in using the pretest as an explanatory variable. Wainer
(2000) called attention to others who fail to take measurement error into account in labeling
students as “strivers” because their observed test scores exceed values predicted by a regression
equation.

iii
The plot for the probability model was produced by first fitting a probit model of the binary
variable GRADE, as a function of GPA. This produces a functional relationship of the form
Prob(GRADE = 1) = Φ(α + βGRADE), where estimates of α and β are produced by maximum
likelihood techniques. The graph is produced by plotting the standard normal distribution
function, Φ(α + βGRADE) for the values of GRADE in the sample, which range between 2.0
and 4.0, then connecting the dots. The nonparametric regression, although intuitively appealing
because it can be viewed as making use of weighted relative frequencies, is computationally
more complicated. [Today the binomial probit model can be fitted with just about any statistical
package but software for nonparametric estimation is less common. LIMDEP (NLOGIT)
version 8.0 (Econometric Software, Inc., 2001) was used for both the probit and nonparametric
estimations.] The nonparametric approach is based on the assumption that there is some as yet

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 22 


 
                                                                                                                                                                               
unknown functional relationship between the Prob(GRADE = 1) and the independent variable,
GPA, say Prob(Grade = 1 | GPA) = F(GPA). The probit model based on the normal distribution
is one functional candidate, but the normality assumption is more specific than we need at this
point. We proceed to use the data to find an approximation to this function. The form of the
‘estimator’ of this function is F(GPA*) = Σi = all observations w(GPA* − GPAi )GRADEi. The
weights, ‘w(.),’ are positive weight functions that sum to 1.0, so for any specific value GPA*, the
approximation is a weighted average of the values of GRADE. The weights in the function are
based on the desired value of GPA, that is GPA*, as well as all the data. The nature of the
computation is such that if there is a positive relationship between GPA and GRADE =1, then as
GPA* gets larger, the larger weights in the average shown above will tend to be associated with
the larger values of GRADE. (Because GRADE is zeros and ones, this means that for larger
values of GPA*, the weights associated with the observations on GRADE that equal one will
generally be larger than those associated with the zeros.) The specific form of these weights is as
follows: w(GPA* − GPAi) = (1/A)×(1/h)K[(GPA* − GPAi)/h]. The ‘h’ is called the smoothing
parameter, or bandwidth, K[.] is the ‘kernel density function’ and A is the sum of the functions,
ensuring that the entire expression sums to one. Discussion of nonparametric regression using a
kernel density estimator is given in Greene (2003, pp. 706-708). The nonparametric regression
of GRADE on GPA plotted in the figure was produced using a logistic distribution as the kernel
function and the following computation of the bandwidth: let r equal one third of the sample
range of GPA and let s equal the sample standard deviation of GPA. The bandwidth is then h =
.9×Min(r,s)/n1/5. (In spite of their apparent technical cache, bandwidths are found largely by
experimentation. There is no general rule that dictates what one should use in a particular case,
which is unfortunate because the shapes of kernel density plots are heavily dependent upon
them.)

iv
Unlike the mean, the median reflects relative but not absolute magnitude; thus, the median may
be a poor measure of change. For example, the series 1, 2, 3 and the series 1, 2, 300 have the
same median (2) but different means (2 versus 101).

v
To appreciate the importance of the unit of analysis, consider a study done by Ramsden (1998,
pp. 352-354) in which he provided a scatter plot showing a positive relationship between a y-axis
index for his “deep approach” (aimed at student understanding versus “surface learning”) and an
x-axis index of “good teaching” (including feedback of assessed work, clear goals, etc.):

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 23 


 
                                                                                                                                                                               

Ramsden’s regression ( y = 18.960 + 0.35307 x ) seems to imply that a decrease (increase) in the
good teaching index by one unit leads to a 0.35307 decrease (increase) in the predicted deep
approach index; that is, good teaching positively affects deep learning. But does it?

Ramsden (1998) ignored the fact that each of his 50 data points represent a type of institutional
average that is based on multiple inputs; thus, questions of heteroscedasticity and the calculation
of appropriate standard errors for testing statistical inference are relevant. In addition, because
Ramsden reports working only with the aggregate data from each university, it is possible that
within each university the relationship between good teaching (x) and the deep approach (y)
could be negative but yet appear positive in the aggregate.

When I contacted Ramsden to get a copy of his data and his coauthored “Paper presented at the
Annual Conference of the Australian Association for Research in Education, Brisbane
(December 1997),” which was listed as the source for his regression of the deep approach index
on the good teaching index in his 1998 published article, he confessed that this conference paper
never got written and that he no longer had ready access to the data (email correspondence
August 22, 2000).

Aside from the murky issue of Ramsden citing his 1997 paper, which he subsequently admitted
does not exist, and his not providing the data on which the published 1998 paper is allegedly
based, a potential problem of working with data aggregated at the university level can be seen

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 24 


 
                                                                                                                                                                               
with three hypothetical data sets. The three regressions for each of the following hypothetical
universities show a negative relationship for y (deep approach) and x (good teaching), with slope
coefficients of –0.4516, –0.0297, and –0.4664, but a regression on the university means shows a
positive relationship, with slope coefficient of +0.1848. This is a demonstration of “Simpson’s
paradox,” where aggregate results are different from dissaggregated results.

University One

yˆ (1) = 21.3881 − 0.4516 x(1) Std. Error = 2.8622 R2 = 0.81 n=4


y(1): 21.8 15.86 26.25 14.72
x(1): -4.11 6.82 -5.12 17.74

University Two

yˆ (2) = 17.4847 − 0.0297 x(2) Std. Error = 2.8341 R2 = 0.01 n=8


y(2): 12.60 17.90 19.00 16.45 21.96 17.1 18.61 17.85
x(2): -10.54 -10.53 -5.57 -11.54 -15.96 -2.1 -9.64 12.25

University Three

yˆ (3) = 17.1663 − 0.4664 x(3) Std. Error = 2.4286 R2 = 0.91 n = 12


y(3): 27.10 2.02 16.81 15.42 8.84 22.90 12.77 17.52 23.20 22.60 25.90
x(3): -23.16 26.63 5.86 9.75 11.19 –14.29 11.51 –0.63 –19.21 –4.89 –16.16

University Means

yˆ (means) = 18.6105 + 0.1848 x(means) Std. Error = 0.7973 R2 = 0.75 n=3


y(means): 19.658 17.684 17.735
x(means): 3.833 -6.704 -1.218

vi
Let y it be the observed test score index of the ith student in the tth class, who has an expected
test score index value of μ it . That is, y it = μ it + ε it , where ε it is the random error in testing such
that its expected value is zero, E (ε it ) = 0 , and variance is σ 2 , E (ε it ) = σ 2 , for all i and t .
2

Let y t be the sample mean of a test score index for the tth class of nt students. That is,
y t = μ t + ε t and E (ε t ) = σ 2 nt . Thus, the variance of the class mean test score index is
2

inversely related to class size.

vii
As in Fleisher, Hashimoto, and Weinberg (2002), let y gi be the performance measure of the ith
student in a class taught by instructor g, let Fg be a dummy variable reflecting a characteristics
of the instructor (e.g., nonnative English speaker), let xgi be a (1 × n) vector of the student’s

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 25 


 
                                                                                                                                                                               
observable attributes, and let the random error associated with the ith student taught by the gth
instructor be ε gi . The performance of the ith student is then generated by

y gi = Fg γ + xgi β + ε gi

where γ and β are parameters to be estimated. The error term, however, has two components:
one unique to the ith student in the gth instructor’s class ( ugi ) and one that is shared by all
students in this class ( ξ g ): ε gi = ξ g + ugi . It is the presence of the shared error ξ g for which an
adjustment in standard errors is required. The ordinary least squares routines employed by the
standard computer programs are based on a model in which the variance-covariance matrix of
error terms is diagonal, with element σ u2 . The presence of the ξ g terms makes this matrix block
diagonal, where each student in the gth instructor’s class has an off-diagonal element σ ξ2 .

In (May 11, 2008) email correspondence, Bill Greene called my attention to the fact that
Moulton (1986) gave a specific functional form for the shared error term component
computation. Fleisher, Hashimoto, and Weinberg actually used an approximation that is aligned
with the White estimator (as presented in Parts Two, Three and Four of this module), which is
the "CLUSTER" estimator in STATA. In LIMDEP (NLOGIT), Moulton’s shared error term
adjustment is done by first arranging the data as in a panel with the groups contained in
contiguous blocks of observations. Then, the command is “REGRESS ; ... ; CLUSTER = spec.
$” where "spec" is either a fixed number of observations in a group, or the name of an
identification variable that contains a class number. The important point is to recognize that
heterogeneity could be the result of each group having its own variance and each individual
within a group having its own variance. As discussed in detail in Parts Two, Three and Four,
heteroscedasticity in general is handled in STATA with the “ROBUST” command and in
LIMDEP with the “HETRO” command.

W. E. Becker  Module One, Part One: Data Generating Processes  3‐ 23‐10: p. 26 


 
 

MODULE ONE, PART TWO: READING DATA INTO LIMDEP, CREATING AND
RECODING VARIABLES, AND ESTIMATING AND TESTING MODELS IN LIMDEP

This Part Two of Module One provides a cookbook-type demonstration of the steps required to
read or import data into LIMDEP. The reading of both small and large text and Excel files are
shown though real data examples. The procedures to recode and create variables within
LIMDEP are demonstrated. Commands for least-squares regression estimation and maximum
likelihood estimation of probit and logit models are provided. Consideration is given to analysis
of variance and the testing of linear restrictions and structural differences, as outlined in Part
One. (Parts Three and Four provide the STATA and SAS commands for the same operations
undertaken here in Part Two with LIMDEP. For a thorough review of LIMDEP, see Hilbe,
2006.)

IMPORTING EXCEL FILES INTO LIMDEP

LIMDEP can read or import data in several ways. The most easily imported files are those
created in Microsoft Excel with the “.xls” file name extension. To see how this is done, consider
the data set in the Excel file “post-pre.xls,” which consists of test scores for 24 students in four
classes. The column titled “Student” identifies the 24 students by number, “post” provides each
student’s post-course test score, “pre” is each student’s pre-course test score, and “class”
identifies to which one of the four classes the students was assigned, e.g., class4 = 1 if student
was in the fourth class and class4 = 0 if not. The “.” in the post column for student 24 indicates
that the student is missing a post-course test score.

student   post    pre   class1   class2   class3   class4 


1     31   22   1   0   0    0  
2     30   21   1   0   0    0  
3     33   23   1   0   0    0  
4     31   22   1   0   0    0  
5     25   18   1   0   0    0  
6     32   24   0   1   0    0  
7     32   23   0   1   0    0  
8     30   20   0   1   0    0  
9     31   22   0   1   0    0  
10     23   17   0   1   0    0  

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 1   

 
 

11     22   16   0   1   0    0  


12     21   15   0   1   0    0  
13     30   19   0   0   1    0  
14 21 14 0 0 1 0
15 19 13 0 0 1 0
16 23 17 0 0 1 0
17 30 20 0 0 1 0
18 31 21 0 0 1 0
19 20 15 0 0 0 1
20 26 18 0 0 0 1
21 20 16 0 0 0 1
22 14 13 0 0 0 1
23 28 21 0 0 0 1
24 . 12 0 0 0 1
 

To start, the file “post-pre.xls” must be downloaded and copied to your computer’s hard drive.
Once this is done open LIMDEP. Clicking on “Project,” “Import,” and “Variables…” yields the
following screen display:

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 2   

 
 

Clicking “Variable” gives a screen display of your folders in “My Documents,” in which you can
locate files containing Excel ( .wk and .xls) files.

The next slide shows a path to the file “post-pre.xls.” (The path to your copy of “post-pre.xls”
will obviously depend on where you placed it on your computer’s hard drive.) Clicking “Open”
imports the file into LIMDEP.

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 3   

 
 

To make sure the data has been correctly imported into LIMDEP, click the “Activate Data
Editor” button, which is second from the right on the tool bar or go to Data Editor in the
Window’s menu. Notice that the missing observation for Student 24 appears as a blank in this
data editor. The sequencing of these steps and the associated screens follow:

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 4   

 
 

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 5   

 
 

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 6   

 
 

READING SPACE DELINEATED TEXT FILES INTO LIMDEP 

Next we consider externally created text files that are typically accompanied by the “.txt” or
“.prn” extensions. For demonstration purposes, the data set we just employed with 24
observations on the 7 variables (“student,” “post,” “pre,” “class1,” “class2,” “class3,” and
“class4”) was saved as the space delineated text file “post-pre.txt.” After downloading this file
to your hard drive open LIMDEP to its first screen:
 

To read the file “post-pre.txt,” begin by clicking “File” in the upper left-hand corner of the
ribbon, which will yield the following screen display:

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 7   

 
 

Click on “OK” to “Text/Command Document” to create a file into which all your commands
will go.

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 8   

 
 

The “read” command is typed or copied into the “Untitled” command file, with all subparts of a
command separated with semicolons (;). The program is not case sensitive; thus, upper and
lower case letters can be used interchangeably. The read command includes the number of
variables or columns to be read (nvar= ), the number of records or observations for each variable
(nobs= ), and the place to find the file (File= ). Because the names of the variables are on the
first row of the file to be read, we tell this to LIMDEP with the Names=1 command. If the file
path is long and involves spaces (as it is here, but your path will depend on where you placed
your file), then quote marks are required around the path. The $ indicates the end of the
command.

Read;NVAR=7;NOBS=24;Names=1;File=
"C:\Documents and Settings\beckerw\My Documents\WKPAPERS\NCEE - econometrics
\Module One\post-pre.txt"$

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 9   

 
 

Upon copying or typing this read command into the command file and highlighting the entire
three lines, the screen display appears as below and the “Go” button is pressed to run the
command.

LIMDEP tells the user that it has attempted the command with the appearance of

To check on the correct reading of the data, click the “Activate Data Editor” button, which is
second from the right on the tool bar or go to Data Editor in the Window’s menu. Notice that if

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 10   

 
 

you use the Window’s menus, there are now four files open within Limdep: Untitled Project 1*,
Untitled 1*, Output 1*, and Data Editor. As you already know, Untitled 1 contains your read
command and Untitled Project is just information in the opening LIMDEP screen. Output
contains the commands that LIMDEP has attempted, which so far only includes the read
command. This output file could have also been accessed by clicking on the view square next to
the X box in the following rectangle

When it appeared to check on whether the read command was properly executed.

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 11   

 
 

READING LARGE FILES INTO LIMDEP

LIMDEP has a data matrix default restriction of no more than 222 rows (records per variable),
900 columns (number of variables) and 200,000 cells. To read, import or create nonconforming
data sets this default setting must be changed. For example, to accommodate larger data sets the
number of rows must be increased. If the creation of more than 900 variables is anticipated,
even if less than 900 variables were initially imported, then the number of columns must be
increased before any data is read. This is accomplished by clicking the project button on the top
ribbon, going to settings, and changing the number of cells and number of rows.

As an example, consider the data set employed by Becker and Powers (2001), which
initially had 2,837 records. Open LIMDEP and go to “Project” and then “Settings…,” which
yields the following screen display:

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 12   

 
 

Increasing the “Number of Cells” from 200,000 to 2,000,000 and increasing “Rows” from 222 to
3,000, automatically resets the “Columns” to 666, which is more than sufficient to read the 64
variables in the initial data set and to accommodate any variables to be created within LIMDEP.
Pressing “OK” resets the memory allocation that LIMDEP will employ for this data set.

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 13   

 
 

This Becker and Powers data set does not have variable names imbedded in it. Thus, they will
be added to the read command. To now read the data follow the path “Files” to “New” to
“Text/Command Document” and click “OK.” Entering the following read statement into the
Text/Command file, highlighting it, and pushing the green “Go” button will enter the 2,837
records on 64 variables in file beck8WO into LIMDEP and each of the variables will be named
as indicated by each two character label.
READ; NREC=2837; NVAR=64; FILE=F:\beck8WO.csv; Names=
A1,A2,X3, C,AL,AM,AN,CA,CB,CC,CH,CI,CJ,CK,CL,CM,CN,CO,CS,CT,
CU,CV,CW,DB,DD,DI,DJ,DK,DL,DM,DN,DQ,DR,DS,DY,DZ,EA,EB,EE,EF,
EI,EJ,EP,EQ,ER,ET,EY,EZ,FF,FN,FX,FY,FZ,GE,GH,GM,GN,GQ,GR,HB,
HC,HD,HE,HF $

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 14   

 
 

Defining all the variables is not critical for our purposes here, but the variables used in the
Becker and Power’s study required the following definitions (where names are not upper- and
lower-case sensitive):

A1: Term, where 1= fall, 2 = spring


A2: School code, where 100/199 = doctorate, 200/299 = comprehensive, 300/399 = lib arts,
400/499 = 2 year
hb: initial class size (number taking preTUCE)
hc: final class size (number taking postTUCE)
dm: experiences measured by number of years teaching
dj: teacher’s highest degree, where Bachelors=1, Masters=2, PhD=3
cc: postTUCE score (0 to 30)
an: preTUCE score (0 to 30)
ge: Student evaluation measured interest
gh: Student evaluation measured textbook quality
gm: Student evaluation measured regular instructor’s English ability
gq: Student evaluation measured overall teaching effectiveness
ci: Instructor sex (Male=1, Female=2)
ck: English is native language of instructor (Yes=1, No=0)
cs: PostTUCE score counts toward course grade (Yes=1, No=0)
ff: GPA*100
fn: Student had high school economics (Yes=1, No=0)
ey: Student’s sex (Male=1, Female=2)
fx: Student working in a job (Yes=1, No=0)

Notice that this data set is too large to fit in LIMDEP’s “Active Data Editor” but all of the data
are there as verified with the following DSTAT command, which is entered in the
Text/Command file and highlighted. Upon clicking on the Go button, the descriptive statistics
for each variable appear in the output file. Again, the output file is accessed via the Window tab
in the upper ribbon. (Notice that in this data set, all missing values were coded −9. )

Dstat; RHS= A1,A2,X3, C,AL,AM,AN,CA,CB,CC,CH,CI,CJ,CK,CL,CM,CN,CO,CS,CT,


CU,CV,CW,DB,DD,DI,DJ,DK,DL,DM,DN,DQ,DR,DS,DY,DZ,EA,EB,EE,EF,
EI,EJ,EP,EQ,ER,ET,EY,EZ,FF,FN,FX,FY,FZ,GE,GH,GM,GN,GQ,GR,HB,
HC,HD,HE,HF $

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 15   

 
 

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 16   

 
 

In summary, the LIMDEP Help menu states that the READ command is of the general form

READ ; Nobs = number of observations


; Nvar = number of variables
; Names = list of Nvar names
; File = name of the data file $

The default is an ASCII (or text) data file in which numbers are separated by blanks, tabs, and/or
commas. Although not demonstrated here, LIMDEP will also read formatted files by adding the
option “; Format = ( Fortran format )” to the read command. In addition, although not
demonstrated, small data sets can be cut and pasted directly into the Test/Command Document,
preceded by a simple read command “READ ; Nobs = number of observations; Nvar =
number of variables$”, where “;Names = 1” would also be added if names appear on the line
before the data.
 

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 17   

 
 

LEAST-SQUARES ESTIMATION AND LINEAR RESTRICTIONS IN LIMDEP

To demonstrate some of the least-squares regression commands in LIMDEP, read either the
Excel or space delineated text version of the 24 observations and 7 variable “post-pre” data set
into LIMDEP. The model to be estimated is

post = β1 + β 2 pre + f ( classes ) + ε

All statistical and mathematical instructions must be placed in the “Text/Command Document”
of LIMDEP, which is accessed via the “File” to “New” route described earlier:

Once in the “Text/Command Document,” the command for a regression can be entered.
Before doing this, however, recall that the posttest score is missing for the 24th person, as can be
seen in the “Active Data Editor.” LIMDEP automatically codes all missing data that appear in a
text or Excel file as “.” with the value −999. If a regression is estimated with the entire data set,

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 18   

 
 

then this fictitious −999 place holding value would be incorrectly employed. To avoid this, all
commands can be prefaced with “skip,” which tells LIMDEP to not use records involving −999.
(In the highly unlikely event that −999 is a legitimate value, then a recoding is required as
discussed below.) The syntax for regressions in LIMDEP is

Regress; lhs= ???; rhs=one, ??? $

where “lhs=” is the left-hand-side dependent variable and “rhs=” is the right-hand-side
explanatory variable. The “one” is included on the right-hand-side to estimate a y-intercept. If
this “one” is not specified then the regression is forced to go through the origin – that is, no
constant term is estimated. Finally, LIMDEP will automatically predict the value of the
dependent variable, including 95 percent confidence intervals, and show the results in the output
file by adding “fill; list” to the regression command:

Regress; lhs= ???; rhs=one, ???; fill; list $

To demonstrate some of the least-squares regression commands in LIMDEP read either


the Excel or space delineated text version of the 24 observations and 7 variables “post-pre” data
set into LIMDEP. The model to be estimated is

post = β1 + β 2 pre + β 3class1 + β 4class 2 + β 5class3 + ε

To avoid the sum of the four dummy variables equaling the column of ones in the data set, the
fourth class is not included. The commands to be typed into the Untitled Text/Command
Document are now

skip$

Regress ; Lhs =post; Rhs =one, pre, class1, class2,class3; fill; list$

Highlighting these commands and pressing “Go” gives the results in the LIMDEP output file:

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 19   

 
 

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 20   

 
 

Predicted Values (* => observation was not in estimating sample.)


le One\post-pre.xls
Observation Observed Y Predicted Y Residual 95% Forecast Interval
1 31.000 31.214 -.2138 28.3089 34.1187
2 30.000 29.697 .3034 26.7956 32.5975
3 33.000 32.731 .2690 29.8090 35.6530
4 31.000 31.214 -.2138 28.3089 34.1187
5 25.000 25.145 -.1449 22.1774 28.1124
6 32.000 34.005 -2.0048 31.0444 36.9653
7 32.000 32.488 -.4876 29.5784 35.3968
8 30.000 27.936 2.0640 25.1040 30.7679
9 31.000 30.970 .0296 28.1000 33.8408
10 23.000 23.384 -.3843 20.5091 26.2594
11 22.000 21.867 .1329 18.9513 24.7828
12 21.000 20.350 .6502 17.3811 23.3186
13 30.000 28.195 1.8046 25.3167 31.0740
14 21.000 20.609 .3907 17.6757 23.5428
15 19.000 19.092 -.0920 16.1089 22.0752
16 23.000 25.161 -2.1609 22.3001 28.0218
17 30.000 29.713 .2874 26.8053 32.6199
18 31.000 31.230 -.2298 28.2811 34.1786
19 20.000 19.172 .8276 16.2549 22.0900
20 26.000 23.724 2.2759 20.8105 26.6377
21 20.000 20.690 -.6897 17.7866 23.5927
22 14.000 16.138 -2.1380 13.1530 19.1230
23 28.000 28.276 -.2758 25.2500 31.3016
* 24 No data 14.621 No data 11.5836 17.6579

From this output the Predicted posttest score is 14.621, with 95 percent confidence interval equal
to 11.5836< E(y|X24)< 17.6579.

A researcher might be interested to test whether the class in which a student is enrolled affects
his/her post-course test score, assuming fixed effects only. This linear restriction is done

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 21   

 
 

automatically in LIMDEP by adding the following “cls:” command to the regression statement in
the Text/Command Document.

Regress ; Lhs =post; Rhs =one, pre, class1, class2,class3; CLS: b(3)=0,b(4)=0,b(5)=0$

Upon highlighting and pressing the Go button, the following results will appear in the output file:
 
 
--> Regress ; Lhs =post; Rhs =one, pre, class1, class2,class3; CLS: b(3)=0,b(...

************************************************************************
* NOTE: Deleted 1 observations with missing data. N is now 23 *
************************************************************************

le One\post-pre.xls
+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 26.21739130 , S.D.= 5.384797808 |
| Model size: Observations = 23, Parameters = 5, Deg.Fr.= 18 |
| Residuals: Sum of squares= 28.59332986 , Std.Dev.= 1.26036 |
| Fit: R-squared= .955177, Adjusted R-squared = .94522 |
| Model test: F[ 4, 18] = 95.89, Prob value = .00000 |
| Diagnostic: Log-L = -35.1389, Restricted(b=0) Log-L = -70.8467 |
| LogAmemiyaPrCrt.= .660, Akaike Info. Crt.= 3.490 |
| Autocorrel: Durbin-Watson Statistic = 1.72443, Rho = .13779 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -3.585879292 1.6459223 -2.179 .0429 le One\post-pre.xl
PRE 1.517221644 .93156695E-01 16.287 .0000 18.695652
CLASS1 1.420780437 .90500685 1.570 .1338 .21739130
CLASS2 1.177398543 .78819907 1.494 .1526 .30434783
CLASS3 2.954037461 .76623994 3.855 .0012 .26086957
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
le One\post-pre.xls

le One\post-pre.xls
+-----------------------------------------------------------------------+
| Linearly restricted regression |
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 26.21739130 , S.D.= 5.384797808 |
| Model size: Observations = 23, Parameters = 2, Deg.Fr.= 21 |
| Residuals: Sum of squares= 53.19669876 , Std.Dev.= 1.59160 |
| Fit: R-squared= .916608, Adjusted R-squared = .91264 |
| (Note: Not using OLS. R-squared is not bounded in [0,1] |
| Model test: F[ 1, 21] = 230.82, Prob value = .00000 |
| Diagnostic: Log-L = -42.2784, Restricted(b=0) Log-L = -70.8467 |
| LogAmemiyaPrCrt.= 1.013, Akaike Info. Crt.= 3.850 |
| Note, when restrictions are imposed, R-squared can be less than zero. |
| F[ 3, 18] for the restrictions = 5.1627, Prob = .0095 |
| Autocorrel: Durbin-Watson Statistic = 1.12383, Rho = .43808 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -2.211829436 1.9004224 -1.164 .2597 le One\post-pre.xl
PRE 1.520632737 .10008855 15.193 .0000 18.695652
CLASS1 .0000000000 ........(Fixed Parameter)........ .21739130
CLASS2 -.4440892099E-15........(Fixed Parameter)........ .30434783
CLASS3 -.4440892099E-15........(Fixed Parameter)........ .26086957

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 22   

 
 

(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)


le One\post-pre.xls

From the second part of this printout, the appropriate F to test

Ho: β3 = β4 = β5 = 0 versus Ha: at least one Beta is nonzero

is F[df1=3,df2=18] = 5.1627, with p-value = 0.0095. Thus, null hypothesis that none of the
dummies is significant at 0.05 Type I error level can be rejected in favor of the hypothesis that at
least one class is significant, assuming that the effect of pre-course test score on post-course test
score is the same in all classes and only the constant is affected by class assignment.

STRUCTURAL (CHOW) TEST

The above test of the linear restriction β3 = β4 = β5 = 0 (no difference among classes), assumed
that the pretest slope coefficient was constant, fixed and unaffected by the class to which a
student belonged. A full structural test requires the fitting of four separate regressions to obtain
the four residual sum of squares that are added to obtain the unrestricted sum of squares. The
restricted sum of squares is obtained from a regression of posttest on pretest with no dummies for
the classes; that is, the class to which a student belongs is irrelevant in the manner in which
pretests determine the posttest score.

The commands to be entered into the Document/text file of LIMDEP are as follows:

Restricted Regression

Sample; 1-23$
Regress ; Lhs =post; Rhs =one, pre$
Calc ; SSall = Sumsqdev$

Unrestricted Regressions

Sample; 1-5$
Regress ; Lhs =post; Rhs =one, pre$
Calc ; SS1 = Sumsqdev$

Sample; 6-12$
Regress ; Lhs =post; Rhs =one, pre$
Calc ; SS2 = Sumsqdev$

Sample; 13-18$
Regress ; Lhs =post; Rhs =one, pre$
Calc ; SS3 = Sumsqdev$
William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 23   

 
 

Sample; 19-23$
Regress ; Lhs =post; Rhs =one, pre$
Calc ; SS4 = Sumsqdev$

Calc;List ;F=((SSall-(SS1+SS2+SS3+SS4))/3*2) / ((SS1+SS2+SS3+SS4)/(23-4*2))$

The LIMDEP output is


--> RESET
--> READ;FILE="C:\Documents and Settings\beckerw\My Documents\WKPAPERS\NCEE -...
--> Reject; post=-999$
--> Regress ; Lhs =post; Rhs =one, pre, class1, class2,class3; CLS: b(3)=0,b(...

+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 26.21739130 , S.D.= 5.384797808 |
| Model size: Observations = 23, Parameters = 5, Deg.Fr.= 18 |
| Residuals: Sum of squares= 28.59332986 , Std.Dev.= 1.26036 |
| Fit: R-squared= .955177, Adjusted R-squared = .94522 |
| Model test: F[ 4, 18] = 95.89, Prob value = .00000 |
| Diagnostic: Log-L = -35.1389, Restricted(b=0) Log-L = -70.8467 |
| LogAmemiyaPrCrt.= .660, Akaike Info. Crt.= 3.490 |
| Autocorrel: Durbin-Watson Statistic = 1.72443, Rho = .13779 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -3.585879292 1.6459223 -2.179 .0429
PRE 1.517221644 .93156695E-01 16.287 .0000 18.695652
CLASS1 1.420780437 .90500685 1.570 .1338 .21739130
CLASS2 1.177398543 .78819907 1.494 .1526 .30434783
CLASS3 2.954037461 .76623994 3.855 .0012 .26086957
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
+-----------------------------------------------------------------------+
| Linearly restricted regression |
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 26.21739130 , S.D.= 5.384797808 |
| Model size: Observations = 23, Parameters = 2, Deg.Fr.= 21 |
| Residuals: Sum of squares= 53.19669876 , Std.Dev.= 1.59160 |
| Fit: R-squared= .916608, Adjusted R-squared = .91264 |
| (Note: Not using OLS. R-squared is not bounded in [0,1] |
| Model test: F[ 1, 21] = 230.82, Prob value = .00000 |
| Diagnostic: Log-L = -42.2784, Restricted(b=0) Log-L = -70.8467 |
| LogAmemiyaPrCrt.= 1.013, Akaike Info. Crt.= 3.850 |
| Note, when restrictions are imposed, R-squared can be less than zero. |
| F[ 3, 18] for the restrictions = 5.1627, Prob = .0095 |
| Autocorrel: Durbin-Watson Statistic = 1.12383, Rho = .43808 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -2.211829436 1.9004224 -1.164 .2597
PRE 1.520632737 .10008855 15.193 .0000 18.695652
CLASS1 .0000000000 ........(Fixed Parameter)........ .21739130
CLASS2 -.4440892099E-15........(Fixed Parameter)........ .30434783
CLASS3 -.4440892099E-15........(Fixed Parameter)........ .26086957

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 24   

 
 

(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

--> Sample; 1-23$


--> Regress ; Lhs =post; Rhs =one, pre$

+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 26.21739130 , S.D.= 5.384797808 |
| Model size: Observations = 23, Parameters = 2, Deg.Fr.= 21 |
| Residuals: Sum of squares= 53.19669876 , Std.Dev.= 1.59160 |
| Fit: R-squared= .916608, Adjusted R-squared = .91264 |
| Model test: F[ 1, 21] = 230.82, Prob value = .00000 |
| Diagnostic: Log-L = -42.2784, Restricted(b=0) Log-L = -70.8467 |
| LogAmemiyaPrCrt.= 1.013, Akaike Info. Crt.= 3.850 |
| Autocorrel: Durbin-Watson Statistic = 1.12383, Rho = .43808 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -2.211829436 1.9004224 -1.164 .2575
PRE 1.520632737 .10008855 15.193 .0000 18.695652

--> Calc ; SSall = Sumsqdev $


--> Sample; 1-5$
--> Regress ; Lhs =post; Rhs =one, pre$

+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 30.00000000 , S.D.= 3.000000000 |
| Model size: Observations = 5, Parameters = 2, Deg.Fr.= 3 |
| Residuals: Sum of squares= .2567567568 , Std.Dev.= .29255 |
| Fit: R-squared= .992868, Adjusted R-squared = .99049 |
| Model test: F[ 1, 3] = 417.63, Prob value = .00026 |
| Diagnostic: Log-L = .3280, Restricted(b=0) Log-L = -12.0299 |
| LogAmemiyaPrCrt.= -2.122, Akaike Info. Crt.= .669 |
| Autocorrel: Durbin-Watson Statistic = 2.19772, Rho = -.09886 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -2.945945946 1.6174496 -1.821 .1661
PRE 1.554054054 .76044788E-01 20.436 .0003 21.200000
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

--> Calc ; SS1 = Sumsqdev $


--> Sample; 6-12$
--> Regress ; Lhs =post; Rhs =one, pre$

+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 27.28571429 , S.D.= 5.023753103 |
| Model size: Observations = 7, Parameters = 2, Deg.Fr.= 5 |
| Residuals: Sum of squares= 7.237132353 , Std.Dev.= 1.20309 |
| Fit: R-squared= .952208, Adjusted R-squared = .94265 |
| Model test: F[ 1, 5] = 99.62, Prob value = .00017 |
| Diagnostic: Log-L = -10.0492, Restricted(b=0) Log-L = -20.6923 |
| LogAmemiyaPrCrt.= .621, Akaike Info. Crt.= 3.443 |
| Autocorrel: Durbin-Watson Statistic = 1.50037, Rho = .24982 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 25   

 
 

|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|


+---------+--------------+----------------+--------+---------+----------+
Constant .6268382353 2.7094095 .231 .8262
PRE 1.362132353 .13647334 9.981 .0002 19.571429

--> Calc ; SS2 = Sumsqdev $


--> Sample; 13-18$
--> Regress ; Lhs =post; Rhs =one, pre$

+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 25.66666667 , S.D.= 5.278888772 |
| Model size: Observations = 6, Parameters = 2, Deg.Fr.= 4 |
| Residuals: Sum of squares= 8.081250000 , Std.Dev.= 1.42138 |
| Fit: R-squared= .942001, Adjusted R-squared = .92750 |
| Model test: F[ 1, 4] = 64.97, Prob value = .00129 |
| Diagnostic: Log-L = -9.4070, Restricted(b=0) Log-L = -17.9490 |
| LogAmemiyaPrCrt.= .991, Akaike Info. Crt.= 3.802 |
| Autocorrel: Durbin-Watson Statistic = 1.51997, Rho = .24001 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -1.525000000 3.4231291 -.445 .6790
PRE 1.568750000 .19463006 8.060 .0013 17.333333

--> Calc ; SS3 = Sumsqdev $


--> Sample; 19-23$
--> Regress ; Lhs =post; Rhs =one, pre$

+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 21.60000000 , S.D.= 5.549774770 |
| Model size: Observations = 5, Parameters = 2, Deg.Fr.= 3 |
| Residuals: Sum of squares= 8.924731183 , Std.Dev.= 1.72479 |
| Fit: R-squared= .927559, Adjusted R-squared = .90341 |
| Model test: F[ 1, 3] = 38.41, Prob value = .00846 |
| Diagnostic: Log-L = -8.5432, Restricted(b=0) Log-L = -15.1056 |
| LogAmemiyaPrCrt.= 1.427, Akaike Info. Crt.= 4.217 |
| Autocorrel: Durbin-Watson Statistic = .82070, Rho = .58965 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -7.494623656 4.7572798 -1.575 .2132
PRE 1.752688172 .28279093 6.198 .0085 16.600000

--> Calc ; SS4 = Sumsqdev $


--> Calc;List ;F=((SSall-(SS1+SS2+SS3+SS4))/(3*2)) / ((SS1+SS2+SS3+SS4)/(23-4*2))$
F = .29282633057790450D+01

The structural test across all classes is

H 0 : β1 = β 2 = . . . = β4 and H a : β ' s are not equal  

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 26   

 
 

(Re sSSr − Re sSSu ) / 2( J − 1K )


F=
Re sSSu /( n − JK )

Because the calculated F = 2.92, and the critical F (Prob of Type I error =0.05, df1=6, df2=15) =
2.79, we reject the null hypothesis and conclude that at least one class is significantly different
from another, allowing the slope on pre-course test score to vary from one class to another. That
is, the class in which a student is enrolled is important because of a change in slope and/or the
intercept.

HETEROSCEDASTICITY

To adjust for either heteroscedasticity across individual observations or a common error term
within groups but not across groups the “hetro” and “cluster” command can be added to the
standard “regress” command in LIMDEP in the following manner:

Skip

Regress; Lhs= post; Rhs= one, pre, class1, class2, class3$

Regress; Lhs= post; Rhs= one, pre, class1, class2, class3


;hetro $

Create; Class = class1+2*class2+3*class3+4*class4$


Regress ; Lhs= post; Rhs= one, pre, class1, class2, class3
;cluster=class $

where the “class” variable is created to name the classes 1, 2, 3 and 4 to enable their
identification in the “cluster” command.

The resulting LIMDEP output shows a marked increase in the significance of the individual
group effects, as reflected in their respective p-values.
--> RESET
--> READ;FILE="C:\Documents and Settings\beckerw\My Documents\WKPAPERS\NCEE -...
--> skip
--> Regress; Lhs= post; Rhs= one, pre, class1, class2, class3$
************************************************************************
* NOTE: Deleted 1 observations with missing data. N is now 23 *
************************************************************************

le One\post-pre.xls
+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 26.21739130 , S.D.= 5.384797808 |
William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 27   

 
 

| Model size: Observations = 23, Parameters = 5, Deg.Fr.= 18 |


| Residuals: Sum of squares= 28.59332986 , Std.Dev.= 1.26036 |
| Fit: R-squared= .955177, Adjusted R-squared = .94522 |
| Model test: F[ 4, 18] = 95.89, Prob value = .00000 |
| Diagnostic: Log-L = -35.1389, Restricted(b=0) Log-L = -70.8467 |
| LogAmemiyaPrCrt.= .660, Akaike Info. Crt.= 3.490 |
| Autocorrel: Durbin-Watson Statistic = 1.72443, Rho = .13779 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -3.585879292 1.6459223 -2.179 .0429 le One\post-pre.xl
PRE 1.517221644 .93156695E-01 16.287 .0000 18.695652
CLASS1 1.420780437 .90500685 1.570 .1338 .21739130
CLASS2 1.177398543 .78819907 1.494 .1526 .30434783
CLASS3 2.954037461 .76623994 3.855 .0012 .26086957
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
le One\post-pre.xls

--> Regress; Lhs= post; Rhs= one, pre, class1, class2, class3
;hetro $

************************************************************************
* NOTE: Deleted 1 observations with missing data. N is now 23 *
************************************************************************

le One\post-pre.xls
+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 26.21739130 , S.D.= 5.384797808 |
| Model size: Observations = 23, Parameters = 5, Deg.Fr.= 18 |
| Residuals: Sum of squares= 28.59332986 , Std.Dev.= 1.26036 |
| Fit: R-squared= .955177, Adjusted R-squared = .94522 |
| Model test: F[ 4, 18] = 95.89, Prob value = .00000 |
| Diagnostic: Log-L = -35.1389, Restricted(b=0) Log-L = -70.8467 |
| LogAmemiyaPrCrt.= .660, Akaike Info. Crt.= 3.490 |
| Autocorrel: Durbin-Watson Statistic = 1.72443, Rho = .13779 |
| Results Corrected for heteroskedasticity |
| Breusch - Pagan chi-squared = 4.0352, with 4 degrees of freedom |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -3.585879292 1.5096560 -2.375 .0289 le One\post-pre.xl
PRE 1.517221644 .72981808E-01 20.789 .0000 18.695652
CLASS1 1.420780437 .67752835 2.097 .0504 .21739130
CLASS2 1.177398543 .72249740 1.630 .1206 .30434783
CLASS3 2.954037461 .80582075 3.666 .0018 .26086957
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
le One\post-pre.xls
--> Create; Class = class1+2*class2+3*class3+4*class4$
--> Regress ; Lhs= post; Rhs= one, pre, class1, class2, class3
;cluster=class $

************************************************************************
* NOTE: Deleted 1 observations with missing data. N is now 23 *
************************************************************************

le One\post-pre.xls
+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = POST Mean= 26.21739130 , S.D.= 5.384797808 |
| Model size: Observations = 23, Parameters = 5, Deg.Fr.= 18 |
William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 28   

 
 

| Residuals: Sum of squares= 28.59332986 , Std.Dev.= 1.26036 |


| Fit: R-squared= .955177, Adjusted R-squared = .94522 |
| Model test: F[ 4, 18] = 95.89, Prob value = .00000 |
| Diagnostic: Log-L = -35.1389, Restricted(b=0) Log-L = -70.8467 |
| LogAmemiyaPrCrt.= .660, Akaike Info. Crt.= 3.490 |
| Autocorrel: Durbin-Watson Statistic = 1.72443, Rho = .13779 |
+-----------------------------------------------------------------------+
+-----------------------------------------------------------------------+
| Covariance matrix for the model is adjusted for data clustering. |
| Sample of 23 observations contained 4 clusters defined by |
| variable CLASS which identifies by a value a cluster ID. |
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |t-ratio |P[|T|>t] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -3.585879292 1.5875538 -2.259 .0365 le One\post-pre.xl
PRE 1.517221644 .95635769E-01 15.865 .0000 18.695652
CLASS1 1.420780437 .43992454 3.230 .0046 .21739130
CLASS2 1.177398543 .28417486 4.143 .0006 .30434783
CLASS3 2.954037461 .70132897E-01 42.121 .0000 .26086957
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)
le One\post-pre.xls

ESTIMATING PROBIT MODELS IN LIMDEP

Often variables need to be transformed or created within a computer program to perform the
desired analysis. To demonstrate the process and commands in LIMDEP, start with the Becker
and Power’s data that have been or can be read into LIMDEP as shown earlier. After reading
the data into LIMDEP the first task is to recode the qualitative data into appropriate dummies.

A2 contains a range of values representing various classes of institutions. These are recoded via
the “recode” command, where A2 is set equal to 1 for doctoral institutions (100/199), 2 for
comprehensive or master’s degree granting institutions (200/299), 3 for liberal arts colleges
(300/399) and 4 for two-year colleges (400/499) . The “create” command is then used to create
1 and 0 bivariate variables for each of these institutions of post-secondary education:

recode; a2; 100/199 = 1; 200/299 = 2; 300/399 = 3; 400/499 =4$


create; doc=a2=1; comp=a2=2; lib=a2=3; twoyr=a2=4$

As should be apparent, this syntax says if a2 has a value between 100 and 199 recode it to be 1.
If a2 has a value between 200 and 299 recode it to be 2 and so on. Next, create a variable called
“doc” and if a2=1, then set doc=1 and for any other value of a2 let doc=0. Create a variable
called “comp” and if a2=2, then set comp=1 and for any other value of a2 let comp=0, and so on.

Next 1 - 0 bivariates are created to show whether the instructor had a PhD degree and where the
student got a positive score on the postTUCE:

create; phd=dj=3; final=cc>0$

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 29   

 
 

To allow for quadratic forms in teacher experiences and class size the following variables are
created:

create; dmsq=dm^2; hbsq=hb^2$

In this data set, as can be seen in the descriptive statistics (DSTAT), all missing values were
coded −9. Thus, adding together some of the responses to the student evaluations gives
information on whether a student actually completed an evaluation. For example, if the sum of
ge, gh, gm, and gq equals −36, we know that the student did not complete a student evaluation
in a meaningful way. A dummy variable to reflect this fact is then created by:

create; evalsum=ge+gh+gm+gq; noeval=evalsum=−36$

Finally, from the TUCE developer it is known that student number 2216 was counted in term 2
but was in term 1 but no postTUCE was taken. This error is corrected with the following
command:

recode; hb; 90=89$ #2216 counted in term 2, but in term 1 with no posttest

These “create” and “recode” commands can be entered into LIMDEP as a block, highlighted and
run with the “Go” button. Notice, also, that descriptive statements can be written after the “$” as
a reminder or for later justification or reference as to why the command was included.

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 30   

 
 

One of the things of interest to Becker and Powers was whether class size at the beginning or end
of the term influenced whether a student completed the postTUCE. This can be assessed by
fitting a probit model to the 1 – 0 discrete dependent variable “final.” To do this, however, we
must make sure that there are no missing data on the variables to be included as regressors. In
this data set, all missing values were coded −9. LIMDEP’s “reject” command can be employed
to remove all records with a −9 value. The “probit” command is used to invoke a maximum
likelihood estimation with the following syntax:

Probit; Lhs=???; rhs=one, ???; marginaleffect$

where the addition of the “marginaleffect” tells LIMDEP to calculate marginal effects regardless
of whether the explanatory variable is or is not continuous. The commands to be entered into the
Text/Command Document are then

Reject; AN=-9$
Reject; HB=-9$
Reject; ci=-9$
Reject; ck=-9$
William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 31   

 
 

Reject; cs=0$
Reject; cs=-9$
Reject; a2=-9$
Reject; phd=-9$
reject; hc=-9$
probit;lhs=final;
rhs=one,an,hb,doc,comp,lib,ci,ck,phd,noeval;marginaleffect$
probit;lhs=final;
rhs=one,an,hc,doc,comp,lib,ci,ck,phd,noeval;marginaleffect$

which upon highlighting and pressing the Go button yields the output for these two probit
models.

--> probit;lhs=final;
rhs=one,an,hb,doc,comp,lib,ci,ck,phd,noeval;marginaleffect$
Normal exit from iterations. Exit status=0.

+---------------------------------------------+
| Binomial Probit Model |
William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 32   

 
 

| Maximum Likelihood Estimates |


| Model estimated: May 05, 2008 at 04:07:02PM.|
| Dependent variable FINAL |
| Weighting variable None |
| Number of observations 2587 |
| Iterations completed 6 |
| Log likelihood function -822.7411 |
| Restricted log likelihood -1284.216 |
| Chi squared 922.9501 |
| Degrees of freedom 9 |
| Prob[ChiSqd > value] = .0000000 |
| Hosmer-Lemeshow chi-squared = 26.06658 |
| P-value= .00102 with deg.fr. = 8 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Index function for probability
Constant .9953497702 .24326247 4.092 .0000
AN .2203899720E-01 .94751772E-02 2.326 .0200 10.596830
HB -.4882560519E-02 .19241005E-02 -2.538 .0112 55.558949
DOC .9757147902 .14636173 6.666 .0000 .31774256
COMP .4064945318 .13926507 2.919 .0035 .41785852
LIB .5214436028 .17664590 2.952 .0032 .13567839
CI .1987315042 .91686501E-01 2.168 .0302 1.2311558
CK .8778999306E-01 .13428742 .654 .5133 .91998454
PHD -.1335050091 .10303166 -1.296 .1951 .68612292
NOEVAL -1.930522400 .72391102E-01 -26.668 .0000 .29068419
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

+-------------------------------------------+
| Partial derivatives of E[y] = F[*] with |
| respect to the vector of characteristics. |
| They are computed at the means of the Xs. |
| Observations used for means are All Obs. |
+-------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Index function for probability
Constant .1977242134 .48193408E-01 4.103 .0000
AN .4378002101E-02 .18769275E-02 2.333 .0197 10.596830
HB -.9699107460E-03 .38243741E-03 -2.536 .0112 55.558949
Marginal effect for dummy variable is P|1 - P|0.
DOC .1595047130 .20392136E-01 7.822 .0000 .31774256
Marginal effect for dummy variable is P|1 - P|0.
COMP .7783344522E-01 .25881201E-01 3.007 .0026 .41785852
Marginal effect for dummy variable is P|1 - P|0.
LIB .8208261358E-01 .21451464E-01 3.826 .0001 .13567839
CI .3947761030E-01 .18186048E-01 2.171 .0299 1.2311558
Marginal effect for dummy variable is P|1 - P|0.
CK .1820482750E-01 .29016989E-01 .627 .5304 .91998454
Marginal effect for dummy variable is P|1 - P|0.
PHD -.2575430653E-01 .19325466E-01 -1.333 .1826 .68612292
Marginal effect for dummy variable is P|1 - P|0.
NOEVAL -.5339850032 .19586185E-01 -27.263 .0000 .29068419
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

+----------------------------------------+
| Fit Measures for Binomial Choice Model |
| Probit model for variable FINAL |
+----------------------------------------+
| Proportions P0= .197140 P1= .802860 |
| N = 2587 N0= 510 N1= 2077 |
William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 33   

 
 

| LogL = -822.74107 LogL0 = -1284.2161 |


| Estrella = 1-(L/L0)^(-2L0/n) = .35729 |
+----------------------------------------+
| Efron | McFadden | Ben./Lerman |
| .39635 | .35934 | .80562 |
| Cramer | Veall/Zim. | Rsqrd_ML |
| .38789 | .52781 | .30006 |
+----------------------------------------+
| Information Akaike I.C. Schwarz I.C. |
| Criteria .64379 1724.06468 |
+----------------------------------------+
Frequencies of actual & predicted outcomes
Predicted outcome has maximum probability.
Threshold value for predicting Y=1 = .5000
Predicted
------ ---------- + -----
Actual 0 1 | Total
------ ---------- + -----
0 342 168 | 510
1 197 1880 | 2077
------ ---------- + -----
Total 539 2048 | 2587

--> probit;lhs=final;
rhs=one,an,hc,doc,comp,lib,ci,ck,phd,noeval;marginaleffect$
Normal exit from iterations. Exit status=0.

+---------------------------------------------+
| Binomial Probit Model |
| Maximum Likelihood Estimates |
| Model estimated: May 05, 2008 at 04:07:03PM.|
| Dependent variable FINAL |
| Weighting variable None |
| Number of observations 2587 |
| Iterations completed 6 |
| Log likelihood function -825.9472 |
| Restricted log likelihood -1284.216 |
| Chi squared 916.5379 |
| Degrees of freedom 9 |
| Prob[ChiSqd > value] = .0000000 |
| Hosmer-Lemeshow chi-squared = 22.57308 |
| P-value= .00396 with deg.fr. = 8 |
+---------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Index function for probability
Constant .8712666323 .24117408 3.613 .0003
AN .2259549490E-01 .94553383E-02 2.390 .0169 10.596830
HC .1585898886E-03 .21039762E-02 .075 .9399 49.974874
DOC .8804040395 .14866411 5.922 .0000 .31774256
COMP .4596088640 .13798168 3.331 .0009 .41785852
LIB .5585267697 .17568141 3.179 .0015 .13567839
CI .1797199200 .90808055E-01 1.979 .0478 1.2311558
CK .1415663447E-01 .13332671 .106 .9154 .91998454
PHD -.2351326125 .10107423 -2.326 .0200 .68612292
NOEVAL -1.928215642 .72363621E-01 -26.646 .0000 .29068419
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

+-------------------------------------------+
| Partial derivatives of E[y] = F[*] with |
| respect to the vector of characteristics. |
| They are computed at the means of the Xs. |
| Observations used for means are All Obs. |
William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 34   

 
 

+-------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Index function for probability
Constant .1735365132 .47945637E-01 3.619 .0003
AN .4500509092E-02 .18776909E-02 2.397 .0165 10.596830
HC .3158750180E-04 .41902052E-03 .075 .9399 49.974874
Marginal effect for dummy variable is P|1 - P|0.
DOC .1467543687 .21319420E-01 6.884 .0000 .31774256
Marginal effect for dummy variable is P|1 - P|0.
COMP .8785901674E-01 .25536388E-01 3.441 .0006 .41785852
Marginal effect for dummy variable is P|1 - P|0.
LIB .8672357482E-01 .20661637E-01 4.197 .0000 .13567839
CI .3579612385E-01 .18068050E-01 1.981 .0476 1.2311558
Marginal effect for dummy variable is P|1 - P|0.
CK .2839467767E-02 .26927626E-01 .105 .9160 .91998454
Marginal effect for dummy variable is P|1 - P|0.
PHD -.4448632109E-01 .18193388E-01 -2.445 .0145 .68612292
Marginal effect for dummy variable is P|1 - P|0.
NOEVAL -.5339710749 .19569243E-01 -27.286 .0000 .29068419
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

+----------------------------------------+
| Fit Measures for Binomial Choice Model |
| Probit model for variable FINAL |
+----------------------------------------+
| Proportions P0= .197140 P1= .802860 |
| N = 2587 N0= 510 N1= 2077 |
| LogL = -825.94717 LogL0 = -1284.2161 |
| Estrella = 1-(L/L0)^(-2L0/n) = .35481 |
+----------------------------------------+
| Efron | McFadden | Ben./Lerman |
| .39186 | .35685 | .80450 |
| Cramer | Veall/Zim. | Rsqrd_ML |
| .38436 | .52510 | .29833 |
+----------------------------------------+
| Information Akaike I.C. Schwarz I.C. |
| Criteria .64627 1730.47688 |
+----------------------------------------+
Frequencies of actual & predicted outcomes
Predicted outcome has maximum probability.
Threshold value for predicting Y=1 = .5000
Predicted
------ ---------- + -----
Actual 0 1 | Total
------ ---------- + -----
0 337 173 | 510
1 192 1885 | 2077
------ ---------- + -----
Total 529 2058 | 2587

For each of these two probits, the first block of coefficients are for the latent variable
probit equation. The second block provides the marginal effects. The initial class size (hb)
probit coefficient −0.004883, however, is highly significant with a two-tail p-value of 0.0112.
On the other hand, the end-of-term class size (hc) probit coefficient (0.000159) is insignificant
with a two-tail p-value of 0.9399.

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 35   

 
 

The overall goodness of fit can be assessed in several ways. The easiest is the proportion
of correct 0 and 1 predictions: For the first probit, using initial class size (hb) as an explanatory
variable, the proportion of correct prediction is 0.859 = (342+1880)/2587. For the second probit,
using end-of-term class size (hc) as an explanatory variable, the proportion of correct prediction
is also 0.859 = (337+1885)/2587. The Chi-square (922.95, df =9) for the probit employing the
initial class size is slightly higher than that for the end-of-term probit (916.5379, df =9) but they
are both highly significant.

Finally, worth noting when using the “reject” command is that the record is not removed.
It can be reactivated with the “include” command. Active and inactive status can be observed in
LIMDEP’s editor by the presence or lack of presence of chevrons (>>) next to the row number
down the left-hand side of the display.

If you wish to save you work in LIMDEP you must make sure to save each of the files
you want separately. Your Text/Command Document, data file, and output files must be saved
individually in LIMDEP. There is no global saving of all three files.

CONCLUDING REMARKS

The goal of this hands-on component of this first of four modules was to enable users to get data
into LIMDEP, create variables and run regressions on continuous and discrete variables; it was
not to explain all of the statistics produced by computer output. For this an intermediate level
econometrics textbook (such as Jeffrey Wooldridge, Introductory Econometrics) or advanced
econometrics textbook such as (William Greene, Econometric Analysis) must be consulted.

REFERENCES

Becker, William E. and John Powers (2001). “Student Performance, Attrition, and Class Size
Given Missing Student Data,” Economics of Education Review, Vol. 20, August: 377-388.

Hilbe, Joseph M. (2006). “A Review of LIMDEP 9.0 and NLOGIT 4.0.” The American
Statistician, 60(May): 187-202.

William. E. Becker  Module One, Part Two: Using LIMDEP  Sept. 15, 2008: p. 36   

 
MODULE ONE, PART THREE: READING DATA INTO STATA, CREATING AND
RECODING VARIABLES, AND ESTIMATING AND TESTING MODELS IN STATA

This Part Three of Module One provides a cookbook-type demonstration of the steps required to
read or import data into STATA. The reading of both small and large text and Excel files are
shown though real data examples. The procedures to recode and create variables within STATA
are demonstrated. Commands for least-squares regression estimation and maximum likelihood
estimation of probit and logit models are provided. Consideration is given to analysis of variance
and the testing of linear restrictions and structural differences, as outlined in Part One. (Parts
Two and Four provide the LIMDEP and SAS commands for the same operations undertaken
here in Part Three with STATA. For a review of STATA, version 7, see Kolenikov (2001).)

IMPORTING EXCEL FILES INTO STATA

STATA can read data from many different formats. As an example of how to read data created in
an Excel spreadsheet, consider the data from the Excel file “post-pre.xls,” which consists of test
scores for 24 students in four classes. The column titled “Student” identifies the 24 students by
number, “post” provides each student’s post-course test score, “pre” is each student’s pre-course
test score, and “class” identifies to which one of the four classes the students was assigned, e.g.,
class4 = 1 if student was in the fourth class and class4 = 0 if not. The “.” in the post column for
student 24 indicates that the student is missing a post-course test score.

To start, the file “post-pre.xls” must be downloaded and copied to your computer’s hard drive.
Unfortunately, STATA does not work with “.xls” data by default (i.e., there is no default
“import” function or command to get “.xls” data into STATA’s data editor); however, we can
still transfer data from an Excel spreadsheet into STATA by copy and paste. * First, open the
“post-pre.xls” file in Excel. The raw data are given below:

                                                            
*
 See Appendix A for a description of Stat/Transfer, a program to convert data from one format to
another.

Ian M. McCarthy  Module One, Part Three: Using STATA  Sept. 15, 2008: p. 1   


student   post    pre   class1   class2   class3   class4 
1     31   22   1   0   0    0  
2     30   21   1   0   0    0  
3     33   23   1   0   0    0  
4     31   22   1   0   0    0  
5     25   18   1   0   0    0  
6     32   24   0   1   0    0  
7     32   23   0   1   0    0  
8     30   20   0   1   0    0  
9     31   22   0   1   0    0  
10     23   17   0   1   0    0  
11     22   16   0   1   0    0  
12     21   15   0   1   0    0  
13     30   19   0   0   1    0  
14 21 14 0 0 1 0
15 19 13 0 0 1 0
16 23 17 0 0 1 0
17 30 20 0 0 1 0
18 31 21 0 0 1 0
19 20 15 0 0 0 1
20 26 18 0 0 0 1
21 20 16 0 0 0 1
22 14 13 0 0 0 1
23 28 21 0 0 0 1
24 . 12 0 0 0 1

In Excel, highlight the appropriate cells, right-click on the highlighted area and click “copy”.
Your screen should look something like:

Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 2   


Then open STATA. Go to “Data”, and click on “Data Editor”:

Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 3   


Clicking on “Data Editor” will yield the following screen:

From here, right-click on the highlighted cell and click “paste”:

Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 4   


Your data is now in the STATA data editor, which yields the following screen:

Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 5   


Note that STATA, unlike LIMDEP, records missing observations with a period, rather than a
blank space. Closing the data editor, we see that our variables are now added to the variable list
in STATA, and we are correctly told that our data consist of 7 variables and 24 observations.

Any time you wish to see your current data, you can go back to the data editor. We can also view
the data by typing in “browse” in the command window. As the terms suggest, “browse” only
allows you to see the data, while you can manually alter data in the “data editor”.

READING SPACE, TAB, OR COMMA DELINEATED FILES INTO STATA

Next we consider externally created text files that are typically accompanied by the “.txt” or
“.prn” extensions. As an example, we use the previous dataset with 24 observations on the 7
variables (“student,” “post,” “pre,” “class1,” “class2,” “class3,” and “class4”) and saved it as a
space delineated text file “post-pre.txt.” To read the data into STATA, we need to utilize the
“insheet” command. In the command window, type

insheet using “F:\NCEE (Becker)\post-pre.txt”, delimiter(“ “)

The “insheet” tells STATA to read in text data and “using” directs STATA to a particular file
name. In this case, the file is saved in the location “F:\NCEE (Becker)\post-pre.txt”, but this will
vary by user. Finally, the “delimiter(“ “)” option tells STATA that the data points in this file are
separated by a space. If your data were tab delimited, you could type
Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 6   
insheet using “F:\NCEE (Becker)\post-pre.txt”, tab

and if you were using a “.csv” file, you could type

insheet using “F:\NCEE (Becker)\post-pre.csv”, comma

In general, the “delimiter( )” option is used when your data have a less standard delimiter (e.g., a
colon, semicolon, etc.).

Once you’ve typed the appropriate command into the command window, press enter to run that
line of text. This should yield the following screen:

Just as before, STATA tells us that it has read a data set consisting of 7 variables and 24
observations, and we can access our variable list in the lower-left window pane. We can also see
previously written lines from the “review” window in the upper-left window pane. Again, we can
view our data by typing “browse” in the command window and pressing enter.

READING LARGE DATA FILES INTO STATA

The default memory allocation is different depending on the version of STATA you are using.
When STATA first opens, it will indicate how much memory is allocated by default. From the
previous screenshot, for instance, STATA indicates that 1.00mb is set aside for STATA’s use.

Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 7   


This is shown in the note directly above the entered command, which appears every time we start
STATA. The 1.00mb memory is the standard for Intercooled STATA, which is the version used
for this module. For a slightly more detailed look at the current memory allocation, you can type
into the command window, “memory” and press enter. This provides the following:

A more useful (and detailed) description of STATA’s memory usage (among other things) can
be obtained by typing “creturn list” into the command window. This provides:

Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 8   


You may have to click on the “-more-“ link (bottom left of the STATA output window) to see
this output. You can also press spacebar in the command window (or any key) to advance
screens whenever you see “-more-“ at the bottom. Two things to notice from this screen are: (1)
c(max_N_theory) tells us the maximum possible number of records our version of STATA will
allow, while c(max_N_current) tells us the maximum possible number of records we have
currently allocated to STATA based on our memory allocation, and (2) c(max_k_theory) tells us
the maximum possible number of variables, while c(max_k_current) tells us the maximum
number of variables based on our current memory allocation.

To work with large datasets (in this case, anything larger than 1mb), we can type “set memory
10m” into the command window and press enter. This increases the memory allocation to 10 mb,
and you can increase by more or less to your preference. You can also increase STATA’s
memory allocation permanently by typing, “set memory 10m, permanently” into the command
line. To check that our memory has actually increased, again type “memory” into the command
window and press enter. We get the following screen:

Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 9   


The maximum amount of memory you can allocate to STATA varies based on your computer’s
performance. If we try to allocate more memory that our RAM can allow, we get an error:

Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 10   


Note that the total amount of memory allowed depends on the computer’s performance;
however, the total number of variables allowed may be restricted by your version of STATA. (If
you’re using Small STATA, then the memory allocation is also limited.) For Intercooled
STATA, for instance, we cannot have more than 2048 variables in our data set.

For the Becker and Powers data set, the 1mb allocation is sufficient, so we need only follow the
process to import a “.csv” file described above. Note, however, that this data set does not contain
variable names in the top row. You can assign names yourself with a slight addition to the
insheet command:

insheet var1 var2 var3 … using “filename.csv”, comma

Where, var1 var2 var3 …, are the variable names for each of the 64 variables in the data set. Of
course, manually adding all 64 variable names can be irritating. For more details on how to
import data sets with data dictionaries (i.e., variable names and definitions in external files), try
typing “help infile” into the command window. If you do not assign variable names, then
STATA will provide default variable names of “v1, v2, v3, etc.”.

LEAST-SQUARES ESTIMATION AND LINEAR RESTRICTIONS IN STATA

As in the previous section using LIMDEP, we now demonstrate various regression tools in
STATA using the “post-pre” data set. Recall the model being estimated is

post = β1 + β 2 pre + f (classes ) + ε .

STATA automatically drops any missing observations from our analysis, so we need not restrict
the data in any of our commands. In general, the syntax for a basic OLS regression in STATA is

regress y-variable x-variables,

where y-variable is just the independent variable name and x-variables are the dependent
variable names. Now is a good time to mention STATA’s very useful help menu. Typing “help
regress” into the command window and pressing enter will open a thorough description of the
regress command and all of its options, and similarly with any command in STATA.

Once you have your data read into STATA, let’s estimate the model

post = β1 + β 2 pre + β 3class1 + β 4class 2 + β5class 3 + ε

by typing:

regress post pre class1 class2 class3

Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 11   


into the command window and pressing enter. We get the following output:

To see the predicted post-test scores (with confidence intervals) from our regression, we type:

predict posthat
predict se_f, stdf
generate upper_f = posthat + invttail(e(df_r),0.025)*se_f
generate lower_f = posthat + invttail(e(df_r),0.025)*se_f

You can either copy and paste these commands directly into the command window and press
enter, or you can enter each one directly into the command window and press enter one at a time.
Notice the use of the “predict” and “generate” keywords in the previous set of commands. After
running a regression, STATA has lots of data stored away, some of which is shown in the output
and some that is not. By typing “predict posthat”, STATA applies the estimated regression
equation to the 24 observations in the sample to get predicted y-values. These predicted y-values
are the default prediction for the “predict” command, and if we want the standard error of these
predictions, we need to use “predict” again but this time specify the option “stdf”. This stands for
the standard deviation of the forecast. Both posthat and se_f are new variables that STATA has
created for us. Now, to get the upper and lower bounds of a 95% confidence interval, we apply
the usual formula taking the predicted value plus/minus the margin of error. Typing “generate
upper_f=…” and “generate lower_f=…” creates two new variables named “upper_f” and

Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 12   


“lower_f”, respectively. To see our predictions, we can type “browse” into the command window
and press enter. This yields:

Just as with LIMDEP, our 95% confidence interval for the 24th student’s predicted post-test score
is [11.5836, 17.6579]. For more information on the “predict” command, try typing “help predict”
into the command window.

To test the linear restriction of all class coefficients being zero, we type:

test class1 class2 class3

into the command window and press enter. STATA automatically forms the correct test statistic,
and we see

F(3, 18) = 5.16


Prob > F = 0.0095

The second line gives us the p-value, where we see that we can reject the null that all class
coefficients are zero at any probability of Type I error greater than 0.0095.

TEST FOR A STRUCTURAL BREAK (CHOW TEST)

Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 13   


The above test of the linear restriction β3 = β4 = β5 = 0 (no difference among classes), assumed
that the pretest slope coefficient was constant, fixed and unaffected by the class to which a
student belonged. A full structural test can be performed in two possible ways. One, we can run
each restricted regression and the unrestricted regression, take note of the residual sums of
squares from each regression, and explicitly calculate the F-statistic. This requires the fitting of
four separate regressions to obtain the four residual sum of squares that are added to obtain the
unrestricted sum of squares. The restricted sum of squares is obtained from a regression of
posttest on pretest with no dummies for the classes; that is, the class to which a student belongs
is irrelevant in the manner in which pretests determine the posttest score.

For this, we can type:

regress post pre if class1==1

into the command window and press enter. The resulting output is as follows:
. regress post pre if class1==1

Source | SS df MS Number of obs = 5


-------------+------------------------------ F( 1, 3) = 417.63
Model | 35.7432432 1 35.7432432 Prob > F = 0.0003
Residual | .256756757 3 .085585586 R-squared = 0.9929
-------------+------------------------------ Adj R-squared = 0.9905
Total | 36 4 9 Root MSE = .29255

------------------------------------------------------------------------------
post | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pre | 1.554054 .0760448 20.44 0.000 1.312046 1.796063
_cons | -2.945946 1.61745 -1.82 0.166 -8.093392 2.201501
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 

We see in the upper-left portion of this output that the residual sum of squares from this
restricted regression is 0.2568. We can similarly run a restricted regression for only students in
class 2 by specifying the option “if class2==1”, and so forth for classes 3 and 4.

The second way to test for a structural break is to create several interaction terms and test
whether the dummy and interaction terms are jointly significantly different from zero. To
perform the Chow test this way, we first generate interaction terms between all dummy variables
and independent variables. To do this in STATA, type the following into the command window
and press enter:

generate pre_c1=pre*class1
generate pre_c2=pre*class2
generate pre_c3=pre*class3

With our new variables created, we now run a regression with all dummy and interaction terms
included, as well as the original independent variable. In STATA, we need to type:

Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 14   


regress post pre class1 class2 class3 pre_c1 pre_c2 pre_c3

into the command window and press enter. The output for this regression is not meaningful, as it
is only the test that we’re interested in. To run the test, we can then type:

test class1 class2 class3 pre_c1 pre_c2 pre_c3

into the command window and press enter. The resulting output is:
. test class1 class2 class3 pre_c1 pre_c2 pre_c3

( 1) class1 = 0
( 2) class2 = 0
( 3) class3 = 0
( 4) pre_c1 = 0
( 5) pre_c2 = 0
( 6) pre_c3 = 0

F( 6, 15) = 2.93
Prob > F = 0.0427

Just as we saw in LIMDEP, our F-statistic is 2.93, with a p-value of 0.0427. We again reject the
null (at a probability of Type I error=0.05) and conclude that class is important either through the
slope or intercept coefficients. This type of test will always yield results identical to the restricted
regression approach.

HETEROSCEDASTICITY

You can control for heteroscedasticity across observations or within specific groups (in this
class, within a given class, but not across classes) by specifying the “robust” or “cluster” option,
respectively, at the end of your regression command.

To account for a common error term within groups, but not across groups, we first create a class
variable that identifies each student into one of the 4 classes. This is used to specify which group
(or cluster) a student is in. To generate this variable, type:

generate class=class1 + 2*class2 + 3*class3 + 4*class4

into the command window and press enter. Then to allow for clustered error terms, our
regression command is:

regress post pre class1 class2 class3, cluster(class)

This gives us the following output:


. regress post pre class1 class2 class3, cluster(class)

Linear regression Number of obs = 23

Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 15   


F( 0, 3) = .
Prob > F = .
R-squared = 0.9552
Number of clusters (class) = 4 Root MSE = 1.2604

------------------------------------------------------------------------------
| Robust
post | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pre | 1.517222 .1057293 14.35 0.001 1.180744 1.8537
class1 | 1.42078 .4863549 2.92 0.061 -.1270178 2.968579
class2 | 1.177399 .3141671 3.75 0.033 .1775785 2.177219
class3 | 2.954037 .0775348 38.10 0.000 2.707287 3.200788
_cons | -3.585879 1.755107 -2.04 0.134 -9.171412 1.999654
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 

Similarly, to account for general heteroscedasticity across individual observations, our regression
command is:

regress post pre class1 class2 class3, robust

and we get the following output:


. regress post pre class1 class2 class3, robust

Linear regression Number of obs = 23


F( 4, 18) = 165.74
Prob > F = 0.0000
R-squared = 0.9552
Root MSE = 1.2604

------------------------------------------------------------------------------
| Robust
post | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pre | 1.517222 .0824978 18.39 0.000 1.3439 1.690543
class1 | 1.42078 .7658701 1.86 0.080 -.188253 3.029814
class2 | 1.177399 .8167026 1.44 0.167 -.53843 2.893227
class3 | 2.954037 .9108904 3.24 0.005 1.040328 4.867747
_cons | -3.585879 1.706498 -2.10 0.050 -7.171098 -.0006609
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ 

ESTIMATING PROBIT MODELS IN STATA

We now want to estimate a probit model using the Becker and Powers data set. First, read in the
“.csv” file: i
. insheet a1 a2 x3 c al am an ca cb cc ch ci cj ck cl cm cn co cs ct cu ///
> cv cw db dd di dj dk dl dm dn dq dr ds dy dz ea eb ee ef ///
> ei ej ep eq er et ey ez ff fn fx fy fz ge gh gm gn gq gr hb ///
> hc hd he hf using "F:\NCEE (Becker)\BECK8WO2.csv", comma
(64 vars, 2849 obs)

Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 16   


Notice the “///” at the end of each line. Because STATA by default reads the end of the line as
the end of a command, you have to tell it when the command actually goes on to the next line.
The “///” tells STATA to continue reading this command through the next line.

As always, we should look at our data before we start doing any work. Typing “browse” into the
command window and pressing enter, it looks as if several variables have been read as character
strings rather than numeric values. We can see this by typing “describe” into the command
window or simply by noting that string variables appear in red in the browsing window. This is a
somewhat common problem when using STATA with Excel, usually because of variable names
in the Excel files or because of spaces placed in front or after numeric values. If there are spaces
in any cell that contains an otherwise numeric value, STATA will read the entire column as a
character string. Since we know all variables should be numeric, we can fix this problem by
typing:

destring, replace

into the command window and pressing enter. This automatically codes all variables as numeric
variables.

Also note that the original Excel .csv file has several “extra” observations at the end of the data
set. These are essentially extra rows that have been left blank but were somehow utilized in the
original Excel file (for instance, just pressing enter at last cell will generate a new record with all
missing variables). STATA correctly reads these 12 observations as missing values, but because
we know these are not real observations, we can just drop these with the command “drop if
a1==.”. This works because a1 is not missing for any of the other observations.

Now we recode the variable a2 as a categorical variable, where a2=1 for doctorate institutions
(between 100 and 199), a2=2 for comprehensive master’s degree granting institutions (between
200 and 299), a2=3 for liberal arts colleges (between 300 and 399), and a2=4 for two-year
colleges (between 400 and 499). To do this, type the following command into the command
window:

recode a2 (100/199=1) (200/299=2) (300/399=3) (400/499=4)

Once we’ve recoded the variable, we can generate the 4 dummy variables as follows: ii

generate doc=(a2==1) if a2!=.


generate comp=(a2==2) if a2!=.
generate lib=(a2==3) if a2!=.
generate twoyr=(a2==4) if a2!=.

The more lengthy way to generate these variables would be to first generate new variables equal
to zero, and then replace each one if the relevant condition holds. But the above commands are a
more concise way.

Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 17   


Next 1 - 0 bivariates are created to show whether the instructor had a PhD degree and where the
student got a positive score on the postTUCE. We also create new variables, dmsq and hbsq, to
allow for quadratic forms in teacher experiences and class size:

generate phd=(dj==3) if dj!=.


generate final=(cc>0) if cc!=.
generate dmsq=dm^2
generate hbsq=hb^2

In this data set, all missing values are coded −9. Thus, adding together some of the responses to
the student evaluations provides information as to whether a student actually completed an
evaluation. For example, if the sum of ge, gh, gm, and gq equals −36, we know that the student
did not complete a student evaluation in a meaningful way. A dummy variable to reflect this fact
is then created by: iii

generate noeval=(ge + gh + gm + gq == -36)

Finally, from the TUCE developer it is known that student number 2216 was counted in term 2
but was in term 1 but no postTUCE was taken. This error is corrected with the following
command:

recode hb (90=89)

We are now ready to estimate the probit model with final as our dependent variable. Because
missing values are coded as -9 in this data set, we need to avoid these observations in our
analysis. The quickest way to avoid this problem is just to recode all of the variables, setting
every variable equal to “.” if it equals “-9”. Because there are 64 variables, we do not want to do
this one at a time, so instead we type:

foreach x of varlist * {
replace `x’=. if `x’==-9
}

You should type this command exactly as is for it to work correctly, including pressing enter
after the first open bracket. Also note that the single quotes surrounding each x in the replace
statement are two different characters. The first single quote is the key directly underneath the
escape key (for most keyboards) while the closing single quote is the standard single quote
keystroke by the enter key. For more help on this, type “help foreach” into the command
window.

Finally, we drop all observations where an=. and where cs=0 and run the probit model by typing

drop if an==.

Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 18   


drop if cs==0
probit final an hb doc comp lib ci ck phd noeval

into the command window and pressing enter. We can then retrieve the marginal effects by
typing “mfx” into the command window and pressing enter. This yields the following output: 
. drop if cs==0
(1 observation deleted)

. drop if an==.
(249 observations deleted)

. probit final an hb doc comp lib ci ck phd noeval

Iteration 0: log likelihood = -1284.2161


Iteration 1: log likelihood = -840.66421
Iteration 2: log likelihood = -823.09278
Iteration 3: log likelihood = -822.74126
Iteration 4: log likelihood = -822.74107

Probit regression Number of obs = 2587


LR chi2(9) = 922.95
Prob > chi2 = 0.0000
Log likelihood = -822.74107 Pseudo R2 = 0.3593

------------------------------------------------------------------------------
final | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
an | .022039 .0094752 2.33 0.020 .003468 .04061
hb | -.0048826 .0019241 -2.54 0.011 -.0086537 -.0011114
doc | .9757148 .1463617 6.67 0.000 .6888511 1.262578
comp | .4064945 .1392651 2.92 0.004 .13354 .679449
lib | .5214436 .1766459 2.95 0.003 .175224 .8676632
ci | .1987315 .0916865 2.17 0.030 .0190293 .3784337
ck | .08779 .1342874 0.65 0.513 -.1754085 .3509885
phd | -.133505 .1030316 -1.30 0.195 -.3354433 .0684333
noeval | -1.930522 .0723911 -26.67 0.000 -2.072406 -1.788638
_cons | .9953498 .2432624 4.09 0.000 .5185642 1.472135
------------------------------------------------------------------------------

. mfx

Marginal effects after probit


y = Pr(final) (predict)
= .88118215
------------------------------------------------------------------------------
variable | dy/dx Std. Err. z P>|z| [ 95% C.I. ] X
---------+--------------------------------------------------------------------
an | .004378 .00188 2.33 0.020 .000699 .008057 10.5968
hb | -.0009699 .00038 -2.54 0.011 -.001719 -.00022 55.5589
doc*| .1595047 .02039 7.82 0.000 .119537 .199473 .317743
comp*| .0778334 .02588 3.01 0.003 .027107 .12856 .417859
lib*| .0820826 .02145 3.83 0.000 .040039 .124127 .135678
ci | .0394776 .01819 2.17 0.030 .003834 .075122 1.23116
ck*| .0182048 .02902 0.63 0.530 -.038667 .075077 .919985
phd*| -.0257543 .01933 -1.33 0.183 -.063632 .012123 .686123
noeval*| -.533985 .01959 -27.26 0.000 -.572373 -.495597 .290684
------------------------------------------------------------------------------
(*) dy/dx is for discrete change of dummy variable from 0 to 1

Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 19   


For the other probit model (using hc rather than hb), we get:
. probit final an hc doc comp lib ci ck phd noeval

Iteration 0: log likelihood = -1284.2161


Iteration 1: log likelihood = -843.39917
Iteration 2: log likelihood = -826.28953
Iteration 3: log likelihood = -825.94736
Iteration 4: log likelihood = -825.94717

Probit regression Number of obs = 2587


LR chi2(9) = 916.54
Prob > chi2 = 0.0000
Log likelihood = -825.94717 Pseudo R2 = 0.3568

------------------------------------------------------------------------------
final | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
an | .0225955 .0094553 2.39 0.017 .0040634 .0411276
hc | .0001586 .002104 0.08 0.940 -.0039651 .0042823
doc | .880404 .1486641 5.92 0.000 .5890278 1.17178
comp | .4596089 .1379817 3.33 0.001 .1891698 .730048
lib | .5585268 .1756814 3.18 0.001 .2141976 .902856
ci | .1797199 .090808 1.98 0.048 .0017394 .3577004
ck | .0141566 .1333267 0.11 0.915 -.2471589 .2754722
phd | -.2351326 .1010742 -2.33 0.020 -.4332344 -.0370308
noeval | -1.928216 .0723636 -26.65 0.000 -2.070046 -1.786386
_cons | .8712666 .2411741 3.61 0.000 .3985742 1.343959
------------------------------------------------------------------------------

. mfx

Marginal effects after probit


y = Pr(final) (predict)
= .88073351
------------------------------------------------------------------------------
variable | dy/dx Std. Err. z P>|z| [ 95% C.I. ] X
---------+--------------------------------------------------------------------
an | .0045005 .00188 2.40 0.017 .00082 .008181 10.5968
hc | .0000316 .00042 0.08 0.940 -.00079 .000853 49.9749
doc*| .1467544 .02132 6.88 0.000 .104969 .18854 .317743
comp*| .087859 .02554 3.44 0.001 .037809 .137909 .417859
lib*| .0867236 .02066 4.20 0.000 .046228 .12722 .135678
ci | .0357961 .01807 1.98 0.048 .000383 .071209 1.23116
ck*| .0028395 .02693 0.11 0.916 -.049938 .055617 .919985
phd*| -.0444863 .01819 -2.45 0.014 -.080145 -.008828 .686123
noeval*| -.5339711 .01957 -27.29 0.000 -.572326 -.495616 .290684
------------------------------------------------------------------------------
(*) dy/dx is for discrete change of dummy variable from 0 to 1
 

Results from each model are equivalent to those of LIMDEP, where we see that the estimated
coefficient on hb is -0.005 with a p-value of 0.01, and the estimated coefficient on hc is 0.00007
with a p-value of 0.974. These results imply that initial class size is strongly significant while
final class size is insignificant.

Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 20   


To assess model fit, we can form the predicted 0 or 1 values by first taking the predicted
probabilities and then transforming these into 0 or 1 depending on whether the predicted
probability is greater than .5. Then we can look at a tabulation to see how many correct 0s and 1s
our probit model predicts. Because we have already run the models, we are not interested in the
output, so to look only at these predictions, type the following into the command window:

quietly probit final an hb doc comp lib ci ck phd noeval


predict prob1
generate finalhat1=(prob1>.5)

this yields:

. quietly probit final an hb doc comp lib ci ck phd noeval

. predict prob1
(option p assumed; Pr(final))

. generate finalhat1=(prob1>.5)

. tab finalhat1 final

| final
finalhat1 | 0 1 | Total
-----------+----------------------+----------
0 | 342 197 | 539
1 | 168 1,880 | 2,048
-----------+----------------------+----------
Total | 510 2,077 | 2,587
 

These results are exactly the same as with LIMDEP. For the second model, we get
. quietly probit final an hc doc comp lib ci ck phd noeval

. predict prob2
(option p assumed; Pr(final))

. generate finalhat2=(prob2>.5)

. tab finalhat2 final

| final
finalhat2 | 0 1 | Total
-----------+----------------------+----------
0 | 337 192 | 529
1 | 173 1,885 | 2,058
-----------+----------------------+----------
Total | 510 2,077 | 2,587
 

Again, these results are identical to those of LIMDEP.

Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 21   


We can also use the built-in processes to do these calculations. To do so, type “estat class” after
the model you’ve run. Part of the resulting output will be the tabulation of predicted versus
actual values. Furthermore, to perform a Pearson goodness of fit test, type “estat gof” into the
command window after you have run your model. This will provide a Chi-square value. All of
these postestimation tools conclude that both models do a sufficient job of prediction.

REFERENCES

Kolenikov, Stanislav (2001). “Review of STATA 7” The Journal of Applied Econometrics,


16(5), October: 637-46.

Becker, William E. and John Powers (2001). “Student Performance, Attrition, and Class Size
Given Missing Student Data,” Economics of Education Review, Vol. 20, August: 377-388.

Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 22   


APPENDIX A : Using Stat/Transfer

Stat/Transfer is a convenient program used to convert data from one format to another. Although
this program is not free, there is a free trial version available at www.stattransfer.com. Note that
the trial program will not convert the entire data set—it will drop one observation.

Nonetheless, Stat/Transfer is very user friendly. If you install and open the trial program, your
screen should look something like:

We want to convert the “.xls” file into a STATA format (“.dta”). To do this, we need to first
specify the original file type (e.g., Excel), then specify the location of the file. We then specify
the format that we want (in this case, a STATA “.dta” file). Then click on Transfer, and
Stat/Transfer automatically converts the data into the format you’ve asked.

To open this new “.dta” file in STATA, simply type

use “filename.dta”

into the command window and press enter.

Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 23   


ENDNOTES

                                                            
i
 When using the “insheet” command, STATA automatically converts A1 to a1, A2 to a2 and so
forth. STATA is, however, case sensitive. Therefore, whether users specify “insheet A1 A2 …”
or “insheet a1 a2 …,” we must still call the variables in lower case. For instance, the following
insheet command will work the exact same as that provided in the main text:
. insheet A1 A2 X3 C AL AM AN CA CB CC CH CI CJ CK CL CM CN CO CS CT CU ///
> CV CW DB DD DI DJ DK DL DM DN DQ DR DS DY DZ EA EB EE EF ///
> EI EJ EP EQ ER ET EY EZ FF FN FX FY FZ GE GH GM GN GQ GR HB ///
> HC HD HE HF using "F:\NCEE (Becker)\BECK8WO2.csv", comma

ii
The conditions “if a2!=.” tell STATA to run the command only if a2 is not missing. Although
this particular dataset does not contain any missing values, it is generally good practice to always
use this type of condition when creating dummy variables the way we have done here. For
example, if there were a missing observation, the command “gen doc=(a2==1)” would set doc=0
even if a2 is missing.
iii
An alternative procedure is to first set all variables to missing if they equal -9 and then
generate the dummy variable using:

generate noeval=(ge==.&gh==.&gm==.&gq==.)

Ian M. McCarthy  Module One, Part Three: Using STATA  September 4, 2008: p. 24   


MODULE ONE, PART FOUR: READING DATA INTO SAS, CREATING AND
RECODING VARIABLES, AND ESTIMATING AND TESTING MODELS IN SAS

This Part Four of Module One provides a cookbook-type demonstration of the steps required to
read or import data into SAS. The reading of both small and large text and Excel files are shown
though real data examples. The procedures to recode and create variables within SAS are
demonstrated. Commands for least-squares regression estimation and maximum likelihood
estimation of probit and logit models are provided. Consideration is given to analysis of variance
and the testing of linear restrictions and structural differences, as outlined in Part One. Parts
Two and Three provide the LIMDEP and STATA commands for the same operations undertaken
here in Part Four with SAS. For a thorough review of SAS, see Delwiche and Slaughter’s The
Little SAS Book: A Primer (2003).

IMPORTING EXCEL FILES INTO SAS

SAS can read or import data in several ways. The most commonly imported files by students are
those created in Microsoft Excel with the “.xls” file name extension. Researchers tend to use flat
files that compactly store data. In what follows, we focus on importing data using Excel files. To
see how this is done, consider the data set in the Excel file “post-pre.xls,” which consists of test
scores for 24 students in four classes. The column title “Student” identifies the 24 students by
number, “post” provides each student’s post-course test score, “pre” is each student’s pre-course
test score, and “class” identifies to which one of the four classes the student was assigned, e.g.,
class4 = 1 if student was in the fourth class and class4 = 0 if not. The “.” in the post column for
student 24 indicates that the student is missing a post-course test score.

student post pre class1 class2 class3 class4


1 31 22 1 0 0 0
2 30 21 1 0 0 0
3 33 23 1 0 0 0
4 31 22 1 0 0 0
5 25 18 1 0 0 0
6 32 24 0 1 0 0
7 32 23 0 1 0 0
8 30 20 0 1 0 0
9 31 22 0 1 0 0
10 23 17 0 1 0 0
11 22 16 0 1 0 0
12 21 15 0 1 0 0

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 1   


13 30 19 0 0 1 0
14 21 14 0 0 1 0
15 19 13 0 0 1 0
16 23 17 0 0 1 0
17 30 20 0 0 1 0
18 31 21 0 0 1 0
19 20 15 0 0 0 1
20 26 18 0 0 0 1
21 20 16 0 0 0 1
22 14 13 0 0 0 1
23 28 21 0 0 0 1
24 . 12 0 0 0 1

To start, the file “post-pre.xls” must be downloaded and copied to your computer’s hard drive.
Once this is done open SAS. Click on “File,” “Import Data…,” and “Standard data source”.
Selecting “Microsoft Excel 97, 2000 or 2002 Workbook” from the pull down menu yields the
following screen display:

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 2   


Clicking “Next” gives a pop-up screen window in which you can locate and specify the file
containing the Excel file. Using the “Browse…” function is the simplest way to locate your file
and is identical to opening any Microsoft file in MS Word or Excel.

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 3   


After you have located the file, click “Open”. SAS automatically fills in the path location on the
“Connect to MS Excel” pop-up window (The path to “post-pre.xls” will obviously depend on
where you placed it on your computer’s hard drive).

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 4   


By clicking “OK” in the “Connect to MS Excel”, SAS now has the location of the data you wish
to import.

Before the data is successfully imported, the table in your Excel file must be correctly specified.
By default, SAS assumes that you want to import the first table (Excel labels this table
“Sheet1”). If the data you are importing is on a different sheet, simply use the pull-down menu
and click on the correct table name. After you have specified the table from your Excel file, click
“Next”.

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 5   


Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 6   
The last step required to import data is to name the file. In the “Member” field, enter the name
you want to call your SAS data file. This can be different than the name of the Excel file which
contains the data. There are some rules in naming files and SAS will promptly communicate to
you if your naming is unacceptable and why. Note that SAS is not case specific. For
demonstrative purposes, I have called the SAS dataset “prepost”. It is best to leave the “Library”
set to the default “WORK”. Clicking “Finish” imports the file into SAS.

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 7   


In SAS, the screen is broken up into three main sections: Program Editor, Log, and
Explorer/Results. Each time SAS is told to do a command, SAS displays in the log what you
requested and how it completed the task. To make sure the data has been correctly imported into
SAS, we simply can read the log and verify SAS was able to successfully create the new dataset.
The Log in the screen below indicates that WORK.PREPOST was successfully created where
WORK indicates the Library and PREPOST is the name chosen. Libraries are the directories to
where data sets are located.

The Work Library is a temporary location in the computer’s RAM memory, which is
permanently deleted once the user exits SAS. To retain your newly created SAS data set, you
must place the dataset into a permanent Library.

To view the dataset, click on the “Work” library located in the Explorer section and then on the
dataset “Prepost”. Note that SAS, unlike LIMDEP, records missing observations with a period,
rather than a blank space. The sequencing of these steps and the two associated screens follow:

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 8   


Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 9   
READING SPACE, TAB, OR COMMA DELINEATED FILES INTO SAS

Next we consider externally created text files that are typically accompanied by the “.txt” and
“csv” extensions. For demonstration purposes, the data set just employed with 24 observations
on the 7 variables (“student,” “post,” “pre,” “class1,” “class2,” “class3,” and “class4”) was
saved as the space delimited text file “post-pre.txt.” After downloading this file to your hard
drive, open SAS to its first screen:

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 10   


To read the data into SAS, we need to utilize the “proc import” command. In the Editor window,
type

proc import datafile = "F:\NCEE (Becker)\post-pre.txt"

out = prepost dbms = dlm replace;

getnames=yes; run;

and then clicking the “run man” submit button. “proc import” tells SAS to read in text data and
“datafile” directs SAS to a particular file name. In this case, the file is saved in the location
“F:\NCEE (Becker)\post-pre.txt”, but this will vary by user. Finally, the “dbms=dlm” option tells
SAS that the data points in this file are separated by a space.

If your data is tab delimited, change the “dbms” function to dbms = tab and if you were using a
“.csv” file, change the “dbms” function to dbms=csv. In general, the “dlm” option is used when
your data have a less standard delimiter (e.g., a colon, semicolon, etc.).

Once you’ve typed the appropriate command into the command window, press the submit button
to run that line of text. This should yield the following screen:

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 11   


Just as before, the SAS log tells us that it has read a data set consisting of 7 variables and 24
observations, and we can access the dataset from the explorer window. We can view our data by
typing

proc print data = prepost; run;

in the editor window, highlight it, and click the submit “Running man” button. The sequencing
of these steps and the associated screen is as follows:

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 12   


Some files contain multiple delimiters that can cause some difficulty in importing data. One way
to address this is to use the “User-defined formats” option while using the import wizard. For
demonstration purposes, we will import our “pre-post.txt” file using the user defined formats.

Click “File”, “Import Data”, “User-defined formats” and then “Next” After specifing the location
of the file, click “Next” and name the dataset whatever you want to call it. We will use “prepost”
as our SAS dataset name and then click “Next” and then “Finish”. This activates the user defined
format list where the variables can be properly identified and named. Below is a screen shot of
the “Import: List”.

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 13   


In the user defined format wizard, click “Options...”. This will allow us to specify the type of
data which we are dealing. As shown below, we define the style of input as “column” and allow
automatic variable creation. Automatic variables inputted are numeric and starting record is row
2 (row one is the variable names). Specify the number of records as 24 and the number of
working rows as 25. Click “Okay”. SAS prompts you to inspect the data in the “To:
Work.Prepost” window for accuracy of importing the data. Also, under our current specification,
variable names have been omitted. We can manually update the variable names now by clicking
on the variable names in “To: Work.Prepost” and entering in the new name and then “Update” or
later during a data step once we have finished importing the data. In this case because the
number of variables is small, we manually enter the variable names. To finish importing the data,
simply click the red box in the upper right-hand corner and then “save”. Your screens should
look something like this:

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 14   


READING LARGE FILES INTO SAS

SAS is extremely powerful in handling large datasets. There are few default restrictions on
importing raw data files. The most common restriction is line length. When importing data one
can specify the physical line length of the file using the LRECL command (see syntax below).
As an example, consider the data set employed by Becker and Powers (2001), which initially had
2,837 records. The default restrictions are sufficient, so we need only follow the process of
import a “.csv” file described above. Note, however, that this data set does not contain variable
names in the top row. You can assign names yourself with a simple, but lengthy, addition to the
importing command. Also note that we have specified what type of input variables the data
contains and what type of variables we want them to be. “best32.” corresponds to a numeric
variable list of length 32.

data BECPOW;

infile 'C:\Users\Greg\Desktop\BeckerWork\BECK8WO.CSV' delimiter = ','


MISSOVER DSD lrecl=32767 ;

informat A1 best32.; informat A2 best32.; informat X3 best32.;

informat C best32. ; informat AL best32.; informat AM best32.;

informat AN best32.; informat CA best32.; informat CB best32.;

informat CC best32.; informat CH best32.; informat CI best32.;

informat CJ best32.; informat CK best32.; informat CL best32.;

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 15   


informat CM best32.; informat CN best32.; informat CO best32.;

informat CS best32.; informat CT best32.; informat CU best32.;

informat CV best32.; informat CW best32.; informat DB best32.;

informat DD best32.; informat DI best32.; informat DJ best32.;

informat DK best32.; informat DL best32.; informat DM best32.;

informat DN best32.; informat DQ best32.; informat DR best32.;

informat DS best32.; informat DY best32.; informat DZ best32.;

informat EA best32.; informat EB best32.; informat EE best32.;

informat EF best32.; informat EI best32.; informat EJ best32.;

informat EP best32.; informat EQ best32.; informat ER best32.;

informat ET best32.; informat EY best32.; informat EZ best32.;

informat FF best32.; informat FN best32.; informat FX best32.;

informat FY best32.; informat FZ best32.; informat GE best32.;

informat GH best32.; informat GM best32.; informat GN best32.;

informat GQ best32.; informat GR best32.; informat HB best32.;

informat HC best32.; informat HD best32.; informat HE best32.;

informat HF best32.;

format A1 best12.; format A2 best12.; format X3 best12.;

format C best12. ; format AL best12.; format AM best12.;

format AN best12.; format CA best12.; format CB best12.;

format CC best12.; format CH best12.; format CI best12.;

format CJ best12.; format CK best12.; format CL best12.;

format CM best12.; format CN best12.; format CO best12.;

format CS best12.; format CT best12.; format CU best12.;

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 16   


format CV best12.; format CW best12.; format DB best12.;

format DD best12.; format DI best12.; format DJ best12.;

format DK best12.; format DL best12.; format DM best12.;

format DN best12.; format DQ best12.; format DR best12.;

format DS best12.; format DY best12.; format DZ best12.;

format EA best12.; format EB best12.; format EE best12.;

format EF best12.; format EI best12.; format EJ best12.;

format EP best12.; format EQ best12.; format ER best12.;

format ET best12.; format EY best12.; format EZ best12.;

format FF best12.; format FN best12.; format FX best12.;

format FY best12.; format FZ best12.; format GE best12.;

format GH best12.; format GM best12.; format GN best12.;

format GQ best12.; format GR best12.; format HB best12.;

format HC best12.; format HD best12.; format HE best12.;

format HF best12.;

input

A1 A2 X3 C AL AM AN CA CB CC CH CI CJ CK CL CM CN CO CS CT CU

CV CW DB DD DI DJ DK DL DM DN DQ DR DS DY DZ EA EB EE EF

EI EJ EP EQ ER ET EY EZ FF FN FX FY FZ GE GH GM GN GQ GR HB

HC HD HE HF; run;

For more details on how to import data sets with data dictionaries (i.e., variable names and
definitions in external files), try typing “infile” into the “SAS Help and Documentation” under
the Help tab. If you do not assign variable names, then SAS will provide default variable names
of “var1, var2, var3, etc.”.

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 17   


LEAST-SQUARES ESTIMATION AND LINEAR RESTRICTIONS IN SAS

As in the previous Module One, Part Two using LIMDEP, we now demonstrate some various
regression tools in SAS using the “post-pre” data set. Recall the model being estimated is

post = β1 + β 2 pre + f (classes ) + ε .

SAS automatically drops any missing observations from our analysis, so we need not restrict the
data in any of our commands. In general, the syntax for a basic OLS regression in SAS is

proc reg data = FILENAME; model y-variable = x-variables; run;

where y-variable is just the independent variable name and x-variables are the dependent
variable names.

Once you have your data read into SAS, we estimate the model

post = β1 + β 2 pre + β 3class1 + β 4class 2 + β5class 3 + ε

by typing:

proc reg data = prepost; model post = pre class1 class2 class3 / p cli clm; run;

into the editor window and pressing submit. Typing “/ p cli clm” after specifying the model
outputs predicted value and a 95% confidence interval. From the output the predicted posttest
score is 14.621, with 95 percent confidence interval equal to 11.5836< E(y|X24)< 17.6579.

We get the following output:

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 18   


Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 19   
A researcher might be interested to test whether the class in which a student is enrolled affects
his/her post-course test score, assuming fixed effects only. This linear restriction is done
automatically in SAS by adding the following “test” command to the regression statement in the
Editor Text.

proc reg data = prepost; model post = pre class1 class2 class3;
b1: test class1=0, class2=0, class3=0;
run;

In this case we have named the test “b1”. Upon highlighting and pressing the submit button, SAS
automatically forms the correct test statistic, and we see

F(3, 18) = 5.16


Prob > F = 0.0095

The first line gives us the value of the F statistic and the associated P-value, where we see that
we can reject the null that all class coefficients are zero at any probability of Type I error greater
than 0.0095.

The following results will appear in the output:

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 20   


TEST FOR A STRUCTURAL BREAK (CHOW TEST)

The above test of the linear restriction β3= β4= β5=0 (no difference among classes), assumed
that the pretest slope coefficient was constant, fixed and unaffected by the class to which a
student belonged. A full structural test requires the fitting of four separate regressions to obtain
the four residual sum of squares that are added to obtain the unrestricted sum of squares. The
restricted sum of squares is obtained from a regression of posttest on pretest with no dummies for
the classes; that is, the class to which a student belongs is irrelevant in the manner in which
pretests determined the posttest score.

We can perform this test one of two ways. One, we can run each restricted regression and the
unrestricted regression, take note of the residual sums of squares from each regression, and
explicitly calculate the F statistic. We already know how to run basic regressions in SAS, so the
new part is how to run a restricted regression. For this, we create a new variable that identifies
the restricted observations; we want to run the regression separately and then we can run all the
restrictions simultaneously. We first create the restriction variable “class” by typing into the
editor window:

data prepost; set prepost; if class1 = 1 then class = 1; if class2 = 1 then class = 2;

if class3 = 1 then class = 3; if class4 = 1 then class = 4; run;

Highlight the command and click on the submit button. We now run all four restricted
regressions simultaneously by typing:

proc reg data = prepost; model post = pre; by class; run;

into the editor window, highlighting the text and press submit. The resulting output is as follows:

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 21   


We see in the Analysis of Variance of this output that the residual sum of squares from this
restricted regression is 0.2568. We can similarly obtain from the output the other residual sums
of squares.

The structural test across all classes is

H 0 : β1 = β 2 = . . . = β 4 and H a : β ' s are not equal

( ErrorSS r − ErrorSSu ) / 2( J − 1K )
F=
ErrorSSu / (n − JK )

Because the calculated F = 2.92, and the critical F (Prob of Type I error =0.05, df1=6, df2=15) =
2.79, we reject the null hypothesis and conclude that at least one class is significantly difference
from another, allowing for the slope on pre-course test score to vary is from one class to another.
That is, the class in which a student is enrolled is important because of a change in slope and or
intercept.

The second way to test for a structural break is to create several interaction terms and test
whether the dummy and interaction terms are jointly significantly different from zero. To
perform the Chow test this way, we first generate interaction terms between all dummy variables
and independent variables. To do this in SAS, type the following into the editor window,
highlight it and press submit:

data prepost; set prepost; pre_c1=pre*class1; pre_c2=pre*class2; pre_c3=pre*class3; run;

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 22   


With our new variables created, we now run a regression with all dummy and interaction terms
included, as well as the original independent variable and run the F test. In SAS, we need to type

proc reg data = prepost; model post = pre class1 class2 class3 pre_c1 pre_c2 pre_c3;

b3: test class1, class2, class3, pre_c1, pre_c2, pre_c3; run;

into the editor window, highlight it, and press enter. The output for this regression is not
meaningful, as it is only the test that we’re interested in. The resulting output is:

F( 6, 15) = 2.93
Prob > F = 0.0427

Just as we could see in LIMDEP, our F statistic is 2.93, with a P-value of 0.0427. We again
reject the null (at a probability of Type I error=0.05) and conclude that class is important either
through the slope or intercept coefficients. This type of test will always yield results identical to
the restricted regression approach.

HETEROSCEDASTICITY

You can control for heteroscedasticity across observations or within specific groups (in this case,
within a given class, but not across classes) by specifying the “robust” or “cluster” option,
respectively, at the end of your regression command.

To account for a common error term within groups, but not across groups, we create a class
variable that identifies each student into one of the 4 classes. This is used to specify which group
(or cluster) a student is in. To generate this variable, type:

data prepost; set prepost; if class1 = 1 then class = 1; if class2 = 1 then class = 2;

if class3 = 1 then class = 3; if class4 = 1 then class = 4; run;

Highlight the command and click on the submit button.

Then to allow for clustered error terms, our regression command is:

proc surveyreg data=prepost; cluster class;

model post = pre class1 class2 class3; run;

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 23   


This gives us the following output:

Similarly, to account for general heteroscedasticity across individual observations, our regression
command is:

proc model data=prepost; parms const p c1 c2 c3;

post = const + p*pre + c1*class1 + c2*class2 + c3*class3;

fit post / white; run; quit;

and we get the following output:

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 24   


ESTIMATING PROBIT MODELS IN SAS

As seen in the Becker and Powers’ (2001) study, often variables need to be transformed or
created within a computer programs to perform the desired analysis. To demonstrate the process
and commands in SAS, start with the Becker and Powers data that have been or can be read into
SAS as shown earlier.

As always, we should look at our log file and data before we start doing any work. Viewing the
log file, the data set BECPOW has 2849 observations and 64 variables. Upon viewing the
dataset, we should notice there are several “extra” observations at the end of the data set. These
are essentially extra rows that have been left blank but were somehow utilized in the original
Excel file (for instance, just pressing enter at last cell will generate a new record with all missing
variables). SAS correctly reads these 12 observations as missing values, but because we know
these are not real observations, we can just drop these with the command

data becpow; set becpow; if a1 = . then delete; run;

This works because a1 is not missing for any of the other observations.

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 25   


After reading the data into SAS the first task is to recode the qualitative data into appropriate
dummies. A2 contains a range of values representing various classes of institutions. SAS does
not have a recode command, so we will use a series of if-then/else commands in a data step to do
the job. We need to create variables for doctorate institutions (100/199), for comprehensive or
master’s degree granting institutions (200/299), for liberal arts colleges (300/399) and for two-
year colleges (400/499). The following code creates these variables:

data becpow; set becpow; doc = 0; comp = 0; lib = 0; twoyr = 0;

if 99 < A2 < 200 then doc = 1;

if 199 < A2 < 300 then comp = 1;

if 299 < A2 < 400 then lib = 1;

if 399 < A2 < 500 then twoyr = 1; run;

Next 1 - 0 bivariates, “phd” and “final” are created to show whether the instructor had a PhD
degree and where the student got a positive score on the postTUCE . To allow for quadratic
forms in teacher experiences and class size the variables “dmsq” and “hbsq” are created. In this
data set, all missing values were coded −9. Thus, adding together some of the responses to the
student evaluations gives information on whether a student actually completed an evaluation. For
example, if the sum of “ge”, “gh”, “gm”, and “gq” equals −36, we know that the student did not
complete a student evaluation in a meaningful way. A dummy variable “noeval” is created to
reflect this fact. Finally, from the TUCE developer it is known that student number 2216 was
counted in term 2 but was in term 1 but no postTUCE was taken (see “hb” line in syntax). The
following are the commands:

data becpow; set becpow;


phd = 0;
final = 0;
noeval = 0;
if dj=3 then phd = 1;
if cc>0 then final = 1;
dmsq = dm*dm;
hbsq = hb*hb;
evalsum=ge+gh+gm+gq;
if evalsum = -36 then noeval = 1;
if hb = 90 then hb = 89;
run;

These commands can be entered into SAS as a block, highlighted and run with the “Submit”
button.

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 26   


One of the things of interest to Becker and Powers was whether class size at the beginning or end
of the term influenced whether a student completed the postTUCE. This can be assessed by
fitting a probit model to the 1 – 0 discrete dependent variable “final.” Because missing values
are coded as −9 in this data set, we need to avoid these observations in our analysis. The quickest
way to avoid this problem is just to create a new dataset and delete those observations that have
−9 included in them. This is done by typing:

data becpowp; set becpow;

if an = -9 then delete;

if hb = -9 then delete;

if doc = -9 then delete;

if comp = -9 then delete;

if lib = -9 then delete;

if ci = -9 then delete;

if phd = -9 then delete;

if noeval = -9 then delete;

if an = . then delete;

if cs = 0 then delete;

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 27   


run;

Finally, we run the probit model by typing:

proc logistic data= becpowp descending;

model final = an hb doc comp lib ci ck phd noeval / link=probit tech= newton;

ods output parameterestimates=prbparms;

output out = outprb xbeta = xb prob = probpr;

run;

into the editor window, highlighting it, and pressing enter. The SAS “probit” procedure by
default uses a smaller value in the dependent variable as success. Thus, the magnitudes of the
coefficients remain the same, but the signs are opposite to those of the STATA, and LIMDEP.
The “descending” option forces SAS to use a larger value as success. Alternatively, you may
explicitly specify the category of successful “final” using the “event” option. The option
“link=probit” tells SAS that instead of running a logistic regression, we would like to do a probit
regression. We can then retrieve the marginal effects by typing:

proc transpose data=prbparms out=tprb (rename=(an = tan hb = thb doc = tdoc


comp=tcomp lib=tlib ci = tci ck = tck phd =tphd noeval = tnoeval));

var estimate; id variable; run;

data outprb; if _n_=1 then set tprb; set outprb;

MEffan = pdf('NORMAL',xb)*tan; MEffhb = pdf('NORMAL',xb)*thb;

MEffdoc = pdf('NORMAL',xb)*tdoc; MEffcomp= pdf('NORMAL',xb)*tcomp;

MEfflib = pdf('NORMAL',xb)*tlib; MEffci = pdf('NORMAL',xb)*tci;

MEffck = pdf('NORMAL',xb)*tck; MEffphd = pdf('NORMAL',xb)*tphd;

MEffeval = pdf('NORMAL',xb)*tnoeval; run;

proc means data=outprb; run;

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 28   


into the editor window, highlighting it and pressing enter. This yields the following coefficient
estimates:

and marginal effects:

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 29   


For the other probit model (using hc rather than hb), we get coefficient estimates of:

and marginal effects of:

Results from each model are equivalent to those of LIMDEP and STATA, where we see the
initial class size (hb) probit coefficient is −0.004883 with a P-value of 0.0112, and the estimated
coefficient of “hc” is 0.0000159 with a P-value of 0.9399. These results imply that initial class
size is strongly significant however final class size is insignificant.

The overall goodness of fit can be assessed in several ways. A straight forward way is using the
Chi-square statistic found in the “Fit Statistics” output. The Chi-squared is (922.95, df =9) for

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 30   


the probit employing the initial class size is slightly higher than that for the end-of-term probit
(916.5379, df =9) but they are both highly significant.

CONCLUDING REMARKS

The goal of this hands-on component of Module One, Part Four was to enable users to get data
into SAS, create variables and run regressions on continuous and discrete variables; it was not to
explain all of the statistics produced by computer output. For this an intermediate level
econometrics textbook (such as Jeffrey Wooldridge, introductory Econometrics) or advanced
econometrics textbook such as William Greene, Econometric Analysis must be consulted.

REFERENCES

Becker, William E. and John Powers (2001). “Student Performance, Attrition, and Class Size
Given Missing Student Data,” Economics of Education Review, Vol. 20, August: 377-388.

Delwiche, Lora and Susan Slaughter (2003).The Little SAS Book: A Primer, Third Edition, SAS
Publishing.

Gregory Gilpin  Module One, Part 4: Using SAS  Sept 15, 2008: p. 31   


MODULE TWO, PART ONE:
ISSUES OF ENDOGENEITY AND INSTRUMENTAL VARIABLES
IN ECONOMIC EDUCATION RESEARCH

William E. Becker
Professor of Economics, Indiana University, Bloomington, Indiana, USA
Adjunct Professor of Commerce, University of South Australia, Adelaide, Australia
Research Fellow, Institute for the Study of Labor (IZA), Bonn, Germany
Editor, Journal of Economic Education
Editor, Social Scinece Reseach Network: Economic Research Network Educator

Federal Reserve Chair Ben Bernanke said that being an economist is like being a
mechanic working on an engine while it is running. Economists typically do not have the
convenience of random assignment as in laboratory experiments. However, in some
situations they can take advantage of random events such as lotteries or nature. In other
circumstances, they might be able to produce variables that have desired random
components. When this is possible, they can use instrumental variable techniques and
two-stage least squares estimation, which is the focus of the Module Two. Part One of
Module Two is devoted to the general theoretical issues associated with endogeneity.
Module Two, Parts Two, Three and Four provide the methods of instrumental variable
estimation using LIMDEP (NLOGIT), STATA and SAS. To get started consider three
types of problems for which instrumental variables are employed. 1

The first problem of concern is omitted variables. When presenting regression


results someone invariably proposes that an explanatory variable that is alleged to be
relevant but was omitted is correlated with the included regressors. This renders the
coefficient estimators of the included but correlated regressors biased and inconsistent.
As stated in the introduction to these modules, examples of this can be traced back over
one hundred years to a debate between statistician George Yule and economist Arthur
Pigou, see Stephen Stigler (1986, pp. 356-357). Recall that Pigou criticized Yule's
multiple regression (aimed at explaining the percentage of persons in poverty with the
change in the percentage of disabled relief recipients, the percentage change in the
proportion of old people, and the percentage change in the population) because it omitted
the most important influences: superior program management and restrictive practices,
which cannot be measured quantitatively.

Pigou identified the most enduring criticism of regression analysis; namely, the
possibility that an unmeasured but relevant variable has been omitted from the regression
and that it is this variable that is giving the appearance of a causal relationship between
the dependent variable and the included regressors. As described by Michael Finkelstein
and Bruce Levin (1990, pp. 363-364 and pp. 409-415), for example, defense attorneys
continue to argue that the plaintiff's experts omitted relevant market and productivity
variables when they use regression analysis to demonstrate that women are paid less than
men. Modern academic journals are packed with articles that argue for one specification

William E. Becker Endogeneity May 1, 2010: p. 1


of a regression equation versus another for everything from the demand for places in
higher education to the learning of economics in the introductory courses.

The second problem is errors in variables. The late Milton Friedman was
awarded the Nobel prize in Economics in part because of his path-breaking work in
estimating the relationship between consumption and permanent income, which is an
unobservable quantity. His work was later applied in unrelated areas such as education
research where a student’s grade is hypothesized to be a function of his or her effort and
ability, which are both unobservable. As we will see, unobserved explanatory variables
for which index variables are created give rise to errors-in-variables problems. As seen
in the early work of Becker and Salemi (1979), an outstanding example of this in
economic education research occurs when the pretest is used as a proxy for existing
knowledge, ability or prior understanding.

The third problem is simultaneity. At the aggregate level, estimating a Keynesian


consumption function (in which consumption is a function of income) has problems
caused by a second equation involving an accounting identity in which aggregate income
must equal personal consumption plus other forms of aggregate expenditures. That is, for
the nation as a whole there is a simultaneous relationship between income and
consumption: consumption is a function of income and income is a function of
consumption. Harvard/Stanford University researcher Caroline Hoxby (2000) identified
a similar reverse causality problem in her study of the effect of competition among school
districts on student performance, as reported in the Wall Street Journal (Oct 24, 2005).
She hypothesized that more school districts in a community implied more competition
and better schools. She also recognized, however, that there could be reverse causality in
that a poor school district that could not be closed (because of state regulations, for
example) would force politicians (through parental pressure) to start another school
district. In economic education, Becker and Johnston (1999) identified a simultaneity
problem in trying to explain scores on one type of test (say multiple choice) with scores
on another (essay or free response), where causality is bidirectional. Students who score
high on either are likely to score high on the other. As we will see, these are problems of
simultaneity that involve endogenous regressors.

Omitted variables that are correlated with included explanatory variables,


simultaneity and errors in variables are all examples of endogeneity problems for which
single equation estimation is not sufficient.

PROBLEMS OF ENDOGENEITY

Put simply, the problem of endogeneity occurs when an explanatory variable is related to
the error term in the population model of the data generating process, which causes the
ordinary least squares estimators of the relevant model parameters to be biased and
inconsistent. More precisely, for the least squared b vector to be a consistent estimator
of the β vector in the population data generating model y = Xβ + ε , the X'X matrix must
be a positive definite matrix (defined by Q , as the sample size n goes to infinity) and

William E. Becker Endogeneity May 1, 2010: p. 2


there can be no relationship between the vector of population error terms ( ε ) and the
regressors (explanatory variables) in X. Mathematically, if

⎛1 ⎞ ⎛1 ⎞
lim ⎜ X ′X ⎟ = Q (a positive definite matrix) and p lim⎜ X′ε ⎟ = 0 ,
n →∞ n
⎝ ⎠ ⎝n ⎠

then

⎛1 ⎞
p lim b = β + ( Q ) p lim ⎜ X ′ε ⎟ = β
−1

⎝n ⎠

In words, if observations on the explanatory variables (the Xs) are unrelated to draws
from the error terms (in vector ε ), then the sampling distribution of each of the
coefficients (the bs in b) will appear to degenerate to a spike on the relevant Beta, as the
sample size increases. In probability limit, a b is equal to its β : p lim b = β .

But if there is strong correlation between the Xs and ε s , and this correlation does
not deteriorate as the sample size goes to infinity, then the least squares estimators are not
consistent estimator of Betas and p lim b ≠ β . The b vector is an inconsistent estimator
because of endogenous regressors. That is, the sampling distribution of at least one of the
coefficient (one of the bs in b) will not degenerate to a spike on the relevant Beta, as the
sample size continues to increase.

OMITTED VARIABLE

If someone asserts that a regression has omitted variable bias, he or she is saying that the
population disturbance is related to an included regressor because a relevant explanatory
variable is missing in the estimated regression and its effects must be in the disturbance.
This is also known as unobserved heterogeneity because the effect of the omitted
variable also leads to population error term heterogeneity. The straightforward solution
is to include that omitted variable as a regressor, but often data on the missing variable
are unavailable. For example, as described in Becker (2004), the U.S. Congressional
Advisory Committee on Student Financial Assistance is interested in the functional
relationship between the effects of financial variables (e.g., family income, loan
availability, and/or grants) and the college-going decision, called persistence and
measured by the probability of attending a post-secondary institution, number of post-
secondary terms attempted and the like, in linear form:

Persistence = f ( finances, random perturbation)

The U.S. Department of Education is concerned about getting students “college ready,”
as measured by an index reflecting the completion of high school college prep courses,
high school grades, SAT scores and the like:

William E. Becker Endogeneity May 1, 2010: p. 3


Persistence = h(college ready, random error )

Putting the two interests together, where epsilon is the disturbance term, suggests that the
appropriate linear model is

Persistence = β 1 + β 2 (college ready ) + β 3 ( finances) + ε

Information on college readiness is obtainable from Department of Education records but


matching financial information is more difficult to obtain; thus, a researcher might
consider estimating the parameters in

Persistence = λ1 + λ 2 (college ready ) + u

Finances are now in the error term u. But students from wealthier families are known to
be more college ready than those from less well-off families. Thus, the explanatory
variable college ready is related to the error term u. If estimation is by OLS, bias and
inconsistent estimation of λ 2 result:

E [(college ready )u ] = β 3 E[(college ready)( finances)] + E[(college ready)ε ]


= β 3 E[(college ready)( finances)] = β 3 cov[(college ready)( finances)] ≠ 0

SIMULTANEITY

A classic case of simultaneity can be found in the most basic idea from microeconomics:
that the competitive market of supply and demand determines the equilibrium quantity.
The market data generating process is thus written as a three equation system:

Supply: Qs = m + nP + U

Demand: Qd = a + bP + cZ + V

Equilibrium Q = Qd = Qs

where m, n, a, b and c are parameters to be estimated. P is price. Qd and Qs are


quantities demanded and supplied, which in equilibrium are equal to Q. Z is an
exogenous variable and U and V are errors such that

E (U ) = E (V ) = 0, E (UV ) = 0,
E (U 2 ) = σ u2 , E (Y 2 ) = σ v2 , and
E (VZ ) = E (UZ ) = 0.

William E. Becker Endogeneity May 1, 2010: p. 4


Suppose the supply curve now is to be estimated by OLS from observable market
data for which it must be the case that quantity demand equals quantity supplied in
equilibrium:

Q = m + nP + U .

The estimation slope coefficients in the supply equation would obtained as

nˆ = ( P ' P ) −1 P ' Q
= n + ( P ' P ) −1 P ' U .

But from the market structure assumed to be generating the data we know

a−m c v−u
P= + Z+ = β 0 + β1 Z + ε 2 .
n−b n−b n−b

Thus, E ( P'U ) ≠ 0

a−m c V −U U2 σ2
E ( PU ) = E[( )U + ( Z )U + ( )U ] = E (− )=− u .
n−b n−b n−b n−b n−b

The OLS estimator ( nˆ ) is downward biased; that is, the true population parameter is
expected to be underestimated by the least squares estimator:

σ u2
E (nˆ ) = n − .
n−b

Next consider an example from macroeconomics in which an aggregate


Keynesian consumption function is to be estimated.

C = A + BX + U

where C is consumption (realized and planned consumption are equal in equilibrium), X


is current income and U is the disturbance term. A and B are parameters to be estimated.
From the national income accounting rules, we know that

X = C + V, where V is other exogenous expenditure .

Thus, X = (1−B)-1(A + V + U). A shock in U causes a shock in X, and U and X are


related by the algebra of the data generating process. The B cannot be estimated without
bias using least squares.

Consider a third example of simultaneity that is more subtle. Carolyn Hoxby’s


problem in estimating the relationship between student performance and school

William E. Becker Endogeneity May 1, 2010: p. 5


competition was algebraically similar to the classic simultaneous equation problem of the
Keynesian consumption function but yet quite a bit different in its theoretical origins.

She hypothesized that cities with many school districts provided more opportunity for
parents to switch their children in the pursuit of better schools; thus, competition among
school districts should lead to better schools as reflected in higher student test scores.
Allowing for other explanatory variables, the implied regression is

Test scores = β 1 + β 2 (number of school districts ) + . . . + ε .

The causal effect of more school districts in a metropolitan area, however, may
not be clearly discerned from this regression of mean metropolitan test scores on the
number of school districts. Hoxby had anecdotal evidence that economy of scale
arguments might lead to two good school districts being merged. At the other extreme,
when districts were really bad they could not be merged with others and yet poor
performance did not imply that the district would be shut down (it might be taken over by
the state) even though a totally new district might be formed. That is, there is reverse
causality: bad test performance leads to more districts and good performance leads to
fewer.

As a final example of simultaneity, consider the Becker and Johnston (1999)


study of the relationship between multiple-choice test and free-response test scores of
economics understanding. Although these two form of tests are alleged to measure many
different skills, matched scores are known to be highly correlated. Becker and Johnston
assert that in part this is because both forms are a function of an unobservable ability that
is cause in the error terms u and v in the following system of equations:

Multiple-choice score = β1 + β 2 ( Free-response score) + . . . + u .

Free-resonse score = λ1 + λ2 ( Multiple-choice score) + . . . + v .

This system of equations should make the simultaneity apparent. As discussed in more
detail later, the existence of the second equation (where both u and v include the effect of
unobservable ability) makes the free-response test score an endogenous regressor in the
first equation. Similarly, the existence of the first equation makes multiple-choice an
endogenous regressor in the second.

ERRORS IN VARIABLES

Next consider an “errors in variables” problem that leads to regressor and error term
correlation. In particular consider the example in which a student’s grade on an exam in
economics is hypothesized to be a function of effort and a random disturbance (u):

grade = A + B(effort) + u.

William E. Becker Endogeneity May 1, 2010: p. 6


But effort is not observable (as was also the case for Milton Friedmen’s permanent
income). What is observable is the number of homework assignments completed, which
may be either indicative of or the result of the amount of effort:

homework = C(effort) + v .

The equation to be estimated is then

grade = A + (B/C)homework + u*, where u* = u − (1/C)v .

But a shock to v causes a shock to homework; thus, homework and u* are correlated and
the slope coefficient (B/C) cannot be estimated without bias via least squares.

A SINGLE VARIABLE INSTRUMENT

So what is the solution to these three problems of endogeneity? The instrumental


variable (IV) solution is to find something that is highly correlated with the offending
regressor but that is not correlated with the error term. In the case of
Carolyn Hoxby’s problem in estimating the relationship between student performance
and school competition,

Test scores = β 1 + β 2 (number of school districts ) + ε ,

she observed that areas with a lot of school districts also had a lot of streams, possibly
because the streams made natural boundaries for the school districts. She had what is
become known as a natural experiment. 2 The number of streams was a random event
in nature that had nothing to do with the population error term ( ε ) in the student
performance equation but yet was highly related to number of school districts. 3

For simplicity, ignoring any other variables in the student performance equation
and measuring test scores, number of school districts and number of streams in deviation
from their respective means, a consistent estimate of the effect of the number of school
districts on test scores can be obtained with the instrumental variable estimator

b2 =
∑ (dev. in test scores)(dev. in number of streams) .
∑ (dev. in number of school districts)(dev. in number of steams)
To appreciate why the instrumental estimator works, consider the expected value of the
terms in the numerator:

E ( deviations in test score)(deviations in number of streams )


= E{[ β 2 ( dev. in number of school districts ) + ε ](dev.in number of streams)}
= β 2 cov(number of school districts, number of streams ) ,

William E. Becker Endogeneity May 1, 2010: p. 7


because the number of streams in an area is a purely random variable unrelated to
epsilon.

In this example, deviations in one exogenous variable ( z − z : deviation in number


of streams) could be used as an instrument for deviations in an endogenous explanatory
variable ( x − x : deviations in number of school districts):

∑ (z i − z )( y i − y )
bIV = i =1
n
.
∑ (z
i =1
i − z )( xi − x )

As with the OLS estimator, the IV estimator has an asymptotically normal distribution.
The IV large sample variance is obtained by
n

∑ ( yˆ i − y) 2 / n
s b2IV = i =1
n
,
r 2
x,z ∑ (x
i =1
i − x) 2

where rx2,z is the coefficient of determination (square of correlation coefficient) for x and
z. Notice, if the correlation between x and z were perfect, the IV and OLS variance
estimators would be the same. On the other hand, if the linear relationship between the x
and z is weak, then the IV variance will greatly exceed that calculated by OLS.

Important to recognize is that a poor instrument is one that has a low rx2,z , causing
the standard error of the estimated slope coefficient to be overly large, or has E( Zε ) ≠ 0 ,
implying the Z was in fact endogenous. Unlike OLS estimators, the desired properties of
IV estimators are all asymptotic; thus, to refer to small sample statistics like the t ratio is
not appropriate. The appropriate statistic for testing with bIV is the standard normal:

BIV − β
Z≅ , for large n.
S IV / n

It is important that this instrumental variable approach is not restricted to


continuous endogenous variables. For example, Angrist (1990) was interested in the
lifetime earnings effect of being a Vietnam War veteran. Measuring earnings in
logarithmic form, Angrist’s model was

Ln(earnings ) = β 1 + β 2 veteran + ... + ε ,

where veteran is one if a veteran of the Vietnam War and zero otherwise. Angrist
recognized that there was a sample selection problem (to be discussed in detail in a later
module). It is likely that those who expected their earnings to be enhanced by the

William E. Becker Endogeneity May 1, 2010: p. 8


military experience are the ones who volunteer for service. That is, being a veteran is
dependent on earning expectations at the time of joining. To the extent that all the
factors that go into these earnings expectations and the decision to join are not captured
in this single equation model they are in the epsilon error term. Thus, the error term must
be correlated with being a veteran, E[(verteran) (ε )] ≠ 0 .

For his instrument, Angrist observed that the lottery used to draft young men
provided a natural experiment. Lottery numbers were assigned randomly; thus, the
number received would not be correlated with ε . Men receiving lower numbers faced a
higher probability of being drafted; thus, lottery numbers are correlated with being a
Vietnam vet.

The use of these natural experiments has and likely will continue to be a source of
instrumental variables for endogenous explanatory variables. Michael Murray (2006)
provided a detailed but easily read review of natural experiments and the use of the IV
estimator.

INSTRUMENTAL VARIABLE ESTIMATORS IN GENERAL

Often there are many exogenous variables that could be used as instruments for
endogenous variables. Let matrix Z contain the set of all the endogenous variables that
could serve as an instrument set of regressors. The instrumental variable estimator is
now of the general form

b IV = (Z′X)-1 Z'y
Var(b IV ) = σ 2 (Z′X)-1 Z′Z(X′Z)-1 .

Unlike the selective replacement of a regressor with its instrument, for sets of regressors
the typical estimation procedure involves the project of each of the columns of X in the
column space of Z; at least conceptually we have

ˆ = Z[(Z′Z)-1 Z′X] .
X

This projected X̂ matrix is then substituted for Z.

ˆ ′X) -1 X
b IV = (X ˆ ′y
= [X′Z(Z′Z) -1 Z′X]-1 X′Z(Z′Z) -1 Z′y
= [X′(I - M 2 )X]-1 X ′(I - M 2 )y
ˆ ′X)
= (X ˆ -1 X
ˆ ′y ,

which suggests a two step process: 1) regress the endogenous regressor(s) on all the
exogenous variables; 2) use the predicted values from step 1 as replacement for the

William E. Becker Endogeneity May 1, 2010: p. 9


endogenous regressor in the original equation. This instrumental variable procedure is
referred to as Two-Stage Least Squares (TSLS).

Unfortunately the standard errors associated with this TSLS estimation approach
do not reflect the fact that the instrument is a combination of variables. That is, the
standard errors obtained from the second step do not reflect the number of variables used
in the first step predictions. In the case of a single instrument the difference between the
variances of OLS and IV estimators was captured in the magnitude of rx2, z and a similar
adjustment must be made when multiple variables are used to form the instruments.
Advanced econometrics programs like LIMDEP, SAS and SAS automatically do this in
their TSLS programs.

The asymptotic variances correctly calculated can be extremely large if Z is not


highly correlated with X; that is, (Z′X) -1 is large if X and Z are not related. Also, for
poor fitting instruments, it is possible to get negative R2 when the typical computational
formula [1 − (ResSS/TotalSS)] is used – recall that least squares minimized the ResSS so
that it necessarily is less than or equal to TotalSS. But the IV estimator will have an
ResSS greater than or equal to that of least squares. The fit of the IV can be so bad that
its ResSS exceeds the Total SS. (For demonstration of this see Becker and Kennedy,
1992.)

DURBIN, HAUSMAN AND WU SPECIFICATION TEST


APPLIED TO ENDOGENEITY

We wish to test p lim ( X ′ε / n ) = 0 , but cannot use the covariance between n × K matrix X
and the n residuals ( ei = yi − yˆ i ) in the n × 1 vector e because X'e = 0 is a byproduct of
least squares. Greene (2003, pp. 80-83) outlined the testing procedure originally
proposed by Durbin (1954) and then extended by Wu (1973) and Hausman (1978).
Davidson and MacKinnon (1993) are recognized for providing an algebraic
demonstration of test statistic equivalence. Asymptotically, a Wald (W) statistic may be
used in a Chi-square ( χ 2 ) test with K* degrees of freedom, or for smaller samples, an F
statistic, with K* and n − (K + K*) degrees of freedom, can be used to test the joint
significance of the contribution of the predicted values ( X ˆ * ) of a regression of the K*
endogenous regressors, in matrix X*, on the exogenous variables (and a column of ones
for the constant term) in matrix Z:

ˆ * γ+ ε * ,
y = Xβ +X
ˆ = Zλˆ , and λˆ is a least squares estimator of λ.
where X* = Zλ + u , X*

H o : γ = 0 , the variables in Z are exogenous.


H A : γ ≠ 0 , at least one of the variables in Z is endogenous.

William E. Becker Endogeneity May 1, 2010: p. 10


As an example, consider the previously introduced economic exam grade
equation that has the number of homework assignments as an explanatory variables:

grade = β1 +β 2 homework +ε .

The theoretical data generating process that gave rise to this model suggests that number
of homeworks completed is an endogenous regressor. To test this we need truly
exogenous variables – say x2 and x3 , which might represent student gender and race. The
number of homeworks is then regressed on these two exogenous variables to get the least
square equation

predicted homework = λˆ 1 + λˆ 2 x2 + λˆ 3 x3 .

This predicted homework variable is then added to the exam grade equation to form the
augmented regression

grade = β1 + β 2 homework +γ (predicted homework) +ε *

In this example, K = 2 (for β1 and β 2 ) and K* = 1 (for γ ); thus, the degrees of freedom
for the F statistic are 1 and n − (K + K*) , which is also the square of a t statistic with n −
(K + K*) degrees of freedom. That is, with only one endogenous variable and relatively
small sample n, the t statistic printed by a computer program is sufficient to do the test.
(Recall that asymptotically the t goes to the standard normal, with no adjustment for
degrees of freedom required.) As with any other F, χ 2 , t or z test, calculated statistics
greater than their critical values lead to the rejection of the null hypothesis. Important to
keep in mind, however, is that failure to reject the null hypothesis at a specific probability
of a Type I error does not prove exogeneity. The null hypothesis can always be rejected
at some Type I error level.

Some introductory econometrics textbooks such as Wooldridge (2009, pp. 527-


528) specify that the residuals from the auxiliary X̂* = Zλˆ regression should be used in
the augmented regression y = Xβ +(X * − X* ˆ ) γ + ε * . For example, in the case of the test
scores model the augmented regression would be

grade = β1 + β 2 homework +γ (homework−predicted homework) +ε *

The additional calculation of this residual for inclusion in the augmented regression is not
necessary because the absolute value of the estimate of γ and its standard error are
identical regardless of whether predicted homework or the residual (= homework −
predicted homework) is used.

Finally, keep in mind that you can use all the exogenous variables in the system to
predict the endogenous variable. Some of these exogenous variables can even be in the
original equation of interest – in the grade example, the grade equation might have been

William E. Becker Endogeneity May 1, 2010: p. 11


grade = β1 +β 2 homework + β 3x 3 +ε .

The auxiliary equation would still be

predicted homework = λˆ 1 + λˆ 2 x2 + λˆ 3 x3 .

As will become clear in the next section, the auxiliary equation should always have at
least one more exogenous variable than the initial equation of interest.

IDENTIFICATION CONDITIONS

Whenever an instrumental variable estimator or two-stage least squares (2SLS) routine is


employed consideration must be given to the identification conditions. To understand
identification, consider a set of matched price and quantity observations (Figure 1, panel
a) for which quantity values tend to rise as prices rise, as seen in the fitted OLS
regression (Figure 1, panel b) The question to be asked: is this a supply relationship?
As seen in Figure 1, panel c, the OLS line is not a supply curve. It is tracing equilibrium
points. 4

If a supply curve is to be estimated, more information than the observations that


the quantity and price are positively related is needed. We need to identify a supply
curve. This can be done if there is an exogenous variable that affects demand but does
not affect supply. For example, household income likely affects demand but does not
affect supply. In our previous simultaneous equation market model, for example,

Supply in equilibrium: Q = m + nP + U

Demand in equilibrium: Q=a + bP + cZ + V

if Z is household income, then an increase in Z shifts the demand curve up, from D to D',
but does not affect the supply curve (Figure 2); thus, the supply curve is identified by the
change in equilibrium observations. Notice, however, that the demand curve is not
identified because there is no unique exogenous variable in the supply equation.

Identification of this supply curve in this two endogenous variable system is


achieved by an exclusionary or zero restriction -- the coefficient on income in the supply
equation was restricted to zero. A necessary order condition for identification of any
equation in a system is that the number of exogenous variables excluded from an
equation must be at least as great as the number of endogenous variables less one. In
this example, there were two endogenous variables (Q and P) and one exogenous variable
(Z) excluded from the supply equation; thus, the necessary condition for identification
was met: 2 − 1 ≤ 1 . This necessary condition for identification is called the order
condition.

William E. Becker Endogeneity May 1, 2010: p. 12


Figure 1. Market data.

Panel a: Scatter plot

Panel b: OLS regression

Panel c: Demand and supply interaction

William E. Becker Endogeneity May 1, 2010: p. 13


Figure 2. Supply curve is identified .

Any exogenous variable that is excluded from at least one equation in an equation
system can be used as an instrumental variable. It can be used as an instrument in the
equation from which it is excluded. For example, in the supply and demand equation
system, the reduced form (no endogenous variables as explanatory variables) for P is

a−m c v−u
P= + Z+ = β1 + β 2 Z + ε 2 .
n−b n−b n−b

And either the price predicted from this equation or Z itself can be used as the instrument
for P in the supply equation. If there were more exogenous variables excluded from the
supply equation then they could all be used to get predicted price from the reduced form
equation.

Notice that the coefficient on Z in the reduced form equation for P must be
nonzero for Z to be used as an instrument, which requires that c ≠ 0 and n – b ≠ 0. This
requirement states that exogenous variable(s) excluded from the supply equation must
have a nonzero population coefficient in the demand equation and that the effect of price
cannot be the same in both demand and supply. This is known as the rank condition.

As an example of identification in economic education research consider the work


of Becker and Johnston (1999). In addition to the multi-dimensional attributes of the
Australian 12th grade test takers (captured in the explanatory X variables such as gender,
age, English a second language, etc.), Becker and Johnston called attention to classroom
and peer effects that might influence multiple-choice and essay type test taking skills in
different ways. For example, if the student is in a classroom that emphasizes skills
associated with multiple-choice testing (e.g., risk-taking behavior, question analyzing
skills, memorization, and keen sense of judging between close alternatives), then the
student can be expected to do better on multiple-choice questions. By the same token, if
placed in a classroom that emphasizes the skills of essay test question answering (e.g.,
organization, good sentence and paragraph construction, obfuscation when uncertain, and
logical argument), then the student can be expected to do better on the essay component.
Thus, Becker and Johnston attempted to control for the type of class of which the student

William E. Becker Endogeneity May 1, 2010: p. 14


is a member. Their measure of “teaching to the multiple-choice questions” is the mean
score on the multiple-choice questions for the school in which the ith student took the 12th
grade economics course. Similarly, the mean school score on the essay questions is their
measure of the ith student’s exposure to essay question writing skills.

In equation form, the two equations that summarize the influence of the various
covariates on multiple-choice and essay test questions are written as the following
structural equations:

_ J
M i = ρ21 + ρ22Wi + ρ23 M i + ∑ ρ2 j X ij + U i* .
j =4

_ J
Wi = ρ31 + ρ32 M i + ρ33 Wi + ∑ ρ3 j X ij + Vi* .
j=4

M i and Wi are the ith student’s respective scores on the multiple-choice test and essay test.
M i and Wi are the mean multiple-choice and essay test scores at the school where the ith
student took the twelfth grade economics course. The X ij variables are the other
exogenous variables used to explain the ith student’s multiple choice and essay marks,
where the ρs are parameters to be estimated. U I* and Vi* are assumed to be zero mean and
constant variance error terms that may or may not each include an effect of unobservable
ability.

Least squares estimation of the ρs will involve bias if the respective error terms
U and Vi* are related to regressors (Wi in the first equation, and Mi in second equation).
*
i

Such relationships are seen in the reduced form equations, which are obtained by
solving for M and W in terms of the exogenous variables and the error terms in these two
equations:

_ _ J
M i = Γ21 + Γ22 W i + Γ23 M i + ∑ Γ2 j X ij + U i** .
j=4

_ _ J
Wi = Γ31 + Γ32 M i + Γ33 Wi + ∑ Γ3 j X ij + Vi** .
j =4

The reduced form parameters (Γs) are functions of the ρs, and U** and V** are
dependent on U* and V*:

William E. Becker Endogeneity May 1, 2010: p. 15


U i* + ρ22Vi*
U i** = .
1 − ρ22 ρ32

Vi* + ρ32U i*
**
Vi = .
1 − ρ22 ρ32

In the reduced form error terms, it can be seen that a random shock in U* causes a
change in V**, which causes a change in W in the reduced form. Thus, W and U* are
related in the essay structural equation, and consistent estimation of the parameters in this
equation is not possible using least squares. Similarly, a shock in V*, and a resulting
change in U**yields a change in M. Thus, M and V* are dependent in the structural
equation, and least squares estimators of the parameters in that equation are inconsistent.
_ _
The inclusion of M i and W i in their respective structural equations, and their
exclusion from the other equation, enables both of the structural equations to be identified
within the system. For example, if a student moves from a school with a low average
multiple-choice test score to one with a higher average multiple-choice test score, then
his or her multiple-choice score will rise via a shift in the M-W relationship in the first
structural equation, but this shift is associated with a move along the W-M relationship in
the second structural equation; thus, the second structural equation is identified.
Similarly, if a student moves from a low average essay test score school to a higher one,
then his or her essay test score will rise via a shift in the W-M relationship in second
structural equation, but this shift implies a move along the M-W relationship in the first
structural equation, and this first structural equation is thus identified. Most certainly,
identification hinges critically on justifying the exclusionary rule employed.

To summarize, identification involved two conditions.

The order condition for identifying an equation in a model of K


equations and K endogenous variables is that the equation exclude at
least K – 1 variables that appear in the model. Alternatively, if the
number of potential instruments (exogenous variables in the system
but not in the equation) equals the number of endogenous regressors,
the equation is exactly identified. If exactly K – 1 variables are
excluded, then the equation is just identified. If more (less) than K – 1
variables are excluded, then the equation is over (under) identified.

The order condition is a necessary condition, but not a sufficient


condition for identification.

William E. Becker Endogeneity May 1, 2010: p. 16


The sufficient condition for identification is the rank condition. By
the rank condition an equation is identified if and only if at least one
nonzero determinant of order exists for the coefficients of the
excluded variables that are included in the other equations of the
model. This sufficient condition requires that variables excluded from
the equation, but included in the other equations of the model, not be
dependent. It ensures that the parameters can be estimated from the
reduced form.

CONCLUDING COMMENTS

Eagerness to employ natural experiments and instrumental variables to address problems


of endogeneity have exploded within economics, but along with that growth has come
questions of validity, as seen most recently in criticism of the work of Waldman,
Nicholson and Adilov (2006) that suggests that TV watching causes autism. Economist
Waldman recognized that he could not simply run a regression of incidence of autism on
amount of TV watched because autism might in some way influence the TV watching.
He observed, however, that TV watching and precipitation were highly correlated.
Because rainfall is a natural occurrence unrelated to the error term in the autism
regression, he had his instrument for TV watching. As reported in the Wall Street
Journal, Whitehouse (2007), those who specialize in the study of autism were not
impressed, labeling Waldman’s work “irresponsible” (because it shifts responsibilty to
parents when experts claim that it is genetic and beyond the control of parent) and “junk
science.”

When instrumental variables are used, that which is measured is unclear.


Unanswered in the Waldman, Nicholson and Adilov study is how TV watching
influences autism. Arm-chair speculation that children are distracted by television is not
convincing to those who have devoted their lives to studying autism. Joseph Piven,
Director of the Neurodevelopment Disorder Research Center at the University of North
Carolina, is quoted in the WSJ article stating that “it is just too much of a stretch to tie
(autism) to television-watching. Why not tie it to carrying umbrellas?” More damning
still are the quotes from Nobel Laureate in Economics James Heckman, “There’s a saying
that ignorance is bliss,” and IV econometrician guru Jerry Hausman, “I think that
characterizes a lot of the enthusiasm for these instruments. If your instruments aren’t
perfect, you could go seriously wrong.”

William E. Becker Endogeneity May 1, 2010: p. 17


REFERENCES

Angrist, Joshua D. (1990). “Lifetime Earnings and the Vietnam Era Draft Lottery:
Evidence from Social Security Administrative Records, ”American Economic Review,”
Vol. 80 (June): 313-336.

Angrist, Joshua D. and Alan B. Krueger (2001). "Instrumental Variables and the Search
for Identification: From Supply and Demand to Natural Experiments," Journal of
Economic Perspectives, Vol. 15 (Fall),: 69-85.

Becker, William E. (2004). “Omitted Variables and Sample Selection Problems in


Studies of College-Going Decisions,” Public Policy and College Access: Investigating
the Federal and State Role in Equalizing Postsecondary Opportunity, Edward St. John
(ed), Vol. 19. NY: AMS Press: 65-86.

Becker, William E. and Carol Johnston (1999).“The Relationship Between Multiple


Choice and Essay Response Questions in Assessing Economics Understanding,”
Economic Record (Economic Society of Australia), Vol. 75 (December): 348-357.

Becker, William E. and Peter Kennedy (1995). “A Lesson in Least Squares and R
Squared,” American Statistician, Vol. 55 ( November): 282-283. Portions reprinted in
Dale Poirier, Intermediate Statistics and Econometrics (Cambridge: MIT Press, 1995):
562-563.

Becker, William E. and Michael Salemi (1977). “The Learning and Cost Effectiveness of
AVT Supplemented Instruction: Specification of Learning Models,” Journal of
Economic Education Vol. 8 (Spring) : 77-92.

Davidson, Russell and James G. MacKinnon (1993). Estimation and Inference in


Econometrics. New York, Oxford University Press.

Durbin, James. (1954). “Errors in Variables,” Review of the International Statistical


Institute, Vol. 22(1): 23-32.

Finkelstein, Michael and Bruce Levin (1990). Statistics for Lawyers. New York,
Springer-Verlag.

Greene, William (2003). Econometric Analysis. 5th Edition, New Jersey: Prentice Hall.

Hausman, Jerry (1978). “Specification Tests in Econometrics,” Econometrica, Vol. 46


(November): 1251-1271.

Hilsenrath, Jon (2005). “Novel Way to Assess School Competition Stirs Academic Row,”
Wall Street Journal (October 24): A1 and A11

William E. Becker Endogeneity May 1, 2010: p. 18


Hoxby, Caroline M. (2000). “Does Competition Among Public Schools Benefit Students
and Taxpayers?’ American Economic Review. Vol. 90 (December): 1209-1238.

Kennedy, Peter (2003). A Guide to Econometrics, 5th Edition, United Kingdom:


Blackwell.

Murray, Michael P. (2006). “Avoiding Invalid Instruments and Coping with Weak
Instruments,” Journal of Economic Perspectives. Vol. 20 (Fall): 111-132.

Rosenzweig, Mark and Kenneth Wolpin (2000). "Natural 'Natural Experiments' in


Economics," Journal of Economic Literature. Vol. 38 (December): 827-874.

Stigler, Stephen (1986). The History of Statistics: The Measurement of Uncertainty


Before 1900. Cambridge: Harvard University Press.

Waldman, Michael, Sean Nicholson and Nodir Adilov (2006). “Does Television Cause
Autism?” NBER Working Paper No. W12632 (October).

Whitehouse, Mark (2007). “Mind and Matter: Is an Economist Qualified to Solve Puzzle
of Autism? Professor’s Hypothesis: Rainy Days and TV May Trigger Condition,” Wall
Street Journal. February 27: A.1.

Wooldridge, Jeffrey M. (2009). Introductory Econometrics: A Modern Approach. 4th


Edition Mason Oh; South-Western.

Working, E. J. (1927). “What Do Statistical ‘Demand Curves’ Show?” Quarterly


Journal of Economics, 41( 2) : 212-235.

Wu, De-Min. (1973). “Alternative Tests of Independence between Stochastic Regressors


and Disturbances,” Econometrica, 41(July): 733-750.

William E. Becker Endogeneity May 1, 2010: p. 19


ENDNOTES

1
Conceptually there are more than three forms of endogeneity that could occur. For
example, if there is a lagged dependent variable and the residuals are serially correlated,
then the lagged dependent variable will be correlated with the error term. This is not a
problem for the typical cross-section regressions considered by economic educators but
does become a problem when time is introduced. To see this consider a data generating
process in which knowledge of economics (Yit) of the ith student at time t is a linear
function of the student’s ability at time t (xit) plus an error term ( ε it ):

yit = β1 + β 2 xit + ε it .

At time t-1, knowledge is then given by

yit −1 = β1 + β 2 xit −1 + ε it −1 .

If learning is assessed in the following equation, then the pretest yit-1 regressor is
endogenous by construction:

yit = β 1(1 − ρ ) + β 2 ( xit − ρ xit −1 ) + ρ yit −1 + (ε it − ρε it −1 )

E[ yit −1 (ε it − ρε it −1 )] = ρ E ( yit −1ε it −1 ) ≠ 0 .

As demonstrated in a later module, sample selection also leads to endogeneity problems.


However, the sample selection form of endogeneity is typically associated with a
truncation of the error term, which is a different problem than the three sources of
endogeneity considered in the text of this module, where the error term is always
assumed to be continuous.
2
Natural experiments and instrumental variables are not synonymous but Rosenzweig
and Wolpin (2000, pp.827-8) state "The most widely applied approach to identifying
causal or treatment effects, which has a long history in economics, employs instrumental
variable techniques . . .in standard instrumental variable studies, economists as well as
researchers in other fields have sought out 'natural experiments,' random treatments that
have arisen serendipitously . . ."
3
Jon Hilsenrath reported in his Wall Street Journal (October 24, 2005, pp. A1 and A11)
“Novel Way to Assess School Competition Stirs Academic Row,” that Princeton
University economist Jesse Rothstein questioned Hoxby’s use of the instrumental
variable technique because he could not replicate her count of streams, which aside from
ethical questions posed by Hilsenrath introduces an added complication if her instrument
has a measurement error problem.

William E. Becker Endogeneity May 1, 2010: p. 20


4
Working (1927) provided an early intuitive explanation of simultaneity and the
identification problems that is still relevant today as seen in its modern rendition by
Kennedy (2003).

William E. Becker Endogeneity May 1, 2010: p. 21


MODULE TWO, PART TWO: ENDOGENEITY,
INSTRUMENTAL VARIABLES AND TWO-STAGE LEAST SQUARES
IN ECONOMIC EDUCATION RESEARCH USING LIMDEP

Part Two of Module Two provides a cookbook-type demonstration of the steps required to use
LIMDEP to address problems of endogeneity using a two-stage least squares, instrumental
variable estimator. The Durbin, Hausman and Wu specification test for endogeneity is also
demonstrated. Users of this model need to have completed Module One, Parts One and Two,
and Module Two, Part One. That is, from Module One, users are assumed to know how to get
data into LIMDEP, recode and create variables within LIMDEP, and run and interpret regression
results. From Module Two, Part One, they are expected to have an understanding of the problem
of and source of endogeneity and the basic idea behind an instrumental variable approach and the
two-stage least squares method. The Becker and Johnston (1999) data set is used throughout
this module for demonstration purposes only. Module Two, Parts Three and Four demonstrate
in STATA and SAS what is done here in LIMDEP.

THE CASE

As described in Module Two, Part One, Becker and Johnston (1999) called attention to
classroom effects that might influence multiple-choice and essay type test taking skills in
economics in different ways. For example, if the student is in a classroom that emphasizes skills
associated with multiple choice testing (e.g., risk-taking behavior, question analyzing skills,
memorization, and keen sense of judging between close alternatives), then the student can be
expected to do better on multiple-choice questions. By the same token, if placed in a classroom
that emphasizes the skills of essay test question answering (e.g., organization, good sentence and
paragraph construction, obfuscation when uncertain, logical argument, and good penmanship),
then the student can be expected to do better on the essay component. Thus, Becker and
Johnston attempted to control for the type of class of which the student is a member. Their
measure of “teaching to the multiple-choice questions” is the mean score or mark on the
multiple-choice questions for the school in which the ith student took the 12th grade economics
course. Similarly, the mean school mark or score on the essay questions is their measure of the
ith student’s exposure to essay question writing skills.

In equation form, the two equations that summarize the influence of the various
covariates on multiple-choice and essay test questions are written as the follow structural
equations:

_ J
M i = ρ21 + ρ22Wi + ρ23 M i + ∑ ρ2 j X ij + U i* .
j =4

_ J
Wi = ρ31 + ρ32 M i + ρ33 Wi + ∑ ρ3 j X ij + Vi* .
j=4

W. E. Becker  Module Two, Part Two: LIMDEP IV Hands On Section  Sept. 15, 2008  p. 1  


 
M i and Wi are the ith student’s respective scores on the multiple-choice test and essay test.
M i and Wi are the mean multiple-choice and essay test scores at the school where the ith student
took the 12th grade economics course. The X i j variables are the other exogenous variables (such
as gender, age, English a second language, etc.) used to explain the ith student’s multiple-choice
and essay marks, where the ρs are parameters to be estimated. The inclusion of the mean
multiple-choice and mean essay test scores in their respective structural equations, and their
exclusion from the other equation, enables both of the structural equations to be identified within
the system.

As shown in Module Two, Part One, the least squares estimation of the ρs involves bias
because the error term U i* is related to Wi, in the first equation, and Vi* is related to Mi, in
second equation. Instruments for regressors Wi and Mi are needed. Because the reduced form
equations express Wi and Mi solely in terms of exogenous variables, they can be used to generate
the respective instruments:

_ _ J
M i = Γ21 + Γ22 W i + Γ23 M i + ∑ Γ2 j X ij + U i** .
j=4

_ _ J
Wi = Γ31 + Γ32 M i + Γ33 Wi + ∑ Γ3 j X ij + Vi** .
j =4

The reduced form parameters (Γs) are functions of the ρs, and the reduced form error terms U**
and V** are functions of U* and V*, which are not related to any of the regressors in the reduced
form equations.

We could estimate the reduced form equations and get Mˆ i and Wˆi . We could then
substitute Mˆ i and Wˆi into the structural equations as proxy regressors (instruments) for
M i and Wi . The least squares regression of Mi on Wˆi , M i and the Xs and a least squares
regression of Wi on Mˆ i , Wi and the Xs would yield consistent estimates of the respective ρs, but
the standard errors would be incorrect. LIMDEP automatically does all the required estimations
with the two-stage, least squares command:

2SLS; LHS= ; RHS= ; INST= $

TWO-STAGE, LEAST SQUARES IN LIMDEP

The Becker and Johnston (1999) data are in the file named “Bill.CSV.” Before reading these
data into LIMDEP, however, the “Project Settings” must be increased from 200000 cells (222

W. E. Becker  Module Two, Part Two: LIMDEP IV Hands On Section  Sept. 15, 2008  p. 2  


 
rows and 900 columns) to accommodate the 4,178 observations. This can be done with a project
setting of 4000000 cells (4444 rows and 900 columns), following the procedures described in
Module One, Part Two. After increasing the project setting, file Bill.CSV can be read into
LIMDEP with the following read command (the file may be located anywhere on your hard
drive but here it is located on the e drive):

READ; NREC=4178; NVAR=44; FILE=e:\bill.csv; Names=


student,school,size,other,birthday,sex,eslflag,adultst,
mc1,mc2,mc3,mc4,mc5,mc6,mc7,mc8,mc9,mc10,mc11,mc12,mc13,
mc14,mc15,mc16,mc17,mc18,mc19,mc20,totalmc,avgmc,
essay1,essay2,essay3,essay4,totessay,avgessay,
totscore,avgscore,ma081,ma082,ec011,ec012,ma083,en093$

Using these recode and create commands, yields the following relevant variable definitions:

recode; size; 0/9=1; 10/19=2; 20/29=3; 30/39=4; 40/49=5;


50/100=6$
create; smallest=size=1; smaller=size=2; small=size=3;
large=size=4; larger=size=5; largest=size=6$

TOTALMC: Student’s score on 12th grade economics multiple-choice exam ( M i ).


AVGMC: Mean multiple-choice score for students at school ( M i ).
TOTESSAY: Student’s score on 12th grade economics essay exam ( Wi ).
AVGESSAY: Mean essay score for students at school ( Wi ).
ADULTST = 1, if a returning adult student, and 0 otherwise.
SEX = GENDER = 1 if student is female and 0 is male.
ESLFLAG = 1 if English is not student’s first language and 0 if it is.
EC011 = 1 if student enrolled in first semester 11 grade economics course, 0 if not.
EN093 = 1 if student was enrolled in ESL English course, 0 if not
MA081 = 1 if student enrolled in the first semester 11 grade math course, 0 if not.
MA082 = 1 if student was enrolled in the second semester 11 grade math course, 0 if not.
MA083 = 1 if student was enrolled in the first semester 12 grade math course, 0 if not.
SMALLER = 1 if student from a school with 10 to 19 test takers, 0 if not.
SMALL = 1 if student from a school with 20 to 29 test takers, 0 if not.
LARGE = 1 if student from a school with 30 to 39 test takers, 0 if not.
LARGER = 1 if student from a school with 40 to 49 test takers, 0 if not.

In all of the regressions, the effect of being at a school with more than 49 test takers is captured
in the constant term, against which the other dummy variables are compared. The smallest
schools need to be rejected to treat the mean scores as exogenous and unaffected by any
individual student’s test performance, which is accomplished with the following command:

Reject; smallest = 1$

W. E. Becker  Module Two, Part Two: LIMDEP IV Hands On Section  Sept. 15, 2008  p. 3  


 
The descriptive statistics on the relevant variables are then obtained with the following
command, yielding the LIMDEP output table shown:

Dstat;RHS=TOTALMC,AVGMC,TOTESSAY,AVGESSAY,ADULTST,SEX,ESLFLAG,
EC01,EN093,MA081,MA082,MA083,SMALLER,SMALL,LARGE,LARGER$

Descriptive Statistics
All results based on nonmissing observations.
===============================================================================
Variable Mean Std.Dev. Minimum Maximum Cases
===============================================================================
-------------------------------------------------------------------------------
All observations in current sample
-------------------------------------------------------------------------------
TOTALMC 12.4355795 3.96194160 .000000000 20.0000000 3710
AVGMC 12.4355800 1.97263767 6.41666700 17.0714300 3710
TOTESSAY 18.1380054 9.21191366 .000000000 40.0000000 3710
AVGESSAY 18.1380059 4.66807071 5.70000000 29.7857100 3710
ADULTST .512129380E-02 .713893539E-01 .000000000 1.00000000 3710
SEX .390566038 .487943012 .000000000 1.00000000 3710
ESLFLAG .641509434E-01 .245054660 .000000000 1.00000000 3710
EC011 .677088949 .467652064 .000000000 1.00000000 3710
EN093 .622641509E-01 .241667268 .000000000 1.00000000 3710
MA081 .591374663 .491646035 .000000000 1.00000000 3710
MA082 .548787062 .497681208 .000000000 1.00000000 3710
MA083 .420215633 .493659946 .000000000 1.00000000 3710
SMALLER .462264151 .498641179 .000000000 1.00000000 3710
SMALL .207277628 .405410797 .000000000 1.00000000 3710
LARGE .106469003 .308478530 .000000000 1.00000000 3710
LARGER .978436658E-01 .297143201 .000000000 1.00000000 3710

For comparison with the two-stage least squares results, we start with the least squares
regressions shown after this paragraph. The least squares estimations are typical of those found
in multiple-choice and essay score correlation studies, with correlation coefficients of 0.77 and
0.78. The essay mark or score, W, is the most significant variable in the multiple-choice score
regression (first of the two tables) and the multiple-choice mark, M, is the most significant
variable in the essay regression (second of the two tables). Results like these have led
researchers to conclude that the essay and multiple-choice marks are good predictors of each
other. Notice also that both the mean multiple-choice and mean essay marks are significant in
their respective equations, suggesting that something in the classroom environment or group
experience influences individual test scores. Finally, being female has a significant negative
effect on the multiple choice-test score, but a significant positive effect on the essay score, as
expected from the least squares results reported by others. We will see how these results hold up
in the two-stage least squares regressions.

Regress;LHS=TOTALMC;RHS=TOTESSAY,ONE,ADULTST,SEX,AVGMC,
ESLFLAG,EC011,EN093,MA081,MA082,MA083,SMALLER,SMALL,LARGE,LARGER$
+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = TOTALMC Mean= 12.43557951 , S.D.= 3.961941603 |
| Model size: Observations = 3710, Parameters = 15, Deg.Fr.= 3695 |
| Residuals: Sum of squares= 23835.89955 , Std.Dev.= 2.53985 |
| Fit: R-squared= .590590, Adjusted R-squared = .58904 |
| Model test: F[ 14, 3695] = 380.73, Prob value = .00000 |
| Diagnostic: Log-L = -8714.8606, Restricted(b=0) Log-L = -10371.4459 |
| LogAmemiyaPrCrt.= 1.868, Akaike Info. Crt.= 4.706 |
| Autocorrel: Durbin-Watson Statistic = 1.99019, Rho = .00490 |

W. E. Becker  Module Two, Part Two: LIMDEP IV Hands On Section  Sept. 15, 2008  p. 4  


 
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
TOTESSAY .2707916883 .54725877E-02 49.481 .0000 18.138005
Constant 2.654956801 .33936151 7.823 .0000
ADULTST .4674947703 .59221296 .789 .4299 .51212938E-02
SEX -.5259548390 .91287080E-01 -5.762 .0000 .39056604
AVGMC .3793818833 .25290373E-01 15.001 .0000 12.435580
ESLFLAG .3933259495 .85245570 .461 .6445 .64150943E-01
EC011 .1722643321E-01 .92648817E-01 .186 .8525 .67708895
EN093 -.3117337847 .86493864 -.360 .7185 .62264151E-01
MA081 -.1208070545 .18084020 -.668 .5041 .59137466
MA082 .3827058262 .19467371 1.966 .0493 .54878706
MA083 .3703758129 .11847674 3.126 .0018 .42021563
SMALLER .6721051012E-01 .14743497 .456 .6485 .46226415
SMALL -.5687831831E-02 .15706323 -.036 .9711 .20727763
LARGE .6635816769E-01 .17852633 .372 .7101 .10646900
LARGER .5654860817E-01 .18217561 .310 .7563 .97843666E-01
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

Regress;LHS=TOTESSAY;RHS=TOTALMC,ONE, ADULTST,SEX,AVGESSAY,
ESLFLAG,EC011,EN093,MA081,MA082,MA083,SMALLER,SMALL,LARGE,LARGER$
+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = TOTESSAY Mean= 18.13800539 , S.D.= 9.211913659 |
| Model size: Observations = 3710, Parameters = 15, Deg.Fr.= 3695 |
| Residuals: Sum of squares= 123011.3151 , Std.Dev.= 5.76986 |
| Fit: R-squared= .609169, Adjusted R-squared = .60769 |
| Model test: F[ 14, 3695] = 411.37, Prob value = .00000 |
| Diagnostic: Log-L = -11759.0705, Restricted(b=0) Log-L = -13501.8081 |
| LogAmemiyaPrCrt.= 3.509, Akaike Info. Crt.= 6.347 |
| Autocorrel: Durbin-Watson Statistic = 2.03115, Rho = -.01557 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
TOTALMC 1.408895961 .28223608E-01 49.919 .0000 12.435580
Constant -8.948704180 .55427657 -16.145 .0000
ADULTST -.8291495512 1.3454556 -.616 .5377 .51212938E-02
SEX 1.239956900 .20801072 5.961 .0000 .39056604
AVGESSAY .4000235352 .23711680E-01 16.870 .0000 18.138006
ESLFLAG .4511403830 1.9369352 .233 .8158 .64150943E-01
EC011 .2985371912 .21044864 1.419 .1560 .67708895
EN093 -2.020881931 1.9647001 -1.029 .3037 .62264151E-01
MA081 .8495120566 .41061265 2.069 .0386 .59137466
MA082 .1590915478 .44249860 .360 .7192 .54878706
MA083 1.809541566 .26793945 6.754 .0000 .42021563
SMALLER .6170663022 .33054246 1.867 .0619 .46226415
SMALL .2693408755 .35476913 .759 .4477 .20727763
LARGE .2646447973 .40526280 .653 .5137 .10646900
LARGER .6150288712E-01 .41436703 .148 .8820 .97843666E-01
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

Theoretical considerations discussed in Module Two, Part One, suggest that these least
squares estimates involve a simultaneous equation bias that is brought about by an apparent
reverse causality between the two forms of testing. Consistent estimation of the parameters in
this simultaneous equation system is possible with two-stage least squares, where our instrument
( Mˆ i ) for Mi is obtained by a least squares regression of Mi on SEX, ADULTST, AVGMC,
AVGESSAY, ESLFLAG,SMALLER,SMALL, LARGE, LARGER, EC011, EN093, MA081,
MA082, and MA083. Our instrument ( Wˆi ) for Wi is obtained by a least squares regression of

W. E. Becker  Module Two, Part Two: LIMDEP IV Hands On Section  Sept. 15, 2008  p. 5  


 
Wi on SEX, ADULTST, AVGMC, AVGESSAY, ESLFLAG, SMALLER, SMALL, LARGE,
LARGER, EC011, EN093, MA081, MA082, and MA083. LIMDEP will do these regressions
and the subsequent regressions for M and W employing these instruments via the following
commands, which yield the subsequent output: i

2SLS; LHS = TOTALMC; RHS = TOTESSAY,ONE, ADULTST,SEX,AVGMC,


ESLFLAG,EC011,EN093,MA081,MA082,MA083,SMALLER,SMALL,LARGE,
LARGER; INST = ONE,SEX, ADULTST ,AVGMC,AVGESSAY,ESLFLAG,
SMALLER,SMALL,LARGE,LARGER,EC011,EN093,MA081,MA082,MA083$
+-----------------------------------------------------------------------+
| Two stage least squares regression Weighting variable = none |
| Dep. var. = TOTALMC Mean= 12.43557951 , S.D.= 3.961941603 |
| Model size: Observations = 3710, Parameters = 15, Deg.Fr.= 3695 |
| Residuals: Sum of squares= 46157.78754 , Std.Dev.= 3.53440 |
| Fit: R-squared= .203966, Adjusted R-squared = .20095 |
| (Note: Not using OLS. R-squared is not bounded in [0,1] |
| Model test: F[ 14, 3695] = 67.63, Prob value = .00000 |
| Diagnostic: Log-L = -9940.7797, Restricted(b=0) Log-L = -10371.4459 |
| LogAmemiyaPrCrt.= 2.529, Akaike Info. Crt.= 5.367 |
| Autocorrel: Durbin-Watson Statistic = 2.07829, Rho = -.03914 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
TOTESSAY -.5247790489E-01 .36407219E-01 -1.441 .1495 18.138005
Constant -.3038295700 .57375703 -.530 .5964
ADULTST .2533493567 .82444633 .307 .7586 .51212938E-02
SEX -.8971949978E-01 .13581404 -.661 .5089 .39056604
AVGMC .9748840572 .74429145E-01 13.098 .0000 12.435580
ESLFLAG .6744471036 1.1866603 .568 .5698 .64150943E-01
EC011 .2925430155 .13244518 2.209 .0272 .67708895
EN093 -1.588715660 1.2118154 -1.311 .1899 .62264151E-01
MA081 .2995655100 .25587578 1.171 .2417 .59137466
MA082 .8159710785 .27507326 2.966 .0030 .54878706
MA083 1.635255739 .21583992 7.576 .0000 .42021563
SMALLER .2715919941 .20639788 1.316 .1882 .46226415
SMALL .4372991271E-01 .21863306 .200 .8415 .20727763
LARGE .1981182700 .24885626 .796 .4260 .10646900
LARGER -.8677104536E-01 .25400196 -.342 .7326 .97843666E-01
2SLS; LHS = TOTESSAY; RHS = TOTALMC,ONE, ADULTST,SEX,AVGE
SSAY,ESLFLAG,EC011,EN093,MA081,MA082,MA083,SMALLER,SMALL,
LARGE,LARGER; INST = ONE,SEX, ADULTST,AVGMC,AVGESSAY,ESLFLAG,
SMALLER,SMALL,LARGE,LARGER,EC011,EN093,MA081,MA082,MA083$
+-----------------------------------------------------------------------+
| Two stage least squares regression Weighting variable = none |
| Dep. var. = TOTESSAY Mean= 18.13800539 , S.D.= 9.211913659 |
| Model size: Observations = 3710, Parameters = 15, Deg.Fr.= 3695 |
| Residuals: Sum of squares= 201898.9900 , Std.Dev.= 7.39196 |
| Fit: R-squared= .355924, Adjusted R-squared = .35348 |
| (Note: Not using OLS. R-squared is not bounded in [0,1] |
| Model test: F[ 14, 3695] = 145.85, Prob value = .00000 |
| Diagnostic: Log-L = -12678.2066, Restricted(b=0) Log-L = -13501.8081 |
| LogAmemiyaPrCrt.= 4.005, Akaike Info. Crt.= 6.843 |
| Autocorrel: Durbin-Watson Statistic = 2.10160, Rho = -.05080 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
TOTALMC .2788777265E-01 .15799711 .177 .8599 12.435580
Constant -1.179740796 1.1193206 -1.054 .2919
ADULTST -.1690793751 1.7252757 -.098 .9219 .51212938E-02

W. E. Becker  Module Two, Part Two: LIMDEP IV Hands On Section  Sept. 15, 2008  p. 6  


 
SEX .6854633662 .27355130 2.506 .0122 .39056604
AVGESSAY .8417622152 .57819872E-01 14.558 .0000 18.138006
ESLFLAG 1.723698602 2.4855173 .693 .4880 .64150943E-01
EC011 .7128702679 .27353325 2.606 .0092 .67708895
EN093 -3.983249481 2.5265144 -1.577 .1149 .62264151E-01
MA081 1.069628788 .52662071 2.031 .0422 .59137466
MA082 1.217026971 .57901457 2.102 .0356 .54878706
MA083 3.892551120 .41430603 9.395 .0000 .42021563
SMALLER .3348223746 .42463421 .788 .4304 .46226415
SMALL -.1364832691 .45674848 -.299 .7651 .20727763
LARGE .3418924354 .51926721 .658 .5103 .10646900
LARGER -.8251220287E-01 .53110191 -.155 .8765 .97843666E-01

The 2SLS results differ from the least squares results in many ways. The essay mark or
score, W, is no longer a significant variable in the multiple-choice regression and the multiple-
choice mark, M, is likewise insignificant in the essay regression. Each score appears to be
measuring something different when the regressor and error-term-induced bias is eliminated by
our instrumental variable estimators.

Both the mean multiple-choice and mean essay scores continue to be significant in their
respective equations. But now being female is insignificant in explaining the multiple-choice
test score. Being female continues to have a significant positive effect on the essay score.

DURBIN, HAUSMAN AND WU TEST FOR ENDOGENEITY

The theoretical argument is strong for treating multiple-choice and essay scores as endogenous
when employed as regressors in the explanation of the other. Nevertheless, this endogeneity can
be tested with the Durbin, Hausman and Wu specification test, which is a two-step procedure in
LIMDEP versions prior to 9. 0.4. ii   
 
Either a Wald statistic, in a Chi-square ( χ 2 ) test with K* degrees of freedom, or an F
statistic with K* and n − (K + K*) degrees of freedom, is used to test the joint significance of the
contribution of the predicted values ( X ˆ * ) of a regression of the K* endogenous regressors, in
matrix X*, on the exogenous variables (and column of ones for the constant term) in matrix Z:

ˆ * γ+ ε * ,
y = Xβ +X
ˆ = Zλˆ , and λˆ is a least squares estimator of λ.
where X* = Zλ + u , X*

H o : γ = 0 , the variables in Z are exogenous


H A : γ ≠ 0 , at least one of the variables in Z is endogenous

In our case, K* = 1 when the essay score is to be tested as an endogenous regressor in the
multiple-choice equation and when the multiple-choice regressor is to be tested as endogenous in
the essay equation. X̂ * is an n × 1 vector of predicted essay scores from a regression of essay
scores on all the exogenous variables (for subsequent use in the multiple-choice equation) or an
n × 1 vector of predicted multiple-choice scores from a regression of multiple-choice scores on all
the exogenous variables (for subsequent use in the essay equation). Because K* = 1, the relevant
W. E. Becker  Module Two, Part Two: LIMDEP IV Hands On Section  Sept. 15, 2008  p. 7  
 
test statistic is either the t, with n − (K + K*) degrees of freedom for small n or the standard
normal, for large n.

In LIMDEP, the predicted essay score is obtained by the following command, where the
specification “;keep=Essayhat” tells LIMDEP to predict the essay scores and keep them as a
variable called “Essayhat”:

Regres; lhs= TOTESSAY; RHS= ONE,ADULTST,SEX, AVGESSAY,AVGMC,


ESLFLAG,EC011,EN093,MA081,MA082,MA083,SMALLER,SMALL,LARGE,LARGER
;keep=Essayhat$

The predicted essay scores are then added as a regressor in the original multiple-choice
regression:

Regress;LHS=TOTALMC;RHS=TOTESSAY,ONE,ADULTST,SEX,AVGMC,
ESLFLAG,EC011,EN093,MA081,MA082,MA083,SMALLER,SMALL,LARGE,
LARGER, Essayhat$

The test statistic for the Essayhat coefficient is then used in the test of endogeneity. In the below
LIMDEP output, we see that the calculated standard normal test statistic z value is −12.916,
which far exceeds the absolute value of the 0.05 percent Type I error critical 1.96 standard
normal value. Thus, the null hypothesis of an exogenous essay score as an explanatory variable
for the multiple-choice score is rejected. As theorized, the essay score is endogenous in an
explanation of the multiple-choice score.

--> Regres; lhs= TOTESSAY; RHS= ONE,ADULTST,SEX, AVGESSAY,AVGMC,


ESLFLAG,EC011,EN093,MA081,MA082,MA083,SMALLER,SMALL,LARGE,LARGER;keep=Esayhat$

+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = TOTESSAY Mean= 18.13800539 , S.D.= 9.211913659 |
| Model size: Observations = 3710, Parameters = 15, Deg.Fr.= 3695 |
| Residuals: Sum of squares= 205968.5911 , Std.Dev.= 7.46609 |
| Fit: R-squared= .345598, Adjusted R-squared = .34312 |
| Model test: F[ 14, 3695] = 139.38, Prob value = .00000 |
| Diagnostic: Log-L = -12715.2253, Restricted(b=0) Log-L = -13501.8081 |
| LogAmemiyaPrCrt.= 4.025, Akaike Info. Crt.= 6.863 |
| Autocorrel: Durbin-Watson Statistic = 2.10143, Rho = -.05072 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -1.186477526 1.1613927 -1.022 .3070
ADULTST -.1617772661 1.7412483 -.093 .9260 .51212938E-02
SEX .6819632415 .27234747 2.504 .0123 .39056604
AVGESSAY .8405321032 .64642750E-01 13.003 .0000 18.138006
AVGMC .2714761464E-01 .15534612 .175 .8613 12.435580
ESLFLAG 1.739961011 2.5067133 .694 .4876 .64150943E-01
EC011 .7199749635 .27219191 2.645 .0082 .67708895
EN093 -4.021669541 2.5417647 -1.582 .1136 .62264151E-01
MA081 1.076407689 .53146100 2.025 .0428 .59137466
MA082 1.237970826 .57190601 2.165 .0304 .54878706
MA083 3.932399725 .34253928 11.480 .0000 .42021563
SMALLER .3418961082 .43385196 .788 .4307 .46226415
SMALL -.1350660711 .46222353 -.292 .7701 .20727763

W. E. Becker  Module Two, Part Two: LIMDEP IV Hands On Section  Sept. 15, 2008  p. 8  


 
LARGE .3469098130 .52477155 .661 .5086 .10646900
LARGER -.8480793833E-01 .53618108 -.158 .8743 .97843666E-01
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

--> Regress;LHS=TOTALMC;RHS=TOTESSAY,ONE,ADULTST,SEX,AVGMC,
ESLFLAG,EC011,EN093,MA081,MA082,MA083,SMALLER,SMALL,LARGE,LARGER,
Essayhat$

+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = TOTALMC Mean= 12.43557951 , S.D.= 3.961941603 |
| Model size: Observations = 3710, Parameters = 16, Deg.Fr.= 3694 |
| Residuals: Sum of squares= 22805.95017 , Std.Dev.= 2.48471 |
| Fit: R-squared= .608280, Adjusted R-squared = .60669 |
| Model test: F[ 15, 3694] = 382.41, Prob value = .00000 |
| Diagnostic: Log-L = -8632.9227, Restricted(b=0) Log-L = -10371.4459 |
| LogAmemiyaPrCrt.= 1.825, Akaike Info. Crt.= 4.662 |
| Autocorrel: Durbin-Watson Statistic = 2.07293, Rho = -.03647 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
TOTESSAY .2855834321 .54748868E-02 52.162 .0000 18.138005
Constant -.3038295700 .40335588 -.753 .4513
ADULTST .2533493567 .57959250 .437 .6620 .51212938E-02
SEX -.8971949978E-01 .95478380E-01 -.940 .3474 .39056604
AVGMC .9748840572 .52324297E-01 18.632 .0000 12.435580
ESLFLAG .6744471036 .83423185 .808 .4188 .64150943E-01
EC011 .2925430155 .93110045E-01 3.142 .0017 .67708895
EN093 -1.588715660 .85191616 -1.865 .0622 .62264151E-01
MA081 .2995655100 .17988277 1.665 .0958 .59137466
MA082 .8159710785 .19337874 4.220 .0000 .54878706
MA083 1.635255739 .15173722 10.777 .0000 .42021563
SMALLER .2715919941 .14509939 1.872 .0612 .46226415
SMALL .4372991270E-01 .15370083 .285 .7760 .20727763
LARGE .1981182700 .17494798 1.132 .2574 .10646900
LARGER -.8677104536E-01 .17856546 -.486 .6270 .97843666E-01
ESSAYHAT -.3380613370 .26173585E-01 -12.916 .0000 18.138005
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

The similar estimation routine to test for the endogeneity of the multiple-choice test score in the
essay equation yields a calculated z test statistic of −11.713, which far exceeds the absolute value
of its 1.96 critical value. Thus, the null hypothesis of an exogenous multiple-choice score as an
explanatory variable for the essay score is rejected. As theorized, the multiple-choice score is
endogenous in an explanation of the essay score.
--> Regress;LHS=TOTALMC; RHS=ONE, ADULTST,SEX, AVGMC,AVGESSAY,
ESLFLAG,EC011,EN093,MA081,MA082,MA083,SMALLER,SMALL,LARGE,LARGER;
keep=MChat$

+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = TOTALMC Mean= 12.43557951 , S.D.= 3.961941603 |
| Model size: Observations = 3710, Parameters = 15, Deg.Fr.= 3695 |
| Residuals: Sum of squares= 39604.31525 , Std.Dev.= 3.27389 |
| Fit: R-squared= .319748, Adjusted R-squared = .31717 |
| Model test: F[ 14, 3695] = 124.06, Prob value = .00000 |
| Diagnostic: Log-L = -9656.7280, Restricted(b=0) Log-L = -10371.4459 |
| LogAmemiyaPrCrt.= 2.376, Akaike Info. Crt.= 5.214 |
| Autocorrel: Durbin-Watson Statistic = 2.07600, Rho = -.03800 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+

W. E. Becker  Module Two, Part Two: LIMDEP IV Hands On Section  Sept. 15, 2008  p. 9  


 
Constant -.2415657153 .50927203 -.474 .6353
ADULTST .2618390887 .76353941 .343 .7317 .51212938E-02
SEX -.1255075019 .11942469 -1.051 .2933 .39056604
AVGMC .9734594072 .68119457E-01 14.290 .0000 12.435580
AVGESSAY -.4410936377E-01 .28345921E-01 -1.556 .1197 18.138006
ESLFLAG .5831375952 1.0991967 .531 .5958 .64150943E-01
EC011 .2547602379 .11935647 2.134 .0328 .67708895
EN093 -1.377666868 1.1145668 -1.236 .2164 .62264151E-01
MA081 .2430778897 .23304627 1.043 .2969 .59137466
MA082 .7510049632 .25078145 2.995 .0027 .54878706
MA083 1.428891640 .15020388 9.513 .0000 .42021563
SMALLER .2536500026 .19024459 1.333 .1824 .46226415
SMALL .5081789714E-01 .20268556 .251 .8020 .20727763
LARGE .1799131698 .23011294 .782 .4343 .10646900
LARGER -.8232050244E-01 .23511603 -.350 .7262 .97843666E-01

--> Regress;LHS=TOTESSAY;RHS=TOTALMC,ONE, ADULTST,SEX,AVGESSAY,


ESLFLAG,EC011,EN093,MA081,MA082,MA083,SMALLER,SMALL,LARGE,LARGER,
MChat$

+-----------------------------------------------------------------------+
| Ordinary least squares regression Weighting variable = none |
| Dep. var. = TOTESSAY Mean= 18.13800539 , S.D.= 9.211913659 |
| Model size: Observations = 3710, Parameters = 16, Deg.Fr.= 3694 |
| Residuals: Sum of squares= 118606.0003 , Std.Dev.= 5.66637 |
| Fit: R-squared= .623166, Adjusted R-squared = .62164 |
| Model test: F[ 15, 3694] = 407.25, Prob value = .00000 |
| Diagnostic: Log-L = -11691.4200, Restricted(b=0) Log-L = -13501.8081 |
| LogAmemiyaPrCrt.= 3.473, Akaike Info. Crt.= 6.311 |
| Autocorrel: Durbin-Watson Statistic = 2.09836, Rho = -.04918 |
+-----------------------------------------------------------------------+
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
TOTALMC 1.485222426 .28473026E-01 52.162 .0000 12.435580
Constant -1.179740796 .85802415 -1.375 .1691
ADULTST -.1690793751 1.3225239 -.128 .8983 .51212938E-02
SEX .6854633662 .20969294 3.269 .0011 .39056604
AVGESSAY .8417622152 .44322287E-01 18.992 .0000 18.138006
ESLFLAG 1.723698602 1.9052933 .905 .3656 .64150943E-01
EC011 .7128702679 .20967911 3.400 .0007 .67708895
EN093 -3.983249481 1.9367199 -2.057 .0397 .62264151E-01
MA081 1.069628788 .40368533 2.650 .0081 .59137466
MA082 1.217026971 .44384827 2.742 .0061 .54878706
MA083 3.892551120 .31758961 12.257 .0000 .42021563
SMALLER .3348223746 .32550676 1.029 .3037 .46226415
SMALL -.1364832691 .35012421 -.390 .6967 .20727763
LARGE .3418924354 .39804844 .859 .3904 .10646900
LARGER -.8251220288E-01 .40712043 -.203 .8394 .97843666E-01
MCHAT -1.457334653 .12441585 -11.713 .0000 12.435580
(Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

CONCLUDING COMMENTS

This cookbook-type introduction to the use of instrumental variables and two-stage least squares
regression and testing for endogeneity has just scratched the surface of this controversial
problem in statistical estimation and inference. It was intended to enable researchers to begin
using instrumental variables in their work and to enable readers of that work to have an idea of
what is being done. To learn more about these methods there is no substitute for a graduate level
textbook treatment such as that found in William Greene’s Econometric Analysis.

W. E. Becker  Module Two, Part Two: LIMDEP IV Hands On Section  Sept. 15, 2008  p. 10  


 
REFERENCES

Becker, William E. and Carol Johnston (1999).“The Relationship Between Multiple Choice and
Essay Response Questions in Assessing Economics Understanding,” Economic Record
(Economic Society of Australia), Vol. 75 (December): 348-357.

Greene, William (2003). Econometric Analysis. 5th Edition, New Jersey: Prentice Hall.

ENDNOTES

                                                       
i
In the default mode, relatively large samples are required for 2SLS in LIMDEP because a
routine aimed at providing consistent estimators is employed; thus, for example, no degrees of
freedom adjustment is made for variances; i.e.,

σˆ 2 = (1 / n )∑ (i th prediction error )
2

As William Greene states, “this is consistent with most published sources, but (curiously
enough) inconsistent with most other commercially available computer programs.” The degrees
of freedom correction for small samples is obtainable by adding the following specification to
the 2SLS command: ;DFC

ii
In Limdep version 9.0.4, the following command will automatically test x3 for endogeneity:

Regress; lhs=y; rhs=one,x2,x3; inst=one,x2,x4; Wu test$

Because x3 is not an instrument, LIMDEP knows the test for endogeneity is on this variable.
 

W. E. Becker  Module Two, Part Two: LIMDEP IV Hands On Section  Sept. 15, 2008  p. 11  


 
MODULE TWO, PART THREE: ENDOGENEITY,
INSTRUMENTAL VARIABLES AND TWO-STAGE LEAST SQUARES
IN ECONOMIC EDUCATION RESEARCH USING STATA

Part Three of Module Two demonstrates how to address problems of endogeneity using
STATA's two-stage least squares instrumental variable estimator, as well as how to perform and
interpret the Durbin, Hausman and Wu specification test for endogeneity. Users of this model
need to have completed Module One, Parts One and Three, and Module Two, Part One. That is,
from Module One, users are assumed to know how to get data into STATA, recode and create
variables within STATA, and run and interpret regression results. From Module Two, Part One,
they are expected to have an understanding of the problem of and source of endogeneity and the
basic idea behind an instrumental variable approach and the two-stage least squares method. The
Becker and Johnston (1999) data set is used throughout this module.

THE CASE

As described in Module Two, Part One, Becker and Johnston (1999) called attention to
classroom effects that might influence multiple-choice and essay type test taking skills in
economics in different ways. For example, if the student is in a classroom that emphasizes skills
associated with multiple-choice testing (e.g., risk-taking behavior, question analyzing skills,
memorization, and keen sense of judging between close alternatives), then the student can be
expected to do better on multiple-choice questions. By the same token, if placed in a classroom
that emphasizes the skills of essay test question answering (e.g., organization, good sentence and
paragraph construction, obfuscation when uncertain, logical argument, and good penmanship),
then the student can be expected to do better on the essay component. Thus, Becker and
Johnston attempted to control for the type of class of which the student is a member. Their
measure of “teaching to the multiple-choice questions” is the mean score or mark on the
multiple-choice questions for the school in which the ith student took the 12th grade economics
course. Similarly, the mean school mark or score on the essay questions is their measure of the
ith student’s exposure to essay question writing skills.

In equation form, the two equations that summarize the influence of the various
covariates on multiple-choice and essay test questions are written as the following structural
equations:

_ J
M i = ρ21 + ρ22Wi + ρ23 M i + ∑ ρ2 j X ij + U i* .
j =4

_ J
Wi = ρ31 + ρ32 M i + ρ33 Wi + ∑ ρ3 j X ij + Vi* .
j=4

Ian McCarthy  Module Two, Part Three: STATA IV Hands On Section  Sept. 15, 2008  p. 1  


 
M i and Wi are the ith student’s respective scores on the multiple-choice test and essay test.
M i and Wi are the mean multiple-choice and essay test scores at the school where the ith student
took the 12th grade economics course. The X i j variables are the other exogenous variables (such
as gender, age, English a second language, etc.) used to explain the ith student’s multiple-choice
and essay marks, where the ρs are parameters to be estimated. The inclusion of the mean
multiple-choice and mean essay test scores in their respective structural equations, and their
exclusion from the other equation, enables both of the structural equations to be identified within
the system.

As shown in Module Two, Part One, the least squares estimators of the ρs are biased
because the error term U i* is related to Wi, in the first equation, and Vi* is related to Mi in second
equation. Instruments for regressors Wi and Mi are needed. Because the reduced form equations
express Wi and Mi solely in terms of exogenous variables, they can be used to generate the
respective instruments:

_ _ J
M i = Γ21 + Γ22 W i + Γ23 M i + ∑ Γ2 j X ij + U i** .
j=4

_ _ J
Wi = Γ31 + Γ32 M i + Γ33 Wi + ∑ Γ3 j X ij + Vi** .
j =4

The reduced form parameters (Γs) are functions of the ρs, and the reduced form error terms U**
and V** are functions of U* and V*, which are not related to any of the regressors in the reduced
form equations.

We could estimate the reduced form equations and get Mˆ i and Wˆi . We could then
substitute Mˆ i and Wˆi into the structural equations as proxy regressors (instruments) for M i and Wi .
The least squares regression of Mi on Wˆ , M i and the Xs and a least squares regression of Wi on
i

Mˆ i , Wi and the Xs would yield consistent estimates of the respective ρs, but the standard errors
would be incorrect. STATA automatically performs the required estimations with the
instrumental variables command: i

ivreg dependent_variable independent_variables


(engoenous_var_name=instruments)

Here, independent_variables should be all of your included, exogenous variables, and in the
parentheses, we must specify the endogenous variable as a function of its instruments.

Ian McCarthy  Module Two, Part Three: STATA IV Hands On Section  Sept. 15, 2008  p. 2  


 
TWO-STAGE, LEAST SQUARES IN STATA

The Becker and Johnston (1999) data are in the file named “Bill.CSV.” Since this is a large
dataset, users may need to increase the size of STATA following the procedures described in
Module One, Part Three. For the version used in this Module (Intercooled STATA), the default
memory is sufficient. Now, the file Bill.CSV can be read into STATA with the following
insheet command. Note that, in this case, the directory has been changed beforehand so that we
need only specify the file BILL.csv. For instance, say the file is located in the folder,
“C:\Documents and Settings\My Documents\BILL.csv”. Then users can change the directory
with the command, cd “C:\Documents and Settings\My Documents”, in which case the file may
be accessed simply by specifying the actual file name, BILL.csv, as in the following:

insheet student school size other birthday sex eslflag ///


adultst mc1 mc2 mc3 mc4 mc5 mc6 mc7 mc8 mc9 mc10 mc11 ///
mc12 mc13 mc14 mc15 mc16 mc17 mc18 mc19 mc20 totalmc ///
avgmc essay1 essay2 essay3 essay4 totessay avgessay ///
totscore avgscore ma081 ma082 ec011 ec012 ma083 en093 ///
using "BILL.csv", comma

Using these recode and generate commands yields the following relevant variable definitions:

recode size (0/9=1) (10/19=2) (20/29=3) (30/39=4) ///


(40/49=5) (50/100=6)
gen smallest=(size==1)
gen smaller=(size==2)
gen small=(size==3)
gen large=(size==4)
gen larger=(size==5)
gen largest=(size==6)

TOTALMC: Student’s score on 12th grade economics multiple-choice exam ( M i ).


AVGMC: Mean multiple-choice score for students at school ( M i ).
TOTESSAY: Student’s score on 12th grade economics essay exam ( Wi ).
AVGESSAY: Mean essay score for students at school ( Wi ).
ADULTST = 1, if a returning adult student, and 0 otherwise.
SEX = GENDER = 1 if student is female and 0 is male.
ESLFLAG = 1 if English is not student’s first language and 0 if it is.
EC011 = 1 if student enrolled in first semester 11 grade economics course, 0 if not.
EN093 = 1 if student was enrolled in ESL English course, 0 if not
MA081 = 1 if student enrolled in the first semester 11 grade math course, 0 if not.
MA082 = 1 if student was enrolled in the second semester 11 grade math course, 0 if not.
MA083 = 1 if student was enrolled in the first semester 12 grade math course, 0 if not.
SMALLER = 1 if student from a school with 10 to 19 test takers, 0 if not.
SMALL = 1 if student from a school with 20 to 29 test takers, 0 if not.
LARGE = 1 if student from a school with 30 to 39 test takers, 0 if not.

Ian McCarthy  Module Two, Part Three: STATA IV Hands On Section  Sept. 15, 2008  p. 3  


 
LARGER = 1 if student from a school with 40 to 49 test takers, 0 if not.

In all of the regressions, the effect of being at a school with more than 49 test takers is captured
in the constant term, against which the other dummy variables are compared. The smallest
schools should not be included so that we can treat the mean scores as exogenous and unaffected
by any individual student’s test performance, which is accomplished by adding the following
command to the end of the summary statistics and regression commands:

if smallest!=1

This command is added to the end of our regression and summary statistics commands as an
option, and it says to only perform the desired command if smallest is not equal to 1. We could
also completely remove these observations with the command:

drop if smallest==1

The problem with this approach, however, is that we cannot retrieve observations once they’ve
been dropped (at least not easily), so it’s generally sound practice to follow the first approach.

The descriptive statistics on the relevant variables are then obtained with the following
command, yielding the STATA output shown: ii

sum totalmc avgmc totessay avgessay adultst sex eslflag ///


ec011 en093 ma081 ma082 ma083 smaller small large ///
larger if smallest!=1

Variable | Obs Mean Std. Dev. Min Max


-------------+--------------------------------------------------------
totalmc | 3710 12.43558 3.961942 0 20
avgmc | 3710 12.43558 1.972638 6.416667 17.07143
totessay | 3710 18.13801 9.211914 0 40
avgessay | 3710 18.13801 4.668071 5.7 29.78571
adultst | 3710 .0051213 .0713894 0 1
-------------+--------------------------------------------------------
sex | 3710 .390566 .487943 0 1
eslflag | 3710 .0641509 .2450547 0 1
ec011 | 3710 .6770889 .4676521 0 1
en093 | 3710 .0622642 .2416673 0 1
ma081 | 3710 .5913747 .491646 0 1
-------------+--------------------------------------------------------
ma082 | 3710 .5487871 .4976812 0 1
ma083 | 3710 .4202156 .4936599 0 1
smaller | 3711 .4621396 .4986317 0 1
small | 3711 .2072218 .4053704 0 1
large | 3711 .1064403 .3084419 0 1
-------------+--------------------------------------------------------
larger | 3711 .0978173 .2971075 0 1

Ian McCarthy  Module Two, Part Three: STATA IV Hands On Section  Sept. 15, 2008  p. 4  


 
For comparison with the two-stage least squares results, we start with the least squares
regressions shown after this paragraph. The least squares estimations are typical of those found
in multiple-choice and essay score correlation studies, with correlation coefficients of 0.77 and
0.78. The essay mark or score, W, is the most significant variable in the multiple-choice score
regression (first of the two tables) and the multiple-choice mark, M, is the most significant
variable in the essay regression (second of the two tables). Results like these have led
researchers to conclude that the essay and multiple-choice marks are good predictors of each
other. Notice also that both the mean multiple-choice and mean essay marks are significant in
their respective equations, suggesting that something in the classroom environment or group
experience influences individual test scores. Finally, being female has a significant negative
effect on the multiple choice-test score, but a significant positive effect on the essay score, as
expected from the least squares results reported by others. We will see how these results hold up
in the two-stage least squares regressions.

regress totalmc totessay adultst sex avgmc eslflag ec011 ///


en093 ma081 ma082 ma083 smaller small large larger if ///
smallest!=1

Source | SS df MS Number of obs = 3710


-------------+------------------------------ F( 14, 3695) = 380.73
Model | 34384.2039 14 2456.01457 Prob > F = 0.0000
Residual | 23835.8996 3695 6.45085239 R-squared = 0.5906
-------------+------------------------------ Adj R-squared = 0.5890
Total | 58220.1035 3709 15.6969813 Root MSE = 2.5399

------------------------------------------------------------------------------
totalmc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
totessay | .2707917 .0054726 49.48 0.000 .2600621 .2815213
adultst | .4674948 .592213 0.79 0.430 -.6936016 1.628591
sex | -.5259549 .0912871 -5.76 0.000 -.7049329 -.3469768
avgmc | .3793819 .0252904 15.00 0.000 .3297974 .4289663
eslflag | .3933259 .8524557 0.46 0.645 -1.278004 2.064656
ec011 | .0172264 .0926488 0.19 0.853 -.1644214 .1988743
en093 | -.3117338 .8649386 -0.36 0.719 -2.007538 1.38407
ma081 | -.120807 .1808402 -0.67 0.504 -.4753635 .2337494
ma082 | .3827058 .1946737 1.97 0.049 .0010273 .7643843
ma083 | .3703758 .1184767 3.13 0.002 .1380896 .602662
smaller | .0672105 .147435 0.46 0.649 -.2218514 .3562725
small | -.0056878 .1570632 -0.04 0.971 -.3136269 .3022513
large | .0663582 .1785263 0.37 0.710 -.2836616 .4163781
larger | .0565486 .1821756 0.31 0.756 -.300626 .4137232
_cons | 2.654957 .3393615 7.82 0.000 1.989603 3.320311
------------------------------------------------------------------------------

regress totessay totalmc adultst sex avgessay eslflag ec011 ///


en093 ma081 ma082 ma083 smaller small large larger if ///
smallest!=1
Source | SS df MS Number of obs = 3710
-------------+------------------------------ F( 14, 3695) = 411.37
Model | 191732.026 14 13695.1447 Prob > F = 0.0000
Residual | 123011.315 3695 33.2912896 R-squared = 0.6092
-------------+------------------------------ Adj R-squared = 0.6077
Total | 314743.341 3709 84.8593533 Root MSE = 5.7699

Ian McCarthy  Module Two, Part Three: STATA IV Hands On Section  Sept. 15, 2008  p. 5  


 
------------------------------------------------------------------------------
totessay | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
totalmc | 1.408896 .0282236 49.92 0.000 1.353561 1.464231
adultst | -.8291495 1.345456 -0.62 0.538 -3.467058 1.808759
sex | 1.239957 .2080107 5.96 0.000 .8321298 1.647784
avgessay | .4000235 .0237117 16.87 0.000 .3535343 .4465128
eslflag | .4511402 1.936935 0.23 0.816 -3.346427 4.248707
ec011 | .2985372 .2104486 1.42 0.156 -.1140697 .7111441
en093 | -2.020882 1.9647 -1.03 0.304 -5.872885 1.831122
ma081 | .8495121 .4106127 2.07 0.039 .0444623 1.654562
ma082 | .1590916 .4424986 0.36 0.719 -.7084739 1.026657
ma083 | 1.809542 .2679394 6.75 0.000 1.284218 2.334865
smaller | .6170663 .3305425 1.87 0.062 -.0309973 1.26513
small | .2693409 .3547691 0.76 0.448 -.4262217 .9649034
large | .2646448 .4052628 0.65 0.514 -.5299159 1.059206
larger | .0615029 .414367 0.15 0.882 -.7509076 .8739135
_cons | -8.948704 .5542766 -16.14 0.000 -10.03542 -7.861986
------------------------------------------------------------------------------

Theoretical considerations discussed in Module Two, Part One, suggest that these least
squares estimates involve a simultaneous equation bias that is brought about by an apparent
reverse causality between the two forms of testing. Consistent estimation of the parameters in
this simultaneous equation system is possible with two-stage least squares, where our instrument
Mˆ i for Mi is obtained by a least squares regression of Mi on SEX, ADULTST, AVGMC,
AVGESSAY, ESLFLAG,SMALLER,SMALL, LARGE, LARGER, EC011, EN093, MA081,
MA082, and MA083. Our instrument for Wˆi for Wi is obtained by a least squares regression of
Wi on SEX, ADULTST, AVGMC, AVGESSAY, ESLFLAG, SMALLER, SMALL, LARGE,
LARGER, EC011, EN093, MA081, MA082, and MA083. STATA will do these regressions
and the subsequent regressions for M and W employing these instruments via the following
commands, which yield the subsequent output. Note that we should only specify as instruments
variables that we are not including as independent variables in the full regression. As seen in the
output tables, STATA correctly includes all of the exogenous variables as instruments in the
two-stage least squares estimation:

ivreg totalmc adultst sex avgmc eslflag ec011 en093 ///


ma081 ma082 ma083 smaller small large larger ///
(totessay=avgessay) if smallest!=1

Instrumental variables (2SLS) regression

Source | SS df MS Number of obs = 3710


-------------+------------------------------ F( 14, 3695) = 106.01
Model | 11874.9352 14 848.209657 Prob > F = 0.0000
Residual | 46345.1683 3695 12.5426707 R-squared = 0.2040
-------------+------------------------------ Adj R-squared = 0.2010
Total | 58220.1035 3709 15.6969813 Root MSE = 3.5416

------------------------------------------------------------------------------
totalmc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
totessay | -.0524779 .036481 -1.44 0.150 -.1240029 .019047
adultst | .2533495 .8261181 0.31 0.759 -1.366343 1.873042
sex | -.0897195 .1360894 -0.66 0.510 -.3565373 .1770983
avgmc | .9748841 .0745801 13.07 0.000 .8286619 1.121106
eslflag | .6744471 1.189066 0.57 0.571 -1.656844 3.005738
ec011 | .2925431 .1327137 2.20 0.028 .0323437 .5527425
en093 | -1.588716 1.214273 -1.31 0.191 -3.969426 .7919948
ma081 | .2995655 .2563946 1.17 0.243 -.2031234 .8022545

Ian McCarthy  Module Two, Part Three: STATA IV Hands On Section  Sept. 15, 2008  p. 6  


 
ma082 | .8159711 .275631 2.96 0.003 .2755672 1.356375
ma083 | 1.635256 .2162776 7.56 0.000 1.211221 2.059291
smaller | .2715921 .2068164 1.31 0.189 -.1338934 .6770776
small | .04373 .2190764 0.20 0.842 -.3857925 .4732525
large | .1981185 .2493609 0.79 0.427 -.29078 .687017
larger | -.086771 .254517 -0.34 0.733 -.5857787 .4122366
_cons | -.3038295 .5749204 -0.53 0.597 -1.431022 .8233631
------------------------------------------------------------------------------
Instrumented: totessay
Instruments: adultst sex avgmc eslflag ec011 en093 ma081 ma082 ma083
smaller small large larger avgessay
------------------------------------------------------------------------------

ivreg totessay adultst sex avgessay eslflag ec011 ///


en093 ma081 ma082 ma083 smaller small large larger ///
(totalmc=avgmc) if smallest!=1

Instrumental variables (2SLS) regression

Source | SS df MS Number of obs = 3710


-------------+------------------------------ F( 14, 3695) = 141.62
Model | 112024.731 14 8001.76652 Prob > F = 0.0000
Residual | 202718.61 3695 54.8629526 R-squared = 0.3559
-------------+------------------------------ Adj R-squared = 0.3535
Total | 314743.341 3709 84.8593533 Root MSE = 7.407

------------------------------------------------------------------------------
totessay | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
totalmc | .0278877 .1583175 0.18 0.860 -.2825105 .338286
adultst | -.1690792 1.728774 -0.10 0.922 -3.558524 3.220366
sex | .6854633 .274106 2.50 0.012 .1480494 1.222877
avgessay | .8417622 .0579371 14.53 0.000 .7281703 .9553541
eslflag | 1.723698 2.490557 0.69 0.489 -3.159304 6.6067
ec011 | .7128703 .2740879 2.60 0.009 .1754919 1.250249
en093 | -3.983249 2.531637 -1.57 0.116 -8.946793 .980295
ma081 | 1.069629 .5276886 2.03 0.043 .0350393 2.104218
ma082 | 1.217027 .5801887 2.10 0.036 .0795055 2.354548
ma083 | 3.892551 .4151461 9.38 0.000 3.078613 4.706489
smaller | .3348224 .4254953 0.79 0.431 -.4994062 1.169051
small | -.1364833 .4576746 -0.30 0.766 -1.033803 .7608364
large | .3418925 .5203201 0.66 0.511 -.6782504 1.362035
larger | -.0825121 .5321788 -0.16 0.877 -1.125905 .960881
_cons | -1.179741 1.12159 -1.05 0.293 -3.378737 1.019256
------------------------------------------------------------------------------
Instrumented: totalmc
Instruments: adultst sex avgessay eslflag ec011 en093 ma081 ma082 ma083
smaller small large larger avgmc
------------------------------------------------------------------------------

The 2SLS results differ from the least squares results in many ways. The essay mark or
score, W, is no longer a significant variable in the multiple-choice regression and the multiple-
choice mark, M, is likewise insignificant in the essay regression. Each score appears to be
measuring something different when the regressor and error-term-induced bias is eliminated by
our instrumental variable estimators.

Both the mean multiple-choice and mean essay scores continue to be significant in their
respective equations. But now being female is insignificant in explaining the multiple-choice
test score. Being female continues to have a significant positive effect on the essay score.

DURBIN, HAUSMAN AND WU TEST FOR ENDOGENEITY

Ian McCarthy  Module Two, Part Three: STATA IV Hands On Section  Sept. 15, 2008  p. 7  


 
The theoretical argument is strong for treating multiple-choice and essay scores as endogenous
when employed as regressors in the explanation of the other. Nevertheless, this endogeneity can
be tested with the Durbin, Hausman and Wu specification test. There are at least two ways to
perform this test in STATA: One is with the auxiliary regression as done with LIMDEP in
Module Two, Part Two, and the other is with the general Hausman command. For consistency
across Parts Two and Three, we use the auxiliary regression method here; however, for those
interested in the more general Hausman command, type help hausman in the command window
for a brief description. 
 
Either a Wald statistic, in a Chi-square ( χ 2 ) test with K* degrees of freedom, or an F
statistic with K* and n − (K + K*) degrees of freedom, is used to test the joint significance of the
contribution of the predicted values ( X ˆ * ) of a regression of the K* endogenous regressors, in
matrix X*, on the exogenous variables (and column of ones for the constant term) in matrix Z:

ˆ * γ+ ε * ,
y = Xβ +X
ˆ = Zλˆ , and λˆ is a least squares estimator of λ.
where X* = Zλ + u , X*

H o : γ = 0 , the variables in Z are exogenous


H A : γ ≠ 0 , at least one of the variables in Z is endogenous

In our case, K* = 1 when the essay score is to be tested as an endogenous regressor in the
multiple-choice equation and when the multiple-choice regressor is to be tested as endogenous in
the essay equation. X̂ * is an n × 1 vector of predicted essay scores from a regression of essay
scores on all the exogenous variables (for subsequent use in the multiple-choice equation) or an
n × 1 vector of predicted multiple-choice scores from a regression of multiple-choice scores on all
the exogenous variables (for subsequent use in the essay equation). Because K* = 1, the relevant
test statistic is either the t, with n − (K + K*) degrees of freedom for small n or the standard
normal, for large n.

In STATA, the predicted essay score is obtained by the following command, where the
specification “predict totesshat, xb” tells STATA to predict the essay scores and keep them as a
variable called “totesshat”:

regress totessay adultst sex avgmc avgessay eslflag ///


ec011 en093 ma081 ma082 ma083 smaller small large ///
larger if smallest!=1

predict totesshat, xb

The predicted essay scores are then added as a regressor in the original multiple-choice
regression:

regress totalmc totessay adultst sex avgmc eslflag ///

Ian McCarthy  Module Two, Part Three: STATA IV Hands On Section  Sept. 15, 2008  p. 8  


 
ec011 en093 ma081 ma082 ma083 smaller small large ///
larger totesshat if smallest!=1

The test statistic for the totesshat coefficient is then used in the test of endogeneity. In the below
STATA output, we see that the calculated standard normal test statistic z value is −12.92, which
far exceeds the absolute value of the 0.05 percent Type I error critical 1.96 standard normal
value. Thus, the null hypothesis of an exogenous essay score as an explanatory variable for the
multiple-choice score is rejected. As theorized, the essay score is endogenous in an explanation
of the multiple-choice score.

. regress totessay adultst sex avgmc avgessay eslflag ec011 en093 ma081 ma082 ///
ma083 smaller small large larger if smallest!=1

Source | SS df MS Number of obs = 3710


-------------+------------------------------ F( 14, 3695) = 139.38
Model | 108774.75 14 7769.62501 Prob > F = 0.0000
Residual | 205968.591 3695 55.7425145 R-squared = 0.3456
-------------+------------------------------ Adj R-squared = 0.3431
Total | 314743.341 3709 84.8593533 Root MSE = 7.4661

------------------------------------------------------------------------------
totessay | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
adultst | -.1617771 1.741248 -0.09 0.926 -3.575679 3.252125
sex | .6819632 .2723475 2.50 0.012 .1479971 1.215929
avgmc | .0271476 .1553461 0.17 0.861 -.277425 .3317202
avgessay | .8405321 .0646428 13.00 0.000 .7137931 .9672711
eslflag | 1.739961 2.506713 0.69 0.488 -3.174717 6.654638
ec011 | .719975 .2721919 2.65 0.008 .1863138 1.253636
en093 | -4.021669 2.541765 -1.58 0.114 -9.005069 .9617305
ma081 | 1.076408 .531461 2.03 0.043 .0344219 2.118393
ma082 | 1.237971 .571906 2.16 0.030 .1166884 2.359253
ma083 | 3.9324 .3425393 11.48 0.000 3.260815 4.603984
smaller | .3418961 .433852 0.79 0.431 -.5087167 1.192509
small | -.1350661 .4622235 -0.29 0.770 -1.041304 .7711722
large | .3469099 .5247716 0.66 0.509 -.6819605 1.37578
larger | -.0848079 .5361811 -0.16 0.874 -1.136048 .9664321
_cons | -1.186477 1.161393 -1.02 0.307 -3.463511 1.090556
------------------------------------------------------------------------------

. predict totesshat, xb
(1 missing value generated)

.
. regress totalmc totessay adultst sex avgmc eslflag ec011 en093 ma081 ma082 ma083 ///
> smaller small large larger totesshat if smallest!=1

Source | SS df MS Number of obs = 3710


-------------+------------------------------ F( 15, 3694) = 382.41
Model | 35414.1533 15 2360.94355 Prob > F = 0.0000
Residual | 22805.9502 3694 6.17378186 R-squared = 0.6083
-------------+------------------------------ Adj R-squared = 0.6067
Total | 58220.1035 3709 15.6969813 Root MSE = 2.4847

------------------------------------------------------------------------------
totalmc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
totessay | .2855834 .0054749 52.16 0.000 .2748493 .2963175
adultst | .2533495 .5795925 0.44 0.662 -.8830033 1.389702
sex | -.0897195 .0954784 -0.94 0.347 -.2769151 .097476
avgmc | .974884 .0523243 18.63 0.000 .8722967 1.077471
eslflag | .6744471 .8342319 0.81 0.419 -.9611532 2.310047
ec011 | .2925431 .09311 3.14 0.002 .1099909 .4750952
en093 | -1.588716 .8519162 -1.86 0.062 -3.258988 .0815565
ma081 | .2995655 .1798828 1.67 0.096 -.0531138 .6522448

Ian McCarthy  Module Two, Part Three: STATA IV Hands On Section  Sept. 15, 2008  p. 9  


 
ma082 | .8159711 .1933787 4.22 0.000 .4368315 1.195111
ma083 | 1.635256 .1517372 10.78 0.000 1.337759 1.932753
smaller | .2715921 .1450994 1.87 0.061 -.0128907 .5560749
small | .04373 .1537008 0.28 0.776 -.2576168 .3450769
large | .1981185 .174948 1.13 0.258 -.1448856 .5411227
larger | -.086771 .1785655 -0.49 0.627 -.4368675 .2633256
totesshat | -.3380613 .0261736 -12.92 0.000 -.3893774 -.2867452
_cons | -.3038295 .4033559 -0.75 0.451 -1.094652 .4869926
------------------------------------------------------------------------------

The similar estimation routine to test for the endogeneity of the multiple-choice test score in the
essay equation yields a calculated z test statistic of −11.71, which far exceeds the absolute value
of its 1.96 critical value. Thus, the null hypothesis of an exogenous multiple-choice score as an
explanatory variable for the essay score is rejected. As theorized, the multiple-choice score is
endogenous in an explanation of the essay score.
. regress totalmc avgessay adultst sex avgmc eslflag ec011 en093 ma081 ma082 ma083 ///
> smaller small large larger if smallest!=1

Source | SS df MS Number of obs = 3710


-------------+------------------------------ F( 14, 3695) = 124.06
Model | 18615.7882 14 1329.69916 Prob > F = 0.0000
Residual | 39604.3153 3695 10.7183533 R-squared = 0.3197
-------------+------------------------------ Adj R-squared = 0.3172
Total | 58220.1035 3709 15.6969813 Root MSE = 3.2739

------------------------------------------------------------------------------
totalmc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
avgessay | -.0441094 .0283459 -1.56 0.120 -.0996846 .0114658
adultst | .2618392 .7635394 0.34 0.732 -1.235161 1.758839
sex | -.1255075 .1194247 -1.05 0.293 -.3596523 .1086372
avgmc | .9734594 .0681195 14.29 0.000 .839904 1.107015
eslflag | .5831376 1.099197 0.53 0.596 -1.571954 2.738229
ec011 | .2547603 .1193565 2.13 0.033 .0207492 .4887713
en093 | -1.377667 1.114567 -1.24 0.217 -3.562894 .8075597
ma081 | .2430779 .2330463 1.04 0.297 -.2138341 .6999899
ma082 | .7510049 .2507815 2.99 0.003 .2593213 1.242689
ma083 | 1.428892 .1502039 9.51 0.000 1.134401 1.723382
smaller | .2536501 .1902446 1.33 0.183 -.1193446 .6266448
small | .050818 .2026856 0.25 0.802 -.3465686 .4482046
large | .1799134 .2301129 0.78 0.434 -.2712475 .6310742
larger | -.0823205 .235116 -0.35 0.726 -.5432904 .3786495
_cons | -.2415657 .509272 -0.47 0.635 -1.240048 .7569162
------------------------------------------------------------------------------

. predict totmchat, xb
(1 missing value generated)

.
. regress totessay totalmc adultst sex avgessay eslflag ec011 en093 ma081 ma082 ///
ma083 smaller small large larger totmchat if smallest!=1

Source | SS df MS Number of obs = 3710


-------------+------------------------------ F( 15, 3694) = 407.25
Model | 196137.341 15 13075.8227 Prob > F = 0.0000
Residual | 118606 3694 32.1077423 R-squared = 0.6232
-------------+------------------------------ Adj R-squared = 0.6216
Total | 314743.341 3709 84.8593533 Root MSE = 5.6664

------------------------------------------------------------------------------
totessay | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
totalmc | 1.485222 .028473 52.16 0.000 1.429398 1.541047
adultst | -.1690792 1.322524 -0.13 0.898 -2.762028 2.42387
sex | .6854633 .2096929 3.27 0.001 .274338 1.096589
avgessay | .8417622 .0443223 18.99 0.000 .7548637 .9286608
eslflag | 1.723699 1.905293 0.90 0.366 -2.011832 5.459229

Ian McCarthy  Module Two, Part Three: STATA IV Hands On Section  Sept. 15, 2008  p. 10  


 
ec011 | .7128703 .2096791 3.40 0.001 .3017721 1.123969
en093 | -3.983249 1.93672 -2.06 0.040 -7.780395 -.186104
ma081 | 1.069629 .4036853 2.65 0.008 .2781608 1.861097
ma082 | 1.217027 .4438483 2.74 0.006 .3468152 2.087239
ma083 | 3.892551 .3175896 12.26 0.000 3.269883 4.515219
smaller | .3348226 .3255068 1.03 0.304 -.303368 .9730132
small | -.1364831 .3501242 -0.39 0.697 -.8229389 .5499726
large | .3418926 .3980484 0.86 0.390 -.4385237 1.122309
larger | -.0825121 .4071204 -0.20 0.839 -.880715 .7156909
totmchat | -1.457335 .1244158 -11.71 0.000 -1.701265 -1.213404
_cons | -1.179741 .8580241 -1.37 0.169 -2.861988 .5025071
------------------------------------------------------------------------------

CONCLUDING COMMENTS

This cookbook-type introduction to the use of instrumental variables and two-stage least squares
regression and testing for endogeneity has just scratched the surface of this controversial
problem in statistical estimation and inference. It was intended to enable researchers to begin
using instrumental variables in their work and to enable readers of that work to have an idea of
what is being done. To learn more about these methods there is no substitute for a graduate level
textbook treatment such as that found in William Greene’s Econometric Analysis.

REFERENCES

Becker, William E. and Carol Johnston (1999).“The Relationship Between Multiple Choice and
Essay Response Questions in Assessing Economics Understanding,” Economic Record
(Economic Society of Australia), Vol. 75 (December): 348-357.

Greene, William (2003). Econometric Analysis. 5th Edition, New Jersey: Prentice Hall.

ENDNOTES

                                                       
i
As stated in Module Two, Part One, the 2SLS coefficients are consistent but not necessarily
unbiased. Consistency is an asymptotic property for which there are no adjustments for degrees
of freedom. Nevertheless, the default in the standard STATA routine for 2SLS, “ivreg,” adjusts
standard errors for the degrees of freedom. As an alternative, if your institution permits
downloads from STATA's user-written routines, then the “ivreg2” command rather than “ivreg”
can be employed. The “ivreg2” command makes no adjustment for degrees of freedom.

To use ivreg2, type findit ivreg2 into the STATA command window, where a list of information
and links to download this routine appears. Click on one of the download links and STATA
automatically downloads and installs the routine for use. Users can then access the
documentation for this routine by typing help ivreg2.

Ian McCarthy  Module Two, Part Three: STATA IV Hands On Section  Sept. 15, 2008  p. 11  


 
                                                                                                                                                                               
If users do not have access to ivreg2 or are not permitted to download user-written routines on
the machine in use, the following code provides the unadjusted standard errors after running a
model using ivreg:

matrix large_sample_se=e(b)
matrix large_sample_var=e(V)*e(df_r)/e(N)
local ncol=colsof(large_sample_se)
forvalues i=1/`ncol' {
matrix large_sample_se[1,`i']=sqrt(large_sample_var[`i',`i'])
}
matrix list large_sample_se
ii
Notice that the size variables (smaller to larger) show 3711 observations but the others show
the correct 3710. This was an artifact of the way the size variables were created in STATA. The
extra blank space has no relevance and is ignored in the calculations that are all based on the
original 3710 observations.
 
 

Ian McCarthy  Module Two, Part Three: STATA IV Hands On Section  Sept. 15, 2008  p. 12  


 
MODULE TWO, PART FOUR: ENDOGENEITY,
INSTRUMENTAL VARIABLES AND TWO-STAGE LEAST SQUARES
IN ECONOMIC EDUCATION RESEARCH USING SAS

Part Four of Module Two provides a cookbook-type demonstration of the steps required to use
SAS to address problems of endogeneity using a two-stage least squares, instrumental variable
estimator. The Durbin, Hausman and Wu specification test for endogeneity is also demonstrated.
Users of this model need to have completed Module One, Parts One and Four, and Module Two,
Part One. That is, from Module One, users are assumed to know how to get data into SAS,
recode and create variables within SAS, and run and interpret regression results. From Module
Two, Part One, they are expected to have an understanding of the problem of and source of
endogeneity and the basic idea behind an instrumental variable approach and the two-stage least
squares method. The Becker and Johnston (1999) data set is used throughout this module for
demonstration purposes only. Module Two, Parts Two and Three demonstrate in LIMDEP and
STATA what is done here in SAS.

THE CASE

As described in Module Two, Part One, Becker and Johnston (1999) called attention to
classroom effects that might influence multiple-choice and essay type test taking skills in
economics in different ways. For example, if the student is in a classroom that emphasizes skills
associated with multiple choice testing (e.g., risk-taking behavior, question analyzing skills,
memorization, and keen sense of judging between close alternatives), then the student can be
expected to do better on multiple-choice questions. By the same token, if placed in a classroom
that emphasizes the skills of essay test question answering (e.g., organization, good sentence and
paragraph construction, obfuscation when uncertain, logical argument, and good penmanship),
then the student can be expected to do better on the essay component. Thus, Becker and
Johnston attempted to control for the type of class of which the student is a member. Their
measure of “teaching to the multiple-choice questions” is the mean score or mark on the
multiple-choice questions for the school in which the ith student took the 12th grade economics
course. Similarly, the mean school mark or score on the essay questions is their measure of the
ith student’s exposure to essay question writing skills.

In equation form, the two equations that summarize the influence of the various
covariates on multiple-choice and essay test questions are written as the follow structural
equations:

_ J
M i = ρ21 + ρ22Wi + ρ23 M i + ∑ ρ2 j X ij + U i* .
j =4

_ J
Wi = ρ31 + ρ32 M i + ρ33 Wi + ∑ ρ3 j X ij + Vi* .
j=4

G. Gilpin  Module Two, Part Four: SAS Hands On Section  Sept. 15, 2008  p. 1  


 
M i and Wi are the ith student’s respective scores on the multiple-choice test and essay test.
M i and Wi are the mean multiple-choice and essay test scores at the school where the ith student
took the 12th grade economics course. The X i j variables are the other exogenous variables (such
as gender, age, English a second language, etc.) used to explain the ith student’s multiple-choice
and essay marks, where the ρs are parameters to be estimated. The inclusion of the mean
multiple-choice and mean essay test scores in their respective structural equations, and their
exclusion from the other equation, enables both of the structural equations to be identified within
the system.

As shown in Module Two, Part One, the least squares estimation of the ρs involves bias
because the error term U i* is related to Wi, in the first equation, and Vi* is related to Mi, in
second equation. Instruments for regressors Wi and Mi are needed. Because the reduced form
equations express Wi and Mi solely in terms of exogenous variables, they can be used to generate
the respective instruments:

_ _ J
M i = Γ21 + Γ22 W i + Γ23 M i + ∑ Γ2 j X ij + U i** .
j=4

_ _ J
Wi = Γ31 + Γ32 M i + Γ33 Wi + ∑ Γ3 j X ij + Vi** .
j =4

The reduced form parameters (Γs) are functions of the ρs, and the reduced form error terms U**
and V** are functions of U* and V*, which are not related to any of the regressors in the reduced
form equations.

We could estimate the reduced form equations and get Mˆ i and Wˆi . We could then
substitute Mˆ i and Wˆi into the structural equations as proxy regressors (instruments) for M i and Wi .
The least squares regression of Mi on Wˆi , M i and the Xs and a least squares regression of Wi on
Mˆ i , Wi and the Xs would yield consistent estimates of the respective ρs, but the standard errors
would be incorrect. SAS can automatically do all the required estimations with the two-stage,
least squares command:

proc syslin data= dataset_name 2sls;


endogenous p;
instruments y u s;
Equation 1: model q = p y s;
Equation 2: model q = p u;
run;

G. Gilpin  Module Two, Part Four: SAS Hands On Section  Sept. 15, 2008  p. 2  


 
This ‘proc syslin’ package command, however, inappropriate adjustment for degrees of freedom
in calculating the standard errors. What follows is the correct two-step procedure for the
asymptotic efficient estimators, which involve no adjustment for degrees of freedom.

TWO-STAGE, LEAST SQUARES IN SAS

The Becker and Johnston (1999) data are in the file named “Bill.CSV.” The file Bill.CSV can be
read into SAS with the following read command (the file may be located anywhere on your hard
drive but here it is located on the e drive):

data work.bill; infile 'e:\ BILL.CSV' delimiter = ',' missover dsd;

informat student best32.; informat school best32.; informat size best32.;


informat other best32.; informat birthday best32.; informat sex best32.;
informat eslflag best32.; informat adultst best32.; informat mc1 best32.;
informat mc2 best32.; informat mc3 best32.; informat mc4 best32.;
informat mc5 best32.; informat mc6 best32.; informat mc7 best32.;
informat mc8 best32.; informat mc9 best32.; informat mc10 best32.;
informat mc11 best32.; informat mc12 best32.; informat mc13 best32.;
informat mc14 best32.; informat mc15 best32.; informat mc16 best32.;
informat mc17 best32.; informat mc18 best32.; informat mc19 best32.;
informat mc20 best32.; informat totalmc best32.; informat avgmc best32.;
informat essay1 best32.; informat essay2 best32.; informat essay3 best32.;
informat essay4 best32.; informat totessay best32.; informat avgessay best32.;
informat totscore best32.; informat avgscore best32.; informat ma081 best32.;
informat ma082 best32.; informat ec011 best32.; informat ec012 best32.;
informat ma083 best32.; informat en093 best32.;

format student best12.; format school best12.; format size best12.;


format other best12.; format birthday best12.; format sex best12.;
format eslflag best12.; format adultst best12.; format mc1 best12.;
format mc2 best12.; format mc3 best12.; format mc4 best12.;
format mc5 best12.; format mc6 best12.; format mc7 best12.;
format mc8 best12.; format mc9 best12.; format mc10 best12.;
format mc11 best12.; format mc12 best12.; format mc13 best12.;
format mc14 best12.; format mc15 best12.; format mc16 best12.;
format mc17 best12.; format mc18 best12.; format mc19 best12.;
format mc20 best12.; format totalmc best12.; format avgmc best12.;
format essay1 best12.; format essay2 best12.; format essay3 best12.;
format essay4 best12.; format totessay best12.; format avgessay best12.;
format totscore best12.; format avgscore best12.; format ma081 best12.;
format ma082 best12.; format ec011 best12.; format ec012 best12.;
format ma083 best12.; format en093 best12.;

G. Gilpin  Module Two, Part Four: SAS Hands On Section  Sept. 15, 2008  p. 3  


 
input student school size other birthday sex eslflag adultst mc1 mc2 mc3 mc4 mc5 mc6
mc7 mc8 mc9 mc10 mc11 mc12 mc13 mc14 mc15 mc16 mc17 mc18 mc19 mc20 totalmc
avgmc essay1 essay2 essay3 essay4 totessay avgessay totscore avgscore ma081 ma082 ec011
ec012 ma083 en093; run;

Using these recode and create commands, yields the following relevant variable definitions:

data Bill;
set M2P4.Bill;
smallest = 0;
smaller = 0;
small = 0;
large = 0;
larger = 0;
largest = 0;
if size > 0 & size < 10 then smallest = 1;
if size > 9 & size < 20 then smaller = 1;
if size > 19 & size < 30 then small = 1;
if size > 29 & size < 40 then large = 1;
if size > 39 & size < 50 then larger = 1;
if size > 49 then largest = 1;
run;

TOTALMC: Student’s score on 12th grade economics multiple-choice exam ( M i ).


AVGMC: Mean multiple-choice score for students at school ( M i ).
TOTESSAY: Student’s score on 12th grade economics essay exam ( Wi ).
AVGESSAY: Mean essay score for students at school ( Wi ).
ADULTST = 1, if a returning adult student, and 0 otherwise.
SEX = GENDER = 1 if student is female and 0 is male.
ESLFLAG = 1 if English is not student’s first language and 0 if it is.
EC011 = 1 if student enrolled in first semester 11 grade economics course, 0 if not.
EN093 = 1 if student was enrolled in ESL English course, 0 if not
MA081 = 1 if student enrolled in the first semester 11 grade math course, 0 if not.
MA082 = 1 if student was enrolled in the second semester 11 grade math course, 0 if not.
MA083 = 1 if student was enrolled in the first semester 12 grade math course, 0 if not.
SMALLER = 1 if student from a school with 10 to 19 test takers, 0 if not.
SMALL = 1 if student from a school with 20 to 29 test takers, 0 if not.
LARGE = 1 if student from a school with 30 to 39 test takers, 0 if not.
LARGER = 1 if student from a school with 40 to 49 test takers, 0 if not.

In all of the regressions, the effect of being at a school with more than 49 test takers is captured
in the constant term, against which the other dummy variables are compared. The smallest
schools need to be rejected to treat the mean scores as exogenous and unaffected by any
individual student’s test performance, which is accomplished with the following command:

G. Gilpin  Module Two, Part Four: SAS Hands On Section  Sept. 15, 2008  p. 4  


 
data Bill; set Bill;
if smallest = 1 then delete;
if student = . then delete;
run;

The descriptive statistics on the relevant variables are then obtained with the following
command, yielding the SAS output table shown:

proc means data = Bill; var


totalmc avgmc totessay avgessay adultst sex eslflag ec011
en093 ma081 ma082 ma083 smaller small large larger;
output out = new; quit;

For comparison with the two-stage least squares results, we start with the least squares
regressions shown after this paragraph. The least squares estimations are typical of those found
in multiple-choice and essay score correlation studies, with correlation coefficients of 0.77 and
0.78. The essay mark or score, W, is the most significant variable in the multiple-choice score
regression (first of the two tables) and the multiple-choice mark, M, is the most significant
variable in the essay regression (second of the two tables). Results like these have led
researchers to conclude that the essay and multiple-choice marks are good predictors of each
other. Notice also that both the mean multiple-choice and mean essay marks are significant in
their respective equations, suggesting that something in the classroom environment or group
experience influences individual test scores. Finally, being female has a significant negative
effect on the multiple choice-test score, but a significant positive effect on the essay score, as
expected from the least squares results reported by others. We will see how these results hold up
in the two-stage least squares regressions.

proc reg data = Bill; model totalmc totessay adultst sex avgmc eslflag
ec011 en093 ma081 ma082 ma083 smaller small large larger; quit;

G. Gilpin  Module Two, Part Four: SAS Hands On Section  Sept. 15, 2008  p. 5  


 
proc reg data = Bill; model totessay = totalmc adultst sex avgessay
eslflag ec011 en093 ma081 ma082 ma083 smaller small large larger; quit;

G. Gilpin  Module Two, Part Four: SAS Hands On Section  Sept. 15, 2008  p. 6  


 
Theoretical considerations discussed in Module Two, Part One, suggest that these least
squares estimates involve a simultaneous equation bias that is brought about by an apparent
reverse causality between the two forms of testing. Consistent estimation of the parameters in
this simultaneous equation system is possible with two-stage least squares, where our instrument
( Mˆ i ) for Mi is obtained by a least squares regression of Mi on SEX, ADULTST, AVGMC,
AVGESSAY, ESLFLAG,SMALLER,SMALL, LARGE, LARGER, EC011, EN093, MA081,
MA082, and MA083. Our instrument ( Wˆi ) for Wi is obtained by a least squares regression of Wi
on SEX, ADULTST, AVGMC, AVGESSAY, ESLFLAG, SMALLER, SMALL, LARGE,
LARGER, EC011, EN093, MA081, MA082, and MA083. SAS will do these regressions and
the subsequent regressions for M and W employing these instruments via the following
commands, which yield the subsequent outputs:

proc model data = bill;


instruments sex adultst avgmc avgessay eslflag smaller
small large larger ec011 en093 ma081 ma082 ma083;
totalmc = constant_1 + totessay_1*totessay + adultst_1*adultst +
sex_1*sex + avgmc_1*avgmc + eslflag_1*eslflag +
eco11_1*ec011 + en093_1*en093 + ma081_1*ma081 +
ma082_1*ma082 + ma083_1*ma083 + smaller_1*smaller +
small_1*small + large_1*large + larger_1*larger;

totessay = constant_2 + totalm_2*totalmc + adultst_2*adultst +


sex_2*sex + avgessay_2*avgessay + eslflag_2*eslflag +
ec011_2*ec011 + en093_2*en093 + ma081_2*ma081 +
a082_2*ma082+ ma083_2*ma083 + smaller_2*smaller +
small_2*small + large_2*large + larger_2*larger;

fit totalmc totessay / 2sls outv=vdata vardef=N;


quit;

G. Gilpin  Module Two, Part Four: SAS Hands On Section  Sept. 15, 2008  p. 7  


 
The 2SLS results differ from the least squares results in many ways. The essay mark or
score, W, is no longer a significant variable in the multiple-choice regression and the multiple-
choice mark, M, is likewise insignificant in the essay regression. Each score appears to be
measuring something different when the regressor and error-term-induced bias is eliminated by
our instrumental variable estimators.

Both the mean multiple-choice and mean essay scores continue to be significant in their
respective equations. But now being female is insignificant in explaining the multiple-choice
test score. Being female continues to have a significant positive effect on the essay score.

DURBIN, HAUSMAN AND WU TEST FOR ENDOGENEITY

The theoretical argument is strong for treating multiple-choice and essay scores as endogenous
when employed as regressors in the explanation of the other. Nevertheless, this endogeneity can
be tested with the Durbin, Hausman and Wu specification test, which is a two-step procedure in
SAS. 
 

G. Gilpin  Module Two, Part Four: SAS Hands On Section  Sept. 15, 2008  p. 8  


 
Either a Wald statistic, in a Chi-square ( χ 2 ) test with K* degrees of freedom, or an F
statistic with K* and n − (K + K*) degrees of freedom, is used to test the joint significance of the
contribution of the predicted values ( X ˆ * ) of a regression of the K* endogenous regressors, in
matrix X*, on the exogenous variables (and column of ones for the constant term) in matrix Z:

ˆ * γ+ ε * ,
y = Xβ +X
ˆ = Zλˆ , and λˆ is a least squares estimator of λ.
where X* = Zλ + u , X*

H o : γ = 0 , the variables in Z are exogenous


H A : γ ≠ 0 , at least one of the variables in Z is endogenous

In our case, K* = 1 when the essay score is to be tested as an endogenous regressor in the
multiple-choice equation and when the multiple-choice regressor is to be tested as endogenous in
the essay equation. X̂ * is an n × 1 vector of predicted essay scores from a regression of essay
scores on all the exogenous variables (for subsequent use in the multiple-choice equation) or an
n × 1 vector of predicted multiple-choice scores from a regression of multiple-choice scores on all
the exogenous variables (for subsequent use in the essay equation). Because K* = 1, the relevant
test statistic is either the t, with n − (K + K*) degrees of freedom for small n or the standard
normal, for large n.

In SAS, the predicted essay score is obtained by the following command, where the
specification “output out=essaypredict p=Essayhat;” tells SAS to predict the essay scores and
keep them as a variable called “Essayhat”:

proc reg data = bill; model totessay = adultst sex avgessay avgmc eslflag
ec011 en093 ma081 ma082 ma083 smaller small large larger;
output out=essaypredict p=Essayhat; quit;

The predicted essay scores are then added as a regressor in the original multiple-choice
regression:

proc reg data = essaypredict; model totalmc = totessay adultst sex avgmc eslflag
ec011 en093 ma081 ma082 ma083 smaller small large larger Essayhat;
quit;

The test statistic for the Essayhat coefficient is then used in the test of endogeneity. In the below
SAS output, we see that the calculated standard normal test statistic z value is −12.916, which far
exceeds the absolute value of the 0.05 percent Type I error critical 1.96 standard normal value.
Thus, the null hypothesis of an exogenous essay score as an explanatory variable for the
multiple-choice score is rejected. As theorized, the essay score is endogenous in an explanation
of the multiple-choice score.

G. Gilpin  Module Two, Part Four: SAS Hands On Section  Sept. 15, 2008  p. 9  


 
proc reg data = bill; model totessay = adultst sex avgessay avgmc eslflag ec011 en093
ma081 ma082 ma083 smaller small large larger; output out=essaypredict p=Essayhat;
quit;

proc reg data = essaypredict; model totalmc = totessay adultst sex avgmc eslflag
ec011 en093 ma081 ma082 ma083 smaller small large larger Essayhat; quit;

G. Gilpin  Module Two, Part Four: SAS Hands On Section  Sept. 15, 2008  p. 10  


 
The similar estimation routine to test for the endogeneity of the multiple-choice test score in the
essay equation yields a calculated z test statistic of −11.713, which far exceeds the absolute value
of its 1.96 critical value. Thus, the null hypothesis of an exogenous multiple-choice score as an
explanatory variable for the essay score is rejected. As theorized, the multiple-choice score is
endogenous in an explanation of the essay score.

proc reg data = bill; model totalmc = adultst sex avgmc avgessay eslflag ec011 en093
ma081 ma082 ma083 smaller small large larger; output out=mcpredict p=MChat; quit;

proc reg data = mcpredict; model totessay = totalmc adultst sex avgessay eslflag
ec011 en093 ma081 ma082 ma083 smaller small large larger MChat; quit;

G. Gilpin  Module Two, Part Four: SAS Hands On Section  Sept. 15, 2008  p. 11  


 
CONCLUDING COMMENTS

This cookbook-type introduction to the use of instrumental variables and two-stage least squares
regression and testing for endogeneity has just scratched the surface of this controversial
problem in statistical estimation and inference. It was intended to enable researchers to begin
using instrumental variables in their work and to enable readers of that work to have an idea of
what is being done. To learn more about these methods there is no substitute for a graduate level
textbook treatment such as that found in William Greene’s Econometric Analysis.

REFERENCES

Becker, William E. and Carol Johnston (1999).“The Relationship Between Multiple Choice and
Essay Response Questions in Assessing Economics Understanding,” Economic Record
(Economic Society of Australia), Vol. 75 (December): 348-357.

Greene, William (2003). Econometric Analysis. 5th Edition, New Jersey: Prentice Hall.

G. Gilpin  Module Two, Part Four: SAS Hands On Section  Sept. 15, 2008  p. 12  


 
MODULE THREE, PART ONE:
PANEL DATA ANALYSIS IN ECONOMIC EDUCATION RESEARCH

William E. Becker, William H. Greene and John J. Siegfried*

As discussed in Modules One and Two, most of the empirical economic education research is
based on a “value-added,” “change-score” or a “difference-in-differences” model specifications
in which the expected improvement in student performance from a pre-program measure
(pretest) to its post-program measurement (posttest) is estimated and studied for a cross section
of subjects. Other than the fact that testing occurs at two different points in time, there is no
time dimension, as seen in the data sets employed in Modules One and Two. Panel data analysis
provides an alternative structure in which measurements on the cross section of subjects are
taken at regular intervals over multiple periods of time. i Collecting data on the cross section of
subjects over time enables a study of change. It opens the door for economic education
researchers to address unobservable attributes that lead to biased estimators in cross-section
analysis. ii As demonstrated in this module, it also opens the door for economic education
researchers to look at things other than test scores that vary with time.

This module provides an introduction to panel data analysis with specific applications to
economic education. The data structure for a panel along with constant coefficient, fixed effects
and random effects representations of the data generating processes are presented. Consideration
is given to different methods of estimation and testing. Finally, as in Modules One and Two,
contemporary estimation and testing procedures are demonstrated in Parts Two, Three and Four
using LIMDEP (NLOGIT), STATA and SAS.

THE PANEL DATA SET

As an example of a panel data set, consider our study (Becker, Greene and Siegfried,
Forthcoming) that examines the extent to which undergraduate degrees (BA and BS) in
economics or PhD degrees (PhD) in economics drive faculty size at those U.S. institutions that
offer only a bachelor degree and those that offer both bachelor degrees and PhDs.

-----------------------------------------
*William Becker is Professor of Economics, Indiana University, Adjunct Professor of Commerce,
University of South Australia, Research Fellow, Institute for the Study of Labor (IZA) and Fellow, Center
for Economic Studies and Institute for Economic Research (CESifo). William Greene is Professor of
Economics, Stern School of Business, New York University, Distinguished Adjunct Professor at
American University and External Affiliate of the Health Econometrics and Data Group at York
University. John Siegfried is Professor of Economics, Vanderbilt University, Senior Research Fellow,
University of Adelaide, South Australia, and Secretary-Treasurer of the American Economic Association.
Their e-mail addresses are <[email protected]>, <[email protected]> and
<[email protected]>.

    WEB Module Three, Part One, 4‐14‐2010: 1  
 
We obtained data on the number of full-time tenured or tenure-track faculty and the
number of undergraduate economics degrees per institution per year from the American
Economic Association’s Universal Academic Questionnaire (UAQ). The numbers of PhD
degrees in economics awarded by department were obtained from the Survey of Earned
Doctorates, which is sponsored by several U.S. federal government agencies. These sources
provided data on faculty size and degree yearly data for each institution for 16 years from 1990-
91 through 2005-06. For each year, we had data from 18 bachelor degree-granting institutions
and 24 institutions granting both the PhD and bachelor degrees. Pooling the cross-section
observations on each of the 18 bachelor only institutions, at a point in time, over the 16 years,
implies a panel of 288 observations on each initial variable. Pooling the cross-section
observations on each of the 24 PhD institutions, at a point in time, over the 16 years, implies a
panel of 384 observations on each initial variable. Subsequent creation of a three-year moving
average variable for degrees granted at each type of institution reduced the length of each panel
in the data set to 14 years of usable data.

Panel data are typically laid out in sequential blocks of cross-sectional data. For
example, the bachelor degree institution data observations for each of the 18 colleges appear in
blocks of 16 rows for years 1991 through 2006:

“College” identifies the bachelor degree-granting institution by a number 1 through 18.

“Year” runs from 1996 through 2006.

“BA&S” is the number of BS or BA degrees awarded in each year by each college.

“MEANBA&S” is the average number of degrees awarded by each college


for the 16-year period.

“Public” equals 1 if the institution is a public college and 2 if it is a private college.

“Bschol” equals 1 if the college has a business program and 0 if not.

“Faculty” is the number of tenured or tenure-track economics department faculty members.

“T” is a time trend running from −7 to 8, corresponding to years from 1996 through 2006.

“MA_Deg” is a three-year moving average of degrees (unknown for the first two years).

    WEB Module Three, Part One, 4‐14‐2010: 2  
 
College  Year  BA&S  MEANBA&S Public  Bschol  Faculty  T  MA_Deg
1  1991  50  47.375  2  1  11  ‐7  Missing 
1  1992  32  47.375  2  1  8  ‐6  Missing 
1  1993  31  47.375  2  1  10  ‐5  37.667 
1  1994  35  47.375  2  1  9  ‐4  32.667 
                 
↓  ↓  ↓  ↓  ↓  ↓  ↓    ↓ 
                 
1  2003  57  47.375  2  1  7  5  56 
1  2004  57  47.375  2  1  10  6  55.667 
1  2005  57  47.375  2  1  10  7  57 
1  2006  51  47.375  2  1  10  8  55 
2  1991  16  8.125  2  1  3  ‐7  Missing 
2  1992  14  8.125  2  1  3  ‐6  Missing 
2  1993  10  8.125  2  1  3  ‐5  13.333 
                 
↓  ↓  ↓  ↓  ↓  ↓  ↓    ↓ 
                 
2  2004  10  8.125  2  1  3  6  12.667 
2  2005  7  8.125  2  1  3  7  11.333 
2  2006  6  8.125  2  1  3  8  7.667 
3  1991  40  35.5  2  1  8  ‐7  Missing 
3  1992  31  37.125  2  1  8  ‐6  Missing 
                 
↓  ↓  ↓  ↓  ↓  ↓  ↓    ↓ 
                 
17  2004  64  39.3125  2  0  5  6  54.667 
17  2005  37  39.3125  2  0  4  7  51.333 
17  2006  53  39.3125  2  0  4  8  51.333 
18  1991  14  8.4375  2  0  4  ‐7  Missing 
18  1992  10  8.4375  2  0  4  ‐6  Missing 
18  1993  10  8.4375  2  0  4  ‐5  11.333 
18  1994  7  8.4375  2  0  3.5  ‐4  9 
                 
↓  ↓  ↓  ↓  ↓  ↓  ↓    ↓ 
                 
18  2005  4  8.4375  2  0  2.5  7  7.333 
18  2006  7  8.4375  2  0  3  8  6 

    WEB Module Three, Part One, 4‐14‐2010: 3  
 
In a few years for some colleges, faculty size was missing. We interpolated missing data
on the number of faculty members in the economics department from the reported information in
the years prior and after a missing observation; thus, giving rise to the prospect for a half person
in those cases. If a panel data set such as this one has missing values that cannot be
meaningfully interpolated, it is an “unbalanced panel,” in which the number of usable
observations differs across the units. If there are no missing values and there are the same
number of periods of data for every group (college) in the sample, then the resulting pooled
cross-section and time-series data set is said to be a “balanced panel.” Typically, the cross-
section dimension is designated the i dimension and the time-series dimension is the t dimension.
Thus, panel data studies are sometimes referred to as “it ” studies.

THE PANEL DATA-GENERATING PROCESS

There are three ways in which we consider the effect of degrees on faculty size. Here we will
consider only the bachelor degree-granting institutions.

First, the decision makers might set the permanent faculty based on the most current
available information, as reflected in the number of contemporaneous degrees (BA&Sit). That is,
the decision makers might form a type of rational expectation by setting the faculty size based on
the anticipated number of majors to receive degrees in the future, where that expectation for that
future number is forecasted by this year's value. Second, we included the overall mean number
of degrees awarded at each institution (MEANBA&Si) to reflect a type of historical steady state.
That is, the central administration or managers of the institution may have a target number of
permanent faculty relative to the long-term expected number of annual graduates from the
department that is desired to maintain the department’s appropriate role within the institution. iii
Third, the central authority might be willing to marginally increase or decrease the permanent
faculty size based on the near term trend in majors, as reflected in a three-year moving average
of degrees awarded (MA_Degit).

We then assume the faculty size data-generating process for bachelor degree-granting
undergraduate departments to be

FACULTY sizeit = β1 + β2Tt + β3BA&Sit + β4MEANBA&Si + β5PUBLICi (1)


+ β6Bschl + β7MA_Degit +εit

where the error term εit is independent and identically distributed (iid) across institutions and
over time and E(εit2|xit) = σ2 , for I = 18 colleges and T = 14 years (−5 through 8) for 252
complete observations. Notice that there is no time subscript on the mean number of degrees,
public/private and B school regressors because they do not vary with time.

In a more general and nondescript algebraic form for any it study, in which all
explanatory variables are free to vary with time and the error term is of the simple iid form with
E(εit2|xit) = σ2, the model would be written

    WEB Module Three, Part One, 4‐14‐2010: 4  
 
Yit = β1 + β2X2it + β3X3it +. . . + βkXkit + εit , for i = 1, 2,….I and, t = 1, 2, … T. (2)

This is a constant coefficient model because the intercept and slopes are assumed not to
vary within a cross section (not to vary across institutions) or over time. If this assumption is
true, the parameters can be estimated without bias by ordinary least squares applied directly to
the panel data set. Unfortunately, this assumption is seldom true, requiring us to consider the
fixed-effect and random-effects models.

FIXED-EFFECTS MODEL

The fixed effects model allows the intercept to vary across institutions (or among whatever cross-
section categories that are under consideration in other studies), while keeping the slope
coefficients the same for all institutions (or categories). The model could be written as

Yit = β1i + β2X2it + β3X3it + εit . (3)

Where β1i suggests that there is a separate intercept for each unit. No restriction is placed on
how the intercepts vary, except, of course, that they do so independently of εit. The model can be
made explicit for our application by inserting a 0-1 covariate or dummy variable for each of the
institutions except the one for which comparisons are to be made. In our case, there are 18
colleges; thus, 17 dummy variables are inserted and each of their coefficients is interpreted as the
expected change in faculty size for a movement from the excluded college to the college of
interest. Alternatively, we could have a separate dummy variable for each college and drop the
overall intercept. Both approaches give the same results for the other coefficients and for the R2
in the regression. (A moment’s thought will reveal, however, that in this setting, either way it is
formulated, it is not possible to have variables, such as type of school, that do not vary through
time. In the fixed effects model, such a variable would just be a multiple of the school specific
dummy variable.)

To clarify, the fixed-effects model for our study of bachelor degree institutions is written
(where Collgei = 1 if college i and 0 if not, for i = 1, 2, … 18) as

FACULTY sizeit=β1+β2YEARt+β3BA&Sit+β4MEANBA&Si+β5PUBLICi+
β6Bschl+ β7MA_Degit+ β8College1+β9College2+ (4)
β10College3+… +β23College16+β24College17+ εit.

Here a dummy for college 18 is omitted and its effects are reflected in the constant term β1 when
College1 = College2 =…= College16 = College17 = 0. For example, β9 is the expected change
in faculty size for a movement from college 18 to college 2. Which college is omitted is
arbitrary, but one must be omitted to avoid perfect collinearity in the data set. In general, if i
goes from 1 to I categories, then only I − 1 dummies are used to form the fixed-effects model:

Yit = β1 + β2X2it + β3X3it +. . . + βkXkit


+ βk+1D1 + βk+2D2 + …+ βk+1DI−1 + βk+2D2 +… βk+(I−)1DI−1 + εit , (5)
for i = 1, 2,….I and, t = 1, 2, … T.

    WEB Module Three, Part One, 4‐14‐2010: 5  
 
After creating the relevant dummy variables (Ds), the parameters of this fixed-effects model can
be estimated without bias using by ordinary least squares. iv

If one has sufficient observations, the categorical dummy variables can be interacted with
the other time-varying explanatory variables to enable the slopes to vary along with the intercept
over time. For our study with 18 college categories this would be laborious to write out in
equation form. In many cases there simply are not sufficient degrees of freedom to
accommodate all the required interactions. v

To demonstrate a parsimonious model setup with both intercept and slope variability
consider a hypothetical situation involving three categories (represented by dummies D1, D2 and
D3 ) and two time-varying explanatory variables (represented by X2it and X3it):

Yit = β1 + β2X2it + β3X3it + β4D1 + β5D2+ β6(X2it D1) + (6)


β7(X3it D1) + β8(X2it D2) + β9(X3it D2) + εit.

In this model, β1 is the intercept for category three, where D3 = 0. The intercept for category one
is β1 + β4 and for category 2 it is β1 + β5. The change in the expected value of Y given a change
in X2 is β2 + β6D1 + β8D2; thus for category 1 it is β2 + β6 and for category 2 it is β2 + β8. The
change in the expected value of Y for a movement from category two to category three is

(β5 − β4 ) + ( β8 − β6 )X2it + (β9 − β7 ) X3it . (7)

Individual coefficients are tested in fixed-effects models as in any other model with the z
ratio (with asymptotic properties) or t ratio (finite sample properties). There could be category-
specific heteroscedasticity or autocorrelation over time. As described and demonstrated in
Module One, where students were grouped or clustered into classes, a type of White
heteroscedasticity consistent covariance estimator can be used in fixed-effects models with
ordinary least squares to obtain standard errors robust to unequal variances across the groups.
Correlation of residuals from one period to the next within a panel can also be a problem. If this
serial correlation is of the first-order autoregressive type, a Prais-Winston transformation
transformation might be considered to first partial difference the data to remove the serial
correlation problem. In general, because there are typically few time-series observations, it is
difficult to both correctly identify the nature of the time-series error term process and
appropriately address with a least-squares estimator. vi Contemporary treatments typically rely
on robust, “cluster” corrections that accommodate more general forms of correlation across time.

Hypotheses tests about sets of coefficients related to the categories in fixed-effects


models are conducted as tests of linear restrictions for a subset of coefficients as described and
demonstrated in Module One. For instance, as a starting point one might want to test if there is
any difference in intercepts or slopes. For our hypothetical parsimonious model the null and
alternative hypotheses are:

HO: β4 = β5 = β6 = β7 = β8 = β9 = 0 vs. (8)


HA: at least one of these six βs is not zero.

    WEB Module Three, Part One, 4‐14‐2010: 6  
 
The unrestricted sum of squared residuals comes from the regression:

yˆit = b1 + b2x2it + b3x3it + b4D1 + b5D2 + b6(x2it D1) (9)


+ b7(x3it D1) + b8(x2it D2) + b9(x3it D2).

The restricted sum of squared residuals comes from the regression:

yˆit = b1 + b2x2it + b3x3it . (10)

The relevant F statistic is

[Restricted ResSS( β subset = 0) − Unrestricted ResSS] / (9 − 3)


F= .
Unrestricted ResSS /(ΣTi − 9) (11)

RANDOM-EFFECTS MODELS

The random effects model, like the fixed effects model, allows the intercept term to vary across
units. The difference is an additional assumption, not made in the fixed effects case, that this
variation is independent of the values of the other variables in the model. Recall, in the fixed
effects case, we placed no restriction on the relationship between the intercepts and the other
independent variables. In essence, a random-effects data generating process is a regression with
an intercept that is subject to purely random perturbations; it is a category-specific random
variable (β1i). The realization of the random variable intercept β1i is assumed to be formed by
the overall mean plus the ith category-specific random term vi. In the case of our hypothetical,
parsimonious two explanatory variable model, the relevant random-effects equations are

Yit = β1i + β2X2it + β3X3it + εit . (12)


β1i = α + vi with Cov[vi,(X1i2,X2it)] = 0.

Inserting the second equation into the first produces the “random effects” model,

Yit = α + β2X2it + β3X3it + εit + vi . (13)

Deviations from the main intercept, α, as measured in the category specific part of the error term,
vi, must be uncorrelated with the time-varying regressors (that is, vi, is uncorrelated with X2it and
X3it) and have zero mean. Because vi does not vary with time, it is reasonable to assume its
variance is fixed given the explanatory variables. vii Thus,

E(vi| X2it, X3it) = 0 and E(vi2| X2it, X3it) = θ2 . (14)

An important difference between the fixed and random effects models is that time-invariant
regressors, such as type of school, can be accommodated in the random effects but not in the

    WEB Module Three, Part One, 4‐14‐2010: 7  
 
fixed effects model. The surprising result ultimately follows from the assumption that the
variation of the constant terms is independent of the other variables, which allows us to put the
term vi in the disturbance of the equation, rather than build it into the regression with a set of
dummy variables.)

In a random-effects model, disturbances for a given college (in our case) or whatever
entity is under study will be correlated across periods whereas in the fixed-effects model this
correlation is assumed to be absent. However, in both settings, correlation between the panel
error term effects and the explanatory variables is also likely. Where it occurs, this correlation
will reflect the effect of substantive influences on the dependent variable that have been omitted
from the equation – the classic “missing variables” problem. The now standard Mundlak (1978)
approach is a method of accommodating this correlation between the effects and means of the
regressors. The approach is motivated by the suggestion that the correlation can be explained by
the overall levels (group means) of the time variables. By this device, the effect, β1i, is projected
upon the group means of the time-varying variables, so that

β1i = β1 + δ′ xi + wi (15)

where xi is the set of group (school) means of the time-varying variables and wi is a (now)
random effect that is uncorrelated with the variables and disturbances in the model, wi~N(0, σw2).

In fact, the random effects model as described here departs from an assumption that the
school effect, vi, actually is uncorrelated with the other variables. If true, the projection would
be unnecessary However, in most cases, the initial assumption of the random-effects model, that
the effects and the regressors are uncorrelated, is considered quite strong. In the fixed effects
case, the assumption is not made. However, it remains a useful extension of the fixed effects
model to think about the “effect,” β1i, in terms of a projection such as suggested above – perhaps
by the logic of a “hierarchical model,” for the regression. That is, although the fixed effects
model allows for an unrestricted effect, freely correlated with the time varying variables, the
Mundlak projection adds a layer of explanation to this effect. The Mundlak approach is a useful
approach in either setting. Note that after incorporating the Mundlak “correction” in the fixed
effects specification, the resulting equation becomes a random effects equation.

Adding the unit means to the equations picks up the correlation between the school
effects and the other variables as well as reflecting an expected long-term steady state. Our
random effects models for BA and BS degree-granting undergraduate economics departments is

  FACULTY sizeit = β1 + β2YEARt + β3BA&Sit + (16)


β4MEANBA&Si + β5PUBLICi + β6Bschl + β7MA_Degit + εit + wi
 
where error term ε is iid over time and E(εit2|xit) = σi2 for I = 18 colleges and T = 14 years and
E[ui2] = θ2.

    WEB Module Three, Part One, 4‐14‐2010: 8  
 
FIXED EFFECTS VERSUS RANDOM EFFECTS

Fixed-effects models can be estimated efficiently by ordinary least squares whereas random-
effects models are usually estimated using some type of generalized least-squares procedure.
GLS should yield more asymptotically efficient estimators if the assumptions for the random-
effects model are correct. Current practice, however, favors the fixed-effects approach for
estimating standard errors because of the likelihood that the stronger assumptions behind the
GLS estimator are likely not satisfied, implying poor finite sample properties (Angrist and
Pischke, 2009, p. 223). This creates a bit of a dilemma, because the fixed effects approach is, at
least potentially, very inefficient if the random effects assumptions are actually met. The fixed
effects approach could lead to estimation of K+1+n rather than K+2 parameters (including σ2).

Whether we treat the effects as fixed (with a constant intercept β1 and dummy category
variables) or random (with a stochastic intercept β1i) makes little difference when there are a
large number of time periods (Hsiao, 2007, p. 41). But, the typical case is one for which the
time series is short, with many cross-section units,

The Hausman (1978) test has become the standard approach for assessing the
appropriateness of the fixed-effects versus random-effects model. Ultimately, the question is
whether there is strong correlation between the unobserved case-specific random effects and the
explanatory variables. If this correlation is significant, the random-effects model is inappropriate
and the fixed-effects model is supported. On the other hand, insignificant correlation between
the specific random-effects errors and the regressors implies that the more efficient random-
effects coefficient estimators trump the consistent fixed-effects estimators. The correlation
cannot be assessed directly. But, indirectly, it has a testable implication for the estimators. If the
effects are correlated with the time varying variable, then, in essence, the dummy variables will
have been left out of the random effects model/estimator. The classic left out variable result then
implies that the random effects estimator will be biased because of this problem, but the fixed
effects estimator will not be biased because it includes the dummy variables. If the random
effects model is appropriate, the fixed effects approach will still be unbiased, though it will fail
to use the information that the extra dummy variables in the model are not needed. Thus, an
indirect test for the presence of this correlation is based on the empirical difference between the
fixed and random effects estimators.

Let βFE and βRE be the vectors of coefficients from a fixed-effects and random-effects
specification. The null hypothesis for purpose of the Hausman test is that under the random
effects assumption, estimators of both of these vectors are consistent, but the estimator for βRE is
more efficient (with a smaller asymptotic variance) than that of βFE. Hausman’s alternative
hypothesis is that the random-effects estimator is inconsistent (with coefficient distributions not
settling on the correct parameter values as sample size goes to infinity) under the hypothesis of
the fixed-effects model, but is consistent under the hypothesis of the random-effects model. The
fixed-effects estimator is consistent in both cases. The Hausman test statistic is based on the
difference between the estimated covariance matrix for least-squares dummy variable coefficient
estimates (bFE ) and that for the random-effects model:

    WEB Module Three, Part One, 4‐14‐2010: 9  
 
H = (bFE − bRE)´[Var (bFE ) − Var(bRE)]−1(bFE − bRE)

where H is distributed Chi square, with K (number in b) degrees of freedom.

If the Chi-square statistic p value < 0.05, reject the Hausman null hypotheis and do not use
random effects. If the Chi-square statistic p value > 0.05, do not reject the Hausman null
hypothesis and use random effects. An intuitively appealing, and fully equivalent (and usually
more convenient) way to carry out the Hausman test is to test the null hypothesis in the context
of the random-effects model that the coefficients on the group means in the Mundlak-augmented
regression are jointly zero.

CONCLUDING COMMENTS

As stated in Module One, “theory is easy, data are hard – hard to find and hard to get into a
computer program for statistical analysis.” This axiom is particularly true for those wishing to
do panel data analysis on topics related to the teaching of economics where data collected for
only the cross sections is the norm. As stated in Endnote One, a recent exception is Stanca
(2006), in which a large panel data set for students in Introductory Microeconomics is used to
explore the effects of attendance on performance. As with Modules One and Two, Parts Two,
Three and Four of this module provide the computer code to conduct a panel data analysis with
LIMDEP (NLOGIT), STATA and SAS, using the Becker, Greene and Siegfried (2009) data set.

    WEB Module Three, Part One, 4‐14‐2010: 10  
 
REFERENCES

Angrist, Joshua D. and Jorn-Steffen Pischke (2009). Mostly Harmless Econometrics. Princeton
New Jersey: Princeton University Press.

Becker, William, William Greene and John Siegfried (Forthcoming). “Does Teaching Load
Affect Faculty Size?” American Economists.

Greene, William (2008). Econometric Analysis. 6th Edition, New Jersey: Prentice Hall.

Hausman, J. A. (1978). “Specification Tests in Econometrics,” Econometrica. Vol. 46, No. 6.


(November): 1251-1271.

Hsiao, Cheng (2007). Analysis of Panel Data. 2nd Edition (8th Printing), Cambridge: Cambridge
University Press.

Johnson, William R. and Sarah Turner (2009). “Faculty Without Students: Resource Allocation
in Higher Education, ” Journal of Economic Perspectives. Vol. 23. No. 2 (Spring): 169-190.

Link, Charles R. and James G. Mulligan (1996). "The Value of Repeat Data for individual
Students," in William E. Becker and William J. Baumol (eds), Assessing Educational
Practices: The Contribution of Economics, Cambridge MA.: MIT Press.

Marburger, Daniel R. (2006). “Does Mandatory Attendance Improve Stduent Performance,”


Journal of Economic Education. Vol. 37. No. 2 (Spring): 148-266155.

Mundlak, Yair (1978). “On the Pooling of Time Series and Cross Section Data, ” Econometrica.
Vol. 46. No. 1 (January): 69-85.

Stanca, Luca (2006). “The Effects of Attendance on Academic Performance: Panel Data
Evidence for Introductory Microeconomics” Journal of Economic Education. Vol. 37. No. 3
(Summer): 251-266.

    WEB Module Three, Part One, 4‐14‐2010: 11  
 
ENDNOTES

                                                       
i
   As seen in Stanca (2006), where a large panel data set for students in Introductory
Microeconomics is used to explore the effects of attendance on performance, panel data analysis
typically involved a dimension of time. However, panels can be set up in blocks that involve a
dimension other than time. For example, Marburger (2006) uses a panel data structure (with no
time dimension) to overcome endogeneity problems in assessing the effects of an enforced
attendance policy on absenteeism and exam performance. Each of the Q multiple-choice exam
questions was associated with specific course material that could be associated with the
attendance pattern of N students, giving rise to NQ panel data records. A dummy variable for
each student captured the fixed effects for the unmeasured attribute of students and thus
eliminating any student specific sample selection problem.
ii
Section V of Link and Mulligan (1996) outlines the advantage of panel analysis for a range of
educational issues. They show how panel data analysis can be used to isolate the effect of
individual teachers, schools or school districts on students' test scores.
iii
One of us, as a member on an external review team for a well known economics department,
was told by a high-ranking administrator that the department had received all the additional lines
it was going to get because it now had too many majors for the good of the institution.
Historically, the institution was known for turning out engineers and the economics department
was attracting too many students away from engineering. This personal experience is consistent
with Johnson and Turner’s (2009, p. 170) assessment that a substantial part of the explanation for
differences in student-faculty ratios across academic departments resides in politics or tradition
rather than economic decision-making in many institutions of higher education.
 
iv
As long as the model is static with all the explanatory variables exogenous and no lagged
dependent variables used as explanatory variables, ordinary least-squares estimators are unbiased
and consistent although not as efficient as those obtained by maximum likelihood routines.
Unfortunately, this is not true if a lagged dependent variable is introduced as a regressor (as one
might want to do if the posttest is to be explained by a pretest). The implied correlation between
the lagged dependent variable and the individual specific effects and associated error terms bias
the OLS estimators (Hsiao, 2007, pp. 73-74).
 
v
Fixed-effects models can have too many categories, requiring too many dummies, for
parameter estimation. Even if estimation is possible, there may be too few degrees of freedom
and little power for statistical tests. In addition, problems of multicollinearity arise when many
dummy variables are introduced. 
vi
Hsiao (2007, pp. 295-310) discusses panel data with a large number of time periods. When T
is large serial correlation problems become a big issue, which is well beyond the scope of this
introductory module.

    WEB Module Three, Part One, 4‐14‐2010: 12  
 
                                                                                                                                                                               
vii
Random-effects models in which the intercept error term vi does not depend on time are
referred to as one-way random-effects models. Two-way random-effects models have error terms
of the form

εit = vi + εt + uit

where vi is the cross-section-specific error, affecting only observations in the ith panel; εt is the
time-specific component, which is unique to all observations for the tth period; and uit is the
random perturbation specific to the individual observation in the ith panel at time t. These two-
way random-effects models are also known as error component models and variance component
models.
 

    WEB Module Three, Part One, 4‐14‐2010: 13  
 
MODULE THREE, PART TWO: PANEL DATA ANALYSIS
IN ECONOMIC EDUCATION RESEARCH USING LIMDEP (NLOGIT)

Part Two of Module Three provides a cookbook-type demonstration of the steps required to use
LIMDEP (NLOGIT) in panel data analysis. Users of this model need to have completed Module
One, Parts One and Two, and Module Three, Part One. That is, from Module One users are
assumed to know how to get data into LIMDEP, recode and create variables within LIMDEP,
and run and interpret regression results. They are also expected to know how to test linear
restrictions on sets of coefficients as done in Module One, Parts One and Two. Module Three,
Parts Three and Four demonstrate in STATA and SAS what is done here in LIMDEP.

THE CASE

As described in Module Three, Part One, Becker, Greene and Siegfried (2009) examine the
extent to which undergraduate degrees (BA and BS) in economics or Ph.D. degrees (PhD) in
economics drive faculty size at those U.S. institutions that offer only a bachelor degree and those
that offer both bachelor degrees and PhDs. Here we retrace their analysis for the institutions
that offer only the bachelor degree. We provide and demonstrate the LIMDEP (NLOGIT) code
necessary to duplicate their results.

DATA FILE

The following panel data are provided in the comma separated values (CSV) text file
“bachelors.csv”, which will automatically open in EXCEL by simply double clicking on it after
it has been downloaded to your hard drive. Your EXCEL spreadsheet should look like this:

“College” identifies the bachelor degree-granting institution by a number 1 through 18.

“Year” runs from 1996 through 2006.

“Degrees” is the number of BS or BA degrees awarded in each year by each college.

“DegreBar” is the average number of degrees awarded by each college for the 16-year period.

“Public” equals 1 if the institution is a public college and 2 if it is a private college.

“Faculty” is the number of tenured or tenure-track economics department faculty members.

“Bschol” equals 1 if the college has a business program and 0 if not.

“T” is the time trend running from −7 to 8, corresponding to years from 1996 through 2006.

“MA_Deg” is a three-year moving average of degrees (unknown for the first two years).

Becker, Greene, Siegfried  9‐24‐2009    1   
College  Year  Degrees  DegreBar Public  Faculty  Bschol  T  MA_Deg
1  1991  50  47.375  2  11  1  ‐7  0 
1  1992  32  47.375  2  8  1  ‐6  0 
1  1993  31  47.375  2  10  1  ‐5  37.667 
1  1994  35  47.375  2  9  1  ‐4  32.667 
                 
↓  ↓  ↓  ↓  ↓  ↓  ↓    ↓ 
                 
1  2003  57  47.375  2  7  1  5  56 
1  2004  57  47.375  2  10  1  6  55.667 
1  2005  57  47.375  2  10  1  7  57 
1  2006  51  47.375  2  10  1  8  55 
2  1991  16  8.125  2  3  1  ‐7  0 
2  1992  14  8.125  2  3  1  ‐6  0 
2  1993  10  8.125  2  3  1  ‐5  13.333 
                 
↓  ↓  ↓  ↓  ↓  ↓  ↓    ↓ 
                 
2  2004  10  8.125  2  3  1  6  12.667 
2  2005  7  8.125  2  3  1  7  11.333 
2  2006  6  8.125  2  3  1  8  7.667 
3  1991  40  35.5  2  8  1  ‐7  0 
3  1992  31  37.125  2  8  1  ‐6  0 
                 
↓  ↓  ↓  ↓  ↓  ↓  ↓    ↓ 
                 
17  2004  64  39.3125  2  5  0  6  54.667 
17  2005  37  39.3125  2  4  0  7  51.333 
17  2006  53  39.3125  2  4  0  8  51.333 
18  1991  14  8.4375  2  4  0  ‐7  0 
18  1992  10  8.4375  2  4  0  ‐6  0 
18  1993  10  8.4375  2  4  0  ‐5  11.333 
18  1994  7  8.4375  2  3.5  0  ‐4  9 
                 
↓  ↓  ↓  ↓  ↓  ↓  ↓    ↓ 
                 
18  2005  4  8.4375  2  2.5  0  7  7.333 
18  2006  7  8.4375  2  3  0  8  6 

Becker, Greene, Siegfried  9‐24‐2009    2   
If you opened this CSV file in a word processor or text editing program, it would show
that each of the 289 lines (including the headers) corresponds to a row in the EXCEL table, but
variable values would be separated by commas and not appear neatly one on top of the other as
in EXCEL.

As discussed in Module One, Part Two, older versions of LIMDEP (NLOGIT) have a
data matrix default restriction of no more than 222 rows (records per variable), 900 columns
(number of variables) and 200,000 cells. LIMDEP 9 and NLOGIT 4.0 automatically adjust the
data constraints but in older versions the number of cells must be increased to accommodate
work with our data set. After opening LIMDEP, the number of working cells can be increased
by clicking the Project button on the top ribbon, going to Settings, and changing the number of
cells. Going from the default 200,000 cells to 900,000 cells (1,000 Rows and 900 columns) is
more than sufficient for this panel data set.

Becker, Greene, Siegfried  9‐24‐2009    3   
We could write a “READ” command to bring this text data file into LIMDEP but like
EXCEL it can be imported into LIMDEP directly by clicking the Project button on the top
ribbon, going to Import, and then clicking on Variables, from which the bachelors.cvs file can be
located wherever it is stored (in our case in the “Greene programs 2” folder). Hitting the Open
button will bring the data set into LIMDEP, which can be checked by clicking the “Activate Data
Editor” button, which is second from the right on the tool bar or go to Data Editor in the
Window’s menu, as described and demonstrated in Module One, Part Two.

Becker, Greene, Siegfried  9‐24‐2009    4   
In addition to a visual inspection of the data via the “Activate Data Editor,” we use the “dstat”
command to check the descriptive statistics. First, however, we need to remove the two years
(1991 and 1992) for which no data are available for the degree moving average measure. This is
done with the “Reject” command. In our “File:New Text/Command Document” (which was
described in Module One, Part Two), we have

 
reject ; year < 1993 $ 
dstat;rhs=*$ 
 

which upon highlighting and pressing “Go” yields

Becker, Greene, Siegfried  9‐24‐2009    5   
--> reject ; year < 1993 $
--> dstat;rhs=*$
Descriptive Statistics
All results based on nonmissing observations.
==============================================================================
Variable Mean Std.Dev. Minimum Maximum Cases Missing
==============================================================================
All observations in current sample
--------+---------------------------------------------------------------------
COLLEGE | 9.50000 5.19845 1.00000 18.0000 252 0
YEAR | 1999.50 4.03915 1993.00 2006.00 252 0
DEGREES | 23.1111 19.2264 .000000 81.0000 252 0
DEGREBAR| 23.6528 18.0143 2.00000 62.4375 252 0
PUBLIC | 1.77778 .416567 1.00000 2.00000 252 0
FACULTY | 6.51786 3.13677 2.00000 14.0000 252 0
BSCHOOL | .388889 .488468 .000000 1.00000 252 0
T | 1.50000 4.03915 -5.00000 8.00000 252 0
MA_DEG | 23.1931 18.5540 1.33333 80.0000 252 0

CONSTANT COEFFICIENT REGRESSION

The constant coefficient panel data model for the faculty size data-generating process for
bachelor degree-granting undergraduate departments is given by

Faculty sizeit = β1 + β2Tt + β3BA&Sit + β4MEANBA&Si + β5PUBLICi


+ β6Bschl + β7MA_Degit + εit

where the error term εit is independent and identically distributed (iid) across institutions and
over time and E(εit2|xit) = σ2 , for I = 18 colleges and T = 14 years (−5 through 8) for 252
complete records. The LIMDEP OLS regression command that needs to be entered into the
command document (again, following the procedure for opening the command document
window shown in Module One, Part Two), including the standard error adjustment for clustering
is

reject ; year < 1993 $


regress
;lhs=faculty;rhs=one,t,degrees,degrebar,public,bschool,MA_deg
;cluster=14$

Upon highlighting and hitting the “Go” button the Output file shows the following results

Becker, Greene, Siegfried  9‐24‐2009    6   
--> reject ; year < 1993 $
--> regress;lhs=faculty;rhs=one,t,degrees,degrebar,public,bschool,MA_deg
;cluster=14$
+----------------------------------------------------+
| Ordinary least squares regression |
| LHS=FACULTY Mean = 6.517857 |
| Standard deviation = 3.136769 |
| Number of observs. = 252 |
| Model size Parameters = 7 |
| Degrees of freedom = 245 |
| Residuals Sum of squares = 868.4410 |
| Standard error of e = 1.882726 |
| Fit R-squared = .6483574 |
| Adjusted R-squared = .6397458 |
| Model test F[ 6, 245] (prob) = 75.29 (.0000) |
| Diagnostic Log likelihood = -513.4686 |
| Restricted(b=0) = -645.1562 |
| Chi-sq [ 6] (prob) = 263.38 (.0000) |
| Info criter. LogAmemiya Prd. Crt. = 1.292840 |
| Akaike Info. Criter. = 1.292826 |
| Bayes Info. Criter. = 1.390866 |
| Autocorrel Durbin-Watson Stat. = .3295926 |
| Rho = cor[e,e(-1)] = .8352037 |
| Model was estimated Jul 16, 2009 at 04:21:28PM |
+----------------------------------------------------+
+---------------------------------------------------------------------+
| Covariance matrix for the model is adjusted for data clustering. |
| Sample of 252 observations contained 18 clusters defined by |
| 14 observations (fixed number) in each cluster. |
| Sample of 252 observations contained 1 strata defined by |
| 252 observations (fixed number) in each stratum. |
+---------------------------------------------------------------------+
+--------+--------------+----------------+--------+--------+----------+
|Variable| Coefficient | Standard Error |t-ratio |P[|T|>t]| Mean of X|
+--------+--------------+----------------+--------+--------+----------+
|Constant| 10.1397*** .91063 11.135 .0000 |
|T | -.02809 .02227 -1.261 .2083 1.50000|
|DEGREES | -.01636 .01866 -.877 .3814 23.1111|
|DEGREBAR| .10832*** .03378 3.206 .0015 23.6528|
|PUBLIC | -3.86239*** .56950 -6.782 .0000 1.77778|
|BSCHOOL | .58112 .94253 .617 .5381 .38889|
|MA_DEG | .03780** .01810 2.089 .0377 23.1931|
+--------+------------------------------------------------------------+
| Note: ***, **, * = Significance at 1%, 5%, 10% level. |
+---------------------------------------------------------------------+

Contemporaneous degrees have little to do with current faculty size but both overall number of
degrees awarded (the school means) and the moving average of degrees (MA_DEG) have
significant effects. It takes an increase of 26 or 27 bachelor degrees in the moving average to
expect just one more faculty position. Whether it is a public or a private college is highly
significant. Moving from a public to a private college lowers predicted faculty size by nearly
four members for otherwise comparable institutions. There is an insignificant erosion of tenured
and tenure-track faculty size over time. Finally, while economics departments in colleges with a
business school tend to have a larger permanent faculty, ceteris paribus, the effect is small and
insignificant.

Becker, Greene, Siegfried  9‐24‐2009    7   
FIXED-EFFECTS REGRESSION

To estimate the fixed-effects model we can either insert seventeen (0,1) covariates to capture the
unique effect of each of the 18 colleges (where each of the 17 dummy coefficients are measured
relative to the constant term) or the insert of 18 dummy variables with no overall constant term
in the OLS regression. The results for the other coefficients and for R2 will be identical either
way, while the the constant terms, since they measure the difference of each college from the
18th in the first case, or the difference of all 18 from zero in the second, will differ. This
difference is inconsequential for the regression of interest. Which way the model is estimated is
purely a matter of convenience and preference.

An important implication of the fixed effects specification is that no time invariant


variables can be included in the equation because they would be perfectly correlated with the
respective college dummies. Thus, the overall school mean number of degrees, the public or
private dummy variables, and business school dummy variables must all be excluded from the
fixed effects model.

A LIMDEP (NLOGIT) program to be run from the Test/Command Document, including


the commands to create the dummy variables then run the regression is shown below. (An
alternative, more compact way to create the dummies and run the regression is shown in the
Appendix.)

reject ; year < 1993 $

create
;Col1=college=1
;Col2=college=2
;Col3=college=3
;Col4=college=4
;Col5=college=5
;Col6=college=6$
create
;Col7=college=7
;Col8=college=8
;Col9=college=9
;Col10=college=10
;Col11=college=11
;Col12=college=12$
create
;Col13=college=13
;Col14=college=14
;Col15=college=15
;Col16=college=16
;Col17=college=17
;Col18=college=18$

regress;lhs=faculty;rhs=one,t,degrees,MA_deg,
Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8,Col9,
Col10,Col11,Col12,Col13,Col14,Col15,Col16,Col17; cluster=14$

The resulting regression information appearing in the output window is

Becker, Greene, Siegfried  9‐24‐2009    8   
+---------------------------------------------------------------------+
| Covariance matrix for the model is adjusted for data clustering. |
| Sample of 252 observations contained 18 clusters defined by |
| 14 observations (fixed number) in each cluster. |
+---------------------------------------------------------------------+

----------------------------------------------------------------------
Ordinary least squares regression ............
LHS=FACULTY Mean = 6.51786
Standard deviation = 3.13677
Number of observs. = 252
Model size Parameters = 21
Degrees of freedom = 231
Residuals Sum of squares = 146.63709
Standard error of e = .79674
Fit R-squared = .94062
Adjusted R-squared = .93548
Model test F[ 20, 231] (prob) = 183.0(.0000)
Diagnostic Log likelihood = -289.34751
Restricted(b=0) = -645.15625
Chi-sq [ 20] (prob) = 711.6(.0000)
Info criter. LogAmemiya Prd. Crt. = -.37441
Akaike Info. Criter. = -.37480
Bayes Info. Criter. = -.08068
Model was estimated on Sep 23, 2009 at 06:44:38 PM
--------+-------------------------------------------------------------
Variable| Coefficient Standard Error t-ratio P[|T|>t] Mean of X
--------+-------------------------------------------------------------
Constant| 2.69636*** .15109 17.846 .0000
T| -.02853 .02245 -1.271 .2051 1.50000
DEGREES| -.01608 .01521 -1.058 .2913 23.1111
MA_DEG| .03985*** .01485 2.683 .0078 23.1931
COL1| 5.77747*** .76816 7.521 .0000 .05556
COL2| .15299*** .01343 11.392 .0000 .05556
COL3| 4.29759*** .55420 7.755 .0000 .05556
COL4| 6.28973*** .65533 9.598 .0000 .05556
COL5| 4.91094*** .56987 8.618 .0000 .05556
COL6| 5.02016*** .02561 196.041 .0000 .05556
COL7| 1.21384*** .01321 91.876 .0000 .05556
COL8| .77797*** .06785 11.466 .0000 .05556
COL9| 3.16474*** .06270 50.478 .0000 .05556
COL10| 2.86345*** .15540 18.427 .0000 .05556
COL11| 5.15181*** .02403 214.385 .0000 .05556
COL12| -.06802*** .02153 -3.160 .0018 .05556
COL13| 3.98895*** 1.01415 3.933 .0001 .05556
COL14| -.63196*** .11986 -5.272 .0000 .05556
COL15| 8.25859*** .47255 17.477 .0000 .05556
COL16| 8.00970*** .55461 14.442 .0000 .05556
COL17| .43544 .59258 .735 .4632 .05556
--------+-------------------------------------------------------------
Note: ***, **, * = Significance at 1%, 5%, 10% level.
----------------------------------------------------------------------

Becker, Greene, Siegfried  9‐24‐2009    9   
Once again, contemporaneous degrees is not a driving force in faculty size. An F test is
not needed to assess if at least one of the 17 colleges differ from college 18. With the exception
of college 17, each of the other colleges are significantly different. The moving average of
degrees is again significant.

The preceding approach, of computing all the dummy variables and building them into
the regression, is likely to become unduly cumbersome if the number of colleges (units) is very
large. Most contemporary software, including LIMDEP will do this computation automatically
without explicitly computing the dummy variables and including them in the equation. As an
alternative to specifying all the dummies in the regression command, the same results can be
obtained with the simpler "FixedEffects" command:

regress;lhs=faculty;rhs=one,t,degrees,MA_deg
;Panel;Str=College
;FixedEffects;Robust$

----------------------------------------------------------------------
Least Squares with Group Dummy Variables..........
Ordinary least squares regression ............
LHS=FACULTY Mean = 6.51786
Standard deviation = 3.13677
Number of observs. = 252
Model size Parameters = 21
Degrees of freedom = 231
Residuals Sum of squares = 146.63709
Standard error of e = .79674
Fit R-squared = .94062
Adjusted R-squared = .93548
Model test F[ 20, 231] (prob) = 183.0(.0000)
Diagnostic Log likelihood = -289.34751
Restricted(b=0) = -645.15625
Chi-sq [ 20] (prob) = 711.6(.0000)
Info criter. LogAmemiya Prd. Crt. = -.37441
Akaike Info. Criter. = -.37480
Bayes Info. Criter. = -.08068
Model was estimated on Sep 23, 2009 at 06:44:38 PM
Estd. Autocorrelation of e(i,t) = .293724
Robust cluster corrected covariance matrix used
Panel:Groups Empty 0, Valid data 18
Smallest 14, Largest 14
Average group size in panel 14.00
--------+-------------------------------------------------------------
Variable| Coefficient Standard Error t-ratio P[|T|>t] Mean of X
--------+-------------------------------------------------------------
T| -.02853 .02245 -1.271 .2050 1.50000
DEGREES| -.01608 .01521 -1.058 .2912 23.1111
MA_DEG| .03985*** .01485 2.683 .0078 23.1931
--------+-------------------------------------------------------------
Note: ***, **, * = Significance at 1%, 5%, 10% level.
----------------------------------------------------------------------

Becker, Greene, Siegfried  9‐24‐2009    10   
RANDOM-EFFECTS REGRESSION

Finally, consider the random-effects model in which we employ Mundlak’s (1978) approach to
estimating panel data. The Mundlak model posits that the fixed effects in the equation, β1i , can
be projected upon the group means of the time-varying variables, so that

β1i = β1 + δ′ xi + wi

where xi is the set of group (school) means of the time-varying variables and wi is a (now)
random effect that is uncorrelated with the variables and disturbances in the model. Logically,
adding the means to the equations picks up the correlation between the school effects and the
other variables. We could not incorporate the mean number of degrees awarded in the fixed-
effects model (because it was time invariant) but this variable plays a critical role in the Mundlak
approach to panel data modeling and estimation.

The random effects model for BA and BS degree-granting undergraduate departments is


 
  FACULTY sizeit = β1 + β2Tt + β3BA&Sit + β4MEANBA&Si + β5MOVAVBA&BS
+ β6PUBLICi + β7Bschl + εit + ui

where error term ε is iid over time, E(εit2|xit) = σ2 for I = 18 and Ti = 14 and E[ui2] = θ2 for I =
18. The LIMDEP program to be run from the Text/Command Document (with 1991 and 1992
data suppressed) is

regress
;lhs=faculty
;rhs=one,t,degrees,degrebar,public,bschool,MA_deg
;pds=14
;panel
;random
;robust$

The resulting regression information appearing in the output window is

Becker, Greene, Siegfried  9‐24‐2009    11   
--> regress
;lhs=faculty;rhs=one,t,degrees,degrebar,public,bschool,MA_deg
;pds=14;panel;random;robust$
----------------------------------------------------------------------
OLS Without Group Dummy Variables.................
Ordinary least squares regression ............
LHS=FACULTY Mean = 6.51786
Standard deviation = 3.13677
Number of observs. = 252
Model size Parameters = 7
Degrees of freedom = 245
Residuals Sum of squares = 868.44104
Standard error of e = 1.88273
Fit R-squared = .64836
Adjusted R-squared = .63975
Model test F[ 6, 245] (prob) = 75.3(.0000)
Diagnostic Log likelihood = -513.46861
Restricted(b=0) = -645.15625
Chi-sq [ 6] (prob) = 263.4(.0000)
Info criter. LogAmemiya Prd. Crt. = 1.29284
Akaike Info. Criter. = 1.29283
Bayes Info. Criter. = 1.39087
Model was estimated on Sep 23, 2009 at 07:17:22 PM
Panel Data Analysis of FACULTY [ONE way]
Unconditional ANOVA (No regressors)
Source Variation Deg. Free. Mean Square
Between 2312.22321 17. 136.01313
Residual 157.44643 234. .67285
Total 2469.66964 251. 9.83932
--------+-------------------------------------------------------------
Variable| Coefficient Standard Error t-ratio P[|T|>t] Mean of X
--------+-------------------------------------------------------------
T| -.02809 .03030 -.927 .3549 1.50000
DEGREES| -.01636 .02334 -.701 .4839 23.1111
DEGREBAR| .10832*** .02047 5.293 .0000 23.6528
PUBLIC| -3.86239*** .29652 -13.026 .0000 1.77778
BSCHOOL| .58112** .25115 2.314 .0215 .38889
MA_DEG| .03780 .02907 1.300 .1947 23.1931
Constant| 10.1397*** .52427 19.341 .0000
--------+-------------------------------------------------------------
Note: ***, **, * = Significance at 1%, 5%, 10% level.
----------------------------------------------------------------------

Becker, Greene, Siegfried  9‐24‐2009    12   
+----------------------------------------------------+
| Panel:Groups Empty 0, Valid data 18 |
| Smallest 14, Largest 14 |
| Average group size 14.00 |
| There are 3 vars. with no within group variation. |
| DEGREBAR PUBLIC BSCHOOL |
+----------------------------------------------------+

----------------------------------------------------------------------
Random Effects Model: v(i,t) = e(i,t) + u(i)
Estimates: Var[e] = .643145
Var[u] = 2.901512
Corr[v(i,t),v(i,s)] = .818559
Lagrange Multiplier Test vs. Model (3) =1096.30
( 1 degrees of freedom, prob. value = .000000)
(High values of LM favor FEM/REM over CR model)
Baltagi-Li form of LM Statistic = 1096.30
Sum of Squares 868.488173
R-squared .648338
Robust cluster corrected covariance matrix used
--------+-------------------------------------------------------------
Variable| Coefficient Standard Error b/St.Er. P[|Z|>z] Mean of X
--------+-------------------------------------------------------------
T| -.02853 .02146 -1.329 .1838 1.50000
DEGREES| -.01609 .01793 -.897 .3696 23.1111
DEGREBAR| .10610*** .03228 3.287 .0010 23.6528
PUBLIC| -3.86365*** .54685 -7.065 .0000 1.77778
BSCHOOL| .58176 .90497 .643 .5203 .38889
MA_DEG| .03981** .01728 2.305 .0212 23.1931
Constant| 10.1419*** .87456 11.597 .0000
--------+-------------------------------------------------------------
Note: ***, **, * = Significance at 1%, 5%, 10% level.
----------------------------------------------------------------------

The marginal effect of an additional economics major is again insignificant but slightly negative
within the sample. Both the short-term moving average number and long-term average number
of bachelor degrees are significant. A long-term increase of about 10 students earning degrees in
economics is required to predict that one more tenured or tenure-track faculty member is in a
department. Ceteris paribus, economics departments at private institutions are smaller than
comparable departments at public schools by a large and significant number of four members.
Whether there is a business school present is insignificant. There is no meaningful trend in
faculty size.

CONCLUDING REMARKS

The goal of this hands-on component of this third of four modules is to enable economic
education researchers to make use of panel data for the estimation of constant coefficient, fixed-
effects and random-effects panel data models in LIMDEP (NLOGIT). It was not intended to
explain all of the statistical and econometric nuances associated with panel data analysis. For
this an intermediate level econometrics textbook (such as Jeffrey Wooldridge, Introductory
Econometrics) or advanced econometrics textbook (such as William Greene, Econometric
Analysis) should be consulted.

Becker, Greene, Siegfried  9‐24‐2009    13   
APPENDIX:

ALTERNATIVES FOR CREATING COLLEGE DUMMY VARIABLES AND RUNNING


REGRESSIONS IN LIMDEP (NLOGIT)

There are two alternative ways to create the college dummy variables for use in the regression.

First is the “create ; expand(college)$” command, where COLLEGE is expanded as


_COLLEG_, with the following resulting output:
 
COLLEGE was expanded as _COLLEG_.
Largest value = 18. 18 New variables were created.
Category
1 New variable = COLLEG01 Frequency= 14
2 New variable = COLLEG02 Frequency= 14
3 New variable = COLLEG03 Frequency= 14
4 New variable = COLLEG04 Frequency= 14
5 New variable = COLLEG05 Frequency= 14
6 New variable = COLLEG06 Frequency= 14
7 New variable = COLLEG07 Frequency= 14
8 New variable = COLLEG08 Frequency= 14
9 New variable = COLLEG09 Frequency= 14
10 New variable = COLLEG10 Frequency= 14
11 New variable = COLLEG11 Frequency= 14
12 New variable = COLLEG12 Frequency= 14
13 New variable = COLLEG13 Frequency= 14
14 New variable = COLLEG14 Frequency= 14
15 New variable = COLLEG15 Frequency= 14
16 New variable = COLLEG16 Frequency= 14
17 New variable = COLLEG17 Frequency= 14
18 New variable = COLLEG18 Frequency= 14
Note, this is a complete set of dummy variables. If
you use this set in a regression, drop the constant.
 
The second method for creating dummies and running the regression is an even more condensed,
and as yet an undocumented feature in the LIMDEP (NLOGIT) manual:
regress;lhs=faculty;rhs=one,t,degrees,MA_deg,expand(college)
;cluster=college$
 
which produces the same results as the fixed effects regression command we used at the
beginning of this duscussion.

Becker, Greene, Siegfried  9‐24‐2009    14   
REFERENCES

Becker, William, William Greene and John Siegfried (2009). “Does Teaching Load Affect
Faculty Size? ” Working Paper (July).

Greene, William (2008). Econometric Analysis. 6th Edition, New Jersey: Prentice Hall.

Mundlak, Yair (1978). "On the Pooling of Time Series and Cross Section Data," Econometrica.
Vol. 46. No. 1 (January): 69-85.

Wooldridge, Jeffrey (2009). Introductory Econometrics. 4th Edition, Mason OH: South-
Western.

Becker, Greene, Siegfried  9‐24‐2009    15   
MODULE THREE, PART THREE: PANEL DATA ANALYSIS
IN ECONOMIC EDUCATION RESEARCH USING STATA

Part Three of Module Three provides a cookbook-type demonstration of the steps required to use
STATA in panel data analysis. Users of this model need to have completed Module One, Parts
One and Three, and Module Three, Part One. That is, from Module One users are assumed to
know how to get data into STATA, recode and create variables within STATA, and run and
interpret regression results. They are also expected to know how to test linear restrictions on
sets of coefficients as done in Module One, Parts One and Three. Module Three, Parts Two and
Four demonstrate in LIMDEP and SAS what is done here in STATA.

THE CASE

As described in Module Three, Part One, Becker, Greene and Siegfried (2009) examine the
extent to which undergraduate degrees (BA and BS) in economics or Ph.D. degrees (PhD) in
economics drive faculty size at those U.S. institutions that offer only a bachelor degree and those
that offer both bachelor degrees and PhDs. Here we retrace their analysis for the institutions
that offer only the bachelor degree. We provide and demonstrate the STATA code necessary to
duplicate their results.

DATA FILE

The following panel data are provided in the comma separated values (CSV) text file
“bachelors.csv”, which will automatically open in EXCEL by simply double clicking on it after
it has been downloaded to your hard drive. Your EXCEL spreadsheet should look like this:

“College” identifies the bachelor degree-granting institution by a number 1 through 18.

“Year” runs from 1996 through 2006.

“Degrees” is the number of BS or BA degrees awarded in each year by each college.

“DegreBar” is the average number of degrees awarded by each college for the 16-year period.

“Public” equals 1 if the institution is a public college and 2 if it is a private college.

“Faculty” is the number of tenured or tenure-track economics department faculty members.

“Bschol” equals 1 if the college has a business program and 0 if not.

“T” is the time trend running from −7 to 8, corresponding to years from 1996 through 2006.

“MA_Deg” is a three-year moving average of degrees (unknown for the first two years).

McCarthy 8‐30‐2009    1   
College  Year  Degrees  DegreBar Public  Faculty  Bschol  T  MA_Deg
1  1991  50  47.375  2  11  1  ‐7  0 
1  1992  32  47.375  2  8  1  ‐6  0 
1  1993  31  47.375  2  10  1  ‐5  37.667 
1  1994  35  47.375  2  9  1  ‐4  32.667 
                 
↓  ↓  ↓  ↓  ↓  ↓  ↓    ↓ 
                 
1  2003  57  47.375  2  7  1  5  56 
1  2004  57  47.375  2  10  1  6  55.667 
1  2005  57  47.375  2  10  1  7  57 
1  2006  51  47.375  2  10  1  8  55 
2  1991  16  8.125  2  3  1  ‐7  0 
2  1992  14  8.125  2  3  1  ‐6  0 
2  1993  10  8.125  2  3  1  ‐5  13.333 
                 
↓  ↓  ↓  ↓  ↓  ↓  ↓    ↓ 
                 
2  2004  10  8.125  2  3  1  6  12.667 
2  2005  7  8.125  2  3  1  7  11.333 
2  2006  6  8.125  2  3  1  8  7.667 
3  1991  40  35.5  2  8  1  ‐7  0 
3  1992  31  37.125  2  8  1  ‐6  0 
                 
↓  ↓  ↓  ↓  ↓  ↓  ↓    ↓ 
                 
17  2004  64  39.3125  2  5  0  6  54.667 
17  2005  37  39.3125  2  4  0  7  51.333 
17  2006  53  39.3125  2  4  0  8  51.333 
18  1991  14  8.4375  2  4  0  ‐7  0 
18  1992  10  8.4375  2  4  0  ‐6  0 
18  1993  10  8.4375  2  4  0  ‐5  11.333 
18  1994  7  8.4375  2  3.5  0  ‐4  9 
                 
↓  ↓  ↓  ↓  ↓  ↓  ↓    ↓ 
                 
18  2005  4  8.4375  2  2.5  0  7  7.333 
18  2006  7  8.4375  2  3  0  8  6 

McCarthy 8‐30‐2009    2   
If you opened this CSV file in a word processor or text editing program, it would show
that each of the 289 lines (including the headers) corresponds to a row in the EXCEL table, but
variable values would be separated by commas and not appear neatly one on top of the other as
in EXCEL.

As discussed in Module One, Part Three, you can read the CSV file into STATA by
typing the following command into the command window and pressing enter:

insheet using “E:\NCEE (Becker)\bachelors.csv”, comma 

In this case, the “bachelors.csv” file is saved in the file “E:\NCEE (Becker)” but this will vary by
user. For these data, the default memory allocated by STATA should be sufficient. After
entering the above command in the command window and pressing enter, you should see the
following screen:

STATA indicates that the data consist of 9 variables and 288 observations. In addition to
a visual inspection of the data via the “browse” command, you can use the "summarize"
command to check the descriptive statistics. First, however, we need to remove the two years
(1991 and 1992) for which no data are available for the degree moving average measure. This is
done with the "drop if" command. In the command window, type:
 
drop if year < 1993 
summarize 

McCarthy 8‐30‐2009    3   
which upon pressing enter yields the following summary statistics:

Variable | Obs Mean Std. Dev. Min Max


-------------+--------------------------------------------------------
college | 252 9.5 5.198452 1 18
year | 252 1999.5 4.039151 1993 2006
degrees | 252 23.11111 19.22636 0 81
degrebar | 252 23.65278 18.01427 2 62.4375
public | 252 1.777778 .4165671 1 2
-------------+--------------------------------------------------------
faculty | 252 6.517857 3.136769 2 14
bschool | 252 .3888889 .4884682 0 1
t | 252 1.5 4.039151 -5 8
ma_deg | 252 23.19312 18.55398 1.333333 80

By default, STATA essentially considers all data as cross-sectional. Since we are


working with panel data in this case, we need to indicate to STATA that there is a time-series
component to our dataset. This is done with the “tsset” command. The general syntax for the
“tsset” command with panel data is:

tsset “panel variable” “time variable” 

In this case, our panel variable is college and our time variable is year, so the relevant command
is:

tsset college year 

After typing the above command into STATA’s command window and pressing enter, you
should see the following screen:

McCarthy 8‐30‐2009    4   
This indicates that STATA recognizes a strongly balanced panel (i.e., the same number of years
for each college) with observations for each panel from 1993 through 2006. Note that we could
also use the variable “t” as our time variable.

In general, we must “tsset” the data before we can utilize any of STATA’s time-series or
panel data commands (for example, the “xtreg” command presented below). Our time variable
should also be appropriately spaced. For example, if we have yearly data, but our time variable
was recorded in a daily format (e.g., 1/1/1999, 1/1/2000, 1/1/2002, etc.), we would want to
reformat this variable as a yearly variable rather than daily. Correctly formatting the time
variable is important to ensure the various time-series commands in STATA work properly. For
more detail on formats and other options for the “tsset” command type “help tsset” into
STATA’s command window.

CONSTANT COEFFICIENT REGRESSION

The constant coefficient panel data model for the faculty size data-generating process for
bachelor degree-granting undergraduate departments is given by

Faculty sizeit = β1 + β2Tt + β3BA&Sit + β4MEANBA&Si + β5PUBLICi


+ β6Bschl + β7MA_Degit + εit

McCarthy 8‐30‐2009    5   
where the error term εit is independent and identically distributed (iid) across institutions and
over time and E(εit2|xit) = σ2 , for I = 18 colleges and T = 14 years (−5 through 8) for 252
complete records. The STATA OLS regression command that needs to be entered into the
command window, including the standard error adjustment for clustering is

regress faculty t degrees degrebar public bschool ma_deg, cluster(college)

After typing the above command into the command window and pressing enter, the output
window shows the following results:

. regress faculty t degrees degrebar public bschool ma_deg, cluster(college)

Linear regression Number of obs = 252


F( 6, 17) = 27.70
Prob > F = 0.0000
R-squared = 0.6484
Number of clusters (college) = 18 Root MSE = 1.8827

------------------------------------------------------------------------------
| Robust
faculty | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
t | -.0280875 .0222654 -1.26 0.224 -.0750634 .0188885
degrees | -.0163611 .0186579 -0.88 0.393 -.0557259 .0230037
degrebar | .1083201 .0337821 3.21 0.005 .0370461 .1795942
public | -3.862393 .5694961 -6.78 0.000 -5.063925 -2.660862
bschool | .5811154 .9425269 0.62 0.546 -1.407443 2.569673
ma_deg | .0378038 .0180966 2.09 0.052 -.0003767 .0759842
_cons | 10.13974 .9106264 11.13 0.000 8.218486 12.06099
------------------------------------------------------------------------------

Contemporaneous degrees have little to do with current faculty size but both overall number of
degrees awarded (the school means) and the moving average of degrees (MA_DEG) have
significant effects. It takes an increase of 26 or 27 bachelor degrees in the moving average to
expect just one more faculty position. Whether it is a public or a private college is highly
significant. Moving from a public to a private college lowers predicted faculty size by nearly
four members for otherwise comparable institutions. There is an insignificant erosion of tenured
and tenure-track faculty size over time. Finally, while economics departments in colleges with a
business school tend to have a larger permanent faculty, ceteris paribus, the effect is small and
insignificant.

FIXED-EFFECTS REGRESSION

The fixed-effects model requires either the insertion of 17 (0,1) covariates to capture the unique
effect of each of the 18 colleges (where each of the 17 dummy coefficients are measured relative
to the constant term) or the insertion of 18 dummy variables with no constant term in the OLS
regression. In addition, no time invariant variables can be included because they would be
perfectly correlated with the respective college dummies. Thus, the overall mean number of

McCarthy 8‐30‐2009    6   
degrees, the public or private dummy, and business school dummy cannot be included as
regressors.

The STATA code, including the commands to create the dummy variables, is (two
additional ways to estimate fixed-effects models in STATA are presented in the Appendix):

gen Col1=(college==1)
gen Col2=(college==2)
gen Col3=(college==3)
gen Col4=(college==4)
gen Col5=(college==5)
gen Col6=(college==6)
gen Col7=(college==7)
gen Col8=(college==8)
gen Col9=(college==9)
gen Col10=(college==10)
gen Col11=(college==11)
gen Col12=(college==12)
gen Col13=(college==13)
gen Col14=(college==14)
gen Col15=(college==15)
gen Col16=(college==16)
gen Col17=(college==17)
gen Col18=(college==18)

regress faculty t degrees ma_deg Col1-Col17, cluster(college) 

The resulting regression information appearing in the output window is:

McCarthy 8‐30‐2009    7   
Linear regression Number of obs = 252
F( 2, 17) = .
Prob > F = .
R-squared = 0.9406
Number of clusters (college) = 18 Root MSE = .79674

------------------------------------------------------------------------------
| Robust
faculty | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
t | -.0285342 .022453 -1.27 0.221 -.0759059 .0188374
degrees | -.0160847 .0152071 -1.06 0.305 -.0481689 .0159995
ma_deg | .039847 .0148528 2.68 0.016 .0085103 .0711837
Col1 | 5.777467 .7681565 7.52 0.000 4.156799 7.398136
Col2 | .1529889 .0134293 11.39 0.000 .1246555 .1813222
Col3 | 4.297591 .5541956 7.75 0.000 3.128341 5.466842
Col4 | 6.289728 .6553347 9.60 0.000 4.907093 7.672363
Col5 | 4.910941 .5698701 8.62 0.000 3.708621 6.113262
Col6 | 5.020157 .0256077 196.04 0.000 4.966129 5.074185
Col7 | 1.213842 .0132117 91.88 0.000 1.185967 1.241716
Col8 | .7779701 .0678475 11.47 0.000 .6348244 .9211157
Col9 | 3.164737 .0626958 50.48 0.000 3.03246 3.297013
Col10 | 2.863453 .1553986 18.43 0.000 2.53559 3.191315
Col11 | 5.151815 .0240307 214.39 0.000 5.101115 5.202515
Col12 | -.0680152 .0215257 -3.16 0.006 -.1134304 -.0226
Col13 | 3.988947 1.014148 3.93 0.001 1.849282 6.128611
Col14 | -.631956 .1198635 -5.27 0.000 -.8848458 -.3790662
Col15 | 8.258587 .4725524 17.48 0.000 7.261588 9.255585
Col16 | 8.009696 .5546092 14.44 0.000 6.839573 9.179819
Col17 | .4354377 .5925837 0.73 0.472 -.8148046 1.68568
_cons | 2.696364 .1510869 17.85 0.000 2.377598 3.015129
------------------------------------------------------------------------------

Once again, contemporaneous degrees is not a driving force in faculty size. There is no
need to do an F test to assess if at least one of the 17 colleges differ from college 18. With the
exception of college 17, each of the other colleges are significantly different. The moving
average of degrees is again significant.

RANDOM-EFFECTS REGRESSION

Finally, consider the random-effects model in which we employ Mundlak’s (1978) approach to
estimating panel data. The Mundlak model posits that the fixed effects in the equation, β1i , can
be projected upon the group means of the time-varying variables, so that

β1i = β1 + δ′ xi + wi

where xi is the set of group (school) means of the time-varying variables and wi is a (now)
random effect that is uncorrelated with the variables and disturbances in the model. Logically,
adding the means to the equations picks up the correlation between the school effects and the
other variables. We could not incorporate the mean number of degrees awarded in the fixed-
effects model (because it was time invariant) but this variable plays a critical role in the Mundlak
approach to panel data modeling and estimation.

McCarthy 8‐30‐2009    8   
The random effects model for BA and BS degree-granting undergraduate departments is
 
  FACULTY sizeit = β1 + β2Tt + β3BA&Sit + β4MEANBA&Si + β5MOVAVBA&BS
+ β6PUBLICi + β7Bschl + εit + ui

where error term ε is iid over time, E(εit2|xit) = σ2 for I = 18 and Ti = 14 and E[ui2] = θ2 for I =
18. The STATA command to estimate this model is

xtreg faculty t degrees degrebar public bschool ma_deg, re cluster(college)

The resulting regression information appearing in the output window is 1

. xtreg faculty t degrees degrebar public bschool ma_deg, re cluster(college)

Random-effects GLS regression Number of obs = 252


Group variable (i): college Number of groups = 18

R-sq: within = 0.0687 Obs per group: min = 14


between = 0.6878 avg = 14.0
overall = 0.6483 max = 14

Random effects u_i ~ Gaussian Wald chi2(7) = 1273.20


corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000

(Std. Err. adjusted for 18 clusters in college)


------------------------------------------------------------------------------
| Robust
faculty | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
t | -.0285293 .0218015 -1.31 0.191 -.0712594 .0142007
degrees | -.0160879 .0147378 -1.09 0.275 -.0449734 .0127976
degrebar | .1060891 .0312801 3.39 0.001 .0447811 .167397
public | -3.863652 .5662052 -6.82 0.000 -4.973394 -2.75391
bschool | .5817666 .9406433 0.62 0.536 -1.26186 2.425394
ma_deg | .0398252 .01444 2.76 0.006 .0115233 .0681271
_cons | 10.14196 .9033207 11.23 0.000 8.371485 11.91244
-------------+----------------------------------------------------------------
sigma_u | 2.0564748
sigma_e | .79673873
rho | .86948846 (fraction of variance due to u_i)
------------------------------------------------------------------------------

The marginal effect of an additional economics major is again insignificant but slightly negative
within the sample. Both the short-term moving average number and long-term average number
                                                       
1
Note that the Wald statistic of 1273.20 is based on a test of all coefficients in the model (including the constant).
This is inconsistent with the default Wald statistic reported in other regression results, including random-effects
models without robust or clustered standard errors, where the default statistic is based on a test of all slope
coefficients in the model. In the model estimated here, the Wald statistic based on a test of all slope coefficients
equal to 0 is 198.55. I understand that the current version of STATA (STATA 11) now consistently presents the
Wald statistic based on a test of all slope coefficients.

McCarthy 8‐30‐2009    9   
of bachelor degrees are significant. A long-term increase of about 10 students earning degrees in
economics is required to predict that one more tenured or tenure-track faculty member is in a
department. Ceteris paribus, economics departments at private institutions are smaller than
comparable departments at public schools by a large and significant number of four members.
Whether there is a business school present is insignificant. There is no meaningful trend in
faculty size.

CONCLUDING REMARKS

The goal of this hands-on component of this third of four modules is to enable economic
education researchers to make use of panel data for the estimation of constant coefficient, fixed-
effects and random-effects panel data models in STATA. It was not intended to explain all of
the statistical and econometric nuances associated with panel data analysis. For this an
intermediate level econometrics textbook (such as Jeffrey Wooldridge, Introductory
Econometrics) or advanced econometrics textbook (such as William Greene, Econometric
Analysis) should be consulted.

McCarthy 8‐30‐2009    10   
APPENDIX: Alternative commands to estimate fixed-ffects models in STATA

Method 1 – Alternative Method of Creating Dummy variables

We estimated the above fixed-effects model after explicitly creating 18 different dummy
variables. STATA also has a built in command (“xi”) to create a sequence of dummy variables
from a single categorical variable. To be consistent with the above model, we can first indicate
to STATA which category it should omit when creating the college dummy variables by typing
the following command into the command window and pressing enter:

char college[omit] 18

We can now automatically create the relevant college dummy variables and estimate the fixed-
effects model all through one command:

xi: regress faculty t degrees ma_deg i.college, cluster(college)

The resulting regression information appearing in the output window is

. xi: regress faculty t degrees ma_deg i.college, cluster(college)

i.college _Icollege_1-18 (naturally coded; _Icollege_18 omitted)

Linear regression Number of obs = 252


F( 2, 17) = .
Prob > F = .
R-squared = 0.9406
Number of clusters (college) = 18 Root MSE = .79674

------------------------------------------------------------------------------
| Robust
faculty | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
t | -.0285342 .022453 -1.27 0.221 -.0759059 .0188374
degrees | -.0160847 .0152071 -1.06 0.305 -.0481689 .0159995
ma_deg | .039847 .0148528 2.68 0.016 .0085103 .0711837
_Icollege_1 | 5.777467 .7681565 7.52 0.000 4.156799 7.398136
_Icollege_2 | .1529889 .0134293 11.39 0.000 .1246555 .1813222
_Icollege_3 | 4.297591 .5541956 7.75 0.000 3.128341 5.466842
_Icollege_4 | 6.289728 .6553347 9.60 0.000 4.907093 7.672363
_Icollege_5 | 4.910941 .5698701 8.62 0.000 3.708621 6.113262
_Icollege_6 | 5.020157 .0256077 196.04 0.000 4.966129 5.074185
_Icollege_7 | 1.213842 .0132117 91.88 0.000 1.185967 1.241716
_Icollege_8 | .7779701 .0678475 11.47 0.000 .6348244 .9211157
_Icollege_9 | 3.164737 .0626958 50.48 0.000 3.03246 3.297013
_Icollege_10 | 2.863453 .1553986 18.43 0.000 2.53559 3.191315
_Icollege_11 | 5.151815 .0240307 214.39 0.000 5.101115 5.202515
_Icollege_12 | -.0680152 .0215257 -3.16 0.006 -.1134304 -.0226
_Icollege_13 | 3.988947 1.014148 3.93 0.001 1.849282 6.128611
_Icollege_14 | -.631956 .1198635 -5.27 0.000 -.8848458 -.3790662
_Icollege_15 | 8.258587 .4725524 17.48 0.000 7.261588 9.255585
_Icollege_16 | 8.009696 .5546092 14.44 0.000 6.839573 9.179819
_Icollege_17 | .4354377 .5925837 0.73 0.472 -.8148046 1.68568
_cons | 2.696364 .1510869 17.85 0.000 2.377598 3.015129
------------------------------------------------------------------------------

McCarthy 8‐30‐2009    11   
Method 2 – xtreg fe

STATA’s “xtreg” command allows for various panel data models to be estimated. A random-
effects model was presented above, but “xtreg” also estimates a fixed-effects model, a between-
effects model, and various other models. The basic syntax for the “xtreg” command is:

xtreg “dependent variable” “independent variables”, “model to be estimated” “other


options”

To estimate a random-effects model, the “model to be estimated” is “re.” Similarly, to estimate a


fixed-effects model, the “model to be estimated” is “fe.” When using “xtreg” to estimate a fixed-
effects model, STATA does not estimate the panel-specific dummy variables. This is a by-
product of the type of estimator used by STATA. However, the coefficient estimates for the
remaining independent variables are identical to those estimated by OLS with panel specific
dummy variables. For example, using the “xtreg” command to estimate the fixed-effects model
presented above, STATA provides the following output:

. xtreg faculty t degrees ma_deg, fe cluster(college) dfadj

Fixed-effects (within) regression Number of obs = 252


Group variable (i): college Number of groups = 18

R-sq: within = 0.0687 Obs per group: min = 14


between = 0.4175 avg = 14.0
overall = 0.3469 max = 14

F(3,17) = 2.66
corr(u_i, Xb) = 0.4966 Prob > F = 0.0815

(Std. Err. adjusted for 18 clusters in college)


------------------------------------------------------------------------------
| Robust
faculty | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
t | -.0285342 .022453 -1.27 0.221 -.0759059 .0188374
degrees | -.0160847 .0152071 -1.06 0.305 -.0481689 .0159995
ma_deg | .039847 .0148528 2.68 0.016 .0085103 .0711837
_cons | 6.008218 .4400811 13.65 0.000 5.079728 6.936708
-------------+----------------------------------------------------------------
sigma_u | 2.8596636
sigma_e | .79673873
rho | .92796654 (fraction of variance due to u_i)
------------------------------------------------------------------------------

The additional “dfadj” option adjusts the cluster-robust standard error estimates to account for
the transformation used by STATA in estimating the fixed-effects model (called the within
transform). Although estimating the fixed-effects model with xtreg no longer provides estimates
of the dummy variable coefficients, we see that the coefficient estimates and standard errors for
the remaining variables are identical to those of an OLS regression with panel-specific dummies
and cluster-robust standard errors.

McCarthy 8‐30‐2009    12   
REFERENCES

Becker, William, William Greene and John Siegfried (2009). “Does Teaching Load Affect
Faculty Size? ” Working Paper (July).

Greene, William (2008). Econometric Analysis. 6th Edition, New Jersey: Prentice Hall.

Mundlak, Yair (1978). "On the Pooling of Time Series and Cross Section Data," Econometrica.
Vol. 46. No. 1 (January): 69-85.

Wooldridge, Jeffrey (2009). Introductory Econometrics. 4th Edition, Mason OH: South-
Western.

McCarthy 8‐30‐2009    13   
MODULE THREE, PART FOUR: PANEL DATA ANALYSIS
IN ECONOMIC EDUCATION RESEARCH USING SAS

Part Four of Module Three provides a cookbook-type demonstration of the steps required to use
SAS in panel data analysis. Users of this model need to have completed Module One, Parts One
and Four, and Module Three, Part One. That is, from Module One users are assumed to know
how to get data into SAS, recode and create variables within SAS, and run and interpret
regression results. They are also expected to know how to test linear restrictions on sets of
coefficients as done in Module One, Parts One and Two. Module Three, Parts Two and Three
demonstrate in LIMDEP and STATA what is done here in SAS.

THE CASE

As described in Module Three, Part One, Becker, Greene and Siegfried (2009) examine the
extent to which undergraduate degrees (BA and BS) in economics or Ph.D. degrees (PhD) in
economics drive faculty size at those U.S. institutions that offer only a bachelor degree and those
that offer both bachelor degrees and PhDs. Here we retrace their analysis for the institutions
that offer only the bachelor degree. We provide and demonstrate the SAS code necessary to
duplicate their results.

DATA FILE

The following panel data are provided in the comma separated values (CSV) text file
“bachelors.csv”, which will automatically open in EXCEL by simply double clicking on it after
it has been downloaded to your hard drive. Your EXCEL spreadsheet should look like this:

“College” identifies the bachelor degree-granting institution by a number 1 through 18.

“Year” runs from 1996 through 2006.

“Degrees” is the number of BS or BA degrees awarded in each year by each college.

“DegreBar” is the average number of degrees awarded by each college for the 16-year period.

“Public” equals 1 if the institution is a public college and 2 if it is a private college.

“Faculty” is the number of tenured or tenure-track economics department faculty members.

“Bschol” equals 1 if the college has a business program and 0 if not.

“T” is the time trend running from −7 to 8, corresponding to years from 1996 through 2006.

“MA_Deg” is a three-year moving average of degrees (unknown for the first two years).

Gilpin  8‐30‐2009    1   
College  Year  Degrees  DegreBar Public  Faculty  Bschol  T  MA_Deg
1  1991  50  47.375  2  11  1  ‐7  0 
1  1992  32  47.375  2  8  1  ‐6  0 
1  1993  31  47.375  2  10  1  ‐5  37.667 
1  1994  35  47.375  2  9  1  ‐4  32.667 
                 
↓  ↓  ↓  ↓  ↓  ↓  ↓    ↓ 
                 
1  2003  57  47.375  2  7  1  5  56 
1  2004  57  47.375  2  10  1  6  55.667 
1  2005  57  47.375  2  10  1  7  57 
1  2006  51  47.375  2  10  1  8  55 
2  1991  16  8.125  2  3  1  ‐7  0 
2  1992  14  8.125  2  3  1  ‐6  0 
2  1993  10  8.125  2  3  1  ‐5  13.333 
                 
↓  ↓  ↓  ↓  ↓  ↓  ↓    ↓ 
                 
2  2004  10  8.125  2  3  1  6  12.667 
2  2005  7  8.125  2  3  1  7  11.333 
2  2006  6  8.125  2  3  1  8  7.667 
3  1991  40  35.5  2  8  1  ‐7  0 
3  1992  31  37.125  2  8  1  ‐6  0 
                 
↓  ↓  ↓  ↓  ↓  ↓  ↓    ↓ 
                 
17  2004  64  39.3125  2  5  0  6  54.667 
17  2005  37  39.3125  2  4  0  7  51.333 
17  2006  53  39.3125  2  4  0  8  51.333 
18  1991  14  8.4375  2  4  0  ‐7  0 
18  1992  10  8.4375  2  4  0  ‐6  0 
18  1993  10  8.4375  2  4  0  ‐5  11.333 
18  1994  7  8.4375  2  3.5  0  ‐4  9 
                 
↓  ↓  ↓  ↓  ↓  ↓  ↓    ↓ 
                 
18  2005  4  8.4375  2  2.5  0  7  7.333 
18  2006  7  8.4375  2  3  0  8  6 

Gilpin  8‐30‐2009    2   
If you opened this CSV file in a word processor or text editing program, it would show
that each of the 289 lines (including the headers) corresponds to a row in the EXCEL table, but
variable values would be separated by commas and not appear neatly one on top of the other as
in EXCEL.

As discussed in Module One, Part Two, SAS has a data matrix default restriction. This
data set is sufficiently small, so there is no need to adjust the size of the matrix. We could write a
“READ” command to bring this text data file into SAS similar to Module 1, Part 4, but like
EXCEL, it can be imported into SAS directly by using the import wizard.

To import the data into SAS, click on ‘File’ at the top left corner of your screen in SAS, and then
click ‘Import Data’.

This will initialize the Import Wizard pop-up screen. Since the data is comma separated values,
scroll down under the ‘Select data source below.’ tab and click on ‘Comma Separated Values
(*.csv)’ as shown below.

Gilpin  8‐30‐2009    3   
Click ‘Next’, and then provide the location from which the file bachelor.cvs can be
located wherever it is stored (in our case in “e:\bachelor.csv”).

Gilpin  8‐30‐2009    4   
To finish importing the data, click ‘Next’, and then name the dataset, known as a member in
SAS, to be stored in the temporary library called ‘WORK’. Recall that a library is simply a
folder to store datasets and output. I named the file ‘BACHELORS’ as seen below. Hitting the
Finish button will bring the data set into SAS.

To verify that the wizard imported the data correct, review the Log file and physically inspect the
dataset. When SAS is opened, the default panels are the ‘Log’ window at the top right, the
‘Editor’ window in the bottom right and the ‘Explorer/Results’ window on the left. Scrolling
through the Log reveals that the dataset was successfully imported. The details of the data step
procedure are provided along with a few summary statistics of how many observations and
variables were imported.

Gilpin  8‐30‐2009    5   
To view the dataset, click on the “Libraries” folder, which is in the top left of the ‘Explorer’
panel, and then click on the ‘Work’ library. This reveals all of the members in the ‘Work’
library. In this case, the only member is the dataset ‘Bachelors’. To view the dataset, click on the
dataset icon ‘Bachelors’.

In addition to a visual inspection of the data, we use the “means” command to check the
descriptive statistics. Since we don’t list any variables in the command, by default, SAS runs the
‘means’ command on all variables in the dataset. First, however, we need to remove the two
years (1991 and 1992) for which no data are available for the degree moving average measure.
Since we may need the full dataset later, it is good practice to delete the observations off of a
copy of the dataset (called bachelors2). This is done in a data step using an ‘if then’ command.

data bachelors2;
set bachelors;
if year = 1991 then delete;
if year = 1992 then delete;
run;

PROC MEANS DATA=bachelors2;


RUN;

Typing the following commands into the ‘Editor’ window and then clicking the run bottom
(recall this is the running man at the top) yields the following screen.

Gilpin  8‐30‐2009    6   
CONSTANT COEFFICIENT REGRESSION

The constant coefficient panel data model for the faculty size data-generating process for
bachelor degree-granting undergraduate departments is given by

Faculty sizeit = β1 + β2Tt + β3BA&Sit + β4MEANBA&Si + β5PUBLICi


+ β6Bschl + β7MA_Degit + εit

where the error term εit is independent and identically distributed (iid) across institutions and
over time and E(εit2|xit) = σ2 , for I = 18 colleges and T = 14 years (−5 through 8) for 252
complete records. To take into account clustering, include the cluster option with the cluster
being on the colleges. The SAS OLS regression command that needs to be entered into the
editor, including the standard error adjustment for clustering is

proc surveyreg data=bachelors2;


cluster college;
model faculty = t degrees degrebar public bschool ma_deg;
run;

Gilpin  8‐30‐2009    7   
Upon highlighting and hitting the “run” button, the Output panel shows the following results

Contemporaneous degrees have little to do with current faculty size but both overall number of
degrees awarded (the school means) and the moving average of degrees (MA_DEG) have
significant effects. It takes an increase of 26 or 27 bachelor degrees in the moving average to
expect just one more faculty position. Whether it is a public or a private college is highly
significant. Moving from a public to a private college lowers predicted faculty size by nearly
four members for otherwise comparable institutions. There is an insignificant erosion of tenured
and tenure-track faculty size over time. Finally, while economics departments in colleges with a
business school tend to have a larger permanent faculty, ceteris paribus, the effect is small and
insignificant.

Gilpin  8‐30‐2009    8   
FIXED-EFFECTS REGRESSION

The fixed-effects model requires either the insertion of 17 (0,1) covariates to capture the unique
effect of each of the 18 colleges (where each of the 17 dummy coefficients are measured relative
to the constant term) or the insertion of 18 dummy variables with no constant term in the OLS
regression. In addition, no time invariant variables can be included because they would be
perfectly correlated with the respective college dummies. Thus, the overall mean number of
degrees, the public or private dummy, and business school dummy cannot be included as
regressors.
The SAS code to be run from the editor window, including the commands to create the
dummy variables is:

data bachelors2;
set bachelors2;

col1 = 0; col2 = 0; col3 = 0; col4 = 0; col5=0; col6=0;


col7 = 0; col8 = 0; col9 = 0; col10 = 0; col11 = 0; col12 = 0;
col13 = 0; col14 =0; col15 = 0; col16 = 0; col17 = 0; col18 = 0;

if college = 1 then col1=1; if college = 2 then col2=1;


if college = 3 then col3=1; if college = 4 then col4=1;
if college = 5 then col5=1; if college = 6 then col6=1;
if college = 7 then col7=1; if college = 8 then col8=1;
if college = 9 then col9=1; if college = 10 then col10=1;
if college = 11 then col11=1; if college = 12 then col12=1;
if college = 13 then col13=1; if college = 14 then col14=1;
if college = 15 then col15=1; if college = 16 then col16=1;
if college = 17 then col17=1; if college = 18 then col18=1;

run;

proc surveyreg data=bachelors2;


cluster college;
model faculty = t degrees ma_deg col1 col2 col3 col4 col5
col6 col7 col8 col9 col10 col11 col12
col13 col14 col15 col16 col17;
quit;

Gilpin  8‐30‐2009    9   
The resulting regression information appearing in the output window is

Once again, contemporaneous degrees is not a driving force in faculty size. An F test is
not needed to assess if at least one of the 17 colleges differ from college 18. With the exception
of college 17, each of the other colleges are significantly different. The moving average of
degrees is again significant.

Gilpin  8‐30‐2009    10   
RANDOM-EFFECTS REGRESSION

Finally, consider the random-effects model in which we employ Mundlak’s (1978) approach to
estimating panel data. The Mundlak model posits that the fixed effects in the equation, β1i , can
be projected upon the group means of the time-varying variables, so that

β1i = β1 + δ′ xi + wi

where xi is the set of group (school) means of the time-varying variables and wi is a (now)
random effect that is uncorrelated with the variables and disturbances in the model. Logically,
adding the means to the equations picks up the correlation between the school effects and the
other variables. We could not incorporate the mean number of degrees awarded in the fixed-
effects model (because it was time invariant) but this variable plays a critical role in the Mundlak
approach to panel data modeling and estimation.

The random effects model for BA and BS degree-granting undergraduate departments is


 
  FACULTY sizeit = β1 + β2YEARt + β3BA&Sit + β4MEANBA&Si + β5MOVAVBA&BS
+ β6PUBLICi + β7Bschl + εit + ui

where error term ε is iid over time, E(εit2|xit) = σ2 for I = 18 and Ti = 14 and E[ui2] = θ2 for I =
18.

In SAS 9.1, there are no straightforward procedures to estimate this model. In the appendix, I do
provide a lengthy procedure that estimates the random effects model by OLS regression on a
transformed model. This is quite complex and is not recommended for beginners. See Cameron
and Trivedi (2005) for further details. SAS 9.2 has a new command called the PANEL procedure
to estimate panel data. For our model, we need to attach the / RANONE option to specify that a
one-way random-effects model be estimated. We also need to correct for the clustering of the
data. Unlike simple commands in LIMPDEP and STATA, SAS does not have an option for one-
way random effects with clustered errors.

This new SAS 9.2 procedure has more options for specific error term structures in panel data.
Although SAS does not allow the CLUSTER option, there is a VCOMP option that specifies the
type of variance component estimate to use. For balanced data, the default is VCOMP=FB.
However, the FB method does not always obtain nonnegative estimates for the cross section (or
group) variance. In the case of a negative estimate, a warning is printed and the estimate is set to
zero. Because we have to address clustering, WK option is specified, which is close to groupwise
heteroscedastic regression.

Gilpin  8‐30‐2009    11   
The SAS code to be run from the Editor panel (with 1991 and 1992 data suppressed) is

PROC SORT DATA=bachelors2;


BY college year;

PROC panel DATA=bachelors2;


ID college year;
MODEL faculty = t degrees degrebar public bschool MA_deg /RANONE VCOMP=WK;
RUN;

The resulting regression information appearing in the output window is

Gilpin  8‐30‐2009    12   
The marginal effect of an additional economics major is again insignificant but slightly negative
within the sample. Both the short-term moving average number and long-term average number
of bachelor degrees are significant. A long-term increase of about 10 students earning degrees in
economics is required to predict that one more tenured or tenure-track faculty member is in a
department. Ceteris paribus, economics departments at private institutions are smaller than
comparable departments at public schools by a large and significant number of four members.
Whether there is a business school present is insignificant. There is no meaningful trend in
faculty size.

It should be clear that this regression is NOT identical to similar one-way random effect models
controlling for clustering in LIMDEP or STATA. The standard errors are adjusted for a general
groupwise heteroscedastic error structure. The difference does not alter the significance and the
standard errors are, for the most part, very comparable.

CONCLUDING REMARKS

The goal of this hands-on component of this third of four modules is to enable economic
education researchers to make use of panel data for the estimation of constant coefficient, fixed-
effects and random-effects panel data models in SAS. It was not intended to explain all of the
statistical and econometric nuances associated with panel data analysis. For this an intermediate
level econometrics textbook (such as Jeffrey Wooldridge, Introductory Econometrics) or
advanced econometrics textbook (such as William Greene, Econometric Analysis) should be
consulted.

Gilpin  8‐30‐2009    13   
APPENDIX: Alternative Means to Estimate Random-Effects Model with Clustered Data.

The following code provides a necessary code to estimate the random-effect models with
clustering. The estimation procedure is two-step feasible GLS. In the first step, the variance
matrix is estimated. In the second step, this variance matrix is used to transform the equation.

Because the variance matrix is estimated and not the true variance, this causes the standard errors
to be slight different than the standard errors provided by LIMPDEP or STATA when estimating
a random effects model with clustering.

The code to be run in the editor window is:

/* get SSE and SSU */

proc sort data= bachelors2;


by college year; quit;

proc tscsreg data=bachelors2 outest=covvc;


id college year;
model faculty = t degrees degrebar public bschool MA_deg / ranone;
quit;

/* find number of years */

data numobs (keep = year);


set bachelors2;
run;

proc sort nodupkey;


by year;
quit;

proc means data = numobs


max;
output out = num;
quit;

/* create lamda */
proc iml;
use covvc;
read all var {_VARERR_ _VARCS_} into x;
use num;
read var {_freq_} into y;
print y;
sesq = x[1,1];
susq = x[1,2];
lamda = 1 - sqrt( sesq / (y[1,1]*susq + sesq) );
print x y lamda;
cname = {"lamda"};

Gilpin  8‐30‐2009    14   
create out from lamda [ colname=cname];
append from lamda;
quit;

/* find averages of each variable grouped by college #*/


proc MEANS NOPRINT
data=bachelors2;
class college;
output out=stats
mean= avg_year avg_degrees avg_degrebar avg_public avg_faculty avg_bschool
avg_t avg_ma_deg;
run;

data bachelors3 (drop = _type_ _freq_);


merge bachelors2 stats;
by college;
if _type_ = 0 then delete;
one = 1;
run;

DATA bachelors4;
if _N_ = 1 then set out;
SET bachelors3;
l = one*lamda;
run;

/* transform data */
data clean (keep = college con nfaculty nt ndegrees ndegrebar npublic
nbschool nMA_deg year);
set bachelors4;
nfaculty = faculty - lamda*avg_faculty;
nt = t - lamda*avg_t;
ndegrees = degrees - lamda*avg_degrees ;
ndegrebar = degrebar - lamda*avg_degrebar;
npublic = public - lamda*avg_public;
nbschool = bschool - lamda*avg_bschool;
nMA_deg = ma_deg - lamda*avg_ma_deg;
con = 1 - lamda*1;
run;

/* run regression on transformed equation assuming clustering */


/* Since intercept is included in transformed equation, use noint option*/

proc surveyreg data=clean;


cluster college;
model nfaculty = con nt ndegrees ndegrebar npublic nbschool nMA_deg /
noint;
quit;

The output for this regression is:

Gilpin  8‐30‐2009    15   
The standard errors associated with this regression are much closer to the standard errors from
LIMPDEP and STATA. However, this is a complex sequence of codes which should not be
attempted by beginners.

Gilpin  8‐30‐2009    16   
REFERENCES

Becker, William, William Greene and John Siegfried (2009). “Does Teaching Load Affect
Faculty Size? ” Working Paper (July).

Cameron, Colin and Pravin Trivedi (2005). Microeconometrics. 1st Edition, New York,
Cambridge University Press.

Mundlak, Yair (1978). "On the Pooling of Time Series and Cross Section Data," Econometrica.
Vol. 46. No. 1 (January): 69-85.

Greene, William (2008). Econometric Analysis. 6th Edition, New Jersey: Prentice Hall.

Wooldridge, Jeffrey (2009). Introductory Econometrics. 4th Edition, Mason OH: South-
Western.

Gilpin  8‐30‐2009    17   
MODULE FOUR, PART ONE:

SAMPLE SELECTION IN ECONOMIC EDUCATION RESEARCH


William E. Becker and William H. Greene *

Modules One and Two addressed an economic education empirical study involved with the
assessment of student learning that occurs between the start of a program (as measured, for
example, by a pretest) and the end of the program (posttest). At least implicitly, there is an
assumption that all the students who start the program finish the program. There is also an
assumption that those who start the program are representative of, or at least are a random
sample of, those for whom an inference is to be made about the outcome of the program. This
module addresses how these assumptions might be wrong and how problems of sample selection
might occur. The consequences of and remedies for sample selection are presented here in Part
One. As in the earlier three modules, contemporary estimation procedures to adjust for sample
selection are demonstrated in Parts Two, Three and Four using LIMDEP (NLOGIT), STATA
and SAS.

Before addressing the technical issues associated with sample selection problems in an
assessment of one or another instructional method, one type of student or teacher versus another,
or similar educational comparisons, it might be helpful to consider an analogy involving a
contest of skill between two types of contestants: Type A and Type B. There are 8 of each type
who compete against each other in the first round of matches. The 8 winners of the first set of
matches compete against each other in a second round, and the 4 winners of that round compete
in a third. Type A and Type B may compete against their own type in any match after the first
round, but one Type A and one Type B manage to make it to the final round. In the final match
they tie. Should we conclude, on probabilistic grounds, that Type A and Type B contestants are
equally skilled?

-----------------------------------------

*William Becker is Professor Emeritus of Economics, Indiana University, Adjunct Professor of


Commerce, University of South Australia, Research Fellow, Institute for the Study of Labor (IZA) and
Fellow, Center for Economic Studies and Institute for Economic Research (CESifo). William Greene is
Toyota Motor Corp. Professor of Economics, Stern School of Business, New York University,
Distinguished Adjunct Professor, American University and External Affiliate of the Health Econometrics
and Data Group, York University.

    May 1, 2010: p. 1 
 
How is your answer to the above questions affected if we tell you that on the first round 5
Type As and only 3 Types Bs won their matches and only one Type B was successful in the
second and third round? The additional information should make clear that we have to consider
how the individual matches are connected and not just look at the last match. But before you
conclude that Type As had a superior attribute only in the early contests and not in the finals,
consider another analogy provided by Thomas Kane (Becker 2004).

Kane's hypothetical series of races is contested by 8 greyhounds and 8 dachshunds. In the


first race, the greyhounds enjoy a clear advantage with 5 greyhounds and only 3 dachshunds
finishing among the front-runners. These 8 dogs then move to the second race, when only one
dachshund wins. This dachshund survives to the final race when it ties with a greyhound. Kane
asks: “Should I conclude that leg length was a disadvantage in the first two races but not in the
third?” And answers: “That would be absurd. The little dachshund that made it into the third
race and eventually tied for the win most probably had an advantage on other traits—such as a
strong heart, or an extraordinary competitive spirit—which were sufficient to overcome the
disadvantage created by its short stature.”

These analogies demonstrate all three sources of bias associated with attempts to assess
performance from the start of a program to its finish: sample selection bias, endogeneity, and
omitted variables. The length of the dogs’ legs not appearing to be a problem in the final race
reflects the sample selection issues resulting if the researcher only looks at that last race. In
education research this corresponds to only looking at the performance of those who take the
final exam, fill out the end-of-term student evaluations, and similar terminal program
measurements. Looking only at the last race (corresponding to those who take the final exam)
would be legitimate if the races were independent (previous exam performance had no effect on
final exam taking, students could not self select into the treatment group versus control group),
but the races (like test scores) are sequentially dependent; thus, there is an endogeneity problem
(as introduced in Module Two). As Kane points out, concluding that leg length was important in
the first two races and not in the third reveals the omitted-variable problem: a trait such as heart
strength or competitive motivation might be overriding short legs and thus should be included as
a relevant explanatory variable in the analyses. These problems of sample selection in
educational assessment are the focus of this module.

SAMPLE SELECTION FROM PRETEST TO POSTTEST AND HECKMAN CORRECTION

The statistical inference problems associated with sample selection in the typical change-score
model used in economic education research can be demonstrated using a modified version of the
presentation in Becker and Powers (2001), where the data generating process for the change
score (difference between post and pre TUCE scores) for the ith student ( Δyi ) is modeled as

    May 1, 2010: p. 2 
 
k
Δyi = X i β + ε i = β1 + ∑ β j x ji + ε i (1)
j =2

The data set of explanatory variables is matrix X, where Xi is the row of xji values for the
relevant variables believed to explain the ith student’s pretest and posttest scores, the β j ’s are the
associated slope coefficients in the vector β , and ε i is the individual random shock (caused, for
example, by unobservable attributes, events or environmental factors) that affect the ith student’s
test scores. In empirical work, the exact nature of Δyi is critical. For instance, to model the
truncation issues that might be relevant for extremely able students’ being better than the
maximum TUCE score, a Tobit model can be specified for Δyi . i Also critical is the assumed
starting point on which all subsequent estimation is conditioned. ii

As discussed in Module One, to explicitly model the decision to complete a course, as


reflected by the existence of a posttest for the ith student, a “yes” or “no” choice probit model can
be specified. Let Ti = 1 , if the i th student takes the posttest and let Ti = 0 , if not. Assume that
there is an unobservable continuous dependent variable, Ti* , representing the i th student’s desire
or propensity to complete a course by taking the posttest.

For an initial population of N students, let T* be the vector of all students’ propensities
to take a posttest. Let H be the matrix of explanatory variables that are believed to drive these
propensities, which includes directly observable things (e.g., time of class, instructor’s native
language). Let α be the vector of slope coefficients corresponding to these observable variables.
The individual unobservable random shocks that affect each student’s propensity to take the
posttest are contained in the error term vector ω . The data generating process for the i th
student’s propensity to take the posttest can now be written:

Ti* = H i α + ωi (2)

where

Ti = 1 , if Ti* > 0 , and student i has a posttest score, and

Ti = 0 , if Ti* ≤ 0 , and student i does not have a posttest score.

For estimation purposes, the error term ωi is assumed to be a standard normal random
variable that is independently and identically distributed with the other students’ error terms in
the ω vector. As shown in Module Four (Parts Two, Three and Four) this probit model for the
propensity to take the posttest can be estimated using the maximum-likelihood routines in
programs such as LIMDEP, STATA or SAS.

    May 1, 2010: p. 3 
 
The effect of attrition between the pretest and posttest, as reflected in the absence of a
posttest score for the ith student (Ti = 0) and an adjustment for the resulting bias caused by
excluding those students from the Δyi regression can be illustrated with a two-equation model
formed by the selection equation (2) and the i th student’s change score equation (1). iii Each of
the disturbances in vector ε , equation (1), is assumed to be distributed bivariate normal with the
corresponding disturbance term in the ω vector of the selection equation (2). Thus, for the i th
student we have:

(ε i ,ω i ) ~ bivariate normal (0,0, σ ε ,1, ρ ) (3)

and for all perturbations in the two-equation system we have:

E(ε) = E(ω) = 0, E(εε ') = σε2I, E(ωω ') = I, and E(εω') = ρσεI . (4)

That is, the disturbances have zero means, unit variance, and no covariance among students, but
there is covariance between selection in getting a posttest score and the measurement of the
change score.

The difference in the functional forms of the posttest selection equation (2) and the
change score equation (1) ensures the identification of equation (1) but ideally other restrictions
would lend support to identification. Estimates of the parameters in equation (1) are desired, but
the i th student’s change score Δyi is observed in the TUCE data for only the subset of students
for whom Ti = 1 . The regression for this censored sample of nT =1 students is:

E (Δyi | Xi , Ti = 1) = Xi β + E (ε i | Ti * > 0); i = 1, 2,... nT =1 , for nT =1 < N . (5)

Similar to omitting a relevant variable from a regression (as discussed in Module Two), selection
bias is a problem because the magnitude of E (ε i | Ti * > 0) varies across individuals and yet is not
included in the estimation of equation (1). To the extent that ε i and ωi (and thus Ti* ) are
related, the estimators are biased.

The change score regression (1) can be adjusted for those who elected not to take a
posttest in several ways. An early Heckman-type solution to the sample selection problem is to
rewrite the omitted variable component of the regression so that the equation to be estimated is:

E (Δyi | Xi , Ti = 1) = Xi β + ( ρσ ε )λi ; i = 1, 2,... nT =1 (6)

where λi = f (−Ti * ) /[1 − F (−Ti * )] , and f (.) and F (.) are the normal density and distribution
functions. The inverse Mill’s ratio (or hazard) λi is the standardized mean of the disturbance
term ωi , for the i th student who took the posttest; it is close to zero only for those well above the

    May 1, 2010: p. 4 
 
T = 1 threshold. The values of λ are generated from the estimated probit selection equation (2)
for all students. Each student in the change score regression ( Δyi ) gets a calculated value λi ,
with the vector of these values serving as a shift variable in the persistence regression.

The single coefficient represented by the product of ρ and σ ε (ie., ρσ ε ) is estimated in a


two-step procedure in which the probit selection equation (2) is first estimated by maximum
likelihood and then the change-score equation (1) is estimated by least squares with the inverse
mills ratio used as an additional regressor to adjust for the selection bias. The estimates of ρ ,
σ ε , and all the other coefficients in equations (1) and (2) can also be obtained simultaneously
and more efficiently using the maximum-likelihood routines in LIMDEP, STATA or SAS, as
will be demonstrated in Parts Two, Three and Four of this module using the Becker and Powers
data set.

The Heckman-type selection model represented by equations (1) and (2) highlights the
nature of the sample selection problem inherent in estimating a change-score model by itself.
Selection results in population error term and regressor correlation that biases and makes the
coefficient estimators in the change score model inconsistent. The early Heckman (1979) type
two-equation estimation of the parameters in a selection model and change-score model,
however, requires cross-equation exclusion restrictions (variables that affect selection but not the
change score), differences in functional forms, and/or distributional assumptions for the error
terms. Parameter estimates are typically sensitive to these model specifications.

ALTERNATIVE METHODS FOR ADDRESSING SELECTION

As reviewed in Imbens and Wooldridge (2009), alternative nonparametric and semiparametric


methods are being explored for assessing treatment effects in nonrandomized experiments but
these methods have been slow to catch on in education research in general and economic
education in particular. Exceptions, in the case of financial aid and the enrollment decision, are
the works of Wilbert van der Klaauw and Thomas Kane. Van der Klaauw (2002) estimates the
effect of financial aid on the enrollment decision of students admitted to a specific East Coast
college, recognizing that this college’s financial aid is endogenous because competing offers are
unknown and thus by definition are omitted relevant explanatory variables in the enrollment
decision of students considering this college.

The college investigated by van der Klaauw created a single continuous index of each student’s
initial financial aid potential (based on a SAT score and high school GPA) and then classified
students into one of four aid level categories based on discrete cut points. The aid assignment
rule depends at least in part on the value of a continuous variable relative to a given threshold in
such a way that the corresponding probability of receiving aid (and the mean amount offered) is
a discontinuous function of this continuous variable at the threshold cut point. A sample of

    May 1, 2010: p. 5 
 
individual students close to a cut point on either side can be treated as a random sample at the cut
point because on average there really should be little difference between them (in terms of
financial aid offers received from other colleges and other unknown variables). In the absence of
the financial aid level under consideration, we should expect little difference in the college-going
decision of those just above and just below the cut point. Similarly, if they were all given the
financial aid, we should see little difference in outcomes, on average. To the extent that some
actually get it and others do not, we have an interpretable treatment effect. (Intuitively, this can
be thought of as running a regression of enrollment on financial aid for those close to the cut
point, with an adjustment for being in that position.) In his empirical work, van der Klaauw
obtained credible estimates of the importance of the financial aid effect without having to rely on
arbitrary cross-equation exclusion restrictions and functional form assumptions.

Kane (2003) uses an identification strategy similar to van der Klaauw but does so for all
those who applied for the Cal Grant Program to attend any college in California. Eligibility for
the Cal Grant Program is subject to a minimum GPA and maximum family income and asset
level. Like van der Klaauw, Kane exploits discontinuities on one dimension of eligibility for
those who satisfy the other dimensions of eligibility.

Although some education researchers are trying to fit their selection problems into this
regression discontinuity framework, legitimate applications are few because the technique has
very stringent data requirement (an actual but unknown or conceptual defendable continuous
index with thresholds for rank-ordered classifications) and limited ability to generalize away
from the classification cut points. Much of economic education research, on the other hand,
deals with the assessment of one type of program or environment versus another, in which the
source of selection bias is entry and exit from the control or experimental groups. An alternative
to Heckman’s parametric (rigid equation form) manner of comparing outcome measures adjusted
for selection based on unobservables is propensity score matching.

PROPENSITY SCORE MATCHING

Propensity score matching as a body of methods is based on the following logic: We are
interested in evaluating a change score after a treatment. Let O now denote the outcome variable
or interest (e.g., posttest score, change score, persistence, or whatever) and T denote the
treatment dummy variable (e.g., took the enhanced course), such that T = 1 for an individual who
has experienced the “treatment,” and T = 0 for one who has not. If we are interested in the
change-score effect of treatment on the treated, the conceptual experiment would amount to
observing the treated individual (1) after he or she experienced the treatment and the same
individual in the same situation but (2) after he/she did not experience the treatment (but
presumably, others did). The treatment effect would be the difference between the two post-test
scores (because the pretest would be the one achieved by this individual). The problem, of

    May 1, 2010: p. 6 
 
course, is that ex post, we don’t observe the outcome variable, O, for the treated individual, in
the absence of the treatment. We observe some individuals who were treated and other
individuals who were not. Propensity score matching is a largely nonparametric approach to
evaluating treatment effects with this consideration in mind. iv

If individuals who experienced the treatment were exactly like those who did not in all
other respects, we could proceed by comparing random samples of treated and nontreated
individuals, confident that any observed differences could be attributed to the treatment. The
first section of this module focused on the problem that treated individuals might differ from
untreated individuals systematically, but in ways that are not directly observable by the
econometrician. To consider an example, if the decision to take an economics course (the
treatment) were motivated by characteristics of individuals (curiosity, ambition, etc.) that were
also influential in their performance on the outcome (test), then our analysis might attribute the
change in the score to the treatment rather than to these characteristics. Models of sample
selection considered previously are directed at this possibility. The development in this section
is focused on the possibility that the same kinds of issues might arise, but the underlying features
that differentiate the treated from the untreated can be observed, at least in part.

If assignment to the treatment were perfectly random, as discussed in the introduction to


this module, solving this problem would be straightforward. A large enough sample of
individuals would allow us to average away the differences between treated and untreated
individuals, both in terms of observable characteristics and unobservable attributes. Regression
methods, such as those discussed in the previous sections of this module, are designed to deal
with the difficult case in which the assignment is nonrandom with respect to the unobservable
characteristics of individuals (such as ability, motivation, etc.) that can be related to the
“treatment assignment,” that is, whether or not they receive the treatment. Those methods do not
address another question, that is, whether there are systematic, observable differences between
treated and nontreated individuals. Propensity score methods are used to address this problem.

To take a simple example, suppose it is known with certainty that the underlying,
unobservable characteristics that are affecting the change score are perfectly randomly
distributed across individuals, treated and untreated. Assume, as well, that it is known for certain
that the only systematic, observable difference between treated and untreated individuals is that
women are more likely to undertake the treatment than men. It would make sense, then, that if
we want to compare treated to untreated individuals, we would not want to compare a randomly
selected group of treated individuals to a randomly selected group of untreated individuals – the
former would surely contain more women than the latter. Rather, we would try to balance the
samples so that we compared a group of women to another group of women and a group of men
to another group of men, thereby controlling for the impact of gender on the likelihood of
receiving the treatment. We might then want to develop an overall average by averaging, once
again, this time the two differences, one for men, the other for women.

    May 1, 2010: p. 7 
 
In the main, and as already made clear in our consideration of the Heckman adjustment,
if assignment to the treatment is nonrandom, then estimation of treatment effects will be biased
by the effect of the variables that effect the treatment assignment. The strategy is, essentially, to
locate an untreated individual who looks like the treated one in every respect except the
treatment, then compare the outcomes. We then average this across individual pairs to estimate
the “average treatment effect on the treated.” The practical difficulty is that individuals differ in
many characteristics, and it is not feasible, in a realistic application, to compare each treated
observation to an untreated one that “looks like it.” There are too many dimensions on which
individuals can differ. The technique of propensity score matching is intended to deal with this
complication. Keep in mind, however, if unmeasured or unobserved attributes are important,
and they are not randomly distributed across treatment and control groups, matching techniques
may not work. That is for what the methods in the previous sections were designed.

THE PROPENSITY SCORE MATCHING METHOD

We now provide some technical details on propensity score matching. Let x denote a vector of
observable characteristics of the individual, before the treatment. Let the probability of treatment
be denoted P(T=1|x) = P(x). Because T is binary, P(x) = E[T|x], as in a linear probability model.
If treatment is random given x, then treatment is random given P(x), which in this context is
called the propensity score. It will generally not be possible to match individuals based on all the
characteristics individually – with continuously measured characteristics, such as income. There
are too many cells. The matching is done via the propensity score. Individuals with similar
propensity scores are expected (on average) to be individuals with similar characteristics.

Overall, for a ‘treated’ individual with propensity P(xi) and outcome Oi, the strategy is to
locate a control observation with similar propensity P(xc) and with outcome Oc. The effect of
treatment on the treated for this individual is estimated by Oi – Oc. This is averaged across
individuals to estimate the average treatment effect on the treated. The underlying theory asserts
that the estimates of treatment effects across treated and controls are unbiased if the treatment
assignment is random among individuals with the same propensity score; the propensity score,
itself, captures the drivers of the treatment assignment. (Relevant papers that establish this
methodology are too numerous to list here. Useful references are four canonical papers,
Heckman et al. [1997, 1998a, 1998b, 1999] and a study by Becker and Ichino [2002].)

The steps in the propensity score matching analysis consist of the following:

Step 1. Estimate the propensity score function, P(x), for each individual by fitting
a probit or logit model, and using the fitted probabilities.

    May 1, 2010: p. 8 
 
Step 2. Establish that the average propensity scores of treated and control
observations are the same within particular ranges of the propensity scores. (This
is a test of the “balancing hypothesis.”)

Step 3. Establish that the averages of the characteristics for treatment and controls
are the same for observations in specific ranges of the propensity score. This is a
check on whether the propensity score approach appears to be succeeding at
matching individuals with similar characteristics by matching them on their
propensity scores.

Step 4. For each treated observation in the sample, locate a similar control
observation(s) based on the propensity scores. Compute the treatment effect,
Oi – Oc. Average this across observations to get the average treatment effect.

Step 5. In order to estimate a standard error for this estimate, Step 4 is repeated
with a set of bootstrapped samples.

THE PROPENSITY SCORE

We use a binary choice model to predict “participation” in the treatment. Thus,

Prob(T = 1|x) = F (β0 + β1 x1 + β2 x2 + ... + β K xK ) = F (β′x) .

The choice of F is up to the analyst. The logit model is a common choice;

exp(β′x)
Prob(T=1|x) = .
1 + exp(β′x)

The probit model, F (β′x) = Φ (β′x) , where Φ(t) is the normal distribution function, is an
alternative. The propensity score is the fitted probability from the probit or logit model,

( )
Propensity Score for individual i = F βˆ ′xi = Pi .

The central feature of this step is to find similar individuals by finding individuals who have
similar propensity scores. Before proceeding, we note, the original objective is to find groups of
individuals who have the same x. This is easy to do in our simple example, where the only
variable in x is gender, so we can simply distinguish people by their gender. When the x vector
has many variables, it is impossible to partition the data set into groups of individuals with the
same, or even similar explanatory variables. In the example we will develop below, x includes
age (and age squared), education, marital status, race, income and unemployment status. The
working principle in this procedure is that individuals who have similar propensity scores will, if

    May 1, 2010: p. 9 
 
we average enough of them, have largely similar characteristics. (The reverse must be true, of
course.) Thus, although we cannot group people by their characteristics, xs, we can (we hope)
achieve the same end by grouping people by their propensity scores. That leads to step 2 of the
matching procedure.

GROUPING INDIVIDUALS BY PROPENSITY SCORES

Grouping those with similar propensity scores should result in similar predicted probabilities for
treatment and control groups. For instance, suppose we take a range of propensity scores
(probabilities of participating in the treatement), say from 0.4 to 0.6. Then, the part of the sample
that contains propensity scores in this range should contain a mix of treated individuals
(individuals with T = 1) and controls (individuals with T = 0). If the theory we are relying on is
correct, then the average propensity score for treated and controls should be the same, at least
approximately. That is,

( )( ) ( )(
Average F βˆ ′x T = 1 and Fˆ in the range ≈ Average F βˆ ′x T = 0 and Fˆ in the range . )
We will look for a partitioning of the range of propensity scores for which this is the case in each
range.

A first step is to decide if it is necessary to restrict the sample to the range of values of
propensity scores that is shared by the treated and control observations. That range is called the
common support. Thus, if the propensity scores of the treated individuals range from 0.1 to 0.7
and the scores of the control observations range from 0.2 to 0.9, then the common support is
from 0.2 to 0.7. Observations that have scores outside this range would not be used in the
analysis.

Once the sample to be used is determined, we will partition the range of propensity scores
into K cells. For each partitioning of the range of propensity scores considered, we will use a
standard F test for equality of means of the propensity scores of the treatment and control
observations:

(P )
2
C
k
− PTk
Fk [1, d ] = , k = 1,..., K .
(S 2
C ,k / NCk + ST2, k / NTk )
The denominator degrees of freedom for F are approximated using a technique invented by
Satterthwaite (1946):

    May 1, 2010: p. 10 
 
( NC − 1) ( N − 1)
d =w 2 k
+ (1 − w) 2 T k
SC , k / N C ST ,k / NT

( )
2
( NT − 1) SC2 ,k / NCk
w= .
( ) ( )
2 2
( NT − 1) SC2 ,k / NCk + ( NC − 1) ST2,k / NTk

If any of the cells (ranges of scores) fails this test, the next step is to increase the number of cells.
There are various strategies by which this can be done. The natural approach would be to leave
cells that pass the test as they are, and partition more finely the ones that do not. This may take
several attempts. In our example, we started by separating the range into 5 parts. With 5
segments, however, the data do not appear to satisfy the balancing requirement. We then try 6
and, finally, 7 segments of the range of propensity scores. With the range divided into 7
segments, it appears that the balance requirement is met.

Analysis can proceed even if the partitioning of the range of scores does not pass this test.
However, the test at this step will help to give an indication of whether the model used to
calculate the propensity scores is sufficiently specified. A persistent failure of the balancing test
might signal problems with the model that is being used to create the propensity scores. The
result of this step is a partitioning of the range of propensity scores into K cells with the K + 1
values,

[P*] = [P1, P2, ..., PK+1]

which is used in the succeeding steps.

EXAMINING THE CHARACTERISTICS IN THE SAMPLE GROUPS

Step 3 returns to the original motivation of the methodology. At step 3, we examine the
characteristics (x vectors) of the individuals in the treatement and control groups within the
subsamples defined by the groupings made by Step 2. If our theory of propensity scores is
working, it should be the case that within a group, for example, for the individuals whose
propensity scores are in the range 0.4 to 0.6, the x vectors should be similar in that at least the
means should be very close. This aspect of the data is examined statistically. Analysis can
proceed if this property is not met but the result(s) of these tests might signal to the analyst that
their results are a bit fragile. In our example below, there are seven cells in the grid of
propensity scores and 12 variables in the model. We find that for four of the 12 variables in one
of the 7 cells (i.e., in four cases out of 84), the means of the treated and control observations
appear to be significantly different. Overall, this difference does not appear to be too severe, so
we proceed in spite of it.

    May 1, 2010: p. 11 
 
MATCHING

Assuming that the data have passed the scrutiny in step 3, we now match the observations. For
each treated observation (individual’s outcome measure such as a test score) in the sample, we
find a control observation that is similar to it. The intricate complication at this step is to define
“similar.” It will generally not be possible to find a treated observation and a control observation
with exactly the same propensity score. So, at this stage it is necessary to decide what rule to use
for “close.” The obvious choice would be the nearest neighbor in the set of observations that is
in the propensity score group. The nearest neighbor for observation Oi would be the Oc* for
which |Pi – Pc| is minimized. We note, by this strategy, a particular control observation might be
the nearest neighbor for more than one treatment observation and some control observations
might not be the nearest neighbor to any treated observation.

Another strategy is to use the average of several nearby observations. The counterpart
observation is constructed by averaging all control observations whose propensity scores fall in a
given range in the neighborhood of Pi. Thus, we first locate the set [Ct*] = the set of control
observations for which |Pt – Pc| < r, for a chosen value of r called the caliper. We then average
Oc for these observations. By this construction, the neighbor may be an average of several
control observations. It may also not exist, if no observations are close enough. In this case, r
must be increased. As in the single nearest neighbor computation, control observations may be
used more than once, or they might not be used at all (e.g., if the caliper is r = .01, and a control
observation has propensity .5 and the nearest treated observations have propensities of .45 and
.55, then this control will never be used).

A third strategy for finding the counterpart observations is to use kernel methods to
average all of the observations in the range of scores that contains the Oi that we are trying to
match. The averaging function is computed as follows:

Oc = ∑ control observations in the cell


wc Oc

1 ⎡ Pi − Pc ⎤
K
h ⎢⎣ h ⎥⎦
wc =
1 ⎡ Pi − Pc ⎤
∑ control observations in the cell h
K⎢
⎣ h ⎦

The function K[.] is a weighting function that takes its largest value when Pi equals Pc and tapers
off to zero as Pc is farther from Pi. Typical choices for the kernel function are the normal or
logistic density functions. A common choice that cuts off the computation at a specific point is
the Epanechnikov (1969) weighting function,

    May 1, 2010: p. 12 
 
K[t] = 0.75(1 – .2t2)/51/2 for |t| < 5, and 0 otherwise.

The parameter h is the bandwidth that controls the weights given to points that lie relatively far
from Pi. A larger bandwidth gives more distant points relatively greater weight. Choice of the
bandwidth is a bit of an (arcane) art. The value 0.06 is a reasonable choice for the types of data
we are using in our analysis here.

Once treatment observations, Oi and control observations, Oc are matched, the treatment
effect for this pair is computed as Oi – Oc. The average treatment effect (ATE) is then estimated
by the mean,


^ 1 N match
ATE = i =1
(Oi − Oc )
N match

STATISTICAL INFERENCE

In order to form a confidence interval around the estimated average treatment effect, it is
necessary to obtain an estimated standard error. This is done by reconstructing the entire sample
used in Steps 2 through 4 R times, using bootstrapping. By this method, we sample N
observations from the sample of N observations with replacement. Then ATE is computed R
times and the estimated standard error is the empirical standard deviation of the R observations.
This can be used to form a confidence interval for the ATE.

The end result of the computations will be a confidence interval for the expected
treatment effect on the treated individuals in the sample. For example, in the application that we
will present in Part 2 of this module, in which the outcome variable is the log of earnings and the
treatment is the National Supported Work Demonstration – see LaLonde (1986) – the following
is the set of final results:
+----------------------------------------------------------------------+
| Number of Treated observations = 185 Number of controls = 1157 |
| Estimated Average Treatment Effect = .156255 |
| Estimated Asymptotic Standard Error = .104204 |
| t statistic (ATT/Est.S.E.) = 1.499510 |
| Confidence Interval for ATT = ( -.047985 to .360496) 95% |
| Average Bootstrap estimate of ATT = .144897 |
| ATT - Average bootstrap estimate = .011358 |
+----------------------------------------------------------------------+

The overall estimate from the analysis is ATE = 0.156255, which suggests that the effect on
earnings that can be attributed to participation in the program is 15.6%. Based on the (25)
bootstrap replications, we obtained an estimated standard error of 0.104204. By forming a
confidence interval using this standard error, we obtain our interval estimate of the impact of the

    May 1, 2010: p. 13 
 
program of (-4.80% to +36.05%). We would attribute the negative range to an unconstrained
estimate of the sampling variability of the estimator, not actually to a negative impact of the
program. v

CONCLUDING COMMENTS

The genius in James Heckman was recognizing that sample selection problems are not
necessarily removed by bigger samples because unobservables will continue to bias estimators.
His parametric solution to the sample selection problem has not been lessened by newer semi-
parametric techniques. It is true that results obtained from the two equation system advanced by
Heckman over 30 years ago are sensitive to the correctness of the equations and their
identification. Newer methods such as regression discontinuity, however, are extremely limited
in their applications. As we will see in Module Four, Parts Two, Three and Four, methods such
as the propensity score matching depend on the validity of the logit or probit functions estimated
along with the methods of getting smoothness in the kernel density estimator. One of the
beauties of Heckman’s original selection adjustment method is that its results can be easily
replicated in LIMDEP, STATA and SAS. Such is not the case with the more recent
nonparametric and semi-parametric methods for addressing sample selection problems.

REFERENCES

Becker, William E. “Omitted Variables and Sample Selection Problems in Studies of College-
Going Decisions,” Public Policy and College Access: Investigating the Federal and State Role in
Equalizing Postsecondary Opportunity, Edward St. John (ed), 19. NY: AMS Press. 2004: 65-86.

_____ “Economics for a Higher Education,” International Review of Economics Education, 3, 1,


2004: 52-62.

_____ “Quit Lying and Address the Controversies: There Are No Dogmata, Laws, Rules or
Standards in the Science of Economics,” American Economist, 50, Spring 2008: 3-14.

_____ and William Walstad. “Data Loss from Pretest to Posttest as a Sample Selection
Problem,” Review of Economics and Statistics, 72, February 1990: 184-188.

_____ and John Powers. “Student Performance, Attrition, and Class Size Given Missing Student
Data,” Economics of Education Review, 20, August 2001: 377-388.

Becker, S. and A. Ichino. “Estimation of Average Treatment Effects Based on Propensity


Scores,” The Stata Journal, 2, 2002: 358-377.

    May 1, 2010: p. 14 
 
Deheija, R. and S. Wahba “Causal Effects in Nonexperimental Studies: Reevaluation of the
Evaluation of Training Programs,” Journal of the American Statistical Association, 94, 1999, pp.
1052-1062.

Epanechnikov, V. “Nonparametric Estimates of a Multivariate Probability Density,” Theory of


Probability and its Applications, 14, 1969: 153-158.

Greene, William. H. “A statistical model for credit scoring.” Department of Economics, Stern
School of Business, New York University, (September 29, 1992).

Heckman, James. “Sample Bias as a Specification Error,” Econometrica, 47, 1979: 153-162.

Heckman, J., H. Ichimura, J. Smith and P. Todd. “Characterizing Selection Bias Using
Experimental Data,” Econometrica, 66, 5, 1998a: 1017-1098.

Heckman, J., H. Ichimura and P. Todd. “Matching as an Econometric Evaluation Estimator:


Evidence from Evaluating a Job Training Program,” Review of Economic Studies, 64, 4,1997:
605-654.

Heckman, J., H. Ichimura and P. Todd. “Matching as an Econometric Evaluation Estimator,”


Review of Economic Studies, 65, 2, 1998b: 261-294.

Heckman, J., R. LaLonde, and J. Smith. ‘The Economics and Econometrics of Active Labour
Market Programmes,’ in Ashenfelter, O. and D. Card (eds.) The Handbook of Labor Economics,
Vol. 3, North Holland, Amsterdam, 1999.

Krueger, Alan B. and Molly F. McIntosh. “Using a Web-Based Questionnaire as an Aide for
High School Economics Instruction,” Journal of Economic Education, 39, Spring, 2008: 174-
197.

Huynh, Kim, David Jacho-Chavez, and James K. Self. “The Efficacy of Collaborative Learning
Recitation Sessions on Student Outcomes?” American Economic Review, (Forthcoming May
2010).

Imbens, Guido W. and Jeffrey M. Wooldridge. “Recent Developments in Econometrics of


Program Evaluation,” Journal of Economic Literature, March, 2009: 5-86.

Kane, Thomas. “A Quasi-Experimental Estimate of the Impact of Financial Aid on College-


Going.” NBER Working Paper No. W9703, May, 2003.

LaLonde, R., “Evaluating the Econometric Evaluations of Training Programs with Experimental
Data,” American Economic Review, 76, 4, 1986: 604-620.

Satterthwaite, F. E. “An Approximate Distribution of Estimates of Variance Components.”


Biometrics Bulletin, 2: 1946: 110–114.

    May 1, 2010: p. 15 
 
van der Klaauw, W. “Estimating the Effect of Financial Aid Offers on College Enrollment: A
Regression-Discounting Approach.” International Economic Review, November, 2002: 1249-
1288.

ENDNOTES

                                                            
i
. The opportunistic samples employed in the older versions of the TUCE as well as the new TUCE
4 have few observations from highly selective schools. The TUCE 4 is especially noteworthy because it
has only one such prestige school: Stanford University, where the class was taught by a non-tenure track
teacher. Thus, the TUCE 4 might reflect what those in the sample are taught and are able to do, but it
does not reflect what those in the know are teaching or what highly able students are able to do. For
example Alan Krueger (Princeton University) is listed as a member of the TUCE 4 “national panel of
distinguished economists;” yet, in a 2008 Journal of Economic Education article he writes: “a long
standing complaint of mine, as well as others, for example Becker 2007 and Becker 2004, is that
introductory economics courses have not kept up with the economics profession’s expanding emphasis on
data and empirical analysis.” Whether bright and motivated students at the leading institutions of higher
education can be expected to get all or close to all 33 multiple-choice questions correct on either the
micro or macro parts of the TUCE (because they figure out what the test designers want for an answer) or
score poorly (because they know more than what the multiple-choice questions assume) is open to
question and empirical testing. What is not debatable is that the TUCE 4 is based on a censored sample
that excludes those at and exposed to thinking at the forefront of the science of economics.
ii
. Because Becker and Powers (2002) do not have any data before the start of the course, they
condition on those who are already in the course and only adjust their change-score model estimation for
attrition between the pretest and posttest. More recently, Huynh, Jacho-Chavez, and Self (2010) account
for selection into, out of and between collaborative learning sections of a large principles course in their
change-score modeling.
iii.
. Although Δyi is treated as a continuous variable this is not essential. For example, a bivariate
choice (probit or logit) model can be specified to explicitly model the taking of a posttest decision as a
“yes” or “no” for students who enrolled in the course. The selection issue is then modeled in a way
similar to that employed by Greene (1992) on consumer loan default and credit card expenditures. As
with the standard Heckman selection model, this two-equation system involving bivariate choice and
selection can be estimated in a program like LIMDEP. 
iv
 .  The procedure is not “parametric” in that it is not fully based on a parametric model. It is not
“nonparametric” in that it does employ a particular binary choice model to describe participation, or
receiving the treatment. But the binary choice model functions as an aggregator of a vector of variables
into a single score, not necessarily as a behavioral relationship. Perhaps “partially parametric” would be
appropriate here, but we have not seen this term used elsewhere.

    May 1, 2010: p. 16 
 
                                                                                                                                                                                                
v . 
The example mentioned at several points in this discussion will be presented in much greater
detail in Part 2. The data will be analyzed with LIMDEP, Stata and SAS. We note at this point, there are
some issues with duplication of the results with the three programs and with the studies done by the
original authors. Some of these are numerical and specifically explainable. However, we do not
anticipate that results in Step 5 can be replicated across platforms. The reason is that Step 5 requires
generation of random numbers to draw the bootstrap samples. The pseudorandom number generators
used by different programs vary substantially, and these differences show up in, for example, in bootstrap
replications. If the samples involved are large enough, this sort of random variation (chatter) gets
averaged out in the results. The sample in our real world application is not large enough to expect that
this chatter will be completely averaged out. As such, as will be evident later, there will be some small
variation across programs in the results that one obtains with our or any other small or moderately sized
data set.

    May 1, 2010: p. 17 
 
MODULE FOUR, PART TWO: SAMPLE SELECTION

IN ECONOMIC EDUCATION RESEARCH USING LIMDEP (NLOGIT)

Part Two of Module Four provides a cookbook-type demonstration of the steps required to use
LIMDEP (NLOGIT) in situations involving estimation problems associated with sample
selection. Users of this model need to have completed Module One, Parts One and Two, but not
necessarily Modules Two and Three. From Module One users are assumed to know how to get
data into LIMDEP, recode and create variables within LIMDEP, and run and interpret regression
results. Module Four, Parts Three and Four demonstrate in STATA and SAS what is done here
in LIMDEP.

THE CASE, DATA, AND ROUTINE FOR EARLY HECKMAN ADJUSTMENT

The change score or difference in difference model is used extensively in education research.
Yet, before Becker and Walstad (1990), little if any attention was given to the consequence of
missing student records that result from: 1) "data cleaning" done by those collecting the data, 2)
student unwillingness to provide data, or 3) students self-selecting into or out of the study. The
implications of these types of sample selection are shown in the work of Becker and Powers
(2001) where the relationship between class size and student learning was explored using the
third edition of the Test of Understanding in College Economics (TUCE), which was produced
by Saunders (1994) for the National Council on Economic Education (NCEE), since renamed the
Council for Economic Education.

Module One, Part Two showed how to get the Becker and Powers data set
“beck8WO.csv” into LIMDEP (NLOGIT). As a brief review this was done with the read
command:
READ; NREC=2837; NVAR=64; FILE=k:\beck8WO.csv; Names=
A1,A2,X3, C,AL,AM,AN,CA,CB,CC,CH,CI,CJ,CK,CL,CM,CN,CO,CS,CT,
CU,CV,CW,DB,DD,DI,DJ,DK,DL,DM,DN,DQ,DR,DS,DY,DZ,EA,EB,EE,EF,
EI,EJ,EP,EQ,ER,ET,EY,EZ,FF,FN,FX,FY,FZ,GE,GH,GM,GN,GQ,GR,HB,
HC,HD,HE,HF $

where

A1: term, where 1= fall, 2 = spring


A2: school code, where 100/199 = doctorate,
200/299 = comprehensive,
300/399 = lib arts,
400/499 = 2 year
hb: initial class size (number taking preTUCE)
W. E. Becker and W. H. Greene, 5‐1‐2010    1 
 
hc: final class size (number taking postTUCE)
dm: experience, as measured by number of years teaching
dj: teacher’s highest degree, where Bachelors=1, Masters=2, PhD=3
cc: postTUCE score (0 to 30)
an: preTUCE score (0 to 30)
ge: Student evaluation measured interest
gh: Student evaluation measured textbook quality
gm: Student evaluation measured regular instructor’s English ability
gq: Student evaluation measured overall teaching effectiveness
ci: Instructor sex (Male = 1, Female = 2)
ck: English is native language of instructor (Yes = 1, No = 0)
cs: PostTUCE score counts toward course grade (Yes = 1, No = 0)
ff: GPA*100
fn: Student had high school economics (Yes = 1, No = 0)
ey: Student’s sex (Male = 1, Female = 2)
fx: Student working in a job (Yes = 1, No = 0)

In Module One, Part Two the procedure for changing the size of the work space in earlier
versions of LIMDEP and NLOGIT was shown but that is no longer required for the 9th version
of LIMDEP and the 4th version of NLOGIT. Starting with LIMDEP version 9 and NLOGIT
version 4 the required work space is automatically determined by the “Read” command and
increased as needed with subsequent “Create” commands.

Separate dummy variables need to be created for each type of school (A2), which is done with
the following code:

recode; a2; 100/199 = 1; 200/299 = 2; 300/399 = 3; 400/499 =4$


create; doc=a2=1; comp=a2=2; lib=a2=3; twoyr=a2=4$

To create a dummy variable for whether the instructor had a PhD we use

Create; phd=dj=3$

To create a dummy variable for whether the student took the postTUCE we use

final=cc>0;

W. E. Becker and W. H. Greene, 5‐1‐2010    2 
 
To create a dummy variable for whether a student did (noeval = 0) or did not (noeval = 1)
complete a student evaluation of the instructor we use

Create evalsum=ge+gh+gm+gq; noeval=evalsum=-36$

“Noeval” reflects whether the student was around toward the end of the term, attending classes,
and sufficiently motivated to complete an evaluation of the instructor. In the Saunder’s data set
evaluation questions with no answer where coded -9; thus, these four questions summing to -36
indicates that no questions were answered.

And the change score is created with

Create; change=cc-an$

Finally, there was a correction for the term in which student record 2216 was incorrectly
recorded:

recode; hb; 90=89$

All of these recoding and create commands are entered into LIMDEP command file as follows:

recode; a2; 100/199 = 1; 200/299 = 2; 300/399 = 3; 400/499 =4$


create; doc=a2=1; comp=a2=2; lib=a2=3; twoyr=a2=4; phd=dj=3;final=cc>0;
evalsum=ge+gh+gm+gq; noeval=evalsum=-36$
Create; change=cc-an$
recode; hb; 90=89$ #2216 counted in term 2, but in term 1 with no posttest

To remove records with missing data the following is entered:

Reject; AN=-9$

W. E. Becker and W. H. Greene, 5‐1‐2010    3 
 
Reject; HB=-9$
Reject; ci=-9$
Reject; ck=-9$
Reject; cs=0$
Reject; cs=-9$
Reject; a2=-9$
Reject; phd=-9$

The use of these data entry and management commands will appear in the LIMDEP (NLOGIT)
output file for the equations to be estimated in the next section.

THE PROPENSITY TO TAKE THE POSTTEST AND THE CHANGE SCORE EQUATION

To address attrition-type sample selection problems in change score studies, Becker and Powers
first add observations that were dropped during the early stage of assembling data for TUCE III.
Becker and Powers do not have any data on students before they enrolled in the course and thus
cannot address selection into the course, but to examine the effects of attrition (course
withdrawal) they introduce three measures of class size (beginning, ending, and average) and
argue that initial or beginning class size is the critical measure for assessing learning over the
entire length of the course. i To show the effects of initial class size on attrition (as discussed in
Module Four, Part One) they employ what is now the simplest and most restrictive of sample
correction methods, which can be traced to James Heckman (1979), recipient of the 2000 Nobel
Prize in Economics.

From Module Four, Part One, we have the data generating process for the difference between
post and preTUCE scores for the ith student ( Δy i ):

k
Δyi = X i β + ε i = β1 + ∑ β j x ji + ε i (1)
j =2

where the data set of explanatory variables is matrix X, where Xi is the row of xji values for the
relevant variables believed to explain the ith student’s pretest and posttest scores, the β j ’s are the
associated slope coefficients in the vector β , and ε i is the individual random shock (caused, for
example, by unobservable attributes, events or environmental factors) that affect the ith student’s
test scores. Sample selection associated with students’ unwillingness to take the postteest
(dropping the course) results in population error term and regressor correlation that biases and
makes coefficient estimators in this change score model inconsistent.

The data generating process for the i th student’s propensity to take the posttest is:

Ti* = H i α + ωi (2)

W. E. Becker and W. H. Greene, 5‐1‐2010    4 
 
where

Ti = 1 , if Ti* > 0 , and student i has a posttest score, and

Ti = 0 , if Ti* ≤ 0 , and student i does not have a posttest score.

T * is the vector of all students’ propensities to take a posttest.

H is the matrix of explanatory variables that are believed to drive these propensities.

α is the vector of slope coefficients corresponding to these observable variables.

ω is the vector of unobservable random shocks that affect each student’s propensity.

The effect of attrition between the pretest and posttest, as reflected in the absence of a
posttest score for the ith student (Ti = 0) and a Heckman adjustment for the resulting bias caused
by excluding those students from the change-score regression requires estimation of equation (2)
and the calculation of an inverse Mill’s ratio for each student who has a pretest. This inverse
Mill’s ratio is then added to the change-score regression (1) as another explanatory variable. In
essence, this inverse Mill’s ratio adjusts the error term for the missing students.

For the Heckman adjustment for sample selection each disturbance in vector ε , equation
(1), is assumed to be distributed bivariate normal with the corresponding disturbance term in the
ω vector of the selection equation (2). Thus, for the i th student we have:

(ε i ,ω i ) ~ bivariate normal (0,0, σ ε ,1, ρ ) (3)

and for all perturbations in the two-equation system we have:

E (ε) = E (ω) = 0, E (εε ') = σ ε2I, E (ωω ') = I, and E (εω') = ρσ ε I . (4)

That is, the disturbances have zero means, unit variance, and no covariance among students, but
there is covariance between selection in getting a posttest score and the measurement of the
change score.

The regression for this censored sample of nT =1 students who took the posttest is now:

E ( Δyi | X i , Ti = 1) = Xi β + E (ε i | Ti * > 0); i = 1, 2,... nT =1 , for nT =1 < N (5)

which suggests the Heckman adjusted regression to be estimated:

E ( Δyi | Xi , Ti = 1) = Xi β + ( ρσ ε )λi ; i = 1, 2,... nT =1 (6)

W. E. Becker and W. H. Greene, 5‐1‐2010    5 
 
where λi is the inverse Mill’s ratio (or hazard) such that λi = f ( −Ti* ) /[1 − F (−Ti* )] , and f (.)
and F (.) are the normal density and distribution functions. λi is the standardized mean of the
disturbance term ωi , for the i th student who took the posttest; it is close to zero only for those
well above the T = 1 threshold. The values of λ are generated from the estimated probit
selection equation (2) for all students.

The probit command for the selection equation to be estimated in LIMDEP (NLOGIT) is

probit;lhs=final;rhs=one,an,hb,doc,comp,lib,ci,ck,phd,noeval;hold results$

where the “hold results” extension tells LIMDEP to hold the results for the change equation to
be estimated by least squares with the inverse Mill’s ratio used as regressor.

The command for estimating the adjusted change equation using both the inverse Mills
ratio as a regressor and maximum likelihood estimation of the ρ and σ ε is written

selection;lhs=change;rhs=one,hb,doc,comp,lib,ci,ck,phd,noeval;mle$

where the extension “mle” tells LIMDEP (NLOGIT) to use maximum likelihood estimation.

As described in Module One, Part Two, entering all of these commands into the
command file in LIMDEP (NLOGIT), highlighting the bunch and pressing the GO button yields
the following output file:

Initializing NLOGIT Version 4.0.7


--> READ; NREC=2837; NVAR=64; FILE=k:\beck8WO.csv; Names=
A1,A2,X3, C,AL,AM,AN,CA,CB,CC,CH,CI,CJ,CK,CL,CM,CN,CO,CS,CT,
CU,CV,CW,DB,DD,DI,DJ,DK,DL,DM,DN,DQ,DR,DS,DY,DZ,EA,EB,EE,EF,
EI,EJ,EP,EQ,ER,ET,EY,EZ,FF,FN,FX,FY,FZ,GE,GH,GM,GN,GQ,GR,HB,
HC,HD,HE,HF $
--> recode; a2; 100/199 = 1; 200/299 = 2; 300/399 = 3; 400/499 =4$
--> recode; hb; 90=89$ #2216 counted in term 2, but in term 1 with no posttest
--> create; doc=a2=1; comp=a2=2; lib=a2=3; twoyr=a2=4; phd=dj=3; final=cc>0;
evalsum=ge+gh+gm+gq; noeval=evalsum=-36$
--> Create; change=cc-an$
--> Reject; AN=-9$
--> Reject; HB=-9$
--> Reject; ci=-9$
--> Reject; ck=-9$
--> Reject; cs=0$
--> Reject; cs=-9$
--> Reject; a2=-9$
--> Reject; phd=-9$
--> probit;lhs=final;rhs=one,an,hb,doc,comp,lib,ci,ck,phd,noeval;hold results$

W. E. Becker and W. H. Greene, 5‐1‐2010    6 
 
Normal exit: 6 iterations. Status=0. F= 822.7411

+---------------------------------------------+
| Binomial Probit Model |
| Dependent variable FINAL |
| Log likelihood function -822.7411 |
| Restricted log likelihood -1284.216 |
| Chi squared [ 9 d.f.] 922.95007 |
| Significance level .0000000 |
| McFadden Pseudo R-squared .3593438 |
| Estimation based on N = 2587, K = 10 |
| AIC = .6438 Bayes IC = .6664 |
| AICf.s. = .6438 HQIC = .6520 |
| Model estimated: Dec 08, 2009, 12:12:49 |
| Results retained for SELECTION model. |
| Hosmer-Lemeshow chi-squared = 26.06658 |
| P-value= .00102 with deg.fr. = 8 |

+--------+--------------+----------------+--------+--------+----------+
|Variable| Coefficient | Standard Error |b/St.Er.|P[|Z|>z]| Mean of X|
+--------+--------------+----------------+--------+--------+----------+
+--------+Index function for probability |
|Constant| .99535*** .24326 4.092 .0000 |
|AN | .02204** .00948 2.326 .0200 10.5968|
|HB | -.00488** .00192 -2.538 .0112 55.5589|
|DOC | .97571*** .14636 6.666 .0000 .31774|
|COMP | .40649*** .13927 2.919 .0035 .41786|
|LIB | .52144*** .17665 2.952 .0032 .13568|
|CI | .19873** .09169 2.168 .0302 1.23116|
|CK | .08779 .13429 .654 .5133 .91998|
|PHD | -.13351 .10303 -1.296 .1951 .68612|
|NOEVAL | -1.93052*** .07239 -26.668 .0000 .29068|
+--------+------------------------------------------------------------+
| Note: ***, **, * = Significance at 1%, 5%, 10% level. |
+---------------------------------------------------------------------+
+----------------------------------------+
| Fit Measures for Binomial Choice Model |
| Probit model for variable FINAL |
+----------------------------------------+
| Y=0 Y=1 Total|
| Proportions .19714 .80286 1.00000|
| Sample Size 510 2077 2587|
+----------------------------------------+
| Log Likelihood Functions for BC Model |
| P=0.50 P=N1/N P=Model|
| LogL = -1793.17 -1284.22 -822.74|
+----------------------------------------+
| Fit Measures based on Log Likelihood |
| McFadden = 1-(L/L0) = .35934|
| Estrella = 1-(L/L0)^(-2L0/n) = .35729|
| R-squared (ML) = .30006|
| Akaike Information Crit. = .64379|
| Schwartz Information Crit. = .66643|
+----------------------------------------+
| Fit Measures Based on Model Predictions|
| Efron = .39635|
| Ben Akiva and Lerman = .80562|
| Veall and Zimmerman = .52781|
| Cramer = .38789|
+----------------------------------------+
+---------------------------------------------------------+

W. E. Becker and W. H. Greene, 5‐1‐2010    7 
 
|Predictions for Binary Choice Model. Predicted value is |
|1 when probability is greater than .500000, 0 otherwise.|
|Note, column or row total percentages may not sum to |
|100% because of rounding. Percentages are of full sample.|
+------+---------------------------------+----------------+
|Actual| Predicted Value | |
|Value | 0 1 | Total Actual |
+------+----------------+----------------+----------------+
| 0 | 342 ( 13.2%)| 168 ( 6.5%)| 510 ( 19.7%)|
| 1 | 197 ( 7.6%)| 1880 ( 72.7%)| 2077 ( 80.3%)|
+------+----------------+----------------+----------------+
|Total | 539 ( 20.8%)| 2048 ( 79.2%)| 2587 (100.0%)|
+------+----------------+----------------+----------------+
+---------------------------------------------------------+

|Crosstab for Binary Choice Model. Predicted probability |


|vs. actual outcome. Entry = Sum[Y(i,j)*Prob(i,m)] 0,1. |
|Note, column or row total percentages may not sum to |
|100% because of rounding. Percentages are of full sample.|
+------+---------------------------------+----------------+
|Actual| Predicted Probability | |
|Value | Prob(y=0) Prob(y=1) | Total Actual |
+------+----------------+----------------+----------------+
| y=0 | 259 ( 10.0%)| 250 ( 9.7%)| 510 ( 19.7%)|
| y=1 | 252 ( 9.7%)| 1824 ( 70.5%)| 2077 ( 80.2%)|
+------+----------------+----------------+----------------+
|Total | 512 ( 19.8%)| 2074 ( 80.2%)| 2587 ( 99.9%)|
+------+----------------+----------------+----------------+
=======================================================================
Analysis of Binary Choice Model Predictions Based on Threshold = .5000
-----------------------------------------------------------------------
Prediction Success
-----------------------------------------------------------------------
Sensitivity = actual 1s correctly predicted 87.819%
Specificity = actual 0s correctly predicted 50.784%
Positive predictive value = predicted 1s that were actual 1s 87.946%
Negative predictive value = predicted 0s that were actual 0s 50.586%
Correct prediction = actual 1s and 0s correctly predicted 80.518%
-----------------------------------------------------------------------
Prediction Failure
-----------------------------------------------------------------------
False pos. for true neg. = actual 0s predicted as 1s 49.020%
False neg. for true pos. = actual 1s predicted as 0s 12.133%
False pos. for predicted pos. = predicted 1s actual 0s 12.054%
False neg. for predicted neg. = predicted 0s actual 1s 49.219%
False predictions = actual 1s and 0s incorrectly predicted 19.405%
=======================================================================

--> selection;lhs=change;rhs=one,hb,doc,comp,lib,ci,ck,phd,noeval;mle$

+----------------------------------------------------------+
| Sample Selection Model |
| Probit selection equation based on FINAL |
| Selection rule is: Observations with FINAL = 1 |
| Results of selection: |
| Data points Sum of weights |
| Data set 2587 2587.0 |
| Selected sample 2077 2077.0 |
+----------------------------------------------------------+

W. E. Becker and W. H. Greene, 5‐1‐2010    8 
 
+----------------------------------------------------+

| Sample Selection Model |


| Two step least squares regression |
| LHS=CHANGE Mean = 5.456909 |
| Standard deviation = 4.582964 |
| Number of observs. = 2077 |
| Model size Parameters = 10 |
| Degrees of freedom = 2067 |
| Residuals Sum of squares = 39226.14 |
| Standard error of e = 4.356298 |
| Fit R-squared = .0960355 |
| Adjusted R-squared = .0920996 |
| Model test F[ 9, 2067] (prob) = 24.40 (.0000) |
| Diagnostic Log likelihood = -5998.683 |
| Restricted(b=0) = -6108.548 |
| Chi-sq [ 9] (prob) = 219.73 (.0000) |
| Info criter. LogAmemiya Prd. Crt. = 2.948048 |
| Akaike Info. Criter. = 2.948048 |
| Bayes Info. Criter. = 2.975196 |
| Not using OLS or no constant. Rsqd & F may be < 0. |
| Model was estimated Dec 08, 2009 at 00:12:49PM |
| Standard error corrected for selection.. 4.36303 |
| Correlation of disturbance in regression |
| and Selection Criterion (Rho)........... .11132 |
+----------------------------------------------------+
+--------+--------------+----------------+--------+--------+----------+
|Variable| Coefficient | Standard Error |b/St.Er.|P[|Z|>z]| Mean of X|
+--------+--------------+----------------+--------+--------+----------+
|Constant| 6.74123*** .75107 8.976 .0000 |
|HB | -.01022* .00563 -1.815 .0695 55.7429|
|DOC | 2.07968*** .57645 3.608 .0003 .33558|
|COMP | -.32946 .44269 -.744 .4567 .40924|
|LIB | 2.27448*** .53733 4.233 .0000 .14011|
|CI | .40823 .25929 1.574 .1154 1.22773|
|CK | -2.73074*** .37755 -7.233 .0000 .91815|
|PHD | .63345** .29104 2.177 .0295 .69957|
|NOEVAL | -.88434 1.27223 -.695 .4870 .15744|
|LAMBDA | .48567 1.59683 .304 .7610 .21796|
+--------+------------------------------------------------------------+
| Note: ***, **, * = Significance at 1%, 5%, 10% level. |
+---------------------------------------------------------------------+

Normal exit: 25 iterations. Status=0. F= 6826.467

------------------------------------------------------------------
ML Estimates of Selection Model
Dependent variable CHANGE
Log likelihood function -6826.46734
Estimation based on N = 2587, K = 21
Information Criteria: Normalization=1/N
Normalized Unnormalized
AIC 5.29375 13694.93469
Fin.Smpl.AIC 5.29389 13695.29492
Bayes IC 5.34131 13817.95802
Hannan Quinn 5.31099 13739.52039
Model estimated: Mar 31, 2010, 15:17:41
FIRST 10 estimates are probit equation.
--------+---------------------------------------------------------
| Standard Prob.
CHANGE| Coefficient Error z z>|Z|

W. E. Becker and W. H. Greene, 5‐1‐2010    9 
 
--------+---------------------------------------------------------
|Selection (probit) equation for FINAL
Constant| .99018*** .24020 4.12 .0000
AN| .02278** .00940 2.42 .0153
HB| -.00489** .00206 -2.37 .0178
DOC| .97154*** .15076 6.44 .0000
COMP| .40431*** .14433 2.80 .0051
LIB| .51505*** .19086 2.70 .0070
CI| .19927** .09054 2.20 .0277
CK| .08590 .11902 .72 .4705
PHD| -.13208 .09787 -1.35 .1772
NOEVAL| -1.92902*** .07138 -27.03 .0000
|Corrected regression, Regime 1
Constant| 6.81754*** .72389 9.42 .0000
HB| -.00978* .00559 -1.75 .0803
DOC| 1.99729*** .55348 3.61 .0003
COMP| -.36198 .43327 -.84 .4034
LIB| 2.23154*** .50534 4.42 .0000
CI| .39401 .25339 1.55 .1199
CK| -2.74337*** .38031 -7.21 .0000
PHD| .64209** .28964 2.22 .0266
NOEVAL| -.63201 1.26902 -.50 .6185
SIGMA(1)| 4.35713*** .07012 62.14 .0000
RHO(1,2)| .03706 .35739 .10 .9174
--------+---------------------------------------------------------

W. E. Becker and W. H. Greene, 5‐1‐2010    10 
 
 

The estimated probit model (as found on page 7) is

Estimated propensity to take the posttest = 0.995 + 0.022(preTUCE score)

− 0 .005(initial class size) + 0.976(Doctoral Institution)

+ 0.406 (Comprehensive Institution) + 0.521(Liberal Arts Institution)

+ 0.199 (Male instructor) + 0.0878(English Instructor Native Language)

− 0.134(Instructor has PhD ) − 1.930(No Evaluation of Instructor)

The beginning or initial class size is negatively and highly significantly related to the propensity
to take the posttest, with a one-tail p value of 0.0056.

The corresponding change-score equation employing the inverse Mills ratio is on page 9:

Predicted Change = 6.741 − 0.010(initial class size) + 2.080(Doctoral Institution)

− 0.329 (Comprehensive Institution) + 2.274 Liberal Arts Institution)

+ .408(Male instructor) − 2.731(English Instructor Native Language)

+ 0.633(Instructor has PhD) − 0.88434(No Evaluation of Instructor) + 0 .486 λ

The change score is negatively and significantly related to the class size, with a one-tail p value
of 0.0347, but it takes an additional 100 students to lower the change score by a point.

Page 10 provides maximum likelihood estimation of both the probit equation and the
change score equation with separate estimation of ρ and σ ε . The top panel provides the probit
coefficients for the propensity equation, where it is shown that initial class size is negatively and
significantly related to the propensity to take the posttest with a one-tail p value of 0.009. The
second panel gives the change score results, where initial class size is negatively and
significantly related to the change score with a one-tail p value of 0.040. Again, it takes
approximately 100 students to move the change score in the opposite direction by a point.

As a closing comment on the estimation of the Heckit model, it is worth pointing out that
there is no unique way to estimate the standard errors via maximum likelihood computer
routines. Historically, LIMDEP used the conventional second derivatives matrix to compute
standard errors for the maximum likelihood estimation of the two-equation Heckit model. In the
process of preparing this module, differences in standard errors produced by LIMDEP and
STATA suggested that STATA was using the alternative outer products of the first derivatives.
To achieve consistency, Bill Greene modified the LIMDEP routine in April 2010 so that it also
now uses the outer products of the first derivatives.

W. E. Becker and W. H. Greene, 5‐1‐2010    11 
 
AN APPLICATION OF PROPENSITY SCORE MATCHING

 
Unfortunately, we are not aware of a study in economic education for which propensity score
matching has been used. Thus, we looked outside economic education and elected to redo the
example reported in Becker and Ichino (2002). This application and data are derived from
Dehejia and Wahba (1999), whose study, in turn was based on LaLonde (1986). The data set
consists of observed samples of treatments and controls from the National Supported Work
demonstration. Some of the institutional features of the data set are given by Becker and Ichino.
The data were downloaded from the website http://www.nber.org/~rdehejia/nswdata.html. The
data set used here is in the original text form, contained in the data file “matchingdata.txt.” They
have been assembled from the several parts in the NBER archive.

Becker and Ichino report that they were unable to replicate Dehejia and Wahba’s results,
though they did obtain similar results. (They indicate that they did not have the original authors’
specifications of the number of blocks used in the partitioning of the range of propensity scores,
significance levels, or exact procedures for testing the balancing property.) In turn, we could
not precisely replicate Becker and Ichino’s results – we can identify the reason, as discussed
below. Likewise, however, we obtain similar results.

There are 2,675 observations in the data set, 2490 controls (with t = 0) and 185 treated
observations (with t = 1). The variables in the raw data set are

t = treatment dummy variable


age = age in years
educ = education in years
black = dummy variable for black
hisp = dummy variable for Hispanic
marr = dummy variable for married
nodegree = dummy for no degree (not used)
re74 = real earnings in 1974
re75 = real earnings in 1975
re78 = real earnings in 1978 – the outcome variable

We will analyze these data following Becker and Ichino’s line of analysis. We assume
that you have completed Module One, Part Two, and thus are familiar with placing commands in
the text editor and using the GO button to submit commands, and where results are found in the
output window. In what follows, we will simply show the commands you need to enter into
LIMDEP (NLOGIT) to produce the results that we will discuss.

To start, the data are imported by using the command (where the data file is on the C
drive but your data could be placed wherever):

W. E. Becker and W. H. Greene, 5‐1‐2010    12 
 
READ ; file=C:\matchingdata.txt;
names=t,age,educ,black,hisp,marr,nodegree,re74,re75,re78;nvar=10;nobs=2675$

Transformed variables added to the equation are

  age2 = age squared 
  educ2 = educ squared 
  re742 = re74 squared 
  re752 = re75 squared 
  blacku74 = black times 1(re74 = 0) 
 
In order to improve the readability of some of the reported results, we have divided the
income variables by 10,000. (This is also an important adjustment that accommodates a
numerical problem with the original data set. This is discussed below.) The outcome variable is
re78.

The data are set up and described first. The transformations used to create the
transformed variables are

CREATE ; age2 = age^2 ; educ2 = educ^2 $


CREATE ; re74 = re74/10000 ; re75 = re75/10000 ; re78 = re78/10000 $
CREATE ; re742 = re74^2 ; re752 = re75^2 $
CREATE ; blacku74 = black * (re74 = 0) $

The data are described with the following statistics:

DSTAT ; Rhs = * $
Descriptive Statistics
All results based on nonmissing observations.
==============================================================================
Variable Mean Std.Dev. Minimum Maximum Cases Missing
==============================================================================
All observations in current sample
--------+---------------------------------------------------------------------
T| .691589E-01 .253772 .000000 1.00000 2675 0
AGE| 34.2258 10.4998 17.0000 55.0000 2675 0
EDUC| 11.9944 3.05356 .000000 17.0000 2675 0
BLACK| .291589 .454579 .000000 1.00000 2675 0
HISP| .343925E-01 .182269 .000000 1.00000 2675 0
MARR| .819439 .384726 .000000 1.00000 2675 0
NODEGREE| .333084 .471404 .000000 1.00000 2675 0
RE74| 1.82300 1.37223 .000000 13.7149 2675 0
RE75| 1.78509 1.38778 .000000 15.6653 2675 0
RE78| 2.05024 1.56325 .000000 12.1174 2675 0
AGE2| 1281.61 766.842 289.000 3025.00 2675 0
EDUC2| 153.186 70.6223 .000000 289.000 2675 0

W. E. Becker and W. H. Greene, 5‐1‐2010    13 
 
RE742| 5.20563 8.46589 .000000 188.098 2675 0
RE752| 5.11175 8.90808 .000000 245.402 2675 0
BLACKU74| .549533E-01 .227932 .000000 1.00000 2675 0

We next fit the logit model for the propensity scores. An immediate problem arises with
the data set as used by Becker and Ichino. The income data are in raw dollar terms – the mean of
re74, for example is $18,230.00. The square of it, which is on the order of 300,000,000, as well
as the square of re75 which is similar, is included in the logit equation with a dummy variable for
Hispanic which is zero for 96.5% of the observations and the blacku74 dummy variable which is
zero for 94.5% of the observations. Because of the extreme difference in magnitudes, estimation
of the logit model in this form is next to impossible. But rescaling the data by dividing the
income variables by 10,000 addresses the instability problem. ii These transformations are shown
in the second CREATE command above. This has no impact on the results produced with the
data, other than stabilizing the estimation of the logit equation. We are now quite able to
replicate the Becker and Ichino results except for an occasional very low order digit.

The logit model from which the propensity scores are obtained is fit using

NAMELIST ; X = age,age2,educ,educ2,marr,black,hisp,
re74,re75,re742,re752,blacku74,one $
LOGIT ; Lhs = t ; Rhs = x ; Hold $

(Note: Becker and Ichino’s coefficients on re74 and re75 are multiplied by 10,000, and
coefficients on re742 and re752 are multiplied by 100,000,000. Some additional logit results
from LIMDEP are omitted. Becker and Ichino’s results are included in the results for
comparison.)

------------------------------------------------------------------
Binary Logit Model for Binary Choice
Dependent variable T Becker/Ichino
Log likelihood function -204.97536 (-204.97537)
Restricted log likelihood -672.64954 (identical)
Chi squared [ 12 d.f.] 935.34837
Significance level .00000
McFadden Pseudo R-squared .6952717
Estimation based on N = 2675, K = 13
Information Criteria: Normalization=1/N
Normalized Unnormalized
AIC .16297 435.95071
Fin.Smpl.AIC .16302 436.08750
Bayes IC .19160 512.54287
Hannan Quinn .17333 463.66183
Hosmer-Lemeshow chi-squared = 12.77381
P-value= .11987 with deg.fr. = 8
--------+-----------------------------------------------------
| Standard Prob. Mean
T| Coefficient Error z z>|Z| of X
--------+----------------------------------------------------- Becker/Ichino
|Characteristics in numerator of Prob[Y = 1] Coeff. |t|
AGE| .33169*** .12033 2.76 .0058 34.2258 .3316904 (2.76)

W. E. Becker and W. H. Greene, 5‐1‐2010    14 
 
AGE2| -.00637*** .00186 -3.43 .0006 1281.61 -.0063668 (3.43)
EDUC| .84927** .34771 2.44 .0146 11.9944 .8492683 (2.44)
EDUC2| -.05062*** .01725 -2.93 .0033 153.186 -.0506202 (2.93)
MARR| -1.88554*** .29933 -6.30 .0000 .81944 -1.885542 (6.30)
BLACK| 1.13597*** .35179 3.23 .0012 .29159 1.135973 (3.23)
HISP| 1.96902*** .56686 3.47 .0005 .03439 1.969020 (3.47)
RE74| -1.05896*** .35252 -3.00 .0027 1.82300 -.1059000 (3.00)
RE75| -2.16854*** .41423 -5.24 .0000 1.78509 -.2169000 (5.24)
RE742| .23892*** .06429 3.72 .0002 5.20563 .2390000 (3.72)
RE752| .01359 .06654 .20 .8381 5.11175 .0136000 (0.21)
BLACKU74| 2.14413*** .42682 5.02 .0000 .05495 2.144129 (5.02)
Constant| -7.47474*** 2.44351 -3.06 .0022 -7.474742 (3.06)
--------+-----------------------------------------------------
Note: ***, **, * ==> Significance at 1%, 5%, 10% level.
--------------------------------------------------------------
+---------------------------------------------------------+
|Predictions for Binary Choice Model. Predicted value is |
|1 when probability is greater than .500000, 0 otherwise.|
|Note, column or row total percentages may not sum to |
|100% because of rounding. Percentages are of full sample.|
+------+---------------------------------+----------------+
|Actual| Predicted Value | |
|Value | 0 1 | Total Actual |
+------+----------------+----------------+----------------+
| 0 | 2463 ( 92.1%)| 27 ( 1.0%)| 2490 ( 93.1%)|
| 1 | 51 ( 1.9%)| 134 ( 5.0%)| 185 ( 6.9%)|
+------+----------------+----------------+----------------+
|Total | 2514 ( 94.0%)| 161 ( 6.0%)| 2675 (100.0%)|
+------+----------------+----------------+----------------+
 
 
The first set of matching results uses the kernel estimator for the neighbors, lists the
intermediate results, and uses only the observations in the common support. iii

MATCH ; Lhs = re78 ; Kernel ; List ; Common Support $

The estimated propensity score function is echoed first. This merely reports the earlier
estimated binary choice model for the treatment assignment. The treatment assignment model is
not reestimated. (The ;Hold in the LOGIT or PROBIT command stores the estimated model for
this use.)

+---------------------------------------------------+
| ******* Propensity Score Matching Analysis ****** |
| Treatment variable = T , Outcome = RE78 |
| Sample In Use |
| Total number of observations = 2675 |
| Number of valid (complete) obs. = 2675 |
| Number used (in common support) = 1342 |
| Sample Partitioning of Data In Use |
| Treated Controls Total |
| Observations 185 1157 1342 |
| Sample Proportion 13.79% 86.21% 100.00% |
+---------------------------------------------------+

+-------------------------------------------------------------+

W. E. Becker and W. H. Greene, 5‐1‐2010    15 
 
| Propensity Score Function = Logit based on T |
| Variable Coefficient Standard Error t statistic |
| AGE .33169 .12032986 2.757 |
| AGE2 -.00637 .00185539 -3.432 |
| EDUC .84927 .34770583 2.442 |
| EDUC2 -.05062 .01724929 -2.935 |
| MARR -1.88554 .29933086 -6.299 |
| BLACK 1.13597 .35178542 3.229 |
| HISP 1.96902 .56685941 3.474 |
| RE74 -1.05896 .35251776 -3.004 |
| RE75 -2.16854 .41423244 -5.235 |
| RE742 .23892 .06429271 3.716 |
| RE752 .01359 .06653758 .204 |
| BLACKU74 2.14413 .42681518 5.024 |
| ONE -7.47474 2.44351058 -3.059 |
| Note:Estimation sample may not be the sample analyzed here. |
| Observations analyzed are restricted to the common support =|
| only controls with propensity in the range of the treated. |
+-------------------------------------------------------------+

The note in the reported logit results reports how the common support is defined, that is,
as the range of variation of the scores for the treated observations.

The next set of results reports the iterations that partition the range of estimated
probabilities. The report includes the results of the F tests within the partitions as well as the
details of the full partition itself. The balancing hypothesis is rejected when the p value is less
than 0.01 within the cell. Becker and Ichino do not report the results of this search for their data,
but do report that they ultimately found seven blocks, as we did. They do not report the means by
which the test of equality is carried out within the blocks or the critical value used.

Partitioning the range of propensity scores


================================================================================
Iteration 1. Partitioning range of propensity scores into 5 intervals.
================================================================================
Range Controls Treatment
# Obs. Mean PS S.D. PS # obs. Mean PS S.D. PS F Prob
---------------- ---------------------- ---------------------- -------------
.00061 .19554 1081 .02111 .03337 17 .07358 .05835 13.68 .0020
.19554 .39047 41 .28538 .05956 26 .30732 .05917 2.18 .1460
.39047 .58540 15 .49681 .05098 20 .49273 .06228 .05 .8327
.58540 .78033 13 .68950 .04660 19 .64573 .04769 6.68 .0157
.78033 .97525 7 .96240 .00713 103 .93022 .05405 29.05 .0000
Iteration 1 Mean scores are not equal in at least one cell
================================================================================

Iteration 2. Partitioning range of propensity scores into 6 intervals.


================================================================================
Range Controls Treatment
# Obs. Mean PS S.D. PS # obs. Mean PS S.D. PS F Prob
---------------- ---------------------- ---------------------- -------------
.00061 .09807 1026 .01522 .02121 11 .03636 .03246 4.64 .0566
.09807 .19554 55 .13104 .02762 6 .14183 .02272 1.16 .3163
.19554 .39047 41 .28538 .05956 26 .30732 .05917 2.18 .1460
.39047 .58540 15 .49681 .05098 20 .49273 .06228 .05 .8327
.58540 .78033 13 .68950 .04660 19 .64573 .04769 6.68 .0157

W. E. Becker and W. H. Greene, 5‐1‐2010    16 
 
.78033 .97525 7 .96240 .00713 103 .93022 .05405 29.05 .0000
Iteration 2 Mean scores are not equal in at least one cell
================================================================================
Iteration 3. Partitioning range of propensity scores into 7 intervals.
================================================================================
Range Controls Treatment
# Obs. Mean PS S.D. PS # obs. Mean PS S.D. PS F Prob
---------------- ---------------------- ---------------------- -------------
.00061 .09807 1026 .01522 .02121 11 .03636 .03246 4.64 .0566
.09807 .19554 55 .13104 .02762 6 .14183 .02272 1.16 .3163
.19554 .39047 41 .28538 .05956 26 .30732 .05917 2.18 .1460
.39047 .58540 15 .49681 .05098 20 .49273 .06228 .05 .8327
.58540 .78033 13 .68950 .04660 19 .64573 .04769 6.68 .0157
.78033 .87779 0 .00000 .00000 17 .81736 .02800 .00 1.0000
.87779 .97525 7 .96240 .00713 86 .95253 .01813 8.77 .0103
Mean PSCORES are tested equal within the blocks listed below

After partitioning the range of the propensity scores, we report the empirical distribution of the
propensity scores and the boundaries of the blocks estimated above. The values below show the
percentiles that are also reported by Becker and Ichino. The reported search algorithm notwithstanding,
the block boundaries shown by Becker and Ichino shown below are roughly the same.

+-------------------------------------------------------------+
| Empirical Distribution of Propensity Scores in Sample Used | Becker/Ichino
| Percent Lower Upper Sample size = 1342 | Percentiles (lower)
| 0% - 5% .000611 .000801 Average score .137746 | .0006426
| 5% - 10% .000802 .001088 Std.Dev score .274560 | .0008025
| 10% - 15% .001093 .001378 Variance .075383 | .0010932
| 15% - 20% .001380 .001809 Blocks used to test balance |
| 20% - 25% .001815 .002355 Lower Upper # obs |
| 25% - 30% .002355 .003022 1 .000611 .098075 1037 | .0023546
| 30% - 35% .003046 .004094 2 .098075 .195539 61 |
| 35% - 40% .004097 .005299 3 .195539 .390468 67 |
| 40% - 45% .005315 .007631 4 .390468 .585397 35 |
| 45% - 50% .007632 .010652 5 .585397 .780325 32 |
| 50% - 55% .010682 .015103 6 .780325 .877790 17 | .0106667
| 55% - 60% .015105 .022858 7 .877790 .975254 93 |
| 60% - 65% .022888 .035187 |
| 65% - 70% .035316 .051474 |
| 70% - 75% .051488 .075104 |
| 75% - 80% .075712 .135218 | .0757115
| 80% - 85% .135644 .322967 |
| 85% - 90% .335230 .616205 |
| 90% - 95% .625082 .949302 | .6250832
| 95% - 100% .949302 .975254 | .949382 to .970598
+-------------------------------------------------------------+

The blocks used for the balancing hypothesis are shown at the right in the table above. Becker
and Ichino report that they used the following blocks and sample sizes:

Lower Upper Observations


1 0.0006 0.05 931
2 0.05 0.10 106
3 0.10 0.20 3
4 0.20 0.40 69
5 0.40 0.60 35

W. E. Becker and W. H. Greene, 5‐1‐2010    17 
 
6 0.60 0.80 33
7 0.80 1.00 105

At this point, our results begin to differ somewhat from those of Becker and Ichino because they
are using a different (cruder) blocking arrangement for the ranges of the propensity scores. This should
not affect the ultimate estimation of the ATE; it is an intermediate step in the analysis that is a check on
the reliability of the procedure.

The next set of results reports the analysis of the balancing property for the independent variables.
A test is reported for each variable in each block as listed in the table above. The lines marked (by the
program) with “*” show cells in which one or the other group had no observations, so the F test could not
be carried out. This was treated as a “success” in each analysis. Lines marked with an “o” note where the
balancing property failed. There are only four of these, but those we do find are not borderline. Becker
and Ichino report their finding that the balancing property is satisfied. Note that our finding does not
prevent the further analysis. It merely suggests to the analyst that they might want to consider a richer
specification of the propensity function model.

Examining exogenous variables for balancing hypothesis


* Indicates no observations, treatment and/or controls, for test.
o Indicates means of treated and controls differ significantly.
=================================================================
Variable Interval Mean Control Mean Treated F Prob
-------- -------- ------------ ------------ ------ -----
AGE 1 31.459064 30.363636 .41 .5369
AGE 2 27.727273 26.500000 .10 .7587
AGE 3 28.170732 28.769231 .07 .7892
AGE 4 26.800000 25.050000 .44 .5096
AGE 5 24.846154 24.210526 .10 .7544
AGE 6 .000000 30.823529 .00 1.0000 *
AGE 7 23.285714 23.837209 .55 .4653
AGE2 1 1081.180312 953.454545 1.43 .2576
AGE2 2 822.200000 783.833333 .02 .8856
AGE2 3 873.341463 906.076923 .05 .8202
AGE2 4 774.400000 690.350000 .25 .6193
AGE2 5 644.230769 623.789474 .03 .8568
AGE2 6 .000000 1003.058824 .00 1.0000 *
AGE2 7 543.857143 596.023256 1.99 .1666
EDUC 1 11.208577 11.545455 .37 .5575
EDUC 2 10.636364 10.166667 .40 .5463
EDUC 3 10.414634 10.076923 .31 .5819
EDUC 4 10.200000 10.150000 .01 1.0000
EDUC 5 10.230769 11.000000 1.03 .3218
EDUC 6 .000000 11.058824 .00 1.0000 *
EDUC 7 10.571429 10.046512 .86 .3799
EDUC2 1 132.446394 136.636364 .11 .7420
EDUC2 2 117.618182 106.166667 .60 .4624
EDUC2 3 113.878049 107.769231 .31 .5829
EDUC2 4 108.066667 107.650000 .00 1.0000
EDUC2 5 109.923077 124.263158 .83 .3703
EDUC2 6 .000000 124.705882 .00 1.0000 *
EDUC2 7 113.714286 104.302326 .70 .4275
MARR 1 .832359 .818182 .01 .9056
MARR 2 .563636 .833333 2.63 .1433
MARR 3 .268293 .269231 .00 1.0000
MARR 4 .200000 .050000 1.73 .2032

W. E. Becker and W. H. Greene, 5‐1‐2010    18 
 
MARR 5 .153846 .210526 .17 .6821
MARR 6 .000000 .529412 .00 1.0000 *
MARR 7 .000000 .000000 .00 1.0000
BLACK 1 .358674 .636364 3.63 .0833
BLACK 2 .600000 .500000 .22 .6553
BLACK 3 .780488 .769231 .01 .9150
BLACK 4 .866667 .500000 6.65 .0145
BLACK 5 .846154 .947368 .81 .3792
BLACK 6 .000000 .941176 .00 1.0000 *
BLACK 7 1.000000 .953488 .00 1.0000 *
HISP 1 .048733 .000000 52.46 .0000 o
HISP 2 .072727 .333333 1.77 .2311
HISP 3 .048780 .000000 2.10 .1547
HISP 4 .066667 .150000 .66 .4224
HISP 5 .153846 .052632 .81 .3792
HISP 6 .000000 .058824 .00 1.0000 *
HISP 7 .000000 .046512 4.19 .0436
RE74 1 1.230846 1.214261 .00 1.0000
RE74 2 .592119 .237027 10.63 .0041 o
RE74 3 .584965 .547003 .06 .8074
RE74 4 .253634 .298130 .16 .6875
RE74 5 .154631 .197888 .44 .5108
RE74 6 .000000 .002619 .00 1.0000 *
RE74 7 .000000 .000000 .00 1.0000
RE75 1 1.044680 .896447 .41 .5343
RE75 2 .413079 .379168 .09 .7653
RE75 3 .276234 .279825 .00 1.0000
RE75 4 .286058 .169340 2.39 .1319
RE75 5 .137276 .139118 .00 1.0000
RE75 6 .000000 .061722 .00 1.0000 *
RE75 7 .012788 .021539 .37 .5509
RE742 1 2.391922 2.335453 .00 1.0000
RE742 2 .672950 .092200 9.28 .0035 o
RE742 3 .638937 .734157 .09 .7625
RE742 4 .127254 .245461 1.14 .2936
RE742 5 .040070 .095745 1.31 .2647
RE742 6 .000000 .000117 .00 1.0000 *
RE742 7 .000000 .000000 .00 1.0000
RE752 1 1.779930 1.383457 .43 .5207
RE752 2 .313295 .201080 1.48 .2466
RE752 3 .151139 .135407 .14 .7133
RE752 4 .128831 .079975 .97 .3308
RE752 5 .088541 .037465 .51 .4894
RE752 6 .000000 .037719 .00 1.0000 *
RE752 7 .001145 .005973 2.57 .1124
BLACKU74 1 .014620 .000000 15.12 .0001 o
BLACKU74 2 .054545 .000000 3.17 .0804
BLACKU74 3 .121951 .192308 .58 .4515
BLACKU74 4 .200000 .100000 .66 .4242
BLACKU74 5 .230769 .315789 .29 .5952
BLACKU74 6 .000000 .941176 .00 1.0000 *
BLACKU74 7 1.000000 .953488 .00 1.0000 *
Variable BLACKU74 is unbalanced in block 1
Other variables may also be unbalanced
You might want to respecify the index function for the P-scores

This part of the analysis ends with a recommendation that the analyst reexamine the specification
of the propensity score model. Because this is not a numerical problem, the analysis continues with
estimation of the average treatment effect on the treated.

W. E. Becker and W. H. Greene, 5‐1‐2010    19 
 
The first example below shows estimation using the kernel estimator to define the counterpart
observation from the controls and using only the subsample in the common support. This stage consists of
nboot + 1 iterations. In order to be able to replicate the results, we set the seed of the random number
generator before computing the results.

CALC ; Ran(1234579) $
MATCH ; Lhs = re78 ; Kernel ; List ; Common Support $

The first result is the actual estimation, which is reported in the intermediate results. Then the
nboot repetitions are reported. (These will be omitted if ; List is not included in the command.) Recall,
we divided the income values by 10,000. The value of .156255 reported below thus corresponds to
$1,562.55. Becker and Ichino report a value (see their section 6.4) of $1537.94. Using the bootstrap
replications, we have estimated the asymptotic standard error to be $1042.04. A 95% confidence interval
for the treatment effect is computed using $1537.94 ± 1.96(1042.04) = (-$325.41,$3474.11).

+----------------------------------------------------------------------+
| Estimated Average Treatment Effect (T ) Outcome is RE78 |
| Kernel Using Epanechnikov kernel with bandwidth = .0600 |
| Note, controls may be reused in defining matches. |
| Number of bootstrap replications used to obtain variance = 25 |
+----------------------------------------------------------------------+
Estimated average treatment effect = .156255
Begin bootstrap iterations *******************************************
Boootstrap estimate 1 = .099594
Boootstrap estimate 2 = .109812
Boootstrap estimate 3 = .152911
Boootstrap estimate 4 = .168743
Boootstrap estimate 5 = -.015677
Boootstrap estimate 6 = .052938
Boootstrap estimate 7 = -.003275
Boootstrap estimate 8 = .212767
Boootstrap estimate 9 = -.042274
Boootstrap estimate 10 = .053342
Boootstrap estimate 11 = .351122
Boootstrap estimate 12 = .117883
Boootstrap estimate 13 = .181123
Boootstrap estimate 14 = .111917
Boootstrap estimate 15 = .181256
Boootstrap estimate 16 = -.012129
Boootstrap estimate 17 = .240363
Boootstrap estimate 18 = .201321
Boootstrap estimate 19 = .169463
Boootstrap estimate 20 = .238131
Boootstrap estimate 21 = .358050
Boootstrap estimate 22 = .199020
Boootstrap estimate 23 = .083503
Boootstrap estimate 24 = .146215
Boootstrap estimate 25 = .266303
End bootstrap iterations *******************************************
+----------------------------------------------------------------------+
| Number of Treated observations = 185 Number of controls = 1157 |
| Estimated Average Treatment Effect = .156255 | (.153794)
| Estimated Asymptotic Standard Error = .104204 | (.101687)
| t statistic (ATT/Est.S.E.) = 1.499510 |

W. E. Becker and W. H. Greene, 5‐1‐2010    20 
 
| Confidence Interval for ATT = ( -.047985 to .360496) 95% |
| Average Bootstrap estimate of ATT = .144897 |
| ATT - Average bootstrap estimate = .011358 |
+----------------------------------------------------------------------+

Note that the estimated asymptotic standard error is somewhat different. As we noted earlier,
because of differences in random number generators, the bootstrap replications will differ across
programs. It will generally not be possible to exactly replicate results generated with different computer
programs. With a specific computer program, replication is obtained by setting the seed of the random
number generator. (The specific seed chosen is immaterial, so long as the same seed is used each time.)

The next set of estimates is based on all of the program defaults. The single nearest neighbor is
used for the counterpart observation; 25 bootstrap replications are used to compute the standard deviation,
and the full range of propensity scores (rather than the common support) is used. Intermediate output is
also suppressed. Once again, we set the seed for the random number generator before estimation.

CALC ; Ran(1234579) $
MATCH ; Rhs = re78 $

Partitioning the range of propensity scores


Iteration 1 Mean scores are not equal in at least one cell
Iteration 2 Mean scores are not equal in at least one cell
Mean PSCORES are tested equal within the blocks listed below.

+-------------------------------------------------------------+
| Empirical Distribution of Propensity Scores in Sample Used |
| Percent Lower Upper Sample size = 2675 |
| 0% - 5% .000000 .000000 Average score .069159 |
| 5% - 10% .000000 .000002 Std.Dev score .206287 |
| 10% - 15% .000002 .000006 Variance .042555 |
| 15% - 20% .000007 .000015 Blocks used to test balance |
| 20% - 25% .000016 .000032 Lower Upper # obs |
| 25% - 30% .000032 .000064 1 .000000 .097525 2370 |
| 30% - 35% .000064 .000121 2 .097525 .195051 60 |
| 35% - 40% .000121 .000204 3 .195051 .390102 68 |
| 40% - 45% .000204 .000368 4 .390102 .585152 35 |
| 45% - 50% .000368 .000618 5 .585152 .780203 32 |
| 50% - 55% .000618 .001110 6 .780203 .877729 17 |
| 55% - 60% .001123 .001851 7 .877729 .975254 93 |
| 60% - 65% .001854 .003047 |
| 65% - 70% .003057 .005451 |
| 70% - 75% .005451 .010756 |
| 75% - 80% .010877 .023117 |
| 80% - 85% .023149 .051488 |
| 85% - 90% .051703 .135644 |
| 90% - 95% .136043 .625082 |
| 95% - 100% .625269 .975254 |
+-------------------------------------------------------------+
Examining exogenous variables for balancing hypothesis
Variable BLACKU74 is unbalanced in block 1
Other variables may also be unbalanced
You might want to respecify the index function for the P-scores

+----------------------------------------------------------------------+
| Estimated Average Treatment Effect (T ) Outcome is RE78 |

W. E. Becker and W. H. Greene, 5‐1‐2010    21 
 
| Nearest Neighbor Using average of 1 closest neighbors |
| Note, controls may be reused in defining matches. |
| Number of bootstrap replications used to obtain variance = 25 |
+----------------------------------------------------------------------+
Estimated average treatment effect = .169094
Begin bootstrap iterations *******************************************
End bootstrap iterations *******************************************
+----------------------------------------------------------------------+
| Number of Treated observations = 185 Number of controls = 54 |
| Estimated Average Treatment Effect = .169094 |
| Estimated Asymptotic Standard Error = .102433 |
| t statistic (ATT/Est.S.E.) = 1.650772 |
| Confidence Interval for ATT = ( -.031675 to .369864) 95% |
| Average Bootstrap estimate of ATT = .171674 |
| ATT - Average bootstrap estimate = -.002579 |
+----------------------------------------------------------------------+

Using the full sample in this fashion produces an estimate of $1,690.94 for the treatment effect
with an estimated standard error of $1,093.29. Note that from the results above, we find that only 54 of
the 2490 control observations were used as nearest neighbors for the 185 treated observations. In
comparison, using the 1,342 observations in their estimated common support, and the same 185 treateds,
Becker and Ichino reported estimates of $1,667.64 and $2,113.59 for the effect and the standard error,
respectively and use 57 of the 1,342 controls as nearest neighbors.

The next set of results uses the caliper form of matching and again restricts attention to the
estimates in the common support.

CALC ; Ran(1234579) $
MATCH ; Rhs = re78 ; Range = .0001 ; Common Support $
CALC ; Ran(1234579) $
MATCH ; Rhs = re78 ; Range = .01 ; Common Support $

The estimated treatment effects are now very different. We see that only 23 of the 185 treated
observations had a neighbor within a range (radius in the terminology of Becker and Ichino) of 0.0001.
The treatment effect is estimated to be only $321.95 with a standard error of $307.95. In contrast, using
this procedure, and this radius, Becker and Ichino report a nonsense result of -$5,546.10 with a standard
error of $2,388.72. They state that this illustrates the sensitivity of the estimator to the choice of radius,
which is certainly the case. To examine this aspect, we recomputed the estimator using a range of 0.01
instead of 0.0001. This produces the expected effect, as seen in the second set of results below. The
estimated treatment effect rises to $1433.54 which is comparable to the other results already obtained

+----------------------------------------------------------------------+
| Estimated Average Treatment Effect (T ) Outcome is RE78 |
| Caliper Using distance of .00010 to locate matches |
| Note, controls may be reused in defining matches. |
| Number of bootstrap replications used to obtain variance = 25 |
+----------------------------------------------------------------------+
Estimated average treatment effect = .032195

W. E. Becker and W. H. Greene, 5‐1‐2010    22 
 
Begin bootstrap iterations *******************************************
End bootstrap iterations *******************************************
+----------------------------------------------------------------------+
| Number of Treated observations = 23 Number of controls = 66 |
| Estimated Average Treatment Effect = .032195 |
| Estimated Asymptotic Standard Error = .030795 |
| t statistic (ATT/Est.S.E.) = 1.045454 |
| Confidence Interval for ATT = ( -.028163 to .092553) 95% |
| Average Bootstrap estimate of ATT = .018996 |
| ATT - Average bootstrap estimate = .013199 |
+----------------------------------------------------------------------+

+----------------------------------------------------------------------+
| Estimated Average Treatment Effect (T ) Outcome is RE78 |
| Caliper Using distance of .01000 to locate matches |
| Note, controls may be reused in defining matches. |
| Number of bootstrap replications used to obtain variance = 25 |
+----------------------------------------------------------------------+
Estimated average treatment effect = .143354
Begin bootstrap iterations *******************************************
End bootstrap iterations *******************************************
+----------------------------------------------------------------------+
| Number of Treated observations = 146 Number of controls = 1111 |
| Estimated Average Treatment Effect = .143354 |
| Estimated Asymptotic Standard Error = .078378 |
| t statistic (ATT/Est.S.E.) = 1.829010 |
| Confidence Interval for ATT = ( -.010267 to .296974) 95% |
| Average Bootstrap estimate of ATT = .127641 |
| ATT - Average bootstrap estimate = .015713 |
+----------------------------------------------------------------------+

CONCLUDING COMMENTS

Results obtained from the two equation system advanced by Heckman over 30 years ago are
sensitive to the correctness of the equations and their identification. On the other hand, methods
such as the propensity score matching depend on the validity of the logit or probit functions
estimated along with the methods of getting smoothness in the kernel density estimator.
Someone using Heckman’s original selection adjustment method can easily have their results
replicated in LIMDEP, STATA and SAS, although standard error estimates may differ somewhat
because of the difference in routines used. Such is not the case with propensity score matching.
Propensity score matching results are highly sensitive to the computer program employed while
Heckman’s original sample selection adjustment method can be relied on to give comparable
coefficient estimates across programs.

REFERENCES

Becker, William and William Walstad. “Data Loss From Pretest to Posttest As a Sample
Selection Problem,” Review of Economics and Statistics, Vol. 72, February 1990: 184-188,

W. E. Becker and W. H. Greene, 5‐1‐2010    23 
 
Becker, William and John Powers. “Student Performance, Attrition, and Class Size Given
Missing Student Data,” Economics of Education Review, Vol. 20, August 2001: 377-388.

Becker, S. and A. Ichino, “Estimation of Average Treatment Effects Based on Propensity


Scores,” The Stata Journal, Vol. 2, November 2002: 358-377.

Deheija, R. and S. Wahba “Causal Effects in Nonexperimental Studies: Reevaluation of the


Evaluation of Training Programs,” Journal of the American Statistical Association, Vol. 94,
1999: 1052-1062.

Heckman, James. Sample Bias as a Specific Error. Econometrica, Vol. 47, 1979: 153-162.

Huynh, Kim, David Jacho-Chavez, and James K. Self.“The Efficacy of Collaborative Learning
Recitation Sessions on Student Outcomes?” American Economic Review, (Forthcoming May
2010).

LaLonde, R., “Evaluating the Econometric Evaluations of Training Programs with Experimental
Data,” American Economic Review, Vol. 76, 4, 1986, 604-620.

Saunders, Phillip. The TUCE III Data Set: Background information and file codes
(documentation, summary tables, and five 3.5-inch double-sided, high density disks in ASCII
format). New York: National Council on Economic Education, 1994.

ENDNOTES

                                                            
i
  Huynh, Jacho-Chavez, and Self (2010) have a data set that enables them to account for selection
into, out of and between collaborative learning sections of a large principles course in their change-score
modeling.
ii
An attempt to compute a linear regression of the original RE78 on the original unscaled other
variables is successful, but produces a warning that the condition number of the X matrix is 6.5 times 109.
When the data are scaled as done above, no warning about multicollinearity is given.
iii
   The Kernel density estimator is a nonparametric estimator. Unlike a parametric
estimator (which is an equation), a non-parametric estimator has no fixed structure and is based
on a histogram of all the data. Histograms are bar charts, which are not smooth, and whose
shape depends on the width of the bin into which the data are divided. In essence, with a fixed
bin width, the kernel estimator smoothes out the histogram by centering each of the bins at each
data point rather than fixing the end points of the bin. The optimum bin width is a subject of
debate and well beyond the technical level of this module.
 

W. E. Becker and W. H. Greene, 5‐1‐2010    24 
 
MODULE FOUR, PART THREE: SAMPLE SELECTION

IN ECONOMIC EDUCATION RESEARCH USING STATA

Part Three of Module Four provides a cookbook-type demonstration of the steps required to use
STATA in situations involving estimation problems associated with sample selection. Users of
this model need to have completed Module One, Parts One and Three, but not necessarily
Modules Two and Three. From Module One users are assumed to know how to get data into
STATA, recode and create variables within STATA, and run and interpret regression results.
Module Four, Parts Two and Four demonstrate in LIMDEP (NLOGIT) and SAS what is done
here in STATA.

THE CASE, DATA, AND ROUTINE FOR EARLY HECKMAN ADJUSTMENT

The change score or difference in difference model is used extensively in education research.
Yet, before Becker and Walstad (1990), little if any attention was given to the consequence of
missing student records that result from: 1) "data cleaning" done by those collecting the data, 2)
student unwillingness to provide data, or 3) students self-selecting into or out of the study. The
implications of these types of sample selection are shown in the work of Becker and Powers
(2001) where the relationship between class size and student learning was explored using the
third edition of the Test of Understanding in College Economics (TUCE), which was produced
by Saunders (1994) for the National Council on Economic Education (NCEE), since renamed the
Council for Economic Education.

Module One, Part Three showed how to get the Becker and Powers data set
“beck8WO.csv” into STATA. As a brief review this was done with the insheet command:
. insheet a1 a2 x3 c al am an ca cb cc ch ci cj ck cl cm cn co cs ct cu ///
> cv cw db dd di dj dk dl dm dn dq dr ds dy dz ea eb ee ef ///
> ei ej ep eq er et ey ez ff fn fx fy fz ge gh gm gn gq gr hb ///
> hc hd he hf using "F:\BECK8WO2.csv", comma
(64 vars, 2849 obs)

where

A1: term, where 1= fall, 2 = spring


A2: school code, where 100/199 = doctorate,
200/299 = comprehensive,
300/399 = lib arts,
400/499 = 2 year
hb: initial class size (number taking preTUCE)
hc: final class size (number taking postTUCE)
Ian McCarthy, 3‐30‐2010    1 
 
dm: experience, as measured by number of years teaching
dj: teacher’s highest degree, where Bachelors=1, Masters=2, PhD=3
cc: postTUCE score (0 to 30)
an: preTUCE score (0 to 30)
ge: Student evaluation measured interest
gh: Student evaluation measured textbook quality
gm: Student evaluation measured regular instructor’s English ability
gq: Student evaluation measured overall teaching effectiveness
ci: Instructor sex (Male = 1, Female = 2)
ck: English is native language of instructor (Yes = 1, No = 0)
cs: PostTUCE score counts toward course grade (Yes = 1, No = 0)
ff: GPA*100
fn: Student had high school economics (Yes = 1, No = 0)
ey: Student’s sex (Male = 1, Female = 2)
fx: Student working in a job (Yes = 1, No = 0)

Separate dummy variables need to be created for each type of school (A2), which is done with
the following code:
recode a2 (100/199=1) (200/299=2) (300/399=3) (400/499=4)
generate doc=(a2==1) if a2!=.
generate comp=(a2==2) if a2!=.
generate lib=(a2==3) if a2!=.
generate twoyr=(a2==4) if a2!=.

To create a dummy variable for whether the instructor had a PhD we use

generate phd=(dj==3) if dj!=.

To create a dummy variable for whether the student took the postTUCE we use

generate final=(cc>0) if cc!=.

To create a dummy variable for whether a student did (noeval = 0) or did not (noeval = 1)
complete a student evaluation of the instructor we use

generate noeval=(ge + gh + gm + gq == -36)

“Noeval” reflects whether the student was around toward the end of the term, attending classes,
and sufficiently motivated to complete an evaluation of the instructor. In the Saunder’s data set
evaluation questions with no answer where coded -9; thus, these four questions summing to -36
indicates that no questions were answered.

Ian McCarthy, 3‐30‐2010    2 
 
And the change score is created with

generate change=cc-an

Finally, there was a correction for the term in which student record 2216 was incorrectly
recorded:

recode hb (90=89)

All of these recoding and create commands are entered into the STATA command file as
follows:

recode a2 (100/199=1) (200/299=2) (300/399=3) (400/499=4)


gen doc=(a2==1) if a2!=.
gen comp=(a2==2) if a2!=.
gen lib=(a2==3) if a2!=.
gen twoyr=(a2==4) if a2!=.
gen phd=(dj==3) if dj!=.
gen final=(cc>0) if cc!=.

gen noeval=(ge+gh+gm+gq==-36)

gen change=cc-an
recode hb (90=89)

To remove records with missing data the following is entered:


drop if an==-9
drop if hb==-9
drop if ci==-9
drop if ck==-9
drop if cs==0
drop if cs==-9
drop if a2==-9
drop if phd==-9

The use of these data entry and management commands will appear in the STATA output file for
the equations to be estimated in the next section.

Ian McCarthy, 3‐30‐2010    3 
 
THE PROPENSITY TO TAKE THE POSTTEST AND THE CHANGE SCORE EQUATION

To address attrition-type sample selection problems in change score studies, Becker and Powers
first add observations that were dropped during the early stage of assembling data for TUCE III.
Becker and Powers do not have any data on students before they enrolled in the course and thus
cannot address selection into the course, but to examine the effects of attrition (course
withdrawal) they introduce three measures of class size (beginning, ending, and average) and
argue that initial or beginning class size is the critical measure for assessing learning over the
entire length of the course. i To show the effects of initial class size on attrition (as discussed in
Module Four, Part One) they employ what is now the simplest and most restrictive of sample
correction methods, which can be traced to James Heckman (1979), recipient of the 2000 Nobel
Prize in Economics.

From Module Four, Part One, we have the data generating process for the difference between
post and preTUCE scores for the ith student ( Δy i ):

k
Δyi = X i β + ε i = β1 + ∑ β j x ji + ε i (1)
j =2

where the data set of explanatory variables is matrix X, where Xi is the row of xji values for the
relevant variables believed to explain the ith student’s pretest and posttest scores, the β j ’s are the
associated slope coefficients in the vector β , and ε i is the individual random shock (caused, for
example, by unobservable attributes, events or environmental factors) that affect the ith student’s
test scores. Sample selection associated with students’ unwillingness to take the postteest
(dropping the course) results in population error term and regressor correlation that biases and
makes coefficient estimators in this change score model inconsistent.

The data generating process for the i th student’s propensity to take the posttest is:

Ti* = H i α + ωi (2)

where

Ti = 1 , if Ti* > 0 , and student i has a posttest score, and

Ti = 0 , if Ti* ≤ 0 , and student i does not have a posttest score.

T * is the vector of all students’ propensities to take a posttest.

H is the matrix of explanatory variables that are believed to drive these propensities.

α is the vector of slope coefficients corresponding to these observable variables.

Ian McCarthy, 3‐30‐2010    4 
 
ω is the vector of unobservable random shocks that affect each student’s propensity.

The effect of attrition between the pretest and posttest, as reflected in the absence of a
posttest score for the ith student (Ti = 0) and a Heckman adjustment for the resulting bias caused
by excluding those students from the change-score regression requires estimation of equation (2)
and the calculation of an inverse Mill’s ratio for each student who has a pretest. This inverse
Mill’s ratio is then added to the change-score regression (1) as another explanatory variable. In
essence, this inverse Mill’s ratio adjusts the error term for the missing students.

For the Heckman adjustment for sample selection each disturbance in vector ε , equation
(1), is assumed to be distributed bivariate normal with the corresponding disturbance term in the
ω vector of the selection equation (2). Thus, for the i th student we have:

(ε i ,ω i ) ~ bivariate normal (0,0, σ ε ,1, ρ ) (3)

and for all perturbations in the two-equation system we have:

E (ε) = E (ω) = 0, E (εε ') = σ ε2I, E (ωω ') = I, and E (εω') = ρσ ε I . (4)

That is, the disturbances have zero means, unit variance, and no covariance among students, but
there is covariance between selection in getting a posttest score and the measurement of the
change score.

The regression for this censored sample of nT =1 students who took the posttest is now:

E ( Δyi | X i , Ti = 1) = Xi β + E (ε i | Ti * > 0); i = 1, 2,... nT =1 , for nT =1 < N (5)

which suggests the Heckman adjusted regression to be estimated:

E ( Δyi | Xi , Ti = 1) = Xi β + ( ρσ ε )λi ; i = 1, 2,... nT =1 (6)

where λi is the inverse Mill’s ratio (or hazard) such that λi = f ( −Ti * ) /[1 − F (−Ti * )] , and f (.)
and F (.) are the normal density and distribution functions. λi is the standardized mean of the
disturbance term ωi , for the i th student who took the posttest; it is close to zero only for those
well above the T = 1 threshold. The values of λ are generated from the estimated probit
selection equation (2) for all students.

STATA’s built-in “heckman” command estimates both the selection and outcome
equation using either the full-information maximum likelihood or Heckman’s original two-step
estimator (which uses the Mills ratio as a regressor). The default “heckman” command
implements the maximum likelihood estimation, including ρ and σ ε , and is written:

Ian McCarthy, 3‐30‐2010    5 
 
heckman change hb doc comp lib ci ck phd noeval, ///
select (final = an hb doc comp lib ci ck phd noeval) vce(opg)

while the Mills ratio two-step process can be implemented by specifying the option “twostep”
after the command. The option “vce(opg)” specifies the outer-product of the gradient method to
estimate standard errors, as opposed to STATA’s default Hessian method.

As described in Module One, Part Three, entering all of these commands into the
command window in STATA and pressing enter (or alternatively, highlighting the commands in
a do file and pressing ctrl-d) yields the following output file:

. insheet ///
> A1 A2 X3 C AL AM AN CA CB CC CH CI CJ CK CL CM CN CO CS CT ///
> CU CV CW DB DD DI DJ DK DL DM DN DQ DR DS DY DZ EA EB EE EF ///
> EI EJ EP EQ ER ET EY EZ FF FN FX FY FZ GE GH GM GN GQ GR HB ///
> HC HD HE HF ///
> using "C:\BECK8WO.csv", comma
(64 vars, 2837 obs)

. recode a2 (100/199=1) (200/299=2) (300/399=3) (400/499=4)


(a2: 2837 changes made)

. gen doc=(a2==1) if a2!=.


. gen comp=(a2==2) if a2!=.
. gen lib=(a2==3) if a2!=.
. gen twoyr=(a2==4) if a2!=.
. gen phd=(dj==3) if dj!=.
. gen final=(cc>0) if cc!=.
. gen noeval=(ge+gh+gm+gq==-36)
. gen change=cc-an
. recode hb (90=89)
(hb: 96 changes made)

. drop if an==-9 | hb==-9 | ci==-9 | ck==-9 | cs==0 | cs==-9 | a2==-9 |


phd==-9
(250 observations deleted)

Ian McCarthy, 3‐30‐2010    6 
 
. heckman change hb doc comp lib ci ck phd noeval, select (final = an hb doc
comp lib ci ck phd noeval) vce(opg)

Iteration 0: log likelihood = -6826.563


Iteration 1: log likelihood = -6826.4685
Iteration 2: log likelihood = -6826.4674
Iteration 3: log likelihood = -6826.4674

Heckman selection model Number of obs = 2587


(regression model with sample selection) Censored obs = 510
Uncensored obs = 2077

Wald chi2(8) = 211.39


Log likelihood = -6826.467 Prob > chi2 = 0.0000

------------------------------------------------------------------------------
| OPG
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
change |
hb | -.0097802 .0055923 -1.75 0.080 -.0207408 .0011805
doc | 1.997291 .5534814 3.61 0.000 .912487 3.082094
comp | -.361983 .4332653 -0.84 0.403 -1.211167 .4872015
lib | 2.23154 .505341 4.42 0.000 1.24109 3.22199
ci | .3940114 .2533859 1.55 0.120 -.1026158 .8906386
ck | -2.743372 .3803107 -7.21 0.000 -3.488767 -1.997976
phd | .6420888 .2896418 2.22 0.027 .0744013 1.209776
noeval | -.6320101 1.269022 -0.50 0.618 -3.119248 1.855227
_cons | 6.817536 .7238893 9.42 0.000 5.398739 8.236332
-------------+----------------------------------------------------------------
final |
an | .0227793 .009396 2.42 0.015 .0043634 .0411953
hb | -.0048868 .0020624 -2.37 0.018 -.008929 -.0008447
doc | .9715436 .150756 6.44 0.000 .6760672 1.26702
comp | .4043055 .1443272 2.80 0.005 .1214295 .6871815
lib | .5150521 .1908644 2.70 0.007 .1409648 .8891394
ci | .1992685 .0905382 2.20 0.028 .0218169 .37672
ck | .0859013 .1190223 0.72 0.470 -.1473781 .3191808
phd | -.1320764 .0978678 -1.35 0.177 -.3238939 .059741
noeval | -1.929021 .0713764 -27.03 0.000 -2.068916 -1.789126
_cons | .9901789 .240203 4.12 0.000 .5193897 1.460968
-------------+----------------------------------------------------------------
/athrho | .0370755 .3578813 0.10 0.917 -.6643589 .73851
/lnsigma | 1.471813 .0160937 91.45 0.000 1.44027 1.503356
-------------+----------------------------------------------------------------
rho | .0370585 .3573898 -.581257 .6282441
sigma | 4.357128 .0701223 4.221836 4.496756
lambda | .1614688 1.55763 -2.89143 3.214368
------------------------------------------------------------------------------
LR test of indep. eqns. (rho = 0): chi2(1) = 0.03 Prob > chi2 = 0.8612
------------------------------------------------------------------------------

The above output provides maximum likelihood estimation of both the probit equation
and the change score equation with separate estimation of ρ and σ ε . The bottom panel provides
the probit coefficients for the propensity equation, where it is shown that initial class size is
negatively and significantly related to the propensity to take the posttest with a one-tail p value
of 0.009. The tob panel gives the change score results, where initial class size is negatively and

Ian McCarthy, 3‐30‐2010    7 
 
significantly related to the change score with a one-tail p value of 0.04. Again, it takes
approximately 100 students to move the change score in the opposite direction by a point.

Alternatively, the following command estimates the Heckman model using the Mills ratio
as a regressor:
. heckman change hb doc comp lib ci ck phd noeval, select (final = an hb doc comp lib
ci ck phd noeval) twostep

Heckman selection model -- two-step estimates Number of obs = 2587


(regression model with sample selection) Censored obs = 510
Uncensored obs = 2077

Wald chi2(16) = 931.46


Prob > chi2 = 0.0000

------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
change |
hb | -.0102219 .0056305 -1.82 0.069 -.0212575 .0008137
doc | 2.079684 .5764526 3.61 0.000 .9498578 3.20951
comp | -.329457 .4426883 -0.74 0.457 -1.19711 .5381962
lib | 2.274478 .5373268 4.23 0.000 1.221337 3.327619
ci | .4082326 .2592943 1.57 0.115 -.0999749 .9164401
ck | -2.730737 .377552 -7.23 0.000 -3.470725 -1.990749
phd | .6334483 .2910392 2.18 0.030 .063022 1.203875
noeval | -.8843357 1.272225 -0.70 0.487 -3.377851 1.60918
_cons | 6.741226 .7510686 8.98 0.000 5.269159 8.213293
-------------+----------------------------------------------------------------
final |
an | .022039 .0094752 2.33 0.020 .003468 .04061
hb | -.0048826 .0019241 -2.54 0.011 -.0086537 -.0011114
doc | .9757148 .1463617 6.67 0.000 .6888511 1.262578
comp | .4064945 .1392651 2.92 0.004 .13354 .679449
lib | .5214436 .1766459 2.95 0.003 .175224 .8676632
ci | .1987315 .0916865 2.17 0.030 .0190293 .3784337
ck | .08779 .1342874 0.65 0.513 -.1754085 .3509885
phd | -.133505 .1030316 -1.30 0.195 -.3354433 .0684333
noeval | -1.930522 .0723911 -26.67 0.000 -2.072406 -1.788638
_cons | .9953498 .2432624 4.09 0.000 .5185642 1.472135
-------------+----------------------------------------------------------------
mills |
lambda | .4856741 1.596833 0.30 0.761 -2.644061 3.61541
-------------+----------------------------------------------------------------
rho | 0.11132
sigma | 4.3630276
lambda | .48567415 1.596833
------------------------------------------------------------------------------

Ian McCarthy, 3‐30‐2010    8 
 
The estimated probit model (in the bottom portion of the above output) is

Estimated propensity to take the posttest = 0.995 + 0.022(preTUCE score)

− 0 .005(initial class size) + 0.976(Doctoral Institution)

+ 0.406 (Comprehensive Institution) + 0.521(Liberal Arts Institution)

+ 0.199 (Male instructor) + 0.0878(English Instructor Native Language)

− 0.134(Instructor has PhD ) − 1.930(No Evaluation of Instructor)

The beginning or initial class size is negatively and highly significantly related to the propensity
to take the posttest, with a one-tail p value of 0.011.

The corresponding change-score equation employing the inverse Mills ratio is in the
upper portion of the above output:

Predicted Change = 6.741 − 0.010(initial class size) + 2.080(Doctoral Institution)

− 0.329 (Comprehensive Institution) + 2.274 Liberal Arts Institution)

+ .408(Male instructor) − 2.731(English Instructor Native Language)

+ 0.633(Instructor has PhD) − 0.88434(No Evaluation of Instructor) + 0 .486 λ

The change score is negatively and significantly related to the class size, with a one-tail p value
of 0.0345, but it takes an additional 100 students to lower the change score by a point.

Ian McCarthy, 3‐30‐2010    9 
 
AN APPLICATION OF PROPENSITY SCORE MATCHING
 
Unfortunately, we are not aware of a study in economic education for which propensity score
matching has been used. Thus, we looked outside economic education and elected to redo the
example reported in Becker and Ichino (2002). This application and data are derived from
Dehejia and Wahba (1999), whose study, in turn was based on LaLonde (1986). The data set
consists of observed samples of treatments and controls from the National Supported Work
demonstration. Some of the institutional features of the data set are given by Becker and Ichino.
The data were downloaded from the website http://www.nber.org/~rdehejia/nswdata.html. The
data set used here is in the original text form, contained in the data file “matchingdata.txt.” They
have been assembled from the several parts in the NBER archive.

Becker and Ichino report that they were unable to replicate Dehejia and Wahba’s results,
though they did obtain similar results. (They indicate that they did not have the original authors’
specifications of the number of blocks used in the partitioning of the range of propensity scores,
significance levels, or exact procedures for testing the balancing property.) In turn, we could
not precisely replicate Becker and Ichino’s results – we can identify the reason, as discussed
below. Likewise, however, we obtain similar results.

There are 2,675 observations in the data set, 2,490 controls (with t = 0) and 185 treated
observations (with t = 1). The variables in the raw data set are

t = treatment dummy variable


age = age in years
educ = education in years
black = dummy variable for black
hisp = dummy variable for Hispanic
marr = dummy variable for married
nodegree = dummy for no degree (not used)
re74 = real earnings in 1974
re75 = real earnings in 1975
re78 = real earnings in 1978 – the outcome variable

We will analyze these data following Becker and Ichino’s line of analysis. We assume
that you have completed Module One, Part Three, and thus are familiar with placing commands
in the command window or in a do file. In what follows, we will simply show the commands
you need to enter into STATA to produce the results that we will discuss.

First, note that STATA does not have a default command available for propensity score
matching. Becker and Ichino, however, have created the user-written routine pscore that
implements the propensity score matching analysis underlying Becker and Ichino (2002). As
described in the endnotes of Module Two, Part Three, users can install the pscore routine by
typing findit pscore into the command window, where a list of information and links to
download this routine appears. Click on one of the download links and STATA automatically

Ian McCarthy, 3‐30‐2010    10 
 
downloads and installs the routine for use. Users can then access the documentation for this
routine by typing help pscore. Installing the pscore routine also downloads and installs several
other routines useful for analyzing treatment effects (i.e., the routines attk, attnd and attr,
discussed later in this Module).

To begin the analysis, the data are imported by using the command (where the data file is
on the C drive but your data could be placed wherever):
insheet ///
t age educ black hisp marr nodegree re74 re75 re78 ///
using "C:\matchingdata.txt"

Transformed variables added to the equation are

  age2 = age squared 
  educ2 = educ squared 
  re742 = re74 squared 
  re752 = re75 squared 
  blacku74 = black times 1(re74 = 0) 
 
In order to improve the readability of some of the reported results, we have divided the
income variables by 10,000. (This is also an important adjustment that accommodates a
numerical problem with the original data set. This is discussed below.) The outcome variable is
re78.

The data are set up and described first. The transformations used to create the
transformed variables are

gen age2=age^2
gen educ2=educ^2
replace re74=re74/10000
replace re75=re75/10000
replace re78=re78/10000
gen re742=re74^2
gen re752=re75^2
gen blacku74=black*(re74==0)
global X age age2 educ educ2 marr black hisp re74 re75 re742 re752 blacku74

The data are described with the following statistics:

Ian McCarthy, 3‐30‐2010    11 
 
. sum

Variable | Obs Mean Std. Dev. Min Max


-------------+--------------------------------------------------------
t | 2675 .0691589 .2537716 0 1
age | 2675 34.22579 10.49984 17 55
educ | 2675 11.99439 3.053556 0 17
black | 2675 .2915888 .4545789 0 1
hisp | 2675 .0343925 .1822693 0 1
-------------+--------------------------------------------------------
marr | 2675 .8194393 .3847257 0 1
nodegree | 2675 .3330841 .4714045 0 1
re74 | 2675 1.823 1.372225 0 13.71487
re75 | 2675 1.785089 1.387778 0 15.66532
re78 | 2675 2.050238 1.563252 0 12.11736
-------------+--------------------------------------------------------
age2 | 2675 1281.61 766.8415 289 3025
educ2 | 2675 153.1862 70.62231 0 289
re742 | 2675 5.205628 8.465888 0 188.0976
re752 | 2675 5.111751 8.908081 0 245.4024
blacku74 | 2675 .0549533 .2279316 0 1

We next fit the logit model for the propensity scores. An immediate problem arises with
the data set as used by Becker and Ichino. The income data are in raw dollar terms – the mean of
re74, for example is $18,230.00. The square of it, which is on the order of 300,000,000, as well
as the square of re75 which is similar, is included in the logit equation with a dummy variable for
Hispanic which is zero for 96.5% of the observations and the blacku74 dummy variable which is
zero for 94.5% of the observations. Because of the extreme difference in magnitudes, estimation
of the logit model in this form is next to impossible. But rescaling the data by dividing the
income variables by 10,000 addresses the instability problem. These transformations are shown
in the replace commands above. This has no impact on the results produced with the data, other
than stabilizing the estimation of the logit equation.

The following command estimates the logit model from which the propensity scores are
obtained and tests the balancing hypothesis. The logit model from which the propensity scores
are obtained is fit using: ii

. global X age age2 educ educ2 marr black hisp re74 re75 re742 re752 blacku74
. pscore t $X, logit pscore(_pscore) blockid(_block) comsup

where the logit option specifies that propensity scores should be estimated using the logit
model, the blockid and pscore options define two new variables created by STATA
representing each observation’s propensity score and block id, and the comsup option restricts
the analysis to observations in the common support.

Ian McCarthy, 3‐30‐2010    12 
 
(Note: Becker and Ichino’s coefficients on re74 and re75 are multiplied by 10,000, and
coefficients on re742 and re752 are multiplied by 100,000,000. Otherwise, the output presented
here matches that of Becker and Ichino)
. pscore t $X, logit pscore(_pscore) blockid(_block) comsup

****************************************************
Algorithm to estimate the propensity score
****************************************************

The treatment is t

t | Freq. Percent Cum.


------------+-----------------------------------
0 | 2,490 93.08 93.08
1 | 185 6.92 100.00
------------+-----------------------------------
Total | 2,675 100.00

Estimation of the propensity score

Iteration 0: log likelihood = -672.64954


Iteration 1: log likelihood = -506.34385
Iteration 2: log likelihood = -385.59357
Iteration 3: log likelihood = -253.47057
Iteration 4: log likelihood = -239.00944
Iteration 5: log likelihood = -216.46206
Iteration 6: log likelihood = -209.42835
Iteration 7: log likelihood = -205.15188
Iteration 8: log likelihood = -204.97706
Iteration 9: log likelihood = -204.97537
Iteration 10: log likelihood = -204.97536
Iteration 11: log likelihood = -204.97536

Logistic regression Number of obs = 2675


LR chi2(12) = 935.35
Prob > chi2 = 0.0000
Log likelihood = -204.97536 Pseudo R2 = 0.6953

------------------------------------------------------------------------------
t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .3316903 .1203299 2.76 0.006 .0958482 .5675325
age2 | -.0063668 .0018554 -3.43 0.001 -.0100033 -.0027303
educ | .849268 .3477058 2.44 0.015 .1677771 1.530759
educ2 | -.0506202 .0172493 -2.93 0.003 -.0844282 -.0168122
marr | -1.885542 .2993309 -6.30 0.000 -2.472219 -1.298864
black | 1.135972 .3517854 3.23 0.001 .4464852 1.825459
hisp | 1.96902 .5668594 3.47 0.001 .857996 3.080044
re74 | -1.058961 .3525178 -3.00 0.003 -1.749883 -.3680387
re75 | -2.168541 .4142324 -5.24 0.000 -2.980422 -1.35666
re742 | .2389164 .0642927 3.72 0.000 .112905 .3649278
re752 | .0135926 .0665375 0.20 0.838 -.1168185 .1440038
blacku74 | 2.14413 .4268152 5.02 0.000 1.307588 2.980673
_cons | -7.474743 2.443511 -3.06 0.002 -12.26394 -2.68555
------------------------------------------------------------------------------
Note: 22 failures and 0 successes completely determined

Ian McCarthy, 3‐30‐2010    13 
 
Note: the common support option has been selected
The region of common support is [.00061066, .97525407]
Description of the estimated propensity score
in region of common support

Estimated propensity score


-------------------------------------------------------------
Percentiles Smallest
1% .0006426 .0006107
5% .0008025 .0006149
10% .0010932 .0006159 Obs 1342
25% .0023546 .000618 Sum of Wgt. 1342

50% .0106667 Mean .1377463


Largest Std. Dev. .2746627
75% .0757115 .974804
90% .6250822 .9749805 Variance .0754396
95% .949302 .9752243 Skewness 2.185181
99% .970598 .9752541 Kurtosis 6.360726

The next set of results summarizes the tests of the balancing hypothesis. By specifying
the detail option in the above pscore command, the routine will also report the separate results of
the F tests within the partitions as well as the details of the full partition itself. The balancing
hypothesis is rejected when the p value is less than 0.01 within the cell. Becker and Ichino do
not report the results of this search for their data, but do report that they ultimately found seven
blocks. They do not report the means by which the test of equality is carried out within the
blocks or the critical value used.

******************************************************
Step 1: Identification of the optimal number of blocks
Use option detail if you want more detailed output
******************************************************

The final number of blocks is 7

This number of blocks ensures that the mean propensity score


is not different for treated and controls in each blocks

**********************************************************
Step 2: Test of balancing property of the propensity score
Use option detail if you want more detailed output
**********************************************************

Variable black is not balanced in block 1

The balancing property is not satisfied

Try a different specification of the propensity score

Ian McCarthy, 3‐30‐2010    14 
 
Inferior |
of block | t
of pscore | 0 1 | Total
-----------+----------------------+----------
0 | 924 7 | 931
.05 | 102 4 | 106
.1 | 56 7 | 63
.2 | 41 28 | 69
.4 | 14 21 | 35
.6 | 13 20 | 33
.8 | 7 98 | 105
-----------+----------------------+----------
Total | 1,157 185 | 1,342

Note: the common support option has been selected

*******************************************
End of the algorithm to estimate the pscore
*******************************************

The final portion of the pscore output presents the blocks used for the balancing hypothesis.
Again, specifying the detail option will report the results of the balancing property test for each of the
independent variables, which are excluded here for brevity. This part of the analysis also recommends
that the analyst reexamine the specification of the propensity score model. Because this is not a
numerical problem, the analysis continues with estimation of the average treatment effect on the treated.

The first example below shows estimation using the kernel estimator to define the counterpart
iii
observation from the controls and using only the subsample in the common support. This stage consists
of nboot + 1 iterations. In order to be able to replicate the results, we set the seed of the random number
generator before computing the results:

set seed 1234579


attk re78 t $X, pscore(_pscore) bootstrap comsup reps(25)

Recall, we divided the income values by 10,000. The value of .153795 reported below thus
corresponds to $1,537.95. Becker and Ichino report a value (see their section 6.4) of $1,537.94. Using
the bootstrap replications, we have estimated the asymptotic standard error to be $856.28. A 95%
confidence interval for the treatment effect is computed using $1537.95 ± 1.96(856.27) = (-
$229.32,$3,305.22).

. attk re78 t $X, pscore(_pscore) bootstrap comsup reps(25)

The program is searching for matches of each treated unit.


This operation may take a while.

Ian McCarthy, 3‐30‐2010    15 
 
ATT estimation with the Kernel Matching method

---------------------------------------------------------
n. treat. n. contr. ATT Std. Err. t
---------------------------------------------------------

185 1157 0.154 . .

---------------------------------------------------------
Note: Analytical standard errors cannot be computed. Use
the bootstrap option to get bootstrapped standard errors.

Bootstrapping of standard errors

command: attk re78 t age age2 educ educ2 marr black hisp re74 re75 re742 re752
blacku74 , pscore(_pscore) comsup bwidth(.06)
statistic: attk = r(attk)
note: label truncated to 80 characters

Bootstrap statistics Number of obs = 2675


Replications = 25

------------------------------------------------------------------------------
Variable | Reps Observed Bias Std. Err. [95% Conf. Interval]
-------------+----------------------------------------------------------------
attk | 25 .1537945 -.0050767 .0856277 -.0229324 .3305215 (N)
| -.0308111 .279381 (P)
| -.0308111 .2729317 (BC)
------------------------------------------------------------------------------
Note: N = normal
P = percentile
BC = bias-corrected

ATT estimation with the Kernel Matching method


Bootstrapped standard errors

---------------------------------------------------------
n. treat. n. contr. ATT Std. Err. t
---------------------------------------------------------

185 1157 0.154 0.086 1.796

---------------------------------------------------------

Note that the estimated asymptotic standard error is somewhat different. As we noted earlier,
because of differences in random number generators, the bootstrap replications will differ across
programs. It will generally not be possible to exactly replicate results generated with different computer
programs. With a specific computer program, replication is obtained by setting the seed of the random
number generator. (The specific seed chosen is immaterial, so long as the same seed is used each time.)

The next set of estimates is based on all of the program defaults. The single nearest neighbor is
used for the counterpart observation; 25 bootstrap replications are used to compute the standard deviation,
and the full range of propensity scores (rather than the common support) is used. Intermediate output is

Ian McCarthy, 3‐30‐2010    16 
 
also suppressed. Once again, we set the seed for the random number generator before estimation. In this
case, the pscore calculation is not used, and we have instead estimated the nearest neighbor matching and
the logit propensity scores in the same command sequence by specifying the logit option rather than the
pscore option. Skipping the pscore routine essentially amounts to ignoring any test of the balancing
hypothesis. For the purposes of this Module, this is a relatively innocuous simplification, but in practice,
the pscore routine should always be used prior to estimating the treatment effects.

. attnd re78 t $X, logit bootstrap reps(25)

The program is searching the nearest neighbor of each treated unit.


This operation may take a while.

ATT estimation with Nearest Neighbor Matching method


(random draw version)
Analytical standard errors

---------------------------------------------------------
n. treat. n. contr. ATT Std. Err. t
---------------------------------------------------------

185 57 0.167 0.211 0.789

---------------------------------------------------------
Note: the numbers of treated and controls refer to actual
nearest neighbour matches

Bootstrapping of standard errors

command: attnd re78 t age age2 educ educ2 marr black hisp re74 re75 re742 re752
blacku74 , pscore() logit
statistic: attnd = r(attnd)
note: label truncated to 80 characters

Bootstrap statistics Number of obs = 2675


Replications = 25

------------------------------------------------------------------------------
Variable | Reps Observed Bias Std. Err. [95% Conf. Interval]
-------------+----------------------------------------------------------------
attnd | 25 .1667644 .012762 .1160762 -.0728051 .4063339 (N)
| -.1111108 .3704965 (P)
| -.1111108 .2918935 (BC)
------------------------------------------------------------------------------
Note: N = normal
P = percentile
BC = bias-corrected

Ian McCarthy, 3‐30‐2010    17 
 
ATT estimation with Nearest Neighbor Matching method
(random draw version)
Bootstrapped standard errors

---------------------------------------------------------
n. treat. n. contr. ATT Std. Err. t
---------------------------------------------------------

185 57 0.167 0.116 1.437

---------------------------------------------------------
Note: the numbers of treated and controls refer to actual
nearest neighbour matches

Using the full sample in this fashion produces an estimate of $1,667.64 for the treatment effect
with an estimated standard error of $1,160.76. In comparison, using the 1,342 observations in their
estimated common support, and the same 185 treated observations, Becker and Ichino reported estimates
of $1,667.64 and $2,113.59 for the effect and the standard error, respectively and use 57 of the 1,342
controls as nearest neighbors.

The next set of results uses the radius form of matching and again restricts attention to the
estimates in the common support.

. attr re78 t $X, logit bootstrap comsup radius(0.0001) reps(25)

The program is searching for matches of treated units within radius.


This operation may take a while.

ATT estimation with the Radius Matching method


Analytical standard errors

---------------------------------------------------------
n. treat. n. contr. ATT Std. Err. t
---------------------------------------------------------

23 66 -0.555 0.239 -2.322

---------------------------------------------------------
Note: the numbers of treated and controls refer to actual
matches within radius

Ian McCarthy, 3‐30‐2010    18 
 
Bootstrapping of standard errors

command: attr re78 t age age2 educ educ2 marr black hisp re74 re75 re742 re752
blacku74 , pscore() logit comsu
> p radius(.0001)
statistic: attr = r(attr)
note: label truncated to 80 characters

Bootstrap statistics Number of obs = 2675


Replications = 25

------------------------------------------------------------------------------
Variable | Reps Observed Bias Std. Err. [95% Conf. Interval]
-------------+----------------------------------------------------------------
attr | 25 -.554614 -.0043318 .5369267 -1.662776 .5535483 (N)
| -1.64371 .967416 (P)
| -1.357991 .967416 (BC)
------------------------------------------------------------------------------
Note: N = normal
P = percentile
BC = bias-corrected

ATT estimation with the Radius Matching method


Bootstrapped standard errors

---------------------------------------------------------
n. treat. n. contr. ATT Std. Err. t
---------------------------------------------------------

23 66 -0.555 0.537 -1.033

---------------------------------------------------------
Note: the numbers of treated and controls refer to actual
matches within radius

The estimated treatment effects are now very different. We see that only 23 of the 185 treated
observations had a neighbor within a range (radius in the terminology of Becker and Ichino) of 0.0001.
Consistent with Becker and Ichino’s results, the treatment effect is estimated to be -$5,546.14 with a
standard error of $5,369.27. Becker and Ichino state that that these nonsensical results illustrate both the
differences in “caliper” vesus “radius” matching as well as the sensitivity of the estimator to the choice of
radius. In order to implement a true caliper matching process, the user-written psmatch2 routine should
be used.

After installing the psmatch2 routine, caliper matching with logit propensity scores and common
support can be implemented with the following command:

Ian McCarthy, 3‐30‐2010    19 
 
. psmatch2 t $X, common logit caliper(0.0001) outcome(re78)

Logistic regression Number of obs = 2675


LR chi2(12) = 935.35
Prob > chi2 = 0.0000
Log likelihood = -204.97536 Pseudo R2 = 0.6953

------------------------------------------------------------------------------
t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .3316903 .1203299 2.76 0.006 .0958482 .5675325
age2 | -.0063668 .0018554 -3.43 0.001 -.0100033 -.0027303
educ | .849268 .3477058 2.44 0.015 .1677771 1.530759
educ2 | -.0506202 .0172493 -2.93 0.003 -.0844282 -.0168122
marr | -1.885542 .2993309 -6.30 0.000 -2.472219 -1.298864
black | 1.135972 .3517854 3.23 0.001 .4464852 1.825459
hisp | 1.96902 .5668594 3.47 0.001 .857996 3.080044
re74 | -1.058961 .3525178 -3.00 0.003 -1.749883 -.3680387
re75 | -2.168541 .4142324 -5.24 0.000 -2.980422 -1.35666
re742 | .2389164 .0642927 3.72 0.000 .112905 .3649278
re752 | .0135926 .0665375 0.20 0.838 -.1168185 .1440038
blacku74 | 2.14413 .4268152 5.02 0.000 1.307588 2.980673
_cons | -7.474743 2.443511 -3.06 0.002 -12.26394 -2.68555
------------------------------------------------------------------------------
Note: 22 failures and 0 successes completely determined.
There are observations with identical propensity score values.
The sort order of the data could affect your results.
Make sure that the sort order is random before calling psmatch2.
--------------------------------------------------------------------------------------
Variable Sample | Treated Controls Difference S.E. T-stat
----------------------------+---------------------------------------------------------
re78 Unmatched | .634914353 2.1553921 -1.52047775 .115461434 -13.17
ATT | .672171543 .443317968 .228853575 .438166333 0.52
----------------------------+---------------------------------------------------------
Note: S.E. for ATT does not take into account that the propensity score is estimated.

psmatch2: | psmatch2: Common


Treatment | support
assignment | Off suppo On suppor | Total
-----------+----------------------+----------
Untreated | 0 2,490 | 2,490
Treated | 162 23 | 185
-----------+----------------------+----------
Total | 162 2,513 | 2,675

The “difference” column in the “ATT” row of the above results presents the estimated treatment
effect. Using a true caliper matching process, the estimates of $2,228.85 and $4,381.66 for the effect and
the standard error, respectively, are much more comparable to the results previously obtained.

Ian McCarthy, 3‐30‐2010    20 
 
CONCLUDING COMMENTS

Results obtained from the two equation system advanced by Heckman over 30 years ago are
sensitive to the correctness of the equations and their identification. On the other hand, methods
such as the propensity score matching depend on the validity of the logit or probit functions
estimated along with the methods of getting smoothness in the kernel density estimator.
Someone using Heckman’s original selection adjustment method can easily have their results
replicated in LIMDEP, STATA and SAS. Such is not the case with propensity score matching.
Propensity score matching results are highly sensitive to the computer program employed while
Heckman’s original sample selection adjustment method can be relied on to give comparable
results across programs.

Ian McCarthy, 3‐30‐2010    21 
 
REFERENCES

Becker, William and William Walstad. “Data Loss From Pretest to Posttest As a Sample
Selection Problem,” Review of Economics and Statistics, Vol. 72, February 1990: 184-188,

Becker, William and John Powers. “Student Performance, Attrition, and Class Size Given
Missing Student Data,” Economics of Education Review, Vol. 20, August 2001: 377-388.

Becker, S. and A. Ichino, “Estimation of Average Treatment Effects Based on Propensity


Scores,” The Stata Journal, Vol. 2, November 2002: 358-377.

Deheija, R. and S. Wahba “Causal Effects in Nonexperimental Studies: Reevaluation of the


Evaluation of Training Programs,” Journal of the American Statistical Association, Vol. 94,
1999: 1052-1062.

Heckman, James. Sample Bias as a Specific Error. Econometrica, Vol. 47, 1979: 153-162.

Huynh, Kim, David Jacho-Chavez, and James K. Self.“The Efficacy of Collaborative Learning
Recitation Sessions on Student Outcomes?” American Economic Review, (Forthcoming May
2010).

LaLonde, R., “Evaluating the Econometric Evaluations of Training Programs with Experimental
Data,” American Economic Review, Vol. 76, 4, 1986, 604-620.

Saunders, Phillip. The TUCE III Data Set: Background information and file codes
(documentation, summary tables, and five 3.5-inch double-sided, high density disks in ASCII
format). New York: National Council on Economic Education, 1994.

Ian McCarthy, 3‐30‐2010    22 
 
ENDNOTES

                                                            
i
  Huynh, Jacho-Chavez, and Self (2010) have a data set that enables them to account for selection
into, out of and between collaborative learning sections of a large principles course in their change-score
modeling.
ii
   Users can also estimate the logit model with STATA’s default logit command. The
predicted probabilities from the logit estimation are equivalent to the propensity scores
automatically provided with the pscore command. Since STATA does not offer any default
matching routine to use following the default logit command, we adopt the use of the pscore
routine (the download of which includes several matching routines to calculate treatment
effects). The pscore routine also tests the balancing hypothesis and provides other relevant
information for propensity score matching which is not provided by the default logit command.
iii
  The Kernel density estimator is a nonparametric estimator. Unlike a parametric
estimator (which is an equation), a non-parametric estimator has no fixed structure and is based
on a histogram of all the data. Histograms are bar charts, which are not smooth, and whose
shape depends on the width of the bin into which the data are divided. In essence, with a fixed
bin width, the kernel estimator smoothes out the histogram by centering each of the bins at each
data point rather than fixing the end points of the bin. The optimum bin width is a subject of
debate and well beyond the technical level of this module.
 

Ian McCarthy, 3‐30‐2010    23 
 
MODULE FOUR, PART FOUR: SAMPLE SELECTION

IN ECONOMIC EDUCATION RESEARCH USING SAS

Part Four of Module Four provides a cookbook-type demonstration of the steps required to use
SAS in situations involving estimation problems associated with sample selection. Unlike
LIMDEP and STATA, SAS does not have a procedure or macro available from SAS Institute
specifically designed to match observations using propensity scores. There are a few user-written
codes but these are not well suited to replicate the particular type of sample-selection problems
estimated in LIMDEP and STATA. As such, this segment will go as far as SAS permits in
replicating what is done in Parts Two and Three in LIMDEP and STATA. Users of this model
need to have completed Module One, Parts One and Four, but not necessarily Modules Two and
Three. From Module One users are assumed to know how to get data into SAS, recode and
create variables within SAS, and run and interpret regression results. Module Four, Parts Two
and Three demonstrate in LIMDEP and STATA what is done here in SAS.

THE CASE, DATA, AND ROUTINE FOR EARLY HECKMAN ADJUSTMENT

The change score or difference in difference model is used extensively in education research.
Yet, before Becker and Walstad (1990), little if any attention was given to the consequence of
missing student records that result from: 1) "data cleaning" done by those collecting the data, 2)
student unwillingness to provide data, or 3) students self-selecting into or out of the study. The
implications of these types of sample selection are shown in the work of Becker and Powers
(2001) where the relationship between class size and student learning was explored using the
third edition of the Test of Understanding in College Economics (TUCE), which was produced
by Saunders (1994) for the National Council on Economic Education (NCEE), since renamed the
Council for Economic Education.

Module One, Part Four showed how to get the Becker and Powers data set
“beck8WO.csv” into SAS. As a brief review this was done with the read command:
data BECPOW;
infile 'C:\Users\gregory.gilpin\Desktop\BeckerWork\BECK8WO.CSV'
delimiter = ',' MISSOVER DSD lrecl=32767 ;
informat A1 best32.; informat A2 best32.; informat X3 best32.;
informat C best32. ; informat AL best32.; informat AM best32.;
informat AN best32.; informat CA best32.; informat CB best32.;
informat CC best32.; informat CH best32.; informat CI best32.;
informat CJ best32.; informat CK best32.; informat CL best32.;
informat CM best32.; informat CN best32.; informat CO best32.;
informat CS best32.; informat CT best32.; informat CU best32.;
informat CV best32.; informat CW best32.; informat DB best32.;

G.  Gilpin, 3‐17‐2010    1 
 
informat DD best32.; informat DI best32.; informat DJ best32.;
informat DK best32.; informat DL best32.; informat DM best32.;
informat DN best32.; informat DQ best32.; informat DR best32.;
informat DS best32.; informat DY best32.; informat DZ best32.;
informat EA best32.; informat EB best32.; informat EE best32.;
informat EF best32.; informat EI best32.; informat EJ best32.;
informat EP best32.; informat EQ best32.; informat ER best32.;
informat ET best32.; informat EY best32.; informat EZ best32.;
informat FF best32.; informat FN best32.; informat FX best32.;
informat FY best32.; informat FZ best32.; informat GE best32.;
informat GH best32.; informat GM best32.; informat GN best32.;
informat GQ best32.; informat GR best32.; informat HB best32.;
informat HC best32.; informat HD best32.; informat HE best32.;
informat HF best32.;

format A1 best12.; format A2 best12.; format X3 best12.;


format C best12. ; format AL best12.; format AM best12.;
format AN best12.; format CA best12.; format CB best12.;
format CC best12.; format CH best12.; format CI best12.;
format CJ best12.; format CK best12.; format CL best12.;
format CM best12.; format CN best12.; format CO best12.;
format CS best12.; format CT best12.; format CU best12.;
format CV best12.; format CW best12.; format DB best12.;
format DD best12.; format DI best12.; format DJ best12.;
format DK best12.; format DL best12.; format DM best12.;
format DN best12.; format DQ best12.; format DR best12.;
format DS best12.; format DY best12.; format DZ best12.;
format EA best12.; format EB best12.; format EE best12.;
format EF best12.; format EI best12.; format EJ best12.;
format EP best12.; format EQ best12.; format ER best12.;
format ET best12.; format EY best12.; format EZ best12.;
format FF best12.; format FN best12.; format FX best12.;
format FY best12.; format FZ best12.; format GE best12.;
format GH best12.; format GM best12.; format GN best12.;
format GQ best12.; format GR best12.; format HB best12.;
format HC best12.; format HD best12.; format HE best12.;
format HF best12.;
input
A1 A2 X3 C AL AM AN CA CB CC CH CI CJ CK CL CM CN CO CS CT CU
CV CW DB DD DI DJ DK DL DM DN DQ DR DS DY DZ EA EB EE EF
EI EJ EP EQ ER ET EY EZ FF FN FX FY FZ GE GH GM GN GQ GR HB
HC HD HE HF; run;
where

A1: term, where 1= fall, 2 = spring


A2: school code, where 100/199 = doctorate,
200/299 = comprehensive,
300/399 = lib arts,
400/499 = 2 year
hb: initial class size (number taking preTUCE)
hc: final class size (number taking postTUCE)
dm: experience, as measured by number of years teaching
dj: teacher’s highest degree, where Bachelors=1, Masters=2, PhD=3
cc: postTUCE score (0 to 30)

G.  Gilpin, 3‐17‐2010    2 
 
an: preTUCE score (0 to 30)
ge: Student evaluation measured interest
gh: Student evaluation measured textbook quality
gm: Student evaluation measured regular instructor’s English ability
gq: Student evaluation measured overall teaching effectiveness
ci: Instructor sex (Male = 1, Female = 2)
ck: English is native language of instructor (Yes = 1, No = 0)
cs: PostTUCE score counts toward course grade (Yes = 1, No = 0)
ff: GPA*100
fn: Student had high school economics (Yes = 1, No = 0)
ey: Student’s sex (Male = 1, Female = 2)
fx: Student working in a job (Yes = 1, No = 0)

Separate dummy variables need to be created for each type of school (A2), which is done with
the following code:

if 99 < A2 < 200 then a2 = 1;


if 199 < A2 < 300 then a2 = 2;
if 299 < A2 < 400 then a2 = 3;
if 399 < A2 < 500 then a2 = 4;
doc = 0; comp = 0; lib = 0; twoyr = 0;
if a2 = 1 then doc = 1;
if a2 = 2 then comp = 1;
if a2 = 3 then lib = 3;
if a2 = 4 then twoyr = 4;

To create a dummy variable for whether the instructor had a PhD we use
phd = 0;
if dj = 3 then phd = 1;

To create a dummy variable for whether the student took the postTUCE we use
final = 0;
if cc > 0 then final = 1;

To create a dummy variable for whether a student did (noeval = 0) or did not (noeval = 1)
complete a student evaluation of the instructor we use

evalsum = ge+gh+gm+gq;
noeval= 0;
if evalsum = -36 then noeval = 1;

“Noeval” reflects whether the student was around toward the end of the term, attending classes,
and sufficiently motivated to complete an evaluation of the instructor. In the Saunder’s data set

G.  Gilpin, 3‐17‐2010    3 
 
evaluation questions with no answer where coded -9; thus, these four questions summing to -36
indicates that no questions were answered.

And the change score is created with

change = cc - an;

Finally, there was a correction for the term in which student record 2216 was incorrectly
recorded:

if hb = 90 then hb = 89;

All of these recoding and create commands are entered into SAS editor file as follows:

data becpow;
set beckpow;
if 99 < A2 < 200 then a2 = 1;
if 199 < A2 < 300 then a2 = 2;
if 299 < A2 < 400 then a2 = 3;
if 399 < A2 < 500 then a2 = 4;
doc = 0; comp = 0; lib = 0; twoyr = 0;
if a2 = 1 then doc = 1;
if a2 = 2 then comp = 1;
if a2 = 3 then lib = 1;
if a2 = 4 then twoyr = 1;
phd = 0;
if dj = 3 then phd = 1;
final = 0;
if cc > 0 then final = 1;
evalsum = ge+gh+gm+gq;
noeval= 0;
if evalsum = -36 then noeval = 1;
change = cc - an;
if hb = 90 then hb = 89;
run;

To remove records with missing data the following is entered:

data becpow;
set beckpow;
if AN=-9 then delete;
if HB=-9 then delete;
if ci=-9 then delete;
if ck=-9 then delete;
if cs=0 then delete;

G.  Gilpin, 3‐17‐2010    4 
 
if cs=-9 then delete;
if a2=-9 then delete;
if phd=-9 then delete;
run;

The use of these data entry and management commands will appear in the SAS output file for the
equations to be estimated in the next section.

THE PROPENSITY TO TAKE THE POSTTEST AND THE CHANGE SCORE EQUATION

To address attrition-type sample selection problems in change score studies, Becker and Powers
first add observations that were dropped during the early stage of assembling data for TUCE III.
Becker and Powers do not have any data on students before they enrolled in the course and thus
cannot address selection into the course, but to examine the effects of attrition (course
withdrawal) they introduce three measures of class size (beginning, ending, and average) and
argue that initial or beginning class size is the critical measure for assessing learning over the
entire length of the course. i To show the effects of initial class size on attrition (as discussed in
Module Four, Part One) they employ what is now the simplest and most restrictive of sample
correction methods, which can be traced to James Heckman (1979), recipient of the 2000 Nobel
Prize in Economics.

From Module Four, Part One, we have the data generating process for the difference between
post and preTUCE scores for the ith student ( Δy i ):

k
Δyi = X i β + ε i = β1 + ∑ β j x ji + ε i (1)
j =2

where the data set of explanatory variables is matrix X, where Xi is the row of xji values for the
relevant variables believed to explain the ith student’s pretest and posttest scores, the β j ’s are the
associated slope coefficients in the vector β , and ε i is the individual random shock (caused, for
example, by unobservable attributes, events or environmental factors) that affect the ith student’s
test scores. Sample selection associated with students’ unwillingness to take the postteest
(dropping the course) results in population error term and regressor correlation that biases and
makes coefficient estimators in this change score model inconsistent.

The data generating process for the i th student’s propensity to take the posttest is:

Ti * = H i α + ωi (2)

where

G.  Gilpin, 3‐17‐2010    5 
 
Ti = 1 , if Ti* > 0 , and student i has a posttest score, and

Ti = 0 , if Ti* ≤ 0 , and student i does not have a posttest score.

T * is the vector of all students’ propensities to take a posttest.

H is the matrix of explanatory variables that are believed to drive these propensities.

α is the vector of slope coefficients corresponding to these observable variables.

ω is the vector of unobservable random shocks that affect each student’s propensity.

The effect of attrition between the pretest and posttest, as reflected in the absence of a
posttest score for the ith student (Ti = 0) and a Heckman adjustment for the resulting bias caused
by excluding those students from the change-score regression requires estimation of equation (2)
and the calculation of an inverse Mill’s ratio for each student who has a pretest. This inverse
Mill’s ratio is then added to the change-score regression (1) as another explanatory variable. In
essence, this inverse Mill’s ratio adjusts the error term for the missing students.

For the Heckman adjustment for sample selection each disturbance in vector ε , equation
(1), is assumed to be distributed bivariate normal with the corresponding disturbance term in the
ω vector of the selection equation (2). Thus, for the i th student we have:

(ε i ,ω i ) ~ bivariate normal (0,0, σ ε ,1, ρ ) (3)

and for all perturbations in the two-equation system we have:

E (ε) = E (ω) = 0, E (εε ') = σ ε2I, E (ωω ') = I, and E (εω') = ρσ ε I . (4)

That is, the disturbances have zero means, unit variance, and no covariance among students, but
there is covariance between selection in getting a posttest score and the measurement of the
change score.

The regression for this censored sample of nT =1 students who took the posttest is now:

E ( Δyi | X i , Ti = 1) = Xi β + E (ε i | Ti * > 0); i = 1, 2,... nT =1 , for nT =1 < N (5)

which suggests the Heckman adjusted regression to be estimated:

E ( Δyi | Xi , Ti = 1) = Xi β + ( ρσ ε )λi ; i = 1, 2,... nT =1 (6)

where λi is the inverse Mill’s ratio (or hazard) such that λi = f ( −Ti* ) /[1 − F (−Ti* )] , and f (.)
and F (.) are the normal density and distribution functions. λi is the standardized mean of the

G.  Gilpin, 3‐17‐2010    6 
 
disturbance term ωi , for the i th student who took the posttest; it is close to zero only for those
well above the T = 1 threshold. The values of λ are generated from the estimated probit
selection equation (2) for all students.

The probit command for the selection equation to be estimated in SAS is

proc qlim data =becpow;


model final= an hb doc comp lib ci ck phd noeval / discrete;
quit;

where the “/ discrete” extension tells SAS to estimated the model by probit.

The command for estimating the adjusted change equation using both the inverse Mills
ratio as a regressor and maximum likelihood estimation of the ρ and σ ε is written

proc qlim data=becpow;


model final = an hb doc comp lib ci ck phd noeval / discrete;
model change = hb doc comp lib ci ck phd noeval / select(final=1);
quit;

where the extension “ / select (final = 1)” tells SAS that the selection is on observations with the
variable final equal to 1.

As described in Module One, Part Four, entering all of these commands into the editor
window in SAS and pressing the RUN button yields the following output file:

G.  Gilpin, 3‐17‐2010    7 
 
G.  Gilpin, 3‐17‐2010    8 
 
 

 
G.  Gilpin, 3‐17‐2010    9 
 
The estimated probit model (as found on the top of page 8) is

Estimated propensity to take the posttest = 0.995 + 0.022(preTUCE score)

− 0 .005(initial class size) + 0.976(Doctoral Institution)

+ 0.406 (Comprehensive Institution) + 0.521(Liberal Arts Institution)

+ 0.199 (Male instructor) + 0.0878(English Instructor Native Language)

− 0.134(Instructor has PhD ) − 1.930(No Evaluation of Instructor)

The beginning or initial class size is negatively and highly significantly related to the propensity
to take the posttest, with a one-tail p value of 0.0056.

The corresponding change-score equation employing the inverse Mills ratio is on page 8-
9:

Predicted Change = 6.847 − 0.010(initial class size) + 1.970(Doctoral Institution)

− 0.380 (Comprehensive Institution) + 2.211 Liberal Arts Institution)

+ .386(Male instructor) − 2.749(English Instructor Native Language)

+ 0.650(Instructor has PhD) − 0.588(No Evaluation of Instructor) + 0 .486 λ

The change score is negatively and significantly related to the class size, with a one-tail p value
of 0.0347, but it takes an additional 100 students to lower the change score by a point. The
maximum likelihood results also contain separate estimates of ρ and σ ε . Note that the
coefficients are slightly different then those provided by LIMDEP. This is due to the
maximization algorithm of used in proc qlim – that of Newton–Raphson maximization method.
Currently SAS does not have any other standard routine to perform Heckman’s two-step
procedure. It should be noted that there are a few user written codes which can be implemented.

G.  Gilpin, 3‐17‐2010    10 
 
AN APPLICATION OF PROPENSITY SCORE MATCHING
 
Unfortunately, we are not aware of a study in economic education for which propensity score
matching has been used. Thus, we looked outside economic education and elected to redo the
example reported in Becker and Ichino (2002). This application and data are derived from
Dehejia and Wahba (1999), whose study, in turn was based on LaLonde (1986). The data set
consists of observed samples of treatments and controls from the National Supported Work
demonstration. Some of the institutional features of the data set are given by Becker and Ichino.
The data were downloaded from the website http://www.nber.org/~rdehejia/nswdata.html. The
data set used here is in the original text form, contained in the data file “matchingdata.txt.” They
have been assembled from the several parts in the NBER archive.

Becker and Ichino report that they were unable to replicate Dehejia and Wahba’s results,
though they did obtain similar results. (They indicate that they did not have the original authors’
specifications of the number of blocks used in the partitioning of the range of propensity scores,
significance levels, or exact procedures for testing the balancing property.) In turn, we could
not precisely replicate Becker and Ichino’s results – we can identify the reason, as discussed
below. Likewise, however, we obtain similar results.

There are 2,675 observations in the data set, 2490 controls (with t = 0) and 185 treated
observations (with t = 1). The variables in the raw data set are

t = treatment dummy variable


age = age in years
educ = education in years
black = dummy variable for black
hisp = dummy variable for Hispanic
marr = dummy variable for married
nodegree = dummy for no degree (not used)
re74 = real earnings in 1974
re75 = real earnings in 1975
re78 = real earnings in 1978 – the outcome variable

We will analyze these data following Becker and Ichino’s line of analysis. We assume
that you have completed Module One, Part Two, and thus are familiar with placing commands in
the editor and using the RUN button to submit commands, and where results are found in the
output window. In what follows, we will simply show the commands you need to enter into SAS
to produce the results that we will discuss.

To start, the data are imported by using the import wizard. The file is most easily
imported by specifying the file as a ‘delimited file *.*’: When providing the location of the file,
click ‘options’ and then click on the Delimiter ‘space’ and unclick the box for ‘Get variable

G.  Gilpin, 3‐17‐2010    11 
 
names from first row’. In what follows, I call the imported dataset ‘match’. As ‘match’ does not
have proper variables names, this is easily corrected using a datastep:

data match (keep = t age educ black hisp marr nodegree re74 re75 re78);
rename var3 = t var5 = age var7 = educ var9 = black var11 = hisp
var13 = marr var15 = nodegree var17 = re74 var19 = re75
var21 = re78;
set match;
run ;

Transformed variables added to the dataset are

  age2 = age squared 
  educ2 = educ squared 
  re742 = re74 squared 
  re752 = re75 squared 
  blacku74 = black times 1(re74 = 0) 
 
In order to improve the readability of some of the reported results, we have divided the
income variables by 10,000. (This is also an important adjustment that accommodates a
numerical problem with the original data set. This is discussed below.) The outcome variable is
re78.

The data are set up and described first. The transformations used to create the
transformed variables are

data match;
set match;
age2 = age*age; educ2 = educ*educ;
re74 = re74/10000; re75 = re75/10000; re78 = re78/10000;
re742 = re74*re74; re752 = re75*re75;
blacku74 = black*(re74 = 0);
run;

The data are described with the following code and statistics:

proc means data = match;


var t age educ black hisp marr nodegree re74 re75 re78 age2 educ2 re742
re752 blacku74;
quit;

G.  Gilpin, 3‐17‐2010    12 
 
We next fit the logit model for the propensity scores. An immediate problem arises with
the data set as used by Becker and Ichino. The income data are in raw dollar terms – the mean of
re74, for example is $18,230.00. The square of it, which is on the order of 300,000,000, as well
as the square of re75 which is similar, is included in the logit equation with a dummy variable for
Hispanic which is zero for 96.5% of the observations and the blacku74 dummy variable which is
zero for 94.5% of the observations. Because of the extreme difference in magnitudes, estimation
of the logit model in this form is next to impossible. But rescaling the data by dividing the
income variables by 10,000 addresses the instability problem. ii These transformations are shown
in the second set of commands above. This has no impact on the results produced with the data,
other than stabilizing the estimation of the logit equation. We are now quite able to replicate the
Becker and Ichino results except for an occasional very low order digit.

The logit model from which the propensity scores are obtained is fit using

proc qlim data = match;


model t = age age2 educ educ2 marr black hisp re74 re75 re742 re752
blacku74 / discrete (dist = logit);
quit;

(Note: Becker and Ichino’s coefficients on re74 and re75 are multiplied by 10,000, and
coefficients on re742 and re752 are multiplied by 100,000,000.)

G.  Gilpin, 3‐17‐2010    13 
 
The above results provide the predicted probabilities to be used in matching algorithms. As
discussed in the Introduction of this part, SAS does not have such a procedure or macro
specifically designed to match observations to estimate treatment effects. We refer the reader to
Parts Two and Three of this module to for further understanding on how to implement matching
procedures in LIMDEP and STATA.

G.  Gilpin, 3‐17‐2010    14 
 
CONCLUDING COMMENTS

Results obtained from the two equation system advanced by Heckman over 30 years ago are
sensitive to the correctness of the equations and their identification. On the other hand, methods
such as the propensity score matching depend on the validity of the logit or probit functions
estimated along with the methods of getting smoothness in the kernel density estimator.
Someone using Heckman’s original selection adjustment method can easily have their results
replicated in LIMDEP, STATA and SAS. Such is not the case with propensity score matching.
Propensity score matching results are highly sensitive to the computer program employed while
Heckman’s original sample selection adjustment method can be relied on to give comparable
results across programs.

REFERENCES

Becker, William and William Walstad. “Data Loss From Pretest to Posttest As a Sample
Selection Problem,” Review of Economics and Statistics, Vol. 72, February 1990: 184-188,

Becker, William and John Powers. “Student Performance, Attrition, and Class Size Given
Missing Student Data,” Economics of Education Review, Vol. 20, August 2001: 377-388.

Becker, S. and A. Ichino, “Estimation of Average Treatment Effects Based on Propensity


Scores,” The Stata Journal, Vol. 2, November 2002: 358-377.

Deheija, R. and S. Wahba “Causal Effects in Nonexperimental Studies: Reevaluation of the


Evaluation of Training Programs,” Journal of the American Statistical Association, Vol. 94,
1999: 1052-1062.

Heckman, James. Sample Bias as a Specific Error. Econometrica, Vol. 47, 1979: 153-162.

Huynh, Kim, David Jacho-Chavez, and James K. Self.“The Efficacy of Collaborative Learning
Recitation Sessions on Student Outcomes?” American Economic Review, (Forthcoming May
2010).

LaLonde, R., “Evaluating the Econometric Evaluations of Training Programs with Experimental
Data,” American Economic Review, Vol. 76, 4, 1986, 604-620.

Saunders, Phillip. The TUCE III Data Set: Background information and file codes
(documentation, summary tables, and five 3.5-inch double-sided, high density disks in ASCII
format). New York: National Council on Economic Education, 1994.

G.  Gilpin, 3‐17‐2010    15 
 
ENDNOTES

                                                            
i
  Huynh, Jacho-Chavez, and Self (2010) have a data set that enables them to account for selection
into, out of and between collaborative learning sections of a large principles course in their change-score
modeling.
ii
An attempt to compute a linear regression of the original RE78 on the original unscaled other
variables is successful, but produces a warning that the condition number of the X matrix is 6.5 times 109.
When the data are scaled as done above, no warning about multicollinearity is given.

G.  Gilpin, 3‐17‐2010    16 
 

You might also like