Econometric Analysis of Panel Data
Econometric Analysis of Panel Data
Part I. Literature 20
Part II. The Mundlak Estimator 20
Part III. Panel Data Regressions 50
Part IV. Binary Choice Models 50
Part V. A Loglinear Model 60
Note, in parts of the exam in which you are asked to report the results of computation, please filter your
response so that you present the numerical results as part of an organized discussion of the question. Do not
submit long, unannotated pages of computer output. Some of the parts require you to do some
computations. Use Stata, R, NLOGIT, MatLab or any other software you wish to use.
Part I. Literature
Locate a published study in a field that interests you that uses a panel data based methodology.
Describe in no more than one page the study, the estimation method(s) used and the conclusion(s) reached
by the author(s).
The course website contains an abbreviated version of the WHO health outcomes data set,
http://people.stern.nyu.edu/wgreene/Econometrics/WHO-balanced-panel.csv
http://people.stern.nyu.edu/wgreene/Econometrics/WHO-balanced-panel.lpj
The csv file is a text, comma delimited file that should be directly readable by other programs such as Stata
and R. The original data set contained 840 observations as an unbalanced panel for 191 countries. It also
contained data for some internal political districts such as the 24 states of Mexico and the provinces of
Canada and Australia. This panel retains the data for the 140 countries that contain all 5 years of data. The
variables in the file are
Note that COMP, DALE, EDUC and HLTHEXP are time varying, but all other measured variables are
time invariant.
Call this Model A. This is a translog production function. The authors found that the values of 12 implied
a nonconcave production function, and fixed 22 and 12 both to zero in their final presentation. Call this
restricted model Model B.
a. Fit the “pooled” model and report your results.
b. Using the pooled model, test the null hypothesis of Model B against the alternative Model A.
c. Using the formulation of Model B, fit a random effects model and a fixed effects model. Use your
estimation results to decide which is the preferable model. If you find that neither panel data model is
preferred to the pooled model, show how you reached that conclusion. As part of the analysis, test the
hypothesis that there are no “country effects.”
d. Using the Mundlak approach, determine which model, fixed or random effects is preferred.
e. Assuming that there are “latent individual (county) effects,” the asymptotic covariance matrix that is
computed for the pooled estimator, s2(X′X)-1, is inappropriate. What estimator can be computed for the
covariance matrix of the pooled estimator that will give appropriate standard errors?
f. The hypothesis of constant returns to scale in the translog model (Model A) would be
g. The 2004 Health Economics paper by Greene argued that WHO did not handle the obvious heterogeneity
across countries appropriately. Variables GINI, TROPICS, logPOPDN, logGDPC, GEFF, VOICE, OECD
all capture dimensions of this heterogeneity. Extend the random effects model to include some (or all) of
these variables and test the hypothesis that they significantly add to the explanatory power of the model.
h. Are there “time effects” in the data. One approach find out would be to add the time variables (less one
of them) to the preferred regression model and test for their joint significance. A second approach would
be to use a CHOW test to test for homogeneity of the regression model over the 5 years. Test the
homogeneity assumption using your preferred pooled model.
Part IV. Binary Choice Models
The course website describes the “German Manufacturing Innovation Data.” The actual data are not
published on the course website. We will use them for purposes of this exercise, however. You can obtain
them by downloading either a csv file,
http://people.stern.nyu.edu/wgreene/Econometrics/probit-panel.csv
http://people.stern.nyu.edu/wgreene/Econometrics/probit-panel.lpj
This data set contains 1,270 firms and 5 years of data for 6,350 observations in total – a balanced panel. The
variables that you need for this exercise are described in the data sets area of the course home page,
http://people.stern.nyu.edu/wgreene/Econometrics/PanelDataSets.htm
(The csv file can easily be ported to other software such as R, SAS and Stata.) I am interested in a binary
choice model for the innovation variable, IP. You will fit your model using at least three of the independent
variables in the data set. With respect to the model you specify,
A. THEORY
(a) If you fit a pooled logit model, there is the possibility that you might be ignoring unobserved heterogeneity
(effects). Wooldridge argues that when one fits a probit model while ignoring unobserved heterogeneity, the
raw coefficient estimator (MLE) is inconsistent, but the quantity of interest, the “Average Partial Effects” might
well be estimated appropriately. Explain in detail what he has in mind here.
(b) Suppose we were to estimate a “fixed effects” probit model by “brute force,” just by including the 1,270
dummy variables needed to create the empirical model. What would the properties of the resulting estimator
likely be? What is “the incidental parameters problem?”
(c) How would I proceed to use Chamberlain’s estimator to obtain a consistent slope estimator for the fixed
effects logit model.
(d) Describe in detail how to fit a random effects logit model using quadrature and using simulation for the part
of the computations where they would be necessary, under the assumption that the effects are uncorrelated with
the other included exogenous variables.
(e) Using the random effects logit model that you described in part (d), describe how you would test the
hypothesis that the same logit model applies to the four different sectors in the data set
(CONSGOOD,FOOD,RAWMTL,INVGOOD).
B. PRACTICE
(a) Fit a pooled probit model using your specification. Provide all relevant estimation results. (Please condense
and organize the results in a readable form.)
(b) Fit a random effects probit model.
(c) Use the Mundlak (correlated random effects) approach to approximate a fixed effects model. Recall this
means adding the group means of the time varying variables to the model, then using a random effects model.
(d) Note the difference between the estimates in (b) and (c). Which do you think is appropriate? Explain.
This semester, we have examined several ‘loglinear models,’ including the logit model for binary
choice, Poisson and negative binomial models for counts and the exponential model for a continuous
nonnegative random variable. We will now examine one more loglinear model. The nonnegative,
continuous random variable y|x has a Weibull distribution:
f ( y | x ) i P yiP 1 exp i yi P , y 0, P 0,
i exp(x i ). The first element of x i is a constant term, 1.
(We examined a version of this model in Assignment 5.) Estimation and analysis is based on a
sample of N observations on yi,xi. The conditional mean function is
1 P 1 P 1
E[yi|xi] = exp(xi ) (Note the minus sign.)
i P P
The data set is a panel. There are 7,293 groups with group sizes ranging from 1 to 7. This exercise
will examine a variety of regression formulations. I have done the estimation for you; the results
appear below. Some of the questions will involve a small amount of ancillary computation.
A. I propose to estimate the parameters (P,α,) by maximum likelihood. The results are shown in
regression 1 below. Derive the log likelihood function, likelihood equations and Hessian. Show
precisely how to use Newton’s method to estimate the parameters. How will you obtain asymptotic
standard errors for your estimator? Test the hypothesis of ‘the model.’ That is, test the hypothesis
that all of the coefficients are equal to zero (except the constant term) using the likelihood ratio test.
B. There are several interesting special cases of the Weibull model. If P = 1, the model reduces to
the exponential model discussed in class. We considered three different ways to test a parametric
restriction such as this, Wald, Likelihood ratio and LM tests. Using the results of regressions 1, 2
and 3 below, carry out the three tests. Do the results of the three tests agree?
C. The conditional mean function shown above suggests a nonlinear least squares approach. Note
that the conditional mean function can be written
P 1 1
E[ y | x] exp log x exp(x)
P
where is the constant term and is the remaining parameters, and x1 is all variables not including
the constant term. Thus, and have different constant terms, and are otherwise the same. The
nonlinear least squares results are shown in regression 4. How do the two results compare? We now
have two possible estimators of . In theoretical terms, which is better, MLE or NLS? Why? Do the
empirical results support your argument?
D. The likelihood equations for estimation of (P,) imply that E[yP|x] = 1/. Prove this result.
E. Derive the partial effects for the Weibull conditional mean function, E[y|x]/x. Compute the
partial effects at the means of the data. Hint: ((P+1)/P) for the P in regression 1 equals .88562.
How would you obtain standard errors for your estimated partial effects? Explain in detail.
F. Regression 5 presents linear least squares results for the regression of –y on x. (The minus sign on
y changes the sign of the coefficients so they will be comparable to the earlier results.) How do these
results compare to the MLEs in part A? How do they compare to the results in part E? Why would
they resemble the results in part E?
G. The log of a Weibull distributed variable has a type 1 extreme value distribution. The expected
value of logy is -x + , where is the Euler-Mascheroni constant, 0.57721566…. Regression 6
presents the results of linear regression of –logy on x. Which other result should these resemble? Do
they?
H. Since these are panel data, it is appropriate to rebuild the model to accommodate the unobserved
heterogeneity. Explain the difference between fixed and random effects models. How would they
appear in the loglinear model formulated here?
J. Some have argued that marital status might be endogenous in an income equation when there are
households that have two working people. (You probably thought people married for love.) To
investigate in the present model, I will use a control function approach. Regression 10 presents a
probit eqution for marital status based on age, education, gender and whether the household head has
a white collar job. The variable GENRES is the generalized residual from this model,
GENRES = q(x)/(qx) where q = 2Married – 1. The expected value of GENRES is zero, and
since it is the derivative of logL with respect to the constant term, it will sum to zero in the sample. I
am going to use GENRES as a control function? What is a control function, and why will I use it in
the INCOME model?
K. Regression 11 presents estimates of the Weibull INCOME model that includes the control
function. Regression 12 is similar to 11, but regression 12 includes normal heterogeneity in the
model in the form of what appears to be a random effect – a random constant. But, this is not a panel
data model look closely at the results and note that the ‘panel’ has one period. The implied two
equation model underlying 12 is
where (i,ui) have a bivariate normal distribution with means (0,0), standard deviations (,1) and
correlation . The endogeneity issue turns on . The coefficient on GENRES in the model in
regression 12 will approximate . So, based on the estimated model, marry for money
(endogenous, not equal to zero) or marry for love (exogenous, equal to zero)?
L. In this model, the argument in parts J and K about MARRIED could also be made about health
satisfaction, HSAT. But, HSAT is an ordered outcome, coded 0,1,2 (bad, middling, good) in our
data. How would you proceed to deal with endogneity of HSAT in this model?
1. Weibull, MLE
-----------------------------------------------------------------------------
Weibull (Loglinear) Regression Model
Dependent variable INCOME
Log likelihood function 12133.14495
Restricted log likelihood 1195.24508 (Log likelihood when = 0)
Chi squared [ 7](P= .000) 21875.79975
Significance level .00000
McFadden Pseudo R-squared -9.1511775
Estimation based on N = 27326, K = 8
Inf.Cr.AIC = -24250.3 AIC/N = -.887
--------+--------------------------------------------------------------------
| Standard Prob. 95% Confidence
INCOME| Coefficient Error z |z|>Z* Interval
--------+--------------------------------------------------------------------
|Parameters in conditional mean function
Constant| 1.67075*** .01433 116.62 .0000 1.64267 1.69883
AGE| .00086*** .00022 3.91 .0001 .00043 .00130
EDUC| -.05084*** .00073 -69.23 .0000 -.05228 -.04940
HSAT| -.01233*** .00077 -15.96 .0000 -.01385 -.01082
MARRIED| -.16990*** .00371 -45.79 .0000 -.17717 -.16262
FEMALE| -.02041*** .00334 -6.11 .0000 -.02696 -.01386
HHKIDS| .06403*** .00375 17.07 .0000 .05668 .07139
|Scale parameter for Weibull model
P_scale| 2.13722*** .00495 431.40 .0000 2.12751 2.14693
--------+--------------------------------------------------------------------
***, **, * ==> Significance at 1%, 5%, 10% level.
-----------------------------------------------------------------------------
2. Exponential, MLE
-----------------------------------------------------------------------------
Exponential (Loglinear) Regression Model
Dependent variable INCOME
Log likelihood function 1558.04494
Restricted log likelihood 1195.24508
Chi squared [ 5](P= .000) 725.59973
Significance level .00000
McFadden Pseudo R-squared -.3035360
Estimation based on N = 27326, K = 6
Inf.Cr.AIC = -3104.1 AIC/N = -.114
--------+--------------------------------------------------------------------
| Standard Prob. 95% Confidence
INCOME| Coefficient Error z |z|>Z* Interval
--------+--------------------------------------------------------------------
|Parameters in conditional mean function
Constant| 1.85106*** .04834 38.29 .0000 1.75632 1.94580
AGE| .00158** .00064 2.48 .0133 .00033 .00283
EDUC| -.05438*** .00268 -20.27 .0000 -.05963 -.04912
HSAT| -.01101*** .00275 -4.00 .0001 -.01641 -.00561
MARRIED| -.26249*** .01568 -16.75 .0000 -.29322 -.23177
HHKIDS| .06619*** .01399 4.73 .0000 .03877 .09360
--------+--------------------------------------------------------------------