100% found this document useful (1 vote)
47 views

Introduction To Panel Data UG-students

This document provides an introduction to panel data analysis. It defines panel data as cross-sectional data repeatedly sampled over time where the same individuals are followed. Common features of panel data include individuals like people, firms, or countries observed over multiple time periods. Panel data allows researchers to control for individual fixed effects and model temporal effects without aggregation bias. Common panel data sets are cited as examples. Advantages of panel data include more observations, improved efficiency of estimates, and the ability to study dynamics of change over time.

Uploaded by

David
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
47 views

Introduction To Panel Data UG-students

This document provides an introduction to panel data analysis. It defines panel data as cross-sectional data repeatedly sampled over time where the same individuals are followed. Common features of panel data include individuals like people, firms, or countries observed over multiple time periods. Panel data allows researchers to control for individual fixed effects and model temporal effects without aggregation bias. Common panel data sets are cited as examples. Advantages of panel data include more observations, improved efficiency of estimates, and the ability to study dynamics of change over time.

Uploaded by

David
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

A short introduction to applied

econometrics
Panel Data Analysis

Prof. Razack Lokina


Department of Economics
UDSM
Introduc)on to Applied Panel Data
Econometrics
• Panel data (or longitudinal data) refers to a cross-section repeatedly
sampled over time, but where the same individual has been followed
throughout the period of the sample.
o Individuals: person, household, plant, firm, municipality, state or
a country.
o Time: five years intervals, annual, quarters, weeks, days or just an
observation time. We cannot assume that the observations are
independently distributed across time (e.g. unobserved factors
that affects a person’s wage in 1990 will also affect that person’s
wage in 1991)
• Independently pooled cross-section data: obtained by sampling
randomly from a large population at different points in time.
o An important feature of such data is that they consist of
independently sampled observations.
o This rules out correlation in the error terms across different
observations
• Examples of Panel Data Firm or company data Longitudinal data on
patterns of individual behaviour over the life-cycle.
o Comparative country-specific macroeconomic data over time.
• Why use panel data methods?
o Increased precision of regression

estimates The ability to control for


individual fixed effects The ability to
model temporal effects without
aggregation bias

• Examples of Panel Datasets


o Panel Study of Income Dynamics

(PSID)
o National Longitudinal Surveys of

Labor Market Experience (NLS)


o German Socioeconomic Panel

(GSEP)
o The British Household Panel Survey

(BHPS)
o Tanzania National Panel Data (very

recent data)
• Common feature:
o The sample of individuals is typically

relatively large The number of time


periods is generally short.
• Advantages of panel estimation methods?

Large number of data points (observations)

Increased degrees of freedom

Reduces the collinearity among the explanatory variables

Improved efficiency of econometric estimates

More variability, less aggregation over firm and individuals


• Better able to study dynamics of adjustment in unemployment, income
mobility,
• Panel data provide better prediction of individual’s behaviour
• More reliable and stable parameter estimates
• Identify and measure effects not detectable in pure cross-section (CS) or
time-series (TS) data.
o Control for unobservable individual heterogeneity and dynamics
not possible in TS (N=1) and CS (T=1).
o Example: married woman labour-force participation of 50%
interpreted as 50% chance of being in labour force in any given
year, or alternatively 50% always work and 50% never
• Microdynamic and macrodynamic effects cannot be estimated using CS
data.
• Multicollinearity problem in single TS data. Insufficient information to
obtain unconditional estimates of lag coefficients.
Advantages of panel analysis

More degrees of Reduced


More observa)ons
freedom mul)collinearity

Pooling of Stems from more Especially a


cross sectional observations problem in
and time series distributed lag
data model

⇒ Improved efficiency (unbiased estimator with smallest


variance for all possible true parameter values)
Advantages of panel analysis

Wider range of Causality


problems discussion

Dynamics of Time structure


change e.g. facilitates
labor market discussion
participation

⇒you can test new hypothesis on individual behavior or


policy changes that affect several entities
The importance of the data structure
• Example: 11 countries over 10 years
• General note: cross-sectional dimension should
be larger than time dimension
• But: many new models currently developed
• Very fertile field for research!
• I prefer the following data structure
The importance of the data structure
name code year gdp sav pop
Albania ALB 1990 6,75179343 20,9783993 1,6
Albania First cross- ALB 1991 -11,4142038 -13,0284996 -0,2
Albania ALB 1992 -27,5896031 -75,4131012 -1,6
Albania sectional unit ALB 1993 -5,69153612 -33,6716003 -1,4
Albania ALB 1994 11,1974627 -9,88263035 0,2
Albania ALB 1995 9,1941036 -3,94799995 1,2
Albania ALB 1996 7,55757392 -11,8118 1,3
Albania ALB 1997 7,73893405 -9,25912952 1,2
Albania ALB 1998 -8,06352119 -6,69585991 1,1
Albania
Algeria
ALB
DZA
1999
1990
missing
2,29575915
-1,66910005
27,4666996
1,1
2,5
Algeria Time dimension
DZA 1991 -3,72084675 36,6562004 2,4
Algeria DZA 1992 -3,55414336 32,3755989 2,4
Algeria DZA 1993 -0,79384221 27,8384991 2,3
Algeria DZA 1994 -4,35723136 27,0359993 2,2
Algeria DZA 1995 -3,31007521 28,4333992 2,2
Algeria DZA 1996 1,59040861 31,4230003 2,2
Algeria DZA 1997 1,58921549 32,1985016 2,2
Algeria DZA 1998 -1,03429441 27,0669003 2,1
Algeria DZA 1999 1,44857954 31,6912003 2,1
• Disadvantages
o Complicated survey design, stratification
o Changing structure of population (use of rotating panel data)
o Incomplete coverage of the population of interest
o Data collection and management problem
o Distortions of measurement errors due to faulty response, unclear
questions, ...
o Selectivity problems (self-selectivity not to work because
reservation wage>offered wage)
•Non-response (partial or complete) due to lack of cooperation
•Attrition problem, non-response over time is increasing
•Short time-series dimension, increased N costly, increased T deteriorates
attrition
•New estimation problems
•Imputations of unit non-response/missing.
Pooled regression
• Combine both dimensions in one data set
• Neglect time and cross-sectional structure
• Run following regression with POLS/SOLS

gdp = α + βsav + γpop + e


Thereby, i...countries, t...years it
it it it
Pooled regression

. reg gdp pop sav

variables coefficients t-values p-values


pop -1.73028 -1.95 0.055
sav 0.1766935 3.51 0.001
Adjusted R2 0.10
F-test 6.20 (0.003)
Observations 95
Autocorrelation
• Now time dimension; hence, correlation
among successive residuals possible
• This affects t and p-values – violates
assumption E(eiteit-j)=0 for all j≠0
• How can we test for this problem?
• What can we do if we detect autocorrelation?
Autocorrelation
• Stata should know that the data set is a panel
• Command: tsset (i) year
• note: i=cross-section
• Normal test commands for autocorrelation do
not work; hence, develop own test (several
procedures!)
Test for Autocorrelation
• Run the following regression and estimate residuals

• Insert lagged residuals in regression


gdpit = α + βsavit + γpopit + eit
• Run t-test for autocorrelation coefficient
• H0:gdp + γpopit + ρeˆit −1 + eit
+ βsavitautocorrelation
α rejected
ρ=0it –= if
• Note: AR(1) and assumption of strict exogenity!
• Pooling Independent Cross-Sec)ons across Time
– Example: Current Popula)on Survey

– Since a random sample is drawn at each )me period,


pooling the resul)ng random samples gives us an
independently pooled cross-sec)on

– As such, we can use standard OLS methods

– Advantage of pooling is to increase the sample size,


thereby obtaining more precise es)mates and test
sta)s)cs with greater power
– Pooling is only useful in this regard if the rela)onship
between the dependent variable and at least some of the
independent variables remains constant over )me

– To reflect the fact that the popula)on may have different


distribu)ons in different )me periods, the intercept is
usually allowed to differ across )me periods (can be
accomplished by including year dummies)

– The coefficients on the year dummies may be of interest


(e.g. aPer controlling for other factors has the paQern of
fer)lity changed over )me?)

– Year dummies can also be interacted with other


explanatory variables to see if the effect of that variable
has changed over )me
Tes)ng for Structural Change across Time
– Considering a pooled dataset of two )me periods and
– Interact each variable with a year dummy for

– Test for the joint significance of the year dummy and


all of the interac)on terms

– Since the intercept in a regression model oPen


changes over )me, the Chow test can detect such
changes. It is usually more interes)ng to allow for an
intercept difference and then to test whether certain
slope coefficients change over )me
– This can be extended to more than two )me periods
o Panel regression Models
§ Fixed effects panel data models
§ !!" = !! + !!"! ! + !!" (1)
§ For ! = 1, … . , ! individuals over ! = 1, … . , ! periods
o Model includes
§ An individual effect, !! (constant over time)
§ Marginal effects ! for !!" (common across ! and !)
o The pooled ordinary Least Squares (OLS) estimator
o The simplest approach to the estimation.
o Individual effects !! are fixed and common across economic
agents, such that !! = ! for all ! = 1, … !
o OLS produces consistent and efficient estimates of ! and !.
o The Within-Groups (WG) estimator
o Can be used if Individual effects !! are fixed but not
common across ! = 1, … . , !
o Eliminates the fixed effect !! by differencing
o Let !! = ! !! !!!! !!" and !! = ! !! !!!! !!"
o Define: !!"∗ = !!" − !! and !!"∗ = !!" − !!
o Then !! = !! + !!! ! + !!
o Subtracting from (1) gives:
!!" − !! = !! − !! + !!" − !! ! ! + (!!" − !! )
!!" − !! = !!" − !! ! ! + !!" − !!

or !!"∗ = !!"∗ ! ! + !!"∗


o which can then be estimated by OLS
o The individual effects can be estimated as !! = !! − !!!
o The estimator of the slope parameters, !, is consistent if either !
or ! become large
o The estimator of the individual effects, !!, is constant only if !
becomes large
o The number of degrees of freedom need to be adjusted.
o Usually the degrees of freedom would be !" − !, but with
individual effects we have !" − ! − ! (software packages
usually make this correction when running their panel
commands)
o Drawback with the Within-Groups estimator
o Eliminates time-invariant characteristics from a model of
the form
!!" = !! + ! ! !" ! + ! ! ! ! + !!"

o As such, we cannot distinguish between observed and


unobserved heterogeneity

o The least Squares Dummy Variables (LSDV) Model


o Define a series of group specific dummy variables
!!"# = 1(! = !)
o This gives:
!!" = !! + !′!" ! + !!" (2)
!!" = !!!!!" + !!!!!" + ⋯ + !! !!"# + ! !!" ! + !!"

o Estimate by standard OLS (Excluding a constant)

o Here the constant terms vary by individual, but the slopes are the
same for all individuals
o A test for individual effects:
§ !! : !! = !! = ⋯ = !!
§ which can be tested using an F-test.

o Note: equation (2) can be written as !!" = ! + !! + !′!" ! + !!" ,


where ! is the average individual effect and !! is the deviation
from average
§ The model can thus be estimated by including a
constant and ! − 1 individual dummies
o Problems
o Incidental parameters – the number of dummies grows as N
increases. The usual proof for consistency does not hold for
LSDV models therefore
o Inverting and N + K matrix can be impossible, and even
when possible impractical and/or inaccurate
o Random Effects Models
o In the random effects model the !! are treated as random
variables, rather than fixed constants

o The !! are usually assumed to be independent of the errors


!!" and also mutually independent, i.e.
!! ~!!"(0, !!! )
!!" ~!!" 0, !!!
!! and !!" are independently distributed

o Since !! are now random, the errors now take the


following form: !!" = !! + !!"
o The presence of !! produces a correlation among the errors of the
same cross-section unit (i.e. !"# !!" , !!" ≠ 0, though the errors
from the different cross-section units are independent
(!"# !!" , !!" = 0)
§ OLS is thus inefficient in the random effects model,
and yields incorrect standards errors

o The Two-Way Fixed Effects Model


o In the one-way model, we assume that there exists an unobserved
individual heterogeneity, but that the model is homogenous over
time
o Two-way panel model allow for unobserved heterogeneity across
both time and individuals
o The two-way panel model can be written as
o !!" = !! + !! + !′!" ! + !!" or,
!!" = ! + !! + !! + !′!" ! + !!"
!ℎ!"! !! = 0 !"# !! = 0
! !
o We can then define:
o Individual/time effect: !!" = ! + !! + !!
!
o The average effect: ! = !. . = !" ! ! !!"
!
o The individual effect: ! + !! = !! . = ! ! !!"
!
o The time effect: ! + !! = !.! = ! ! !!"
o Using these we can write !!" − !!. − !.! + !. . = 0
o The two-way fixed effect panel model can be estimated using
the LSDV approach by including time dummies !!"# = 1(! = !)
in addition to individual dummies, thus estimating:

!!" = !! !!!" + !! !!!" + ⋯ + !! !!"# + ⋯ + !! !!"# + ! ! !" ! + !!"

o In certain cases a dataset might have more than 2 dimensions:


e.g. firm industry-year, country region-year,
individual-household-year, employee-firm-year;
farm-region-year.

o This class of data can be analyzed using nested error


component models
!!"# = ! + ! !!!"#$ + !!"# = !! + !! + !! + !!" + !!"#

o Which can be estimated using LSDV methods.


o One problem with estimating the two-way panel model using
dummy variables is that there is an incidental parameters
problem as either ! or ! go to infinity

§ A new within transformation can remove these:


§ !!" = !!" − !! . −!.! + !..
§ The two-way within model can then be written as
§ !!" = !!!" + !!"

o The average, individual and time effects can now be estimated as


§ !! = !. . −!! !..
§ !!;! = !!. − !! !.!
§ !!;! = !.! − !! !.!
o Consistency:
o !! and !! are consistent as either T or N tend to infinity
o !!;! is only T-consistent
o !!;! is only N-consistent

§
o The two-way within transformation removes both observed and
unobserved heterogeneity for both individual and time effects
o Chow Test
o Provides a test of the pooled
o The two-way model can also be estimated using a random effects model
by GLS
o In one-way models the fixed effects are either fixed or random. In a two-
way model the individual and time effects can be fixed or random
o i.e. we may have mixed random effects / fixed effects models
where the time effect is assumed fixed and the individual effect
random for example
o if T is small for example, one may estimate a one-way random
effects model on a set of exogenous variables and time dummies
• Fixed or Random Effects
o Fixed effects allow for arbitrary correlation between the individual
effects and the regressors
o Fixed effects cannot provide estimates of variables that are constant
over time
o The group effect can be thought of as random if we can think of the
sample as being drawn from a larger population.
o Fixed effects model appropriate when differences between
individuals may be viewed as parametric shifts in the regression
function (considered reasonable when the sample covers broadly
exhaustive sample of the population)
– Random effects more applicable when we want to draw
inferences for the whole popula)on

– Random effects preferred when there is no correla)on between


the individual effects and the regressors (see Hausman test
below)
– LSDV model oPen results in a large loss in degrees of freedom

– Fixed effects model eliminates a large por)on of the total


varia)on if the between sum
of squares are large rela)ve to the within sum of squares

– The αi are a total of several factors specific to the cross-sec)on


units and thus represents “specific ignorance”, which can be
treated as random variables, in the same manner as which
represents “general ignorance” are treated as random.
o Chow Test
o Provides a test of the pooled (restricted model) versus the
fixed effects (unrestricted) model
o This is simply a joint test of whether the fixed effects are
significant
!!"" − !"## / ! − 1
!!"# =
!"##/ !" − ! − !
where RRSS and URSS are the residuals sum of
squares from the restricted and unrestricted model
respectively. This is distributed !!!!,!"!!!! under
the null of no fixed effects
o If there are a number of observed individual specific variables
in the model, these are included in the pooled model, but not
the fixed effects model (i.e. we want to test for unobserved
heterogeneity
o Hausman Test
o Usually applied to test for fixed versus random effects models
o Compares directly the random effects estimator, !!" to the
fixed effects estimator, !!"
o In the presence of a correlation between the individual effects
and the regressors the GLS estimates are inconsistent, while
the OLS fixed effects results are consistent
o If there is no correlation between the fixed effects and the
regressors both estimators are consistent, but the OLS fixed
effects estimator is inefficient
o Construct ! = !!" − !!" and ! ! = !(!!" − !(!!" )
!!
o Test statistic: ! = ! ! ! ! distributed as a ! ! statistic
!

with ! degrees of freedom (where ! is the dimensionality of !)


o Breusch and Pagan Test
o Provides a test of the random effects model against the
pooled OLS model
o Tests the null hypothesis that !!! = 0, which is the case
where the individual effects do not exist and OLS is
applicable (i.e. the random effects model reduces to the
pooled one if the variance of the individual effects is zero)
o Denote the residuals from the OLS (pooled) regression as !!"
o Define: !! = !!!! !!!! !!" ! and !! = !!!! !!!! ! ! !"
!" !! !
o Test statistics: ! = ! !!! !!
− 1 , distributed as a ! !
statistics with 1 degree of freedom under the full hypothesis
• Misspecification Tests
o It is difficult to investigate the time-series properties (e.g.
autocorrelation, stationarity, ect) of panel data when ! is small
o Testing for heteroscedasticity is possible with small !using the
Bickel version of the Bresuch-Pagan test
§ This is a test of both with and between heterogeneity
§ This is test of !! = ⋯ = !! = 0 in the regression model
! !
• !!" = !! + !! !!" + ⋯ + !! !!" + !!"

§ where !!" and !!" are the residuals and fitted values
respectively from the within regression
• Policy Analysis with Pooled Cross-Sections
• Difference-in-Difference Estimation
o Methodology
§ Examine the effect of some sort of treatment by
comparing the treatment group after treatment both to
the treatment group before treatment and to some other
control group.
§ Standard case: outcomes are observed for two groups
for two time periods.
• One of the groups is exposed to a treatment in
the second period but not in the first period.
• The second group is not exposed to the
treatment during either period.
§ Structure can apply to repeated cross sections or panel
data.
§ Example:
• Usually related to a so-called natural (or quasi-)
experiment, when some exogenous event – often
a change in government policy – changes the
environment in which individuals, families, firms
or cities operate.
• A state offers a tax break to firms providing employers with health
insurance.
o To estimate the impact of the bill on the percentage of firms
offering health insurance we could use data on a state that
didn’t implement such a law as a control group.
o It is not correct to just compare pre-and post-law changes in
the percentage of firms offering health insurance
o (i.e. ! = !! + !! !2 + !,
§ !ℎ!"! !2 is a dummy for period two.
o Here the coefficient estimate !! gives an estimate of the
difference in the percentage of firms offering health
insurance between periods one and two) since there could be
a trend towards more employers offering health insurance
over time.
• With repeated cross sections, let ! be the control group and !
the treatment group.
• Write

! = !! + !! !" + !! !2 + !! !2. !" + ! (2)


• where:
o -! is the outcome of interest (e.g. percentage of
firms offering health insurance in each State}
o -!" captures possible differences between the
treatment and control groups prior to the policy
change (e.g. State ! versus State !)
• −!2 captures aggregate factors that would cause changes in
! over time even in the absence of a policy change. i.e. for
both States (e.g. time dummies)

• The coefficient of interest is !! , which gives an estimate of


the change in health insurance take-up for firms in State !,
which is called the difference-in-difference estimator
− The coefficient of interest is δ! , which gives an estimate of the change
in health insurance take- up for firms in State !, and which is called
the difference-in-difference estimator.
• The difference-in-difference (DD) can be written as:
!! = !!,! − !!,! − !!,! − !!,! (3)

§ in other words, !! represents the difference in the


changes over time.
• Assuming that both states have the same health insurance
trends over time, we have now controlled for a possible
national time trend, and can now identify what the true impact
of the tax deductibility is on employers offering insurance.

• Inference based on moderate sample sizes in each of the four


groups is straightforward, and is easily made robust to
different group/time period variances in regression
framework.
Fixed Effects Regression

Useful command: areg – you do not need to construct dummies by hand!

areg gdp sav pop, absorb(i)

areg gdp sav pop, absorb(year)

both is not possible – but use xi: reg gdp sav pop i.year i.i

variables Year dummies Country dummies Both


constant -0.8954 (0.525) -3.8602 (0.106) 2.2578 (0.582)
pop -1.5334 (0.099) -0.7835 (0.654) -0.6431 (0.728)
sav 0.1705 (0.002) 0.2878 (0.005) 0.2710 (0.017)
Adjusted R2 0.07 0.13 0.10
F-test 5.27 (0.007) 4.60 (0.013) 1.51 (0.102)
Observations 95 95 95
Fixed Effects Regression:
• Joint F-tests indicate that neither time nor country
dummies are relevant
• But: For a few countries dummies might be used
• General: You have to estimate lots of additional
coefficients
• But: Widely applied and easy to interpret
• Note: Time dummies do not eliminate problems that
may arise from stochastic trends!
Random Effects Regression
• We assume the following regression
gdpit = α + βsavit + γpopit + ui + eit

• Individual effects are random


• Estimation with GLS or maximum likelihood
procedure
• After estimation: Breusch-Pagan (1980) test or
likelihood ratio test whether random effects
should be assumed
Random Effects Regression
xtreg gdp pop sav, re – random effects with group variable i (countries)

Postestimation command: xttest0 – carries out a LM test (H0: Var(ui)=0)

xtreg gdp pop sav, mle – maximum likelihood estimation

Note: Likelihood ratio test is reported

variables GLS ML
constant -0.9731 (0.518) -0.7222 (0.590)
pop -1.7037 (0.076) -1.7303 (0.048)
sav 0.1860 (0.001) 0.1767 (0.000)
Wald chi2 11.54 (0.003) -
LR test - 12.01 (0.003)
Observations 95 95

Test whether random effects should be used


LM test 0.11 (0.736) -
LR test - 0.00 (1.000)
Which Procedure should we use?
• Neither fixed nor random effects are superior
• Little evidence that individual effects matter
• Hence: stick to POLS/SOLS pooled regression
• Maybe: use dummies for extreme countries
• Check stability of coefficients over time (goes
beyond the scope of the course!)
The Causality Issue
• Note: We assume that current saving rate and
population growth rate affect GDP growth rate
• But: Possible that causality goes the other way round!
• Solution: VAR model – test for Granger causality
• Result: Savings and population growth rate Granger
cause GDP growth rate and not vice versa!
Additional Issues
• Stochastic trends in panel data
– Spurious regressions
– Unit-root tests – panel based; thus, more
observations
– First differencing or deviation from common
trends
• Long-term equilibriums and cointegration

You might also like