0% found this document useful (0 votes)
26 views

SurveyData 3

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

SurveyData 3

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Survey Data in Macroeconomics

III. Econometric Methods for the Analysis of Survey Microdata

Prof. Dr. Lena Dräger

Johannes Gutenberg-University Mainz, GSEFM field course


Email: [email protected]

1 / 49
Outline

1 Analysis of Survey Data in STATA

2 Pooled Cross-Sections

3 Panel Estimators

4 Analysis of Binary and Ordinal (Qualitative) Variables

5 Sample Selection and Attrition Bias

6 Sampling Weights

Textbooks: Wooldridge (2010) and Greene (2012)

2 / 49
Analysis of Survey Data in STATA

Analysis of Survey Data in STATA

3 / 49
Analysis of Survey Data in STATA

Analysis of Survey Microdata

In general:

Control for socio-demographic characteristics (where possible) or


individual fixed effects (only panel data)

Possibly control for time trends and macroeconomic control variables

Careful with causal statements regarding policy changes, generally not


possible

Probit/logit estimators: Define a “representative survey respondent”


and evaluate marginal effects at this point ⇒ otherwise, marginal
effects will be evaluated at the mean which may differ across models

4 / 49
Analysis of Survey Data in STATA

Analysis of Survey Microdata in STATA

If the dataset is repeated cross-sections, declare the data to be


(pooled) time series: tsset timevar

If the dataset has a panel structure, declare the data to be a panel:


xtset panelvar timevar

For a single cross-section, you can declare the dataset to be survey


data: svyset

Think about meaningful truncation

Get to know your dataset!

5 / 49
Analysis of Survey Data in STATA

Implications of Truncation
Micro survey data of macroeconomic expectations is typically
truncated, at least for non-experts

We might also think about truncating extremely high incomes etc.

Should we adjust our estimator to the truncated nature of the data?


⇒ E.g., use estimators for censored data like a tobit estimator?

⇒ No, because truncated observations are simply dropped, not set to


some fixed value (censored)

⇒ Truncation may bias OLS estimates towards 0, maximum likelihood


remains consistent

⇒ In practice, estimations are typically not adjusted for truncation,


probably because the truncation only becomes binding in very few cases
6 / 49
Pooled Cross-Sections

Pooled Cross-Sections

7 / 49
Pooled Cross-Sections

Pooled Cross-Sections

Pool micro cross-sections of survey waves over time

Individual and time component

⇒ If identical individuals are followed over time: Use panel estimators

⇒ If there are repeated (random) cross-sections: Use a simple OLS


estimator on the pooled cross-sections

8 / 49
Pooled Cross-Sections

Pooled OLS

Assumptions:

Random samples from a population at different points in time

Observations are independent, but not identically distributed

Include either macro control variables or time dummies to control for


aggregate changes over time

Include individual characteristics to avoid omitted variables bias

Standard errors may be corrected to account for heteroscedasticity


(STATA: option vce(robust)) or clustering (STATA: option vce(cluster
clustvar))

9 / 49
Pooled Cross-Sections

Pooled OLS

Pooled cross-section model:

yt = xt β + ut , t = 1, 2, ..., T (1)

Coefficients in the vector β can be consistently estimated by OLS if:


 0 
1 E xt ut = 0, t = 1, 2, ..., T
⇒ No contemporaneous correlation between explanatory variables and
the error term
hP  0 i
T
2 rank t=1 E xt xt = K
⇒ No perfect linear dependencies (collinearity) between the
explanatory variables

10 / 49
Pooled Cross-Sections

Pooled OLS

For unbiasedness of standard errors of the OLS estimator, we additionally


need:
 0   0 
E ut2 xt xt = σ 2 E xt xt , t = 1, 2, ..., T , where σ 2 = E ut2 for all t

3

⇒ Homoscedasticity
 0

4 E ut us xt xs = 0, t 6= s, t, s = 1, 2, ..., T
⇒ No serial correlation

11 / 49
Pooled Cross-Sections

Diff-in-diff Estimation

Popular in applied microeconometrics for policy analysis (e.g.


unexpected change in unemployment insurance only in one of two
neighboring states)

Natural experiment: One control group (A) and one treatment group
(B, dummy dB)

Two time periods: One before the policy change, one afterwards
(dummy d2)

12 / 49
Pooled Cross-Sections

Diff-in-diff Estimation
Research question: What is the effect of the change in unemployment
insurance on the individual labor market experience?

Estimation equation:
yi,t = β0 + β2 dBi,t + β3 d2t + δ1 d2t ∗ dBi,t + ui,t (2)
OLS estimator of δ1 :
δˆ1 = y B,1 − y B,2 − y A,1 − y A,2
 
(3)
where y A,1 , y A,2 , y B,1 and y B,2 are the sample averages of y for the
control (A) and the treatment (B) group before and after the policy
change (time periods 1 and 2)

⇒ δˆ1 measures the relative difference in individual labor market


experience that can be attributed to the policy change only
experienced by group B.
13 / 49
Pooled Cross-Sections

Diff-in-diff Estimation

Estimator allows for time- and group-specific effects

The estimator is unbiased if the policy change is not systematically


related to other factors that affect yi,t (and are hidden in ui,t ) ⇒
usually include further control variables for both time and
cross-sectional variation

14 / 49
Pooled Cross-Sections

Robust and Clustered Standard Errors


Problem: OLS assumes that observations are drawn from identical
distributions, but this is often not the case in survey data

Cross-sectional or time heteroscedasticity: E.g., variance of income is


larger for high income respondents than for respondents with lower
income

Clustering: Data may differ across certain groups, either different time
periods or different cross-section groups

⇒ Both heteroscedasticity and clustering lead to correlation between


residuals and regressors ⇒ standard errors are biased

Robust and clustered standard errors are available with many


estimators, including panel and probit/logit models
15 / 49
Pooled Cross-Sections

Robust and Clustered Standard Errors

OLS standard errors:


h 0 i−1 h 0 0 i h 0 i−1 h 0 i−1
Var (β) = x x x uu x x x = σ2 x x (4)

Robust standard errors:


Account for (cross-sectional) heteroscedasticity:
" N #
h 0 i−1 X 0
h 0 i−1
2
Var (β) = x x ui xi xi x x (5)
i=1

In STATA: option vce(robust)

16 / 49
Pooled Cross-Sections

Robust and Clustered Standard Errors


Clustered standard errors:
Similar to robust standard errors (correct for correlations between
individual residuals ei and regressors xi ) with corrections summed over
each cluster ⇒ assume independence across clusters g
" G #
h 0 i−1 X 0 0
h 0 i−1
Var (β) = x x xg ug ug xg x x (6)
i=1
Appropriate cluster variables may emerge from the estimation context

Clustering is possible along several dimensions (multi-way clustering),


otherwise include fixed effects along one dimension and cluster along
the other

In STATA: option vce(cluster clustvar), e.g. vce(cluster income) or


vce(cluster year)
17 / 49
Panel Estimators

Panel Estimators

18 / 49
Panel Estimators

Panel Estimators

Accounts for time-variation of individual units

Survey data: Usually unbalanced panel

In principal, we can test for fixed vs. random effects (Hausman test),
but intuitively, individual fixed effects make more sense ⇒ account for
unobserved constant individual effects

19 / 49
Panel Estimators

Fixed Effects Estimator


Unobserved effects model:

yit = xit β + ci + uit , t = 1, 2, ..., T (7)

The model includes time-invariant individual effects ci by estimating a


dummy variable for each individual unit over time

1 Exogeneity Assumption:
E (uit |xi , ci ) = 0, t = 1, 2, ..., T (8)
⇒ Strict exogeneity of xit conditional on ci

With fixed effects, xit cannot include any time-invariant characteristics


such as gender, race etc.! All variables must be time-varying at least
for SOME part of the sample 20 / 49
Panel Estimators

Fixed Effects Estimator

2 Rank condition:
T
!
X  0

rank E (xit − xi ) (xit − xi ) =K (9)
t=1

⇒ Consistent estimation if exogeneity assumption and rank condition


hold

3 Homoscedasticity and no serial correlation of residuals:


 0 
E ui ui |xi , ci = σu2 IT (10)

⇒ Efficient and unbiased estimation

21 / 49
Panel Estimators

Fixed Effects: Within and Between Estimator


Within estimator:
(yit − y i ) = (xit − xi )β + (uit − u i ) (11)
⇒ Eliminates the unobserved effect
PTby subtracting time-averages for each
−1
cross-section, with y i = T t=1 yit ⇒ Evaluate time-variance
within cross-sections
⇒ Estimated by pooled OLS
⇒ In STATA: xtreg depvar indepvars [if] [in] [weight], fe [options]

Between estimator:
y i = xi β + ci + u i (12)
⇒ Eliminates the time effect by calculating time averages between the
cross-sections
⇒ In STATA: xtreg depvar indepvars [if] [in] [weight], be [options]

22 / 49
Panel Estimators

First Differencing: FD Estimator

We can also first-difference the data to eliminate the unobserved


individual component in (7):

∆yit = ∆xit β + ∆uit , t = 1, 2, ..., T (13)

Use pooled OLS to estimate (13)

Loses one observation

Same assumptions as for FE estimator need to hold

We can use this version to construct a diff-in-diff estimator

23 / 49
Panel Estimators

Dynamic Panel Models

Standard FE estimators are biased if the residuals are serially


correlated (Nickell (1981) bias)

Arellano-Bond estimator to account for a lagged dependent variable


(Arellano and Bond (1991), in STATA: xtabond )

Nickell bias not a severe problem if T is significantly larger than N ⇒


often the case in aggregate country panels

Use Least Squares Dummy Variable Corrected (LSDVC) estimator


instead (Bruno (2005), in STATA: xtlsdvc)

Generally: Serial correlation not a big issue with micro survey data,
applies more to aggregated (macro) panel models
24 / 49
Panel Estimators

Accounting for Cross-Sectional Correlation in Panel Data

Usual panel estimators assume that cross-sections are not correlated

However, we may find contemporaneous correlation across either


individuals or aggregate cross-sections

Tests for cross-sectional correlation:


Breusch and Pagan (1980) statistic for cross-sectional independence of
the residuals of a fixed effects model (see also Greene, 2012).
In STATA: xttest2

Pesaran (2004) test for cross-sectional dependence, can also handle


unbalanced panels. In STATA: xtcsd, pesaran show

xtcsd includes further tests by Frees (1995, 2004) and Friedman (1937)

25 / 49
Panel Estimators

Accounting for Cross-Sectional Correlation in Panel Data

Solution:
Use OLS regression with clustered standard errors, robust to
heteroscedasticity across cross-sections, and within cross-section
correlation

Estimate with feasible generalized least squares (FGLS), estimator is


robust to AR(1) autocorrelation within cross-sections,
heteroscedasticity and cross-sectional correlation.
Estimator in STATA: xtgls

Use OLS regression with panel-corrected standard errors that account


for heteroscedasticity and contemporaneous correlation across
cross-sections and, potentially, for autocorrelation.
Estimator in STATA: xtpcse

26 / 49
Analysis of Binary and Ordinal (Qualitative) Variables

Analysis of Binary and Ordinal


(Qualitative) Variables

27 / 49
Analysis of Binary and Ordinal (Qualitative) Variables

Binary Variables

Dependent variable is a dummy, e.g. =1 if a person is employed and 0


otherwise

We’re interested in the response probability conditional on a set of


explanatory variables x:

p(x) ≡ P(y = 1|x) = P(y = 1|x1 , x2 , ..., xK ) (14)

Interpretation of coefficients not straightforward, except for the signs

Estimate marginal effects post-regression

28 / 49
Analysis of Binary and Ordinal (Qualitative) Variables

Probit or Logit Estimators

Latent variable model:

y ∗ = xβ + e, y = 1[y ∗ > 0] (15)

e is a continuously distributed variable independent of x with a


symmetric distribution around 0
Let G be the cdf of e, then by the symmetry of the pdf of e around 0,
we have 1 − G (−z) = G (z). Use this to get:

P(y = 1|x) = P(y ∗ > 0|x) = P(e > −xβ|x) = 1 − G (−xβ) = G (xβ)
(16)
Estimation by maximum likelihood
Robust or clustered standard errors to account for heteroscedasticity
or clustered correlations

29 / 49
Analysis of Binary and Ordinal (Qualitative) Variables

Probit or Logit Estimators

Probit estimator: Assume a standard normal distribution for G :


Z xβ
G (xβ) = Φ(xβ) ≡ φ(ν)dν (17)
−∞

In STATA: probit depvar [indepvars] [if] [in] [weight] [, options]

Logit estimator: Assume a logistic distribution for G :

exp(xβ)
G (xβ) = Λ(xβ) ≡ (18)
1 + exp(xβ)

In STATA: logit depvar [indepvars] [if] [in] [weight] [, options]

30 / 49
Analysis of Binary and Ordinal (Qualitative) Variables

Marginal Effects

In order to get an estimate of the marginal effect of a regressor on the


probability that y = 1, we need to derive marginal effects after the
estimation of probit/logit models
For a continuous regressor xj :

∂p(x) dG
= g (xβ)βj , where g (xβ) ≡ (xβ) (19)
∂xj dxβ

For a discrete regressor xK , the marginal effect of changing xK from 0


to 1, holding all other variables fixed, is given by:

G (β1 + β2 x2 + ... + βK −1 xK −1 + βK ) − G (β1 + β2 x2 + ... + βK −1 xK −1 )


(20)
⇒ The marginal effect of a single regressor depends on all regressors in x

31 / 49
Analysis of Binary and Ordinal (Qualitative) Variables

Marginal Effects

Default option: Marginal effects are evaluated at the mean of variables


in x

With survey micro data: Define a representative participant and


evaluate marginal effects there, so that effects are comparable across
different models

In STATA: margins, dydx(*) at(sex==1 age==49 inc==3


employ==3)

32 / 49
Analysis of Binary and Ordinal (Qualitative) Variables

Bi-Probit Estimators

What if we are interested in the likelihood that two dummy variables


equal 1 simultaneously?

y1 = 1 [x1 β 1 + e1 > 0] (21)


y2 = 1 [x2 β 2 + e2 > 0] (22)

Error terms in e ≡ (e1 , e2 ) are assumed to be bivariate normally


distributed: e|x ∼ N(0, Ω)

33 / 49
Analysis of Binary and Ordinal (Qualitative) Variables

Bi-Probit Estimators

If e1 and e2 are correlated, estimating a joint maximum likelihood


procedure is more efficient

In STATA: biprobit depvar1 depvar2 [indepvars] [if] [in] [weight] [,


options]

Compute marginal effects for the likelihood of both y1 and y2 being 1:


margins, p11 dydx(*) (at sex==1 age==49 inc==3 employ==3)

Margins also computes conditional marginal effects or marginal effects


for one variable being 1 and the other being 0

34 / 49
Analysis of Binary and Ordinal (Qualitative) Variables

Ordered Probit or Logit Estimators

Dependent variable is ordinal, e.g. has values from 1-5 where the
ordering matters

Often the case for qualitative survey data, e.g. Likert scale: (1 – like),
(2 – like somewhat), (3 – neutral), (4 – dislike somewhat), (5 – dislike)

Latent variable model:

y ∗ = xβ + e, e|x ∼ N(0, 1) (23)



y = 0 if y ≤ α1
y = 1 if α1 < y ∗ ≤ α2
..
.
y =J if y ∗ > αJ

35 / 49
Analysis of Binary and Ordinal (Qualitative) Variables

Ordered Probit or Logit Estimators

Estimate parameters in α and β by maximum likelihood assuming


either a normal or a logistic distribution for e

Make sure the ordering is meaningful, i.e. qualitative expectations


should assign higher numbers to higher expectations

In STATA: oprobit or ologit

Marginal effects have to be estimated separately for each ordinal


realisation, e.g. in STATA: margins, dydx(*) predict(outcome(5))
at(sex==1 age==49 inc==3 employ==3), gives the marginal effects
of all regressors for the likelihood that the ordinal variable takes on the
value of 5

36 / 49
Analysis of Binary and Ordinal (Qualitative) Variables

Binary or Ordinal Dependent Variables in Panel Data


All estimators for binary or ordinal dependent variables are also
available for panel data

With random cross-sections: Simply use the estimators on the pooled


data, controlling for individual characteristics

With repeated cross-sections:


Use within transformation or first-differencing to eliminate the
unobserved effect ci
Estimate random-effects or population-averaged probit or ordered
probit models (in STATA: xtprobit and xtoprobit)
Population-averaged probit: Estimates at the average value of
unobserved components ci in the population, c
Random-effects probit: Assume a normal distribution for ci , integrate
out ci when constructing the log-likelihood function, marginal effects
can be estimated at c = 0
37 / 49
Sample Selection and Attrition Bias

Sample Selection and Attrition Bias

38 / 49
Sample Selection and Attrition Bias

Sample Selection and Attrition Bias

Assumption: Survey samples are drawn randomly and are


representative of the overall population

What if there is non-random sample selection (= incidental


truncation)? ⇒ Survivorship or attrition bias
Respondents leave the survey permanently (e.g. professional forecasters
get a new job at a firm not participating in the survey, a household
drops out for private reasons or the respondent dies, a firm goes
bankrupt)
Certain groups of participants are more likely to be selected into a
panel dimension of the survey

What if the decision to not answer a particular question is


non-random? ⇒ Non-response bias
Certain questions might be deemed more difficult than others
39 / 49
Sample Selection and Attrition Bias

Heckman (1979) Estimator

Two-step estimation procedure by Heckman (1979) to account for


attrition
Unobserved selection equation:
0
zi∗ = wi γ + ui , zi = 1 if zi∗ > 0 and 0 otherwise
 0 
Prob(zi = 1|wi ) = Φ wi γ
 0 
Prob(zi = 0|wi ) = 1 − Φ wi γ (24)

Regression equation:
0
yi = xi β + εi observed only if zi = 1 (25)
(ui , εi ) ∼ bivariate normal[0, 0, 1, σε , ρ]

40 / 49
Sample Selection and Attrition Bias

Heckman (1979) Estimator

Need to assume that zi and wi are observed for a random sample

Combining (24) and (40) gives:


0
 0 
E [yi |zi = 1, xi , wi ] = xi β + ρσe λ wi γ , (26)
 0   0   0 
with λ wi γ = φ wi γ /Φ wi γ

⇒ Not accounting for sample selection results in an omitted variable bias

⇒ OLS is not consistent!

41 / 49
Sample Selection and Attrition Bias

Two-Step Heckman Correction

1 Estimate a probit model with maximum likelihood of (24) to obtain


estimates
 of γ. For
  each
 observation, compute0 
0 0
λ̂i = φ wi γ̂ /Φ wi γ̂ and δ̂i = λ̂i λ̂i + wi γ̂

0
2 Estimate yi = xi β + βλ λ̂ + εi by least squares to obtain estimates of
β and βλ

The estimator of the correlation between the selection equation and


the regression model, ρ̂, provides a measure for the degree of attrition
bias ⇒ Test whether ρ̂ = 0

42 / 49
Sample Selection and Attrition Bias

Two-Step Heckman Correction in STATA

Cross-section data: heckman depvar [indepvars], select([depvar_s =]


varlist_s) twostep [options]

Binary data: heckprobit depvar [indepvars] [if] [in] [weight],


select([depvar_s =] varlist_s) [options]

Ordinal data: heckoprobit depvar [indepvars] [if] [in] [weight],


select([depvar_s =] varlist_s) [options]

So far no module to do a Heckman correction with panel data in


STATA!

43 / 49
Sampling Weights

Sampling Weights

44 / 49
Sampling Weights

Sampling Weights

Sampling weights are calculated by surveys in order to ensure the


representative of the sample w.r.t. the overall population

Check whether sample contains a weight variable (index). e.g.


household head weight index in the Michigan Survey of Consumers

Sampling weights (in STATA: pweight) denote the inverse of the


probability that the observation is included due to the sampling design

Use weights in the estimations, e.g.:


oprobit cons_past infl_exp int_exp age sex income
[pweight=weight], robust

45 / 49
Sampling Weights

Literature I

Arellano, M. and S. Bond (1991).


Some tests of specification for panel data: Monte carlo evidence and an
application to employment equations.
Review of Economic Studies 58 (2), 277–297.

Breusch, T. and A. R. Pagan (1980).


The Lagrange multiplier test and its applications to model specification in
econometrics.
Review of Economic Studies 47 (1), 239–253.

Bruno, G. S. (2005).
Approximating the bias of the LSDV estimator for dynamic unbalanced panel
data models.
Economics Letters 87 (3), 361–366.

46 / 49
Sampling Weights

Literature II

Frees, E. W. (1995).
Assessing cross-sectional correlations in panel data.
Journal of Econometrics 64, 393–414.

Frees, E. W. (2004).
Longitudinal and Panel Data: Analysis and Applications in the Social
Sciences.
Cambridge University Press.

Friedman, M. (1937).
The use of ranks to avoid the assumption of normality implicit in the analysis
of variance.
Journal of the American Statistical Association 32, 675–701.

47 / 49
Sampling Weights

Literature III

Greene, W. (2012).
Econometric Analysis (7th ed.).
Pearson Education.
Heckman, J. J. (1979).
Sample selection bias as a specification error.
Econometrica 47 (1), 153–161.

Nickell, S. (1981).
Biases in dynamic models with fixed effects.
Econometrica 49 (6), 1417–1426.

Pesaran, M. H. (2004).
General diagnostic tests for cross section dependence in panels.
Cambridge Working Paper in Economics 0435.

48 / 49
Sampling Weights

Literature IV

Wooldridge, J. M. (2010).
Econometric Analysis of Cross Section and Panel Data (2nd ed.).
Cambridge, MA: MIT Press.

49 / 49

You might also like