0% found this document useful (0 votes)
11 views

Generalized Synthetic Control Method

Uploaded by

4032130198
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Generalized Synthetic Control Method

Uploaded by

4032130198
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Generalized Synthetic Control Method: Causal Inference

with Interactive Fixed Effects Models


Yiqing Xu∗
University of California, San Diego
Forthcoming, Political Analysis

ABSTRACT
Difference-in-differences (DID) is commonly used for causal inference in time-
series cross-sectional data. It requires the assumption that the average outcomes
of treated and control units would have followed parallel paths in the absence
of treatment. In this paper, we propose a method that not only relaxes this
often-violated assumption, but also unifies the synthetic control method (Abadie,
Diamond and Hainmueller 2010) with linear fixed effects models under a simple
framework, of which DID is a special case. It imputes counterfactuals for each
treated unit using control group information based on a linear interactive fixed ef-
fects model that incorporates unit-specific intercepts interacted with time-varying
coefficients. This method has several advantages. First, it allows the treatment
to be correlated with unobserved unit and time heterogeneities under reasonable
modelling assumptions. Second, it generalizes the synthetic control method to
the case of multiple treated units and variable treatment periods, and improves
efficiency and interpretability. Third, with a built-in cross-validation procedure,
it avoids specification searches and thus is easy to implement. An empirical ex-
ample of Election Day Registration and voter turnout in the United States is
provided.

Keywords: causal inference, TSCS data, difference-in-differences, synthetic con-


trol method, interactive fixed effects, factor analysis


Department of Political Science, University of California, San Deigo. Social Science Building 377, 9500
Gilman Drive #0521, La Jolla, CA 92093. Email: [email protected]. The author is indebted to Matt Black-
well, Devin Caughey, Justin Grimmer, Jens Hainmueller, Danny Hidalgo, Simon Jackman, Jonathan Katz,
Luke Keele, Eric Min, Molly Roberts, Jim Snyder, Brandon Stewart, Teppei Yamamoto, as well as seminar
participants at the 2015 MPSA Annual Meeting and 2015 APSA Annual Meeting for helpful comments and
suggestions. I thank the editor, Mike Alvarez, and two anonymous reviewers for their extremely helpful
suggestions. I thank Jushan Bai for generously sharing the Matlab codes used in Bai (2009) and Melanie
Springer for kindly providing the state-level voter turnout data (1920-2000). The source code and data used
in the paper can be downloaded from the Political Analysis Dataverse at dx.doi.org/10.7910/DVN/8AKACJ
(Xu 2016) as well as the author’s website. Supplementary Materials for this article are available on the
journal’s website.

Electronic copy available at: https://ssrn.com/abstract=2584200


1. INTRODUCTION

Difference-in-differences (DID) is one of the most commonly used empirical designs in today’s
social sciences. The identifying assumptions for DID include the “parallel trends” assump-
tion, which states that in the absence of the treatment the average outcomes of treated and
control units would have followed parallel paths. This assumption is not directly testable,
but researchers have more confidence in its validity when they find that the average out-
comes of the treated and control units follow parallel paths in pre-treatment periods. In
many cases, however, parallel pre-treatment trends are not supported by data, a clear sign
that the “parallel trends” assumption is likely to fail in the post-treatment period as well.
This paper attempts to deal with this problem systematically. It proposes a method that es-
timates the average treatment effect on the treated using time-series cross-sectional (TSCS)
data when the “parallel trends” assumption is not likely to hold.
The presence of unobserved time-varying confounders causes the failure of this assump-
tion. There are broadly two approaches in the literature to deal with this problem. The first
one is to condition on pre-treatment observables using matching methods, which may help
balance the influence of potential time-varying confounders between treatment and control
groups. For example, Abadie (2005) proposes matching before DID estimations. Although
this method is easy to implement, it does not guarantee parallel pre-treatment trends. The
synthetic control method proposed by Abadie, Diamond and Hainmueller (2010, 2015) goes
one step further. It matches both pre-treatment covariates and outcomes between a treated
unit and a set of control units and uses pre-treatment periods as criteria for good matches.1
Specifically, it constructs a “synthetic control unit” as the counterfactual for the treated unit
by reweighting the control units. It provides explicit weights for the control units, thus mak-
ing the comparison between the treated and synthetic control units transparent. However,
it only applies to the case of one treated unit and the uncertainty estimates it offers are not
1
See Hsiao, Ching and Wan (2012) and Angrist, Jord and Kuersteiner (2013) for alternative matching
methods along this line of thought.

Electronic copy available at: https://ssrn.com/abstract=2584200


easily interpretable.2
The second approach is to model the unobserved time-varying heterogeneities explicitly.
A widely used strategy is to add in unit-specific linear or quadratic time trends to conven-
tional two-way fixed effects models. By doing so, researchers essentially rely upon a set
of alternative identification assumptions that treatment assignment is ignorable conditional
on both the fixed effects and the imposed trends (Mora and Reggio 2012). Controlling for
these trends, however, often consumes a large number of degrees of freedom and may not
necessarily solve the problem if the underlying confounders are not in forms of the specified
trends.
An alternative way is to model unobserved time-varying confounders semi-parametrically.
For example, Bai (2009) proposes an interactive fixed effects (IFE) model, which incorpor-
ates unit-specific intercepts interacted with time-varying coefficients. The time-varying coef-
ficients are also referred to as (latent) factors while the unit-specific intercepts are labelled as
factor loadings. This approach builds upon an earlier literature on factor models in quant-
itative fiance.3 The model is estimated by iteratively conducting a factor analysis of the
residuals from a linear model and estimating the linear model that takes into account the in-
fluences of a fixed number of most influential factors. Pang (2010, 2014) explores non-linear
IFE models with exogenous covariates in a Bayesian multilevel framework. Stewart (2014)
provides a general framework of estimating IFE models based on a Bayesian variational
inference algorithm. Gobillon and Magnac (2016) show that IFE models out-perform the
synthetic control method in DID settings when factor loadings of the treatment and control
groups do not share common support.4
This paper proposes a generalized synthetic control (GSC) method that links the two
approaches and unifies the synthetic control method with linear fixed effects models under
2
To gauge the uncertainty of the estimated treatment effect, the synthetic control method compares the
estimated treatment effect with the “effects” estimated from placebo tests in which the treatment is randomly
assigned to a control unit.
3
See Campbell, Lo and MacKinlay (1997) for applications of factor models in finance.
4
For more empirical applications of the IFE estimator, see Kim and Oka (2014) and Gaibulloev, Sandler
and Sul (2014).

Electronic copy available at: https://ssrn.com/abstract=2584200


a simple framework, of which DID is a special case. It first estimates an IFE model using
only the control group data, obtaining a fixed number of latent factors. It then estimates
factor loadings for each treated unit by linearly projecting pre-treatment treated outcomes
onto the space spanned by these factors. Finally, it imputes treated counterfactuals based
on the estimated factors and factor loadings. The main contribution of this paper, hence, is
to employ a latent factor approach to address a causal inference problem and provide valid,
simulation-based uncertainty estimates under reasonable assumptions.
This method is in the spirit of the synthetic control method in the sense that by essence
it is a reweighting scheme that takes pre-treatment treated outcomes as benchmarks when
choosing weights for control units and uses cross-sectional correlations between treated and
control units to predict treated counterfactuals. Unlike the synthetic matching method,
however, it conducts dimension reduction prior to reweighting such that vectors to be re-
weighted on are smoothed across control units. The method can also be understood as a
bias correction procedure for IFE models when the treatment effect is heterogeneous across
units.5 It treats counterfactuals of treated units as missing data and makes out-of-sample
predictions for post-treatment treated outcomes based on an IFE model.
This method has several advantages. First, it generalizes the synthetic control method
to cases of multiple treated units and/or variable treatment periods. Since the IFE model is
estimated only once, treated counterfactuals are obtained in a single run. Users therefore no
longer need to find matches of control units for each treated unit one by one.6 This makes
the algorithm fast and less sensitive to the idiosyncrasies of a small number of observations.
Second, the GSC method produces frequentist uncertainty estimates, such as standard
5
When the treatment effect is heterogeneous (as it is almost always the case), an IFE model that imposes
a constant treatment effect assumption gives biased estimates of the average treatment effect because the
estimation of the factor space is affected by the heterogeneity in the treatment effect.
6
For examples, Acemoglu et al. (2016), who estimate the effect of Tim Geithner connections on stock
market returns, conduct the synthetic control method repeatedly for each connected (treated) firm; Dube
and Zipperer (2015) estimate the effect of minimum wage policies on wage and employment by conducting
the method for each of the 29 policy changes. The latter also extend Abadie, Diamond and Hainmueller
(2010)’s original inferential method to the case of multiple treated units using the mean percentile ranks of
the estimated effects.

Electronic copy available at: https://ssrn.com/abstract=2584200


errors and confidence intervals, and improves efficiency under correct model specifications.
A parametric bootstrap procedure based on simulated data can provide valid inference under
reasonable assumptions. Since no observations are discarded from the control group, this
method uses more information from the control group and thus is more efficient than the
synthetic matching method when the model is correctly specified.
Third, it embeds a cross-validation scheme that selects the number of factors of the
IFE model automatically, and thus is easy to implement. One advantage of the DID data
structure is that treated observations in pre-treatment periods can naturally serve as a
validation dataset for model selection. We show that with sufficient data, the cross-validation
procedure can pick up the correct number of factors with high probability, therefore reducing
the risks of over-fitting.
The GSC method has two main limitations. First, it requires more pre-treatment data
than fixed effects estimators. When the number of pre-treatment periods is small, “incidental
parameters” can lead to biased estimates of the treatment effects. Second, and perhaps
more importantly, modelling assumptions play a heavier role with the GSC method than the
original synthetic matching method. For example, if the treated and control units do not
share common support in factor loadings, the synthetic matching method may simply fail
to construct a synthetic control unit. Since such a problem is obvious to users, the chances
that users misuse the method are small. The GSC method, however, will still impute treated
counterfactuals based on model extrapolation, which may lead to erroneous conclusions. To
safeguard against this risk, it is crucial to conduct various diagnostic checks, such as plotting
the raw data, fitted values, and predicted counterfactuals.
The rest of the paper is organized as follows. Section 2 sets up the model and defines
the quantities of interest. Section 3 introduces the GSC estimator, describes how it is
implemented, and discuss the parametric bootstrap procedure. Sections 4 reports simulation
results that explores the finite sample properties of the GSC estimator and compares it with
several existing methods. Section 5 illustrates the method with an empirical example that

Electronic copy available at: https://ssrn.com/abstract=2584200


investigates the effect of Election Day Registration laws on voter turnout in the United
States. The last section concludes.

2. FRAMEWORK

Suppose Yit is the outcome of interest of unit i at time t. Let T and C denote the sets of units
in treatment and control groups, respectively. The total number of units is N = Ntr + Nco ,
where Ntr and Nco are the numbers of treated and control units, respectively. All units are
observed for T periods (from time 1 to time T ). Let T0,i be the number of pre-treatment
periods for unit i, which is first exposed to the treatment at time (T0,i + 1) and subsequently
observed for qi = T − T0,i periods. Units in the control group are never exposed to the
treatment in the observed time span. For notational convenience, we assume that all treated
units are first exposed to the treatment at the same time, i.e., T0,i = T0 and qi = q; variable
treatment periods can be easily accommodated. First, we assume that Yit is given by a linear
factor model.

Assumption 1 Functional form:

Yit = δit Dit + x0it β + λ0i ft + εit ,

where the treatment indicator Dit equals 1 if unit i has been exposed to the treatment
prior to time t and equals 0 otherwise (i.e., Dit = 1 when i ∈ T and t > T0 and Dit = 0
otherwise).7 δit is the heterogeneous treatment effect on unit i at time t; xit is a (k × 1)
vector of observed covariates, β = [β1 , · · · , βk ]0 is a (k × 1) vector of unknown parameters,8
ft = [f1t , · · · , frt ]0 is an (r × 1) vector of unobserved common factors, λi = [λi1 , · · · , λir ]0 is
7
Cases in which the treatment switches on and off (or “multiple-treatment-time”) can be easily incorporated
in this framework as long as we impose assumptions on how the treatment affects current and future outcomes.
For example, one can assume that the treatment only affect the current outcome but not future outcomes
(no carryover effect), as fixed effects models often do. In this paper, we do not impose such assumptions.
See Imai and Kim (2016) for a thorough discussion.
8
β is assumed to be constant across space and time mainly for the purpose of fast computation in the
frequentist framework. It is a limitation compared with more flexible and increasingly popular random
coefficient models in Bayesian multi-level analysis.

Electronic copy available at: https://ssrn.com/abstract=2584200


an (r × 1) vector of unknown factor loadings, and εit represents unobserved idiosyncratic
shocks for unit i at time t and has zero mean. Assumption 1 requires that the treated and
control units are affected by the same set of factors and the number of factors is fixed during
the observed time periods, i.e., no structural breaks are allowed.
The factor component of the model, λ0i ft = λi1 f1t + λi2 f2t + · · · + λir frt , takes a linear,
additive form by assumption. In spite of the seemingly restrictive form, it covers a wide
range of unobserved heterogeneities. First and foremost, conventional additive unit and
time fixed effects are special cases. To see this, if we set f1t = 1 and λi2 = 1 and rewrite
λi1 = αi and f2t = ξt , then λi1 f1t + λi2 f2t = αi + ξt .9 Moreover, the term also incorporates
cases ranging from unit-specific linear or quadratic time trends to autoregressive components
that researchers often control for when analyzing TSCS data.10 In general, as long as an
unobserved random variable can be decomposed into a multiplicative form, i.e., Uit = ai ×
bt , it can be absorbed by λ0i ft while it cannot capture unobserved confounders that are
independent across units.
To formalize the notion of causality, we also use the notation from the potential out-
comes framework for causal inference (Neyman 1923; Rubin 1974; Holland 1986). Let Yit (1)
and Yit (0) be the potential outcomes for individual i at time t when Dit = 1 or Dit = 0,
respectively. We thus have Yit (0) = x0it β + λ0i ft + εit and Yit (1) = δit + x0it β + λ0i ft + εit . The
individual treatment effect on treated unit i at time t is therefore δit = Yit (1) − Yit (0) for
any i ∈ T , t > T0 .
We can rewrite the DGP of each unit as:

Yi = Di ◦ δi + Xi β + F λi + εi , i ∈ 1, 2, · · · Nco , Nco + 1, · · · , N,

where Yi = [Yi1 , Yi2 , · · · , YiT ]0 ; Di = [Di1 , Di2 , · · · , DiT ]0 and δi = [δi1 , δi2 , · · · , δiT ]0 (sym-
bol “◦” stands for point-wise product); εi = [εi1 , εi2 , · · · , εiT ]0 are (T × 1) vectors; Xi =
9
For this reason, additive unit and time fixed effects are not explicitly assumed in the model. An extended
model that directly imposes additive two-way fixed effects is discussed in the next section.
10
In the former case, we can set f1t = t and f2t = t2 ; in the latter case, for example, we can rewrite
Yit = ρYi,t−1 + x0it β + εit as Yit = Yi0 · ρt + x0it β + νit , in which νit is an AR(1) process and ρt and Yi0 are
the unknown factor and factor loadings, respectively. See Gobillon and Magnac (2016) for more examples.

Electronic copy available at: https://ssrn.com/abstract=2584200


[xi1 , xi2 , · · · , xiT ]0 is a (T × k) matrix; and F = [f1 , f2 , · · · , fT ]0 is a (T × r) matrix.
The control and treated units are subscripted from 1 to Nco and from Nco + 1 to N ,
respectively. The DGP of a control unit can be expressed as: Yi = Xi β + F λi + εi , i ∈
1, 2, · · · Nco . Stacking all control units together, we have:

Yco = Xco β + F Λ0co + εco , (1)

in which Yco = [Y1 , Y2 , · · · , YNco ] and εco = [ε1 , ε2 , · · · , εNco ] are (T × Nco ) matrices; Xco
is a three dimensional (T × Nco × p) matrix; and Λco = [λ1 , λ2 , · · · , λNco ]0 is a (Nco × r)
matrix, hence, the products Xco β and F Λ0co are also (T × Nco ) matrices. To identify β, F
and Λco in Equation (1), more constraints are needed. Following Bai (2003, 2009), I add
two sets of constraints on the factors and factor loadings: (1) all factor are normalized, and
(2) they are orthogonal to each other, i.e.: F 0 F/T = Ir and Λ0co Λco = diagonal.11 For the
moment, the number of factors r is assumed to be known. In the next section, we propose
a cross-validation procedure that automates the choice of r.
The main quantity of interest of this paper is the average treatment effect on the treated
(ATT) at time t (when t > T0 ):

1 1
δit .12
P P
AT Tt,t>T0 = Ntr i∈T [Yit (1) − Yit (0)] = Ntr i∈T

Note that in this paper, as in Abadie, Diamond and Hainmueller (2010), we treat the treat-
ment effects δit as given once the sample is drawn.13 Because Yit (1) is observed for treated
11
These constraints do not lead to loss of generality because for an arbitrary pair of matrices F and Λco ,
we can find an (r × r) invertible matrix A such that (F A)0 (F A)/T = Ir and (A−1 Λco )0 A−1 Λco is a diagonal
matrix. To see this, we can then rewrite λ0i F as λ̃0i F̃ , in which F̃ = F A and λ̃i = A−1 λi for units in both
the treatment and control groups such that F̃ and Λ̃co satisfy the above constraints. The total number of
constraints is r2 , the dimension of the matrix space where A belongs. It is worth noting that although the
original factors F may not be identifiable, the space spanned by F , a r-dimensional subspace of in the T -
dimensional space, is identified under the above constraints because for any vector in the subspace spanned
by F̃ , it is also in the subspace spanned by the original factors F .
12
For a clear and detailed explanation of quantities of interest in TSCS analysis, see Blackwell and Glynn
(2015). Using their terminology, this paper intends to estimate the Average Treatment History Effect on the
Treated given two specific treatment histories: E[Yit (a1t ) − Yit (a0t )|Di,t−1 = a1t−1 ] in which a0t = (0, · · · , 0),
a1t = (0, · · · , 0, 1, · · · , 1) with T0 zeros and (t − T0 ) ones indicate the histories of treatment statuses. We keep
the current notation for simplicity.
13
We attempt to make inference about the ATT in the sample we draw, not the ATT of the population. In
other words, we do not incorporate uncertainty of the treatment effects δit .

Electronic copy available at: https://ssrn.com/abstract=2584200


units in post-treatment periods, the main objective of this paper is to construct counterfac-
tuals for each treated unit in post-treatment periods, i.e., Yit (0) for i ∈ T and t > T0 . The
problem of causal inference indeed turns into a problem of forecasting missing data.14

Assumptions for causal identification. In addition to the functional form assumption


(Assumption 1), three assumptions are required for the identification of the quantities of
interest. Among them, the assumption of strict exogeneity is the most important.

Assumption 2 Strict exogeneity.

εit ⊥⊥ Djs , xjs , λj , fs ∀i, j, t, s.

Assumption 2 means that the error term of any unit at any time period is independent of
treatment assignment, observed covariates, and unobserved cross-sectional and temporal
heterogeneities of all units (including itself) at all periods. We call it a strict exogen-
eity assumption, which implies conditional mean independence, i.e., E[εit |Dit , xit , λi , ft ] =
E[εit |xit , λi , ft ] = 0.15
Assumption 2 is arguably weaker than the strict exogeneity assumption required by
fixed effects models when decomposable time-varying confounders are at present. These
confounders are decomposable if they can take forms of heterogeneous impacts of a common
trend or a series of common shocks. For instance, suppose a law is passed in a state because
the public opinion in that state becomes more liberal. Because changing ideologies are often
cross-sectionally correlated across states, a latent factor may be able to capture shifting
ideology at the national level; the national shifts may have a larger impact on a state that
has a tradition of mass liberalism or has a higher proportion of manufacturing workers than a
14
The idea of predicting treated counterfactuals in a DID setup is also explored by Brodersen et al. (2014)
using a structural Bayesian time series approach.
15
Note that because εit is independent of Dis and xis for all (t, s), Assumption 2 rules out the possibility that
past outcomes may affect future treatments, which is allowed by the so called “sequential exogeneity” as-
sumption. A directed acyclic graph (DAG) representation is provided in the Online Appendix. See Blackwell
and Glynn (2015) and Imai and Kim (2016) for discussions on the difference between the strict ignorability
and sequential ignorability assumptions. What is unique here is that we conditional on unobserved factors
and factor loadings.

Electronic copy available at: https://ssrn.com/abstract=2584200


state that is historically conservative. Controlling for this unobserved confounder, therefore,
can alleviate the concern that the passage of the law is endogenous to changing ideology of
a state’s constituents to a great extent.
When such a confounder exists, with two-ways fixed effects models we need to assume
that (εit +λi ft ) ⊥⊥ Djs , xjs , αj , ξs , ∀i, j, t, s (with λi ft , αj and ξs representing the time-varying
confounder for unit i at time t, fixed effect for unit j, and fixed effect for time s, respectively)
for the identification of the constant treatment effect. This is implausible because λi ft is likely
to be correlated with Dit , xit , and αi , not to mention other terms. In contrast, Assumption 2
allows the treatment indicator to be correlated with both xjs and λ0j fs for any unit j at any
time periods s (including i and t themselves).
Identifying the treatment effects also requires the following assumptions.

Assumption 3 Weak serial dependence of the error terms.


Assumption 4 Regularity conditions.

Assumptions 3 and 4 (see the Online Appendix for details) are needed for the consistent
estimation of β and the space spanned by F (or F 0 F/T ). Similar, though slightly weaker,
assumptions are made in Bai (2009) and Moon and Weidner (2015). Assumption 3 allows
weak serial correlations but rules out strong serial dependence, such as unit root processes;
errors of different units are uncorrelated. A sufficient condition for Assumption 3 to hold is
that the error terms are not only independent of covariates, factors and factor loadings, but
also independent both across units and over time, which is assumed in Abadie, Diamond and
Hainmueller (2010). Assumption 4 specifies moment conditions that ensure the convergence
of the estimator.
For valid inference based on a block bootstrap procedure discussed in the next section,
we also need to Assumption 5 (see Online Appendix for details). Heteroskedasticity across
time, however, is allowed.

Assumption 5 The error terms are cross-sectionally independent and homoscedastic.

Electronic copy available at: https://ssrn.com/abstract=2584200


Remark 1: Assumptions 3 and 5 suggest that the error terms εit can be serially correlated.
Assumption 2 rules out dynamic models with lagged dependent variables, however, this is
mainly for the purpose of simplifying proofs (Bai 2009, p. 1243). The proposed method can
accommodate dynamic models as long as the error terms are not serially correlated.

3. ESTIMATION STRATEGY

In this section, we first propose a generalized synthetic control (GSC) estimator for treatment
effect of each treated unit. It is essentially an out-of-sample prediction method based on Bai
(2009)’s factor augmented model.
The GSC estimator for the treatment effect on treated unit i at time t is given by the
difference between the actual outcome and its estimated counterfactual: δ̂it = Yit (1) − Ŷit (0),
in which Ŷit (0) is imputed with three steps. In the first step, we estimate an IFE model
using only the control group data and obtain β̂, F̂ , Λ̂co :

  X
Step 1. β̂, F̂ , Λ̂co =argmin (Yi − Xi β̃ − F̃ λ̃i )0 (Yi − Xi β̃ − F̃ λ̃i )
β̃,F̃ ,Λ̃co i∈C

s.t. F̃ 0 F̃ /T = Ir and Λ̃0co Λ̃co = diagonal.

We explain in detail how to estimate this model in the Online Appendix. The second step
estimates factor loadings for each treated unit by minimizing the mean squared error of the
predicted treated outcome in pre-treatment periods:

Step 2. λ̂i = argmin(Yi0 − Xi0 β̂ − Fˆ0 λ̃i )0 (Yi0 − Xi0 β̂ − Fˆ0 λ̃i )
λ̃i

= (F̂ F̂ 0 )−1 F̂ 00 (Yi0 − Xi0 β̂),


00
i∈T,

in which β̂ and F̂ 0 are from the first-step estimation and the superscripts “0”s denote the
pre-treatment periods. In the third step, we calculate treated counterfactuals based on β̂,

10

Electronic copy available at: https://ssrn.com/abstract=2584200


F̂ , and λ̂i :
Step 3. Ŷit (0) = x0it β̂ + λ̂0i fˆt i ∈ T , t > T0 .

1
P
An estimator for AT Tt therefore is: AT
[ Tt = Ntr i∈T [Yit (1) − Ŷit (0)] for t > T0 .

Remark 2: In the Online Appendix, we show that, under Assumption 1-4, the bias of
the GSC shrinks to zero as the sample size grows, i.e. Eε (AT
[ T t |D, X, Λ, F ) → AT Tt as
Nco , T0 → 0 (Ntr is taken as given), in which D = [D1 , D2 , · · · , DN ] is a (T × N ) matrix, X
is a three dimensional (T × N × p) matrix; and Λ = [λ1 , λ2 , · · · , λN ]0 is a (N × r) matrix.
Intuitively, both large Nco and large T0 are necessary for the convergences of β̂ and the
estimated factor space. When T0 is small, imprecise estimation of the factor loadings, or the
“incidental parameters” problem, will lead to bias in the estimated treatment effects. This
is a crucial difference from the conventional linear fixed-effect models.

Model selection. In practice, researchers may have limited knowledge of the exact number
of factors to be included in the model. Therefore, we develop a cross-validation procedure to
select models before estimating the causal effect. It relies on the control group information
as well as information from the treatment group in pre-treatment periods. Algorithm 1
describes the details of this procedure.

Algorithm 1 (Cross-validating the number of factors) A leave-one-out-cross-validation


procedure that selects the number of factors takes the following steps:

Step 1. Start with a given number of factors r, estimate an IFE model using the control
group data {Yi , Xi }i∈C , obtaining β̂ and F̂ ;

Step 2. Start a cross-validation loop that goes through all T0 pre-treatment periods:

(a) In round s ∈ {1, · · · , T0 }, hold back data of all treated units at time s. Run
an OLS regression using the rest of the pre-treatment data, obtaining factor
loadings for each treated unit i:

00 0 −1 00 0 00
λ̂i,−s = (F−s F−s ) F−s (Yi,−s − Xi,−s β̂), ∀i ∈ T ,

11

Electronic copy available at: https://ssrn.com/abstract=2584200


in which the subscripts “-s” stands for all pre-treatment periods except for s.
(b) Predict the treated outcomes at time s using Ŷ (0)is = x0is β̂ + λ̂0i,−s fˆs and save
the prediction error eis = Yis (0) − Ŷis (0) for all i ∈ T .

End of the cross-validation loop;

Step 3. Calculate the mean square prediction error (MSPE) given r,


XT0 X
M SP E(r) = e2is /T0 .
s=1 i∈T

Step 4. Repeat Steps 1-3 with different r’s and obtain corresponding MSPEs.

Step 5. Choose r∗ that minimizes the MSPE.

The basic idea of the above procedure is to hold back a small amount of data (e.g. one
pre-treatment period of the treatment group) and use the rest of data to predict the held-
back information. The algorithm then chooses the model that on average makes the most
accurate predictions. A TSCS dataset with a DID data structure allows us to do so because
(1) there exists a set of control units that are never exposed to the treatment and therefore
can serve as the basis for estimating time-varying factors and (2) the pre-treatment periods
of treated units constitute a natural validation set for candidate models. This procedure
is computationally inexpensive because with each r, the IFE model is estimated only once
(Step 1). Other steps involves merely simple calculations. In the Online Appendix, we
conduct Monte Carlo exercises and show that the above procedure performs well in term of
choosing the correct number of factors even with relatively small datasets.

Remark 3: Our framework can also accommodate DGPs that directly incorporate additive
fixed effects, known time trends, and exogenous time-invariant covariates, such as:

Yit = δit Dit + x0it β + γi0 lt + zi0 θt + λ0i ft + αi + ξt + εit , (2)

in which lt is a (q × 1) vector of known time trends that may affect each unit differently; γi
is (q × 1) unit-specific unknown parameters; zi is a (m × 1) vector of observed time-invariant

12

Electronic copy available at: https://ssrn.com/abstract=2584200


covariates; θt is a (m × 1) vector of unknown parameters; αi and ξt are additive individual
and time fixed effects, respectively. We describe the estimation procedure of this extended
model in the Online Appendix,.

Inference. We rely on a parametric bootstrap procedure to obtain the uncertainty es-


timates of the GSC estimator (deriving the analytical asymptotic distribution of the GSC
estimator is a necessary step for future research). When the sample size is large, when Ntr
is large in particular, a simple non-parametric bootstrap procedure can provide valid uncer-
tainty estimates. When the sample size is small, especially when Ntr is small, we are unable
to approximate the DGP of the treatment group by resampling the data non-parametrically.
In this case, we simply lack the information of the joint distribution of (Xi , λi , δi ) for the
treatment group. However, we can obtain uncertainty estimates conditional on observed co-
variates and unobserved factors and factor loadings using a parametric bootstrap procedure
via re-sampling the residuals. By re-sampling entire time-series of residuals, we preserve the
serial correlation within the units, thus avoiding underestimating the standard errors due to
serial correlations (Beck and Katz 1995). Our goals is to estimate the conditional variance
of ATT estimator, i.e., Varε (AT
[ T t |D, X, Λ, F ). Notice that the only random variable that
is not being conditioned on is εi , which are assumed to be independent of treatment assign-
ment, observed covariates, factors and factor loadings (Assumption 2). We can interpret εi
as measurement errors or variations in the outcome that we cannot explain but are unrelated
to treatment assignment.16
In the parametric bootstrap procedure, we simulate treated counterfactuals and control
units based on the following re-sampling scheme:

Ỹi (0) = Xi β̂ + F̂ λ̂i + ε̃i , ∀i ∈ C;


Ỹi (0) = Xi β̂ + F̂ λ̂i + ε̃pi , ∀i ∈ T .
16
εit may be correlated with λ̂i when the errors are serially correlated because λ̂i is estimated using the
pre-treatment data.

13

Electronic copy available at: https://ssrn.com/abstract=2584200


in which Ỹi (0) is a vector of simulated outcomes in the absence of the treatment; Xi β̂ + F̂ λ̂i is
the estimated conditional mean; and ε̃i and ε̃pi are re-sampled residuals for unit i, depending
on whether it belongs to the treatment or control group. Because β̂ are F̂ are estimated
using only the control group information, Xi β̂ + F̂ λ̂i fits Xi β + F λi better for a control unit
than for a treated unit (as a result, the variance of ε̃pi is usually bigger than that of ε̃i ).
Hence, ε̃i and ε̃pi are drawn from different empirical distributions: ε̃i is the in-sample error
of the IFE model fitted to the control group data, and therefore is drawn from the empirical
distribution of the residuals of the IFE model, while ε̃pi can be seen as the prediction error
of the IFE model for treated counterfactuals.17
Although we cannot observe treated counterfactuals, Yit (0) is observed for all control
units. With the assumptions that treated and control units follow the same factor model
(Assumption 1) and the error terms are independent and homoscedastic across space (As-
sumption 5), we can use a cross-validation method to simulate εpi based on the control group
data (Efron 2012). Specifically, each time we leave one control unit out (to be taken as a
“fake” treat unit) and use the rest of the control units to predict the outcome of left-out
unit. The difference between the predicted and observed outcomes is a prediction error of the
IFE model. εpi is drawn from the empirical distributions of the prediction errors. Under As-
sumptions 1-5, this procedure provides valid uncertainty estimates for the proposed method
without making particular distributional assumptions of the error terms. Algorithm 2 de-
scribes the entire procedure in detail.

Algorithm 2 (Inference) A parametric bootstrap procedure that gives the uncertainty es-
timates of the ATT is described as follows:

17
The treated outcome for unit i, thus can be drawn from Ỹi (1) = Ỹi (0) + δi . We do not directly observe δi ,
but since it is taken as given, its presence will not affect the uncertainty estimates of AT
[ T t . Hence, in the
bootstrap procedure, we use Ỹi (0) for both the treatment and control groups to form bootstrapped samples
(set δi = 0, for all i ∈ T ). We will add back AT
[ T t when constructing confidence intervals.

14

Electronic copy available at: https://ssrn.com/abstract=2584200


Step 1. Start a loop that runs B1 times:

(a) In round m ∈ {1, · · · , B1 }, randomly select one control unit i as if it was


treated when t > T0 ;
(b) Re-sample the rest of control group with replacement of size Nco and form a
new sample with one “treated” unit and Nco re-sampled control units;
(c) Apply the GSC method to the new sample, obtaining a vector of prediction
error, or residuals; ε̂p(m) = Yi − Ŷi (0).

End of the loop, collecting êp = {ε̂p(1) , ε̂p(2) · · · , ε̂p(B1 ) }.

Step 2. Apply the GSC method to the original data, obtaining: (1) AT
[ T t for all t > T0 , (2)
estimated coefficients: β̂, F̂ , Λ̂co , and λ̂j,j∈T , and (3) the fitted values and residuals
of the control units: Ŷco = {Ŷ1 (0), Ŷ2 (0), · · · , ŶNco (0)} and ê = {ε̂1 , ε̂2 , · · · , ε̂Nco }.

Step 3. Start a bootstrap loop that runs B2 times:

(a) In round k ∈ {1, · · · , B2 }, construct a bootstrapped sample S (k) by:

(k)
Ỹi (0) = Ŷi (0) + ε̃i , i∈C
(k)
Ỹi (0) = Ŷi (0) + ε̃pi , j∈T

in which each vector of ε̃i and ε̃pj are randomly selected from sets e and ep ,
respectively, and Ŷi (0) = Xi β̂ + F̂ λ̂i . Note that the simulated treated coun-
terfactuals do not contain the treatment effect.
(b) Apply the GSC method to S (k) and obtain a new ATT estimate; add AT
[ T t,t>T0
(k)
to it, obtaining the bootstrapped estimate AT
[ T t,t>T . 0

End of the bootstrap loop.

Step 4. Compute the variance of AT


[ T t,t>T0 using
 2
1 PB (k) 1 PB (j)
Var(AT
[ T t |D, X, Λ, F ) = B k=1 AT
[ Tt − B j=1 AT T t
[

and its confidence interval using the conventional percentile method (Efron and
Tibshirani 1993).

15

Electronic copy available at: https://ssrn.com/abstract=2584200


4. MONTE CARLO EVIDENCE

In this section, we conduct Monte Carlo exercises to explore the finite sample properties of the
GSC estimator and compare it with several existing methods, including the DID estimator,
the IFE estimator, and the original synthetic matching method. We also investigate the
extent to which the proposed cross-validation scheme can choose the number of factors
correctly in relatively small samples.
We start with the following data generating process (DGP) that includes two observed
time-varying covariates, two unobserved factors, and additive two-way fixed effects:

Yit = δit Dit + xit,1 · 1 + xit,2 · 3 + λ0i ft + αi + ξt + 5 + εit (3)

where ft = (f1t , f2t )0 and λi = (λi1 , λi2 )0 are time-varying factors and unit-specific factor
loadings. The covariates are (positively) correlated with both the factors and factor loadings:
xit,k = 1 + λ0i ft + λi1 + λi2 + f1t + f2t + ηit,k , k = 1, 2. The error term εit and disturbances
in covariates ηit,1 and ηit,2 are i.i.d. N (0, 1). Factors f1t and f2t , as well as time fixed effects
ξt , are also i.i.d. N (0, 1). The treatment and control groups consist of Ntr and Nco units.
The treatment starts to affect the treated units at time T0 + 1 and since then 10 periods are
observed (q = 10). The treatment indicator is defined as in Section 2, i.e., Dit = 1 when
i ∈ T and t > T0 and Dit = 0 otherwise. The heterogeneous treatment effect is generated
by δit,t>T0 = δ̄t + eit , in which eit is i.i.d. N(0,1). δ̄t is given by: [δ̄T0 +1 , δ̄T0 +1 , · · · , δ̄T0 +10 ] =
[1, 2, · · · , 10].
Factor loadings λi1 and λi2 , as well as unit fixed effects αi , are drawn from uniform
√ √ √ √ √ √
distributions U [− 3, 3] for control units and U [ 3 − 2w 3, 3 3 − 2w 3] for treated units
(w ∈ [0, 1]). This means that when 0 ≤ w < 1, (1) the random variables have variance 1;
(2) the supports of factor loadings of treated and control units are not perfectly overlapped;
and (3) the treatment indicator and factor loadings are positively correlated.18
18
The DGP specified here is modified based on Bai (2009) and Gobillon and Magnac (2016).

16

Electronic copy available at: https://ssrn.com/abstract=2584200


A simulated example. We first illustrate the proposed method, as well as the DGP
described above, with a simulated sample of Ntr = 5, Nco = 45, and T0 = 20 (hence,
N = 50, T = 30). w is set to be 0.8, which means that the treated units are more likely
to have larger factor loadings than the control units. Figure 1 visualizes the raw data and
estimation results. In the upper panel, the dark and light gray lines are time series of the

Figure 1. Estimated ATT for A Simulated Sample


Ntr = 5, Nco = 45, T = 30, T0 = 10
50

Treated Average
Estimated Y(0) Average for the Treated
40

Treated
30

Control
20
10
0
−10

0 5 10 15 20 25 30
12

Estimated ATT
10

True ATT
95% Confidence Intervals
8
6
4
2
0
−2

0 5 10 15 20 25 30

treated and control units, respectively. The bold solid line is the average outcome of the five
treated units while the bold dashed line is the average predicted outcome of the five units in
the absence of the treatment. The latter is imputed using the proposed method.
The lower panel of Figure 1 shows the estimated ATT (solid line) and the true ATT
(dashed line). The 95 percent confidence intervals for the ATT are based on bootstraps of
2,000 times. It shows that the estimated average treated outcome fits the data well in pre-
treatment periods and the estimated ATT is very close to the actual ATT. The estimated

17

Electronic copy available at: https://ssrn.com/abstract=2584200


factors and factor loadings, as well as imputed counterfactual and individual treatment effect
for each treat unit, are shown in the Online Appendix.

Finite sample properties. We present the Monte Carlo evidence on the finite sample
properties of the GSC estimator in Table 1 (additional results are shown in the Online
Appendix). As in the previous example, the treatment group is set to have five units. The
estimand is the ATT at time T0 + 5, whose expected value equals 5. Observables, factors,
and factor loadings are drawn only once while the error term is drawn repeatedly; w is
set to be 0.8 such that treatment assignment is positively correlated with factor loadings.
Table 1 reports the bias, standard deviation (SD), and root mean squared error (RMSE) of
AT
[ T T0 +5 from 5,000 simulations for each pair of T0 and Nco .19 It shows that the the GSC
estimator has limited bias even when T0 and Nco are relatively small and the bias goes away
as T0 and Nco grow. As expected, both the SD and RMSE shrink when T0 and Nco become
larger. Table 1 also reports the coverage probabilities of 95 percent confidence intervals for
AT
[ T i,T0 +5 constructed by the parametric bootstrap procedure (Algorithm 2). For each pair
of T0 and Nco , the coverage probability is calculated based on 5,000 simulated samples, each
of which is bootstrapped for 1,000 times. These numbers show that the proposed procedure
can achieve the correct coverage rate even when the sample size is relatively small (e.g.,
T0 = 15, Ntr = 5, Nco = 80).
In the Online Appendix, we run additional simulations and compare the proposed method
with several existing methods, including the DID estimator, the IFE estimator, and the
synthetic matching method. We find that (1) the GSC estimator has less bias than the
DID estimator in the presence of unobserved, decomposable time-varying confounders; (2)
it has less bias than the IFE estimator when the treatment effect is heterogeneous; and
(3) it is usually more efficient than the original synthetic matching estimator. It is worth
q
(k) (k)
19
Standard deviation is defined as:
q SD(AT T t ) =
[ E[AT
[ Tt − E(AT
[ T t )]2 , while root mean squared error
(k) (k)
is defined as: RM SE(AT[ T t ) = E(AT[ T t − AT Tt )2 . The superscript (k) denotes the k-th sample. We
see that they are very close because the bias of the GSC estimator shrinks to zero as the sample size grows.

18

Electronic copy available at: https://ssrn.com/abstract=2584200


Table 1. Finite Sample Properties and Coverage Rates
Ntr Nco T0 Bias SD RMSE Coverage

5 40 15 0.053 0.589 0.591 0.947


5 80 15 0.017 0.535 0.536 0.949
5 120 15 0.010 0.524 0.524 0.949
5 200 15 0.011 0.518 0.518 0.949

5 40 30 0.046 0.538 0.540 0.946


5 80 30 0.021 0.504 0.505 0.948
5 120 30 0.024 0.494 0.495 0.949
5 200 30 0.008 0.487 0.487 0.949

5 40 50 0.031 0.519 0.520 0.947


5 80 50 0.016 0.497 0.498 0.948
5 120 50 0.003 0.475 0.475 0.949
5 200 50 0.016 0.468 0.469 0.949

emphasizing that these results are under the premise of correct model specifications. To
address the concern that the GSC method relies on correct model specifications, we conduct
additional tests and show that the cross-validation scheme described in Algorithm 1 is able
to choose the number of factors correctly most of the time when the sample is large enough.

5. EMPIRICAL EXAMPLE

In this section, we illustrate the GSC method with an empirical example that investigates
the effect of Election Day Registration (EDR) laws on voter turnout in the United States.
Voting in the United States usually takes two steps. Except in North Dakota, where no regis-
tration is needed, eligible voters throughout the country must register prior to casting their
ballots. Registration, which often requires a separate trip from voting, is widely regarded
as a substantial cost of voting and a culprit of low turnout rates before the 1993 National
Voter Registration Act (NVRA) was enacted (e.g. Highton 2004). Against this backdrop,
EDR is a reform that allows eligible voters to register on Election Day when they arrive at
polling stations. In the mid-1970s, Maine, Minnesota, and Wisconsin were the first adopters

19

Electronic copy available at: https://ssrn.com/abstract=2584200


of this reform in the hopes of increasing voter turnout; while Idaho, New Hampshire, and
Wyoming established EDR in the 1990s as a strategy to opt out the NVRA (Hanmer 2009).
Before the 2012 presidential election, three other states, Montana, Iowa, and Connecticut,
passed laws to enact EDR, adding the number of states having EDR laws to nine.20
Most existing studies based on individual-level cross-sectional data, such as the Current
Population Surveys and the National Election Surveys, suggest that EDR laws increase
turnout (the estimated effect varies from 5 to 14 percentage points).21 These studies do not
provide compelling evidence of a causal effect of EDR laws because the research designs they
use are insufficient to address the problem that states self-select their systems of registration
laws. “Registration requirements did not descend from the skies,” as Dean Burnham puts it
(1980, p. 69). A few studies employ time-series or TSCS analysis to address the identification
problem.22 However, Keele and Minozzi (2013) cast doubts on these studies and suggest that
the “parallel trends” assumption may not hold, as we will also demonstrate below.
In the following analysis, we use state-level voter turnout data for presidential elections
from 1920 to 2012.23 The turnout rates are calculated with total ballots counted in a pres-
idential election in a state as the numerator and the state’s voting-age population (VAP)
as the denominator.24 Alaska and Hawaii are not included in the sample since they were
not states until 1959. North Dakota is also dropped since no registration is required. As
mentioned above, up to the 2012 presidential election, 9 states had adopted EDR laws (here-
after referred to as treated) and the rest 38 states had not (referred to as controls). The raw
20
In the Online Appendix, we list the years during which EDR laws were enacted and first took effect in
presidential elections.
21
See Wolfinger and Rosenstone (1980); Mitchell and Wlezien (1995); Rhine (1992); Highton (1997); Timpone
(1998, 2002); Huang and Shields (2000); Alvarez, Ansolabehere and Wilson (2002); Brians and Grofman
(2001); Hanmer (2009); Burden et al. (2009); Cain, Donovan and Tolbert (2011); Teixeira (2011) for examples.
The results are especially consistent for the three early adopters, Maine, Minnesota, and Wisconsin.
22
See, for example, Fenster (1994); King and Wambeam (1995); Knack and White (2000); Knack (2001);
Neiheisel and Burden (2012); Springer (2014).
23
The data from 1920 to 2000 are from Springer (2014). The data from 2004 to 2012 are from The United
States Election Project, http://www.electproject.org/. Indicators of other registration laws, including
universal mail-in registration and motor voter registration, also come from Springer (2014), with a few
supplements. Replication files can be found in Xu (2016).
24
We do not use the voting-eligible population (VEP) as the denominator because they are not available in
early years.

20

Electronic copy available at: https://ssrn.com/abstract=2584200


turnout data for all 47 states are shown in the Online Appendix.25
First, we use a standard two-way fixed effects mode, which is often referred to as a DID
model in the literature. The results are shown in Table 2 columns (1) and (2). Standard
errors are produced by non-parametric bootstraps (blocked at the state level) of 2,000 times.
In column (1), only the EDR indicator is included, while in column (2), we additionally
control for indicators of universal mail-in registration and motor voter registration. The
estimated coefficients of EDR laws are 0.87 and 0.78 percent using the two specifications,
respectively, with standard errors around 3 percent.

Table 2. The Effect of EDR on Voter Turnout


Outcome variable Voter Turnout %
FE GSC
(1) (2) (3) (4)

Election Day Registration 0.87 0.78 5.13 4.90


(3.01) (3.31) (2.27) (2.27)
Universal Mail-in Registration -0.94 0.15
(1.80) (0.80)
Motor Voter Registration -0.21 -1.05
(1.45) (0.79)

State fixed effects x x x x


Year fixed effects x x x x
Unobserved factors N/A N/A 2 2
Observations 1,128 1,128 1,128 1,128
Treated states 9 9 9 9
Control states 38 38 38 38
Note: Standard errors in columns (1) and (2) are based on non-parametric
boostraps (blocked at the state level) of 2,000 times. Standard errors in columns
(3) and (4) are based on parametric bootstraps (blocked at the state level) of
2,000 times.

The two-way fixed effects model presented in Table 2 assumes a constant treatment effect
both across states and over time. Next we relax this assumption by literally employing a DID
approach. In other words, we estimate the effect of EDR laws on voter turnout in the post-
25
As is shown in the figure and has been pointed out by many, turnout rates are in general higher in states
that have EDR laws than states that have not, but this does not necessarily imply a causal relationship
between EDR laws and voter turnout.

21

Electronic copy available at: https://ssrn.com/abstract=2584200


treatment period by subtracting the time intercepts estimated from the control group and
the unit intercepts based on the pre-treatment data. The predict turnout for state i in year
t, therefore, is the summation of unit intercept i and time intercept t, plus the impact of the
time-varying covariates. The result is visualized in the upper panel of Figure 2. Figure (a)
shows the average actual turnout (solid line) and average predicted turnout in the absence
of EDR laws (dashed line); both averages are taken based on the number of terms since (or
before) EDR laws first took effect. Figure (b) shows the gap between the two lines, or the
estimated ATT. The confidence intervals are produced by block bootstraps of 2,000 times.
It is clear from both figures that the “parallel trends” assumption is not likely to hold since
the average predicted turnout deviates from the average actual turnout in the pre-treatment
periods.
Next, we apply the GSC method to the same dataset. Table 2 columns (3) and (4)
summarize the result.26 Again, both specifications impose additive state and year fixed
effects. In column (3), no covariates are included, while in column (4), mail-in and motor
voter registration are controlled for (assuming that they have constant effects on turnout).
With both specifications, the cross-validation scheme finds two unobserved factors to be
important and after conditioning on both the factors and additive fixed effects, the estimated
ATT based on the GSC method is around 5 percent with a standard error of 2.3 percent.27
This means that EDR laws are associated with a statistically significant increase in voter
turnout, consistent with previous OLS results based on individual-level data. The lower panel
of Figure 2 shows the dynamics of the estimated ATT. Again, in the left figure, averages are
taken after the actual and predicted turnout rates are re-aligned to the timing of the reform.
With the GSC method, the average actual turnout and average predicted turnout match
well in pre-treatment periods and diverge after EDR laws took effect. The right figure shows
26
Note that although the estimated ATT of EDR on voter turnout is presented in the same row as the coeffi-
cient of EDR using the FE model, the GSC method does not assume the treatment effect to be constant. In
fact, it allows the treatment effect to be different both across states and over time. Predicted counterfactuals
and individual treatment effect for each of the 9 treated states are shown in the Online Appendix.
27
The results are similar if additive state and year fixed effects are not directly imposed, though not surpris-
ingly, the algorithm includes an additional factor.

22

Electronic copy available at: https://ssrn.com/abstract=2584200


75 Figure 2. The Effect of EDR on Turnout: Main Results

8
Treated Average Estimated ATT
Estimated Y(0) Average for the Treated 95% Confidence Intervals
70

4
Turnout %

Turnout %
65

0
−4
60

−8
55

−12 −10 −8 −6 −4 −2 0 2 4 −12 −10 −8 −6 −4 −2 0 2 4


Term Relative to Reform Term relative to reform

(a) Difference-in-differences
75

8
Treated Average Estimated ATT
Estimated Y(0) Average for the Treated 95% Confidence Intervals
70

4
Turnout %

Turnout %
65

0
−4
60

−8
55

−12 −10 −8 −6 −4 −2 0 2 4 −12 −10 −8 −6 −4 −2 0 2 4


Term Relative to Reform Term relative to reform

(b) Generalized synthetic control

that the gaps between the two lines are virtually flat in pre-treatment periods and the effect
takes off right after the adoption of EDR.28
Figure 3 presents the estimated factors and factor loadings produced by the GSC method.29
Figure 3(a) depicts the two estimated factors. The x-axis is year and the y-axis is the mag-
nitude of factors (re-scaled by the square root of their corresponding eigenvalues to demon-
strate their relative importance). Figure (b) shows the estimated factors loadings for each
treated (black, bold) and control (gray) units, with x- and y-axes indicating the magnitude
of the loadings for the first and second factors, respectively. Bearing in mind the caveat
28
Although it is not guaranteed, this is not surprising since the GSC method uses information of all past
outcomes and minimizes gaps between actual and predicted turnout rates in pre-treatment periods.
29
The results are essentially the same with or without controlling for the other two registration reforms.

23

Electronic copy available at: https://ssrn.com/abstract=2584200


15 Figure 3. The Effect of EDR on Turnout: Factors and Loadings

6
Factor 1 Treated
Factor 2 Control (non−South)
Control (South)
CT
MA
10

4
NYRI PA
CA
WI
5

NJ WA LA

Loadings for factor 2


SDOK

2
MI
UT ME
NH TX
Turnout %

WY MN AR
VT
OR
0

NEID GA SC

0
CO
IL KS MT TN
WV OH MS
IA AZ FL
−5

IN
NV
VA

−2
MO AL
DE NC
−10

MD
NM

−4
−15

1920 1936 1952 1968 1984 2000 2016 −20 −10 0 10 20


Year Loadings for factor 1
KY
(a) Factors (b) Loadings

that estimated factors may not be directly interpretable because they are, at best, linear
transformations of the true factors, we find that the estimated factors shown in this figure
are meaningful. The first factor captures the sharp increase in turnout in the southern states
because of the 1965 Voting Rights Act that removed Jim Crow laws, such as poll taxes or
literacy tests, that suppressed turnout. As shown in the right figure, the top 11 states that
have the largest loadings on the first factor are exactly the 11 southern states (which were
previously in the confederacy).30 The labels of these states are underlined in Figure 3(b).
The second factor, which is set to be orthogonal to the first one, is less interpretable. How-
ever, its non-negligible magnitude indicates a strong downward trend in voter turnout in
many states in recent years. Another reassuring finding shown by Figure 3(b) is that the
estimated factor loadings of the 9 treated units mostly lie in the convex hull of those of the
control units, which indicates that the treated counterfactuals are produced mostly by more
reliable interpolations instead of extrapolations.
Finally, we investigate the heterogeneous treatment effects of EDR laws. Previous studies
30
Although we can control for indicators of Jim Crow laws in the model, such indicators may not be able to
capture the heterogeneous impacts of these laws on voter turnout in each state.

24

Electronic copy available at: https://ssrn.com/abstract=2584200


have suggested that the motivations behind enacting these laws are vastly different between
the early adoptors and later ones. For example, Maine, Minnesota, and Wisconsin, which
established the EDR in mid-1970s, did so because officials in these states sincerely wanted the
turnout rates to be higher, while the “reluctant adoptors,” including Idaho, New Hampshire,
and Wyoming, introduced the EDR as a means to avoid the NVRA because officials viewed
the NVRA as “a more costly and potentially chaotic system” (Hanmer 2009). Because of
the different motivations and other reasons, we may expect the treatment effect of EDR laws
to be different in states that adopted them in different times.
Table 3. The Effect of EDR on Voter Turnout: Three Waves
Outcome variable Voter Turnout %
1st Wave 2nd Wave 3rd Wave
(1) (2) (3)

Election Day Registration 7.27 2.17 -1.14


(3.33) (2.82) (3.00)

Mail-in and motor voter registration x x x


State fixed effects x x x
Year fixed effects x x x
Unobserved factors 2 2 2
Observations 1,128 1,128 1,128
Treated states 3 3 3
(ME, MN, WI) (ID, NH, WY) (MT, IA, CT)
Control states 38 38 38
Note: Standard errors are based on parametric bootstraps (blocked at the state level)
of 1,000 times.

The estimation of heterogeneous treatment effects is embedded in the GSC method since
it gives individual treatment effects for all treated units in a single run. Table 3 summarizes
the ATTs of EDR on voter turnout among the three waves of EDR adoptors. Again, additive
state and year fixed effects, as well as indicators of two other registration systems, are
controlled for. Table 3 shows that EDR laws have a large and positive effect on the early
adoptors (the estimate is about 7 percent with a standard error of 3 percent) while EDR
laws were found to have no statistically significant impact on the other six states.31 Such
31
In the Online Appendix, we show that the treatment effects are positive (and relatively large) for all three

25

Electronic copy available at: https://ssrn.com/abstract=2584200


differential outcomes can be due to two reasons. First, the NVRA of 1993 substantially
reduced the cost of registration: since almost everyone who has some intention to vote is
a registrant after the NVRA was enacted, “there is now little room for enhancing turnout
further by making registration easier” (Highton 2004). Second, because states having a
strong “participatory culture” is more likely to be selected into an EDR system in earlier
years, costly registration, as a binding constraint in these states, may not be a first-order
issue in a state where many eligible voters have low incentives to vote in the first place. It
is also possible that voters in early adopting states formed a habit to vote in the days when
the demand for participation was high (Hanmer 2009).
In short, using the GSC method, we find that EDR laws increased turnout in early ad-
opting states, including Maine, Minnesota, and Wisconsin, but not in states that introduced
EDR as a strategy to opt out the NVRA or enacted EDR laws in recent years. These results
are broadly consistent with evidence provided by a large literature based on individual-level
cross-sectional data (see, for example, Leighley and Nagler 2013 for a summary). They are
also more credible than results from conventional fixed effects models when the “parallel
trends” assumption appears to fail.32

6. CONCLUSION

In this paper, we propose the generalized synthetic control (GSC) method for causal inference
with TSCS data. It attempts to address the challenge that the “parallel trends” assumption
often fails when researchers apply fixed effects models to estimate the causal effect of a
early adopting states, Maine, Minnesota, and Wisconsin. Using a fuzzy regression discontinuity design, Keele
and Minozzi (2013) show that EDR has almost no effect on the turnout in Wisconsin. The discrepancy
with this paper could be mainly due to the difference in the estimands. Two biggest cities in Wisconsin,
Milwaukee and Madison constitute a major part of Wisconsin’s constituency but have neglectable influence
to their local estimates. One advantage of Keele and Minozzi (2013)’s approach over ours is the use of
fine-grained municipal level data.
32
Glynn and Quinn (2011) argue that traditional cross-sectional methods in general over-estimate the effect
of EDR laws on voter turnout and suggest that EDR laws are likely to have minimum effect on turnout in
non-EDR states (the ATC). In this paper, we focus on the effect of EDR in EDR states (the ATT) instead.

26

Electronic copy available at: https://ssrn.com/abstract=2584200


certain treatment. The GSC method estimates the individual treatment effect on each
treated unit semi-parametrically. Specifically, it imputes treated counterfactuals based on
a linear interactive fixed effects model that incorporates time-varying coefficients (factors)
interacted with unit-specific intercepts (factor loadings). A built-in cross-validation scheme
automatically selects the model, reducing the risks of over-fitting.
This method is in spirit of the original synthetic control method in that it uses data from
pre-treatment periods as benchmarks to customize a re-weighting scheme of control units
in order to make the best possible predictions for treated counterfactuals. It generalizes
the synthetic control method in two aspects. First, it allows multiple treated units and
differential treatment timing. Second, it offers uncertainty estimates, such as standard errors
and confidence intervals, that are easy to interpret.
Monte Carlo exercises suggest that the proposed method performs well even with relat-
ively small T0 and Nco and show that it has advantages over several existing methods: (1) it
has less bias than the two-way fixed effects or DID estimators in the presence of decomposable
time-varying confounders, (2) it corrects bias of the IFE estimator when the treatment effect
is heterogeneous across units; and (3) it is more efficient than the synthetic control method.
To illustrate the applicability of this method in political science, we estimate the effect of
Election Day Registration (EDR) laws on voter turnout in the United States. We show that
EDR laws increased turnout in early adopting states but not in states that introduced them
more recently.
Two caveats are worth emphasizing. First, insufficient data (with either a small T0
or a small Nco ) cause bias in the estimated treatment effect. In general, users should be
cautious when T0 < 10 or Nco < 40. Second, excessive extrapolations based on imprecisely
estimated factors and factor loading can lead to erroneous results. To avoid this problem, we
recommend the following diagnostics upon using this method: (1) plot raw data of treated
and control outcomes as well as imputed counterfactuals and check whether the imputed
values are within reasonable intervals; (2) plot estimated factor loadings of both treated and

27

Electronic copy available at: https://ssrn.com/abstract=2584200


control units and check the overlap (as in Figure 3). We provide software routines gsynth in
both R and STATA to implement the estimation procedure as well as these diagnostic tests.
When excessive extrapolations appear to happen, we recommend users to include a smaller
number of factors or switch back to the conventional DID framework. We also recommend
users to benchmark the results with estimates from the IFE model (Bai 2009) as well as
Bayesian multi-level factor models (e.g. Pang 2014) whenever it is possible.
Another limitation of the proposed method is that it cannot accommodate complex DGPs
that often appear in TSCS data (when T is much bigger than panel data), such as (1) dy-
namic relationships between the treatment, covariates, and outcome (e.g., Pang 2010, 2014;
Blackwell and Glynn 2015), (2) structural breaks (e.g., Park 2010, 2012), and (3) multiple
times of treatment and variable treatment intensity. Nor does it allow random coefficients
for the observed time-varying covariates, as such modeling setups become increasing pop-
ular with Bayesian multi-level analysis. Future research is needed to accommodate these
scenarios.

28

Electronic copy available at: https://ssrn.com/abstract=2584200


References

Abadie, Alberto. 2005. “Semiparametric Difference-in-Differences Estimators.” The Review


of Economic Studies 72(1):1–19.
Abadie, Alberto, Alexis Diamond and Jens Hainmueller. 2010. “Synthetic Control Methods
for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control
Program.” Journal of the American Statistical Association 105(490):493–505.
Abadie, Alberto, Alexis Diamond and Jens Hainmueller. 2015. “Comparative Politics and
the Synthetic Control Method.” American Journal of Political Science 59(2):495–510.
Acemoglu, Daron, Simon Johnson, Amir Kermani, James Kwak and Todd Mitton. 2016.
“The Value of Connections In Turbulent Times: Evidence from the United States.” Journal
of Financial Economics 121(2):368–391.
Alvarez, R. Michael, Stephen Ansolabehere and Catherine H. Wilson. 2002. “Election Day
Voter Registration in the United States: How One-step Voting Can Change the Com-
position of the American Electorate.” Working Paper, Caltech/MIT Voting Technology
Project.
Angrist, Joshua D., scar Jord and Guido Kuersteiner. 2013. “Semiparametric Estimates of
Monetary Policy effects: String Theory Revisited.” NBER Working Paper No. 19355.
Bai, Jushan. 2003. “Theory for Factor Models of Large Dimensions.” Econometrica
71(1):135–137.
Bai, Jushan. 2009. “Panel Data Models with Interactive Fixed Effects.” Econometrica
77:1229–1279.
Beck, Nathaniel and Jonathan N. Katz. 1995. “What to do (and not to do) with Time-Series
Cross-Section Data.” American Political Science Review 89(3):634–647.
Blackwell, Matthew and Adam Glynn. 2015. “How to Make Causal Inferences with Time-
Series Cross-Sectional Data.” Mimeo, Harvard Uniersity.
Brians, Craig Leonard and Bernard Grofman. 2001. “Election Day Registration’s Effect on
US Voter Turnout.” Social Science Quarterly 82(1):170–183.
Brodersen, Kay H., Fabian Gallusser, Jim Koehler, Nicolas Remy and Steven L. Scott.
2014. “Inferring Causal Impact Using Bayesian Structural Time-series Models.” Annals
of Applied Statistics 9(1):247–274.
Burden, Barry C., David T. Canon, Kenneth R. Mayer and Donald P. Moynihan. 2009. “The
Effects and Costs of Early Voting, Election Day Registration, and Same Day Registration
in the 2008 Elections.” Mimeo, University of Wisconsin-Madison.
Burnham, Walter Dean. 1980. The Appearance and Disappearance of the American Voter.
In Electoral Participation: A Comparative Analysis, ed. Richard Rose. Beverly Hills, CA:
Sage Publications.

29

Electronic copy available at: https://ssrn.com/abstract=2584200


Cain, Bruce E., Todd Donovan and Caroline J. Tolbert. 2011. Democracy in the States:
Experiments in Election Reform. Brookings Institution Press.

Campbell, John Y., Andrew W. Lo and A. Craig MacKinlay. 1997. The Econometrics of
Financial Markets. Princeton, NJ: Princeton University Press.

Dube, Arindrajit and Ben Zipperer. 2015. “Pooling Multiple Case Studies Using Synthetic
Controls: An Application to Minimum Wage Policies.” IZA Discussion Paper No. 8944.

Efron, Brad. 2012. “The Estimation of Prediction Error.” Journal of the American Statistical
Association 99(467):619–632.

Efron, Brad and Rob Tibshirani. 1993. An Introduction to the Bootstrap. New York, NY:
Chapman & Halll.

Fenster, Mark J. 1994. “The Impact of Allowing Day of Registration Voting On Turnout in
US Elections From 1960 To 1992 A Research Note.” American Politics Research 22(1):74–
87.

Gaibulloev, Khusrav, Todd Sandler and Donggyu Sul. 2014. “Dynamic Panel Analysis under
Cross-Sectional Dependence.” Political Analysis 22(2):258–273.

Glynn, Adam N. Glynn and Kevin M. Quinn. 2011. “Why Process Matters for Causal
Inference.” Political Analysis (19):273–286.

Gobillon, Laurent and Thierry Magnac. 2016. “Regional Policy Evaluation: Interactive Fixed
Eects and Synthetic Controls.” The Review of Economics and Statistics 98(3):535–551.

Hanmer, Michael J. 2009. Discount Voting: Voter Registration Reforms and their Effects.
Cambridge University Press.

Highton, Benjamin. 1997. “Easy Registration and Voter Turnout.” The Journal of Politics
59(2):565–575.

Highton, Benjamin. 2004. “Voter Registration and Turnout in the United States.” Perspect-
ives on Politics 2(3):507–515.

Holland, Paul W. 1986. “Statistics and Causal Inference.” Journal of American Statistical
Association 81(8):945–960.

Hsiao, Cheng, Steve H. Ching and Shui Ki Wan. 2012. “A Panel Data Approach for Program
Evaludation: Measuring the Benefits of Political and Economic Integration of Hong Kong
with Mainland China.” Journal of Applied Econometrics 27(5):705–740.

Huang, Chi and Todd G. Shields. 2000. “Interpretation of Interaction Effects in Logit and
Probit Analyses Reconsidering the Relationship Between Registration Laws, Education,
and Voter Turnout.” American Politics Research 28(1):80–95.

Imai, Kosuke and In Song Kim. 2016. “When Should We Use Linear Fixed Effects Regression
Models for Causal Inference with Panel Data.” Mimeo, Princeton University.

30

Electronic copy available at: https://ssrn.com/abstract=2584200


Keele, L. and W. Minozzi. 2013. “How Much Is Minnesota Like Wisconsin? Assump-
tions and Counterfactuals in Causal Inference with Observational Data.” Political Analysis
21(2):193–216.
Kim, Dukpa and Tatsushi Oka. 2014. “Divorce Law Reforms and Divorce Rates in the USA:
An Interactive Fixed-Effects Approach.” Journal of Applied Econometrics 29(2):231–245.
King, James D. and Rodney A. Wambeam. 1995. “Impact of Election Day Registration on
Voter Turnout: A Quasi-experimental Analysis.” Policy Studies Review 14(3):263–278.
Knack, Stephen. 2001. “Election-day Registration The Second Wave.” American Politics
Research 29(1):65–78.
Knack, Stephen and James White. 2000. “Election-day Registration and Turnout Inequal-
ity.” Political Behavior 22(1):29–44.
Leighley, Jan E. and Jonathan Nagler. 2013. Who Votes Now? Demographics, Issues,
Inequality, and Turnout in the United States. Princeton, New Jersey: Princeton University
Press.
Mitchell, Glenn E. and Christopher Wlezien. 1995. “The Impact of Legal Constraints on
Voter Registration, Turnout, and the Composition of the American Electorate.” Political
Behavior 17(2):179–202.
Moon, Hyungsik Roger and Martin Weidner. 2015. “Dynamic Linear Panel Regression
Models with Interactive Fixed Effects.” Econometric Theory (forthcoming).
Mora, Ricardo and Ilina Reggio. 2012. “Treatment Effect Identification Using Alternative
Parallel Assumptions.” Mimeo, Universidad Carlos III de Madrid.
Neiheisel, J. R. and B. C. Burden. 2012. “The Impact of Election Day Registration on Voter
Turnout and Election Outcomes.” American Politics Research 40(4):636–664.
Neyman, Jerzy. 1923. “On the Application of Probability Theory to Agricultural Experi-
ments: Essay on Principles.” Statistical Science 5:465–80. Section 9 (translated in 1990).
Pang, Xun. 2010. “Modeling Heterogeneity and Serial Correlation in Binary Time-Series
Cross-sectional Data: A Bayesian Multilevel Model with AR(p) Errors.” Political Analysis
(4):470–498.
Pang, Xun. 2014. “Varying Responses to Common Shocks and Complex Cross-Sectional
Dependence: Dynamic Multilevel Modeling with Multifactor Error Structures for Time-
Series Cross-Sectional Data.” Political Analysis 22(4):464–496.
Park, Jong Hee. 2010. “Structural Change in US Presidents’ Use of Force.” American Journal
of Political Science 54(3):766–782.
Park, Jong Hee. 2012. “A Unified Method for Dynamic and Cross-Sectional Heterogen-
eity: Introducing Hidden Markov Panel Models.” American Journal of Political Science
56(4):1040–1054.

31

Electronic copy available at: https://ssrn.com/abstract=2584200


Rhine, Staci L. 1992. “An Analysis of the Impact of Registration Factors on Turnout in
1992.” Political Behavior 18(2):171–185.

Rubin, Donald B. 1974. “Estimating Causal Effects of Treatments n Randomized and non-
randomized Studies.” Journal of Educational Psychology 5(66):688–701.

Springer, Melanie Jean. 2014. How the States Shaped the Nation: American Electoral Insti-
tutions and Voter Turnout, 1920-2000. University of Chicago Press.

Stewart, Brandon. 2014. “Latent Factor Regressions for the Social Sciences.” Mimeo, Prin-
ceton University.

Teixeira, Ruy A. 2011. The Disappearing American Voter. Brookings Institution Press.

Timpone, Richard J. 1998. “Structure, Behavior, and Voter Turnout in the United States.”
The American Political Science Review 92(1):145.

Timpone, Richard J. 2002. “Estimating Aggregate Policy Reform Effects: New Baselines for
Registration, Participation, and Representation.” Political Analysis 10(2):154–177.

Wolfinger, Raymond E. and Steven J. Rosenstone. 1980. Who Votes? New Haven, CT: Yale
University Press.

Xu, Yiqing. 2016. “Replication Data for: Generalized Synthetic Control Method: Causal
Inference with Interactive Fixed Effects Models.” doi:10.7910/DVN/8AKACJ, Harvard
Dataverse.

32

Electronic copy available at: https://ssrn.com/abstract=2584200

You might also like