Understanding Randomized Controlled Trials
Understanding Randomized Controlled Trials
HOUSEHOLD
SURVEYS
References in this publication to “Taiwan,” “Republic of China,” and “Taiwan (China)” refer to the
region, “Taiwan, China.” References to “Hong Kong” refer to the region, “Hong Kong SAR, China.”
References in this publication to “Taiwan,” “Republic of China,” and “Taiwan (China)” refer to the
region, “Taiwan, China.” References to “Hong Kong” refer to the region, “Hong Kong SAR, China.”
An Introduction to the “Handbook of Field Experiments”
Many (though by no means all) of the questions that economists and policymakers ask them-
selves are causal in nature: What would be the impact of adding computers in classrooms? What
is the price elasticity of demand for preventive health products? Would increasing interest rates
lead to an increase in default rates? Decades ago, the statistician Fisher (Fisher, 1925) proposed
a method to answer such causal questions: Randomized Controlled Trials (RCTs) . In an RCT ,
the assignment of different units to different treatment groups is chosen randomly. This ensures
that no unobservable characteristics of the units are reflected in the assignment, and hence that
any difference between treatment and control units reflects the impact of the treatment. While
the idea is simple, the implementation in the field can be more involved, and it took some time
before randomization was considered to be a practical tool for answering questions in economics.
By many accounts, the first large-scale social experiment was the New Jersey Income Main-
tenance Experiment, which was initiated in 1968 and tested the impact of income transfers and
tax rates on labor supply. The next few decades, as chapter 1 (Gueron, 2016) and chapter 18
(von Wachter and Rothstein, 2016) in this volume reminds us, were a sometime tortuous jour-
ney eventually leading to a more widespread acceptance of RCTs, both by policymakers and by
academic researchers. While this acceptance first took hold in the US, starting in the mid 1990s
it extended to developing countries, where the RCT “revolution” took the field by storm.
At this point, the method has gained widespread acceptance (though there continue to be
vocal critics and active debates, many of which this Handbook covers), and there is now a
large body of research on field experiments, both in developed and developing countries. We
feel that we have collectively learnt an enormous amount from this literature, both in terms
of how to conduct and analyze experiments, but also about the methodological contributions
1
experiments have made to economics and about the world. For this volume, we asked some of
the foremost experts in the field to distill these learnings, as well as discuss the most important
challenges and open questions for future work. In this short introduction, we provide what is
our (admittedly personal, subjective, and somewhat biased towards our own field, development
economics) assessment of the impacts that the past 20 years of field experiment research have
had, both on how we do research and how we understand the world.
The remarkable growth in the number of RCTs is, in itself, a dramatic change in some fields.
The type of development research that is carried out today is significantly different from research
conducted even fifteen years ago. A reflection of this fact is that many researchers who were
openly skeptical of RCTs or simply belonged to an entirely different tradition within development
economics are now involved in one or more randomized controlled trials (e.g. Daron Acemoglu,
Derek Neal, Martin Ravallion, Mark Rosenzweig).
Early discussions of the merits (or lack thereof) of randomization put significant emphasis on
its role in the reliable identification of internally valid causal effects and the external validity of
such estimates. We and others have already had these discussions in various forum (Heckman,
1992; Banerjee et al., 2007; Duflo et al., 2007; Banerjee and Duflo, 2009; Deaton, 2010), and we
will not reproduce them here. As we had also begun to argue in Banerjee and Duflo (2009),
we actually think that these discussions somewhat miss the point about why RCTs are really
valuable, and why they have become so popular with researchers.
Starting with Neyman (1923) (who used it as a theoretical device) and Fisher (1925) (who was the
first to propose physically randomizing units), the original motivation of randomized experiments
was a focus on the credible identification of causal effects. As Imbens and Athey (2016) write in
chapter 2 of this volume:
1
This section draws on Banerjee, Duflo, and Kremer (2016)
2
There is a long tradition viewing randomized experiments as the most credible of
designs to obtain causal inferences. Freedman (2006) writes succinctly “experiments
offer more reliable evidence on causation than observational studies.” On the other
hand, some researchers continue to be skeptical about the relative merits of random-
ized experiments. For example, Deaton (2010) argues that “evidence from randomized
experiments has no special priority. . . . Randomized experiments cannot automati-
cally trump other evidence, they do not occupy any special place in some hierarchy
of evidence.” Our view align with that of Freedman and others who view random-
ized experiments as playing a special role in causal inference. Whenever possible,
a randomized experiment is unique in the control that the researcher has over the
assignment mechanism, and by virtue of this control, selection bias in comparisons
between treated and control units can be eliminated. That does not mean that ran-
domized experiments can answer all causal questions. There are a number of reasons
why randomized experiments may not be suitable to answer particular questions.
For a long time, observational studies and randomized studies progressed largely on paral-
lel paths: in agricultural science, and then biomedical studies, randomized experiments were
quickly accepted, and a vocabulary and statistical apparatus to think about them were devel-
oped. Despite the adoption of randomized studies in other fields, most researchers in the social
sciences continued to reason exclusively in terms of observational data. The main approach was
to estimate associations, and then to try to assess the extent to which these associations reflect
causality (or to explicitly give up on causality). Starting with Rubin’s (1974) fundamental con-
tribution, researchers started to use the experimental analog to reason about observational data,
and this set the stage for thinking about how to analyze observational data through the lens of
the “ideal experiment.”
Through the 1980s and 1990s, motivated by this clear thinking about causal effects, labor
economics and public finance were transformed by the introduction of new empirical methods for
estimating causal effects, namely: matching, instrumental variables, difference-in-differences and
regression discontinuity designs. Development economics also embraced these methods starting
in the 1990s, but some researchers further decided that it may be possible to go straight to the
“ideal” experiments (RCTs), and therefore researchers began to go back and forth between exper-
3
imental and non-experimental studies. This means that the experimental and non-experimental
literatures developed in close relationship, constantly cross-fertilizing each other.
In development economics, the non-experimental literature was completely transformed by
the existence of this large RCT movement. When the “gold standard” is not just a twinkle in
someone’s eyes, but the clear alternative to a particular empirical strategy or at least well-defined
benchmark for it, researchers feel compelled to think harder about identification strategies, and
to be more inventive and rigorous about them. As a result, researchers have become increas-
ingly more clever at identifying and using natural experiments, and at the same time, much
more cautious in interpreting the results from them. Not surprisingly, the standards of the non-
experimental literature have therefore improved tremendously over the last few decades, without
necessarily sacrificing its ability to ask broad and important questions. To highlight some ex-
amples, Alesina, Giuliano, and Nunn (2013) use suitability to the plow to study the long-run
determinants of the social attitudes towards the role of women; Padró i Miquel, Qian, and Yao
(2012) use a difference-in-differences strategy to study village democracy; and Banerjee and Iyer
(2005) and Dell (2010) each use a spatial discontinuity to look at the long-run impact of ex-
tractive institutions. In each of these cases, the questions are approached with the same eye for
careful identification as other more standard program evaluation questions.
Meanwhile, the RCT literature was also influenced by work done in the non-experimental liter-
ature. The understanding of the power (and limits) of instrumental variables allowed researchers
to move away from the basic experimental paradigm of the completely randomized experiment
with perfect follow-up and use more complicated strategies, such as encouragement designs.
Techniques developed in the non-experimental literature offered ways to handle situations in the
field that are removed from the ideal setting of experiments (imperfect randomization, clustering,
non-compliance, attrition, spillovers and contamination, etc.). These methods are very clearly
exposited in chapter 2 (Imbens and Athey, 2016) on the econometrics of experiments, and most
chapters provide examples of their uses.
Structural methods are also increasingly combined with experiments to estimate counterfac-
tual policies (See chapter 18 (von Wachter and Rothstein, 2016) for a number of examples from
the developed world, as well as Todd and Wolpin (2006) and Attanasio, Meghir, and Santiago
(2012) for developing country examples).
4
More recently, machine learning techniques have also been combined with experiments to
model treatment effect heterogeneity (see chapter 2 (Imbens and Athey, 2016)).
Of course, the broadening offered by these new techniques comes with the cost of making
additional assumptions on top of the original experimental assignment, and those assumptions
may or may not be valid. This means that the difference in the quality of identification between
a very well-identified, non-experimental study and a randomized evaluation that ends up facing
lots of constraints in the field or tries to estimate parameters that are not pure treatment effects
is a matter of degree. In this sense, there has been a convergence across the empirical spectrum
in terms of the quality of identification, though mostly because experiments have pulled the
remaining study designs up with them.
Interestingly, somewhat counter to this tendency to blur the boundaries between experiments
and non-experiments, in chapter 2, Imbens and Athey (2016) provide a coherent framework for
designing and analyzing experiments that puts randomization at the center:
A major theme of the chapter is that we recommend using statistical methods that
are directly justified by randomization, in contrast to the more traditional sampling-
based approach that is commonly used in econometrics. In essence, the sampling
based approach considers the treatment assignments to be fixed, while the outcomes
are random. Inference is based on the idea that the subjects are a random sample
from a much larger population. In contrast, the randomization-based approach takes
the subject’s potential outcomes (that is, the outcomes they would have had in each
possible treatment regime) as fixed, and considers the assignment of subjects to
treatments as random.
Thus, the methods they propose to analyze experiment sometimes differ from “traditional”
econometrics: for example, instead of controlling for covariates (what researchers routinely do),
which can easily lead to bias in finite sample, they suggest placing the data into strata, analyzing
the within group experiments, and averaging the results. This is directly justified by the ran-
domization of the treatment and does not require any additional assumptions. They also suggest
doing as much as possible ex-ante through the design of the experiment to avoid any ex-post
adjustment.
5
1.2 Assessing external validity
In the words of Imbens and Athey (2016) (chapter 2): “external validity is concerned with
generalizing causal inferences, drawn for a particular population and setting, to others, where
these alternative settings could involve different populations, different outcomes, or different
contexts.” The question of the external validity of RCTs is even more hotly debated than that of
their internal validity. This is perhaps because, unlike internal validity, there is no clear endpoint
to the debate. Other individuals could always be different and react differently to the treatment,
and any future treatment could be ever so slightly different from what has been tested. As
Banerjee, Chassang, and Snowberg (2016) (chapter 3) acknowledge: “External policy advice is
unavoidably subjective. This does not mean that it needs to be uninformed by experimental
evidence, rather, judgment will unavoidably color it.”
It is worth noting that there is very little here that is specific about RCTs (Banerjee and
Duflo, 2009). The same problem afflicts all empirical analysis with the one exception of what
Heckman (1992) calls the “randomization bias.” “Randomization bias” refers to the fact that
experiments require the consent of both the subjects and the organization who is carrying out
the program, and these people may be special, and non-representative of the future population
that could be treated. Chapter 4 (Glennerster, 2016) provides a list of the characteristics of
an ideal partner: they must have sufficient scale, be flexible and technically competent in the
area of the program, have expertise and reputation, have low staff turnover, and possess a desire
to know the truth. In other words, they are clearly not representative of the typical NGO or
government, and this has clear implications on what can be generalized from those studies.
On the other hand, it is worth pointing out that any naturally occurring policy that gets
evaluated (i.e. not an RCT) is also selected: the evaluation requires that the policy did take
place, and that was presumably because someone thought it was a good idea to try it out. In
general, any study takes place in a particular time and place, and that might affect results. This
does not imply that subjective recommendations by experts, based both on their priors and the
results of their experiments, should not be of some use for policymakers. Most policymakers are
not stupid, and they do know how to combine data that is presented to them with their own
prior knowledge of their settings. From our experience, when presented with evidence from a
program of interest, the immediate reaction of a policymaker is typically to ask whether an RCT
6
could be done in their own context.
There is one clear advantage that RCTs do offer for external validity, although it is not often
discussed and has not been systematically exploited as yet. To assess any external validity issues,
it is helpful to have well-identified causal studies in multiple settings. These settings should vary
in terms of the distribution of characteristics of the units, and possibly in terms of the specific
nature of the treatments or the treatment rate, in order to assess the credibility of generalizing
to other settings. With RCTs, because we can, in principle, control where and over what sample
experiments take place (and not just how to allocate the treatment within a sample), we can,
also in principle, get a handle on how treatment effects might vary by context. Of course, if we
allow the the world to vary in infinite ways, this is not sufficient to say anything much on its
own. But there are several ways to make progress.
A first approach is to combine existing evaluations, and make assumptions about the possible
distribution of treatment effects. There are a variety of ways to doing so, ranging from the explic-
itly parametric — Rubin (1981) proposes modeling treatment effect heterogeneity as stemming
from a normal distribution: in each site, the causal effect of the treatment is a site specific effect
drawn from a normal distribution — to more non-parametric procedures, such as those based on
revealed preference. Chapter 18 (von Wachter and Rothstein, 2016) contains an extensive dis-
cussion of the trade-offs between the various approaches in the context of the evaluation of social
programs in developed countries. Chapter 12 (Fryer, 2016) provides a systematic meta-analysis
of 196 RCTs in education in the US in three domains.
One issue that arises with trying to do any kind of meta-analysis is the access to an unse-
lected sample of results from an unselected sample of studies. Since there is publication bias in
economics, the worry is that the sample of published studies may not be representative of all the
studies that exist; furthermore, since researchers have some flexibility in the analyses to run, the
available results may themselves be selected. This is where another advantage of RCTs kicks in:
since they have a defined beginning and end, they can in principle be registered. To this end,
chapter 4 (Glennerster, 2016) discusses how the American Economic Association recently created
a registry of randomized trials ([Link]), which listed over 800 entries as of
August 10. The hope is that all projects are registered, preferably before they are launched, and
7
that results are clearly linked to their respective study, so that in the future meta-analysts can
work from the full universe of studies. Chapter 4 (Glennerster, 2016) and chapter 3 (Banerjee,
Chassang, and Snowberg, 2016) also have a useful exchange on the value to go further than
registration and pre- analysis plan, where researchers lay out in advance the hypotheses to be
tested and the regressions to be run.2 Overall, both chapters point out the value in tying the
hands of a partner who may be too eager to show success, but also emphasize that this comes
with the cost of losing the flexibility to explore the data. In chapter 3, Banerjee, Chassang, and
Snowberg (2016) point out that, if the data is available to others, there is in principle no reason
to pre-specify a specific analysis, since anyone can decide what to run. This ties in to another
issue that is discussed in chapter 4 (Glennerster, 2016): the need for open access of complete
and usable data, both for reproducing existing analyses and for running others. This is an area
where a lot of progress has been made, and hopefully more will be made in years to come.
A second approach is to use the results from other experiments to test specific channels, and
support the conclusions from the policy experiment. One way to do is to draw parallels between
those results and results from laboratory experiments conducted in comparable settings (see
chapter 9 (Gneezy and Imas, 2016)). Another option involves carrying out additional field
experiments that provide support for the causal channels that underlie the policy claim (see
chapter 14 (Kling, Ludwig, Congdon, and Mullainathan, 2016)).
A third approach is to conceive projects as multi-site projects from the start. One recent example
of such an enterprise is the “Graduation” approach—an integrated, multi-faceted program with
livelihood promotion at its core that aims to “graduate” individuals out of extreme poverty
and onto a long-term, sustainable higher consumption path, which is discussed in chapter 17
(Hanna and Karlan, 2016). BRAC, perhaps the world’s largest nongovernmental organization,
has scaled-up this program in Bangladesh (Bandiera et al., 2013), while NGOs around the world
have engaged in similar livelihood-based efforts. Six randomized trials were undertaken over
2
Paluck and Shafir (2016) also discuss the merit of pre-registration and pre-analysis plan for an experimenter
who has some construal of what the results should be.
8
the same time period across the world (Ethiopia, Ghana, Honduras, India, Pakistan, and Peru).
The teams regularly communicated with each other and with BRAC to ensure that their local
adaptations remained true to the original program. The results suggest that the integrated
multi-faceted program was “sufficient” to increase long-term income, where long-term is defined
as three years after the productive asset transfer (Banerjee et al., 2015). Using an index approach
to account for multiple hypotheses testing, positive impacts were found for consumption, income
and revenue, asset wealth, food security, financial inclusion, physical health, mental health, labor
supply, political involvement and women’s decision-making after two years. After a third year,
the results remained the same in 8 out of 10 outcome categories. There is country-by-country
variation (e.g. the program was ineffective in Honduras), and the team is currently working on
a meta-analysis to quantify the level of heterogeneity.
One issue is that there is little the researcher can do ex-post to causally identify the source of dif-
ferences in findings across countries. An option for multi-site projects would be to take guidance
from the first few sites to make a prediction on what the next sites would find. To discipline this
process, researchers would be encouraged to use the results from existing trials to make some
explicit predictions about what they expect to observe in other samples (or with slightly different
treatments). These can serve as a guide for subsequent trials. This idea is discussed in chapter 3
(Banerjee, Chassang, and Snowberg, 2016), who call it “structured speculation.” They propose
the following broad guidelines for structured speculation:
2. Such speculation should be clearly and cleanly separated from the rest of the
paper, maybe in a section called “speculation”
According to Banerjee, Chassang, and Snowberg (2016), structured speculation has three
advantages: First, it ensures that the researcher’s specific knowledge is captured. Second, it
creates a clear sense of where else experiments should be run. Third, it creates incentives to
design research that has greater external validity. They write:
9
To address scalability, experimenters may structure local pilot studies for easy com-
parison with their main experiments. To identify the right sub-populations for gen-
eralizing to other environments, experimenters can identify ahead of time the char-
acteristics of groups that can be generalized, and stratify on those. To extend the
results to populations with a different distribution of unobserved characteristics, ex-
perimenters may elicit the former using the selective trial techniques discussed in
Chassang et al. (2012), and run the experiments separately for each of the groups so
identified.
As this idea is just being proposed, there are few examples as yet. A notable example is
Dupas (2014), who studies the effect of short-term subsidies on long-run adoption of new health
products, and reports that short-term subsidies had a significant impact on the adoption of a
more effective and comfortable class of bed nets. The paper then provides a clear discussion of
external validity. It first spells out a simple and transparent argument relating the effectiveness
of short-run subsidies to: 1) the speed at which various forms of uncertainty are resolved; 2) the
timing of user’s costs and benefits. If the uncertainty over benefits is resolved quickly, short-run
subsidies can have a long-term effect. If uncertainty over benefits is resolved slowly, and adoption
costs are incurred early on, short-run subsidies are unlikely to have a long-term effect.
Dupas (2014) then answers the question “For what types of health products and contexts
would we expect the same results to obtain?” It does so by classifying potential technologies into
three categories based on how short-run (or one-time) subsidies would change adoption patterns.
Clearly, there could be such discussions at the end of all papers, not just ones featuring RCTs.
But because RCTs can be purposefully designed and placed, there is a higher chance of follow-up
in this case.
This discussion makes clear that the talking about external validity only makes sense once we
understand the lesson that we want to generalize. Reflecting on the problem of partner selection
that we mentioned earlier, in chapter 4, Glennerster (2016) writes:
10
human behavior—such as a willingness to pay now for benefits in the future—the
representativeness of the partner may be less relevant. If we want to know whether
a type of program, as it is usually implemented, works, we will want to prioritize
working with a representative partner. Note that “does this type of program work” is
not necessarily a more policy-relevant question than a more general question about
human behavior. By their nature, more general questions generalize better and can
be applied to a wider range of policy questions.
A big contribution of field experiments has been the ability to test theory. In chapter 2,
Imbens and Athey (2016) argue “a randomized experiment is unique in the control that the
researcher has over the assignment mechanism.” We would take the argument one step further:
randomization is also unique in the control that the researcher (often) has on the treatment itself.
In observational studies, however beautifully designed, the researcher is limited to evaluating
what has been implemented in the world. In a randomized experiment, she can manipulate the
treatment in ways that we do not observe in reality. This has a number of advantages. First, she
can innovate, i.e. design new policies or interventions that she thinks will be effective based on
prior knowledge or theory, and test them even if no policymaker is thinking of putting them in
practice yet. Development economists have many ideas, often inspired by what they have read
or researched, and many of the randomized experiment projects come out of those: they test in
the field an intervention that simply did not exist before (a kilogram of lentil for parents who
vaccinate their kids; stickers to encourage riders to speak up against a bad driver; free chlorine
dispensers, etc.).
Second, she can introduce variations that will help her test implications of existing theories
or establish facts that could not otherwise be established. The well-known Negative Income
Tax (NIT) experiment was designed with precisely that idea in mind: in general, when wages
are raised, this creates both income and substitution effects which cannot easily be separated
(Heckman, 1992). But randomized manipulation of the slope and the intercept of a wage schedule
makes it possible to estimate both together. Interestingly, after the initial NIT and the Rand
Health Insurance Experiment, the tradition of social experiments in the US has mainly been
to obtain causal effect of social policies that were often fairly comprehensive packages (Gueron,
2016), though according to chapter 14 (Kling, Ludwig, Congdon, and Mullainathan, 2016) there
11
has been a recent revival of what they call “mechanism experiments” which they define to be:
. . . an experiment that tests a mechanism—that is, it tests not the effects of vari-
ation in policy parameters themselves, directly, but the effects of variation in an
intermediate link in the causal chain that connects (or is hypothesized to connect) a
policy to an outcome. That is, where there is a specified policy that has candidate
mechanisms that affect an outcome of policy concern, the mechanism experiment
tests one or more of those mechanisms. There can be one or more mechanisms that
link the policy to the outcome, which could operate in parallel (for example when
there are multiple potential mediating channels through which a policy could change
outcomes) or sequentially (if for example some mechanisms affect take-up or imple-
mentation fidelity). The central idea is that the mechanism experiment is intended to
be informative about some policy but does not involve a test of that policy directly.
In other words, mechanism experiments are a specific version of experiments that test theories
which distinctively have a relatively direct implication for the design of some policy.
Experiments that test theories, including mechanism experiments, have always had an im-
portant place in development economics and are now also used in developed countries. Banerjee
and Duflo (2009) discuss some early examples of mechanism experiments including the justly in-
fluential papers on “observing unobservables” by Karlan and Zinman (2009). A number of these
are discussed in chapter 7 (Bertrand and Duflo, 2016), chapter 11 (Dupas and Miguel, 2016),
and chapter 17 (Hanna and Karlan, 2016).
Another area where it is now standard to use field experiments to test theories is in the
growing literature on replicating tests of theories that were previously conducted in the laboratory
in more realistic settings. Chapter 6 (Al-Ubaydli and List, 2016) and chapter 9 (Gneezy and
Imas, 2016) are both excellent introduction to this literature, with the first focusing on theoretical
predictions about market outcomes while the second is more about understanding preferences.
By moving from the lab to the field, the studies that are reviewed in these two chapters aim to
select a more relevant population, and to place them in situations that are not artificial, in order
to test these theories in the contexts that are relevant in practice. The idea is that, in the lab,
people do not behave as they would in reality. Chapter 5 (Paluck and Shafir, 2016) goes one step
further in helping us think about how an experimenter must design an experiment to successfully
12
test a theory. They place the notion of “construal” at the center of their approach. They
write: “Construal is defined as the individual’s subjective interpretation of a stimulus, whether
the stimulus is a choice set, a situation, another person group of people, or an experimental
intervention.” In order to successfully test a theory, the experiment must be designed such that
the participants understand the world (and the different treatments) in the way the experimenter
intended, and therefore their action and behavior in the different conditions can be interpreted.
Of course, construal is relevant for other research as well (it affects how people will respond
to a survey). But it is particularly important in an experimental set up, when a researcher
is thinking about relevant manipulation. There is no magic recipe to do this, but Paluck and
Shafir emphasize and encourage us to use this lens to think about basic experimental practice :
piloting, as well as open-ended and open-minded observations, in the early phase of an experiment
to ensure that the participants’ construal is the same as the researchers; a “manipulation” check
to make sure that participants understood that they were being treated; and decisions on whether
to be present or not during an experiment.
Data collection is at the core of experimental work, since administrative data is not always
available or sufficient to obtain information on the relevant outcome. Considerable progress has
been made on this front. Chapter 4 (Glennerster, 2016) gives specific and useful guidance on
how researchers can insure the validity of the data that they have collected, summarizing best
practices on monitoring, back checking, and effective use of information technology. Experiments
have also spurred creativity in measurement, and Glennerster’s chapter, as well as almost all the
other chapters, covers these innovations. We elaborate a bit more on these issues here.
In principle, there is no automatic link between careful and innovative collection of microe-
conomic data and the experimental method. However, one specific feature of experiments that
serves to encourage the development of new measurement methods is high take-up rates and a
specific measurement problem. In many experimental studies, a large fraction of those who are
intended to be affected by the program are actually affected. This means that the number of
units on which data needs to be collected to assess the impact of the program does not have to
be very large and that data are typically collected especially for the purpose of the experiment.
Elaborate and expensive measurement of outcomes is then easier to afford than in the context
13
of a large multipurpose household or firm survey. By contrast, observational studies must often
rely for identification on variation (policy changes, market-induced variation, natural variation,
supply shocks, etc.) that cover large populations, requiring the use of a large dataset often not
collected for a specific purpose. This makes it more difficult to fine-tune the measurement to
the specific question at hand. Moreover, even if it is possible ex post to do a sophisticated data
collection exercise specifically targeted to the question, it is generally impossible to do it for
the preprogram situation. This precludes the use of a difference-in-differences strategy for these
types of outcomes, which again limits the incentives to collect them ex-post.
Some of the most exciting recent developments related to field experiments have to do with
measurement. Researchers have turned to other sub-fields of economics as well as different
fields altogether to borrow tools for measuring outcomes. Examples include soil testing and
remote sensing in agriculture (see chapter 15 (de Janvry, Sadoulet, and Suri, 2016) for a review
on agriculture); techniques developed by social psychologists for difficult to measure outcomes
such as discrimination and prejudice – audit and correspondence studies, implicit association
tests, Goldberg Experiments and List experiments (see chapter 7 (Bertrand and Duflo, 2016)
for a review on discrimination); tools developed by cognitive psychologists for child development
(Attanasio et al., 2014); tools inspired by economic theory, such as Becker-DeGroot-Marshak
games to infer willingness to pay (see a discussion in chapter 11 (Dupas and Miguel, 2016));
biomarkers in health, beyond the traditional height, weight and hemoglobin (cortisol to measure
stress for example); wearable devices to measure mobility or effort (Rao, Schilbach, and Schofield,
in progress; Kreindler, in progress).
Specific methods and devices that exactly suit the purpose at hand have also been developed
for experiments. Olken (2007) is one example of the kind of data that can be collected in an
experimental setting. The objective was to determine whether audits or community monitoring
were effective ways to curb corruption in decentralized construction projects. Getting a reliable
measure of actual levels of corruption was thus necessary. Olken focused on roads and had
engineers dig holes in the road to measure the material used. He then compared that with the
level of material reported to be used. The difference is a measure of how much of the material
was stolen, or never purchased but invoiced, and thus an objective measure of corruption. Olken
then demonstrated that this measure of “missing inputs” is affected by the threat of audits,
but not, except under one specific condition, by encouraging greater participation in community
14
meetings. Rigol, Hussam, and Regianni (in progress) provide another example of innovative data
collection practices. For their experiment, in order to accurately measure if and when people
wash their hands, they designed soap dispensers that could track when the pump was being
pushed and hired a Chinese company to manufacture them. Similar “audit” methodologies are
used to measure the impact of interventions in health, such as patients posing with specific
diseases to measure the impact of training (Banerjee et al., 2016) or ineligible people attempting
to buy free bed nets (Dupas et al., 2016). Even a partial list of such examples would be very
long.
In parallel, greater use is being made of administrative data, which are often combined with
large-scale experiments. Administrative data are often at the core of the analysis of experiments
in the US (see chapter 1 (Gueron, 2016) and chapter 18 (von Wachter and Rothstein, 2016)), and
the more recent availability of tax data has allowed to examine long term impacts of interventions
(Chetty et al., 2011, 2016). Recently, the practice has also spread to developing countries. For
example, Banerjee et al. (2016) make use of both publicly available administrative data on a
workfare program in India and restricted expenditure data made available to them as part of the
experiment; Olken, Khan, and Khwaja (2016) use administrative tax data from Pakistan; and
Attanasio, Medina, and Meghir (2016) use unemployment insurance data to measure the long
term effect of job training in Colombia.
Another increasingly important source of data comes from the use of lab-in-the-field exper-
iments either as predictors of the treatment effect (e.g. commitment devices should help those
who have self-control problems more than others) or as an outcome (e.g. cooperation in a public
goods game as a measure of success in creating social capital). Chapter 9 (Gneezy and Imas,
2016) provides a number of examples, but also warns against blindly trusting lab-in-field experi-
ments to unearth deep preferences—for example, behavior in a dictator game may not necessarily
predict pro-social behaviors in real-life contexts.
The bottom line is that there has been great progress in our understanding of how to creatively
and accurately collect or use existing data that goes beyond the traditional surveys, and these
insights have led both to better projects and to innovations in data collection that have been
adopted in non-randomized work as well.
15
1.5 Iterate and build on previous research in the same settings
Another methodological advantage of RCTs also relates to the control that researchers have
over the assignment and, often enough, over the treatments themselves. Well-identified policy
evaluations often raise more questions than they can actually answer. In particular, we are often
left wondering why things turned out the way they did and how to change the intervention to
make things (even) better.
This is where the ability to keep trying different interventions can be enormously valuable.
Chapter 12 (Fryer, 2016) on education in the developed world is in part a history of such a quest.
Fryer details the process of trying to figure out what actually works in closing the black-white
achievement gap, describing the long line of experiments that failed to deliver or deliver enough
and the slow accretion of learnings from successes and failures. Through this process, the main
directions eventually became clear and he is able to conclude:
These facts provide reason for optimism. Through the systematic implementation
of randomized field experiments designed to increase human capital of school-aged
children, we have substantially increased our knowledge of how to produce human
capital and have assembled a canon of best practices.
We see a very similar process of dynamic discovery in chapter 8 (Gerber and Green, 2016) on
the question of how to influence voter turnout. Marketing experiments also feature dynamically
evolving treatments (see chapter 10 (Simester, 2016)), as do some agricultural experiments (see
chapter 15 (de Janvry, Sadoulet, and Suri, 2016)).
Finally, RCTs, allow the possibility to “unpack” a program to its constituent elements. Here
again the work may be iterative. For example, all the initial evaluations of the BRAC ultra
poor program were done using their “full package,” as were a large number of evaluations of the
Mexican conditional cash transfer (CCT) program PROGRESA. But both for research and for
policy, once we know that the full program works, there is a clear interest in knowing what are
the elements that are key to its success. In recent years, a number of papers have looked “inside”
CCTS, relaxing the conditionality and altering it in other ways which are discussed in chapter
16
17 (Hanna and Karlan, 2016). Hanna and Karlan also highlight the challenge of fully unpacking
a program in the context of their discussion of the graduation program, mentioned above, which
provides beneficiaries with the gift of an asset, as well as access to a savings opportunity, health
services and information, life coaching and a small stipend. They write:
As this paragraph implies, the way forward is clearly going to be the development of a mo-
saic, rather than any one definitive study that both tests each component and also includes
sufficient contextual and market variations so that it can help set policy for a myriad of coun-
tries and populations. More work is needed to tease apart the different components: asset
transfer (addresses capital market failures), savings account (lowers savings transaction fee), in-
formation (addresses information failures), life-coaching (addresses behavioral constraints, and
perhaps changes expectations and beliefs about possible return on investment), health services
and information (addresses health market failures), consumption support (addresses nutrition-
based poverty traps), etc. Furthermore, for several of these questions, there are key open issues
for how to address them; for example, life-coaching can take on an infinite number of manifes-
tations. Some organizations conduct life-coaching through religion, others through interactive
problem-solving, and others through psychotherapy approaches (Bolton et al., 2003, 2007; Patel
et al., 2010) Much remains to be learned not just in regards to the promise of such life-coaching
components, but also how to make them work (if they work at all).
17