Data For Development Impact
Data For Development Impact
DATA for
DEVELOPMENT
IMPACT
DIME ANALYTICS RESOURCE GUIDE
D ATA F O R
D E V E L O P M E N T I M PA C T :
T H E D I M E A N A LY T I C S
RESOURCE GUIDE
D I M E A N A LY T I C S
Copyright © 2019
Kristoffer Bjärkefur
Luíza Cardoso de Andrade
Benjamin Daniels
Maria Jones
http://worldbank.github.com/d4di
Released under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
https://creativecommons.org/licenses/by/4.0/
Feedback
We encourage feedback and corrections so that we can improve
the contents of the book in future editions. Please visit https:
//worldbank.github.com/d4di/feedback/ to see different options on
how to provide feedback. You can also email us at dimeanalytics@
worldbank.org with input or comments, and we will be very thankful.
We hope you enjoy the book!
Abbreviations
83 Bibliography
Dedicated to all the research assistants
Welcome to Data for Development Impact. This book is intended to serve as a resource guide for people
who collect or use data for development research. In particular, the book is intended to guide the reader
through the process of research using primary survey data, from research design to fieldwork to data
management to analysis. This book will not teach you econometrics or epidemiology or agribusiness. This
book will not teach you how to design an impact evaluation. This book will not teach you how to do data
analysis, or how to code. There are lots of really good resources out there for all of these things, and they
are much better than what we would be able to squeeze into this book.
What this book will teach you is how to think about quantitative data, keeping in mind that you are not
going to be the only person collecting it, using it, or looking back on it. We hope to provide you two key
tools by the time you finish this book. First, we want you to form a mental model of data collection as a
“social process”, in which many people need to have the same idea about what is to be done, and when
and where and by whom, so that they can collaborate effectively on large, long-term projects. Second, we
want to provide a map of concrete resources for supporting these processes in practice. As research teams
and timespans have grown dramatically over the last decade, it has become inefficient for everyone to have
their own personal style dictating how they use different functions, how they store data, and how they
write code.
code.do
1 * Load the auto dataset
2 sysuse auto.dta , clear
3
4 * Run a simple regression
5 reg price mpg rep78 headroom , coefl
6
7 * Transpose and store the output
8 matrix results = `r(table)'
9
10 * Load the results into memory
11 clear
12 svmat results , n(col)
We have tried really hard to make sure that all the Stata code runs,
and that each block is well-formatted and uses built-in functions. We
will also point to user-written functions when they provide important
tools. In particular, we have written two suites of Stata commands,
ietoolkit12 and iefieldkit,13 that standardize some of our core 12
https://dimewiki.worldbank.org/
wiki/ietoolkit
data collection workflows. Providing some standardization to Stata
13
code style is also a goal of this team, since groups are collaborating https://dimewiki.worldbank.org/
wiki/iefieldkit
on code in Stata more than ever before. We will not explain Stata
commands unless the behavior we are exploiting is outside the
usual expectation of its functionality; we will comment the code
generously (as you should), but you should reference Stata help-files
h [command] whenever you do not understand the functionality that
is being used. We hope that these snippets will provide a foundation
for your code style. Alongside the collaborative view of data that we
outlined above, good code practices are a core part of the new data
science of development research. Code today is no longer a means
to an end (such as a paper), but it is part of the output itself: it is
a means for communicating how something was done, in a world
where the credibility and transparency of data cleaning and analysis
is increasingly important.
While adopting the workflows and mindsets described in this
book requires an up-front cost, it should start to save yourself and
others a lot of time and hassle very quickly. In part this is because
you will learn how to do the essential things directly; in part this is
because you will find tools for the more advanced things; and in part
this is because you will have the mindset to doing everything else
in a high-quality way. We hope you will find this book helpful for
accomplishing all of the above, and that you will find that mastery of
data helps you make an impact!
– The DIME Analytics Team
Handling data ethically
Development research does not just involve real people – it also affects real people. Policy decisions are
made every day using the results of briefs and studies, and these can have wide-reaching consequences on
the lives of millions. As the range and importance of the policy-relevant questions asked by development
researchers grow, so too does the (rightful) scrutiny under which methods and results are placed. This
scrutiny involves two major components: data handling and analytical quality. Performing at a high
standard in both means that research participants are appropriately protected, and that consumers of
research can have confidence in its conclusions.
What we call ethical standards in this chapter is a set of practices for data privacy and research transparency
that address these two components. Their adoption is an objective measure of to judge a research product’s
performance in both. Without these transparent measures of credibility, reputation is the primary signal
for the quality of evidence, and two failures may occur: low-quality studies from reputable sources may
be used as evidence when in fact they don’t warrant it, and high-quality studies from sources without an
international reputation may be ignored. Both these outcomes reduce the quality of evidence overall. Even
more importantly, they usually mean that credibility in development research accumulates at international
institutions and top global universities instead of the people and places directly involved in and affected by
it. Simple transparency standards mean that it is easier to judge research quality, and making high-quality
research identifiable also increases its impact. This section provides some basic guidelines and resources
for collecting, handling, and using field data ethically and responsibly to publish research findings.
Research replicability
Replicable research, first and foremost, means that the actual analytical
processes you used are executable by others.18 (We use “replicable” 18
Dafoe, A. (2014). Science deserves
and “reproducible” somewhat interchangeably, referring only to the better: the imperative to share complete
replication files. PS: Political Science &
code processes themselves in a specific study; in other contexts they Politics, 47(1):60–66
may have more specific meanings.19 ) All your code files involving 19
http://datacolada.org/76
data construction and analysis should be public – nobody should
have to guess what exactly comprises a given index, or what controls
are included in your main regression, or whether or not you clustered
standard errors correctly. That is, as a purely technical matter,
nobody should have to “just trust you”, nor should they have to
bother you to find out what happens if any or all of these things
were to be done slightly differently.20 Letting people play around 20
Simmons, J. P., Nelson, L. D., and
with your data and code is a great way to have new questions asked Simonsohn, U. (2011). False-positive
psychology: Undisclosed flexibility
and answered based on the valuable work you have already done. in data collection and analysis allows
Services like GitHub that expose your code history are also valuable presenting anything as significant.
Psychological Science, 22(11):1359–
resources. They can show things like modifications made in response 1366; and Wicherts, J. M., Veldkamp,
to referee comments; for another, they can show the research paths C. L., Augusteijn, H. E., Bakker, M.,
and questions you may have tried to answer (but excluded from Van Aert, R., and Van Assen, M. A.
(2016). Degrees of freedom in planning,
publication) as a resource to others who have similar questions of running, analyzing, and reporting
their own data. psychological studies: A checklist to
avoid p-hacking. Frontiers in Psychology,
Secondly, reproducible research21 enables other researchers to 7:1832
re-utilize your code and processes to do their own work more easily 21
https://dimewiki.worldbank.org/
in the future. This may mean applying your techniques to their data wiki/Reproducible_Research
or implementing a similar structure in a different context. As a pure
public good, this is nearly costless. The useful tools and standards
you create will have high value to others. If you are personally or
professionally motivated by citations, producing these kinds of
resources will almost certainly lead to that as well. Therefore, your
code should be written neatly and published openly. It should be
easy to read and understand in terms of structure, style, and syntax.
Finally, the corresponding dataset should be openly accessible unless
for legal or ethical reasons it cannot.22 22
https://dimewiki.worldbank.org/
wiki/Publishing_Data
handling data ethically 17
Research transparency
Transparent research will expose not only the code, but all the
processes involved in establishing credibility to the public.23 This 23
http://www.princeton.edu/~mjs3/
open_and_reproducible_opr_2017.pdf
means that readers be able to judge for themselves if the research
was done well, and if the decision-making process was sound. If
the research is well-structured, and all relevant documentation24 is 24
https://dimewiki.worldbank.org/
wiki/Data_Documentation
shared, this is as easy as possible for the reader to do. This is also
an incentive for researchers to make better decisions, be skeptical
about their assumptions, and, as we hope to convince you, make
the process easier for themselves, because it requires methodical
organization that is labor-saving over the complete course of a
project.
Registered Reports25 can help with this process where they are 25
https://blogs.worldbank.org/
impactevaluations/registered-
available. By setting up a large portion of the research design in
reports-piloting-pre-results-
advance,26 a great deal of work has already been completed, and review-process-journal-development-
at least some research questions are pre-committed for publication economics
26
regardless of the outcome. This is meant to combat the “file-drawer https://www.bitss.org/2019/04/18/
better-pre-analysis-plans-through-
problem”,27 and ensure that researchers are transparent in the design-declaration-and-diagnosis/
additional sense that all the results obtained from registered studies 27
Simonsohn, U., Nelson, L. D., and
are actually published. Simmons, J. P. (2014). P-curve: a key to
Documenting a project in detail greatly increases transparency. the file-drawer. Journal of Experimental
Psychology: General, 143(2):534
This means explicitly noting decisions as they are made, and explaining
the process behind them. Documentation on data processing and
additional hypothesis tested will be expected in the supplemental
materials to any publication. Careful documentation will also save
the research team a lot of time during a project, as it prevents you to
have the same discussion twice (or more!), since you have a record
of why something was done in a particular way. There is a number
of available tools that will contribute to producing documentation,
but project documentation should always be an active and ongoing
process, not a one-time requirement or retrospective task. New
decisions are always being made as the plan begins contact with
reality, and there is nothing wrong with sensible adaptation so long
as it is recorded and disclosed. (Email is not a note-taking service.)
There are various software solutions for building documentation
over time. The Open Science Framework28 provides one such 28
https://osf.io/
to different team and project dynamics, but is less effective for file
storage. Each project has its specificities, and the exact shape of this
process can be molded to the team’s needs, but it should be agreed
18 data for development impact: the dime analytics resource guide
on prior to project launch. This way, you can start building a project’s
documentation as soon as you start making decisions.
Research credibility
The credibility of research is traditionally a function of design
choices.31 Is the research design sufficiently powered through its 31
Angrist, J. D. and Pischke, J.-S. (2010).
sampling and randomization? Were the key research outcomes The credibility revolution in empirical
economics: How better research design
pre-specified or chosen ex-post? How sensitive are the results to is taking the con out of econometrics.
changes in specifications or definitions? Tools such as pre-analysis Journal of Economic Perspectives, 24(2):3–
30; and Ioannidis, J. P. (2005). Why most
plans32 are important to assuage these concerns for experimental published research findings are false.
evaluations, but they may feel like “golden handcuffs” for other types PLoS Medicine, 2(8):e124
of research.33 Regardless of whether or not a formal pre-analysis 32
https://dimewiki.worldbank.org/
plan is utilized, all experimental and observational studies should be wiki/Pre-Analysis_Plan
Preparation for data work begins long before you collect any data. In order to be prepared for fieldwork,
you need to know what you are getting into. This means knowing which data sets you need, how those
data sets will stay organized and linked, and what identifying information you will collect for the different
types and levels of data you’ll observe. Identifying these details creates a data map for your project, giving
you and your team a sense of how information resources should be organized. It’s okay to update this map
once the project is underway – the point is that everyone knows what the plan is.
Then, you must identify and prepare your tools and workflow. All the tools we discuss here are
designed to prepare you for collaboration and replication, so that you can confidently manage tools and
tasks on your computer. We will try to provide free, open-source, and platform-agnostic tools wherever
possible, and provide more detailed instructions for those with which we are familiar. However, most have
a learning and adaptation process, meaning you will become most comfortable with each tool only by
using it in real-world work. Get to know them well early on, so that you do not spend a lot of time later
figuring out basic functions.
Being comfortable using your computer and having the tools you
need in reach is key. This section provides a brief introduction to key
concepts and toolkits that can help you take on the work you will be
primarily responsible for. Some of these skills may seem elementary,
but thinking about simple things from a workflow perspective can
help you make marginal improvements every day you work.
Teams often develop they workflows as they go, solving new
challenges when they appear. This will always be necessary, and new
challenges will keep coming. However, there are a number of tasks
that will always have to be completed on any project. These include
organizing folders, collaborating on code, controlling different
versions of a file, and reviewing each other’s work. Thinking about
the best way to do these tasks ahead of time, instead of just doing it
as quickly as you can when needed, will save your team a lot of re-
working. This chapter will outline the main points to discuss within
the team, and point to some possible solutions.
22 data for development impact: the dime analytics resource guide
Folder management
The first thing your team will need to create is a shared folder.52 52
Common tools for folder sharing are
If every team member is working on their local computers, there Dropbox, Box, and OneDrive.
Code management
Once you start a project’s data work, the number of scripts, datasets
and outputs that you have to manage will grow very quickly. This
can get out of hand just as quickly, so it’s important to organize your
data work and follow best practices from the beginning. Adjustments
will always be needed along the way, but if the code is well-organized,
they will be much easier to make. Below we discuss a few crucial
steps to code organization. They all come from the principle that
code is an output by itself, not just a means to an end. So code
should be written thinking of how easy it will be for someone to
read it later.
Code documentation is one of the main factors that contribute to
readability, if not the main one. There are two types of comments
that should be included in code. The first one describes what is
being done. This should be easy to understand from the code itself
24 data for development impact: the dime analytics resource guide
if you know the language well enough and the code is clear. But
writing plain English (or whichever language you communicate with
your team on) will make it easier for everyone to read. The second
type of comment is what differentiates commented code from well-
commented code: it explains why the code is performing a task in
a particular way. As you are writing code, you are making a series
of decisions that (hopefully) make perfect sense to you at the time.
However, you will probably not remember how they were made
in a couple of weeks. So write them down in your code. There are
other ways to document decisions (GitHub offers a lot of different
documentation options, for example), but information that is relevant
to understand the code should always be written in the code itself.
Code organization is the next level. Start by adding a code header.
This should include simple things such as stating the purpose of the
script and the name of the person who wrote it. If you are using a
version control software, the last time a modification was made and
the person who made it will be recorded by that software. Otherwise,
you should include it in the header. Finally, and more importantly,
use it to track the inputs and outputs of the script. When you are
trying to track down which code creates a data set, this will be very
helpful.
Breaking your code into readable steps is also good practice on
code organization. One way to do this is to create sections where a
specific task is completed. So, for example, if you want to find the
line in your code where a variable was created, you can go straight
to PART 2: Create new variables, instead of reading line by line of
the code. RStudio makes it very easy to create sections, and compiles
them into an interactive script index. In Stata, you can use comments
to create section headers, though they’re just there to make the
reading easier. Adding a code index to the header by copying and
pasting section titles is the easiest way to create a code map. You can
then add and navigate through them using the find command. Since
Stata code is harder to navigate, as you will need to scroll through
the document, it’s particularly important to avoid writing very long
scripts. One reasonable rule of thumb is to not write files that have
more than 200 lines. This is also true for other statistical software,
though not following it will not cause such a hassle.
planning data work before going to field 25
stata-master-dofile.do
1 /*******************************************************************************
2 * TEMPLATE MASTER DO-FILE *
3 ********************************************************************************
4 * *
5 * PURPOSE: Reproduce all data work, map inputs and outputs, *
6 * facilitate collaboration *
7 * *
8 * OUTLINE: PART 1: Set standard settings and install packages *
9 * PART 2: Prepare folder paths and define programs *
10 * PART 3: Run do files *
11 * *
12 ********************************************************************************
13 PART 1: Set standard settings and install packages
14 ********************************************************************************/
15
16 if (0) {
17 ssc install ietoolkit, replace
18 }
19
20 ieboilstart, v(15.1)
21 `r(version)'
22
23 /*******************************************************************************
24 PART 2: Prepare folder paths and define programs
25 *******************************************************************************/
26
27 * Research Assistant folder paths
28 if "`c(username)'" == "ResearchAssistant" {
29 global github "C:/Users/RA/Documents/GitHub/d4di/DataWork"
30 global dropbox "C:/Users/RA/Dropbox/d4di/DataWork"
31 global encrypted "A:/DataWork/EncryptedData" // Always mount to A disk!
32 }
33
34
35 * Baseline folder globals
36 global bl_encrypt "${encrypted}/Round Baseline Encrypted"
37 global bl_dt "${dropbox}/Baseline/DataSets"
38 global bl_doc "${dropbox}/Baseline/Documentation"
39 global bl_do "${github}/Baseline/Dofiles"
40 global bl_out "${github}/Baseline/Output"
41
42 /*******************************************************************************
43 PART 3: Run do files
44 *******************************************************************************/
45
46 /*------------------------------------------------------------------------------
47 PART 3.1: De-identify baseline data
48 --------------------------------------------------------------------------------
49 REQUIRES: ${bl_encrypt}/Raw Identified Data/D4DI_baseline_raw_identified.dta
50 CREATES: ${bl_dt}/Raw Deidentified/D4DI_baseline_raw_deidentified.dta
51 IDS VAR: hhid
52 ------------------------------------------------------------------------------- */
53 do "${bl_do}/Cleaning/deidentify.do"
54
55 /*------------------------------------------------------------------------------
56 PART 3.2: Clean baseline data
57 --------------------------------------------------------------------------------
58 REQUIRES: ${bl_dt}/Raw Deidentified/D4DI_baseline_raw_deidentified.dta
59 CREATES: ${bl_dt}/Final/D4DI_baseline_clean.dta
60 ${bl_doc}/Codebook baseline.xlsx
61 IDS VAR: hhid
62 ----------------------------------------------------------------------------- */
63 do "${bl_do}/Cleaning/cleaning.do"
64
65 /*-----------------------------------------------------------------------------
66 PART 3.3: Construct income indicators
67 --------------------------------------------------------------------------------
68 REQUIRES: ${bl_dt}/Final/D4DI_baseline_clean.dta
69 CREATES: ${bl_out}/Raw/D4DI_baseline_income_distribution.png
70 ${bl_dt}/Intermediate/D4DI_baseline_constructed_income.dta
71 IDS VAR: hhid
72 ----------------------------------------------------------------------------- */
73 do "${bl_do}/Construct/construct_income.do"
26 data for development impact: the dime analytics resource guide
Version control
A version control system is the way you manage the changes to any
computer file. This is important, for example, for your team to be
able to find the version of a presentation that you delivered to your
donor, but also to understand why the significance of your estimates
has changed. Everyone who has ever encountered a file named
something like final_report_v5_LJK_KLE_jun15.docx can appreciate
how useful such a system can be. Most file sharing solutions offer
some level of version control. These are usually enough to manage
changes to binary files (such as Word and PowerPoint documents)
without needing to appeal to these dreaded file names. For code
files, however, a more complex version control system is usually
desirable. We recommend using Git56 for all plain text files. Git 56
Git: a multi-user version control
system for collaborating on and
tracks all the changes you make to your code, and allows you to go
tracking changes to code as it is written.
back to previous versions without losing the information on changes
made. It also makes it possible to work on two parallel versions of
the code, so you don’t risk breaking the code for other team members
planning data work before going to field 27
Output management
One more thing to be discussed with your team is the best way to
manage outputs. A great number of them will be created during the
course of a project, from raw outputs such as tables and graphs to
final products such as presentations, papers and reports. When the
first outputs are being created, agree on where to store them, what
software to use, and how to keep track of them.
Decisions about storage of final outputs are made easier by
technical constraints. As discussed above, Git is a great way to
control for different versions of plain text files, and sync software
such as Dropbox are better for binary files. So storing raw outputs
in formats like .tex and .eps in Git and final outputs in PDF,
PowerPoint or Word, makes sense. Storing plain text outputs on
Git makes it easier to identify changes that affect results. If you are
re-running all of your code from the master when significant changes
to the code are made, the outputs will be overwritten, and changes
in coefficients and number of observations, for example, will be
highlighted.
Though formatted text software such as Word and PowerPoint
are still prevalent, more and more researchers are choosing to
write final outputs using LATEX.57 LATEX is a document preparation 57
https://www.latex-project.org
system that can create both text documents and presentations. The
28 data for development impact: the dime analytics resource guide
main difference between them is that LATEX uses plain text, and
it’s necessary to learn its markup convention to use it. The main
advantage of using LATEX is that you can write dynamic documents,
that import inputs every time they are compiled. This means you
can skip the copying and pasting whenever an output is updated.
Because it’s written in plain text, it’s also easier to control and
document changes using Git. Creating documents in LATEX using
an integrated writing environment such as TeXstudio is great for
outputs that focus mainly on text, but include small chunks of code
and static code outputs. This book, for example, was written in LATEX.
Another option is to use the statistical software’s dynamic document
engines. This means you can write both text (in Markdown) and
code in the script, and the result will usually be a PDF or html file
including code, text and code outputs. Dynamic document tools are
better for including large chunks of code and dynamically created
graphs and tables, but formatting can be trickier. So it’s great for
creating appendices, or quick document with results as you work
on them, but not for final papers and reports. RMarkdown58 is the 58
https://rmarkdown.rstudio.com/
do-file editor is the most widely adopted code editor, but Atom64 64
https://atom.io
and Sublime65 can also be configured to run Stata code. Opening 65
https://www.sublimetext.com/
an entire directory and loading the whole tree view in the sidebar,
which gives you access to directory management actions, is a really
useful feature. This can be done using RStudio projects in RStudio,
Stata projects in Stata, and directory managers in Atom and Sublime.
When it comes to collaboration software,66 the two most common 66
https://dimewiki.worldbank.org/
wiki/Collaboration_Tools
softwares in use are Dropbox and GitHub.67 GitHub issues are a
67
great tool for task management, and Dropbox Paper also provides https://michaelstepner.com/blog/
git-vs-dropbox/
a good interface with notifications. Neither of these tools require
much technical knowledge; they merely require an agreement and
workflow design so that the people assigning the tasks are sure to
set them up in the system. GitHub is useful because tasks can clearly
be tied to file versions; therefore it is useful for managing code-
related tasks. It also creates incentives for writing down why changes
were made as they are saved, creating naturally documented code.
Dropbox Paper is useful because tasks can be easily linked to other
documents saved in Dropbox; therefore it is useful for managing
non-code-related tasks. Our team uses both.
Designing research for causal inference
Research design is the process of structuring field work – both experimental design and data collection –
that will answer a specific research question. You don’t need to be an expert in this, and there are lots of
good resources out there that focus on designing interventions and evaluations. This section will present a
very brief overview of the most common methods that are used, so that you can have an understanding of
how to construct appropriate counterfactuals, data structures, and the corresponding code tools as you are
setting up your data structure before going to the field to collect data.
You can categorize most research questions into one of two main types. There are cross-sectional,
descriptive, and observational analyses, which seek only to describe something for the first time, such
as the structure or variation of a population. We will not describe these here, because there are endless
possibilities and they tend to be sector-specific. For all sectors, however, there are also causal research
questions, both experimental and quasi-experimental, which rely on establishing exogenous variation in
some input to draw a conclusion about the impact of its effect on various outcomes of interest. We’ll focus
on these causal designs, since the literature offers a standardized set of approaches, with publications and
code tools available to support your work.
Cross-sectional RCTs
Cross-sectional RCTs are the simplest possible study design: a
program is implemented, surveys are conducted, and data is analyzed.
The randomization process, as in all RCTs, draws the treatment
and control groups from the same underlying population. This
designing research for causal inference 33
Differences-in-differences
Regression discontinuity
Regression discontinuity (RD) designs differ from other RCTs in
that the treatment group is not directly randomly assigned, even
though it is often applied in the context of a specific experiment.87 87
https://dimewiki.worldbank.org/
wiki/Regression_Discontinuity
(In practice, many RDs are quasi-experimental, but this section will
treat them as though they are designed by the researcher.) In an RD
design, there is a running variable which gives eligible people access
to some program, and a strict cutoff determines who is included.88 88
Lee, D. S. and Lemieux, T. (2010).
This is ussally justified by budget limitations. The running variable Regression discontinuity designs in
economics. Journal of Economic Literature,
should not be the outcome of interest, and while it can be time, that 48(2):281–355
may require additional modeling assumptions. Those who qualify
are given the intervention and those who don’t are not; this process
substitutes for explicit randomization.89 89
http://blogs.worldbank.org/
impactevaluations/regression-
For example, imagine that there is a strict income cutoff created
discontinuity-porn
for a program that subsidizes some educational resources. Here,
income is the running variable. The intuition is that the people who
are “barely eligible” should not in reality be very different from
those who are “barely ineligible”, and that resulting differences
between them at measurement are therefore due to the intervention
or program.90 For the modeling component, the bandwidth, or the 90
Imbens, G. W. and Lemieux, T. (2008).
size of the window around the cutoff to use, has to be decided and Regression discontinuity designs: A
guide to practice. Journal of Econometrics,
tested against various options for robustness. The rest of the model 142(2):615–635
depends largely on the design and execution of the experiment.
Quasi-experimental designs
Instrumental variables
Instrumental variables designs utilize variation in an otherwise-
unrelated predictor of exposure to a treatment condition as an
“instrument” for the treatment condition itself.93 The simplest 93
https://dimewiki.worldbank.org/
wiki/instrumental_variables
example is actually experimental – in a randomization design, we can
use instrumental variables based on an offer to join some program,
rather than on the actual inclusion in the program.94 The reason for 94
Angrist, J. D. and Krueger, A. B.
doing this is that the second stage of actual program takeup may (2001). Instrumental variables and the
search for identification: From supply
be severely self-selected, making the group of program participants and demand to natural experiments.
in fact wildly different from the group of non-participants.95 The Journal of Economic Perspectives, 15(4):69–
85
corresponding two-stage-least-squares (2SLS) estimator96 solves this
95
http://www.rebeccabarter.com/
by conditioning on only the random portion of takeup – in this case,
blog/2018-05-23-instrumental_
the randomized offer of enrollment in the program. variables/
Unfortunately, instrumental variables designs are known to have 96
http://www.nuff.ox.ac.uk/
very high variances relative to ordinary least squares.97 IV designs teaching/economics/bond/IV%
20Estimation%20Using%20Stata.pdf
furthermore rely on strong but untestable assumptions about the
97
Young, A. (2017). Consistency without
relationship between the instrument and the outcome.98 Therefore IV
inference: Instrumental variables in
designs face special scrutiny, and only the most believable designs, practical application. Unpublished
usually those backed by extensive qualitative analysis, are acceptable manuscript, London: London School of
Economics and Political Science. Retrieved
as high-quality evidence. from: http://personal.lse.ac.uk/
YoungA
Matching estimators 98
Bound, J., Jaeger, D. A., and
Baker, R. M. (1995). Problems with
Matching estimators rely on the assumption that, conditional on instrumental variables estimation when
the correlation between the instruments
some observable characteristics, untreated units can be compared and the endogenous explanatory
to treated units, as if the treatment had been fully randomized.99 variable is weak. Journal of the American
In other words, they assert that differential takeup is sufficiently Statistical Association, 90(430):443–450
99
predictable by observed characteristics. These assertions are somewhat https://dimewiki.worldbank.org/
wiki/Matching
testable,100 and there are a large number of “treatment effect” packages
100
devoted to standardizing reporting of various tests.101 https://dimewiki.worldbank.org/
wiki/iematch
However, since most matching models rely on a specific linear 101
http://fmwww.bc.edu/repec/
model, such as the typical propensity score matching estimator, usug2016/drukker_uksug16.pdf
they are open to the criticism of “specification searching”, meaning
that researchers can try different models of matching until one, by
chance, leads to the final result that was desired. Newer methods,
such as coarsened exact matching,102 are designed to remove some 102
Iacus, S. M., King, G., and Porro, G.
of the modelling, such that simple differences between matched (2012). Causal inference without balance
checking: Coarsened exact matching.
observations are sufficient to estimate treatment effects given somewhat Political Analysis, 20(1):1–24
weaker assumptions on the structure of that effect. One solution,
as with the experimental variant of 2SLS proposed above, is to
incorporate matching models into explicitly experimental designs.
36 data for development impact: the dime analytics resource guide
Synthetic controls
Synthetic controls methods103 are designed for a particularly 103
Abadie, A., Diamond, A., and
interesting situation: one where useful controls for an intervention Hainmueller, J. (2015). Comparative
politics and the synthetic control
simply do not exist. Canonical examples are policy changes at state method. American Journal of Political
or national levels, since at that scope there are no other units quite Science, 59(2):495–510
like the one that was affected by the policy change (much less
sufficient N for a regression estimation).104 In this method, time 104
Gobillon, L. and Magnac, T. (2016).
series data is almost always required, and the control comparison is Regional policy evaluation: Interactive
fixed effects and synthetic controls.
contructed by creating a linear combination of other units such that Review of Economics and Statistics,
pre-treatment outcomes for the treated unit are best approximated by 98(3):535–551
that specific combination.
Sampling, randomization, and power
Sampling, randomization, and power calculations are the core elements of experimental design. Sampling
and randomization determine which units are observed and in which states. Each of these processes
introduces statistical noise or uncertainty into the final estimates of effect sizes. Sampling noise produces
some probability of selection of units to measure that will produce significantly wrong estimates, and
randomization noise produces some probability of placement of units into treatment arms that does
the same. Power calculation is the method by which these probabilities of error are meaningfully assessed.
Good experimental design has high power – a low likelihood that these noise parameters will meaningfully
affect estimates of treatment effects.
Not all studies are capable of achieving traditionally high power: the possible sampling or treatment
assignments may simply be fundamentally too noisy. This may be especially true for novel or small-
scale studies – things that have never been tried before may be hard to fund or execute at scale. What is
important is that every study includes reasonable estimates of its power, so that the evidentiary value
of its results can be honestly assessed. Demonstrating that sampling and randomization were taken
seriously into consideration before going to field lends credibility to any research study. Using these
tools to design the most highly-powered experiments possible is a responsible and ethical use of donor and
client resources, and maximizes the likelihood that reported effect sizes are accurate.
will use ieboilstart at the beginning of your master do-file107 to set wiki/ieboilstart
107
the version once; in this guide, we will use version 13.1 in examples https://dimewiki.worldbank.org/
wiki/Master_Do-files
where we expect this to already be done.
Sorting means that the actual data that the random process is run
on is fixed. Most random outcomes have as their basis an algorithmic
sequence of pseudorandom numbers. This means that if the start
point is set, the full sequence of numbers will not change. A corollary
of this is that the underlying data must be unchanged between runs:
to ensure that the dataset is fixed, you must make a LOCKED copy of
it at runtime. However, if you re-run the process with the dataset
in a different order, the same numbers will get assigned to different
units, and the randomization will turn out different. In Stata, isid
[id_variable], sort will ensure that order is fixed over repeat runs.
Seeding means manually setting the start-point of the underlying
randomization algorithm. You can draw a standard seed randomly
by visiting http://bit.ly/stata-random. You will see in the code
below that we include the timestamp for verification. Note that there
are two distinct concepts referred to here by “randomization”: the
conceptual process of assigning units to treatment arms, and the
technical process of assigning random numbers in statistical software,
which is a part of all tasks that include a random component.108 If 108
https://blog.stata.com/2016/03/
10/how-to-generate-random-numbers-
the randomization seed for the statistical software is not set, then its
in-stata/
pseudorandom algorithm will pick up where it left off. By setting the
seed, you force it to restart from the same point. In Stata, set seed
[seed] will accomplish this.
The code below code loads and sets up the auto.dta dataset for
any random process. Note the three components: versioning, sorting,
and seeding. Why are check1 and check3 the same? Why is check2
different?
sampling, randomization, and power 39
replicability.do
1 * Set the version
2 ieboilstart , v(13.1)
3 `r(version)'
4
5 * Load the auto dataset and sort uniquely
6 sysuse auto.dta , clear
7 isid make, sort
8
9 * Set the seed using random.org (range: 100000 - 999999)
10 set seed 287608 // Timestamp: 2019-02-17 23:06:36 UTC
11
12 * Demonstrate stability under the three rules
13 gen check1 = rnormal()
14 gen check2 = rnormal()
15
16 set seed 287608
17 gen check3 = rnormal()
18
19 //Visualize randomization results
20 graph matrix check1 check2 check3 , half
Commands like bys: and merge will re-sort your data as part of
their execution, To reiterate: any process that includes a random
component is a random process, including sampling, randomization,
power calculation, and many algorithms like bootstrapping. and
other commands may alter the seed without you realizing it.109 Any 109
https://dimewiki.worldbank.org/
wiki/Randomization_in_Stata
of these things will cause the output to fail to replicate. Therefore,
each random process should be independently executed to ensure
that these three rules are followed. Before shipping the results of
any random process, save the outputs of the process in a temporary
location, re-run the file, and use cf _all using [dataset] targeting
the saved file. If there are any differences, the process has not
reproduced, and cf will return an error, as shown here.
40 data for development impact: the dime analytics resource guide
randomization-cf.do
1 * Make one randomization
2 sysuse bpwide.dta , clear
3 isid patient, sort
4 version 13.1
5 set seed 796683 // Timestamp: 2019-02-26 22:14:17 UTC
6
7 sample 100
8
9 * Save for comparison
10 tempfile sample
11 save `sample' , replace
12
13 * Identical randomization
14 sysuse bpwide.dta , clear
15 isid patient, sort
16 version 13.1
17 set seed 796683 // Timestamp: 2019-02-26 22:14:17 UTC
18
19 sample 100
20 cf _all using `sample'
21
22 * Do something wrong
23 sysuse bpwide.dta , clear
24 sort bp*
25 version 13.1
26 set seed 796683 // Timestamp: 2019-02-26 22:14:17 UTC
27
28 sample 100
29 cf _all using `sample'
Sampling
simple-sample.do
1 /*
2 Simple reproducible sampling
3 */
4
5 * Set up reproducbilitiy
6 ieboilstart , v(12) // Version
7 `r(version)' // Version
8 sysuse auto.dta, clear // Load data
9 isid make, sort // Sort
10 set seed 215597 // Timestamp: 2019-04-26 17:51:02 UTC
11
12 * Take a sample of 20%
13 preserve
14 sample 20
15 tempfile sample
16 save `sample' , replace
17 restore
18
19 * Merge and complete
20 merge 1:1 make using `sample'
21 recode _merge (3 = 1 "Sampled") (* = 0 "Not Sampled") , gen(sample)
22 label var sample "Sampled"
23 drop _merge
24
25 * Check
26 tab sample
sample-noise.do
1 * Reproducible setup: data, isid, version, seed
2 sysuse auto.dta , clear
3 isid make, sort
4 version 13.1
5 set seed 556292 // Timestamp: 2019-02-25 23:30:39 UTC
6
7 * Get true population parameter for price mean
8 sum price
9 local theMean = `r(mean)'
10
11 * Sample 20 units 1000 times and store the mean of [price]
12 cap mat drop results // Make matrix free
13 qui forvalues i = 1/1000 {
14 preserve
15 sample 20 , count // Remove count for 20%
16 sum price // Calculate sample mean
17 * Allow first run and append each estimate
18 mat results = nullmat(results) \ [`r(mean)']
19 restore
20 }
21
22 * Load the results into memory and graph the distribution
23 clear
24 mat colnames results = "price_mean"
25 svmat results , n(col)
26 kdensity price_mean , norm xline(`theMean')
to be, and report that as the standard error of our point estimates.
The interpretation of, say, a 95% confidence interval in this context
is that, conditional on our sampling strategy, we would anticipate
that 95% of future samples from the same distribution would lead
to parameter estimates in the indicated range. This approach says
nothing about the truth or falsehood of any hypothesis.
Randomization
randomization-program-1.do
1 * Define a randomization program
2 cap prog drop my_randomization
3 prog def my_randomization
4
5 * Syntax with open options for [ritest]
6 syntax , [*]
7 cap drop treatment
8
9 * Group 2/5 in treatment and 3/5 in control
10 xtile group = runiform() , n(5)
11 recode group (1/2=0 "Control") (3/5=1 "Treatment") , gen(treatment)
12 drop group
13
14 * Cleanup
15 lab var treatment "Treatment Arm"
16
17 end
With this program created and executed, the next part of the
code, shown below, can set up for reproducibility. Then it will call
the randomization program by name, which executes the exact
randomization process we programmed to the data currently loaded
in memory. Having pre-programmed the exact randomization
does two things: it lets us write this next code chunk much more
simply, and it allows us to reuse that precise randomization as
needed. Specifically, the user-written ritest command118 allows 118
http://hesss.org/ritest.pdf
randomization-program-2.do
1 * Reproducible setup: data, isid, version, seed
2 sysuse auto.dta , clear
3 isid make, sort
4 version 13.1
5 set seed 107738 // Timestamp: 2019-02-25 23:34:33 UTC
6
7 * Call the program
8 my_randomization
9 tab treatment
10
11 * Show randomization variation with [ritest]
12 ritest treatment _b[treatment] ///
13 , samplingprogram(my_randomization) kdensityplot ///
14 : reg price treatment
randtreat-strata.do
1 * Use [randtreat] in randomization program ----------------
2 cap prog drop my_randomization
3 prog def my_randomization
4
5 * Syntax with open options for [ritest]
6 syntax , [*]
7 cap drop treatment
8 cap drop strata
9
10 * Create strata indicator
11 egen strata = group(sex agegrp) , label
12 label var strata "Strata Group"
13
14 * Group 1/5 in control and each treatment
15 randtreat, ///
16 generate(treatment) /// New variable name
17 multiple(6) /// 6 arms
18 strata(strata) /// 6 strata
19 misfits(global) /// Randomized altogether
20
21 * Cleanup
22 lab var treatment "Treatment Arm"
23 lab def treatment 0 "Control" 1 "Treatment 1" 2 "Treatment 2" ///
24 3 "Treatment 3" 4 "Treatment 4" 5 "Treatment 5" , replace
25 lab val treatment treatment
26 end // ----------------------------------------------------
27
28 * Reproducible setup: data, isid, version, seed
29 sysuse bpwide.dta , clear
30 isid patient, sort
31 version 13.1
32 set seed 796683 // Timestamp: 2019-02-26 22:14:17 UTC
33
34 * Randomize
35 my_randomization
36 tab treatment strata
randtreat-clusters.do
1 * Use [randtreat] in randomization program ----------------
2 cap prog drop my_randomization
3 prog def my_randomization
4
5 * Syntax with open options for [ritest]
6 syntax , [*]
7 cap drop treatment
8 cap drop cluster
9
10 * Create cluster indicator
11 egen cluster = group(sex agegrp) , label
12 label var cluster "Cluster Group"
13
14 * Save data set with all observations
15 tempfile ctreat
16 save `ctreat' , replace
17
18 * Keep only one from each cluster for randomization
19 bysort cluster : keep if _n == 1
20
21 * Group 1/2 in control and treatment in new variable treatment
22 randtreat, generate(treatment) multiple(2)
23
24 * Keep only treatment assignment and merge back to all observations
25 keep cluster treatment
26 merge 1:m cluster using `ctreat' , nogen
27
28 * Cleanup
29 lab var treatment "Treatment Arm"
30 lab def treatment 0 "Control" 1 "Treatment" , replace
31 lab val treatment treatment
32 end // ----------------------------------------------------
33
34 * Reproducible setup: data, isid, version, seed
35 sysuse bpwide.dta , clear
36 isid patient, sort
37 version 13.1
38 set seed 796683 // Timestamp: 2019-02-26 22:14:17 UTC
39
40 * Randomize
41 my_randomization
42 tab cluster treatment
Power calculations
minimum-detectable-effect.do
1 * Simulate power for different treatment effect sizes
2 clear
3 set matsize 5000
4 cap mat drop results
5 set seed 852526 // Timestamp: 2019-02-26 23:18:42 UTC
6
7 * Loop over treatment effect sizes (te)
8 * of 0 to 0.5 standard deviations
9 * in increments of 0.05 SDs
10 qui forval te = 0(0.05)0.5 {
11 forval i = 1/100 { // Loop 100 times
12 clear // New simulation
13 set obs 1000 // Set sample size to 1000
14
15 * Randomly assign treatment
16 * Here you could call a randomization program instead:
17 gen t = (rnormal() > 0)
18
19 * Simulate assumed effect sizes
20 gen e = rnormal() // Include a normal error term
21 gen y = 1 + `te'*t + e // Set functional form for DGP
22
23 * Does regression detect an effect in this assignment?
24 reg y t
25
26 * Store the result
27 mat a = r(table) // Reg results
28 mat a = a[....,1] // t parameters
29 mat results = nullmat(results) \ a' , [`te'] // First run and accumulate
30 } // End iteration loop
31 } // End incrementing effect size
32
33 * Load stored results into data
34 clear
35 svmat results , n(col)
36
37 * Analyze all the regressions we ran against power 80%
38 gen sig = (pvalue <= 0.05) // Flag significant runs
39
40 * Proportion of significant results in each effect size group (80% power)
41 graph bar sig , over(c10) yline(0.8)
minimum-sample-size.do
1 * Power for varying sample size & a fixed treatment effect
2 clear
3 set matsize 5000
4 cap mat drop results
5 set seed 510402 // Timestamp: 2019-02-26 23:19:00 UTC
6
7 * Loop over sample sizes (ss) 100 to 1000, increments of 100
8 qui forval ss = 100(100)1000 {
9 forval i = 1/100 { // 100 iterations per each
10 clear
11 set obs `ss' // Simulation with new sample size
12
13 * Randomly assign treatment
14 * Here you could call a randomization program instead
15 gen t = (rnormal() > 0)
16
17 * Simulate assumed effect size: here 0.2SD
18 gen e = rnormal() // Normal error term
19 gen y = 1 + 0.2*t + e // Functional form for DGP
20
21 * Does regression detect an effect in this assignment?
22 reg y t
23
24 * Store the result
25 mat a = r(table) // Reg results
26 mat a = a[....,1] // t parameters
27 mat results = nullmat(results) \ a' , [`ss'] // First run and accumulate
28 } // End iteration loop
29 } // End incrementing sample size
30
31 * Load stored results into data
32 clear
33 svmat results , n(col)
34
35 * Analyze all the regressions we ran against power 80%
36 gen sig = (pvalue <= 0.05) // Flag significant runs
37
38 * Proportion of significant results in each effect size group (80% power)
39 graph bar sig , over(c10) yline(0.8)
Most data collection is now done using digital data entry using tools that are specially designed for
surveys. These tools, called computer-assisted personal interviewing (CAPI) software, provide a wide
range of features designed to make implementing even highly complex surveys easy, scalable, and secure.
However, these are not fully automatic: you still need to actively design and manage the survey. Each
software has specific practices that you need to follow to enable features such as Stata-compatibility and
data encryption.
You can work in any software you like, and this guide will present tools and workflows that are
primarily conceptual: this chapter should provide a motivation for planning data structure during survey
design, developing surveys that are easy to control for quality and security, and having proper file storage
ready for sensitive PII data.
and relate directly to the data that will be used later, both the field
team and the data team should collaborate to make sure that the
survey suits all needs.134 134
Krosnick, J. A. (2018). Questionnaire
Generally, this collaboration means building the experimental design. In The Palgrave Handbook of
Survey Research, pages 439–455. Springer
design fundamentally into the structure of the survey. In addition to
having prepared a unique anonymous ID variable using the master 135
CONSORT: a standardized system
for reporting enrollment, intervention
data, that ID should be built into confirmation checks in the survey allocation, follow-up, and data analysis
form. When ID matching and tracking across rounds is essential, through the phases of a randomized
the survey should be prepared to verify new data against preloaded trial of two groups.
data from master records or from other rounds. Extensive tracking 136
Begg, C., Cho, M., Eastwood, S.,
sections – in which reasons for attrition, treatment contamination, Horton, R., Moher, D., Olkin, I., Pitkin,
R., Rennie, D., Schulz, K. F., Simel, D.,
and loss to follow-up can be documented – are essential data et al. (1996). Improving the quality of
components for completing CONSORT135 records.136 reporting of randomized controlled
trials: The CONSORT statement. JAMA,
Questionnaire design137 is the first task where the data team and
276(8):637–639
the field team must collaborate on data structure.138 Questionnaire 137
https://dimewiki.worldbank.org/
wiki/Questionnaire_Design
138
https://dimewiki.worldbank.org/
wiki/Preparing_for_Field_Data_
Collection
54 data for development impact: the dime analytics resource guide
survey to which it corresponds. Any time you access the data - either
when viewing it in browser or syncing it to your computer - the
user will be asked to provide this keyfile. It could differ between the
software, but typically you would copy that keyfile from where it is
stored (for example LastPass) to your desktop, point to it, and the
rest is automatic. After each time you use the keyfile, delete it from
your desktop, but not from your password manager.
Finally, you should ensure that all teams take basic precautions to
ensure the security of data, as most problems are due to human error.
Most importantly, all computers, tablets, and accounts used must
have a logon password associated with them. Ideally, the machine
hard drives themselves should also be encrypted. This policy should
also be applied to physical data storage such as flash drives and hard
drives; similarly, files sent to the field containing PII data such as
the sampling list should at least be password-protected. This can
be done using a zip-file creator. LastPass can also be used to share
passwords securely, and you cannot share passwords across email.
This step significantly mitigates the risk in case there is a security
breach such as loss, theft, hacking, or a virus, and adds very little
hassle to utilization.
While the team is in the field, the research assistant and field coordinator
will be jointly responsible for making sure that the survey is progressing
correctly,146 that the collected data matches the survey sample, and 146
https://dimewiki.worldbank.org/
wiki/Data_Quality_Assurance_Plan
that errors and duplicate observations are resolved quickly so that
the field team can make corrections.147 Modern survey software 147
https://dimewiki.worldbank.org/
wiki/Duplicates_and_Survey_Logs
makes it relatively easy to control for issues in individual surveys,
using a combination of in-built features such as hard constraints on
answer ranges and soft confirmations or validation questions.
These features allow you to spend more time looking for issues
that the software cannot check automatically.148 Namely, these will 148
https://dimewiki.worldbank.org/
wiki/Monitoring_Data_Quality
be suspicious patterns across multiple responses or across a group
of surveys rather than errors in any single response field (those can
often be flagged by the questionnaire software); enumerators who
are taking either too long or not long enough to complete their work,
“difficult” groups of respondents who are systematically incomplete;
and systematic response errors. These are typically done in two main
forms: high-frequency checks (HFCs) and back-checks.
High-frequency checks are carried out on the data side.149 First, 149
https://github.com/PovertyAction/
In this section, you finally get your hands on some data! What do
we do with it? Data handling is one of the biggest “black boxes”
in primary research – it always gets done, but teams have wildly
different approaches for actually doing it. This section breaks the
process into key conceptual steps and provides at least one practical
solution for each. Initial receipt of data will proceed as follows: the
data will be downloaded, and a “gold master” copy of the raw data
should be permanently stored in a secure location. Then, a “master”
copy of the data is placed into an encrypted location that will remain
accessible on disk and backed up. This handling satisfies the rule of
three: there are two on-site copies of the data and one off-site copy,
so the data can never be lost in case of hardware failure. For this
step, the remote location can be a variety of forms: the cheapest is
a long-term cloud storage service such as Amazon Web Services or
collecting primary data 59
Data analysis is hard. Making sense of a dataset in such a way that makes a substantial contribution to
scientific knowledge requires a mix of subject expertise, programming skills, and statistical and econometric
knowledge. The process of data analysis is therefore typically a back-and-forth discussion between the
various people who have differing experiences, perspectives, and research interests. The research assistant
usually ends up being the fulcrum for this discussion, and has to transfer and translate results among
people with a wide range of technical capabilities while making sure that code and outputs do not become
tangled and lost over time (typically months or years).
Organization is the key to this task. The structure of files needs to be well-organized, so that any
material can be found when it is needed. Data structures need to be organized, so that the various steps
in creating datasets can always be traced and revised without massive effort. The structure of version
histories and backups need to be organized, so that different workstreams can exist simultaneously and
experimental analyses can be tried without a complex workflow. Finally, the outputs need to be organized,
so that it is clear what results go with what analyses, and that each individual output is a readable element
in its own right. This chapter outlines how to stay organized so that you and the team can focus on getting
the work right rather than trying to understand what you did in the past.
analysis to be executed from one do-file that runs all other files in
the correct order, as a human readable map to the file and folder
structure used for all the code. By reading the master do-file anyone
not familiar to the project should understand which are the main
tasks, what are the do-files that execute those tasks and where in the
project folder they can be found.
Raw data should contain only materials that are received directly
from the field. These datasets will invariably come in a host of
file formats and nearly always contain personally-identifying
information. These should be retained in the raw data folder exactly
as they were received, including the precise filename that was submitted,
along with detailed documentation about the source and contents
of each of the files. This data must be encrypted if it is shared in an
insecure fashion, and it must be backed up in a secure offsite location.
Everything else can be replaced, but raw data cannot. Therefore, raw
data should never be interacted with directly.
Instead, the first step upon receipt of data is de-identification.158 158
https://dimewiki.worldbank.org/
wiki/De-identification
There will be a code folder and a corresponding data folder so that
it is clear how de-identification is done and where it lives. Typically,
this process only involves stripping fields from raw data, naming,
formatting, and optimizing the file for storage, and placing it as a
.dta or other data-format file into the de-identified data folder. The
underlying data structure is unchanged: it should contain only fields
that were collected in the field, without any modifications to the
responses collected there. This creates a survey-based version of the
file that is able to be shared among the team without fear of data
corruption or exposure. Only a core set of team members will have
access to the underlying raw data necessary to re-generate these
datasets, since the encrypted raw data will be password-protected.
The de-identified data will therefore be the underlying source for
all cleaned and constructed data. It will also become the template
dataset for final public release.
continue to keep the data-construction files open, and add to them as 166
http://scunning.com/mixtape.html
Visualizing data
Increasingly, research assistants are relied on to manage some or all of the publication process. This can
include managing the choice of software, coordinating referencing and bibliography, tracking changes
across various authors and versions, and preparing final reports or papers for release or submission.
Modern software tools can make a lot of these processes easier. Unfortunately there is some learning curve,
particularly for lead authors who have been publishing for a long time. This chapter suggests some tools
and processes that can make writing and publishing in a team significantly easier. It will provide resources
to judge how best to adapt your team to the tools you agree upon, since all teams vary in composition and
technical experience.
Ideally, your team will spend as little time as possible fussing with the technical requirements of
publication. It is in nobody’s interest for a skilled and busy research assistant to spend days re-numbering
references (and it can take days) if a small amount of up-front effort could automate the task. However,
experienced academics will likely have a workflow with which they are already comfortable, and since
they have worked with many others in the past, that workflow is likely to be the least-common-denominator:
Microsoft Word with tracked changes. This chapter will show you how you can avoid at least some of the
pain of Microsoft Word, while still providing materials in the format that co-authors prefer and journals
request.
The gold standard for academic writing is LATEX.177 LATEX allows 177
https://www.maths.tcd.ie/
~dwilkins/LaTeXPrimer/GSWLaTeX.pdf
automatically-organized sections like titles and bibliographies,
imports tables and figures in a dynamic fashion, and can be version
controlled using Git. Unfortunately, LATEX can be a challenge to set
up and use at first, particularly for people who are unfamiliar with
plaintext, code, or file management. LATEX requires that all formatting
be done in its special code language, and it is not particularly
informative when you do something wrong. This can be off-putting
very quickly for people who simply want to get to writing, like lead
authors. Therefore, if we want to take advantage of the features of
LATEX, without getting stuck in the weeds of it, we will need to adopt
a few tools and tricks to make it effective.
The first is choice of software. The easiest way for someone new
to LATEX to be able to “just write” is often the web-based Overleaf
suite.178 Overleaf offers a rich text editor that behaves pretty similarly 178
https://www.overleaf.com
68 data for development impact: the dime analytics resource guide
to familiar tools like Word. TeXstudio179 and atom-latex180 are two 179
https://www.texstudio.org
popular desktop-based tools for writing LATEX; they allow more 180
https://atom.io/packages/atom-
advanced integration with Git, among other advantages, but the latex
sample.bib
1 @article{flom2005latex,
2 title={LATEX for academics and researchers who (think they) don't need it},
3 author={Flom, Peter},
4 journal={The PracTEX Journal},
5 volume={4},
6 year={2005},
7 publisher={Citeseer}
8 }
citation.tex
1 With these tools, you can ensure that co-authors are writing
2 in a format you can manage and control.\cite{flom2005latex}
publishing collaborative research 69
With these tools, you can ensure that co-authors are writing in a
format you can manage and control.182 The purpose of this setup, 182
Flom, P. (2005). LaTeX for academics
just like with other synced folders, is to avoid there ever being more and researchers who (think they) don’t
need it. The PracTEX Journal, 4
than one master copy of the document. This means that people can
edit simultaneously without fear of conflicts, and it is never necessary
to manually resolve differences in the document. Finally, LATEX has
one more useful trick: if you download a journal styler from the
Citation Styles Library183 and use pandoc,184 you can translate the 183
https://github.com/citation-
style-language/styles
raw document into Word by running the following code from the
184
command line: http://pandoc.org/
pandoc.sh
1 pandoc -s -o main.docx main.tex --bibliography sample.bib --csl=[style].csl
Data and code should always be released with any publication.185 185
https://dimewiki.worldbank.org/
wiki/Publishing_Data
Many journals and funders have strict open data policies, and
providing these materials to other researchers allows them to
evaluate the credibility of your work as well as to re-use your
materials for further research.186 If you have followed the steps in 186
https://dimewiki.worldbank.org/
wiki/Exporting_Analysis
this book carefully, you will find that this is a very easy requirement
to fulfill.187 You will already have a publishable (either de-identified 187
https://www.bitss.org/2016/05/
23/out-of-the-file-drawer-tips-on-
or constructed) version of your research dataset. You will already
prepping-data-for-publication/
have your analysis code well-ordered, and you won’t have any junk
lying around for cleanup. You will have written your code so that
others can read it, and you will have documentation for all your
work, as well as a well-structured directory containing the code and
data.
If you are at this stage, all you need to do is find a place to publish
your work! GitHub provides one of the easiest solutions here, since it
is completely free for static, public projects and it is straightforward
to simply upload a fixed directory and obtain a permanent URL for
it. The Open Science Framework also provides a good resource, as
does ResearchGate (which can also assign a permanent digital object
identifier link for your work). Any of these locations is acceptable –
the main requirement is that the system can handle the structured
directory that you are submitting, and that it can provide a stable,
70 data for development impact: the dime analytics resource guide
Most academic programs that prepare students for a career in the type of work discussed in this book
spend a disproportionately small amount of time teaching their students coding skills, in relation to the
share of their professional time they will spend writing code their first years after graduating. Recent
Masters’ program graduates that have joined our team tended to have very good knowledge in the theory
of our trade, but tended to require a lot of training in its practical skills. To us, it is like hiring architects
that can sketch, describe, and discuss the concepts and requirements of a new building very well, but do
not have the technical skill set to actually contribute to a blueprint using professional standards that can be
used and understood by other professionals during construction. The reasons for this are probably a topic
for another book, but in today’s data-driven world, people working in quantitative economics research
must be proficient programmers, and that includes more than being able to compute the correct numbers.
This appendix first has a short section with instructions on how to access and use the code shared in
this book. The second section contains a the current DIME Analytics style guide for Stata code. Widely
accepted and used style guides are common in most programming languages, and we think that using
such a style guide greatly improves the quality of research projects coded in Stata. We hope that this guide
can help to increase the emphasis in the Stata community on using, improving, sharing and standardizing
code style. Style guides are the most important tool in how you, like an architect, draw a blueprint that can
be understood and used by everyone in your trade.
You can access the raw code used in examples in this book in many
ways. We use GitHub to version control everything in this book, the
code included. To see the code on GitHub, go to: https://github.
com/worldbank/d4di/tree/master/code. If you are familiar with
GitHub you can fork the repository and clone your fork. We only use
Stata’s built-in datasets in our code examples, so you do not need to
download any data from anywhere. If you have Stata installed on
your computer, then you will have the data files used in the code.
A less technical way to access the code is to click the individual
file in the URL above, then click the button that says Raw. You
will then get to a page that looks like the one at: https://raw.
githubusercontent.com/worldbank/d4di/master/code/code.do.
There, you can copy the code from your browser window to your do-
file editor with the formatting intact. This method is only practical
72 data for development impact: the dime analytics resource guide
for a single file at the time. If you want to download all code used in
this book, you can do that at: https://github.com/worldbank/d4di/
archive/master.zip. That link offers a .zip file download with all
the content used in writing this book, including the LATEX code used
for the book itself. After extracting the .zip-file you will find all the
code in a folder called /code/.
Regardless if you are new to Stata or have used it for decades, you
will always run into commands that you have not seen before or do
not remember what they do. Every time that happens, you should
always look that command up in the helpfile. For some reason,
we often encounter the conception that the helpfiles are only for
beginners. We could not disagree with that conception more, as the
only way to get better at Stata is to constantly read helpfiles. So if
there is a command that you do not understand in any of our code
examples, for example isid, then write help isid, and the helpfile
for the command isid will open.
We cannot emphasize too much how important we think it is that
you get into the habit of reading helpfiles.
Sometimes, you will encounter code employing user-written
commands, and you will not be able to read those helpfiles until you
have installed the commands. Two examples of these in our code are
reandtreat or ieboilstart. The most common place to distribute
user-written commands for Stata is the Boston College Statistical
Software Components (SSC) archive. In our code examples, we only
use either Stata’s built-in commands or commands available from the
SSC archive. So, if your installation of Stata does not recognize
a command in our code, for example randtreat, then type ssc
install randtreat in Stata.
Some commands on SSC are distributed in packages, for example
ieboilstart, meaning that you will not be able to install it using ssc
install ieboilstart. If you do, Stata will suggest that you instead
use findit ieboilstart which will search SSC (among other places)
and see if there is a package that has a command called ieboilstart.
Stata will find ieboilstart in the package ietoolkit, so then you
will type ssc install ietoolkit instead in Stata.
We understand that this can be confusing the first time you work
with this, but this is the best way to set up your Stata installation
to benefit from other people’s work that they have made publicly
available, and once used to installing commands like this it will
not be confusing at all. All code with user-written commands,
furthermore, is best written when it installs such commands at
appendix: the dime analytics stata style guide 73
the beginning of the master do-file, so that the user does not have to
search for packages manually.
Commenting code
Comments do not change the output of code, but without them, your
code will not be accessible to your colleagues. It will also take you a
much longer time to edit code you wrote in the past if you did not
comment it well. So, comment a lot: do not only write what your
code is doing but also why you wrote it like that.
There are three types of comments in Stata and they have different
purposes:
1. /* */ indicates narrative, multi-line comments at the beginning of
files or sections.
stata-comments.do
1 /*
2 This is a do-file with examples of comments in Stata. This
3 type of comment is used to document all of the do-file or a large
4 section of it
5 */
6
7 * Standardize settings (This comment is used to document a task
8 * covering at maximum a few lines of code)
9 ieboilstart, version(13.1)
10 `r(version)'
11
12 * Open the dataset
13 sysuse auto.dta // Built in dataset (This comment is used to document a single line)
Abbreviating commands
Stata commands can often be abbreviated in the code. In the helpfiles
you can tell if a command can be abbreviated, indicated by the
appendix: the dime analytics stata style guide 75
Abbreviation Command
tw twoway
di display
gen generate
mat matrix
reg regress
lab label
sum summarize
tab tabulate
bys bysort
qui quietly
cap capture
forv forvalues
prog program
Writing loops
In Stata examples and other code languages, it is common that
the name of the local generated by foreach or forvalues is named
something as simple as i or j. In Stata, however, loops generally
index a real object, and looping commands should name that index
descriptively. One-letter indices are acceptable only for general
examples; for looping through iterations with i; and for looping
across matrices with i, j. Other typical index names are obs or var
when looping over observations or variables, respectively. But since
Stata does not have arrays such abstract syntax should not be used
in Stata code otherwise. Instead, index names should describe what
the code is looping over, for example household members, crops, or
medicines. This makes code much more readable, particularly in
nested loops.
76 data for development impact: the dime analytics resource guide
stata-loops.do
1 * This is BAD
2 foreach i in potato cassava maize {
3 }
4
5 * These are GOOD
6 foreach crop in potato cassava maize {
7 }
8
9 * or
10
11 local crops potato cassava maize
12 * Loop over crops
13 foreach crop of local crops {
14 * Loop over plot number
15 forvalues plot_num = 1/10 {
16 }
17 }
Using whitespace
In Stata, one space or many spaces does not make a difference,
and this can be used to make the code much more readable. In the
example below the exact same code is written twice, but in the good
example whitespace is used to signal to the reader that the central
object of this segment of code is the variable employed. Organizing
the code like this makes the code much quicker to read, and small
typos stand out much more, making them easier to spot.
We are all very well trained in using whitespace in software
like PowerPoint and Excel: we would never present a PowerPoint
presentation where the text does not align or submit an Excel table
with unstructured rows and columns, and the same principles apply
to coding.
stata-whitespace-columns.do
1 * This is BAD
2 * Create dummy for being employed
3 generate employed = 1
4 replace employed = 0 if (_merge == 2)
5 label variable employed "Person exists in employment data"
6 label define yesno 1 "Yes" 0 "No"
7 label value employed yesno
8
9 * This is GOOD
10 * Create dummy for being employed
11 generate employed = 1
12 replace employed = 0 if (_merge == 2)
13 label variable employed "Person exists in employment data"
14 label define yesno 1 "Yes" 0 "No"
15 label value employed yesno
stata-whitespace-indentation.do
1 * This is GOOD
2 * Loop over crops
3 foreach crop in potato cassava maize {
4 * Loop over plot number
5 forvalues plot_num = 1/10 {
6 gen crop_`crop'_`plot_num' = "`crop'"
7 }
8 }
9
10 * or
11 local sampleSize = `c(N)'
12 if (`sampleSize' <= 100) {
13 gen use_sample = 0
14 }
15 else {
16 gen use_sample = 1
17 }
18
19 * This is BAD
20 * Loop over crops
21 foreach crop in potato cassava maize {
22 * Loop over plot number
23 forvalues plot_num = 1/10 {
24 gen crop_`crop'_`plot_num' = "`crop'"
25 }
26 }
27
28 * or
29 local sampleSize = `c(N)'
30 if (`sampleSize' <= 100) {
31 gen use_sample = 0
32 }
33 else {
34 gen use_sample = 1
35 }
stata-conditional-expressions1.do
1 * These examples are GOOD
2 replace gender_string = "Female" if (gender == 1)
3 replace gender_string = "Male" if ((gender != 1) & !missing(gender))
4
5 * These examples are BAD
6 replace gender_string = "Female" if gender == 1
7 replace gender_string = "Male" if (gender ~= 1)
stata-conditional-expressions2.do
1 local sampleSize = _N // Get the number of observations in dataset
2
3 * This example is GOOD
4 if (`sampleSize' <= 100) {
5 }
6 else {
7 }
8
9 * This example is BAD
10 if (`sampleSize' <= 100) {
11 }
12 if (`sampleSize' > 100) {
13 }
Using macros
Stata has several types of macros where numbers or text can be
stored temporarily, but the two most common macros are local and
global. Locals should always be the default type and globals should
only be used when the information stored is used in a different do-
file. Globals are error-prone since they are active as long as Stata is
open, which creates a risk that a global from one project is incorrectly
used in another, so only use globals where they are necessary. Our
recommendation is that globals should only be defined in the master
do-file. All globals should be referenced using both the the dollar
sign and the curly brackets around their name; otherwise, they can
cause readability issues when the endpoint of the macro name is
unclear.
appendix: the dime analytics stata style guide 79
There are several naming conventions you can use for macros with
long or multi-word names. Which one you use is not as important
as whether you and your team are consistent in how you name then.
You can use all lower case (mymacro), underscores (my_macro), or
“camel case” (myMacro), as long as you are consistent.
stata-macros.do
1 * Define a local and a global using the same name convention
2 local myLocal "A string local"
3 global myGlobal "A string global"
4
5 * Reference the local and the global macros
6 display "`myLocal'"
7 display "${myGlobal}"
8
9 * Escape character. If backslashes are used just before a local
10 * or a global then two backslashes must be used
11 local myFolderLocal "Documents"
12 local myFolderGlobal "Documents"
13
14 * These are BAD
15 display "C:\Users\username\`myFolderLocal'"
16 display "C:\Users\username\${myFolderGlobal}"
17
18 * These are GOOD
19 display "C:\Users\username\\`myFolderLocal'"
20 display "C:\Users\username\\${myFolderGlobal}"
in practice as setting cd, as all new users should only have to change
these file path globals in one location. But dynamic absolute file
paths are a better practice since if the global names are set uniquely
there is no risk that files are saved in the incorrect project folder, and
you can create multiple folder globals instead of just one location as
with cd.
stata-filepaths.do
1 * Dynamic, absolute file paths
2
3 * Dynamic (and absolute) - GOOD
4 global myDocs "C:/Users/username/Documents"
5 global myProject "${myDocs}/MyProject"
6 use "${myProject}/MyDataset.dta"
7
8 * Relative and absolute file paths
9
10 * Relative - BAD
11 cd "C:/Users/username/Documents/MyProject"
12 use MyDataset.dta
13
14 * Absolute but not dynamic - BAD
15 use "C:/Users/username/Documents/MyProject/MyDataset.dta"
stata-linebreak.do
1 * This is GOOD
2 graph hbar ///
3 invil if (priv == 1) ///
4 , over(statename, sort(1) descending) blabel(bar, format(%9.0f)) ///
5 ylab(0 "0%" 25 "25%" 50 "50%" 75 "75%" 100 "100%") ///
6 ytit("Share of private primary care visits made in own village")
7
8 * This is BAD
9 #delimit ;
10 graph hbar
11 invil if (priv == 1)
12 , over(statename, sort(1) descending) blabel(bar, format(%9.0f))
13 ylab(0 "0%" 25 "25%" 50 "50%" 75 "75%" 100 "100%")
14 ytit("Share of private primary care visits made in own village");
15 #delimit cr
stata-boilerplate.do
1 * This is GOOD
2 ieboilstart, version(13.1)
3 `r(version)'
4
5 * This is GOOD but less GOOD
6 set more off
7 set maxvar 10000
8 version 13.1
Saving data
Similarly to boilerplate code, there are good practices that should be
followed before saving the data set. These are sorting and ordering
the data set, dropping intermediate variables that are not needed,
and compressing the data set to save disk space and network bandwidth.
If there is an ID variable or a set of ID variables, then the code
82 data for development impact: the dime analytics resource guide
should also test that they are uniqueally and fully identifying the
data set.193 ID variables are also perfect variables to sort on, and to 193
https://dimewiki.worldbank.org/
wiki/ID_Variable_Properties
order leftmost in the data set.
The command compress makes the data set smaller in terms of
memory usage without ever losing any information. It optimizes the
storage types for all variables and therefore makes it smaller on your
computer and faster to send over a network or the internet.
stata-before-saving.do
1 * If the data set has ID variables, create a local and test
2 * if they are fully and uniquely identifying the observations.
3 local idvars household_ID household_member year
4 isid `idvars'
5
6 * Sort and order on the idvars (or any other variables if there are no ID variables)
7 sort `idvars'
8 order * , seq // Place all variables in alphanumeric order (optional but useful)
9 order `idvars' , first // Make sure the idvars are the leftmost vars when browsing
10
11 * Drop intermediate variables no longer needed
12
13 * Optimize disk space
14 compress
15
16 * Save data settings
17 save "${myProject}/myDataFile.dta" , replace // The folder global is set in master do-file
18 use "${myProject}/myDataFile.dta" , clear // It is useful to be able to recall the data quickly
Bibliography
Angrist, J., Azoulay, P., Ellison, G., Hill, R., and Lu, S. F. (2017).
Economic research evolves: Fields and styles. American Economic
Review, 107(5):293–97.
Iacus, S. M., King, G., and Porro, G. (2012). Causal inference without
balance checking: Coarsened exact matching. Political Analysis,
20(1):1–24.
Orozco, V., Bontemps, C., Maigne, E., Piguet, V., Hofstetter, A.,
Lacroix, A., Levert, F., Rousselle, J.-M., et al. (2018). How to make a
pie? reproducible research for empirical economics & econometrics.
Toulouse School of Economics Working Paper, 933.