Data Management in Stata
Data Management in Stata
CONTENTS
INTRODUCTION
3
4
5
6
8
9
10
12
12
14
15
18
__________
* Prepared for the 2013 Applied Methods Summer Series, Department of Political Science,
University of North Texas.
INTRODUCTION
The purpose of this guide is to provide graduate students with a more complete understanding of
issues related to good data management in Stata. To do this, we will discuss a number of topics,
ranging from the relatively simple (e.g., calling data into Stata) to the somewhat advanced (e.g.,
transforming the unit of analysis in a dataset). I have made these notes, as well as the dataset and do
files we will be using, available on my webpage (http://www.psci.unt.edu/~pmcollins/courses.htm).
Note that our purpose here is not to deal with statistical models; instead we will focus solely on data
management issues.
Prior to working with a dataset in Stata, there are two rules you should always follow:
1) Know your data. This goes for working with any data, whether it is qualitative or
quantitative, in Stata, SPSS, R, SAS, etc. Before you start running models, you should know
where your dataset came from, if any changes have been made to it, whether it has missing
values and why, and what the unit of analysis is. By and large, this can be accomplished by
reading the codebook corresponding to your data and visually inspecting the data.
2) Use do files. Whenever you work in Stata and you are transforming data or running models
on data, you should be working through a do file. While interactivity has its place, the
profession demands replicability and one easy way to meet this standard is to catalog all of
the changes you make to your data so your analysis can be replicated by others (and yourself
after you forget what you did). I also recommend that you keep at least one copy of your
data in its original format. Because the do file contains the commands for any changes you
make, this shouldnt be a problem. Keeping at least one copy of the data in its original
format is useful when you want to revisit an analysis on those data (or start a new project
using the data).
a. Note: Many researchers also use log files, which catalog all of the information that
crosses the screen. While I am generally ambivalent about their use, a good case can
be made that they are beneficial.
For the purposes of this guide, we will be working with Harold Spaeths Original United States Supreme
Court Judicial Database, which contains information on the U.S. Supreme Court for the 1953-2004
terms. However, it is important to note that the discussion here is applicable to virtually any dataset.
All you have to do is retrofit the commands for use with your data and you are in business.
Finally, I also want to emphasize that it is useful to work with data conversion software, such as
STAT/TRANSFER, for transferring data from one format to another (e.g., from SPSS to Stata).
String Variables the type of string is represented by the prefix str followed by the number of
characters (i.e., letters and/or numbers) in the maximum value of the variable. So, if you have a
variable that appears in the data as jerrygarcia it is a str11 variable.
Identifying the variables that are contained in string format is very important, particularly if they are
variables of interest, because Stata cannot use these variables in their original form to perform
statistical analyses.
Because led is a variable that is in string format, and because it contains information on both the
volume and page number of each case, we have to make two transformations to the variable in order
to get Stata to run summary statistics on the variable.
First, lets split the variable up.
split led, p("/")
This command creates two new variables. led1 is the volume number for each case (the numbers
that precede the slash), while led2 is the page number for each case (the numbers that follow the
slash).
Note that, because these are string variables, we need to convert them to numeric to get Stata to
show us summary statistics.
destring led1, generate(ledvolume)
destring led2, generate (ledpage)
These commands creates two new numeric variables, ledvolume and ledpage, thus allowing us to
examine summary statistics from these variables (e.g., sum or tab).
These commands have a variety of implications, including splitting up variables that contain a
persons first and last names into two new variables (i.e., first name, last name), as well as working
with data that, for whatever reason, was inputted as a string variable, but is in numeric form.
stores missing observations of numeric variables as extremely large positive values. If I just told
Stata to score newvalue a 3 if value was greater than 8, it would recode the single missing
observation of the variable value as a 3, which is not necessarily the issue area the case involved.
Accordingly, you have to be very careful when using the greater than command. If the variable you
are transforming has missing values, you should always remember to tell Stata not to recode missing
values by using the less than missing (<.) command.
To find out and identify how many missing values a particular variable contains, enter the following
commands:
egen missvalue=rowmiss(value)
This creates a new variable, missvalue, that is scored 1 for missing observations of the variable
value and 0 for non-missing observations. If we use the tab command, we can see that there is 1
missing observation of the value variable:
tab missvalue
missvalue |
Freq.
Percent Cum.
___________________________________________________
0
|
12,576
99.99
99.99
1
|
1
0.01
100.00
___________________________________________________
Total
|
12,577 100.00
Having figured out how to find missing observations, we want to label our new variable.
label variable newvalue "Recoded Issue Area"
label define newvalue 1 "Civil Rights and Liberties" 2 "Economics" 3 "Other"
label value newvalue newvalue
Another way to create variables in Stata is to use the not command (~). That is, instead of specifying
the values of an existing variable, we can create a new variable based on values that we do not want
this variable to reflect. For example, lets create a variable that indicates the case involved a privacy
issue.
generate privacy=.
replace privacy=0 if value~=5
replace privacy=1 if privacy==.
label variable privacy "Case Involved a Privacy Issue"
label define privacy 0 "No" 1 "Yes"
label value privacy privacy
Creating a new variable based on a string variable is just as easy, although it requires slightly different
command language. Lets say we want a variable that indicates whether Justice Douglas authored the
majority opinion; this information can be found in the mow variable.
generate douglasopinion=0
replace douglasopinion=1 if mow=="DOUG"
Creating a new variable from an existing string variable differs from creating a new variable from a
numeric variable in two ways. First, you have to use quotes to identify the values of the string
variable that will be represented by numbers in the new variable. Second, you have to be very
attentive to spelling and capitalization with regard to the values of the string variable you are
recoding.
Finally, we want to label our new variable.
label variable douglasopinion "Justice Douglas Authored Majority Opinion"
label define douglasopinion 1 "Yes" 0 "No"
label value douglasopinion douglasopinion
SIDENOTE: Many of the commands to create new variables used above rely on a conditional
statement (if commands). Here is a useful list of commonly used conditional statements in Stata.
<
<=
==
>
>=
~=
|
&
~
e(sample)
less than
less than or equal to
equal
greater than
greater than or equal to
not equal to
or (you cant spell out or)
and (you cant spell out and)
not
this returns only values from the estimated sample (the model most recently
estimated)
CONCATENATING VARIABLES
Above, I covered how to split up and destring variables. It is often equally important to know how
to create a new string variable based on two (or more) existing variables. I find this to be particularly
useful for merging data.
Lets create a new variable that consists of each cases Lawyers Edition to U.S. Reports citation and
the cases docket number.
egen leddocket=concat (led docket), punct (" ")
This command created this new variable, which combines the led variable and the docket variable,
separated by a space. If you do not want the new variable to separate the two variables by a space,
you can leave out the , punct (" ") commands. Alternatively, if you want the original variable values
to be separated by a comma, you can replace the , punct (" ") commands with these commands: ,
punct (,)
8
This tell us the following: there are 6,292 cases that appear in the data (based on the Lawyers
Edition citation) only once; there are 3,428 cases that appear in the data twice; there are 981 cases
that appear in the data three times, etc.
If this was all Stata could do, it wouldnt be of much use. Instead, we want to tag the duplicate
observations, so we know which cases are repeated in the data.
duplicates tag led, gen(ledduplicates)
Now we have created a new variable, ledduplicates, that tags each case on the basis of whether it is
a duplicate observation. This variable is set up such that: 0 = not a duplicate; 1 = duplicate, appears
twice; 2 = duplicate, appears three times, etc.
At this point, it is useful to introduce you to the unique command. This command is an add on to
Stata, meaning that you need to download it. To download the command, type the following:
net install unique
This tells Stata to install the unique command from the internet. You can also locate the command
by typing findit unique. We can use this command to identify how many unique values there are of
a variable.
unique led
Number of unique values of led is 8661
Number of records is 12577
This tells us that there are 8,661 unique values of led. This command can also be combined with
conditional statements. For example, to determine how many unique values of led there were during
the 1995 term, we can type:
unique led if term==1995
10
drop led1 ledvolume led2 ledpage day month year oraldate certiorari newvalue missvalue
privacy douglasopinion leddocket ledduplicates
This command gets rid of these variables, bringing us back to the variables in the original database
(although the duplicate observations are still purged).
Occasionally, you might want to drop a large number of variables from the dataset. One way to do
this is through the drop command; you can list all of the variables you want to drop after the
command (as I did above). Alternatively, you can tell Stata which variables you want to keep by
using the keep command. For example, if I only wanted to keep the term and led variables, I could
type: keep led term. That command would remove all of the other variables in the dataset.
11
rehndir
0
1
stevdir
0
1
ocondir
0
1
Scaldir
0
1
kendir
1
0
To this format:
case
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
justice
rehn
stev
ocon
scal
ken
sout
thom
gin
bry
rehn
stev
ocon
scal
ken
sout
thom
gin
bry
dir
0
0
0
0
1
1
1
1
1
1
1
1
1
0
0
0
0
0
12
soutdir
1
0
thomdir
1
0
gindir
1
0
brydir
1
0
The first step is to rename variables that describe the individual justices voting behavior. It is
necessary to rename these variables as the new code seeks out j suffixes in order to make the
transformation (in the following example the j suffix is b for behavior).
rename har harb
rename blc blcb
rename doug dougb
13
COLLAPSING DATA
Another way of transforming the unit of analysis in a dataset is through the collapse command.
This command is particularly useful for obtaining summary statistics at the aggregate level.
Lets say we are interested in the proportion of liberal votes each justice cast in each term. To do
this, we enter the following commands:
collapse (mean) dir, by (term justid)
We now have a dataset in which the unit of analysis is the justice-term. That is, the data now contain
information on the proportion of liberal votes each justice cast for each term they served on the
Court.
Note that the collapse command can also be used to return values other than the mean. For
example, if we wanted a dataset that contained the standard deviations surrounding each justices
proportion of liberal votes, we can replace (mean) with (sd). If we want minimum or maximum
values, we can replace (mean) with (min) or (max), respectively.
14
This means that Stata successfully merged all 475 observations in the data. We can also check this
with the following command:
15
tab _merge
This command tells you how many observations in both the master (the data we have been working
with all this time) and the using (the Segal and Cover) data matched perfectly.
The _merge command is set up such that:
1
2
3
4
5
master
using
match
match_update
match_conflict
In plain English, if _merge returns a 1, that means that these observations only contain information
from the master dataset. In other words, the using data did not merge to the master data.
If _merge returns a 2, that means that these observations only contain information from the using
dataset. In other words, the master data did not merge with the using data. In practice, this means
that the number of observations in the data have increased in accordance with the number of 2s
returned and none of these observations contain information from the master data.
If _merge returns a 3, that means that these observations contain information from both the master
and using datasets (i.e., they merged correctly).
Note that codes 4 and 5 only arise if the update option was used. update means that you are telling
Stata to update values in the master dataset with data in the using dataset.
If _merge returns a 4, this means that these observations contain information from both the master
and using datasets (i.e., they merged correctly), and any missing values in the master dataset were
updated with data in the using dataset.
If _merge returns a 5, this means that these observations contain information from both the master
and using datasets (i.e., they merged correctly), and Stata is alerting you that that there are conflicting
nonmissing values in the master and using datasets.
As only 3s were returned, we know our data merged correctly.
SIDENOTE: You can also use Statas old merge command, merge. This command allows you
to merge various types of datasets without specifying the type of merge that is to be conducted. For
example, if you wanted to merge the two datasets we combined above, you could do so with the
following commands:
merge justid, using "segalcoverscores.dta"
After we run this command, Stata returns the following:
variable justid does not uniquely identify observations in the master data
16
Anytime Stata comments on a previous command, you should take this very seriously and figure out
what the program is trying to tell you. In this example, Stata is telling us that the variable justid does
not uniquely identify the observations in the master data. We know this, because the observations in
the master data are based on justice-terms, not the justices themselves. In other words, the variable
justid repeats itself for each term a justice served (which is why we used the merge m:1 command
above).
17
18