Stata Data Managment
Stata Data Managment
UCLA OARC
STATISTICAL
METHODS AND
DATA ANALYTICS
management
Purpose of the workshop
Data sets usually do not arrive in a state immediately ready for analysis
◦ Variables need to be cleaned
◦ E.g. missing data codes like -99 can invalidate analyses if left untouched
◦ Data errors should be corrected
◦ Variables need to be generated
◦ Unnecessary data should perhaps be dropped for efficiency
◦ Data sets may need to be merged together
This workshop covers commonly used Stata commands to prepare data sets for statistical
analysis
Topics
Inspecting variables
Creating variables
Selecting observations and variables
Missing data
Date variables
String variables
Appending data sets
Merging data sets
Looping
Processing data by groups
save
Preliminaries load
Use do-files
Write and run code in do-files, text files where Stata commands can be saved.
◦ A record of the commands used
◦ Can easily make adjustments to code later
To run code from the do-file, highlight code and then Ctrl-D or click “Execute(do)” icon
Comments: precede text with * or enclose text within /* and */
◦ *not executed
◦ /* this is a comment */
You can also place comments on the same line as a command following //
tab x // also a comment
Reading syntax specifications in this
workshop
Unitalicized words should be typed as is
Italicized words should be replaced with the appropriate variable, command, or value
For example:
merge 1:1 varlist using filename
varlist should be replaced with one or more variable names
filename should be replaced with a file’s name
The rest should be typed as is
varlist will be used throughout the workshop to mean one or more variable names
Workshop dataset
This dataset contains fake hospital patient data, with patients as rows of data (no patient id,
though)
A mix of continuous and discrete numeric variables as well as string variables
Some data errors and missing data codes in the data
Each patient is also linked to a doctor (docid).
Another dataset containing doctor variables will be merged into this dataset later in the
seminar.
save and use
We recommend that you save your data set * save data as Stata data set, overwrite
save data_clean.dta, replace
under a different name after performing some
data management on it * .dta automatically added so can omit it
◦ Always good to have untouched raw, data set save data_clean, replace
◦ You may change your mind later about how * load data_clean, clear memory first
variables should be generated/cleaned/etc use data_clean, clear
Inspecting summariz e
tabulate
variables codebook
browse
Spreadsheet style window view
browse
◦ or click on the spreadsheet and magnifying glass icon in the Stata toolbar
variables
generate variables
generate (abbreviated gen or even g) is *sum of tests
* if any test=., testsum=.
the basic command to create variables gen testsum = test1 + test2 + test3
◦ Often from other existing variables
◦ if the value of the input variable is missing, the *reverse-code a 1,2,3 variable
gen revvar = 4-var
generated value will be missing as well.
Selecting <>=!~&|
replace
observations recode
drop
Logical Operators
& and
| or
! not
~ not
replace and if
The replace command is used to replace *binary variable coding whether pain is
greater than 6
the values of an existing variable with new gen highpain = 0
values replace highpain = 1 if pain > 6
Or, if you need to drop most variables, you * drop all variables but age
can keep a few keep age
. “” .a .b
misstabl e summari ze
. di date("3/5/2021", "MDY")
22344
. di date("3-5-2021", "MDY")
22344
. di date("3-5-21", "MD19Y")
-14181
Formatting numeric dates
After conversion using date(), the resulting *apply date format to variable newdob
format newdob %td
variable will be filled with numbers
representing dates, but can be hard to read
directly as dates
Stata’s format command controls how
variables are displayed
The format %td formats numbers as dates
◦ 22344 will appear as 2mar2021 after applying
the format
Date arithmetic
Once dates are stored as numeric variables in * length of stay
gen los = discharge_date – admit_date
Stata, we can perform date arithmetic, for
example, to find the number of days between
two dates
Functions to extract components of
dates
At times, we will need to extract one of the *year of birth
gen yob = year(dob)
components of a date, such as the year
*month of birth
Each of these functions returns a number: gen mob = month(dob)
◦ year(): numeric year
◦ month(): 1 to 12
◦ day(): day of the month
help str ing funct ions
strtrim( )
data sets
Appending: adding observations
We often wish to combine data sets that are split into multiple files that have the same
variables (more or less)
◦ Data collected over several waves of time
◦ Data collected from different labs/sources
+ = 103
201
12-10-1990
4-29-1970
198
150
1
0
ID DOB Weight Insured 202 12-15-1963 254 0
On hard 201 4-29-1970 150 0 203 1-10-1962 199 1
drive 202 12-15-1963 254 0
203 1-10-1962 199 1
append
With Stata’s append, data set files stored on • *append data set data2
• append using data2
hard drives are appended to the data set
currently loaded in Stata (the master data set) *gen() option creates new variable source
* source=1 if obs comes from 1st data set
Syntax: * source=2 if obs comes from 2nd data set
append using data2, gen(source)
append using filename
[filename…]
Multiple filenames can be appended to the
master data set
Variables with the same name should have
the same type (string, float, etc.)
◦ use the force option for mismatches
Unshared variables will have missing
Variables that do not appear in all datasets will have missing values where they were omitted
+ = 103
301
12-10-1990
5-29-1974
198
203
1
.
ID DOB Weight 302 1-5-1959 225 .
301 5-29-1974 203 303 7-24-1969 165 .
302 1-5-1959 225
303 7-24-1969 165
Merging data merge
sets
One-to-one merging
When we merge datasets, we add more columns of variables.
Datasets to be merged should generally be matched on an id variables that appears in both
datasets
In the most basic merge, each id appears once in each file
ID DOB Weight Insured ID Doc_ID LOS ID DOB Weight Insured DOC_ID LOS
101 3-3-1981 175 1 101 A1 3 101 3-3-1981 175 1 A1 3
102 2-14-1975 213 0 + 102 A1 5 = 102 2-14-1975 213 0 A1 5
103 12-10-1990 198 1 103 A2 2 103 12-10-1990 198 1 A2 2
Merging on Doc_ID
ID Doc_ID LOS Doc_ID Doc_yrs Doc_gen ID Doc_ID LOS Doc_yrs Doc_gen _merge
101 A1 3 A1 12 F 101 A1 3 12 F 3
102 A1 5 + A2 29 M = 102 A1 5 12 F 3
103 A2 2 A3 8 F 103 A2 2 29 M 3
104 A2 7 104 A2 7 29 M 3
. A3 . 8 F 2
Merging on Doc_ID
Doc_ID Doc_yrs Doc_gen ID Doc_ID LOS Doc_ID Doc_yrs Doc_gen ID LOS _merge
A1 12 F 101 A1 3 A1 12 F 101 3 3
A2 29 M + 102 A1 5 = A1 12 F 102 5 3
A3 8 F 103 A2 2 A2 29 M 103 2 3
104 A2 7 A2 29 M 104 7 3
A3 8 F . . 1
One-to-many:
merge 1:m varlist using
filename
Looping forvalue s
Loops
Loops are a programmer’s tool to perform some task repeatedly over a set of items.
For example, this Stata loop:
foreach var of varlist x y z {
summ `var’
gen copy_`var’ = `var’ var is known as a Stata macro variable, a temporary
} variable (not related to the data set) that can hold string
Is equivalent to running values
summ x
gen copy_x = x The word var is arbitrary and can be any word you like
summ y
gen copy_y = y varlist tells Stata that the strings following are
summ z variable names
gen copy_z = z
Loops
Loops are a programmer’s tool to perform some task repeatedly over a set of items.
For example, this Stata loop:
foreach var of varlist x y z {
summ `var’
gen copy_`var’ = `var’
}
Initially var will be set equal to x, and then the
1st pass through loop: commands within {} are run; wherever `var’
summ x appears, replace with x.
gen copy_x = x
Loops
Loops are a programmer’s tool to perform some task repeatedly over a set of items.
For example, this Stata loop:
foreach var of varlist x y z {
summ `var’
gen copy_`var’ = `var’
}
Then var becomes y, the commands within {}
2nd pass through loop: are run, replace `var’ with y
summ y
gen copy_y = y
Loops
Loops are a programmer’s tool to perform some task repeatedly over a set of items.
For example, this Stata loop:
foreach var of varlist x y z {
summ `var’
gen copy_`var’ = `var’
}
Repeat for z …
3 pass through loop:
rd
summ z
gen copy_z = z
Stata loops
In general, Stata loops will consist of: * ^ means “to the power”
forvalues i = 2/4 {
◦ A macro variable that sequentially takes on the gen test`i’ = test^`i’
values of a set of items summ test`i’
◦ A set of commands within {} }
◦ Calls to the contents of the macro variable with
`macro’, where macro is the macro variable *loop above is equivalent to
name gen test2 = test^2
◦ when the loop commands execute, `macro’ will be summ test2
replaced by its current contents gen test3 = test^3
summ test3
gen test4 = test*4
summ test4
foreach: looping over variables
foreach macro of varlist varlist { * for more help with foreach loops
commands help foreach
by group bysort:
varname[ n]
Grouped data
Many datasets consist of grouped observations Class_ID Student_ID SAT_Math Mean_Math
time variable 2 0 31 31 .
2 1 31 31 31
Data should generally be sorted by the ID 2 2 32 31 31
variable and then the time variable before by- 2 3 32 31 32
ID processing 3 4 33 31 32
◦ However, we usually only want to process by the
ID variable depress at time=0 lagged depress
◦ So, sort first, then use by: (2 steps)
Specifying the value of a variable from a
particular observation
If we want to use the value of a variable from a *sort by id and time first
sort id time
particular observation number, e.g.
observation 1, we can use this syntax: *baseline (first) value of depression per id
by id: depress0 = depress[1]
varname[n]
So, math[3] is the value of the math
variable from the third observation
If used with by:, then it is the nth value from
within each group
◦ by classid: math[3], third math value
within each class
System variables _n and _N
System variables are created and updated by *last depression score per id
* allows for different number of timepoints
Stata by id: gen depress_last = depress[_N]
_n is the number of the current observation *lagged depression
* first obs will be . per id
_N is the total number of observations by id: gen depress_lag = depress[_n-1]
With by:
◦ _n is the current observation in a group
◦ _N is the total number of observations in the
current group.