0% found this document useful (0 votes)
35 views

Stata Data Managment

This document provides an overview of commonly used Stata commands for preparing data sets for statistical analysis. It discusses how to inspect variables using commands like browse, summarize, tabulate, and codebook. It also covers generating new variables from existing ones using generate and extended generate (egen). Additionally, it describes how to select observations and variables using if, replace, recode, keep, and drop, and how to merge and append datasets.

Uploaded by

speaktosurendra
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Stata Data Managment

This document provides an overview of commonly used Stata commands for preparing data sets for statistical analysis. It discusses how to inspect variables using commands like browse, summarize, tabulate, and codebook. It also covers generating new variables from existing ones using generate and extended generate (egen). Additionally, it describes how to select observations and variables using if, replace, recode, keep, and drop, and how to merge and append datasets.

Uploaded by

speaktosurendra
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 79

STATA data

UCLA OARC
STATISTICAL
METHODS AND
DATA ANALYTICS
management
Purpose of the workshop
Data sets usually do not arrive in a state immediately ready for analysis
◦ Variables need to be cleaned
◦ E.g. missing data codes like -99 can invalidate analyses if left untouched
◦ Data errors should be corrected
◦ Variables need to be generated
◦ Unnecessary data should perhaps be dropped for efficiency
◦ Data sets may need to be merged together

This workshop covers commonly used Stata commands to prepare data sets for statistical
analysis
Topics
Inspecting variables
Creating variables
Selecting observations and variables
Missing data
Date variables
String variables
Appending data sets
Merging data sets
Looping
Processing data by groups
save

Preliminaries load
Use do-files
Write and run code in do-files, text files where Stata commands can be saved.
◦ A record of the commands used
◦ Can easily make adjustments to code later

To run code from the do-file, highlight code and then Ctrl-D or click “Execute(do)” icon
Comments: precede text with * or enclose text within /* and */
◦ *not executed
◦ /* this is a comment */

You can also place comments on the same line as a command following //
tab x // also a comment
Reading syntax specifications in this
workshop
Unitalicized words should be typed as is
Italicized words should be replaced with the appropriate variable, command, or value
For example:
merge 1:1 varlist using filename
varlist should be replaced with one or more variable names
filename should be replaced with a file’s name
The rest should be typed as is

varlist will be used throughout the workshop to mean one or more variable names
Workshop dataset
This dataset contains fake hospital patient data, with patients as rows of data (no patient id,
though)
A mix of continuous and discrete numeric variables as well as string variables
Some data errors and missing data codes in the data
Each patient is also linked to a doctor (docid).
Another dataset containing doctor variables will be merged into this dataset later in the
seminar.
save and use
We recommend that you save your data set * save data as Stata data set, overwrite
save data_clean.dta, replace
under a different name after performing some
data management on it * .dta automatically added so can omit it
◦ Always good to have untouched raw, data set save data_clean, replace
◦ You may change your mind later about how * load data_clean, clear memory first
variables should be generated/cleaned/etc use data_clean, clear

To quickly save and load your data in Stata, save


as a Stata file (usually .dta)

use loads Stata data files


browse

Inspecting summariz e
tabulate

variables codebook
browse
Spreadsheet style window view
browse
◦ or click on the spreadsheet and magnifying glass icon in the Stata toolbar

Numeric variables appear black


String variables appear red
Numeric variables with value labels appear blue
For large data sets, we need commands that allow us to quickly inspect variables
summarize
summarize (abbreviated summ) provides *summary stats for variable y
summ y
summary statistics for numeric variables,
which may be useful for data management: *summary stats for all numeric variables
◦ Number of non-missing observations summ
◦ Does this variable have the correct number of missing
observations?
◦ Mean and standard deviation
◦ Are there any data errors or missing data codes that make
the mean or standard deviation seem implausible?
◦ Minimum and maximum
◦ Are the min and max within the plausible range for this
variable?
tabulate
tabulate (abbreviated tab here) *table of frequencies of race
◦ frequency tables (# observations per value) * display missing
◦ string or numeric variables ok tab race, miss
◦ but not continuous numeric variables
*2-way table of frequencies
Questions to ask with tab: tab race gender
◦ Is this the number of categories I expect?
◦ Are the categories numbered and labeled as I
expect?
◦ Do these frequencies look right?

Important data management options


◦ miss - treat missing values as another category
◦ nolabel – remove labels to inspect underlying
numeric representation
codebook
Detailed information about variables *detailed info about variables x and y
codebook x y
codebook provides:
*detailed info about all variables
◦ For all variables codebook
◦ Number of unique values
◦ Number of missing values
◦ Value labels, if applied
◦ For numeric variables
◦ Range
◦ quantiles, means and standard deviation for continuous
variables
◦ Frequencies for discrete variables
◦ For string variables
◦ frequencies
◦ warnings about leading and trailing blanks
generate

Creating help fun ctions


egen

variables
generate variables
generate (abbreviated gen or even g) is *sum of tests
* if any test=., testsum=.
the basic command to create variables gen testsum = test1 + test2 + test3
◦ Often from other existing variables
◦ if the value of the input variable is missing, the *reverse-code a 1,2,3 variable
gen revvar = 4-var
generated value will be missing as well.

We will be using gen throughout this seminar


Functions to use with generate
We can use functions to perform some *get table of function help pages
help functions
operation on a variable to generate another
variable *random number (0,1) for each obs
gen x = runiform()
Later sections of this seminar will take a
focused look at the function groups Date and *running (cumulative) sum of x
gen sumx = sum(x)
Time and String
*extract year from date variable
We will use many various functions from other gen yr_birth= year(date_birth)
groups as well
*extract 1st 3 numbers of phone number
Most of these functions accept no more than gen areacode = substr(phone, 1, 3)
one variable as an input
egen, extended generation command
egen (extended generate) creates variables *mean of 3 y variables
egen ymean = rowmean(y1 y2 y3)
with its own, exclusive set of functions, which
include: *how many missing values in y vars
egen ymiss = rowmiss(y1 y2 y3)
◦ Many functions that accept multiple variables as
arguments * how many of the y vars equal 2 or 3
egen y2or3 = anycount(y1 y2 y3), values(2 3)
◦ cut() function to create categorical variable
from continuous variable *age category variable:
◦ group() function to create a new grouping * 0-17, 18-59, 60-120, with value labels
egen agecat = cut(age), at(0, 18, 60, 120) label
variable that is the crossing of two grouping
variables
*create agecat-by-race variable, with value labels
egen agecat_race = group(agecat, race), label
if

Selecting <>=!~&|
replace

observations recode
drop

and variables keep


if: selecting by condition
select observations that meet a certain *tab x for age > 60, include missing
tab x if age > 60, miss
condition
if clause usually placed after the command
specification, but before the comma that
marks the beginning of the list of options.
Logical and relational operators
Relational Operators *summary stats for y
* for obs where insured not equal to 1
> greater than summ y if insured != 1
<= less than or equal *tab x for obs where
== equal * age < 60 and insured equal to 1
tab x if (age < 60) & (insured == 1)
!= not equal
~= not equal

Logical Operators
& and
| or
! not
~ not
replace and if
The replace command is used to replace *binary variable coding whether pain is
greater than 6
the values of an existing variable with new gen highpain = 0
values replace highpain = 1 if pain > 6

Typically, replace is used with if to replace


the values in a subset of observation
Change variable coding with recode
Variables are often not coded the way we *recode (0,1,2)->3 and (6,7)->5
recode income_cat (0 1 2 = 3) (6 7 = 5)
want
◦ too many categories or with values out of order.

With recode, we can handle all recodes for a


variable in a single step.
keep and drop: filtering
observations
Drop unneeded observations using: * drop observations where age < 18
drop if age < 18
drop if exp
* same thing as above
keep if age >= 18
Where exp expresses some condition that if
true, will cause an observation to be dropped
Conversely, we can keep only observations
that satisfy some condition with
keep if exp
drop (or keep) variables
Unneeded variables can be dropped with: * drop variables x y z
drop x y z
drop varlist
* drop all variables in consecutive columns
* from age-dob
where varlist is a list of variables to drop drop age-dob
See examples for some shortcuts to drop * drop all variables that begin with “pat”
many variables at once drop pat*

Or, if you need to drop most variables, you * drop all variables but age
can keep a few keep age
. “” .a .b
misstabl e summari ze

Missing Data mvdecode


missing( )
Missing data
Missing values are very common in real data, * overview of how missing values work in
Stata and Stata commands for missing values
and it is important for an analyst to be aware help missing
of missingness in the data set
◦ Hopefully, you know any missing data codes
(e.g. -99)

When reading in data from a text or Excel file,


missing data can be represented by an empty
field.
Missing values in Stata
. is missing for numeric variables (also called replace stringvar = “” if stringvar == “-99”
sysmiss)
* replace with sysmiss if -99 (skipped)
“” is missing for string variables (also called replace numvar = . if numvar == -99
blank).
* use .a for different missing data code -98
* (e.g. refused to answer)
replace numvar = .a if numvar == -98
Additional missing data values are available
◦ .a through .z
◦ can be used to represent different types of missing
(refusal, don’t know, etc.).

Missing values are very large numbers in Stata


◦ all non-missing numbers < . < .a < .b < … < .z
misstable summarize: finding
existing missing values
misstable summarize produces a table *table of missing values across all variables
misstable summarize
of missing values (of all types) across a set of
variables
column Obs=. counts the number of
missing values equal to .
column Obs>. counts the number of
missing values other than ., such as .a and .b
column Obs<. and the entire right-hand
section Obs<. address non-missing values
Detecting missing data codes
extreme numeric values are often used to *boxplot of variables
graph box lungcapacity test1 test2
represent missing values
◦ -99 or 999

Undetected, these missing data codes can be


included as real data in statistical analysis
Use summarize and graph boxplot to
look for missing data codes across many
variables at once
Use mvdecode to convert user-defined
missing data codes to missing values
We can quickly convert all user-defined *convert -99 to . for all variables
mvdecode _all, mv(-99)
missing codes to system missing values across
all numeric variables with mvdecode. *convert -99 to . and -98 to .a
mvdecode _all, mv(-99 =. \ -98=.a)
Unfortunately, mvdecode will not work at all
on string variables.
The missing() function
The missing() function returns *eligible if not missing for lungcapacity
gen eligible = 0
TRUE if the value is any one of replace eligible = 1 if !missing(lungcapacity)
Stata’s missing values (., .a, .b,
etc)
Be careful with relational operators
when working with missing values
Missing values are very large numbers in Stata * we want hightest1 = 1 if test1 > 10
◦ all non-missing numbers < . < .a < .b < … < .z * but . > 10, so this is not right
gen hightest1 = 0
replace hightest1 = 1 if test1 > 10

*now hightest1 will be . when test1 is .


Thus, (. > 50) results in TRUE replace hightest1 = . if test1 == .

When creating variables from other variables,


make sure you know how you want to handle
missing values
help dat etime
date()

Date variables format


year() m onth() da y()
Dates as strings and numbers
In Stata we can store dates as strings… * Overview of how Stata handles dates
◦ “January 2, 2021” help datetime

◦ “1-2-2021” *Use codebook or describe to determine


whether your date variable is a string or
However, dates should be represented number
numerically in Stata if the date data are needed codebook date_var
◦ To create analysis variables *Alternatively, we can look at the variable
◦ For plotting in the browser: red=string, blue/black=number
browse date_var
Numeric dates (with day, month, and year data)
in Stata are represented by the number of days
since January 1, 1960 (a reference date)
◦ January 2, 2021 = 22,280
date(): converting string dates to
numeric
Use the date() function to convert string * create numeric version of date of birth
* order is month, day, year
dates to numeric in Stata gen newdob = date(dob, “MDY”)
The generic syntax, for example with gen, is:
gen varname = date(stringdate,
mask)
◦ stringdate is a variable
◦ mask is a code that specifies the order of the
components of the date

For day, month, year dates:


◦ mask is “MDY” if the order is month, day, year
◦ mask is “DMY” if the order is day, month, year
The date() function accepts dates in
many formats
String dates are recorded in many different *just display commands to show date() usage
. di date("March 5, 2021", "MDY")
formats, but fortunately, the date() 22344
function is flexible in what it accepts as inputs
. di date("Mar 5, 2021", "MDY")
22344

. di date("3/5/2021", "MDY")
22344

. di date("3-5-2021", "MDY")
22344

*add 20 or 19 to the mask if year is 2-digit


. di date("3-5-21", "MD20Y")
22344

. di date("3-5-21", "MD19Y")
-14181
Formatting numeric dates
After conversion using date(), the resulting *apply date format to variable newdob
format newdob %td
variable will be filled with numbers
representing dates, but can be hard to read
directly as dates
Stata’s format command controls how
variables are displayed
The format %td formats numbers as dates
◦ 22344 will appear as 2mar2021 after applying
the format
Date arithmetic
Once dates are stored as numeric variables in * length of stay
gen los = discharge_date – admit_date
Stata, we can perform date arithmetic, for
example, to find the number of days between
two dates
Functions to extract components of
dates
At times, we will need to extract one of the *year of birth
gen yob = year(dob)
components of a date, such as the year
*month of birth
Each of these functions returns a number: gen mob = month(dob)
◦ year(): numeric year
◦ month(): 1 to 12
◦ day(): day of the month
help str ing funct ions
strtrim( )

String variables substr()


+
encode
destring
Strings in Stata
Strings are just a series of characters * z now missing if z was “-99”
replace z = “” if z==“-99”

Variables can generally be stored as either


numeric or string
String values are surrounded by quotes when
specifying a Stata command
◦ String missing is “”

Many estimation commands in Stata will not


work with string variables.
String functions
Stata provides a number of functions to clean, * help page for all string functions
help string functions
combine, and extract components of string
variables
We will examine a few in detail shortly
Other useful functions:
◦ strlen(): length of strings
◦ strpos(): the position in a string where a
substring is found
◦ strofreal(): convert number to string with
a specified format
strtrim(): trimming white space
String variables often arrive messy, with * some of these categories should be combined
tab hospital
unnecessary spaces at the beginning or end of
the string * remove all leading and trailing whitespace
◦ “ May 3, 2001 “ replace hospital = strtrim(hospital)
tab hospital
Extra spaces can complicate string matching
and extracting components
◦ tab will treat “ UCLA” and “UCLA” as
separate categories

strtrim() removes whitespace from the


beginning and end of strings
substr(): extracting a substring
substr() extracts a substring from a • *area code starts at 1, length=3
• gen areacode = substr(phone, 1, 3)
longer string
• *extract last 5 characters as zip code
substr(s, n1, n2) • gen zipcode = substr(address, -5, 5)
◦ s is a string value or string variable
◦ n1 is the starting position
◦ negative number counts from the end
◦ n2 is the length of the string
+: concatenating strings
Strings can be joined or concatenated *create full name variable: “Last, First”
gen fullname = lastname + “, “ + firstname
together in a Stata command with a +
◦ String variables can be combined with string
constants this way
String matching
For matching strings exactly, we can use the == * for regular expression functions
operator help regexm

For more flexible pattern matching, Stata has a


number of functions that use regular expressions
◦ e.g. to find strings that contain either “US” or
“USA” and may or may not contain periods
◦ “US”
◦ “U.S.”
◦ “USA”
◦ “U.S.A.”
◦ See help regexm
◦ See help strmatch

Regular expressions are beyond the scope of this


workshop
Encoding strings into numeric variables
Categorical variables are often initially coded as * convert hospital to numeric, generate new
strings variable
encode hospital, gen(hospnum)
◦ But most estimation commands require numeric
variables

encode converts string variables to numeric


◦ assigns a numeric value to each distinct string value
◦ applies value labels of the original string values
◦ Use the gen() option to create a new variable
◦ Or instead use replace to overwrite the original string variable

The ordering of the categories is alphabetical


Remember, when browsing, string variables will
appear red while numeric variables with labels will
appear blue
Convert number variables stored as
strings to numeric with destring
Sometimes, variables with number values are loaded *convert string wbc to numeric and overwrite
as strings into Stata. destring wbc, replace

This can happen can a character value is mistakenly


entered, or a non-numeric code (e.g. “NA”) is used
for missing.
◦ Or the variable was saved as a string in other software
(e.g. Excel)

We do not want to use encode here, because the


variable values are truly numbers rather than
categories
◦ we would not want the “1.25” converted to a category.

Instead, we can use destring, which directly


translates numbers stored as strings into the
numbers themselves.
Appending append

data sets
Appending: adding observations
We often wish to combine data sets that are split into multiple files that have the same
variables (more or less)
◦ Data collected over several waves of time
◦ Data collected from different labs/sources

ID DOB Weight Insured


101 3-3-1981 175 1 ID DOB Weight Insured
Loaded in 102 2-14-1975 213 0 101 3-3-1981 175 1
Stata 103 12-10-1990 198 1 102 2-14-1975 213 0

+ = 103
201
12-10-1990
4-29-1970
198
150
1
0
ID DOB Weight Insured 202 12-15-1963 254 0
On hard 201 4-29-1970 150 0 203 1-10-1962 199 1
drive 202 12-15-1963 254 0
203 1-10-1962 199 1
append
With Stata’s append, data set files stored on • *append data set data2
• append using data2
hard drives are appended to the data set
currently loaded in Stata (the master data set) *gen() option creates new variable source
* source=1 if obs comes from 1st data set
Syntax: * source=2 if obs comes from 2nd data set
append using data2, gen(source)
append using filename
[filename…]
Multiple filenames can be appended to the
master data set
Variables with the same name should have
the same type (string, float, etc.)
◦ use the force option for mismatches
Unshared variables will have missing
Variables that do not appear in all datasets will have missing values where they were omitted

ID DOB Weight Insured


101 3-3-1981 175 1 ID DOB Weight Insured
102 2-14-1975 213 0 101 3-3-1981 175 1
103 12-10-1990 198 1 102 2-14-1975 213 0

+ = 103
301
12-10-1990
5-29-1974
198
203
1
.
ID DOB Weight 302 1-5-1959 225 .
301 5-29-1974 203 303 7-24-1969 165 .
302 1-5-1959 225
303 7-24-1969 165
Merging data merge

sets
One-to-one merging
When we merge datasets, we add more columns of variables.
Datasets to be merged should generally be matched on an id variables that appears in both
datasets
In the most basic merge, each id appears once in each file

ID DOB Weight Insured ID Doc_ID LOS ID DOB Weight Insured DOC_ID LOS
101 3-3-1981 175 1 101 A1 3 101 3-3-1981 175 1 A1 3
102 2-14-1975 213 0 + 102 A1 5 = 102 2-14-1975 213 0 A1 5
103 12-10-1990 198 1 103 A2 2 103 12-10-1990 198 1 A2 2

Loaded in Stata On hard drive


(master dataset) (using dataset)
merge 1:1 for one-to-one merging
Basic syntax: *one-to-one merge example
merge 1:1 id using morevars.dta
merge 1:1 varlist using
filename
◦ varlist is one or more variables used to
match observations
◦ filename is the data set stored elsewhere to
be merged into the data set currently loaded in
Stata
_merge: understanding the resulting
merge
After running a merge, Stata will
Result Number of obs
ouput a table similar to the table -----------------------------------------
pictured on the right Not matched 5
from master 5 (_merge==1)
from using 0 (_merge==2)
Stata also automatically adds a new
variable, _merge, to the merged Matched 200 (_merge==3)
-----------------------------------------
data set, where:
◦ _merge==1 if the observation’s id is
only found in the master file * drop unmatched observations
drop if _merge != 3
◦ _merge==2 if the observation’s id is
only found in the using file
◦ _merge==3 if the observation’s
merge id was matched in both files
Many-to-one merges
Many-to-one merge: unique id values appear multiple times in the master data set, but only
once in the using dataset

Merging on Doc_ID

ID Doc_ID LOS Doc_ID Doc_yrs Doc_gen ID Doc_ID LOS Doc_yrs Doc_gen _merge

101 A1 3 A1 12 F 101 A1 3 12 F 3

102 A1 5 + A2 29 M = 102 A1 5 12 F 3

103 A2 2 A3 8 F 103 A2 2 29 M 3

104 A2 7 104 A2 7 29 M 3
. A3 . 8 F 2

Loaded in Stata On hard drive


(using dataset) _merge==2 means this Doc_ID
(master dataset)
only appears in using dataset
One-to-many merges
One-to-many merge: unique id values appear once each in the master data set, but can appear
many times in the using dataset

Merging on Doc_ID

Doc_ID Doc_yrs Doc_gen ID Doc_ID LOS Doc_ID Doc_yrs Doc_gen ID LOS _merge
A1 12 F 101 A1 3 A1 12 F 101 3 3
A2 29 M + 102 A1 5 = A1 12 F 102 5 3
A3 8 F 103 A2 2 A2 29 M 103 2 3
104 A2 7 A2 29 M 104 7 3
A3 8 F . . 1

Loaded in Stata On hard drive


(using dataset) _merge==1 means this Doc_ID
(master dataset)
only appears in master dataset
merge m:1 and merge 1:m
Many-to-one: *each docid appears multiple times in
* master data, once in using data
merge m:1 varlist using merge m:1 docid using “dm_doctor_data.dta"
filename

One-to-many:
merge 1:m varlist using
filename

◦ varlist is one or more id variables


◦ filename is using data set file name
foreach

Looping forvalue s
Loops
Loops are a programmer’s tool to perform some task repeatedly over a set of items.
For example, this Stata loop:
foreach var of varlist x y z {
summ `var’
gen copy_`var’ = `var’ var is known as a Stata macro variable, a temporary
} variable (not related to the data set) that can hold string
Is equivalent to running values
summ x
gen copy_x = x The word var is arbitrary and can be any word you like
summ y
gen copy_y = y varlist tells Stata that the strings following are
summ z variable names
gen copy_z = z
Loops
Loops are a programmer’s tool to perform some task repeatedly over a set of items.
For example, this Stata loop:
foreach var of varlist x y z {
summ `var’
gen copy_`var’ = `var’
}
Initially var will be set equal to x, and then the
1st pass through loop: commands within {} are run; wherever `var’
summ x appears, replace with x.
gen copy_x = x
Loops
Loops are a programmer’s tool to perform some task repeatedly over a set of items.
For example, this Stata loop:
foreach var of varlist x y z {
summ `var’
gen copy_`var’ = `var’
}
Then var becomes y, the commands within {}
2nd pass through loop: are run, replace `var’ with y
summ y
gen copy_y = y
Loops
Loops are a programmer’s tool to perform some task repeatedly over a set of items.
For example, this Stata loop:
foreach var of varlist x y z {
summ `var’
gen copy_`var’ = `var’
}
Repeat for z …
3 pass through loop:
rd

summ z
gen copy_z = z
Stata loops
In general, Stata loops will consist of: * ^ means “to the power”
forvalues i = 2/4 {
◦ A macro variable that sequentially takes on the gen test`i’ = test^`i’
values of a set of items summ test`i’
◦ A set of commands within {} }
◦ Calls to the contents of the macro variable with
`macro’, where macro is the macro variable *loop above is equivalent to
name gen test2 = test^2
◦ when the loop commands execute, `macro’ will be summ test2
replaced by its current contents gen test3 = test^3
summ test3
gen test4 = test*4
summ test4
foreach: looping over variables
foreach macro of varlist varlist { * for more help with foreach loops
commands help foreach

} *reverse-code three 1-7 Likert scale


variables
macro: name of variable that takes on values in
* apply variable label
varlist
foreach var of varlist T1 T2 T3 {
varlist: a list of variable names gen rev_`var’ = 8 - `var’
label var rev_`var’ “Reverse-coded `var’”
commands: Stata commands to run each time loop }
iterates
◦ Use `’ with macro name to substitute contents of macro
*create copies of four variables
Opening { must be on the first line * where 4 is recoded to 3
foreach var of varlist X1 X2 X3 X4 {
Closing } must be by itself on the last line
gen new_`var’ = `var’
foreach Can loop over many other kinds of lists recode new_`var’ (4=3)
besides variables }
Use forvalues to loop over numbers
foreach macro = range { *create dummy variables for
* different age cutoffs
commands forvalues i = 40(10)60 {
gen age`i’ = 0
} replace age`i’ = 1 if age >= `i’
replace age`i’ = . if age == .
macro: name of variable that takes on values in }
varlist
*equivalent to
range: a range of numbers, e.g. 1/10 or gen age40 = 0
1(2)11 replace age40 = 1 if age >= 40
replace age40 = . if age == .
commands: Stata commands to run each time gen age50 = 0
loop iterates replace age50 = 1 if age >= 50
replace age50 = . if age == .
Opening { must be on the first line gen age60 = 0
replace age60 = 1 if age >= 60
Closing } must be by itself on the last line replace age60 = . if age == .
Processing data sort
by:

by group bysort:
varname[ n]
Grouped data
Many datasets consist of grouped observations Class_ID Student_ID SAT_Math Mean_Math

◦ observations were sampled in clusters (e.g. 1 101 620 660

students sampled from schools) 1 102 650 660


1 103 710 660
◦ repeated measurements of the same individual
2 201 520 550
With grouped data, we often want to generate 2 202 620 550
variables and process data by groups 2 203 510 550
3 301 490 540
Examples: 3 302 450 540
◦ The mean SAT math score for each classroom of 3 303 680 540
students
◦ In a longitudinal study of depression scores, a
variable that records each person’s first (baseline)
value of depression mean SAT_math by Class_ID
◦ Or a variable that represents depression from the previous
timepoint (lagged depression)
The by prefix
For processing by groups, we use the prefix by and a group variable, which precede other
Stata commands
In general the syntax will be:
◦ by varlist: stata_cmd
◦ varlist is one or more grouping variables by which the data are to be processed
◦ stata_cmd is the Stata command to run on each group of data.
Data must be sorted before processing
by group
Data must be sorted by the grouping variable *2 steps, sort first; summ age by doctor
sort dobcid
before processing by that variable by docid: summ age
◦ Either use the sort command on the grouping
before running any commands with by… * same as above in one step
bysort docid: summ age
◦ …or use the prefix bysort (instead of by),
which sorts the data by the grouping variable(s)
before processing by group
Generating statistics by group
some egen functions can be used with by to *mean, max, and standard dev of age
* of patients within each doctor
generate statistics by group. by docid: egen mean_age = mean(age)
◦ including mean, max, and sd by docid: egen max_age = max(age)
by docid: egen sd_age = sd(age)
Longitudinal data (multiple rows per
subject)
Longitudinal designs repeatedly measure ID time depress depress0 lag_depress
1 0 25 25 .
subjects over time
1 1 23 25 25
If repeated measurements are recorded on 1 2 23 25 23
separate rows of data (i.e. long data), then 1 3 18 25 23

there will usually be both an ID variable and a 1 4 15 25 18

time variable 2 0 31 31 .
2 1 31 31 31
Data should generally be sorted by the ID 2 2 32 31 31
variable and then the time variable before by- 2 3 32 31 32
ID processing 3 4 33 31 32
◦ However, we usually only want to process by the
ID variable depress at time=0 lagged depress
◦ So, sort first, then use by: (2 steps)
Specifying the value of a variable from a
particular observation
If we want to use the value of a variable from a *sort by id and time first
sort id time
particular observation number, e.g.
observation 1, we can use this syntax: *baseline (first) value of depression per id
by id: depress0 = depress[1]
varname[n]
So, math[3] is the value of the math
variable from the third observation
If used with by:, then it is the nth value from
within each group
◦ by classid: math[3], third math value
within each class
System variables _n and _N
System variables are created and updated by *last depression score per id
* allows for different number of timepoints
Stata by id: gen depress_last = depress[_N]
_n is the number of the current observation *lagged depression
* first obs will be . per id
_N is the total number of observations by id: gen depress_lag = depress[_n-1]

With by:
◦ _n is the current observation in a group
◦ _N is the total number of observations in the
current group.

These can then be used to create variables for


longitudinal data
Summing within group
When used with generate, sum() creates a *running sum of adverse life events (ale)
* per id
running sum by id: gen sum_ale = sum(ale)
◦ gen sumx = sum(x) // running sum
of x *total number of adverse life events per id
by id: gen total_ale = sum_ale[_N]
With a by-group specification, we get running
sums by group.
◦ The running sum may itself be a useful variable
◦ We can also pull the value from the last
observation (using _N) within each group to
create a total sum variable
References and
Further
Learning
References
Mitchell, Michael N. 2010. Data Management Using Stata: A Practical Handbook. Stata Press.
Stata YouTube channel – videos for both data management and data analysis made by Stata,
and a list of links to their videos on their home site
Data management FAQ on Stata home site
UCLA OARC Stata pages – our own pages on data management and data analysis
THANK YOU!

You might also like