STATA Notes - by Ms Bing
STATA Notes - by Ms Bing
Reference Guide
Bingxin Yu
Data Exploration
Chapter 5. Examine Dataset
Chapter 6. Generate and Organize Variables
Chapter 7. Descriptive Statistics
Chapter 8. Normality and Outlier
Chapter 9. Graphing Data
Chapter 10. Statistical Tests
Chapter 11. Data Management
Analysis
Chapter 12. Linear Regression
Chapter 13. Logistic Regression
Chapter 14. Simulations
Chapter 15. System Equations
Chapter 16. Simultaneous Equations
Extensions
Chapter 17. Troubleshooting and Update
Chapter 18. Advanced Programming
Chapter 19. Helpful Resources
1
Chapter 1. Introduction
Why Stata?
Stata is a package that offers a good combination of ease to learn and power. It has numerous
powerful yet simple commands for data management, which allows users to perform complex
manipulations with ease. Under Stata/SE, one can have up to 32,768 in a Stata data file and
11,000 for any estimation commands.
Stata performs most general statistical analyses (regression, logistic regression, ANOVA, factor
analysis, and some multivariate analysis). The greatest strengths of Stata are probably in
regression and logistic regression. Stata also has a very nice array of robust methods that are very
easy to use, including robust regression, regression with robust standard errors, and many other
estimation commands include robust standard errors as well.
Stata has the ability to easily download programs developed by other users and the ability to
create your own Stata programs that seamlessly become part of Stata. One can find many cutting
edge statistical procedures written by other users before and incorporate them into his/her own
Stata program. Stata uses one line commands which can be entered one command at a time or
can be entered many at a time in a Stata program.
2
The command window on the bottom right is where you'll enter commands. When you press
ENTER, they are pasted into the Stata Results window above, which is where you will see your
commands execute and view the results. All results will be made to display in black Courier font.
On the left are two convenience windows. Variables window keeps a list of your current
variables. If you click on one of them, its name will be pasted into the current command at the
location of the cursor, which saves a little typing. The Review window keeps a list of all the
commands you've typed this Stata session. Click on one, and it will be pasted into the command
window, which is handy for fixing typos. Double-click, and the command will be pasted and re-
executed. You can also export everything in the Review window into a .do file (more on them
later) so you can run the exact same commands at any time. To do this right-click the Review
window.
When we first open Stata, all these windows are blank except for the Stata Results window. You
can resize these 4 windows independently, and you can resize the outer window as well. To save
your window size changes, click on Prefs button, then Save Windowing Preferences
Entering commands in Stata works pretty much like you expect. BACKSPACE deletes the
character to the left of the cursor, DELETE the character to the right, the arrow keys move the
cursor around, and if you type the text is inserted at the current location of the cursor. The up
arrow does not retrieve previous commands, but you can do that by pressing PAGE UP, or
CTRL-R, or by using the Review window.
3
2. Menus
Stata displays 9 drop-down menus across the top of the outer window, from left to right:
A. File
Open: open a Stata data file (use)
Save/Save as: save the Stata data in memory to disk
Do: execute a do-file
Filename: copy a filename to the command line
Print: print log or graph
Exit: quit Stata
B. Edit
Copy/Paste: copy text among the Command, Results, and Log windows
Copy Table: copy table from Results window to another file
Table copy options: what to do with table lines in Copy Table
C. Prefs - all Stata-related preferences
D. Data
E. Graphics
F. Statistics
build and run Stata commands from menus
G. User - menus for user-supplied Stata commands (download from Internet)
H. Window - bring a Stata window to the front
I. Help - Stata command syntax and keyword searches
3. Button bar
The buttons on the button bar are from left to right (equivalent command is in bold):
Open a Stata data file: use
Save the Stata data in memory to disk: save
Print a log or graph
Open a log, or suspend/close an open log: log
Open a new viewer
Bring Results window to front
Bring Graph window to front
New Dofile Editor: doedit
Edit the data in memory: edit
Browse the data in memory: browse
Scroll another page when --more-- is displayed: Space Bar
Stop current command or do-file: Ctrl-Break
4
Chapter 2. Getting Started
Directory commands
We begin by defining directory first.
The pwd command, which stands for print working directory, shows current directory you are in
when Stata firs up.
. pwd
Y:\Stata9SE
The cd command stands for change directory, in this case, to change to the notes directory. The
advantage of working from within a non-Stata directory is not only that Stata and your work are
safe, but also you can use files without spelling out the full path, which can be quite handy.
. cd “u:\notes”
. pwd
u:\notes
The log command starts a log file called test1 that keeps a record of the commands and output
during the Stata session.
The log close command closes and saves the current log file.
. log close
The log file, test1.txt, can be viewed with any text editor or word processor.
The cd command fixes current directory, but it might be inconvenient if we need to switch
between Stata files under two or more different paths. We can use global command (which is
actually a macro) to define each path and give us a shortcut in programming.
. global t "u:\notes"
. use $t\ethiopia.dta, clear
Do file
It is easier to collect all of your Stata commands used to perform a certain task together in one
place, and do all the commands at once rather than one at a time. If modifications are needed,
you will have to start from scratch and try to remember you get so far in the first place. A do file
allows users to keep track of pervious work and to make changes easily. Any command that you
can type in on the command line can be place in a do file. There is a limit of 3500 lines to a do
file.
5
Do files are created with the do file editor or any other text editor. Any command which can be
executed from the command line can be placed in a do file. Here is a simple example of a do file:
cd u:\notes
log using test1.txt
pwd
log close
. doedit
. doedit e1.do
. do e1.do
2. To use run command or press ctrl+D key to executes the Stata commands in e1.do, but
display no output.
. run e1.do
If you would like to add document a do file, but do not want Stata to execute your notes, /* */ is
used.
Memory commands
Sometimes we might need extra memory to read a big data file.
First you can check to see how much memory is allocated to hold your data using the memory
command. I am running Stata 9 under Windows, and this is what the memory command told me.
. memory
bytes
--------------------------------------------------------------------
Details of set memory usage
overhead (pointers) 16 0.00%
data 72 0.00%
----------------------------
data + overhead 88 0.00%
free 10,485,664 100.00%
----------------------------
Total allocated 10,485,752 100.00%
--------------------------------------------------------------------
Other memory usage
6
set maxvar usage 1,816,666
set matsize usage 1,315,200
programs, saved results, etc. 509
---------------
Total 3,132,375
-------------------------------------------------------
Grand total 13,618,127
I have 13 MB free for reading in a data file. I have a data file that’s 33MB big that I want to read,
which is beyond the Stata default memory allocated to me. Thus, I get the error message
I will allocate 100 MB of memory with the set memory command before trying to use my file.
Now that I have allocated enough memory, I will be able to read the file. If I want to allocate
100m (100 megabytes) every time I start Stata, I can type
And then Stata will allocate this amount of memory every time you start Stata.
7
Chapter 3. Input and Import Data
One of the easiest methods for getting data into Stata is using the Stata data editor, which
resembles an Excel spreadsheet. It is useful when your data is on paper and needs to be typed in,
or if your data is already typed into an Excel spreadsheet. To learn more about the Stata data
editor, see the edit module.
The edit command opens up a spreadsheet like window in which you can enter and change data.
You can also get to the 'Data Editor' from the pull-down 'Window' menu or by clicking on the
'Data Editor' icon on the tool bar.
. edit
Enter values and press return. Double click on the column head and you can change the name of
the variables. When you are done click the 'close box' for the 'Data Editor' window.
Another option is to use input command, then enter your own data set in command window or
do file editor.
input a b c
1 2 3
4 5 6
7 8 9
end
Consider the file comma.txt below that contains three variables, name, midterm, and final,
separated by comma. The file looks like what is shown below (the variable names are indeed the
first line of data.)
name,midterm,final
Smith,79,84
Jones,87,86
Brown,91,94
Adraktas,80,84
8
You can read this kind of file using the insheet command as shown below.
. clear
. insheet using comma.csv
We can issue the list command to see if the data was read properly.
. list
+----------------------------+
| name midterm final |
|----------------------------|
1. | Smith 79 84 |
2. | Jones 87 86 |
3. | Brown 91 94 |
4. | Adraktas 80 84 |
+----------------------------+
As you can see, the insheet command was pretty smart. It got the variable names from the first
row of the data. It looks at the first row and can get the variable names from the first row. It
also examines the file and determines for itself whether the data is separated by commas or by
tabs. The exact same command could read the same file but delimited with tabs (you can try
reading a file for yourself).
If the name of variables is not included in the data, you can still read data into Stata using insheet
command.
. clear
. infile str10 name midterm final using ascii.txt
. list
+----------------------------+
| name midterm final |
|----------------------------|
1. | Smith 79 84 |
2. | Jones 87 86 |
3. | Brown 91 94 |
4. | Adraktas 80 84 |
+----------------------------+
9
To convert a SPSS file into Stata, in SPSS use “File Save As” to make a .csv file and then in
Stata use the insheet command to read the .csv file in Stata. Another way to do it is to save an
ascii text file and use infile command to import the data into Stata.
Note that the variables are clearly defined by which column(s) they are located. The columns
define where the make begins and ends, and the embedded spaces no longer create confusion.
This file can be read with the infix command as shown below.
. infix str make 1-13 mpg 15-16 weight 18-21 price 23-26 using
fixedcolumn.txt
(5 observations read)
Here again we need to tell Stata that make is a string variable by preceding make with str. We
did not need to indicate the length since Stata can infer that make can be up to 13 characters wide
based on the column locations.
The list command confirms that the data was read correctly.
. list
+--------------------------------------+
| make mpg weight price |
|--------------------------------------|
1. | AMC Concord 22 2930 4099 |
2. | AMC Pacer 17 3350 4749 |
3. | AMC Spirit 22 2640 3799 |
4. | Buick Century 20 3250 4816 |
5. | Buick Electra 15 4080 7827 |
+--------------------------------------+
Stata did not want you to lose the changes that you made to the data sitting in memory. If you
really want to discard the changes in memory, clear option specifies that it is okay to replace the
data in memory, even though the current data have not been saved to disk.
10
The save command will save the dataset as a .dta file under the name your choose. Editing the
dataset changes data in the computer's memory, it does not change the data that is stored on the
computer's disk. Note that since I’ve defined my current directory by cd command, it is not
necessary to specify the path while saving my data. But if I would like to save the data under a
different directory, I have to put down the path clearly.
. save t1.dta
. save ..\plus\t1.dta
The replace option allows you to save a changed file to the disk, replacing the original file. Stata
is worried that you will accidentally overwrite your data file. You need to use the replace option
to tell Stata that you know that the file exists and you want to replace it.
11
Chapter 4. Export Data
Since many different software packages (e.g. Power Point) read in Excel spreadsheets, we will
focus on getting Stata results into Excel spreadsheets. Once you have your results in an Excel
spreadsheet you are ready to either use Excel to format the table or create the graph, or bring the
results into some other software. There are ways to create graphs and to format your results in
Stata, but most people are already familiar with doing all that with one of these 4 software
packages.
There are many ways to get your results out of Stata and into Excel. No matter which way you
choose, it's always a good idea to check that the process did what you expected.
1. From the Stata results window, select text in Stata and paste into Excel. To copy an output
table in Stata, select the table in Stata results window, right click (or Shift+Ctrl+C)to choose
Copy Table, and paste into Excel.
2. From a log file, select text and paste into Excel. Or open the log file (*.log) in Excel as fixed
width file. Make use of Stata's ability to start and stop logging to a log file so that your log file
contains only one table of results. It's easier to have multiple log files with one table per file than
multiple tables and other code in one big log file.
use "u:\notes\ethiopia.dta"
log using "test1.log", replace
collapse (mean) tot_exp, by(regco)
log close
3. From a Stata data set, use DBMScopy to convert a Stata data set to an Excel spreadsheet.
4. From a Stata data set, to use the outsheet command to output a spreadsheet that can be read
into Excel.
Outfile command exports a spreadsheet that can be read into Excel. The wide option causes
Stata to write the data with one observation per line, which could be used to feed into other
software.
12
Chapter 5. Examine Dataset
Stata syntax
Most Stata commands follow the same syntax:
[by varilist1:] command [varlist2] [if exp] [in range] [weight], [options]
Items inside of the squares brackets are either options or not available for every command. This
syntax app applies to all Stata commands. In order to use by prefix, the dataset must first be
sorted on the by variable(s).
The [if exp] can be complex, using & (and) and | (or) to join conditions together, like the
example below.
~ not
== equal
~= not equal
!= not equal
> greater than
>= greater than or equal
< less than
<= less than or equal
& and
| or
1. fweight, or frequency weights, are weights that indicate the number of duplicated
observations. It is used when your data set has been collapsed and contains a variable that tells
the frequency each record occurred.
2. pweight, or sampling weights, are weights that denote the inverse of the probability that the
observation is included due to the sampling design. pweights is correct to be used for sampling
survey data. The pweight command causes Stata to use the sampling weight as the number of
subjects in the population that each observation represents when computing estimates such as
proportions, means and regressions parameters. A robust variance estimation technique will
automatically be used to adjust for the design characteristics so that variances, standard errors
and confidence intervals are correct. For example,
13
3. aweight, or analytic weights, are weights that are inversely proportional to the variance of an
observation; i.e., the variance of the j-th observation is assumed to be sigma^2/w_j, where w_j
are the weights. Typically, the observations represent averages and the weights are the number
of elements that gave rise to the average. For most Stata commands, the recorded scale of
aweights is irrelevant; Stata internally rescales them to sum to N, the number of observations in
your data, when it uses them.
Analytic weights are used when you want to compute a linear regression on data that are
observed means. Do not use aweights to specify sampling weights. This is because the formulas
that use aweights assume that larger weights designate more accurately measured observations.
Conversely, one observation from a sample survey is no more accurately measured than any
other observation. Hence, using the aweight command to specify sampling weights will cause
Stata to estimate incorrect values of the variance and standard errors of estimates, and p-values
for hypothesis tests.
4. iweight, or importance weights, are weights that indicate the "importance" of the observation
in some vague sense. iweights have no formal statistical definition; any command that supports
iweights will define exactly how they are treated. In most cases, they are intended for use by
programmers who who need to implement their own analytical techniques by using some of the
available estimation commands. Special care should be taken when using importance weights to
understand how they are used in the formulas for estimates and variance. This information is
available in the Methods and Formulas section in the Stata manual for each estimation command.
In general, these formulas will be incorrect for computing the variance for data from a sample
survey.
Stata accepts unambiguous abbreviations for commands and variable names. For example, we
can just type
.des
Or
.d
. describe
14
Dataset examining commands
The list command without any variable names displays values of all the variables for all the
cases. list command with variable names displays values of variables listed. in option gives the
range of variables.
. list hhid urban
. list hhid tot_exp regco urban hhz_usu in 1/5
+----------------------------------------------------------+
| hhid tot_exp regco urban hhz_usu |
|----------------------------------------------------------|
1. | 101010888130501 5.572000027 1 0 2 |
2. | 101010888130502 3.738346577 1 0 1 |
3. | 101010888130503 13.01241112 1 0 5 |
4. | 101010888130504 2.75808239 1 0 1 |
5. | 101010888130505 4.473479271 1 0 2 |
.
.
Here we look at hhid, tot_exp, regco, urban, hhz_usu for the first 5 observations.
Note about --more-- whenever it fills up the computer screen. Pressing the space bar will display
the next screen, and so on, until all of the information has been displayed. To get out pf --more--,
you can click on the “break” button, select “Break” from the pull-down “Tools” menu, or press
the “q” key.
The if exp qualifier allows you to list values for those cases for which the exp is "true."
. list hhid if tot_exp==.
. list hhid urban tot_exp if regco==1
The first list displays all cases for which tot_exp is missing. Stata uses "." to indicate missing
values. The if regco==1 only displays households in region 1.
The browse command is similar to edit , except that it will not allow users to change the data.
The browse command is a convenient alternative to list command, but users can view the data
through data editor instead of results window.
We can use the describe command to displays a basic summary of a Stata dataset, describing the
number of observations in the file, the number of variables, the name of variables, and variable
format and label.
. describe regco
. describe
. codebook
. codebook hhz_usu
-----------------------------------------------------------------------------
hhz_usu
Number of usual household members
-----------------------------------------------------------------------------
mean: 4.7466
std. dev: 2.39059
Another useful command for getting a quick overview of a data file the inspect command.
inspect command displays information about the values of variables and is useful for checking
data accuracy.
. inspect
. inspect hhz_usu
count command can be used to show the number of observations that satisfying if options. If no
conditions are specified, count displays the number of observations in the data.
. count
17332
. count if hhz_usu>8
1235
16
Chapter 6. Generate and Organize Variables
Create variables
The generate command is used to create a new variable.
. gen urbanrural=”RURAL”
For existing variables, the replace command is needed to replace existing value of a variable
with a new value.
egen stands for extended generate and is an extremely powerful command that has many options
for creating new variables. It adds summary statistics to each observation. Although egen and
generate commands look a like, they produce quite different results, as showed in the example
below.
+------------------------------------------+
| hid iid income hhincome hhinco~1 |
|------------------------------------------|
1. | 1 1 1000 1250 1000 |
2. | 1 2 250 1250 1250 |
3. | 1 3 0 1250 1250 |
4. | 2 1 600 1100 600 |
5. | 2 2 500 1100 1100 |
|------------------------------------------|
6. | 3 1 20000 20000 20000 |
+------------------------------------------+
17
Here is a list of some of the options for egen command:
Another approach to generate variables is to use recode command. The example makes a new
variable called grade going from 1 to 5 based on student scores (0-100).
. gen grade=totavg
. recode grade 0/60=0 60/70=1 70/80=2 80/90=3 90/100=4
Modify variables
We can use rename command to rename a variable.
The format command allows you to specify the display format for variables. The internal
precision of the variables is unaffected.
18
%fmt description example
-----------------------------------------------------------------------------
Right-justified formats
%#.#g general numeric format %9.0g
%#.#f fixed numeric format %9.2f
%#.#e exponential numeric format %10.7e
%d default numeric elapsed date format %d
%d... user-specified elapsed date format %dM/D/Y
%#s string format %15s
Leading-zero formats
%0#.#f fixed numeric format %09.2f
%0#s string format %015s
Left-justified formats
%-#.#g general numeric format %-9.0g
%-#.#f fixed numeric format %-9.2f
%-#.#e exponential numeric format %-10.7e
%-d default numeric elapsed date format %-d
%-d... user-specified elapsed date format %-dM/D/Y
%-#s string format %-15s
Centered formats
%~#s string format (special) %~15s
-----------------------------------------------------------------------------
Label variables
Now let’s include some variable labels so that we know a little more about the variables. The
variable urban may be confusing since it is hard to tell what the 0s and 1s mean.
. use ethiopia.dta, clear
. codebook urban
-----------------------------------------------------------------------------
urban
Urban
-----------------------------------------------------------------------------
type: numeric (double)
19
We will use label command to add a brief definition for variable urban, and clearly indicate the
0s for rural households and 1s for urban households.
The label variable command makes labels that help explain individual variables.
The label define command creates a definition for the values 0 and 1 called ul.
The label value command connects the values defined for ul with the values in variables urban.
In the student grading example, we the values of grade are labeled A-F by label define and label
values.
. label define abcdf 0 “F” 1 “D” 2 “C” 3 “B” 4 “A”
. label values grade abcdf
20
Chapter 7. Descriptive Statistics
Frequency table
The tabulate command is useful for obtaining frequency tables. Below we make a table for
regco.
. tabulate regco
The tab1 command can be used as a shortcut to request one-way frequency tables for a series of
variables, instead of typing the tabulate command over and over again.
We can also make crosstables using tabulate. Let’s look at the repair history broken down by
regco and urban/rural.
| Urban
Region | 0 1 | Total
-----------+----------------------+----------
1 | 564 688 | 1,252
2 | 392 400 | 792
3 | 1,740 1,600 | 3,340
4 | 1,824 1,904 | 3,728
5 | 372 480 | 852
6 | 516 400 | 916
7 | 1,872 768 | 2,640
12 | 360 384 | 744
13 | 360 368 | 728
14 | 300 1,200 | 1,500
15 | 360 480 | 840
-----------+----------------------+----------
Total | 8,660 8,672 | 17,332
With the column option, we can request column percentages. Notice that about 21.96% of urban
households lives in regco 4.
21
+-------------------+
| Key |
|-------------------|
| frequency |
| column percentage |
+-------------------+
| Urban
Region | 0 1 | Total
-----------+----------------------+----------
1 | 564 688 | 1,252
| 6.51 7.93 | 7.22
-----------+----------------------+----------
2 | 392 400 | 792
| 4.53 4.61 | 4.57
-----------+----------------------+----------
3 | 1,740 1,600 | 3,340
| 20.09 18.45 | 19.27
-----------+----------------------+----------
4 | 1,824 1,904 | 3,728
| 21.06 21.96 | 21.51
-----------+----------------------+----------
5 | 372 480 | 852
| 4.30 5.54 | 4.92
-----------+----------------------+----------
6 | 516 400 | 916
| 5.96 4.61 | 5.29
-----------+----------------------+----------
7 | 1,872 768 | 2,640
| 21.62 8.86 | 15.23
-----------+----------------------+----------
12 | 360 384 | 744
| 4.16 4.43 | 4.29
-----------+----------------------+----------
13 | 360 368 | 728
| 4.16 4.24 | 4.20
-----------+----------------------+----------
14 | 300 1,200 | 1,500
| 3.46 13.84 | 8.65
-----------+----------------------+----------
15 | 360 480 | 840
| 4.16 5.54 | 4.85
-----------+----------------------+----------
Total | 8,660 8,672 | 17,332
| 100.00 100.00 | 100.00
Since we are more interested in the percentages instead of the frequencies, nofreq option is used
to suppress the frequencies.
22
. tabulate regco urban, column nofreq
| Urban
Region | 0 1 | Total
-----------+----------------------+----------
1 | 6.51 7.93 | 7.22
2 | 4.53 4.61 | 4.57
3 | 20.09 18.45 | 19.27
4 | 21.06 21.96 | 21.51
5 | 4.30 5.54 | 4.92
6 | 5.96 4.61 | 5.29
7 | 21.62 8.86 | 15.23
12 | 4.16 4.43 | 4.29
13 | 4.16 4.24 | 4.20
14 | 3.46 13.84 | 8.65
15 | 4.16 5.54 | 4.85
-----------+----------------------+----------
Total | 100.00 100.00 | 100.00
We can use the plot option to make a plot to visually show the tabulated values.
Region | Freq.
----------+------------+-----------------------------------------------------
1 | 1,252 |******************
2 | 792 |***********
3 | 3,340 |***********************************************
4 | 3,728 |*****************************************************
5 | 852 |************
6 | 916 |*************
7 | 2,640 |**************************************
12 | 744 |***********
13 | 728 |**********
14 | 1,500 |*********************
15 | 840 |************
----------+------------+-----------------------------------------------------
Total | 17,332
Suppose we want to focus on just the households with household income less than 1000. We can
combine if suffix and tabulate command to do it.
| Urban
Region | 0 1 | Total
-----------+----------------------+----------
1 | 6.53 7.92 | 7.23
2 | 4.46 4.62 | 4.54
3 | 20.01 18.46 | 19.24
4 | 21.10 21.97 | 21.54
5 | 4.27 5.52 | 4.90
6 | 5.98 4.62 | 5.30
7 | 21.68 8.86 | 15.26
12 | 4.17 4.41 | 4.29
13 | 4.15 4.24 | 4.19
14 | 3.47 13.84 | 8.67
15 | 4.17 5.54 | 4.86
-----------+----------------------+----------
Total | 100.00 100.00 | 100.00
23
Stata has two built-in variables called _n and _N. _n is Stata notation for the current observation
number. Thus _n=1 in the first observation, and _n=2 in the second, and so on. _N is Stata
notation for the total number of observations. Let’s see how _n and _N work.
. clear
. input score group
score group
1. 72 1
2. 84 2
3. 76 1
4. 89 3
5. 82 2
6. 90 1
7. 85 1
. end
. generate id=_n
. generate nt=_N
. list
+-------------------------+
| score group id nt |
|-------------------------|
1. | 72 1 1 7 |
2. | 84 2 2 7 |
3. | 76 1 3 7 |
4. | 89 3 4 7 |
5. | 82 2 5 7 |
|-------------------------|
6. | 90 1 6 7 |
7. | 85 1 7 7 |
+-------------------------+
As you can see, the variable id contains observation number running from 1 to 7 and nt is the
total number of observations, which is 7 for all observations.
Summary statistics
For summary statistics, summarize command is the mostly used. Let’s generate some summary
statistics on tot_exp.
. summarize tot_exp
To get these values separately for urban and rural households, we could use the by urban: prefix
as shown below. Note that we first have to sort the data by urban before using the prefix.
24
. sort urban
. by urban: summarize tot_exp
-----------------------------------------------------------------------------
-> urban = 0
-----------------------------------------------------------------------------
-> urban = 1
Suppose we do the summarization for each regco, this is not the most efficient way to do it with
long output. Another concise way, which does not require the data to be sorted, is by using the
summarize() option as part of the tabulate command.
. tabulate regco, summarize(tot_exp)
We can use if and by with most Stata commands. Here, we get summary statistics for tot_exp for
households in regco 1.
ATTENTION!
Missing values are represented as . and are the highest value possible. Most commands ignore
missing values by default. Some commands, such as tabulate, have an option to display missing
if you want to see how many missing observations there are. Therefore, when values are missing,
be careful with commands like summarize and tabulate. To avoid this problem, use missing
option to treat missing values like other values in tabulate command.
25
. tabulate regco urban, column nofreq missing
Other commands, however, may use missing values in a way that will surprise you. For example,
the replace command does not ignore missing values. Here is a simple example to demonstrate
how replace commands handle missing values differently. In this example, we have variable
income with one missing value. We want to generate an index variable at cutting value of 500.
Clear
input hid income
1 1000
2 450
3 .
4 700
5 500
end
. gen cut1=0 if income<=500
(3 missing values generated)
. replace cut1=1 if income>500
(3 real changes made)
. list
+-----------------------------------+
| hid income cut1 cut2 cut3 |
|-----------------------------------|
1. | 1 1000 1 1 1 |
2. | 2 450 0 0 0 |
3. | 3 . 1 . . |
4. | 4 700 1 1 1 |
5. | 5 500 0 0 0 |
+-----------------------------------+
The first replace command changes every income value that's greater than 500 to 1. This
command does not ignore missing values, so both income greater than 500 and missing values
are changed to 1. This probably is not what we would normally want to do, since missing values
should remain missing.
The second replace command changes all income values greater than 500 but not missing to 1.
In this case no-missing values greater than 500 are changed and missing values are not changed,
which is our intention.
The recode command automatically ignores missing values, so we don't have to think about it.
The results are the same as the second replace command.
26
Advanced statistics
The table command calculates and displays tables of statistics, including frequency, mean,
standard deviation, sum, and 1st to 99th percentile. The row and col option specifies an additional
row and column to be added to the table, reflecting the total across rows and columns.
The example lists a two-way table of median of tot_exp (50th percentile) by urban and femhead.
-------------------------------------------
| Female headed household
Urban | 0 1 Total
----------+--------------------------------
0 | 11.713764 7.9488297 10.741507
1 | 17.312714 11.005945 14.391151
|
Total | 12.156257 8.5248489 11.194192
-------------------------------------------
The tabstat command displays summary statistics for a series of numeric variables in a single
table.
Sometimes we have data files that need to be aggregated at a higher level to be useful for us. For
example, we have household data but we really interested in regional data. The collapse
command serves this purpose by converting the dataset in memory into a dataset of means, sums,
medians and percentiles. Note that the collapse command creates a new dataset and all
household information disappear and only the specified variable aggregation remain at the region
level. The resulting summary table can by viewed by edit command.
We would like to see the mean tot_exp in each regco and urban/rural areas.
27
regco urban tot_exp
1 0 12.067
1 1 14.899
2 0 13.022
2 1 17.849
3 0 11.612
3 1 16.507
4 0 13.324
4 1 17.790
5 0 15.152
5 1 22.627
6 0 11.890
6 1 18.261
7 0 12.313
7 1 18.591
12 0 10.851
12 1 19.714
13 0 19.528
13 1 20.021
14 0 21.568
14 1 30.597
15 0 16.627
15 1 19.574
However, this table is not easy to interpret, and we can call it a long format since the data of
urban and rural are vertically listed. We will use reshape command to convert it into a wide
form where the rural and urban are horizontally arranged in a twoway table. The reshape wide
command tells system that we want to go from long to wide. The i() option records row variable
while j() column variable.
If needed, the table can be converted back into the long form by reshape long.
There are no official rules to identify outliers. In a sense, this definition leaves it up to the analyst
(or a consensus process) to decide what will be considered abnormal. Sometimes it is obvious
when an outlier is simply miscoded (for example, age reported as 230) and hence should be set to
missing. But most times it is not the case.
Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. That
is, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly,
and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than
a sharp peak. A uniform distribution would be the extreme case. The standard normal
distribution has a kurtosis of zero. Positive kurtosis indicates a "peaked" distribution and
negative kurtosis indicates a "flat" distribution. A value of 6 or larger on the true kurtosis
indicates a large departure from normality.
We can obtain skewness and kurtosis values by using detail option in summarize command.
Clearly, variable tot_exp_pc is skewed to the right and has a peaked distribution. But both
statistics indicate the distribution of tot_exp_pc is far from normal. Remember that missing
values are represented as . and are the highest value possible. Those observations need to be
dropped first.
. cd "u:\notes"
. use ethiopia.dta, clear
. drop if tot_exp_pc==.
(37 observations deleted)
. sum tot_exp_pc
29
Total daily expenditures per capita
-------------------------------------------------------------
Percentiles Smallest
1% .9142123 .4008128
5% 1.265692 .4301129
10% 1.540479 .4934749 Obs 17295
25% 2.101912 .5245371 Sum of Wgt. 17295
Besides commands for descriptive statistics, such as summarize, we can also check normality of
a variable visually by looking at some basic graphs in Stata, including histograms, boxplots,
kdensity, pnorm, and qnorm. Let’s keep using tot_exp_pc from ethiopia.dta file for making
some graphs.
The histogram command is an effective graphical technique for showing both the skewness and
kurtosis of tot_exp_pc.
. histogram tot_exp_pc
.2
.15
Density
.1 .05
0
0 50 100 150
Total daily expenditures per capita
The normal option can be used to get a normal overlay. This shows the skew to the left in
tot_exp_pc.
30
.2
.15
Density
.1 .05
0
0 50 100 150
Total daily expenditures per capita
We can use the bin() option to increase the number of bins to 100. This better illustrates the
distribution of tot_exp_pc. This option specifies how to aggregate data into bins. Notice that the
histogram resembles a bell shape curve, but truncated at 0.
0 50 100 150
Total daily expenditures per capita
graph box draws vertical box plots. In a vertical box plot, the y axis is numerical, and the x axis
is categorical. The upper and lower bounds of box are defined by the 25th and 75th percentiles of
tot_exp_pc, and the line within the box is the median. The ends of the whiskers are 5th and 95th
percentile of tot_exp_pc. graph box command can be used to produce a boxplot which can help
us examine the distribution of tot_exp_pc. If tot_exp_pc is normal, the median would be in the
center of the box and the end of whiskers would be equidistant from the box.
The boxplot for tot_exp_pc shows positive skew. The median is pulled to the low end of the box,
and the 95th percentile is stretched out away from the box, in both rural and urban cases.
31
0 1
150
Total daily expenditures per capita
100
50
0
Graphs by Urban
The kdensity command with the normal option displays a density graph of the residual with a
normal distribution superimposed on the graph. This is particularly useful in verifying that the
residuals are normally distributed, which is a very important assumption for regression. The plot
shows that tot_exp_pc is more skewed to the right and has a higher mean than that of normal
distribution.
0 50 100 150
Total daily expenditures per capita
Graphical alternatives to the kdensity command are the P-P plot and Q-Q plot.
pnorm command produces a P-P plot, which graphs a standardized normal probability. It should
be approximately linear if the variable follows normal distribution. The straighter the line formed
by the P-P plot, the more the variable's distribution conforms to the normal distribution.
. pnorm tot_exp_pc
32
1.00
0.75
Normal F[(tot_exp_pc-m)/s]
0.25 0.50 0.00
Qnorm command plots the quantiles of a variable against the quantiles of a normal distribution.
If the Q-Q plot shows a line that is close to the 45 degree line, the variable is more normally
distributed.
. qnorm tot_exp_pc
150
Total daily expenditures per capita
50 0 100
-10 0 10 20
Inverse Norm al
Both P-P and Q-Q plot prove that tot_exp_pc is not normal, with a long tail to the right. The
qnorm plot is more sensitive to deiances from normality in the tails of the distribution, where the
pnorm plot is more sensitive to deviances near the mean of the distribution.
From the statistics and graphs we can confidently conclude that there exists outlier, especially at
the upper end of the distribution.
Since our data is heavily left-tailed, we will focus on very large outliers. A customary criterion to
identify outlier is to three times of deviation from the median. Note that we are using the
median because it is a robust statistic and if there are big outliers the mean will shift a lot but not
the median.
33
/* Calculate number of standard deviations from median by urban or rural */
. use ethiopia.dta, clear
. egen median=median(tot_exp_pc), by (urban)
. egen sd=sd(tot_exp_pc), by (urban)
. gen ratio=(tot_exp_pc-median)/sd
(37 missing values generated)
. gen outlier=1 if ratio>3 & ratio~=.
(17044 missing values generated)
. replace outlier=0 if outlier==. & ratio~=.
(17007 real changes made)
There are 288 observations are identified as outliers. When we compare the mean and median
values from using table command, the mean value has dropped around 10% among urban
households, while the medians are less sensitive to outliers.
-------------------------------------------
| outlier
Urban | 0 1 Total
----------+--------------------------------
0 | 2.7359721 15.140992 2.8739015
1 | 4.8367138 28.860726 5.3692863
|
Total | 3.7820815 24.287481 4.1235417
-------------------------------------------
-------------------------------------------
| outlier
Urban | 0 1 Total
----------+--------------------------------
0 | 2.453216 11.575683 2.4679672
1 | 3.8607807 25.327541 3.9278767
|
Total | 2.9651103 22.549266 3.0001049
-------------------------------------------
Sometimes by dropping outliers we can greatly improve decrease the adverse effect of extreme
values. But it does not work in our data, as indicated by the histogram below.
34
. histogram tot_exp_pc if outlier==0, normal
.3
.2
Density
.1
0
0 5 10 15 20
Total daily expenditures per capita
When we are concerned about outliers or skewed distributions, the rreg command is used for
robust regression. Robust regression will result regression coefficients and standard errors from
OLS, which is different from regress command with robust option.
-----------------------------------------------------------------------------
tot_exp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------------+----------------------------------------------------------------
hhz_usu | 1.589587 .0213129 74.58 0.000 1.547812 1.631363
_cons | 5.948286 .1133414 52.48 0.000 5.726125 6.170447
-----------------------------------------------------------------------------
-2 0 2 4 6
logexp
Statistics from summarize command also indicates an almost perfect normal distribution.
logexp
-------------------------------------------------------------
Percentiles Smallest
1% -.0896924 -.9142609
5% .2356188 -.8437076
10% .4320935 -.7062833 Obs 17295
25% .7428475 -.6452391 Sum of Wgt. 17295
36
Method 4. Imputation
After identifying outliers, usually we first denoted them as missing values. Missing data usually
present a problem in statistical analyses. If missing values are correlated with the outcome of
interest, then ignoring them will bias the results of statistical tests. In addition, most statistical
software packages (e.g., SAS, Stata) automatically drop observations that have missing values
for any variables used in an analysis. This practice reduces the analytic sample size, lowering the
power of any test carried out.
Other than simply dropping missing values, there is more than one approach of imputation to fill
in the cell of missing value. We will only focus on single imputation, referring to fill a missing
value with one single replacement value.
The easy approach is to use arbitrary methods to impute missing data, such as mean substitution.
Substitution of the simple grand mean will reduce the variance of the variable. Reduced variance
can bias correlation downward (attenuation) or, if the same cases are missing for two variables
and means are substituted, correlation can be inflated. These effects on correlation carry over in a
regression context to lack of reliability of the beta weights and of the related estimates of the
relative importance of independent variables. That is, mean substitution in the case of one
variable can lead to bias in estimates of the effects of other or all variables in the regression
analysis, because bias in one correlation can affect the beta weights of all variables. Mean
substitution is no longer recommended.
Another approach is regression-based imputation. In this strategy, it is assumed that the same
model explains the data for the non-missing cases as for the missing cases. First the analyst
estimates a regression model in which the dependent variable has missing values for some
observations, using all non-missing data. In the second step, the estimated regression coefficients
are used to predict (impute) missing values of that variable. The proper regression model
depends on the form of the dependent variable. A probit or logit is used for binary variables,
Poisson or other count models for integer-valued variables, and OLS or related models for
continuous variables. Even though this may introduce unrealistically low levels of noise in the
data, it is a performs more robustly than mean substitution and less complex than multiple
imputation. Thus it is the preferred approach in imputation.
Assuming we already coded outliers of tot_exp_pc as missing ., now the missing values are
replaced (imputed) with predicted values.
There is another Stata command to perform imputation. The impute command fills in missing
values by regression and put newly created variable into a new variable defined by generate
option.
37
0.21% (37) observations imputed
38
Chapter 9. Graphing Data
Graph commands to produce histogram, box plot, kdensity, P-P plot, Q-Q plot will be
postponed until the introduction of normality later. But first we will get ourselves acquainted
with some twoway graph commands.
A two way scatterplot can be drawn using (graph) twoway scatter command to show the
relationship between two variables, tot_exp and exp_food. As we would expect, there is a
positive relationship between the two variables.
0 20 40 60
Total food expenditures
We can show the regression line predicting tot_exp from exp_food using lfit option.
0 20 40 60
Total food expenditures
39
600
400
200
0
0 5 10 15 20
Number of usual household members
40
Chapter 10. Statistical Tests
compare command
The compare command is an easy way to check if two variables are the same. Let’s first create
one variable compare, which equals tot_exp if tot_exp not missing, and equals 0 if tot_exp is
missing.
correlate command
The correlate command displays a matrix of Pearson correlations for the variable listed.
| tot_exp hhz_usu
-------------+------------------
tot_exp | 1.0000
hhz_usu | 0.3357 1.0000
ttest command
We would like to see if the mean of hhz_usu equals to 4 by using single sample t-test, testing
whether the sample was drawn from a population with a mean of 4. ttest command is used for
this purpose.
. ttest hhz_usu=4
One-sample t test
-----------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf.Interval]
---------+-------------------------------------------------------------------
hhz_usu | 17332 4.746596 .0181585 2.390587 4.711003 4.782188
-----------------------------------------------------------------------------
41
mean = mean(hhz_usu) t = 41.1155
Ho: mean = 4 degrees of freedom = 17331
. ttest tot_exp=exp_food
Paired t test
-----------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+-------------------------------------------------------------------
tot_exp | 17295 17.25426 .1407023 18.50382 16.97847 17.53005
exp_food | 17295 8.870088 .0405354 5.330828 8.790635 8.949542
---------+-------------------------------------------------------------------
diff | 17295 8.384172 .1171933 15.41215 8.154462 8.613883
-----------------------------------------------------------------------------
mean(diff) = mean(tot_exp - exp_food) t = 71.5414
Ho: mean(diff) = 0 degrees of freedom = 17294
The t-test for independent groups comes in two varieties: pooled variance and unequal variance.
We want to look at the differences in tot_exp between rural and urban households. We will begin
with the ttest command for independent groups with pooled variance and compare the results to
the ttest command for independent groups using unequal variance.
42
diff| -8.197619 .2741701 -8.735033 -7.660205
-----------------------------------------------------------------------------
diff = mean(0) - mean(1) t = -29.8998
Ho: diff = 0 Satterthwaite's degrees of freedom = 12883.4
----------------------------------------------------------------------------
-> urban = 0
The tabulate command performs a chi-square test to see if two variables are independent.
| Female headed
| household
Urban | 0 1 | Total
-----------+----------------------+----------
0 | 6,621 2,013 | 8,634
1 | 5,116 3,545 | 8,661
-----------+----------------------+----------
Total | 11,737 5,558 | 17,295
43
Chapter 11. Data Management
Subset data
We can subset data by keeping or dropping variables, or by keeping and dropping observations.
Suppose our data file have many variables, but we only care about just a handful of them. We
can subset our data file to keep just those variables to our intereste. The keep command is used
to keep variables in the list while dropping other variables.
Instead of wanting to keep just a handful of variables, it is possible that we might want to get rid
of just one or two variables in the data file. The drop command is used to drop variables in the
list while keeping other variables.
. drop tot_exp
. keep if urban==0
(8660 observations deleted)
We want to focus on rural households in the data set, which means 8660 urban households are
dropped from the data set.
Similar concepts can be found in drop if command. We eliminate the observations with missing
values with drop if command. The portion after the drop if specifies which observations that
should be dropped.
. drop if missing(tot_exp)
(37 observations deleted)
You can eliminate both variables and observations with the use command. Let’s read in just
hhid, tot_exp, regco, urban, hhz_usu from ethiopia.dta file.
44
Organize data
The sort command arranges the observations of the current data into ascending order based on
the values of the variables listed. There is no limit to the number of variables in the variable list.
Missing numeric values are interpreted as being larger than any other number, so they are placed
last. When you sort on a string variable, however, null strings are placed first.
The order command helps us to organize variables in a way that makes sense by changing the
order of the variables. While there are several possible orderings that are logical, we usually put
the id variable first, followed by the demographic variables, such as region, zone, gender,
urban/rural. We will put the variables regarding the household total expenditure as follows.
Using _n and _N in conjunction with the by command can produce some very useful results.
When used with by command, _N is the total number of observations within each group listed in
by command, and _n is the running counter to uniquely identify observations within the group.
To use the by command we must first sort our data on the by variable.
. sort group
. by group: generate n1=_n
. by group: generate n2=_N
. list
+-----------------------------------+
| score group id nt n1 n2 |
|-----------------------------------|
1. | 72 1 1 7 1 4 |
2. | 85 1 7 7 2 4 |
3. | 76 1 3 7 3 4 |
4. | 90 1 6 7 4 4 |
5. | 84 2 2 7 1 2 |
|-----------------------------------|
6. | 82 2 5 7 2 2 |
7. | 89 3 4 7 1 1 |
+-----------------------------------+
Now n1 is the observation number within each group and n2 is the total number of observations
for each group. This is very useful in programming, especially in identifying duplicate
observations.
. sort group
. list if id == id[_n+1]
If there are a lot of variables in the data set, it could take a long time to type them all out twice.
We can make use of the “*” and “?” wildcards to indicates that we wish to use all the variables.
45
Further we can combine sort and by commands into a single statement. Below is a simplified
version of the code and will yield the same results as above.
Supposing we are given one file with data for the rural households (called rural.dta) and a file for
the urban households (called urban.dta). We need to combine these files together to be able to
analyze them.
The append command does not require that the two datasets contain the same variables, even
though this is typically the case. But it highly recommended to use the identical list of variables
for append command to avoid missing values from one dataset.
Assuming we are working on our household expenditure data, and we have been given two files.
One file has all the demographic information (called hhinfo.dta) and the other file with total
expenditure by household (called hhexp.dta). Both data sets have been cleaned and sorted by
hhid. We would like to merge the two households together by hhid.
After merge command, a _merge variable appears. The _merge variable indicates, for each
observation, how the merge go. This is especially useful in identifying mismatched records.
_merge can have one of three values in merging file A using file B:
46
_merge==1 the records contains information from master data file A
_merge==2 the records contains information from using data file B
_merge==3 the records contains information from both files
When there are many records, tabulating _merge is very useful to summarize how many
mismatched observations you have. In this case, all of the records match so the value for _merge
is always 3.
. tab _merge
_merge | Freq. Percent Cum.
------------+-----------------------------------
2 | 8,660 49.97 49.97
3 | 8,672 50.03 100.00
------------+-----------------------------------
Total | 17,332 100.00
The strategy for the one to many merge is really the same as the one to one match merge.
There is no difference in the order of files to be merged and the results are the same. The only
difference is the order of the records after the merge.
Label data
Besides giving labels to variables, we can also label the data set itself so that we will remember
what the data are. The label data command places a label on the whole dataset.
. label data “relabeled household”
We can also add some notes to the data set. The note: (note the colon, “:”) command allows you
to place notes into the dataset.
. notes hhsize: the variable hhz_usu was renamed to hhsize
Regression commands
This is an example of ordinary linear regression by using regress command.
-----------------------------------------------------------------------------
tot_exp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------------+----------------------------------------------------------------
hhz_usu | 2.601727 .0555198 46.86 0.000 2.492902 2.710551
_cons | 4.890831 .2952526 16.56 0.000 4.312106 5.469556
-----------------------------------------------------------------------------
This regression tells us that for every extra person (hhz_usu) added to a household, total daily
expenditure (tot_exp) will increase by 2.6 Ethiopia Birr. This increase is statistically significant
as indicated by the 0.000 probability associated with this coefficient.
The other important piece of information is the r-squared (r2) which equals to 0.1127. In essence,
this value tells us that by our independent variable (hhz_usu) accounts for approximately 11% of
the variation of dependent variable (tot_exp).
We can run the regression with robust standard errors, which can tolerate a non-zero percentage
of outliers, i.e., when the residuals are not iid. This is very useful when there is heterogeneity of
variance. The robust option does not affect the estimates of the regression coefficients.
The regress command without any arguments redisplays the last regression analysis.
48
Extract results
Stata stores results from estimation commands in e(), and you can see a list of what exactly is
stored using the ereturn list command.
. ereturn list
scalars:
e(N) = 17295
e(df_m) = 1
e(df_r) = 17293
e(F) = 2195.973818320767
e(r2) = .112677755062528
e(rmse) = 17.43069424443639
e(mss) = 667200.7528912034
e(rss) = 5254116.658171482
e(r2_a) = .1126264439976499
e(ll) = -73972.67623316433
e(ll_0) = -75006.45947850558
macros:
e(title) : "Linear regression"
e(depvar) : "tot_exp"
e(cmd) : "regress"
e(properties) : "b V"
e(predict) : "regres_p"
e(model) : "ols"
e(estat_cmd) : "regress_estat"
matrices:
e(b) : 1 x 2
e(V) : 2 x 2
functions:
e(sample)
Using the generate command, we can extract those results, such as estimated coefficients and
standard errors, to be used in other Stata commands.
. display intercept
4.890831
. gen slope=_b[hhz_usu]
. display slope
2.6017268
The estimates table command displays a table with coefficients and statistics for one or more
estimation sets in parallel columns. In addition, standard errors, t statistics, p-values, and scalar
statistics may be listed by b, se, t, p options.
49
. estimates table, b se t p
---------------------------
Variable | active
-------------+-------------
hhz_usu | 2.6017268
| .05551983
| 46.86
| 0.0000
_cons | 4.890831
| .29525262
| 16.56
| 0.0000
---------------------------
Prediction commands
The predict command computes predicted value and residual for each observation. The default
shown below is to calculate the predicted tot_exp.
. predict pred
(option xb assumed; fitted values)
When using the resid option the predict command calculates the residual.
. predict e, residual
We can plot the predicted value and observed value using graph twoway command.
. regress tot_exp exp_food
. predict pred
. graph twoway (scatter tot_exp hhz_usu) (line pred hhz_usu)
600
400
200
0
0 5 10 15 20
Number of usual household members
The rvfplot command is a convenience command that generates a plot of the residual versus the
fitted values. It is used after regress command.
50
600
400
Residuals
2000
-200
0 50 100 150
Fitted values
The rvpplot command is another convenience command which produces a plot of the residual
versus a specified predictor and it is also used after regress. In this example, it produces the same
graph as above.
Hypothesis tests
The test command performs Wald tests for simple and composite linear hypotheses about the
parameters of estimation.
. gen regco1=0
. replace regco1=1 if regco==1
(1252 real changes made)
. gen regco2=0
. replace regco2=1 if regco==2
(792 real changes made)
. gen regco3=0
. replace regco3=1 if regco==3
(3340 real changes made)
. gen regco4=0
. replace regco4=1 if regco==4
(3728 real changes made)
51
. regress tot_exp hhz_usu regco1 regco2 regco3 regco4
-----------------------------------------------------------------------------
tot_exp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------------+----------------------------------------------------------------
hhz_usu | 2.586416 .0556412 46.48 0.000 2.477353 2.695478
regco1 | -1.656508 .5287445 -3.13 0.002 -2.6929 -.6201148
regco2 | -1.59704 .650351 -2.46 0.014 -2.871794 -.3222864
regco3 | -1.721974 .3587041 -4.80 0.000 -2.425071 -1.018878
regco4 | -2.516142 .3437652 -7.32 0.000 -3.189956 -1.842327
_cons | 6.028982 .333079 18.10 0.000 5.376113 6.68185
-----------------------------------------------------------------------------
. test regco1=0
( 1) regco1 = 0
F( 1, 17289) = 9.82
Prob > F = 0.0017
. test regco1=regco2=regco3=regco4
( 1) regco1 - regco2 = 0
( 2) regco1 - regco3 = 0
( 3) regco1 - regco4 = 0
F( 3, 17289) = 1.66
Prob > F = 0.1726
test and predict are commands that can be used in conjunction with all of the above estimation
procedures.
The suest command combines the estimation results from regressions (including parameter
estimates and associated covariance matrices) into a single parameter vector and simultaneous
covariance matrix of the sandwich/robust type.
Typical applications of suest command are tests for intra-model and cross-model hypotheses
using test or testnl command, such as a generalized Hausman specification test, or Chow test for
structural break.
Before we perform any test using suest command, it is important we first keep estimation results
by estimates store command.
52
Simultaneous results for urban, rural
Number of obs = 17295
-----------------------------------------------------------------------------
| Robust
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
------------+----------------------------------------------------------------
urban_mean |
hhz_usu| 3.43175 .1170849 29.31 0.000 3.202268 3.661232
_cons| 5.729652 .4250396 13.48 0.000 4.89659 6.562714
------------+----------------------------------------------------------------
urban_lnvar |
_cons| 6.092408 .1238756 49.18 0.000 5.849617 6.3352
------------+----------------------------------------------------------------
rural_mean |
hhz_usu| 1.932295 .0492704 39.22 0.000 1.835727 2.028863
_cons| 3.576649 .2157359 16.58 0.000 3.153814 3.999484
------------+----------------------------------------------------------------
rural_lnvar |
_cons| 4.748265 .3268713 14.53 0.000 4.107609 5.388921
-----------------------------------------------------------------------------
We would like to test if the hhz_usu coefficients are zeros by using test command.
. test hhz_usu
( 1) [urban_mean]hhz_usu = 0
( 2) [rural_mean]hhz_usu = 0
chi2( 2) = 2397.14
Prob > chi2 = 0.0000
Next we want to see if the same hhz_usu coefficient holds for rural and urban households. We
can type
. test [urban_mean]hhz_usu=[rural_mean]hhz_usu
( 1) [urban_mean]hhz_usu - [rural_mean]hhz_usu = 0
chi2( 1) = 139.33
Prob > chi2 = 0.0000
. test ([urban_mean]hhz_usu=[rural_mean]hhz_usu)
([urban_mean]_cons=[rural_mean]_cons)
( 1) [urban_mean]hhz_usu - [rural_mean]hhz_usu = 0
( 2) [urban_mean]_cons - [rural_mean]_cons = 0
chi2( 2) = 1414.59
Prob > chi2 = 0.0000
This is equivalent to have accumulate options in test command, which tests hypothesis jointly
with previously tested hypotheses
. test ([urban_mean]hhz_usu=[rural_mean]hhz_usu)
. test ([urban_mean]_cons=[rural_mean]_cons) , accumulate
53
( 1) [urban_mean]hhz_usu - [rural_mean]hhz_usu = 0
( 2) [urban_mean]_cons - [rural_mean]_cons = 0
chi2( 2) = 1414.59
Prob > chi2 = 0.0000
Heteroskedasticity
We can always visually check how well the regression surface fits the data y plotting residuals
versus fitted values, like rvfplot or rvpplot commands. In addition, there are a bunch of
statistical tests to test heteroskedasticity in regression errors.
We can use the hettest command to run an auxiliary regression of ln ei2 on the fitted values.
. hettest
chi2(1) = 33821.20
Prob > chi2 = 0.0000
We can also use information matrix test by imtest command, which provides a summary test of
violations of the assumptions on regression errors.
. imtest
---------------------------------------------------
Source | chi2 df p
---------------------+-----------------------------
Heteroskedasticity | 216.69 2 0.0000
Skewness | 49.16 1 0.0000
Kurtosis | 4.31 1 0.0379
---------------------+-----------------------------
Total | 270.17 4 0.0000
---------------------------------------------------
For the next two tests first we need to download the programs from internet at
http://econpapers.repec.org/software/bocbocode/s390601.htm
and
http://econpapers.repec.org/software/bocbocode/s390602.htm
The bpagan command computes the Breusch-Pagan Lagrange multiplier test for
heteroskedasticity in the error distribution, conditional on a set of variables which are presumed
to influence the error variance. The test statistic, a Lagrange multiplier measure, is distributed
Chi-squared(p) under the null hypothesis of homoskedasticity.
. gen food2=exp_food^2
. gen lgfood=log(exp_food)
. regress tot_exp exp_food
. bpagan exp_food food2 lgfood
54
Breusch-Pagan LM statistic: 47406.9 Chi-sq( 3) P-value = 0
The whitetst command computes White's test for heteroskedasticity following regression. This
test is a special case of the Breusch-Pagan test (bpagan). The White test does not require
specification of a list of variables, as that list is constructed from the explanatary list.
Alternatively, whitetst can perform a specialized form of the test which economizes on degrees
of freedom.
. whitetst
In out example, the explanatory variable regco has 11 levels and requires 10 dummy variables.
The test command is used to test the collective effect of the 10 dummy-coded variables. In other
words, it tests the main effect of variable regco. Note that the dummy-coded variables name is
written in exactly the same one as it appears in the regression results, including the uppercase I.
( 1) _Iregco_2 = 0
( 2) _Iregco_3 = 0
55
( 3) _Iregco_4 = 0
( 4) _Iregco_5 = 0
( 5) _Iregco_6 = 0
( 6) _Iregco_7 = 0
( 7) _Iregco_12 = 0
( 8) _Iregco_13 = 0
( 9) _Iregco_14 = 0
(10) _Iregco_15 = 0
The xi prefix can also be used to create dummy variables for regco and for the interaction term
of regco and hhz_usu. The first test command tests the overall interaction and the second test
command test the main effect of urban.
By default, Stata selects the first category in the categorical variable as the reference category. If
we would like to declare a certain category as reference category, the char command is needed.
In the model above, we would like to use region 5 as reference region, and the commands are
. char regco[omit] 5
56
Some estimation procedures in Stata are included here:
57
Chapter 13. Logistic Regression
Logistic regression
We are not going to talk the theory behind logistic regression, per se, but focus on how to
perform logistic regression analyses and interpret the results using Stata. It is assumed that users
are familiar with logistic regression.
We will use the ethiopia1.dta dataset. It added one binary response variable called poverty. The
logistic command by default produces the output in odds ratios but can display the coefficients if
the coef options is used.
The exact same results can be obtained by using the logit command.
. logit poverty hhz_usu agehhh
The xi prefix can also be used in logistic model to include categorical variables.
Extract results
We can use ereturn or estat command to retrieve results from estimation, same as with other
regression commands.
scalars:
e(N) = 17332
e(ll_0) = -10903.56206621753
e(ll) = -9782.301403136013
e(df_m) = 3
e(chi2) = 2242.521326163038
e(r2_p) = .102834344984885
macros:
e(title) : "Logistic regression"
e(depvar) : "poverty"
e(cmd) : "logit"
e(crittype) : "log likelihood"
e(predict) : "logit_p"
e(properties) : "b V"
e(estat_cmd) : "logit_estat"
e(chi2type) : "LR"
matrices:
e(b) : 1 x 4
e(V) : 4 x 4
functions:
e(sample)
. estat summarize
-------------------------------------------------------------
Variable | Mean Std. Dev. Min Max
-------------+-----------------------------------------------
poverty | .3229864 .467631 0 1
hhz_usu | 4.746596 2.390587 1 18
agehhh | 43.60322 14.90114 13 99
_Iurban_1 | .5003462 .5000143 0 1
-------------------------------------------------------------
59
. estat ic
-----------------------------------------------------------------------------
Model | Obs ll(null) ll(model) df AIC BIC
------------+----------------------------------------------------------------
. | 17332 -10903.56 -9782.301 4 19572.6 19603.64
-----------------------------------------------------------------------------
Marginal effects
We use mfx command to numerically calculates the marginal effects or the elasticities and their
standard errors after estimation. Several options are available for the calculation of marginal
effects.
dydx is the default.
eyex specifies that elasticities be calculated in the form of d(lny)/d(lnx)
dyex specifies that elasticities be calculated in the form of d(y)/d(lnx)
eydx specifies that elasticities be calculated in the form of d(lny)/d(x)
Hypothesis Tests
Likelihood-ratio test
The lrtest command performs a likelihood-ratio test for the null hypothesis that the parameter
vector of a statistical model satisfies some smooth constraint. To conduct the test, both the
unrestricted and the restricted models must be fitted using the maximum likelihood method (or
some equivalent method), and the results of at least one must be stored using estimates store.
The lrtest command provides an important alternative to Wald testing for models fitted by
maximum likelihood. Wald testing requires fitting only one model (the unrestricted model).
Hence, it is computationally more attractive than likelihood-ratio testing. Most statisticians,
however, favor using likelihood-ratio testing whenever feasible since the null-distribution of the
LR test statistic is often "more closely" chi-square distributed than the Wald test statistic.
We would like to see if the introduction of regional dummy will help our estimation. We perform
a likelihood ratio test using lrtest command.
60
. xi: logit poverty hhz_usu agehhh i.urban
. estimates store m1
. logit poverty hhz_usu agehhh
. lrtest m1
Likelihood-ratio test LR chi2(1) = 1256.08
(Assumption: . nested in m1) Prob > chi2 = 0.0000
. findit listcoef
These add-on programs ease the running and interpretation or ordinal logistic models. Or, you
can install the complete spostado package by clicking on one of the links under web resources.
One useful command is prchange, which computes discrete and marginal change for regression
models for categorical and count variables. Marginal change is the partial derivative of the
predicted probability or predicted rate with respect to the independent variables. Discrete change
is the difference in the predicted value as one independent variable changes values while all
others are held constant at specified values.
The discrete change is computed when a variable changes from its minimum to its maximum
(Min->Max), from 0 to 1 (0->1), from its specified value minus .5 units to its specified value
plus .5 (-+1/2), and from its specified value minus .5 standard deviations to its value plus .5
standard deviations (-+sd/2).
61
. xi: logit poverty hhz_usu agehhh i.urban
. prchange
0 1
Pr(y|x) 0.7007 0.2993
After estimating a regression model, the prtab command presents a one- to four-way table
of the predicted values (probabilities, rate) for different combinations of values of independent
variable.
. prtab hhz_usu
. prtab hhz_usu, x(agehhh=50 _Iurban_1==0)
----------------------
Number of |
usual |
household |
members | Prediction
----------+-----------
1 | 0.2634
2 | 0.3070
3 | 0.3543
4 | 0.4047
5 | 0.4571
6 | 0.5106
7 | 0.5637
8 | 0.6155
9 | 0.6647
10 | 0.7107
11 | 0.7526
12 | 0.7903
13 | 0.8236
14 | 0.8526
15 | 0.8775
16 | 0.8987
17 | 0.9166
18 | 0.9316
----------------------
We interpret the results from prtab command in this way, given a rural household with
household head ages 50, the probability of the household stays in poverty increases from 0.2634
to 0.9316 as the number of household member goes up.
The prgen command computes predicted values and confidence intervals for regression with
continuous, categorical, and count outcomes in a way that is useful for making plots. Predicted
values are computed for the case in which one independent variable varies over a specified range
while the others are held constant. You can request variables containing upper and lower bounds
for these variables. You can also create a variable containing the marginal change in the outcome
with respect to the specified variable, holding other variables constant. New variables are added
to the existing dataset that contain these predicted values that can be plotted.
62
In first prgen command, we generate variable urbanpovertyp1 for predicted probability of being
poor and urbanpovertyp1 for being not poor. Since we know from prtab command that hhz_usu
takes value from 1 to 18, we use ncases() option to define the number of predicted values as
urbanpovertyx varies from the start value 1 to the end value 18.
0 5 10 15 20
Number of usual household members
urban rural
Stata has a variety of commands for performing estimation when the dependent variable is
dichotomous or polychotomous. Here is a list of some estimation commands for discrete
dependent variable. See estimation commands for a complete list of all of Stata's estimation
commands.
63
logistic logistic regression
logit maximum-likelihood logit regression
mlogit maximum-likelihood multinomial logit models
mprobit multinomial probit regression
nbreg maximum-likelihood negative binomial regression
nlogit nested logit regression
ologit maximum-likelihood ordered logit
oprobit maximum-likelihood ordered probit
probit maximum-likelihood probit estimation
rologit rank-ordered logistic regression
scobit skewed logistic regression
slogit stereotype logistic regression
xtcloglog random-effects and population-averaged cloglog models
xtgee GEE population-averaged generalized linear models
xtlogit fixed-effects, random-effects, and population-averaged logit models
xtprobit random-effects and population-averaged probit models
64
Chapter 14. Simulation
Class
Before we get started, let’s first classify Stata commands into three categories:
r-class: general commands such as summarize. Results are returned in r() and generally must
be used before executing more commands.
. summarize tot_exp
. return list
scalars:
r(N) = 17295
r(sum_w) = 17295
r(mean) = 17.25426066783716
r(Var) = 342.3914311936321
r(sd) = 18.50382206987606
r(min) = .6677808165550232
r(max) = 602.4645385742188
r(sum) = 298412.4382502437
We can save the results from summarize command into our dataset.
. gen mean=r(mean)
e-class: estimation commands such as regress, logistic, that fit statistical models. Such
estimation results stay around until the next model is fitted. Results are returned in e().
The ereturn list command lists results stored in e(). We’ve seen examples of ereturn in
regression and logistic models before.
There are also s-class, n-class, and c-class commands, but we will skip them for now.
Program command
The program command defines and manipulates programs, and return results in r() if using the
return command. Or the program can be defined as e-class and returns results in e() by using the
ereturn command. In the program, return commands are used to return saved results defined as
e- or r-class. We tell Stata that the program is finished by using the end command.
Let’s write a very simple program called t, to display “hello” on the screen.
65
program t, rclass
display "hello"
end
When we would like to use program t to display a “hello” on the screen, we simply type
. t
hello
The program command allows users to write their own programs and use them later, which
provides great flexibilities and efficiency. Let’s look at another example of program to calculate
per capita expenditure and return the results as a scalar. We name the program myprog.
This program myprog executes like this: first our program calls summarize and stores the mean
of the variable tot_exp in a local macro exp. The program then repeats this procedure for the
second variable hhz_usu and stores the mean in another local macro people. Finally, the ratio of
the two means is computed and returned as a scalar by our program in the saved result we call
r(pcexp).
return list
display "Average per capita expenditure is " r(pcexp)
Average per capita expenditure is 3.635081
Simulation
Bootstrap command
The bootstrap command handles repeatedly drawing a sample with replacement, running the
user-written program, collecting the results into a new dataset, and presenting the bootstrap
results. It allows user to supply an expression that is a function of the saved results of existing
commands, or they can write a program to calculate the statistics of interest. The user-written
calculation program is easy to write because every Stata command saves the statistics it
calculates.
For instance, assume that we wish to obtain the bootstrap estimate of the standard error of the
mean of variable tot_exp. Stata has a built-in command, summarize, that calculates and displays
summary statistics including means. In addition to displaying the calculated results, the
summarize command saves them in the form of r().
66
. sum tot_exp
. return list
scalars:
r(N) = 17295
r(sum_w) = 17295
r(mean) = 17.25426066783716
r(Var) = 342.3914311936321
r(sd) = 18.50382206987606
r(min) = .6677808165550232
r(max) = 602.4645385742188
r(sum) = 298412.4382502437
In order to get a bootstrap estimate of standard error of mean, all we need to do is to type the
bootstrap command and Stata will do the work. The reps() option is required since it specifies
the number of bootstrap replications to be performed. The default number is 50. It is
recommended to choose a large but tolerable number of replications to obtain the bootstrap
estimates.
If the assumption is not true, press Break, save the data, and drop the
observations that are to be excluded. Be sure that the dataset in memory
contains only the relevant data.
-----------------------------------------------------------------------------
| Observed Bootstrap Normal-based
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
------------+----------------------------------------------------------------
_bs_1 | 17.25426 .124953 138.09 0.000 17.00936 17.49916
-----------------------------------------------------------------------------
The output tells us that if we would like to specify the size of the sample to be drawn, instead of
all observations by default, we can set sample size by size() option. Note that sample size should
always be less than or equal to the number of observations.
-----------------------------------------------------------------------------
| Observed Bootstrap Normal-based
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
------------+----------------------------------------------------------------
_bs_1 | 17.25426 .1963464 87.88 0.000 16.86943 17.63909
-----------------------------------------------------------------------------
Still we would like to have a look at the resulting mean for each bootstrap replicates, which is
saved as a Stata data file. If we add saving() option, the results will be saved in specified file that
we can open and look into it.
Let’s compare the bootstrap estimates to OLS, and see the difference.
. reg tot_exp hhz_usu
-----------------------------------------------------------------------------
tot_exp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------------+----------------------------------------------------------------
hhz_usu | 2.601727 .0555198 46.86 0.000 2.492902 2.710551
_cons | 4.890831 .2952526 16.56 0.000 4.312106 5.469556
-----------------------------------------------------------------------------
Another option seed() sets the random-number seed for bootstrapping. We can change the
random number seed and obtain the bootstrap estimates again, using the same number of
replications. If the results change dramatically, the number of replications we choose is too small
and let’s pick a larger number. If results are similar enough, we probably have a large enough
number.
68
. bootstrap r(mean), rep(50) seed (123456): sum tot_exp
-----------------------------------------------------------------------------
| Observed Bootstrap Normal-based
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
------------+----------------------------------------------------------------
_bs_1 | 17.25426 .1315345 131.18 0.000 16.99646 17.51206
-----------------------------------------------------------------------------
We might agree that 50 replicates are good enough to serve our purpose.
For an example of a situation where we need to write a program, consider the case of
bootstrapping the per capita expenditure. We first define the calculation routine using program
command, which we name myprog, defined at the beginning of this chapter.
With our program written, we can now obtain the bootstrap estimate by simply typing
use "u:\notes\ethiopia.dta", clear
bootstrap pcexpd=r(pcexp), rep(100) seed(123456) size(1000): myprog tot_exp
hhz_usu
Jackknife command
The jackknife command calculates estimates by leaving out one observation from the sample at
one time, and no sampling method is needed for jackknife estimation. The jackknife is a reliable
method for estimating standard error nonparametrically. The method is easy to use, but it could
be extremely computationally intensive.
The eclass, rclass, and n() options specify where the number of observations on which it based
the calculated results.
69
We can estimate the standard deviation of the standard deviation of tot_exp, so we type
It takes at least 15 minutes for my computer to execute this command, and the output is
-----------------------------------------------------------------------------
| Jackknife
| Coef. Std. Err. t P>|t| [95% Conf. Interval]
------------+----------------------------------------------------------------
sd | 18.50382 .9552704 19.37 0.000 16.6314 20.37625
-----------------------------------------------------------------------------
Simulate command
The syntax of simulate command is
The simulate command performs Monte Carlo type simulations by running specified commands
for # replications and save the results in exp_list. Most Stata commands and user-defined
programs can be used with simulate. The reps() option is required to specify the number of
replications to be performed.
Let’s make a dataset containing means and variances of 100-observation samples from a
standard normal distribution by performing the experiment 10,000 times:
capture program drop normalsim
program define normalsim, rclass
version 9
syntax [, obs(integer 1) mu(real 0) sigma(real 1) ]
drop _all
set obs `obs'
tempvar z
gen `z' = `mu' + `sigma'*invnorm(uniform())
summarize `z'
return scalar mean = r(mean)
return scalar Var = r(Var)
end
The resulting dataset contains means and variances of 100 observations from 10,000 samples.
. sum
70
To make a dataset containing means and variances of 50-observation samples from a normal
distribution with a normal mean of and standard deviation of 7. Perform the experiment 10,000
times:
. simulate mean=r(mean) var=r(Var), reps(10000): normalsim, obs(50) mu(-3)
sigma(7)
71
Chapter 15. System Equations
The most common problem in estimating system equations are seemingly unrelated regressions
(SURE). The sureg command fits SURE models.
Let’s borrow an example from internet on student scores (details can be find at
http://www.ats.ucla.edu/stat/stata/notes/hsb2.dta) and we are interested in estimating a SURE
model with two equations:
. describe
72
. sureg (read write math science) (socst write math science)
Now, say that we would like to constrain the write coefficient to be the same for the read and
socst dependent variable. The constraint command is used to define a constraint named 1.
. constraint define 1 [read]write = [socst]write
. sureg (read write math science) (socst write math science), constraint(1)
Extract results
We can first use estat command to check covariance matrix.
. estat vce
| read | socst
e(V) | write math science _cons | write
math science _cons
-------------+------------------------------------------------+--------------
----------------------------------
read | |
write | .00476021 |
math | -.00206069 .00545882 |
science | -.00136975 -.00213253 .00448292 |
_cons | -.0717133 -.06805451 -.0478838 10.090327 |
-------------+------------------------------------------------+--------------
----------------------------------
socst | |
write | .00181246 -.00078461 -.00052153 -.02730498 | .00663256
math | -.00078461 .00207846 -.00081197 -.02591189 | -.00287123
.00760595
science | -.00052153 -.00081197 .00170688 -.01823185 | -.00190852 -
.00297133 .0062462
_cons | -.02730498 -.02591189 -.01823185 3.8419121 | -.09992051
-.0948226 -.06671808 14.059187
73
Chapter 16. Simultaneous Equations
Sometimes we need to consider the issue of simultaneous equations, where the model equations
are jointly determined. Variables that depend on the model are endogenous variables and
variables determined outside of the model are exogenous.
Here we will estimate a simultaneous equation model from Greene’s “Econometric Analysis”
(2000, pp 655). This is Klein’s small, dynamic model of consumption, investment, private
wages, equilibrium demand, private profits, and capital stock. There is more than one way to
estimate this model.
In this model, c,I, w are the endogenous variables, while g, t, wg, yr are exogenous variables.
We usually use instrumental variables to estimate simultaneous equations, to obtain whether a set
of consistent estimates by least squares.
. ivreg c p1 (p w = t wg g yr p1 x1 k1)
-----------------------------------------------------------------------------
c | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------------+----------------------------------------------------------------
p | .0173022 .1312046 0.13 0.897 -.2595153 .2941197
w | .8101827 .0447351 18.11 0.000 .7158 .9045654
p1 | .2162338 .1192217 1.81 0.087 -.0353019 .4677696
_cons | 16.55476 1.467979 11.28 0.000 13.45759 19.65192
-----------------------------------------------------------------------------
Instrumented: p w
Instruments: p1 t wg g yr x1 k1
-----------------------------------------------------------------------------
74
/* additional code to get correct standard errors, thanks to Kit Baum */
. mat vpr=e(V)*e(df_r)/e(N)
. mat se=e(b)
. local nc=colsof(se)
. forv i=1/`nc' { mat se[1,`i']=sqrt(vpr[`i',`i']) }
. mat list se
se[1,4]
p w p1 _cons
y1 .11804942 .04024972 .10726797 1.3207925
-----------------------------------------------------------------------------
| Coef. Std. Err. t P>|t| [95% Conf. Interval]
------------+----------------------------------------------------------------
c |
p | .0173022 .1180494 0.15 0.885 -.2317603 .2663647
p1 | .2162338 .107268 2.02 0.060 -.0100818 .4425495
w | .8101827 .0402497 20.13 0.000 .7252632 .8951022
_cons | 16.55476 1.320793 12.53 0.000 13.76813 19.34139
-----------------------------------------------------------------------------
Endogenous variables: c p w
Exogenous variables: t wg g yr p1 x1 k1
-----------------------------------------------------------------------------
75
-----------------------------------------------------------------------------
| Coef. Std. Err. t P>|t| [95% Conf. Interval]
------------+----------------------------------------------------------------
c |
p | .0173022 .1180494 0.15 0.884 -.2196919 .2542963
p1 | .2162338 .107268 2.02 0.049 .0008844 .4315833
w | .8101827 .0402497 20.13 0.000 .729378 .8909874
_cons | 16.55476 1.320793 12.53 0.000 13.90316 19.20636
------------+----------------------------------------------------------------
i |
p | .1502219 .1732292 0.87 0.390 -.1975503 .4979941
p1 | .6159434 .1627853 3.78 0.000 .2891382 .9427486
k1 | -.1577876 .0361262 -4.37 0.000 -.2303141 -.0852612
_cons | 20.27821 7.542704 2.69 0.010 5.135599 35.42082
------------+----------------------------------------------------------------
wp |
x | .4388591 .0356319 12.32 0.000 .3673251 .5103931
x1 | .1466739 .0388361 3.78 0.000 .0687071 .2246406
yr | .1303956 .029141 4.47 0.000 .0718927 .1888985
_cons | 1.500296 1.147779 1.31 0.197 -.8039674 3.804559
-----------------------------------------------------------------------------
Endogenous variables: c p w i wp x
Exogenous variables: t wg g yr p1 x1 k1
-----------------------------------------------------------------------------
Method 4. use reg3 for 3SLS
We can run a 3SLS use reg3 command to estimate full--information estimates for system of
equations of the consumption, investment, and wage functions.
. reg3 (c p p1 w) (i p p1 k1) (wp x x1 yr), 3sls inst(t wg g yr p1 x1 k1)
Three-stage least squares regression
----------------------------------------------------------------------
Equation Obs Parms RMSE "R-sq" chi2 P
----------------------------------------------------------------------
c 21 3 .9443305 0.9801 864.5909 0.0000
i 21 3 1.446736 0.8258 162.9808 0.0000
wp 21 3 .7211282 0.9863 1594.751 0.0000
----------------------------------------------------------------------
-----------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
------------+----------------------------------------------------------------
c |
p | .1248904 .1081291 1.16 0.248 -.0870387 .3368194
p1 | .1631439 .1004382 1.62 0.104 -.0337113 .3599992
w | .790081 .0379379 20.83 0.000 .715724 .8644379
_cons | 16.44079 1.304549 12.60 0.000 13.88392 18.99766
------------+----------------------------------------------------------------
i |
p | -.0130791 .1618962 -0.08 0.936 -.3303898 .3042316
p1 | .7557238 .1529331 4.94 0.000 .4559805 1.055467
k1 | -.1948482 .0325307 -5.99 0.000 -.2586072 -.1310893
_cons | 28.17785 6.793768 4.15 0.000 14.86231 41.49339
------------+----------------------------------------------------------------
wp |
x | .4004919 .0318134 12.59 0.000 .3381388 .462845
x1 | .181291 .0341588 5.31 0.000 .1143411 .2482409
yr | .149674 .0279352 5.36 0.000 .094922 .2044261
_cons | 1.797216 1.115854 1.61 0.107 -.3898181 3.984251
-----------------------------------------------------------------------------
Endogenous variables: c p w i wp x
Exogenous variables: t wg g yr p1 x1 k1
-----------------------------------------------------------------------------
76
Chapter 17. Troubleshooting and Update
The help command followed by a Stata command brings up the on-line help system for that
command. It can be used from the command line or from the help window. With help you must
spell the full name of the command completely and correctly.
. help regress
The help contents will list all commands that can be accessed using help command.
. help contents
The search command looks for the term in help files, Stata Technical Bulletins and Stata FAQs.
It can be used from the command line or from the help window.
. search logit
The findit command can be used to search the Stata site and other sites for Stata related
information, including ado files. Say that we are interested in panel data, so we search for this
program from within Stata by typing
. findit panel data
The Stata viewer window appears and we are shown a number of resources related to this key
word.
Stata is composed of an executable file and official ado files. Ado stands for automatically
loaded do file. An ado file is a Stata command that created by users like you. Once installed in
your computer, they work pretty much the same way so Stata commands. Stata files are regularly
updated. It is important to make sure that you are always running the most up to date Stata, and
please do so regularly.
The update command reports on the current update level and installs official updates to Stata. It
helps users to be up to date with the latest Stata ado and executable file, and copy and installs the
ado files into the directory specified.
. update
. update ado, into(d:\ado)
You can keep track of all the users ado files that you have added to your package over time by
ado command, which will list all of them, with information on where you got it from and what it
does.
. ado
[1] package spost9_ado from http://www.indiana.edu/~jslsoc/stata
spost9_ado Stata 9 commands for the post-estimation interpretation of
(package uninstalled)
77
Chapter 18. Advanced Programming
Besides simple one-line commands, we can always get more from Stata by more sophisticated
programming.
Looping
Consider the sample program below, which reads in income data for twelve months.
Say that we wanted to compute the amount of tax (10%) paid for each months, which means to
compute 12 variables by multiplying each of the inc* variable by 0.10.
There is more than one way to execute part of your do file more than once.
In the example below, we use the foreach command to cycle through the variables inc1 to inc12
and compute the taxable income as taxinc1-taxinc12.
foreach var of varlist inc1-inc12 {
generate tax`var' = `var' * .10
}
The initial foreach statement tells Stata that we want to cycle through the variables inc1 to inc12
using the statements that are surrounded by the curly braces. Note the curly braces must be open
at the end of foreach command line. The first time we cycle through the statements, the value of
var will be inc1 and the second time the value of var will be inc2 and so on until the final
iteration where the value of var will be inc12. Each statement within the loop (in this case, just
the one generate statement) is evaluated and executed. When we are inside the foreach loop, we
can access the value of var by surrounding it with the funny quotation marks like this `var' . The
` is the quote right below the ~ on your keyboard and the ' is the quote below the " on your
keyboard. The first time through the loop, `var' is replaced with inc1, so the statement
78
generate tax`var' = `var' * .10
becomes
generate taxinc1 = inc1 * .10
This is repeated for inc2 and then inc3 and so on until inc12. So, this foreach loop is the
equivalent of executing the 12 generate commands manually, but much easier and less error
prone.
3. The third way is to use while loop.
First we define a Stata local variable that is going to be the loop increment. Similar to the
foreach command, codes are in terms of local variable `var'.
local i=1
while `i'<=12 {
generate taxinc`i'=inc`i'*0.10
local i=`i'+1
}
Local variable i can be seen as a counter, and the while command states how many times the
commands within the while loop are going to be replicated. This statement basically says do
until counter value reaches the limit 12. Note the curly braces must be open at the end of while
command line. All commands between curly braces will be executed each time the system go
through the while loop. So first the statement
generate taxinc`i'=inc`i'*0.10
becomes
generate taxinc1=inc1*0.10
The counter value is increased by 1 unit afterwards. Note that the fourth line means the value of
local variable i will be increased by 1 from its current value stored in `i'.
1. use gen and egen command to create a running counter, then using generate and replace
command to create percentile code.
gen pop=hhsize*weight
gen popsum=sum(pop)
egen totpop=sum(pop)
gen cut=0
replace cut=1 if popsum<(totpop/5)
replace cut=2 if popsum<(2*totpop/5) & popsum>=(totpop/5)
replace cut=3 if popsum<(3*totpop/5) & popsum>=(2*totpop/5)
replace cut=4 if popsum<(4*totpop/5) & popsum>=(3*totpop/5)
replace cut=5 if popsum<=(totpop) & popsum>=(4*totpop/5)
79
2. use pctile and xtile command
The pctile command creates a new variable containing the percentiles of another variable. The
xtile command creates a new variable that categorizes exp by its quantiles. Let’s try those two
commands and check the output. The pctile command produces three cutting points at 25th, 50th
(meidan), and 75th percentile. The xtile command assign a new variable to recode which quartile
the observation belongs to.
percentiles |
of tot_exp | Freq. Percent Cum.
------------+-----------------------------------
7.678307 | 1 33.33 33.33
11.19419 | 1 33.33 66.67
16.24392 | 1 33.33 100.00
------------+-----------------------------------
Total | 3 100.00
4 quantiles |
of tot_exp | Freq. Percent Cum.
------------+-----------------------------------
1 | 3,414 19.74 19.74
2 | 3,575 20.67 40.41
3 | 4,131 23.89 64.30
4 | 6,175 35.70 100.00
------------+-----------------------------------
Total | 17,295 100.00
80
Chapter 19. Helpful Sources
http://www.stata.com/
http://www.stata.com/statalist/
Statalist is hosted at the Harvard School of Public Health, and is an email listserver where Stata
users including experts writing Stata programs to users like us maintain a lively dialogue about
all things statistical and Stata. You
can sign on to statalist so that you can receive as well as post your own questions through email.
http://ideas.repec.org/s/boc/bocode.html
http://www.princeton.edu/~erp/stata/main.html
http://www.cpc.unc.edu/services/computer/presentations/statatutorial/
http://www.ats.ucla.edu/stat/stata/
81