Tobit Analysis - Stata Data Analysis Examples
Tobit Analysis - Stata Data Analysis Examples
ANALYSIS EXAMPLES
Version info: Code for this page was tested in Stata 12.
Please Note: The purpose of this page is to show how to use various data
analysis commands. It does not cover all aspects of the research process
which researchers are expected to do. In particular, it does not cover data
cleaning and checking, verification of assumptions, model diagnostics and
potential follow-up analyses.
Lets look at the data. Note that in this dataset, the lowest value of apt is
352. No students received a score of 200 (i.e. the lowest score possible),
meaning that even though censoring from below was possible, it does not
occur in the dataset.
use https://stats.idre.ucla.edu/stat/stata/dae/tobit,
clear
tabulate prog
type of |
program | Freq. Percent Cum.
------------+-----------------------------------
academic | 45 22.50 22.50
general | 105 52.50 75.00
vocational | 50 25.00 100.00
------------+-----------------------------------
Total | 200 100.00
Looking at the above histogram showing the distribution of apt, we can see
the censoring in the data, that is, there are far more cases with scores of
750 to 800 than one would expect looking at the rest of the distribution.
Below is an alternative histogram that further highlights the excess of cases
where apt=800. In the histogram below, the discrete option produces a
histogram where each unique value of apt has its own bar. The freq option
causes the y-axis to be labeled with the frequency for each value, rather
than the density. Because apt is continuous, most values of apt are unique
in the dataset, although close to the center of the distribution there are a
few values of apt that have two or three cases. The spike on the far right of
the histogram is the bar for cases where apt=800, the height of this bar
relative to all the others clearly shows the excess number of cases with this
value.
Tobit regression
Below we run the tobit model, using read, math, and prog to predict apt.
The ul( ) option in the tobitcommand indicates the value at which the right-
censoring begins (i.e., the upper limit). There is also a ll( )option to indicate
the value of the left-censoring (the lower limit) which was not needed in this
example. The i. before prog indicates that prog is a factor variable (i.e.,
categorical variable), and that it should be included in the model as a series
of dummy variables. Note that this syntax was introduced in Stata 11.
Tobit regression
Number of obs = 200
LR
chi2(4) = 188.97
Prob
> chi2 = 0.0000
Log likelihood = -1041.0629
Pseudo R2 = 0.0832
-------------------------------------------------------
-----------------------
apt | Coef. Std. Err. t P>|t|
[95% Conf. Interval]
-------------+-----------------------------------------
-----------------------
read | 2.697939 .618798 4.36 0.000
1.477582 3.918296
math | 5.914485 .7098063 8.33 0.000
4.514647 7.314323
|
prog |
2 | -12.71476 12.40629 -1.02 0.307
-37.18173 11.7522
3 | -46.1439 13.72401 -3.36 0.001
-73.2096 -19.07821
|
_cons | 209.566 32.77154 6.39 0.000
144.9359 274.1961
-------------+-----------------------------------------
-----------------------
/sigma | 65.67672 3.481272
58.81116 72.54228
-------------------------------------------------------
-----------------------
Obs. summary: 0 left-censored observations
183 uncensored observations
17 right-censored observations
at apt>=800
The final log likelihood (-1041.0629) is shown at the top of the output,
it can be used in comparisons of nested models, but we wont show
an example of that here.
Also at the top of the output we see that all 200 observations in our
data set were used in the analysis (fewer observations would have
been used if any of our variables had missing values).
The likelihood ratio chi-square of 188.97 (df=4) with a p-value of
0.0001 tells us that our model as a whole fits significantly better than
an empty model (i.e., a model with no predictors).
In the table we see the coefficients, their standard errors, the t-
statistic, associated p-values, and the 95% confidence interval of the
coefficients. The coefficients for read and math are statistically
significant, as is the coefficient for prog=3. Tobit regression
coefficients are interpreted in the similiar manner to OLS regression
coefficients; however, the linear effect is on the uncensored latent
variable, not the observed outcome. See McDonald and Moffitt (1980)
for more details.
For a one unit increase in read, there is a 2.7 point increase in
the predicted value of apt.
A one unit increase in math is associated with a 5.91 unit
increase in the predicted value of apt.
The terms for prog have a slightly different interpretation. The
predicted value of apt is 46.14 points lower for students in a
vocational program (prog=3) than for students in an academic
program (prog=1).
The ancillary statistic /sigma is analogous to the square root of the
residual variance in OLS regression. The value of 65.67 can be
compared to the standard deviation of academic aptitude which
was 99.21, a substantial reduction. The output also contains an
estimate of the standard error of /sigma as well as the 95% confidence
interval.
Finally, the output provides a summary of the number of left-censored,
uncensored and right-censored values.
We can test for an overall effect of prog using the test command. Below
we see that the overall effect ofprog is statistically significant.
( 1) [model]2.prog = 0
( 2) [model]3.prog = 0
F( 2, 196) = 5.98
Prob > F = 0.0030
We can also test additional hypotheses about the differences in the
coefficients for different levels of prog.Below we test that the coefficient
for prog=2 is equal to the coefficient for prog=3. In the output below we
see that the coefficient for prog=2 is significantly different than the
coefficient for prog=3.
( 1) [model]2.prog - [model]3.prog = 0
F( 1, 196) = 6.66
Prob > F = 0.0106
We may also wish to see measures of how well our model fits. This can be
particularly useful when comparing competing models. One method of
doing this is to compare the predicted values based on the tobit model to
the observed values in the dataset. Below we use predict to generate
predicted values of apt based on the model. Next we correlate the
observed values of apt with the predicted values (yhat).
predict yhat
(option xb assumed; fitted values)
| apt yhat
-------------+------------------
apt | 1.0000
yhat | 0.7825 1.0000
The correlation between the predicted and observed values of apt is
0.7825. If we square this value, we get the multiple squared correlation, this
indicates predicted values share about 61% (0.7825^2 = 0.6123) of their
variance with apt. Additionally, we can use the user-written
command fitstat to produce a variety of fit statistics. You can find more
information on fitstat by typing search fitstat (see How can I use the
search command to search for programs and get additional help? for more
information about using search).
fitstat
See Also
Stata Online Manual
tobit
Related Stata Commands
cnreg censored normal regression, in which the censoring
values may change from observation to observation.
intreg interval regression, in which observations may be point
data, interval data, left-censored data or right-censored data.
References
Long, J. S. (1997). Regression Models for Categorical and Limited
Dependent Variables. Thousand Oaks, CA: Sage Publications.
McDonald, J. F. and Moffitt, R. A. 1980. The Uses of Tobit
Analysis. The Review of Economics and Statistics Vol 62(2): 318-321.