When Does Heckman's Two-Step Procedure For Censored Data Work and When Does It Not?
When Does Heckman's Two-Step Procedure For Censored Data Work and When Does It Not?
Robert Jonsson
Robert Jonsson
Abstract:
Keywords:
1. Introduction
Y , if Y ! a
Y* ® (1)
¯ a, if Y d a
The Tobit model was later generalized by Heckman who introduced a further
latent variable to take account of selection effects (Heckman 1976, 1979).
Consider e.g. the variable Y * = ‘Number of sick-listed days per person’ where
many observations are zeros. To deal with the problem of border observations at
a = 0 one may introduce the latent variable Y = ‘State of health’ which can be
measured in several ways (cf. e.g. Hansson et al, 2004). For those interested in
the actual and private budgetary consequences of sick-listening there is no
reason to include selection effects because the zeros are true zeros. However,
persons with zero sick-listed days may be different from others in several
respects. E.g. in a Swedish study women with extremely low household incomes
returned to work after sick-listening earlier than others and after 90 days nearly
all had returned (Bergendorff et al. 2001, p. 33). For those interested in studying
the potential outcome that would follow if incomes were changed, it seems
natural to take account of the selection effect that derives from household
income. The problem of choosing a proper model for the censoring in the latter
case may be termed the selection-effect problem and is separated from the
border-observation problem mentioned above. A clarifying discussion on the
problem of border observations and selection effects has been given by Dow
and Norton (2003).
Objections may be raised against introducing a latent variable, the meaning of
which may be unclear, such as ‘State of health’ but this gives anyhow a simple
solution of a complicated problem. The introduction of a latent variable in the
selection-effect situation is even more delicate, especially if it is generally stated
that the two latent variables has a bivariate normal distribution (cf. e.g. Flood
and Gråsjö, 2001). In the latter paper simulation studies were performed that
showed that the simple Tobit model can be as good as more sophisticated
selection-effects models, and sometimes even better. In this paper only the
censoring in Eq. (1) is studied.
2
Eq. (1) contains two types of data, counting data and observations on Y. When
Y depends on explanatory variables in a regression relation it is possible to find
the Maximum Likelihood (ML) estimates of the parameters by using both types
of data under suitable assumptions, such as linearity of the regression and
normality (Rosett and Nelson, 1975, Nelson, 1984). The computational
difficulties involved in solving the ML equations led Heckman (1976, 1979) to
propose a simple two-step method (Heckit). Although it was originally designed
for censoring due to selection effects in cross-sectional data, it can be used for
data free from selection effects and for panel data. The Heckit requires in a first
step an estimate of a censoring proportion p from counting data. This in turn
gives estimates of the hazard (h) for approaching a (or inverse Mills ratio). In a
second step the parameters in the linear model are obtained by regressing the
observations on the explanatory variables and on estimates of h.
It is peculiar that the Heckit never seems to have been used by biostatisticians,
although problems with censoring occur frequently in this area. Also pure
statisticians seem to have ignored the procedure. It is typical that in a recent
PhD thesis in statistics including four papers on the subject, the Heckit is not
mentioned (Karlsson, 2005). But, among econometricians the Heckit is still
popular despite of the fact that an extensive amount of Monte Carlo studies casts
doubt on the procedure. (See Puhani, 2000 for an overview). But, from these
studies it is hard to find guide lines which can be used in practice
Heckman’s two-step procedure involves several critical moments. It is the
aim of this paper to clarify the following issues: (i) Which are the properties of
the estimated hazard that is used later in the second step? (ii) Which are the
properties (bias and variance) of the regression estimates obtained with three
different linear models? Furthermore, is it possible to adjust for the bias? In
earlier studies the performance of the Heckit estimators have been compared
with other alternatives such as the Tobit ML estimator and several
semiparametric estimators (Kim and Lai, 2000, Lee, 1996, Newey, 2001 and
Powell, 1994). This paper will focus only on the Heckit. The aim is to find
simple guide lines for when the Heckit works and when it does not.
Let Ytj denote an observation on the latent variable from the j:th subject at time
t, j=1,…,n and t=1,…,T. For cross sectional data the index t is omitted The
observations for each subject are represented by a transposed vector
y 'j Y1 j ...YTj and it is assumed that the latter are independent over the j’s. The
problem considered is to estimate a linear regression function E Ytj x t Px ,
where x t is a vector of p explanatory variables possibly depending on t, when
observations are obtained only in the interval (a, f) and it is known how many
observations that fall below a. The function P x is written D x t' ȕ where ȕ is a
vector of regression coefficients.
3
Consider the following models, where random variables are denoted by capitol
letters, fixed values by small letters and parameters by Greek symbols.
Here the U tj ' s are independent and identically distributed (iid) disturbances with
mean 0 and variance V U2 . A j is a random intercept that is specific for the j:th
subject with mean D and variance V A2 , while b j is a vector of random regression
coefficients specific for the j:th subject with mean ȕ and variance V B2r for the
r:th component. All A j ' s and b j ' s are iid and U tj is independent of A j
and b j . The latter two may be correlated with Cov( A j , Brj ) V ABr . All
random variables are assumed to be normally distributed.
The models in Eq. (2) have been widely used (see e.g. Swamy, 1971 and
Hsiao, 2003) and have been termed (a) Gauss-Markov (GM), (b) Error
Components Regression (ECR) and (c) Random Coefficient Regression (RCR),
just to mention a few names. The GM-model is intended for cross-sectional data
or panel data without within-subject correlations. ECR- and RCR models are
intended for panel data. Tests for uncensored data in order to establish a proper
random structure have been suggested by several authors (see e.g. Honda, 1985,
Lundevaller and Laitila, 2002, Hsiao, 2003), but no such test seems to have been
suggested for censored data.
The Heckit requires that the censored variable is normally distributed. This
can be tested by Pearson’s chi-square statistic or the likelihood-ratio statistic
also called the deviance, provided that data can be sorted by the explanatory
variables. For each combination of the latter, the observed proportion of
censored observations are compared with the estimates of the corresponding
theoretical proportion p x defined by
a Px
px P Ytj d a ) (u x ) , with u x where v x V (Ytj ) (3)
vx
These tests are supplied by several statistical packages such as SAS (SAS
Online Guide, 2006).
Below it is shown that the performance of Heckman’s estimation procedure is
dependent on the magnitude of the standardized variable u x rather than on P x or
x . In order to simplify the simulation studies (Sect. 3) it was therefore decided
to consider just one explanatory variable, that was chosen as t, t=1,…,T, so the
expressions in Eq. (2) simplifies to
hx I (u x ) /(1 p x ) (5)
This is often referred to as the inverse Mills ratio. Since hx is the limit of
G 1 P Ytj (a, a G ) Ytj ! a as G o 0 it can be interpreted as the hazard for
approaching the censoring limit a for a given vector x t . The behaviour of hx as a
function of u x is seen in Figure 1. Notice that hx is roughly linear when u x is
large. From the inequality u x hx u x 1 / u x (Gordon, 1941), it follows that
the asymptotic slope for large u x is 1. In Figure 1 the range of u x is from -2 to 2.
The latter corresponds to a range of the censoring proportion from 2.3 % to 97.7
% and this will cover most situations that occur in practice.
Figure 1. The solid line is the hazard in Eq. (5) (normal observations). The three
dotted lines are the hazards for Laplace distributed observations (cf. Section
2.2.2) with v = 0.5 (upper curve), v =1.0 and v =5.0 (lower curve).
The expectation of the Ytj ' s that are found above a is related to P x in the
following way (Johnson et al, 1994)
All these results are based on the assumption of normality of the censored
variables and the two-step procedure described above would therefore be termed
normal-Heckit. Below (Sect.3) it will be found that, if the normal-Heckit is
applied to data that are not normally distributed, it may collapse.
hx ® > x x@
° V 2 exp(u 2 ) 1 -1 , for u d 0
(8)
1
°̄ V , for u x t 0
This function is shown in Figure 1 for v = V 2 = 0.5, 1.0 and 5.0. When
u x d 0 the hazard is increasing and for some values of v the hazard is rather
close to that of the normal distribution. For u x t 0 the hazard is completely
different and is identical to the hazard of the exponential distribution with a
constant level. It also follows that
Px f
³a yf ( y )dy ³P x yf ( y )dy
E Ytj Ytj ! a P x hx V ( P x a) V 2 for a d P x
1
1 exp(( P x a) / V )
2
f
³ yf ( y )dy
a
E Ytj Ytj ! a a V , for a t P x
1
exp((a P x ) / V )
2
6
The first step in Heckman’s procedure is to estimate the hazard in the definition
(5), and this in turn requires the estimates of p x or u x in Eq. (3). The most
basic way to estimate p x is to count the number of observations that falls below
a for a given x t out of a total of n x . This suggests the estimator
The estimator of the hazard that is based on Eq. (9a) will be termed semi-
parametric. In practise the latter is only feasible when the model has a small
number of explanatory variables, each with a limited state space. Alternatively
one can perform a probit analysis that fits the relation in Eq. (3) to data. In this
way one gets estimators of ( a D ) / v x and ȕ / v x (being of less value when
v x is unknown), but also of p x and u x ,
The theoretical exposition above raises some questions that will be dealt with in
the next section:
(i) Which are the properties of the semi-parametric and the probit-based
estimates of the hazard under normal- and non-normal distributional
assumptions? (ii) For which range of u x -values, or alternatively for which
censoring proportions, are estimates obtained by Heckman’s procedure reliable?
(iii) Under which of the three random structures, GM, ECR and RCR, are
estimates obtained by Heckman’s procedure reliable?
Data were generated according to the three models in (4) with E (Ytj ) Pt
D E t , t = 1,2,3,4 and V Ytj v 2 with v 2 V U2 for GM data and Q 2 V U2
V A2 for ECR data. For RCR data the variance depends on t, V (Ytj ) Q t2
V U2 V A2 2tV AB t 2 V B2 . The censoring limit was a = 0 and the Heckit was
studied within the ranges ut >2,0@ I , ut >1,1@ I 0 and ut >0,2@ I .
For GM and ECR data the parameters were E 10, 30 and v 3E / 2
(=15, 45). For u t I D 4 E (=40, 120) yielding u t (2t 8) / 3 . For
u t I 0 D 5E / 2 (=25, 75) yielding u t (2t 5) / 3 , and for u t I
D E (=10, 30) yielding u t 2(t 1) / 3 . The expected proportion censored
observations was : 0.22 for u t I , 0.50 for u t I 0 and 0.78 for u t I .
For ECR data two sets of variance components were used
(V U2 200, V A2 25) and (V U2 25, V A2 200) giving v 15 , and furthermore
(V U2 1772, V A2 253) and (V U2 253, V A2 1772) giving v 45 . Since v t2
depends on t in the RCR model it is not possible to find parameter values such
that V Ytj is exactly the same as for the GM- and ECR data. The following
parameter choices made the results for the RCR model roughly comparable with
the former models: E 10, V U2 25, V A2 200, V B2 10 . For u t I ,
V AB 18.45 , so vt varied between 14.1 and 15.4 and for u t I , V AB
31.55 with vt varying between 11.5 and 13.1.
8
The bias of the estimated hazard ht was studied at t = 1, 2, 3, 4 when data were
generated by the GM model with normally distributed disturbances. For both
estimators in (9a) and (9b) the bias decreased rapidly with increasing n. For
small n, the bias could be substantial, especially for u t I and t = 4. However,
it was concluded that for practical purpose when estimating ht , the bias could
be ignored when n is 100 or larger. The same conclusions were drawn about the
variances of the ht estimates. Here the probit-based estimator had a slightly
smaller variance and the variance decreased more rapidly than the bias with
increasing n. A similar pattern was obtained for the ECR and RCR models. So,
under normality assumptions the probit-based estimator is at least as good as the
semi-parametric estimator, and for n=100 or larger the influence from bias can
be ignored and the variance remains small.
Now, consider the case when the disturbances are Laplace-distributed. The
absolute relative bias was smallest for v 1 With increasing n the bias persisted
and the variance decreased. The latter was more than five times larger for n =
100 than for n = 400. The results show that both proportional-based and probit-
based estimators of the hazard can be seriously biased if the hazard is far from
that of the normal and this can not be compensated for by increasing n.
In the sequel, when the properties of estimates of E and D are studied under
normality, n is chosen as 100 and 400. From the results above it follows that
possible biases of the estimates can not be caused by poor estimates of the
hazard in the first step in the Heckit, but purely on the fact that P x and hx in Eq.
(6) are both linear which in turn leads to the structure in Eq. (10).
Since the Heckit is so closely tied up with normality it was furthermore
studied whether two commonly used tests of normality for censored data,
Pearson’s chi-square and maximum likelihood-ratio (SAS Online Guide, 2006),
were able to detect deviations from normality. When the observations were
9
Tables 1a and 1b summarize the properties of the E and D estimates when the
Heckit was applied to GM data. Both bias and variance of the estimates
increased as the range of the u t values moved upwards, and decreased with
increasing n. Especially for u t I , bias and variance were considerable, up to
15 times larger than for u t I . As expected, both bias and variance was larger
for E 30 than for E 10 since the former value makes V (Ytj ) larger.
However, it is interesting that the absolute relative bias turned out to be
independent of the magnitude of E for given n and a given range of u t .
Variance of Ê Variance of D̂
E n I I0 I I I0 I
-10 100 19 128 289 19 27 84
400 2.9 34 270 3.7 5.9 88
-30 100 163 2282 3316 166 298 1188
400 26 392 2033 34 65 679
10
Similar results, when the Heckit was applied to ECR data, are seen in Tables 2a
and 2b. Bias and variance were roughly the same as for the GM data. For
u t I and u t I 0 bias and variance of the E -estimator were larger when the
ratio V A2 / V U2 is large. As for the GM model, the absolute relative bias seemed
to be roughly independent of the magnitude of E .
Table 2a. Relative bias (%) with the ECR-model. The first and second figures
represent the cases when V A2 / V U2 is small and large, respectively.
Table 2b. Variances with the ECR-model. The upper and lower figures in the
cells represent the cases when V A2 / V U2 is small and large, respectively.
Variance of Ê Variance of D̂
E n I I0 I I I0 I
-10 100 29 161 298 24 23 135
97 196 291 29 30 411
400 3.6 43 233 4.0 4.7 93
14 89 179 4.1 9.7 250
-30 100 228 1286 3581 194 222 1310
708 1683 2159 235 229 3165
400 34 504 3851 37 45 1446
142 993 2071 39 107 3012
Tables 3a and 3b show the pattern for the RCR data. Compared with the results
in the Tables 1 and 2, bias and variance are smaller.
Bias of Ê Bias of D̂
N I I I I
100 -5 46 4 57
400 -5 30 5 53
11
Variance of Ê Variance of D̂
n I I I I
100 1.7 230 3.6 33
400 0.34 212 0.91 34
From Tables 1-3 it is concluded that the Heckit works quite well for u t I ,
(22 % censored) and is less good when u t I 0 (50 % censored), especially
regarding bias of the E estimator. For u t I (78 % censored), Heckman’s
procedure is very poor but seems to perform slightly better with RCR data.
In Section 2.3 it was noticed that the absolute relative bias when estimating
the ȕ -components can be expressed by T in Eq. (10). Since T in Tables 3-5 is
roughly independent of the magnitude of E and thus also of Q and only
dependent on n and on the censoring proportion p, it is challenging to search for
a relation that describes how T depends on n and p. From the results in Tables 1
and 2 (GM- and ECR data) the following relation was established,
n
T p< (11)
where < = 0.1966 (GM), 0.1791 (ECR with V A2 / V U2 small), 0.1324 (ECR with
V A2 / V U2 large). The constant < was determined by fitting the linearized version
of Eq. (11) to the estimates obtained in Tables 1-2 by ordinary least squares.
The coefficient of determination ( R 2 ) ranged from 99.3 % to 99.8 %. The
relation in Eq. (11) is illustrated in Figures 2a,b. From Figure 2a it is concluded
that when n = 1000 or larger the censoring proportion p has less impact on the
magnitude of T as far as p is below 50 %. E.g. n = 1000 and p = 0.5 gives
T 0.01 . If the censoring proportion is small, say below 20 %, then Figure 2b
tells us that the absolute relative bias can be ignored for sample sizes above 250.
However, for large p and small n the absolute relative bias can be substantial.
12
(a)
(b)
Table 4a Relative bias (%) and variance of estimates obtained by the normal-
Heckit when in fact the data are Laplace distributed with u I .
3.4 Comparison between the efficiency obtained with censored and uncensored
data
When data are censored it is obvious that some information is lost when
estimating the parameters. Although this is inevitable it may be of some interest
to compare the variances in Tables 1-3 with those that are obtained with
uncensored data. Such a comparison may be considered to be of purely
academic interest, but one reason for doing it is to set up a standard that allows
for comparisons between the normal-Heckit and alternative methods. Let the
n
optimal estimator of E with uncensored data be Eˆ OPT ¦ Eˆ j , where Ê j
j 1
T T
wtY / wtt with wtY ¦ (t t )(Ytj Y j ), wtt ¦ (t t ) 2 (cf. Rao, 1965, Ch. IV in
t 1 t 1
Swamy, 1971 and Ch. 3 in Hsiao). Then V Eˆ OPT V U2 / nwtt for the GM and
ECR models, and V Eˆ OPT V B2 V U2 / wtt / n for the RCR model. From this
one obtains the relative efficiency RE 100 V ( Eˆ ) / V ( Eˆ
OPT ) , where
Heck
V ( Eˆ Heck ) is the variance of Ê obtained from the Heckit and is determined from
the simulations. For u t I 0 and u t I the relative efficiency is below 1 % for
all three models. But for u t I , RE is 11.0 % when n=400 and 8.8 % when
n=100 for the RCR-model, compared with RE of 3.4 % (n=400) and 2.4 %
(n=100) for the GM-model. Also from this point of view, Heckman’s procedure
seems to produce the best estimates when it is applied to the RCR-model.
4. Using the Heckit for analysing recurrence of lower back problems among
sick-listed men
4.1 Background
Data from the post follow-up will be used to illustrate some undesirable
consequences of the Heckit. n = 203 men with unspecified lower back diagnoses
who had returned to work within the follow-up period were observed during the
post follow-up. Men with specific back diagnoses (about 10 % of all cases,
Bergendorff et al. 2001, p. 46) were excluded since these had back surgery and
were thereafter free from back problems with the same diagnosis. The
dependent variable of interest is DAYS = ‘Number of sick-listed days during the
post follow-up due to the same diagnosis as in the follow-up’. One important
explanatory variable was EQT = ‘Value on EuroQol Thermometer scale’,
obtained at the end of the 2-year follow-up. The latter is a health-related quality
of life measure obtained from a visual scale on which the respondent is asked to
mark his health from 0 (worst function) to 100 (best function) (Hansson et. al.,
2005). The variable EQT was negatively associated with DAYS. Another
explanatory variable was STATE1Y (= 1 if the person had returned to work
within 1 year during the previous follow-up, and = 0 otherwise). Rather
unexpectedly, there was a significant positive association between not returning
to work within 1 year and DAYS = 0 (p-value= 0.01, Chi-square test). In fact,
89 % (31/35) of those who did not return within 1 year had zero days during the
post follow-up period, while the corresponding figure for those who returned
within 1 year was 68 % (115/168). No further explanatory variables, such as
demographic and socio-economic factors, work environment, co-morbidity and
treatment received, were found to be associated with DAYS.
The major part of the observations are found on the border DAYS = 0, and it
is obvious that the standard conditions for performing a regression analysis,
such as normality or at least symmetrically distributed disturbances, are
violated. Therefore, a latent variable Y is introduced such that
0, if Y d 0
DAYS ®
¯Y , if Y ! 0
and Y is a variable that is related to a person’s state of health. It is assumed that
for the j : th person, Y j D E 1 STATE1Y E 2 EQT U j , j = 1,…,203.
16
Below the data is analyzed by the Heckit and in order to clarify the different
steps they are numbered (i)-(iii).
Here all estimated coefficients are significantly different from zero at the 5 %
level as judged by two sided t-tests.
This paper has studied the performance of Heckman’s two-step approach when
it is used to solve the problem with border-observations without selection effects
and when data are censored from below. From the simulations it was concluded
that the Heckit performed quite well for n larger than 100 and when the
censoring proportion was 0.22, provided that the censored variable was
normally distributed. With increasing censoring proportion the estimates
gradually became more biased and the variance increased. However, it is
possible to compensate for this by increasing the sample size.
By means of Eq. (11) it is possible to estimate T , the absolute relative bias of
the E -estimates, and to adjust for the bias in the way that was done in Section
4.3. Eq. (11) can also be used in the planning of a study. By first taking a pilot
sample one gets a rough estimate of the censoring proportion p. The final proper
sample size n can then be determined from restrictions on T . E.g. if it is
required that T is at most 1 % for the GM model, then n should be at least 62 if
p = 0.05 and at least 1142 if p = 0.50. From considerations of space Eq. (11) had
to be considered for two special cases of the ECR model. This gives some
practical guide lines, but more detailed studies should be performed on the
effect of the variance ratio upon the relation in Eq. (11).
Since the Heckit inevitably gives more or less biased estimates one should
compare the estimated expectation of the observed variable with the observed
data in a final step. A warning practical example was given in Section 4 where
the censoring proportion was 0.72, leading to an estimated absolute relative bias
of the regression estimates of 40 %, and this in turn led to gigantic over-
estimates of the actual costs for sick-listing.
When the censored variable has a distribution that is not normal Heckman’s
two-step procedure may collapse for at least two reasons. One is that estimates
of the hazard (or Mills ratio) used in the first step are biased. A second is that
the regression function of interest and the hazard no longer are added to each
other. From considerations of space the effects of misspecification was only
studied for Laplace distributed disturbances, but such effects should be further
investigated for a variety of distributions.
Acknowledgements
The author would like to thank two anonymous referees for their valuable
comments. The research was supported by the National Social Insurance Board
in Sweden (RFV), Dnr 3124/99 –UFU.
18
References
Dow, W.H. and Norton, E.C. (2003), Choosing Between and Interpreting the
Heckit and Two-Part Models for Corner Solutions, Health Services & Outcomes
Research Methodology 4, 5-18.
Gordon, R.D. (1941), Values of Mills’ ratio of area to boarding ordinate and of
the normal probability integral for large values of the argument, Annals of
Mathematical Statistics 12, 364-366.
Bergendorff, S., Hansson, E., Hansson, T. and Jonsson, R. (2001), Vad kan
förutsäga utfallet av en sjukskrivning? (Predictors of health status and work
resumption) (in Swedish), Rygg och Nacke 8. Stockholm: RFVand Sahlgrenska
Universitetssjukhuset.
Hansson, E, Hansson, T. and Jonsson, R. (2004), Predictors for work ability and
disability in men and women with low-back or neck problems, accepted for
publication in European Spine Journal.
Kim, C.K. and Lai, T.L. (2000), Efficient score estimation and adaptive M-
estimators in censored and truncated regression models, Statistica Sinica 10,
731-749.
Nelson, F.D. (1984), Efficiency of the two-step estimator for models with
endogenous sample selection, Journal of Econometrics 24, 181-196.
Powell, J.L. (1994), Estimation of semiparametric models. In: Engel, R.F. and
McFadden, D.L. (Eds.), Handbook of Econometrics, Vol 4, pp 2444-2521,
North-Holland, Amsterdam.
Puhani, P.A. (2000), The Heckman correction for sample selection and its
critique, Journal of Economic Surveys 14, No 1, 53-68.
Rao, C.R. (1965), The theory of least squares when the parameters are
stochastic and its application to the analysis of growth curves, Biometrica 52,
447-458.
Rosett, R.N. and Nelson, F.D. (1975), Estimation of the two-limit probit
regression model, Econometrica 43, 141-146.