0% found this document useful (0 votes)
22 views

When Does Heckman's Two-Step Procedure For Censored Data Work and When Does It Not?

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

When Does Heckman's Two-Step Procedure For Censored Data Work and When Does It Not?

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Research Report

Statistical Research Unit


Department of Economics
University of Gothenburg
Sweden

When does Heckman’s two-step procedure


for censored data work and when does it
not?

Robert Jonsson

Research Report 2008:2


ISSN 0349-8034

Mailing address: Fax Phone Home Page:


Statistical Research Nat: 031-786 12 74 Nat: 031-786 00 00 http://www.statistics.gu.se/
Unit
P.O. Box 640 Int: +46 31 786 12 74 Int: +46 31 786 00 00
SE 405 30 Göteborg
Sweden
When does Heckman’s two-step procedure for censored data work and
when does it not?

Robert Jonsson

Department of Economics, University of Gothenburg, Box 640, 405 30


Göteborg, Sweden

Abstract:

Heckman’s two-step procedure (Heckit) for estimating the parameters in linear


models from censored data is frequently used by econometricians, despite of the
fact that earlier studies cast doubt on the procedure. In this paper it is shown that
estimates of the hazard h for approaching the censoring limit, the latter being
used as an explanatory variable in the second step of the Heckit, can induce
multicollinearity. The influence of the censoring proportion and sample size
upon bias and variance in three types of random linear models are studied by
simulations. From these results a simple relation is established that describes
how absolute bias depends on the censoring proportion and the sample size. It is
also shown that the Heckit may work with non-normal (Laplace) distributions,
but it collapses if h deviates too much from that of the normal distribution. Data
from a study of work resumption after sick-listing are used to demonstrate that
the Heckit can be very risky.

Keywords:

Censoring, Cross-sectional and panel data, Hazard, Multicollinearity


1

1. Introduction

When studying the relation between a dependent variable Y * and a set of


explanatory variables it sometimes occurs that a large proportion of the
observations falls on Y * a , and no observations are found below the known
constant a. The consequences of this are that standard conditions for efficient
estimation of the parameters are violated. This may be termed the problem of
border-observations. One way to deal with the latter is to use the fact, or just
make the assumption, that it has originated from censoring of some latent
variables. (According to Kruskal and Tanur (1978) data are censored if
observations are measured only in some interval, while observations outside the
interval are counted but not measured). The relation between Y * and the latent
variables can be expressed in several ways, the simplest being the Tobit model
(Tobin, 1958)

­Y , if Y ! a
Y* ® (1)
¯ a, if Y d a

The Tobit model was later generalized by Heckman who introduced a further
latent variable to take account of selection effects (Heckman 1976, 1979).
Consider e.g. the variable Y * = ‘Number of sick-listed days per person’ where
many observations are zeros. To deal with the problem of border observations at
a = 0 one may introduce the latent variable Y = ‘State of health’ which can be
measured in several ways (cf. e.g. Hansson et al, 2004). For those interested in
the actual and private budgetary consequences of sick-listening there is no
reason to include selection effects because the zeros are true zeros. However,
persons with zero sick-listed days may be different from others in several
respects. E.g. in a Swedish study women with extremely low household incomes
returned to work after sick-listening earlier than others and after 90 days nearly
all had returned (Bergendorff et al. 2001, p. 33). For those interested in studying
the potential outcome that would follow if incomes were changed, it seems
natural to take account of the selection effect that derives from household
income. The problem of choosing a proper model for the censoring in the latter
case may be termed the selection-effect problem and is separated from the
border-observation problem mentioned above. A clarifying discussion on the
problem of border observations and selection effects has been given by Dow
and Norton (2003).
Objections may be raised against introducing a latent variable, the meaning of
which may be unclear, such as ‘State of health’ but this gives anyhow a simple
solution of a complicated problem. The introduction of a latent variable in the
selection-effect situation is even more delicate, especially if it is generally stated
that the two latent variables has a bivariate normal distribution (cf. e.g. Flood
and Gråsjö, 2001). In the latter paper simulation studies were performed that
showed that the simple Tobit model can be as good as more sophisticated
selection-effects models, and sometimes even better. In this paper only the
censoring in Eq. (1) is studied.
2

Eq. (1) contains two types of data, counting data and observations on Y. When
Y depends on explanatory variables in a regression relation it is possible to find
the Maximum Likelihood (ML) estimates of the parameters by using both types
of data under suitable assumptions, such as linearity of the regression and
normality (Rosett and Nelson, 1975, Nelson, 1984). The computational
difficulties involved in solving the ML equations led Heckman (1976, 1979) to
propose a simple two-step method (Heckit). Although it was originally designed
for censoring due to selection effects in cross-sectional data, it can be used for
data free from selection effects and for panel data. The Heckit requires in a first
step an estimate of a censoring proportion p from counting data. This in turn
gives estimates of the hazard (h) for approaching a (or inverse Mills ratio). In a
second step the parameters in the linear model are obtained by regressing the
observations on the explanatory variables and on estimates of h.
It is peculiar that the Heckit never seems to have been used by biostatisticians,
although problems with censoring occur frequently in this area. Also pure
statisticians seem to have ignored the procedure. It is typical that in a recent
PhD thesis in statistics including four papers on the subject, the Heckit is not
mentioned (Karlsson, 2005). But, among econometricians the Heckit is still
popular despite of the fact that an extensive amount of Monte Carlo studies casts
doubt on the procedure. (See Puhani, 2000 for an overview). But, from these
studies it is hard to find guide lines which can be used in practice
Heckman’s two-step procedure involves several critical moments. It is the
aim of this paper to clarify the following issues: (i) Which are the properties of
the estimated hazard that is used later in the second step? (ii) Which are the
properties (bias and variance) of the regression estimates obtained with three
different linear models? Furthermore, is it possible to adjust for the bias? In
earlier studies the performance of the Heckit estimators have been compared
with other alternatives such as the Tobit ML estimator and several
semiparametric estimators (Kim and Lai, 2000, Lee, 1996, Newey, 2001 and
Powell, 1994). This paper will focus only on the Heckit. The aim is to find
simple guide lines for when the Heckit works and when it does not.

2. Notations, assumptions and some theoretical results

Let Ytj denote an observation on the latent variable from the j:th subject at time
t, j=1,…,n and t=1,…,T. For cross sectional data the index t is omitted The
observations for each subject are represented by a transposed vector
y 'j Y1 j ...YTj and it is assumed that the latter are independent over the j’s. The
problem considered is to estimate a linear regression function E Ytj x t Px ,
where x t is a vector of p explanatory variables possibly depending on t, when
observations are obtained only in the interval (a, f) and it is known how many
observations that fall below a. The function P x is written D  x t' ȕ where ȕ is a
vector of regression coefficients.
3

2.1 Three linear models with different random structures

Consider the following models, where random variables are denoted by capitol
letters, fixed values by small letters and parameters by Greek symbols.

( a) Ytj D  x t' ȕ  U tj , (b) Ytj A j  x t' ȕ  U tj , (c) Ytj A j  x t' b j  U tj (2)

Here the U tj ' s are independent and identically distributed (iid) disturbances with
mean 0 and variance V U2 . A j is a random intercept that is specific for the j:th
subject with mean D and variance V A2 , while b j is a vector of random regression
coefficients specific for the j:th subject with mean ȕ and variance V B2r for the
r:th component. All A j ' s and b j ' s are iid and U tj is independent of A j
and b j . The latter two may be correlated with Cov( A j , Brj ) V ABr . All
random variables are assumed to be normally distributed.
The models in Eq. (2) have been widely used (see e.g. Swamy, 1971 and
Hsiao, 2003) and have been termed (a) Gauss-Markov (GM), (b) Error
Components Regression (ECR) and (c) Random Coefficient Regression (RCR),
just to mention a few names. The GM-model is intended for cross-sectional data
or panel data without within-subject correlations. ECR- and RCR models are
intended for panel data. Tests for uncensored data in order to establish a proper
random structure have been suggested by several authors (see e.g. Honda, 1985,
Lundevaller and Laitila, 2002, Hsiao, 2003), but no such test seems to have been
suggested for censored data.
The Heckit requires that the censored variable is normally distributed. This
can be tested by Pearson’s chi-square statistic or the likelihood-ratio statistic
also called the deviance, provided that data can be sorted by the explanatory
variables. For each combination of the latter, the observed proportion of
censored observations are compared with the estimates of the corresponding
theoretical proportion p x defined by

a  Px
px P Ytj d a ) (u x ) , with u x where v x V (Ytj ) (3)
vx

These tests are supplied by several statistical packages such as SAS (SAS
Online Guide, 2006).
Below it is shown that the performance of Heckman’s estimation procedure is
dependent on the magnitude of the standardized variable u x rather than on P x or
x . In order to simplify the simulation studies (Sect. 3) it was therefore decided
to consider just one explanatory variable, that was chosen as t, t=1,…,T, so the
expressions in Eq. (2) simplifies to

(a ) Ytj D  Et  U tj , (b) Ytj A j  Et  U tj , (c ) Ytj A j  B j t  U tj (4)


4

with variances V (Ytj ) V U2 (a ), V U2  V A2 (b), V U2  V A2  2tV AB  t 2V B2 (c) and


covariances Cov(Ysj , Ytj ) 0 (a), V A2 (b), V A2  ( s  t )V AB  stV B2 (c) .

2.2 Results on expectations of censored variables

2.2.1 Normally distributed censored variables


Let I be the density of a standardized normal variable and consider the function

hx I (u x ) /(1  p x ) (5)

This is often referred to as the inverse Mills ratio. Since hx is the limit of
G 1 P Ytj  (a, a  G ) Ytj ! a as G o 0 it can be interpreted as the hazard for
approaching the censoring limit a for a given vector x t . The behaviour of hx as a
function of u x is seen in Figure 1. Notice that hx is roughly linear when u x is
large. From the inequality u x  hx  u x  1 / u x (Gordon, 1941), it follows that
the asymptotic slope for large u x is 1. In Figure 1 the range of u x is from -2 to 2.
The latter corresponds to a range of the censoring proportion from 2.3 % to 97.7
% and this will cover most situations that occur in practice.

Figure 1. The solid line is the hazard in Eq. (5) (normal observations). The three
dotted lines are the hazards for Laplace distributed observations (cf. Section
2.2.2) with v = 0.5 (upper curve), v =1.0 and v =5.0 (lower curve).

The expectation of the Ytj ' s that are found above a is related to P x in the
following way (Johnson et al, 1994)

E Ytj Ytj ! a P x  v x ˜ hx (6)


5

As Heckman noticed, the latter relation makes it possible to obtain estimates of


the parameters in P x by regressing Ytj Ytj ! a on the explanatory variables and
on the estimated hazard. The expectation of the observed variable Ytj* can
finally be obtained by putting Eq. (6) into the obvious relation

E Ytj* a ˜ p x  E Ytj Ytj ! a ˜ (1  p x ) (7)

All these results are based on the assumption of normality of the censored
variables and the two-step procedure described above would therefore be termed
normal-Heckit. Below (Sect.3) it will be found that, if the normal-Heckit is
applied to data that are not normally distributed, it may collapse.

2.2.2 Non-normally distributed censored variables: The Laplace distribution


Under normality assumptions the hazard hx is separated from P x in Eq. (6) in
an additive way. For other distributions this decomposition is seldom possible.
Consider e.g. the case when the Ytj ’s in the GM model (4a) has the Laplace (or
double exponential) distribution with the following density f ( y ) and cdf F ( y ) :
1 ­exp( z ), if z t 0 ­1  exp( z ) / 2, z t 0 y  Px
f ( y) ˜® F ( y) ® , with z .
2V ¯ exp( z ), if z d 0 ¯ exp( z ) / 2, z d 0 V
The expectation and variance of Ytj is P x and 2V 2 , respectively (cf. Johnson et
al, 1994). The normal density and the Laplace density are both symmetric
around P x but compared to the normal density the Laplace density has a sharper
peak at P x and longer tails. In terms of u x defined in Eq. (3), the hazard for
approaching the censoring limit a is

hx ® > x x@
­° V 2 exp(u 2 )  1 -1 , for u d 0
(8)
1
°̄ V , for u x t 0
This function is shown in Figure 1 for v = V 2 = 0.5, 1.0 and 5.0. When
u x d 0 the hazard is increasing and for some values of v the hazard is rather
close to that of the normal distribution. For u x t 0 the hazard is completely
different and is identical to the hazard of the exponential distribution with a
constant level. It also follows that
Px f
³a yf ( y )dy  ³P x yf ( y )dy
E Ytj Ytj ! a P x  hx V ( P x  a)  V 2 for a d P x
1
1  exp(( P x  a) / V )
2
f

³ yf ( y )dy
a
E Ytj Ytj ! a a  V , for a t P x
1
exp((a  P x ) / V )
2
6

In the last expressions P x and hx can not in general be expressed in separate


terms as in Eq. (6). Only when a equals P x they have the same structure.
Thus, if the normal-Heckit is applied to data where the censored variable in
fact is Laplace distributed, estimates can be expected to be very unreliable for
two reasons. First, estimates of the hazard are uncertain since the form of the
hazard is incorrectly specified and second, the hazard is not additively separated
from P x , so the regression relation is incorrectly specified in Heckman’s second
step.

2.3 Heckman’s two-step procedure

The first step in Heckman’s procedure is to estimate the hazard in the definition
(5), and this in turn requires the estimates of p x or u x in Eq. (3). The most
basic way to estimate p x is to count the number of observations that falls below
a for a given x t out of a total of n x . This suggests the estimator

pˆ x Proportion observations  a at x t and from this uˆ x ) 1 ( pˆ x ) (9a)

The estimator of the hazard that is based on Eq. (9a) will be termed semi-
parametric. In practise the latter is only feasible when the model has a small
number of explanatory variables, each with a limited state space. Alternatively
one can perform a probit analysis that fits the relation in Eq. (3) to data. In this
way one gets estimators of ( a  D ) / v x and ȕ / v x (being of less value when
v x is unknown), but also of p x and u x ,

pˆ x and uˆ x from probit analysis (9b)

The latter estimator will be termed probit-based. The essential difference


between the two types of estimators is that the one in (9b) makes full use of the
normality assumption, while that in (9a) only uses the normality assumption for
estimating the numerator in the definition (5). The estimates of D and ȕ are
finally obtained in the second step by regressing Ytj Ytj ! a on x t and on the

estimated hazard ĥx .


In Figure 1 hx is roughly linear for large values of u x , say hx | O  T ˜ u x ,
where T  (0,1) and O  0 . Putting this into Eq. (7) and using Eq. (3) gives

E Ytj Ytj ! a | >D (1  T )  v x O  T ˜ a @  x t' ȕ ˜ (1  T ) (10)

From this it is obvious that estimates of D and ȕ can be seriously biased by


performing the second step in Heckman’s procedure since one is estimating the
slope vector ȕ(1  T ) rather than ȕ . Provided that ȕ(1  T ) is estimated without
7

bias, it follows that T can be interpreted as the relative bias of the ȕ -


components. If T is known this can be used to adjust for the bias when
estimating ȕ by simply dividing the estimate by (1  T ) . An example of this will
be given in Section 4.3

2.4 Specific problems to be considered

The theoretical exposition above raises some questions that will be dealt with in
the next section:
(i) Which are the properties of the semi-parametric and the probit-based
estimates of the hazard under normal- and non-normal distributional
assumptions? (ii) For which range of u x -values, or alternatively for which
censoring proportions, are estimates obtained by Heckman’s procedure reliable?
(iii) Under which of the three random structures, GM, ECR and RCR, are
estimates obtained by Heckman’s procedure reliable?

3. Monte Carlo simulations

3.1 Design of the simulation study

Data were generated according to the three models in (4) with E (Ytj ) Pt
D  E ˜ t , t = 1,2,3,4 and V Ytj v 2 with v 2 V U2 for GM data and Q 2 V U2 
V A2 for ECR data. For RCR data the variance depends on t, V (Ytj ) Q t2
V U2  V A2  2tV AB  t 2 V B2 . The censoring limit was a = 0 and the Heckit was
studied within the ranges ut  >2,0@ I  , ut  >1,1@ I 0 and ut  >0,2@ I  .
For GM and ECR data the parameters were E 10,  30 and v 3E / 2
(=15, 45). For u t  I  D 4 E (=40, 120) yielding u t (2t  8) / 3 . For
u t  I 0 D 5E / 2 (=25, 75) yielding u t (2t  5) / 3 , and for u t  I 
D  E (=10, 30) yielding u t 2(t  1) / 3 . The expected proportion censored
observations was : 0.22 for u t  I  , 0.50 for u t  I 0 and 0.78 for u t  I  .
For ECR data two sets of variance components were used
(V U2 200, V A2 25) and (V U2 25, V A2 200) giving v 15 , and furthermore
(V U2 1772, V A2 253) and (V U2 253, V A2 1772) giving v 45 . Since v t2
depends on t in the RCR model it is not possible to find parameter values such
that V Ytj is exactly the same as for the GM- and ECR data. The following
parameter choices made the results for the RCR model roughly comparable with
the former models: E 10, V U2 25, V A2 200, V B2 10 . For u t  I  ,
V AB 18.45 , so vt varied between 14.1 and 15.4 and for u t  I  , V AB
31.55 with vt varying between 11.5 and 13.1.
8

Simulations were also performed to study the performance of the normal-


Heckit when in fact the observations with GM data were Laplace distributed.
Three cases were considered: (i) v 0.5, E 0.25 , (ii) v 1, E 0.5 , (iii)
v 5, E 2.5 . For u t  I  D 4E giving u t (t  4) / 2 , t=1,2,3,4). For
u t  I  D  E giving u t (1  t ) / 2 , t 1,2,3,4 . The hazards for these three
values of v are shown in Figure 1.
Estimates of pt and u t that are required in order to estimate the hazard ht in
the first step of Heckman’s procedure were obtained from probit analysis. Based
on the results from a preparatory study of the bias of the estimated hazard
outlined below, the sample sizes were chosen as n = 100 and 400 when studying
bias and variance of the D and E estimates. All simulations were performed
with 10,000 replicates, using random number functions and procedures in SAS
version 9.1. A computer program is available from the author on request.

3.2 The estimated hazard

The bias of the estimated hazard ht was studied at t = 1, 2, 3, 4 when data were
generated by the GM model with normally distributed disturbances. For both
estimators in (9a) and (9b) the bias decreased rapidly with increasing n. For
small n, the bias could be substantial, especially for u t  I  and t = 4. However,
it was concluded that for practical purpose when estimating ht , the bias could
be ignored when n is 100 or larger. The same conclusions were drawn about the
variances of the ht estimates. Here the probit-based estimator had a slightly
smaller variance and the variance decreased more rapidly than the bias with
increasing n. A similar pattern was obtained for the ECR and RCR models. So,
under normality assumptions the probit-based estimator is at least as good as the
semi-parametric estimator, and for n=100 or larger the influence from bias can
be ignored and the variance remains small.
Now, consider the case when the disturbances are Laplace-distributed. The
absolute relative bias was smallest for v 1 With increasing n the bias persisted
and the variance decreased. The latter was more than five times larger for n =
100 than for n = 400. The results show that both proportional-based and probit-
based estimators of the hazard can be seriously biased if the hazard is far from
that of the normal and this can not be compensated for by increasing n.
In the sequel, when the properties of estimates of E and D are studied under
normality, n is chosen as 100 and 400. From the results above it follows that
possible biases of the estimates can not be caused by poor estimates of the
hazard in the first step in the Heckit, but purely on the fact that P x and hx in Eq.
(6) are both linear which in turn leads to the structure in Eq. (10).
Since the Heckit is so closely tied up with normality it was furthermore
studied whether two commonly used tests of normality for censored data,
Pearson’s chi-square and maximum likelihood-ratio (SAS Online Guide, 2006),
were able to detect deviations from normality. When the observations were
9

Laplace distributed (v = 0.5, E 0.25, D 4 E ) it was found that the p-values


of both tests were roughly the same. However, for n=100 only 20 % of the p-
values were below 0.10 (the recommended significance level) and 36 % were
below 0.20. For n=400, 58 % of the p-values were lower than 0.10 and 72 %
were lower than 0.20. It is beyond the scope of this paper to go into details
about these tests, but it is clear that the powers of the tests are unsatisfactory low
when the alternative to the normal distribution is that of Laplace and n d 400 .

3.3 Estimates of E and D

Tables 1a and 1b summarize the properties of the E and D estimates when the
Heckit was applied to GM data. Both bias and variance of the estimates
increased as the range of the u t values moved upwards, and decreased with
increasing n. Especially for u t  I  , bias and variance were considerable, up to
15 times larger than for u t  I  . As expected, both bias and variance was larger
for E 30 than for E 10 since the former value makes V (Ytj ) larger.
However, it is interesting that the absolute relative bias turned out to be
independent of the magnitude of E for given n and a given range of u t .

Table 1a. Relative bias (%) with the GM model.

Relative bias of Ê Relative bias of D̂


E n I I0 I I I0 I
-10 100 5 28 71 3 6 61
400 0.3 4 53 4 8 51
-30 100 5 29 70 3 5 60
400 0.6 5 50 4 4 50

Table 1b. Variances with the GM-model.

Variance of Ê Variance of D̂
E n I I0 I I I0 I
-10 100 19 128 289 19 27 84
400 2.9 34 270 3.7 5.9 88
-30 100 163 2282 3316 166 298 1188
400 26 392 2033 34 65 679
10

Similar results, when the Heckit was applied to ECR data, are seen in Tables 2a
and 2b. Bias and variance were roughly the same as for the GM data. For
u t  I  and u t  I 0 bias and variance of the E -estimator were larger when the
ratio V A2 / V U2 is large. As for the GM model, the absolute relative bias seemed
to be roughly independent of the magnitude of E .

Table 2a. Relative bias (%) with the ECR-model. The first and second figures
represent the cases when V A2 / V U2 is small and large, respectively.

Relative bias of Eˆ Relative bias of D̂


E n I I0 I I I0 I
-10 100 6, 14 33, 36 69, 61 2, 2 5, 11 71, 90
400 0.5, 2 6, 5 53, 51 4, 3 8, 8 55, 77
-30 100 6, 12 32, 33 67, 60 2, 2 5, 10 67, 90
400 0.4, 1 6, 16 53, 47 4, 3 8, 9 57, 73

Table 2b. Variances with the ECR-model. The upper and lower figures in the
cells represent the cases when V A2 / V U2 is small and large, respectively.

Variance of Ê Variance of D̂
E n I I0 I I I0 I
-10 100 29 161 298 24 23 135
97 196 291 29 30 411
400 3.6 43 233 4.0 4.7 93
14 89 179 4.1 9.7 250
-30 100 228 1286 3581 194 222 1310
708 1683 2159 235 229 3165
400 34 504 3851 37 45 1446
142 993 2071 39 107 3012

Tables 3a and 3b show the pattern for the RCR data. Compared with the results
in the Tables 1 and 2, bias and variance are smaller.

Table 3a. Relative bias (%) with the RCR-model.

Bias of Ê Bias of D̂
N I I I I
100 -5 46 4 57
400 -5 30 5 53
11

Table 3b. Variances with the RCR-model.

Variance of Ê Variance of D̂
n I I I I
100 1.7 230 3.6 33
400 0.34 212 0.91 34

From Tables 1-3 it is concluded that the Heckit works quite well for u t  I  ,
(22 % censored) and is less good when u t  I 0 (50 % censored), especially
regarding bias of the E estimator. For u t  I  (78 % censored), Heckman’s
procedure is very poor but seems to perform slightly better with RCR data.
In Section 2.3 it was noticed that the absolute relative bias when estimating
the ȕ -components can be expressed by T in Eq. (10). Since T in Tables 3-5 is
roughly independent of the magnitude of E and thus also of Q and only
dependent on n and on the censoring proportion p, it is challenging to search for
a relation that describes how T depends on n and p. From the results in Tables 1
and 2 (GM- and ECR data) the following relation was established,

n
T p< (11)

where < = 0.1966 (GM), 0.1791 (ECR with V A2 / V U2 small), 0.1324 (ECR with
V A2 / V U2 large). The constant < was determined by fitting the linearized version
of Eq. (11) to the estimates obtained in Tables 1-2 by ordinary least squares.
The coefficient of determination ( R 2 ) ranged from 99.3 % to 99.8 %. The
relation in Eq. (11) is illustrated in Figures 2a,b. From Figure 2a it is concluded
that when n = 1000 or larger the censoring proportion p has less impact on the
magnitude of T as far as p is below 50 %. E.g. n = 1000 and p = 0.5 gives
T 0.01 . If the censoring proportion is small, say below 20 %, then Figure 2b
tells us that the absolute relative bias can be ignored for sample sizes above 250.
However, for large p and small n the absolute relative bias can be substantial.
12

(a)

(b)

Figure 2. Illustration of the dependency of the absolute relative bias T (Theta)


on p and n in Eq. (11) when < 0.1966 (GM-model). (a) The upper to the
lower curves show the dependency for n =50, 100, 400 and 1000. (b) The upper
to the lower curves shows the dependency for the censoring proportions p
=0.78, 0.50 and 0.22.

Since T can be estimated from data by means of Eq. (11) it is possible to


remove a great part of the bias by dividing the E estimate obtained from the
second step in the Heckit by (1  T ) (cf. Eq. (10)). This was also confirmed in
simulation experiments where the absolute relative bias was about three times
smaller after the adjustment. A similar adjustment for bias when estimating D
requires an estimate of v . Although v is an estimable parameter in the second
13

step of Heckman’s procedure, the estimates of the latter seems to be extremely


unreliable. In the simulation study the estimates of v had a serious negative bias
and the variances of the v-estimates were 5-15 times larger than the variance
of Ê . For this reason no attempt was made to adjust for bias of the D parameter.
The (normal-)Heckit estimates of D and E was furthermore studied when
the disturbances in fact were Laplace distributed using the parameters v = 0.5,
1.0, 5.0. The corresponding hazards are shown in Figure 1. For u t  I  it is
concluded from Table 4a that for given n the absolute relative bias of the
estimates are roughly the same for the three values of v. Whith increasing n
much of the bias persists and the variances are reduced. A comparison between
Table 4a and Table 1a for u t  I  shows that absolute relative bias is very much
the same for n = 100. The difference is that in Table 1a, where the Heckit is
applied to normally distributed observations, the bias is reduced much more for
n = 400. The normal-Heckit seems yet to be surprisingly robust for Laplace
distributed observations provided that u t d 0 . On the other hand, for u t t 0 it
is seen from Table 4b that the normal-Heckit collapses with Laplace distribute
data.

Table 4a Relative bias (%) and variance of estimates obtained by the normal-
Heckit when in fact the data are Laplace distributed with u  I  .

n E v Relative Variance Relative Variance


bias of Ê of Ê bias of D̂ of D̂
-0.25 0.5 5 0.02 4 0.01
100 -0.5 1.0 5 0.07 5 0.04
-2.5 5.0 6 3.20 5 1.85
-0.25 0.5 3 0.00 5 0.00
400 -0.5 1.0 3 0.01 5 0.01
-2.5 5.0 2 0.18 5 0.18

Table 4b Relative bias (%)and variance of estimates obtained by the normal-


Heckit when in fact the data are Laplace distributed with u  I  .

n E v Relative Variance Relative Variance


bias of Ê of Ê bias of D̂ of D̂
-0.25 0.5 94 1.09 36 0.50
100 -0.5 1.0 100 2.33 41 1.43
-2.5 5.0 101 45.42 41 41.54
-0.25 0.5 97 0.30 38 0.33
400 -0.5 1.0 99 1.14 40 0.96
-2.5 5.0 100 13.90 42 14.74
14

It is interesting to compare these results with those obtained by Paarsch


(1984). Here the normal-Heckit was applied to Laplace distributed observations
using two sets of parameters: D 2.94, E 1, v 10 giving u t
(2.94  t ) / 10 for t = 0,1,…20 and u t  (1.706,0.294) (25 % censoring) and
D 10 and same E and v giving u t 1  t / 10 and u t  (1,1) (50 %
censoring). For n = 100 the relative bias of the E -estimator was found to be 32
% (25 % censoring) and 68 % (50 % censoring). Although these figures were
based on simulations with only 100 replicates, they agree well with the results in
this paper.

3.4 Comparison between the efficiency obtained with censored and uncensored
data

When data are censored it is obvious that some information is lost when
estimating the parameters. Although this is inevitable it may be of some interest
to compare the variances in Tables 1-3 with those that are obtained with
uncensored data. Such a comparison may be considered to be of purely
academic interest, but one reason for doing it is to set up a standard that allows
for comparisons between the normal-Heckit and alternative methods. Let the
n
optimal estimator of E with uncensored data be Eˆ OPT ¦ Eˆ j , where Ê j
j 1
T T
wtY / wtt with wtY ¦ (t  t )(Ytj  Y j ), wtt ¦ (t  t ) 2 (cf. Rao, 1965, Ch. IV in
t 1 t 1

Swamy, 1971 and Ch. 3 in Hsiao). Then V Eˆ OPT V U2 / nwtt for the GM and
ECR models, and V Eˆ OPT V B2  V U2 / wtt / n for the RCR model. From this
one obtains the relative efficiency RE 100 ˜ V ( Eˆ ) / V ( Eˆ
OPT ) , where
Heck

V ( Eˆ Heck ) is the variance of Ê obtained from the Heckit and is determined from
the simulations. For u t  I 0 and u t  I  the relative efficiency is below 1 % for
all three models. But for u t  I  , RE is 11.0 % when n=400 and 8.8 % when
n=100 for the RCR-model, compared with RE of 3.4 % (n=400) and 2.4 %
(n=100) for the GM-model. Also from this point of view, Heckman’s procedure
seems to produce the best estimates when it is applied to the RCR-model.

4. Using the Heckit for analysing recurrence of lower back problems among
sick-listed men

4.1 Background

In 1993 the International Social Security Association initiated the Work


Incapacity and Reintegration project, primarily because of high levels of
15

expenditure on sickness in many industrialized countries (Hansson and


Hansson, 2000). In the Swedish part of the project sick-listed men and women
due to lower back or neck problems were followed during 2 years. One purpose
of the study was to analyze the effects of commonly practiced medical
interventions upon work resumption. The Swedish data base also contains
information about the person’s health during a further 2-year period after the 2-
year follow-up. Results from this post follow-up period have not been published
elsewhere. Of special interest was to study the number of sick-listed days during
the post follow-up due to the same diagnosis as in the follow-up.

4.2 The post follow-up

Data from the post follow-up will be used to illustrate some undesirable
consequences of the Heckit. n = 203 men with unspecified lower back diagnoses
who had returned to work within the follow-up period were observed during the
post follow-up. Men with specific back diagnoses (about 10 % of all cases,
Bergendorff et al. 2001, p. 46) were excluded since these had back surgery and
were thereafter free from back problems with the same diagnosis. The
dependent variable of interest is DAYS = ‘Number of sick-listed days during the
post follow-up due to the same diagnosis as in the follow-up’. One important
explanatory variable was EQT = ‘Value on EuroQol Thermometer scale’,
obtained at the end of the 2-year follow-up. The latter is a health-related quality
of life measure obtained from a visual scale on which the respondent is asked to
mark his health from 0 (worst function) to 100 (best function) (Hansson et. al.,
2005). The variable EQT was negatively associated with DAYS. Another
explanatory variable was STATE1Y (= 1 if the person had returned to work
within 1 year during the previous follow-up, and = 0 otherwise). Rather
unexpectedly, there was a significant positive association between not returning
to work within 1 year and DAYS = 0 (p-value= 0.01, Chi-square test). In fact,
89 % (31/35) of those who did not return within 1 year had zero days during the
post follow-up period, while the corresponding figure for those who returned
within 1 year was 68 % (115/168). No further explanatory variables, such as
demographic and socio-economic factors, work environment, co-morbidity and
treatment received, were found to be associated with DAYS.
The major part of the observations are found on the border DAYS = 0, and it
is obvious that the standard conditions for performing a regression analysis,
such as normality or at least symmetrically distributed disturbances, are
violated. Therefore, a latent variable Y is introduced such that
­ 0, if Y d 0
DAYS ®
¯Y , if Y ! 0
and Y is a variable that is related to a person’s state of health. It is assumed that
for the j : th person, Y j D  E 1 ˜ STATE1Y  E 2 ˜ EQT  U j , j = 1,…,203.
16

4.3 Applying Heckman’s two-step approach

Below the data is analyzed by the Heckit and in order to clarify the different
steps they are numbered (i)-(iii).

(i) Estimation of hx in (5) by means of probit analysis


The probit model is
p x P Y j d 0 ) u x , u x T 0  T 1 ˜ STATE1Y  T 2 ˜ EQT
where T 0 D / v, T 1  E 1 / v, T 2  E 2 / v . The fit of the model was tested by
Pearson’s chi-square statistic and the Maximum Likelihood Ratio (MLR)
statistic, giving the p-values 0.33 and 0.20, respectively, so the probit model
should not be rejected at the 10 % level. The estimates that were obtained from
the probit analysis were Tˆ 0.5649, Tˆ 1.1037, Tˆ
0 1 0.0148 . The observed
2
censoring proportion was 146/203 = 0.72. Much of the u-range is located to the
part where the hazard is roughly linear, especially for STATE1Y = 0, where
u x ranges from 0.56 to 2.04. The range of the u x -values indicates that the
Heckit may give unreliable estimates (cf. Section 2.3).

(ii) Regressing Y j Y j ! 0 on x ' = (STATE1Y, EQT) and hx


The estimated regression relation in Eq.(6), by using OLS, is

Eˆ Y j Y j ! 0 3563  2788 ˜ STATE1Y - 37.5 ˜ EQT  3095 ˜ hˆx (12)

Here all estimated coefficients are significantly different from zero at the 5 %
level as judged by two sided t-tests.

(iii)Calculation of expected number of sick-listed days during the post follow-


up, according to Eq.(7).
The expected number of sick-listed days is Eˆ ( DAYS ) Eˆ Y j Y j ! 0 ˜ (1  pˆ x ) .
Here the first factor is given in Eq. (12) and an estimate of p x is obtained from
the estimated probit model. The estimates have little in common with the actual
data. E.g. at EQT = 20, Eˆ ( DAYS ) is about 800, but in the actual data no one
had more than 650 days. From Eq. (12) the estimate of T is
(0.72) 0.1966 203 0.40 , i.e. the E -coefficients have been estimated with an
absolute relative bias of 40 %. This figure can be used to correct for the bias of
the E -parameters by using Eq. (10):
Eˆ (1  T ) Eˆ (1  0.40) 2788 Ÿ Eˆ 4647
1 1 1

Eˆ 2 (1  T ) Eˆ 2 (1  0.40) 37.5 Ÿ Eˆ 2 62.5


17

5. Conclusions and suggestions for further research

This paper has studied the performance of Heckman’s two-step approach when
it is used to solve the problem with border-observations without selection effects
and when data are censored from below. From the simulations it was concluded
that the Heckit performed quite well for n larger than 100 and when the
censoring proportion was 0.22, provided that the censored variable was
normally distributed. With increasing censoring proportion the estimates
gradually became more biased and the variance increased. However, it is
possible to compensate for this by increasing the sample size.
By means of Eq. (11) it is possible to estimate T , the absolute relative bias of
the E -estimates, and to adjust for the bias in the way that was done in Section
4.3. Eq. (11) can also be used in the planning of a study. By first taking a pilot
sample one gets a rough estimate of the censoring proportion p. The final proper
sample size n can then be determined from restrictions on T . E.g. if it is
required that T is at most 1 % for the GM model, then n should be at least 62 if
p = 0.05 and at least 1142 if p = 0.50. From considerations of space Eq. (11) had
to be considered for two special cases of the ECR model. This gives some
practical guide lines, but more detailed studies should be performed on the
effect of the variance ratio upon the relation in Eq. (11).
Since the Heckit inevitably gives more or less biased estimates one should
compare the estimated expectation of the observed variable with the observed
data in a final step. A warning practical example was given in Section 4 where
the censoring proportion was 0.72, leading to an estimated absolute relative bias
of the regression estimates of 40 %, and this in turn led to gigantic over-
estimates of the actual costs for sick-listing.
When the censored variable has a distribution that is not normal Heckman’s
two-step procedure may collapse for at least two reasons. One is that estimates
of the hazard (or Mills ratio) used in the first step are biased. A second is that
the regression function of interest and the hazard no longer are added to each
other. From considerations of space the effects of misspecification was only
studied for Laplace distributed disturbances, but such effects should be further
investigated for a variety of distributions.

Acknowledgements

The author would like to thank two anonymous referees for their valuable
comments. The research was supported by the National Social Insurance Board
in Sweden (RFV), Dnr 3124/99 –UFU.
18

References

Dow, W.H. and Norton, E.C. (2003), Choosing Between and Interpreting the
Heckit and Two-Part Models for Corner Solutions, Health Services & Outcomes
Research Methodology 4, 5-18.

Flood, L. and Gråsjö, U. (2001), A Monte Carlo simulation study of a Tobit


model, Applied Economics Letters 8, 581-584.

Gordon, R.D. (1941), Values of Mills’ ratio of area to boarding ordinate and of
the normal probability integral for large values of the argument, Annals of
Mathematical Statistics 12, 364-366.

Hansson, T. and Hansson, E. (2000), The Effects of Common Medical


Interventions on Pain, Back Function, and Work Resumption in Patients With
Chronic Low Back Pain, SPINE 25, No 23, 3055-3064.

Bergendorff, S., Hansson, E., Hansson, T. and Jonsson, R. (2001), Vad kan
förutsäga utfallet av en sjukskrivning? (Predictors of health status and work
resumption) (in Swedish), Rygg och Nacke 8. Stockholm: RFVand Sahlgrenska
Universitetssjukhuset.

Hansson, E, Hansson, T. and Jonsson, R. (2004), Predictors for work ability and
disability in men and women with low-back or neck problems, accepted for
publication in European Spine Journal.

Heckman, J. (1976), The common structure of statistical models of truncation,


sample selection and limited dependent variables and a simple estimator of such
models, Annals of Economic and Social Measurement 5, 475-492.

Heckman, J. (1979), Sample Selection Error as a Specification Error,


Econometrica 47, 153-161.

Honda, Y. (1985), Testing the Error Components Model with Non-Normal


Disturbances, The Rewiev of Economic Studies 52, 681-690.

Hsiao, C. (2003), Analysis of panel data, Cambridge University Press,


Cambridge.

Johnson, N.L., Kotz, S. and Balakrishnan, N. (1994), Continuous


univariatedistributions, vol I (2nd ed.), Wiley, New York.

Karlsson, M. (2005), Estimators of Semiparametric Truncated and Censored


Regression Models, Statistical Studies 34, PhD thesis, Department of Statistics,
Umeå University.
19

Kim, C.K. and Lai, T.L. (2000), Efficient score estimation and adaptive M-
estimators in censored and truncated regression models, Statistica Sinica 10,
731-749.

Kruskal, W.H. and Tanur, J.M. (Ed.) (1978), International Encyclopedia of


Statistics, vol 2, McMillan, New York.

Lee, M.J. (1996), Method of Moments and Semiparametric Econometrics for


Limited Dependent Variable Models, Springer, New York.

Lundevaller, E.H. and Laitila, T. (2002), Test of random subject effects in


heteroscedastic linear models, Biometrical Journal 44, 825-834.

Nelson, F.D. (1984), Efficiency of the two-step estimator for models with
endogenous sample selection, Journal of Econometrics 24, 181-196.

Newey, W.K. (2001), Conditional moment restrictions in censored and


truncated regression models, Econometric Theory 17, 863-888.

Paarsch, H.J. (1984), A Monte Carlo comparison of estimators for censored


regression models, Journal of Econometrics 24, 197-213.

Powell, J.L. (1994), Estimation of semiparametric models. In: Engel, R.F. and
McFadden, D.L. (Eds.), Handbook of Econometrics, Vol 4, pp 2444-2521,
North-Holland, Amsterdam.

Puhani, P.A. (2000), The Heckman correction for sample selection and its
critique, Journal of Economic Surveys 14, No 1, 53-68.

Rao, C.R. (1965), The theory of least squares when the parameters are
stochastic and its application to the analysis of growth curves, Biometrica 52,
447-458.

Rosett, R.N. and Nelson, F.D. (1975), Estimation of the two-limit probit
regression model, Econometrica 43, 141-146.

SAS Online Guide (2006),


http://support.sas.com/91doc/getDoc/statug.hlp/probit_sect.5/htm.

Swamy, P.A.V.B. (1971), Statistical inference in random coefficient regression


model, 55, Springer, Berlin.

Tobin, J. (1958); Estimation of relationships for limited dependent variables,


Econometrica 26, 24-36.
Research Report

2007:2 Frisén, M.: Optimal Sequential Surveillance for Finance,


Public Health and other areas.
2007:3 Bock, D.: Consequences of using the probability of a false
alarm as the false alarm measure.
2007:4 Frisén, M.: Principles for Multivariate Surveillance.
2007:5 Andersson, E., Bock, Modeling influenza incidence for the purpose of
D. & Frisén, M.: on-line monitoring.
2007:6 Bock, D., Andersson, Statistical Surveillance of Epidemics: Peak
E. & Frisén, M.: Detection of Influenza in Sweden.
2007:7 Andersson, E., Predictions by early indicators of the time and
Kühlmann-Berenzon, height of yearly influenza outbreaks in Sweden.
S., Linde, A.,
Schiöler, L.,
Rubinova, S. &
Frisén, M.:
2007:8 Bock, D., Andersson, Similarities and differences between statistical
E. & Frisén, M.: surveillance and certain decision rules in finance.
2007:9 Bock, D.: Evaluations of likelihood based surveillance of
volatility.
2007:10 Bock, D. & Explorative analysis of spatial aspects on the
Pettersson, K. Swedish influenza data.
2007:11 Frisén, M. & Semiparametric surveillance of outbreaks.
Andersson, E.
2007:12 Frisén, M., Robust outbreak surveillance of epidemics in
Andersson, E. & Sweden.
Schiöler, L.
2007:13 Frisén, M., Semiparametric estimation of
Andersson, E. & outbreak regression.
Pettersson, K.
2007:14 Pettersson, K. Unimodal regression in the two-parameter
exponential family with constant or known
dispersion parameter
2007:15 Pettersson, K. On curve estimation under order restrictions
2008:1 Frisén, M. Introduction to financial surveillance

You might also like