Linear Mixed Models For Longitudinal Data
Linear Mixed Models For Longitudinal Data
Advisors:
P. Bickel, P. Diggle, S. Fienberg K. Krickeberg,
I. Olkin, N. Wermuth, S. Zeger
Springer
New York
Berlin
Heidelberg
Barcelona
Hong Kong
London
Milan
Paris
Singapore
Tokyo
Springer Series in Statistics
Andersen/Borgan/GHl/Keiding: Statistical Models Based on Counting Processes.
Berger: Statistical Decision Theory and Bayesian Analysis, 2nd edition.
Bolfarine/Zacks: Prediction Theory for Finite Populations.
Borg/Groenen: Modem Multidimensional Scaling: Theory and Applications
Brockwell/Davis: Time Series: Theory and Methods, 2nd edition.
Chen/Shao/Ibrahim: Monte Carlo Methods in Bayesian Computation.
Efromovich: Nonparametric Curve Estimation: Methods, Theory, and Applications.
Fahrmeir/Tutv Multivariate Statistical Modelling Based on Generalized Linear
Models.
Farebrother: Fitting Linear Relationships: A History of the Calculus of Observations
1750-1900.
Federer: Statistical Design and Analysis for Intercropping Experiments, Volume I:
Two Crops.
Federer: Statistical Design and Analysis for Intercropping Experiments, Volume II:
Three or More Crops.
Fienberg/Hoaglin/Kruskal/Tanur (Eds.): A Statistical Model: Frederick Mosteller's
Contributions to Statistics, Science and Public Policy.
Fisher/Sen: The Collected Works of Wassily Hoeffding.
Good: Permutation Tests: A Practical Guide to Resampling Methods for Testing
Hypotheses, 2nd edition.
Gourieroux: ARCH Models and Financial Applications.
Grandell: Aspects of Risk Theory.
Haberman: Advanced Statistics, Volume I: Description of Populations.
Hall: The Bootstrap and Edgeworth Expansion.
Hdrdle: Smoothing Techniques: With Implementation in S.
Hart: Nonparametric Smoothing and Lack-of-Fit Tests.
Hartigan: Bayes Theory.
Hedayat/Sloane/Stufken: Orthogonal Arrays: Theory and Applications.
Heyde: Quasi-Likelihood and its Application: A General Approach to Optimal
Parameter Estimation.
Huet/Bouvier/Gruet/Jolivet: Statistical Tools for Nonlinear Regression: A Practical
Guide with S-PLUS Examples.
Kolen/Brennan: Test Equating: Methods and Practices.
Kotz/Johnson (Eds.): Breakthroughs in Statistics Volume I.
Kotz/Johnson (Eds.): Breakthroughs in Statistics Volume II.
Kotz/Johnson (Eds.): Breakthroughs in Statistics Volume III.
Kuchler/S0rensen: Exponential Families of Stochastic Processes.
Le Cam: Asymptotic Methods in Statistical Decision Theory.
Le Cam/Yang: Asymptotics in Statistics: Some Basic Concepts.
Longford: Models for Uncertainty in Educational Testing.
Miller, Jr.: Simultaneous Statistical Inference, 2nd edition.
Mosteller/Wallace: Applied Bayesian and Classical Inference: The Case of the
Federalist Papers.
Parzen/Tanabe/Kitagawa: Selected Papers of Hirotugu Akaike.
Politis/Romano/Wolf: Subsampling.
Springer
Geert Verbeke Geert Molenberghs
Biostatistical Centre Biostatistics
Katholieke Universiteit Leuven Center for Statistics
Kapucijnenvoer 35 Limburgs Universitair Centrum
B-3000 Leuven Universitaire Campus, Building D
Belgium B-3590 Diepenbeek
Belgium
As with the first version, we hope this book will be of value to a wide
audience, including applied statisticians and biomedical researchers, par-
ticularly in the pharmaceutical industry, medical and public health research
organizations, contract research organizations, and academic departments.
This implies that the majority of the chapters is explanatory rather than
research oriented and that it emphasizes practice rather than mathemat-
ical rigor. In this respect, guidance and advice on practical issues are the
main focus of the text. On the other hand, some more advanced topics
are included as well, which we believe to be of use to the more demanding
modeler.
In the first version, we had placed strong emphasis on the SAS procedure
MIXED, without discouraging the non-SAS users. Considerable effort was
put in treating data analysis issues in a generic fashion, instead of mak-
ing them fully software dependent. Therefore, a research question was first
translated into a statistical model by means of algebraic notation. In a
number of cases, such a model was then implemented using SAS code.
This was positively received by many readers and we therefore for most
part kept this format. In this version, much of the SAS-related issues are
centralized in a single chapter, and we still keep selected examples through-
out the text. Additionally, an Appendix is devoted to other software tools
(MLwiN, SPlus).
Because SAS Version 7 has not been generally marketed, SAS Version
6.12 was used throughout this book. The Appendix briefly lists the most
important changes in Version 7. Selected macros for tools discussed in
the text, not otherwise available in commercial software packages, as well
as publicly available data sets, can be found at Springer-Verlag’s URL:
www.springer-ny.com.
This book has been accomplished with considerable help from several peo-
ple. We would like to gratefully acknowledge their support.
A large part of this book is based on joint research. We are grateful to sev-
eral co-authors: Larry Brant (Gerontology Research Center and The Johns
Hopkins University, Baltimore), Luc Bijnens (Janssen Research Founda-
tion, Beerse), Tomasz Burzykowski (Limburgs Universitair Centrum), Marc
Buyse (International Institute for Drug Development, Brussels), Desmond
Curran (European Organization for Research and Treatment of Cancer,
Brussels), Helena Geys (Limburgs Universitair Centrum), Mike Kenward
(London School of Hygiene and Tropical Medicine), Emmanuel Lesaffre
(Katholieke Universiteit Leuven), Stuart Lipsitz (Medical University of
South Carolina, Charleston), Bart Michiels (Janssen Research Foundation,
Beerse), Didier Renard (Limburgs Universitair Centrum), Ziv Shkedy (Lim-
burgs Universitair Centrum), Bart Spiessens (Katholieke Universiteit Leu-
ven), Herbert Thijs (Limburgs Universitair Centrum), Tony Vangeneug-
den (Janssen Research Foundation, Beerse), and Paige Williams (Harvard
School of Public Health, Boston).
Russell Wolfinger (SAS Institute, Cary, NC) has been kind enough to pro-
vide us with a trial version of SAS Version 7.0 during the development of
this text. Bart Spiessens (Katholieke Universiteit Leuven) kindly provided
us with technical support. Steffen Fieuws (Katholieke Universiteit Leuven)
commented on earlier versions of the text.
x Acknowledgments
It has been a pleasure to work with John Kimmel and Jenny Wolkowicki
of Springer-Verlag.
We apologize to our wives, daughters, and son for the time not spent with
them during the preparation of this book and we are very grateful for their
understanding. The preparation of this book has been a period of close and
fruitful collaboration, of which we will keep good memories.
Preface vii
Acknowledgments ix
1 Introduction 1
2 Examples 7
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Stage 1 . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.2 Stage 2 . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.5 Shrinkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 83
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Appendix
A Software 485
References 523
Index 554
1
Introduction
Among those, multivariate data have received most attention in the statis-
tical literature (e.g., Seber 1984, Krzanowski 1988, Johnson and Wichern
1992). Techniques devised for this situation include multivariate regression
and multivariate analysis of variance, which have been implemented in the
SAS procedure GLM (SAS 1991) for general linear models. In addition,
SAS contains a battery of relatively specialized procedures for principal
components analysis, canonical correlation analysis, discriminant analysis,
factor analysis, cluster analysis, and so forth (SAS 1989).
Among the clustered data settings, longitudinal data perhaps require the
most elaborate modeling of the random variability. Diggle, Liang, and Zeger
(1994) distinguish among three components of variability. The first one
groups traditional random effects (as in a random-effects ANOVA model)
and random coefficients (Longford 1993). It stems from interindividual vari-
ability (i.e., heterogeneity between individual profiles). The second compo-
nent, serial association, is present when residuals close to each other in time
are more similar than residuals further apart. This notion is well known in
the time-series literature (Ripley 1981, Diggle 1983, Cressie 1991). Finally,
in addition to the other two components, there is potentially also measure-
ment error. This results from the fact that, for delicate measurements (e.g.,
laboratory assays), even immediate replication will not be able to avoid a
certain level of variation. In longitudinal data, these three components of
variability can be distinguished by virtue of both replication as well as a
clear distance concept (time), one of which is lacking in classical spatial
and time-series analysis and in clustered data. This implies that adapting
models for longitudinal data to other data structures is in many cases rel-
atively straightforward. For example, clustered data could be analyzed by
leaving out all aspects of the model that refer to time.
1. Introduction 3
Two fairly different views can be adopted. The first one, supported by
large-sample results, states that normal theory should be applied as much
as possible, even to non-normal data such as ordinal scores and counts.
A different view is that each type of outcome should be analyzed using
instruments that exploit the nature of the data. We will adopt the second
standpoint. In addition, since the statistical community has been familiar-
ized with generalized linear models (GLIM; McCullagh and Nelder 1989),
some have taken the view that the normal model for continuous data is
but one type of GLIM. Although this is correct in principle, it fails to ac-
knowledge that normal models are much further developed than any other
GLIM (e.g., model checks and diagnostic tools) and that it enjoys unique
properties (e.g., the existence of closed-form solutions, exact distributions
of test statistics, unbiased estimators). Extensions of GLIM to the longitu-
dinal case are discussed in Diggle, Liang, and Zeger (1994), where the main
emphasis is on generalized estimating equations (Liang and Zeger 1986).
Generalized linear mixed models have been proposed by, for example, Bres-
low and Clayton (1993). Fahrmeir and Tutz (1994) devote an entire book
to GLIM for multivariate settings.
For normally distributed data, marginal models can easily be fitted, for ex-
ample, with the SAS procedure MIXED, the SPlus function lme, or within
the MLwiN package. For such data, integrating a mixed-effects model over
the random effects produces a marginal model, in which the regression
parameters retain their meaning and the random effects contribute in a
simple way to the variance-covariance structure. For example, the mar-
ginal model corresponding to a random-intercepts model is a compound
symmetry model that can be fitted without explicitly acknowledging the
random-intercepts structure. In the same vein, certain types of transition
model induce simple marginal covariance structures. For example, some
first-order stationary autoregressive models imply an exponential or AR(1)
covariance structure. As a consequence, many marginal models derived
from random-effects and transition models can be fitted with mixed-models
software.
Motivated by the above discussion, we have restricted the scope of this book
to linear mixed models for continuous outcomes. Fahrmeir and Tutz (1994)
discuss generalized linear (mixed) models for multivariate outcomes, while
longitudinal versions are treated in Diggle, Liang, and Zeger (1994). Non-
linear models for repeated measurement data are discussed by Davidian
and Giltinan (1995).
While research in this area has largely focused on the formulation of linear
mixed-effects models, inference, and software implementation, other im-
portant aspects, such as exploratory analysis, the investigation of model
fit, and the construction of diagnostic tools have received considerably less
attention. In addition, longitudinal data are typically very prone to in-
completeness, due to dropout or intermediate missing values. This poses
1. Introduction 5
Broadly, the structure of the book is as follows. The key examples, used
throughout the book, are introduced in Chapter 2. Chapters 3 to 9 pro-
vide the core about the linear mixed-effects model, while Chapters 10 to 13
discuss more advanced tools for model exploration, influence diagnostics,
as well as extensions of the original model. Chapters 14 to 16 introduce
the reader to basic incomplete data concepts. Chapters 17 and 18 discuss
strategies to model incomplete longitudinal data, based on the linear mixed
model. The sensitivity of such strategies to parametric assumptions is in-
vestigated in Chapters 19 and 20. Some additional missing data topics are
presented in Chapters 21 and 22. Chapter 23 is devoted to design consid-
erations. Five case studies are treated in detail in Chapter 24. Appendix A
reviews a number of software tools for fitting mixed models. Since the book
puts relatively more emphasis on SAS than on other packages, this proce-
dure is discussed in detail in Chapter 8, while worked examples can be
found throughout the text. Some technical background material from the
sensitivity chapters is deferred until Appendix B.
2
Examples
This chapter introduces the longitudinal sets of data which will be used
throughout the book. The rat data are presented in Section 2.1. The TDO
data, studying toenails, are described in Section 2.2. Section 2.3 is de-
voted to the Baltimore Longitudinal Study of Aging, with two substudies:
prostate-specific antigen data (Section 2.3.1) and data on hearing (Sec-
tion 2.3.2). Section 2.4 introduces the Vorozole study, focusing on quality
of life in breast cancer patients. In Section 2.5, we will introduce data,
previously analyzed by Goldstein (1979), on the heights of 20 schoolgirls.
Section 2.6 presents the growth data of Potthoff and Roy (1964). Mastitis
in dairy cattle is the subject of Section 2.7.
To complement the data introduced in this chapter, five case studies, in-
volving additional sets of data, are presented in Chapter 24.
In medical science, there has recently been increased interest in the ther-
apeutic use of hormones. However, such drastic therapies require detailed
knowledge about their effect on the different aspects of growth. To this
respect, an experiment has been set up at the Department of Orthodontics
of the Catholic University of Leuven (KUL) in Belgium (see Verdonck et
al . 1998). The primary aim was to investigate the effect of the inhibition
8 2. Examples
FIGURE 2.1. Rat Data. Individual profiles for each of the treatment groups in
the rat experiment separately.
For the purpose of this book, we will consider one of the measurements
which can be used to characterize the height of the skull. The individual
profiles are shown in Figure 2.1. It is clear that not all rats have measure-
ments up to the age of 110 days. This is due to the fact that many rats do
not survive anaesthesia and therefore drop out before the end of the study.
Table 2.1 shows the number of rats observed at each occasion. While 50
rats have been randomized at the start of the experiment, only 22 of them
survived the 6 first measurements, so measurements on only 22 rats are
available in the way anticipated at the design stage. For example, at the
2.2 The Toenail Data (TDO) 9
TABLE 2.1. Rat Data. Summary of the number of observations taken at each
occasion in the rat experiment, for each group separately and in total.
# Observations
Age (days) Control Low High Total
50 15 18 17 50
60 13 17 16 46
70 13 15 15 43
80 10 15 13 38
90 7 12 10 29
100 4 10 10 24
110 4 8 10 22
second occasion (age = 60 days), only 46 rats were available, implying that
for 4 rats only 1 measurement could be recorded.
which is always at the free end of the nail) of the target nail is measured in
millimeters. Obviously, this response will be related to the toe size. There-
fore, we will only include here those patients for which the target nail was
one of the two big toenails. This reduces our sample under consideration to
150 and 148 subjects, respectively. Figure 2.2 shows the observed profiles
of 30 randomly selected subjects from treatment group A and treatment
group B, respectively.
Due to a variety of reasons, 72 (24%) out of the 298 participants left the
study prematurely. Table 2.2 summarizes the number of subjects still in
the study at each occasion, for both treatment groups separately. Although
the comparison of the average evolutions in both treatment groups was of
primary interest, there was also some interest in studying the relationship
between the dropout process and the actual outcome. For example, are
patients who drop out doing better or worse than patients who do not
drop out from the study ?
TABLE 2.2. Toenail Data. Summary of the number of observations taken at each
occasion in the TDO study, for each group separately and in total.
# Observations
Time (months) Treatment A Treatment B Total
0 150 148 298
1 149 142 291
2 146 138 284
3 140 131 271
6 131 124 255
9 120 109 229
12 118 108 226
In this book, two of the many responses measured in the BLSA will be
used to illustrate the statistical methodology. In Section 2.3.1, it will be
discussed how data from the BLSA can be used to study the natural history
of prostate disease. Afterward, in Section 2.3.2, the hearing data will be
presented.
During the last 10 years, many papers have been published on the natural
history of prostate disease; see, for example, Carter et al . (1992a, 1992b)
and Pearson et al . (1991, 1994). According to Carter and Coffey (1990),
prostate disease is one of the most common and most costly medical prob-
lems in the United States, and prostate cancer has become the second
leading cause of male cancer deaths. It is therefore very important to look
for markers which can detect the disease at an early stage. The prostate-
specific antigen (PSA) is such a marker. PSA is an enzyme produced by
12 2. Examples
TABLE 2.3. Prostate Data. Description of subjects included in the prostate data
set, by diagnostic group. The cancer cases are subdivided into local/regional (L/R)
and metastatic (M) cancer cases.
Cancer Cases
Controls BPH cases L/R M
Number of participants 16 20 14 4
Age at diagnosis (years)
Median 66 75.9 73.8 72.1
Range 56.7-80.5 64.6-86.7 63.6-85.4 62.7-82.8
Years of follow-up
Median 15.1 14.3 17.2 17.4
Range 9.4-16.8 6.9-24.1 10.6-24.9 10-25.3
Time between
measurements (years)
Median 2 2 1.7 1.7
Range 1.1-11.7 0.9-8.3 0.9-10.8 0.9-4.8
Number of measurements
per individual
Median 8 8 11 9.5
Range 4-10 5-11 7-15 7-12
both normal and cancerous prostate cells, and its level is related to the
volume of prostate tissue. Still, an elevated PSA level is not necessarily an
indicator of prostate cancer because patients with benign prostatic hyper-
plasia (BPH) also have an enlarged volume of prostate tissue and therefore
also an increased PSA level. This overlap of the distribution of PSA values
in patients with prostate cancer and BPH has limited the usefulness of
a single PSA value as a screening tool since, according to Pearson et al .
(1991), up to 60% of BPH patients may be falsely identified as potential
cancer cases based on a single PSA value.
FIGURE 2.3. Prostate Data. Longitudinal trends in PSA in men with prostate
cancer, benign prostatic hyperplasia, or no evidence of prostate disease.
To the extent possible, age at diagnosis and years of follow-up were matched
for the control, BPH, and cancer groups. However, due to the high preva-
lence of BPH in men over age 50, it was difficult to find age-matched
controls with no evidence of prostate disease. In fact, the control group
remained significantly younger at first visit and at diagnosis, compared to
the BPH group. For this reason, our analyses of this data set will always
correct for age differences at the time of the diagnosis.
Also recorded in the BLSA study are hearing threshold sound pressure lev-
els (SPLs in dB), measured at 11 different frequencies [varying from 125
to 8000 hertz (Hz)] on both ears, yielding a maximum of 22 observations
per visit. This was done by means of a sound proof chamber and a Bekesy
audiometer. Using these data, Brant and Fozard (1990) have shown that
the relationship between hearing threshold level and frequency can be well
described by a quadratic function of the logarithm of frequency, the pa-
rameters of which depend on age and are highly subject-specific. Morrell
and Brant (1991) and Brant and Pearson (1994) considered the data of 268
elderly male participants whose first visit occurred at about 70 years of age
or older. They studied how hearing thresholds change over time and how
these evolutions depend on age and on the frequency under consideration.
For our purposes, we now consider all available hearing thresholds for 500
Hz, from male BLSA participants only, without otologic disease, unilateral
hearing loss, or evidence of noise-induced hearing loss. Individual profiles
on the left and right ear separately are shown in Figure 2.4 for 30 randomly
selected subjects.
In total, we have 6170 observations (3089 on the left ear and 3081 on the
right ear), from 681 males. Their age at the first visit ranged from 17.2 to
90.5 years, with median value equal to 53 years. The number of visits per
subject varied from 1 to 15, and some of the participants were followed for
over 22 years (median 7.5 years).
2.4 The Vorozole Study 15
Patients underwent screening and for those deemed eligible, a detailed ex-
amination at baseline (occasion 0) took place. Further measurement oc-
casions were months 1, then from months 2 at bimonthly intervals until
month 44.
FIGURE 2.5. Heights of Schoolgirls. Growth curves of 20 school girls from age 6
to 10, for girls with small, medium, or tall mothers.
These data, introduced by Potthoff and Roy (1964), contain growth mea-
surements for 11 girls and 16 boys. For each subject, the distance from
the center of the pituitary to the maxillary fissure was recorded at ages 8,
2.6 Growth Data 17
TABLE 2.5. Growth Data for 11 Girls and 16 Boys. Measurements marked with
∗ were deleted by Little and Rubin (1987).
10, 12, and 14. The data were used by Jennrich and Schluchter (1986) to
illustrate estimation methods for unbalanced data, where unbalancedness
is now to be interpreted in the sense of an unequal number of boys and
girls.
FIGURE 2.6. Mastitis in Dairy Cattle. The first panel shows a scatter plot of
the second measurement versus the first measurement. The second panel shows a
scatter plot of the change versus the baseline measurement.
3.1 Introduction
In practice, longitudinal data are often highly unbalanced in the sense that
not an equal number of measurements is available for all subjects and/or
that measurements are not taken at fixed time points. In the rat data
set and the toenail data set, presented in Section 2.1 and in Section 2.2,
respectively, a fixed number of measurements was scheduled to be taken
on all subjects, at fixed time points. However, during the study, rats died,
and patients left the toenail study prematurely, implying unbalance. This
is different from the prostate data and the hearing data (Sections 2.3.1
and 2.3.2, respectively), where the unbalance is an immediate result from
the fact that the volunteers participating in the BLSA were asked to return
approximately every 2 years for medical examination.
3.2.1 Stage 1
Let the random variable Yij denote the (possibly transformed) response
of interest, for the ith individual, measured at time tij , i = 1, . . . , N ,
j = 1, . . . , ni , and let Yi be the ni -dimensional vector of all repeated mea-
surements for the ith subject, that is, Yi = (Yi1 , Yi2 , . . . , Yini ) . The first
stage of the two-stage approach assumes that Yi satisfies the linear regres-
sion model
Yi = Zi β i + εi , (3.1)
where Zi is a (ni × q) matrix of known covariates, modeling how the re-
sponse evolves over time for the ith subject. Further, β i is a q-dimensional
vector of unknown subject-specific regression coefficients, and εi is a vec-
tor of residual components εij , j = 1, . . . , ni . It is usually assumed that all
εi are independent and normally distributed with mean vector zero, and
covariance matrix σ 2 Ini , where Ini is the ni -dimensional identity matrix.
This latter assumption will be extended in Section 3.3.
Obviously, model (3.1) includes very flexible models for the description of
subject-specific profiles. In practice, polynomials will often suffice. How-
ever, extensions such as fractional polynomial models (Royston and Alt-
man 1994), or extended spline functions (Pan and Goldstein 1998) can be
considered as well. We refer to Lesaffre, Asefa and Verbeke (1999) for an
example where subject-specific profiles have been modeled using fractional
polynomials.
3.2.2 Stage 2
The rat data presented in Section 2.1 have been analyzed by Verbeke and
Lesaffre (1999), who describe the subject-specific profiles shown in Fig-
ure 2.1 by straight lines, after transforming the original time scale (age
expressed in days) logarithmically (see also Section 4.3.3). The first-stage
model (3.1) then becomes
In the second stage, the subject-specific intercepts and time effects are
related to the treatment of the rats (low dose, high dose, control). Our
second-stage model (3.2) then becomes
⎧
⎨ β1i = β0 + b1i ,
(3.4)
⎩
β2i = β1 Li + β2 Hi + β3 Ci + b2i ,
Yij = ln(PSAij + 1)
= β1i + β2i tij + β3i t2ij + εij , j = 1, . . . , ni , (3.5)
and the columns of the covariate matrix Zi contain only ones, all time
points tij , and all squared time points t2ij .
in which Agei equals the subject’s age at diagnosis (t = 0), and where Ci ,
Bi , Li , and Mi are indicator variables defined to be one if the subject is
a control, a BPH case, a local cancer case, or a metastatic cancer case,
respectively, and zero otherwise. The parameters β2 , β3 , β4 , and β5 are the
average intercepts for the controls, the BPH cases, the L/R cancer cases,
and the metastatic cancer cases, respectively, after correction for age at
diagnosis. Similar interpretations hold for the other parameters in (3.6).
Fitting the models (3.5) and (3.6) to the prostate data, Verbeke and Molen-
berghs (1997, Section 3.3) found that the subject-specific regression para-
meters β i did not depend on age at diagnosis (at the 5% level of signifi-
cance), but highly significant differences were found among the diagnostic
groups. No significant differences were obtained between the controls and
3.3 The General Linear Mixed-Effects Model 23
the BPH cases, and the two groups of cancer patients only differed with
respect to their intercepts.
Note how this two-stage analysis can be interpreted as the calculation (first
stage) and analysis (second stage) of summary statistics. First, the actually
observed data vector yi is summarized by β , for each subject separately.
i
Afterward, regression methods are used to assess the relation between the
so-obtained summary statistics and relevant covariates. Other summary
statistics frequently used in practice are the area under each individual
profile (AUC), the mean response for each individual, the largest observa-
tion (peak), the half-time, and so forth (see, for example, Weiner 1981 and
Rang and Dale 1990).
Yi = Xi β + Zi bi + εi , (3.7)
Since model (3.8) is defined through the distributions f (yi |bi ) and f (bi ),
it will be called the hierarchical formulation of the linear mixed model. The
corresponding marginal normal distribution with mean Xi β and covariance
Zi DZi + Σi is called the marginal formulation of the model. Note that,
although the marginal model naturally follows from the hierarchical one,
both models are not equivalent. We refer to Section 5.6.2 for a detailed
discussion on the differences between both models.
3.3 The General Linear Mixed-Effects Model 25
1
Cov(Yi (t1 ), Yi (t2 )) = 1 t1 D + σ2
t2
= d22 t1 t2 + d12 (t1 + t2 ) + d11 + σ 2 .
Note how the model now implies the variance function of the response to
be quadratic over time, with positive curvature d22 .
Yij ≡ ln(PSAij + 1)
= β1 Agei + β2 Ci + β3 Bi + β4 Li + β5 Mi
+ (β6 Agei + β7 Ci + β8 Bi + β9 Li + β10 Mi ) tij
+ (β11 Agei + β12 Ci + β13 Bi + β14 Li + β15 Mi ) t2ij
+ b1i + b2i tij + b3i t2ij + εij . (3.10)
Very often, Σi is chosen to be equal to σ 2 Ini where Ini denotes the identity
matrix of dimension ni . We then call model (3.8) the conditional inde-
pendence model, since it implies that the ni responses on individual i are
independent, conditional on bi and β. As shown in Section 3.3.2, the con-
ditional independence model may imply unrealistically simple covariance
structures for the response vector Yi , especially for models with few random
effects. When there is no evidence for the presence of additional random
effects, or when additional random effects have no substantive meaning,
the covariance assumptions can often be relaxed by allowing an appro-
priate, more general, residual covariance structure Σi for the vector εi of
subject-specific error components.
3.3 The General Linear Mixed-Effects Model 27
Two frequently used functions g(·) are the exponential and Gaussian serial
correlation functions defined as g(u) = exp(−φu) and g(u) = exp(−φu2 ),
respectively (φ > 0), and which are shown in Figure 3.2 for φ = 1. Note
that the most important qualitative difference between these functions is
their behavior near u = 0, although their tail behavior is also different.
Although Diggle, Liang, and Zeger (1994) discuss model (3.11) in full gen-
erality, they do not fit any models which simultaneously include serial cor-
relation as well as random effects other than intercepts. They argue that,
in applications, the effect of serial correlation is very often dominated by
the combination of random effects and measurement error. In practice, this
3.3 The General Linear Mixed-Effects Model 29
4.1 Introduction
Most books on longitudinal data discuss exploratory analysis. See, for ex-
ample, Diggle, Liang, and Zeger (1994). However, most effort is spent to
model building and formal aspects of inference. In this section, we present
a selected set of techniques to underpin the model building. We distinguish
between two modes of display. In Section 4.2, the marginal distribution of
the responses in the Vorozole study is explored, that is, we explore the ob-
served profiles averaged over (sub)populations. Three aspects of the data
will be looked at in turn: the average evolution, the variance function, and
the correlation structure. Afterward, in Section 4.3, we will discuss some
procedures for exploring the observed profiles in a subject-specific way.
The average evolution describes how the profile for a number of relevant
subpopulations (or the population as a whole) evolves over time. The results
32 4. Exploratory Data Analysis
FIGURE 4.1. Vorozole Study. Individual profiles, raw residuals, and standardized
residuals.
The individual profiles are displayed in Figure 4.1, and the mean profiles,
per treatment arm, are plotted in Figure 4.2. The average profiles indicate
an increase over time which is slightly stronger for the Vorozole group. In
addition, the Vorozole group is, with the exception of month 16, consistently
higher than the AGT group. Of course, at this point it is not yet possible
to decide on the significance of this difference. It is useful to explore the
treatment difference separately since even when both evolutions might be
complicated, the treatment difference, which is often of primary interest,
could follow a simple model, or vice versa. The treatment difference is
plotted in Figure 4.3.
The detrended profiles are displayed in Figure 4.1, and the corresponding
variance function is plotted in Figure 4.4.
As shown in the Sections 3.2 and 3.3, linear mixed models can be inter-
preted as the result from a two-stage approach, where the first stage consists
of approximating each observed longitudinal profile by an appropriate lin-
ear regression function. In this section, we propose two simple exploratory
tools to check to what extent observed longitudinal profiles can be described
by a specific linear regression model.
Y = Xβ + ε (4.1)
FIGURE 4.5. Vorozole Study. Scatter plot matrix for selected time points. The
same vertical scale is used along the diagonal to display the attrition rate as well.
for (some of) the subjects, which may result in (very) high values Ri2 . This
suggests that a fair comparison of the Ri2 should take into account the
numbers of measurements on which they are based. We therefore promote
the use of scatter plots of the Ri2 versus the ni . Examples will be given in
the Sections 4.3.3 and 4.3.4.
Y = Xβ + X ∗ β ∗ + ε (4.5)
where the sums are taken over all subjects with at least p + p∗ mea-
surements. Assuming that the candidate first-stage model was correctly
specified, and assuming that all residuals are independently normally dis-
tributed with mean zero and with some common variance, we have that
the above test statistic follows an F -distribution with {i:ni ≥p+p∗ } p∗ and
∗
{i:ni ≥p+p∗ } (ni − p − p ) degrees of freedom.
The right panel in Figure 4.6 shows a scatter plot of the Ri2 versus the ni , for
2
the so-obtained quadratic first-stage model. The overall coefficient Rmeta
now equals 0.9554, which is a rather small improvement when compared
to the original model. This is also reflected in the scatter plot. Note also
that all rats with at most three measurements have Ri2 = 1. Testing for
the need of an additional cubic term in the first-stage model results in
Fmeta = 1.3039, which is not significant (p = 0.1633) when compared to an
F -distribution with 38 and 75 degrees of freedom. Strictly speaking, this
is evidence in favor of using the quadratic first-stage model rather than
the original linear model. However, in view of the good fits which were
already obtained with the linear model, and in order to keep our models as
parsimonious as possible, our further analyses of the rat data will be based
on the original linear first-stage model (3.3).
Although the linear two-stage model explains about 82% of all within-
2
subject variability (Rmeta = 0.8188), many longitudinal profiles are badly
described (left panel of Figure 4.7). For example, for 8 out of the 54 profiles,
less than 10% of the observed variability could be explained by a linear fit.
As can be expected from observing the individual profiles in Figure 2.3, the
linear model is strongly rejected when compared to a quadratic first-stage
model (Fmeta = 6.2181, 54 and 301 degrees of freedom, p < 0.0001).
The right panel in Figure 4.7 shows the coefficients Ri2 versus the ni , for this
quadratic model. The new model explains about 10% more within-subject
2
variability (Rmeta increased to 0.9143). Testing for the need of an additional
cubic term results in Fmeta = 1.2310, which is not significant (p = 0.1484)
when compared to an F -distribution with 54 and 247 degrees of freedom.
This clearly supports the first-stage model (3.5) proposed in Section 3.2.4.
The fact that some individual coefficients Ri2 are still quite small, although
the quadratic model has not been rejected, suggests the presence of a con-
siderable amount of residual variability. This will be confirmed later, in
Section 9.5.
5
Estimation of the Marginal Model
5.1 Introduction
As discussed in Section 3.3, the general linear mixed model (3.8) implies
the marginal model
Yi ∼ N (Xi β, Zi DZi + Σi ). (5.1)
Unless the data are analyzed in a Bayesian framework (see, e.g., Gelman et
al . 1995), inference is based on this marginal distribution for the response
Yi . It should be emphasized that the hierarchical structure of the original
model (3.8) is then not taken into account. Indeed, the marginal model (5.1)
is not equivalent to the original hierarchical model (3.8). Inferences based on
the marginal model do not explicitly assume the presence of random effects
representing the natural heterogeneity between subjects. An example of this
can be found in Section 5.6.2. In this and the next chapter, we will discuss
inference for the parameters in the marginal distribution (5.1). Later, in
Chapter 7, it will be shown how the random effects can be estimated under
the explicit assumption that Yi satisfies model (3.8).
Let α denote the vector of all variance and covariance parameters (usually
called variance components) found in Vi = Zi DZi +Σi , that is, α consists of
the q(q+1)/2 different elements in D and of all parameters in Σi . Finally, let
θ = (β , α ) be the s-dimensional vector of all parameters in the marginal
model for Yi , and let Θ = Θβ × Θα denote the parameter space for θ, with
42 5. Estimation of the Marginal Model
Θβ and Θα the parameter spaces for the fixed effects and for the variance
components respectively. Note that Θβ = IRp , and Θα equals the set of
values for α such that D and all Σi are positive (semi-)definite.
N
−1
LML (θ) = (2π)−ni /2 |Vi (α)| 2
i=1
1
× exp − (Yi − Xi β) Vi−1 (α) (Yi − Xi β) (5.2)
2
distributed with mean zero and variance σ 2 . The MLE for σ 2 equals
2
σ = (Y − X(X X)−1 X Y ) (Y − X(X X)−1 X Y )/N,
2
σ = (Y − X(X X)−1 X Y ) (Y − X(X X)−1 X Y )/(N − p),
which is the mean squared error, unbiased for σ 2 , and classically used as
estimator for the residual variance in linear regression analysis (see, for
example, Neter, Wasserman, and Kutner 1990, Chapter 7; Seber 1977, Sec-
tion 3.3). As in Section 5.3.1, we again have that any matrix A satisfying the
specified conditions leads to the same estimator for the residual variance,
which is again called the REML estimator for σ 2 .
In practice, linear mixed models often contain many fixed effects. For ex-
ample, the linear mixed model (3.10) which immediately followed from the
two-stage model proposed in Section 3.2.4 for the prostate data, has a 15-
dimensional vector β of parameters in the mean structure. In such cases,
it may be important to estimate the variance components, explicitly tak-
ing into account the loss of the degrees of freedom involved in estimating
the fixed effects. In contrast to the simple cases discussed in Sections 5.3.1
and 5.3.2, an unbiased estimator for the vector α of variance components
cannot be obtained from appropriately transforming the ML estimator as
suggested from the analytic calculation of its bias. However, the error con-
trasts approach can still be applied as follows. We first combine all N
subject-specific regression models (3.8) to one model:
Y = Xβ + Zb + ε, (5.5)
where the vectors Y , b, and ε, and the matrix X are obtained from stacking
the vectors Yi , bi , and εi , and the matrices Xi respectively, underneath
each other, and where Z is the block-diagonal matrix with blocks Z i on
N
the main diagonal and zeros elsewhere. The dimension of Y equals i=1 ni
and will be denoted by n.
5.3 Restricted Maximum Likelihood Estimation 45
The marginal distribution for Y is normal with mean vector Xβ and with
covariance matrix V (α) equal to the block-diagonal matrix with blocks Vi
on the main diagonal and zeros elsewhere. The REML estimator for the
variance components α is now obtained from maximizing the likelihood
function of a set of error contrasts U = A Y where A is any (n × (n − p))
full-rank matrix with columns orthogonal to the columns of the X matrix.
The vector U then follows a normal distribution with mean vector zero
and covariance matrix A V (α)A, which is not dependent on β any longer.
Further, Harville (1974) has shown that the likelihood function of the error
contrasts can be written as
N 1/2
−(n−p)/2
L(α) = (2π) Xi Xi
i=1
−1/2
N N
−1/2
× Xi Vi−1 Xi |Vi |
i=1 i=1
N
1
× exp −
Yi − X i β V i −1
Yi − X i β , (5.6)
2 i=1
Note that the maximum likelihood estimator for the mean of a univariate
normal population and for the vector of regression parameters in a linear
regression model are independent of the residual variance σ 2 . Hence, the
estimates for the mean structures of the two examples in Sections 5.3.1
and 5.3.2 do not change if REML estimates are used for the variance com-
ponents, rather than ML estimates. However, it follows from (5.3) that this
no longer holds in the general linear mixed model. Thus, we have that al-
though REML estimation is only with respect to the variance components
in the model, the “REML” estimator for the vector of fixed effects is not
identical to its ML version. This will be illustrated in Section 5.5, where
model (3.10) will be fitted to the prostate cancer data.
it follows that the REML estimators for α and for β can also be found by
46 5. Estimation of the Marginal Model
The main justification of the REML approach has been given by Patterson
and Thompson (1971), who prove that, in the absence of information on β,
no information about α is lost when inference is based on U rather than on
Y . More precisely, U is marginally sufficient for α in the sense described by
Sprott (1975) (see also Harville 1977). Further, Harville (1974) has shown
that, from a Bayesian point of view, using only error contrasts to make
inferences on α is equivalent to ignoring any prior information on β and
using all the data to make those inferences.
Also with regard to the mean squared error for estimating α, there is no
indisputable preference for either one of the two estimation procedures,
since it depends on the specifics of the underlying model and possibly on
the true value of α. For ordinary ANOVA or regression models, the ML
estimator of the residual variance σ 2 has uniformly smaller mean squared
error than the REML estimator when p = rank(X) ≤ 4, but the opposite is
true when p > 4 and n − p is sufficiently large (n − p > 2 suffices if p > 12).
In general, one may expect results from ML and REML estimation to differ
more as the number p of fixed effects in the model increases. We hereby
5.4 Model-Fitting Procedures 47
Note that fitting the general linear mixed model (3.8) requires maximiza-
tion of LML and LREML over the parameter space Θ, which consists of all
vectors θ which yield positive (semi-)definite matrices D and Σi . On the
other hand, the marginal model (5.1) only requires all Vi = Zi DZi + Σi
to be positive (semi-)definite. This is why some statistical packages maxi-
mize the likelihood functions over a parameter space which is larger than
Θ. For example, the SAS procedure MIXED (Version 6.12), by default,
only requires the diagonal elements of D and all Σi to be positive, which
probably stems from classical variance-components models, where the ran-
dom effects are assumed to be independent of each other (see, for example,
Searle, Casella and McCulloch 1992, Chapter 6). An example of this will
be given in Section 5.6.2.
48 5. Estimation of the Marginal Model
FIGURE 5.1. Prostate Data. Fitted average profiles for males of median ages at
diagnosis, based on the model (3.10), where the parameters are replaced by their
REML estimates.
Table 5.1 shows the maximum likelihood as well as the restricted maximum
likelihood estimates for all parameters in the marginal model corresponding
to model (3.10), where time is expressed in decades prior to diagnosis rather
than years prior to diagnosis (for reasons which will be explained further
in Section 5.6.1). Recall that the residual variability in this model is pure
measurement error, that is, εi = ε(1)i , with notation as in Section 3.3.4.
As can be expected from the theory in Section 5.3, the ML estimates de-
viate most from the REML estimates for the variance components in the
model. In fact, all REML estimates are larger in absolute value than the
ML estimates. Note that the same is true for the REML estimates for
the residual variance in normal populations or in linear regression models
when compared to the ML estimates, as described in Section 5.3.1 and
Section 5.3.2, respectively. Further, Table 5.1 illustrates the fact that the
REML estimates for the fixed effects are also different from the ML esti-
mates. Figure 5.1 shows, for each diagnostic group separately, the fitted
average profile for a male of median age at diagnosis.
5.5 Example: The Prostate Data 49
TABLE 5.1. Prostate Data. Maximum likelihood and restricted maximum likeli-
hood estimates (MLE and REMLE) and standard errors (model based;robust) for
all fixed effects and all variance components in model (3.10), with time expressed
in decades before diagnosis.
When we first fitted a linear mixed model to the prostate cancer data, time
was expressed in decades before diagnosis, rather than years before diag-
nosis as in the original data set (see Section 5.5). This was done to avoid
that the random slopes for the linear and quadratic time effects would
show too little variability, which might lead to divergence of the numerical
maximization routine. To illustrate this, we refit our mixed model using
the SAS procedure MIXED (SAS Version 6.12), but we express time as
months before diagnosis. The procedure failed to converge. The estimates
for the variance components at the last iteration are shown in Table 5.2.
5.6 Estimation Problems 51
Note how the reported estimate for the variance of the random slopes for
the quadratic time effect equals d33 = 0.00000000. This suggests that there
is very little variability among the quadratic time effects, requiring the iter-
ative procedure to converge to a point which is very close to the boundary
of the parameter space (since only non-negative variances are allowed), if
not on the boundary of the parameter space. This can produce numerical
difficulties in the maximization process. One way of circumventing this is
by artificially enlarging the true value d33 .
which is a new linear mixed effects model, in which t∗ij is now expressed in
decades before diagnosis, and where the random effects b∗1i ≡ b1i , b∗2i and
52 5. Estimation of the Marginal Model
TABLE 5.3. Rat Data. Restricted maximum likelihood estimates (REMLE) and
standard errors for all fixed effects and all variance components in the marginal
model corresponding to model (3.9), for two different parameter restrictions for
the variance components α.
could ever yield a marginal model as has now been obtained. On the other
hand, as long as all covariance matrices Vi = Zi DZi + σ 2 Ini are positive
(semi-) definite, that is, as long as the covariates Zi take values within
a specific range, a valid marginal model is obtained. In our example, the
variance function is predicted by
1
Var(Yi (t)) = 1 t D + σ2
t
= d 2 2
22 t + 2d12 t + d11 + σ
= −0.287t2 + 0.924t + 4.443 (5.9)
FIGURE 5.2. Rat Data. Sample variance function for ordinary least squares
(OLS) residuals, obtained from fitting a linear regression model with the same
mean structure as the marginal model corresponding to (3.9).
arose from the two-stage approach described in Section 3.2.3, it does not
necessarily imply an appropriate marginal model. This again illustrates
the need for exploratory data analysis (Chapter 4) prior to fitting linear
mixed models. More detailed discussions on negative variance components
can be found in Nelder (1954), Thompson (1962), and Searle, Casella and
McCulloch (1992, Section 3.5).
6
Inference for the Marginal Model
6.1 Introduction
where Wi equals Vi−1 (α). In practice, the covariance matrix (6.3) is es-
timated by replacing α by its ML or REML estimator. For the models
previously fitted to the prostate data and to the rat data, the so-obtained
standard errors for the fixed effects are also reported in Table 5.1 and
Table 5.3, respectively.
H0 : Lβ = 0, versus HA : Lβ = 0 (6.4)
As noted by Dempster, Rubin and Tsutakawa (1981), the Wald test statis-
tics are based on estimated standard errors which underestimate the true
because they do not take into account the variability intro-
variability in β
duced by estimating α. In practice, this downward bias is often resolved
by using approximate t- and F -statistics for testing hypotheses about β.
6.2 Inference for the Fixed Effects 57
It should be remarked that all these methods usually lead to different re-
sults. However, in the analysis of longitudinal data, different subjects con-
tribute independent information, which results in numbers of degrees of
freedom which are typically large enough, whatever estimation method is
used, to lead to very similar p-values. Only for very small samples, or when
linear mixed models are used outside the context of longitudinal analy-
ses, different estimation methods for degrees of freedom may lead to severe
differences in the resulting p-values. This will be illustrated in Section 8.3.5.
Table 5.1 clearly suggests that the original linear mixed model (3.10), used
for describing the prostate data, can be reduced to a more parsimonious
model. Classically, this is done in a hierarchical way, starting with the
highest-order interaction terms, deleting nonsignificant terms, and com-
bining parameters which do not differ significantly. The so-obtained final
model assumes no average evolution over time for the control group, a
linear average time trend for the BPH group, the same average quadratic
58 6. Inference for the Marginal Model
time effects for both cancer groups, and no age dependencies of the average
linear as well as quadratic time trends. An overall test for comparing this
final model with the original model (3.10) is testing the null hypothesis:
⎧
⎪
⎪
⎪ β6 = 0
⎪
(no age by time interaction)
⎪
⎪
⎪
⎪ β7 = 0 (no linear time effect for controls)
⎪
⎪
⎪
⎪
⎨ β =0 (no age × time2 interaction)
11
H0 :
⎪
⎪ β12 = 0 (no quadratic time effect for controls)
⎪
⎪
⎪
⎪
⎪
⎪ β13 = 0 (no quadratic time effect for BPH)
⎪
⎪
⎪
⎪
⎩ β =β (equal quadratic time effect for both cancer groups).
14 15
The observed value under the above null hypothesis for the associated Wald
statistic (6.5) equals 3.3865, on 6 degrees of freedom. The observed value
under H0 for the associated F -statistic (6.6) equals 3.3865/6 = 0.5664, on
6 and 46.7 degrees of freedom. The denominator degrees of freedom has
been obtained from the Satterthwaite approximation in SAS procedure
MIXED (Version 6.12). The corresponding p-values are 0.7587 and 0.7561,
respectively, suggesting that no important terms have been left out of the
model.
From now on, all further inferences will be based on the reduced final model,
which can be written as:
Yij = Yi (tij )
= β1 Agei + β2 Ci + β3 Bi + β4 Li + β5 Mi
+ (β8 Bi + β9 Li + β10 Mi ) tij
+ β14 (Li + Mi ) t2ij
+ b1i + b2i tij + b3i t2ij + εij , (6.8)
Table 6.1 contains the parameter estimates and estimated standard errors
for all fixed effects and variance components in model (6.8). Although the
average PSA level for the control patients is not significantly different from
6.2 Inference for the Fixed Effects 59
TABLE 6.1. Prostate Data. Parameter estimates and standard errors (model
based;robust) obtained from fitting the final model (6.8) to the prostate cancer
data, using restricted maximum likelihood estimation.
zero (p = 0.1889), we will not remove the corresponding effect from the
model because a point estimate for the average PSA level in the control
group may be of interest.
Figure 6.1 shows the average fitted profiles based on this final model, for
a man of median age at diagnosis, for each of the diagnostic groups sepa-
rately. Note how little difference there is with the average fitted profiles in
Figure 5.1, based on the full model (3.10).
When the prostate data were first analyzed, one of the research questions
of primary interest was whether early discrimination between cancer cases
and BPH cases should be based on the rate of increase of PSA (which can
60 6. Inference for the Marginal Model
FIGURE 6.1. Prostate Data. Fitted average profiles for males with median ages
at diagnosis, based on the final model (6.8), where the parameters are estimated
using restricted maximum likelihood estimation.
and
∂
2
DIFFRATE(t = 5 years) = β1 Age + β4 + β9 t + β14 t
∂t
t=0.5
∂
− (β1 Age + β3 + β8 t)
∂t t=0.5
= −β8 + β9 + β14 , (6.10)
which are of the form Lβ, for specific (1 × 15) matrices L. Obviously, Lβ
Moreover, the chi-squared approximation for (6.5)
will be estimated by Lβ.
as well as the F -approximation for (6.6) can be used to obtain approximate
confidence intervals for Lβ. The results from the chi-squared approximation
are summarized in the top part of Table 6.2.
6.2 Inference for the Fixed Effects 61
TABLE 6.2. Prostate Data. Naive and robust inference for the linear combinations
(6.9) and (6.10) of fixed effects in model (6.8), fitted to the prostate cancer data,
using restricted maximum likelihood estimation.
Naive inference
Wald-type approximate
Effect Estimate Standard error 95% confidence interval
DIFF 0.221 0.146 [−0.065, 0.507]
DIFFRATE −0.951 0.166 [−1.276, −0.626]
Robust inference
Wald-type approximate
Effect Estimate Standard error 95% confidence interval
DIFF 0.221 0.159 [−0.092, 0.533]
DIFFRATE −0.951 0.245 [−1.432, −0.470]
Note that this suggests that as long as interest is only in inferences for
average longitudinal evolutions, little effort should be spent in modeling
the covariance structure, provided that the data set is sufficiently large.
In this respect, an extreme point of view would be to use ordinary least
squares regression methods to fit longitudinal models, thereby completely
ignoring the presence of any correlation among the repeated measurements,
and to use the sandwich estimator to correct for this in the inferential
procedures. However, in practice, an appropriate covariance model may
be of interest since it helps in interpreting the random variation in the
data. For example, it may be of scientific interest to explore the presence
of random slopes. Further, efficiency is gained if an appropriate covariance
model can be specified (see, for example, Diggle, Liang, and Zeger 1994
Section 4.6). Finally, in the case of missing observations, use of the sandwich
estimator only provides valid inferences for the fixed effects under very
strict, severe assumptions about the underlying missingness process. This
will be extensively discussed in Sections 15.8 and 16.5. Therefore, from
now on, all inferences will be based on model-based standard errors, unless
explicitly stated otherwise.
Robust versions of the approximate Wald, t-, and F -test, described in Sec-
tions 6.2.1 and 6.2.2, as well as associated confidence intervals, can also
be obtained, replacing the naive covariance matrix (6.3) in (6.5) and (6.6)
by the robust one given in (6.2). As an illustration, we recalculated the
confidence intervals for the linear combinations (6.9) and (6.10) of fixed
effects in model (6.8), using robust inference. The results are now shown
in the bottom part of Table 6.2. Note that the robust standard errors for
both estimates are larger than the naive ones, leading to larger confidence
intervals.
A classical statistical test for the comparison of nested models with different
mean structures is the likelihood ratio (LR) test. Suppose that the null
hypothesis of interest is given by H0 : β ∈ Θβ,0 , for some subspace Θβ,0
of the parameter space Θβ of the fixed effects β. Let LML denote again the
ML likelihood function (5.2) and let −2 ln λN be the likelihood ratio test
6.2 Inference for the Fixed Effects 63
TABLE 6.3. Prostate Data. Likelihood ratio test for H0 : β1 = 0, under model
(3.10), using ML as well as REML estimation.
statistic defined as
( )
ML,0 )
LML (θ
−2 ln λN = −2 ln ,
LML (θML )
It should be emphasized that the above result is not valid if the models
are fitted using REML rather than ML estimation. Indeed, the mean struc-
ture of the model fitted under H0 is not the mean structure Xi β of the
original model under Θβ , leading to different error contrasts U = A Y
(see Section 5.3) under both models. Hence, the corresponding REML log-
likelihood functions are based on different observations, which makes them
no longer comparable. This can be well illustrated in the context of the
prostate data. We reconsider the final linear mixed model (6.8), and we
use the likelihood ratio test for testing whether correction for the different
ages at the time of the diagnosis is really needed. The corresponding null
hypothesis equals H0 : β1 = 0. The results obtained under ML as well as
under REML estimation are summarized in Table 6.3. Under ML, the ob-
served value for the LR statistic −2 ln λN equals 6.602, which is significant
(p = 0.010) when compared to a chi-squared distribution with 1 degree of
freedom. Note that a negative observed value for the LR statistic −2 ln λN
is obtained under REML, clearly illustrating the fact that valid classical
LR tests for the mean structure can only be obtained in the context of ML
inference. We refer to Welham and Thompson (1997) for two alternative
LR-type tests, based on profile likelihoods, which do allow comparison of
two models with nested mean structures, fitted using the REML estimation
method.
64 6. Inference for the Marginal Model
In many practical situations, the mean structure rather than the covari-
ance model is of primary interest. However, adequate covariance modeling
is useful for the interpretation of the random variation in the data and it
is essential to obtain valid model-based inferences for the parameters in
the mean structure of the model. Overparameterization of the covariance
structure leads to inefficient estimation and potentially poor assessment
of standard errors for estimates of the mean response profiles (fixed ef-
fects), whereas a too restrictive specification invalidates inferences about
the mean response profile when the assumed covariance structure does not
hold (Altham 1984). Finally, as will be discussed in Chapters 17, 19, and 21,
analyses of longitudinal data subject to dropout often require correct spec-
ification of the longitudinal model. In this section, inferential procedures
for variance components in linear mixed models will be discussed.
It follows from classical likelihood theory (see, e.g., Cox and Hinkley 1990,
Chapter 9) that, under some regularity conditions, the distribution of the
ML as well as REML estimator α can be well approximated by a nor-
mal distribution with mean vector α and with covariance matrix given by
the inverse of the Fisher information matrix. Hence, approximate standard
errors for the estimates of the variance components in α can be easily cal-
culated from inverting minus the matrix of second-order partial derivatives
of the log-likelihood function (ML or REML) with respect to α. These
are also the standard errors previously reported in Tables 5.1 and 6.1 and
Table 5.3 for the prostate data and the rat data, respectively.
For example, consider the linear mixed model (6.8) previously derived for
the prostate data, with REML estimates as reported in Table 6.1. If we
assume that the variability between subjects can be explained by ran-
dom effects bi , then H0 : d33 = 0 cannot be tested using an approxi-
mate Wald test. Indeed, given the hierarchical interpretation of the model,
D is restricted to be positive (semi-)definite, implying that H0 is on the
boundary of the parameter space Θα . Therefore, the asymptotic distrib-
ution of d
33 /s.e.(d33 ) is not normal under H0 , such that the approximate
Wald test is not applicable. On the other hand, if one only assumes that
the covariance matrix of each subject’s response Yi can be described by
Vi = Zi DZi + σ 2 Ini , not assuming that this covariance matrix results from
an underlying random-effects structure, D is no longer restricted to be
positive (semi-)definite. Since d33 = 0 is then interior to Θα , d
33 /s.e.(d33 )
asymptotically follows a standard normal distribution under H0 : d33 = 0,
from which a valid approximate Wald test follows. Based on the parameter
estimates reported in Table 6.1, we find that the observed value for the
test statistic equals d d
33 /s.e.( 33 ) = 0.114/0.035 = 3.257, which is highly
significant when compared to a standard normal distribution (p = 0.0011).
Obviously, the above distinction between the hierarchical and marginal in-
terpretation of a linear mixed model is far less crucial for testing significance
of covariance parameters in α. For example, the hypothesis H0 : d23 = 0
can be tested with an approximate Wald test, even when the variability
between subjects is believed to be induced by random effects. However,
in order to keep H0 away from the boundary of Θα , one then still has to
assume that the variances d22 and d33 are strictly positive. Hence, based
on the parameter estimates reported in Table 6.1, and assuming all diag-
onal elements in D to be nonzero, we find highly significant correlations
between the subject-specific intercepts and slopes in model (6.8), and the
only positive correlation is the one between the random intercepts and the
random slopes for time2 .
Similar as for the fixed effects, a LR test can be derived for comparing
nested models with different covariance structures. Suppose that the null
hypothesis of interest is now given by H0 : α ∈ Θα,0 , for some subspace
Θα,0 of the parameter space Θα of the variance components α. Let LML
denote again the ML likelihood function (5.2) and let −2 ln λN be the like-
lihood ratio test statistic which is again defined as
( )
ML,0 )
LML (θ
−2 ln λN = −2 ln , (6.11)
LML (θML )
66 6. Inference for the Marginal Model
where θML,0 and θ ML are now the maximum likelihood estimates obtained
from maximizing LML over Θα,0 and Θα , respectively. It then follows from
classical likelihood theory (see, e.g., Cox and Hinkley 1990, Chapter 9)
that, under some regularity conditions, −2 ln λN follows, asymptotically
under H0 , a chi-squared distribution with degrees of freedom equal to the
difference between the dimension s − p of Θα and the dimension of Θα,0 .
= 2.936,
which is not significant when compared to a chi-squared distribution with
2 degrees of freedom (p = 0.2304). The REML estimates of the parameters
6.3 Inference for the Variance Components 67
TABLE 6.4. Rat Data. REML estimates and associated estimated standard errors
for all parameters in model (3.9), under H0 : d12 = d22 = 0.
in the reduced model are shown in Table 6.4. A LR test for the same null
hypothesis can now be obtained from comparing the maximized REML
log-likelihood values (see Tables 5.3 and 6.4). The observed value for the
test statistic equals −2 ln λN = −2(−466.202 + 465.193) = 2.018, which
is also not significant when compared to a chi-squared distribution with 2
degrees of freedom (p = 0.3646).
From now on, all further inferences will be based on the reduced model. For
each treatment group separately, the predicted average profile based on the
estimates reported in Table 6.4 is shown in Figure 6.2. The observed value
for the Wald statistic (6.5) for testing the hypothesis H0 : β1 = β2 = β3
of no average treatment effect equals 4.63, which is not significant when
compared to a chi-squared distribution with 2 degrees of freedom (p =
0.0987).
Since the obtained estimate for d11 equals 3.565 > 0, the fitted model allows
a random-effects interpretation. The corresponding hierarchical model is
given by
⎧
⎪
⎪ β0 + b1i + β1 tij + εij , if low dose
⎪
⎪
⎨
Yij = β0 + b1i + β2 tij + εij , if high dose (6.12)
⎪
⎪
⎪
⎪
⎩
β0 + b1i + β3 tij + εij , if control
and is obtained from omitting the subject-specific slopes b2i from the orig-
inal model (3.9), thereby assuming that all individual profiles have equal
68 6. Inference for the Marginal Model
FIGURE 6.2. Rat Data. Fitted average evolution for each treatment group sepa-
rately, obtained from fitting the final model (6.12) using REML estimation.
slopes, after correction for the treatment. As before, the residual compo-
nents εij only contain measurement error (i.e., εi = ε(1)i , see Section 3.3.4).
Note that the above random-intercepts model does not only imply the mar-
ginal variance function to be constant over time, it also assumes constant
correlation between any two repeated measurements from the same rat.
The constant correlation is given by
d11
ρI =
d11 + σ 2
which is the intraclass correlation coefficient previously encountered in Sec-
tion 3.3.2. In our example, ρI is estimated by 3.565/(3.565+1.445) = 0.712.
The corresponding covariance matrix, with constant variance and constant
correlation, is often called compound symmetry. This again illustrates that
negative estimates for variance components in a linear mixed model of-
ten have meaningful interpretations in the implied marginal model. Here, a
nonpositive estimate for the variance d11 of the random effects in a random-
intercepts model would indicate that the assumption of constant positive
correlation between the repeated measurements is not valid for the data
set at hand. We refer to Section 5.6.2 for another example in which neg-
ative variance estimates indicate misspecifications in the implied marginal
model.
6.3 Inference for the Variance Components 69
in panel (a) of Figure 6.3. Note that if the classical null distribution would
be used, all p-values would be overestimated. Therefore, the null hypothesis
would be accepted too often, resulting in incorrectly simplifying the covari-
ance structure of the model, which may seriously invalidate inferences, as
shown by Altham (1984).
d11 0
H0 : D = ,
0 0
for some strictly positive d11 , versus HA that D is a (2 × 2) positive semi-
definite matrix, we have that the asymptotic null distribution of −2 ln λN
is a mixture with equal weights 0.5 for χ22 and χ21 , shown in Figure 6.3(b).
Similar to case 1, we have that ignoring the boundary problems may result
in too parsimonious covariance structures.
6.3 Inference for the Variance Components 71
D11 0
H0 : D = , (6.14)
0 0
D11 D12
HA : D = ,
D12 D22
Note that the results in the above cases 1 to 4 assume that the likelihood
function can be maximized over the space Θα of positive semi definite ma-
trices D, and that the estimating procedure is able to converge, for example,
to values of D which are positive semidefinite but not positive definite. This
is software dependent and should be checked when the above results are
applied in practice.
For example, according to Stram and Lee (1994), this assumption did not
hold for the SAS procedure MIXED when their paper was written, and they
therefore discuss how their results had to be corrected. Since the procedure
only allowed maximization of the likelihood over a subspace of the required
parameter space Θα , the likelihood ratio statistics were typically too small.
For the third case, for example, the asymptotic null distribution became
a mixture of χ2q+1 and χ20 with equal weight 0.5. However, as explained
in Section 5.6.2, since release 6.10 of SAS, the only constraint on variance
72 6. Inference for the Marginal Model
TABLE 6.5. Prostate Data. Several random-effects models with the associated
value for the log-likelihood value evaluated at the parameter estimates, for maxi-
mum as well as restricted maximum likelihood estimation.
ln[L(θ)]
Random effects ML REML
Model 1: Intercepts, time, time 2
−3.575 −20.165
Model 2: Intercepts, time −50.710 −66.563
Model 3: Intercepts −131.218 −149.430
Model 4: −251.275 −272.367
For the prostate data, the hypothesis of most interest is that only random
intercepts and random slopes for the linear time effect are needed in model
(6.8), and hence that the random slopes for the quadratic time effect may
be omitted (case 3). However, for illustrative purposes, we tested all hy-
potheses of deleting one random effect from the model, in a hierarchical
way starting from the highest-order time effect. Likelihood ratio tests were
used, based on maximum likelihood as well as on restricted maximum like-
lihood estimation. The models and the associated maximized log-likelihood
values are shown in Table 6.5. Further, Table 6.6 shows the likelihood ra-
tio statistics for dropping one random effect at a time, starting from the
quadratic time effect. The correct asymptotic null distributions directly fol-
low from the results described in cases 1 to 3. We hereby denote a mixture
6.4 Information Criteria 73
TABLE 6.6. Prostate Data. Likelihood ratio statistics with the correct as well
as naive asymptotic null distribution for comparing random-effects models, for
maximum as well as restricted maximum likelihood estimation. A mixture of two
chi-squared distributions with k1 and k2 degrees of freedom and with equal weight
for both distributions is denoted by χ2k1 :k2 .
Maximum likelihood
Asymptotic null distribution
Hypothesis −2 ln(λN ) Correct Naive
2
Model 2 versus Model 1 94.270 χ2:3 χ23
Model 3 versus Model 2 161.016 χ21:2 χ22
2
Model 4 versus Model 3 240.114 χ0:1 χ21
Restricted maximum likelihood
Asymptotic null distribution
Hypothesis −2 ln(λN ) Correct Naive
2
Model 2 versus Model 1 92.796 χ2:3 χ23
2
Model 3 versus Model 2 165.734 χ1:2 χ22
Model 4 versus Model 3 245.874 χ20:1 χ21
All testing procedures discussed in Sections 6.2 and 6.3 considered the
comparison of so-called nested models, in the sense that the model under
the null hypothesis could be viewed as a special case from the alternative
model. In order to extend this to the case where one wants to discriminate
between non-nested models, we take a closer look at the likelihood ratio
tests discussed in Sections 6.2.5, 6.3.2, and 6.3.4. Let A and 0 denote
the log-likelihood function evaluated at the estimates obtained under the
alternative hypothesis and under the null hypothesis, respectively. Further,
let #θ 0 and #θ A denote the number of free parameters under the null
hypothesis and under the alternative hypothesis, respectively. The LR test
then rejects the null hypothesis if A − 0 is large in comparison to the
difference in degrees of freedom between the two models which are to be
compared, or, equivalently, if
or, equivalently, if
for an appropriate function F(·). For example, when tests are performed
at the 5% level of significance, for hypotheses of the same form as those
described in the third case of Section 6.3.4, F was such that
where χ2k1 :k2 ,0.95 denotes the 95% percentile of the χ2k1 :k2 distribution. This
procedure can be interpreted as a formal test of significance only if the
model under the null hypothesis is nested within the model under the al-
ternative hypothesis. However, if this is not the case, there is no reason why
6.4 Information Criteria 75
TABLE 6.8. Rat Data. Summary of the results of fitting two different ran-
dom-intercepts models to the rat data (ML estimation, n = 252).
the above procedure could not be used as a rule of thumb, or why no other
functions F(·) could be used to construct empirical rules for discriminating
between covariance structures. Some other frequently used functions are
shown in Table 6.7, all leading to different discriminating rules, called in-
formation criteria. The main idea behind information criteria is to compare
models based on their maximized log-likelihood value, but to penalize for
the use of too many parameters. The model with the largest AIC, SBC,
HQIC, or CAIC is deemed best. Note that, except for the Akaike infor-
mation criterion (AIC), they all involve the sample size (see Table 6.7),
showing that differences in likelihood need to be viewed, not only relative
to the differences in numbers of parameters but also relative to the number
of observations included in the analysis. As the sample size increases, more
severe increases in likelihood are required before a complex model will be
preferred over a simple model. Note also that, since REML is based on a
set of n − p error contrasts (see Section 5.3), the effective sample size used
in the definition of the information criteria is n∗ = n − p under REML esti-
mation, while being n under ML estimation. Note also that, as explained in
Section 6.2.5, REML log-likelihoods are only fully comparable for models
with the same mean structure. Hence, for comparing models with different
mean structures, one should only consider information criteria based on
ML estimation.
We refer to Akaike (1974), Schwarz (1978), Hannan and Quinn (1979), and
Bozdogan (1987) for more information on the information criteria defined
in Table 6.7.
to the original model (6.12). Table 6.8 shows the AIC and SBC obtained
from the ML results for both models, with effective sample size equal to
n = 252. Note that AIC prefers the model with treatment-specific average
slopes, whereas the common-slope model is to be preferred based on SBC.
This again illustrates the effect of taking into account the sample size n in
the calculation of SBC, which is not the case for AIC.
7
Inference for the Random Effects
7.1 Introduction
Since the random effects in model (3.8) are assumed to be random vari-
ables, it is most natural to estimate them using Bayesian techniques (see,
for example, Box and Tiao 1992 or Gelman et al . 1995). As discussed in
Section 3.3, the distribution of the vector Yi of responses for the ith indi-
vidual, conditional on that individual’s specific regression coefficients bi , is
multivariate normal with mean vector Xi β + Zi bi and with covariance ma-
trix Σi . Further, the marginal distribution of bi is multivariate normal with
mean vector 0 and covariance matrix D. In the Bayesian literature, this
last distribution is usually called the prior distribution of the parameters
bi since it does not depend on the data Yi . Once observed values yi for Yi
have been collected, the so-called posterior distribution of bi , defined as the
distribution of bi , conditional on Yi = yi , can be calculated. If we denote
the density function of Yi conditional on bi , and the prior density function
of bi by f (yi |bi ) and f (bi ), respectively, we have that the posterior density
function of bi given Yi = yi is given by
f (yi |bi ) f (bi )
f (bi |yi ) ≡ f (bi |Yi = yi ) = + . (7.1)
f (yi |bi ) f (bi ) dbi
For the sake of notational convenience, we hereby suppressed the depen-
dence of all above density functions on certain components of θ.
Using the theory on general Bayesian linear models (Smith 1973, Lindley
and Smith 1972), it can be shown that (7.1) is the density of a multivariate
normal distribution. Very often, bi is estimated by the mean of this pos-
terior distribution, called the posterior mean of bi . This estimate is then
given by
where, as before, Wi equals Vi−1 (Laird and Ware 1982). Note that (7.3)
underestimates the variability in bi (θ) − bi since it ignores the variation of
bi . Therefore, inference for bi is usually based on
as an estimator for the variation in bi (θ) − bi (Laird and Ware 1982).
Let the linear mixed model be denoted as in (5.5) in Section 5.3.3; that
is, Y = Xβ + Zb + ε, where the vectors Y , b, and ε, and the matrix X
are obtained from stacking the vectors Yi , bi , and εi , and the matrices
Xi , respectively, underneath each other, and where Z is the block-diagonal
matrix with blocks Zi on the main diagonal and zeros elsewhere. Let D and
Σ be block-diagonal with blocks D and Σi on the main diagonal and zeros
elsewhere. Henderson et al . (1959) showed that, conditional on the vector
α of variance components, the estimate (5.3) for β and the estimates (7.2)
of all random effects bi can be obtained from solving the so-called mixed-
80 7. Inference for the Random Effects
model equations
−1
XΣ X X Σ−1 Z β X Σ−1 y
=
Z Σ−1 X Z Σ−1 Z + D−1 b Z Σ−1 y
with respect to β and b (see also Henderson 1984, Searle, Casella and
McCulloch 1992, Section 7.6). Note that, especially with large data sets,
this may become computationally very expensive, such that, in practice, it
may be (much) more efficient to calculate the estimates directly from the
expressions (5.3) and (7.2).
where β(α) and bi (β, α) = bi (θ) are as defined in expressions (5.3) and
(7.2), respectively. It can now be shown that u (α) is a best linear unbiased
predictor of u, in the sense that it is unbiased for u and has minimum
variance among all unbiased estimators of the form c + i λi Yi (see, for
example, Harville 1976, McLean, Sanders and Stroup 1991, Searle, Casella
and McCulloch 1992, Chapter 7). Note that, in practice, u is estimated by
where α
(α),
u equals the maximum likelihood or restricted maximum like-
lihood estimator for the vector α of variance components (see Chapter 5).
7.5 Shrinkage
i
Y + Zi bi
≡ Xi β
+ Zi DZ V −1 (yi − Xi β)
= Xi β
i i
+ Zi DZ V −1 yi
= Ini − Zi DZi Vi−1 Xi β i i
+ In − Σi V −1 yi ,
= Σi V −1 Xi β (7.6)
i i i
7.6 Example: The Random-Intercepts Model 81
In the Bayesian literature, one usually refers to phenomena like those ex-
hibited in expression (7.6), as shrinkage (Carlin and Louis 1996, Strenio,
Weisberg, and Bryk 1983). The observed data are shrunken toward the
prior average profile, which is Xi β since the prior mean of the random ef-
fects was zero. This is also illustrated in (7.4), which implies that for any
linear combination λ of the random effects,
var(λ bi ) ≤ var(λ bi ). (7.7)
Denoting 1ni 1ni by Jni , it follows from (7.2) that the EB estimate for the
random intercept of subject i is given by
−1
bi = σb2 1ni σb2 Jni − σ 2 Ini (yi − Xi β)
σb2 σb2
= 1 n Ini − Jn (yi − Xi β)
σ2 i σ 2 + ni σb2 i
1
ni
ni σb2
= (yij − xij β), (7.8)
σ 2 + ni σb2 ni j=1
where the vectorxij consists of the jth row in the design matrix Xi .
(yij − xij β)/ni is equal to the average residual for
ni
Note that ri· = j=1
subject i.
82 7. Inference for the Random Effects
FIGURE 7.1. Prostate Data. Histograms (panels a, c, and e) and scatter plots
(panels b, d, and f ) of the empirical Bayes estimates for the random intercepts
and slopes in the final model (6.8).
7.8.1 Introduction
Note also that even when all EB estimates follow the same distribution,
it follows from the shrinkage effect (see Section 7.5) that the histogram
of the EB estimates shows less variability than actually present in the
population of random-effects bi . This suggests that such histograms do not
necessarily reflect the correct random-effects distribution, and hence also
that histograms of (standardized) EB estimates might not be suitable for
detecting deviations from the normality assumption. Louis (1984) therefore
proposes to minimize other well-chosen loss functions than the classical
squared-error loss function which corresponds to the posterior mean (7.2).
For example, if interest is in the random-effects distribution, one could
minimize a distance function between the empirical cumulative distribution
function of the estimates and the true parameters.
FIGURE 7.3. Histogram (range [−5, 5]) of 1000 random intercepts drawn from
the normal mixture 0.5N (−2, 1) + 0.5N (2, 1).
FIGURE 7.4. Histogram (range [−5, 5]) of the Empirical Bayes estimates of the
random intercepts shown in Figure 7.3, calculated under the assumption that the
random effects are normally distributed.
On the other hand, the covariates Zi are also important. For example,
suppose that Zi isani vector with elements ti1 , ti2 , . . . , tini . We then have
that Zi Zi equals j=1 t2ij . Hence, heterogeneity will be more likely to be
detected when the bi represent random slopes in a model for measurements
taken at large time points tij than when the bi are random intercepts (all
7.8 The Normality Assumption for Random Effects 87
tij equal to 1). If Zi contains both random intercepts and random slopes
for time points tij , we have that Zi Zi equals
⎛ ni ⎞
ni j=1 tij
Zi Zi = ⎝ ni 2
⎠,
ni
t
j=1 ij t
j=1 ij
In a designed experiment with Zi = Z for nall i, and where the time points tj
are centered, the eigenvalues are λ2 = j=1 t2j and λ1 = ni , or vice versa.
So, if we are interested in detecting subgroups in the random-effects popu-
lation, we should take as many measurements as possible, at the beginning
and at the end of the study (maximal spread of the time points).
for N → ∞.
It can easily be seen that the covariance matrix obtained from replacing α
in (6.3) by its MLE equals
N −1
=
var(β) Xi Vi−1 (α)X
i =A−1 /N, (7.11)
N ,11
i=1
For the random components on the other hand, and more specifically for
the elements in D, this is only true under the correct model (normal ran-
dom effects). When the random effects are not normally distributed, the
7.8 The Normality Assumption for Random Effects 89
Lange and Ryan (1989) have proposed to check the normality assumption
for the random effects based on weighted normal quantile plots of stan-
dardized linear combinations
c bi
vi = / ,
c var(bi )c
of the estimates bi . However, since the vi are functions of the random effects
bi as well as of the error terms εi , these normal quantile plots can only
indicate that the vi do not have the distribution one expects under the
assumed model, but the plots cannot differentiate a wrong distributional
assumption for the random effects or the error terms from a wrong choice
of covariates.
This suggests that non-normality of the random effects can only be de-
tected by comparing the results obtained under the normality assumption
with results obtained from fitting a linear mixed model with relaxed distri-
butional assumptions for the random effects. Verbeke (1995) and Verbeke
and Lesaffre (1996a, 1997b) therefore propose to extend the general linear
90 7. Inference for the Random Effects
FIGURE 7.5. Density functions of mixtures pN (µ1 , σb2 ) + (1 − p)N (µ2 , σb2 ) of two
normal distributions, for varying values for p and σb2 . The dashed lines represents
the densities of the normal components; the solid line represents the density of
the mixture.
g
bi ∼ pj N (µj , D), (7.12)
j=1
g g
with j=1 pj = 1, and such that the mean j=1 pj µj equals zero. As
discussed in Section 7.8.2, this extension naturally arises from assuming
that there is unobserved heterogeneity in the random-effects population.
Each component in the mixture represents a cluster containing a proportion
pj of the total population. The model is therefore called the heterogeneity
model and the linear mixed model discussed so far can then be called the
homogeneity model. Also, as shown in Figure 7.5, it extends the assumption
about the random-effects distribution to a very broad class of distributions:
unimodal as well as multimodal, symmetric as well as very skewed. Note
that heterogeneity in the random-effects population may occur very often in
practice. Whenever a categorical covariate has been omitted as a fixed effect
in a linear mixed-effects model, the random effects will follow a mixture of
g normal distributions, where g is the number of categories of the missing
covariate.
7.8 The Normality Assumption for Random Effects 91
Magder and Zeger (1996) also considered linear mixed models with mix-
tures of normal distributions as random-effects distribution, but they treat-
ed the number g of components as an unknown parameter, to be estimated
from the data. In order to avoid that nonsmooth mixture distributions, with
many components, would be obtained, they prespecify a lower boundary h
92 7. Inference for the Random Effects
FIGURE 7.6. Histogram (range [−5, 5]) of the Empirical Bayes estimates of the
random intercepts shown in Figure 7.3, calculated under the assumption that the
random effects are drawn from a two-component mixture of normal distributions.
8.1 Introduction
In this chapter, our original model (3.10) for the prostate data will be used
as a guiding example. In Section 8.2, the program for fitting the model will
be presented, together with some available options. It is by no means our
intention to give a full overview of all available statements and options.
Instead, we restrict to those statements and options that are, in our ex-
perience, most frequently used in the analysis of longitudinal data. When
fitting mixed models in other contexts, other statements or options may
94 8. Fitting Linear Mixed Models with SAS
be more appropriate. We refer to the SAS manuals (SAS 1992, 1996, 1997)
and to Littell et al . (1996) for more details on the procedure MIXED and
for a variety of examples in other contexts.
We now consider fitting the original linear mixed model (3.10) for the
prostate data. Let the variable group indicate whether a subject is a con-
trol (group = 1), a BPH case (group = 2), a local cancer case (group = 3)
or a metastatic cancer case (group = 4). As in Sections 5.5 and 5.6.1, we
express time in decades before diagnosis, rather than years before diagno-
sis. Further, we define the variable timeclss to be equal to time. This will
enable us to consider time as a continuous covariate and as a classification
variable (a factor in the ANOVA terminology) simultaneously. As before,
the variable age measures the age of the subject at the time of diagnosis.
Finally, id is a variable containing the subject’s identification label, and
lnpsa is the logarithmic transformation ln(1 + x) of the original PSA mea-
surements. We can then use the following program to fit model (3.10) and
to obtain the inferences described in Chapters 6 and 7:
Before presenting the results of this analysis, we briefly discuss the state-
ments and options used in the above program.
This statement calls the procedure MIXED and specifies that the data
be stored in the SAS data set ‘prostate’. If no data set is specified, then
the most recently created data set is used. In general, there are two ways
of setting up data sets containing repeated measurements. One way is to
define a variable for each variable measured and for each time point in
the data set at which at least one subject was measured. Each subject
then corresponds to exactly one record (one line) in the data set. This
setup is convenient when the data are highly balanced, that is, when all
measurements are taken at only a few number of time points. However, this
approach leads to huge data matrices with many missing values in cases of
highly unbalanced data such as the prostate data. The same problem occurs
in the presence of missing data or dropout (see Section 17.3). Therefore, the
MIXED procedure requires that the data set be structured such that each
record corresponds to the measurements available for a subject at only one
moment in time. For example, five repeated measurements for individual
i are put into five different records. This has the additional advantage
that time-varying covariates (such as time) can be easily incorporated into
the model. An identification variable id is needed to link measurements to
subjects, and a time variable is used to order the repeated measurements
within each individual. For example, our prostate cancer data set is set up
in the following way:
The options ‘asycov’ and ‘asycorr’ can be used for printing the asymptotic
covariance matrix as well as the associated correlation matrix of the es-
timators for the variance components in the marginal model. The option
‘covtest’ requires the printing of the resulting asymptotic standard errors
and associated Wald tests for those variance components. Note that these
Wald tests are not applicable to all variance components in the model (see
discussion in Section 6.3.1). The calculation of the information criteria dis-
cussed in Section 6.4 can be requested by adding the option ’ic’.
The MODEL statement names the response variable (one and only one)
and all fixed effects, which determine the Xi matrices. Note that in order to
have the same parameterization for the mean structure as model (3.10), no
overall intercept (using the ‘noint’ option) nor overall linear or quadratic
time effects should be included into the model, since, otherwise, the mean
structure is parameterized using contrasts between the intercepts and slopes
of the first three diagnostic groups and those for the last group. Although
this would facilitate the testing of group differences (see also Section 8.4),
it complicates the interpretation of the parameter estimates.
The ‘solution’ option is used to request the printing of the estimates for all
the fixed effects in the model, together with standard errors, t-statistics, and
corresponding p-values for testing their significance (see Section 6.2). When
the whole model-based estimated covariance matrix (6.3) for the fixed ef-
fects is of interest, it can be obtained by specifying the option ‘covb’. The es-
8.2 The SAS Program 97
When predicted values are requested with the option ’predmeans’ or ’pre-
dicted’ in the MODEL statement, SAS prints a table with the requested
predicted value, the corresponding observed value, and the resulting resid-
ual, for each record in the original data set. Although the records are then
printed in the same order as they appear in the original data set, it may
still be helpful to add columns which help identify the records. This is done
via the ID statement. The values of the variables in the ID statement are
then printed beside each observed, predicted, and residual value. In our ex-
ample, we used the identification number of the patient, together with the
time point at which a measurement was taken to identify the predictions
and residuals which we requested in the MODEL statement.
This statement is used to define the random effects in the model, that is,
the matrices Zi containing the covariates with subject-specific regression
coefficients. Note that when random intercepts are required, this should be
specified explicitly, which is in contrast to the MODEL statement where
an intercept is included by default.
The ‘subject=’ option is used to identify the subjects in our data set. Here,
‘subject=id’ means that all records with the same value for id are assumed
to be from the same subject, whereas records with different values for id are
assumed to contain independent data. This option also defines the block-
diagonality of the matrix Z and of the covariance matrix of b in (5.5). The
variable id is permitted to be continuous as well as categorical (specified
in the CLASS statement). However, when id is continuous, PROC MIXED
considers a record to be from a new subject whenever the value of id is
different from the previous record. Hence, one then should first sort the
data by the values of id. On the other hand, using a continuous id variable
reduces execution times for models with a large number of subjects (manual
PROC MIXED).
98 8. Fitting Linear Mixed Models with SAS
The ‘type=’ option specifies the covariance structure D for the random
effects bi . In our example, we specified ‘type=un’ which corresponds to a
general unstructured covariance matrix, i.e., a symmetric positive (semi-)
definite (q×q) matrix D. Many other covariance structures can be specified,
some of which are shown in Table 8.1 and Table 8.2. We further refer to
the SAS manual (1996) for a complete list of possible structures. Although
many structures are available, in longitudinal data analysis, one usually
specifies ‘type=UN’ which does not assume the random-effects covariance
matrix to be of any specific form.
Specifying the options ‘g’ and ‘gcorr’ requests that the random-effects co-
variance matrix D as well as its associated correlation matrix are printed,
printing blanks for all values that are zero. The options ‘v’ and ‘vcorr’ can
be used if a printout of the marginal covariance matrix Vi = Zi DZi + Σi
and the corresponding correlation matrix, respectively, are needed. By de-
fault, SAS only prints the covariance and correlation matrices for the first
subject in the data set. However, ‘v=’ and ‘vcorr=’ can be used to specify
the identification numbers of the patients for which the matrix Vi and the
associated correlation matrix are needed.
The option ‘solution’ is needed for the calculation of the empirical Bayes
(EB) estimates for the random effects bi , previously derived and discussed
in Chapter 7. The result will be a table containing the EB estimates for
the random effects of all subjects included in the analysis. If, for example,
scatter plots or histograms of components of these estimates bi are to be
made, then the estimates should be converted to a SAS output data set.
This will be discussed in Section 8.2.9.
TABLE 8.1. Overview of frequently used covariance structures which can be speci-
fied in the RANDOM and REPEATED statements of the SAS procedure MIXED.
The σ parameters are used to denote variances and covariances, whereas the ρ
parameters are used for correlations.
Structure Example
σ12 σ12 σ13
Unstructured
σ12 σ22 σ23
type=UN
σ13 σ23 σ32
Structure Example
⎛ ⎞
1 ρd12 ρd13
σ 2 ⎝ ρd12 ⎠
Power
1 ρd23
type=SP(POW)(list)
ρd13 ρd23 1
⎛ ⎞
1 exp(−d12 /ρ) exp(−d13 /ρ)
σ2 ⎝ ⎠
Exponential
exp(−d12 /ρ) 1 exp(−d23 /ρ)
type=SP(EXP)(list)
exp(−d13 /ρ) exp(−d23 /ρ) 1
⎛ ⎞
1 exp(−d212 /ρ2 ) exp(−d213 /ρ2 )
σ2 ⎝ ⎠
Gaussian
exp(−d212 /ρ2 ) 1 exp(−d223 /ρ2 )
type=SP(GAU)(list)
exp(−d213 /ρ2 ) exp(−d223 /ρ2 ) 1
For example, the option ‘subject=’ identifies the subjects in the data set,
and complete independence is assumed across subjects. It therefore defines
the block-diagonality of the covariance matrix of ε in (5.5). With respect to
the variable id, the same remarks hold as the ones stated in our description
of the RANDOM statement. Although this is strictly speaking not required,
the RANDOM and REPEATED statement often have the same option
‘subject=id’, as was the case in our example.
Further, the ‘type=’ option specifies the covariance structure Σi for the
error components εi . All covariance structures previously described for the
RANDOM statement can also be specified here. Very often, one selects
‘type=simple’ which corresponds to the most simple covariance structure
Σi = σ 2 Ini . Finally, if no REPEATED statement is used, PROC MIXED
automatically fits such a ‘simple’ covariance structure for the residual com-
ponents.
8.2 The SAS Program 101
Finally, specifying the options ‘r’ and ‘rcorr’ requests that the residual co-
variance matrix Σi as well as its associated correlation matrix be printed.
As for the ‘v’ and ‘vcorr’ options in the RANDOM statement, SAS only
prints by default the covariance and correlation matrices for the first sub-
ject in the data set. However ‘r=’ and ‘rcorr=’ can be used to specify the
identification numbers of the subjects for which the matrix Σi and the
associated correlation matrix are needed.
The contrast statement allows testing general linear hypotheses of the form
(6.4). In our program, we have shown how to test whether the original
model (3.10) for the prostate data can be reduced to the final model (6.8)
[see hypothesis (6.7) obtained in Section 6.2.3].
nosis, between the local cancer cases and the BPH cases, that is, we will
estimate the linear combination (6.9), specified in Section 6.2.3.
It should be emphasized that, from Version 7.0 on, the MAKE statement
is replaced by the integrated ODS (output delivery system). In Version 7.0,
the MAKE statement is still supported, but it is envisaged that this will
no longer be the case in later versions. We refer to Appendix A for more
details and for an example.
inference for all fixed effects in the model (see Section 6.2.4). This can be
requested by adding the option ‘empirical’ to the PROC MIXED statement.
All reported standard errors and inferences for the fixed effects will then
be based on the robust estimate for the covariance matrix rather than the
naive one.
Similar to the ESTIMATE statement, one can add the options ‘cl’ and
‘alpha=’ to the MODEL statement to request for the construction of t-
type confidence limits for each of the fixed-effect parameters.
of the specified effect will produce a new set of covariance parameters with
the same structure as specified in the ‘type=’ option.
In Section 3.3.4, a general but flexible model was presented for the resid-
ual covariance Σi , which assumed that the residual component εi can be
decomposed as εi = ε(1)i +ε(2)i , in which ε(2)i is a component of serial corre-
lation and where ε(1)i represents pure measurement error. Such models can
also easily be fitted in PROC MIXED. One then has to specify the required
covariance structure of ε(2)i in the ‘type=’ option of the REPEATED state-
ment, whereas the measurement error component is obtained by adding the
option ‘local’ to the same REPEATED statement. This will be illustrated
in Section 9.4.
In the following sections, we will discuss the different parts of the SAS
output obtained from fitting the model on p. 94 to the prostate data.
0 1 -259.0577593
1 2 -753.2423823 0.00962100
2 1 -757.9085275 0.00444385
. . ............ ..........
6 1 -760.8988784 0.00000003
7 1 -760.8988902 0.00000000
The objective function is, apart from a constant which does not depend
on the parameters, minus twice the log-likelihood function. In the case
of REML estimation, the exact relation between LREML and the objective
function OFREML is given by
Hence, our final parameter estimates are those which minimize this objec-
tive function. The reported number of evaluations is the number of times
the objective function has been evaluated during each iteration. In the
‘Criterion’ column, a measure of convergence is given, where a value equal
to zero indicates that the iterative estimation procedure has converged.
In practice, the procedure is considered to have converged whenever the
convergence criterion is smaller than a so-called tolerance number which
is set equal to 10−8 by default. Unless specified otherwise, SAS uses the
relative Hessian convergence criterion defined as |gk Hk−1 gk |/|fk |, where
fk is the value of the objective function at iteration k, gk is the gradient
(vector of first-order derivatives) of fk , and Hk is the Hessian (matrix of
second-order derivatives) of fk . Other possible choices are the relative func-
tion convergence criterion and the relative gradient convergence criterion
defined as |fk − fk−1 |/|fk | and maxj |gkj |/|fk |, respectively, where gkj is
the jth element in gk .
The ‘Model Fitting Information’ table shows the following additional in-
formation:
106 8. Fitting Linear Mixed Models with SAS
Description Value
Observations 463.0000
Res Log Likelihood -31.2350
Akaike’s Information Criterion -38.2350
Schwarz’s Bayesian Criterion -52.6018
-2 Res Log Likelihood 62.4700
Null Model LRT Chi-Square 501.8411
Null Model LRT DF 6.0000
Null Model LRT P-Value 0.0000
N
In our example, n = i=1 ni = 463 observations were used to calculate
the parameter estimates. The value of the REML log-likelihood function
evaluated at the REML estimates is reported as ‘Res Log Likelihood’, and
similarly for minus twice the maximized log-likelihood function. Note that
the log-likelihood can also easily be calculated from expression (8.1). In our
example, this becomes −[(463 − 15) ln(2π) − 760.8989]/2 = −31.2350. Note
that this value was already reported in Table 5.1. It is also used for the
calculation of the reported Akaike and Schwarz information criteria which
we defined in Section 6.4. The number of parameters used in the calculation
equals the number of variance components in the model. Hence, the criteria
reported here should only be used to compare models with the same mean
structure, but with different covariance structures, even when maximum
likelihood would be used for model fitting. We will come back to this in
Section 8.3.3. In our example, we have AIC = −31.2350 − 7 = −38.2350
and SBC = −31.2350 − 3.5 ln(463 − 15) = −52.6018, respectively.
The ‘Null Model LRT Chi-Square’ value is −2 times the log-likelihood from
the null model minus −2 times the log-likelihood from the fitted model,
where the null model is the one with the same fixed effects as the actual
model, but without any random effects, and with Σi = σ 2 Ini . This statistic
is then compared to a χ2 -distribution with degrees of freedom equal to the
number of variance components minus 1, and the reported p-value is the
upper tail area from this distribution. It is suggested that this p-value can
be used to test whether or not there is any need at all for modeling the
covariance structure of the data. However, as discussed in Section 6.3.4,
the obtained p-value is not valid since the classical maximum likelihood
theory from which it results does not hold due to boundary problems in
the parameter space. Hence, it is, in general, not recommended to interpret
the reported p-value in any such way.
8.3 The SAS Output 107
Since the option ‘ic’ was specified in the PROC MIXED statement, all four
information criteria previously defined in Section 6.4 were calculated. The
results are summarized in the following table:
Information Criteria
The first two rows are the information criteria as we defined them in Sec-
tion 6.4. When comparing different models, the model with the largest
value of AIC, SBC, HQIC, or CAIC is deemed best. The last two rows
are the versions defined based on minus twice the maximized log-likelihood
value rather than on the log-likelihood itself. They equal minus twice the
corresponding value in the first or second row. For these versions, small
values of AIC, SBC, HQIC, or CAIC are considered as good. The reported
parameters p and q in the above output table correspond to the number
of fixed effects and the number of variance components that is taken into
account in the computation of the information criteria. Note that, since the
information criteria reported in the first and third rows do not take into
account the number of parameters in the mean structure (p set equal to
zero), these criteria should not be used to compare models with different
mean structures. Moreover, as explained in Section 6.2.5 and Section 6.4,
they are only fully interpretable when models are fitted with the maxi-
mum likelihood procedure, rather than the restricted maximum likelihood
procedure.
Since we specified the options ‘asycov’ and ‘asycorr’ in the PROC MIXED
statement, we also get a printout of the estimated asymptotic covariance
matrix and associated correlation matrix for the above parameter esti-
mates. These are given by
Cov Parm Row COVP1 COVP2 COVP3 COVP4 COVP5 COVP6 COVP7
and
8.3 The SAS Output 109
Cov Parm Row COVP1 COVP2 COVP3 COVP4 COVP5 COVP6 COVP7
respectively.
G Matrix
and
G Correlation Matrix
respectively, and were obtained by specifying the options ‘g’ and ‘gcorr’ in
the RANDOM statement. The estimate for the residual covariance matrix
Σ1 for the first subject in the data set, as well as the associated correlation
matrix are reported as
110 8. Fitting Linear Mixed Models with SAS
R Matrix for ID 1
1 0.0282
2 0.0282
3 0.0282
4 0.0282
5 0.0282
6 0.0282
7 0.0282
8 0.0282
and
1 1.0000
2 1.0000
3 1.0000
4 1.0000
5 1.0000
6 1.0000
7 1.0000
8 1.0000
respectively, and were obtained by specifying the options ‘r’ and ‘rcorr’ in
the REPEATED statement. Note that zero entries are left blank. Finally,
combining the estimate for D with the estimate for Σ1 , an estimate is
obtained for the marginal covariance matrix V1 = Z1 DZ1 + Σ1 , as well as
for the associated correlation matrix, of the first subject in the data set.
These are reported as
V Matrix for ID 1
and
respectively, and were obtained by specifying the options ‘v’ and ‘vcorr’ in
the RANDOM statement.
TABLE 8.3. Prostate Data. Overview of the hypotheses corresponding to the tests
specified in the table labeled ‘Tests of Fixed Effects.’
Note that the estimates and standard errors were already reported in Ta-
ble 5.1. A printout of the complete estimated covariance matrix for the fixed
effects is also obtained (due to the option ‘covb’ in the MODEL statement),
but it is not printed here because of its high dimension (15 × 15).
By default, SAS provides approximate F -tests (see Section 6.2.2) for all ef-
fects specified in the MODEL statement. For continuous covariates, which
do not interact with any factors (i.e., with no interaction term included in
the MODEL statement), this is equivalent to the t-test reported in the table
‘Solution for Fixed Effects.’ For each factor specified in the CLASS state-
ment, it is tested whether any of the parameters assigned to this factor is
significantly different from zero. The same is true for interactions of factors
with other effects. The hypotheses tested in our example are summarized
in Table 8.3. Finally, since the option ‘chisq’ was added to the MODEL
statement, approximate Wald tests are also performed (see Section 6.2.1).
The output table is given by
Because the option ‘ddfm=satterth’ has been added to the MODEL state-
ment, a Satterthwaite approximation is used for the calculation of the de-
nominator degrees of freedom needed for the approximate F -tests. As is
8.3 The SAS Output 113
Two additional tables are also given, labeled ‘CONTRAST Statement Re-
sults’ and ‘ESTIMATE Statement Results,’ which contain the results from
the specified CONTRAST and ESTIMATE statements, respectively. The
first one is given by
and provides approximate F - and Wald tests for testing whether the orig-
inal model (3.10) can be reduced to model (6.8). The results were already
reported in Section 6.2.3. The table with the results from the ESTIMATE
statement equals
Note that the results are slightly different from our results given in Ta-
ble 6.2, which is due to the fact that inferences for the average difference
between BPH patients and local cancer cases is now obtained under model
(3.10) rather than under model (6.8), as was the case in Table 6.2.
For each subject, the empirical Bayes estimate bi for its vector bi of random
effects is printed, together with approximate standard errors and t-tests.
The standard errors reported here are not based on the covariance matrix
(7.3) of bi but on the covariance matrix (7.4) of bi − bi .
Note that it follows from the way we parameterized the mean structure of
our model that the F -tests discussed in Section 8.3.5 cannot be used to test
whether the different diagnostic groups have different intercepts or slopes.
For example, four parameters are assigned to the effect ‘time2 ∗ group’,
being the slopes for the quadratic time effect for each group separately.
The F -test reported for the effect time2 ∗ group was therefore
H5 : β12 = β13 = β14 = β15 = 0
rather than
H0 : β12 = β13 = β14 = β15 . (8.2)
Note that hypothesis (8.2) is also of the form H0 : Lβ = 0 and can thus
also be tested using a CONTRAST statement in PROC MIXED (see Sec-
tion 8.2.7).
We then get the following output table ‘Solution for Fixed Effects’ with
REML estimates and t-tests for all parameters in the reparameterized mean
structure:
The slope β15 for time2 in the last group is now the parameter assigned to
the overall time2 effect, and the three parameters assigned to the interaction
of group with time2 are the contrasts β12 − β15 , β13 − β15 , and β14 − β15 ,
respectively (see also the original estimates in Table 5.1). Note that this also
implies that the CONTRAST statement and the ESTIMATE statement
previously used in the program on p. 94 are no longer valid under the new
parameterization.
For the reparameterized model, we get the following F -tests for the effects
specified in the MODEL statement:
116 8. Fitting Linear Mixed Models with SAS
TABLE 8.4. Prostate Data. Overview of the hypotheses corresponding to the tests
specified in the table labeled ‘Tests of Fixed Effects’ for the model with reparame-
terized mean structure.
The hypotheses tested in the above output are shown in Table 8.4. Hence,
the test for hypothesis (8.2) is now reported as the F -test corresponding to
the effect of time2 ∗ group. Note also the change in its numerator degrees of
freedom due to the fact that we now test for equality of the quadratic time
effect in the four diagnostic groups, rather than testing whether there is any
quadratic time effect in any of the four diagnostic groups at all. Also, under
this parameterization for the mean structure, the F -test reported for time2
tests whether there is a quadratic time effect in the overall population and
is therefore not equivalent to the t-test reported for time2 in the table
labeled ‘Solution for Fixed Effects,’ which was testing for a quadratic time
effect for the metastatic cancer cases only. The same remark is true for the
F -test reported for time.
for some non-negative value σc2 . Although such an assumption is often not
realistic in a longitudinal-data setting (constant variance and all correla-
tions equal), it is frequently used in practice since it immediately follows
from random-factor ANOVA models (see, for example, Neter, Wasserman,
and Kutner 1990, Section 17.6, Searle 1987, Chapter 13).
118 8. Fitting Linear Mixed Models with SAS
This also implies that one should not conclude that the REPEATED state-
ment is used whenever the data are of the repeated-measures type. Some
repeated-measures models are best expressed using the RANDOM state-
ment (see, e.g., the prostate cancer data), whereas there are also random-
effects models, which do not fall into the repeated-measures class but where
the REPEATED statement is the simplest tool for expressing them in
PROC MIXED syntax. For example, suppose 100 exams were randomly
assigned for correction to 10 randomly selected teachers. If Yij then de-
notes the grade assigned to the jth exam by the ith observer, the following
random-factor ANOVA model can be used to analyze the data:
The parameter µ represents the overall mean, the parameters αi are the
random observer effects, and the εij are components of measurement error.
It is thereby assumed that all αi and εij are independent of each other, and
that they are normally distributed with mean zero and constant variances
σc2 and σ 2 , respectively. Model (8.4) can then be fitted using the following
program:
8.6 PROC MIXED versus PROC GLM 119
proc mixed;
class observer;
model Y = / solution;
repeated / type = cs subject = observer;
run;
For balanced longitudinal data (i.e., longitudinal data where all subjects
have the same number of repeated measures, taken at time points which
are also the same for all subjects), one often analyzes the data using the
SAS procedure PROC GLM, fitting general multivariate regression models
(Seber 1984, Chapters 8 and 9) to the data. Such models can also be fitted
with PROC MIXED by omitting the RANDOM statement and including
a REPEATED statement with option ‘type=UN’. One then fits a linear
model with a general unstructured covariance matrix Σ = Σi . However,
the two procedures do not necessarily yield the same results: PROC GLM
only takes into account the data of the completers, that is, only the data
of the subjects with all measurements available are used in the calcula-
tions. PROC MIXED, on the other hand, uses all available data. Hence,
patients for whom not all measurements were recorded will still be taken
into account in the analysis. We refer to Section 17.3 for an illustration.
where time = 2;
120 8. Fitting Linear Mixed Models with SAS
Note that this again may yield different results than PROC GLM due to
the fact that now all second measurements are analyzed rather than only
the measurements from the patients with measurements taken at all time
points. Finally, the “split unit” type of analysis provided by the GLM proce-
dure can be obtained using PROC MIXED from fitting a compound sym-
metry model (see Section 8.5). However, Greenhouse-Geiser and Huynh-
Feldt corrections to the F -tests are not available in the MIXED procedure,
but they are not really required, as it is very simple to fit and test models
with more complex covariance structures.
The main strength of the procedure MIXED is that it does not assume that
an equal number of repeated observations is taken from each individual or
that all individuals should be measured on the same time points. Hence,
the measurements can be viewed as being taken at a continuous rather
than discrete time scale. Also, the use of random effects allows us to model
covariances as continuous functions of time. Another main advantage in
using the MIXED procedure is the fact that all available data (not only
the “complete cases”) are used in the analysis. Finally, PROC MIXED also
allows us to include time-varying covariates in the mean structure, which
is not possible in PROC GLM.
9.1 Introduction
-
' $
& %
? ?
Estimation of θ
Covariance matrix for θ
t-tests and F-tests
Confidence intervals
Efficiency
Prediction
FIGURE 9.1. Graphical representation of how the mean structure and the covari-
ance structure of a linear mixed model influence one another and how they affect
the inference results.
in which Xi∗ and Zi∗ are the fixed-effects and random-effects covariates,
respectively, and ε∗(2)i is the serial error, at time t∗i . The random effect bi
is estimated as in Section 7.2. Chi and Reinsel (1989) have shown that if
the components of ε(2)i follow an AR(1) process,
$ %
− Zi bi
∗
E(ε∗(2)i | yi ) = φ(ti −ti,ni ) yi − Xi α ,
ni
for φ equal to the constant which determines the AR(1) process, |φ| < 1.
This means that the inclusion of serial correlation may improve the predic-
9.2 Selection of a Preliminary Mean Structure 123
For data sets where most variability in the measurements is due to between-
subject variability, one can very often use the two-stage approach to con-
struct an appropriate linear mixed model. This was illustrated for the
prostate cancer data in Sections 3.2.4 and 3.3.3. On the other hand, as
shown in Section 5.6.2, a two-stage approach does not always automati-
cally yield a valid marginal model for the data. Also, if the intersubject
variability is small in comparison to the intrasubject variability, this sug-
gests that the covariance structure cannot be modeled using random effects,
but that an appropriate covariance matrix Σi for εi should be found.
In this chapter, some simple guidelines will be discussed which can help the
data analyst to select an appropriate linear mixed model for some specific
data set at hand. All steps in this model building process will be illustrated
with the prostate cancer data set. It should be emphasized that following
the proposed guidelines does not necessarily yield the most appropriate
model, nor does it yield a linear mixed model where all distributional as-
sumptions are automatically satisfied. In general, more complex model di-
agnostics are required to assess the goodness-of-fit of a model. We therefore
refer to the exploratory techniques of Chapter 3 and to the more advanced
techniques later described in Chapters 10, 11, and 12.
Since the covariance structure models all variability in the data which can-
not be explained by the fixed effects, we start by first removing all sys-
tematic trends. As proposed by Diggle (1988) and Diggle, Liang, and Zeger
(1994) (Sections 4.4 and 5.3), we use an overelaborated model for the mean
response profile. When the data are from a designed experiment in which
the only relevant explanatory variables are the treatment labels, it is a
sensible strategy to use a “saturated model” for the mean structure. This
incorporates a separate parameter for the mean response at each time point
within each treatment group. For example, when two treatment groups had
measurements at four fixed time points, we would use p = 4 × 2 = 8 para-
meters to model E(Yi ). The Xi matrices would then equal
⎛ ⎞ ⎛ ⎞
1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
⎜0 1 0 0 0 0 0 0⎟ ⎜0 0 0 0 0 1 0 0⎟
Xi = ⎜ ⎟ ⎜
⎝ 0 0 1 0 0 0 0 0 ⎠ or Xi = ⎝ 0 0 0 0 0 0 1 0 ⎠,
⎟
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1
124 9. General Guidelines for Model Building
FIGURE 9.2. Prostate Data. Smoothed average trend of ln(PSA + 1) for each
diagnostic group separately.
For data in which the times of measurement are not common to all individ-
uals or when there are continuous covariates which are believed to affect
the mean response, the concept of a saturated model breaks down and the
choice of our most elaborate model becomes less obvious. In such cases, a
plot of smoothed average trends or individual profiles often helps to select
a candidate mean structure. For the prostate cancer data, Figure 9.2 shows
the smoothed average trend in each diagnostic group separately. Note that
these trends are not corrected for the age differences between the study
participants. Further, the individual profiles (Figure 2.3) and our results
from Section 4.3.4 suggest modeling ln(1 + PSA) as a quadratic function
over time. This results in an average intercept and an average linear as
well as quadratic time effect within each diagnostic group. Finally, it has
been anticipated that age is also an important prognostic covariate. We
therefore also include age at time of diagnosis along with its interactions
with time and time2 . Our preliminary mean structure therefore contains
4 × 3 + 3 = 15 fixed effects, represented by the vector β. Note that this is
the mean structure which was used in our initial model (3.10), and that at
this stage, we deliberately favor overparameterized models for E(Yi ) in or-
der to get consistent estimators of the covariance structure in the following
steps.
9.3 Selection of a Preliminary Random-Effects Structure 125
FIGURE 9.3. Prostate Data. Ordinary least squares (OLS) residual profiles.
FIGURE 9.4. Prostate Data. Smoothed average trend of squared OLS residuals.
Squared residuals larger than 0.4 are not shown.
prostate data, this was done in Figure 9.3. A smoothed average trend of
the squared OLS residuals is shown in Figure 9.4 and is used to explore
the variance function over time. If it is believed that different variance
functions should be used for different groups, a plot as in Figure 9.4 could
be constructed for each group separately. When this plot shows constant
variability over time, we assume stationarity and we do not include random
effects other than intercepts. In cases where the variability varies over time
and where there is still some remaining systematic structure in the residual
profiles (i.e., where the between-subject variability is large in comparison
to the overall variation), the following guidelines can be used to select one
or more random effects additional to the random intercepts.
• Try to find a regression model for each residual profile in the above
plot. Such models contain subject-specific parameters and are there-
fore perfect candidates as random effects in our general linear mixed
model. For example, if the residual profiles can be approximated by
straight lines, then only random intercepts and random slopes for
time would be included.
• Since our model always assumes the random effects bi to have mean
zero, we only consider covariates Zi which have already been included
as covariates in the fixed part (i.e., in Xi ) or which are linear com-
binations of columns of Xi . Note that this condition was satisfied in
model (3.10), which we used to analyze the prostate cancer data. For
example, the second column of Zi , which represents the linear ran-
dom effect for time, equals the sum of columns 7 to 10 in Xi , which
are the columns containing the linear time effects for the controls,
9.3 Selection of a Preliminary Random-Effects Structure 127
the benign prostatic hyperplasia patients, the local cancer cases, and
the metastatic cancer cases, respectively.
• Morrell, Pearson, and Brant (1997) have shown that Zi should not
include a polynomial effect if not all hierarchically inferior terms are
also included, and similarly for interaction terms. This generalizes
the well-known results from linear regression (see, e.g., Peixoto 1987,
1990) to random-effects models. It ensures that the model is invariant
to coding transformations and avoids unanticipated covariance struc-
tures. This means that if, for example, we want to include quadratic
random time effects, then also linear random time effects and random
intercepts should be included.
where D and σ2 are the REML estimates reported in Table 5.1. The result is
presented in Figure 9.5. Both variance functions show similar trends, except
at the beginning and at the end. The deviation for small time points can be
explained by noticing that some subjects have extremely large PSA values
close to their time of diagnosis (see individual profiles in Figure 2.3). These
correspond to extremely large squared residuals (not shown in Figure 9.4),
which may have inflated the fitted variance function much more than the
variance function obtained by smoothing the squared OLS residuals. The
deviation for large time points can be ascribed to the small amount of data
available. Only 24 out of the 463 PSA measurements have been taken more
than 20 years prior to the diagnosis.
the information criteria discussed in Section 6.4, there are no general sim-
ple techniques available to compare all these models. For highly unbalanced
data with many repeated measurements per subject, one usually assumes
that random effects can account for most of the variation in the data and
that the remaining error components εi have a very simple covariance struc-
ture, leading to parsimonious models for Vi .
One such model is the model (3.11), introduced in Section 3.3.4. It assumes
that εi has constant variance and can be decomposed as εi = ε(1)i + ε(2)i ,
in which ε(2)i is a component of serial correlation and where ε(1)i is a com-
ponent of measurement error. The model is then completed by specifying a
serial correlation function g(·). The most frequently used functions are the
exponential and the Gaussian serial correlation functions already shown in
Figure 3.2, but other functions can also be specified in the SAS procedure
MIXED. We propose to fit a selection of linear mixed models with the
same mean and random-effects structure, but with different serial correla-
tion structures, and to use likelihood-based criteria to compare the different
models. In some cases, likelihood ratio tests can be applied. In other cases,
one might want to use the information criteria introduced in Section 6.4.
Alternatively, one can use one of the advanced methods for the exploration
of the residual serial correlation, which will be discussed in Chapter 10.
However, unless one is especially interested in the serial correlation func-
tion, it is usually sufficient to fit and compare a series of serial correlation
models.
TABLE 9.1. Prostate Data. REML estimates and estimated standard errors for all
variance components in a linear mixed model with the preliminary mean structure
defined in Section 9.2, with random intercepts and random slopes for the linear as
well as quadratic time effects, with measurement error, and with Gaussian serial
correlation.
Comparing minus twice the REML log-likelihood of the above model with
the value obtained without the serial correlation component (see Table 5.1)
yields a difference of 12.896, indicating that adding the serial correlation
component really improved the covariance structure of our model. Further,
note that the residual variability has now been split up into two compo-
nents which are about equally important (similar variance). Based on this
extended covariance matrix, we repeated our informal check previously pre-
sented in Figure 9.5, comparing the smoothed average trend in the squared
OLS residuals with the new fitted variance function. The fitted variances
9.4 Selection of a Residual Covariance Structure 131
TABLE 9.2. Prostate Data. REML estimates and estimated standard errors for all
variance components in a linear mixed model with the preliminary mean structure
defined in Section 9.2, with random intercepts and random slopes for the linear as
well as quadratic time effects, with measurement error, and with Gaussian serial
correlation. Each time, another parameterization of the model is used, based on
how the intercept has been defined.
the final steps in the iterative procedure are based on the default Newton-
Raphson method since, otherwise, all reported standard errors are based
on the expected rather than observed Hessian matrix, the consequences of
which will be discussed in Chapter 21. Note that the variance components
in the covariance structure of εi as well as the variance d33 of the random
slopes for the quadratic time effect remain unchanged when the model is
reparameterized. This is not the case for the other elements in the random-
effects covariance matrix D. As was expected, the random-intercepts vari-
ance d11 decreases as the intercept moves further away from the time of
diagnosis, and the same holds for the overall variance d11 + σ 2 + τ 2 at the
time of the intercept. We therefore recommend defining random intercepts
as the response value at the time where the random variation in the data
is maximal. This facilitates the discrimination among the three sources of
stochastic variability.
terms are also included. Taking into account this hierarchy, one should test
the significance of the highest-order random effects first. We reemphasize
that the need for random effects cannot be tested using classical likelihood
ratio tests, due to the fact that the null hypotheses of interest are on the
boundary of the parameter space, which implies that the likelihood ratio
statistic does not have the classical asymptotic chi-squared distribution (see
our discussion in Section 6.3.4).
Further, now that the final covariance structure for the model has been
selected, the tests discussed in Section 6.2 become available for the fixed
effects in the preliminary mean structure.
For the prostate cancer data, we then end up with the same mean structure
as well as random-effects structure as model (6.8) in Section 6.2.3. This
model can now be fitted in SAS with the following program:
The SAS variables cancer, bph, loccanc, and metcanc are dummy variables
defined to be equal to 1 if the patient has prostate cancer, benign prostatic
hyperplasia, local prostate cancer, or metastatic prostate cancer, respec-
tively, and zero otherwise. The variables id, group, time, and timeclss are
as defined in Section 8.2. The parameter estimates and estimated standard
errors are shown in Table 9.3, and they can be compared to the estimates
shown in Table 6.1, which were obtained without assuming the presence of
residual serial correlation. Note that adding the serial correlation compo-
nent to the model leads to smaller standard errors for almost all parameters
in the marginal model. This illustrates that an adequate covariance model
implies efficient model-based inferences for the fixed effects.
TABLE 9.3. Prostate Data. Results from fitting the final model (6.8) to the
prostate cancer data, using restricted maximum likelihood estimation. The co-
variance structure contains three random effects, Gaussian serial correlation, and
measurement error.
10.1 Introduction
where the (j, k) element of Hi equals g(|tij −tik |) for some (usually) decreas-
ing function g(·) with g(0) = 1. Exploring the residual covariance structure
then reduces to studying the serial correlation function g(·).
We will also assume that all systematic trends have been removed from the
data. As explained in Section 9.2, this can be done by calculating ordinary
136 10. Exploring Serial Correlation
In Section 10.2, we will discuss an informal check for the need of a serial
correlation component τ 2 Hi in (10.1). Afterward, in Section 10.3, it will
be shown how the residual serial correlation function can be studied by
fitting linear mixed models with general flexible parametric functions for
g(·). Finally, in Section 10.4, we discuss the use of the so-called variogram
to study the serial correlation function nonparametrically, that is, without
assuming any parametric form for g(·).
and 10.4 to explore whether serial correlation should be added to the model,
and if so, what function g(·) would be appropriate.
10.3.1 Introduction
where the degree m is a positive integer, where p1 > . . . > pm are real-
valued prespecified powers, and where φ0 and φ1 , . . . , φm are real-valued
unknown regression coefficients. Finally, x(pj ) is defined as
p
x j if pj = 0
x(pj ) = (10.2)
ln(x) if pj = 0.
In the context of linear and logistic regression analyses, Royston and Alt-
man (1994) have shown that the family of fractional polynomials is very
flexible and that models with degree m larger than 2 are rarely required.
In practice, several values for the powers p1 , . . . , pm can be tried, and the
model with the best fit is then selected.
138 10. Exploring Serial Correlation
TABLE 10.1. Prostate Data. ML estimates and estimated standard errors for the
variance of the measurement error components as well as the parameters φj in
the serial correlation function (10.4).
Parameter estimates and associated asymptotic standard errors for the pa-
rameters φj as well as for the variance σ 2 of the measurement error compo-
nents are given in Table 10.1. The fitted function (10.4) is shown in panel
(a) of Figure 10.1 (solid line). Note that we report here maximum likeli-
hood (ML) estimates instead of restricted maximum likelihood estimates
(REML). Indeed, all parameters reported in the table are based on the
OLS residuals ri , which implies that no mean structure is included in the
model. Hence, REML estimation becomes impossible.
The parameter estimates for the variance components in the model with
Gaussian serial correlation have already been reported in Table 9.1. Those
for the model with exponential serial correlation are given in Table 10.2.
This latter model can be fitted with the SAS procedure MIXED by replac-
ing the option ‘type=sp(gau)(time)’ by the option ‘type=sp(exp)(time)’ in
our program on p. 129. Note also that when our exponential serial cor-
relation function is parameterized as g(u) = exp(−φu), SAS provides an
140 10. Exploring Serial Correlation
FIGURE 10.1. Prostate Data. Estimates for the residual serial covariance func-
tion τ 2 g(u) and for the residual serial correlation function g(u). The solid lines
represent the estimate obtained under model (10.4), where the parameter esti-
mates are the ones reported in Table 10.1. The long-dashed lines show the es-
timates under the exponential serial correlation model g(u) = exp(−φu), where
the parameter estimates of τ and φ are the ones reported in Table 10.2. The
short-dashed lines show the estimates under the Gaussian serial correlation model
g(u) = exp(−φu2 ), where the parameter estimates of τ and φ are the ones re-
ported in Table 9.1.
estimate for 1/φ rather than for φ itself. The so-obtained parametric esti-
mates for τ 2 g(u) are also included in panel (a) of Figure 10.1. Note that the
estimate for τ 2 g(u) based on model (10.4) is not well approximated by ei-
ther one of the other two estimates, but this is mainly due to the differences
between the estimates for the variance τ 2 . For studying the correlation func-
tion g(u), independent of the amount of variability explained by the serial
correlation component, it is often helpful to plot the corresponding rescaled
functions g(u), shown in panel (b) of Figure 10.1. We now find that the
exponential as well as Gaussian serial correlation functions are good ap-
proximations for the function obtained under model (10.4), with a slightly
better performance for the exponential serial correlation model, especially
for small time lags. This is also supported by the maximized log-likelihood
values which are equal to −24.266 and −24.787 for the exponential and the
Gaussian serial correlation model, respectively (see Tables 9.1 and 10.2).
The fact that both models are hard to distinguish is also reflected in the
high correlations between the estimates for the parameters φj and the es-
timate for the variance σ 2 , under model (10.4). The estimated correlation
matrix is given by
⎛ ⎞
1.000 0.235 −0.811 0.862
⎜ 0.235 1.000 −0.010 0.646 ⎟
σ 2 , φ0 , φ1 , φ2 ) = ⎜
Corr( ⎝ −0.811 −0.010
⎟,
1.000 −0.706 ⎠
0.862 0.646 −0.706 1.000
which shows that all parameter estimates are highly correlated with φ2 ,
including the estimate for the variance σ 2 of the measurement errors.
10.4 The Semi-Variogram 141
TABLE 10.2. Prostate Data. REML estimates and estimated standard errors for
all variance components in a linear mixed model with the preliminary mean struc-
ture defined in Section 9.2, with random intercepts and random slopes for the lin-
ear as well as quadratic time effect, with measurement error, and with exponential
serial correlation.
10.4.1 Introduction
In Section 10.3, a parametric approach was followed to study the serial cor-
relation function g(·) in linear mixed models. The empirical semi-variogram
is an alternative, nonparametric technique which does not require the fit-
ting of linear mixed models. It was first introduced by Diggle (1988) for the
case of random-intercepts models (i.e., linear mixed models where the only
random effects are intercepts). Later, Verbeke, Lesaffre and Brant (1998)
extended the technique to models which may also contain other random
effects, additional to the random intercepts (see also Verbeke 1995). In
Sections 10.4.2 and 10.4.3, the original empirical semi-variogram for the
random-intercepts model will be discussed and illustrated. Afterward, in
Sections 10.4.4 and 10.4.5, the extended version will be presented.
142 10. Exploring Serial Correlation
Assuming that the only random effects in the model are random intercepts,
we have that the marginal covariance matrix is given by
where Jni is the (ni × ni ) matrix containing only ones and where ν 2 now
denotes the random-intercepts variance. This implies that the residuals rij
have constant variance ν 2 + σ 2 + τ 2 and that the correlation between any
two residuals rij and rik from the same subject i is given by
ν 2 + τ 2 g(|tij − tik |)
ρ(|tij − tik |) = .
ν 2 + σ2 + τ 2
A stochastic process with mean zero, constant variance and a correlation
function which only depends on the time lag between the measurements is
often called (second-order) stationary (see for example Diggle 1990).
FIGURE 10.2. (a) The semi-variogram for a linear random-intercepts model con-
taining a component with exponential serial correlation. (b) The semi-variogram
for a linear random-intercepts model containing a component with Gaussian ser-
ial correlation. σ 2 , τ 2 , and ν 2 represent the variability of the measurement error,
the serial correlation component, and the random intercepts respectively.
correlations equal to one between the components of ε(2)i . So, the smaller
the value of φ, the stronger the serial correlation in the data.
In practice,
N the function v(u) is estimated by smoothing the scatter plot of
the i=1 ni (ni − 1)/2 half-squared differences vijk = (rij − rik )2 /2 between
pairs of residuals within subjects versus the corresponding time lags uijk =
|tij − tik |. This estimate will be denoted by v(u) and is called the sample
semi-variogram. Further, since
1
E[rij − rkl ]2 = σ 2 + τ 2 + ν 2
2
whenever i = k, we estimate the total process variance by
1
ni
nl
2 + τ2 + ν2 =
σ (rij − rkl )2 ,
2N ∗ j=1
i=k l=1
∗
where N is the number of terms in the sum. This estimate, together with
v(u), can now be used for deciding which of the three stochastic com-
ponents will be included in the model and for selecting an appropriate
function g(u) in the case serial correlation will be included. The sample
semi-variogram also provides initial values for τ 2 , σ 2 , and ν 2 when needed
for the numerical maximization procedure. Finally, comparing v(u) with a
fitted semi-variogram yields an informal check on the assumed covariance
structure.
More details on this topic can be found in Chapter 5 in the book by Diggle,
Liang and Zeger (1994) in which the semi-variogram is discussed for several
special cases of the covariance structure (10.5) and the method is illustrated
in a covariance analysis for some real data sets.
144 10. Exploring Serial Correlation
FIGURE 10.3. Vorozole Study. Observed variogram (bullets with size proportional
to the number of pairs on which they are based) and fitted variogram (solid line).
The variogram is presented in Figure 10.3. Apart from the observed var-
iogram, a fitted version is presented as well. This is based on a linear
mixed-effects model with fixed effects: time, time×baseline, time2 , and
time2 ×baseline. The covariance structure includes a random intercept, a
Gaussian serial process, and residual measurement error.
where Airs denotes the (r, s) element of the matrix Ai . When all measure-
ments are taken at fixed time points, then only a small number of values uirs
can occur, which we denote by u0 , . . . , uM . Note that (10.6) can then be seen
as a multiple regression model with parameters σ 2 + τ 2 and gj = τ 2 g(uj ),
j = 0, . . . , M , and a scatter plot of the OLS estimates g0 , . . . , g
M versus
Using simulations, Verbeke (1995) has shown that the above approach
yields estimates for gt which are too unstable to be useful for the detection
of residual serial correlation. This is caused by the large amount of scatter
in the values (ij − ik )2 but also by the high degree of multicollinearity
(Neter, Wasserman and Kutner, 1990, Section 8.5) in the approximate re-
gression model, induced by the linear interpolation. A classical method to
obtain more stable estimates in the presence of multicollinearity is to use
ridge regression, which allows a small bias in return for stability (Sen and
Srivastava 1990, Section 12.3, Neter, Wasserman and Kutner 1990, Section
1.7). The bias depends on the so-called shrinkage parameter c which is cho-
10.4 The Semi-Variogram 147
sen as small as possible (to reduce the bias) and such that the resulting esti-
mates indicate stability and satisfy approximately g0 ≥ g1 ≥ . . . ≥ g
M = 0.
Note also that the choice of the matrices Ai is not unique. In the absence
of serial correlation, we have that the distribution of the quantities (ij −
ik )2 in (10.6) does not depend on Ai . Hence, differences in parameter
estimates due to different transformations Ai then only reflect sampling
variability. This is no longer the case in the presence of serial correlation,
where different choices for Ai not only result in different responses used in
the final regression analysis but also yields different covariates. In that case,
the effect of choosing other transformations Ai is less obvious. However,
Verbeke, Lesaffre and Brant (1998) found empirically that different choices
for the matrices Ai yield slightly different nonparametric estimates for g(u),
but all these estimates lead to the same conclusion with respect to the
presence and type of serial correlation.
Further, the above check for serial correlation acts conditional on the ran-
dom effects included in the model (i.e., conditional on the covariates Zi ).
However, the resulting semi-variogram is invariant under general repara-
meterizations of the form Zi Gi . Also, if too many random effects have been
specified, the results will still be valid since the resulting transformation
matrices Ai satisfy Ai Zi∗ = 0 for any matrix Zi∗ consisting of columns of
the overspecified Zi . This again justifies favoring overspecified models Zi bi
for the detection of residual serial correlation rather than models which are
too restrictive (see also our discussion in Section 9.3).
The solid line in Figure 10.4 represents the estimate for τ 2 g(u), obtained
from the methods described in the previous section. It clearly suggests the
presence of serial correlation, which may be appropriately described by a
148 10. Exploring Serial Correlation
FIGURE 10.4. Prostate Data. Estimates for the residual serial covariance func-
tion τ 2 g(u). The solid line represents the estimate obtained from the extended
semi-variogram approach. The dashed line shows the estimated Gaussian serial
covariance function τ 2 exp(−φu2 ), where the parameter estimates of τ and φ are
the ones reported in Table 9.1.
prostate data slightly better than the Gaussian serial correlation function,
with very similar maximized likelihood values for both models. On the other
hand, the nonparametric variogram in Section 10.4.5 seems to suggest the
presence of serial correlation of the Gaussian type. This again illustrates
the fact that precise characterization of the serial correlation function g(·)
is extremely difficult in the presence of several random effects. This was
also the conclusion of Lesaffre, Asefa and Verbeke (1999) after the analysis
of longitudinal data from more than 1500 children. However, this does not
justify ignoring the possible presence of any serial correlation, since this
might result in less efficient model-based inferences (see example in Sec-
tion 9.5). Practical experience suggests that including serial correlation, if
present, is more important than correctly specifying the serial correlation
function. We therefore propose to use the procedures discussed in this chap-
ter for detecting whether any serial correlation is present, rather than for
specifying the actual shape of g(·), which seems to be of minor importance.
Corr d ,
d ,
d ,
d ,
11 12 22 13 23 33
d ,
d ,
τ 2
, 1/
φ,
σ 2
⎛ ⎞
1.00 −0.87 0.62 0.70 −0.49 0.39 −0.18 −0.10 −0.00
⎜ −0.87 1.00 −0.85 −0.94 0.75 −0.63 0.21 0.08 −0.03 ⎟
⎜ −0.85 −0.97 −0.46 −0.29 0.02 ⎟
⎜ 0.62 1.00 0.88 0.91 ⎟
⎜ −0.94 −0.82 −0.22 −0.06 0.05 ⎟
⎜ 0.70 0.88 1.00 0.72 ⎟
= ⎜
⎜ −0.49 0.75 −0.97 −0.82 1.00 −0.97 0.51 0.33 −0.02 ⎟
⎟.
⎜ 0.39 −0.63 0.91 0.72 −0.97 1.00 −0.57 −0.38 0.01 ⎟
⎜ ⎟
⎜ −0.18 0.21 −0.46 −0.22 0.51 −0.57 1.00 0.81 0.04 ⎟
⎝ −0.10 0.08 −0.29 −0.06 0.33 −0.38 0.81 1.00 0.32
⎠
−0.00 −0.03 0.02 0.05 −0.02 0.01 0.04 0.32 1.00
150 10. Exploring Serial Correlation
We indeed get some relatively large correlations between τ2 and the es-
timates of some of the parameters in D. Note also/the small correlations
between σ2 and the other estimates, except for 1/ φ, which is not com-
pletely unexpected since the Gaussian serial correlation component reduces
to measurement error for φ becoming infinitely large.
11
Local Influence for the Linear Mixed
Model
11.1 Introduction
Many diagnostics have been developed for linear regression models. See,
for example, Cook and Weisberg (1982) and Chatterjee and Hadi (1988).
Since the linear mixed model can be seen as a concatenation of several
subject-specific regression models, it is most obvious to investigate how
these diagnostics (residual analysis, leverage, Cook’s distance, etc.) can be
generalized to the models considered in this book. Unfortunately, such a
generalization is far from obvious. First, several kinds of residuals could
be defined. For example, the marginal residual yi − Xi β reflects how a
specific profile deviates from the overall population mean and can there-
fore be interpreted as a residual. Alternatively, the subject-specific residual
152 11. Local Influence for the Linear Mixed Model
yi − Xi β − Zi bi measures how much the observed values deviate from the
subject’s own predicted regression line. Finally, the estimated random ef-
fects bi can also be seen as residuals since they reflect how much specific
profiles deviate from the population average profile. Further, the linear
mixed model involves two kinds of covariates. The matrix Xi represents
the design matrix for the fixed effects, and Zi is the design matrix for the
random effects. Therefore, it is not clear how leverages should be defined,
partially because the matrices Xi and Zi are not necessarily of the same
dimension.
All these considerations suggest that an influence analysis for the linear
mixed model should not be based on the same diagnostic procedures as
ordinary least squares regression. DeGruttola, Ware and Louis (1987) de-
scribe measures of influence and leverage for a generalized three-step least
squares estimator for the regression coefficients in a class of multivariate
linear models for repeated measurements. However, their method does not
apply to maximum likelihood estimation, and it is also not clear how to
extend their diagnostics to the case of unequal covariance matrices Vi .
which clearly allows different weights for different subjects. The weight
vector ω is then N dimensional, and the original log-likelihood corresponds
to ω = ω0 = (1, 1, . . . , 1) . Note also that the log-likelihood with the ith
case completely removed corresponds to the vector ω with wi = 0 and
wj = 1 for all j = i.
This way, the variability of θ is taken into account. LD(ω) will be large if
(θ) is strongly curved at θ (which means that θ is estimated with high
precision) and LD(ω) will be small if (θ) is fairly flat at θ (meaning
that θ is estimated with high variability). From this perspective, a graph
of LD(ω) versus ω contains essential information on the influence of case-
weight perturbations. It is useful to view this graph as the geometric surface
formed by the values of the (r + 1)-dimensional vector
ω
ξ(ω) =
LD(ω)
as ω varies throughout Ω. In differential geometry, a surface of this form
is frequently called a Monge patch. Following Cook (1986), we will refer
to ξ(ω) as an influence graph. It is a surface in IRr+1 and can be used to
assess the influence of varying ω through Ω. A graphical representation,
which also illustrates all further developments, is given in Figure 11.1.
11.2 Local Influence 155
LD(ω)
T0
P
ωr
PP
ω1 PP
PP
Pr
ω 0 PPP PP
ω0 + h
FIGURE 11.1. Graphical representation of the influence graph and the local in-
fluence approach.
Ideally, we would like a complete influence graph [i.e., a graph of ξ(ω) for
varying ω] to assess influence for a particular model and a particular data
set. However, this is only possible in cases where the number r of weights
in ω does not exceed 2. Hence, methods are needed for extracting the most
relevant information from an influence graph. One possible approach is local
influence, which uses normal curvatures of ξ(ω) in ω 0 . One proceeds as
follows. Let T0 be the tangent plane to ξ(ω) at ω 0 . Since LD(ω) attains its
minimum at ω 0 , we have that T0 is parallel to Ω ⊂ IRr . Each vector h in Ω,
of unit length, determines a plane that contains h and which is orthogonal
to T0 . The intersection, called a normal section, of this plane with the
surface is called a lifted line. It can be graphed by plotting LD(ω 0 + ah)
versus the univariate parameter a ∈ IR. The normal curvature of the lifted
line, denoted by Ch , is now defined as the curvature of the plane curve
(a, LD(ω 0 + ah)) at a = 0. It can be visualized by the inverse radius of
the best-fitting circle at a = 0. The curvature Ch is called the normal
curvature of the surface ξ(ω), at ω 0 , in the direction h. Large values of Ch
indicate sensitivity to the induced perturbations in the direction h. Ch is
called the local influence on the estimation of θ, of perturbing the model
corresponding to the log-likelihood (11.1), in the direction h.
156 11. Local Influence for the Linear Mixed Model
There are several ways in which (11.3) can be used to study ξ(ω), each
corresponding to a specific choice of the unit vector h. One evident choice
corresponds to the perturbation of the ith weight only (case-weight pertur-
bation). This is obtained by taking h equal to the vector hi which contains
zeros everywhere except on the ith position, where there is a one. The
resulting local influence is then given by
Ci ≡ Chi = 2 ∆i L̈−1 ∆i . (11.4)
The direction hmax is the direction for which the normal curvature is maxi-
mal. It shows how to perturb the postulated model (the model for ω = ω 0 )
to obtain the largest local changes in the likelihood displacement. If, for
example, the ith component of hmax is found to be relatively large, this in-
dicates that perturbations in the weight wi may lead to substantial changes
11.2 Local Influence 157
in the results of the analysis. On the other hand, denoting the s nonzero
eigenvalues of −∆ L̈−1 ∆ by Cmax /2 ≡ λ1 ≥ . . . ≥ λs > 0 and the corre-
sponding normalized orthogonal eigenvectors by {hmax ≡ v1 , . . . , vs }, we
have that
s
2
Ci = 2 λj vji , (11.5)
j=1
where vji is the ith component of the vector vj . Hence, the local influence
Ci of perturbing the ith weight wi can be large without the ith component
of hmax to be large, as long as some of the other eigenvectors of −∆ L̈−1 ∆
have large ith components. This will be further illustrated with an exam-
ple in Section 11.4. Therefore, it is not sufficient to calculate only hmax ;
all Ci should be computed as well. Cook (1986) proposes to inspect hmax ,
regardless of the size of Cmax , since it may highlight directions which are
simultaneously influential. However, since there is usually no analytic ex-
pression for hmax , it is very difficult to get some insight behind the reasons
for such influences. This is not the case for the influence measures Ci . In
Section 11.3, it will be shown how expression (11.4) can be used to ascribe
local influence to interpretable components such as residuals.
The methods discussed above for the full parameter vector now directly
carry over to calculate local influences on the geometric surface defined by
LD1 (ω). We now partition L̈ as
L̈11 L̈12
L̈ = ,
L̈21 L̈22
according to the dimensions of θ 1 and θ 2 . Cook (1986) has then shown
that the local influence on the estimation of θ 1 , of perturbing the model in
the direction of a normalized vector h, is given by
3
4
0 0
Ch (θ 1 ) = 2 h ∆ L̈ −1
− ∆h . (11.6)
0 L̈−1 22
are either one or zero, we have that (Seber 1984, p. 526) for any vector v
0 0
0 ≥ v v ≥ v L̈−1 v
0 L̈−1
22
0 0
Ch (θ 1 ) = −2h ∆ L̈−1 ∆h + 2h ∆ ∆h
0 L̈−1
22
0 0
= Ch + 2h ∆ ∆h (11.7)
0 L̈−1
22
≤ Ch .
This means that the normal curvature for θ 1 , in the direction h, can never
be larger than the normal curvature for θ in that same direction.
Note also that it immediately follows from (11.6) that, for L̈12 = 0,
−1
L̈11 0 0 0
Ch = −2h ∆ ∆h − 2h ∆ ∆h
0 0 0 L̈−122
= Ch (θ 1 ) + Ch (θ 2 ).
Hence, we have that for any direction h, the normal curvature for θ in
that direction is then the sum of the normal curvatures for θ 1 and θ 2 in
the same direction. Intuitively, this can be explained as follows. It follows
from the classical maximum likelihood theory that, for sufficiently large
samples, and under the correct model specification, θ is asymptotically
−1
normally distributed with covariance matrix −L̈ . So, L̈12 = 0 means
that θ 2 are statistically independent. It is then not surprising that
1 and θ
Ch = Ch (θ 1 ) + Ch (θ 2 ) since this expresses the fact that the influence for
θ 1 is independent of the influence for θ 2 .
Finally, there are again many possible choices for the vector h. For example,
the local influence of perturbing the ith weight on the estimation of θ 1 is
obtained for h = hi , the vector with zeros everywhere except on the ith
position where there is a one. The corresponding normal curvature will be
denoted by Ci (θ 1 ).
We will now show how the local influence approach, introduced in Sec-
tion 11.2, can be applied to detect subjects which are locally influential for
11.3 The Detection of Influential Subjects 159
the fitting of a specific linear mixed model. We hereby restrict the discus-
sion to models which assume conditional independence; that is, models of
the form (3.8) where all residual covariance matrices Σi are equal to σ 2 Ini .
Our perturbed log-likelihood is defined in (11.2), which allows different
subjects to have different weights in the log-likelihood function. The vector
∆i now equals the s-dimensional vector of first-order derivatives of i (θ),
with respect to all components of θ and evaluated at θ = θ. Note that
the calculation of ∆i usually only requires little additional computational
effort, since the first-order derivative of the log-likelihood is needed in the
iterative Newton-Raphson estimation procedure.
See also Pregibon (1979) (Chapter 5) and Cook and Weisberg (1982) (Chap-
ter 5) for more details. It now follows from (11.8) that
1
∆i = − θ
∆j = L̈(i) (θ) −θ , (11.9)
(i)
j=i
∂ 2 (θ)
N
∂Vi −1
= Xi Vi−1 V (yi − Xi β),
∂β∂αk i=1
∂αk i
which implies that the maximum likelihood estimates for the fixed effects
and for the variance components are asymptotically independent (see Ver-
beke and Lesaffre 1996b, 1997a, for technical details). It now follows from
Section 11.2 that, for N sufficiently large,
Lesaffre and Verbeke (1998) have shown that this also implies that Ci (β)
can be decomposed using only the first two components
Xi Xi
and
Ri
,
whereas only the last three components
Zi Zi
,
I − Ri Ri
, and
Vi−1
are needed in the decomposition of Ci (α). Hence, for sufficiently large data
sets, influence for the fixed effects and for the variance components can
be further investigated by studying the first two and the last three in-
terpretable components, respectively. This will also be illustrated in our
example in Section 11.4.
162 11. Local Influence for the Linear Mixed Model
FIGURE 11.2. Prostate Data. (a) Plot of total local influences Ci versus the
identification numbers of the individuals in the BLSA data set. (b) Scatter plot
of the local influence measures Ci (β) and Ci (α) for the fixed effects and the
variance components, respectively. The most influential subjects are indicated by
their identification number.
Figure 11.2(a) is an index plot of the total local influence Ci . The cutoff
value used for Ci equals 2 i Ci /N = 1.98 and has been indicated in the
figure by the dashed line. Participants #15, #22, #23, #28, and #39 are
found to have a Ci value larger than 1.98 and are therefore considered to be
relatively influential for the estimation of the complete parameter vector θ.
Their observed and expected profiles are shown in Figure 11.3. Pearson et
al . (1994) report that subjects #22, #28, and #39, who were classified as
local/regional cancer cases, were probably misclassified metastatic cancer
cases. It is therefore reassuring that this influence approach flagged these
three cases as being special. Subjects #15 and #23 were already in the
metastatic cancer group. In Figure 11.2(b), a scatter plot of Ci (α) versus
Ci (β) is given. Their respective cutoff values are 2 i Ci (α)/N = 1.10 and
2 i Ci (β)/N = 0.99. Obviously, subject #28, who is the subject with
the largest Ci value, is highly influential for both the fixed effects and the
variance components. Individuals #15 and #39 are also influential for both
parts of the model, but to a much lesser extent. Finally, we have that subject
#23 is influential only for the fixed effects β and that, except for subject
11.4 Example: The Prostate Data 163
FIGURE 11.3. Prostate Data. Observed (dashed lines) and fitted (solid lines) pro-
files for the five most locally influential subjects. All subjects are metastatic cancer
cases, but individuals #22, #28, and #39 were wrongly classified as local/regional
cancer cases.
#28, subject #22 has the highest influence for the variance components α,
but is not influential for β.
Figure 11.4 shows an index plot of each of the five interpretable compo-
nents in the decomposition of the local influence measures Ci , Ci (β), and
Ci (α), as well as of the number ni of PSA measurements available for each
subject. These can now be used to ascribe the influence of the influential
subjects to their specific characteristics. As an example, we will illustrate
this for subject #22, which has been circled in Figure 11.4. As indicated by
Figure 11.2, this subject is highly influential, but only for the estimation
of the variance components. If approximation (11.12) is sufficiently accu-
rate, this influence for α can be ascribed to the last three interpretable
components only [i.e., the components plotted in the panels (c), (d), and
(e) of Figure 11.4]. Hence, although the residual component for the mean
structure is the largest for subject #22 [Figure 11.4(b)], it is not the cause
of the highly influential character of this subject for the estimation of the
variance components, nor did it cause a large influence on the estimation
of the fixed effects in the model. Note instead how this subject also has the
largest residual for the covariance structure, suggesting that the covari-
ance matrix is poorly predicted by the model-based covariance V22 . This is
also illustrated in Figure 11.3. Obviously, the large residual for the mean
was caused by the poor prediction around the time of diagnosis, but this
164 11. Local Influence for the Linear Mixed Model
(a) Xi Xi
(b) Ri
(c) Zi Zi
(d) I − Ri Ri
(e) Vi−1
(f) ni
FIGURE 11.4. Prostate Data. Index plots of the five interpretable components in
the decomposition of the total local influence Ci and an index plot of the number
ni of repeated measurements for each subject.
was not sufficient to make subject #22 influential for the estimation of
the complete average profile. Further, a close look at the estimated covari-
ance matrix V22 shows that only positive correlations are assumed between
the repeated measurements, whereas the positive and negative residuals in
Figure 11.3 suggest some negative correlations.
11.4 Example: The Prostate Data 165
FIGURE 11.5. Prostate Data. Comparison of the likelihood displacement LDi with
the total local curvature Ci . The individuals are numbered by their identification
number.
where θ(i) denotes the maximum likelihood estimate for θ after deletion
of subject i. From this picture we can observe that for the data set at
hand, both approaches highlight the same set of influential observations.
However, they do not agree on the ranking of the observations according
to their influence.
On the other hand, the index plot in Figure 11.6 offers an extra diagnostic
tool. The components of hmax have a positive or a negative sign. From
this plot, one can observe that case #15 has a different sign than individ-
uals #28, #39, and #45. The impression is given that individual #15 is
counterbalancing the effect of these other cases. For the three cases with
a negative component of hmax , the observed profiles (see also Figure 11.3)
are completely above and much steeper than the predicted profiles, whereas
for individual #15 the observed response intersects its prediction somewhat
halfway in the observation period and is also much less steeper. More de-
tailed similar information could be obtained from hmax by deriving it for the
fixed effects and the variance components separately. On the other hand,
since no analytic expression for hmax is available, using hmax as a diagnos-
tic tool does not yield much insight into the reasons why some individuals
are more influential than others.
Our results from Section 11.2 and Section 11.3 assume that the parameters
in the marginal linear mixed model are estimated via maximum likelihood.
An influence analysis for the REML estimates would also be useful. How-
ever, it follows from expression (5.8) that the REML log-likelihood function
can no longer be seen as a sum of independent individual contributions and,
therefore, it is not obvious how a perturbation scheme, similar to (11.2),
should be defined and interpreted. One approach would be to replace (θ|ω)
in (11.2) by
N
1 −1
REML (θ | ω) = − ln wi Xi Vi Xi + (θ | ω).
2
i=1
The theory of local influence can then also be applied to this new perturba-
tion scheme, but no longer results in simple expressions for the curvature
Ci ; hence, it becomes much more complicated to ascribe influence to the
specific characteristics of the influential subjects. Therefore, we only con-
sidered here the maximum likelihood situation.
12
The Heterogeneity Model
12.1 Introduction
However, although this approach looks very appealing, it raises many prob-
lems with respect to the normality assumption for the random effects, which
is automatically made by the linear mixed-effects model. For example, it
follows from the results in Section 6.2.3 that the mean quadratic time ef-
fect is zero for the noncancer cases and positive for both cancer groups.
Hence, the quadratic effects β4 + b3i in model (12.1) should follow a nor-
mal distribution with mean zero for the noncancer cases and with positive
mean for the cancer cases. This means that the b3i are no longer normally
distributed, but follow a mixture of two normal distributions, i.e.,
b3i ∼ pN (µ1 , σ12 ) + (1 − p)N (µ2 , σ22 ),
in which µ1 , µ2 and σ12 , σ22 denote the means and variances of the b3i in the
noncancer and cancer groups, respectively, and where p is the proportion
of patients in the data set which belong to the noncancer group. Similar
arguments hold for the random intercepts b1i and for the random time
slopes b2i , which even may be sampled from mixtures of more than two
normal distributions.
It was shown in Section 7.8.2 that for the detection of subgroups in the
random-effects population or for the classification of subjects in such sub-
groups, one should definitely not use empirical Bayes estimates obtained
under the normality assumption for the random effects. In this chapter,
it will be shown that the heterogeneity model is a natural model for clas-
sifying longitudinal profiles. In Sections 12.2 and 12.3, the heterogeneity
model will be defined in full detail, and it will be described how the so-
called Expectation-Maximization (EM) algorithm can be applied to obtain
maximum likelihood estimates for all the parameters in the corresponding
marginal model. In Section 12.4, it will be briefly discussed how longitudi-
nal profiles can be classified based on the heterogeneity model. As already
mentioned in Section 7.8.4, testing for the number of components in a het-
erogeneity model is far from straightforward due to boundary problems
12.2 The Heterogeneity Model 171
which make classical likelihood results break down. We will therefore dis-
cuss in Section 12.5 some simple informal checks for the goodness-of-fit of
heterogeneity models. Finally, two examples will be given in Sections 12.6
and 12.7. Another example can be found in Brant and Verbeke (1997a,
1997b).
g
bi ∼ pj N (µj , Dj ), (12.2)
j=1
g
with j=1 pj = 1. We now define zij = 1 if bi is sampled from the jth
component in the mixture, and 0 otherwise, j = 1, . . . , g. We then have
that P (zij = 1) = E(zij ) = pj and that
⎛ ⎞
g
g
E(bi ) = E (E(bi | zi1 , . . . , zig )) = E ⎝ µj zij ⎠ = pj µj .
j=1 j=1
g
Therefore, the additional constraint j=1 pj µj = 0 is needed to assure
that E(yi ) = Xi β. Further, we have that the overall covariance matrix of
the bi is given by
g
g
= pj µj µj + pj Dj . (12.3)
j=1 j=1
g
−q/2 −1/2 1
pj (2π) |Dj | exp − bi − µj Dj−1 bi − µj . (12.4)
j=1
2
172 12. The Heterogeneity Model
Note also that since the random effects are assumed to follow a mixture of
distributions of the same parametric family, the vector of all parameters
174 12. The Heterogeneity Model
where y = (y1 , . . . , yN ) is the vector containing all observed response
values.
Let zij be as defined in Section 12.2. The prior probability for an individual
to belong to component j is then P (zij = 1) = pj , the mixture proportion
for that component. The log-likelihood function for the observed measure-
ments y and for the vector z of all unobserved zij is then
N
g
(θ|y, z) = zij {ln pj + ln fij (yi |γ)} ,
i=1 j=1
More specifically, let θ (t) be the current estimate for θ, and θ (t+1) stands
for the updated estimate, obtained from one further iteration in the EM
algorithm. We then have the following E and M steps in the estimation
process for the heterogeneity model.
where only the posterior probability for the ith individual to belong
to the jth component of the mixture, given by
pij (θ (t) ) = E(zij | yi , θ (t) ) = P (zij = 1 | yi , θ (t) )
pj fij (yi |γ)
=
g
k=1 pk fik (yi |γ) π̂ (t) ,γ̂ (t)
(t+1)
from which it follows that all estimates pj satisfy
1
N
(t+1)
pj = pij (θ (t) ).
N i=1
Once all parameters θ in the marginal heterogeneity model have been esti-
mated, one might be interested in estimating the random effects bi also. To
this end, empirical Bayes (EB) estimates can be calculated, in exactly the
same way as EB estimates were obtained under the classical linear mixed
model (see Section 7.2). The posterior density of bi is given by
g
fi (bi | yi , θ) = pij (θ)fij (bi | yi , γ),
j=1
Note how the first component of bi is of exactly the same form as the esti-
mator (7.2) obtained in Section 7.2, assuming normally distributed random
effects. However, the overall covariance matrix of the bi is now replaced by
the within-component covariance matrix D. The second component in the
expression for bi can be viewed as a correction term toward the component
means µj , with most weight on those means, which correspond to compo-
nents for which the subject has a high posterior probability of belonging.
Finally, the unknown parameters in (12.10) are replaced by their maximum
likelihood estimates obtained from the EM algorithm. These are the EB
estimates shown in Figure 7.6 for the data simulated in Section 7.8.2 and
obtained under a two-component heterogeneity model (i.e., g equals 2).
Interest could also lie in the classification of the subjects into the different
mixture components. It is natural in mixture models for such a classification
to be based on the estimated posterior probabilities pij (θ) (McLachlan
and Basford 1988, Section 1.4). One then classifies the ith subject into the
component for which it has the highest estimated posterior probability to
belong to, that is, to the j(i)th component, where j(i) is the index for
= max1≤j≤g pij (θ).
which pi,j(i) (θ) Note how this technique can be used
for cluster analysis within the framework of linear mixed-effects models: If
the individual profiles are to be classified into g subgroups, fit a mixture
model with g components and use the above rule for classification in either
one of the g clusters.
So far, we have not discussed yet how the number g of components in the
heterogeneity model should be chosen. One approach is to fit models with
increasing numbers of components and to compare them using likelihood
ratio tests. However, as explained in Section 7.8.4, this is far from straight-
forward, due to boundary problems. Also, acceptance of the null hypothesis
does not automatically yield a good-fitting model since it was tested against
only one specific alternative hypothesis. An alternative approach is to in-
crease g to the level where some of the subpopulations get very small weight
(some pj very small) or where some of the subpopulations coincide (some
µj approximately the same). Finally, Verbeke and Lesaffre (1996a) pro-
posed some omnibus goodness-of-fit checks for the marginal heterogeneity
model (12.6), which can be employed to determine the number of compo-
nents in the heterogeneity model, but which are also useful for evaluating
the final model. These will now be discussed.
Suppose we want to test whether model (12.6) fits our data well, for some
specific value of g (possibly 1). The most well-known goodness-of-fit tests
are derived for univariate random variables, but, here, all observed vec-
tors yi possibly have different distributions or different dimensions. Thus,
strictly speaking, these tests are not applicable here. However, this prob-
lem can be circumvented by considering the stochastic variables Fi (yi ), for
Fi equal to the cumulative distribution function of Yi , under the assumed
model. If the assumed model is correct, we have that all Fi (yi ) can be
considered sampled from a uniform distribution with support [0, 1]. On the
other hand, we have that the computation of Fi (yi ) involves the evalua-
tion of multivariate normal distribution functions of dimensions ni , which
may therefore be practically unfeasible for data sets with large numbers of
repeated measurements for some of the individuals. We therefore propose
to first summarize each vector yi by some linear combination ai yi , and
then to calculate Fi (ai yi ), where Fi is now the distribution function of
ai Yi under the assumed model. We then have that, under model (12.6),
the stochastic variables
g
ai (Yi − Xi β − Zi µj )
Ui = Fi (ai Yi ) = pj Φ √ , (12.11)
j=1
ai V i ai
Two procedures can now be followed. First, we can apply the Kolmogorov-
Smirnov test (see, for example, Birnbaum 1952, Bickel and Doksum 1977,
pp. 378-381) to test whether the observed Ui , calculated by replacing Yi by
yi , and all parameters in (12.11) by their maximum likelihood estimates
12.5 Goodness-of-Fit Checks 179
Of course, the above goodness-of-fit tests can be performed for any ai , but
a good choice of the linear combination may increase the power of the test.
Here, we are specifically interested in exploring whether the number g of
components in our heterogeneity model has been taken sufficiently large.
Note how this testing for heterogeneity can be viewed as an attempt to
break down the total random-effects variability into the within-subgroup
variability represented by D, and the between-subgroup variability repre-
sented by the component means µ1 , . . . , µg . However, it is intuitively clear
that this will only be successful when the residual variability in the model,
represented by the error terms εi , is small to moderate, in comparison to
the random-effects variability in which we are interested. We therefore rec-
ommend choosing ai such that the variability in ai yi due to the random
effects is large compared to the variability due to the error terms. Specifi-
cally, we choose ai such that
var(ai Zi bi ) ai Zi D∗ Zi ai
= ,
var(ai εi ) ai Σi ai
TABLE 12.1. Prostate Data. Parameter estimates for all parameters in model
(12.1), under a one-, two- and three-component heterogeneity model, and based
on the observed data from cancer patients and control patients only.
⎛ ⎞ ⎛ ⎞
−0.2373 0.0497 0.0089 −0.0005
⎜ ⎟ ⎜ ⎟ 0.013
⎝ 0.0260 ⎠ 0.79 ⎝ 0.0089 0.0017 −0.0001 ⎠
0.026
0.0012 −0.0005 −0.0001 0.00002
⎛ ⎞
0.1790
⎜ ⎟
⎝ −0.0439 ⎠ 0.21
0.0105
⎛ ⎞ ⎛ ⎞
−0.0202 0.0306 0.0082 −0.0003
⎜ ⎟ ⎜ ⎟ 0.009
⎝ 0.0124 ⎠ 0.72 ⎝ 0.0082 0.0023 −0.0001 ⎠
0.027
0.0012 −0.0003 −0.0001 0.00001
⎛ ⎞
0.5110
⎜ ⎟
⎝ −0.0088 ⎠ 0.19
0.0045
⎛ ⎞
0.2167
⎜ ⎟
⎝ −0.0288 ⎠ 0.09
0.0207
TABLE 12.2. Prostate Data. Goodness-of-fit checks for model (12.1), under
a one-, two- and three-component heterogeneity model, and based on the ob-
served data from cancer patients and control patients only. The reported Kol-
mogorov-Smirnov statistics are to be compared with the 5% critical value
Dc = 0.2274. The Shapiro-Wilk statistics are accompanied by their p-values.
Kolmogorov-Smirnov Shapiro-Wilk
1 component 0.2358 (> Dc ) 0.824 (p = 0.0001)
2 components 0.2113 (< Dc ) 0.848 (p = 0.0001)
3 components 0.1061 (< Dc ) 0.940 (p = 0.0797)
The fixed-effects vector now consists of two parts: β1 measures the effect
of age on PSA, whereas ν = (β2 , β3 , β4 ) reflects the overall average trend
over time, after correction for age differences at the entry in the study.
So, ν contains the overall mean intercept β2 , mean slope β3 for time, and
mean slope β4 for time2 , which are estimated by the homogeneity model
as 0.1694, 0.0085, and 0.0034, respectively. The component means reported
in Table 12.1 are the average trends within each component of the mixture
(i.e., ν + µj , j = 1, . . . , g).
This model will only be useful for the detection of prostate cancer at an
early stage, if one or two of our components can be shown to represent the
cancer cases, and the other component(s) then would represent the con-
trols. We therefore compare classification by our mixture approach with
the correct classification as control or cancer. The result is shown in Ta-
ble 12.3. Except for one patient, all controls were classified in the first com-
ponent, together with 10 cancer cases for which the profiles show hardly
any difference from many profiles in the control group (only a moderate,
linear increase over time). Three cancer cases were classified in the third
component. These are those cases which have entered the study almost
simultaneously with the start of the growth of the tumor. The five can-
cer cases, classified in the second component, are those who were already
in the study long before the tumor started to develop and therefore have
profiles which hardly change in the beginning, but which start increasing
quadratically after some period of time in the study.
Mixture classification
1 2 3
control 15 1 0
Disease status
cancer 10 5 3
ple has shown that the mixture approach does not necessarily model what
one might hope. There is no a priori reason why the mixture classification
should exactly correspond to some predefined group structure, which may
not fully reflect the heterogeneity in growth curves.
A linear mixed model obtained from a two-stage approach (see Section 3.2)
assumes the average evolution within each group to be linear as a function
of age and allows for subject-specific intercepts as well as slopes. More
specifically, our model is given by
⎧
⎪
⎪ β1 + b1i + (β4 + b2i )Ageij + εij , if small mother
⎪
⎪
⎨
Heightij = β2 + b1i + (β5 + b2i )Ageij + εij , if medium mother
⎪
⎪
⎪
⎪
⎩
β3 + b1i + (β6 + b2i )Ageij + εij , if tall mother,
where Heightij and Ageij are the height and the age of the ith girl at the
jth measurement, respectively. The model can easily be rewritten as
184 12. The Heterogeneity Model
where Smalli , Mediumi , and Talli are dummy variables defined to be 1 if the
mother of the ith girl is small, medium, or tall, respectively, and defined to
be 0 otherwise. So, β1 , β2 , and β3 represent the average intercepts and β4 ,
β5 , and β6 the average slopes in the three groups. The terms b1i and b2i are
the random intercepts and random slopes, respectively. REML estimates
for all parameters in this model are given in Table 12.4, obtained assuming
conditional independence; that is, assuming that all error components εij
are independent with common variance σ 2 .
All of the above analyses are based on the somewhat arbitrary discretization
of the heights of the mothers into three different intervals (small, medium,
and tall mothers). It is therefore interesting to see how heterogeneity models
would classify the children into two, three, or even more groups, ignoring
12.7 Example: The Heights of Schoolgirls 185
82.78 6.73 0.10
0.68 0.47
5.39 0.10 0.03
82.06
0.32
6.42
79.46 3.64 0.32
0.20 0.47
5.60 0.32 0.03
84.21
0.50
5.32
81.65
0.30
6.47
where β1 and β2 denote the overall average intercept and linear age effect,
respectively. As before, we will assume all error components εij to be in-
dependent and normally distributed with mean zero and common variance
σ2 .
Three mixture models were fitted: the homogeneous model (one compo-
nent) and two heterogeneous models (two and three components). For the
heterogeneity models, the girls were classified in either one of the mixture
components. The parameter estimates and classification rules are summa-
rized in Table 12.5 and Figure 12.2, respectively. The reported component
means are the average growth trends within each component of the mixture
(i.e., β + µj , j = 1, . . . , g).
First, it follows from the fit of the homogeneity model that the average
intercept and slope are estimated to be 82.48 and 5.72, respectively. Us-
ing the goodness-of-fit procedures discussed in Section 12.5, we found that
this homogeneity model fits the data sufficiently well; that is, no statistical
186 12. The Heterogeneity Model
“small” girls
3n7 2,6,13
“slow” growers
7 1,2,3,4,5,6,7,8,10,
S
n
14 nS
11,12,13,14,18
11 S
w
S
“tall” girls
20 girls
1,3,4,5,7,8,10,11,12,14,18
S
S
6nS
S
w
S
“fast” growers
9,15,16,17,19,20
children with tall mothers from the rest is achieved fairly well, the children
with small mothers could not be well separated from those with medium
mothers. However, this is in agreement with results obtained from analy-
ses based on the linear mixed model (12.12), the parameter estimates for
which are given in Table 12.4. Indeed, applying the F -tests described in
Section 6.2.2 (with the Satterthwaite approximation for the denominator
degrees of freedom), we found that the average slopes β4 , β5 , and β6 are
significantly different (p = 0.0019), but this can be fully ascribed to differ-
ences between the groups of children with small or medium mothers on the
one hand and the group of children with tall mothers on the other hand:
β4 and β6 are significantly different (p = 0.0007), also β5 and β6 are signif-
icantly different (p = 0.0081), but β4 is not significantly different from β5
(p = 0.2259). Since our mixture approach only partially reconstructs the
prior group structure of Goldstein (1979), we conclude that the latter does
not fully reflect the heterogeneity in the growth curves.
Finally, we note that, under the three-component model, the overall average
trend is given by
which is very similar to the overall average trend, estimated under the
homogeneity model. Further, we also have that, the overall random-effects
covariance matrix is given by
1.72
+ 0.50 1.72 −0.40
−0.40
−0.84
+ 0.30 −0.84 0.75
0.75
7.17 −0.14
= ,
−0.14 0.28
13.1 Introduction
As pointed out by Diggle, Liang and Zeger (1994, Section 1.4) and as shown
in the examples so far presented in this book, the main advantage of longi-
tudinal studies, when compared to cross-sectional studies, is that they can
distinguish changes over time within individuals (longitudinal effects) from
differences among people in their baseline values (cross-sectional effects).
men over age 50, the control group was significantly younger on average
than the BPH cases, at first visit as well as at the time of diagnosis. Brant
et al . (1992) analyzed repeated measures of systolic blood pressure from
955 healthy males. Their models included cross-sectional effects for age
at first visit (linear as well as quadratic effect), obesity, and birth cohort.
In a non-linear context, Diggle, Liang and Zeger (1994, Section 9.3) used
longitudinal data on 250 children to investigate the evolution of the risk for
respiratory infection and its relation to vitamin A deficiency. They adjusted
for factors like gender, season, and age at entry in the study.
in which tij is the time point (in decades from entry in the study) at which
the jth measurement is taken for the ith subject, and where Agei1 is the
age (in decades) of the subject at the time of entry in the study. Pearson
et al . (1995) found evidence for the presence of a learning effect from the
first visit to subsequent visits. This is taken into account by the extra time-
varying covariate Visit1ij , defined to be one at the first measurement and
zero for all other visits. Finally, the b1i are random intercepts, and the b2i
are random slopes for time. As before, the ε(1)ij are measurement error
components. Table 13.1 shows the ML estimates and associated standard
13.2 A Linear Mixed Model for the Hearing Data 191
TABLE 13.1. Hearing Data. ML estimates (standard errors) for the parameters in
the marginal linear mixed model corresponding to (13.1), with and without inclu-
sion of a cross-sectional quadratic effect of age, for the original data (∆ = 0) as
well as for contaminated data (∆ = −10). The last column contains ML estimates
(standard errors) obtained from the conditional linear mixed model approach.
errors for all parameters in the marginal model corresponding to the model
(13.1).
Figure 13.1 compares the estimates of the average longitudinal effects under
both models. The estimates under the correct model are independent of
the degree ∆ of contamination. Under the incorrect model however, the
obtained estimates clearly depend on ∆, and for β5 and β6 , they differ up
192 13. Conditional Linear Mixed Models
FIGURE 13.1. Hearing Data. ML estimates for the average longitudinal effects
under correct (long dashes) and incorrect (solid) cross-sectional models, as well
as for the conditional linear mixed model (short dashes), for several degrees of
contamination (∆). The vertical lines represent one estimated standard deviation
under the incorrect model. The bold vertical line corresponds to the original data
(∆ = 0).
to one standard deviation from the estimates obtained under the correct
model. The parameter estimates under the correct and under the incorrect
model, for the case ∆ = −10, are also given in Table 13.1. Under the
correct model, we get exactly the same estimates as for the original data,
except for the cross-sectional quadratic age effect, for which the difference
is exactly ∆ = −10. As noticed earlier, the omission of a cross-sectional
covariate inflates the random-intercepts variability, but this is now much
more pronounced than earlier. Note also that deleting the quadratic age
effect for the contaminated data changes the estimated correlation between
the random intercepts and random slopes from 0.2013 (p = 0.0625, LR test)
to −0.2490 (p = 0.0206, LR test).
For ∆ = −10, we also calculated the empirical Bayes estimates (EB) (see
Section 7.2) for the random slopes b2i under the correct and under the in-
13.2 A Linear Mixed Model for the Hearing Data 193
FIGURE 13.2. Hearing Data. Pairwise scatter plots of the empirical Bayes esti-
mates for the random slopes b2i obtained under the correct linear mixed model, the
incorrect linear mixed model, and the conditional linear mixed model. All plots are
based on contaminated data (∆ = −10), and the Pearson correlation coefficient
is denoted by r.
The results of Section 13.2 illustrate the need for statistical methodology
which allows for the study of longitudinal trends in observational data,
without having to specify any cross-sectional effects. Verbeke et al . (1999)
and Verbeke, Spiessens and Lesaffre (2000) propose the use of so-called
conditional linear mixed models. In order to simplify notations, we will
restrict to discussing this approach in the context of model (13.1) for the
hearing data, rather than in full generality, and we refer to the above-
mentioned papers for more details.
Conditional linear mixed models now proceed in two steps. In a first step,
we condition on sufficient statistics for the nuisance parameters b∗i . In a
second step, maximum likelihood or restricted maximum likelihood esti-
mation is used to estimate the remaining parameters in the conditional
distribution of the Yi given these sufficient statistics.
fi (yi |b∗i , bi )
=
fi (y i |b∗i , bi )
13.3 Conditional Linear Mixed Models 195
−(ni −1)/2 √
= 2πσ 2 ni
1
× exp − (yi − Xi β − Zi bi )
2σ 2
−1
× I ni − 1ni 1ni 1ni 1ni (yi − Xi β − Zi bi ) . (13.4)
It now follows directly from some matrix algebra (Seber 1984, property
B3.5, p. 536) that (13.4) is proportional to
−(ni −1)/2 1 −1
2πσ 2 exp − 2 (Ai yi − Ai Xi β − Ai Zi bi ) (Ai Ai )
2σ
× (Ai yi − Ai Xi β − Ai Zi bi ) (13.5)
for any set of ni ×(ni −1) matrices Ai of rank ni −1 which satisfy Ai 1ni = 0.
This shows that the conditional approach is equivalent to transforming
each vector Yi orthogonal to 1ni . If we now also require the Ai to satisfy
Ai Ai = I(ni −1) , we have that the transformed vectors Ai Yi satisfy
where Xi∗ = Ai Xi and Zi∗ = Ai Zi and where the ε∗(1)i = Ai ε(1)i are nor-
mally distributed with mean 0 and covariance matrix σ 2 Ini −1 .
Model (13.6) is now again a linear mixed model, but with transformed
data and covariates, and such that the only parameters still in the model
are the longitudinal effects and the residual variance. Hence, the second
step in fitting conditional linear mixed models is to fit model (13.6) using
maximum likelihood or restricted maximum likelihood methods. As earlier,
the subject-specific slopes are estimated using empirical Bayes methods (see
Section 7.2). Note that once the transformed responses and covariates have
been calculated, standard software for fitting linear mixed models (e.g.,
SAS procedure MIXED) can be used for the estimation of all parameters
in model (13.6). A SAS macro for performing the transformation has been
provided by Verbeke et al . (1999) and is available from the website.
Conditional linear mixed models are based on transformed data, where the
transformation is chosen such that a specific set of “nuisance” parameters
vanishes from the likelihood. From this respect, the proposed method is
very similar to REML estimation in the linear mixed model, where the vari-
ance components are estimated after transforming the data such that the
fixed effects vanish from the model (see Section 5.3). As shown by Harville
(1974, 1977) and by Patterson and Thompson (1971), and as discussed in
196 13. Conditional Linear Mixed Models
Section 5.3.4, the REML estimates for the variance components do not de-
pend on the selected transformation, and no information on the variance
components is lost in the absence of information on the fixed effects. It has
been shown by Verbeke, Spiessens and Lesaffre (2000) that similar prop-
erties hold for inferences obtained from conditional linear mixed models;
that is, it was shown that results do not depend on the selected transfor-
mation Yi → Ai Yi and that no information is lost on the average, nor
on the subject-specific longitudinal effects, from conditioning on sufficient
statistics for the cross-sectional components b∗i in the original model.
Panel (b) of Figure 13.2 shows a scatter plot of the EB estimates for the
subject-specific slopes b2i in model (13.1), obtained by fitting the associated
(correct) linear mixed model versus those obtained from the conditional
linear mixed model. Note that the same plot would be obtained for all
contaminated data sets. For the contaminated data with ∆ = −10, a similar
scatter plot has been included for the EB estimates under the incorrect
linear mixed model [panel (c) of Figure 13.2]. Surprisingly, the estimates
under the conditional model correlate better with the estimates under the
incorrect model (r = 0.93) than with those obtained from the correct linear
mixed model (r = 0.86). However, panel (c) in Figure 13.2 reveals the
presence of outliers which may inflate the correlation and also suggests that
the incorrect model tends to systematically underestimate small (negative)
and large (positive) slopes, when compared to the conditional linear mixed
model, whereas the opposite is true for the slopes closer to zero.
We also calculated, for all subjects, the difference between their EB esti-
mate obtained under the correct or incorrect model and their EB estimate
obtained under the conditional linear mixed model. A plot of these differ-
ences versus the subject’s age at entry in the study is shown in Figure 13.3.
It clearly shows that the omission of the cross-sectional quadratic age-
effect results in a systematic bias of the EB estimates, when compared to
the estimates from the conditional model: The incorrect model tends to
overestimate the subject-specific slope for subjects of low or high age at
198 13. Conditional Linear Mixed Models
FIGURE 13.3. Hearing Data. Scatter plots of the differences in empirical Bayes
estimates for the random slopes b2i obtained under the correct linear mixed model
and the conditional linear mixed model (left panel) as well as under the incorrect
model and the conditional linear mixed model (right panel). Both plots are based
on contaminated data (∆ = −10).
entry in the study, whereas the opposite is true for middle-aged subjects.
This bias is not present for the EB estimates obtained from the correct
linear mixed model. These findings suggest that one way of checking the
appropriateness of the cross-sectional component of a classical linear mixed
model could be to calculate the difference between the resulting EB esti-
mates for the subject-specific slopes and their EB estimates obtained from
the conditional approach, and to plot these differences versus relevant co-
variates. However, more research is needed in order to fully explore the
potentials of such procedures.
For example, we tried this fixed-effects approach to obtain the results previ-
ously derived for the hearing data under a conditional linear mixed model
and already reported in Table 13.1. In SAS PROC MIXED, this can be
done as follows:
The variable time contains the time point (in decades from entry in the
study) at which the repeated responses were taken, whereas f time repre-
sents the interaction term between time and the age (in decades) of the
subjects at their entry in the study. Further, id contains the identification
number of each patient, and the variable visit1 is one at the first visit,
and zero for all subsequent visits. Finally, the response variable L500 con-
tains the hearing thresholds for 500 Hz, taken on the left ear of the study
participants. Running the above program requires fitting a linear mixed
model with 684 fixed effects and 2 variance components. Unfortunately,
this turned out not to be feasible because the 1.12 Gb of free disk space
was insufficient for SAS to fit the model.
TABLE 13.2. Hearing Data (randomly selected subset). ML and REML estimates
(standard errors) for the longitudinal effects in (13.1), from a conditional linear
mixed model as well as from a fixed-effects approach.
obtained from the conditional linear mixed model (i.e., d22 = 11.466 and
σ 2 = 20.148). This is done in PROC MIXED using the following program
with a PARMS statement:
The so-obtained results are now exactly the same as those from the condi-
tional linear mixed model, reported in the first column of Table 13.2.
Finally, since REML estimation does account for the loss in degrees of free-
dom due to the estimation of the fixed effects, we get the same results from
the conditional linear mixed model as from the fixed-effects approach, pro-
vided both apply the REML estimation. For our example, this is illustrated
in the fact that the second and the last columns in Table 13.2 are identical.
14
Exploring Incomplete Data
Several issues arise when data are incomplete. In the remainder of this
chapter, we will illustrate some of these issues based on the Vorozole study.
To simplify matters, we will focus on dropout. In the next chapter, a formal
treatment of missingness, based on the pivotal work of Rubin (1976) and
Little and Rubin (1987) will be given. Subsequent chapters present various
modeling strategies for incomplete longitudinal data, as well as tools for
sensitivity analysis, an area in which interest is strongly rising.
The first issue, resulting from dropout, is evidently a depletion of the study
subjects. Of course, a decreasing sample size increases variability which, in
turn, decreases precision. In this respect, the Vorozole study is a dramatic
example, as can be seen from Figure 14.1 and Table 14.1, which graphi-
202 14. Exploring Incomplete Data
cally and numerically present dropout in both treatment arms. Clearly, the
dropout rate is high and there is a hint of a differential rate between the
two arms. This means we have identified one potential factor that could
14. Exploring Incomplete Data 203
Standard Vorozole
Week # (%) # (%)
0 226 (100) 220 (100)
1 221 (98) 216 (98)
2 203 (90) 198 (90)
4 161 (71) 146 (66)
6 123 (54) 106 (48)
8 90 (40) 90 (41)
10 73 (32) 77 (35)
12 51 (23) 64 (29)
14 39 (17) 51 (23)
16 27 (12) 44 (20)
18 19 (8) 33 (15)
20 14 (6) 27 (12)
22 6 (3) 22 (10)
24 5 (2) 17 (8)
26 4 (2) 9 (4)
28 3 (1) 7 (3)
30 3 (1) 3 (1)
32 2 (1) 1 (0)
34 2 (1) 1 (0)
36 1 (0) 1 (0)
38 1 (0) 0 (0)
40 1 (0) 0 (0)
42 1 (0) 0 (0)
44 1 (0) 0 (0)
FIGURE 14.3. Vorozole Study. Scatter plot matrix for selected time points.
FIGURE 14.4. Vorozole Study. Mean profiles, with 95% confidence intervals
added.
The individual profiles plot, by definition displaying all available data, has
some intrinsic limitations. As is the case with any individual data plot,
it tends to be fairly busy. Since there is a lot of early dropout, there are
many short sequences and since we decided to use the same time axis for
all profiles, also for those that drop out early, very little information can
be extracted. Indeed, the evolution over the first few sequences is not clear
at all. In addition, the eye assigns more weight to the longer profiles, even
though they are considerably less frequent.
Some of these limitations are removed in Figure 14.6, where the pattern-
specific average profiles are displayed per treatment arm. Still, care has
to be taken for not overinterpreting the longer profiles and neglecting the
shorter profiles. Indeed, for this study the latter represent more subjects
than the longer profiles.
Another important observation is that those who drop out rather early
seem to decrease from the start, whereas those who remain relatively long
in the study exhibit, on average and in turn, a rise, a plateau, and then
206 14. Exploring Incomplete Data
FIGURE 14.6. Vorozole Study. Mean profiles, per dropout pattern, grouped per
treatment arm.
15.1 Introduction
Patients who drop out of a clinical trial are usually listed on a separate with-
drawal sheet of the case record form with the reasons for withdrawal, en-
tered by the investigator. Reasons typically encountered are adverse events,
illness not related to study medication, uncooperative patient, protocol vi-
olation, ineffective study medication, and other reasons (with further spec-
ification, e.g., lost to follow-up). Based on such a medical typology, Gould
(1980) proposed specific methods to handle this type of incompleteness.
Early work on missing values was largely concerned with algorithmic and
computational solutions to the induced lack of balance or deviations from
the intended study design. See, for example, the reviews by Afifi and
Elashoff (1966) and Hartley and Hocking (1971). More recently, general
210 15. Joint Modeling of Measurements and Missingness
Much of the treatment in this work will be restricted to dropout (or at-
trition); that is, to patterns in which missing values are only followed by
missing values. There are four related reasons for this restriction. First, the
classification of missing value processes has a simpler interpretation with
dropout than for patterns with intermediate missing values. Second, it is
easier to formulate models for dropout and, third, much of the missing
value literature on longitudinal data is restricted to this setting. Finally,
dropouts are by far the most common appearance of missingness in longi-
tudinal studies.
In a strict sense, the conventional justification for the analysis of data from
a randomized trial is removed when data are missing for reasons outside
the control of the investigator. Before one can address this problem how-
ever, it is necessary to clearly establish the purpose of the study (Heyting,
Tolboom, and Essers 1992). If one is working within a pragmatic setting,
the event of dropout, for example, may well be a legitimate component of
the response. It may make no sense to ask what response the subject would
have shown had they remained in the trial, and the investigator may then
require a description of the response conditional on a subject remaining in
the trial. This, together with the pattern of observed missingness, may then
be the appropriate and valid summary of the outcome. We might call this a
conditional description. Shih and Quan (1997) argue that such a description
will be of more relevance in many clinical trials. On the other hand, from
a more explanatory perspective, we might be interested in the behavior of
15.3 Simple ad hoc Methods 211
Although such dropout may in any particular setting imply that a marginal
model is not helpful, it does not imply that it necessarily has no meaning.
Provided that the underlying model does not attach a probability of 1 to
dropout for a particular patient, then non-dropout and subsequent obser-
vation is an outcome that is consistent with the model and logically not
different from any other event in a probability model. Such distinctions,
particularly with respect to the conditional analysis, are complicated by
the inevitable mixture of causes behind missing values. The conditional
description is a mirror of what has been observed, and so its validity is
less of an issue than its interpretation. In contrast, other methods of han-
dling incompleteness make some correction or adjustment to what has been
directly observed, and therefore address questions other than those corre-
sponding to the conditional setting. In seeking to understand the validity
of these analyses, we need to compare their consequences with their aims.
In conclusion, we see that when there are missing values, simple methods of
analysis do not necessarily imply simple, or even accessible, assumptions,
and without understanding properly the assumptions being made in an
analysis, we are not in a position to judge its validity or value. It has
been argued that although any particular such ad hoc analysis may not
represent the true picture behind the data, a collection of such analyses
should provide a reasonable envelope within which the truth might lie.
This does point to the desirability of a sensitivity analysis, but the main
conclusion does not follow. Counterexamples to this can be constructed
and, again, without a clear formulation of the assumptions being made, we
are not in a position to interpret such an envelope, and we are certainly not
justified in assuming that its coverage is, in some practical sense, inclusive.
One way to proceed is to consider a formal framework for the missing value
problem, and this leads us to Rubin’s classification.
In many examples, however, the reasons for dropout are many and varied
and it is therefore difficult to justify on a priori grounds the assumption
of random dropout. Arguably, in the presence of nonrandom dropout, a
wholly satisfactory analysis of the data is not feasible.
Indeed, one feature in common to all of the more complex approaches is that
they rely on untestable assumptions about the relation between the mea-
surement process (often of primary interest) and the dropout process. One
should therefore avoid missing data as much as possible, and if dropout oc-
curs, information should be collected on the reasons for this. As an example,
consider a clinical trial where outcome and dropout are both strongly re-
lated to a specific covariate X and where, conditionally on X, the response
Y and the missing data process R are independent. In the selection frame-
work, we then have that f (Y, R|X) = f (Y |X)f (R|X), implying MCAR,
whereas omission of X from the model may imply MAR or even MNAR,
which has important consequences for selecting valid statistical methods.
214 15. Joint Modeling of Measurements and Missingness
Section 15.5 develops the necessary terminology and notation and Sec-
tion 15.6 describes various missing data patterns. The missing data mecha-
nisms, informally introduced in this section, are formalized in Section 15.7.
The important case where the missing data mechanism can be excluded
from statistical analysis is introduced in Section 15.8. Since much of the
subsequent treatment will be confined to dropout, this situation is reviewed
in Section 15.9.
15.5 Terminology
The missing data indicators Rij are grouped into a vector Ri which is, of
course, of the same length as Y i .
Full data (Y i , Ri ): the complete data, together with the missing data
indicators. Note that unless all components of Ri equal 1, the full
data components are never jointly observed.
Observed data Y oi .
Missing data Y m
i .
Some confusion might arise between the terms complete data introduced
here and complete case analysis of Sections 15.3 and 16.2. Although the
former refers to the (hypothetical) data set that would arise if there were
no missing data, “complete case analysis” refers to deletion of all subjects
for which at least one component is missing.
Note that one observes the measurements Y oi together with the missingness
indicators Ri .
Further, selection models and pattern-mixture models are not the only
possible ways of factorizing the joint distribution of the outcome and miss-
ingness processes. Section 17.1 places these models in a broader context.
15.8 Ignorability
Let us decide to use likelihood based estimation. The full data likelihood
contribution for subject i assumes the form
Since inference has to be based on what is observed, the full data likelihood
L∗ has to be replaced by the observed data likelihood L:
with
f (y oi , r i |θ, ψ) = f (y i , r i |Xi , Zi , θ, ψ)dy m
i
= f (y oi , y m
i |Xi , Zi , θ)f (r i |y i , y i , Xi , ψ)dy i . (15.5)
o m m
that is, the likelihood factorizes into two components of the same functional
form as the general factorization (15.1) of the complete data. If, further,
218 15. Joint Modeling of Measurements and Missingness
θ and ψ are disjoint in the sense that the parameter space of the full
vector (θ , ψ ) is the product of the parameter spaces of θ and ψ, then
inference can be based on the marginal observed data density only. This
technical requirement is referred to as the separability condition. However,
still some caution should be used when constructing precision estimators
(see Chapter 21).
ni
Di = 1+ Rij . (15.7)
j=1
ni
Ti = Rij = Di − 1. (15.8)
j=1
Selection modeling is now obtained from factorizing the density of the full
data (y i , di ), i = 1, . . . N , as (suppressing covariate dependence)
where the first factor is the marginal density of the measurement process
and the second one is the density of the missingness process, conditional
on the outcomes.
If f (di |y oi , y m
i , ψ) is independent of the measurements [i.e., when it assumes
the form f (di |ψ)], then the process is termed missing completely at random
(MCAR). If f (di |y oi , y m i , ψ) is independent of the unobserved (missing)
measurements y m i , but depends on the observed measurements y oi , thereby
assuming the form f (di |y oi , ψ), then the process is referred to as missing
at random (MAR). Finally, when f (di |y oi , y m i , ψ) depends on the missing
values y m i , the process is referred to as nonrandom missingness (MNAR). Of
course, ignorability is defined in analogy with its definition in Section 15.8.
16
Simple Missing Data Methods
16.1 Introduction
As suggested in Section 15.2, missing data nearly always entail problems for
the practicing statistician. First, inference will often be invalidated when
the observed measurements do not constitute a simple random subset of the
complete set of measurements. Second, even when correct inference follows,
it is not always an easy task to trick standard software into operation on a
ragged data structure.
We will briefly review a number of techniques that are valid when the mea-
surement and missing data processes are independent and their parameters
are separated (MCAR). It is important to realize that many of these meth-
ods are used also in situations where the MCAR assumption is not tenable.
This should be seen as bad practice since it will often lead to biased esti-
mates and invalid tests and hence to erroneous conclusions. Ample detail
and illustrations of several problems are provided in Verbeke and Molen-
berghs (1997, Chapter 5). Section 16.2 discusses the computationally sim-
plest technique, a complete case analysis, in which the analysis is restricted
to the subjects for whom all intended measurements have been observed.
A complete case analysis is popular because it maps a ragged data matrix
into a rectangular one, by deleting incomplete cases. A second family of
approaches, with a similar effect on the applicability of complete data soft-
ware, is based on imputing missing values. One distinguishes between single
imputation (Section 16.3) and multiple imputation (Section 20.3). In the
first case, a single value is substituted for every “hole” in the data set and
the resulting data set is analyzed as if it represented the true complete data.
Multiple imputation properly acknowledges the uncertainty stemming from
filling in missing values rather than observing them.
A complete case analysis includes only those cases into the analysis, for
which all ni measurements were recorded. This method has obvious advan-
tages. It is very simple to describe and since the data structure is as would
have resulted from a complete experiment, standard statistical software can
be used. Further, since the complete estimation is done on the same subset
of completers, there is a common basis for inference, unlike for the available
case methods (see Section 16.4).
tation). Finally, both within and between subject information can be used
(e.g., conditional mean imputation, hot deck imputation). Standard refer-
ences are Little and Rubin (1987) and Rubin (1987). Imputation strategies
have been very popular in sample survey methods.
For example, Little and Rubin (1987) show that the method could work
for a linear model with one fixed effect and one error term, but that it
generally does not for hierarchical models, split-plot designs, and repeated
measures (with a complicated error structure), random-effects, and mixed-
effects models. At the very least, different imputations for different effects
would be necessary.
The user of imputation strategies faces several dangers. First, the imputa-
tion model could be wrong and, hence, the point estimates would be biased.
Second, even for a correct imputation model, the uncertainty resulting from
incompleteness is masked. Indeed, even when one is reasonably sure about
the mean value the unknown observation would have, the actual stochastic
realization, depending on both the mean structure as well as on the error
distribution, is still unknown.
The idea behind unconditional mean imputation (Little and Rubin 1987)
is to replace a missing value with the average of the observed values on the
same variable over the other subjects. Thus, the term unconditional refers
to the fact that one does not use (i.e., condition on) information on the
subject for which an imputation is generated.
This approach was suggested by Buck (1960) and reviewed by Little and
Rubin (1987). The method is technically hardly more complex than mean
imputation. Let us describe it first for a single multivariate normal sample.
The first step is to estimate the mean vector µ and the covariance matrix
Σ from the complete cases. This step builds on the assumption that Y ∼
N (µ, Σ). For a subject with missing components, the regression of the
missing components (Y m o
i ) on the observed ones (y i ) is
Ym
i |y i
o
∼ N (µm + Σmo (Σoo )−1 (y oi − µoi ), Σmm
−Σmo (Σoo )−1 Σom ).
Buck (1960) showed that under mild regularity conditions, the method
is valid for MCAR mechanisms. Little and Rubin (1987) added that the
method is valid under certain types of MAR mechanism. Even though the
distribution of the observed components is allowed to differ between com-
plete and incomplete observations, it is very important that the regression
of the missing components on the observed ones is constant across miss-
ingness patterns.
226 16. Simple Missing Data Methods
Again, this method shares with other single imputation strategies that,
although point estimation may be consistent, the precision will be under-
estimated. Little and Rubin (1987, p. 46) indicated ways to correct the
precision estimation for unconditional mean imputation.
The imputation methods reviewed here are clearly not the only ones. Little
and Rubin (1987) and Rubin (1987) mention several others. Several meth-
ods, such as hot deck imputation, are based on filling in missing values from
“matching” subjects, where an appropriate matching criterion is used.
The main advantage, shared with complete case analysis, is that complete
data software can be used. Although a complete case analysis is even sim-
pler since one does not need to address the imputation task, the imputation
family uses all (and, in fact, too much) of the available information. With
the availability of the SAS procedure MIXED, it is no longer necessary to
stick to complete data software, since this procedure allows for measure-
ment sequences of unequal length. A discussion of multiple imputation is
postponed until Section 20.3.
16.4 Available Case Methods 227
Available case methods (Little and Rubin 1987) use as much of the data
as possible. Let us restrict attention to the estimation of the mean vector
µ and the covariance matrix Σ. The jth component µj of the mean vector
and the jth diagonal variance element σjj are estimated using all cases
that are observed on the jth variable, disregarding their response status
at the other measurement occasions. The (j, k)th element (j = k) of the
covariance matrix is computed using all cases that are observed on both
the jth and the kth variable.
This method is more efficient than the complete case method, since more
information is used. The number of components of the outcome vector has
no direct effect on the sample available for a particular mean or covariance
component.
This method will be illustrated in Section 17.4.2, using the growth data
introduced in Section 2.6.
FIGURE 16.1. Toenail Data. Estimated mean profile under treatment A (solid
line) and treatment B (dashed line), obtained under different assumptions for
the measurement model and the dropout model. (a) Completely random dropout,
without parametric model for the average evolution in both groups. (b) Com-
pletely random dropout, assuming quadratic average evolution for both groups.
(c) Random dropout, assuming quadratic average evolution for both groups. (d)
Nonrandom dropout, assuming quadratic average evolution for both groups.
time, thereby including all patients still available at that occasion. For the
toenail example, this is shown in panel (a) of Figure 16.1. The graph sug-
gests that there is very little difference between both groups, with marginal
superiority of treatment A.
Note that the sample averages at a specific occasion are unbiased estimators
of the mean responses of those subjects still in the study at that occasion.
Hence, the average profiles in panel (a) of Figure 16.1 only reflect the
marginal average evolutions if, at each occasion, the mean response of those
still in the study equals the mean response of those who already dropped
out. Thus, we have to assume that the mean of the response, conditional
on dropout status, is independent of the dropout status.
For our toenail example, we used the sample averages displayed in Fig-
ure 16.1 [panel (a)] to test for any differences between both treatment
groups. The resulting Wald statistic equals χ2 = 4.704 on 7 degrees of
freedom, from which we conclude that there is no evidence in the data
for any difference between both groups (p = 0.696). Note that the above
methodology also applies if we assume the outcome to satisfy a general lin-
ear regression model, where the average evolution in both groups may be
assumed to be of a specific parametric form. We compared both treatments
assuming that the average evolution is quadratic over time, with regression
parameters possibly depending on treatment. The resulting OLS profiles
are shown in panel (b) of Figure 16.1. The main difference with the profiles
obtained from a model with unstructured mean evolutions [panel (a)] is
seen during the treatment period (first 3 months). The Wald test statistic,
employed for testing treatment differences, now equals χ2 = 2.982, on 3
degrees of freedom (p = 0.395) yielding the same conclusion as before.
In practice, patients often leave the study prematurely due to reasons re-
lated to the outcome of interest. The assumption of completely random
dropout is then no longer tenable, and statistical methods allowing for less
strict assumptions about the relation between the dropout process and the
measurement process (MAR or even MNAR) should be investigated. This
will be discussed in Sections 17.2 and 18.3.
17
Selection Models
17.1 Introduction
Much of the early development of, and debate about, selection models ap-
peared in the econometrics literature in which the Tobit model (Heckman
1976) played a central role. This combines a marginal Gaussian regression
model for the response, as might be used in the absence of missing data,
with a Gaussian-based threshold model for the probability of a value be-
ing missing. For simplicity, consider a single Gaussian-distributed response
variable Y ∼ N (µ, σ 2 ). The probability of Y being missing is assumed to
depend on a second Gaussian variable Ym ∼ N (µm , σm 2
), where
P (R = 0) = P (Ym < 0).
Dependence of missingness on the response Y is induced by introducing
a correlation between Y and Ym . To avoid some of the complications of
direct likelihood maximization, a two-stage estimation procedure was pro-
posed by Heckman (1976) for this type of model. The use of the Tobit
model and associated two-stage procedure was the subject of considerable
debate in the econometrics literature, much of it focusing on the issues of
identifiability and sensitivity (Amemiya 1984, Little 1986).
At first sight, the Tobit model does not appear to have the selection model
structure specified in (15.1) in that there is no conditional partition of
f (y, r). However, it is simple to show from the joint Gaussian distribution
232 17. Selection Models
P (R = 0 | Y = y) = Φ(β0 + β1 y)
for suitably chosen parameters β0 and β1 and Φ(·) the Gaussian cumulative
distribution function. This can be seen as a probit regression model for the
(binary) missing value process. This basic structure underlies the simplest
form of selection model that has been proposed for longitudinal data in
the biometric setting. A suitable response model, such as the multivariate
Gaussian, is combined with a binary regression model for dropout. At each
time point, the occurrence of dropout can be regressed on previous and
current values of the response as well as covariates. In this chapter, we
explore such models in more detail.
As formally introduced in Chapter 15, under the selection model (15.1), one
uses the functional form of f (di |yi ) to discriminate between different types
of dropout processes. Indeed, recall from Section 15.8 that under MCAR or
MAR, the joint density of observed measurements and dropout indicator
factors as
o
f (yio )f (di ) under MCAR
f (yi , di ) =
f (yi )f (di |yi ) under MAR,
o o
from which it follows that a marginal model for the observed data yio only is
required. Moreover, the measurement model f (yio ) and the dropout model
f (di ) or f (di |yio ) can be fitted separately, provided that the parameters
in both models are functionally independent of each other (separability).
If interest is in the measurement model only, the dropout model can be
completely ignored (Section 15.8). This implies that, under ignorability,
MCAR and MAR provide the same fitted measurement model. However,
as discussed by Kenward and Molenberghs (1998) and by Verbeke and
Molenberghs (1997, Section 5.8), this does not imply that inferences under
MCAR and MAR are equivalent.
This model is similar to (3.11). All random components have zero mean.
The random intercept variance is d11 . The variance of the measurement
234 17. Selection Models
error ε(1)i is σ 2 Ini , whereas the variance of the serial process is ε(2)i is
τ 2 Hi , where Hi follows from the particular serial process considered. The
unknown parameters βA0 , βA1 , βA2 , βB0 , βB1 , and βB2 describe the average
quadratic evolution of Yio over time.
Let us first assume that ε(1)ij is absent. The estimated average profiles
obtained from fitting model (17.1) to our TDO data are shown in panel
(c) of Figure 16.1. Note that there is very little difference from the OLS
average profiles shown in panel (b) of the same figure and obtained under
the MCAR assumption. The observed likelihood ratio statistic for testing
for treatment differences equals 2 ln λ = 4.626 on 3 degrees of freedom.
Hence, under model (17.1) and under the assumption of random dropout,
there is little evidence for any average difference between the treatments A
and B (p = 0.201).
For the TDO data, we assume that all outcomes Yij satisfy
(βA0 + bi ) + βA1 tij + βA2 t2ij + ε(2)ij group A
Yij = (17.2)
(βB0 + bi ) + βB1 tij + βB2 t2ij + ε(2)ij group B.
Note that, for reasons that will become clear later, the measurement error
component ε(1)ij has been removed from the model. Under MAR, model
(17.2) reduces to (17.1) with the measurement error removed, but when
MAR does not hold, we no longer have that Yio satisfies model (17.1).
Further, we assume that the probability for dropout at occasion j (j =
2, . . . , ni ), given the subject was still in the study at the previous occasion,
follows a logistic regression model, in line with Diggle and Kenward (1994),
Model (17.4) implies that the dropout contribution f (di |yi ) to the log-
likelihood can be written as a product of independent contributions of
the form P (Di = di |Di ≥ di , yi ). They describe a binary outcome, condi-
tional on covariates. For such data, logistic regression methods are a natural
choice, as mentioned earlier. Implementations of this will be discussed at a
later stage.
At this point, we will relate dropout to the current and previous observa-
tion only. No covariates are included, and we do not explicitly take into
account the fact that the time points at which measurements have been
taken are not evenly spaced. Diggle and Kenward (1994) consider a more
general model where dropout at occasion j can depend on the complete
history {yi1 , . . . , yi,j−1 }, as well as on external covariates. In addition, one
could argue that the dependence of dropout on the measurement history is
time dependent. This would imply that the parameters ψ = (ψ0 , ψ1 , ψ2 )
in (17.3) depend on j. However, this level of generality will not be con-
sidered here. Note also that, strictly speaking, (15.1) allows dropout at
a specific occasion to be related to all future responses. However, this is
rather counterintuitive in many cases, especially when it is difficult for the
study participants to make projections about the future responses. More-
over, including future outcomes seriously complicates the calculations since
computation of the likelihood (15.4) then requires evaluation of a possibly
high-dimensional integral. Diggle and Kenward (1994) and Molenberghs,
Kenward, and Lesaffre (1997) considered nonrandom versions of this model
by including the current, possible unobserved measurement. This requires
more elaborate fitting algorithms, given the high-dimensional mentioned
earlier. Diggle and Kenward (1994) used the simplex algorithm (Nelder and
Mead, 1965), and Molenberghs, Kenward, and Lesaffre fitted their models
with the EM algorithm (Dempster, Laird, and Rubin 1977). The algorithm
of Diggle and Kenward (1994) is implemented in OSWALD (Smith, Robert-
son, and Diggle 1996). For further information on OSWALD, please consult
Section A.3.2.
We fitted the above model to our TDO data using the PCMID function in
the Splus suite of functions called OSWALD (Smith, Robertson and Diggle,
236 17. Selection Models
1996). The fitted average profiles are shown in panel (d) of Figure 16.1. Note
that we again find very little difference with the estimated average profiles
obtained from previous analyses. The observed likelihood ratio statistic
for testing for treatment differences equals 2 ln λ = 4.238 on 3 degrees of
freedom. Hence, there is again very little evidence for the presence of any
average treatment difference (p = 0.237).
of the conclusions to the stated complete data model. Several authors have
pointed to this sensitivity, such as Rubin (1994), Laird (1994), Little (1995),
Hogan and Laird (1997), and Molenberghs, Kenward and Lesaffre (1997). A
good example is given by Kenward (1998), who reanalyzed data on mastitis
in dairy cows, previously analyzed by Diggle and Kenward (1994). He found
that all evidence for nonrandom dropout vanishes when the normality as-
sumption for the second, possibly unobserved, outcome conditional on the
first, always observed, outcome is replaced by a heavy-tailed t-distribution.
See also Section 19.5.1. Further illustrations are Molenberghs, Verbeke, et
al. (1999) and Kenward and Molenberghs (1999). Thus, clearly, caution is
required and, preferably, a sensitivity analysis should be conducted. Formal
tools to carry out such an analysis are provided in Chapter 19.
Another reason to formulate a model not only for the complete but also for
the observed data is that these components need to be integrated out from
the likelihood. Technically, this implies that, at present, specialized soft-
ware, based on computationally intensive and highly unstable algorithms,
is required for fitting nonrandom dropout models.
TABLE 17.1. Toenail Data. Summary of results obtained from fitting model (17.2)
and model (17.5), in combination with dropout model (17.3), to the TDO data.
Model (17.2)
Variability explained
by measurement error: 0%
LR test for 2 ln λ = 4.238
treatment differences: p = 0.237
LR test for MAR: 2 ln λ = 25.386
p < 0.0001
⎛ ⎞
1 .87 .77 .70 .58 .53 .52
⎜ 1 .87 .77 .61 .54 .52 ⎟
⎜ ⎟
⎜ 1 .87 .65 .56 .53 ⎟
⎜ ⎟
Fitted correlations: ⎜ 1 .70 .58 .53 ⎟
⎜ ⎟
⎜ 1 .70 .58 ⎟
⎜ ⎟
⎝ 1 .70 ⎠
1
Model (17.5)
Variability explained
by measurement error: 6%
LR test for 2 ln λ = 3.024
treatment differences: p = 0.388
LR test for MAR: 2 ln λ = 4.432
p = 0.035
⎛ ⎞
1 .91 .89 .86 .79 .73 .70
⎜ 1 .91 .89 .81 .75 .70 ⎟
⎜ ⎟
⎜ 1 .91 .83 .77 .71 ⎟
⎜ ⎟
Fitted correlations: ⎜ 1 .86 .79 .73 ⎟
⎜ ⎟
⎜ 1 .86 .79 ⎟
⎜ ⎟
⎝ 1 .86 ⎠
1
Let us, as in the toenail data example in Section 17.2.1, assume that MAR
holds. Assume, in addition, that the separability condition is satisfied (Sec-
tion 15.8). In Section 15.8, it was argued, and in Section 17.2.1 it was re-
iterated, that likelihood based inference is valid, whenever the mechanism
is MAR and provided the technical condition holds that the parameters
describing the nonresponse mechanism are distinct from the measurement
model parameters. In other words, the missing data process should be ig-
norable in the likelihood inference sense, since, then, factorization (15.6)
applies and the log-likelihood partitions into two functionally independent
components.
This implies that a module with likelihood estimation facilities which can
handle incompletely observed subjects since its units are measurements
rather than subjects, manipulates the correct likelihood and leads to valid
likelihood ratios. We will qualify this statement in more detail since, al-
though this is an extremely important feature of PROC MIXED and in
fact of any flexibly written linear mixed model likelihood optimization rou-
tine, a few cautionary remarks still apply.
The growth data have been introduced in Section 2.6. Section 17.4.1 an-
alyzes the original set of data (i.e., without artificially removed subjects).
The incomplete version, generated by Little and Rubin (1987), is studied
in Section 17.4.3, and the missingness process is studied in Section 17.4.4.
Yi = Xi β + Zi bi + εi , (17.6)
where
bi ∼ N (0, D),
εi ∼ N (0, Σ),
Model 1
The first model we will consider assumes a separate mean for each of
the eight age×sex combinations, together with an unstructured covariance.
This is done by assuming that the covariance matrix Σ of the error vector
242 17. Selection Models
Yi = Xi β + εi ,
with
⎛ ⎞
1 xi 1 − xi 0 0 xi 0 0
⎜ 1 xi 0 1 − xi 0 0 xi 0 ⎟
Xi = ⎜
⎝ 1 xi
⎟
0 0 1 − xi 0 0 xi ⎠
1 xi 0 0 0 0 0 0
and β = (β0 , β1 , β0,8 , β0,10 , β0,12 , β1,8 , β1,10 , β1,12 ) . With this parameteri-
zation, the means for girls are β0 + β1 + β1,8 ; β0 + β1 + β1,10 ; β0 + β1 + β1,12 ;
and β0 +β1 at ages 8, 10, 12, and 14, respectively. The corresponding means
for boys are β0 + β0,8 ; β0 + β0,10 ; β0 + β0,12 ; and β0 , respectively. Of course,
there are many equivalent ways to express the set of eight means in terms
of eight linearly independent parameters.
This model can, for example, be fitted with the following SAS code:
Let us discuss the fit of the model. The deviance (minus twice the log-
likelihood at maximum) equals 416.5093, and there are 18 model parame-
ters (8 mean, 4 variance, and 6 covariance parameters). This deviance will
serve as a reference to assess the goodness-of-fit of simpler models. Pa-
rameter estimates and standard errors are reproduced in Table 17.2. The
deviances are listed in Table 17.4.
17.4 Growth Data 243
TABLE 17.2. Growth Data. Maximum likelihood estimates and standard errors
(model based and empirically corrected) for the fixed effects in Model 1 (complete
data set).
These quantities are easily obtained in PROC MIXED by using the options
‘r’ and ‘rcorr’ in the REPEATED statement (see Section 8.2.6). Apparently,
the variances are close to each other, and so are the correlations.
Boys Girls
and ⎛ ⎞
1.0000 0.4374 0.5579 0.3152
⎜ ⎟
⎜ 0.4374 1.0000 0.3873 0.6309 ⎟
⎜ ⎟.
⎝ 0.5579 0.3873 1.0000 0.5860 ⎠
0.3152 0.6309 0.5860 1.0000
From these, we suspect that there is a non-negligible difference between
the covariance structures for boys and girls, with, in particular, a weaker
correlation among the boys’ measurements. This is indeed supported by a
246 17. Selection Models
Here, the ‘empirical’ option is added to the PROC MIXED statement. This
method yields a consistent estimator of precision, even if the covariance
model is misspecified. In this particular case (a full factorial mean model),
both methods (Model 0 on the one hand and the empirically corrected
Model 1 on the other hand) lead to exactly the same standard errors. This
illustrates that the robust method can be advantageous if correct standard
errors are required, but finding an adequate covariance model is judged too
involved. The robust standard errors are presented in Table 17.2 as the
second entry in parentheses. It is seen that the naive standard errors are
somewhat smaller than their robust counterparts, except for the parameters
β1,8 , β1,10 , and β1,12 , where they are considerably larger. Even though the
relation between the standard errors of the “correct model” (here, Model
0) and the empirically corrected “working model” (here, Model 1) will
not always be a mathematical identity, the empirically corrected estimator
option is a useful tool to compensate for misspecification in the covariance
model.
FIGURE 17.1. Growth Data. Profiles for the complete data, from a selected set
of models.
Model 2
The first simplification occurs by assuming a linear trend within each sex
group. This implies that each profile can be described with two parameters
(intercept and slope), instead of with four unstructured means. The error
matrix Σ will be left unstructured. The model can be expressed as
Yij = β0 + β01 xi + β10 tj (1 − xi ) + β11 tj xi + εij (17.10)
or, in matrix notation,
Y i = Xi β + εi ,
where the design matrix changes to
⎛ ⎞
1 xi 8(1 − xi ) 8xi
⎜ 1 x 10(1 − xi ) 10xi ⎟
⎜ i ⎟
Xi = ⎜ ⎟
⎝ 1 xi 12(1 − xi ) 12xi ⎠
1 xi 14(1 − xi ) 14xi
248 17. Selection Models
and β = (β0 , β01 , β10 , β11 ) . Here, β0 is the intercept for boys and β0 + β01
is the intercept for girls. The slopes are β10 and β11 , respectively.
The SAS code for Model 1 can be adapted simply by deleting age from the
CLASS statement.
The likelihood ratio test comparing Model 2 to Model 1 does not reject
the null hypothesis of linearity. A summary of model fitting information
for this and subsequent models as well as for comparisons between models
is given in Table 17.4. The first column contains the model number; and a
short description of the model is given in the second and third columns, in
terms of the mean and covariance structures, respectively. The number of
parameters is given next, as well as the deviance (−2). The column labeled
“Ref” displays one (or more) numbers of models to which the current model
is compared. The G2 likelihood ratio statistic is the difference between
−2 of the current and the reference model. The final columns contain the
number of degrees of freedom, and the p-value corresponds to the likelihood
ratio test statistic. Model 2 predicts the following mean growth curves:
These profiles are visualized in the second panel of Figure 17.1. The ob-
served means are added to the graph. The mean model seems acceptable,
consistent with the likelihood ratio test. The estimated covariance and cor-
17.4 Growth Data 249
relation matrices of the measurements are similar to the ones found for
Model 1.
Model 3
The next step is to investigate whether the two profiles are parallel. Al-
though the plot for Model 2 suggests that the profiles are diverging, the
question remains whether this effect is statistically significant. The model
can be described as follows:
and β = (β0 , β01 , β1 ) . The two slopes in Model 2 have been replaced by
β1 , a slope common to boys and girls.
Table 17.4 reveals that the likelihood ratio test statistic (comparing Models
2 and 3) rejects the common slope hypothesis (p = 0.0098). This is consis-
tent with the systematic deviation between observed and expected means
in the third panel of Figure 17.1.
In line with the choice of Jennrich and Schluchter (1986), the mean struc-
ture of Model 2 will be kept. We will now turn our attention to simplifying
the covariance structure.
Graphical Exploration
Figure 17.2 presents the 27 individual profiles. The left-hand panel shows
the raw profiles, exhibiting the time trend found in the mean model. To
250 17. Selection Models
FIGURE 17.2. Growth Data. Raw and residual profiles for the complete data set.
(Girls are indicated with solid lines. Boys are indicated with dashed lines.)
Model 4
Comparing the likelihood of this model to the one of the reference Model
2 shows that Model 4 is consistent with the data (see Table 17.4). The
covariance matrix is
⎛ ⎞
4.9439 3.0507 3.4054 2.3421
⎜ 3.0507 4.9439 3.0507 3.4054 ⎟
⎜ ⎟
⎝ 3.4054 3.0507 4.9439 3.0507 ⎠
2.3421 3.4054 3.0507 4.9439
and the derived correlation matrix is
⎛ ⎞
1.0000 0.6171 0.6888 0.4737
⎜ 0.6171 1.0000 0.6171 0.6888 ⎟
⎜ ⎟.
⎝ 0.6888 0.6171 1.0000 0.6171 ⎠
0.4737 0.6888 0.6171 1.0000
The lag 2 correlation is slightly higher than the lag 1 correlation, while the
lag 3 correlation shows a drop. In light of the standard errors of the covari-
ance parameters (0.9791, 0.9812, and 1.0358, respectively), this observation
should not be seen as clear evidence for a particular trend.
Note that this structure constrains the variance to be constant across time.
Should this assumption be considered unrealistic, then heterogeneous ver-
sions can be fitted instead, combining the correlation matrix from the ho-
mogeneous version with variances that are allowed to change over time.
At this point, Model 4 can replace Model 2 as the most parsimonious model,
consistent with the data found so far. Whether or not further simplifications
are possible will be investigated next.
252 17. Selection Models
Model 5
σij = σ 2 ρ|i−j| .
In other words, the variance of the measurements equals σ 2 , and the co-
variance decreases with increasing time lag if ρ > 0. To fit this model
with PROC MIXED, the REPEATED statement should include the op-
tion ‘type=AR(1)’.
Table 17.4 reveals that there is an apparent lack of fit for this model, when
compared to Model 2. Jennrich and Schluchter (1986) compared Model 5 to
Model 2 as well. Alternatively, we might want to compare Model 5 to Model
4. This more parsimonious test (2 degrees of freedom) yields p = 0.0003,
strongly rejecting the AR(1) structure.
Model 6
1 14
= 4.5569 −0.1983
D . (17.12)
−0.1983 0.0238
Model 7
Vi = Zi dZi + σ 2 I4 = dJ4 + σ 2 I4 ,
and
These two equivalent views toward the same model have been discussed in
Section 3.3.2.
FIGURE 17.3. Growth Data. χ26 and simulated null distributions for comparing
Models 2 and 6.
They are shown in the fourth panel of Figure 17.1. Although not exactly
the same, they are extremely similar to the profiles of Model 2.
256 17. Selection Models
Model 8
Let us now turn toward the incomplete version of the data. In this section,
we will focus on a straightforward but restrictive available case analysis
from a frequentist perspective. The method is briefly introduced in Sec-
tion 16.4. Specifically, the parameters for the unstructured mean and co-
variance Model 1 will be estimated.
The mean vector for girls is based on a sample of size 11, except for the
second element, which is based on the 7 complete observations. The corre-
sponding sample sizes for boys are 16 and 11, respectively.
Looking at the available case procedure from the perspective of the indi-
vidual observation, one might say that each observation contributes to the
subvector of the parameter vector about which it contains information. For
example, a complete observation in the growth data set contributes to 4
(sex specific) mean components as well as to all 10 variance-covariance pa-
rameters. In an incomplete observation, there is information about three
mean components and six variance-covariance parameters (excluding those
with a subscript 2).
TABLE 17.5. Growth Data. MAR analysis (Little and Rubin). Model fit summary.
the second row and the second column of the covariance matrix. In most
software packages, such as the SAS procedure MIXED, this is performed
automatically, as will be discussed next.
Little and Rubin (1987) fitted the same eight models as Jennrich and
Schluchter (1986) to the incomplete growth data set. Whereas Little and
Rubin made use of the EM algorithm, we set out to perform our analysis
with direct maximization of the observed likelihood (with Fisher scoring or
Newton-Raphson) in PROC MIXED. The results ought to coincide. Table
17.5 reproduces the findings of Little and Rubin. We added p-values.
Applying the programs to the data yields some discrepancies, as seen from
the model fit Table 17.7.
Let us take a close look at these discrepancies. Although most of the tests
performed lead to the same conclusion, there is one fundamental difference.
In Table 17.5, the AR(1) model is rejected whereas it is not in Table 17.7.
A puzzling difference is that the maximized log-likelihoods are different
for Models 1–5, but not for Models 6–8. The same holds for the mean
and covariance parameter estimates. To get a hold on this problem, let us
consider the repeated statement (e.g., of Model 1):
1 1 8 2 21.0
2 1 10 2 20.0
3 1 12 2 21.5
4 1 14 2 23.0
5 2 8 2 21.0
6 2 10 2 21.5
7 2 12 2 24.0
8 2 14 2 25.5
9 3 8 2 20.5
10 3 12 2 24.5
11 3 14 2 26.0
...
97 27 8 1 22.0
98 27 12 1 23.5
99 27 14 1 25.0
This statement identifies the subject in terms of IDNR blocks but does
not specify the ordering of the observations within a subject. Thus, PROC
MIXED assumes the default ordering: 1, 2, 3, 4 for a complete subject and,
erroneously, 1, 2, 3 for an incomplete one, whereas the correct incomplete
ordering is 1, 2, 4. This means that, by default, dropout is assumed. Since
this assumption is inadequate for the growth data, Models 1–5 in Table
17.7 are incorrect. The random-effects Model 6, on the other hand, uses
the RANDOM statement
where the variable AGE conveys the information needed to correctly calcu-
late the random-effects parameters. Indeed, for an incomplete observation,
the correct design
⎛ ⎞
1 8
Zi = ⎝ 1 12 ⎠
1 14
TABLE 17.7. Growth Data. Inadequate MAR analysis (Little and Rubin). Model
fit summary.
There are two equivalent ways to overcome this problem. The first is to
adapt the data set slightly. An example is given in Table 17.8.
The effect of using this data set is, of course, that incomplete records
are deleted from the analysis, but that the relative positions are correctly
passed on to PROC MIXED. Running Models 1–8 on this data set yields
exactly the same results as in Table 17.5.
However, this program generates an error since the variables in the RE-
PEATED statement have to be categorical variables, termed CLASS vari-
ables in PROC MIXED. See Section 8.2.6. One of the tricks to overcome
this issue is by using the program
data help;
set growthav;
agec = age;
run;
Thus, there are two identical copies of the variable AGE, only one of which
is treated as a class variable.
TABLE 17.8. Growth Data. Extract of the incomplete data set. The missing ob-
servations are explicitly indicated.
1 1 8 2 21.0
2 1 10 2 20.0
3 1 12 2 21.5
4 1 14 2 23.0
5 2 8 2 21.0
6 2 10 2 21.5
7 2 12 2 24.0
8 2 14 2 25.5
9 3 8 2 20.5
10 3 10 2 .
11 3 12 2 24.5
12 3 14 2 26.0
...
105 27 8 1 22.0
106 27 10 1 .
107 27 12 1 23.5
108 27 14 1 25.0
yield the same conclusions. In both cases, linear profiles turn out to be
consistent with the data, but parallel profiles do not. A Toeplitz correlation
structure (Model 5) is acceptable, as well as a random intercepts and slopes
model (Model 6). These models can be simplified further to compound
symmetry (Model 7). The assumption of no correlation between repeated
measures (Model 8) is untenable. This means that Model 7 is again the most
parsimonious description of the data among the eight models considered.
It has to be noted that the rejection of Models 3 and 5 is less compelling
in the MAR analysis than it was in the complete data set. Of course, this
is to be expected due to the reduction in the sample size, or rather in the
number of available measurements. The likelihood ratio test statistic for a
direct comparison of Model 5 to Model 4 is 11.494 on 2 degrees of freedom
(p = 0.0032), which is, again, a clear indication of an unacceptable fit.
FIGURE 17.4. Growth Data. Profiles for a selected set of models. MAR analysis.
(The small dots are the observed group means for the complete data set. The large
dots are the corresponding quantities for the incomplete data.)
reflecting the means for the complete data set and for those observed at
age 10, respectively. Since the average of the observed measurements at age
10 is biased upward, the fitted profiles from the complete case analysis and
from unconditional mean imputation were too high. Clearly, the average
observed from the data is the same for the complete case analysis, the
unconditional mean imputation, the available case analysis, and the present
analysis. The most crucial difference is that the current Model 1, although
saturated in the sense that there are eight mean parameters (one for each
age by sex combination), does not let the (biased) observed and fitted
averages at age 10 coincide, in contrast to the means at ages 8, 12, and 14.
Indeed, if the model specification is correct, then an ignorable likelihood
analysis is consistent for the correct complete data mean, rather than for
the observed data mean. Of course, this effect might be blurred in relatively
small data sets due to small-sample variability.
This discussion touches upon the key distinction between the frequentist
available case analysis of Section 16.4, with example in Section 17.4.2, and
the present likelihood based available case analysis. The method of Sec-
tion 16.4 constructs an estimate for the age 10 parameters, irrespective of
the (extra) information available for the other parameters. The likelihood
264 17. Selection Models
Incomplete
Age Complete Obs. Pred.
Girls
8 21.18 21.18 21.18
10 22.23 22.79 21.58
12 23.09 23.09 23.09
14 24.09 24.09 24.09
Boys
8 22.88 22.88 22.88
10 23.81 24.14 23.17
12 25.72 25.72 25.72
14 27.47 27.47 27.47
approach implicitly constructs a correction, based on (1) the fact that the
measurements at ages 8, 12, and 14 differ between the subgroups of com-
plete and incomplete observations and (2) the fairly strong correlation be-
tween the measurement at age 10 on the one hand, and the measurements
at ages 8, 12, and 14 on the other hand. A detailed treatment of likelihood
estimation in incomplete multivariate normal samples is given in Little and
Rubin (1987, Chapter 6). Clearly, this correction leads to an overshoot in
the fairly small growth data set, whence the predicted mean at age 10 is
actually smaller than the one of the complete data set. The means are
reproduced in Table 17.9. All means coincide for ages 8, 12, and 14. Irre-
spective of the small-sample behavior encountered here, the validity under
MAR and the ease of implementation are good arguments that favor this
ignorable analysis over other techniques.
• Model 2:
• Model 3:
• Model 7:
These profiles are fairly similar to their complete data counterparts. This
is in contrast to analyses obtained from the simple methods, described in
Chapter 16 and applied in Verbeke and Molenberghs (1997, Sections 5.4–
5.6).
Let us now study this method in terms of the effect on the estimated
covariance structure. The estimated covariance matrix of Model 1 is
⎛ ⎞
5.0142 4.8796 3.6205 2.5095
⎜ 4.8796 6.6341 3.3772 3.0621 ⎟
Σ̂ = ⎜ ⎝ 3.6205 3.3772 5.9775 3.8248 ⎠ .
⎟
The variance at age 10 is inflated compared to its complete data set coun-
terpart (17.8). The dominating reason is that the sample size at age 10 is
only two-thirds of the original one, thereby making all estimators involved
more variable. In other settings, the variance may increase due to an in-
creased homogeneity in the selected subset. A correct analysis, such as the
ignorable one considered here, should acknowledge this additional source
of uncertainty. The correlation matrices are as follows:
• Model 1 (unstructured):
⎛ ⎞
1.0000 0.8460 0.6613 0.5216
⎜ 0.8460 1.0000 0.5363 0.5533 ⎟
⎜ ⎟. (17.16)
⎝ 0.6613 0.5363 1.0000 0.7281 ⎠
0.5216 0.5533 0.7281 1.0000
• Model 4 (Toeplitz):
⎛ ⎞
1.0000 0.6248 0.6688 0.4307
⎜ 0.6248 1.0000 0.6248 0.6688 ⎟
⎜ ⎟.
⎝ 0.6688 0.6248 1.0000 0.6248 ⎠
0.4307 0.6688 0.6248 1.0000
• Model 5 (AR(1)):
⎛ ⎞
1.0000 0.6265 0.3925 0.2459
⎜ 0.6265 1.0000 0.6265 0.3925 ⎟
⎜ ⎟.
⎝ 0.3925 0.6265 1.0000 0.6265 ⎠
0.2459 0.3925 0.6265 1.0000
266 17. Selection Models
6.7853 −0.3498
D =
−0.3498 0.0337
There are still a few issues with the estimation of precision and with hy-
pothesis testing, related to this type of analysis. These will be discussed in
Chapter 21. We will first study the missingness mechanism for the growth
data.
P (Ri = 0|y i )
ln = ψ0 + ψ1 yi1 ,
1 − P (Ri = 0|y i )
where Ri = 1 for complete observations and 0 otherwise, and Yi1 is the
measurement at age 8. Of course, this model is easily adapted to include
a different subset of measurements into the linear predictor. Table 17.10
shows the model fit for a few choices. In each of the four models, missing-
ness depends on a single outcome. When this dependence is on Yi2 , the
process is nonrandom; it is MAR otherwise. We used the complete data
set to estimate the parameters of the nonrandom model (with linear pre-
dictor including Yi2 ). This is generally not possible. Ways to overcome this
problem are discussed in Sections 17.2.2 and 17.5. The only important co-
variates are the measurements at ages 8 and 12 (i.e., the ones adjacent to
the possibly missing measurement). A backward logistic regression model
retains only Yi1 . Its coefficients (standard errors) are estimated as
P (Ri = 0|yi1 )
ln = 41.22(18.17) − 1.94(0.85)yi1 .
1 − P (Ri = 0|yi1 )
268 17. Selection Models
This model implies that the missingness probability decreases with increas-
ing Yi1 .
P (Ri = 0|yi1 , xi = 0)
boys : ln = ∞(22 − yi1 ),
1 − P (Ri = 0|yi1 , xi = 0)
P (Ri = 0|yi1 , xi = 1)
girls : ln = ∞(20.75 − yi1 ).
1 − P (Ri = 0|yi1 , xi = 1)
The model for boys is interpreted as follows:
⎧
⎨ 1 if yi1 < 22
P (Ri = 0|yi1 , xi = 0) = 0.5 if yi1 = 22
⎩
0 if yi1 > 22.
This is exactly what is seen in Table 2.5. The same is true for girls, with
the sole difference that the cut point lies halfway between two observable
outcome values (20.75):
1 if yi1 < 20.75
P (Ri = 0|yi1 , xi = 1) =
0 if yi1 > 20.75.
The models are displayed in Figure 17.5. Thus, the missingness mecha-
nism used by Little and Rubin (1987) is, in fact, deterministic (given the
outcomes at age 8). This should not be confused with the fact that nonre-
sponse depends (very clearly) on the observed outcomes and the observed
outcomes only, whence it is missing at random ! A similar mechanism is
employed in Section 21.3.
17.5 A Selection Model for Nonrandom Dropout 269
FIGURE 17.5. Growth Data. Logistic nonresponse models for growth data.
In Section 17.2, the toenail data were analyzed assuming both MAR (Sec-
tion 17.2.1) and MNAR (Section 17.2.2). The responses were modeled using
linear mixed models and the dropout process was described by means of
logistic regression. This is in agreement with the model proposed by Diggle
and Kenward (1994). Using the notation laid out in Section 15.9, we will
now slightly generalize this model.
ment model for which we use a linear mixed model and the dropout model,
assumed to follow a logistic regression, can then be fitted separately. If
ω = 0, the dropout process is assumed to be nonrandom. A special case is
given by (17.3). Then, (17.4) can be used to calculate the dropout proba-
bility at a given occasion.
In line with our discussion in Sections 17.1 and 17.2, Rubin (1994) points
out that such analyses heavily depend on the assumed dropout process,
whereas it is impossible to find evidence for or against the model, un-
less supplemental information on the dropouts is available. In practice, a
dropout model may be found to be nonrandom solely because one or a few
influential subjects have driven the analysis. This and related issues will
be studied in Chapter 19, which is devoted to sensitivity analysis for the
selection model.
Fitted profiles are displayed in Figure 17.6 and Figure 17.7. In Figure 17.7,
empirical Bayes estimates of the random effects are included, whereas in
Figure 17.6, the purely marginal mean is used. For each treatment group,
we obtain three sets of profiles. The fitted complete profile is the average
curve that would be obtained had all individuals been completely observed.
If we use only those predicted values that correspond to occasions at which
an observation was made, then the fitted incomplete profiles are obtained.
The latter are somewhat above the former when the random effects are in-
cluded, and somewhat below when they are not, suggesting that individuals
with lower measurements are more likely to disappear from the study. In
addition, although the fitted complete curves are very close (the treatment
effect was not significant), the fitted incomplete curves are not, suggesting
17.6 A Selection Model for the Vorozole Study 271
TABLE 17.11. Vorozole Study. Selection model parameter estimates and standard
errors.
that there is more dropout in the standard arm than in the treatment arm.
This is in agreement with the dropout rate, displayed in Figure 14.1, and
should not be seen as evidence of a bad fit. Finally, the observed curves,
based on the measurements available at each time point, are displayed.
These are higher than the fitted ones, but this should be viewed with the
standard errors of the observed means in mind (see Figure 4.2).
With larger data sets such as this one, convergence of nonrandom models
can be painstakingly difficult in, for example, OSWALD, and one has to
272 17. Selection Models
FIGURE 17.6. Vorozole Study. Fitted profiles (averaging the predicted means for
the incomplete and complete measurement sequences, without the random effects).
= 0.033(0.401) − 0.013(0.003)basei
yi,j−2 + yi,j−1
−0.023(0.005)
2
yi,j−1 − yi,j−2
−0.047(0.010) , (17.19)
2
indicating that both size and increment are significant predictors for drop-
out. We conclude that dropout increases with a decrease in baseline, in
overall level of the outcome variable, as well as with a decreasing evolution
in the outcome. Recall that fitting the dropout model could be done using
a logistic regression of the type (17.19), given (17.4) and the discussion
following this equation.
17.6 A Selection Model for the Vorozole Study 273
FIGURE 17.7. Vorozole Study. Fitted profiles (averaging the predicted means for
the incomplete and complete measurement sequences, including the random ef-
fects).
Using OSWALD, both dropout models (17.18) and (17.19) can be com-
pared with their nonrandom counterparts, where yij is added to the linear
predictor. The first one becomes
It turns out that model (17.21) is not significantly better than (17.19) and,
hence, we retain (17.19) as the most plausible description of the dropout
process we have so far obtained.
18
Pattern-Mixture Models
18.1 Introduction
where
⎛ ⎞ ⎛ ⎞
µ1 (t) σ11 (t) σ21 (t) σ31 (t)
µ(t) = ⎝ µ2 (t) ⎠ and Σ(t) = ⎝ σ21 (t) σ22 (t) σ32 (t) ⎠,
µ3 (t) σ31 (t) σ32 (t) σ33 (t)
3
µ = πt µ(t).
t=1
Its variance can be derived by application of the delta method (see Sections
18.3, 18.4, 20.6.2, and 24.4.2).
However, although the πt can be simply estimated from the observed pro-
portions in each dropout group, only 16 of the 27 response parameters can
be identified from the data without making further assumptions. These
16 comprise all the parameters from the completers plus those from the
following two submodels. For t = 2
and for t = 1
N (µ1 (1); σ11 (1)) .
18.1 Introduction 277
Little (1993) shows how these constraints can be used to identify all the
parameters in the model and so obtain estimates for these and the marginal
probabilities. The CCMV restrictions essentially equate conditional distri-
butions beyond time t (i.e., those unidentifiable from this dropout group),
with the same conditional distributions from the completers. A stronger
restriction is to identify the former conditional distributions and all condi-
tional distributions from those who drop out after t. This has been called
the available case missing value (ACMV) restrictions and it has been shown
(Molenberghs, Michiels, Kenward, and Diggle 1998; see also Section 20.2)
that for dropout, these conditions are equivalent to MAR in the selection
model framework. Again, such constraints can be used to develop meth-
ods of estimation or to set up schemes for sensitivity analysis. A detailed
account is given in Chapter 20.
18.1.2 A Paradox
g(t|y1 , y2 ) := f (t|y1 , y2 ),
p(t) := f (t),
ft (y1 , y2 ) := f (y1 , y2 |t).
18.1 Introduction 279
1 − g(y1 , y2 ) p
f1 (y1 , y2 ) = f2 (y1 , y2 ).
g(y1 , y2 ) 1 − p
All selection model factors are identified, as are the pattern-mixture quan-
tities on the right-hand side. However, the left-hand side is not entirely
identifiable. We can further separate the identifiable from the nonidentifi-
able quantities:
1 − g(y1 , y2 ) p f2 (y1 )
f1 (y2 |y1 ) = f2 (y2 |y1 ) . (18.5)
g(y1 , y2 ) 1 − p f1 (y1 )
This clearly illustrates the need for sensitivity analysis. Due to the different
nature of the selection and pattern-mixture models, specific forms for each
of the two contexts will be presented in Chapters 19 and 20, respectively.
Thus, the fixed effects as well as the covariance parameters are allowed to
change with dropout pattern and a priori no restrictions are placed on the
structure of this change.
which only requires specifying a marginal model for the dropout process and
a conditional model for the observed outcomes, given the dropout pattern
as in (18.6). Further, as for ignorable selection models, both models can be
fitted separately, provided separability of their parameters.
Hogan and Laird (1997) noted that in order to estimate the large num-
ber of parameters in general pattern-mixture models, one has to make the
awkward requirement that each dropout pattern is sufficiently “filled”; in
other words, one has to require large numbers of dropouts. This problem is
less prominent in simplified models. Note however that simplified models,
qualified as “assumption rich” by Sheiner, Beal, and Dunne (1997), are also
making untestable assumptions and therefore illustrate that even pattern-
mixture models do not provide a free lunch. A main advantage however is
that the need of assumptions and their implications are more obvious. For
example, it is not possible to assume an unstructured time trend in incom-
plete patterns, except if one restricts attention to the time range from onset
until dropout. In contrast, assuming a linear time trend allows estimation
in all patterns containing at least two measurements.
For the TDO data, we will assume the dropout patterns Ti to be sampled
from a multinomial distribution with support {1, 2, 3, 4, 5, 6, 7}, where the
class Ti = 7 contains all completers. The associated multinomial probability
vector is denoted by π = (π1 , π2 , . . . , π7 ) . Further, our model for Yio ,
conditional on Ti = ti , is assumed to be of the same form as model (17.1),
282 18. Pattern-Mixture Models
TABLE 18.1. Toenail Data. Fitted dropout probabilities under the multinomial
dropout model.
Dropout occasion d: 1 2 3 4 5 6 7
t = P (Ti = t):
Fitted prob. π 0.02 0.02 0.04 0.05 0.10 0.01 0.76
FIGURE 18.1. Toenail Data. Fitted average profiles for each dropout pattern,
obtained from fitting the mixed models (18.8). For each pattern, the number of
subjects in group A and group B is denoted by nA and nB , respectively.
For each pattern, the likelihood ratio statistic 2 ln λ measures the difference
between the two treatment groups with respect to the fitted average profile.
The sum of all of these statistics could be used to test whether there is any
treatment difference at all for any of the dropout patterns. In our example,
this sum equals 26.5, on 18 degrees of freedom. When compared to a χ2 -
distribution, there is no evidence for any treatment effect (p = 0.089).
However, this should be interpreted with care since the χ2 -approximation
may not be accurate due to the small numbers of subjects observed in some
of the dropout patterns.
For the dropout patterns where at least three measurements are available
per subject (ti ≥ 3), we extrapolate the quadratic trend over time, fitted
from the observed data. Borrowing the linear and quadratic time effects,
or just the quadratic time effect, from pattern ti = 3, we can extrapolate
the patterns ti = 2 and ti = 3 as well. More precisely, our extrapolation
assumes that
⎧
⎪
⎪ β (1) = βA1 (3),
⎪ A1
⎪
⎪
⎨ βA2 (1) = βA2 (2) = βA2 (3),
(18.10)
⎪
⎪ βB1 (1) = βB1 (3),
⎪
⎪
⎪
⎩ β (1) = β (2) = β (3).
B2 B2 B2
This expresses our belief that the average behavior in the first two patterns
is likely to be similar to the third pattern. The obtained extrapolations are
indicated by dashed lines in Figure 18.2. Note that the strongly positive
estimated average quadratic time effect for the patterns ti = 3 and ti = 4
implies extremely steep extrapolated curves for all patterns with ti ≤ 4.
FIGURE 18.2. Toenail Data. Extrapolated fitted average profiles for each dropout
pattern, obtained from fitting the mixed models (18.8), imposing the restrictions
(18.10). For each pattern, the number of subjects in group A and group B is
denoted by nA and nB , respectively.
FIGURE 18.3. Toenail Data. Fitted marginal average profiles (18.11) for both
treatment groups, obtained from fitting the pattern-mixture model (18.8), under
the restrictions (18.10).
versus the alternative hypothesis that H0 does not hold. Following Little
(1993), we tested the above hypothesis using a Wald-type test, where the
asymptotic covariance matrix of the estimators of the three functions in
(18.12) is estimated via the delta method. Other methods for precision es-
timation in pattern-mixture models are described by Michiels, Molenberghs
and Lipsitz (1999). These authors use multiple imputation in the context
of categorical data.
Let β (t) denote the vector of all six fixed effects in the linear mixed model
corresponding to pattern ti = t. The asymptotic covariance matrix
π , β(1),
var(
. . . , β(7))
π)
var( = diag(π) − ππ
The resulting observed test statistic equals 2.464, which is not significant
(p = 0.482) when compared to a χ2 -distribution with 3 degrees of freedom.
This may seem counterintuitive in view of the large difference between both
treatment groups, seen in Figure 18.3. Again, it should be emphasized that
many parameters were estimated with very little precision. Further, the
observed difference is, to a large extent, a function of extrapolation rather
than observation.
In Chapter 14, the individual and average profiles for the Vorozole study
were plotted in a pattern-specific way (Figures 14.5 and 14.6).
FIGURE 18.4. Vorozole Study. Fitted selection (solid line) and first pat-
tern-mixture models (dashed lines).
the interaction of pattern with time2 . This reduced model can be found
in Table 18.2. As was the case with the selection model in Table 17.11,
the treatment effect is nonsignificant. Indeed, a single degree of freedom
F -test yields a p-value of 0.687. Note that such a test is possible since
treatment effect does not interact with pattern, in contrast to the model
which we will describe next. The fitted profiles are displayed in Figure 18.4.
We observe that the profiles for both arms are very similar. This is due to
the fact that treatment effect is not significant but perhaps also because
we did not allow a more complex treatment effect. For example, we might
consider an interaction of treatment with the square of time and, more im-
portantly, a treatment effect which is pattern-specific. Some evidence for
such an interaction is seen in Figure 14.6.
Our second, expanded model allowed for up to cubic time effects, the in-
teraction of time with dropout pattern, dominant site, baseline value and
treatment, as well as their two- and three-way interactions. After a back-
ward selection procedure, the effects included are time and time2 , the two-
way interaction of time and dropout pattern, as well as three-factor in-
teractions of time and dropout pattern with (1) baseline, (2) group, and
(3) dominant site. Finally, time2 interacts with dropout pattern and with
the interaction of baseline and dropout pattern. No cubic time effects were
necessary, which is in agreement with the observed profiles in Figure 14.6.
18.4 A Pattern-Mixture Model for the Vorozole Study 289
TABLE 18.2. Vorozole Study. Parameter estimates and standard errors for the
first pattern-mixture model.
Variance Parameters:
Random intercept 78.45
Serial variance 95.38
Serial association 8.85
Measurement error 73.77
The parameter estimates of this model are displayed in Tables 18.3 and
18.4. The model is graphically represented in Figure 18.5.
TABLE 18.3. Vorozole Study. Parameter estimates and standard errors for the
second pattern-mixture model (part I). Each column represents an effect, for which
a main effect is given, as well as interactions with the dropout patterns.
method, as in Section 18.3. See also Section 20.6.2. This effect is equal to
−0.286(0.288), producing a p-value of 0.321, which is still nonsignificant.
18.5 Some Reflections 291
TABLE 18.4. Vorozole Study. Parameter estimates and standard errors for the
second pattern-mixture model (part II). Each column represents an effect, for
which a main effect is given, as well as interactions with the dropout patterns.
FIGURE 18.5. Vorozole Study. Fitted selection (solid line) and second pat-
tern-mixture models (dashed lines).
However, there are several reasons why more complex manipulations may
be needed. First, there will often be interest in the marginal distribution of
the responses, for which a mixture of an effect over the different dropout
patterns is needed, such as in (18.9).
19.1 Introduction
When more flexible assumptions, such as MAR or even MNAR, are con-
sidered several choices have to be made. For example, one has to choose
between selection and pattern-mixture models, or an alternative framework
such as shared-parameter models (Wu and Bailey 1988, 1989, Wu and Car-
roll 1988, DeGruttola and Tu 1994). For a review, see Little (1995). A
296 19. Sensitivity Analysis for Selection Models
Particularly within the selection modeling framework, there has been an in-
creasing literature on nonrandom missing data. At the same time, concern
has been growing precisely about the fact that models often rest on strong
assumptions and relatively little evidence from the data themselves. This
point was already raised by Glynn, Laird and Rubin (1986), who indicate
that this is typical for so-called selection models, whereas it is much less so
for a pattern-mixture model (Section 18.1.2). In Section 17.1 attention was
drawn to the fact that much of the debate on selection models is rooted in
the econometrics literature, in particular Heckman’s selection model (Heck-
man 1976). Draper (1995) and Copas and Li (1997) provide useful insight
in model uncertainty and nonrandomly selected samples. Vach and Blet-
tner (1995) study the case of incompletely observed covariates in logistic
regression.
Because the model of Diggle and Kenward (1994) fits within the class of se-
lection models, it is fair to say that it raised, at first, too high expectations.
This was made clear by many discussants of the paper. It implies that, for
example, formal tests for the null hypothesis of random missingness, al-
though technically possible, should be approached with caution. See also
Section 18.1.1. In Section 17.2.2, it was shown, using the toenail data, that
excluding a small amount of measurement error, can have a serious im-
pact on the rest of the model parameters. In particular, the likelihood ratio
test statistics for the random dropout null hypothesis changes drastically
(Table 17.1).
Section 19.2 adapts the model of Diggle and Kenward (1994) to a form
useful for sensitivity analysis. Such a sensitivity analysis method, based
on local influence (Cook 1986; Thijs, Molenberghs, and Verbeke 2000; see
also Section 11.2) is introduced in Section 19.3 and applied to the rats
data in Section 19.4. Note that in Section 24.4, a comparison is made with
a more conventional global influence analysis (Chatterjee and Hadi 1988).
Both informal and formal methods of sensitivity are applied to the mastitis
data in Section 19.5. Note that a sensitivity analysis of the milk protein
contents data is given in Section 24.4. An outlook on alternative approaches
is given in Section 19.6. Random-coefficient-based models are discussed
in Section 19.7. We will conclude that caution is needed with selection
models and that a mechanical use, perhaps stimulated by the availability
of software such as the SPlus suite of functions termed OSWALD (Smith,
Robertson, and Diggle 1996), should be avoided.
In Section 17.5, the selection model of Diggle and Kenward (1994) was pre-
sented, specifically with a linear mixed model for the measurement process
and a logistic regression, based dropout model (see also Sections 17.2 and
17.6).
George Box has a famous quote saying that all statistical models are wrong,
but some are useful. Cook (1986) uses this idea to motivate his assessment
of local influence. He suggests that more confidence can be put in a model
which is relatively stable under small modifications. The best known per-
turbation schemes are based on case deletion (Cook and Weisberg 1982),
in which the effect is studied of completely removing cases from the analy-
sis. A quite different paradigm is the local influence approach where one
investigates how the results of an analysis are changed under small pertur-
bations of the model. In the framework of the linear mixed model, Beckman,
Nachtsheim, and Cook (1987) used local influence to assess the effect of per-
turbing the error variances, the random-effects variances, and the response
vector. In the same context, Lesaffre and Verbeke (1998) have shown that
the local influence approach is also useful for the detection of influential
subjects in a longitudinal data analysis. Moreover, because the resulting
influence diagnostics can be expressed analytically, they often can be de-
composed in interpretable components, which yield additional insights in
the reasons why some subjects are more influential than others.
Since this so-called influence graph can only be depicted when N = 2, Cook
(1986) proposed looking at local influence [i.e., at the normal curvatures
Ch of ξ(ω) in ω 0 ], in the direction of some N dimensional vector h of unit
length. Let ∆i be the s-dimensional vector defined by
∂ 2 i (γ|ωi )
∆i =
∂ωi ∂γ γ =γ ,ωi =0
Let Vi,11 be the predicted covariance matrix for the observed vector given
by (yi1 , . . . , yi,d−1 ) , Vi,22 is the predicted variance for the missing obser-
vation yid , and Vi,12 is the vector of predicted covariances between the
elements of the observed vector and the missing observation. It then fol-
lows from the linear mixed model (3.8) that the conditional expectation for
the observation at dropout, given the history, equals
−1
λ(yid |hid ) = λ(yid ) + Vi,21 Vi,11 [hid − λ(hid )]. (19.6)
The derivatives of (19.6) w.r.t. the fixed effects and variance components
in the measurement model are
∂λ(yid |hid ) −1
= xid − Vi,21 Vi,11 Xi,(d−1) ,
∂β
3 4
∂λ(yid |hid ) ∂Vi,21 −1 ∂Vi,11 −1
= − Vi,21 Vi,11 Vi,11 [hid − λ(hid )],
∂α ∂α ∂α
respectively, where xid is the dth row of Xi and where Xi,(d−1) indicates
the first (d − 1) rows of Xi .
The second factor of (19.7) involves L̈(θ) and is therefore harder to study.
However, we can still make progress if we are prepared to make some ap-
proximations. The off-diagonal block of the observed information matrix
L̈(θ) pertaining to the mixed derivatives w.r.t. β and α is not equal to
zero. The corresponding block of the expected information matrix is zero
for a complete data problem, but is not for an incomplete data set, unless
the missing data are MCAR (Kenward and Molenberghs 1998). However,
these authors also argue that in many practical settings, the difference
might be small (see also Chapter 21). Therefore, we will assume that L̈(θ)
is block-diagonal and that Ci (θ) Ciap (β) + Ciap (σ 2 , ν 2 ).
−1
ν2
L̈ (β) = Xi(d−1) Id−1 + 2 Jd−1 Xi(d−1) ,
i=1
σ + (d − 1)ν 2
where
Here,
1
Xi(d−1) = 1n Xi(d−1) .
d − 1 i d−1
1 −1
× −1, 2 L̈−1 (σ 2 , ν 2 ) 1 , (19.10)
ν ν2
where
L̈(σ 2 , ν 2 )
N
d−1
= 2 + (d − 1)ν 2 )2
i=1
2(σ
2
[σ + (d − 1)ν 2 ]2 − ν 2 [2σ 2 + (d − 1)ν 2 ] 1
× ,
1 (d − 1)
and [hid − λ(hid )] represents the average difference of the hid column and
its predicted value.
third factor is large when the squared average residual of the history at the
time of dropout is large.
for incomplete sequences. Further, vij equals g(hij )[1 − g(hij )], which is
the variance of the estimated dropout probability under MAR. Expression
(19.11) bears some resemblance with the hat-matrix diagonal, used for
diagnostic purposes in logistic regression (Hosmer and Lemeshow 1989).
One of the differences is that the contributions from a single individual
are summed in the first and third factors of (19.11), even though they
contribute independent pieces of information to the logistic regression. This
is because each individual is given a single weight ωi for an entire sequence
of measurements.
To get a good feel for when Ci (ψ) is large, simplify hij yij vij to
where the factor in curly braces equals the hat-matrix diagonal. In the case
of dropout, the same replacement as before for yid has to be made. When
the length of a measurement sequence is restricted to 2, then (19.13) and
(19.11) coincide. Note that this alternative perturbation scheme assigns
weights to observations rather than subjects, changing the interpretation
of the results of the analysis. Also, the graphical representation of the
results is more involved since series of influence measures Cij now have to
be studied and interpreted.
Thus far, the development has focused on the standard linear mixed model,
with random effects and measurement error. If, in addition, a serially cor-
related part of the variance structure is thought to be present, the variance
parameter α needs to be extended with the serial correlation parameters,
as was done in Section 3.3.4, and thus encompasses djk , the components
of the variance-covariance matrix of the random effects, τ 2 , the variance
of the serial process, ϕ, the serial correlation parameter, and σ 2 , the vari-
ance of the measurement error process. The general model is spelled out
in (3.11).
Most of the development in Section 19.3.2 remains the same. We will briefly
outline the changes that have to be made. For the various variance para-
meters, expression (19.7) can be written as
∂λ(yid |hid ) 5 −1
6 −1
= [Zi Djk Zi ]21 − Vi,21 Vi,11 [Zi Djk Zi ]11 Vi,11 [hid − λ(hid )],
∂djk
∂λ(yid |hid ) −1 −1
= [Hi,21 − Vi,21 Vi,11 Hi,11 ]Vi,11 [hid − λ(hid )],
∂τ 2
∂λ(yid |hid ) −1 2 −1
= [τ 2 Ki,21 − Vi,21 Vi,11 τ Ki,11 ]Vi,11 [hid − λ(hid )],
∂ϕ
∂λ(yid |hid ) −2
= −Vi,21 Vi,11 [hid − λ(hid )].
∂σ 2
Here,
[Djk ]
m = δj
δkm + δjm δk
− δjk .
u
Ki,
m = |ti
− tim |eφ|ti −tim | ,
FIGURE 19.1. Rat Data. Individual growth curves for the three treatment groups
separately. Influential subjects are highlighted by bold lines or dots.
In order to illustrate the above methodology, we will apply the local influ-
ence approach to data from a randomized experiment, designed to study
the effect of the inhibition of the testosterone production in rats. The data
were introduced in Section 2.1. The profiles were explored in Section 4.3.3.
A linear mixed model with random intercepts was fitted in Section 6.3.3
(equation (6.12)).
The individual profiles are shown in Figure 19.1. They can be linearized by
using the logarithmic transformation t = ln(1+(age−45)/10)) for the time
scale. This is also the scale we will use from now on in all statistical analy-
ses. Note that the transformation was chosen such that t = 0 corresponds
to the start of the treatment. We assume a linear mixed model for the re-
sponse with common average intercept β0 for all three groups, with average
slopes β1 , β2 , and β3 for the three treatment groups, respectively, and as-
suming compound symmetry covariance structure, with common variance
σ 2 + ν 2 and common covariance ν 2 . These models are estimated under
MCAR, MAR, and MNAR processes, using the PCMID function in the
Splus suite of functions called OSWALD (Smith, Robertson, and Diggle
1996). The estimates are displayed in Table 19.1 (original data). Following
these models, and if we are prepared to believe the assumptions on which
308 19. Sensitivity Analysis for Selection Models
TABLE 19.1. Rat Data. Maximum likelihood estimates (standard errors) of com-
pletely random, random and nonrandom dropout models, fitted to the rat data set,
with and without modification.
Original Data
Effect Parameter MCAR MAR MNAR
Measurement model:
Intercept β0 68.61 68.61 68.61
Slope control β1 7.51 7.51 7.50
Slope low dose β2 6.87 6.87 6.86
Slope high dose β3 7.31 7.31 7.30
Random intercept ν2 3.44 3.44 3.44
Measurement error σ2 1.43 1.43 1.43
Dropout model:
Intercept ψ0 −1.98 −8.48 −8.05
Prev. measurement ψ1 0.084 0.096
Curr. measurement ω = ψ2 −0.017
−2 log-likelihood 1777.3 1774.5 1774.5
Modified Data
Effect Parameter MCAR MAR MNAR
Measurement model:
Intercept β0 70.20 70.20 70.26
Slope control β1 7.52 7.52 7.39
Slope low dose β2 6.97 6.97 6.88
Slope high dose β3 7.21 7.21 6.98
Random intercept ν2 40.38 40.38 40.83
Measurement error σ2 1.42 1.42 1.46
Dropout model:
Intercept ψ0 −2.20 −0.79 3.23
Prev. measurement ψ1 −0.015 0.32
Curr. measurement ω = ψ2 −0.38
−2 log-likelihood 1906.6 1894.6 1890.2
they rest, there is little evidence of MAR and no evidence for MNAR. The
estimates in Table 19.1 differ from those in Table 6.4, since the latter were
obtained with the REML method.
FIGURE 19.2. Rat Data. Index plots of Ci , Ci (θ), Ci (β), Ci (α), Ci (ψ), and of
the components of the direction hmax of maximal curvature.
absolute magnitude of Ci (·) depends upon the scale on which the measure-
ments and/or covariates are expressed, and hence influence graphs should
be interpreted in a relative fashion.
The largest Ci are observed for rats #10, #16, #35, and #41, and virtually
the same picture holds for Ci (ψ). They are highlighted in Figure 19.1. All
four belong to the low-dose group. Arguably, their relatively large influence
is caused by an interplay of three facts. First, the profiles are relatively high,
and hence yij and hij in (19.11) are large. Second, since all four profiles
are complete, the first factor in (19.11) contains a maximal number of large
terms. Third, the computed vij are relatively large, which is implied by the
MAR dropout model parameter estimates in Table 19.1. Indeed, for these
measurements, the logit of the dropout probability is closest to 0 and hence
vij is fairly close to its maximal value of 0.25.
Turning attention to Ci (α) reveals peaks for rats #5 and #23. Both be-
long to the control group and drop out after a single measurement occasion.
They are highlighted (by means of a bullet) in the first panel of Figure 19.1.
To explain this, observe that the relative magnitude of Ci (α), approxi-
mately given by (19.10), is determined by 1 − g(hid ) and hid − λ(hid ).
The first term is large when the probability of dropout is small. Now,
when dropout occurs early in the sequence, the measurements are still
relatively low, implying that the dropout probability is rather small (cf.
310 19. Sensitivity Analysis for Selection Models
FIGURE 19.3. Rat Data. Index plots of Ci , Ci (θ), Ci (β), Ci (α), Ci (ψ), and of
the components of the direction hmax of maximal curvature where four profiles
have been shifted upward.
Table 19.1). This feature is built into the model by writing the dropout
probability in terms of the raw measurements with time-independent coef-
ficients rather than, for example, in terms of residuals. Alternatively, the
dropout model parameters could be made time dependent. Further, the
residual hid − λ(hid ) is large since these two rats are somewhat distant
from their group-by-time mean.
FIGURE 19.4. Rat Data. Index plots of Ci , Ci (θ), Ci (β), Ci (α), Ci (ψ), and of
the components of the direction hmax of maximal curvature, where 4 profiles have
been shifted upward and the components have been ordered in decreasing order of
Ci .
The analysis of the rat data set supports the claim that the influence mea-
sures are easy to interpret. In addition to the advantages quoted earlier,
we claim that a careful study of the conditions under which the diagnostics
become large can shed some light on the adequacy of the model formula-
tion. For example, the Diggle and Kenward (1994) model usually writes
the logit of the dropout probability as a function of the raw measurements,
with time-independent coefficients. This implies that an expression such
as (19.11) depends directly on the magnitude of the responses. An alter-
native parameterization of the dropout probability in terms of residuals
(Yij − µij )/σij would obviously yield a different picture. However, this pa-
312 19. Sensitivity Analysis for Selection Models
FIGURE 19.5. Rat Data. Scatter plots of (1) Ci (θ) versus Ci (ψ) and (2) Ci (β)
versus Ci (α), where four profiles have been shifted upward.
The data have been introduced in Section 2.7. Diggle and Kenward (1994)
and Kenward (1998) performed several analyses of these data. In Diggle
and Kenward (1994), a separate mean for each group defined by the year
of first lactation and a common time effect was considered, together with an
unstructured 2×2 covariance matrix. The dropout model included both Yi1
and Yi2 and was reparameterized in terms of the size variable (Yi1 + Yi2 )/2
and the increment Yi2 −Yi1 . It turned out that the increment was important,
in contrast to a relatively small contribution of the size. If this model were
19.5 Mastitis in Dairy Cattle 313
Random dropout
Parameter All (53,54,66,69) (4,5) (66) (4,5,66)
Measurement model:
β0 5.77(0.09) 5.69(0.09) 5.81(0.08) 5.75(0.09) 5.80(0.09)
βd 0.72(0.11) 0.70(0.11) 0.64(0.09) 0.68(0.10) 0.60(0.08)
σ12 0.87(0.12) 0.76(0.11) 0.77(0.11) 0.86(0.12) 0.76(0.11)
σ22 1.30(0.20) 1.08(0.17) 1.30(0.20) 1.10(0.17) 1.09(0.17)
ρ 0.58(0.07) 0.45(0.08) 0.72(0.05) 0.57(0.07) 0.73(0.05)
Dropout model:
ψ0 −2.65(1.45) −3.69(1.63) −2.34(1.51) −2.77(1.47) −2.48(1.54)
ψ1 0.27(0.25) 0.46(0.28) 0.22(0.25) 0.29(0.24) 0.24(0.26)
ω = ψ2 0 0 0 0 0
−2 log-likelihood 280.02 246.64 237.94 264.73 220.23
Nonrandom dropout
Parameter All (53,54,66,69) (4,5) (66) (4,5,66)
Measurement model:
β0 5.77(0.09) 5.69(0.09) 5.81(0.08) 5.75(0.09) 5.80(0.09)
βd 0.33(0.14) 0.35(0.14) 0.40(0.18) 0.34(0.14) 0.63(0.29)
σ12 0.87(0.12) 0.76(0.11) 0.77(0.11) 0.86(0.12) 0.76(0.11)
σ22 1.61(0.29) 1.29(0.25) 1.39(0.25) 1.34(0.25) 1.10(0.20)
ρ 0.48(0.09) 0.42(0.10) 0.67(0.06) 0.48(0.09) 0.73(0.05)
Dropout model:
ψ0 0.37(2.33) −0.37(2.65) −0.77(2.04) 0.45(2.35) −2.77(3.52)
ψ1 2.25(0.77) 2.11(0.76) 1.61(1.13) 2.06(0.76) 0.07(1.82)
ω = ψ2 −2.54(0.83) −2.22(0.86) −1.66(1.29) −2.33(0.86) 0.20(2.09)
−2 log-likelihood 274.91 243.21 237.86 261.15 220.23
G2 for MNAR 5.11 3.43 0.08 3.57 0.005
whereas a normal yield was obtained during the second year. He then fitted
t-distributions to Yi2 given Yi1 = yi1 . Not surprisingly, his finding was that
the heavier the tails of the t-distribution, the better the outliers were ac-
commodated. As a result, the difference between the MAR and nonrandom
models vanished (G2 = 1.08 for a t2 -distribution). Alternatively, removing
these two cows and refitting the normal model shows complete lack of evi-
dence for nonrandom dropout (G2 = 0.08). This latter procedure is similar
to a global influence analysis by means of deleting two observations. Para-
meter estimates and standard errors for random and nonrandom dropout,
under several deletion schemes, are reproduced in Table 19.2. It is clear
that the influence on the measurement model parameters is small in the
random dropout case, although the gap on the time effect βd between the
random and nonrandom dropout models is reduced when #4 and #5 are
removed.
Y1 µ σ12 ρσ1 σ2
= N , .
Y2 µ+∆ ρσ1 σ2 σ22
Note that the parameter ∆ represents the change in average yield between
the 2 years. The probability of mastitis is assumed to follow the logistic
regression model:
Using the likelihoods to compare the fit of the two models, we get a differ-
ence G2 = 5.11. The corresponding tail probability from the χ21 is 0.02.
19.5 Mastitis in Dairy Cattle 315
Some insight into this fitted model can be obtained by rewriting it in terms
of the milk yield totals (Y1 + Y2 ) and increments (Y2 − Y1 ):
To gain some additional insight into these two fitted models, we now take
a closer look at the raw data and the predictive behavior of the Gaussian
MNAR model. Under an MNAR model, the predicted, or imputed, value
of a missing observation is given by the ratio of expectations:
EYm |Yo [y m P (r | y o , y m )]
ŷ m = . (19.17)
EYm |Yo [P (r | y o , y m )]
Recall that the fitted dropout model (19.15) implies that the probability
of mastitis increases with decreasing values of the increment Y2 − Y1 . We
therefore plot the 27 imputed values of this quantity together with the 80
observed increments against the first year yield Y1 . This is presented in
Figure 19.6, in which the imputed values are indicated with triangles and
the observed values with crosses. Note how the imputed values are almost
linear in Y1 : This is a well-known property of the ratio (19.17) within this
range of observations. The imputed values are all negative, in contrast to
the observed increments, which are nearly all positive. With animals of this
age, one would normally expect an increase in yield between the 2 years.
The dropout model is imposing very atypical behavior on these animals and
this corresponds to the statistical significance of the MNAR component of
the model (ψ2 ) but, of course, necessitates further scrutiny.
316 19. Sensitivity Analysis for Selection Models
FIGURE 19.6. Mastitis in Dairy Cattle. Plot of observed and imputed year 2 −
year 1 yield differences against year 1 yield. Two outlying points are circled.
Another feature of this plot is the pair of outlying observed points circled in
the top left-hand corner. These two animals have the lowest and third lowest
yields in the first year, but moderately large yields in the second, leading
to the largest positive increments. In a well-husbanded dairy herd, one
would expect approximately Gaussian joint milk yields, and these two then
represent outliers. It is likely that there is some anomaly, possibly illness,
leading to their relatively low yields in the first year. One can conjecture
that these two animals are the cause of the structure identified by the
Gaussian MNAR model. Under the joint Gaussian assumption, the MNAR
model essentially “fills in” the missing data to produce a complete Gaussian
distribution. To counterbalance the effect of these two extreme positive
increments, the dropout model predicts negative increments for the mastitic
cows, leading to the results observed. As a check on this conjecture, we
omit these two animals from the data set and refit the MAR and MNAR
Gaussian models. The resulting estimates are presented in the (4, 5) column
of Table 19.2.
The deviance is minimal and the MNAR model now shows no improvement
in fit over MAR. The estimates of the dropout parameters, although still
moderately large in an absolute sense, are of the same size as their stan-
dard errors which, as mentioned earlier, are probably underestimates. In
the absence of the two anomalous animals, the structure identified earlier
in terms of the MNAR dropout model no longer exists. The increments
19.5 Mastitis in Dairy Cattle 317
FIGURE 19.7. Mastitis in Dairy Cattle. Normal probability plot of the year 1
milk yields.
imputed by the fitted model are also plotted in Figure 19.6, indicated by
circles. Although still lying among the lower region of the observed incre-
ments, these are now all positive and lie close to the increments imputed
by the MAR model (diamonds). Thus, we have a plausible representation
of the data in terms of joint Gaussian milk yields, two pairs of outlying
yields and no requirement for an MNAR dropout process.
The two key assumptions underlying the outcome-based MNAR model are,
first, the form chosen for the relationship between dropout probability and
response and, second, the distribution of the response or, more precisely,
the conditional distribution of the possibly unobserved response given the
observed response. In the current setting for the first assumption, if there
is dependence of mastitis occurrence on yield, experience with logistic re-
gression tells us that the exact form of the link function in this relationship
is unlikely to be critical. In terms of sensitivity, we therefore consider the
second assumption, the distribution of the response.
All the data from the first year are available, and a normal probability plot
of these, Figure 19.7, does not show great departures from the Gaussian
assumption. Leaving this distribution unchanged, we therefore examine the
effect of changing the conditional distribution of Y2 given Y1 . One simple
and obvious choice is to consider a heavy-tailed distribution, and for this,
318 19. Sensitivity Analysis for Selection Models
where
ρσ2 (y1 − µ)
µ2|1 = µ+∆+
σ1
is the conditional mean of Y2 | y1 . The corresponding conditional variance
is
m
σ2 .
m−2
Relevant parameter estimates from the fits of both MAR and MNAR mod-
els are presented in Table 19.3 for three values of m: 2, 10, and 25. Smaller
values of m correspond to greater kurtosis and, as m becomes large, the
model approaches the Gaussian one used in the previous section. It can be
seen from the results for the MNAR model in Table 19.3, that as the kurto-
sis increases the estimate of ψ2 decreases. Also, the maximized likelihoods
of the MAR and MNAR models converge. With 10 and 2 degrees of free-
dom, there is no evidence at all to support the inclusion of ψ2 in the model;
that is, the MAR model provides as good a description of the observed data
as the MNAR, in contrast to the Gaussian-based conclusions. Further, as m
decreases, the estimated yearly increment in milk yield ∆ from the MNAR
model increases to the value estimated under the MAR model. In most
applications of outcome-based selection models (see Section 19.7), it will
be quantities of this type that will be of prime interest, and it is clearly
seen in this example how the dropout model can have a crucial influence
on the estimate of this. Comparing the values of the deviance from the
t-based model with those from the original Gaussian model, we also see
that the former with m = 10 or 2 produces a slightly better fit, although
no meaning can be attached to the statistical significance of the difference
in these likelihood values.
The results observed here are consistent with those from the deletion analy-
sis. The two outlying pairs of measurements identified earlier are not incon-
sistent with the heavy-tailed t-distribution; so it would require no “filling
in” and hence no evidence for nonrandomness in the dropout process under
the second model. In conclusion, if we consider the data with outliers in-
cluded, we have two models that effectively fit equally well to the observed
data. The first assumes a joint Gaussian distribution for the responses and
a MNAR dropout model. The second assumes a Gaussian distribution for
the first observation and a conditional tm -distribution (with small m) for
the second given the first, with no requirement for a MNAR dropout com-
ponent. Each provides a different explanation for what has been observed,
19.5 Mastitis in Dairy Cattle 319
TABLE 19.3. Mastitis in Dairy Cattle. Details of the fit of MAR and MNAR
dropout models, assuming a tm -distribution for the conditional distribution of Y2
given Y1 . Maximum likelihood estimates (standard errors) are shown.
However, it should also be noted that absence of structure in the data as-
sociated with an MNAR process does not imply that an MNAR process
is not operating: Different models with similar maximized likelihoods (i.e.,
with similar plausibility with respect to the observed data), may have com-
pletely different implications for the dropout process and the unobserved
data. These points together suggest that the appropriate role of such mod-
eling is as a component of a sensitivity analysis.
The analysis of the previous section is characterized by its basis within sub-
stantive knowledge about the data. The local influence approach, presented
in Section 19.3 and applied to the rats data in Section 19.4, may appear to
be “blindly” applicable (i.e., without departing from specific information
about the data). In this section, we will apply the technique to the mastitis
data, confront the results with those found in Section 19.5.1, and suggest
that here a combination of methodology and substantive insight will be the
most fruitful approach.
Applying the method to the mastitis data produces Figure 19.8, which sug-
gests that there are four influential subjects: #53, #54, #66, and #69. The
most striking feature of this analysis is that #4 and #5 are not recovered.
See also Figure 2.6. It is interesting to consider an analysis with these four
cows removed. Details are given in Table 19.2. Unlike removing #4 and #5,
the influence on the likelihood ratio test is rather small: G2 = 3.43 instead
of the original 5.11. The influence on the measurement model parameters
under both random and nonrandom dropout is small.
It is very important to realize that one should not expect agreement be-
tween deletion and our local influence analysis. The latter focuses on the
sensitivity of the results with respect to the assumed dropout model; more
specifically, how the results change when the MAR model is extended into
the direction of nonrandom dropout. In particular, all subjects singled out
so far are complete and hence Ci (θ) ≡ 0, placing all influence on Ci (ψ)
and hmax,i . A comparison between local influence and deletion is given in
Section 24.4.
19.5 Mastitis in Dairy Cattle 321
FIGURE 19.8. Mastitis in Dairy Cattle. Index plots of Ci , Ci (θ), Ci (ψ), and
of the components of the direction hmax of maximal curvature, when the dropout
model is parameterized in function of Yi1 and Yi2 .
FIGURE 19.9. Mastitis in Dairy Cattle. Index plots of Ci , Ci (θ), Ci (ψ), and
of the components of the direction hmax of maximal curvature, when the dropout
model is parameterized in function of Yi1 and Yi2 − Yi1 .
show that they lead to different perturbation schemes of the form (19.1).
At first, this feature can be seen as both an advantage and a disadvantage.
The fact that reparameterizations of the linear predictor of the dropout
model leads to different perturbation schemes requires careful reflection
based on substantive knowledge in order to guide the analysis, such as the
considerations on the incremental variable made earlier.
We will present the results of the incremental analysis and then offer further
comments on the rationale behind this particular transformation. From the
diagnostic plots in Figure 19.9, it is obvious that we recover three influential
subjects: #4, #5, and #66. Although Kenward (1998) did not consider #66
to be influential, it does appear to be somewhat distant from the bulk of
the data (Figure 2.6). The main difference between both types is that the
first two were likely sick during year 1, and this is not necessarily so for
#66. An additional feature is that in all cases, both Ci (ψ) and hmax show
the same influential animals. In addition, hmax suggests that the influence
for #66 is different than for the others. It could be conjectured that the
latter one pulls the coefficient ω in a different direction than the other
two. The other values are all relatively small. This could indicate that for
the remaining 104 subjects, MAR is plausible, whereas a deviation in the
direction of the incremental variable, with differing signs, appears to be
necessary for the other three subjects. At this point, a comparison between
19.5 Mastitis in Dairy Cattle 323
FIGURE 19.10. Mastitis in Dairy Cattle. Index plots of the three components of
Ci (ψ) when the dropout model is parameterized in function of Yi1 and Yi2 − Yi1 .
hmax for the direct variable and incremental analyses is useful. Since the
contributions hi sum to 1, these two plots are directly comparable. There is
no pronounced influence indication in the direct variables case and perhaps
only random noise is seen. A more formal way to distinguish between signal
and noise needs to be developed.
probabilities move toward the center of the range or are pulled away from
it. (Note that −hmax is another normalized eigenvector corresponding to
the largest eigenvalue.)
Equation (19.18) shows that the direct variables model checks the influ-
ence on the random dropout parameter ψ1 , whereas the random dropout
parameter in the incremental model is ψ1 + ψ2 . Not only is this a different
parameter, it is also estimated with higher precision. One often observes
that ψ̂1 and ψ̂2 exhibit a similar variance and negative correlation, in which
case, the linear combination with smallest variance is approximately in the
direction of the sum ψ1 + ψ2 . When the correlation is negative, the differ-
ence direction ψ1 − ψ2 is obtained instead. Let us assess this in case all 107
observations are included. The estimated covariance matrix is
0.59 −0.54
,
0.70
with correlation −0.84. The variance of ψ1 + ψ2 , on the other hand, is
estimated to be 0.21. In this case, the direction of minimal variance is
along (0.74; 0.67), which is indeed close to the sum direction. When all
three influential subjects are removed, the estimated covariance matrix
becomes
3.31 −3.77
,
4.37
with correlation −0.9897. Removing only #4 and #5 yields an interme-
diate situation of which the results are not shown. The variance of the
sum is 0.15, which is a further reduction and still close to the direction of
minimal variance. These considerations reinforce the claim that an incre-
mental analysis is highly recommended. It might therefore be interesting
to routinely construct a plot such as in Figure 2.6 or Figure 19.6, even
with longer measurement sequences. On the other hand, transforming the
19.5 Mastitis in Dairy Cattle 325
dropout model to a size variable (Yi1 +Yi2 )/2 will worsen the problem since
an insensitive parameter for Yi1 will result.
Although local and global influence are, strictly speaking, not equivalent,
it is insightful to see how the global influence on θ can be linked to the
behavior of Ci (ψ). We observed earlier that all locally influential subjects
are completers and hence Ci (θ) ≡ 0. Yet, removing #4, #5, and #66 shows
some effect on the discrepancy between the random dropout (MAR) and
nonrandom dropout (MNAR) estimates of the time effect βd . In particular,
MAR and MNAR estimates with all three subjects removed are virtually
identical (0.60 and 0.63, respectively). Let us do a small thought experi-
ment. Since these subjects are influential in Ci (ψ), the MAR model could
be improved by including incremental terms for these three subjects. Such
a model would still imply random dropout. In contrast, allowing a depen-
dence on the increment in all subjects will influence E(Yi2 |yi1 , dropout)
for all incomplete observations; hence, the measurement model parameters
under MNAR will change. In conclusion, this provides a way to assess the
indirect influence of the dropout mechanism on the measurement model
parameters through local influence methods. In the milk data set, this
influence is likely due to the fact that an exceptional increment which is
caused by a different mechanism, perhaps a diseased animal during the first
year, is nevertheless treated on equal footing with the other observations
within the dropout model. Such an investigation cannot be done with the
case-deletion method because it is not possible to disentangle the various
sources of influence.
The perturbation scheme used throughout this chapter has several elegant
properties. The perturbation is around the MAR mechanism, which is often
deemed a sensible starting point. Extra calculations are limited and free of
numerical integration. Influence decomposes in a measurement and dropout
part, the first of which is zero in the case of a complete observation. Finally,
if the special case of compound symmetry is assumed, the measurement
part can approximately be written in interpretable components for the
fixed effect and variance component parts.
Apart from MAR, often MCAR also is considered a useful model. It is then
natural to consider departures from the MCAR model, rather than from
the MAR model. This would change (19.1) to
logit(g(hij , yij )) = logit [pr(Di = j|Di ≥ j, y i )]
= hij ψ + ωi1 yi,j−1 + ωi2 yij , (19.19)
with obvious change in the definition of hij . This way, the perturbation
parameter becomes a two-component vector ω i = (ωi1 , ωi2 ). As a result,
the ith subject produces a pair (hi1 , hi2 ), which is a normalized vector and
hence main interest lies in its direction. Also, Ch = Ci is the local influence
on γ of allowing the ith subject to drop out randomly or nonrandomly.
Figure 19.11 shows the result of this procedure, applied to the mastitis
data. Pairs (hi1 , hi2 ) are plotted. The main diagonal corresponds to the size
direction, whereas the diagonal represents the purely incremental direction.
The circles are used to indicate the minimal and maximal distances to the
origin. Finally, squares rather than bullets are used for cows #4, #5, and
#66. Most cows lie in the size direction, but it is noticeable that #4, #5,
and #66 tend toward the nonrandom direction. Further, no extremely large
Ci are seen in this case.
Another extension would result from the observation that the choice of the
incremental analysis in Section 19.5.2 may, although motivated by substan-
tive insight, seem rather arbitrary. Hence, it would be desirable to have a
more automatic, data-driven selection of a direction. One way of doing this
is by considering
logit(g(hij , yij )) = logit [pr(Di = j|Di ≥ j, y i )]
= hij ψ + ωi (sin θyi,j−1 + cos θyij ). (19.20)
19.6 Alternative Local Influence Approaches 327
f (y oi , r i |θ, ψ, ωi )
= f (y oi , y m
i |Xi , Zi , θ)f (r i |y i , y i , Xi , ψ) dy i .
o m ωi m
(19.21)
It has been seen in the mastitis data, and observed elsewhere, such as
in Diggle and Kenward (1994) and Molenberghs, Kenward, and Lesaffre
(1997), that apparent nonrandom dropout in a selection model often mani-
fests itself in terms of a dependence of dropout on the increment or change
in response, C = Yij − Yi,j−1 , say. If a subject is exhibiting a clear trend in
response, then we might regard C as an estimate, with error, of an under-
lying trend. This is supported particularly in examples in which a MNAR
model with dependence on C fits a little better than an MAR model with
dependence on a change calculated wholly from past values. A better way
of approaching such a situation may be to attempt to model the underly-
ing trend directly using a latent, or random-coefficient, model and allow
both response and dropout model to depend on this. Little (1995) uses the
term random-coefficient based to distinguish such models from those used
in this chapter in which the probability of dropout depends directly on
the response Y i , which he terms outcome based. We will now discuss these
types of model in some detail.
Suppose that the latent variable bi (possibly vector valued) describes some
aspect of an individual’s response. This leads naturally to the linear mixed
model (3.8). Let us consider a model with random intercept and random
slope:
Yij | b0i , b1i ∼ N (β0 + b0i + (β1 + b1i )ti ; σ 2 ), j = 1, . . . , ni . (19.22)
The subject’s random coefficients bi = (b0i , b1i ) are assumed to be nor-
mally distributed with zero mean and covariance matrix D. Also, the
dropout model is assumed to depend on the random effects: P (r|b).
The response y now enters the joint distribution only through the observed
data and the latent variable, and the nonrandom dropout model will typi-
cally be identified.
Wu and Carroll (1988) proposed such a model for what they termed infor-
mative right censoring. The situation they cover extends the earlier setting
to accommodate right censoring of the dropout times. Although this com-
plicates the likelihood to some degree, it does not fundamentally change
the structure of the model and could equally well be used for outcome-
based models. For a continuous response, Wu and Carroll suggested using
a conventional Gaussian random-coefficient model (Laird and Ware 1982,
Longford 1993) combined with an appropriate model for time to dropout,
such as proportional hazards, logistic or probit regression. The combination
of probit and Gaussian response allows explicit solution of the integral and
was used in their application.
Since all models for incomplete longitudinal data rest on unverifiable as-
sumptions, this chapter argues that a careful investigation of the model
output is in place. Using two examples, both informal and formal sensitiv-
ity analyses have been conducted. The former are based on insight into the
modeling process and the distributional assumptions made, as well as on
background knowledge of the data problem.
20.1 Introduction
Thus, in line with the reflections made in Section 18.5, we have three strate-
gies to build a full data model in the pattern-mixture context: identifying
restrictions, simple within-pattern models, and the inclusion of pattern as
a covariate, the latter of which allows for the combination of information
across patterns. A few observations are in place. First, although identifying
restrictions impose a careful reflection on the unidentified part of the distri-
bution, the other strategies are more implicit about the assumptions made
to identify the full distribution. In this respect, they are open to some of
the criticisms toward selection models. Second, the identifying-restrictions
strategy is harder to implement, unless in fairly simple settings, such as
a single normal sample or contingency tables (Little 1993, 1994a). This
chapter provides tools to conduct such a strategy in realistic longitudinal
settings. Third, in the selection modeling framework, the MAR assumption
plays a crucial role. It can be seen as a compromise between the very rigid
and unrealistic MCAR assumption and the complex and fundamentally
problematic MNAR assumptions. A counterpart to the MAR assumption
is provided in Section 20.2, which can be exploited as the basis for specific
identifying-restrictions strategies.
The first couple of sections provide background material. Section 20.2 de-
scribes the relationship between MAR and the pattern-mixture framework.
Multiple imputation, a tool used in the identifying restrictions strategy, is
reviewed in Section 20.3. The three strategies to fit pattern-mixture models,
mentioned earlier, are described in Section 20.4. The identifying-restrictions
strategy is described in detail in Section 20.5. Application to the Vorozole
study is discussed in Sections 20.6. Reflections and suggestions for alterna-
tive routes of sensitivity are offered in Section 20.7.
The missing data taxonomy of Rubin (1976) and Little and Rubin (1987),
which distinguishes between missing completely at random, missing at
random, and nonrandom missingness, is widely used (Section 15.5). It is
20.2 Pattern-Mixture Models and MAR 333
Little (1993) suggested the possibility of using more than the completers to
construct identifying restrictions for two practical reasons: (1) The set of
completers may be small and (2) there may be a closer similarity between
the conditional distributions given d = t + 1 and some other incomplete
pattern d = s + 1, or set of patterns, than between those for d = t + 1 and
the completers, d = n + 1.
One way to proceed is as follows. First, restrict the data set to the first two
components only. Then, missing data patterns d = 3, . . . , n+1 collapse into
a single pattern d > 2. Applying ACMV restrictions to d = 2 and d > 3
leads to the construction of the density f (y2 |y1 , d = 2) = f (y2 |y1 , d > 2),
as in (20.1). Multiplying by f (y1 |d = 2) leads to f (y1 , y2 |d = 2), thus
determining the joint densities of f (y1 , y2 |d) for all d = 2, . . . , n + 1. Next,
f (y3 |y1 , y2 , d) (d = 2, 3) can be calculated from f (y3 |y1 , y2 , d > 3). We then
proceed by induction to construct all joint densities.
20.2 Pattern-Mixture Models and MAR 335
Note that the result of Theorem 20.1 does not hold for general missing
data patterns. Consider a bivariate outcome (Y1 , Y2 ) where missingness
can occur in both components. Let (R1 , R2 ) be the corresponding bivariate
missingness indicator, where Rj = 0 if Yj is missing and 1 otherwise (j =
1, 2). Consider the following MAR mechanism:
⎧
⎪
⎪ p if (r1 , r2 ) = (0, 0)
⎪
⎪
⎨ qy1 if (r1 , r2 ) = (1, 0)
f (r|y) = Pr(r1 , r2 |y1 , y2 ) = sy2 if (r1 , r2 ) = (0, 1) (20.2)
⎪
⎪
⎪
⎪ 1 − p − qy
⎩ 1
− sy2 if (r1 , r2 ) = (1, 1).
The idea is that the density of missing components, given observed com-
ponents, is replaced by the corresponding density of patterns for which
both are available. Restrictions for the pattern r = (0, 0) will be discussed
further.
Clearly, since both qy1 and sy2 have to be constant, the mechanism needs to
be MCAR. In other words, ACMV≡MCAR, independent of the restrictions
for f (y1 , y2 |r = (0, 0)), and hence ACMV and MAR differ.
distribution
i |y i , γ).
f (y m o
(20.5)
Note that we explicitly distinguish the parameter of scientific interest θ
from the parameter γ in (20.5). Since γ is unknown, we must estimate it
from the data, say γ̂, and use
i |y i , γ̂)
f (y m o
(20.6)
4. Independently repeat steps 1–3, M times. The M data sets give rise
(m)
to θ̂ and U (m) , for m = 1, ..., M .
Steps 1 and 2 are referred to as the Imputation Task. Step 3 is the Es-
timation Task. Of course, one wants to combine the M inferences into a
single one. Both parameter and precision estimation, on the one hand, and
hypothesis testing, on the other hand, will be discussed next.
338 20. Sensitivity Analysis for Pattern-Mixture Models
1 (m)
M
∗
θ̂ = θ̂ .
M m=1
where
M +1
V = Ŵ + B̂, (20.8)
M
M
m=1U (m)
Ŵ = (20.9)
M
is the average within-imputation variance, and
M (m) ∗ (m) ∗
m=1 (θ̂ − θ̂ )(θ̂ − θ̂ )
B̂ = (20.10)
M −1
is the between-imputation variance (Rubin 1987). Rubin and Schenker
(1986) report that a small number of imputations (M = 2, 3) already yields
a major improvement over single imputation. Upon noting that the factor
(M + 1)/M approaches 1 for large M , (20.8) is approximately the sum of
the within- and the between-imputations variability.
(θ ∗ − θ 0 ) W −1 (θ ∗ − θ 0 )
F = , (20.12)
k(1 + r)
3 42
(1 − 2τ −1 )
w = 4 + (τ − 4) 1 + ,
r
1 1
r = 1+ tr(BW −1 ), (20.13)
k M
τ = k(M − 1).
Clearly, procedure (20.11) can be used as well when not the full vector θ,
but one component, a subvector, or a set of linear contrasts, is the subject
of hypothesis testing. When a subvector is of interest (a single component
being a special case), the corresponding submatrices of B and W need
to be used in (20.12) and (20.13). For a set of linear contrasts Lθ, one
should use the appropriately transformed covariance matrices: W̃ = LW L ,
B̃ = LBL , and Ṽ = LV L .
to either (1) answer the same scientific question, such as marginal treat-
ment effect or time evolution, based on these two rather different modeling
strategies, or (2) to gain additional insight by supplementing the selection
model results with those from a pattern-mixture approach. Examples can
be found in Verbeke, Lesaffre, and Spiessens (2000), Curran, Pignatti, and
Molenberghs (1997), and Michiels et al . (1999) for continuous outcomes.
The categorical outcome case has been treated in Molenberghs, Michiels,
and Lipsitz (1999), and Michiels, Molenberghs, and Lipsitz (1999). Further
references include Cohen and Cohen (1983), Muthén, Kaplan, and Hollis
(1987), Allison (1987), McArdle and Hamagani (1992) Little and Wang
(1996), Hedeker and Gibbons (1997), Siddiqui and Ali (1998), Ekholm and
Skinner (1998), Molenberghs, Michiels, and Kenward (1998), and Park and
Lee (1999).
We want to stress the usefulness of these two modeling strategies and also
refer to the toenail (Sections 17.2 and 18.3) and Vorozole studies (Sec-
tions 17.6 and 18.4). In the Vorozole case, the treatment effect assessment
is virtually identical under both strategies, but useful additional insight
is obtained from the pattern-specific average profiles. Of course, such bi-
lateral comparisons are not confined to the selection and pattern-mixture
model families. For example, shared-parameter models can be included as
well (see Section 15.4 and the very complete review by Little 1995).
On the other hand, the sensitivity analysis we propose here can also be
conducted within the pattern-mixture family, in analogy with the selection
model case (Chapter 18). The key area on which sensitivity analysis should
focus is the unidentified components of the model and the way(s) in which
this is handled. Indeed, recall that model family (18.6) contains underiden-
tified members since it describes the full set of measurements in pattern ti ,
even though there are no measurements after occasion ti − 1. In the intro-
duction, we mentioned three strategies to deal with this underidentification.
Let us describe these in turn.
– Strategy 3. Second, one can let the parameters vary across pat-
terns in a parametric way. Thus, rather than estimating a sepa-
rate time trend in each pattern, one could assume that the time
evolution within a pattern is unstructured, but parallel across
patterns. This is effectuated by treating pattern as a covariate
in the model. The available data can be used to assess whether
such simplifications are supported within the time ranges for
which there is information. Using the so-obtained profiles past
the time of dropout still requires extrapolation.
From a model-building perspective, this modeling strategy can
be viewed as a standard linear mixed model with the pattern
indicator as an additional covariate.
This approach was recently adopted by Park and Lee (1999)
in the context of count data for which generalized estimating
equations (Liang and Zeger 1986) are used.
The first factor on the right-hand side of (20.14) is clearly identifiable from
the observed data, whereas the second factor is not. Therefore, identifying
restrictions are applied in order to identify the second component.
ft (y1 , . . . , yt ). (20.15)
as
n
ft (yt+1 , . . . , yn |y1 , . . . , yt ) = ft (ys |y1 , . . . , ys−1 ). (20.18)
s=t+1
The result is easily shown. First, the equivalence between (20.22) and
(20.19) is immediate. Second, for (20.21) we proceed by inductive reason-
ing. Clearly, the equivalence holds for t = n − 1. Assume now that it holds
for t + 1. Then,
ft (yt+1 , . . . , yn |y1 , . . . , yt )
= ft+1 (yt+1 , . . . , yn |y1 , . . . , yt )
n−t
= ft+1 (yt+1 |y1 , . . . , yt ) ft+1 (yt+s |y1 , . . . , yt+s−1 )
s=2
n−t
= ft+1 (yt+1 |y1 , . . . , yt ) ft+s (yt+s |y1 , . . . , yt+s−1 ),
s=2
where the last equality holds by virtue of the induction hypothesis. This
establishes the result.
and reproducing the proof of Result 20.1 for the g(·) functions, it is clear
that the following result holds:
= ft (y1 , . . . , yt )
⎡ ⎤
n−t−1 n
× ⎣ ωn−s,j fj (yn−s |y1 , ..., yn−s−1 )⎦ . (20.29)
s=0 j=n−s
In this case, there are only three patterns, and identification (20.29) takes
the following form:
f3 (y1 , y2 , y3 ) = f3 (y1 , y2 , y3 ), (20.30)
f2 (y1 , y2 , y3 ) = f2 (y1 , y2 )f3 (y3 |y1 , y2 ), (20.31)
f1 (y1 , y2 , y3 ) = f1 (y1 ) [ωf2 (y2 |y1 ) + (1 − ω)f3 (y2 |y1 )]
×f3 (y3 |y1 , y2 ). (20.32)
Since f3 (y1 , y2 , y3 ) is completely identifiable from the data, and for density
f2 (y1 , y2 , y3 ) there is only one possible identification, given (20.20), the
only room for choice is in pattern 1. Setting ω = 1 corresponds to NCMV,
whereas ω = 0 implies CCMV. Using (20.28), ACMV boils down to
α2 f2 (y1 )
ω = . (20.33)
α2 f2 (y1 ) + α3 f3 (y1 )
The conditional density f1 (y2 |y1 ) in (20.32) can be rewritten as
α2 f2 (y1 , y2 ) + α3 f3 (y1 , y2 )
f1 (y2 |y1 ) = ,
α2 f2 (y1 ) + α3 f3 (y1 )
which is, of course, a special case of (20.26). Upon division by α2 + α3 ,
both numerator and denominator are mixture densities, hence producing a
legitimate conditional density.
f1 (y1 , y2 , y3 , y4 ) = f1 (y1 )
f1 (y2 |y1 )
All steps but the first one have to be repeated M times to obtain the
same number of imputed data sets. Inference then proceeds as outlined in
Sections 20.3.1 and 20.3.2.
When the observed densities are estimated using linear mixed models,
f3 (y1 , y2 , y3 ), f2 (y1 , y2 ), and f1 (y1 ) produce fixed effects and variance com-
ponents. Let us group all of them in γ and assume a draw is made from
their distribution, γ ∗ say. To this end, their precision estimates need to be
computed. In SAS, they are provided from the ‘covb’ option in the MODEL
statement and the ‘asycov’ option in the PROC MIXED statement.
Let us illustrate this procedure for (20.31). Assume that the ith subject
has only two measurements, and hence belongs to the second pattern. Let
its design matrices be Xi and Zi for the fixed effects and random effects,
respectively. Its mean and variance for the third pattern are
µi (3) = Xi β ∗ (3), (20.41)
Vi (3) = Zi D∗ (3)Zi + Σ∗i (3), (20.42)
where (3) indicates that the parameters are specific to the third pattern,
as in (18.6). The asterisk is reminiscent of these parameters being part of
the draw γ ∗ .
Now, based on (20.41)–(20.42) and the observed values yi1 and yi2 , the
parameters for the conditional density follow immediately:
µi,2|1 (3) = µi,2 (3) + Vi,21 (3)[Vi,11 (3)]−1 (y i − µi,2 (3)),
Vi,2|1 (3) = Vi,22 (3) − Vi,21 (3)[Vi,11 (3)]−1 Vi,12 (3),
where a subscript 1 indicates the first two components and a subscript 2
refers to the third component.
• Draw from the kth component. Since every component of the mixture
is normal, only draws from uniform and normal random variables are
required.
All of these steps have been combined in a SAS macro, which is available
from the web site.
These figures, apart from giving a feel for the relative importance of the var-
ious patterns, will be needed to calculate marginal effects (such as marginal
treatment effect) from pattern-mixture model parameters, as was done in
Sections 18.3 and 18.4.
Strategies 2 and 1
We include time and time2 effects, as well as their interactions with treat-
ment. Further, time by baseline value interaction is included. Recall from
Section 18.4 that all effects interact with time, in order to force profiles to
pass through the origin, since we are studying change versus baseline. An
unstructured 3 × 3 covariance matrix is assumed for each pattern.
Let us present this model graphically. Since there is one binary (treatment
arm) and one continuous covariate (baseline level of FLIC score), and there
are three patterns, we can represent the models using 3 × 2 surface plots,
as in Figure 20.1. Similar plots for the other strategies are displayed in
Figures 20.3, 20.5, 20.7, and 20.9. More insight on the effect of extrapola-
tion can be obtained by presenting time plots for selected values of baseline
value, the only continuous covariate. Precisely, we chose the minimum, av-
erage, and maximum values (Figure 20.2). Bold type is used for the range
over which data are obtained within a particular pattern, and extrapola-
tion is indicated using thinner type. Note that the extrapolation can have
surprising effects, even with these relatively simple models. Thus, although
this form of extrapolation is simple, its plausibility can be called into ques-
tion. Note that this is in line with the experience gained from the toenail
data analysis in Section 18.3 (Figure 18.2).
This initial model provides a basis, and its graphical representation extra
motivation, to consider identifying-restrictions models. Using the method-
ology detailed in Section 20.5, a GAUSS macro and a SAS macro (available
from the web) were written to conduct the multiple imputation, fitting
of imputed data sets, and combination of the results into a single infer-
ence. Results are presented in Table 20.1, for each of the three types of
restrictions (CCMV, NCMV, ACMV). For patterns 1 and 2, there is some
variability in the parameter estimates across the three strategies, although
this is often consistent with random variation (see the standard errors).
Since the data in pattern 3 are complete, there is, of course, no difference
between the initial model parameters and those obtained with each of the
identifying-restrictions techniques. Again, a better impression can be ob-
tained from a graphical representation. In all of the two-dimensional plots,
the same mean response scale as in Figure 20.2 was retained, illustrat-
ing that the identifying-restrictions strategies extrapolate much closer to
354 20. Sensitivity Analysis for Pattern-Mixture Models
TABLE 20.1. Vorozole Study. Multiple imputation estimates and standard errors
for CCMV, NCMV, and ACMV restrictions.
Effect Initial CCMV NCMV ACMV
Pattern 1:
Time 3.40(13.94) 13.21(15.91) 7.56(16.45) 4.43(18.78)
Time∗base −0.11(0.13) −0.16(0.16) −0.14(0.16) −0.11(0.17)
Time∗treat 0.33(3.91) −2.09(2.19) −1.20(1.93) −0.41(2.52)
Time2 −0.84(4.21) −2.12(4.24) −0.70(4.22)
Time2 ∗base 0.01(0.04) 0.03(0.04) 0.02(0.04)
σ11 131.09(31.34) 151.91(42.34) 134.54(32.85) 137.33(34.18)
σ12 59.84(40.46) 119.76(40.38) 97.86(38.65)
σ22 201.54(65.38) 257.07(86.05) 201.87(80.02)
σ13 55.12(58.03) 49.88(44.16) 61.87(43.22)
σ23 84.99(48.54) 99.97(57.47) 110.42(87.95)
σ33 245.06(75.56) 241.99(79.79) 286.16(117.90)
Pattern 2:
Time 53.85(14.12) 29.78(10.43) 33.74(11.11) 28.69(11.37)
Time∗base −0.46(0.12) −0.29(0.09) −0.33(0.10) −0.29(0.10)
Time∗treat −0.95(1.86) −1.68(1.21) −1.56(2.47) −2.12(1.36)
Time2 −18.91(6.36) −4.45(2.87) −7.00(3.80) −4.22(4.20)
Time2 ∗base 0.15(0.05) 0.04(0.02) 0.07(0.03) 0.05(0.04)
σ11 170.77(26.14) 175.59(27.53) 176.49(27.65) 177.86(28.19)
σ12 151.84(29.19) 147.14(29.39) 149.05(29.77) 146.98(29.63)
σ22 292.32(44.61) 297.38(46.04) 299.40(47.22) 297.39(46.04)
σ13 57.22(37.96) 89.10(34.07) 99.18(35.07)
σ23 71.58(36.73) 107.62(47.59) 166.64(66.45)
σ33 212.68(101.31) 264.57(76.73) 300.78(77.97)
Pattern 3:
Time 29.91(9.08) 29.91(9.08) 29.91(9.08) 29.91(9.08)
Time∗base −0.26(0.08) −0.26(0.08) −0.26(0.08) −0.26(0.08)
Time∗treat 0.82(0.95) 0.82(0.95) 0.82(0.95) 0.82(0.95)
Time2 −6.42(2.23) −6.42(2.23) −6.42(2.23) −6.42(2.23)
Time2 ∗base 0.05(0.02) 0.05(0.02) 0.05(0.02) 0.05(0.02)
σ11 206.73(35.86) 206.73(35.86) 206.73(35.86) 206.73(35.86)
σ12 96.97(26.57) 96.97(26.57) 96.97(26.57) 96.97(26.57)
σ22 174.12(31.10) 174.12(31.10) 174.12(31.10) 174.12(31.10)
σ13 87.38(30.66) 87.38(30.66) 87.38(30.66) 87.38(30.66)
σ23 91.66(28.86) 91.66(28.86) 91.66(28.86) 91.66(28.86)
σ33 262.16(44.70) 262.16(44.70) 262.16(44.70) 262.16(44.70)
the observed data mean responses. There are some differences among the
identifying-restrictions methods. Roughly speaking, CCMV extrapolates
rather toward a rise whereas NCMV seems to predict more of a decline,
at least for baseline value 53. Further, ACMV predominantly indicates a
steady state. For the other baseline levels, a status quo or a mild increase
20.6 Analysis of the Vorozole Study 355
FIGURE 20.1. Vorozole Study. Extrapolation based on initial model fitted to ob-
served data (Strategy 2). Per pattern and per treatment group, the mean response
surface is shown as a function of time (month) and baseline value.
Nevertheless, the NCMV prediction looks more plausible since the worst
baseline value shows declining profiles, whereas the best one leaves room for
356 20. Sensitivity Analysis for Pattern-Mixture Models
FIGURE 20.2. Vorozole Study. Extrapolation based on initial model fitted to ob-
served data (Strategy 2). For three levels of baseline value (minimum, average,
maximum), plots of mean profiles over time are presented. The bold portion of
the curves runs from baseline until the last obtained measurement, and the ex-
trapolated piece is shown in thin type. The dashed line refers to megestrol acetate;
the solid line is the Vorozole arm.
FIGURE 20.3. Vorozole Study. Complete case missing value restrictions analysis.
Per pattern and per treatment group, the mean response surface is shown as a
function of time (month) and baseline value.
FIGURE 20.4. Vorozole Study. Complete case missing value restrictions analysis.
For three levels of baseline value (minimum, average, maximum), plots of mean
profiles over time are presented. The bold portion of the curves runs from baseline
until the last obtained measurement, whereas the extrapolated piece is shown in
thin line type. The dashed line refers to megestrol acetate, the solid line is the
Vorozole arm.
Exactly the same results are obtained as before, by ensuring that every
effect interacts with the class variable pattern, and ‘group=pattern’ is in-
20.6 Analysis of the Vorozole Study 359
FIGURE 20.5. Vorozole Study. Neighboring case missing value restrictions analy-
sis. Per pattern and per treatment group, the mean response surface is shown as
a function of time (month) and baseline value.
FIGURE 20.6. Vorozole Study. Neighboring case missing value restrictions analy-
sis. For three levels of baseline value (minimum, average, maximum), plots of
mean profiles over time are presented. The bold portion of the curves runs from
baseline until the last obtained measurement, and the extrapolated piece is shown
in thin type. The dashed line refers to megestrol acetate; the solid line is the
Vorozole arm.
FIGURE 20.7. Vorozole Study. Available case missing value restrictions analysis.
Per pattern and per treatment group, the mean response surface is shown as a
function of time (month) and baseline value.
Strategy 3
FIGURE 20.8. Vorozole Study. Available case missing value restrictions analysis.
For three levels of baseline value (minimum, average, maximum), plots of mean
profiles over time are presented. The bold portion of the curves runs from baseline
until the last obtained measurement, and the extrapolated piece is shown in thin
type. The dashed line refers to megestrol acetate; the solid line is the Vorozole
arm.
TABLE 20.2. Vorozole Study. Parameter estimates and standard errors for Strat-
egy 3.
These findings suggest, again, that a more careful reflection on the extrap-
olation method is required. This is very well possible in a pattern-mixture
context, but then the first strategy, rather than the second or third strat-
egy, has to be used, either as model of choice or to supplement insight
gained from another model.
FIGURE 20.9. Vorozole Study. Models with pattern used as a covariate (Strategy
3). Per pattern and per treatment group, the mean response surface is shown as
a function of time (month) and baseline value.
time*treat
time*treat*base
time*pattern
time*pattern*base
time*pattern*treat
time*time
time*time*base
time*time*treat
time*time*pattern
/ noint solution covb;
repeated timedisc / subject = id type = un;
run;
20.6 Analysis of the Vorozole Study 365
FIGURE 20.10. Vorozole Study. Models with pattern used as a covariate (Strategy
3). For three levels of baseline value (minimum, average, maximum), plots of
mean profiles over time are presented. The bold portion of the curves runs from
baseline until the last obtained measurement, and the extrapolated piece is shown
in thin type. The dashed line refers to megestrol acetate; the solid line is the
Vorozole arm.
This program is also more parsimonious in the sense that it treats treatment
as a continuous variable which, when there are only two modalities, comes
down to exactly the same model as when it is treated as a class variable,
but a number of structural zeros are removed from the parameter vector.
In the latter case, two distinct routes are possible. The more “epidemi-
ologic” viewpoint is to direct inferences toward the vector of treatment
effects. For example, if treatment effects are heterogeneous across patterns,
then it may be deemed better to avoid combining such effects into a single
measure. In our case, this implies, for example, testing for the treatment
by time interaction to be zero in all three patterns simultaneously. Alter-
natively, one can calculate the same quantity as would be obtained in the
corresponding selection model. Then, the marginal treatment effect is cal-
culated, based on the pattern-specific treatment effects and the weighting
probabilities, perhaps irrespective of whether the treatment effects are ho-
mogeneous across patterns or not. This was done in (18.9) and (18.11) for
the toenail data, and in Section 18.4. See also Section 24.4.2 (Eq. (20.45)).
Precisely, let β
t represent the treatment effect parameter estimates =
1, . . . , g (assuming there are g + 1 groups) in pattern t = 1, . . . , n and let
πt be the proportion of patients in pattern t. Then, the estimates of the
marginal treatment effects β
are
n
β
= β
t πt , = 1, . . . , g. (20.45)
t=1
The variance is obtained using the delta method. Precisely, it assumes the
form
where
var(β
t ) 0
V = (20.47)
0 var(πt )
and
∂(β1 , . . . , βg )
A = . (20.48)
∂(β11 , . . . , βng , π1 , . . . , πn )
where β 0 = (β1 , . . . , βg ) .
We will now apply both testing approaches to the models presented in Ta-
bles 20.1 and 20.2. All three pattern-mixture strategies will be considered.
Since the identifying restrictions strategies are slightly more complicated
than the others, we will consider the other strategies first.
Strategy 2
Recall that the parameters are presented in Table 20.1 as the initial model.
The treatment effect vector is β = (0.33, −0.95, 0.82) with, since the pat-
terns are analyzed separately, diagonal covariance matrix:
⎛ ⎞
15.28
V = ⎝ 3.44 ⎠.
0.90
These quantities are obtained as either the square of the standard errors
reported in the ‘Solution for Fixed Effects’ panel in the output of the SAS
procedure MIXED or, directly, as the appropriate diagonal elements of the
‘Covariance Matrix for Fixed Effects’ panel, produced by means of the
‘covb’ option in the MODEL statement. This leads to the test statistic
β V −1 β = 1.02 on 3 degrees of freedom, producing p = 0.796.
Strategy 3
The parameters are presented in Table 20.2. The treatment effect vector is
β = (5.25, 3.48, 3.44) with nondiagonal covariance matrix
⎛ ⎞
41.12 23.59 25.48
V = ⎝ 23.59 29.49 30.17 ⎠ .
25.48 30.17 36.43
other treatment effects (three-way interaction with baseline and time, inter-
action with time2 ), are common to all three patterns, inducing dependence
across patterns.
Strategy 1
The CCMV case will be discussed in detail. The two other restriction types
are entirely similar.
There are three treatment effects, one for each pattern. Hence, multiple
imputation produces five vectors of three treatment effects which are aver-
aged to produce a single treatment effect vector. In addition, the within,
between, and total covariance matrices are calculated:
β CC = (−2.09, −1.68, 0.82) , (20.50)
⎛ ⎞
1.67 0.00 0.00
WCC = ⎝ 0.00 0.59 0.00 ⎠ , (20.51)
0.00 0.00 0.90
⎛ ⎞
2.62 0.85 0.00
BCC = ⎝ 0.85 0.72 0.00 ⎠ , (20.52)
0.00 0.00 0.00
and
⎛ ⎞
4.80 1.02 0.00
TCC = ⎝ 1.02 1.46 0.00 ⎠ . (20.53)
0.00 0.00 0.90
TABLE 20.3. Vorozole Study. Tests of treatment effect for CCMV, NCMV, and
ACMV restrictions.
Parameter CCMV NCMV ACMV
Stratified analysis:
k 3 3 3
τ 12 12 12
denominator d.f. w 28.41 17.28 28.06
r 1.12 2.89 1.14
F -statistic 1.284 0.427 0.946
p-value 0.299 0.736 0.432
Marginal Analysis:
Marginal effect (s.e.) −0.85(0.77) −0.63(1.22) −0.74(0.85)
k 1 1 1
τ 4 4 4
denominator d.f. w 4 4 4
r 1.49 4.57 1.53
F -statistic 0.948 0.216 0.579
p-value 0.385 0.667 0.489
Note that, even though the analysis is done per pattern, the between and
total matrices have nonzero off-diagonal elements. This is because imputa-
tion is done based on information from other patterns, hence introducing
interpattern dependence. Results are presented in Table 20.3. All results
are nonsignificant, in line with earlier evidence from Strategies 2 and 3,
although the p-values for CCMV and ACMV are somewhat smaller.
For the marginal parameter, the situation is more complicated here than
with Strategies 2 and 3. Indeed, the theory of Section 20.3.2 assumes in-
ference is geared toward the original vector, or linear contrasts thereof.
Formula (20.45) displays a nonlinear transformation of the parameter vec-
tor and therefore needs further development. First, consider π to be part
of the parameter vector. Since there is no missingness involved in this part,
it contributes to the within matrix, but not to the between matrix. Then,
using (20.46), the approximate within matrix for the marginal treatment
effect is
W0 = π W π + β var(π)β,
with, for the between matrix, simply
B0 = π Bπ.
The latter formula consists of only one term, since there is no between
variance for π.
20.6 Analysis of the Vorozole Study 371
The results are presented in the second panel of Table 20.3. All three p-
values are, again, nonsignificant, in agreement with Strategies 2 and 3. Of
course, all five agree on the nonsignificance of the treatment effect. The
reason for the differences is to be found in the way the treatment effect is
extrapolated beyond the period of observation. Indeed, the highest p-value
is obtained for Strategy 2, and from Figure 20.2, we learn that virtually no
separation between both treatment arms is projected. On the other hand,
wider separations are seen in Figure 20.10.
Model building guidelines for the standard linear mixed-effects model can
be found in Chapter 9. These guidelines can be used without any problem
in a selection model context, but the pattern-mixture case is more compli-
cated. Of course, the same general principles can be applied, taking into
account the intertwining between the mean or fixed-effects structure, on
the one hand, and the components of variability on the other hand, as
graphically represented in Figure 9.1.
Strategy 3
TABLE 20.4. Vorozole Study. Strategy 3. Parameter estimates and standard er-
rors of a reduced model.
Strategy 2
20.7 Thoughts
Section 24.4 offers a case study on the milk protein contents trial (Diggle
and Kenward 1994). Several sensitivity analysis tools, both informal and
formal, are employed to gain insight into the data and into the missingness
mechanism.
The SAS and GAUSS macros which have been used to carry out the mul-
tiple imputation related tasks are available from the authors’ web pages.
21
How Ignorable Is Missing At
Random ?
21.1 Introduction
For over two decades, following the pioneering work of Rubin (1976) and
Little (1976), there has been a growing literature on incomplete data, with
a lot of emphasis on longitudinal data. Following the original work of Rubin
and Little, there has evolved a general view that “likelihood methods” that
ignore the missing value mechanism are valid under an MAR process, where
likelihood is interpreted in a frequentist sense. The availability of flexible
standard software for incomplete data, such as PROC MIXED, and the
advantages quoted in Section 17.3 contribute to this point of view. This
statement needs careful qualification however. Kenward and Molenberghs
(1998) provided an exposition of the precise sense in which frequentist
methods of inference are justified under MAR processes.
As discussed in Section 15.8, Rubin (1976) has shown that MAR (and
parameter distinctness) is necessary and sufficient to ensure validity of
direct-likelihood inference when ignoring the process that causes missing
data. Here, direct-likelihood inference is defined as an “inference that re-
sults solely from ratios of the likelihood function for various values of the
parameter,” in agreement with the definition in Edwards (1972). In the
concluding section of the same paper, Rubin remarks:
376 21. How Ignorable Is Missing At Random ?
Little and Rubin (1987) discuss several aspects of this problem and propose,
using the observed information matrix, to circumvent problems associated
with the determination of the correct expected information matrix. Laird
(1988) makes a similar point in the context of incomplete longitudinal data
analysis.
In this section, we will drop the subject subscript i from the notation.
We assume that the joint distribution of the full data (Y , R) is regular in
the sense of Cox and Hinkley (1974, p. 281). We are concerned here with
the sampling distributions of certain statistics under MCAR and MAR
mechanisms. Under an MAR process, the joint distribution of Y o (the
observed components) and R factorizes as in (15.6). In terms of the log-
likelihood function, we have
(θ, ψ; y o , r) = 1 (θ; y o ) + 2 (ψ; r, y o ). (21.1)
It is assumed that θ and ψ satisfy the separability condition. This partition
of the likelihood has, with important exceptions, been taken for granted to
mean that, under an MAR mechanism, likelihood methods based on 1
alone are valid for inferences about θ even when interpreted in the broad
frequentist sense. We now consider more precisely the sense in which the dif-
ferent elements of the frequentist likelihood methodology can be regarded
as valid in general under the MAR mechanism. It is now well known that
such inferences are valid under an MCAR mechanism (Rubin 1976, Sec-
tion 6).
First, we note that under the MAR mechanism, r is not an ancillary sta-
tistic for θ in the extended sense of Cox and Hinkley (1974, p. 35). (A
statistic S(Y , R) is ancillary for θ if its distribution does not depend upon
θ.) Hence, we are not justified in restricting the sample space from that
associated with the pair (Y , R). In considering the properties of frequentist
procedures below, we therefore define the appropriate sampling distribu-
tions to be that determined by this pair. We call this the unconditional
sampling framework. By working within this framework, we do need to
consider the missing value mechanism. We shall be comparing this with
the sampling distribution that would apply if r were fixed by design [i.e.,
if we repeatedly sampled using the distribution f (y o ; θ)]. If this sampling
distribution were appropriate, this would lead directly to the use of 1 as
a basis for inference. We call this the naive sampling framework.
The above argument justifying the use of the maximum likelihood esti-
mators from 1 (θ; y o ) applies equally well to the use of the inverse of the
observed information derived from 1 as an estimate of the asymptotic
variance-covariance matrix of these estimators. This has been pointed out
by Little and Rubin (1987, Section 8.2.2) and Laird (1988, p. 307). In addi-
tion, there are other reasons for preferring the observed information matrix
(Efron and Hinkley 1978).
21.3 Illustration
σ11 σ12
Σ = .
σ12 σ22
It is assumed that m complete pairs, and only the first member (Yi1 ) of the
remaining pairs are observed. This implies that the dropout process can be
represented by a scalar indicator Ri which is 1 if the second component is
observed and 0 otherwise. The log-likelihood can be expressed as the sum
of the log-likelihoods for the complete and incomplete pairs:
m
N
= ln f (yi1 , yi2 | µ1 , µ2 , σ11 , σ12 , σ22 ) + ln f (yi1 | µ1 , σ11 ),
i=1 i=m+1
380 21. How Ignorable Is Missing At Random ?
N −m m 1 N
= − ln σ11 − ln | Σ | − (yi1 − µ1 )2
2 2 2σ11 i=m+1
−1
1
m
yi1 − µ1 σ11 σ12 yi1 − µ1
− .
2 i=1 yi2 − µ2 σ12 σ22 yi2 − µ2
σ11 0
iO (µ, µ) = (N − m) + mΣ−1 ,
0 0
and
N
yi1 − µ1
iO (µ1 , σ11 ) = 2
i=m+1
σ11
m
yi1 − µ1
+ e1 Σ−1 E 11 Σ−1 (21.3)
yi2 − µ2
i=1
yi1 − µ1
iO (µj , σk
) = ej Σ−1 E kl Σ−1 (21.4)
yi2 − µ2
i=1
for
1 0
e1 = , e2 =
0 1
and
1 0 0 1 0 0
E 11 = , E 12 = , E 22 = .
0 0 1 0 0 1
the expression for iU (µ, µ) only through m, the naive and unconditional
information matrices for µ are effectively equivalent. However, we show
now that this is not true for the cross-term elements of the information
matrices. Define αj = E(Yi1 | ri = j) − µ1 , j = 0, 1. For the conditional
expectation of Yi2 in the complete subgroup, we have
E(Yi2 | ri = 1) = yi2 f (yi2 | yi1 )dyi2 f (yi1 | ri = 1)dyi1
−1 σ12
= µ2 − σ12 σ11 µ1 + yi1 f (yi1 , ri = 1)dyi1
σ11 P (ri = 1)
−1
= µ2 + σ12 σ11 {E(Yi1 | ri = 1) − µ1 }
or
EY |R (Yi2 − µ2 ) = βα1
−1
for β = σ12 σ11 . Hence,
Yi1 − µ1 1
EY |R = α1 .
Yi2 − µ2 β
Noting that
−1
−1 1 σ11 −1
Σ = = σ11 e1 ,
β 0
we then have from (21.3) and (21.4)
⎧ α0
⎪
⎪ (N − m) 2
⎪
⎪ σ11
⎪
⎪
⎨ α
+m σ11 e1 Σ−1 E 11 e1 , j = k = = 1
1
EY |R {iO (µj , σkl )} =
⎪
⎪
⎪
⎪
⎪
⎪
⎩ m α1 ej Σ−1 E k
e1 otherwise.
σ11
Finally, taking expectations over R, we get for the cross-terms of the un-
conditional information matrix
N (1 − π)α0 1 πα1 σ22
iU (µ, σ11 ) = + , (21.5)
σ11 σ11 0 σ11 σ22 − σ12
2 −σ12
N πα1 −β
iU (µ, σ12 ) = , (21.6)
σ11 σ22 − σ12
2 1
0
iU (µ, σ22 ) = , (21.7)
0
382 21. How Ignorable Is Missing At Random ?
We now illustrate these findings with a few numerical results. The off-
diagonal unconditional information elements (21.5)–(21.7) are computed
for sample size N = 1000, mean vector (0, 0) , and two covariance matrices:
(1) σ11 = σ22 = 1 and correlation
√ ρ = σ12 = 0.5 and (2) σ11 = 2, σ33 = 3,
and ρ = 0.5 leading to σ12 = 6/2. Further, two MAR dropout mechanisms
are considered. They are both of the logistic form
exp(γ0 + γ1 yi1 )
P (R1 = 0|yi1 ) = .
1 + exp(γ0 + γ1 yi1 )
We choose γ0 = 0 and (a) γ1 = 1 or (b) γ1 = −∞. The latter mechanism
implies ri = 0 if yi1 ≥ 0 and ri = 1 otherwise. Both dropout mechanisms
yield π = 0.5. In all cases, α1 = −α0 , with α1 in the four possible combina-
tions
0 of covariance √ and dropout parameters: (1a) 0.4132, (1b) 0.7263, (2a)
2/π, and (2b) 2/ π. Numerical values for (21.5)–(21.7) are presented in
Table 21.1, as well as the average from the observed information matrices
in a simulation with 500 replicates.
Obviously, these elements are far from zero, as would be found with the
naive estimator. They are of the same order of magnitude as the upper
left block of the information matrix (pertaining to the mean parameters),
which are
1166.67 −333.33
.
−333.33 666.67
We performed a limited simulation study to verify the coverage probability
for the Wald tests under the unconditional and a selection of conditional
frameworks. The hypotheses considered are H01 : µ1 = 0, H02 : µ2 = 0,
and H03 : µ1 = µ2 = 0. The simulations have been restricted to the first
covariance matrix used in Table 21.1 and to the second dropout mecha-
nism (γ1 = −∞). Results are reported in Table 21.2. The coverages for the
21.4 Example 383
TABLE 21.1. Bivariate Normal Data. Computed and simulated values for the
off-diagonal block of the unconditional information matrix. Sample size is N =
1000 (500 replications). (The true model has zero mean vector. Two true covari-
ances Σ and two dropout parameters γ1 are considered.)
21.4 Example
TABLE 21.2. Bivariate Normal Data. True values are as in the third model of
Table 21.1. Coverage probabilities (× 1000) for Wald test statistics. Sample size is
N = 1000 (500 replications). The null hypotheses are H01 : µ1 = 0, H02 : µ2 = 0,
H03 : µ1 = µ2 = 0. For the naive sampling frameworks, m denotes the fixed
number of complete cases.
halothane (0%, 0.25%, 0.5%, 1%, and 2%). The groups were of sizes 11, 10,
11, 11, and 11 rats, respectively. Following an induced heart attack in each
rat, the blood pressure was recorded on nine unequally spaced occasions.
A number of rats died during the course of the experiment, including all
rats from group 5 (2% halothane). Following the original authors we omit
this group from the analysis since they contribute no information at all,
leaving 43 rats, of which 23 survived the experiment.
Examination of the data from these four groups does not provide any ev-
idence of an MAR dropout process, although this observation must be
considered in the light of the small sample size. A Gaussian multivariate
linear model with an unconstrained covariance matrix was fitted to the
data. There was very little evidence of a treatment by time interaction
and the following results are based on the use of a model with additive
effects for treatment and time. The Wald statistics for the treatment main
effect on 3 degrees of freedom are equal to 46.95 and 30.82 respectively
using the expected and observed information matrices. Although leading
to the same qualitative conclusions, the figures are notably discrepant. A
first reaction may be to attribute this difference to the incompleteness of
the data. However, the lack of evidence for an MAR process together with
the relatively small sample size points to another cause. The equivalent
analysis of the 24 completers produces Wald statistics of 45.34 and 26.35,
respectively; that is, the effect can be attributed to a combination of small-
sample variation and possible model misspecification. A theoretical reason
for this difference might be that the expected value of the off-diagonal block
of the information matrix of the maximum likelihood estimates (describing
covariance between mean and covariance parameters) has expectation zero
but is likely to depart from this in small samples. As a consequence, the
variances of the estimated treatment effects will be higher when derived
from the observed information, thereby reducing the Wald statistic.
21.5 Implications for PROC MIXED 385
TABLE 21.3. Maximum likelihood estimates and standard errors (in parentheses)
for the parameters in Model 7, fitted to the growth data (complete data set and
ignorable analysis).
Clearly, there are only minor differences between the two sets of standard
errors and the analysis on an incomplete set of data does not seem to widen
the gap.
We can conclude from this that, with the exception of the expected informa-
tion matrix, conventional likelihood-based frequentist inference, including
standard hypothesis testing, is applicable in the MAR setting. Standard
errors based on inverting the entire Hessian are to be preferred, and in
this sense, it is a pity that this option is presently not available in PROC
MIXED.
22
The Expectation-Maximization
Algorithm
Although the models in Table 17.5 are fitted using direct observed data like-
lihood maximization in PROC MIXED, Little and Rubin (1987) obtained
these same results using the Expectation-Maximization algorithm. Special
forms of the algorithm, designed for specific applications, had been pro-
posed for about half a century (e.g., Yates 1933), but the first unifying and
formal account was given by Dempster, Laird, and Rubin (1977). McLach-
lan and Krishnan (1997) devoted a whole volume to the EM algorithm and
its extensions.
Even though the SAS procedure MIXED uses direct likelihood maximiza-
tion, the EM algorithm is generally useful to maximize certain complicated
likelihood functions. For example, it has been used to maximize mixture
likelihoods in Section 12.3. Liu and Rubin (1995) used it to estimate the
t-distribution, based on EM, its extension ECM (expectation conditional
maximization), and ECME (expectation conditional maximization, either),
which are described in Meng and Rubin (1993), Liu and Rubin (1994), and
van Dyk, Meng, and Rubin (1995). EM methods specifically for mixed-
effects models are discussed in Meng and van Dyk (1998). A nice review
is given in Meng (1997), where the focus is on EM applications in medical
studies.
which can be found from, for example, a complete case analysis, an available
case analysis, single imputation, or any other convenient method.
The E Step. Given the current value θ (t) of the parameter vector, the E
step computes the expected value of the complete data log-likelihood,
given the observed data and the current parameters, which is called
the objective function:
Q(θ|θ (t) ) = (θ, y)f (y m |y o , θ (t) )dy m
7 8
= E (θ|y)|y o , θ (t) . (22.1)
These computations can easily be done using the sweep operator (Little
and Rubin 1987).
Unstructured: Σ = α.
Banded: σjk = αr with r = |j − k| + 1.
|j−k|
Autoregressive: σjk = α1 α2 .
Compound symmetry: Σ = α1 J4 + α2 I4 with J4 a matrix of ones.
Random effects: Σ = ZαZ + σ 2 I with α the covariance matrix of the
random effects.
Independence: Σ = αI4 .
The main drawbacks of the EM algorithm are its typically slow rate of con-
vergence. The double iterative structure of many implementations adds to
the problem. Further, the algorithm does not automatically provide preci-
sion estimators. Proposals for overcoming these limitations have been made
by, for example, Louis (1982), Meilijson (1989), Meng and Rubin (1991),
Baker (1992), Meng and van Dyk (1997), and Liu, Rubin, and Wu (1998).
In the light of these observations, one might argue that the existence of
PROC MIXED, enabling the use of Newton-Raphson or Fisher scoring al-
gorithms to maximize the observed data likelihood, is fortunate. Although
this statement is certainly warranted for a wide range of applications, there
may be situations where the EM algorithm is beneficial. Baker (1994) men-
tions advantages of starting with an EM algorithm and then switching to
Newton-Raphson, if necessary, including less sensitivity to poor starting
values and more reliable convergence to a boundary when the maximum
likelihood estimators is indeed a boundary value. In the latter situation,
390 22. The Expectation-Maximization Algorithm
Many of the issues briefly touched upon in this section are discussed at
length in McLachlan and Krishnan (1997). This includes the definition
and basic principles and theory of EM, various ways of obtaining standard
errors and improving the speed of convergence, as well as extensions, in-
cluding those mentioned earlier, but also stochastic EM and Gibbs sampler
versions.
23
Design Considerations
23.1 Introduction
In the first part of this book (Chapters 3 to 13), emphasis was on the for-
mulation and the fitting of, as well as on inference and diagnostics for linear
mixed models in general. Later (Chapters 14 to 22), the problem of miss-
ing data was discussed in full detail, with emphasis on how to obtain valid
inferences from observed longitudinal data and how to perform sensitivity
analyses with respect to assumptions made about the dropout process.
In practice (see, e.g., the rat experiment and the Vorozole study introduced
in Chapter 2), longitudinal experiments often do not yield the amount of
information hoped for at the design stage, due to dropout. This results
in realized experiments with (possibly much) less power than originally
planned. In Section 23.4, it will be shown how expected dropout can be
taken into account in sample-size calculations. The basic idea behind this
is that two designs with equal power under the absence of dropout are not
392 23. Design Considerations
H0 : ξ ≡ Lβ − ξ0 = 0 versus HA : ξ = 0 (23.1)
for some known matrix L and known vector ξ0 , a simplified procedure can
be followed. As explained in Section 6.2.2, (23.1) can be tested based on
the fact that, under the null hypothesis, the test statistic
⎡ −1 ⎤−1
−1
F =
ξ ⎣L Xi Vi Xi L ⎦
ξ / rank(L)
i
Helms (1992) reports simulation results which show that, under the al-
ternative hypothesis HA , the distribution of
F can also be approximated
by an F -distribution, now with rank(L) and i ni − rank[X|Z] degrees of
23.3 Example: The Rat Data 393
An example in which the above results are used for power calculations
can be found in Helms (1992), where it has been shown empirically that
intentionally incomplete designs, where some subjects are intentionally not
measured at all time points, can have more power while being less expensive
to conduct. Another example will be given in the next section. Finally, the
noncentral F -approximation will also be used in Section 23.4 to perform
power calculations, taking into account that dropout is to be expected.
Note that, as already mentioned in Section 2.1 and shown in Table 2.1, this
rat experiment suffers from a severe degree of dropout, since many rats do
not survive anesthesia needed to measure the outcome. Indeed, although
50 rats have been randomized at the start of the experiment, only 22 of
them survived the 6 first measurements, so measurements on only 22 rats
are available in the way anticipated at the design stage. For example, at
the second occasion (age = 60 days), only 46 rats were available, implying
that for 4 rats, only 1 measurement has been recorded. As can be expected,
this high dropout rate inevitably leads to severe losses in efficiency of the
statistical inferential procedures. Indeed, if no dropout had occurred (i.e.,
if all 50 rats would have withstood the 7 measurements), the power for
detecting the observed differences at the 5% level of significance would
have been 74%, rather than the 56% previously reported for the realized
experiment.
In the rat example, dropout was not entirely unexpected since it is in-
herently related to the way the response of interest is actually measured
(anesthesia cannot be avoided) and should therefore have been taken into
account at the design stage. In the next section, we will discuss a general,
computationally simple method, proposed by Verbeke and Lesaffre (1999)
for the design of longitudinal experiments, when dropout is to be expected.
Afterward, in Section 23.5, the proposed approach is applied to the rat
data.
In order to fully understand how the dropout process can be taken into
account at the design stage, we first investigate how it affects the power
of a realized experiment. Note that the power of the F -test described in
Section 23.2 not only depends on the true parameter values β, D, and σ 2
(or, more generally, Σi ) but also on the covariates Xi and Zi . Usually, in
designed experiments, many subjects will have the same covariates, such
that there are only a small number of different sets (Xi , Zi ). For the rat
23.4 Power Calculations When Dropout Is to Be Expected 395
data, for example, all 15 rats in the control group have Xi and Zi equal to
⎛ ⎞ ⎛ ⎞
1 0 0 ln[1 + (50 − 45)/10] 1
⎜ 1 0 0 ln[1 + (60 − 45)/10] ⎟ ⎜ 1 ⎟
⎜ ⎟ ⎜ ⎟
Xi = ⎜ . .. .. .. ⎟, Zi = ⎜ . ⎟ .
⎝ .. . . . ⎠ ⎝ .. ⎠
1 0 0 ln[1 + (110 − 45)/10] 1
However, due to the dropout mechanism, the above matrices have been
realized for only four of them. Indeed, for a rat that drops out early, say at
the kth occasion, the realized design matrices equal the first k rows of the
above planned matrices; that is,
⎛ ⎞ ⎛ ⎞
1 0 0 ln[1 + (50 − 45)/10] 1
⎜ .. .. .. . ⎟ ⎜ .. ⎟
Xi = ⎝ . . . .
. ⎠, Zi = ⎝ . ⎠ .
1 0 0 ln[1 + (40 + k × 10 − 45)/10] 1
Note that the number of rats that drop out at each occasion is a realization
of the stochastic dropout process, from which it follows that the power
of the realized experiment is also a realization of a random variable, the
distribution of which depends on the planned design and on the dropout
process. From now on, we will denote this random power function by P.
Note that all of the above criteria are based on only one specific aspect of
the distribution of P. A criterion which takes into account the full distri-
bution selects the second design over the first one if P 1 is stochastically
smaller than P 2 , P 1 ≺ P 2 , which is defined as (Lehmann and D’Abrera
1975, p. 66)
P1 ≺ P2 ⇐⇒ P (P 1 ≤ p) ≥ P (P 2 ≤ p), ∀p.
396 23. Design Considerations
This means that, for any power p, the risk of ending up with a final analysis
with power less than p is smaller for the second design than for the first
design. Obviously, if this criterion is to be used, one needs to assess the com-
plete power distribution function for all designs which are to be compared.
We propose doing this via sampling methods in which, for each design un-
der consideration, a large number of realized values ps , s = 1, . . . , S, are
sampled from P and used to construct the empirical distribution function
1
S
P(P ≤ p) = I[ps ≤ p]
S s=1
and the appropriate numbers of degrees of freedom for the F -statistic, from
which a realized power follows. Note that the dropout process associates
with each triplet (X j , Z j , Mj ) in the design a vector pj = (pj,1 , . . . , pj,nj )
in which pj,k equals the marginal probability that exactly k measurements
are taken on a subject with planned design matrices X j and Z j . Once
all vectors pj have been specified, we have that all sets (Mj,1 , . . . , Mj,nj )
follow a multinomial distribution with Mj trials and probabilities given by
the elements of the vectors pj . This implies that the sampling procedure
basically reduces to sampling from multinomial distributions, from which
it follows that the implementation is straightforward. This allows one to
explore many different combinations of models for the dropout process with
models for the actual responses and to investigate the effect of possible
model misspecifications. Further, it can easily be seen that the computing
time does not increase with the planned sample size. It only depends on
the number of triplets (X j , Z j , Mj ) in the design rather than on the total
sample size j Mj .
TABLE 23.1. Rat Data. Observed conditional dropout rates pj,k|≥k at each occa-
sion, for all treatment groups simultaneously.
In this section, we will use the sampling procedure described in the previ-
ous section to compare the design that was used in the rat experiment with
alternative designs which could be used in future similar experiments. The
assumption of random dropout is supported by our analyses in Section 19.4.
In order to be able to specify realistic marginal dropout probabilities pj,k ,
we first study the dropout rates observed in the data set at hand. Accord-
ing to the clinicians who collected the data, there is a strong belief that
there is constant probability for a rat to drop out, given that the rat sur-
vived up to that moment. This suggests that the conditional probability
pj,k|≥k that a rat with planned covariates (X j , Z j ) does not survive the
kth occasion, given that it survived all previous measurements, does not
depend on k. Further, it is believed that the dropout process does not de-
pend on treatment (i.e., that pj,k and pj,k|≥k do not depend on j), as long
as measurements are planned at the same occasions for the three treatment
groups. Table 23.1 shows the conditional observed dropout rates pj,k|≥k at
each occasion, for the total sample. For example, 3 rats, out of the 46 rats
who survived the first measurement, died at the second occasion, leading
398 23. Design Considerations
TABLE 23.2. Rat Data. Marginal probabilities pj,k for several conditional dropout
models and designs. Empty entries correspond to occasions at which no observa-
tions are planned in the design.
In Sections 23.5.1 and 23.5.2, we will first compare several designs under
the assumption that pj,k|≥k = 0.12. Afterward, in Section 23.5.3, the results
will be compared with those obtained under two alternative dropout models
which assume that pj,k|≥k increases over time. In all designs, the three
treatment groups are measured on the same occasions, and all designs plan
their first and last observation at the age of 50 and 110 days, respectively.
The marginal probabilities are shown in Table 23.2, and the actual designs
are summarized in Table 23.3, together with their power if no dropout
23.5 Example: The Rat Data 399
TABLE 23.3. Rat Data. Summary of the designs compared in the simulation
study.
would occur. All calculations are done under the assumption that the true
parameter values are given by the estimates in Table 6.4 and all simulated
power distributions are based on 1000 draws from the correct distribution.
Since, at each occasion, rats may die, it seems natural to reduce the number
of occasions at which measurements are taken. We have therefore simulated
the power distribution of four designs in which the number of rats assigned
to each treatment group is the same as in the original experiment, but the
planned number of measurements per subject is seven, four, three, and two,
respectively. These are the designs A to D in Table 23.3. Note that design
A is the design used in the original rat experiment. The simulated power
distributions are shown in Figure 23.1.
First, note that the solid line is an estimate for the power function of
the originally designed rat experiment under the assumption of constant
dropout probability pj,k|≥k equal to 12%. It shows that there was more
than 80% chance for the final analysis to have realized power less than
the 56% which was observed in the actual experiment. Comparing the four
designs under consideration, we observe that the risk of high power losses
increases as the planned number of measurements per subject decreases. On
the other hand, it should be emphasized that the four designs are, strictly
speaking, not comparable in the sense that, in the absence of dropout, they
have very different powers ranging from 74% for design A to only 53% for
400 23. Design Considerations
FIGURE 23.1. Rat Data. Comparison of the simulated power distributions for
designs with seven, four, three, or two measurements per rat, with equal number
of rats in each design (designs A, B, C, and D, respectively), under the assumption
of constant pj,k|≥k equal to 12%. The vertical dashed line corresponds to the power
which was realized in the original rat experiment (56%).
Designs E, F, and G are the same as designs B, C, and D, but with sample
sizes such that their power is approximately the same as the power of design
A, in the absence of dropout. The simulated power distributions are shown
in Figure 23.2. The figure suggests that P A ≺ P E ≺ P F ≺ P G , from which
it follows that, in practice, the design in which subjects are measured only
at the beginning and at the end of the study is to be preferred, under
the assumed dropout process. This can be explained by the fact that the
probability for surviving up to the age of 110 days is almost twice as high
for design G (88%) as for the original design (46%) (Table 23.2). Note
also that the parameters of interest [β1 , β2 , and β3 in model (23.2)] are
slopes in a linear model such that two measurements are sufficient for the
parameters to be estimable. On the other hand, design G does not allow
testing for possible nonlinearities in the average evolutions. It also follows
from Figure 23.2 that if design E, F, or G had been taken, it would have
been very unlikely to have a final analysis with such small power as in the
original experiment.
23.5 Example: The Rat Data 401
FIGURE 23.2. Rat Data. Comparison of the simulated power distributions for
designs with seven, four, three, or two measurements per rat, with equal power
if no dropout would occur (designs A, E, F, and G, respectively), under the as-
sumption of constant pj,k|≥k equal to 12%. The vertical dashed line corresponds
to the power which was realized in the original rat experiment (56%).
The results in Section 23.5.1 suggest that similar future experiments should
plan less measurements for each rat. From now on, we will, therefore, only
consider designs with only three measurements per rat. As before, the first
and last measurement are planned to be taken at the beginning and at
the end of the study, respectively (at the age of 50 and 110 days). Hence,
only the second observation needs to be specified. Designs H, F, and I have
their second observation planned early in the study (at the age of 60 days),
halfway through the study (at the age of 80 days) and late in the study
(at the age of 100 days), respectively. Note that design H needs 18 more
subjects than design I in order to get comparable power in the absence of
dropout (Table 23.3). This is due to the fact that our linear mixed model is
linear as a function of t = ln(1 + (Age − 45)/10)) instead of the original Age
scale and because maximal spread in the values tij is obtained by taking
the second measurement at the end rather than at the beginning of the
experiment.
Figure 23.3 shows the simulated power distributions for designs F, H, and I.
As in Section 23.5.1, these are obtained under the assumption that pj,k|≥k is
constant and equal to 12%. This implies that, under each of these designs,
there is 12% chance for a subject to have only one measurement, 11%
chance for two measurements, and 77% chance that all three observations
will be available at the time of the analysis (Table 23.2).
402 23. Design Considerations
FIGURE 23.3. Rat Data. Comparison of the simulated power distributions for
designs with three measurements per rat, with equal power if no dropout would
occur (designs H, F, and I), under the assumption of constant pj,k|≥k equal to
12%. The vertical dashed line corresponds to the power which was realized in the
original rat experiment (56%).
First, under all three designs, it is very unlikely to have a final analysis
with such small power as observed in the rat experiment. Further, we have
that, from the perspective of efficiency, designs F and I are almost identical,
but clearly superior to design H, that is, we have that P H ≺ P F ≈ P I .
The relatively poor behavior of design H can be explained as follows. For
subjects which drop out at the first occasion (and therefore have only one
observation available), there is no difference between the three designs. This
is in contrast with subjects which drop out at the second occasion, since
subjects from design H then contain less information on the parameters of
interest than subjects from design I. Note that although the designs F and
I are equivalent with respect to efficiency of the final analysis, design I is
to be preferred since it requires the randomization of less subjects.
The simulated power distributions are shown in Figure 23.4. We now clearly
get that P I ≺ P F ≺ P H , which is opposite to our results under the assump-
tion of constant pj,k|≥k . The most efficient design is now the one where the
second occasion is planned immediately after the first measurement. This
can be explained by observing that there is 83% chance under design H
that a subject will have all measurements available as planned at the de-
sign stage. For design I, this probability drops to only 40%. Hence, the
efficiency gained by having more spread in the covariate values tij for our
linear model is lost by the severely increased risk of dropping out.
The fact that our conclusions are opposite to those in Section 23.5.2 sug-
gests that there exists a dropout model under which designs F, H, and I are
equivalent with respect to efficiency of the final analysis. One such model
is obtained by setting ψ0 = −3 and ψ1 = 0.02 in the above logistic regres-
sion model. The corresponding simulated power distributions are shown in
Figure 23.5. We now have that the gain in efficiency due to more spread in
the covariate values tij is in balance with the loss in efficiency due to an
404 23. Design Considerations
increased risk of dropping out. In this case, one would prefer design I since
it requires fewer subjects to conduct the study.
Note that the results presented here fully rely on the assumed linear mixed
model (23.2). For example, the simulation results reported in Section 23.5.1
show that design G, with only two observations per subject, is to be pre-
ferred over designs A, E, and F, with more than two observations scheduled
for each subject. Obviously, the assumption of linearity is crucial here, and
design G will not allow testing for nonlinearities. Hence, when interest
would be in providing support for model (23.2), more simulations would be
needed comparing the behavior of different designs under different models
for the outcome under consideration, and design G should no longer be
taken into account. As for any sample-size calculation, it would be advis-
able to perform some informal sensitivity analysis to investigate the impact
of model assumptions and imputed parameter values on the final results.
24
Case Studies
FIGURE 24.1. Blood Pressure Data. Systolic and diastolic blood pressure in pa-
tients with moderate essential hypertension, immediately before and 2 hours after
taking captopril.
Hand et al . (1994), data set #72. For 15 patients with moderate essential
(unknown cause) hypertension, the supine (measured while patient is ly-
ing down) systolic and diastolic blood pressure was measured immediately
before and 2 hours after taking the drug captopril. The individual profiles
are shown in Figure 24.1. The objective of the analysis is to investigate the
effect of treatment on both responses.
Note that since we only have two measurements available for each response,
there is no need for modeling the variance or the mean as continuous func-
tions of time. Also, saturated mean structures and covariance structures
can easily be fitted because of the balanced nature of the data. No trans-
formation to normality is needed since none of the four responses has a
distribution which shows clear deviations from normality.
TABLE 24.1. Blood Pressure Data. Summary of the results of fitting several co-
variance models. All models include a saturated mean structure. Notations RIdia ,
RSdia , RIsys , RSsys , RI, and RS are used for random intercepts and slopes for
the diastolic blood pressures, random intercepts and slopes for the systolic blood
pressures, and random intercepts and slopes which are the same for both blood
pressures, respectively.
As a second step, we refit our model, not including the random slopes. This
is the fourth model in Table 24.1. The p-value calculated using the theory
of Section 6.3.4 on testing the significance of the random slopes equals
indicating that the treatment effect is not the same for all subjects.
where the random effects are ordered as indicated in front of the matrix.
Clearly, there is no significant correlation between either one of the random
intercepts on one side and the random slopes on the other side (p = 0.8501
and p = 0.4101 for the diastolic and systolic blood pressure, respectively),
meaning that the treatment effect does not depend on the initial value.
On the other hand, there is a significant positive correlation (p = 0.0321)
between the random intercepts for the diastolic blood pressure and the
random intercepts for the systolic blood pressure, meaning that a patient
with an initial diastolic blood pressure higher than average is likely to have
an initial systolic blood pressure which is also higher than average.
This suggests that an overall subject effect may be present in the data. We
can easily reparameterize our fourth model such that an overall random
intercept is included, but a correction term for either systolic or diastolic
blood pressure is then needed. In view of the larger variability in systolic
blood pressures than in diastolic blood pressures, we decided to reparame-
terize our model as a random-effects model, with overall random intercepts,
random intercepts for systolic blood pressure, and random slopes. The over-
all random intercepts can then be interpreted as the random intercepts for
the diastolic blood pressures. The random intercepts for systolic blood pres-
sure are corrections to the overall intercepts, indicating, for each patient,
what its deviation from the average initial systolic blood pressure is, ad-
ditional to its deviation from the average initial diastolic blood pressure.
These correction terms then explain the additional variability for systolic
blood pressure, in comparison to diastolic blood pressure. Information on
the model fit for this fifth model is also shown in Table 24.1. Since this
model is merely a reparameterization of our third model, we obtain the
same results for both models. The REML-estimated random-effects covari-
24.1 Blood Pressures 409
ance matrix D and the corresponding estimated standard errors are now
⎛ ⎞
RI −→ 92 (39) 54 (42) 4 (22)
RIdia −→ ⎝ 54 (42) 209 (84) −42 (34) ⎠ ,
RS −→ 4 (22) −42 (34) 51 (25)
suggesting that there are no pairwise correlations between the three ran-
dom effects (all p-values larger than 0.19). We therefore fit a sixth model
assuming independent random effects. The results are also shown in Ta-
ble 24.1. Minus twice the difference in maximized REML log-likelihood
between the sixth and fifth model equals 3.788, which is not significant
when compared to a chi-squared distribution with 3 degrees of freedom
(p = 0.2853). We will preserve this covariance structure (four independent
components of stochastic variability: a random subject effect, a random
effect for the overall difference between the systolic and diastolic blood
pressures, a random effect for the overall treatment effect, and a compo-
nent of measurement error) in the models considered next. Note that this
covariance structure is also the one selected by the AIC as well as the SBC
criterion (see Table 24.1).
Using the above covariance structure, we can now try to reduce our satu-
rated mean structure. The average treatment effect was found to be signif-
icantly different for the two blood pressure measurements, and we found
significant treatment effects for the systolic as well as diastolic blood pres-
sures (all p-values smaller than 0.0001). Our final model is now given by
⎧
⎪
⎪ β1 + b1i + ε(1)ij diastolic, before,
⎪
⎪
⎪
⎪
⎪
⎪
⎨ β2 + b1i + b2i + ε(1)ij systolic, before,
Yij = (24.2)
⎪
⎪
⎪
⎪ β 3 + b 1i + b 3i + ε (1)ij diastolic, after,
⎪
⎪
⎪
⎪
⎩
β4 + b1i + b2i + b3i + ε(1)ij systolic, after.
data blood;
set blood;
slope = (time = ’after’);
intsys = (meas = ’systolic’);
run;
TABLE 24.2. Blood Pressure Data. Results from fitting the final model (24.2),
using restricted maximum likelihood estimation.
The variables meas and time are factors with levels “systolic” and “dias-
tolic” and with levels “before” and “after,” respectively. The ESTIMATE
statements are included to estimate the average treatment effect for the sys-
tolic and diastolic blood pressures separately. The CONTRAST statement
is used to compare these two effects.
The REML estimates for all parameters in the marginal model are shown in
Table 24.2. The 95% confidence intervals for the average treatment effect on
24.2 The Heat Shock Study 411
diastolic and systolic blood pressure are [4.383; 14.151] and [14.050; 23.817],
respectively. Further, the parameter estimates in Table 24.2 suggest that
the average treatment effect on systolic blood pressure is twice the average
treatment effect on diastolic blood pressure. This hypothesis, tested with
the CONTRAST statement in the above program, was indeed not rejected
at the 5% level of significance (p = 0.9099).
24.2.1 Introduction
est dose level that shows no significant difference from the control group
in the rate of malformed embryos or fetal deaths (Gaylor 1989). Several
limitations of this approach have been widely recognized, and new regula-
tory guidelines emphasize the use of quantitative methods similar to those
developed for cancer risk assessment to estimate reference concentrations
and benchmark doses (U.S. EPA 1991). Thus, more recent techniques for
risk assessment in this area are based on fitting dose-response models and
estimating the dose that leads to a certain increase in risk of some type of
adverse developmental effect over that of the control group.
Although a wide variety of statistical methods have been developed for can-
cer risk assessment, the issue of multiple endpoints does not present quite
the degree of complexity in this area as it does for developmental toxicity
studies. The endpoint of interest in an animal cancer bioassay is typically
the occurrence of a particular type of tumor, whereas in developmental tox-
icity studies, there is no clear choice for a single type of adverse outcome.
In fact, an entire array of outcomes are needed to define certain birth defect
syndromes (Khoury et al . 1987, Holmes 1988). Ryan (1992a) describes the
data resulting from a developmental toxicity study as consisting of a series
of hierarchical outcomes and has proposed a modeling framework which in-
corporates the hierarchical structure and allows for the combination of data
on fetal death and resorption. Catalano et al . (1993) extend this approach
to account for low birth weight, by modeling this continuous outcome con-
ditionally on other outcomes. In the most general situation, the multiple
outcomes recorded in the course of a developmental toxicity study may
represent a combination of binary, ordinal, and continuous measurements
(Ryan 1992a, 1992b).
TABLE 24.3. Heat Shock Study. Study design: number of (viable) embryo’s ex-
posed to each combination of duration and temperature.
Duration of Exposure
Temperature 5 10 15 20 30 45 60 Total
37.0 11 11 12 13 12 18 11 88
40.0 11 9 9 8 11 10 11 69
40.5 9 8 10 9 11 10 7 64
41.0 10 9 10 11 9 6 0 55
41.5 9 8 9 10 10 7 0 53
42.0 10 8 10 5 7 6 0 46
Total 60 53 60 56 60 57 29 375
In these heat shock experiments, the embryos are explanted from the uterus
of the maternal dam during the gestation period and cultured in vitro.
Each individual embryo is subjected to a short period of heat stress by
placing the culture vial into a water bath, usually involving an increase
over body temperature of 4◦ C to 5◦ C for a duration of 5 to 60 minutes. The
embryos are examined 24 hours later for signs of impaired or accelerated
development.
This type of developmental toxicity test system has several advantages over
the standard Segment II design. First of all, the exposure is administered
directly to the embryo, so controversial issues regarding the unknown (and
often nonlinear) relationship between the level of exposure to the maternal
dam and that received by the developing embryo need not be addressed.
Although genetic factors are still expected to exert an influence on the
vulnerability to injury of embryos from a common dam, direct exposure
to individual embryos reduces the need to account for such litter effects,
but does not remove it. Second, the exposure pattern can be much more
easily controlled than in most developmental toxicity studies, since it is
possible to achieve target temperature levels in the water bath within 1
to 2 minutes. Whereas the typical Segment II study requires waiting 8
to 12 days after exposure to assess its impact, information regarding the
effects of exposure are quickly obtained in heat shock studies. Finally, this
animal test system provides a convenient mechanism for examining the
joint effects of both duration of exposure and exposure levels, which, until
recently, have received little attention. The actual study design for the set
of experiments conducted by Kimmel et al . (1994) is shown in Table 24.3.
For the experiment, 71 dams (clusters) were available, yielding a total of
375 embryos. The distribution of cluster sizes is given in Table 24.4.
414 24. Case Studies
Cluster size ni 1 2 3 4 5 6 7 8 9 10 11
Number of clusters of size ni 6 3 6 12 13 11 8 5 2 3 2
For the heat shock studies, the vector of exposure covariates must incor-
porate both exposure level (also referred to as temperature or dose), dij ,
and duration (time), tij , for the jth embryo within the ith cluster. Fur-
thermore, models must be formulated in such a way that departures from
Haber’s premise of the same adverse response levels for any equivalent mul-
tiple of dose times duration can easily be assessed. The exposure metrics
in these models are the cumulative heat exposure, (dt)ij = dij tij , referred
to as durtemp and the effect of duration of exposure at positive increases
in temperature (the increase in temperature over the normal body temper-
ature of 37◦ C):
(pd)ij = tij I(dij > 37).
We refer to the latter as posdur . There are measurements on 13 morpholog-
ical variables. Some are binary; others are measured on a continuous scale.
Even though we will focus on continuous outcomes, as in Geys, Molen-
berghs, and Williams (1999), it is worth mentioning that a lot of work has
been done in the area of clustered binary outcomes and combined binary
and continuous outcomes.
There are several continuous outcomes recorded in the heat shock study,
such as size measures on crown rump, yolk sac, and head. We will focus on
crown rump length (CRL).
Although there will be no doubt that random effects are used to model
interdam variability and also the role of the measurement error time is un-
ambiguous, it is less obvious what the role of the serial association would
be. Generally, serial association results from the fact that within a cluster,
residuals of individuals closer to each other are often more similar than
residuals for individuals further apart. Although this distance concept is
clear in longitudinal and spatial applications, it is less so in this context.
However, covariates like duration and temperature, or relevant transforma-
tions thereof, can play a similar role. This distinction is very useful since
random effects capture the correlation structure which is attributable to
the dam and hence includes genetic components. The serial correlation, on
the other hand, is entirely design driven. If one conjectures that the latter
component is irrelevant, then translation into a statistical hypothesis and,
consequently, testing for it are relatively straightforward. Note that such a
model is not possible in conventional developmental toxicity studies, where
exposure applies at the dam level, not at the individual fetus level.
Yij = (β1 + bi1 ) + (β2 + bi2 )(dt)ij + β3 (pd)ij + ε(1)ij + ε(2)ij , (24.3)
where the ε(1)ij are uncorrelated and follow a normal distribution with zero
mean and variance σ 2 . The ε(2)ij have zero mean, variance τ 2 , and serial
correlation
1 2
hijk = exp −φ[(dt)ij − (dt)ik ]2 .
TABLE 24.5. Heat Shock Study. Parameter estimates (standard errors) for initial
and final model.
Fixed effects:
Random-effects parameters:
FIGURE 24.2. Heat Shock Study. Fixed-effects structure for (a) the final model
and (b) the model with posdur removed.
The fitted variogram is presented in Figure 24.3. Roughly half of the vari-
ability is attributed to measurement error, and the remaining half is divided
equally over the random intercept and the serial process. The correspond-
ing fitted correlation function is presented in Figure 24.4. The correlation
is about 0.50 for two fetuses that are at the exact same level of exposure.
It then decreases to 0.25 when the distance between exposures is maximal.
This reexpresses that half of the correlation is due to the random effect,
and the other half is attributed to the serial process in durtemp.
420 24. Case Studies
24.3.1 Introduction
Prentice’s Definition
Prentice’s Criteria
However, (24.9) and (24.10) are, in general, not equal to one another, in
which case definition (24.4) is violated. However, it is possible to construct
examples where f (T |Z) = f (T ), in which case the definition still holds
despite the fact that (24.8) does not hold. Hence, (24.8) is not a necessary
condition, except for binary endpoints.
FIGURE 24.5. Advanced Ovarian Cancer. Scatter plot of progression free survival
versus survival.
Suppose we have data from N trials, in the ith of which Ni subjects are
enrolled. Let Tij and Sij be random variables that denote the true and
surrogate endpoints, respectively, for the jth subject in the ith trial, and let
Zij be an indicator variable for treatment. Although the main focus of this
work is on binary treatment indicators, the methods proposed generalize
without difficulty to multiple category indicators for treatment, as well as to
situations where covariate information is used in addition to the treatment
indicators.
24.3 The Validation of Surrogate Endpoints from Multiple Trials 425
Our methods will first be illustrated using data from a meta-analysis of four
randomized multicenter trials in advanced ovarian cancer (Ovarian Can-
cer Meta-Analysis Project 1991). Individual patient data are available in
these four trials for the comparison of two treatment modalities: cyclophos-
phamide plus cisplatin (CP) versus cyclophosphamide plus adriamycin plus
cisplatin (CAP). The binary indicator for treatment (Zij ) will be set to 0
for CP and to 1 for CAP. The surrogate endpoint Sij will be the logarithm
of time to progression, defined as the time (in weeks) from randomization
to clinical progression of the disease or death due to the disease, and the
final endpoint Tij will be the logarithm of survival, defined as the time (in
weeks) from randomization to death from any cause. The full results of this
meta-analysis were published with a minimum follow-up of 5 years in all
trials (Ovarian Cancer Meta-Analysis Project 1991). The data set was sub-
sequently updated to include a minimum follow-up of 10 years in all trials
(Ovarian Cancer Meta-Analysis Project 1998). After such long follow-up,
most patients have had a disease progression or have died (952 of 1194
patients, i.e., 80%), so censoring will be ignored in our analyses. Methods
that account for censoring would admittedly be preferable, but we ignore it
here for the purposes of illustrating the case where the surrogate and final
endpoints are both normally distributed.
The ovarian cancer data set contains only four trials. This will turn out to
be insufficient to apply the meta-analytic methods of Section 24.3.4. In the
two larger trials, information is also available on the centers in which the
patients had been treated. We can then use center as the unit of analysis for
the two larger trials, and the trial as the unit of analysis for the two smaller
trials. A total of 50 units are thus available for analysis, with a number of
individual patients per unit ranging from 2 to 274. To assess sensitivity, all
analyses will be performed with and without the two smaller trials in which
center is unknown. A scatter plot of the surrogate and true endpoints for
all individuals in the trials included is presented in Figure 24.5.
The first three Prentice criteria (24.5)–(24.7) are provided by tests of sig-
nificance of parameters α, β, and γ in the following models:
where εSij , εT ij , and εij are independent Gaussian errors with mean zero.
If the analysis is restricted to the two large trials in which center is known,
α = 0.228 [standard error (s.e.) 0.091, p = 0.013], β = 0.149 (s.e. 0.085,
p = 0.079), and γ = 0.874 (s.e. 0.011, p < 0.0001). Strictly speaking, the
426 24. Case Studies
criteria are not fulfilled because β fails to reach statistical significance. This
will often be the case, since a surrogate endpoint is needed when there is
no convincing evidence of a treatment effect upon the true endpoint.
Here, βS = −0.051 (s.e. 0.028) and P E = 1.34 (95% delta confidence limits
[0.73; 1.95]). The proportion explained is larger than 100%, because the
direction of the effect of Z on T is reversed after adjustment for S. Another
problem would arise if there were a strong interaction between Z and S,
which would require the following model to be fitted instead of (24.15):
Tij |Zij , Sij = µ̌T + β̌S Zij + ρ̌Z Sij + δZij Sij + ε̌T ij , (24.16)
With this model, P E would cease to be captured by a single number and the
validation process would have to stop (Freedman, Graubard, and Schatzkin
1992). In the two large ovarian cancer trials, the interaction term is not
statistically significant (δ = 0.014, s.e. 0.022), and therefore model (24.15)
may be used.
When the two smaller trials are included in the analysis, the results change
very little, providing evidence for the validity of considering each of the
smaller trials as a single center. The p-values for α, β, and γ become 0.003,
0.054, and < 0.0001, respectively, and P E = 1.46 (95% confidence limits
[0.80; 2.13]), RE = 0.60 (95% confidence limits [0.32; 0.87]), and ρZ = 0.942
(95% confidence limits [0.94; 0.95]). By including both trials, the precision
is somewhat improved. However, in this case, the interaction term in model
(24.16) is statistically significant (δ = 0.037, s.e. 0.018), further complicat-
ing the interpretation of P E.
An Example in Ophthalmology
The second example concerns a clinical trial for patients with age-related
macular degeneration, a condition in which patients progressively lose vi-
sion (Pharmacological Therapy for Macular Degeneration Study Group
1997). In this example, the binary indicator for treatment (Zij ) is set to
0 for placebo and to 1 for interferon-α. The surrogate endpoint Sij is the
change in the visual acuity (which we assume to be normally distributed) at
6 months after starting treatment, and the final endpoint Tij is the change
in the visual acuity at 1 year. The data are presented in Figure 24.6. The
first three Prentice criteria (24.5)–(24.7) are again provided by tests of sig-
nificance of parameters α, β, and γ. Here, α = −1.90 (s.e. 1.87, p = 0.312),
β = −2.88 (s.e. 2.32, p = 0.216), and γ = 0.92 (s.e. 0.06, p < 0.001). Only
γ is statistically significant and therefore the validation procedure has to
stop inconclusively. Note, however, that the lack of statistical significance
of α and β could merely be due to the insufficient number of observations
available in this trial. Also note that α and β are negative, hinting at a
negative effect of interferon-α upon visual acuity. Freedman’s proportion
explained is calculated as P E = 0.61 (95% confidence limits [−0.19; 1.41]).
The relative effect is RE = 1.51 (95% confidence limits [−0.46; 3.49]), and
the adjusted association ρZ = 0.74 (95% confidence limits [0.68; 0.81]). The
adjusted association is determined rather precisely, but the confidence lim-
its of P E and RE are too wide to convey any useful information. Even so,
as we will see in Section 24.3.5, some conclusions can be reached in this
example that are in sharp contrast to those reached in the ovarian cancer
example.
As a third example, we will use data from two randomized multicenter trials
in advanced colorectal cancer (Corfu-A Study Group 1995; Greco et al .
1996). In one trial, treatment with fluorouracil and interferon (5FU/IFN)
was compared to treatment with 5FU plus folinic acid (5FU/LV) (Corfu-A
428 24. Case Studies
Study Group 1995). In the other trial, treatment with 5FU plus interferon
(5FU/IFN) was compared to treatment with 5FU alone (Greco et al . 1996).
The binary indicator for treatment (Zij ) will be set to 0 for 5FU/IN and to 1
for 5FU/LV or 5FU alone. The surrogate endpoint Sij will be progression-
free survival time, defined as the time (in years) from randomization to
clinical progression of the disease or death, and the final endpoint Tij will
be survival time, defined as the time (in years) from randomization to
death from any cause. Most patients in the two trials have had a disease
progression or have died (694 of 736 patients, i.e., 94.3%).
Similarly to the ovarian cancer example, we will use center as the unit of
analysis. A total of 76 units are thus available for analysis. However, in
eight centers, one of the treatment arms accrued no patients. These eight
centers were therefore excluded from the analysis. As a result, the data
used for illustration contained 68 units, with a number of individual pa-
tients per unit ranging from 2 to 38. The data are graphically depicted
in Figure 24.7. Fitting (24.11)–(24.13) yields α = 0.021 (standard error,
s.e. 0.066, p = 0.749), β = 0.002 (s.e. 0.075, p = 0.794), and γ = 0.917
(s.e. 0.031, p < 0.0001). As with the ovarian cancer case, the criteria are
not fulfilled because both β and α fail to reach statistical significance. The
proportion explained is estimated as 0.985 (95% delta confidence limits
[−3.44; 5.41]). Further, RE = 0.931 (95% confidence limits [−3.23; 5.10]).
Both quantities have estimated confidence limits that are too wide to be
24.3 The Validation of Surrogate Endpoints from Multiple Trials 429
FIGURE 24.7. Advanced Colorectal Cancer. Scatter plot of progression free sur-
vival versus survival.
Let us describe the two-stage model first. The first stage is based upon a
fixed-effects model:
Sij |Zij = µSi + αi Zij + εSij , (24.18)
Tij |Zij = µT i + βi Zij + εT ij , (24.19)
where µSi and µT i are trial-specific intercepts, αi and βi are trial-specific
effects of treatment Z on the endpoints in trial i and εSi and εT i are cor-
related error terms, assumed to be zero-mean normally distributed with
covariance matrix
σSS σST
Σ = . (24.20)
σT T
430 24. Case Studies
where the second term on the right-hand side of (24.21) is assumed to follow
a mean-zero normal distribution with dispersion matrix
⎛ ⎞
dSS dST dSa dSb
⎜ dT T dT a dT b ⎟
D = ⎜ ⎝
⎟. (24.22)
daa dab ⎠
dbb
where now µS and µT are fixed intercepts, α and β are the fixed effects
of treatment Z on the endpoints, mSi and mT i are random intercepts,
and ai and bi are the random effects of treatment Z on the endpoints
in trial i. The vector of random effects (mSi , mT i , ai , bi ) is assumed to be
zero-mean normally distributed with covariance matrix (24.22). The error
terms εSi and εT i follow the same assumptions as in fixed-effects model
(24.18)–(24.19), with covariance matrix (24.20), thereby completing the
specification of the linear mixed model. Section 24.3.10 provides sample
SAS code to fit this particular model.
Much debate has been devoted to the relative merits of fixed versus random
effects, especially in the context of meta-analysis (Thompson and Pocock
1991, Thompson 1993, Fleiss 1993, Senn 1998). Although the underlying
models rest on different assumptions about the nature of the experiments
being analyzed, the two approaches yield discrepant results only in patho-
logical situations, or in very small samples where a fixed-effects analysis can
yield artificially precise results if the experimental units truly constitute a
random sample from a larger population. In our setting, both approaches
are very similar, and the two-stage procedure can be used to introduce
random effects (Section 3.2; see also Laird and Ware 1982). As the data
analyses in Section 24.3.5 will illustrate, the choice between random and
fixed effects can also be guided by pragmatic arguments.
24.3 The Validation of Surrogate Endpoints from Multiple Trials 431
Trial-Level Surrogacy
S0
m S0 − µ
= µ S ,
a0 0 − α
= α .
E(β + b0 |mS0 , a0 )
−1
dS b dSS dSa µS 0 − µ S
= β+ , (24.26)
dab dS a daa α0 − α
var(β + b0 |mS0 , a0 )
−1
This suggests calling a surrogate perfect at the trial level if the conditional
variance (24.27) is equal to zero. A measure to assess the quality of the
surrogate at the trial level is the coefficient of determination
−1
dab
E(β + b0 |a0 ) = β + (α0 − α),
daa
d2
var(β + b0 |a0 ) = dbb − ab
daa
with corresponding
2 d2ab
Rtrial(r) = Rb2i |ai = . (24.29)
daa dbb
2
Now, Rtrial(r) = 1 if the trial-level treatment effects are simply multiples of
each other. We will refer to this simplified version as the reduced random-
effects model, and the original expression (24.28) will be said to derive from
the full random-effects model.
Individual-Level Surrogacy
after adjustment for the treatment effect. To this end, we need to construct
the conditional distribution of T , given S and Z. From (24.18)–(24.19), we
derive
1 −1 −1
Tij |Zij , Sij ∼ N µT i − σT S σSS µSi + (βi − σT S σSS αi )Zij ;
−1 −1
2
+σT S σSS Sij σT T − σT S σSS .
2
(24.30)
It should be noted that the validation criteria proposed here do not require
the treatment to have a significant effect on either endpoint. In particu-
lar, it is possible to have α ≡ 0 and yet have a perfect surrogate. Indeed,
even though the treatment may not have any effect on the surrogate end-
point as a whole, the fluctuations around zero in individual trials (or other
experimental units) can be very strongly predictive of the effect on the
434 24. Case Studies
true endpoint. However, such a situation is unlikely to occur since the het-
erogeneity between the trials is generally small compared to that between
individual patients.
If data are available on a single trial (or, more generally, on a single exper-
imental unit), the above developments are only partially possible. While
the individual-level reasoning (producing ρZ as in (24.32)) carries over by
virtue of the within-trial replication, the trial-level reasoning breaks down
and one cannot go beyond the relative effect (RE) as suggested in Buyse
and Molenberghs (1998). Recall that the RE is defined as the ratio of the
effects of Z on S and T , respectively, as expressed in (24.17). The confi-
dence limits of RE can be used to assess the uncertainty about the value
of β predicted from that of α, but in contrast to the above developments,
no sensible prediction interval can be calculated for β.
As in Section 24.3.3, all analyses have been performed with and without the
two smaller trials. Excluding the two smaller trials has very little impact
on the estimates of interest, and therefore the results reported are those ob-
tained with all four trials. Two-stage fixed-effects models (24.18)–(24.19)
could be fitted, as well as a reduced version of the mixed-effects model
(24.23)–(24.24), with random treatment effects but no random intercepts.
Point estimates for the two types of model are in close agreement, although
standard errors are smaller by roughly 35% in the random-effects model.
Figure 24.8 shows a plot of the treatment effects on the true endpoint (loga-
rithm of survival) by the treatment effects on the surrogate endpoint (loga-
rithm of time to progression). These effects are highly correlated. Similarly
to the random-effects situation, we refer to the models with and without the
intercept used for determining R2 as the reduced and full fixed-effects mod-
2
els. The reduced fixed-effects model provides Rtrial(r) = 0.939 (s.e. 0.017).
When the sample sizes of the experimental units are used to weigh the
2
pairs (ai , bi ), then Rtrial(r) = 0.916 (s.e. 0.023). The full fixed-effects model
2
yields Rtrial(f) = 0.940 (s.e. 0.017). In the reduced random-effects model,
2
Rtrial(r) = 0.951 (s.e. 0.098).
FIGURE 24.8. Advanced Ovarian Cancer. Treatment effects on the true endpoint
(logarithm of survival time) versus treatment effects on the surrogate endpoint
(logarithm of time to progression) for all units of analysis. The size of each point
is proportional to the number of patients in the corresponding unit.
reports prediction intervals for several experimental units: six centers taken
at random from the two large trials, and the two small trials in which center
is unknown. Note that none of the predictions is significantly different from
zero. The predicted values for β + b0 agree reasonably well with the effects
estimated from the data. The ratio β0 / α0 ranges from 0.69 to 0.73, which
is close to the RE estimated in Section 24.3.3.
2
At the individual level, Rindiv = 0.886 (s.e. 0.006) in the fixed-effects model,
2
and Rindiv = 0.888 (s.e. 0.006) in the reduced random-effects model. The
square roots of these quantities are respectively 0.941 and 0.942, very close
to the value of ρZ estimated in Section 24.3.3. Figure 24.9 displays a scatter
plot of the residuals on both endpoints. It exhibits the close relationship
which exists between both endpoints at the individual level.
TABLE 24.6. Advanced Ovarian Cancer. Predictions. Standard errors are shown
in parenthesis. The number of patients is reported for each unit, as well as which
sample is used for the estimation (only two trials or all four). α 0 and β0 are
values estimated from the data; E(β + b0 |a0 ) is the predicted effect of treatment
on survival (β0 ), given its effect upon time to progression (α 0 ). The DACOVA
and GONO trials are the two smaller studies, for which predictions are based on
parameter estimates from the centers in the two larger studies.
Unit ni Trials 0
α E(β + b0 |a0 ) β0
6 17 2 −0.58 (0.33) −0.45 (0.29) −0.56 (0.32)
4 −0.45 (0.29)
8 10 2 0.67 (0.76) 0.49 (0.57) 0.76 (0.39)
4 0.47 (0.56)
37 12 2 1.02 (0.61) 0.76 (0.54) 1.04 (0.70)
4 0.73 (0.53)
49 40 2 0.54 (0.34) 0.39 (0.26) 0.28 (0.28)
4 0.37 (0.25)
55 31 2 1.08 (0.56) 0.80 (0.44) 0.79 (0.45)
4 0.77 (0.44)
BB 21 2 −1.05 (0.55) −0.80 (0.46) −0.79 (0.51)
4 −0.79 (0.46)
DACOVA 274 2 0.25 (0.15) 0.17 (0.13) 0.14 (0.14)
GONO 125 2 0.15 (0.25) 0.10 (0.20) 0.03 (0.22)
The results derived here are considerably more useful than the conclusions
in Section 24.3.3. Indeed, the first three Prentice criteria provide only mar-
ginal evidence and P E cannot be estimated on the full data set, since there
is a three-way interaction between Z, S, and T . RE is meaningful and esti-
mated with precision, but it is derived from a regression through the origin
based on a single data point. In contrast, the approach used here combines
evidence from several experimental units and allows prediction intervals to
be calculated for the effect of treatment on the true endpoint.
were treated as the unit of analysis. A total of 36 centers were thus available
for analysis, with a number of individual patients per center ranging from
2 to 18.
Figure 24.11(a) shows a plot of the raw data (true endpoint versus surrogate
endpoint for all individual patients). Irrespective of the software tool used
(SAS, SPlus, MLwiN), the random effects are difficult to obtain. Therefore,
we report only the result of a two-stage fixed-effects model and explore the
computational issues further in Section 24.3.6. Figure 24.11(b) shows a
plot of the treatment effects on the true endpoint by the treatment effects
on the surrogate endpoint. These effects are moderately correlated, with
2
Rtrial(f) = 0.692 (s.e. 0.087). The estimates based on the reduced model are
2
virtually identical. At the individual level, Rindiv = 0.483 (s.e. 0.053). Note
that Rindiv = 0.69 is close to ρZ = 0.74, as estimated in Section 24.3.3. The
2 2
coefficients of determination Rtrial(r) and Rindiv are both too low to make
visual acuity at 6 months a reliable surrogate for visual acuity at 12 months.
Figure 24.11(c) shows that the correlation of the measurements at 6 months
and at 1 year is indeed rather poor at the individual level. Therefore, even
with the limited data available, it is clear that the assessment of visual
acuity at 6 months is not a good surrogate for the same assessment at 1
year. This is in contrast with the inconclusive analysis in Section 24.3.3.
438 24. Case Studies
Figure 24.12 shows a plot of the treatment effects on the true endpoint
(logarithm of survival) by the treatment effects on the surrogate endpoint
(logarithm of time to progression). Clearly, the correlation between both is
considerably weaker than in the advanced ovarian cancer case. The corre-
2
sponding Rtrial(r) = 0.454 (95% confidence limits [0.23; 0.68]).
2
At the individual level (Figure 24.13), Rindiv = 0.665 (95% confidence limits
[0.62; 0.71]). The square roots of this quantity is 0.815, very close to the
estimate of ρZ , which is 0.805.
FIGURE 24.12. Advanced Colorectal Cancer. Treatment effects on the true end-
point (logarithm of survival time) versus treatment effects on the surrogate end-
point (logarithm of time to progression) for all units of analysis. The size of each
point is proportional to the number of patients in the corresponding unit.
Table 24.7 shows the number of runs for which convergence could be achiev-
ed within 20 iterations. In each case, 500 runs were performed, assuming a
model of the following form:
1 0.8
Σ=3 .
1
The number of trials was fixed to either 10, 20, or 50, each trial involving
10 subjects randomly assigned to treatment groups. The δ 2 parameter was
set to 0.1 or 1.
24.3 The Validation of Surrogate Endpoints from Multiple Trials 441
TABLE 24.7. Simulation results. Number of runs for which convergence was
achieved within 20 iterations. Total number of runs: 500; percentages are given
in parentheses.
Number of trials
2
δ 50 20 10
1 500 (100%) 498 (100%) 412 (82%)
0.1 491 (98%) 417 (83%) 218 (44%)
From Table 24.7, we see that when the between-trial variability is large
(δ 2 = 1), no convergence problems occur, except when the number of trials
gets very small. When the between-trial variability gets smaller, conver-
gence problems do arise and worsen as the number of trials decreases.
24.3.7 Extensions
For binary outcomes, there are both marginal models such as generalized
estimating equations (Liang and Zeger 1986) or full likelihood approaches
(Fitzmaurice and Laird 1993, Molenberghs and Lesaffre 1994, Lang and
Agresti 1994, Glonek and McCullagh 1995) and random-effects models (Sti-
ratelli, Laird, and Ware 1984, Zeger, Liang, and Albert 1988, Breslow and
Clayton 1993, Wolfinger and O’Connell, 1993; Lee and Nelder 1996). Re-
views are given in Diggle, Liang, and Zeger (1994) and Fahrmeir and Tutz
(1994). For additional references, see also Section 24.2.1 (p. 414).
Since our validation measures make use not only of main-effect parame-
ters, such as treatment effects, but prominently of association (random-
effects structure and residual covariance structure), standard generalized
estimating equations are less suitable to extend ideas toward noncontin-
uous data. Possible approaches are second-order generalized estimating
equations (Liang, Zeger, and Qaqish 1992, Molenberghs and Ritter 1996)
and random-effects models. Since the latter are computationally involved,
the likelihood-based approaches need to be supplemented with alternative
methods of estimation such as quasi-likelihood. Also, pseudo-likelihood is
a viable alternative.
Burzykowski et al . (1999) studied the situation where Sij and Tij are
failure-time endpoints. In order to extend the approach used in the case
of two normally distributed endpoints described in Section 24.3.4, they re-
placed model (24.18)–(24.19) by a model for two correlated failure-time
random variables. This is based on a copula model (Shih and Louis 1995).
More specifically, they assume that the joint survivor function of (Sij , Tij )
can be written as
∂f ∂f 1
= − = D1 D2−1 ,
∂µS0 ∂µS 0
∂f ∂f 0
= − = D1 D2−1 ,
∂α0 ∂α 1
∂f 1
= D2−1 D3 ,
∂dSb 0
∂f 0
= D2−1 D3 ,
∂dab 1
∂f 1 0
= −D1 D2−1 D2−1 D3 ,
∂dSS 0 0
∂f 0 1
= −D1 D2−1 D2−1 D3 ,
∂dSa 1 0
∂f 0 0
= −D1 D2−1 D2−1 D3 .
∂daa 0 1
Denoting the asymptotic covariance matrix of the estimated parameter
vector by V , the asymptotic variance of f is given by fd V fd , producing a
confidence interval in the usual way. For a prediction interval, the variance
to be used is fd V fd + var(β + b0 |mS0 , a0 ).
The above syntax presumes that there are two records per subject in the
input data set, one corresponding to the surrogate endpoint and the other
to the true endpoint. The variable endpoint is an indicator for the kind of
endpoint (coded 0 for the surrogate and 1 for the true endpoint) and the
variable outcome contains measurements obtained from each endpoint.
24.4.1 Introduction
Diggle (1990) and Diggle and Kenward (1994) analyzed data taken from
Verbyla and Cullis (1990), who, in turn, had discovered the data at a work-
shop at Adelaide University in 1989. The data consist of assayed protein
content of milk samples taken weekly for 19 weeks from 79 Australian cows.
The cows entered the experiment after calving and were randomly allocated
to 1 of 3 diets: barley, mixed barley-lupins, and lupins alone, with 25, 27,
and 27 animals in the 3 groups, respectively. The time profiles for all 79
cows are plotted in Figure 24.19. All cows remained on study during the
first 14 weeks, whereafter the sample reduced to 59, 50, 46, 46, and 41,
respectively, due to dropout. This means that dropout is as high as 48%
by the end of the study. Table 24.8 shows the number of cows per arm and
per dropout pattern.
The primary objective of the milk protein experiment was to describe the
effects of diet on the mean response profile of milk protein content over
time. Previous analyses of the same data are reported by Diggle (1990,
Chapter 5), Verbyla and Cullis (1990), Diggle, Liang, and Zeger (1994),
and Diggle and Kenward (1994), under different assumptions and with
different modeling of the dropout process. Diggle (1990) assumed random
dropout, whereas Diggle and Kenward (1994) concluded that dropout was
24.4 The Milk Protein Content Trial 447
TABLE 24.8. Milk Protein Content Trial. Number of cows per arm and per
dropout pattern.
Diet
Dropout week Barley Mixed Lupins
Week 15 6 7 7
Week 16 2 3 4
Week 17 2 1 1
Week 18
Week 19 2 2 1
Completers 13 14 14
Total 25 27 27
In addition to the usual problems with this type of models, serious doubts
have been raised about even the appropriateness of the “dropout” con-
cept in this study. Cullis (1994) warned that the conclusions inferred from
the statistical model are very unlikely since usually there is no relation-
ship between dropout and a relatively low level of milk protein content. In
the discussion of the Diggle and Kenward (1994) paper, one is informed
by Cullis that Valentine, who originally conducted the experiment, had
previously revealed the real reasons for dropout. The explanation eluci-
dates that the experiment terminated when feed availability declined in
the paddock in which animals were grazing. Thus, this would imply that a
nonrandom dropout mechanism is very implausible. A nonrandom dropout
mechanism would wrongly relate dropout to response, whereas, to the con-
trary, dropout depends on food availability only. Thus, there are actually no
dropouts but rather five cohorts representing the different starting times.
Together with Cullis (1994), and in agreement with our pleas for sensitivity
analysis (Chapters 19 and 20), we conclude that especially with incomplete
data, a statistical analysis should not proceed without a thorough discus-
sion with the experimenters.
The complex and somewhat vague history of the data set probably is the
main cause for so many conflicting issues related to the analysis of the
milk data. At the same time, it becomes a perfect candidate for sensitiv-
ity analysis. Modeling will be based upon the linear mixed-effects model
with serial correlation (3.11), introduced in Section 3.3.4. In Section 24.4.2,
448 24. Case Studies
Since there has been some confusion about the actual design employed, we
cannot avoid making subjective assumptions such as the following: Several
matched paddocks are randomly assigned to either of three diets: barley,
lupins, or a mixture of the two. The experiment starts as the first cow
experiences calving. As the first 5 weeks have passed, all 79 cows have
entered their randomly assigned, randomly cultivated paddock. By week
19, all paddocks appear to approach the point of exhausting their food
availability (in a synchronous fashion) and the experiment is terminated
for all animals simultaneously.
All previous analyses assumed a fixed date for entry into the trial and the
crucial issue then becomes how the dropout process should be handled and
analyzed. However, it seems intuitive that since entry into the study was
at random time points (i.e., after calving) and since the experiment was
terminated at a fixed time point, that this time point should be the refer-
ence for all other time points. It is therefore also appealing to reverse the
time axis and to analyze the data in reverse, starting from time of dropout.
Under the aforementioned assumptions, we have found a partial solution
to the problem of potentially nonrandom dropout since dropout has been
replaced by ragged entry. Note, however, that a crucial simplification arises:
Since entry into the trial depends solely on calving and gestation, it can be
thought of as totally independent from the unobserved responses.
A problem with the alignment lies in the fact that virtually all cows showed
a very steep decrease in milk protein content immediately after calving,
lasting until the third week into the experiment. This behavior could be
due to a special hormone regulation of milk composition following calving,
which lasts only for a few weeks. Such a process is most likely totally
independent of diet and, probably, can also be observed in the absence of
food, to the expense of the animal’s natural reserves. Since entry is now
24.4 The Milk Protein Content Trial 449
FIGURE 24.14. Milk Protein Content Trial. Data manipulations on five selected
cows: (a) raw profiles; (b) right aligned profiles; (c) deletion of the first three
observations; (d) profiles with in addition time reversal.
ragged, the process is spread and influences the mean response level during
the first 8 weeks. Of course, one might construct an appropriate model for
the first 3 weeks with a separate model specification, in analogy to the one
used in Diggle and Kenward (1994). Instead, we prefer to ignore the first
3 weeks, analogous in spirit to the approach taken in Verbyla and Cullis
(1990). Hence, we have time series of length 16, with some observations
missing at the beginning. Figure 24.14 displays the data manipulations
for five selected cows. In Figure 24.14(a), the raw profiles are shown. In
Figure 24.14(b), the plots are right aligned. Figure 24.14(c) illustrates the
protein content levels for the five cows with the first three observations
deleted and Figure 24.14(d) presents these profiles when time is reversed.
FIGURE 24.15. Milk Protein Content Trial. Mean response profiles on the orig-
inal data and after aligning and reverting.
no evidence for random effects as the serial correlation levels of toward the
process variance.
Table 24.10 presents maximum likelihood estimates for the original data,
similar to the analysis by Diggle and Kenward (1994). The corresponding
parameters after aligning and reverting are 3.45 (s.e. 0.06) for the barley
group, with differences for the lupins and mixed groups estimated to be
−0.21 (s.e. 0.08) and −0.12 (s.e. 0.08), respectively. The variance parame-
ters roughly retain their relative magnitude, although even more weight is
given to the serial process (90% of the total variance).
The analysis using aligned and reverted data shows little difference if com-
pared to the original analysis by Diggle and Kenward (1994). It would be
interesting to acquire knowledge about what mechanisms determined the
systematic increase and decrease observed for the three parallel profiles
illustrated in Figure 24.15(b). It is difficult to envisage that the paral-
lelism of the profiles and their systematic peaks and troughs shown in Fig-
ure 24.15(b) is due entirely to chance. Indeed, many of the previous analyses
had debated the influence on variability for those factors, common to the
paddocks cultivated with the three different diets, such as meteorological
factors, that had not been reported by the experimenter. These factors
may account for a large amount of variability in the data. Hence, the data
exploration performed in this analysis may be shown to be a useful tool
24.4 The Milk Protein Content Trial 451
FIGURE 24.16. Milk Protein Content Trial. Variogram for the original data and
after aligning and reverting.
in gaining insight about the response process. For example, we note that
after transformation, the inexplicable trend toward an increase in milk pro-
tein content, as the paddocks approach exhaustion has, in fact, vanished or
even reverted to a possible decrease. This was also confirmed in the strat-
ified analysis where the protein level content tended to decrease prior to
termination of the experiment (see Figure 24.17).
Denoting the parameter for diet effect = 1, 2 (difference with the barley
group) in pattern t = 1, 2, 3 by β
t and letting πt be the proportion of cows
452 24. Case Studies
Note that the simple multinomial model for the dropout probabilities could
be extended when additional information concerning the dropout mecha-
nism. For example, if covariates are known or believed to influence dropout,
the simple multinomial model can be replaced by logistic regression or time-
to-event methods (Hogan and Laird 1997).
Recall that Table 24.8 presents the dropout pattern by time in each of
the three diet groups. As few dropouts occurred in weeks 16, 17, and 19,
these three dropout patterns were collapsed into a single pattern. Thus,
three patterns remain with 20, 18, and 41 cows, respectively. The model
fitting results are presented in Table 24.9. The most complex model for the
mean structure assumes a separate mean for each diet by time by dropout
pattern combination. As the variogram indicated no random effects, the
covariance matrix was taken as first-order autoregressive with a residual
variance term σjk = σ 2 ρ|j−k| . Also the variance-covariance parameters are
allowed to vary according to the dropout pattern. This model is equivalent
to including time and diet as covariates in the model and stratifying for
dropout pattern and it provides a starting point for model simplification
through backward selection. We refer to the description of Strategy 3 in
Section 20.5.1. The protein content levels over time are presented by pattern
and diet in Figure 24.17. Note that the protein content profiles appear to
vary considerably according to missingness pattern and time. Additionally,
Diggle and Kenward (1994) suggested an increase in protein content level
toward the end of the experiment. This observation is not consistent for
the three plots in Figure 24.17. In fact, there is a tendency for a decrease
in all diet by pattern subgroups prior to dropout.
TABLE 24.9. Milk Protein Content Trial. Model fit summary for pattern-mixture
models.
Mean Covar
1 Full interaction AR1(t), meas(t)
2 Full interaction AR1(t), meas
3 Full interaction AR(1), meas
4 Two-way interactions AR1(t), meas
5 Diet, time, pattern, diet∗time, diet∗pattern AR1(t), meas
6 Diet, time, pattern, diet∗time, time∗pattern AR1(t), meas
7 Diet, time, pattern, diet∗pattern, time∗pattern AR1(t), meas
8 Time, pattern, time∗pattern AR1(t), meas
9 Time, diet(time) AR(1), meas
10 Time, diet AR(1), meas
# par −2 Ref G2 df p
1 162 −474.93
2 160 −470.79 1 4.44 2 0.111
3 156 −428.26 2 42.23 4 <0.001
4 100 −439.96 2 30.53 50 0.987
5 70 −202.40 4 237.56 30 <0.001
6 96 −430.55 4 9.41 4 0.052
7 64 −404.04 4 35.92 36 0.472
8 58 −378.22 7 25.82 6 <0.001
6 52.33 38 0.061
9 60
10 24
FIGURE 24.17. Milk Protein Content Trial. Mean response level per diet and per
dropout pattern.
Recall that the objective of the experiment was to assess the influence
of diet on protein content level. With selection models, the corresponding
null hypothesis of no effect can be tested using, for example, the standard
F -tests on 2 numerator degrees of freedom as provided by the SAS proce-
dure MIXED or similar software. In the pattern-mixture framework, such a
standard test can be used only if the treatment effects do not interact with
pattern. Otherwise, the marginal treatment (diet) effect has to be deter-
mined as in (20.45) and the delta method can be used to test the hypothesis
of no effect. In Model 6, the diet effect is independent of pattern and the
reverse holds for Model 7. Reparameterizing Model 6 by including the diet
effect and diet by time interaction as one effect in the model provides us
with an appropriate F -test for the three diet profiles. The F -test rejects
the null hypothesis of no diet effect (F = 1.57 on 38 degrees of freedom,
p = 0.015). In the corresponding selection model, Model 9, we remove all
the terms from Model 6 which include pattern. In that case, the F -test is
not significant (F = 1.26 on 38 degrees of freedom, p = 0.133). The differ-
ence in the tests may be explained by the variance parameters which were
larger in the selection model in the absence of stratification for pattern,
thereby effectively diluting the strength of the difference. Additionally, the
standard errors for the estimates of the fixed effects were slightly smaller
in the pattern-mixture model. This is not surprising, as in the model fit-
ting we found that the means and variance parameters were dependent on
24.4 The Milk Protein Content Trial 455
FIGURE 24.18. Milk Protein Content Trial. Diet effect over time for the selection
model (SM), the corresponding pattern-mixture model (PMM), and the estimate
obtained after weighting the PMM contributions (weighted).
Using Model 7, we test the global null hypothesis of no diet effect in any of
the patterns. This analysis can be seen as a stratified analysis where a diet
effect is estimated separately within each pattern. This model results in a
significant F -test for the diet effect (F = 6.05, on 6 degrees of freedom,
p < 0.001). Alternatively, we can consider the pooled estimate for the diet
effect, provided by equation (20.45), and calculate the test statistic using
the delta method. This test also indicates a significant diet effect (F = 17.82
on 2 degrees of freedom, p < 0.001), as does the corresponding selection
model, Model 10 (F = 8.51 on 2 degrees of freedom, p < 0.001).
Figure 24.18 presents the diet by time parameter estimates for selection
Model 10, for the corresponding pattern-mixture Model 7 and the weighted
average estimates used in the delta method. The estimates for the selection
model and the pattern-mixture model appear to differ only slightly. Since
the model building within both families is done separately, this is a very
reassuring sensitivity analysis outcome.
Our analysis of the correlation structure appears to agree with the general
conclusions retained in the Diggle and Kenward (1994) analysis. Particu-
larly, it is interesting to notice the absence of random effects. We do not
completely share the surprise expressed by Diggle and Kenward (1994)
since it should be noted that the study animals are highly selected through
centuries of cow eugenics and race selection. Had we dealt with wild ani-
mals, the role played by random effects would most likely have been much
more substantial. To explain the absence of random effects, we may assume
that there were additional eligibility criteria for the trial (e.g., a specific
breed of cow), which made random effects even more unlikely.
Supplementing the results of Section 24.4.2 with a more formal look at sen-
sitivity, in the selection-model spirit of Chapter 19, can be done using local
influence methods, described in Section 19.3 and applied in Sections 19.4
and 19.5 to the rats and mastitis data sets, respectively. In addition, we
will describe and apply global influence techniques as well.
Cook (1986) suggests that more confidence can be put in a model which is
relatively stable under small modifications. The best known perturbation
schemes are based on case deletion (Cook and Weisberg 1982, Chatterjee
and Hadi 1988), in which the effect of completely removing cases from the
analysis is studied. They were introduced by Cook (1977a, 1979) for the
linear regression context. Denote the likelihood function, corresponding to
measurement model (3.8) and dropout model (17.17) and (17.4), by
N
as well as
Recall that Diggle and Kenward (1994) considered model (3.11), where
the mean model includes separate intercepts for the barley, mixed, and
lupins groups, and a common time effect which is linear during the first
3 weeks and constant thereafter. The covariance structure is described by
a random intercept, an exponential serial process, and measurement error.
458 24. Case Studies
TABLE 24.10. Milk Protein Content Trial. Maximum likelihood estimates (stan-
dard errors) of random and nonrandom dropout models. Dropout starts from week
15 onward.
The dropout model includes dependence on the previous and current, pos-
sibly unobserved, measurements. Since dropout only happens from week
15 onward, Diggle and Kenward (1994) chose to set the dropout proba-
bility for earlier occasions equal to zero. Thereafter, they allowed separate
intercepts per time point, but common dependencies on previous and cur-
rent measurements. We will now introduce two models which use the same
measurement model as Diggle and Kenward (1994) but different dropout
models.
A first dropout model is closely related to the one of Diggle and Kenward
(1994), who defined occasion-specific intercepts ψ0k (k = 15, 16, 17, 19),
assumed slopes common, and set the dropout probability equal to zero
at other occasions. We also model dropout from week 15 onward, but we
will keep the intercepts constant for occasions 15 to 19. Precisely, our first
model contains three parameters (intercept ψ0 , dependence on the previous
measurement ψ1 , and dependence on the current measurement ψ2 ).
Parameter estimates for this model under both MAR and MNAR are listed
in Table 24.10. The fitted model is qualitatively equivalent to the model
used by Diggle and Kenward (1994), who concluded overwhelming evidence
24.4 The Milk Protein Content Trial 459
TABLE 24.11. Milk Protein Content Trial. Maximum likelihood estimates (stan-
dard errors) of random and nonrandom dropout models. Dropout starts from week
1 onwards.
for nonrandom dropout (likelihood ratio statistic 13.9). In line with these
results, we also could decide in favor of a nonrandom process (likelihood
ratio statistic 14.59).
In our second dropout model, we allow dropout starting from the sec-
ond week. Specifically, this model contains three parameters (intercept ψ0 ,
dependence on the previous measurement ψ1 , and dependence on the cur-
rent measurement ψ2 ) which are assumed constant throughout the whole
19-week period. The fit of this model is listed in Table 24.11. A striking
difference with the previous analysis is that the MAR assumption is bor-
derline not rejected (likelihood ratio statistic 3.63). Apparently, this is a
major source of sensitivity, to be explored further. As results from theory,
the measurement model parameters do not change under the MAR model,
compared to those displayed in Table 24.10. The measurement model ob-
tained under MNAR has changed only slightly.
Global Influence
Global influence results are shown in Figures 24.19–24.21. They are based
on fitting a MNAR model with each of the cows deleted in turn. The Cook’s
distances for the first and the second model are shown in Figure 24.20 and
24.21, respectively. The individual curves with influential subjects high-
lighted are plotted in Figure 24.19 where subject #38 should not be high-
lighted for the second model.
There is very little difference in some of the Cook’s distance plots, when Fig-
ures 24.20 and 24.21 are compared. Precisely, CD1i , CD2i (γ), and CD2i (θ)
are virtually identical. The three others are similar in the sense that there
is some overlap in the subjects indicated as peaks, but with varying mag-
nitudes. Subject #38 is influential on the dropout measures CD2,38 (ψ, ω),
CD2,38 (ψ), and CD2,38 (ω). This is not surprising since #38 is rather low
in the middle portion of the measurement sequence, whereas it is very high
from week 15 onward. Therefore, this sequence is picked up in the second
analysis only. By looking at a plot with the evolution of the parameters
separately during the deletion process (not shown here), we can conclude
that subject #38 has some impact on the serial correlation parameter while
#65 is rather influential for the measurement error. In view of the fairly
smooth deviation from a straight line of the former and the abrupt peaks
in the latter, this is not a surprise.
Based on our second model, all forms of CD2i (·), whether based on the
entire parameter vector γ, the dropout parameters (ψ0 , ψ1 , ω), or subsets
of the latter, indicate that subjects #51, #59, and #68 are influential. In
contrast, CD1i , which is based directly on the likelihood, does not reveal
these subjects, but rather subject #65 jumps out. Thus, although the for-
mer three subjects have a substantial impact on the parameter estimates,
they do not change the likelihood in a noticeable fashion. From a plot of
the dropout parameter estimates for each deleted case (not shown here),
it is very clear that upward peaks in ψ0(−i) for subjects #51 and #59 are
compensated with downward peaks in ω (−i) . An explanation for this phe-
nomenon can be found in the variance-covariance matrix of the dropout
24.4 The Milk Protein Content Trial 461
FIGURE 24.19. Milk Protein Content Trial. Individual profiles, with globally in-
fluential subjects highlighted. Dropout modeled from week 15.
FIGURE 24.20. Milk Protein Content Trial. Index plots of CD1i , CD2i (γ),
CD2i (θ), CD2i (ψ, ω), CD2i (ψ), and CD2i (ω). Dropout modeled from week 15.
FIGURE 24.21. Milk Protein Content Trial. Index plots of CD1i , CD2i (γ),
CD2i (θ), CD2i (ψ, ω), CD2i (ψ), and CD2i (ω). Dropout modeled from week 1.
From a principal components analysis, it follows that more than 90% of the
variation is captured in the linear combination 0.93ψ0 −0.37ω. Hence, there
is mass transfer between these two parameters, of course with sign reversal,
little impact on the likelihood value, and little effect on the MAR parameter
ψ1 . Note that a similar plot for the measurement model parameters can be
constructed (not shown).
Let us now turn to the subjects which are globally influential. A first and
common reason for those subjects to show up is the fact that they all
have a rather strange profile. Remember the overall trend to be sloping
downward during the first 3 weeks and constant thereafter. Subject #65
appears with large CD65,1 and large CD2 (θ). The reason for this can be
found in the fact that its profile shows extremely low and high peaks.
Subjects #51, #59, and #68, on the other hand, only show large values for
CD2 (ψ, ω), CD2 (ψ), CD2 (ω). This means that these subjects are influential
for the dropout parameters. For subject #51, this can be explained by the
fact that it drops out in spite of the rather large profile. Subjects #59 and
#68, on the contrary, stay in the experiment even though they both have
rather low profiles.
Local Influence
Local influence plots and individual profiles, with the influential subjects
highlighted (bold type), for the first model for raw and incremental data,
24.4 The Milk Protein Content Trial 463
FIGURE 24.22. Milk Protein Content Trial. Index plots of Ci , Ci (θ), Ci (β),
Ci (α), and Ci (ψ), and of the components of the direction hmax of maximal cur-
vature. Dropout modeled from week 15.
Observe that the plots for Ci and Ci (ψ) are virtually identical. This is
due to the relative magnitudes of the ψ and θ components. Profiles #51,
#59, and #66–#68 are highlighted in Figure 24.27. An explanation for the
influence in ψ is found by studying (19.12). Indeed, for ψ 0 and ψ 1 as in
Table 24.11, the maximum is obtained for y = 2.51, exactly as seen in the
influential profiles, which are all in the lupins group (Figure 24.27). Fur-
ther, note that there is some agreement between the locally and globally
influential subjects, although there is no compelling need for the two ap-
proaches to be identical (#51 appears in different influential components in
the two approaches). Indeed, although global influence lumps together all
sources of influence, our local influence approach is designed to detect sub-
jects which, due to several causes, tend to have a strong impact on ω and
therefore on the conclusion about the nature of the dropout mechanism.
Observe that one factor in (19.12) is the square of the response. This is
a direct consequence of our parameterization of the dropout process, the
logit of which is in terms of the previous and current outcomes, to which
no transformation is applied. As was argued in the mastitis case (p. 321),
since two subsequent measurements are usually positively correlated, it is
464 24. Case Studies
FIGURE 24.23. Milk Protein Content Trial. Individual profiles, with locally in-
fluential subjects highlighted. Dropout modeled from week 15.
FIGURE 24.24. Milk Protein Content Trial. Index plots of Ci , Ci (θ), Ci (β),
Ci (α), and Ci (ψ), and of the components of the direction hmax of maximal cur-
vature. Dropout modeled from week 15. Incremental analysis.
FIGURE 24.25. Milk Protein Content Trial. Individual profiles, with locally influ-
ential subjects highlighted. Dropout modeled from week 15. Incremental analysis.
meterize the dropout model (19.1) in terms of the increment; that is, yij
is replaced by yij − yi,j−1 . This is related to the approach of Diggle and
Kenward (1994), who reparameterized their dropout model in terms of the
increment just introduced and the size (the average of both measurements).
Even though a dropout model in the outcomes themselves, termed direct
variables model, is equivalent to a model in the first variable Yi1 and the in-
crement Yi2 −Yi1 , termed incremental variable representation, it was shown
that they lead to different perturbation schemes of the form (19.1). Indeed,
from equality (19.18) or, more generally, from
Overview
FIGURE 24.26. Milk Protein Content Trial. Index plots of Ci , Ci (θ), Ci (β),
Ci (α), and Ci (ψ), and of the components of the direction hmax of maximal cur-
vature. Dropout modeled from week 1.
FIGURE 24.27. Milk Protein Content Trial. Individual profiles, with locally in-
fluential subjects highlighted. Dropout modeled from week 1.
Whereas global influence, as stated earlier, starts from deleting one subject
completely, local influence only changes the dropout process for one subject
from random dropout to nonrandom dropout. Because of the completely
468 24. Case Studies
FIGURE 24.28. Milk Protein Content Trial. Index plots of Ci , Ci (θ), Ci (β),
Ci (α), and Ci (ψ), and of the components of the direction hmax of maximal cur-
vature. Dropout modeled from week 1. Incremental analysis.
different approach, there is no need for both methods to yield similar re-
sults, although by looking at the influential subjects for all cases studied
above, we notice some overlap.
24.4 The Milk Protein Content Trial 469
FIGURE 24.29. Milk Protein Content Trial. Individual profiles, with locally in-
fluential subjects highlighted. Dropout modeled from week 1. Incremental analysis.
The mentally handicapped residing in institutions are at high risk for he-
patitis B virus (HBV) acquisition and subsequent carrier state. The higher
risk of nonparental transmission in this population is due to the typical be-
havior of mentally retarded patients, the type of mental retardation, and
the closed setting of the institutions, which all enhance spreading of the
virus.
We describe the use of linear mixed-effects models for the analysis of anti-
bodies against hepatitis B surface antigen (anti-HBs) data from a mentally
handicapped population vaccinated against hepatitis B, 11 years earlier
(Van Damme et al . 1989). Use of random effects in this setting was already
proposed by Coursaget et al . (1994) and Gilks et al . (1993), who consid-
ered between-individuals variability in a Bayesian random-effects model.
In previous studies, several factors have been described to cause a higher
risk in the acquisition of hepatitis B virus infection. These factors include
age, age at admission, duration of residency, type of mental retardation
[Down’s syndrome (DS) or other types of mental retardation (OMR)], sex,
and use of anti-epileptic medication (Vellinga et al . 1999a). Sex, age, and
type of mental retardation are also of influence on the response to vaccina-
tion (Vellinga et al . 1999b). The linear mixed model describes the decline in
antibody titer in relation to the significant risk factors across measurement
occasions. A detailed account of the medical results is reported in Vellinga
et al . (1999c).
samples were taken after each vaccine dose, and if residents did not meet the
(arbitrary) anti-HBs level of 100 IU/L at month 7, they received a booster
vaccine dose at month 12. If the requirement of 100 IU/L were still not met
at month 13, additional booster doses were administered (these residents
were, however, not further included in the program). All residents received
a booster dose after 5 years (i.e., at month 60).
Section 24.5.1 presents model building for the two models we just described.
Section 24.5.2 is devoted to the issue of prediction at year 12, where sensi-
tivity to assumptions and type of model used can be assessed.
Figure 24.31 depicts an estimate of the empirical variogram for these data.
It was constructed using standardized ordinary least squares residuals ob-
tained upon fitting a saturated groups by times model (where group is
type of mental retardation). Also shown in this figure is a loess-smoothed
estimate (Cleveland 1979) of the variogram. The between-subject variance
seems relatively large in these data, accounting for about one-half of the
total variability. The measurement error is also substantial, accounting for
the other half of the process variance. This variogram leaves little room for
a serially correlated component. Note that in this context, it is essential
to use standardized residuals to remove variance heterogeneity in the data,
ensuring that the process variance is constant and equal to 1.
474 24. Case Studies
Finally, retaining the covariance structure we just selected, the mean model
can be reduced. Effects kept in the final model were time, type of mental
retardation, number of vaccine doses (with unstructured time effects), dura-
tion of residency (with a linear time trend), use of anti-epileptic medication,
and sex (time-constant effects). Although not significant, sex was kept in
the model for reasons of external comparison.
Parameter estimates for this model are shown in Tables 24.14 and 24.15. It
can be seen that no covariance parameter is given for the random effects in
these tables. A comparison between model-based and empirical standard er-
24.5 Hepatitis B Vaccination 475
# Param.
Model Description Rand. Ser. −2
Comp.∗ G2 df p
∗ Comparison model.
† One parameter could not be estimated due to parameter constraints.
†† From a 50-50 mixture of χ22 - and χ23 -distributions.
††† From a 50-50 mixture of χ21 - and χ22 -distributions.
rors revealed large discrepancies (relative increases more than threefold for
most of the estimates) in the final model. Empirically corrected (or robust)
standard errors (Section 6.2.4) counteract the effect of potential misspeci-
fication of the covariance structure (Diggle, Liang, and Zeger 1994, Liang
and Zeger 1986) and disagreement between both types of standard errors
might point to an inadequately specified covariance structure. Arguably, we
had little reason to believe that the selected covariance function is substan-
tially incorrect. Therefore, it is wise to attain a trade-off between model fit
as reported by likelihood ratios, and differences occurring between model-
based and empirical standard errors. In particular, assuming a diagonal
instead of an unstructured covariance matrix for the random effects yields
a much better model in this respect and was therefore retained as our final
model. Most of the estimated empirical standard errors in Tables 24.14 and
24.15 do not exhibit changes of more than 25% compared to model-based
standard errors.
It is worth noting that the effect of Down’s syndrome on antibody titer was
significant at months 24, 36, and 48, indicating a faster decline in anti-HBs
in this population than in other mentally retarded. There did not seem to be
476 24. Case Studies
Mean Structure:
Intercept 9.36 (0.41; 0.38)
Time 1 −6.80 (0.21; 0.24)
Time 2 −4.30 (0.21; 0.20)
Time 7 †
Time 12 −1.66 (0.18; 0.13)
Time 13 −0.54 (0.36; 0.37)
Time 24 −3.36 (0.20; 0.19)
Time 36 −3.68 (0.21; 0.19)
Time 48 −4.21 (0.28; 0.27)
Time 60 −4.69 (0.31; 0.25)
Time 61 1.62 (0.34; 0.25)
Time 132 −1.92 (0.59; 0.55)
DS/OMR 1 0.60 (0.64; 0.54)
DS/OMR 2 0.37 (0.68; 0.62)
DS/OMR 7 −0.02 (0.56; 0.56)
DS/OMR 12 −0.30 (0.61; 0.82)
DS/OMR 13 ††
DS/OMR 24 −1.56 (0.53; 0.74)
DS/OMR 36 −1.75 (0.46; 0.60)
DS/OMR 48 −1.50 (0.56; 0.62)
DS/OMR 60 −0.61 (0.54; 0.44)
DS/OMR 61 −0.83 (0.69; 0.54)
DS/OMR 132 −1.18 (0.69; 0.39)
# doses 1 −1.34 (0.37; 0.24)
# doses 2 −1.78 (0.40; 0.36)
# doses 7 −2.45 (0.33; 0.36)
# doses 12 −2.71 (0.35; 0.42)
# doses 13 †††
# doses 24 −0.11 (0.31; 0.36)
Random Effects:
Intercept 0.66
Time 0.02
Serial Structure:
Variance 0.58
Rate of exponential decrease (1/ρ) 2.32
Measurement Error:
Time 1 1.32
Time 2 1.37
Time 7 0.76
Time 12 0.70
Time 13 0.78
Time 24 0.48
Time 36 0.00
Time 48 0.46
Time 60 0.47
Time 61 1.31
Time 132 0.31
that time again led to better responses in G1 and this was still visible at
year 11.
where the βj are regression parameters and t0 ≡ ln(t) and the powers p1 <
· · · < pm are positive or negative integers or fractions (Royston and Altman
1994). They argue that polynomials with degree higher than 2 are rarely
required in practice and further restrict the powers of dose to a small prede-
fined set of possibly noninteger values: Π = {−2, −1, −1/2, 0, 1/2, 1, 2, . . . ,
max(3, m)}. For example, setting m = 2 generates √ (1) √ four quadratics in
powers of d, represented by (1/t2 , 1/t), (1/t, 1/ t), ( t, t), and (t, t2 ); (2)
a quadratic in ln(t), and (3) other curves which have shapes different from
those of conventional low-degree polynomials. The full definition includes
possible “repeated powers” which involve powers of ln(t). For example, a
fractional polynomial of degree m = 3 with powers (−1, −1, 2) is of the
form
β0 + β1 y −1 + β2 y −1 ln(y) + β3 y 2
(Royston and Altman 1994, Sauerbrei and Royston 1999). In this case, the
set of powers we have considered ranged from −3 to 3 with increments of
24.5 Hepatitis B Vaccination 479
0.5. We then searched for the best pair or triple of powers among this set.
The combination (−3, 0) turned out to give an adequate fit to these data.
See also Section 10.3.
In order to visually assess the fit of these two models, Figure 24.32 shows
observed and predicted average profiles for combinations of number of vac-
cine doses and type of mental retardation. For predicted average profiles,
all other covariate effects were set equal to their mean values.
FIGURE 24.32. Hepatitis B Vaccination Study. Observed and predicted mean pro-
files for combinations of number of vaccine doses and type of mental retardation:
(a) original data; (b) postvaccination data.
and OMR patients in their anti-HBs response after vaccination have failed.
See Vellinga et al . (1999c) for further discussion. Another point concerns
whether anti-epileptic medication has an influence on the immune system.
However, it is hard to decide whether this is due to the medication itself,
or rather an indication for the influence of epilepsy, or both (De Ponti et
al . 1993).
Although there is some interest in modeling the complete set of data from
a descriptive viewpoint, this could only be achieved with the help of a time-
saturated model to account for the high nonlinearity in the profiles. If one is
interested in a more parsimonious, parametric description of the temporal
decline in antibody titer, to address such questions as long-term influence,
one needs to resort to an alternative solution. Focusing the analysis on
postvaccination data is a suitable alternative, enabling simple parametric
modeling of the anti-HBs evolution over time. Obviously, the absence of
intermittent measurements between years 5 and 11 weakens the long-term
prediction process, and we need to make certain assumptions such that the
rate of decline after booster administration at month 60 is similar to the
rate of decline after time of last vaccination. It is nevertheless reassuring
to see that predicted values at year 12 were all in good agreement, inde-
pendently of the model or method chosen. We conclude that each of these
two models may bring their own insight into the data and their combined
use may better serve the purpose of a sensitivity analysis.
The first approach merely uses a linear extrapolation based on the last
two measurements (at months 61 and 132) of an individual. The resulting
extrapolations are then averaged out to obtain a prediction at month 144.
Obviously, this approach can be criticized as being overly simple since the
profiles are clearly nonlinear over the first 5 years. A refinement of this
method might consist of overlaying profiles for the month 61–132 period
with profiles from the first part of the study and then extrapolating until
month 144. It raises some technical difficulties though, since the starting
482 24. Case Studies
TABLE 24.17. Hepatitis B Vaccination Study. Predicting log antibody titer (IU/L)
at year 12: (a) Approach 1: linear interpolation (original data); (b) Approach 2:
refined linear interpolation (original data); (c) Approach 3: fractional polynomials
model (post-vaccination data).
point of the first period (time of the last vaccine dose) depends upon the
group being considered: month 7 for group G1 and month 13 for G2. Using
month 24 as a cutoff point to split the first time period into two pieces,
we can linearly approximate the profiles in these two time windows, trans-
late them to the month 61–132 period, extrapolate until month 144, and
eventually average the results across the two groups.
We conclude with two remarks. First, in this study, prediction takes place
only 1 year upon completion of the study, which is not too distant in time
compared to the duration of the study. Would prediction have to be done
several years later, we would likely observe larger discrepancies. Second, in
studies where more emphasis is to be put on prediction, it is a good idea
to plan intermittent assessment occasions to aid in modeling long-term
temporal evolution. A simple model (e.g., using fractional polynomials),
might then be used straightforwardly for making long-term inference with
more confidence.
We describe how to use the SAS statistical software package to fit our start-
ing and thus the most complex model on the one hand, and the finally se-
lected model on the other hand. The models are discussed in Section 24.5.1.
where Ui is a design matrix built from the covariates included and δ are
the estimated and reported parameters. Note that the exponential function
ensures non-negative components of variance.
Many of the analyses done in this book have been performed in SAS, in par-
ticular using the procedure MIXED. An extensive overview is to be found
in Chapter 8. Here, the focus is on the Baltimore Longitudinal Study of
Aging (prostate cancer data; Section 2.3.1). The main program features are
discussed, as well as the output. The SAS procedure MIXED is explicitly
discussed in a number of other chapters. In Chapter 9, the prostate cancer
data are used to illustrate model building principles. In Chapter 17, both
complete as well as incomplete versions of the growth data (Section 2.6)
are analyzed using PROC MIXED. The use of SAS for pattern-mixture
based sensitivity analysis is documented in Chapter 20. Finally, SAS code
is included for several case studies in Chapter 24.
In this book, all analyses using the SAS System, have been carried out
using Version 6.12. Although this may seem anomalous to many, given the
availability of Version 7.0 and higher, it has to be noted that Version 7.0
486 Appendix A. Software
(SAS 1999) was not available on a commercial basis in 1999, for example,
in Europe. For a thorough description of PROC MIXED in SAS Version
7.0, we refer to the on-line manual (SAS 1999).
In this section, we will highlight some of the important changes that have
been implemented in Version 7.0 with respect to PROC MIXED. They are
ordered by statement.
The option ‘CL’, requesting confidence intervals for the covariance para-
meter estimates, has been modified. For those parameters with a default
lower bound of zero (diagonal elements in a covariance matrix), Satterth-
waite approximations are used:
2
νσ νσ2
≤ σ2 ≤ , (A.1)
χ2ν,1−α/2 2
χν,α/2
2 /s.e.(σ
where ν = 2Z 2 and Z is the classical Wald statistic σ 2 ). Using
‘CL<=Wald>’ requests a Wald version, rather than the modified limits in
(A.1).
The ‘method’ option has been extended to include, apart from REML, ML,
and MIVQUE0, also TYPE1, TYPE2, and TYPE3. The new methods re-
quest analysis of variance (ANOVA) estimation of the corresponding type,
producing method-of-moment variance component estimates. Of course,
these methods are available only when there is no ‘subject=<effects>’ op-
tion and no REPEATED statement.
A major change to PROC MIXED (and all other procedures) is the way in
which output is handled. PROC MIXED now makes use of the integrated
ODS (output delivery system). This implies that the MAKE statement has
become obsolete. Although still supported, it is envisaged that this will no
longer be the case in later versions.
MODEL Statement
Given the ODS structure, predicted means and predicted values are now
handled in a different way. Precisely, the option ‘outpredm=SAS-data set’
and ‘outpred=SAS-data set’ have to be used. Options ‘pred’ and ‘pred-
means’ have become obsolete. Aliases for the new options are ‘outpm=SAS-
data set’ and ‘outp=SAS-data set’.
PRIOR Statement
This statement has undergone a major revision. New options are as follows:
‘out=’: output data set with the sample of the posterior density.
‘outg=’: output data set from the grid evaluations.
‘outgt=’: output data set from the transformed grid evaluations.
‘psearch’: displays the search used to determine the parameters for the
inverted gamma densities.
‘ptrans’: displays the transformation of the variance components.
‘seed=’: starting value for the random number generation within a call of
PROC MIXED.
‘tdata=’: allows to input the transformation of the covariance parameters
used by the sampling algorithm.
‘trans=’: specifies the algorithm used to determine the transformation of
the covariance parameters.
RANDOM Statement
REPEATED Statement
Note that all outcomes for a given individual are still stacked, as with uni-
variate repeated measures. Thus, two indicators are needed: var to indicate
which of the multivariate outcomes is listed and time indicating the longi-
tudinal structure. The ‘type=’ option specifies an unstructured covariance
matrix among the components at a given time (a typical assumption in
multivariate data, cf. PROC GLM), and further an AR(1) process for re-
peated measures of a specific outcome. It is then assumed that for different
outcomes at unequal measurement times, the Kronecker product specifies
A.2 Fitting Mixed Models Using MLwiN 489
http://www.ioe.ac.uk/multilevel/
http://www.medent.umontreal/multilevel/
http://www.edfac.unimelb.edu.au/multilevel/
http://www.ioe.ac.uk/mlwin/
+β01 xi (A.4)
+β10 tj (1 − xi ) (A.5)
+β11 tj xi (A.6)
+b1i tj . (A.7)
FIGURE A.1. Growth Data. Symbolic MLwiN equation for Models (A.8)–(A.9).
There are ample graphical capabilities. For example, a simple plot of the
raw profiles is shown in Figure A.3. Predicted means are displayed in Fig-
ure A.4. This plot is based on the fixed-effects structure. Therefore, two
fitted straight lines are shown, one for boys and one for girls. Empirical
Bayes predictions are displayed in Figure A.5. Since the model assumes
random intercepts and random slopes, individual linear profiles are pro-
duced. MLwiN uses data by means of a work sheet. This can easily be
manipulated. Variables can be renamed and transformed, and new ones
A.3 Fitting Mixed Models Using SPlus 493
can be created. For example, if predicted values are computed, they can be
added to the work sheet as a new column. Subsequently, predicted values
can be used in the graphical window to generate plots such as Figures A.4
and A.5.
SPlus provides various ways to estimate mixed models. On the one hand,
the built-in functions lme() for linear mixed-effects models can be used.
Note that there is a companion function for nonlinear mixed-effects models,
nmle(). These functions are based on work by Lindstrom and Bates (1988),
Laird and Ware (1982), Box, Jenkins, and Reinsel (1994), and Davidian and
Giltinan (1995). The lme() function will be discussed in Section A.3.1. A
third-party suite of SPlus functions, termed OSWALD (Smith, Robertson,
and Diggle 1996) has been developed to fit longitudinal models. The pack-
age is based on the methods described in Diggle, Liang, and Zeger (1994).
In particular, the pcmid() function fits linear mixed models. Many of the
functionalities between lme() and pcmid() are shared, but an attractive
494 Appendix A. Software
Clusters. The clusters (subjects, units, etc.) are defined using cluster.
A.3 Fitting Mixed Models Using SPlus 495
Call:
Fixed: MEASURE ~ 1 + MALE + MALEAGE + FEMAGE
Random: ~ 1 + AGE
Cluster: ~ (IDNR)
Data: growth5.df
Structure: unstructured
Parametrization: matrixlog
Standard Deviation(s) of Random Effect(s)
(Intercept) AGE
2.134752 0.1541473
Correlation of Random Effects
(Intercept)
AGE -0.6025632
Although the above output is rather brief, one can obtain a more extensive
summary:
Call:
Fixed: MEASURE ~ 1 + MALE + MALEAGE + FEMAGE
Random: ~ 1 + AGE
Cluster: ~ (IDNR)
Data: growth5.df
Estimation Method: ML
Convergence at iteration: 6
Log-likelihood: -213.903
AIC: 443.806
BIC: 465.263
The estimates and standard errors coincide with those obtained with ML-
wiN (Figure A.2). This is immediately clear for the fixed-effects estimates,
their standard errors, and the residual variance. The components of the D
matrix have to be derived from the standard deviations and correlation of
the random effects:
d11 = 2.1347522 = 4.577,
d12 = (−0.6025632)(2.134752)(0.1541473) = −0.198,
d22 = 0.15414732 = 0.024.
As is the case with MLwiN, SPlus in general, and lme() in particular, have
extensive graphical capabilities.
Even though a word of care was issued in Chapters 17–20 about nonrandom
dropout models restricting the model-building exercise simply to MAR
498 Appendix A. Software
In a heart failure study, the primary efficacy endpoint is based upon the
ability to do physical exercise. This ability is measured in the number of
seconds a subject is able to ride the exercise bike. There are 25 subjects
assigned to placebo and 25 to treatment. The treatment consisted of the ad-
ministration of ACE inhibitors. Four measurements were taken at monthly
intervals. Table A.2 presents outcome scores, transformed to normality. We
will refer to them as the exercise bike data. All 50 subjects are observed
at the first occasion, whereas there are 44, 41, and 38 subjects seen at the
second, the third, and the fourth visits, respectively.
The first model we consider is the random intercept Model 7 used for the
growth data (see Section 17.4.1, p. 253). It is given by omitting the H
A.3 Fitting Mixed Models Using SPlus 499
Placebo Treatment
1 2 3 4 1 2 3 4
0.43 0.94 4.32 4.51 −2.54 −0.20 −0.15 3.53
3.10 5.82 5.59 6.32 4.33 5.57 6.86 6.87
0.56 2.21 1.18 1.54 −2.46 . . .
−1.18 −0.30 2.48 2.67 2.30 4.64 7.37 7.99
1.24 2.83 1.98 3.21 0.73 3.29 5.23 6.12
−1.87 −0.06 1.16 1.84 0.38 1.25 2.91 4.71
−0.28 1.30 . . 1.51 4.00 5.98 .
2.93 . . . 0.38 0.94 3.28 4.05
−0.20 3.34 3.71 3.69 0.42 2.53 . .
−0.12 2.01 2.35 2.70 2.41 4.24 4.79 8.14
−1.60 1.42 0.41 0.72 0.12 1.48 3.12 3.69
0.64 . . . −3.46 −0.93 2.78 3.02
−1.14 −1.20 0.09 2.39 −0.55 . . .
2.24 2.12 3.00 1.52 0.74 2.40 4.04 5.61
−0.44 0.88 2.83 1.47 2.37 2.79 4.05 5.91
0.39 1.77 3.62 4.35 1.94 5.05 3.06 5.89
−4.37 −2.43 −0.43 −0.13 0.77 2.46 . .
0.20 2.05 3.18 5.13 1.32 . . .
1.31 3.82 2.70 3.59 2.15 4.84 7.70 8.29
−0.38 −1.92 −0.12 −0.40 −0.09 2.02 4.68 5.29
−0.78 . . . 2.10 4.91 7.48 8.91
−0.48 0.32 0.66 3.03 1.36 0.62 1.87 .
−0.64 1.53 1.29 . 3.14 5.79 5.95 7.50
0.88 2.10 1.90 3.51 −0.94 −0.08 3.57 3.80
2.02 3.10 4.93 4.76 0.89 1.51 3.14 5.96
term from (A.10). The next two models are new and, to avoid confusion,
they will be assigned numbers 9 and 10. The second model (Model 9)
supplements a random intercept with serial correlation. We will choose
the serial correlation to be of the AR(1) type. This model is found by
omitting the measurement error component τ 2 I from (A.10) and choosing
the elements of H to be hjk = ρ|j−k| = exp (−φ|j − k|). Model 10 combines
all three sources of variability. SAS code for these models is
500 Appendix A. Software
The mean and covariance model parameters are summarized in Table A.3.
The parameters supplied by PROC MIXED are supplemented with some
additional quantities in order to obtain both sets of intercepts and slopes
for the two treatment groups. Thus, intercept 0 is the sum of intercept
1 and the group 0 effect. Further, φ and ρ are connected by φ = − ln(ρ).
Although the mean models are straightforward to interpret from the SAS
output, it is necessary to approach the covariance parameters output with
some care.
TABLE A.3. Exercise Bike Data. SAS output on Models 7, 9, and 10.
CS ID 2.37805794
Residual 0.76223505
UN(1,1) ID 2.15845062
AR(1) ID 0.30795893
Residual 0.94837375
R Matrix for ID 1
G Matrix
INTERCEPT ID 1 1 2.09382622
A.3 Fitting Mixed Models Using SPlus 503
UN(1,1) ID 2.09382622
Variance ID 0.80117791
AR(1) ID 0.44401509
Residual 0.20973894
This output is easier to understand since the parameters are nicely grouped
by source of variability: (1) random intercept, with ν 2 = 2.0938, (2) serial
correlation, with σ 2 = 0.8012 and ρ = 0.4440, and (3) measurement error
(residual), with τ 2 = 0.2097.
Formal inspection of the deviances shows that Model 7 is too simple and
that Model 10 does not improve Model 9 significantly.
Our next goal is to fit the same three models with OSWALD. For a full
documentation on OSWALD we refer to Smith, Robertson, and Diggle
(1996) or to the web page:
http://www.maths.lancs.ac.uk:2080/~maa036/OSWALD/
The output is presented in the form of a typical SPlus list object. For
Model 7,
Call:
pcmid(formula = bike.bal ~ group * time,
vparms = c(0, 1.5, 0), correxp = 1)
Maximised likelihood:
[1] -482.4208
Mean Parameters:
(Intercept) group time group:time
PARAMETER -0.6166502 -0.1915053 0.92002700 0.7233457
STD.ERROR 0.3792614 0.5374845 0.08554001 0.1232477
504 Appendix A. Software
Variance Parameters:
nu.sq sigma.sq tau.sq phi
0 2.378172 0.7622228 0
Model 9 omits the measurement error, which is simply done by setting the
initial value for τ 2 ≡ 0:
Call:
pcmid(formula = bike.bal ~ group * time,
vparms = c(1, 0, 1), correxp = 1)
Maximised likelihood:
[1] -480.13
Mean Parameters:
(Intercept) group time group:time
PARAMETER -0.6609958 -0.1607851 0.92363457 0.7212975
STD.ERROR 0.3923346 0.5559721 0.09760774 0.1403651
A.3 Fitting Mixed Models Using SPlus 505
Variance Parameters:
nu.sq sigma.sq tau.sq phi
2.158405 0.9483919 0 1.177751
Call:
pcmid(formula = bike.bal ~ group * time,
vparms = c(2, 0.2, 1), correxp = 1,
reqmin = 1e-012)
Maximised likelihood:
[1] -480.1222
Mean Parameters:
(Intercept) group time group:time
PARAMETER -0.6629645 -0.1606256 0.92304454 0.7221020
STD.ERROR 0.3932421 0.5572348 0.09842758 0.1415133
506 Appendix A. Software
Variance Parameters:
nu.sq sigma.sq tau.sq phi
2.094417 0.8016666 0.2087339 0.8139487
All analyses done on the exercise bike data so far assumed ignorable nonre-
sponse, in the spirit of Section 17.3. This means that they are valid under
MAR (and not only under MCAR, in spite of the claim printed in the
OSWALD output). Of great potential value is the feature of OSWALD to
be able to go beyond an ignorable analysis and to fit a specific class of
nonrandom models. We will exemplify this power using the exercise bike
data. Illustration will be on the basis of the most general Model 10, even
though the slightly simpler Model 9 fits the data equally well. The analysis
featured by OSWALD couples a linear mixed-effects model for the mea-
surements with a logistic model for dropout, with predictors given by the
current outcome as well as a set of previous responses. Details of the model
are to be found in Diggle and Kenward (1994). For instance, assuming that
the dropout probability at occasion j depends on both the current outcome
Yij and the previous one Yi,j−1 , leads to the following model:
P (Rij = 0|y i )
ln = ψ0 + ψ1 yij + ψ2 yi,j−1 . (A.11)
1 − P (Rij = 0|y i )
Such an analysis is done in OSWALD through the DROP.PARMS, DROP-
MODEL, and DROP.COV.PARMS arguments of the PCMID function.
A.3 Fitting Mixed Models Using SPlus 507
P (Rij = 0|y i )
ln = ψ0 + xi ψ c + ψ1 yij + ψ2 yi,j−1 , (A.12)
1 − P (Rij = 0|y i )
Let us discuss OSWALD analyses for dropout model (A.11), in the MCAR,
MAR, and nonrandom contexts. The MCAR analysis output is as follows:
Call:
pcmid(formula = bike.bal ~ group * time,
vparms = c(2, 0.2, 1), drop.parms = c(0),
drop.cov.parms = c(-2), dropmodel = ~ 1,
correxp = 1, maxfn = 1600, reqmin = 1e-012)
Maximised likelihood:
[1] -520.6167
Mean Parameters:
(Intercept) group time group:time
PARAMETER -0.6629601 -0.1606322 0.9230446 0.7221012
STD.ERROR 0.3859817 0.5470223 0.1009714 0.1451370
508 Appendix A. Software
Variance Parameters:
nu.sq sigma.sq tau.sq phi
2.094412 0.8016692 0.2087333 0.8139474
Dropout parameters:
(Intercept) y.d
-2.32728 0
Call:
pcmid(formula = bike.bal ~ group * time,
vparms = c(2, 0.2, 1), drop.parms = c(0, -0.1),
drop.cov.parms = c(-2), dropmodel = ~ 1,
correxp = 1, maxfn = 3000, reqmin = 1e-012)
Maximised likelihood:
[1] -520.3494
Mean Parameters:
(Intercept) group time group:time
PARAMETER -0.6629684 -0.1606169 0.9230426 0.7221043
STD.ERROR 0.3859815 0.5470220 0.1009713 0.1451369
A.3 Fitting Mixed Models Using SPlus 509
Dropout modeled
Parameter Ign. MCAR MAR MNAR
Intercept −0.6630 −0.6630 −0.6630 0.6666
Group −0.1606 −0.1606 −0.1606 0.1740
Time 0.9230 0.9230 0.9230 0.9377
Group:time 0.7221 0.7221 0.7221 0.7317
ν2 2.0944 2.0944 2.0944 2.1067
σ2 0.8017 0.8017 0.8017 0.8349
τ2 0.2087 0.2087 0.2087 0.1577
φ 0.8139 0.8139 0.8139 0.9311
ψ0 −2.3273 −2.1661 −2.7316
ψ1 0.3280
ψ2 −0.0992 −0.3869
Deviance 960.24 1041.23 1040.70 1040.46
Variance Parameters:
nu.sq sigma.sq tau.sq phi
2.09441 0.8016771 0.2087185 0.8139738
Dropout parameters:
(Intercept) y.d y.d-1
-2.166139 0 -0.09920587
Note that the number of iterations has increased somewhat, even though
the extra dropout parameter, ψ2 , appears to be very small. In fact, the
likelihood has increased only marginally over the MCAR analysis.
Call:
pcmid(formula = bike.bal ~ group * time,
vparms = c(2, 0.4, 0.5), drop.parms = c(2, -2),
drop.cov.parms = c(-3), dropmodel = ~ 1,
510 Appendix A. Software
Maximised likelihood:
[1] -520.2316
Mean Parameters:
(Intercept) group time group:time
PARAMETER -0.6666244 -0.1740098 0.9377247 0.7317099
STD.ERROR NA NA NA NA
Variance Parameters:
nu.sq sigma.sq tau.sq phi
2.106741 0.8348588 0.1577192 0.9310764
Dropout parameters:
(Intercept) y.d y.d-1
-2.731573 0.3279598 -0.3869054
Results for the three dropout models are summarized in Table A.5, under
the headings MCAR, MAR, and MNAR, respectively. Also included is the
earlier ignorable analysis. Since MCAR and MAR are ignorable, whether
or not the dropout model parameters are estimated explicitly, the first
three models in Table A.5 yield exactly the same values for the mean and
covariance parameter estimates, as they should. Note that the deviance
from the ignorable model is not comparable to the other deviances, since
the dropout parameters are not estimated. Comparing the parameters in
these models to the nonrandom dropout models shows some shifts, although
they are very modest.
A.3 Fitting Mixed Models Using SPlus 511
TABLE A.6. Exercise Bike Data. Non-Random models. Model 10. Treatment
assignment (group) included into the dropout model.
We may now want to compare the dropout models. The likelihood ratio
test statistic to compare MAR with MCAR is 0.53 on 1 degree of freedom
(p = 0.4666). This means that MCAR would be acceptable provided MAR
were the correct alternative hypothesis and the actual parametric form for
the MAR process were correct. In addition, a comparison between the non-
random and random dropout models yields a likelihood ratio test statistic
of 0.24 (p = 0.6242). Of course, for reasons outlined in Chapters 19 and
20 (see also Section 18.1.2), one should use nonrandom dropout models
with caution, since they rely on assumptions that are at best only partially
verifiable. These issues of sensitivity are illustrated in Sections 19.4, 19.5,
and 24.4.
Call:
pcmid(formula = demo2.bal ~ group * time,
vparms = c(2, 0.2, 0.8), drop.parms = c(0, -0.1),
512 Appendix A. Software
Maximised likelihood:
[1] -519.9814
Mean Parameters:
(Intercept) group time group:time
PARAMETER -0.6629779 -0.1606298 0.9230512 0.7220979
STD.ERROR 0.3859810 0.5470213 0.1009712 0.1451367
Variance Parameters:
nu.sq sigma.sq tau.sq phi
2.094418 0.8016748 0.2087164 0.8139889
Dropout parameters:
(Intercept) group y.d y.d-1
-2.405863 0.5395115 0 -0.1289393
and
Call:
pcmid(formula = demo2.bal ~ group * time,
vparms = c(2, 0.2, 0.8),
drop.parms = c(0.3, -0.3),
drop.cov.parms = c(-2, 0.1),
dropmodel = ~ 1 + group,
correxp = 1, maxfn = 10000, reqmin = 1e-012)
Maximised likelihood:
[1] -519.9143
A.3 Fitting Mixed Models Using SPlus 513
Mean Parameters:
(Intercept) group time group:time
PARAMETER -0.6213682 -0.228779 0.9032322 0.7251818
STD.ERROR NA NA NA NA
Variance Parameters:
nu.sq sigma.sq tau.sq phi
2.043767 0.7858604 0.2930629 0.6288833
Dropout parameters:
(Intercept) group y.d y.d-1
-2.103619 0.7979308 -0.303576 0.1238007
Again, the main effect and variance parameters in the MAR column have
not changed relative to their ignorable counterparts in Table A.5. In con-
trast, the nonrandom dropout model parameters are all different, compared
to the nonrandom model in Table A.5.
Appendix B
Technical Details for Sensitivity
Analysis
This appendix contains the more technical material, given for completeness
but not essential to understanding, about sensitivity analysis for selection
models (Section B.1) and pattern-mixture models (Section B.2).
ni
iω = ln f (y i ) + ln[1 − g(hij , yij )],
j=2
where the parameter dependencies are suppressed for notational ease. The
mixed derivatives are particularly easy to calculate, immediately yielding
expressions (19.2) and (19.3).
516 Appendix B. Technical Details for Sensitivity Analysis
d−1
= ln f (hid ) + ln[1 − g(hij , yij )] + ln f (yid |hid )g(hid , yid )dyid ,
j=2
of which the first component depends on θ only, the second one on ψ only,
and the third one contains both.
∂ 2 iω
d−1
= − hij yij g(hij , yij )[1 − g(hij , yij )]
∂ψ∂ωi j=2
∂ 2 g(hid , yid )
f (yid |hid )g(hid , yid )dyid f (yid |hid ) dyid
∂ψ∂ωi
+ 3 42
f (yid |hid )g(hid , yid )dyid
∂g(hid , yid )
− f (yid |hid ) dyid
∂ωi
∂g(hid , yid )
f (yid |hid ) dyid
∂ψ
× 3 42 . (B.2)
f (yid |hid )g(hid , yid )dyid
B.1 Local Influence: Derivation of Components of ∆i 517
∂λ(yid |hid )
× (B.6)
∂θ
∂g(hid , yid )
f (yid |hid ) dyid = g(hid )[1 − g(hid )]hid (B.7)
∂ψ ωi =0
∂ 2 g(hid , yid )
f (yid |hid ) dyid = g(hid )[1 − g(hid )]
∂ψ∂ωi ωi =0
×[1 − 2g(hid )]hid
×λ(yid |hid ). (B.8)
t−1
= f (yt |y1 , . . . , yt−1 , d = j + 1)f (d = i + 1)
i=1
+f (yt |y1 , . . . , yt−1 , d = j + 1)f (d > t)
( t−1 )
= f (yt |y1 , . . . , yt−1 , d = j + 1) f (d = i + 1) + f (d > t)
i=1
MAR ⇒ ACMV
Consider the ratio Q of the complete data likelihood to the observed data
likelihood. This gives, under the MAR assumption,
f (y1 , . . . , yn )f (d = i + 1|y1 , . . . , yi )
Q =
f (y1 , . . . , yi )f (d = i + 1|y1 , . . . , yi )
= f (yi+1 , . . . , yn |y1 , . . . , yi ). (B.11)
Further, one can always write,
Q = f (yi+1 , . . . , yn |y1 , . . . , yi , d = i + 1)
f (y1 , . . . , yi |d = i + 1)f (d = i + 1)
×
f (y1 , . . . , yi |d = i + 1)f (d = i + 1)
= f (yi+1 , . . . , yn |y1 , . . . , yi , d = i + 1). (B.12)
Equating expressions (B.11) and (B.12) for Q, we see that
f (yi+1 , . . . , yn |y1 , . . . , yi , d = i + 1) = f (yi+1 , . . . , yn |y1 , . . . , yi ). (B.13)
To show that (B.13) implies the ACMV conditions (B.10), we will use the
induction principle on t. First, consider the case t = 2. Using (B.13) for
i = 1, and integrating over y3 , . . . , yn , we obtain
f (y2 |y1 , d = 2) = f (y2 |y1 ),
leading to, using Lemma B.1,
f (y2 |y1 , d = 2) = f (y2 |y1 , d > 2).
Suppose, by induction, ACMV holds for all t ≤ i, We will now prove the
hypothesis for t = i + 1. Choose j ≤ i. Then, from the induction hypothesis
and Lemma B.1, it follows that for all j < t ≤ i:
f (yt |y1 , . . . , yt−1 , d = j + 1) = f (yt |y1 , . . . , yt−1 , d > t)
= f (yt |y1 , . . . , yt−1 ).
Taking the product over t = j + 1, . . . , i then gives
f (yj+1 , . . . , yi |y1 , . . . , yj , d = j + 1) = f (yj+1 , . . . , yi |y1 , . . . , yj ). (B.14)
520 Appendix B. Technical Details for Sensitivity Analysis
Dividing (B.15) by (B.14) and equating the left- and right-hand sides, we
find that
This holds for all j ≤ i, and Lemma B.1 shows this is equivalent to ACMV.
ACMV ⇒ MAR
f (y1 , . . . , yn , d = i + 1)
= f (y1 , . . . , yi , d = i + 1)f (yi+1 , . . . , yn |y1 , . . . , yi , d = i + 1)
T
= f (y1 , . . . , yi , d = i + 1) f (yt |y1 , . . . , yt−1 , d = i + 1).
t=i+1
f (y1 , . . . , yn , d = i + 1)
T
= f (y1 , . . . , yi |d = i + 1)f (d = i + 1) f (yt |y1 , . . . , yt−1 )
t=i+1
= f (y1 , . . . , yi |d = i + 1)f (d = i + 1)f (yi+1 , . . . , yn |y1 , . . . , yi )
f (y1 , . . . , yi |d = i + 1)f (d = i + 1)
=
f (y1 , . . . , yi )
×f (y1 , . . . , yi )f (yi+1 , . . . , yn |y1 , . . . , yi )
f (y1 , . . . , yi |d = i + 1)f (d = i + 1)
= f (y1 , . . . , yn )
f (y1 , . . . , yi )
= f (d = i + 1|y1 , . . . , yi )f (y1 , . . . , yn ). (B.17)
f (d = i + 1|y1 , . . . , yn ) = f (d = i + 1|y1 , . . . , yi ),
Agresti, A. (1990) Categorical Data Analysis. New York: John Wiley &
Sons.
Ashford, J.R. and Sowden, R.R. (1970) Multivariate probit analysis. Bio-
metrics, 26, 535–546.
Baker, S.G. (1992) A simple method for computing the observed infor-
mation matrix when using the EM algorithm with categorical data.
Journal of Computational and Graphical Statistics, 1, 63–76.
Baker, S.G. (1994) Regression analysis of grouped survival data with in-
complete covariates: non-ignorable missing-data and censoring mech-
anisms. Biometrics, 50, 821–826.
Baker, S.G. and Laird, N.M. (1988) Regression analysis for categorical
variables with outcome subject to non-ignorable non-response. Journal
of the American Statistical Association, 83, 62–69.
Beckman, R.J., Nachtsheim, C.J., and Cook, R.D. (1987) Diagnostics for
mixed-model analysis of variance. Technometrics, 29, 413–426.
Boissel, J.P., Collet, J.P., Moleur, P., and Haugh, M. (1992) Surrogate
endpoints: a basis for a rational approach. European Journal of Clin-
ical Pharmacology, 43, 235–244.
References 525
Box, G.E.P., Jenkins, G.M., and Reinsel, G.C. (1994) Time Series Analy-
sis: Forecasting and Control (3rd ed.). London: Holden-Day.
Box, G.E.P. and Tiao, G.C. (1992) Bayesian Inference in Statistical
Analysis. Wiley Classics Library edition. New York: John Wiley &
Sons.
Bozdogan, H. (1987) Model selection and Akaike’s Information Criterion
(AIC): The general theory and its analytical extensions. Psychome-
trika, 52, 345–370.
Brant, L.J. and Fozard, J.L. (1990) Age changes in pure-tone hearing
thresholds in a longitudinal study of normal human aging. Journal of
the Acoustic Society of America, 88, 813–820.
Brant, L.J. and Pearson, J.D. (1994) Modeling the variability in lon-
gitudinal patterns of aging. In: Biological Anthropology and Aging:
Perspectives on Human Variation over the Life Span, Ch. 14, D.E.
Crews and R.M.Garruto (Eds.). New York: Oxford University Press,
pp. 373–393.
Brant, L.J., Pearson, J.D., Morrell, C.H., and Verbeke, G. (1992) Statis-
tical methods for studying individual change during aging. Collegium
Antropologicum, 16, 359–369.
Brant, L.J. and Verbeke, G. (1997a) Describing the natural heterogeneity
of aging using multilevel regression models. International Journal of
Sports Medicine, 18, S225–S231.
Brant, L.J. and Verbeke, G. (1997b) Modelling longitudinal studies of
aging. In: Proceedings of the 12th International Workshop on Sta-
tistical Modelling, Schriftenreihe der Osterreichischen Statistischen
Gesellschaft, Vol. 5, C.E. Minder and H. Friedl (Eds.). Biel/Bienne,
Switzerland, pp. 19–30.
Breslow, N.E. and Clayton, D.G. (1993) Approximate inference in gener-
alized linear mixed models. Journal of the American Statistical Asso-
ciation, 88, 9–25.
Brown, N.A. and Fabro, S. (1981) Quantitation of rat embryonic develop-
ment in vitro: a morphological scoring system. Teratology, 24, 65–78.
Buck, S.F. (1960) A method of estimation of missing values in multivari-
ate data suitable for use with an electronic computer. Journal of the
Royal Statistical Society, Series B, 22, 302–306.
Burzykowski, T., Molenberghs, G., Buyse, M., Geys, H., and Renard,
D. (1999) Validation of surrogate endpoints in multiple randomized
clinical trials with failure-time endpoints. Submitted for publication.
526 References
Butler, S.M. and Louis, T.A. (1992) Random effects models with non-
parametric priors. Statistics in Medicine, 11, 1981–2000.
Buyse, M. and Molenberghs, G. (1998) The validation of surrogate end-
points in randomized experiments. Biometrics, 54, 1014–1029.
Buyse, M., Molenberghs, G., Burzykowski, T., Renard, D., and Geys,
H. (2000) The validation of surrogate endpoints in meta-analyses of
randomized experiments, Biostatistics, 1, 000–000.
Carlin, B.P. and Louis, T.A. (1996) Bayes and Empirical Bayes Methods
for Data Analysis. London: Chapman & Hall.
Carter, H.B. and Coffey, D.S. (1990) The prostate: an increasing medical
problem. The Prostate, 16, 39–48.
Carter, H.B., Morrell, C.H., Pearson, J.D., Brant, L.J., Plato, C.C., Met-
ter, E.J., Chan, D.W., Fozard, J.L., and Walsh, P.C. (1992a) Estima-
tion of prostate growth using serial prostate-specific antigen measure-
ments in men with and without prostate disease. Cancer Research,
52, 3323–3328.
Carter, H.B., Pearson, J.D., Metter, E.J., Brant, L.J., Chan, D.W., An-
dres, R., Fozard, J.L., and Walsh, P.C. (1992b) Longitudinal eval-
uation of prostate-specific antigen levels in men with and without
prostate disease. Journal of the American Medical Association, 267,
2215–2220.
Catalano, P.J. and Ryan, L.M. (1992) Bivariate latent variable models for
clustered discrete and continuous outcomes. Journal of the American
Statistical Association, 87, 651–658.
Catalano, P.J., Scharfstein, D.O., Ryan, L.M., Kimmel, C.A., and Kim-
mel, G.L. (1993) Statistical model for fetal death, fetal weight, and
malformation in developmental toxicity studies. Teratology, 47, 281–
290.
Chatterjee, S. and Hadi, A.S. (1988) Sensitivity Analysis in Linear Re-
gression. New York: John Wiley & Sons.
Chen, J. (1993) A malformation incidence dose-response model incorpo-
rating fetal weight and/or litter size as covariates. Risk Analysis, 13,
559–564.
Chen, T.T., Simon, R.M., Korn, E.L., Anderson, S.J., Lindblad, A.D.,
Wieand, H.S., Douglass Jr., H.O., Fisher, B., Hamilton, J.M., and
Friedman, M.A. (1998) Investigation of disease-free survival as a sur-
rogate endpoint for survival in cancer clinical trials. Communications
in Statistics, A, 27, 1363–1378.
References 527
Chi, E.M. and Reinsel, G.C. (1989) Models for longitudinal data with
random effects and AR(1) errors. Journal of the American Statistical
Association, 84, 452–459.
Choi, S., Lagakos, S., Schooley, R.T., and Volberding, P.A. (1993) CD4+
lymphocytes are an incomplete surrogate marker for clinical progres-
sion in persons with asymptomatic HIV infection taking zidovudine.
Annals of Internal Medicine, 118, 674–680.
Christensen, R., Pearson, L.M., and Johnson, W. (1992) Case-deletion
diagnostics for mixed models. Technometrics, 34, 38–45.
Copas, J.B. and Li, H.G. (1997) Inference from non-random samples (with
discussion). Journal of the Royal Statistical Society, Series B, 59, 55–
96.
Corfu-A Study Group (1995) Phase III randomized study of two fluo-
rouracil combinations with either interferon alfa-2a or leucovorin for
advanced colorectal cancer. Journal of Clinical Oncology, 13, 921–928.
Coursaget, P., Leboulleux, D., Soumare, M., le Cann P., Yvonnet, B.,
Chiron, J.P., and Collseck A.M. (1994) Twelve-year follow-up study
of hepatitis immunization of Senegalese infants. Journal of Hepatology,
21, 250–254.
Cowles, M.K., Carlin, B.P., and Connett, J.E. (1996) Bayesian tobit mod-
eling of longitudinal ordinal clinical trial compliance data with nonig-
norable missingness. Journal of the American Statistical Association,
91, 86–98.
Cox, D.R. (1972) The analysis of multivariate binary data. Applied Sta-
tistics, 21, 113–120.
Cox, D.R. and Hinkley, D.V. (1974) Theoretical Statistics. London: Chap-
man & Hall.
Cox, D.R. and Hinkley, D.V. (1990) Theoretical Statistics. London: Chap-
man & Hall.
Cox, D.R. and Wermuth, N. (1992) Response models for mixed binary
and quantitative variables. Biometrika, 79, 441–461.
Crépeau, H., Koziol, J., Reid, N., and Yuh, Y.S. (1985) Analysis of in-
complete multivariate data from repeated measurements experiments.
Biometrics, 41, 505–514.
Cressie, N.A.C. (1991) Statistics for Spatial Data. New York: John Wiley
& Sons.
Cullis, B.R. (1994) Discussion to Diggle, P.J. and Kenward, M.G.: Infor-
mative dropout in longitudinal data analysis. Applied Statistics, 43,
79–80.
Curran, D., Pignatti, F., and Molenberghs, G. (1997) Milk protein trial:
informative dropout versus random drop-in. Submitted for publication.
Dale, J.R. (1986) Global cross-ratio models for bivariate, discrete, ordered
responses. Biometrics, 42, 909–917.
Daniels, M.J. and Hughes, M.D. (1997) Meta-analysis for the evaluation
of potential surrogate markers. Statistics in Medicine, 16, 1515–1527.
DeGruttola, V., Fleming, T.R., Lin, D.Y., and Coombs, R. (1997) Vali-
dating surrogate markers - are we being naive ? Journal of Infectious
Diseases, 175, 237–246.
DeGruttola, V., Lange, N., and Dafni, U. (1991) Modeling the progression
of HIV infection. Journal of the American Statistical Association, 86,
569–577.
DeGruttola, V., Ware, J.H., and Louis, T.A. (1987) Influence analysis of
generalized least squares estimators. Journal of the American Statis-
tical Association, 82, 911–917.
DeGruttola, V., Wulfsohn, M., Fischl, M.A., and Tsiatis, A. (1993) Mod-
elling the relationship between survival and CD4 lymphocytes in pa-
tients with AIDS and AIDS-related complex. Journal of Acquired Im-
mune Deficiency Syndrome, 6, 359–365.
530 References
Dempster, A.P. and Rubin, D.B. (1983) Overview. In: Incomplete Data
in Sample Surveys, Vol. II: Theory and Annotated Bibliography, W.G.
Madow, I. Olkin, and D.B. Rubin (Eds.). New York: Academic Press,
pp. 3–10.
Dempster, A.P., Rubin, R.B., and Tsutakawa, R.K. (1981) Estimation
in covariance components models. Journal of the American Statistical
Association, 76, 341–353.
De Ponti, F., Lecchini, S., Cosentino, M., Castelletti, C.M., Malesci, A.,
and Frigo, G.M. (1993) Immunological adverse effects of anticonvul-
sants. What is their clinical relevance ? Drug Safety, 8, 235–250.
Diggle, P.J., Liang, K.-Y., and Zeger, S.L. (1994) Analysis of Longitudinal
Data. Oxford Science Publications. Oxford: Clarendon Press.
Fleming, T.R., Prentice, R.L., Pepe, M.S., and Glidden, D. (1994) Sur-
rogate and auxiliary endpoints in clinical trials, with potential ap-
plications in cancer and AIDS research. Statistics in Medicine, 13,
955–968.
Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995) Bayesian
Data Analysis, Texts in Statistical Science. London: Chapman & Hall.
Geys, H., Molenberghs, G., and Ryan, L.M. (1997) Pseudo-likelihood in-
ference for clustered binary data. Communications in Statistics: The-
ory and Methods, 26, 2743–2767.
Geys, H., Regan, M., Catalano, P., and Molenberghs, G. (1999) Two
latent variable risk assessment approaches for combined continuous
and discrete outcomes from developmental toxicity data. Submitted
for publication.
References 533
Ghosh, J.K. and Sen, P.K. (1985) On the asymptotic performance of the
log likelihood ratio statistic for the mixture model and related results.
In: Proceedings of the Berekely Conference in Honor or Jerzy Ney-
man and Jack Kiefer, Vol. 2, L.M. Le Cam and R.A. Olshen (Eds.).
Monterey: Wadsworth, Inc., pp. 789–806.
Gilks, W.R., Wang, C.C., Yvonnet, B., and Coursaget, P. (1993) Random-
effects models for longitudinal data using Gibbs sampling. Biometrics,
49, 441–453.
Glonek, G.F.V. and McCullagh, P. (1995) Multivariate logistic models.
Journal of the Royal Statistical Society, Series B, 81, 477–482.
Glynn, R.J., Laird, N.M., and Rubin, D.B. (1986) Selection modelling
versus mixture modelling with non-ignorable nonresponse. In: Drawing
Inferences from Self Selected Samples, H. Wainer (Ed.). New York:
Springer-Verlag, pp. 115–142.
Goldstein, H. (1979) The Design and Analysis of Longitudinal Studies.
London: Academic Press.
Goldstein, H. (1995) Multilevel Statistical Models. Kendall’s Libary of
Statistics 3. London: Arnold.
Golub, G.H. and Van Loan, C.F. (1989) Matrix Computations. (2nd ed.).
Baltimore: The Johns Hopkins University Press.
Goss, P.E., Winer, E.P., Tannock, I.F., and Schwartz, L.H. (1999) Breast
cancer: randomized phase III trial comparing the new potent and se-
lective third-generation aromatase inhibitor vorozole with megestrol
acetate in postmenopausal advanced breast cancer patients. Journal
of Clinical Oncology, 17, 52–63.
Gould, A.L. (1980) A new approach to the analysis of clinical drug trials
with withdrawals. Biometrics, 36, 721–727.
Greco, F.A., Figlin, R., York, M., Einhorn, L., Schilsky, R., Marshall,
E.M., Buys, S.S., Froimtchuk, M.J., Schuller, J., Buyse, M., Ritter,
L., Man, A., and Yap, A.K.L. (1996) Phase III randomized study
to compare interferon alfa-2a in combination with fluorouracil versus
fluorouracil alone in patients with advanced colorectal cancer. Journal
of Clinical Oncology, 14, 2674–2681.
Green, S., Benedetti, J., and Crowley, J. (1997) Clinical Trials in Oncol-
ogy. London: Chapman & Hall.
Greenlees, W.S., Reece, J.S., and Zieschang, K.D. (1982) Imputation of
missing values when the probability of response depends on the vari-
able being imputed. Journal of the American Statistical Association,
77, 251–261.
534 References
Heitjan, D.F. (1993) Estimation with missing data. Letter to the Editor.
Biometrics, 49, 580.
Henderson, C.R., Kempthorne, O., Searle, S.R., and Von Krosig, C.N.
(1959) Estimation of environmental and genetic trends from records
subject to culling. Biometrics, 15, 192–218.
Laird, N.M. and Ware, J.H. (1982) Random effects models for longitudinal
data. Biometrics, 38, 963–974.
Lesaffre, E., Asefa, M., and Verbeke, G. (1999) Assessing the goodness-
of-fit of the Laird and Ware model: an example: the Jimma Infant
Survival Differential Longitudinal Study. Statistics in Medicine, 18,
835–854.
Li, K.H., Raghunathan, T.E., and Rubin, D.B. (1991) Large-sample sig-
nificance levels from multiply imputed data using moment-based sta-
tistics and an F reference distributions. Journal of the American Sta-
tistical Association, 86, 1065–1073.
Liang, K.-Y. and Zeger, S.L. (1986) Longitudinal data analysis using
generalized linear models. Biometrika, 73, 13–22.
Liang, K.-Y. and Zeger, S.L. (1989) A class of logistic regression models
for multivariate binary time series. Journal of the American Statistical
Association, 84, 447–451.
Lin, D.Y., Fischl, M.A., and Schoenfeld, D.A. (1993) Evaluating the role
of CD4-lymphocyte change as a surrogate endpoint in HIV clinical
trials. Statistics in Medicine, 12, 835–842.
Lin, D.Y., Fleming T.R., and De Gruttola, V. (1997) Estimating the pro-
portion of treatment effect explained by a surrogate marker. Statistics
in Medicine, 16, 1515–1527.
Lin, X., Raz, J., and Harlow, S. (1997) Linear mixed models with hetero-
geneous within-cluster variances. Biometrics, 53, 910–923.
Lindley, D.V. and Smith, A.F.M. (1972) Bayes estimates for the linear
model. Journal of the Royal Statistical Society, Series B, 34, 1–41.
Lindstrom, M.J. and Bates, D.M. (1988) Newton-Raphson and EM al-
gorithms for linear mixed-effects models for repeated-measures data.
Journal of the American Statistical Association, 83, 1014–1022.
Littell, R.C., Milliken, G.A., Stroup, W.W., and Wolfinger, R.D. (1996)
SAS System for Mixed Models. Cary, NC: SAS Institute Inc.
Little, R.J.A. (1986) A note about models for selectivity bias. Econo-
metrika, 53, 1469–1474.
Little, R.J.A. and Rubin, D.B. (1987) Statistical Analysis with Missing
Data. New York: John Wiley & Sons.
Little, R.J.A. and Wang, Y. (1996) Pattern-mixture models for multivari-
ate incomplete data with covariates. Biometrics, 52, 98–111.
Little, R.J.A. and Yau, L. (1996) Intent-to-treat analysis for longitudinal
studies with drop-outs. Biometrics, 52, 1324–1333.
References 539
Liu, C. and Rubin, D.B. (1994) The ECME algorithm: a simple extension
of EM and ECM with faster monotone convergence. Biometrika, 81,
633–648.
Liu, C. and Rubin, D.B. (1995) MI estimation of the t distribution using
EM and its extensions, ECM and ECME. Statistica Sinica, 5, 19–39.
Liu, C., Rubin, D.B., and Wu, Y.N. (1998) Parameter expansion to ac-
celerate EM: the PX-EM algorithm. Biometrika, 85, 755–770.
Liu, G. and Liang, K.-Y. (1997) Sample size calculations for studies with
correlated observations. Biometrics, 53, 937–947.
Longford, N.T. (1993) Random Coefficient Models. Oxford: Oxford Uni-
versity Press.
Louis, T.A. (1982) Finding the observed information matrix when using
the EM algorithm. Journal of the Royal Statistical Society, Series B,
44, 226-233.
Louis, T.A. (1984) Estimating a population of parameter values using
bayes and empirical Bayes methods. Journal of the American Statis-
tical Association, 79, 393–398.
Magder, L.S. and Zeger, S.L. (1996) A smooth nonparametric estimated
of a mixing distribution using mixtures of Gaussians. Journal of the
American Statistical Association, 91, 1141–1152.
Mansour, H., Nordheim, E.V., and Rutledge, J.J. (1985) Maximum likeli-
hood estimation of variance components in repeated measures designs
assuming autoregressive errors. Biometrics, 41, 287–294.
McArdle, J.J. and Hamagami, F. (1992) Modeling incomplete longitu-
dinal and cross-sectional data using latent growth structural models.
Experimental Aging Research, 18, 145–166.
McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models. Lon-
don: Chapman & Hall.
McLachlan, G.J. and Basford, K.E. (1988) Mixture models. Inference and
Applications to Clustering. New York: Marcel Dekker.
McLachlan, G.J. and Krishnan, T. (1997) The EM Algorithm and Exten-
sions. New York: John Wiley & Sons.
McLean, R.A., Sanders, W.L., and Stroup, W.W. (1991) A unified ap-
proach to mixed linear models. The American Statistician, 45, 54–64.
Meilijson, I. (1989) A fast improvement to the EM algorithm on its own
terms. Journal of the Royal Statistical Society, Series B, 51, 127–138.
540 References
Mellors, J.W., Munoz, A., Giorgi, J.V., Margolich, J.B., Tassoni, C.J.,
Gupta, P., Kingsley, L.A., Todd, J.A., Saah, A.J., Phair, J.P., and
Rinaldo, C.R. (1997) Plasma viral load and CD4+ lymphocytes as
prognostic markers of HIV-1 infection. Annals of Internal Medicine,
126, 946–954.
Meng, X.-L. (1997) The EM algorithm and medical studies: a historical
link. Statistical Methods in Medical Research, 6, 3–23.
Meng, X.-L. and Rubin, D.B. (1991) Using EM to obtain asymptotic vari-
ance covariance matrices: the SEM algorithm. Journal of the American
Statistical Association, 86, 899–909.
Meng, X.-L. and Rubin, D.B. (1993) Maximum likelihood estimation via
the ECM algorithm: a general framework. Biometrika, 80, 267–278.
Meng, X.-L. and van Dyk, D. (1997) The EM algorithm–an old folk-song
sung to a fast new tune. Journal of the Royal Statistical Society, Series
B, 3, 511–567.
Meng, X.-L. and van Dyk, D. (1998) Fast EM-type implementation for
mixed effects models. Journal of the Royal Statistical Society, Series
B, 3, 559–578.
Mentré, F., Mallet, A., and Baccar, D. (1997) Optimal design in random-
effects regression models. Biometrika, 84, 429–442.
Michiels, B., Molenberghs, G., Bijnens, L., and Vangeneugden, T. (1999)
Selection models and pattern-mixture models to analyze longitudinal
quality of life data subject to dropout. Submitted for publication.
Michiels, B., Molenberghs, G., and Lipsitz, S.R. (1999). Selection mod-
els and pattern-mixture models for incomplete categorical data with
covariates. Biometrics, 55, 978–983.
Miller, J.J. (1977) Asymptotic properties of maximum likelihood esti-
mates in the mixed model of the analysis of variance. The Annals of
Statistics, 5, 746–762.
Molenberghs, G., Buyse, M., Geys, H., Renard, D., and Burzykowski, T.
(1999) Statistical challenges in the evaluation of surrogate endpoints
in randomized trials. Submitted for publication.
Molenberghs, G., Geys, H., and Buyse, M. (1998) Validation of surro-
gate endpoints in randomized experiments with mixed discrete and
continuous outcomes. Submitted for publication.
Molenberghs, G., Goetghebeur, E.J.T., Lipsitz, S.R., Kenward, M.G.
(1999) Non-random missingness in categorical data: strengths and lim-
itations. The American Statistician, 53, 110–118.
References 541
Nelder, J.A. and Mead, R. (1965) A simplex method for function min-
imisation. The Computer Journal, 7, 303–313.
Neter, J., Wasserman, W., and Kutner, M.H. (1990) Applied Linear Sta-
tistical Models. Regression, Analysis of Variance and Experimental
Designs (3rd ed.). Homewood, IL: Richard D. Irwin, Inc.
O’Brien, W.A., Hartigan, P.M., Martin, D., Eisnhart, J., Hill, A., Benoit,
S., Rubin, M., Simberkoff, M.S., and Hamilton, J.D. (1996) Changes
in plasma HIV-1 RNA and CD4+ lymphocyte counts and the risk of
progression to AIDS. New England Journal of Medicine, 334, 426–431.
Park, T. and Brown, M.B. (1994) Models for categorical data with nonig-
norable nonresponse. Journal of the American Statistical Association,
89, 44–52.
Park, T. and Lee, S.-L. (1999) Simple pattern-mixture models for longitu-
dinal data with missing observations: analysis of urinary incontinence
data. Statistics in Medicine, 18, 2933–2941.
Patel, H.I. (1991) Analysis of incomplete data from clinical trials with
repeated measurements. Biometrika, 78, 609-619.
Pearson, E.S., D’Agostino, R.B., and Bowman, KO. (1977) Tests for de-
parture from normality: comparison of powers. Biometrika, 64, 231–
246.
Pearson, J.D., Kaminski, P., Metter, E.J., Fozard, J.L., Brant, L.J., Mor-
rell, C.H., and Carter, H.B. (1991) Modeling longitudinal rates of
change in prostate specific antigen during aging. Proceedings of the
Social Statistics Section of the American Statistical Assciation, Wash-
ington, DC, pp. 580–585.
Pearson, J.D., Morrell, C.H., Gordon-Salant, S., Brant, L.J., Metter, E.J.,
Klein, L.L., and Fozard J.L. (1995) Gender differences in a longitudinal
study of age-associated hearing loss. Journal of the Acoustical Society
of America, 97, 1196–1205.
Pearson, J.D., Morrell, C.H., Landis, P.K., Carter, H.B., and Brant, L.J.
(1994) Mixed-effects regression models for studying the natural history
of prostate disease. Statistics in Medicine, 13, 587–601.
Pendergast, J.F., Gange, S.J., Newton, M.A., Lindstrom, M.J., Palta, M.,
and Fisher, M.R. (1996) A survey of methods for analyzing clustered
binary response data. International Statistical Review, 64, 89–118.
Prasad, N.G.N. and Rao, J.N.K. (1990) The estimation of mean squared
error of small-area estimators. Journal of the American Statistical As-
sociation, 85, 163–171.
Rao, C.R. (1973) Linear Statistical Inference and Its Applications (2nd
ed.). New York: John Wiley & Sons.
Regan, M.M. and Catalano, P.J. (1999a) Likelihood models for clustered
binary and continuous outcomes: Application to developmental toxi-
cology. Biometrics, 55, 760–768.
Regan, M.M. and Catalano, P.J. (1999b) Bivariate dose-response model-
ing and risk estimation in developmental toxicology. Journal of Agri-
cultural, Biological and Environmental Statistics, 4, 217–237.
Ripley, B.D. (1981) Spatial Statistics. New York: John Wiley & Sons.
References 545
Rubin, D.B. (1976) Inference and missing data. Biometrika, 63, 581–592.
Rubin, D.B. (1994) Discussion to Diggle, P.J. and Kenward, M.G.: Infor-
mative dropout in longitudinal data analysis. Applied Statistics, 43,
80–82.
Rubin, D.B. (1996) Multiple imputation after 18+ years. Journal of the
American Statistical Association, 91, 473–489.
Rubin, D.B., Stern H.S., and Vehovar V. (1995) Handling “don’t know”
survey responses: the case of the Slovenian plebiscite. Journal of the
American Statistical Association, 90, 822–828.
Ryan, L.M. (1992b) The use of generalized estimating equations for risk
assessment in developmental toxicity. Risk Analysis, 12, 439–447.
Ryan, L.M. and Molenberghs, G. (1999) Statistical methods for develop-
mental toxicity: analysis of clustered multivariate binary data. Annals
of the New York Academy of Sciences, 00, 000–000.
SAS Institute Inc. (1991) SAS System for Linear Models (3rd ed.). Cary,
NC: SAS Institute Inc.
SAS Institute Inc. (1992) SAS Technical Report P-229, SAS/STAT Soft-
ware: Changes and Enhancements, Release 6.07. Cary, NC: SAS In-
stitute Inc.
References 547
Self, S.G. and Liang, K.Y. (1987) Asymptotic properties of maximum like-
lihood estimators and likelihood ratio tests under nonstandard condi-
tions. Journal of the American Statistical Association, 82, 605–610.
Shapiro, S.S. and Wilk, M.B. (1965) An analysis of variance test for
normality (complete samples). Biometrika, 52, 591–611.
Shapiro, S.S. and Wilk, M.B. (1968) Approximations for the null distri-
bution of the W statistic. Technometrics, 10, 861–866.
Shih, W.J. and Quan, H. (1997) Testing for treatment differences with
dropouts in clinical trials—a composite approach. Statistics in Medi-
cine, 16, 1225–1239.
Shock, N.W., Greullich, R.C., Andres, R., Arenberg, D., Costa, P.T.,
Lakatta, E.G., and Tobin, J.D. (1984) Normal Human Aging: The
Baltimore Longitudinal Study of Aging. National Institutes of Health
Publication 84–2450. Washington, DC: National Institutes of Health.
Smith, D.M., Robertson, B., and Diggle, P.J. (1996) Object-oriented Soft-
ware for the Analysis of Longitudinal Data in S. Technical Report MA
96/192. Department of Mathematics and Statistics, University of Lan-
caster, LA1 4YF, United Kingdom.
Stasny, E.A. (1986) Estimating gross flows using panel data with nonre-
sponse: an example from the Canadian Labour Force Survey. Journal
of the American Statistical Association, 81, 42–47.
Stiratelli, R., Laird, N., and Ware, J. (1984) Random effects models for
serial observations with dichotomous response. Biometrics, 40, 961–
972.
Stram, D.O. and Lee, J.W. (1994) Variance components testing in the
longitudinal mixed effects model. Biometrics, 50, 1171–1177.
Stram, D.A. and Lee, J.W. (1995) Correction to: Variance components
testing in the longitudinal mixed effects model. Biometrics, 51, 1196.
Strenio, J.F., Weisberg, H.J., and Bryk, A.S. (1983) Empirical Bayes es-
timation of individual growth-curve parameters and their relationship
to covariates. Biometrics, 39, 71–86.
Tanner, M.A. and Wong, W.H. (1987) The calculation of posterior dis-
tributions by data augmentation. Journal of the American Statistical
Association, 82, 528–550.
Thijs, H., Molenberghs, G., and Verbeke, G. (2000) The milk protein
trial: influence analysis of the dropout process. Biometrical Journal,
00, 000–000.
Troxel, A.B., Harrington, D.P., and Lipsitz, S.R. (1998) Analysis of longi-
tudinal data with non-ignorable non-monotone missing values. Applied
Statistics, 47, 425–438.
Van Damme, P., Vranckx, R., Safary, A., Andre F.E., and Meheus, A.
(1989) Protective efficacy of a recombinant desoxyribonucleic acid he-
patitis B vaccine in institutionalized mentally handicapped clients.
American Journal of Medicine, 87, 26S–29S.
van Dyk, D. Meng, X.-L., and Rubin, D.B. (1995) Maximum likelihood
estimation via the ECM algorithm: computing the asymptotic vari-
ance. Statistica Sinica, 5, 55-75.
Vellinga, A., Van Damme, P., and Meheus, A. (1999a) Hepatitis B and
C in institutions for individuals with mental retardation: a review.
Journal of Intellectual Disability Research, 00, 000–000.
Vellinga, A., Van Damme, P., Weyler, J.J., Vranckx, R., Meheus, A.
(1999b) Hepatitis B vaccination in mentally retarded: effectiveness
after 11 years. Vaccine, 17, 602–606.
Vellinga, A., Van Damme, P., Bruckers, L., Weyler, J.J., Molenberghs, G.,
and Meheus, A. (1999c) Modelling long term persistence of hepatitis B
antibodies after vacination. Journal of Medical Virology, 57, 100–103.
Verbeke, G., Lesaffre, E., and Brant L.J. (1998) The detection of residual
serial correlation in linear mixed models. Statistics in Medicine, 17,
1391–1402.
Verbeke, G., Lesaffre, E., and Spiessens, B. (2000) The practical use of
different strategies to handle dropout in longitudinal studies. Submit-
ted for publication.
Verdonck, A., De Ridder, L., Verbeke, G., Bourguignon, J.P., Carels, C.,
Kuhn, E.R., Darras, V., and de Zegher, F. (1998) Comparative effects
of neonatal and prepubertal castration on craniofacial growth in rats.
Archives of Oral Biology, 43, 861–871.
552 References
Wang-Clow, F., Lange, N., Laird, N.M., and Ware, J.H. (1995) A sim-
ulation study of estimators for rate of change in longitudinal studies
with attrition. Statistics in Medicine, 14, 283–297.
Waternaux, C., Laird, N.M., and Ware, J.H. (1989) Methods for analysis
of longitudinal data: bloodlead concentrations and cognitive develop-
ment. Journal of the American Statistical Association, 84, 33–41.
Weihrauch, T.R. and Demol, P. (1998) The value of surrogate endpoints
for evaluation of therapeutic efficacy. Drug Information Journal, 32,
737–43.
Weiner, D.L. (1981) Design and analysis of bioavailability studies. In:
Statistics in the pharmaceutical industry, C.R. Buncher and J.-Y. Tsay
(Eds.). New York: Marcel Dekker, pp. 205–229.
Welham, S.J. and Thompson, R. (1997) Likelihood ratio tests for fixed
model terms using residual maximum likelihood. Journal of the Royal
Statistical Society, Series B, 59, 701–714.
White, H. (1980) Nonlinear regression on cross-section data. Economet-
rica, 48, 721–746.
White, H. (1982) Maximum likelihood estimation of misspecified models.
Econometrica, 50, 1–25.
Williams, P.L., Molenberghs, G., and Lipsitz, S.R. (1996) Analysis of
multiple ordinal outcomes in developmental toxicity studies. Journal
of Agricultural, Biological, and Environmental Statistics, 1, 250–274.
Wolfinger, R. and O’Connell, M. (1993) Generalized linear mixed models:
a pseudo-likelihood approach. Journal of Statistical Computation and
Simulation, 48, 233–243.
Wu, M.C. and Bailey, K.R. (1988) Analysing changes in the presence of
informative right censoring caused by death and withdrawal. Statistics
in Medicine, 7, 337–346.
Wu, M.C. and Bailey, K.R. (1989) Estimation and comparison of changes
in the presence of informative right censoring: conditional linear
model. Biometrics, 45, 939–955.
Wu, M.C. and Carroll, R.J. (1988) Estimation and comparison of changes
in the presence of informative right censoring by modeling the censor-
ing process. Biometrics, 44, 175–188.
Yates, F. (1933) The analysis of replicated experiments when the field
results are incomplete. Empirical Journal of Experimental Agriculture,
1, 129–142.
References 553
Zeger, S.C., Liang, K.-Y., and Albert, P.S. (1988) Models for longitudi-
nal data: a generalized estimating equation approach. Biometrics, 44,
1049–1060.
Index
sensitivity analysis,
352–373
selection model, 270–273
semi-variogram, 144
variance function, 33