100% found this document useful (1 vote)
359 views579 pages

Linear Mixed Models For Longitudinal Data

Uploaded by

Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
359 views579 pages

Linear Mixed Models For Longitudinal Data

Uploaded by

Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 579

springer Series in Statistics

Advisors:
P. Bickel, P. Diggle, S. Fienberg K. Krickeberg,
I. Olkin, N. Wermuth, S. Zeger

Springer
New York
Berlin
Heidelberg
Barcelona
Hong Kong
London
Milan
Paris
Singapore
Tokyo
Springer Series in Statistics
Andersen/Borgan/GHl/Keiding: Statistical Models Based on Counting Processes.
Berger: Statistical Decision Theory and Bayesian Analysis, 2nd edition.
Bolfarine/Zacks: Prediction Theory for Finite Populations.
Borg/Groenen: Modem Multidimensional Scaling: Theory and Applications
Brockwell/Davis: Time Series: Theory and Methods, 2nd edition.
Chen/Shao/Ibrahim: Monte Carlo Methods in Bayesian Computation.
Efromovich: Nonparametric Curve Estimation: Methods, Theory, and Applications.
Fahrmeir/Tutv Multivariate Statistical Modelling Based on Generalized Linear
Models.
Farebrother: Fitting Linear Relationships: A History of the Calculus of Observations
1750-1900.
Federer: Statistical Design and Analysis for Intercropping Experiments, Volume I:
Two Crops.
Federer: Statistical Design and Analysis for Intercropping Experiments, Volume II:
Three or More Crops.
Fienberg/Hoaglin/Kruskal/Tanur (Eds.): A Statistical Model: Frederick Mosteller's
Contributions to Statistics, Science and Public Policy.
Fisher/Sen: The Collected Works of Wassily Hoeffding.
Good: Permutation Tests: A Practical Guide to Resampling Methods for Testing
Hypotheses, 2nd edition.
Gourieroux: ARCH Models and Financial Applications.
Grandell: Aspects of Risk Theory.
Haberman: Advanced Statistics, Volume I: Description of Populations.
Hall: The Bootstrap and Edgeworth Expansion.
Hdrdle: Smoothing Techniques: With Implementation in S.
Hart: Nonparametric Smoothing and Lack-of-Fit Tests.
Hartigan: Bayes Theory.
Hedayat/Sloane/Stufken: Orthogonal Arrays: Theory and Applications.
Heyde: Quasi-Likelihood and its Application: A General Approach to Optimal
Parameter Estimation.
Huet/Bouvier/Gruet/Jolivet: Statistical Tools for Nonlinear Regression: A Practical
Guide with S-PLUS Examples.
Kolen/Brennan: Test Equating: Methods and Practices.
Kotz/Johnson (Eds.): Breakthroughs in Statistics Volume I.
Kotz/Johnson (Eds.): Breakthroughs in Statistics Volume II.
Kotz/Johnson (Eds.): Breakthroughs in Statistics Volume III.
Kuchler/S0rensen: Exponential Families of Stochastic Processes.
Le Cam: Asymptotic Methods in Statistical Decision Theory.
Le Cam/Yang: Asymptotics in Statistics: Some Basic Concepts.
Longford: Models for Uncertainty in Educational Testing.
Miller, Jr.: Simultaneous Statistical Inference, 2nd edition.
Mosteller/Wallace: Applied Bayesian and Classical Inference: The Case of the
Federalist Papers.
Parzen/Tanabe/Kitagawa: Selected Papers of Hirotugu Akaike.
Politis/Romano/Wolf: Subsampling.

(continued after index)


Geert Verbeke
Geert Molenberghs

Linear Mixed Models for


Longitudinal Data

With 128 Illustrations

Springer
Geert Verbeke Geert Molenberghs
Biostatistical Centre Biostatistics
Katholieke Universiteit Leuven Center for Statistics
Kapucijnenvoer 35 Limburgs Universitair Centrum
B-3000 Leuven Universitaire Campus, Building D
Belgium B-3590 Diepenbeek
Belgium

Library of Congress Cataloging-in-Publication Data


Verbeke, Geert.
Linear mixed models for longitudinal data / Geert Verbeke, Geert Molenberghs.
p. cm. — (Springer series in statistics)
Includes bibliographical references and index.
ISBN 0-387-95027-3 (alk. paper)
1. Linear models (Statistics) 2. Longitudinal method. I. Molenberghs, Geert. II. Title.
III. Series.
QA279 .V458 2000
519.5'3—dc21 00-026596

© 2000 Springer-Verlag New York, Inc.


All rights reserved. This work may not be translated or copied in whole or in part without the
written permission of the publisher (Springer-Verlag New York, Inc., 175 Fifth Avenue, New York,
NY 10010, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use
in connection with any form of information storage and retrieval, electronic adaptation, computer
software, or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use of general descriptive names, trade names, trademarks, etc., in this pubUcation, even if the
former are not especially identified, is not to be taken as a sign that such names, as understood by
the Trade Marks and Merchandise Marks Act, may accordingly be used freely by anyone.

ISBN 0-387-95027-3 Springer-Verlag New York Beriin Heidelberg SPIN 10761593


To Godewina, Lien, Noor, and Aart

To Conny, An, and Jasper


To the memory of my father
Preface

The dissemination of the MIXED procedure in SAS and related software


have provided a whole class of linear mixed-effects models, some of which
with a long history, for routine use. Experience shows that both the ideas
behind the techniques and their software implementation are not at all
straightforward, and users from various applied backgrounds often encount-
er difficulties in using the methodology effectively. Courses and consultancy
in this domain have been in great demand over the last decade, illustrating
the clear need for resource material to aid the user.

As an outgrowth of such courses, Verbeke and Molenberghs (1997) was in-


tended as a contribution to bridging this gap. Since its appearance, it has
been the basis for several short and regular courses in academia and in-
dustry. In the meantime, many research papers on these and related topics
have appeared in the statistical literature. Therefore, it is considered timely
to present a second, entirely recast version. Material kept from Verbeke and
Molenberghs (1997) has been reworked, and a large range of new topics has
been added. The structure of the book reflects not only our own research
activity but also our experience in teaching various applied longitudinal
modeling courses, such as the Longitudinal Data Analysis course in the
Master of Science in Biostatistics Programme of the Limburgs Universitair
Centrum, the Repeated Measures course in the International Study Pro-
gramme in Statistics of the Katholieke Universiteit Leuven, and the Topics
in Biostatistics course at the Universiteit Antwerpen.
viii Preface

As with the first version, we hope this book will be of value to a wide
audience, including applied statisticians and biomedical researchers, par-
ticularly in the pharmaceutical industry, medical and public health research
organizations, contract research organizations, and academic departments.
This implies that the majority of the chapters is explanatory rather than
research oriented and that it emphasizes practice rather than mathemat-
ical rigor. In this respect, guidance and advice on practical issues are the
main focus of the text. On the other hand, some more advanced topics
are included as well, which we believe to be of use to the more demanding
modeler.

In the first version, we had placed strong emphasis on the SAS procedure
MIXED, without discouraging the non-SAS users. Considerable effort was
put in treating data analysis issues in a generic fashion, instead of mak-
ing them fully software dependent. Therefore, a research question was first
translated into a statistical model by means of algebraic notation. In a
number of cases, such a model was then implemented using SAS code.
This was positively received by many readers and we therefore for most
part kept this format. In this version, much of the SAS-related issues are
centralized in a single chapter, and we still keep selected examples through-
out the text. Additionally, an Appendix is devoted to other software tools
(MLwiN, SPlus).

Because SAS Version 7 has not been generally marketed, SAS Version
6.12 was used throughout this book. The Appendix briefly lists the most
important changes in Version 7. Selected macros for tools discussed in
the text, not otherwise available in commercial software packages, as well
as publicly available data sets, can be found at Springer-Verlag’s URL:
www.springer-ny.com.

Geert Verbeke (Katholieke Universiteit Leuven, Leuven)

Geert Molenberghs (Limburgs Universitair Centrum, Diepenbeek)


Acknowledgments

This book has been accomplished with considerable help from several peo-
ple. We would like to gratefully acknowledge their support.

A large part of this book is based on joint research. We are grateful to sev-
eral co-authors: Larry Brant (Gerontology Research Center and The Johns
Hopkins University, Baltimore), Luc Bijnens (Janssen Research Founda-
tion, Beerse), Tomasz Burzykowski (Limburgs Universitair Centrum), Marc
Buyse (International Institute for Drug Development, Brussels), Desmond
Curran (European Organization for Research and Treatment of Cancer,
Brussels), Helena Geys (Limburgs Universitair Centrum), Mike Kenward
(London School of Hygiene and Tropical Medicine), Emmanuel Lesaffre
(Katholieke Universiteit Leuven), Stuart Lipsitz (Medical University of
South Carolina, Charleston), Bart Michiels (Janssen Research Foundation,
Beerse), Didier Renard (Limburgs Universitair Centrum), Ziv Shkedy (Lim-
burgs Universitair Centrum), Bart Spiessens (Katholieke Universiteit Leu-
ven), Herbert Thijs (Limburgs Universitair Centrum), Tony Vangeneug-
den (Janssen Research Foundation, Beerse), and Paige Williams (Harvard
School of Public Health, Boston).

Russell Wolfinger (SAS Institute, Cary, NC) has been kind enough to pro-
vide us with a trial version of SAS Version 7.0 during the development of
this text. Bart Spiessens (Katholieke Universiteit Leuven) kindly provided
us with technical support. Steffen Fieuws (Katholieke Universiteit Leuven)
commented on earlier versions of the text.
x Acknowledgments

We gratefully acknowledge support from Research Project Fonds voor We-


tenschappelijk Onderzoek Vlaanderen G.0002.98: “Sensitivity Analysis for
Incomplete Data,” NATO Collaborative Research Grant CRG950648: “Sta-
tistical Research for Environmental Risk Assessment,” and from Onder-
zoeksfonds K.U.Leuven grant PDM/96/105.

It has been a pleasure to work with John Kimmel and Jenny Wolkowicki
of Springer-Verlag.

We apologize to our wives, daughters, and son for the time not spent with
them during the preparation of this book and we are very grateful for their
understanding. The preparation of this book has been a period of close and
fruitful collaboration, of which we will keep good memories.

Geert and Geert

Kessel-Lo, December 1999


Contents

Preface vii

Acknowledgments ix

1 Introduction 1

2 Examples 7

2.1 The Rat Data . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 The Toenail Data (TDO) . . . . . . . . . . . . . . . . . . . 9

2.3 The Baltimore Longitudinal Study of Aging (BLSA) . . . . 10

2.3.1 The Prostate Data . . . . . . . . . . . . . . . . . . . 11

2.3.2 The Hearing Data . . . . . . . . . . . . . . . . . . . 14

2.4 The Vorozole Study . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Heights of Schoolgirls . . . . . . . . . . . . . . . . . . . . . 16

2.6 Growth Data . . . . . . . . . . . . . . . . . . . . . . . . . . 16


xii Contents

2.7 Mastitis in Dairy Cattle . . . . . . . . . . . . . . . . . . . . 18

3 A Model for Longitudinal Data 19

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 A Two-Stage Analysis . . . . . . . . . . . . . . . . . . . . . 20

3.2.1 Stage 1 . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.2 Stage 2 . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.3 Example: The Rat Data . . . . . . . . . . . . . . . . 21

3.2.4 Example: The Prostate Data . . . . . . . . . . . . . 21

3.2.5 Two-Stage Analysis . . . . . . . . . . . . . . . . . . 22

3.3 The General Linear Mixed-Effects Model . . . . . . . . . . 23

3.3.1 The Model . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.2 Example: The Rat Data . . . . . . . . . . . . . . . . 25

3.3.3 Example: The Prostate Data . . . . . . . . . . . . . 26

3.3.4 A Model for the Residual Covariance Structure . . . 26

4 Exploratory Data Analysis 31

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Exploring the Marginal Distribution . . . . . . . . . . . . . 31

4.2.1 The Average Evolution . . . . . . . . . . . . . . . . 31

4.2.2 The Variance Structure . . . . . . . . . . . . . . . . 33

4.2.3 The Correlation Structure . . . . . . . . . . . . . . . 34

4.3 Exploring Subject-Specific Profiles . . . . . . . . . . . . . . 35

4.3.1 Measuring the Overall Goodness-of-Fit . . . . . . . . 35

4.3.2 Testing for the Need of a Model Extension . . . . . 37

4.3.3 Example: The Rat Data . . . . . . . . . . . . . . . . 38

4.3.4 Example: The Prostate Data . . . . . . . . . . . . . 39


Contents xiii

5 Estimation of the Marginal Model 41

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . 42

5.3 Restricted Maximum Likelihood Estimation . . . . . . . . 43

5.3.1 Variance Estimation in Normal Populations . . . . . 43

5.3.2 Estimation of Residual Variance in Linear Regression 43

5.3.3 REML Estimation for the Linear Mixed Model . . . 44

5.3.4 Justification of REML Estimation . . . . . . . . . . 46

5.3.5 Comparison Between ML and REML Estimation . 46

5.4 Model-Fitting Procedures . . . . . . . . . . . . . . . . . . . 47

5.5 Example: The Prostate Data . . . . . . . . . . . . . . . . . 48

5.6 Estimation Problems . . . . . . . . . . . . . . . . . . . . . . 50

5.6.1 Small Variance Components . . . . . . . . . . . . . . 50

5.6.2 Model Misspecifications . . . . . . . . . . . . . . . . 52

6 Inference for the Marginal Model 55

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.2 Inference for the Fixed Effects . . . . . . . . . . . . . . . . . 55

6.2.1 Approximate Wald Tests . . . . . . . . . . . . . . . 56

6.2.2 Approximate t-Tests and F -Tests . . . . . . . . . . . 56

6.2.3 Example: The Prostate Data . . . . . . . . . . . . . 57

6.2.4 Robust Inference . . . . . . . . . . . . . . . . . . . . 61

6.2.5 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . 62

6.3 Inference for the Variance Components . . . . . . . . . . . . 64

6.3.1 Approximate Wald Tests . . . . . . . . . . . . . . . 64

6.3.2 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . 65

6.3.3 Example: The Rat Data . . . . . . . . . . . . . . . . 66


xiv Contents

6.3.4 Marginal Testing for the Need of Random Effects . . 69

6.3.5 Example: The Prostate Data . . . . . . . . . . . . . 72

6.4 Information Criteria . . . . . . . . . . . . . . . . . . . . . . 74

7 Inference for the Random Effects 77

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7.2 Empirical Bayes Inference . . . . . . . . . . . . . . . . . . . 78

7.3 Henderson’s Mixed-Model Equations . . . . . . . . . . . . . 79

7.4 Best Linear Unbiased Prediction (BLUP) . . . . . . . . . . 80

7.5 Shrinkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

7.6 Example: The Random-Intercepts Model . . . . . . . . . . . 81

7.7 Example: The Prostate Data . . . . . . . . . . . . . . . . . 82

7.8 The Normality Assumption for Random Effects . . . . . . . 83

7.8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 83

7.8.2 Impact on EB Estimates . . . . . . . . . . . . . . . 85

7.8.3 Impact on the Estimation of the Marginal Model . . 87

7.8.4 Checking the Normality Assumption . . . . . . . . . 89

8 Fitting Linear Mixed Models with SAS 93

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

8.2 The SAS Program . . . . . . . . . . . . . . . . . . . . . . . 94

8.2.1 The PROC MIXED Statement . . . . . . . . . . . . 95

8.2.2 The CLASS Statement . . . . . . . . . . . . . . . . . 96

8.2.3 The MODEL Statement . . . . . . . . . . . . . . . . 96

8.2.4 The ID Statement . . . . . . . . . . . . . . . . . . . 97

8.2.5 The RANDOM Statement . . . . . . . . . . . . . . . 97

8.2.6 The REPEATED Statement . . . . . . . . . . . . . 98


Contents xv

8.2.7 The CONTRAST Statement . . . . . . . . . . . . . 101

8.2.8 The ESTIMATE Statement . . . . . . . . . . . . . . 101

8.2.9 The MAKE Statement . . . . . . . . . . . . . . . . . 102

8.2.10 Some Additional Statements and Options . . . . . . 102

8.3 The SAS Output . . . . . . . . . . . . . . . . . . . . . . . . 104

8.3.1 Information on the Iteration Procedure . . . . . . . 104

8.3.2 Information on the Model Fit . . . . . . . . . . . . . 105

8.3.3 Information Criteria . . . . . . . . . . . . . . . . . . 107

8.3.4 Inference for the Variance Components . . . . . . . . 107

8.3.5 Inference for the Fixed Effects . . . . . . . . . . . . 111

8.3.6 Inference for the Random Effects . . . . . . . . . . . 113

8.4 Note on the Mean Parameterization . . . . . . . . . . . . . 114

8.5 The RANDOM and REPEATED Statements . . . . . . . . 117

8.6 PROC MIXED versus PROC GLM . . . . . . . . . . . . . 119

9 General Guidelines for Model Building 121

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

9.2 Selection of a Preliminary Mean Structure . . . . . . . . . 123

9.3 Selection of a Preliminary Random-Effects Structure . . . . 125

9.4 Selection of a Residual Covariance Structure . . . . . . . . 128

9.5 Model Reduction . . . . . . . . . . . . . . . . . . . . . . . . 132

10 Exploring Serial Correlation 135

10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

10.2 An Informal Check for Serial Correlation . . . . . . . . . . . 136

10.3 Flexible Models for Serial Correlation . . . . . . . . . . . . 137

10.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 137


xvi Contents

10.3.2 Fractional Polynomials . . . . . . . . . . . . . . . . . 137

10.3.3 Example: The Prostate Data . . . . . . . . . . . . . 138

10.4 The Semi-Variogram . . . . . . . . . . . . . . . . . . . . . . 141

10.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 141

10.4.2 The Semi-Variogram for Random-Intercepts Models 142

10.4.3 Example: The Vorozole Study . . . . . . . . . . . . . 144

10.4.4 The Semi-Variogram for Random-Effects Models . . 144

10.4.5 Example: The Prostate Data . . . . . . . . . . . . . 147

10.5 Some Remarks . . . . . . . . . . . . . . . . . . . . . . . . . 148

11 Local Influence for the Linear Mixed Model 151

11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 151

11.2 Local Influence . . . . . . . . . . . . . . . . . . . . . . . . . 153

11.3 The Detection of Influential Subjects . . . . . . . . . . . . . 158

11.4 Example: The Prostate Data . . . . . . . . . . . . . . . . . 162

11.5 Local Influence Under REML Estimation . . . . . . . . . . 167

12 The Heterogeneity Model 169

12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

12.2 The Heterogeneity Model . . . . . . . . . . . . . . . . . . . 171

12.3 Estimation of the Heterogeneity Model . . . . . . . . . . . . 173

12.4 Classification of Longitudinal Profiles . . . . . . . . . . . . . 177

12.5 Goodness-of-Fit Checks . . . . . . . . . . . . . . . . . . . . 178

12.6 Example: The Prostate Data . . . . . . . . . . . . . . . . . 180

12.7 Example: The Heights of Schoolgirls . . . . . . . . . . . . . 183

13 Conditional Linear Mixed Models 189

13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 189


Contents xvii

13.2 A Linear Mixed Model for the Hearing Data . . . . . . . . . 190

13.3 Conditional Linear Mixed Models . . . . . . . . . . . . . . . 194

13.4 Applied to the Hearing Data . . . . . . . . . . . . . . . . . 197

13.5 Relation with Fixed-Effects Models . . . . . . . . . . . . . . 198

14 Exploring Incomplete Data 201

15 Joint Modeling of Measurements and Missingness 209

15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

15.2 The Impact of Incompleteness . . . . . . . . . . . . . . . . . 210

15.3 Simple ad hoc Methods . . . . . . . . . . . . . . . . . . . . 211

15.4 Modeling Incompleteness . . . . . . . . . . . . . . . . . . . 212

15.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

15.6 Missing Data Patterns . . . . . . . . . . . . . . . . . . . . . 215

15.7 Missing Data Mechanisms . . . . . . . . . . . . . . . . . . . 215

15.8 Ignorability . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

15.9 A Special Case: Dropout . . . . . . . . . . . . . . . . . . . . 218

16 Simple Missing Data Methods 221

16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

16.2 Complete Case Analysis . . . . . . . . . . . . . . . . . . . . 223

16.3 Simple Forms of Imputation . . . . . . . . . . . . . . . . . . 223

16.3.1 Last Observation Carried Forward . . . . . . . . . . 224

16.3.2 Imputing Unconditional Means . . . . . . . . . . . . 225

16.3.3 Buck’s Method: Conditional Mean Imputation . . . 225

16.3.4 Discussion of Imputation Techniques . . . . . . . . . 226

16.4 Available Case Methods . . . . . . . . . . . . . . . . . . . . 227

16.5 MCAR Analysis of Toenail Data . . . . . . . . . . . . . . . 227


xviii Contents

17 Selection Models 231

17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

17.2 A Selection Model for the Toenail Data . . . . . . . . . . . 233

17.2.1 MAR Analysis . . . . . . . . . . . . . . . . . . . . . 233

17.2.2 MNAR analysis . . . . . . . . . . . . . . . . . . . . . 234

17.3 Scope of Ignorability . . . . . . . . . . . . . . . . . . . . . . 239

17.4 Growth Data . . . . . . . . . . . . . . . . . . . . . . . . . . 240

17.4.1 Analysis of Complete Growth Data . . . . . . . . . . 240

17.4.2 Frequentist Analysis of Incomplete Growth Data . . 256

17.4.3 Likelihood Analysis of Incomplete Growth Data . . . 257

17.4.4 Missingness Process for the Growth Data . . . . . . 267

17.5 A Selection Model for Nonrandom Dropout . . . . . . . . . 269

17.6 A Selection Model for the Vorozole Study . . . . . . . . . . 270

18 Pattern-Mixture Models 275

18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

18.1.1 A Simple Illustration . . . . . . . . . . . . . . . . . . 275

18.1.2 A Paradox . . . . . . . . . . . . . . . . . . . . . . . 278

18.2 Pattern-Mixture Models . . . . . . . . . . . . . . . . . . . . 280

18.3 Pattern-Mixture Model for the Toenail Data . . . . . . . . . 281

18.4 A Pattern-Mixture Model for the Vorozole Study . . . . . . 287

18.5 Some Reflections . . . . . . . . . . . . . . . . . . . . . . . . 291

19 Sensitivity Analysis for Selection Models 295

19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 295

19.2 A Modified Selection Model for Nonrandom Dropout . . . . 297

19.3 Local Influence . . . . . . . . . . . . . . . . . . . . . . . . . 298


Contents xix

19.3.1 Review of the Theory . . . . . . . . . . . . . . . . . 299

19.3.2 Applied to the Model of Diggle and Kenward . . . . 300

19.3.3 Special Case: Compound Symmetry . . . . . . . . . 302

19.3.4 Serial Correlation . . . . . . . . . . . . . . . . . . . . 306

19.4 Analysis of the Rat Data . . . . . . . . . . . . . . . . . . . 307

19.5 Mastitis in Dairy Cattle . . . . . . . . . . . . . . . . . . . . 312

19.5.1 Informal Sensitivity Analysis . . . . . . . . . . . . . 312

19.5.2 Local Influence Approach . . . . . . . . . . . . . . . 319

19.6 Alternative Local Influence Approaches . . . . . . . . . . . 326

19.7 Random-coefficient-based Models . . . . . . . . . . . . . . . 328

19.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . 330

20 Sensitivity Analysis for Pattern-Mixture Models 331

20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 331

20.2 Pattern-Mixture Models and MAR . . . . . . . . . . . . . . 332

20.2.1 MAR and ACMV . . . . . . . . . . . . . . . . . . . 333

20.2.2 Nonmonotone Patterns: A Counterexample . . . . . 335

20.3 Multiple Imputation . . . . . . . . . . . . . . . . . . . . . . 336

20.3.1 Parameter and Precision Estimation . . . . . . . . . 338

20.3.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . 338

20.4 Pattern-Mixture Models and Sensitivity Analysis . . . . . . 339

20.5 Identifying Restrictions Strategies . . . . . . . . . . . . . . 343

20.5.1 Strategy Outline . . . . . . . . . . . . . . . . . . . . 343

20.5.2 Identifying Restrictions . . . . . . . . . . . . . . . . 344

20.5.3 ACMV Restrictions . . . . . . . . . . . . . . . . . . 347

20.5.4 Drawing from the Conditional Densities . . . . . . . 350

20.6 Analysis of the Vorozole Study . . . . . . . . . . . . . . . . 352


xx Contents

20.6.1 Fitting a Model . . . . . . . . . . . . . . . . . . . . . 352

20.6.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . 366

20.6.3 Model Reduction . . . . . . . . . . . . . . . . . . . . 371

20.7 Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373

21 How Ignorable Is Missing At Random ? 375

21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 375

21.2 Information and Sampling Distributions . . . . . . . . . . . 377

21.3 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . 379

21.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383

21.5 Implications for PROC MIXED . . . . . . . . . . . . . . . . 385

22 The Expectation-Maximization Algorithm 387

23 Design Considerations 391

23.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 391

23.2 Power Calculations Under Linear Mixed Models . . . . . . . 392

23.3 Example: The Rat Data . . . . . . . . . . . . . . . . . . . . 393

23.4 Power Calculations When Dropout Is to Be Expected . . . 394

23.5 Example: The Rat Data . . . . . . . . . . . . . . . . . . . . 397

23.5.1 Constant pj,k|≥k , Varying nj . . . . . . . . . . . . . 399

23.5.2 Constant pj,k|≥k , Constant nj . . . . . . . . . . . . 401

23.5.3 Increasing pj,k|≥k over Time, Constant nj . . . . . . 402

24 Case Studies 405

24.1 Blood Pressures . . . . . . . . . . . . . . . . . . . . . . . . 405

24.2 The Heat Shock Study . . . . . . . . . . . . . . . . . . . . . 411

24.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 411


Contents xxi

24.2.2 Analysis of Heat Shock Data . . . . . . . . . . . . . 415

24.3 The Validation of Surrogate Endpoints from Multiple Trials 420

24.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 420

24.3.2 Validation Criteria . . . . . . . . . . . . . . . . . . . 421

24.3.3 Notation and Motivating Examples . . . . . . . . . . 424

24.3.4 A Meta-Analytic Approach . . . . . . . . . . . . . . 429

24.3.5 Data Analysis . . . . . . . . . . . . . . . . . . . . . . 434

24.3.6 Computational Issues . . . . . . . . . . . . . . . . . 439

24.3.7 Extensions . . . . . . . . . . . . . . . . . . . . . . . 442

24.3.8 Reflections on Surrogacy . . . . . . . . . . . . . . . . 443

24.3.9 Prediction Intervals . . . . . . . . . . . . . . . . . . 444

24.3.10 SAS Code for Random-Effects Model . . . . . . . . 445

24.4 The Milk Protein Content Trial . . . . . . . . . . . . . . . . 446

24.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 446

24.4.2 Informal Sensitivity Analysis . . . . . . . . . . . . . 448

24.4.3 Formal Sensitivity Analysis . . . . . . . . . . . . . . 457

24.5 Hepatitis B Vaccination . . . . . . . . . . . . . . . . . . . . 470

24.5.1 Time Evolution of Antibodies . . . . . . . . . . . . . 472

24.5.2 Prediction at Year 12 . . . . . . . . . . . . . . . . . 481

24.5.3 SAS Code for Vaccination Models . . . . . . . . . . 482

Appendix

A Software 485

A.1 The SAS System . . . . . . . . . . . . . . . . . . . . . . . . 485

A.1.1 Standard Applications . . . . . . . . . . . . . . . . . 485

A.1.2 New Features in SAS Version 7.0 . . . . . . . . . . . 485


xxii Contents

A.2 Fitting Mixed Models Using MLwiN . . . . . . . . . . . . . 489

A.3 Fitting Mixed Models Using SPlus . . . . . . . . . . . . . . 493

A.3.1 Standard SPlus Functions . . . . . . . . . . . . . . . 494

A.3.2 OSWALD for Nonrandom Nonresponse . . . . . . . 497

B Technical Details for Sensitivity Analysis 515

B.1 Local Influence: Derivation of Components of ∆i . . . . . . 515

B.2 Proof of Theorem 20.1 . . . . . . . . . . . . . . . . . . . . . 518

References 523

Index 554
1
Introduction

In applied sciences, one is often confronted with the collection of correlated


data. This generic term embraces a multitude of data structures, such as
multivariate observations, clustered data, repeated measurements, longitu-
dinal data, and spatially correlated data.

Among those, multivariate data have received most attention in the statis-
tical literature (e.g., Seber 1984, Krzanowski 1988, Johnson and Wichern
1992). Techniques devised for this situation include multivariate regression
and multivariate analysis of variance, which have been implemented in the
SAS procedure GLM (SAS 1991) for general linear models. In addition,
SAS contains a battery of relatively specialized procedures for principal
components analysis, canonical correlation analysis, discriminant analysis,
factor analysis, cluster analysis, and so forth (SAS 1989).

As an example of a simple multivariate study, assume that a subject’s


systolic and diastolic blood pressure are measured simultaneously. This
is different from a clustered setting where, for example, for a number of
families, diastolic blood pressure is measured for all of their members. A
design where, for each subject, diastolic blood pressure is recorded under
several experimental conditions is often termed a repeated measures study.
In the case that diastolic blood pressure is measured repeatedly over time
for each subject, we are dealing with longitudinal data. Although one could
view all of these data structures as special cases of multivariate designs, we
believe there are many fundamental differences, thoroughly affecting the
2 1. Introduction

mode of analysis. First, certain multivariate techniques, such as principal


components, are hardly useful for the other designs. Second, in a truly
multivariate set of outcomes, the variance-covariance structure is usually
unstructured, in contrast to, for example, longitudinal data. Therefore,
the methodology of the general linear model is too restrictive to perform
satisfactory data analyses of these more complex data. In contrast, the
general linear mixed model , as implemented in the SAS procedure MIXED
(Littell et al . 1996), is much more flexible.

Replacing the time dimension in a longitudinal setting with one or more


spatial dimensions leads naturally to spatial data. While ideas in the lon-
gitudinal and spatial areas have developed relatively independently, efforts
have been spent in bridging the gap between both disciplines. In 1996, a
workshop was devoted to this idea: “The Nantucket Conference on Mod-
eling Longitudinal and Spatially Correlated Data: Methods, Applications,
and Future Directions” (Gregoire et al . 1997).

Still, restricting attention to the correlated data settings described earlier is


too limited to fully grasp the wide applicability of the general linear mixed
model. In designed experiments, such as analysis of variance (ANOVA) or
nested factorial designs, the variance structure has to reflect the design and
thus elaborate structures will be needed. A good mode of analysis should
be able to account for various sources of variability. Linear mixed models
originated precisely in this area of application. For a review, see Robinson
(1991).

Among the clustered data settings, longitudinal data perhaps require the
most elaborate modeling of the random variability. Diggle, Liang, and Zeger
(1994) distinguish among three components of variability. The first one
groups traditional random effects (as in a random-effects ANOVA model)
and random coefficients (Longford 1993). It stems from interindividual vari-
ability (i.e., heterogeneity between individual profiles). The second compo-
nent, serial association, is present when residuals close to each other in time
are more similar than residuals further apart. This notion is well known in
the time-series literature (Ripley 1981, Diggle 1983, Cressie 1991). Finally,
in addition to the other two components, there is potentially also measure-
ment error. This results from the fact that, for delicate measurements (e.g.,
laboratory assays), even immediate replication will not be able to avoid a
certain level of variation. In longitudinal data, these three components of
variability can be distinguished by virtue of both replication as well as a
clear distance concept (time), one of which is lacking in classical spatial
and time-series analysis and in clustered data. This implies that adapting
models for longitudinal data to other data structures is in many cases rel-
atively straightforward. For example, clustered data could be analyzed by
leaving out all aspects of the model that refer to time.
1. Introduction 3

A very important characteristic of data to be analyzed is the type of out-


come. Methods for continuous data form the best developed and most ad-
vanced body of research; the same is true for software implementation. This
is natural, since the special status and the elegant properties of the normal
distribution simplify model building and ease software development. It is
in this area that the general linear mixed model and the SAS procedure
MIXED, as well as its counterparts in, for example, SPlus and MLwiN,
are situated. However, also categorical (nominal, ordinal, and binary) and
discrete outcomes are very prominent in statistical practice. For example,
quality of life outcomes are often scored on ordinal scales.

Two fairly different views can be adopted. The first one, supported by
large-sample results, states that normal theory should be applied as much
as possible, even to non-normal data such as ordinal scores and counts.
A different view is that each type of outcome should be analyzed using
instruments that exploit the nature of the data. We will adopt the second
standpoint. In addition, since the statistical community has been familiar-
ized with generalized linear models (GLIM; McCullagh and Nelder 1989),
some have taken the view that the normal model for continuous data is
but one type of GLIM. Although this is correct in principle, it fails to ac-
knowledge that normal models are much further developed than any other
GLIM (e.g., model checks and diagnostic tools) and that it enjoys unique
properties (e.g., the existence of closed-form solutions, exact distributions
of test statistics, unbiased estimators). Extensions of GLIM to the longitu-
dinal case are discussed in Diggle, Liang, and Zeger (1994), where the main
emphasis is on generalized estimating equations (Liang and Zeger 1986).
Generalized linear mixed models have been proposed by, for example, Bres-
low and Clayton (1993). Fahrmeir and Tutz (1994) devote an entire book
to GLIM for multivariate settings.

In longitudinal settings, each individual typically has a vector Y of re-


sponses with a natural (time) ordering among the components. This leads
to several, generally nonequivalent, extensions of univariate models. In a
marginal model , marginal distributions are used to describe the outcome
vector Y , given a set X of predictor variables. The correlation among
the components of Y can then be captured either by adopting a fully
parametric approach or by means of working assumptions, such as in the
semiparametric approach of Liang and Zeger (1986). Alternatively, in a
random-effects model , the predictor variables X are supplemented with a
vector b of random effects, conditional upon which the components of Y
are usually assumed to be independent. This does not preclude that more
elaborate models are possible if residual dependence is detected (Longford
1993). Finally, a conditional model describes the distribution of the compo-
nents of Y , conditional on X but also conditional on (a subset of) the other
components of Y . In a longitudinal context, a particular relevant class of
4 1. Introduction

conditional models describes a component of Y given the ones recorded


earlier in time. Well-known members of this class of transition models are
Markov type models. Several examples are given in Diggle, Liang, and Zeger
(1994).

For normally distributed data, marginal models can easily be fitted, for ex-
ample, with the SAS procedure MIXED, the SPlus function lme, or within
the MLwiN package. For such data, integrating a mixed-effects model over
the random effects produces a marginal model, in which the regression
parameters retain their meaning and the random effects contribute in a
simple way to the variance-covariance structure. For example, the mar-
ginal model corresponding to a random-intercepts model is a compound
symmetry model that can be fitted without explicitly acknowledging the
random-intercepts structure. In the same vein, certain types of transition
model induce simple marginal covariance structures. For example, some
first-order stationary autoregressive models imply an exponential or AR(1)
covariance structure. As a consequence, many marginal models derived
from random-effects and transition models can be fitted with mixed-models
software.

It should be emphasized that the above elegant properties of normal mod-


els do not extend to the general GLIM case. For example, opting for a
marginal model for longitudinal binary data precludes the researcher from
answering conditional and transitional questions in terms of simple model
parameters. This implies that each model family requires its own specific
software tools. For example, an analysis based on generalized estimating
equations can be performed within the GENMOD procedure in SAS, or
the SPlus set of functions termed OSWALD (Smith, Robertson, and Dig-
gle 1996). Mixed-effects models for non-Gaussian data can be fitted using
the MIXOR program (Hedeker and Gibbons 1994, 1996), MLwiN, or the
SAS procedure NLMIXED. The latter procedure is available from Version
7 onward and is the successor of the macros GLIMMIX and NONLINMIX.

Motivated by the above discussion, we have restricted the scope of this book
to linear mixed models for continuous outcomes. Fahrmeir and Tutz (1994)
discuss generalized linear (mixed) models for multivariate outcomes, while
longitudinal versions are treated in Diggle, Liang, and Zeger (1994). Non-
linear models for repeated measurement data are discussed by Davidian
and Giltinan (1995).

While research in this area has largely focused on the formulation of linear
mixed-effects models, inference, and software implementation, other im-
portant aspects, such as exploratory analysis, the investigation of model
fit, and the construction of diagnostic tools have received considerably less
attention. In addition, longitudinal data are typically very prone to in-
completeness, due to dropout or intermediate missing values. This poses
1. Introduction 5

particular challenges to methodological development. In this book, we have


attempted to give a detailed account of several of these topics. By no means
has it been our intention to give a complete or definitive overview. Indeed,
given the high research activity, this would be impossible.

Broadly, the structure of the book is as follows. The key examples, used
throughout the book, are introduced in Chapter 2. Chapters 3 to 9 pro-
vide the core about the linear mixed-effects model, while Chapters 10 to 13
discuss more advanced tools for model exploration, influence diagnostics,
as well as extensions of the original model. Chapters 14 to 16 introduce
the reader to basic incomplete data concepts. Chapters 17 and 18 discuss
strategies to model incomplete longitudinal data, based on the linear mixed
model. The sensitivity of such strategies to parametric assumptions is in-
vestigated in Chapters 19 and 20. Some additional missing data topics are
presented in Chapters 21 and 22. Chapter 23 is devoted to design consid-
erations. Five case studies are treated in detail in Chapter 24. Appendix A
reviews a number of software tools for fitting mixed models. Since the book
puts relatively more emphasis on SAS than on other packages, this proce-
dure is discussed in detail in Chapter 8, while worked examples can be
found throughout the text. Some technical background material from the
sensitivity chapters is deferred until Appendix B.
2
Examples

This chapter introduces the longitudinal sets of data which will be used
throughout the book. The rat data are presented in Section 2.1. The TDO
data, studying toenails, are described in Section 2.2. Section 2.3 is de-
voted to the Baltimore Longitudinal Study of Aging, with two substudies:
prostate-specific antigen data (Section 2.3.1) and data on hearing (Sec-
tion 2.3.2). Section 2.4 introduces the Vorozole study, focusing on quality
of life in breast cancer patients. In Section 2.5, we will introduce data,
previously analyzed by Goldstein (1979), on the heights of 20 schoolgirls.
Section 2.6 presents the growth data of Potthoff and Roy (1964). Mastitis
in dairy cattle is the subject of Section 2.7.

To complement the data introduced in this chapter, five case studies, in-
volving additional sets of data, are presented in Chapter 24.

2.1 The Rat Data

In medical science, there has recently been increased interest in the ther-
apeutic use of hormones. However, such drastic therapies require detailed
knowledge about their effect on the different aspects of growth. To this
respect, an experiment has been set up at the Department of Orthodontics
of the Catholic University of Leuven (KUL) in Belgium (see Verdonck et
al . 1998). The primary aim was to investigate the effect of the inhibition
8 2. Examples

FIGURE 2.1. Rat Data. Individual profiles for each of the treatment groups in
the rat experiment separately.

of the production of testosterone in male Wistar rats on their craniofacial


growth.

A total of 50 male Wistar rats have been randomized to either a control


group or one of the two treatment groups where treatment consisted of a
low or high dose of the drug Decapeptyl, which is an inhibitor for testos-
terone production in rats. The treatment started at the age of 45 days, and
measurements were taken every 10 days, with the first observation taken
at the age of 50 days. The responses of interest are distances (in pixels)
between well-defined points on X-ray pictures of the skull of each rat, taken
after the rat has been anesthetized. Of primary interest is the estimation
of changes over time and testing whether these changes are treatment de-
pendent.

For the purpose of this book, we will consider one of the measurements
which can be used to characterize the height of the skull. The individual
profiles are shown in Figure 2.1. It is clear that not all rats have measure-
ments up to the age of 110 days. This is due to the fact that many rats do
not survive anaesthesia and therefore drop out before the end of the study.
Table 2.1 shows the number of rats observed at each occasion. While 50
rats have been randomized at the start of the experiment, only 22 of them
survived the 6 first measurements, so measurements on only 22 rats are
available in the way anticipated at the design stage. For example, at the
2.2 The Toenail Data (TDO) 9

TABLE 2.1. Rat Data. Summary of the number of observations taken at each
occasion in the rat experiment, for each group separately and in total.

# Observations
Age (days) Control Low High Total
50 15 18 17 50
60 13 17 16 46
70 13 15 15 43
80 10 15 13 38
90 7 12 10 29
100 4 10 10 24
110 4 8 10 22

second occasion (age = 60 days), only 46 rats were available, implying that
for 4 rats only 1 measurement could be recorded.

2.2 The Toenail Data (TDO)

The data introduced in this section were obtained from a randomized,


double-blind, parallel group, multicenter study for the comparison of two
oral treatments (in the sequel coded as A and B) for toenail dermatophyte
onychomycosis (TDO), described in full detail by De Backer et al . (1996).
TDO is a common toenail infection, difficult to treat, affecting more than 2
out of 100 persons (Roberts 1992). Antifungal compounds, classically used
for treatment of TDO, need to be taken until the whole nail has grown out
healthy. The development of new such compounds, however, has reduced
the treatment duration to 3 months. The aim of the present study was to
compare the efficacy and safety of 12 weeks of continuous therapy with
treatment A or with treatment B.

In total, 2 × 189 patients were randomized, distributed over 36 centers.


Subjects were followed during 12 weeks (3 months) of treatment and fol-
lowed further, up to a total of 48 weeks (12 months). Measurements were
taken at baseline, every month during treatment, and every 3 months af-
terward, resulting in a maximum of seven measurements per subject. For
our purposes, we will only consider one of the secondary endpoints, unaf-
fected nail length, which is measured as follows. At the first occasion, the
treating physician indicates one of the affected toenails as the target nail,
the nail which will be followed over time. At each occasion, the unaffected
nail length (measured from the nail bed to the infected part of the nail,
10 2. Examples

FIGURE 2.2. Toenail Data. Individual profiles of 30 randomly selected subjects


in each of the treatment groups in the toenail experiment.

which is always at the free end of the nail) of the target nail is measured in
millimeters. Obviously, this response will be related to the toe size. There-
fore, we will only include here those patients for which the target nail was
one of the two big toenails. This reduces our sample under consideration to
150 and 148 subjects, respectively. Figure 2.2 shows the observed profiles
of 30 randomly selected subjects from treatment group A and treatment
group B, respectively.

Due to a variety of reasons, 72 (24%) out of the 298 participants left the
study prematurely. Table 2.2 summarizes the number of subjects still in
the study at each occasion, for both treatment groups separately. Although
the comparison of the average evolutions in both treatment groups was of
primary interest, there was also some interest in studying the relationship
between the dropout process and the actual outcome. For example, are
patients who drop out doing better or worse than patients who do not
drop out from the study ?

2.3 The Baltimore Longitudinal Study of Aging


(BLSA)

The Baltimore Longitudinal Study of Aging (BLSA) is an ongoing multi-


disciplinary observational study, which started in 1958, and with the study
of normal human aging as primary objective (Shock et al . 1984). Partici-
pants in the BLSA are volunteers who return approximately every 2 years
for 3 days of biomedical and psychological examinations. They are predom-
inantly white (95%), well educated (over 75% have bachelor’s degrees), and
financially comfortable (82%). So far, over 1400 men with an average of al-
most 7 visits and 16 years of follow-up have participated in the study since
its inception in 1958. Later on, females have been included in the study as
well.
2.3 The Baltimore Longitudinal Study of Aging (BLSA) 11

TABLE 2.2. Toenail Data. Summary of the number of observations taken at each
occasion in the TDO study, for each group separately and in total.

# Observations
Time (months) Treatment A Treatment B Total
0 150 148 298
1 149 142 291
2 146 138 284
3 140 131 271
6 131 124 255
9 120 109 229
12 118 108 226

The BLSA (Pearson et al . 1994) is a unique resource for rapidly evaluating


longitudinal hypotheses because of the availability of data from repeated
clinical examinations and a bank of frozen blood samples from the same
individuals over 30 years of follow-up (where new studies would require
many years to conduct). On the other hand, the observational aspect of
the study poses additional complications on the statistical analysis. For ex-
ample, although repeated visits are scheduled every 2 years, some subjects
may have more than one visit within 1 year of time, while others have over
10 years between two successive visits. Also, longitudinal evolutions may
be highly influenced by many covariates which may or may not be recorded
in the study.

In this book, two of the many responses measured in the BLSA will be
used to illustrate the statistical methodology. In Section 2.3.1, it will be
discussed how data from the BLSA can be used to study the natural history
of prostate disease. Afterward, in Section 2.3.2, the hearing data will be
presented.

2.3.1 The Prostate Data

During the last 10 years, many papers have been published on the natural
history of prostate disease; see, for example, Carter et al . (1992a, 1992b)
and Pearson et al . (1991, 1994). According to Carter and Coffey (1990),
prostate disease is one of the most common and most costly medical prob-
lems in the United States, and prostate cancer has become the second
leading cause of male cancer deaths. It is therefore very important to look
for markers which can detect the disease at an early stage. The prostate-
specific antigen (PSA) is such a marker. PSA is an enzyme produced by
12 2. Examples

TABLE 2.3. Prostate Data. Description of subjects included in the prostate data
set, by diagnostic group. The cancer cases are subdivided into local/regional (L/R)
and metastatic (M) cancer cases.

Cancer Cases
Controls BPH cases L/R M
Number of participants 16 20 14 4
Age at diagnosis (years)
Median 66 75.9 73.8 72.1
Range 56.7-80.5 64.6-86.7 63.6-85.4 62.7-82.8
Years of follow-up
Median 15.1 14.3 17.2 17.4
Range 9.4-16.8 6.9-24.1 10.6-24.9 10-25.3
Time between
measurements (years)
Median 2 2 1.7 1.7
Range 1.1-11.7 0.9-8.3 0.9-10.8 0.9-4.8
Number of measurements
per individual
Median 8 8 11 9.5
Range 4-10 5-11 7-15 7-12

both normal and cancerous prostate cells, and its level is related to the
volume of prostate tissue. Still, an elevated PSA level is not necessarily an
indicator of prostate cancer because patients with benign prostatic hyper-
plasia (BPH) also have an enlarged volume of prostate tissue and therefore
also an increased PSA level. This overlap of the distribution of PSA values
in patients with prostate cancer and BPH has limited the usefulness of
a single PSA value as a screening tool since, according to Pearson et al .
(1991), up to 60% of BPH patients may be falsely identified as potential
cancer cases based on a single PSA value.

Based on clinical practice, researchers have hypothesized that the rate of


change in PSA level might be a more accurate method of detecting prostate
cancer in the early stages of the disease. This has been extensively investi-
gated by Pearson et al . (1994), who analyzed repeated PSA measures from
the Baltimore Longitudinal Study of Aging (BLSA), using linear mixed
models.

A retrospective case-control study was undertaken that utilized frozen


serum samples from 18 BLSA participants identified as prostate cancer
cases, 20 cases of BPH, and 16 controls with no clinical signs of prostate
disease. In order to be eligible for the analyses, men had to meet several
criteria:
2.3 The Baltimore Longitudinal Study of Aging (BLSA) 13

FIGURE 2.3. Prostate Data. Longitudinal trends in PSA in men with prostate
cancer, benign prostatic hyperplasia, or no evidence of prostate disease.

1. seven or more years of follow-up prior to diagnosis of prostate cancer,


simple prostatectomy for BPH, or exclusion of prostate disease by a
urologist,

2. confirmation of the pathological diagnosis, and

3. no prostate surgery prior to diagnosis.

To the extent possible, age at diagnosis and years of follow-up were matched
for the control, BPH, and cancer groups. However, due to the high preva-
lence of BPH in men over age 50, it was difficult to find age-matched
controls with no evidence of prostate disease. In fact, the control group
remained significantly younger at first visit and at diagnosis, compared to
the BPH group. For this reason, our analyses of this data set will always
correct for age differences at the time of the diagnosis.

A description of the data, differentiating between local/regional (L/R) can-


cer cases and metastatic cancer cases, is given in Table 2.3. The number of
repeated PSA measurements per individual varies between 4 and 15, and
the follow-up period ranges from 6.9 to 25.3 years. Since it was anticipated
that PSA values would increase exponentially in prostate cancer cases, the
responses were transformed to ln(PSA + 1). These transformed individual
profiles are shown in Figure 2.3.
14 2. Examples

FIGURE 2.4. Hearing Data. Individual profiles of 30 randomly selected subjects


in the hearing data set, for the left and the right ear separately.

2.3.2 The Hearing Data

Also recorded in the BLSA study are hearing threshold sound pressure lev-
els (SPLs in dB), measured at 11 different frequencies [varying from 125
to 8000 hertz (Hz)] on both ears, yielding a maximum of 22 observations
per visit. This was done by means of a sound proof chamber and a Bekesy
audiometer. Using these data, Brant and Fozard (1990) have shown that
the relationship between hearing threshold level and frequency can be well
described by a quadratic function of the logarithm of frequency, the pa-
rameters of which depend on age and are highly subject-specific. Morrell
and Brant (1991) and Brant and Pearson (1994) considered the data of 268
elderly male participants whose first visit occurred at about 70 years of age
or older. They studied how hearing thresholds change over time and how
these evolutions depend on age and on the frequency under consideration.

For our purposes, we now consider all available hearing thresholds for 500
Hz, from male BLSA participants only, without otologic disease, unilateral
hearing loss, or evidence of noise-induced hearing loss. Individual profiles
on the left and right ear separately are shown in Figure 2.4 for 30 randomly
selected subjects.

In total, we have 6170 observations (3089 on the left ear and 3081 on the
right ear), from 681 males. Their age at the first visit ranged from 17.2 to
90.5 years, with median value equal to 53 years. The number of visits per
subject varied from 1 to 15, and some of the participants were followed for
over 22 years (median 7.5 years).
2.4 The Vorozole Study 15

TABLE 2.4. Heights of Schoolgirls. Classification of 20 preadolescent school girls


in three groups, according to their mother’s height

Mothers height Children numbers


Small mothers < 155 cm 1→6
Medium mothers [155cm; 164cm] 7 → 13
Tall mothers > 164 cm 14 → 20

2.4 The Vorozole Study

This study was an open-label, multicenter, parallel group design conducted


at 67 North American centers. Patients were randomized to either the new
drug Vorozole (2.5 mg taken once daily) or the standard drug megestrol
acetate (40 mg four times daily). The patient population consisted of post-
menopausal patients with histologically confirmed estrogen-receptor pos-
itive metastatic breast carcinoma. All 452 randomized patients were fol-
lowed until disease progression or death. The main objective was to com-
pare the treatment groups with respect to response rate, whereas secondary
objectives included a comparison relative to duration of response, time to
progression, survival, safety, pain relief, performance status, and quality of
life. Full details of this study are reported in Goss et al . (1999). In this book,
we will focus on overall quality of life, measured by the total Functional
Living Index: Cancer (FLIC; Schipper et al . 1984). Precisely, a higher FLIC
score is the more desirable outcome. Even though this outcome is, strictly
speaking, of the ordinal type, the total number of categories encountered
exceeds 70, justifying the use of continuous-outcome methods.

Patients underwent screening and for those deemed eligible, a detailed ex-
amination at baseline (occasion 0) took place. Further measurement oc-
casions were months 1, then from months 2 at bimonthly intervals until
month 44.

Goss et al . (1999) analyzed FLIC using a two-way ANOVA model with


effects for treatment, disease status, as well as their interaction. No signifi-
cant difference was found. Apart from treatment, important covariates are
dominant site of the disease as well as clinical stage.

This example will be used, for example, to introduce exploratory tools in


Chapter 4.
16 2. Examples

FIGURE 2.5. Heights of Schoolgirls. Growth curves of 20 school girls from age 6
to 10, for girls with small, medium, or tall mothers.

2.5 Heights of Schoolgirls

Goldstein (1979, Table 4.3, p. 101) reports growth curves of 20 preado-


lescent girls, measured on a yearly basis from age 6 to 10. The girls were
classified according to the height of their mother, which was discretized
as in Table 2.4. The individual profiles are shown in Figure 2.5, for each
group separately. The measurements are given at exact years of age, some
having been previously adjusted to these. The values Goldstein reports for
the fifth girl in the first group are 114.5, 112, 126.4, 131.2, and 135.0. This
suggests that the second measurement is incorrect. We therefore replaced
it by 122. An extensive analysis of this data set can be found in Section 4.2
of Verbeke and Molenberghs (1997). Of primary interest is to test whether
the growth of these schoolgirls is related to the height of their mothers.

2.6 Growth Data

These data, introduced by Potthoff and Roy (1964), contain growth mea-
surements for 11 girls and 16 boys. For each subject, the distance from
the center of the pituitary to the maxillary fissure was recorded at ages 8,
2.6 Growth Data 17

TABLE 2.5. Growth Data for 11 Girls and 16 Boys. Measurements marked with
∗ were deleted by Little and Rubin (1987).

Age (in years) Age (in years)


Girl 8 10 12 14 Boy 8 10 12 14
1 21.0 20.0 21.5 23.0 1 26.0 25.0 29.0 31.0
2 21.0 21.5 24.0 25.5 2 21.5 22.5∗ 23.0 26.5
3 20.5 24.0∗ 24.5 26.0 3 23.0 22.5 24.0 27.5
4 23.5 24.5 25.0 26.5 4 25.5 27.5 26.5 27.0
5 21.5 23.0 22.5 23.5 5 20.0 23.5∗ 22.5 26.0
6 20.0 21.0∗ 21.0 22.5 6 24.5 25.5 27.0 28.5
7 21.5 22.5 23.0 25.0 7 22.0 22.0 24.5 26.5
8 23.0 23.0 23.5 24.0 8 24.0 21.5 24.5 25.5
9 20.0 21.0∗ 22.0 21.5 9 23.0 20.5 31.0 26.0
10 16.5 19.0∗ 19.0 19.5 10 27.5 28.0 31.0 31.5
11 24.5 25.0 28.0 28.0 11 23.0 23.0 23.5 25.0
12 21.5 23.5∗ 24.0 28.0
13 17.0 24.5∗ 26.0 29.5
14 22.5 25.5 25.5 26.0
15 23.0 24.5 26.0 30.0
16 22.0 21.5∗ 23.5 25.0

Source: Pothoff and Roy (1964), Jennrich and Schluchter (1986).

10, 12, and 14. The data were used by Jennrich and Schluchter (1986) to
illustrate estimation methods for unbalanced data, where unbalancedness
is now to be interpreted in the sense of an unequal number of boys and
girls.

Little and Rubin (1987) deleted 9 of the [(11 + 16) × 4] measurements,


rendering 9 incomplete subjects. Deletion is confined to the age 10 mea-
surements. Little and Rubin (1987) describe the mechanism to be such that
subjects with a low value at age 8 are more likely to have a missing value at
age 10. The data are presented in Table 2.5. The measurements that were
deleted are marked with an asterisk. In Section 17.4.1, the complete data
will be analyzed in some detail. Sections 17.4.2 and 17.4.3 are devoted to
frequentist and likelihood-based ignorable analyses of the incomplete ver-
sion of the data, respectively. Section 17.4.4 is devoted to insight in the
missingness mechanism.
18 2. Examples

FIGURE 2.6. Mastitis in Dairy Cattle. The first panel shows a scatter plot of
the second measurement versus the first measurement. The second panel shows a
scatter plot of the change versus the baseline measurement.

2.7 Mastitis in Dairy Cattle

This example, concerning the occurrence of the infectious disease mastitis in


dairy cows, was introduced in Diggle and Kenward (1994) and reanalyzed
in Kenward (1998). Data were available of the milk yields in thousands
of liters of 107 dairy cows from a single herd in 2 consecutive years: Yij
(i = 1, . . . , 107; j = 1, 2). In the first year, all animals were supposedly
free of mastitis; in the second year, 27 became infected. Mastitis typically
reduces milk yield, and the question of scientific interest is whether the
probability of occurrence of mastitis is related to the yield that would have
been observed had mastitis not occurred. A graphical representation of the
complete data is given in Figure 2.6.
3
A Model for Longitudinal Data

3.1 Introduction

In practice, longitudinal data are often highly unbalanced in the sense that
not an equal number of measurements is available for all subjects and/or
that measurements are not taken at fixed time points. In the rat data
set and the toenail data set, presented in Section 2.1 and in Section 2.2,
respectively, a fixed number of measurements was scheduled to be taken
on all subjects, at fixed time points. However, during the study, rats died,
and patients left the toenail study prematurely, implying unbalance. This
is different from the prostate data and the hearing data (Sections 2.3.1
and 2.3.2, respectively), where the unbalance is an immediate result from
the fact that the volunteers participating in the BLSA were asked to return
approximately every 2 years for medical examination.

Due to their unbalanced nature, many longitudinal data sets cannot be


analyzed using multivariate regression techniques (see, for example, Seber
1984, Chapters 8 and 9, Hand and Taylor 1987). A natural alternative
arises from observing that subject-specific longitudinal profiles can often be
well approximated by linear regression functions. One hereby summarizes
the vector of repeated measurements for each subject by a vector of a
relatively small number of estimated subject-specific regression coefficients.
Afterward, in a second stage, multivariate regression techniques can be used
to relate these estimates to known covariates such as treatment, disease
20 3. A Model for Longitudinal Data

classification, baseline characteristics, and so forth. This so-called two-stage


analysis will be introduced in Section 3.2. Afterward, in Section 3.3, the
general linear mixed model will be introduced as a result of combining the
two stages into one single statistical model.

3.2 A Two-Stage Analysis

3.2.1 Stage 1

Let the random variable Yij denote the (possibly transformed) response
of interest, for the ith individual, measured at time tij , i = 1, . . . , N ,
j = 1, . . . , ni , and let Yi be the ni -dimensional vector of all repeated mea-
surements for the ith subject, that is, Yi = (Yi1 , Yi2 , . . . , Yini ) . The first
stage of the two-stage approach assumes that Yi satisfies the linear regres-
sion model
Yi = Zi β i + εi , (3.1)
where Zi is a (ni × q) matrix of known covariates, modeling how the re-
sponse evolves over time for the ith subject. Further, β i is a q-dimensional
vector of unknown subject-specific regression coefficients, and εi is a vec-
tor of residual components εij , j = 1, . . . , ni . It is usually assumed that all
εi are independent and normally distributed with mean vector zero, and
covariance matrix σ 2 Ini , where Ini is the ni -dimensional identity matrix.
This latter assumption will be extended in Section 3.3.

Obviously, model (3.1) includes very flexible models for the description of
subject-specific profiles. In practice, polynomials will often suffice. How-
ever, extensions such as fractional polynomial models (Royston and Alt-
man 1994), or extended spline functions (Pan and Goldstein 1998) can be
considered as well. We refer to Lesaffre, Asefa and Verbeke (1999) for an
example where subject-specific profiles have been modeled using fractional
polynomials.

3.2.2 Stage 2

In a second step, a multivariate regression model of the form


β i = Ki β + b i , (3.2)
is used to explain the observed variability between the subjects, with re-
spect to their subject-specific regression coefficients β i . Ki is a (q × p)
3.2 A Two-Stage Analysis 21

matrix of known covariates, and β is a p-dimensional vector of unknown


regression parameters. Finally, the bi are assumed to be independent, fol-
lowing a q-dimensional normal distribution with mean vector zero and gen-
eral covariance matrix D.

3.2.3 Example: The Rat Data

The rat data presented in Section 2.1 have been analyzed by Verbeke and
Lesaffre (1999), who describe the subject-specific profiles shown in Fig-
ure 2.1 by straight lines, after transforming the original time scale (age
expressed in days) logarithmically (see also Section 4.3.3). The first-stage
model (3.1) then becomes

Yij = β1i + β2i tij + εij , j = 1, . . . , ni , (3.3)

where tij = ln[1 + (Ageij − 45)/10)], implying that t = 0 corresponds to


the start of the treatment. The matrix Zi has two columns: one containing
only ones, and one containing all time points tij , j = 1, . . . , ni .

In the second stage, the subject-specific intercepts and time effects are
related to the treatment of the rats (low dose, high dose, control). Our
second-stage model (3.2) then becomes

⎨ β1i = β0 + b1i ,
(3.4)

β2i = β1 Li + β2 Hi + β3 Ci + b2i ,

in which Li , Hi , and Ci are indicator variables defined to be one if the rat


belongs to the low-dose group, the high-dose group, or the control group,
respectively, and zero otherwise. The randomization in combination with
the chosen transformation of the original time scale allows us to assume
the subject-specific intercepts β1i not to depend on the treatment. The
parameter β0 can be interpreted as the average response at the start of the
treatment, whereas the parameters β1 , β2 , and β3 represent the average
time effects for each treatment group separately. Of primary interest is
the comparison of these average slopes, since this directly measures the
treatment effect on the average growth.

3.2.4 Example: The Prostate Data

Pearson et al . (1994) and Verbeke and Molenberghs (1997, Chapter 3) have


previously analyzed the prostate data presented in Section 2.3.1, assuming
that each individual profile shown in Figure 2.3 can be well approximated
22 3. A Model for Longitudinal Data

by a quadratic function over time, where time is expressed as years before


diagnosis (see also Section 4.3.4). The regression model (3.1) in the first
stage is then

Yij = ln(PSAij + 1)
= β1i + β2i tij + β3i t2ij + εij , j = 1, . . . , ni , (3.5)

and the columns of the covariate matrix Zi contain only ones, all time
points tij , and all squared time points t2ij .

In the second stage, the subject-specific intercepts and linear as well as


quadratic time effects are related to the diagnostic class of the subject
(control, BPH case, local cancer case, or metastatic cancer case). The age
at the time of diagnosis is included as a covariate in order to correct for
the age differences among the four diagnostic groups. Model (3.2) in the
second stage then becomes


⎪ β1i = β1 Agei + β2 Ci + β3 Bi + β4 Li + β5 Mi + b1i ,



β2i = β6 Agei + β7 Ci + β8 Bi + β9 Li + β10 Mi + b2i , (3.6)





β3i = β11 Agei + β12 Ci + β13 Bi + β14 Li + β15 Mi + b3i ,

in which Agei equals the subject’s age at diagnosis (t = 0), and where Ci ,
Bi , Li , and Mi are indicator variables defined to be one if the subject is
a control, a BPH case, a local cancer case, or a metastatic cancer case,
respectively, and zero otherwise. The parameters β2 , β3 , β4 , and β5 are the
average intercepts for the controls, the BPH cases, the L/R cancer cases,
and the metastatic cancer cases, respectively, after correction for age at
diagnosis. Similar interpretations hold for the other parameters in (3.6).

3.2.5 Two-Stage Analysis

In practice, the regression parameters in (3.2) are of primary interest. They


can be estimated by sequentially fitting the models (3.1) and (3.2). First,
all β i are estimated by fitting model (3.1) to the observed data vector yi
of each subject separately, yielding estimates β. Afterward, model (3.2) is
i

fitted to the estimates β i , providing inferences for β.

Fitting the models (3.5) and (3.6) to the prostate data, Verbeke and Molen-
berghs (1997, Section 3.3) found that the subject-specific regression para-
meters β i did not depend on age at diagnosis (at the 5% level of signifi-
cance), but highly significant differences were found among the diagnostic
groups. No significant differences were obtained between the controls and
3.3 The General Linear Mixed-Effects Model 23

the BPH cases, and the two groups of cancer patients only differed with
respect to their intercepts.

Note how this two-stage analysis can be interpreted as the calculation (first
stage) and analysis (second stage) of summary statistics. First, the actually
observed data vector yi is summarized by β , for each subject separately.
i
Afterward, regression methods are used to assess the relation between the
so-obtained summary statistics and relevant covariates. Other summary
statistics frequently used in practice are the area under each individual
profile (AUC), the mean response for each individual, the largest observa-
tion (peak), the half-time, and so forth (see, for example, Weiner 1981 and
Rang and Dale 1990).

As for any analysis of summary statistics, the two-stage analysis obviously


suffers from at least two problems. First, information is lost in summarizing
the vector yi of observed measurements for the ith subject by β . Second,
i
random variability is introduced by replacing the β i in model (3.2) by
. Moreover, the covariance matrix of β
their estimates β  highly depends
i i
on the number of measurements available for the ith subject as well as on
the time points at which these measurements were taken, and this has not
been taken into account in the second stage of the analysis. In Section 3.3,
it will be shown how this can be solved by combining the two stages into
one model, the so-called linear mixed-effects model.

3.3 The General Linear Mixed-Effects Model

3.3.1 The Model

In order to combine the models from the two-stage analysis, we replace β i


in (3.1) by expression (3.2), yielding

Yi = Xi β + Zi bi + εi , (3.7)

where Xi = Zi Ki is the appropriate (ni × p) matrix of known covari-


ates, and where all other components are as defined earlier. Model (3.7) is
called a linear mixed (-effects) model with fixed effects β and with subject-
specific effects bi . It assumes that the vector of repeated measurements
on each subject follows a linear regression model where some of the re-
gression parameters are population-specific (i.e., the same for all subjects),
whereas other parameters are subject-specific. As in Section 3.2.2, the bi
are assumed to be random and are therefore often called random effects.
24 3. A Model for Longitudinal Data

In general, a linear mixed-effects model is any model which satisfies (Laird


and Ware 1982)


⎪ Yi = Xi β + Zi bi + εi





⎪ bi ∼ N (0, D),

(3.8)



⎪ ε i ∼ N (0, Σ i ),





b1 , . . . , bN , ε1 , . . . , εN independent,

where Yi is the ni -dimensional response vector for subject i, 1 ≤ i ≤ N , N


is the number of subjects, Xi and Zi are (ni × p) and (ni × q) dimensional
matrices of known covariates, β is a p-dimensional vector containing the
fixed effects, bi is the q-dimensional vector containing the random effects,
and εi is an ni -dimensional vector of residual components. Finally, D is a
general (q × q) covariance matrix with (i, j) element dij = dji and Σi is a
(ni × ni ) covariance matrix which depends on i only through its dimension
ni , i.e. the set of unknown parameters in Σi will not depend upon i. In
some cases, one may wish to relax this last assumption. An example of this
can be found in Lin, Raz and Harlow (1997).

It follows from (3.8) that, conditional on the random effect bi , Yi is nor-


mally distributed with mean vector Xi β + Zi bi and with covariance matrix
Σi . Further, bi is assumed to be normally distributed with mean vector
0 and covariance matrix D. Let f (yi |bi ) and f (bi ) be the corresponding
density functions. The marginal density function of Yi is then given by

f (yi ) = f (yi |bi ) f (bi ) dbi ,

which can easily be shown to be the density function of an ni -dimensional


normal distribution with mean vector Xi β and with covariance matrix
Vi = Zi DZi + Σi . Hence, the marginal model implied by the two-stage
approach makes very specific assumptions about the dependence of the
mean structure and the covariance structure on the covariates Xi and Zi ,
respectively.

Since model (3.8) is defined through the distributions f (yi |bi ) and f (bi ),
it will be called the hierarchical formulation of the linear mixed model. The
corresponding marginal normal distribution with mean Xi β and covariance
Zi DZi + Σi is called the marginal formulation of the model. Note that,
although the marginal model naturally follows from the hierarchical one,
both models are not equivalent. We refer to Section 5.6.2 for a detailed
discussion on the differences between both models.
3.3 The General Linear Mixed-Effects Model 25

3.3.2 Example: The Rat Data

Combining models (3.3) and (3.4) previously proposed for a two-stage


analysis of the rat data, we obtain

Yij = (β0 + b1i ) + (β1 Li + β2 Hi + β3 Ci + b2i )tij + εij , j = 1, . . . , ni ,

which can be rewritten as




⎪ β0 + b1i + (β1 + b2i )tij + εij , if low dose



Yij = β0 + b1i + (β2 + b2i )tij + εij , if high dose (3.9)





β0 + b1i + (β3 + b2i )tij + εij , if control.

The subject-specific profiles are assumed to be linear, with subject-specific


intercepts as well as slopes. The average evolution is also linear, with dif-
ferent slopes for the three treatment groups, but with common intercepts.
Still assuming the error components εij to be independently identically dis-
tributed with variance σ 2 , we have that the assumed covariance function
can be summarized by

  1
Cov(Yi (t1 ), Yi (t2 )) = 1 t1 D + σ2
t2
= d22 t1 t2 + d12 (t1 + t2 ) + d11 + σ 2 .

Note how the model now implies the variance function of the response to
be quadratic over time, with positive curvature d22 .

A model which assumes that all variability in subject-specific slopes can be


ascribed to treatment differences can be obtained by omitting the random
slopes b2i from the above model. This is the so-called random-intercepts
model. The subject-specific profiles are then still assumed to be linear, with
subject-specific intercepts, but with the same slopes within each treatment
group. The implied covariance structure assumes constant variance d11 +σ 2
over time as well as equal positive correlation ρI = d11 /(d11 + σ 2 ) between
any two measurements from the same rat. This covariance structure is
called compound symmetric, whereas the common correlation ρI is often
termed the intraclass correlation coefficient (see, e.g., Crowder and Hand
1990, p. 27). Note that ρI is large when the intersubject variability d11 is
large in comparison to the intrasubject variability σ 2 .
26 3. A Model for Longitudinal Data

3.3.3 Example: The Prostate Data

Combining models (3.5) and (3.6) previously proposed for a two-stage


analysis of the prostate data, we obtain

Yij ≡ ln(PSAij + 1)
= β1 Agei + β2 Ci + β3 Bi + β4 Li + β5 Mi
+ (β6 Agei + β7 Ci + β8 Bi + β9 Li + β10 Mi ) tij
+ (β11 Agei + β12 Ci + β13 Bi + β14 Li + β15 Mi ) t2ij
+ b1i + b2i tij + b3i t2ij + εij . (3.10)

The subject-specific profiles are assumed to be quadratic over time, with


subject-specific intercepts as well as slopes for the linear as well as quadratic
time effect. The average evolution is also quadratic, with different intercepts
and slopes for the four diagnostic groups. If we again assume the error
components εij to be independently identically distributed with variance
σ 2 , we have that the assumed covariance function is given by
⎛ ⎞
  1
Cov(Yi (t1 ), Yi (t2 )) = 1 t1 t21 D ⎝ t2 ⎠ + σ 2
t22
= d33 t21 t22 + d23 (t21 t2 + t1 t22 ) + d22 t1 t2

+d13 (t21 + t22 ) + d12 (t1 + t2 ) + d11 + σ 2 .

Consequently, the implied variance function is now a fourth-order polyno-


mial over time.

3.3.4 A Model for the Residual Covariance Structure

Very often, Σi is chosen to be equal to σ 2 Ini where Ini denotes the identity
matrix of dimension ni . We then call model (3.8) the conditional inde-
pendence model, since it implies that the ni responses on individual i are
independent, conditional on bi and β. As shown in Section 3.3.2, the con-
ditional independence model may imply unrealistically simple covariance
structures for the response vector Yi , especially for models with few random
effects. When there is no evidence for the presence of additional random
effects, or when additional random effects have no substantive meaning,
the covariance assumptions can often be relaxed by allowing an appro-
priate, more general, residual covariance structure Σi for the vector εi of
subject-specific error components.
3.3 The General Linear Mixed-Effects Model 27

FIGURE 3.1. Graphical representation of the three stochastic components in


the general linear mixed model (3.11). The solid line represents the popula-
tion-average evolution. The lines with long dashes show subject-specific evolutions
for two subjects i1 and i2 . The residual components of serial correlation and mea-
surement error are indicated by short-dashed lines and dots, respectively.

A variety of models has been proposed in the statistical literature. See,


for example, Mansour, Nordheim, and Rutledge (1985), Diem and Liukko-
nen (1988), Diggle (1988), Chi and Reinsel (1989), Rochon (1992) , and
Núñez-Antón and Woodworth (1994), among many others. Most of these
models are special cases of the general model proposed by Diggle, Liang,
and Zeger (1994). They assume that εi has constant variance and can be
decomposed as εi = ε(1)i + ε(2)i in which ε(2)i is a component of serial cor-
relation, suggesting that at least part of an individual’s observed profile is
a response to time-varying stochastic processes operating within that indi-
vidual. This type of random variation results in a correlation between serial
measurements, which is usually a decreasing function of the time separa-
tion between these measurements. Further, ε(1)i is an extra component of
measurement error reflecting variation added by the measurement process
itself, and assumed to be independent of ε(2)i . A graphical representation
of the three stochastic components (random effects, serial correlation, and
measurement error) in the resulting model is shown in Figure 3.1.
28 3. A Model for Longitudinal Data

FIGURE 3.2. Exponential and Gaussian serial correlation functions.

The resulting linear mixed model can now be written as




⎪ Yi = Xi β + Zi bi + ε(1)i + ε(2)i







⎪ bi ∼ N (0, D),



ε(1)i ∼ N (0, σ 2 Ini ), (3.11)







⎪ ε(2)i ∼ N (0, τ 2 Hi ),





b1 , . . . , bN , ε(1)1 , . . . , ε(1)N , ε(2)1 , . . . , ε(2)N independent,
and the model is completed by assuming a specific structure for the (ni ×ni )
correlation matrix Hi . Such structures are often borrowed from time-series
analysis. One usually assumes that the serial effect ε(2)i is a population
phenomenon, independent of the individual. The serial correlation matrix
Hi then only depends on i through the number ni of observations and
through the time points tij at which measurements were taken. Further, it
is assumed that the (j, k) element hijk of Hi is modeled as hijk = g(|tij −
tik |) for some decreasing function g(·) with g(0) = 1. This means that
the correlation between ε(2)ij and ε(2)ik only depends on the time interval
between the measurements yij and yik , and decreases if the length of this
interval increases.

Two frequently used functions g(·) are the exponential and Gaussian serial
correlation functions defined as g(u) = exp(−φu) and g(u) = exp(−φu2 ),
respectively (φ > 0), and which are shown in Figure 3.2 for φ = 1. Note
that the most important qualitative difference between these functions is
their behavior near u = 0, although their tail behavior is also different.

Although Diggle, Liang, and Zeger (1994) discuss model (3.11) in full gen-
erality, they do not fit any models which simultaneously include serial cor-
relation as well as random effects other than intercepts. They argue that,
in applications, the effect of serial correlation is very often dominated by
the combination of random effects and measurement error. In practice, this
3.3 The General Linear Mixed-Effects Model 29

is often reflected in estimation problems for models which include several


random effects, serial correlation, as well as measurement error. We refer to
Section 9.4 for an example. In Chapter 10, we will discuss how appropriate
residual covariance structures can be found in the presence of random ef-
fects, other than just intercepts. We also refer to Chapter 4 in the book by
Davidian and Giltinan (1995) for a discussion of components of variability
in the context of nonlinear mixed models.
4
Exploratory Data Analysis

4.1 Introduction

Most books on longitudinal data discuss exploratory analysis. See, for ex-
ample, Diggle, Liang, and Zeger (1994). However, most effort is spent to
model building and formal aspects of inference. In this section, we present
a selected set of techniques to underpin the model building. We distinguish
between two modes of display. In Section 4.2, the marginal distribution of
the responses in the Vorozole study is explored, that is, we explore the ob-
served profiles averaged over (sub)populations. Three aspects of the data
will be looked at in turn: the average evolution, the variance function, and
the correlation structure. Afterward, in Section 4.3, we will discuss some
procedures for exploring the observed profiles in a subject-specific way.

4.2 Exploring the Marginal Distribution

4.2.1 The Average Evolution

The average evolution describes how the profile for a number of relevant
subpopulations (or the population as a whole) evolves over time. The results
32 4. Exploratory Data Analysis

FIGURE 4.1. Vorozole Study. Individual profiles, raw residuals, and standardized
residuals.

of this exploration will be useful in order to choose a fixed-effects structure


for the linear mixed model.

The individual profiles are displayed in Figure 4.1, and the mean profiles,
per treatment arm, are plotted in Figure 4.2. The average profiles indicate
an increase over time which is slightly stronger for the Vorozole group. In
addition, the Vorozole group is, with the exception of month 16, consistently
higher than the AGT group. Of course, at this point it is not yet possible
to decide on the significance of this difference. It is useful to explore the
treatment difference separately since even when both evolutions might be
complicated, the treatment difference, which is often of primary interest,
could follow a simple model, or vice versa. The treatment difference is
plotted in Figure 4.3.

The individual profiles augment the averaged plot with a suggestion of


the variability seen within the data. The thinning of the data toward the
later study times suggests that trends at later times should be treated
with caution. Although these plots also give us some indications about the
variability at given times and even about the correlation between measure-
ments of the same individual, it is easier to base such considerations on
residual profiles and standardized residual profiles.
4.2 Exploring the Marginal Distribution 33

FIGURE 4.2. Vorozole Study. Mean profiles.

4.2.2 The Variance Structure

In addition to the average evolution, the evolution of the variance is impor-


tant to build an appropriate longitudinal model. Clearly, one has to correct
the measurements for the fixed-effects structure and hence raw residuals
must be used. Again, two plots are of interest. The first one pictures the
average evolution of the variance as a function of time; the second one
merely produces the individual residual plots.

The detrended profiles are displayed in Figure 4.1, and the corresponding
variance function is plotted in Figure 4.4.

The variance function seems to be relatively stable and hence a constant


variance model could be a plausible starting point. The individual de-
trended profiles show subjects’ tendency, most clearly in the Vorozole group,
to decrease right before they leave the study. In addition, the detrended
profiles suggest that the variance would decrease over time. This is in con-
tradiction with the variance function; it is entirely due to considerable
attrition. This observation suggests that caution should be used with in-
complete data.
34 4. Exploratory Data Analysis

FIGURE 4.3. Vorozole Study. Treatment difference.

4.2.3 The Correlation Structure

The correlation structure describes how measurements within a subject


correlate. The correlation function depends on a pair of times and only
under the assumption of stationarity (see Section 10.4.2 for a formal defini-
tion) does this pair of times simplify to the time lag only. This is important
since many exploratory and modeling tools are based on this assumption.
A plot of standardized residuals is useful in this respect (Figure 4.1). The
picture is not radically different from the previous individual plots, which
can be explained by the relative flatness of both mean profile and variance
functions. If one or both structures is varying with time, the standardized
residuals will contribute useful additional information.

A different way of displaying the correlation structure is using a scatter


plot matrix, such as in Figure 4.5. The off-diagonal elements picture scat-
ter plots of standardized residuals obtained from pairs of measurement
occasions. The decay of correlation with time is studied by considering the
evolution of the scatters with increasing distance to the main diagonal. Sta-
tionarity on the other hand implies that the scatter plots remain similar
within diagonal bands if measurement occasions are approximately equally
spaced. In addition to the scatter plots, we place histograms on the diago-
nal, capturing the variance structure, including such features as skewness.
If the axes are given the same scales, it is very easy to capture the attrition
rate as well.
4.3 Exploring Subject-Specific Profiles 35

FIGURE 4.4. Vorozole Study. Variance function.

4.3 Exploring Subject-Specific Profiles

As shown in the Sections 3.2 and 3.3, linear mixed models can be inter-
preted as the result from a two-stage approach, where the first stage consists
of approximating each observed longitudinal profile by an appropriate lin-
ear regression function. In this section, we propose two simple exploratory
tools to check to what extent observed longitudinal profiles can be described
by a specific linear regression model.

4.3.1 Measuring the Overall Goodness-of-Fit

In practice, one often uses the coefficient of multiple determination R2 to


measure the overall goodness-of-fit of a classical multiple linear regression
model. See, for example, Neter, Wasserman and Kutner (1990, Section 7.5).
Let

Y = Xβ + ε (4.1)

be a linear regression model, where Y is an N -dimensional vector, X is


an (N × p) matrix with known covariate values, and where it is assumed
that all elements in ε are independently normally distributed with mean
zero and variance σ 2 . The total sum of squares SSTO and the error sum of
36 4. Exploratory Data Analysis

FIGURE 4.5. Vorozole Study. Scatter plot matrix for selected time points. The
same vertical scale is used along the diagonal to display the attrition rate as well.

squares SSE are then defined as



SSTO = (Y − 1N 1N Y /N ) (Y − 1N 1N Y /N ) , (4.2)
  −1 
   −1 

SSE = Y − X(X X) X Y Y − X(X X) X Y , (4.3)
respectively, where 1N is the N -dimensional vector containing only ones.
Further, the coefficient of multiple determination is defined as
SSTO − SSE
R2 = , (4.4)
SSTO
which expresses what proportion of the total observed variability in the
response values can be explained by the covariates in the matrix X. R2 is
always between zero and one. The larger R2 the better the model describes
the observed data.

In order to assess how well a candidate first-stage linear regression model


describes observed longitudinal profiles, a coefficient of multiple determi-
nation Ri2 can be calculated for each subject separately. We therefore apply
the expressions (4.2), (4.3) and (4.4) to obtain subject-specific total and er-
ror sums of squares SSTOi and SSEi , as well as subject-specific coefficients
of multiple determination Ri2 , respectively. A histogram can now be used
to summarize the so-obtained Ri2 . Ideally, all Ri2 should be large. Typically,
only a small or moderate number ni of repeated measurements is available
4.3 Exploring Subject-Specific Profiles 37

for (some of) the subjects, which may result in (very) high values Ri2 . This
suggests that a fair comparison of the Ri2 should take into account the
numbers of measurements on which they are based. We therefore promote
the use of scatter plots of the Ri2 versus the ni . Examples will be given in
the Sections 4.3.3 and 4.3.4.

Finally, an overall measure for the goodness-of-fit of first-stage linear re-


gression models is
N
i=1 (SSTOi − SSEi )
2
Rmeta = N ,
i=1 SSTOi

which expresses what proportion of the total within-subject variability can


be explained by the first-stage linear regression models.

4.3.2 Testing for the Need of a Model Extension

Another approach toward assessing the adequacy of a linear regression


model is to test the assumed model versus an alternative model which is
an extended version of the original model. In the context of classical linear
regression, we consider again model (4.1). An indirect way of checking
whether this model is appropriate is to consider an extended version

Y = Xβ + X ∗ β ∗ + ε (4.5)

of model (4.1) and to test whether the p∗ -dimensional vector β ∗ of regres-


sion parameters, corresponding to the additional covariates in X ∗ , equals
zero. Let SSE(F ) and SSE(R) denote the error sums of squares, as defined
in (4.3), for the full model (4.5) and the reduced model (4.1), respectively.
The null hypothesis H0 : β ∗ = 0 can then be tested using the test statistic

(SSE(R) − SSE(F ))/p∗


F = , (4.6)
SSE(F )/(N − p − p∗ )

which follows an F -distribution with p∗ and N − p − p∗ degrees of freedom,


under H0 and assuming that model (4.5) is correct (see Neter, Wasserman
and Kutner 1990, Section 8.6). The null hypothesis is rejected for large
observed values of the above F -statistic.

In a longitudinal data context, a similar approach can now be used to test


a specific candidate first-stage linear regression model versus an extended
version. We therefore first calculate all subject-specific error sums of squares
SSEi (F ) and SSEi (R) under the full model and under the reduced model,
38 4. Exploratory Data Analysis

respectively. Afterward, an overall test statistic can be calculated as


⎧ ⎫ ⎧ ⎫
⎨  ⎬ ⎨  ⎬
(SSEi (R) − SSEi (F )) p∗
⎩ ⎭ ⎩ ⎭
{i:n ≥p+p∗ } {i:ni ≥p+p∗ }
Fmeta = ⎧ i ⎫ ⎧ ⎫ ,
⎨  ⎬ ⎨  ⎬
SSEi (F ) (ni − p − p∗ )
⎩ ∗
⎭ ⎩ ∗

{i:ni ≥p+p } {i:ni ≥p+p }

where the sums are taken over all subjects with at least p + p∗ mea-
surements. Assuming that the candidate first-stage model was correctly
specified, and assuming that all residuals are independently normally dis-
tributed with mean zero and with some common variance,  we have that
the above test statistic follows an F -distribution with {i:ni ≥p+p∗ } p∗ and
 ∗
{i:ni ≥p+p∗ } (ni − p − p ) degrees of freedom.

It should be emphasized that the above testing procedure relies on specific


distributional assumptions, which are not necessarily satisfied in a longi-
tudinal data context. For example, the within-subjects errors are assumed
to be independent, having common variance over the subjects. Hence, re-
ferring back to the general linear mixed model discussed in Section 3.3.4,
the absence of residual serial correlation within subjects is implicitly as-
sumed. We therefore propose using this approach as a general tool for ex-
ploring how a specific first-stage regression model can be improved, rather
than a formal testing procedure for the adequacy of the model. In this re-
spect, it is also advisable to always use this procedure in combination with
the goodness-of-fit measures discussed in Section 4.3.1. In Sections 4.3.3
and 4.3.4, two examples will be given, for which all calculations have been
performed using a SAS macro available from the website.

4.3.3 Example: The Rat Data

As a first example, we explore the adequacy of the first-stage model (3.3)


previously used in Section 3.2.3 for the rat data. Recall that it was assumed
that, apart from residual variability, the response is a linear function of the
transformed timescale tij = ln[1 + (Ageij − 45)/10)]. The left panel in
Figure 4.6 shows a scatter plot of the subject-specific coefficients Ri2 of
multiple determinations, versus the numbers ni of repeated measurements.
2
The overall coefficient Rmeta of multiple determination, represented by the
dashed line, equals 0.9294, indicating that the model explains about 93% of
the total within-subject variability. All except two coefficients Ri2 are larger
than 0.85, suggesting that our first-stage model fits the observed profiles
reasonably well. However, comparing the model with an extended model
which assumes quadratic subject-specific evolutions yields Fmeta = 1.5347,
4.3 Exploring Subject-Specific Profiles 39

FIGURE 4.6. Rat Data. Subject-specific coefficients Ri2 of multiple determination


2
and the overall coefficient Rmeta of multiple determination (dashed lines), for
first-stage models which assume linear (left panel) as well as quadratic (right
panel) subject-specific profiles.

on 43 and 113 degrees of freedom, which is significant on the 5% level


(p = 0.0382). This suggests that adding quadratic terms to the first-stage
model might improve the fit considerably.

The right panel in Figure 4.6 shows a scatter plot of the Ri2 versus the ni , for
2
the so-obtained quadratic first-stage model. The overall coefficient Rmeta
now equals 0.9554, which is a rather small improvement when compared
to the original model. This is also reflected in the scatter plot. Note also
that all rats with at most three measurements have Ri2 = 1. Testing for
the need of an additional cubic term in the first-stage model results in
Fmeta = 1.3039, which is not significant (p = 0.1633) when compared to an
F -distribution with 38 and 75 degrees of freedom. Strictly speaking, this
is evidence in favor of using the quadratic first-stage model rather than
the original linear model. However, in view of the good fits which were
already obtained with the linear model, and in order to keep our models as
parsimonious as possible, our further analyses of the rat data will be based
on the original linear first-stage model (3.3).

4.3.4 Example: The Prostate Data

As a second example, we explore the adequacy of the first-stage model (3.5)


previously used in Section 3.2.4 for the prostate data. Recall that it was
assumed that, apart from residual variability, the response is a quadratic
function of time before diagnosis (expressed in decades). The same checks
were performed as for the rat data, described in Section 4.3.3. The results
are shown in Figure 4.7.
40 4. Exploratory Data Analysis

FIGURE 4.7. Prostate Data. Subject-specific coefficients Ri2 of multiple determi-


2
nation and the overall coefficient Rmeta of multiple determination (dashed lines),
for first-stage models which assume linear (left panel) as well as quadratic (right
panel) subject-specific profiles.

Although the linear two-stage model explains about 82% of all within-
2
subject variability (Rmeta = 0.8188), many longitudinal profiles are badly
described (left panel of Figure 4.7). For example, for 8 out of the 54 profiles,
less than 10% of the observed variability could be explained by a linear fit.
As can be expected from observing the individual profiles in Figure 2.3, the
linear model is strongly rejected when compared to a quadratic first-stage
model (Fmeta = 6.2181, 54 and 301 degrees of freedom, p < 0.0001).

The right panel in Figure 4.7 shows the coefficients Ri2 versus the ni , for this
quadratic model. The new model explains about 10% more within-subject
2
variability (Rmeta increased to 0.9143). Testing for the need of an additional
cubic term results in Fmeta = 1.2310, which is not significant (p = 0.1484)
when compared to an F -distribution with 54 and 247 degrees of freedom.
This clearly supports the first-stage model (3.5) proposed in Section 3.2.4.
The fact that some individual coefficients Ri2 are still quite small, although
the quadratic model has not been rejected, suggests the presence of a con-
siderable amount of residual variability. This will be confirmed later, in
Section 9.5.
5
Estimation of the Marginal Model

5.1 Introduction

As discussed in Section 3.3, the general linear mixed model (3.8) implies
the marginal model
Yi ∼ N (Xi β, Zi DZi + Σi ). (5.1)
Unless the data are analyzed in a Bayesian framework (see, e.g., Gelman et
al . 1995), inference is based on this marginal distribution for the response
Yi . It should be emphasized that the hierarchical structure of the original
model (3.8) is then not taken into account. Indeed, the marginal model (5.1)
is not equivalent to the original hierarchical model (3.8). Inferences based on
the marginal model do not explicitly assume the presence of random effects
representing the natural heterogeneity between subjects. An example of this
can be found in Section 5.6.2. In this and the next chapter, we will discuss
inference for the parameters in the marginal distribution (5.1). Later, in
Chapter 7, it will be shown how the random effects can be estimated under
the explicit assumption that Yi satisfies model (3.8).

Let α denote the vector of all variance and covariance parameters (usually
called variance components) found in Vi = Zi DZi +Σi , that is, α consists of
the q(q+1)/2 different elements in D and of all parameters in Σi . Finally, let
θ = (β  , α ) be the s-dimensional vector of all parameters in the marginal
model for Yi , and let Θ = Θβ × Θα denote the parameter space for θ, with
42 5. Estimation of the Marginal Model

Θβ and Θα the parameter spaces for the fixed effects and for the variance
components respectively. Note that Θβ = IRp , and Θα equals the set of
values for α such that D and all Σi are positive (semi-)definite.

The classical approach to inference is based on estimators obtained from


maximizing the marginal likelihood function


N 
−1
LML (θ) = (2π)−ni /2 |Vi (α)| 2
i=1


1 
× exp − (Yi − Xi β) Vi−1 (α) (Yi − Xi β) (5.2)
2

with respect to θ. Let us first assume α to be known. The maximum like-


lihood estimator (MLE) of β, obtained from maximizing (5.2), conditional
on α, is then given by (Laird and Ware 1982)
 N −1
 
N

β(α) = Xi Wi Xi Xi Wi yi , (5.3)
i=1 i=1

where Wi equals Vi−1 .

When α is not known, but an estimate α  is available, we can set Vi =


−1
 =W
Vi (α) i , and estimate β by using the expression (5.3) in which Wi is
i . Two frequently used methods for estimating α are maxi-
replaced by W
mum likelihood estimation and restricted maximum likelihood estimation,
which will be discussed and compared in the Sections 5.2 and 5.3, respec-
tively.

5.2 Maximum Likelihood Estimation

The maximum likelihood estimator (MLE) of α is obtained by maximizing


(5.2) with respect to α, after β is replaced by (5.3). This approach arises
naturally when we consider the estimation of β and α simultaneously by
maximizing the joint likelihood (5.2).
5.3 Restricted Maximum Likelihood Estimation 43

5.3 Restricted Maximum Likelihood Estimation

5.3.1 Variance Estimation in Normal Populations

As an introductory example to restricted maximum likelihood estimation,


consider the case where the variance of a normal distribution N (µ, σ 2 ) is
to be estimated based on a sample Y1 , . . . , YN of N
 observations. When
the mean µ is known, the MLE for σ 2 equals σ 2 = i (Yi − µ)2 /N , which
is unbiased for σ 2 . When µ is not known, we get the same
 expression for
the MLE, but with µ replaced by the sample mean Y = i Yi /N . One can
then easily show that
 2 N −1 2

E σ = σ , (5.4)
N
indicating that the MLE is now biased downward, due to the estimation of
µ. Note, however, that an unbiased estimate is easily obtained
 from expres-
sion (5.4), yielding the classical sample variance S 2 = i (Yi −Y )2 /(N −1).

Apparently, directly obtaining an unbiased estimate for σ 2 should be based


on a statistical procedure which does not require estimation of µ first. This
can be done as follows. Let Y = (Y1 , . . . , YN ) denote the vector of all mea-
surements, and let 1N be the N -dimensional vector containing only ones.
The distribution of Y is then N (µ 1N , σ 2 IN ) where, as before, IN equals
the N -dimensional identity matrix. Let A be any N × (N − 1) matrix with
N − 1 linearly independent columns orthogonal to the vector 1N . We then
define the vector U of N − 1 so-called error contrasts by U = A Y , which
now follows a normal distribution with mean vector 0 and covariance ma-
trix σ 2 A A. Maximizing the corresponding likelihood with respect to the
only remaining parameter σ 2 yields σ 2 = Y  A(A A)−1 A Y /(N − 1), which
can be shown to equal the classical sample variance S 2 previously derived
from expression (5.4). Note that any matrix A satisfying the specified con-
ditions leads to the same estimator for σ 2 . The resulting estimator for σ 2
is often called the restricted maximum likelihood (REML) estimator since
it is restricted to (N − 1) error contrasts.

5.3.2 Estimation of Residual Variance in Linear Regression

As a second example, we now consider the estimation of the residual vari-


ance σ 2 in a linear regression model Y = Xβ + ε, where Y is an N -
dimensional vector, and with X a (N × p) matrix with known covariate
values. It is assumed that all elements in ε are independently normally
44 5. Estimation of the Marginal Model

distributed with mean zero and variance σ 2 . The MLE for σ 2 equals

2
σ = (Y − X(X  X)−1 X  Y ) (Y − X(X  X)−1 X  Y )/N,

which can easily be shown to be biased downward by a factor (N − p)/N .

Similarly, as in Section 5.3.1, σ 2 can be estimated using a set of error con-


trasts U = A Y where A is now any N ×(N −p) matrix with N −p linearly
independent columns orthogonal to the columns of the design matrix X.
We then have that U follows a normal distribution with mean vector 0 and
covariance matrix σ 2 A A, in which σ 2 is again the only unknown parame-
ter. Maximizing the corresponding likelihood with respect to σ 2 yields

2
σ = (Y − X(X  X)−1 X  Y ) (Y − X(X  X)−1 X  Y )/(N − p),

which is the mean squared error, unbiased for σ 2 , and classically used as
estimator for the residual variance in linear regression analysis (see, for
example, Neter, Wasserman, and Kutner 1990, Chapter 7; Seber 1977, Sec-
tion 3.3). As in Section 5.3.1, we again have that any matrix A satisfying the
specified conditions leads to the same estimator for the residual variance,
which is again called the REML estimator for σ 2 .

5.3.3 REML Estimation for the Linear Mixed Model

In practice, linear mixed models often contain many fixed effects. For ex-
ample, the linear mixed model (3.10) which immediately followed from the
two-stage model proposed in Section 3.2.4 for the prostate data, has a 15-
dimensional vector β of parameters in the mean structure. In such cases,
it may be important to estimate the variance components, explicitly tak-
ing into account the loss of the degrees of freedom involved in estimating
the fixed effects. In contrast to the simple cases discussed in Sections 5.3.1
and 5.3.2, an unbiased estimator for the vector α of variance components
cannot be obtained from appropriately transforming the ML estimator as
suggested from the analytic calculation of its bias. However, the error con-
trasts approach can still be applied as follows. We first combine all N
subject-specific regression models (3.8) to one model:

Y = Xβ + Zb + ε, (5.5)

where the vectors Y , b, and ε, and the matrix X are obtained from stacking
the vectors Yi , bi , and εi , and the matrices Xi respectively, underneath
each other, and where Z is the block-diagonal matrix with blocks Z i on
N
the main diagonal and zeros elsewhere. The dimension of Y equals i=1 ni
and will be denoted by n.
5.3 Restricted Maximum Likelihood Estimation 45

The marginal distribution for Y is normal with mean vector Xβ and with
covariance matrix V (α) equal to the block-diagonal matrix with blocks Vi
on the main diagonal and zeros elsewhere. The REML estimator for the
variance components α is now obtained from maximizing the likelihood
function of a set of error contrasts U = A Y where A is any (n × (n − p))
full-rank matrix with columns orthogonal to the columns of the X matrix.
The vector U then follows a normal distribution with mean vector zero
and covariance matrix A V (α)A, which is not dependent on β any longer.
Further, Harville (1974) has shown that the likelihood function of the error
contrasts can be written as
 N 1/2
 
−(n−p)/2   
L(α) = (2π)  Xi Xi 
 
i=1
 −1/2
 N   N
  −1/2
×  Xi Vi−1 Xi  |Vi |
 
i=1 i=1
 N 

1   
× exp − 
Yi − X i β V i −1 
Yi − X i β , (5.6)
2 i=1

where β is given by (5.3). Hence, as in the simple examples described


 does not
in Sections 5.3.1 and 5.3.2, the so-obtained REML estimator α
depend on the error contrasts (i.e., the choice of A).

Note that the maximum likelihood estimator for the mean of a univariate
normal population and for the vector of regression parameters in a linear
regression model are independent of the residual variance σ 2 . Hence, the
estimates for the mean structures of the two examples in Sections 5.3.1
and 5.3.2 do not change if REML estimates are used for the variance com-
ponents, rather than ML estimates. However, it follows from (5.3) that this
no longer holds in the general linear mixed model. Thus, we have that al-
though REML estimation is only with respect to the variance components
in the model, the “REML” estimator for the vector of fixed effects is not
identical to its ML version. This will be illustrated in Section 5.5, where
model (3.10) will be fitted to the prostate cancer data.

Finally, note that the likelihood function in (5.6) equals


 N − 12
 
   
L(α) = C  Xi Wi (α)Xi  LML (β(α), α), (5.7)
 
i=1

where C is a constant not depending on α, where, as earlier, Wi (α) equals


Vi−1 (α), and where LML (β, α) = LML (θ) is the ML likelihood function

given by (5.2). Because  i=1 Xi Wi (α)Xi  in (5.7) does not depend on β,
N

it follows that the REML estimators for α and for β can also be found by
46 5. Estimation of the Marginal Model

maximizing the so-called REML likelihood function


 − 12
N 
 
LREML (θ) =  Xi Wi (α)Xi  LML (θ) (5.8)
 
i=1

with respect to all parameters simultaneously (α and β).

5.3.4 Justification of REML Estimation

The main justification of the REML approach has been given by Patterson
and Thompson (1971), who prove that, in the absence of information on β,
no information about α is lost when inference is based on U rather than on
Y . More precisely, U is marginally sufficient for α in the sense described by
Sprott (1975) (see also Harville 1977). Further, Harville (1974) has shown
that, from a Bayesian point of view, using only error contrasts to make
inferences on α is equivalent to ignoring any prior information on β and
using all the data to make those inferences.

5.3.5 Comparison Between ML and REML Estimation

Maximum likelihood estimation and restricted maximum likelihood estima-


tion both have the same merits of being based on the likelihood principle
which leads to useful properties such as consistency, asymptotic normality,
and efficiency. ML estimation also provides estimators of the fixed effects,
whereas REML estimation, in itself, does not. On the other hand, for bal-
anced mixed ANOVA models, the REML estimates for the variance compo-
nents are identical to classical ANOVA-type estimates obtained from solv-
ing the equations which set mean squares equal to their expectations. This
implies optimal minimum variance properties, and it shows that REML
estimates in that context do not rely on any normality assumption since
only moment assumptions are involved (Harville 1977 and Searle, Casella,
and McCulloch 1992).

Also with regard to the mean squared error for estimating α, there is no
indisputable preference for either one of the two estimation procedures,
since it depends on the specifics of the underlying model and possibly on
the true value of α. For ordinary ANOVA or regression models, the ML
estimator of the residual variance σ 2 has uniformly smaller mean squared
error than the REML estimator when p = rank(X) ≤ 4, but the opposite is
true when p > 4 and n − p is sufficiently large (n − p > 2 suffices if p > 12).
In general, one may expect results from ML and REML estimation to differ
more as the number p of fixed effects in the model increases. We hereby
5.4 Model-Fitting Procedures 47

refer to Section 13.5 for an example with extremely many covariates in


the mean structure, leading to severe differences between ML and REML
estimates. More details on this and related topics can be found in Harville
(1977).

5.4 Model-Fitting Procedures

In the literature, several methods for the actual calculation of the ML or


REML estimates have been described. Dempster, Laird and Rubin (1977),
for example, have introduced the EM algorithm for the calculation of MLEs
based on incomplete data and have illustrated how it can be used for the
estimation of variance components in mixed-model analysis of variance.
Laird and Ware (1982) have shown how this EM algorithm not only can be
applied to obtain MLEs, but also to calculate the REML estimates through
an empirical Bayesian approach. Note that, strictly speaking, no data are
missing: The EM algorithm is only used to “estimate” the unobservable
parameters (i.e., the random effects bi ). The main advantage of the EM
algorithm is that the general theory (Dempster, Laird, and Rubin 1977) as-
sures that each iteration increases the likelihood. However, Laird and Ware
(1982) report slow convergence of the estimators of the variance compo-
nents, especially when the maximum likelihood is on or near the boundary
of the parameter space. We refer to Chapter 22 for more details on the EM
algorithm.

Therefore, nowadays, one usually uses Newton-Raphson-based procedures


to estimate all parameters in the model. Details about the implementation
of such algorithms, together with expressions for all first- and second-order
derivatives of LML and LREML with respect to all parameters in θ can be
found in Lindstrom and Bates (1988).

Note that fitting the general linear mixed model (3.8) requires maximiza-
tion of LML and LREML over the parameter space Θ, which consists of all
vectors θ which yield positive (semi-)definite matrices D and Σi . On the
other hand, the marginal model (5.1) only requires all Vi = Zi DZi + Σi
to be positive (semi-)definite. This is why some statistical packages maxi-
mize the likelihood functions over a parameter space which is larger than
Θ. For example, the SAS procedure MIXED (Version 6.12), by default,
only requires the diagonal elements of D and all Σi to be positive, which
probably stems from classical variance-components models, where the ran-
dom effects are assumed to be independent of each other (see, for example,
Searle, Casella and McCulloch 1992, Chapter 6). An example of this will
be given in Section 5.6.2.
48 5. Estimation of the Marginal Model

FIGURE 5.1. Prostate Data. Fitted average profiles for males of median ages at
diagnosis, based on the model (3.10), where the parameters are replaced by their
REML estimates.

5.5 Example: The Prostate Data

Table 5.1 shows the maximum likelihood as well as the restricted maximum
likelihood estimates for all parameters in the marginal model corresponding
to model (3.10), where time is expressed in decades prior to diagnosis rather
than years prior to diagnosis (for reasons which will be explained further
in Section 5.6.1). Recall that the residual variability in this model is pure
measurement error, that is, εi = ε(1)i , with notation as in Section 3.3.4.

As can be expected from the theory in Section 5.3, the ML estimates de-
viate most from the REML estimates for the variance components in the
model. In fact, all REML estimates are larger in absolute value than the
ML estimates. Note that the same is true for the REML estimates for
the residual variance in normal populations or in linear regression models
when compared to the ML estimates, as described in Section 5.3.1 and
Section 5.3.2, respectively. Further, Table 5.1 illustrates the fact that the
REML estimates for the fixed effects are also different from the ML esti-
mates. Figure 5.1 shows, for each diagnostic group separately, the fitted
average profile for a male of median age at diagnosis.
5.5 Example: The Prostate Data 49

TABLE 5.1. Prostate Data. Maximum likelihood and restricted maximum likeli-
hood estimates (MLE and REMLE) and standard errors (model based;robust) for
all fixed effects and all variance components in model (3.10), with time expressed
in decades before diagnosis.

Effect Parameter MLE (s.e.) REMLE (s.e.)


Age effect β1 0.026 (0.013) 0.027 (0.014;0.016)
Intercepts:
Control β2 −1.077 (0.919) −1.098 (0.976;1.037)
BPH β3 −0.493 (1.026) −0.523 (1.090;1.190)
L/R cancer β4 0.314 (0.997) 0.296 (1.059;1.100)
Met. cancer β5 1.574 (1.022) 1.549 (1.086;1.213)
Age×time effect β6 −0.010 (0.020) −0.011 (0.021;0.024)
Time effects:
Control β7 0.511 (1.359) 0.568 (1.473;1.640)
BPH β8 0.313 (1.511) 0.396 (1.638;1.853)
L/R cancer β9 −1.072 (1.469) −1.036 (1.593;1.646)
Met. cancer β10 −1.657 (1.499) −1.605 (1.626;2.038)
2
Age×time effect β11 0.002 (0.008) 0.002 (0.009;0.010)
Time2 effects:
Control β12 −0.106 (0.549) −0.130 (0.610;0.688)
BPH β13 −0.119 (0.604) −0.158 (0.672;0.774)
L/R cancer β14 0.350 (0.590) 0.342 (0.656;0.683)
Met. cancer β15 0.411 (0.598) 0.395 (0.666;0.844)
Covariance of bi :
var(b1i ) d11 0.398 (0.083) 0.452 (0.098)
var(b2i ) d22 0.768 (0.187) 0.915 (0.230)
var(b3i ) d33 0.103 (0.032) 0.131 (0.041)
cov(b1i , b2i ) d12 = d21 −0.443 (0.113) −0.518 (0.136)
cov(b2i , b3i ) d23 = d32 −0.273 (0.076) −0.336 (0.095)
cov(b3i , b1i ) d13 = d31 0.133 (0.043) 0.163 (0.053)
Residual variance:
var(εij ) σ2 0.028 (0.002) 0.028 (0.002)
Log-likelihood −1.788 −31.235
50 5. Estimation of the Marginal Model

5.6 Estimation Problems

As discussed in Section 5.4, the fitting of linear mixed models is usually


done via Newton-Raphson-based procedures. Based on some starting val-
ues for the parameters, these procedures iteratively update the estimates
until sufficient convergence has been obtained. When fitting complex linear
mixed models, the practicing statistician is often faced with nonconverging
iteration processes, in the sense that the iterative process does not converge
at all, or that it converges to parameter values on or outside the boundary
of the parameter space. In some cases, this can be solved by specifying
better starting values, or by using other numerical procedures. We refer to
Section 9.4 for an example where convergence problems with the Newton-
Raphson procedure could be solved by using the Fisher scoring method
which uses the expected Hessian matrix (the matrix of second-order deriv-
atives) of the log-likelihood function rather than the observed one. In many
cases however, divergence is an indicator of substantial problems with the
parameterization of the model or the assumptions implied by the model.
Two examples of frequently occurring problems will now be given in Sec-
tions 5.6.1 and 5.6.2. It should be emphasized that such numerical prob-
lems always arise from estimating the variance components in the model,
not from estimating the fixed effects. This can easily be explained from
the fact that the classical ordinary least squares estimator for the vector of
fixed effects, although completely ignoring the longitudinal structure of the
data, is unbiased and consistent and therefore provides good starting val-
ues for the fixed effects. This is in contrast to the variance components for
which good starting values are often hard to obtain, especially in complex
models with many variance components. Also, as explained in Section 5.1,
iterative procedures are, strictly speaking, only required for the estimation
of the variance components, not for the estimation of the fixed effects.

5.6.1 Estimation Problems due to Small Variance


Components

When we first fitted a linear mixed model to the prostate cancer data, time
was expressed in decades before diagnosis, rather than years before diag-
nosis as in the original data set (see Section 5.5). This was done to avoid
that the random slopes for the linear and quadratic time effects would
show too little variability, which might lead to divergence of the numerical
maximization routine. To illustrate this, we refit our mixed model using
the SAS procedure MIXED (SAS Version 6.12), but we express time as
months before diagnosis. The procedure failed to converge. The estimates
for the variance components at the last iteration are shown in Table 5.2.
5.6 Estimation Problems 51

TABLE 5.2. Prostate Data. Restricted maximum likelihood estimates (REMLE)


at last iteration for all variance components in model (3.10), with time expressed
in months before diagnosis.

Effect Parameter REMLE


Covariance of bi :
var(b1i ) d11 0.36893546
var(b2i ) d22 0.00003846
var(b3i ) d33 0.00000000
cov(b1i , b2i ) d12 = d21 −0.00244046
cov(b2i , b3i ) d23 = d32 −0.00000011
cov(b3i , b1i ) d13 = d31 0.00000449
Residual variance:
var(εij ) σ2 0.03259207

Note how the reported estimate for the variance of the random slopes for
the quadratic time effect equals d33 = 0.00000000. This suggests that there
is very little variability among the quadratic time effects, requiring the iter-
ative procedure to converge to a point which is very close to the boundary
of the parameter space (since only non-negative variances are allowed), if
not on the boundary of the parameter space. This can produce numerical
difficulties in the maximization process. One way of circumventing this is
by artificially enlarging the true value d33 .

Let the model we just fitted for Yij be written as


[j]
Yij = Xi β + b1i + b2i tij + b3i t2ij + εij ,
[j]
where Xi is the jth row of Xi , where tij is time expressed as months before
diagnosis, and where the random effects b1i , b2i , and b3i have covariance
matrix
⎛ ⎞
d11 d12 d13
var(bi ) = D = ⎝ d12 d22 d23 ⎠ .
d31 d32 d33
We can then reformulate the model as


2
[j] tij tij
Yij = Xi β + b1i + 120 b2i 2
+ (120) b3i + εij
120 120
= Xi β + b∗1i + b∗2i t∗ij + b∗3i t∗ij 2 + εij ,
[j]

which is a new linear mixed effects model, in which t∗ij is now expressed in
decades before diagnosis, and where the random effects b∗1i ≡ b1i , b∗2i and
52 5. Estimation of the Marginal Model

b∗3i now have covariance matrix


⎛ ⎞
(120)0 d11 (120)1 d12 (120)2 d13
var(bi ) = D = ⎝ (120)1 d12
∗ ∗
(120)2 d22 (120)3 d23 ⎠ .
(120)2 d31 (120)3 d32 (120)4 d33

This transformation enlarges the covariance parameters substantially, which


implies that the peak of the log-likelihood is well away from the boundary
and that the evaluated second-order derivatives are well away from zero.
The normal equations, which are the equations to be solved in the max-
imization algorithm, now form a system which is much more stable and
which can easily be solved without any convergence problems. In general,
we therefore recommend always rescaling linear mixed models with random
effects which are expected to have (very) small variability.

5.6.2 Estimation Problems due to Model Misspecifications

As explained in Section 5.1, inference is based on the marginal model (5.1),


rather than on the original, more restrictive, hierarchical model (3.8). In
practice, this often causes numerical maximization procedures not to con-
verge to parameter values in the interior of the parameter space implied by
the hierarchical model.

As an illustration, we consider the linear mixed model (3.9) proposed in Sec-


tion 3.3.2 for the rat data introduced in Section 2.1. Table 5.3 shows the re-
stricted maximum likelihood estimates, obtained using the SAS procedure
MIXED (version 6.12), for all parameters in the corresponding marginal
model. Recall that the residual variability in this model is pure measure-
ment error, that is, εi = ε(1)i , with notation as in Section 3.3.4. Note that
the variance of the random slopes is estimated by 0, and no standard error
is reported. In contrast to the example given in Section 5.6.1, this cannot
be solved by reparameterizing the model. Instead, the zero estimate indi-
cates that the maximum of the REML log-likelihood function is really on
the boundary of the parameter space. Indeed, as discussed in Section 5.4,
SAS maximizes LREML under the restriction that all diagonal elements dii
in D as well as σ 2 are positive. Our results now suggest that the REML
likelihood could be further increased by removing these restrictions and al-
lowing some of the variance components dii or σ 2 to become negative. The
parameter estimates obtained from refitting the model without any restric-
tions on the parameter space are also reported in Table 5.3. As expected,
we now get a negative estimate for the variance d22 of the random slopes.
Note also that this has further increased the REML log-likelihood value.
It should be strongly emphasized, however, that the resulting model does
not allow any hierarchical interpretation since no random-effects structure
5.6 Estimation Problems 53

TABLE 5.3. Rat Data. Restricted maximum likelihood estimates (REMLE) and
standard errors for all fixed effects and all variance components in the marginal
model corresponding to model (3.9), for two different parameter restrictions for
the variance components α.

Parameter restrictions for α


dii ≥ 0, σ 2 ≥ 0 dii ∈ IR, σ 2 ∈ IR
Effect Parameter REMLE (s.e.) REMLE (s.e.)
Intercept β0 68.606 (0.325) 68.618 (0.313)
Time effects:
Low dose β1 7.503 (0.228) 7.475 (0.198)
High dose β2 6.877 (0.231) 6.890 (0.198)
Control β3 7.319 (0.285) 7.284 (0.254)
Covariance of bi :
var(b1i ) d11 3.369 (1.123) 2.921 (1.019)
var(b2i ) d22 0.000 ( ) −0.287 (0.169)
cov(b1i , b2i ) d12 = d21 0.090 (0.381) 0.462 (0.357)
Residual variance:
var(εij ) σ2 1.445 (0.145) 1.522 (0.165)
REML log-likelihood −466.173 −465.193

could ever yield a marginal model as has now been obtained. On the other
hand, as long as all covariance matrices Vi = Zi DZi + σ 2 Ini are positive
(semi-) definite, that is, as long as the covariates Zi take values within
a specific range, a valid marginal model is obtained. In our example, the
variance function is predicted by

  1
Var(Yi (t)) = 1 t D  + σ2
t
= d 2   2
22 t + 2d12 t + d11 + σ
= −0.287t2 + 0.924t + 4.443 (5.9)

and therefore suggests the presence of negative curvature in the variance


function. As an informal check, we can calculate the sample variance func-
tion of the ordinary least squares (OLS) residuals obtained from fitting a
linear regression model with the same mean structure as the marginal model
corresponding to (3.9), thereby completely ignoring the correlation struc-
ture in the data (see Chapter 4). The obtained variance function, shown
in Figure 5.2, indeed supports the negative curvature suggested by our fit-
ted variance function. Note that this is not compatible with the proposed
hierarchical model. Hence, although our random-effects model naturally
54 5. Estimation of the Marginal Model

FIGURE 5.2. Rat Data. Sample variance function for ordinary least squares
(OLS) residuals, obtained from fitting a linear regression model with the same
mean structure as the marginal model corresponding to (3.9).

arose from the two-stage approach described in Section 3.2.3, it does not
necessarily imply an appropriate marginal model. This again illustrates
the need for exploratory data analysis (Chapter 4) prior to fitting linear
mixed models. More detailed discussions on negative variance components
can be found in Nelder (1954), Thompson (1962), and Searle, Casella and
McCulloch (1992, Section 3.5).
6
Inference for the Marginal Model

6.1 Introduction

In practice, the fitting of a model is rarely the ultimate goal of a statistical


analysis. Usually, one is primarily interested in drawing inferences on the
parameters in a model, in order to generalize results obtained from a specific
sample to the general population from which the sample was taken. In
Section 6.2, inference for the parameter vector β in the mean structure of
model (5.1) is discussed. Afterward, in Section 6.3, inference with respect
to the variance components α will be handled.

6.2 Inference for the Fixed Effects

As discussed in Section 5.1, the vector β of fixed effects is estimated by


 N −1 N
 

β(α) = 
Xi W i X i Xi Wi yi , (6.1)
i=1 i=1

in which the unknown vector α of variance components is replaced by its


ML or REML estimate. Under the marginal model (5.1), and conditionally

on α, β(α) follows a multivariate normal distribution with mean vector β
56 6. Inference for the Marginal Model

and with variance-covariance matrix



var(β)
 N −1  N  N −1
  
= Xi Wi Xi Xi Wi var(Yi )Wi Xi Xi Wi Xi (6.2)
i=1 i=1 i=1
 N −1

= Xi Wi Xi , (6.3)
i=1

where Wi equals Vi−1 (α). In practice, the covariance matrix (6.3) is es-
timated by replacing α by its ML or REML estimator. For the models
previously fitted to the prostate data and to the rat data, the so-obtained
standard errors for the fixed effects are also reported in Table 5.1 and
Table 5.3, respectively.

6.2.1 Approximate Wald Tests

For each parameter βj in β, j = 1, . . . , p, an approximate Wald test (also


termed Z-test), as well as an associated confidence interval, is obtained
 βj ) by a standard
from approximating the distribution of (βj − βj )/s.e.(
univariate normal distribution. In general, for any known matrix L, a test
for the hypothesis

H0 : Lβ = 0, versus HA : Lβ = 0 (6.4)

immediately follows from the fact that the distribution of


⎡  −1 ⎤−1
N
 − β) L ⎣L
(β Xi Vi−1 (α)X
 i  − β)
L ⎦ L(β (6.5)
i=1

asymptotically follows a chi-squared distribution with rank(L) degrees of


freedom.

6.2.2 Approximate t-Tests and F -Tests

As noted by Dempster, Rubin and Tsutakawa (1981), the Wald test statis-
tics are based on estimated standard errors which underestimate the true
 because they do not take into account the variability intro-
variability in β
duced by estimating α. In practice, this downward bias is often resolved
by using approximate t- and F -statistics for testing hypotheses about β.
6.2 Inference for the Fixed Effects 57

For each parameter βj in β, j = 1, . . . , p, an approximate t-test and associ-


ated confidence interval can be obtained by approximating the distribution
 βj ) by an appropriate t-distribution. Testing general linear
of (βj − βj )/s.e.(
hypotheses of the form (6.4) is now based on an F -approximation to the
distribution of
$  −1  %−1
 − β) L L N X  V −1 (α)X
(β  i L L(β − β)
i=1 i i
F = . (6.6)
rank(L)
The numerator degrees of freedom equals rank(L). The denominator de-
grees of freedom needs to be estimated from the data. The same is true for
the degrees of freedom needed in the above t-approximation.

In practice, several methods are available for estimating the appropriate


number of degrees of freedom needed for a specific t- or F -test. The SAS
procedure MIXED (Version 6.12), for example, includes four different es-
timation methods, one of which is based on a so-called Satterthwaite-type
approximation (Satterthwaite 1941). We refer to Section 3.5.2 and Appen-
dix A in Verbeke and Molenberghs (1997) and to SAS (1999) for a detailed
discussion on the estimation of the degrees of freedom in SAS. Recently,
Kenward and Roger (1997) proposed a scaled Wald statistic, based on an
adjusted covariance estimate which accounts for the extra variability intro-
duced by estimating α, and they show that its small sample distribution
can be well approximated by an F -distribution with denominator degrees
of freedom also obtained via a Satterthwaite-type approximation.

It should be remarked that all these methods usually lead to different re-
sults. However, in the analysis of longitudinal data, different subjects con-
tribute independent information, which results in numbers of degrees of
freedom which are typically large enough, whatever estimation method is
used, to lead to very similar p-values. Only for very small samples, or when
linear mixed models are used outside the context of longitudinal analy-
ses, different estimation methods for degrees of freedom may lead to severe
differences in the resulting p-values. This will be illustrated in Section 8.3.5.

6.2.3 Example: The Prostate Data

Table 5.1 clearly suggests that the original linear mixed model (3.10), used
for describing the prostate data, can be reduced to a more parsimonious
model. Classically, this is done in a hierarchical way, starting with the
highest-order interaction terms, deleting nonsignificant terms, and com-
bining parameters which do not differ significantly. The so-obtained final
model assumes no average evolution over time for the control group, a
linear average time trend for the BPH group, the same average quadratic
58 6. Inference for the Marginal Model

time effects for both cancer groups, and no age dependencies of the average
linear as well as quadratic time trends. An overall test for comparing this
final model with the original model (3.10) is testing the null hypothesis:



⎪ β6 = 0

(no age by time interaction)



⎪ β7 = 0 (no linear time effect for controls)




⎨ β =0 (no age × time2 interaction)
11
H0 :

⎪ β12 = 0 (no quadratic time effect for controls)





⎪ β13 = 0 (no quadratic time effect for BPH)




⎩ β =β (equal quadratic time effect for both cancer groups).
14 15

The above hypothesis can be rewritten as


⎛ ⎞
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
⎜0 0 0 0 0 0 1 0 0 0 0 0 0 0 0⎟
⎜ ⎟
⎜0 0 0 0 0 0 0 0 0 0 1 0 0 0 0⎟
H0 : ⎜
⎜0 0 0 0 0
⎟ β = 0; (6.7)
⎜ 0 0 0 0 0 0 1 0 0 0⎟ ⎟
⎝0 0 0 0 0 0 0 0 0 0 0 0 1 0 0⎠
0 0 0 0 0 0 0 0 0 0 0 0 0 1 −1

it is clearly of the form (6.4).

The observed value under the above null hypothesis for the associated Wald
statistic (6.5) equals 3.3865, on 6 degrees of freedom. The observed value
under H0 for the associated F -statistic (6.6) equals 3.3865/6 = 0.5664, on
6 and 46.7 degrees of freedom. The denominator degrees of freedom has
been obtained from the Satterthwaite approximation in SAS procedure
MIXED (Version 6.12). The corresponding p-values are 0.7587 and 0.7561,
respectively, suggesting that no important terms have been left out of the
model.

From now on, all further inferences will be based on the reduced final model,
which can be written as:

Yij = Yi (tij )
= β1 Agei + β2 Ci + β3 Bi + β4 Li + β5 Mi
+ (β8 Bi + β9 Li + β10 Mi ) tij
+ β14 (Li + Mi ) t2ij
+ b1i + b2i tij + b3i t2ij + εij , (6.8)

Table 6.1 contains the parameter estimates and estimated standard errors
for all fixed effects and variance components in model (6.8). Although the
average PSA level for the control patients is not significantly different from
6.2 Inference for the Fixed Effects 59

TABLE 6.1. Prostate Data. Parameter estimates and standard errors (model
based;robust) obtained from fitting the final model (6.8) to the prostate cancer
data, using restricted maximum likelihood estimation.

Effect Parameter Estimate (s.e.)


Age effect β1 0.016 (0.006;0.006)
Intercepts:
Control β2 −0.564 (0.428;0.404)
BPH β3 0.275 (0.488;0.486)
L/R cancer β4 1.099 (0.486;0.499)
Met. cancer β5 2.284 (0.531;0.507)
Time effects:
BPH β8 −0.410 (0.068;0.067)
L/R cancer β9 −1.870 (0.233;0.360)
Met. cancer β10 −2.303 (0.262;0.391)
Time2 effects:
Cancer β14 = β15 0.510 (0.088;0.128)
Covariance of bi :
var(b1i ) d11 0.443 (0.093)
var(b2i ) d22 0.842 (0.203)
var(b3i ) d33 0.114 (0.035)
cov(b1i , b2i ) d12 = d21 −0.490 (0.124)
cov(b2i , b3i ) d23 = d32 −0.300 (0.082)
cov(b3i , b1i ) d13 = d31 0.148 (0.047)
Residual variance:
var(εij ) σ2 0.028 (0.002)
REML log-likelihood −20.165

zero (p = 0.1889), we will not remove the corresponding effect from the
model because a point estimate for the average PSA level in the control
group may be of interest.

Figure 6.1 shows the average fitted profiles based on this final model, for
a man of median age at diagnosis, for each of the diagnostic groups sepa-
rately. Note how little difference there is with the average fitted profiles in
Figure 5.1, based on the full model (3.10).

When the prostate data were first analyzed, one of the research questions
of primary interest was whether early discrimination between cancer cases
and BPH cases should be based on the rate of increase of PSA (which can
60 6. Inference for the Marginal Model

FIGURE 6.1. Prostate Data. Fitted average profiles for males with median ages
at diagnosis, based on the final model (6.8), where the parameters are estimated
using restricted maximum likelihood estimation.

only be estimated when repeated PSA measurements are available) rather


than on just one single measurement of PSA (see also Section 2.3.1). In
order to assess this, we estimate the average difference in ln(1 + PSA)
between these two groups, as well as the average difference in the rate of
increase of ln(PSA) between the two groups, 5 years prior to diagnosis. If
we ignore the metastatic cancer cases, this is equivalent to estimating
 
DIFF(t = 5 years) = β1 Age + β4 + β9 t + β14 t2  t=0.5
− (β1 Age + β3 + β8 t)|t=0.5
= −β3 + β4 − 0.5 β8 + 0.5 β9 + 0.25 β14 (6.9)

and

∂  
2 
DIFFRATE(t = 5 years) = β1 Age + β4 + β9 t + β14 t 
∂t
 t=0.5
∂ 
− (β1 Age + β3 + β8 t)
∂t t=0.5
= −β8 + β9 + β14 , (6.10)

which are of the form Lβ, for specific (1 × 15) matrices L. Obviously, Lβ
 Moreover, the chi-squared approximation for (6.5)
will be estimated by Lβ.
as well as the F -approximation for (6.6) can be used to obtain approximate
confidence intervals for Lβ. The results from the chi-squared approximation
are summarized in the top part of Table 6.2.
6.2 Inference for the Fixed Effects 61

TABLE 6.2. Prostate Data. Naive and robust inference for the linear combinations
(6.9) and (6.10) of fixed effects in model (6.8), fitted to the prostate cancer data,
using restricted maximum likelihood estimation.

Naive inference
Wald-type approximate
Effect Estimate Standard error 95% confidence interval
DIFF 0.221 0.146 [−0.065, 0.507]
DIFFRATE −0.951 0.166 [−1.276, −0.626]
Robust inference
Wald-type approximate
Effect Estimate Standard error 95% confidence interval
DIFF 0.221 0.159 [−0.092, 0.533]
DIFFRATE −0.951 0.245 [−1.432, −0.470]

The average difference in ln(1 + PSA), 5 years prior to diagnosis, between


local cancer cases and BPH cases is estimated by 0.221, with standard error
equal to 0.146. The average difference in rate of change of ln(1 + PSA),
5 years prior to diagnosis, between local cancer cases and BPH cases is
estimated by −0.951, with standard error equal to 0.166. The corresponding
95% Wald-type approximate confidence intervals are [−0.066, 0.508] and
[−1.277, −0.624], respectively. Hence, there is no significant difference in
ln(1 + PSA) between the local cancer cases and the BPH cases, whereas
the rate of increase of PSA differs highly significantly. This illustrates why
repeated measures of PSA are needed to discriminate between the different
prostate diseases.

6.2.4 Robust Inference

A sufficient condition for β,  given by (6.1), to be unbiased for β is that


the mean E(Yi ) is correctly specified as Xi β. However, the equivalence of
(6.2) and (6.3) also assumes the marginal covariance matrix to be correctly
specified as Vi = Zi DZi + Σi . Thus, an analysis based on (6.3) will not be
robust with respect to model misspecification of the covariance structure.
Liang and Zeger (1986) therefore propose inferential procedures based on
 obtained by replacing var(Yi )
the so-called sandwich estimator for var(β),
 
in (6.2) by ri ri , where ri = yi − Xi β. The resulting estimator, also called
robust or empirical variance estimator, can then be shown to be consistent,
as long as the mean is correctly specified in the model.
62 6. Inference for the Marginal Model

Note that this suggests that as long as interest is only in inferences for
average longitudinal evolutions, little effort should be spent in modeling
the covariance structure, provided that the data set is sufficiently large.
In this respect, an extreme point of view would be to use ordinary least
squares regression methods to fit longitudinal models, thereby completely
ignoring the presence of any correlation among the repeated measurements,
and to use the sandwich estimator to correct for this in the inferential
procedures. However, in practice, an appropriate covariance model may
be of interest since it helps in interpreting the random variation in the
data. For example, it may be of scientific interest to explore the presence
of random slopes. Further, efficiency is gained if an appropriate covariance
model can be specified (see, for example, Diggle, Liang, and Zeger 1994
Section 4.6). Finally, in the case of missing observations, use of the sandwich
estimator only provides valid inferences for the fixed effects under very
strict, severe assumptions about the underlying missingness process. This
will be extensively discussed in Sections 15.8 and 16.5. Therefore, from
now on, all inferences will be based on model-based standard errors, unless
explicitly stated otherwise.

As an illustration of robust inference, model-based as well as robust stan-


dard errors were reported in the Tables 5.1 and 6.1. For some parameters,
the robust standard error is smaller than the naive, model-based one. For
other parameters, the opposite is true.

Robust versions of the approximate Wald, t-, and F -test, described in Sec-
tions 6.2.1 and 6.2.2, as well as associated confidence intervals, can also
be obtained, replacing the naive covariance matrix (6.3) in (6.5) and (6.6)
by the robust one given in (6.2). As an illustration, we recalculated the
confidence intervals for the linear combinations (6.9) and (6.10) of fixed
effects in model (6.8), using robust inference. The results are now shown
in the bottom part of Table 6.2. Note that the robust standard errors for
both estimates are larger than the naive ones, leading to larger confidence
intervals.

6.2.5 Likelihood Ratio Tests

A classical statistical test for the comparison of nested models with different
mean structures is the likelihood ratio (LR) test. Suppose that the null
hypothesis of interest is given by H0 : β ∈ Θβ,0 , for some subspace Θβ,0
of the parameter space Θβ of the fixed effects β. Let LML denote again the
ML likelihood function (5.2) and let −2 ln λN be the likelihood ratio test
6.2 Inference for the Fixed Effects 63

TABLE 6.3. Prostate Data. Likelihood ratio test for H0 : β1 = 0, under model
(3.10), using ML as well as REML estimation.

ML estimation REML estimation


Under β1 ∈ IR LML = −3.575 LREML = −20.165
Under H0 : β1 = 0 LML = −6.876 LREML = −19.003
−2 ln λN 6.602 −2.324
degrees of freedom 1
p-value 0.010

statistic defined as
( )
ML,0 )
LML (θ
−2 ln λN = −2 ln ,
LML (θML )

where θ ML,0 and θ


ML are the maximum likelihood estimates obtained from
maximizing LML over Θβ,0 and Θβ , respectively. It then follows from clas-
sical likelihood theory (see, e.g., Cox and Hinkley 1990, Chapter 9) that,
under some regularity conditions, −2 ln λN follows, asymptotically under
H0 , a chi-squared distribution with degrees of freedom equal to the differ-
ence between the dimension p of Θβ and the dimension of Θβ,0 .

It should be emphasized that the above result is not valid if the models
are fitted using REML rather than ML estimation. Indeed, the mean struc-
ture of the model fitted under H0 is not the mean structure Xi β of the
original model under Θβ , leading to different error contrasts U = A Y
(see Section 5.3) under both models. Hence, the corresponding REML log-
likelihood functions are based on different observations, which makes them
no longer comparable. This can be well illustrated in the context of the
prostate data. We reconsider the final linear mixed model (6.8), and we
use the likelihood ratio test for testing whether correction for the different
ages at the time of the diagnosis is really needed. The corresponding null
hypothesis equals H0 : β1 = 0. The results obtained under ML as well as
under REML estimation are summarized in Table 6.3. Under ML, the ob-
served value for the LR statistic −2 ln λN equals 6.602, which is significant
(p = 0.010) when compared to a chi-squared distribution with 1 degree of
freedom. Note that a negative observed value for the LR statistic −2 ln λN
is obtained under REML, clearly illustrating the fact that valid classical
LR tests for the mean structure can only be obtained in the context of ML
inference. We refer to Welham and Thompson (1997) for two alternative
LR-type tests, based on profile likelihoods, which do allow comparison of
two models with nested mean structures, fitted using the REML estimation
method.
64 6. Inference for the Marginal Model

6.3 Inference for the Variance Components

In many practical situations, the mean structure rather than the covari-
ance model is of primary interest. However, adequate covariance modeling
is useful for the interpretation of the random variation in the data and it
is essential to obtain valid model-based inferences for the parameters in
the mean structure of the model. Overparameterization of the covariance
structure leads to inefficient estimation and potentially poor assessment
of standard errors for estimates of the mean response profiles (fixed ef-
fects), whereas a too restrictive specification invalidates inferences about
the mean response profile when the assumed covariance structure does not
hold (Altham 1984). Finally, as will be discussed in Chapters 17, 19, and 21,
analyses of longitudinal data subject to dropout often require correct spec-
ification of the longitudinal model. In this section, inferential procedures
for variance components in linear mixed models will be discussed.

6.3.1 Approximate Wald Tests

It follows from classical likelihood theory (see, e.g., Cox and Hinkley 1990,
Chapter 9) that, under some regularity conditions, the distribution of the
ML as well as REML estimator α  can be well approximated by a nor-
mal distribution with mean vector α and with covariance matrix given by
the inverse of the Fisher information matrix. Hence, approximate standard
errors for the estimates of the variance components in α can be easily cal-
culated from inverting minus the matrix of second-order partial derivatives
of the log-likelihood function (ML or REML) with respect to α. These
are also the standard errors previously reported in Tables 5.1 and 6.1 and
Table 5.3 for the prostate data and the rat data, respectively.

Using the asymptotic normality of the parameter estimates, approximate


Wald tests and approximate Wald confidence intervals can now easily be
obtained, similarly as described in Section 6.2.1 for the fixed effects. How-
ever, the performance of the normal approximation strongly depends on the
true value α, with larger samples needed for values of α relatively closer
to the boundary of the parameter space Θα . In the case that α is a bound-
ary value, the normal approximation completely fails. This has important
consequences for significance tests for variance components. Depending on
the hypothesis of interest, and depending on whether the marginal or the
hierarchical interpretation of the linear mixed model under consideration
is used (see discussion in Section 5.6.2), Wald tests may or may not yield
valid inferences.
6.3 Inference for the Variance Components 65

For example, consider the linear mixed model (6.8) previously derived for
the prostate data, with REML estimates as reported in Table 6.1. If we
assume that the variability between subjects can be explained by ran-
dom effects bi , then H0 : d33 = 0 cannot be tested using an approxi-
mate Wald test. Indeed, given the hierarchical interpretation of the model,
D is restricted to be positive (semi-)definite, implying that H0 is on the
boundary of the parameter space Θα . Therefore, the asymptotic distrib-
ution of d 
33 /s.e.(d33 ) is not normal under H0 , such that the approximate
Wald test is not applicable. On the other hand, if one only assumes that
the covariance matrix of each subject’s response Yi can be described by
Vi = Zi DZi + σ 2 Ini , not assuming that this covariance matrix results from
an underlying random-effects structure, D is no longer restricted to be
positive (semi-)definite. Since d33 = 0 is then interior to Θα , d 
33 /s.e.(d33 )
asymptotically follows a standard normal distribution under H0 : d33 = 0,
from which a valid approximate Wald test follows. Based on the parameter
estimates reported in Table 6.1, we find that the observed value for the
test statistic equals d  d
33 /s.e.( 33 ) = 0.114/0.035 = 3.257, which is highly
significant when compared to a standard normal distribution (p = 0.0011).

Obviously, the above distinction between the hierarchical and marginal in-
terpretation of a linear mixed model is far less crucial for testing significance
of covariance parameters in α. For example, the hypothesis H0 : d23 = 0
can be tested with an approximate Wald test, even when the variability
between subjects is believed to be induced by random effects. However,
in order to keep H0 away from the boundary of Θα , one then still has to
assume that the variances d22 and d33 are strictly positive. Hence, based
on the parameter estimates reported in Table 6.1, and assuming all diag-
onal elements in D to be nonzero, we find highly significant correlations
between the subject-specific intercepts and slopes in model (6.8), and the
only positive correlation is the one between the random intercepts and the
random slopes for time2 .

6.3.2 Likelihood Ratio Tests

Similar as for the fixed effects, a LR test can be derived for comparing
nested models with different covariance structures. Suppose that the null
hypothesis of interest is now given by H0 : α ∈ Θα,0 , for some subspace
Θα,0 of the parameter space Θα of the variance components α. Let LML
denote again the ML likelihood function (5.2) and let −2 ln λN be the like-
lihood ratio test statistic which is again defined as
( )
ML,0 )
LML (θ
−2 ln λN = −2 ln , (6.11)
LML (θML )
66 6. Inference for the Marginal Model

where θML,0 and θ ML are now the maximum likelihood estimates obtained
from maximizing LML over Θα,0 and Θα , respectively. It then follows from
classical likelihood theory (see, e.g., Cox and Hinkley 1990, Chapter 9)
that, under some regularity conditions, −2 ln λN follows, asymptotically
under H0 , a chi-squared distribution with degrees of freedom equal to the
difference between the dimension s − p of Θα and the dimension of Θα,0 .

One of the regularity conditions under which the chi-squared approximation


is valid is that H0 is not on the boundary of the parameter space Θα . Hence,
the LR test suffers from exactly the same problems as the approximate
Wald test previously described in Section 6.3.1. Further, in contrast to the
LR test for fixed effects (see Section 6.2.5), valid LR tests are also obtained
under REML estimation. The test statistic −2 ln λN is then still given by
(6.11), but LML needs to be replaced by the REML likelihood function
LREML , given by expression (5.8), and the parameter estimates θ ML,0 and
ML are replaced by their corresponding REML estimates. This is because
θ
models with the same mean structure lead to the same error contrasts
U = A Y (see Section 5.3), which makes both REML likelihood functions
comparable since they are no longer based on different observations.

6.3.3 Example: The Rat Data

In Section 5.6.2, the marginal model corresponding to model (3.9) was


fitted to the rat data, not restricting the parameter space for the variance
components. Based on the unrestricted parameter estimates reported in
Table 5.3, the variance function was predicted by expression (5.9), which
suggested the presence of negative curvature in the variance function. Under
the assumed marginal model, an approximate Wald test as well as LR test
can be derived to test whether the variance function is significantly different
from constant. More specifically, the hypothesis of interest is H0 : d12 =
d22 = 0, which is not on the boundary of the parameter space under the
marginal interpretation of the model. The observed value for the Wald
statistic equals
 −1  
  * d
Var( * d  
) Cov( , d ) d
d d
12 12 22 12
12 22 * d
Cov( 
12 , d22 )
* d
Var( 22 ) d22

−1

  0.127 −0.038 0.462


= 0.462 −0.287
−0.038 0.029 −0.287

= 2.936,
which is not significant when compared to a chi-squared distribution with
2 degrees of freedom (p = 0.2304). The REML estimates of the parameters
6.3 Inference for the Variance Components 67

TABLE 6.4. Rat Data. REML estimates and associated estimated standard errors
for all parameters in model (3.9), under H0 : d12 = d22 = 0.

Effect Parameter REMLE (s.e.)


Intercept β0 68.607 (0.331)
Time effects:
Low dose β1 7.507 (0.225)
High dose β2 6.871 (0.228)
Control β3 7.507 (0.225)
Covariance of bi :
var(b1i ) d11 3.565 (0.808)
Residual variance:
var(εij ) σ2 1.445 (0.145)
REML log-likelihood −466.202

in the reduced model are shown in Table 6.4. A LR test for the same null
hypothesis can now be obtained from comparing the maximized REML
log-likelihood values (see Tables 5.3 and 6.4). The observed value for the
test statistic equals −2 ln λN = −2(−466.202 + 465.193) = 2.018, which
is also not significant when compared to a chi-squared distribution with 2
degrees of freedom (p = 0.3646).

From now on, all further inferences will be based on the reduced model. For
each treatment group separately, the predicted average profile based on the
estimates reported in Table 6.4 is shown in Figure 6.2. The observed value
for the Wald statistic (6.5) for testing the hypothesis H0 : β1 = β2 = β3
of no average treatment effect equals 4.63, which is not significant when
compared to a chi-squared distribution with 2 degrees of freedom (p =
0.0987).

Since the obtained estimate for d11 equals 3.565 > 0, the fitted model allows
a random-effects interpretation. The corresponding hierarchical model is
given by


⎪ β0 + b1i + β1 tij + εij , if low dose



Yij = β0 + b1i + β2 tij + εij , if high dose (6.12)





β0 + b1i + β3 tij + εij , if control

and is obtained from omitting the subject-specific slopes b2i from the orig-
inal model (3.9), thereby assuming that all individual profiles have equal
68 6. Inference for the Marginal Model

FIGURE 6.2. Rat Data. Fitted average evolution for each treatment group sepa-
rately, obtained from fitting the final model (6.12) using REML estimation.

slopes, after correction for the treatment. As before, the residual compo-
nents εij only contain measurement error (i.e., εi = ε(1)i , see Section 3.3.4).

Note that the above random-intercepts model does not only imply the mar-
ginal variance function to be constant over time, it also assumes constant
correlation between any two repeated measurements from the same rat.
The constant correlation is given by
d11
ρI =
d11 + σ 2
which is the intraclass correlation coefficient previously encountered in Sec-
tion 3.3.2. In our example, ρI is estimated by 3.565/(3.565+1.445) = 0.712.
The corresponding covariance matrix, with constant variance and constant
correlation, is often called compound symmetry. This again illustrates that
negative estimates for variance components in a linear mixed model of-
ten have meaningful interpretations in the implied marginal model. Here, a
nonpositive estimate for the variance d11 of the random effects in a random-
intercepts model would indicate that the assumption of constant positive
correlation between the repeated measurements is not valid for the data
set at hand. We refer to Section 5.6.2 for another example in which neg-
ative variance estimates indicate misspecifications in the implied marginal
model.
6.3 Inference for the Variance Components 69

6.3.4 Marginal Testing for the Need of Random Effects

As illustrated in Chapter 3, random effects in a linear mixed model repre-


sent the variability in subject-specific intercepts and slopes, not explained
by the covariates included in the model. Under the hierarchical interpre-
tation of the model, it may therefore be of scientific interest to test for
the need of (some of the) random effects in the model. For example, under
model (6.8) for the prostate data, it might be of interest to test whether ran-
dom quadratic time effects are needed. If not, this would suggest that all
noncancer patients evolve linearly over time, whereas all cancer patients
would have the same quadratic time effect described by the fixed effect
β14 = β15 . Note that, unless a Bayesian approach is followed, this can
only indirectly be tested via the induced marginal model. For the prostate
example, the corresponding hypothesis of interest is

H0 : d13 = d23 = d33 = 0, (6.13)

which is clearly on the boundary of the parameter space Θα such that


the classical likelihood-based inference cannot be applied (see discussion in
Sections 6.3.1 and 6.3.2).

Using results of Self and Liang (1987) on nonstandard testing situations,


Stram and Lee (1994, 1995) have been able to show that the asymptotic null
distribution for the likelihood ratio test statistic for testing hypotheses of
the type (6.13) is often a mixture of chi-squared distributions rather than
the classical single chi-squared distribution. This was derived under the
assumption of conditional independence, that is, assuming that all residual
covariances Σi are of the form σ 2 Ini . For ANOVA models with independent
random effects, this was already briefly discussed by Miller (1977).

Let −2 ln λN be the likelihood ratio test statistic as in expression (6.11).


Stram and Lee (1994, 1995) then discuss several specific testing situations,
which we will briefly summarize. Although their results were derived for the
case of maximum likelihood estimation, the same results apply for restricted
maximum likelihood estimation, as shown by Morrell (1998). In fact, the
REML test statistic performs slightly better than the ML test statistic
in the sense that, on average, the rejection proportions are closer to the
nominal level for the REML test statistic than for the ML test statistic.

Case 1: No Random Effects versus One Random Effect

For testing H0 : D = 0 versus HA : D = d11 , where d11 is a non-negative


scalar, we have that the asymptotic null distribution of −2 ln λN is a mix-
ture of χ21 and χ20 with equal weights 0.5. The χ20 distribution is the distri-
bution which gives probability mass 1 to the value 0. The mixture is shown
70 6. Inference for the Marginal Model

FIGURE 6.3. Graphical representation of the asymptotic null distribution of the


likelihood ratio statistic for testing the significance of random effects in a linear
mixed model, for three different types of hypotheses. For each case, the distribution
(solid line) is a mixture of two chi-squared distributions (dashed lines), with both
weights equal to 0.5:
(a) Case 1: no random effects versus one random effect.
(b) Case 2: one random effect versus two random effects.
(c) Case 3: two random effects versus three random effects.

in panel (a) of Figure 6.3. Note that if the classical null distribution would
be used, all p-values would be overestimated. Therefore, the null hypothesis
would be accepted too often, resulting in incorrectly simplifying the covari-
ance structure of the model, which may seriously invalidate inferences, as
shown by Altham (1984).

Case 2: One versus Two Random Effects

In the case that one wishes to test


d11 0
H0 : D = ,
0 0
for some strictly positive d11 , versus HA that D is a (2 × 2) positive semi-
definite matrix, we have that the asymptotic null distribution of −2 ln λN
is a mixture with equal weights 0.5 for χ22 and χ21 , shown in Figure 6.3(b).
Similar to case 1, we have that ignoring the boundary problems may result
in too parsimonious covariance structures.
6.3 Inference for the Variance Components 71

Case 3: q versus q + 1 Random Effects

For testing the hypothesis


D11 0
H0 : D = , (6.14)
0 0

in which D11 is a (q × q) positive definite matrix, versus HA that D is a


general ((q + 1) × (q + 1)) positive semidefinite matrix, the large-sample
behavior of the null distribution of −2 ln λN is a mixture of χ2q+1 and χ2q ,
again with equal weights 0.5. A graphical representation for the case of
testing two random effects (q = 2) versus three random effects is given in
the third panel of Figure 6.3. Again, we have that the correction due to the
boundary problems reduces the p-values in order to protect against the use
of oversimplified covariance structures.

Case 4: q versus q + k Random Effects

The null distribution of −2 ln λN for testing (6.14) versus


D11 D12
HA : D =  ,
D12 D22

which is a general ((q + k) × (q + k)) positive semidefinite matrix, is a


mixture of χ2 random variables as well as other types of random variables
formed by the lengths of projections of multivariate normal random vari-
ables upon curved as well as flat surfaces. Apart from very special cases,
current statistical knowledge calls for simulation methods to estimate the
appropriate null distribution.

Note that the results in the above cases 1 to 4 assume that the likelihood
function can be maximized over the space Θα of positive semi definite ma-
trices D, and that the estimating procedure is able to converge, for example,
to values of D which are positive semidefinite but not positive definite. This
is software dependent and should be checked when the above results are
applied in practice.

For example, according to Stram and Lee (1994), this assumption did not
hold for the SAS procedure MIXED when their paper was written, and they
therefore discuss how their results had to be corrected. Since the procedure
only allowed maximization of the likelihood over a subspace of the required
parameter space Θα , the likelihood ratio statistics were typically too small.
For the third case, for example, the asymptotic null distribution became
a mixture of χ2q+1 and χ20 with equal weight 0.5. However, as explained
in Section 5.6.2, since release 6.10 of SAS, the only constraint on variance
72 6. Inference for the Marginal Model

TABLE 6.5. Prostate Data. Several random-effects models with the associated
value for the log-likelihood value evaluated at the parameter estimates, for maxi-
mum as well as restricted maximum likelihood estimation.


ln[L(θ)]
Random effects ML REML
Model 1: Intercepts, time, time 2
−3.575 −20.165
Model 2: Intercepts, time −50.710 −66.563
Model 3: Intercepts −131.218 −149.430
Model 4: −251.275 −272.367

components estimates is non-negativeness of the variances. In some cases,


this can even lead to estimates of D which are not non-negative definite
(see, for example, the analysis in Section 5.6.2 of the rat data). Because
any symmetric matrix with at least one negative diagonal element is not
positive semidefinite, we have that the required parameter space Θα is a
subspace of the set of all symmetric matrices with non-negative diagonal
elements. Hence, we may conclude that, since release 6.10, SAS allows max-
imizing the likelihood over Θα , and therefore that the original results, as
described in the cases 1 to 4, are valid even when the procedure MIXED
is used. However, since the likelihood is maximized over a parameter space
which is larger than Θα , one should check the resulting estimate D  for
positive semidefiniteness. In the next section, the above results will be il-
lustrated in the context of the prostate data. Other examples can be found
in Section 17.4 as well as in Section 24.1.

6.3.5 Example: The Prostate Data

For the prostate data, the hypothesis of most interest is that only random
intercepts and random slopes for the linear time effect are needed in model
(6.8), and hence that the random slopes for the quadratic time effect may
be omitted (case 3). However, for illustrative purposes, we tested all hy-
potheses of deleting one random effect from the model, in a hierarchical
way starting from the highest-order time effect. Likelihood ratio tests were
used, based on maximum likelihood as well as on restricted maximum like-
lihood estimation. The models and the associated maximized log-likelihood
values are shown in Table 6.5. Further, Table 6.6 shows the likelihood ra-
tio statistics for dropping one random effect at a time, starting from the
quadratic time effect. The correct asymptotic null distributions directly fol-
low from the results described in cases 1 to 3. We hereby denote a mixture
6.4 Information Criteria 73

TABLE 6.6. Prostate Data. Likelihood ratio statistics with the correct as well
as naive asymptotic null distribution for comparing random-effects models, for
maximum as well as restricted maximum likelihood estimation. A mixture of two
chi-squared distributions with k1 and k2 degrees of freedom and with equal weight
for both distributions is denoted by χ2k1 :k2 .

Maximum likelihood
Asymptotic null distribution
Hypothesis −2 ln(λN ) Correct Naive
2
Model 2 versus Model 1 94.270 χ2:3 χ23
Model 3 versus Model 2 161.016 χ21:2 χ22
2
Model 4 versus Model 3 240.114 χ0:1 χ21
Restricted maximum likelihood
Asymptotic null distribution
Hypothesis −2 ln(λN ) Correct Naive
2
Model 2 versus Model 1 92.796 χ2:3 χ23
2
Model 3 versus Model 2 165.734 χ1:2 χ22
Model 4 versus Model 3 245.874 χ20:1 χ21

of two chi-squared distributions with k1 and k2 degrees of freedom, with


equal weights 0.5, by χ2k1 :k2 . For example, the p-value under obtained under
REML estimation for the comparison of Model 2 versus Model 1 can then
be calculated as

p = P (χ22:3 > 92.796)


1 1
= P (χ22 > 92.796) + P (χ23 > 92.796).
2 2
The naive asymptotic null distribution is the one which follows from ap-
plying the classical likelihood theory, ignoring the boundary problem for
the null hypothesis (i.e., a chi-squared distribution with degrees of free-
dom equal to the number of free parameters which vanish under the null
hypothesis). All observed values for −2 ln(λN ) are larger than 90, yielding
p-values smaller than 0.0001. We conclude that the covariance structure
should not be simplified deleting random effects from the model. We refer
to Sections 17.4 and 24.1 for examples where the naive and the corrected
p-values show much more difference.
74 6. Inference for the Marginal Model

TABLE 6.7. Overview of frequently used information criteria for comparing


N linear
mixed models. We hereby define n∗ equal to the total number n = n of
i=1 i
observations or equal to n − p, depending on whether ML or REML estimation
was used in the calculations.

Criterion Definition of F(·)


Akaike (AIC) F(#θ) = #θ
Schwarz (SBC) F(#θ) = (#θ ln n∗ )/2
Hannan and Quinn (HQIC) F(#θ) = #θ ln(ln n∗ )
Bozdogan (CAIC) F(#θ) = #θ (ln n∗ + 1)/2

6.4 Information Criteria

All testing procedures discussed in Sections 6.2 and 6.3 considered the
comparison of so-called nested models, in the sense that the model under
the null hypothesis could be viewed as a special case from the alternative
model. In order to extend this to the case where one wants to discriminate
between non-nested models, we take a closer look at the likelihood ratio
tests discussed in Sections 6.2.5, 6.3.2, and 6.3.4. Let A and 0 denote
the log-likelihood function evaluated at the estimates obtained under the
alternative hypothesis and under the null hypothesis, respectively. Further,
let #θ 0 and #θ A denote the number of free parameters under the null
hypothesis and under the alternative hypothesis, respectively. The LR test
then rejects the null hypothesis if A − 0 is large in comparison to the
difference in degrees of freedom between the two models which are to be
compared, or, equivalently, if

A − 0 > F(#θ A ) − F(#θ 0 ),

or, equivalently, if

A − F(#θ A ) > 0 − F(#θ 0 ),

for an appropriate function F(·). For example, when tests are performed
at the 5% level of significance, for hypotheses of the same form as those
described in the third case of Section 6.3.4, F was such that

2 [F(#θ A ) − F(#θ 0 )] = χ2(#θ A − #θ 0 − 1) : (#θ A − #θ 0 ), 0.95

where χ2k1 :k2 ,0.95 denotes the 95% percentile of the χ2k1 :k2 distribution. This
procedure can be interpreted as a formal test of significance only if the
model under the null hypothesis is nested within the model under the al-
ternative hypothesis. However, if this is not the case, there is no reason why
6.4 Information Criteria 75

TABLE 6.8. Rat Data. Summary of the results of fitting two different ran-
dom-intercepts models to the rat data (ML estimation, n = 252).

Mean structure ML #θ AIC SBC


Separate average slopes −464.326 6 −470.326 −480.914
Common average slope −466.622 4 −470.622 −477.681

the above procedure could not be used as a rule of thumb, or why no other
functions F(·) could be used to construct empirical rules for discriminating
between covariance structures. Some other frequently used functions are
shown in Table 6.7, all leading to different discriminating rules, called in-
formation criteria. The main idea behind information criteria is to compare
models based on their maximized log-likelihood value, but to penalize for
the use of too many parameters. The model with the largest AIC, SBC,
HQIC, or CAIC is deemed best. Note that, except for the Akaike infor-
mation criterion (AIC), they all involve the sample size (see Table 6.7),
showing that differences in likelihood need to be viewed, not only relative
to the differences in numbers of parameters but also relative to the number
of observations included in the analysis. As the sample size increases, more
severe increases in likelihood are required before a complex model will be
preferred over a simple model. Note also that, since REML is based on a
set of n − p error contrasts (see Section 5.3), the effective sample size used
in the definition of the information criteria is n∗ = n − p under REML esti-
mation, while being n under ML estimation. Note also that, as explained in
Section 6.2.5, REML log-likelihoods are only fully comparable for models
with the same mean structure. Hence, for comparing models with different
mean structures, one should only consider information criteria based on
ML estimation.

We refer to Akaike (1974), Schwarz (1978), Hannan and Quinn (1979), and
Bozdogan (1987) for more information on the information criteria defined
in Table 6.7.

It should be strongly emphasized that information criteria only provide


rules of thumb to discriminate between several statistical models. They
should never be used or interpreted as formal statistical tests of significance.
In specific examples, different criteria can even lead to different models. As
an illustration, we consider model (6.12) previously derived for the rat data.
Using an approximate Wald test, we found evidence in Section 6.3.3 that,
under the proposed model, there is no average difference among the three
treatment groups (p = 0.0987). Hence, we expect that a linear mixed model
with equal average slope for all three groups is preferred when compared
76 6. Inference for the Marginal Model

to the original model (6.12). Table 6.8 shows the AIC and SBC obtained
from the ML results for both models, with effective sample size equal to
n = 252. Note that AIC prefers the model with treatment-specific average
slopes, whereas the common-slope model is to be preferred based on SBC.
This again illustrates the effect of taking into account the sample size n in
the calculation of SBC, which is not the case for AIC.
7
Inference for the Random Effects

7.1 Introduction

Although in practice one is usually primarily interested in estimating the


parameters in the marginal linear mixed-effects model (the fixed effects β
and the variance components in D and in all Σi ), it is often useful to cal-
culate estimates for the random effects bi as well, since they reflect how
much the subject-specific profiles deviate from the overall average profile.
Such estimates can then be interpreted as residuals which may be helpful
for detecting special profiles (i.e., outlying individuals) or groups of indi-
viduals evolving differently in time. Also, estimates for the random effects
are needed whenever interest is in prediction of subject-specific evolutions
(see Section 7.5).

As indicated in Section 5.1, it is then no longer sufficient to assume that the


marginal distribution of the responses Yi is given by model (5.1), because it
does not imply that the variability in the data can be explained by random
effects. In this section, we will therefore explicitly assume that the hierar-
chical model (3.8) is appropriate. Since random effects represent a natural
heterogeneity between the subjects, this assumption will often be justified
for data where the between-subjects variability is large in comparison to
the within-subject variability.
78 7. Inference for the Random Effects

7.2 Empirical Bayes Inference

Since the random effects in model (3.8) are assumed to be random vari-
ables, it is most natural to estimate them using Bayesian techniques (see,
for example, Box and Tiao 1992 or Gelman et al . 1995). As discussed in
Section 3.3, the distribution of the vector Yi of responses for the ith indi-
vidual, conditional on that individual’s specific regression coefficients bi , is
multivariate normal with mean vector Xi β + Zi bi and with covariance ma-
trix Σi . Further, the marginal distribution of bi is multivariate normal with
mean vector 0 and covariance matrix D. In the Bayesian literature, this
last distribution is usually called the prior distribution of the parameters
bi since it does not depend on the data Yi . Once observed values yi for Yi
have been collected, the so-called posterior distribution of bi , defined as the
distribution of bi , conditional on Yi = yi , can be calculated. If we denote
the density function of Yi conditional on bi , and the prior density function
of bi by f (yi |bi ) and f (bi ), respectively, we have that the posterior density
function of bi given Yi = yi is given by
f (yi |bi ) f (bi )
f (bi |yi ) ≡ f (bi |Yi = yi ) = + . (7.1)
f (yi |bi ) f (bi ) dbi
For the sake of notational convenience, we hereby suppressed the depen-
dence of all above density functions on certain components of θ.

Using the theory on general Bayesian linear models (Smith 1973, Lindley
and Smith 1972), it can be shown that (7.1) is the density of a multivariate
normal distribution. Very often, bi is estimated by the mean of this pos-
terior distribution, called the posterior mean of bi . This estimate is then
given by

bi (θ) = E [bi | Yi = yi ]



= bi f (bi |yi ) dbi

= DZi Wi (α)(yi − Xi β), (7.2)

and the covariance matrix of the corresponding estimator equals


⎧  N −1 ⎫
⎨  ⎬
var(bi (θ)) = DZi Wi − Wi Xi Xi Wi Xi Xi Wi Zi D, (7.3)
⎩ ⎭
i=1

where, as before, Wi equals Vi−1 (Laird and Ware 1982). Note that (7.3)
underestimates the variability in bi (θ) − bi since it ignores the variation of
bi . Therefore, inference for bi is usually based on

var(bi (θ) − bi ) = D − var(bi (θ)) (7.4)


7.3 Henderson’s Mixed-Model Equations 79

as an estimator for the variation in bi (θ) − bi (Laird and Ware 1982).

So far, all calculations were performed conditionally on the vector θ of


parameters in the marginal model. In practice, the unknown parameters β
and α in (7.2), (7.3), and (7.4) are replaced by their maximum or restricted
maximum likelihood estimates. The resulting estimates for the random
effects are called “Empirical Bayes” (EB) estimates, which we will denote
as bi . Note that (7.3) and (7.4) then underestimate the true variability in
the obtained estimate bi since they do not take into account the variability
introduced by replacing the unknown parameter θ by its estimate. Similar
to that for fixed effects (see Section 6.2), inference is therefore often based
on approximate t-tests of F -tests, rather than on Wald tests, with similar
procedures for the estimation of the denominator degrees of freedom. An
example will be given in Section 8.3.6.

In practice, one often uses histograms and scatter plots of components


of bi for diagnostic purposes, such as the detection of outliers which are
subjects who seem to evolve differently from the other subjects in the data
set. For example, Morrell and Brant (1991) use scattergrams of the EB
estimates to pinpoint outlying observations, DeGruttola, Lange, and Dafni
(1991) report histograms of the EB estimates, and use a normal quantile
plot of standardized estimated random intercepts to check their normality,
and Waternaux, Laird, and Ware (1989) use several techniques based on
the EB estimates to look for unusual individuals and departures from the
model assumptions. However, as will be explained in Section 7.8, results
from such procedures should be interpreted with extreme care.

7.3 Henderson’s Mixed-Model Equations

In Section 7.2, the estimation of the random effects was approached in


a Bayesian way, which was motivated by the assumption that the bi are
random parameters. Henderson has shown that the estimates (7.2) can also
be obtained from solving a system of linear equations.

Let the linear mixed model be denoted as in (5.5) in Section 5.3.3; that
is, Y = Xβ + Zb + ε, where the vectors Y , b, and ε, and the matrix X
are obtained from stacking the vectors Yi , bi , and εi , and the matrices
Xi , respectively, underneath each other, and where Z is the block-diagonal
matrix with blocks Zi on the main diagonal and zeros elsewhere. Let D and
Σ be block-diagonal with blocks D and Σi on the main diagonal and zeros
elsewhere. Henderson et al . (1959) showed that, conditional on the vector
α of variance components, the estimate (5.3) for β and the estimates (7.2)
of all random effects bi can be obtained from solving the so-called mixed-
80 7. Inference for the Random Effects

model equations
 −1

XΣ X X  Σ−1 Z β X  Σ−1 y
=
Z  Σ−1 X Z  Σ−1 Z + D−1 b Z  Σ−1 y

with respect to β and b (see also Henderson 1984, Searle, Casella and
McCulloch 1992, Section 7.6). Note that, especially with large data sets,
this may become computationally very expensive, such that, in practice, it
may be (much) more efficient to calculate the estimates directly from the
expressions (5.3) and (7.2).

7.4 Best Linear Unbiased Prediction (BLUP)

Suppose interest is in the estimation of a linear combination u = λβ β+λb bi


of the vector β of fixed effects and the vector bi of random effects, for some
known vectors λβ and λb of dimension p and q, respectively. Conditionally
on the variance components in α, an obvious estimator for u is now

(α) = λβ  β(α)
u 
+ λb  bi (β(α), α), (7.5)


where β(α) and bi (β, α) = bi (θ) are as defined in expressions (5.3) and
(7.2), respectively. It can now be shown that u (α) is a best linear unbiased
predictor of u, in the sense that it is unbiased for u and  has minimum
variance among all unbiased estimators of the form c + i λi Yi (see, for
example, Harville 1976, McLean, Sanders and Stroup 1991, Searle, Casella
and McCulloch 1992, Chapter 7). Note that, in practice, u is estimated by
 where α
(α),
u  equals the maximum likelihood or restricted maximum like-
lihood estimator for the vector α of variance components (see Chapter 5).

7.5 Shrinkage

To illustrate the interpretation of the EB estimates, consider the prediction


i of the ith profile. It follows from (7.2) that
Y

i
Y  + Zi bi
≡ Xi β
 + Zi DZ  V −1 (yi − Xi β)
= Xi β 
i i
 
 + Zi DZ  V −1 yi
= Ini − Zi DZi Vi−1 Xi β i i
 
 + In − Σi V −1 yi ,
= Σi V −1 Xi β (7.6)
i i i
7.6 Example: The Random-Intercepts Model 81

and therefore can be interpreted as a weighted average of the population-


averaged profile Xi β  and the observed data yi , with weights Σi V −1 and
i
Ini −Σi Vi , respectively. Note that the “numerator” of Σi Vi−1 is the resid-
−1

ual covariance matrix Σi and the “denominator” is the overall covariance


matrix Vi . Hence, much weight will be given to the overall average pro-
file if the residual variability is large in comparison to the between-subject
variability (modeled by the random effects), whereas much weight will be
given to the observed data if the opposite is true.

In the Bayesian literature, one usually refers to phenomena like those ex-
hibited in expression (7.6), as shrinkage (Carlin and Louis 1996, Strenio,
Weisberg, and Bryk 1983). The observed data are shrunken toward the
prior average profile, which is Xi β since the prior mean of the random ef-
fects was zero. This is also illustrated in (7.4), which implies that for any
linear combination λ of the random effects,
var(λ bi ) ≤ var(λ bi ). (7.7)

7.6 Example: The Random-Intercepts Model

As a special case, we consider the random-intercepts model, that is, a linear


mixed model where the only subject-specific effects are intercepts. An ex-
ample was already given in Section 6.3.3, in the context of the rat data. The
random-effects covariance matrix D is now a scalar and will be denoted by
σb2 . Further, all design matrices Zi are of the form 1ni , an ni -dimensional
vector of ones. We will assume here that all residual covariance matrices
are of the form Σi = σ 2 Ini ; that is, we assume conditional independence
(see Section 3.3.4).

Denoting 1ni 1ni  by Jni , it follows from (7.2) that the EB estimate for the
random intercept of subject i is given by
 −1
bi = σb2 1ni  σb2 Jni − σ 2 Ini (yi − Xi β)

σb2  σb2
= 1 n Ini − Jn (yi − Xi β)
σ2 i σ 2 + ni σb2 i

1 
ni
ni σb2
= (yij − xij  β), (7.8)
σ 2 + ni σb2 ni j=1

where the vectorxij consists of the jth row in the design matrix Xi .
(yij − xij  β)/ni is equal to the average residual for
ni
Note that ri· = j=1
subject i.
82 7. Inference for the Random Effects

Expression (7.8) clearly illustrates the shrinkage effect. It immediately fol-


lows from ni σb2 /(σ 2 + ni σb2 ) < 1 that bi is a weighted average of zero (the
prior mean of bi ) and the average residual ri· . The larger the number ni of
measurements available for subject i, the more weight is put on ri· , yielding
less severe shrinkage. Expression (7.8) also shows that more shrinkage is
obtained in cases where the within-subject variability is large in comparison
to the between-subject variability.

7.7 Example: The Prostate Data

As an illustration of EB estimation, we calculated the EB estimates for the


random effects in our final model (6.8). Frequency histograms and scatter
plots of these estimates can be found in Figure 7.1. Note how the scatter
plots clearly show strong negative correlations between the intercepts and
slopes for time, and between the slopes for time and the slopes for time2 .
On the other hand, the intercepts are positively correlated with the slopes
for the quadratic time effect. This is in agreement with the estimates for
the covariance parameters in D (see Table 6.1).

The histograms in Figure 7.1 suggest the presence of outliers. Furthermore,


we highlighted subjects #22, #28, #39, and #45, who are the individuals
with the highest four slopes for time2 and the smallest four slopes for time.
Hence, these are the subjects with the strongest (quadratic) growth of
ln(1 + PSA) over time. Pearson et al . (1994) noticed that the local/regional
cancer cases #22, #28, and #39 were probably misclassified by the original
methods of clinical staging and should have been included in the group of
metastatic cancer cases instead. Further, subject #45 is the metastatic
cancer case with the highest rate of increase of ln(1 + PSA) over time (see
also Figure 2.3).
 for subjects
i and Xi β
To illustrate the shrinkage effect, we calculated Y
#15 and #28 in the prostate data set. The resulting predicted profiles and
the observed profiles are shown in Figure 7.2. The EB estimates clearly
correct the population-average profile toward the observed profile.

The sample covariance matrix of the EB estimates bi equals


⎛ ⎞
0.4033 −0.4398 0.1311
 bi ) = ⎝ −0.4398
var( 0.7287 −0.2532 ⎠ ,
0.1311 −0.2532 0.0922
which clearly underestimates the variability in the random-effects popula-
 in Table 6.1). This again illustrates the
tion (compare to the elements of D
shrinkage effect previously obtained in expression (7.7).
7.8 The Normality Assumption for Random Effects 83

FIGURE 7.1. Prostate Data. Histograms (panels a, c, and e) and scatter plots
(panels b, d, and f ) of the empirical Bayes estimates for the random intercepts
and slopes in the final model (6.8).

7.8 The Normality Assumption for Random Effects

7.8.1 Introduction

As described in Section 7.2, EB estimates for subject-specific regression co-


efficients are often used for diagnostic purposes, such as checking whether
84 7. Inference for the Random Effects

FIGURE 7.2. Prostate Data. Observed profiles (short dashes), population-average


predicted profiles (long dashes), and subject-specific predicted profiles (solid line)
for subjects #15 and #28 of the prostate data set.

the normality assumption for the random effects bi is appropriate. How-


ever, it should be emphasized that, even when the assumed linear mixed
model is correctly specified, the EB estimators bi all have different dis-
tributions unless all covariate matrices Xi and Zi are the same. Hence, it
may be questioned to what extent histograms and scatter plots of unstan-
dardized EB estimates are interpretable. This is why DeGruttola, Lange,
and Dafni (1991) first standardize the EB estimates prior to constructing
normal quantile plots to check the normality assumption for the random
effects.

Note also that even when all EB estimates follow the same distribution,
it follows from the shrinkage effect (see Section 7.5) that the histogram
of the EB estimates shows less variability than actually present in the
population of random-effects bi . This suggests that such histograms do not
necessarily reflect the correct random-effects distribution, and hence also
that histograms of (standardized) EB estimates might not be suitable for
detecting deviations from the normality assumption. Louis (1984) therefore
proposes to minimize other well-chosen loss functions than the classical
squared-error loss function which corresponds to the posterior mean (7.2).
For example, if interest is in the random-effects distribution, one could
minimize a distance function between the empirical cumulative distribution
function of the estimates and the true parameters.

In Section 7.8.2, it will be shown that, in some cases, a histogram of the


classical EB estimates indeed does not reflect the correct random-effects
distribution. This indicates that EB estimates are very dependent on their
assumed prior distribution. In Section 7.8.3, it will be shown that this is in
contrast with the estimation of the vector θ of parameters in the marginal
7.8 The Normality Assumption for Random Effects 85

FIGURE 7.3. Histogram (range [−5, 5]) of 1000 random intercepts drawn from
the normal mixture 0.5N (−2, 1) + 0.5N (2, 1).

model (5.1), which is very robust with respect to misspecifications of the


random-effects distribution. Finally, Section 7.8.4 briefly discusses how the
normality assumption of random effects can be checked in practice.

7.8.2 Impact on EB Estimates

In order to investigate the sensitivity of EB estimates with respect to the


assumed underlying random-effects distribution, Verbeke (1995) and Ver-
beke and Lesaffre (1996a) report results from 1000 simulated longitudinal
profiles with 5 repeated measurements each, where univariate random in-
tercepts bi were drawn from the mixture distribution
1 1
2 N (−2, 1) + 2 N (2, 1), (7.9)
reflecting the presence of heterogeneity in the population. In practice, such
a mixture could occur when the population under consideration consists
of two subpopulations of equal size, with negative and positive subject-
specific intercepts, respectively. A histogram of the realized values is shown
in Figure 7.3. They then fitted a linear mixed model, assuming normality
for the random effects, and they calculated the EB estimates bi for the
random intercepts in the model. The histogram of these estimates is shown
in Figure 7.4. Clearly, the severe amount of shrinkage forces the estimates bi
to satisfy the assumption of a homogeneous, unimodal (normal) population.

Under the conditional independence model (see Section 3.3.4), Verbeke


(1995) and Verbeke and Lesaffre (1996a) have shown that the bimodal as-
86 7. Inference for the Random Effects

FIGURE 7.4. Histogram (range [−5, 5]) of the Empirical Bayes estimates of the
random intercepts shown in Figure 7.3, calculated under the assumption that the
random effects are normally distributed.

pect of the original distribution will not be reflected in the distribution


of the EB estimates obtained under the normality assumption, as soon as
the eigenvalues of σ 2 (Zi Zi )−1 are sufficiently large. This means that both
the error variance and the covariate structure play an important role in
the shape of the distribution of bi . First, if σ 2 is large, it will be difficult
to detect heterogeneity in the random-effects population, based on the bi .
Thus, if the error variability σ 2 is large compared to the random-effects
variability, the bi may not reflect the correct distributional shape of the
random effects. For a linear mixed-effects model with only random inter-
cepts, we previously defined the intraclass correlation ρI as d11 /(d11 + σ 2 ),
where d11 is the intercept variability (see Section 6.3.3). It represents the
correlation between two repeated measurements within the same subject,
that is, ρI = corr(Yik , Yil ) for all k = l. We now have that subgroups
in the random-effects population will be unrecognized when the within-
subject correlation is small. Note that this again illustrates that the degree
of shrinkage increases as the residual variability increases (see Section 7.6).
The residual variance used in the above simulation by Verbeke and Lesaffre
(1996a) equals σ 2 = 30. Hence, since the variance of the distribution (7.9)
equals 5, the implied intraclass correlation equals ρI = 0.14.

On the other hand, the covariates Zi are also important. For example,
suppose that Zi isani vector with elements ti1 , ti2 , . . . , tini . We then have
that Zi Zi equals j=1 t2ij . Hence, heterogeneity will be more likely to be
detected when the bi represent random slopes in a model for measurements
taken at large time points tij than when the bi are random intercepts (all
7.8 The Normality Assumption for Random Effects 87

tij equal to 1). If Zi contains both random intercepts and random slopes
for time points tij , we have that Zi Zi equals
⎛ ni ⎞
ni j=1 tij
Zi Zi = ⎝  ni 2
⎠,
ni
t
j=1 ij t
j=1 ij

which has two positive eigenvalues λ1 and λ2 , given by


,⎛ ⎞2 ⎛ ⎞2
-
n -  ni 
ni
i
-
2λk = t2ij + ni + (−1)k .⎝ t2ij − ni ⎠ + 4 ⎝ tij ⎠ .
j=1 j=1 j=1

In a designed experiment with Zi = Z for nall i, and where the time points tj
are centered, the eigenvalues are λ2 = j=1 t2j and λ1 = ni , or vice versa.
So, if we are interested in detecting subgroups in the random-effects popu-
lation, we should take as many measurements as possible, at the beginning
and at the end of the study (maximal spread of the time points).

In many cases, one is not especially interested in detecting subgroups in the


random-effects distribution. However, the above examples clearly illustrate
that the EB estimates may be highly affected by their normality assump-
tion. Hence, if there is any interest at all in the distribution of the random
effects bi , one should explore whether the assumed Gaussian distribution is
appropriate. This will be discussed in Section 7.8.4. The special case of the
detection of heterogeneity in the random-effects population will be handled
in detail in Chapter 12.

7.8.3 Impact on the Estimation of the Marginal Model

In contrast to our results in the previous section on the estimation of the


random effects, estimation of the vector θ of parameters in the marginal
model is very robust with respect to misspecification of the random-effects
distribution. We refer to Section 12.7 for an illustration of this property.
Using simulations and the analysis of a real data set, Butler and Louis
(1992) have shown that wrongly specifying the random-effects distribution
of univariate random effects has little effect on the fixed-effects estimates
as well as on the estimates for the residual variance and the variance of the
random effects. No evidence was found for any inconsistencies among these
estimators. However, it was shown that the standard errors of all parameter
estimators need correction in order to get valid inferences. Using theory
on maximum likelihood estimation in misspecified models (White 1980,
1982), Verbeke (1995), and Verbeke and Lesaffre (1994, 1996b) extended
this to the general model (3.8). Let N × AN (θ) be minus the matrix of
88 7. Inference for the Random Effects

second-order derivatives of the log-likelihood function with respect to the


elements of θ, and let N × BN (θ) be the matrix with cross-products of
first-order derivatives of the log-likelihood function, also with respect to
θ. Their estimated versions, obtained from replacing θ by its MLE are
denoted by A N and B N , respectively. Verbeke (1995) and Verbeke and
Lesaffre (1996b) then prove that, under general regularity conditions, the
MLE θ  of θ is asymptotically normally distributed with mean θ and with
asymptotic covariance matrix
A−1 BN A−1 /N, (7.10)
N N

for N → ∞.

It can easily be seen that the covariance matrix obtained from replacing α
in (6.3) by its MLE equals
 N −1

 =
var(β) Xi Vi−1 (α)X
 i =A−1 /N, (7.11)
N ,11
i=1

where A N , corresponding to the fixed effects.


N ,11 is the leading block in A
Hence, we have that the asymptotic covariance matrix for β,  obtained from
(7.10), adds extra variability to the “naive” estimate (7.11), by taking into
account the estimation of the variance components, but it also corrects for
possible misspecification of the random-effects distribution. Note also that
A−1 B
N A−1
N N /N is of the same form as the so-called “information sandwich”
estimator for the asymptotic covariance matrix of fixed effects, estimated
with quasilikelihood methods (see Section 6.2.4 and, e.g., Liang and Zeger
1986). However, the above asymptotic result relates to both the fixed effects
and the parameters in the “working correlation” model, and the model is
incorrectly specified only through the random-effects distribution, whereas
the covariance structure is assumed to be correct.

In practice, the asymptotic covariance matrix of MLEs is usually estimated


by the inverse Fisher information matrix. However, this is only valid un-
der the assumed model. Verbeke and Lesaffre (1997a) performed exten-
sive simulations to compare this uncorrected covariance matrix with the
above sandwich estimator, which corrects for possible non-normality of
the random effects. In general, they conclude that, for the fixed effects,
the corrected and uncorrected standard errors are very similar. This is in
agreement with the results of Sharples and Breslow (1992), who showed
that, for correlated binary data, the sandwich estimator for the covariance
matrix of fixed effects is almost as efficient as the uncorrected model-based
estimator, even under the correct model.

For the random components on the other hand, and more specifically for
the elements in D, this is only true under the correct model (normal ran-
dom effects). When the random effects are not normally distributed, the
7.8 The Normality Assumption for Random Effects 89

corrected standard errors are clearly superior to the uncorrected ones. In


some cases, the correction enlarges the standard errors to get confidence
levels closer to the pursued level. In other cases, the correction results in
smaller standard errors protecting against too conservative confidence in-
tervals.

Verbeke and Lesaffre (1997a) calculated the corrected and uncorrected


standard errors for all parameters in model (3.10). The ratio of the cor-
rected over the uncorrected standard errors was between 0.52 and 1.72 for
all parameters, whereas the same ratio could be shown to be between 0.21
and 2.76 for any linear combination λ θ of the parameters. Hence, the un-
corrected standard errors could be up to five times too large, and almost
three times too small, when compared to the standard errors obtained after
correcting for possible non-normality of the random effects bi .

7.8.4 Checking the Normality Assumption

The results presented in the previous section suggest that if interest is


only in inference for the marginal model, and especially if interest is only
in inference for the fixed effects, valid inferences are obtained even when
the random effects have been incorrectly assumed to be normally distrib-
uted. This is in strong contrast with the results discussed in Section 7.8.2,
showing that the EB estimates may heavily depend on their distributional
assumptions. This calls for methods to check these underlying assumptions.

Lange and Ryan (1989) have proposed to check the normality assumption
for the random effects based on weighted normal quantile plots of stan-
dardized linear combinations
c bi
vi = / ,
c var(bi )c

of the estimates bi . However, since the vi are functions of the random effects
bi as well as of the error terms εi , these normal quantile plots can only
indicate that the vi do not have the distribution one expects under the
assumed model, but the plots cannot differentiate a wrong distributional
assumption for the random effects or the error terms from a wrong choice
of covariates.

This suggests that non-normality of the random effects can only be de-
tected by comparing the results obtained under the normality assumption
with results obtained from fitting a linear mixed model with relaxed distri-
butional assumptions for the random effects. Verbeke (1995) and Verbeke
and Lesaffre (1996a, 1997b) therefore propose to extend the general linear
90 7. Inference for the Random Effects

FIGURE 7.5. Density functions of mixtures pN (µ1 , σb2 ) + (1 − p)N (µ2 , σb2 ) of two
normal distributions, for varying values for p and σb2 . The dashed lines represents
the densities of the normal components; the solid line represents the density of
the mixture.

mixed model (3.8) by allowing the bi to be sampled from a mixture of g


normal distributions with equal covariance matrix, i.e.,


g
bi ∼ pj N (µj , D), (7.12)
j=1
g g
with j=1 pj = 1, and such that the mean j=1 pj µj equals zero. As
discussed in Section 7.8.2, this extension naturally arises from assuming
that there is unobserved heterogeneity in the random-effects population.
Each component in the mixture represents a cluster containing a proportion
pj of the total population. The model is therefore called the heterogeneity
model and the linear mixed model discussed so far can then be called the
homogeneity model. Also, as shown in Figure 7.5, it extends the assumption
about the random-effects distribution to a very broad class of distributions:
unimodal as well as multimodal, symmetric as well as very skewed. Note
that heterogeneity in the random-effects population may occur very often in
practice. Whenever a categorical covariate has been omitted as a fixed effect
in a linear mixed-effects model, the random effects will follow a mixture of
g normal distributions, where g is the number of categories of the missing
covariate.
7.8 The Normality Assumption for Random Effects 91

Verbeke and Lesaffre (1996a) considered the number of components g in


(7.12) to be known. In practice, several models can be fitted, with increas-
ing values for g, leading to a series of nested models, and testing proce-
dures such as the likelihood ratio test could be used for the comparison
of these models. However, as discussed by Ghosh and Sen (1985), testing
for the number of components in a finite mixture is seriously complicated
by boundary problems similar to the ones discussed in Section 6.3.4 in the
context of testing for the need of random effects. In order to briefly high-
light the main problems, we consider testing H0 : g = 1 versus HA : g = 2.
The null hypothesis can then be expressed as H0 : µ1 = µ2 . However, the
same hypothesis is obtained by setting H0 : p1 = 0 or H0 : p2 = 0, which
clearly illustrates that H0 is on the boundary of the parameter space, and
hence also that the usual regularity conditions for application of the clas-
sical maximum likelihood theory are violated. Therefore, simulations are
needed to derive the correct null distribution of the LR test statistic. We
refer to Verbeke (1995, Section 4.6) for an example, and to McLachlan and
Basford (1988, Section 1.10) for an extensive overview of the literature on
the use of the LR test in finite mixture problems. Finally, some informal
procedures for checking the goodness-of-fit of heterogeneity models will be
discussed in Section 12.5. Obviously, these procedures can also be used to
explore the plausibility of the normality assumption for the random effects.
In practice however, it may be sufficient to fit several heterogeneity models
and to explore how increasing g affects the inference for the parameters of
interest. We refer to Chapter 12 for details on the definition and the fitting
of the heterogeneity model, and for two examples where the heterogeneity
model is used for the classification of longitudinal profiles.

As an example, a two-component heterogeneity model was fitted to the


data simulated and analyzed in Section 7.8.2. Figure 7.6 shows the EB
estimates of the 1000 simulated random intercepts previously shown in
Figure 7.3, and obtained under a two-component heterogeneity model. An
expression for these estimates will be derived and discussed in Section 12.3.
In comparison to the histogram of the EB estimates under the normality
assumption (Figure 7.4), the correct random-effects distribution is (much)
better reflected. We do not claim that the bi , calculated under the het-
erogeneity model, perfectly reflect the correct mixture distribution, but,
at least, they suggest that the random effects certainly do not follow a
normal distribution, as was suggested by the estimates obtained under the
normality assumption.

Magder and Zeger (1996) also considered linear mixed models with mix-
tures of normal distributions as random-effects distribution, but they treat-
ed the number g of components as an unknown parameter, to be estimated
from the data. In order to avoid that nonsmooth mixture distributions, with
many components, would be obtained, they prespecify a lower boundary h
92 7. Inference for the Random Effects

FIGURE 7.6. Histogram (range [−5, 5]) of the Empirical Bayes estimates of the
random intercepts shown in Figure 7.3, calculated under the assumption that the
random effects are drawn from a two-component mixture of normal distributions.

for the within-component variability measured by |D|, the determinant of


the covariance matrix of each component in the mixture. In practice, very
little difference is expected from the model used by Verbeke and Lesaffre
(1996a). Indeed, when a very smooth mixing distribution is required, a
large value of h can be specified, which will yield a mixture of a relatively
small number of normal distributions.
8
Fitting Linear Mixed Models with SAS

8.1 Introduction

In Chapters 5 and 6, estimation and inference on all parameters in the


marginal model (5.1) were discussed. Chapter 7 considered inference for the
random effects in the hierarchical model (3.8). At present, among the most
flexible commercially available statistical packages is the SAS procedure
PROC MIXED (SAS 1992, 1996, 1997). In this chapter, we will therefore
explain in full detail how all previously discussed inferences can be obtained
with this procedure, using SAS Release 6.12 (SAS 1997). Although this
may seem anomalous to many, given the availability of Version 7.0 and
higher, it has to be noted that Version 7.0 (SAS 1999) was not available
on a commercial basis in 1999, for example, in Europe. For a thorough
description of PROC MIXED in SAS Version 7.0, we refer to the on-line
manual (SAS 1999). Further, some of the important changes in comparison
to Version 6.12 are summarized in Appendix A.

In this chapter, our original model (3.10) for the prostate data will be used
as a guiding example. In Section 8.2, the program for fitting the model will
be presented, together with some available options. It is by no means our
intention to give a full overview of all available statements and options.
Instead, we restrict to those statements and options that are, in our ex-
perience, most frequently used in the analysis of longitudinal data. When
fitting mixed models in other contexts, other statements or options may
94 8. Fitting Linear Mixed Models with SAS

be more appropriate. We refer to the SAS manuals (SAS 1992, 1996, 1997)
and to Littell et al . (1996) for more details on the procedure MIXED and
for a variety of examples in other contexts.

The SAS output consists of a series of tables, each addressing a specific


aspect of the fitted model. These will be discussed in Section 8.3. An alter-
native SAS procedure, often used in practice for the analysis of longitudinal
data, is PROC GLM. In Section 8.6, the most important differences be-
tween the procedures GLM and MIXED will be summarized.

8.2 The SAS Program

We now consider fitting the original linear mixed model (3.10) for the
prostate data. Let the variable group indicate whether a subject is a con-
trol (group = 1), a BPH case (group = 2), a local cancer case (group = 3)
or a metastatic cancer case (group = 4). As in Sections 5.5 and 5.6.1, we
express time in decades before diagnosis, rather than years before diagno-
sis. Further, we define the variable timeclss to be equal to time. This will
enable us to consider time as a continuous covariate and as a classification
variable (a factor in the ANOVA terminology) simultaneously. As before,
the variable age measures the age of the subject at the time of diagnosis.
Finally, id is a variable containing the subject’s identification label, and
lnpsa is the logarithmic transformation ln(1 + x) of the original PSA mea-
surements. We can then use the following program to fit model (3.10) and
to obtain the inferences described in Chapters 6 and 7:

proc mixed data = prostate method = reml asycov asycorr


covtest ic;
class id group timeclss;
model lnpsa = group age group*time age*time
group*time2 age*time2
/ noint solution ddfm = satterth covb chisq;
id id time;
random intercept time time2 / type = un subject = id
g gcorr v vcorr solution;
repeated timeclss / type = simple subject = id r rcorr;
contrast ’Final model’ age*time 1,
group*time 1 0 0 0,
age*time2 1,
group*time2 1 0 0 0,
group*time2 0 1 0 0,
group*time2 0 0 1 -1 / chisq;
8.2 The SAS Program 95

estimate ’Diff L/R-BPH, t=5yr’ group 0 -4 4 0


group*time 0 -2 2 0
group*time2 0 0 1 0
/ cl alpha = 0.05 divisor = 4;
make ’solutionR’ out = randeff;
run;

Before presenting the results of this analysis, we briefly discuss the state-
ments and options used in the above program.

8.2.1 The PROC MIXED Statement

This statement calls the procedure MIXED and specifies that the data
be stored in the SAS data set ‘prostate’. If no data set is specified, then
the most recently created data set is used. In general, there are two ways
of setting up data sets containing repeated measurements. One way is to
define a variable for each variable measured and for each time point in
the data set at which at least one subject was measured. Each subject
then corresponds to exactly one record (one line) in the data set. This
setup is convenient when the data are highly balanced, that is, when all
measurements are taken at only a few number of time points. However, this
approach leads to huge data matrices with many missing values in cases of
highly unbalanced data such as the prostate data. The same problem occurs
in the presence of missing data or dropout (see Section 17.3). Therefore, the
MIXED procedure requires that the data set be structured such that each
record corresponds to the measurements available for a subject at only one
moment in time. For example, five repeated measurements for individual
i are put into five different records. This has the additional advantage
that time-varying covariates (such as time) can be easily incorporated into
the model. An identification variable id is needed to link measurements to
subjects, and a time variable is used to order the repeated measurements
within each individual. For example, our prostate cancer data set is set up
in the following way:

OBS ID LNPSA TIME AGE GROUP

1 1 0.405 1.94 72.4 2


2 1 0.336 1.44 72.4 2
3 1 0.693 1.20 72.4 2
... .. ..... .... .... .
461 54 0.182 0.46 62.9 1
462 54 0.262 0.25 62.9 1
463 54 0.182 0.00 62.9 1
96 8. Fitting Linear Mixed Models with SAS

The option ‘method=’ specifies the estimation method. In this book, we


will always specify ‘method=ML’ or ‘method=REML’ requesting ML or
REML estimation, respectively. However, it is also possible to use the non-
iterative MIVQUE0 method (minimum variance quadratic unbiased esti-
mation), which is used by default to compute starting values for the iter-
ative ML and REML estimation procedures. We refer to the SAS manual
(1996) for a treatment of the MIVQUE0 method. If no method is specified,
then REML estimation is used by default.

The options ‘asycov’ and ‘asycorr’ can be used for printing the asymptotic
covariance matrix as well as the associated correlation matrix of the es-
timators for the variance components in the marginal model. The option
‘covtest’ requires the printing of the resulting asymptotic standard errors
and associated Wald tests for those variance components. Note that these
Wald tests are not applicable to all variance components in the model (see
discussion in Section 6.3.1). The calculation of the information criteria dis-
cussed in Section 6.4 can be requested by adding the option ’ic’.

8.2.2 The CLASS Statement

This statement specifies which variables should be considered as factors.


Such classification variables can be either character or numeric. Internally,
each of these factors will correspond to a set of dummy variables in the
manner described in the SAS manual on-linear models (1991, Section 5.5).

8.2.3 The MODEL Statement

The MODEL statement names the response variable (one and only one)
and all fixed effects, which determine the Xi matrices. Note that in order to
have the same parameterization for the mean structure as model (3.10), no
overall intercept (using the ‘noint’ option) nor overall linear or quadratic
time effects should be included into the model, since, otherwise, the mean
structure is parameterized using contrasts between the intercepts and slopes
of the first three diagnostic groups and those for the last group. Although
this would facilitate the testing of group differences (see also Section 8.4),
it complicates the interpretation of the parameter estimates.

The ‘solution’ option is used to request the printing of the estimates for all
the fixed effects in the model, together with standard errors, t-statistics, and
corresponding p-values for testing their significance (see Section 6.2). When
the whole model-based estimated covariance matrix (6.3) for the fixed ef-
fects is of interest, it can be obtained by specifying the option ‘covb’. The es-
8.2 The SAS Program 97

timation method for the degrees of freedom in the t- and F -approximations


needed in tests for fixed-effects estimates (see Section 6.2.2) is specified in
the option ‘ddfm=’. Here, the Satterthwaite approximation was used, but
other methods are also available within SAS. When the option ‘chisq’ is
added to the MODEL statement, SAS also provides approximate Wald tests
(Section 6.2.1), next to the default t- and F -tests, for all effects specified
in the MODEL statement.

8.2.4 The ID Statement

When predicted values are requested with the option ’predmeans’ or ’pre-
dicted’ in the MODEL statement, SAS prints a table with the requested
predicted value, the corresponding observed value, and the resulting resid-
ual, for each record in the original data set. Although the records are then
printed in the same order as they appear in the original data set, it may
still be helpful to add columns which help identify the records. This is done
via the ID statement. The values of the variables in the ID statement are
then printed beside each observed, predicted, and residual value. In our ex-
ample, we used the identification number of the patient, together with the
time point at which a measurement was taken to identify the predictions
and residuals which we requested in the MODEL statement.

8.2.5 The RANDOM Statement

This statement is used to define the random effects in the model, that is,
the matrices Zi containing the covariates with subject-specific regression
coefficients. Note that when random intercepts are required, this should be
specified explicitly, which is in contrast to the MODEL statement where
an intercept is included by default.

The ‘subject=’ option is used to identify the subjects in our data set. Here,
‘subject=id’ means that all records with the same value for id are assumed
to be from the same subject, whereas records with different values for id are
assumed to contain independent data. This option also defines the block-
diagonality of the matrix Z and of the covariance matrix of b in (5.5). The
variable id is permitted to be continuous as well as categorical (specified
in the CLASS statement). However, when id is continuous, PROC MIXED
considers a record to be from a new subject whenever the value of id is
different from the previous record. Hence, one then should first sort the
data by the values of id. On the other hand, using a continuous id variable
reduces execution times for models with a large number of subjects (manual
PROC MIXED).
98 8. Fitting Linear Mixed Models with SAS

The ‘type=’ option specifies the covariance structure D for the random
effects bi . In our example, we specified ‘type=un’ which corresponds to a
general unstructured covariance matrix, i.e., a symmetric positive (semi-)
definite (q×q) matrix D. Many other covariance structures can be specified,
some of which are shown in Table 8.1 and Table 8.2. We further refer to
the SAS manual (1996) for a complete list of possible structures. Although
many structures are available, in longitudinal data analysis, one usually
specifies ‘type=UN’ which does not assume the random-effects covariance
matrix to be of any specific form.

Specifying the options ‘g’ and ‘gcorr’ requests that the random-effects co-
variance matrix D as well as its associated correlation matrix are printed,
printing blanks for all values that are zero. The options ‘v’ and ‘vcorr’ can
be used if a printout of the marginal covariance matrix Vi = Zi DZi + Σi
and the corresponding correlation matrix, respectively, are needed. By de-
fault, SAS only prints the covariance and correlation matrices for the first
subject in the data set. However, ‘v=’ and ‘vcorr=’ can be used to specify
the identification numbers of the patients for which the matrix Vi and the
associated correlation matrix are needed.

The option ‘solution’ is needed for the calculation of the empirical Bayes
(EB) estimates for the random effects bi , previously derived and discussed
in Chapter 7. The result will be a table containing the EB estimates for
the random effects of all subjects included in the analysis. If, for example,
scatter plots or histograms of components of these estimates bi are to be
made, then the estimates should be converted to a SAS output data set.
This will be discussed in Section 8.2.9.

8.2.6 The REPEATED Statement

The REPEATED statement is used to specify the Σi matrices in the mixed


model. The repeated effects define the ordering of the repeated measure-
ments within each subject. These effects (in the example, ‘timeclss’) must
be classification variables, which is why we needed two versions of our time
variable: a continuous version ‘time’ needed in the MODEL statement as
well as in the RANDOM statement, and a classification version ‘timeclss’
needed in the REPEATED statement. Usually, one will specify only one re-
peated effect. Its levels should then be different for each observation within
a subject. If not, PROC MIXED constructs identical rows in Σi corre-
sponding to the observations with the same level, yielding a singular Σi
and an infinite likelihood. If the data are ordered similarly for each sub-
ject, and any missing data are denoted with missing values, then specifying
a repeated effect is not necessary. In this case, the name ‘DIAG’ appears
as the repeated effect in the printed output. Note that this is not the same
8.2 The SAS Program 99

TABLE 8.1. Overview of frequently used covariance structures which can be speci-
fied in the RANDOM and REPEATED statements of the SAS procedure MIXED.
The σ parameters are used to denote variances and covariances, whereas the ρ
parameters are used for correlations.

Structure Example
 
σ12 σ12 σ13
Unstructured
σ12 σ22 σ23
type=UN
σ13 σ23 σ32

Simple  (1)  (2)


σ2 0 0 σ12 0 0
Variance Components
0 σ2 0 or 0 σ22 0
type=SIMPLE
0 0 σ2 0 0 σ32
type=VC
 
σ12 + σ 2 σ12 σ12
Compound symmetry
σ12 σ12 + σ2 σ12
type=CS
σ12 σ12 σ12 + σ 2
 
σ12 σ12 0
Banded
σ12 σ22 σ23
type=UN(2)
0 σ23 σ32
 
σ2 ρσ 2 ρ2 σ 2
First-order autoregressive
ρσ 2 σ2 ρσ 2
type=AR(1)
ρ2 σ 2 ρσ 2 σ2
 
σ2 σ12 σ13
Toeplitz
σ12 σ2 σ12
type=TOEP
σ13 σ12 σ2
 
σ2 0 0
Toeplitz (1)
0 σ2 0
type=Toep(1)
0 0 σ2
 
Heterogeneous compound σ12 ρσ1 σ2 ρσ1 σ3
symmetry ρσ1 σ2 σ22 ρσ2 σ3
type=CSH ρσ1 σ3 ρσ2 σ3 σ32
 
Heterogeneous first-order σ12 ρσ1 σ2 ρ2 σ1 σ3
autoregressive ρσ1 σ2 σ22 ρσ2 σ3
type=ARH(1) ρ2 σ1 σ3 ρσ2 σ3 σ32
 
σ12 ρ1 σ1 σ2 ρ2 σ1 σ3
Heterogeneous Toeplitz
ρ1 σ1 σ2 σ22 ρ1 σ2 σ3
type=TOEPH
ρ2 σ1 σ3 ρ1 σ2 σ3 σ32

(1) Example: repeated timeclss / type=simple subject=id;


(2) Example: random intercept time time2 / type=simple subject=id;
100 8. Fitting Linear Mixed Models with SAS

TABLE 8.2. Overview of frequently used (stationary) spatial covariance struc-


tures, which can be specified in the RANDOM and REPEATED statements of
the SAS procedure MIXED. The correlations are positive decreasing functions of
the Euclidean distances dij between the observations. The coordinates of the ob-
servations used to calculate these distances are given by a set of variables, the
names of which are specified in the list ‘list’. The variance is denoted by σ 2 , and
ρ defines how fast the correlations decrease as functions of the dij .

Structure Example

⎛ ⎞
1 ρd12 ρd13
σ 2 ⎝ ρd12 ⎠
Power
1 ρd23
type=SP(POW)(list)
ρd13 ρd23 1
⎛ ⎞
1 exp(−d12 /ρ) exp(−d13 /ρ)
σ2 ⎝ ⎠
Exponential
exp(−d12 /ρ) 1 exp(−d23 /ρ)
type=SP(EXP)(list)
exp(−d13 /ρ) exp(−d23 /ρ) 1
⎛ ⎞
1 exp(−d212 /ρ2 ) exp(−d213 /ρ2 )
σ2 ⎝ ⎠
Gaussian
exp(−d212 /ρ2 ) 1 exp(−d223 /ρ2 )
type=SP(GAU)(list)
exp(−d213 /ρ2 ) exp(−d223 /ρ2 ) 1

as completely omitting the REPEATED statement, since this would not


allow to specify parametric forms for Σi other than the simple form σ 2 Ini .
The options for the REPEATED statement are similar to those for the
RANDOM statement.

For example, the option ‘subject=’ identifies the subjects in the data set,
and complete independence is assumed across subjects. It therefore defines
the block-diagonality of the covariance matrix of ε in (5.5). With respect to
the variable id, the same remarks hold as the ones stated in our description
of the RANDOM statement. Although this is strictly speaking not required,
the RANDOM and REPEATED statement often have the same option
‘subject=id’, as was the case in our example.

Further, the ‘type=’ option specifies the covariance structure Σi for the
error components εi . All covariance structures previously described for the
RANDOM statement can also be specified here. Very often, one selects
‘type=simple’ which corresponds to the most simple covariance structure
Σi = σ 2 Ini . Finally, if no REPEATED statement is used, PROC MIXED
automatically fits such a ‘simple’ covariance structure for the residual com-
ponents.
8.2 The SAS Program 101

Finally, specifying the options ‘r’ and ‘rcorr’ requests that the residual co-
variance matrix Σi as well as its associated correlation matrix be printed.
As for the ‘v’ and ‘vcorr’ options in the RANDOM statement, SAS only
prints by default the covariance and correlation matrices for the first sub-
ject in the data set. However ‘r=’ and ‘rcorr=’ can be used to specify the
identification numbers of the subjects for which the matrix Σi and the
associated correlation matrix are needed.

8.2.7 The CONTRAST Statement

The contrast statement allows testing general linear hypotheses of the form
(6.4). In our program, we have shown how to test whether the original
model (3.10) for the prostate data can be reduced to the final model (6.8)
[see hypothesis (6.7) obtained in Section 6.2.3].

Since it is possible to test several hypotheses at once (specifying several


CONTRAST statements), a label is needed for each contrast in order to
identify them in the output. This label can be up to 20 characters long and
must be enclosed in single quotes. In the above example, the label was ‘Final
model’. Following the label, one needs to specify the linear combinations in
the hypothesis (i.e., the rows in the matrix L), separated by commas. Each
row in L is represented by a list of the effects, specified in the MODEL
statement, followed by the appropriate elements in the corresponding row
of the matrix L. Effects for which the corresponding parameters only get
zero weight in the linear combination may be omitted. For example, the
last row in (6.7) only gives nonzero weights to parameters in β assigned
to the interaction of group by the quadratic time effect. The first two
parameters assigned to this effect (β12 and β13 ) get zero weight and the
other two parameters (β14 and β15 ) get weights 1 and −1, respectively.
This is represented in the last row of the above CONTRAST statement. A
similar argument leads to the other rows.

By default, the specified hypothesis is tested based on an F -test (see Sec-


tion 6.2.2). The option ‘chisq’ requests that also an approximate Wald test
(see Section 6.2.1) is performed.

8.2.8 The ESTIMATE Statement

The ESTIMATE statement allows estimation and testing of linear combi-


nations of the fixed effects. In the above SAS program, we illustrate the
estimation of the average difference in ln(1 + PSA), 5 years prior to diag-
102 8. Fitting Linear Mixed Models with SAS

nosis, between the local cancer cases and the BPH cases, that is, we will
estimate the linear combination (6.9), specified in Section 6.2.3.

As for the CONTRAST statement, several ESTIMATE statements can be


specified, all having their own label. In our example, the given label equals
‘Diff L/R-BPH, t=5yr’. Linear combinations to be estimated are speci-
fied in exactly the same way as they would be specified in a CONTRAST
statement (see Section 8.2.7). The only difference in output is that an ES-
TIMATE statement also provides point estimates as well as confidence in-
tervals, whereas a CONTRAST statement only yields approximate F -tests
and Wald tests. The options ‘cl’ and ‘alpha=0.05’ are used to request that
an approximate t-type 95% confidence interval is calculated for the linear
combination specified. Finally, the option ‘divisor=4’ requests division of
all coefficients in the specified linear combination by 4.

8.2.9 The MAKE Statement

SAS allows conversion of any table produced by PROC MIXED to a SAS


data file. This is done via a MAKE statement. If several tables need to be
converted, several MAKE statements can be used. In our example, we are
converting the empirical Bayes (EB) estimates which have been requested
by the option ‘solution’ in the RANDOM statement (see Section 8.2.5).
This is especially convenient for producing histograms and scatter plots of
components of the EB estimates bi , as the ones shown in Figure 7.1. Each
table is given a label which can be found in the SAS manuals and which
needs to be specified in the MAKE statement. For example, the table with
EB estimates is labeled ‘solutionR’. The option ‘out=randeff’ specifies the
name of the SAS data set which will contain the requested information.
In practice, one will often add the option ‘noprint’, which avoids printing
of the table in the SAS output. This option is particularly useful for large
tables such as the one containing the EB estimates.

It should be emphasized that, from Version 7.0 on, the MAKE statement
is replaced by the integrated ODS (output delivery system). In Version 7.0,
the MAKE statement is still supported, but it is envisaged that this will
no longer be the case in later versions. We refer to Appendix A for more
details and for an example.

8.2.10 Some Additional Statements and Options

Our program on p. 94 will provide model-based inferences for all parameters


in the original linear mixed model (3.10). However, SAS also allows robust
8.2 The SAS Program 103

inference for all fixed effects in the model (see Section 6.2.4). This can be
requested by adding the option ‘empirical’ to the PROC MIXED statement.
All reported standard errors and inferences for the fixed effects will then
be based on the robust estimate for the covariance matrix rather than the
naive one.

As discussed in Section 5.6, convergence problems may often be avoided


by using the Fisher scoring estimation method rather than the default
Newton-Raphson-based procedures. This can be required by adding the
option ‘scoring’ to the PROC MIXED statement. The expected Hessian
matrix instead of the observed Hessian is then also used to compute ap-
proximate standard errors for the covariance parameters. However, as will
be discussed in Chapter 21, this may yield invalid inferences when some of
the response measurements are missing. In practice, one can start the iter-
ation process using the Fisher scoring algorithm and proceed with Newton-
Raphson. This will then still yield inferences based on the observed Hessian
rather than the expected one. This can be done in SAS by specifying the
option ‘scoring=a’, where a is the required number of Fisher scoring steps.

Another way of avoiding convergence problems is to specify good starting


values for the parameters in the marginal covariance structure. This can
be done by adding a PARMS statement to the model. Single values, and
also sequences of values can be specified. In the latter case, a grid search
is performed and the best point on the grid is used as starting value for
the iterative estimation procedure. SAS also allows to fix some of these
parameters to known values (option ‘eqcons’), which are then no longer
included in the iterative estimation procedure.

When predictions are of interest, options ‘predmeans’ and ‘predicted’ can


be added to the MODEL statement, which require the calculation of pre-
dicted means and predicted subject-specific profiles, respectively. The pre-
 whereas predicted subject-specific profiles
dicted means are given by Xi β,

are obtained from calculating Xi β+Z 
i bi . SAS automatically also calculates
residuals defined as the difference between the predicted (mean or subject-
specific) and observed values. The options ‘predmeans’ and ‘predicted’ are
particularly useful when graphs as Figure 7.2 are to be prepared.

Similar to the ESTIMATE statement, one can add the options ‘cl’ and
‘alpha=’ to the MODEL statement to request for the construction of t-
type confidence limits for each of the fixed-effect parameters.

As discussed in Sections 8.2.5 and 8.2.6, the RANDOM and REPEATED


statements specify the structure in the covariance matrices D and Σi . The
option ‘group=’ can be added to both statements to specify heterogeneity
in these covariance structures. All observations having the same level of the
specified effect will have the same covariance parameters. Each new level
104 8. Fitting Linear Mixed Models with SAS

of the specified effect will produce a new set of covariance parameters with
the same structure as specified in the ‘type=’ option.

In Section 3.3.4, a general but flexible model was presented for the resid-
ual covariance Σi , which assumed that the residual component εi can be
decomposed as εi = ε(1)i +ε(2)i , in which ε(2)i is a component of serial corre-
lation and where ε(1)i represents pure measurement error. Such models can
also easily be fitted in PROC MIXED. One then has to specify the required
covariance structure of ε(2)i in the ‘type=’ option of the REPEATED state-
ment, whereas the measurement error component is obtained by adding the
option ‘local’ to the same REPEATED statement. This will be illustrated
in Section 9.4.

The CONTRAST and ESTIMATE statements in our program on p. 94


illustrate how linear combinations of the fixed effects in the model can be
estimated and tested. However, SAS also allows one to perform subject-
specific inferences, which can be obtained by including random effects in
the linear combinations. We refer to the SAS manuals and to Littell et al .
(1996) for examples and more details.

Finally, as discussed in Section 5.4, SAS maximizes likelihood functions


under the restriction that all diagonal elements of the random-effects co-
variance matrix D as well as all diagonal elements of the residual covariance
matrices Σi are positive. However, an example was given in Section 5.6.2,
which shows that, in practice, one may want to remove these constraints
on the parameter estimates. This can be done in SAS by adding the option
‘nobound’ to the PROC MIXED statement or to the PARMS statement.

8.3 The SAS Output

In the following sections, we will discuss the different parts of the SAS
output obtained from fitting the model on p. 94 to the prostate data.

8.3.1 Information on the Iteration Procedure

First of all, an ‘Estimation Iteration History’ table is given, describing the


iteration history, that is, the process of maximizing the likelihood function
(or equivalently, the log-likelihood function). This table is of the following
form:
8.3 The SAS Output 105

REML Estimation Iteration History

Iteration Evaluations Objective Criterion

0 1 -259.0577593
1 2 -753.2423823 0.00962100
2 1 -757.9085275 0.00444385
. . ............ ..........
6 1 -760.8988784 0.00000003
7 1 -760.8988902 0.00000000

Convergence criteria met.

The objective function is, apart from a constant which does not depend
on the parameters, minus twice the log-likelihood function. In the case
of REML estimation, the exact relation between LREML and the objective
function OFREML is given by

−2 ln (LREML (θ)) = (n − p) ln(2π) + OFREML (θ). (8.1)

For ML estimation, the above equation becomes

−2 ln (LML (θ)) = n ln(2π) + OFML (θ).

Hence, our final parameter estimates are those which minimize this objec-
tive function. The reported number of evaluations is the number of times
the objective function has been evaluated during each iteration. In the
‘Criterion’ column, a measure of convergence is given, where a value equal
to zero indicates that the iterative estimation procedure has converged.
In practice, the procedure is considered to have converged whenever the
convergence criterion is smaller than a so-called tolerance number which
is set equal to 10−8 by default. Unless specified otherwise, SAS uses the
relative Hessian convergence criterion defined as |gk  Hk−1 gk |/|fk |, where
fk is the value of the objective function at iteration k, gk is the gradient
(vector of first-order derivatives) of fk , and Hk is the Hessian (matrix of
second-order derivatives) of fk . Other possible choices are the relative func-
tion convergence criterion and the relative gradient convergence criterion
defined as |fk − fk−1 |/|fk | and maxj |gkj |/|fk |, respectively, where gkj is
the jth element in gk .

8.3.2 Information on the Model Fit

The ‘Model Fitting Information’ table shows the following additional in-
formation:
106 8. Fitting Linear Mixed Models with SAS

Model Fitting Information for LNPSA

Description Value

Observations 463.0000
Res Log Likelihood -31.2350
Akaike’s Information Criterion -38.2350
Schwarz’s Bayesian Criterion -52.6018
-2 Res Log Likelihood 62.4700
Null Model LRT Chi-Square 501.8411
Null Model LRT DF 6.0000
Null Model LRT P-Value 0.0000
N
In our example, n = i=1 ni = 463 observations were used to calculate
the parameter estimates. The value of the REML log-likelihood function
evaluated at the REML estimates is reported as ‘Res Log Likelihood’, and
similarly for minus twice the maximized log-likelihood function. Note that
the log-likelihood can also easily be calculated from expression (8.1). In our
example, this becomes −[(463 − 15) ln(2π) − 760.8989]/2 = −31.2350. Note
that this value was already reported in Table 5.1. It is also used for the
calculation of the reported Akaike and Schwarz information criteria which
we defined in Section 6.4. The number of parameters used in the calculation
equals the number of variance components in the model. Hence, the criteria
reported here should only be used to compare models with the same mean
structure, but with different covariance structures, even when maximum
likelihood would be used for model fitting. We will come back to this in
Section 8.3.3. In our example, we have AIC = −31.2350 − 7 = −38.2350
and SBC = −31.2350 − 3.5 ln(463 − 15) = −52.6018, respectively.

The ‘Null Model LRT Chi-Square’ value is −2 times the log-likelihood from
the null model minus −2 times the log-likelihood from the fitted model,
where the null model is the one with the same fixed effects as the actual
model, but without any random effects, and with Σi = σ 2 Ini . This statistic
is then compared to a χ2 -distribution with degrees of freedom equal to the
number of variance components minus 1, and the reported p-value is the
upper tail area from this distribution. It is suggested that this p-value can
be used to test whether or not there is any need at all for modeling the
covariance structure of the data. However, as discussed in Section 6.3.4,
the obtained p-value is not valid since the classical maximum likelihood
theory from which it results does not hold due to boundary problems in
the parameter space. Hence, it is, in general, not recommended to interpret
the reported p-value in any such way.
8.3 The SAS Output 107

8.3.3 Information Criteria

Since the option ‘ic’ was specified in the PROC MIXED statement, all four
information criteria previously defined in Section 6.4 were calculated. The
results are summarized in the following table:

Information Criteria

Better Parms q p AIC HQIC BIC CAIC

Larger 7 7 0 -38.2 -43.9 -52.6 -56.1


Larger 22 7 15 -53.2 -71.0 -98.4 -109.4
Smaller 7 7 0 76.5 87.8 105.2 112.2
Smaller 22 7 15 106.5 142.1 196.8 218.8

The first two rows are the information criteria as we defined them in Sec-
tion 6.4. When comparing different models, the model with the largest
value of AIC, SBC, HQIC, or CAIC is deemed best. The last two rows
are the versions defined based on minus twice the maximized log-likelihood
value rather than on the log-likelihood itself. They equal minus twice the
corresponding value in the first or second row. For these versions, small
values of AIC, SBC, HQIC, or CAIC are considered as good. The reported
parameters p and q in the above output table correspond to the number
of fixed effects and the number of variance components that is taken into
account in the computation of the information criteria. Note that, since the
information criteria reported in the first and third rows do not take into
account the number of parameters in the mean structure (p set equal to
zero), these criteria should not be used to compare models with different
mean structures. Moreover, as explained in Section 6.2.5 and Section 6.4,
they are only fully interpretable when models are fitted with the maxi-
mum likelihood procedure, rather than the restricted maximum likelihood
procedure.

8.3.4 Inference for the Variance Components

First, a table labeled ‘Covariance Parameter Estimates (REML)’ is given


which contains parameter estimates for all variance components in the
model. Since we specified the option ‘covtest’, estimated standard errors
and approximate Wald tests are also given:
108 8. Fitting Linear Mixed Models with SAS

Covariance Parameter Estimates (REML)

Cov Parm Subject Estimate Std Error Z Pr > |Z|

UN(1,1) ID 0.4517 0.0976 4.63 0.0001


UN(2,1) ID -0.5178 0.1355 -3.82 0.0001
UN(2,2) ID 0.9153 0.2297 3.98 0.0001
UN(3,1) ID 0.1625 0.0529 3.07 0.0021
UN(3,2) ID -0.3356 0.0949 -3.53 0.0004
UN(3,3) ID 0.1308 0.0409 3.19 0.0014
TIMECLSS ID 0.0281 0.0022 12.41 0.0001

The entry UN(i,j) corresponds to the element dij of the random-effects


covariance matrix D. The entry TIMECLSS reports the inference results for
the residual variance σ 2 . Note that the p-values reported for all UN(i,i)
entries as well as for the entry TIMECLSS should be ignored since the clas-
sical maximum likelihood theory from which they result does not hold
due to boundary problems in the parameter space (see our discussion in
Section 6.3.1). Note that the estimates and standard errors were already
reported in Table 5.1.

Since we specified the options ‘asycov’ and ‘asycorr’ in the PROC MIXED
statement, we also get a printout of the estimated asymptotic covariance
matrix and associated correlation matrix for the above parameter esti-
mates. These are given by

Asymptotic Covariance Matrix of Estimates

Cov Parm Row COVP1 COVP2 COVP3 COVP4 COVP5 COVP6 COVP7

UN(1,1) 1 0.0095 -0.0116 0.0145 0.0039 -0.0049 0.0017 -0.0000


UN(2,1) 2 -0.0116 0.0183 -0.0277 -0.0069 0.0103 -0.0038 0.0000
UN(2,2) 3 0.0145 -0.0277 0.0527 0.0112 -0.0212 0.0086 -0.0000
UN(3,1) 4 0.0039 -0.0069 0.0112 0.0027 -0.0044 0.0017 -0.0000
UN(3,2) 5 -0.0049 0.0103 -0.0212 -0.0044 0.0090 -0.0038 0.0000
UN(3,3) 6 0.0017 -0.0038 0.0086 0.0017 -0.0038 0.0016 -0.0000
TIMECLSS 7 -0.0000 0.0000 -0.0000 -0.0000 0.0000 -0.0000 0.0000

and
8.3 The SAS Output 109

Asymptotic Correlation Matrix of Estimates

Cov Parm Row COVP1 COVP2 COVP3 COVP4 COVP5 COVP6 COVP7

UN(1,1) 1 1.0000 -0.8830 0.6501 0.7594 -0.5383 0.4354 -0.0607


UN(2,1) 2 -0.8830 1.0000 -0.8917 -0.9634 0.8060 -0.7020 0.0983
UN(2,2) 3 0.6501 -0.8917 1.0000 0.9218 -0.9760 0.9168 -0.1325
UN(3,1) 4 0.7594 -0.9634 0.9218 1.0000 -0.8813 0.8038 -0.1291
UN(3,2) 5 -0.5383 0.8060 -0.9760 -0.8813 1.0000 -0.9806 0.1592
UN(3,3) 6 0.4354 -0.7020 0.9168 0.8038 -0.9806 1.0000 -0.1798
TIMECLSS 7 -0.0607 0.0983 -0.1325 -0.1291 0.1592 -0.1798 1.0000

respectively.

For other covariance structures, the table ‘Covariance Parameter Estimates


(REML)’ may look slightly different. Therefore it is useful in practice to
print out the complete covariance matrix derived from the estimated vari-
ance components. The resulting estimate for the random-effects covariance
matrix D and its associated correlation matrix are reported as

G Matrix

Effect ID Row COL1 COL2 COL3

INTERCEPT 1 1 0.4517 -0.5178 0.1625


TIME 1 2 -0.5178 0.9153 -0.3356
TIME2 1 3 0.1625 -0.3356 0.1308

and

G Correlation Matrix

Effect ID Row COL1 COL2 COL3

INTERCEPT 1 1 1.0000 -0.8052 0.6685


TIME 1 2 -0.8052 1.0000 -0.9700
TIME2 1 3 0.6685 -0.9700 1.0000

respectively, and were obtained by specifying the options ‘g’ and ‘gcorr’ in
the RANDOM statement. The estimate for the residual covariance matrix
Σ1 for the first subject in the data set, as well as the associated correlation
matrix are reported as
110 8. Fitting Linear Mixed Models with SAS

R Matrix for ID 1

Row COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8

1 0.0282
2 0.0282
3 0.0282
4 0.0282
5 0.0282
6 0.0282
7 0.0282
8 0.0282

and

R Correlation Matrix for ID 1

Row COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8

1 1.0000
2 1.0000
3 1.0000
4 1.0000
5 1.0000
6 1.0000
7 1.0000
8 1.0000

respectively, and were obtained by specifying the options ‘r’ and ‘rcorr’ in
the REPEATED statement. Note that zero entries are left blank. Finally,
combining the estimate for D with the estimate for Σ1 , an estimate is
obtained for the marginal covariance matrix V1 = Z1 DZ1 + Σ1 , as well as
for the associated correlation matrix, of the first subject in the data set.
These are reported as

V Matrix for ID 1

Row COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8

1 0.4133 0.2959 0.2021 0.1460 0.1023 0.0709 0.0501 0.0585


2 0.2959 0.2684 0.1800 0.1427 0.1121 0.0882 0.0691 0.0576
3 0.2021 0.1800 0.1821 0.1357 0.1187 0.1029 0.0864 0.0570
4 0.1460 0.1427 0.1357 0.1566 0.1194 0.1086 0.0942 0.0569
5 0.1023 0.1121 0.1187 0.1194 0.1446 0.1098 0.0977 0.0571
6 0.0709 0.0882 0.1029 0.1086 0.1098 0.1345 0.0968 0.0577
7 0.0501 0.0691 0.0864 0.0942 0.0977 0.0968 0.1186 0.0587
8 0.0585 0.0576 0.0570 0.0569 0.0571 0.0577 0.0587 0.0905
8.3 The SAS Output 111

and

V Correlation Matrix for ID 1

Row COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8

1 1.0000 0.8885 0.7366 0.5740 0.4186 0.3009 0.2265 0.3025


2 0.8885 1.0000 0.8143 0.6960 0.5689 0.4642 0.3875 0.3696
3 0.7366 0.8143 1.0000 0.8035 0.7316 0.6577 0.5879 0.4440
4 0.5740 0.6960 0.8035 1.0000 0.7934 0.7485 0.6914 0.4781
5 0.4186 0.5689 0.7316 0.7934 1.0000 0.7870 0.7460 0.4996
6 0.3009 0.4642 0.6577 0.7485 0.7870 1.0000 0.7663 0.5231
7 0.2265 0.3875 0.5879 0.6914 0.7460 0.7663 1.0000 0.5672
8 0.3025 0.3696 0.4440 0.4781 0.4996 0.5231 0.5672 1.0000

respectively, and were obtained by specifying the options ‘v’ and ‘vcorr’ in
the RANDOM statement.

8.3.5 Inference for the Fixed Effects

Due to the specification of the ‘solution’ option in the MODEL statement,


we obtain a table labeled ‘Solution for Fixed Effects,’ which contains the
parameter estimates, estimated standard errors, and approximate t-tests
for all fixed effects in the model:

Solution for Fixed Effects

Effect GROUP Estimate Std Error DF t Pr > |t|

GROUP 1 -1.0984 0.9763 47.9 -1.13 0.2662


GROUP 2 -0.5228 1.0895 48 -0.48 0.6335
GROUP 3 0.2964 1.0587 48 0.28 0.7807
GROUP 4 1.5493 1.0856 47.6 1.43 0.1600
AGEDIAG 0.0265 0.0142 47.9 1.87 0.0683
TIME*GROUP 1 0.5680 1.4725 42.7 0.39 0.7016
TIME*GROUP 2 0.3956 1.6376 42.3 0.24 0.8103
TIME*GROUP 3 -1.0359 1.5927 42.4 -0.65 0.5190
TIME*GROUP 4 -1.6049 1.6257 41.6 -0.99 0.3293
AGEDIAG*TIME -0.0111 0.0214 42.3 -0.52 0.6050
TIME2*GROUP 1 -0.1295 0.6100 32.9 -0.21 0.8332
TIME2*GROUP 2 -0.1584 0.6723 31.9 -0.24 0.8152
TIME2*GROUP 3 0.3419 0.6562 32.1 0.52 0.6060
TIME2*GROUP 4 0.3950 0.6660 31.1 0.59 0.5574
AGEDIAG*TIME2 0.0022 0.0088 32 0.26 0.7997
112 8. Fitting Linear Mixed Models with SAS

TABLE 8.3. Prostate Data. Overview of the hypotheses corresponding to the tests
specified in the table labeled ‘Tests of Fixed Effects.’

Source Null hypothesis


Group H1 : β2 = β3 = β4 = β5 = 0
Age H2 : β1 = 0
Time∗group H3 : β7 = β8 = β9 = β10 = 0
Age∗group H4 : β6 = 0
Time2∗group H5 : β12 = β13 = β14 = β15 = 0
Age∗time2 H6 : β11 = 0

Note that the estimates and standard errors were already reported in Ta-
ble 5.1. A printout of the complete estimated covariance matrix for the fixed
effects is also obtained (due to the option ‘covb’ in the MODEL statement),
but it is not printed here because of its high dimension (15 × 15).

By default, SAS provides approximate F -tests (see Section 6.2.2) for all ef-
fects specified in the MODEL statement. For continuous covariates, which
do not interact with any factors (i.e., with no interaction term included in
the MODEL statement), this is equivalent to the t-test reported in the table
‘Solution for Fixed Effects.’ For each factor specified in the CLASS state-
ment, it is tested whether any of the parameters assigned to this factor is
significantly different from zero. The same is true for interactions of factors
with other effects. The hypotheses tested in our example are summarized
in Table 8.3. Finally, since the option ‘chisq’ was added to the MODEL
statement, approximate Wald tests are also performed (see Section 6.2.1).
The output table is given by

Tests of Fixed Effects

Type III Type III


Source NDF DDF ChiSq F Pr > ChiSq Pr > F

GROUP 4 47.8 63.60 15.90 0.0001 0.0001


AGEDIAG 1 47.9 3.48 3.48 0.0621 0.0683
TIME*GROUP 4 42.2 31.41 7.85 0.0001 0.0001
AGEDIAG*TIME 1 42.3 0.27 0.27 0.6022 0.6050
TIME2*GROUP 4 31.8 17.78 4.44 0.0014 0.0058
AGEDIAG*TIME2 1 32 0.07 0.07 0.7980 0.7997

Because the option ‘ddfm=satterth’ has been added to the MODEL state-
ment, a Satterthwaite approximation is used for the calculation of the de-
nominator degrees of freedom needed for the approximate F -tests. As is
8.3 The SAS Output 113

expected, all p-values obtained from the chi-squared approximation are


smaller than those from the F -approximation. However, the difference is
rather small, which illustrates the fact that, in a longitudinal context, dif-
ferent estimation methods for the denominator degrees of freedom for the
F -test usually lead to very similar results (see also Section 6.2.2).

Two additional tables are also given, labeled ‘CONTRAST Statement Re-
sults’ and ‘ESTIMATE Statement Results,’ which contain the results from
the specified CONTRAST and ESTIMATE statements, respectively. The
first one is given by

CONTRAST Statement Results

Source NDF DDF ChiSq F Pr > ChiSq Pr > F

Final model 6 46.7 3.39 0.56 0.7587 0.7561

and provides approximate F - and Wald tests for testing whether the orig-
inal model (3.10) can be reduced to model (6.8). The results were already
reported in Section 6.2.3. The table with the results from the ESTIMATE
statement equals

ESTIMATE Statement Results

Parameter Estimate Std Error DF t Pr > |t|

Diff L/R-BPH, t=5yr 0.1889 0.2189 71.2 0.86 0.3910

Alpha Lower Upper

0.05 -0.2476 0.6255

Note that the results are slightly different from our results given in Ta-
ble 6.2, which is due to the fact that inferences for the average difference
between BPH patients and local cancer cases is now obtained under model
(3.10) rather than under model (6.8), as was the case in Table 6.2.

8.3.6 Inference for the Random Effects

Because the option ‘solution’ was added to the RANDOM statement, a


table is printed containing empirical Bayes estimates for all random effects
in the model:
114 8. Fitting Linear Mixed Models with SAS

Solution for Random Effects

Effect ID Estimate SE Pred DF t Pr > |t|

INTERCEPT 1 0.3470 0.2141 126 1.62 0.1075


TIME 1 -0.8979 0.3822 114 -2.35 0.0205
TIME2 1 0.3138 0.1641 57.3 1.91 0.0609
INTERCEPT 2 -0.3150 0.2781 67.9 -1.13 0.2613
TIME 2 0.9239 0.4218 60.9 2.19 0.0324
TIME2 2 -0.3840 0.1649 38.7 -2.33 0.0252
.... .. ......

INTERCEPT 54 -0.3603 0.2223 93.6 -1.62 0.1083


TIME 54 0.1435 0.4315 108 0.33 0.7400
TIME2 54 -0.0089 0.1896 34.8 -0.05 0.9628

For each subject, the empirical Bayes estimate bi for its vector bi of random
effects is printed, together with approximate standard errors and t-tests.
The standard errors reported here are not based on the covariance matrix
(7.3) of bi but on the covariance matrix (7.4) of bi − bi .

8.4 Note on the Mean Parameterization

Note that it follows from the way we parameterized the mean structure of
our model that the F -tests discussed in Section 8.3.5 cannot be used to test
whether the different diagnostic groups have different intercepts or slopes.
For example, four parameters are assigned to the effect ‘time2 ∗ group’,
being the slopes for the quadratic time effect for each group separately.
The F -test reported for the effect time2 ∗ group was therefore
H5 : β12 = β13 = β14 = β15 = 0
rather than
H0 : β12 = β13 = β14 = β15 . (8.2)
Note that hypothesis (8.2) is also of the form H0 : Lβ = 0 and can thus
also be tested using a CONTRAST statement in PROC MIXED (see Sec-
tion 8.2.7).

Another possibility, which directly yields a test for group differences, is to


reparameterize the mean structure, including an overall intercept, and over-
all slopes for the linear and quadratic time effects. The MODEL statement
of our program on p. 94 then needs to be replaced by
8.4 Note on the Mean Parameterization 115

model lnpsa = group age time group*time age*time


time2 group*time2 age*time2
/ solution ddfm = satterth covb chisq;

We then get the following output table ‘Solution for Fixed Effects’ with
REML estimates and t-tests for all parameters in the reparameterized mean
structure:

Solution for Fixed Effects

Effect GROUP Estimate Std Error DF t Pr > |t|

INTERCEPT 1.5493 1.0856 47.6 1.43 0.1600


GROUP 1 -2.6478 0.3931 48 -6.74 0.0001
GROUP 2 -2.0722 0.3835 48.8 -5.40 0.0001
GROUP 3 -1.2529 0.3932 48.3 -3.19 0.0025
GROUP 4 0.0000 . . .
AGEDIAG 0.0265 0.0142 47.9 1.87 0.0683
TIME -1.6049 1.6257 41.6 -0.99 0.3293
TIME*GROUP 1 2.1729 0.5836 40.9 3.72 0.0006
TIME*GROUP 2 2.0005 0.5678 40.4 3.52 0.0011
TIME*GROUP 3 0.5689 0.5794 39.5 0.98 0.3321
TIME*GROUP 4 0.0000 . . .
AGEDIAG*TIME -0.0111 0.0214 42.3 -0.52 0.6050
TIME2 0.3950 0.6660 31.1 0.59 0.5574
TIME2*GROUP 1 -0.5245 0.2341 29.6 -2.24 0.0327
TIME2*GROUP 2 -0.5535 0.2232 25.9 -2.48 0.0200
TIME2*GROUP 3 -0.0531 0.2267 25.2 -0.23 0.8166
TIME2*GROUP 4 0.0000 . . .
AGEDIAG*TIME2 0.0022 0.0088 32 0.26 0.7997

The slope β15 for time2 in the last group is now the parameter assigned to
the overall time2 effect, and the three parameters assigned to the interaction
of group with time2 are the contrasts β12 − β15 , β13 − β15 , and β14 − β15 ,
respectively (see also the original estimates in Table 5.1). Note that this also
implies that the CONTRAST statement and the ESTIMATE statement
previously used in the program on p. 94 are no longer valid under the new
parameterization.

For the reparameterized model, we get the following F -tests for the effects
specified in the MODEL statement:
116 8. Fitting Linear Mixed Models with SAS

TABLE 8.4. Prostate Data. Overview of the hypotheses corresponding to the tests
specified in the table labeled ‘Tests of Fixed Effects’ for the model with reparame-
terized mean structure.

Source Null hypothesis


Group H7 : β2 = β3 = β4 = β5
Age H8 : β1 = 0
Time H9 : (β7 + β8 + β9 + β10 )/4 = 0
Time∗group H10 : β7 = β8 = β9 = β10
Age∗group H11 : β6 = 0
Time2 H12 : (β12 + β13 + β14 + β15 )/4 = 0
Time2∗group H13 : β12 = β13 = β14 = β15
Age∗time2 H14 : β11 = 0

Tests of Fixed Effects

Type III Type III


Source NDF DDF ChiSq F Pr > ChiSq Pr > F

GROUP 3 47.9 60.39 20.13 0.0001 0.0001


AGEDIAG 1 47.9 3.48 3.48 0.0621 0.0683
TIME 1 42.2 0.07 0.07 0.7874 0.7887
TIME*GROUP 3 41.8 31.24 10.41 0.0001 0.0001
AGEDIAG*TIME 1 42.3 0.27 0.27 0.6022 0.6050
TIME2 1 32 0.03 0.03 0.8609 0.8620
TIME2*GROUP 3 30.8 17.78 5.93 0.0005 0.0026
AGEDIAG*TIME2 1 32 0.07 0.07 0.7980 0.7997

The hypotheses tested in the above output are shown in Table 8.4. Hence,
the test for hypothesis (8.2) is now reported as the F -test corresponding to
the effect of time2 ∗ group. Note also the change in its numerator degrees of
freedom due to the fact that we now test for equality of the quadratic time
effect in the four diagnostic groups, rather than testing whether there is any
quadratic time effect in any of the four diagnostic groups at all. Also, under
this parameterization for the mean structure, the F -test reported for time2
tests whether there is a quadratic time effect in the overall population and
is therefore not equivalent to the t-test reported for time2 in the table
labeled ‘Solution for Fixed Effects,’ which was testing for a quadratic time
effect for the metastatic cancer cases only. The same remark is true for the
F -test reported for time.

Although fitting the reparameterized model automatically yields tests for


group differences with respect to average intercepts or slopes, it often com-
8.5 The RANDOM and REPEATED Statements 117

plicates the interpretation of the parameter estimates since contrasts are


estimated rather than the parameters of interest. All further analyses of the
prostate data will therefore be based on the original parameterization as in
model (3.10) or as in the reduced model (6.8) obtained in Section 6.2.3.

8.5 The RANDOM and REPEATED Statements

In Section 8.2, we introduced the RANDOM statement and the REPEAT-


ED statement of PROC MIXED, and both statements were used in our
program on p. 94 to fit model (3.10) to the prostate cancer data. However,
since the covariance structure for the error components εi was taken equal
to σ 2 Ini , which is the default in PROC MIXED, the same model can be
fitted omitting the REPEATED statement. In practice, it is often sufficient
to use only a RANDOM statement or only a REPEATED statement. In
the first case, a hierarchical model is assumed, in which random effects are
used to describe the covariance structure in the data, whereas all remaining
variability is assumed to be purely measurement error (the components in
εi are assumed to be independently, identically distributed). The covariance
structure is then assumed to be of the form Vi = Zi DZi +σ 2 Ini . In the other
case, no random effects are included, indicating that no part of the observed
variability in the data can be ascribed to between-subject variability. The
covariance structure for the data is then completely determined by the
covariance structure Σi for the error components εi , which is specified in
the REPEATED statement.

It should be noted, however, that although both procedures have different


interpretations, they can result in identical marginal models. We hereby
also refer to our discussion in Section 5.6.2 on hierarchical and marginal
models. As an example, we take the so-called ‘random intercepts’ or ‘com-
pound symmetry’ model, which assumes a covariance structure of the form
⎛ 2 ⎞
σ + σc2 σc2 ··· σc2
⎜ σc2 σ 2 + σc2 · · · σc2 ⎟
⎜ ⎟
var(Yi ) = Vi = ⎜ .. .. . . ⎟, (8.3)
⎝ . . . . .
. ⎠
σc2 σc2 · · · σ 2 + σc2

for some non-negative value σc2 . Although such an assumption is often not
realistic in a longitudinal-data setting (constant variance and all correla-
tions equal), it is frequently used in practice since it immediately follows
from random-factor ANOVA models (see, for example, Neter, Wasserman,
and Kutner 1990, Section 17.6, Searle 1987, Chapter 13).
118 8. Fitting Linear Mixed Models with SAS

Since the covariance matrix in (8.3) can be rewritten as


⎛ ⎞ ⎛ ⎞
1 1 0 ··· 0
⎜ 1 ⎟   ⎜ 1 ··· 0 ⎟
⎜ ⎟ ⎜ 0 ⎟
Vi = ⎜ . ⎟ σc2 1 1 ··· 1 + σ2 ⎜ . .. . . .. ⎟
⎝ .. ⎠ ⎝ .. . . . ⎠
1 0 0 ··· 1
= Zi var(bi )Zi + σ 2 Ini ,

it can be interpreted as the marginal covariance structure of a linear mixed


model containing only random intercepts with variance d11 = σc2 , which
can be fitted with PROC MIXED by specifying random intercepts in the
RANDOM statement and omitting the REPEATED statement. This hi-
erarchical interpretation of the compound symmetry covariance structure
was already encountered in Sections 3.3.2 and 6.3.3. Note also that in this
case, no ‘type=’ option is needed in the RANDOM statement since for uni-
variate random effects bi , all types result in the same covariance structure.

Further, since the option ‘type=CS’ in the REPEATED statement results


in a covariance structure for εi of the same form as (8.3), the same model
can be fitted without the RANDOM statement. This shows that there are
sometimes several ways of specifying a given model. In such a case, it is
recommended to specify the model using the REPEATED statement rather
than the RANDOM statement because this can reduce the computing time
considerably.

This also implies that one should not conclude that the REPEATED state-
ment is used whenever the data are of the repeated-measures type. Some
repeated-measures models are best expressed using the RANDOM state-
ment (see, e.g., the prostate cancer data), whereas there are also random-
effects models, which do not fall into the repeated-measures class but where
the REPEATED statement is the simplest tool for expressing them in
PROC MIXED syntax. For example, suppose 100 exams were randomly
assigned for correction to 10 randomly selected teachers. If Yij then de-
notes the grade assigned to the jth exam by the ith observer, the following
random-factor ANOVA model can be used to analyze the data:

Yij = µ + αi + εij . (8.4)

The parameter µ represents the overall mean, the parameters αi are the
random observer effects, and the εij are components of measurement error.
It is thereby assumed that all αi and εij are independent of each other, and
that they are normally distributed with mean zero and constant variances
σc2 and σ 2 , respectively. Model (8.4) can then be fitted using the following
program:
8.6 PROC MIXED versus PROC GLM 119

proc mixed;
class observer;
model Y = / solution;
repeated / type = cs subject = observer;
run;

In this case, no effects need to be specified in the REPEATED statement


since the ordering of observations for each of the observers is of no impor-
tance for the estimation of the covariance structure (constant variance and
equal correlations).

8.6 PROC MIXED versus PROC GLM

For balanced longitudinal data (i.e., longitudinal data where all subjects
have the same number of repeated measures, taken at time points which
are also the same for all subjects), one often analyzes the data using the
SAS procedure PROC GLM, fitting general multivariate regression models
(Seber 1984, Chapters 8 and 9) to the data. Such models can also be fitted
with PROC MIXED by omitting the RANDOM statement and including
a REPEATED statement with option ‘type=UN’. One then fits a linear
model with a general unstructured covariance matrix Σ = Σi . However,
the two procedures do not necessarily yield the same results: PROC GLM
only takes into account the data of the completers, that is, only the data
of the subjects with all measurements available are used in the calcula-
tions. PROC MIXED, on the other hand, uses all available data. Hence,
patients for whom not all measurements were recorded will still be taken
into account in the analysis. We refer to Section 17.3 for an illustration.

The multivariate approach used in the GLM procedure produces multi-


variate tests for the fixed effects based on the Wilk’s Lambda likelihood
ratio test statistic (see, e.g., Rao 1973, Chapter 8, SAS 1989, Chapter 1).
The resulting F -tests are based on a better approximation to the actual
distribution of the test statistic than that for the F -tests currently given
by the MIXED procedure (see Roger and Kenward 1993). Further, apart
from the multivariate tests, PROC GLM also provides a univariate analysis
for the response at each time point separately. This can also be obtained
with PROC MIXED, by specifying a WHERE statement. For example, an
analysis of the responses at time t = 2 is requested by adding the following
line to the main program:

where time = 2;
120 8. Fitting Linear Mixed Models with SAS

Note that this again may yield different results than PROC GLM due to
the fact that now all second measurements are analyzed rather than only
the measurements from the patients with measurements taken at all time
points. Finally, the “split unit” type of analysis provided by the GLM proce-
dure can be obtained using PROC MIXED from fitting a compound sym-
metry model (see Section 8.5). However, Greenhouse-Geiser and Huynh-
Feldt corrections to the F -tests are not available in the MIXED procedure,
but they are not really required, as it is very simple to fit and test models
with more complex covariance structures.

The main strength of the procedure MIXED is that it does not assume that
an equal number of repeated observations is taken from each individual or
that all individuals should be measured on the same time points. Hence,
the measurements can be viewed as being taken at a continuous rather
than discrete time scale. Also, the use of random effects allows us to model
covariances as continuous functions of time. Another main advantage in
using the MIXED procedure is the fact that all available data (not only
the “complete cases”) are used in the analysis. Finally, PROC MIXED also
allows us to include time-varying covariates in the mean structure, which
is not possible in PROC GLM.

For a more elaborate discussion on the comparison between the proce-


dures MIXED and GLM, we refer to Roger and Kenward (1993) and Roger
(1993).
9
General Guidelines for Model Building

9.1 Introduction

As discussed in Chapter 8, the SAS procedure PROC MIXED allows the


user to fit general linear mixed models, with a large variety of possible
covariance structures. Under the linear mixed model (3.11), the data vector
Yi for the ith subject is assumed to be normally distributed with mean
vector Xi β and covariance matrix of the form Vi = Zi DZi + σ 2 Ini +
τ 2 Hi . Hence, fitting linear mixed models implies that an appropriate mean
structure as well as covariance structure needs to be specified. As shown in
Figure 9.1, they are not independent of each other.

First, unless robust inference is used (see Section 6.2.4), an appropriate


covariance model is essential to obtain valid inferences for the parameters
in the mean structure, which is usually of primary interest. This will be
especially the case in the presence of missing data, since robust inference
then only provides valid results under often unrealistically strict assump-
tions about the missing data process (see, for example, Section 16.5 as
well as Chapters 17, 19, and 21). Too restrictive specifications invalidate
inferences when the assumed structure does not hold, whereas overpara-
meterization of the covariance structure leads to inefficient estimation and
poor assessment of standard errors (Altham 1984).
122 9. General Guidelines for Model Building

-
' $

Mean structure Covariance structure

& %


? ?

Estimation of θ

Covariance matrix for θ
 t-tests and F-tests 
Confidence intervals
Efficiency
Prediction

FIGURE 9.1. Graphical representation of how the mean structure and the covari-
ance structure of a linear mixed model influence one another and how they affect
the inference results.

Further, the covariance structure itself may be of interest for understanding


the random variation observed in the data. However, since it only explains
the variability not explained by systematic trends, it is highly dependent
on the specified mean structure.

Finally, note that an appropriate covariance structure also yields better


predictions. For example, the prediction of a future observation yi∗ for in-
dividual i, to be taken at time point t∗i , based on model (3.11) is given
by
 + Z ∗ bi + E(ε∗ | yi ),
y∗ = X ∗ β
i i i (2)i

in which Xi∗ and Zi∗ are the fixed-effects and random-effects covariates,
respectively, and ε∗(2)i is the serial error, at time t∗i . The random effect bi
is estimated as in Section 7.2. Chi and Reinsel (1989) have shown that if
the components of ε(2)i follow an AR(1) process,
$ %
 − Zi bi

E(ε∗(2)i | yi ) = φ(ti −ti,ni ) yi − Xi α ,
ni

for φ equal to the constant which determines the AR(1) process, |φ| < 1.
This means that the inclusion of serial correlation may improve the predic-
9.2 Selection of a Preliminary Mean Structure 123

tion since it exploits the correlation between the observation to be predicted


and the last observed value yi,ni .

For data sets where most variability in the measurements is due to between-
subject variability, one can very often use the two-stage approach to con-
struct an appropriate linear mixed model. This was illustrated for the
prostate cancer data in Sections 3.2.4 and 3.3.3. On the other hand, as
shown in Section 5.6.2, a two-stage approach does not always automati-
cally yield a valid marginal model for the data. Also, if the intersubject
variability is small in comparison to the intrasubject variability, this sug-
gests that the covariance structure cannot be modeled using random effects,
but that an appropriate covariance matrix Σi for εi should be found.

In this chapter, some simple guidelines will be discussed which can help the
data analyst to select an appropriate linear mixed model for some specific
data set at hand. All steps in this model building process will be illustrated
with the prostate cancer data set. It should be emphasized that following
the proposed guidelines does not necessarily yield the most appropriate
model, nor does it yield a linear mixed model where all distributional as-
sumptions are automatically satisfied. In general, more complex model di-
agnostics are required to assess the goodness-of-fit of a model. We therefore
refer to the exploratory techniques of Chapter 3 and to the more advanced
techniques later described in Chapters 10, 11, and 12.

9.2 Selection of a Preliminary Mean Structure

Since the covariance structure models all variability in the data which can-
not be explained by the fixed effects, we start by first removing all sys-
tematic trends. As proposed by Diggle (1988) and Diggle, Liang, and Zeger
(1994) (Sections 4.4 and 5.3), we use an overelaborated model for the mean
response profile. When the data are from a designed experiment in which
the only relevant explanatory variables are the treatment labels, it is a
sensible strategy to use a “saturated model” for the mean structure. This
incorporates a separate parameter for the mean response at each time point
within each treatment group. For example, when two treatment groups had
measurements at four fixed time points, we would use p = 4 × 2 = 8 para-
meters to model E(Yi ). The Xi matrices would then equal
⎛ ⎞ ⎛ ⎞
1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
⎜0 1 0 0 0 0 0 0⎟ ⎜0 0 0 0 0 1 0 0⎟
Xi = ⎜ ⎟ ⎜
⎝ 0 0 1 0 0 0 0 0 ⎠ or Xi = ⎝ 0 0 0 0 0 0 1 0 ⎠,

0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1
124 9. General Guidelines for Model Building

FIGURE 9.2. Prostate Data. Smoothed average trend of ln(PSA + 1) for each
diagnostic group separately.

depending on whether the ith individual belongs to the first or second


treatment group, respectively.

For data in which the times of measurement are not common to all individ-
uals or when there are continuous covariates which are believed to affect
the mean response, the concept of a saturated model breaks down and the
choice of our most elaborate model becomes less obvious. In such cases, a
plot of smoothed average trends or individual profiles often helps to select
a candidate mean structure. For the prostate cancer data, Figure 9.2 shows
the smoothed average trend in each diagnostic group separately. Note that
these trends are not corrected for the age differences between the study
participants. Further, the individual profiles (Figure 2.3) and our results
from Section 4.3.4 suggest modeling ln(1 + PSA) as a quadratic function
over time. This results in an average intercept and an average linear as
well as quadratic time effect within each diagnostic group. Finally, it has
been anticipated that age is also an important prognostic covariate. We
therefore also include age at time of diagnosis along with its interactions
with time and time2 . Our preliminary mean structure therefore contains
4 × 3 + 3 = 15 fixed effects, represented by the vector β. Note that this is
the mean structure which was used in our initial model (3.10), and that at
this stage, we deliberately favor overparameterized models for E(Yi ) in or-
der to get consistent estimators of the covariance structure in the following
steps.
9.3 Selection of a Preliminary Random-Effects Structure 125

FIGURE 9.3. Prostate Data. Ordinary least squares (OLS) residual profiles.

Once an appropriate mean structure Xi β for E(Yi ) has been selected, we


use the ordinary least squares (OLS) method to estimate β, and we hereby
ignore that not all measurements are independent. It follows from the the-
ory of generalized estimating equations (GEE) that this OLS estimator is
consistent for β (Liang and Zeger 1986; see also Section 6.2.4). This justifies

the use of the OLS residuals ri = yi − Xi β OLS for studying the dependence
among the repeated measures.

9.3 Selection of a Preliminary Random-Effects


Structure

In a second step, we will select a set of random effects to be included in the


covariance model. Note that random effects for time-independent covari-
ates can be interpreted as subject-specific corrections to the overall mean
structure. This makes them hard to distinguish from random intercepts.
Therefore, one often includes random intercepts, and random effects only
for time-varying covariates. However, it will be shown in Section 24.1 that
in some applications, random effects for time-independent indicators may
be useful to model differences in variability between subgroups of subjects
or measurements.

A helpful tool for deciding which time-varying covariates should be included


in the model is a plot of the OLS residual profiles versus time. For the
126 9. General Guidelines for Model Building

FIGURE 9.4. Prostate Data. Smoothed average trend of squared OLS residuals.
Squared residuals larger than 0.4 are not shown.

prostate data, this was done in Figure 9.3. A smoothed average trend of
the squared OLS residuals is shown in Figure 9.4 and is used to explore
the variance function over time. If it is believed that different variance
functions should be used for different groups, a plot as in Figure 9.4 could
be constructed for each group separately. When this plot shows constant
variability over time, we assume stationarity and we do not include random
effects other than intercepts. In cases where the variability varies over time
and where there is still some remaining systematic structure in the residual
profiles (i.e., where the between-subject variability is large in comparison
to the overall variation), the following guidelines can be used to select one
or more random effects additional to the random intercepts.

• Try to find a regression model for each residual profile in the above
plot. Such models contain subject-specific parameters and are there-
fore perfect candidates as random effects in our general linear mixed
model. For example, if the residual profiles can be approximated by
straight lines, then only random intercepts and random slopes for
time would be included.
• Since our model always assumes the random effects bi to have mean
zero, we only consider covariates Zi which have already been included
as covariates in the fixed part (i.e., in Xi ) or which are linear com-
binations of columns of Xi . Note that this condition was satisfied in
model (3.10), which we used to analyze the prostate cancer data. For
example, the second column of Zi , which represents the linear ran-
dom effect for time, equals the sum of columns 7 to 10 in Xi , which
are the columns containing the linear time effects for the controls,
9.3 Selection of a Preliminary Random-Effects Structure 127

the benign prostatic hyperplasia patients, the local cancer cases, and
the metastatic cancer cases, respectively.

• Morrell, Pearson, and Brant (1997) have shown that Zi should not
include a polynomial effect if not all hierarchically inferior terms are
also included, and similarly for interaction terms. This generalizes
the well-known results from linear regression (see, e.g., Peixoto 1987,
1990) to random-effects models. It ensures that the model is invariant
to coding transformations and avoids unanticipated covariance struc-
tures. This means that if, for example, we want to include quadratic
random time effects, then also linear random time effects and random
intercepts should be included.

• The choice of a set of random effects for the model automatically


implies that the covariance matrix for Yi is assumed to be of the
general form Vi = Zi DZi + Σi . In the presence of random effects
other than intercepts, it is often assumed (see, e.g., Diggle, Liang, and
Zeger 1994) that the diagonal elements in Σi are all equal such that
the variance of Yi (t) depends on time, only through the component
Zi (t)DZi (t), where it is now explicitly indicated that the covariates Zi
depend on time. As an informal check for the appropriateness of the
selected random effects, one can compare the fitted variance function
based on a mixed-effects model with Σi = σ 2 Ini to the smoothed
sample variance function of the residuals rij .

In the example on prostate cancer, Figure 9.4 clearly suggests nonstatio-


narity, and we assume that the remaining structure in the OLS residual
profiles in Figure 9.3 can be well described by a quadratic function over
time. Hence, random intercepts and linear as well as quadratic random
slopes for time are included in the preliminary random-effects structure.
Note that, as in Section 9.2, we favor the inclusion of too many random
effects rather than omitting some. This ensures that the remaining vari-
ability is not due to any missing random effects. However, it also should be
emphasized that including high-dimensional random effects bi with uncon-
strained covariance matrix D leads to complicated covariance structures
and may result in divergence of the maximization procedure.

As an informal check for the variance function, we compared the smoothed


average trend of squared OLS residuals, previously shown in Figure 9.4,
with the fitted variance function obtained from fitting a linear mixed model
with the preliminary mean structure, the preliminary random-effects struc-
ture, and measurement error. For the prostate data, this is the original
128 9. General Guidelines for Model Building

FIGURE 9.5. Prostate Data. Comparison of the smoothed average trend of


squared OLS residuals (solid line) with the fitted variance function (dashed line)
obtained using the REML estimates in Table 5.1 for the variance components in
model (3.10).

model (3.10), and the fitted variance function is obtained by calculating


⎛ ⎞
  1
1 t t2 D ⎝ t ⎠ + σ 2 ,
2
t

where D  and σ2 are the REML estimates reported in Table 5.1. The result is
presented in Figure 9.5. Both variance functions show similar trends, except
at the beginning and at the end. The deviation for small time points can be
explained by noticing that some subjects have extremely large PSA values
close to their time of diagnosis (see individual profiles in Figure 2.3). These
correspond to extremely large squared residuals (not shown in Figure 9.4),
which may have inflated the fitted variance function much more than the
variance function obtained by smoothing the squared OLS residuals. The
deviation for large time points can be ascribed to the small amount of data
available. Only 24 out of the 463 PSA measurements have been taken more
than 20 years prior to the diagnosis.

9.4 Selection of a Residual Covariance Structure

Conditional on our selected set of random effects, we now need to specify


the covariance matrix Σi for the error components εi . Many possible co-
variance structures are available at this stage. Unfortunately, apart from
9.4 Selection of a Residual Covariance Structure 129

the information criteria discussed in Section 6.4, there are no general sim-
ple techniques available to compare all these models. For highly unbalanced
data with many repeated measurements per subject, one usually assumes
that random effects can account for most of the variation in the data and
that the remaining error components εi have a very simple covariance struc-
ture, leading to parsimonious models for Vi .

One such model is the model (3.11), introduced in Section 3.3.4. It assumes
that εi has constant variance and can be decomposed as εi = ε(1)i + ε(2)i ,
in which ε(2)i is a component of serial correlation and where ε(1)i is a com-
ponent of measurement error. The model is then completed by specifying a
serial correlation function g(·). The most frequently used functions are the
exponential and the Gaussian serial correlation functions already shown in
Figure 3.2, but other functions can also be specified in the SAS procedure
MIXED. We propose to fit a selection of linear mixed models with the
same mean and random-effects structure, but with different serial correla-
tion structures, and to use likelihood-based criteria to compare the different
models. In some cases, likelihood ratio tests can be applied. In other cases,
one might want to use the information criteria introduced in Section 6.4.
Alternatively, one can use one of the advanced methods for the exploration
of the residual serial correlation, which will be discussed in Chapter 10.
However, unless one is especially interested in the serial correlation func-
tion, it is usually sufficient to fit and compare a series of serial correlation
models.

Using some of the methods to be discussed in Chapter 10, Verbeke, Lesaffre,


and Brant (1998) and Verbeke and Lesaffre (1997b) have shown that the
residual components in the prostate cancer model indeed contain a serial
correlation component ε(2)i , which is probably of the Gaussian type. The
linear mixed model with our preliminary mean structure, with random
intercepts and random slopes for the linear as well as quadratic time effect,
with measurement error, and with Gaussian serial correlation can then be
fitted using the following program:

proc mixed data = prostate covtest;


class id group timeclss;
model lnpsa = group age group*time age*time
group*time2 age*time2 / noint solution;
random intercept time time2 / type = un subject = id g;
repeated timeclss / type = sp(gau)(time) local subject = id;
run;

The option ‘type=sp(gau)(time)’ is used to specify the Gaussian serial cor-


relation structure with the SAS variable time as the variable which needs to
be used to calculate the time differences between the repeated measures.
130 9. General Guidelines for Model Building

TABLE 9.1. Prostate Data. REML estimates and estimated standard errors for all
variance components in a linear mixed model with the preliminary mean structure
defined in Section 9.2, with random intercepts and random slopes for the linear as
well as quadratic time effects, with measurement error, and with Gaussian serial
correlation.

Effect Parameter Estimate (s.e.)


Covariance of bi :
var(b1i ) d11 0.389 (0.096)
var(b2i ) d22 0.559 (0.206)
var(b3i ) d33 0.059 (0.032)
cov(b1i , b2i ) d12 = d21 −0.382 (0.121)
cov(b2i , b3i ) d23 = d32 −0.175 (0.079)
cov(b3i , b1i ) d13 = d31 0.099 (0.043)
Measurement error variance:
var(ε(1)ij ) σ2 0.023 (0.002)
Gaussian serial correlation:
var(ε(2)ij ) τ2 0.032 (0.021)

Rate of exponential decrease 1/ φ 0.619 (0.202)
REML log-likelihood −24.787
−2 REML log-likelihood 49.574

As explained in Section 8.2.10, the option ‘local’ requests inclusion of a


measurement error component ε(1)i on top of the serial correlation com-
ponent ε(2)i . The REML estimates and estimated standard errors of all
variance components
√ in this model are shown in Table 9.1. Note how SAS
estimates 1/ φ rather than φ itself. This is to ensure positiveness of φ. In
cases where no measurement error is included in the model, this also allows
testing whether φ = +∞, under which Hi becomes equal to the identity
matrix Ini , meaning that no serial correlation would be present in the error
components εi .

Comparing minus twice the REML log-likelihood of the above model with
the value obtained without the serial correlation component (see Table 5.1)
yields a difference of 12.896, indicating that adding the serial correlation
component really improved the covariance structure of our model. Further,
note that the residual variability has now been split up into two compo-
nents which are about equally important (similar variance). Based on this
extended covariance matrix, we repeated our informal check previously pre-
sented in Figure 9.5, comparing the smoothed average trend in the squared
OLS residuals with the new fitted variance function. The fitted variances
9.4 Selection of a Residual Covariance Structure 131

FIGURE 9.6. Prostate Data. Comparison of the smoothed average trend of


squared OLS residuals (solid line) with the fitted variance function (dashed line)
obtained using the REML estimates in Table 9.1 for the variance components in
model (3.10), extended with a Gaussian serial correlation component.

are now obtained by calculating


⎛ ⎞
  1
1 t t2  ⎝ t ⎠ + σ
D 2 + τ2 ,
2
t

where the estimates D, σ


2 , and τ2 are the ones reported in Table 9.1. The
result is shown in Figure 9.6. Figures 9.5 and 9.6 are very similar, except
for large values of time, where the two variance functions are now more
alike.

Finally, it should be emphasized that, for fitting complicated covariance


structures, with possibly overspecified random-effects structures, one often
needs to specify starting values (using the PARMS statement of PROC
MIXED, see Section 8.2.10) in order for the iterative procedure to con-
verge. Sometimes it is sufficient to use the Fisher scoring method (option
‘scoring’ in the PROC MIXED statement, see Section 8.2.10) in the itera-
tive estimating procedure, which uses the expected Hessian matrix instead
of the observed one. To illustrate this, we reparameterize the above fitted
model by defining the intercept successively at 0 years (= original parame-
terization), 5 years, 10 years, 15 years, and 20 years prior to diagnosis. The
resulting estimates and standard errors for all variance components in the
model are shown in Table 9.2.

Obviously, minus twice the maximized REML log-likelihood function is not


affected by the reparameterization. The Fisher scoring algorithm was used
in two cases in order to attain convergence. It is hereby important that
132 9. General Guidelines for Model Building

TABLE 9.2. Prostate Data. REML estimates and estimated standard errors for all
variance components in a linear mixed model with the preliminary mean structure
defined in Section 9.2, with random intercepts and random slopes for the linear as
well as quadratic time effects, with measurement error, and with Gaussian serial
correlation. Each time, another parameterization of the model is used, based on
how the intercept has been defined.

Definition of intercept (time in years before diagnosis)


Parameter t=0 t=5 t = 10(1) t = 15 t = 20(2)
d11 0.389 (0.096) 0.156 (0.039) 0.090 (0.032) 0.061 (0.027) 0.026 (0.025)
d22 0.559 (0.206) 0.267 (0.085) 0.094 (0.027) 0.038 (0.026) 0.099 (0.091)
d33 0.059 (0.032) 0.059 (0.032) 0.059 (0.032) 0.059 (0.032) 0.059 (0.032)
d12 = d21 −0.382 (0.121) −0.120 (0.042) −0.033 (0.019) −0.033 (0.017) −0.030 (0.024)
d23 = d32 −0.175 (0.079) −0.116 (0.048) −0.057 (0.021) 0.001 (0.023) 0.060 (0.051)
d13 = d31 0.099 (0.043) 0.026 (0.023) −0.018 (0.023) −0.032 (0.020) −0.016 (0.016)
σ2 0.023 (0.002) 0.023 (0.002) 0.023 (0.002) 0.023 (0.002) 0.023 (0.002)
τ2
0 0.032 (0.021) 0.032 (0.021) 0.032 (0.021) 0.032 (0.021) 0.032 (0.021)
1/ φ 0.619 (0.202) 0.619 (0.202) 0.619 (0.202) 0.619 (0.202) 0.619 (0.202)
−2 log-lik 49.574 49.574 49.574 49.574 49.574
(1)
Five initial steps of Fisher scoring.
(2)
One initial step of Fisher scoring.

the final steps in the iterative procedure are based on the default Newton-
Raphson method since, otherwise, all reported standard errors are based
on the expected rather than observed Hessian matrix, the consequences of
which will be discussed in Chapter 21. Note that the variance components
in the covariance structure of εi as well as the variance d33 of the random
slopes for the quadratic time effect remain unchanged when the model is
reparameterized. This is not the case for the other elements in the random-
effects covariance matrix D. As was expected, the random-intercepts vari-
ance d11 decreases as the intercept moves further away from the time of
diagnosis, and the same holds for the overall variance d11 + σ 2 + τ 2 at the
time of the intercept. We therefore recommend defining random intercepts
as the response value at the time where the random variation in the data
is maximal. This facilitates the discrimination among the three sources of
stochastic variability.

9.5 Model Reduction

Based on the residual covariance structure specified in the previous step,


we can now investigate whether the random effects which we included in
Section 9.3 are really needed in the model. As discussed in Section 9.3,
Zi should not contain a polynomial effect if not all hierarchically inferior
9.5 Model Reduction 133

terms are also included. Taking into account this hierarchy, one should test
the significance of the highest-order random effects first. We reemphasize
that the need for random effects cannot be tested using classical likelihood
ratio tests, due to the fact that the null hypotheses of interest are on the
boundary of the parameter space, which implies that the likelihood ratio
statistic does not have the classical asymptotic chi-squared distribution (see
our discussion in Section 6.3.4).

Further, now that the final covariance structure for the model has been
selected, the tests discussed in Section 6.2 become available for the fixed
effects in the preliminary mean structure.

For the prostate cancer data, we then end up with the same mean structure
as well as random-effects structure as model (6.8) in Section 6.2.3. This
model can now be fitted in SAS with the following program:

proc mixed data = prostate covtest;


class id group timeclss;
model lnpsa = group age bph*time loccanc*time metcanc*time
cancer*time2 / noint solution;
random intercept time time2 / type = un subject = id g;
repeated timeclss / type = sp(gau)(time) local subject = id;
run;

The SAS variables cancer, bph, loccanc, and metcanc are dummy variables
defined to be equal to 1 if the patient has prostate cancer, benign prostatic
hyperplasia, local prostate cancer, or metastatic prostate cancer, respec-
tively, and zero otherwise. The variables id, group, time, and timeclss are
as defined in Section 8.2. The parameter estimates and estimated standard
errors are shown in Table 9.3, and they can be compared to the estimates
shown in Table 6.1, which were obtained without assuming the presence of
residual serial correlation. Note that adding the serial correlation compo-
nent to the model leads to smaller standard errors for almost all parameters
in the marginal model. This illustrates that an adequate covariance model
implies efficient model-based inferences for the fixed effects.

Finally, note that the total residual variability is estimated as σ 2 + τ2 =


0.023 + 0.029 = 0.052, which is as large as the total variability present
in the response ln(PSA + 1) prior to any development of prostate disease,
that is, many years prior to diagnosis (see Figure 9.4). This is in agreement
with our findings in Section 4.3.4, where evidence was found in favor of the
presence of considerable residual variability not explained by the first-stage
model (3.5) used in the two-stage approach proposed in Section 3.2.4.
134 9. General Guidelines for Model Building

TABLE 9.3. Prostate Data. Results from fitting the final model (6.8) to the
prostate cancer data, using restricted maximum likelihood estimation. The co-
variance structure contains three random effects, Gaussian serial correlation, and
measurement error.

Effect Parameter Estimate (s.e.)


Age effect β1 0.015 (0.006)
Intercepts:
Control β2 −0.496 (0.411)
BPH β3 0.320 (0.470)
L/R cancer β4 1.216 (0.469)
Met. cancer β5 2.353 (0.518)
Time effects:
BPH β8 −0.376 (0.070)
L/R cancer β9 −1.877 (0.210)
Met. cancer β10 −2.274 (0.244)
Time2 effects:
Cancer β14 = β15 0.484 (0.073)
Covariance of bi :
var(b1i ) d11 0.393 (0.093)
var(b2i ) d22 0.550 (0.187)
var(b3i ) d33 0.056 (0.028)
cov(b1i , b2i ) d12 = d21 −0.382 (0.114)
cov(b2i , b3i ) d23 = d32 −0.170 (0.070)
cov(b3i , b1i ) d13 = d31 0.098 (0.039)
Measurement error variance:
var(ε(1)ij ) σ2 0.023 (0.002)
Gaussian serial correlation:
var(ε(2)ij ) τ2 0.029 (0.018)

Rate of exponential decrease 1/ φ 0.599 (0.192)
Observations 463
REML log-likelihood −13.704
−2 REML log-likelihood 27.407
Akaike’s information criterion −22.704
Schwarz’s Bayesian criterion −41.235
10
Exploring Serial Correlation

10.1 Introduction

As discussed in Sections 3.3.4 and 9.4, the selection of an appropriate resid-


ual covariance structure is a nontrivial step in the model selection process,
especially in the presence of random effects. This is because the residual
variability is in practice very often dominated by the random effects in
the model. In this chapter, we will discuss two procedures for exploring
the residual covariance, conditionally on a set of random effects already in-
cluded in the model. This will be done under the assumption that the data
can be well described by a general linear mixed model of the form (3.11)
introduced in Section 3.3.4; that is, it is assumed that the residual compo-
nent εi has constant variance and can be decomposed as εi = ε(1)i + ε(2)i ,
in which ε(2)i is a component of serial correlation and where ε(1)i is a com-
ponent of measurement error. The marginal covariance matrix is then of
the form

Vi = Zi DZi + τ 2 Hi + σ 2 Ini , (10.1)

where the (j, k) element of Hi equals g(|tij −tik |) for some (usually) decreas-
ing function g(·) with g(0) = 1. Exploring the residual covariance structure
then reduces to studying the serial correlation function g(·).

We will also assume that all systematic trends have been removed from the
data. As explained in Section 9.2, this can be done by calculating ordinary
136 10. Exploring Serial Correlation

 , based on some preliminary mean


least squares residuals ri = yi − Xi β OLS
structure, ignoring any dependence among the repeated measurements. The
residuals ri can now be used to study the covariance structure of our data.

In Section 10.2, we will discuss an informal check for the need of a serial
correlation component τ 2 Hi in (10.1). Afterward, in Section 10.3, it will
be shown how the residual serial correlation function can be studied by
fitting linear mixed models with general flexible parametric functions for
g(·). Finally, in Section 10.4, we discuss the use of the so-called variogram
to study the serial correlation function nonparametrically, that is, without
assuming any parametric form for g(·).

10.2 An Informal Check for Serial Correlation

A simple informal check for the need of a component τ 2 Hi of serial corre-


lation in (10.1) has been proposed by Verbeke, Lesaffre, and Brant (1998).
The main idea is to project the residuals ri orthogonal to the columns of
Zi , which allows one to directly study the variability in the data not ex-
plained by the included random effects. For each i, i = 1, . . . , N , let Ai be
an ni × (ni − q) matrix such that Ai Zi = 0 and such that Ai Ai = Ini −q .
The (ni − q)-dimensional transformed OLS residuals are then defined as
i = Ai ri . Their covariance matrix equals

Ai Vi Ai = τ 2 Ai Hi Ai + σ 2 Ini −q .

In the absence of serial correlation, this is equal to σ 2 Ini −q , from which


it then follows that all ij , i = 1, . . . , N and j = 1, . . . , (ni − q), are nor-
mally distributed with mean zero and common variance σ 2 . Once the trans-
formed residuals ij have been calculated, various techniques are available
for checking their normality. Non-normality indicates that the assumed
model may not be appropriate, possibly because a component of serial
correlation is missing in the covariance structure.

As an illustration, we applied this to the prostate data, introduced in Sec-


tion 2.3.1. The same preliminary mean structure is used as in Section 9.2,
resulting in the OLS residual profiles which have been shown in Figure 9.3.
Further, random intercepts as well as slopes for time and time2 have been
included in the model. The Shapiro-Wilk test (Shapiro and Wilk 1965)
revealed clear deviations from normality for the transformed residuals ij
(p = 0.0011). In view of the fact that the assumed random-effects struc-
ture describes the variance function quite well (see Figure 9.5), we believe
that this suggests that the correlation function is still not adequately mod-
eled. We will therefore use the techniques to be described in Sections 10.3
10.3 Flexible Models for Serial Correlation 137

and 10.4 to explore whether serial correlation should be added to the model,
and if so, what function g(·) would be appropriate.

10.3 Flexible Models for Serial Correlation

10.3.1 Introduction

A classical statistical procedure for testing whether a specific statistical


model fits some data set at hand is to test the model versus an extended
version of that model. A similar idea can now be used to investigate the
serial correlation function, conditionally on a prespecified set of random
effects. One thereby assumes that g(·) has a parametric form, which is
flexible enough to allow various shapes for the function g(·). Lesaffre, Asefa
and Verbeke (1999) have used so-called fractional polynomials, which we
introduce in the next section. Afterward, in Section 10.3.3, this approach
will be applied to the prostate data.

10.3.2 Fractional Polynomials

Royston and Altman (1994) define a fractional polynomial as any function


of the form

m
f (u) = φ0 + φj x(pj ) ,
j=1

where the degree m is a positive integer, where p1 > . . . > pm are real-
valued prespecified powers, and where φ0 and φ1 , . . . , φm are real-valued
unknown regression coefficients. Finally, x(pj ) is defined as
 p
x j if pj = 0
x(pj ) = (10.2)
ln(x) if pj = 0.

In the context of linear and logistic regression analyses, Royston and Alt-
man (1994) have shown that the family of fractional polynomials is very
flexible and that models with degree m larger than 2 are rarely required.
In practice, several values for the powers p1 , . . . , pm can be tried, and the
model with the best fit is then selected.
138 10. Exploring Serial Correlation

Lesaffre, Asefa and Verbeke (1999) applied fractional polynomials to model


the serial correlation function g(·). Their model is of the form
⎧ ⎫
⎨ m ⎬
τ 2 g(u) = exp φ0 + φj u(pj ) . (10.3)
⎩ ⎭
j=1

Since u here represents the time lags between repeated measurements,


(10.2) needs to be adapted to allow zero time lags. One possibility is to
replace (10.2) by
⎧ p
⎨ x j if pj > 0
(pj )
x = ln(x + 1) if pj = 0

(x + 1)pj − 1 if pj < 0.

Note that g(0) = τ 2 implies that τ 2 is parameterized as τ 2 = exp(φ0 ).

A linear mixed model with random effects, measurement error, as well as


a serial correlation component with correlation function of the form (10.3)
can be fitted using maximum or restricted maximum likelihood estimation.
At present, this is not possible in the SAS procedure MIXED, but it can be
easily implemented in any software package which allows numerical opti-
mization. However, as discussed in Section 9.4, fitting linear mixed models
which contain all of the above three sources of stochastic variation may
become quite involved, in the sense that iterative numerical optimization
procedures often diverge. This is especially the case when high-degree frac-
tional polynomials are used for modeling the serial correlation function. We
therefore propose keeping the degree m of the polynomials relatively small,
but fitting several models with a variety of powers p1 , . . . , pm .

10.3.3 Example: The Prostate Data

As an illustration, we have applied the above approach to the prostate can-


cer data introduced in Section 2.3.1. The same preliminary mean structure
is used as in Section 9.2, resulting in the OLS residual profiles which have
been shown in Figure 9.3. Further, random intercepts as well as slopes for
time and time2 have been included in the model. Several models of the
form (10.3) have been fitted, with varying degrees m and varying powers
p1 , . . . , pm . For some models, the numerical maximization procedure failed
to converge. This should not be surprising, keeping in mind that all mod-
els have quite complicated covariance structures, with at least 9 variance
components and that the prostate data set contains data from 54 subjects
only. Among all models for which estimates for the parameters could be
obtained, the best one (largest maximized likelihood value) was the second-
10.3 Flexible Models for Serial Correlation 139

TABLE 10.1. Prostate Data. ML estimates and estimated standard errors for the
variance of the measurement error components as well as the parameters φj in
the serial correlation function (10.4).

Parameter Estimate (s.e.)


σ2 0.018 (0.001)
φ0 −3.173 (0.706)
φ1 −0.800 (3.019)
φ2 −0.907 (3.338)

degree fractional polynomial with powers p1 = 2 and p2 = 1, i.e.,


1 2
τ 2 g(u) = exp φ0 + φ1 u2 + φ2 u . (10.4)

Parameter estimates and associated asymptotic standard errors for the pa-
rameters φj as well as for the variance σ 2 of the measurement error compo-
nents are given in Table 10.1. The fitted function (10.4) is shown in panel
(a) of Figure 10.1 (solid line). Note that we report here maximum likeli-
hood (ML) estimates instead of restricted maximum likelihood estimates
(REML). Indeed, all parameters reported in the table are based on the
OLS residuals ri , which implies that no mean structure is included in the
model. Hence, REML estimation becomes impossible.

The variance of the serial correlation component is estimated as τ 2 =


exp(−3.173) = 0.042, which is more than twice the estimated variance
σ 2 of the measurement errors. Note also the relatively large standard er-
ror for φ0 and the relatively small standard error for σ2 , indicating that
we will probably not be able to estimate the serial correlation component
very accurately. The expression for the serial correlation function in (10.4)
suggests that the “optimal” serial correlation function is a combination of
exponential and Gaussian serial correlations. It is therefore interesting to
compare the obtained estimate for τ 2 g(u) with its estimates obtained un-
der exponential and Gaussian serial correlations, respectively. We therefore
fit linear mixed models with the same mean structure as the preliminary
mean structure and the same random-effects structure as the preliminary
random-effects structure.

The parameter estimates for the variance components in the model with
Gaussian serial correlation have already been reported in Table 9.1. Those
for the model with exponential serial correlation are given in Table 10.2.
This latter model can be fitted with the SAS procedure MIXED by replac-
ing the option ‘type=sp(gau)(time)’ by the option ‘type=sp(exp)(time)’ in
our program on p. 129. Note also that when our exponential serial cor-
relation function is parameterized as g(u) = exp(−φu), SAS provides an
140 10. Exploring Serial Correlation

FIGURE 10.1. Prostate Data. Estimates for the residual serial covariance func-
tion τ 2 g(u) and for the residual serial correlation function g(u). The solid lines
represent the estimate obtained under model (10.4), where the parameter esti-
mates are the ones reported in Table 10.1. The long-dashed lines show the es-
timates under the exponential serial correlation model g(u) = exp(−φu), where
the parameter estimates of τ and φ are the ones reported in Table 10.2. The
short-dashed lines show the estimates under the Gaussian serial correlation model
g(u) = exp(−φu2 ), where the parameter estimates of τ and φ are the ones re-
ported in Table 9.1.

estimate for 1/φ rather than for φ itself. The so-obtained parametric esti-
mates for τ 2 g(u) are also included in panel (a) of Figure 10.1. Note that the
estimate for τ 2 g(u) based on model (10.4) is not well approximated by ei-
ther one of the other two estimates, but this is mainly due to the differences
between the estimates for the variance τ 2 . For studying the correlation func-
tion g(u), independent of the amount of variability explained by the serial
correlation component, it is often helpful to plot the corresponding rescaled
functions g(u), shown in panel (b) of Figure 10.1. We now find that the
exponential as well as Gaussian serial correlation functions are good ap-
proximations for the function obtained under model (10.4), with a slightly
better performance for the exponential serial correlation model, especially
for small time lags. This is also supported by the maximized log-likelihood
values which are equal to −24.266 and −24.787 for the exponential and the
Gaussian serial correlation model, respectively (see Tables 9.1 and 10.2).
The fact that both models are hard to distinguish is also reflected in the
high correlations between the estimates for the parameters φj and the es-
timate for the variance σ 2 , under model (10.4). The estimated correlation
matrix is given by
⎛ ⎞
1.000 0.235 −0.811 0.862
⎜ 0.235 1.000 −0.010 0.646 ⎟
σ 2 , φ0 , φ1 , φ2 ) = ⎜
Corr( ⎝ −0.811 −0.010
⎟,
1.000 −0.706 ⎠
0.862 0.646 −0.706 1.000

which shows that all parameter estimates are highly correlated with φ2 ,
including the estimate for the variance σ 2 of the measurement errors.
10.4 The Semi-Variogram 141

TABLE 10.2. Prostate Data. REML estimates and estimated standard errors for
all variance components in a linear mixed model with the preliminary mean struc-
ture defined in Section 9.2, with random intercepts and random slopes for the lin-
ear as well as quadratic time effect, with measurement error, and with exponential
serial correlation.

Effect Parameter Estimate (s.e.)


Covariance of bi :
var(b1i ) d11 0.373 (0.115)
var(b2i ) d22 0.557 (0.212)
var(b3i ) d33 0.060 (0.032)
cov(b1i , b2i ) d12 = d21 −0.378 (0.120)
cov(b2i , b3i ) d23 = d32 −0.177 (0.078)
cov(b3i , b1i ) d13 = d31 0.099 (0.042)
Measurement error variance:
var(ε(1)ij ) σ2 0.015 (0.005)
Gaussian serial correlation:
var(ε(2)ij ) τ2 0.050 (0.065)
Rate of exponential decrease 1/φ 0.766 (1.301)
REML log-likelihood −24.266
−2 REML log-likelihood 48.532

10.4 The Semi-Variogram

10.4.1 Introduction

In Section 10.3, a parametric approach was followed to study the serial cor-
relation function g(·) in linear mixed models. The empirical semi-variogram
is an alternative, nonparametric technique which does not require the fit-
ting of linear mixed models. It was first introduced by Diggle (1988) for the
case of random-intercepts models (i.e., linear mixed models where the only
random effects are intercepts). Later, Verbeke, Lesaffre and Brant (1998)
extended the technique to models which may also contain other random
effects, additional to the random intercepts (see also Verbeke 1995). In
Sections 10.4.2 and 10.4.3, the original empirical semi-variogram for the
random-intercepts model will be discussed and illustrated. Afterward, in
Sections 10.4.4 and 10.4.5, the extended version will be presented.
142 10. Exploring Serial Correlation

10.4.2 The Semi-Variogram for Random-Intercepts Models

Assuming that the only random effects in the model are random intercepts,
we have that the marginal covariance matrix is given by

ν 2 Jni + τ 2 Hi + σ 2 Ini , (10.5)

where Jni is the (ni × ni ) matrix containing only ones and where ν 2 now
denotes the random-intercepts variance. This implies that the residuals rij
have constant variance ν 2 + σ 2 + τ 2 and that the correlation between any
two residuals rij and rik from the same subject i is given by

ν 2 + τ 2 g(|tij − tik |)
ρ(|tij − tik |) = .
ν 2 + σ2 + τ 2
A stochastic process with mean zero, constant variance and a correlation
function which only depends on the time lag between the measurements is
often called (second-order) stationary (see for example Diggle 1990).

It immediately follows from the stationarity of the random process ri1 ,


ri2 ,. . . , that
1 2
E (rij − rik ) = σ 2 + τ 2 (1 − g(|tij − tik |)) = v(uijk ),
2
for all i = 1, . . . , N and for all j = k. The function v(u) is called the semi-
variogram, and it only depends on the time points tij through the time
lags uijk = |tij − tik |. Note that decreasing serial correlation functions g(·)
yield increasing semi-variograms v(u), with v(0) = σ 2 , which converge to
σ 2 + τ 2 as u grows to infinity.

This is illustrated in Figure 10.2 for the case of exponential as well as


Gaussian serial correlation (see also Figure 3.2). The two graphs show the
semi-variogram for a linear random-intercepts model, once with exponential
serial correlation and once with Gaussian serial correlation. The variance of
the measurement errors was taken equal to σ 2 = 0.7, the variance of the ser-
ial correlation component equal to τ 2 = 1.3, and the random intercepts had
variance ν 2 = 1. Thus, the most important source of variability was the ser-
ial correlation component ε(2)i . The parameter φ was taken equal to 1; that
is, the function g(u) was exp(−u) for the exponential model and exp(−u2 )
for the Gaussian model. Larger values of φ would yield semi-variograms
which increase much faster, meaning that the function g(u) decays to zero
much quicker. The extreme case would be φ = +∞, which leads to the
independence model, assuming no correlation between the components in
ε(2)i . On the other hand, values of φ smaller than 1 yield semi-variograms
which level out much slower, meaning that g(u) does not decay to zero so
quickly. The extreme case would be φ = 0, which leads to a model assuming
10.4 The Semi-Variogram 143

FIGURE 10.2. (a) The semi-variogram for a linear random-intercepts model con-
taining a component with exponential serial correlation. (b) The semi-variogram
for a linear random-intercepts model containing a component with Gaussian ser-
ial correlation. σ 2 , τ 2 , and ν 2 represent the variability of the measurement error,
the serial correlation component, and the random intercepts respectively.

correlations equal to one between the components of ε(2)i . So, the smaller
the value of φ, the stronger the serial correlation in the data.

In practice,
N the function v(u) is estimated by smoothing the scatter plot of
the i=1 ni (ni − 1)/2 half-squared differences vijk = (rij − rik )2 /2 between
pairs of residuals within subjects versus the corresponding time lags uijk =
|tij − tik |. This estimate will be denoted by v(u) and is called the sample
semi-variogram. Further, since
1
E[rij − rkl ]2 = σ 2 + τ 2 + ν 2
2
whenever i = k, we estimate the total process variance by
1 
ni 
nl
2 + τ2 + ν2 =
σ (rij − rkl )2 ,
2N ∗ j=1
i=k l=1

where N is the number of terms in the sum. This estimate, together with
v(u), can now be used for deciding which of the three stochastic com-
ponents will be included in the model and for selecting an appropriate
function g(u) in the case serial correlation will be included. The sample
semi-variogram also provides initial values for τ 2 , σ 2 , and ν 2 when needed
for the numerical maximization procedure. Finally, comparing v(u) with a
fitted semi-variogram yields an informal check on the assumed covariance
structure.

More details on this topic can be found in Chapter 5 in the book by Diggle,
Liang and Zeger (1994) in which the semi-variogram is discussed for several
special cases of the covariance structure (10.5) and the method is illustrated
in a covariance analysis for some real data sets.
144 10. Exploring Serial Correlation

FIGURE 10.3. Vorozole Study. Observed variogram (bullets with size proportional
to the number of pairs on which they are based) and fitted variogram (solid line).

10.4.3 Example: The Vorozole Study

The Vorozole study was introduced in Section 2.4. In order to construct


a variogram, the following fixed effects were considered in the calculation
of the residuals ri : linear, quadratic, and cubic time effects, as well as the
interactions between time and the covariates baseline value, treatment, and
dominant site.

The variogram is presented in Figure 10.3. Apart from the observed var-
iogram, a fitted version is presented as well. This is based on a linear
mixed-effects model with fixed effects: time, time×baseline, time2 , and
time2 ×baseline. The covariance structure includes a random intercept, a
Gaussian serial process, and residual measurement error.

10.4.4 The Semi-Variogram for Random-Effects Models

The semi-variogram described in Section 10.4.2 explicitly assumed that the


marginal covariance matrices of the longitudinal vectors Yi satisfy (10.5),
which makes the semi-variogram applicable when the only random effects
in the model are intercepts, and therefore only when the variance function
can be assumed to be constant. However, in many practical situations, the
variance function is clearly not constant, and one may wish to use random
effects (other than just intercepts) to take this into account in the model.
We will therefore describe now the approach of Verbeke, Lesaffre and Brant
(1998), who extended the semi-variogram to the more general linear mixed
10.4 The Semi-Variogram 145

model with (possibly multivariate) random effects, other than intercepts


(see also Verbeke and Lesaffre 1997b).

In the presence of random effects other than intercepts, several authors


have reported that the covariance structure (10.1) is often dominated by
its first component. It is therefore proposed to first remove all variability,
which is explained by the random effects bi before studying the remaining
serial correlation in the data. This can be done by considering again the
transformed residuals i , defined in Section 10.2. The i can now be used
to study the covariance structure of Ai ε(1)i +Ai ε(2)i . Another way to remove
the effect of the random effects would be to use subject-specific residuals
yi − Xi β − Zi bi , in which bi = E(bi |yi ) is the empirical Bayes estimate
for the random effect bi , obtained under a specific linear mixed model.
This is the approach followed by Morrell et al . (1995) in the context of
piecewise nonlinear mixed-effects models for the prostate data introduced
in Section 2.3.1. However, bi greatly depends on the normality assumption
for the random effects [see our discussion in Section 7.8.2, as well as Verbeke
and Lesaffre (1996a)], and is also influenced by the assumed covariance
structure Vi . This already involves assumptions about the functional form
of the serial correlation if present, which is why the subject-specific residuals
should not be used to check assumptions about Vi . This is in contrast
with the i , which are independent of any distributional assumptions for
the bi and the calculation of which does not require an estimate of the
random-effects covariance matrix D. In the context of linear mixed models,
we therefore propose to study the serial correlation structure based on
the transformed residuals i , rather than on the subject-specific residuals
yi − Xi β − Zi bi .

As mentioned in Section 10.2, all components ij are normally distributed


with mean zero and common variance σ 2 if no serial correlation is present.
In general, however, we have that
1
E( ij − ik )2
2

= σ2 + τ 2 + τ 2 (Airj − Airk ) (Aisj − Aisk ) g(uirs ), (10.6)
r<s

where Airs denotes the (r, s) element of the matrix Ai . When all measure-
ments are taken at fixed time points, then only a small number of values uirs
can occur, which we denote by u0 , . . . , uM . Note that (10.6) can then be seen
as a multiple regression model with parameters σ 2 + τ 2 and gj = τ 2 g(uj ),
j = 0, . . . , M , and a scatter plot of the OLS estimates g0 , . . . , g
M versus

u0 , . . . , uM can be used to select an appropriate parametric form for the


function g(u).

Unfortunately, in practice (e.g., in the prostate example, Section 2.3.1),


one often has highly unbalanced data, resulting in a very large number
146 10. Exploring Serial Correlation

of values uirs . Although the above approach is no longer feasible, we can


still apply it to estimate g(u) for some set of prespecified values u. Let
u0 = 0 < u1 < . . . < uM be a set of values u for which we wish to estimate
g(u), and let g0 , . . . , gM be defined as before. We take uM equal to the largest
time lag observed in the data set at hand. For each uirs not equal to any of
the prespecified values for u, we apply linear interpolation to approximate
g(uirs ) in (10.6) by a linear combination of g(ut ) and g(ut+1 ), for t such
that ut < uirs < ut+1 . This yields an approximate linear regression model
with parameters σ 2 + τ 2 and g0 = τ 2 , g1 , . . . , gM . Similar to the balanced
case, a scatter plot of the OLS estimates g0 , . . . , gM versus u0 , . . . , uM can

then be used to propose an appropriate structure for the serial correlation


in the model. Note that this approach can be seen as an extension of the
sample semi-variogram, but with the original OLS residuals ri replaced by
the transformed versions i .

In order to improve the approximation in the linear interpolation, one


may be inclined to choose M large, yielding small intervals [ut , ut+1 ],
t = 0, . . . , M − 1. However, this automatically increases the number of
parameters g0 , . . . , gM to be estimated. Hence, there is a dilemma between
reducing the number of parameters and improving the approximation. One
also needs to specify the interpolation points ut , conditional on the choice
of M . We propose to first calculate all distances uijk = |tij − tik | and to
take the values ut equal to the (1/M )100% percentiles of the distribution
of these distances. This usually leads to a set of intervals with increasing
length, which has the advantage that most computational effort is spent
in the estimation of g(u) for small values of u, where an accurate estimate
of g is required for specifying an appropriate parametric structure for the
serial correlation in the data (see the difference between exponential and
Gaussian serial correlation in Section 3.3.4). Further, it can easily be shown
that the covariates in model (10.6) are linearly dependent when the model
contains random intercepts. We then set gM = 0, which is equivalent to the
assumption that the time span covered by the data is long enough for the
serial correlation to have decayed to zero.

Using simulations, Verbeke (1995) has shown that the above approach
yields estimates for gt which are too unstable to be useful for the detection
of residual serial correlation. This is caused by the large amount of scatter
in the values ( ij − ik )2 but also by the high degree of multicollinearity
(Neter, Wasserman and Kutner, 1990, Section 8.5) in the approximate re-
gression model, induced by the linear interpolation. A classical method to
obtain more stable estimates in the presence of multicollinearity is to use
ridge regression, which allows a small bias in return for stability (Sen and
Srivastava 1990, Section 12.3, Neter, Wasserman and Kutner 1990, Section
1.7). The bias depends on the so-called shrinkage parameter c which is cho-
10.4 The Semi-Variogram 147

sen as small as possible (to reduce the bias) and such that the resulting esti-
mates indicate stability and satisfy approximately g0 ≥ g1 ≥ . . . ≥ g
M = 0.

Note also that the choice of the matrices Ai is not unique. In the absence
of serial correlation, we have that the distribution of the quantities ( ij −
ik )2 in (10.6) does not depend on Ai . Hence, differences in parameter
estimates due to different transformations Ai then only reflect sampling
variability. This is no longer the case in the presence of serial correlation,
where different choices for Ai not only result in different responses used in
the final regression analysis but also yields different covariates. In that case,
the effect of choosing other transformations Ai is less obvious. However,
Verbeke, Lesaffre and Brant (1998) found empirically that different choices
for the matrices Ai yield slightly different nonparametric estimates for g(u),
but all these estimates lead to the same conclusion with respect to the
presence and type of serial correlation.

Further, the above check for serial correlation acts conditional on the ran-
dom effects included in the model (i.e., conditional on the covariates Zi ).
However, the resulting semi-variogram is invariant under general repara-
meterizations of the form Zi Gi . Also, if too many random effects have been
specified, the results will still be valid since the resulting transformation
matrices Ai satisfy Ai Zi∗ = 0 for any matrix Zi∗ consisting of columns of
the overspecified Zi . This again justifies favoring overspecified models Zi bi
for the detection of residual serial correlation rather than models which are
too restrictive (see also our discussion in Section 9.3).

10.4.5 Example: The Prostate Data

As an illustration, we have applied the above semi-variogram approach to


the prostate cancer data introduced in Section 2.3.1. The same preliminary
mean structure is used as in Section 9.2, resulting in the OLS residual
profiles which have been shown in Figure 9.3. Further, random intercepts as
well as slopes for time and time2 have been included in the model. Our aim
is to check whether the lack of fit detected in Section 10.2 can be (partially)
ascribed to the fact that no serial correlation component was included. We
used five intervals, each of which contained 20% of the values uijk . The
boundaries of these intervals are thus the quintiles of the distribution of
all time lags uijk , i = 1, . . . , N = 54, j = 1, . . . , ni , k = j + 1, . . . , ni , in
the prostate cancer data set. These boundaries equal u0 = 0, u1 = 2.7,
u2 = 4.7, u3 = 7.5, u4 = 11.2, and u5 = 25.3.

The solid line in Figure 10.4 represents the estimate for τ 2 g(u), obtained
from the methods described in the previous section. It clearly suggests the
presence of serial correlation, which may be appropriately described by a
148 10. Exploring Serial Correlation

FIGURE 10.4. Prostate Data. Estimates for the residual serial covariance func-
tion τ 2 g(u). The solid line represents the estimate obtained from the extended
semi-variogram approach. The dashed line shows the estimated Gaussian serial
covariance function τ 2 exp(−φu2 ), where the parameter estimates of τ and φ are
the ones reported in Table 9.1.

Gaussian serial correlation function g(u) = exp(−φu2 ) for some constant


φ > 0. For u = 0, we have that g0 = τ2 = 0.0285. The intercept σ 2 + τ 2
of model (10.6) was estimated as 0.0564, from which σ 2 can be estimated
as 0.0279. This suggests that the variability which cannot be explained by
the three random effects can be split up into two components which are
about equally important (similar variance). We can now compare these re-
sults with those obtained from fitting a linear mixed model with the same
assumed preliminary mean structure, with the same assumed preliminary
random-effects structure, and with Gaussian serial correlation as well as
measurement error. The parameter estimates for the variance components
in this model have already been reported in Table 9.1. Note that the re-
ported estimates for τ 2 and σ 2 (0.032 and 0.023, respectively) were very
close to the values we now obtain nonparametrically. We have also added
the parametrically fitted Gaussian serial correlation function to Figure 10.4
(dashed line). Both estimates are very similar, suggesting that a Gaussian
serial correlation component might be appropriate here.

10.5 Some Remarks

Comparing parametric models for the serial correlation function, we found


in Section 10.3.3, that the exponential serial correlation function fits the
10.5 Some Remarks 149

prostate data slightly better than the Gaussian serial correlation function,
with very similar maximized likelihood values for both models. On the other
hand, the nonparametric variogram in Section 10.4.5 seems to suggest the
presence of serial correlation of the Gaussian type. This again illustrates
the fact that precise characterization of the serial correlation function g(·)
is extremely difficult in the presence of several random effects. This was
also the conclusion of Lesaffre, Asefa and Verbeke (1999) after the analysis
of longitudinal data from more than 1500 children. However, this does not
justify ignoring the possible presence of any serial correlation, since this
might result in less efficient model-based inferences (see example in Sec-
tion 9.5). Practical experience suggests that including serial correlation, if
present, is more important than correctly specifying the serial correlation
function. We therefore propose to use the procedures discussed in this chap-
ter for detecting whether any serial correlation is present, rather than for
specifying the actual shape of g(·), which seems to be of minor importance.

Both procedures discussed in this chapter are conditional on a specific


random-effects structure. In practice, one often observes strong competi-
tion between these two sources of stochastic variation. Chi and Reinsel
(1989) report that the inclusion of a sufficient number of random effects in
a model with white noise errors may be able to represent the serial corre-
lations among the measurements taken on each individual. Indeed, serial
correlation can be replaced by very smooth subject-specific functions. This
is also reflected in substantial correlations between the estimates for the
variance components in the random-effects covariance matrix D and the
estimates for the remaining variance components in the covariance struc-
ture. As an example, we consider here the covariance matrix for all variance
components in the final linear mixed model (6.8), with Gaussian serial cor-
relation in addition to the three random effects and the measurement error.
The REML estimates for all parameters in this model have been reported in
Table 9.3. The estimated correlation matrix for the estimates of all variance
components is given by
/

Corr d , 
d , 
d , 
d ,
11 12 22 13 23 33

d , 
d , 
τ 2
, 1/ 
φ, 
σ 2

⎛ ⎞
1.00 −0.87 0.62 0.70 −0.49 0.39 −0.18 −0.10 −0.00
⎜ −0.87 1.00 −0.85 −0.94 0.75 −0.63 0.21 0.08 −0.03 ⎟
⎜ −0.85 −0.97 −0.46 −0.29 0.02 ⎟
⎜ 0.62 1.00 0.88 0.91 ⎟
⎜ −0.94 −0.82 −0.22 −0.06 0.05 ⎟
⎜ 0.70 0.88 1.00 0.72 ⎟
= ⎜
⎜ −0.49 0.75 −0.97 −0.82 1.00 −0.97 0.51 0.33 −0.02 ⎟
⎟.
⎜ 0.39 −0.63 0.91 0.72 −0.97 1.00 −0.57 −0.38 0.01 ⎟
⎜ ⎟
⎜ −0.18 0.21 −0.46 −0.22 0.51 −0.57 1.00 0.81 0.04 ⎟
⎝ −0.10 0.08 −0.29 −0.06 0.33 −0.38 0.81 1.00 0.32

−0.00 −0.03 0.02 0.05 −0.02 0.01 0.04 0.32 1.00
150 10. Exploring Serial Correlation

We indeed get some relatively large correlations between τ2 and the es-
timates of some of the parameters in D. Note also/the small correlations
between σ2 and the other estimates, except for 1/ φ,  which is not com-
pletely unexpected since the Gaussian serial correlation component reduces
to measurement error for φ becoming infinitely large.
11
Local Influence for the Linear Mixed
Model

11.1 Introduction

As explained in Chapter 5, the fitting of mixed models is based on likelihood


methods (maximum likelihood, restricted maximum likelihood), which are
sensitive to peculiar observations. The data analyst should be aware of
particular observations that have an unusually large influence on the results
of the analysis. Such cases may be found to be completely appropriate and
retained in the analysis, or they may represent inappropriate data and may
be eliminated from the analysis, or they may suggest that additional data
need to be collected or that the current model is inadequate. Of course, an
extended investigation of influential cases is only possible once they have
been identified.

Many diagnostics have been developed for linear regression models. See,
for example, Cook and Weisberg (1982) and Chatterjee and Hadi (1988).
Since the linear mixed model can be seen as a concatenation of several
subject-specific regression models, it is most obvious to investigate how
these diagnostics (residual analysis, leverage, Cook’s distance, etc.) can be
generalized to the models considered in this book. Unfortunately, such a
generalization is far from obvious. First, several kinds of residuals could
be defined. For example, the marginal residual yi − Xi β  reflects how a
specific profile deviates from the overall population mean and can there-
fore be interpreted as a residual. Alternatively, the subject-specific residual
152 11. Local Influence for the Linear Mixed Model

yi − Xi β − Zi bi measures how much the observed values deviate from the
subject’s own predicted regression line. Finally, the estimated random ef-
fects bi can also be seen as residuals since they reflect how much specific
profiles deviate from the population average profile. Further, the linear
mixed model involves two kinds of covariates. The matrix Xi represents
the design matrix for the fixed effects, and Zi is the design matrix for the
random effects. Therefore, it is not clear how leverages should be defined,
partially because the matrices Xi and Zi are not necessarily of the same
dimension.

The final classification of subjects as influential or not influential can be


based on the Cook’s distance, first introduced by Cook (1977a, 1977b,
1979), which measures how much the parameter estimates change when
a specific individual has been removed from the data set. In ordinary re-
gression analysis, this can easily be calculated due to the availability of
closed-form expressions for the parameter estimates, which makes it also
possible to ascribe influence to the specific characteristics of the subjects
(leverage, outlying). Unfortunately, this is no longer the case in linear mixed
models. For exact Cook’s distances, the iterative estimation procedure has
to be used N + 1 times, once to fit the model for all observations and once
for each individual that has been excluded from the analysis. This is not
only extremely time-consuming, but it also does not give any information
on the reason why some individuals are more influential than others.

All these considerations suggest that an influence analysis for the linear
mixed model should not be based on the same diagnostic procedures as
ordinary least squares regression. DeGruttola, Ware and Louis (1987) de-
scribe measures of influence and leverage for a generalized three-step least
squares estimator for the regression coefficients in a class of multivariate
linear models for repeated measurements. However, their method does not
apply to maximum likelihood estimation, and it is also not clear how to
extend their diagnostics to the case of unequal covariance matrices Vi .

Christensen, Pearson and Johnson (1992) have noticed that, conditionally


on the variance components α, there is an explicit expression for β  (see
Section 5.1), and hence it is possible to extend the Cook’s distance to mea-
sure influence on the fixed effects in a mixed model. For known α, the
so-obtained distances can be compared to a χ2p -distribution in order to de-
cide which ones are exceptionally large. For estimated α, they still propose
using the χ2 -distribution as approximation. Further, they define Cook’s
distances, based on one-step estimates, for examining case influence on
the estimation of the variance components. These one-step estimates are
obtained from one single step in the Newton-Raphson procedure for the
maximization of the log-likelihood corresponding to the incomplete data
(ith case removed), starting from the estimates obtained for the complete
11.2 Local Influence 153

data. Although these procedures seem intuitively appealing, they do not


yield any influence diagnostic for the fixed effects and the variance compo-
nents simultaneously. Further, they do not allow to ascribe global influence
to any of the subject’s characteristics.

Since case-deletion diagnostics assess the effect of an observation by com-


pletely removing it, they fit into the framework of global influence analy-
ses. This contrasts with a local influence analysis, first introduced by Cook
(1986). Beckman, Nachtsheim and Cook (1987) used the idea of local influ-
ence to develop methods for assessing the effect of perturbations from the
usual assumptions in the mixed-models analysis of variance with uncorre-
lated random components. They investigate how the parameters change un-
der small perturbations of the error variances, the random-effects variances,
and the response vector. An alternative perturbation scheme, proposed
by Verbeke (1995), Verbeke and Lesaffre (1997b), and Lesaffre and Ver-
beke (1998), is case-weight perturbation where it is investigated how much
the parameter estimates are affected by changes in the weights of the log-
likelihood contributions of specific individuals. In Section 11.2, some general
theory on local influence will be presented. Afterward, in Section 11.3, this
will be applied to case-weight perturbations in the context of linear mixed
models. Finally, in Section 11.4, the local influence methodology will be
illustrated in an influence analysis for the prostate data.

11.2 Local Influence

In this section, the local influence approach, first introduced by Cook


(1986), will be presented. The general idea is to give every individual its own
weight in the calculation of the parameter estimates and to investigate how
these estimates depend on the weights, locally around the equal-weight case
(i.e., all individuals have the same weight 1), which is the ordinary max-
imum likelihood case. Much of the terminology and most concepts in the
sequel of this section are borrowed from differential geometry. More details
can be found in any textbook on differential geometry; see, for example,
O’Neill (1966).

We know from expression (5.2) that the maximum likelihood log-likelihood


function for model (3.8) can be seen as

N

(θ) = i (θ), (11.1)


i=1

in which i (θ) is the contribution of the ith individual to the log-likelihood.


Let (θ|ω) now denote any perturbed version of (θ), depending on an r-
154 11. Local Influence for the Linear Mixed Model

dimensional vector ω of weights, which is assumed to belong to an open


subset Ω of IRr , and such that the original log-likelihood (θ) is obtained
for ω = ω0 .

For the detection of influential subjects, our perturbed log-likelihood will


be

N

(θ|ω) = wi i (θ), (11.2)


i=1

which clearly allows different weights for different subjects. The weight
vector ω is then N dimensional, and the original log-likelihood corresponds
to ω = ω0 = (1, 1, . . . , 1) . Note also that the log-likelihood with the ith
case completely removed corresponds to the vector ω with wi = 0 and
wj = 1 for all j = i.

Later, in Chapter 19, another perturbation scheme will be used to perform


a sensitivity analysis in the context of missing data. For now, we will further
consider the case of a general perturbed log-likelihood (θ|ω) to explain the
local influence methodology.
 be the maximum likelihood estimator for θ, obtained by maximizing
Let θ
ω denote the estimator for θ under (θ|ω). The local influ-
(θ), and let θ
ence approach now compares θ  Similar estimates indicate that
ω with θ.
perturbations have little effect on the parameter estimates. Strongly differ-
ent estimates suggest that the estimation procedure is highly sensitive to
such perturbations. Cook (1986) proposed to measure the distance between
ω and θ
θ  by the so-called likelihood displacement, defined by
 
LD(ω) = 2 (θ̂) − (θ̂ω ) .

This way, the variability of θ is taken into account. LD(ω) will be large if
(θ) is strongly curved at θ  (which means that θ is estimated with high
precision) and LD(ω) will be small if (θ) is fairly flat at θ  (meaning
that θ is estimated with high variability). From this perspective, a graph
of LD(ω) versus ω contains essential information on the influence of case-
weight perturbations. It is useful to view this graph as the geometric surface
formed by the values of the (r + 1)-dimensional vector

ω
ξ(ω) =
LD(ω)
as ω varies throughout Ω. In differential geometry, a surface of this form
is frequently called a Monge patch. Following Cook (1986), we will refer
to ξ(ω) as an influence graph. It is a surface in IRr+1 and can be used to
assess the influence of varying ω through Ω. A graphical representation,
which also illustrates all further developments, is given in Figure 11.1.
11.2 Local Influence 155

LD(ω)

T0

P
ωr
PP
ω1 PP
PP
Pr
ω 0 PPP PP
ω0 + h

FIGURE 11.1. Graphical representation of the influence graph and the local in-
fluence approach.

Ideally, we would like a complete influence graph [i.e., a graph of ξ(ω) for
varying ω] to assess influence for a particular model and a particular data
set. However, this is only possible in cases where the number r of weights
in ω does not exceed 2. Hence, methods are needed for extracting the most
relevant information from an influence graph. One possible approach is local
influence, which uses normal curvatures of ξ(ω) in ω 0 . One proceeds as
follows. Let T0 be the tangent plane to ξ(ω) at ω 0 . Since LD(ω) attains its
minimum at ω 0 , we have that T0 is parallel to Ω ⊂ IRr . Each vector h in Ω,
of unit length, determines a plane that contains h and which is orthogonal
to T0 . The intersection, called a normal section, of this plane with the
surface is called a lifted line. It can be graphed by plotting LD(ω 0 + ah)
versus the univariate parameter a ∈ IR. The normal curvature of the lifted
line, denoted by Ch , is now defined as the curvature of the plane curve
(a, LD(ω 0 + ah)) at a = 0. It can be visualized by the inverse radius of
the best-fitting circle at a = 0. The curvature Ch is called the normal
curvature of the surface ξ(ω), at ω 0 , in the direction h. Large values of Ch
indicate sensitivity to the induced perturbations in the direction h. Ch is
called the local influence on the estimation of θ, of perturbing the model
corresponding to the log-likelihood (11.1), in the direction h.
156 11. Local Influence for the Linear Mixed Model

Let ∆i be the s-dimensional vector of second-order derivatives of (θ|ω),


with respect to wi and all components of θ, and evaluated at θ = θ  and
at ω = ω 0 , and let ∆ be the (s × r) matrix with ∆i as the ith column.
Further, let L̈ denote the (s × s) matrix of all second-order derivatives of
 Cook (1986) has then shown that, for any
(θ), also evaluated at θ = θ.
unit vector h in Ω,
 
 
Ch = 2  h ∆ L̈−1 ∆h  . (11.3)

There are several ways in which (11.3) can be used to study ξ(ω), each
corresponding to a specific choice of the unit vector h. One evident choice
corresponds to the perturbation of the ith weight only (case-weight pertur-
bation). This is obtained by taking h equal to the vector hi which contains
zeros everywhere except on the ith position, where there is a one. The
resulting local influence is then given by
 
 
Ci ≡ Chi = 2  ∆i L̈−1 ∆i  . (11.4)

Another important direction is determined by hmax , which corresponds to


the maximal normal curvature Cmax . It follows from Seber (1984, p. 526)
that Cmax is twice the largest eigenvalue of −∆ L̈−1 ∆, and that hmax is
the corresponding eigenvector. Hence, the calculation of Cmax and hmax in-
volves an eigenvalue analysis of an (r×r)-dimensional matrix and can there-
fore be both difficult and computationally expensive. Fortunately, ∆ L̈−1 ∆
is only of rank s, which is often small in comparison to r, and this can be
exploited to simplify the computations as follows (see also Beckman, Nacht-
sheim and Cook 1987). We know from Seber (1984, p. 521) that there exists
a nonsingular matrix R such that L̈−1 = −R R, from which it follows that
−∆ L̈−1 ∆ equals ∆ R R∆. It now follows from Seber (1984, p. 518) that
the nonzero eigenvalues of −∆ L̈−1 ∆ are the same as those of R∆∆ R ,
which is of dimension (s × s). Hence, Cmax can easily be calculated as twice
the largest eigenvalue of R∆∆ R . Let z denote the corresponding eigen-
vector. The direction of maximal normal curvature can then readily be seen
to equal
∆ R z
hmax = .
∆ R z

The direction hmax is the direction for which the normal curvature is maxi-
mal. It shows how to perturb the postulated model (the model for ω = ω 0 )
to obtain the largest local changes in the likelihood displacement. If, for
example, the ith component of hmax is found to be relatively large, this in-
dicates that perturbations in the weight wi may lead to substantial changes
11.2 Local Influence 157

in the results of the analysis. On the other hand, denoting the s nonzero
eigenvalues of −∆ L̈−1 ∆ by Cmax /2 ≡ λ1 ≥ . . . ≥ λs > 0 and the corre-
sponding normalized orthogonal eigenvectors by {hmax ≡ v1 , . . . , vs }, we
have that
s
2
Ci = 2 λj vji , (11.5)
j=1

where vji is the ith component of the vector vj . Hence, the local influence
Ci of perturbing the ith weight wi can be large without the ith component
of hmax to be large, as long as some of the other eigenvectors of −∆ L̈−1 ∆
have large ith components. This will be further illustrated with an exam-
ple in Section 11.4. Therefore, it is not sufficient to calculate only hmax ;
all Ci should be computed as well. Cook (1986) proposes to inspect hmax ,
regardless of the size of Cmax , since it may highlight directions which are
simultaneously influential. However, since there is usually no analytic ex-
pression for hmax , it is very difficult to get some insight behind the reasons
for such influences. This is not the case for the influence measures Ci . In
Section 11.3, it will be shown how expression (11.4) can be used to ascribe
local influence to interpretable components such as residuals.

When a subset θ 1 of θ = (θ 1 , θ 2 ) is of special interest, a similar approach


can be used, replacing the log-likelihood by the profile log-likelihood. Let
g(θ 1 ) be the value of θ 2 that maximizes (θ) = (θ 1 , θ 2 ) for each fixed
θ 1 . The profile log-likelihood for θ 1 is defined as (θ 1 , g(θ 1 )). For θ1ω de-
  
termined from the partition θ  = (θ  ,θ  ), the likelihood displacement
ω 1ω 2ω
based on this profile log-likelihood is given by
 
LD1 (ω) = 2 (θ)  − (θ 1ω ) .
1ω , g(θ

The methods discussed above for the full parameter vector now directly
carry over to calculate local influences on the geometric surface defined by
LD1 (ω). We now partition L̈ as

L̈11 L̈12
L̈ = ,
L̈21 L̈22
according to the dimensions of θ 1 and θ 2 . Cook (1986) has then shown
that the local influence on the estimation of θ 1 , of perturbing the model in
the direction of a normalized vector h, is given by
 3
4 
   0 0 

Ch (θ 1 ) = 2  h ∆ L̈ −1
− ∆h  . (11.6)
0 L̈−1 22

Because all eigenvalues of the matrix




L̈11 L̈12 0 0 0 L̈12 L̈−1


22
=
L̈21 L̈22 0 L̈−1
22 0 I
158 11. Local Influence for the Linear Mixed Model

are either one or zero, we have that (Seber 1984, p. 526) for any vector v

 0 0
0 ≥ v v ≥ v  L̈−1 v
0 L̈−1
22

and therefore also that


0 0
Ch (θ 1 ) = −2h ∆ L̈−1 ∆h + 2h ∆ ∆h
0 L̈−1
22

0 0
= Ch + 2h ∆ ∆h (11.7)
0 L̈−1
22
≤ Ch .

This means that the normal curvature for θ 1 , in the direction h, can never
be larger than the normal curvature for θ in that same direction.

Note also that it immediately follows from (11.6) that, for L̈12 = 0,
−1

L̈11 0 0 0
Ch = −2h ∆ ∆h − 2h ∆ ∆h
0 0 0 L̈−122

= Ch (θ 1 ) + Ch (θ 2 ).

Hence, we have that for any direction h, the normal curvature for θ in
that direction is then the sum of the normal curvatures for θ 1 and θ 2 in
the same direction. Intuitively, this can be explained as follows. It follows
from the classical maximum likelihood theory that, for sufficiently large
samples, and under the correct model specification, θ  is asymptotically
−1
normally distributed with covariance matrix −L̈ . So, L̈12 = 0 means
that θ 2 are statistically independent. It is then not surprising that
1 and θ
Ch = Ch (θ 1 ) + Ch (θ 2 ) since this expresses the fact that the influence for
θ 1 is independent of the influence for θ 2 .

Finally, there are again many possible choices for the vector h. For example,
the local influence of perturbing the ith weight on the estimation of θ 1 is
obtained for h = hi , the vector with zeros everywhere except on the ith
position where there is a one. The corresponding normal curvature will be
denoted by Ci (θ 1 ).

11.3 The Detection of Influential Subjects

We will now show how the local influence approach, introduced in Sec-
tion 11.2, can be applied to detect subjects which are locally influential for
11.3 The Detection of Influential Subjects 159

the fitting of a specific linear mixed model. We hereby restrict the discus-
sion to models which assume conditional independence; that is, models of
the form (3.8) where all residual covariance matrices Σi are equal to σ 2 Ini .
Our perturbed log-likelihood is defined in (11.2), which allows different
subjects to have different weights in the log-likelihood function. The vector
∆i now equals the s-dimensional vector of first-order derivatives of i (θ),
with respect to all components of θ and evaluated at θ = θ.  Note that
the calculation of ∆i usually only requires little additional computational
effort, since the first-order derivative of the log-likelihood is needed in the
iterative Newton-Raphson estimation procedure.

As explained in Section 11.2, local influence measures Ch can be calculated


for a variety of unit vectors h. For the detection of influential subjects,
the measures Ci will be of particular interest since they represent the local
influence of each subject separately on the estimation of θ. In a global case-
deletion approach, the maximum likelihood estimate θ is compared with
θ(i) obtained by maximizing

(i) (θ) = j (θ).
j=i

Because this is computationally very expensive, one instead often compares


θ with the approximation of θ(i) given by
1
 −1 
θ(i) = θ − 
L̈(i) (θ) 
∆j (θ), (11.8)
j=i

 is the matrix of second-order derivatives of (i) (θ), evaluated


where L̈(i) (θ)
 The vector θ̂ 1 is referred to as the one-step approximation to θ̂ (i)
at θ. (i)
since it is obtained from a single Newton-Raphson step in the maximization
procedure of (i) (θ), using θ̂ as the starting value.

A measure of influence, proposed by Pregibon (1981), is then


 1
  1

ρi = − θ̂ − θ̂ (i) L̈ θ̂ − θ̂ (i) .

See also Pregibon (1979) (Chapter 5) and Cook and Weisberg (1982) (Chap-
ter 5) for more details. It now follows from (11.8) that
  1 
∆i = −  θ
∆j = L̈(i) (θ)  −θ  , (11.9)
(i)
j=i

such that expression (11.4) becomes


 1
  1

Ci = −2 θ̂ − θ̂ (i) L̈(i) L̈−1 L̈(i) θ̂ − θ̂ (i) . (11.10)
160 11. Local Influence for the Linear Mixed Model

Comparison of (11.9) with (11.10) shows that Ci /2 and ρi are approxi-


mately the same for N sufficiently large. Hence, our local influence measures
Ci can be interpreted as approximations to the classical global case-deletion
diagnostics.

An advantage of Ci , in comparison to, for example, the direction hmax of


maximal curvature, is the availability of the analytic expression (11.4) for
Ci . Lesaffre and Verbeke (1998) have shown that Ci can be decomposed
into five interpretable components. Let Ri , Xi , and Zi denote now the
“standardized” residuals and covariates for the ith individual, defined by
−1/2 −1/2 −1/2
Ri = Vi ri , Xi = Vi Xi , and Zi = Vi Zi , respectively, with ri
0

equal to yi − Xi β. Further, for any matrix A, let A = tr(A A) be the
Frobenius norm of A (see Golub and Van Loan 1989). The interpretable
components in Ci are then
Xi Xi  , Ri , Zi Zi  , I − Ri Ri  , Vi−1 . (11.11)
First, Xi Xi  measures the “length” of the standardized covariates in the
mean structure and Ri is an overall measure for how well the observed
data for the ith subject are predicted by the mean structure Xi β. Second,
the components Zi Zi  and I − Ri Ri  have a similar meaning, but for
the covariance structure. For example, I − Ri Ri  will be zero only if Vi
equals ri ri  . Note that ri ri  is an estimate for var(yi ), which only assumes
the mean to be correctly modeled as Xi β. Therefore, I − Ri Ri  can be
interpreted as a residual, measuring how well the covariance structure of
the data is modeled by Vi = Zi DZi + σ 2 Ini . Finally, the fifth component
Vi−1 will be large if Vi has small eigenvalues, which indicates that the
ith subject is assumed to have small variability.

The decomposition of Ci immediately suggests a practical procedure to find


an explanation for the influential character of an individual. Namely, when
Ci is large, we inspect the diagnostics (11.11). Index plots are useful to
graphically inspect the individuals vis-à-vis their influential nature. Hence,
we propose to start with an index plot of Ci . In a second step, the index
plots of (11.11) can be examined. A recurrent practical difficulty with di-
agnostics is to establish a threshold above which an individual is defined
as “remarkable”. It follows from (11.4) that
 
N  N
Ci = −2 tr L̈−1 ∆i ∆i ,
i=1 i=1

which converges to 2s, for N becoming infinitely large. As for leverage in


linear regression (see, for example, Neter, Wasserman and Kutner 1990,
pp. 395-396), one could classify an individual for which Ci is larger than
twice the average value (larger than 4s/N , for N large) as being influential.
However, unlike for the leverage situation, 2s is only the approximate sum
11.4 Example: The Prostate Data 161

of the Ci , whichwill not be accurate if the model is not correctly specified


(such that L̈−1 i ∆i ∆i does not converge to Is ) or if N is too small for
the asymptotics to yield good approximations. In such cases, we propose
to replace 2s by the actual sum, and Nwe call the ith subject influential if
Ci is larger than the cutoff value 2 i=1 Ci /N .

It is less evident to find “natural” thresholds for the diagnostics (11.11).


Also in the context of linear mixed models, Waternaux, Laird and Ware
(1989) proposed to calibrate Ri 2 with a χ2 -distribution with ni degrees
of freedom to detect outliers. Note that also the other interpretable com-
ponents depend on the size ni of the response vector yi . However, it is not
immediately clear how to correct them for ni . We therefore suggest adding
an index plot of ni to the index plots of the interpretable components
(11.11). We refer to Section 11.4 for an example.

The general theory on local influence also allows assessing influence on


subsets of θ. Here, we will be especially interested in the local influence of
subjects on the estimation of the fixed effects β or the variance components
α separately. The local influence of subject i on the estimation of β will
be denoted by Ci (β). For the variance components, this will be denoted
by Ci (α). It can be shown that Ci (β) and Ci (α) contain the same five
interpretable components in their decomposition as Ci . Further, we have
that for any variance component αk from α,

∂ 2 (θ) 
N
∂Vi −1
= Xi Vi−1 V (yi − Xi β),
∂β∂αk i=1
∂αk i

which implies that the maximum likelihood estimates for the fixed effects
and for the variance components are asymptotically independent (see Ver-
beke and Lesaffre 1996b, 1997a, for technical details). It now follows from
Section 11.2 that, for N sufficiently large,

Ci ≈ Ci (β) + Ci (α). (11.12)

Lesaffre and Verbeke (1998) have shown that this also implies that Ci (β)
can be decomposed using only the first two components Xi Xi  and Ri ,
whereas only the last three components Zi Zi  , I − Ri Ri  , and Vi−1
are needed in the decomposition of Ci (α). Hence, for sufficiently large data
sets, influence for the fixed effects and for the variance components can
be further investigated by studying the first two and the last three in-
terpretable components, respectively. This will also be illustrated in our
example in Section 11.4.
162 11. Local Influence for the Linear Mixed Model

FIGURE 11.2. Prostate Data. (a) Plot of total local influences Ci versus the
identification numbers of the individuals in the BLSA data set. (b) Scatter plot
of the local influence measures Ci (β) and Ci (α) for the fixed effects and the
variance components, respectively. The most influential subjects are indicated by
their identification number.

11.4 Example: The Prostate Data

As an illustration, we now perform a local influence analysis for model


(6.8) for the prostate data. The first step is to trace the individuals who
have a large impact on the parameter estimates, measured by Ci . In the
second step, it is determined which part of the fitted model is affected
by the influential cases, the fixed effects, and/or the variance components.
Finally, the cause of the influential character has to be established in order
to obtain insight in why a case is peculiar. The calculation of the influence
measures and the interpretable components can be performed with a SAS
macro available from the website.

Figure 11.2(a) is an index  plot of the total local influence Ci . The cutoff
value used for Ci equals 2 i Ci /N = 1.98 and has been indicated in the
figure by the dashed line. Participants #15, #22, #23, #28, and #39 are
found to have a Ci value larger than 1.98 and are therefore considered to be
relatively influential for the estimation of the complete parameter vector θ.
Their observed and expected profiles are shown in Figure 11.3. Pearson et
al . (1994) report that subjects #22, #28, and #39, who were classified as
local/regional cancer cases, were probably misclassified metastatic cancer
cases. It is therefore reassuring that this influence approach flagged these
three cases as being special. Subjects #15 and #23 were already in the
metastatic cancer group. In Figure 11.2(b), a scatter  plot of Ci (α) versus
Ci (β) is given. Their respective cutoff values are 2 i Ci (α)/N = 1.10 and
2 i Ci (β)/N = 0.99. Obviously, subject #28, who is the subject with
the largest Ci value, is highly influential for both the fixed effects and the
variance components. Individuals #15 and #39 are also influential for both
parts of the model, but to a much lesser extent. Finally, we have that subject
#23 is influential only for the fixed effects β and that, except for subject
11.4 Example: The Prostate Data 163

FIGURE 11.3. Prostate Data. Observed (dashed lines) and fitted (solid lines) pro-
files for the five most locally influential subjects. All subjects are metastatic cancer
cases, but individuals #22, #28, and #39 were wrongly classified as local/regional
cancer cases.

#28, subject #22 has the highest influence for the variance components α,
but is not influential for β.

Figure 11.4 shows an index plot of each of the five interpretable compo-
nents in the decomposition of the local influence measures Ci , Ci (β), and
Ci (α), as well as of the number ni of PSA measurements available for each
subject. These can now be used to ascribe the influence of the influential
subjects to their specific characteristics. As an example, we will illustrate
this for subject #22, which has been circled in Figure 11.4. As indicated by
Figure 11.2, this subject is highly influential, but only for the estimation
of the variance components. If approximation (11.12) is sufficiently accu-
rate, this influence for α can be ascribed to the last three interpretable
components only [i.e., the components plotted in the panels (c), (d), and
(e) of Figure 11.4]. Hence, although the residual component for the mean
structure is the largest for subject #22 [Figure 11.4(b)], it is not the cause
of the highly influential character of this subject for the estimation of the
variance components, nor did it cause a large influence on the estimation
of the fixed effects in the model. Note instead how this subject also has the
largest residual for the covariance structure, suggesting that the covari-
ance matrix is poorly predicted by the model-based covariance V22 . This is
also illustrated in Figure 11.3. Obviously, the large residual for the mean
was caused by the poor prediction around the time of diagnosis, but this
164 11. Local Influence for the Linear Mixed Model

(a) Xi Xi 

(b) Ri

(c) Zi Zi 

(d) I − Ri Ri 

(e) Vi−1

(f) ni

FIGURE 11.4. Prostate Data. Index plots of the five interpretable components in
the decomposition of the total local influence Ci and an index plot of the number
ni of repeated measurements for each subject.

was not sufficient to make subject #22 influential for the estimation of
the complete average profile. Further, a close look at the estimated covari-
ance matrix V22 shows that only positive correlations are assumed between
the repeated measurements, whereas the positive and negative residuals in
Figure 11.3 suggest some negative correlations.
11.4 Example: The Prostate Data 165

FIGURE 11.5. Prostate Data. Comparison of the likelihood displacement LDi with
the total local curvature Ci . The individuals are numbered by their identification
number.

As discussed in Section 11.3, the local influence measures Ci can be in-


terpreted as approximations to the classical global case-deletion diagnos-
tics. As an illustration, we also performed such a global influence analysis.
Figure 11.5 compares the local influence measures Ci with the likelihood
displacements
$ %
LDi = 2 (θ)  − (θ
(i) ) ,

where θ(i) denotes the maximum likelihood estimate for θ after deletion
of subject i. From this picture we can observe that for the data set at
hand, both approaches highlight the same set of influential observations.
However, they do not agree on the ranking of the observations according
to their influence.

We also calculated the direction hmax = v1 of maximal curvature Cmax


for our data set at hand. Cmax equals 15.19 and an index plot of the in-
dividual components of hmax is shown in Figure 11.6. The vector hmax is
pointing toward individuals #15, #28, #39, and #45. These individuals
were also found in the index plot of Ci . Indeed, also case #45 is standing
out in Figure 11.2, although the C45 value is below the threshold. However,
subjects #22 and #23 are not highlighted in Figure 11.6. This illustrates
the fact that locally influential subjects are not necessarily subjects with a
large component in the direction of maximal curvature. Subject #22, for
example, has total local influence C22 equal to 7.02, but seems to have a
small weight in v1 (v1,22 = 0.18). As already explained in Section 11.2,
this occurs when −∆ L̈−1 ∆ has large eigenvalues other than λ1 and for
subjects with much weight in one or more eigenvectors corresponding to
such eigenvalues. The first five terms in expression (11.5) are shown for
subject #22 in Table 11.1. The five largest eigenvalues of −∆ L̈−1 ∆ are
λ1 = 7.59, λ2 = 5.49, λ3 = 2.77, λ4 = 1.68, and λ5 = 1.53, and the weight
166 11. Local Influence for the Linear Mixed Model

FIGURE 11.6. Prostate Data. An index plot of the components of hmax .

of this subject in each of the corresponding eigenvectors is v1,22 = 0.18,


v2,22 = −0.74, v3,22 = −0.20, v4,22 = 0.11, and v5,22 = −0.01. Because of
its large weight in v2 , this subject has a large second component in the
2
local influence (2λ2 v2,22 = 6.0413), which results in a large C22 value. This
is why we believe that an influence analysis should not be based on the
direction of maximal curvature only.

On the other hand, the index plot in Figure 11.6 offers an extra diagnostic
tool. The components of hmax have a positive or a negative sign. From
this plot, one can observe that case #15 has a different sign than individ-
uals #28, #39, and #45. The impression is given that individual #15 is
counterbalancing the effect of these other cases. For the three cases with
a negative component of hmax , the observed profiles (see also Figure 11.3)
are completely above and much steeper than the predicted profiles, whereas
for individual #15 the observed response intersects its prediction somewhat
halfway in the observation period and is also much less steeper. More de-
tailed similar information could be obtained from hmax by deriving it for the
fixed effects and the variance components separately. On the other hand,
since no analytic expression for hmax is available, using hmax as a diagnos-
tic tool does not yield much insight into the reasons why some individuals
are more influential than others.

Finally, since three subjects were probably misclassified as local cancer


cases (participants #22, #28, and #39), we reallocated them to the meta-
static cancer group and performed a new influence analysis. Subjects #23
and #39 were not influential anymore, but the local influence of partici-
pants #22 and #28 was as high as before, and individual #15 has even
become more influential. Hence, the influence of subjects #22 and #28, as
found in the first analysis, cannot be entirely ascribed to their incorrect
classification as local cancer cases. One possible explanation may be that
their Zi covariate matrix, and hence the assumed model for their covari-
11.5 Local Influence Under REML Estimation 167

TABLE 11.1. Prostate Data. Decomposition (11.5) of C22 according to the 18


eigenvectors vj of −∆ L̈−1 ∆ with nonzero eigenvalues λj . The table shows the
first five terms in the decomposition, corresponding to the five largest eigenvalues.
The calculations illustrate why subject #22 is influential without having a large
contribution in the direction hmax = v1 of maximal curvature.

2
2 λ1 v1,22 = 0.5175 ⎪





2
2 λ2 v2,22 = 6.0413 ⎪



2 sum: 6.8283
2 λ3 v3,22 = 0.2311




2
2 λ4 v4,22 = 0.0382 ⎪



2 ⎪

2 λ5 v5,22 = 0.0001
..
.
18 2
2 j=1 λj vj,22 = 7.0214 = C22

ance matrix Vi , is not changed by reclassifying these patients as metastatic


cancer cases.

11.5 Local Influence Under REML Estimation

Our results from Section 11.2 and Section 11.3 assume that the parameters
in the marginal linear mixed model are estimated via maximum likelihood.
An influence analysis for the REML estimates would also be useful. How-
ever, it follows from expression (5.8) that the REML log-likelihood function
can no longer be seen as a sum of independent individual contributions and,
therefore, it is not obvious how a perturbation scheme, similar to (11.2),
should be defined and interpreted. One approach would be to replace (θ|ω)
in (11.2) by
 N 
 
1   −1 
REML (θ | ω) = − ln  wi Xi Vi Xi  + (θ | ω).
2  
i=1

The theory of local influence can then also be applied to this new perturba-
tion scheme, but no longer results in simple expressions for the curvature
Ci ; hence, it becomes much more complicated to ascribe influence to the
specific characteristics of the influential subjects. Therefore, we only con-
sidered here the maximum likelihood situation.
12
The Heterogeneity Model

12.1 Introduction

In Section 7.8.4, we discussed how Verbeke and Lesaffre (1996a, 1997b)


extended the linear mixed model to cases where the random effects are
not necessarily normally distributed. Their so-called heterogeneity model
assumes the random effects to be sampled from a mixture of normal distri-
butions rather than from just one single normal distribution. This not only
extends the assumption about the random-effects distribution to a very
broad class of distributions (unimodal as well as multimodal, symmetric
as well as highly skewed; see Figure 7.5), it is also perfectly suitable for
classification purposes, based on longitudinal profiles.

An example where classification of subjects based on longitudinal profiles


is clearly of interest is the prostate data set introduced in Section 2.3.1.
Indeed, our analyses of this data set have revealed some significant differ-
ences between the control patients, patients with benign prostatic hyper-
plasia, local cancer cases, and metastatic cancer cases (see, for example,
Section 6.2.3), suggesting that repeated measures of PSA might be useful
for detecting prostate cancer in an early stage of the disease. Note that
such a classification procedure cannot be based on a model for PSA which
includes age at diagnosis and time before diagnosis as covariates [as in our
final model (6.8) in Section 6.2.3], since these are only available in retro-
spective studies, such as the Baltimore Longitudinal Study of Aging, where
170 12. The Heterogeneity Model

classification of the individuals is superfluous. The indicator variables Ci ,


Bi , Li , and Mi are also not available for the same reason. The only pos-
sible adjustment of model (6.8) which could be used for our classification
purposes is therefore given by
ln(1 + PSAij ) = β1 Agei + (β2 + b1i )
+(β3 + b2i ) tij + (β4 + b3i ) t2ij + εij , (12.1)
where Agei is now the age of the ith subject at entry in the study (or at
the time the first measurement was taken) and where the time points tij
are now expressed as time since entry. The procedure would then be to
first fit model (12.1), from which estimates for the random effects can be
calculated, as explained in Section 7.2. These estimates could then be used
to classify patients in either one of the diagnostic groups.

However, although this approach looks very appealing, it raises many prob-
lems with respect to the normality assumption for the random effects, which
is automatically made by the linear mixed-effects model. For example, it
follows from the results in Section 6.2.3 that the mean quadratic time ef-
fect is zero for the noncancer cases and positive for both cancer groups.
Hence, the quadratic effects β4 + b3i in model (12.1) should follow a nor-
mal distribution with mean zero for the noncancer cases and with positive
mean for the cancer cases. This means that the b3i are no longer normally
distributed, but follow a mixture of two normal distributions, i.e.,
b3i ∼ pN (µ1 , σ12 ) + (1 − p)N (µ2 , σ22 ),
in which µ1 , µ2 and σ12 , σ22 denote the means and variances of the b3i in the
noncancer and cancer groups, respectively, and where p is the proportion
of patients in the data set which belong to the noncancer group. Similar
arguments hold for the random intercepts b1i and for the random time
slopes b2i , which even may be sampled from mixtures of more than two
normal distributions.

It was shown in Section 7.8.2 that for the detection of subgroups in the
random-effects population or for the classification of subjects in such sub-
groups, one should definitely not use empirical Bayes estimates obtained
under the normality assumption for the random effects. In this chapter,
it will be shown that the heterogeneity model is a natural model for clas-
sifying longitudinal profiles. In Sections 12.2 and 12.3, the heterogeneity
model will be defined in full detail, and it will be described how the so-
called Expectation-Maximization (EM) algorithm can be applied to obtain
maximum likelihood estimates for all the parameters in the corresponding
marginal model. In Section 12.4, it will be briefly discussed how longitudi-
nal profiles can be classified based on the heterogeneity model. As already
mentioned in Section 7.8.4, testing for the number of components in a het-
erogeneity model is far from straightforward due to boundary problems
12.2 The Heterogeneity Model 171

which make classical likelihood results break down. We will therefore dis-
cuss in Section 12.5 some simple informal checks for the goodness-of-fit of
heterogeneity models. Finally, two examples will be given in Sections 12.6
and 12.7. Another example can be found in Brant and Verbeke (1997a,
1997b).

12.2 The Heterogeneity Model

As already explained in Section 7.8.4, the heterogeneity model of Verbeke


(1995) and Verbeke and Lesaffre (1996a, 1997b) is obtained by replacing the
normality assumption for the random effects in the linear mixed model (3.8)
by a mixture of g q-dimensional normal distributions with mean vectors µj
and covariance matrices Dj , i.e.,


g
bi ∼ pj N (µj , Dj ), (12.2)
j=1
g
with j=1 pj = 1. We now define zij = 1 if bi is sampled from the jth
component in the mixture, and 0 otherwise, j = 1, . . . , g. We then have
that P (zij = 1) = E(zij ) = pj and that
⎛ ⎞

g 
g
E(bi ) = E (E(bi | zi1 , . . . , zig )) = E ⎝ µj zij ⎠ = pj µj .
j=1 j=1

g
Therefore, the additional constraint j=1 pj µj = 0 is needed to assure
that E(yi ) = Xi β. Further, we have that the overall covariance matrix of
the bi is given by

D∗ = var (E(bi | zi1 , . . . , zig )) + E (var(bi | zi1 , . . . , zig ))


⎛ ⎞ ⎛ ⎞
g g
= var ⎝ µj zij ⎠ + E ⎝ Dj zij ⎠
j=1 j=1


g 
g
= pj µj µj + pj Dj . (12.3)
j=1 j=1

The density function corresponding to (12.2) is given by


g  
−q/2 −1/2 1   
pj (2π) |Dj | exp − bi − µj Dj−1 bi − µj . (12.4)
j=1
2
172 12. The Heterogeneity Model

Note that, for µ1 = bi , we have that (12.4) becomes infinitely large if


|D1 | → 0. In order to avoid numerical problems in the estimating proce-
dure, which will be described later, we will assume from now on that all
covariance matrices Dj are the same (i.e., that Dj = D for all j). Our
extended model is then fully determined by

⎪ Yi = Xi β + Zi bi + εi







⎪  g

⎪ b ∼ pj N (µj , D),

⎪ i




j=1

⎪  g


⎨ pj = 1,
j=1 (12.5)

⎪  g



⎪ pj µj = 0,



⎪ j=1





⎪ εi ∼ N (0, Σi ),






b1 , . . . , bN , ε1 , . . . , εN independent,

and it assumes that the random-effects population consists of g subpop-


ulations with mean vectors µj and with common covariance matrix D.
The model is therefore called the heterogeneity model. Also, specifying the
model as


⎪ Yi |bi ∼ N (Xi β + Zi bi , Σi ),



bi |µ ∼ N (µ, D),



⎪ 1 2  

µ ∈ µ1 , . . . , µg , with P µ = µj = pj ,

it can be interpreted as a hierarchical Bayes model, but now with an un-


derlying random vector µ which is no longer identically zero and which
therefore represents the heterogeneity for the mean of the random-effects
distribution. The classical linear mixed model which assumes bi ∼ N (0, D)
does not allow this type of heterogeneity and is therefore called the homo-
geneity model.
12.3 Estimation of the Heterogeneity Model 173

12.3 Estimation of the Heterogeneity Model

The marginal distribution of the measurements Yi under model (12.5) can


easily be seen to be given by

g
Yi ∼ pj N (Xi β + Zi µj , Vi ), (12.6)
j=1

with Vi = Zi DZi +Σi = Wi−1 . Estimation of the parameters β, µj , pj , and


D and the parameters in Σi can be done via maximum likelihood estima-
tion. For this, the so-called Expectation-Maximization (EM) algorithm has
been advocated; see Laird (1978). The EM algorithm is particularly useful
for mixture problems since it often happens that a model is fitted with too
many components (g too large), leading to a likelihood which is maximal
anywhere on a ridge. As shown by Dempster, Laird and Rubin (1977), the
EM algorithm is capable of converging to some particular point on that
ridge. Titterington, Smith and Makov (1985, pp. 88-89) compare the EM
algorithm with the Newton-Raphson (NR) algorithm. Their conclusions
can be summarized as follows:

• EM is usually simple to apply and satisfies the appealing monotonic


property in that it increases the objective function at each iteration
step. NR is more complicated, and there is no guarantee of monotonic-
ity.
• If NR converges, it is of second order (i.e., fast), whereas EM is often
painfully slow. However, if the separation between the components in
the mixture is poor, even the numerical performance of NR can be
disappointing. Simulations have shown that, in such cases, NR can
fail to converge in up to half the simulations, even when the algorithm
was started from the true parameter values.
• Convergence is not guaranteed with any of the techniques since EM,
even with the monotonicity property, can converge to a local maxi-
mum of the likelihood surface.

Böhning and Lindsay (1988) have considered maximization of log-likeli-


hoods for which the quadratic approximation based on the Taylor series
is “flatter” than the objective function, thereby sending the solution too
far at the next step. They conclude that, in a mixture framework, flat log-
likelihoods can lead to problems in convergence and to instabilities for the
Newton-Raphson algorithm.

Note also that since the random effects are assumed to follow a mixture of
distributions of the same parametric family, the vector of all parameters
174 12. The Heterogeneity Model

in model (12.5) is not identifiable. Indeed, the log-likelihood is invariant


under the g! possible permutations of the mean vectors and corresponding
probabilities of the components in the mixture. Therefore, the likelihood
will have at least g! local maxima with the same likelihood value. However,
this lack of identifiability is of no concern in practice, as it can easily be
overcome by imposing some constraint on the parameters. For example,
Aitkin and Rubin (1985) use the constraint that
p1 ≥ p2 ≥ . . . ≥ pg . (12.7)
The likelihood is then maximized without the restriction, and the compo-
nent labels are permuted afterward to achieve (12.7).

The EM algorithm is frequently used for the calculation of maximum like-


lihood estimates for missing data problems. We have therefore deferred a
detailed presentation on this algorithm to the second part of this book,
namely to Chapter 22. However, a brief introduction in the context of the
heterogeneity model will be given here. We also refer to McLachlan and
Basford (1988, Section 1.6) for an application of the EM algorithm in a
simpler mixture context, where it is assumed that the available data are
all drawn from the same mixture distribution (no different dimensions, no
covariates).

Let π be the vector of component probabilities [i.e., π  = (p1 , . . . , pg )] and


let γ be the vector containing the remaining parameters β and D, the
parameters in all Σi , and all µj ’s. Further, θ  = (π  , γ  ) denotes the vector
of all parameters in the marginal heterogeneity model (12.6), and fij (yi |γ)
is the density function of the normal distribution with mean Xi β + Zi µj
and covariance matrix Vi . The likelihood function corresponding to (12.6)
is then
⎧ ⎫
N ⎨ g ⎬
L(θ|y) = pj fij (yi | γ) , (12.8)
⎩ ⎭
i=1 j=1

  
where y = (y1 , . . . , yN ) is the vector containing all observed response
values.

Let zij be as defined in Section 12.2. The prior probability for an individual
to belong to component j is then P (zij = 1) = pj , the mixture proportion
for that component. The log-likelihood function for the observed measure-
ments y and for the vector z of all unobserved zij is then

N 
g
(θ|y, z) = zij {ln pj + ln fij (yi |γ)} ,
i=1 j=1

which is easier to maximize than the log-likelihood function correspond-


ing to the likelihood (12.8) of the observed data vector y only. On the
12.3 Estimation of the Heterogeneity Model 175

other hand, maximizing (θ|y, z) with respect to θ yields estimates which


depend on the unobserved (“missing”) indicators zij . A compromise is ob-
tained with the EM algorithm, where the expected value of (θ|y, z), rather
than (θ|y, z) itself, is maximized with respect to θ, where the expectation
is taken over all the unobserved zij . In the E step (expectation step), the
conditional expectation of (θ|y, z), given the observed data vector y, is
calculated. In the M step (maximization step), the so-obtained expected
log-likelihood function is maximized with respect to θ, providing an up-
dated estimate for θ. Finally, one keeps iterating between the E step and
the M step until convergence is attained.

More specifically, let θ (t) be the current estimate for θ, and θ (t+1) stands
for the updated estimate, obtained from one further iteration in the EM
algorithm. We then have the following E and M steps in the estimation
process for the heterogeneity model.

The E Step. The conditional expectation


$ %
Q(θ|θ (t) ) = E (θ|y, z) | y, θ (t)
is given by

N 
g
Q(θ|θ (t) ) = pij (θ (t) ) [ln pj + ln fij (yi |γ)] , (12.9)
i=1 j=1

where only the posterior probability for the ith individual to belong
to the jth component of the mixture, given by
pij (θ (t) ) = E(zij | yi , θ (t) ) = P (zij = 1 | yi , θ (t) )

pj fij (yi |γ) 
=  
g 
k=1 pk fik (yi |γ) π̂ (t) ,γ̂ (t)

has to be calculated for each i and j.


The M Step. To get the updated estimate θ (t+1) , we have to maximize
expression (12.9) with respect to θ. We first maximize

N 
g
pij (θ (t) ) ln pj
i=1 j=1
⎛ ⎞

N 
g−1 
N 
g−1
= pij (θ (t)
) ln pj + pig (θ (t)
) ln ⎝1 − pj ⎠
i=1 j=1 i=1 j=1

with respect to p1 , . . . , pg−1 . Setting all first-order derivatives equal


to zero yields that the updated estimates satisfy
(t+1) N (t)
pj i=1 pij (θ )
(t+1)
=  N (t)
,
pg i=1 pig (θ )
176 12. The Heterogeneity Model

for all j = 1, . . . , g − 1. This also implies that



g
(t+1) N pg
(t+1)
1 = pj = N (t)
,
j=1 i=1 pig (θ )

(t+1)
from which it follows that all estimates pj satisfy

1 
N
(t+1)
pj = pij (θ (t) ).
N i=1

Unfortunately, the second part of (12.9) cannot be maximized ana-


lytically, and a numerical maximization procedure such as Newton-
Raphson is needed to maximize

N 
g
pij (θ (t) ) ln fij (yi |γ)
i=1 j=1

with respect to γ. All necessary derivatives can be obtained from the


expressions in a paper by Lindstrom and Bates (1988).

Once all parameters θ in the marginal heterogeneity model have been esti-
mated, one might be interested in estimating the random effects bi also. To
this end, empirical Bayes (EB) estimates can be calculated, in exactly the
same way as EB estimates were obtained under the classical linear mixed
model (see Section 7.2). The posterior density of bi is given by

g
fi (bi | yi , θ) = pij (θ)fij (bi | yi , γ),
j=1

where fij (bi |yi , γ) is the posterior density function of bi , conditional on


zij = 1, that is, conditional on the knowledge that bi was sampled from the
jth component in the mixture. Hence, the posterior distribution of bi is a
mixture of the posterior distributions of bi within each component of the
mixture, with the posterior probabilities pij (θ) as subject-specific mixture
proportions. The posterior mean is then

g
bi = pij (θ)E (bi | yi , γ, zij = 1) .
j=1

Conditionally on bi , Yi is normally distributed with mean Xi β + Zi bi and


variance-covariance matrix Σi . The vector bi is, conditionally on component
j, also normally distributed, with mean µj and variance-covariance matrix
D. It then follows from Lindley and Smith (1972) that
E (bi |yi , γ, zij = 1) = DZi Wi (yi − Xi β) + (I − DZi Wi Zi ) µj ,
12.4 Classification of Longitudinal Profiles 177

from which it follows that



g
bi = DZi Wi (yi − Xi β) + (I − DZi Wi Zi ) pij (θ)µj . (12.10)
j=1

Note how the first component of bi is of exactly the same form as the esti-
mator (7.2) obtained in Section 7.2, assuming normally distributed random
effects. However, the overall covariance matrix of the bi is now replaced by
the within-component covariance matrix D. The second component in the
expression for bi can be viewed as a correction term toward the component
means µj , with most weight on those means, which correspond to compo-
nents for which the subject has a high posterior probability of belonging.
Finally, the unknown parameters in (12.10) are replaced by their maximum
likelihood estimates obtained from the EM algorithm. These are the EB
estimates shown in Figure 7.6 for the data simulated in Section 7.8.2 and
obtained under a two-component heterogeneity model (i.e., g equals 2).

12.4 Classification of Longitudinal Profiles

Interest could also lie in the classification of the subjects into the different
mixture components. It is natural in mixture models for such a classification
to be based on the estimated posterior probabilities pij (θ)  (McLachlan
and Basford 1988, Section 1.4). One then classifies the ith subject into the
component for which it has the highest estimated posterior probability to
belong to, that is, to the j(i)th component, where j(i) is the index for
 = max1≤j≤g pij (θ).
which pi,j(i) (θ)  Note how this technique can be used
for cluster analysis within the framework of linear mixed-effects models: If
the individual profiles are to be classified into g subgroups, fit a mixture
model with g components and use the above rule for classification in either
one of the g clusters.

For g = 2, the above classification rule implies classification of subject i in


the first component if and only if

p1 fi1 (yi |


γ ) ≥ (1 − p1 )fi2 (yi |
γ ).

Using some matrix algebra, this can be rewritten as


$ % −1
yi − Xi β − Zi ( 2 )/2 Vi Zi (
µ1 + µ µ1 − µ
2 ) ≥ ln [(1 − p1 )/p1 ] ,

which is the linear discriminant function recently proposed by Tomasko,


Helms and Snapinn (1999), also in the context of linear mixed models.
178 12. The Heterogeneity Model

12.5 Goodness-of-Fit Checks

So far, we have not discussed yet how the number g of components in the
heterogeneity model should be chosen. One approach is to fit models with
increasing numbers of components and to compare them using likelihood
ratio tests. However, as explained in Section 7.8.4, this is far from straight-
forward, due to boundary problems. Also, acceptance of the null hypothesis
does not automatically yield a good-fitting model since it was tested against
only one specific alternative hypothesis. An alternative approach is to in-
crease g to the level where some of the subpopulations get very small weight
(some pj very small) or where some of the subpopulations coincide (some
µj approximately the same). Finally, Verbeke and Lesaffre (1996a) pro-
posed some omnibus goodness-of-fit checks for the marginal heterogeneity
model (12.6), which can be employed to determine the number of compo-
nents in the heterogeneity model, but which are also useful for evaluating
the final model. These will now be discussed.

Suppose we want to test whether model (12.6) fits our data well, for some
specific value of g (possibly 1). The most well-known goodness-of-fit tests
are derived for univariate random variables, but, here, all observed vec-
tors yi possibly have different distributions or different dimensions. Thus,
strictly speaking, these tests are not applicable here. However, this prob-
lem can be circumvented by considering the stochastic variables Fi (yi ), for
Fi equal to the cumulative distribution function of Yi , under the assumed
model. If the assumed model is correct, we have that all Fi (yi ) can be
considered sampled from a uniform distribution with support [0, 1]. On the
other hand, we have that the computation of Fi (yi ) involves the evalua-
tion of multivariate normal distribution functions of dimensions ni , which
may therefore be practically unfeasible for data sets with large numbers of
repeated measurements for some of the individuals. We therefore propose
to first summarize each vector yi by some linear combination ai  yi , and
then to calculate Fi (ai  yi ), where Fi is now the distribution function of
ai  Yi under the assumed model. We then have that, under model (12.6),
the stochastic variables
 g 

ai (Yi − Xi β − Zi µj )
Ui = Fi (ai  Yi ) = pj Φ √  , (12.11)
j=1
ai V i ai

with Φ denoting the cumulative distribution function of a univariate stan-


dard normal random variable, are uniformly distributed.

Two procedures can now be followed. First, we can apply the Kolmogorov-
Smirnov test (see, for example, Birnbaum 1952, Bickel and Doksum 1977,
pp. 378-381) to test whether the observed Ui , calculated by replacing Yi by
yi , and all parameters in (12.11) by their maximum likelihood estimates
12.5 Goodness-of-Fit Checks 179

obtained from the EM algorithm, indeed follow the uniform distribution


as is to be expected under the assumed model. For moderate sample sizes,
percentage points are tabulated in Birnbaum (1952) and Neave (1986),
whereas Bickel and Doksum (1977, p. 483) give approximations for large
sample sizes (N > 80). Another possible approach is to test whether or not
the values Φ−1 (Ui ) can be assumed to be sampled from a univariate nor-
mal distribution. We therefore use the Shapiro-Wilk test, first introduced
by Shapiro and Wilk (1965) and since then extensively investigated and
compared to other normality tests. See, for example, D’Agostino (1971),
Dyer (1974), and Pearson, D’Agostino and Bowman (1977). Percentage
points have been tabulated by Shapiro and Wilk (1965), and approxima-
tions to the distribution of the test statistic can be found in Shapiro and
Wilk (1968) and Leslie, Stephens and Fotopoulos (1986).

Of course, the above goodness-of-fit tests can be performed for any ai , but
a good choice of the linear combination may increase the power of the test.
Here, we are specifically interested in exploring whether the number g of
components in our heterogeneity model has been taken sufficiently large.
Note how this testing for heterogeneity can be viewed as an attempt to
break down the total random-effects variability into the within-subgroup
variability represented by D, and the between-subgroup variability repre-
sented by the component means µ1 , . . . , µg . However, it is intuitively clear
that this will only be successful when the residual variability in the model,
represented by the error terms εi , is small to moderate, in comparison to
the random-effects variability in which we are interested. We therefore rec-
ommend choosing ai such that the variability in ai  yi due to the random
effects is large compared to the variability due to the error terms. Specifi-
cally, we choose ai such that

var(ai  Zi bi ) ai  Zi D∗ Zi ai

= ,
var(ai εi ) ai  Σi ai

with D∗ as defined in (12.3), is maximal. It follows from Seber (1984, p. 526)


that this is satisfied for ai equal to the eigenvector corresponding to the
largest eigenvalue of Σi −1 Zi D∗ Zi . This choice can be further justified as
follows. Let λmax be the largest eigenvalue of Σ−1 ∗ 
i Zi D Zi , and ai be the cor-
responding eigenvector. We then have that ai yi = ai Σ−1
 ∗ 
i Zi D Zi yi /λmax

and is therefore a function of Zi yi , which can easily be seen to be suf-
ficient for bi in the conditional distribution of Yi given bi (distribution
given in Section 3.3.1). So, ai  yi only uses that part of the information
in the sample, which is sufficient for the random effects. In the case that
the only random effects in the model are intercepts, we have that ai  yi is
even equivalent to the sufficient statistic Zi yi . In practice, all parameters
in the above expressions need to be replaced by their maximum likelihood
estimates obtained from the EM algorithm.
180 12. The Heterogeneity Model

TABLE 12.1. Prostate Data. Parameter estimates for all parameters in model
(12.1), under a one-, two- and three-component heterogeneity model, and based
on the observed data from cancer patients and control patients only.

Component means ν + µj Covariance matrix β1


Component probabilities D σ2
⎛ ⎞ ⎛ ⎞
0.1694 0.0760 0.0051 0.0002
⎜ ⎟ ⎜ ⎟ 0.008
⎝ 0.0085 ⎠ 1.00 ⎝ 0.0051 0.0023 −0.0001 ⎠
0.027
0.0034 0.0002 −0.0001 0.00003

⎛ ⎞ ⎛ ⎞
−0.2373 0.0497 0.0089 −0.0005
⎜ ⎟ ⎜ ⎟ 0.013
⎝ 0.0260 ⎠ 0.79 ⎝ 0.0089 0.0017 −0.0001 ⎠
0.026
0.0012 −0.0005 −0.0001 0.00002
⎛ ⎞
0.1790
⎜ ⎟
⎝ −0.0439 ⎠ 0.21
0.0105

⎛ ⎞ ⎛ ⎞
−0.0202 0.0306 0.0082 −0.0003
⎜ ⎟ ⎜ ⎟ 0.009
⎝ 0.0124 ⎠ 0.72 ⎝ 0.0082 0.0023 −0.0001 ⎠
0.027
0.0012 −0.0003 −0.0001 0.00001
⎛ ⎞
0.5110
⎜ ⎟
⎝ −0.0088 ⎠ 0.19
0.0045
⎛ ⎞
0.2167
⎜ ⎟
⎝ −0.0288 ⎠ 0.09
0.0207

12.6 Example: The Prostate Data

As an example, we will now use several heterogeneity models to analyze


the prostate data introduced in Section 2.3.1, and we will hereby ignore
any prior information about the disease status of the patients. In order
not to complicate the model too much at once, we excluded the benign
prostatic hyperplasia patients (BPH) from our analyses, yielding a total
of 34 remaining patients. The purpose of our analyses is to investigate
(1) whether our mixture approach detects the presence of heterogeneity
in the random-effects population, which we know to be present, and (2)
whether our classification procedure correctly classifies patients as being
12.6 Example: The Prostate Data 181

TABLE 12.2. Prostate Data. Goodness-of-fit checks for model (12.1), under
a one-, two- and three-component heterogeneity model, and based on the ob-
served data from cancer patients and control patients only. The reported Kol-
mogorov-Smirnov statistics are to be compared with the 5% critical value
Dc = 0.2274. The Shapiro-Wilk statistics are accompanied by their p-values.

Kolmogorov-Smirnov Shapiro-Wilk
1 component 0.2358 (> Dc ) 0.824 (p = 0.0001)
2 components 0.2113 (< Dc ) 0.848 (p = 0.0001)
3 components 0.1061 (< Dc ) 0.940 (p = 0.0797)

controls or cancer cases. As explained in Section 12.1, such an analysis


should be based on model (12.1). Several models have been fitted, with one
to three components in the random-effects mixture distribution, and all
models assume conditional independence (i.e., Σi = σ 2 Ini ). The parameter
estimates for all fitted models are summarized in Table 12.1, and the results
for the goodness-of-fit tests are shown in Table 12.2.

The fixed-effects vector now consists of two parts: β1 measures the effect
of age on PSA, whereas ν = (β2 , β3 , β4 ) reflects the overall average trend
over time, after correction for age differences at the entry in the study.
So, ν contains the overall mean intercept β2 , mean slope β3 for time, and
mean slope β4 for time2 , which are estimated by the homogeneity model
as 0.1694, 0.0085, and 0.0034, respectively. The component means reported
in Table 12.1 are the average trends within each component of the mixture
(i.e., ν + µj , j = 1, . . . , g).

As can be expected, both goodness-of-fit tests reject the homogeneity mod-


el, that is, the one-component model. We therefore extended the model by
fitting a two-component heterogeneity model, which yields a first group
(79% of the patients) with patients evolving mainly linearly, and a second
group (21% of the patients) with patients who clearly evolve quadrati-
cally. Although this two-component model is accepted by the Kolmogorov-
Smirnov test, it is not by the Shapiro-Wilk test. We therefore also fitted a
three-component model, which was accepted by the Kolmogorov-Smirnov
test as well as by the Shapiro-Wilk test. The estimated mean profiles, not
taking into account the effect of age, are shown in Figure 12.1. Apparently,
the first component represents the individuals who evolve mainly linearly:
There is a constant increase of PSA over time. This is in contrast with
the other two groups in the mixture which contain the subjects who evolve
quadratically over time, for the second component after a period of small
linear increase, and for the last component immediately after enrollment
in the study.
182 12. The Heterogeneity Model

FIGURE 12.1. Prostate Data. Estimated component means and probabilities,


based on a three-component heterogeneity model.

This model will only be useful for the detection of prostate cancer at an
early stage, if one or two of our components can be shown to represent the
cancer cases, and the other component(s) then would represent the con-
trols. We therefore compare classification by our mixture approach with
the correct classification as control or cancer. The result is shown in Ta-
ble 12.3. Except for one patient, all controls were classified in the first com-
ponent, together with 10 cancer cases for which the profiles show hardly
any difference from many profiles in the control group (only a moderate,
linear increase over time). Three cancer cases were classified in the third
component. These are those cases which have entered the study almost
simultaneously with the start of the growth of the tumor. The five can-
cer cases, classified in the second component, are those who were already
in the study long before the tumor started to develop and therefore have
profiles which hardly change in the beginning, but which start increasing
quadratically after some period of time in the study.

Apparently, the detection of the correct diagnostic group is hampered by


the different onsets of observation periods. Further, the quadratic mixed-
effects model is only a rough approximation to the correct model. For ex-
ample, Carter et al . (1992a) and Pearson et al . (1994) have fitted piecewise
nonlinear mixed-effects models to estimate the time when rapid increases
in PSA were first observable. One could also think of extending the het-
erogeneity model to the case where the component probabilities pj are
modeled as functions over time. This would take into account the fact that
the proportion of cancer cases increases with time. In any case, this exam-
12.7 Example: The Heights of Schoolgirls 183

TABLE 12.3. Prostate Data. Cross-classification of 34 patients according to the


three-component mixture model and according to their true disease status.

Mixture classification
1 2 3
control 15 1 0
Disease status
cancer 10 5 3

ple has shown that the mixture approach does not necessarily model what
one might hope. There is no a priori reason why the mixture classification
should exactly correspond to some predefined group structure, which may
not fully reflect the heterogeneity in growth curves.

12.7 Example: The Heights of Schoolgirls

As a second example of the use of heterogeneity models, we consider the


growth curves of 20 preadolescent schoolgirls, introduced in Section 2.5, and
previously analyzed by Goldstein (1979), not using linear mixed models.
Goldstein found a significant (at the 5% level of significance) group effect as
well as a significant interaction of age with group. Note that the individual
profiles shown in Figure 2.5 suggest that the variability in the observed
heights is mainly due to between-subject variability. That is why we will
now reanalyze the data using linear mixed models, which allow us to use
subject-specific regression coefficients.

A linear mixed model obtained from a two-stage approach (see Section 3.2)
assumes the average evolution within each group to be linear as a function
of age and allows for subject-specific intercepts as well as slopes. More
specifically, our model is given by


⎪ β1 + b1i + (β4 + b2i )Ageij + εij , if small mother



Heightij = β2 + b1i + (β5 + b2i )Ageij + εij , if medium mother





β3 + b1i + (β6 + b2i )Ageij + εij , if tall mother,

where Heightij and Ageij are the height and the age of the ith girl at the
jth measurement, respectively. The model can easily be rewritten as
184 12. The Heterogeneity Model

TABLE 12.4. Heights of Schoolgirls. REML estimates and associated estimated


standard errors for all parameters in model (12.12), under the assumption of
conditional independence.

Effect Parameter REMLE (s.e.)


Intercepts:
Small mothers β1 81.300 (1.338)
Medium mothers β2 82.974 (1.239)
Tall mothers β3 83.123 (1.239)
Age effects:
Small mothers β4 5.270 (0.174)
Medium mothers β5 5.567 (0.161)
Tall mothers β6 6.249 (0.161)
Covariance of bi :
var(b1i ) d11 7.603 (3.729)
var(b2i ) d22 0.133 (0.063)
cov(b1i , b2i ) d12 −0.444 (0.399)
Residual variance:
var(εij ) σ2 0.476 (0.087)
REML log-likelihood −157.874

Heightij = b1i + β1 Smalli + β2 Mediumi + β3 Talli


+ {b2i + β4 Smalli + β5 Mediumi + β6 Talli } Ageij
+εij , (12.12)

where Smalli , Mediumi , and Talli are dummy variables defined to be 1 if the
mother of the ith girl is small, medium, or tall, respectively, and defined to
be 0 otherwise. So, β1 , β2 , and β3 represent the average intercepts and β4 ,
β5 , and β6 the average slopes in the three groups. The terms b1i and b2i are
the random intercepts and random slopes, respectively. REML estimates
for all parameters in this model are given in Table 12.4, obtained assuming
conditional independence; that is, assuming that all error components εij
are independent with common variance σ 2 .

All of the above analyses are based on the somewhat arbitrary discretization
of the heights of the mothers into three different intervals (small, medium,
and tall mothers). It is therefore interesting to see how heterogeneity models
would classify the children into two, three, or even more groups, ignoring
12.7 Example: The Heights of Schoolgirls 185

TABLE 12.5. Heights of Schoolgirls. Parameter estimates for all parameters in


model (12.13), under a one-, two- and three-component heterogeneity model.

Component means β + µj Covariance matrix Residual variance


Component probabilities D σ2
   
82.48 6.71 −0.07
1 0.47
5.72 −0.07 0.27

   
82.78 6.73 0.10
0.68 0.47
 5.39  0.10 0.03
82.06
0.32
6.42

   
79.46 3.64 0.32
0.20 0.47
 5.60  0.32 0.03
84.21
0.50
 5.32 
81.65
0.30
6.47

the group structure used so far. The corresponding model is then

Heightij = β1 + b1i + (β2 + b2i )Ageij + εij , (12.13)

where β1 and β2 denote the overall average intercept and linear age effect,
respectively. As before, we will assume all error components εij to be in-
dependent and normally distributed with mean zero and common variance
σ2 .

Three mixture models were fitted: the homogeneous model (one compo-
nent) and two heterogeneous models (two and three components). For the
heterogeneity models, the girls were classified in either one of the mixture
components. The parameter estimates and classification rules are summa-
rized in Table 12.5 and Figure 12.2, respectively. The reported component
means are the average growth trends within each component of the mixture
(i.e., β + µj , j = 1, . . . , g).

First, it follows from the fit of the homogeneity model that the average
intercept and slope are estimated to be 82.48 and 5.72, respectively. Us-
ing the goodness-of-fit procedures discussed in Section 12.5, we found that
this homogeneity model fits the data sufficiently well; that is, no statistical
186 12. The Heterogeneity Model

 
“small” girls
 
3n7 2,6,13

 

“slow” growers
 

7 1,2,3,4,5,6,7,8,10,
S
n
14 nS

11,12,13,14,18
11 S
w 
S
 
  “tall” girls
20 girls  
  1,3,4,5,7,8,10,11,12,14,18
S
S
6nS
S  
w
S
“fast” growers
 
9,15,16,17,19,20

FIGURE 12.2. Heights of Schoolgirls. Graphical representation of the cluster


analysis on the growth curves of 20 preadolescent schoolgirls, based on model
(12.13), under a one-, two- and three-component heterogeneity model. The child
numbers are given underneath each cluster.

evidence is found for the presence of heterogeneity in the random-effects


population (p = 0.2970 for the Shapiro-Wilk test, observed value for the
Kolmogorov-Smirnov statistic equal to 0.1505, which is below the 5% criti-
cal value Dc = 0.2941). Note however, that this is based on a small data set.
Also, acceptance of the homogeneity model should not necessarily prevent
us from performing cluster analysis. The two-component mixture model
subdivides the children into two groups, with similar intercepts but highly
different slopes, and therefore can be regarded as discriminating the “slow”
growers (68%) from the “fast” growers (32%). Note also that this implies
a large reduction in slope variability and no reduction in intercept vari-
ability, as indicated by the estimated variances in the matrix D. Finally,
the three-component mixture model further subdivides the “slow” grow-
ers into “small” and “tall” girls, with an average difference in height of
84.21 − 79.46 = 4.75 cm.

Although there is no a priori reason why our mixture classification should


reconstruct Goldstein’s groups, shown in Table 2.4, it may still be inter-
esting to compare his artificial group structure with ours, which naturally
arises from the profile structures themselves. Although separation of the
12.7 Example: The Heights of Schoolgirls 187

children with tall mothers from the rest is achieved fairly well, the children
with small mothers could not be well separated from those with medium
mothers. However, this is in agreement with results obtained from analy-
ses based on the linear mixed model (12.12), the parameter estimates for
which are given in Table 12.4. Indeed, applying the F -tests described in
Section 6.2.2 (with the Satterthwaite approximation for the denominator
degrees of freedom), we found that the average slopes β4 , β5 , and β6 are
significantly different (p = 0.0019), but this can be fully ascribed to differ-
ences between the groups of children with small or medium mothers on the
one hand and the group of children with tall mothers on the other hand:
β4 and β6 are significantly different (p = 0.0007), also β5 and β6 are signif-
icantly different (p = 0.0081), but β4 is not significantly different from β5
(p = 0.2259). Since our mixture approach only partially reconstructs the
prior group structure of Goldstein (1979), we conclude that the latter does
not fully reflect the heterogeneity in the growth curves.
Finally, we note that, under the three-component model, the overall average
trend is given by



79.46 84.21 81.65 82.49


0.20 + 0.50 + 0.30 = ,
5.60 5.32 6.47 5.72

which is very similar to the overall average trend, estimated under the
homogeneity model. Further, we also have that, the overall random-effects
covariance matrix is given by

3.64 0.32 −3.03  


+ 0.20 −3.03 −0.12
0.32 0.03 −0.12

1.72  
+ 0.50 1.72 −0.40
−0.40

−0.84  
+ 0.30 −0.84 0.75
0.75

7.17 −0.14
= ,
−0.14 0.28

which is also similar to the overall random-effects covariance matrix ob-


tained under the homogeneity model, but less similar than what we found
for the overall average. Finally, the residual variance estimate is the same
for the three fitted models. This illustrates the results discussed in Sec-
tion 7.8.3: Even if the three-component model would be the correct one,
the homogeneity model would yield good (consistent) estimators for all
parameters in the model.
13
Conditional Linear Mixed Models

13.1 Introduction

As pointed out by Diggle, Liang and Zeger (1994, Section 1.4) and as shown
in the examples so far presented in this book, the main advantage of longi-
tudinal studies, when compared to cross-sectional studies, is that they can
distinguish changes over time within individuals (longitudinal effects) from
differences among people in their baseline values (cross-sectional effects).

Consider a randomized longitudinal clinical trial, where subjects are first


randomly assigned to one out of a set of treatments, and then followed
for a certain period of time during which measurements are taken at pre-
specified time points. Treatment effects are then completely represented by
differences in evolutions over time (i.e., by interactions of treatment with
time), whereas the randomization assures that, at least in large trials, the
treatment groups are completely comparable at baseline with respect to
factors which potentially influence change afterward. Hence, a statistical
model for such data does not need a cross-sectional model component.

In observational studies however, subjects may be very heterogeneous at


baseline such that longitudinal changes need to be studied after correction
for potential confounders such as age, gender, and so forth. For example,
all our analyses of the prostate data so far have been corrected for age
at diagnosis because we knew that, due to the high prevalence of BPH in
190 13. Conditional Linear Mixed Models

men over age 50, the control group was significantly younger on average
than the BPH cases, at first visit as well as at the time of diagnosis. Brant
et al . (1992) analyzed repeated measures of systolic blood pressure from
955 healthy males. Their models included cross-sectional effects for age
at first visit (linear as well as quadratic effect), obesity, and birth cohort.
In a non-linear context, Diggle, Liang and Zeger (1994, Section 9.3) used
longitudinal data on 250 children to investigate the evolution of the risk for
respiratory infection and its relation to vitamin A deficiency. They adjusted
for factors like gender, season, and age at entry in the study.

When analyzing longitudinal data, the longitudinal effects are usually of


primary interest, whereas the cross-sectional component of the model is
often considered as nuisance, but needed to correct for baseline differences.
In this chapter, we will therefore explore how sensitive inference for the lon-
gitudinal effects is to model assumptions in the cross-sectional component
of the model, and it will be shown how such inferences can be obtained
without making any assumptions about this cross-sectional component. Il-
lustrations will be based on the hearing data, introduced in Section 2.3.2,
which we will first analyze in the next section.

13.2 A Linear Mixed Model for the Hearing Data

As an example of the effect of the assumed cross-sectional model on the


estimation of longitudinal trends, we will now analyze the hearing data
introduced in Section 2.3.2, but we restrict our analysis to measurements
from the left ear only. Based on the results of Brant and Fozard (1990),
Morrell and Brant (1991), and Pearson et al . (1995), we propose the fol-
lowing linear mixed model for these data:

Yij = (β1 + β2 Agei1 + β3 Age2i1 + b1i )


+(β4 + β5 Agei1 + b2i )tij
+β6 Visit1ij + ε(1)ij , (13.1)

in which tij is the time point (in decades from entry in the study) at which
the jth measurement is taken for the ith subject, and where Agei1 is the
age (in decades) of the subject at the time of entry in the study. Pearson
et al . (1995) found evidence for the presence of a learning effect from the
first visit to subsequent visits. This is taken into account by the extra time-
varying covariate Visit1ij , defined to be one at the first measurement and
zero for all other visits. Finally, the b1i are random intercepts, and the b2i
are random slopes for time. As before, the ε(1)ij are measurement error
components. Table 13.1 shows the ML estimates and associated standard
13.2 A Linear Mixed Model for the Hearing Data 191

TABLE 13.1. Hearing Data. ML estimates (standard errors) for the parameters in
the marginal linear mixed model corresponding to (13.1), with and without inclu-
sion of a cross-sectional quadratic effect of age, for the original data (∆ = 0) as
well as for contaminated data (∆ = −10). The last column contains ML estimates
(standard errors) obtained from the conditional linear mixed model approach.

Linear mixed model Conditional


Original data Contaminated data linear mixed
Parameter β3 = 0 β3 = 0 β3 = 0 β3 = 0 model
Fixed effects:
β1 (intercept) 3.52 (2.39) −4.52 (0.86) 3.52 (2.39) 226.39 (3.31) —
β2 (age) −1.63 (1.01) 1.96 (0.15) −1.63 (1.01) −101.42 (0.60) —
β3 (age2 ) 0.35 (0.10) — −9.65 (0.10) — —
β4 (time) −0.20 (0.81) −0.15 (0.81) −0.20 (0.81) −0.58 (0.81) 0.02 (0.81)
β5 (age×time) 0.86 (0.16) 0.84 (0.16) 0.86 (0.16) 1.03 (0.16) 0.82 (0.17)
β6 (visit1) 1.85 (0.30) 1.86 (0.30) 1.85 (0.30) 2.07 (0.31) 1.96 (0.31)
Variance components:
d11 = var(b1i ) 41.81 42.61 41.81 861.93 —
d12 = cov(b1i , b2i ) 3.59 4.11 3.59 −20.75 —
d22 = var(b2i ) 7.61 7.67 7.61 8.06 7.61
σ 2 = var(ε(1)ij ) 25.16 25.15 25.13 25.16 25.19

errors for all parameters in the marginal model corresponding to the model
(13.1).

To study the effect of misspecifying the cross-sectional model on the esti-


mation of the longitudinal model, we refitted model (13.1), not including
the cross-sectional quadratic age effect. The results are also shown in Ta-
ble 13.1. We conclude that removing cross-sectional terms from the model
inflates the random-intercepts variance d11 , but the estimates of the average
longitudinal trends (β4 , β5 , and β6 ) are only slightly affected. Intuitively,
one might expect the effect of omitting the quadratic age effect to depend
on β3 . We checked this by repeatedly fitting the two previous models on
contaminated data, obtained from replacing the original response values yij
by yij + ∆Age2i1 , for ∆ equal to −10, −9, . . . , 9, and 10. For ∆ = −10, there
is an extremely strong negative quadratic age effect, whereas for ∆ = 10,
there is an extremely strong positive quadratic age effect. The situation
∆ = 0 corresponds to the original data. From now on, model (13.1) will be
termed the “correct” model, and the model under β3 = 0 will be termed
the “incorrect” (misspecified) model.

Figure 13.1 compares the estimates of the average longitudinal effects under
both models. The estimates under the correct model are independent of
the degree ∆ of contamination. Under the incorrect model however, the
obtained estimates clearly depend on ∆, and for β5 and β6 , they differ up
192 13. Conditional Linear Mixed Models

FIGURE 13.1. Hearing Data. ML estimates for the average longitudinal effects
under correct (long dashes) and incorrect (solid) cross-sectional models, as well
as for the conditional linear mixed model (short dashes), for several degrees of
contamination (∆). The vertical lines represent one estimated standard deviation
under the incorrect model. The bold vertical line corresponds to the original data
(∆ = 0).

to one standard deviation from the estimates obtained under the correct
model. The parameter estimates under the correct and under the incorrect
model, for the case ∆ = −10, are also given in Table 13.1. Under the
correct model, we get exactly the same estimates as for the original data,
except for the cross-sectional quadratic age effect, for which the difference
is exactly ∆ = −10. As noticed earlier, the omission of a cross-sectional
covariate inflates the random-intercepts variability, but this is now much
more pronounced than earlier. Note also that deleting the quadratic age
effect for the contaminated data changes the estimated correlation between
the random intercepts and random slopes from 0.2013 (p = 0.0625, LR test)
to −0.2490 (p = 0.0206, LR test).

For ∆ = −10, we also calculated the empirical Bayes estimates (EB) (see
Section 7.2) for the random slopes b2i under the correct and under the in-
13.2 A Linear Mixed Model for the Hearing Data 193

FIGURE 13.2. Hearing Data. Pairwise scatter plots of the empirical Bayes esti-
mates for the random slopes b2i obtained under the correct linear mixed model, the
incorrect linear mixed model, and the conditional linear mixed model. All plots are
based on contaminated data (∆ = −10), and the Pearson correlation coefficient
is denoted by r.

correct model. A graphical comparison is shown in panel (a) of Figure 13.2.


Note that, although the estimates from both procedures are highly corre-
lated (r = 0.72), many subjects are found to increase faster (slower) than
average under the correct model, but slower (faster) than average under
the misspecified model, and vice versa.
194 13. Conditional Linear Mixed Models

13.3 Conditional Linear Mixed Models

The results of Section 13.2 illustrate the need for statistical methodology
which allows for the study of longitudinal trends in observational data,
without having to specify any cross-sectional effects. Verbeke et al . (1999)
and Verbeke, Spiessens and Lesaffre (2000) propose the use of so-called
conditional linear mixed models. In order to simplify notations, we will
restrict to discussing this approach in the context of model (13.1) for the
hearing data, rather than in full generality, and we refer to the above-
mentioned papers for more details.

We first reformulate the linear mixed model (13.1) as


Yij = b∗i
+(β4 + β5 Agei1 + b2i )tij
+β6 Visit1ij + ε(1)ij , (13.2)
where b∗i represents the cross-sectional component β1 +β2 Agei1 +β3 Age2i1 +
b1i , corresponding to subject i, under the original model. The parameters
of interest are the fixed slopes β4 , β5 , and β6 , the subject-specific slopes b2i ,
and the residual variance σ 2 ; the cross-sectional component b∗i is considered
as nuisance. Note how model (13.2) is of the form
Yi = 1ni b∗i + Xi β + Zi bi + ε(1)i , (13.3)
where the matrices Xi and Zi and the vectors β and bi are those submatri-
ces and subvectors of their original counterparts Xi , Zi , β, and bi obtained
from deleting the elements which correspond to the cross-sectional compo-
nent (i.e., the time-independent covariates) of the original model (13.1).

Conditional linear mixed models now proceed in two steps. In a first step,
we condition on sufficient statistics for the nuisance parameters b∗i . In a
second step, maximum likelihood or restricted maximum likelihood esti-
mation is used to estimate the remaining parameters in the conditional
distribution of the Yi given these sufficient statistics.

Conditional on the subject-specific parameters b∗i and bi in (13.3), we have


that Yi is normally distributed with mean vector 1ni b∗i + Xi β + Zi bi and
covariance matrix σ 2 Ini , from which it readily follows that y i = j yij /ni
is sufficient for b∗i . Further, the distribution of Yi , conditional on y i and on
the remaining subject-specific effects bi , is given by
fi (yi |y i , bi )

fi (yi |b∗i , bi )
=
fi (y i |b∗i , bi )
13.3 Conditional Linear Mixed Models 195

 −(ni −1)/2 √
= 2πσ 2 ni

1 
× exp − (yi − Xi β − Zi bi )
2σ 2
 
  −1  
× I ni − 1ni 1ni 1ni 1ni (yi − Xi β − Zi bi ) . (13.4)

It now follows directly from some matrix algebra (Seber 1984, property
B3.5, p. 536) that (13.4) is proportional to

 −(ni −1)/2 1  −1
2πσ 2 exp − 2 (Ai yi − Ai Xi β − Ai Zi bi ) (Ai Ai )


× (Ai yi − Ai Xi β − Ai Zi bi ) (13.5)

for any set of ni ×(ni −1) matrices Ai of rank ni −1 which satisfy Ai 1ni = 0.
This shows that the conditional approach is equivalent to transforming
each vector Yi orthogonal to 1ni . If we now also require the Ai to satisfy
Ai Ai = I(ni −1) , we have that the transformed vectors Ai Yi satisfy

Yi ∗ ≡ Ai Yi = Ai Xi β + Ai Zi bi + Ai ε(1)i


= Xi∗ β + Zi∗ bi + ε∗(1)i , (13.6)

where Xi∗ = Ai Xi and Zi∗ = Ai Zi and where the ε∗(1)i = Ai ε(1)i are nor-
mally distributed with mean 0 and covariance matrix σ 2 Ini −1 .

Model (13.6) is now again a linear mixed model, but with transformed
data and covariates, and such that the only parameters still in the model
are the longitudinal effects and the residual variance. Hence, the second
step in fitting conditional linear mixed models is to fit model (13.6) using
maximum likelihood or restricted maximum likelihood methods. As earlier,
the subject-specific slopes are estimated using empirical Bayes methods (see
Section 7.2). Note that once the transformed responses and covariates have
been calculated, standard software for fitting linear mixed models (e.g.,
SAS procedure MIXED) can be used for the estimation of all parameters
in model (13.6). A SAS macro for performing the transformation has been
provided by Verbeke et al . (1999) and is available from the website.

Conditional linear mixed models are based on transformed data, where the
transformation is chosen such that a specific set of “nuisance” parameters
vanishes from the likelihood. From this respect, the proposed method is
very similar to REML estimation in the linear mixed model, where the vari-
ance components are estimated after transforming the data such that the
fixed effects vanish from the model (see Section 5.3). As shown by Harville
(1974, 1977) and by Patterson and Thompson (1971), and as discussed in
196 13. Conditional Linear Mixed Models

Section 5.3.4, the REML estimates for the variance components do not de-
pend on the selected transformation, and no information on the variance
components is lost in the absence of information on the fixed effects. It has
been shown by Verbeke, Spiessens and Lesaffre (2000) that similar prop-
erties hold for inferences obtained from conditional linear mixed models;
that is, it was shown that results do not depend on the selected transfor-
mation Yi → Ai Yi and that no information is lost on the average, nor
on the subject-specific longitudinal effects, from conditioning on sufficient
statistics for the cross-sectional components b∗i in the original model.

The simplest example of conditional linear mixed models is obtained for


balanced data with only two repeated measurements per subject, where the
only time-varying covariate of interest is a binary indicator for the occasion
at which the measurement is taken. The proposed approach is then equiva-
lent to analyzing the difference between the first and second measurement
of each subject. Hence, conditional linear mixed models can be interpreted
as an extension of the well-known paired t-test to longitudinal data sets,
possibly unbalanced, with more than two measurements per subject.

In a recent paper, Neuhaus and Kalbfleisch (1998) proposed a similar con-


ditional approach for the analysis of clustered data with generalized linear
mixed models. However, since their models do not contain any random
effects other than intercepts, they are often too restrictive because they
imply unrealistically simple covariance structures. For linear models, the
conditional linear mixed models extend this methodology to accommodate
models which also allow the presence of subject-specific longitudinal effects
(random slopes).

Conditional linear mixed models have many advantages. First, inference


becomes available for the parameters of interest, completely disregarding
the nuisance parameters, without loss of any information. Further, the fit-
ting of conditional linear mixed models is very straightforward. Note also
that the second step in the fitting process does not necessarily require the
random longitudinal effects to be normally distributed. Other possibilities
include the use of finite mixtures of Gaussian distributions (e.g., Verbeke
and Lesaffre 1996a, Magder and Zeger 1996; see Section 7.8.4 and Chap-
ter 12.2). In fact, one could even use nonparametric maximum likelihood
estimation, not assuming the mixing distribution to be of any specific para-
metric form (see, for example, Laird 1978, Aitkin and Francis 1995, Aitkin
1999). The advantage of the first conditional step is then the reduction of
the dimensionality of the mixing distribution which seriously reduces the
numerical complexity of the fitting algorithms.

A disadvantage of the conditional linear mixed model is that all information


is lost on the average as well as on the subject-specific cross-sectional ef-
fects. However, it should be emphasized that longitudinal data are collected
13.4 Applied to the Hearing Data 197

for studying longitudinal changes, rather than cross-sectional differences


among subjects.

13.4 Applied to the Hearing Data

As an illustration, we continue our analysis of Section 13.2 of the hearing


data, and we will estimate all longitudinal effects in model (13.1), without
having to correct for any baseline differences among the study participants.
So-obtained ML estimates for the parameters in the marginal model cor-
responding to model (13.6) have also been included in Table 13.1 and in
Figure 13.1. Note that it is now irrelevant whether or not a quadratic age
effect is included in the model, and that exactly the same results are found
for the original data as well as for the contaminated data. All parameter es-
timates are very similar to the ones obtained from fitting model (13.1). This
suggests that the baseline differences among the participants in this study
can be well described by a quadratic function of the age at entry, which
has also been assumed by other authors who studied the relation between
hearing thresholds and age (e.g., Morrell and Brant 1991 and Pearson et
al . 1995).

Panel (b) of Figure 13.2 shows a scatter plot of the EB estimates for the
subject-specific slopes b2i in model (13.1), obtained by fitting the associated
(correct) linear mixed model versus those obtained from the conditional
linear mixed model. Note that the same plot would be obtained for all
contaminated data sets. For the contaminated data with ∆ = −10, a similar
scatter plot has been included for the EB estimates under the incorrect
linear mixed model [panel (c) of Figure 13.2]. Surprisingly, the estimates
under the conditional model correlate better with the estimates under the
incorrect model (r = 0.93) than with those obtained from the correct linear
mixed model (r = 0.86). However, panel (c) in Figure 13.2 reveals the
presence of outliers which may inflate the correlation and also suggests that
the incorrect model tends to systematically underestimate small (negative)
and large (positive) slopes, when compared to the conditional linear mixed
model, whereas the opposite is true for the slopes closer to zero.

We also calculated, for all subjects, the difference between their EB esti-
mate obtained under the correct or incorrect model and their EB estimate
obtained under the conditional linear mixed model. A plot of these differ-
ences versus the subject’s age at entry in the study is shown in Figure 13.3.
It clearly shows that the omission of the cross-sectional quadratic age-
effect results in a systematic bias of the EB estimates, when compared to
the estimates from the conditional model: The incorrect model tends to
overestimate the subject-specific slope for subjects of low or high age at
198 13. Conditional Linear Mixed Models

FIGURE 13.3. Hearing Data. Scatter plots of the differences in empirical Bayes
estimates for the random slopes b2i obtained under the correct linear mixed model
and the conditional linear mixed model (left panel) as well as under the incorrect
model and the conditional linear mixed model (right panel). Both plots are based
on contaminated data (∆ = −10).

entry in the study, whereas the opposite is true for middle-aged subjects.
This bias is not present for the EB estimates obtained from the correct
linear mixed model. These findings suggest that one way of checking the
appropriateness of the cross-sectional component of a classical linear mixed
model could be to calculate the difference between the resulting EB esti-
mates for the subject-specific slopes and their EB estimates obtained from
the conditional approach, and to plot these differences versus relevant co-
variates. However, more research is needed in order to fully explore the
potentials of such procedures.

13.5 Relation with Fixed-Effects Models

It can be shown that, conditional on the variance components, the same


estimates are obtained for longitudinal components in a linear mixed model
by applying the conditional linear mixed model approach as obtained by
fitting the corresponding model (13.3), thus considering the subject-specific
intercepts b∗i as fixed, rather than random, parameters. So, one can, strictly
speaking, perform a conditional linear mixed model analysis without ex-
plicit computation of the transformed response vectors yi ∗ or of the trans-
formed covariance matrices Xi∗ and Zi∗ . In large data sets however, this
requires fitting linear mixed models with hundreds, if not thousands, of pa-
rameters in the mean structure. In many cases, this will become unfeasible
using standard software.
13.5 Relation with Fixed-Effects Models 199

For example, we tried this fixed-effects approach to obtain the results previ-
ously derived for the hearing data under a conditional linear mixed model
and already reported in Table 13.1. In SAS PROC MIXED, this can be
done as follows:

proc mixed data = hearing noclprint method = ml;


class id;
model L500 = id time ftime visit1 / solution;
random time / type = un subject = id;
run;

The variable time contains the time point (in decades from entry in the
study) at which the repeated responses were taken, whereas f time repre-
sents the interaction term between time and the age (in decades) of the
subjects at their entry in the study. Further, id contains the identification
number of each patient, and the variable visit1 is one at the first visit,
and zero for all subsequent visits. Finally, the response variable L500 con-
tains the hearing thresholds for 500 Hz, taken on the left ear of the study
participants. Running the above program requires fitting a linear mixed
model with 684 fixed effects and 2 variance components. Unfortunately,
this turned out not to be feasible because the 1.12 Gb of free disk space
was insufficient for SAS to fit the model.

In order to still be able to illustrate the relationship between the conditional


linear mixed models and the fixed-effects approach, we analyzed hearing
thresholds from 100 randomly selected subjects in the hearing data set.
The results are summarized in Table 13.2. The first two columns of this
table show the estimates obtained from fitting a conditional linear mixed
model using maximum likelihood (ML) and restricted maximum likelihood
(REML), respectively. The third and the fourth columns show the esti-
mates obtained from the fixed-effects approach also under ML and REML
estimation, respectively. Note that the ML estimates from both fitting ap-
proaches do not coincide and that the fixed-effects approach even fails to
detect the presence of random slopes. This is because the equivalence be-
tween both procedures acts conditional on the variance components. As
discussed in Chapter 5, ML estimates for the variance components do not
account for the loss in degrees of freedom due to the estimation of the fixed
effects in the mean structure. Here, we are comparing ML estimates from a
model with 3 fixed effects (conditional linear mixed model) with those from
a model which contains 103 fixed effects (fixed-effects approach), yielding
severe differences in the estimates for the variance components, and there-
fore also in the estimates of the longitudinal components in model (13.1).
To illustrate this even better, we refitted the fixed-effects model, using ML
estimation, but we restricted the variance components to be equal to those
200 13. Conditional Linear Mixed Models

TABLE 13.2. Hearing Data (randomly selected subset). ML and REML estimates
(standard errors) for the longitudinal effects in (13.1), from a conditional linear
mixed model as well as from a fixed-effects approach.

Conditional linear mixed model Fixed-effects approach


Parameter ML REML ML REML
Fixed effects:
β4 (time) −1.778 (2.111) −1.817 (2.146) −0.483 (1.372) −1.817 (2.146)
β5 (age×time) 1.212 (0.430) 1.221 (0.436) 0.918 (0.286) 1.221 (0.436)
β6 (visit1) 2.968 (0.697) 2.969 (0.699) 2.939 (0.658) 2.969 (0.699)
Variance components:
d22 = var(b2i ) 11.466 (4.140) 12.216 (4.381) 0.000 ( — ) 12.216 (4.381)
σ 2 = var(ε(1)ij ) 20.148 (1.505) 20.210 (1.513) 19.043 (1.190) 20.210 (1.513)
Random slopes b2i :
Subject 1 2.597 (2.657) 2.680 (2.712) 0.000 ( — ) 2.680 (2.712)
Subject 2 −1.847 (2.285) −1.904 (2.322) 0.000 ( — ) −1.904 (2.322)
Subject 3 −1.130 (1.823) −1.146 (1.844) 0.000 ( — ) −1.146 (1.844)

obtained from the conditional linear mixed model (i.e., d22 = 11.466 and
σ 2 = 20.148). This is done in PROC MIXED using the following program
with a PARMS statement:

proc mixed data = subset noclprint method = ml;


class id;
model L500 = id time ftime visit1 / solution;
random time / type = un subject = id solution;
parms 11.466 20.148 / eqcons = 1,2;
run;

The so-obtained results are now exactly the same as those from the condi-
tional linear mixed model, reported in the first column of Table 13.2.

Finally, since REML estimation does account for the loss in degrees of free-
dom due to the estimation of the fixed effects, we get the same results from
the conditional linear mixed model as from the fixed-effects approach, pro-
vided both apply the REML estimation. For our example, this is illustrated
in the fact that the second and the last columns in Table 13.2 are identical.
14
Exploring Incomplete Data

In Chapter 4, we introduced several tools, in the context of the Vorozole


study, to graphically explore longitudinal data, both from the individual-
level standpoint (Figures 4.1 and 4.5) as well as from the population-
averaged or group-averaged perspective (Figures 4.2, 4.3, 4.4, and 10.3).
These plots are designed to focus on various structural aspects, such as the
mean structure, the variance function, and the association structure.

An extra level of complexity is added whenever not all planned measure-


ments are observed. This results in incompleteness or missingness. Another
frequently encountered term is dropout, which refers to the case where all
observations on a subject are obtained until a certain point in time, after
which all measurements are missing.

Several issues arise when data are incomplete. In the remainder of this
chapter, we will illustrate some of these issues based on the Vorozole study.
To simplify matters, we will focus on dropout. In the next chapter, a formal
treatment of missingness, based on the pivotal work of Rubin (1976) and
Little and Rubin (1987) will be given. Subsequent chapters present various
modeling strategies for incomplete longitudinal data, as well as tools for
sensitivity analysis, an area in which interest is strongly rising.

The first issue, resulting from dropout, is evidently a depletion of the study
subjects. Of course, a decreasing sample size increases variability which, in
turn, decreases precision. In this respect, the Vorozole study is a dramatic
example, as can be seen from Figure 14.1 and Table 14.1, which graphi-
202 14. Exploring Incomplete Data

FIGURE 14.1. Vorozole Study. Representation of dropout.

FIGURE 14.2. Vorozole Study. Alternative representation of dropout.

cally and numerically present dropout in both treatment arms. Clearly, the
dropout rate is high and there is a hint of a differential rate between the
two arms. This means we have identified one potential factor that could
14. Exploring Incomplete Data 203

TABLE 14.1. Vorozole Study. Evolution of dropout.

Standard Vorozole
Week # (%) # (%)
0 226 (100) 220 (100)
1 221 (98) 216 (98)
2 203 (90) 198 (90)
4 161 (71) 146 (66)
6 123 (54) 106 (48)
8 90 (40) 90 (41)
10 73 (32) 77 (35)
12 51 (23) 64 (29)
14 39 (17) 51 (23)
16 27 (12) 44 (20)
18 19 (8) 33 (15)
20 14 (6) 27 (12)
22 6 (3) 22 (10)
24 5 (2) 17 (8)
26 4 (2) 9 (4)
28 3 (1) 7 (3)
30 3 (1) 3 (1)
32 2 (1) 1 (0)
34 2 (1) 1 (0)
36 1 (0) 1 (0)
38 1 (0) 0 (0)
40 1 (0) 0 (0)
42 1 (0) 0 (0)
44 1 (0) 0 (0)

influence a patient’s probability of dropping out. Although a large part of


the trialist’s interest focuses on the treatment effect, we should be aware
that it is still a covariate and hence a design factor. Another question that
will arise is whether dropout depends on observed or unobserved responses.
An equivalent representation is given in Figure 14.2.

A different way of displaying several structural aspects is using a scat-


ter plot matrix, such as in Figure 14.3. The off-diagonal elements picture
scatter plots of standardized residuals obtained from pairs of measurement
occasions. The decay of correlation with time is studied by considering
the evolution of the scatters with increasing distance to the main diago-
nal. Stationarity, on the other hand, implies that the scatter plots remain
similar within diagonal bands if measurement occasions are approximately
equally spaced. In addition to the scatter plots, we place histograms on the
diagonal, capturing the variance structure. Features such as skewness, mul-
204 14. Exploring Incomplete Data

FIGURE 14.3. Vorozole Study. Scatter plot matrix for selected time points.

timodality, and so forth, can then be graphically detected. Finally, if the


axes are given the same scales, it is very easy to capture the attrition rate
as well; see also Figure 4.5.

Another aspect of the impact of dropout is also seen if we consider the


average profile in each treatment arm, with pointwise confidence limits
added (Figure 14.4). Indeed, near the end of the study, these intervals
become extremely wide, as opposed to the relatively narrow intervals at the
start of the experiment. Thus, it is clear that dropout leads to efficiency
loss. Of course, this effect can be due in part to increasing variability over
time. Modeling is needed to obtain more insight into this effect.

To gain further insight into the impact of dropout, it is useful to construct


dropout-pattern-specific plots. Figures 14.5 and 14.6 display the individual
and averaged profiles per pattern.
14. Exploring Incomplete Data 205

FIGURE 14.4. Vorozole Study. Mean profiles, with 95% confidence intervals
added.

The individual profiles plot, by definition displaying all available data, has
some intrinsic limitations. As is the case with any individual data plot,
it tends to be fairly busy. Since there is a lot of early dropout, there are
many short sequences and since we decided to use the same time axis for
all profiles, also for those that drop out early, very little information can
be extracted. Indeed, the evolution over the first few sequences is not clear
at all. In addition, the eye assigns more weight to the longer profiles, even
though they are considerably less frequent.

Some of these limitations are removed in Figure 14.6, where the pattern-
specific average profiles are displayed per treatment arm. Still, care has
to be taken for not overinterpreting the longer profiles and neglecting the
shorter profiles. Indeed, for this study the latter represent more subjects
than the longer profiles.

Several observations can be made. Most profiles clearly show a quadratic


trend, which seems to be in contrast with the relatively flat nature of the
average profiles in Figure 14.4. This implies that the impression from all
patterns together may differ radically from a pattern-specific look. These
conclusions seem to be consistent across treatment arms.

Another important observation is that those who drop out rather early
seem to decrease from the start, whereas those who remain relatively long
in the study exhibit, on average and in turn, a rise, a plateau, and then
206 14. Exploring Incomplete Data

FIGURE 14.5. Vorozole Study. Individual profiles, per dropout pattern.

a decrease. Looked upon from the standpoint of dropout, we suggest that


there are at least two important characteristics that make dropout increase:
(1) a low value of change versus baseline and (2) an unfavorable (downward)
evolution.
14. Exploring Incomplete Data 207

FIGURE 14.6. Vorozole Study. Mean profiles, per dropout pattern, grouped per
treatment arm.

Arguably, a careful modeling of these data, irrespective of the framework


chosen, should reflect these features. In Chapters 17 and 18, we will recon-
sider these preliminary findings.
15
Joint Modeling of Measurements and
Missingness

15.1 Introduction

The problem of dealing with missing values is common throughout sta-


tistical work and is almost ever present in the analysis of longitudinal or
repeated measurements data. Missing data are indeed common in clinical
trials (Piantadosi 1997, Green, Benedetti, and Crowley 1997, Friedman,
Furberg, and DeMets 1998), in epidemiologic studies (Kahn and Sempos
1989, Clayton and Hills 1993, Lilienfeld and Stolley 1994, Selvin 1996),
and, very prominently, in sample surveys (Fowler 1988, Schafer, Khare and
Ezatti-Rice 1993, Rubin 1987, Rubin, Stern, and Vehovar 1995).

Patients who drop out of a clinical trial are usually listed on a separate with-
drawal sheet of the case record form with the reasons for withdrawal, en-
tered by the investigator. Reasons typically encountered are adverse events,
illness not related to study medication, uncooperative patient, protocol vi-
olation, ineffective study medication, and other reasons (with further spec-
ification, e.g., lost to follow-up). Based on such a medical typology, Gould
(1980) proposed specific methods to handle this type of incompleteness.

Early work on missing values was largely concerned with algorithmic and
computational solutions to the induced lack of balance or deviations from
the intended study design. See, for example, the reviews by Afifi and
Elashoff (1966) and Hartley and Hocking (1971). More recently, general
210 15. Joint Modeling of Measurements and Missingness

algorithms, such as the Expectation-Maximization (EM; Dempster, Laird,


and Rubin 1977), and data imputation and augmentation procedures (Ru-
bin 1987, Tanner and Wong 1987), combined with powerful computing
resources have largely provided a solution to this aspect of the problem.
There remains the very difficult and important question of assessing the
impact of missing data on subsequent statistical inference. Conditions can
be formulated and are outlined later (Section 15.8), under which an analy-
sis that proceeds as if the missing data are missing by design (i.e., ignoring
the missing value process), can provide valid answers to study questions.
The difficulty in practice is that such conditions can rarely be assumed to
hold. We will frequently emphasize a key point here: When we undertake
such analyses, assumptions will be required that cannot be assessed from
the data under analysis. Hence, in this setting, there cannot be anything
that could be termed a definitive analysis, and arguably the appropriate
statistical framework is one of sensitivity analysis (Chapters 19 and 20).

Much of the treatment in this work will be restricted to dropout (or at-
trition); that is, to patterns in which missing values are only followed by
missing values. There are four related reasons for this restriction. First, the
classification of missing value processes has a simpler interpretation with
dropout than for patterns with intermediate missing values. Second, it is
easier to formulate models for dropout and, third, much of the missing
value literature on longitudinal data is restricted to this setting. Finally,
dropouts are by far the most common appearance of missingness in longi-
tudinal studies.

15.2 The Impact of Incompleteness

In a strict sense, the conventional justification for the analysis of data from
a randomized trial is removed when data are missing for reasons outside
the control of the investigator. Before one can address this problem how-
ever, it is necessary to clearly establish the purpose of the study (Heyting,
Tolboom, and Essers 1992). If one is working within a pragmatic setting,
the event of dropout, for example, may well be a legitimate component of
the response. It may make no sense to ask what response the subject would
have shown had they remained in the trial, and the investigator may then
require a description of the response conditional on a subject remaining in
the trial. This, together with the pattern of observed missingness, may then
be the appropriate and valid summary of the outcome. We might call this a
conditional description. Shih and Quan (1997) argue that such a description
will be of more relevance in many clinical trials. On the other hand, from
a more explanatory perspective, we might be interested in the behavior of
15.3 Simple ad hoc Methods 211

the responses that occurred irrespective of whether we were able to record


them or not. This might be termed a marginal description of the response.
For a further discussion of intention-to-treat and explanatory analyses in
the context of dropout, we refer to Heyting, Tolboom, and Essers (1992)
and Little and Yau (1996). It is commonly suggested (Shih and Quan 1997)
that such a marginal representation is not meaningful when the nature of
dropout (e.g., death) means that the response cannot subsequently exist,
irrespective of whether it is measured.

Although such dropout may in any particular setting imply that a marginal
model is not helpful, it does not imply that it necessarily has no meaning.
Provided that the underlying model does not attach a probability of 1 to
dropout for a particular patient, then non-dropout and subsequent obser-
vation is an outcome that is consistent with the model and logically not
different from any other event in a probability model. Such distinctions,
particularly with respect to the conditional analysis, are complicated by
the inevitable mixture of causes behind missing values. The conditional
description is a mirror of what has been observed, and so its validity is
less of an issue than its interpretation. In contrast, other methods of han-
dling incompleteness make some correction or adjustment to what has been
directly observed, and therefore address questions other than those corre-
sponding to the conditional setting. In seeking to understand the validity
of these analyses, we need to compare their consequences with their aims.

15.3 Simple ad hoc Methods

Two simple, common approaches to analysis are (1) to discard subjects


with incomplete sequences and (2) simple imputation. The first approach
has the advantage of simplicity, although the wide availability of more
sophisticated methods of analysis minimize the significance of this. It is also
an inefficient use of information. In a trivial sense, it provides a description
of the response conditional on a subject remaining in the trial. Whether
this reflects a question of interest depends entirely on the mechanism(s)
generating the missing values. It is not difficult to envisage situations where
it can be very misleading, and examples of this exist in the literature (Wang-
Clow et al . 1995).

There are several forms of simple imputation. For example, a cross-sectional


approach replaces a missing observation by the average of available obser-
vations at the same time from other subjects with the same covariates and
treatment. A simple longitudinal approach carries the last available mea-
surement from a subject forward, replacing the entire sequence of missing
values. A more sophisticated version predicts the next missing value us-
212 15. Joint Modeling of Measurements and Missingness

ing a regression relationship established from available past data. These


methods share the same drawbacks, although not all to the same degree.
The data set that results will mimic a sample from the population of in-
terest, itself determined by the aims of the analysis, only under particular
and potentially unrealistic assumptions. Further, these assumptions depend
critically on the missing value mechanism(s). For example, under certain
dropout mechanisms, the process of imputation may recover the actual
marginal behavior required, whereas under other mechanisms, it may be
wildly misleading. It is only under the simplest and most ignorable mecha-
nisms that the relationship between imputation procedure and assumption
is easily deduced. Little (1994a) gives two simple examples in which the
relationship is clear. A further minor point is that, without further elabo-
ration, the analysis of the completed data set will underestimate the true
variability of the data.

In conclusion, we see that when there are missing values, simple methods of
analysis do not necessarily imply simple, or even accessible, assumptions,
and without understanding properly the assumptions being made in an
analysis, we are not in a position to judge its validity or value. It has
been argued that although any particular such ad hoc analysis may not
represent the true picture behind the data, a collection of such analyses
should provide a reasonable envelope within which the truth might lie.
This does point to the desirability of a sensitivity analysis, but the main
conclusion does not follow. Counterexamples to this can be constructed
and, again, without a clear formulation of the assumptions being made, we
are not in a position to interpret such an envelope, and we are certainly not
justified in assuming that its coverage is, in some practical sense, inclusive.
One way to proceed is to consider a formal framework for the missing value
problem, and this leads us to Rubin’s classification.

A review of simple methods, with their advantages and disadvantages, in-


cluding complete case analysis and simple forms of imputation, is provided
in Chapter 16.

15.4 Modeling Incompleteness

In order to incorporate incompleteness into the modeling process, we need


to reflect on the nature of the missing value mechanism and its implica-
tions for statistical inference. Rubin (1976) and Little and Rubin (1987,
Chapter 6) make important distinctions between different missing values
processes. A dropout process is said to be completely random (MCAR)
if the dropout is independent of both unobserved and observed data, and
random (MAR) if, conditional on the observed data, the dropout is inde-
15.4 Modeling Incompleteness 213

pendent of the unobserved measurements; otherwise, the dropout process


is termed nonrandom (MNAR). A more formal definition of these con-
cepts is given in Section 15.7. If a dropout process is random, then a valid
analysis can be obtained through a likelihood-based analysis that ignores
the dropout mechanism, provided the parameters describing the measure-
ment process are functionally independent of the parameters describing the
dropout process, the so-called parameter distinctness condition. This sit-
uation is termed ignorable by Rubin (1976) and Little and Rubin (1987)
and leads to considerable simplification in the analysis (Diggle 1989). See
also Section 15.8.

In many examples, however, the reasons for dropout are many and varied
and it is therefore difficult to justify on a priori grounds the assumption
of random dropout. Arguably, in the presence of nonrandom dropout, a
wholly satisfactory analysis of the data is not feasible.

One approach is to estimate from the available data the parameters of a


model representing a nonrandom dropout mechanism. It may be difficult to
justify the particular choice of dropout model, and it does not necessarily
follow that the data contain information on the parameters of the particu-
lar model chosen, but where such information exists, the fitted model may
provide some insight into the nature of the dropout process and of the
sensitivity of the analysis to assumptions about this process. This is the
route taken by Diggle and Kenward (1994) in the context of continuous
longitudinal data; see also Diggle, Liang and Zeger (1994, Chapter 11) and
Section 12.4. Further approaches are proposed by Laird, Lange, and Stram
(1987), Wu and Bailey (1988, 1989), Wu and Carroll (1988), and Green-
lees, Reece, and Zieschang (1982). An overview of the different modeling
approaches is given by Little (1995).

Also, the case of categorical outcomes has received considerable attention.


See, for example, Baker and Laird (1988), Stasny (1986), Baker, Rosen-
berger, and DerSimonian (1992), Conaway (1992, 1993), Park and Brown
(1994), and Molenberghs, Kenward, and Lesaffre (1997).

Indeed, one feature in common to all of the more complex approaches is that
they rely on untestable assumptions about the relation between the mea-
surement process (often of primary interest) and the dropout process. One
should therefore avoid missing data as much as possible, and if dropout oc-
curs, information should be collected on the reasons for this. As an example,
consider a clinical trial where outcome and dropout are both strongly re-
lated to a specific covariate X and where, conditionally on X, the response
Y and the missing data process R are independent. In the selection frame-
work, we then have that f (Y, R|X) = f (Y |X)f (R|X), implying MCAR,
whereas omission of X from the model may imply MAR or even MNAR,
which has important consequences for selecting valid statistical methods.
214 15. Joint Modeling of Measurements and Missingness

Because different models imply different untestable assumptions, thereby


possibly affecting the statistical inferences of interest, it is advisable, in
practice, to always perform a sensitivity analysis. Various informal and
formal ways of performing sensitivity analyses are described in Chapters 19
and 20. Draper (1995) and Copas and Li (1997) provide useful insight in
model uncertainty and nonrandomly selected samples.

Section 15.5 develops the necessary terminology and notation and Sec-
tion 15.6 describes various missing data patterns. The missing data mecha-
nisms, informally introduced in this section, are formalized in Section 15.7.
The important case where the missing data mechanism can be excluded
from statistical analysis is introduced in Section 15.8. Since much of the
subsequent treatment will be confined to dropout, this situation is reviewed
in Section 15.9.

15.5 Terminology

In this section, we introduce terminology, building on the standard frame-


work for missing data, which is largely due to Rubin (1976) and Little and
Rubin (1987).

In general, we assume that for subject i in the study, a sequence of mea-


surements Yij is designed to be measured at occasions j = 1, . . . , ni . As
previously, the outcomes are grouped into a vector Y i = (Yi1 , . . . , Yini ) .
In addition, for each occasion j define

1 if Yij is observed
Rij =
0 otherwise.

The missing data indicators Rij are grouped into a vector Ri which is, of
course, of the same length as Y i .

Partition Y i into two subvectors such that Y oi is the vector containing


those Yij for which Rij = 1 and Y m i contains the remaining components.
These subvectors are referred to as the observed and missing components,
respectively. The following terminology is adopted:

Complete data Y i : the scheduled measurements. This is the outcome


vector that would have been recorded if there had been no missing
data.

Missing data indicators Ri : the process generating Ri is referred to as


the missing data process.
15.6 Missing Data Patterns 215

Full data (Y i , Ri ): the complete data, together with the missing data
indicators. Note that unless all components of Ri equal 1, the full
data components are never jointly observed.
Observed data Y oi .
Missing data Y m
i .

Some confusion might arise between the terms complete data introduced
here and complete case analysis of Sections 15.3 and 16.2. Although the
former refers to the (hypothetical) data set that would arise if there were
no missing data, “complete case analysis” refers to deletion of all subjects
for which at least one component is missing.

Note that one observes the measurements Y oi together with the missingness
indicators Ri .

15.6 Missing Data Patterns

A hierarchy of missing data patterns can be considered. When missingness


is due to attrition, all measurements for a subject from baseline onward
up to a certain measurement time are recorded, whereafter all data are
missing. It is then possible to replace the information contained in the
vector Ri by a single indicator variable. For example, Ri could indicate the
last observed measurement occasion. The sample size decreases over time.

Attrition is a particular monotone pattern of missingness. In order to have


monotone missingness, there has to exist a permutation of the measurement
components such that a measurement earlier in the permuted sequence is
observed for at least those subjects that are observed at later measure-
ments. Note that for this definition to be meaningful, we need to have a
balanced design in the sense of a common set of measurement occasions.
Other patterns are called nonmonotone.

15.7 Missing Data Mechanisms

Statistical modeling begins by considering the full data density


f (y i , r i |Xi , Zi , θ, ψ),
where Xi and Zi are the design matrices for fixed and random effects, re-
spectively, and θ and ψ are vectors that parameterize the joint distribution.
216 15. Joint Modeling of Measurements and Missingness

We will use θ = (β  , α ) (fixed-effects and covariance parameters) and ψ


to describe the measurement and missingness processes, respectively.

The taxonomy, constructed by Rubin (1976), further developed in Little


and Rubin (1987), and informally sketched in Section 15.4, is based on the
factorization
f (y i , r i |Xi , Zi , θ, ψ) = f (y i |Xi , Zi , θ)f (r i |y i , Xi , ψ), (15.1)
where the first factor is the marginal density of the measurement process
and the second one is the density of the missingness process, conditional on
the outcomes. It is possible to have additional covariates in the missingness
model, but this is suppressed from notation. Factorization (15.1) forms the
basis of selection modeling, as the second factor corresponds to the (self-
)selection of individuals into “observed” and “missing” groups. Selection
modeling is discussed in detail in Chapters 17 and 19. An alternative tax-
onomy can be built based on so-called pattern-mixture models (Little 1993,
Little 1994a). These are based on the factorization
f (y i , r i |Xi , Zi , θ, ψ) = f (y i |r i , Xi , Zi , θ)f (r i |Xi , ψ). (15.2)
Indeed, (15.2) can be seen as a mixture of different populations, character-
ized by the observed pattern of missingness. Pattern-mixture models are
given extensive treatment in Chapters 18 and 20.

The natural parameters of selection models and pattern-mixture models


have a different meaning, and transforming a probability model into the
other framework is, in general, not straightforward, even not for normal
measurement models. When a selection model is used, it is often men-
tioned that one has to make untestable assumptions about the relationship
between dropout and missing data (discussion in Diggle and Kenward 1994,
Molenberghs, Kenward, and Lesaffre 1997). In pattern-mixture models, it
is explicit which parameters cannot be identified. Little (1993) suggests
the use of identifying relationships between identifiable and nonidentifiable
parameters. Thus, even though these identifying relationships are also un-
verifiable (Little 1995), the advantage of pattern-mixture models is that
the verifiable and unverifiable assumptions can easily be separated. Note
that when interest is confined to describing the observed portions of the
profiles, no extrapolation and, hence, no restriction is needed. More details
are given in Section 18.1 (p. 278).

Further, selection models and pattern-mixture models are not the only
possible ways of factorizing the joint distribution of the outcome and miss-
ingness processes. Section 17.1 places these models in a broader context.

The selection model taxonomy is based on the second factor of (15.1):


f (r i |y i , Xi , ψ) = f (r i |y oi , y m
i , Xi , ψ). (15.3)
15.8 Ignorability 217

If (15.3) is independent of the measurements [i.e., when it assumes the form


f (r i |Xi , Zi , ψ)], then the process is termed missing completely at random
(MCAR).

If (15.3) is independent of the unobserved (missing) measurements y m i ,


but depends on the observed measurements y oi , thereby assuming the form
f (r i |y oi , Xi , ψ), then the process is referred to as missing at random (MAR).

Finally, when (15.3) depends on the missing values y m


i , the process is re-
ferred to as nonrandom missingness (MNAR). An MNAR process is allowed
to depend on y oi .

It is important to note that above terminology is independent of the statis-


tical framework chosen to analyze the data. This is to be contrasted with
the terms ignorable and nonignorable missingness. The latter terms depend
crucially on the inferential framework (Rubin 1976).

15.8 Ignorability

Let us decide to use likelihood based estimation. The full data likelihood
contribution for subject i assumes the form

L∗ (θ, ψ|Xi , Zi , y i , r i ) ∝ f (y i , r i |Xi , Zi , θ, ψ).

Since inference has to be based on what is observed, the full data likelihood
L∗ has to be replaced by the observed data likelihood L:

L(θ, ψ|Xi , Zi , y oi , r i ) ∝ f (y oi , r i |Xi , Zi , θ, ψ) (15.4)

with

f (y oi , r i |θ, ψ) = f (y i , r i |Xi , Zi , θ, ψ)dy m
i

= f (y oi , y m
i |Xi , Zi , θ)f (r i |y i , y i , Xi , ψ)dy i . (15.5)
o m m

Under an MAR process, we obtain



f (y oi , r i |θ, ψ) = f (y oi , y m
i |Xi , Zi , θ)f (r i |y i , Xi , ψ)dy i
o m

= f (y oi |Xi , Zi , θ)f (r i |y oi , Xi , ψ); (15.6)

that is, the likelihood factorizes into two components of the same functional
form as the general factorization (15.1) of the complete data. If, further,
218 15. Joint Modeling of Measurements and Missingness

θ and ψ are disjoint in the sense that the parameter space of the full
vector (θ  , ψ  ) is the product of the parameter spaces of θ and ψ, then
inference can be based on the marginal observed data density only. This
technical requirement is referred to as the separability condition. However,
still some caution should be used when constructing precision estimators
(see Chapter 21).

In conclusion, when the separability condition is satisfied, within the likeli-


hood framework , ignorability is equivalent to the union of MAR and MCAR.
Hence, nonignorability and MNAR are synonyms in this context. A formal
derivation is given in Rubin (1976), where it is also shown that the same
requirements hold for Bayesian inference, but that frequentist inference is
ignorable only under MCAR. Of course, ignorability is not helpful when
at least part of the scientific interest is directed toward the missingness
process.

Classical examples of the more stringent condition with frequentist methods


are ordinary least squares (see also Sections 16.4 and 17.3) and the gen-
eralized estimating equations (GEE) approach of Liang and Zeger (1986).
These GEE define an unbiased estimator only under MCAR. Robins, Rot-
nitzky, and Zhao (1995) and Rotnitzky and Robins (1995) have estab-
lished that some progress can be made under MAR and that, even under
MNAR processes, these methods can be applied (Rotnitzky and Robins
1997, Robins, Rotnitzky, and Scharfstein 1998). Their method is based on
including weights that depend on the missingness probability, proving the
point that at least some information on the missingness mechanism should
be included and, thus, that ignorability does not hold.

15.9 A Special Case: Dropout

Without modifying the notational conventions for the measurement process


Y i , we now let the scalar variable Di be the dropout indicator. This is
meaningful since, in the case that missingness is restricted to dropout,
each vector Ri is of the form (1, . . . , 1, 0, . . . , 0) and we can introduce a
scalar dropout indicator


ni
Di = 1+ Rij . (15.7)
j=1

For an incomplete sequence, Di denotes the occasion at which dropout


occurs. For a complete sequence, Di = ni +1. In both cases, Di indicates 1+
the length of the measurement sequence, whether complete or incomplete.
15.9 A Special Case: Dropout 219

It will sometimes be convenient to use a different dropout indicator:


ni
Ti = Rij = Di − 1. (15.8)
j=1

Thus, Ti = t indicates the pattern in which t measurements are obtained.


For a complete sequence, Ti = ni . Throughout the text, Di and Ti will
be used in precisely this meaning. Note that, thus far, tij has been used
to indicate the time at which the jth measurement for subject i is taken.
Given the difference between single and double subscripts for these different
concepts, no confusion should arise.

Selection modeling is now obtained from factorizing the density of the full
data (y i , di ), i = 1, . . . N , as (suppressing covariate dependence)

f (y i , di |θ, ψ) = f (y i |θ)f (di |y i , ψ). (15.9)

where the first factor is the marginal density of the measurement process
and the second one is the density of the missingness process, conditional
on the outcomes.

The observed data likelihood can be expressed as



f (y i , di |θ, ψ) =
o
f (y i , di |θ, ψ)dy mi

= f (y oi , y m
i |θ)f (di |y i , y i , ψ)dy i .
o m m
(15.10)

If f (di |y oi , y m
i , ψ) is independent of the measurements [i.e., when it assumes
the form f (di |ψ)], then the process is termed missing completely at random
(MCAR). If f (di |y oi , y m i , ψ) is independent of the unobserved (missing)
measurements y m i , but depends on the observed measurements y oi , thereby
assuming the form f (di |y oi , ψ), then the process is referred to as missing
at random (MAR). Finally, when f (di |y oi , y m i , ψ) depends on the missing
values y m i , the process is referred to as nonrandom missingness (MNAR). Of
course, ignorability is defined in analogy with its definition in Section 15.8.
16
Simple Missing Data Methods

16.1 Introduction

As suggested in Section 15.2, missing data nearly always entail problems for
the practicing statistician. First, inference will often be invalidated when
the observed measurements do not constitute a simple random subset of the
complete set of measurements. Second, even when correct inference follows,
it is not always an easy task to trick standard software into operation on a
ragged data structure.

Even in the simple case of a one-way ANOVA design (Neter, Wasserman,


and Kutner 1990) and under an MCAR mechanism operating, problems
occur since missingness destroys the balance between the sizes of the sub-
samples. This implies that a slightly more complicated least squares analy-
sis has to be invoked. Of course, a regression module for the latter analysis
is included in most statistical software packages. The trouble is that the
researcher has to know which tool to choose for particular classes of incom-
plete data.

Little and Rubin (1987) give an extensive treatment of methods to analyze


incomplete data, many of which are intended for continuous, normally dis-
tributed data. Some of these methods were proposed more than 50 years
ago. Examples are Yates’ (1933) iterated ANOVA and Bartlett’s (1937)
ANCOVA procedures to analyze incomplete ANOVA designs. The former
222 16. Simple Missing Data Methods

method is an early example of the Expectation-Maximization (EM) algo-


rithm (Dempster, Laird, and Rubin 1977). See also Chapter 22.

We will briefly review a number of techniques that are valid when the mea-
surement and missing data processes are independent and their parameters
are separated (MCAR). It is important to realize that many of these meth-
ods are used also in situations where the MCAR assumption is not tenable.
This should be seen as bad practice since it will often lead to biased esti-
mates and invalid tests and hence to erroneous conclusions. Ample detail
and illustrations of several problems are provided in Verbeke and Molen-
berghs (1997, Chapter 5). Section 16.2 discusses the computationally sim-
plest technique, a complete case analysis, in which the analysis is restricted
to the subjects for whom all intended measurements have been observed.
A complete case analysis is popular because it maps a ragged data matrix
into a rectangular one, by deleting incomplete cases. A second family of
approaches, with a similar effect on the applicability of complete data soft-
ware, is based on imputing missing values. One distinguishes between single
imputation (Section 16.3) and multiple imputation (Section 20.3). In the
first case, a single value is substituted for every “hole” in the data set and
the resulting data set is analyzed as if it represented the true complete data.
Multiple imputation properly acknowledges the uncertainty stemming from
filling in missing values rather than observing them.

A third family is based on the principle of analyzing the incomplete data


as such. A simple representative is a so-called available case analysis. This
merely means that every component of a parameter (e.g., made up of mean
vector and covariance matrix elements for a multivariate normal sample)
is estimated using the maximal amount of information available for that
component. This technique is discussed in Section 16.4 and applied in Sec-
tion 17.4.2 on the growth data that have been introduced in Section 2.6.
Although it makes use of more data than a corresponding complete case
analysis, it still suffers from some drawbacks. For example, the method re-
quires the missingness process to be MCAR. Section 17.3 describes a simple
and convenient analysis, based on the more relaxed MAR assumption, that
is consistent with factorization (15.6) and, importantly, can be carried out
using PROC MIXED. A popular and very general technique to optimize
incomplete data likelihoods under MAR is the EM algorithm (Dempster,
Laird, and Rubin 1977). Little and Rubin (1987) used the EM algorithm
to analyze their incomplete version of the growth data (Section 17.4.1).
The principal ideas behind this method, and its connection to the MAR
analysis of Section 17.3 will be given in Chapter 22.
16.2 Complete Case Analysis 223

16.2 Complete Case Analysis

A complete case analysis includes only those cases into the analysis, for
which all ni measurements were recorded. This method has obvious advan-
tages. It is very simple to describe and since the data structure is as would
have resulted from a complete experiment, standard statistical software can
be used. Further, since the complete estimation is done on the same subset
of completers, there is a common basis for inference, unlike for the available
case methods (see Section 16.4).

Unfortunately, the method suffers from severe drawbacks. First, there is


nearly always a substantial loss of information. For example, suppose there
are 20 measurements, with 10% of missing data on each measurement.
Suppose, further, that missingness on the different measurements is inde-
pendent; then, the estimated percentage of incomplete observations is as
high as 87%. The impact on precision and power is dramatic. Even though
the reduction of the number of complete cases will be less dramatic in
realistic settings where the missingness indicators Ri are correlated, the
effect just sketched will often undermine a lot of complete case analyses. In
addition, severe bias can result when the missingness mechanism is MAR
but not MCAR. Indeed, should an estimator be consistent in the complete
data problem, then the derived complete case analysis is consistent only if
the missingness process is MCAR. Unfortunately, the MCAR assumption
is much more restrictive than the MAR assumption.

A simple partial check on the MCAR assumption is as follows (Little and


Rubin 1987). Divide the observations on measurement j into two groups:
(1) those subjects that are also observed on another measurement or set of
measurements and (2) those missing on the other measurement(s). Should
MCAR hold, then both groups should be random samples of the same
population. Failure to reject equality of the distributional parameters of
both samples increases the evidence for MCAR, but does not prove it.

16.3 Simple Forms of Imputation

An alternative way to obtain a data set on which complete data methods


can be used is filling in the missing values, instead of deleting subjects with
incomplete sequences. The principle of imputation is particularly easy. The
observed values are used to impute values for the missing observations.
There are several ways to use the observed information. First, one can use
information on the same subject (e.g., last observation carried forward).
Second, information can be borrowed from other subjects (e.g., mean impu-
224 16. Simple Missing Data Methods

tation). Finally, both within and between subject information can be used
(e.g., conditional mean imputation, hot deck imputation). Standard refer-
ences are Little and Rubin (1987) and Rubin (1987). Imputation strategies
have been very popular in sample survey methods.

However, great care has to be taken with imputation strategies. Dempster


and Rubin (1983) write

The idea of imputation is both seductive and dangerous. It


is seductive because it can lull the user into the pleasurable
state of believing that the data are complete after all, and it
is dangerous because it lumps together situations where the
problem is sufficiently minor that it can be legitimately handled
in this way and situations where standard estimators applied
to the real and imputed data have substantial biases.

For example, Little and Rubin (1987) show that the method could work
for a linear model with one fixed effect and one error term, but that it
generally does not for hierarchical models, split-plot designs, and repeated
measures (with a complicated error structure), random-effects, and mixed-
effects models. At the very least, different imputations for different effects
would be necessary.

The user of imputation strategies faces several dangers. First, the imputa-
tion model could be wrong and, hence, the point estimates would be biased.
Second, even for a correct imputation model, the uncertainty resulting from
incompleteness is masked. Indeed, even when one is reasonably sure about
the mean value the unknown observation would have, the actual stochastic
realization, depending on both the mean structure as well as on the error
distribution, is still unknown.

In this section, several mean imputation strategies will be described. Ap-


plications to real data are discussed in Verbeke and Molenberghs (1997,
Chapter 5).

16.3.1 Last Observation Carried Forward

Whenever a value is missing, the last observed value is substituted. The


technique can be applied to both monotone and nonmonotone missing data.
It is typically applied to settings where incompleteness is due to attrition.

Very strong and often unrealistic assumptions have to be made to ensure


validity of this method. First, one has to believe that a subjects’ measure-
ment stays at the same level from the moment of dropout onward (or during
16.3 Simple Forms of Imputation 225

the period they are unobserved in the case of intermittent missingness). In


a clinical trial setting, one might believe that the response profile changes
as soon as a patient goes off treatment and even that it would flatten.
However, the constant profile assumption is even stronger. Further, this
method shares with other single imputation methods that it overestimates
the precision by treating imputed and actually observed values on equal
footing.

16.3.2 Imputing Unconditional Means

The idea behind unconditional mean imputation (Little and Rubin 1987)
is to replace a missing value with the average of the observed values on the
same variable over the other subjects. Thus, the term unconditional refers
to the fact that one does not use (i.e., condition on) information on the
subject for which an imputation is generated.

16.3.3 Buck’s Method: Conditional Mean Imputation

This approach was suggested by Buck (1960) and reviewed by Little and
Rubin (1987). The method is technically hardly more complex than mean
imputation. Let us describe it first for a single multivariate normal sample.
The first step is to estimate the mean vector µ and the covariance matrix
Σ from the complete cases. This step builds on the assumption that Y ∼
N (µ, Σ). For a subject with missing components, the regression of the
missing components (Y m o
i ) on the observed ones (y i ) is

Ym
i |y i
o
∼ N (µm + Σmo (Σoo )−1 (y oi − µoi ), Σmm
−Σmo (Σoo )−1 Σom ).

Superscripts o and m refer to “observed” and “missing” components, re-


spectively. The second step calculates the conditional mean from this re-
gression and substitutes it for the missing values. In this way, “vertical”
information (estimates for µ and Σ) is combined with “horizontal” infor-
mation (y oi ).

Buck (1960) showed that under mild regularity conditions, the method
is valid for MCAR mechanisms. Little and Rubin (1987) added that the
method is valid under certain types of MAR mechanism. Even though the
distribution of the observed components is allowed to differ between com-
plete and incomplete observations, it is very important that the regression
of the missing components on the observed ones is constant across miss-
ingness patterns.
226 16. Simple Missing Data Methods

Again, this method shares with other single imputation strategies that,
although point estimation may be consistent, the precision will be under-
estimated. Little and Rubin (1987, p. 46) indicated ways to correct the
precision estimation for unconditional mean imputation.

16.3.4 Discussion of Imputation Techniques

The imputation methods reviewed here are clearly not the only ones. Little
and Rubin (1987) and Rubin (1987) mention several others. Several meth-
ods, such as hot deck imputation, are based on filling in missing values from
“matching” subjects, where an appropriate matching criterion is used.

Almost all imputation techniques suffer from the following limitations:

1. The performance of imputation techniques is unreliable. Situations


where they do work are difficult to distinguish from situations were
they prove misleading. For example, although conditional imputa-
tion is considered superior to unconditional imputation, Verbeke and
Molenberghs (1997, pp. 217–218) have seen that the latter performs
better on the growth data, introduced in Section 2.6.

2. Imputation often requires ad hoc adjustments to yield satisfactory


point estimates.

3. The methods fail to provide simple correct precision estimators.

In addition, most methods require the MCAR assumption to hold. Methods


such as the last observation carried forward require additional and often
unrealistically strong assumptions.

The main advantage, shared with complete case analysis, is that complete
data software can be used. Although a complete case analysis is even sim-
pler since one does not need to address the imputation task, the imputation
family uses all (and, in fact, too much) of the available information. With
the availability of the SAS procedure MIXED, it is no longer necessary to
stick to complete data software, since this procedure allows for measure-
ment sequences of unequal length. A discussion of multiple imputation is
postponed until Section 20.3.
16.4 Available Case Methods 227

16.4 Available Case Methods

Consider a single multivariate normal sample, based on i = 1, . . . , N sub-


jects, for which j = 1, . . . , n variables are planned. In a longitudinal context,
the n variables would refer to n repeated measurements. The data matrix
is Y = (yij ).

Available case methods (Little and Rubin 1987) use as much of the data
as possible. Let us restrict attention to the estimation of the mean vector
µ and the covariance matrix Σ. The jth component µj of the mean vector
and the jth diagonal variance element σjj are estimated using all cases
that are observed on the jth variable, disregarding their response status
at the other measurement occasions. The (j, k)th element (j = k) of the
covariance matrix is computed using all cases that are observed on both
the jth and the kth variable.

This method is more efficient than the complete case method, since more
information is used. The number of components of the outcome vector has
no direct effect on the sample available for a particular mean or covariance
component.

The method is valid only under MCAR. In this respect, it is no funda-


mental improvement over a complete case analysis. An added disadvantage
is that, although more information is used and a consistent estimator is
obtained under MCAR, it is not guaranteed that the covariance matrix is
positive (semi-)definite. Of course, this is only a small-sample problem and
does not invalidate asymptotic results. However, for samples with a large
number of variables and/or with fairly high correlations between pairs of
outcomes, this nuisance feature is likely to occur.

Although a complete case analysis is possible for virtually every statistical


method and single imputation is also fairly generally applicable, extending
an available case analysis beyond multivariate means and covariances can
be tedious.

This method will be illustrated in Section 17.4.2, using the growth data
introduced in Section 2.6.

16.5 MCAR Analysis of Toenail Data

Let us analyze the toenail data, introduced in Section 2.2, assuming an


MCAR process holds. An exploratory graphical tool for studying average
evolutions over time is to plot the sample average at each occasion versus
228 16. Simple Missing Data Methods

FIGURE 16.1. Toenail Data. Estimated mean profile under treatment A (solid
line) and treatment B (dashed line), obtained under different assumptions for
the measurement model and the dropout model. (a) Completely random dropout,
without parametric model for the average evolution in both groups. (b) Com-
pletely random dropout, assuming quadratic average evolution for both groups.
(c) Random dropout, assuming quadratic average evolution for both groups. (d)
Nonrandom dropout, assuming quadratic average evolution for both groups.

time, thereby including all patients still available at that occasion. For the
toenail example, this is shown in panel (a) of Figure 16.1. The graph sug-
gests that there is very little difference between both groups, with marginal
superiority of treatment A.

Note that the sample averages at a specific occasion are unbiased estimators
of the mean responses of those subjects still in the study at that occasion.
Hence, the average profiles in panel (a) of Figure 16.1 only reflect the
marginal average evolutions if, at each occasion, the mean response of those
still in the study equals the mean response of those who already dropped
out. Thus, we have to assume that the mean of the response, conditional
on dropout status, is independent of the dropout status.

Similar assumptions for variances, correlations, and so forth are needed


for drawing valid inferences, based on sample statistics of the observed
data. This then comes down to assuming the response Y to be statistically
independent of the dropout time D (i.e., MCAR).
16.5 MCAR Analysis of Toenail Data 229

Under the assumption of MCAR, valid inferences based on sample statis-


tics can be obtained as follows. First, note that the vector of all 14 sample
averages plotted in panel (a) of Figure 16.1 can be interpreted as the ordi-
nary least squares (OLS) estimate obtained from fitting a two-way ANOVA
model to all available measurements, thereby ignoring the dependence be-
tween repeated measures within subjects. Under MCAR, this provides an
unbiased estimator for the marginal average evolution in the population.
Further, it follows from the theory on generalized estimating equations
(Liang and Zeger 1986) that this OLS estimator is asymptotically normally
distributed, and valid standard errors are obtained from the sandwich esti-
mator (see Section 6.2.4). Hence, Wald-type statistics are readily available
for testing hypotheses or for the calculation of approximate confidence in-
tervals.

For our toenail example, we used the sample averages displayed in Fig-
ure 16.1 [panel (a)] to test for any differences between both treatment
groups. The resulting Wald statistic equals χ2 = 4.704 on 7 degrees of
freedom, from which we conclude that there is no evidence in the data
for any difference between both groups (p = 0.696). Note that the above
methodology also applies if we assume the outcome to satisfy a general lin-
ear regression model, where the average evolution in both groups may be
assumed to be of a specific parametric form. We compared both treatments
assuming that the average evolution is quadratic over time, with regression
parameters possibly depending on treatment. The resulting OLS profiles
are shown in panel (b) of Figure 16.1. The main difference with the profiles
obtained from a model with unstructured mean evolutions [panel (a)] is
seen during the treatment period (first 3 months). The Wald test statistic,
employed for testing treatment differences, now equals χ2 = 2.982, on 3
degrees of freedom (p = 0.395) yielding the same conclusion as before.

In practice, patients often leave the study prematurely due to reasons re-
lated to the outcome of interest. The assumption of completely random
dropout is then no longer tenable, and statistical methods allowing for less
strict assumptions about the relation between the dropout process and the
measurement process (MAR or even MNAR) should be investigated. This
will be discussed in Sections 17.2 and 18.3.
17
Selection Models

17.1 Introduction

Much of the early development of, and debate about, selection models ap-
peared in the econometrics literature in which the Tobit model (Heckman
1976) played a central role. This combines a marginal Gaussian regression
model for the response, as might be used in the absence of missing data,
with a Gaussian-based threshold model for the probability of a value be-
ing missing. For simplicity, consider a single Gaussian-distributed response
variable Y ∼ N (µ, σ 2 ). The probability of Y being missing is assumed to
depend on a second Gaussian variable Ym ∼ N (µm , σm 2
), where
P (R = 0) = P (Ym < 0).
Dependence of missingness on the response Y is induced by introducing
a correlation between Y and Ym . To avoid some of the complications of
direct likelihood maximization, a two-stage estimation procedure was pro-
posed by Heckman (1976) for this type of model. The use of the Tobit
model and associated two-stage procedure was the subject of considerable
debate in the econometrics literature, much of it focusing on the issues of
identifiability and sensitivity (Amemiya 1984, Little 1986).

At first sight, the Tobit model does not appear to have the selection model
structure specified in (15.1) in that there is no conditional partition of
f (y, r). However, it is simple to show from the joint Gaussian distribution
232 17. Selection Models

of Y and Ym that in the Tobit model,

P (R = 0 | Y = y) = Φ(β0 + β1 y)

for suitably chosen parameters β0 and β1 and Φ(·) the Gaussian cumulative
distribution function. This can be seen as a probit regression model for the
(binary) missing value process. This basic structure underlies the simplest
form of selection model that has been proposed for longitudinal data in
the biometric setting. A suitable response model, such as the multivariate
Gaussian, is combined with a binary regression model for dropout. At each
time point, the occurrence of dropout can be regressed on previous and
current values of the response as well as covariates. In this chapter, we
explore such models in more detail.

Especially for a continuous response, these models can be constructed in a


fairly obvious way, combining the multivariate Gaussian linear model with a
suitable dropout model. Diggle and Kenward (1994) used a logistic dropout
model in a longitudinal development of the model of Greenlees, Reece, and
Zieschang (1982) for nonrandom missingness in a cross-sectional setting.
This was subsequently extended to the nonmonotone setting, using an ante-
dependence covariance structure for the response with full likelihood and
using pseudo-likelihood (Troxel, Harrington, and Lipsitz 1998). For the full
likelihood analyses, subject-by-subject integration is required in general,
unless MAR is assumed. This makes maximization somewhat cumbersome.
The above authors (Diggle and Kenward 1994, Troxel, Harrington, and
Lipsitz 1998) used the Nelder and Mead simplex algorithm (Nelder and
Mead 1965). However, such an approach lacks flexibility, is inefficient for
high-dimensional problems, and does not exploit the well-known routines
that are implemented for the two separate components of the model. For
some combinations of response and dropout model, the EM algorithm can
be used and this does allow separate maximization of response and dropout,
hence exploiting the familiar structure, but integration is still required in
the expectation step of the algorithm.

Section 17.2 presents an introductory data analysis, based on the toenail


data. Both MAR and MNAR analyses are presented in order to appreciate
interpretational and computational differences between both. Illustrated
with a balanced set of growth data, the validity of an ignorable analysis,
using standard statistical software such as the SAS procedure MIXED, is
established in Section 17.3. A general MNAR selection model is constructed
in Section 17.5 and applied to the Vorozole data in Section 17.6.
17.2 A Selection Model for the Toenail Data 233

17.2 A Selection Model for the Toenail Data

We will familiarize the reader with selection modeling by means of an in-


troductory analysis of the toenail set of data, introduced in Section 2.2 and
analyzed in the MCAR context in Section 16.5.

As formally introduced in Chapter 15, under the selection model (15.1), one
uses the functional form of f (di |yi ) to discriminate between different types
of dropout processes. Indeed, recall from Section 15.8 that under MCAR or
MAR, the joint density of observed measurements and dropout indicator
factors as

o
f (yio )f (di ) under MCAR
f (yi , di ) =
f (yi )f (di |yi ) under MAR,
o o

from which it follows that a marginal model for the observed data yio only is
required. Moreover, the measurement model f (yio ) and the dropout model
f (di ) or f (di |yio ) can be fitted separately, provided that the parameters
in both models are functionally independent of each other (separability).
If interest is in the measurement model only, the dropout model can be
completely ignored (Section 15.8). This implies that, under ignorability,
MCAR and MAR provide the same fitted measurement model. However,
as discussed by Kenward and Molenberghs (1998) and by Verbeke and
Molenberghs (1997, Section 5.8), this does not imply that inferences under
MCAR and MAR are equivalent.

17.2.1 MAR Analysis

In this section, we will fit a selection model to the toenail dermatophyte


onychomycosis (TDO) data, assuming random dropout. Our primary goal
is to test for any treatment differences; hence, we do not need to explicitly
consider a dropout model, but only need to specify a marginal model for
the observed outcomes Yio . The measurement model we consider here as-
sumes a quadratic evolution for each subject, possibly with subject-specific
intercepts, and we allow the stochastic error components to be correlated
within subjects. More formally, we assume that Yio satisfies the following
linear mixed-effects model:

o
(βA0 + bi ) + βA1 tij + βA2 t2ij + ε(1)ij + ε(2)ij group A
Yij = (17.1)
(βB0 + bi ) + βB1 tij + βB2 t2ij + ε(1)ij + ε(2)ij group B.

This model is similar to (3.11). All random components have zero mean.
The random intercept variance is d11 . The variance of the measurement
234 17. Selection Models

error ε(1)i is σ 2 Ini , whereas the variance of the serial process is ε(2)i is
τ 2 Hi , where Hi follows from the particular serial process considered. The
unknown parameters βA0 , βA1 , βA2 , βB0 , βB1 , and βB2 describe the average
quadratic evolution of Yio over time.

Let us first assume that ε(1)ij is absent. The estimated average profiles
obtained from fitting model (17.1) to our TDO data are shown in panel
(c) of Figure 16.1. Note that there is very little difference from the OLS
average profiles shown in panel (b) of the same figure and obtained under
the MCAR assumption. The observed likelihood ratio statistic for testing
for treatment differences equals 2 ln λ = 4.626 on 3 degrees of freedom.
Hence, under model (17.1) and under the assumption of random dropout,
there is little evidence for any average difference between the treatments A
and B (p = 0.201).

17.2.2 MNAR analysis

In cases where dropout could be related to the unobserved responses,


dropout is no longer ignorable, implying that treatment effects can no
longer be tested or estimated without explicitly taking the dropout model
f (di |yio , yim ) into account. Moreover, a marginal model is then required for
the complete vector Yi , rather than for the observed component Yio only.

For the TDO data, we assume that all outcomes Yij satisfy

(βA0 + bi ) + βA1 tij + βA2 t2ij + ε(2)ij group A
Yij = (17.2)
(βB0 + bi ) + βB1 tij + βB2 t2ij + ε(2)ij group B.

Note that, for reasons that will become clear later, the measurement error
component ε(1)ij has been removed from the model. Under MAR, model
(17.2) reduces to (17.1) with the measurement error removed, but when
MAR does not hold, we no longer have that Yio satisfies model (17.1).
Further, we assume that the probability for dropout at occasion j (j =
2, . . . , ni ), given the subject was still in the study at the previous occasion,
follows a logistic regression model, in line with Diggle and Kenward (1994),

logit [P (Di = j | Di ≥ d, yi )] = ψ0 + ψ1 yij + ψ2 yi,j−1 . (17.3)


17.2 A Selection Model for the Toenail Data 235

It is assumed that di ≥ 2. The above model is then used to calculate the


dropout probability at each occasion, given the measurements yi :


⎪ P (Di = di |Di ≥ di , yi )

⎪ di −1



⎪ × [1 − P (Di = k|Di ≥ k, yi )] di ≤ ni

f (di |yi ) = k=2 (17.4)



⎪ 


ni

⎪ [1 − P (Di = k|Di ≥ k, yi )] di > ni .

k=2

Model (17.4) implies that the dropout contribution f (di |yi ) to the log-
likelihood can be written as a product of independent contributions of
the form P (Di = di |Di ≥ di , yi ). They describe a binary outcome, condi-
tional on covariates. For such data, logistic regression methods are a natural
choice, as mentioned earlier. Implementations of this will be discussed at a
later stage.

At this point, we will relate dropout to the current and previous observa-
tion only. No covariates are included, and we do not explicitly take into
account the fact that the time points at which measurements have been
taken are not evenly spaced. Diggle and Kenward (1994) consider a more
general model where dropout at occasion j can depend on the complete
history {yi1 , . . . , yi,j−1 }, as well as on external covariates. In addition, one
could argue that the dependence of dropout on the measurement history is
time dependent. This would imply that the parameters ψ = (ψ0 , ψ1 , ψ2 )
in (17.3) depend on j. However, this level of generality will not be con-
sidered here. Note also that, strictly speaking, (15.1) allows dropout at
a specific occasion to be related to all future responses. However, this is
rather counterintuitive in many cases, especially when it is difficult for the
study participants to make projections about the future responses. More-
over, including future outcomes seriously complicates the calculations since
computation of the likelihood (15.4) then requires evaluation of a possibly
high-dimensional integral. Diggle and Kenward (1994) and Molenberghs,
Kenward, and Lesaffre (1997) considered nonrandom versions of this model
by including the current, possible unobserved measurement. This requires
more elaborate fitting algorithms, given the high-dimensional mentioned
earlier. Diggle and Kenward (1994) used the simplex algorithm (Nelder and
Mead, 1965), and Molenberghs, Kenward, and Lesaffre fitted their models
with the EM algorithm (Dempster, Laird, and Rubin 1977). The algorithm
of Diggle and Kenward (1994) is implemented in OSWALD (Smith, Robert-
son, and Diggle 1996). For further information on OSWALD, please consult
Section A.3.2.

We fitted the above model to our TDO data using the PCMID function in
the Splus suite of functions called OSWALD (Smith, Robertson and Diggle,
236 17. Selection Models

1996). The fitted average profiles are shown in panel (d) of Figure 16.1. Note
that we again find very little difference with the estimated average profiles
obtained from previous analyses. The observed likelihood ratio statistic
for testing for treatment differences equals 2 ln λ = 4.238 on 3 degrees of
freedom. Hence, there is again very little evidence for the presence of any
average treatment difference (p = 0.237).

The fitted dropout model equals


logit [P (Di = j|Di ≥ j, yi )] = −4.26 + 0.47yij − 0.46yi,j−1 ,
which can be rewritten as
logit [P (Di = j|Di ≥ j, yi )]
= −4.26 + 0.47(yij − yi,j−1 ) + 0.01yi,j−1 ,
showing that dropout is related to the increment yij − yi,j−1 , rather than
to any of the actual observations yij or yi,j−1 , and such that subjects which
improve most (large increments) are very likely to drop out from the study.
This phenomenon is very common in practice (see, e.g., Diggle and Kenward
1994, Molenberghs, Kenward and Lesaffre 1997). See also Section 19.5.2.

Special cases of model (17.3) are obtained by setting ψ1 = 0 or ψ1 = ψ2 = 0,


respectively. In the first case, dropout is no longer allowed to depend on
the current measurement, implying random dropout. In the second case,
dropout is independent of the outcome, which corresponds to completely
random dropout. Thus, under the assumed model , it is possible to test for
nonignorable dropout. The likelihood ratio test statistic, comparing the
maximized likelihood under model (17.3) with the maximized likelihood
under the same model with ψ1 = 0, equals 2 ln λ = 25.386, which is highly
significant (p < 0.0001) on 1 degree of freedom. Hence, conditional on the
validity of model (17.3), there is a lot of evidence for nonrandom dropout.
However, some caution is needed, as will be indicated next.

The main advantage of selection models for nonrandom dropout is that


they directly model the quantities which are usually of primary interest:
the marginal distribution of the outcome vector Yi and the distribution
of the dropout process conditional on Yi . The former is used for marginal
inferences on longitudinal profiles; the latter is used to characterize the
dropout process (MCAR, MAR, MNAR). It is important to realize that a
model is needed for the complete data vector Yi rather than for the ob-
served component Yio only. Thus, even if a posited model fits the observed
outcomes and the nonresponse data well, one can easily find a model with a
similar fit but different in the predictions for the unobserved outcomes. Such
a model may yield different conclusions about key aspects of the outcome
and nonresponse mechanisms. Thus, one is faced with untestable assump-
tions (see also Section 18.1, p. 278). This clearly indicates a great sensitivity
17.2 A Selection Model for the Toenail Data 237

of the conclusions to the stated complete data model. Several authors have
pointed to this sensitivity, such as Rubin (1994), Laird (1994), Little (1995),
Hogan and Laird (1997), and Molenberghs, Kenward and Lesaffre (1997). A
good example is given by Kenward (1998), who reanalyzed data on mastitis
in dairy cows, previously analyzed by Diggle and Kenward (1994). He found
that all evidence for nonrandom dropout vanishes when the normality as-
sumption for the second, possibly unobserved, outcome conditional on the
first, always observed, outcome is replaced by a heavy-tailed t-distribution.
See also Section 19.5.1. Further illustrations are Molenberghs, Verbeke, et
al. (1999) and Kenward and Molenberghs (1999). Thus, clearly, caution is
required and, preferably, a sensitivity analysis should be conducted. Formal
tools to carry out such an analysis are provided in Chapter 19.

Another reason to formulate a model not only for the complete but also for
the observed data is that these components need to be integrated out from
the likelihood. Technically, this implies that, at present, specialized soft-
ware, based on computationally intensive and highly unstable algorithms,
is required for fitting nonrandom dropout models.

Another example of the sensitivity of selection models is found in the con-


text of the TDO example. We reanalyzed the data, by allowing a measure-
ment error component in (17.2) to be present:

(βA0 + bi ) + βA1 tij + βA2 t2ij + ε(1)ij + ε(2)ij group A
Yij = (17.5)
(βB0 + bi ) + βB1 tij + βB2 t2ij + ε(1)ij + ε(2)ij group B.
Table 17.1 shows a summary of the results from model (17.5) as well as from
model (17.2). The estimated amount of variability explained by the ε(1)ij
2 /(
equals σ τ 2 + d11 ) = 6%. As before, we again find no evidence for any
σ 2 +
difference in average evolution between both treatment groups (p = 0.388).
The main difference between the results from both models is found in the
LR test for random dropout. The likelihood ratio test statistic reduces from
2 ln λ = 25.386 under model (17.2) to 2 ln λ = 4.432 under model (17.5).
* ij , yik )
The reason for this can be found in the estimated correlation corr(y
of outcomes within subjects, also shown in Table 17.1. Under model (17.5),
outcomes are much more correlated than under model (17.2), explaining
why, under model (17.5), the current observation is less needed for predict-
ing dropout, once the previous measurement is known.

In practice, the covariance structure is often considered a nuisance, and very


little effort is spent in finding adequate covariance models. As shown in the
above example, this becomes crucial when dropout is present. Statistical
packages such as the SAS procedure MIXED (1997) nowadays allow one to
fit linear mixed models with a variety of covariance structures. However,
as illustrated by Verbeke, Lesaffre and Brant (1998) and by Lesaffre, Asefa
and Verbeke (1999), finding appropriate covariance models is often far from
238 17. Selection Models

TABLE 17.1. Toenail Data. Summary of results obtained from fitting model (17.2)
and model (17.5), in combination with dropout model (17.3), to the TDO data.

Model (17.2)
Variability explained
by measurement error: 0%
LR test for 2 ln λ = 4.238
treatment differences: p = 0.237
LR test for MAR: 2 ln λ = 25.386
p < 0.0001
⎛ ⎞
1 .87 .77 .70 .58 .53 .52
⎜ 1 .87 .77 .61 .54 .52 ⎟
⎜ ⎟
⎜ 1 .87 .65 .56 .53 ⎟
⎜ ⎟
Fitted correlations: ⎜ 1 .70 .58 .53 ⎟
⎜ ⎟
⎜ 1 .70 .58 ⎟
⎜ ⎟
⎝ 1 .70 ⎠
1

Model (17.5)
Variability explained
by measurement error: 6%
LR test for 2 ln λ = 3.024
treatment differences: p = 0.388
LR test for MAR: 2 ln λ = 4.432
p = 0.035
⎛ ⎞
1 .91 .89 .86 .79 .73 .70
⎜ 1 .91 .89 .81 .75 .70 ⎟
⎜ ⎟
⎜ 1 .91 .83 .77 .71 ⎟
⎜ ⎟
Fitted correlations: ⎜ 1 .86 .79 .73 ⎟
⎜ ⎟
⎜ 1 .86 .79 ⎟
⎜ ⎟
⎝ 1 .86 ⎠
1

straightforward. It is therefore advisable to compare results from several


models, with varying plausible covariance structures. See also Chapter 10.
17.3 Scope of Ignorability 239

17.3 Scope of Ignorability

Let us, as in the toenail data example in Section 17.2.1, assume that MAR
holds. Assume, in addition, that the separability condition is satisfied (Sec-
tion 15.8). In Section 15.8, it was argued, and in Section 17.2.1 it was re-
iterated, that likelihood based inference is valid, whenever the mechanism
is MAR and provided the technical condition holds that the parameters
describing the nonresponse mechanism are distinct from the measurement
model parameters. In other words, the missing data process should be ig-
norable in the likelihood inference sense, since, then, factorization (15.6)
applies and the log-likelihood partitions into two functionally independent
components.

This implies that a module with likelihood estimation facilities which can
handle incompletely observed subjects since its units are measurements
rather than subjects, manipulates the correct likelihood and leads to valid
likelihood ratios. We will qualify this statement in more detail since, al-
though this is an extremely important feature of PROC MIXED and in
fact of any flexibly written linear mixed model likelihood optimization rou-
tine, a few cautionary remarks still apply.

1. Ignorability depends on the often implicit assumption that the sci-


entific interest is directed toward the measurement model parame-
ters θ (fixed effects, variance components, or a combination of both)
and that the missing data mechanism parameters ψ are nuisance
parameters. This is not always true. For instance, when the ques-
tion of predicting an individual’s measurement profile (individual or
group averaged) is raised, past the time of dropout and given that
she dropped out, then both parameter vectors θ and ψ need to be
estimated. However, due to the ignorability and the resulting parti-
tioning of the likelihood, one can construct a model for nonresponse,
separately from the linear mixed measurement model. As a practical
consequence, the software module to estimate the missingness model
parameters can be chosen independently from PROC MIXED. Of-
ten, categorical data analysis methods such as logistic regression will
be a sensible choice in this respect.
2. Likelihood inference is often surrounded with references to the sam-
pling distribution (e.g., to construct precision estimators and for sta-
tistical hypothesis tests; Kenward and Molenberghs 1998). This issue
and its relationship to the PROC MIXED implementation is dis-
cussed further in Chapter 21.
3. Even though the assumption of likelihood ignorability encompasses
the MAR and not only the more stringent and often implausible
240 17. Selection Models

MCAR mechanisms, it is difficult to exclude the option of a more gen-


eral nonrandom dropout mechanism. One solution is to fit an MNAR
model as proposed by Diggle and Kenward (1994). This was done for
the toenail data in Section 17.2.2. Diggle and Kenward (1994) fitted
models to the full data using the simplex algorithm (Nelder and Mead
1965). Alternatively, the EM algorithm can be used, as proposed by
Molenberghs, Kenward, and Lesaffre (1997), for the longitudinal cat-
egorical data problem. A module for the linear mixed model with
dropout is implemented in the OSWALD software, written for S-Plus
(Smith, Robertson, and Diggle 1996). It is based on an extension of
the Diggle and Kenward (1994) model. A SAS program for an EM
algorithm for the linear model would consist of three parts. First,
a macro to carry out the E step has to be written, where the ex-
pected value of the observed data likelihood is computed conditional
on the current parameter vector and on the observed data. The M
step consists of two substeps, where PROC MIXED might be used to
maximize the measurement process likelihood and a different routine
(e.g., logistic regression) could be called to maximize the nonresponse
likelihood. Diggle and Kenward (1994) assumed a logistic model for
the dropout process. The EM algorithm will be sketched in Chap-
ter 22.

17.4 Growth Data

The growth data have been introduced in Section 2.6. Section 17.4.1 an-
alyzes the original set of data (i.e., without artificially removed subjects).
The incomplete version, generated by Little and Rubin (1987), is studied
in Section 17.4.3, and the missingness process is studied in Section 17.4.4.

17.4.1 Analysis of Complete Growth Data

Following guidelines in Chapter 9 and in Diggle, Liang, and Zeger (1994)


model building should proceed by constructing an adequate description of
the variability on an appropriate set of residuals. These residuals are prefer-
ably found by subtracting a saturated sex by time mean model from the
measurements. When a satisfactory covariance model is found, attention
would then shift to simplification of the mean structure. However, this in-
sight is relatively recent and was certainly not the standard procedure in
the mid-eighties. Jennrich and Schluchter (1986) constructed eight models,
where the first three concentrate on the mean model, leaving the 4 × 4
17.4 Growth Data 241

covariance matrix of the repeated measurements completely unstructured.


Once an adequate mean model is found, the remaining five models are fit to
enable simplification of the covariance structure. Jennrich and Schluchter
(1986) primarily wanted to illustrate their estimation procedures and did
not envisage a comprehensive data analysis. Moreover, since this proce-
dure can be considered legitimate in small balanced studies and also for
reasons of comparability, we will, at first, adopt the same eight models, in
the same order. In this section, these models will be fitted to the original
data, referred to henceforth as the complete data set. The results of Jen-
nrich and Schluchter (1986) will be recovered and additional insight will
be given. In Section 17.4.3, these solutions will be compared to the results
for the same eight models on the incomplete data. Jennrich and Schluchter
(1986) used Newton-Raphson, Fisher scoring, and generalized Expectation-
Maximization (EM) algorithms to maximize the log-likelihood. We will
show that the data can be analyzed relatively easily using PROC MIXED.

The models of Jennrich and Schluchter (1986) can be expressed in the


general linear mixed models family (3.8):

Yi = Xi β + Zi bi + εi , (17.6)

where

bi ∼ N (0, D),
εi ∼ N (0, Σ),

and bi and εi are statistically independent. As earlier (Section 3.3), Y i is


the (4×1) response vector, Xi is a (4×p) design matrix for the fixed effects,
β is a vector of unknown fixed regression coefficients, Zi is a (4 × q) design
matrix for the random effects, bi is a (q × 1) vector of normally distrib-
uted random parameters, with covariance matrix D, and εi is a normally
distributed (4 × 1) random error vector, with covariance matrix Σ. Since
every subject contributes exactly four measurements at exactly the same
time points, it has been possible to drop the subscript i from the error
covariance matrix Σ. The random error εi encompasses both measurement
error (as in a cross-sectional study) and serial correlation. In this study,
the design will be a function of age, sex, and/or the interaction between
both. Let us indicate boys with xi = 0, girls with xi = 1, and age with
tj = 8, 10, 12, 14.

Model 1

The first model we will consider assumes a separate mean for each of
the eight age×sex combinations, together with an unstructured covariance.
This is done by assuming that the covariance matrix Σ of the error vector
242 17. Selection Models

εi is a completely general positive definite matrix and no random effects


are included.

This model can be expressed as

Yi1 = β0 + β1 xi + β0,8 (1 − xi ) + β1,8 xi + εi1 ,


Yi2 = β0 + β1 xi + β0,10 (1 − xi ) + β1,10 xi + εi2 ,
(17.7)
Yi3 = β0 + β1 xi + β0,12 (1 − xi ) + β1,12 xi + εi3 ,
Yi4 = β0 + β1 xi + εi4 ,

or, in matrix notation,

Yi = Xi β + εi ,

with
⎛ ⎞
1 xi 1 − xi 0 0 xi 0 0
⎜ 1 xi 0 1 − xi 0 0 xi 0 ⎟
Xi = ⎜
⎝ 1 xi

0 0 1 − xi 0 0 xi ⎠
1 xi 0 0 0 0 0 0

and β = (β0 , β1 , β0,8 , β0,10 , β0,12 , β1,8 , β1,10 , β1,12 ) . With this parameteri-
zation, the means for girls are β0 + β1 + β1,8 ; β0 + β1 + β1,10 ; β0 + β1 + β1,12 ;
and β0 +β1 at ages 8, 10, 12, and 14, respectively. The corresponding means
for boys are β0 + β0,8 ; β0 + β0,10 ; β0 + β0,12 ; and β0 , respectively. Of course,
there are many equivalent ways to express the set of eight means in terms
of eight linearly independent parameters.

This model can, for example, be fitted with the following SAS code:

proc mixed data = growth method = ml covtest;


title ’Growth Data, Model 1’;
class idnr sex age;
model measure = sex age*sex / s;
repeated / type = un subject = idnr r rcorr;
run;

Let us discuss the fit of the model. The deviance (minus twice the log-
likelihood at maximum) equals 416.5093, and there are 18 model parame-
ters (8 mean, 4 variance, and 6 covariance parameters). This deviance will
serve as a reference to assess the goodness-of-fit of simpler models. Pa-
rameter estimates and standard errors are reproduced in Table 17.2. The
deviances are listed in Table 17.4.
17.4 Growth Data 243

TABLE 17.2. Growth Data. Maximum likelihood estimates and standard errors
(model based and empirically corrected) for the fixed effects in Model 1 (complete
data set).

Parameter MLE (s.e.)(1) (s.e.)(2)


β0 24.0909 (0.6478) (0.7007)
β1 3.3778 (0.8415) (0.8636)
β0,8 −2.9091 (0.6475) (0.3793)
β0,10 −1.8636 (0.4620) (0.3407)
β0,12 −1.0000 (0.5174) (0.2227)
β1,8 −4.5938 (0.5369) (0.6468)
β1,10 −3.6563 (0.3831) (0.4391)
β1,12 −1.7500 (0.4290) (0.5358)
(1)
Default s.e.’s under Model 1.
(2)
Sandwich s.e.’s, obtained from the “empirical” option;
also obtained under Model 0.

 of the error vector, based on this model,


The estimated covariance matrix Σ
equals
⎛ ⎞
5.0143 2.5156 3.6206 2.5095
⎜ ⎟
 ⎜ 2.5156 3.8748 2.7103 3.0714 ⎟
Σ = ⎜ ⎟ (17.8)
⎝ 3.6206 2.7103 5.9775 3.8248 ⎠
2.5095 3.0714 3.8248 4.6164

with corresponding correlation matrix


⎛ ⎞
1.0000 0.5707 0.6613 0.5216
⎜ ⎟
⎜ 0.5707 1.0000 0.5632 0.7262 ⎟
⎜ ⎟. (17.9)
⎝ 0.6613 0.5632 1.0000 0.7281 ⎠
0.5216 0.7262 0.7281 1.0000

These quantities are easily obtained in PROC MIXED by using the options
‘r’ and ‘rcorr’ in the REPEATED statement (see Section 8.2.6). Apparently,
the variances are close to each other, and so are the correlations.

Even though we opted to follow closely the models discussed in Jennrich


and Schluchter (1986), it is instructive to consider a more elaborate model,
termed Model 0, where a separate unstructured covariance matrix is as-
sumed for each of the two sex groups. This model has 10 extra parameters
and can be fitted to the data using the following SAS code:
244 17. Selection Models

TABLE 17.3. Growth Data. Predicted means.

Boys Girls

Model Age Estimate s.e. Estimate s.e.


1 8 22.88 0.56 21.18 0.68
10 23.81 0.49 22.23 0.59
12 25.72 0.61 23.09 0.74
14 27.47 0.54 24.09 0.65
2 8 22.46 0.49 21.24 0.59
10 24.11 0.45 22.19 0.55
12 25.76 0.47 23.14 0.57
14 27.42 0.54 24.10 0.65
3 8 22.82 0.48 20.77 0.57
10 24.16 0.45 22.12 0.55
12 25.51 0.47 23.47 0.56
14 26.86 0.52 24.82 0.60
4 8 22.64 0.53 21.22 0.64
10 24.23 0.48 22.17 0.57
12 25.83 0.48 23.12 0.57
14 27.42 0.53 24.07 0.64
5 8 22.75 0.54 21.19 0.66
10 24.29 0.44 22.16 0.53
12 25.83 0.44 23.13 0.53
14 27.37 0.54 24.09 0.66
6 8 22.62 0.51 21.21 0.61
10 24.18 0.47 22.17 0.56
12 25.75 0.48 23.13 0.58
14 27.32 0.55 24.09 0.67
7 8 22.62 0.52 21.21 0.63
10 24.18 0.47 22.17 0.57
12 25.75 0.47 23.13 0.57
14 27.32 0.52 24.09 0.63
8 8 22.62 0.46 21.21 0.56
10 24.18 0.30 22.17 0.37
12 25.75 0.30 23.13 0.37
14 27.32 0.46 24.09 0.56
Source: Jennrich and Schluchter (1986).
17.4 Growth Data 245

proc mixed data = growth method = ml covtest;


title ’Growth Data, Model 0’;
class idnr sex age;
model measure = sex age*sex / s;
repeated / type = un subject = idnr
r = 1,12 rcorr = 1,12 group = sex;
run;

Since this model has individual-specific covariance matrices (although there


are only two values, one for each gender), Σ has to be replaced by Σi .

These separate covariance matrices are requested by means of the ‘group=’


option. These matrices and the corresponding correlation matrices are
printed using the ‘r=’ and ‘rcorr=’ options. The estimated covariance ma-
trix for girls is
⎛ ⎞
4.1033 3.0496 3.9380 3.9607
⎜ 3.0496 3.2893 3.6612 3.7066 ⎟
⎜ ⎟
⎜ ⎟
⎝ 3.9380 3.6612 5.0826 4.9690 ⎠
3.9607 3.7066 4.9690 5.4008

with corresponding correlation matrix


⎛ ⎞
1.0000 0.8301 0.8623 0.8414
⎜ 0.8301 1.0000 0.8954 0.8794 ⎟
⎜ ⎟
⎜ ⎟.
⎝ 0.8623 0.8954 1.0000 0.9484 ⎠
0.8414 0.8794 0.9484 1.0000

The corresponding quantities for boys are


⎛ ⎞
5.6406 2.1484 3.4023 1.5117
⎜ 2.1484 4.2773 2.0566 2.6348 ⎟
⎜ ⎟
⎜ ⎟
⎝ 3.4023 2.0566 6.5928 3.0381 ⎠
1.5117 2.6348 3.0381 4.0771

and ⎛ ⎞
1.0000 0.4374 0.5579 0.3152
⎜ ⎟
⎜ 0.4374 1.0000 0.3873 0.6309 ⎟
⎜ ⎟.
⎝ 0.5579 0.3873 1.0000 0.5860 ⎠
0.3152 0.6309 0.5860 1.0000
From these, we suspect that there is a non-negligible difference between
the covariance structures for boys and girls, with, in particular, a weaker
correlation among the boys’ measurements. This is indeed supported by a
246 17. Selection Models

deviance of 23.77 on 10 degrees of freedom (p = 0.0082). Nevertheless, the


point estimates for the fixed effects coincide exactly with the ones obtained
from Model 1 (see Table 17.2). However, even if attention is restricted to
fixed-effects inference, one still needs to address the quality of the esti-
mates of precision. To this end, there are, in fact, two solutions. First, the
more elaborate Model 0 can be fitted, as was done already. A drawback is
that this model has 28 parameters altogether, which is quite a substantial
number for such a small data set, implying that the asymptotic behavior
of, for example, the deviance statistic becomes questionable. As discussed
in Section 6.2.4, an alternative solution consists of retaining Model 1 and
estimating the standard errors by means of the so-called robust estima-
tor (equivalently termed “sandwich” or “empirically corrected” estimator;
Liang and Zeger 1986). To this end, the following code can be used:

proc mixed data = growth method = ml covtest empirical;


title ’Growth Data, Model 1, Empirically Corrected’;
class idnr sex age;
model measure = sex age*sex / s;
repeated / type = un subject = idnr r rcorr;
run;

Here, the ‘empirical’ option is added to the PROC MIXED statement. This
method yields a consistent estimator of precision, even if the covariance
model is misspecified. In this particular case (a full factorial mean model),
both methods (Model 0 on the one hand and the empirically corrected
Model 1 on the other hand) lead to exactly the same standard errors. This
illustrates that the robust method can be advantageous if correct standard
errors are required, but finding an adequate covariance model is judged too
involved. The robust standard errors are presented in Table 17.2 as the
second entry in parentheses. It is seen that the naive standard errors are
somewhat smaller than their robust counterparts, except for the parameters
β1,8 , β1,10 , and β1,12 , where they are considerably larger. Even though the
relation between the standard errors of the “correct model” (here, Model
0) and the empirically corrected “working model” (here, Model 1) will
not always be a mathematical identity, the empirically corrected estimator
option is a useful tool to compensate for misspecification in the covariance
model.

Let us now return to our discussion of Model 1. It is insightful to consider


the means for each of the eight categories explicitly. These means are pre-
sented in Table 17.3 for Models 1–8. The first panel of Figure 17.1 depicts
the eight individual group means, connected to form two profiles, one for
each sex group. Clearly, there seems to be a linear trend in both profiles
as well as a vague indication for diverging lines, and hence different slopes.
17.4 Growth Data 247

FIGURE 17.1. Growth Data. Profiles for the complete data, from a selected set
of models.

These hypotheses will be assessed on the basis of likelihood ratio tests,


using the simpler Models 2 and 3.

Model 2

The first simplification occurs by assuming a linear trend within each sex
group. This implies that each profile can be described with two parameters
(intercept and slope), instead of with four unstructured means. The error
matrix Σ will be left unstructured. The model can be expressed as
Yij = β0 + β01 xi + β10 tj (1 − xi ) + β11 tj xi + εij (17.10)
or, in matrix notation,
Y i = Xi β + εi ,
where the design matrix changes to
⎛ ⎞
1 xi 8(1 − xi ) 8xi
⎜ 1 x 10(1 − xi ) 10xi ⎟
⎜ i ⎟
Xi = ⎜ ⎟
⎝ 1 xi 12(1 − xi ) 12xi ⎠
1 xi 14(1 − xi ) 14xi
248 17. Selection Models

TABLE 17.4. Growth Data. Complete data set. Model fit summary.

Mean Covar par −2 Ref G2 df p


1 unstr. unstr. 18 416.509
2 = slopes unstr. 14 419.477 1 2.968 4 0.5632
3 = slopes unstr. 13 426.153 2 6.676 1 0.0098
4 = slopes Toepl. 8 424.643 2 5.166 6 0.5227
5 = slopes AR(1) 6 440.681 2 21.204 8 0.0066
4 16.038 2 0.0003
6 = slopes random 8 427.806 2 8.329 6 0.2150
7  slopes
= CS 6 428.639 2 9.162 8 0.3288
4 3.996 2 0.1356
6 0.833 2 0.6594
6 0.833 1:2 0.5104
8 = slopes simple 5 478.242 7 49.603 1 <0.0001
7 49.603 0:1 <0.0001

and β = (β0 , β01 , β10 , β11 ) . Here, β0 is the intercept for boys and β0 + β01
is the intercept for girls. The slopes are β10 and β11 , respectively.

The SAS code for Model 1 can be adapted simply by deleting age from the
CLASS statement.

The likelihood ratio test comparing Model 2 to Model 1 does not reject
the null hypothesis of linearity. A summary of model fitting information
for this and subsequent models as well as for comparisons between models
is given in Table 17.4. The first column contains the model number; and a
short description of the model is given in the second and third columns, in
terms of the mean and covariance structures, respectively. The number of
parameters is given next, as well as the deviance (−2 ). The column labeled
“Ref” displays one (or more) numbers of models to which the current model
is compared. The G2 likelihood ratio statistic is the difference between
−2 of the current and the reference model. The final columns contain the
number of degrees of freedom, and the p-value corresponds to the likelihood
ratio test statistic. Model 2 predicts the following mean growth curves:

girls: Ŷj = 17.43 + 0.4764tj ,


boys: Ŷj = 15.84 + 0.8268tj .

These profiles are visualized in the second panel of Figure 17.1. The ob-
served means are added to the graph. The mean model seems acceptable,
consistent with the likelihood ratio test. The estimated covariance and cor-
17.4 Growth Data 249

relation matrices of the measurements are similar to the ones found for
Model 1.

Model 3

The next step is to investigate whether the two profiles are parallel. Al-
though the plot for Model 2 suggests that the profiles are diverging, the
question remains whether this effect is statistically significant. The model
can be described as follows:

Yij = β0 + β01 xi + β1 tj + εij . (17.11)

The design matrix Xi simplifies further:


⎛ ⎞
1 xi 8
⎜ 1 xi 10 ⎟
Xi = ⎜ ⎝ 1 xi

12 ⎠
1 xi 14

and β = (β0 , β01 , β1 ) . The two slopes in Model 2 have been replaced by
β1 , a slope common to boys and girls.

Model 3 can be fitted in PROC MIXED by replacing the model statement


in Model 2 with

model measure = sex age / s;

The predicted growth curves are

girls: Ŷj = 15.37 + 0.6747tj ,


boys: Ŷj = 17.42 + 0.6747tj .

Table 17.4 reveals that the likelihood ratio test statistic (comparing Models
2 and 3) rejects the common slope hypothesis (p = 0.0098). This is consis-
tent with the systematic deviation between observed and expected means
in the third panel of Figure 17.1.

In line with the choice of Jennrich and Schluchter (1986), the mean struc-
ture of Model 2 will be kept. We will now turn our attention to simplifying
the covariance structure.

Graphical Exploration

Figure 17.2 presents the 27 individual profiles. The left-hand panel shows
the raw profiles, exhibiting the time trend found in the mean model. To
250 17. Selection Models

FIGURE 17.2. Growth Data. Raw and residual profiles for the complete data set.
(Girls are indicated with solid lines. Boys are indicated with dashed lines.)

obtain a rough idea about the covariance structure, it is useful to look at


the right-hand panel, which gives the profiles after subtracting the means
predicted by Model 2. Since these means agree closely with the observed
means (see Figure 17.1), the corresponding sets of residuals are equivalent.
A noticeable though not fully general trend is that a profile tends to be
high or low as a whole, which points to a random intercept. Apparently,
the variance of the residuals is roughly constant over time, implying that
the random-effects structure is probably confined to the intercept. This
observation is consistent with correlation matrix (17.9) of the unstructured
Model 1. A more formal exploration can be done by means of the variogram
(Diggle, Liang, and Zeger 1994, p. 51) or its extensions (Verbeke, Lesaffre,
and Brant 1998). See also Section 10.4.4.

Jennrich and Schluchter (1986) considered several covariance structure


models, which are all included in PROC MIXED as standard options.

Model 4

The first covariance structure model is the so-called Toeplitz covariance


matrix. Mean model formula (17.10) of Model 2 still applies, but the error
17.4 Growth Data 251

vector εi is now assumed to follow a εi ∼ N (0, Σ) distribution, where


Σ is constrained to σij = α|i−j| ; that is, the covariance depends on the
measurement occasions through the lag between them only. In addition, Σ
is assumed to be positive definite. For the growth data, there are only 4
free parameters, α0 , α1 , α2 , and α3 , instead of 10 in the unstructured case.
The relationship among the α parameters is left unspecified. In the sequel,
such additional constraints will lead to first-order autoregressive (Model 5)
or exchangeable (Model 7) covariance structures.

To fit this model with PROC MIXED, the REPEATED statement needs


to be changed, leading to the following program:

proc mixed data = growth method = ml covtest;


title ’Growth Data, Model 4’;
class sex idnr;
model measure = sex age*sex / s;
repeated / type = toep subject = idnr r rcorr;
run;

Comparing the likelihood of this model to the one of the reference Model
2 shows that Model 4 is consistent with the data (see Table 17.4). The
covariance matrix is
⎛ ⎞
4.9439 3.0507 3.4054 2.3421
⎜ 3.0507 4.9439 3.0507 3.4054 ⎟
⎜ ⎟
⎝ 3.4054 3.0507 4.9439 3.0507 ⎠
2.3421 3.4054 3.0507 4.9439
and the derived correlation matrix is
⎛ ⎞
1.0000 0.6171 0.6888 0.4737
⎜ 0.6171 1.0000 0.6171 0.6888 ⎟
⎜ ⎟.
⎝ 0.6888 0.6171 1.0000 0.6171 ⎠
0.4737 0.6888 0.6171 1.0000
The lag 2 correlation is slightly higher than the lag 1 correlation, while the
lag 3 correlation shows a drop. In light of the standard errors of the covari-
ance parameters (0.9791, 0.9812, and 1.0358, respectively), this observation
should not be seen as clear evidence for a particular trend.

Note that this structure constrains the variance to be constant across time.
Should this assumption be considered unrealistic, then heterogeneous ver-
sions can be fitted instead, combining the correlation matrix from the ho-
mogeneous version with variances that are allowed to change over time.

At this point, Model 4 can replace Model 2 as the most parsimonious model,
consistent with the data found so far. Whether or not further simplifications
are possible will be investigated next.
252 17. Selection Models

Model 5

A special case of the Toeplitz model is the first-order autoregressive model.


This model is based on the assumption that the covariance between two
measurements is a decreasing function of the time lag between them:

σij = σ 2 ρ|i−j| .

In other words, the variance of the measurements equals σ 2 , and the co-
variance decreases with increasing time lag if ρ > 0. To fit this model
with PROC MIXED, the REPEATED statement should include the op-
tion ‘type=AR(1)’.

The estimated covariance matrix is


⎛ ⎞
4.8903 2.9687 1.8021 1.0940
⎜ 2.9687 4.8903 2.9687 1.8021 ⎟
⎜ ⎟.
⎝ 1.8021 2.9687 4.8903 2.9687 ⎠
1.0940 1.8021 2.9687 4.8903

The correlation matrix is


⎛ ⎞
1.0000 0.6070 0.3685 0.2237
⎜ 0.6070 1.0000 0.6070 0.3685 ⎟
⎜ ⎟.
⎝ 0.3685 0.6070 1.0000 0.6070 ⎠
0.2237 0.3685 0.6070 1.0000

Table 17.4 reveals that there is an apparent lack of fit for this model, when
compared to Model 2. Jennrich and Schluchter (1986) compared Model 5 to
Model 2 as well. Alternatively, we might want to compare Model 5 to Model
4. This more parsimonious test (2 degrees of freedom) yields p = 0.0003,
strongly rejecting the AR(1) structure.

Model 6

An alternative simplification of the unstructured covariance Model 2 is


given by allowing the intercept and slope parameters to be random. This is
an example of model (17.6) with fixed-effects design matrix Xi as in Model
2 [Eq. (17.10)], random-effects design matrix
⎛ ⎞
1 8
⎜ 1 10 ⎟
Zi = ⎜ ⎝ 1 12 ⎠ ,

1 14

as well as measurement error structure Σ = σ 2 I4 .


17.4 Growth Data 253

An unstructured covariance matrix D for the random effects bi will be


assumed. The matrix D (requested by the ‘g’ option in the RANDOM
statement) is estimated to be

 = 4.5569 −0.1983
D . (17.12)
−0.1983 0.0238

One easily calculates the resulting covariance matrix of Y i : Vi = Zi DZi +


σ 2 I4 , which is estimated by
⎛ ⎞
4.6216 2.8891 2.8727 2.8563
⎜ 2.8891 4.6839 3.0464 3.1251 ⎟
V̂i = Zi D̂Zi + σ̂ I4 = ⎜
 2
⎝ 2.8727 3.0464 4.9363
⎟, (17.13)
3.3938 ⎠
2.8563 3.1251 3.3938 5.3788

where σ̂ 2 = 1.7162. Of course, this matrix can be requested by the ‘v’


option in the REPEATED statement as well. Thus, this covariance matrix
is a function of four parameters (three random-effects parameters and one
measurement error parameter). The corresponding estimated correlation
matrix is ⎛ ⎞
1.0000 0.6209 0.6014 0.5729
⎜ 0.6209 1.0000 0.6335 0.6226 ⎟
⎜ ⎟
⎝ 0.6014 0.6335 1.0000 0.6586 ⎠ . (17.14)
0.5729 0.6226 0.6586 1.0000
This model is a submodel of Model 2, but not of Model 4 since the cor-
relations increase within each diagonal, albeit only moderately since the
variance of the random slope is very modest. From Table 17.4, we observe
that this model is a plausible simplification of Model 2. It has the same
number of degrees of freedom as Model 4, although the latter one has a
slightly smaller deviance.

Since the variance of the random slope is small, it is natural to explore


whether a random intercept model is adequate.

Model 7

A random intercept model is given by Zi = (1 1 1 1) , with variance of the


random intercepts equal to d. The resulting covariance matrix of Y i is

Vi = Zi dZi + σ 2 I4 = dJ4 + σ 2 I4 ,

where J4 is a (4 × 4) matrix of ones. This covariance structure is called ex-


changeable or compound symmetry. Another term is intraclass correlation
(Section 3.3.2). All correlations are equal to (d + σ 2 )/σ 2 , implying that this
model is a submodel of Models 4 and 6, as well as of Model 2. It can be
fitted in PROC MIXED with two equivalent programs:
254 17. Selection Models

proc mixed data = growth method = ml covtest;


title ’Jennrich and Schluchter, Model 7’;
class sex idnr;
model measure = sex age*sex / s;
random intercept / type = un subject = idnr g;
run;

and

proc mixed data = growth method = ml covtest;


title ’Jennrich and Schluchter, Model 7’;
class sex idnr;
model measure = sex age*sex / s;
repeated / type = cs subject = idnr r rcorr;
run;

These two equivalent views toward the same model have been discussed in
Section 3.3.2.

The estimated covariance matrix is


⎛ ⎞
4.9052 3.0306 3.0306 3.0306
⎜ 3.0306 4.9052 3.0306 3.0306 ⎟
⎜ ⎟,
⎝ 3.0306 3.0306 4.9052 3.0306 ⎠
3.0306 3.0306 3.0306 4.9052

with corresponding correlation matrix


⎛ ⎞
1.0000 0.6178 0.6178 0.6178
⎜ 0.6178 1.0000 0.6178 0.6178 ⎟
⎜ ⎟
⎝ 0.6178 0.6178 1.0000 0.6178 ⎠ .
0.6178 0.6178 0.6178 1.0000

Comparing this model to Model 2 yields p = 0.3288. Comparisons to Mod-


els 4 and 6 lead to the same conclusion. This implies that this model is
currently the simplest one consistent with the data. It has to be noted that
a comparison of Model 7 with Model 6 is slightly complicated by the fact
that the null hypothesis implies that two of the three parameters in the D
matrix of Model 6 are zero. For the variance of the random slope, this null
value lies on the boundary of the parameter space. As explained in Section
6.3.4, Stram and Lee (1994) show that the corresponding reference distri-
bution is not χ22 , but a 50 : 50 mixture of a χ21 and a χ22 . Such a mixture
is indicated by χ21:2 , or simply by 1 : 2. As a result, the corrected p-value
would be 0.5104, thereby indicating no change in the conclusion. Similarly,
comparing Models 2 and 6 as carried out earlier suffers from the same prob-
lem. Stram and Lee (1994) indicate that the asymptotic null distribution is
17.4 Growth Data 255

FIGURE 17.3. Growth Data. χ26 and simulated null distributions for comparing
Models 2 and 6.

even more complex and involves projections of random variables on curved


surfaces (Stram and Lee 1994). Therefore, the p-value is best determined by
means of simulations. A simulation study of 500 samples yields p = 0.046,
rather than p = 0.215, as reported in Table 17.4. To simulate the null
distribution, we generated 500 samples of 270 individuals rather than 27
individuals, to reduce small sample effects. Although this choice reflects
the desire to perform asymptotic inference, it is debatable since one might
rightly argue that generating samples of size 27 would reflect small-sample
effects as well. Figure 17.3 shows the simulated as well as the inadequate
χ26 null distributions.

Although such a correction is clearly necessary, it is hard to use in gen-


eral practice in its current form. Additional work in this area is certainly
required.

The profiles, predicted by Model 7, are

girls: Ŷj = 17.37 + 0.4795tj ,


boys: Ŷj = 16.34 + 0.7844tj .

They are shown in the fourth panel of Figure 17.1. Although not exactly
the same, they are extremely similar to the profiles of Model 2.
256 17. Selection Models

Model 8

Finally, the independence model is considered in which the only source of


variability is measurement error: Σ = σ 2 I4 . This model can be fitted in
PROC MIXED using the ‘type=simple’ option in the REPEATED state-
ment. Table 17.4 indicates that this model does not fit the data at all.
Whether a χ21 is used or a χ20:1 does not affect the conclusion.

In summary, among the models presented, Model 7 is preferred to summa-


rize the data. In Sections 17.4.2 and 17.4.3, the trimmed version of the data
will be analyzed, using frequentist available data methods and an ignorable
likelihood-based analysis.

17.4.2 Frequentist Analysis of Incomplete Growth Data

Let us now turn toward the incomplete version of the data. In this section,
we will focus on a straightforward but restrictive available case analysis
from a frequentist perspective. The method is briefly introduced in Sec-
tion 16.4. Specifically, the parameters for the unstructured mean and co-
variance Model 1 will be estimated.

The estimated mean vector for girls is

(21.1818, 22.7857, 23.0909, 24.0909),

whereas the vector for boys is

(22.8750, 24.1364, 25.7188, 27.4688).

The mean vector for girls is based on a sample of size 11, except for the
second element, which is based on the 7 complete observations. The corre-
sponding sample sizes for boys are 16 and 11, respectively.

The estimated covariance matrix is


⎛ ⎞
5.4155 2.3155 3.9102 2.7102
⎜ 2.3155 4.2763 2.0420 2.5741 ⎟
⎜ ⎟, (17.15)
⎝ 3.9102 2.0420 6.4557 4.1307 ⎠
2.7102 2.5741 4.1307 4.9857

with correlation matrix


⎛ ⎞
1.0000 0.4812 0.6613 0.5216
⎜ 0.4812 1.0000 0.3886 0.5575 ⎟
⎜ ⎟.
⎝ 0.6613 0.3886 1.0000 0.7281 ⎠
0.5216 0.5575 0.7281 1.0000
17.4 Growth Data 257

The elements of the covariance matrix are computed as


 11 
1  g 16
g g g
σ̂jk = (y − y j )(yik − y k ) + (yij − y j )(yik − y k ) , j, k = 2,
b b b b
25 i=1 ij i=1
 7 
1  g 11
g g g
σ̂j2 = (y − y j )(yi2 − y 2 ) + (yij − y j )(yi2 − y 2 ) .
b b b b
18 i=1 ij i=1

The superscripts g and b refer to girls and boys, respectively. It is assumed


that, within each sex subgroup, the ordering is such that completers precede
the incompletely measured children.

Looking at the available case procedure from the perspective of the indi-
vidual observation, one might say that each observation contributes to the
subvector of the parameter vector about which it contains information. For
example, a complete observation in the growth data set contributes to 4
(sex specific) mean components as well as to all 10 variance-covariance pa-
rameters. In an incomplete observation, there is information about three
mean components and six variance-covariance parameters (excluding those
with a subscript 2).

Whether or not there is nonrandom selection of the incomplete observa-


tions does not affect those parameters without a subscript 2. For the ones
involving a subscript 2, potential differences between completers and non-
completers are not taken into account and, hence, biased estimation may
result when an MAR mechanism is operating. In fact, the estimates for the
parameters with at least one subscript 2 equal their complete case analy-
sis counterparts. Thus, MCAR is required. This observation is consistent
with the theory in Rubin (1976), since the current available case method
is frequentist rather than likelihood based.

17.4.3 Likelihood Analysis of Incomplete Growth Data

As with the available case method of Section 16.4, a complete subject


contributes to “more parameters” than an incomplete subject. Whereas
these contributions were direct in terms of parameter vector components
in Section 16.4, in the current framework subjects contribute information
through their factor of the likelihood function. For example, let us consider
Model 1. A complete subject contributes by means of a four-dimensional
normal log-likelihood term with 4 out of 8 mean components (boys and girls
have separate means) and a 4 × 4 positive definite covariance matrix. An
incomplete observation contributes through the three-dimensional marginal
density, obtained by integrating over the second component. In practice,
this is done by deleting the second component of the mean vector and
258 17. Selection Models

TABLE 17.5. Growth Data. MAR analysis (Little and Rubin). Model fit summary.

Mean Covar par −2 Ref G2 df p


1 unstr. unstr. 18 386.957
2 = slopes unstr. 14 393.288 1 6.331 4 0.1758
3 = slopes unstr. 13 397.400 2 4.112 1 0.0426
4 = slopes Toepl. 8 398.030 2 4.742 6 0.5773
5 = slopes AR(1) 6 409.523 2 16.235 8 0.0391
6 = slopes random 8 400.452 2 7.164 6 0.3059
7 = slopes CS 6 401.313 6 0.861 2 0.6502
6 0.861 1:2 0.5018
8 = slopes simple 5 441.583 7 40.270 1 <0.0001
7 40.270 0:1 <0.0001

the second row and the second column of the covariance matrix. In most
software packages, such as the SAS procedure MIXED, this is performed
automatically, as will be discussed next.

Little and Rubin (1987) fitted the same eight models as Jennrich and
Schluchter (1986) to the incomplete growth data set. Whereas Little and
Rubin made use of the EM algorithm, we set out to perform our analysis
with direct maximization of the observed likelihood (with Fisher scoring or
Newton-Raphson) in PROC MIXED. The results ought to coincide. Table
17.5 reproduces the findings of Little and Rubin. We added p-values.

The PROC MIXED programs, constructed in Section 17.3 to analyze the


complete data set, will be applied to the artificially incomplete data set.
The structure of this data set is given in Table 17.6. Although there would
be four records for every subject in the complete data set, now there are
nine subjects (e.g., subjects #3 and #27) with only three records.

Applying the programs to the data yields some discrepancies, as seen from
the model fit Table 17.7.

Let us take a close look at these discrepancies. Although most of the tests
performed lead to the same conclusion, there is one fundamental difference.
In Table 17.5, the AR(1) model is rejected whereas it is not in Table 17.7.
A puzzling difference is that the maximized log-likelihoods are different
for Models 1–5, but not for Models 6–8. The same holds for the mean
and covariance parameter estimates. To get a hold on this problem, let us
consider the repeated statement (e.g., of Model 1):

repeated / type = un subject = idnr r rcorr;


17.4 Growth Data 259

TABLE 17.6. Growth Data. Extract of the incomplete data set.

OBS IDNR AGE SEX MEASURE

1 1 8 2 21.0
2 1 10 2 20.0
3 1 12 2 21.5
4 1 14 2 23.0
5 2 8 2 21.0
6 2 10 2 21.5
7 2 12 2 24.0
8 2 14 2 25.5
9 3 8 2 20.5
10 3 12 2 24.5
11 3 14 2 26.0
...
97 27 8 1 22.0
98 27 12 1 23.5
99 27 14 1 25.0

This statement identifies the subject in terms of IDNR blocks but does
not specify the ordering of the observations within a subject. Thus, PROC
MIXED assumes the default ordering: 1, 2, 3, 4 for a complete subject and,
erroneously, 1, 2, 3 for an incomplete one, whereas the correct incomplete
ordering is 1, 2, 4. This means that, by default, dropout is assumed. Since
this assumption is inadequate for the growth data, Models 1–5 in Table
17.7 are incorrect. The random-effects Model 6, on the other hand, uses
the RANDOM statement

random intercept age / type = un subject = idnr g;

where the variable AGE conveys the information needed to correctly calcu-
late the random-effects parameters. Indeed, for an incomplete observation,
the correct design
⎛ ⎞
1 8
Zi = ⎝ 1 12 ⎠
1 14

is generated. Finally, it remains to be discussed why Models 7 and 8 give


a correct answer in spite of the fact that they also use the REPEATED
statement rather than the RANDOM statement. This is best seen in Model
260 17. Selection Models

TABLE 17.7. Growth Data. Inadequate MAR analysis (Little and Rubin). Model
fit summary.

Mean Covar par −2 Ref G2 df p


1 unstr. unstr. 18 394.309
2 = slopes unstr. 14 397.862 1 3.553 4 0.4699
3 = slopes unstr. 13 401.935 2 4.073 1 0.0436
4 = slopes banded 8 400.981 2 3.119 6 0.7938
5 = slopes AR(1) 6 408.996 2 11.134 8 0.1942
6 = slopes random 8 400.452 2 2.590 6 0.8583
7 = slopes CS 6 401.312 6 0.860 2 0.6505
6 0.860 1:2 0.5021
8 = slopes simple 5 441.582 7 40.270 1 <0.0001
7 40.270 0:1 <0.0001

8, where we assume an independence covariance structure. This covariance


structure is equivalent to assuming that the 99 = 108 − 9 measurements
form a simple random sample of size n = 99, rather than a longitudinal
sample. Consequently, the actual position of a measurement within a sub-
ject’s sequence is irrelevant. The same holds for the compound symmetry
or exchangeable model. The only difference with the simple model is that a
common intraclass correlation between two measurements within the same
sequence is assumed. Since this correlation is constant and thus indepen-
dent of the actual distance between measurements, it can be determined
from the full set of pairs of measurements within an individual (six pairs
for a complete observation and three pairs for an incomplete observation),
with the order being immaterial.

There are two equivalent ways to overcome this problem. The first is to
adapt the data set slightly. An example is given in Table 17.8.

The effect of using this data set is, of course, that incomplete records
are deleted from the analysis, but that the relative positions are correctly
passed on to PROC MIXED. Running Models 1–8 on this data set yields
exactly the same results as in Table 17.5.

It is also possible to use the data as presented in Table 17.6. Instead of


passing on the position of the missing values through the data set, we have
to specify explicitly the ordering by coding it properly into the PROC
MIXED program. For Model 1, the following code can be used:
17.4 Growth Data 261

proc mixed data = growthav method = ml;


title ’Jennrich and Schluchter (MAR, Altern.), Model 1’;
class sex idnr age;
model measure = sex age*sex / s;
repeated age / type = un subject = idnr r rcorr;
run;

The REPEATED statement now explicitly includes the ordering by means


of the AGE variable. Note that any counter with the correct ordering (e.g.,
1, 2, 3, 4) would be suitable. We consider it good practice to always include
the (time) ordering variable of the measurements.

The corresponding Model 2 program would be

proc mixed data = growthav method = ml;


title ’Jennrich and Schluchter (MAR, Altern.), Model 2’;
class sex idnr;
model measure = sex age*sex / s;
repeated age / type = un subject = idnr r rcorr;
run;

However, this program generates an error since the variables in the RE-
PEATED statement have to be categorical variables, termed CLASS vari-
ables in PROC MIXED. See Section 8.2.6. One of the tricks to overcome
this issue is by using the program

data help;
set growthav;
agec = age;
run;

proc mixed data = help method = ml;


title ’Jennrich and Schluchter (MAR, Altern.), Model 2’;
class sex idnr agec;
model measure = sex age*sex / s;
repeated agec / type = un subject = idnr r rcorr;
run;

Thus, there are two identical copies of the variable AGE, only one of which
is treated as a class variable.

Let us now turn attention to the performance of the ignorable method


of analysis and compare the results with the ones obtained earlier. First,
the model comparisons performed in Tables 17.4 and 17.5 qualitatively
262 17. Selection Models

TABLE 17.8. Growth Data. Extract of the incomplete data set. The missing ob-
servations are explicitly indicated.

OBS IDNR AGE SEX MEASURE

1 1 8 2 21.0
2 1 10 2 20.0
3 1 12 2 21.5
4 1 14 2 23.0
5 2 8 2 21.0
6 2 10 2 21.5
7 2 12 2 24.0
8 2 14 2 25.5
9 3 8 2 20.5
10 3 10 2 .
11 3 12 2 24.5
12 3 14 2 26.0
...
105 27 8 1 22.0
106 27 10 1 .
107 27 12 1 23.5
108 27 14 1 25.0

yield the same conclusions. In both cases, linear profiles turn out to be
consistent with the data, but parallel profiles do not. A Toeplitz correlation
structure (Model 5) is acceptable, as well as a random intercepts and slopes
model (Model 6). These models can be simplified further to compound
symmetry (Model 7). The assumption of no correlation between repeated
measures (Model 8) is untenable. This means that Model 7 is again the most
parsimonious description of the data among the eight models considered.
It has to be noted that the rejection of Models 3 and 5 is less compelling
in the MAR analysis than it was in the complete data set. Of course, this
is to be expected due to the reduction in the sample size, or rather in the
number of available measurements. The likelihood ratio test statistic for a
direct comparison of Model 5 to Model 4 is 11.494 on 2 degrees of freedom
(p = 0.0032), which is, again, a clear indication of an unacceptable fit.

Figure 17.4 displays the fit of Models 1, 2, 3, and 7. Let us consider the fit


of Model 1 first. As mentioned earlier, the complete observations at age 10
are those with a higher measurement at age 8. Due to the within-subject
correlation, they are the ones with a higher measurement at age 10 as well.
This is seen by comparing the large dot with the corresponding small dot,
17.4 Growth Data 263

FIGURE 17.4. Growth Data. Profiles for a selected set of models. MAR analysis.
(The small dots are the observed group means for the complete data set. The large
dots are the corresponding quantities for the incomplete data.)

reflecting the means for the complete data set and for those observed at
age 10, respectively. Since the average of the observed measurements at age
10 is biased upward, the fitted profiles from the complete case analysis and
from unconditional mean imputation were too high. Clearly, the average
observed from the data is the same for the complete case analysis, the
unconditional mean imputation, the available case analysis, and the present
analysis. The most crucial difference is that the current Model 1, although
saturated in the sense that there are eight mean parameters (one for each
age by sex combination), does not let the (biased) observed and fitted
averages at age 10 coincide, in contrast to the means at ages 8, 12, and 14.
Indeed, if the model specification is correct, then an ignorable likelihood
analysis is consistent for the correct complete data mean, rather than for
the observed data mean. Of course, this effect might be blurred in relatively
small data sets due to small-sample variability.

This discussion touches upon the key distinction between the frequentist
available case analysis of Section 16.4, with example in Section 17.4.2, and
the present likelihood based available case analysis. The method of Sec-
tion 16.4 constructs an estimate for the age 10 parameters, irrespective of
the (extra) information available for the other parameters. The likelihood
264 17. Selection Models

TABLE 17.9. Growth Data. Means under unstructured Model 1.

Incomplete
Age Complete Obs. Pred.
Girls
8 21.18 21.18 21.18
10 22.23 22.79 21.58
12 23.09 23.09 23.09
14 24.09 24.09 24.09
Boys
8 22.88 22.88 22.88
10 23.81 24.14 23.17
12 25.72 25.72 25.72
14 27.47 27.47 27.47

approach implicitly constructs a correction, based on (1) the fact that the
measurements at ages 8, 12, and 14 differ between the subgroups of com-
plete and incomplete observations and (2) the fairly strong correlation be-
tween the measurement at age 10 on the one hand, and the measurements
at ages 8, 12, and 14 on the other hand. A detailed treatment of likelihood
estimation in incomplete multivariate normal samples is given in Little and
Rubin (1987, Chapter 6). Clearly, this correction leads to an overshoot in
the fairly small growth data set, whence the predicted mean at age 10 is
actually smaller than the one of the complete data set. The means are
reproduced in Table 17.9. All means coincide for ages 8, 12, and 14. Irre-
spective of the small-sample behavior encountered here, the validity under
MAR and the ease of implementation are good arguments that favor this
ignorable analysis over other techniques.

We now present the predicted mean curves for Models 2, 3, and 7:

• Model 2:

girls: Ŷj = 17.18 + 0.4917tj ,


boys: Ŷj = 16.32 + 0.7886tj .

• Model 3:

girls: Ŷj = 15.40 + 0.6519tj ,


boys: Ŷj = 17.82 + 0.6519tj .
17.4 Growth Data 265

• Model 7:

girls: Ŷj = 17.22 + 0.4890tj ,


boys: Ŷj = 16.30 + 0.7867tj .

These profiles are fairly similar to their complete data counterparts. This
is in contrast to analyses obtained from the simple methods, described in
Chapter 16 and applied in Verbeke and Molenberghs (1997, Sections 5.4–
5.6).

Let us now study this method in terms of the effect on the estimated
covariance structure. The estimated covariance matrix of Model 1 is
⎛ ⎞
5.0142 4.8796 3.6205 2.5095
⎜ 4.8796 6.6341 3.3772 3.0621 ⎟
Σ̂ = ⎜ ⎝ 3.6205 3.3772 5.9775 3.8248 ⎠ .

2.5095 3.0621 3.8248 4.6164

The variance at age 10 is inflated compared to its complete data set coun-
terpart (17.8). The dominating reason is that the sample size at age 10 is
only two-thirds of the original one, thereby making all estimators involved
more variable. In other settings, the variance may increase due to an in-
creased homogeneity in the selected subset. A correct analysis, such as the
ignorable one considered here, should acknowledge this additional source
of uncertainty. The correlation matrices are as follows:

• Model 1 (unstructured):
⎛ ⎞
1.0000 0.8460 0.6613 0.5216
⎜ 0.8460 1.0000 0.5363 0.5533 ⎟
⎜ ⎟. (17.16)
⎝ 0.6613 0.5363 1.0000 0.7281 ⎠
0.5216 0.5533 0.7281 1.0000

• Model 4 (Toeplitz):
⎛ ⎞
1.0000 0.6248 0.6688 0.4307
⎜ 0.6248 1.0000 0.6248 0.6688 ⎟
⎜ ⎟.
⎝ 0.6688 0.6248 1.0000 0.6248 ⎠
0.4307 0.6688 0.6248 1.0000

• Model 5 (AR(1)):
⎛ ⎞
1.0000 0.6265 0.3925 0.2459
⎜ 0.6265 1.0000 0.6265 0.3925 ⎟
⎜ ⎟.
⎝ 0.3925 0.6265 1.0000 0.6265 ⎠
0.2459 0.3925 0.6265 1.0000
266 17. Selection Models

• Model 6 (random effects):


⎛ ⎞
1.0000 0.6341 0.5971 0.5465
⎜ 0.6341 1.0000 0.6302 0.6041 ⎟
⎜ ⎟.
⎝ 0.5971 0.6302 1.0000 0.6461 ⎠
0.5465 0.6041 0.6461 1.0000

• Model 7 (compound symmetry):


⎛ ⎞
1.0000 0.6054 0.6054 0.6054
⎜ 0.6054 1.0000 0.6054 0.6054 ⎟
⎜ ⎟.
⎝ 0.6054 0.6054 1.0000 0.6054 ⎠
0.6054 0.6054 0.6054 1.0000

The unstructured model reveals an increased correlation between ages 10


and 8, but a decrease between ages 10 and 12 and also a decrease be-
tween ages 10 and 14. Although the differences in correlation between
the complete data set and ignorable analyses are carried across the simpli-
fied correlation structures, they are, in fact, very modest. For example, the
complete data exchangeable correlation of 0.6178 changes to 0.6054 here.

It is interesting to consider the covariance structure of random-effects


Model 6 in a bit more detail. The matrix D is estimated to be

 6.7853 −0.3498
D =
−0.3498 0.0337

and σ̂ 2 = 1.7700. Thus, all entries in the random-effects covariance matrix


as well as the measurement error σ 2 seem to have increased slightly in
absolute value in comparison to the complete data analysis version (17.12).
The resulting covariance matrix is now
⎛ ⎞
5.1140 3.1833 3.0226 2.8620
⎜ 3.1833 4.9274 3.1315 3.1055 ⎟
V̂i = Zi D̂Zi + σ̂ 2 I4 = ⎜⎝ 3.0226 3.1315 5.0103 3.3491 ⎠ .

2.8620 3.1055 3.3491 5.3626

In conclusion, a likelihood ignorable analysis is preferable since it uses all


available information, without the need neither to delete nor to impute
measurements or entire subjects. It is theoretically justified whenever the
missing data mechanism is MAR, which is a more relaxed assumption than
MCAR, necessary for simple analyses (complete case, frequentist available
case, and single-imputation based analyses, with the exception of Buck’s
method of conditional mean imputation). There is no statistical information
distortion, since observations are neither removed (such as in complete case
analysis) nor added (such as in single imputation). There is no additional
17.4 Growth Data 267

programming involved to perform an ignorable analysis in PROC MIXED,


provided the order of the measurements is correctly specified. This can be
done either by supplying records with missing data in the input data set or
by properly indicating the order of the measurement in the REPEATED
and/or RANDOM statements in PROC MIXED.

When the scientific interest is directed to the missing data mechanism as


well, then a simple ignorable analysis is generally not sufficient and has to
be supplemented with a model for missingness. Also, when the missingness
mechanism is nonrandom, then an ignorable analysis is not valid and more
complex modeling (e.g., Diggle and Kenward 1994) is required. This topic is
of vital importance and will be discussed further in Sections 17.5 and 17.6.

There are still a few issues with the estimation of precision and with hy-
pothesis testing, related to this type of analysis. These will be discussed in
Chapter 21. We will first study the missingness mechanism for the growth
data.

17.4.4 Missingness Process for the Growth Data

The only prior information we have about the nonresponse mechanism is


that Little and Rubin (1987) conceived it to depend on the measurement
at age 8 in such a way that lower values led to a higher nonresponse. Let
us assume that the missingness probability follows a logistic model. For
example, if dropout would depend solely on the measurement at age 8, a
candidate model would be

P (Ri = 0|y i )
ln = ψ0 + ψ1 yi1 ,
1 − P (Ri = 0|y i )
where Ri = 1 for complete observations and 0 otherwise, and Yi1 is the
measurement at age 8. Of course, this model is easily adapted to include
a different subset of measurements into the linear predictor. Table 17.10
shows the model fit for a few choices. In each of the four models, missing-
ness depends on a single outcome. When this dependence is on Yi2 , the
process is nonrandom; it is MAR otherwise. We used the complete data
set to estimate the parameters of the nonrandom model (with linear pre-
dictor including Yi2 ). This is generally not possible. Ways to overcome this
problem are discussed in Sections 17.2.2 and 17.5. The only important co-
variates are the measurements at ages 8 and 12 (i.e., the ones adjacent to
the possibly missing measurement). A backward logistic regression model
retains only Yi1 . Its coefficients (standard errors) are estimated as

P (Ri = 0|yi1 )
ln = 41.22(18.17) − 1.94(0.85)yi1 .
1 − P (Ri = 0|yi1 )
268 17. Selection Models

TABLE 17.10. Growth Data. Model fit for logistic nonresponse models.

Type Effects Deviance p


MAR Yi1 19.51 <0.0001
MAR Yi3 7.43 0.0064
MAR Yi4 2.51 0.1131
MNAR Yi2 2.55 0.1105

This model implies that the missingness probability decreases with increas-
ing Yi1 .

It is important to note that omitting a relevant predictor from the nonre-


sponse model might lead to the wrong conclusions, even about the nature
of the nonresponse mechanism itself. For example, a MCAR mechanism
might be classified as nonrandom if a crucial covariate is omitted from the
model. Therefore, it is wise to examine all available information, including
covariates, with the greatest care. In the growth example, the only covari-
ate is sex (xi ). A model including the age 8 measurement Yi1 together with
sex xi as an analysis by sex group leads to a complete separation in the
covariate space, resulting in parameter estimates at infinity. Examining the
fit carefully, we deduce the following mechanism:

P (Ri = 0|yi1 , xi = 0)
boys : ln = ∞(22 − yi1 ),
1 − P (Ri = 0|yi1 , xi = 0)

P (Ri = 0|yi1 , xi = 1)
girls : ln = ∞(20.75 − yi1 ).
1 − P (Ri = 0|yi1 , xi = 1)
The model for boys is interpreted as follows:

⎨ 1 if yi1 < 22
P (Ri = 0|yi1 , xi = 0) = 0.5 if yi1 = 22

0 if yi1 > 22.
This is exactly what is seen in Table 2.5. The same is true for girls, with
the sole difference that the cut point lies halfway between two observable
outcome values (20.75):

1 if yi1 < 20.75
P (Ri = 0|yi1 , xi = 1) =
0 if yi1 > 20.75.
The models are displayed in Figure 17.5. Thus, the missingness mecha-
nism used by Little and Rubin (1987) is, in fact, deterministic (given the
outcomes at age 8). This should not be confused with the fact that nonre-
sponse depends (very clearly) on the observed outcomes and the observed
outcomes only, whence it is missing at random ! A similar mechanism is
employed in Section 21.3.
17.5 A Selection Model for Nonrandom Dropout 269

FIGURE 17.5. Growth Data. Logistic nonresponse models for growth data.

17.5 A Selection Model for Nonrandom Dropout

In Section 17.2, the toenail data were analyzed assuming both MAR (Sec-
tion 17.2.1) and MNAR (Section 17.2.2). The responses were modeled using
linear mixed models and the dropout process was described by means of
logistic regression. This is in agreement with the model proposed by Diggle
and Kenward (1994). Using the notation laid out in Section 15.9, we will
now slightly generalize this model.

We assume the measurement model to be of the linear mixed model (3.8)


form. Assuming that the first measurement Yi1 is obtained for every subject
in the study, the model for the dropout process is based on a logistic re-
gression for the probability of dropout at occasion j, given the subject was
still in the study up to occasion j. We denote this probability by g(hij , yij ),
in which hij is a vector containing all responses observed up to but not
including occasion j, as well as relevant covariates. We then assume that
g(hij , yij ) satisfies

logit[g(hij , yij )] = logit [pr(Di = j|Di ≥ j, y i )] = hij ψ + ωyij (17.17)

(i = 1, . . . , N ). When ω equals zero, the dropout model is random, and all


parameters can be estimated using standard software since the measure-
270 17. Selection Models

ment model for which we use a linear mixed model and the dropout model,
assumed to follow a logistic regression, can then be fitted separately. If
ω = 0, the dropout process is assumed to be nonrandom. A special case is
given by (17.3). Then, (17.4) can be used to calculate the dropout proba-
bility at a given occasion.

In line with our discussion in Sections 17.1 and 17.2, Rubin (1994) points
out that such analyses heavily depend on the assumed dropout process,
whereas it is impossible to find evidence for or against the model, un-
less supplemental information on the dropouts is available. In practice, a
dropout model may be found to be nonrandom solely because one or a few
influential subjects have driven the analysis. This and related issues will
be studied in Chapter 19, which is devoted to sensitivity analysis for the
selection model.

17.6 A Selection Model for the Vorozole Study

We will assume a model of the form described in the previous section.


Since we are modeling change versus baseline, all models are forced to pass
through the origin. The following covariates were considered for the mea-
surement model: baseline value, treatment, dominant site, stage, and time
in months. Second-order interactions were considered as well. For design
reasons, treatment was kept in the model in spite of its nonsignificance.
An F -test for treatment effect produces a p-value of 0.5822. Apart from
baseline, no other time-stationary covariates were kept. A quadratic time
effect provided an adequate description of the time trend. Based on the var-
iogram (Figure 10.3), we confined the random-effects structure to random
intercepts and supplemented this with a Gaussian serial correlation compo-
nent and measurement error. The final model is presented in Table 17.11.

Fitted profiles are displayed in Figure 17.6 and Figure 17.7. In Figure 17.7,
empirical Bayes estimates of the random effects are included, whereas in
Figure 17.6, the purely marginal mean is used. For each treatment group,
we obtain three sets of profiles. The fitted complete profile is the average
curve that would be obtained had all individuals been completely observed.
If we use only those predicted values that correspond to occasions at which
an observation was made, then the fitted incomplete profiles are obtained.
The latter are somewhat above the former when the random effects are in-
cluded, and somewhat below when they are not, suggesting that individuals
with lower measurements are more likely to disappear from the study. In
addition, although the fitted complete curves are very close (the treatment
effect was not significant), the fitted incomplete curves are not, suggesting
17.6 A Selection Model for the Vorozole Study 271

TABLE 17.11. Vorozole Study. Selection model parameter estimates and standard
errors.

Effect Parameter Estimate (s.e.)


Fixed-Effect Parameters:
Time β0 7.78 (1.05)
Time∗baseline β1 −0.065 (0.009)
Time∗treatment β2 0.086 (0.157)
Time 2
β3 −0.30 (0.06)
Time2 ∗baseline β4 0.0024 (0.0005)
Variance Parameters:
Random intercept d 105.42
2
Serial variance τ 77.96
Serial association λ 7.22
2
Measurement error σ 77.83

that there is more dropout in the standard arm than in the treatment arm.
This is in agreement with the dropout rate, displayed in Figure 14.1, and
should not be seen as evidence of a bad fit. Finally, the observed curves,
based on the measurements available at each time point, are displayed.
These are higher than the fitted ones, but this should be viewed with the
standard errors of the observed means in mind (see Figure 4.2).

The fitted variance structure is represented by means of the fitted vari-


ogram in Figure 10.3. The total correlation between two measurements,
1 month apart, equals 0.696. The residual correlation, which remains af-
ter accounting for the random effects, is still equal to 0.491. The serial
correlation, obtained by further ignoring the measurement error, equals
ρ = exp(−1/7.222 ) = 0.981.

Next, we will study factors which influence dropout. A logistic regression


model, described by (17.17) and (17.4) is used. To start, we restrict atten-
tion to MAR processes, whence ω = 0. The first model includes treatment,
dominant side, stage group, baseline, and the previous measurement, but
only the last two are significant, producing

logit[g(hij )] = 0.080(0.341) − 0.014(0.003)basei


−0.033(0.004)yi,j−1 . (17.18)

With larger data sets such as this one, convergence of nonrandom models
can be painstakingly difficult in, for example, OSWALD, and one has to
272 17. Selection Models

FIGURE 17.6. Vorozole Study. Fitted profiles (averaging the predicted means for
the incomplete and complete measurement sequences, without the random effects).

worry about apparent convergence. Therefore, we first proceed in an alter-


native way. Both Diggle and Kenward (1994) and Molenberghs, Kenward,
and Lesaffre (1997) observed that in nonrandom models, dropout tends to
depend on the increment (i.e., the difference between the current and previ-
ous measurements yij − yi,j−1 ). Clearly, a very similar quantity is obtained
as yi,j−1 − yi,j−2 , but a major advantage of such a model is that it fits
within the MAR framework. In our case, we obtain

logit[g(hij )] = 0.033(0.401) − 0.013(0.003)basei


+0.012(0.006)yi,j−2 − 0.035(0.005)yi,j−1

= 0.033(0.401) − 0.013(0.003)basei
yi,j−2 + yi,j−1
−0.023(0.005)
2
yi,j−1 − yi,j−2
−0.047(0.010) , (17.19)
2
indicating that both size and increment are significant predictors for drop-
out. We conclude that dropout increases with a decrease in baseline, in
overall level of the outcome variable, as well as with a decreasing evolution
in the outcome. Recall that fitting the dropout model could be done using
a logistic regression of the type (17.19), given (17.4) and the discussion
following this equation.
17.6 A Selection Model for the Vorozole Study 273

FIGURE 17.7. Vorozole Study. Fitted profiles (averaging the predicted means for
the incomplete and complete measurement sequences, including the random ef-
fects).

Using OSWALD, both dropout models (17.18) and (17.19) can be com-
pared with their nonrandom counterparts, where yij is added to the linear
predictor. The first one becomes

logit[g(hij , yij )] = 0.53 − 0.015basei − 0.076yi,j−1


+0.057yij , (17.20)

and the second one becomes

logit[g(hij , yij )] = 1.38 − 0.021basei − 0.0027yi,j−2


−0.064yi,j−1 + 0.035yij . (17.21)

It turns out that model (17.21) is not significantly better than (17.19) and,
hence, we retain (17.19) as the most plausible description of the dropout
process we have so far obtained.
18
Pattern-Mixture Models

18.1 Introduction

The high sensitivity of selection modeling results to the correct specification


of the measurement model as well as the dropout model, about which little
is often known, has been extensively documented. See also Sections 15.3,
15.4, 17.1, 17.2.2, and 17.5. This has lead to growing interest in pattern-
mixture modeling, based on the factorization (15.2) (Little 1993, Glynn,
Laird and Rubin 1986, Hogan and Laird 1997). After initial mention of
pattern-mixture models (Glynn, Laird, and Rubin 1986, Little and Rubin
1987), they are receiving more attention lately (Little 1993, 1994a, 1995,
Hogan and Laird 1997, Ekholm and Skinner 1998, Molenberghs, Michiels,
Kenward, and Diggle 1998, Molenberghs, Michiels, and Kenward 1998).

18.1.1 A Simple Illustration

We will first illustrate the idea of pattern-mixture modeling using a simple


setting. Let us adopt pattern-mixture decomposition (15.2) and suppress
dependence on covariates:

f (y i , r i |θ, ψ) = f (y i |r i , θ)f (r i , ψ),


276 18. Pattern-Mixture Models

with notation as laid out in Chapter 15. Restricting attention to dropout


(Section 15.9), we obtain, using (15.7),

f (y i , di |θ, ψ) = f (y i |di , θ)f (di |ψ). (18.1)

Equivalently, using (15.8),

f (y i , ti |θ, ψ) = f (y i |ti , θ)f (ti |ψ). (18.2)

Consider a continuous response at three times of measurement which will


be modeled using a trivariate Gaussian distribution. Assume that there
may be dropout at time 2 or 3, and let the dropout indicator Ti take the
values 1 and 2 to indicate that the last observation occurred at these times
and 3 to indicate no dropout. Then, in the first instance, the model implies
a different distribution for each time of dropout. We can write

y i | ti ∼ N (µ(ti ), Σ(ti )), (18.3)

where
⎛ ⎞ ⎛ ⎞
µ1 (t) σ11 (t) σ21 (t) σ31 (t)
µ(t) = ⎝ µ2 (t) ⎠ and Σ(t) = ⎝ σ21 (t) σ22 (t) σ32 (t) ⎠,
µ3 (t) σ31 (t) σ32 (t) σ33 (t)

for t = 1, 2, 3. Recall that t indicates length of sequences, as defined in


Section 15.9, rather than time points of measurements actually taken. Let
P (t) = πt = f (ti |ψ), then the marginal distribution of the response is a
mixture of normals with, for example, mean


3
µ = πt µ(t).
t=1

Its variance can be derived by application of the delta method (see Sections
18.3, 18.4, 20.6.2, and 24.4.2).

However, although the πt can be simply estimated from the observed pro-
portions in each dropout group, only 16 of the 27 response parameters can
be identified from the data without making further assumptions. These
16 comprise all the parameters from the completers plus those from the
following two submodels. For t = 2

µ1 (2) σ11 (2) σ21 (2)


N ; ,
µ2 (2) σ31 (2) σ32 (2)

and for t = 1
N (µ1 (1); σ11 (1)) .
18.1 Introduction 277

This is a saturated pattern-mixture model and the representation makes


it very clear what information each dropout group provides and, conse-
quently, the assumptions that need to be made if we are to predict the
behavior of the unobserved responses, and so obtain marginal models for
the response. If the three sets of parameters µ(t) are simply equated, with
the same holding for the corresponding variance components, then this im-
plies that dropout is completely random. Progress can be made with less
stringent restrictions however. Little (1993) introduces so-called complete
case missing value (CCMV) restrictions. These can be defined in terms
of conditional distributions. Let y = (y1 , y2 , . . . , yn ) . Then the CCMV
restrictions imply that for any T = t < j

f (yj | y1 , . . . , yj−1 , T = t) = f (yj | y1 , . . . , yj−1 , T = n).

Little (1993) shows how these constraints can be used to identify all the
parameters in the model and so obtain estimates for these and the marginal
probabilities. The CCMV restrictions essentially equate conditional distri-
butions beyond time t (i.e., those unidentifiable from this dropout group),
with the same conditional distributions from the completers. A stronger
restriction is to identify the former conditional distributions and all condi-
tional distributions from those who drop out after t. This has been called
the available case missing value (ACMV) restrictions and it has been shown
(Molenberghs, Michiels, Kenward, and Diggle 1998; see also Section 20.2)
that for dropout, these conditions are equivalent to MAR in the selection
model framework. Again, such constraints can be used to develop meth-
ods of estimation or to set up schemes for sensitivity analysis. A detailed
account is given in Chapter 20.

In practice, choice of restrictions will need to be guided by the context. In


addition, the form of the data will typically be more complex, requiring,
for example, a more structured model for the response with the incorpora-
tion of covariates. Hence, models for f (ti |ψ) can be constructed in many
ways. Most authors assume the dropout process is fully observed and that
Ti satisfies a parametric model (Wu and Bailey 1988, 1989, Little 1993,
DeGruttola and Tu 1994). Hogan and Laird (1997) extend this to cases
where the dropout time is allowed to be right censored and no parametric
restrictions are put on the dropout times. Their conditional model for Yio
given Ti is a linear mixed model with dropout time as one of the covariates
in the mean structure. Due to the right censoring, the estimation method
must handle incomplete covariates. Hogan and Laird (1997) use the EM
algorithm (Dempster, Laird, and Rubin 1977) for ML estimation.

At this point, a distinction between so-called outcome-based and random-


coefficient-based models is useful. In the context of the former, Little (1995)
and Little and Wang (1996) consider the restrictions implied by a selection
dropout model in the pattern-mixture framework. For example, with two
278 18. Pattern-Mixture Models

time points and a Gaussian response, Little proposes a general form of


dropout model:

P (dropout | y) = g(y1 + λy2 ), (18.4)

with the function g(·) left unspecified. In a selection modeling context,


(18.4) is often assumed to have a logistic form [Chapter 17, e.g., (17.17)].
This relationship implies that the conditional distribution of Y1 given Y1 +
λY2 is the same for those who drop out and those who do not. With this re-
striction and given λ, the parameters of the full distribution of the dropouts
is identified. The “weight” λ can then be used as a sensitivity parameter,
its size determining dependence of dropout on the past and present, as
in the selection models. Such a procedure can be extended to more gen-
eral problems (Little 1995, Little and Wang 1996). It is instructive in this
very simple setting to compare the sources of identifiability in the pattern-
mixture and selection models. In the former, the information comes from
the assumption that the dropout probability is some function of a linear
combination of the two observations with known coefficients. In the latter,
it comes from the shape of the assumed conditional distribution of Y2 given
Y1 (typically Gaussian), together with the functional form of the dropout
probability. The difference is highlighted if we consider a sensitivity analysis
for the selection model that varies λ in the same way as with the pattern-
mixture model. Such sensitivity analysis is much less convincing because
the data can, through the likelihood, distinguish between the fit associated
with different values of λ.

Therefore, identifiability problems in the selection context tend to be mask-


ed. Indeed, there are always unidentified parameters, although a related
“problem” seems absent in the selection model. This apparent paradox
has been observed by Glynn, Laird, and Rubin (1986). Let us discuss this
paradox in some detail.

18.1.2 A Paradox

Assume we have two measurements where Y1 is always observed and Y2 is


either observed (t = 2) or missing (t = 1). Let us further simplify the nota-
tion by suppressing dependence on parameters and additionally adopting
the following definitions:

g(t|y1 , y2 ) := f (t|y1 , y2 ),
p(t) := f (t),
ft (y1 , y2 ) := f (y1 , y2 |t).
18.1 Introduction 279

Equating the selection model and pattern-mixture model factorizations


yields

f (y1 , y2 )g(d = 2|y1 , y2 ) = f2 (y1 , y2 )p(t = 2),


f (y1 , y2 )g(d = 1|y1 , y2 ) = f1 (y1 , y2 )p(t = 1).

Since we have only two patterns, this obviously simplifies further to

f (y1 , y2 )g(y1 , y2 ) = f2 (y1 , y2 )p,


f (y1 , y2 )[1 − g(y1 , y2 )] = f1 (y1 , y2 )[1 − p],

of which the ratio yields

1 − g(y1 , y2 ) p
f1 (y1 , y2 ) = f2 (y1 , y2 ).
g(y1 , y2 ) 1 − p

All selection model factors are identified, as are the pattern-mixture quan-
tities on the right-hand side. However, the left-hand side is not entirely
identifiable. We can further separate the identifiable from the nonidentifi-
able quantities:

1 − g(y1 , y2 ) p f2 (y1 )
f1 (y2 |y1 ) = f2 (y2 |y1 ) . (18.5)
g(y1 , y2 ) 1 − p f1 (y1 )

In other words, the conditional distribution of the second measurement


given the first one, in the incomplete first pattern, about which there is no
information in the data, is identified by equating it to its counterpart from
the complete pattern, modulated via the ratio of the “prior” and “poste-
rior” odds for dropout [p/(1 − p) and g(y1 , y2 )/(1 − g(y1 , y2 )), respectively]
and via the ratio of the densities for the first measurement.

Thus, although an identified selection model is seemingly less arbitrary


than a pattern-mixture model, it incorporates implicit restrictions. Indeed,
precisely these are used in (18.5) to identify the component for which there
is no information.

This clearly illustrates the need for sensitivity analysis. Due to the different
nature of the selection and pattern-mixture models, specific forms for each
of the two contexts will be presented in Chapters 19 and 20, respectively.

In Section 18.2, we will describe a general strategy for fitting pattern-


mixture models. The remainder of this chapter is devoted to pattern-
mixture models for the toenail data (Section 18.3) and for the Vorozole
data (Section 18.4). Chapter 20 is devoted to a formal juxtaposition of
several strategies for pattern-mixture modeling.
280 18. Pattern-Mixture Models

18.2 Pattern-Mixture Models

As indicated in Section 18.1, this family is based on factorization (15.2),


which, for dropout, can be rewritten as (18.1) or (18.2). The conditional
density of the measurements given the dropout pattern is combined with
the marginal density describing the dropout mechanism. Note that the sec-
ond factor can depend on covariates, but not on outcomes. It is, of course,
possible to have different covariate dependencies in both components of the
factorization. For example, dropout can vary with treatment arm and age of
the respondent, whereas the measurement model can depend on treatment
arm, sex, and measurement time.

Thus, the dropout process f (ti |Xi , ψ) is just a, possibly covariate-depen-


dent, model for the probability to belong to a particular pattern. If it is
expressed in analogy with (17.17), then g(hij ) will describe the dropout
rate at each occasion.

The measurement model has to reflect dependence on dropout. In its most


general form, this implies that (3.8) is replaced by


⎪ Yi = Xi β(ti ) + Zi bi + εi ,



bi ∼ N (0, D(ti )), (18.6)





εi ∼ N (0, Σi (ti )).

Thus, the fixed effects as well as the covariance parameters are allowed to
change with dropout pattern and a priori no restrictions are placed on the
structure of this change.

It immediately follows from (15.2) that the likelihood contribution of the


ith subject, based on the observed data (yio , ti ), is proportional to

f (yio , ti ) = f (ti )f (yio |ti ), (18.7)

which only requires specifying a marginal model for the dropout process and
a conditional model for the observed outcomes, given the dropout pattern
as in (18.6). Further, as for ignorable selection models, both models can be
fitted separately, provided separability of their parameters.

Model family (18.6) contains underidentified members since it describes


the full set of measurements in pattern ti , even though there are not mea-
surements after occasion ti , as was pointed out in Section 18.1.2 for the
simple case of two measurements. Several routes can be taken to solve this
problem. They are described in detail in Section 20.4. Let us briefly sketch
them.
18.3 Pattern-Mixture Model for the Toenail Data 281

Little (1993, 1994a) advocated the use of identifying restrictions which


works well in relatively simple settings. Molenberghs, Michiels, Kenward,
and Diggle (1998) proposed a particular set of restrictions for the monotone
case which correspond to MAR. Alternatively, several types of simplified
(identified) models can be considered. The advantage is that the number
of parameters decreases, which is generally an issue with pattern-mixture
models. This route will be followed in Sections 18.3 and 18.4.

Hogan and Laird (1997) noted that in order to estimate the large num-
ber of parameters in general pattern-mixture models, one has to make the
awkward requirement that each dropout pattern is sufficiently “filled”; in
other words, one has to require large numbers of dropouts. This problem is
less prominent in simplified models. Note however that simplified models,
qualified as “assumption rich” by Sheiner, Beal, and Dunne (1997), are also
making untestable assumptions and therefore illustrate that even pattern-
mixture models do not provide a free lunch. A main advantage however is
that the need of assumptions and their implications are more obvious. For
example, it is not possible to assume an unstructured time trend in incom-
plete patterns, except if one restricts attention to the time range from onset
until dropout. In contrast, assuming a linear time trend allows estimation
in all patterns containing at least two measurements.

In general, we distinguish between two types of simplification to identify


pattern-mixture models. First, functional model forms can be restricted to
those which are supported by the information available within a pattern.
For example, a linear time trend with a fixed treatment effect, together
with a compound symmetry covariance structure, is identifiable as soon as
there are two time points. This will be illustrated in Section 18.3. Second,
one can let the parameters vary across patterns in a parametric way. Thus,
rather than estimating a separate time trend in each pattern, one could
assume that the time evolution is unstructured within a pattern, but par-
allel across patterns. The available data can be used to assess whether such
simplifications are supported within the range of the observed data. Using
the so-obtained profiles past the time of dropout still requires extrapolation
or, in other words, a leap of faith. This is the route chosen in Section 18.4.

18.3 Pattern-Mixture Model for the Toenail Data

For the TDO data, we will assume the dropout patterns Ti to be sampled
from a multinomial distribution with support {1, 2, 3, 4, 5, 6, 7}, where the
class Ti = 7 contains all completers. The associated multinomial probability
vector is denoted by π = (π1 , π2 , . . . , π7 ) . Further, our model for Yio ,
conditional on Ti = ti , is assumed to be of the same form as model (17.1),
282 18. Pattern-Mixture Models

TABLE 18.1. Toenail Data. Fitted dropout probabilities under the multinomial
dropout model.

Dropout occasion d: 1 2 3 4 5 6 7

t = P (Ti = t):
Fitted prob. π 0.02 0.02 0.04 0.05 0.10 0.01 0.76

but with different parameters for each possible value of Ti . Obviously, no


quadratic average evolution can be fitted whenever ti = 1 or ti = 2. We
then only fit a constant term or a linear average evolution, respectively.
More specifically, our model for Yio , conditional on Ti = ti , equals



(βA0 (ti ) + bi (ti )) + βA1 (ti )tij
⎪ +β (t )t2 + ε
⎨ A2 i ij (2)ij (ti ) group A
o
Yij = (18.8)

⎪ (βB0 (ti ) + bi (ti )) + βB1 (ti )tij


+βB2 (ti )t2ij + ε(2)ij (ti ) group B.
Recall that ti indicates pattern and tij indicates time of measurement. For
the patterns ti = 1 and ti = 2, we only have information in the data to fit
constant average trends or linear average trends, respectively. Therefore,
we need to restrict the parameters in model (18.8). A possible restriction
is βA1 (1) = βB1 (1) = βA2 (1) = βB2 (1) = βA2 (2) = βB2 (2) = 0. For ti = 1,
there is only one measurement per subject such that no random intercepts
can be included into the model. As before, it is assumed that the bi (ti ) and
the ε(2)ij (ti ) are normally distributed with means zero, but we allow their
variance to depend on the dropout pattern ti . More parsimonious mod-
els can be obtained by putting additional restrictions on the parameters,
such as assuming that all variance components are independent of ti . The
most extreme case assumes that none of the parameters in model (18.8)
depends on ti . We then have that the measurement model is statistically
independent of the dropout model, implying completely random dropout.

The separability of the parameters in the measurement model and the


dropout model, together with the separability of the parameters in the mea-
surement models across dropout occasions, implies that fitting the above
model reduces to fitting a linear mixed model to each dropout pattern
separately and to the calculation of the observed dropout probabilities at
the various occasions. Table 18.1 contains the fitted dropout probabilities,
whereas Figure 18.1 shows the fitted average profiles for each dropout pat-
tern, obtained from fitting the linear mixed models (18.8). Each panel of
the figure corresponds to a specific dropout pattern, and the number of
subjects in the pattern is denoted by nA and nB for group A and group
B, respectively. Note that there is very little information to fit some of the
18.3 Pattern-Mixture Model for the Toenail Data 283

FIGURE 18.1. Toenail Data. Fitted average profiles for each dropout pattern,
obtained from fitting the mixed models (18.8). For each pattern, the number of
subjects in group A and group B is denoted by nA and nB , respectively.

models in (18.8). This explains the unexpected behavior observed for ti = 1


and ti = 6.

For each pattern, the likelihood ratio statistic 2 ln λ measures the difference
between the two treatment groups with respect to the fitted average profile.
The sum of all of these statistics could be used to test whether there is any
treatment difference at all for any of the dropout patterns. In our example,
this sum equals 26.5, on 18 degrees of freedom. When compared to a χ2 -
distribution, there is no evidence for any treatment effect (p = 0.089).
However, this should be interpreted with care since the χ2 -approximation
may not be accurate due to the small numbers of subjects observed in some
of the dropout patterns.

One of the main advantages of pattern-mixture models is that a conditional


model for the observed responses Yio only is needed, rather than for the
complete vector Yi . However, this changes when interest is in inferences for
the marginal distribution of Yi instead of the conditional distribution. Sup-
pose we want to compare the marginal average time trends between both
treatment groups, as was done under the selection models of Section 17.2.
We then need to evaluate the marginal expectation
E[Yij ] = E[Yij | Ti = 1] P (Ti = 1) + E[Yij | Ti = 2] P (Ti = 2)
284 18. Pattern-Mixture Models

+ · · · + E[Yij | Ti = 7] P (Ti = 7), (18.9)


which requires specification of E(Yim |Ti ) (i.e., of the expected evolution of
the subjects after they dropped out). It should be emphasized that the data
do not contain any information on these average profiles beyond the time of
dropout. Hence, the estimation of (18.9) entirely relies on the extrapolation
of the fitted average profiles shown in Figure 18.1 to time points where no
data were observed.

For the dropout patterns where at least three measurements are available
per subject (ti ≥ 3), we extrapolate the quadratic trend over time, fitted
from the observed data. Borrowing the linear and quadratic time effects,
or just the quadratic time effect, from pattern ti = 3, we can extrapolate
the patterns ti = 2 and ti = 3 as well. More precisely, our extrapolation
assumes that


⎪ β (1) = βA1 (3),
⎪ A1


⎨ βA2 (1) = βA2 (2) = βA2 (3),
(18.10)

⎪ βB1 (1) = βB1 (3),



⎩ β (1) = β (2) = β (3).
B2 B2 B2

This expresses our belief that the average behavior in the first two patterns
is likely to be similar to the third pattern. The obtained extrapolations are
indicated by dashed lines in Figure 18.2. Note that the strongly positive
estimated average quadratic time effect for the patterns ti = 3 and ti = 4
implies extremely steep extrapolated curves for all patterns with ti ≤ 4.

The marginal expectation (18.9) for treatment group A now becomes


 
E[Yij ] = βA0 (1) + βA1 (3)tij + βA2 (3)t2ij π1
+ (βA0 (2) + βA1 (2)tij + βA2 (3)tij 2) π2
 
+ βA0 (3) + βA1 (3)tij + βA2 (3)t2ij π3 + · · ·
 
+ βA0 (7) + βA1 (7)tij + βA2 (7)t2ij π7 , (18.11)
which is estimated by replacing all parameters by their estimates obtained
by our original pattern-mixture model. The marginal expectation for treat-
ment group B is obtained from replacing the subscript A in (18.11) by B.
The so-obtained estimates of the marginal average profiles are shown in
Figure 18.3. Note the completely different behavior of treatment group A
when compared to the fitted average trends obtained from selection model-
ing, shown in Figure 16.1. Obviously, this is a consequence of the extremely
steep extrapolations for the dropout patterns ti ≤ 4, which was more pro-
nounced for group A than for group B. One might argue that these extrap-
olations are unrealistic. However, as discussed previously, such conclusions
cannot be supported by the collected data, since they do not contain any
18.3 Pattern-Mixture Model for the Toenail Data 285

FIGURE 18.2. Toenail Data. Extrapolated fitted average profiles for each dropout
pattern, obtained from fitting the mixed models (18.8), imposing the restrictions
(18.10). For each pattern, the number of subjects in group A and group B is
denoted by nA and nB , respectively.

information on the trends beyond dropout. Moreover, our analyses in Sec-


tion 17.2.2 have shown that subjects with large increments in unaffected
nail length are more susceptible to dropout than others, suggesting that
early dropouts are subjects for which the response increases quickly over
time. In this respect, the extrapolations used in Figure 18.2 become less
unrealistic. Chapter 20 will discuss various strategies which can then be
considered simultaneously.

When interest is in testing for marginal average differences between both


treatment groups, the null hypothesis of interest is
⎧ 7

⎪  7

⎪ π β (t) − πt βB0 (t) = 0,


t A0
⎪ t=1


⎨ 
t=1
7 7
H0 : πt βA1 (t) − πt βB1 (t) = 0, (18.12)



⎪ t=1 t=1

⎪ 7 7



⎩ π β
t A2 (t) − πt βB2 (t) = 0,
t=1 t=1
286 18. Pattern-Mixture Models

FIGURE 18.3. Toenail Data. Fitted marginal average profiles (18.11) for both
treatment groups, obtained from fitting the pattern-mixture model (18.8), under
the restrictions (18.10).

versus the alternative hypothesis that H0 does not hold. Following Little
(1993), we tested the above hypothesis using a Wald-type test, where the
asymptotic covariance matrix of the estimators of the three functions in
(18.12) is estimated via the delta method. Other methods for precision es-
timation in pattern-mixture models are described by Michiels, Molenberghs
and Lipsitz (1999). These authors use multiple imputation in the context
of categorical data.

Let β (t) denote the vector of all six fixed effects in the linear mixed model
corresponding to pattern ti = t. The asymptotic covariance matrix

π , β(1),
var( 
. . . , β(7))

of all parameters involved in (18.12) is block-diagonal with blocks var( π ),



var(β(1)), 
. . . , var(β(7)). 
All var(β(t)), t = 1, . . . , 7, are readily available
from the statistical package (e.g., the SAS procedure MIXED) used for
fitting the models (18.8) to each pattern separately. Further,

π)
var( = diag(π) − ππ 

(Bickel and Doksum 1977, Section A.13).

In our example, the average difference in intercepts, linear time effects,


and quadratic time effects [the three functions in (18.12)] are estimated by
18.4 A Pattern-Mixture Model for the Vorozole Study 287

0.446, −0.131, and 0.042, respectively. The asymptotic covariance matrix


for these estimators is estimated by
⎛ ⎞
0.174 −0.021 0.001
⎝ −0.021 0.018 −0.003 ⎠ .
0.001 −0.003 0.001

The resulting observed test statistic equals 2.464, which is not significant
(p = 0.482) when compared to a χ2 -distribution with 3 degrees of freedom.
This may seem counterintuitive in view of the large difference between both
treatment groups, seen in Figure 18.3. Again, it should be emphasized that
many parameters were estimated with very little precision. Further, the
observed difference is, to a large extent, a function of extrapolation rather
than observation.

18.4 A Pattern-Mixture Model for the Vorozole


Study

In Chapter 14, the individual and average profiles for the Vorozole study
were plotted in a pattern-specific way (Figures 14.5 and 14.6).

Figure 14.6 suggests that pattern-specific profiles are of a quadratic nature;


with a sharp decline prior to dropout in most cases. Note that this is in line
with the fitted dropout mechanism (17.19). Therefore, it seems reasonable
to expect reflection of this feature in the pattern-mixture model. In analogy
with our selection model, the profiles are forced to pass through the origin.
This is done by allowing only time main effects and interactions of other
covariables with time in the model.

The most complex pattern-mixture model we consider includes a differ-


ent parameter vector for each of the observed patterns. This is done by
having all effects in the model interact with pattern, a factor variable. We
then proceed by backward selection in order to simplify the model. First,
we found that the covariance structure is common to all patterns, encom-
passing random intercept, a serial exponential process, and measurement
error.

For the fixed effects, we proceeded as follows. A backward selection proce-


dure was conducted, starting from a model that includes a main effect of
time and time2 , as well as interactions of time with baseline value, treat-
ment effect, dominant site and pattern, and the interaction of pattern with
time2 . This procedure revealed main effects of time and time2 , as well as
interactions of time with baseline value, treatment effect, and pattern, and
288 18. Pattern-Mixture Models

FIGURE 18.4. Vorozole Study. Fitted selection (solid line) and first pat-
tern-mixture models (dashed lines).

the interaction of pattern with time2 . This reduced model can be found
in Table 18.2. As was the case with the selection model in Table 17.11,
the treatment effect is nonsignificant. Indeed, a single degree of freedom
F -test yields a p-value of 0.687. Note that such a test is possible since
treatment effect does not interact with pattern, in contrast to the model
which we will describe next. The fitted profiles are displayed in Figure 18.4.
We observe that the profiles for both arms are very similar. This is due to
the fact that treatment effect is not significant but perhaps also because
we did not allow a more complex treatment effect. For example, we might
consider an interaction of treatment with the square of time and, more im-
portantly, a treatment effect which is pattern-specific. Some evidence for
such an interaction is seen in Figure 14.6.

Our second, expanded model allowed for up to cubic time effects, the in-
teraction of time with dropout pattern, dominant site, baseline value and
treatment, as well as their two- and three-way interactions. After a back-
ward selection procedure, the effects included are time and time2 , the two-
way interaction of time and dropout pattern, as well as three-factor in-
teractions of time and dropout pattern with (1) baseline, (2) group, and
(3) dominant site. Finally, time2 interacts with dropout pattern and with
the interaction of baseline and dropout pattern. No cubic time effects were
necessary, which is in agreement with the observed profiles in Figure 14.6.
18.4 A Pattern-Mixture Model for the Vorozole Study 289

TABLE 18.2. Vorozole Study. Parameter estimates and standard errors for the
first pattern-mixture model.

Effect Estimate (s.e.) Effect Estimate (s.e.)


Fixed-effect Parameters:
Time 4.671 (0.844) Time2 −0.034 (0.029)
2
Time∗Pattern 1 −8.856 (2.739) Time ∗Pattern 1
Time∗Pattern 2 −0.796 (2.958) Time2 ∗Pattern 2 −1.918 (1.269)
2
Time∗Pattern 3 −1.959 (1.794) Time ∗Pattern 3 −0.145 (0.365)
Time∗Pattern 4 1.600 (1.441) Time2 ∗Pattern 4 −0.541 (0.197)
2
Time∗Pattern 5 0.292 (1.295) Time ∗Pattern 5 −0.107 (0.133)
2
Time∗Pattern 6 1.366 (1.035) Time ∗Pattern 6 −0.181 (0.080)
Time∗Pattern 7 1.430 (1.045) Time2 ∗Pattern 7 −0.132 (0.071)
2
Time∗Pattern 8 1.176 (1.025) Time ∗Pattern 8 −0.118 (0.061)
Time∗Pattern 9 0.735 (0.934) Time2 ∗Pattern 9 −0.083 (0.049)
2
Time∗Pattern 10 0.797 (1.078) Time ∗Pattern 10 −0.078 (0.055)
Time∗Pattern 11 0.274 (0.989) Time2 ∗Pattern 11 −0.023 (0.046)
2
Time∗Pattern 12 0.544 (1.087) Time ∗Pattern 12 −0.026 (0.049)
Time∗Baseline −0.031 (0.004) Time∗Treatment −0.067 (0.166)

Variance Parameters:
Random intercept 78.45
Serial variance 95.38
Serial association 8.85
Measurement error 73.77

The parameter estimates of this model are displayed in Tables 18.3 and
18.4. The model is graphically represented in Figure 18.5.

Because a pattern-specific parameter has been included, we have several


options for the assessment of treatment. Since there are 13 patterns (re-
member we cut off the patterns at 2 years), one can test the global hy-
pothesis, based on 13 degrees of freedom, of no treatment effect. We obtain
F = 1.25, producing p = 0.240, indicating that there is no overall treatment
effect. Each of the treatment effects separately is at a nonsignificant level.
Alternatively, the marginal effect of treatment can be calculated, which is
the weighted average of the pattern-specific treatment effects, with weights
given by the probability of occurrence of the various patterns. Its stan-
dard error is calculated using a straightforward application of the delta
290 18. Pattern-Mixture Models

TABLE 18.3. Vorozole Study. Parameter estimates and standard errors for the
second pattern-mixture model (part I). Each column represents an effect, for which
a main effect is given, as well as interactions with the dropout patterns.

Fixed-effect parameters [estimate (s.e.)]


Effect Time Time∗Baseline Time2
Main 5.468 (5.089) −0.034 (0.040) −0.271 (0.206)
Pattern 1 7.616 (21.908) −0.119 (0.175)
Pattern 2 44.097 (17.489) −0.440 (0.148) −18.632 (7.491)
Pattern 3 22.471 (10.907) −0.218 (0.089) −5.871 (2.143)
Pattern 4 10.578 (9.833) −0.055 (0.079) −1.429 (1.276)
Pattern 5 14.691 (8.424) −0.123 (0.069) −1.571 (0.814)
Pattern 6 7.527 (6.401) −0.061 (0.052) −0.827 (0.431)
Pattern 7 −12.631 (7.367) 0.086 (0.058) 0.653 (0.454)
Pattern 8 14.827 (6.467) −0.126 (0.053) −0.697 (0.343)
Pattern 9 5.667 (6.050) −0.049 (0.049) −0.315 (0.288)
Pattern 10 12.418 (6.473) −0.093 (0.051) −0.273 (0.296)
Pattern 11 1.934 (6.551) −0.022 (0.053) −0.049 (0.289)
Pattern 12 6.303 (6.426) −0.052 (0.050) −0.182 (0.259)
2
Effect Time ∗Baseline Time∗Treatment
Main 0.002 (0.002)
Pattern 1 0.445 (5.095)
Pattern 2 0.1458 (0.0644) 0.867 (1.552)
Pattern 3 0.0484 (0.0178) −1.312 (0.808)
Pattern 4 0.0080 (0.0107) −0.249 (0.686)
Pattern 5 0.0127 (0.0069) −0.184 (0.678)
Pattern 6 0.0058 (0.0036) 0.527 (0.448)
Pattern 7 −0.0065 (0.0038) 0.782 (0.502)
Pattern 8 0.0052 (0.0029) −0.809 (0.464)
Pattern 9 0.0021 (0.0023) −0.080 (0.443)
Pattern 10 0.0016 (0.0024) 0.331 (0.579)
Pattern 11 0.0003 (00024) −0.679 (0.492)
Pattern 12 0.0015 (0.0021) 0.433 (0.688)
Pattern 13 −1.323 (0.706)

method, as in Section 18.3. See also Section 20.6.2. This effect is equal to
−0.286(0.288), producing a p-value of 0.321, which is still nonsignificant.
18.5 Some Reflections 291

TABLE 18.4. Vorozole Study. Parameter estimates and standard errors for the
second pattern-mixture model (part II). Each column represents an effect, for
which a main effect is given, as well as interactions with the dropout patterns.

Fixed-effect parameters [estimate (s.e.)]


Effect Time∗Domsite (1) Time∗Domsite (2) Time∗Domsite (3)
Main −0.873 (1.073) 0.941 (0.845) 0.023 (0.576)
Pattern 1 −5.822 (17.401) −9.320 (9.429) 1.431 (9.878)
Pattern 2 2.024 (3.847) 4.393 (2.690) 5.681 (2.642)
Pattern 3 2.937 (2.596) 0.940 (1.697) 1.414 (1.633)
Pattern 4 −1.378 (2.699) −4.366 (2.367) −3.237 (2.289)
Pattern 5 −0.547 (1.917) −1.099 (1.456) −1.015 (1.344)
Pattern 6 1.302 (1.130) −0.914 (0.811)
Pattern 7 3.881 (1.485) 1.733 (1.226) 4.548 (1.218)
Pattern 8 2.359 (1.241) −0.436 (0.843)
Pattern 9 1.138 (1.128) −0.326 (0.753)
Pattern 10 −3.595 (0.996)
Pattern 11 0.317 (1.152) 0.182 (0.825)
Pattern 12 −1.694 (0.972)
Variance parameters
Random intercept 98.93
Serial variance 38.86
Serial association 6.10
Measurement error 73.65

The various assessment of treatment effect, based on the results obtained


in this section and in Section 17.6, are summarized in Table 18.5.

Thus, we obtain a nonsignificant treatment effect from all our different


models, which gives more weight to this conclusion.

18.5 Some Reflections

Pattern-mixture modeling does strictly speaking not require modeling of


the unobserved outcomes. Indeed, in its simplest form, a chosen measure-
ment model (e.g., a linear mixed-effects model) can be fitted to the observed
data in each of the dropout patterns separately. Together with estimating
292 18. Pattern-Mixture Models

FIGURE 18.5. Vorozole Study. Fitted selection (solid line) and second pat-
tern-mixture models (dashed lines).

the (possibly covariate-dependent) probabilities of membership to each of


the dropout patterns, the model is completed.

However, there are several reasons why more complex manipulations may
be needed. First, there will often be interest in the marginal distribution of
the responses, for which a mixture of an effect over the different dropout
patterns is needed, such as in (18.9).

Second, interest can be placed on the prediction of pattern-specific quanti-


ties, such as average profiles, beyond the time of dropout. This is where
the underidentification of pattern-mixture models manifests itself. Sev-
eral solutions have been proposed. Little (1993) suggested identifying re-
strictions (see Section 18.1). Alternatively, relatively simple models can
be constructed such as linear or quadratic time evolutions, which allow
easy extrapolation (Section 18.3). Finally, incorporating dropout time as
a covariate into the model, information can be borrowed across patterns
(Section 18.4). Thus, an advantage is that the assumptions are always very
explicit, in contrast to selection modeling. This simplifies performing sen-
sitivity analyses by investigating the effect of various assumptions on the
final results. This advantage will be exploited in Chapter 20.
18.5 Some Reflections 293

TABLE 18.5. Vorozole Study. Summary of treatment effect assessment.

Method d.f. p-value


Selection model 1 0.582
First pattern-mixture model 1 0.687
Second pattern-mixture model 13 0.240
Second pattern-mixture model 1 0.321

Finally, as illustrated in our analyses, pattern-mixture models often com-


prise large numbers of parameters, some of which may be estimated very
inefficiently, thereby possibly distorting asymptotics.
19
Sensitivity Analysis for Selection
Models

19.1 Introduction

In the previous chapters, it was indicated on various occasions (see Sections


15.4, 16.1, 16.5, 17.1, and 17.2) that incomplete longitudinal data pose
specific challenges related to sensitivity to modeling assumptions. Even
when the linear mixed model would beyond any doubt be the choice of
preference to describe the measurement process should the data be complete,
then the analysis of the actually observed, incomplete version is subject
still to further untestable modeling assumptions. The terminology which is
useful to this end has been reviewed in Chapter 15.

The methodologically simplest case is discussed in Chapter 16, where it is


assumed that the missing data are MCAR. Simple techniques such as a
complete case analysis, simple forms of imputation, and so forth may be
advised in some cases. However, the MCAR assumption is a strong one and
made too often in practice. Thus, simple forms of analysis are certainly too
common in applied statistical practice.

When more flexible assumptions, such as MAR or even MNAR, are con-
sidered several choices have to be made. For example, one has to choose
between selection and pattern-mixture models, or an alternative framework
such as shared-parameter models (Wu and Bailey 1988, 1989, Wu and Car-
roll 1988, DeGruttola and Tu 1994). For a review, see Little (1995). A
296 19. Sensitivity Analysis for Selection Models

more complete literature review can be found in Section 15.4. Selection


models have been studied in Chapter 17, and pattern-mixture models are
the subject of Chapters 18 and 20.

Particularly within the selection modeling framework, there has been an in-
creasing literature on nonrandom missing data. At the same time, concern
has been growing precisely about the fact that models often rest on strong
assumptions and relatively little evidence from the data themselves. This
point was already raised by Glynn, Laird and Rubin (1986), who indicate
that this is typical for so-called selection models, whereas it is much less so
for a pattern-mixture model (Section 18.1.2). In Section 17.1 attention was
drawn to the fact that much of the debate on selection models is rooted in
the econometrics literature, in particular Heckman’s selection model (Heck-
man 1976). Draper (1995) and Copas and Li (1997) provide useful insight
in model uncertainty and nonrandomly selected samples. Vach and Blet-
tner (1995) study the case of incompletely observed covariates in logistic
regression.

Because the model of Diggle and Kenward (1994) fits within the class of se-
lection models, it is fair to say that it raised, at first, too high expectations.
This was made clear by many discussants of the paper. It implies that, for
example, formal tests for the null hypothesis of random missingness, al-
though technically possible, should be approached with caution. See also
Section 18.1.1. In Section 17.2.2, it was shown, using the toenail data, that
excluding a small amount of measurement error, can have a serious im-
pact on the rest of the model parameters. In particular, the likelihood ratio
test statistics for the random dropout null hypothesis changes drastically
(Table 17.1).

In response to these concerns, there is growing awareness of the need for


methods that investigate the sensitivity of the results with respect to the
model assumptions. See, for example, Nordheim (1984), Little (1994a), Ru-
bin (1994), Laird (1994), Fitzmaurice, Molenberghs, and Lipsitz (1995),
Molenberghs, Goetghebeur, Lipsitz, and Kenward (1999), and Kenward
and Molenberghs (1999). Still, only few actual proposals have been made.
Moreover, many of these are to be considered useful but ad hoc approaches.
Whereas such informal sensitivity analyses are an indispensable step in the
analysis of incomplete longitudinal data, it is desirable to conduct more
formal sensitivity analyses.

In any case, fitting a nonrandom dropout model should be subject to careful


scrutiny. The modeler needs to pay attention, not only to the assumed
distributional form of her model (Little 1994b, Kenward 1998; see also
Section 19.5.1), but also to the impact one or a few influential subjects may
have on the dropout and/or measurement model parameters (Section 19.5).
Because fitting a nonrandom dropout model is feasible by virtue of strong
19.2 A Modified Selection Model for Nonrandom Dropout 297

assumptions, such models are likely to pick up a wide variety of influences in


the parameters describing the nonrandom part of the dropout mechanism.
Hence, a good level of caution is in place.

We could define a sensitivity analysis as one in which several statistical


models are considered simultaneously and/or where a statistical model is
further scrutinized using specialized tools (such as diagnostic measures).
This rather loose and very general definition encompasses a wide variety
of useful approaches. The simplest procedure is to fit a selected number
of (nonrandom) models which are all deemed plausible or one in which a
preferred (primary) analysis is supplemented with a number of variations.
The extent to which conclusions (inferences) are stable across such ranges
provides an indication about the belief that can be put into them. Vari-
ations to a basic model can be constructed in different ways. The most
obvious strategy is to consider various dependencies of the missing data
process on the outcomes and/or on covariates. Alternatively, the distribu-
tional assumptions of the models can be changed.

Section 19.2 adapts the model of Diggle and Kenward (1994) to a form
useful for sensitivity analysis. Such a sensitivity analysis method, based
on local influence (Cook 1986; Thijs, Molenberghs, and Verbeke 2000; see
also Section 11.2) is introduced in Section 19.3 and applied to the rats
data in Section 19.4. Note that in Section 24.4, a comparison is made with
a more conventional global influence analysis (Chatterjee and Hadi 1988).
Both informal and formal methods of sensitivity are applied to the mastitis
data in Section 19.5. Note that a sensitivity analysis of the milk protein
contents data is given in Section 24.4. An outlook on alternative approaches
is given in Section 19.6. Random-coefficient-based models are discussed
in Section 19.7. We will conclude that caution is needed with selection
models and that a mechanical use, perhaps stimulated by the availability
of software such as the SPlus suite of functions termed OSWALD (Smith,
Robertson, and Diggle 1996), should be avoided.

19.2 A Modified Selection Model for Nonrandom


Dropout

In Section 17.5, the selection model of Diggle and Kenward (1994) was pre-
sented, specifically with a linear mixed model for the measurement process
and a logistic regression, based dropout model (see also Sections 17.2 and
17.6).

In this chapter, we investigate the sensitivity of the estimation of quantities


of interest, such as treatment effect, growth parameters, or the dropout
298 19. Sensitivity Analysis for Selection Models

parameters, with respect to assumptions about the dropout model. To this


end, we consider the following perturbed version of (17.17):

logit(g(hij , yij )) = logit [pr(Di = j|Di ≥ j, y i )] = hij ψ + ωi yij (19.1)

(i = 1, . . . , N ), in which different subjects give different weights to the


response at occasion j to predict dropout at occasion j. If all ωi equal
zero, the model reduces to a MAR model; hence, (19.1) can be seen as an
extension of the MAR model, which allows some individuals to drop out in
a “less random” way (|ωi | large) than others (|ωi | small). It is important to
note that we will not consider ωi to be (subject-specific) parameters in the
usual sense. Rather, they have to be seen as perturbations around a null
model, which in this case will be MAR (ωi = 0). Then, studying the effect
of extending an MAR model to the nonrandom case on the parameters of
interest can be achieved by investigating the effect of perturbing the ωi ’s
around zero. This will be done using the local influence approach of Cook
(1986) which has been described in Section 11.2. In the next section, the
theory is summarized and then adapted to the current setting.

19.3 Local Influence

George Box has a famous quote saying that all statistical models are wrong,
but some are useful. Cook (1986) uses this idea to motivate his assessment
of local influence. He suggests that more confidence can be put in a model
which is relatively stable under small modifications. The best known per-
turbation schemes are based on case deletion (Cook and Weisberg 1982),
in which the effect is studied of completely removing cases from the analy-
sis. A quite different paradigm is the local influence approach where one
investigates how the results of an analysis are changed under small pertur-
bations of the model. In the framework of the linear mixed model, Beckman,
Nachtsheim, and Cook (1987) used local influence to assess the effect of per-
turbing the error variances, the random-effects variances, and the response
vector. In the same context, Lesaffre and Verbeke (1998) have shown that
the local influence approach is also useful for the detection of influential
subjects in a longitudinal data analysis. Moreover, because the resulting
influence diagnostics can be expressed analytically, they often can be de-
composed in interpretable components, which yield additional insights in
the reasons why some subjects are more influential than others.

In our case, we are interested in the influence the nonrandomness of dropout


exerts on the parameters of interest, which will most often be the fixed-
effects parameters, possibly supplemented with the variance components.
This can be done in a meaningful way by considering (19.1) as the dropout
19.3 Local Influence 299

model. Indeed, ωi = 0, for all i, corresponds to an MAR process and such a


process cannot influence the measurement model parameters. When small
perturbations in a specific ωi lead to relatively large differences in the
model parameters, then this suggests that these subjects may have a large
impact on the final analysis. Therefore, even though we may be tempted
to conclude that such subjects drop out nonrandomly, this conclusion is
misguided because we are not aiming to detect (groups of) subjects that
drop out nonrandomly but rather subjects that have a considerable impact
on the dropout and measurement model parameters.

In Section 19.3.1, a general introduction is given about the local influence


methodology as introduced by Cook (1986). In Section 19.3.2, we will ap-
ply it to the dropout model presented in Section 17.5, and a special but
important case will be discussed in Section 19.3.3.

19.3.1 Review of the Theory

We denotethe log-likelihood function corresponding to model (19.1) by


N
(γ|ω) = i=1 i (γ|ωi ), in which i (γ|ωi ) is the contribution of the ith
individual to the log-likelihood and where γ = (θ, ψ) is the s-dimensional
vector, grouping the parameters of the measurement model and the dropout
model, not including the N ×1 vector ω = (ω1 , ω2 , . . . , ωN ) of weights defin-
ing the perturbation of the MAR model. This expression arises from taking
the logarithm of (15.10), the model components of which are described in
Section 17.5. It is assumed that ω belongs to an open subset Ω of IRN . For
ω equal to ω0 = (0, 0, . . . , 0) , (γ|ω0 ) is the log-likelihood function which
corresponds to a MAR dropout model.

Let γ be the maximum likelihood estimator for γ, obtained by maximizing


(γ|ω0 ), and let γ ω denote the maximum likelihood estimator for γ under
(γ|ω). The local influence approach now compares γ  ω with γ . Similar
estimates indicate that the parameter estimates are robust with respect to
perturbations of the MAR model in the direction of nonrandom dropout.
Strongly different estimates suggest that the model is highly sensitive to
such perturbations, which suggests that the choice between an MAR model
and a nonrandom dropout model highly affects the results of the analysis.
Recall that Cook (1986) proposed to measure the distance between γ ω
and γ  by the likelihood displacement, defined by LD(ω) = 2[ ( γ |ω0 ) −
 . Indeed, LD(ω) will
γω |ω0 )]. This takes into account the variability of γ
(
 , which means that γ is estimated
be large if (γ|ω0 ) is strongly curved at γ
with high precision, and small otherwise. Therefore, a graph of LD(ω)
versus ω contains essential information on the influence of perturbations.
It is useful to view this graph as the geometric surface formed by the values
300 19. Sensitivity Analysis for Selection Models

of the N +1 dimensional vector ξ(ω) = (ω  , LD(ω)) as ω varies throughout


Ω. See Figure 11.1.

Since this so-called influence graph can only be depicted when N = 2, Cook
(1986) proposed looking at local influence [i.e., at the normal curvatures
Ch of ξ(ω) in ω 0 ], in the direction of some N dimensional vector h of unit
length. Let ∆i be the s-dimensional vector defined by

∂ 2 i (γ|ωi ) 
∆i =
∂ωi ∂γ γ =γ  ,ωi =0

and define ∆ as the (s × N ) matrix with ∆i as its ith column. Further,


let L̈ denote the (s × s) matrix of second-order derivatives of (γ|ω 0 ) with
respect to γ, also evaluated at γ = γ  . Cook (1986) has then shown that
Ch can be easily calculated by Ch = 2|h ∆ L̈−1 ∆h|.

Obviously, Ch can be calculated for any direction h. One evident choice is


the vector hi containing 1 in the ith position and 0 elsewhere, correspond-
ing to the perturbation of the ith weight only. This reflects the influence
of allowing the ith subject to drop out nonrandomly, whereas the others
can only drop out at random. The corresponding local influence measure,
denoted by Ci , then becomes Ci = 2|∆i L̈−1 ∆i |. Another important direc-
tion is the direction hmax of maximal normal curvature Cmax . It shows how
to perturb the MAR model to obtain the largest local changes in the like-
lihood displacement. It is readily seen that Cmax is the largest eigenvalue
of −2∆ L̈−1 ∆, and that hmax is the corresponding eigenvector.

When a subset γ 1 of γ = (γ 1 , γ 2 ) is of special interest, a similar approach


can be used, replacing the log-likelihood by the profile log-likelihood for γ 1 ,
and the methods discussed above for the full parameter vector directly carry
over. There are many possible choices for the vector h. For example, Ci (γ 1 ),
corresponding to h = hi , defined above, expresses the local influence of
allowing the ith subject to drop out nonrandomly on the estimation of γ 1 .

19.3.2 Applied to the Model of Diggle and Kenward

As discussed in the previous section, calculation of local influence measures


merely reduces to the evaluation of ∆ and L̈. The components of L̈ follow
from a standard likelihood optimization routine. For the linear mixed model
with Σi = σ 2 Ini , expressions are derived by Lesaffre and Verbeke (1998)
and can easily be extended to the more general case considered here. Let us
use iω as shorthand for i (γ|ωi ). It is shown in the Appendix (Section B.1),
which can be skipped without problem for the less technically interested
19.3 Local Influence 301

reader, that the components of the columns ∆i of ∆ are given by



∂ 2 iω 
= 0, (19.2)
∂θ∂ωi ωi =0

∂ 2 iω  ni
= − hij yij g(hij )[1 − g(hij )], (19.3)
∂ψ∂ωi ωi =0 j=2

for complete sequences (no dropout) and by



∂ 2 iω  ∂λ(yid |hid )
 = [1 − g(hid )] , (19.4)
∂θ∂ωi ωi =0 ∂θ
 
∂ 2 iω 
d−1
= − hij yij g(hij )[1 − g(hij )]
∂ψ∂ωi ωi =0 j=2
−hid λ(yid |hid )g(hid )[1 − g(hid )]. (19.5)
for incomplete sequences, where all of the above expressions are evaluated
 and where g(hij ) = g(hij , yij )|ωi =0 , is the MAR version of the dropout
at γ
model. By λ(yid |hid ), we denote the expected value of yid , given the his-
tory and the fitted MAR model parameters. It is understood that hij is
restricted to a relevant subset of the history. For example, one can restrict
attention to the previous measurement, in which case hij is taken to be
yi,j−1 .

Let Vi,11 be the predicted covariance matrix for the observed vector given
by (yi1 , . . . , yi,d−1 ) , Vi,22 is the predicted variance for the missing obser-
vation yid , and Vi,12 is the vector of predicted covariances between the
elements of the observed vector and the missing observation. It then fol-
lows from the linear mixed model (3.8) that the conditional expectation for
the observation at dropout, given the history, equals
−1
λ(yid |hid ) = λ(yid ) + Vi,21 Vi,11 [hid − λ(hid )]. (19.6)

The derivatives of (19.6) w.r.t. the fixed effects and variance components
in the measurement model are
∂λ(yid |hid ) −1
= xid − Vi,21 Vi,11 Xi,(d−1) ,
∂β
3 4
∂λ(yid |hid ) ∂Vi,21 −1 ∂Vi,11 −1
= − Vi,21 Vi,11 Vi,11 [hid − λ(hid )],
∂α ∂α ∂α
respectively, where xid is the dth row of Xi and where Xi,(d−1) indicates
the first (d − 1) rows of Xi .

In practice, the parameter θ in the measurement model is often of primary


interest. Since L̈ is block-diagonal with blocks L̈(θ) and L̈(ψ), we have that
302 19. Sensitivity Analysis for Selection Models

for any unit vector h, Ch equals Ch (θ) + Ch (ψ), with

Ch (θ) = −2h ∆ L̈−1 (θ)∆h,

Ch (ψ) = −2h ∆ L̈−1 (ψ)∆h,

evaluated at γ = γ  . It now immediately follows from (19.2) and (19.4)


that influence on θ only arises from those measurement occasions at which
dropout occurs. This implies that complete sequences cannot be influential
(Ci (θ) = 0) and that incomplete sequences only contribute at the actual
dropout time. This is intuitively clear from the following consideration.
Suppose the model is fitted using the EM algorithm (Dempster, Laird, and
Rubin 1977), then the E step determines the expected values of the missing
measurements and the M step fits an ignorable model to the so completed
set of data. The only way in which the dropout process can influence the
measurement model parameters is by predicting a value which deviates
from prediction under ignorability, which is simply the conditional mean
of the missing measurement, given the history. From expression (19.4), it
is clear that the corresponding contribution is large only if (1) the dropout
probability was small but the subject disappeared nevertheless and (2) the
conditional mean “strongly depends” on the parameter of interest.

Additional insight can be gained from comparing two incomplete sequences,


with equal history, which drop out at the same time point. They then have
the same contribution 1 − g(hid ) to (19.4). Hence, different influences on θ
can be ascribed to differences for the second factor of (19.4). For example,
for the fixed effects, we have that

∂λ(yid |hid ) ∂λ(yjd |hjd ) −1


− = xid − xjd − (Vi,21 − Vj,21 )Vi,11 Xi,(d−1) .
∂β ∂β
Hence, if the estimated covariance matrix for the complete data is the same
for both sequences, the above expression reduces to xid − xjd , indicating
that differences with respect to Ci (θ) can be entirely ascribed to differences
in time-varying covariates for the mean structure. A similar interpretation
can be obtained for the variance components α.

19.3.3 Special Case: Compound Symmetry

A special but enlightening case is the compound symmetry covariance


structure. It arises from assuming that Zi = 1ni , a vector of ones. The
matrix D then reduces to ν 2 . Assuming further that Σi = σ 2 Ini the covari-
ance matrix becomes Vi = σ 2 Ini + ν 2 Jni , where Jni is an (ni × ni ) matrix
of ones. We will now study Ci (θ) and Ci (ψ) in turn.
19.3 Local Influence 303

As discussed earlier, L̈ is block-diagonal, from which it follows that



∂λ(yid |hid ) −1 ∂λ(yid |hid )
Ci (θ) = 2[1 − g(hid )]2 L̈ (θ) , (19.7)
∂θ ∂θ
in which the first factor is large for a small dropout probability at the
time of dropout,—in other words, for an unlikely event. This is intuitively
appealing, since gid then has the potential of being improved by including
dependence on yid . For such a subject, apparent “nonrandomness” would
help.

The second factor of (19.7) involves L̈(θ) and is therefore harder to study.
However, we can still make progress if we are prepared to make some ap-
proximations. The off-diagonal block of the observed information matrix
L̈(θ) pertaining to the mixed derivatives w.r.t. β and α is not equal to
zero. The corresponding block of the expected information matrix is zero
for a complete data problem, but is not for an incomplete data set, unless
the missing data are MCAR (Kenward and Molenberghs 1998). However,
these authors also argue that in many practical settings, the difference
might be small (see also Chapter 21). Therefore, we will assume that L̈(θ)
is block-diagonal and that Ci (θ)  Ciap (β) + Ciap (σ 2 , ν 2 ).

Let us consider Ciap (β) first. With some algebra, we arrive at


∂λ(yid |hid )
= ξid xid + (1 − ξid )ρid (19.8)
∂β
with
σ2
ξid = ,
σ 2 + (d − 1)ν 2
1
ρid = xid − X 1n .
d − 1 i(d−1) i d−1
Note that (19.8) is a weighted average of the covariate xid and the within-
series residual covariate ρid , at time d. Further, the matrix of second order
derivatives L̈−1 (β) equals
N

−1
 ν2
L̈ (β) = Xi(d−1) Id−1 + 2 Jd−1 Xi(d−1) ,
i=1
σ + (d − 1)ν 2

from which it follows that


Ciap (β)
= 2[1 − g(hid )]2 (ξid xid + (1 − ξid )ρid )
(N )
  
 −1
×σ 2 ξid Xi(d−1) Xi(d−1) + (1 − ξid )Ri(d−1) Ri(d−1)
i=1
×(ξid xid + (1 − ξid )ρid ), (19.9)
304 19. Sensitivity Analysis for Selection Models

where

Ri,d−1 = Xi(d−1) − 1ni d−1 Xi(d−1) .

Here,
1
Xi(d−1) = 1n  Xi(d−1) .
d − 1 i d−1

Expression (19.9) is the product of the factor which purely depends on


the dropout probability and a factor which has the structure of a lever-
age. When ξid = 1 for all individuals, we have a classical leverage where
each measurement is an independent contribution. When ξid = 0, each
subject presents a single independent contribution. The general case is a
weighted combination of the between- and within-individual contributions.
These arguments motivate to call the second factor of Ciap (β) a generalized
leverage, not only for compound symmetry but also for general covariance
structures.

Similar calculations can be performed for the variance components (σ 2 , ν 2 ),


yielding
2
Ciap (σ 2 , ν 2 ) = 2[1 − g(hid )]2 ξid
2
(1 − ξid )2 [hid − λ(hid )]

1 −1
× −1, 2 L̈−1 (σ 2 , ν 2 ) 1 , (19.10)
ν ν2

where

L̈(σ 2 , ν 2 )

N
d−1
= 2 + (d − 1)ν 2 )2
i=1
2(σ
 2 
[σ + (d − 1)ν 2 ]2 − ν 2 [2σ 2 + (d − 1)ν 2 ] 1
× ,
1 (d − 1)

and [hid − λ(hid )] represents the average difference of the hid column and
its predicted value.

It is important to note that, even though L̈−1 (σ 2 , ν 2 ) has a somewhat com-


plicated form, it occurs in (19.10) only through a scalar. Thus, Ciap (σ 2 , ν 2 )
can in practice be decomposed into three interpretable components. The
first factor is shared with Ciap (β) and has the same interpretation. The sec-
ond factor disappears when either the measurement error variance or the
variance of the random intercept is reduced to zero. It is maximal when
there is “balance” between both components of variability (ξid = 0.5). The
19.3 Local Influence 305

third factor is large when the squared average residual of the history at the
time of dropout is large.

For the dropout model parameters, there are no approximations involved,


and we have that
⎛ ⎞ ⎛ ⎞−1
d N 
d
Ci (ψ) = 2 ⎝ hij yij vij ⎠ ⎝ vij hij hij ⎠
j=2 i=1 j=2
⎛ ⎞

d
×⎝ hij yij vij ⎠ , (19.11)
j=2

in which d = ni for a complete case and where yid needs to be replaced


with

λ(yid |hid ) = λ(yid ) + (1 − ξid )[hid − λ(hid )]

for incomplete sequences. Further, vij equals g(hij )[1 − g(hij )], which is
the variance of the estimated dropout probability under MAR. Expression
(19.11) bears some resemblance with the hat-matrix diagonal, used for
diagnostic purposes in logistic regression (Hosmer and Lemeshow 1989).
One of the differences is that the contributions from a single individual
are summed in the first and third factors of (19.11), even though they
contribute independent pieces of information to the logistic regression. This
is because each individual is given a single weight ωi for an entire sequence
of measurements.

To get a good feel for when Ci (ψ) is large, simplify hij yij vij to

F (y) = y 2 g(1 − g), (19.12)

which is based on the assumption that previous and current measurements


are approximately equal. Given estimates for ψ, it is easy to determine at
which value this function is maximal.

An even greater resemblance would be obtained by using an alternative


weighting scheme which replaces ωi in (19.1) by ωij , hereby giving differ-
ent weights to the different observations within subjects. This alternative
perturbation scheme would not imply any differences in the influence con-
tributions for the measurement model, but for the dropout parameters, we
would obtain
⎧ ⎛ ⎞−1 ⎫

⎨ ⎪

 N  d
 ⎝  ⎠
Cij (ψ) = 2(vij yij ) vij hij
2
vij hij hij hij , (19.13)

⎩ ⎪

i=1 j=2
306 19. Sensitivity Analysis for Selection Models

where the factor in curly braces equals the hat-matrix diagonal. In the case
of dropout, the same replacement as before for yid has to be made. When
the length of a measurement sequence is restricted to 2, then (19.13) and
(19.11) coincide. Note that this alternative perturbation scheme assigns
weights to observations rather than subjects, changing the interpretation
of the results of the analysis. Also, the graphical representation of the
results is more involved since series of influence measures Cij now have to
be studied and interpreted.

19.3.4 Serial Correlation

Thus far, the development has focused on the standard linear mixed model,
with random effects and measurement error. If, in addition, a serially cor-
related part of the variance structure is thought to be present, the variance
parameter α needs to be extended with the serial correlation parameters,
as was done in Section 3.3.4, and thus encompasses djk , the components
of the variance-covariance matrix of the random effects, τ 2 , the variance
of the serial process, ϕ, the serial correlation parameter, and σ 2 , the vari-
ance of the measurement error process. The general model is spelled out
in (3.11).

Most of the development in Section 19.3.2 remains the same. We will briefly
outline the changes that have to be made. For the various variance para-
meters, expression (19.7) can be written as

∂λ(yid |hid ) 5 −1
6 −1
= [Zi Djk Zi ]21 − Vi,21 Vi,11 [Zi Djk Zi ]11 Vi,11 [hid − λ(hid )],
∂djk
∂λ(yid |hid ) −1 −1
= [Hi,21 − Vi,21 Vi,11 Hi,11 ]Vi,11 [hid − λ(hid )],
∂τ 2
∂λ(yid |hid ) −1 2 −1
= [τ 2 Ki,21 − Vi,21 Vi,11 τ Ki,11 ]Vi,11 [hid − λ(hid )],
∂ϕ
∂λ(yid |hid ) −2
= −Vi,21 Vi,11 [hid − λ(hid )].
∂σ 2
Here,

[Djk ]
m = δj
δkm + δjm δk
− δjk .
u
Ki,
m = |ti
− tim |eφ|ti −tim | ,

with obvious subscript and superscript use. The exponent u = 1 in the


exponential case and u = 2 in the Gaussian case.
19.4 Analysis of the Rat Data 307

FIGURE 19.1. Rat Data. Individual growth curves for the three treatment groups
separately. Influential subjects are highlighted by bold lines or dots.

19.4 Analysis of the Rat Data

In order to illustrate the above methodology, we will apply the local influ-
ence approach to data from a randomized experiment, designed to study
the effect of the inhibition of the testosterone production in rats. The data
were introduced in Section 2.1. The profiles were explored in Section 4.3.3.
A linear mixed model with random intercepts was fitted in Section 6.3.3
(equation (6.12)).

The individual profiles are shown in Figure 19.1. They can be linearized by
using the logarithmic transformation t = ln(1+(age−45)/10)) for the time
scale. This is also the scale we will use from now on in all statistical analy-
ses. Note that the transformation was chosen such that t = 0 corresponds
to the start of the treatment. We assume a linear mixed model for the re-
sponse with common average intercept β0 for all three groups, with average
slopes β1 , β2 , and β3 for the three treatment groups, respectively, and as-
suming compound symmetry covariance structure, with common variance
σ 2 + ν 2 and common covariance ν 2 . These models are estimated under
MCAR, MAR, and MNAR processes, using the PCMID function in the
Splus suite of functions called OSWALD (Smith, Robertson, and Diggle
1996). The estimates are displayed in Table 19.1 (original data). Following
these models, and if we are prepared to believe the assumptions on which
308 19. Sensitivity Analysis for Selection Models

TABLE 19.1. Rat Data. Maximum likelihood estimates (standard errors) of com-
pletely random, random and nonrandom dropout models, fitted to the rat data set,
with and without modification.

Original Data
Effect Parameter MCAR MAR MNAR
Measurement model:
Intercept β0 68.61 68.61 68.61
Slope control β1 7.51 7.51 7.50
Slope low dose β2 6.87 6.87 6.86
Slope high dose β3 7.31 7.31 7.30
Random intercept ν2 3.44 3.44 3.44
Measurement error σ2 1.43 1.43 1.43
Dropout model:
Intercept ψ0 −1.98 −8.48 −8.05
Prev. measurement ψ1 0.084 0.096
Curr. measurement ω = ψ2 −0.017
−2 log-likelihood 1777.3 1774.5 1774.5
Modified Data
Effect Parameter MCAR MAR MNAR
Measurement model:
Intercept β0 70.20 70.20 70.26
Slope control β1 7.52 7.52 7.39
Slope low dose β2 6.97 6.97 6.88
Slope high dose β3 7.21 7.21 6.98
Random intercept ν2 40.38 40.38 40.83
Measurement error σ2 1.42 1.42 1.46
Dropout model:
Intercept ψ0 −2.20 −0.79 3.23
Prev. measurement ψ1 −0.015 0.32
Curr. measurement ω = ψ2 −0.38
−2 log-likelihood 1906.6 1894.6 1890.2

they rest, there is little evidence of MAR and no evidence for MNAR. The
estimates in Table 19.1 differ from those in Table 6.4, since the latter were
obtained with the REML method.

Figure 19.2 displays overall Ci , as well as influences for subvectors θ, β,


α, and ψ. In addition, the direction hmax corresponding to maximal local
influence is given. We observe large absolute scale differences for different
influence graphs. As is clear from such expressions as (19.9) and (19.11), the
19.4 Analysis of the Rat Data 309

FIGURE 19.2. Rat Data. Index plots of Ci , Ci (θ), Ci (β), Ci (α), Ci (ψ), and of
the components of the direction hmax of maximal curvature.

absolute magnitude of Ci (·) depends upon the scale on which the measure-
ments and/or covariates are expressed, and hence influence graphs should
be interpreted in a relative fashion.

The largest Ci are observed for rats #10, #16, #35, and #41, and virtually
the same picture holds for Ci (ψ). They are highlighted in Figure 19.1. All
four belong to the low-dose group. Arguably, their relatively large influence
is caused by an interplay of three facts. First, the profiles are relatively high,
and hence yij and hij in (19.11) are large. Second, since all four profiles
are complete, the first factor in (19.11) contains a maximal number of large
terms. Third, the computed vij are relatively large, which is implied by the
MAR dropout model parameter estimates in Table 19.1. Indeed, for these
measurements, the logit of the dropout probability is closest to 0 and hence
vij is fairly close to its maximal value of 0.25.

Turning attention to Ci (α) reveals peaks for rats #5 and #23. Both be-
long to the control group and drop out after a single measurement occasion.
They are highlighted (by means of a bullet) in the first panel of Figure 19.1.
To explain this, observe that the relative magnitude of Ci (α), approxi-
mately given by (19.10), is determined by 1 − g(hid ) and hid − λ(hid ).
The first term is large when the probability of dropout is small. Now,
when dropout occurs early in the sequence, the measurements are still
relatively low, implying that the dropout probability is rather small (cf.
310 19. Sensitivity Analysis for Selection Models

FIGURE 19.3. Rat Data. Index plots of Ci , Ci (θ), Ci (β), Ci (α), Ci (ψ), and of
the components of the direction hmax of maximal curvature where four profiles
have been shifted upward.

Table 19.1). This feature is built into the model by writing the dropout
probability in terms of the raw measurements with time-independent coef-
ficients rather than, for example, in terms of residuals. Alternatively, the
dropout model parameters could be made time dependent. Further, the
residual hid − λ(hid ) is large since these two rats are somewhat distant
from their group-by-time mean.

All deviations discussed are fairly moderate. This conclusion is supported


by the observation that the √ components of the normalized vector hmax do
not deviate much from 1/ N and it is consistent with the observation that
the likelihood ratio statistics for MNAR versus MAR did not reject the null
hypothesis.

To further explore the properties of the influence diagnostics, we consider


a second analysis where all responses for rats #10, #16, #35, and #41
have been increased by 20 units. A graphical display of the local influence
measures is given in Figure 19.3. The parameter estimates for all three
models are also shown in Table 19.1. The peaks in Ci and Ci (ψ) observed
earlier have become much clearer. Thus, the fact that the test statistics for
MAR versus MCAR and for MNAR versus MAR have become significant
is correctly explained by the influence analysis to have been driven by the
four extreme profiles.
19.4 Analysis of the Rat Data 311

FIGURE 19.4. Rat Data. Index plots of Ci , Ci (θ), Ci (β), Ci (α), Ci (ψ), and of
the components of the direction hmax of maximal curvature, where 4 profiles have
been shifted upward and the components have been ordered in decreasing order of
Ci .

Graphical representations such as Figure 19.3 are sometimes judged mis-


leading since the apparent magnitude of a subject is influenced by its neigh-
bors. On the other hand, it preserves the order across all six index plots.
One way to overcome this problem is by ordering one plot (e.g., according to
Ci ) and keeping this order across all six panels. This is done in Figure 19.4.
Alternatively, scatter plots of (1) the measurement versus dropout com-
ponents and (2) fixed-effects versus variance component elements can be
used. An example of the latter is presented in Figure 19.5. In this figure,
the axes are extended slightly below zero for ease of display, even though
these values are always non-negative.

The analysis of the rat data set supports the claim that the influence mea-
sures are easy to interpret. In addition to the advantages quoted earlier,
we claim that a careful study of the conditions under which the diagnostics
become large can shed some light on the adequacy of the model formula-
tion. For example, the Diggle and Kenward (1994) model usually writes
the logit of the dropout probability as a function of the raw measurements,
with time-independent coefficients. This implies that an expression such
as (19.11) depends directly on the magnitude of the responses. An alter-
native parameterization of the dropout probability in terms of residuals
(Yij − µij )/σij would obviously yield a different picture. However, this pa-
312 19. Sensitivity Analysis for Selection Models

FIGURE 19.5. Rat Data. Scatter plots of (1) Ci (θ) versus Ci (ψ) and (2) Ci (β)
versus Ci (α), where four profiles have been shifted upward.

rameterization has one important drawback in the sense that parameters


are shared between the measurement and dropout models, thus destroying
the separability (see Section 15.8). As a consequence, such a parameteriza-
tion would require an entire new and much more complicated theoretical
development.

19.5 Mastitis in Dairy Cattle

19.5.1 Informal Sensitivity Analysis

The data have been introduced in Section 2.7. Diggle and Kenward (1994)
and Kenward (1998) performed several analyses of these data. In Diggle
and Kenward (1994), a separate mean for each group defined by the year
of first lactation and a common time effect was considered, together with an
unstructured 2×2 covariance matrix. The dropout model included both Yi1
and Yi2 and was reparameterized in terms of the size variable (Yi1 + Yi2 )/2
and the increment Yi2 −Yi1 . It turned out that the increment was important,
in contrast to a relatively small contribution of the size. If this model were
19.5 Mastitis in Dairy Cattle 313

TABLE 19.2. Mastitis in Dairy Cattle. Maximum likelihood estimates (stan-


dard errors) of random and nonrandom dropout models, under several deletion
schemes.

Random dropout
Parameter All (53,54,66,69) (4,5) (66) (4,5,66)
Measurement model:
β0 5.77(0.09) 5.69(0.09) 5.81(0.08) 5.75(0.09) 5.80(0.09)
βd 0.72(0.11) 0.70(0.11) 0.64(0.09) 0.68(0.10) 0.60(0.08)
σ12 0.87(0.12) 0.76(0.11) 0.77(0.11) 0.86(0.12) 0.76(0.11)
σ22 1.30(0.20) 1.08(0.17) 1.30(0.20) 1.10(0.17) 1.09(0.17)
ρ 0.58(0.07) 0.45(0.08) 0.72(0.05) 0.57(0.07) 0.73(0.05)
Dropout model:
ψ0 −2.65(1.45) −3.69(1.63) −2.34(1.51) −2.77(1.47) −2.48(1.54)
ψ1 0.27(0.25) 0.46(0.28) 0.22(0.25) 0.29(0.24) 0.24(0.26)
ω = ψ2 0 0 0 0 0
−2 log-likelihood 280.02 246.64 237.94 264.73 220.23
Nonrandom dropout
Parameter All (53,54,66,69) (4,5) (66) (4,5,66)
Measurement model:
β0 5.77(0.09) 5.69(0.09) 5.81(0.08) 5.75(0.09) 5.80(0.09)
βd 0.33(0.14) 0.35(0.14) 0.40(0.18) 0.34(0.14) 0.63(0.29)
σ12 0.87(0.12) 0.76(0.11) 0.77(0.11) 0.86(0.12) 0.76(0.11)
σ22 1.61(0.29) 1.29(0.25) 1.39(0.25) 1.34(0.25) 1.10(0.20)
ρ 0.48(0.09) 0.42(0.10) 0.67(0.06) 0.48(0.09) 0.73(0.05)
Dropout model:
ψ0 0.37(2.33) −0.37(2.65) −0.77(2.04) 0.45(2.35) −2.77(3.52)
ψ1 2.25(0.77) 2.11(0.76) 1.61(1.13) 2.06(0.76) 0.07(1.82)
ω = ψ2 −2.54(0.83) −2.22(0.86) −1.66(1.29) −2.33(0.86) 0.20(2.09)
−2 log-likelihood 274.91 243.21 237.86 261.15 220.23
G2 for MNAR 5.11 3.43 0.08 3.57 0.005

assumed plausible, MAR would be rejected on the basis of a likelihood ratio


test statistic of G2 = 5.11 on 1 degree of freedom.

Kenward (1998) carried out what we could term a data-driven sensitiv-


ity analysis. He started from the original model in Diggle and Kenward
(1994), albeit with a common intercept, since there was no evidence for
a dependence on first lactation year. The right-hand panel of Figure 2.6.
reveals that there appear to be two cows, #4 and #5, with unusually
large increments. He conjectures that this might mean that these animals
were ill during the first lactation year, producing an unusually low yield,
314 19. Sensitivity Analysis for Selection Models

whereas a normal yield was obtained during the second year. He then fitted
t-distributions to Yi2 given Yi1 = yi1 . Not surprisingly, his finding was that
the heavier the tails of the t-distribution, the better the outliers were ac-
commodated. As a result, the difference between the MAR and nonrandom
models vanished (G2 = 1.08 for a t2 -distribution). Alternatively, removing
these two cows and refitting the normal model shows complete lack of evi-
dence for nonrandom dropout (G2 = 0.08). This latter procedure is similar
to a global influence analysis by means of deleting two observations. Para-
meter estimates and standard errors for random and nonrandom dropout,
under several deletion schemes, are reproduced in Table 19.2. It is clear
that the influence on the measurement model parameters is small in the
random dropout case, although the gap on the time effect βd between the
random and nonrandom dropout models is reduced when #4 and #5 are
removed.

Next, these informal but insightful forms of sensitivity analysis will be


presented. A sensitivity analysis based on local influence, as introduced in
Section 19.3, is performed in Section 19.5.2.

Kenward’s Sensitivity Analysis

A simple multivariate Gaussian linear model is used to represent the mar-


ginal milk yield in the 2 years (i.e., the yield that would be, or was, observed
in the absence of mastitis):


Y1 µ σ12 ρσ1 σ2
= N , .
Y2 µ+∆ ρσ1 σ2 σ22

Note that the parameter ∆ represents the change in average yield between
the 2 years. The probability of mastitis is assumed to follow the logistic
regression model:

eψ0 +ψ1 y1 +ψ2 y2


P (dropout) = . (19.14)
1 + eψ0 +ψ1 y1 +ψ2 y2
The combined response/dropout model was fitted to the milk yields by
maximum likelihood using a generic function maximization routine. In ad-
dition, the MAR model (ψ2 = 0) was fitted. This latter is equivalent to
fitting separately the Gaussian linear model for the milk yields and logistic
regression model for the occurrence of mastitis. These fits produced the pa-
rameter estimates as displayed in the “all” column of Table 19.2, standard
errors and minimized value of twice the negative log-likelihood.

Using the likelihoods to compare the fit of the two models, we get a differ-
ence G2 = 5.11. The corresponding tail probability from the χ21 is 0.02.
19.5 Mastitis in Dairy Cattle 315

This test essentially examines the contribution of ψ2 to the fit of the


model. Using the Wald statistic for the same purpose gives a statistic of
(−2.53)2 /0.83 = 9.35, with corresponding χ21 probability of 0.002. The dis-
crepancy between the results of the two tests suggests that the asymptotic
approximations on which these are based are not very accurate in this set-
ting and the standard error probably underestimates the true variability
of the estimate of ψ2 . Nevertheless, there is a suggestion from the change
in likelihood that ψ2 is making a real contribution to the fit of the model.
The dropout model estimated from the MNAR setting is as follows:

logit[P (mastitis)] = 0.37 + 2.25y1 − 2.54y2 . (19.15)

Some insight into this fitted model can be obtained by rewriting it in terms
of the milk yield totals (Y1 + Y2 ) and increments (Y2 − Y1 ):

logit[P (mastitis)] = 0.37 − 0.145(y1 + y2 ) − 2.395(y2 − y1 ).(19.16)

The probability of mastitis increases with larger negative increments; that


is, those animals who showed (or would have shown) a greater decrease in
yield over the 2 years have a higher probability of getting mastitis. The
other differences in parameter estimates between the two models are con-
sistent with this: The MNAR dropout model predicts a smaller average
increment in yield (∆), with larger second year variance and smaller cor-
relation caused by greater negative imputed differences between yields.

To gain some additional insight into these two fitted models, we now take
a closer look at the raw data and the predictive behavior of the Gaussian
MNAR model. Under an MNAR model, the predicted, or imputed, value
of a missing observation is given by the ratio of expectations:

EYm |Yo [y m P (r | y o , y m )]
ŷ m = . (19.17)
EYm |Yo [P (r | y o , y m )]

Recall that the fitted dropout model (19.15) implies that the probability
of mastitis increases with decreasing values of the increment Y2 − Y1 . We
therefore plot the 27 imputed values of this quantity together with the 80
observed increments against the first year yield Y1 . This is presented in
Figure 19.6, in which the imputed values are indicated with triangles and
the observed values with crosses. Note how the imputed values are almost
linear in Y1 : This is a well-known property of the ratio (19.17) within this
range of observations. The imputed values are all negative, in contrast to
the observed increments, which are nearly all positive. With animals of this
age, one would normally expect an increase in yield between the 2 years.
The dropout model is imposing very atypical behavior on these animals and
this corresponds to the statistical significance of the MNAR component of
the model (ψ2 ) but, of course, necessitates further scrutiny.
316 19. Sensitivity Analysis for Selection Models

FIGURE 19.6. Mastitis in Dairy Cattle. Plot of observed and imputed year 2 −
year 1 yield differences against year 1 yield. Two outlying points are circled.

Another feature of this plot is the pair of outlying observed points circled in
the top left-hand corner. These two animals have the lowest and third lowest
yields in the first year, but moderately large yields in the second, leading
to the largest positive increments. In a well-husbanded dairy herd, one
would expect approximately Gaussian joint milk yields, and these two then
represent outliers. It is likely that there is some anomaly, possibly illness,
leading to their relatively low yields in the first year. One can conjecture
that these two animals are the cause of the structure identified by the
Gaussian MNAR model. Under the joint Gaussian assumption, the MNAR
model essentially “fills in” the missing data to produce a complete Gaussian
distribution. To counterbalance the effect of these two extreme positive
increments, the dropout model predicts negative increments for the mastitic
cows, leading to the results observed. As a check on this conjecture, we
omit these two animals from the data set and refit the MAR and MNAR
Gaussian models. The resulting estimates are presented in the (4, 5) column
of Table 19.2.

The deviance is minimal and the MNAR model now shows no improvement
in fit over MAR. The estimates of the dropout parameters, although still
moderately large in an absolute sense, are of the same size as their stan-
dard errors which, as mentioned earlier, are probably underestimates. In
the absence of the two anomalous animals, the structure identified earlier
in terms of the MNAR dropout model no longer exists. The increments
19.5 Mastitis in Dairy Cattle 317

FIGURE 19.7. Mastitis in Dairy Cattle. Normal probability plot of the year 1
milk yields.

imputed by the fitted model are also plotted in Figure 19.6, indicated by
circles. Although still lying among the lower region of the observed incre-
ments, these are now all positive and lie close to the increments imputed
by the MAR model (diamonds). Thus, we have a plausible representation
of the data in terms of joint Gaussian milk yields, two pairs of outlying
yields and no requirement for an MNAR dropout process.

The two key assumptions underlying the outcome-based MNAR model are,
first, the form chosen for the relationship between dropout probability and
response and, second, the distribution of the response or, more precisely,
the conditional distribution of the possibly unobserved response given the
observed response. In the current setting for the first assumption, if there
is dependence of mastitis occurrence on yield, experience with logistic re-
gression tells us that the exact form of the link function in this relationship
is unlikely to be critical. In terms of sensitivity, we therefore consider the
second assumption, the distribution of the response.

All the data from the first year are available, and a normal probability plot
of these, Figure 19.7, does not show great departures from the Gaussian
assumption. Leaving this distribution unchanged, we therefore examine the
effect of changing the conditional distribution of Y2 given Y1 . One simple
and obvious choice is to consider a heavy-tailed distribution, and for this,
318 19. Sensitivity Analysis for Selection Models

we use the translated and scaled tm -distribution with density:



2 −(m+1)/2
1 √ 2−1 1 y2 − µ2|1
f (y2 | y1 ) = σ mB(1/2, m/2) 1+ ,
m σ

where
ρσ2 (y1 − µ)
µ2|1 = µ+∆+
σ1
is the conditional mean of Y2 | y1 . The corresponding conditional variance
is
m
σ2 .
m−2
Relevant parameter estimates from the fits of both MAR and MNAR mod-
els are presented in Table 19.3 for three values of m: 2, 10, and 25. Smaller
values of m correspond to greater kurtosis and, as m becomes large, the
model approaches the Gaussian one used in the previous section. It can be
seen from the results for the MNAR model in Table 19.3, that as the kurto-
sis increases the estimate of ψ2 decreases. Also, the maximized likelihoods
of the MAR and MNAR models converge. With 10 and 2 degrees of free-
dom, there is no evidence at all to support the inclusion of ψ2 in the model;
that is, the MAR model provides as good a description of the observed data
as the MNAR, in contrast to the Gaussian-based conclusions. Further, as m
decreases, the estimated yearly increment in milk yield ∆ from the MNAR
model increases to the value estimated under the MAR model. In most
applications of outcome-based selection models (see Section 19.7), it will
be quantities of this type that will be of prime interest, and it is clearly
seen in this example how the dropout model can have a crucial influence
on the estimate of this. Comparing the values of the deviance from the
t-based model with those from the original Gaussian model, we also see
that the former with m = 10 or 2 produces a slightly better fit, although
no meaning can be attached to the statistical significance of the difference
in these likelihood values.

The results observed here are consistent with those from the deletion analy-
sis. The two outlying pairs of measurements identified earlier are not incon-
sistent with the heavy-tailed t-distribution; so it would require no “filling
in” and hence no evidence for nonrandomness in the dropout process under
the second model. In conclusion, if we consider the data with outliers in-
cluded, we have two models that effectively fit equally well to the observed
data. The first assumes a joint Gaussian distribution for the responses and
a MNAR dropout model. The second assumes a Gaussian distribution for
the first observation and a conditional tm -distribution (with small m) for
the second given the first, with no requirement for a MNAR dropout com-
ponent. Each provides a different explanation for what has been observed,
19.5 Mastitis in Dairy Cattle 319

TABLE 19.3. Mastitis in Dairy Cattle. Details of the fit of MAR and MNAR
dropout models, assuming a tm -distribution for the conditional distribution of Y2
given Y1 . Maximum likelihood estimates (standard errors) are shown.

t DF Par. MAR MNAR


25 ∆ 0.69(0.10) 0.35(0.13)
ψ1 0.27(0.24) 2.11(0.78)
ψ2 −2.33(0.88)
−2 log-likelihood 275.54 271.77
10 ∆ 0.67(0.09) 0.38(0.14)
ψ1 0.27(0.24) 1.84(0.82)
ψ2 −1.96(0.95)
−2 log-likelihood 271.22 269.12
2 ∆ 0.61(0.08) 0.54(0.11)
ψ1 0.27(0.24) 0.80(0.66)
ψ2 −0.65(0.73)
−2 log-likelihood 267.87 266.79

with quite a different biological interpretation. In likelihood terms, the sec-


ond model fits a little better than the first, but a key feature of such dropout
models is that the distinction between them should not be based on the
observed data likelihood alone. It is always possible to specify models with
identical maximized observed data likelihoods that differ with respect to
the unobserved data and dropout mechanism and such models can have
very different implications for the underlying mechanism generating the
data. Finally, the most plausible explanation for the observed data is that
the pairs of milk yields have joint Gaussian distributions, with no need for
an MNAR dropout component, and that two animals are associated with
anomalous pairs of yields.

19.5.2 Local Influence Approach

In the previous section, the sensitivity to distributional assumptions of


conclusions concerning the randomness of the dropout process has been
established in the context of the mastitis data. Such sensitivity has led
some to conclude that such modeling should be avoided. We argue that this
conclusion is too strong. First, repeated measures tend to be incomplete and
therefore the consideration of the dropout process is simply unavoidable.
320 19. Sensitivity Analysis for Selection Models

Second, if a nonrandom dropout component is added to a model and the


maximized likelihood changes appreciably, then some real structure in the
data has been identified that is not encompassed by the original model.
The MNAR analysis may tell us about inadequacies of the original model
rather than the adequacy of the MNAR model. It is the interpretation of
the identified structure that cannot be made unequivocally from the data
under analysis. The mastitis data clearly illustrated that, using external
information on the distribution of the response, a plausible explanation of
the structure so identified might be made in terms of the outlying responses
from two animals.

However, it should also be noted that absence of structure in the data as-
sociated with an MNAR process does not imply that an MNAR process
is not operating: Different models with similar maximized likelihoods (i.e.,
with similar plausibility with respect to the observed data), may have com-
pletely different implications for the dropout process and the unobserved
data. These points together suggest that the appropriate role of such mod-
eling is as a component of a sensitivity analysis.

The analysis of the previous section is characterized by its basis within sub-
stantive knowledge about the data. The local influence approach, presented
in Section 19.3 and applied to the rats data in Section 19.4, may appear to
be “blindly” applicable (i.e., without departing from specific information
about the data). In this section, we will apply the technique to the mastitis
data, confront the results with those found in Section 19.5.1, and suggest
that here a combination of methodology and substantive insight will be the
most fruitful approach.

Applying the method to the mastitis data produces Figure 19.8, which sug-
gests that there are four influential subjects: #53, #54, #66, and #69. The
most striking feature of this analysis is that #4 and #5 are not recovered.
See also Figure 2.6. It is interesting to consider an analysis with these four
cows removed. Details are given in Table 19.2. Unlike removing #4 and #5,
the influence on the likelihood ratio test is rather small: G2 = 3.43 instead
of the original 5.11. The influence on the measurement model parameters
under both random and nonrandom dropout is small.

It is very important to realize that one should not expect agreement be-
tween deletion and our local influence analysis. The latter focuses on the
sensitivity of the results with respect to the assumed dropout model; more
specifically, how the results change when the MAR model is extended into
the direction of nonrandom dropout. In particular, all subjects singled out
so far are complete and hence Ci (θ) ≡ 0, placing all influence on Ci (ψ)
and hmax,i . A comparison between local influence and deletion is given in
Section 24.4.
19.5 Mastitis in Dairy Cattle 321

FIGURE 19.8. Mastitis in Dairy Cattle. Index plots of Ci , Ci (θ), Ci (ψ), and
of the components of the direction hmax of maximal curvature, when the dropout
model is parameterized in function of Yi1 and Yi2 .

More insight can also be obtained by studying (19.13). The contribution


for subject i is made up of three factors. The first factor, Vi , is small for
extreme dropout probabilities. The subjects with a very high probability
to either remain in the study or disappear will be less influential. Cows
#4 and #5 have dropout probabilities equal to 0.13 and 0.17, respectively.
The 107 cows in the study span the dropout probability interval [0.13, 0.37].
Thus, this component rather deflates the influence of subjects #4 and #5.
Second, (19.13) contains a leverage factor in curly braces. Third, a subject
is relatively more influential when both milk yields are high. We now need
to question whether this is plausible or relevant. Since both measurements
are positively correlated, measurements with both milk yields high or low
will not be unusual. In Section 19.5.1, we observed that cows #4 and #5
are unusual on the basis of their increment. This is in line with several
other applications of similar dropout models (Diggle and Kenward 1994,
Molenberghs, Kenward, and Lesaffre 1997) where it was found that a strong
incremental component pointed to genuine nonrandomness. In contrast,
the size variable can often be replaced by just the history, and hence the
corresponding model is very close to random dropout.

Even though a dropout model in the outcomes themselves, termed direct


variables model, is equivalent to a model in the first variable Yi1 and the
increment Yi2 − Yi1 , termed incremental variables representation, we will
322 19. Sensitivity Analysis for Selection Models

FIGURE 19.9. Mastitis in Dairy Cattle. Index plots of Ci , Ci (θ), Ci (ψ), and
of the components of the direction hmax of maximal curvature, when the dropout
model is parameterized in function of Yi1 and Yi2 − Yi1 .

show that they lead to different perturbation schemes of the form (19.1).
At first, this feature can be seen as both an advantage and a disadvantage.
The fact that reparameterizations of the linear predictor of the dropout
model leads to different perturbation schemes requires careful reflection
based on substantive knowledge in order to guide the analysis, such as the
considerations on the incremental variable made earlier.

We will present the results of the incremental analysis and then offer further
comments on the rationale behind this particular transformation. From the
diagnostic plots in Figure 19.9, it is obvious that we recover three influential
subjects: #4, #5, and #66. Although Kenward (1998) did not consider #66
to be influential, it does appear to be somewhat distant from the bulk of
the data (Figure 2.6). The main difference between both types is that the
first two were likely sick during year 1, and this is not necessarily so for
#66. An additional feature is that in all cases, both Ci (ψ) and hmax show
the same influential animals. In addition, hmax suggests that the influence
for #66 is different than for the others. It could be conjectured that the
latter one pulls the coefficient ω in a different direction than the other
two. The other values are all relatively small. This could indicate that for
the remaining 104 subjects, MAR is plausible, whereas a deviation in the
direction of the incremental variable, with differing signs, appears to be
necessary for the other three subjects. At this point, a comparison between
19.5 Mastitis in Dairy Cattle 323

FIGURE 19.10. Mastitis in Dairy Cattle. Index plots of the three components of
Ci (ψ) when the dropout model is parameterized in function of Yi1 and Yi2 − Yi1 .

hmax for the direct variable and incremental analyses is useful. Since the
contributions hi sum to 1, these two plots are directly comparable. There is
no pronounced influence indication in the direct variables case and perhaps
only random noise is seen. A more formal way to distinguish between signal
and noise needs to be developed.

In Figure 19.10, we have decomposed (19.13) in its three components: the


variance of the dropout probability Vi , the incremental variable Yi2 − Yi1 ,
which is replaced by its predicted value for a dropout, and the hat-matrix
diagonal. In agreement with the preceding discussion, the influence clearly
stems from an unusually large increment, which survives the fact that Vi
actually downplays the influence because Y41 and Y51 are comparatively
small and dropout increases with the milk yield in the first year. Further,
the sign difference of hmax,4 and hmax,5 versus hmax,66 can be interpreted
better.

We noted already that cows #4 and #5 have relatively small dropout


probabilities. In contrast, the dropout probability of #66 is large within the
observed range [0.13; 0.37]. Since for those subjects the increment is large,
changing its perturbation ωi can have a large impact on the other dropout
parameters ψ0 and ψ1 . In order to avoid that the effects of the change for
#4 and #5 will cancel with the effect for #66, the corresponding signs
need to be opposite. Such a change implies either that all three dropout
324 19. Sensitivity Analysis for Selection Models

probabilities move toward the center of the range or are pulled away from
it. (Note that −hmax is another normalized eigenvector corresponding to
the largest eigenvalue.)

In the informal approach, extra analyses where considered with #4 and


#5 removed. The resulting likelihood ratio statistic reduces to G2 = 0.08.
When only #66 is removed, the likelihood ratio for nonrandom dropout is
G2 = 3.57, very similar to the one when #53, #54, #66, and #69 were
removed. Removing all three (#4, #5, and #66) results in G2 = 0.005 (i.e.,
complete disappearance of all evidence for nonrandom dropout). Details are
given in Table 19.2.

We now provide insight into why the transformation of direct outcomes


to increments is useful. We noted already that the associated perturbation
schemes (19.1) are different. An important device in this respect is the
equality

ψ0 + ψ1 yi1 + ψ2 yi2 = ψ0 + (ψ1 + ψ2 )yi1 + ψ2 (yi2 − yi1 ). (19.18)

Equation (19.18) shows that the direct variables model checks the influ-
ence on the random dropout parameter ψ1 , whereas the random dropout
parameter in the incremental model is ψ1 + ψ2 . Not only is this a different
parameter, it is also estimated with higher precision. One often observes
that ψ̂1 and ψ̂2 exhibit a similar variance and negative correlation, in which
case, the linear combination with smallest variance is approximately in the
direction of the sum ψ1 + ψ2 . When the correlation is negative, the differ-
ence direction ψ1 − ψ2 is obtained instead. Let us assess this in case all 107
observations are included. The estimated covariance matrix is

0.59 −0.54
,
0.70

with correlation −0.84. The variance of ψ1 + ψ2 , on the other hand, is
estimated to be 0.21. In this case, the direction of minimal variance is
along (0.74; 0.67), which is indeed close to the sum direction. When all
three influential subjects are removed, the estimated covariance matrix
becomes

3.31 −3.77
,
4.37
with correlation −0.9897. Removing only #4 and #5 yields an interme-
diate situation of which the results are not shown. The variance of the
sum is 0.15, which is a further reduction and still close to the direction of
minimal variance. These considerations reinforce the claim that an incre-
mental analysis is highly recommended. It might therefore be interesting
to routinely construct a plot such as in Figure 2.6 or Figure 19.6, even
with longer measurement sequences. On the other hand, transforming the
19.5 Mastitis in Dairy Cattle 325

dropout model to a size variable (Yi1 +Yi2 )/2 will worsen the problem since
an insensitive parameter for Yi1 will result.

Finally, observe that a transformation of the dropout model to a size and


incremental variable at the same time for the model with all three influential
subjects removed gives a variance of the size and increment variables of
0.15 and 15.22, respectively. In other words, there is no evidence for an
incremental effect, confirming that random dropout is plausible.

Although local and global influence are, strictly speaking, not equivalent,
it is insightful to see how the global influence on θ can be linked to the
behavior of Ci (ψ). We observed earlier that all locally influential subjects
are completers and hence Ci (θ) ≡ 0. Yet, removing #4, #5, and #66 shows
some effect on the discrepancy between the random dropout (MAR) and
nonrandom dropout (MNAR) estimates of the time effect βd . In particular,
MAR and MNAR estimates with all three subjects removed are virtually
identical (0.60 and 0.63, respectively). Let us do a small thought experi-
ment. Since these subjects are influential in Ci (ψ), the MAR model could
be improved by including incremental terms for these three subjects. Such
a model would still imply random dropout. In contrast, allowing a depen-
dence on the increment in all subjects will influence E(Yi2 |yi1 , dropout)
for all incomplete observations; hence, the measurement model parameters
under MNAR will change. In conclusion, this provides a way to assess the
indirect influence of the dropout mechanism on the measurement model
parameters through local influence methods. In the milk data set, this
influence is likely due to the fact that an exceptional increment which is
caused by a different mechanism, perhaps a diseased animal during the first
year, is nevertheless treated on equal footing with the other observations
within the dropout model. Such an investigation cannot be done with the
case-deletion method because it is not possible to disentangle the various
sources of influence.

In conclusion, it is found that an incremental variable representation of the


dropout mechanism is beneficial over a direct variable representation. Con-
trasting our local influence approach with a case-deletion scheme as applied
in Kenward (1998), it is argued that the former approach is advantageous
since it allows one to assess direct and indirect influences on the dropout
and measurement model parameters, stemming from perturbing the ran-
dom dropout model in the direction of nonrandom dropout. In contrast, a
case-deletion scheme does not allow one to disentangle the various sources
of influence.
326 19. Sensitivity Analysis for Selection Models

19.6 Alternative Local Influence Approaches

The perturbation scheme used throughout this chapter has several elegant
properties. The perturbation is around the MAR mechanism, which is often
deemed a sensible starting point. Extra calculations are limited and free of
numerical integration. Influence decomposes in a measurement and dropout
part, the first of which is zero in the case of a complete observation. Finally,
if the special case of compound symmetry is assumed, the measurement
part can approximately be written in interpretable components for the
fixed effect and variance component parts.

However, other schemes are worthwhile considering as well. Most of the


developments presented here can be adapted to such alternatives, although
not all schemes will preserve the remarkable computational convenience.
Also, interpretation of the influence expressions in an alternative scheme
will require additional work.

Apart from MAR, often MCAR also is considered a useful model. It is then
natural to consider departures from the MCAR model, rather than from
the MAR model. This would change (19.1) to
logit(g(hij , yij )) = logit [pr(Di = j|Di ≥ j, y i )]
= hij ψ + ωi1 yi,j−1 + ωi2 yij , (19.19)
with obvious change in the definition of hij . This way, the perturbation
parameter becomes a two-component vector ω i = (ωi1 , ωi2 ). As a result,
the ith subject produces a pair (hi1 , hi2 ), which is a normalized vector and
hence main interest lies in its direction. Also, Ch = Ci is the local influence
on γ of allowing the ith subject to drop out randomly or nonrandomly.
Figure 19.11 shows the result of this procedure, applied to the mastitis
data. Pairs (hi1 , hi2 ) are plotted. The main diagonal corresponds to the size
direction, whereas the diagonal represents the purely incremental direction.
The circles are used to indicate the minimal and maximal distances to the
origin. Finally, squares rather than bullets are used for cows #4, #5, and
#66. Most cows lie in the size direction, but it is noticeable that #4, #5,
and #66 tend toward the nonrandom direction. Further, no extremely large
Ci are seen in this case.

Another extension would result from the observation that the choice of the
incremental analysis in Section 19.5.2 may, although motivated by substan-
tive insight, seem rather arbitrary. Hence, it would be desirable to have a
more automatic, data-driven selection of a direction. One way of doing this
is by considering
logit(g(hij , yij )) = logit [pr(Di = j|Di ≥ j, y i )]
= hij ψ + ωi (sin θyi,j−1 + cos θyij ). (19.20)
19.6 Alternative Local Influence Approaches 327

FIGURE 19.11. Mastitis in Dairy Cattle. Plot of Ci in the direction of hi .

Now, it is possible to apply (19.20) for a selected number of angles θ, to


range through a fine grid covering the entire circle, or to consider θ as
another influence parameter. In the latter case, θ becomes subject-specific
and the pair (ωi , θi ) is essentially a reparameterization of the pair ω i =
(ωi1 , ωi2 ) in (19.20).

A completely different local influence approach would modify the general


form (15.5) as follows:

f (y oi , r i |θ, ψ, ωi )

= f (y oi , y m
i |Xi , Zi , θ)f (r i |y i , y i , Xi , ψ) dy i .
o m ωi m
(19.21)

Now, if ωi = 0, then the missing data process is considered ignorable and


only the measurement process is considered. If ωi = 1, the posited, poten-
tially nonrandom, model is considered. Other values of ωi correspond to
partial case weighting.
328 19. Sensitivity Analysis for Selection Models

19.7 Random-coefficient-based Models

It has been seen in the mastitis data, and observed elsewhere, such as
in Diggle and Kenward (1994) and Molenberghs, Kenward, and Lesaffre
(1997), that apparent nonrandom dropout in a selection model often mani-
fests itself in terms of a dependence of dropout on the increment or change
in response, C = Yij − Yi,j−1 , say. If a subject is exhibiting a clear trend in
response, then we might regard C as an estimate, with error, of an under-
lying trend. This is supported particularly in examples in which a MNAR
model with dependence on C fits a little better than an MAR model with
dependence on a change calculated wholly from past values. A better way
of approaching such a situation may be to attempt to model the underly-
ing trend directly using a latent, or random-coefficient, model and allow
both response and dropout model to depend on this. Little (1995) uses the
term random-coefficient based to distinguish such models from those used
in this chapter in which the probability of dropout depends directly on
the response Y i , which he terms outcome based. We will now discuss these
types of model in some detail.

Suppose that the latent variable bi (possibly vector valued) describes some
aspect of an individual’s response. This leads naturally to the linear mixed
model (3.8). Let us consider a model with random intercept and random
slope:
Yij | b0i , b1i ∼ N (β0 + b0i + (β1 + b1i )ti ; σ 2 ), j = 1, . . . , ni . (19.22)
The subject’s random coefficients bi = (b0i , b1i ) are assumed to be nor-
mally distributed with zero mean and covariance matrix D. Also, the
dropout model is assumed to depend on the random effects: P (r|b).

In general, such a selection model can be written, conditional on b,


f (y, r | b) = f (y | b)P (r | b). (19.23)
Integration is still required to obtain the marginal distribution of (y o , r),
now with respect to both y m and b:
 
f (y o , r) = f (y | b)P (r | b)f (b)dbdy m ,

but the dependence of dropout on y m through b allows some simplification,


given appropriate regularity conditions:
  
f (y o , r) = f (y | b)dy m P (r | b)f (b)db

= f (y o | b)P (r | b)f (b)db. (19.24)
19.7 Random-coefficient-based Models 329

The response y now enters the joint distribution only through the observed
data and the latent variable, and the nonrandom dropout model will typi-
cally be identified.

Wu and Carroll (1988) proposed such a model for what they termed infor-
mative right censoring. The situation they cover extends the earlier setting
to accommodate right censoring of the dropout times. Although this com-
plicates the likelihood to some degree, it does not fundamentally change
the structure of the model and could equally well be used for outcome-
based models. For a continuous response, Wu and Carroll suggested using
a conventional Gaussian random-coefficient model (Laird and Ware 1982,
Longford 1993) combined with an appropriate model for time to dropout,
such as proportional hazards, logistic or probit regression. The combination
of probit and Gaussian response allows explicit solution of the integral and
was used in their application.

In a slightly different approach to modeling dropout time as a continuous


variable in the latent variable setting, Schluchter (1992) and DeGruttola
and Tu (1994) proposed joint multivariate Gaussian distributions for the
latent variable(s) of the response process and a variable representing time
to dropout. The correlation between these variables induces dependence
between dropout and response. To permit more realistic distributions for
dropout time, Schluchter proposed that dropout time itself should be some
monotone transformation of the corresponding Gaussian variable. The use
of a joint Gaussian representation does simplify computational problems
associated with the likelihood. There are clear links here with the Tobit
model and this is made explicit by Cowles, Carlin, and Connett (1996), who
use a number of correlated latent variables to represent various aspects of
an individual’s behavior, such as compliance and attendance at scheduled
visits.

These random-coefficient-based models do have the advantage over the


earlier outcome-based models in providing a simpler framework for non-
dropout patterns of missing values and allowing very general patterns of
observation time among individuals. There are many ways in which such
models can be extended and generalized. Follman and Wu (1995) introduce
the idea of shared-parameter models in which generalized linear models are
defined for both response and dropout that share latent variable(s). With
some exceptions, the random-coefficient-based models share the main draw-
backs of the outcome-based models: (1) Computational algorithms can be
complex and problem specific, although the EM has been applied to some
problems (Schluchter 1992, DeGruttola and Tu 1994, Molenberghs, Ken-
ward, and Lesaffre 1997), and (2) inferences are necessarily highly depen-
dent on parametric modeling assumptions that cannot be assessed from
the data under analysis. In answer to these concerns, several authors have
330 19. Sensitivity Analysis for Selection Models

considered to replace parametric assumptions, at least partially, by a non-


parametric approach (Robins, Rotnitzky, and Zhao 1995, Robins and Rot-
nitzky 1995, Rotnitzky and Robins 1995, 1997, Robins 1997, Robins and
Gill 1997, Robins, Rotnitzky, and Scharfstein 1998).

19.8 Concluding Remarks

Since all models for incomplete longitudinal data rest on unverifiable as-
sumptions, this chapter argues that a careful investigation of the model
output is in place. Using two examples, both informal and formal sensitiv-
ity analyses have been conducted. The former are based on insight into the
modeling process and the distributional assumptions made, as well as on
background knowledge of the data problem.

Needless to say that a variety of different approaches to formal sensitivity


analysis are potentially possible. We focused on global and local influence
measures, being but one way of assessing sensitivity.

The sensitivity problem at large is receiving a lot of attention and we believe


that a number of methodologies will emerge over the coming years.
20
Sensitivity Analysis for
Pattern-Mixture Models

20.1 Introduction

Chapter 18 is devoted to the study of pattern-mixture models, thus provid-


ing an alternative formulation for the common selection model factorization
(see also Section 15.4). In Section 18.1, we observed that pattern-mixture
models are chronically underidentified, which is clearly seen by means of the
Glynn, Laird, and Rubin (1986) “paradox” (Section 18.1.2). Consequently,
Little (1993, 1994a, 1995) suggested the use of so-called identifying re-
strictions to overcome this underidentification. Choosing a set of different
restriction schemes, rather than a single one, is an obvious way to pass
from a standard approach to sensitivity analysis.

The need to use identifying restrictions is often quoted as an advantage for


pattern-mixture models since it forces careful reflection on the nature of
the assumptions made. On the other hand, neither of the two case studies
in Chapter 18 (the toenail data in Section 18.3 and the Vorozole study in
Section 18.4) made use of identifying restrictions. The reason is different
for both studies. The pattern-specific models for the toenail data were
simple enough (quadratic curves) to allow extrapolation beyond the last
measurement obtained in a particular pattern. Only the first two patterns,
with a single or only two measurements, posed problems and an ad hoc
solution was employed. In the Vorozole study, pattern was included as a
332 20. Sensitivity Analysis for Pattern-Mixture Models

covariate in both the fixed-effects and variance portions of the models.


Subsequent simplification lead to a model which was easy to extrapolate.

Thus, in line with the reflections made in Section 18.5, we have three strate-
gies to build a full data model in the pattern-mixture context: identifying
restrictions, simple within-pattern models, and the inclusion of pattern as
a covariate, the latter of which allows for the combination of information
across patterns. A few observations are in place. First, although identifying
restrictions impose a careful reflection on the unidentified part of the distri-
bution, the other strategies are more implicit about the assumptions made
to identify the full distribution. In this respect, they are open to some of
the criticisms toward selection models. Second, the identifying-restrictions
strategy is harder to implement, unless in fairly simple settings, such as
a single normal sample or contingency tables (Little 1993, 1994a). This
chapter provides tools to conduct such a strategy in realistic longitudinal
settings. Third, in the selection modeling framework, the MAR assumption
plays a crucial role. It can be seen as a compromise between the very rigid
and unrealistic MCAR assumption and the complex and fundamentally
problematic MNAR assumptions. A counterpart to the MAR assumption
is provided in Section 20.2, which can be exploited as the basis for specific
identifying-restrictions strategies.

The identifying-restrictions strategy requires some theoretical justification,


which is provided in this chapter. In spite of some long and tedious deriva-
tions, the resulting procedure is relatively simple to implement and a set of
SAS macros has been provided which can be downloaded from the website.
GAUSS functions are available at the same location.

The first couple of sections provide background material. Section 20.2 de-
scribes the relationship between MAR and the pattern-mixture framework.
Multiple imputation, a tool used in the identifying restrictions strategy, is
reviewed in Section 20.3. The three strategies to fit pattern-mixture models,
mentioned earlier, are described in Section 20.4. The identifying-restrictions
strategy is described in detail in Section 20.5. Application to the Vorozole
study is discussed in Sections 20.6. Reflections and suggestions for alterna-
tive routes of sensitivity are offered in Section 20.7.

20.2 Pattern-Mixture Models and MAR

The missing data taxonomy of Rubin (1976) and Little and Rubin (1987),
which distinguishes between missing completely at random, missing at
random, and nonrandom missingness, is widely used (Section 15.5). It is
20.2 Pattern-Mixture Models and MAR 333

usually presented in the selection modeling framework rather than in the


pattern-mixture context.

Although selection and pattern-mixture models are interchangeable from a


probabilistic point of view, in the sense that they represent different factor-
izations of the same joint distribution, in practice they encourage different
kinds of simplifying assumptions. For this reason, it is important to consider
their relative merits as scientific models, especially when the probability of
missingness depends on the unobserved outcomes. One attraction of selec-
tion models is that they fit naturally into Little and Rubin’s taxonomy,
whereas pattern-mixture models appear not to do so. Here, we show that
pattern-mixture models can be classified similarly, and further that the in-
termediate MAR category is connected to particular kinds of restrictions
on the parameters of a pattern-mixture model in the case of monotone
missingness. This suggests to us that a purely philosophical debate about
the relative merits of the selection and pattern-mixture paradigms is not
helpful. Instead, the focus of debate should shift to two other issues.

First, a consideration of the statistical and scientific merits of proposed


missing value models on their own terms is needed. For example, if the
question of scientific interest regards the treatment effect, averaged over
all dropout patterns, then choosing a selection model seems to be obvious.
On the other hand, if one is interested in the treatment effect for vari-
ous dropout patterns separately, then a pattern-mixture model is a natural
choice. Second, selection models and pattern-mixture models can be com-
bined into a sensitivity analysis (Section 20.4). For example, one can select
a model family of primary focus and fit a model in the other one as well.
In addition, insight gained from both model families can be combined into
a richer data-analytic picture.

20.2.1 MAR and ACMV

Assume a complete measurement sequence is of length n. Recall (Sec-


tion 15.7) that in a pattern-mixture model, the joint density of f (y, d)
is factorized as
f (y, d) = f (d)f (y|d).
We will now show how pattern-mixture models can be classified using ex-
actly the same taxonomy as is used for selection models (MCAR, MAR,
MNAR). Furthermore, we enable a link between this classification and the
identifying restrictions proposed in Little (1993).

Clearly, selection models and pattern-mixture models coincide under the


MCAR assumption, since, in either case, the joint density simplifies to
334 20. Sensitivity Analysis for Pattern-Mixture Models

f (y)f (d). Next, we show that MAR can be expressed in a pattern-mixture


framework through restrictions, related to the complete case missing value
(CCMV) restrictions (Little 1993), which we call available case missing
value (ACMV) restrictions. Little’s CCMV restrictions set a conditional
density of unobserved components given a particular set of observed com-
ponents equal to the corresponding conditional density in the subgroup
of completers. Our ACMV restrictions equate this conditional density to
the one calculated from the subgroup of all patterns for which all required
components have been observed.

In our setting of longitudinal data with dropouts, CCMV can be defined


formally as the condition that for each t ≥ 2 and for j < t,
f (yt |y1 , . . . , yt−1 , d = j + 1) = f (yt |y1 , . . . , yt−1 , d = n + 1),
whereas ACMV is the condition that for all t ≥ 2 and j < t,
f (yt |y1 , . . . , yt−1 , d = j + 1) = f (yt |y1 , . . . , yt−1 , d > t). (20.1)
If there are only two time points (n = 2), then ACMV and CCMV coincide.
With these definitions, we obtain:

Theorem 20.1 For longitudinal data with dropouts, MAR ⇐⇒ ACMV.

The proof of Theorem 20.1 is given in the Appendix (Section B.2).

An interesting aside of this theorem is that, since MAR corresponds to a


set of (untestable) restrictions (ACMV) in the pattern-mixture framework,
MAR itself is also untestable. Precisely, given MAR, standard (observed
data) methods can be used, but the assumption of MAR itself cannot be
tested. This fact is often overlooked in the selection framework.

Little (1993) suggested the possibility of using more than the completers to
construct identifying restrictions for two practical reasons: (1) The set of
completers may be small and (2) there may be a closer similarity between
the conditional distributions given d = t + 1 and some other incomplete
pattern d = s + 1, or set of patterns, than between those for d = t + 1 and
the completers, d = n + 1.

One way to proceed is as follows. First, restrict the data set to the first two
components only. Then, missing data patterns d = 3, . . . , n+1 collapse into
a single pattern d > 2. Applying ACMV restrictions to d = 2 and d > 3
leads to the construction of the density f (y2 |y1 , d = 2) = f (y2 |y1 , d > 2),
as in (20.1). Multiplying by f (y1 |d = 2) leads to f (y1 , y2 |d = 2), thus
determining the joint densities of f (y1 , y2 |d) for all d = 2, . . . , n + 1. Next,
f (y3 |y1 , y2 , d) (d = 2, 3) can be calculated from f (y3 |y1 , y2 , d > 3). We then
proceed by induction to construct all joint densities.
20.2 Pattern-Mixture Models and MAR 335

A precise formulation of this and related strategies is discussed in Sec-


tion 20.5. The next section is devoted to an insightful counterexample in
nonmonotone patterns.

20.2.2 Nonmonotone Patterns: A Counterexample

Note that the result of Theorem 20.1 does not hold for general missing
data patterns. Consider a bivariate outcome (Y1 , Y2 ) where missingness
can occur in both components. Let (R1 , R2 ) be the corresponding bivariate
missingness indicator, where Rj = 0 if Yj is missing and 1 otherwise (j =
1, 2). Consider the following MAR mechanism:


⎪ p if (r1 , r2 ) = (0, 0)


⎨ qy1 if (r1 , r2 ) = (1, 0)
f (r|y) = Pr(r1 , r2 |y1 , y2 ) = sy2 if (r1 , r2 ) = (0, 1) (20.2)



⎪ 1 − p − qy
⎩ 1
− sy2 if (r1 , r2 ) = (1, 1).

We need to indicate how the concept of ACMV would be translated to


this setting. Several proposals can be considered. A trivial extension of the
ACMV restrictions in the monotone case implies for the patterns r = (1, 0)
and r = (0, 1):

r = (1, 0) : f (y1 , y2 |r = (1, 0))


= f (y1 |r = (1, 0)).f (y2 |y1 , r = (1, 1)), (20.3)

r = (0, 1) : f (y1 , y2 |r = (0, 1))


= f (y2 |r = (0, 1)).f (y1 |y2 , r = (1, 1)). (20.4)

The idea is that the density of missing components, given observed com-
ponents, is replaced by the corresponding density of patterns for which
both are available. Restrictions for the pattern r = (0, 0) will be discussed
further.

From condition (20.3) we derive

f (r = (1, 0)|y1 , y2 )f (y1 , y2 ) f (r = (1, 0)|y1 )f (y1 )


=
f (r = (1, 0)) f (r = (1, 0))
f (r = (1, 1)|y1 , y2 )f (y1 , y2 )
×
f (r = (1, 1)|y1 )f (y1 )

f (r = (1, 1)|y1 , y2 ) = f (r = (1, 1)|y1 ),
336 20. Sensitivity Analysis for Pattern-Mixture Models

since f (r = (1, 0)|y1 , y2 ) = f (r = (1, 0)|y1 ) = qy1 , implying that sy2 is


constant. Similarly, condition (20.4) implies that qy1 is constant.

Clearly, since both qy1 and sy2 have to be constant, the mechanism needs to
be MCAR. In other words, ACMV≡MCAR, independent of the restrictions
for f (y1 , y2 |r = (0, 0)), and hence ACMV and MAR differ.

20.3 Multiple Imputation

In Section 20.5, multiple imputation will be used as a tool in developing


identifying-restrictions strategies. For this reason, and for its central place
in the incomplete-data literature, this section reviews the principles. Mul-
tiple imputation was formally introduced by Rubin (1978). Rubin (1987)
provides a comprehensive treatment. Several other sources, such as Rubin
and Schenker (1986), Little and Rubin (1987), Tanner and Wong (1987),
and Schafer’s (1997) book give excellent and easy-to-read descriptions of
the technique. Efron (1994) discusses connections between multiple impu-
tation and the bootstrap. An important review, containing an extensive
list of references and a large bibliography, is given in Rubin (1996).

The concept of multiple imputation refers to replacing each missing value


with more than one imputed value. The goal is to combine the simplicity
of imputation strategies, with unbiasedness in both point estimates and
measures of precision. In Section 16.3, we have seen that some simple im-
putation procedures may yield inconsistent point estimates as soon as the
missingness mechanism surpasses MCAR. This could be overcome to a
large extent with conditional mean imputation, but the problem of under-
estimating the variability of the estimators is common to all methods since
they all treat imputed values as observed values. By imputing several values
for a single missing component, this uncertainty is explicitly acknowledged.

Rubin (1987) points to another very useful application of multiple imputa-


tion. Rather than merely accounting for sampling uncertainty, the method
can be used to incorporate model uncertainty. Indeed, when a measure-
ment is missing but the researcher has a good idea about the probabilistic
measurement and missingness mechanisms, constructing the appropriate
distribution to draw imputations from is, at least in principle, relatively
straightforward. In practice, there may be considerable uncertainty about
some parts of the joint model. In that case, several mechanisms for draw-
ing imputations might seem equally plausible. They can be combined in a
single multiple imputation analysis. As such, multiple imputation can be
used as a tool for sensitivity analysis.
20.3 Multiple Imputation 337

Suppose we have a sample of N , i.i.d. n×1 random vectors Y i . Our interest


lies in estimating some parameter vector θ of the distribution of Y i . Assume
the notation is as in Section 15.5. Multiple imputation fills in Y m using
the observed data Y o , several times, and then the completed data are used
to estimate θ.

As discussed by Rubin and Schenker (1986), the theoretical justification


for multiple imputation is most easily understood using Bayesian concepts,
but a likelihood-based treatment of the subject is equally possible. If we
knew the joint distribution of Y i = (Y oi , Y mi ) with parameter vector γ say,
then we could impute Y m i by drawing a value of Y i from the conditional
m

distribution
i |y i , γ).
f (y m o
(20.5)
Note that we explicitly distinguish the parameter of scientific interest θ
from the parameter γ in (20.5). Since γ is unknown, we must estimate it
from the data, say γ̂, and use

i |y i , γ̂)
f (y m o
(20.6)

to impute the missing data. In Bayesian terms, γ in (20.5) is a random


variable, of which the distribution is a function of the data. In particular,
we first obtain the distribution of γ from the data, depending on γ̂. The
construction of model (20.5) is referred to by Rubin (1987) as the Modeling
Task.

After formulating the distribution of γ, the imputation algorithm is as


follows:

1. Draw γ ∗ from the distribution of γ.



2. Draw Y m∗
i i |y i , γ ).
from f (y m o

3. Using the completed data, (Y o , Y m∗ ), and the method of choice


(i.e., maximum likelihood, restricted maximum likelihood, method
of moments, partial likelihood), estimate the parameter of interest
θ̂ = θ̂(Y ) = θ̂(Y o , Y m∗ ) and its variance (called within-imputation
 θ̂).
variance) U = var(

4. Independently repeat steps 1–3, M times. The M data sets give rise
(m)
to θ̂ and U (m) , for m = 1, ..., M .

Steps 1 and 2 are referred to as the Imputation Task. Step 3 is the Es-
timation Task. Of course, one wants to combine the M inferences into a
single one. Both parameter and precision estimation, on the one hand, and
hypothesis testing, on the other hand, will be discussed next.
338 20. Sensitivity Analysis for Pattern-Mixture Models

20.3.1 Parameter and Precision Estimation

The M within-imputation estimates for θ are pooled to give the multiple


imputation estimate:

1  (m)
M

θ̂ = θ̂ .
M m=1

Suppose that complete data inference about θ would be made by (θ − θ̂) ∼


N (0, U ). Then, one can make normal-based inferences for θ based upon

(θ − θ̂ ) ∼ N (0, V ), (20.7)

where

M +1
V = Ŵ + B̂, (20.8)
M

M
m=1U (m)
Ŵ = (20.9)
M
is the average within-imputation variance, and
M (m) ∗ (m) ∗
m=1 (θ̂ − θ̂ )(θ̂ − θ̂ )
B̂ = (20.10)
M −1
is the between-imputation variance (Rubin 1987). Rubin and Schenker
(1986) report that a small number of imputations (M = 2, 3) already yields
a major improvement over single imputation. Upon noting that the factor
(M + 1)/M approaches 1 for large M , (20.8) is approximately the sum of
the within- and the between-imputations variability.

Multiple imputation is most useful in situations where γ is an easily es-


timated set of parameters characterizing the distribution of Y i , whereas
θ is complicated to estimate in the presence of missing data and/or when
obtaining a correct estimate for the variance is nontrivial with incomplete
data.

20.3.2 Hypothesis Testing

Testing hypotheses could be based on the asymptotic normality results


(20.7) and (20.8). However, the rationale for using asymptotic results and
hence χ2 reference distributions is not just a function of the sample size,
N , but also of the number of imputations, M . Therefore, Li, Raghunathan,
20.4 Pattern-Mixture Models and Sensitivity Analysis 339

and Rubin (1991) propose the use of an F reference distribution. Precisely,


to test the hypothesis H0 : θ = θ 0 , they advocate the following method to
calculate p-values:

p = P (Fk,w > F ), (20.11)

where k is the length of the parameter vector θ, Fk,w is an F random


variable with k numerator and w denominator degrees of freedom, and

(θ ∗ − θ 0 ) W −1 (θ ∗ − θ 0 )
F = , (20.12)
k(1 + r)
3 42
(1 − 2τ −1 )
w = 4 + (τ − 4) 1 + ,
r

1 1
r = 1+ tr(BW −1 ), (20.13)
k M
τ = k(M − 1).

It is interesting to note that when M → ∞, the reference distribution of F


approaches an Fk,∞ = χ2 /k-distribution, in line with intuition. Good oper-
ational characteristics of this procedure are reported in Li, Raghunathan,
and Rubin (1991), which combines nicely with computational ease.

Clearly, procedure (20.11) can be used as well when not the full vector θ,
but one component, a subvector, or a set of linear contrasts, is the subject
of hypothesis testing. When a subvector is of interest (a single component
being a special case), the corresponding submatrices of B and W need
to be used in (20.12) and (20.13). For a set of linear contrasts Lθ, one
should use the appropriately transformed covariance matrices: W̃ = LW L ,
B̃ = LBL , and Ṽ = LV L .

20.4 Pattern-Mixture Models and Sensitivity


Analysis

Sensitivity analysis for pattern-mixture models can be conceived in many


different ways. Crucial guiding questions are whether a pattern-mixture
model is the central focus of interest or should rather be viewed as com-
plementary to another model, such as a selection model.

Following the initial mention of pattern-mixture models (Glynn, Laird,


and Rubin 1986, Little and Rubin 1987) and the renewed interest in re-
cent years (Little 1993, 1994a, Hogan and Laird 1997), several authors
have contrasted selection models and pattern-mixture models. This is done
340 20. Sensitivity Analysis for Pattern-Mixture Models

to either (1) answer the same scientific question, such as marginal treat-
ment effect or time evolution, based on these two rather different modeling
strategies, or (2) to gain additional insight by supplementing the selection
model results with those from a pattern-mixture approach. Examples can
be found in Verbeke, Lesaffre, and Spiessens (2000), Curran, Pignatti, and
Molenberghs (1997), and Michiels et al . (1999) for continuous outcomes.
The categorical outcome case has been treated in Molenberghs, Michiels,
and Lipsitz (1999), and Michiels, Molenberghs, and Lipsitz (1999). Further
references include Cohen and Cohen (1983), Muthén, Kaplan, and Hollis
(1987), Allison (1987), McArdle and Hamagani (1992) Little and Wang
(1996), Hedeker and Gibbons (1997), Siddiqui and Ali (1998), Ekholm and
Skinner (1998), Molenberghs, Michiels, and Kenward (1998), and Park and
Lee (1999).

We want to stress the usefulness of these two modeling strategies and also
refer to the toenail (Sections 17.2 and 18.3) and Vorozole studies (Sec-
tions 17.6 and 18.4). In the Vorozole case, the treatment effect assessment
is virtually identical under both strategies, but useful additional insight
is obtained from the pattern-specific average profiles. Of course, such bi-
lateral comparisons are not confined to the selection and pattern-mixture
model families. For example, shared-parameter models can be included as
well (see Section 15.4 and the very complete review by Little 1995).

On the other hand, the sensitivity analysis we propose here can also be
conducted within the pattern-mixture family, in analogy with the selection
model case (Chapter 18). The key area on which sensitivity analysis should
focus is the unidentified components of the model and the way(s) in which
this is handled. Indeed, recall that model family (18.6) contains underiden-
tified members since it describes the full set of measurements in pattern ti ,
even though there are no measurements after occasion ti − 1. In the intro-
duction, we mentioned three strategies to deal with this underidentification.
Let us describe these in turn.

• Strategy 1. Little (1993, 1994a) advocated the use of identifying


restrictions, which work well in relatively simple settings, and pre-
sented several examples. Perhaps the best known ones are so-called
complete case missing value restrictions (CCMV), where, for a given
pattern, the conditional distribution of the missing data, given the
observed data, is equated to its counterpart in the completers.
Based in part on the pattern-mixture formulation of MAR as de-
scribed in Section 20.2, we will outline a general framework for iden-
tifying restrictions. An important case is available case missing value
(ACMV), where the conditional distribution of the unobserved out-
comes given the observed outcomes in a specific pattern is equated to
combined information from all patterns on which the outcomes are
20.4 Pattern-Mixture Models and Sensitivity Analysis 341

observed, in such a way that it corresponds to MAR. This is useful if


one wants to assign special value to the MAR case, just as in the se-
lection model context. It also provides a way to compare ignorable se-
lection models with their counterpart in the pattern-mixture setting.
Molenberghs, Michiels, and Lipsitz (1999) and Michiels, Molenberghs,
Lipsitz (1999) took up this idea in the context of binary outcomes,
with a marginal global odds ratio model to describe the measurement
process (Molenberghs and Lesaffre 1994).
It will be clear that ACMV is but one particular way of combining in-
formation from patterns on which a given set of outcomes is observed.
Among the family of such approaches, special emphasis can be put
on the so-called neighboring pattern which, in a monotone dropout
setting, is the pattern with one additional measurement obtained.
They are referred to as neighboring case missing value (NCMV) re-
strictions. In a sense, they are opposite to CCMV.
A full account of identifying restrictions is provided in Section 20.5.
• We will now introduce the other two strategies. As opposed to identi-
fying restrictions, model simplification can be done in order to iden-
tify the parameters. The advantage is that the number of parameters
decreases, which is desirable since the length of the parameter vec-
tor is a general issue with pattern-mixture models. Indeed, Hogan
and Laird (1997) noted that in order to estimate the large number of
parameters in general pattern-mixture models, one has to make the
awkward requirement that each dropout pattern occurs sufficiently
often. In other words, one has to require large amounts of dropout.
Broadly, we distinguish between two types of simplifications to iden-
tify pattern-mixture models.
– Strategy 2. First, trends can be restricted to functional forms
supported by the information available within a pattern. For ex-
ample, a linear time trend is easily extrapolated beyond the last
obtained measurement. As discussed in the introduction, this
was the strategy followed in the toenail study (Section 18.3).
It is relatively simple to apply when, for example, a quadratic
curve is assumed for each of the patterns. One only needs to
provide an ad hoc solution for the first or the first few patterns.
However, as was seen in the toenail study, some of the extrapola-
tions, especially when based on traditional polynomials and/or
in sparse patterns, can appear to be unrealistic (Figure 18.2)
and may then require further scrutiny.
In order to fit such models, one simply has to carry out a model-
building exercise in each of the patterns separately. Each of these
comes down to fitting a standard linear mixed model and there-
fore entails no additional complexity.
342 20. Sensitivity Analysis for Pattern-Mixture Models

– Strategy 3. Second, one can let the parameters vary across pat-
terns in a parametric way. Thus, rather than estimating a sepa-
rate time trend in each pattern, one could assume that the time
evolution within a pattern is unstructured, but parallel across
patterns. This is effectuated by treating pattern as a covariate
in the model. The available data can be used to assess whether
such simplifications are supported within the time ranges for
which there is information. Using the so-obtained profiles past
the time of dropout still requires extrapolation.
From a model-building perspective, this modeling strategy can
be viewed as a standard linear mixed model with the pattern
indicator as an additional covariate.
This approach was recently adopted by Park and Lee (1999)
in the context of count data for which generalized estimating
equations (Liang and Zeger 1986) are used.

While the second and third strategies are computationally simple, it is


important to note that there is a price to pay. Indeed, simplified models,
qualified as “assumption rich” by Sheiner, Beal and Dunne (1997), are also
making untestable assumptions, just as in the selection model case. Still,
the need of assumptions and their implications are more obvious. It is, for
example, not possible to assume an unstructured time trend in incomplete
patterns, except if one restricts attention to the time range from onset
until dropout. In contrast, assuming a linear time trend is possible in all
patterns containing at least two measurements. Whereas such an approach
is very simple from a modeler’s point of view, it is then less obvious what
the precise nature of the dropout mechanism is. In any case, the dropout
model is not explicitated in this way. This is in contrast with the identifying-
restrictions route, where the assumptions have to be clear from the start
and, importantly, the MAR (ACMV) case is available as a special case.

A final observation, applying to all strategies, is that pattern-mixture mod-


els do not necessarily provide estimates and standard errors of marginal
quantities of interest, such as overall treatment effect or overall time trend.
Hogan and Laird (1997) provided a way to derive selection model quan-
tities from the pattern-mixture model. Several authors have followed this
idea to formally compare the conclusions from a selection model with the
selection model parameters in a pattern-mixture model. Verbeke, Lesaf-
fre, and Spiessens (2000), Curran, Pignatti, and Molenberghs (1997), and
Michiels, Molenberghs, Bijnens, and Vangeneugden (1999) applied this ap-
proach in the context of linear mixed models for continuous data. We refer
to Sections 18.3, 18.4, and 24.4.2 for illustrations in the toenail study, the
Vorozole study, and the milk protein trial, respectively.
20.5 Identifying Restrictions Strategies 343

In Section 20.5, we describe identifying restriction strategies, with MAR


(ACMV), CCMV, and NCMV as special cases. They are applied in Sec-
tion 20.6 to the Vorozole study.

20.5 Identifying Restrictions Strategies

In line with the results obtained in Section 20.2, we restrict attention to


monotone patterns. In general, indicate dropout patterns by t = 1, . . . , n,
where, as in Section 15.9, the dropout indicator is d = t + 1. For pattern t,
the complete data density is given by

ft (y1 , . . . , yn ) = ft (y1 , . . . , yt )ft (yt+1 , . . . , yn |y1 , . . . , yt ). (20.14)

The first factor on the right-hand side of (20.14) is clearly identifiable from
the observed data, whereas the second factor is not. Therefore, identifying
restrictions are applied in order to identify the second component.

20.5.1 Strategy Outline

The above observations suggest the following strategy:

1. Fit a (linear mixed) model to the pattern-specific identifiable densi-


ties:

ft (y1 , . . . , yt ). (20.15)

This results in a parameter estimate γ̂ t which, for example, consists


of fixed-effects parameters and variance components.

2. Select an identification method of choice. This will be discussed in


full detail in Section 20.5.2.

3. Using this identification method, determine the conditional distribu-


tions of the unobserved outcomes, given the observed ones:

ft (yt+1 , . . . , yn |y1 , . . . , yt ). (20.16)

In the case that the observed densities (20.15) are assumed to be


normal, the conditional densities are normal or finite mixtures of
normal densities. This feature will be taken up in Section 20.5.2 as
well.
344 20. Sensitivity Analysis for Pattern-Mixture Models

4. Using the methodology outlined in Section 20.3, draw multiple impu-


tations for the unobserved components, given the observed outcomes
and the correct pattern-specific density (20.16). We will study this
further in Section 20.5.4.

5. Analyze the multiply-imputed sets of data using the method of choice.


It would be most natural to consider a pattern-mixture model, but it
is also possible to use selection models. One has to ensure, however,
that the multiple imputation is proper , in the sense described by
Rubin (1987, 1996), Rubin and Schenker (1986), and Schafer (1997).
Informally, an important concern is that relations between variables
of scientific interest should not be excluded in the imputation stage.
This implies that the original observed-data models, fitted in the first
step, should be rich enough to carry the relations included in the final
completed-data models.

6. Inferences can be conducted in the way described in Sections 20.3.1


and 20.3.2.

In the next sections, a more technical treatment is given to items 2 and 3 of


the above strategy. Section 20.5.2 describes the identifying restrictions in
detail. Section 20.5.4 is dedicated to handling the conditional distributions
of the unobserved components given the observed ones, as a preparation
for the multiple imputation draws.

20.5.2 Identifying Restrictions

Although, in principle, completely arbitrary restrictions can be used by


means of any valid density function over the appropriate support, strategies
which relate back to the observed data deserve privileged interest. Little
(1993) proposes CCMV, which uses the following identification:

ft (yt+1 , . . . , yn |y1 , . . . , yt ) = fn (yt+1 , . . . , yn |y1 , . . . , yt ). (20.17)

In other words, information which is unavailable is always borrowed from


the completers. This strategy can be defended in cases where the bulk of
the subjects are complete and only small proportions are assigned to the
various dropout patterns. Also, extension of this approach to nonmonotone
patterns is particularly easy. On the other hand, the completers may be
rather “distant” in some sense from incomplete patterns, especially in the
case of early dropouts.

In such cases, it is useful to borrow information from other or even all


available patterns. To this end, we first rewrite the unidentified component
20.5 Identifying Restrictions Strategies 345

as

n
ft (yt+1 , . . . , yn |y1 , . . . , yt ) = ft (ys |y1 , . . . , ys−1 ). (20.18)
s=t+1

Now, a very general formulation is obtained by allowing each factor on


the right-hand side of (20.18) to be identified from a selected number of
patterns. For example, CCMV follows from using fn for each unidentified
factor. Alternatively, the nearest identified pattern can be used:
ft (ys |y1 , . . . , ys−1 ) = fs (ys |y1 , . . . , ys−1 ), s = t + 1, . . . , n. (20.19)
We will refer to these restrictions as neighboring case missing values or
NCMV. In what follows, we will provide some motivation for this termi-
nology. However, using information from only one pattern can be seen as
wasteful and therefore one can base identification on all patterns for which
the sth component is identified:

n
ft (ys |y1 , . . . , ys−1 ) = ωsj fj (ys |y1 , . . . , ys−1 ), s = t+1, . . . , n. (20.20)
j=s

For simplicity, we will use ω s = (ωss , . . . , ωsn ) . Every ω s which sums to


1 and consists of nonnegative components provides a valid identification
scheme. Obviously, (20.20) generalizes the earlier schemes. CCMV restric-
tions are obtained by setting ωsn = 1 and all others equal to zero. The
NCMV system follows from setting ωss = 1 and the others equal to zero.
Further, we will show that there always is a unique choice for ω s which
corresponds to ACMV and hence to MAR.

Motivation for NCMV

Neighboring case missing value restrictions, as outlined in (20.19), can be


introduced in two slightly different but perhaps more intuitive ways. A
downward approach identifies (20.16) as
ft (yt+1 , . . . , yn |y1 , . . . , yt ) = ft+1 (yt+1 , . . . , yn |y1 , . . . , yt ),(20.21)
for t = 1, . . . , n − 1. The idea is to start with t = n − 1 and then gradually
step down until t = 1. Thus, the distribution in a given pattern is identified
from the pattern with one more component observed. Similarly, an upward
strategy identifies
fs (yt+1 |y1 , . . . , yt ) = ft+1 (yt+1 |y1 , . . . , yt ) (20.22)
for t = 1, . . . , n and s = 1, . . . , t.

We now state the connection between these strategies.


346 20. Sensitivity Analysis for Pattern-Mixture Models

Result 20.1 Strategies (20.21) and (20.22) are equivalent to neighboring


case missing value restrictions (20.19).

The result is easily shown. First, the equivalence between (20.22) and
(20.19) is immediate. Second, for (20.21) we proceed by inductive reason-
ing. Clearly, the equivalence holds for t = n − 1. Assume now that it holds
for t + 1. Then,
ft (yt+1 , . . . , yn |y1 , . . . , yt )
= ft+1 (yt+1 , . . . , yn |y1 , . . . , yt )

n−t
= ft+1 (yt+1 |y1 , . . . , yt ) ft+1 (yt+s |y1 , . . . , yt+s−1 )
s=2

n−t
= ft+1 (yt+1 |y1 , . . . , yt ) ft+s (yt+s |y1 , . . . , yt+s−1 ),
s=2

where the last equality holds by virtue of the induction hypothesis. This
establishes the result.

Using All Patterns

Similarly to the NCMV case, we can formulate (20.20) in two alternative


ways. The downward strategy is formulated as
ft (yt+1 , . . . , yn |y1 , . . . , yt )
 n
= ωt+1,j fj (yt+1 , . . . , yn |y1 , . . . , yt ). (20.23)
j=t+1

Similarly, we write the upward strategy as



n
fs (yt+1 |y1 , . . . , yt ) = ωt+1,j fj (yt+1 |y1 , . . . , yt ) (20.24)
j=t+1

for t = 1, . . . , n and s = 1, . . . , t. Now, by setting



n
gt+1 (·|y1 , . . . , yt ) ≡ ωt+1,j fj (·|y1 , . . . , yt )
j=t+1

and reproducing the proof of Result 20.1 for the g(·) functions, it is clear
that the following result holds:

Result 20.2 Strategies (20.23) and (20.24) are equivalent to restrictions


(20.20).

Note that Result 20.2 includes Result 20.1 as a special case.


20.5 Identifying Restrictions Strategies 347

20.5.3 ACMV Restrictions

A general class of restrictions, based on the information available from


other patterns, is provided by (20.20). Equivalent formulations are (20.23)
and (20.24), and (20.19), (20.21) and (20.22) provide special cases, and
so do (20.17). In this section, we will derive expressions for the ω s which
correspond to ACMV, as defined in Section 20.2. This will then constitute
a third special case.

The definition of ACMV is presented in (20.1). We will now show that it


provides an easy way to derive an expression for ω s in (20.20). Indeed,
(20.20) can be restated, using notation of this section, as
ft (ys |y1 , . . . , ys−1 ) = f(≥s) (ys |y1 , . . . , ys−1 ), (20.25)
for s = t + 1, . . . , n. Here,
f(≥s) (·|·) ≡ f (·|·, d > s) ≡ f (·|·, t ≥ s),
with d = t + 1 indicating time of dropout. Now, we can transform (20.25)
as follows:
ft (ys |y1 , . . . , ys−1 )

= f(≥s) (ys |y1 , . . . , ys−1 )


n
j=s αj fj (y1 , . . . , ys )
= n (20.26)
j=s αj fj (y1 , . . . , ys−1 )
 
n
αj fj (y1 , . . . , ys−1 )
= n fj (ys |y1 , . . . , ys−1 ). (20.27)
j=s j=s αj fj (y1 , . . . , ys−1 )

Now, comparing (20.27) to (20.20) yields


α f (y , ..., ys−1 )
ωsj = n j j 1 . (20.28)

=s α
f
(y1 , ..., ys−1 )

We have derived two equivalent explicit expressions of (20.1). Expression


(20.26) is the conditional density of a mixture, whereas (20.20) with (20.28)
is a mixture of conditional densities.

Clearly, ω s defined by (20.28) consists of components which are nonneg-


ative and sum to 1, thus defining a valid density function as soon as its
components are genuine densities.

Let us incorporate (20.20) into (20.14):


ft (y1 , . . . , yn )
348 20. Sensitivity Analysis for Pattern-Mixture Models

= ft (y1 , . . . , yt )
⎡ ⎤

n−t−1  n
× ⎣ ωn−s,j fj (yn−s |y1 , ..., yn−s−1 )⎦ . (20.29)
s=0 j=n−s

Expression (20.29) clearly shows which information is used to complement


the observed data density in pattern t in order to establish the complete
data density.

The practical use of (20.20) in multiple imputation, with the CCMV,


NCMV, and ACMV strategies as special cases, will be studied in Sec-
tion 20.5.4. First, we will study the simple case of sequences of length 3
and 4.

Special Case: Three Measurements

In this case, there are only three patterns, and identification (20.29) takes
the following form:
f3 (y1 , y2 , y3 ) = f3 (y1 , y2 , y3 ), (20.30)
f2 (y1 , y2 , y3 ) = f2 (y1 , y2 )f3 (y3 |y1 , y2 ), (20.31)
f1 (y1 , y2 , y3 ) = f1 (y1 ) [ωf2 (y2 |y1 ) + (1 − ω)f3 (y2 |y1 )]
×f3 (y3 |y1 , y2 ). (20.32)
Since f3 (y1 , y2 , y3 ) is completely identifiable from the data, and for density
f2 (y1 , y2 , y3 ) there is only one possible identification, given (20.20), the
only room for choice is in pattern 1. Setting ω = 1 corresponds to NCMV,
whereas ω = 0 implies CCMV. Using (20.28), ACMV boils down to
α2 f2 (y1 )
ω = . (20.33)
α2 f2 (y1 ) + α3 f3 (y1 )
The conditional density f1 (y2 |y1 ) in (20.32) can be rewritten as
α2 f2 (y1 , y2 ) + α3 f3 (y1 , y2 )
f1 (y2 |y1 ) = ,
α2 f2 (y1 ) + α3 f3 (y1 )
which is, of course, a special case of (20.26). Upon division by α2 + α3 ,
both numerator and denominator are mixture densities, hence producing a
legitimate conditional density.

Special Case: Four Measurements

The counterparts of (20.30)–(20.32) are


f4 (y1 , y2 , y3 , y4 ) = f4 (y1 , y2 , y3 , y4 ), (20.34)
20.5 Identifying Restrictions Strategies 349

f3 (y1 , y2 , y3 , y4 ) = f3 (y1 , y2 , y3 )f4 (y4 |y1 , y2 , y3 ), (20.35)

f2 (y1 , y2 , y3 , y4 ) = f2 (y1 , y2 ) [ω33 f3 (y3 |y1 , y2 ) + ω34 f4 (y3 |y1 , y2 )]

×f4 (y4 |y1 , y2 , y3 ), (20.36)

f1 (y1 , y2 , y3 , y4 ) = f1 (y1 )

× [ω22 f2 (y2 |y1 ) + ω23 f3 (y2 |y1 ) + ω24 f4 (y2 |y1 )]

× [ω33 f3 (y3 |y1 , y2 ) + ω34 f4 (y3 |y1 , y2 )]

×f4 (y4 |y1 , y2 , y3 ). (20.37)


Now, setting ω33 = ω22 = 1 corresponds to NCMV, and ω34 = ω24 = 1
implies CCMV. ACMV corresponds to the system
α3 f3 (y1 , y2 )
ω33 = ,
α3 f3 (y1 , y2 ) + α4 f4 (y1 , y2 )
α4 f4 (y1 , y2 )
ω34 = ,
α3 f3 (y1 , y2 ) + α4 f4 (y1 , y2 )
α2 f2 (y1 )
ω22 = ,
α2 f2 (y1 ) + α3 f3 (y1 ) + α4 f4 (y1 )
α3 f3 (y1 )
ω23 = ,
α2 f2 (y1 ) + α3 f3 (y1 ) + α4 f4 (y1 )
α4 f4 (y1 )
ω24 = .
α2 f2 (y1 ) + α3 f3 (y1 ) + α4 f4 (y1 )

Explicit ACMV expressions for those conditional densities which involve


mixtures are
f1 (y3 |y1 , y2 ) = f2 (y3 |y1 , y2 )

= ω33 f3 (y3 |y1 , y2 ) + ω34 f4 (y3 |y1 , y2 ) (20.38)


3 4
α3 f3 (y1 , y2 , y3 ) + α4 f4 (y1 , y2 , y3 )
= , (20.39)
α3 f3 (y1 , y2 ) + α4 f4 (y1 , y2 )

f1 (y2 |y1 )

= ω22 f2 (y2 |y1 ) + ω23 f3 (y2 |y1 ) + ω24 f4 (y2 |y1 )


3 4
α2 f2 (y1 , y2 ) + α3 f3 (y1 , y2 ) + α3 f3 (y2 , y1 )
= . (20.40)
α2 f2 (y1 ) + α3 f3 (y1 ) + α4 f4 (y1 )
350 20. Sensitivity Analysis for Pattern-Mixture Models

In general, (20.34)–(20.37) provide three free parameters which can be ex-


ploited as a natural basis for sensitivity analysis.

20.5.4 Drawing from the Conditional Densities

In the previous section, we have seen how general identifying restrictions


(20.20), with CCMV, NCMV, and ACMV as special cases, lead to the
conditional densities for the unobserved components, given the observed
ones. This came down to deriving expressions for ω s , such as in (20.28)
for ACMV. This endeavor corresponds to items 2 and 3 of the strategy
outline (Section 20.5.1). In order to carry out item 4 (drawing multiple
imputations), we need to draw imputations from these conditional densities.

Let us focus on the special case of three measurements. To this end, we


consider identification scheme (20.30)–(20.32). At first, we leave the para-
metric form of these densities unspecified. The following steps are required:

1. Estimate the parameters of the identifiable densities: f3 (y1 , y2 , y3 ),


f2 (y1 , y2 ), and f1 (y1 ). Then, for each of the M imputations, we have
to execute the following steps.
2. Draw from the parameter vectors as in the first step on p. 337. It
will be assumed that in all densities from which we draw next, this
parameter vector is used.
3. For pattern 2. Given an observation (y1 , y2 ) in this pattern, calcu-
late the conditional density f3 (y3 |y1 , y2 ) and draw from it.
4. For pattern 1. We now have to distinguish three substeps.
(a) Given y1 and the proportions α2 and α3 of observations in the
second and third patterns, respectively, determine ω. Every ω in
the unit interval is valid. Special cases are as follows:
• For NCMV, ω = 1.
• For CCMV, ω = 0.
• For ACMV, ω is calculated from (20.33). Note that, given
y1 , this is a constant.
If 0 < ω < 1, generate a random uniform variate, U say. Note
that, strictly speaking, this draw is unnecessary for the boundary
NCMV (ω = 1) and CCMV (ω = 0) cases.
(b) If U ≤ ω, calculate f2 (y2 |y1 ) and draw from it. Otherwise, do
the same based on f3 (y2 |y1 ).
(c) Given the observed y1 and given y2 , which has just been drawn,
calculate the conditional density f3 (y3 |y1 , y2 ) and draw from it.
20.5 Identifying Restrictions Strategies 351

All steps but the first one have to be repeated M times to obtain the
same number of imputed data sets. Inference then proceeds as outlined in
Sections 20.3.1 and 20.3.2.

When the observed densities are estimated using linear mixed models,
f3 (y1 , y2 , y3 ), f2 (y1 , y2 ), and f1 (y1 ) produce fixed effects and variance com-
ponents. Let us group all of them in γ and assume a draw is made from
their distribution, γ ∗ say. To this end, their precision estimates need to be
computed. In SAS, they are provided from the ‘covb’ option in the MODEL
statement and the ‘asycov’ option in the PROC MIXED statement.

Let us illustrate this procedure for (20.31). Assume that the ith subject
has only two measurements, and hence belongs to the second pattern. Let
its design matrices be Xi and Zi for the fixed effects and random effects,
respectively. Its mean and variance for the third pattern are
µi (3) = Xi β ∗ (3), (20.41)
Vi (3) = Zi D∗ (3)Zi + Σ∗i (3), (20.42)
where (3) indicates that the parameters are specific to the third pattern,
as in (18.6). The asterisk is reminiscent of these parameters being part of
the draw γ ∗ .

Now, based on (20.41)–(20.42) and the observed values yi1 and yi2 , the
parameters for the conditional density follow immediately:
µi,2|1 (3) = µi,2 (3) + Vi,21 (3)[Vi,11 (3)]−1 (y i − µi,2 (3)),
Vi,2|1 (3) = Vi,22 (3) − Vi,21 (3)[Vi,11 (3)]−1 Vi,12 (3),
where a subscript 1 indicates the first two components and a subscript 2
refers to the third component.

Based on (20.20), it is clear that the above procedure readily generalizes.


This holds in particular for NCMV and CCMV, in which case no random
uniform variates are required. Using (20.28), it is also clear that such a gen-
eralization is straightforward in the ACMV case. Formally, drawing from
(20.20) consists of two steps:

• Draw a random uniform variate U to determine which of the n − s + 1


components from which one is going to draw. Specifically, the kth
component is chosen if

k−1 
k
ωsj ≤ U < ωsj ,
j=s j=s

where k = s, . . . , n. Note that if k = 1, the left-hand sum is set equal


to zero.
352 20. Sensitivity Analysis for Pattern-Mixture Models

• Draw from the kth component. Since every component of the mixture
is normal, only draws from uniform and normal random variables are
required.

All of these steps have been combined in a SAS macro, which is available
from the web site.

20.6 Analysis of the Vorozole Study

In order to concisely illustrate the methodology described in this chapter,


we will apply it to the Vorozole study, restricted to those subjects with one,
two, and three follow-up measurements, respectively. Thus, 190 subjects are
included into the analysis, with subsample sizes 35, 86, and 69, respectively.
The corresponding pattern probabilities are

 = (0.184, 0.453, 0.363) ,


π (20.43)

with asymptotic covariance matrix


⎛ ⎞
0.000791 −0.000439 −0.000352
 π ) = ⎝ −0.000439
var( 0.001304 −0.000865 ⎠ . (20.44)
−0.000352 −0.000865 0.001217

These figures, apart from giving a feel for the relative importance of the var-
ious patterns, will be needed to calculate marginal effects (such as marginal
treatment effect) from pattern-mixture model parameters, as was done in
Sections 18.3 and 18.4.

We will apply each of the three strategies, presented in Section 20.5.1, to


these data. First, within each of the strategies, starting models will be
fitted (Section 20.6.1). Second, it will be illustrated how hypothesis testing
can be performed, given the pattern-mixture parameter estimates and their
estimated covariance matrix (Section 20.6.2). Third, model simplification
will be discussed and applied (Section 20.6.3).

20.6.1 Fitting a Model

Strategies 2 and 1

In order to apply the identifying restriction Strategy 1, one needs to fit


a model to the observed data first. We will opt for a simple model, with
20.6 Analysis of the Vorozole Study 353

parameters specific to each pattern. Using extrapolation, such a model can


be used for the second strategy as well.

We include time and time2 effects, as well as their interactions with treat-
ment. Further, time by baseline value interaction is included. Recall from
Section 18.4 that all effects interact with time, in order to force profiles to
pass through the origin, since we are studying change versus baseline. An
unstructured 3 × 3 covariance matrix is assumed for each pattern.

Parameter estimates are presented in Table 20.1, in the “initial” column.


Of course, not all parameters are estimable. This holds for the variance
components, where in pattern 1 and 2 only the upper 1 × 1 block and the
upper 2 × 2 block are identified, respectively. In the first pattern, the effects
in time2 are unidentified. The linear effects are identified by virtue of the
absence of an intercept term.

Let us present this model graphically. Since there is one binary (treatment
arm) and one continuous covariate (baseline level of FLIC score), and there
are three patterns, we can represent the models using 3 × 2 surface plots,
as in Figure 20.1. Similar plots for the other strategies are displayed in
Figures 20.3, 20.5, 20.7, and 20.9. More insight on the effect of extrapola-
tion can be obtained by presenting time plots for selected values of baseline
value, the only continuous covariate. Precisely, we chose the minimum, av-
erage, and maximum values (Figure 20.2). Bold type is used for the range
over which data are obtained within a particular pattern, and extrapola-
tion is indicated using thinner type. Note that the extrapolation can have
surprising effects, even with these relatively simple models. Thus, although
this form of extrapolation is simple, its plausibility can be called into ques-
tion. Note that this is in line with the experience gained from the toenail
data analysis in Section 18.3 (Figure 18.2).

This initial model provides a basis, and its graphical representation extra
motivation, to consider identifying-restrictions models. Using the method-
ology detailed in Section 20.5, a GAUSS macro and a SAS macro (available
from the web) were written to conduct the multiple imputation, fitting
of imputed data sets, and combination of the results into a single infer-
ence. Results are presented in Table 20.1, for each of the three types of
restrictions (CCMV, NCMV, ACMV). For patterns 1 and 2, there is some
variability in the parameter estimates across the three strategies, although
this is often consistent with random variation (see the standard errors).
Since the data in pattern 3 are complete, there is, of course, no difference
between the initial model parameters and those obtained with each of the
identifying-restrictions techniques. Again, a better impression can be ob-
tained from a graphical representation. In all of the two-dimensional plots,
the same mean response scale as in Figure 20.2 was retained, illustrat-
ing that the identifying-restrictions strategies extrapolate much closer to
354 20. Sensitivity Analysis for Pattern-Mixture Models

TABLE 20.1. Vorozole Study. Multiple imputation estimates and standard errors
for CCMV, NCMV, and ACMV restrictions.
Effect Initial CCMV NCMV ACMV

Pattern 1:
Time 3.40(13.94) 13.21(15.91) 7.56(16.45) 4.43(18.78)
Time∗base −0.11(0.13) −0.16(0.16) −0.14(0.16) −0.11(0.17)
Time∗treat 0.33(3.91) −2.09(2.19) −1.20(1.93) −0.41(2.52)
Time2 −0.84(4.21) −2.12(4.24) −0.70(4.22)
Time2 ∗base 0.01(0.04) 0.03(0.04) 0.02(0.04)
σ11 131.09(31.34) 151.91(42.34) 134.54(32.85) 137.33(34.18)
σ12 59.84(40.46) 119.76(40.38) 97.86(38.65)
σ22 201.54(65.38) 257.07(86.05) 201.87(80.02)
σ13 55.12(58.03) 49.88(44.16) 61.87(43.22)
σ23 84.99(48.54) 99.97(57.47) 110.42(87.95)
σ33 245.06(75.56) 241.99(79.79) 286.16(117.90)

Pattern 2:
Time 53.85(14.12) 29.78(10.43) 33.74(11.11) 28.69(11.37)
Time∗base −0.46(0.12) −0.29(0.09) −0.33(0.10) −0.29(0.10)
Time∗treat −0.95(1.86) −1.68(1.21) −1.56(2.47) −2.12(1.36)
Time2 −18.91(6.36) −4.45(2.87) −7.00(3.80) −4.22(4.20)
Time2 ∗base 0.15(0.05) 0.04(0.02) 0.07(0.03) 0.05(0.04)
σ11 170.77(26.14) 175.59(27.53) 176.49(27.65) 177.86(28.19)
σ12 151.84(29.19) 147.14(29.39) 149.05(29.77) 146.98(29.63)
σ22 292.32(44.61) 297.38(46.04) 299.40(47.22) 297.39(46.04)
σ13 57.22(37.96) 89.10(34.07) 99.18(35.07)
σ23 71.58(36.73) 107.62(47.59) 166.64(66.45)
σ33 212.68(101.31) 264.57(76.73) 300.78(77.97)

Pattern 3:
Time 29.91(9.08) 29.91(9.08) 29.91(9.08) 29.91(9.08)
Time∗base −0.26(0.08) −0.26(0.08) −0.26(0.08) −0.26(0.08)
Time∗treat 0.82(0.95) 0.82(0.95) 0.82(0.95) 0.82(0.95)
Time2 −6.42(2.23) −6.42(2.23) −6.42(2.23) −6.42(2.23)
Time2 ∗base 0.05(0.02) 0.05(0.02) 0.05(0.02) 0.05(0.02)
σ11 206.73(35.86) 206.73(35.86) 206.73(35.86) 206.73(35.86)
σ12 96.97(26.57) 96.97(26.57) 96.97(26.57) 96.97(26.57)
σ22 174.12(31.10) 174.12(31.10) 174.12(31.10) 174.12(31.10)
σ13 87.38(30.66) 87.38(30.66) 87.38(30.66) 87.38(30.66)
σ23 91.66(28.86) 91.66(28.86) 91.66(28.86) 91.66(28.86)
σ33 262.16(44.70) 262.16(44.70) 262.16(44.70) 262.16(44.70)

the observed data mean responses. There are some differences among the
identifying-restrictions methods. Roughly speaking, CCMV extrapolates
rather toward a rise whereas NCMV seems to predict more of a decline,
at least for baseline value 53. Further, ACMV predominantly indicates a
steady state. For the other baseline levels, a status quo or a mild increase
20.6 Analysis of the Vorozole Study 355

FIGURE 20.1. Vorozole Study. Extrapolation based on initial model fitted to ob-
served data (Strategy 2). Per pattern and per treatment group, the mean response
surface is shown as a function of time (month) and baseline value.

is predicted. This conclusion needs to be considered carefully. Since these


patients drop out mainly because they relapse or die, it seems unlikely to
expect a rise in quality of life. Hence, it is very possible that the dropout
mechanism is not CCMV, since this strategy always refers to the “best”
group, in the sense that it groups patients who stay longer in the study
and hence have, on average, a better prognosis. ACMV, which compro-
mises between all strategies, may be more realistic, but NCMV may be
even better since information is borrowed from the nearest pattern, which
is then based on the nearest patients in terms of dropout time and perhaps
prognosis and quality of life evolution. However, recall that the identifica-
tion is done sequentially, and hence even under NCMV, the first pattern is
identified borrowing from both the second and the third patterns.

Nevertheless, the NCMV prediction looks more plausible since the worst
baseline value shows declining profiles, whereas the best one leaves room for
356 20. Sensitivity Analysis for Pattern-Mixture Models

FIGURE 20.2. Vorozole Study. Extrapolation based on initial model fitted to ob-
served data (Strategy 2). For three levels of baseline value (minimum, average,
maximum), plots of mean profiles over time are presented. The bold portion of
the curves runs from baseline until the last obtained measurement, and the ex-
trapolated piece is shown in thin type. The dashed line refers to megestrol acetate;
the solid line is the Vorozole arm.

improvement. Should one want to explore the effect of assumptions beyond


the range of (20.20), one can allow ω s to include components outside of the
unit interval. In that situation, one has to ensure that the resulting density
is still non-negative over its entire support. Finally, completely different
restrictions can be envisaged as well.

SAS Code for Strategies 1 and 2

We first present an example of a Strategy 2 model, which is also the starting


point for the identifying restriction Strategy 1.

proc mixed data = vor01 method = ml


noclprint asycov info covtest;
title ’Strategy 2 / Initial Model Strategy 1’;
class treat timedisc id;
by pattern;
20.6 Analysis of the Vorozole Study 357

FIGURE 20.3. Vorozole Study. Complete case missing value restrictions analysis.
Per pattern and per treatment group, the mean response surface is shown as a
function of time (month) and baseline value.

model y = time base*time treat*time


time*time base*time*time
/ noint solution covb;
repeated timedisc / subject = id type = un;
run;

Clearly, the essential bit is to include the BY statement in order to achieve a


pattern-specific analysis. To apply the delta method (Agresti 1990) for the
estimation and testing of marginal effects, it is important to generate the
asymptotic covariance matrix of the parameters. This is done by including
the ‘covb’ option in the MODEL statement for the fixed-effects parameters
and the ‘asycov’ option in the PROC MIXED statement for the covariance
parameters.
358 20. Sensitivity Analysis for Pattern-Mixture Models

FIGURE 20.4. Vorozole Study. Complete case missing value restrictions analysis.
For three levels of baseline value (minimum, average, maximum), plots of mean
profiles over time are presented. The bold portion of the curves runs from baseline
until the last obtained measurement, whereas the extrapolated piece is shown in
thin line type. The dashed line refers to megestrol acetate, the solid line is the
Vorozole arm.

It is useful to note that an alternative parameterization is possible as well:

proc mixed data = vor01 method = ml


noclprint asycov info covtest;
title ’Strategy 2 / Initial Model Strategy 1’;
title2 ’Alternative parameterization’;
class treat timedisc id pattern;
model y = time*pattern base*time*pattern
treat*time*pattern time*time*pattern
base*time*time*pattern
/ noint solution covb;
repeated timedisc / subject = id type = un
group = pattern;
run;

Exactly the same results are obtained as before, by ensuring that every
effect interacts with the class variable pattern, and ‘group=pattern’ is in-
20.6 Analysis of the Vorozole Study 359

FIGURE 20.5. Vorozole Study. Neighboring case missing value restrictions analy-
sis. Per pattern and per treatment group, the mean response surface is shown as
a function of time (month) and baseline value.

cluded in the REPEATED statement. The first parameterization is useful


to clearly separate the model elements for each of the patterns, whereas
the second one is advantageous when further calculations are needed across
patterns, such as the assessment of all treatment effect parameters simul-
taneously. See also Section 20.6.2.

Based on the model output, multiple imputation can be conducted. Details


on the macro used to do this are suppressed. After multiple imputation
has been conducted, a single meta-data set is obtained, which contains M
completed data sets. Although it is possible to create M different copies, it
is more convenient to paste them together, so that a single analysis program
suffices to analyze all of them, which can be done as follows:
360 20. Sensitivity Analysis for Pattern-Mixture Models

FIGURE 20.6. Vorozole Study. Neighboring case missing value restrictions analy-
sis. For three levels of baseline value (minimum, average, maximum), plots of
mean profiles over time are presented. The bold portion of the curves runs from
baseline until the last obtained measurement, and the extrapolated piece is shown
in thin type. The dashed line refers to megestrol acetate; the solid line is the
Vorozole arm.

proc sort data = m.vor02nc;


by imput pattern;
run;

proc mixed data = m.vor02nc method = ml covtest


noclprint asycov info;
title ’NCMV’;
class id imput pattern;
by imput pattern;
model y = time base*time treat*time time*time
base*time*time / noint solution covb;
repeated timedisc / subject = id type = un;
make ’solutionf’ out = m.fixednc;
make ’CovParms’ out = m.covparnc;
make ’COVB’ out = m.covbnc;
make ’asycov’ out = m.asycovnc;
run;
20.6 Analysis of the Vorozole Study 361

FIGURE 20.7. Vorozole Study. Available case missing value restrictions analysis.
Per pattern and per treatment group, the mean response surface is shown as a
function of time (month) and baseline value.

In order to conduct parameter estimation (Section 20.3.1) and hypothesis


testing (Section 20.3.2), parameter estimates and estimated covariance ma-
trices for the fixed effects and variance components need to be conserved.
This is done by means of four MAKE statements. The actual combination
of these into a single inference is done in a separate macro.

Strategy 3

In this strategy, pattern is included as a covariate, as was done in Sec-


tion 18.4. An initial model is considered with the following effects: time,
the interaction between time and treatment, baseline value, pattern, treat-
ment∗baseline, treatment∗pattern, and baseline∗pattern. Further, time2 is
included, as well as its interaction with baseline, treatment, and pattern.
362 20. Sensitivity Analysis for Pattern-Mixture Models

FIGURE 20.8. Vorozole Study. Available case missing value restrictions analysis.
For three levels of baseline value (minimum, average, maximum), plots of mean
profiles over time are presented. The bold portion of the curves runs from baseline
until the last obtained measurement, and the extrapolated piece is shown in thin
type. The dashed line refers to megestrol acetate; the solid line is the Vorozole
arm.

No higher order interactions are included, and the unstructured covariance


structure is common to all three patterns. This implies that the current
model is not equivalent to a Strategy 1 model, where all parameters are
pattern-specific. In order to achieve this goal, every effect would have to be
made pattern-dependent.

The estimated model parameters are presented in Table 20.2. Graphical


representations are given in Figures 20.9 and 20.10. From the latter, we
can make two distinct observations. First, the behavior of the model over
the range of the observed data is very similar to the behavior seen in the
analysis of these data in Section 18.4. Precisely, early dropouts decline
immediately, whereas those who stay longer in the study first show a rise
and then decline thereafter. However, this is less pronounced for higher
baseline values. Second, the extrapolation based on the fitted model is very
unrealistic, in the sense that for the early dropout sharp rises are predicted,
which is extremely implausible.
20.6 Analysis of the Vorozole Study 363

TABLE 20.2. Vorozole Study. Parameter estimates and standard errors for Strat-
egy 3.

Effect Pattern Estimate (s.e.)


Time 1 7.29(15.69)
Time 2 37.05(7.67)
Time 3 39.40(9.97)
Time∗treat 1 5.25(6.41)
Time∗treat 2 3.48(5.46)
Time∗treat 3 3.44(6.04)
Time∗base 1 −0.21(0.15)
Time∗base 2 −0.34(0.06)
Time∗base 3 −0.36(0.08)
Time∗treat∗base −0.06(0.04)
Time2 1
Time2 2 −9.18(2.47)
Time2 3 −7.70(2.29)
Time2 ∗treat 1.10(0.74)
Time2 ∗base 0.07(0.02)
σ11 173.63(18.01)
σ12 117.88(17.80)
σ22 233.86(26.61)
σ13 89.59(24.56)
σ23 116.12(34.27)
σ33 273.98(48.15)

These findings suggest, again, that a more careful reflection on the extrap-
olation method is required. This is very well possible in a pattern-mixture
context, but then the first strategy, rather than the second or third strat-
egy, has to be used, either as model of choice or to supplement insight
gained from another model.

SAS Code for Strategy 3

The following code can be used:

proc mixed data = m.vor01 method = ml noclprint asycov


covtest info;
title ’Strategy 3’;
class treat timedisc id pattern;
model y = time
time*base
364 20. Sensitivity Analysis for Pattern-Mixture Models

FIGURE 20.9. Vorozole Study. Models with pattern used as a covariate (Strategy
3). Per pattern and per treatment group, the mean response surface is shown as
a function of time (month) and baseline value.

time*treat
time*treat*base
time*pattern
time*pattern*base
time*pattern*treat
time*time
time*time*base
time*time*treat
time*time*pattern
/ noint solution covb;
repeated timedisc / subject = id type = un;
run;
20.6 Analysis of the Vorozole Study 365

FIGURE 20.10. Vorozole Study. Models with pattern used as a covariate (Strategy
3). For three levels of baseline value (minimum, average, maximum), plots of
mean profiles over time are presented. The bold portion of the curves runs from
baseline until the last obtained measurement, and the extrapolated piece is shown
in thin type. The dashed line refers to megestrol acetate; the solid line is the
Vorozole arm.

The above program uses a hierarchical parameterization, implying that all


interactions represent contrasts versus the main effects. For example, treat-
ment effect is represented as time∗treat and in addition time∗treat∗pattern.
An equivalent but more parsimonious program, which includes exactly one
treatment effect parameter for each dropout group, is

proc mixed data = m.vor01 method = ml noclprint asycov


covtest info;
title ’Strategy 3’;
class timedisc id pattern;
model y = time(pattern)
time*treat(pattern)
time*base(pattern)
time*treat*base
time*time(pattern)
time*time*base
time*time*treat
366 20. Sensitivity Analysis for Pattern-Mixture Models

/ noint solution covb;


repeated timedisc / subject = id type = un;
run;

This program is also more parsimonious in the sense that it treats treatment
as a continuous variable which, when there are only two modalities, comes
down to exactly the same model as when it is treated as a class variable,
but a number of structural zeros are removed from the parameter vector.

20.6.2 Hypothesis Testing

For ease of exposition, let us assume we are interested in a single effect,


treatment effect, say. In the particular case of the Vorozole data, this trans-
lates into the time∗treatment interaction parameter. For simplicity, we will
generically refer to the parameter of interest as treatment effect. For ease of
exposition, we ignore the time2 ∗treatment interaction. It is a simple exten-
sion to include both into the test. In the simplest case of a single parameter
for the effect of interest, the corresponding selection model would contain
exactly this single treatment effect parameter, turning the hypothesis test-
ing task into a very straightforward one. If there were several treatment
effect parameters, such as in a three-armed trial or in an analysis where
interactions between treatment and other effects are included, standard
hypothesis testing theory, such as in Section 6.2, could be applied.

Some pattern-mixture models will have a treatment effect parameter spe-


cific to each pattern. This is the case for all five models in Tables 20.1 and
20.2. Let us note in passing that this does not need to be the case. For ex-
ample, in the final Strategy 3 analysis in Section 20.6.3, treatment effect is
reduced to a single parameter. In such cases, the assessment of treatment ef-
fect is no more difficult than in a corresponding selection model. Therefore,
this section will focus on the situation where there are pattern-dependent
treatment effects.

It is useful to point out a strong analogy with post hoc stratification,


where pattern plays the role of a stratifying variable. A selection model
corresponds to a pooled analysis, where data from all patterns (strata) are
pooled, without correction for the “confounding effect” stemming from het-
erogeneity across dropout patterns. A pattern-mixture model on the other
hand does correct for pattern and hence, in a sense, for the confounding ef-
fect arising from pattern. If treatment effect does not interact with pattern,
such as in the Strategy 3 analysis in Section 20.6.3, then a simple, so-called
corrected , treatment effect estimate is obtained. Finally, if treatment effect
interacts with pattern, such as in all five models above (although it is not
20.6 Analysis of the Vorozole Study 367

significant in this case), there is heterogeneity of treatment effect across


patterns (cf. heterogeneity of the relative risks in epidemiological studies).

In the latter case, two distinct routes are possible. The more “epidemi-
ologic” viewpoint is to direct inferences toward the vector of treatment
effects. For example, if treatment effects are heterogeneous across patterns,
then it may be deemed better to avoid combining such effects into a single
measure. In our case, this implies, for example, testing for the treatment
by time interaction to be zero in all three patterns simultaneously. Alter-
natively, one can calculate the same quantity as would be obtained in the
corresponding selection model. Then, the marginal treatment effect is cal-
culated, based on the pattern-specific treatment effects and the weighting
probabilities, perhaps irrespective of whether the treatment effects are ho-
mogeneous across patterns or not. This was done in (18.9) and (18.11) for
the toenail data, and in Section 18.4. See also Section 24.4.2 (Eq. (20.45)).

Precisely, let β
t represent the treatment effect parameter estimates =
1, . . . , g (assuming there are g + 1 groups) in pattern t = 1, . . . , n and let
πt be the proportion of patients in pattern t. Then, the estimates of the
marginal treatment effects β
are

n
β
= β
t πt , = 1, . . . , g. (20.45)
t=1

The variance is obtained using the delta method. Precisely, it assumes the
form

var(β1 , . . . , βg ) = AVA , (20.46)

where

var(β
t ) 0
V = (20.47)
0 var(πt )

and
∂(β1 , . . . , βg )
A = . (20.48)
∂(β11 , . . . , βng , π1 , . . . , πn )

The estimate of the variance-covariance matrix of the β


t is obtained from
statistical software (e.g., the ‘covb’ option in the MODEL statement of the
SAS procedure MIXED). The multinomial quantities are easy to obtain
from the pattern-specific sample sizes. In the case of the Vorozole data,
these quantities are presented in (20.43) and (20.44). A Wald test statistic
for the null hypothesis H0 : β1 = . . . = βg = 0 is then given by

β 0 (AV A )−1 β 0 , (20.49)


368 20. Sensitivity Analysis for Pattern-Mixture Models

where β 0 = (β1 , . . . , βg ) .

We will now apply both testing approaches to the models presented in Ta-
bles 20.1 and 20.2. All three pattern-mixture strategies will be considered.
Since the identifying restrictions strategies are slightly more complicated
than the others, we will consider the other strategies first.

Strategy 2

Recall that the parameters are presented in Table 20.1 as the initial model.
The treatment effect vector is β = (0.33, −0.95, 0.82) with, since the pat-
terns are analyzed separately, diagonal covariance matrix:
⎛ ⎞
15.28
V = ⎝ 3.44 ⎠.
0.90

These quantities are obtained as either the square of the standard errors
reported in the ‘Solution for Fixed Effects’ panel in the output of the SAS
procedure MIXED or, directly, as the appropriate diagonal elements of the
‘Covariance Matrix for Fixed Effects’ panel, produced by means of the
‘covb’ option in the MODEL statement. This leads to the test statistic
β  V −1 β = 1.02 on 3 degrees of freedom, producing p = 0.796.

In order to calculate the marginal treatment effect, we apply (20.46)–


(20.49). The single (since there are only two groups) marginal effect is
estimated as β 0 = −0.07 (s.e. 1.16). The corresponding observed asymp-
totic p-value is p = 0.95.

Both approaches agree on the nonsignificance of the treatment effect.

Strategy 3

The parameters are presented in Table 20.2. The treatment effect vector is
β = (5.25, 3.48, 3.44) with nondiagonal covariance matrix
⎛ ⎞
41.12 23.59 25.48
V = ⎝ 23.59 29.49 30.17 ⎠ .
25.48 30.17 36.43

These quantities are obtained as the appropriate block of the ‘Covariance


Matrix for Fixed Effects’ panel. The information provided by the square of
the standard errors is insufficient since they fail to report the covariances
between the parameter estimates. In this case, the correlation between them
is quite substantial. The reason is that some parameters, in particular, the
20.6 Analysis of the Vorozole Study 369

other treatment effects (three-way interaction with baseline and time, inter-
action with time2 ), are common to all three patterns, inducing dependence
across patterns.

This leads to the test statistic β  V −1 β = 0.70 on 3 degrees of freedom,


producing p = 0.874. The same p-value up to three digits is reported in the
‘Tests of Fixed Effects’ panel, where the same test is conducted using an
F -statistic.

Calculating the marginalized treatment effect, we obtain β 0 = 3.79 (s.e.


5.44). The corresponding asymptotic p-value is p = 0.49. The different
numerical value of the treatment effects, as compared to those obtained
with the other strategies, is entirely due to the presence of a quadratic
treatment effect, which, for ease of exposition, is left out of the picture in
testing here. If deemed necessary, and often it will, it is straightforward to
add this parameter to the contrast(s) being considered.

Strategy 1

The CCMV case will be discussed in detail. The two other restriction types
are entirely similar.

There are three treatment effects, one for each pattern. Hence, multiple
imputation produces five vectors of three treatment effects which are aver-
aged to produce a single treatment effect vector. In addition, the within,
between, and total covariance matrices are calculated:
β CC = (−2.09, −1.68, 0.82) , (20.50)
⎛ ⎞
1.67 0.00 0.00
WCC = ⎝ 0.00 0.59 0.00 ⎠ , (20.51)
0.00 0.00 0.90
⎛ ⎞
2.62 0.85 0.00
BCC = ⎝ 0.85 0.72 0.00 ⎠ , (20.52)
0.00 0.00 0.00
and
⎛ ⎞
4.80 1.02 0.00
TCC = ⎝ 1.02 1.46 0.00 ⎠ . (20.53)
0.00 0.00 0.90

In the stratified case, we want to test the hypothesis H0 : β = 0. Using


(20.50)–(20.52), we can apply the multiple imputation results described in
Section 20.3.2.
370 20. Sensitivity Analysis for Pattern-Mixture Models

TABLE 20.3. Vorozole Study. Tests of treatment effect for CCMV, NCMV, and
ACMV restrictions.
Parameter CCMV NCMV ACMV

Stratified analysis:
k 3 3 3
τ 12 12 12
denominator d.f. w 28.41 17.28 28.06
r 1.12 2.89 1.14
F -statistic 1.284 0.427 0.946
p-value 0.299 0.736 0.432

Marginal Analysis:
Marginal effect (s.e.) −0.85(0.77) −0.63(1.22) −0.74(0.85)
k 1 1 1
τ 4 4 4
denominator d.f. w 4 4 4
r 1.49 4.57 1.53
F -statistic 0.948 0.216 0.579
p-value 0.385 0.667 0.489

Note that, even though the analysis is done per pattern, the between and
total matrices have nonzero off-diagonal elements. This is because imputa-
tion is done based on information from other patterns, hence introducing
interpattern dependence. Results are presented in Table 20.3. All results
are nonsignificant, in line with earlier evidence from Strategies 2 and 3,
although the p-values for CCMV and ACMV are somewhat smaller.

For the marginal parameter, the situation is more complicated here than
with Strategies 2 and 3. Indeed, the theory of Section 20.3.2 assumes in-
ference is geared toward the original vector, or linear contrasts thereof.
Formula (20.45) displays a nonlinear transformation of the parameter vec-
tor and therefore needs further development. First, consider π to be part
of the parameter vector. Since there is no missingness involved in this part,
it contributes to the within matrix, but not to the between matrix. Then,
using (20.46), the approximate within matrix for the marginal treatment
effect is
W0 = π  W π + β  var(π)β,
with, for the between matrix, simply
B0 = π  Bπ.
The latter formula consists of only one term, since there is no between
variance for π.
20.6 Analysis of the Vorozole Study 371

The results are presented in the second panel of Table 20.3. All three p-
values are, again, nonsignificant, in agreement with Strategies 2 and 3. Of
course, all five agree on the nonsignificance of the treatment effect. The
reason for the differences is to be found in the way the treatment effect is
extrapolated beyond the period of observation. Indeed, the highest p-value
is obtained for Strategy 2, and from Figure 20.2, we learn that virtually no
separation between both treatment arms is projected. On the other hand,
wider separations are seen in Figure 20.10.

20.6.3 Model Reduction

Model building guidelines for the standard linear mixed-effects model can
be found in Chapter 9. These guidelines can be used without any problem
in a selection model context, but the pattern-mixture case is more compli-
cated. Of course, the same general principles can be applied, taking into
account the intertwining between the mean or fixed-effects structure, on
the one hand, and the components of variability on the other hand, as
graphically represented in Figure 9.1.

In addition to these principles, one has to reflect on the special status of


pattern in a pattern-mixture model. Broadly, we can distinguish between
two cases, reflecting Strategy 2 (a per-pattern analysis) and Strategy 3 (use
pattern as a covariate). In fact, the identifying-restrictions strategy leaves
the method of analysis to be used after multiple imputation unspecified,
as mentioned in item 6 of the strategy outline (Section 20.5.1). In our
analysis, we have chosen to conduct a per-pattern analysis (Table 20.1) as
in Strategy 1, but it is possible to conduct a global analysis, using pattern
as a covariate, or even to use selection modeling. The only requirement is
that the proper nature of the imputation be preserved (Rubin 1987). It
is therefore sufficient to discuss and illustrate model reduction using the
second and third strategies only.

Strategy 3

Model reduction in a context where pattern is used as a covariate is clearly


of the same level of complexity as with complete data or as for a selection
model. Let us reduce the model presented in Table 20.2. It is convenient
to use a hierarchical representation of the model as with the second SAS
program (p. 365). The following effects are removed using a hierarchical se-
quence of models, and using F -test statistics: the time by pattern by treat-
ment interaction (p = 0.934), the time by pattern interaction (p = 0.776),
the time by pattern by baseline value interaction (p = 0.707), the time by
baseline by treatment interaction (p = 0.165), and the time2 by treatment
372 20. Sensitivity Analysis for Pattern-Mixture Models

TABLE 20.4. Vorozole Study. Strategy 3. Parameter estimates and standard er-
rors of a reduced model.

Effect Pattern Estimate (s.e.)


Time 33.06(6.67)
Time∗treat 0.40(0.84)
Time∗base −0.29(0.06)
Time2 1 −16.71(3.46)
Time2 2 −8.56(1.90)
Time2 3 −7.09(1.78)
Time2 ∗base 0.06(0.01)
σ11 178.02(18.46)
σ12 121.75(18.30)
σ22 238.31(26.98)
σ13 88.75(24.94)
σ23 121.10(34.70)
σ33 274.58(48.32)

interaction (p = 0.093). Note that one cannot necessarily conclude that


these parameters automatically are jointly nonsignificant (see Section 5.5).
The reduced model is displayed in Table 20.4.

Strategy 2

For Strategy 2, where a per-pattern analysis is conducted, there are several


model building decisions to be made:

• In the process of simplifying, one can allow that effects be shared


between two or more patterns. For example, a baseline effect, common
to all patterns, can be included. By doing so, this strategy effectively
reduces to Strategy 3 and there is no need to discuss this any further
here.
• When simplifying the model, effects are either absent or common to
all patterns. Again, this approach is close to Strategy 3 and can be
conducted within that framework without any problem if one starts
with a model where all effects, including the covariance parameters,
depend on pattern. For this reason, we will not pursue it further.
• Finally, model reduction is done entirely separately in each of the pat-
terns. This may yield different levels of simplification for each pattern
and certainly a pattern-specific set of covariates, which is found to
influence the response profile. This strategy will be illustrated.
20.7 Thoughts 373

In order to enable treatment effect assessment, the interaction between time


and treatment will not be removed from the models. In pattern 1, there is
one simplification possible in the sense that the interaction between time
and baseline is not significant (p = 0.415). Thus, the only effects that
remain in the model are time and the time by treatment interaction. For
patterns 2 and 3, there are no non-significant effects to be removed. In
conclusion, baseline FLIC score influences the follow-up scores in patterns
2 and 3, but not in pattern 1.

20.7 Thoughts

In this chapter, we have illustrated three distinct strategies to fit pattern-


mixture models. In this way, we have brought together several existing
practices. Little (1993, 1994a) has proposed identifying restrictions, which
we formalized here using the connection with MAR (Section 20.2) and
multiple imputation (Section 20.3). Strategies 2 and 3 refer to fitting a
model per pattern, as in the toenail data (Section 18.3), and using pattern
as a covariate, as in the Vorozole study (Section 18.4).

By contrasting these strategies on a single set of data, one obtains a range


of conclusions rather than a single one, which provides insight into the sen-
sitivity to the assumptions made. Especially with the identifying restric-
tions, one has to be very explicit about the assumptions and, moreover,
this approach offers the possibility to consider several forms of restrictions.
Special attention should go to the ACMV restrictions, since they are the
MAR counterpart within the pattern-mixture context.

In addition, a comparison between the selection and pattern-mixture mod-


eling approaches is useful to obtain additional insight into the data and/or
to assess sensitivity. This has been done, informally, in Chapters 17 and
18, using the toenail data and the Vorozole study.

Section 24.4 offers a case study on the milk protein contents trial (Diggle
and Kenward 1994). Several sensitivity analysis tools, both informal and
formal, are employed to gain insight into the data and into the missingness
mechanism.

The identifying-restrictions strategy provides further opportunity for sensi-


tivity analysis, beyond what has been presented here. Indeed, since CCMV
and NCMV are extremes for the ω s vector in (20.20), it is very natural to
consider the idea of ranges in the allowable space of ω s . Clearly, any ω s
which consists of non-negative elements that sum to 1 is allowable, but also
374 20. Sensitivity Analysis for Pattern-Mixture Models

the idea of extrapolation could be useful, where negative components are


allowed, given they provide valid conditional densities.

As in the previous chapter, we underscore that the strategies presented


here are but one approach to sensitivity analysis in the pattern-mixture
context. Surely, more will be developed and more work is needed in this
area.

The SAS and GAUSS macros which have been used to carry out the mul-
tiple imputation related tasks are available from the authors’ web pages.
21
How Ignorable Is Missing At
Random ?

21.1 Introduction

For over two decades, following the pioneering work of Rubin (1976) and
Little (1976), there has been a growing literature on incomplete data, with
a lot of emphasis on longitudinal data. Following the original work of Rubin
and Little, there has evolved a general view that “likelihood methods” that
ignore the missing value mechanism are valid under an MAR process, where
likelihood is interpreted in a frequentist sense. The availability of flexible
standard software for incomplete data, such as PROC MIXED, and the
advantages quoted in Section 17.3 contribute to this point of view. This
statement needs careful qualification however. Kenward and Molenberghs
(1998) provided an exposition of the precise sense in which frequentist
methods of inference are justified under MAR processes.

As discussed in Section 15.8, Rubin (1976) has shown that MAR (and
parameter distinctness) is necessary and sufficient to ensure validity of
direct-likelihood inference when ignoring the process that causes missing
data. Here, direct-likelihood inference is defined as an “inference that re-
sults solely from ratios of the likelihood function for various values of the
parameter,” in agreement with the definition in Edwards (1972). In the
concluding section of the same paper, Rubin remarks:
376 21. How Ignorable Is Missing At Random ?

One might argue, however, that this apparent simplicity of


likelihood and Bayesian inference really buries the important
issues. (. . . ) likelihood inferences are at times surrounded with
references to the sampling distributions of likelihood statistics.
Thus, practically, when there is the possibility of missing data,
some interpretations of Bayesian and likelihood inference face
the same restrictions as sampling distribution inference. The
inescapable conclusion seems to be that when dealing with real
data, the practicing statistician should explicitly consider the
process that causes missing data far more often than he does.

In essence, the problem from a frequentist point of view is that of iden-


tifying and using the appropriate sampling distribution. This is obviously
relevant for determining distributions of test statistics, expected values of
the information matrix, and measures of precision.

Little and Rubin (1987) discuss several aspects of this problem and propose,
using the observed information matrix, to circumvent problems associated
with the determination of the correct expected information matrix. Laird
(1988) makes a similar point in the context of incomplete longitudinal data
analysis.

In a variety of settings, several authors have reexpressed this preference


for the observed information matrix and derived methods to compute it:
Meng and Rubin (1991), the supplemented EM algorithm; Baker (1992),
composite link models; Fitzmaurice, Laird, and Lipsitz (1994), incomplete
longitudinal binary data; and Jennrich and Schluchter (1986). A group of
authors has used the observed information matrix, without reference to the
problems associated with the expected information: Louis (1982), Meilijson
(1989), and Kenward, Molenberghs, and Lesaffre (1994).

However, others, while claiming validity of analysis under MAR mecha-


nisms, have used expected information matrices and other measures of
precision that do not account for the missingness mechanism (Murray and
Findlay 1988, Patel 1991). A number of references is given in Baker (1992).
It is clear that the problem as identified in the initial work of Rubin (1976)
is not fully appreciated in the more recent literature. An exception to this
is Heitjan’s (1994) clear restatement of the problem.

A recent exchange of correspondence (Diggle 1992, Heitjan 1993, Diggle


1993) indicates a genuine interest in these issues and suggests a need for
clarification. We will build on the framework of likelihood inference un-
der an MAR process, sketched in Section 15.5. The difference between the
expected information matrix with and without taking the missing data
mechanism into account is elucidated and the relevance of this for Wald
and score statistics is elaborated upon. Analytic and numerical illustra-
21.2 Information and Sampling Distributions 377

tions of this difference are provided using a bivariate Gaussian setting. A


longitudinal example is used for practical illustration.

21.2 Information and Sampling Distributions

In this section, we will drop the subject subscript i from the notation.
We assume that the joint distribution of the full data (Y , R) is regular in
the sense of Cox and Hinkley (1974, p. 281). We are concerned here with
the sampling distributions of certain statistics under MCAR and MAR
mechanisms. Under an MAR process, the joint distribution of Y o (the
observed components) and R factorizes as in (15.6). In terms of the log-
likelihood function, we have
(θ, ψ; y o , r) = 1 (θ; y o ) + 2 (ψ; r, y o ). (21.1)
It is assumed that θ and ψ satisfy the separability condition. This partition
of the likelihood has, with important exceptions, been taken for granted to
mean that, under an MAR mechanism, likelihood methods based on 1
alone are valid for inferences about θ even when interpreted in the broad
frequentist sense. We now consider more precisely the sense in which the dif-
ferent elements of the frequentist likelihood methodology can be regarded
as valid in general under the MAR mechanism. It is now well known that
such inferences are valid under an MCAR mechanism (Rubin 1976, Sec-
tion 6).

First, we note that under the MAR mechanism, r is not an ancillary sta-
tistic for θ in the extended sense of Cox and Hinkley (1974, p. 35). (A
statistic S(Y , R) is ancillary for θ if its distribution does not depend upon
θ.) Hence, we are not justified in restricting the sample space from that
associated with the pair (Y , R). In considering the properties of frequentist
procedures below, we therefore define the appropriate sampling distribu-
tions to be that determined by this pair. We call this the unconditional
sampling framework. By working within this framework, we do need to
consider the missing value mechanism. We shall be comparing this with
the sampling distribution that would apply if r were fixed by design [i.e.,
if we repeatedly sampled using the distribution f (y o ; θ)]. If this sampling
distribution were appropriate, this would lead directly to the use of 1 as
a basis for inference. We call this the naive sampling framework.

Little (1976), in a comment on the paper by Rubin (1976), mentions ex-


plicitly the role played by the nonresponse pattern. He argues:

For sampling based inferences, a first crucial question con-


cerns when it is justified to condition on the observed pattern,
378 21. How Ignorable Is Missing At Random ?

that is on the event R = r (. . . ). A natural condition is that


R should be ancillary (. . . ). Otherwise the pattern on its own
carries at least some information about θ, which should in prin-
ciple be used.

Certain elements of the frequentist methodology can be justified immedi-


ately from (21.1). The maximum likelihood estimator obtained from max-
imizing l1 (θ; y o ) alone is identical to that obtained from maximizing the
complete log-likelihood function. Similarly, the maximum likelihood estima-
tor of ψ is functionally independent of θ and so any maximum likelihood
ratio concerning θ, with common ψ, will involve 1 only. Because these sta-
tistics are identical whether derived from 1 or the complete log-likelihood,
it follows that they have the required properties under the naive sampling
framework. See, for example, Rubin (1976), Little (1976), and Little and
Rubin (1987, Section 5.2).

An important element of likelihood-based frequentist inference is the deriva-


tion of measures of precision of the maximum likelihood estimators from
the information. For this, either the observed information, iO , can be used
where
∂ 2 (·)
iO (θj , θk ) = −
∂θj ∂θk

or the expected information, iE , where

iE (θj , θk ) = E{iO (θj , θk )}. (21.2)

The above argument justifying the use of the maximum likelihood esti-
mators from 1 (θ; y o ) applies equally well to the use of the inverse of the
observed information derived from 1 as an estimate of the asymptotic
variance-covariance matrix of these estimators. This has been pointed out
by Little and Rubin (1987, Section 8.2.2) and Laird (1988, p. 307). In addi-
tion, there are other reasons for preferring the observed information matrix
(Efron and Hinkley 1978).

The use of the expected information matrix is more problematical. The


expectation in (21.2) needs to be taken over the unconditional sampling
distribution (the unconditional information iU ) and, consequently, the use
of the naive sampling framework (producing the naive information iN ) can
lead to inconsistent estimates of precision. In the next section, we give
an example of the bias resulting from the use of the naive framework.
It is possible however, as we show below, to calculate the unconditional
information by taking expectations over the appropriate distribution and so
correct this bias. Although this added complication is generally unnecessary
in practice, given the availability of the observed information, it does allow
21.3 Illustration 379

a direct examination of the effect of ignoring the missing value mechanism


on the expected information.

As part of the process of frequentist inference, we also need to consider


the sampling distribution of the test statistics. Provided that use is made
of the likelihood ratio, or Wald and score statistics based on the observed
information, then reference to a null asymptotic χ2 -distribution will be ap-
propriate because this is derived from the implicit use of the unconditional
sampling framework. Only in those situations in which the sampling dis-
tribution is explicitly constructed must care be taken to ensure that the
unconditional framework is used; that is, account must be taken of the
missing data mechanism.

21.3 Illustration

For an incomplete multivariate normal sample, Little and Rubin (1987)


state:

If the data are MCAR, the expected information matrix of


θ = (µ, Σ) represented as a vector is block diagonal. (. . . ) The
observed information matrix, which is calculated and inverted
at each iteration of the Newton-Raphson algorithm, is not block
diagonal with respect to µ and Σ, so this simplification does not
occur if standard errors are based on this matrix. On the other
hand, the standard errors based on the observed information
matrix are more conditional and thus valid when the data are
MAR but not MCAR, and hence should be preferable to those
based on [the expected information] in applications.

Suppose now that we have N independent pairs of observations (Yi1 , Yi2 ),


each with a bivariate Gaussian distribution with mean vector µ = (µ1 , µ2 )
and variance-covariance matrix

σ11 σ12
Σ = .
σ12 σ22
It is assumed that m complete pairs, and only the first member (Yi1 ) of the
remaining pairs are observed. This implies that the dropout process can be
represented by a scalar indicator Ri which is 1 if the second component is
observed and 0 otherwise. The log-likelihood can be expressed as the sum
of the log-likelihoods for the complete and incomplete pairs:

m 
N
= ln f (yi1 , yi2 | µ1 , µ2 , σ11 , σ12 , σ22 ) + ln f (yi1 | µ1 , σ11 ),
i=1 i=m+1
380 21. How Ignorable Is Missing At Random ?

which, in the Gaussian setting, has kernel,

N −m m 1 N
= − ln σ11 − ln | Σ | − (yi1 − µ1 )2
2 2 2σ11 i=m+1


−1

1
m
yi1 − µ1 σ11 σ12 yi1 − µ1
− .
2 i=1 yi2 − µ2 σ12 σ22 yi2 − µ2

Straightforward differentiation produces the elements of the observed in-


formation matrix that relate to µ:
−1

σ11 0
iO (µ, µ) = (N − m) + mΣ−1 ,
0 0

and
N
yi1 − µ1
iO (µ1 , σ11 ) = 2
i=m+1
σ11


m

yi1 − µ1
+ e1  Σ−1 E 11 Σ−1 (21.3)
yi2 − µ2
i=1

and, when at least one of the indices j, k, or is different from 1,



m

yi1 − µ1
iO (µj , σk
) = ej  Σ−1 E kl Σ−1 (21.4)
yi2 − µ2
i=1

for

1 0
e1 = , e2 =
0 1
and

1 0 0 1 0 0
E 11 = , E 12 = , E 22 = .
0 0 1 0 0 1

For the naive information, we just take expectations of these quantities


over (Yi1 , Yi2 ) ∼ N (µ, Σ) for i = 1, . . . m and Yi1 ∼ N (µ1 , σ11 ) for i =
m + 1, . . . , N . It follows at once that the cross-terms linking the mean and
variance-covariance parameters vanish, establishing the familiar orthogo-
nality property of these sets of parameters in the Gaussian setting. We
now examine the behavior of the expected information under the actual
sampling process implied by the MAR mechanism.

We need to consider first the conditional expectation of these quantities


given the occurrence of R, the dropout pattern. Because (Y , R) enters
21.3 Illustration 381

the expression for iU (µ, µ) only through m, the naive and unconditional
information matrices for µ are effectively equivalent. However, we show
now that this is not true for the cross-term elements of the information
matrices. Define αj = E(Yi1 | ri = j) − µ1 , j = 0, 1. For the conditional
expectation of Yi2 in the complete subgroup, we have
   
E(Yi2 | ri = 1) = yi2 f (yi2 | yi1 )dyi2 f (yi1 | ri = 1)dyi1

−1 σ12
= µ2 − σ12 σ11 µ1 + yi1 f (yi1 , ri = 1)dyi1
σ11 P (ri = 1)
−1
= µ2 + σ12 σ11 {E(Yi1 | ri = 1) − µ1 }

or

EY |R (Yi2 − µ2 ) = βα1
−1
for β = σ12 σ11 . Hence,



Yi1 − µ1 1
EY |R = α1 .
Yi2 − µ2 β

Noting that

−1
−1 1 σ11 −1
Σ = = σ11 e1 ,
β 0
we then have from (21.3) and (21.4)
⎧ α0

⎪ (N − m) 2

⎪ σ11


⎨ α
+m σ11 e1  Σ−1 E 11 e1 , j = k = = 1
1
EY |R {iO (µj , σkl )} =






⎩ m α1 ej  Σ−1 E k
e1 otherwise.
σ11
Finally, taking expectations over R, we get for the cross-terms of the un-
conditional information matrix



N (1 − π)α0 1 πα1 σ22
iU (µ, σ11 ) = + , (21.5)
σ11 σ11 0 σ11 σ22 − σ12
2 −σ12

N πα1 −β
iU (µ, σ12 ) = , (21.6)
σ11 σ22 − σ12
2 1

0
iU (µ, σ22 ) = , (21.7)
0
382 21. How Ignorable Is Missing At Random ?

for π = P (ri = 1). In contrast to the naive information, these cross-terms


do not all vanish, and the orthogonality of mean and variance-covariance
parameters is lost under the MAR mechanism. One implication of this is
that although the information relating to the linear model parameters alone
is not affected by the move from an MCAR to an MAR mechanism, the
asymptotic variance-covariance matrix is affected due to the induced non-
orthogonality and, therefore, the dropout mechanism cannot be regarded as
ignorable as far as the estimation of precision of the estimators of the linear
model parameters is concerned. It can also be shown that the expected
information for the variance-covariance parameters is not equivalent under
the MCAR and MAR dropout mechanisms, but the expressions are more
involved. Assuming that π is nonzero, it can be seen that the necessary
and sufficient condition for the terms in (21.5) and (21.6) to be equal to
zero is that α0 = α1 = 0, the condition defining, as expected, an MCAR
mechanism.

We now illustrate these findings with a few numerical results. The off-
diagonal unconditional information elements (21.5)–(21.7) are computed
for sample size N = 1000, mean vector (0, 0) , and two covariance matrices:
(1) σ11 = σ22 = 1 and correlation
√ ρ = σ12 = 0.5 and (2) σ11 = 2, σ33 = 3,
and ρ = 0.5 leading to σ12 = 6/2. Further, two MAR dropout mechanisms
are considered. They are both of the logistic form
exp(γ0 + γ1 yi1 )
P (R1 = 0|yi1 ) = .
1 + exp(γ0 + γ1 yi1 )
We choose γ0 = 0 and (a) γ1 = 1 or (b) γ1 = −∞. The latter mechanism
implies ri = 0 if yi1 ≥ 0 and ri = 1 otherwise. Both dropout mechanisms
yield π = 0.5. In all cases, α1 = −α0 , with α1 in the four possible combina-
tions
0 of covariance √ and dropout parameters: (1a) 0.4132, (1b) 0.7263, (2a)
2/π, and (2b) 2/ π. Numerical values for (21.5)–(21.7) are presented in
Table 21.1, as well as the average from the observed information matrices
in a simulation with 500 replicates.

Obviously, these elements are far from zero, as would be found with the
naive estimator. They are of the same order of magnitude as the upper
left block of the information matrix (pertaining to the mean parameters),
which are

1166.67 −333.33
.
−333.33 666.67
We performed a limited simulation study to verify the coverage probability
for the Wald tests under the unconditional and a selection of conditional
frameworks. The hypotheses considered are H01 : µ1 = 0, H02 : µ2 = 0,
and H03 : µ1 = µ2 = 0. The simulations have been restricted to the first
covariance matrix used in Table 21.1 and to the second dropout mecha-
nism (γ1 = −∞). Results are reported in Table 21.2. The coverages for the
21.4 Example 383

TABLE 21.1. Bivariate Normal Data. Computed and simulated values for the
off-diagonal block of the unconditional information matrix. Sample size is N =
1000 (500 replications). (The true model has zero mean vector. Two true covari-
ances Σ and two dropout parameters γ1 are considered.)

Parameters Uncond. iU (µ, ·) Simulated iO (µ, ·)


Σ γ1 σ11 σ12 σ22 σ11 σ12 σ22
1 0.5 1 −68.87 137.75 0.00 −69.36 137.95 −0.04
0.5 1 137.75 −275.49 0.00 137.88 −276.83 −0.04

2 6/2 1 −30.26 49.42 0.00 −30.21 49.54 0.04

6/2 3 49.42 −80.70 0.00 49.52 −81.31 0.06
1 0.5 −∞ 132.98 −265.96 0.00 135.67 −267.66 0.16
0.5 1 −265.96 531.92 0.00 −267.73 537.58 −0.02

2 6/2 −∞ 47.02 −76.78 0.00 49.52 −78.73 −0.02

6/2 3 −76.78 125.38 0.00 −78.58 126.91 0.02

unconditional framework are in good agreement with a χ2 reference distrib-


ution; the first naive framework (500 complete cases) leads to a conservative
procedure, whereas the second and the third lead to extreme liberal behav-
ior, that is most marked for hypotheses H01 and H03 . This is to be expected
because by fixing m = 500, the proportion of positive first outcomes is con-
strained to be equal to its predicted value. This has the effect of reducing
the variability of µ̂1 . The second and the third frameworks also suppress
the variability, but introduce bias at the same time. The comparative in-
sensitivity of the behavior of the test for H02 to the sampling framework is
because µ1 has only an indirect influence through the correlation between
the outcomes on both occasions. It should be noted that due to numerical
problems, not all simulations led to 500 successful estimations. On average,
489 convergencies were observed, the lowest value being 460 for H02 in the
first naive sampling frame.

21.4 Example

We will now consider a relatively small example with a continuous re-


sponse, analyzed in Crépeau et al . (1985). Fifty-four rats were divided
into five treatment groups corresponding to exposure to increasing doses of
384 21. How Ignorable Is Missing At Random ?

TABLE 21.2. Bivariate Normal Data. True values are as in the third model of
Table 21.1. Coverage probabilities (× 1000) for Wald test statistics. Sample size is
N = 1000 (500 replications). The null hypotheses are H01 : µ1 = 0, H02 : µ2 = 0,
H03 : µ1 = µ2 = 0. For the naive sampling frameworks, m denotes the fixed
number of complete cases.

Hypothesis Uncond. m = 500 m = 450 m = 400


H01 933 996 187 0
H02 953 952 913 830
H03 952 992 338 0

halothane (0%, 0.25%, 0.5%, 1%, and 2%). The groups were of sizes 11, 10,
11, 11, and 11 rats, respectively. Following an induced heart attack in each
rat, the blood pressure was recorded on nine unequally spaced occasions.
A number of rats died during the course of the experiment, including all
rats from group 5 (2% halothane). Following the original authors we omit
this group from the analysis since they contribute no information at all,
leaving 43 rats, of which 23 survived the experiment.

Examination of the data from these four groups does not provide any ev-
idence of an MAR dropout process, although this observation must be
considered in the light of the small sample size. A Gaussian multivariate
linear model with an unconstrained covariance matrix was fitted to the
data. There was very little evidence of a treatment by time interaction
and the following results are based on the use of a model with additive
effects for treatment and time. The Wald statistics for the treatment main
effect on 3 degrees of freedom are equal to 46.95 and 30.82 respectively
using the expected and observed information matrices. Although leading
to the same qualitative conclusions, the figures are notably discrepant. A
first reaction may be to attribute this difference to the incompleteness of
the data. However, the lack of evidence for an MAR process together with
the relatively small sample size points to another cause. The equivalent
analysis of the 24 completers produces Wald statistics of 45.34 and 26.35,
respectively; that is, the effect can be attributed to a combination of small-
sample variation and possible model misspecification. A theoretical reason
for this difference might be that the expected value of the off-diagonal block
of the information matrix of the maximum likelihood estimates (describing
covariance between mean and covariance parameters) has expectation zero
but is likely to depart from this in small samples. As a consequence, the
variances of the estimated treatment effects will be higher when derived
from the observed information, thereby reducing the Wald statistic.
21.5 Implications for PROC MIXED 385

To summarize, this example provides an illustration of an alternative source


of discrepancy between the expected and observed information matrices,
which is likely to be associated with the use, in smaller samples, of covari-
ance matrices with many parameters.

21.5 Implications for PROC MIXED

The literature indicates an early awareness of problems with conventional


likelihood-based frequentist inference in the MAR setting. Specifically, sev-
eral authors point to the use of the observed information matrix as a way
to circumvent issues with the expected information matrix. In spite of this,
it seems that a broad awareness of this problem has diminished while the
number of methods formulated to deal with the MAR situation has risen
dramatically in recent years. We therefore feel that a restatement and expo-
sition of this important problem is timely, especially since PROC MIXED
allows routine fitting of ignorable models with likelihood-based methods.

The MIXED procedure allows both Newton-Raphson and Fisher scoring


algorithms. Specifying the ‘scoring’ option in the PROC MIXED statement
requests the Fisher scoring algorithm in conjunction with the method of
estimation for a specified number of iterations (1 by default). If convergence
is reached before scoring is stopped, then the expected Hessian is used to
compute approximate standard errors rather than the observed Hessian. In
both cases, the standard errors for the fixed effects are based on inverting a
single block of the Hessian matrix. Since we have shown in Section 21.3 that
the off-diagonal block, pertaining to the covariance between the fixed effects
and covariance parameters, does not have expectation zero, this procedure
is, strictly speaking, incorrect. Correction factors to overcome this problem
have been proposed (e.g., Prasad and Rao 1990) but they tend to be small
for fairly well-balanced data sets. It has to be noted that a substantial
amount of (randomly) missing data will destroy this balance. The extent
to which all this is problematic is illustrated in Table 21.3. Model 7 for the
growth data is reconsidered for both the complete data set, as well as for
the incomplete data on the basis of an ignorable analysis. The fixed-effects
parameters are as in (17.10), whereas the covariance structure consists of
the residual variance σ 2 and the variance of the random intercept d. Apart
from the parameter estimates, two sets of standard errors are shown: (1)
taken from inverting the fixed-effects block from the observed Hessian and
(2) taken from inverting the entire observed Hessian. The first set is found
from the MIXED output, whereas the second one was constructed using
the numerical optimizer OPTMUM of GAUSS (Edlefsen).
386 21. How Ignorable Is Missing At Random ?

TABLE 21.3. Maximum likelihood estimates and standard errors (in parentheses)
for the parameters in Model 7, fitted to the growth data (complete data set and
ignorable analysis).

Complete Data Ignorable


Parameter Estimate (s.e.)(1) (s.e.)(2) Estimate (s.e.)(1) (s.e.)(2)
β0 17.3727 (1.1615,1.1645) 17.2218 (1.2220,1.2207)
β01 −1.0321 (1.5089,1.5156) −0.9188 (1.5857,1.5814)
β10 0.4795 (0.0923,0.0925) 0.4890 (0.0969,0.0968)
β11 0.7844 (0.0765,0.0767) 0.7867 (0.0802,0.0801)
2
σ 1.8746 (0.2946,0.2946) 2.0173 (0.3365,0.3365)
d 3.0306 (0.9552,0.9550) 3.0953 (1.0011,1.0011)
(1)
Standard error based on the Newton-Raphson algorithm of PROC
MIXED
(2)
Standard error obtained from inverting the entire observed information
matrix.

Clearly, there are only minor differences between the two sets of standard
errors and the analysis on an incomplete set of data does not seem to widen
the gap.

We can conclude from this that, with the exception of the expected informa-
tion matrix, conventional likelihood-based frequentist inference, including
standard hypothesis testing, is applicable in the MAR setting. Standard
errors based on inverting the entire Hessian are to be preferred, and in
this sense, it is a pity that this option is presently not available in PROC
MIXED.
22
The Expectation-Maximization
Algorithm

Although the models in Table 17.5 are fitted using direct observed data like-
lihood maximization in PROC MIXED, Little and Rubin (1987) obtained
these same results using the Expectation-Maximization algorithm. Special
forms of the algorithm, designed for specific applications, had been pro-
posed for about half a century (e.g., Yates 1933), but the first unifying and
formal account was given by Dempster, Laird, and Rubin (1977). McLach-
lan and Krishnan (1997) devoted a whole volume to the EM algorithm and
its extensions.

Even though the SAS procedure MIXED uses direct likelihood maximiza-
tion, the EM algorithm is generally useful to maximize certain complicated
likelihood functions. For example, it has been used to maximize mixture
likelihoods in Section 12.3. Liu and Rubin (1995) used it to estimate the
t-distribution, based on EM, its extension ECM (expectation conditional
maximization), and ECME (expectation conditional maximization, either),
which are described in Meng and Rubin (1993), Liu and Rubin (1994), and
van Dyk, Meng, and Rubin (1995). EM methods specifically for mixed-
effects models are discussed in Meng and van Dyk (1998). A nice review
is given in Meng (1997), where the focus is on EM applications in medical
studies.

We will first give a brief description of the algorithm, with emphasis on


incomplete data problems. Suppose we are interested in maximizing the
ignorable observed-data log-likelihood (θ; y o ). Let θ (0) be an initial guess,
388 22. The Expectation-Maximization Algorithm

which can be found from, for example, a complete case analysis, an available
case analysis, single imputation, or any other convenient method.

The EM algorithm consists of an expectation step (E step) and a maximiza-


tion step (M step).

The E Step. Given the current value θ (t) of the parameter vector, the E
step computes the expected value of the complete data log-likelihood,
given the observed data and the current parameters, which is called
the objective function:

Q(θ|θ (t) ) = (θ, y)f (y m |y o , θ (t) )dy m
7 8
= E (θ|y)|y o , θ (t) . (22.1)

In the case that the log-likelihood is linear in sufficient statistics,


this procedure comes down to substituting the expected value of the
sufficient statistics, given Y o and θ (t) . In particular, for exponential
family models, the E step reduces to the computation of complete-
data sufficient statistics.
The M Step. Next, the M step determines θ (t+1) , the parameter vector
maximizing the objective function (22.1). Formally, θ (t+1) satisfies
Q(θ (t+1) |θ (t) ) ≥ Q(θ|θ (t) ), for all θ.

One then iterates between the E and M steps until convergence.

Consider, for example, a multivariate normal sample. Then, θ = (µ, Σ).


The E step computes the sufficient statistics
N  N 
 
E Yij |y , θ
o (t)
and E Yij Yik |y , θ
o (t)
.
i=1 i=1

From these, computation of µ(t+1) and Σ(t+1) is straightforward.

When the covariance matrix is structured or patterned, the E step remains


the same, but the M step is slightly modified. This situation arose when
Little and Rubin (1987) fitted the incomplete growth data Models 5–7 using
the EM algorithm. Let us sketch their procedure. Emphasis is placed on
the patterns in the covariance matrix. The outcomes Y i are assumed to
follow a normal model Y i ∼ N (µi , Σ) where µi = Xi β and Σ = Σ(α),
a known function of α, such as a banded, AR(1), or exchangeable model,
or a model induced by random effects. The complete-data log-likelihood is
linear in y i and y i y i . The E step is restricted to computing
 
E (Y i |y oi , Xi , Zi , θ) and E Y i Y i |y oi , Xi , Zi , θ .
22. The Expectation-Maximization Algorithm 389

These computations can easily be done using the sweep operator (Little
and Rubin 1987).

The M step consists of a standard estimation procedure for complete data.


Whereas for simple and unstructured covariance models, the M step may
be available in closed form, it is usually iterative for patterned covariance
matrices, turning the EM algorithm into a doubly iterative scheme. To
make the M step noniterative, a GEM (generalized EM) algorithm can
be used. A GEM algorithm merely increases the likelihood in the M step,
rather than maximizing it. For example, a single scoring step can be used
rather than full convergence. Under general conditions, the convergence of
the GEM is the same as for the EM (Dempster, Laird, and Rubin 1977).

Let us write Σ as a function of α for the covariance matrices in the growth


example.

Unstructured: Σ = α.
Banded: σjk = αr with r = |j − k| + 1.
|j−k|
Autoregressive: σjk = α1 α2 .
Compound symmetry: Σ = α1 J4 + α2 I4 with J4 a matrix of ones.
Random effects: Σ = ZαZ  + σ 2 I with α the covariance matrix of the
random effects.
Independence: Σ = αI4 .

The mean structure design matrices Xi are as discussed in Section 17.4.1.

The main drawbacks of the EM algorithm are its typically slow rate of con-
vergence. The double iterative structure of many implementations adds to
the problem. Further, the algorithm does not automatically provide preci-
sion estimators. Proposals for overcoming these limitations have been made
by, for example, Louis (1982), Meilijson (1989), Meng and Rubin (1991),
Baker (1992), Meng and van Dyk (1997), and Liu, Rubin, and Wu (1998).

In the light of these observations, one might argue that the existence of
PROC MIXED, enabling the use of Newton-Raphson or Fisher scoring al-
gorithms to maximize the observed data likelihood, is fortunate. Although
this statement is certainly warranted for a wide range of applications, there
may be situations where the EM algorithm is beneficial. Baker (1994) men-
tions advantages of starting with an EM algorithm and then switching to
Newton-Raphson, if necessary, including less sensitivity to poor starting
values and more reliable convergence to a boundary when the maximum
likelihood estimators is indeed a boundary value. In the latter situation,
390 22. The Expectation-Maximization Algorithm

Newton-Raphson and Fisher scoring algorithms exhibit a tendency to con-


verge to values outside the allowable parameter space. Further, the EM al-
gorithm can be easily extended for use with nonignorable problems, such as
discussed by Diggle and Kenward (1994). This route was explicitly chosen
by Molenberghs, Kenward, and Lesaffre (1997) for a comparable categorical
data setting.

Many of the issues briefly touched upon in this section are discussed at
length in McLachlan and Krishnan (1997). This includes the definition
and basic principles and theory of EM, various ways of obtaining standard
errors and improving the speed of convergence, as well as extensions, in-
cluding those mentioned earlier, but also stochastic EM and Gibbs sampler
versions.
23
Design Considerations

23.1 Introduction

In the first part of this book (Chapters 3 to 13), emphasis was on the for-
mulation and the fitting of, as well as on inference and diagnostics for linear
mixed models in general. Later (Chapters 14 to 22), the problem of miss-
ing data was discussed in full detail, with emphasis on how to obtain valid
inferences from observed longitudinal data and how to perform sensitivity
analyses with respect to assumptions made about the dropout process.

In this chapter, we will reflect on the design of longitudinal studies. In Sec-


tion 23.2, we will briefly discuss how power calculations can be performed
based on linear mixed models. We refer to Mentré, Mallet and Baccar
(1997) for a note on D-optimal designs in random-effects regression mod-
els, and to Liu and Liang (1997), where sample-size calculations are treated
in the context of generalized estimating equations (Liang and Zeger 1986).

In practice (see, e.g., the rat experiment and the Vorozole study introduced
in Chapter 2), longitudinal experiments often do not yield the amount of
information hoped for at the design stage, due to dropout. This results
in realized experiments with (possibly much) less power than originally
planned. In Section 23.4, it will be shown how expected dropout can be
taken into account in sample-size calculations. The basic idea behind this
is that two designs with equal power under the absence of dropout are not
392 23. Design Considerations

necessarily equally likely to yield realized experiments with high power.


The main question then is how to design experiments with minimal risk
of huge losses in efficiency due to dropout. In Section 23.5, this will be
extensively illustrated in the context of the rat experiment.

23.2 Power Calculations Under Linear Mixed


Models

Chapter 6 was devoted to inference in the marginal linear mixed model


(5.1). Several testing procedures were discussed, including approximate
Wald tests, approximate t-tests, approximate F -tests, and likelihood ra-
tio tests (based on ML as well as REML estimation), for the fixed effects
as well as for the variance components in the model. Obviously, any of these
testing procedures can be used in power calculations. Unfortunately, the
distribution of many of the corresponding test statistics is only known un-
der the null hypothesis. In practice, this means that if such tests are to be
used in sample-size calculations, extensive simulations would be required.
One then would have to sample data sets under the alternative hypothesis
of interest, analyze each of them using the selected testing procedure, and
estimate the probability of correctly rejecting the null hypothesis. Finally,
this whole procedure would have to be repeated for every new design under
consideration.

When interest is in testing a general linear hypothesis of the form

H0 : ξ ≡ Lβ − ξ0 = 0 versus HA : ξ = 0 (23.1)

for some known matrix L and known vector ξ0 , a simplified procedure can
be followed. As explained in Section 6.2.2, (23.1) can be tested based on
the fact that, under the null hypothesis, the test statistic
⎡  −1 ⎤−1
 −1
F =
ξ ⎣L Xi Vi Xi L ⎦ 
ξ / rank(L)
i

follows approximately an F -distribution. The numerator degrees of freedom


equals rank(L), and several methods can be used to estimate the denomi-
nator degrees of freedom from the data.

Helms (1992) reports simulation results which show that, under the al-
ternative hypothesis HA , the distribution of 
F can also be approximated
by an F -distribution, now with rank(L) and i ni − rank[X|Z] degrees of
23.3 Example: The Rat Data 393

freedom, and with noncentrality parameter


⎡  −1 ⎤−1

δ = ξ  ⎣L Xi Vi −1 Xi L ⎦ ξ.
i

The matrices X and Z are as previously defined in Section 5.3.3 [i.e.,


X = (X1 | . . . |XN
 
) and Z = Diag(Z1 , . . . , ZN )]. Hence, under HA , we get
a noncentral F -distribution, from which power calculations immediately
follow.

An example in which the above results are used for power calculations
can be found in Helms (1992), where it has been shown empirically that
intentionally incomplete designs, where some subjects are intentionally not
measured at all time points, can have more power while being less expensive
to conduct. Another example will be given in the next section. Finally, the
noncentral F -approximation will also be used in Section 23.4 to perform
power calculations, taking into account that dropout is to be expected.

23.3 Example: The Rat Data

We reconsider here the rat experiment, introduced in Section 2.1. Recall


that our final model, derived in Section 6.3.3, is given by


⎪ β0 + bi + β1 tij + εij , if low dose



Yij = β0 + bi + β2 tij + εij , if high dose (23.2)





β0 + bi + β3 tij + εij , if control dose,

where tij represents the logarithmically transformed ages, tij = ln(1 +


(Ageij − 45)/10)), at which the repeated measurements have been taken.
As before, the residual components εij only contain measurement error
(i.e., εi = ε(1)i , see Section 3.3.4). Estimates of all parameters in the corre-
sponding marginal model have been provided in Table 6.4. The estimated
average profiles are shown in Figure 6.2.

The hypothesis of primary interest is H0 : β1 = β2 = β3 , which has already


been tested in Section 6.3.3, yielding a nonsignificant approximate Wald
statistic (p = 0.0987). A similar result (p = 0.1010) is obtained using an
approximate F -test, with Satterthwaite approximation for the denominator
degrees of freedom. We conclude from this that there is little evidence
for any treatment effects. However, the power for detecting the observed
differences (as described in Table 6.4) at the 5% level of significance and
394 23. Design Considerations

calculated using the F -approximation described in the previous section is


as low as 56%.

Note that, as already mentioned in Section 2.1 and shown in Table 2.1, this
rat experiment suffers from a severe degree of dropout, since many rats do
not survive anesthesia needed to measure the outcome. Indeed, although
50 rats have been randomized at the start of the experiment, only 22 of
them survived the 6 first measurements, so measurements on only 22 rats
are available in the way anticipated at the design stage. For example, at
the second occasion (age = 60 days), only 46 rats were available, implying
that for 4 rats, only 1 measurement has been recorded. As can be expected,
this high dropout rate inevitably leads to severe losses in efficiency of the
statistical inferential procedures. Indeed, if no dropout had occurred (i.e.,
if all 50 rats would have withstood the 7 measurements), the power for
detecting the observed differences at the 5% level of significance would
have been 74%, rather than the 56% previously reported for the realized
experiment.

In the rat example, dropout was not entirely unexpected since it is in-
herently related to the way the response of interest is actually measured
(anesthesia cannot be avoided) and should therefore have been taken into
account at the design stage. In the next section, we will discuss a general,
computationally simple method, proposed by Verbeke and Lesaffre (1999)
for the design of longitudinal experiments, when dropout is to be expected.
Afterward, in Section 23.5, the proposed approach is applied to the rat
data.

23.4 Power Calculations When Dropout Is to Be


Expected

In order to fully understand how the dropout process can be taken into
account at the design stage, we first investigate how it affects the power
of a realized experiment. Note that the power of the F -test described in
Section 23.2 not only depends on the true parameter values β, D, and σ 2
(or, more generally, Σi ) but also on the covariates Xi and Zi . Usually, in
designed experiments, many subjects will have the same covariates, such
that there are only a small number of different sets (Xi , Zi ). For the rat
23.4 Power Calculations When Dropout Is to Be Expected 395

data, for example, all 15 rats in the control group have Xi and Zi equal to
⎛ ⎞ ⎛ ⎞
1 0 0 ln[1 + (50 − 45)/10] 1
⎜ 1 0 0 ln[1 + (60 − 45)/10] ⎟ ⎜ 1 ⎟
⎜ ⎟ ⎜ ⎟
Xi = ⎜ . .. .. .. ⎟, Zi = ⎜ . ⎟ .
⎝ .. . . . ⎠ ⎝ .. ⎠
1 0 0 ln[1 + (110 − 45)/10] 1
However, due to the dropout mechanism, the above matrices have been
realized for only four of them. Indeed, for a rat that drops out early, say at
the kth occasion, the realized design matrices equal the first k rows of the
above planned matrices; that is,
⎛ ⎞ ⎛ ⎞
1 0 0 ln[1 + (50 − 45)/10] 1
⎜ .. .. .. . ⎟ ⎜ .. ⎟
Xi = ⎝ . . . .
. ⎠, Zi = ⎝ . ⎠ .
1 0 0 ln[1 + (40 + k × 10 − 45)/10] 1
Note that the number of rats that drop out at each occasion is a realization
of the stochastic dropout process, from which it follows that the power
of the realized experiment is also a realization of a random variable, the
distribution of which depends on the planned design and on the dropout
process. From now on, we will denote this random power function by P.

In general, a planned design is characterized by a small number of triplets


(X j , Z j , Mj ), j = 1, . . . , M , in which it is indicated that the (nj × p)
design matrix X j for the fixed effects and the (nj × q) design matrix Z j
for the random effects is repeated for Mj subjects, j Mj = N , the total
number of subjects in our sample. Due to the dropout process, the realized
[k] [k]
covariate matrices are (X j , Z j ) with multiplicity Mj,k , k = 1, . . . , nj ,
 [k] [k]
k Mj,k = Mj , j = 1, . . . , M , where X j and Z j denote the first k rows
of X j and Z j , respectively. Note that once all realized values for Mj,k are
known, the corresponding realization of the power P can be calculated.

Since, in the presence of dropout, the power P becomes a stochastic vari-


able, it is not obvious how two different designs with two different associ-
ated power functions P 1 and P 2 should be compared in practice. Several
criteria can be used, such as the average power, E(P), the median power,
median(P), the risk of having a final analysis with power less than for
example 70%, P (P ≤ 70%), and so forth.

Note that all of the above criteria are based on only one specific aspect of
the distribution of P. A criterion which takes into account the full distri-
bution selects the second design over the first one if P 1 is stochastically
smaller than P 2 , P 1 ≺ P 2 , which is defined as (Lehmann and D’Abrera
1975, p. 66)
P1 ≺ P2 ⇐⇒ P (P 1 ≤ p) ≥ P (P 2 ≤ p), ∀p.
396 23. Design Considerations

This means that, for any power p, the risk of ending up with a final analysis
with power less than p is smaller for the second design than for the first
design. Obviously, if this criterion is to be used, one needs to assess the com-
plete power distribution function for all designs which are to be compared.
We propose doing this via sampling methods in which, for each design un-
der consideration, a large number of realized values ps , s = 1, . . . , S, are
sampled from P and used to construct the empirical distribution function

1 
S
P(P ≤ p) = I[ps ≤ p]
S s=1

in which I[A] equals one if A is true and zero otherwise.

As indicated above, sampling from P actually comes down to sampling


realized values for all Mj,k , k = 1, . . . , nj , j = 1, . . . , M , and constructing
[k] [k]
all necessary realized matrices X j and Z j . One then can easily calculate
the implied noncentrality parameter
⎧ ⎡ ⎤−1 ⎫−1

⎨   ⎪

 M  nj
[k] [k]
−1
δ=ξ 
L ⎣ Mj,k X j
[k]
Z j DZ j + σ Ik 2
Xj
[k] ⎦
L 
ξ.

⎩ ⎪

j=1 k=1

and the appropriate numbers of degrees of freedom for the F -statistic, from
which a realized power follows. Note that the dropout process associates
with each triplet (X j , Z j , Mj ) in the design a vector pj = (pj,1 , . . . , pj,nj )
in which pj,k equals the marginal probability that exactly k measurements
are taken on a subject with planned design matrices X j and Z j . Once
all vectors pj have been specified, we have that all sets (Mj,1 , . . . , Mj,nj )
follow a multinomial distribution with Mj trials and probabilities given by
the elements of the vectors pj . This implies that the sampling procedure
basically reduces to sampling from multinomial distributions, from which
it follows that the implementation is straightforward. This allows one to
explore many different combinations of models for the dropout process with
models for the actual responses and to investigate the effect of possible
model misspecifications. Further, it can easily be seen that the computing
time does not increase with the planned sample size. It only depends on
the number  of triplets (X j , Z j , Mj ) in the design rather than on the total
sample size j Mj .

It should be emphasized that the above approach is not restricted to


any particular statistical test. The idea of sampling designs under spe-
cific dropout patterns is applicable for any testing procedure, as long as it
remains possible to evaluate the power associated to each realized design.
Note also that the only additional information needed, in comparison to
classical power analyses, are the vectors pj of marginal dropout probabil-
ities pj,k . This does not require full knowledge of the underlying dropout
23.5 Example: The Rat Data 397

TABLE 23.1. Rat Data. Observed conditional dropout rates pj,k|≥k at each occa-
sion, for all treatment groups simultaneously.

Age (days): 50 60 70 80 90 100

Observed rate: 0.08 0.07 0.12 0.24 0.17 0.08

process. We only need to make assumptions about the dropout rate at


each occasion where observations are designed to be taken. For example,
we do not need to know whether the dropout mechanism is “completely
at random” or “at random” (see Section 15.7). Still, we have to assume
that dropout is “not informative” in the sense that it does not depend on
the response values which would have been recorded if no dropout had oc-
curred, since otherwise our final analysis based on the linear mixed model
would not yield valid results (see Section 15.8 and Chapter 21).

Finally, the proposed method can be used in combination with techniques,


such as those proposed by Helms (1992), which would allow the costs of
performing the designs under consideration to be taken into account. This
could yield less costly experiments with minimal risk of large efficiency
losses due to dropout. This will not be explored any further here.

23.5 Example: The Rat Data

In this section, we will use the sampling procedure described in the previ-
ous section to compare the design that was used in the rat experiment with
alternative designs which could be used in future similar experiments. The
assumption of random dropout is supported by our analyses in Section 19.4.
In order to be able to specify realistic marginal dropout probabilities pj,k ,
we first study the dropout rates observed in the data set at hand. Accord-
ing to the clinicians who collected the data, there is a strong belief that
there is constant probability for a rat to drop out, given that the rat sur-
vived up to that moment. This suggests that the conditional probability
pj,k|≥k that a rat with planned covariates (X j , Z j ) does not survive the
kth occasion, given that it survived all previous measurements, does not
depend on k. Further, it is believed that the dropout process does not de-
pend on treatment (i.e., that pj,k and pj,k|≥k do not depend on j), as long
as measurements are planned at the same occasions for the three treatment
groups. Table 23.1 shows the conditional observed dropout rates pj,k|≥k at
each occasion, for the total sample. For example, 3 rats, out of the 46 rats
who survived the first measurement, died at the second occasion, leading
398 23. Design Considerations

TABLE 23.2. Rat Data. Marginal probabilities pj,k for several conditional dropout
models and designs. Empty entries correspond to occasions at which no observa-
tions are planned in the design.

Logits of conditional Occasions


probabilities pj,k|≥k 50 60 70 80 90 100 110
logit(0.12) 0.12 0.11 0.09 0.08 0.07 0.06 0.46
logit(0.12) 0.12 0.11 0.09 0.68
logit(0.12) 0.12 0.11 0.77
logit(0.12) 0.12 0.88
logit(0.12) 0.12 0.11 0.77
logit(0.12) 0.12 0.11 0.77
−3 + 0.06(Age − 45) 0.06 0.10 0.83
−3 + 0.06(Age − 45) 0.06 0.27 0.67
−3 + 0.06(Age − 45) 0.06 0.54 0.40
−3 + 0.02(Age − 45) 0.05 0.06 0.89
−3 + 0.02(Age − 45) 0.05 0.09 0.86
−3 + 0.02(Age − 45) 0.05 0.13 0.82

to an observed dropout rate of pj,2|≥2 = 3/46 = 0.07. Using a likelihood


ratio test, we tested whether it is reasonable to assume the dropout rates
in Table 23.1 to be observed values of one common dropout probability,
as has been hypothesized by the clinicians. The likelihood ratio statistic
equals 2 ln λ = 7.37, on 5 degrees of freedom, which is clearly not signif-
icant at the 5% level. Further, the maximum likelihood estimate for the
common probability equals pj,k|≥k = 0.122, suggesting that each time a
rat is anesthetized, there is about 12% chance that the rat will not survive
anesthesia, independent of the treatment.

In Sections 23.5.1 and 23.5.2, we will first compare several designs under
the assumption that pj,k|≥k = 0.12. Afterward, in Section 23.5.3, the results
will be compared with those obtained under two alternative dropout models
which assume that pj,k|≥k increases over time. In all designs, the three
treatment groups are measured on the same occasions, and all designs plan
their first and last observation at the age of 50 and 110 days, respectively.
The marginal probabilities are shown in Table 23.2, and the actual designs
are summarized in Table 23.3, together with their power if no dropout
23.5 Example: The Rat Data 399

TABLE 23.3. Rat Data. Summary of the designs compared in the simulation
study.

Occasions Number of subjects Power if


Design Age (days) (M1 , M2 , M3 ) no dropout
A 50-60-70-80-90-100-110 (15, 18, 17) 0.74
B 50-70-90-110 (15, 18, 17) 0.63
C 50-80-110 (15, 18, 17) 0.59
D 50-110 (15, 18, 17) 0.53
E 50-70-90-110 (22, 22, 22) 0.74
F 50-80-110 (24, 24, 24) 0.74
G 50-110 (27, 27, 27) 0.75
H 50-60-110 (26, 26, 26) 0.74
I 50-100-110 (20, 20, 20) 0.73

would occur. All calculations are done under the assumption that the true
parameter values are given by the estimates in Table 6.4 and all simulated
power distributions are based on 1000 draws from the correct distribution.

23.5.1 Constant pj,k|≥k , Varying nj

Since, at each occasion, rats may die, it seems natural to reduce the number
of occasions at which measurements are taken. We have therefore simulated
the power distribution of four designs in which the number of rats assigned
to each treatment group is the same as in the original experiment, but the
planned number of measurements per subject is seven, four, three, and two,
respectively. These are the designs A to D in Table 23.3. Note that design
A is the design used in the original rat experiment. The simulated power
distributions are shown in Figure 23.1.

First, note that the solid line is an estimate for the power function of
the originally designed rat experiment under the assumption of constant
dropout probability pj,k|≥k equal to 12%. It shows that there was more
than 80% chance for the final analysis to have realized power less than
the 56% which was observed in the actual experiment. Comparing the four
designs under consideration, we observe that the risk of high power losses
increases as the planned number of measurements per subject decreases. On
the other hand, it should be emphasized that the four designs are, strictly
speaking, not comparable in the sense that, in the absence of dropout, they
have very different powers ranging from 74% for design A to only 53% for
400 23. Design Considerations

FIGURE 23.1. Rat Data. Comparison of the simulated power distributions for
designs with seven, four, three, or two measurements per rat, with equal number
of rats in each design (designs A, B, C, and D, respectively), under the assumption
of constant pj,k|≥k equal to 12%. The vertical dashed line corresponds to the power
which was realized in the original rat experiment (56%).

design D (Table 23.3). In order to make fair comparisons, we will from


now on only consider designs with comparable powers if no dropout would
occur. As shown in Table 23.3, this can be achieved by considering designs
with different sample sizes.

Designs E, F, and G are the same as designs B, C, and D, but with sample
sizes such that their power is approximately the same as the power of design
A, in the absence of dropout. The simulated power distributions are shown
in Figure 23.2. The figure suggests that P A ≺ P E ≺ P F ≺ P G , from which
it follows that, in practice, the design in which subjects are measured only
at the beginning and at the end of the study is to be preferred, under
the assumed dropout process. This can be explained by the fact that the
probability for surviving up to the age of 110 days is almost twice as high
for design G (88%) as for the original design (46%) (Table 23.2). Note
also that the parameters of interest [β1 , β2 , and β3 in model (23.2)] are
slopes in a linear model such that two measurements are sufficient for the
parameters to be estimable. On the other hand, design G does not allow
testing for possible nonlinearities in the average evolutions. It also follows
from Figure 23.2 that if design E, F, or G had been taken, it would have
been very unlikely to have a final analysis with such small power as in the
original experiment.
23.5 Example: The Rat Data 401

FIGURE 23.2. Rat Data. Comparison of the simulated power distributions for
designs with seven, four, three, or two measurements per rat, with equal power
if no dropout would occur (designs A, E, F, and G, respectively), under the as-
sumption of constant pj,k|≥k equal to 12%. The vertical dashed line corresponds
to the power which was realized in the original rat experiment (56%).

23.5.2 Constant pj,k|≥k , Constant nj

The results in Section 23.5.1 suggest that similar future experiments should
plan less measurements for each rat. From now on, we will, therefore, only
consider designs with only three measurements per rat. As before, the first
and last measurement are planned to be taken at the beginning and at
the end of the study, respectively (at the age of 50 and 110 days). Hence,
only the second observation needs to be specified. Designs H, F, and I have
their second observation planned early in the study (at the age of 60 days),
halfway through the study (at the age of 80 days) and late in the study
(at the age of 100 days), respectively. Note that design H needs 18 more
subjects than design I in order to get comparable power in the absence of
dropout (Table 23.3). This is due to the fact that our linear mixed model is
linear as a function of t = ln(1 + (Age − 45)/10)) instead of the original Age
scale and because maximal spread in the values tij is obtained by taking
the second measurement at the end rather than at the beginning of the
experiment.

Figure 23.3 shows the simulated power distributions for designs F, H, and I.
As in Section 23.5.1, these are obtained under the assumption that pj,k|≥k is
constant and equal to 12%. This implies that, under each of these designs,
there is 12% chance for a subject to have only one measurement, 11%
chance for two measurements, and 77% chance that all three observations
will be available at the time of the analysis (Table 23.2).
402 23. Design Considerations

FIGURE 23.3. Rat Data. Comparison of the simulated power distributions for
designs with three measurements per rat, with equal power if no dropout would
occur (designs H, F, and I), under the assumption of constant pj,k|≥k equal to
12%. The vertical dashed line corresponds to the power which was realized in the
original rat experiment (56%).

First, under all three designs, it is very unlikely to have a final analysis
with such small power as observed in the rat experiment. Further, we have
that, from the perspective of efficiency, designs F and I are almost identical,
but clearly superior to design H, that is, we have that P H ≺ P F ≈ P I .
The relatively poor behavior of design H can be explained as follows. For
subjects which drop out at the first occasion (and therefore have only one
observation available), there is no difference between the three designs. This
is in contrast with subjects which drop out at the second occasion, since
subjects from design H then contain less information on the parameters of
interest than subjects from design I. Note that although the designs F and
I are equivalent with respect to efficiency of the final analysis, design I is
to be preferred since it requires the randomization of less subjects.

23.5.3 Increasing pj,k|≥k over Time, Constant nj

In order to investigate the effect of the assumed dropout model on the


simulated power distributions, we reconsider designs F, H, and I which
have been investigated in Section 23.5.2 under the assumption of constant
pj,k|≥k . However, we will now assume that pj,k|≥k increases as a function of
the time (days) elapsed since the start of the treatment. More specifically,
it will be assumed that the conditional dropout probabilities satisfy the
23.5 Example: The Rat Data 403

FIGURE 23.4. Rat Data. Comparison of the simulated power distributions


for designs with three measurements per rat, with equal power if no dropout
would occur (designs H, F, and I), under the assumption that pj,k|≥k satisfies
logit(pj,k|≥k ) = −3 + 0.06(Age − 45). The vertical dashed line corresponds to the
power which was realized in the original rat experiment (56%).

logistic regression model

exp(ψ0 + ψ1 (Age − 45))


pj,k|≥k = ,
1 + exp(ψ0 + ψ1 (Age − 45))

with ψ0 = −3 and ψ1 = 0.06.

The simulated power distributions are shown in Figure 23.4. We now clearly
get that P I ≺ P F ≺ P H , which is opposite to our results under the assump-
tion of constant pj,k|≥k . The most efficient design is now the one where the
second occasion is planned immediately after the first measurement. This
can be explained by observing that there is 83% chance under design H
that a subject will have all measurements available as planned at the de-
sign stage. For design I, this probability drops to only 40%. Hence, the
efficiency gained by having more spread in the covariate values tij for our
linear model is lost by the severely increased risk of dropping out.

The fact that our conclusions are opposite to those in Section 23.5.2 sug-
gests that there exists a dropout model under which designs F, H, and I are
equivalent with respect to efficiency of the final analysis. One such model
is obtained by setting ψ0 = −3 and ψ1 = 0.02 in the above logistic regres-
sion model. The corresponding simulated power distributions are shown in
Figure 23.5. We now have that the gain in efficiency due to more spread in
the covariate values tij is in balance with the loss in efficiency due to an
404 23. Design Considerations

FIGURE 23.5. Rat Data. Comparison of the simulated power distributions


for designs with three measurements per rat, with equal power if no dropout
would occur (designs H, F, and I), under the assumption that pj,k|≥k satisfies
logit(pj,k|≥k ) = −3 + 0.02(Age − 45). The vertical dashed line corresponds to the
power which was realized in the original rat experiment (56%).

increased risk of dropping out. In this case, one would prefer design I since
it requires fewer subjects to conduct the study.

Note that the results presented here fully rely on the assumed linear mixed
model (23.2). For example, the simulation results reported in Section 23.5.1
show that design G, with only two observations per subject, is to be pre-
ferred over designs A, E, and F, with more than two observations scheduled
for each subject. Obviously, the assumption of linearity is crucial here, and
design G will not allow testing for nonlinearities. Hence, when interest
would be in providing support for model (23.2), more simulations would be
needed comparing the behavior of different designs under different models
for the outcome under consideration, and design G should no longer be
taken into account. As for any sample-size calculation, it would be advis-
able to perform some informal sensitivity analysis to investigate the impact
of model assumptions and imputed parameter values on the final results.
24
Case Studies

Building on the methodology developed in this text, the current chapter


presents five case studies. In Section 24.1, we study the extension of univari-
ate longitudinal data technology to the multivariate setting, where several
measurements (i.e., systolic and diastolic blood pressure) are obtained at
each measurement occasion. Section 24.2 is devoted to a developmental
toxicology experiments where, due to litter effects, fetuses are clustered
within dams. The flexibility of the linear mixed model to combine cluster
effects with individual-specific covariates is illustrated. Even though time is
not a factor in these data, we are able to establish a close connection with
longitudinal modeling. In Section 24.3, we describe how bivariate outcomes
from multicenter trials can be used to study the validity of one outcome
as a surrogate endpoint for the other. A sensitivity analysis on incomplete
longitudinal data on milk protein content is conducted in Section 24.4. The
chapter is concluded with the analysis of hepatitis B vaccination data. It
is shown how rather irregular sequences can be handled within the linear
mixed-models context.

24.1 Blood Pressures

As an illustration of the use of linear mixed models for the analysis of


repeated measures of a bivariate outcome, we analyze data reported by
406 24. Case Studies

FIGURE 24.1. Blood Pressure Data. Systolic and diastolic blood pressure in pa-
tients with moderate essential hypertension, immediately before and 2 hours after
taking captopril.

Hand et al . (1994), data set #72. For 15 patients with moderate essential
(unknown cause) hypertension, the supine (measured while patient is ly-
ing down) systolic and diastolic blood pressure was measured immediately
before and 2 hours after taking the drug captopril. The individual profiles
are shown in Figure 24.1. The objective of the analysis is to investigate the
effect of treatment on both responses.

Note that since we only have two measurements available for each response,
there is no need for modeling the variance or the mean as continuous func-
tions of time. Also, saturated mean structures and covariance structures
can easily be fitted because of the balanced nature of the data. No trans-
formation to normality is needed since none of the four responses has a
distribution which shows clear deviations from normality.

In order to explore the covariance structure of our data, we fitted sev-


eral linear mixed models with the saturated mean structure. Table 24.1
shows minus twice the maximized REML log-likelihood value, the Akaike
and Schwarz information criteria, and the number of degrees of freedom
in the corresponding covariance model. Our first model has a general un-
structured covariance matrix. The REML-estimated covariance matrix and
corresponding standard errors equal
⎛ ⎞
423 (160) 371 (148) 143 (69) 105 (75)
⎜ 371 (148) 400 (151) 153 (69) 166 (81) ⎟
⎜ ⎟, (24.1)
⎝ 143 (69) 153 (69) 110 (41) 97 (44) ⎠
105 (75) 166 (81) 97 (44) 157 (60)
for measurements ordered as indicated in Figure 24.1. Note that if one
would model the diastolic and systolic blood pressures separately, a random-
intercepts model would probably fit the data well. This would implicitly
assume that for both responses, the variance before and after the treatment
with captopril is the same, which is not unreasonable in view of the large
standard errors shown in (24.1). We therefore reparameterize the unstruc-
24.1 Blood Pressures 407

TABLE 24.1. Blood Pressure Data. Summary of the results of fitting several co-
variance models. All models include a saturated mean structure. Notations RIdia ,
RSdia , RIsys , RSsys , RI, and RS are used for random intercepts and slopes for
the diastolic blood pressures, random intercepts and slopes for the systolic blood
pressures, and random intercepts and slopes which are the same for both blood
pressures, respectively.

Model −2 REML AIC SBC df


Unstructured 420.095 −220.047 −230.174 10
RIdia + RIsys + RSdia + RSsys 420.095 −221.047 −232.187 10
RIdia + RIsys + RS 420.656 −217.328 −224.417 7
RIdia + RIsys 433.398 −220.699 −224.749 4
RI + RIsys + RS 420.656 −217.328 −224.417 7
RI + RIsys + RS, uncorrelated 424.444 −216.222 −220.273 4

tured covariance model as a random-effects model with random intercepts


and random slopes for the two responses separately. The random slopes are
random coefficients for the dummy covariate defined as 0 before the treat-
ment and as 1 after the treatment, and therefore represent subject-specific
deviations from the average effect of the treatment, for systolic and dias-
tolic blood pressure, respectively. Since this model includes four linearly
independent covariates in the Zi matrix, it is equivalent to the first model
in Table 24.1, yielding the same maximized log-likelihood value. The ad-
vantage is that we now have expressed the model in terms of interpretable
components. This will prove to be convenient for reducing the number of
variance components. As discussed previously, SAS always adds a residual
component ε(1)i of measurement error to any random-effects model. In our
model this component is not identified. Still, the reported AIC and SBC
values are based on 11 variance components and are therefore not the same
as the ones found for the first model in Table 24.1.

In a first attempt to reduce the covariance structure, we fitted a linear


mixed model assuming equal random slopes for diastolic and systolic blood
pressure. This model, which is the third model in Table 24.1, assumes that
for each subject, the treatment effect additional to the average effect is the
same for both responses. Note that there is only a difference of 3 degrees of
freedom when compared to the unstructured model, which is due to the es-
timation of the residual variance which was not included in the first model.
Therefore, we can, strictly speaking, not apply the theory of Section 6.3.4
for testing random effects. However, the very small difference in twice the
maximized REML log-likelihood clearly suggests that the random slopes
may be assumed equal for both responses.
408 24. Case Studies

As a second step, we refit our model, not including the random slopes. This
is the fourth model in Table 24.1. The p-value calculated using the theory
of Section 6.3.4 on testing the significance of the random slopes equals

P (χ22:3 ≥ 12.742) = 0.5 P (χ22 ≥ 12.742) + 0.5 P (χ23 ≥ 12.742)


= 0.5 × 0.0017 + 0.5 × 0.0052
= 0.0035,

indicating that the treatment effect is not the same for all subjects.

We therefore investigate our third model further. The REML-estimated


random-effects covariance matrix D and the corresponding estimated stan-
dard errors are
⎛ ⎞
RIsys −→ 409 (158) 146 (68) −38 (46)
RIdia −→ ⎝ 146 (68) 92 (39) 4 (22) ⎠ ,
RS −→ −38 (46) 4 (22) 51 (25)

where the random effects are ordered as indicated in front of the matrix.
Clearly, there is no significant correlation between either one of the random
intercepts on one side and the random slopes on the other side (p = 0.8501
and p = 0.4101 for the diastolic and systolic blood pressure, respectively),
meaning that the treatment effect does not depend on the initial value.
On the other hand, there is a significant positive correlation (p = 0.0321)
between the random intercepts for the diastolic blood pressure and the
random intercepts for the systolic blood pressure, meaning that a patient
with an initial diastolic blood pressure higher than average is likely to have
an initial systolic blood pressure which is also higher than average.

This suggests that an overall subject effect may be present in the data. We
can easily reparameterize our fourth model such that an overall random
intercept is included, but a correction term for either systolic or diastolic
blood pressure is then needed. In view of the larger variability in systolic
blood pressures than in diastolic blood pressures, we decided to reparame-
terize our model as a random-effects model, with overall random intercepts,
random intercepts for systolic blood pressure, and random slopes. The over-
all random intercepts can then be interpreted as the random intercepts for
the diastolic blood pressures. The random intercepts for systolic blood pres-
sure are corrections to the overall intercepts, indicating, for each patient,
what its deviation from the average initial systolic blood pressure is, ad-
ditional to its deviation from the average initial diastolic blood pressure.
These correction terms then explain the additional variability for systolic
blood pressure, in comparison to diastolic blood pressure. Information on
the model fit for this fifth model is also shown in Table 24.1. Since this
model is merely a reparameterization of our third model, we obtain the
same results for both models. The REML-estimated random-effects covari-
24.1 Blood Pressures 409

ance matrix D and the corresponding estimated standard errors are now
⎛ ⎞
RI −→ 92 (39) 54 (42) 4 (22)
RIdia −→ ⎝ 54 (42) 209 (84) −42 (34) ⎠ ,
RS −→ 4 (22) −42 (34) 51 (25)

suggesting that there are no pairwise correlations between the three ran-
dom effects (all p-values larger than 0.19). We therefore fit a sixth model
assuming independent random effects. The results are also shown in Ta-
ble 24.1. Minus twice the difference in maximized REML log-likelihood
between the sixth and fifth model equals 3.788, which is not significant
when compared to a chi-squared distribution with 3 degrees of freedom
(p = 0.2853). We will preserve this covariance structure (four independent
components of stochastic variability: a random subject effect, a random
effect for the overall difference between the systolic and diastolic blood
pressures, a random effect for the overall treatment effect, and a compo-
nent of measurement error) in the models considered next. Note that this
covariance structure is also the one selected by the AIC as well as the SBC
criterion (see Table 24.1).

Using the above covariance structure, we can now try to reduce our satu-
rated mean structure. The average treatment effect was found to be signif-
icantly different for the two blood pressure measurements, and we found
significant treatment effects for the systolic as well as diastolic blood pres-
sures (all p-values smaller than 0.0001). Our final model is now given by


⎪ β1 + b1i + ε(1)ij diastolic, before,






⎨ β2 + b1i + b2i + ε(1)ij systolic, before,
Yij = (24.2)



⎪ β 3 + b 1i + b 3i + ε (1)ij diastolic, after,





β4 + b1i + b2i + b3i + ε(1)ij systolic, after.

As previously, we assume the random effects to be uncorrelated, and the


ε(1)ij represent independent components of measurement error with equal
variance σ 2 . The program needed to fit this model in SAS is

data blood;
set blood;
slope = (time = ’after’);
intsys = (meas = ’systolic’);
run;

proc mixed data = blood covtest;


class time meas id;
410 24. Case Studies

TABLE 24.2. Blood Pressure Data. Results from fitting the final model (24.2),
using restricted maximum likelihood estimation.

Effect Parameter Estimate (s.e.)


Intercepts:
Diastolic, before β1 112.333 (2.687)
Systolic, before β2 176.933 (4.648)
Diastolic, after β3 103.067 (3.269)
Systolic, after β4 158.000 (5.007)
Treatment effects:
Diastolic β3 − β 1 9.267 (2.277)
Systolic β4 − β 2 18.933 (2.277)
Covariance of bi :
var(b1i ) d11 95.405 (39.454)
var(b2i ) d22 215.728 (86.315)
var(b3i ) d33 52.051 (24.771)
Measurement error variance:
var(ε(1)ij ) σ2 12.862 (4.693)
Observations 60
REML log-likelihood −212.222
−2 REML log-likelihood 424.444
Akaike’s Information Criterion −216.222
Schwarz’s Bayesian Criterion −220.273

model bp = meas*time / noint s;


random intercept intsys slope / type = un(1) subject = id;
estimate ’trt_sys’ meas*time 0 -1 0 1 / cl alpha = 0.05;
estimate ’trt_dia’ meas*time -1 0 1 0 / cl alpha = 0.05;
contrast ’trt_sys = 2xtrt_dia’ meas*time 2 -1 -2 1;
run;

The variables meas and time are factors with levels “systolic” and “dias-
tolic” and with levels “before” and “after,” respectively. The ESTIMATE
statements are included to estimate the average treatment effect for the sys-
tolic and diastolic blood pressures separately. The CONTRAST statement
is used to compare these two effects.

The REML estimates for all parameters in the marginal model are shown in
Table 24.2. The 95% confidence intervals for the average treatment effect on
24.2 The Heat Shock Study 411

diastolic and systolic blood pressure are [4.383; 14.151] and [14.050; 23.817],
respectively. Further, the parameter estimates in Table 24.2 suggest that
the average treatment effect on systolic blood pressure is twice the average
treatment effect on diastolic blood pressure. This hypothesis, tested with
the CONTRAST statement in the above program, was indeed not rejected
at the 5% level of significance (p = 0.9099).

24.2 The Heat Shock Study

24.2.1 Introduction

In the last several decades, teratology and developmental toxicity studies


conducted in laboratory animals have served as an important strategy for
evaluating the potential risk of chemical compounds and other environ-
mental agents on fertility, reproduction, and fetal development. Standard
protocols for conducting developmental toxicity studies have been in use
since shortly after the thalidomide tragedy of the early 1960s, yet meth-
ods for quantitative risk assessment are still being developed and refined
(Schwetz and Harris, 1993). One issue that requires further consideration
is the fact that these types of animal study often result in multiple out-
comes of interest. For example, developmental toxicity studies may record
several different types of malformation on each embryo, such as skeletal,
visceral, and gross malformations. Teratology studies also consider multi-
ple outcomes, such as the extent of implantations, resorptions, and several
different clinical signs of maternal toxicity. These responses are usually ex-
amined individually, and risk assessment is based on the most sensitive
outcome. However, statistical methods which incorporate the important
multiple outcomes have several advantages: They can increase the power
of detecting effects of exposure under a common dose effect model, they
allow investigation of the association among the multiple outcomes, and
they provide a more biologically informative and realistic description of
the range of exposure effects on various responses.

A typical study design for evaluating adverse effects of exposure on the


developing fetus, referred to as a “Segment II design,” involves 20 to 30
pregnant rodents randomly assigned to each of several exposure groups
and a control group. The pregnant dams are usually exposed for 1 or more
days during the early part of the gestational cycle and are then sacrificed
just prior to delivery to examine the fetuses for abnormalities. The stan-
dard approach for conducting risk assessment for developmental endpoints
has been to use the results of such experimental animal studies to estimate
the “No observed adverse effect level” (NOAEL), by determining the high-
412 24. Case Studies

est dose level that shows no significant difference from the control group
in the rate of malformed embryos or fetal deaths (Gaylor 1989). Several
limitations of this approach have been widely recognized, and new regula-
tory guidelines emphasize the use of quantitative methods similar to those
developed for cancer risk assessment to estimate reference concentrations
and benchmark doses (U.S. EPA 1991). Thus, more recent techniques for
risk assessment in this area are based on fitting dose-response models and
estimating the dose that leads to a certain increase in risk of some type of
adverse developmental effect over that of the control group.

Although a wide variety of statistical methods have been developed for can-
cer risk assessment, the issue of multiple endpoints does not present quite
the degree of complexity in this area as it does for developmental toxicity
studies. The endpoint of interest in an animal cancer bioassay is typically
the occurrence of a particular type of tumor, whereas in developmental tox-
icity studies, there is no clear choice for a single type of adverse outcome.
In fact, an entire array of outcomes are needed to define certain birth defect
syndromes (Khoury et al . 1987, Holmes 1988). Ryan (1992a) describes the
data resulting from a developmental toxicity study as consisting of a series
of hierarchical outcomes and has proposed a modeling framework which in-
corporates the hierarchical structure and allows for the combination of data
on fetal death and resorption. Catalano et al . (1993) extend this approach
to account for low birth weight, by modeling this continuous outcome con-
ditionally on other outcomes. In the most general situation, the multiple
outcomes recorded in the course of a developmental toxicity study may
represent a combination of binary, ordinal, and continuous measurements
(Ryan 1992a, 1992b).

Risk assessment methods for developmental toxicity endpoints have tradi-


tionally been based on daily exposure levels and have not taken into account
either the duration or timing of exposure. However, organogenesis is a very
sensitive process and it is acknowledged that exposure during early periods
of organ development will lead to different types of adverse effect than later
exposures. In addition, it is usually anticipated that short acute exposures
at higher concentrations will result in more severe damage to the devel-
oping fetus than longer chronic exposures. One test system developed to
explore the joint effects of exposure levels and durations is a “heat shock”
study, as described by Brown and Fabro (1981) and Kimmel et al . (1993).
In this type of developmental toxicity study, mice are exposed in vitro to
various combinations of heat stress levels and durations. The outcomes of
interest are a series of morphological responses which are measured on an
ordinal scale. Data from such a heat shock study are studied here. Focusing
on continuous responses, we will show that linear mixed models are advan-
tageous in combining clustering effects with covariates of interest, which
are both duration and level of heat stress exposure.
24.2 The Heat Shock Study 413

TABLE 24.3. Heat Shock Study. Study design: number of (viable) embryo’s ex-
posed to each combination of duration and temperature.

Duration of Exposure
Temperature 5 10 15 20 30 45 60 Total
37.0 11 11 12 13 12 18 11 88
40.0 11 9 9 8 11 10 11 69
40.5 9 8 10 9 11 10 7 64
41.0 10 9 10 11 9 6 0 55
41.5 9 8 9 10 10 7 0 53
42.0 10 8 10 5 7 6 0 46
Total 60 53 60 56 60 57 29 375

In these heat shock experiments, the embryos are explanted from the uterus
of the maternal dam during the gestation period and cultured in vitro.
Each individual embryo is subjected to a short period of heat stress by
placing the culture vial into a water bath, usually involving an increase
over body temperature of 4◦ C to 5◦ C for a duration of 5 to 60 minutes. The
embryos are examined 24 hours later for signs of impaired or accelerated
development.

This type of developmental toxicity test system has several advantages over
the standard Segment II design. First of all, the exposure is administered
directly to the embryo, so controversial issues regarding the unknown (and
often nonlinear) relationship between the level of exposure to the maternal
dam and that received by the developing embryo need not be addressed.
Although genetic factors are still expected to exert an influence on the
vulnerability to injury of embryos from a common dam, direct exposure
to individual embryos reduces the need to account for such litter effects,
but does not remove it. Second, the exposure pattern can be much more
easily controlled than in most developmental toxicity studies, since it is
possible to achieve target temperature levels in the water bath within 1
to 2 minutes. Whereas the typical Segment II study requires waiting 8
to 12 days after exposure to assess its impact, information regarding the
effects of exposure are quickly obtained in heat shock studies. Finally, this
animal test system provides a convenient mechanism for examining the
joint effects of both duration of exposure and exposure levels, which, until
recently, have received little attention. The actual study design for the set
of experiments conducted by Kimmel et al . (1994) is shown in Table 24.3.
For the experiment, 71 dams (clusters) were available, yielding a total of
375 embryos. The distribution of cluster sizes is given in Table 24.4.
414 24. Case Studies

TABLE 24.4. Heat Shock Study. Distribution of cluster sizes.

Cluster size ni 1 2 3 4 5 6 7 8 9 10 11
Number of clusters of size ni 6 3 6 12 13 11 8 5 2 3 2

Historically, the strategy for comparing responses among exposures of dif-


ferent durations to a variety of environmental agents (e.g., radiation, in-
halation, chemical compounds) has relied on a conjecture called Haber’s
Law, which states that adverse response levels should be the same for any
equivalent level of dose times duration (Haber 1924). In other words, a
15-minute exposure to an increase of 3 degrees should produce the same
response as a 45-minute exposure to an increase of 1 degree. Clearly, the
appropriateness of applying Haber’s Law depends on the pharmacokinet-
ics of the particular agent, the route of administration, the target organ,
and the dose/duration patterns under consideration. Although much at-
tention has been focused on documenting exceptions to this rule, it is often
used as a simplifying assumption in view of limited testing resources and
the multitude of exposure scenarios. However, given the current desire to
develop regulatory standards for a range of exposure durations, models flex-
ible enough to describe the response patterns over varying levels of both
exposure concentration and duration are greatly needed.

For the heat shock studies, the vector of exposure covariates must incor-
porate both exposure level (also referred to as temperature or dose), dij ,
and duration (time), tij , for the jth embryo within the ith cluster. Fur-
thermore, models must be formulated in such a way that departures from
Haber’s premise of the same adverse response levels for any equivalent mul-
tiple of dose times duration can easily be assessed. The exposure metrics
in these models are the cumulative heat exposure, (dt)ij = dij tij , referred
to as durtemp and the effect of duration of exposure at positive increases
in temperature (the increase in temperature over the normal body temper-
ature of 37◦ C):
(pd)ij = tij I(dij > 37).
We refer to the latter as posdur . There are measurements on 13 morpholog-
ical variables. Some are binary; others are measured on a continuous scale.
Even though we will focus on continuous outcomes, as in Geys, Molen-
berghs, and Williams (1999), it is worth mentioning that a lot of work has
been done in the area of clustered binary outcomes and combined binary
and continuous outcomes.

Indeed, as a result of the research activity over the past 10 to 15 years,


there are presently several different schools of thought regarding the best
approach to the analysis of correlated binary data. Unlike in the normal
24.2 The Heat Shock Study 415

setting, marginal, conditional, and random-effects approaches tend to give


dissimilar results, as do likelihood, quasi-likelihood, and GEE-based infer-
ential methods. There are many excellent reviews (Prentice 1988, Fitzmau-
rice, Laird and Rotnitzky 1993, Diggle, Liang, Zeger 1994, Pendergast et
al . 1996).

Several likelihood-based methods have been proposed. Fitzmaurice and


Laird (1993) incorporate marginal parameters for the main effects in this
model and quantify the degree of association by means of conditional odds
ratios. Fully marginal models are presented by Bahadur (1961) and Cox
(1972), using marginal correlations, and by Ashford and Sowden (1970),
using a dichotomized version of a multivariate normal to analyze multi-
variate binary data. Alternatively, marginal odds ratios can be used, as
shown by Dale (1986) and Molenberghs and Lesaffre (1994). Cox (1972) also
describes a model whose parameters have interpretations in terms of con-
ditional probabilities. Similar models were proposed by Rosner (1984) and
Liang and Zeger (1989). A full exponential family model was proposed by
Molenberghs and Ryan (1999) and Ryan and Molenberghs (1999). Pseudo-
likelihood methods were developed by Geys, Molenberghs, and Ryan (1997,
1999). Random-effects approaches have been studied by Stiratelli, Laird,
and Ware (1984), Zeger, Liang, and Albert (1988), Breslow and Clayton
(1993), and Wolfinger and O’Connell (1993). Generalized estimating equa-
tions were developed in Liang and Zeger (1986). A thorough account is
given in Fahrmeir and Tutz (1994). Williams, Molenberghs, and Lipsitz
(1996) consider ordinal outcomes in the context of the heat shock study.

Early work on the combination of continuous and discrete outcomes can


be found in Olkin and Tate (1961). We also refer to Cox and Wermuth
(1992, 1994) and to Schafer (1997). There is a substantial amount of work
which focuses on clustered data: Catalano and Ryan (1992), Catalano et
al . (1993), Chen (1993), Fitzmaurice and Laird (1995, 1997), Regan and
Catalano (1999a, 1999b), and Geys, Regan, Catalano, and Molenberghs
(1999). Molenberghs, Geys, and Buyse (1998) studied such models for use
in surrogate marker evaluation.

24.2.2 Analysis of Heat Shock Data

There are several continuous outcomes recorded in the heat shock study,
such as size measures on crown rump, yolk sac, and head. We will focus on
crown rump length (CRL).

It will be shown that the three components of variability customarily incor-


porated in a linear mixed-effects model of the form (3.11) can usefully be
applied here as well, even in the absence of a repeated-measures structure.
416 24. Case Studies

Although there will be no doubt that random effects are used to model
interdam variability and also the role of the measurement error time is un-
ambiguous, it is less obvious what the role of the serial association would
be. Generally, serial association results from the fact that within a cluster,
residuals of individuals closer to each other are often more similar than
residuals for individuals further apart. Although this distance concept is
clear in longitudinal and spatial applications, it is less so in this context.
However, covariates like duration and temperature, or relevant transforma-
tions thereof, can play a similar role. This distinction is very useful since
random effects capture the correlation structure which is attributable to
the dam and hence includes genetic components. The serial correlation, on
the other hand, is entirely design driven. If one conjectures that the latter
component is irrelevant, then translation into a statistical hypothesis and,
consequently, testing for it are relatively straightforward. Note that such a
model is not possible in conventional developmental toxicity studies, where
exposure applies at the dam level, not at the individual fetus level.

The model we consider is based on Haber’s Law and controlled deviations


thereof, in the sense that the fixed-effects structure includes durtemp and
posdur. For computational convenience, the ranges of these covariates are
transformed to the unit interval. The maximal values correspond to 225
mino C for durtemp and 60 minutes for posdur. The random-effects struc-
ture includes a random intercept and a random slope for dt. The residual
covariance structure is decomposed into a Gaussian serial process in dt and
measurement error. Formally,

Yij = (β1 + bi1 ) + (β2 + bi2 )(dt)ij + β3 (pd)ij + ε(1)ij + ε(2)ij , (24.3)

where the ε(1)ij are uncorrelated and follow a normal distribution with zero
mean and variance σ 2 . The ε(2)ij have zero mean, variance τ 2 , and serial
correlation
1 2
hijk = exp −φ[(dt)ij − (dt)ik ]2 .

The random-effects vector (bi1 , bi2 ) is assumed to be a zero-mean normal


variable with covariance matrix D. SAS code to fit this model is

proc mixed data = heatshock method = ml covtest;


class idnr;
model crl = durtemp posdur / solution;
random intercept durtemp / subject = idnr g v type = un;
repeated / subject = idnr local type = sp(gau)(durtemp) r;
parms (0.01) (0.07) (-0.03) (0.04) (4.26) (0.09)
/ nobound;
run;
24.2 The Heat Shock Study 417

TABLE 24.5. Heat Shock Study. Parameter estimates (standard errors) for initial
and final model.

Effect Parameter Initial Final

Fixed effects:

Intercept β1 3.622 (0.034) 3.627 (0.042)


Durtemp (dt)ij β2 −1.558 (0.376) −1.331 (0.353)
Posdur (pd)ij β3 0.019 (0.006) 0.015 (0.006)

Random-effects parameters:

var(b1i ) d11 0.010 (0.014) 0.046 (0.014)


var(b12 ) d12 −0.038 (0.065)
cov(b1i , b2i ) d22 0.071 (0.032)

Residual variance parameters:

var(ε(1)ij ) σ2 0.097 (0.014) 0.097 (0.014)


var(ε(2)ij ) τ2 0.044 (0.017) 0.042 (0.017)
Spatial corr. parameter ρ 4.268 (5.052) 4.143 (3.772)

Since the ‘nobound’ option is added to the PARMS statement, variance


components are allowed to assume values on the whole real line. This im-
plies that conventional χ2 tests can be used, rather than the mixtures
described in Section 6.3.4. The initial model is reproduced in Table 24.5.

First, the covariance model is simplified. The covariance between both


random effects is not significant and can be removed (G2 = 3.35 on 1
degree of freedom, p = 0.067). Next, the random durtemp effect is re-
moved (G2 = 3.63, 2 df, p = 0.057). The serial process cannot be removed
(G2 = 6.19, 2 df, p = 0.045). Finally, both fixed effects are highly signifi-
cant and cannot be removed. The final model is given in Table 24.5. SAS
code for this model is

proc mixed data = heatshock method = ml covtest;


class idnr;
model crl = durtemp posdur / solution;
random intercept / subject = idnr g v vcorr;
repeated / subject = idnr local
418 24. Case Studies

FIGURE 24.2. Heat Shock Study. Fixed-effects structure for (a) the final model
and (b) the model with posdur removed.

FIGURE 24.3. Heat Shock Study. Fitted variogram.

type = sp(gau)(durtemp) r rcorr;


parms (0.032) (0.043) (2.015) (0.094) / nobound;
run;
24.2 The Heat Shock Study 419

FIGURE 24.4. Heat Shock Study. Fitted correlation function.

The fixed-effects structure is presented in Figure 24.2. The left-hand panel


shows the fixed-effects structure of the final model, as listed in Table 24.5.
The coefficient of durtemp is negative, indicating a decreasing crown rump
length with increasing exposure. This effect is reduced by a positive coef-
ficient for posdur. Fitting this model with posdur removed shows qualita-
tively the same trend, but the effect of exposure is much less pronounced,
underscoring that there is a significant deviation from Haber’s Law.

The fitted variogram is presented in Figure 24.3. Roughly half of the vari-
ability is attributed to measurement error, and the remaining half is divided
equally over the random intercept and the serial process. The correspond-
ing fitted correlation function is presented in Figure 24.4. The correlation
is about 0.50 for two fetuses that are at the exact same level of exposure.
It then decreases to 0.25 when the distance between exposures is maximal.
This reexpresses that half of the correlation is due to the random effect,
and the other half is attributed to the serial process in durtemp.
420 24. Case Studies

24.3 The Validation of Surrogate Endpoints from


Multiple Trials

24.3.1 Introduction

Surrogate endpoints are referred to as endpoints that can be used in lieu


of other endpoints in the evaluation of experimental treatments or other
interventions. Surrogate endpoints are useful when they can be measured
earlier, more conveniently, or more frequently than the endpoints of in-
terest, which are referred to as the “true” or “final” endpoints (Ellenberg
and Hamilton 1989). Biological markers of the disease process are often
proposed as surrogate endpoints for clinically meaningful endpoints, the
hope being that if a treatment showed benefit on the markers, it would
ultimately also show benefit upon the clinical endpoints of interest. Before
a surrogate endpoint can replace a final endpoint in the evaluation of an
experimental treatment, it must be formally “validated,” a process that has
caused a number of controversies and has not been fully elucidated so far.

In a landmark paper, Prentice (1989) proposed a formal definition of sur-


rogate endpoints, outlined how they could be validated, and, at the same
time, discussed intrinsic limitations in the surrogate marker validation
quest. Much debate ensued, since some authors perceived a formal criteria-
based approach as too stringent and not straightforward to verify (Fleming
et al . 1994). Freedman, Graubard, and Schatzkin (1992) took Prentice’s ap-
proach one step further by introducing the proportion explained , which is
the proportion of the treatment effect mediated by the surrogate. Buyse
and Molenberghs (1998) and Molenberghs, Buyse et al. (1999) discussed
some issues with the proportion explained and proposed to enhance insight
by means of two new measures. The first, defined at the population level
and termed relative effect, is the ratio of the overall treatment effect on
the true endpoint over that on the surrogate endpoint. The second is the
individual-level association between both endpoints, after accounting for
the effect of treatment, and referred to as adjusted association.

Buyse et al . (2000) extended these concepts to situations in which data are


available from several randomized experiments. The individual-level associ-
ation between the surrogate and final endpoints carries over naturally, the
only change required being an additional stratification to account for the
presence of multiple experiments. The experimental unit can be the center
in a multicenter trial, or the trial in a meta-analysis context. We emphasize
the latter situation, because a sufficiently informative validation of a surro-
gate endpoint will typically require large numbers of observations coming
from several trials. Moreover, meta-analytic data usually carry a degree
of heterogeneity not encountered in a single trial, caused by differences in
24.3 The Validation of Surrogate Endpoints from Multiple Trials 421

patient population, study design, treatment regimens, and so forth. We


shall argue that these sources of heterogeneity increase one’s confidence in
the validity of a surrogate endpoint, when the relationship between the ef-
fects of treatment on the surrogate and the true endpoints tends to remain
constant across such different situations.

The notion of relative effect can then be extended to a trial-level measure


of association between the effects of treatment on both endpoints. The two
measures of association, one at the individual level, and the other at the
trial level, are proposed as an alternative way to assess the usefulness of
a surrogate endpoint. This approach also naturally yields a prediction for
the effect of treatment on the true endpoint, based on the observation of
the effect of treatment on the surrogate endpoint.

In Section 24.3.2, Prentice’s definition and criteria, as well as Freedman’s


proportion explained, are reviewed. Notation and motivating examples are
presented in Section 24.3.3. The new concepts and an alternative valida-
tion strategy are introduced in Section 24.3.4. The examples are analyzed in
Section 24.3.5. Fitting of some of the models in Section 24.3.4 by means of
linear mixed-models methodology is computationally not straightforward.
Section 24.3.6 examines through simulations when numerical problems are
likely to occur. The emphasis is on normally distributed endpoints, for
which standard linear mixed models are appropriate. The mixed-models
methodology provides an easy-to-use framework that avoids many of the
complexities encountered with different response types. In practice, how-
ever, endpoints are seldom normally distributed. In Section 24.3.7, a brief
discussion of possible extensions to more general situations where the surro-
gate and true endpoints are of a different nature, such as the highly relevant
situation where the surrogate endpoint is binary and the final endpoint is
a survival time, possibly censored (Lin, Fleming, and DeGruttola 1997), is
presented. A profound treatment of these extensions is outside the scope
of the current text.

24.3.2 Validation Criteria

Prentice’s Definition

Prentice proposed to define a surrogate endpoint as “a response variable


for which a test of the null hypothesis of no relationship to the treatment
groups under comparison is also a valid test of the corresponding null hy-
pothesis based on the true endpoint” (Prentice 1989, p. 432). We adopt
the following notation: T and S are random variables that denote the true
and surrogate endpoints, respectively, and Z is an indicator variable for
422 24. Case Studies

treatment. Prentice’s definition can be written


f (S|Z) = f (S) ⇔ f (T |Z) = f (T ), (24.4)
where f (X) denotes the probability distribution of random variable X and
f (X|Z) denotes the probability distribution of X conditional on the value
of Z. As such, this definition is of limited value since a direct verification
that a triplet (T, S, Z) fulfills the definition would require a large number of
experiments to be conducted with information on the triplet. Even if many
experiments were available, the equivalence of the statistical tests implied
in (24.4) might not be true in all of them because of chance fluctuations
and/or lack of statistical power. Operational criteria are therefore needed
to check if definition (24.4) is fulfilled.

Prentice’s Criteria

Four operational criteria have been proposed to check if a triplet (T, S, Z)


fulfills the definition. The first two verify departures from the null hypothe-
ses implicit in (24.4):
f (S|Z) = f (S), (24.5)
f (T |Z) = f (T ). (24.6)
Strictly speaking, (24.5) and (24.6) are not criteria since having both
f (T |Z) = f (T ) and f (S|Z) = f (S) is consistent with the definition (24.4).
However, in this case, the validation is practically impossible since one may
fail to detect differences due to lack of power. Thus, in practice, the vali-
dation requires Z to have an effect on both T and S. Several authors have
pointed out that requiring Z to have a statistically significant effect on T
may be excessively stringent, for in that case, from the limited perspective
of significance testing, there would no longer be a need to establish the
surrogacy of S (Fleming et al . 1994).

The other two criteria are


f (T |S) =
 f (T ), (24.7)
f (T |S, Z) = f (T |S). (24.8)
Buyse and Molenberghs (1998) reproduce the arguments that establish
the sufficiency of conditions (24.7) and (24.8) for (24.4) to hold in the
case of binary responses. It is also easy to show that condition (24.7) is
always necessary for (24.4), and that condition (24.8) is necessary for binary
endpoints but not in general. Indeed, suppose (24.8) does not hold; then,
assuming that f (S|Z) = f (S),

f (T |Z) = f (T |S, Z)f (S) dS, (24.9)
24.3 The Validation of Surrogate Endpoints from Multiple Trials 423

f (T ) = f (T |S)f (S) dS. (24.10)

However, (24.9) and (24.10) are, in general, not equal to one another, in
which case definition (24.4) is violated. However, it is possible to construct
examples where f (T |Z) = f (T ), in which case the definition still holds
despite the fact that (24.8) does not hold. Hence, (24.8) is not a necessary
condition, except for binary endpoints.

Next, assume (24.8) holds but (24.7) does not. Then,


 
f (T |Z) = f (T |S)f (S|Z) dS = f (T )f (S|Z) dS = f (T ),

and hence f (T |Z) = f (T ) regardless of the relationship between S and Z.


The simplest example is the situation where T is independent of the pair
(S, Z). Thus, (24.7) is necessary to avoid situations where one null hypothe-
sis is true while the other is not. However, criteria (24.5) and (24.6) already
imply that both null hypotheses must be rejected, and therefore criterion
(24.7) is of no additional value. In fact, criterion (24.7) indicates that the
surrogate endpoint has prognostic relevance for the final endpoint, a con-
dition which will obviously be fulfilled by any sensible surrogate endpoint.
Conditions (24.5)–(24.8) are informative and will tend to be fulfilled for
valid surrogate endpoints, but they should not be regarded as strict crite-
ria. Condition (24.8) captures the essential notion of surrogacy by requiring
that the treatment is irrelevant for predicting the true outcome, given the
surrogate. In the next section, we discuss how Freedman, Graubard, and
Schatzkin (1992) used this concept in estimation rather than in testing.
Our meta-analytic development, laid out in Section 24.3.4, also emphasizes
estimation and prediction rather than hypothesis testing.

Freedman’s Proportion Explained

Freedman, Graubard, and Schatzkin (1992) argued that criterion (24.8)


raises a conceptual difficulty in that it requires the statistical test for treat-
ment effect on the true endpoint to be nonsignificant after adjustment for
the surrogate. The nonsignificance of this test does not prove that the effect
of treatment upon the true endpoint is fully captured by the surrogate, and
therefore Freedman, Graubard, and Schatzkin (1992) proposed to calculate
the proportion of the treatment effect explained by the surrogate. In this
paradigm, a good surrogate is one for which this proportion explained (P E)
is close to unity (Prentice’s criterion (24.8) would require that P E = 1).
Buyse and Molenberghs (1998) and Molenberghs, Buyse, et al . (1999) out-
lined some conceptual difficulties with the P E, in particular that it is not
a proportion: P E can be estimated to be anywhere on the real line, which
424 24. Case Studies

FIGURE 24.5. Advanced Ovarian Cancer. Scatter plot of progression free survival
versus survival.

complicates its interpretation. They argued that P E can advantageously


be replaced by two related quantities: the relative effect (RE), which is the
ratio of the effects of treatment upon the final and the surrogate endpoint,
and the treatment-adjusted association between the surrogate and the true
endpoint, ρZ . Therefore, these proposals are extended using data from sev-
eral experiments. Motivating examples are introduced in the next section,
and our approach in Section 24.3.4.

24.3.3 Notation and Motivating Examples

Suppose we have data from N trials, in the ith of which Ni subjects are
enrolled. Let Tij and Sij be random variables that denote the true and
surrogate endpoints, respectively, for the jth subject in the ith trial, and let
Zij be an indicator variable for treatment. Although the main focus of this
work is on binary treatment indicators, the methods proposed generalize
without difficulty to multiple category indicators for treatment, as well as to
situations where covariate information is used in addition to the treatment
indicators.
24.3 The Validation of Surrogate Endpoints from Multiple Trials 425

An Example in Ovarian Cancer

Our methods will first be illustrated using data from a meta-analysis of four
randomized multicenter trials in advanced ovarian cancer (Ovarian Can-
cer Meta-Analysis Project 1991). Individual patient data are available in
these four trials for the comparison of two treatment modalities: cyclophos-
phamide plus cisplatin (CP) versus cyclophosphamide plus adriamycin plus
cisplatin (CAP). The binary indicator for treatment (Zij ) will be set to 0
for CP and to 1 for CAP. The surrogate endpoint Sij will be the logarithm
of time to progression, defined as the time (in weeks) from randomization
to clinical progression of the disease or death due to the disease, and the
final endpoint Tij will be the logarithm of survival, defined as the time (in
weeks) from randomization to death from any cause. The full results of this
meta-analysis were published with a minimum follow-up of 5 years in all
trials (Ovarian Cancer Meta-Analysis Project 1991). The data set was sub-
sequently updated to include a minimum follow-up of 10 years in all trials
(Ovarian Cancer Meta-Analysis Project 1998). After such long follow-up,
most patients have had a disease progression or have died (952 of 1194
patients, i.e., 80%), so censoring will be ignored in our analyses. Methods
that account for censoring would admittedly be preferable, but we ignore it
here for the purposes of illustrating the case where the surrogate and final
endpoints are both normally distributed.

The ovarian cancer data set contains only four trials. This will turn out to
be insufficient to apply the meta-analytic methods of Section 24.3.4. In the
two larger trials, information is also available on the centers in which the
patients had been treated. We can then use center as the unit of analysis for
the two larger trials, and the trial as the unit of analysis for the two smaller
trials. A total of 50 units are thus available for analysis, with a number of
individual patients per unit ranging from 2 to 274. To assess sensitivity, all
analyses will be performed with and without the two smaller trials in which
center is unknown. A scatter plot of the surrogate and true endpoints for
all individuals in the trials included is presented in Figure 24.5.

The first three Prentice criteria (24.5)–(24.7) are provided by tests of sig-
nificance of parameters α, β, and γ in the following models:

Sij |Zij = µS + αZij + εSij , (24.11)


Tij |Zij = µT + βZij + εT ij , (24.12)
Tij |Sij = µ + γSij + εij , (24.13)

where εSij , εT ij , and εij are independent Gaussian errors with mean zero.
If the analysis is restricted to the two large trials in which center is known,
α = 0.228 [standard error (s.e.) 0.091, p = 0.013], β = 0.149 (s.e. 0.085,
p = 0.079), and γ = 0.874 (s.e. 0.011, p < 0.0001). Strictly speaking, the
426 24. Case Studies

criteria are not fulfilled because β fails to reach statistical significance. This
will often be the case, since a surrogate endpoint is needed when there is
no convincing evidence of a treatment effect upon the true endpoint.

As emphasized in Section 24.3.2, we cannot strictly show that the last


criterion (24.8) is fulfilled. Instead, we can calculate Freedman’s proportion
explained:
βS
PE = 1− , (24.14)
β

where β is the estimate of the effect of Z on T as in (24.12) and βS is the


estimate of the effect of Z on T after adjustment for S:

Tij |Zij , Sij = µ̃T + βS Zij + γZ Sij + ε̃T ij , (24.15)

Here, βS = −0.051 (s.e. 0.028) and P E = 1.34 (95% delta confidence limits
[0.73; 1.95]). The proportion explained is larger than 100%, because the
direction of the effect of Z on T is reversed after adjustment for S. Another
problem would arise if there were a strong interaction between Z and S,
which would require the following model to be fitted instead of (24.15):

Tij |Zij , Sij = µ̌T + β̌S Zij + ρ̌Z Sij + δZij Sij + ε̌T ij , (24.16)

With this model, P E would cease to be captured by a single number and the
validation process would have to stop (Freedman, Graubard, and Schatzkin
1992). In the two large ovarian cancer trials, the interaction term is not
statistically significant (δ = 0.014, s.e. 0.022), and therefore model (24.15)
may be used.

Buyse and Molenberghs (1998) suggested to replace the P E by two quan-


tities: the relative effect
β
RE = (24.17)
α
and the association ρZ between T and S, adjusted for Z, which can be
calculated from jointly modeling (24.11) and (24.12). To this end, the error
terms of (24.11) and (24.12) are assumed to follow a bivariate Gaussian
distribution with zero mean and general 2 × 2 covariance matrix. In this
case, RE = 0.65 (95% confidence limits [0.36; 0.95]) and ρZ = 0.944 (95%
confidence limits [0.94; 0.95]). Thus, the adjusted correlation is very close to
1 and estimated with high precision. The relative effect is determined with
reasonable precision and enables calculation of the predicted effect of treat-
ment upon survival based on the observed effect upon time to progression
in a new trial. However, this prediction is based on the strong assumption
of a regression through the origin based on a single pair ( 
α, β).
24.3 The Validation of Surrogate Endpoints from Multiple Trials 427

When the two smaller trials are included in the analysis, the results change
very little, providing evidence for the validity of considering each of the
smaller trials as a single center. The p-values for α, β, and γ become 0.003,
0.054, and < 0.0001, respectively, and P E = 1.46 (95% confidence limits
[0.80; 2.13]), RE = 0.60 (95% confidence limits [0.32; 0.87]), and ρZ = 0.942
(95% confidence limits [0.94; 0.95]). By including both trials, the precision
is somewhat improved. However, in this case, the interaction term in model
(24.16) is statistically significant (δ = 0.037, s.e. 0.018), further complicat-
ing the interpretation of P E.

An Example in Ophthalmology

The second example concerns a clinical trial for patients with age-related
macular degeneration, a condition in which patients progressively lose vi-
sion (Pharmacological Therapy for Macular Degeneration Study Group
1997). In this example, the binary indicator for treatment (Zij ) is set to
0 for placebo and to 1 for interferon-α. The surrogate endpoint Sij is the
change in the visual acuity (which we assume to be normally distributed) at
6 months after starting treatment, and the final endpoint Tij is the change
in the visual acuity at 1 year. The data are presented in Figure 24.6. The
first three Prentice criteria (24.5)–(24.7) are again provided by tests of sig-
nificance of parameters α, β, and γ. Here, α = −1.90 (s.e. 1.87, p = 0.312),
β = −2.88 (s.e. 2.32, p = 0.216), and γ = 0.92 (s.e. 0.06, p < 0.001). Only
γ is statistically significant and therefore the validation procedure has to
stop inconclusively. Note, however, that the lack of statistical significance
of α and β could merely be due to the insufficient number of observations
available in this trial. Also note that α and β are negative, hinting at a
negative effect of interferon-α upon visual acuity. Freedman’s proportion
explained is calculated as P E = 0.61 (95% confidence limits [−0.19; 1.41]).
The relative effect is RE = 1.51 (95% confidence limits [−0.46; 3.49]), and
the adjusted association ρZ = 0.74 (95% confidence limits [0.68; 0.81]). The
adjusted association is determined rather precisely, but the confidence lim-
its of P E and RE are too wide to convey any useful information. Even so,
as we will see in Section 24.3.5, some conclusions can be reached in this
example that are in sharp contrast to those reached in the ovarian cancer
example.

An Example in Colorectal Cancer

As a third example, we will use data from two randomized multicenter trials
in advanced colorectal cancer (Corfu-A Study Group 1995; Greco et al .
1996). In one trial, treatment with fluorouracil and interferon (5FU/IFN)
was compared to treatment with 5FU plus folinic acid (5FU/LV) (Corfu-A
428 24. Case Studies

FIGURE 24.6. Age-Related Macular Degeneration Trial. Scatter plot of visual


acuity at 6 months (surrogate) versus visual acuity at 12 months (true endpoint).

Study Group 1995). In the other trial, treatment with 5FU plus interferon
(5FU/IFN) was compared to treatment with 5FU alone (Greco et al . 1996).
The binary indicator for treatment (Zij ) will be set to 0 for 5FU/IN and to 1
for 5FU/LV or 5FU alone. The surrogate endpoint Sij will be progression-
free survival time, defined as the time (in years) from randomization to
clinical progression of the disease or death, and the final endpoint Tij will
be survival time, defined as the time (in years) from randomization to
death from any cause. Most patients in the two trials have had a disease
progression or have died (694 of 736 patients, i.e., 94.3%).

Similarly to the ovarian cancer example, we will use center as the unit of
analysis. A total of 76 units are thus available for analysis. However, in
eight centers, one of the treatment arms accrued no patients. These eight
centers were therefore excluded from the analysis. As a result, the data
used for illustration contained 68 units, with a number of individual pa-
tients per unit ranging from 2 to 38. The data are graphically depicted
in Figure 24.7. Fitting (24.11)–(24.13) yields α = 0.021 (standard error,
s.e. 0.066, p = 0.749), β = 0.002 (s.e. 0.075, p = 0.794), and γ = 0.917
(s.e. 0.031, p < 0.0001). As with the ovarian cancer case, the criteria are
not fulfilled because both β and α fail to reach statistical significance. The
proportion explained is estimated as 0.985 (95% delta confidence limits
[−3.44; 5.41]). Further, RE = 0.931 (95% confidence limits [−3.23; 5.10]).
Both quantities have estimated confidence limits that are too wide to be
24.3 The Validation of Surrogate Endpoints from Multiple Trials 429

FIGURE 24.7. Advanced Colorectal Cancer. Scatter plot of progression free sur-
vival versus survival.

practically useful. Finally, ρZ = 0.802 (95% confidence limits [0.77; 0.83]).


This estimate of the adjusted association indicates that there is a substan-
tial correlation between an individual’s two endpoints.

24.3.4 A Meta-Analytic Approach

We focus on surrogate and true endpoints which are assumed to be jointly


normally distributed. Two distinct modeling strategies will be followed,
based on a two-stage fixed-effects representation on the one hand (see also
Section 3.2) and the linear mixed model on the other hand.

Let us describe the two-stage model first. The first stage is based upon a
fixed-effects model:
Sij |Zij = µSi + αi Zij + εSij , (24.18)
Tij |Zij = µT i + βi Zij + εT ij , (24.19)
where µSi and µT i are trial-specific intercepts, αi and βi are trial-specific
effects of treatment Z on the endpoints in trial i and εSi and εT i are cor-
related error terms, assumed to be zero-mean normally distributed with
covariance matrix

σSS σST
Σ = . (24.20)
σT T
430 24. Case Studies

At the second stage, we assume


⎛ ⎞ ⎛ ⎞ ⎛ ⎞
µS i µS mS i
⎜ µT i ⎟ ⎜ µT ⎟ ⎜ mT i ⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟,
⎝ αi ⎠ = ⎝ α ⎠ + ⎝ ai ⎠ (24.21)
βi β bi

where the second term on the right-hand side of (24.21) is assumed to follow
a mean-zero normal distribution with dispersion matrix
⎛ ⎞
dSS dST dSa dSb
⎜ dT T dT a dT b ⎟
D = ⎜ ⎝
⎟. (24.22)
daa dab ⎠
dbb

Next, the random-effects representation is based upon combining both


steps:

Sij |Zij = µS + mSi + αZij + ai Zij + εSij , (24.23)


Tij |Zij = µT + mT i + βZij + bi Zij + εT ij , (24.24)

where now µS and µT are fixed intercepts, α and β are the fixed effects
of treatment Z on the endpoints, mSi and mT i are random intercepts,
and ai and bi are the random effects of treatment Z on the endpoints
in trial i. The vector of random effects (mSi , mT i , ai , bi ) is assumed to be
zero-mean normally distributed with covariance matrix (24.22). The error
terms εSi and εT i follow the same assumptions as in fixed-effects model
(24.18)–(24.19), with covariance matrix (24.20), thereby completing the
specification of the linear mixed model. Section 24.3.10 provides sample
SAS code to fit this particular model.

Much debate has been devoted to the relative merits of fixed versus random
effects, especially in the context of meta-analysis (Thompson and Pocock
1991, Thompson 1993, Fleiss 1993, Senn 1998). Although the underlying
models rest on different assumptions about the nature of the experiments
being analyzed, the two approaches yield discrepant results only in patho-
logical situations, or in very small samples where a fixed-effects analysis can
yield artificially precise results if the experimental units truly constitute a
random sample from a larger population. In our setting, both approaches
are very similar, and the two-stage procedure can be used to introduce
random effects (Section 3.2; see also Laird and Ware 1982). As the data
analyses in Section 24.3.5 will illustrate, the choice between random and
fixed effects can also be guided by pragmatic arguments.
24.3 The Validation of Surrogate Endpoints from Multiple Trials 431

Trial-Level Surrogacy

The key motivation for validating a surrogate endpoint is to be able to


predict the effect of treatment on the true endpoint based on the observed
effect of treatment on the surrogate endpoint. It is essential, therefore, to
explore the quality of the prediction of the treatment effect on the true
endpoint in trial i by (a) information obtained in the validation process
based on trials i = 1, . . . , N and (b) the estimate of the effect of Z on S in
a new trial i = 0. Fitting either the fixed-effects model (24.18)–(24.19) or
the mixed-effects model (24.23)–(24.24) to data from a meta-analysis pro-
vides estimates for the parameters and the variance components. Suppose
then the new trial i = 0 is considered for which data are available on the
surrogate endpoint but not on the true endpoint. We then fit the following
linear model to the surrogate outcomes S0j :

S0j = µS0 + α0 Z0j + εS0j . (24.25)

Estimates for mS0 and a0 are

 S0
m S0 − µ
= µ S ,
a0 0 − α
= α .

We are interested in the estimated effect of Z on T , given the effect of Z on


S. To this end, observe that (β + b0 |mS0 , a0 ) follows a normal distribution
with mean and variance:

E(β + b0 |mS0 , a0 )


−1

dS b dSS dSa µS 0 − µ S
= β+ , (24.26)
dab dS a daa α0 − α

var(β + b0 |mS0 , a0 )


−1

dS b dSS dSa dSb


= dbb − . (24.27)
dab dS a daa dab

This suggests calling a surrogate perfect at the trial level if the conditional
variance (24.27) is equal to zero. A measure to assess the quality of the
surrogate at the trial level is the coefficient of determination


−1

dS b dSS dSa dSb


2 dab dSa daa dab
Rtrial(f) = Rb2i |mSi ,ai = . (24.28)
dbb
Coefficient (24.28) is unitless and ranges in the unit interval if the cor-
responding variance-covariance matrix is positive definite, two desirable
features, facilitating interpretation. Intuition can be gained by considering
432 24. Case Studies

the special case where the prediction of b0 can be done independently of


the random intercept mS0 . Expressions (24.26) and (24.27) then reduce to

dab
E(β + b0 |a0 ) = β + (α0 − α),
daa
d2
var(β + b0 |a0 ) = dbb − ab
daa
with corresponding

2 d2ab
Rtrial(r) = Rb2i |ai = . (24.29)
daa dbb
2
Now, Rtrial(r) = 1 if the trial-level treatment effects are simply multiples of
each other. We will refer to this simplified version as the reduced random-
effects model, and the original expression (24.28) will be said to derive from
the full random-effects model.

An estimate for β + b0 is obtained by replacing the right-hand side of


(24.26) with the corresponding parameter estimates. A confidence interval
is obtained by applying the delta method to (24.26). The covariance matrix
of the parameters involved is obtained from the meta-analysis, except for
µS0 and α0 , which are obtained from fitting (24.25) to the data of the
new trial. The corresponding prediction interval is found by adding (24.27)
to the variance obtained for the confidence interval. Details are given in
Section 24.3.9.

There is a close connection between the prediction approach followed here


and empirical Bayes estimation (Chapter 7). To see this, consider a simi-
lar but nonidentical approach where all data are analyzed together. This
means that a meta-analysis is performed of the surrogate data on trials
i = 0, . . . , N and of the true endpoint data on trials i = 1, . . . , N . The esti-
mate of b0 will be based only on the surrogate data, since the true endpoint
is unknown for trial i = 0, and on the parameter estimates. The expres-
sion for the empirical Bayes estimate of b0 is identical to (24.26), but the
numerical value will be slightly different since the parameters of the linear
mixed model are determined on a larger set of data. For example, with
the MIXED procedure in SAS, obtaining the empirical Bayes estimate of
b0 is immediate, and the conditional variance follows from the estimated
standard error of prediction (Littell et al . 1996).

Individual-Level Surrogacy

To validate a surrogate endpoint, Buyse and Molenberghs (1998) suggested


considering the association between the surrogate and the final endpoints
24.3 The Validation of Surrogate Endpoints from Multiple Trials 433

after adjustment for the treatment effect. To this end, we need to construct
the conditional distribution of T , given S and Z. From (24.18)–(24.19), we
derive
1 −1 −1
Tij |Zij , Sij ∼ N µT i − σT S σSS µSi + (βi − σT S σSS αi )Zij ;
−1 −1
2
+σT S σSS Sij σT T − σT S σSS .
2
(24.30)

Similarly, the random-effects model (24.23)–(24.24) yields


1 −1
Tij |Zij , Sij ∼ N µT + mT i − σT S σSS (µS + mSi ) ;
−1
+ [β + bi − σT S σSS (α + ai )]Zij
−1 −1
2
+ σT S σSS Sij ; σT T − σT2 S σSS , (24.31)

where conditioning is also on the random effects. The association between


both endpoints after adjustment for the treatment effect is in both (24.30)
and (24.31) captured by
2
2 σST
Rindiv = Rε2T i |εSi = , (24.32)
σSS σT T
the squared correlation between S and T after adjustment for both the trial
effects and the treatment effect. Note that RεT i |εSi generalizes the adjusted
association ρZ of Section 24.3.3 to the case of several trials.

A New Approach to Surrogate Evaluation

The above developments suggest to term a surrogate trial-level valid if


2 2
Rtrial(f) (or Rtrial(r) ) is sufficiently close to one, and to call it individual-level
2
valid if Rindiv is sufficiently close to one. Finally, a surrogate is termed
valid if it is both trial-level and individual-level valid. In order to replace
the words “valid” with “perfect,” the corresponding R-squared values are
required to equal one.

To be useful in practice, a valid surrogate must be able to predict the effect


of treatment upon the true endpoint with sufficient precision to distinguish
safely between effects that are clinically worthwhile and effects that are
not. This requires both that the estimate of β + b0 be sufficiently large and
that the prediction interval of this quantity be sufficiently narrow.

It should be noted that the validation criteria proposed here do not require
the treatment to have a significant effect on either endpoint. In particu-
lar, it is possible to have α ≡ 0 and yet have a perfect surrogate. Indeed,
even though the treatment may not have any effect on the surrogate end-
point as a whole, the fluctuations around zero in individual trials (or other
experimental units) can be very strongly predictive of the effect on the
434 24. Case Studies

true endpoint. However, such a situation is unlikely to occur since the het-
erogeneity between the trials is generally small compared to that between
individual patients.

Validation in a Single Trial

If data are available on a single trial (or, more generally, on a single exper-
imental unit), the above developments are only partially possible. While
the individual-level reasoning (producing ρZ as in (24.32)) carries over by
virtue of the within-trial replication, the trial-level reasoning breaks down
and one cannot go beyond the relative effect (RE) as suggested in Buyse
and Molenberghs (1998). Recall that the RE is defined as the ratio of the
effects of Z on S and T , respectively, as expressed in (24.17). The confi-
dence limits of RE can be used to assess the uncertainty about the value
of β predicted from that of α, but in contrast to the above developments,
no sensible prediction interval can be calculated for β.

24.3.5 Data Analysis

Advanced Ovarian Cancer

As in Section 24.3.3, all analyses have been performed with and without the
two smaller trials. Excluding the two smaller trials has very little impact
on the estimates of interest, and therefore the results reported are those ob-
tained with all four trials. Two-stage fixed-effects models (24.18)–(24.19)
could be fitted, as well as a reduced version of the mixed-effects model
(24.23)–(24.24), with random treatment effects but no random intercepts.
Point estimates for the two types of model are in close agreement, although
standard errors are smaller by roughly 35% in the random-effects model.
Figure 24.8 shows a plot of the treatment effects on the true endpoint (loga-
rithm of survival) by the treatment effects on the surrogate endpoint (loga-
rithm of time to progression). These effects are highly correlated. Similarly
to the random-effects situation, we refer to the models with and without the
intercept used for determining R2 as the reduced and full fixed-effects mod-
2
els. The reduced fixed-effects model provides Rtrial(r) = 0.939 (s.e. 0.017).
When the sample sizes of the experimental units are used to weigh the
2
pairs (ai , bi ), then Rtrial(r) = 0.916 (s.e. 0.023). The full fixed-effects model
2
yields Rtrial(f) = 0.940 (s.e. 0.017). In the reduced random-effects model,
2
Rtrial(r) = 0.951 (s.e. 0.098).

Predictions of the effect of treatment on log(survival) based on the observed


effect of treatment on log(time to progression) are of interest. Table 24.6
24.3 The Validation of Surrogate Endpoints from Multiple Trials 435

FIGURE 24.8. Advanced Ovarian Cancer. Treatment effects on the true endpoint
(logarithm of survival time) versus treatment effects on the surrogate endpoint
(logarithm of time to progression) for all units of analysis. The size of each point
is proportional to the number of patients in the corresponding unit.

reports prediction intervals for several experimental units: six centers taken
at random from the two large trials, and the two small trials in which center
is unknown. Note that none of the predictions is significantly different from
zero. The predicted values for β + b0 agree reasonably well with the effects
estimated from the data. The ratio β0 / α0 ranges from 0.69 to 0.73, which
is close to the RE estimated in Section 24.3.3.
2
At the individual level, Rindiv = 0.886 (s.e. 0.006) in the fixed-effects model,
2
and Rindiv = 0.888 (s.e. 0.006) in the reduced random-effects model. The
square roots of these quantities are respectively 0.941 and 0.942, very close
to the value of ρZ estimated in Section 24.3.3. Figure 24.9 displays a scatter
plot of the residuals on both endpoints. It exhibits the close relationship
which exists between both endpoints at the individual level.

Thus, we conclude that time to progression can be used as a surrogate


for survival in advanced ovarian cancer. The effect of treatment can be
observed earlier if time to progression is used instead of survival, and it is
also more pronounced, as shown by the overall Kaplan-Meier estimates of
Figure 24.10. Hence, a trial that used time to progression would require
less follow-up time and less patients to establish the statistical significance
436 24. Case Studies

TABLE 24.6. Advanced Ovarian Cancer. Predictions. Standard errors are shown
in parenthesis. The number of patients is reported for each unit, as well as which
sample is used for the estimation (only two trials or all four). α 0 and β0 are
values estimated from the data; E(β + b0 |a0 ) is the predicted effect of treatment
on survival (β0 ), given its effect upon time to progression (α 0 ). The DACOVA
and GONO trials are the two smaller studies, for which predictions are based on
parameter estimates from the centers in the two larger studies.

Unit ni Trials 0
α E(β + b0 |a0 ) β0
6 17 2 −0.58 (0.33) −0.45 (0.29) −0.56 (0.32)
4 −0.45 (0.29)
8 10 2 0.67 (0.76) 0.49 (0.57) 0.76 (0.39)
4 0.47 (0.56)
37 12 2 1.02 (0.61) 0.76 (0.54) 1.04 (0.70)
4 0.73 (0.53)
49 40 2 0.54 (0.34) 0.39 (0.26) 0.28 (0.28)
4 0.37 (0.25)
55 31 2 1.08 (0.56) 0.80 (0.44) 0.79 (0.45)
4 0.77 (0.44)
BB 21 2 −1.05 (0.55) −0.80 (0.46) −0.79 (0.51)
4 −0.79 (0.46)
DACOVA 274 2 0.25 (0.15) 0.17 (0.13) 0.14 (0.14)
GONO 125 2 0.15 (0.25) 0.10 (0.20) 0.03 (0.22)

of a truly superior treatment than a trial that used survival (Chen et al .


1998).

The results derived here are considerably more useful than the conclusions
in Section 24.3.3. Indeed, the first three Prentice criteria provide only mar-
ginal evidence and P E cannot be estimated on the full data set, since there
is a three-way interaction between Z, S, and T . RE is meaningful and esti-
mated with precision, but it is derived from a regression through the origin
based on a single data point. In contrast, the approach used here combines
evidence from several experimental units and allows prediction intervals to
be calculated for the effect of treatment on the true endpoint.

Age-Related Macular Degeneration

The age-related macular degeneration data come from a single multicenter


trial. Therefore, it is natural to consider the center in which the patients
24.3 The Validation of Surrogate Endpoints from Multiple Trials 437

FIGURE 24.9. Advanced Ovarian Cancer. Residuals of progression free survival


versus survival, after correction for treatment and center effect.

were treated as the unit of analysis. A total of 36 centers were thus available
for analysis, with a number of individual patients per center ranging from
2 to 18.

Figure 24.11(a) shows a plot of the raw data (true endpoint versus surrogate
endpoint for all individual patients). Irrespective of the software tool used
(SAS, SPlus, MLwiN), the random effects are difficult to obtain. Therefore,
we report only the result of a two-stage fixed-effects model and explore the
computational issues further in Section 24.3.6. Figure 24.11(b) shows a
plot of the treatment effects on the true endpoint by the treatment effects
on the surrogate endpoint. These effects are moderately correlated, with
2
Rtrial(f) = 0.692 (s.e. 0.087). The estimates based on the reduced model are
2
virtually identical. At the individual level, Rindiv = 0.483 (s.e. 0.053). Note
that Rindiv = 0.69 is close to ρZ = 0.74, as estimated in Section 24.3.3. The
2 2
coefficients of determination Rtrial(r) and Rindiv are both too low to make
visual acuity at 6 months a reliable surrogate for visual acuity at 12 months.
Figure 24.11(c) shows that the correlation of the measurements at 6 months
and at 1 year is indeed rather poor at the individual level. Therefore, even
with the limited data available, it is clear that the assessment of visual
acuity at 6 months is not a good surrogate for the same assessment at 1
year. This is in contrast with the inconclusive analysis in Section 24.3.3.
438 24. Case Studies

FIGURE 24.10. Advanced Ovarian Cancer. Survival curves. Kaplan-Meier esti-


mates of survival (S) and time to progression (TTP) for the two treatment groups:
cyclophosphamide plus cisplatin (CP) and cyclophosphamide plus adriamycin plus
cisplatin (CAP).

Advanced Colorectal Cancer

Figure 24.12 shows a plot of the treatment effects on the true endpoint
(logarithm of survival) by the treatment effects on the surrogate endpoint
(logarithm of time to progression). Clearly, the correlation between both is
considerably weaker than in the advanced ovarian cancer case. The corre-
2
sponding Rtrial(r) = 0.454 (95% confidence limits [0.23; 0.68]).
2
At the individual level (Figure 24.13), Rindiv = 0.665 (95% confidence limits
[0.62; 0.71]). The square roots of this quantity is 0.815, very close to the
estimate of ρZ , which is 0.805.

In contrast to the results obtained for advanced ovarian cancer, time to


progression seems less qualified as a surrogate for survival in the context
of advanced colorectal cancer. Both the trial level as well as the individual
level R2 are relatively low.
24.3 The Validation of Surrogate Endpoints from Multiple Trials 439

FIGURE 24.11. Age-Related Macular Degeneration Trial. (a) True endpoint


(change in visual acuity at 1 year) versus surrogate endpoint (change in visual
acuity at 6 months) for all individual patients, raw data. (b) Treatment effects on
the true endpoint versus treatment effects on the surrogate endpoint in all centers.
The size of each point is proportional to the number of patients in the correspond-
ing center. (c) True endpoint versus surrogate endpoint for all individual patients,
after correction for treatment effect.

24.3.6 Computational Issues

In this section, we investigate convergence properties of the random-effects


approach as proposed in Section 24.3.4. The need for such an investigation
arises from the observation that in many practical instances, convergence of
the Newton-Raphson algorithm yielding (restricted) maximum likelihood
solutions could hardly be achieved (see Section 24.3.5). Therefore, it is
worth exploring what features of the problem at hand may be of influence
in easing convergence of the algorithm, since this may be an additional
factor to decide between a two-stage or a random-effects model.

We explored the following factors: number of trials, size of the between-trial


variability (compared to residual variability), number of patients per trial,
normality assumption, and strength of the correlation between random
treatment effects. Since only the first two factors were found significantly
to affect convergence of the algorithm, we do not report on the others in
the remainder of this paragraph.
440 24. Case Studies

FIGURE 24.12. Advanced Colorectal Cancer. Treatment effects on the true end-
point (logarithm of survival time) versus treatment effects on the surrogate end-
point (logarithm of time to progression) for all units of analysis. The size of each
point is proportional to the number of patients in the corresponding unit.

Table 24.7 shows the number of runs for which convergence could be achiev-
ed within 20 iterations. In each case, 500 runs were performed, assuming a
model of the following form:

Sij | Zij = 45 + mSi + (3 + ai )Zij + εSij ,


Tij | Zij = 50 + mT i + (5 + bi )Zij + εT ij ,

where (mSi , mT i , ai , bi ) ∼ N (0, D) with


⎛ ⎞
1 0.8 0 0
⎜ 1 0 0 ⎟
D = δ2 ⎝ ⎠
1 0.9
1

and (εSij , εT ij ) ∼ N (0, Σ) with


1 0.8
Σ=3 .
1

The number of trials was fixed to either 10, 20, or 50, each trial involving
10 subjects randomly assigned to treatment groups. The δ 2 parameter was
set to 0.1 or 1.
24.3 The Validation of Surrogate Endpoints from Multiple Trials 441

FIGURE 24.13. Advanced Colorectal Cancer. Residuals of progression free sur-


vival versus survival, after correction for treatment and center effect.

TABLE 24.7. Simulation results. Number of runs for which convergence was
achieved within 20 iterations. Total number of runs: 500; percentages are given
in parentheses.

Number of trials
2
δ 50 20 10
1 500 (100%) 498 (100%) 412 (82%)
0.1 491 (98%) 417 (83%) 218 (44%)

From Table 24.7, we see that when the between-trial variability is large
(δ 2 = 1), no convergence problems occur, except when the number of trials
gets very small. When the between-trial variability gets smaller, conver-
gence problems do arise and worsen as the number of trials decreases.

These simulation results indicate that there should be enough variability


at the trial level, and a sufficient number of trials, to obtain convergence
of the Newton-Raphson algorithm for fitting mixed-effects models. When
these requirements are not fulfilled, one must rely on simpler fixed-effects
models, or mixed-effects models with random treatment effects but no ran-
dom intercepts. The investigator should reflect carefully on whether such
442 24. Case Studies

simplifications are allowable. It may well be that a two-stage model is the


more sensible choice.

24.3.7 Extensions

In Section 24.3.4, we focused on the methodologically appealing case of


normally distributed endpoints. In practice, situations abound with binary
and time-to-event endpoints, and in addition with surrogate and final end-
points of a different type (Molenberghs, Geys, and Buyse 1998). Whereas
the linear mixed model provides a unified and flexible framework to analyze
multivariate and/or repeated measurements that are normally distributed,
similar tools for non-normal outcomes are unfortunately less well devel-
oped.

For binary outcomes, there are both marginal models such as generalized
estimating equations (Liang and Zeger 1986) or full likelihood approaches
(Fitzmaurice and Laird 1993, Molenberghs and Lesaffre 1994, Lang and
Agresti 1994, Glonek and McCullagh 1995) and random-effects models (Sti-
ratelli, Laird, and Ware 1984, Zeger, Liang, and Albert 1988, Breslow and
Clayton 1993, Wolfinger and O’Connell, 1993; Lee and Nelder 1996). Re-
views are given in Diggle, Liang, and Zeger (1994) and Fahrmeir and Tutz
(1994). For additional references, see also Section 24.2.1 (p. 414).

Since our validation measures make use not only of main-effect parame-
ters, such as treatment effects, but prominently of association (random-
effects structure and residual covariance structure), standard generalized
estimating equations are less suitable to extend ideas toward noncontin-
uous data. Possible approaches are second-order generalized estimating
equations (Liang, Zeger, and Qaqish 1992, Molenberghs and Ritter 1996)
and random-effects models. Since the latter are computationally involved,
the likelihood-based approaches need to be supplemented with alternative
methods of estimation such as quasi-likelihood. Also, pseudo-likelihood is
a viable alternative.

Burzykowski et al . (1999) studied the situation where Sij and Tij are
failure-time endpoints. In order to extend the approach used in the case
of two normally distributed endpoints described in Section 24.3.4, they re-
placed model (24.18)–(24.19) by a model for two correlated failure-time
random variables. This is based on a copula model (Shih and Louis 1995).
More specifically, they assume that the joint survivor function of (Sij , Tij )
can be written as

F (s, t) = P (Sij ≥ s, Tij ≥ t) = Cδ {FSij (s), FT ij (t)} (24.33)


24.3 The Validation of Surrogate Endpoints from Multiple Trials 443

(s, t ≥ 0), where (FSij , FT ij ) denotes marginal survivor functions and Cδ is


a distribution function on [0, 1]2 with δ ∈ IR1 . Cδ is called a copula func-
tion. It describes the association between Sij and Tij . An attractive feature
of model (24.33) is that the margins do not depend on the choice of the
copula function. Specifically, the copulas of Clayton (1978) and Hougaard
(1987) are studied in detail. Burzykowski et al . replace the R2 at the indi-
vidual level by Kendall’s τ , while maintaining R2 as a measure for trial-level
surrogacy.

This area is currently in full development. For example, the important


situation of a binary (or categorical) response, such as tumor response, as
a surrogate for survival time, is being studied.

24.3.8 Reflections on Surrogacy

The validation of surrogate endpoints is a controversial issue. Difficulties


have arisen on several fronts. First, some endpoints used as surrogates
have been shown to provide wholly misleading predictions of the treat-
ment effect upon the important clinical endpoints: The case of encainide
and flecainide, two harmful drugs that were approved by the Food and
Drug Administration based on their anti-arrhythmic effects, will remain a
painful illustration of such an unfortunate circumstance (Fleming 1992).
Second, some endpoints that have not been so catastrophically misleading
have still failed to explain the totality of the treatment effect upon the final
endpoints: The case of the CD4+ lymphocyte counts in patients with AIDS
is an example (Lin, Fischl, and Schoenfeld 1993, DeGruttola et al . 1993,
Choi et al . 1993, DeGruttola and Tu 1995). Many of these problems were
mentioned in Prentice (1989). All these reasons have led some authors to
express reservations about attempts to validate surrogate endpoints statis-
tically (Fleming and DeMets 1996, DeGruttola et al . 1997). Their reserva-
tions rest to a large extent on biological considerations: A good surrogate
must be shown to be causally linked to the true endpoint, and even so,
it is implausible that the surrogate will ever capture the whole effect of
treatment upon the true endpoint. These reservations are well taken, but
biologically complex situations lend themselves to statistical evaluations
that may shed light on the underlying mechanisms involved (Chuang-Stein
and DeMasi 1998). The approach proposed here indirectly addresses these
2
issues: A large individual-level coefficient of determination (Rindiv close to
1) indicates that the endpoints are likely to be causally linked to each other,
2
whereas a large trial-level coefficient of determination (Rtrial(r) close to 1)
indicates that a large proportion of the treatment effect is captured by the
surrogate.
444 24. Case Studies

We obtain a quantitative assessment of the value of a surrogate, as well


as predictions of the expected effect of treatment upon the true endpoint
(Boissel et al . 1992, Chen et al . 1998). It evaluates the “validity” of a
surrogate in terms of coefficients of determination, which are intuitively
appealing quantities in the unit interval and enables the construction of
prediction intervals for the effect of treatment on the true endpoint based
on its action upon the surrogate endpoint. Such an approach is more infor-
mative than a mere dichotomization of surrogate endpoints as being “valid”
or “invalid.” Moreover, the validation procedure no longer requires statis-
tical tests to be statistically significant; for instance, an endpoint with a
low individual-level coefficient of determination (Rindiv2
 1) is unlikely to
2
be a good surrogate (even if Rtrial(f) = 1), a conclusion that may be reached
with a limited number of observations.

The need for validated surrogate endpoints is as acute as ever, particu-


larly in diseases where an accelerated approval process is deemed necessary
(Cocchetto and Jones 1998, Weihrauch and Demol 1998). Some surrogate
endpoints or combinations of endpoints, such as viral load measures com-
bined with CD4+ lymphocyte counts, have, in fact, already replaced as-
sessment of clinical outcomes in AIDS clinical trials (O’Brien et al . 1996,
Mellors et al . 1997). The approach presented here may offer a better under-
standing of the worth of a surrogate endpoint, provided that large enough
sets of data from multiple randomized experiments are available to esti-
mate the required parameters (Daniels and Hughes 1997). Large numbers
of observations are needed for the estimates to be sufficiently precise, and
multiple studies are needed to distinguish individual-level from trial-level
associations between the endpoints and effects of interest. However, it has
to be emphasized that, even if the results of a surrogate evaluation seem
encouraging based on several trials, applying these results to a new trial
requires a certain amount of extrapolation that may or may not be deemed
acceptable. In particular, when a new treatment is under investigation, is
it reasonable to assume that the quantitative relationship between its ef-
fects on the surrogate and true endpoints will be the same as with other
treatments ? The leap of faith involved in making that assumption rests
primarily on biological considerations, although the type of statistical in-
formation presented above may provide essential supporting evidence.

24.3.9 Prediction Intervals

Denote f = E(β + b0 |mS0 , a0 ) = β + D1 D2−1 D3 , where D1 , D2 , and D3


refer to the corresponding matrices in (24.26). Let fd be the derivative of
f w.r.t. the parameter vector
(β, µS , α, dSb , dab , dSS , dSa , daa , µS0 , α0 ) .
24.3 The Validation of Surrogate Endpoints from Multiple Trials 445

The components of fd are


∂f
= 1,
∂β

∂f ∂f 1
= − = D1 D2−1 ,
∂µS0 ∂µS 0

∂f ∂f 0
= − = D1 D2−1 ,
∂α0 ∂α 1


∂f 1
= D2−1 D3 ,
∂dSb 0


∂f 0
= D2−1 D3 ,
∂dab 1

∂f 1 0
= −D1 D2−1 D2−1 D3 ,
∂dSS 0 0

∂f 0 1
= −D1 D2−1 D2−1 D3 ,
∂dSa 1 0

∂f 0 0
= −D1 D2−1 D2−1 D3 .
∂daa 0 1
Denoting the asymptotic covariance matrix of the estimated parameter
vector by V , the asymptotic variance of f is given by fd V fd , producing a
confidence interval in the usual way. For a prediction interval, the variance
to be used is fd V fd + var(β + b0 |mS0 , a0 ).

24.3.10 SAS Code for Random-Effects Model

We describe how to use the SAS system to fit the random-effects model


proposed in Section 24.3.4. Note that other packages such as MLwiN (see
also Section A.2) are also particularly well suited for fitting this type of
multivariate multilevel models and could therefore be utilized instead.

The SAS code to fit model (24.23)–(24.24) may be written as follows:

proc mixed data = data set covtest;


class endpoint subject trial treat;
model outcome = endpoint|treat / s noint;
random endpoint*treat / sub = trial type = un;
446 24. Case Studies

repeated endpoint / subject = subject(trial) type = un;


run;

The above syntax presumes that there are two records per subject in the
input data set, one corresponding to the surrogate endpoint and the other
to the true endpoint. The variable endpoint is an indicator for the kind of
endpoint (coded 0 for the surrogate and 1 for the true endpoint) and the
variable outcome contains measurements obtained from each endpoint.

The RANDOM statement defines the covariance matrix D in (24.22) of


random effects at the trial level, and the REPEATED statement builds up
the residual covariance matrix Σ in (24.20). Note that the nesting nota-
tion in the ‘subject=’ option enables SAS to recognize the nested structure
of the data (subjects are clustered within trials). Acknowledgment of the
hierarchical nature of the data prompts SAS to build a block-diagonal co-
variance matrix, with diagonal blocks corresponding to the different trials,
which speeds up computations considerably.

24.4 The Milk Protein Content Trial

24.4.1 Introduction

Diggle (1990) and Diggle and Kenward (1994) analyzed data taken from
Verbyla and Cullis (1990), who, in turn, had discovered the data at a work-
shop at Adelaide University in 1989. The data consist of assayed protein
content of milk samples taken weekly for 19 weeks from 79 Australian cows.
The cows entered the experiment after calving and were randomly allocated
to 1 of 3 diets: barley, mixed barley-lupins, and lupins alone, with 25, 27,
and 27 animals in the 3 groups, respectively. The time profiles for all 79
cows are plotted in Figure 24.19. All cows remained on study during the
first 14 weeks, whereafter the sample reduced to 59, 50, 46, 46, and 41,
respectively, due to dropout. This means that dropout is as high as 48%
by the end of the study. Table 24.8 shows the number of cows per arm and
per dropout pattern.

The primary objective of the milk protein experiment was to describe the
effects of diet on the mean response profile of milk protein content over
time. Previous analyses of the same data are reported by Diggle (1990,
Chapter 5), Verbyla and Cullis (1990), Diggle, Liang, and Zeger (1994),
and Diggle and Kenward (1994), under different assumptions and with
different modeling of the dropout process. Diggle (1990) assumed random
dropout, whereas Diggle and Kenward (1994) concluded that dropout was
24.4 The Milk Protein Content Trial 447

TABLE 24.8. Milk Protein Content Trial. Number of cows per arm and per
dropout pattern.

Diet
Dropout week Barley Mixed Lupins
Week 15 6 7 7
Week 16 2 3 4
Week 17 2 1 1
Week 18
Week 19 2 2 1

Completers 13 14 14
Total 25 27 27

nonrandom, based on their selection model. Of course, it has been noted in


Chapters 17 and 19 that appropriate care should be taken with nonrandom
selection models for their reliance on unverifiable assumptions.

In addition to the usual problems with this type of models, serious doubts
have been raised about even the appropriateness of the “dropout” con-
cept in this study. Cullis (1994) warned that the conclusions inferred from
the statistical model are very unlikely since usually there is no relation-
ship between dropout and a relatively low level of milk protein content. In
the discussion of the Diggle and Kenward (1994) paper, one is informed
by Cullis that Valentine, who originally conducted the experiment, had
previously revealed the real reasons for dropout. The explanation eluci-
dates that the experiment terminated when feed availability declined in
the paddock in which animals were grazing. Thus, this would imply that a
nonrandom dropout mechanism is very implausible. A nonrandom dropout
mechanism would wrongly relate dropout to response, whereas, to the con-
trary, dropout depends on food availability only. Thus, there are actually no
dropouts but rather five cohorts representing the different starting times.
Together with Cullis (1994), and in agreement with our pleas for sensitivity
analysis (Chapters 19 and 20), we conclude that especially with incomplete
data, a statistical analysis should not proceed without a thorough discus-
sion with the experimenters.

The complex and somewhat vague history of the data set probably is the
main cause for so many conflicting issues related to the analysis of the
milk data. At the same time, it becomes a perfect candidate for sensitiv-
ity analysis. Modeling will be based upon the linear mixed-effects model
with serial correlation (3.11), introduced in Section 3.3.4. In Section 24.4.2,
448 24. Case Studies

we examine the validity of the conclusions made in Diggle and Kenward


(1994) by incorporating subject matter information into the method of
analysis. As dropout was due to design, the method of analysis should re-
flect this. We will investigate two approaches. The first approach involves
restructuring the data set and then analyzing the resulting data set using
a selection modeling framework, whereas the second method involves fit-
ting pattern-mixture models taking the missingness pattern into account.
Both analyses consider the sequences as unbalanced in length rather than
a formal instance of dropout. Local influence diagnostics, as introduced
in Section 19.3 and supplemented with global influence measures, are pre-
sented in Section 24.4.3.

24.4.2 Informal Sensitivity Analysis

Since there has been some confusion about the actual design employed, we
cannot avoid making subjective assumptions such as the following: Several
matched paddocks are randomly assigned to either of three diets: barley,
lupins, or a mixture of the two. The experiment starts as the first cow
experiences calving. As the first 5 weeks have passed, all 79 cows have
entered their randomly assigned, randomly cultivated paddock. By week
19, all paddocks appear to approach the point of exhausting their food
availability (in a synchronous fashion) and the experiment is terminated
for all animals simultaneously.

All previous analyses assumed a fixed date for entry into the trial and the
crucial issue then becomes how the dropout process should be handled and
analyzed. However, it seems intuitive that since entry into the study was
at random time points (i.e., after calving) and since the experiment was
terminated at a fixed time point, that this time point should be the refer-
ence for all other time points. It is therefore also appealing to reverse the
time axis and to analyze the data in reverse, starting from time of dropout.
Under the aforementioned assumptions, we have found a partial solution
to the problem of potentially nonrandom dropout since dropout has been
replaced by ragged entry. Note, however, that a crucial simplification arises:
Since entry into the trial depends solely on calving and gestation, it can be
thought of as totally independent from the unobserved responses.

A problem with the alignment lies in the fact that virtually all cows showed
a very steep decrease in milk protein content immediately after calving,
lasting until the third week into the experiment. This behavior could be
due to a special hormone regulation of milk composition following calving,
which lasts only for a few weeks. Such a process is most likely totally
independent of diet and, probably, can also be observed in the absence of
food, to the expense of the animal’s natural reserves. Since entry is now
24.4 The Milk Protein Content Trial 449

FIGURE 24.14. Milk Protein Content Trial. Data manipulations on five selected
cows: (a) raw profiles; (b) right aligned profiles; (c) deletion of the first three
observations; (d) profiles with in addition time reversal.

ragged, the process is spread and influences the mean response level during
the first 8 weeks. Of course, one might construct an appropriate model for
the first 3 weeks with a separate model specification, in analogy to the one
used in Diggle and Kenward (1994). Instead, we prefer to ignore the first
3 weeks, analogous in spirit to the approach taken in Verbyla and Cullis
(1990). Hence, we have time series of length 16, with some observations
missing at the beginning. Figure 24.14 displays the data manipulations
for five selected cows. In Figure 24.14(a), the raw profiles are shown. In
Figure 24.14(b), the plots are right aligned. Figure 24.14(c) illustrates the
protein content levels for the five cows with the first three observations
deleted and Figure 24.14(d) presents these profiles when time is reversed.

In order to explore the patterns after transformation, we plotted the newly


obtained mean profiles. Figures 24.15(a) and 24.15(b) display the mean
profiles before and after the transformation, respectively. Notice that the
mean profiles have become parallel in Figure 24.15(b). To address the issue
of correlation, we shall compare the two variograms (see also Section 10.4).
The two graphs shown in Figure 24.16 are very similar although slight
differences can be noted in the estimated process variance, which is slightly
lower after transformation. Complete decay of serial correlation appears to
happen between time lags 9 and 10 in both variograms. There is virtually
450 24. Case Studies

FIGURE 24.15. Milk Protein Content Trial. Mean response profiles on the orig-
inal data and after aligning and reverting.

no evidence for random effects as the serial correlation levels of toward the
process variance.

Table 24.10 presents maximum likelihood estimates for the original data,
similar to the analysis by Diggle and Kenward (1994). The corresponding
parameters after aligning and reverting are 3.45 (s.e. 0.06) for the barley
group, with differences for the lupins and mixed groups estimated to be
−0.21 (s.e. 0.08) and −0.12 (s.e. 0.08), respectively. The variance parame-
ters roughly retain their relative magnitude, although even more weight is
given to the serial process (90% of the total variance).

The analysis using aligned and reverted data shows little difference if com-
pared to the original analysis by Diggle and Kenward (1994). It would be
interesting to acquire knowledge about what mechanisms determined the
systematic increase and decrease observed for the three parallel profiles
illustrated in Figure 24.15(b). It is difficult to envisage that the paral-
lelism of the profiles and their systematic peaks and troughs shown in Fig-
ure 24.15(b) is due entirely to chance. Indeed, many of the previous analyses
had debated the influence on variability for those factors, common to the
paddocks cultivated with the three different diets, such as meteorological
factors, that had not been reported by the experimenter. These factors
may account for a large amount of variability in the data. Hence, the data
exploration performed in this analysis may be shown to be a useful tool
24.4 The Milk Protein Content Trial 451

FIGURE 24.16. Milk Protein Content Trial. Variogram for the original data and
after aligning and reverting.

in gaining insight about the response process. For example, we note that
after transformation, the inexplicable trend toward an increase in milk pro-
tein content, as the paddocks approach exhaustion has, in fact, vanished or
even reverted to a possible decrease. This was also confirmed in the strat-
ified analysis where the protein level content tended to decrease prior to
termination of the experiment (see Figure 24.17).

An alternative method of analysis is based on the premise that the pro-


tein content levels form distinct homogeneous subgroups for cows based on
their dropout pattern. This leads very naturally to pattern-mixture mod-
els (Chapters 18 and 20). Parameters in (3.11) are now made to depend
on pattern. In its general form, the fixed effects as well as the covariance
parameters are allowed to vary unconstrained according to the dropout
pattern. Alternatively, simplifications can be sought. For example, diet ef-
fect can vary linearly with pattern or can be pattern independent. In the
latter case, this effect becomes marginal. When the diet effect is pattern
dependent, an extra calculation is necessary to obtain the marginal diet
effect, as was illustrated on the toenail data in Section 18.3 and formal-
ized in Section 20.6.2. Precisely, the marginal effect can be computed as in
(20.45), whereas the delta method variance expression is given by (20.46).

Denoting the parameter for diet effect = 1, 2 (difference with the barley
group) in pattern t = 1, 2, 3 by β
t and letting πt be the proportion of cows
452 24. Case Studies

in pattern t, then the matrix A in (20.48) assumes the form


∂(β1 , β2 )
A =
∂(β11 , β12 , β13 , β21 , β22 , β23 , π1 , π2 , π3 )

π 1 π 2 π3 0 0 0 β11 β12 β13


= .
0 0 0 π1 π2 π3 β21 β22 β23

Note that the simple multinomial model for the dropout probabilities could
be extended when additional information concerning the dropout mecha-
nism. For example, if covariates are known or believed to influence dropout,
the simple multinomial model can be replaced by logistic regression or time-
to-event methods (Hogan and Laird 1997).

Recall that Table 24.8 presents the dropout pattern by time in each of
the three diet groups. As few dropouts occurred in weeks 16, 17, and 19,
these three dropout patterns were collapsed into a single pattern. Thus,
three patterns remain with 20, 18, and 41 cows, respectively. The model
fitting results are presented in Table 24.9. The most complex model for the
mean structure assumes a separate mean for each diet by time by dropout
pattern combination. As the variogram indicated no random effects, the
covariance matrix was taken as first-order autoregressive with a residual
variance term σjk = σ 2 ρ|j−k| . Also the variance-covariance parameters are
allowed to vary according to the dropout pattern. This model is equivalent
to including time and diet as covariates in the model and stratifying for
dropout pattern and it provides a starting point for model simplification
through backward selection. We refer to the description of Strategy 3 in
Section 20.5.1. The protein content levels over time are presented by pattern
and diet in Figure 24.17. Note that the protein content profiles appear to
vary considerably according to missingness pattern and time. Additionally,
Diggle and Kenward (1994) suggested an increase in protein content level
toward the end of the experiment. This observation is not consistent for
the three plots in Figure 24.17. In fact, there is a tendency for a decrease
in all diet by pattern subgroups prior to dropout.

To simplify the covariance structure presented in Model 1, Model 2 assumes


the residual covariance parameter is equal in the three patterns. The like-
lihood ratio test indicates that Model 2 compares favorably with Model
1, suggesting a common residual variance (measurement error component)
parameter (2 for the three groups; see Table 24.9 for details). However,
comparing Model 3 with Model 2, we reject a common variance-covariance
structure in the three groups.

Next, we investigate the mean structure. In Model 4, the three-way inter-


action among pattern, time, and diet is removed. This simplified model is
acceptable when contrasted to Model 2, based on p = 0.987. Models 5, 6,
24.4 The Milk Protein Content Trial 453

TABLE 24.9. Milk Protein Content Trial. Model fit summary for pattern-mixture
models.

Mean Covar
1 Full interaction AR1(t), meas(t)
2 Full interaction AR1(t), meas
3 Full interaction AR(1), meas
4 Two-way interactions AR1(t), meas
5 Diet, time, pattern, diet∗time, diet∗pattern AR1(t), meas
6 Diet, time, pattern, diet∗time, time∗pattern AR1(t), meas
7 Diet, time, pattern, diet∗pattern, time∗pattern AR1(t), meas
8 Time, pattern, time∗pattern AR1(t), meas
9 Time, diet(time) AR(1), meas
10 Time, diet AR(1), meas
# par −2 Ref G2 df p
1 162 −474.93
2 160 −470.79 1 4.44 2 0.111
3 156 −428.26 2 42.23 4 <0.001
4 100 −439.96 2 30.53 50 0.987
5 70 −202.40 4 237.56 30 <0.001
6 96 −430.55 4 9.41 4 0.052
7 64 −404.04 4 35.92 36 0.472
8 58 −378.22 7 25.82 6 <0.001
6 52.33 38 0.061
9 60
10 24

and 7 are fitted to investigate the pairwise interaction terms. Comparing


Models 5 and 4 suggests a strong interaction between dropout pattern and
time. Model 6 results in a borderline decrease in goodness-of-fit (p = 0.052).
From Table 24.9, we observe that Model 7 is a plausible simplification of
Model 4. Moreover, there is an apparent lack of fit for Model 8, which
includes only one interaction term, time by pattern, when compared to
Model 7. In conclusion, among the models presented, Model 7 is the pre-
ferred one to summarize the data, as it is the simplest model consistent
with the data. However, Model 6 should be given some attention as well.
In analogy to Diggle and Kenward (1994), we attempted to include time as
a separate linear factor for the first 3 weeks and the subsequent 16 weeks.
These models did not improve the fit (results not shown).
454 24. Case Studies

FIGURE 24.17. Milk Protein Content Trial. Mean response level per diet and per
dropout pattern.

Recall that the objective of the experiment was to assess the influence
of diet on protein content level. With selection models, the corresponding
null hypothesis of no effect can be tested using, for example, the standard
F -tests on 2 numerator degrees of freedom as provided by the SAS proce-
dure MIXED or similar software. In the pattern-mixture framework, such a
standard test can be used only if the treatment effects do not interact with
pattern. Otherwise, the marginal treatment (diet) effect has to be deter-
mined as in (20.45) and the delta method can be used to test the hypothesis
of no effect. In Model 6, the diet effect is independent of pattern and the
reverse holds for Model 7. Reparameterizing Model 6 by including the diet
effect and diet by time interaction as one effect in the model provides us
with an appropriate F -test for the three diet profiles. The F -test rejects
the null hypothesis of no diet effect (F = 1.57 on 38 degrees of freedom,
p = 0.015). In the corresponding selection model, Model 9, we remove all
the terms from Model 6 which include pattern. In that case, the F -test is
not significant (F = 1.26 on 38 degrees of freedom, p = 0.133). The differ-
ence in the tests may be explained by the variance parameters which were
larger in the selection model in the absence of stratification for pattern,
thereby effectively diluting the strength of the difference. Additionally, the
standard errors for the estimates of the fixed effects were slightly smaller
in the pattern-mixture model. This is not surprising, as in the model fit-
ting we found that the means and variance parameters were dependent on
24.4 The Milk Protein Content Trial 455

FIGURE 24.18. Milk Protein Content Trial. Diet effect over time for the selection
model (SM), the corresponding pattern-mixture model (PMM), and the estimate
obtained after weighting the PMM contributions (weighted).

pattern. Thus, stratifying for pattern results in more homogeneous sub-


groups of cows, reducing the variance within each group and subsequently
providing more precise estimates for the diet effect.

Using Model 7, we test the global null hypothesis of no diet effect in any of
the patterns. This analysis can be seen as a stratified analysis where a diet
effect is estimated separately within each pattern. This model results in a
significant F -test for the diet effect (F = 6.05, on 6 degrees of freedom,
p < 0.001). Alternatively, we can consider the pooled estimate for the diet
effect, provided by equation (20.45), and calculate the test statistic using
the delta method. This test also indicates a significant diet effect (F = 17.82
on 2 degrees of freedom, p < 0.001), as does the corresponding selection
model, Model 10 (F = 8.51 on 2 degrees of freedom, p < 0.001).

Figure 24.18 presents the diet by time parameter estimates for selection
Model 10, for the corresponding pattern-mixture Model 7 and the weighted
average estimates used in the delta method. The estimates for the selection
model and the pattern-mixture model appear to differ only slightly. Since
the model building within both families is done separately, this is a very
reassuring sensitivity analysis outcome.

In conclusion, including pattern in the model improves the model fit signif-


icantly. In particular, the time by pattern and diet by pattern interactions
456 24. Case Studies

are maintained in Model 7, which is considered to be the most parsimonious


model consistent with the data (Figure 24.17). In addition, the covariance
parameters are also dependent on the missingness pattern. Dividing cows
into more homogeneous groups based on their missingness patterns reduces
the unexplained variation in the data and subsequently provides more pre-
cise parameter estimates.

This example and, in particular, the absence of genuine dropout illustrate


once more that care has to be taken when analyzing longitudinal outcomes
with a nonrectangular structure.

The analyses discussed here provide an alternative to those obtained by


Diggle and Kenward (1994) but generally do not contradict them. Rather,
they convey the message that the use of sensitivity analysis should become
standard practice when dropout occurs. We strongly stress the importance
of careful data verification to be undertaken prior to any statistical analysis.
To this end, we might add that the effect of an erroneous initial description
of a data set should not be underestimated, as it can lead to subsequent
mismodeling of the data, thus adding confusion to an already complex
undertaking of analyzing longitudinally measured observations.

Our analysis of the correlation structure appears to agree with the general
conclusions retained in the Diggle and Kenward (1994) analysis. Particu-
larly, it is interesting to notice the absence of random effects. We do not
completely share the surprise expressed by Diggle and Kenward (1994)
since it should be noted that the study animals are highly selected through
centuries of cow eugenics and race selection. Had we dealt with wild ani-
mals, the role played by random effects would most likely have been much
more substantial. To explain the absence of random effects, we may assume
that there were additional eligibility criteria for the trial (e.g., a specific
breed of cow), which made random effects even more unlikely.

Analyzing a data set using various approaches to answer a particular ques-


tion is seen as a simple and informal way of sensitivity analysis, as is supple-
menting the main analysis with additional ones to gain extra insight. Each
method used requires certain assumptions about the measurement process
and the dropout process. In particular, pattern-mixture models and selec-
tion models approach the issue of dropout in different ways. It may also be
useful to investigate the fundamental assumptions concerning the design of
the experiment since dropout may be design driven.
24.4 The Milk Protein Content Trial 457

24.4.3 Formal Sensitivity Analysis

Supplementing the results of Section 24.4.2 with a more formal look at sen-
sitivity, in the selection-model spirit of Chapter 19, can be done using local
influence methods, described in Section 19.3 and applied in Sections 19.4
and 19.5 to the rats and mastitis data sets, respectively. In addition, we
will describe and apply global influence techniques as well.

Cook (1986) suggests that more confidence can be put in a model which is
relatively stable under small modifications. The best known perturbation
schemes are based on case deletion (Cook and Weisberg 1982, Chatterjee
and Hadi 1988), in which the effect of completely removing cases from the
analysis is studied. They were introduced by Cook (1977a, 1979) for the
linear regression context. Denote the likelihood function, corresponding to
measurement model (3.8) and dropout model (17.17) and (17.4), by

N

(γ) = i (γ), (24.34)


i=1

in which i (γ) is the contribution of the ith individual to the log-likelihood


and where γ = (θ, ψ, ω) is the s-dimensional vector, grouping the parame-
ters of the measurement model and the dropout model. Further, we denote
by
(−i) (γ) (24.35)
the log-likelihood function, where the contribution of the ith subject has
been removed. Cook’s distances (CD) are based on measuring the discrep-
ancy between either the maximized likelihoods (24.34) and (24.35) or (sub-
sets of) the estimated parameter vectors γ and γ (−i) , with obvious nota-
tion. Precisely, we will consider both

CD1i = 2(  − (−i) ) (24.36)

as well as

CD2i (γ) = 2 (  (−i) ) L̈−1 (


γ−γ  (−i) ).
γ−γ (24.37)

Formulation (24.37) easily allows to consider the global influence in a sub-


vector of γ, such as the dropout parameters ψ, or the nonrandom parameter
ω. This will be indicated using notation of the form CD2i (ψ), CD2i (ω), and
so forth.

Recall that Diggle and Kenward (1994) considered model (3.11), where
the mean model includes separate intercepts for the barley, mixed, and
lupins groups, and a common time effect which is linear during the first
3 weeks and constant thereafter. The covariance structure is described by
a random intercept, an exponential serial process, and measurement error.
458 24. Case Studies

TABLE 24.10. Milk Protein Content Trial. Maximum likelihood estimates (stan-
dard errors) of random and nonrandom dropout models. Dropout starts from week
15 onward.

Effect Par. MAR MNAR


Measurement model:
Barley µ1 4.147 (0.053) 4.152 (0.053)
Mixed µ2 4.046 (0.052) 4.050 (0.052)
Lupins µ3 3.935 (0.052) 3.941 (0.052)
Time effect β −0.226 (0.015) −0.224 (0.015)
Random intercept variance d −0.001 (0.010) 0.002 (0.009)
2
Measurement error variance σ 0.024 (0.002) 0.025 (0.002)
2
Serial variance τ 0.073 (0.012) 0.067 (0.011)
Serial correlation ρ 0.152 (0.037) 0.163 (0.039)
Dropout model:
Intercept ψ0 17.870 (3.147) 15.642 (3.535)
Previous measurement ψ1 −6.024 (0.998) −10.722 (2.015)
Current measurement ψ2 5.176 (1.487)
-2 log-likelihood 51.844 37.257

The dropout model includes dependence on the previous and current, pos-
sibly unobserved, measurements. Since dropout only happens from week
15 onward, Diggle and Kenward (1994) chose to set the dropout proba-
bility for earlier occasions equal to zero. Thereafter, they allowed separate
intercepts per time point, but common dependencies on previous and cur-
rent measurements. We will now introduce two models which use the same
measurement model as Diggle and Kenward (1994) but different dropout
models.

A first dropout model is closely related to the one of Diggle and Kenward
(1994), who defined occasion-specific intercepts ψ0k (k = 15, 16, 17, 19),
assumed slopes common, and set the dropout probability equal to zero
at other occasions. We also model dropout from week 15 onward, but we
will keep the intercepts constant for occasions 15 to 19. Precisely, our first
model contains three parameters (intercept ψ0 , dependence on the previous
measurement ψ1 , and dependence on the current measurement ψ2 ).

Parameter estimates for this model under both MAR and MNAR are listed
in Table 24.10. The fitted model is qualitatively equivalent to the model
used by Diggle and Kenward (1994), who concluded overwhelming evidence
24.4 The Milk Protein Content Trial 459

TABLE 24.11. Milk Protein Content Trial. Maximum likelihood estimates (stan-
dard errors) of random and nonrandom dropout models. Dropout starts from week
1 onwards.

Effect Par. MAR MNAR


Measurement model:
Barley µ1 4.147 (0.053) 4.152 (0.053)
Mixed µ2 4.046 (0.052) 4.050 (0.052)
Lupins µ3 3.935 (0.052) 3.941 (0.052)
Time effect β −0.226 (0.015) −0.224 (0.015)
Random intercept variance d −0.001 (0.010) 0.002 (0.009)
2
Measurement error variance σ 0.024 (0.002) 0.025 (0.002)
2
Serial variance τ 0.073 (0.012) 0.067 (0.011)
Serial correlation ρ 0.152 (0.037) 0.163 (0.040)
Dropout model:
Intercept ψ0 10.483 (2.010) 6.477 (2.867)
Previous measurement ψ1 −4.326 (0.651) −5.917 (1.069)
Current measurement ψ2 2.732 (1.396)
-2 log-likelihood 194.316 190.691

for nonrandom dropout (likelihood ratio statistic 13.9). In line with these
results, we also could decide in favor of a nonrandom process (likelihood
ratio statistic 14.59).

In our second dropout model, we allow dropout starting from the sec-
ond week. Specifically, this model contains three parameters (intercept ψ0 ,
dependence on the previous measurement ψ1 , and dependence on the cur-
rent measurement ψ2 ) which are assumed constant throughout the whole
19-week period. The fit of this model is listed in Table 24.11. A striking
difference with the previous analysis is that the MAR assumption is bor-
derline not rejected (likelihood ratio statistic 3.63). Apparently, this is a
major source of sensitivity, to be explored further. As results from theory,
the measurement model parameters do not change under the MAR model,
compared to those displayed in Table 24.10. The measurement model ob-
tained under MNAR has changed only slightly.

Which of the two analyses is to be preferred is debatable and depends


on substantive rather than statistical considerations. The first analysis ac-
counts for the post hoc observation that no dropout occurred prior to week
15. However, there is a, perhaps small, chance for the experiment to ter-
460 24. Case Studies

minate in a particular field with less than 15 weeks of measurements, and


our second model acknowledges this possibility. It is clear that there is an
enormous sensitivity of the results due to this model choice and, therefore,
an assessment of influence seems appropriate. In general though, it may be
questionable that the dropout model parameters remain constant over an
extended period of time. Not only can the rate chance over time, but also
the dominant causes and the magnitude of their effect can change.

Global Influence

Global influence results are shown in Figures 24.19–24.21. They are based
on fitting a MNAR model with each of the cows deleted in turn. The Cook’s
distances for the first and the second model are shown in Figure 24.20 and
24.21, respectively. The individual curves with influential subjects high-
lighted are plotted in Figure 24.19 where subject #38 should not be high-
lighted for the second model.

There is very little difference in some of the Cook’s distance plots, when Fig-
ures 24.20 and 24.21 are compared. Precisely, CD1i , CD2i (γ), and CD2i (θ)
are virtually identical. The three others are similar in the sense that there
is some overlap in the subjects indicated as peaks, but with varying mag-
nitudes. Subject #38 is influential on the dropout measures CD2,38 (ψ, ω),
CD2,38 (ψ), and CD2,38 (ω). This is not surprising since #38 is rather low
in the middle portion of the measurement sequence, whereas it is very high
from week 15 onward. Therefore, this sequence is picked up in the second
analysis only. By looking at a plot with the evolution of the parameters
separately during the deletion process (not shown here), we can conclude
that subject #38 has some impact on the serial correlation parameter while
#65 is rather influential for the measurement error. In view of the fairly
smooth deviation from a straight line of the former and the abrupt peaks
in the latter, this is not a surprise.

Based on our second model, all forms of CD2i (·), whether based on the
entire parameter vector γ, the dropout parameters (ψ0 , ψ1 , ω), or subsets
of the latter, indicate that subjects #51, #59, and #68 are influential. In
contrast, CD1i , which is based directly on the likelihood, does not reveal
these subjects, but rather subject #65 jumps out. Thus, although the for-
mer three subjects have a substantial impact on the parameter estimates,
they do not change the likelihood in a noticeable fashion. From a plot of
the dropout parameter estimates for each deleted case (not shown here),
it is very clear that upward peaks in ψ0(−i) for subjects #51 and #59 are
compensated with downward peaks in ω (−i) . An explanation for this phe-
nomenon can be found in the variance-covariance matrix of the dropout
24.4 The Milk Protein Content Trial 461

FIGURE 24.19. Milk Protein Content Trial. Individual profiles, with globally in-
fluential subjects highlighted. Dropout modeled from week 15.

FIGURE 24.20. Milk Protein Content Trial. Index plots of CD1i , CD2i (γ),
CD2i (θ), CD2i (ψ, ω), CD2i (ψ), and CD2i (ω). Dropout modeled from week 15.

parameters (correlations shown in the lower triangle):


⎛ ⎞
8.22 0.43 −2.85
⎝ (0.14) 1.14 −1.18 ⎠ .
(−0.71) (−0.79) 1.94
462 24. Case Studies

FIGURE 24.21. Milk Protein Content Trial. Index plots of CD1i , CD2i (γ),
CD2i (θ), CD2i (ψ, ω), CD2i (ψ), and CD2i (ω). Dropout modeled from week 1.

From a principal components analysis, it follows that more than 90% of the
variation is captured in the linear combination 0.93ψ0 −0.37ω. Hence, there
is mass transfer between these two parameters, of course with sign reversal,
little impact on the likelihood value, and little effect on the MAR parameter
ψ1 . Note that a similar plot for the measurement model parameters can be
constructed (not shown).

Let us now turn to the subjects which are globally influential. A first and
common reason for those subjects to show up is the fact that they all
have a rather strange profile. Remember the overall trend to be sloping
downward during the first 3 weeks and constant thereafter. Subject #65
appears with large CD65,1 and large CD2 (θ). The reason for this can be
found in the fact that its profile shows extremely low and high peaks.
Subjects #51, #59, and #68, on the other hand, only show large values for
CD2 (ψ, ω), CD2 (ψ), CD2 (ω). This means that these subjects are influential
for the dropout parameters. For subject #51, this can be explained by the
fact that it drops out in spite of the rather large profile. Subjects #59 and
#68, on the contrary, stay in the experiment even though they both have
rather low profiles.

Local Influence

Local influence plots and individual profiles, with the influential subjects
highlighted (bold type), for the first model for raw and incremental data,
24.4 The Milk Protein Content Trial 463

FIGURE 24.22. Milk Protein Content Trial. Index plots of Ci , Ci (θ), Ci (β),
Ci (α), and Ci (ψ), and of the components of the direction hmax of maximal cur-
vature. Dropout modeled from week 15.

respectively, are depicted in Figures 24.22–24.25. Corresponding graphs for


our second model are shown in Figures 24.26–24.29. It is more convenient
to discuss results of the second model up front and then compare them to
the first model.

Observe that the plots for Ci and Ci (ψ) are virtually identical. This is
due to the relative magnitudes of the ψ and θ components. Profiles #51,
#59, and #66–#68 are highlighted in Figure 24.27. An explanation for the
influence in ψ is found by studying (19.12). Indeed, for ψ 0 and ψ 1 as in
Table 24.11, the maximum is obtained for y = 2.51, exactly as seen in the
influential profiles, which are all in the lupins group (Figure 24.27). Fur-
ther, note that there is some agreement between the locally and globally
influential subjects, although there is no compelling need for the two ap-
proaches to be identical (#51 appears in different influential components in
the two approaches). Indeed, although global influence lumps together all
sources of influence, our local influence approach is designed to detect sub-
jects which, due to several causes, tend to have a strong impact on ω and
therefore on the conclusion about the nature of the dropout mechanism.

Observe that one factor in (19.12) is the square of the response. This is
a direct consequence of our parameterization of the dropout process, the
logit of which is in terms of the previous and current outcomes, to which
no transformation is applied. As was argued in the mastitis case (p. 321),
since two subsequent measurements are usually positively correlated, it is
464 24. Case Studies

FIGURE 24.23. Milk Protein Content Trial. Individual profiles, with locally in-
fluential subjects highlighted. Dropout modeled from week 15.

FIGURE 24.24. Milk Protein Content Trial. Index plots of Ci , Ci (θ), Ci (β),
Ci (α), and Ci (ψ), and of the components of the direction hmax of maximal cur-
vature. Dropout modeled from week 15. Incremental analysis.

not unusual for both of them to be high. It is therefore wise to repara-


24.4 The Milk Protein Content Trial 465

FIGURE 24.25. Milk Protein Content Trial. Individual profiles, with locally influ-
ential subjects highlighted. Dropout modeled from week 15. Incremental analysis.

meterize the dropout model (19.1) in terms of the increment; that is, yij
is replaced by yij − yi,j−1 . This is related to the approach of Diggle and
Kenward (1994), who reparameterized their dropout model in terms of the
increment just introduced and the size (the average of both measurements).
Even though a dropout model in the outcomes themselves, termed direct
variables model, is equivalent to a model in the first variable Yi1 and the in-
crement Yi2 −Yi1 , termed incremental variable representation, it was shown
that they lead to different perturbation schemes of the form (19.1). Indeed,
from equality (19.18) or, more generally, from

ψ0 + ψ1 yi,j−1 + ψ2 yij = λ0 + λ1 yi,j−1 + λ2 (yij − yi,j−1 ), (24.38)

it follows that ψ0 = λ0 and ψ1 = λ1 − λ2 . Thus, local influence is now


focusing on a different set of parameters and one should not expect it to
give the same answer. Therefore, it is crucial to guide the parameterization
by careful substantive knowledge. In a sense, dependence on the increment
is most dramatic since, at the time of dropout, there is no information
about the increment, whereas size can be assessed reasonably well from
Yi,j−1 , especially if the correlation is sufficiently high. The results of this
analysis are presented in Figure 24.28 and the most influential profiles
are highlighted in Figure 24.29. A slightly different but overlapping set of
profiles is responsible for the influence now. The most important feature
is that the influence is very minor. The components of the direction of
maximal curvature hmax shows virtually no peaks.
466 24. Case Studies

Finally, we will compare both models. The direct-variable results found in


Figure 24.22 agree fairly well with those in Figure 24.26, the differences
being the absence of #66 and #67 and the appearance of #43. The latter
profile is extremely low at the end of the period, where dropout is modeled,
and therefore yields a large value for (19.12). For #66 and #67, there is a
logical explanation for their disappearance. Indeed, these profiles are very
low during the first part of the experimental period, in spite of which they
do not drop out. However, during the latter part, their profile is still low
and they drop out, which is totally plausible behavior and, hence, their
influence is marked in the second but not in this analysis.

For the incremental analysis, there is a larger discrepancy between both


models as one can observe by comparing Figure 24.24 to Figure 24.28. While
the direction of maximal curvature still shows no unusual subjects, Ci shows
somewhat different subjects to be influential. Precisely, subjects #7, #51,
and #74 are highly influential for the first model, whereas subjects #51
(again), #66, #67, and #73 are the ones detected with the second model.
Although the cutoff is rather arbitrary, it is noteworthy that #51 appears
in both Ci and Ci (θ) for the first model, indicating that the measurement
model influence Ci (θ) is of the same order of magnitude as the dropout
model influence Ci (ψ), which is in contrast to the other analysis. Both #7
and #74 are on average not particularly low profiles, but they are among
the lowest ones during the last month of the experiment and, although
there are some others with the same feature, these two have a low overall
level, but a high increment, which is very unusual.

Overview

Table 24.12 summarizes the subjects which are found to be influential in


the various analyses. Although it can be argued that the various influence
analyses serve different purposes, it is of some importance to distinguish
between those subjects who are influential overall and others which turn
up in one or a few analyses. Cow #51 is highlighted (bold type) in all six
analyses and cows #59 and #68 show up four times, all others being seen
three times or less. Clearly, #51 shows up unambiguously in the global
influence plots and it yields the highest Ci (θ), Ci (β), and Ci (α) values in
the local influence analysis, even though one might argue that in some local
influence plots, it is closely followed by slightly lower peaks. Inspecting its
profile more closely, we conclude that it deviates from the typical profile in
a number of ways. First, it is among the highest profiles during the period
of initial drop, whereafter it is fairly low during the first half of the period,
followed by a period of almost linear increase until the end of the study.
The other two, #59 and #68, are, on average, the lowest profiles, not only
within their group, but overall.
24.4 The Milk Protein Content Trial 467

FIGURE 24.26. Milk Protein Content Trial. Index plots of Ci , Ci (θ), Ci (β),
Ci (α), and Ci (ψ), and of the components of the direction hmax of maximal cur-
vature. Dropout modeled from week 1.

FIGURE 24.27. Milk Protein Content Trial. Individual profiles, with locally in-
fluential subjects highlighted. Dropout modeled from week 1.

Whereas global influence, as stated earlier, starts from deleting one subject
completely, local influence only changes the dropout process for one subject
from random dropout to nonrandom dropout. Because of the completely
468 24. Case Studies

TABLE 24.12. Milk Protein Content Trial. Summary of influential subjects.

Drop From Week 15 Drop From Week 1


Subject Local Local
Subject Glob. Raw Inc Glob. Raw Inc
1 * *
7 *
38 *
43 *
51 * * * * * *
59 * * * *
65 * *
66 * *
67 * *
68 * * * *
73 *
74 *

FIGURE 24.28. Milk Protein Content Trial. Index plots of Ci , Ci (θ), Ci (β),
Ci (α), and Ci (ψ), and of the components of the direction hmax of maximal cur-
vature. Dropout modeled from week 1. Incremental analysis.

different approach, there is no need for both methods to yield similar re-
sults, although by looking at the influential subjects for all cases studied
above, we notice some overlap.
24.4 The Milk Protein Content Trial 469

FIGURE 24.29. Milk Protein Content Trial. Individual profiles, with locally in-
fluential subjects highlighted. Dropout modeled from week 1. Incremental analysis.

Substantial differences are seen between the two models we formulated.


The first one models dropout from week 15 onward. A second one allows
dropout during the complete 19-week period. When dropout is based on
the last 5 weeks, the model fitting results are, as expected, very close to
those of Diggle and Kenward (1994), with a highly significant indication
for nonrandom dropout. When the dropout model is based on the entire
period, there is little evidence for nonrandom dropout. Moreover, influen-
tial subjects in the two approaches are entirely different. Both analyses
concentrate on behavior in the period during which dropout is modeled.
The latter indicates that the choice of period to which dropout applies is
crucial.

In addition, we compared the direct variable analysis with an incremental


one, where dropout depends on the difference between the current and pre-
vious measurement. In line with our analysis of the mastitis data set (Sec-
tion 19.5.2), each analysis leads to different influential subjects, indicating
that one should carefully discuss which analysis is preferable. Although
both model formulations in (24.38) are equivalent, they lead to a different
influence analysis, simply because the parameters at which the influence is
targeted are different. Which one is chosen may depend on substantive con-
siderations, as well as on the observation made by Molenberghs, Verbeke,
et al . (1999) that the parameter of Yi,j−1 is the most efficiently calculated
in the incremental model, provided ψ̂1 and ψ̂2 are negatively correlated.
470 24. Case Studies

The latter condition is satisfied in many longitudinal applications, as was


already noted by Diggle and Kenward (1994).

24.5 Hepatitis B Vaccination

The mentally handicapped residing in institutions are at high risk for he-
patitis B virus (HBV) acquisition and subsequent carrier state. The higher
risk of nonparental transmission in this population is due to the typical be-
havior of mentally retarded patients, the type of mental retardation, and
the closed setting of the institutions, which all enhance spreading of the
virus.

Hepatitis B vaccination of residents and staff is a general recommendation


and has become part of today’s hepatitis B prevention program. Data on
long-term persistence of antibodies against HBV are scarce, especially in
this population. Data available from other high-risk populations showed
that 67% to 85% of the vaccinated individuals still had antibody levels
higher than 10 international units/liter (IU/L), 9 to 12 years after the first
vaccine dose (Hadler et al . 1986, Coursaget et al . 1994).

We describe the use of linear mixed-effects models for the analysis of anti-
bodies against hepatitis B surface antigen (anti-HBs) data from a mentally
handicapped population vaccinated against hepatitis B, 11 years earlier
(Van Damme et al . 1989). Use of random effects in this setting was already
proposed by Coursaget et al . (1994) and Gilks et al . (1993), who consid-
ered between-individuals variability in a Bayesian random-effects model.
In previous studies, several factors have been described to cause a higher
risk in the acquisition of hepatitis B virus infection. These factors include
age, age at admission, duration of residency, type of mental retardation
[Down’s syndrome (DS) or other types of mental retardation (OMR)], sex,
and use of anti-epileptic medication (Vellinga et al . 1999a). Sex, age, and
type of mental retardation are also of influence on the response to vaccina-
tion (Vellinga et al . 1999b). The linear mixed model describes the decline in
antibody titer in relation to the significant risk factors across measurement
occasions. A detailed account of the medical results is reported in Vellinga
et al . (1999c).

In 1985–1986, a hepatitis B vaccination program was conducted in a Bel-


gian institution for mentally handicapped to evaluate the long-term persis-
tence of anti-HBs after vaccination in this population. Blood samples were
drawn from residents in that institution, who were then all vaccinated with
three doses of hepatitis B vaccine (Engerix-BTM , SmithKline Beecham Bi-
ologicals, Rixensart, Belgium) according to a month 0-1-6 schedule. Serum
24.5 Hepatitis B Vaccination 471

samples were taken after each vaccine dose, and if residents did not meet the
(arbitrary) anti-HBs level of 100 IU/L at month 7, they received a booster
vaccine dose at month 12. If the requirement of 100 IU/L were still not met
at month 13, additional booster doses were administered (these residents
were, however, not further included in the program). All residents received
a booster dose after 5 years (i.e., at month 60).

Of the 196 seronegative residents originally included in the program, only


97 were included in the analysis of the follow-up after 11 years. They had
blood samples taken yearly for the first 5 years and at year 11. Sixty-seven
of them received 4 vaccine doses (at months 0, 1, 6, plus a booster at month
60) and 30 received 5 doses (at months 0, 1, 6, 12, plus a booster at month
60). Further details can be found in Van Damme et al . (1989) and Vellinga
et al . (1999a). We will denote the group of residents vaccinated according
to a month 0-1-6 (versus 0-1-6-12) schedule by G1 (versus G2).

Interest focuses on describing the evolution of the mean log(anti-HBs+1)


titer over time (where log denotes the natural logarithm), while accounting
for prognostic factors such as sex, body mass index, duration of residency,
age at admission into the institution, type of mental retardation, use of
anti-epileptic drugs, and number of vaccine doses received. It is also of
importance to predict antibody level at years 11 (end of study) and 12 (1
year past the end) based on the fitted model.

Although the main epidemiological interest lies in the population-averaged


prediction, the model enables one to perform individual-specific predictions
as well. Both model building and prediction are complicated by the fact
that individual and average profiles are highly nonlinear (Figures 24.30(a)
and 24.30(b)), combined with the absence of measurements between years
5 to 11.

In order to deal with these complications, we consider two types of models.


The first one is saturated in time effects and is helpful, for example, for
making comparisons between groups at different time points. Although the
nonlinearity of the profiles raises no particular problem there, such a model
does not parsimoniously describe the temporal decline in antibody titer, nor
is it truly useful for making long-term predictions. This could be achieved
by restricting the analysis to post-vaccination data (i.e., data available after
months 6 or 12, depending on the number of doses administered). These
data are indeed somewhat “smoother” than the original data and we can
employ prebooster data (until month 60) to set up a model that can then be
transposed to model postbooster decline in anti-HBs. This is accomplished
using fractional polynomials (Royston and Altman 1994), which provide a
very flexible tool for parametric modeling. (see also Section 10.3).
472 24. Case Studies

FIGURE 24.30. Hepatitis B Vaccination Study. (a) Longitudinal trends in


log(anti-HBs+1) for residents with DS (solid line) and OMR (dashed line).
Cross symbols indicate missing values. (b) Average log(anti-HBs+1) over time
for residents with DS (solid line) and OMR (dashed line). (c) Ordinary least
squares (OLS) residual profiles obtained upon fitting a saturated mean structure
to log(anti-HBs+1). (d) Variance of the OLS residuals over time for residents
with DS (solid line) and OMR (dashed line).

Section 24.5.1 presents model building for the two models we just described.
Section 24.5.2 is devoted to the issue of prediction at year 12, where sensi-
tivity to assumptions and type of model used can be assessed.

24.5.1 Time Evolution of Antibodies

In this section, we address the question of building a model that adequately


describes the evolution of log antibody titer over time. Hence, we need to
24.5 Hepatitis B Vaccination 473

FIGURE 24.31. Hepatitis B Vaccination Study. Sample variogram of log anti-


body residuals (the horizontal line estimates the process variance; the dashed line
represents a smooth estimate of the variogram).

consider appropriate mean, variance, and covariance structures. Since the


profiles are quite messy due to unequally spaced measurement occasions
and booster effects, it is imperative to conduct an exploratory data analysis.

As shown in Figures 24.30(a) and 24.30(b), individual and mean profiles


of log(anti-HBs+1) for DS and OMR are clearly nonlinear and show peaks
after booster doses. Individual profiles follow approximately the same pat-
tern, with the main difference between profiles residing in the vertical shift.
This suggests a strong contribution of an individual random intercept. Av-
erage profiles show a difference in anti-HBs between both groups. Also,
these profiles exhibit steep increases immediately after boosters, followed
by a gradual decrease which appears to be nonlinear.

Figure 24.31 depicts an estimate of the empirical variogram for these data.
It was constructed using standardized ordinary least squares residuals ob-
tained upon fitting a saturated groups by times model (where group is
type of mental retardation). Also shown in this figure is a loess-smoothed
estimate (Cleveland 1979) of the variogram. The between-subject variance
seems relatively large in these data, accounting for about one-half of the
total variability. The measurement error is also substantial, accounting for
the other half of the process variance. This variogram leaves little room for
a serially correlated component. Note that in this context, it is essential
to use standardized residuals to remove variance heterogeneity in the data,
ensuring that the process variance is constant and equal to 1.
474 24. Case Studies

We now turn to model construction and outline the successive steps to


retain a final model. Type of mental retardation, duration of residency, and
number of vaccine doses (as a group variable) are allowed to have specific
effects at each sample occasion. We also include time-constant effects for
sex (male versus female), use of anti-epileptic drugs (yes versus no), body
mass index, and age at admission in the institution, since a time trend for
these covariates could not be detected.

In Chapter 9, it was suggested to select a variance-covariance structure


based on the most complex mean structure one is willing to consider. Once
such a structure is chosen, simplification of the mean model can proceed.

The preliminary variance model acknowledges presence of serial correlation


and includes the following random effects: an intercept, a linear time slope,
and number of vaccine doses (as a 0/1-coded group variable). An unstruc-
tured form is assumed for the 3 × 3 random-effects variance matrix D.

We first select an appropriate serial process, as shown in Table 24.13. Mod-


els with exponential (B) and Gaussian (C) serial correlation are compared
to the model with no serial process (A) using the likelihood ratio test statis-
tic (denoted G2 in this table). These tests strongly reject the null hypothesis
of no serial process. At this stage, it was decided to keep the exponential
model.

Next, the random-effects structure can be simplified. Three hierarchically


ordered models are presented in the second part of Table 24.13. One has
to be very careful in interpreting the significance of random effects using
the likelihood ratio test statistic G2 . As described in Section 6.3.4, the
associated testing problem is indeed nonstandard, as the null hypothesis
lies on the boundary of the parameter space of the alternative hypothesis.
The reference distribution for the B–D comparison is a 50:50 mixture of χ22
and χ23 . Similarly, for the comparison of Models D and E, we obtain a 50:50
mixture of χ21 and χ22 variables. These distributions have been utilized to
calculate the corresponding p-values. Thus, at this stage, we select Model
D, comprising a random intercept and a random time slope.

Finally, retaining the covariance structure we just selected, the mean model
can be reduced. Effects kept in the final model were time, type of mental
retardation, number of vaccine doses (with unstructured time effects), dura-
tion of residency (with a linear time trend), use of anti-epileptic medication,
and sex (time-constant effects). Although not significant, sex was kept in
the model for reasons of external comparison.

Parameter estimates for this model are shown in Tables 24.14 and 24.15. It
can be seen that no covariance parameter is given for the random effects in
these tables. A comparison between model-based and empirical standard er-
24.5 Hepatitis B Vaccination 475

TABLE 24.13. Hepatitis B Vaccination Study. Selection of a serial correlation


process and a random-effects structure.

# Param.
Model Description Rand. Ser. −2 Comp.∗ G2 df p

Selection of a serial correlation process:


A None 6 0 2623.72
B Exponential 5† 2 2575.23 A 48.49 1 < 0.0001
C Gaussian 6 2 2574.31 A 49.41 2 < 0.0001

Selection of a random-effects structure:


B Intercept, time,
# doses 5† 2 2575.23
D Intercept, time 3 2 2578.54 B 3.31 0.269††
E Intercept 1 2 2606.22 D 23.68 < 0.0001†††

∗ Comparison model.
† One parameter could not be estimated due to parameter constraints.
†† From a 50-50 mixture of χ22 - and χ23 -distributions.
††† From a 50-50 mixture of χ21 - and χ22 -distributions.

rors revealed large discrepancies (relative increases more than threefold for
most of the estimates) in the final model. Empirically corrected (or robust)
standard errors (Section 6.2.4) counteract the effect of potential misspeci-
fication of the covariance structure (Diggle, Liang, and Zeger 1994, Liang
and Zeger 1986) and disagreement between both types of standard errors
might point to an inadequately specified covariance structure. Arguably, we
had little reason to believe that the selected covariance function is substan-
tially incorrect. Therefore, it is wise to attain a trade-off between model fit
as reported by likelihood ratios, and differences occurring between model-
based and empirical standard errors. In particular, assuming a diagonal
instead of an unstructured covariance matrix for the random effects yields
a much better model in this respect and was therefore retained as our final
model. Most of the estimated empirical standard errors in Tables 24.14 and
24.15 do not exhibit changes of more than 25% compared to model-based
standard errors.

It is worth noting that the effect of Down’s syndrome on antibody titer was
significant at months 24, 36, and 48, indicating a faster decline in anti-HBs
in this population than in other mentally retarded. There did not seem to be
476 24. Case Studies

TABLE 24.14. Hepatitis B Vaccination Study. Parameter estimates and standard


errors (model based; empirically corrected) for the final model (original data).
Part I.

Effect Time Estimate (s.e.)

Mean Structure:
Intercept 9.36 (0.41; 0.38)
Time 1 −6.80 (0.21; 0.24)
Time 2 −4.30 (0.21; 0.20)
Time 7 †
Time 12 −1.66 (0.18; 0.13)
Time 13 −0.54 (0.36; 0.37)
Time 24 −3.36 (0.20; 0.19)
Time 36 −3.68 (0.21; 0.19)
Time 48 −4.21 (0.28; 0.27)
Time 60 −4.69 (0.31; 0.25)
Time 61 1.62 (0.34; 0.25)
Time 132 −1.92 (0.59; 0.55)
DS/OMR 1 0.60 (0.64; 0.54)
DS/OMR 2 0.37 (0.68; 0.62)
DS/OMR 7 −0.02 (0.56; 0.56)
DS/OMR 12 −0.30 (0.61; 0.82)
DS/OMR 13 ††
DS/OMR 24 −1.56 (0.53; 0.74)
DS/OMR 36 −1.75 (0.46; 0.60)
DS/OMR 48 −1.50 (0.56; 0.62)
DS/OMR 60 −0.61 (0.54; 0.44)
DS/OMR 61 −0.83 (0.69; 0.54)
DS/OMR 132 −1.18 (0.69; 0.39)
# doses 1 −1.34 (0.37; 0.24)
# doses 2 −1.78 (0.40; 0.36)
# doses 7 −2.45 (0.33; 0.36)
# doses 12 −2.71 (0.35; 0.42)
# doses 13 †††
# doses 24 −0.11 (0.31; 0.36)

DS/OMR: 1 = DS, 0 = OMR; # doses: 1 = 5 doses, 0 = 4 doses.


Sex: 1 = male, 0 = female; Anti-epileptic drugs: 1 = use, 0 = no use.
† Month 7 taken as reference point because the decision to give an extra booster
dose was taken at that time.
†† No measurements were available at month 13 in DS patients.
††† No measurements were available at month 13 in the group that was
administered four vaccine doses.
24.5 Hepatitis B Vaccination 477

TABLE 24.15. Hepatitis B Vaccination Study. Parameter estimates and standard


errors (model based; empirically corrected) for the final model (original data).
Part II.

Effect Time Estimate (s.e.)

Mean Structure (continued):

# doses 36 −0.25 (0.27; 0.30)


# doses 48 0.00 (0.33; 0.35)
# doses 60 0.24 (0.32; 0.34)
# doses 61 −2.01 (0.41; 0.45)
# doses 132 −2.00 (0.41; 0.41)
Residency −0.04 (0.02; 0.01)
ResidencyTime 0.005 (0.002; 0.002)
Sex −0.01 (0.23; 0.23)
Anti-epileptic −0.63 (0.24; 0.23)

Random Effects:

Intercept 0.66
Time 0.02

Serial Structure:

Variance 0.58
Rate of exponential decrease (1/ρ) 2.32

Measurement Error:

Time 1 1.32
Time 2 1.37
Time 7 0.76
Time 12 0.70
Time 13 0.78
Time 24 0.48
Time 36 0.00
Time 48 0.46
Time 60 0.47
Time 61 1.31
Time 132 0.31

a difference in immediate response to vaccination between these two groups.


Also, we see that the extra dose given at month 12 in G2 had sufficiently
elevated anti-HBs titer so as to render it almost indistinguishable from
anti-HBs titer in G1 until year 5. Yet, administration of a booster dose at
478 24. Case Studies

that time again led to better responses in G1 and this was still visible at
year 11.

A similar modeling strategy can be applied to the postvaccination data


[i.e., data available after the last vaccination, at month 6 (G1) or 12 (G2)].
We need to specify different models for each of the two groups since post-
vaccination times are different. We can set up a model for prebooster data
(until month 60) and then transpose this model to postbooster data, using
an indicator variable for the time of booster administration. For instance,
a simple model, ignoring potential covariates could be written as follows:

⎨ β (1) + β (1) I(tj ≥ 55) + g(tj − 55.I(tj ≥ 55)) in group G1,
0 1
E(Yij ) =
⎩ β (2) + β (2) I(t ≥ 49) + g(t − 49.I(t ≥ 49)) in group G2,
0 1 j j j

where g(t) is a fractional polynomial (Royston and Altman 1994) (i.e., a


linear combination of real-valued powers of t). This strategy is discussed in
detail in Section 10.3. Let us briefly recapitulate the key concepts. Royston
and Altman (1994) argue that conventional low-order polynomials offer
only a limited family of shapes and that high-order polynomials may fit
poorly at the extreme values of the covariates. Moreover, polynomials do
not have finite asymptotes and cannot fit the data where limiting behavior
is expected. This is a severe limitation in many cases. As a result, Royston
and Altman (1994) propose an extended family of curves, which they call
fractional polynomials. Conventional polynomials are included as a subset
of this family. For a given degree m and an argument t > 0 (e.g., time),
fractional polynomials are defined as

M
β0 + βj tpj ,
j=1

where the βj are regression parameters and t0 ≡ ln(t) and the powers p1 <
· · · < pm are positive or negative integers or fractions (Royston and Altman
1994). They argue that polynomials with degree higher than 2 are rarely
required in practice and further restrict the powers of dose to a small prede-
fined set of possibly noninteger values: Π = {−2, −1, −1/2, 0, 1/2, 1, 2, . . . ,
max(3, m)}. For example, setting m = 2 generates √ (1) √ four quadratics in
powers of d, represented by (1/t2 , 1/t), (1/t, 1/ t), ( t, t), and (t, t2 ); (2)
a quadratic in ln(t), and (3) other curves which have shapes different from
those of conventional low-degree polynomials. The full definition includes
possible “repeated powers” which involve powers of ln(t). For example, a
fractional polynomial of degree m = 3 with powers (−1, −1, 2) is of the
form
β0 + β1 y −1 + β2 y −1 ln(y) + β3 y 2
(Royston and Altman 1994, Sauerbrei and Royston 1999). In this case, the
set of powers we have considered ranged from −3 to 3 with increments of
24.5 Hepatitis B Vaccination 479

TABLE 24.16. Hepatitis B Vaccination Study. Parameter estimates and standard


errors (model-based; empirically corrected) for the fixed effects of the final model
(postvaccination data).

Effect Estimate (s.e.)

Intercept G1 8.808 (0.299; 0.284)


I(t ≥ 55) 3.501 (0.151; 0.163)
(t − 55I(t ≥ 55))−3 −0.384 (0.202; 0.178)
log(t − 55I(t ≥ 55)) −1.120 (0.052; 0.057)
DS/OMR G1 −0.494 (0.673; 0.702)
Sex 0.037 (0.278; 0.276)
Intercept G2 6.389 (0.503; 0.582)
I(t ≥ 49) 1.358 (0.272; 0.346)
(t − 49I(t ≥ 49))−3 1.601 (0.455; 0.446)
log(t − 49I(t ≥ 49)) −0.435 (0.131; 0.154)
DS/OMR G2 −2.706 (0.688; 0.533)
Use of anti-epileptic drugs −0.632 (0.284; 0.281)

0.5. We then searched for the best pair or triple of powers among this set.
The combination (−3, 0) turned out to give an adequate fit to these data.
See also Section 10.3.

Selection of the covariance structure was performed similarly to what was


previously done, with the difference that no spatial process was found nec-
essary. For the selection of the mean structure, the antibody titer mea-
surement obtained at the last vaccination could be included as a baseline
covariate on top of the usual covariates, but this could not be done because
no measurement were taken at month 6. Instead, the first log antibody titer
measurement was used as baseline. Selected covariates were type of men-
tal retardation (with a different effect depending on the number of vaccine
doses administered), use of anti-epileptic medication, and sex. Table 24.16
presents the parameter estimates of the final model, for the fixed effects
only.

In order to visually assess the fit of these two models, Figure 24.32 shows
observed and predicted average profiles for combinations of number of vac-
cine doses and type of mental retardation. For predicted average profiles,
all other covariate effects were set equal to their mean values.

On accommodating individual-specific effects, our model enables a much


more precise assessment of important explanatory variables, such as num-
ber of vaccine doses received, whether or not a person has Down’s syn-
480 24. Case Studies

FIGURE 24.32. Hepatitis B Vaccination Study. Observed and predicted mean pro-
files for combinations of number of vaccine doses and type of mental retardation:
(a) original data; (b) postvaccination data.

drome, and, of course, time effects. In particular, the strong contributions


of random intercepts and serial correlation show the importance of the ini-
tial response as well as the individual trajectory for the further evolution of
anti-HBs profiles. Models that restrict attention to geometric mean titers
(GMT) calculation are not able to include such individual-specific effects,
typically resulting in less precise inference, also for the fixed effects.

In this study, no difference can be detected between DS and OMR patients


in their immediate response, but we find that DS induces an accelerated
decrease in antibody titers, implying that the rate of decline in antibody
titers might be different in these two populations. This might explain why
some other studies attempting to demonstrate a difference between DS
24.5 Hepatitis B Vaccination 481

and OMR patients in their anti-HBs response after vaccination have failed.
See Vellinga et al . (1999c) for further discussion. Another point concerns
whether anti-epileptic medication has an influence on the immune system.
However, it is hard to decide whether this is due to the medication itself,
or rather an indication for the influence of epilepsy, or both (De Ponti et
al . 1993).

Although there is some interest in modeling the complete set of data from
a descriptive viewpoint, this could only be achieved with the help of a time-
saturated model to account for the high nonlinearity in the profiles. If one is
interested in a more parsimonious, parametric description of the temporal
decline in antibody titer, to address such questions as long-term influence,
one needs to resort to an alternative solution. Focusing the analysis on
postvaccination data is a suitable alternative, enabling simple parametric
modeling of the anti-HBs evolution over time. Obviously, the absence of
intermittent measurements between years 5 and 11 weakens the long-term
prediction process, and we need to make certain assumptions such that the
rate of decline after booster administration at month 60 is similar to the
rate of decline after time of last vaccination. It is nevertheless reassuring
to see that predicted values at year 12 were all in good agreement, inde-
pendently of the model or method chosen. We conclude that each of these
two models may bring their own insight into the data and their combined
use may better serve the purpose of a sensitivity analysis.

24.5.2 Prediction at Year 12

We now address the issue of predicting antibody titer at year 12 (i.e., 1


year after last follow-up contact). This extrapolation problem is compli-
cated by the design feature that no measurements are available between
months 61 and 132, whereas apparently we are dealing with nonlinear pro-
files. Although the use of a time-saturated model for the mean structure is
suitable for building an acceptable model, it is less so when it comes to pre-
diction, in particular when interest centers on future prediction. Therefore,
we propose two simple methods to perform such a prediction and compare
the results to the fractional-polynomial approach on postvaccination data.

The first approach merely uses a linear extrapolation based on the last
two measurements (at months 61 and 132) of an individual. The resulting
extrapolations are then averaged out to obtain a prediction at month 144.
Obviously, this approach can be criticized as being overly simple since the
profiles are clearly nonlinear over the first 5 years. A refinement of this
method might consist of overlaying profiles for the month 61–132 period
with profiles from the first part of the study and then extrapolating until
month 144. It raises some technical difficulties though, since the starting
482 24. Case Studies

TABLE 24.17. Hepatitis B Vaccination Study. Predicting log antibody titer (IU/L)
at year 12: (a) Approach 1: linear interpolation (original data); (b) Approach 2:
refined linear interpolation (original data); (c) Approach 3: fractional polynomials
model (post-vaccination data).

Group Approach 1 Approach 2 Approach 3

4 vaccine doses (G1) 7.05 7.38 7.13


5 vaccine doses (G2) 5.01 5.42 5.41

point of the first period (time of the last vaccine dose) depends upon the
group being considered: month 7 for group G1 and month 13 for G2. Using
month 24 as a cutoff point to split the first time period into two pieces,
we can linearly approximate the profiles in these two time windows, trans-
late them to the month 61–132 period, extrapolate until month 144, and
eventually average the results across the two groups.

Predictions at year 12 based on these two extrapolation methods are dis-


played in Table 24.17, together with the prediction inferred from the model
on postvaccination data. As can be seen, all three approaches yield very
similar results.

We conclude with two remarks. First, in this study, prediction takes place
only 1 year upon completion of the study, which is not too distant in time
compared to the duration of the study. Would prediction have to be done
several years later, we would likely observe larger discrepancies. Second, in
studies where more emphasis is to be put on prediction, it is a good idea
to plan intermittent assessment occasions to aid in modeling long-term
temporal evolution. A simple model (e.g., using fractional polynomials),
might then be used straightforwardly for making long-term inference with
more confidence.

24.5.3 SAS Code for Vaccination Models

We describe how to use the SAS statistical software package to fit our start-
ing and thus the most complex model on the one hand, and the finally se-
lected model on the other hand. The models are discussed in Section 24.5.1.

The SAS code to fit the initial model is


24.5 Hepatitis B Vaccination 483

proc mixed data = hbv method = ml info noclprint;


class timecls ds_omr dosegrp sex epilepsy;
model loganti = timecls ds_omr(timecls)
dosegrp(timecls) dur_res(timecls)
sex bmi age epilepsy / s;
random intercept time ds_omr / subject = patid
type = un;
repeated timecls / subject = patid type = sp(exp)(time)
local = exp(timecls);
run;

Apart from an unstructured time trend, the fixed-effects structure includes


time-varying effects of Down’s syndrome versus other mental retardation
(DS/OMR), dose group, and duration of residency. Effects of sex, body
mass index, age, and influence of epilepsy are time invariant. Note that
time is included as a class variable, implying that there is a different time
effect parameter for each time point, as well as a different parameter at each
time for the time-varying covariates. Apart from a random-intercepts term,
there is a linear random time effect and an effect of DS/OMR. A different
copy of time is used so that it can be left out of the CLASS statement. The
serial correlation structure in time is specified by means of the REPEATED
statement, where the ‘type=’ option is included to indicate that a spatial
exponential process is required. To ensure that, in addition to random
effects and serial variance, also measurement error is allowed to be present,
the ‘local’ option is used. More precisely, ‘local=’ allows the measurement
error to depend on covariates. In our case, ‘exp(timecls)’ includes a different
measurement error variance component for each time point. Formally, the
following contribution is added to Σi :

σ 2 diag [exp(Ui δ)]

where Ui is a design matrix built from the covariates included and δ are
the estimated and reported parameters. Note that the exponential function
ensures non-negative components of variance.

As an aside, the ‘info’ option in the PROC MIXED statement provides


preliminary information, such as an expression for the model fitted and the
dimensions of relevant matrices.

The final model is fitted using the following syntax:


484 24. Case Studies

proc mixed data = hbv method = ml info noclprint;


class timecls ds_omr dosegrp sex;
model loganti = timecls ds_omr(timecls) dosegrp(timecls)
dur_res dur_res*time sex / s;
random intercept time / subject = patid type = vc;
repeated timecls / subject = patid type = sp(exp)(time)
local = exp(timecls);
run;
Appendix A
Software

A.1 The SAS System

A.1.1 Standard Applications

Many of the analyses done in this book have been performed in SAS, in par-
ticular using the procedure MIXED. An extensive overview is to be found
in Chapter 8. Here, the focus is on the Baltimore Longitudinal Study of
Aging (prostate cancer data; Section 2.3.1). The main program features are
discussed, as well as the output. The SAS procedure MIXED is explicitly
discussed in a number of other chapters. In Chapter 9, the prostate cancer
data are used to illustrate model building principles. In Chapter 17, both
complete as well as incomplete versions of the growth data (Section 2.6)
are analyzed using PROC MIXED. The use of SAS for pattern-mixture
based sensitivity analysis is documented in Chapter 20. Finally, SAS code
is included for several case studies in Chapter 24.

A.1.2 New Features in SAS Version 7.0

In this book, all analyses using the SAS System, have been carried out
using Version 6.12. Although this may seem anomalous to many, given the
availability of Version 7.0 and higher, it has to be noted that Version 7.0
486 Appendix A. Software

(SAS 1999) was not available on a commercial basis in 1999, for example,
in Europe. For a thorough description of PROC MIXED in SAS Version
7.0, we refer to the on-line manual (SAS 1999).

In this section, we will highlight some of the important changes that have
been implemented in Version 7.0 with respect to PROC MIXED. They are
ordered by statement.

PROC MIXED Statement

The option ‘CL’, requesting confidence intervals for the covariance para-
meter estimates, has been modified. For those parameters with a default
lower bound of zero (diagonal elements in a covariance matrix), Satterth-
waite approximations are used:

2
νσ νσ2
≤ σ2 ≤ , (A.1)
χ2ν,1−α/2 2
χν,α/2

2 /s.e.(σ
where ν = 2Z 2 and Z is the classical Wald statistic σ 2 ). Using
‘CL<=Wald>’ requests a Wald version, rather than the modified limits in
(A.1).

The ‘method’ option has been extended to include, apart from REML, ML,
and MIVQUE0, also TYPE1, TYPE2, and TYPE3. The new methods re-
quest analysis of variance (ANOVA) estimation of the corresponding type,
producing method-of-moment variance component estimates. Of course,
these methods are available only when there is no ‘subject=<effects>’ op-
tion and no REPEATED statement.

Also, ‘namelen=<number>’ is a new option, which specifies the length to


which long effect names are shortened.

MAKE Statement and Output Delivery

A major change to PROC MIXED (and all other procedures) is the way in
which output is handled. PROC MIXED now makes use of the integrated
ODS (output delivery system). This implies that the MAKE statement has
become obsolete. Although still supported, it is envisaged that this will no
longer be the case in later versions.

For example, a Version 6 statement of the form

make ’covparms’ out = cp;


A.1 The SAS System 487

within the MIXED procedure is replaced by

ods output covparms = cp;

to be placed in front of PROC MIXED. For a thorough discussion on the


capabilities of ODS, we refer to SAS (1999).

MODEL Statement

In order to have a more efficient access to the covariance structure of the


estimated fixed effects, two new options have been added: ‘corrb’, producing
the asymptotic correlation matrix, and ‘covbi’, producing the inverse of the
asymptotic covariance matrix. For the latter, there is still an alias: ‘xpvix’.

The ‘ddfm=satterth’ option is now extensively documented.

Given the ODS structure, predicted means and predicted values are now
handled in a different way. Precisely, the option ‘outpredm=SAS-data set’
and ‘outpred=SAS-data set’ have to be used. Options ‘pred’ and ‘pred-
means’ have become obsolete. Aliases for the new options are ‘outpm=SAS-
data set’ and ‘outp=SAS-data set’.

PRIOR Statement

This statement has undergone a major revision. New options are as follows:

‘data=’: allows input of user-defined prior densities for variance compo-


nents.
‘alg=’: specifies the algorithm used for generating the posterior sample.
‘bdata’: allows input of the base densities used by the sampling algorithm.
‘grid=’: a grid of values over which to estimate the posterior density.
‘gridt=’: specifies a transformed grid of values over which to estimate the
posterior density.
‘lognote=number’: writes a log note to screen each time a sample is gen-
erated of which the sequence number is a multiple of the specified
number.
‘logrbound=number’: specifies the bounding constant for rejection sam-
pling. This option has been available since Version 6.12.
488 Appendix A. Software

‘out=’: output data set with the sample of the posterior density.
‘outg=’: output data set from the grid evaluations.
‘outgt=’: output data set from the transformed grid evaluations.
‘psearch’: displays the search used to determine the parameters for the
inverted gamma densities.
‘ptrans’: displays the transformation of the variance components.
‘seed=’: starting value for the random number generation within a call of
PROC MIXED.
‘tdata=’: allows to input the transformation of the covariance parameters
used by the sampling algorithm.
‘trans=’: specifies the algorithm used to determine the transformation of
the covariance parameters.

RANDOM Statement

The option ‘nofullz’ eliminates the columns in Z corresponding to missing


levels of random effects involving class variables.

REPEATED Statement

Since Version 6.12, multivariate repeated measures can be fitted. To this


end, direct-product covariance structure are available, of which the first
factor specifies the correlation among the components of the multivariate
outcome vector and the second component specifies the correlation over
time within a component. Three structures are available: ‘type=un@ar(1)’,
‘type=un@cs’, and ‘type=un@un’. An example is

model y = var time var*time;


repeated var time / type = un@ar(1) subject = subject;

Note that all outcomes for a given individual are still stacked, as with uni-
variate repeated measures. Thus, two indicators are needed: var to indicate
which of the multivariate outcomes is listed and time indicating the longi-
tudinal structure. The ‘type=’ option specifies an unstructured covariance
matrix among the components at a given time (a typical assumption in
multivariate data, cf. PROC GLM), and further an AR(1) process for re-
peated measures of a specific outcome. It is then assumed that for different
outcomes at unequal measurement times, the Kronecker product specifies
A.2 Fitting Mixed Models Using MLwiN 489

the appropriate covariance. Note that this is an assumption. Although not


implemented in the SAS procedure MIXED, the modeler could in principle
consider more complex structures.

Testing for Variance Components

The PROC MIXED documentation explicitly acknowledges the complexity


of hypothesis testing for variance components, as discussed in Section 6.3.4.

A.2 Fitting Mixed Models Using MLwiN

The package MLwiN can be used to fit a wide variety of linear mixed


models. This package is based on the multilevel model concept (Goldstein
1979, 1995). Using the growth data, introduced in Section 2.6 and analyzed
in Section 17.4, we will briefly introduce multilevel modeling, with emphasis
on longitudinal studies, and illustrate the use of the MLwiN package. The
most important source of information is Goldstein (1995), which contains
ample references to further reading, and the User’s Guide to MLwiN. The
manual and other relevant information can be obtained from the websites
on multilevel models (with Australian and American mirror sites) and the
website on MLwiN:

http://www.ioe.ac.uk/multilevel/
http://www.medent.umontreal/multilevel/
http://www.edfac.unimelb.edu.au/multilevel/
http://www.ioe.ac.uk/mlwin/

For the growth data, we will focus on Model 6, introduced on p. 252.


MLwiN Version 1.02.0002 will be used. This model assumes separate linear
profiles for boys and girls, implying four fixed-effects parameters, as well
as a random intercept and a random age slope, with an unstructured 2 × 2
covariance matrix D. Finally, a measurement error term is added. Similar
to (17.10), which assumes an unstructured covariance matrix Σi , we can
write this model as
Yij = β0 + b0i + β01 xi + β10 tj (1 − xi ) + β11 tj xi + b1i tj + εij ,(A.2)
where (b0i , b1i ) follows a N (0, D) distribution and the residual errors are
uncorrelated with zero mean and variance σ 2 . Recall that xi is 1 for boys
and 0 for girls and that tj indicates measurement ages 8, 10, 12, and 14.
Now, let us group the model terms per covariate effect:
Yij = (β0 + b0i + εij )1 (A.3)
490 Appendix A. Software

TABLE A.1. Growth Data. Model effects grouped per covariate.

Number Effect Fixed Level 2 Level 1


0 Intercept β0 b0i εij
1 Male β01
2 Female∗Age β10
3 Male∗Age β11
4 Age b1i

+β01 xi (A.4)

+β10 tj (1 − xi ) (A.5)

+β11 tj xi (A.6)

+b1i tj . (A.7)

Model (A.2) is equivalent to Models (A.3)–(A.7). The intercept or con-


stant term is explicitly included in (A.3). In the latter, the coefficients are
grouped per effect, as detailed in Table A.1. Fixed effects are listed in the
column labeled “Fixed.” By “Level 2” we indicate all random effects (i.e.,
those effects which are random at the level of the individual subject). “Level
1” contains the measurement error (i.e., the term which is random at the
level of the individual measurement within a subject). The level can also
be deduced from the subscripts i and j affixed to the parameters. Fixed ef-
fects carry no subscripts, neither i nor j. Level 2 effects carry only the first
subscript i, whereas level 1 effects are subscripted by both i and j. Observe
that the coefficient of the intercept consists of three parts: a fixed effect, a
random intercept, and a residual error term. Alternatively, these terms can
be termed fixed effect, random effect at level 2, and random effect at level
1. Covariates 1–3 have a fixed effect only, whereas age only has a random
effect at level 2 (random slope). This is due to the fact that the fixed slope
is allowed to differ for boys and girls, whereas a common random slope is
assumed for both. In multilevel notation, this model can be rewritten as

Yij = β0ij 1 + β1 xj + β2 ti (1 − xj ) + β3 ti xj + u4j tj , (A.8)

β0ij = β0 + u0j + e0ij . (A.9)

Using the MLwiN package, Models (A.8)–(A.9) can be implemented very


easily using the drop-down window structure. The result is shown in Fig-
ure A.1. The corresponding maximum likelihood estimates are displayed in
Figure A.2.
A.2 Fitting Mixed Models Using MLwiN 491

FIGURE A.1. Growth Data. Symbolic MLwiN equation for Models (A.8)–(A.9).

FIGURE A.2. Growth Data. MLwiN estimates for Models (A.8)–(A.9).

The coefficient of every effect is allowed to include any combination of a


fixed effect and random effects at all levels. In this case, there are only
two levels, but higher-order hierarchies are implemented as well and are no
more difficult to construct. Estimation is done using maximum likelihood
or iterative generalized least squares. Alternative estimation methods are
restricted iterative generalized least squares, parametric bootstrap, Gibbs
sampling, and Metropolis-Hastings.
492 Appendix A. Software

FIGURE A.3. Growth Data. Observed profiles.

FIGURE A.4. Growth Data. Predicted means.

Model simplification can flexibly be conducted by clicking on terms that


need to be deleted. For example, if the covariance between both random
effects is judged to be redundant, it can be removed by simply clicking on
this term and reestimating the model.

There are ample graphical capabilities. For example, a simple plot of the
raw profiles is shown in Figure A.3. Predicted means are displayed in Fig-
ure A.4. This plot is based on the fixed-effects structure. Therefore, two
fitted straight lines are shown, one for boys and one for girls. Empirical
Bayes predictions are displayed in Figure A.5. Since the model assumes
random intercepts and random slopes, individual linear profiles are pro-
duced. MLwiN uses data by means of a work sheet. This can easily be
manipulated. Variables can be renamed and transformed, and new ones
A.3 Fitting Mixed Models Using SPlus 493

FIGURE A.5. Growth Data. Predicted individual profiles.

can be created. For example, if predicted values are computed, they can be
added to the work sheet as a new column. Subsequently, predicted values
can be used in the graphical window to generate plots such as Figures A.4
and A.5.

Multilevel modeling is not restricted to repeated measures. Rather, this


modeling paradigm applies equally well to hierarchical survey data, hier-
archical multivariate outcomes, time series, and so forth. Apart from ran-
dom effects at the various levels, correlated errors at level 1 (i.e., serial
correlation) are allowed and can be fitted using a macro provided within
MLwiN. Apart from linear models, nonlinear models are allowed. Further,
the methodology also applies to categorical and time-to-event outcomes.

A.3 Fitting Mixed Models Using SPlus

SPlus provides various ways to estimate mixed models. On the one hand,
the built-in functions lme() for linear mixed-effects models can be used.
Note that there is a companion function for nonlinear mixed-effects models,
nmle(). These functions are based on work by Lindstrom and Bates (1988),
Laird and Ware (1982), Box, Jenkins, and Reinsel (1994), and Davidian and
Giltinan (1995). The lme() function will be discussed in Section A.3.1. A
third-party suite of SPlus functions, termed OSWALD (Smith, Robertson,
and Diggle 1996) has been developed to fit longitudinal models. The pack-
age is based on the methods described in Diggle, Liang, and Zeger (1994).
In particular, the pcmid() function fits linear mixed models. Many of the
functionalities between lme() and pcmid() are shared, but an attractive
494 Appendix A. Software

feature specific to pcmid() is its capability to jointly estimate measurement


and dropout models, based on the model of Diggle and Kenward (1994).
This selection model allows for MCAR, MAR, as well as MNAR dropout.
It will be studied in Section A.3.2.

A.3.1 Standard SPlus Functions

As in Section A.2 on multilevel modeling and MLwiN, we will use the


growth data (Section 2.6) to illustrate the built-in function mle(). SPlus
Version 4.5 is used. Apart from the references mentioned earlier which give
the theoretical underpinning, there is ample documentation within SPlus.
The on-line manual provides a 53-page discussion of linear and nonlinear
mixed-effects models. The function lme() is generic. The on-line help sys-
tem of SPlus provides a brief account of the syntax of this generic function.
Methods functions are being developed for specific classes of objects. The
methods function lme.formula() comes with ample documentation.

Let us discuss the main arguments:

Fixed effects. The structure is specified by means of the fixed argument,


using standard formulas.

Random effects. The random-effects structure is specified through


random. Additional arguments to tune the random-effects model are
re.block (describing the blocking structure), re.structure (spec-
ifying the form of the D matrix), and re.paramtr (specifying how
the D matrix is internally parameterized). The latter argument is
included to improve numerical stability and to ensure that the re-
sulting D matrix is positive define. Values of this argument refer to
the Cholesky decomposition, the matrix logarithm, and several oth-
ers.

Serial correlation. This structure is defined by means of the argument


serial.structure. In the case that a serial correlation structure
depending on time is assumed, the arguments serial.covariate
and serial. covariate.transformation can be used to specify this
aspect of the serial process.

Residual variance. The residual variance function is defined by means


of var.function. Fine-tuning can be done using var.covariate and
var.estimate (indicating whether the variance parameters are to be
estimated or to be kept fixed at their initial values).

Clusters. The clusters (subjects, units, etc.) are defined using cluster.
A.3 Fitting Mixed Models Using SPlus 495

Method of estimation. Both maximum likelihood and REML are pro-


vided. The user’s preference can be specified by means of the argu-
ment est.method.

Other tools include subsetting, specifying the action to be undertaken on


missing data, and control over the estimation algorithm.

Let us apply the function lme.formula() to fit Model 6 to the growth


data, as was done using MLwiN in Section A.2. The following program can
be used.

my.lme <- lme.formula(


fixed = MEASURE ~ 1 + MALE + MALEAGE + FEMAGE,
random = ~ 1 + AGE,
cluster = ~ IDNR,
data = growth5.df,
re.structure = "unstructured",
na.action = "na.omit",
est.method = "ML")

Printing the object my.lme produces

Call:
Fixed: MEASURE ~ 1 + MALE + MALEAGE + FEMAGE
Random: ~ 1 + AGE
Cluster: ~ (IDNR)
Data: growth5.df

Variance/Covariance Components Estimate(s):

Structure: unstructured
Parametrization: matrixlog
Standard Deviation(s) of Random Effect(s)
(Intercept) AGE
2.134752 0.1541473
Correlation of Random Effects
(Intercept)
AGE -0.6025632

Cluster Residual Variance: 1.716206

Fixed Effects Estimate(s):


(Intercept) MALE MALEAGE FEMAGE
17.37273 -1.032102 0.784375 0.4795455
496 Appendix A. Software

Number of Observations: 108


Number of Clusters: 27

Although the above output is rather brief, one can obtain a more extensive
summary:

> my.lme.2 <- summary(my.lme)


> my.lme.2

Call:
Fixed: MEASURE ~ 1 + MALE + MALEAGE + FEMAGE
Random: ~ 1 + AGE
Cluster: ~ (IDNR)
Data: growth5.df

Estimation Method: ML
Convergence at iteration: 6
Log-likelihood: -213.903
AIC: 443.806
BIC: 465.263

Variance/Covariance Components Estimate(s):


Structure: unstructured
Parametrization: matrixlog
Standard Deviation(s) of Random Effect(s)
(Intercept) AGE
2.134752 0.1541473
Correlation of Random Effects
(Intercept)
AGE -0.6025632

Cluster Residual Variance: 1.716206

Fixed Effects Estimate(s):


Value Approx. Std.Error z ratio(C)
(Intercept) 17.3727273 1.18203467 14.6973077
MALE -1.0321023 1.53550808 -0.6721568
MALEAGE 0.7843750 0.08275405 9.4783886
FEMAGE 0.4795455 0.09980513 4.8048175
A.3 Fitting Mixed Models Using SPlus 497

Conditional Correlation(s) of Fixed Effects Estimates


(Intercept) MALE MALEAGE
MALE -7.698004e-001
MALEAGE 6.198039e-016 -5.617972e-001
FEMAGE -8.801671e-001 6.775530e-001 -1.691642e-016

Random Effects (Conditional Modes):


(Intercept) AGE
1 -0.68278894 -0.039972872
2 -0.45926352 0.071886460
3 -0.03109489 0.093020178
4 1.61182535 0.030832363
5 0.43850471 -0.043000835
.....
25 0.50935427 -0.055453935
26 -0.10573027 0.083999487
27 -0.89462307 -0.076992100

Standardized Population-Average Residuals:


Min Q1 Med Q3 Max
-3.335979 -0.4153858 0.01039114 0.4916851 3.858188

Number of Observations: 108


Number of Clusters: 27

The estimates and standard errors coincide with those obtained with ML-
wiN (Figure A.2). This is immediately clear for the fixed-effects estimates,
their standard errors, and the residual variance. The components of the D
matrix have to be derived from the standard deviations and correlation of
the random effects:
d11 = 2.1347522 = 4.577,
d12 = (−0.6025632)(2.134752)(0.1541473) = −0.198,
d22 = 0.15414732 = 0.024.

As is the case with MLwiN, SPlus in general, and lme() in particular, have
extensive graphical capabilities.

A.3.2 OSWALD for Nonrandom Nonresponse

Even though a word of care was issued in Chapters 17–20 about nonrandom
dropout models restricting the model-building exercise simply to MAR
498 Appendix A. Software

mechanisms is equally dangerous, since the MAR assumption is in itself


fundamentally untestable. Often, the choice to restrict model building to
MAR is driven by the lack of software for fitting more general models.
Many of the software packages do not allow, at present, the exploration of
nonrandom dropout mechanisms.

In this section, we will illustrate how a family of nonrandom dropout models


can be fitted with the OSWALD (Smith, Robertson, and Diggle 1996)
software in SPlus. Several data sets have been fitted using OSWALD (the
toenail data in Section 17.2 and the Vorozole study in Section 17.6). We
will study another, relatively simple example in this section.

In a heart failure study, the primary efficacy endpoint is based upon the
ability to do physical exercise. This ability is measured in the number of
seconds a subject is able to ride the exercise bike. There are 25 subjects
assigned to placebo and 25 to treatment. The treatment consisted of the ad-
ministration of ACE inhibitors. Four measurements were taken at monthly
intervals. Table A.2 presents outcome scores, transformed to normality. We
will refer to them as the exercise bike data. All 50 subjects are observed
at the first occasion, whereas there are 44, 41, and 38 subjects seen at the
second, the third, and the fourth visits, respectively.

To be able to perform a comparison between PROC MIXED and OSWALD,


we will restrict attention to a set of models that can be fitted reasonably
well with both packages (Diggle 1988). The measurement models belong to
the general class (3.8). We consider it useful to decompose the variability
explicitly in three components:
var(Y i ) = Zi DZi + H̃i + τ 2 Ii .
The notation used here is chosen to reflect the OSWALD output and hence
deviates slightly from earlier conventions. Recall that H̃i represents the
serial correlation and τ 2 is the measurement error. For the remainder of
this section, we will restrict the random effects to a random intercept, and
to balanced designs with measurement occasions common to all subjects.
As a result, the variance can be selected to have the following, simplified
form:
var(Y i ) = ν 2 J + H̃ + τ 2 I.
Finally, writing H̃ in terms of its correlation matrix yields
var(Y i ) = ν 2 J + σ 2 H + τ 2 I. (A.10)
The symbols ν 2 , σ 2 , and τ 2 are chosen to reflect the names used in OS-
WALD.

The first model we consider is the random intercept Model 7 used for the
growth data (see Section 17.4.1, p. 253). It is given by omitting the H
A.3 Fitting Mixed Models Using SPlus 499

TABLE A.2. Exercise Bike Data.

Placebo Treatment
1 2 3 4 1 2 3 4
0.43 0.94 4.32 4.51 −2.54 −0.20 −0.15 3.53
3.10 5.82 5.59 6.32 4.33 5.57 6.86 6.87
0.56 2.21 1.18 1.54 −2.46 . . .
−1.18 −0.30 2.48 2.67 2.30 4.64 7.37 7.99
1.24 2.83 1.98 3.21 0.73 3.29 5.23 6.12
−1.87 −0.06 1.16 1.84 0.38 1.25 2.91 4.71
−0.28 1.30 . . 1.51 4.00 5.98 .
2.93 . . . 0.38 0.94 3.28 4.05
−0.20 3.34 3.71 3.69 0.42 2.53 . .
−0.12 2.01 2.35 2.70 2.41 4.24 4.79 8.14
−1.60 1.42 0.41 0.72 0.12 1.48 3.12 3.69
0.64 . . . −3.46 −0.93 2.78 3.02
−1.14 −1.20 0.09 2.39 −0.55 . . .
2.24 2.12 3.00 1.52 0.74 2.40 4.04 5.61
−0.44 0.88 2.83 1.47 2.37 2.79 4.05 5.91
0.39 1.77 3.62 4.35 1.94 5.05 3.06 5.89
−4.37 −2.43 −0.43 −0.13 0.77 2.46 . .
0.20 2.05 3.18 5.13 1.32 . . .
1.31 3.82 2.70 3.59 2.15 4.84 7.70 8.29
−0.38 −1.92 −0.12 −0.40 −0.09 2.02 4.68 5.29
−0.78 . . . 2.10 4.91 7.48 8.91
−0.48 0.32 0.66 3.03 1.36 0.62 1.87 .
−0.64 1.53 1.29 . 3.14 5.79 5.95 7.50
0.88 2.10 1.90 3.51 −0.94 −0.08 3.57 3.80
2.02 3.10 4.93 4.76 0.89 1.51 3.14 5.96

term from (A.10). The next two models are new and, to avoid confusion,
they will be assigned numbers 9 and 10. The second model (Model 9)
supplements a random intercept with serial correlation. We will choose
the serial correlation to be of the AR(1) type. This model is found by
omitting the measurement error component τ 2 I from (A.10) and choosing
the elements of H to be hjk = ρ|j−k| = exp (−φ|j − k|). Model 10 combines
all three sources of variability. SAS code for these models is
500 Appendix A. Software

proc mixed data = m.bike method = ml covtest;


title ’Exercise Bike, Dropout, Model 7’;
class group id;
model y = group time*group / s;
repeated / type = cs subject = id r rcorr;
run;

proc mixed data = m.bike method = ml covtest;


title ’Exercise Bike, Dropout, Model 9’;
class group id;
model y = group time*group / s;
repeated / type = ar(1) subject = id r rcorr;
random intercept / type = un subject = id g;
run;

proc mixed data = m.bike method = ml covtest;


title ’Exercise Bike, Dropout, Model 10’;
class group id;
model y = group time*group / s;
repeated / type = ar(1) local subject = id r rcorr;
random intercept / type = un subject = id g;
run;

The mean and covariance model parameters are summarized in Table A.3.
The parameters supplied by PROC MIXED are supplemented with some
additional quantities in order to obtain both sets of intercepts and slopes
for the two treatment groups. Thus, intercept 0 is the sum of intercept
1 and the group 0 effect. Further, φ and ρ are connected by φ = − ln(ρ).
Although the mean models are straightforward to interpret from the SAS
output, it is necessary to approach the covariance parameters output with
some care.

As with the growth data, the random intercept Model 7 is introduced


via the ‘type=CS’ option in the REPEATED statement. Alternatively, we
could have chosen to use a random intercept, using the RANDOM state-
ment. The fitted covariance matrix is
⎛ ⎞
3.1403 2.3781 2.3781 2.3781
⎜ 2.3781 3.1403 2.3781 2.3781 ⎟
τ 2I + ν2J = ⎜⎝ 2.3781 2.3781 3.1403 2.3781 ⎠ .

2.3781 2.3781 2.3781 3.1403

The SAS summary of the covariance parameters is


A.3 Fitting Mixed Models Using SPlus 501

TABLE A.3. Exercise Bike Data. SAS output on Models 7, 9, and 10.

Parameter Interpretation Model 7 Model 9 Model 10


Intercept Intercept 1 −0.8082 −0.8218 −0.8236
Group 0 0.1915 0.1608 0.1606
Intercept 0 −0.6167 −0.6610 −0.6630
Time*group 0 Slope 0 0.9200 0.9236 0.9230
Time*group 1 Slope 1 1.6434 1.6449 1.6451
ν2 Random interc. 2.3781 2.1585 2.0938
τ2 Meas. error 0.7622 0.2097
σ2 Serial variance 0.9484 0.8012
ρ Serial corr. 0.3080 0.4440
φ Serial corr. (exp.) 1.1778 0.8119
Deviance 564.27 559.69 559.68

Covariance Parameter Estimates (MLE)

Cov Parm Subject Estimate

CS ID 2.37805794
Residual 0.76223505

from which it follows that ν 2 = 2.3781 and τ 2 = 0.7622 = 3.1403 − 2.3781.

Model 9 replaces the independent measurement errors with serially cor-


related errors. This requires the use of the RANDOM and REPEATED
statements simultaneously. The ‘r’ matrix produced by SAS is now inter-
preted as
⎛ ⎞
0.9484 0.2921 0.0899 0.0277
⎜ 0.2921 0.9484 0.2921 0.0899 ⎟
σ2 H = ⎜ ⎝ 0.0899 0.2921 0.9484 0.2921 ⎠ .

0.0277 0.0899 0.2921 0.9484

Observe that this ‘r’ matrix has a completely different interpretation in


these two models, since they refer to different sources of variability. The
corresponding correlation matrix is
⎛ ⎞
1.0000 0.3080 0.0948 0.0292
⎜ 0.3080 1.0000 0.3080 0.0948 ⎟
H = ⎜ ⎝ 0.0948 0.3080 1.0000 0.3080 ⎠

0.0292 0.0948 0.3080 1.0000


502 Appendix A. Software

from which we deduce that ρ = 0.3080. In addition, the output labeled


‘G Matrix’ corresponds to the variance of the intercept, τ 2 = 2.1585. The
covariance parameters are given in

Covariance Parameter Estimates (MLE)

Cov Parm Subject Estimate

UN(1,1) ID 2.15845062
AR(1) ID 0.30795893
Residual 0.94837375

of which all components have been discussed. It is slightly misleading that


σ 2 = 0.9484 is labeled “residual” since it does not refer to measurement
error, but rather to the variance of the serially correlated process. This
point will be clearer from Model 10. Indeed, this model combines all three
components of variability. The full covariance structure output is

R Matrix for ID 1

Row COL1 COL2 COL3 COL4

1 1.01091685 0.35573508 0.15795174 0.07013296


2 0.35573508 1.01091685 0.35573508 0.15795174
3 0.15795174 0.35573508 1.01091685 0.35573508
4 0.07013296 0.15795174 0.35573508 1.01091685

R Correlation Matrix for ID 1

Row COL1 COL2 COL3 COL4

1 1.00000000 0.35189351 0.15624603 0.06937559


2 0.35189351 1.00000000 0.35189351 0.15624603
3 0.15624603 0.35189351 1.00000000 0.35189351
4 0.06937559 0.15624603 0.35189351 1.00000000

G Matrix

Parameter Subject Row COL1

INTERCEPT ID 1 1 2.09382622
A.3 Fitting Mixed Models Using SPlus 503

Covariance Parameter Estimates (MLE)

Cov Parm Subject Estimate

UN(1,1) ID 2.09382622
Variance ID 0.80117791
AR(1) ID 0.44401509
Residual 0.20973894

This output is easier to understand since the parameters are nicely grouped
by source of variability: (1) random intercept, with ν 2 = 2.0938, (2) serial
correlation, with σ 2 = 0.8012 and ρ = 0.4440, and (3) measurement error
(residual), with τ 2 = 0.2097.

Formal inspection of the deviances shows that Model 7 is too simple and
that Model 10 does not improve Model 9 significantly.

Our next goal is to fit the same three models with OSWALD. For a full
documentation on OSWALD we refer to Smith, Robertson, and Diggle
(1996) or to the web page:

http://www.maths.lancs.ac.uk:2080/~maa036/OSWALD/

The output is presented in the form of a typical SPlus list object. For
Model 7,

Longitudinal Data Analysis Model


assuming completely random dropout

Call:
pcmid(formula = bike.bal ~ group * time,
vparms = c(0, 1.5, 0), correxp = 1)

Analysis Method: Maximum Likelihood (ML)


Correlation structure: exp(- phi * |u| ^ 1 )

Maximised likelihood:
[1] -482.4208

Mean Parameters:
(Intercept) group time group:time
PARAMETER -0.6166502 -0.1915053 0.92002700 0.7233457
STD.ERROR 0.3792614 0.5374845 0.08554001 0.1232477
504 Appendix A. Software

Variance Parameters:
nu.sq sigma.sq tau.sq phi
0 2.378172 0.7622228 0

Since we want to include only a random intercept and measurement error


as components of variability, we would like to omit σ 2 from the model. One
way to do this is by specifying initial values for the covariance parameters
using the VPARMS argument to the PCMID function. The argument of
VPARMS is a vector with three components, containing initial values for
ν 2 , τ 2 , and φ, respectively. Setting one or more of the initial values equal
to 0 prevents these parameters from maximization. However, σ 2 is not in-
cluded as a component of the initial values vector. One way to circumvent
this problem is to set φ ≡ 0. This implies that the serial correlation ma-
trix H reduces to a matrix of ones, H = J, whence it takes on the same
role as the random intercept component. As a result, ν 2 can be omitted
from the model. Thus, the component we are actually interested in needs
to be excluded ! Model fit for all three models is summarized in Table
A.4. Comparison with Table A.3 shows that the fits are virtually identical.
The only substantial difference is seen in the deviances. Adding 400.57 to
the deviances in Table A.3 yields the deviances in Table A.4. Presumably,
OSWALD uses a slightly different objective function.

Model 9 omits the measurement error, which is simply done by setting the
initial value for τ 2 ≡ 0:

Longitudinal Data Analysis Model


assuming completely random dropout

Call:
pcmid(formula = bike.bal ~ group * time,
vparms = c(1, 0, 1), correxp = 1)

Analysis Method: Maximum Likelihood (ML)


Correlation structure: exp(- phi * |u| ^ 1 )

Maximised likelihood:
[1] -480.13

Mean Parameters:
(Intercept) group time group:time
PARAMETER -0.6609958 -0.1607851 0.92363457 0.7212975
STD.ERROR 0.3923346 0.5559721 0.09760774 0.1403651
A.3 Fitting Mixed Models Using SPlus 505

TABLE A.4. Exercise Bike Data. SPlus (OSWALD) output on Models 7, 9,


and 10.

Parameter Interpretation Model 7 Model 9 Model 10


Intercept Intercept 0 −0.6167 −0.6610 −0.6630
Group −0.1915 −0.1608 −0.1606
Intercept 1 −0.8082 −0.8218 −0.8236
Time Slope 0 0.9200 0.9236 0.9230
Group:time 0.7233 0.7213 0.7221
Slope 1 1.6434 1.6449 1.6451
ν2 Random interc. 2.1584 2.0944
τ2 Measurem. error 0.7622 0.2087
σ2 Serial variance 2.3782 0.9484 0.8017
ρ Serial corr. 0.3080 0.4431
φ Expon. param. 1.1778 0.8139
Deviance 964.84 960.26 960.24

Variance Parameters:
nu.sq sigma.sq tau.sq phi
2.158405 0.9483919 0 1.177751

Finally, the unrestricted Model 10 produces the following output:

Longitudinal Data Analysis Model


assuming completely random dropout

Call:
pcmid(formula = bike.bal ~ group * time,
vparms = c(2, 0.2, 1), correxp = 1,
reqmin = 1e-012)

Analysis Method: Maximum Likelihood (ML)


Correlation structure: exp(- phi * |u| ^ 1 )

Maximised likelihood:
[1] -480.1222

Mean Parameters:
(Intercept) group time group:time
PARAMETER -0.6629645 -0.1606256 0.92304454 0.7221020
STD.ERROR 0.3932421 0.5572348 0.09842758 0.1415133
506 Appendix A. Software

Variance Parameters:
nu.sq sigma.sq tau.sq phi
2.094417 0.8016666 0.2087339 0.8139487

In all of the above models, CORREXP=1 ensures that an AR(1) serial


correlation structure is used. OSWALD accepts any value between 1 and
2 as arguments to CORREXP (the latter corresponds to Gaussian decay).
The model formula is equivalent to a SAS model:

model y = group time group*time

and follows standard model formulating conventions of SPlus. The object


BIKE.BAL is a member of the BALANCED class, designed within OS-
WALD for collections of time series with a common set of measurement
times. For unbalanced designs, a different class (LDA.MAT) is conceived.

All models considered in this section have a simple random-effects struc-


ture. Indeed, Models 7, 9, and 10 include only a random intercept. Clearly,
more elaborate random-effects models can be fitted with PROC MIXED
using the RANDOM statement. In OSWALD Version 2.6, the function
PCMID does not allow more complex structures. The predecessor of the
function PCMID (the function REML.FIT) allows the user to specify sev-
eral random effects, but constrains its variance-covariance matrix D to be
diagonal.

All analyses done on the exercise bike data so far assumed ignorable nonre-
sponse, in the spirit of Section 17.3. This means that they are valid under
MAR (and not only under MCAR, in spite of the claim printed in the
OSWALD output). Of great potential value is the feature of OSWALD to
be able to go beyond an ignorable analysis and to fit a specific class of
nonrandom models. We will exemplify this power using the exercise bike
data. Illustration will be on the basis of the most general Model 10, even
though the slightly simpler Model 9 fits the data equally well. The analysis
featured by OSWALD couples a linear mixed-effects model for the mea-
surements with a logistic model for dropout, with predictors given by the
current outcome as well as a set of previous responses. Details of the model
are to be found in Diggle and Kenward (1994). For instance, assuming that
the dropout probability at occasion j depends on both the current outcome
Yij and the previous one Yi,j−1 , leads to the following model:

P (Rij = 0|y i )
ln = ψ0 + ψ1 yij + ψ2 yi,j−1 . (A.11)
1 − P (Rij = 0|y i )
Such an analysis is done in OSWALD through the DROP.PARMS, DROP-
MODEL, and DROP.COV.PARMS arguments of the PCMID function.
A.3 Fitting Mixed Models Using SPlus 507

Explicitly, DROP.PARMS specifies starting values for a number of time


points, starting from the current one, to be included in the dropout model.
In model (A.11), this number is 2 (ψ1 and ψ2 ). A starting value for the
intercept ψ0 is given by means of the DROP.COV.PARMS statement. A
very important feature of this argument is that it can be used to include
covariate effects as well. In that case, model (A.11) is extended to

P (Rij = 0|y i )
ln = ψ0 + xi ψ c + ψ1 yij + ψ2 yi,j−1 , (A.12)
1 − P (Rij = 0|y i )

where xi is a vector of covariates and ψ c is an additional vector of para-


meters. The actual form of the dropout model is specified in the DROP-
MODEL argument, using standard SPlus model building conventions. An
illustration will be given in the sequel. As before, setting one or more of
the initial values equal to zero prevents their inclusion in the maximization
process. This is a useful feature, since it allows the user to estimate the
dropout parameters under MCAR and MAR assumptions, not only un-
der nonrandom assumptions. For example, model (A.11) corresponds to an
MAR process by setting the initial value for ψ1 ≡ 0.

Let us discuss OSWALD analyses for dropout model (A.11), in the MCAR,
MAR, and nonrandom contexts. The MCAR analysis output is as follows:

Longitudinal Data Analysis Model


assuming random dropout based on 0 previous observations

Call:
pcmid(formula = bike.bal ~ group * time,
vparms = c(2, 0.2, 1), drop.parms = c(0),
drop.cov.parms = c(-2), dropmodel = ~ 1,
correxp = 1, maxfn = 1600, reqmin = 1e-012)

Analysis Method: Maximum Likelihood (ML)


Correlation structure: exp(- phi * |u| ^ 1 )

Maximised likelihood:
[1] -520.6167

Mean Parameters:
(Intercept) group time group:time
PARAMETER -0.6629601 -0.1606322 0.9230446 0.7221012
STD.ERROR 0.3859817 0.5470223 0.1009714 0.1451370
508 Appendix A. Software

Variance Parameters:
nu.sq sigma.sq tau.sq phi
2.094412 0.8016692 0.2087333 0.8139474

Dropout parameters:
(Intercept) y.d
-2.32728 0

Iteration converged after 993 iterations.

The indication that the dropout model is random with a dependence on


0 previous observations effectively refers to a MCAR process. The mech-
anism is turned into MCAR by specifying DROP.PARMS=c(0) and by
setting DROP.COV.PARMS=c(-2). We increased the maximum number
of iterations MAXFN to 1600, since more complex dropout models tend
to require more iterations. As is seen from the output, a bit under 1000
iterations were actually needed. In addition, the tolerance of the relative
gradient was set equal to 10−12 by means of the REQMIN argument.

The MAR program and output are similar:

Longitudinal Data Analysis Model


assuming random dropout based on 1 previous observations

Call:
pcmid(formula = bike.bal ~ group * time,
vparms = c(2, 0.2, 1), drop.parms = c(0, -0.1),
drop.cov.parms = c(-2), dropmodel = ~ 1,
correxp = 1, maxfn = 3000, reqmin = 1e-012)

Analysis Method: Maximum Likelihood (ML)


Correlation structure: exp(- phi * |u| ^ 1 )

Maximised likelihood:
[1] -520.3494

Mean Parameters:
(Intercept) group time group:time
PARAMETER -0.6629684 -0.1606169 0.9230426 0.7221043
STD.ERROR 0.3859815 0.5470220 0.1009713 0.1451369
A.3 Fitting Mixed Models Using SPlus 509

TABLE A.5. Exercise Bike Data. Nonrandom models. Model 10.

Dropout modeled
Parameter Ign. MCAR MAR MNAR
Intercept −0.6630 −0.6630 −0.6630 0.6666
Group −0.1606 −0.1606 −0.1606 0.1740
Time 0.9230 0.9230 0.9230 0.9377
Group:time 0.7221 0.7221 0.7221 0.7317
ν2 2.0944 2.0944 2.0944 2.1067
σ2 0.8017 0.8017 0.8017 0.8349
τ2 0.2087 0.2087 0.2087 0.1577
φ 0.8139 0.8139 0.8139 0.9311
ψ0 −2.3273 −2.1661 −2.7316
ψ1 0.3280
ψ2 −0.0992 −0.3869
Deviance 960.24 1041.23 1040.70 1040.46

Variance Parameters:
nu.sq sigma.sq tau.sq phi
2.09441 0.8016771 0.2087185 0.8139738

Dropout parameters:
(Intercept) y.d y.d-1
-2.166139 0 -0.09920587

Iteration converged after 1203 iterations.

Note that the number of iterations has increased somewhat, even though
the extra dropout parameter, ψ2 , appears to be very small. In fact, the
likelihood has increased only marginally over the MCAR analysis.

Finally, we allow for nonrandom dropout.

Longitudinal Data Analysis Model


assuming informative dropout
based on 1 previous observations

Call:
pcmid(formula = bike.bal ~ group * time,
vparms = c(2, 0.4, 0.5), drop.parms = c(2, -2),
drop.cov.parms = c(-3), dropmodel = ~ 1,
510 Appendix A. Software

correxp = 1, maxfn = 10000, reqmin = 1e-012)

Analysis Method: Maximum Likelihood (ML)


Correlation structure: exp(- phi * |u| ^ 1 )

Maximised likelihood:
[1] -520.2316

Mean Parameters:
(Intercept) group time group:time
PARAMETER -0.6666244 -0.1740098 0.9377247 0.7317099
STD.ERROR NA NA NA NA

Variance Parameters:
nu.sq sigma.sq tau.sq phi
2.106741 0.8348588 0.1577192 0.9310764

Dropout parameters:
(Intercept) y.d y.d-1
-2.731573 0.3279598 -0.3869054

Iteration converged after 5034 iterations.

The number of iterations has increased considerably, which is a typical fea-


ture of nonrandom dropout models. The likelihood has changed only mar-
ginally, and the dropout parameters are all somewhat larger, even though
this is not a precise statement due to the lack of precision estimates. Note
also that no standard errors for the mean parameters are provided, in con-
trast to the other PCMID analyses. This is a very sensible decision since,
unlike with ignorable analyses, standard errors are not obtained as simple
by-products of the maximization process and require in general consider-
able extra code. A few methods are listed on p. 376 and on p. 389.

Results for the three dropout models are summarized in Table A.5, under
the headings MCAR, MAR, and MNAR, respectively. Also included is the
earlier ignorable analysis. Since MCAR and MAR are ignorable, whether
or not the dropout model parameters are estimated explicitly, the first
three models in Table A.5 yield exactly the same values for the mean and
covariance parameter estimates, as they should. Note that the deviance
from the ignorable model is not comparable to the other deviances, since
the dropout parameters are not estimated. Comparing the parameters in
these models to the nonrandom dropout models shows some shifts, although
they are very modest.
A.3 Fitting Mixed Models Using SPlus 511

TABLE A.6. Exercise Bike Data. Non-Random models. Model 10. Treatment
assignment (group) included into the dropout model.

Parameter MAR MNAR


intercept -0.6630 -0.6214
group -0.1606 -0.2288
time 0.9231 0.9032
group:time 0.7221 0.7252
ν2 2.0944 2.0438
σ2 0.8017 0.7859
τ2 0.2087 0.2931
φ 0.8139 0.6289
ψ0 -2.4059 -2.1036
ψ1 0.3036
ψ2 -0.1289 0.1238
group 0.5395 0.7979
Deviance 1039.96 1039.83

We may now want to compare the dropout models. The likelihood ratio
test statistic to compare MAR with MCAR is 0.53 on 1 degree of freedom
(p = 0.4666). This means that MCAR would be acceptable provided MAR
were the correct alternative hypothesis and the actual parametric form for
the MAR process were correct. In addition, a comparison between the non-
random and random dropout models yields a likelihood ratio test statistic
of 0.24 (p = 0.6242). Of course, for reasons outlined in Chapters 19 and
20 (see also Section 18.1.2), one should use nonrandom dropout models
with caution, since they rely on assumptions that are at best only partially
verifiable. These issues of sensitivity are illustrated in Sections 19.4, 19.5,
and 24.4.

To conclude, let us illustrate the capability of OSWALD to incorporate


covariates into the dropout model, as in model (A.12). Including the treat-
ment assignment (group) into the MAR and nonrandom models of Table
A.5 yields

Longitudinal Data Analysis Model


assuming random dropout based on 1 previous observations

Call:
pcmid(formula = demo2.bal ~ group * time,
vparms = c(2, 0.2, 0.8), drop.parms = c(0, -0.1),
512 Appendix A. Software

drop.cov.parms = c(-2, 0.1),


dropmodel = ~ 1 + group,
correxp = 1, maxfn = 5000, reqmin = 1e-012)

Analysis Method: Maximum Likelihood (ML)


Correlation structure: exp(- phi * |u| ^ 1 )

Maximised likelihood:
[1] -519.9814

Mean Parameters:
(Intercept) group time group:time
PARAMETER -0.6629779 -0.1606298 0.9230512 0.7220979
STD.ERROR 0.3859810 0.5470213 0.1009712 0.1451367

Variance Parameters:
nu.sq sigma.sq tau.sq phi
2.094418 0.8016748 0.2087164 0.8139889

Dropout parameters:
(Intercept) group y.d y.d-1
-2.405863 0.5395115 0 -0.1289393

Iteration converged after 2983 iterations.

and

Longitudinal Data Analysis Model


assuming informative dropout
based on 1 previous observations

Call:
pcmid(formula = demo2.bal ~ group * time,
vparms = c(2, 0.2, 0.8),
drop.parms = c(0.3, -0.3),
drop.cov.parms = c(-2, 0.1),
dropmodel = ~ 1 + group,
correxp = 1, maxfn = 10000, reqmin = 1e-012)

Analysis Method: Maximum Likelihood (ML)


Correlation structure: exp(- phi * |u| ^ 1 )

Maximised likelihood:
[1] -519.9143
A.3 Fitting Mixed Models Using SPlus 513

Mean Parameters:
(Intercept) group time group:time
PARAMETER -0.6213682 -0.228779 0.9032322 0.7251818
STD.ERROR NA NA NA NA

Variance Parameters:
nu.sq sigma.sq tau.sq phi
2.043767 0.7858604 0.2930629 0.6288833

Dropout parameters:
(Intercept) group y.d y.d-1
-2.103619 0.7979308 -0.303576 0.1238007

Iteration converged after 5930 iterations.

The model fit is summarized in Table A.6.

Again, the main effect and variance parameters in the MAR column have
not changed relative to their ignorable counterparts in Table A.5. In con-
trast, the nonrandom dropout model parameters are all different, compared
to the nonrandom model in Table A.5.
Appendix B
Technical Details for Sensitivity
Analysis

This appendix contains the more technical material, given for completeness
but not essential to understanding, about sensitivity analysis for selection
models (Section B.1) and pattern-mixture models (Section B.2).

B.1 Local Influence: Derivation of Components


of ∆i

Let us provide some computational details that lead to expression (19.2)–


(19.5). We will consider complete and incomplete sequences in turn.

The log-likelihood contribution for a complete sequence is


ni
iω = ln f (y i ) + ln[1 − g(hij , yij )],
j=2

where the parameter dependencies are suppressed for notational ease. The
mixed derivatives are particularly easy to calculate, immediately yielding
expressions (19.2) and (19.3).
516 Appendix B. Technical Details for Sensitivity Analysis

The log-likelihood contribution from an incomplete sequence equals


 
d−1
iω = ln f (y i ) [1 − g(hij , yij )]g(hid , yid )dyid
j=2


d−1 
= ln f (hid ) + ln[1 − g(hij , yij )] + ln f (yid |hid )g(hid , yid )dyid ,
j=2

of which the first component depends on θ only, the second one on ψ only,
and the third one contains both.

The mixed derivatives of the log-likelihood w.r.t. ωi can be written as


 
∂f (yid |hid ) ∂g(hid , yid )
f (yid |hid )g(hid , yid )dyid dyid
∂ 2 iω ∂θ ∂ωi
= 3 42
∂θ∂ωi
f (yid |hid )g(hid , yid )dyid

∂g(hid , yid )
− f (yid |hid ) dyid
∂ωi

∂f (yid |hid )
g(hid , yid )dyid
∂θ
× 3 42 , (B.1)
f (yid |hid )g(hid , yid )dyid

∂ 2 iω 
d−1
= − hij yij g(hij , yij )[1 − g(hij , yij )]
∂ψ∂ωi j=2
 
∂ 2 g(hid , yid )
f (yid |hid )g(hid , yid )dyid f (yid |hid ) dyid
∂ψ∂ωi
+ 3 42
f (yid |hid )g(hid , yid )dyid

∂g(hid , yid )
− f (yid |hid ) dyid
∂ωi

∂g(hid , yid )
f (yid |hid ) dyid
∂ψ
× 3 42 . (B.2)
f (yid |hid )g(hid , yid )dyid
B.1 Local Influence: Derivation of Components of ∆i 517

In order to evaluate these expressions under ωi = 0, we set ωi = 0 in the


integrands and calculate the resulting simplified integrals:
  


f (yid |hid )g(hid , yid )dyid  = f (yid |hid )g(hid )dyid
ωi =0
= g(hid ), (B.3)
 
∂g(hid , yid ) 
f (yid |hid ) dyid  = g(hid )[1 − g(hid )]
∂ωi ωi =0

× yid f (yid |hid )dyid

= g(hid )[1 − g(hid )]


×λ(yid |hid ), (B.4)
  
∂f (yid |hid )  ∂f (yid |hid )
g(hid , yid )dyid  = g(hid ) dyid
∂θ ωi =0 ∂θ
= 0 (B.5)
 
∂f (yid |hid ) ∂g(hid , yid ) 
dyid  = g(hid )[1 − g(hid )]
∂θ ∂ωi ωi =0

∂f (yid |hid )
× yid dyid
∂θ
= g(hid )[1 − g(hid )]

∂λ(yid |hid )
× (B.6)
∂θ
 
∂g(hid , yid ) 
f (yid |hid ) dyid  = g(hid )[1 − g(hid )]hid (B.7)
∂ψ ωi =0
 
∂ 2 g(hid , yid ) 
f (yid |hid ) dyid  = g(hid )[1 − g(hid )]
∂ψ∂ωi ωi =0
×[1 − 2g(hid )]hid
×λ(yid |hid ). (B.8)

Combining (B.3)–(B.8) with (B.1)–(B.2) immediately yields expressions


(19.4) and (19.5).
518 Appendix B. Technical Details for Sensitivity Analysis

B.2 Proof of Theorem 20.1

The MAR assumption states that

f (d = t + 1|y1 , . . . , yn ) = f (d = t + 1|y1 , . . . , yt ) (B.9)

and the ACMV assumption that for all t ≥ 2, ∀j < t,

f (yt |y1 , . . . , yt−1 , d = j + 1) = f (yt |y1 , . . . , yt−1 , d > t). (B.10)

First, a lemma will be established.

Lemma B.1 In a longitudinal setting with dropout, ACMV ⇐⇒ ∀t ≥


2, ∀j < t : f (yt |y1 , . . . , yt−1 , d = j + 1) = f (yt |y1 , . . . , yt−1 ).

Proof. Take t ≥ 2, j < t, then ACMV leads to

f (yt |y1 , . . . , yt−1 )



t−1
= f (yt |y1 , . . . , yt−1 , d = i + 1)f (d = i + 1)
i=1
+f (yt |y1 , . . . , yt−1 , d > t)f (d > t)


t−1
= f (yt |y1 , . . . , yt−1 , d = j + 1)f (d = i + 1)
i=1
+f (yt |y1 , . . . , yt−1 , d = j + 1)f (d > t)
( t−1 )

= f (yt |y1 , . . . , yt−1 , d = j + 1) f (d = i + 1) + f (d > t)
i=1

= f (yt |y1 , . . . , yt−1 , d = j + 1).

To show the reverse direction, take again t ≥ 2, j < t:

f (yt |y1 , . . . , yt−1 , d > t)f (d > t)



t−1
= f (yt |y1 , . . . , yt−1 ) − f (yt |y1 , . . . , yt−1 , d = i + 1)f (d = i + 1)
i=1

t−1
= f (yt |y1 , . . . , yt−1 ) − f (yt |y1 , . . . , yt−1 )f (d = i + 1)
i=1
( )

t−1
= f (yt |y1 , . . . , yt−1 ) 1 − f (d = i + 1)
i=1
B.2 Proof of Theorem 20.1 519
( )

t−1
= f (yt |y1 , . . . , yt−1 , d = j + 1) 1 − f (d = i + 1)
i=1
= f (yt |y1 , . . . , yt−1 , d = j + 1)f (d > t).
This completes the proof. We are now able to prove Theorem 20.1.

MAR ⇒ ACMV

Consider the ratio Q of the complete data likelihood to the observed data
likelihood. This gives, under the MAR assumption,
f (y1 , . . . , yn )f (d = i + 1|y1 , . . . , yi )
Q =
f (y1 , . . . , yi )f (d = i + 1|y1 , . . . , yi )
= f (yi+1 , . . . , yn |y1 , . . . , yi ). (B.11)
Further, one can always write,
Q = f (yi+1 , . . . , yn |y1 , . . . , yi , d = i + 1)
f (y1 , . . . , yi |d = i + 1)f (d = i + 1)
×
f (y1 , . . . , yi |d = i + 1)f (d = i + 1)
= f (yi+1 , . . . , yn |y1 , . . . , yi , d = i + 1). (B.12)
Equating expressions (B.11) and (B.12) for Q, we see that
f (yi+1 , . . . , yn |y1 , . . . , yi , d = i + 1) = f (yi+1 , . . . , yn |y1 , . . . , yi ). (B.13)
To show that (B.13) implies the ACMV conditions (B.10), we will use the
induction principle on t. First, consider the case t = 2. Using (B.13) for
i = 1, and integrating over y3 , . . . , yn , we obtain
f (y2 |y1 , d = 2) = f (y2 |y1 ),
leading to, using Lemma B.1,
f (y2 |y1 , d = 2) = f (y2 |y1 , d > 2).

Suppose, by induction, ACMV holds for all t ≤ i, We will now prove the
hypothesis for t = i + 1. Choose j ≤ i. Then, from the induction hypothesis
and Lemma B.1, it follows that for all j < t ≤ i:
f (yt |y1 , . . . , yt−1 , d = j + 1) = f (yt |y1 , . . . , yt−1 , d > t)
= f (yt |y1 , . . . , yt−1 ).
Taking the product over t = j + 1, . . . , i then gives
f (yj+1 , . . . , yi |y1 , . . . , yj , d = j + 1) = f (yj+1 , . . . , yi |y1 , . . . , yj ). (B.14)
520 Appendix B. Technical Details for Sensitivity Analysis

After integration over yi+2 , . . . , yn , (B.13) leads to

f (yj+1 , . . . , yi+1 |y1 , . . . , yj , d = j + 1)


= f (yj+1 , . . . , yi+1 |y1 , . . . , yj ). (B.15)

Dividing (B.15) by (B.14) and equating the left- and right-hand sides, we
find that

f (yi+1 |y1 , . . . , yi , d = j + 1) = f (yi+1 |y1 , . . . , yi ).

This holds for all j ≤ i, and Lemma B.1 shows this is equivalent to ACMV.

ACMV ⇒ MAR

Starting from the ACMV assumption and Lemma 1, we have

∀t ≥ 2, ∀j < t : f (yt |y1 , . . . , yt−1 , d = j + 1) = f (yt |y1 , . . . , yt−1 ). (B.16)

We now factorize the full data density as

f (y1 , . . . , yn , d = i + 1)
= f (y1 , . . . , yi , d = i + 1)f (yi+1 , . . . , yn |y1 , . . . , yi , d = i + 1)

T
= f (y1 , . . . , yi , d = i + 1) f (yt |y1 , . . . , yt−1 , d = i + 1).
t=i+1

Using (B.16), it follows that

f (y1 , . . . , yn , d = i + 1)

T
= f (y1 , . . . , yi |d = i + 1)f (d = i + 1) f (yt |y1 , . . . , yt−1 )
t=i+1
= f (y1 , . . . , yi |d = i + 1)f (d = i + 1)f (yi+1 , . . . , yn |y1 , . . . , yi )
f (y1 , . . . , yi |d = i + 1)f (d = i + 1)
=
f (y1 , . . . , yi )
×f (y1 , . . . , yi )f (yi+1 , . . . , yn |y1 , . . . , yi )
f (y1 , . . . , yi |d = i + 1)f (d = i + 1)
= f (y1 , . . . , yn )
f (y1 , . . . , yi )
= f (d = i + 1|y1 , . . . , yi )f (y1 , . . . , yn ). (B.17)

An alternative factorization of f (y, d) gives

f (y1 , . . . , yn , d = i + 1) = f (d = i + 1|y1 , . . . , yn )f (y1 , . . . , yn ). (B.18)


B.2 Proof of Theorem 20.1 521

It follows from (B.17) and (B.18) that

f (d = i + 1|y1 , . . . , yn ) = f (d = i + 1|y1 , . . . , yi ),

completing the proof of Theorem 20.1.


References

Afifi, A. and Elashoff, R. (1966) Missing observations in multivariate sta-


tistics I: Review of the literature. Journal of the American Statistical
Association, 61, 595–604.

Agresti, A. (1990) Categorical Data Analysis. New York: John Wiley &
Sons.

Aitkin, M. (1999) A general maximum likelihood analysis of variance


components in generalized linear models. Biometrics, 55, 218–234.

Aitkin, M. and Francis, B. (1995) Fitting overdispersed generalized linear


models by nonparametric maximum likelihood. The GLIM Newsletter,
25, 37–45.

Aitkin, M. and Rubin, D.B. (1985) Estimation and hypothesis testing in


finite mixture models. Journal of the Royal Statistical Society, Series
B, 47, 67–75.

Akaike, H. (1974) A new look at the statistical model identification. IEEE


Transactions on Automatic Control, 19, 716–723.

Allison, P.D. (1987) Estimation of linear models with incomplete data.


Sociology Methodology, 71–103.

Altham, P.M.E. (1984) Improving the precision of estimation by fitting a


model. Journal of the Royal Statistical Society, Series B, 46, 118–119.
524 References

Amemiya, T. (1984) Tobit models: a survey. Journal of Econometrics,


24, 3–61.

Ashford, J.R. and Sowden, R.R. (1970) Multivariate probit analysis. Bio-
metrics, 26, 535–546.

Bahadur, R.R. (1961) A representation of the joint distribution of re-


sponses to n dichotomous items. In: Studies in Item Analysis and
Prediction,, H. Solomon (Ed.). Stanford Mathematical Studies in the
Social Sciences VI. Stanford, CA: Stanford University Press.

Baker, S.G. (1992) A simple method for computing the observed infor-
mation matrix when using the EM algorithm with categorical data.
Journal of Computational and Graphical Statistics, 1, 63–76.

Baker, S.G. (1994) Regression analysis of grouped survival data with in-
complete covariates: non-ignorable missing-data and censoring mech-
anisms. Biometrics, 50, 821–826.

Baker, S.G. and Laird, N.M. (1988) Regression analysis for categorical
variables with outcome subject to non-ignorable non-response. Journal
of the American Statistical Association, 83, 62–69.

Baker, S.G., Rosenberger, W.F., and DerSimonian, R. (1992) Closed-form


estimates for missing counts in two-way contingency tables. Statistics
in Medicine, 11, 643–657.

Bartlett, M.S. (1937) Some examples of statistical methods of research


in agriculture and applied botany. Journal of the Royal Statistical
Society, Series B, 4, 137–170.

Beckman, R.J., Nachtsheim, C.J., and Cook, R.D. (1987) Diagnostics for
mixed-model analysis of variance. Technometrics, 29, 413–426.

Bickel, P.J. and Doksum, K.A. (1977) Mathematical Statistics. Englewood


Cliffs, NJ: Prentice-Hall.

Birnbaum, Z.W. (1952) Numerical tabulation of the distribution of Kol-


mogorov’s statistic for finite sample size. Journal of the American
Statistical Association, 47, 425–441.

Böhning, D. and Lindsay, B.G. (1988) Monotonicity of quadratic approx-


imation algorithms. The Annals of the Institute of Statistical Mathe-
matics, 40, 641–663.

Boissel, J.P., Collet, J.P., Moleur, P., and Haugh, M. (1992) Surrogate
endpoints: a basis for a rational approach. European Journal of Clin-
ical Pharmacology, 43, 235–244.
References 525

Box, G.E.P., Jenkins, G.M., and Reinsel, G.C. (1994) Time Series Analy-
sis: Forecasting and Control (3rd ed.). London: Holden-Day.
Box, G.E.P. and Tiao, G.C. (1992) Bayesian Inference in Statistical
Analysis. Wiley Classics Library edition. New York: John Wiley &
Sons.
Bozdogan, H. (1987) Model selection and Akaike’s Information Criterion
(AIC): The general theory and its analytical extensions. Psychome-
trika, 52, 345–370.
Brant, L.J. and Fozard, J.L. (1990) Age changes in pure-tone hearing
thresholds in a longitudinal study of normal human aging. Journal of
the Acoustic Society of America, 88, 813–820.
Brant, L.J. and Pearson, J.D. (1994) Modeling the variability in lon-
gitudinal patterns of aging. In: Biological Anthropology and Aging:
Perspectives on Human Variation over the Life Span, Ch. 14, D.E.
Crews and R.M.Garruto (Eds.). New York: Oxford University Press,
pp. 373–393.
Brant, L.J., Pearson, J.D., Morrell, C.H., and Verbeke, G. (1992) Statis-
tical methods for studying individual change during aging. Collegium
Antropologicum, 16, 359–369.
Brant, L.J. and Verbeke, G. (1997a) Describing the natural heterogeneity
of aging using multilevel regression models. International Journal of
Sports Medicine, 18, S225–S231.
Brant, L.J. and Verbeke, G. (1997b) Modelling longitudinal studies of
aging. In: Proceedings of the 12th International Workshop on Sta-
tistical Modelling, Schriftenreihe der Osterreichischen Statistischen
Gesellschaft, Vol. 5, C.E. Minder and H. Friedl (Eds.). Biel/Bienne,
Switzerland, pp. 19–30.
Breslow, N.E. and Clayton, D.G. (1993) Approximate inference in gener-
alized linear mixed models. Journal of the American Statistical Asso-
ciation, 88, 9–25.
Brown, N.A. and Fabro, S. (1981) Quantitation of rat embryonic develop-
ment in vitro: a morphological scoring system. Teratology, 24, 65–78.
Buck, S.F. (1960) A method of estimation of missing values in multivari-
ate data suitable for use with an electronic computer. Journal of the
Royal Statistical Society, Series B, 22, 302–306.
Burzykowski, T., Molenberghs, G., Buyse, M., Geys, H., and Renard,
D. (1999) Validation of surrogate endpoints in multiple randomized
clinical trials with failure-time endpoints. Submitted for publication.
526 References

Butler, S.M. and Louis, T.A. (1992) Random effects models with non-
parametric priors. Statistics in Medicine, 11, 1981–2000.
Buyse, M. and Molenberghs, G. (1998) The validation of surrogate end-
points in randomized experiments. Biometrics, 54, 1014–1029.
Buyse, M., Molenberghs, G., Burzykowski, T., Renard, D., and Geys,
H. (2000) The validation of surrogate endpoints in meta-analyses of
randomized experiments, Biostatistics, 1, 000–000.
Carlin, B.P. and Louis, T.A. (1996) Bayes and Empirical Bayes Methods
for Data Analysis. London: Chapman & Hall.
Carter, H.B. and Coffey, D.S. (1990) The prostate: an increasing medical
problem. The Prostate, 16, 39–48.
Carter, H.B., Morrell, C.H., Pearson, J.D., Brant, L.J., Plato, C.C., Met-
ter, E.J., Chan, D.W., Fozard, J.L., and Walsh, P.C. (1992a) Estima-
tion of prostate growth using serial prostate-specific antigen measure-
ments in men with and without prostate disease. Cancer Research,
52, 3323–3328.
Carter, H.B., Pearson, J.D., Metter, E.J., Brant, L.J., Chan, D.W., An-
dres, R., Fozard, J.L., and Walsh, P.C. (1992b) Longitudinal eval-
uation of prostate-specific antigen levels in men with and without
prostate disease. Journal of the American Medical Association, 267,
2215–2220.
Catalano, P.J. and Ryan, L.M. (1992) Bivariate latent variable models for
clustered discrete and continuous outcomes. Journal of the American
Statistical Association, 87, 651–658.
Catalano, P.J., Scharfstein, D.O., Ryan, L.M., Kimmel, C.A., and Kim-
mel, G.L. (1993) Statistical model for fetal death, fetal weight, and
malformation in developmental toxicity studies. Teratology, 47, 281–
290.
Chatterjee, S. and Hadi, A.S. (1988) Sensitivity Analysis in Linear Re-
gression. New York: John Wiley & Sons.
Chen, J. (1993) A malformation incidence dose-response model incorpo-
rating fetal weight and/or litter size as covariates. Risk Analysis, 13,
559–564.
Chen, T.T., Simon, R.M., Korn, E.L., Anderson, S.J., Lindblad, A.D.,
Wieand, H.S., Douglass Jr., H.O., Fisher, B., Hamilton, J.M., and
Friedman, M.A. (1998) Investigation of disease-free survival as a sur-
rogate endpoint for survival in cancer clinical trials. Communications
in Statistics, A, 27, 1363–1378.
References 527

Chi, E.M. and Reinsel, G.C. (1989) Models for longitudinal data with
random effects and AR(1) errors. Journal of the American Statistical
Association, 84, 452–459.

Choi, S., Lagakos, S., Schooley, R.T., and Volberding, P.A. (1993) CD4+
lymphocytes are an incomplete surrogate marker for clinical progres-
sion in persons with asymptomatic HIV infection taking zidovudine.
Annals of Internal Medicine, 118, 674–680.
Christensen, R., Pearson, L.M., and Johnson, W. (1992) Case-deletion
diagnostics for mixed models. Technometrics, 34, 38–45.

Chuang-Stein, C. and DeMasi, R. (1998) Surrogate endpoints in AIDS


drug development: current status (with discussion). Drug Information
Journal, 32, 439–448.

Clayton, D.G. (1978) A model for association in bivariate life tables


and its application in epidemiological studies of familial tendency in
chronic disease incidence. Biometrika, 65, 141–151.

Clayton, D. and Hills, M. (1993) Statistical Methods in Epidemiology.


Oxford: Oxford University Press.

Cleveland, W.S. (1979) Robust locally-weighted regression and smoothing


scatterplots. Journal of the American Statistical Association, 74, 829–
836.
Cocchetto, D.M. and Jones, D.R. (1998) Faster access to drugs for serious
or life-threatening illnesses through use of the accelerated approval
regulation in the United States. Drug Information Journal, 32, 27–
35.

Cohen, J. and Cohen, P. (1983) Applied multiple regression/correlation


analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.

Conaway, M.R. (1992) The analysis of repeated categorical measurements


subject to nonignorable nonresponse. Journal of the American Statis-
tical Association, 87, 817–824.

Conaway, M.R. (1993) Non-ignorable non-response models for time-


ordered categorical variables. Applied Statistics, 42, 105–115.

Cook, R.D. (1977a) Detection of influential observations in linear regres-


sion. Technometrics, 19, 15–18.

Cook, R.D. (1977b) Letter to the editor. Technometrics, 19, 348.

Cook, R.D. (1979) Influential observations in linear regression. Journal


of the American Statistical Association, 74, 169–174.
528 References

Cook, R.D. (1986) Assessment of local influence. Journal of the Royal


Statistical Society, Series B, 48, 133–169.

Cook, R.D. and Weisberg, S. (1982) Residuals and Influence in Regres-


sion. London: Chapman & Hall.

Copas, J.B. and Li, H.G. (1997) Inference from non-random samples (with
discussion). Journal of the Royal Statistical Society, Series B, 59, 55–
96.

Corfu-A Study Group (1995) Phase III randomized study of two fluo-
rouracil combinations with either interferon alfa-2a or leucovorin for
advanced colorectal cancer. Journal of Clinical Oncology, 13, 921–928.

Coursaget, P., Leboulleux, D., Soumare, M., le Cann P., Yvonnet, B.,
Chiron, J.P., and Collseck A.M. (1994) Twelve-year follow-up study
of hepatitis immunization of Senegalese infants. Journal of Hepatology,
21, 250–254.

Cowles, M.K., Carlin, B.P., and Connett, J.E. (1996) Bayesian tobit mod-
eling of longitudinal ordinal clinical trial compliance data with nonig-
norable missingness. Journal of the American Statistical Association,
91, 86–98.

Cox, D.R. (1972) The analysis of multivariate binary data. Applied Sta-
tistics, 21, 113–120.
Cox, D.R. and Hinkley, D.V. (1974) Theoretical Statistics. London: Chap-
man & Hall.

Cox, D.R. and Hinkley, D.V. (1990) Theoretical Statistics. London: Chap-
man & Hall.

Cox, D.R. and Wermuth, N. (1992) Response models for mixed binary
and quantitative variables. Biometrika, 79, 441–461.

Cox, D.R. and Wermuth, N. (1994) Multivariate Dependencies: Models,


Analysis and Interpretation. London: Chapman & Hall.

Crépeau, H., Koziol, J., Reid, N., and Yuh, Y.S. (1985) Analysis of in-
complete multivariate data from repeated measurements experiments.
Biometrics, 41, 505–514.

Cressie, N.A.C. (1991) Statistics for Spatial Data. New York: John Wiley
& Sons.

Crowder, M.J. and Hand, D.J. (1990) Analysis of Repeated Measures.


London: Chapman & Hall.
References 529

Cullis, B.R. (1994) Discussion to Diggle, P.J. and Kenward, M.G.: Infor-
mative dropout in longitudinal data analysis. Applied Statistics, 43,
79–80.

Curran, D., Pignatti, F., and Molenberghs, G. (1997) Milk protein trial:
informative dropout versus random drop-in. Submitted for publication.

D’Agostino, R.B. (1971) An omnibus test of normality for moderate and


large size samples. Biometrika, 58, 341–348.

Dale, J.R. (1986) Global cross-ratio models for bivariate, discrete, ordered
responses. Biometrics, 42, 909–917.

Daniels, M.J. and Hughes, M.D. (1997) Meta-analysis for the evaluation
of potential surrogate markers. Statistics in Medicine, 16, 1515–1527.

Davidian, M. and Giltinan, D.M. (1995) Nonlinear Models for Repeated


Measurement Data. London: Chapman & Hall.

De Backer, M., De Keyser, P., De Vroey, C., and Lesaffre, E. (1996) A


12-week treatment for dermatophyte toe onychomycosis: terbinafine
250mg/day vs. itraconazole 200mg/day–a double-blind comparative
trial. British Journal of Dermatology, 134, 16–17.

DeGruttola, V., Fleming, T.R., Lin, D.Y., and Coombs, R. (1997) Vali-
dating surrogate markers - are we being naive ? Journal of Infectious
Diseases, 175, 237–246.

DeGruttola, V., Lange, N., and Dafni, U. (1991) Modeling the progression
of HIV infection. Journal of the American Statistical Association, 86,
569–577.

DeGruttola, V. and Tu, X.M. (1994) Modelling progression of CD4 lym-


phocyte count and its relationship to survival time. Biometrics, 50,
1003–1014.

DeGruttola, V. and Tu, X.M. (1995) Modelling progression of CD-4 lym-


phocyte count and its relationship to survival time. Biometrics, 50,
1003–1014.

DeGruttola, V., Ware, J.H., and Louis, T.A. (1987) Influence analysis of
generalized least squares estimators. Journal of the American Statis-
tical Association, 82, 911–917.
DeGruttola, V., Wulfsohn, M., Fischl, M.A., and Tsiatis, A. (1993) Mod-
elling the relationship between survival and CD4 lymphocytes in pa-
tients with AIDS and AIDS-related complex. Journal of Acquired Im-
mune Deficiency Syndrome, 6, 359–365.
530 References

Dempster, A.P., Laird, N.M., and Rubin, D. B. (1977) Maximum likeli-


hood from incomplete data via the EM algorithm (with discussion).
Journal of the Royal Statistical Society, Series B, 39, 1–38.

Dempster, A.P. and Rubin, D.B. (1983) Overview. In: Incomplete Data
in Sample Surveys, Vol. II: Theory and Annotated Bibliography, W.G.
Madow, I. Olkin, and D.B. Rubin (Eds.). New York: Academic Press,
pp. 3–10.
Dempster, A.P., Rubin, R.B., and Tsutakawa, R.K. (1981) Estimation
in covariance components models. Journal of the American Statistical
Association, 76, 341–353.
De Ponti, F., Lecchini, S., Cosentino, M., Castelletti, C.M., Malesci, A.,
and Frigo, G.M. (1993) Immunological adverse effects of anticonvul-
sants. What is their clinical relevance ? Drug Safety, 8, 235–250.

Diem, J.E. and Liukkonen, J.R. (1988) A comparative study of three


methods for analysing longitudinal pulmonary function data. Statistics
in Medicine, 7, 19–28.

Diggle, P.J. (1983) Statistical Analysis of Spatial Point Patterns. Mathe-


matics in Biology. London: Academic Press.
Diggle, P.J. (1988) An approach to the analysis of repeated measures.
Biometrics, 44, 959–971.

Diggle, P.J. (1989) Testing for random dropouts in repeated measurement


data. Biometrics, 45, 1255–1258.
Diggle, P.J. (1990) Time Series: A Biostatistical Introduction. Oxford:
Oxford University Press.
Diggle, P.J. (1992) On informative and random dropouts in longitudinal
studies. Letter to the Editor. Biometrics, 48, 947.

Diggle, P.J. (1993) Estimation with missing data. Reply to a Letter to


the Editor. Biometrics, 49, 580.

Diggle, P.J. and Kenward, M.G. (1994) Informative drop-out in longitu-


dinal data analysis (with discussion). Applied Statistics, 43, 49–93.

Diggle, P.J., Liang, K.-Y., and Zeger, S.L. (1994) Analysis of Longitudinal
Data. Oxford Science Publications. Oxford: Clarendon Press.

Draper, D. (1995) Assessment and propagation of model uncertainty


(with discussion). Journal of the Royal Statistical Society, Series B,
57, 45–97.
References 531

Dyer, A.R. (1974) Comparison of tests for normality with a quationary


note. Biometrika, 61, 185–189.
Edlefsen, L.E. and Jones, S.D. GAUSS. Aptech Systems Inc., Kent, WA.
Edwards, A.W.F. (1972) Likelihood. Cambridge: Cambridge University
Press.
Efron, B. (1994) Missing data, imputation, and the bootstrap (with dis-
cussion). Journal of the American Statistical Association, 89, 463–479.
Efron, B. and Hinkley, D.V. (1978) Assessing the accuracy of the maxi-
mum likelihood estimator: observed versus expected Fisher informa-
tion. Biometrika, 65, 457–487.
Ekholm, A. and Skinner, C. (1998) The muscatine children’s obesity data
reanalysed using pattern mixture models. Applied Statistics, 47, 251–
263.
Ellenberg, S.S. and Hamilton, J.M. (1989) Surrogate endpoints in clinical
trials: cancer. Statistics in Medicine, 8, 405–413.
Fahrmeir, L. and Tutz, G. (1994) Multivariate Statistical Modelling Based
on Generalized Linear Models. Heidelberg: Springer-Verlag.
Fitzmaurice, G.M. and Laird, N.M. (1993) A Likelihood-based method
for analysing longitudinal binary responses. Biometrika, 80, 141–151.
Fitzmaurice, G.M. and Laird, N.M. (1995) Regression models for a bi-
variate discrete and continuous outcome with clustering. Journal of
the American Statistical Association, 90, 845–852.
Fitzmaurice, G.M. and Laird, N.M. (1997) Regression models for mixed
discrete and continuous responses with potentially missing values. Bio-
metrics, 53, 110–122.
Fitzmaurice, G.M., Laird, N.M., and Lipsitz, S.R. (1994) Analysing in-
complete longitudinal binary responses: a likelihood-based approach.
Biometrics, 50, 601–612.
Fitzmaurice, G.M., Laird, N.M., and Rotnitzky, A. (1993) Regression
models for discrete longitudinal responses. Statistical Science, 8, 284–
309.
Fitzmaurice, G.M., Molenberghs, G., and Lipsitz, S.R. (1995) Regression
models for longitudinal binary responses with informative dropouts.
Journal of the Royal Statistical Society, Series B, 57, 691–704.
Fleiss, J.L. (1993) The statistical basis of meta-analysis. Statistical Meth-
ods in Medical Research, 2, 121–145.
532 References

Fleming, T.R. (1992) Evaluating therapeutic interventions: some issues


and experiences (with discussion) Statistical Science, 7, 428–456.

Fleming, T.R. and DeMets, D.L. (1996) Surrogate endpoints in clinical


trials: are we being misled ? Annals of Internal Medicine, 125, 605–
613.

Fleming, T.R., Prentice, R.L., Pepe, M.S., and Glidden, D. (1994) Sur-
rogate and auxiliary endpoints in clinical trials, with potential ap-
plications in cancer and AIDS research. Statistics in Medicine, 13,
955–968.

Follman, D. and Wu, M. (1995) An approximate generalized linear model


with random effects for informative missing data. Biometrics, 51, 151–
168.

Fowler, F. J. (1988) Survey Research Methods. Newbury Park, CA: Sage.

Freedman, L.S., Graubard, B.I., and Schatzkin, A. (1992) Statistical val-


idation of intermediate endpoints for chronic diseases. Statistics in
Medicine, 11, 167–178.

Friedman, L.M., Furberg, C.D., and DeMets, D.L. (1998) Fundamentals


of Clinical Trials. New York: Springer-Verlag.
Gaylor, D.W. (1989) Quantitative risk analysis for quantal reproduc-
tive and developmental effects. Environmental Health Perspectives, 79,
243–246.

Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995) Bayesian
Data Analysis, Texts in Statistical Science. London: Chapman & Hall.

Geys, H., Molenberghs, G., and Ryan, L.M. (1997) Pseudo-likelihood in-
ference for clustered binary data. Communications in Statistics: The-
ory and Methods, 26, 2743–2767.

Geys, H., Molenberghs, G., and Ryan, L. (1999) Pseudolikelihood model-


ing of multivariate outcomes in developmental toxicology. Journal of
the American Statistical Association, 94, 734–745.

Geys, H., Molenberghs, G., and Williams, P. (1999) Analysis of clustered


binary data with covariates specific to each observation. Submitted for
publication.

Geys, H., Regan, M., Catalano, P., and Molenberghs, G. (1999) Two
latent variable risk assessment approaches for combined continuous
and discrete outcomes from developmental toxicity data. Submitted
for publication.
References 533

Ghosh, J.K. and Sen, P.K. (1985) On the asymptotic performance of the
log likelihood ratio statistic for the mixture model and related results.
In: Proceedings of the Berekely Conference in Honor or Jerzy Ney-
man and Jack Kiefer, Vol. 2, L.M. Le Cam and R.A. Olshen (Eds.).
Monterey: Wadsworth, Inc., pp. 789–806.
Gilks, W.R., Wang, C.C., Yvonnet, B., and Coursaget, P. (1993) Random-
effects models for longitudinal data using Gibbs sampling. Biometrics,
49, 441–453.
Glonek, G.F.V. and McCullagh, P. (1995) Multivariate logistic models.
Journal of the Royal Statistical Society, Series B, 81, 477–482.
Glynn, R.J., Laird, N.M., and Rubin, D.B. (1986) Selection modelling
versus mixture modelling with non-ignorable nonresponse. In: Drawing
Inferences from Self Selected Samples, H. Wainer (Ed.). New York:
Springer-Verlag, pp. 115–142.
Goldstein, H. (1979) The Design and Analysis of Longitudinal Studies.
London: Academic Press.
Goldstein, H. (1995) Multilevel Statistical Models. Kendall’s Libary of
Statistics 3. London: Arnold.
Golub, G.H. and Van Loan, C.F. (1989) Matrix Computations. (2nd ed.).
Baltimore: The Johns Hopkins University Press.
Goss, P.E., Winer, E.P., Tannock, I.F., and Schwartz, L.H. (1999) Breast
cancer: randomized phase III trial comparing the new potent and se-
lective third-generation aromatase inhibitor vorozole with megestrol
acetate in postmenopausal advanced breast cancer patients. Journal
of Clinical Oncology, 17, 52–63.
Gould, A.L. (1980) A new approach to the analysis of clinical drug trials
with withdrawals. Biometrics, 36, 721–727.
Greco, F.A., Figlin, R., York, M., Einhorn, L., Schilsky, R., Marshall,
E.M., Buys, S.S., Froimtchuk, M.J., Schuller, J., Buyse, M., Ritter,
L., Man, A., and Yap, A.K.L. (1996) Phase III randomized study
to compare interferon alfa-2a in combination with fluorouracil versus
fluorouracil alone in patients with advanced colorectal cancer. Journal
of Clinical Oncology, 14, 2674–2681.
Green, S., Benedetti, J., and Crowley, J. (1997) Clinical Trials in Oncol-
ogy. London: Chapman & Hall.
Greenlees, W.S., Reece, J.S., and Zieschang, K.D. (1982) Imputation of
missing values when the probability of response depends on the vari-
able being imputed. Journal of the American Statistical Association,
77, 251–261.
534 References

Gregoire, T., Brillinger, D.R., Diggle, P.J., Russek-Cohen, E., Warren,


W.G., and Wolfinger, R.D. (1997) Modelling Longitudinal and Spa-
tially Correlated Data. Lecture Notes in Statistics 122. New York:
Springer-Verlag.
Haber, F. (1924) Zur Geschichte des Gaskrieges (On the history of gas
warfare). In: Funf Vortrage aus den Jahren 1920-1923 (Five Lectures
from the Years 1920-1923), Berlin: Springer-Verlag, pp. 76–92.
Hadler, S.C., Francis, D.P., Maynard, J.E., Thompson, S.E., Judson,
F.N., Echenberg, D.F., Ostrow, D.G., O’Malley, P.M., Penley, K.A.,
Altman, N.L., et al. (1986) Long-term immunogenicity and efficacy
of hepatitis B vaccine in homosexual men. New England Journal of
Medicine, 315, 209–214.
Hand, D.J., Daly, F., Lunn, A.D., McConway, K.J., and Ostrowski, E.
(1994) A Handbook of Small Data Sets (1st ed.). London: Chapman
& Hall.
Hand, D.J. and Taylor, C.C. (1987) Multivariate Analysis of Variance
and Repeated Measures. London: Chapman & Hall.
Hannan, E.J. and Quinn, B.G. (1979) The determination of the order of
an autoregression. Journal of the Royal Statistical Society, Series B,
41, 190–195.
Hartley, H.O. and Hocking, R. (1971) The analysis of incomplete data.
Biometrics, 27, 7783–808.
Harville, D.A. (1974) Bayesian inference for variance components using
only error contrasts. Biometrika, 61, 383–385.
Harville, D.A. (1976) Extension of the Gauss-Markov theorem to include
the estimation of random effects. The Annals of Statistics, 4, 384–395.
Harville, D.A. (1977) Maximum likelihood approaches to variance com-
ponent estimation and to related problems. Journal of the American
Statistical Association, 72, 320–340.
Heckman, J.J. (1976) The common structure of statistical models of trun-
cation, sample selection and limited dependent variables and a simple
estimator for such models. Annals of Economic and Social Measure-
ment, 5, 475–492.
Hedeker, D. and Gibbons, R.D. (1994) A random-effects ordinal regres-
sion model for multilevel analysis. Biometrics, 50, 933–944.
Hedeker, D. and Gibbons, R.D. (1996) MIXOR: A computer program
for mixed-effects ordinal regression analysis. Computer Methods and
Programs in Biomedicine, 49, 157–176.
References 535

Hedeker, D. and Gibbons, R.D. (1997) Application of random-effects


pattern-mixture models for missing data in longitudinal studies. Psy-
chological Methods, 2, 64–78.

Heitjan, D.F. (1993) Estimation with missing data. Letter to the Editor.
Biometrics, 49, 580.

Heitjan, D.F. (1994) Ignorability in general incomplete-data models. Bio-


metrika, 81, 701–708.

Helms, R.W. (1992) Intentionally incomplete longitudinal designs:


Methodology and comparison of some full span designs. Statistics in
Medicine, 11, 1889–1913.
Henderson, C.R. (1984) Applications of Linear Models in Animal Breed-
ing. Guelph, Canada: University of Guelph Press.

Henderson, C.R., Kempthorne, O., Searle, S.R., and Von Krosig, C.N.
(1959) Estimation of environmental and genetic trends from records
subject to culling. Biometrics, 15, 192–218.

Heyting, A., Tolboom, J.T.B.M., and Essers, J.G.A. (1992) Statistical


handling of drop-outs in longitudinal clinical trials. Statistics in Medi-
cine, 11, 2043–2061.
Hogan, J.W. and Laird, N.M. (1997) Mixture models for the joint distri-
bution of repeated measures and event times. Statistics in Medicine,
16, 239–258.

Holmes, L.B. (1988) Human teratogens: delineating the phenotypic ef-


fects, the period of greatest sensitivity, and the dose-response rela-
tionship and mechanisms of action. In: Transplacental Effects on Fetal
Health. New York: Alan R. Liss, Inc., pp. 171–191.

Hosmer, D.W. and Lemeshow, S. (1989) Applied Logistic Regression. New


York: John Wiley & Sons.

Hougaard, P. (1987) Modelling multivariate survival. Scandinavian Jour-


nal of Statistics, 14, 291–304.

Jennrich, R.I. and Schluchter, M.D. (1986) Unbalanced repeated measures


models with structured covariance matrices. Biometrics, 42, 805–820.

Johnson, R.A. and Wichern, D.W. (1992) Applied Multivariate Statistical


Analysis (3rd ed.). Englewood Cliffs, NJ: Prentice-Hall.
Kahn, H. and Sempos, C.T. (1989) Statistical Methods in Epidemiology.
New York: Oxford University Press.
536 References

Kenward, M.G. (1998) Selection models for repeated measurements with


nonrandom dropout: an illustration of sensitivity. Statistics in Medi-
cine, 17, 2723–2732.
Kenward, M.G. and Molenberghs, G. (1998) Likelihood based frequentist
inference when data are missing at random. Statistical Science, 12,
236–247.
Kenward, M.G. and Molenberghs, G. (1999) Parametric models for in-
complete continuous and categorical longitudinal studies data. Statis-
tical Methods in Medical Research, 8, 51–83.
Kenward, M.G., Molenberghs, G. and Lesaffre, E. (1994) An application
of maximum likelihood and estimating equations to the analysis of
ordinal data from a longitudinal study with cases missing at random.
Biometrics, 50, 945–953.
Kenward, M.G. and Roger, J.H. (1997) Small sample inference for fixed
effects from restricted maximum likelihood. Biometrics, 53, 983–997.
Khoury, M.J., Adams, M.M., Rhodes, P., and Erickson, J.D. (1987) Mon-
itoring multiple malformations in the detection of epidemics of birth
defects. Teratology, 36, 345–354.
Kimmel, G.L., Cuff, J.M., Kimmel, C.A., Heredia, D.J., Tudor, N., and
Silverman, P.M. (1993) Embryonic development in vitro following
short-duration exposure to heat. Teratology, 47, 243–251.
Kimmel, G.L., Williams, P.L., Kimmel, C.A., Claggett, T.W., and Tudor,
N. (1994) The effects of temperature and duration of exposure on in
vitro development and response-surface modelling of their interaction.
Teratology, 49, 366–367.
Krzanowski, W.J. (1988) Principles of Multivariate Analysis. Oxford:
Clarendon Press.
Laird, N.M. (1978) Nonparametric maximum likelihood estimation of a
mixing distribution. Journal of the American Statistical Association,
73, 805–811.
Laird, N.M. (1988) Missing data in longitudinal studies. Statistics in
Medicine, 7, 305–315.
Laird, N.M. (1994) Discussion to Diggle, P.J. and Kenward, M.G.: Infor-
mative dropout in longitudinal data analysis. Applied Statistics, 43,
84.
Laird, N.M., Lange, N., and Stram, D. (1987) Maximum likelihood com-
putations with repeated meausres: application of the EM algorithm.
Journal of the American Statistical Association, 82, 97–105.
References 537

Laird, N.M. and Ware, J.H. (1982) Random effects models for longitudinal
data. Biometrics, 38, 963–974.

Lang, J.B. and Agresti, A. (1994) Simultaneously modeling joint and


marginal distributions of multivariate categorical responses. Journal
of the American Statistical Association, 89, 625–632.

Lange, N. and Ryan, L. (1989) Assessing normality in random effects


models. The Annals of Statistics, 17, 624–642.

Lee, Y. and Nelder, J.A. (1996) Hierarchical generalized linear models


(with discussion). Journal of the Royal Statistical Society, Series B,
58, 619–678.
Lehmann, E.L. and D’Abrera, H.J.M. (1975) Nonparametrics. Statistical
Methods Based on Ranks. San Francisco: Holden-Day.

Lesaffre, E., Asefa, M., and Verbeke, G. (1999) Assessing the goodness-
of-fit of the Laird and Ware model: an example: the Jimma Infant
Survival Differential Longitudinal Study. Statistics in Medicine, 18,
835–854.

Lesaffre, E. and Verbeke, G. (1998) Local influence in linear mixed mod-


els. Biometrics, 54, 570–582.
Leslie, J.R., Stephens, M.A., and Fotopoulos, S. (1986) Asymptotic dis-
tribution of the Shapiro-Wilk W for testing for normality. The Annals
of Statistics, 14, 1497–1506.

Li, K.H., Raghunathan, T.E., and Rubin, D.B. (1991) Large-sample sig-
nificance levels from multiply imputed data using moment-based sta-
tistics and an F reference distributions. Journal of the American Sta-
tistical Association, 86, 1065–1073.

Liang, K.-Y. and Zeger, S.L. (1986) Longitudinal data analysis using
generalized linear models. Biometrika, 73, 13–22.

Liang, K.-Y. and Zeger, S.L. (1989) A class of logistic regression models
for multivariate binary time series. Journal of the American Statistical
Association, 84, 447–451.

Liang, K.–Y., Zeger, S.L., and Qaqish, B. (1992) Multivariate regression


analyses for categorical data. Journal of the Royal Statistical Society,
Series B, 54, 3–40.

Lilienfeld, D.E. and Stolley, P.D. (1994) Foundations of Epidemiology.


New York: Oxford University Press.
538 References

Lin, D.Y., Fischl, M.A., and Schoenfeld, D.A. (1993) Evaluating the role
of CD4-lymphocyte change as a surrogate endpoint in HIV clinical
trials. Statistics in Medicine, 12, 835–842.

Lin, D.Y., Fleming T.R., and De Gruttola, V. (1997) Estimating the pro-
portion of treatment effect explained by a surrogate marker. Statistics
in Medicine, 16, 1515–1527.

Lin, X., Raz, J., and Harlow, S. (1997) Linear mixed models with hetero-
geneous within-cluster variances. Biometrics, 53, 910–923.

Lindley, D.V. and Smith, A.F.M. (1972) Bayes estimates for the linear
model. Journal of the Royal Statistical Society, Series B, 34, 1–41.
Lindstrom, M.J. and Bates, D.M. (1988) Newton-Raphson and EM al-
gorithms for linear mixed-effects models for repeated-measures data.
Journal of the American Statistical Association, 83, 1014–1022.

Littell, R.C., Milliken, G.A., Stroup, W.W., and Wolfinger, R.D. (1996)
SAS System for Mixed Models. Cary, NC: SAS Institute Inc.

Little, R.J.A. (1976) Inference about means for incomplete multivariate


data. Biometrika, 63, 593–604.

Little, R.J.A. (1986) A note about models for selectivity bias. Econo-
metrika, 53, 1469–1474.

Little, R.J.A. (1993) Pattern-mixture models for multivariate incomplete


data. Journal of the American Statistical Association, 88, 125–134.
Little, R.J.A. (1994a) A class of pattern-mixture models for normal in-
complete data. Biometrika, 81, 471–483.

Little, R.J.A. (1994b) Discussion to Diggle, P.J. and Kenward, M.G.:


Informative dropout in longitudinal data analysis. Applied Statistics,
43, 78.

Little, R.J.A. (1995) Modeling the drop-out mechanism in repeated mea-


sures studies. Journal of the American Statistical Association, 90,
1112–1121.

Little, R.J.A. and Rubin, D.B. (1987) Statistical Analysis with Missing
Data. New York: John Wiley & Sons.
Little, R.J.A. and Wang, Y. (1996) Pattern-mixture models for multivari-
ate incomplete data with covariates. Biometrics, 52, 98–111.
Little, R.J.A. and Yau, L. (1996) Intent-to-treat analysis for longitudinal
studies with drop-outs. Biometrics, 52, 1324–1333.
References 539

Liu, C. and Rubin, D.B. (1994) The ECME algorithm: a simple extension
of EM and ECM with faster monotone convergence. Biometrika, 81,
633–648.
Liu, C. and Rubin, D.B. (1995) MI estimation of the t distribution using
EM and its extensions, ECM and ECME. Statistica Sinica, 5, 19–39.
Liu, C., Rubin, D.B., and Wu, Y.N. (1998) Parameter expansion to ac-
celerate EM: the PX-EM algorithm. Biometrika, 85, 755–770.
Liu, G. and Liang, K.-Y. (1997) Sample size calculations for studies with
correlated observations. Biometrics, 53, 937–947.
Longford, N.T. (1993) Random Coefficient Models. Oxford: Oxford Uni-
versity Press.
Louis, T.A. (1982) Finding the observed information matrix when using
the EM algorithm. Journal of the Royal Statistical Society, Series B,
44, 226-233.
Louis, T.A. (1984) Estimating a population of parameter values using
bayes and empirical Bayes methods. Journal of the American Statis-
tical Association, 79, 393–398.
Magder, L.S. and Zeger, S.L. (1996) A smooth nonparametric estimated
of a mixing distribution using mixtures of Gaussians. Journal of the
American Statistical Association, 91, 1141–1152.
Mansour, H., Nordheim, E.V., and Rutledge, J.J. (1985) Maximum likeli-
hood estimation of variance components in repeated measures designs
assuming autoregressive errors. Biometrics, 41, 287–294.
McArdle, J.J. and Hamagami, F. (1992) Modeling incomplete longitu-
dinal and cross-sectional data using latent growth structural models.
Experimental Aging Research, 18, 145–166.
McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models. Lon-
don: Chapman & Hall.
McLachlan, G.J. and Basford, K.E. (1988) Mixture models. Inference and
Applications to Clustering. New York: Marcel Dekker.
McLachlan, G.J. and Krishnan, T. (1997) The EM Algorithm and Exten-
sions. New York: John Wiley & Sons.
McLean, R.A., Sanders, W.L., and Stroup, W.W. (1991) A unified ap-
proach to mixed linear models. The American Statistician, 45, 54–64.
Meilijson, I. (1989) A fast improvement to the EM algorithm on its own
terms. Journal of the Royal Statistical Society, Series B, 51, 127–138.
540 References

Mellors, J.W., Munoz, A., Giorgi, J.V., Margolich, J.B., Tassoni, C.J.,
Gupta, P., Kingsley, L.A., Todd, J.A., Saah, A.J., Phair, J.P., and
Rinaldo, C.R. (1997) Plasma viral load and CD4+ lymphocytes as
prognostic markers of HIV-1 infection. Annals of Internal Medicine,
126, 946–954.
Meng, X.-L. (1997) The EM algorithm and medical studies: a historical
link. Statistical Methods in Medical Research, 6, 3–23.
Meng, X.-L. and Rubin, D.B. (1991) Using EM to obtain asymptotic vari-
ance covariance matrices: the SEM algorithm. Journal of the American
Statistical Association, 86, 899–909.
Meng, X.-L. and Rubin, D.B. (1993) Maximum likelihood estimation via
the ECM algorithm: a general framework. Biometrika, 80, 267–278.
Meng, X.-L. and van Dyk, D. (1997) The EM algorithm–an old folk-song
sung to a fast new tune. Journal of the Royal Statistical Society, Series
B, 3, 511–567.
Meng, X.-L. and van Dyk, D. (1998) Fast EM-type implementation for
mixed effects models. Journal of the Royal Statistical Society, Series
B, 3, 559–578.
Mentré, F., Mallet, A., and Baccar, D. (1997) Optimal design in random-
effects regression models. Biometrika, 84, 429–442.
Michiels, B., Molenberghs, G., Bijnens, L., and Vangeneugden, T. (1999)
Selection models and pattern-mixture models to analyze longitudinal
quality of life data subject to dropout. Submitted for publication.
Michiels, B., Molenberghs, G., and Lipsitz, S.R. (1999). Selection mod-
els and pattern-mixture models for incomplete categorical data with
covariates. Biometrics, 55, 978–983.
Miller, J.J. (1977) Asymptotic properties of maximum likelihood esti-
mates in the mixed model of the analysis of variance. The Annals of
Statistics, 5, 746–762.
Molenberghs, G., Buyse, M., Geys, H., Renard, D., and Burzykowski, T.
(1999) Statistical challenges in the evaluation of surrogate endpoints
in randomized trials. Submitted for publication.
Molenberghs, G., Geys, H., and Buyse, M. (1998) Validation of surro-
gate endpoints in randomized experiments with mixed discrete and
continuous outcomes. Submitted for publication.
Molenberghs, G., Goetghebeur, E.J.T., Lipsitz, S.R., Kenward, M.G.
(1999) Non-random missingness in categorical data: strengths and lim-
itations. The American Statistician, 53, 110–118.
References 541

Molenberghs, G., Kenward, M. G., and Lesaffre, E. (1997) The analysis of


longitudinal ordinal data with non-random dropout. Biometrika, 84,
33–44.
Molenberghs, G. and Lesaffre, E. (1994) Marginal modelling of correlated
ordinal data using a multivariate Plackett distribution. Journal of the
American Statistical Association, 89, 633–644.
Molenberghs, G., Michiels, B., and Kenward, M.G. (1998) Pseudo-like-
lihood for combined selection and pattern-mixture models for missing
data problems. Biometrical Journal, 40, 557–572.
Molenberghs, G., Michiels, B., Kenward, M.G., and Diggle, P.J. (1998)
Missing data mechanisms and pattern-mixture models. Statistica
Neerlandica, 52, 153–161.
Molenberghs, G., Michiels, B., and Lipsitz, S.R. (1999) A pattern-mixture
odds ratio model for incomplete categorical data. Communications in
Statistics: Theory and Methods, 28, 000–000.
Molenberghs, G. and Ritter, L. (1996) Likelihood and quasi-likelihood
based methods for analysing multivariate categorical data, with the
association between outcomes of interest. Biometrics, 52, 1121–1133.
Molenberghs, G. and Ryan, L.M. (1999) Likelihood inference for clustered
multivariate binary data. Environmetrics, 10, 279–300.
Molenberghs, G., Verbeke, G., Thijs, H., Lesaffre, E., and Kenward, M.G.
(1999) Mastitis in dairy cattle: influence analysis to assess sensitivity
of the dropout process. Submitted for publication.
Morrell, C.H. (1998) Likelihood ratio testing of variance components in
the linear mixed-effects model using restricted maximum likelihood.
Biometrics, 54, 1560–1568.
Morrell, C.H. and Brant, L.J. (1991) Modelling hearing thresholds in the
elderly. Statistics in Medicine, 10, 1453–1464.
Morrell, C.H., Pearson, J.D., Ballentine Carter, H., and Brant, L.J.
(1995) Estimating unknown transition times using a piecewise non-
linear mixed-effects model in men with prostate cancer. Journal of
the American Statistical Association, 90, 45–53.
Morrell, C.H., Pearson, J.D., and Brant, L.J. (1997) Linear transforma-
tions of linear mixed-effects models. The American Statistician, 51,
338–343.
Murray, G.D. and Findlay, J.G. (1988) Correcting for the bias caused by
drop-outs in hypertension trials. Statististics in Medicine, 7, 941-946.
542 References

Muthén, B., Kaplan, D., and Hollis, M. (1987) On structural equation


modeling with data that are not missing completely at random. Psy-
chometrika, 52, 431–462.

Neave, H.R. (1986) Statistics Tables for Mathematicians, Engineers,


Economists and the Behavioural and Management Sciences. London:
George Allen & Unwin.

Nelder, J.A. (1954) The interpretation of negative components of vari-


ance. Biometrika, 41, 544–548.

Nelder, J.A. and Mead, R. (1965) A simplex method for function min-
imisation. The Computer Journal, 7, 303–313.
Neter, J., Wasserman, W., and Kutner, M.H. (1990) Applied Linear Sta-
tistical Models. Regression, Analysis of Variance and Experimental
Designs (3rd ed.). Homewood, IL: Richard D. Irwin, Inc.

Neuhaus, J.M. and Kalbfleisch, J.D. (1998) Between- and within-cluster


covariate effects in the analysis of clustered data. Biometrics, 54, 638–
645.

Nordheim, E.V. (1984) Inference from nonrandomly missing categorical


data: an example from a genetic study on Turner’s syndrome. Journal
of the American Statistical Association, 79, 772–780.

Núñez-Antón, V. and Woodworth, G.G. (1994) Analysis of longitudinal


data with unequally spaced observations and time-dependent corre-
lated errors. Biometrics, 50, 445–456.

O’Brien, W.A., Hartigan, P.M., Martin, D., Eisnhart, J., Hill, A., Benoit,
S., Rubin, M., Simberkoff, M.S., and Hamilton, J.D. (1996) Changes
in plasma HIV-1 RNA and CD4+ lymphocyte counts and the risk of
progression to AIDS. New England Journal of Medicine, 334, 426–431.

Olkin, I. and Tate, R.F. (1961) Multivariate correlation models with


mixed discrete and continuous variables. Annals of Mathematical Sta-
tistics, 32, 448–465 (with correction in 36, 343–344).

O’Neill, B. (1966) Elementary Differential Geometry. New York: Acad-


emic Press.

Ovarian Cancer Meta-Analysis Project (1991) Cyclophosphamide plus


cisplatin versus cyclophosphamide, doxorubicin, and cisplatin chemo-
therapy of ovarian carcinoma: a meta-analysis. Journal of Clinical On-
cology, 9, 1668–1674.
References 543

Ovarian Cancer Meta-Analysis Project (1998) Cyclophosphamide plus


cisplatin versus cyclophosphamide, doxorubicin, and cisplatin chemo-
therapy of ovarian carcinoma: a meta-analysis. Classic Papers and
Current Comments, 3, 237–43.

Pan, H. and Goldstein, H. (1998) Multi-level repeated measures growth


modelling using extended spline functions. Statistics in Medicine, 17,
2755–2770.

Park, T. and Brown, M.B. (1994) Models for categorical data with nonig-
norable nonresponse. Journal of the American Statistical Association,
89, 44–52.

Park, T. and Lee, S.-L. (1999) Simple pattern-mixture models for longitu-
dinal data with missing observations: analysis of urinary incontinence
data. Statistics in Medicine, 18, 2933–2941.

Patel, H.I. (1991) Analysis of incomplete data from clinical trials with
repeated measurements. Biometrika, 78, 609-619.

Patterson, H.D. and Thompson, R. (1971) Recovery of inter-block infor-


mation when block sizes are unequal. Biometrika, 58, 545–554.

Pearson, E.S., D’Agostino, R.B., and Bowman, KO. (1977) Tests for de-
parture from normality: comparison of powers. Biometrika, 64, 231–
246.

Pearson, J.D., Kaminski, P., Metter, E.J., Fozard, J.L., Brant, L.J., Mor-
rell, C.H., and Carter, H.B. (1991) Modeling longitudinal rates of
change in prostate specific antigen during aging. Proceedings of the
Social Statistics Section of the American Statistical Assciation, Wash-
ington, DC, pp. 580–585.

Pearson, J.D., Morrell, C.H., Gordon-Salant, S., Brant, L.J., Metter, E.J.,
Klein, L.L., and Fozard J.L. (1995) Gender differences in a longitudinal
study of age-associated hearing loss. Journal of the Acoustical Society
of America, 97, 1196–1205.

Pearson, J.D., Morrell, C.H., Landis, P.K., Carter, H.B., and Brant, L.J.
(1994) Mixed-effects regression models for studying the natural history
of prostate disease. Statistics in Medicine, 13, 587–601.

Peixoto, J.L. (1987) Hierarchical variable selection in polynomial regres-


sion models. The American Statistician, 41, 311–313.
Peixoto, J.L. (1990) A property of well-formulated polynomial regression
models. The American Statistician, 44, 26–30.
544 References

Pendergast, J.F., Gange, S.J., Newton, M.A., Lindstrom, M.J., Palta, M.,
and Fisher, M.R. (1996) A survey of methods for analyzing clustered
binary response data. International Statistical Review, 64, 89–118.

Pharmacological Therapy for Macular Degeneration Study Group (1997)


Interferon α-IIA is ineffective for patients with choroidal neovascular-
ization secondary to age-related macular degeneration. Results of a
prospective randomized placebo-controlled clinical trial. Archives of
Ophthalmology, 115, 865–872.

Piantadosi, S. (1997) Clinical Trials: A Methodologic Perspective. New


York: John Wiley & Sons.
Potthoff, R.F. and Roy, S.N. (1964) A generalized multivariate analysis
of variance model useful especially for growth curve problems. Bio-
metrika, 51, 313–326.

Prasad, N.G.N. and Rao, J.N.K. (1990) The estimation of mean squared
error of small-area estimators. Journal of the American Statistical As-
sociation, 85, 163–171.

Pregibon, D. (1979) Data analytic methods for generalized linear models.


Ph.D. Thesis, University of Toronto.
Pregibon, D. (1981) Logistic regression diagnostics. The Annals of Sta-
tistics, 9, 705–724.

Prentice, R.L. (1988) Correlated binary regression with covariates specific


to each binary observation. Biometrics, 44, 1033–1048.
Prentice, R.L. (1989) Surrogate endpoints in clinical trials: definitions
and operational criteria. Statistics in Medicine, 8, 431–440.
Rang, H.P. and Dale, M.M. (1990) Pharmacology. Edinburgh: Churchill
Livingstone.

Rao, C.R. (1973) Linear Statistical Inference and Its Applications (2nd
ed.). New York: John Wiley & Sons.

Regan, M.M. and Catalano, P.J. (1999a) Likelihood models for clustered
binary and continuous outcomes: Application to developmental toxi-
cology. Biometrics, 55, 760–768.
Regan, M.M. and Catalano, P.J. (1999b) Bivariate dose-response model-
ing and risk estimation in developmental toxicology. Journal of Agri-
cultural, Biological and Environmental Statistics, 4, 217–237.

Ripley, B.D. (1981) Spatial Statistics. New York: John Wiley & Sons.
References 545

Roberts, D.T. (1992) Prevalence of dermatophyte onychomycosis in the


United Kingdom: results of an omnibus survey. British Journal of
Dermatology, 126, 23–27.
Robins, J.M. (1997) Non-respone models for the analysis of non-monotone
non-ignorable missing data. Statistics in Medicine, 16, 21–38.
Robins, J.M. and Gill, R. (1997) Non-respone models for the analysis
of non-monotone ignorable missing data. Statistics in Medicine, 16,
39–56.
Robins, J.M. and Rotnitzky, A. (1995) Semiparametric efficiency in multi-
variate regression models with missing data. Journal of the American
Statistical Association, 90, 122–129.
Robins, J.M., Rotnitzky, A., and Scharfstein, D.O. (1998) Semiparamet-
ric regression for repeated outcomes with non-ignorable non-response.
Journal of the American Statistical Association, 93, 1321–1339.
Robins, J.M., Rotnitzky, A., and Zhao, L.P. (1995) Analysis of semi-
parametric regression models for repeated outcomes in the presence
of missing data. Journal of the American Statistical Association, 90,
106–121.
Robinson, G.K. (1991) That BLUP is a good thing: the estimation of
random effects. Statistical Science, 1, 15–51.
Rochon, J. (1992) ARMA covariance structures with time heteroscedas-
ticity for repeated measures experiments. Journal of the American
Statistical Association, 87, 777–784.
Roger, J.H. (1993) A new look at the facilities in PROC MIXED. Pro-
ceedings SEUGI, 93, 521–532.
Roger, J.H. and Kenward, M.G. (1993) Repeated measures using proc
mixed instead of proc glm. In: Proceedings of the First Annual South-
East SAS Users Group Conference, Cary, NC, U.S.A. Cary, NC: SAS
Institute Inc. pp. 199–208.
Rosner, B. (1984) Multivariate methods in ophtalmology with applica-
tions to other paired-data situations. Biometrics, 40, 1025–1035.
Rotnitzky, A. and Robins, J.M. (1995) Semi-parametric estimation of
models for means and covariances in the presence of missing data.
Scandinavian Journal of Statistics: Theory and Applications, 22, 323–
334.
Rotnitzky, A. and Robins, J.M. (1997) Analysis of semiparametric regres-
sion models with non-ignorable non-response. Statistics in Medicine,
16, 81–102.
546 References

Royston, P. and Altman, D.G. (1994) Regression using fractional poly-


nomials of continuous covariates: parsimonious parametric modelling.
Applied Statistics, 43, 429–468.

Rubin, D.B. (1976) Inference and missing data. Biometrika, 63, 581–592.

Rubin, D.B. (1978) Multiple imputations in sample surveys – a phenom-


enological Bayesian approach to nonresponse. In: Imputation and Edit-
ing of Faulty or Missing Survey Data. Washington, DC: U.S. Depart-
ment of Commerce, pp. 1–23.

Rubin, D.B. (1987) Multiple Imputation for Nonresponse in Surveys. New


York: John Wiley & Sons.

Rubin, D.B. (1994) Discussion to Diggle, P.J. and Kenward, M.G.: Infor-
mative dropout in longitudinal data analysis. Applied Statistics, 43,
80–82.

Rubin, D.B. (1996) Multiple imputation after 18+ years. Journal of the
American Statistical Association, 91, 473–489.

Rubin, D.B. and Schenker, N. (1986) Multiple imputation for interval


estimation from simple random samples with ignorable nonresponse.
Journal of the American Statistical Association, 81, 366–374.

Rubin, D.B., Stern H.S., and Vehovar V. (1995) Handling “don’t know”
survey responses: the case of the Slovenian plebiscite. Journal of the
American Statistical Association, 90, 822–828.

Ryan, L.M. (1992a) Quantitative risk assessment for developmental tox-


icity. Biometrics, 48, 163–174.

Ryan, L.M. (1992b) The use of generalized estimating equations for risk
assessment in developmental toxicity. Risk Analysis, 12, 439–447.
Ryan, L.M. and Molenberghs, G. (1999) Statistical methods for develop-
mental toxicity: analysis of clustered multivariate binary data. Annals
of the New York Academy of Sciences, 00, 000–000.

SAS Institute Inc. (1989) SAS/STAT User’s guide, Version 6, Volume 1


(4th ed.). Cary, NC: SAS Institute Inc.

SAS Institute Inc. (1991) SAS System for Linear Models (3rd ed.). Cary,
NC: SAS Institute Inc.

SAS Institute Inc. (1992) SAS Technical Report P-229, SAS/STAT Soft-
ware: Changes and Enhancements, Release 6.07. Cary, NC: SAS In-
stitute Inc.
References 547

SAS Institute Inc. (1996) SAS/STAT Software: Changes and Enhance-


ments through Release 6.11. Cary, NC: SAS Institute Inc.
SAS Institute Inc. (1997) SAS/STAT Software : Changes and Enhance-
ments through Release 6.12. Cary, NC: SAS Institute Inc.
SAS Institute Inc. (1999) SAS/STAT User’s guide, Version 7. Cary, NC:
SAS Institute Inc.
Satterthwaite, F.E. (1941) Synthesis of variance. Psychometrika, 6, 309–
316.
Sauerbrei, W. and Royston, P. (1999) Building multivariable prognostic
and diagnostic models: transformation of the predictors by using frac-
tional polynomials. Journal of the Royal Statistical Society, Series A,
162, 71–94.
Schafer J.L. (1997) Analysis of Incomplete Multivariate Data. London:
Chapman & Hall.
Schafer J.L., Khare M., and Ezatti-Rice T.M. (1993) Multiple imputa-
tion of missing data in NHANES III. In: Proceedings of the Annual
Research Conference Washington, DC: Bureau of the Census. pp. 459–
487.
Schipper, H., Clinch, J., and McMurray, A. (1984) Measuring the quality
of life of cancer patients: the Functional-Living Index-Cancer: devel-
opment and validation. Journal of Clinical Oncology, 2, 472–483.
Schluchter, M.D. (1992) Methods for the analysis of informatively cen-
sored longitudinal data. Statistics in Medicine, 11, 1861–1870.
Schwarz, G. (1978) Estimating the dimension of a model. The Annals of
Statistics, 6, 461–464.
Schwetz, B.A. and Harris, M.W. (1993) Developmental toxicology: sta-
tus of the field and contribution of the National Toxicology Program.
Environmental Health Perspectives, 100, 269–282.
Searle, S.R. (1987) Linear Models for Unbalanced Data. New York: John
Wiley & Sons.
Searle, S.R., Casella, G., and McCulloch, C.E. (1992) Variance Compo-
nents. New York: John Wiley & Sons.
Seber, G.A.F. (1977) Linear Regression Analysis. New York: John Wiley
& Sons.
Seber, G.A.F. (1984) Multivariate Observations. New York: John Wiley
& Sons.
548 References

Self, S.G. and Liang, K.Y. (1987) Asymptotic properties of maximum like-
lihood estimators and likelihood ratio tests under nonstandard condi-
tions. Journal of the American Statistical Association, 82, 605–610.

Selvin, S. (1996) Statistical Analysis of Epidemiologic Data. New York:


Oxford University Press.

Sen, A. and Srivastava, M. (1990) Regression Analysis. Theory, Methods


and Applications. New York: Springer-Verlag.

Senn, S. (1998) Some controversies in planning and analysing multi-centre


trials. Statistics in Medicine, 17, 1753–1765.

Shapiro, S.S. and Wilk, M.B. (1965) An analysis of variance test for
normality (complete samples). Biometrika, 52, 591–611.

Shapiro, S.S. and Wilk, M.B. (1968) Approximations for the null distri-
bution of the W statistic. Technometrics, 10, 861–866.

Sharples, K. and Breslow, N.E. (1992) Regression analysis of correlated


binary data: some small sample results for the estimating equation
approach. Journal of Statistical Computation and Simulation, 42, 1–
20.

Sheiner, L.B., Beal, S.L., and Dunne, A. (1997) Analysis of nonrandomly


censored ordered categorical longitudinal data from analgesic trials.
Journal of the American Statistical Association, 92, 1235–1244.

Shih, J.H. and Louis, T.A. (1995) Inferences on association parameter in


copula models for bivariate survival data. Biometrics, 51, 1384–1399.

Shih, W.J. and Quan, H. (1997) Testing for treatment differences with
dropouts in clinical trials—a composite approach. Statistics in Medi-
cine, 16, 1225–1239.
Shock, N.W., Greullich, R.C., Andres, R., Arenberg, D., Costa, P.T.,
Lakatta, E.G., and Tobin, J.D. (1984) Normal Human Aging: The
Baltimore Longitudinal Study of Aging. National Institutes of Health
Publication 84–2450. Washington, DC: National Institutes of Health.

Siddiqui, O. and Ali, M.W. (1998) A comparison of the random-effects


pattern-mixture model with last-observation-carried-forward (LOCF)
analysis in longitudinal clinical trials with dropouts. Journal of Bio-
pharmaceutical Statistics, 8, 545–563.

Smith, A.F.M. (1973) A general Bayesian linear model. Journal of the


Royal Statistical Society, Series B, 35, 67–75.
References 549

Smith, D.M., Robertson, B., and Diggle, P.J. (1996) Object-oriented Soft-
ware for the Analysis of Longitudinal Data in S. Technical Report MA
96/192. Department of Mathematics and Statistics, University of Lan-
caster, LA1 4YF, United Kingdom.

Sprott, D.A. (1975) Marginal and conditional sufficiency. Biometrika, 62,


599–605.

Stasny, E.A. (1986) Estimating gross flows using panel data with nonre-
sponse: an example from the Canadian Labour Force Survey. Journal
of the American Statistical Association, 81, 42–47.

Stiratelli, R., Laird, N., and Ware, J. (1984) Random effects models for
serial observations with dichotomous response. Biometrics, 40, 961–
972.
Stram, D.O. and Lee, J.W. (1994) Variance components testing in the
longitudinal mixed effects model. Biometrics, 50, 1171–1177.

Stram, D.A. and Lee, J.W. (1995) Correction to: Variance components
testing in the longitudinal mixed effects model. Biometrics, 51, 1196.

Strenio, J.F., Weisberg, H.J., and Bryk, A.S. (1983) Empirical Bayes es-
timation of individual growth-curve parameters and their relationship
to covariates. Biometrics, 39, 71–86.

Tanner, M.A. and Wong, W.H. (1987) The calculation of posterior dis-
tributions by data augmentation. Journal of the American Statistical
Association, 82, 528–550.

Thijs, H., Molenberghs, G., and Verbeke, G. (2000) The milk protein
trial: influence analysis of the dropout process. Biometrical Journal,
00, 000–000.

Thompson, S.G. (1993) Controversies in meta-analysis: the case of the


trials of serum cholesterol reduction. Statistical Methods in Medical
Research, 2, 173–192.

Thompson, S.G. and Pocock, S.J. (1991) Can meta-analyses be trusted ?


Lancet, 338, 1127–1130.

Thompson, W.A., Jr. (1962) The problem of negative estimates of vari-


ance components. Annals of Mathematical Statistics, 33, 273–289.
Titterington, D.M., Smith, A.F.M., and Makov, U.E. (1985) Statistical
Analysis of Finite Mixture Distributions. New York: John Wiley &
Sons.
550 References

Tomasko, L., Helms, R.W., and Snapinn, S.M. (1999) A discriminant


analysis extension to mixed models. Statistics in Medicine, 18, 1249–
1260.

Troxel, A.B., Harrington, D.P., and Lipsitz, S.R. (1998) Analysis of longi-
tudinal data with non-ignorable non-monotone missing values. Applied
Statistics, 47, 425–438.

U.S. Environmental Protection Agency (1991) Guidelines for developmen-


tal toxicity risk assessment. Federal Register, 56, 63,798–63,826.

Vach, W. and Blettner, M. (1995) Logistic regresion with incompletely


observed categorical covariates–investigating the sensitivity against vi-
olation of the missing at random assumption. Statistics in Medicine,
12, 1315–1330.

Van Damme, P., Vranckx, R., Safary, A., Andre F.E., and Meheus, A.
(1989) Protective efficacy of a recombinant desoxyribonucleic acid he-
patitis B vaccine in institutionalized mentally handicapped clients.
American Journal of Medicine, 87, 26S–29S.

van Dyk, D. Meng, X.-L., and Rubin, D.B. (1995) Maximum likelihood
estimation via the ECM algorithm: computing the asymptotic vari-
ance. Statistica Sinica, 5, 55-75.

Vellinga, A., Van Damme, P., and Meheus, A. (1999a) Hepatitis B and
C in institutions for individuals with mental retardation: a review.
Journal of Intellectual Disability Research, 00, 000–000.

Vellinga, A., Van Damme, P., Weyler, J.J., Vranckx, R., Meheus, A.
(1999b) Hepatitis B vaccination in mentally retarded: effectiveness
after 11 years. Vaccine, 17, 602–606.

Vellinga, A., Van Damme, P., Bruckers, L., Weyler, J.J., Molenberghs, G.,
and Meheus, A. (1999c) Modelling long term persistence of hepatitis B
antibodies after vacination. Journal of Medical Virology, 57, 100–103.

Verbeke, G. (1995) The linear mixed model. A critical investigation in the


context of longitudinal data analysis. Ph.D. thesis, Catholic University
of Leuven, Faculty of Science, Department of Mathematics.

Verbeke, G. and Lesaffre, E. (1994) The effect of misspecifying the random


effects distribution in a linear mixed effects model. In: Proceedings of
the 9th International Workshop on Statistical Modelling, Exeter, U.K.
Verbeke, G. and Lesaffre, E. (1996a) A linear mixed-effects model with
heterogeneity in the random-effects population. Journal of the Amer-
ican Statistical Association, 91, 217–221.
References 551

Verbeke, G. and Lesaffre, E. (1996b) Large Sample Properties of the Max-


imum Likelihood Estimators in Linear Mixed Models with Misspecified
Random-Effects Distributions. Technical report 1996.1, Biostatistical
Centre for Clinical Trials, Catholic University of Leuven, Belgium.

Verbeke, G. and Lesaffre, E. (1997a) The effect of misspecifying the ran-


dom effects distribution in linear mixed models for longitudinal data.
Computational Statistics and Data Analysis, 23, 541–556.

Verbeke, G. and Lesaffre, E. (1997b) The linear mixed model. A critical


investigation in the context of longitudinal data. In: Proceedings of the
Nantucket conference on Modelling Longitudinal and Spatially Corre-
lated Data: Methods, Applications, and Future Directions, T. Gregoire
(Ed.), Lecture Notes in Statistics 122. New York: Springer-Verlag, pp.
89–99.

Verbeke, G. and Lesaffre, E. (1999) The effect of drop-out on the efficiency


of longitudinal experiments. Applied Statistics, 48, 363–375.

Verbeke, G., Lesaffre, E., and Brant L.J. (1998) The detection of residual
serial correlation in linear mixed models. Statistics in Medicine, 17,
1391–1402.

Verbeke, G., Lesaffre, E., and Spiessens, B. (2000) The practical use of
different strategies to handle dropout in longitudinal studies. Submit-
ted for publication.

Verbeke, G. and Molenberghs, G. (1997) Linear Mixed Models in Practice:


A SAS-Oriented Approach. Lecture Notes in Statistics 126. New York:
Springer-Verlag.

Verbeke, G., Spiessens, B., and Lesaffre, E. (2000) Conditional linear


mixed models. American Statistician, 00, 000–000.
Verbeke, G., Spiessens, B., Lesaffre, E., and Brant, L.J. (1999) Condi-
tional linear mixed models. In: Proceedings of the 14th International
Workshop on Statistical Modelling, H. Friedl, A. Berghold, and G.
Kauermann (Eds), Graz, Austria, pp. 386–393.

Verbyla, A.P. and Cullis, B.R. (1990) Modelling in repeated measures


experiments. Applied Statistics, 39, 341–356.

Verdonck, A., De Ridder, L., Verbeke, G., Bourguignon, J.P., Carels, C.,
Kuhn, E.R., Darras, V., and de Zegher, F. (1998) Comparative effects
of neonatal and prepubertal castration on craniofacial growth in rats.
Archives of Oral Biology, 43, 861–871.
552 References

Wang-Clow, F., Lange, N., Laird, N.M., and Ware, J.H. (1995) A sim-
ulation study of estimators for rate of change in longitudinal studies
with attrition. Statistics in Medicine, 14, 283–297.
Waternaux, C., Laird, N.M., and Ware, J.H. (1989) Methods for analysis
of longitudinal data: bloodlead concentrations and cognitive develop-
ment. Journal of the American Statistical Association, 84, 33–41.
Weihrauch, T.R. and Demol, P. (1998) The value of surrogate endpoints
for evaluation of therapeutic efficacy. Drug Information Journal, 32,
737–43.
Weiner, D.L. (1981) Design and analysis of bioavailability studies. In:
Statistics in the pharmaceutical industry, C.R. Buncher and J.-Y. Tsay
(Eds.). New York: Marcel Dekker, pp. 205–229.
Welham, S.J. and Thompson, R. (1997) Likelihood ratio tests for fixed
model terms using residual maximum likelihood. Journal of the Royal
Statistical Society, Series B, 59, 701–714.
White, H. (1980) Nonlinear regression on cross-section data. Economet-
rica, 48, 721–746.
White, H. (1982) Maximum likelihood estimation of misspecified models.
Econometrica, 50, 1–25.
Williams, P.L., Molenberghs, G., and Lipsitz, S.R. (1996) Analysis of
multiple ordinal outcomes in developmental toxicity studies. Journal
of Agricultural, Biological, and Environmental Statistics, 1, 250–274.
Wolfinger, R. and O’Connell, M. (1993) Generalized linear mixed models:
a pseudo-likelihood approach. Journal of Statistical Computation and
Simulation, 48, 233–243.
Wu, M.C. and Bailey, K.R. (1988) Analysing changes in the presence of
informative right censoring caused by death and withdrawal. Statistics
in Medicine, 7, 337–346.
Wu, M.C. and Bailey, K.R. (1989) Estimation and comparison of changes
in the presence of informative right censoring: conditional linear
model. Biometrics, 45, 939–955.
Wu, M.C. and Carroll, R.J. (1988) Estimation and comparison of changes
in the presence of informative right censoring by modeling the censor-
ing process. Biometrics, 44, 175–188.
Yates, F. (1933) The analysis of replicated experiments when the field
results are incomplete. Empirical Journal of Experimental Agriculture,
1, 129–142.
References 553

Zeger, S.C., Liang, K.-Y., and Albert, P.S. (1988) Models for longitudi-
nal data: a generalized estimating equation approach. Biometrics, 44,
1049–1060.
Index

age-related macular Bayesian methods, 41, 46, 69,


degeneration, 427 470
coefficient of multiple empirical Bayes estimation
determination, 437 (EB), see random
Akaike information criterion effects
(AIC), see information hierarchical Bayes model,
criteria 172
ancillary statistic, 377 posterior distribution, 78,
ANCOVA, 221 176
ANOVA, 221, 229 posterior mean, 78, 176
autoregressive AR(1), see posterior probability, 175,
covariance structure 177
available case analysis, 222, 227, prior distribution, 78
263 shrinkage, 80–82, 85
best linear unbiased prediction
balanced data, 119, 215 (BLUP), 80
BALANCED object, see bivariate outcome, 405
OSWALD blood pressure data, 405–411
Baltimore Longitudinal Study of Bozdogan information criterion
Aging (BLSA), 10–12 (CAIC), see
hearing data, see hearing information criteria
data BY statement, 357
prostate data, see prostate
data case deletion, see global
banded, see covariance structure influence
Index 555

CLASS statement, 96, 483 ‘chisq’ option, 101


classification of profiles, see convergence problems, see
heterogeneity model estimation problems
cluster analysis, see Cook’s distance, see influence
heterogeneity model analysis
coefficient of multiple covariance structure, 62, 64, 117,
determination 121, 240, 265
random effect, 431, 438 ante-dependence, 232
residual, 433, 438 autoregressive AR(1), 99,
subject-specific, see 122, 252, 258, 388, 452,
subject-specific profiles 499
colorectal cancer, 427–429 banded, 99, 388
coefficient of multiple compound symmetry, 25,
determination, 438 68, 99, 117, 118, 120,
complete case analysis, 211, 253, 260, 262, 302–307,
222–223, 227 326, 389
complete data, see missing data conditional independence,
complete data set, 241 26, 85, 159
compound symmetry, 25, 68, correlation structure, 34,
117, 118 419
Greenhouse-Geiser, 120 exchangeable, 253, 260, 388
Huynh-Feldt, 120 exploration, 33–34
random intercepts, see heterogeneous, 99
random-intercepts independence, 106, 256, 389
model measurement error, 27, 28,
computing time, 47, 97, 118 117, 129, 135, 237, 241,
conditional independence, see 270, 416, 502, 504
covariance structure model building, 125–132
conditional linear mixed model, patterned, 388
194–197 preliminary, 474
empirical Bayes estimation random effects, 28, 135, 416
(EB), 195 residual, 26–29, 128–132,
estimation, 195 271, 416
fixed-effects approach, REPEATED statement,
198–200 98
inference, 195 residuals, 160
justification, 196 serial correlation, 27, 28,
maximum likelihood 128–132, 135–150, 241,
estimation (ML), 195 306, 416, 449, 498
paired t-test, 196 exponential, 28, 100, 139,
restricted maximum 142
likelihood estimation Gaussian, 28, 100, 129,
(REML), 195 139, 142, 270
CONTRAST statement, 101, simple, 99
104, 410 spatial, 100
556 Index

starting values, 131 direct variables, 465–466,


stationarity, 126, 127, 142 469
Toeplitz, 99, 250, 262 increment, 272, 312, 315,
unstructured, 99, 242, 247, 321, 326, 328, 463,
353, 389 465–466, 469
variance function, 25, 33, size, 272, 312, 315, 326
53, 127, 130, 131, 201 dropout rate, 280
coverage probability, 382
cross-sectional component, 189, efficiency, see design
194 considerations
EM algorithm, 173–177, 329,
degrees of freedom 387–390
F-test, see F-test E step, 175, 388
Satterthwaite method, 57, ECM algorithm, 387
112, 486 ECME algorithm, 387
t-test, see t-test GEM algorithm, 389
delta method, 286, 357, 432, 451, heterogeneity model,
454 173–177
design considerations, 391–404 linear mixed model, 47
comparing power M step, 175, 388
distributions, 395, missing data, 210, 222, 232,
397–404 240, 258, 277, 302, 376
designed power, 391, rate of convergence, 47, 389
394–397 versus Newton-Raphson,
dropout mechanism, 397 173
intentionally incomplete, empirical Bayes estimation (EB),
393 see random effects
power, 392–393 empirical variance, see fixed
power distribution, 394–404 effects
random-effects distribution, error contrasts, see REML
87 ESTIMATE statement, 101, 104,
realized power, 391, 394–404 410
sampling methods, 396–404 ‘alpha=’ option, 102
under expected dropout, ‘cl’ option, 102
394–397 estimation problems, 50–54,
design matrix, 215, 241, 247, 439–442
249, 252, 259 model misspecification,
deviance, 242 52–54
discriminant analysis, see model parameterization, 131
heterogeneity model small variance components,
dose-response model, 412 50–52
dropout, see missing data exchangeable, see covariance
dropout model, 235, 236, 269, structure
271, 280, 297, 302, 314, exercise bike data, 498
326, 328
Index 557

exponential serial correlation, general linear hypothesis, see


see serial correlation fixed effects
generalized estimating equations,
F-test 125, 218, 229, 391
degrees of freedom, 57, 112, Gibbs sampling, 390
393 global influence, 314, 460–462
fixed effects, see fixed effects case deletion, 153, 165, 298,
noncentral, 393, 396 314, 320, 325, 457
PROC MIXED versus one-step approach, 152, 159
PROC GLM, 119 versus local influence, 153,
random effects, see random 466–470
effects Greenhouse-Geiser, see
Fisher scoring, 50, 103, 131, 385 compound symmetry
fixed effects, 24, 224, 241, 385 growth curves, 248
F-test, 56, 112, 115, 392 growth data, 16–17, 240–268,
general linear hypothesis, 388
56, 58, 392 complete data analysis,
CONTRAST statement, 240–256
101 incomplete data analysis
inference, 55–63, 133 frequentist analysis,
local influence, see local 256–257
influence likelihood analysis,
LR test, 62, 392 257–267
maximum likelihood, 42 missingness process,
MODEL statement, 96 267–268
multivariate test, 119 MLwiN, 489–493
parameterization in SAS, multilevel model, 489–493
114–117
restricted maximum Hannan and Quinn information
likelihood, 45 criterion (HQC), see
robust inference, 61–62, 121 information criteria
empirical variance, 61, hearing data, 14
474 conditional linear mixed
sandwich estimator, 61, model, 197–200
88 contaminated data, 191–193
t-test, 56, 111, 112, 115, 392 empirical Bayes estimation
versus random effects, (EB), 192
198–200 fixed effects versus random
Wald test, 56, 112 effects, 198–200
fractional polynomials, 20, linear mixed model, 190–191
137–139, 478–479 misspecified cross-sectional
full data, see missing data component, 191–193
heat shock study, 411–419
Gaussian serial correlation, see heights of schoolgirls, 16
serial correlation
558 Index

classification of subjects, homogeneity model, 85, 90, 172


184–187 goodness-of-fit, 178–179,
cluster analysis, 184–187 181, 185
discriminant analysis, hot deck imputation, see
184–187 imputation
heterogeneity model, Huynh-Feldt, see compound
184–187 symmetry
linear mixed model, 183–184
two-stage approach, 183 ID statement, 97
Henderson’s mixed model identifiability, 174, 231, 278, 280
equations, 79 identifiable parameter, see also
hepatitis B vaccination, 470–484 estimable parameter,
semi-variogram, 473 216
Hessian matrix, see information identifying restrictions, see
matrix pattern-mixture model
heterogeneity model, 85–87, 90, ignorable analysis, 239–240, 266
91, 169, 171–172 ignorable missing data, see
classification, 169, 177, 182, missing data
186 imputation, 222
cluster analysis, 177, 181, Buck, 225
186 conditional mean, 224–226
discriminant analysis, 177, hot deck, 224, 226
181 last observation carried
EM algorithm, 173–177 forward (LOCF), 224,
empirical Bayes estimation 226
(EB), 176 mean, 224
goodness-of-fit, 178–179, multiple, 222, 336–339, 344,
181, 185 350–352, 359, 371, 373
identifiability, 174 estimation, 338
Kolmogorov-Smirnov test, estimation task, 337
178, 181, 186 F-test, 339
likelihood ratio test, 91, 178 hypothesis testing,
Newton-Raphson, 173 338–339
number of components, 91, imputation task, 337
178–179, 181, 185 modeling task, 337
posterior probability, 175, proper, 344
177 variance, 338
Shapiro-Wilk test, 179, 181, simple, 211–212, 222–226
186 single, 223
hierarchical Bayes model, see unconditional mean, 225
Bayesian methods incomplete data, see missing
hierarchical model, 24, 41, 52, data
65, 67, 69, 77, 117 independence model, see
versus marginal model, 52, covariance structure
65, 117 influence analysis, 457–470
Index 559

Cook’s distance, 151, 457 observed data, 258, 318, 387


global, see global influence restricted maximum
hat-matrix diagonal, 305 likelihood, 46
leverage, 151, 304 likelihood ratio test, 239, 375
local, see local influence asymptotic null distribution,
influence graph, see local 69–73, 91, 178, 254
influence fixed effects, 62, 247, 392
information criteria, 74–76, 129, heterogeneity model, 91, 178
409 missing data, 375
Akaike (AIC), 74, 106, 107, missing data mechanism,
406 310, 313, 320, 324, 459,
Bozdogan (CAIC), 74, 107 511
Hannan and Quinn (HQIC), ML versus REML, 63, 66,
74, 107 69
ML versus REML, 75, 107 pattern specific, 283
Schwarz (SBC), 74, 106, random effects, 69–73, 133,
107, 406 408
information matrix, 64, 88 variance components, 65,
expected, 103, 131, 132, 106, 133, 262, 392
376, 378, 382 Wilk’s Lambda, 119
Hessian, 50, 131 likelihood-based frequentist
naive, 380 inference, 385
observed, 103, 131, 132, 376, linear mixed model, 23–29
378, 385 local influence, 153–158, 298–300
intentionally incomplete designs, case-weight perturbation,
393 156
intraclass correlation, 25, 68, 86, cutoff value, 161, 162
260 fixed effects, 161, 162, 303
index plot, 160, 161, 163
Kolmogorov-Smirnov test, 178, influence graph, 154, 299,
181, 186 308
kurtosis, 318 interpretable components,
160, 161, 163, 304
last observation carried forward, lifted line, 155
see imputation likelihood displacement,
latent variable, 328 154, 165, 299
leverage, see influence analysis linear mixed model, 158–167
likelihood displacement, see maximal normal curvature,
local influence 156, 165, 300
likelihood function, 104, 106 normal curvature, 155, 156,
complete data, 378 300
factorization, 217 normal section, 155
full data, 217 perturbation scheme, 153,
maximum likelihood, 42 154, 299, 305, 326–327
objective function, 105
560 Index

perturbed log-likelihood, mastitis in dairy cattle, 18


153, 159, 298 local influence, 319–325
scatter plot, 162 sensitivity analysis, 312–325
selection model, 298–327, maximum likelihood estimation
462–466 (ML), 42
compound symmetry, comparison with REML,
302–306 46–48, 139, 199
direct variables model, fixed effects, 42
321–325 likelihood function, 42
dropout model, 305, 326 variance components, 42
fixed effects, 303 mean structure, 64, 121, 201,
history, 301 240, 277, 419, 471
incremental variables exploration, 31, 124, 204
model, 322–325 likelihood ratio test, 247
interpretable components, model building, 123–125,
304 133
mastitis in dairy cattle, parameterization in SAS,
319–325 114–117
measurement model, 326 preliminary, 123–125, 133,
perturbation scheme, 136, 139, 452, 474
326–327 residuals, 160
rat data, 307–312 saturated model, 123, 406
serial correlation, 306 measurement error, see
variance components, 304 covariance structure,
specific parameters, 157 241
under REML, 167 measurement model, 269, 297,
variance components, 161, 302, 314, 328
162, 304 measurement process, see
versus global influence, 153, missing data
466–470 meta-analysis, 420, 429–442
weights, 299 milk protein content trial,
logistic regression, 234, 240, 267, 446–470
269, 272, 297, 314, 329, global influence, 460–462
403, 452, 506 influence analysis, 457–470
longitudinal component, 189, 195 informal sensitivity analysis,
448–456
macro, see SAS macro local influence, 462–466
MAKE statement, 102, 361, 486 pattern-mixture model,
‘noprint’ option, 102 451–456
‘out=’ option, 102 semi-variogram, 449
marginal model, 24, 31–34, 41, missing at random, see missing
52, 67, 69, 77, 117, 123 data
versus hierarchical model, missing completely at random,
52, 65, 117 see missing data
marginal sufficiency, 46 missing data, 201–390
Index 561

complete data, 214 298, 307, 320, 332, 397,


dropout, 218, 276, 446–470 448, 458, 494, 497–513
exploration, 201–207 missing data indicators, 214
dropout pattern specific missing data process, 214,
plot, 204, 287 336, 497–513
dropout plot, 202 nonignorability, 217
individual profiles plot, observed data, 214
205, 288, 307 outcome-based model, 277,
mean profiles plot, 204 317, 328
scatter plot, 311 pattern, 210, 215, 377
scatter plot matrix, 203 attrition, 215
full data, 215, 240 dropout, 210, 218–219,
identifiable parameter, 216 224, 380
ignorability, 213, 217, 239, intermittent, 225
302, 382, 506 monotone, 215, 224
Bayesian inference, 218, nonmonotone, 215, 224
376 random-coefficient-based
frequentist inference, 218, model, 277, 328–329
263, 375, 379, 385 separability condition, 218,
likelihood inference, 264, 280, 282, 377
375, 376, 385 shared parameter model,
likelihood analysis, 239 329
measurement process, 214, missing data indicators, see
239, 336 missing data
mechanism, 215, 239, 267, missing data mechanism, see
376, 379 missing data
ignorability, 217–218, missing data patterns, see
375–386 missing data
missing at random missing data process, see
(MAR), 212, 217, 222, missing data
225, 233–234, 239, 262, missing not at random, see
269, 277, 281, 295, 298, missing data
301, 307, 314, 320, MIVQUE0, 96
332–336, 340, 345, 373, mixture distribution
376, 397, 458, 494, 498, LR test, 69–72, 408, 474
506 number of components, 91,
missing completely at 178–179, 181, 185
random (MCAR), 212, pattern-mixture model, 276,
217, 222, 225–229, 240, 292, 343, 347, 349
295, 307, 332, 336, 397, random effects, 85, 90,
494, 506 170–172
missing not at random MLwiN, 445, 489–493
(MNAR), 213, 217, comparison with SPlus, 497
234–238, 240, 270, 295, covariance structure, 489,
493
562 Index

empirical Bayes estimation ‘ddfm=’ option, 97, 487


(EB), 492 ‘noint’ option, 96
fixed effects, 490 ‘pred’ option, 487
Gibbs sampling, 491 ‘predicted’ option, 97, 103
graphs, 492 ‘predmeans’ option, 97, 103,
iterative generalized least 487
squares, 491 ‘solution’ option, 96
maximum likelihood, 491 ‘xpvix’ option, 487
Metropolis-Hastings, 491 parameterization of mean,
multilevel model, 489–493 114–117
parametric bootstrap, 491 multilevel model, see MLwiN
random effects, 490 multinomial distribution, 281,
restricted iterative 396
generalized least multiple imputation, see
squares, 491 imputation
serial correlation, 493 multivariate regression, 119
model building, 121–133 multivariate tests, 119
covariance structure,
125–132 Newton-Raphson, 47, 50, 103,
mean structure, 123–125, 132, 173, 379, 439, 441
133 versus EM algorithm, 173
model reduction, 132–133 nonignorable missing data, see
random effects, 125–128, missing data
133 normal curvature, see local
serial correlation, 128–132 influence
two-stage analysis, see
two-stage analysis objective function, see likelihood
model misspecification function
covariance structure, 61 observed data, see missing data
cross-sectional component, ordinary least squares, 125, 218,
190, 191, 194 221, 229
estimation problems, 52–54 residual profiles, 32, 34, 125
random-effects distribution, residuals, 53, 125, 136, 139,
85–89, 187 240
model reduction, 132–133, 287 OSWALD, 235, 240, 272, 297,
pattern-mixture model, 307, 497–513
371–373 BALANCED object, 506
MODEL statement, 96, 487 PCMID function, 493, 503
‘alpha=’ option, 103 ‘correxp’ argument, 506
‘chisq’ option, 97 ‘drop.cov.parms’
‘cl’ option, 103 argument, 507
‘corrb’ option, 487 ‘drop.parms’ argument,
‘covb’ option, 96, 357, 367, 506
368 ‘dropmodel’ argument,
‘covbi’ option, 487 507
Index 563

‘maxfn’ argument, 508 marginal effect, 367–369,


‘reqmin’ argument, 508 451
‘vparms’ argument, 504 marginal expectation, 284
outcome-based model, see marginal hypothesis,
missing data 285–287, 289
outliers, 77, 79, 316 strategy 1, 340, 352–361,
output delivery system (ODS), 369
102, 486 strategy 2, 341, 352–361,
ovarian cancer, 425–427 368
coefficient of multiple strategy 3, 342, 361–366,
determination, 434–435 368–369, 452
prediction, 434 perturbed log-likelihood, see
two-stage analysis, 434 local influence
posterior distribution, see
paired t-test, see conditional Bayesian methods
linear mixed model posterior mean, see Bayesian
paradox, 278–279, 331 methods
parameter space, 41, 47, 52 posterior probability, see
boundary, 47, 51, 52, 64, 66, Bayesian methods
69, 91, 106, 133, 178, power calculations, see design
254 considerations
restricted, 52 prediction
unrestricted, 52, 66 best linear unbiased, 80
parameterization in SAS, future observation, 122
114–117 intervals, 444–445
PARMS statement, 103, 131, 200 population-averaged, 471,
‘eqcons’ option, 103 481–482
‘nobound’ option, 104, 417 subject-specific, 432–433,
pattern-mixture model, 216, 471
275–293, 331–374, subject-specific profiles, 77,
451–456 80
extrapolation, 281, 283–285, trial-specific, 431–432
331, 341, 342, 353, 358, preliminary mean structure, see
360, 362, 371 mean structure
global hypothesis, 289, 455 preliminary random-effects
hypothesis testing, 366–371 structure, see random
identifying restrictions, effects
281–282, 331, 340–341, principal components, 462
343–361, 373 prior distribution, see Bayesian
ACMV, 277, 332–336, methods
340, 346–350, 353, 373 PRIOR statement, 487
CCMV, 277, 334, 340, ‘alg=’ option, 487
344, 348–353, 369, 373 ‘bdata’ option, 487
NCMV, 341, 345–346, ‘data=’ option, 487
348–353, 373 ‘grid=’ option, 487
564 Index

‘gridt=’ option, 487 classification of subjects,


‘lognote=’ option, 487 180–183
‘logrbound=’ option, 487 cluster analysis, 180–183
‘out=’ option, 488 discriminant analysis,
‘outg=’ option, 488 180–183
‘outgt=’ option, 488 estimation problems, 50,
‘psearch=’ option, 488 131
‘ptrans’ option, 488 heterogeneity model,
‘seed=’ option, 488 180–183
‘tdata=’ option, 488 in SAS, 94–117
‘trans=’ option, 488 inference fixed effects,
probit regression, 329 57–61, 63, 133
PROC GLM versus PROC inference random effects, 82
MIXED, 119 linear mixed model, 26, 48,
PROC MIXED 58, 129
output, 104–114 local influence analysis,
fixed effects, 111 162–167
information criteria, 106, marginal testing random
107 effects, 72–73, 133
iteration history, 104 mean exploration, 124
model fit, 105 model reduction, 133
random effects, 113 OLS residual profiles, 126
variance components, 107 preliminary mean structure,
program, 94–104 124
PROC MIXED statement, 95, preliminary random-effects
486 structure, 127
‘CL’ option, 486 robust inference, 62
‘CL=’ option, 486 semi-variogram, 147–148
‘asycorr’ option, 96 serial correlation, 129, 136,
‘asycov’ option, 96, 357 138–140, 147–148
‘covtest’ option, 96 two-stage analysis, 21, 39
‘empirical’ option, 103, 246 variance function, 127, 131
‘ic’ option, 96
‘info’ option, 483 random effects, 24, 28, 241, 252,
‘method’ option, 486 270, 388
‘method=’ option, 96 classification, see
‘nobound’ option, 104, 417 heterogeneity model
‘scoring’ option, 103, 131, empirical Bayes estimation
385 (EB), 78–79, 113, 170,
‘scoring=’ option, 103 176, 195
PROC MIXED versus PROC F-test, 79
GLM, 119 Henderson’s mixed model
profile, 246, 249, 262, 264, 270 equations, 79
profile likelihood, 157, 300 heterogeneity model, see
prostate data, 11–13 heterogeneity model
Index 565

histogram, 79, 82 compound symmetry, see


homogeneity model, see compound symmetry
homogeneity model empirical Bayes estimation
marginal testing, 69–73, (EB), 81
133, 408, 474 semi-variogram, 142–144
mixture distribution, 85, 90, shrinkage, 81
169–172 rat data, 7–9
model building, 125–128, efficiency, 394
133 inference fixed effects, 67
normal quantile plot, 79, 89 inference variance
normality assumption, 79, components, 66–68
83–92, 169, 170 information criteria, 75
preliminary, 125–128, 139, linear mixed model, 25
474 local influence, 307–312
random intercept, 81, 117, marginal versus
250, 252, 253, 262, 498, hierarchical, 52, 67
504 model misspecification, 52
random slope, 252, 262 power, 393–394
RANDOM statement, 97 power distribution, 397–404
scatter plot, 79, 82 sensitivity analysis, 307–312
semi-variogram, 144–148 two-stage analysis, 21, 38
shrinkage, 80, 82, 84, 85 variance function, 53
t-test, 79 REPEATED statement, 98,
versus fixed effects, 198–200 117–119, 251, 252, 259,
versus serial correlation, 149 261, 267, 446, 488, 500,
RANDOM statement, 97, 501
117–119, 259, 260, 267, ‘group=’ option, 103, 245,
446, 488, 500, 501 359
‘g’ option, 98, 253 ‘local’ option, 104, 130
‘gcorr’ option, 98 ‘local=’ option, 483
‘group=’ option, 103 ‘r’ option, 101, 243
‘nofullz’ option, 488 ‘r=’ option, 101, 245
‘solution’ option, 98 ‘rcorr’ option, 101, 243
‘subject=’ option, 97 ‘rcorr=’ option, 101, 245
‘type=’ option, 98, 104, 118 ‘subject=’ option, 100, 446
‘v’ option, 98, 101 ‘type=’ option, 100, 104,
‘v=’ option, 98 118, 119, 129, 139, 256,
‘vcorr’ option, 98, 101 483, 488
‘vcorr=’ option, 98 ‘type=AR(1)’ option, 252
versus REPEATED versus RANDOM
statement, 117–119 statement, 117–119
random-coefficient-based model, residual covariance structure, see
see missing data covariance structure
random-intercepts model, 25, 68, residuals, 151, 240
117, 118, 120
566 Index

covariance structure, 160, selection model, 216, 231–273,


163 278–279, 295–330, 333,
marginal, 151 448, 454
mean structure, 160, 163 Heckman’s model, 296
ordinary least squares, 32, semi-variogram, 141–148, 270,
34, 53, 125, 136, 139 271, 419
random effects, 77, 152 random effects, 144–148
subject-specific, 145, 151 random intercepts, 142–144,
restricted maximum likelihood 449, 452, 473
estimation (REML), sensitivity, 236–238, 270, 297
43–47, 195 sensitivity analysis, 213, 270,
comparison with ML, 46–48, 277–279, 292, 448–470
139, 199 pattern-mixture model,
error contrasts, 43–46, 63, 331–374
75 selection model, 295–330
fixed effects, 45 separability condition, see
justification, 46, 195 missing data
likelihood function, 46 serial correlation, 26–28,
linear mixed model, 44 128–132, 135–150
linear regression, 43, 48 check for, 136–137
normal population, 43, 48 exponential, 28, 100, 139,
variance components, 45 142, 474
ridge regression, 146 flexible models, 137–140
robust inference, see fixed effects fractional polynomials,
137–139
sample-size calculations, see Gaussian, 28, 100, 129, 139,
design considerations 142, 416, 474
sampling framework versus random effects, 149
naive, 377, 380, 382 Shapiro-Wilk test, 136, 179, 181,
unconditional, 377, 382 186
sandwich estimator, see fixed shrinkage, see Bayesian methods
effects simplex algorithm, 232, 235, 240
SAS data set, 95 SPlus, 493–513
SAS macro, 38, 162, 195, 240, comparison with MLwiN,
332, 352, 353, 359, 361, 497
374 LME function, 493–497
Satterthwaite method, see ‘cluster’ argument, 494
degrees of freedom ‘covariate.transformation’
saturated mean structure, see argument, 494
mean structure ‘est.method’ argument,
Schwarz information criterion 495
(SBC), see information ‘fixed’ argument, 494
criteria ‘random’ argument, 494
‘re.block’ argument, 494
Index 567

‘re.paramtr’ argument, toenail data, 9–10, 227–229,


494 233–238
‘serial’ argument, 494 MAR analysis, 233–234
‘var.covariate’ argument, MCAR analysis, 227–229
494 MNAR analysis, 234–238
‘var.estimate’ argument, pattern-mixture model,
494 281–287
‘var.function’ argument, Toeplitz, see covariance
494 structure
LME.FORMULA function, two-stage analysis, 20–23, 123,
see SPlus, LME 133, 231, 429–430
function stage 1, 20, 35–40, 429
NMLE function, 493 stage 2, 20, 430
OSWALD, see OSWALD
starting values, 131 uncertainty
stationarity, see covariance modeling, 336
structure sampling, 336
stratification unstructured covariance, see
posthoc, 366 covariance structure
subject-specific profiles untestable assumptions, 236,
alignment, 448–451 270, 281, 297, 329, 334,
coefficient of multiple 342, 498
determination, 35–38,
40 variance components, 41, 407,
exploration, 35–40, 205, 415
288, 307 estimation problems, 50–52
F-test, 37–40 inference, 64–73, 133
goodness-of-fit, 35–37 local influence, see local
summary statistics, 23 influence
surrogate endpoints, 420–446 LR test, 65, 69–73, 106, 392,
sweep operator, 389 408, 474
maximum likelihood, 42
t-distribution, 314, 318 negative, 54, 68
degrees of freedom, 318 restricted maximum
t-test likelihood, 45
degrees of freedom, 57, 112 Wald test, 64, 107
fixed effects, see fixed effects variance function, see covariance
random effects, see random structure
effects variogram, see semi-variogram
time series, 28 Vorozole study, 15, 201–207,
time-independent covariate, 125, 270–273
194 correlation structure, 34
time-varying covariate, 95, 120, mean structure, 32
125, 190 pattern-mixture model,
tobit model, 231–232 287–291
568 Index

sensitivity analysis,
352–373
selection model, 270–273
semi-variogram, 144
variance function, 33

Wald test, 382


fixed effects, see fixed effects
pattern-mixture model, 286
scaled, 57
variance components, see
variance components
WHERE statement, 119
Wilk’s Lambda test, 119
within-imputation variance, 337
Springer Series in Statistics
(continuedfromp. ii)

Ramsay/Silverman: Functional Data Analysis.


Rao/Toutenburg: Linear Models: Least Squares and Alternatives.
Read/Cressie: Goodness-of-Fit Statistics for Discrete Multivariate Data.
Reinsel: Elements of Multivariate Time Series Analysis, 2nd edition.
Reiss: A Course on Point Processes.
Reiss: Approximate Distributions of Order Statistics: With Applications
to Non-parametric Statistics.
Rieder: Robust Asymptotic Statistics.
Rosenbaum: Observational Studies.
Rosenblatt: Gaussian and Non-Gaussian Linear Time Series and Random Fields
Sdmdal/Swensson/Wretman: Model Assisted Survey Sampling.
Schervish: Theory of Statistics.
Shao/Tu: The Jackknife and Bootstrap.
Siegmund: Sequential Analysis: Tests and Confidence Intervals.
Simonoff: Smoothing Methods in Statistics.
Singpurwalla and Wilson: Statistical Methods in Software Engineering:
Reliability and Risk.
Small: The Statistical Theory of Shape.
Sprott: Statistical Inference in Science.
Stein: Interpolation of Spatial Data: Some Theory for Kriging.
Taniguchi/Kakizawa: Asymptotic Theory of Statistical Inference for Time Series.
Tanner: Tools for Statistical Inference: Methods for the Exploration of Posterior
Distributions and Likelihood Functions, 3rd edition.
Tong: The Multivariate Normal Distribution.
van der Vaart/Wellner: Weak Convergence and Empirical Processes: With
Applications to Statistics.
Verbeke/Molenberghs: Linear Mixed Models for Longitudinal Data.
Weerahandi: Exact Statistical Methods for Data Analysis.
West/Harrison: Bayesian Forecasting and Dynamic Models, 2nd edition.

You might also like