0% found this document useful (0 votes)
27 views

Exploring the Limits of Bootstrap

Exploring the Limits of Bootstrap is a compilation of research on the bootstrap method, a non-parametric technique for inferring statistical distributions from samples, introduced by Bradley Efron in 1973. The book highlights significant advancements in the application of bootstrap methods, emphasizing their efficiency and consistency across various statistical scenarios. It includes discussions on bias reduction, efficient simulation, and applications to dependent samples, making it essential for those interested in the evolution of bootstrap methodologies.

Uploaded by

diegoblues
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Exploring the Limits of Bootstrap

Exploring the Limits of Bootstrap is a compilation of research on the bootstrap method, a non-parametric technique for inferring statistical distributions from samples, introduced by Bradley Efron in 1973. The book highlights significant advancements in the application of bootstrap methods, emphasizing their efficiency and consistency across various statistical scenarios. It includes discussions on bias reduction, efficient simulation, and applications to dependent samples, making it essential for those interested in the evolution of bootstrap methodologies.

Uploaded by

diegoblues
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 458

EXPLORIN

LIMITS
BOOTSTRA
REESE

LDAP
HR ESHER ERROR

NRHO EERE EID SLA EASE OCONEE ONCE UNO


NEES NSLPSESSIONS

ONSEN
EMU DEED EE ROLE ERSTE.

ALTEOTE NETS OES SOLES OE ONES

samc HSN RRS RE SESE


RD 888HU SHR PEERS SESH LES SEB ECS ROCESS

Ry

eben 298 OERONEONTA SESSOLE EUSA ESN ESPEN EROS SSIES SABIE OERELESORES SSE ESE LEED ERE AEN VODARONEIDA

sees EEE essen SULASt SESS SLA RSET ESE SES UIPRONE SEED SE DEDTERO VEER ON NSELE REELS IEEE

Wiley Series in Probability and Mathematical Statistics:


Applied Probability and Statistics Section—
Vic Barnett, Ralph A. Bradley, Nicholas I. Fisher, J. Stuart Hunter,
J.B. Kadane, David G. Kendall, Adrian F.M. Smith, Stephen M.
Stigler, Jozef Teugels and Geoffrey S. Watson, Advisory Editors
ste NHC LENSESROSSA EELOMELHUSEOERSEERN EEE ENSTALE EDEN LEESON
LDA SAL E LILLE LLL STR
Introduced by Bradley Efron, in
1973, bootstrap is anon-parametric
method for inferring the distribution
of a statistic derived from a sample.
In essence, the bootstrap method
depends upon the notion that
through the sampling of data one
can approximate the sampling
variations by which the data were
produced.
In recent years the once radical
bootstrap concept has spawned a
revolution in applications. Exploring
the Limits of Bootstrap is the result
of a special topics meeting of the
Institute of Mathematical Statistics.
It represents the work that has been
done since Efron’s breakthrough by
top researchers as well as by relative
newcomers.
Some of the astonishing develop-
ments represented in this book are
simple to summarize: fundamental
research suggests that in many
cases, computer + bootstrap = ba-
sic statistics, without the need for
specialized formulas or tables. Also,
itseems that computer + bootstrap
+ basic statistics = second order
efficient statistics. Exploring the
Limits of Bootstrap offers a number
of papers that support these re-
markable conclusions, including
discussions of:
© Consistency of bootstrap in gen-
eral circumstances including em-
pirical processes, M-estimation,
and U-statistics
¢ Bias reduction
e Efficient simulation
¢ Applicability to dependent sam-
ples including Markov processes
and time series
® Bootstrap bandwidth selection in
density estimation
Also, numerous application papers
deal with such topics as exploratory
regression, model selection, and
computing bootstrap distributions
without using random numbers. The
papers collected in this volume are
indispensable reading for anyone
concerned with the evolution and
applications of the bootstrap
method.
pth li
~gaete hip :
Aundevse™
WILEY SERIES IN PROBABILITY
AND MATHEMATICAL STATISTICS

ESTABLISHED BY WALTER A. SHEWHART AND SAMUEL S. WILKS


Editors
Vic Barnett, Ralph A. Bradley, Nicholas I. Fisher, J. Stuart Hunter,
Joseph B. Kadane, David G. Kendall, Adrian F. M. Smith,
Stephen M. Stigler, Jozef L. Teugels, Geoffrey S. Watson

Probability and Mathematical Statistics


ANDERSON -° An Introduction to Multivariate Statistical Analysis,
Second Edition
BARNETT + Comparative Statistical Inference, Second Edition
BERNARDO and SMITH ° Bayesian Statistical Concepts and Theory
BHATTACHARYYA and JOHNSON : Statistical Concepts and Methods
BILLINGSLEY ° Probability and Measure, Second Edition
BOROVKOV ° Asymptotic Methods in Queuing Theory
BRANDT, FRANKEN, and LISEK * Stationary Stochastic Models
CAINES ° Linear Stochastic Systems
CHEN ° Recursive Estimation and Control for Stochastic Systems
CONSTANTINE * Combinatorial Theory and Statistical Design
COVER and THOMAS »° Elements of Information Theory
*DOOB -* Stochastic Processes
DUDEWICZ and MISHRA + Modern Mathematical Statistics
ETHIER and KURTZ °* Markov Processes: Characterization and Convergence
FELLER ° An Introduction to Probability Theory and Its Applications, Volume I,
Third Edition, Revised; Volume I, Second Edition
FULLER ~* Introduction to Statistical Time Series
FULLER * Measurement Error Models
GIFI + Nonlinear Multivariate Analysis
GUTTORP - Statistical Inference for Branching Processes
HALD °: A History of Probability and Statistics and Their Applications before 1750
HALL »* Introduction to the Theory of Coverage Processes
HANNAN and DEISTLER ° The Statistical Theory of Linear Systems
HEDAYAT and SINHA »° Design and Inference in Finite Population Sampling
HOEL »° Introduction to Mathematical Statistics, Fifth Edition
HUBER -* Robust Statistics
IMAN and CONOVER °* A Modern Approach to Statistics
KAUFMAN and ROUSSEEUW ~»° Finding Groups in Data: An Introduction to Cluster
Analysis
LARSON ~° Introduction to Probability Theory and Statistical Inference,
Third Edition
LE PAGE and BILLARD »* Exploring the Limits of Bootstrap
MORGENTHALER and TUKEY °* Configural Polysampling: A Route to Practical
Robustness
MUIRHEAD »* Aspects of Multivariate Statistical Theory
OLIVER and SMITH »* Influence Diagrams, Belief Nets and Decision Analysis
PILZ + Bayesian Estimation and Experimental Design in Linear Regression Models
PRESS ° Bayesian Statistics: Principles, Models, and Applications
PURI and SEN + Nonparametric Methods in General Linear Models
ae VILAPLANA, and WERTZ * New Perspectives in Theoretical and Applied
tatistics
RAO - Asymptotic Theory of Statistical Inference
RAO »* Linear Statistical Inference and Its Applications, Second Edition
ROBERTSON, WRIGHT, and DYKSTRA = Order Restricted Statistical Inference
ROGERS and WILLIAMS »* Diffusions, Markov Processes, and Martingales, Volume
II: Ito Calculus
ROSS ° Stochastic Processes
RUBINSTEIN * Simulation and the Monte Carlo Method
RUZSA and SZEKELY + Algebraic Probability Theory
SCHEFFE * The Analysis of Variance
SEBER ° Linear Regression Analysis
SEBER : Multivariate Observations
SEBER and WILD ° Nonlinear Regression
SERFLING * Approximation Theorems of Mathematical Statistics

*Now available in a lower priced paperback edition in the Wiley Classics Library.
Probability and Mathematical Statistics (Continued)
SHORACK and WELLNER * Empirical Processes with Applications to Statistics
STAUDTE and SHEATHER »* Robust Estimation and Testing
STOYANOV »* Counterexamples in Probability
STYAN - The Collected Papers of T. W. Anderson: 1943-1985
WHITTAKER * Graphical Models in Applied Multivariate Statistics
YANG » The Construction Theory of Denumerable Markov Processes

Applied Probability and Statistics


ABRAHAM and LEDOLTER : Statistical Methods for Forecasting
AGRESTI * Analysis of Ordinal Categorical Data
AGRESTI * Categorical Data Analysis
ANDERSON and LOYNES * The Teaching of Practical Statistics
ANDERSON, AUQUIER, HAUCK, OAKES, VANDAELE, and
WEISBERG » Statistical Methods for Comparative Studies
ASMUSSEN »* Applied Probability and Queues
ae * The Elements of Stochastic Processes with Applications to the Natural
ciences
BARNETT ° Interpreting Multivariate Data
BARNETT and LEWIS * Outliers in Statistical Data, Second Edition
BARTHOLOMEW, FORBES, and McLEAN : Statistical Techniques for Manpower
Planning, Second Edition
BATES and WATTS + Nonlinear Regression Analysis and Its Applications
BELSLEY ° Conditioning Diagnostics: Collinearity and Weak Data in Regression
BELSLEY, KUH, and WELSCH >= Regression Diagnostics: Identifying Influential
Data and Sources of Collinearity
BHAT * Elements of Applied Stochastic Processes, Second Edition
BHATTACHARYA and WAYMIRE -: Stochastic Processes with Applications
BIEMER, GROVES, LYBERG, MATHIOWETZ, and SUDMAN =: Measurement
Errors in Surveys
BLOOMFIELD »* Fourier Analysis of Time Series: An Introduction
BOLLEN ~» Structural Equations with Latent Variables
BOX ° R.A. Fisher, the Life of a Scientist
BOX and DRAPER °* Empirical Model-Building and Response Surfaces
BOX and DRAPER »* Evolutionary Operation: A Statistical Method for Process
Improvement
BOX, HUNTER, and HUNTER - Statistics for Experimenters: An
Introduction to Design, Data Analysis, and Model Building
BROWN and HOLLANDER - Statistics: A Biomedical Introduction
BUCKLEW ~° Large Deviation Techniques in Decision, Simulation, and Estimation
BUNKE and BUNKE ° Nonlinear Regression, Functional Relations and Robust
Methods: Statistical Methods of Model Building
CHATTERJEE and HADI * Sensitivity Analysis in Linear Regression
CHATTERJEE and PRICE ° Regression Analysis by Example, Second Edition
CLARKE and DISNEY °: Probability and Random Processes: A First Course with
Applications, Second Edition
COCHRAN ° Sampling Techniques, Third Edition
COCHRAN and COX * Experimental Designs, Second Edition
CONOVER »° Practical Nonparametric Statistics, Second Edition
CONOVER and IMAN »* Introduction to Modern Business Statistics
CORNELL -° Experiments with Mixtures, Designs, Models, and the Analysis of Mixture
Data, Second Edition
COX + A Handbook of Introductory Statistical Methods
COX ° Planning of Experiments
CRESSIE ° Statistics for Spatial Data
DANIEL ° Applications of Statistics to Industrial Experimentation
DANIEL ° Biostatistics: A Foundation for Analysis in the Health Sciences, Fifth
Edition
DAVID ° Order Statistics, Second Edition
DEGROOT, FIENBERG, and KADANE : Statistics and the Law
*DEMING ° Sample Design in Business Research
DILLON and GOLDSTEIN * Multivariate Analysis: Methods and Applications
DODGE and ROMIG -° Sampling Inspection Tables, Second Edition
DOWDY and WEARDEN : Statistics for Research, Second Edition
DRAPER and SMITH ° Applied Regression Analysis, Second Edition
DUNN - Basic Statistics: A Primer for the Biomedical Sciences, Second Edition

*Now available in a lower priced paperback edition in the Wiley Classics Library.
} me . { La page ‘ya r a

"4
ee
Proms rrstiiascays
a eres
fae era
<ul ht
ot So! etyoh nantes ,
SS

Ain _ Peete es
Aes ve wae WERE PUR
Say Ges eid PereySa
recy et eee ee ey oh AAES; BM tn b
» Viet ees iy ny wa fi AstanBey re, fi ts

i e fy Lata’ ar - te re f SS ir

fl ha aide — i sib) Vara

TRE. Oi Re ea
bets Pairs (LV PA ae Be ; nie ae

AGS
rit7,
iui TRA
the) Vode
LSS yt TR Re
ot © and VEY

NE Ge idaighé Gomi Je
2% ius, Fie Latest, See Pe
rer o:¢ bey {5aTGs” ie

a Fes ie fog Tiles dekd Se omenh 4.‘aaa


WP Ae isttealh Tapas Gah“ie uae 7
aan Rat biioclt ces Cate |: ay
a4 Sag te Bris Aue tie: mye
{ Sh et ¢ agsne
a ‘si ane |
- kal a VES jae x Th
: Sa bs y
7 aa) FE.) Tape ite ,.
~ wel i

EN aie potieraleAvena raeedA


«| kote Ne i
} Anes te
Pe f #3
Sy ORE fag Ae a ‘ By
: Be L Baked ltrs 3 2) ea) ee
Sc ee ae Vrs ehh
cael MN ii
a : : J Sy eid Sarees : ij ie >
mee by MOTH tRee S 103 (i Meret
Ye ae | ery ‘chal ahd E ik nh ra,
inet) oe Shee whe es ie a

tae ei
a = my 4
b ‘ us P hy
4 4
= . a H ri

¥%ea 5 7 ‘ }- ol
- ae = oa. 5 a
: —S ~“ + . ir
he i ASL wo
ne ya 7) oe fie 2
ts 5 9
r i cd 1
4 x
" eS, Baie: Oe
é ty Pang no ed Meee

1 ee ire - Hesorutsig
ae f Leith)Mal
i th Me
r ; oad

: 4 eae b
eS Gly 4 @
Exploring the Limits
of
Bootstrap
Exploring the Limits
of
Bootstrap

Edited by
RAOUL LEPAGE
Michigan State University, East Lansing, Michigan

LYNNE BILLARD
University of Georgia, Athens, Georgia

A Wiley-Interscience Publication
JOHN WILEY & SONS, INC.
New York e Chichester e Brisbane ¢ Toronto e Singapore
In recognition of the importance of preserving what has been
written, it is a policy of John Wiley & Sons, Inc., to have books
of enduring value published in the United States printed on
acid-free paper, and we exert our best efforts to that end.

Copyright © 1992 by John Wiley & Sons, Inc.


All rights reserved. Published simultaneously in Canada.

Reproduction or translation of any part of this work


beyond that permitted by Section 107 or 108 of the
1976 United States Copyright Act without the permission
of the copyright owner is unlawful. Requests for
permission or further information should be addressed to
the Permissions Department, John Wiley & Sons, Inc.

Library of Congress Cataloging in Publication Data:


Exploring the limits of bootstrap / edited by Raoul LePage, Lynne
Billard.
p. cm.— (Wiley series in probability and mathematical
statistics. Applied probability and statistics)
"A Wiley-interscience publication".
Includes bibliographical references and index.
ISBN 0-471-53631-8
1. Estimation theory. 2. Distribution (Probability theory)
3. Sampling (Statistics) 4. Nonparametric statistics. I. LePage.
Raoul, 1938- —._‘II. Billard, L. (Lynne), 1943- _ . __III. Series.
QA276.8.E97 1992
519.5'44--dc20 91-31546
CIP

Printed and bound in the United States of America by Braun-Brumfield, Inc.

1098765432
CONTRIBUTORS

MIGUEL A. ARCONES, Department of Mathematics, University of Connecticut,


Storrs, Connecticut

K. B. ATHREYA, Departments of Mathematics and Statistics, Iowa State


University, Ames, Iowa

WILLIAM A. BAILEY, Kemper National Insurance Companies, Long Grove,


Illinois
EDWARD J. BEDRICK, Department of Mathematics and Statistics, University of
New Mexico, Alberquerque, New Mexico.
P. J. BICKEL, Department of Statistics, University of California, Berkeley,
California

L. BILLARD, Department of Statistics, University of Georgia, Athens, Georgia

DAVID BROWNSTONE, Department of Economics, University of California,


Irvine, California

D. S. BURDICK, Institute of Statistics and Decision Sa Duke University,


Durham, North Carolina

SOMNATH DATTA, Department of Statistics, University of Georgia, Athens,


Georgia
BRADLEY EFRON, Department of Statistics, Stanford University, Stanford,
California
C. D. FUH, Institute of Statistical Science, Academia Sinica, Taipei, Taiwan

EVARIST GINE, Department of Mathematics, University of Connecticut, Storrs,


Connecticut

PETER HALL, Department of Statistics, Faculty of Economics and Commerce,


Australian National University, Canberra, Australia

R. HELMERS, Centre for Mathematics and Computer Science, Amsterdam, The


Netherlands

JOE R. HILL, EDS Research, Alberquerque, New Mexico

JOHN J. HSIEH, Department of Preventative Medicine and Biostatistics,


University of Toronto, Toronto, Canada

P. JANSSEN, Limburgs Universitair Centrum, Universitaire Campus, Diepenbeek,


Belgium
JOHN G. KINATEDER, Department of Statistics, Michigan State University, East
Lansing, Michigan

VICTOR KIPNIS, Department of Economics, University of Southern California,


Los Angeles, California
S. N. LAHIRI, Department of Statistics, lowa State University, Ames, lowa
RAOUL LEPAGE, Department of Statistics, Michigan State University, East
Lansing, Michigan
REGINA Y. LIU, Department of Statistics, Rutgers University, New Brunswick,
New Jersey
J. S. MARRON, Department of Statistics, University of North Carolina, Chapel
Hill, North Carolina
WILLIAM P. McCORMICK, Department of Statistics, University of Georgia,
Athens, Georgia
B. C. MITCHELL, Institute of Statistics and Decision Sciences, Duke University,
Durham, North Carolina
DIMITRIS N. POLITIS, Department of Statistics, Purdue University, West
Lafayette, Indiana
JOSEPH P. ROMANO, Department of Statistics, Stanford University, Stanford,
California
NICHOLAS SCHORK, Departments of Statistics and Medicine, University of
Michigan, Ann Arbor, Michigan
KESAR SINGH, Department of Statistics, Rutgers University, New Brunswick,
New Jersey
R. R. SITTER, Department of Mathematics and Statistics, Carlton University,
Ottawa, Canada

MALCOLM S. TAYLOR, US Army Ballistic Research Laboratory, Aberdeen


Proving Ground, Maryland
JAMES R. THOMPSON, Department of Statistics, Rice University, Houston,
Texas

ROBERT TIBSHIRANI, Department of Preventative Medicine and Biostatistics


and Department of Statistics, University of Toronto, Toronto, Canada
XIN M. TU, Department of Biostatistics, Harvard School of Public Health,
Boston, Massachusetts
N. VERAVERBEKE, Limburgs Universitair Centrum, Universitaire Campus,
Diepenbeek, Belgium
DONGSHENG, TU, Center for Mutivariate Analysis, Pennsylvania State
University, University Park, Pennsylvania
CONTENTS

PART 1. INTRODUCTION
Introduction to Bootstrap
Bradley Efron and Raoul LePage
Introduction, 3
Jackknife, 4
Bootstrap, 5
Consistency of Bootstrap, 7
Pivoting and Edgeworth Expansions, 8
New Directions, 9
References, 10

PART 2. GENERAL PRINCIPLES OF THE BOOTSTRAP


On the Bootstrap of M-Estimators and Other Statistical Functionals 13
Miguel A. Arcones and Evarist Giné
Abstract, 13
Introduction, 14
Empirical Processes, 15
The a.s. Bootstrap of M-Estimators, 23
hart
A The Bootstrap of Differentiable Functionals, 36
a
References, 45

Bootstrapping Markov Chains 49

K. B. Athreya and C. D. Fuh


Abstract, 49
Introduction, 49
Bootstrapping a Finite State Space Markov Chain, 50
Bootstrapping Markov Chains: Countable Case, 52
Accuracy of the Bootstrap in the Finite State Space Case, 62
plea Some Open Problems, 63
Sar
ol
Aelia
he
References, 63
Theoretical Comparison of Different Bootstrap t Confidence Bounds 65

P. J. Bickel

Summary, 65
1. Introduction, 66
2. Second Order Correctness and Equivalence, 67
3. Second Order Optimality and Robustness, 74
References, 75

Bootstrap for a Finite State Markov Chain Based on I.I.D. Resampling fA


Somnath Datta and William P. McCormick
Abstract, 77

Introduction, 77
Asymptotics for the Conditional Bootstrap, 79
Some New Bootstrap Methods, 81
oe
he
ue Proofs, 87
References, 97

Six Questions Raised by the Bootstrap 99

B. Efron

Abstract, 99
Introduction, 99
1. Why Do Maximum Likelihood Estimated Distributions Tend to Be
Short-Tailed?, 99
Why Does the Delta Method Tend to Underestimate Standard Errors?, 103
Why are Cross- Validation Estimators So Variable?, 108
What Is a Correct Confidence Interval?, 112
What Is a Good Nonparametric Pivotal Quantity?, 116
OpeWhat are Computationally Efficient Ways To Bootstrap?, 120
Fat
References, 124
Efficient Bootstrap Simulation 127
Peter Hall

Abstract, 127 e-

. Introduction, 127
Uniform Resampling, 128
. Linear Approximation, 129
. Centering Method, 131
. Balanced Resampling, 133
. Antithetic Resampling, 135
. Importance Resampling, 137
AUA
=
PWN
CN. Quantile Estimation, 141
References, 142

Bootstrapping U-Quantiles 145


R. Helmers, P. Janssen and N. Veraverbeke

Abstract, 145
Introduction, 145
Consistency of the Bootstrap for U-Quantiles, 146
Accuracy of the Bootstrap for U-Quantiles, 150
se Applications, 153
he
bas
References, 154

An Invariance Principle Applicable to the Bootstrap 157

John G. Kinateder

Introduction, 157
The Stochastic Integral Representation, 160
The Invariance Principle, 161
The Limit Laws, 172
Simulation Results, 173
Remarks, 178
a . Appendix, 179
Bok
Hee
References, 180
Edgeworth Correction by ‘Moving Block’ Bootstrap for Stationary
and Nonstationary Data 183

S. N. Lahiri
Abstract, 183

Introduction, 183
Results on X,,, 187
Smooth Functions of Mean, 192
Nonstationary Data, 195
edi
as
hf
adeProofs, 197
References, 212

Bootstrapping Signs 215


Raoul LePage

Abstract, 215

. Introduction, 215

. Bootstrapping Signs, 217


. Examining the Conditional Distributions, 218
Performance on Randomly Signed Powers of Uniforms, 221
. Symmetric Errors Attracted to a Stable Law, 222
kwWN
An Comments, 223

References, 224

Moving Blocks Jackknife and Bootstrap Capture Weak Dependence 225).


Regina Y. Liu and Kesar Singh
Abstract, 225
Introduction, 226
Moving Blocks Jackknife, 231
Moving Blocks Bootstrap, 238
Se
a Concluding Remarks, 245
Appendix, 247
References, 248
Bootstrap Bandwidth Selection 249
J.S. Marron
Abstract, 249
Introduction, 249
Bootstrap MISE Estimation, 251
Asymptotics, 254
Connection to Other Methods, 256
A Simulations and an Application, 258
Pe
aa
References, 261

A Circular Block-Resampling Procedure for Stationary Data 263


Dimitris N. Politis and Joseph P. Romano
Abstract, 263
1. Introduction, 264
A Circular Block-Resampling bootstrap, 266
References, 270
Some Applications of the Bootstrap in Complex Problems 271
Robert Tibshirani
1. Introduction, 271
2. Prediction Limits for Exercise Output, 271
3. Clustering of Cortical Cells, 274
4. Acknowledgements, 277
References, 277

Approximating the Distribution of a General Standardized Functional


Statistic with That of Jackknife Pseudo Values 279
D.S.Tu

Abstract, 279
. Introduction, 279
. Some Preliminary Notations and Basic Ideas, 282
. The Second Order Accuracy of the Random Weighting Approximation, 287
LL
&
WN. Two Examples, 297
Acknowledgements, 303
References, 343
PART 3. APPLICATIONS OF THE BOOTSTRAP
Bootstrapping for Order Statistics Sans Random Numbers
(Operational Bootstrapping) 309
William A. Bailey
Abstract, 309
1. Meshing and Von Mises Theory, 311
2. Bivariate Generalized Numerical Convolutions, 313
References and Acknowledgments, 318

A Generalized Bootstrap 319

Edward J. Bedrick and Joe R. Hill


Abstract, 319

1. Introduction, 319

2. Well-Known Examples, 321


3. A Conditional EB Bootstrap, 324
References, 325

Bootstrapping Admissible Linear Model Selection Procedures 327


David Brownstone
Abstract, 327
1. Introduction, 327
2. Bootstrapping and Jackknifing Multiple Regression
Model Estimators, 328
Stein-Rule Estimators and Model Selection, 329
Estimator Performance, 333
Bootstrap and Jackknife Variance Estimation, 335
Bootstrap Confidence Bands, 338
Se
ee
is Further Refinements, 340
References, 343
A Hazard Process for Survival Analysis 345
John J. Hsieh

Abstract, 345 ~
Introduction, 345
Deterministic Hazards, 346
The Hazard Process, 347
Estimation of Deterministic Functions
Asymptotic Distributions of Estimates and Hypothesis Testing, 351
oe
ae
here
Censoring, Competing Risks and Time-Varying Covariates, 354
References, 360

Bootstrap Assessment of Prediction in Exploratory Regression Analysis 363


Victor Kipnis
Abstract, 363
1. Introduction, 363
2. Problem Formulation, 364
3. Bootstrap Estimators, 367
4. Experimental Comparison of the Conventional and
the Bootstrap Estimators, 371
5. Conclusion, 381
References, 386

Bootstrapping Likelihood Ratios in Quantitative Genetics 389


Nicholas Schork
1. Introduction, 389
2. Bootstrap Tests for Non-Nested Hypotheses, 390
3. Quantitative Segregation Analysis, 392
4. Conclusion, 393
References, 393
A Nonparametric Density Estimation Based Resampling Algorithm 397
Malcolm S. Taylor and James R. Thompson
Abstract, 397
1. Discussion, 397
References, 403

Nonparametric Rank Estimation Using Bootstrap Resampling


and Canonical Correlation Analysis 405
Xin M. Tu, D. S. Burdick and B. C. Mitchell
Abstract, 405
1. Introduction, 406
2. Rank Estimation by Canonical Correlation, 407
3. The Bootstrap Resampling, 408
4. Results and Discussion, 412
References, 418

Index 419
PREFACE

Remarkable developments in statistics have followed Efron’s


introduction of the bovtstrap. In rapid succession, a great many problems
have proven amenable to this new method which advances the notion that
by sampling from our data we can approximate the sampling variations
which produced that data in the first place.
The hard work of developing a theory of bootstrap for parametric, non-
parametric, time series, and other areas of statistics, continues at a brisk
pace in a stream of fundamental papers by top researchers, most of whom
are represented in this volume. There have been some extraordinary
revelations.
We have learned that virtually every sample of modest size taken from
a normal population can tell us the shape of the Student’s t-distributions. A
computer can be programmed to do it without any reference to the
mathematical form of a t-distribution, and can just as easily apply the same
elementary method to a sample scatterplot in order to estimate the
sampling errors of the regression coefficients. This is accomplished easily
and naturally, even for the more complicated case of a sample drawn from a
finite population. We might say that in such cases computer + bootstrap =
basic statistics, without much need for specialized formulas or tables.
Consider our astonishment when we learned that simply bootstrapping
the t-statistic will in many cases outperform the normal approximation.
With some caution, we say that in such cases computer + bootstrap + basic
statistics = second order efficient statistics of the kind few statisticians are
able to do without digging into papers on Edgeworth expansions.
The main featuies of these developinents can be appreciated by almost
anyone having an understanding of basic statistics. Those with deeper
preparations will find rich reward.
In this volume you will find original research papers addressing the
theme Exploring the Limits of Bootstrap, most of which were presented at a
special topics meeting of the Institute of Mathematical Statistics held in
East Lansing, 1990, at the suggestion of Lynne Billard.
in May Raoul
LePage selected the theme and agreed to act as organizer and Program
Chair. Nearly all of the top researchers on bootstrap are represented here,
together with many newcomers, some of whom are working in areas of
application. Their contributions to this volume address the full range of
issues surrounding the bootstrap, among them: consistency of bootstrap in
general circumstances including M-estimation, U-statistics and empirical
processes; invariance principles; comparison of bootstraps; efficient bootstrap
simulation; bootstrapping long-tailed errors; bias-reduction; applicability to
dependent samples including Markov processes and time series (moving and
circular block bootstrap); bootstrap bandwidth selection in density
estimation; bootstrap applied to complex problems. Many of the topics
receive attention in more than one paper.
As a compliment to these new developments, we have a series of
application papers dealing with such topics as: exploratory regression; linear
model selection; hazard processes; computing bootstrap distributions sans
resampling; nonparametric density estimation; resampling with jackknife.
Taken together, these papers represent some of the finest thinking on
the subject of bootstrap and offer a rich guide to the literature.
We wish to thank the Institute of Mathematical Statistics for its
sponsorship of the meeting, and for its continuing support of research on
bootstrap as represented by the preponderance of basic papers you will find
referenced to the Annals of Statistics or the Annals of Probability. We also
wish to thank the Interface Foundation of North America for cooperation
extended during the joint meeting with INTERFACE ’90, and Joseph C.
Gardiner who assisted with local arrangements for the IMS. Special thanks
are due Frank C. Hoppensteadt, Dean of the College of Natural Science, and
Paul M. Hunt, Vice Provost and head of Academic Computing, who
supported efforts to bring both IMS-BOOTSTRAP and INTERFACE ’90
conferences to Michigan State University. William Noble kindly assisted
with special local arrangements for young researchers and foreign visitors.

Special thanks are also due to the National Science Foundation and the
Office of Naval Research who through their support of INTERFACE ’90
indirectly encouraged participation in the bootstrap conference as well.

Raoul LePage, Editor and Program Chair


Lynne Billard, Editor and Program Secretary
Part 1
Introduction
a ay ayaly ait
1 toe

] oh ion us: ‘ i a.
ii

ee me ‘ed ueaseil ” toteB Toa


: : p LH
P aehipih,, E
aon riiranagy Yead Pa)
;

OW) wins friysee ps


aye

i
nie year at ot
peopel
4 Te sataneu cd sled
a Yo oa P
ree ay
iat”
Introduction to Bootstrap
Bradley Efron and Raoul LePage
Stanford University and Michigan State University

Introduction. The following problem arises in almost every data analysis:


the statistician observes some data x, and from x constructs an estimate 6 =
t(x) for a parameter of interest 6. In the most familiar case, x consists of
real-valued observations X 1) -- » Xn independently sampled from an unknown
probability distribution F; the parameter of interest 6 is the true mean of F,
say w(F) = fx dF(x); and the statistic t(x) is the sample mean xX.
The statistician’s first job is to select the estimator 9 = t(x). Having
selected 6, a second and in some ways more formidable task arises: to assess
the accuracy of 6 as an estimator of the true value 8. The standard error of
8, the square root of its variance,

se{0; F} = [varp{t(x)}]/?7, (1)


is the most common measure of accuracy for estimators 6 that are unbiased,
or nearly so.
Elementary statistics courses focus on formulas for the standard error of
§ =x, the mean. Let o?(F) be the variance of F,

oF) =f {x- a(F)}? aF(x) .


A simple but powerful formula relates se(x ; F) to o7(F) ,

se{x ; F} = [o°(F)/a]’/? . (2)


This looks useless for practical purposes since o7(F) is itself a function of the
unknown distribution F. However a simple unbiased estimate exists for
o*(F),
5(F) a.
D¥cee beybop 4 2
(3)
— _____s

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
4 Efron and LePage
as

Substituting (3) into (2) gives an estimated standard error for xX,

vase i Dit (% - X)? 1/2


sé{x} = | ne aie 0) ] . (4)

What about estimators 6 more complicated than X, for example a


median, a sample correlation coefficient, or the maximum eigenvalue of a
sample covariance matrix? Formulas like (2) don’t exist for most estimators
other than the mean. Much of statistical history concerns finding approx-
imations to formula (2) for more general estimators (often for ‘‘mean-like”’
statistics such as the ratio of two means), and avoiding estimators that do
not have such formulas.

Jackknife. In 1958 John Tukey revolutionized error estimation with his


‘“Sjackknife” method, built upon Quenouille’s older technique for estimating
biases. Tukey’s method avoided formula (2) entirely, going directly to a
generalization of (4). Suppose the data x consists of n independent and
identically distributed observations X;; from some unknown distribution F,
which we can write diagrammatically as

F iid (x1, Xq) +) Xn) = X- (5)

Let *(3) be the data set with the i-th datum removed,

Xi) = (X15 XQ) +» X45 Xp Xn) ;

and let 44) equal t(xqy)> the statistic 6 reevaluated for the deleted-point
data set X(j) ° The jackknife estimate of standard error is

AN res aa 4 aw AD SETS
$e; ck (9} = [hoe Dj=1 9%) : 4.)) ] / (6)

where 9.) equals Diy 4(3) jn?

It is easy to verify that formula (6) reduces to formula (2) when 9


equals x . The beauty of Tukey’s jackknife is that it automatically produces
a standard error estimate for even the most complicated estimator 6. All
Introduction to Bootstrap 5
SSa EE

that is required is the ability to recompute 6, n times, once for each deleted-
point vector x;. The jackknife marked a decisive switch toward compu-
tation, and away from the sort of routine theorizing that statisticians
traditionally did in extending formula (2) to the complications of real-life
statistical practice.
Tukey’s formula didn’t eliminate statistical theory of course. Rather it
focused attention on the theory justifying (6) as an accurate assessment of
standard error. The jackknife turned out to work poorly on very un-smooth
estimators like the sample median, but otherwise to give generally
trustworthy results. See Miller (1964).
A major disappointment in the development of jackknife theory
concerned “‘studentization.”’ Standard errors are often used to set
approximate (1-a)-level confidence intervals for 0, of the form

a
6+ Za/2 - sé (7)

where zp is the standard normal percentile z9 995 = -1.960, etc. In the case
of estimating the true mean p(F) with the sample mean X , where sé is given
by (4), and where F is assumed to be a normal distribution, Student’s
famous result says that z, /2 in (7) should be replaced by t,/2yn-1? the
student’s t percentile with degrees of freedom v=n-1. Considerable efforts
were made to find the correct degrees of freedom for SE jack? in order to get
approximate confidence intervals better than 4 + 2, /2 SE jack? but to no
conclusive end.

Bootstrap. Efron’s bootstrap (1979) began as an attempt to better


understand the jackknife. This involved reexamining formula (2) which had
been bypassed in going to (4). Suppose that in (2) the empirical
distribution F of the observed data, that is the distribution

F : probability 1/n on x; fori=1,2,..,n, (8)

is substituted for the unknown true distribution F. Since 0?(F) = (x-x)?,


this gives the standard error estimate
Efron and LePage

se(x , F} = [(@x)*/n)'/? ,
almost the same as the traditional estimate (4).
Efron’s (1979) bootstrap paper makes these main points:

e That substituting F for F in formula (1) gives a reasonable esti-


mate of standard error for any estimator , namely

Sehoot 8} = se; P} = [vara (t(x*)}]/?


= ; = ” 5 (9)
Here x* indicates ‘a hypothetical data vector generated by i.i.d.
sampling from the distribution F,

Pan) (10)

as distinct from the data vector x actually observed.

® That there is a simple computer algorithm for estimating


Sep oot {9} , i.e. for substituting F for F in (1), even though no formula like
(2) exists for most 6 .

® That se,.o, agrees asymptotically with SC ack) and as a matter of


fact the jackknife is a linear approximation to the more computer-intensive
bootstrap process.

° That the bootstrap is a reasonable error estimator in its own


right, easier to understand than the jackknife, and easier to extend to data
structures beyond (5).

° That the bootstrap method could potentially be applied to


problems of statistical error assessment beyond biases and standard errors,
in particular the setting of approximate confidence intervals, but only if
further progress were made in understanding the bootstrap’s inferential
basis.
Introduction to Bootstrap

Consistency of bootstrap. Progress came swiftly. Bickel and Freedman


(1981) formulated the problem of the asymptotic consistency of bootstrap in
a simple way. We may have some statistics of interest 6(n;(x;, ahaa)
which, when normalized, possess a limit distribution Lea: whenever the data
x are iid from any distribution G belonging to some ‘‘neighborhood”’ of the
true distribution F. We write this,

{ the distribution of 6(n;(x,, ++) Xn)iG)} > Lg. (11)

Given a sample x = (x, ... , Xn) lid F, we would like bootstrap to be able to
estimate the distribution Lp by resampling x and applying § with F in place
of F, i.e. we want

{ conditional distribution of 6(n;(x}, ... , X4);F) given x } => Lp - (12)

Bickel and Freedman (1981) isolate a set of three general conditions which
together imply the bootstrap consistency result (12) in the circumstances
(some of them quite general) studied thus far:

e The limit distribution La is continuously dependent upon G, the


sampling distribution of the data.

e The convergence (11) is uniform for G belonging to the


“neighborhood of F.”’

° With probability one, F will eventually remain within the


“neighborhood of F.”’

Their paper should be consulted for the precise statement of these results.
In the paper, a number of examples are developed which exhibit the failure
of (12) due to violation of each of the three conditions individually.
Based on this work, bootstrap was rapidly shown to apply in a broad
range of standard applications, including t-statistics, empirical, and quantile
processes (Bickel and Freedman, 1981); multiple regression (Freedman,
1981); and stratified sampling (Bickel and Freedman, 1984).
8 merry. Nea LePage
Efronaiand Re ae
ee ee ee oe Ee eee

Pivoting and Edgeworth expansions. At the same time, an unexpected


major advance in the theory of bootstrap was obtained by Singh (1981),
independently of Bickel and Freedman, who showed that one of Efron’s
bootstrap methods, the pivot or bootstrap-t method, in many cases
accomplishes ‘“‘Edgeworth correction.” That is, for estimating a point of the
sampling distribution of the studentized statistic, bootstrap automatically
produces answers as good as one-term Edgeworth expansions, without
requiring special theoretical calculations.
To put these results in some perspective, consider confidence intervals
for the population mean p obtained by using each of three distribution
approximations:

e Central limit theorem approximation for the distribution of


(x-»)/sé, with sé given by (4).

e Percentile (naive) bootstrap distribution of x*-x, conditional on


data x, as an approximation for the distribution of X-p.

e Percentile-t bootstrap distribution of (x*-x)/sé*, conditional on


data x, as an approximation for the distribution of (X-»)/sé.

Great interest is generated by the fact that one-sided a-intended-level


confidence intervals based on the first two methods, i.e. the central limit
theorem and naive bootstrap, have actual coverage probabilities which
typically differ from a by O(1//n). The coverage probability achieved by
the percentile-t method, on the other hand, is typically correct to O(1/n).
This improvement is rooted in the Edgeworth correction effect.
The situation for symmetric two-sided confidence intervals is somewhat
different since 1/,/n terms cancel in the first two methods as well. In the
case of the central limit method,

P( we X + Zqjq ° 86) = 1-a + O(1/n), (13)

while for the naive bootstrap interval I(x) = x + W(x) obtained by


choosing W(x) to satisfy,
Introduction to Bootstrap
SE 9
CE

P( x* € I(x) |x) * la, (14)


we likewise have,

P( wel(x) ) = 1-a + O(1/n). (15)

Finally, for the percentile-t, choosing V(x) so that,

P(| E5¥|<
a V(x) | x) & La, (16)
where 6* is the sample standard deviation of (x], ... » Xm), we also have

P( pw € X + V(x)é) = 1l-a+ O(1/n). (17)

Results of this type set in motion a stream of research aimed at


producing bootstrap confidence intervals taking advantage of Edgeworth-
correction effects without introducing unpleasant side effects, such as
excessive width. See Efron (1987). Various bias corrections and other
methods have been introduced. Hall (1988) compares seven different
bootstrap methods utilizing Edgeworth expansions and makes the intriguing
finding that “‘the bootstrap approximation is so good that bootstrap versions
of erroneous theoretical critical points are also erroneous.”

New directions. Some effort has gone into studying the precise nature
of bootstrap’s failure for long tailed errors. Athreya (1987) studied
bootstrapping X when Lea are stable laws of index a<2, and established that
the limit distribution obtained by bootstrap is random. Recently, Gine’ and
Zinn (1989) have shown that normal limits Lq are in some sense necessary
when 6(n; x) are normalized sums. Do bootstraps exist that can successfully
cope with long-tailed errors in normalized sums, for how is one to know
when the normal limits apply?
A potentially important development in bootstrap is “double-dip”
bootstrap, i.e choosing among several competing estimators the one whose
bootstrap-estimated sampling error is least; then again using bootstrap to
assess the sampling error of this adaptively chosen estimator.
Other promising new lines of current research are directed toward
10 Efron and LePage
EE OLE COM Ree nh ital eTRE EEL eS

extending bootstrap methods to dependent observations, such as Markov


chains and time series, where the idea of moving-block bootstrap is receiving
a lot of attention; and developing bootstrap methods for estimating the
sampling error in density or spectral-density estimation.
Many of the papers presented at this conference concern the higher-
order accuracy of bootstrap methods in these various contexts, and how the
accuracy can be harnessed to the task of producing accurate statistical
inference in an automatic way. ‘‘Automatic” here refers to shifting the
burden of computation in any particular applied problem away from
theoretical calculations and toward a computer algorithm. In particular the
goal of automatically producing highly accurate confidence intervals seems
to be moving towards a practical solution.

References

Athreya, K. B. (1987), Bootstrap of the mean in the infinite variance case.


Annals of Statistics, 15:724-731.
Bickel, P. J. and Freedman, D. A. (1981), Some asymptotic theory for the
bootstrap. Annals of Statistics, 9:1196-1217.
Bickel, P. J. and Freedman, D. A. (1984), Asymptotic normality and the
bootstrap in stratified sampling. Annals of Statistics, 12:470-482.
Efron, B. (1979), Bootstrap methods: Another look at the jackknife.
Annals of Statistics, 7:1-26.
Efron, B. (1987), Better bootstrap confidence intervals (with discussion).
Journal of the American Statistical Association, 82:171-200.
Freedman, D. A. (1981), Bootstrapping regression models. Annals of —
Statistics, 9:1218-1228.
Gine’, E. and Zinn, J. (1989), Necessary conditions for the bootstrap of the
mean. Annals of Statistics, 17:684-691.
Hall, P. (1988), Theoretical comparison of bootstrap confidence intervals
(with discussion). Annals of Statistics, 16:927-985.
Miller, R. G. (1964), A trustworthy jackknife. Annals of Mathematical
Statistics, 35:1594-1605.
Singh, K. (1981), On the asymptotic accuracy of Efron’s bootstrap. Annals
of Statistics, 9:1187-1195.
Part 2
General Principles of
The Bootstrap
, i ‘ ie Pe ie
Lk , 3 ~
ied Ce ee eaa read Ry? ami

'
On the bootstrap of M-estimators
and other statistical functionals
Miguel A. Arcones*!
CUNY, Graduate Center
The University of Connecticut

Evarist Giné* }
CUNY, The College of Staten Island
The University of Connecticut

Abstract

Results on the bootstrap of empirical processes (Giné and Zinn


(1990)) are applied to obtain the a.s. bootstrap CLT for M-estimators
under weak differentiability hypotheses. Also, work of Dudley (1990)
and Sheehy and Wellner (1988) on the bootstrap of differentiable func-
tionals of the empirical process is reviewed and complemented.

*Work partially supported by NSF Grant No. DMS 9000132 and by PSC-CUNY Grants
No. 669336 and No. 661376
tCurrent address: Mathematical Sciences Research Institute, 1000 Centennial Drive,
Berkeley, CA 94720
tCurrent address: The University of Connecticut, Department of Mathematics, Storrs,
CT 06269-3009

as
—————

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
14 Arcones and Giné

1 Introduction

Huber (1967) proves asymptotic normality of M-estimators under non-


standard conditions on the basis of an asymptotic equicontinuity result for
the empirical process indexed by a class of functions {y(x,0) : 6 € O} de-
fined on S x ©,(S,S) a general measurable space and O C R?. Thus, it
was to be expected that the recent theory of empirical processes over gen-
eral classes of functions would have something to contribute to the theory
of M-estimators. In fact, Pollard (1985) extends Huber’s results using mod-
ern empirical process theory. In a somewhat related development, Dudley
(1990) and Sheehy and Wellner (1988) (see also Gill (1989)) introduce differ-
ent notions of differentiable (von Mises) functionals in the general empirical
process context, thus extending the classical notion of Fréchet and compact
differentiable statistical functionals.
Giné and Zinn (1990) prove a general bootstrap CLT for empirical pro-
cesses: under measurability conditions, if the empirical process indexed by a
class of functions F satisfies the central limit theorem (CLT) then it satisfies
also the bootstrap CLT in probability (a.s. if moreover the envelope of F is
square integrable). Less general results for more general resampling proce-
dures (such as the parametric bootstrap) are also available (Giné and Zinn
(1991)).
In this article we combine Huber’s and Pollard’s work on M-estimators
with Giné and Zinn’s bootstrap CLT to obtain a.s. bootstrap limit theorems
for M-estimators. Dudley and Sheehy and Wellner also used the bootstrap
CLT for empirical processes to obtain bootstrap limit theorems for different
types of differentiable functionals of the empirical processes. We give results
that complement Dudley’s, describe, with some complements and precisions,
the work of Sheehy and Wellner and then apply these results to M-estimators
once more.
We look at M-estimators twice in order to compare Huber’s and Pol-
lard’s approach (the key term here is Pollard’s “stochastic differentiability ”
condition) with the more classical approach based on different notions of
differentiability (Reeds (1976), Gill (1989) and Sheehy and Wellner (1988)
among others). It appears that, for M-estimators, both methods give similar
results although the former is somewhat better: a condition on stochastic
equicontinuity at 0(P) of the empirical process indexed by a certain class of
functions in the former approach is replaced in the latter by the requirement
that the empirical process over the same class of functions satisfy the CLT,
a condition equivalent to stochastic equicontinuity over the whole domain @.
Of course differentiability is more widely applicable.
The main results are the a.s. bootstrap CLT’s for M-estimators, in Section
3. The proofs use three main tools from empirical processes namely the
Giné-Zinn bootstrap CLT, an exponential inequality of Alexander (1984)
M-Estimators 15
SS

(see Section 2(c)) and the “square root trick” exponential bound (LeCam
(1986, page 546), Giné and Zinn (1984, 1986)). Concrete examples include
the bootstrap of the spatial median and the bootstrap of k-means (this, for
reasons of expediency, only in R). (Can these two examples be handled with
more “classical” methods i.e. proving some kind of differentiability with
respect to the multivariate cdf? Perhaps, but we were not diligent enough
to check this since our approach, that follows Pollard’s, is so well adapted to
the problem).
The results on differentiable functionals are in Section 4. The emphasis
is on bootstrapping limit theorems for functionals which are differentiable
in a very weak sense (only along a family of subsequences and at a fixed
rate). Both the framework and the ideas of the proof of Theorem 4.6 be-
long to Sheehy and Wellner (1988) and Gill (1989). Our contribution to this
subject consists only of providing some accurate proofs (particularly for the
existence of a representation for which simultaneously the empirical process
converges uniformly a.s. to the Gaussian limit and the bootstrap CLT holds
a.s., which is given in Section 2) and observing that if the bootstrap sample
size is of the order 0(n/loglog n) then the bootstrap CLT actually holds
a.s. (a phenomenon already pointed out before in similar situations - Yang
(1985), Arcones and Giné (1990)).

2 Empirical processes.
In the first part of this section we give some notation and definitions about
empirical processes and describe the results on the bootstrap of empirical
processes of Giné and Zinn (1990, 1991). In a second part we give some
technical results that will be useful in the study of the bootstrap for differ-
entiable functionals. The third contains an application of Alexander’s (1984)
exponential inequalities to the almost sure behavior of empirical processes.

(a) The bootstrap of empirical processes. Let ($,S, P) be a probability


space and let F C L2(S,S,P) bea class of functions such that

F(s) := sup{|f(s)|: f € F} < 00 for all s € S,

sup{|Pf|: f € F} < oo, where Pf := | far.

Under these two conditons, the maps

6,: FR given by 6,(f) = f(s)

and
P:F—-R given by Bf= | §4P
Arcones and Giné

are bounded. So, 6,, P € I°(F), the Banach space of real bounded functions
on F, equipped with the supremum norm

I|z|lx := sup{|z(f)| = f € F}-


I°(F) plays the role of D(—oo, co) in the classical formulation of Donsker’s
convergence theorem for the empirical cdf (corresponding to the case F =
{I(-co,4] : t € R}). The non-separability of (°(F), || - ||¥), just like the non-
separability of D(0,1) for the sup norm topology, introduces some measura-
bility complications when dealing with empirical processes. All of them can
be satisfactorily dealt with under some assumptions on F, in particular for
F countable or for F image admissible Suslin (Dudley (1984)). We will write
that F ts measurable to mean that F satisfies the image admissible Suslin
property.
Let X; : (SN, S%, PN) — (S,S,P) be the coordinate functionals of SY.
The X; are i. i. d. (P). It will be convenient to have X; defined on a larger
probability space (0,5, Pr) = (SN,S%, PN) x ([0,1],B,A). The empirical
measure P* is defined as

yao =n! >> 8x,(w)s neN (2.1)


t=1

and the (centered, normalized) empirical process vx = vP = vP(w) by

v, =n?(P,— P), ne N. (2.2)

Both v, and P, are random elements with values in [*(F). Let {Gp(f): f €
F}, the P-Brownian bridge indexed by F, be the centered Gaussian process
with covariance

EGp(f)Gp(g) = Pfg —(PF)(P9), fig EF (2.3)


For further reference, we let

pp(f.g) = P(f — 9) — (P(f —9))” = E(Gr(f) — Gr(9))’,


fig € F.. (2.4)
Obviously, for every finite part J C F,

(un(f) : f € J) 4c (Gr(f): f € J)
by the finite-dimensional central limit theorem. An important question is
whether this convergence can be made “uniform” over all of F. To make this
precise, we define (following Hoffmann-Jorgensen): if {X,}$2o are I%°(F)-
valued random elements and Xo is measurable and has a separable support
then
». © 4» Xo (2.5)
M-Estimators

in [>(F) iff
E*H(X,) + EH(Xo) (2.6)
for all H : I*(F) — R bounded and continuous. LE” stands for outer
integral. Note that if the process Gp has a version with bounded pp-
uniformly continuous trajectories, then it is measurable and has support
CAF, pp) (C, = bounded uniformly continuous functions), which is separa-
ble in (I“(F),|| - ||-). If Gp has this property we say, for short, that Gp is
sample continuous. Dudley’s definition for “CLT for the empirical process
uniform over F” is: F is a P-Donsker class iff:
(i) Gp is sample continuous and
(ii) uF +, Gp in I°(F).
To relate this definition to classical theory, just note that Donsker’s theorem
on weak convergence in D(—oo, 00) of the empirical cdf can just be restated
as: The class F = {I_...4: t € R} is P-Donsker for all P € P(R). The use in
statistics of this theorem rests in part on the continuous mapping theorem:
if F is P-Donsker and T : I*(F) — R is continuous then

T (qn) re T(Gp) (2.7)

(interpreted as regular convergence in distribution if T(v,,) is a random vari-


able, and in the sense of (2.5) and (2.6), with 1°(F) replaced by R?, other-
wise).
There are many interesting classes F besides {(—oo,t]: t € R}, both for
S =R or R* and for more general $. We will encounter some below.
If one views v2 as a measure-valued statistic, then its simple bootstrap is
obviously vi. In other words, if X4,,...,X%, are iid. (P¥%), and if

Po an" Sbxe, (2.8)


i=1

is the empirical measure constructed from the bootstrap sample {X¥,}7_,


then the bootstrap empirical process is

vy™ = (Py — Pz) (2.9)


We will write i, or % for vin,
In order to givea "preciae meaning to the ep ae CLT (in probability
and a.s.) for empirical processes, we should mention that convergence in law
in I*(F) is metrizable in the following sense (e.g. Giné and Zinn (1991)):
X,, ~c Xo if and only if
dgr-( Xn, Xo) —0 (2.10)

where

Xo) = sup{|E”H(X,,) — EH(Xo)| : ||H lo < 1, [lA lz < 1-


dpt-(Xn,
18 Arcones and Giné
i ________ nmmmmammaaaaal

We then say that the bootstrap CLT for the empirical process holds in prob-
ability or that F is pr-bootstrap P-Donsker iff for all « > 0
Pr*{dgr*(in,
Gp) > €} > 0. (2.09)
Here d is d with E* replaced by conditional expectation given the sample.
The bootstrap CLT for the empirical process holds a.s. or F is a.s.-bootstrap
P-Donsker iff :
dpr*(%m,Gp) > 0 almost uniformly. (2.12)
Note that the bootstrap CLT for the empirical process gives at once, via
the continuous mapping theorem, the bootstrap of a large variety of limit
theorems.
The following result states that the bootstrap always works in this general
context, even without local uniformity of the CLT in P.
Theorem 2.1 (Giné and Zinn (1990)). Let F be a measurable class of
functions on S. Then
(i) F is P-Donsker & F is pr-bootstrap P-Donsker.
(it) F is P-Donsker and PF? < co © F is a.s.-bootstrap P-Donsker.
We refer to Giné and Zinn (1990) for the proof. There has recently been
some interest in considering different bootstrap sample sizes. In the present
situation this amounts to taking v/" with m not necessarily equal to n. The
proofs in Giné and Zinn (1990) can be modified to yield the following:
Proposition 2.2 (Giné and Zinn, unpublished). Under the same measura-
bility as in 2.1,
(i) F is P-Donsker and m, — co => dpy.(vir,Gp) > 0 in Pr*.
(ii)F is P-Donsker, PF? < 00, mn 7 00, Ma/Mon > c > 0
=> vin +. Gp in I°(F) as.
We will not prove this proposition, but only mention the additional facts
that must be combined with the proof of Theorem 2.1 in Giné and Zinn
(1990) to yield Proposition 2.2. Their proofs are not difficult to implement
(given the published background). These are:
1) If N(m/n) is the difference of two independent Poisson variables with
mean m/n, then E maxi<n |N.(m/n)|/m? — 0 where Ny are iid. with the
law of N, and

sup P{|N(m/n)| > ty/m/n}2dt < 00


nym>0/40

(4 is a bound for this integral).


2) The basic lemma of Ledoux, Talagrand and Zinn (Ledoux and Talagrand
(1989)) admits the following modification: If m, 7 ©o, My/Man > c >
0, {X;} are iid. Banach space valued random variables and N;(m/n) are
iid. as in 1) and independent of {X;}, then there is K < co independent of
{X;} such that
19

tissup04 EINSf) EK Arn)


< K lien 209 24EN (|r)! Brn) Kl a.
=i

_ The regularity of m,, in 2) seems to be required because the proof uses


blocking ofthe sums (2s in the proof ofthe LIL). However, we would
not be
gurprised if Proposition 2.2 (ii) were true for arbitrary m, > cw.
| if it i# known that P € (Py: 6 € ©} we may want to resample from
Fy,4.4 instead A Pz, where 6,(w) is an estimate A 6 based on the sample.
More generally, we may want to resample from 7,(F,) he eg anocthing
of symunctrizationA P,,). Vor this situation there are no general results such
as Theorem 2.1 (and there are counter examples: Beran (1984)). However,
if there is uniformity in P in the CLT for v® then a very genezal resampling
theorem holds.
A uniformly bounded class offunctions F isuniformly pregaussian if
the P-Browsian motions Zp (covariance of Zp -:EZp(f)Zp(g) = Pia; the
tdlationship to Gp is not unlike that of Brownian motion to Brownian bridge)
are “nice” uniformly in P € PS), the pm’s on S with finite support. Note
that if P = Dab, then Zp = Ea! :b,,, g: i44(N(0,1)). Comeretely F
bounded
is UPG if
2) appepys E\\Zr\le <
Hi) Visas
up pep,is Esuppy_y seer \Zr(S) — Zr(g)\=
Let P(S) denote the oct ofprobability measures on (S,5). reed a
Giné and Zinn (1991) that F measurable and uniformly bounded is UPG if
and only if:
ed
by P(S)
1) conditionsi) and ii) above hold with P,(S) replac and
2) kim, SUPpep) tar (V, , Gr) = 9.

Note that 2) means precisely that the CLT for empirical processes indexed
€ P(S). ktis also proved there that if F is UPG
in Pmly
by F holds unifor
thea,letting¢ = FUF-F=FUS-h: fh EF},
\|2,— Bolle — 9 = dat-(Gz,, Gx) > ©
Be <P(S),n=1,. _. Then a simple triangle inequality gives:

Theorem 2.3 (Giné and Zinn (1991)). IfF measurable and uniformly
boundedis UPG then
: \R. — Rolle + 0 v= +c Gu, 2 *(F). (2.13)
Arcones and Giné

Corollary 2.4 Under the conditions of Theorem 2.8, if ||t:(P,) — P\lg — 9


a.s. (in probability) then v7n(P>()) +. Gp a.s. (in probability) in I°(F).

In Corollary 2.4 the bootstrap sample size can be taken arbitrary as long
as it tends to infinity: the same proof applies.

(b) Almost sure representations and the bootstrap. When dealing


with differentiable functionals of the empirical process it is convenient to
have versions for which convergence in law to Gp becomes a.s. convergence
in [°(F) with the bootstrap CLT still holding. This was recognized by
Gill (1989) in R and by Sheehy and Wellner (1988) in general. We prove
here a version of Sheehy and Wellner’s statement which follows by combin-
ing an important representation theorem of Dudley (1985) and Theorem 2.1
above. We recall the definition of perfect functions: a measurable function
g: (U,U,P) > (V,Y) is perfect if (P og~')* = P* 0 g™" (see Dudley (1985)
for different equivalent definitions). We also need the definition of almost
uniform convergence (of non-measurable functions): X, — X almost uni-
formly if ||X, — X||% — 0 a.s. Almost uniform convergence (contrary to a.s.
convergence) implies convergence in pr*.

Theorem 2.5 Let F be a P-Donsker measurable class of functions. Then


there is a probability space (Q, &, Pr) and perfect functions g, : (Q,%, Pr) >
(Q, 4, Pr) such that

Prog,! = Pr,n=0,1,..., Yn ° Gn + Gp 0 go almost uniformly (2.14)


and
dpz- (vine,
Pnogn
Gp) > 0 almost uniformly. (2.15)
Proof By Theorem 2.1,

Ya t= dpr+(vi", Gp)* — 0 in probability .

Therefore, ¥ being P-Donsker, the /°(F) xR-valued random elements (vp, Yn)
converge in law to (Gp,0) (this is immediate from the definition of conver-
gence in law). Then Dudley (1985), Theorem 4.1, implies the existence of a
probability space (2,2, Pr) and perfect functions g, with Prog, = Pr such
that

(Uns Yn) ° Jn + (Gp,0) 0 go almost uniformly.


The result follows since Y,, 0 g, is the variable in (2.15). a

As another useful application of Dudley’s representation theorem, we


prove now a general continuous mapping theorem. Wellner (1989) has a
M-Estimators

similar statement (which is proved in van der Vaart and Wellner (1989)) and
describes how this result is relevant for convergence of differentiable func-
tionals. Our proof is different, closer to the proof of the continuous mapping
theorem for a single functional in Pollard (1989).

Theorem 2.6 Let {Z,}%2, be random elements with values in a metric space
(V,d). Suppose Zo has separable range and is measurable, and that Z, >. Zo
(in the sense of (2.5), (2.6) with I°(F) replaced by V). Let Vo be a separable
Borel subset of V such that Pr{Z € Vo} = 1 and let Vo be its completion.
For each n € N let V, C V be a set containing the range of Z,. Let Ho:
Vo — R be measurabie and let H, : V, +R be functions satisfying

eS Von We 00,%o E Vo, d(Xn', Lo) 7~0> Hy (2n') Se Ho(z20) (2.16)

where we also denote by d the distance in the completion V of V. Then

A (Z,) if Ho(Zo).

Proof By Lusin’s theorem (e.g. Dudley (1989, page 190)) and tightness
of L(Zo) in Vo, for every r > 0 there exists K C Vo,K compact in Vo, such
that Pr{Z € K°} <7 and Hp is (uniformly) continuous on K. This and
the convergence hypothesis on {H,} implies that for all « > 0 there exist
6 > 0 and ng < of such that

sup |H,(tn) — Ho(zo)| < € for n > no.


Zn €Vn,r0EK,d(rn,20)<6

Let gn, go be the perfect maps prescribed by Dudley’s (1985) theorem for
Zn —¢ Zo. By the properties of these maps and the definition of convergence
in law it suffices to prove

Hn(Zn © Gn) + Ho(Zo © go) almost uniformly

(since then, by Dudley (1985, Theorem 3.5), Hn(Zn) +c Ho(Zo)). Since for
m >No
{sup |\Hn(Zn ° In) =A Ho( Zo ° go)| > e}
n>m

C {sup d(Z, © gn, Zo 9 go) = 5} U{Zo © go € K*}


n>m

we have ,
Pr{sup |Hn(Zn © Gn) — Ho(Zo © go)|"> €}
n>m

= Pr*{sup |Hn(Zn © gn) — Ho(Zo ° go)| > €}


n>m

“a Pr*{sup d(Zn © Jn, Z0° Go) = S$ +T


n>m
Arcones and Giné

whose limsup is dominated by Tr because Z, 0g, — Zo° go almost uniformly.


Since 7 is arbitrary, Hn(Zn © gn) — Ho(Zo © go) almost uniformly and the
result follows. @

(c) Almost sure behavior of empirical processes. In Section 3 we


require a.s. limit theorems for the empirical process, such as

(log log n)?||vn||z, tas. 0


usually with F, = {f —g9: pp(f,9) < 6n, f,g € F} for suitable 6,. This will
be a consequence of an inequality of Alexander (1984) for classes of functions
related to the Vapnik-Cérvonenkis (VC) property. A class of sets C is VC if
there exists n < oo such that

sup #{{71,..-
tar CU sC ecr—a2,
res" =

These classes have very interesting properties regarding the CLT because
their metric entropies with respect to the £2(P) distances are small uni-
formly in P (Dudley (1984)). These properties are inherited by VC-subgraph
classes of functions: a class F of functions if VC-subgraph if the class of sets
{{(z,t): 0 <t < f(z) or f(x) < t < 0} : f € F} are VC. The same
is true for VC-subgraph difference classes G = {f —g : f,g € F} with F
VC-sub graph. A typical example of a VC-subgraph class of functions is
{h(x,0) = h(x — 0) : 6 € R} if h: RR is monotone. Uniformly bounded
VC-subgraph classes are UPG and, more importantly, they satisfy the follow-
ing exponential inequality due to Alexander (1984, Theorem 2.8 and 1985,
Theorem 2.2):

Theorem 2.7 (Alezander). Let P € P(S) and let F be a uniformly bounded


VC-subgraph difference class of functions on S. Let a = supyex Varpf,
letn > 1 and M > 0. There exists a constant K (depending on certain
characteristics of F ) such that if

M? > KaL(a7!) and M > KL(n)/n? (2.17)


then
Pr{|lyn|l-¢ > M} < 16 exp(—M?/8ar?)
+ 16 exp(—(
Mn?
4r)"1
L(M/n?ar)) (2.18)
where Lx = log(z Ve), and r = sup{|f(z)|: f € F,2 € S}.

Theorem 2.7 has the following consequence:


M-Estimators 23
EE

Theorem 2.8 Let P € P(S) and let F be a measuranble uniformly bounded


VC-subgraph class of functions. Let F, = {f —g: fg € F,pd(f,g) <
c/(log log n)?*°} and F, = {f—9: f,g © F, p3(f,9) < ¢/(loglog n)!+*} for
somec<0 and6>0. Then
1

(log log n)?||vallz, —a0.0 and ||vn|lz, a». 0. (2.19)


Proof It suffices to show, for the first limit, that for all e > 0,
fore) ’ 1
PEE Lay n)?||vn||z, > €} < 00.

We can replace n~ log log n by 2-* log k and F/, by Fi, so that the above
series is dominated by

DoPr{ may IU) — PPay > «240g kt}.


Qk <n<2kt1

Inequality (2.18) gives

wa, PrUll So(FOE) — Polley, > e2°-974/(og KE > 0


qkt1 5
k-2 2

since

M?/a > e*(log k)!*°/4c and Mn? > €2-/? /(log k)?. (2.20)

We can apply Ottaviani’s inequality (e.g. Araujo and Giné (1980), page 111)
and get the series above dominated by
gk+1

2 = Pr{ll (F(X) ~ Pll, > 2°-/(log 4).


This series is convergent by (2.18) applied with M = e/2(log k)? and
a < c/(log k)**°, essentially as in (2.20). A similar proof gives the second
limit. a

3 The a.s. Bootstrap of M-estimators

In this section we prove an a.s. bootstrap CLT for M-estimators under con-
ditions close to the “non-standard” conditions of Huber (1967) and those of
Pollard (1985). The proofs are based on methods from these two papers,
Arcones and Giné

Theorem 2.1, Alexander’s (1984) exponential bounds (through Theorem 2.8)


and the “square root trick” (LeCam (1986); Giné and Zinn (1984, 1986)).
M-estimators can be formally defined in two ways, either by maximization
(minimization) of P,g(-,@) for some criterion function g, or as solutions of
P,,h(-,8) = 0 (or almost 0). As in Huber (1967), we will treat both types
separately.
Let P be a probability measure on (S,S), let © C R¢ be a Gs set such
that 0 € O°, and let
g:SxO-R
be a jointly measurable function. We let

G(0) = Pg(-,9), Gr(9) = Prg(-,9) for @E ©

where P,, is the empirical measure based on {X;}%, 7.7.d.(P). We make the
following assumptions:

(A.1) G(0) = supseo G8).


(A.2) There is a symmetric positive definite bilinear form Ag such that,
for 9 in a neighborhood of zero,

G0) = G0) — 5Ac(8,8)+ o(16]?/10g |og I). (3.1)


Without loss of generality we will take
Ag = i, i.e. Ac(6, 0’) =e
(A.3) There exists A : S > R? such that P|A|? < oo
and such that, if we let
r(x,0) = |6|-*[9(z,9) — g(z,0) —8- A(z)],z € S,O0 #0
r(z,0) =0
then there is m < oo and functions r;(z,0),7 = 1,...m,
such that r = 0", r; and for some K > 0 the classes
of functions F; = {r;(-,0) : |@| < K} are measurable
uniformly bounded VC-subgraph classes.
(A.4) Pr?(-,0) < 1/(log |log |@||)?*+® for some 6 > 0,
all: = 1,...,m and all 9 in a neighborhood of 0.

Remark 3.1 Conditions (A.3), (A.4) do not imply that g is differentiable,


but they do give a sort of uniform stochastic differentiability of v,(g) at 0
(Pollard (1985)) since

lim lim sup EF sup |v,r(-,@)| = 0


630 noo |0\<6

by the asymptotic equicontinuity condition associated to the CLT for empiri-


cal processes (Dudley (1984); see also Giné and Zinn (1986)). Note that (A.3)
implies that F := {r(-,0) : |@] < K} is P-Donsker for all P (e.g. Alexander
(1987)).
M-Estimators

The estimator 6,, of 8(= 0) is defined as any random variable 0, satisfying

Prg(-, On) re o(n~*) - pes P,g(-, 9). (3.2)


€O
where o(n~") is independent of w. Such a random variable always exists if
(S,S,P) is complete by e.g. the section theorem in Cohn (1980, Corollary
8.5.4, page 286). Similarly, the bootstrap ( of 0, is defined, for each n € N
and w € 2, as a random variable satisfying

P2g(-,02) + 0(n-*) > sup


0€9
P?g(-,9). (3.3)
We also assume the following consistency properties:
(A.5) 6, —a.s. 0,
(AG) 1020, (@) Sp, Oa.
Sufficient conditions for consistency are given in Theorem 3.5 below.

Theorem 3.2 Let g,P,9n,9n satisfy (A.1) - (A.6), (3.2) and (3.3). Then
Jim £(n¥/7(6% — 6,(w))) = a. lim £(n¥/76,) (3.4)
which is N(0, Ag'(CovA)Az').
Proof We proceed in three steps.

Claim 1 There exists c < co such that, letting a, = (n~! log log n)!/?,

lim sup |6,|/an <c a.s. (3.5)

Pr{\6,|/an < c} a.s. 1. (3.6)


Proof Since 0, —a.s. 0 ((A.5)), (3.1) implies

G(On) — G(0) < —|6,|?/4


for all n > no, for some no < 00. Since P|A|? < oo, the law of the iterated
logarithm (LIL) on R? implies lim sup,,_,., |(Pn—P)(A)|/an < da.s. for some
d < co. By (A.3), the class F verifies the LIL (Dudley and Philipp (1983); see
also Alexander and Talagrand (1989)) so that limsup,_,.. ||Pn —P||F/an < d
a.s. for some d < oo. These observations, together with (3.2) and the
definition of r give that, for n large,

Ooh Pr(g(-, On) aa 9(-,0)) atso(n-")

= (Pa—P)(9(-,9n) — 9(-,0)) + P(G(-) — 9(+r0)) + o(n~)


< (Pa —P)(On-A(-) + [Onkr(-,On)) — 1@nl?/4 + o(r-)
< K|On\an — |On|?/4 + 0o(n*) a.s.
26 Arcones and Giné
a

for some K < oo. This inequality implies (3.5) for c = 4K.
Let A, = (P, —P)(A) and A, = (P, — P,)(A). As mentioned above, for
some ¢ < 00,
lim sup |A,,|/an = c a.s. (3.7)

Now, the bootstrap CLT for A(X) (Theorem 2.1, or Bickel and Freedman
(1981)) gives that for any c > 0,

Pr{|A,|/an <c} as. 1. (3.8)

Similarly, since F is a.s.-bootstrap P-Donsker,

Pr{\|P, — Palle/an < c} as. 1. (3.9)

Now, proceeding as above, we have, for large n (by (A.5), (A.6))

0 < [BulllAnl + [al + [LPs— Palle + lIPa— Pllc]— l6nI?/4 + (2),


and this, by (3.7)-(3.9) and the LIL for F, gives (3.6).

Claim 2 The following limits hold for any c < co:

(loglogn)/? sup |v_(r(-,8))| a.s. 0 (3.10)


|8l<can

and

(log log n)!/? | a l?n(r(-,0))| 2 OPr — a.s., Pr— a.s. (3.11)


8\<can

Proof (3.10) is imediate from (A.3), (A.4) and Theorem 2.8 applied to F;,i <
m. In order to prove (3.11) using Alexander’s bound, we must estimate the
size of

onsup{Py(ry(-,8))?
= :[8]<can}.
For this we use the “square root trick” inequality in Giné and Zinn (1984,
Lemma 5.2), which gives

Pr{ay, > 4(log logn)~@*®)} < (log log n)” exp(—n/ (log log n)'+5')

for some T > 0, 6’ > 0 and all n large enough. Therefore, eventually

an < 4(log log n)— (+9) a.s.


This, just as (A.4) in the non-random case, allows us to apply Theorem 2.8
conditionally on the sample, and obtain (3.11) the same way as (3.10).
M-Estimators ped|
SSSSSS SSS SS

Now we proceed to the proof of Theorem 3.2 using the above claims and
an argument of Pollard (1985). By (3.1) and the definitions of A,, A,, 9,
and r we have: ,

Os = Pa(g(-; On) - 9(-5 A, + A,)) + o(n-*)

= (Pr— Pat Pa—P)(9(-,6n) — g(-;An + An))


13 P(g(-, On) — 93AR + A,)) a5 o(n-*)
Se eee Pi Ne Ae AG)
+ [Onkr(-,6n) — [An + An|r(-, An + An)]
. |6.|/2 + |A, at Ta al

+ (On|?/log |log |6n||) + o(|An + An|?/log [log |An + An||) + 0(n7?).


So, collecting terms and multiplying by n

nlOn w/e AG 2 = ni!216,||n1/?(B,, Pie P)(r(-, n))|


inlA, Ann A(P Pi Pe P)(r(-jAy + Ay)
A n o(|4,|?/ log |log 18.11)
+ no(|A, + A,|?/log |log |A, + Anl}) + 0(1) (3.12)
Now (3.6), (3.10) and (3.11) immediately give

n'/?2)6,,||n1/2(P, — Pa + Pa — P)(r(-,4n))| 2p, 0 a.s.


(3.7), (3.8), (3.10) and (3.11) show

nV2iA, + A,||n!/?(b, — Pa + Pa — P)(r(-,4n + An))| p, 0 a-8.

And the “o” terms also tend to zero in Pr, Pr —a.s. by (3.6), (3.7) and (3.8).
Thus, replacing these limits in (3.12) gives

ni/2(6, =A, = An) — p, 0 a.s. (3.13)

Nex we see that A, can be replaced by 9, in (3.13) again as a consequence


of Theorem 2.8. The argument leading to (3.12), applied to the inequality

0S Pa(9(-, On) — 9(-;An)) + o(n™)


(which holds by the definition of 8,), yields

n\n —Anl?/2 < n?|Anllyn(r(-,


An)
+ n/?16,||Yn(r(-,9n))I.
+ no(|On — 4|?/ log |log |An||)
+ no(|A,|?/ log log |An||) + 0(1).
28 Arcones and Giné
a LS

All the terms at the right side of this inequality tend to zero a.s. by (3.5),
(3.7) and (3.10). Hence,

n¥?(6, — An) as. 0


and therefore, the limit (3.13) becomes
n/2(6, — 0,) —nV?A,, 5, 0 a.s.

By the bootstrap CLT in R?,

n/2A,, = in(A) +2 N(0,CovA) a.s.

and therefore, {n1/2(6,, — 0,,)} converges in conditional law a.s. to the same
limit. a

Remark 3.3 If 0(|4|?/ log |log |6||) is replaced by o(|6|*) in (3.1), if (A.4) is
replaced by Pr?(-,@) — 0, if the condition on {r(-,6) : || < A} in (A.3) is
replaced just by stochastic equicontinuity at 0, i.e. by

lim lim sup EF sup |v,(r(-,@))| = 0,


ot Eom Tero
and “a.s.” is replaced by “in probability” in (A.5) and (A.6), then the boos-
trap CLT holds in probability, that is

dpr.(n/?(6, — 6,), N(0, Ag'(CovA)Ag') — 0 in probability

(Romo (1990)). These modified conditions are quite close to Huber’s (1967)
and Pollard’s (1985).

Next we give sufficient conditions for the consistency hypothesis (A.5) and
(A.6) to hold. They are slightly stronger than those in Huber (1967). Let,
as above, P € P(S), O C R¢ be a Gs set such that 0 € O°,g: Sx OR
jointly measurable, and assume, without loss of generality, that G(0) = 0
(where G(#) = Pg(-,6)). The conditions that will imply consistency are as
follows:

(B.1) Plg(-,@)| < 00 for each 0 € O.


(B.2) There exists M > 0 and a non-negative function b(6),4 € O,
such that if C = {6 € ©: |6| < M} then
(i) infggo b(0) sb. > Oy
(ii)
Psupgge g(-,9)/b(0) =a < 0, and
(it)supeec la(-,)1 € La(P).
(B.3) supgec,ajye G(8) < 0 for all e > 0.
(B.4) The class of functions G = {g(-,0):0€ C}isa
measurable P-Glivenko-Cantelli class.
M-Estimators
———— 29

Remark 3.4 A class of functions G is P-Glivenko-Cantelli (P-GC) if

yea ae P\l¢ as. 0.

There are several necessary and/or sufficient conditions for G measurable to


be P-GC (e.g. Giné and Zinn (1984), Dudley (1984) and references therein).

Theorem 3.5 If P,G and © verify conditions (B.1)-(B.4) with G(0)= 0,


and if 6, and 6, verify the inequalities (3.2) and (3.3) then

eh (3.14)
and ’
Be'6,, ii -Onacas (3.15)
Proof — (3.14) is proved in Huber (1967). To prove (3.15) we first observe
that by the bootstrap weak law of large numbers in R and by (B.2),

P, sup
gC
9(-,0)/0(0) >, Psup 9(-,0)/0(8)
gC
as. (3.16)
and ?
Prg(-,9) +p, Pg(-,0) a.s. (3.17)
So, by (3.16), Pr —a.s., with Pr-probability tending to 1 as n > 00 we have
that for all 6 ZC,

Pig(-,9) < (a+ e)0(0) < (at+e)b<0.


On the other hand, by (3.17) and the definition of Bn,

Pr{P,g(-,9,) > —6} 9 (3.18)


for all 6 > 0. Hence ie.
Pr{0, € C} as. 0. (3.19)
Now, since G is P-GC ((B.4)) and (B.3)(iii) holds, the bootstrap Glivenko-
Cantelli theorem in Giné and Zinn (1990) shows that

|0€C,|6|>e
sup Pyg(-,6)— sup
GEC, |9|>e
G(8)|

< sup |(P,—P)g(-,0))|


G€EC,|O|>e
>, 0as.
for all e > 0. But by (3.18), (3.19) and (B.3) this is impossible unless
Pr{|0| > €} as. 0. |

We now consider the second type of M-estimators. Let © C R® with


0 € ©°, PE P(S), and h: S x © +R? jointly measurable be such that
30 Arcones and Giné
EE

(C.1) Ph(-,0) = 0 and P(h(-,0))? < 0.


(C.2) H(0) := Ph(-,6) is “strongly” differentiable at zero with
non-degerenate first derivative. Assuming (without loss of
generality) H’(0) = IJ the differentiability condition is as follows:

H(0) — H(6') = 6 —@ + o(|6 — 6'|) as 8 4 0 and &’ > 0. (3.20)

(C.3) The classes of functions F; = {h;(-,@) — hi(-,0) : |0| < M} for some
M > 0,i =1,...,d, where h; denotes the 2 — th coordinate
of h, satisfy the following property: there is m < oo and uniformly
bounded measurable VC-subgraph classes G;; = {g;;(-,) : |@] < M}
such that h;(-,0) — hj(-,0) = i, 9:;(-,
9).
(C.4) For i < d,j <m,Varpg;;(-,0) < 1/(log| log |6||)!*+° for some
6 > 0 and @ in a neighborhood of 0.
(C.5) There exist symmetric (P”-completion) measurable functions
6(21,...,2n) defined on the support of P”,n € N, such that
if. 0, = O(Xtyeg Aw) shen

n/2 P h(-, On) —,5.9 and 0, —,.,, 0. (3:21)

(C.6) For almost every w € 2, there exist symmetric random


variables 07 = 0 (X?, 0, ak, sucn.thab

nil? Peh(., 6”) >, 04.s. and 6” —0,(w) +p, 0 a.s. (3.22)
As before, consistency ((C.5)), ((C.6)) will be handled separately.

Theorem 3.6 Under (C.1)-(C.6),

Jim £(n¥??(6, — 8)) =a.s. lim £(n1/?0,) (3.23)


which is N(0,Covph(-,0)).

Proof First we show that

n'/? Ph(.,6,) + n'/?P,h(-,0) a.s, 0. (3.24)


Since H’(0) is not degenerate and 0, 4,5. 0 ((3.20) and (3.21)) there exist
ce < and no(w) < 00 a.s. such that for n > no, clOn| < |H(0,)|. Hence, using
(3.21) and the LIL for random vectors in R? and for empirical processes (by
(C.3), the P-empirical process indexed by F satisfies the LIL - see e.g. Dudley
and Philipp (1983)), we have

| nel, |< n¥?|H(6,)|


< n'?|(P, — P)h(-, O,)| + n¥/?|Prh(-,On)|
M-Estimators
a 31
SE EE eS eae ee

S n?\(P, i P)R(-, 0)| ateri/2\(B, re P)(AC-, On) + h((-,0))| trn™/?|P,A(-, On)|

- = 0((log logn)'/”) a.s. (3.25)


Hence, for some c < 00,

lim sup n1/?/6,|/(log log n)!/? < ca.s. (3.26)

Theorem 2.8 gives that for all c > 0,

sup |n'/?(P, — P)(h(-,9))| 2.2. 0 (3.27)


\@|<can

which, by (3.26), implies

ni P,,ThA P)(h(-, On) Sk h(-,0)) a.s. 0.

Hence, by (C.1) and (3.21),

n'!?(Ph(-, On) + Prh(-,0)) as. 0


which is claim (3.24).
By the bootstrap CLT for empirical processes Deore 2.1) and (C.3),
for every sequence b, — oo.

Pr{n'?|(Pq — Px)(R(-,4n) — A(-,9))| $ Pn} ra. 1.


We can use this and (3.22) in a decomposition similar to (3.25), and obtain
that there is c < co such that

Pr{n'/?|6,,|/(log log n)/? < c} 44.5, 1. (3.28)

We consider now, in analogy with (3.24),

n¥?(Ph(-, bn) ay P,,h(-,0))

= nl? Ph(-,8n) — n¥/?((Pa—Pa) + (Pa —P))(A(-,4n) — (-,0))


The first term at the right hand side tends to zero in Pr, Pr —a.s. by (3.22).
The next term, %(h(-,4n) — h(-,0)), also converges to zero in Pr, Pr —a.s.
by Theorem 2.1 (the bootstrap CLT implies, by (C.4),
limg_,o lim sup,, Pr{supjjes |Pp(h(-,0) — h(-,0))| > e} = Oa.s. for all « > 0,
and 6, —p, 0 a.s. by (3.21) and (3.22)). And the same is true for the last
term v,(h(-,4,) — A(-,0)) by (3.27) and (3.28). Therefore,
n/? Ph(-,O,) + n/?P,h(-,0) +p, 0 a.s.

This limit and (3.24) give, by subtraction,

n3/?(H(6,) — H(0n)) + n/?(Pp — Pa)h(-,0) +p, 0 a.s. (3.29)


Arcones and Giné

This shows that for almost every w the sequence {n}/?(H(6”) —H(8,(w)))} is
Pr- stochastically bounded: it converges weakly by the bootstrap CLT in R?.
Since 6, — 9, —p, 0 a.s. ((3.22)), hypothesis (C.2) implies Pr{|On — On |>
2|H(8n) — H(8n)|} as. 0. Hence the sequence {n/2|§% — 0,(w)|} is Pr-
stochastically bounded, w —a.s. Then, n/?0(|6, — 0,|) +, 0 a.s. and (3.29)
and (3.20) give

ni/2(6, — 0,) + n/?(P, — P,)h(-,0) 4p, 0 a.s.


Now (3.23) follows by the bootstrap CLT in R?. gE

The consistency conditions of Huber (1967, Case B) not only give con-
sistency of 6,,, but also consistency of 6,, i.e. (C.5) and (C.6). The proof is
similar to that of Theorem 3.2.

Theorem 3.7 Ifh is jointly measurable and if

P,,h(-,9n) as. 0, and P,,h(-, On) —p, 0 a.s.

for random variables 0,,9,, then


Osos 0 ad Oe One O g's.
assuming the following conditions hold:

(1) h is P-a.s. continuous in 6,


(2) H(@) exists for all 0 € © and has a unique zero at 6 =0,
(3) there exist a continuous function b on ©, bounded away
from zero and M < oo such that, letting
C := {6 : |6| < M}(9O, we have
Psupyee |h(-50)|/8(8) < 00,intoge |H(0)|/6(0) > 1 and
P supgge |h(-,8) — H()|/b(8) <1, and
(4) Psupgeg |h(-,8)| < co and {h(-,0):6 EC} isa
P-Glivenko-Cantelli class of functions.

Example 3.8 (Bootstrapping spatial medians).The spacial median of P €


P(R*) is the value 0(P) of 6 that maximizes Pg(-,@) where

g(x,0)= |x| — |x — 4.
Since the case d = 1 is already studied in Bickel and Freedman (1981) (their
proof contains some inaccuracies that can be fixed using e.g. Theorem 2. 8)
we will assume d > 2. The set of medians of P, which is convex, consists
of a single point unless P is concentrated on a ee (and has more then one
M-Estimators

median there). We assume P is not concentrated on a line and has a bounded


density in a neighborhood of 6(P). From the fact that for z # 0

—— ae
|| — |x — 6] - (SE + th — ee
— eae Ce, a 3-—
—_—_| < 2—_ el
jz] 2\x| |x jz? a|
it follows that

G(0) = G(0) - P (eo a a +0((6))


if d > 3 and the same expression with 0(|6|°) replaced by 0(||9| log |4||) if
d = 2 (note that 6(P) = 0 implies P@- x/|x| = 0 for all 6 € R*). Hence (A.2)
holds. Pollard (1984, page 153) shows that

F = {r(z,9) = [|x| — |x — 6| — 0 - x/|x|]/|6| : || A 0} U{r(z,0)= 0}


is the sum of two VC-subgraph classes, so that (A.3) is satisfied. Moreover,
it is easy to see that

Ir(z,8)| < 2A 2|6|/|z].


Hence, if P has a density f on {|z| < 6} bounded by M, we have
Er?(ax,0) < K(d, M)(|9| v |9|”)
where K(d,M) is a constant depending on d and M. So, (A.4) holds too.
Finally, taking 6(@) = || conditions (B.1)-(B.3) are satisfied, and it is easy
to see (as in Pollard (1984, page 153)) that {|x| — |x — 6| : |0| < M} isa
bounded VC-subgraph class, hence a Glivenko-Cantelli class. The conclusion
from Theorem 3.2 is then that the central limit theorem for the spatial median
can be bootstrapped a.s. assuming P is not concentrated on a line and P has
a bounded density in a neighborhood of @(P) (since in applications @(P) is
unknown, this last condition should be strengthened, in practice, to P having
bounded density).
Example 3.9 (k — meansinR). Given P in R a k-mean of P,1 < k < «w,
is an ordered set of k points in R, 0, <... < 6,, that minimizes the function
P{minj;<,(x — 0;)? — z?] (Hartigan (1978), Pollard (1981, 1982), Cuesta and
Matran (1988)). This notion has also a meaning in R? and even in infinite
dimensions. Since our purpose is only to illustrate the use of the previous
theorems, we restrict our discussion to the simpler one-dimensional case. We
make the following assumptions on P:
(1) Pre =o
(2) P has a unique k-median y and it consists of k distinct points
(f41,---) fe) With py <... < pe.
_ (3) P has a differentiable density at the points
(mi + Hit1)/2, t=1,...,k—-1.
Arcones and Giné

We let O = {(0,,..., 9x) (S R* : 0; < eee < 6x}, g(a, 9) = minj<;(x ree 6;)?
(by hypothesis (1) the —z? term is unnecessary) and G(@) = Pg(-,@), 0 € O.
By a compactness argument there always exists, for each (21,...,2,) € R”,
a point @ in © that minimizes (n~! 7%, 62,)g(-, 9). Then, by the section
theorem in Cohn (1980, Cor. 8.5.4) there exists a universally measurable
selection 0(x1,...,%n). We let 0% := 0(X1(w),...,Xn(w)) be our estimator
of 4, and call it 6,. For each w € N and n € N,O, = OY := O(X4,,...,X%,)
is the bootstrap of 0,.
Pollard (1982) proved a CLT for 6, — yu. We will show here that Theorem
3.2 implies that Pollard’s CLT can be bootstrapped a.s. under conditions
(1)-(3). (See Romo (1990) for the bootstrap in probability in R? and under
somewhat weaker conditions.)
For consistency we follow Cuesta and Matrdn (1988). They show that
if Z,,Zo are B-valued random variables, B a uniformly convex Banach
space, such that Z, a5. Zo,Zo has a unique k-mean pw := 0(Z) and
Emini<é ||Z, — pill? > Emini<s ||Zo — ull’, then 0(Z,) — 0(Zo). They
apply this result and a Skorohod representation to show consistency of 0,
under hypotheses (1) and (2) that is, condition (A.5). This argument boot-
straps as follows: Let C = {(—oo,t] : t¢ € R}. Then by the bootstrap law of
large numbers,
||P’ — Pllc +p, 0 a.s.and
PS min(s — pi)? Sp, Pmin(x — p;)? a.s.
Therefore for each w fixed and for every subsequence there is a further sub-
sequence {n’} for which these two limits hold a.s. Let w’ be in the set where
convergence occurs (a set depending on w). By Skorohod’s representation
there are random variables Y“*’(w”),n = 0,1,..., on (Q”, &”, P”) such that
P"0(Yv")-} = Pe(w!) and P"”0(¥""")-! = P, and Y“" Yo" Plas.
Also P" minsg(¥"”
—pi)?= Pe(w!) minsga(2—pi)? 44, Pminsee(2—p:)?,
Therefore their result gives that 0(P” o(¥4")-1) — yp that is, 6( Py (w’)) > p.
Hence condition (A.6) is satisfied.
It is easily checked that if P has a differentiable density f at (4; +
Hi41)/2,7 = 1,...,k —1, then G(@) is three times differentiable at 0 = y,
hence condition (A.2) holds. If we let My = (—o0, (41 + 2)/2],M; =
(45 + Hj-1)/2,(M5 + H541)/2], 7= 2,...,k-1, and My = ((4n_-1 + Ux)/2, 00),
and if we define A;,j =1,...,k, in the same way but replacing p by 6 then
we have:

A(z) = (—2(2 — wi), (2), -.-,-2(@ — we) Iuy(2))


and

r(x, 0) = ))(Im.a;(2))(2(0; — 8;)a + pw? — 2u:0; + 67)/|0 — p|


4J
M-Estimators 35

(Pollard (1982)). Using the obvious facts that

(a) if |u — 0| < k-*¥/? min(yi41 — y;)/2 then M;A; = ¢ unless |i — j| < 1,


(b) Iii: C [(i + wir)/2 — |0 — pl, (wi + wiss)/2] and
Ima. © [Hi + witr)/2, (Mi + Misr)/2 + [0 - ul],
we obtain
k-1
Ir(z,4)| < |u—O| +40 (Uma, + Imig. 4:)19: — 9411-
a

This inequality implies that supjg<x ||r(z, 9)||.. < 00 for all K < co and that
Pr?(-,8) < C\@ — p| for all 0, in a neighborhood of yz and for some C' < oo.
Hence (A.4) holds. For each 9, r(x, 9) is the sum of k? or less functions which
are linear on an interval and zero outside it. Hence condition (A.3) is also
satisfied.
Theorem 3.2 applies and it follows that, under conditions (1)-(3) on P,
the CLT for k-means of Pollard (1982) bootstraps a.s. (for d = 1).

Theorem 3.6 applies to monotone functions h with a significant simplifi-


cation of the hypotheses:

Theorem 3.10 Leth: R—R be a bounded monotone function and let P. be


a probability measure on R. We let h(z,0) := h(x — 9), c,0 ER. Assume:

(D.1) Letting H(@) = Ph(-,0), we have H(0) = 0, H'(0) =1 and

lim (H(r) — H(s))/(r - 8) = H'(0). (3.30)


(D.2) There is a neighborhood U of0 such that for all 0 € U,
Varp(h(x,6) — h(x,0)) < 1/(log |log |6||)'*+* for some 6 > 0.
(D.3) If C is the set of discontinuity points of h(x,0(P)) = h(x,0)
and Cs ={x:iye€C st. |x —y| < 6}, then P is continuous on
Cs for some 6 > 0.

Then

lim £(n”/?(6,, — On)) =a.s. lim £(n/70,) = N(0,Varph). (3.31)

where

6, = inf{0: P,h(-,0) >0} and 6, = inf{6: P,h(-,0) > O}.


Arcones and Giné

Proof By (D.1), 0 is the only zero of the function H(@). Since


P,,h(-,€) + H(e) > 0 a.s. for all e > 0 it follows that eventually 0, < € a.s.
Likewise 0 > —e a.s., i.e.
6, —a.s. 0. (3.32)
The same argument using the bootstrap law of large numbers gives

bes —p, 0 a.s. (3.33)

Let |0| < 6/2. The sample points X; satisfying X; — 0 € Cs,2 are all a.s.
different by continuity of P on Cs. Moreover, h(x) is continuous at c = X;—0
if X;— 0 ¢ Cs/2. Hence the function P,,h(-,4) has a jump at 0 of size at most
2||h\|../n a.s. This proves, by (3.32) and (3.33), that

nil? P_h(., On) —,.5, 0 and nV? P h(-, On) —p, 04.8.

So, conditions (C.5) and (C.6) are satisfied.


(D.1) is just (C.1) and (C.2). (D.2) is (C.4). Finally (C.3) is satisfied
because of the monotinicity of h: any class of sets ordered by inclusion is
obviously VC. B

Theorem 3.6 and its corollary Theorem 3.10 contain the bootstrap CLT
for the most usual M-estimators in particular for the median, Huber’s esti-
mators, etc. For instance, Theorem 3.10 applies to

h(x) = —kI(~co,-2)
(2) + t][-n,ay(2) + KI(k,00)(2)
under minimal conditions on P, namely that P{k} = P{—k} = 0 and
P(—k,k) # 0 (assuming Ph(-,0) = 0).

4 The bootstrap of differentiable functionals


Let T be a real valued functional defined on a set P C P(S) and let F
be a class of measurable functions on (S,S) with everywhere finite envelope
F (F(s) = sup{|f(s)| : f € F}) contained in £2(S,S,P) for each P € P.
Dudley’s (1990) definition of differentiability is as follows: T is (Fréchet)
differentiable at P for F with derivative f = fpr iff Pf? < co and

T(Q) =T(P)+ ffa(Q- P) + o(|]Q — Plz) (4.1)


as ||Q—Pllz > 0, Q € P. If S = Rand F = {I(_..,4 : t € R} this is the usual
definition of Fréchet differentiability of statistical functionals. Dudley (1990)
further defines: T is C* at P € P for F if there is a ||- ||- neighborhood U
M-Estimators 37
LLL SS

of P in P such that for all Q € U, T is differentiable at Q with derivative fo


uniformly continuous in Q in the sense that

[fr — fallev
a sup{| [(fr— fo)d(@: — Q2)|/l1Q1 — Qalle : Q1,Q2 € U, |1Q1 — Qallz > 0}
tends to zero as ||R — Q||- — 0 for R and Q in U. Here is Dudley’s boot-
strap limit theorem for a single functional T (he also considers families of
functionals).

Theorem 4.1 (Dudley (1990)). Let P be a convex class of probability mea-


sures containing all the p.m.’s with finite support. Let F be a measurable
P-Donsker class of functions on S such that PF? < co. Let T, defined on
P, be aC" functional for F at P such that the derivatives fg,r are in F for
all Q in a neighborhood of P (for ||- ||-). Then

Jim £(n?(P(P,) — T(Pa))) =e. Jima, £(n4??(T(Pq) - T(P)))


— N(0, Varpfpr).

(In the second limit, since T(P,,) may not be measurable, weak conver-
gence is in the sense of Hoffmann-Jgrgensen -(2.6) with [*(F) replaced by
R.)
The conditions of Theorem 4.1 can be relaxed if we allow the bootstrap
CLT to hold only in (outer) probability. Moreover, the parametric or semi-
parametric bootstrap also holds if F is UPG.

Theorem 4.2 Let P be a class of probability measures on (S,S) containing


all the p.m.’s with finite support. Let T be a functional on P differentiable
at P € P. Let F be a measurable P-Donsker class of functions on S. Then

dpr-(n'/?(T(P,) — T(Pa)), Gp(fp,r)) > 0 in Pr*.


Proof Since n¥/2 f fd(P, — P,) +2 Gp(f) a.s.(f = fpr as above) by
the bootstrap CLT in R, it suffices to show

Pr*{E(\n¥/?(T(B,) — T(Pa)) — :fd(B, — P,)|A1) > 6} 30 (4.2)


for all 6 > 0. Consider the set
A(n,M,6) = {n¥/?||P, — Pile < M} (dar: ((lPnlle, IGrllz) < 6/3}.
On this set

Pr{|linlly > M} < 6/3 + Pr{||Gpllz > M — 1}.


38 Arcones and Giné
a

Given n > 0, let M be such that Pr{||Gp||- > M —1} < (6/6) A (n/2)
and let no < oo be so that for n > no, Pr*A(n,M,6)° < 7, which exists
because F is P-Donsker and pr-bootstrap P-Donsker (Theorem 2.1). For
n > no, w € A(n,M,6) and w’ such that ||i%(w’)||- < M, we can apply
(4.1) to T(P,) — T(P) and T(P,) — T(P), to get (assuming, without loss of
generality, that |o(¢)| is monotone in t)

In/9(T(Pq) —T(Pa)) — ffa(Px — Pa)


= |n¥70(\|Px — Pile) — n'0(||Pa — Pll)!
<n? |o(|| Pa — Palle + ||Pa — Plle)| + 2° lo(\|Pa — Pllz)|
< 2n¥/?\0(2M/n/?)| — Oasn — oo. (4.3)
Hence, from some n on, the probability in (4.2) is dominated by

Pr* A(n, M, 6)° + Pr*{A(n, M, 6) OUP {inl > M} > 6/2]},

a sequence whose lim sup as n — oo is not larger that v. a

Remark 4.3 The above proof works also for T taking values in a separable
Banach space B if we further assume that fp7(X), with £(X) = P, satisfies
the CLT in B.

The same proof, using Theorem 2.3 instead of 2.2, gives:

Theorem 4.4 Let F be a measurable UPG class of functions on S. Let


{Py : 8 € O} be a family of probability measures on (S,S) indered by an
_open setO CR. Let T be a map defined on a set P C P(S) containing {Po}
and the p.m.’s with finite support, such that: (a) T(Ps) = 0, 8 € O, and (b)
T is differentiable at Ps. Let 0, = T(P,,). Then, lettingG =FUF-F, we
have the following implication:

\|Po, — Pall — 0 in probability and

{n1/?||
P,, — Pa||%-} stochastically bounded
=> dpz.[n?(T(Px) — 8m), Gr4(f)] + 0 in Pr’,
where P® is the empirical measure constructed from X$. 3:3.d.( Pe).

Note that: (a) © could be a subset of a separable Banach space containing


9 in its interior, and the theorem would still hold, as in Remark 4.2. (b) the
consistency conditions on Py, are quite natural: for any F UPG they are
certainly satisfied by P, hence, if {Ps} dos not satisfy these two conditions,
resampling from Ps, may not make much sense.
M-Estimators

Gill (1989) and Sheehy and Wellner (1988) approach (Hadamard) differ-
entiability via Theorem 2.5. Using versions P, of Pay P,, of P, and Gp of Gp
so that simultaneously ||7,,— Gpllr — Oand dpi (Un, Gp) — 0, a.s. we have
in (4.3) n¥/?0(||P,—P\lz) 3 0 @ —a.s. and, further using Dudley’s theorem
to get that for each&fixed (in a set of Pr-measure one) ||v,— Gate > 0a.s.,
~ we also have n1/?o0 (Pn — P,||- + ||Pa — P\lz) a.s, 0. The bootstrap CLT in
pr for n'/?(T(P,) — T(Pa)) is then obtained by passing back to the original
variables via Theorem 3.5 of Dudley (1985). Making this argument precise,
particularly if T(21,...,2n) = T(n~! O%, 6;,) is not measurable, requires
some extra care since one must prove almost uniform convergence of the
functional applied to the versions, something not always handled with rigor.
A question of some interest is how weak a differentiability requirement can
we impose on T and still obtain a pr-bootstrap limit theorem. The following
definition (which is a slight modification of one in Gill (1989) and Sheehy
and Wellner (1988)) will help to provide an answer.
Definition 4.5 Let F be a set of measurable functions on S, PE P, Ca
family of convergent sequences x = {zn} C I°(P) and H = linear span of
Usec({tn} Uf{lim z,}). Let T = {T,}%o be a family of R¢-valued functionals
defined on subsets of I°(F) such that the domain of Ty contains P and the do-
main of Tp, contains the set {P+n/*z,, : rp is the n-th term of x € C}. Then
T is n—\/?-differentiable along C at P if there exists a linear continuous map
Tt : H — R?, the derivative of T at P, such that
OP +n en). — TolP)— nT5(z,,)] 0 (4.4)
for all {z,} €C. By continuity, Tp(x,) can be replaced by Tp(xo) in (4.4) if
to. = lim7z,.
Denote by M¢ the set of measures of finite total variation on (S,S)
which are in /°(F), and by Pe the set of probability measures in M¢ (i.e.
Q € Pe iff ||Of|lz < 00). For 2, € Mg, 2, > & will mean ||z, — 2||¢:=
sup yer |tn(f) — 2(f)| — 0. We let C,(F, ep) be the space of uniformly con-
tinuous functions on (F, ep), where e2(f,g) = P(f —g)?.
The first and second parts of the following theorem are taken from Sheehy
and Wellner (1988). It strictly contains Theorem 4.1 up to some measura-
bility (which can possibly be removed), and also the main result in Yang
(1985).
Theorem 4.6 Let F be P-Donsker, let
C = {{x,} : 2, € Mr, P+n-V?2, € Pr, lim tn € C,(F, ep)} (4.5)

and let T = {Tn}%9 be a sequence of maps from subsets of I°(F) into R?


such that the domain of Ty contains P and the domain of T,, contains Pr for
n =1,2,..., and such that T is n-'/?-differentiable along C at P. Then
40 Arcones and Giné
eee eee eeee errr rrreer

(i) n?(Tp(Pn) — To(P)) +2 Tp(Ge).


(ii)
If F ts measurable and if the maps
(s1,..-)Sn) + Ta(n7! D2, 5s,) from S” into R® are
measurable, n =1,..., then dpy-(n/?(Tn(Pn) — Tn(Pn)), Tp(Gr))
is measurable for all n and

dpr+(n'/?(Tn (Pn) — T,(Pn)),Gp) 3 9 in probability. (4.6)

(iti) If F is measurable and PF? < co, if T is


n,/? differentiable at P along sequences {xn,} satisfying the
definition of C with n replaced by nz, and if the sequence
N=N, is regular (N, 7 00,Nn/Non > ¢>0,n EN),
and satisfies N, = 0(n/loglogn), then

N*!?(Ty(Pyn) — Tn(Pn)) 22 Tp(Gp) 4.8. (4.7)


Remark 4.7 If F is UPG no regularity for the sequence N in (iii) is required:
see Proposition 2.2 and the next proof. For F UPG it is also possible to prove
an analogue of Theorem 4.3 under the weaker differentiability of Theorem
4.6

Proof of Theorem 4.6 ‘To prove (i) we set in Theorem 2.6 (V,d)

Vo = C(F,ep),Ha(2) = n/(T,(P + nz) — To(P)) and Holy) = Thy),


=e (I-(F), || lz), Zn = Yn, 4 = Gp, Vn = {z eMep JP hae S Pz},

xz € V,,y € Vo. Then (i) follows directly from that theorem. (This nice proof
is Wellner’s (1989).)
The proof of (ii) uses the representation Theorem 2.5 twice in the way
outlined after Theorem 4.4. Since (ii) is just Theorem 3.3A in Sheehy and
Wellner (1988) and their proof is accurate under the present measurability
conditions, we will only prove (iii). (Note that dg, is in fact a sup over
a countable number of functions since R¢ is o-compact and the Lipschitz
functions on a comptact set are a separable set for the sup norm.)
Under the hypothesis of (iii) the empirical process indexed by F satisfies
the compact law of the iterated logarithm (Dudley and Philipp (1983), The-
orem 1.2) with limit set K = {u,(f) = f fhdP : f h?dP <1,h in the £L,(P)
- closed linear span of { f — Pf}} (Kuelbs (1976)). Note that K C C,(F, ep).
So we have that Pr-a.s.

every subsequence of {v,(w)/(2loglogn)'/?} has


a further subsequence that converges in I°(F) to
some Xo € K (xo depending on the subsequence). (4.8)
Note also that Proposition 2.2, implies

dpr+ (Onn, Gp) > 0 a.s. (4.9)


M-Estimators
oo
e Shay a 41
n

Let w be such that (4.8) and (4.9) hold, and let g,,n = 0,1,..., be the
perfect functions of Dudley’s theorem for {vx.,} and Gp. Then

\l7in(Gn(2")) — Gp(go("))|x + 0 Pr’ —a.s. (4.10)


Take now @’ such that (4.10) holds. Every subsequence of the natural num-
bers has a further subsequence {n’} such that

Ny /(n'/ log logn’) — c € [0, 00) and vp(w)/(2 log log n’)'/? ao

for some zg € K. Hence {if w(gn(@’)) + (N’/n’')/?v_(w)} and


{(N'/n’)'/*v,1(w)} converge to points in C,(F, ep) (where we let N’ := N,) ?
and we obtain,

NO (Dyes PR (Gut (@'))) — Trve( Prr(w))


= NAT (P+ NV? (551 a Gnt(!)) + (N'/n')? rn(w)))
—Ty(P + NP? (N'n')? r_s(w))] + Tp(GP(go(@’)))
by (4.4), (4.10) and the linearity of Tp. Hence, this limit holds along the
whole sequence {n}. Moreover, since N?/ (Ty (Px...) — Tn(P,(w))) is mea-
surable in w’ for each w (in fact, finite-valued measurable) the convergence
above takes place Pr’-almost uniformly and therefore, by Dudley (1985, The-
orem 3.5), :
N¥?(Ty(PXn) — Tw(Pa)) 2 Tp(Gr).
Since this holds for almost every w, (iii) is proved. 3

Remark 4.8 In the application to M-estimators that follows, we require


Theorem 4.6 under n~1/?-differentiability of T along a class of sequences C
strictly contained in C. Suppose F is P-Donsker and G is P-Glivenko-Cantelli
(see Remark 3.4 for the definition of P-GC class), and let

C= 147, ine Mr{\Mg, P+n/*z, € Pr,

F — lim an € CF, ep), n-™I|eallo > 0} (4.11)


Then if T is n—1/?-differentiable along C the conclusions (i) and (ii) in Theo-
rem 4.6 are also valid and so is (iii) assuming T is nz! * differentiable along
sequences {z,,} satisfying the definition of C with n replaced by nz. The
proof is as that of Theorem 4.6, with some natural changes that se sketch.
For H,(tn) = N1/?(T,(P + n'/*z,) — To(P)) and Ho(xo) = Tp(zo), we ap-
ply Theorem 2.6 with (V,d) = (I"(F) x (G)), II(c,y)Il = llellx + llelle)
Vz, = {(tn,n-/?2_)},Vo = {(z,0) : « € C,(F,ep)}. This gives (i).
The proof of (ii) is as in Sheehy and Wellner (1988) but now one
Arcones
and Giné

uses Theorem 2.5 for (vn? y,) —c¢ (Gp,0) (m P(F) x (G))
which yields (v,,n7'/?%_) 0 gx — (G@p,0) © g almost uniformly and
dar+v|(da,. nV?) (Gp,0)] — 0 almost uniformly by the bootstrap
CLT and LLN, and then Dudley’s representation theorem for each @ fixed, on
the sequence {(de, n¥29, >}, Similarly in (iii), using both the bootstrap
CLT and the bootstrap LLN, we have not only (4.9) but

dare v[(PXn» NP OR,n)s (EP, 0)] + 0 as. (4.12)


and the proof follows in an analogous way.

Remark 4.9 By Proposition 2.2 (i), statement (ii) in 4.6 also holds for any
bootstrap sample size N = N,, - oo.

Example 4.10 (On the bootstrap in probability of M-esitmaters). Let 8 C


R® be a Gs set such that @ € 6°, let g: S x © — R and let P satisfy the
consistency conditions (B.1) to (B.4) from Section 3 and, instead of (A.1}
(A.4), the following:

(A.1)’ G(0)= supgce G6).


(A.2)’ There is a symmetric positive definite bilinear form Ag such that
G(@)= G(0) — 4Ac(@,@) + of|@?) as @ — 0.
Without loss of yaar we assume Ag(@, é@’)= 6. &.
(A.3)’ There exists A : S — R® such that P|A}? < co and such that if
r(z,@) = \ol-*fola.@)— g(x, 0) —@- A(z)],2 € S\@ FO,
r(x,0) =0
then, for some K > 0, the class of functions
F = {r(-,0) : |@] < K} U{Ag,...,Aa}
is a measurable P-Donsker class
(A4)’ Pr?(-,@) — as @ = 0.
In order to apply Theorem 4.6 and Remark 4.8 to obtain bootstrap results for
the M-estimators corresponding to g we let: F as in (A3)’,
G = {supjej>as
9(-,8)/6(8), o(-,0)} Ufg(-,@) = [0] < M}, and € as in (4.11)
for these F and G. Let 0 < f(n) = o({1/n) as n — oo. Then, as men-
tioned in Section 3, there is a universally measurable map @, : S* + © such
that n-* DE, g(si, On) > supgee 2? DE, 9(si, @) —f(n). We can then define
for each Q in P¢,T,(Q) as: Ta(n7 CR, Os.)= On(si,..., $x) and otherwise
as any @g satisfying Qg(-,@g) > supgce Q(-,@)— F(n). Then the function-
als T,, satisfy the measurability hypothesis of Theorem 4.6 and T,(P,) =
&.. 7. (Pa ) = 6, are respectively an M-estimator and its bootstrap. Finally
To is defined only at P as To(P)= 0. So the conclusions of Theorem 4.6 will
follow if we prove that T = {T,}$2y is n-“/? differentiable along € at P. To
do this we just apply methods of Huber (for consistancy) and Pollard (for
-M-Estimators
SSS
43

_ the CLT):
Step 1. {z,}€C 3% := (P+ 2,/n'l*) 0:
This follows immediately from

(P+n'?z,)(-sup
iru
g(-,6)/b(6)) > a
pst P+n-/7z,)(g(-,8))
n~"""2_)(g(-,9)) +0
and
| n~!7z,(g(-,0)) +0
as in the proofof Theorem 3.5.

Step 2. {zz} €C > 7, = O(n”):


By Step 1, (A.2)’ and the definition of T,, we have

0 < f(x) - (P + nV, \(g(-, In) = I-; 0))

< fi(n) Sa byn{?/4 = nn /*2,(A) = hnln~/72,(r(-, ‘In))-


Hence
ny\[4< 24(A) + 2a(r(-,%0)) + (f(n))'7n?
which is uniformly bounded.

Step 3. {z,} € C = n/*4, — 2,(A) — 0:


By the definition of T,

0< f(x) (P + n~¥7z,)(9(-, in) ag I-; n~/?z,(A)))

so that, (A-2)’ and (A.3)’ give, as in (3.12),

jn¥/2-y, — z(A)[?/2 < nf(n) + no bya!) + n0(jn-¥/7z,(A)|?)


+zn(A)I)zn(r(-.2-*/72,(A))| + 2 al len(r(-,m))]-
Now, nf(n) — 0 by the definition of f,no(|7,)*) —~ 0 by Step 2 and
wih \n-*!*z,,(A)|?) — 0 since {z,(A)} converges. For the next to last term we
observe |z,(A)| — |zo(A)l and z,(r(-,n-/7z,(A))) © zo(r(-,n-/?z,(A))) =
zol(r{-,0)) = 0 by (A4)’ and because zy € C,(F, ep). The last term tends to
zexo by the same argument. Hence

nly, —Z,(A) — 0,

ie. T is differentiable and T;(z) = (A) for z € I*(F).


Arcones and Giné

The conclusion is that (for Ag = J), under conditions (B.1)-(B.4) and


(A.1)’-(A.4)’
n¥/2(6,, — 0,) +c N(0, CovA) in probability
and ,
n'!?(Onn — On) +2 N(0, Cov, A) a.s.
if N = N,, is a regular sequence such that N, = 0(n/ log logn).
Note that the conditions (B.1)-(B.4) and (A.1)’-(A.4)’ are weaker than the
conditions for the validity of Theorem 3.2 (the a.s. bootstrap) and not much
stronger than Pollard’s (1985). Pollard’s (1985) “stochastic differentiability
condition” is thus close to being equivalent to the differentiability condition
(4.4) for C, which does not involve any randomness.
Theorem 4.6 also applies to M-estimators defined as zeros of P,,h(-,@)
(actually with a proof simpler that in the previous example, that we omit).
The result is the following: Assuming that the consistency conditions (1)-(4)
of Theorem 3.7 hold and that
(Cll PRE,O) = 0,
(C.2)’ H(0) := Ph/(-,6) is differentiable at zero with a non-degenerate first
derivative H'(0), which we assume without loss of generality to be J,
(C.3)’ the classes of functions F; = {h;(-,@) : |0] < M},7 =1,...,d, are
P-Donsker,
(C.4)’ 6 — Varph,(-,4) is continuous at 0,7 = 1,...,d,
(C.5)’ for all Q € Pr there exists T,,(Q) such that |Qh(-, Tn(Q))| < en/n/?
for some €, — 0,
the conclusions of Theorem 4.6 hold for To(P) = 0, Tn(P,) and Ty (Py.n)-
We note that the same conclusions follow if (C.3)’ and (C.4)’ are replaced by
asymptotic equicontinuity at zero of {v,(h;) : hi € Fi}, 7 = 1,...,d (Romo
(1990); see Remark 3.3).

Acknowledgements. A seminar talk by Prof. R. Dudley at CUNY stimu-


lated our interest in the bootstrap of M-estimators and differentiable statis-
tical functionals. A. van der Vaart and J. Wellner pointed out to us a gap in
the first version of the proof of Theorem 2.6. We also acknowledge several
interesting conversations with J. Romo.
M-Estimators 45
SS

References

Alexander, K.S. (1984). Probability inequalities for empirical processes and


a law of the iterated logarithm. Ann. Probability 12, 1041-1067.

Alexander, K. S. (1985). Rates of growth for weighted empirical processes.


In Proceedings of the Berkeley Converence in Honor of Jerzy Neyman and
Jack Kiefer (L.M. LeCam and R. A. Olshen, eds.) 2, 475-493. Wadsworth,
Monterey, CA.

Alexander, K. S. (1987). The central limit theorem for empirical processes


on Vapnilk-Cérvonekis classes. Ann. Probability 15, 178-203.

Alexander, K. S. and Talagrand, M. (1989). The law of the iterated loga-


rithm for empirical processes on Vapnik-Cérvonenkis classes. J. Multivariate
Analysis 30, 155-166.

Araujo, A. and Giné, E. (1980). The central limit theorem for real and Ba-
nach valued random variables. Wiley, New York.

Arcones, M. and Giné, E. (1990). The bootstrap of the mean with arbitrary
bootstrap sample size. Ann. Inst..H. Poincaré 25, 457-481.

Beran, R. (1984). Bootstrap methods in statistics. Jber. d. Dt. Math.


Verein 86, 14-30.

Bickel, P. J. and Freedman, D. A. (1981). Some asymptotic theory for the


bootstrap. Ann. Statist. 9, 1196-1217.

Cohn, D. L. (1980). Measure theory. Birkhauser, Boston.

Cuesta, J. A. and Matran, C. (1988). The strong law of large numbers for
k-means and best possible nets of Banach valued variables. Probability The-
ory and Rel. Fields 78, 523-534.

Dudley, R. M. (1984). A course on empirical processes. Lect. Notes in Math.


1097, 1-142. Springer, Berlin.

Dudley, R. M. (1985). An extended Wichura theorem, definition of Donsker


class and weighted empirical distributions. Lect. Notes in Math. 1153, 141-
178. Springer, Berlin.
46 Arcones and Giné

Dudley, R. M. (1990). Nonlinear functionals of empirical measures and the


bootstrap. In Probability in Banach spaces VII, 63-82. Progress in Probabil-
ity Series. Birkhauser, Boston.

Dudley, R. M. (1989). Real Analysis and Probability. Wadsworth, Pacific


Grove, California.

Dudley, R. M. and Philipp, W. (1983). Invariance principles for sums of


Banach space valued random elements and empirical processes. Z. Wahrsch.
v. Geb. 62, 509-552.

Gill, R. R. (1989). Non- and semi-parametric maximum likelihood estima-


tors and the von Mises method (Part II). Scand. J. Statist. 16, 97-128.

Giné E. and Zinn, J. (1984). Some limit theorems for empirical processes.
Ann. Probability 12, 929-989.

Giné E. and Zinn, J. (1986). Lectures on the central limit theorem for em-
pirical processes. Lect. Notes in Math. 1221, 50-113. Springer, Berlin.

Giné E. and Zinn, J. (1990). Bootstrapping general empirical measures. Ann.


Probability 18, 851-869.

Giné E. and Zinn, J. (1991). Gaussian characterization of uniform Donsker


classes of functions. Ann. Probability 19, No. 3, to appear.

Hartigan, J. A. (1978). Asymptotic distributions for clustering criteria. Ann.


Statistics 6, 117-131. ;

Huber, P. J. (1964). Robust estimation of a location parameter. Ann. Math.


Statist. 35, 73-101.

Huber, P. J. (1967). The behavior of maximum likelihood estimates under


non-standard conditions. Proceedings Fifth Berkeley Symposium on Mathe-
matical Statistics and Probability, Vol. 1, 221-233. University of California
Press, Berkeley.

LeCam, L. (1986). Asymptotic methods in statistical decision theory. Springer,


Berlin.

Pollard, D. (1981). Strong consistency of k-means clustering. Ann. Statist.


9, 135-140.
M-Estimator 47

Pollard, D. (1982). A central limit theorem for k-means clustering. Ann.


Probability 10, 919-926.

Pollard. D. (1984). Convergence of stochastic processes. Springer, Berlin.

Pollard, D. (1985). New ways to prove central limit theorems. Econometric


Theory 1, 259-314.

Pollard, D. (1989). Empirical processes: theory and applications. CBMS/NSF


Regional Conference Series in Probability and Statistics, to appear.

Reeds, J. A. (1976). On the definition of von Mises functionals. Research


Report # S-44, Dept. Statist., Harvard University.

Romo, L. (1990). Manuscript.

Sheehy, A. and Wellner, J. A. (1988). Uniformity in P of some limit theorems


for empirical measures and processes. Technical Report #134, Department
of Statistics, University of Washington, Seattle.

van der Vaart, A. W. and Wellner, J. (1989). Prohorov and continuous map-
ping theorems in the Hoffmann-Jgrgensen weak convergence theory, with
applications to convolution and asymptotic minimax theorems. Preprint.

Wellner, J. (1989). Discussion of: “Non- and semi-parametric maximum like-


lihood estimators and the von Mises method, Part I” by R. Gill, Scand. J.
Statist. 16, 124-127.
Yang, S. S. (1985). On Bootstrapping a class of differentiable statistical
functionals with applications to [- and M-estimators. Stat. Neerland. 39,
375-385.
IT SN

thy
. fhaaimlonsesinat
ra Py‘3dion fateh

in 4

ron ont‘siti Sabena .


>. ma i 4 i
BOOTSTRAPPING MARKOV CHAINS

K. B. Athreya! C.D. Fuh?


Departments of Mathematics Institute of Statistical Science
and Statistics Academia Sinica
Iowa State University Taipei, Taiwan, ROC
Ames, lowa, USA

Abstract

We present a brief summary of some recent work on the application of


the bootstrap method to estimate the distribution of the estimator of the
transition probability matrix, and that of the hitting time distribution for
finite and countable state space Markov chains that are ergodic. Some open
problems are also indicated. Pr ion

Introduction
Let {X,; n > 0} be a homogeneous ergodic (positive recurrent, irreducible
and aperiodic) Markov chain countable with state space S and transition
probability matrix P = (p;;). The problem of estimating P and the distri-
bution of the hitting time T, of a state A arises in several areas of applied
probability. The application of the bootstrap method of Efron (1979) to
the finite state Markov chain case was considered by Kulperger and Prakasa
Rao (1987)and Basawa et al (1989). Athreya and Fuh (1989) discussed the
countable state space case. The general state space case is an important
open problem.
The goal of the present paper is to give a brief survey of the results of
the the above mentioned papers and also some related work of Datta and
McCormick (1990) on second order correction of a method of bootstrap pro-
posed by Basawa et al (1989) for the finite state space case. The latter paper
also considers parametric bootstrap for finite state Markov chains. No proofs
are given in the present paper.
Keywords: Primary 62G05; Secondary 60F05, 60J10
1980 Mathematics subject classifications: bootstrap estimation, central limit theorem, hit-
ting times, Markov chains, stationary distributions, transition probabilities
' Research partially supported by NSF Grant 8803639.
2 Research partially supported by NSC of ROC Grant 79-0208-M001-63.

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
Athreya and Fuh

Bootstrapping a finite state space Markov chain


Let {Xn; n > 0} be a time homogeneous irreducible aperiodic Markov
chain with finite state space S = {1,2,--- ,k} and transition probability
matrix P = (p;;). This implies that there exists an invariant probability
measure II = (m,--- ,7,%) such that 7; > 0, ae me; = lees =), Maay
Bs ae and p\” —+ 1;,asn— oo for allz€ S.
Suppose x = {29,21,°*: ,Zn} is a realization of the process {X;; j =
0,--- ,n} observed up to time n. We estimate P by its maximum likelihood
estimator P, = (pn(z,j)), where
ete nij/ni, ifn; > 0;
Prt )= 7 ij> otherwise,

and estimate II by In = (#n(i)), where


Ae n;
alt) cE: =

nij = observed number of 7j transitions in {r0,--- , Zn},


n; = observed number of visits to state 7 in {z9,--- , Zn}.

Since the state space S is finite, we can consider the non-parametric case
as a special case of the parametric case. So, the consistency and asymptotic
normality of the maximum likelihood estimators can be deduced using the
analogy with the multinomial distribution. This idea also can be used to
prove the consistency of the bootstrap estimators of P, given x
The consistency of II, for II follows from the strong law applied to the
renewal sequence of return times to state 2.
THEOREM 1. For all i,

fn(t) =n; /n — 7; with probability 1 as n — 0d.

The following theorem is a central limit theorem for the maximum likeli-
hood estimator P, of P and is in Billingsley’s book (1961).
THEOREM 2.
Vn(P,-P)—N(0,Zp) __ in distribution,
where Up, the variance-covariance matrix, is given by

(LP)
cissrjry= Si ig (855" — Diz) -
Det 2 efaraoess wees z*}be a realisation of a Markov chain with transition
probability matrix P, and let P, be the P, function evaluated at this z*.
For this bootstrap method Kulperger and Prakasa Rao (1989) established
the following central limit theorem.
Bootstrapping Markov Chains 51
THEOREM 3. Under the notations given above, we have for almost all real-
izations of the Markov chain {X,; n > 0},
UNAPe te ND) Yacin distribution
as n — oo and N, — oo, where ¥p is the same variance-covariance matrix
in Theorem 2.
Let T; be the first hitting time of state k. That is, we let
{intjne 0: i, = Kk}:
Tk = :
co, if no such n exist.
Let Pr(t;P) = P(T, <t| Xo =1;P) denote the probability that T, < t
for t € {1,2,3,---}, where P is the transition probability matrix of a Markov
chain X = i de n > 0} with initial state Xp = 1.
For any k x k stochastic matrix P, let A = A(P) be the stochastic matrix
which is the same as P except that the k‘* row is replaced by (0,--- ,0,1)
with 1 in the k** position. Note that
Pr@R)= (A) (*)
The bootstrap estimate of the distribution Pr(-; P) of the hitting time T;, is
Pr (-Pa) . From (*) and the fact that P, — P with probability 1, we have
for each t, /
Pr(t; P,) — Pr(t;P) —> 0 ‘w.p.T as n — oo.
Here the problem is to estimate the distribution of
G,(t;P)=Jn (Pr(ti P,,) — Pr(t; P)) :
The bootstrap approximation to the above distribution is the conditional
distribution of
G,(t; P,) = Vn (Pr(ti P,) — Pr(t; P,))
The problem here is to verify that these two distributions are asymptotically
close. Kulperger and Prakasa Rao (1987) obtained the next two theorems.
THEOREM 4. Let A, =A el: Then for all t = 1,2,3,---, we have

vn (At, - At) —+ N(Q,Zp) in distribution,


where the variance-covariance matrix Zp is a function of P.
REMARK 1: It turns out that Z} is fairly complicated to compute for general
t. For t = 1,2 we have
Zp = DA;

Z} = Var(AU + UA).
where U ~ N(0 and D4), for any k x k stochastic P, Up is as in Theorem 2.
Athreya and Fuh

THEOREM 5. Let A, = A(Pp) where P, is defined as above. Then, for


almost all realizations of the process, and for t= 1,2,3,---, we have

VN, (At - At.) — N (0,25) in distribution,


where Zb is the same variance-covariance matrix in Theorem 4.

Adapting the same type of analysis, one can estimate


m(P) = E(Ty |Xo = 1; P) the expected value of the hitting time T; by
m(P,) = E(T, | Xo = 1;P,). It can be shown below that m(P,) is a
consistent estimator of m(P), and further

Q,(t;P) = Vn (m(Pn) ~ m(P)) is asymptotically normal.

The bootstrap approximation to the above distribution is the conditional


distribution of : ; - :
Qn(t; Pa) = /Na (m(Pn) a m(Pn))
That these two distributions are asymptotically close can be shown using the
6 method, Theorems 3 and 4, and the following
LEMMA. Let P be a stochastic k x k matrix and {X,; n > 0} be a Markov
chain with transition probability matrix P. Let

mijk(P) = E(visits to 7 before tok |Xo =1;P).

Then
co

mije(P) = > (PE)is,


r=0

where P, is the matrix given by P in which the k** column is replaced by 0.


Note that m;;x(P) is continuously differentiable with respect to the entries
ofP at all P such that inf;,; pj; > 0.

Bootstrapping Markov chains: countable case


Athreya and Fuh (1989) have considered the case of a countable state space
Markov chain. They propose three bootstrap methods.
Method I.
Let x = (z0,21,--+ ,Zn) be a realization of the Markov chain {X,; n > 0}
with transition probability P. Let P, = P(n,x) be an estimator of P based
on the observed data x. A bootstrap method for estimating the sampling
distribution H,, of R(x, P) = (Pp— P) can be described as follows (same as
Bootstrapping Markov Chains 53
ae SSS SSeS

in the finite state space case except that the resample size is changed from
nto Ny,
1) With P, as its transition probability, generate a Markov chain real-
ization of N, steps x* = (r5,2],---,ry_). Call this the bootstrap
sample, and let P,= P(Nn,x*). Note that P,, bears the same relation
to x* as P,, does to x.
2) Approximate the sampling distribution H, of R(x,P) by the condi-
tional distribution H* of R(x*, P,) = (Pa — Pn) given x.
Method II.
The existence of a recurrent state A which is visited infinitely often (i.0.)
for a recurrent Markov chain is well-known. A well-known approach to its
limit theory is via the embedded renewal process of returns to A. This is the
so-called regeneration method. For a fixed state A, by the strong Markov
- property, the cycles {X,; j = gen tee pen —1} areiid. forn =1,2,---,
where 7) is the time of the n** return to A.
Fix an integer k and observe the chain up to the random time n = Tey
Let
{Xo,X1,°°- a

be a realization of the process. Note that in this situation, X, = A. Fix 2,j.


Let ne = {Xj; 7 = Tok vee Rate —1} denote the a*‘* cycle, g(nq) indicate
the number of visits to state 7 during the cycle ng, and h(n.) indicate the
number of ij transitions during the cycle ng. Now, define

Sn, g(a) (su; \= ye h( ne)


i ,(2) —
eee Tae. a Se .0(a)
be the estiraators of m and P, where T, is the length of nq.
By the strong law both 7, and P, are consistent. To estimate the dis-
tributions of (#, — ) and (P, — P), Athreya and Fuh (1989) propose the
following bootstrap algorithm.
1) Decompose the original sample in the following fashion:

{no, 15125" a ets where No = {Xo,X1,°° P Xp _y}-

Let F, denote the uniform probability measure on the cycles


{na} @ = 1,2,--- ,k}. If Xo = A w.p.1., then one could take n = co and
F, to be the uniform probability measure on {nq; a = 0,1,2,--- ,k—- dy:
2) With the original sample fixed, draw a “bootstrap sample” of size k
according to F... Denote this sample by ni, n3,--- , nj. Then, the bootstrap
Athreya and Fuh

analogous of 7,(7), px(2,7) can be defined as follows:

Daa 915) Deas Mns)


mK (t) =
pe SNe: ’ Br(2,J) =
Year 995),
where T% is the length of n3.
3) Approximate the distribution of V/k(p,(2,7) — pij) by the conditional
distribution of Vk(px(i,7)— Be(i,j)) given x. Similarly for Vk(a%(i) — 7).
REMARK 2: A slightly different approach is to fix the original sample size
n. Let kn be the (random) number of full cycles included in the obser-
vation {Xo,X1,:--,Xn}. Now generate the bootstrap sample by making
ii.d. selection uniformly from these k, cycles. Since as n — oo, kp — 00,
this method should work even though the k, cycles are no longer inde-
pendent. Further, we may even choose A to depend on the observation
{Xo,X2,---,Xn}. A natural choice is that A for which we get the largest
number of cycles in {X9,Xj,--- , Xn}.
Method III.
This method is especially suited for estimating the distribution of hitting
times.
Let x = {z0,21,--- ,2n} be a realization of the Markov chain {X,} up to
time n. Fix any two states | and k. Let T,;; 1 = 1,2,--- be the successive
return time to state r. Now arrange {T,
i}, {T;,:} such that

Deeat Ato Li ee ag i ea
(That is, for each m, there is an im such that T;,i,,-1 < Tim < Tk,in-)
Define

Y, = Thi, — Th
Y2 = Ty,i. — Th2

Ye a Lee. Dad fiat

It is clear that by the Markov property, the {Y;}%° are i.i.d.


Let M, = sup{m; Ty,i,, <n}. Then, for each positive integer t, a natural
estimator for the distribution

Fi,(t) = P(Tha S t|Xo =)


is the empirical c.d.f.

ae M
ieee
Fr (t) = 7 Perey
jot
Bootstrapping Markov Chains

The naive bootstrap will consist of fixing Y;,Y2,--- , Ym, and drawing i.i.d.
samples Y,*, Y;,--- , Yn, distributed as Fix (-) and defining

: L N, ire
Fy) =D1}
Nn 44
<0)

For each fixed t, F(t) - Fig (0 — 0 in probability as N, — oo and


w.p.l. if }>1/N2 < oo. Since FO (t) + Fi,(t) w.p.1 (by the strong law) it
is clear that F(. -) and F),(-) are close as n — oo.
This method is yet to be investigated fully and will not be treated further
in this paper.
In order to justify the bootstrap Method I, the following weak law and
central limit theorem for a double array of Markov chains was developed by
Athreya and Fuh (1989).
Let Xn = {Xnz; t > 0} n = 1,2,--- be a sequence of Markov chains on
a countable state space S with transition probability matrices Py = (pni;).
Let 7, = (tn;) be a stationary probability distribution for P,. Suppose
{Xni; t= 0,1,2,--- , Nn} is a realization of the n* chain observed up to N,
steps. Let ;
(n)
Tn(t) i a ?

where
Nn
m\”) = S~ I(Xnt = i);
t=0

is the number of visits to 2 by X,, during {0,1,--- , Nn}.


The following is a weak law for the empirical distribution 7,(-).
THEOREM 6. Assume that for each n, there exists a probability distribution
{tni,t € S} such that for each p > 0 and for fixed |, 1, as n,Nx — 00, we
have
5n(p) = ye |=sopo) — ail — 0. (2)
s=1

Then
it,(t) — 7%; —> 0 in probability.

The hypothesis of Theorem 1, namely that 6,(p) — 0 as n — oo for


each 0 < p < 1 relates the length of observation N, to the rate at which
the n** chain approaches (in Cesaro mean) its stationary distribution. If
P,, converges to a P that is nice, then it is reasonable to expect that the
hypothesis of Theorem 6 holds. In fact, we have the following
Athreya and Fuh

THEOREM 7. Let Pn = (pnij) and P = (p;;) be stochastic matrices such


that
i) P is ergodic,
ii) Pnij —* pi; for all 2,3,
ili) ™,; —> ™ for all i,
where ©, and 7 are stationary probabilities for P, and P respectively. Then
(2) holds and hence the weak law also holds, and for each 1,

in(t) —> 7; in probability.

The weak law of Theorem 1 suggests the possibility of a central limit the-
orem for 7,(-). This turns out to be somewhat intractable and Athreya
and Fuh (1989) address instead a related question motivated by the boot-
strapping of P. Let pn(z,7) denotes the proportion of (2,7) transition to the
number of visits to7 in {Xntz; 0 <t < N,}. That is, let

m\” ay
= hoe=),
t=]
Nn-1
ee = ae TW Xne =t)I(Xaeti) = J);
t=0

and

(n)
a mit, ifm” >0;
Pn(2,j) = :

Since 7,(%) — 7; — 0 in probability, 7; > 0 and N, — oo, we have that


m ;(n) oo in probability and hence

Pn(t,J) — Pnij —> 0 in probability.

The following is a central limit theorem for (P,, — P,).

THEOREM 8. Let P, = (pnij) and P = (p;;) be stochastic matrices such


that
i) Pnij —* pi; for all i,j,
ii) Tay —> 7; for all z,
iii) 7; > 0, for all i and (2) holds.
Bootstrapping Markov Chains

' Then
V Nn(Pn — Pn) —+ N(0,¥p) in distribution,
as n — oo and N, — oo independently of each other, where Sp is the same
varlance-covariance matrix in Theorem 2.

Here the convergence in distribution means that for any finite set A of
pairs (2,7),
{VNn Balt,5)— Pris); (5) € A} — NO,(Ep)a),
where (ip), is a block diagonal matrix involving the states in A. In partic-
ular, if A = {1,2,--- ,k}, then

et?) 0 0
Ov LTP) 0
(Up)a= an
0 0 tL (P)

and T';(P) = (pij(6;1 — pit)), Coa oe ie LP

REMARK 3: By Theorem 7, condition 277) in the above theorem can be


replaced by P being irreducible, positive recurrent and aperiodic.
Let X = {X,; t > 0} be a homogeneous ergodic Markov chain with
a countably infinite state space S = {1,2,---} and transition probability
matrix P = (p;;). Let t = (m,72,---,) such that 2; > 0, Ds 7gs 0
and 45°) -Wapas, J = 1,2;--) It is known that the nt’ step transition
probability pi) t+ mj, as n > 00 for all i € S. .
We now show that the Method I works for estimating the distribution of
(Pr — P) when ‘e: is the maximum likelihood estimator. It is clear that a
finite number of observations will not provide an estimator for all states of
the transition probabilities.
Nevertheless, the a.s. convergence of 7, and P,, can be established using
the recurrence of {X,} and the strong law of large numbers. The following
are extensions of Theorem 1 and 2 to the countable state space case.
PROPOSITION 1. Let X = {X,; n > 0} be a homogeneous irreducible re-
current Markov chain. Then for any initial distribution and any (1,j), we
have
Pn(t,j) —> pijwith probability 1 asn — oo.
Suppose in addition, the chain is also positive recurrent. Then, for any initial
distribution and state 1, we have

itn(t) + m; with probability 1 as n— oo.


58 Athreya and Fuh
LS

This suggests using P,, and it, as estimators of P and 7 respectively and
in order to obtain confidence intervals, we need to look at the distributions
of (P, — P) and (ip — 7).
PROPOSITION 2. (Derman (1956))
Let X = {X,; n > 0} be a homogeneous irreducible, positive recurrent
Markov chain. Then

Vn(P, — P) —+ N(0,Zp)_ in distribution,

where Sp is the variance-covariance matrix.

The convergence is in the same sense as in Theorem 8 and so is Up. It


should be noted that Proposition 2 is a special case of Theorem 8,

PROPOSITION 3. (Derman (1956))


Let X = {Xn; n > 0} be a homogeneous ergodic Markov chain with
ExT? < co for some k. Then for any initial distribution, we have

Vn(itn — 7) —+ N(Q,Z>) in distribution,

where Xp is the variance-covariance matrix.

See Derman (1956) for an explicit form of 5%. Here again the convergence
has the same meaning as Theorem 8.
In order to obtain confidence intervals for P and 7, one can use Proposi-
tions 3 and 4, but use L, and uP in place of Up and X3 respectively.
An alternate approach to finding confidence intervals is to use the method
of bootstrap. Here for the pivotal quantity

Vn( Pp om Ply:

the bootstrap version is


WV Nn(Pn — Pa):
The following theorem states that for almost all realizations of {X,}, the
conditional distribution of VNw(Pr —P,) given x converges in distribution to
N(0,Up) as n — oo. The proof follows from Theorem 8 as P, is consistent
for P. The reader is referred to Athreya and Fuh (1989) for the details.
THEOREM 9. Under the notations given above, we have for almost all real-
izations of {Xn; n > 0},

VNa(Pa — Pa) > N(0; Lp) in distribution,


Bootstrapping Markov Chains 59
nnn SSS SS

as n — oo and N, — oo independently, where Up is the same variance-


covariance matrix in Proposition 3.
REMARK 4: Note that in the above the only properties of m.l.e. P, that we
used were the consistency of P, for P and 7, for 7. So the bootstrap will
work with any other estimator of P enjoying these properties. For the finite
state space, irreducible, aperiodic Markov chain, Theorem 9 is deducible
from the methods of Kulperger and Rao (1989), who use different techniques
and also from Basawa et al (1989). Theorem 9 in its present form is in Fuh’s
thesis (1989).
We now proceed to justify Bootstrap Method II.
Recall that Method II is based on k full cycles. Our goal is to estimate
the distribution of Vk(#,(i) — 7;) and Vk(px(i,7) — Pij)-
The strong consistency property of the bootstrap estimators 7,(7) and
px(t,j) is given by the following theorem.
THEOREM 10. (Athreya and Fuh (1989))
If there exists 6 > 0 such that Bars < oo, then

R(t) —> mj and px(t,j) — pij w.p.l. ask—oo.

It is known that if E,T2 < oo, then the distribution of ©

Vk (ix (2) —1;) is asymptotically normal,


and the bootstrap estimator of this distribution is the conditional distribu-
tion of Vk(a,(i) — w(t) given (m,--- , me):
Method II works well to estimate the distribution of Vk(#,(i) — =;) for an
ergodic Markov chain with infinite state space. The statement is as follows.
The reader is referred to Athreya and Fuh (1989) for the proof.
THEOREM 11. With the notations given above, if Ex,T% < oo, then, for
almost all realizations of {X,; n > 0}, we have for each 1

Vk(ite(i) — #4(7)) —+ N(0,0?) in distribution as k — 00,


_ where 0 < 0? <0.
For any real number 1, [2,--- ,/,,

Vk S| 1 (e(é) — Fe(2))
i=1

can be written as
k * k
Jk (2s!f(nz) See ies
Sever T; py T;
Athreya and Fuh

where for any cycle n, f(n) = >o;-, ligi(n), and gi(n) is the number of visits
to 2. Since

Lem)<(So DT)
the hypothesis E,T2 < oo implies E,a(f(n))? < oo yielding the following
extension of the above theorem.

THEOREM 12. With the above notations, if ExT? < oo, then for any finite
subset A of the state space S, and for almost all realizations of {X,; n > 0},

{VE(ix(3) —(t)); 7 € A} — N(0,x) indistribution as k > oo,

where » is the variance covariance matrix whose (i, j)** element is

oi; = a’ Cov (gi(n),9;(7)) + atV (T(n)) Egi(n)E9;(n)


— 2a°Egi(n)Cov(T(n),9;(n)) ;
where for any cycle n, gi(n) is the number of visits toi during n, and T(n)
is its length.

Turning now to px(t,j) — px(2,7), it is easily proved that mere positive


recurrence (t.e. ExaT, < 00) is enough to yield the asymptotic normality of
Vk(px(i, 3) —pi;)). However, for the conditional distribution of Vk(pe(t, 9) —
Px(2,7)) to converge to normal for almost all realizations of {X,; n > 0}, we
need E,T? < 00.

THEOREM 13. With the notations given above, if ExT < oo, then, for
almost all realizations of the process {Xn; n > 0}, we have for each i,j,

Ve(Be(i,3)- Pr(t,j)) —+ N(0,o*) in distribution,

where 0 < a? < 0.

With the same argument as in Theorem 12, we have the following extension
of Theorem 13.

THEOREM 14. With the above notations, if E,T2 < 00, then for any finite
subset A of S x S, and for almost all realizations of {X,; n > 0}

{Ve(Bx(i, 3) — pr(t,j)); (7,9) € A} +— N(Q,x*) in distribution,


Bootstrapping Markov Chains 61
a

where &* is the variance covariance matrix with the (2,7), (2',7')** element
is

O(i,3),(,3")
= Cov (Ba his(n) —_Ehi(n)
SR 9 ee vpn) Bhoye(n)
AEN ell FV )
Bon) Bain)?
1
Ege(n) —Ege(ny 2
= Pair. < (his (7), hije(7))
Ehi;(n)Ei (0) Cov (9i(7),
9(7))
(Egi(n)Egi(n))?
_ 9 Ehij(n) Cov (hi;(7), 9i(7))
Eg:(n)(Egi(n))?
25 Ehi;(n) Cov (hir5(7), gi(n)) 5
(E9:(n))?
Egi (7)

where h,;(7) is the number of ij transitions during the cycle n.


REMARK 5: Ifthe bootstrap resample size k’ is different from k, Theorem 13
still holds, as long as both k and k' go to oo. The rate of growth is irrelevant,
but the finite second moment for the hitting time Ty is crucial. (see Athreya
(1987) for the details.)
REMARK 6: It was suggested earlier that a variation of method II is to
fix the length n of observations and resample from the k, (random) full
cycles in {Xo, X2,--- , Xn}, and that a natural choice for A is the state that
yields the largest number of cycles in {Xo,X1,---,Xn}. This will make
the cycles {n,,72,--- , 7k, }dependent and hence the use of Proposition 6 in
justifying the CLT for /kn(7x, (2) — ix, (z)) would not be valid. However,
an examination of Singh’s (1981) or Athreya’s (1987) proof shows that even
here the conditional Lindeberg’s condition holds for almost all realizations of
{X1; t > 0} and hence the CLT does hold for the bootstrap. Also by Renyi’s
CLT for random sums Vkn(7x, (i) — 7) obeys a CLT. Thus this modified
bootstrap also works. Similar remarks apply for bootstrapping (P, — P).
62 Athreya and Fuh
a
L S

Accuracy of the bootstrap in the finite state space case


Basawa et al (1989) have also considered the finite state space case. They
propose the following (conditional) bootstrap method. Recall the defini-
tions of n; and n;; from (1). Then, generate independent random variables
Wayvt =1,2,.> (a= 1,2)---5 such that

P (Wi, = J\x) = dij, LS3,7 Sky t = 1,2,---

Define my

ni = DU T(Wi = 3),
t=1

and :
A 135 rap arte yt
Pi eek ts %7 3 :
2

Then, the bootstrap distribution H*. of

{Vn (61; — Bi) 51S 8,5 <b}


is suggested as approximation to Hn, the distribution of

1 S45
{Vn (Bi; — Pis)i Sk}.
Basawa et al show (using the fact that multinomial goes to multivariate
normal) that H* converges w.p.1 to the same multivariate normal as the
limit of H, (as claimed in Theorem 2), thus showing that this bootstrap
method is consistent.
In a recent paper Datta and McCormick (1990) have investigated the ac-
curacy of the above method and have established the following theorems.

THEOREM 15. Fix 2,3. Let 0 < pjj <1. Then

(a) sup, |P (Vni(Bi; — pis) S 2) — P*(/na(


BY,— Bij) < 2)|
=0 ( log
P Gata ~ pij) < z) x p* (2403, — pis)
S z)|

=0 ralase | ?

no? ne(bii —Py) ) o (27!


ail9}, i

-0(c)
P (eh Nl (i ij (1— pi; )) 17? ==
Bootstrapping Markov Chains 63
SSS

They also obtain an Edgeworth type expansion for the case when Di; 18
irrational.
The asymptotic accuracy of the method proposed by Basawa et al is almost
of order O(n-1/?). Datta and McCormick (1990) propose three modified
bootstrap schemes and obtain Edgeworth expansions for each of them. See
Datta and McCormick (1990) for details.

Some Open Problems

The following are further research topics related to this paper.


1. The investigation of the accuracy of the bootstrap method proposed
by Kulperger and Prakasa Rao (1989) for the finite state space and those
proposed by Athreya and Fuh (1989) for the countable state Markov chain.
2. Doing numerical simulation for the comparison of the bootstrap meth-
ods proposed in the above papers, with a view to determine resample sizes
k, Nn etc..
3. Bootstrapping the transition kernel P(-,-) of an ergodic Harris chain, is
an extension of bootstrapping the transition probability matrix of an ergodic
Markov chain with discrete state space. The histogram estimator P,(-,-) of
P(-,-) has asymptotic normality under Doeblin’s condition and regularity
hypotheses. An interesting problem here is to find the asymptotic behavior
of the bootstrap estimator as well as to investigate more general chains that
are Harris recurrent.

Acknowledgement: We would like to thank Raoul LePage for inviting


us to contribute this paper to this Proceedings. We also would like to thank
S. Datta for making his paper with McCormick available to us.

References

[1] K. B. Athreya. Bootstrap of the mean in the infinite variance case.


Ann. Statist. 15 (1987): 724-731.
[2] K. B. Athreya and C. D. Fuh. Bootstrapping Markov chains: count-
able case. Technical Report: B-89-7, Institute of Statistical Science,
Academia Sinica, Taipei, Taiwan, ROC, 1989.
[3] I. Basawa, T. Green, W. McCormick and R. Taylor. Asymptotic boot-
strap validity for finite Markov chains. Submitted, 1989.
[4] P. Billingsley. Statistical Inference for Markov Processes. The Univer-
sity of Chicago Press, Chicago, 1961.
Athreya and Fuh

[5] S. Datta and W. P. McCormick. Bootstrap for a finite Markov chain


based oni.i.d. sampling. Technical Report 139, University of Georgia,
Athens, GA 30602, 1990.
[6] C. Derman. Some asymptotic distribution theory for Markov chains
with a denumerable number of states. Biometrika 43 (1956): 285-294.
[7] B. Efron. Bootstrap method: another look at the jackknife. Ann.
Statist. 7 (1979): 1-26.
[8] C. D. Fuh. The bootstrap method for Markov chains. Ph. D. disser-
tation, Iowa State University, 1989.
[9] R. J. Kulperger and B. L. S. Prakasa Rao. Bootstrapping a finite
state Markov chain. University of Western Ontario, Statistics Dept.
Preprint (1987). To appear in Sankhya.
Theoretical Comparison of Different Bootstrap t
- Confidence Bounds

By
P.J. Bickel
Univ. of California, Berkeley

Summary. We compare in a formal way the behaviour to second order of


bootstrap confidence bounds for a parameter @ based on t statistics. We inves-
tigate the effect of:
1) Varying the estimate of scale in the denominator
2) Varying the estimate 6 of 6 used in the numerator
3) Varying the bootstrap method, parametric or nonparametric in terms of
a) Equivalence of the resulting procedures
b) Correctness of the probability of coverage
c) Minimization of the amount of undershoot
d) Robustness to failure of parametric assumptions

Key Words: (AMS 1980 Classification) Bootstrap, Confidence bounds, Second


order properties.

Department of Statistics, University of California, Berkeley, California 94720

Nee

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
66 Bickel
neeck lh a al ce

1. Introduction. Recent years have seen the development of a large number of


approximate confidence bounds for a population parameter 6 (F) on the basis of
an iad. sample X,,..., X, from a population F. These bounds, all based on
resampling ideas, include the original Efron (1982) percentile method and BC
(1985) and BCA (1987) modifications as well as the bootstrap t discussed
extensively by Hall (1988), Beran’s (1987) prepivoting approach as well as
many others — see Hall (1988) and diCiccio and Romano (1988) for recent
surveys.
Methods leading to bounds 9,;, 9,2 respectively have been considered
equivalent to first order if,

On — On2 = O(n") (1.1)


and to second order if

On — O2 = O,(n 97). (1.2)


The principal criterion for a good method has been correctness of the cover-
age probability,

P[@, aN< 0] 1-—a+O(n-"?) (1st order) (1.3)


P[@, lA< 0®)] 1—a+O(n!) (2nd order). (1.4)
It is plausible and under assumptions explored in the literature true that bounds
equivalent to a given order have coverage probabilities correct to the same
order.
Hall (1988), (see also my discussion), diCiccio and Romano (1988), Beran
(1987), diCiccio and Efron (1990) have shown that the various methods of con-
struction of bootstrap confidence bounds are equivalent to first or second order
to bootstrap t methods.
In this note we limit ourselves to different bootstrap t methods and investi-
gate the effect of:
1) Varying the estimate of scale in the denominator
2) Varying the estimate 6 of @ used in the numerator
3) Warying the bootstrap method (nonparametric or parametric)
in terms of,
a) Equivalence of the resulting procedures
b) Correctness of probability of coverage and two aspects which have not
been studied as extensively,
Comparison of t Confidence Bounds 67
a
SS

c) Minimization of the amount of undershoot


d) Robustness to failure of parametric assumption.
Our observations in c) are a simple application of the results of Pfanzagl (1981,
1985) — see also Bickel, Cibisov, van Zwet (1981) that first order efficiency
implies second order efficiency.
The robustness remarks are trivial but we believe worth noting — see also
Parr (1985).

2. Second order correctness and equivalence.


As we indicated our observations are X,,..., X, ii.d. F, with empirical
distribution which we denote by F,. We are interested in a parameter 0:
F — R where F is a nonparametric family of distributions containing all distri-
butions with finite support, for example F = {all distributions} or {all distribu-
tions with finite variance}. We consider the natural estimate 6 = 6(F,). We
assume as a model Fy ¢ F and we are interested in lower confidence bounds
for 8 restricted to Fg which are based on 6. The estimate 6 may or may not be
efficient for estimating 8 on Fp but we can essentially always think of efficient
estimates in this way. For instance, if Fp is parametric with densities p(-,6)
then (under regularity conditions) @(F) formally solving

J Vlog p (x,0)dF(x) = 0
where V is the gradient, corresponds to 6= MLE. We also assume that 6 is
asymptotically linear with influence function y(-,F) see Hampel (1986). That
iss

6 = OF) + no! D2, w(X;,F)


+ Opn!) (2.1)
where

Jv@,FdF@ = 0
and
0° (F) = Jy’, F)dF(x) < ~,
for all Fe F. Thus if L(-|F) denotes the distribution of a function of
X, under.,
X,;,.. F then

L vn (6 - 0 (F) )F —> N(,1).


0 (F)
68 Bickel
a ___.__ |

We further suppose that we are given an asymptotically linear estimate 6 of


6 (F) on Fo with influence function r,

6 = o(F) +n! D2) r(X;,F) + Opn’). (2.2)


for Fe Fo. If,

T, (FF) = vn - 0)/6
and we know the exact distribution L(T,(F,,F)|F) we are led to the exact
1-— a LCB
a 6
Gin Se)
where

Pe(T,(FyF) < c,(F)] = 1-a. (2.3)


The bootstrap bound(s) corresponding to Og are as usual obtained by estimat-
ing F by F, and replacing the, in general, unknown c, (F) by c, &,) where,

Pe [T,(Fa>Fa) < nF] = 1-@ (2.4)


and F- is the empirical df. of a sample Xo Sen x- iid. F,. That is
2 Oe cs
Gs0oTr = 8 = vote: (2.5)
If F, = F, we are dealing with the usual nonparametric bootstrap. If Fo = {Fy}
where Y= (8,n) is a Euclidean parameter, and y= (6, 1) is the M.L.E. of y,
then F, = F; is the parametric bootstrap. How do different bootstrap bounds
compare in terms of second order correctness and equivalence? The following
“‘theorem’’, stated under unspecified regularity conditions gives the answer.
Parts A and B of the theorem have been noted by Hall, Beran, and others. We
give a heuristic proof and then indicate what kind of regularity conditions are
needed.

We let Ogoor, O:xacr denote the nonparametric bootstrap bounds


corresponding to F, = F, and the corresponding exact bound. Superscripts 1,2
will indicate different choices of numerator 6“) or 62 while subscripts will
similarly correspond to different choices of 6“ or 6. ®8ppoor With or
without indices will indicate F, = F@ 4) when F = {Fy}, ¥ = (0,7).
Comparison of t Confidence Bounds

Theorem: Under suitable regularity conditions, for F < Fo,


A. ®goor; 8pgoor are second order correct.

B. 8goor = 8ppoor a2 O, (a) = OExact at; O, (i)

C. If 6, = 6 but 6 = 6™ + a where Lp(A,) > Lp (A) then

Lp {n @gdor — 8fdor) > Lp(a - d®) (2.6)


where d(F) is a constant. If A is constant, A=d(F). Thus, unless 6 and 6

differ to second order only in bias i.e. 6? = 6 4 ae +O, (n>?) for A a

constant, then OYor and O,Xor are not equivalent to er order.


D. Suppose 6 = 6? = 6. an efficient estimate for @ on Fo, but r, #
where Tj correspond to 6; via (2.2). Then,

Le {n ®goori
— 9B00T2)} > L(A) # 0

Comments: A,B: If F € Fo there is nothing to choose between the parametric


and nonparametric bootstrap t bounds to second order

(©) If 6 = MLE, b, = 2,6) — 6 then typically b, = a + O(n-2) so that if


b&,)
Sie
§? = 6 , the debiased MLE, we expect (under regularity conditions!)

bootstrap t bounds based on 6, 6 both to be second order correct and second


order equivalent even though the parent estimates differ to second order. This
continues to hold for estimates 6 + d(F,)/n for d smooth. However, if 6 is
an efficient estimate produced by an alternative method such as modified
minimum x? where we can expect Lp(A) to be nondegenerate then the bounds
are no longer second order equivalent. Admittedly such estimates 6 can
d(F,)
always be improved to second order by procedures of the form 6+ . For
n

a discussion see Berkson (1980), Pfanzagl (1981)


(D) If 6 is the MLE of @ for a parametric model Fy suppose

6) = fw? F,)dF,(),
the nonparametric estimate of o? (Fy) while

6; = 0)
where ¥ is the MLE of y. Then r, # rp unless oy is also efficient for Fo. Thus
70 Bickel
_._|| ee
a

the bounds are not equivalent to second order. As a consequence we note the
following phenomenon. Hall (1988) shows that the Efron parametric and non-
parametric BCA bounds are second order equivalent to bootstrap t bounds

based on =e , i1=1,2, respectively. We conclude that, in general, the


i
parametric and nonparametric BCA bounds are not second order equivalent
despite the equivalence in A. Using the parametric or nonparametric bootstrap
for the distribution of 6 which is a starting point in this method does make a
difference.

Formal proof of theorem:


Our heuristic argument supposes that, for each studentized statistic we con-
sider T(F,,F) both L(T(F,,F)|F) and L(T(F,,F,)|F,) admit (Edgeworth)
expansions to order n?2. That is,

PE[T(@,F) < x] = ®(x) — 0'76(x) AG,F) + Or") (2.7)


where A(-,F) is a polynomial of degree 2 and O(n7') is uniform in x. Simi-
larly, we require,

Pp (TQ,F,) <x] = ®(x)-n76@)AGF,)+O0,@7). (2.8)


Agreement of @goor7 and O¢xacr (part B) follows from asymptotic inversion of
(2.7), (2.8) and

A(-,F,) = A(-,.F) + Op”). (2.9)


If we further suppose that we can substitute the random x =c,(F,) into
(2.7) with O(n7!) changing to O, (n~!) then evidently part A follows. We note
in passing that the same type of heuristics indicate that all @goor are first order
equivalent to each other as well as first order correct.
The heuristics for C and D are based on the following lemma from Bai,
Bickel, Olshen (1989).

Lemma: Suppose, for j = 1, 2, statistics T,,,


1) T,j; have Edgeworth expansions to order nr

PT, <x] = ®(x) +017 A;(x) + O(n") (2.10)


with O(n7!) uniform in x.
Comparison of t Confidence Bounds

2) If Tyg= Ta +
n =

L(T,;,4,) > L(U,V) with E]V|<©.


Then,

A, (x) = Ap (x) forall x iff E(V|U)=0


The proof is obtained by considering the Edgeworth expansions of Ee!" which
are valid under (2.10).
To prove C, D we note first that by A we can equivalently consider OQacr
and OSacr-
By (2.7) we expect

ceO@ = Om + PO 4 o@.
Vn
Since

= of) + O,@')
we obiain that (2.6) holds with

c(h) =o)" +").


If A is constant we need only show that
A
et = ®
Le). (2.11)

since then,

OBact
6% ~BWacr ee=eeae
2 etAn ed maA
= Op (n"')

But if,
¥ (6) — @) a (6 — 9) 3 An

Dae Ghlcacadtitg ee his 6 (F)


. .
.
evidently
1/2
n’“(T,y — Ty) =
An
Z Se =0,(1). The lemma applies with

V =0 and (2.11) follows. For D we also need a result from the theory of
efficient estimation, (see Pfanzagl (1981), Bickel, Klaassen, Ritov, and Weliner
Bickel

(1991)) which we again state without explicit regularity conditions. These may
however be found in the works cited above.

Proposition: Let Fy = {F,: ye I}, 0S R* be a regular parametric model and


1: Fo > R given by 1.(F,) = q(y) where q is smooth. Let flere be an efficient
(BAN) estimate of 1 and fl be another regular, asymptotically linear estimate of
pt. That is, for all F € Fo, in a uniform sense,

AQ = p® + oD, wX%,F) + 0, (7!) (2.12)


where {w(x,F)dF(x)= 0, [w?(x,F)dF
(x) < ee and the same holds for Weep
Let = (X,,¥),j=1,...,k be the score function (derivatives of the loglikel-
i
ihood
of X;). Then if F=F, forj=1,...,k,

cov, (w (X), Fy) — Weg (X1; Fy), of KP) —i (0). (2.13)


j

That is, w (X;,


Fy) — Weg¢ (Xj, F,) is orthogonal to the tangent space of Fo at
Fo. This is essentially a consequence of the Héjek-Le Cam convolution
theorem. Alternatively it can be viewed as a consequence of the differential
geometry of statistical models, see for example Efron (1975). If the linear span
of = (X,,¥)} is interpreted as the tangent space of Fo at Fy claim (2.14) is
J
true for semiparametric models as well see Bickel et al (1991). To prove D it
is enough to establish that

Cy) = cy) + O(n). (2.14)


For then, by A, the same applies to c,) (F,) and c,» (F,) and hence,

nN @goor2 — Soon) = 2 eq, (F)Ly (Fy (Xj, F) — rp %;, F))


+ o,(1),
and D follows. But (2.15) follows from the lemma and proposition. Since 6 is
efficient, W(X), Fy) is in the linear span of So Sw pel. tg keels
j
without loss of generality we take, 6, = 6.g then, by the proposition, for all
Fy € Fo,

cov,(W(X), Fy), 1) (X),F) — 1 (X),F)) = 0. (2.15)


Write,
Comparison of t Confidence Bounds
LLL
SS

pees Dies gd WOO)»


aie oP)
Then,

(6, — 64)
Va On Pax (2.16)
82
A,
ye Ue ge
x Vn
where

A, = U,(n"!?
E28) (ry (KF) — 12 (XK,F))o! ®)
+ O, (1).

Evidently, Lp (U,, A,) tends to a L (Z, WZ) where


Z~N(0,o7()[w*(x,F)dF(x)), W is independent of Z and
N (0, 0°? F) f(t) — 12)?(x, F)dF(x)). Evidently E(WZ|Z)=0 and we can
apply the lemma. O

To make these results rigorous we need to justify (2.7), (2.8), (2.9) and sub-
stitution of the random c, (F,) into (2.7). When the estimates are smooth func-
tions of vector means n! D2, M(X,), M)x1, the argument is due to Cibisov
(1973) and Pfanzagl (1981), see also Hall (1986). In general the idea is:
a) To expand T(F,,F) in a von Mises or Hoeffding expansion and show that
the remainder after three terms can be neglected in the Edgeworth expansion.
That is, write for suitable aj

T(F,yF) = Va{fa)@d@,- F)@) (2.17)


+ faoxy)d(F, - Fd, -Fy)
®,— F) (2)
+ fa3(x,y,z)d(F, — F)(x)d (Fy= F)(y)d
at tea!
where P[|r,]| = n 2/23] = O(n!) for some 6 > 0. This is to be expected since
we expect r, = O, (n~*) Conditions such that the sum of the first three terms
has an Edgeworth expansion of the from (2.7) may be gleaned from Bickel,
Gotze, van Zwet (1989) for example. Of course (2.9) can, in principle, be
justified in the same way save that techniques such as those of Singh (1981)
Bickel
___.,_| eel

and Bickel and Freedman (1980) have to be employed to get by the failure of
Cramer’s condition due to the discreteness of F,. Substitution of c,(F,) in
(2.7) can be justified once we express T (F,, F) — c, (F,) in the form (2.17) by
using the inversion of (2.8) for c, (F,).

3. Second order optimality and robustness.


It is natural to define second order efficiency in terms of undershoot for a
lower confidence bound @* by: For all F € Fo,
i) e* is second order correct
ii) If @ is second order correct and 4, > 0, nl2 6, = O(1) then

Pp[O*
< O(F)-5,] < Pp[O< O()-8] + om). (3.1)
In fact to avoid superefficiency phenomena we essentially have to require
second order correctness to hold uniformly on shrinking neighbourhoods of
every fixed F and then require (3.1) to similarly hold uniformly. If o(n7) is
replaced in (3.1) by o(1) then e* is first order efficient. It is shown in Pfan-
zagl (1981) and Bickel, Cibisov, and van Zwet (1981) that first order efficiency

implies second order efficiency. But Ogoor = 6 - ae 2-0 + Op (nr). IF 6


is efficient as an estimate it follows that Ogoor is first order efficient.
We conclude that second order efficiency of bootstrap t confidence bounds
depends only on the first order efficiency of the estimate 6 defining them and
not on the choice of 6 beyond its consistency. If 6 is not first order efficient
then Ogoor is not first order and a fortiori not second order efficient.

Robustness: Suppose 6 (F) is the parameter we wish to estimate for all Fe F.


If Fo = {Fy: yeT}, Cc RK we may wish to estimate o(F)) by o(F,) in form-
ing Qgoor- If F ¢ Fo and y = y(F) + 0, (1) we expect 9 (F;) = 6 (Fy) + 0, (1).
Unless Fyp) = F, 9goor will not even be first order correct. This does not hap-
pen if we use 6’ = fy? (x, F,,) dF, (x) and the nonparametric bootstrap.

On the other hand if we use 6% = Jw? (x, F,) dF, (x) but use the parametric
bootstrap, when F € Fo, we are, in general, first order correct. The reason is
that,

1-o = Pp |—~,~—_ < a) = Pre


Comparison of t Confidence Bounds

But Opgpoor is not second order correct since that depends on


A(- » Fry) = A(-,P), which, in general, is false. Thus, robustness considera-
tions strongly mandate a robust estimate of variance and more weakly mandate
use of the nonparametric bootstrap.

References

Bai, C., Bickel, P.J. and Olshen, R. (1989). The bootstrap for prediction.
Proceedings of an Oberwolfach Conference, Springer-Verlag.

Beran, R. (1987). Prepivoting to reduce level error of confidence sets. Biome-


trika 74, 457-468.

Berkson, J. (1980). Minimum chi-square not maximum likelihood (with discus-


sion). Ann. Statist. 8, 457-487.

Bickel, P.J. (1974). Edgeworth expansion in nonparametric statistics. Ann.


Statist. 2, 1-20.

Bickel, P.J. and Freedman, D.A. (1981). Some asymptotics on the bootstrap.
Ann. Statist. 9, 1196-1217.

Bickel, P.J., Chibisov, D.M. and van Zwet W.R. (1981). On efficiency of first
and second order. International Statistical Review 49, 169-175.

Bickel, P.J., Gotze, F., and van Zwet, W.R. (1983). A simple analysis of
third-order efficiency of estimates. Proceedings of the Berkeley Conference in
Honor of Jerzy Neyman and Jack Kiefer. Wadsworth. Belmont.

Bickel, P.J., Gotze, F. and van Zwet, W.R. (1989). The Edgeworth expansion
for U statistics of degree two. Ann. Statist. 14, 1463-1484.

Chibisov, D.M. (1973). An asymptotic expansion for a class of estimators con-


taining maximum likelihood estimators. Theory Probab. Appl. 18, 295-303.

diCiccio, T. and Romano, J. (1988). A review of bootstrap confidence inter-


vals. JRSS B 50, 338-355.

diCiccio, T. and Efron, B. (1990). Better approxiamte confidence intervals in


exponential families. Tech Report, Stanford University.

Efron, B. (1975). Defining the curvature of a statistical problem (with


76 Bickel
|e

applications to second order efficiency). Ann. Statist. 3, 1189-1242.

Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans.
SIAM. Philadelphia.

Efron, B. (1985). Bootstrap confidence intervals for a class of parametric prob-


lems. Biometrika 72, 45-58.

Efron, B. (1987). Better bootstrap confidence intervals (with discussion). J.


Amer. Statist. Assoc. 82, 171-200.

Hall, P.J. (1986). On the bootstrap and confidence intervals. Ann. Statist. 14,
1431-1452.

Hall, P.J. (1988). Theoretical comparison of bootstrap confidence intervals


(with discussion). Ann. Statist. 16, 927-985.

Hampel, F., Renchotti, E., Rousseuw, P., Stahel, W. (1986). Robust statistics:
the approach based on influence functions. J. Wiley. New York.

Singh, K. (1981). On the asymptotic accuracy of Efron’s bootstrap. Ann. Sta-


tist. 9, 1187-1195.

Parr, W. (1985). The bootstrap: Some large sample theory and connections
with robustness. Stat. Prob. Letters 3, 97-100.

Pfanzagl, J. (1981). Contributions to a general asymptotic statistical theory.


Lecture Notes in Statistics 13, Springer Verlag.

Pfanzagl, J. (1985). Asymptotic expansions for general statistical models. Lec-


ture Notes in Statistics 31, Springer Verlag.
re

BOOTSTRAP FOR A FINITE STATE MARKOV CHAIN


BASED ON I.LD. RESAMPLING

Somnath Datta and William P. McCormick


University of Georgia

ABSTRACT
This paper investigates the scope of bootstrap schemes based on i.i.d.
resampling for estimating the sampling distribution of the m.lL.e. pi of @ transi-
tion probability p;; of a finite state Markov chain.
The asymptotic accuracy of a bootstrap method proposed by Basawa et al.
is studied. It is shown that the best rate possible with this method is O(n~”),
where n is the sample size. Three modified bootstrap schemes are proposed for
the above problem. It is shown that an Edgeworth correction is possible with
each of these new methods when estimating the sampling distribution of stand-
ardized §;;, if p; is irrational.

1. Introduction
A number of bootstrap methods for estimating the sampling distribution of
hitting time, transition counts and proportions of a Markov chain have recently
been proposed by various authors. Basawa et al. (1989) and Kulperger and
Prakasa Rao (1989) considered Markov chains with finite state spaces; the
countable state space case has been investigated by Athreya and Fuh (1989).
Whereas the asymptotic validity of these methods has been established by
the respective authors, nothing is known about their rates of approximation. In
this paper, we study the asymptotic rates of a method proposed by Basawa et
al. (1989). This method is easy to implement in practice because, given the ori-
ginal data, the bootstrap distribution is constructed using i.i.d. resampling.
We show that the best rate possible with this method is O(n”), where n
is the sample size, which is the same as that of the classical normal approxima-
tion. This method is referred to as "conditional bootstrap" by the previous
authors. The main reason it fails to be any better is that this approximation is
based on a rather naive i.id. sampling which cannot account for a part of the
skewness term arising from the dependent structure in the original sample.

Department of Statistics, Univ. of Georgia, Athens, GA 30605, USA.


The first author’s research was partially supported by a grant from the Univ. of
Georgia.
ET ___...___ eee.

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
Datta and McCormick

In this paper we propose several new methods to overcome the above


problem. These new bootstrap schemes are also based on i.id. sampling and
therefore easy to use; yet they achieve a rate of o(n-“), which is better than
that of the normal approximation.
Throughout this paper, {X,:n 2 1} will denote an aperiodic, irreducible
Markov chain with finite state space S = {1, 2, .... N}. It is well known that in
this case a unique stationary distribution exists. The one-step transition matrix
will be denoted by P = (pj) and the stationary distribution by p = (p)).
The maximum likelihood estimator P = (f;;) of P based on Xj, ..., Xp41 is
given by

n::

— if n; > 0,
i
Bi = (1.1)
0, otherwise, ;

with
n n

ny = DIX, =i, Xy,=j] and n= YX, =i],


t=1 1

where, for a set A, [A] stands for its indicator function. The problem is to
bootstrap the sampling distribution of pj on the basis of the original data
Xj, --, X43. The following is a method proposed by Basawa et al. with
motivation deriving from a multinomial type representation of the Markov
chain {X,}. See Basawa et al. for the details.
Given the original data Xj, ..., X,,1, obtain §,; by (1.1). Then generate
independent random variables W,, with probability mass function
P*(Wit = j) = fy 1 Si,jsNjt2 1. (1.2)

Define

* ni * °
ny = 2 [Wir = J], (1.3)

and

Pj =ny/n;, 1 <i, j<N. (1.4)


Finite State Markov Chain

Then the distribution of n;“(p;; — pi), given X), ..., X,41, is a bootstrap approxi-
mation to the distribution of n;*(p;; — Piy)-
Basawa et al. showed that this approximation is asymptotically valid, in
sup norm on R, along almost all sample paths X, ..., X,,;. In this paper, we
obtain a one-term Edgeworth expansion for the bootstrap distribution. Com-
parisons of this with the corresponding expansion for the sampling distribution
yield the rate results mentioned earlier. These results are presented in the next
section.
Section 3 describes some modified bootstrap methods and the correspond-
ing rate results. As mentioned, earlier, the main result of these section is that a
better rate of approximation is possible with these methods if Pj is irrational.
All the proofs are presented in Section 4.

2. Asymptotics for the Conditional Bootstrap


Let P denote the probability governing the Markov chain {X,} and P* the
(conditional) probability governing the bootstrap distribution. All the results in
this section involving P* are valid for almost all sample paths X, X, -
All the proofs are given in Section 4. In what follows, 1 <i, j <N.
The first theorem establishes a rate for the bootstrap approximation to the
distribution of centered 5;;. Throughout this section, Pi will stand for the
bootstrap estimate defined by (1.4). Let p; = n,/n.

Theorem 2.1
Let 0 < Pij < 2 Then

(a) sup IP{n;#(p; — py) <x} — P*{n;*(pj — By) < x} I= 0282),

(b) sup 1P{n“n,(6, — py) < x} — P*(n-n,(p, - By) < x} I= (SERED),

The next theorem shows that a better rate of convergence is possible for
the bootstrap approximation to the sampling distribution of the standardized p;;.

Theorem 2.2 :
Let 0< Pij <1. Then

Yan (f Yan (p—p:.


Pee Pr PD <1 00%.
x (ip;(1-p;))” (6:6;(1-B,))”
80 Datta and McCormick

A comparison of the one term Edgeworth expansions would reveal that the
above rate cannot be improved. The Edgeworth expansion for the distribution
of standardized f,; has been obtained by Datta and McCormick (1990). We only
present the case when pj; is irrational. In the other case, the expansion will
have an additional discontinuous n~” term. The Edgeworth expansion for the
bootstrap approximation follows from Theorem 2.2 of Datta (1989).

Theorem 2.3
Suppose that for some 1 <k <N, py > 0. If pj is irrational then

TOE <x) = Ox) + FP OE


y XD
(1-29 ; +3p483) + 00,
py)" p
uniformly in x € R, where @ is the standard normal c.d.f., @ is the standard
normal p.d.f. and

Si= L WK? - vp).


k=1

Remark 2.1 The above theorem can be obtained under a more general condi-
tion. See Datta and McCormick (1990) for details.

Theorem 2.4
Let 0 < Pij <aele Then

aK n-“n;(p;j—B;))
< x} = O(x)+
n(x)(1—-x?)(1—2p;)
(6;6;(1-B;))* 6(p;p;(1—p;))”

n*9(x)Q(n*x(6;h;;(1-f;))*)
tt -4

(pp,(1-p,))” sea
uniformly in x, where Q(x) = [x] — x + 4.

It can be seen from the above two theorems that the n-“ terms in the two
expansions do not match. Hence the closeness of the bootstrap approximation
to the true sampling distribution cannot be any smaller than O(n-%).
Finite State Markov Chain 81
SS

Roughly speaking, this bootstrap technique is ignoring certain dependency


present in the representation of nj; while mimicing it; and therefore over-
simplifying the matter by ‘generating nj; from a binomial distribution. Conse-
quently the third moments of the two distributions do not match.
Note that, because of the latticeness of the summands, the Edgeworth
expansion for the bootstrap has a discontinuous term which contributes to
further mismatch. This term can, however, be removed by adding a small
correction term n 8n to the bootstraped pivot ni*(p;; > 8/16)” where
1) is a zero mean random variable independent of W*’s having a finite third
absolute moment and a characteristic function with a compact support. This
would follow from a continuous version of a special case of the Theorem in
Babu and Singh (1989).
In the next section, we propose a few bootstrap techniques which will take
care of both the above mentioned problems and thereby improve the rate of
approximation.

3. Some New Bootstrap Methods


We now propose three bootstrap methods based on i.i.d. sampling, more
or less along the same line, with the intention of making the first order Edge-
worth expansion match with that of the sampling distribution. Consequently,
the bootstrap schemes may appear to be somewhat artificial. However, because
of the iid. nature of the samples, these should be easier to implement than
some other natural methods, e.g., parametric bootstrap, proposed by Basawa et
al. and Athreya and Fuh.
As observed in Theorem 2.3 the dependence in the data shows up in the
one term Edgeworth expansion. In order to match the Edgeworth expansion in
Theorem 2.3 we choose in Theorem 3.1 to use random weightings applied to
the transition counts. To motivate our bootstrap scheme let Fj be a non-lattice
distribution with
fxdF; =0 5 OF = fx°dF; = PiP;j(1—p3j) and fxcak;; = 14;0;(1—2p;;) where
Ty = (1—-2p,;+3p,S;;)/(1—2p,). Then the one-term Edgeworth expansion for a
standardized sum of iid. random variables with distribution Fj; is exactly the
expansion appearing in the right hand side of Theorem 2.3. Our objective then
is to mimic the maximum likelihood estimate p,; = n,/n, by a bootstrap estimate
Bij whose standardized conditional distribution is appropriately close to that
corresponding to F,; and then invoke a continuous Edgeworth expansion result.
Given the data Xj, ..., Xp413, we compute pf; as before. For k2 1 let
(pf) = PX be the estimated k-step transition matrix. We estimate S,; by
Datta and McCormick

K,
> 6? - ph), if ij
k=1
ee (3.1)

f. = (3.2)
0, otherwise.

It will be shown that fj; is strongly consistent for r;;.


The rest of the plan indicated before can be carried out in several ways.
Here we propose two most natural schemes and a slight modification of one of
them.
The first bootstrap scheme can be described as follows. Generate i.i.d.
pairs (Xj, X>,), 1 $<t <n such that
P*QXG, =i) =aPyordsics Nj (3.3)

and

P*(X>, = jIX}, =i) =p, 1 Si, j<N. (3.4)

Let [Xj, = i, X3, = j] be a transition count. In place of the naive bootstrap esti-
mate 2[Xj, =i, Xz, =i] / ZX}, =i] we will consider a bootstrap estimate of
the form
n n

DY WlXne =i Xy =H / DO XT = i
1 t=1

where the Lie are i.i.d weights, with Bl. =1 and E'lj, = fi.
More precisely, given i and j, generate i.i.d. Liv 1 <t <n such that

P* (Lig = Ayj) = Pip


Finite State Markov Chain
a
t a N ee 83

P* (Li, = By) = paijs (3.5)

P* (I, = 0) = 1 — pyy — Paij»

and I;;’s are independent of (Xj, X>)’s, where


1, if f>-I,
Aj = (3.6)
ijp if hij< -l,

and

fii, if fj Sale

B; (3.7)
1, if %j <1,

Piij
a Ao:2 Rita ijibe ’ (3.8)
Aj(Bi; — Aj)

P2ij
Shope cethie
2
te° (3.9)
By(By — Ajj)

Let

nig = D VilXt = i
nN %* .
(3.10)
t=1

and
n % . * .

a= Ii [Xq, =i, Xp, = j). (3.11)


tl

Then

ning) »if nig#0,


* . *

Pij =
0, otherwise,
Datta and McCormick

is a bootstrap estimate of f;; and we have the following result:

Theorem 3.1
Along almost all sample paths X,, X,, --- ,

=s A log Nv
(a) sup IP(nj*(6;-pi) < x} — P*(nj“*ni@y(pij—B;)) < x} |= OC yee

1 A BYR won log log n


(a’) sup IP{n-“n;(f;-p;;) < x} — P*{n-“nji(pj—P;) < x}! = ree
x

if 0 < Pij < 1;

moreover, if the conditions of Theorem 2.3 hold then

n“ni(B;—p;) bones n “ni(pij—Pi)


(b) sup IP{ < <x}! =o(n-”).
(p;p;(1-p;))* (6:6; (1-6;;))”

Next, we describe another scheme which is slightly simpler than the


above. The resulting method will be a direct modification of the conditional
bootstrap of Basawa et al.(1989).
Given the data X, ..., X,4,, generate W’’s as in Section 1, and indepen-
dently of W"’s, generate i.i.d. Lit 1 St <n,, as in the first method of this sec-
tion. Let
* m1 * * a a :
nig = LY liv nj = DL LlWi = 3),
tl t=1

and

nj;/nig ; if Nici) # 0,

0, otherwise.

Then Pi is our bootstrap estimate of p,; under this method. In order to remove
the effect of latticeness and obtain o(n~”) rate, we need to add a correction
term to the the standardized P; :
Finite State Markov Chain 85
SSS
SD

Theorem 3.2
Along almost all sample paths X;, X2,

(a) ag IP{nj*(6y — py) $x} — P*(nj“*nsg(py — Bi) $x}! = OC see


ifO0< Pij <1:

(a’) sup IP(n-™n,(6;; — pi) $x} — P*(n“*nXi op," - By) <x} 1 = 0 log fs Dy

if0< Py < 1;
moreover if the conditions of Theorem 2.3 hold then

(b)
-Y% 7a -Y4 ¥ 7, # a
n nN; — ee n Yn. . oo ee

sup (jee <x}-P* ery +n; 38n <x}! =o(n-),


. (P;py(1 — P;)) (B;H;(1 — By)

where 1) is a zero mean random variable (independent of W’’s and Ii;’s) having
finite absolute third moment and compactly supported characteristic function.

It is possible to remove the correction term in the last theorem by adjust-


ing the random weights appropriately. The idea is to modify the distribution of
the weights suitably that would guarantee the non-latticeness of the summands
in the limit.
Given the data X,, X>, ..., Xn41, generate W’’s as in Section 1. Let i,j be
as in (3.2).
Define

phy sike fy > a)


Aj a (3.12)
[le tenis t,

and
[F+1]}+€,, if tj >1

— (3.13)
1+ &,> if ij <1

where 0 < €, — €, and € is irrational. Define Piijand Poi as pj; and pj; with
Datta and McCormick
OS a

Aj and Bj replaced by Ajj and B; respectively. Generate Lies 1<tSni,,


independently of W’’s such that I,’ are iid. having probability mass function
given by (3.5) with the p, A and B replaced by the corresponding p’, A’ and B’.
Letting
*“” mh *?
nig = D lit
=
and

n;
* * * .

ny = D Vie (Wir = J)
=

* *” 7 */”
Nij / Nj) 5 if Ni) ye 0,
7?

Pye
0 , otherwise,

defines a bootstrap estimate of fj; For this third method we can conclude the
following:

Theorem 3.3
Along almost all sample paths Xj, Xo, ...,

(a) sup IP(nj*(6jj — Py) $x} — P'{ny*ng(py — By) $x}! = Se


if 0 <p, <1,
(a’)sup lP(n™n,(p; — pi) $x} —P*{n*nyy(p; — By) Sx}! = (ERED)

if 0 < py < 1;
moreover, if the conditions of Theorem 2.3 hold then

bate A

Pi)<x}x}!! =o=o(
n-“n-*..

< x) — Py Pi
p::
™.)
(D:. —

(b) sup ee (nn"”


n~

(B;H,(1 — pi)”
5 (PipPy(l — pi)”y
Finite State Markov Chain

To the best of our knowledge, the present paper is the only work so far
studying the rates of bootstrap approximations for Markov chains. It is hoped
that this paper will initiate further studies in this direction. In particular, the
following questions deserve investigations:

(1) How does the natural parametric bootstrap (see, e.g., Basawa et al.,
Athreya and Fuh (method I)) perform in terms of rates of approximations?
To answer this question, one essentially needs to establish a continuous
version of the Edgeworth expansion for Bj-

(2) How to bootstrap the joint distribution of (p;;) with a rate better (if possi-
ble!) than that of normal approximation? The parametric bootstrap may be
a candidate for this. Clearly, the method in this section can handle only
one f;; at a time. .

(3) Is it possible to bootstrap a studentized version of fj, ¢.g., ni*(p; -


p;)/(B(1 — £;)), with a rate of o(n-)? Certainly, this will be more use-
ful, e.g., in the construction of confidence interval for Pi» than the present
results.

4. Proofs
We are going to use the following terms and notations throughout this sec-
tion. For any sequence Vj, ..., V, of i.id., non-degenerate random variables
with finite second moment, n“(V — 11)/o will be called the standardized sum of
the V’s, where V =n! = V,, p = EV), 0? = E(V; - 1)”. A sequence of distri-
butions {F,,} on R is said converge in d3 to a distribution F on R, written as
z OF if {F,} converges to F weakly, and [ixi34F, = fle ldF. ® and
will denote the standard normal c.d.f. and p.d.f. respectively.
Let Y, = (X;,, X41), t2 1, where {X,} is the original Markov chain. Then
{Y,} is also a Markov chain with state space S,= {(u,v): 1suvs N,
Puy > 0}, which has at most N? states. Moreover the stationary distribution for
Y; is given by Pau,v) = Pu Puv-

Since {X,} is irreducible, aperiodic, and has a finite state space, there exist
M < - and p < 1 such that
max_ |p{*) — p;l < Mp*, for all k2 1. (4.1)
1<i,j<N
88 Datta and McCormick

Consequently,
| pfkt2 -p
max (x,y)
(u,v),(x,y)e S2 IsGey)

1a IpPyx’® —
33 Pxy | — Px p.| < Mo*,
p for all k > 1,

where Penny) denote the k-step transition probabilities for {Y,}. Therefore
{Y,} satisfies the basic condition (0.1) of Nagaev (1961), for some kp large
enough.
Fix 1 <i,j <N. Define
f(Y) = (X,=i, Xy1 = j] — pylX = id.

Note that Ef(Y,)=0, for all t and

DX f(Y) = ni; — Pj). (4.2)


1

Moreover, under the stationary initial distribution,

o? = lim B,(J= ¥ (¥))? = py py(l - py) > 0,


n—eo 1
(4.3)
if 0 < pj < 1. Therefore,

Yiu
n-“n,(B;-Pi,) 27 2K :
SEC
(4.4)
(ppy(1-p,))* no -
Theorem 1 of Nagaev (1961) will be used to prove Theorems 2.1 and 2.2.

Proof of Theorem 2.2


The conditions (0.5) of Theorem 1 of Nagaev are satisfied by g(x) = x and
f(Y,). Therefore it follows by that theorem and the identification (4.4) above
that

n“n;(6;;-Pi) =
sup |P{ — <x} — Ox)! = OM”), (4.5)
x (p;p;(1-p,))” :
Finite State Markov Chain 89
SSS

On the other hand, since nj ~ Binomial (n;, p,) under the P* and
P; = nji/n;, by the ordinary Berry Esseen Theorem for i.id. summands it fol-
lows that
n;*(p;—-P
nj ;) constant
Meee ee OO
(B;;( 1—9;;)) a (;;( 1—p pr?

= O(n”), (4.6)

a.s. (P), because lim inf p;(1—p;;) = pi(1—p;;) > 0 and n/n > p; > 0, as. (P).
n-— eo

Theorem 2.2 follows, by the triangle inequality, from (4.5) and (4.6). O

Proof of Theorem 2.1


(a) From (4.5) and (4.6) we get that

Bj A 1 * a
subir og n;*(B;; — py) < x} — P*{n;4(pj — py) < x}!
1

ae ee i (495
(p;(1-p;))* — (6(1-B,))”

a.s. (P), where M = sup It (t)l.


By the law of iterated logarithm (Chung, I16, Theorem 5), the second term
in the above bound is O((log log n/n)”) a.s. (P).
We now complete the proof by showing that

sup Pe))* n;*(6;;-p;) < x} — P{n;*@;-p;) < x}| = 0(284)%, (4.8)

For any 0 < € < 1, the LHS above is bounded by

sup IP{Z, <x} — P{Z, < x(1+e)”}|


x

+ sup IP{Z, <x} — P{Z, < x(1-€)*}| (4.9)

+ P{ 1p;
— pj! > ep;},
90 Datta and McCormick

where

Z, = (Pin n;*(P;-P;,)
Pi (p(1-p;))”

By (4.5) and the mean value theorem, the first two terms in (4.9) are
O(e) + O(n”). The third term is no more than
y%
pin € 1
A ee On),

where o7 = 0*(f;) = p,(1—p;) + 2p; & (p§9-p;), by the Berry-Esseen bound


tl
(Nagaev, Theorem 1) with f(X,) = [X, = i],
Pi Z

=oe *)+0m~”,,

= (0),: if e= 407log n
1
ye

This completes the proof. 0


(b) We once again get from (4.5) and (4.6) that

ap P(n”“n,(6j; — pj) < x} — P*{n*n,(p5 — fi) < x} |

<on™) + |—_+__- __1___im


(p;P;(1—p;)) (6:B;(1-6;))”

a.s. (P), where M = sup It o(t)l.


By the law of iterated logarithm applied to p; and pj, the second term in
the above bound is O((log log n /n)”) a.s. (P). This completes the proof. 0

Proof of Theorem 2.3


See Datta and McCormick (1990). O
Finite State Markov Chain 91
SSS

Proof of Theorem 2.4


It can be verified that n-“nj(p; — By)/BiB\(1-))* is the standardized
sum corresponding to the eee (Wi, =j], t = 1, 2, ..., nj. Under P*,
(3)
[Wis = j]~ Binomial (1,H;;) — Binomial (1,pi), a.s. (P). Moreover, since 0 <
Pi; < 1, Binomial (1,p,;) and, a.s. for large n, Binomial (1,p;;) are lattice with
span one. Hence the expansion for Dj follows from Theorem 2.2 of Datta
(1989), with p = oo and xg= 0, and the fact that fp;= n,/n > p; ( >0) as. (P).
O

The following lemma is needed for proving the theorems in Section 3.

Lemma 4.1
For any 1 $i,j <N,
8; > Sj as. (P), asn > ©,

Proof
For i = j , Sy=Sj=1- limpg® = 1—p;. Hence the conclusion is
immediate from the fact that p; — p; a.s. (P).
The conclusion in the i#j case will follow if we prove that
Lies
a (k-1) _ ps? = pi) os pi! —> 0, as. (P). (4.10)

Since P > P a.s., it follows that each summand in the above sum con-
verges a.s. to zero, as n> oo. It can be seen from (4.1) that for kg large
enough,

sup Ip (A) — pA) <1, (4.11)

where the supremum is taken over all 1<ij<N and A c S; and


OA) =p Py” for any 1Si<N, ACS.
jeA
Zs.

Fix any r, € (1,1) where r denotes the LHS of (4.11). Since P > P,

sup 1p" (A) — BP (A) <1


ij
O2 Datta and McCormick

for almosts all sample paths, for large enough n (depending upon the path).
Therefore (see Nagaev), a stationary distribution f = (6;) for P exists and

st 1pm — pI STF. for all k > ko.

Thus, for almost all sample paths, for large enough n,


3 Es 1
Ipg—? — pf — pO) + p01 < 2Mp*+ ners

for all k > kp, where M and p are given by (4.1).

The proof now ends by D.C.T. O

Proof of Theorem 3.1


(a) and (a’). Fix 1 <ij.<N. Note that n-” nj (pj — 6;)/(6;6;(1-B,))” is the
standardized sum of the i.i.d summands

Yp = Wj(OXqe = i, Xo = f] — BylXie = iD), (4.12)

1<tsn. It is easy to check that,

E'Y, =0, 0; =E'Y,” = p,6;(1-;),

ROME ee gs PR Rare
(1-2f;+38;,8;)b:5,(1-f;;), if Pij*>>
* *3
U3, =E Y,~ =
. A on 1

0, if Pij = 3 a

as. as. EA a.s.

Since Pi=> pj. > 0, Bj =) Pij E (0,1) and Si —- Si, we get that
, b3
iensup - <oo, a.s. (P). Hence by the Berry-Esseen Theorem for i.i.d.
n

*
n “nici(Pi7-Py)
x (f;P;(1-P ;:))” <x} - O&)!| = OM), (4.13)
iPij ij
Finite State Markov Chain
SS 93
a.s. (P). Rest of the proof is essentially the same as that of Theorem 2.1.

(b) Let Y,. be a random variable generated in the same way as Yi with p;
replaced by Pi» Bj by Pij» Bi by Tj =1+ 3Pij S;/ (1—2p,). Then = has a
non-lattice distribution F;; (say) since Pij is irrational, with

JxdF=0, fxdF, = p,p,(1-p,),

J PdF,; = rypp4(1-p,) 0-2, = PiPij1—py) 1—-2p;+3 p84)

a.s. as. as.


By the Ergodic Theorem and Lemma 4.1, p; > p;, i > py and fj; > Ty.
3
Hence Fj — Fj, as. (P), where F; is the distribution of Y,. Therefore,
Theorem 2.1 of Datta (1989) applies with p = oo, F = F;; and G = F; to yield
Be ny: me oe -% ree D

Rete < x) = O(n) +


py (B;6;1-H;)) SPOS (1-ap.5+ 3p5 $3) + 0(0™),
6(p;p;(1—-p;;))

uniformly in x, a.s. (P). The proof now ends by Theorem 2.3 and the triangle
inequality. O

Next we prove a general result for i.i.d. summands which will be used to
prove Theorem 3.2 (b). This result is a continuous version of special case of
the Corollary in Babu and Singh (1989).

Theorem 4.1
Let Yj,.... Y, iid. with distribution G=G, such that fyaG =." if
(3)
G — F such that fy*dF > 0, [ly |?dF< ©, then

P{ ee
Mey oh n*13(1-x7)()
yt nN <x} = M(x) + foal ci is a +o(n-”) (4.14)
0, 60°

uniformly in x, where 62 = fy’dG, o? = fy?dF, H3 = fy°dF and 1 is a zero


mean random variable which is independent of Yj, Y>, ..., Y,, has finite third
absolute moment and a characteristic function with compact support.
94 Datta and McCormick

Proof
Let og and y be the characteristic functions of G and 1 respectively. Let

y(t) =e*2(1 + a Go,


where, [3,, = jy°dG.
For any 5 > 0, by Esseen’s Lemma (Lemma XVI.3.2 of Feller), the LHS
of (4.14) is bounded by
J. (oa(n-“t/o,,) w (nt) — y,(t))/t
dt (4.15)
Itl<n*5o,

+ f ly(ttldt,
Itl>n*8o,,

for large enough n such that {|Itl>n” 8 o,} (-\ support (wy) = 9.
Since, [3,2 U3<0, 6,-6>0, and et? < @ TOR o4 for
Itl > n“8o,, it follows that the second term in (4.15) goes to zero exponen-
tially fast, as n > 9,
The first term in (4.15) can be further bound by

J 1@8t/o,) — y(Dytldt+ fy (t) yen


341 I/Itldt. (4.16)
Itl<n*do, Itl<n*8o,

The first term above is handled in exactly the same way as in the proofs of
Lemma 4.2 and Theorem 2.1 of Datta (1989). To bound the second term we
use the Taylor expansion for wy:

yi)=1+twO+—
w), 2 ”

0<t’<t. Since n has mean zero, w’(0) =0. Also, vw is continuous, because
Y has finite third absolute moment, and hence is bounded on the compact sup-
port (wy). Let suply” | = 2K. Then the second term in (4.16) is no more than
lua litt
K n 34 f et + Malice 1A Idt,
Itl<n*8o,, oO;

= O(n->4).
Finite State Markov
Chain

This completes
the proof. 0

Proof
of Theorem 3.2
Since 5. as PV Oi<1-Hy)” is the standardized sum of the iid
summands Y = LIWs = jl — B). Saamy poste (a) aed TRY Sotlows "in“the
Lae ea
G)
By similar arguments as inthe proof of Theorem 3.1(b), Fj —» F,, as.
(P), where F5 isthedistribution ofYj, F 1s the distribution of Y_ which is
generated inthesame wayasY; withf,;replaced byp,,andf, by 1. Clearly,
ee - a dF;=pyll-py), fr’dFy=pf 1-py)(1-2p,+3pySy). Hence, by
toes almost all sample paths,
Se 11-294305S
»he) ets yeu) 4 SO
2 nz”), uniformly in x,

n “(x 1-x?(1—2p;; + pS.)


= D(x)+
p; p.<1-p))” +o

in x. The proof once again ends by Theorem 2.3. 0


uniformly

Proof
of Theorem 3.3
It can easily be checked that 1” ni (pz — f,)/(PH,(1-f,))* is the
g
to the summands
standardized sum correspondin
¥; = 1,0W; =i) - 6), 1x, (4.17)

which are iid under P* , with


E*°Y/ =0, E°Y;/?=6,<1-6,), EY,? =t§1-)1-26,).

Thus, parts (a) and (2')ofthe theorem follow asbefore (Le., parts (a) and (a’)
cx-Tiiecaces 3.1) dace nfe py
For past (b), define Y_. in the same way asY{ with p,; replaced by pz f,5
by 1% and €, by €. Also, define Y_ similarly with p,; replaced by py and Ay,
B,, by Az, By given by
96 Datta and McCormick

-1-%, if Tij > -1,

[r;; Co 1] ae , if TiS -l,

and

[r3] +e, if Tj >I,

l+e, if Ty S 1.

Then Y,, and vo are both non-lattice random variables because € is irrational.
Furthermore,

E Ls = 0, E Zz = py(1—pj), ;

E Z3 = ry py(1—p)1—2p;) = py(1—p;)1—2p +3383)»

where Z,, = Y.. Or ve


To obtain the desired expansion, first consider the case when r;; is not an
ba oe Then, by similar arguments as in the proof of Theorem 3.1(b),
F;; — F, ; Fj and F, being the distributions of Y," and Y., respectively. Con-
sequently, by Theorem 2.1 of Datta (1989)

n*nigy (Py — By)


Pre eae ae < x} (4.18)
(B:H,;1—-B;;))

n;*6(x)(1-x?)
= &(x) + ———~(1-2p;, + 3p,S;) + o(n;),
6(p(1—-p;;))” ane ee ;

uniformly in x, a.s. (P),


-l4 2

oO onPijt3pijaaa, a
ahs«) é 6(p;p;(1-p;;))” ( Sij) a o(n 5 (4.19)

a.s.
uniformly in x, a.s. (P) since n/n > p; > 0.
Finite State Markov Chain 97
a

For the case when r;; is an integer, consider two subsequences of n;, viz.,
3
{n;} = {nj: 52 rj) and {n;*} = {nj:f; <1}. Then, as. (P), along nj, Fi > Fi,
and along n;, F;; — F;; where F, is the distribution of Y_. Since Y,, and Y.,
have the same first three moments, we get the same expansions (4.18) and
(4.19) along both the subsequences.
Finally, we get the conclusion of the theorem by comparing (4.19) with
Theorem 2.3. O

REFERENCES

Athreya, K. B. and Fuh, Cheng-Der (1989). Bootstrapping Markov chains:


countable case. Preprint #21, Iowa State University, Ames, Iowa.

Babu, G. J. and Singh, K. (1989). On Edgeworth expansions in the mixture


case. Ann. Statist. 17, 443-447.

Basawa, I. V., Green, T. A., McCormick, W. P. and Taylor, R. L. (1989).


Asymptotic bootstrap validity for finite Markov chains. To appear in
Comm. Statist.

Chung, K. L. (1967). Markov chains with stationary transition probabilities,


Second Edn., Springer-Verlag.

Datta, Somnath (1989). A note on continuous Edgeworth expansions and the


bootstrap. To appear in Sankhya A.

Datta, Somnath and McCormick, W. P. (1990). On first order Edgeworth


expansions for a Markov chain. Technical report, Dept. of Stat., Univ. of
Georgia, Athens.

Feller, William (1966, 1971). An introduction to probability theory and its


applications, Vol II., John Wiley & Sons.

Kulperger, R. J. and Prakasa Rao, B.L.S.P. (1989). Bootstrapping a finite state


Markov chain. To appear in Sankhya A.

Nagaev, S. V. (1961). More exact statements of limit theorems for homogene-


ous Markov chains. Th. Prob. and Appl. 6, 62-80.
coger>
ae
} ag.
a
SIX QUESTIONS RAISED BY THE BOOTSTRAP

B. Efron*

Abstract
Investigations of bootstrap methods often raise more general ques-
tions in statistical inference. This talk discusses six such questions: (1)
Why do distributions estimated by maximum likeklihood tend to have
too short tails? (2) Why does the delta method tend to underestimate
standard errors? (3) Why are cross-validation estimates so variable? (4)
What is a “correct” confidence interval? (5) What is a good nonpara-
metric pivotal quantity? (6) Can we get bootstrap-like answers without
Monte Carlo?

Introduction. Working on the bootstrap tends to raise broader


questions of statistical theory.This paper considers six such questions.
The first three worry about standard methods, and how they sometimes
fail us. The second three questions concern matters not much examined
by standard theory. Only the last question directly concerns the boot-
strap, but bootstrap considerations appear in all six.
The title may give the impression that six answers will be provided.
The actual number is closer to 1.5. My hope in presenting this paper
is to attract more solutions, or at least more interest, in some questions
that seem to me to be of considerable importance.

1. Why do maximum likelihood estimated distributions


tend to be short-tailed? Estimates of probability distributions ob-
tained by maximum likelihood tend to be more concentrated than the
distributions themselves. Here is a familiar example: suppose that 1, y2,
-++, Yn is an independent and identically distributed (i.i.d.) sample from
an unknown distribution F,

£3.d.
Bake Y1,Y2,°°*,Yn- Ett)

* Department of Statistics, Sequoia Hall, Stanford University, Stan-


ford, CA 94305-4065
———=—_—_—_—_=
ener eee err

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 047 1-53631-8.
100 Efron
_.._|| es
2

The nonparametric maximum likelihood estimate (MLE) of F is the em-


pirical distribution F,

F: probability 1/n on yi, Rese Les trea (eZ)

The empirical distribution assigns probability

Prob,{A} = a (1.3)
to any set A in the sample space of the y’s. This is an unbiased estimate
of the true probability Probr{A},

E{Prob;{A}} = Probr{A}. (1.4)

However the same unbiasedness does not apply to the variance functional:
/n has expectation
Var p{Y} = Dyni(yi — 9)?
n —1
E Vara{Y} = Varr{Y}. (1.5)
n
We see that the variance function is underestimated by maximum
likelihood, albeit mildly so. Elementary statistics courses recommend
estimating the variance by

waa erel¥} = Dua) Mn — 2),


n
(1.6)
rather than by Var; {Y} itself. This removes the bias, but doesn’t explain
it.
The underestimation effect seen in (1.5) is quite small from the
point of view of asymptotic theory. Most often we are estimating o? =
Varr{Y} in order to form a confidence interval for the expectation =
Er{Y}, say of the form
gt 2%, eara,
where z is the 100at® normal percentile, e.g. 2695) = 1.645, and
sé = G/,/n, the estimated standard error of g.
The choice of the MLE versus the unbiased estimate of o? makes a
difference of magnitude only

O(&/n) (1.8)
Six Questions Raised by the Bootstrap 101

in (1.7). This is a third order correction in the usual parlance. By com-


parison, ignoring the skewness of F in forming interval (1.7) makes a
difference of O(sé/./n), a second order error; and using ¥ instead of an
efficient estimate ji, makes a difference of O(sé), a first order error. (See
Efron (1987).) Third order notwithstanding, the underestimation effect
in (1.5) can cause substantial problems when n is small.
We can exacerbate the phenomenon seen in (1.5) by considering a
standard linear model,
Yi = U8 + €:, (1.9)
where x; is an observed p-dimensional covariate vector, @ is an unknown
p-dimensional parameter vector, and the e; are an i.i.d. sample from
some unknown distribution F with mean 0 and variance a”. Suppose we
assume that F is normal, F ~ N(0,07). Then the MLE for o? is

6G? = Du — 2,8)? /n, (1.10)

having expectation
aia
E{67} = ores (1.11)
n

If p is a large function of n, then the MLE 6G? badly underestimates


the dispersion parameter a”, Nevertheless, with p fixed and n going to
infinity, (1.11) still represents a third-order error.
The underestimation seen in (1.5) gets worse as we consider measures
of dispersion more extreme than the variance.
Table 1 shows the results of a Monte Carlo experiment involving 100
samples, each of which consisted of 10 independent observations from a
standard exponential distribution. Four functionals of F' were considered,
the mean, the standard deviation, the skewness and the kurtosis. The
sampling experiment shows that skew(F’) badly underestimates skew(F),
the situation being worse for the kurtosis. Fisher’s theory of k-statistics,
Kendall and Stuart, Chapter 12, (1958), reduces the bias of skew(F’) and
kurt(F), without explaining it.
My own concern with the dispersion-reducing tendencies of maxi-
mum likelihood estimation stems from its effect on bootstrap confidence
intervals. The simplest way to form a bootstrap confidence interval is to
102 Efron

Mean Sd Skew Kurt


True F: uf 1 2 6
Ave F: 1:01 x0! 2868.6\5910 {5,052
%<True: 55% 71% 93% 100%
Table 1. Sampling Experiment. 100 i.i.d. samples of size n = 10, from a
standard exponential distribution. For each sample, the mean, standard
deviation, skewness, and kurtosis of F were evaluated. Shown are the
average values of these four functionals, and the proportion of the 100
samples for which the sample functional was less than the true functional.

use the order statistics of the bootstrap replications. Suppose 0 = t(F’)


is a real-valued parameter of interest, such as a mean, a correlation, an
eigenvalue, etc., estimated by 6 = t(F). We draw some large number
“B” of independent bootstrap samples x* By eet acai cir , where in the
one-sample situation (1.1), each x* is a random sample of size n from F,

~ Lid: * * * pee *

er: (U1 390077 2nYa) = r (1.12)

The y; do not have to be real-valued; F' can be a distribution on an


arbitrary probability space.
Each bootstrap sample gives a bootstrap replication of 6 = uF a
namely 6*® = t(F*), where F'*? is the empirical distribution correspond-
ing to the pth bootstrap sample x*’. The percentile confidence interval,
of approximate coverage 90%, is defined to be the central 90% range of
the ordered bootstrap replication. In the usual order-statistic notation,
the 90% percentile interval is aeeene eet
The length of the percentile interval,

length = 8 958) ‘7 8058) (1.13)

is a dispersion functional of F’, similar to Vars{Y} except that it is


usually evaluated by a Monte Carlo process. Like Var a, length often
tends to be a little too small, leading to undercoverage for the percentile
confidence interval.
Once again, this effect is tiny from the asymptotic point of view.
The standard confidence interval for a parameter 6,

6 + hse, (1.14)
Six Questions Raised by the Bootstrap 103

where 6 = 6(F) and & is any reasonably efficient estimate of standard


error for 6, is typically first-order accurate: its non-coverage probability,
in each tail, approaches the nominal value 1— a at rate 1 /Vn. The boot-
strap confidence interval “BC,”, Efron (1987), which is an improved ver-
sion of the percentile interval, is second-order accurate: its non-coverage
probabilities approach (1 — @) at the factor rate 1/n. Its coverage errors
are of the third order, O(+), corresponding to the third order shortening
effect seen in (1.5). However, even third-order errors can be consequen-
tial in small samples. This is especially true if the confidence intervals
are used for hypothesis testing purposes, where small errors in the end-
points can change the verdict of the test.
There are other third-order effects that produce coverage errors of
order 1/n in the BC, intervals. Efron (1985) details the third-order errors
for bootstrap confidence intervals in a certain class of parametric prob-
lems. Perhaps the moral here is that “plug-in” methods, like maximum
likelihood and the bootstrap, give answers accurate to the second-order
but not to the third; and that the tendency toward shortness of maximum
likelihood estimated distributions accounts for part of the third-order
error.

2. Why does the delta method tend to underestimate stan-


dard errors? The delta method is the oldest device for assessing the
standard errors of complicated statistical estimates, predating the jack-
knife and bootstrap methods by at least 150 years. The delta method is
also known as the method of statistical differentials, propagation of er-
rors formula, and the Taylor series method. Despite its ancient pedigree,
I have found the delta method to be less reliable than the jackknife or
the bootstrap, with an occasional tendency to badly underestimate the
true standard error.
Table 2 shows the results of all four sampling experiments in Efron
(1982) that compare the delta method, bootstrap, and jackknife estimates
of standard error with the true standard error. In the first experiment,
each sample consisted of ten independent pairs (z;, y;), with 7; having a
uniform distribution on [0,1], independent of y;, which had a G (one-
sided exponential) distribution. The ratio statistic 6 = y/zZ then had
true standard error .37, as verified by extensive Monte Carlo sampling.
Each sample in the experiment gave a delta method, bootstrap, and
104 Efron

jackknife estimate of standard error. In this case all three methods were
nearly unbiased, giving average standard error estimates of .35,.37, and
.37 respectively. Details of the four sampling experiments appear in Efron
(1982), Section 3.
In the last three sampling experiments, the delta method has a no-
ticeable downward bias. This is particularly evident in the last case,
where the statistic of interest was the tanh”? transformation of a simple
correlation coefficient from 15 independent bivariate normal points.
A puzzling aspect of the differences seen in Table 2 is that the delta
method is intimately related to both the jackknife and the bootstrap. In
fact the delta method is identical to Jaeckel’s (1972) infinitesimal
jackknife. Suppose 6 is a functional statistic 6 = S(F ), such as the
ordinary mean S(F) = f adF = %. Let FS be a distorted version of
the empirical distribution F that puts extra probability on the zth data
point,
fe a te —e)/n+e probability on tj (2.1)
: (1—e)/n probability on z;, j #7. ;

The empirical influence function is the derivative

statistic se estimate True Sample consists of n independent


delta boot jack se n pairs (z;,y;), with distribution
y/z 35 37 37 37° ~=—«10 xz; ~ U(0,1) indep. of y; ~ G,
9/z 53 64 70 67 =©10 =x; ~ U(0,1) indep. of y; ~ G?/2
sample correlation .175 .206 223.218 «14 (ai,yi)~ No(u, YP) true corr = .5
tanh™? (corr) 244 .301 314.297 «14 =(aj, y;) ~ No(u, YP), true corr = .5

Table 2. Four sampling experiments comparing the delta method, boot-


strap, and jackknife estimates of standard error, from Efron (1982). Num-
ber tabled is the average estimate in the sampling experiment. The delta
method gives the smallest estimates. In the last three cases, the delta
method estimates are considerably too small.

0 Pre

Efron (1981) showed that the usual nonparametric delta function estimate
of standard error using a linear Taylor series expansion of 6 is identical
Six Questions Raised by the Bootstrap 105

to the infinitesimal jackknife formula suggested by Jaeckel,


n

sedelta{9} = [D| U?/n?}/?. (2.3)


1=1

Notice that setting « = —1/(n — 1) in (2.1) gives the empirical dis-


tribution for the reduced data set with the ith point removed,

“Ga
Ae, ay
eSds A a ee
0
a+ n
robability
rete on
probability on z2;, j #2,
2;
(2.4)
Let 6.3) = S(Fiy) and 6.) Ser, 6(:)/n. Tukey’s jackknife estimate of
standard error is
A n—-l_-s» ”
S€jack{9} = {—— =]; — OP}. (2.5)

This is almost the same as (2.3), except that € in (2.2) has been set equal
to —1/(n—1) instead of going to zero. (Tukey’s formula also incorporates
an extra factor of n/(n — 1), in order that Sejack {7} exactly equal the
usual formula for the standard error of the mean, [=(#;—Z)? /n(n—1)]!/?.)
Efron and Stein (1981) showed that the jackknife estimates of vari-
ance, cee oh?> tends to be biased upward as an estimate of the vari-
ance. A moderate upward bias is discernable in the jackknife column of
Table 2. The close relationship of definitions (2.5) and (2.3) suggests that
the jackknife and delta method should behave similarly, but such is not
always the case.
The bootstrap method gives the nonparametric maximum likelihood
estimate (MLE) of standard error: let se{t; F} indicate the true standard
error of a statistic 6 = t(x), where x = (11, %2,--+,2n) is ani.i.d. sample
from F; then the bootstrap estimate of standard error is

Shoot {9} = se{t;


F}, (2.6)

where F is the nonparametric MLE of F, i.e. the empirical distribution of


the sample 71, 22,°-+,2n. In other words, sep got {6} is the nonparametric
MLE of the true standard error.
We expect maximum likelihood estimates to be nearly unbiased, as
demonstrated by the bootstrap column of Table 2. The next argument
shows the close theoretical connection between all three estimators, the
106 Efron
se

bootstrap, the delta method, and the jackknife. This deepens the puzzle
of the delta method’s poor performance. 3
A bootstrap sample is a random sample of size n from F,

Plig(gs ot... a4) =x". (2.7)


Corresponding to x* is the bootstrap replication 6* = t(x*). Definition
(2.6) is an ideal version of se},,,4{9}, which is usually approximated
by Monte Carlo: independent bootstrap samples x*!,x*?,---,x*? give
independent bootstrap replications 6*!, 6*?,---,0*?; then

B
Sehoot{9} = [> (6" — 6")? /(B- 1”, (2.8)
b=1

where 6*° = >, 6**/B. Expression (2.8) approaches (2.6) as B — oo.


Usually B in the range 50 — 200 gives good results, see Efron (1987).
Each member of a bootstrap sample x* = (z]},23,--+,2%) is ran-
domly selected from the original data set (x1, 22,---,@n). Let P* be the
resampling vector (Py, P},---,P*), where

Pi = #{2j = 2i}/n, (2.9)


the proportion of the bootstrap sample equalling z;. Then P* has a
rescaled multinomial distribution, of n draws on n categories each with
probability 1/n,

abou 1
P*~Mult(n,P°)/n [P°=(=,=,---,=)} (2.10)
With the original sample x fixed, we can think of 6* = t(x*) as a function
of P*, say 6(P*). Another way to state (2.6) is

S€hoot
{9} = [Var.{6(P*)}]*/2, (2.11)
where Var, indicates variance under the multinomial distribution (2.11).
Figure 1 schematically represents (2.11). The prone triangle is the
simplex S in which P* takes its values. The function 6(P*) is represented
by a curved surface over S. At the center of S is P®, with 6(P°) St xp
the actual simple value of the statistic. We usually need to approximate
Six Questions Raised by the Bootstrap 107

(2.11) by Monte Carlo because there is no closed formula for the variance
of a non-linear function of a multinomial vector.
Both the jackknife and the delta method avoid Monte Carlo by ap-
proximating 6(P*) with a linear function of P*, say OLIN (P*); and then
theoretically evaluating [Var. {9 tn (P*)}]!/ 2 instead of (2.11), from the
usual formula for the variance of a linear function of a multinomial.

6(P*)

Figure 1. Schematic representation of bootstrap resampling; resam-


pling vector P* takes its values in S; 6(P*) is the curved surface over
S; bootstrap standard error is square root of variance of 6(P*) where
P* has multinomial distribution (2.10). Dashed lines indicate the linear
approximation to 6(P*) used by the jackknife. The delta method uses
the linear function tangent to the resampling surface at the central point
(P2,0(P?)).
The delta method standard error is based on the most obvious linear
approximation to 6(P*): the tangent hyperplane to the surface 6(P*)
through the central point (P°, 6). The jackknife uses the hyperplane
which equals 6(P*) at the n “jackknife points” P(1), P(2),--+, Pin), where
PG) is

(oleenaeOsl se. 1) in —1), (2.12)


0 in the zth place. Once again, it is difficult to see why S€jack should be
biased upwards and seg,jt, biased downwards.
We can set up an artificial estimation problem which gives some
108 Efron

insight into the relationship between seggjz, and sepoo¢- Consider the
data x as fixed, and suppose that we observe a multinomial vector P ~
Mult(n, )/n, where now 7 can be any vector in the simplex S, not just P®
as in (2.10). The artificial problem is to estimate 6() having observed P.
The MLE of 6(7) in the artificial problem is 6(P). It then turns
out that [se,,o¢{9}]? is the variance of the MLE when 7 = P®, while
[segelta{9}]” is the Cramer-Rao lower bound for the unbiased estimation
of 6(7), at = P°. This argument makes it plausible that S€delta would
usually be less than S€hoot: “Plausible” isn’t a proof though, and it isn’t
true that S€delta < Shoot in every case.
The delta method is much too useful a tool to throw away. However,
it’s numerical results shouldn’t be accepted uncritically, since they seem
liable to underestimation. The jackknife and bootstrap standard errors
are both more dependable.

3. Why are cross-validation estimators so variable? Cross-


validation, like the delta method, is a time-honored technique for as-
sessing statistical error. Modern computational equipment has greatly
increased the use of cross-validation for choosing estimation rules and
estimating their prediction errors. “Time-honored” is not the same as
“tried and true”. My own experience, as described in this section, is
that cross-validation can be undependable in some situations, and that
substantially better methods are available. The discussion here is taken
from Efron (1983) and Chapter 7 of Efron (1982). Good references for
cross-validation include Stone (1974), Geisser (1975), and Lachenbruch
and Mickey (1968).
Figure 2 shows a sample obtained in a simulation experiment: we
observe a training set of data (y;, 2;), 7 = 1,2,---,14, where

ised ia) with probability 1/2


ui {1 with probability 1/2, GD

and z; is bivariate normal, with identity covariance, and mean depending


on Yi,

ails ~ No((yi — 5,0)o1) (3.2)


Six Questions Raised by the Bootstrap 109
i

The two means, (5,0) and (—$,0), are indicated by stars in Figure 2.
In real practice, of course, we wouldn’t know the probability mechanism
generating the data.

Fitted Linear Disc

Figure 2. Fisher’s linear discriminant (solid line) fit to 14 data points


generated according to (3.1), (3.2); stars indicate mean values of the “0”
and “1” distributions; the apparent error rate is 4/14 in this case.

Fisher’s linear discriminant divides the prediction space, R? in this


case, into two regions, for the purpose of predicting the y value of a
future pair (y,z), having observed only z. In practice, z might be some
observable predictors, like age, sex, weight, etc. and y some dichotomous
outcome that we want to predict, like the success or failure of a medical
procedure.
The linear discriminant predicts 1 or 0 depending on the size of a
certain linear function,

rie AR. 9 fi at+b'z>0


prea(z) = {j nee 0: (3-3)
The constant a and the vector b are functions of the training set {(yi, zi),
i = 1,2,---,14}, see (2.13) of Efron (1983). One of the main uses of cross-
validation is to assess the prediction error of a data- based prediction rule
like (3.3).
110 Efron
_.| a

The True Error Rate of a prediction rule is the probability that it


misclassifies a future observation,

True Error = Prob{y 4 pred(z)}. (3.4)

The probability in (3.4) is conditional on the training set; pred() is fixed,


only y and z being random. For the sample in Figure 2, True Error =
.344, as calculated from (3.1), (3.2). True Error rates cannot be calcu-
lated in practice since we usually don’t know the probability mechanism
generating (y, z).
An obvious estimate of True Error is the Apparent Error rate

AppError = #{yi # pred(z;)}/14, (3.5)


the proportion of the training set points misclassified by pred. Four
of the 14 points are misclassified in Figure 2, three 1’s and one 0, so
AppError = 4/14 = .286. In this case the difference between True and
Apparent rates,
Diff = True — App, (3.6)

is positive. Diff is usually positive, at least in expectation, because pred


is fit to the training set, and so does better in “predicting” the training
set than in predicting a genuinely new point.
Cross-validation gives a way of using the training set to obtain a
nearly unbiased estimate of Diff. Let pred, ;)be the prediction rule ob-
tained from the reduced training set that excludes point (y;, z;). In our
example
= 1 if a;+bi..z>0
a4 Ge
pred(,)(z) = {0 if a;+ bz 0), (3.7)

where a(;),b(;) are the coefficients of Fisher’s linear discriminant based


on (y1, 21), (y2, z2), ies
(Yi-1, 21-1), (Yi41, 2i41),°°*,(Y14, 214). The cross-validation estimate of
Diff is
Diffoy = #{yi # predy;)(z:)}/14 — AppError. (3.8)
Intuitively, the first term of (3.8) should be a newly unbiased estimate of
True Error, so Diff should be nearly unbiased for Diff. We can illustrate
this with a sampling experiment.
Six Questions Raised by the Bootstrap 111

The sampling experiment comprised 100 independent replications of


(3.1), (3.2). In other words, 100 independent realizations of Figure 2
were constructed. Table 3shows the results. We see that the average of
Diffoy, the cross-validation estimate, averaged .091, compared to .096 for
the true Diff. This corroborates the approximate unbiasedness of Diffoy.

Trial True App Diff Diff Diff


Err Err OA" SR os 2
[1] 458 .286 likes cilia. 42068
(2] 312 .357 -.045 .000 .068
[3] Bis 38e 024 8071.» 095
[4] iSh1e Aco re O71 t--051
[5] soe sor e027) 143.094
[6] asin, gaits’ SRD A = ORS
[7] oLUt e071 230. ©0701... 2083
[3] 380 .286 094 .071 .130
[9] .360 =.429 -069 .071 .119
[10] 2330. .143 192 .000 .042
all mean: .360 =.264 .096 .091 .076
100 sd: 045 = .123 PLS ee Oi) aeO35
RMSE: 149 .117

Table 3. One hundred replications of (3.1), (3.2); shown are results of


the first 10 replications, and summary statistics for all 100. The cross-
validation estimte of Diff, “CV”, is nearly unbiased for the true Diff, but
has a large standard deviation.

Table 3 also shows that Diffcy is highly variable: its standard de-
viation over the 100 simulations was .073, about 80% as big as its mean
.091. Unbiased or not, this makes Diffoy an undependable estimator of
Diff.
Efron (1983) shows that cross-validation is closely related to the
bootstrap, much as the delta method, jackknife, and bootstrap are related
in Figure 1. This leads to several new estimators for Diff, based on
variants of the underlying bootstrap argument. The most successful of
these, “Diff 632”, also appears in Table 3. We see that it is moderately
biased downward, but has much smaller standard deviation than Diffgy.
An objective way to compare the two procedures is in terms of their root
112 Efron
eS
cere i

mean square errors for estimating True Error = AppError + Diff,

RMSE = [E{True — (App + Diff)}?}}/ (3.9)


we see that RMSE was .149 for cross-validation, compared to .117 for the
.632 rule.
Five sampling experiments are considered in Efron (1983), and eight
estimators of Diff. Diff 630 was clearly the winner in terms of RMSE,
and cross-validation was even more clearly the loser. In two of the five
experiments, using cross-validation was considerably worse than simply
estimating True by App, i.e. taking Diff = 0.
We have another small mystery here: all of the Diff estimates, in-
cluding cross-validation, are variants of the same bootstrap argument,
and yet they perform quite differently in small-sample simulation experi-
ments. My belief, or hope, is that future research will produce a depend-
able improvement over cross-validation. The problem of estimating the
prediction error of a data-based prediction rule is important enough to
discuss further study.

4. What is a correct confidence interval? Suppose we wish to


construct a 90% central confidence interval for a real-valued parameter
0, having observed some data x. A proposed interval [65.0 (x), byp (x)]
is said to be accurate if it fails to cover 8 exactly 5% of the time im each
direction,

Prob{6 > 6yp(x)} = Prob{@ < 61,9(x)} = .05. (4.1)


An accurate confidence interval is not necessarily a correct one, though. If
©1,%2,***,% 9 is arandom sample from a N(@, 7) distribution, with both
the mean @ and the variance y unknown, then the student’s t interval for
6 based on only 21, 272,+--,25 is obviously accurate in the sense of (4.1).
Equally obvious, it is not the correct interval for 6. It is inferentially
wrong, though probabilistically right.
The question of correctness arises forcibly in the construction of ap-
proximate confidence intervals. Various theories have been put forth to
construct intervals accurate to a high degree of asymptotic approxima-
tion. We will discuss some of these theories in this section, and Sections
5 and 6 as well. Are the intervals they construct highly correct as well as
Six Questions Raised by the Bootstrap 113

highly accurate? Answering this question seems crucial if we are to avoid


the pitfall of the student’s t example above.
The notion of interval correctness is more difficult to pin down than
interval accuracy. Correctness is clear-cut only in the simplest situations,
where the data x can be reduced to a one-dimension sufficient statistic
6, and where the percentiles of 6 increase monotonically as a function of
6. Then the textbook method of confidence interval construction, taking
6L0 to be that @ for which the observed value 6 is at the 95th percentile
of possible outcomes, and similarly for 6537, gives what most statisti-
cians would call the correct confidence interval for 6. Sometimes more
complicated-looking situations can be reduced to the simple form. In a
bivariate normal model for example, the maximum likelihood estimate /
of the true correlation p has a distribution depending only on p.
Fieller’s construction for the ratio of normal means gives an ex-
ample of a correct confidence interval in a genuinely multiparametric
situation. Suppose y is bivariate normal with identity covariance ma-
trix, y ~ No(u,JI), and we want a confidence interval for the ratio
6(u) = we2/p1. The level surface {6() = 00} is a straight line pass-
ing through the origin at angle tan~!(0)), as shown in the left panel of
Figure 3.
There is an obvious way to test the hypothesis Hp : 6(u) = 4%.
Let Do be the signed distance from y to the straight line {6(u) = 4}.
[Almost any sign convention will do, for instance sign(Do) = sign(y2 —
yo2), where yo is the nearest point to y on {6(44) = 6}.] Then Do has
a standard N(0,1) distribution if uw € {A(u) = %}. The obvious .90
two-sided test rejects Ho if y lies outside the band |Do| < 1.645. Fieller’s
confidence interval comprises those values 9) such that the test accepts
the hypothesis 6(j:) = 69. In other words, it is these values @ such that
the distance from y to {0(j) = 0} is less than 1.645, as shown in the
right panel of Figure 3. See Section 5.14 of Miller (1986). Most, though
not all, statisticians find the pivotality of Do an irresistable argument for
the correctness of the Fieller intervals.
Fieller’s construction depends on the level surface {@() = 90} being
straight lines. Suppose we consider a parameter for which the level surface
Co, = {9() = 9} are curved, for example 0() = pif42, where the Cg, are
114 Efron

hyperbolae. Efron (1985) discusses an approximate version of Fieller’s


construction applying to the curved case.

Hypothesis Test

Reject Bate

Accept ga
-
-

Ve cate Confidence Interval

Figure 3. Fieller’s construction of a 90% confidence interval for the


rates of normal means; we observe y ~ No(u,J) and want a confidence
interval for 6(4) = 2/1; the level surface {@() = 60} is the straight
line through the origin at angle tan—!(6); hypothesis test for Hp : 8 = 9%
accepts Ho for |Do| < 1.645, where Dp is the signed distance from y to
{6(4) = 0}, left panel; the Fieller interval for @ is these values of 09 for
which Hp is accepted, right panel.

Let yo be the nearest point to the observed point y ~ N2(pu,J) on


Co,, and let Do be the signed length of y — yo (using any reasonable sign
convention). Also define curvo to be the curvature of Cg, at the point yo.
Then the adjusted signed distance,

Dy = (4.2)
De

L > cUurvg
is approximately normal. This approximation is very good, the cdf of Dj
differing from the standard normal cdf by only O(n-*/?) if y is actually
a sufficient statistic obtained from n observations yj, y2,°°-, Yn ‘No
(u, I). This is third-order accuracy, in the language of Section 1.
Inverting the approximate pivotal Dj gives a third-order accurate
approximate confidence interval for 6. Table 4 shows the results for the
case y ~ No(u,Z), Ou) = |||], when the observed vector y has length
lly|| = 5. (A version of (4.2) holds in higher dimensions.) In this case
there is an exact confidence interval for 9 based on inverting the non-
central chi-square distribution of ||y||? ~ 2(67).
Six Questions Raised by the Bootstrap 115
a

We see that the approximate interval based on the adjusted signed


distance gives excellent results in this case. The second order accurate
bootstrap interval “BC,” is not quite as good, while the first order accu-
rate standard interval, (1.14), is far off.

.05 .95
Exact: 2.68 6.19
De 2.71 6.19 3rd order
BC,: 2.94 6.06 2nd order
Standard: 3.36 6.64 1st order

Table 4. We observe ||y|| = 5, where y ~ Ne(u,J), and want a confi-


dence interval for #(u) = ||u||; exact two-sided 90% interval (2.68, 6.19) is
based on inverting the noncentral y? distribution of ||y||?; interval based
on inverting the approximate pivotal Dj (4.2), is third order accurate;
bootstrap method BC, is second order accurate; standard method is first
order accurate. From Efron (1985).

In this example, most statisticians, or at least most frequentists,


would consider the non-central chi-square intervals to be correct, as well
as exactly accurate. The Dj intervals are third order correct, as well as
third order accurate, in the sense that their endpoints differ from the
exact ones by only O(n~3/*) if y is obtained from an iid. sampling
situation.
Barndorff-Nielsen (1986) extends (4.2) to general parametric fami-
lies. The signed distance Dp is replaced by the signed square root of the
likelihood ratio statistic. “Bartlett corrections” are made to the mean and
standard deviation of Do, to get an adjusted statistic Dj which has a
standard normal distribution to the third order of asymptotic accuracy.
Then Dj is inverted to give third-order accurate approximate confidence
intervals for 6.
Are these intervals third order correct as well as third order accu-
rate? Efron (1985) argues for the correctness of the Dj intervals based
on the fact that they come from an approximate pivotal that is geometri-
cally reasonable. But (4.2) lacks the immediacy of Fieller’s construction,
even in the normal case. Moreover, there are other third order intervals
available, Hall (1988), Cox and Reid (1987), Welch and Peers (1963),
which may or may not agree with Barndorff-Nielsen’s intervals.
116 Efron
ee
—————

DiCiccio and Efron (1990) show that all of these methods give confi-
dence interval agreeing at the second order, and in a certain sense they all
are second order correct, as well as accurate. The best situation would be
if all the methods continued to agree at the third order. This happy
result might very well be false. If so, the question of correctness will
be a pressing one. Highly accurate confidence statements are not worth
pursuing if they lead to inferential errors.

5. What is a good nonparametric pivotal quantity? Pivotal


quantities play a crucial role in the theory of confidence intervals, as we
saw in the previous section. Much of the recent bootstrap literature con-
cerns the construction of approximate confidence intervals in nonpara-
metric settings. This raises the question of this section: what is a good
approximate pivotal quantity in nonparametric estimation problems?
We consider the one-sample situation, where the observed data x is
an i.i.d. sample from some unknown probability distribution F’

ips
rans (cease -foRy Sx; (5.1)

and where we want an approximate confidence interval for a real-valued


parameter 6(F'). This becomes a nonparametric problem if we assume
that F can be any distribution at all on the sample space of the 7;.
We begin with an example of a bad guess at a nonparametric pivotal
quantity. The left panel of Figure 4 shows the lawschool data, 15 pairs
of points z; = (a;,b;), where a; and b; are measures of excellence for
the entering 1975 class at lawschool 7, i = 1,2,---,5. See Section 2.5
of Efron (1982). Suppose we want a confidence interval for 6(F) the
Pearson correlation coefficient. The nonparametric MLE of 6 is 6 =
0(F) = .776, the sample correlation coefficient based on the 15 points
X= C7; 2; e598 15).
The right side of Figure 4 shows the histogram of B = 1000 boot-
strap replications of 6* — 6; B = 1000 independent bootstrap samples
x*!,x*?,...,x*3 were generated as in (2.7), and for each one the differ-
ence of correlation coefficients 6* — § (= sample correlation coefficient of
bth bootstrap sample x*’, minus .776) were calculated. We see that the
5th and 95th percentiles of the 1000 6*® — 6 values were —.254 and .170
respectively.
Six Questions Raised by the Bootstrap
SSS
117

Data From 15 Law Schools


150

100

ss 880 @0 620 640 660

Figure 4. Lawschool data, n = 15 pairs of points, left panel. B = 1000


bootstrap replications of 6* — 6, where 6 is the Pearson correlation, right
panel. The quantity 6* — @ is a bad guess at a nonparametric pivotal
quantity in this case.

Suppose we believe that 6—@isan approximate nonparametric piv-


otal quantity. Then 6*—6 should have approximately the same percentiles
as 6 — @, so that

Prob{—.254 < 6 — 4 < .170} = .90. (5.2)


Inverting (5.2) gives a bad guess at a confidence interval for @,

6 & [6 — .170,6+ .254] = [.606, 1.03]. (5.3)


The quantity 6 —8@ is a poor choice for an approximate pivotal statis-
tic in most situations, either parametric or nonparametric. If 6 —6 is
long-tailed to the left, as in the correlation example, then usually the
confidence interval for 6 will extend further left of 6 than right. (The
reader can check this by considering binomial or Poisson confidence in-
tervals.) This is the opposite of what would happen if 6 — 6 were actually
pivotal.
The second most obvious guess for an approximate nonparametric
pivotal quantity is a t-like statistic, say

6-86
T= 9
(5.4)
>
118 Efron
ee

where G(x) is some estimate of standard error for 6(x), perhaps the jack-
knife or delta method estimates. If we believe in the pivotality of T,
then we can use the bootstrap to construct a “bootstrap t” approximate
confidence interval for 6; we generate some large number B of bootstrap
replications of T, ? 4
O(x*) — 6
TS (5.5)
os G(x*) ’
compute the 5th and 95th percentiles of the values T*®, b = 1,2,---,B,
say T*(-°5) and T*(95); and assign @ the approximate confidence interval

OE [6 MO) G Lae TICES] (5.6)


a

A surprising and encouraging fact has emerged in the bootstrap lit-


erature. The bootstrap-t method gives second order accurate and correct
intervals in a wide variety of situations. (See P. Hall (1988), Abramovitch
and Singh (1985), and DiCiccio and Efron (1990).) There are good rea-
sons to believe that the T statistic (5.4) is a nonparametric pivotal to the
second order of asymptotic accuracy.
The favorable asymptotics of the bootstrap-t method are no guaran-
tee of good small-sample behavior. In fact, there are practical difficulties
connected with the choice of the denominator 6 in (5.4). This can be seen
in the lawschool correlation example. B = 1000 bootstrap replications of
T were generated from the lawschool data, with

&(x*) = [(1 — 6(x*)*)|/V15 + .03. (5.7)

Here [1 — 6?]/./15 is an approximation to the standard error of 6, sug-


gested by normal theory. The added quantity .03 was necessary to pre-
vent occasional very large values of T*. Section 5 of Efron (1990) gives a
full explanation. |
The 5th and 95th percentiles of the T* values were —.939 and 2.93
respectively. Using (5.6), this gives the bootstrap-t confidence interval
[.40,.91] for 6. The 95th percentile 2.93 is suspiciously large, leading to
a suspiciously low lower endpoint .40, see Table 5.
The same bootstrap replications that gave the bootstrap-t percentiles
can be used to check the pivotality of T*’s distribution. This relieves the
statistician’s reliance on asymptotic theory. Figure 5 shows the 5th, 10th,
Six Questions Raised by the Bootstrap 119

16th, 50th, 84th, 90th, and 95th percentiles of the 1000 T* bootstrap
replications as dashed lines. So, for example, the 5% dashed line is at
height —.939, and the 95% line at height 2.93.

Nonparametric Parametric
6*—6@ Standard Boot-t BC, Boot-t BC, Exact
.05: 61 61 40 48 53 .50 49
95: 1.03 95 91 94 93 .90 .90

Table 5. Approximate confidence intervals for the correlation coefficient,


lawschool data; the bootstrap-t and BC, methods are second order accu-
rate and correct. Parametric intervals assume a bivariate normal model
for the data. The lower endpoint of the nonparametric bootstrap-t inter-
val is suspiciously low, compared with the BC, answer and also with the
parametric results.

The percentiles of the T distribution when sampling from F’, as in


(5.1), are functions of F’. Bootstrap sampling gives these function values
for F =F , the empirical distribution (1.2). If T is approximately pivotal,
then we expect the percentiles not to change much if we change F' from
F to some other nearby distribution. The jagged lines in Figure 5 trace
the percentiles of T* as we change F to the deleted-point distributions
Fi, (2.4). That is, we successively remove one point at a time from
the original data set x, and compute the percentiles of the bootstrap-
t distribution when sampling from F' = Fw, the empirical distribution
of the remaining 14 points. Efron (1990a) shows how this computation
can be done without requiring any bootstrap samples beyond the original
1000.
The upper bootstrap-t percentiles are seen to be quite variable under
small changes in F. This argues against the pivotality of T, in this
situation. The jackknife-after-bootstrap theory in Efron (1990a) uses the
jagged line variability to assign a standard error of £1.8 to the estimated
value 2.93 for the 95th percentile. We simply do not have enough data
to estimate the nonparametric bootstrap-t percentiles very well in this
problem.
120 Efron

Nonparametric Bootstrap-t Percentiles

nace es
a e -90

Br

org
Reaea: - ae
-- - -| -05
ee
=2.97 =1.07 —.25,-=.22) —.10) (—.10 700 0 923) ol, «93 <61)8) 39 71, Olas

Figure 5. The percentiles of the 1000 bootstrap-t replications T*, dashed


lines; also the percentiles of T*, successively removing one point at a time
from the original data set, jagged lines. The upper percentiles are highly
sensitive to removing single data points, indicating a lack of pivotality.
From Section 5 of Efron (1990a). Bottom numbers show point removed,
numbered as in Figure 4.

The T statistic (5.4) is a reasonable answer to the question “what


is a good nonparametric pivotal quantity?”, at least in theory. Further
development will be needed to make the bootstrap-t a dependable method
in practice. It is not yet clear when bootstrap-t methods, even with the
bugs worked out, will be preferable to other nonparametric confidence
interval methods like the BC,.

6. What are computationally efficient ways to bootstrap?


Typical problems require®S0="200"boctstrap "replications to" estimatena,
standard error, and 1000 — 2000 replications to compute a bootstrap
confidenceuntérval, see Section 9 of Efron (1987). These numbers assume
that the bootstrap estimation is done in the most obvious way. Various
computational and probabilistic methods have been suggested to reduce
Six Questions Raised by the Bootstrap 1

the number of replications required. The promise of such methods is


not only a reduction of the computational burden, but also a deeper
understanding of the bootstrap. This section discusses just two of the
methods, with references to many others.
Suppose again that we are in the one-sample nonparametric situation
(5.1), and that we wish to assess the bias of the statistic 6(x) as an
estimate of the parameter 6(F’). The straightforward bootstrap estimate
of bias is calculated as follows: B bootstrap samples x*? give replications
*b — 6(x*>) 6 = 1,2,---, B; then the bias of § is estimated by

B
ae
biasg = B1 ; 6*°A*b — (x).
A
(6.1)

In the resampling vector notation of (2.9)-(2.11),

B
biasg = 5
= A(P**) — 6(P°), (6.2)

where P*? is the resampling vector corresponding to x*’, and P® is


(1,1,---,1)’/n, the center point of the resampling simplex S in Figure 1.
As B goes to infinity, biasg approaches the ideal bootstrap bias estimate

biasoo = E,{6(P*)} — 6(P?), (6.3)

E, indicating expectation with respect to the multinomial distribution


(2.10).
Section 2 of Efron (1990b) discusses an improvement on biasz,

biasg= 2 6(P*>) — 6(P), (6.4)


= a —

where P = By P*>/B, the average of the B resampling vectors. (This


assumes that 6(P*) is continuously defined as a function of P*, so that
6(P) is well specified.) As B increases, the improved estimate bias
approaches the ideal value bias.. more quickly than does biasp.
Table 6 refers to the following simulation experiment: ten samples
were selected, each consisting of n = 10 independent bivariate normal
points x; = (yi, zi), drawn from the distribution F = N2((8, 4)'//10, I).
122 Efron
a

We are interested in the bias of 6(x) = 2/g as an estimate of 0(F) =


E{y}/E{2}. bat
Table 6 compares biasg with biasg for B = 20. The bias estimates
are multiplied by 1000 for each reading. Also shown is biasgoo0, which
we use in place of biasz9 as the ideal bootstrap bias estimate. We see
that biasyo is much better than biasg9 in matching this ideal, the ratio of
squared errors for the 10 samples being

D[biaseo = biasgoo0]? /D[biaseo = biasgooo] ==) 2329)

In a much larger version of Table 6, this ratio was estimated to be 50.3.


In effect, this means that biasg9 is about as effective as bias2ox50!

Sample [email protected] DiasSs biasz

1 7.35 4.08 — 16.53


2 23.45 26.78 31.90
3 10.90 11.49 27.19
4 17.30 15.00 —31.12
5 95 — .60 2.56
6 3.50 3.35 — 10.00
7 14.55 13.99 56
8 3.90 3.18 9.82
9 4.45 6.68 70
10 12.85 23.90 3.55

NOTE: Entries are 1,000 x bias estimate. inthis case biasay isabout 50 times as good an
estimator as biaso.

Table 6. Estimates of bias for 6 = Z/y; ten samples, each of size n = 10


drawn from (y;, 2:1) ~ N2/(8, 4)'/,/10, I); biasoo compared with biasog as
an estimate of biasgoo0 = bias; biasgo tends to be closer to biasgogo.-
Entries are 1000 x bias estimate. From Section 2 of Efron (1990b).

The bias estimate biasp corrects biasp by taking into account the
discrepancy between P, the average of the observed resampling vectors,
and P°, their theoretical expectation. Davison, Hinkley, and Schechtman
(1987) make the correction another way: they draw the resampling vec-
tors P*>, 6 = 1,2,---, B, in a manner which forces P to equal P°. Then
(6.2) performs very much like biasg. Various methods of improved boot-
strap sampling for reducing the number of bootstrap replications appear
Six Questions Raised by the Bootstrap = 123

in Johns (1988), Graham et al. (1987), Ogbonmwan and Wynn (1986),


Therneau (1983), and Hesterberg (1988).
In some situations it is possible to obtain bootstrap results without
any Monte Carlo sampling at all. DiCiccio and Efron (1990) discuss a
method for confidence interval construction within exponential families.
Their method called “ABC”, for “Approximate BC,” or “Approximate
Bootstrap Confidence” intervals, replaces the Monte Carlo bootstrap
replications of the (parametric) BC, method with simple analytic ap-
proximations. This is possible because exponential family problems tend
to be very smooth. The ABC method requires only 4p + 6 recomputa-
tions of the statistic of interest, compared to a few thousand for the full
BC, intervals, where p is the dimension of the exponential family.
Figure 6 shows two examples of the ABC method. The right panel
concerns the lawschool data of Figure 4. Now we assume that the data
is bivariate normal, 2; DE et YL) for 2 = 1,2,---,15, with the mean
vector 4 and covariance matrix { unknown. This is a p = 5 parameter
exponential family. We want a confidence interval for the correlation
coefficient. The first-order correct standard intervals, dashed lines, are
compared with the second-order correct ABC intervals, solid lines. The
ABC limits lie far below the standard limits. The solid points indicate the
exact limits for the correlation coefficient, assuming bivariate normality.
We see that the ABC limits are nearly, but not perfectly, correct.
The right side of Figure 6 concerns the student score data from
Mardia, Kent, and Bibby (1979). This comprises 5 test scores from each
of n = 88 students. We assume that the scores have a multivariate
normal distribution, ee ye Y), i=1,2,---,88 a p = 20 parameter
exponential family. In this case the ABC limits are shifted upwards from
the standard limits. The solid points show the BC, limits from a full
bootstrap analysis involving B = 4000 parametric bootstrap samples.
We see that the ABC limits are indeed good approximations to those
from the BC,, even though they require only a few percent as much
computation.
Efron

Lawdata correlation maximum eigenvalue, score data

coverage-> covera ige->


solid=ABC, dashed=standard solid=ABC, dashed=standard

Figure 6. ABC second-order correct confidence intervals, solid lines,


compared with the standard interval (1.14), dashed lines; horizontal axis
is coverage probability of the two-sided interval; dotted line is MLE 6.
Left_panel lawschool data of Figure 4; assumes 2; 1A ie: ie
1,2,---,15; parameter of interest is correlation coefficient of [I ; solid
points indicate exact confidence intervals for correlation. Right panel
score data, Mardia, Kent, and Bibby (1979); assumes 2; ese td fiw tak
2 = 1,2,---,88; parameter of interest is maximum eigenvalue of Y; solid
points are BC, limits, B = 4000 replications.

REFERENCES
Abramovitch, L. and Singh, K. (1985). Edgeworth corrected pivotal
statistics and the bootstrap. Ann. Stat. 13, 116-132.

Barnforff-Nielsen, O. E. (1986). Inference on full or partial parameters


based on the standardized signed log likelihood ratio. Biometrika
73, 307-322.
Cox D. R., and Reid, N. (1987). Parameter orthogonality and approx-
imate conditional inference (with discussion). J. Royal Stat. Soc.,
Ser. B 49, 1-39.
Six Questions Raised by the Bootstrap 125
LLSS SSS SSS

Davison, A. C., Hinkley, D. V., and Schechtman, E. (1987). Efficient


bootstrap simulation. Biometrika 73, 555-561.

DiCiccio, T., and Efron, B. (1990). Better approximate confidence inter-


vals in exponential families. Submitted to JASA.

Efron, B. (1981). Nonparametric estimates of standard error: the jack-


knife, the bootstrap, and other resampling methods. Biometrika 68,
589-599.

Efron, B. (1982). The jackknife, the bootstrap, and other resampling


plans. CBMS 38, SIAM-NSF.

Efron, B. (1983). Estimating the error rate of a prediction rule: improve-


ments in cross-validation. JASA 78, 316-331.

Efron, B. (1985). Bootstrap confidence intervals for a class of parametric


problems. Biometrika 72, 45-58.

Efron, B. (1987). Better bootstrap confidence intervals. JASA 82, 171-


185.

Efron (1990a). Jackknife-after-bootstrap standard errors and influence


functions. Submitted to J. Royal Stat. Soc..

Efron, B. (1990b). More efficient bootstrap computations. JASA 85, 79-


89.

Efron, B. and Stein, C. (1981). The jackknife estimate of variance. Ann.


Stat. 9, 586-596.
Geisser, S. (1975). The predictive sample reuse method with applications.
JASA 70, 320-328.
Graham, R. L., Hinkley, D. V., John, P. W. M., and Shi, S. (1987). Bal-
anced design of bootstrap simulations. Technical Report 48, Univer-
sity of Texas at Austin, Mathematics Dept.

Hall, P. (1988). Theoretical comparison of bootstrap confidence intervals


(with discussion). Ann. Stat. 16, 927-985.

Hesterberg, T. (1988). Variance reduction techniques for bootstrap and


other Monte Carlo simulations. Unpublished Ph.D. dissertation,
Stanford University, Dept. of Statistics.
126 Efron
ine

Jaeckel, L. (1972). The infinitesimal jackknife. Memorandum MM72-


1215-11, Bell Labs, Murray Hill, NJ.

Johns, M. V. (1988). Importance sampling for bootstrap confidence in-


tervals. JASA 83, 709-714.

Kendall, M., and Stuart, A. (1958). The Advanced Theory of Statistics.


London: Charles W. Griffin.

Lachenbruch, P., and Mickey, N, (1968). Estimation of error rates in


discriminant analysis. Technometrics 10, 1-11.

Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979). Multivariate Anal-


ysis. New York: Academic Press.

Miller, R. G. (1986). Beyond ANOVA. New York: John Wiley.

Ogbonmwan, S. M., and Wynn, H. P. (1986). Resampling generated


likelihoods. Proc. Fourth Purdue Symp. Decision Theory (Vol. 1),
eds. S. Gupta and J. Berger, New York: Springer-Verlag, pp. 137-
147.

Stone, M. (1974). Cross-validatory choice and assessment of statistical


predictions. J. Royal Stat. Soc., Ser. B 36, 111-147.

Therneau, T. (1983). Variance reduction techniques for the bootstrap.


Unpublished Ph.D. dissertation, Stanford University, Dept. of
Statistics.

Welch, B. L., and Peers, H. W. (1963). On formulae for confidence points


based on integrals of weighted likelihoods. J. Royal Stat. Soc., Ser.
B 25, 318-329.
EFFICIENT BOOTSTRAP SIMULATION

Peter Hall?
Australian National University

Abstract. We survey a variety of methods for efficient bootstrap simulation,


including linear approximation, the centring method, bootstrap resampling,
antithetic resampling and importance resampling.

1. Introduction
In many problems of practical interest, the nonparametric bootstrap is
employed to estimate an expected value. For example, if 6 is an estimate of
an unknown quantity @ then we might wish to estimate bias, E(6 — 0), or
the distribution function of 6, E{I(§ < z)}. Generally, suppose we wish to
estimate E(U/), where U is a random variable which will often be a functional
of both the population distribution function Fo and the empirical distribution
function F, of the sample 7: U = f(fo,F;). Let F) denote the empirical
distribution function of a resample 1* drawn randomly, with replacement,
from x, and put U*= f(fKi, F). Then, = E{ f(Fi, Fo) | Fy} = E(U* | ¥)
is “the bootstrap estimate” of u = E(U).
In the bias example considered above, we would have U = 6 — 6 and
U* = 6* — 6, where 6* is the version of 6 computed for 4* rather than 7.
In the case of a distribution function, U = I(@< x) and U* = 1(6* <2).
Our aim in this paper is to describe some of the available methods for
approximating & by Monte Carlo simulation, and to provide a little theory for
each. The methods which we treat are uniform resampling, linear approxima-
tion, the centring method, balanced resampling, antithetic resampling and
importance resampling, and are discussed in Sections 2 to 7 respectively. This
account is not exhaustive; for example, we do not treat Richardson extrapo-
lation (Bickel and Yahav 1988), computation by saddlepoint methods (Davi-
son and Hinkley 1988, Reid 1988) or balanced importance resampling (Hall
1990b). Section 8 will briefly describe the problem of quantile estimation,
which does not quite fit into the format of approximating i = E(U* |).

1 Department of Statistics, Faculty of Economics and Commerce, Australian


National University, G.P.O. Box 4, Canberra, ACT 2601, Australia.

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
128 Hall

2. Uniform resampling
Since & is defined in terms of uniform resampling — that is, random
resampling with replacement, in which each sample value is drawn with the
same probability n~1 — then uniform resampling is the most obvious ap-
proach to simulation. Conditional on 4%, draw B independent resamples
Aj ,...,%% by resampling uniformly, and let 6x denote the version of § com-
piles fe, Xf rather than Y. Then
B
ino ete,
b=1

is a Monte Carlo approximation to d. With probability one conditional on


X, i converges to i%, = ui as B > oo. We refer to &} as an approximation
rather than an estimate. The bootstrap estimate & is based on B = oo,
and only its approximate form is concerned with finite B’s. Thus, we draw a
distinction between the statistical problem of estimating u, and the numerical
problem of approximating w.
The expected value of &,, conditional on 4’, equals i. Therefore a}, is an
unbiased approximation to w, and the performance of u may be reasonably
described in terms of conditional variance:

var (a3 |V) = Bo! var(U* |%) = Bo {E(U*? |X) — a7}.


In many problems of practical importance, var(U* |4’) is asymptotic to
either a constant or a constant multiple of n~1, as n — oo. For example, if
we are estimating a distribution function then U* is an indicator function,
and so U*? = U*, whence

var(U* |X) = i(1 — &) 4 uo(1 — uo),


where uo (typically a value of a Normal distribution function) equals the
limit as n — oo of &. If we are estimating bias, and 6 = g(X) is a smooth
function of a d-variate sample mean, then U* = 6* — 6 = g(X*) — g(X),
where X* is the resample mean. By Taylor expansion,

* = g(X") — 9(X) © es — &) 9;(X), (2.1)


=I
where 1%) = (x)) denotes the j’th element of a d-vector x and 9;(x) =
(0/02) g(x). Thus,

var(U* |¥) = var{ - (X*—X) 95(K)| x} + O(n)


j=1

=n 1674 O(n-*) :
Efficient Bootstrap Simulation 129

where

=n Ss{ex. — %) 9(K)}
0? =B{ ex = 2) o(w)}
(as n — oo) and gp = E(X) is the population mean.
These two examples describe situations which are typical of a great
many that arise in statistical problems. When the target td is a distribution
function or a quantile, the conditional variance of the uniform bootstrap
approximant is roughly equal to CB™! for large n and large B; and when
the target is the expected value of a smooth function of a mean, the variance
is approximately CB~!n7!. In both cases, C is a constant not depending on
B or n. Efficient approaches to Monte Carlo approximation can reduce the
value of C in the case of estimating a distribution function, or increase the
power of n~! (say, from n~! to n~*) in the case of estimating the expected
value of a smooth function of a mean. Most importantly, they usually do
not increase the power of B~!. Therefore, generally speaking, even the most
efficient of Monte Carlo methods has mean squared error which decreases
like B~} as B increases, for a given sample.

3. Linear approximation
We motivate linear approximation by considering the bias estimation
example of the previous section. Suppose our aim is to approximate & =
E(U* |), where U* = 6* — 6 = g(X*) — 9(X) and g is a smooth function
of d variables. Let us extend the Taylor expansion (2.1) to another term:

d d d
95(K) +5 DY DI (KK)
= SO(KK) (K-K) gye(K) +.
j=1 j=1 k=1
(3.1)
where g;,.. j= (8° /d2) . ..024r)) g(x). As we noted in Section 2, the
conditional variance of U” is oeined asymptotically by the variance of
the first term on the right-hand side of (3.1), which is the linear component
in the Taylor expansion. Now, our aim is to approximate E(U* |X), and the
linear component does not contribute anything to that expectation:

d d

B{S>(X* ~K) 9X) |x} = D> BR — K) |X} 95(X) = 0. (3.2)


j=1 j=l
130 Hall

Therefore it makes sense to remove the linear component. That is, base the
uniform resampling approximation on
d

j=l
instead of on U*. Conditional on 4, draw B independent resamples A7,...,
X% by resampling uniformly (exactly as in Section 2), and put
d

Vii = US — (KG — X) 9(X)


d

= 9(X§) — 9K) — 5) (KF -— K) 9,(X),


j=!

1<b6b< B, where X} denotes the mean.of the resample 4;*. Define

B
63=B") Vy.
b=1

Then 6% is the linear approximation methou approximation of td. With


probability one conditional on ¥, 3 — i as B > ow.
In view of (3.2), E(V* |) = E(U*|#) = d, and so 6% is an unbiased
approximant of td. The conditional variance of 6% is dominated by that of
the first remaining term in the Taylor expansion, just as in the case of i}:

var(6% |) = Bo! var(V*| 2%),


and
d d

var(V* |) = var{ j=14S) 0k=1 (K* — X) (Kt — KH) gn(K) |} + O(n)


_ r Il

=n?B+O(n-), (3.3)
with probability one as n — oo, where?
d d d

B =i : os Py py aE931k, (X) 9jnk2(X) aia dia ’ (3.4)

Gok = yo} ee — kK) (x; — K)®. (3.5)


t=1

2 In Hall (1989a), the factor 4 was inadvertently omitted from the right-hand
side of (3.4).
Efficient Bootstrap Simulation
a e 131

Therefore, the order of magnitude of the variance has been reduced from
B~' n7* (in the case of
t,) to B-! n-? (in the case of 6%). The numerical
value of the reduction, for a given problem and sample, will depend on values
of the first and second derivatives of g, as well as on the higher-order terms
which our asymptotic argument has ignored.
More generally, we may approximate U* by an arbitrary number of terms
in the Taylor expansion (3.1), and thereby compute a general “polynomial
approximation” to t. For example, if m > 1 is an integer then we may define

Wy =U - i See DoW.
RF— RB) 5 8)
jvi=1 r=1

(a generalization of V,"),

m d d
SA RET Oe Do BAK = KH)... — KY A} 95.5%),

Then #% is an unbiased approximation of u. Of course, the approximation


is only practicable if we can compute w, which is a linear form in the first
m central sample moments, with coefficients equal to the derivatives of g.
In the special case m = 1, Ww} reduces to our linear approximation 6%, and
there # = 0. The pon iions! variance of t%, is of order B~1n~(™+)) as
n — oo. Of course, HR — tt as B — on, for fixed ¥. If the function g is
analytic then for fixed 7 and B, w% — t% as m — oo. Each of these limits is
attained with probability one, conditional on %.
Monte Carlo methods based on functional approximation have been dis-
cussed by Oldford (1985) and Davison, Hinkley and Schechtman (1986).

4. Centring method
To motivate the centring method approximation, recall that the linear
approximation method produces the approximant

B B

oh =B Vy = BO {o(Ki) - 9K) - Sk) 9;(X)}


b= 1 j=1
d

=t2- > (X* ms x) 9;(X) ’


j=l
Hall

where %*, is the uniform resampling approximation of @ and

is the grand mean of all the resamples. Now,

S5 (XK? — KX) 9;(X) © 9(X*) — 9(X),


j=l

by Taylor expansion, and so

dp © tie — {g(X") — 9(X)} = 45,


where
B

#% = B'S g(Xj) — 9(X").


b=1

We call <3 the centring method approximation to d. It differs from a} in


that the mean of the g(X})’s is now centred at g(X*) rather than g(X);
recall that
B

iy = B™ > 9(X3) —9(X).


b=1

The centring method approximation was suggested by Efron (1990).


As expected, £3 — t as B — oo, conditional on ¥. The approximant
£% is not unbiased for t, although the bias is generally very low, of smaller
order than the error about the mean. Indeed, it may be proved that

E(#|%) — a = —(Bn)- & + O{(Bn)~?}, (4.1)

var(#% |¥) = (Bn?)~? B + O{(Bn)~? + (Bn®)}, (4.2)


with probability one as n — oo, where

B= DDE gin(X)e*, (4.3)


d _ .

and #, &3* are given by (3.4), (3.5) respectively. See Hall (1989a). Note
particularly that by (3.3) and (4.2), the conditional asymptotic variances of
d5 and £§ are identical. Since 6% is an unbiased approximant, and the bias
Efficient Bootstrap Simulation
SS
133

of Z} is preusible relative to the error about the mean (order B~1 n-!rela-
tive to Bo? n~1), then the approximations 6*B and z% have LAC ane
equivalent mean squared error.

5. Balanced resampling
If we could ensure that the grand mean of the bootstrap resamples was
identical to the sample mean, i.e.

ASS x, (5.1)
then the uniform approximation i,, the linear approximation 6% and the
centring approximation <% would all be identical. The only practical way
of guaranteeing (5.1) is to resample in such a way that each data point
X; occurs the same number of times in the union of the resamples 1.
To achieve this end, write down each of the sample values X,,...,X, B
times, in a string of length Bn; then randomly permute the elements of this
string; and finally, divide the permuted string into B chunks of length n,
putting all the sample values lying between positions (6 — 1)n + 1 and bn
of the permuted string into the b’th resample rat for. 1 <b < B. This
is balanced resampling, and amounts to random resampling subject to the
constraint that X; appears just B times in U,4%,. Balanced resampling
was introduced by Davison, Hinkley and Schechtman (1986), and high-order
balance has been discussed by Graham, Hinkley, John and Shi (1990). See
also Ogbonmwan and Wynn (1986, 1988). An algorithm for performing
balanced resampling has been described by Gleason (1988). The method of
Latin hypercube sampling (McKay, Beckman and Conover 1979; Stein 1987),
used for Monte Carlo simulation in a non-bootstrap setting, is closely related
to balanced resampling.
‘The balanced resampling approximation of i = E{g(X*) |} is

iun = =e
a(x!
9(X;) ’
b=1

where x} denotes the mean of xi. Once again, al — ti as B — oo, with


S obability one conditional on XY. The balanced resampling approximation
shares the asymptotic bias and variance formulae of the centring approxima-
tion #, introduced in Section 4:

(al, |#) — a = —-(Bn)-? 4+ O(B-n~), (5.2)


var(ail, |) = (Bn?)-? 8 + O{(Bn)~? + (Bn®)“}, (5.3)
134 Hall

with probability one as n — oo, where &, # are given by (4.3), (3.4) re-
spectively; compare (4.1) and (4.2). In particular, in the context of bias
estimation for a smooth function of a mean, balanced resampling reduces
the orders of magnitude of variance and mean squared error by a factor
of n=}. Formulae (5.2) and (5.3) are not entirely trivial to prove, and the
asymptotic equivalence of bias and variance formulae for the centring method
and balanced resampling is not quite obvious; see Hall (1989a).
In Sections 3 and 4, and so far in the present section, we have treated
only the case of approximating a smooth function of a mean. The methods
of linear approximation and centring do not admit a wide range of other
applications. For example, linear approximation relies on Taylor expansion,
and that demands a certain level of smoothness of the statistic U. However,
balanced resampling is not constrained in this way, and in principle applies to
a much wider range of problems, including distribution function and quantile
estimation. In those cases the extent of improvement of variance and mean
squared error is generally by a constant factor, not by a factor n7}.
Suppose that U is an indicator function of the form U = I(S < z) or
U =I(T < x), where S = n?(6 — 6)/o and T = n3(6 — 6)/6 are statistics
which are asymptotically Normal N(0,1). The bootstrap versions are U* =
I(S* < x) and U* = I(T* < z) respectively, where S* = n?(6* — 6)/6 and
T* =n?(6* — 6)/6*. (The pros and cons of pivoting are not relevant to the
present discussion.) To construct a balanced resampling approximation to
ua = E(U* |), first draw B balanced resamples xi 1<b<B, as described
two paragraphs earlier. Let 6}tat
, 5, denote the versions of 6, & respectively
computed from x} instead of 1, and put

st =nt(6l-6)/e, Th =n (64 — 6)et,


and u} = 1(s} < 2x) or 1(T} < x), depending on whether U = I(S < 6)
or I(T < x). (We define T = c, for an arbitrary but fixed constant c, if
o, =0.) The balanced resampling approximation to @ is

Recall from Section 2 that the uniform resampling approximation u® is


unbiased for é, in the sense that E(a}, | 4’)= G, and has variance

var (i, |¢) = Bo) G(1 — a) ~ Bo! &(2) {1 — (2)}


Efficient Bootstrap Simulation 135
asn — oo. The balanced resampling approximant is slightly biased, although
the bias is low relative to the error about the mean:

E(al,|#)
—a = 0(B7)
Fie
with probability one. The asymptotic variance of ti, is less than that of aB
by a constant factor p(z)~! <1, since

var(al, |’) ~ B- [8(2) {1 — (2)} — 6(2)"];


see Hall (1990a). Thus, the asymptotic efficiency of balanced resampling
relative to uniform resampling, in this context, is

p(x) = ®(x) {1 — &(2)}[B(z) {1 - O(z)} — o(z)’]*,


which reaches a maximum at z = 0 and decreases to 1 as |x| f 00. .See Hall
(1990a).
The same asymptotic efficiencies apply to the case of quantile estimation,
which we discuss in more detail in Section 8. :

6. Antithetic resampling
The method of antithetic resampling dates back at least to Hammers-
ley and Morton (1956) and Hammersley and Mauldon (1956). See Snijders
(1984) for a recent account in connection with Monte Carlo estimation of
probabilities. Antithetic resampling may be described as follows. Suppose
we have two estimates 6, and 62 of the same parameter 6, with identical
means and variances but negative covariance. Assume that ike costs of com-
puting 6, and 6, are identical. Define

= 4(6,
+ 62).
Then 63 has the same mean as either 6, or 62, but less than half the variance,
since

var(63)= 3{var 6, + var 62 + 2cov(6,, 62)}


= 1 {var(61) + cov(6y, 62)} < 1var(6,) :

Since the cost of computing 63 is scarcely more than twice the cost of com-
puting either 6, or 62, but the variance is more than halved, then there
is an advantage from the viewpoint of cost-effectiveness in using 63, rather
than either 6, or 62, to estimate 9. Obviously, the advantage increases with
increasing negativity of the covariance, all other things being equal.
136 Hall
Ne e
_..___

To appreciate how this idea may be applied to the case of resampling, let
U* denote the version of a statistic U computed from a (uniform) resample
X* = {X*,...,X*} rather than the original sample ¥ = {X1,...,Xn}. Let
x be an arbitrary but fixed permutation of the integers 1,...,n, and let
jis-++)Jn be the random integers such that Xf = X;, for 1 <i <n. Define
X** = Xej), 1 SiS n, and put ** = {Xj",...,X,*}. That is, 47"
is the (uniform) resample obtained by replacing each appearance of X; in
X* by Xq(x). If U** denotes the version of U computed from 4** instead
of X, then U* and U** have the same distributions, conditional on ¥. In
particular, they have the same conditional mean and variance. If we choose
the permutation 7 in such a way that the conditional covariance of U* and
U** is negative, we may apply the antithetic argument to the pair (U*,U**).
That is, the approximant

Ue ess 1(U* =f Dae)

will have the same conditional mean as U* but less than half the conditional
variance of U™.
If the X;’s are scalars then in many cases of practical interest, the
“asymptotically optimal” permutation 7 (which asymptotically minimizes
the covariance of U* and U**), is that which takes the largest X; into the
smallest X;, the second largest X; into the second smallest X;, and so on.
That is, if we index the X,’s such that X; <...< Xp, then m(t?) =n—i41
for 1 <i <n. For example, this is true when U = g(X) —g(), where g isa
smooth function, and also when

U = I[n? {g(X) — 9(u)} <2). (6.1)


The asymptotically optimal permutation 7 in the case of a d-dimensional
sample, where U = g(X) — g(ps) or

U = In? {g(X) — g(u)} <a], (6.2)


is that one which reverses the order of the quantities

¥i= >ar (Ki — XK)G(X), 1 <isn.

That is, if we index the X,’s such that Yj <... < Yn, then m(i) =n—i+1.
We shall call 7 the antithetic permutation. These results remain true if we
Studentize the arguments of the indicator functions at (6.1) and (6.2); the
pros and cons of pivoting do not have a role to play here. The reader is
referred to Hall (1989b) for details.
Efficient Bootstrap Simulation 137

The method of antithetic resampling may be used in the following way


to approximate the bootstrap estimate i = E(U* |). Draw B independent,
uniform resamples Vf, 1 < 6 < B; by applying the antithetic permutation,
convert 4;° into the corresponding antithetic resample 4*; compute the
versions of US and U;* for X;* and 4", respectively; and define

B
a3 =3B SY (Us +U;").
b=1

Note that a3 is an unbiased approximant of id, in the sense that E(a$ |x ) =


ti. The conditional variance of i$ is given by

var (a$ |¥) = (2B)! {var(U* |¥) + cov(U*, U** |X)},

where (U*,U**) denotes a generic pair (US, U;*). In general, as n — 00,

var(a3 |V) = qvar(a} |%) + ofvar(a |X)},


where 0 < q < 1. The exact value of g depends on the situation, it being
greater for less symmetric distributions. For example, in the case of distri-
bution function estimation where i = P(S* < |) or P(T* < z|%), the
value of
q= (2) = Jim, Jim {var(aQ| ¥)/var(ap|4)}
is an increasing function of both |z| and the skewness of the sampling distri-
bution. The minimum value g = 0 can only arise in this context when both
x = 0 and the sampling distribution is symmetric.
Thus, generally speaking, antithetic resampling reduces variance by a
constant factor. In this respect, balanced resampling is superior in the case
of approximating the conditional mean of a smooth function, since it reduces
variance by a factor which converges to zero as sample size increases. Again,
the reader is referred to Hall (1989b) for details.

7. Importance resampling
The method of importance resampling is a standard technique for im-
proving the efficiency of Monte Carlo approximations. See Hammersley and
Handscomb (1964, p.60ff). It was first suggested in the context of bootstrap
resampling by Johns (1988) and Davison (1988).
To introduce importance resampling, let 1 = {Xi,...,Xn} denote the
sample from which a resample will be drawn. (This notation is only for the
sake of convenience, and in no way precludes a multivariate sample.) Under
importance resampling, each X; is assigned a probability p; of being selected
138 Hall

on any given draw, where p; = 1. Sampling is conducted with replacement,


so that the chance of drawing a resample of size n in which X; appears just
m, times (1 <i <n), is given by a multinomial formula,

ae,
n

my!...my! £5 ci

Of course, 0m; = n. Taking p; = n~! for each 1, we obtain the uniform


resampling method of Section 2.
The name “importance” derives from the fact that resampling is de-
signed to take place in a manner which ascribes more importance to some
sample values than others. The aim is to select the p,’s so that the value as-
sumed by a bootstrap statistic is relatively likely to be close to the quantity
whose value we wish to approximate.
There are two parts to the method of importance resampling: firstly,
a technique for passing from a sequence of importance resamples to an ap-
proximation of a quantity which would normally be defined in terms of a
uniform resample; and secondly, a method for computing the appropriate
values of p; for the importance resampling algorithm. One would usually
endeavour to choose the p;’s so as to minimize the error, or variability, of
the approximation.
In most problems, if the parent distribution is continuous then there are
N= Ce) different possible resamples. Let these be 11,...,4Nn, indexed
in any order, and let m;; denote the number of times X; appears in X;. The
probability of obtaining 1; after n resampling operations, under uniform
resampling or importance resampling, equals

n!

or
' n} mji : mj;
=> al jgiess- jn: ne ee
=a {=1 Co
respectively. Let U be the statistic of interest, a function of the original
sample. We wish to construct a Monte Carlo approximation to the bootstrap
estimate t of the mean of U, u = E(U).
Let ** denote a resample drawn by uniform resampling, and write U*
for the value of U computed for 1*. Of course, 4* will be one of the 4's.
Write u; for the value of U* when 4* = 4;. In this notation,

N N n

fi = E(U*|X)= 0 ujay = DO ujyai [T](mpi)-™, (7.2)


j=1 j=1 i=1
Efficient Bootstrap Simulation 139

the last identity following from (7.1).


Let 27 denote a resample drawn by importance resampling, write ut
for the value of U computed from ¥ t and let val be the number of times
X; appears in Xx}. Then by (7.2),

ti = E{ul TI(mp-™! |+} .

Therefore it is possible to approximate &@ by importance resampling. In


particular, if x! , 1 < b < B, denote independent resamples drawn by im-
portance resampling, if u} equals the value of U computed for xi , and if
Val denotes the number of times X; appears in x} , then the importance
resampling approximant of @ is given by

i es B n
BrP
a =B? Sut [] (np).
b=1 t=1

This approximation is unbiased, in the sense that E(al, |v’) =a. Note too
that conditional on 7, al,— t with probability one as B — oo.

If we take each p; = n=! then al,is just the usual uniform resampling
approximant ti}. We wish to choose pj,..., Pn to optimize the performance
of ip. Since al is unbiased then the performance of i, may be described
in terms of variance:

var(al, |¥) = Bo var{U} [I(np.)-™"


i=1
x}
= B-1($6- 4), (7:3)
where

6 = (p1,..-.pn) = B[{U} T]op} |2]


i=1
N n

j=1 #=1
N n

= Do ius [](rp
j=l %=1

= E{u* [ly |+} ; (7.4)


i=1
140 Hall

On the last line, M} denotes the number of times X; appears in the uniform
resample 4*. Ideally we would like to choose p1,...,Pn 80 as to minimize
0(pi,---,Pn), subject to Dp; = 1.
In the case of estimating a distribution function there can be a signif-
icant advantage in choosing non-identical p,’s, the amount of improvement
depending on the argument of the distribution function. To appreciate the
extent of improvement, consider the case of estimating the distribution func-
tion of a Studentized statistic, T = n3(6 — 9))/6, assuming the “smooth
function model” introduced in Section 2. Other cases, such as that where
S = n3(6 — 0)/o is the subject of interest, are similar; the issue of Studen-
tizing does not play a role here.
Take U* = I(T* < 2), where T* = n2(6* — 6)/6*,
n d
6 =n (KI -RN 9(K)).
t=1 j=l

(We agree to define T* = c, for an arbitrary but fixed constant c, in the


event that 6* = 0.) Now, & = P(T* < r|), and for this particular r we
wish to choose p),..., Pn to minimize

6 = 6(p1,...,Pn) = E{M(T" < 2) [on™ a


A straightforward asymptotic analysis, given for example by Do and
Hall (1990), shows that the asymptotic minimum of 6 is attained with p; =
n-1e—%, where 6; = Ae; +C, & = n7? a” D(X; — X) g;(X), A =
A(z) > 0 is chosen to minimize ®(x — A) e’ , and C = (A) is chosen to
ensure that Up; = 1. Values of A = A(z), for selected Standard Normal
quantiles, are given in the table. For this choice of the p,’s,

BP NT
ns aS)
variance of al,(under importance resampling)
ye 2) LS
(x — A) e4’ — O(z)?2 ©

The function r is strictly decreasing, with r(—oo) = oo, r(0) = 1.7


and r(+oo)= 1. Therefore, importance resampling can be considerably
more efficaceous for negative z than positive z. If we wish to approximate
G(x)= P(T*< x|%) for a value of z > 0, itis advisable to work throughout
with —T* rather than T* and use importance resampling to calculate

P(-T* < -2|¥)=1-G(z).


Efficient Bootstrap Simulation 141

TABLE: Optimal values of A = A(z), and values of asymptotic efficiency


r(x), for selected x (taken to be quantiles of the Standard Normal
distribution).

8. Quantile estimation
Here we consider the problem of estimating the ath quantile, &., of
ae distribution of a random variable R such as S = n?(6— 0)/o or T =
n3(6—@)/&. We define €, to be the solution of the equation P(R < €.)= a.
Now, the bootstrap estimate of €, is the solution & of

P(R* < &,|%) =a, (8.1)


where Rt = S* =n2(6* — 6)/6 (if R* = S) or R* = T* = n2(6* — 6)/6* (if
R=T). The equation (8.1) usually cannot be solved exactly, although the
error will be an exponentially small function of n. For the sake of definiteness
we shall define
f, = H7 (a) = inf{z : H(z) > a}, (8.2)
where H(x) = P(R* < 2| 4%).
Approximation of the function H(z) falls neatly into the format de-
scribed in earlier sections, where we discussed Monte Carlo methods for
approximating a conditional expectation of the form ¢ = E(U*|4). Take
U* = I(R* < z), and apply any one of the methods of uniform resampling,
142 Hall

balanced resampling, antithetic resampling or importance resampling to ob-


tain an approximation Hp(z) to H(z). Then, following the prescription
suggested by (8.2), define
€a,B = Hz (a) = inf{z: Hp(z) > a}. (8.3)
For each of these methods, the asymptotic performance of bu,B as an approx-
imation to &, may be represented in terms of the performance of Fp(Eq) as
an approximation to H(é,) = a, by using the method of Bahadur represen-
tation: d i? fe
ba,B — ba = {a — Hp (ba)} $(za)~* {1 + op(1)}
where ¢ is the Standard Normal density and zq the a-level Standard Normal
quantile. See Hall (1990a, 1990c). (Note from Section 7 that if a > 0.5 and
we are using importance resampling, it is efficaceous to approximate instead
the (1 — a)’th quantile of -T*, and then change sign.)
This makes it clear that the asymptotic gain in performance obtained
by using an efficient resampling algorithm to approximate H—!(q) is equiva-
lent to the asymptotic improvement when approximating A(z) at the point
zt = €y. In the case of importance resampling, where optimal choice of
the resampling probabilities p; depends very much on the value of Bx one
implication of this result is that we may select the p;’s as though we were ap-
proximating H(zq) (bearing in mind that 4 > za asn > oo). Alternatively,
we might construct an approximation &, of bx via a pilot program of uniform
resampling, and then select the p;’s as though we were approximating H (Eq).

References
BICKEL, P.J. and YAHAV, J.A. (1988). Richardson extrapolation and the
bootstrap. J. Amer. Statist. Assoc. 83, 387-393.
DAVISON, A.C. (1988). Discussion of papers by D.V. Hinkley and by T.J.
meats and J.P. Romano. J. Roy. Statist. Soc. Ser. B 50, 356-
57.
DAVISON, A.C. and HINKLEY, D.V. (1988). Saddlepoint approximations
in resampling methods. Biometrika 75, 417-431.
DAVISON, A.C., HINKLEY, D.V. and SCHECHTMAN, E. (1986). Efficient
bootstrap simulation. Biometrika 73, 555-566.
DO, K.-A. and HALL, P. (1990). On importance resampling for the boot-
strap.
EFRON, B. (1990). More efficient bootstrap computations. J. Amer. Statist.
Assoc..
GLEASON, J.R. (1988). Algorithms for balanced bootstrap simulations.
Amer. Statistician 42, 263-266.
GRAHAM, R.L., HINKLEY, D.V., JOHN, P.W.M. and SHI, S. (1990). Bal-
anced design of bootstrap simulations. J. Roy. Statist. Soc. Ser. B
52, 185-202.
Efficient Bootstrap Simulation 143

HALL, P. (1989a). On efficient bootstrap simulation. Biometrika 76, 613-


617.
HALL, SsA Antithetic resampling for the bootstrap. Biometrika 76,
13-724.
HALL, P. (1990a). Performance of bootstrap balanced resampling in distri-
bution function and quantile problems. Probab. Th. Rel. Fields.
HALL, P. (1990b). Balanced importance resampling for the bootstrap.
HALL, P. (1990c). Bahadur representations for uniform resampling and
importance resampling, with applications to asymptotic relative ef-
ficiency. Ann. Statist.
HAMMERSLEY, J.M. and HANDSCOMB, D.C. (1964). Monte Carlo Meth-
ods. London: Methuen.
HAMMERSLEY, J.M. and MORTON, K.W. (1956). A new Monte Carlo
technique: antithetic variates. Proc. Camb. Phil. Soc. 52, 449-475.
HAMMERSLEY, J.M. and MOULDON, J.G. (1956). General principles of
antithetic variates. Proc. Camb. Phil. Soc. 52, 476-481.
JOHNS, M.V. Jr (1988). Importance sampling for bootstrap confidence in-
tervals. J. Amer. Statist. Assoc. 83, 709-714.
OLDFORD, R.W. (1985). Bootstrapping by Monte Carlo versus approxi-
mating the estimator and bootstrapping exactly:. Cost and perfor-
mance. Comm. Statist. Ser. B 14, 395-424.
SNIJDERS, T.A.B. (1984). Antithetic variates for Monte Carlo estimation
of probabilities. Statist. Neerland. 38, 55-73.
*
:
~
£ ¥: 6
i * 1
ae ee A oma,
talsSeve ,) ane Pag nil ern
ty Hy wie seibe

PaaS sk erent aia


x Aah aed

‘aicaty
: a, ri Fs

ate Oe
P| Ps
wi at gh Mall a Bircapenttane
aN

i
SS
SSS

BOOTSTRAPPING U-QUANTILES
-

R. HELMERS !
Centre for Mathematics and Computer Science
Amsterdam, The Netherlands

P. JANSSEN and N. VERAVERBEKE?


Limburgs Universitair Centrum
Diepenbeek, Belgium

ABSTRACT

The asymptotic consistency of the bootstrap approximation for gener-


alized quantiles of U-statistic structure (U-quantiles for short) is established.
The same method of proof also yields the asymptotic accuracy of the boot-
strap approximation in this case. Applications to location and spread estima-
tors, such as the classical sample quantile, the Hodges-Lehmann estimator of
location and a spread estimator proposed by Bickel and Lehmann are given.

1. INTRODUCTION
Let X,, X2, ... be independent random variables defined on a common
probability space (N, A, P), having common unknown distribution func-
tion (df) F. Let h(ri,...,2m) be a kernel of degree m (i.e. a real-valued
measurable function symmetric in its m arguments) and let

Hp(y) = P(A(X,...,Xm) Sy), yER (1)


denote the df of the random variable h(X,,...,Xm). Define, for each n > m
and real y,

H,,(y) = (a Ne Sel AeA XagieegXki


icy) (2)
1<i<...<im<n
m .

the empirical df of U-statistic structure.

Let, for0 << p<1,6,=H 7'(p), denote the p-th quantile corresponding
to Hp, and let és = H=1(p) denote its empirical counterpart. Generalized

1R. Helmers, Centre for Math. and Comp. Science, Kruislaan 413, 1098 SJ Amsterdam
(The Netherlands) <4 ee
2P. Janssen, N. Veraverbeke, Limburgs Universitair Centrum, Universitaire Campus,
B-3590 Diepenbeek (Belgium)
——EEEEoeE
ae reer ere

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
146 Helmers, Janssen, and Veraverbeke
oe a mg eg ge te a

quantiles of the form bai = Hz1(p) are called U-quantiles. Choudhury and
Serfling (1988) note that &, — €, a.s. [P], as m — oo, and, in addition,
that, as n — oo,

n}(fn — &p) > N(0,07) (3)


where

ots moh (fp) (4)


with

Cp = Var(gp(X1)) >0 (5)


and

9p(X1) = E(I(A(X1,.+.,Xm) < &) | Xi)—p (6)

provided Hr has density hr positive at €,.

In applications one often wishes to establish a confidence interval for


£, = Hj'(p) and a studentized version of (3) is required. A strongly con-
sistent estimator of the asymptotic variance a? is proposed by Choudhury
and Serfling (1988). It requires 0(n?"~!) computational steps. They also
propose a strongly consistent but less efficient estimator requiring only 0(n)
computational steps.

The aim of this paper is to employ bootstrap methods for the construc-
tion of a confidence interval for &, = Hj'(p). In Section 2 we establish a
bootstrap analog of (3), under a slightly more stringent smoothness condi-
tion on Hr and in Section 3 we establish the asymptotic accuracy of this
bootstrap approximation. Applications to certain estimators of location and
spread, such as the classical sample quantile, the Hodges-Lehmann estimator
of location and a spread estimator proposed in Bickel and Lehmann (1979)
are discussed in Section 4.

2. CONSISTENCY OF THE BOOTSTRAP FOR U-QUANTILES


Let F, denote the empirical df based on X),...,X,. Define fe =
Hx-'(p),0 < p <1, where H* denotes the empirical df of U-statistic structure
based on the bootstrap sample X},...,X*%. Here and elsewhere Xj,...,X*
Bootstrapping U-Quantiles 147

denotes a random sample of size n from Fy conditionally given X,...,Xn.


=

Our first main result is as follows :

Theorem 2.1. Suppose that Hr is continuously differentiable (with density


hr) on a neighborhood of £, with hr(€>) > 0. Then, for almost every sample
sequence X,, Xo,...

n}(é, — bn) N(0,07) (7)


with o? as in (4).

Proof. With P, the probability measure corresponding to F’,, we have

P,(n (e3
cs én)<< is
n(nt(E x (p) —Hy"(p))S2)
I Hoe (pn M(p)+ 2n- 2) (8)
A H3(Hz1(p) + 2n-?) > p)
wsSoan( Ws> —D,)
bo»
bop
where

Wy = nb {H3(H,"(p) + 2n7*) — HH, (p) + 2n7?)} (9)


with, for each n > m and real y,

Pie STS) (10)


1 =! tm=1

the empirical df of von Mises structure, and

Dn = n?{H,(H;*(p) + en?) — p}. (11)


We first consider D,. Note that

- 3 _

Di we De (12)
t=1

where

Dyn = nt {H,(Hz3(p) + 2n-4) — He(Hz1(p) + 2n-*)} (13)


—n?{H,(Hz"(p)) — Hr(H;"(p))}
+ 2n-?) — He(H,"(p))}
Don =n? {Hp(Hz"(p) (14)
148 Helmers, Janssen, and Veraverbeke

and

Dy = 0? {H,(H;"(p)) — P}- (15)


To treat D,,, note first that Din = Din + O(n-#) a.s. [P], as n — oo,
where D,, is obtained from D,, by replacing H, by H,, with Hy, as in (2).
Suppose without loss of generality that ¢ > 0. Clearly, for n sufficiently
large,

IDinlS SUP [U,(t) — U,(s)| as. [P] (16)


|t — s| < xn7
s,teJ

where J is the neighborhood of é, on which Hr is continuously differentiable,


and

U,(t) =n?(H,(t) -—Hp(t)), tEIR (17)


denotes the empirical process of U-statistic structure.

Similarly as in Silverman (1983) it is easy to see that

Din] S (n!)* SUP ; |Ufai(t) — Ufa;(s)| (18)


a |t-—s| < xn7
tsEJ

where, for any given permutation a of {1,2,...,n} Ufa.j(t) denotes the em-
pirical process based on the [=] independent random variables
R(Xe(mj+1)> Lee aan oe Oy aces [=] — 1, all with common df Hp.
With impunity we may replace at stage n + 1 any of the n.n! permutations
a of {1,...,n +1} which do not extend those of {1,...,n} by one of the n!
permutations which do extend those of {1,...,n}. Application of relation
(2.13) of Stute (1982) to each of the resulting n! terms appearing on the r.h.s.
of (18) directly yields that Dy, = O(n7?(Inn)?) as. [P], as n — oo, hence,

Din = O(n-*(Inn)?) as. [P], as n—-+00. (19)


Here we have used the smoothness of Hy as well as the inequality (18).
Bootstrapping U-Quantiles 149
SS
SS

Next we consider Don. Using again the smoothness assumption on Hr


and employing the a.s. [P] convergence of £,, to £, as n — 00, we easily
obtain from the mean value theorem that

Do, — thr(€,) as. [P], as n— 00. (20)


Finally note that D3, = O(n-?). We can conclude that

Dy > thr(é,)as. [P], as n— 0. (21)


Next we consider the limit behaviour of Wz, n = m, m+1.,... (cf (9)),
conditionally given F,,. Obviously, given F,,,W,* is a normalized U-statistic
of degree m, with bounded kernel, depending on n, of the form

hn(a1,.--;2m) = I(h(ay,..-,2m) < Eon + 2n-*) — An(Eon + an). (22)


Of course Eg Wy = Eg ha(AT,.-.,X7,) = 0, a.s. [P], whereas it is easily
checked that

Varg (Wz) ~ m’Ep g2(X7) as. [P], as n— 00, (23)


where

Gn(Xq) = Ep, (hn(X7,--->Xm)|X7)- (24)


A simple argument involving the strong law for U-statistics with estimated
parameters (Theorem 2.9 of Iverson and Randles (1989)) directly yields that

Ep.g2(Xt) > Gp as. [P], as n > 00, (25)


with ¢ as in (5).

At this point we apply the Berry-Esseen bound for U-statistics of de-


gree m of van Zwet (1984) to find that

sup |P,(W3 < y) — O(ym—165 4)


BS aap|

(26)
y

= O{( Ep, lgn(X7)/° + Ep h2(X7, tee Aw


)n-#}.
(Ep,.g2(X7))? Ep, 92(Xf)
3

Note that, in contrast to Corollary 4.1 of van Zwet (1984), the asymptotic
variance instead of the exact variance of W,* is employed. It is easy to see
150 Helmers, Janssen, and Veraverbeke
Se r
ene Diner Ne OL eS

that this does not affect the bound (26). The different standardization will
give rise to an additional term of type

Ep, hn(Xi,---Xm) -}
(27)
Ep,9n(Xt)
which is already present in van Zwet’s bound. Because h,, is bounded by 1,
for all n, and combining (25) with the fact that ¢, > 0, we easily see that
the moments appearing on the r.h.s. of (26) are O(1) a.s. [P], as n — oo.
Hence the r.h.s. of (26) is O(n-?) a.s. [P], as n — 00.

From (8), (21) and (26) we obtain

P,(n? (Es, = Eon) = z)


=1—6(—D,m-5*) + O(nz?) (28)
= ®(r07') + o(1)
a.s. [P], as n — oo. This completes the proof of theorem. a)

For the special case m = 1, h(x) = 2, p = $, the classical sample median, our
result reduces to Proposition 5.1 of Bickel and Freedman (1981). An insight-
ful proof of their proposition is given in an unpublished note by Sheehy and
Wellner (1988). Our proof is in part inspired by their argument.

3. ACCURACY OF THE BOOTSTRAP FOR U-QUANTILES


From (3) and (7) we know that the bootstrap approximation for a nor-
malized U-quantile is asymptotically consistent. In this section we investigate
the a.s. rate at which the difference between the bootstrap approximation
and the exact distribution of a normalized U-quantile tends to zero, as the
sample size gets large.

Theorem 3.1. Suppose that the assumptions of Theorem 2.1 are satisfied.
Suppose, in addition, that hp satisfies a Lipschitz condition of order > zon
a neighborhood of &,. Then

sup |Pa(n¥ (Gn — €on) S 2) — P(n¥ (Eon — &) S 2)| = O(n-t(In n)) (29)
a.s.[P], as n — oo.

For the special case m = 1, h(x) = 2, the classical p-th sample quantile,
Singh (1981) obtained a slightly better a.s. rate : the factor (Inn)? in (29)
Bootstrapping U-Quantiles 151

is replaced by (InIn n)? in this case. Whenever the same improvement holds
true for U-quantiles appears to be an interesting open problem.

Proof. First note that

- 3
sup |Pa(n?(€, — fon) S 2) — P(n? (Em — &) S2)1S Yolin (30)
t=1

where, for some constant K > 0,

n= sup |P,(n?(é, — fn) < x) — O(207?)| (31)


IzI<K (In n)4
and

ly = sup |P,(n2 (Es, = fe) Sx)— O(r0~*)| (32)


jz|>K (In n)4
and

Ion = sup |P(n# (én — &) S 2) — ®(207*)].


We first consider J;,. Going through the proof of Theorem 2.1 we easily
verify that

sup |D,—zhr(é)|= O(n-#(Inn)*) as. [P], as n> 00. (33)


lei<K (in n)}
Here we have used (see (18)) that

sup |Din| < (n!)~" DO sup |Ufay(t) — Ufay(s)I


|z|<K (inn) @ |t—s| <Kn-4(Inn)3
tsEeJ

=O(n-#(Inn)*) as. [P], n— co

by application of relation (2.13) in Stute (1982). Also (20) is replaced by the


stronger assertion that

sup |Don—zhp(Ep)| =O0(n-*(Inn)t) as. [P], as n— 00.


l2l<K
(Inn)4
152 Helmers, Janssen, and Veraverbeke

For this we used Lemma 3.1 of Choudhury and Serfling (1988) and the Lip-
schitz condition on hy. Combining (33) with (28) directly yields

Tin = O(n7#(Inn)*) as. [P], as n— 00. (34)


For the quantity I, we have

Ton a Mae. St ae (In n)?*)


" +P,(€2,—&m <—Kn-¥(Inn)t) (38)
+2(1 — ®(K (in n)?07).

The third term is O(n-*) by taking K large enough. It remains to estimate


the two other terms. Since the argument is the same for both, we only deal
with the first term of the r.h.s. of (35). Similarly as in (8) we write

Pub — bn > Kn- (In n)?)


= PH (fon + Kn-3(In n)*) — Hn(on + Kn- 2 (In n)) (36)
<p-—H, (on + Kn- 3(In n)2 ))
Application of Lemma 3.1 of Choudhury and Serfling (1988) directly yields
that for all n sufficiently large,

p— An(Eon + Kn- ?(Inn)?) <p


< An(& + > “3 (In n)?) (37)
a.s. [P], provided we take K large enough. A fae argument involving
Corollary 2.1 of Helmers, Janssen and Serfling (1988) and the a.s. closeness
of H,, and H,, gives us (with C,, as in the corollary)

p—-H n(lp ++ oe
pa ) (38)

< p— Hp(é,+ =n~F(In n)?)


+ Can-#(In n)? + 0(n-)
a.s. [P]. The smoothness assumption of the theorem directly implies that

p— He(é,
Fe + n-¥(Inn)}) (39)
= —jhr(E)n-¥(In n)#(14+0(1)) as. [P], as n> 00.
Together (37), (38) and (39) yield that p— Hy(&n + Kn-2(In n)*) < 0, for
all n sufficiently large, a.s. [P], provided we take K large enough.

We can now apply an exponential bound for U-statistics with bounded


kernels of Hoeffding (1963) (see also Serfling (1980), p. 201) to find that
Bootstrapping U-Quantiles 153
SSS

P,(H*(Eon + Kn-?(Inn)?) — Hy(En + Kn-2 (In n)?)


- < p— An(Epn + Kn? (Inn)?))
S exp{—3[F]n“InnK7hz(E,)} (40)
=0(n-*#) as. [P], as n— 00,
provided K is taken sufficiently large. This together with (35) and (36)
implies that

= 0(n-*) a.s. [P], as n — 00. (41)


Hence J2,, is of negligible order for our purposes. It remains to consider J3,.
Clearly, as n — 00,

Tan = sup |P(n? (En — &y) S 2) — (xo™*)| = O(n-?), (42)


i.e. the Berry-Esseen bound for U-quantiles is valid. To check (42) is an
easy matter in view of the classical proof of a Berry-Esseen bound for ordi-
nary sample quantiles (see, e.g. Serfling (1980), p.81-84). We have to apply
instead of the Lemma on p.75 of Serfling (1980), the exponential bound of
Hoeffding (1963) for U-statistics with bounded kernels. Also a Berry-Esseen
bound for U-statistics is needed. |

Combining (34), (41) and (42) with (30), we find that the theorem is
proved. Oo

4. APPLICATIONS
In this section we indicate briefly applications of our results to the prob-
lem of obtaining confidence intervals for , = Hp’ (p). Let ug = ®-1(1— $).
The normal approximation (3) yields an approximate two-sided confidence
interval

(fon — n-26,ug ee n-46,ug) (43)


for £,. Here 62 denotes a consistent estimator (e.g., the one proposed by
Choudhury and Serfling (1988)) of the asymptotic variance 0”. Clearly, the
error rates corresponding to the upper and lower confidence limits in (43)
will depend on the rate at which 6? approaches o?.

A bootstrap based confidence interval for €, is given by


* my A ad
(Eon aw Pcna-g 5-€pn — 7 Chg (44)
154 Helmers, Janssen, and Veraverbeke

where cf,¢ and c, ,_¢ denote the $-th and (1— $)-th percentile of the (simu-
lated) bootstrap Enpvotiamtion Itiis easily verifiedthat the upper and lower
confidence limits in (44) have error rates equal to 5 + 0(n7 + (In n)?).

We discuss a few specific examples of U-quantiles. In the first of these


we take m = 1, h(x) = z and obtain the classical p-th sample quantile
- = F- 1(p), 0 <p <1. Our second example is obtained by taking p = ,
= 2, h(21,22)= (x1 +22)/2. In this case £1, = H,,1(4) becomes the well-
tee atl Sp a ite location estimator. In the third and final example
we take p = $, m = 2, A(x1,22) = |i — 22]. In this case oe = H;*($)
reduces to an estimator of spread proposed by Bickel and Lehmann (1979).

A further investigation into the relative merits of the normal and boot-
strap based confidence intervals (43) and (44) for U-quantiles appears to be
worthwhile. The authors hope to report on these matters elsewhere.

REFERENCES

Bickel, P.J. and Freedman, D. (1981). Some asymptotic theory for the boot-
strap. Ann. Statist. 9, 1196-1217.

Bickel, P.J. and Lehmann, E.L. (1979). Descriptive statistics for nonpara-
metric models. IV. Spread. Contributions to Statistics (J. Hajek
Memorial Volume), 33-40. (ed. J. Juretékovd). Academia, Prague.

Choudhury, J. and Serfling, R. (1988). Generalized order statistics, Bahadur


representations, and sequential nonparametric fixed-width confidence
intervals. J. Statist. Planning Inf. 19, 269-282.

Helmers, R., Janssen, P. and Serfling, R. (1988). Glivenko-Cantelli prop-


erties of some generalized empirical df’s and strong convergence of
generalized L-statistics. Probab. Th. Rel. Fields 79, 75-93.

Hoeffding, W. (1963). Probability inequalities for sums of bounded random


variables. J. Amer. Statist. Assoc. 58, 13-30.

Iverson, H. and Randles, R. (1989). The effect on convergence of sub-


stituting parameter estimates into U-statistics and other families of
statistics. Probab. Th. Rel. Fields 81, 453-471.
Bootstrapping U-Quantiles 155

Serfling, R.J. (1980). Approximation theorems of mathematical statistics.


J. Wiley, New York.

Sheehy, A. and Wellner, J. (1988). Almost sure convergence of the boot-


strap for the median. Unpublished manuscript.

Silverman, B.W. (1983). Convergence of a class of empirical distribution


functions of dependent random variables. Ann. Probability 11, 745-
751.
Singh, K. (1981). On the asymptotic accuracy of Efron’s bootstrap. Ann.
Statist. 9, 1187-1195.

Stute, W. (1982). The oscillation behavior of empirical processes. Ann.


Probability 10, 86-107.

van Zwet, W.R. (1984). A Berry-Esseen bound for symmetric statistics. Z.


Wahrsch. verw. Gebiete 66, 425-440.
Pista sara Re

whet yyfecar
nile

big

Rielamides re ' oF

— Wes ¥ites
telat Soca)
i

vl

Sam
Pa 4:

cs
AN INVARIANCE PRINCIPLE
APPLICABLE TO THE BOOTSTRAP

John G. Kinateder!
Michigan State University

1 Introduction
Suppose X,,X2,... are independent random variables distributed according
to a distribution function F' with location parameter 06. In order to make
inferences about 0, we may consider the distribution of the sample mean about
Ge
X,— 8.
For example, if EX? < oo, the well-known Lindeberg-Levy Central Limit
Theorem [Bil86] tells us that

we Xe XG) NOE)
where o? represents the variance of X,. In the finite variance case, we can use
this to make inferences about 0 = EX.
Of course EX? < 00 is not necessary for convergence in distribution of the
sample mean.

Definition 1 F is said to be in the domain of attraction of a distribution


u (not concentrated at one point) if there exist constants a, > 0,bn, and a
random variable Y with distribution p, such that

= 109=1 (Xj — bn) >a Y, (1)


Jj =1

“Necessarily an ~ cn#/* for some a € (0,2] and c > 0; Y and p are said to be
a-stable. (See [Fel71).)

In what follows, we assume that F is in the domain of attraction of an a-


stable distribution and X;, X2,... is a sequence of i.i.d. F random variables.
If 0 < a < 2, then we fix a sequence a, > 0 such that for each y > 0,
n(1 — F(any)) > y~% as n — oo. For such ap, (1) holds with 6, = 0 for
1 Research partially supported by the Office of Naval Research under grant N000014-85-
K-0150

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
158 Kinateder

0<a<1, bh = EX; if 1 <a < 2, and b, = Esin(X/a,) ifa = 1. (For


existence of such a sequence, see Feller [Fel71].) If a = 2, then we choose a,
such that ,
a, 9 >(X; — EX1) a N(0,1).
j=!

In either case, let 4 denote the limit distribution.


Since the distribution F is generally unknown, so is the distribution of S,,.
Thus, if we are to use S, to estimate 0, then we need to have some idea of
the variability of S,. As was suggested by Efron [Efr79], in a wide variety
of situations, we can use resampling of the data X1,...,Xn to estimate the
distribution of an estimator. This is the essence of bootstrap.
Let F,, be the empirical distribution function of Xy,..., Xn:
n

F(z) = usS UX ae).


n k=1

For each observation of the data X,,..., Xn, we consider the distribution
of
Sh = Cay we. ¢:= Xn)
g=1

where Xf,...,X*, are independent and distributed according to F,. This is


equivalent to simple random sampling from the original sample, Xj,..., Xn,
with replacement, and applying the same statistic to the resampled data as
we would to the original data. The resampled data is often called the boot-
strap sample, and the conditional distribution of the statistic applied to the
bootstrap sample (given the data (X1,...,X,)) is referred to as the bootstrap
distribution.
Bickel and Freedman [BF81] showed that in the finite variance case, the
bootstrap distribution of S*,, given (X1,..., Xn), converges weakly to N(0, 1)
as long as m,n — oo. (Recall that the variance is removed here by the choice of
a,.) Singh [Sin81]) showed that under the added assumptions that E|X,|° < oo
and F is non-lattice, then asymptotically the bootstrap of the pivoted sample
mean is actually a better approximation to the true distribution than the
normal, based on the Edgeworth expansion.
Hall [Hal88] showed that when X;, has finite variance but E|X,|° = oo, the
general situation is that the normal approximation and bootstrap approxima-
tion are asymptotically equivalent. Therefore, in this case, it is better to use
the normal approximation in lieu of the computational cost of the bootstrap.
For the case of 0 < a < 2, when the bootstrap sample size m is taken to
be the same as the original sample size n, it was shown by Athreya ({Ath84]
and [Ath87]) that the bootstrap distribution of S* does not converge weakly
to a constant distribution along almost all sample sequences. He showed that
An Invariance Principle 159

it converges in distribution (with respect to the weak topology on the space


of bounded measures) to a random distribution. He gave a representation for
this random limit distribution in terms of Poisson random measures.
Notice that if (M%,,...,M*,) has the multinomial distribution with pa-
rameters (n, (3, te +)) and is independent of the sample sequence, then

ECSRI Oy Aa A p=)
2 aNG, iy Xn (2)
k=

The M?; can be thought of as counts; X; is chosen M;,; times in the bootstrap
sample.
Following Athreya’s work, Knight [Kni89] gave a different representation
for this limit law. Using the distributional relationship (2) above, and the
sample sequence representation provided by LePage, Woodroofe, and Zinn
[LWZ81], he gave the following explicit representation of the limit law:
As in [LWZ81], define p by

Bais 1 — F(y)
P= yh T— FW) + Fy)
Let €1, €2,--.- be tad., P(e = 1) =p= 1— P(a = —1). Let i (Pi 25 ++)
represent the arrival times of a Poisson process with unit rate; [, = St &;
where P(é; > x) = e~* for alli (&, &2,... are independent). Let Myf, Mj,... be
independent Poisson mean 1 random variables. Finally, assume that {¢;}, {Tj},
{M*} are mutually independent. Then

Gs Katee er Cy clan (Mi 1) Len yy 5bee bs, acl3)


k=1

It is important to note that the above convergence is in distribution. In


fact, Giné and Zinn [GZ89] show that in the infinite variance case, this cannot
be strengthened to almost sure convergence (which does occur in the finite
variance case).
In the infinite variance case, the bootstrap distribution of S* does not con-
verge in distribution to the limit distribution y obtained in the limit of the
original S,, sequence. That is, the bootstrap distribution is not a consistent
estimate of y. Because of this phenomenon, it has been said that the boot-
strap does not work in this case. (Although it is understood why this claim
was made, Section 5 suggests that the method may actually be viable for
applications in the a < 2 case.)
What happens when the bootstrap sample size m is allowed to differ from
the sample size n? Athreya [Ath85] showed that in the 0 < a < 2 case, the
bootstrap can still be made to work if the bootstrap sample size is chosen small
enough in relation to n. More precisely, if the bootstrap sample size mM, — 0o
160 Kinateder

such that m,/n — 0, then the bootstrap distribution of S*, converges weakly
to p in probability.
Arcones and Giné [AG89] added to this answer by showing that if
Mn log logm,/n — 0,
then the bootstrap central limit theorem holds almost surely. That is, the
conditional distribution of S* converges with respect to the weak topology
to p almost surely. But if
lim inf m, log log iy] eS 0;
then there is no almost sure convergence — not even to a random measure!

In this paper, we examine the relationship between the distribution of the


partial sums and the resampling criteria. We give a decomposition of the
bootstrap of the sample mean in the form of a stochastic integral. Then we
state and prove an invariance principle explaining the behavior of the pro-
cesses involved in the decomposition. When these processes are replaced by
their limits in the stochastic integral, the integral obtained turns out to have
the limiting distribution of the bootstrap for all a.? This affords a general
representation of the bootstrap limit law encompassing both the normal and
nonnormal domains of attraction in one expression.
We give the stochastic integral decomposition in Section 2. In Section 3
we introduce the invariance principle. The theorem is proved for the finite
variance case as well as the a < 2 symmetric case. Section 4 gives a de-
composition of the limiting distributions of the same form as that given in
Section 2. We give some simulation results in Section 5 which suggest that in
the a < 2 symmetric case the bootstrap of the sample mean actually performs
very well. Section 6 contains some concluding remarks suggesting some of the
value of the research as well as future directions. The Appendix contains the
statements of some technical results which are referenced in the proof of the
invariance principle. Some of the proofs have been omitted here but all have
been included in the author’s Ph.D. thesis [Kin90].

2 The Stochastic Integral Representation


Here we give a new decomposition of the bootstrap of the sample mean.
Definition 2 For each pair of positive integers m,n, we define the following:
(2) X, = (Xqy,---,X(n)) are the absolutely ordered observations

IXay| >... > |X@l,


2It should be mentioned here that in the infinite variance case the limit result is restricted
to distributions symmetric about the location parameter.
An Invariance Principle 161

so that X(;) is the i** largest in absolute value.


(22) W™ is the scaled partial sum process associated with (X1,...,Xn):

[nt]
W°(t) =a," >> Xe; t € [0,1].
k=1

(ttt) (Miy,---, Min) ts @ multinomial (m,(4,...,+4)) vector independent of


the observations (X1,..., Xn).
(iv) HO”) is the empirical distribution function of the centered multinomial

ato)(2)=15 1M, ~™ <2)


vector:

Theorem 1 Let n be the number of observations, and let m be the bootstrap


sample size. Then

£(«Pu ee ,)1%q) = £(frwno HO™"\(dt)| Xq)

We will refer to W" o H'™”) as the bootstrap partial sum process.


It should be pointed out here that in reference to the bootstrap, since
resampling with replacement has no dependence on the order of the data,

CK EK Vika = c(t <r


But this alternative conditioning is not valid for the stochastic integral repre-
sentation;

LI iErWo H™")
(dr)|X) # L( /rWo H™")(dr)|Xq,...,Xn)-

3 The Invariance Principle


Since X,, is a function of the partial sum process W”, and conditionally on
ae S* has distribution dependent on the bootstrap partial sum process W” o
HH ny.it is clear that the behavior of S*, jointly with X” is dependent on the
behavios of W" o H(™") jointly with W”. The following invariance principle
helps explain this behavior for large n.

3.1 The Main Theorem


We define the Skorohod metric as it is defined in Billingsley [Bil68)].
162 Kinateder

Definition 3 For each pair of functions x and y in D[0, 1], define the distance
d,(x,y) as the infimum of all those values of 5 for which there exists a strictly
increasing and onto transformation 2: [0,1] —> [0,1] such that

|A-2] <6 and lx — yA) <4,


where 2 denotes the identity function on [0,1].

Theorem 2 Suppose H” is a sequence of stochastic processes on R indepen-


dent of W” converging uniformly to H almost surely. Let W be the homo-
geneous independent increments symmetric a-stable process with scale deter-
mined by Sp, +4 W(1).

(A) If X; has a symmetric distribution in the domain of attraction of an a-


stable distribution (a < 2) and H is the distribution function of a discrete
random variable which takes values in a finite set or a set which can be written
in the form {d,d2,...} such that dy < d2 <---, then

(W"o H",W") 4(Wo H,W)


in the product space (D(R),U) x (D[0,1],S) where U denotes the uniform
topology on D(R) and S denotes the Skorohod topology on D{0, 1].

(B) If EX?1 < then


o

(Wo H",W")—>4(Wo H,W)

in the product space (D(R),U4) x (D[0,1],U2) where Uy, denotes the uniform
topology on D(R) and Uz denotes the uniform topology on D{0, 1].

Notice that in part (B), W is a Brownian motion.

3.2 Proof for the a < 2 Case


Throughout this section we assume the hypotheses of Theorem 2(A).

Definition 4 Let «&,¢€2,... be i.i.d.,

P(e =1) = P(g = —1) = 1/2.

Let T represent the arrival times of a unit rate Poisson process as described in
the introduction.
Let T;,T2,... be t.t.d., uniformly distributed on (0,1).
Define W by
WO) = Seki Tweet eS aan)
k=1
An Invariance Principle
SSS
163

LePage [LeP80] showed that W is a homogeneous independent increments


symmetric a-stable process.
__ We start by exploring the way that a particular LePage-like representation
W” with the same distribution as W” converges to W. Then we will use this
convergence along with the almost sure uniform convergence of H” to H to
finish the proof.
Let 1—G be the distribution function of |X,|. Let G7! be the usual inverse:
For real y € (0,1), G-'(y) = inf{z : G(r) < y}. Define for each n and
| eseae
Ye a GP) Tags)
Notice that
(ery cat eaten) =< 4, (Xt), -2 Xn): (4)

As introduced by LePage [LeP80], we define random variables L? in such a


way that the processes I(L? < t), 7 = 1,...,n facilitate scrambling of the
ordered random variables, €Yni,..-,€nYnn,

[3 (t) min{t: T; < sha

min{t: T; = [nt] = Dies MLE <1) Pages


Dp inte, Tee
Pi(t) ei n+1—j

Lastly, define W” to be the scrambled partial sum process associated with


feginueind Calan)

W(t) = >> exVaeT(LE St) t € [0,1]. (5)


k=)

W(t) is constant on each of the intervals [4,4*), j = 1,...,n, adding a


random selection of one of the e,Y,,%’s at each of the times t = i, 2 Se ee
Thus W” has the same distribution as W” in D(0, 1].

We first examine the behavior of W” truncated to its N — 1 largest jumps.


Let
N-1 N-1
VEG WT ell 2s ride andl) Swe Dy eTp IMTS):
gaat j=1

Proposition 1 For each N, Vi — Sw a.s. in the Skorohod topology.


A ; eye -1/a
Proof. By definition of an, for each j, with probability one, Yn; — r; ‘ ;
Therefore the vector of ordered sizes of the N — 1 jumps of Vy is approaching
the vector of ordered sizes of the N — 1 jumps of Sy.
164 Kinateder

But the vector of locations of the N — 1 jumps of Vx is also converging to


the corresponding vector for Sy. To see this note that

min{t: 7; < SiGe


[nt] et
nS ae golden POT
< min{t: 7; =<
aD Ve

The two outer terms converge to T; almost surely as n — oo.


Define \, to be the piecewise linear function with endpoints given by
An(O) = 0, An(1) = 1, and Ap (b8), = 0 tors ="1,. VY oe eventually,
An is strictly increasing in which case
N-1
d,(Vz,Sw) < max{ > ~|¥n; — Tj "|, 22 — Tj]: 157 < N-1}-
j=1

The right side converges to zero almost surely by the previous analysis. Oo

We will proceed to show that the tail sums W"-Vz and W —Ty converge to
zero in the uniform metric as N — oo rapidly enough so that d,(W”, W) —, 0.
To this end, we will use a weak version of a theorem from Pollard [Pol84].
But first we need to define conditional variance as Pollard does.

Definition 5 The conditional variance process associated with the L?-mar-


tingale € is the unique, increasing, predictable process V with V(0) = 0, such
that £? — V is a martingale. It is denoted by (€):.

Theorem 3 Let (£,) be a sequence of L*-martingales with conditional vari-


ance processes (£,). The following are sufficient conditions for convergence in
distribution of (€,) as random elements of D[{0,1] under the uniform topology
to Brownian motion on [0,1]:
Asn — oo,
(2) €n(0)— 0;
(22) (En)t pt, for each fired t;
(127) EJ(E,)? + 0, where J denotes the functional giving the maximum jump
of the path operated on.

Specifically, Pollard states the theorem for processes on [0, 00) and the con-
vergence is with respect to the uniform on compacta topology. Thus he proves
a generalized version of this statement for processes restricted to intervals of
the form [0,7].
Let hs ry
Ty (ue Sere lar edd): 32 = Sa Bp
dane =n
An Invariance Principle
SSS
165

We will show that T;,/s, converges in distribution to standard Brownian


motion on [0,1] in the uniform topology on D[0, 1]. As a consequence, we shall
get
sup |Tn(t)| +p 0. (6)
Theorem 3 seems almost tailored to our application; the paths of the pro-
cesses we are dealing with are constructed in such a way that the maximum
jump (and all others) can be read right off.

Notation. In what follows, || - || will denote the supremum norm for any
space on which it makes sense; it will be subscripted to remove any ambiguity.
For o-fields F, we will let E? denote conditional expectation given the o-field
F; E" will denote conditional expectation given the o-field generated by the
process {I';}.

Proposition 2 T,/s, converges in distribution to Brownian motion with re-


spect to the uniform topology on D{0, 1].

Proof. For s in [0,1], define

G, = o{Tj, 61{T; Su}, uss,j2 1},


and let F, be the P—completion of G,.

Lemma 1 {(Tn(t)/sn,
Fi) : t € [0,1]} ts an L?-martingale.

This is proved in [Kin90] using the completeness of F;, the continuity of


the operator E7* on L?(F;), and results from LePage [LeP80].

Lemma 2 With probability one, the conditional variance process of Tn/Sn 1s


given by
(T, fen = —8," SsTz 7/*In(1 — T;, At) (for allt). (7)
k=n

Sketch of Proof. Use the method provided by Aalen in [Aal78], and the
monotone convergence theorem, and the Minkowski inequality for conditional
expectation to verify the conditions given in definition 5. 0

Now we check that the conditions of Theorem 3 hold for T;,/sn.


Relation (7) is clear.
166 Kinateder

For (27), by (2) and the MCT for conditional expectation,

E(T,./82), B(BM(—3; =n
Tn -TA2)))
B(s;? > PET (ial — Te A ‘))).
k=n

But the processes {[;} and {T;} are independent and E(—In(1 —T, At)) =
Jo In(l —uAt) du =t. Hence

E(Iq] 8p), =tB So(Uy42/s2) = t


k=n

Lemma 7 shows that o?((T;,/8n),) + 0 as n — oo. Thus, Chebyshev’s inequal-


ity gives (T,,/8n), —p t, for each t.
Finally, Lemma 6 in the Appendix shows that ['=?/*/s? — 0 with proba-
bility one. By the bounded convergence theorem (777) follows. Oo

Corollary 1 sup |T,(t)| >, 0.

Proof. Noting that s?2 — 0 almost surely, this follows by Proposition 2. 0

Notice that this corollary is equivalent to the statement: In the uniform


metric,

ery
g=1
apy Sw. (8)

Now we will work on finding an appropriate bound for the tail sums of the
processes W” described earlier. Define
n

UN(t) = D0 eG Ynjl(L? < t); CD Deore


j=N j=N

We will apply a method similar to that which was used in Proposition 2.


Notice that here we must control the starting and ending indeces, whereas
before we only needed to keep track of the former.

Proposition 3 Indexed by t, UR/snn is an L?-martingale, with conditional


variance Up 7 [nt]-1 (ie > BE)
= ane Ve Snes
ae s > ; ~ Dae
An Invariance Principle

Proof. For each n,t let

Fut = O{T;,61(L? Su), ust; 1<j <n}.


Notice that the process (UX/swn(t)) is adapted to (Fnt), and that for each
n,j, and t, svn and Y,; are measurable F,,,. Use Lemma 5 to show UR/snn is
an L?-martingale.
As is pointed out by Pollard [Pol84, p.176], if an L?-martingale Z changes
only by jumps £1, &2,... occurring at fixed times t; < tz <..., and if F, = Fi,
for t, <t < t,4,, then the conditional variance process is just a partial sum
process of conditional variances: For t, < t < ty41.

CAE Ekg gD Sig nae Manage 0 adh es


Our process and o-fields satisfy these conditions, so for each N,n such that
N<n,

(=e) a S Bn (s52,(us 4)
n - ux)’
n

[nt]-1
ay y- sx22 Baim (J Ypepltl,
(=,et
a y):
k=0 yo=iNi

[nt]-1 n k+1
= Do 8nn Dy YayP(LR =[Fen
k=0 j=N

[nt]-1
= me = pee Saget. eat eh eer
- k=0
Sn DyYj
j=N
n—k

n [nt]—1 1s By
= 85 in Daeg a
j=N k=0

Proposition 4 For eacht in [0,1], the following relations hold:

n(—2) Ls [nt] (9)


SNn/ t n

< os )Aleta (10)


n ¢ 2
Vari i
SNn/ t Sin n (n x 1)

This is proved in [Kin90] using some tedious calculations with Proposi-


tion 3.
168 Kinateder

Proposition 5 There exists a sequence (no(N)) such that

Yn(N)N => 0 a.s.


$Nn/(N)
for every sequence (n'(N)) such that n'(N) > no(N) for each N.

Proof. Since for each j, Y,; > Tek a.s., we have for each N,

ya) Pa a.s
SH ye eek Ee

Hence we can choose no(N) > 2N large enough that for n > no(N),
Saja
es (ae YZ, ue, =a ie! = >) Ze

Fix a sequence (n'(N)) such that n’(N) > no(N) for each N. By a simple
application of a Borel-Cantelli lemma, as N — oo

2 aie
ae ee ore (11)
Dna, ni(N)j ONY iy

By the strong law of large numbers,

iba ” Dee i (['w/N)-2/4

ST” < wig >NCaw/N > 0 as.. (12)

Combining (11) and (12), we see that


2
mun >.<
ween énN);

But since n’(N) > 2N,


2 2
Yn (N)N a Yn(N)N

Sni(ny N41 Ynys

Proposition 6 For each N, as n — 00,

lV oH" —SnoH|| 0 as..


An Invariance Principle
SSS
169

Proof. Fix N. Let By equal the set on which


-
N

V |LF =| ar 0,
j=l
N

j =1
Peli 0,
T; # H(t) for alli >1 and teR,

|H"-H|| -— 0.

By hypotheses, P(By) = 1. We claim that ||Vs oH" — Sy o H|| + 0 on By.


To see this, fix w € By and let 6 > 0. Suppress the argument w from the
following relations. By the assumptions on H, there exists K such that for
kek,
H(d,) > H(dx) > max{T; :i < N}.
Thus, we can choose € > 0 such that

eee Nin Ty (|. (13)


i<N,teR

Fix M’ such that for n > M’'


N
VIE -Tl <€/3,
j=1
(14)
N a

j=1

||H"—H|| <e/3.

Now let n > M. We show that |V,? o H"(t) — Sn o H(t)| < 6, for all real
t. Recall that Sy is constant on [T(, Ti41)) for 7 = 0,1,...,N (with To = 0,
and Ty41 = 1); and Vx is constant on [L{j), Liat): We proceed to show that

L? < H"(t) if and only if T; < H(t). (15)


Pick t € R. Say T; < H(t). Then by (13), T; +e < H(t). Since |L? — T;| < €/3
and |H(™)(t) — H(t)| < ¢/3, we must have L? < H"(t). The argument for the
other direction of (15) is similar. Thus

\Vn(H"(t)) — Sv(H(4))|
170 Kinateder

=| eYal(L? < H"() — 3 al UT: < H(t))|


t=1 t=1

N-1

2 Wee eel
t=1

50:

We chose 6 > 0 arbitrarily, so we are done. O

With existence guaranteed by Propositions 1 and 6, choose ni(N) > no( NV)
(no is defined in Proposition 5) such that for n > ni(N),
P(|\Vi
0H" —Sy 0 H||>N7)<N™, (16)
and
P(d,(VEV SS) > Na cv (17)
Define
N(n) = max{N >1:n(N) <n}V1.
Lemma 3 VN (n) —, W in the Skorohod topology.

Proof. Let (n;,) be a subsequence. Since ni(N) > no(N) > 2N — ov, we can
choose a subsequence (n,,;) such that N(n;;) < N(nx,,,) for each j.
Let e > 0. Now ng, > mi(N(nx,)) by definition of N(-), and for 7 large
enough, N(n;,;) > 1/varepsilon, in which case

P(ds(Vyr,,y W) 2 28)
< P(ds(Vin(a,,)) Swim,)) > €) + P(ds(Swvim,)s W) > €)

S Nm) * Plds( Sem) W) > 2)


1

Since N(n;;) — 00, the claim is proved. oO

Lemma 4 UN (n)/$N(n)n converges in distribution to Brownian motion with


respect to the uniform topology on D{0, 1].

Proof. As in the proof of Lemma 3, choose subsequences (n;) and (nx,).


Again we use Theorem 3. By Proposition 3, UN (n) /SN(n)n is an L?-martingale.
Surely UR(ny/$N(n)n(0) = 0. Also, since
Nk; = ni(N(nx,)) 2 no(N(nz;)),
An Invariance Principle
SSS
171

Propositions 4 and 5 give us that the conditional variance condition is satisfied.


Finally, Proposition 5 gives us that the expected squared maximum jump also
converges to 0. a

Theorem 4 W” —p W in the Skorohod topology on D(0, 1].

Proof. Notice that W” = Viv(n) + UN (n): By the corollary to Proposition 8,


SN(n)n —p 0. Therefore, by Lemma 4, |UNn) | —, 0. By definition of d, it is
easy to show that if d,(z,,2) — 0, and |lyn|| — 0, then d,(r_, + yn,z) — 0.
Apply Lemmas 3 and 4 to complete the proof. O

Proposition 7 W"o H” —, W od in the uniform topology on D(R).

Proof. For any n,

|W" o H™ —Wo H|| < |W" 0 H" — Vx.) 0 H™|


Goo
se
Ai

+ [[Vileq) oH” — Swim) 0H + ||Sniwy 0H — Wo Hl).


————
———
A2 A3

Ay < ||W"- Viinyll = \|UN(nyll +p 0 as was shown in the proof of Theorem 4.


A3 < ||Vnq) — WII = |[Zn|| > 0 by (2.2) because N(n) — oo. Also, since
mi(N(n)) <n, and N(n) > oo, A; -, 0. Oo

Combining Theorem 4 and Proposition 7,

(W" 0 H",W") >, (Wo H,W)


in the prescribed space. Since these processes have the same joint distribution
as (W” o H",W"), Theorem 2(A) is proved. Oo

3.3 Proof in the Finite Variance Case


2
Assume the hypotheses of Theorem 2(B). Here EX? < 00 so ay = n/2¢ and
Donsker’s Theorem applies:

WwW” 4 W,

under the uniform topology on D{0,1], where W is Brownian motion (see


Billingsley [Bil68]).
172 Kinateder

Let W” be a process with the same distribution as W” such that wr


converges uniformly almost surely to a continuous Brownian motion W.

|W oH"|ln
[Wr oH" —WoHllln < ||\WoH"—WoH" t
—Wola
< |W" - Who. + ||\WoH" —Wo Hllr
By construction, the first term on the right converges to 0 almost surely.
With probability one W is uniformly continuous on [0,1]. Since H" > H
uniformly almost surely, the last term converges to 0 almost surely.
With this representation, we have (W"o H”, W”) — (WoH, W) uniformly
almost surely. Therefore,

(W" o H",W") 34 (Wo H,W)


under the uniform topology in each coordinate. Oo

4 The Limit Laws


Under the usual resampling scheme, (i.i.d. resampling from the data), when we
replace the processes involved in the stochastic integral decomposition by their
limits obtained in the invariance principle, we get the limiting distribution of
the bootstrap.
Throughout this section, let M* be a Poisson (mean 1) random variable
and let H(z) = P(M*-1 <2).

4.1 Infinite Variance (a < 2): Symmetric Case


In light of the results of Bickel and Freedman [BF81] and the result (3) by
Knight [Kni89], and the invariance principle, the following theorem shows
why the stochastic integral decomposition may be the natural way to view the
bootstrap.

Let ¢,[, and W be as in Definition 4. Let My, Mj,... be i.i.d. ~ M* inde-


pendent of ¢,T.

Theorem 5

L(>> al g!"(ME = 1) (a0, 5k 21)


k=1

ee tWo H(dt)|e0,/*,k > 1):


An Invariance Principle
SSS
173

4.2 Finite Variance Case


The following propositiori shows that the above representation for the limit
law carries over naturally to the finite variance case.

Theorem 6 [f W is a Brownian motion then [tW o H(dt) has the standard


normal distribution.

Proof. To see this, look at the characteristic functions again. The stochastic
integral can be written as >°°2_, r[W(H(r)) — W(H(r—))]. By independence
of the increments of Brownian motion, its characteristic function is

anes. ee
nr — Co Sane 4

= ji exe(—5 DUH) - HW)


2

= exp(—5 |r dH(r))

= exp(—t?/2).
This is the characteristic function of the standard normal distribution, so the
assertion is proved. Oo

Assume W is a version of Brownian motion with continuous sample paths.


Since conditioning on the ordered jumps of such a process provides us with no
additional information,

£( [two H(at) | ordered jumps of W) =£([tWoH(d)).

By applying Bickel and Freedman’s results [BF81], using the proper scal-
ing, a, = n'/?o, we see that in the symmetric case the result concerning a
distribution in the domain of attraction of an infinite variance stable random
variable can be viewed as an extension of what was already known in the finite
variance case.

5 Simulation Results
For random variables which have infinite variance, we found that in the sym-
metric case, the bootstrap of the sample mean does not perform so badly. In
fact, in some ways, the method gives better results than it does in the finite
variance case.
174 Kinateder

For various indices of stability a, and confidence levels y, we simulated


observations of random variables X;, symmetric about 0, in the domain of
attraction of an a-stable random variable and applied the bootstrap of the
sample mean to create symmetric y-confidence intervals for @ in the following
manner:
For a given sample Xn = (Xj,...,Xn), the confidence interval C,(XKn) is
given by i
Cy(Xn) = [Xn — T,(Xn), Xn + Ty(Xn)],
where T,,(Xn) is estimated as a quantity which satisfies

PIX! — Xa $T(Xn)|[Xal = 7-
A suprising observation was that the empirical coverage of this method was
consistently higher than 7 for a < 2. Figure 1 shows the observed coverage
on the bootstrap method applied 1000 times each for y = .95 confidence with
sample size n = 50, bootstrap resample size m = 50, and 500 bootstrap
observations for various values of a. In our simulations, X, ~ «U~1!/* where
P(e = 1) = 1/2 = P(e = —1), U uniform on (0,1). Notice that for a > 2,
F has finite variance, and hence it is expected that the coverage should be
approximately .95.
Consider the confidence radii obtained by the above method, scaled by
n-1/@ because
St = naz (X* SX),
and a, ~ cn'/* (see Feller [Fel71]).
In the finite variance case, the bootstrap distribution of the scaled and cen-
tered bootstrapped sample mean converges weakly to a fixed (normal) distri-
bution almost surely. Since the limit distribution is continuous, the confidence
radii given by the above method and then scaled as indicated converge almost
surely to a fixed number as the sample size n tends to infinity.
But by Athreya’s early results [Ath84] one might not expect this phe-
nomenon to occur in the infinite variance case. Our simulation results exem-
plify this. Figure 2 shows a frequency histogram of the observed confidence
radii (logarithmically scaled) with n = 50, a = 1.0. The vertical line rep-
resents the logarithm of the radius necessary for an unconditional confidence
interval with confidence level equal to the coverage observed by applying this
method.
Figure 3 shows more of the same phenomena for various values of a.
An Invariance Principle 175
(x 6.81)

8.6 a 1.6 2 2.6 3


e

Figure 1: Coverage of bootstrap method for various a.

ad
ee
|Toes
wv 2
1ee
|41
41
et

' e a 1.3 3.3 6.3 7.3

Figure 2: Distribution of bootstrap confidence radii for a = 1, n = 50.


176 Kinateder

sae

® 4 8 12 i6 2e oe a 3 6 7
n260 elphe:.6e elphe?i.26

t)
“1.1 *@.1 6.9 1.9 2.8 [email protected]@.3 @ 6.3 8.6
alpneas2. 60 elpnass. 6e

Figure 3: Distribution of bootstrap confidence radii for various a.


An Invariance Principle
SSS
177

; As expected, even as n gets large, the distribution of the scaled radii ob-
tained by this method is dispersed apparently continuously over the positive
real axis. Figure 4 shows what happens when n = 200.

> )% a 3 6 7 9 4a

Figure 4: Distribution of bootstrap confidence radii for a = 1, n = 200.

More is observed. Since the confidence widths generated by the method


have such a wide distribution, we examined how well the procedure compares
with the procedure which simply uses the unconditional distribution of the
sample mean about 6.
That is, we estimated T, such that

P(|X, — 6| < T,) = 7.


Since the X/s are in the domain of attraction of the stable random variable
Y, we can do this not only by Monte Carlo simulation, but also by using the
quantiles of Y, because S, 4 Y (see (1)).
The results were startling. For a very small proportion of the applications,
the confidence radii obtained by the method T,(Xn), were larger than T,.
But for the complement, the T,(Xn) was extremely small compared to T,,
suggesting that the bootstrap performs well. :

Example. Here is an illustration of how the bootstrap confidence radii com-


pare with confidence radii obtained using the unconditional distribution of the
sample mean.
We ran a simulation with 1000 observations for a = 1, n = 50, and 2000
bootstrap resamplings for each observation with m = 50. The observed cover-
age of bootstrap method was 0.968. The radius about sample mean necessary
for unconditional confidence of .968 is T.9¢g = 32.13.
The implications of these results could be very far reaching. 94% of the
times the method was applied, the radius of the confidence interval for @ was
178 Kinateder

Table 1: Analysis of sizes of bootstrap confidence radii.


Cc Observed proportion of times T\95(Xn) < cT'968
1 940
1/2 881
1/4 .770
107 500

less than the radius necessary to give confidence equivalent to the empirical
coverage obtained by the method.
Maybe more substantial is how much smaller the observed radii were. Half
of the time the bootstrap confidence interval radius was less than about a
tenth of the radius necessary for unconditional confidence.
More needs to be studied in this direction. The invariance principle proved
in this paper will help to explain the phenomenon.

6 Remarks

6.1 Knight’s Result Follows in the Symmetric Case


In [Kin90] it is shown that Knight’s result follows in the symmetric case from
the technique used in proving the invariance principle. This is a different
approach than that used by Knight — bounding the multipliers as opposed
to bounding the scaled sample. It reveals in another way what is important
about the resampling plan.

6.2 Other Resampling Plans


One of the problems with the usual bootstrap in the infinite variance case is
its inconsistency.
The stochastic integral representation given in Section 2 together with the
invariance principle proved in Section 3 provide a new way of studying the
resampling problem.

6.3 Only Off by a Scale.


Consider the symmetric case.
Recall the result given by Knight [Kni89]. The representation he gives for
the random limit law is the random conditional distribution of

aly "(Mg - 1)
k=1
An Invariance Principle

given the sequence eI'j"/%, e,T';'/%,....


It is shown by LePage in [LWZ81] that for our choice of a, and by,
co
Sa — TP wy Aree:
k=

By Lemma 5.1 in [LeP80], the unconditional distribution of the sum given


by Knight is only a scale away from the correct limit law p :

Sal OG —1)-=2(E(Mi — 1°) Satz.


k=1 al

A Appendix
Below are stated some of the technical results referenced during the proof of
the invariance principle. These are all proved in the author’s Ph.D. thesis.

Lemma 5 Let T be an arbitrary random variable taking on values in (0,1)


and € be a random variable independent of T with P(e = 1) = 5= P(e =-1).
Por s49 (0,1) let F, = cf{el(7. < u),u < s}.. Then for B.C (8, 1),,20ith
probability one,
P(T € B)
bg S B\|F,} = Fitisenh > 3),

and
E**el(T € B) =0.
Lemma 6 With probability one, '=*/~/s? = 0.
Lemma 7 For each t, o7(V,(t)) — 0.
Lemma 8 For0<s<t<l
t—s
E**\n(1 ~T At) = In(l—T, As) — 5 (Ti. = 3}.
— 8
Proposition 8
Co
—2/a
Naa ie py a ry Ea
g=1 k=1

Acknowledgements. The author would like to thank Professor Raoul LeP-


age for the suggestion of the problem and all of the helpful direction in solving
it. He would also like to thank Professors Schlomo Levental and Anil Jain for
their time and interest. Professor Hira Koul was particularly helpful with his
meticulous reading of the original manuscript and many helpful corrections
and suggestions. Finally, the author would like to thank the Office of Naval
Research for supporting him for the last year and one half of his doctoral
study.
180 Kinateder

References
[Aal78] Odd Aalen. Nonparametric inference for a family of counting pro-
cesses. Annals of Statistics, 6:701-726, 1978.

[AG89] Miquel A. Arcones and Evarist Giné. The bootstrap of the mean
with arbitrary sample size. Annals of Inst. of H. Poincare, 1989.

[Ath84] K. B. Athreya. Bootstrap of the mean in the infinite variance case.


Technical report, Iowa State University, 1984.

[Ath85] K. B. Athreya. Bootstrap of the mean in the infinite variance case-


ii. Technical report, Iowa State University, 1985.

[Ath87] K.B. Athreya. Bootstrap of the mean in the infinite variance case.
Annals of Statistics, 15:724-731, 1987.

[BF81] Peter J. Bickel and David A. Freedman. Some asymptotic theory for
the bootstrap. Annals of Statistics, 9:1196-1217, 1981.

[Bil68] Patrick Billingsley. Convergence of Probability Measures. John Wiley,


New York, 1968.

[Bil86] Patrick Billingsley. Probability and Measure. John Wiley, New York,
1986.

[Efr79] Bradley Efron. Bootstrap methods: Another look at the jackknife.


Annals of Statistics, 7:1-26, 1979.

[Fel71] William Feller. An Introduction to Probability Theory and Applica-


tions. John Wiley, New York, 1971.

[GZ89] Evarist Giné and Joel Zinn. Necessary conditions for the bootstrap
of the mean. Annals of Statistics, 17:684-691, 1989.

[Hal88] Peter Hall. Rate of convergence in bootstrap approximations. Annals


of Probability, 16:1665-1684, 1988.

[Kin90]_ John G. Kinateder. An Invariance Principle Applicable to the Boot-


strap. PhD thesis, Michigan State University, 1990.

[Kni89] Keith Knight. On the bootstrap of the sample mean in the infinite
variance case. Annals of Statistics, 17:1168-1175, 1989.

[LeP80] Raoul LePage. Multidimensional infinitely divisible variables and


processes. part i: Stable case. Technical report, Stanford University,
1980.
An Invariance Principle 181

[LWZ81] Raoul LePage, Michael Woodroofe, and Joel Zinn. Convergence to a


stable distribution via order statistics. Annals of Probability, 9:624—
632, 1981.

[Pol84] Henry Pollard. Convergence of Stochastic Processes. Springer-Verlag,


1984.

[Sin81] Kesar Singh. On the asymptotic accuracy of Efron’s bootstrap. An-


nals of Statistics, 9:1187-1195, 1981.
Satibenaliae |
“hOM Bie ehigth an:3
Laie) SOR atalt Newsy
ng , Anmaeleaf peat
tree ipa. aswevv yh:otvenote,
a5 Lage of se
Aestied
rw are. -
witb fetes pom es
ee i lea D

1 lies , 4 etalrage at
id $yaypT easey.

aiei Ri ites ue.


M

ne wT Mi WieeM ones
’ Ae beige ha‘ auts

tT nid "Wow a.

Lain ‘
ie
Seat ri : :
mee ieee ie
we a
ay me
isa ag
Mdiee Pie i.aetna) 73
ugar a
A Veh a Ted

D Pah: Gy)‘ee

ee

ra tay

oe

‘ Ph

Mies
Edgeworth Correction by ‘Moving Block’ Bootstrap
for Stationary and Nonstationary Data
S.N. Lahiri
Iowa State University

ABSTRACT
This paper considers second order properties of the ‘moving

block’ bootstrap procedure, proposed by Kiinsch (1989: Annals of

Statistics, 17, 1217- ). In the case of sample mean of weakly

dependent stationary data, the exact rate of approximation by

Kinsch’s bootstrapped statistic is determined and observed to be

worse than the rate of normal approximation. However, a suitable

modification in the definition of the bootstrapped statistic removes

this deficiency and provides a second order correct approximation.

Similar optimality is proved for the eigen ChZaHiECamnle mean of

m-dependent data. Furthermore, the modified version is shown to be

"robust" in the cases where the observations are not necessarily

stationary.

1. Introduction: It has been more than a decade since Efron (1979)

introduced the ‘bootstrap’ procedure for estimating unknown sampling

distributions of statistics based on independent and identically

distributed (iid) observations. It is well known that, in the iid

set up, bootstrap provides more accurate approximations to the

distributions of many regular statistics than the classical large

sample approximations (e.g. Singh (1981), Babu (1986)). However,

the situation may be totally different when the observations are not

necessarily independent. Remark 2.1 of Singh (1981)) shows that the

iid resampling scheme of usual bootstrap method fails to capture the


Oe eS ee
Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
184 Lahiri

dependence structure even for the sample mean of m-dependent data.

Following this observation, there have been several attempts in the

literature to modify and extend Efron’s idea to dependent models.

In most cases, the modifications cleverly exploit certain special

properties of the specific models under consideration. For ARMA

models, the bootstrap procedure is extended by using iid innova-

tions, e.g. Freedman (1984), Efron and Tibshirani (1986). In the

case of Markov chains with finite state space, Basawa et al. (1990)

defines a version of the bootstrap method by using a well known

representation of the Markov chain in terms of independent random

variables. Also, see Kulperger and Prakasa Rao (1989), Datta and

McCormick (1990). The extension to the countable Markov chains has

been done by Athreya and Fuh (1989) using iid cycles.

Unlike the above works, a general bootstrap procedure for weak

dependent stationary data is formulated by Kiinsch (1989). There it

is shown that the proposed method provides a valid approximation to

the unknown distribution of normalized univariate sample mean. It

should be pointed out that the same modification is also put forward

in an independent work by Liu and Singh (1988) on bootstrapping

m-dependent data. Some other versions of bootstrap are proposed by

Shi and Shao (1988) (for m-dependent data) and by Moore and Rais

(1990) (for observations satisfying the uniform mixing condition).

In this paper we consider the second order properties of the ‘moving

block’ bootstrap procedure of Kiinsch (1989) and show that, under

appropriate conditions, this procedure is second order correct even

for the sample mean of nonstationary data.


Moving Block Bootstrap 185

We begin with a brief description of Kiinsch’s bootstrap. Let

Xp oXo-e- be a sequence of stationary random vectors in RP. For

n>1, let py, denote the empirical distribution (e.d.) of XpoeeeoX,e

Suppose that the random vector of interest rete) derives from a

functional T, defined on a (rich) set of probability measures on

R?. Given the observations XyoeeeoXns and an integer e=e, 1<f<n,

form the blocks Fed COORDS SS Be B for t=1,2,...,b, where b=n-2+1.

Set k=[n/2]. Here, [x] denotes the integer part of a real number

x. For the stationary bootstrap procedure proposed by Kiinsch

(1989), one selects a random, with replacement sample gee, of

size k from the observed blocks Ey kor-- 20, and defines the boot-

strapped statistic as Te = T(p,) where pr denotes the empirical dis-

tribution of the (+k) components of fat eee The distribution

of Th is then approximated by the bootstrap distribution of i

letting 2 increase to infinity with n at a suitable rate.

Kiinsch (1989) proves the validity of his procedure for the

statistic /n(X,-#) in the univariate case. Here, nX, = Yin X. and


p=EX jie In the next three sections we shall show that even in the

multivariate set up, this procedure is second order correct for a

wide class of estimators based on sample means. Write M, =

Cov(n = V2(y 4. . 4K.) y


and T 7/0, (X,,-H) -1/2
where Ae = Mi, / .
(assuming

M, to be nonsingular). In the case of normalized sample mean Ty

an equivalent formulation of Kiinsch’s bootstrap can be given as


* * * * = eet) * c
follows. Let f; = (€549-- +9849) and U: = 0 yn) Cap l<i<k. Note

that U;’s are iid with P(U}=U,) = b! for i<t<b where U. =

(Xte-tKpyy 1)/2- Define p,-E,Uy and M,=E,(U}-H,)(Uj-H,)’


A A * A * A i
where
186 Lahiri
__ | ae

E, denotes the expectation under the bootstrap probability. Set Ke

= kl ie U; and A at, 1/2. Then, according to Kiinsch (1989)’s

definition, the bootstrapped statistic et is given by is = Jk-

A (Xn-X,)« However, this statistic suffers from a problem which is

not very serious for the first order results but becomes predominant

in the second order analysis. The bootstrapped statistic 18 has a

random bias B, which indeed determines the rate of bootstrap

approximation. Under the assumptions of Kiinsch (1989), (8,| tends

to zero almost surely but not at the desired rate o(n71/2), Theorem

2.1 of Section 2 shows that /n B/ft actually has a limiting normal

distribution, so that the rate of approximation for ih is much worse

than the usual normal approximation. However, this problem can be

resolved by redefining the bootstrapped statistic as Th =i Jk

A (XqnH,) + As shown in Lahiri (1990), the bootstrap distribution

of sen approximates the distribution of Ts at the rate o(n71/2)

almost surely. Similar results continue to hold for more general

Statistics that are smooth functions of the sample mean. In

particular, it is shown that the rate of bootstrap approximation to

the distribution of studentized Xy is also o(n72/2) almost surely

when the data are m-dependent. In this case as well, one has to use

a modification of Kiinsch’s original formulation to get hold of the

second order correct version of the bootstrapped statistic.

Next we study the performance of Kiinsch’s bootstrap for non-


identically distributed, weak-dependent data. For independent
observations, Liu (1988) showed that Efron’s bootstrap retains its
second order optimality in certain non-iid models. In the dependent
Moving Block Bootstrap 187

set up, similar optimality holds for Kiinsch’s bootstrap as well. In

fact, it is shown that the bootstrap approximation to the normalized

sample mean is second order correct for observations which may not

be stationary, but satisfy some uniformity conditions only. See

Section 4 for details.

The layout of the paper is as follows. Section 2 contains the

results on the normalized sample mean and Section 3 on studentized

mean. The nonstationary case is treated in Section 4. Proofs of

all the results are given in Section 5.

2. Results on xX, As in the iid case, the key step in proving the

second order correctness of stationary bootstrap procedure is to

derive Edgeworth expansions for the original statistic and the

corresponding bootstrapped statistic. In the dependent situation,

derivations of Edgeworth expansions are usually very much involved.

Depending on the dependency structure, they require regularity

conditions which are often difficult to verify. The most general

results in this direction are obtained by Gétze and Hipp (1983)

(hereafter referred to as [GH]). We formulate our results under

regularity conditions similar to those of [GH]. Suppose that

Xp .Xo--- are defined on a common probability space (2,A,P) and a

sequence Dy»D4129409--- of sub o-fields of A are given. We shall

assume that the conditions (C.1)-(C.5) below hold throughout the

paper.
(Can) Ex, |" < © and M= lim M, is nonsingular.
Nw
188 Lahiri
||

(C.2) There exists a positive constant d such that for n,m=1,2,...

-] n+m measurable p-variate random vector


with mod“, there exists a D,_)

Yam for which (i) EIX,-Y,ml $ d-lexp(-dm) and (ii) with m-a,,
EW ml L(Y, al < nl/4) <a}. Here oP is the o-field generated
by (D;, a<j<b}, {a,} is a sequence of real numbers satisfying logn

= o(a,) and a= 0((log n)1+1/d) as n+, and for any set A, I(A)

denotes the indicator of A.


° n co
(C.3) There exists d>0 such that for all m,n=1,2,...AeD_,,BeD,._,

|P(AnB)-P(A)P(B)| < d-te-@™,


(C.4) There exists d>0 such that for all m,n=1,2..., dd emen, and

all te RP with |t|| > d,


UA 0, ee Le eee Ener ) J
E|E(e oe |D;:J#n) |<e dq

(C.5) There exists d>0 such that for all m,n,q=1,2,... and AD

E|P(A|D;:#n)-P(AID,:0<|n-J|<m+q)| < d-t_-dm_


Remark 2.1 Except for (C.2) (ii), all other conditions have been

used by [GH] for obtaining the Edgeworth expansion for the

unbootstrapped statistic Th Here, the moment condition is used for

proving the almost sure convergence of the third order bootstrap

moments of 1

In comparison, even in the univariate case, Kiinsch (1989)

requires Ex,8 < © for almost sure convergence of the second moment.

Condition (C.2) (ii) somewhat relaxes the moment requirement on the


original observations X,'s. A sufficient condition for (C.2) (ii)
is that E|x, | < 0,
Moving Block Bootstrap
SS
189
SSS SS SS

As for the other assumptions, (C.4) is a conditional Cramer’s

condition and (C.3) is the strong mixing condition stated in terms

of the o-fields D;’s. To verify conditions (C.2)-(C.5), one usually

chooses these o-fields appropriately. As an example, if x; =

B14G45» J>1 for some sequences of iid random variables {¥,} and

constants {C,)}, one may define D;’s in terms of Yj'S rather than

in terms of Xj'S- See [GH] for a number of interesting examples

where conditions (C.2) (i), (C.3)-(C.5) are verified.

For stating the theorems, we need to develop some notation.

Let Z* denote the set of all nonnegative integers. For functions

f: RP > R, and r20, write M.(f)= sup{(1+||x|])


"| f(x) |:xe R?} and
w(f,r,u) = fw(f,r,+)du where # is a finite signed measure and for

xe RP, w(f,r,x) = sup{|f(x)-f(y)|:ye R?, \|x-yl|<r}. For any set B

e RP and 7 > 0, define aB = boundary of B and (B)” = (xe R:||x-y|l


< n for some y € B}. Denote the distribution function and the

density of a standard normal random vector on the relevant Euclidean

space by @ and @ respectively. Write N (HA) for the normal

distribution on RP with mean p and dispersion matrix A. Unless

otherwise specified, all the integrals will range over the entire

Euclidean space in question and the limits in the order symbols are

taken by letting n+. Finally, let P, stand for the bootstrap

probability and E, for the expectation under P,.

Our first result gives the exact rate of approximation for the

bootstrap statistic i, = fk A (Xn-X,) considered by Kiinsch (1989).


As pointed out in the introduction, in this case the convergence
A = A *

rate is entirely determined by the bias B, = Jk A (X,-#y) of The


190 Lahiri
aa,
DN...

More precisely, one has

Theorem 2.1. Suppose that there exist constants C>0, C,>0,

O0<a<B<1/4 such that CyntecconP. Let B be a class of Borel sets in

RP such that for some a > 0,

sup #((8B)7) = 0(n*) as n J 0. (251)


BeB
Then, for almost all sample points w e€ Q,
* A = A
(a) [P.(T, € B) - P(T, € B)| = 0(|[8,|), where B. = /k A.(X,-H,),

(b) there exist constants C3 >70; Cy > 0 such that

cy [BI < (SUB LPT sX) - P(T.<x)| < Cy -[[B,I.


(c) (6n/e) 1/2 B, converges in distribution to the standard normal

distribution on RP.
Under the assumed conditions, the bound (8,|| tends to zero
almost surely (See Lemma 5.2 (ii)-(iv)), but not even at the rate

o(n71/291/4) a.s.(P). Consequently, the bootstrap approximation

provided by the statistic i is worse than the usual normal

approximation, which has an error o(n7!/2) only. However, a

significantly better convergence rate can be obtained if one centers

Xe at I and considers the statistic Ta = Jk A (Xn - It,)« The


following result from Lahiri (1990) shows that the bootstrap

distribution of Tia captures the second term in the Edgeworth

expansion for Ty and thus outperforms the normal approximation.

Theorem 2.2. Suppose there exist constants C)>0, C,>0, a>0, a<B<1/4

such that C, n®<a< ConP Then,


Moving Block Bootstrap 191

(a) Then, for any Borel measurable f: RP + R with M,(f)<, and any

positive real number t,


[E.F(Ty,) - EF(T,)| < CIM, (f)0(n71/2) + w(f,n7'38)]
* = = a
as. (P)
(b) Let B be a class of Borel sets in R? satisfying (2.1). Then,

sup |P. (7; 68) - P(T,€B)| = o(n7!/2) a.s.(P).

Remark 2.2 Theorem 2.2 shows that the stationary bootstrap procedure

is second order correct if the block size 2 is not too small or not

too big compared to the sample size n. The lower bound on the order

of 2 is required for ensuring the error bound o(n71/2) in Theorem

2.2(b) over the class B. Without it, the error of approximation is

of the order o(e-k) for any positive number k. The upper bound on

the order of 2 is imposed to guarantee the almost sure convergence

of the third order sample moments. One can allow @ to increase at

a faster rate if X) has finite higher order moments.

Remark 2.3 In the iid set up, works of Singh (1981) and Babu and

Singh (1984) imply that a strong nonlatticeness condition for the

distribution of xX) is sufficient for second order correctness of the

iid bootstrap procedure. In comparison Theorem 2.2 requires a

stronger (viz. Cramer-type) condition for the second order optimal-

ity of stationary bootstrap. Using the arguments in [GH], one can

derive a two-term Edgeworth expansion for the unbootstrapped statis-

tic ls under a conditional strong-nonlatticeness condition. However

it is not clear if this is enough to guarantee the conclusions of

Theorem 2.2.
192 Lahiri
a
sss.

3. s
of mean:
Smooth function In this section we consider the

asymptotic accuracy of Kiinsch’s bootstrap procedure for more general

statistics based on the means of transformed observations. Let f:

RP = R41 be a Borel measureable function. Let Y; = F(X;), a= EY,

and Y, = nl Dic; Suppose that H: R¢+ R" is twice continu-


ously differentiable in a neighborhood of a. Here we shall consider

the normalized version of the statistic /n(H(Y,) - H(a)). Let D


denote the matrix of its first order partial derivatives at a. Set

zh = nD(Disp Y,,)0’. If zh is positive definite, define the

normalized statistic Ton = Jn SeUCaie | - H(a)). Examples of

such statistics include normalized sample correlation coefficient

and normalized sample coefficient of variation. The bootstrapped

statistic corresponding to Ton can be defined in the same way as

before. Given the bootstrap sample (E55: l<i<k, 1<j<2}, write

Unity Deee BlE i plealareae nei=]


k * * A
Uir Spy = Exy and 2,
= D (Disp U;)D’.
Then the bootstrapped version of Ton is given by ie = Jk

3/2 (Hye )-H(a,,)) The following result of Lahiri (1990) asserts

the second order correctness of Kiinsch’s bootstrap for Ton:

Theorem 3.1. Suppose that conditions (C.1) - (C.5) hold with X,"s

replaced by Y,’s. If the block size 2 satisfies the requirements

of Theorem 2.2 and D is of full row rank, then

sup [P(To_ € 8) - Pe(Tpq € B)| = o(n"l/2) ais. (P)


where B is a class of Borel subsets of R" satisfying (2.1).

Remark 3.1. In some applications, the transformation Y; = F(X; ) may


Moving Block Bootstrap 193

force the limiting dispersion matrix of /n 7 to have rank r < m.

If r > 1, one may still use a suitable linear transformation on the

Y;’s and guarantee conditions (C.1) for the transformed variables.

Theorem 3.1 covers most of the commonly used normalized

statistics based on sample means. However, unlike the iid case, one

cannot directiy apply this result to show the second order

correctness of the bootstrap procedure for studentized sample mean.

Here, the main problem arise from the form of the asymptotic

covariance matrix M of /n Xn Since any sequence of estimators of

M must cover an increasing number of lagged covarianes, the standard

technique of transforming Xs by a fixed q-dimensional function f

does not apply. In the special case of m-dependent observations,

however, it is possible to use a similar technique with suitable

manipulations to prove the second order correctness of the

bootstrap. Accordingly, for the rest of this section, we shall

assume that XyoXoreee are m-dependent random variables with values

in RI, Then by the CLT for m-dependent random r.v.’s, the

asymptotic variance 72 of /n(X,,-#) is given by

re = EX? + 25M) EX)Xy4; - (2mel)pe.


Hence, a natural estimator of M is
n-i =e
2
sez on
-l on ,2
DRLyXG + 2n-%
-1
Dy Day XqXjeq - (2MAL) Kye (3-1)
We shall use Sn to studentize the sample mean Xn and write T3y =

{(X,-H#)/S,> N21. Since s, + 7 a.s. (P), T3, is well defined for

large values of n. To define the corresponding bootstrapped

statistic, first note that Sy and hence T3y are not simple averages
194 Lahiri

of the observation XpoeeeoXpe Rather they are defined in terms of

averages of the products XiXjaie 0 < i < m, which involve (m+1)

observations (Xj,--- Kiam ) at a time. Following Kiinsch’s original

formulation, one may define the bootstrap versions of Sh and T3p as

follows. First redefine the blocks of size 2 with Z; =

(X joes oXeam)s 1s< i<n-m-=N (say) as observations and select

kp = [N/2] iid blocks ee (Cayree Sag)» lnssadi< ky from the


oe * * * 2

‘observed’ blocks {f} = (Z4,15---.Ztyg)? Os t N-2}. Note that

each iv is a (m+l) dimensional vector. Let Figrdenote the rth


: -glys * *
component igr’
of fi 1 <r <m+l. Define U.,=% 3 Caj1 Sigr-1
-ly * F d
2<rsm2, Uy = 0) 3h Sap YU5 = (Uyq>-+-Une2 a)» 1s is
* F *
ky. Let a, = Ex(U)), a, = Ex(Usi)9 1 < ig m2. Write X, =

ky Wart AUK, and u, = ExX, = @),- In accordance with the

original statistic, one may define the bootstrap versions as

-1 k, 2
sy = Dreg@plky) LyaqUng) - x,
(2me1)
and Ton = (ee) an)/s,
where a, = 1 or 2 according as r=0 or r > 1. Unfortunately, the

statistic ie also has a demerit similar to 1 In this case the

problem is not with the bias, rather with the normalizing variable

itself. For second order correctness, one has to use an additional


multiplicative factor which depends on certain bootstrap moments.

Let i = k, 2Var((X_) and re = Yy-08reen - (2m+1)a%. Then, the


corrected bootstrapped statistic corresponding to Le is given by

* A * *
and Lee ‘k,e Ty(Xn - by)/TySp- (3.2)
Moving Block Bootstrap 195

Note that r2/k,e is the actual variance of the bootstrap sample mean
of :
Xne On the other hand, writing fo(y) = Yr=08rY y42 - (2m+1)y?, y =

(Yy>-++9Ymio) € RM2 and U. = ky Us AUK it follows that ore

= fy (Un) and 7? = fo(ExU,)- Under the assumption of Theorem 3.2

below, one can show that T. ate o(1) a.s.(P). However, the rate

of convergence in this case is even worse than lB,|. Since re and

- determine the approximate variance of ie through 1 and T.,

respectively, one must remove the effect of the difference

l(tp/7q)-1 for attaining faster convergence rates. The next result

shows that the modified version of the bootstrap is indeed second

order correct.

Theorem 3.2. Let MXy 5-6 sXnay) denote a linearly independent

subset (in L2(P)) of {X1,X4sXpXpo---


oX4Xme]: For j 21, define Y,
= A(X5,-- +X j+m)- Assume that conditions (C.1), (C.2) and (C.4)

hold with Xj’s replaced by Y;’s and D; = o(Y;). If there exists

constants C,>0, C,>0, 0<a<B<1/4 such that Cn%e<C nF, then, for the

bootstrapped statistic Tas defined by (3.2),

Sup |P.(T3,€B) - P(T3,¢8)| = o(n- 1/2) as. (P)

Here B is a class of Borel subsets of R satisfying (2.1).

Remark 3.2: For studentized sample mean, it is possible to define

the corresponding bootstrapped statistic in more than one ways,

depending on the choice of Sy: See Géetz and Kiinsch (1990) for a

different formulation in the strong mixing case.

Section 4 Nonstationary Data: In many applications, the stationari-


196 Lahiri
ss

ty assumption on the observations {X1»Xoo---) does not hold. Time

series data observed over a long period often tend to show lack of

uniformity in their distributions. Consequently, it is important

to investigate the validity of the bootstrap procedure for

nonstationary data. It is well known that Efron’s (1979) iid

bootstrap works reasonably well even when the data are independent

but not identically distributed (see Freedman (1981), Liu (1988),

Lahiri (1989), Liu and Singh (1989)). In fact, the works of Liu

(1988) and Lahiri (1989) establish the second order correctness of

the bootstrap procedure under some’mild conditions on the degree of

heterogeneity of the data. In the dependent set up, Kiinsch’s

bootstrap also enjoys a similar ‘robustness’ property. It turns out

that one can obtain ‘good’ approximations even when the observations

are nonstationary, but satisfy some uniformity conditions.

Here only the case of sample mean is considered. In view of the

nonstationarity of X;"s, some of the conditions of Section 2 need

to be restated. However, the basic set up remains the same. As in

Section 2, assume that X's are defined on the common probability

space (2,A,P), but are not necessarily stationary. Replace

condition (C.1) by condition (C.1)’ on the moments of X,’s:

(C.1)’ Sup E|x* < o; M = lim M, exists and is nonsingular.


jz n>
In addition to this, some more conditions are needed on the joint

behaviour of the moments of X,"s. For j>l, write Ms EX; and Bs


_ aelej
= j Yya14;- Let Mm, =a EU, =- f p-lye
Yj=1"t+j-1 and Mit :
Disp(/2U , )

where U, = DEED I<tsb. Denote the smallest eigen-value


Moving Block Bootstrap
SSSSSS
197
SSD

of Mat by Ant and let An = Min{A,,:1<tsb). The next two conditions

restrict the heterogeneity of m,’s and Mat

(C.6) — max(02[|m,-7 ||:1<t<b) = 0(1)


(C.7) Tim inf An > 0.
n>o

Condition (C.7) is equivalent to one of the conditions used by Liu

(1988) in the independent case with block size 2=1. In view of

(C.1)’, (C.7) holds if Max(|M,, - M,l:1stsb) = 0(1). Condition


(C.6) is somewhat stronger than necessary. It can be replaced with

a weaker condition which requires potye se?iim,-z,1° = 0(1) and the


uniform boundedness of [Ztea(k, j) (tA) | over the index sets A(k,j)

= {i: l<i<b, j+k--l<i<j}, O<k<2-1, l<j<n for sufficiently large n.

The following result proves the second order correctness of Kiinsch’s

bootstrap for the normalized sample mean T, = Jim, 1/? (Xq-Ha) of


nonstationary data.

Theorem 4.1 Assume that conditions (C.1)’, (C.2)-(C.7) hold and

there exist constants C) 520.5 7) >0, 0<a< B < 1/4 such that Cyn®

<< con’. Then


Sup|P(T,€B) - Px(T;,€B)| = o(n”!/2) a.s. (P)
BeB
where T,, ee and B are as in the statement of Theorem 2.2.

Remark 4.1. Under some additional assumptions, one can prove

similar results for the statistics of the form /n(H(X,) - H(#,)) of

Section 3.

Remark 4.2. Results of this section can be readily applied to the

least square estimators of multiple linear regression parameters.

5. Proofs: For proving the theorems of Section 2, we need the


198 Lahiri
———

following lemmas. All throughout this section, C, C(.) will denote

generic constants depending only on their arguments, if any. Also,

for a,be R and any set A, we write a A b = min{a,b}, a V b =

max{a,b} and |A| = size of A. The following result from Lahiri


(1990) will be frequently used in the proofs.

Lemma 5.1: Let {W),Wo,--->W,} be a collection of random variables

on a probability space (S,5,Q) such that for i=1,...,n, EW;=0 and

|W; |<B a.s. (Q) for some B>O. Then, for any positive integer r and

l<m<C(r)
en,

EDM Wade < C(r)Be


[nme +n2"a(m)]
where a(m)=sup{|P(AyAp)-P(Ay)P(Ap) | :Ay€F] Aner a, j=itm+], l<i<j<n}

and Ft = the o-field generated by W;, ssist.

Proof of Lemma 5.1. Since some of the arguments in the proof of

Lemma 5.1 were omitted in Lahiri (1990), we give a complete proof

here. Fix r>l and let k=2r. Then

n k k j t
(Dyar44g)” = Ljuy LyC(ay---@5) Yo MEL] ae (5.1)

where for each fixed je{1,2,...,k}, 1 extends over all j-jtuples

of positive integers (a),--.,05) such that Q)+...+a5=k and Ye

extends over all ordered j-tuples (i]---75) of integers such that

I<ij}<...<ijsn.

By Lemma 6.1 of Bhattacharya and Rao (1986) (hereafter referred

to as [BR]]), it follows that

r Ly C(ay,---.05) Lp E Up}
[Zja1 siesta: Teyk
fe < C(r)n"EWS (5.2)

For proving Lemma 3.1, it is now enough to show that for each fixed
Moving Block Bootstrap 199
SSS
SSS

j>r and fixed j-tuple (@)...a5) under 2)>

rice
JE >raoaa
De_, kak rokpk
#3 | Wsi, | < C(r)[n*B*a(m) + n'mSB*]

Given j and (@),.--,a5) as above, define s=j-r, A={t:a,=1} and Bo

= size of A. Clearly, l<s<r. Next note that k=y}_) a, 2 Bo +

2(5-Bo)- Hence, it follows that 2s<Bosk. Given an integer m,

l<m<n/3r, define the set By={(iq>-- 45) 21siy<..-<ijzsn, lig_y-ig]>m

and litay - ix > m for some teA}. Next, rearrange the sum Yo as

Y3+Lq4 where Y3 extends over all indices (i)---1;)€Bn-

By Lemma 27.2 of Billingsley (1986), it follows that

A Qa
IZ3 E OY, we 1 < c(r)nkBKo(m).
t
Hence, it remains to find an upper bound on the sum Yq: Since Bp

> 2s, it is enough to find a bound on the size of G=((i,--.15):

1<ij<...<ijsn, |i tte, - ig |sm for all tel}, for some [c{l,...,j},

|T|=2s and Epo---r€go€{-1,1}.

To that end note that the indices t, tte,, tel’ are not necessar-

ily distinct. Let 1<0)<...<O)SJ be the increasing arrangement of

the distinct indices in {t, tte, :tel}. Then, 2s<v<(4sAj). Let q

be the number of unbroken integer sequences in the arrangement

Tyr-- +90, and let By>--+sBq respectively denote their lengths. It

is easy to see that l<q<2s, 2<B.<2s and By+..-+Bo=v. Also, using

a straightforward argument, one can show that 16|<n%m’~9nJ-Y, To

obtain a bound independent of q, yv and j, first note that j+q-v <

r if and only if stq< vy. If q<-s, then the bounds on vy imply that

stq < 2s < v. On the other hand, if q > s, then st+q < 2q <
200 Lahiri

Bi+...+By=v. Hence, it follows that |@| < mk nQt5-¥ ¢ n’mk. This


proves Lemma 5.1.

For Lemmas 5.2 and 5.3, we need to introduce a special trunca-

tion function used by [GH]. Let g € C°[0,) satisfy g(x) = x if x

< 1, g(x) = 2 for x > 2 and g is increasing. Then, define the

function T(n;@,+) by T(n;@,x) = nox g(n-P |x|) / xi, for xe RP, 6>0,

nz. For v = (vy,...,¥,)/ € (z*)P and t = (ty,.-..t))’ € RP, write


Vv v
|v| = Vjt.. 4p, y! = vyjt...vp!, t” = ty 1veaty P and ||t|| for the
Euclidean norm of t. We shall prove Lemmas 5.2 and 5.3 under the

set up of Section 4. The corresponding assertions under the

stationarity of X;’s will then follow as special cases.

Lemma 5.2. Suppose that 2+ and e=0(n(1-8)/4) for some 6 > 0. Then

a.s.(P)

(i) tla HI) = 0(1)


(ii) fat -m, I)= 0(1)
(iii) 07, (U;-H,)” - ne E(X,-H,)” = 0(1) for all |y|=3
) —3 PA
(iv) ¥n(X,-,)] = 0(1).
where L,, =n! iets and Ms = EX..

Proof of Lemma 5.2(i). Let Uy == 2 Onl er


owe :
Xeaiep t=1,...,b. Then

yi,
ny = bo} SP
t=] (Um)
Up-m +E t=1
2) (M,-H#,)/d-
(m-,)/b 5
(5.3)
Using the condition sup(E|X,|}*:321) < » and the definition of ms»
one can show that there exists a constant C > 0 such that

Joo! yes m Ball < cab“).


Moving Block Bootstrap 201
SSS
SSS

For the other term in (5.3), without loss of generality (w.l.g.),


we may assume that Kj = 0 for all j > 1. Then,

bo? Dean(Up— my) = (de) TTD eX; + Spd 541]


where 1 extends over j=2,...,b and ”) extends over j=1,2,...,@-1.

By Condition (C.1) and Borel-Contelli Lemma, it follows that |x,|


= o(n(1+8)/4) a.s.(P). One can use the above fact to show that

a.s.(P)
273
I(oe -lo Dod; (X5+XKy_54g) =o(n F-2)/40)
and (5.4)
[Jb-]°E, af
&j-T0X,))] e=O(n")
-]

where T(+-) = T(n;(1+5)/4,-).

Next, note that for all x,ye RP, there exists a constant Cy

independent of n, such that

IIT(x) - T(y)] sc. x-yl.


Hence, by Bore!-Cantelli Lemma and condition (C.2) (with m=a,), it

follows that
-dm/2)
Jo"= 15,(T0K)-TY;,I = (2-8/2 as. (P)
and (5.5)

oS= E(TOX:)-T0Y4,
qhIIL= (2).
-d

Finally use Lemma 3.1 and condition (C.2) to conclude that for any

ae[0,1/4)

Jo ETVs) - ET(Y,gill= 0(b"4) avs.


This proves Lemma 3.2(i).

Proof of (ii) By part (i), it is enough to show that the matrix

(££, (U;-H,) (U}-#,)/-M,) goes to the zero matrix almost surely (P).

To that end, note that


202 Lahiri
ea a sre nL ove vt sieeo Ss a

tE, (U}-Hi,)
(UpFig)’
= eb TSP sC(Ug—m,) (Uy-m,)/ + (mH) (Uy-my)’
+ (Ug-m,) (my)! + (mB) (mH) T-
Clearly, by assumption (C.6), the last term above goes to zero.

Hence, it is enough to show that a.s.(P),

(a) eo“4pP_, (U,-m,) (U,-m)” - Mall > 0


(b) Jeb" 52_y (m,-H,) (U,-m,)/l] +0 and
(c) Jjeb-'yP_, (U,-m,)(m,-2,) ‘||+ 0.
We shall give a proof of part (a) only. Parts (b) and (c) can be

proved using similar arguments. As in the proof of lemma 5.2(i),

w.l.g. assume that H 5-0 forzall sla Rix I<ip; JysP- With m=a,, and

T as in the proof of Lemma 5.2(i), denote the iv and the Jo

components of TVs mn) - ET(Ys mn) by Z, and Ve respectively. Then,

by (5.5), (i,,5,)th element of 2M,

(0b)! Fey (ier Zein)


her Vtei-n) + (1) as.
i] (eb) 2 (p22)oe eer atkeal (asjek tZyan¥g)) + OC) ass.
where a(k,j) = (jAb) - (j-2+k+1)V1, for k#0

= Min{j, n-j+l, 2}/2 for k=0.

The term k=0 can be shown to converge to its expectation a.s. (P)

as in the proof of Lemma 5.2(i). Hence, it is enough to put a bound

on

A,-E de 1 iaae atk Syn jak - EZ V4)"


Note that Lemma 3.1 is not directly applicable in this case.

However, one can use a similar argument and write


Moving Block Bootstrap 203

Ay = EXjuydy Cle --09) Ep WY(al, k;PNG ote, TEV ak 2s“


where 1 extends over all j tuples (a},-..a,) with a, > 0 and @)
+...4+ a, = 4 and Yo extends over j distinct pairs of indices

(5; >k;), l<i<j.

For j < 2, one uses the trivial bound n@e2 on the number of

summands under Yo: The other values of j require more careful

estimates. First we consider the case when j=4. Note that in this

case it is enough to find a bound on

a4’Yo{E Tj4 4 (V dak, ScEV ij+k, Ar )} = Aan (say),

where e, extends over all distinct pairs (j;>k;) under Yo with

AESPSSESSEe
Write ie = Y3th, where 23 runs over all distinct pairs of

indices (J;5k;), i=l,...,4 such that

Max{J, - (3, +k;), Jy - t,) > 3m+l

where to = max{j;+k,: j,+k, < Iq: i=1,2,3}. Then, by weak depen-

dence, each term under Y3 is at most a(m)n2 and the number of terms

under Da is less than n@(2+m)°. Next expand the product of the

differences to write

PEI; ix] (Vv Z. - EV e zl


Jytky dy Sytky Jj
4
(q E Uyay Vik 25.) * Rn
(Ze + Tg)€( Lfey U peices en
where Ls extends over all indices (j;>k;), i=1,..,4 such that

J $5955355, and at least one of (Jpo--eodgs Ky>+--2kq) differ from


204 Lahiri

both the neighbouring indices by 3m+l or more. Then, by weak depen-

dence, the sum Ys is negligible. To estimate Ye? consider the

number of subsequences of (J) >--Jg, Kyo++-kg} where the adjacent

indices differ by at most 3m. Then, under Ye? there can be at most

four such subsequences, each having length two or more. Consequent-

ly, the number of summands under X6 is less than C.n?eom*. By simi-

lar arguments, one can show that R, < cn222m*. Hence, the bound on

Ag, is 0(n226m*).

The case j=3 also admits the same bound. Hence, it follows that

(ig»d,)th element of 0h. converges to (eb) yf")0 Dy TEZ,;Vj+k 2

(P) for all 1<i»JgSP- The proof of Lemma 5.2(ii) can now be

completed by expressing the elements of Mh as a Similar sum and then

using the weak dependence condition.

Proof of (iii) Similar to part (ii). In this case, the bound on

E(L.H.S.)4 is of the order 0(n~204m®) which necessarily forces 2 =

O(n*) with a < 1/4.


Proof of (iv) Follows from Borel Cantelli lemma and the moment
conditions.

For the next lemma, define w,(t) = sy exp(i/Zt’U}), v(t) =

bo Typ exp(-t’M, 5/2), and u,(t) = bo y3_Eexp(iyZt’U,), te RP.


Lemma 5.3. Let @ = 0(n*) for some a < 1/2. Then for any real

number \>0, sup{|w, (t)-v, (t) |:[]t<4) = 0(1) a.s.(P).

Proof of Lemma 5.3. Fix 5>0, d»>0. With T(-) = T(n;(l-a)/2,-),


define U,=e! ee TX i)» t =],...,b. Then, uniformly in t e
RP,
Moving Block Bootstrap
SSS
205

|w,(t) - bo? ybexp(ityz U,)| < 2b) a1! (Uj#0;) (5.6)


= 0(n-4), a.s. (P).
Also, by condition (C.2), it follows that

L = Sup(E|T(Y, ,)l]:md™!, n21} < @.


Let Us y= ee, TVs 45am) and Wyq(t) = E exp(it/Z Us).
Using arguments similar to those in the proof of Lemma 5.2(i) and

taking m=0(f+a_), one can show that uniformly over all t in RP,

bo! ee Jexp(it/E Us 4) 1] < ftVE(L+0(1)) a.s. (P) and


bot yey WWyg(t)-2L s Ute.
Now using Lemma 5.1 above and a discretizing argument similar to the

one given in the proof of Lemma 4.2 of Babu and Singh (1984) (here-

after referred to as [BS]), one can show that |

Sup(|b™? YBy(exp(it/E Us 9) - Wy(td)


Ieltll < 2) = 0(1) a.s. (P).
This, together with equations (5.5) and (5.6), yields

sup(|w,(t) - un(t){: [It] < @) = o(1) a.s. (P).


Next, recall that Mad = Cov(/2U.). Using the arguments in the proof

of Theorem 2.8 of [GH], one can show that uniformly in l<j<b,

Sup{|E exp(i/@t’U;) - exp(-t’M, ;t/2)|:[Itl| < 0) = o(1).


Hence, Lemma 5.3 follows.

Lemma 5.4. Under the hypotheses of Theorem 4.1, there exists a

constant C > 0 such that EA, (U;-a,) 11"Oe <vort ass. (Ph).
Proof of Lemma 5.4: By Lemma 5.2(ii) and condition (C.1)’, @-1/2A,
converges almost surely (P) to M “1/2 and hence, is bounded in norm.
A
Therefore it is enough to estimate Eg /Z(U-1,)II*. By Lemma 5.2(i)

and condition (C.6), there exists a constant C > 0 such that


206 Lahiri

E,[/0(U;-2,)I* < c + ce2b-1yP_|ju,-m|*.


W.1.g., assume that y 5-0 for all j>l. Next define the truncated

variables x, = T(j;(1-28)/2, Xj), Yj = T(j;(1-26)/2, V5 5m) and ‘5

= T(G30,Y5 a)s for j>l, where m=a, 6=(1/12) + (1-4B)/48 and 0<B<1/4

(as in the hypothesis of Theorem 4.1). Then, using conditions

(C.1), (C.2) and the arguments in (5.5), it can be shown that

(a) ley; < c.g1? for all jel,


(b) Y
X =X; and Y j,m7~3,m7" j , eventually, a.s. (P)

(c) #@b-FpP_fu,-mll* < ce@b-AyP_ lv, + o(1) a.s. (P)


where V, = satis aceiaeel l<t<b. Next, a series of long
and tedious arguments as in the proof of Lemma 5.2(ii) show that

0b FP (Vet
-ellv, 4) = 0(1) a.s. (P).
Let Z denote a p-dimensional standard normal random vector. Then,

using the arguments in the proof of Theorem 2.8 of [GH], one can

show that

boty? (27el|v, 4-€| 41/224) = 0(1).


Since a | = Im, °/? 26 sup¢ (EX 4) 1/4: jzl) for all i<jcn,
the lemma is proved.

For the next lemma, we need to introduce some more notation.

For ve(Z*), |v] < 3, let Sins E,(A,(U}-it,))”. Let p, denote the
finite signed measure corresponding to the Fourier transform

: X
~

V(t) = (1+ Diypeg Ge" +Cit)”) exp(-llt 272), te RP.


Lemma 5.5. Assume that the hypotheses of Theorem 4.1 hold. Let t>0

be a real number. Then, there exist a set rT; C2 with P(T,)=1 such
Moving Block Bootstrap 207

that for every sample point wel’, » there exists a constant C = C(t,w)

> 0 satisfying :

[E.F(T},) - Sfdv,1 < CIMg(F)-note+ w(Fynét54)]


for every n>] and any Borel measurable function f: RP + R with

My(f) < @.

Proof of Lemma 5.5. The main steps in the proof of Lemma 5.5 are

similar to those in the proof of Theorem 20.1 of [BR]. Therefore,

we only mention the differences. For l<j<k=[n/2], define:


9
Z
==A AM(Uj-Hy)s Zo 5= 245 1([]2, jl <k) andZ; = 7
oe, b-
E.Z
‘ye Ba DES
Then, by Lemma 5.4,

Enz; 4i" SO <wt" fais <(P/. (5.7)


Using (5.7), Lemma 5.2 above and Lemma 14.1 of [BR] one can show

that, almost surely (P),

JExZ>yh= o(k9/4),
=3/2 e,242,' ' - 1] = 0(k?)
-] and

(5.8)

[E23v =iEZ,Vv 1! 2= 0(k -1/2 °/“) for any v €(Z")*


+\p .-
with =
|v|=3 a.s. (P).
Now fix a sample point for which (5.7), (5.8) and the conclusions

of Lemma 5.2 and Lemma 5.3 (for a suitable depending only on @ and

t) hold. Clearly, it is enough to prove the assertion of the lemma

for this sample point. Fix a function f with M4 (f)<o. Using the

arguments in the proof of Theorem 20.1 of [BR], one can show that

[E.f(T,) - Sfdv,| < C.M,(F)K™ + ff, dH,,|


where FO) = f(x + Jk Eyl, 4)» H, is the signed measure with

Fourier transform
-t’D t
A(t) = (Egexp(itZ,))* - pet KS/*p (its (x ae
2- ° A n
>
208 Lahiri

2D. (E424 24) and xp = vth cumulant of Z, under E,, ve(Z*)P and

Pales*) are polynomials (see page 51, [BR] for definitions of P).

As noted in [BS], the proof of Theorem 20.1 of [BR] has to be

modified for estimating the last term above. Instead of Lemma 11.2,

one has to use Lemma 24.1 of [BR]. Rest of the proof can be

completed following the arguments in the proof of Theorem 20.1 of

[BR] and noting that, by condition (C.7), sup{|v,(t) |: & < |tll < 0)
< 1 for every 6 > 0.

Proof of Theorem 2.1: Note that P,.(T,€B) = Esf tty) where F(x) =

1,(x-B,), xe RP and 1p denotes the indicator function of the set B.

Hence w(f,,n3%) = #(B_+(aB)") < cB, | + $((0B7)) for any n>0.


Consequently, by Lemma 5.2 and the definition of Ya?

P,(Tne8) = Sof.dv, + 0((18,I)


= $(B+B) + 0(([B,,lj+n= 1/2) a.s.(P).
Also, by Theorem 2.8 of [GH],

P(TeB) = #(B) + 0(n-}/2),


Hence, it follows that a.s. (P)

P,(T,€B) - P(T,€8) = #(B+B,) - 4(B) + 0(|Bl] + n-2/2).


This proves part (a) of Theorem 2.1.

Part (b) follows by a similar argument noting that there exists

a constant C, > 0 such that Sup(wW(1(_,, xy (+-¥) 38) :xe RP) = Con for
all n>0, and ye RP.
As for part (c), note that by condition (C.1) (w.1.g., assuming
H5=H=0 for all j>l),

Kacy = Amb)"C-1)99 1X 5405 2180 (Xj4Xq_ gy)


Moving Block Bootstrap 209
SSS

= op(n°/4) + (0b) AS52] (€-3) (X 4Kn-j+1)°


where ay = (2-j)(2b)~ eh l<j<é-1.

Write Sil = TyeEL - tiarn-j+1" Then by stationarity,

Aarne a nnerfor i=1,2 and S_, has the same distribu-


tion as S13 “afStites
-j" By Lemma 27.2 of Billingsley (1986), con-

ditions nee (C.4) and arguments similar to those used in the

derivation of (5.5), one has

JE exp(it’(S., + S.4)) - mn-jE(exp(it’s, I = 0(1) (5.9)


for every te RP. A "big block - little block’ argument together

with conditions (C.2), (C.4), the Cramer-Wald device and the

classical Lindeberg-Feller Theorem implies that Ee

converges in distribution to N (0,M) for i=1,3. Also, by equation

(5.9), (eBay ies | and Rice are asymptotically indepen-

dent. Hence it follows that (era) eyest15(X;.+X n-j+l ) converges

in distribution to N,(0,2M). Therefore, the sees of part (c)

follows from Lemma (5.2)(ii) and condition (C.1).

Proof of Theorem 2.2: Note that the signed measure Yn in Lemma 5.5

has density

(1? Ys CY) yn y(D)8(X)


estin(Uj-H,) 1”.
where q,(x)¢(x)= (-D)"4(x), and X, , = ra
By Lenma 5.2, %,,= €1/2nE(A,(X,-u))”+ o(1) a.s. (P). Hence,
Theorem 2.2 follows from Theorem 2.8 of [GH], Lemma 5.5 above and

equation (2.1).

Proof of Theorem 3.1: Following the proofs of Theorem 1 of

Bhattacharya (1985) or Theorem 2.2 of Bhattacharya and Ghosh (1978),


210 Lahiri

one can show that the distribution of /n(H(Y) - H(a)) has the

following expansion:

sup PCIE (H(Yq)-H(a) )eB)-fg( 140"/2 a(x: (x, .)))8(x)ee]=0(n- 2)


where gq is a polynomial and its co-efficients are continuous func-

tions of partial derivatives of H at @ and of the cumulants

{xy nil¥l < 3) of Y,.


Also, with an observation as above, one can show that the boot-

strapped statistic admits a similar expansion:

sup |Px(/k 3/2 (H(¥h) - H(@,))€B) - sas n/2 (x5 {Xy, ))6(x)dx|
= o(n71/2) a.s. (P)

where i Ad vth cumulant of ¥ Theorem 3.1 follows from Lemma 5.2

(stated in terms of Y,;’s).

Proof of Theorem 3.2 W.1.g. assume that A(Xpo+++ Xn)

(agatyiig -X)Xy4m)* Define a function H: R™® + R by H(y)


(y;-#)/Lfg(y)I/2, y=(¥15++++Ym4o) € pre with fy as in Section 3.

Then, H has partial derivatives of all orders on the open set Q

(ye RM2, yeaa > (2m+1)y%) containing a@ = (p, EX, tens]


EX Xi am )’ and the vector D of first order partial derivatives at o

is given by D = [7,0...0]. Let Tan = /nH(Y,, ) where ie oon Ys


ss
and Y; = (X;, Xy>-
2 : x; Xi am)? idl.

Then, using condition (C.2) and the m-dependence of X;'s, one

can show that

PUT go> Taal oie’ seer te


for some constant C > 0. Hence, by condition (2.1) and Theorem 3.1
Moving Block Bootstrap
LLL
211
SS

(together with the m-dependence of X;’s), it follows that uniformly

over BeB, 5

P(T, eB) = P(T, €B) + 0(n-7/12


st os (5.10)
= fy(ten/2p(x, (u,}))6(x)dx + o(n/?)
where P(x, (H,,}) is a polynomial and its co-efficients are continuous

functions of by =EYS for |v|<3, ye(Zt) m2 | Since


Sup{ |p (x;{4,})4(x)|:xER)< @ and n-1/2 -(n-m) "1/2 = o(n-3/2), with
N=n-m, relations (5.10) yields

Sup |P(T3,€B) - P(TyyeB)| = o(n)/2), (5.11)


BeB

Next, recall the definition of the block averages TE I<isk, =

[N/2]. For notational similarity, set Yu" U, = ky (Uy +...+ Uy):

Then it is obvious that the conclusions of Lemma 5.2 hold with X's

replaced by Y,’s and U;’s replaced by Uscs. Next, define the func-

tion HA! Rn2 + R by

HA(y) = (¥y-q)/Lf(y)
11/2
for y=(Y)> roc »Ym42) ’E Rae Then, for every n> Mm, H, is infinitely

differentiable on Q. Since a, 7% as. (P), it follows that a.s.

(P), for sufficiently large n, a <Q and H, has partial derivatives

of all orders ata. Let D, denote the vector of first order par-

tial derivatives of H, at @,. Then, trivially,


n = D, Disp(/k,é y*LF yp
zh )D,, =mit.
7
and 2, = 0 Disp(/n 10; = re/z?, where re = Var(/n X,)-

Note that Tyy=/N(H(¥y)-H(a)) and Tae 13,77? (H, (TN) Hy (Ex TN)
212 Lahiri

Using arguments similar to those needed in the proof of Theorem 3.1,

one can show that a.s. (P)

SupaA [PCa -1/2 eB) -


|P(z7!/21, fue
P,(Tx,eB)| ea?
= o(n7!/2), (5.12)
Tay 3N
Next observe that the m-dependence of X;'s implies re_g? = o(n-!).

Hence, by condition (2.1) and relation (5.10), one gets

Sup [P(T4yeB) - p(3, 1/27 eB) | = o(n7 1/2),


Theorem 3.2 now follows from (5.11) and (5.12).

Proof of Theorem 4.1: Follows from Lemmas 5.2-5.5.

References

Athreya, K.B. and Fuh, C.D. (1989): Bootstrapping Markov Chains:

Countable Case. Preprint, Dept. of Stat., Iowa State University.

Babu, G.J. (1986). Bootstrapping Statistics with Linear Combina-

tions of Chi-Squares as Weak Limit. Sankhya Ser A., 56, 85-93.

Babu, G.J. and Singh, K. (1984). One-term Edgeworth Correction by

Efron’s Bootstrap. Sankhya Ser. A, 46, 219-232.

Basawa, I.V., Green, T.A., McCormick, W.P. and Taylor, R.L. (1990)

Asymptotic bootstrap validity for finite Markov. chains.

Preprint. Dept. of Statistics, University of Georgia.

Bhattacharya, R.N. (1985). Some Recent results on Cramer-Edgeworth

Expansions with Applications. Multivariate Analysis - VI (Ed.

P.R. Krishiah.), Elsevier Science Publishers B.V., 57-75.

Bhattacharya, R.N. and Ghosh, J.K. (1978). On the Validity of the

Formal Edgeworth Expansion. Ann. Statist., 6, 436-451.


Moving Block Bootstrap 28

Bhattacharya, R.N. and Ranga, Rao R. (1986). Normal Approximation

and Asymptotic Expansions. R.E. Krieger Publishing Company,

Malabar, Florida.

Billingsley, P. (1986). Probability and Measure, Wiley, New York.

Datta, S. and McCormick, W.P. (1990). Asymptotic accuracy of

bootstrap for finite Markov chains. Technical Report, Dept. of

Statistics, Univ. of Georgia.

Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife.

Ann. Statist. 7, 1-26.

Efron, B. and Tibshirani, R. (1986). Bootstrap Methods for Standard

Errors, Confidence Intervals and other Measures of Statistical

Accuracy (with discussion). Statist. Sci. 1, 56-77.

Freedman, D.A. (1984). On Bootstrapping Two Stage Least Squares

Estimates in Stationary Linear Models. Ann. Statist. 12, 827-

842.
G6tze, F. and Hipp, C. (1983). Asymptotic Expansions for Sums of

Weakly Dependent Random Vectors. JZ. Wahrscheinlichkeitstheor.

Verw. Geb. 64, 211-240.


Gotze, F. and Kiinsch, H.R. (1990). Blockwise bootstrap for depen-

dent observation: Higher order approximations for studentized

statistics. (Abstract). Institute of Mathematical Statistics

bulletin. 19, 443.

Kulperger, R.J. and Prakasa Rao, B.L.S. (1987). Bootstrapping a

finite state Markov chain. To appear in Sankhya.

Kiinsch, H.R. (1989). The Jackknife and the Bootstrap for General

Stationary Observations. Ann. Statist. 17, 1217-1241.


214 Lahiri
|
EE ——— __________|

Lahiri, S.N. (1989). Bootstrap Approximation to the Distributions

of M-estimators. Ph.D. dissertation, Michigan State University.

Lahiri S.N. (1990). Second Order Optimality of Stationary Boot-

strap. To appear in Statist. Prob. Letters.

Liu, R.Y. (1988). Bootstrap Procedures Under Some Non-iid Models.

Ann. Statist. 16, 1696-1708.

Liu, R.Y. and Singh, K. (1989). Using iid Bootstrap Inference for

some Non-iid Models. Preprint. Department of Statistics,

Rutgers University.

Liu, R.Y. and Singh, K. (1988). Moving Blocks Jackknife and

Bootstrap Capture Weak Dependence. Preprint. Dept. of

Statistics, Rutgers University.

Moore, M. and Rais, N. (1990). A bootstrap procedure for finite

state stationary uniformly mixing discrete time processes.

Preprint. Ecole Polytechnique de Montreal and Universite de

Montreal.

Shi, X. and Shao, J. (1988). Resampling estimation when observa-

tions are m-dependent. Comm. Statist. A, 17 (11), 3923-3934.

Singh, K. (1981). On the Asymptotic Accuracy of Efron’s Bootstrap.

Ann. Statist. 9, 1187-1195.


Bootstrapping Signs

Raoul LePage
Michigan State University

Abstract. One of the known limitations of the basic bootstrap is its


sensitivity to heavy tails in the error distribution. Athreya (1987) studies
bootstrap’s failure to recover the sampling distribution of the sum of 1.i.d.
non-normal stable errors. Gine’ and Zinn (1989) prove that the same
failure occurs essentially whenever i.i.d. errors are not subject to a central
limit theorem (CLT) having a normal limit. This poses a practical
difficulty, for how is one to know whether data meet such a requirement?
Interestingly, in the case of symmetric errors, and probably much more
generally, we can resolve this difficulty through a somewhat different
application of Efron’s (1979) bootstrap.
In the case of symmetric error distributions, without the assumption of
finite variance, we find that a simple bootstrapping of signs, instead of
bootstrapping the observations themselves, approximately recovers the
conditional distribution of (fi - mu) given /e/, where yw denotes population
location parameter, ji is the average of the sample, and |e/ is the vector of
the absolute values of the actual errors present in the data.
Even for a seemingly routine application, where the symmetric error
distribution is concentrated on an arbitrarily small interval around zero, the
central limit theorem and bootstrapping residuals can be sub-optimal.
Confidence intervals produced by bootstrapping signs can for some
distributions be far narrower for small to moderate sample sizes.

1. Introduction.
Bootstrap has trouble recovering the distribution of a sum of i.i.d. long
tailed errors €; based on observations p + €; (Athreya 1987, Gine’ and Zinn
1989). There is an idea which seems missing from previous attempts to

* Department of Statistics and Probability, Michigan State University, E. Lansing, MI


48824. Research partially supported by ONR grant N00014-91-J-1087.

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
216 LePage

deal with this difficulty. I have not found it, for example, in Athreya (1987)
or Knight (1989). Nor does it appear in Wu (1986) among the many topics
relating to resampling methods in regression. This idea is that the
presumed goal of bootstrap, recovery of the unconditional limit law of a
statistic of interest, may be too modest. Perhaps we should seek to recover
the conditional limit law given some characteristic of the errors.
Bootstrap confidence intervals, in cases where bootstrap appears to fail,
are actually narrower than their unconditional counterparts for the same
unconditional confidence level. Kinateder (1990) established an invariance
principle linking bootstrap to the order statistics of the data, and presented
simulations which show that for long tailed errors bootstrap confidence
intervals based on the sample average tend to be smaller than those
obtained from the unconditional limit law. The following histogram, taken
from Figure 6.2 of Kinateder, is illustrative. It depicts the sampling
distribution of loge(A) where 2A is the bootstrap confidence interval width
(calibrated to have 0.95 unconditional confidence level) for samples of n =
50 from a symmetric distribution attracted to the Cauchy distribution. The
vertical line in the figure points to the log-width of a .95-confidence interval
based on the Cauchy limit law. The horizontal scale is about [-0.7, 7.0]
(logarithmic). It can be seen that the sampling distribution of A places
most of its mass far below the confidence interval width obtainable from the
unconditional limit law (in this case Cauchy).

-0.7 1.3 3.3 5.3 7.3


One is tempted merely to calibrate bootstrap
by estimating the
unconditional confidence level. This lacks simplicity but, more seriously,
Bootstrapping Signs 217

would appear to mask an unusual conditioning for which there is no sensible


inferential interpretation. -
Any effort to establish a truly conditional interpretation for bootstrap
meets with one fundamental difficulty. The order statistics of a sample are
approximately a Poisson point process (Lévy 1925; LePage, Woodroofe, and
Zinn, 1981). Bootstrap, through its multinomial sampling of the data,
approximately places i.i.d. mean-one Poisson counts on the order statistics
of sample data points (Athreya 1987, Knight 1989, Kinateder 1990). Except
in special circumstances, bootstrap resampling of a Poisson process is quite
unlike the original, which is why bootstrap will give incorrect results for long
tailed errors (see Proposition 3 below, comparing bootstrap to sign-
bootstrap).

2. Bootstrapping signs.
Our problem is to use i.i.d. observations Y;=pt+ oe i<n, to estimate
the sampling distribution of ju - 4 , where jj = Y. We consider the case of
errors whose signs are randomly assigned by coin flips.

Assumption 1. The distribution of the error vector ¢€ is identical to that of


Se defined by de = (5;€), - - , Snén), where 6 = (6), .. , bn) are iid,
independent of € , and take the values +1 each with probability 0.5. O

In may be convenient to interpret expressions like de as the product of a


diagonal matrix, whose diagonal is 6, by a column vector € . What we
propose is very simple:

a) Find the conditional distribution [6(Y-Y) |Y ], of 6(Y-Y) given Y.


b) Result (a) estimates the conditional distribution [ (/- ») | lel] -

Regarding (a) it can be noted that the Y-conditional characteristic


function E [exp(-l r 6(Y-Y)) | Y], involving only a linear expression in
random signs, may be written as the product Te cos(r (Y;-Y)/n). Fast
Fourier transforms can be effectively used to determine the conditional
distribution by inverting this characteristic function, addressing one of the
Six Questions Raised by the Bootstrap (B. Efron, in this volume).
218 LePage

3. Examining the conditional distributions.

Proposition 1. If errors €; satisfy Assumption 1 (symmetry) then

[6(¥-Y) |¥] = [6(¥-Y) le] = [6(e€) le] (3-1)

~ [dele] = [€] lel] = [(#-) |lel], (3.2)

where the approximation ~ is justified if €? is small compared with é.

Proof. All relations are obvious, except perhaps the approximation ~ in


(3.2) which is justified as n — oo provided 6 @ is small compared with 6e,
conditionally on €. Both 6 € = ( 6(Y-Y) — 6¢e ) and de have conditional
mean zero, conditional on ¢«. Their conditional variances given € are re-
spectively €#/n ande?/n. O

Often (see Lemma 1 below) a scaling ath * with a <2 will be required
for distributional convergence of each of (né)? and ne?. In such a case é*/e?
is of order 1/n, which guarantees the relative smallness of the difference
between bootstrapping signs of residuals versus bootstrapping signs on the
actual errors present in the data.
Notice that one need not assume i.i.d. errors € in Proposition 1. With
the additional assumption of i.i.d. errors a far simpler comparison can be
obtained.

Proposition l.a. Define F to be the sigma algebra generated by the order


statistics of the absolute values le.|. If errors €; satisfy Assumption 1 and are
also i.i.d. then

E[(6(Y-¥) -5e)2|F] = e2/n2, (3.1.a.)

E[ (e)2| F] = e/n, (3.2.a.)


Bootstrapping Signs 219

Proof. These relations are a simple consequence of the fact that, conditional
on F, the vectors 6 and e-are independent; é; are conditionally i.i.d. signs; é;
are conditionally uniform over all 2" xn! signed permutations of le;|. oO

The relevance of relations (3.1.a.-3.2.a.) may be explained as follows.


When attempting to assess the sampling variation of Y by bootstrapping
signs, we would exactly recover the conditional distribution of the error
term € given the vector |e| except that we have thrown away a term 6 €.
Relative to the conditioning F, that missing term has conditional mean zero
and mean square (3.1.a.). On the other hand, the target € (equivalently de)
has conditional mean zero and conditional mean square given by (3.2.a.).
Thus the F-relative average magnitude of our error in using the sign-
bootstrap approximation to the conditional sampling distribution of € given
|e| can be determined by taking the ratio of (3.1.a.) to (3.2.a.), which does
not depend on ¢ and equals 1/n.
Conditioning on F, as in Proposition l.a., is particularly important for
the regression and time-series counterparts of these results, since Proposition
1 does not extend to sums involving unequal coefficients on the errors, while
Proposition 1.a. does have an extension to such sums.

Assumption 2. Y = X 6 + e (a finite non-singular matrix regression model)


in which the errors €; satisfy Assumption 1 (symmetry) and are i.i.d., and 8
is an unknown vector. O

Denote by Y the values fitted by linear least squares; é = linear least


square fit to the errors; € t+~y_Y-=ec-é@. As before, let F denote the
sigma algebra generated by the order statistics of the absolute values |e;|. If
v is any vector in R™ we denote by v/X its projection to the column space
of X.

Proposition 2. Let Y, X, 6, € satisfy Assumption 2. For every vector v


define (v, Y) = Cartesian inner product of Y with v in R". Then for every
complete ortho-normal basis {u,} for the column space of X,

E[((v, de+) - (v, 6) )? |F] = e? (v2, v’), (3.3)


220 LePage

E[(v, 6)?|F] = || v|P, (3.4)


B[(v, 6e*) (v, 6(€/X)) |F] = 0, (3.5)
where u2 = a u? does not depend upon the choice of c.o.n. {u,}5 the
Z v2 being formed by squaring each coordinate.
vectors uj’,

Proof. Use the conditioning argument of Proposition l.a. Let {u,} be any
complete orthonormal basis for the column space of X. We make use of the
fact that for every a, 8,

E[ (6 ua) (ug) |F] = Bug ug Ele |F] =< (a=8), (3.6)


where (a = #) denotes the indicator function. By expanding e/X in terms
of {u,,} we obtain from (3.5) that for every i,

E[ (¢/X)2 | F] = Dae Yai 43 es(a =i6)i- me Dppenew Ge

From (3.7) we obtain (3.3),

E [ ((v, de +) - (v, de) )? | F] = E[ ( (v, 6(€/X) )? | F]


= Ove B[(e/X)? | F] = 2 (ue, v2) = 2 (2, v2). (3.8)

Expressions (3.4)-(3.5) have a similar proof. O

The main point of Proposition 2 is that, provided the ratio (3.3) to


(3.4) (which only depends upon the condition vector u~ associated with the
design X) is small, the Y-conditional distribution of (v, 6¢ 1) will, on F-
average, be close to the conditional distribution of the sampling error (v, €)
given the absolute values |e|. The same method may be used to prove that,

E [ll (Se ~ )/X - (6€)/X [P| F] = Bq Ep (ud, ud), (3.9)


E [|| 6e |? | F] = e24, (3.10)
Bootstrapping Signs 221

where d > 0 is the number of columns of X. Relations (3.3)-(3.10) also hold


for bootstraping over all .2"xn! signed permutations of residuals.
To illustrate Proposition 2, consider using least squares linear regression
to fit a cubic polynomial to n = 101 points with equally spaced x-values in
the interval [0, 1], ie. x} = 0, ... , X49) = 1. The ratios (3.3)/(3.4),
computed for each of the four coefficients, are 0.0863, 0.0719, 0.0665, 0.0666.
Suppose the actual “errors from the regression” are ¢; = (-1)! i2,0<i<101.
In the case of the first coefficient, the y-intercept, we find by computer
calculations that bootstrapping signs gives a 0.95 confidence interval less
than half the width of a confidence interval determined by either the CLT
or by bootstrapping the residuals. Bear in mind that the relative widths of
these three confidence interval types (bootstrap-signs of residuals, CLT,
bootstrap residuals) are unchanged by scale change in the errors, so we could
just as well be working with errors having the same distributional shape but
concentrated arbitrarily close to the origin.

4. Performance on randomly signed powers of uniforms.

It is instructive to compare ordinary vs sign bootstraps for the case of


the sum of i.i.d. symmetric errors having a distribution whose tail is a
power. Such distributions are actually the prototype for errors attracted to
a symmetric stable law. A simple Poisson process construction of such
errors (correct for each n, but not for all n simultaneously) allows us to
track and compare the performance of both bootstraps using almost sure
arguments.

Lemma 1. Suppose {U;} are i.i.d. uniform on [0, 1]; = 6 le;| with €; = UP
for a real number p, and 6: i.i.d as 6; . All sequences of r.v. are independent
of one another. Let sss time of j-th arrival in a homogeneous Poisson
process on [0,00] having a mean of one arrival per unit time. Then for every
n (but not for different n simultaneously) there is a standard construction of
the increasing order statistics Uy = 7; PIS ss i<n, in terms of which,

“P , 6! PP = vc weP , (4.1)
222 LePage

For p < -1/2 we have a.s. , as n > o,

[al+P (6-8)Y |¥] and [n't (a-y) | lel]


(4.2)
E88 oP 146 a}

Outline of Proof. Breiman (1968) is a good reference for the Poisson


construction of uniform order statistics. The limits (4.2) are obtained by
applying the three series theorem conditional on {65 7;} , or a martingale
argument. For additional details see (LeP, 1980; L-W-Z, 1981). O

Only the case p < -1/2 corresponds to a symmetric stable law, yet the
other powers are of some interest for finite samples, before the central limit
theorem takes hold. They are best examined from the conditional viewpoint
of Proposition 1.a. since the limit law is uninteresting.

5. Symmetric errors attracted to a stable law. It is not surprising that the


result (4.2) applies also to i.i.d. symmetric errors belonging to the domain of
attraction of a stable law.

Proposition 3. The result (4.2) (with a scale constant on the right side)
holds for the corresponding Poisson construction of symmetric errors {e,
i<n} belonging to the domain of attraction of a stable law of index 0 < a <
2 (i.e. p = - 1/a in the above) . Basic bootstrap (Kni, 1989) on the other
hand gives,

[ai TP (YF -Y)1¥] = const x [Fm oP Hy}, G1)


1 ae co

i=

where, as is usual, Y* denotes the bootstrap average, and {7;} are iid.
Poisson r.v., having expectation one, and independent of 6, etc. .
In the case p >-1/2 the conditioning vanishes in the limit.

Remarks on the Proof. The proofs follow by application of an invariance


principle (Kinateder, Theorem 3., 1990; also in this volume), or by direct
Bootstrapping Signs 73

arguments along the lines of lemma i. The form of the limit law in (5.1) is
due to Knight (1929), who docs not address the problem of developing a
| comitistent bootstrap. Athreya (1987) obtained 2 characterization
of the
unconditional lirnit lew (5.1) 2s 2 random distribution. O
The following figure is a plot of confidence interval half-widths for
different samples d n = W from 2 standard normal distribution. The CLT
half-width is 0.258. The pairs being plotted are y = confidence interval half-
width obtained from bootstrapping signs; x = confidence interval half-width
obtained by the samme monte-carlo simulation applied instead to the signs of
the actual errors in the data. Sampling variation is biased upward due to
the use of only OO passes in cach bootstrap, but the point being made is
only that the approximation of lemma 1 is operating very well. That is,
bootstrapping signs is approximately recovering the conditional sampling
distribution given the absolute values of the residuals.

6 Comments. Symmetry docs not appear to be essential for the basic


pxiciphe. Rather, for symmetric errors we mote easily xe how to use
ressmnpling to change the object of our inference from the unconditional
init law to 2 conditional limit law likely to have narrower confidence limits
224 LePage
_._| | eee
ee

since E [ (v, Y) | F] = (v, E Y) (conditionally unbiased). This idea,


appropriately modified, seems to extend to time series and other settings.
The conditional interpretation that we offer for bootstrapping signs
appears to be related to the concept of second-order efficiency of bootstrap
introduced in (Singh, 1981) for the case of finite second moments, since
pivoting adjusts bootstrap to the variance of the sample. Bootstrapping
signs can produce dramatic results, even in problems such as the regression
cited earlier where one might expect the CLT to do reasonably well. In
cases where consistent bootstrap methods, such as bootstrapping signs, are
available, the way is open to compare different estimators based on
bootstrap estimates of their respective sampling errors.

References

Athreya, K. B. (1987), Bootstrap of the mean in the infinite variance case.


Annals of Statistics, 15:724-731.
Breiman, L. (1968), Probability. Addison-Wesley, New York.
Bickel, P. J. and Freedman, D. A. (1981), Some asymptotic theory for the
bootstrap. Annals of Statistics, 9:1196-1217.
Efron, B. (1979), Bootstrap methods: Another look at the jackknife.
Annals of Statistics, 7:1-26.
Gine’, E. and Zinn, J. (1989), Necessary conditions for the bootstrap of the
mean. Annals of Statistics, 17:684-691, 1989.
Knight, K. (1989), On the bootstrap of the sample mean in the infinite
variance case. Annals of Statistics, 17:1168-1175.
Kinateder, J. (1990), An invariance principle applicable to the bootstrap.
Thesis, Michigan State University.
LePage, R., Woodroofe, M. and Zinn, J. (1981), Convergence to a stable
distribution via order statistics. Annals of Probability, 9:624-632.
LePage, R. (1980), Reprinted in: Conditional moments for coordinates of
stable vectors. Lecture Notes in Mathematics, 1391:148-163.
Singh, K. (1981), On the asymptotic accuracy of Efron’s bootstrap. Annals
of Statistics, 9:1187-1195.
Wu, C. F. (1986), Jackknife, bootstrap, and other resampling methods in
regression analysis. Annals of Statistics, 14:1261-1343.
SS
SSS

Moving Blocks Jackknife and Bootstrap Capture


Weak Dependence*

Regina Y. Liu and Kesar Singh


Rutgers University**

Abstract

Let Xp, be a sequence of stationary m—dependent random


variables with the common distribution F for each X.. Let Tp be the
parameter of interest and T5 be its considered estimator based on
(X,,Xo,...X,,). To study the sampling distribution of T,» Wwe introduce
two resampling procedures — the moving blocks jackknife (MBJ) and the
moving blocks bootstrap (MBB). These two procedures resample from the
moving blocks By Bap 4) where b is the size of each block and B;
stands for the block consisting of b consecutive observations starting from
Xi, i.e., Bi = {X,X; gr Xysp For the MBJ, pseudo—values are
generated by deleting each of such blocks. As for the MBB, k i.i.d. block
samples are drawn from By Bap 41 All elements in the k sampled
blocks are then regarded as the bootstrap sample. Both MBJ and MBB tend
to make partial correction for the m—dependence if b is a fixed positive
integer. Furthermore, they both achieve the actual consistency under
m—dependence if b is allowed to grow to infinity with n.

*This research was supported by NSF grants DMS-85—0294,


DMS-88—02558.
**Department of Statistics, Hill Center, Rutgers University, New
Brunswick NJ 08903.

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
226 Liu and Singh
aE

1. Introduction.
A sequence of time dependent observations tend to be
dependent, unless the time indices are sufficiently far apart. The notion of
m—dependence is probably the most basic model which takes into account
such dependence. Let {X,,X5,...} be a sequence of r.v.s (random variables).
Let A and B be two events such that A depends on {X,,---X} and B
on {Xt Xam gore t: The sequence {X;} is called m—dependent if
any such pair of events A and B are independent. The sequence {X;} is
called stationary if the distribution of any subcollection of X,’s does not
change if each of the indices are shifted by the same amount. For instance,
(X,,X9,X¢) has the same distribution as (Xj 9XyoX15) under the
stationarity. Consider the variance of the sum De 1X If {X;} is
stationary and m—dependent, m<n, then

1 n m :
= Var (2%) = Var (X,) + az “ — 2) cov (X1,X44 )-

Thus, 0)2 =ak5 War (L;_)X;)


n 9 07eae= Var (X,) + 22°;m _ cov (X,,X,,) as n
j=l J
+o. It is well known that /n(X —y) is asymptotically normal with mean 0
and variance o (see p.174 of Billingsley (1968) for a more general result).
Here X = <25;'_,X, and p= EX,. In general, if a statistic dite
asymptotically | N(0, Var &p(X,)) under i.i.d. models, then it is

N(0, o°(T,)) under stationary m—dependent models, where

oa) = Vat (Gp(K,)) +22 cov(6p(%y) Ep(X}4)


Here F is the c.d.f. of X,. For instance, for the sample median, &,(X,)
(ply
I(X,<F —1/1(5))/F’(F (5)), and for the sample
:
variance &p(X,)
2
(X, =)".

=o More general models of time dependence are y—mixing and


a-mixing. In these two models the dependence, in two different senses, is
assumed to decrease to zero as time gap tends to infinity. All these models
are generally referred to as weak dependence models. The asymptotics on
weakly dependent r.v.s have been extensively studied. (See Billingsley
(1968) for results on y—mixing r.v.s.)
Moving Blocks Jackknife and Bootstrap Capture Zah

In this paper we propose a jackknife procedure and a bootstrap


procedure for estimating sampling errors of general statistics which are
consistent under m—depéndent models. First of all, we examine why the
classical jackknife and bootstrap procedures are inconsistent. Consider the
simplest case of the sample mean xX. Let Fa be the empirical distribution
based on Xj y-yX). The classical bootstrap procedure is to draw a random
sample of size n from Fe Denote the bootstrapped sample by YoY:

The sampling distribution of jn (X,, —p) is then estimated by that of

va(Y,, — X,,) under F, where Y, is the mean of the bootstrap sample.


,
The variance of yn(Y,—X,) under F, iseign s)Aik = ao
th
(%; —X,)° 2 which
:

converges to Var,(X,). This can be seen by writing

ae
7 dg ee eeepe =(%,) 2
i=!

and using the law of large numbers for stationary m—dependent r.v.s. This
inconsistency of the classical bootstrap was also mentioned in Remark of
Singh (1981). Since Yyro¥, are i.i.d. observations from | F» it is obvious
that this bootstrap completely ignores the dependence structure of X,’s.
Consequently under m—dependent models, the achieved asymptotic variance
is incorrect. As for the classsical jackknife procedure, the estimate of the
: ; 1 2 )
variance of yn(X,—y) is 55; (5; -X,)° where Jj’s are
pseudo—values; ie., J; = nX — (n-1)X, = X,. Here Xi =
n
— x X, and J is the mean of Jpn, This estimate equals s.
j=1,ji
and hence also converges to Var,(X,)- Thus the classical delete—1 jackknife
is also inconsistent.

Now we describe the proposed bootstrap and jackknife


procedures. Let B. denote the block of b consecutive sample observations
beginning from the ith one, ie, B: = {X;,X; gpk spo i=
1,272,
n— bel. Let. Ts indicate the parameter of interest and T) the
estimate of Tp based on Xj Xq-X

A) Moving Blocks Jackknife (MBJ).


We delete one of the moving blocks, say
B.,, and then
compute the functional T based on the remaining observations
{X)----X, }\B;. The latter is denoted by T, _;. For each i = l,...,n—b+1,
228 Liu and Singh

the following pseudo—value is formed.

J, =b" fT, -(n-b)T,4) (1.1)

The proposed estimate of the variance of vaT is

A 1 n-b+1 9
V3 p(T) = ble-peT oa (J; ale) IB (1.2)

At the first glance the MBJ procedure proposed here may seem
somewhat to resemble the so—called delete—d jackknife. (See Miller (1974)
for a general review on the jackknife.) However, they are intrinsically
different. In fact, a delete—-d version of MBJ would be to delete d of the
moving blocks at a time, for all combinations of d such blocks.

B) Moving Blocks Bootstrap (MBB).


Under the MBB scheme we resample with replacement from
the moving blocks By) Bi_p 41 with B.’s equally likely to be drawn.
Suppose we draw k_ blocks and denote the resulting sampled blocks by
Cpr dy Note that each €. contains b elements which may be expressed
as gj 2 (Si jy) Let ¢= kb. We shall refer to (Ys ¥y) =
(E 4 5E59-+-964) as the bootstrap sample under the MBB scheme. Here Y, =
oi if a=(j—1)b+ i. For instance, with b=2, i= 3, if f = (6,8),
& = (4,7) and & = (7,2), then the bootstrap sample of size 6 is
(6,8,4,7,7,2). Let FY denote the e.d.f. based upon (Yy5--5¥ 9) and let T)
be the statistic computed from the bootstrap sample. In order to estimate
the sampling distribution of yn[T, —T)] under the MBB scheme, we may
obtain a large number of bootstrap values yi[T4—T,] to form a histogram.
This histogram is the proposed bootstrap approximation of the sampling
distribution of Ya[T, — T,].

To understand how MBJ and MBB reflect the weak


dependence structure, we first focus on the case of sample mean with block
size b = 2. The jackknife variance with b = 2 can be expressed as

n—l X.+X.
.
ean re 1 3). (lf+1 8X).2

Let U; = (Xj+Xi1 1)/v2, j = 1,....n-1, then U?s are stationary and


Moving Blocks Jackknife and Bootstrap Capture 229

(m+1)—dependent. The average of Uj,...U,_; turns out to be y2 X +


O,(0*) since

n—l X.4X.
2! ble > a ae a 2 X — ———_ (X, +X_).
AT ajaerty 2 ra1)
This shows that Vyo(X) is essentially the sample variance of
Uys suet Consequently, the jackknife variance converges to

harlem Se) Now, we examine the estimated variance of X under the


MBB scheme with b = 2. Let S(;) indicate the sum of the components of
the ith sampled block gi Then

Var* (W2Y,) = ¢- Var* (€"[8(é,) +...+ $(&,)))-


Here Var* ( - ) stands for the variance of - under the MBB procedure.
Since S(é), i = 1,2,...,.k are iid. under the MBB probability, the

expression above reduces to Var* (S(€,)/¥v2) which is the sample variance


of U¥s. Therefore, this estimated variance also converges to Var

((X,+X,)/¥2). By extending these arguments, one immediately concludes


that for a general fixed b, both V; p(X,) and Var* (y2Y,) converge to

Var ((X, + Xq + ... + X,)/vb).

The latter variance converges to the true asymptotic variance of yn X_ as


bo.

The above discussion suggests that, for the case of sample


mean, MBJ and MBB will make partial correction for the weak dependence
when b is considered fixed. To clearly demonstrate this point, we consider
the m—dependent model with minimum m, ie, m = 1. The true
asymptotic variance in this case is

Var (X,) + 2 cov (X,,Xo).

The achieved asymptotic variance by MBJ and MBB with b = 2, 3 and 4


are, respectively,
Liu and Singh

Var (X,) + COV (X,,Xq))


Var (X,) + 34 cov (X,Xq),
and Var (X,) + §cov (Xj,Xq).
Thus, with b = 2 we capture 50% of the dependence factor, namely
2 cov (X,,Xy). The fraction rises to 66% and 75% with b = 3 and 4,
respectively. Therefore for practical purposes, replacing the classical
jackknife or bootstrap by the moving blocks ones will amount to robustizing
these procedures against suspected serial dependence. As for asymptotic
theory, it is shown that the actual consistency is achieved under
m—dependent models by the proposed moving blocks resampling procedures
if b is allowed to grow to infinity with n.

Turning to general statistics, usually if a statistic T, is


asymptotically normal, it can be represented as

15eMs
gp(X;) + o,(n/?),
where F is the c.d.f. of X, and E gp(X,) = 0. This comment applies
under i.i.d. models as well as weak dependence models. When X,’s are i.i.d.
the pseudo— values of the standard delete-1 jackknife are gp(X;), i =
1,2,....n, with a remainder of order ol) uniformly in i (see Parr eee) and
Singh and Liu (1990)). It turns out that the ith pseudo—value of the MBJ
procedure is

plus a remainder term of order o(1) uniformly in i. Thus the foregoing


discussion on the variance estimate of the sample mean by the MBJ carries
over to general statistics, with X. replaced by 8p(X))- Similar reasoning
applies to the MBB. It follows from asymptotic representations that the
classical bootstrap on _ is equivalent, up to the first order limit, to

bootstrapping the sample mean nD tg A(X) (see Liu, Singh and Lo


(1989)). Evidently, the classical bootstrap gives the asymptotic variance
Var (g,(X,)), which is incorrect under m—dependent models. We establish
in Section III some representations which assert that the MBB on a
"smooth" T_ is equivalent to that on the mean + DEp(X;). This leads to
the validity of the MBB for general statistics. In the course of obtaining
Moving Blocks Jackknife and Bootstrap Capture 284

these representations, we prove an extension of the Dvoretzky—Kiefer—


Wolfowitz inequality for the empirical d-f. FY of the MBB sample. This

inequality in turn implies that IF — FI, = one 2) in the MBB


probability. This particular fact enables us to neglect second order terms in
pursuing the asymptotic result.

We present the results for MBJ and MBB separately in


Sections 2 and 3. For each scheme we shall discuss successively the case of
sample mean and the case of more general statistics T, i each case, b
starts out as a fixed number and later on it is allowed to tend to infinity as
n tends to infinity. The case when b is a fixed number is technically more
complete than the case when b _ tends to infinity with n. Indeed, the
arguments are much more involved in the latter case and we have not
attempted to include the most general results in order not to obscure the
main ideas. Section 4 contains some concluding remarks and possibilities
for further investigations.

After this manuscript was written (March, 1988), we were informed


that the paper Hans Kunsch (1989) was in the process of publication in the
Annals of Statistics. Though the same resampling methods were proposed in
both papers, the techniques used differ substantially.

2. Moving Blocks Jackknife (MBJs).


Case (i) in = x
In this case, (1.1) and (1.2) can be expressed explicitly as
at i+b-l
J;=b Bi X; for i = 1,....n—b+1 (2.1)
j=i
and
. 3 Bbtt ly i+b-1 9

The following two theorems provide the asymptotic limit of Vy p(X) in


the two situations: b is fixed and b tends to infinity.

Theorem 1. Let X prooky be a stationary m—dependent sequence of r.v.s

with EX* 2 <o. For any fixed integer b> 1, we have


i

2 b
Vy p(X) 4 oFyg= 5| Var ey X) a.s. aS No.

Proof: (with b = 2) Without loss of generality, we assume E X, = 0.


Note that
232 Liu and Singh

a 1 n—l Xx; + i+1 y)


Wie.zp!
HSL) a
4) me i
2X) n

2
n—1 (X. + X,,,)
erat te pres
a aT8 a
]}=

It is easy to see that (X; an X44): i = 1,2,..., is a stationary


side ions sequence of r.v.s. Therefore, the claim of the theorem
ollows using the SLLN for m—dependent r.v.s (see the Appendix).
If b is treated as a sequence depending on n and if it is
allowed to tend to infinity at some proper rate, then the following theorem
shows that MBJ provides a consistent estimate of the variance of /n xX:

Theorem 2. Let Ky yrX, be a stationary m—dependent sequence of r.v.s

with EX}
<e. Let b=b,, b+@ and b/n+0 as n4o. Then
a m
Vy p(%q) > o”= Var (X,) +2 es cov (X,, X 1+) in prob. as no.
Note that o is the asymptotic variance of /n xX

Proof: First we express V J,p(Xn) as

pa Bs bael 9
hte eee (2.3)
where B. = pt/2 a X;, Since EB. = pl/ wi expression (2.3) can
be rewritten as

1 n-b +1 9
aba 2 (8-2
8,-w(K,-w?. (2.4)
Clearly, (B. —E B.), i = 1,2,....n-b+1 is a stationary (m+b)—dependent
sequence of r.v.s. We observe that the splitting technique mentioned in the
Moving Blocks Jackknife and Bootstrap Capture 233
Appendix for m—dependent r.v.s is not applicable here for getting the law of
large numbers when (m+b) +m as n+. Next we express (2.4) further as
Avi; D,—2C, with ~

1 n-b+1 9
peti eer ee eg)

D, n =b(X,-#)"
n :

srin-eoleg Seca sabre kaea bch


The limit of each of the above terms is established separately to prove the
claim of the theorem. We show that An 4 o in probability as n+. The
fact that Cn converges to zero in probability follows from similar
arguments. The term Da converges to zero in probability in view of the

bound X, -p= O(a 4); (In fact, this bound also comes into use in
showing that C, +0 in probability.)

Note that

Var (A) = (n—b+1) “{(n-b+1) Var [(B, — EB,)?]


m +b
adeEZ (a-b+1-4) cov ((B, - BB,)*, (B,,; - EB, ,,))}-

The last expression is bounded above by

(n—b+1) *[(n-b+1) + 2(m+b)(n-b+1)] Var ((B, —EB,)”). (2.8)


Now,

b
Var (B, — EB,)”) < E((B, — EB,)*) =? BIE (%— wl
which is of order O(1) asn-o, provided EX! <. (See the Appendix.)
Applying this particular observation and the condition b/n +0 in (2.5), we
obtain that Var (A,) +0 as n-+o. Using Chebyshev inequality it is easy to
Liu and Singh

show that ||A, — Var (B,)|| + 0 in probability. The result now clearly
follows since Var (B,) + o in probability.

Case (ii) General Statistics T ni

In order to study the asymptotic behavior of Va p(n) given


in (1.2), we first establish a representation theorem for the pseudo—values
J;
(given in (1.1)) when T,, = T(F) is a "sufficiently smooth" functional, and
T,, is simply T(F,). The representation enables us to view T, as a
sample mean, so the arguments used for Case (i) can be adapted readily. As
far as for the smoothness of T(-) is concerned, we choose to discuss the
consistency under the so—called strong Frechét differentiability and twice
Frechét differentiability. These conditions are satisfied by many commonly
used statistics and they turn out to be "just right" in the theoretical
development of the jackknife (see Parr (1985) for examples on the strong
Frechét differentiability).
Definition 1. A functional T(-) is strongly Frechét differentiable at F if,
for some function gp,

T(G) — T(H) = [gp d(G — H) + of||G — Hl)


as G and H both converge to F in the sup norm, ||-||.
Definition 2. A functional T(-) is twice Frechét differentiable at F if

T(G) — T(F) = fgp d(G — F) — ffhp(x,y) d[G(x) — F(x)] d[G(y) — F(y)]


+ o(\|G — FI)
as G converges to F inthe sup norm, ||-||, for some functions 8p(-) and
hy(-,-). (W.L.O.G., we may assume [gp dF = 0.)

In both the definitions stated above, F, G and H may be


restricted to some suitable class of distributions in order to achieve the
differentiability in some restricted sense. For instance, for technical reasons
one may restrict oneself to the distributions with compact supports.

We are now ready to state the representations for = T(E)


under above differentiability conditions. Note that T_3= TUF, 3) where
F _; is the empirical d.f. based on the data {X},Xo,.--.X, }\B..
?
Moving Blocks Jackknife and Bootstrap Capture 239

Theorem 3. Let Xyy--X ne be a stationary m—dependent sequence of r.v.s


with the common distribution F. If the functional T(-) is strongly Frechét
differentiable at F with the differential &p(-), then

se
igi 8p(X))|
— ee +»
max
oe
=
eee
tends to zero a.s. as n+. Here b is assumed to be a fixed positive integer.

To prove Theorem 3 we note that if we take G = ce and H


=F, _; in Definition 1, then ||G —H|| = O(b/n). This is the key element
?

in the proof. The rest of the proof is straightforward and it is therefore


omitted.

Theorem 3 immediately implies the following corollary


regarding the estimation of the variance of T,, under MBJ procedure.

Corollary 1. Under the same condition as in Theorem 3 and the additional


moment condition Ega(X,) <o, we have

: —1
Va (Ty) +b © Var (gp(X,) +... + 8p(Xp)) a.S. aS No.

When the block size b_ is allowed to go to infinity, we are able to


establish Theorem 4 below. The strongly Frechét differentiability falls short
of the required smoothness for the sake of achieving the representation
similar to the one stated in Theorem 4 for the case b+ m as no. This is
simply because in this situation ||F, _; —F,|| is of the order O(b/n), and
hence the difference

[T(F,4)— T(F,)] — 8px) d[F, _((%) — dF (00)


is of the order o(b/n). To yield a useful representation, this remainder
should rather be o(/b/n). Therefore, when b is allowed to tend to infinity,
we instead work with the twice Frechét differentiability (as stated in
Definition 2). Note that these two differentiability notions are not directly
comparable with each other (see the example on p.8 of Singh and Liu
(1990)). We also remark here that Theorem 4 itself includes the case of b
being a fixed integer. It is possible to relax the moment conditions to some
extent if b is restricted to be a fixed integer specifically.

Theorem 4. Let T(-) be twice Frechét differentiable at F as stated in


Definition 2 with Jrp dF = 0. Assume
236 Liu and Singh

pad/S(x,x) dF(x) <= and


[ThS(x,y) dF(x) dF(y) < @
If b°/n+0 as n-a, then

J,=1(F) +> 3 gp(X)


+

where MEX) nb 410! | +0 in probabilityas n- 2.

A direct consequence of Theorems 2 and 4 is the following.

Corollary2. Under the conditions of Theorem 4 and, additionally, Egs(X,)


<o, one has

V3 (Ta) + Var (gp(X))) +2 = cov (gp(X;), &p(X} ,;)): in probability

which is the actual asymptotic variance of ¥ T..


Proof
of Theorem 4: We write
nT(F,) —(a—b)TR,_,)= T(F,)
af — HP)
— (a—bd)[T(R,
_,)— T)] + OTR).
Then, we expand each of [T(F,)—T(F)] and (TR, _;) — T(R)] as a linear
term plus a quadratic term utilizing the assumed twice Frechét
differentiability. Since ||F,_;—F,|] = O(b/n) which in turn is o{a!/2)
due to the condition b//n-+0, the remainders in both expansions mentioned
above are of the order o(a*), uniformly in i. Thus, the claim of the
theorem follows if we show that the pseudo—values in the quadratic terms are
of appropriate order.

To this end, we define

Qn i = b TasJhp (xy) AIF,(x)—F(2)] alF,(9)- FQ)


— (ab) fop(xy) a[F, _(x)— F(x)] alF, _,(5)— FL
‘Moving Blocks Jackknife and Bootstrap Capture 237
We would like to prove that
1/2 : 5 :
b wemax /enl.| 30, in probability
t as +o.n+

j We rewrite bQ, 5 2
4. 4 1 =
, fa ~—(n—b)
“] 28a, +(n-b) [2Lu,- LE u
4 jx * jk © 5x4, 540-4 ix
al

r
| where ay = bp(XX,) — bp(X,*) —hy(*X,) + hyl*,*) with hp(x,*) =
Eh xX) and bf*,7)=E XX). Next, we express
.

pomn aan:
nf{n-5) D« edPig
and
| 2 1 ts s sina

a where

= £ Ua, and = Zz ua -
: eae Siig ia *
_ ~‘Wkis clear that A_ +B, cannot exceed

b
m4 |max
Cm) Ig|jq..|+2 1max jq.j],
jén la]

where (*) stands for i < j< i+b—1 and 1¢i¢ n-b+1.
Thus, it only remains to show that

max |q,|
= o(n//b),
1¢ jin
and

re 19,,) = ofn/¥6). (2.6)

oe the argument in the proof of Lemma of Singh and


and using the splitting technique mentioned in the Appendix, it is
Lia (1990)
238 Liu and Singh
in

shown that Eq = O(n?) uniformly in j and Eq; = O(n?) uniformly in (*).


Finally by using Bonferroni inequality, we obtain

P( max |q,| Se By cn. DEK ofa) +0.


iia yo n
pets Rare)
Exactly similar argument proves (2.6) which in fact uses the condition b‘/n
+0. The proof is complete.

3 . Moving Blocks Bootstrap (MBB).

Recall that Cyd are i.i.d. sampled blocks from the moving
blocks By, By pay for each i, € = (Eipres ib) Yyr¥y = (E5906)
stands for the bootstrap sample of size 4 and FY is the empirical d.f. based
on Yj,...,¥, Throughout this section we assume ¢ and n are of the same
order, namely ¢/n is bounded below and above.

Case (i) T, = xX
For the bootstrap approximation of the sampling distribution
of xX) based on the MBB procedure, with b fixed, we obtain Theorem 5
below.

Remark. The true asymptotic distribution of yn (X, — 4) is N(0,02). As


mentioned in the Introduction of4 0”, as b+. Furthermore o may be
viewed as an intermediate stage value between Var (X,) (which is achieved _
by the classical bootstrap) and o (which is the true asymptotic variance).
Thus the discrepancy between the MBB approximation with b fixed and
the actual distribution of yn (X, —y) is ||®(x/ 0) — &(x/o_)|| in limit.

Theorem 5. Let Ky yy KX, be a sequence of stationary m—dependent r.v.s.

If EX? <o, then

IP*(v2 (Y,—X,) <x) - F(x/o4)|| 10 as. as n0, (3.1)


where a stands for the bootstrap probability (i.e., under the MBB scheme),
Vora Bh X;, ||-|| stands for the sup norm over all x, and of =
Moving Blocks Jackknife and Bootstrap Capture 239

b/Var (BPs, X;). Here b is a fixed integer.

Proof: (with b= 2) The proof hinges on the observation that under the
MBB procedure the effective sample is sen where g = (& + &o)/v2.

Let 2 = Sade: then ¥2Y,= yk re Note that Comey are iid. r.v.s
under the MBB scheme. Hence,

Ez, = E*t, = (2X,)/v2 + o(n~!/2) a.s.


Here E* (or Var* as seen later) is the mean (variance) under P*. Using the
same arguments as in Theorem 1, we see that

n—l1 X.+X
Washoe oe pee ttle iJ
prt ys
n—l X.+X,
wreey aug ate a.s.
j=

40% as.
Now we shall use the Berry—Esseen theorem to bound the normed difference

in (3.1) which can be rewritten as ||P*(vK(2,—E*E,


<x)— 8(x/y' Var*(?,))Il
+ o(1) by 2E*|%, — Ete, |?/vk (Var* gore + 0(1) a.s. It only remains
to be shown that 2p |‘4|349 a.s., i.e,

—1
ait a 3

When EX? <, the above statement follows from Marcinkiewicz—Zygmund


SLLN, which does extend to m—dependent r.v.s, using the splitting
technique. Thus, the theorem is proved.

Consider now the case that b = bn and b 172 a No.

Theorem 6. Ky X, is a stationary m—dependent sequence of r.v.s with


4+6 6>0. If b/n-+0 as n-a, then
EX, =p and E|X,| <o, forsome
Liu and Singh
240
aU CC URE) GE ROOR NE CE ee
IP*(ve (Y,— E*Y,) <x) — P(vi (X, -) <x) +0 (3.2)
in probability. If the condition is strengthened to be b/yn+0 as n-ao,
then (3.2) holds with E*Y, replaced by xX

Proof: Note first that under the MBB procedure

E*Y)= E*((634 + b19 tit €1p)/dl

1 n-b+1
= RebET 2 sel Ai eae bd

We may rewrite y2(Y,—E*Y,) as k 1/2 it (g — E*%.), where t=


(61 +--+&,)/vb. Now, the result (3.2) will follow from Katz theorem (see
Katz (1963)) on the Berry—Esseen bound provided that

i) Var* (2,) converges to o in probability, and

ii) E*|t, —Et,|?+°/? is bounded in probability.

In order to replace E*Y, by xX in (3.2) legitimately, we


need to show that

iii)
Pe
E*Y,=X, + of(n—1/2[2
Here o* stands for o_ under the bootstrap probability. We proceed to
prove statements i), ii) and iii) in that order to complete the proof of the
theorem.

: i:
Proof of i) Var* (|) = abelze= it1 (B, — Et) , where B, =
(X; + Xj, 4+--+%)
14, 1)/vb. Let us rewrite Var* (2,) as

12 wb § : 9
nobel cae [((B; — you)
—(E*E, — yby)]°.
Moving Blocks Jackknife and Bootstrap Capture 241

Clearly EtG, is a sample mean of (m+b)—dependent r.v.s whose mean is


vybu. Therefore it follows from the arguments of Theorem 2 that
(E*%,—ybu) + 0 in probability as n-, and consequently Var*(?,)
converges to o in probability.

Proof of ii) We first use Minkowski inequality to obtain

Ig, aa EXE llo+ a2 ¢ Ng, vtVoullos 5/2 me |E*e, rsvoullo, 6/2

where Zl, = (E*|z|P)2/ P_ Then, we apply the argument of Theorem 2 to


show that
E*|%, z sou 2+ 4/2= E|B, = Moule +0 in probability.

To prove this we need the result that Ree (x pate = o(b2t 4/a
This is a well—known fact for iid. r.v.s and it can be extended easily to
m—dependent r.v.s by the splitting technique mentioned in the Appendix.

Proof of iii) Note that

E*Y,=E*[(é,, +--+ &)/2

= [b(n —b + 1)]2
- {b z X; — [(b-1)X, + (b-2)X, +... + 2X9 +Xp_4]
S3[(b-1)X, + (D-2)Xp y+ 2X, 9) + Xap yl}
=X, + [n/(n-b+1) -1]X, + 0,(b/n)
=X, +0 p(b/n) in probability.

To be able to replace E*(Y,) by X,, we require

EX(Y) =X, +00),


Since @ and n are assumed to grow at the same rate, this requirement will
be met in view of the above bound, provided b/yn-70 as n-o.
242 Liu and Singh

Case (ii) General statistic ee


Let us investigate the asymptotic behavior of the MBB
approximation when Tp, = T(F), T, = T F,) and T(-) is a smooth
functional. Our main goal is to establish representation theorems, for both
b being fixed and ba, similar to the ones given in Liu, Singh and Lo
1989). The key elements in this task are probability inequalities on
IF, -F,ll and ||F, — Fil which are extensions of the standard
Dvoretzky—Kiefer—Wolfowitz inequality. These extensions not only help us
obtain the representations, they also clarify how FA approximates Lee and
F which does not manifest itself in the description of MBB.

Lemma 1. For any 6> 0, there exists a universal constant c such that

i) P¥*(|[F5—E*F4l| > 6) < cb exp (-2k6"),

ii) P*(\|F%— Fl] > 6) < cb exp [-2k(6 22)".


Proof: Let us define FA; as the empirical d.f. based on the k iid.
univariate observations consisting of all the ith coordinates from the sampled
blocks SS nis Thus

F%(x) =b
=I [FZ 1) +... +F7 pl
and
b
IPA-Fall
ee
$51 2 FY;x —Foll:
Consequently, by Bonferroni inequality we have

PAUFZ-Fyl> 9
* *
2 PUY; —Fall > 9.
ak
(3.3)
Next we shall show that, for any i, 1<i<b,

E*FY — F,||< 2b/(n — b). (3.4)


The claim ii) follows from (3.3) and (3.4) using the standard
irae scr ati ileal ie i :
To prove (3.4), we first write
E*(E4 ;(x)) = [I(Xj<x) +... + W(X) 44 4$*)]/(n—b+1)
= nF (x)/(n—b+1) + 7,
Moving Blocks Jackknife and Bootstrap Capture 243

ee Iagl Saceet + abe = EET


Therefore,

The claim i) is clearly contained in the above arguments.


This lemma immediately lends itself to the following corollary
concerning the order of ||F4 —F, ||.

Corollary 3. Assume b/n-70 as no. We obtain

o*(n2/2) if b is fixed,
IFE-F,l=4? n
o8(b!/ nog pa Rr eete
Next, we extend the Dvoretzky—Kiefer—Wolfowitz inequality
to the m—dependence case.

Lemma 2. Let Ky X) be a sequence of stationary m—dependent r.v.s.


For any 6> 0, there exists a universal constant c such that

P(E, — Fl > 4) § (m+1)c exp [2 =) (6-2).


The proof of this lemma first uses the splitting technique
mentioned in the Appendix to split the m—dependent sequence into (m+1)
iid. sequences and then the standard Dvoretzky—Kiefer—Wolfowitz
inequality on each of the (m+1) i.i.d. empiricals.

We are now ready to present a representation theorem for


T(F4) - T(F,,) under the MBB procedure.

Theorem 7. Assume b is a fixed positive integer. Let T(-) be (once)


Frechét differentiable at F in the usual sense, i.e.,

T(G) — T(F) = J gp d(G —F) + o(||G — FI]).


Then

T(F+)
(FD) -—T(Fet -15
a eeeee (vy)
Pent -2 5 g(x,iad + 0#(a
iota tip 2/2);
244 Liu and Singh —
aE
DN

Proof: The basic idea is to write

T(F¥) — T(F,) = [T(F%) — T(F)] - [T(F) - TH


The right—hand side of the above is equal to

AS
ails g(v)-2 5 e(X)] +0(NF}—Fi)
j) |Te y +o(IF,—FI).
n (3.5)
Now the two exponential bounds of Lemmas 1 and 2 and the arguments of
Theorem 1 of Liu, Singh and Lo (1989) can be used to show that the
remainder term of (3.5) is of the order ot(n ay,

The following result regarding the asymptotic distribution of


vUT(F4) — T(F,)) can be viewed as a direct consequence of Theorem 5
above, the representation of Theorem 7 and the proposition in Section 1 of
Liu, Singh and Lo (1989).

Theorem 8. Assume the conditions of Theorem 7 above and E g2(X,) <o.


Then

P*(VAT(F$) — T(F,)) <x) — &(«/o,(7,))|1 9 0


in probability as n+, where o¢(T,) = b Var [gp(X,) + --. + Bp(X,Jh
Analogous to Theorems 7 and 8, we have the following results
in the case of b= b,7o:

Theorem 9. Assume T(-) admits the following expansion at F for some Sp:

T(G) —T(F) = J gp 4(G -F) + O(|G - FI).


If b log n/yn70, as no, then
T(F+)DaA\
—T(F n=uot
tse yege
&p(Y;) tee *(n
&p(X;) + on(a t/2 ) as.
Before providing a proof of this theorem, we remark that if the
requirement on the remainder term is relaxed to O(||G — Ft 4), 0< 6<i,
then the theorem will still hold though the conditions on b will be more
stringent.
Moving Blocks Jackknife and Bootstrap Capture 245

Proof: To establish the claim, we essentially need to prove that for any ¢ >
0, PA(IF¥— FI? + IF, = FI? > 1%) 40s. Clearly |IF¥— FI? +
{k= FI? ¢ 2||F5- Fl? so) | P|”. It follows from Lemma 2 that

IE, — FI? = O(n 1/? tog n) as.


With each fixed sequence of X,’s for which ||F, — Fl] = o(nt/ ¢ log n)
a.s., we have

PA(QFF-FaI?2 + SFP? 2 > a?)


—1/2
¢ PAYPAL 2 qa —1/4M4), (3.6)
Then ii) of Lemma 1 implies that (3.6) = 0.
The representation of Theorem 9 immediately leads to the following:

Theorem 10. Under the condition of Theorem 9 and the condition b log
n//n-+0 as no, we have

P*(veCT(F4) — T(F,)) < x) — P(vA(T(F,,) - TOF) < x)]] +0


in probability as n-o.

It may be appropriate to mention here that our theorems of this


section do not cover general sample quantiles since the quantiles are not
Frechét differentiable. Instead they are compactly differentiable, which is a
weaker concept. We do expect that our results can be extended to cover
compactly differentiable functionals. In any case, the consistency of MBB
(though not MBJ) can be achieved for quantiles by an alternative approach
via Bahadur—Kiefer representations.

4. Concluding Remarks.

Remark 4.1. Extension to Linear Regression Model: On time series data


which indicate the presence of a trend and a seasonal effect, the linear
regression model with a m—dependence structure on the errors is a commonly
accepted model. This suggests that the extensions of the resampling
procedures proposed in this article to linear regression models will be quite
useful in time series analysis. We delay the investigation of this extension to
a follow—up study.

Remark 4.2. Extension to More General Dependence Models: Our


discussion throughout the paper is restricted to the m—dependence model
which is the most basic weak dependence model. We expect that the MBJ
and MBB procedures are consistent under much more general dependence
246 Liu and Singh
| |

models, such as the a—mixing model, under reasonable conditions. A study


on this line should be worthwhile.

Remark 4.3. oe Increases_to o if the Covariances are Non—Negative:


Recall that
2
m
o”
(X,,) = Var (X,) + 2 ee cov (X,,X14;)-

One can express o-(X,,) = f Var (X,+...+X,) as

min (b,m) F
Var(X,)+2 YU
aa
(1—f) cov (X,,X 1+i)

Thus if all the covariances are non—negative, then o°(X,) increases


monotonically to o°(X,); as bo. This monotonicity means that even in
the case when b_ is considered fixed, there is a definite improvement
compared to the classical procedures; also the improvement increases as b
increases. Consider the example: X; = ape. + aje,_j+...ta,.¢_., where
{e,} is an i.i.d. sequence of r.v.s and the coefficients a.’s are non—negative.
In this instance, all the covariances turn out to be non—negative and the
above comment on oo(X«) applies. This remark extends to a general
statistic T, if the differential p(x) is a monotonic function of x. For
example, this is the case with the sample median.

On the other hand, if all the covariances are negative, then oe

decreases to o and the above comments on MBB and MBB3J still hold.

In the mixed situation when the covariances are not of the


same sign it is not guaranteed that oe is between o and o for every b.
However, the partial correction on the ith covariance (i.e., cov (X,,X, 44)) is
inversely proportional to i. Since the lower terms of covariances (and the
variance term) typically dominate the overall asymptotic variance, such a
correction is expected to be quite effective.

Remark 4.4. Optimal Choice of b and k: Is there an optimal choice of the


block size b for MBJ or the pair (b,k) for MBB? For the classical
bootstrap it follows from the second order theory (i.e, the one term
Edgeworth expansions) that the bootstrap sample size ¢ should satisfy ¢/n
71, as n+. Such a second order theory for MBB with m—dependent r.v.s
Moving Blocks Jackknife and Bootstrap Capture 247
SSSSSS SSS

is yet unavailable. One way to choose (b,k) is to assume {X;} to be iid.


and choose (b,k) which minimizes the loss of efficiency in comparison with
the classical bootstrap. This idea is motivated by the remark in the
Introduction that replacing the classical bootstrap or jackknife by MBB or
MBJ may be viewed as robustizing both procedures against possible serial
dependence of the observations. Preliminary calculations show that this
pepoint leads to the condition ¢/n +1 for MBB. Further work would be
of value.

Remark 4.5. Functions of Several Means: To demonstrate the applicability


of our theorems proved for the functionals, we consider the following class of
many commonly used statistics. Suppose that X,’s are p—variate, and on
each X, there are q functions: h (Xi) for j = 1,2,...,q. The functional
ny, can be expressed as a smooth function H(-) of the q means
Eh; X,), j = 1,...,q. This class includes, for example, the central moments

E(X — Mp)" in the univariate case, the correlation and regression coefficients
in the bivariate case. In the case of variance: p = 1, q = 2; h, (X) = x?
ho(X) = X. In the case of the correlation coefficient: p = 2, q = 5; h, (X)

=x), nicx) = xP), ngcxy = (xD), n(x) = (KO)? and hy(X) =


x()x(2). here x = (x) x(2)),
For this class of functionals, the smoothness required in our
theorems can be checked simply by expanding H(-) using Taylor’s
expansion, provided H(-) has sufficiently many derivatives (which is often
the case) and the class of distributions is suitably restricted. For instance, in
the simple case of the variance,

I(x — pg)? dG — I(x — wp)” AF = f(x — pp)” a(G -F) — (ug — bp)”
The term (uq — Le) = O(\|G - F||?) if the distributions considered have
compact support.

In order to include other examples such as the correlation


coefficient, our theorems need to be extended to multivariate settings. This
should be a simple task in view of the multivariate DKW inequality (see,
e.g., Serfling (1980)).
Appendix: The splitting technique for m—dependent sequence of r.v.s.

The sum of ra i of m—dependent r.v.s can be split into


(m+1) sums of independent r.v.s. For examples, let m= 1 and n be even,
248 Liu and Singh
LL

then n
D,_,X; = (KX, + Xp t+. + Xy) + (Xq + Xy te + X,)-
Applying the triangle inequality and this splitting technique, many
asymptotic bounds for independent r.v.s can be extended immediately.
Several such extensions are used repeatedly in the text. Some of them are
listed here for references: If {X;} is stationary and m—dependent, then

matt| ; .
1) nD) _ 1X, 7 EX, (if EIX,| <®);
2) BP_,X, = O(a 1/2)2) (it4: EX?2 < @);
3) B[D"_,X,|P = o(nP/?) (it BIX,|? < o);
ay ok 0(n!/2(10g log n)/?) as. (if EX? <a).

References

Billingsley, P. (1968). Convergence of Probability Measures. John Wiley &


Sons.

Katz, M. L. (1963). " Note on the Berry—Esseen Theorem." Ann. Math.


Statist. 34, 1107-1108.

Kunsch, H. (1989). "The jackknife and the bootstrap for general stationary
observations." Ann. Stat. 17, 1217-1241.

Liu, R. (1988). "Bootstrap procedures under some non-i.i.d. models." Ann.


Stat. 16, 1696-1708.

Liu, R., Singh, K. and Lo, S. (1989). "On a representation of the


bootstrap." Sankhya A. 51, 168-177.

Miller, R. G. (1974). "The jackknife—a review." Biometrika 6, 1-15.


Parr, W. C. (1985). "Jackknifing differentiable statistical functionals."
JRSS, B, 47, 56-66.
Serfling, R. (1980). Approzimation Theorems of Mathematical Statistics.
John Wiley & Sons.

Singh, K. (1981). "On the asymptotic accuracy of Efron’s bootstrap." Ann.


Statist. 9, 1187-1195.

Singh, K. and Liu, R. (1990). "On the validity of jackknife procedures."


Scand. J. Stat. 17, 11-21.
Bootstrap Bandwidth Selection

J. S. Marron
University of North Carolina

Abstract: Various bootstrap methodologies are discussed for the


selection of the bandwidth of a kernel density estimator. The smoothed
bootstrap is seen to provide new and independent motivation of some
previously proposed methods. A curious feature of bootstrapping in this
context is that no simulated resampling is required, since the needed
functionals of the distribution can be calculated explicitly.

1. Introduction
This is a review of results concerning application of bootstrap ideas to
bandwidth or smoothing parameter selection. The main ideas are useful in
all types of nonparametric curve estimation settings, including regression,
density and hazard estimation, and also apply to a wide variety of
estimators, including those based on kernels, splines, orthogonal series, etc.
However as much of the work so far has focused on perhaps the simplest of
these, kernel density estimation, the discussion here will be given in this
context.
The density estimation problem is often mathematically formulated
by assuming that observations Ky oX n are a random sample from a
probability density f(x), and it is desired to estimate f(x). The kernel
estimator of f(x) is defined by
A es! n

“Department of Statistics, University of North Carolina, Chapel Hill,


NC 27599-3260. Research Partially Supported by NSF Grant DMS-—8902973
nee ere eee ee ——EEEEESSESSEEEE==—_—==

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
250 Marron

where Kj, denotes the rescaling Ky (+) = K(-/h)/h, of the "kernel


function" K (assumed throughout to be a symmetric probability density) by
the "bandwidth" h. See Silverman (1986) for early references, good practical
motivation, and useful intuitive discussion concerning this estimator.

The most important hurdle in the practical implementation of f(x)


(and indeed essentially any nonparametric curve estimator), is the choice of
the bandwidth h. While many methods have been proposed and studied, see
the survey Marron (1988) for example, this remains an important problem in
the application of curve estimators in general.
For a mathematical approach to this problem, most workers consider

error criteria. Here the focus will be on the expected Peenorm, or Mean
Integrated Squared Error,

MISE(h) = E f [f,(x) —f(x)]? dx.


Devroye and Gyé6rfi (1984) provide an array of arguments in favor of the ihe
norm, but for simplicity of presentation and development of ideas, extensions
to this challenging case will only be bri2fly discussed at the end.
An important advantage of MISE(h) as an error criterion is that it
admits the simple variance—bias” representation

MISE(h) = V(h) + B“(h)


where

V(h) = 0 “{h "yx? + f(K, *f)?}


B7(h) = f(K,*t-1?
using * to denote the convolution. Deep insight into the bandwidth
selection problem comes from the asymptotic analysis, as n+, h70, with
nh, assuming two uniformly continuous derivatives on f,
v(h) = nt am px? + ofan,

B’(h) = h* (f°)? (fx"K/2)? + o(h’,


see Section 3.3.1 of Silverman (1986) for example. From this one can see
why choice of h is not a simple matter, h small gives too much variability
to the estimator (expected intuitively since there are not enough points in
Bootstrap Bandwidth Selection 251
NN

the effective local average when the window width is too small), h_ big
introduces too much bias (again intuitively clear since a large window width
introduces information from too far away).
Section 2 discusses bandwidth selection by minimization of bootstrap
estimates of MISE(h). In particular it is seen why the smoothed bootstrap
is very important here. An interesting and unusual feature of this case is
that the bootstrap expected value can be directly and simply calculated, so
the usual simulation step is unnecessary in this case.
Asymptotic analysis and comparison of these methods is described in
Section 3. Connection to other methods, including Least Squares Cross
Validation, is made in Section 4. Simulation experience and a real data
example are given in Section 5.

2. Bootstrap MISE Estimation


The essential idea of bootstrapping in general is to find a useful
approximation to the probability structure of the estimation process being

studied, denoted in this case by L{f, (x) — f{(x)}. The usual simple means of
doing this involves thinking about "resampling with replacement". One way
of thinking about this probability structure is through random variables
Tel, which are independent of each other, and of Xj X and are
uniformly distributed on the integers {1,....n}. These new random variables
contain the probability structure (conditioned on X,,...,X,) of the
"resample" ny i ae defined for i= 1,...,.n by
*
xX, = oe
As a first attempt at using this new conditional probability structure
to model the bandwidth trade—off in the density estimation problem, one
might define the "bootstrap density estimator"
wit 12
fh (x) =n sa neat

and then hope the approximation

fF (x) — Fy) Xpp-oX_} L{fy(x) —10)},


x * = a

is useful. Faraway and Jhun (1987) have pointed out that this
252 Marron
a aa COO ee Oe roe DE eet ee Se

approximation is in fact not useful for bandwidth selection because


x 0k * "
E {f, (x)} = E Se sre sre) - f(x),

where E is expected value "in the bootstrap world", in other words with

respect to the bootstrap distribution ce [X,5-X,}, i.e. the average over


all possible values of I-51, This shows that there is no bias in this
bootstrap world, which is disastrous because as shown by the MISE(h)
analysis in section 1, bias constitutes one of the two essential quantities to be
balanced in bandwidth selection. Actually this is not too surprising because
a

f,, is not in fact the density of the et -|X,,-.X,} distribution. Note


moreover that this philosophical flaw can not be simply fixed because this is
a discrete distribution (supported on, {X,,...,X,}) and has no density.
This motivates finding another bootstrap probability structure to

approximate L{ fa(x) — f(x)}. A natural candidate, proposed for bandwidth


selection by Faraway and Jhun (1987) and Taylor (1989), which does have a
density is the smoothed bootstrap, introduced in Efron (1979). An
alternative clever idea based on subsampling is proposed in Hall (1990), but
this will not be discussed further here because connections to other methods,
and comparison with them have not yet been well understood.
A means of studying the smoothed bootstrap, using notation as above,
is to define additional random variables Epps independent of each other
and of Xj,--.X, and I,,....1,, having probability density L|(x), where for
g > 0, L, denotes the rescaling L,(-) = L(-/g)/g. Now redefine the
bootstrap sample by, for i = 1,...,n
*
Xx; = aa + G.
*
Observe that the distribution of X, |X 5--)X n’ i.e. the bootstrap
*
distribution £ {- |X,,..-.X,}, has probability density

ig(*)
(x)=n=ot}oP Eee
Hence it seems natural to study when the approximation
Bootstrap Bandwidth Selection 253

{fy(x)— E(x) |Xp-X_} 9 LfFy(x) —f(x)},


is useful. This depends on the choices of L and g which are discussed in
the next section.

This motivates the use of the bandwidth h which minimizes (break


ties arbitrarily) the MISE in the bootstrap world,

MISE(h) = Ef {fy (*) - £00)? dx,


= V (h) +B (nh),
where fh is defined as above but using the smoothed bootstrap data, and
where

Vi(h) = 2! {ho fx? + §(K,*f,)7},


ie0 es f(Ky*t, — f,)”.
An interesting fact about this setup is that the desired functionals of
the bootstrap distribution can be simply calculated, so the usual
computationally expensive simulation step is unnecessary. This is usually
the case only for very simple examples, see section Chapter 5 of Efron (1982),
although an interesting exception is in quantile estimation, see Sheather and
Marron (1990).
* * *
It is straightforward to compute MISE (h) because V and B :
admit the simple representations

Vi ¥ Ch) = HAT? + PEE {Ry KKH;—XV)


—1,—-liy2 , _-2

“ay _ 2 *K. *K *K —2K,*K *K K *K (xX. —X.


pea g 8 hog “et Sg gl ):
ES

Hence calculation of ie requires about the same computational effort as the


least squares cross—validated bandwidth (discussed in section 3.4.3 of section
Silverman 1986).
For a completely different approach to bootstrapping in density
estimation, see Hall (1990).
254 Marron

3. Asymptotics
In this section, choice of g and L is considered. A sensible first
attempt, see Taylor (1989), would be to try g = h and K=L. This can be
easily analyzed using the assumptions and asymptotics at the end of section
1, with the important part being

B (nh) ¥ bY f(f,’")” (Jx-K/2)


This presents a problem, because for g ~ nt/ wf which is the reasonable

range for h see section 3.3.2 of Silverman (1986) for example, fg (x) does
not even converge to f’’(x) (because the variance does not tend to 0). For
this reason, Faraway and Jhun (1987) propose using g > h. However
observe that f’’(x) is not what is needed here, instead we need the
functional {(f’ as which is a different problem. Indeed for g ~ ntl a Hall
and Marron (1987) show that

[liga sa f(e2y
although this choice of the bandwidth g is quite inefficient in the sense that
it gives a slower than necessary rate of convergence.
A means of quantifying this inefficiency, which is relevant to
bandwidth selection, is to study its effect on the relative rate of convergence.
In remark 3.6 of Hall, Marron and Park (1990), it is shown that

(ufleler
z
yp - ma_—1/10 ay,
when g=h, where hp denotes the minimizer of MISE(h). This very slow
rate of convergence is the same as that well known to obtain for least squares
cross—validation, and for the biased cross—validation method of Scott and
Terrell (1987) (which uses g =h in a slightly different way). For this
reason, as well as the fact that the appropriate bandwidth for estimating

f(f aye is different from that for estimation of f(x), the choice g=h does
not seem appropriate.
Bootstrap Bandwidth Selection
SSS
255

Good insight for the problem of how to choose g_has been provided
by the main results of. Hall, Marron and Park (1990). The minor
modification of these results presented here is explicitly given in Jones,
Marron and Park (1990), where it is seen that if f has slightly more than
four derivatives, and L is a probability density, for C,, C, and Cy
2
constants depending on f, K and L,

(a /bg)—1 9 Cyn gz, + (Cog? + Con's),


* d —] -—9/2 2} -l -—

where Z is a standard normal random variable. Note C, = 0. and

n 2/5 gives the slow n~ 1/10 rate in the above paragraph. This
expansion is important because it quantifies the trade-off involved in the
choice of g. In particular there is too much "variance" present if g +0 too
rapidly, and a "bias" term that penalizes g 40 too slowly. This variance
and bias can be combined them into an "asymptotic mean squared error"
which can then be optimized over g to see that the best g has the form
P —1/7
& C,(£,K,L) n / ?

which gives

ayia vsat
Data based methods for estimating C 4 are given in Jones, Marron and Park

(1990). Note that this rate of convergence is much faster than n 2/10)
A natural question at this point is: can the rate nol fe be
improved? As noted in remark 3.3 of Hall, Marron and Jones (1990), by
taking L to be a higher order kernel, this rate can be improved all the way

up to the parametric mon! (L needs to be of “order 6" for this fast rate).
This rate has been shown to be best possible by Hall and Marron (1990).
However there is a distinct intuitive drawback to this in that when L is a
higher order kernel, it can no longer be thought of as a probability density,

sO h is no longer a bootstrap bandwidth, at least in the usual sense of the


word.
A more intuitive way of achieving root n convergence is given in
Jones, Marron and Park (1990), who consider factorizations of g, in terms
256 Marron
aa,

of h, of the form

g=C a

In particular for m = —2 and suitable p they obtain


* =
ge glo
in the much more natural case K = L.

4. Connection to Other Methods


*
Bandwidths that are essentially the same as h have been derived
from considerations much different from bootstrapping. In particular note
that the dominant part of the representation of V(h) at the end of section 1

does not depend on f, so it is natural to estimate V(h) by nh! (K?,


which is asymptotically equivalent to V (h). The fact that B 2(h)
provides a natural estimate of B2(h) can also be derived in a natural way
by thinking about replacing the unknown f in B? by the pilot kernel

estimator te Such non—bootstrap motivations for a bandwidth selector

very close to h. were developed independently in an unpublished paper by


Ramlau—Hansen and in the related regression setting by Miiller (1985). See
Chiu (1990) for related ideas from the Fourier Transform point of view.
Hall, Marron and Park (1990) motivate a very similar bandwidth
selector, but by a different method. They propose decreasing the variability
of the least squares cross—validated bandwidth through a "pre—smoothing" of

the pairwise differences of the data. Note that, using fh to denote the
kernel estimator based on the sample with X. excluded, the least squares
cross—validation criterion can be written in the form

CV) = ffh 2—2nja etx)


hj j
gies 2), Ai aed

where the approximation comes from replacing a factor of n by (n-1)? :


Bootstrap Bandwidth Selection DOL
SS SSS

Note that the first term provides the same sensible estimate of V(h)
discussed in the paragraph above, while the second term has features

reminiscent of the representation of B 4 given at the end of Section 2. To


. . . * .

make this connection more precise, note that when there are no replications
among Xp Xs the second term is the limit as g 70 of

B’(h) = 2(n-1)13 BD, (x, -X),


where
ip jie SBilon ised
D,_ = {K,*K,*K *K_—2K,*K *K_ 4K *K }.
hg = (Ky"Ky", °K, nhKg°K, + K,*K,}
Note also that by the associative law for convolutions, one may view this as
first plugging the differences into K_*K_, and then putting the result into
the bias part of CV, which is why this idea was called smoothed
cross—validation by Hall, Park and Marron (1990).
The important difference between B“(h) and B (h) is whether or
not the "terms on the diagonal" are included in the double sum. At first
glance one may feel uncomfortable about these terms because they do not
depend on the data. At second glance it is not clear that they will have a
very large effect, unless g <= h, when they contribute a term of order

nhl. For this reason Taylor(1989) deleted these terms in his g = h


implementation of the smoothed bootstrap. However more careful analysis,
*
in the case g >> h, shows a rather slight theoretical superiority of B =

over B has been demonstrated by Jones, Marron and Park (1990) in terms
of the relative rate of convergence. Simulation work has also indicated
usually small superiority of the diagonals in approach, although the
improvement is sometimes much larger because the diagonals out version is
*
less stable. One possible explanation as to why this happens is that B 2 is

the smoothed bootstrap estimate of B2, while B 2 does not seem to have
any such representation.
Faraway and Jhun (1987) have pointed out that the bootstrap
approximation can be used to understand the bandwidth trade-off entailed

by other means of assessing the error in f,. For example one could replace
258 Marron
TT

the L? based MISE with the expected L! norm. A major drawback to


this is that it seems that an exact calculation of the bootstrap expected value’

E’ is nolonger realistically available. Hence this expected value will need


to be evaluated by simulation, which will be far more (perhaps —
prohibitively?) expensive from the computational viewpoint. Another
example is the replacement of MISE by the pointwise Mean Squared Error,
where one focuses on estimationof f at one fixed location x. Here the exact

calculation of E can be done, however this has not been explored carefully
yet, mostly because it seems sensible to postpone investigation of this more
challenging pointwise case until more is understood about
the global MISE
problem.

5. Simulations and an Application


To see how these methods worked in a simulation context, various
versions of the bootstrap bandwidth selectors were tested. Several methods
of choosing the pilot bandwidth g, as discussed in Jones, Park and Marron

(1990), including immediate use of aN(0,c°) reference distribution and also


estimation of the unknown functionals as suggested in Jones and Sheather
(1990),were tried. The results were usually better when estimates were
used, so one step estimators of this type were used for the following
discussion. To speed the computations, a binned implementation of the type
describedin Scott and Hardle (1990) was employed.
For this, 500 pseudosamples
of size nan = 100 were generated
from
the normal
mean mixture density described
in Park and Marron (1990). The
results are visually summarized in Figure 1, which is very similar to Figure
3b in that paper. The bandwidth selectors CV, BCV and OS there are
not shown here, because as one would expect from the results of that paper
they were inferior to these newer ones. PI is the main bandwidth selector
discussedby Park and Marron. Note that Taylors g = h method
performed quite poorly in comparison to the others, with a strong bias
towards oversmoothing. This poor performance is not surprising in view of
the theoreticalresults described above. The simple bootstrap, denoted BSS,
Bootstrap Bandwidth Selection 259

huise = 0.385 n = 100

3
2

/MISE(huise)
MISE(h)
1

gar -0.5 (Sr 0 0.5 | ae


Logs(h) — Log3(hwise
Figure 1: MISE and kernel density estimates of the distributions
of the automatically selected bandwidths (log 3 scale). Based on
500 Monte Carlo replications of samples of size 100 from the model
.5N(—1,4/9) + .5N(1,4/9).

which uses a data based g chosen independently of h, gave performance


roughly comparable to the Park and Marron PI. It is not straightforward to
compare these, because there is slightly more bias, but slightly less

variability. However the bandwidth factorized bootstrap, i.e. the nl/ H


method described at the end of Section 3, denoted BSF, gave much better
performance, having less variability and also less bias than the others.
These selectors have also been tried for other sample sizes and other
densities as well. For those densities not too far from the Normal in shape,
the asymptotics describe the situation well, with larger sample sizes giving
260 Marron

more rapid improvements in BSF than the others (as expected from its
faster rate of convergence). For the N(0,1) BSF gave really superlative
performance, in fact even beating out the Normal reference distribution
bandwidth given at (3.28) of Silverman (1986). For densities which are still
unimodal, but depart strongly from the normal in directions of strong
skewness or kurtosis, the performance was not so good (in fact CV is
typically the best in terms of MISE), but can be improved a lot by using a

Figure 2: Expanded respresentation of 16 density estimates


for incomes in the United Kingdom, 1968-1983. Bandwidths
chosen by bandwidth factorized smoothed bootstrap.

scale estimate which is more reasonable than the sample standard deviation
in such situations, such as the interquartile range. On the other hand when
f is far from normal in the direction of heavy multimodality, again most of
these newer bandwidth selectors were inferior to CV in the MISE sense,
but the sample standard deviation was a more reasonable scale estimate than
the IQR. A way to view both of the above situations, is that they are cases
Bootstrap Bandwidth Selection
a
261

where it takes very large sample sizes before the effects described by the
asymptotics take over. There is still work to be done in finding a bandwidth
selector which works acceptably well in all situations.
To see how well these methods work on a real data set, they were
tried on the income data shown in Figure 2 of Park and Marron (1990). The
data and importance of that type of display are discussed there. Several of
the bootstrap bandwidth selectors considered in this paper were tried on this

data set. The best result was for SBF with the N(0,07) reference
distribution used immediately. Figure 2 here, which compares nicely to
Figure 2 in Park and Marron shows the result. The other variants, involving
estimation steps in the choice of g, tended to give smaller bandwidths,
which are probably closer to the MISE value, but gave estimates that are too
rough for effective presentation of this type.

REFERENCES

Chiu, S. T. (1990) Bandwidth selection for kernel density estimation,


unpublished manuscript.

Devroye, L. and Gy6rfi, L. (1984), Nonparametric density estimation: the


L, view. Wiley, New York.

Efron, B. (1979) Bootstrap methods: another look at the jackknife, Annals


of Statistics, 7, 1-26.

Efron, B. (1982) The jackknife, the bootstrap and other resampling plans,
CBMS Regional Conference series in Applied Mathematics, Society
for Industrial and Applied Mathematics, Philadelphia.

Faraway, J. J. and Jhun, M. (1987) Bootstrap choice of bandwidth for


density estimation, unpublished manuscript.

Hall, P. (1990) Using the bootstrap to estimate mean squared error and _
select smoothing parameter in nonparametric problems, to appear in
Journal of Multivariate Analysis.

Hall, P. and Marron, J. S. (1987) Estimation of integrated squared density


derivatives, Statistics and Probability Letters, 6, 109-115.
262 Marron

Hall, P. and Marron, J. S. (1990) Lower bounds for bandwidth selection in


density estimation, to appear in Probability Theory and Related
Fields.

Hall, P., Marron, J. S. and Park, B. U. (1990) Smoothed cross—validation,


unpublished manuscript.

Jones, M. C. and Sheather, S. J. (1990) Using nonstochastic terms to _


advantage in kernel—based estimation of integrated squared density
derivatives, unpublished manuscript.

Jones, M. C., Marron, J. S. and Park, B. U. (1990) A simple root n


bandwidth selector, unpublished manuscript.

Marron, J. S. (1988) Automatic smoothing parameter selection: a survey,


Empirical Economics, 13, 187—208.

Miller, H.—G. (1985) Empirical bandwidth choice for nonparametric kernel


regression by means of pilot estimators, Statistics and Decisions,
Supplement no. 2, 193—206.

Scott, D. W. and Hardle, W. (1990) Weighted averaging using rounded


points, to appear in Journal of the Royal Statistical Society, Series B.

Scott, D. W. and Terrell, G. R. (1987) Biased and unbiased


cross—validation in density estimation, Journal of the American
Statistical Association, 82, 1131-1146.

Sheather, S. J. and Marron, J. S. (1990) Kernel quantile estimation, to


appear in Journal of the American Statistical Association.

Silverman, B. W. (1986) Density Estimation for Statistics and Data


Analysis, Chapman and Hall, New York.

Taylor, C. C. (1989) Bootstrap choice of the smoothing parameter in kernel


density estimation, Biometrika 76, 705—712.
A CIRCULAR BLOCK-RESAMPLING PROCEDURE FOR STATIONARY DATA
by
Dimitris N. Politis and Joseph P. Romano
Department of Statistics Department of Statistics
Purdue University ~ Stanford University
West Lafayette, IN 47907-1399 Stanford, CA 94305

ABSTRACT
A block-resampling bootstrap for the sample mean of weakly dependent stationary
observations has been recently introduced by Kitinsch (1989) and independently by Liu and
Singh (1988). In Lahiri (1990) it was shown that the bootstrap estimate of sampling dis-
tribution is more accurate than the normal approximation, provided it is centered around
the bootstrap mean, and not around the sample mean as customary. In this report, we
introduce a variant of this block-resampling bootstrap that amounts to ‘wrapping’ the data
around in a circle before blocking them. This ‘circular’ block-resampling bootstrap, has
the additional advantage to be automatically centered around the sample mean. The con-
sistency and asymptotic accuracy of the proposed method are demonstrated in comparison
with the corresponding results for the block-resampling bootstrap.
AMS 1980 subject classifications: Primary 62G05; secondary 62M10.
Key words and phrases: Resampling schemes, bootstrap, time series, mixing sequences,
weak dependence, nonparametric estimation.

EES —_——_—=u=
Haaeee essen eee eee

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
264 Politis and Romano
ae
a ____._..___ |

1. Introduction

Suppose X;,...,Xwy are observations from the (strictly) stationary multivariate sequence
{X,,n € Z}, and the statistic of interest is the sample mean Xy = N-* vu, Xi. The sequence
{X,,n € Z} is assumed to have a weak dependence structure. Specifically, the a-mixing (also
called strong mixing) condition will be assumed, i.e. that ax(k) — 0, as k — oo, where

ax(k) = supyp|P(AN B) — P(A)P(B)|, and A € F2,,,B € Ff are events in the o-algebras


generated by {Xn,n < 0} and {X,,n > k} respectively.
The objective is to set confidence intervals for » = EX,, for which an approximation to
the sampling distribution of Xn is required. For this reason, a block-resampling bootstrap
procedure has been introduced by Kiinsch (1989) and independently by Liu and Singh (1988).
This method can be described as follows:

e Define B; to be the block of 6 consecutive observations starting from X;, that is B; =

(X;,...,Xi4s-1), where i= 1,...,q and g = N—6+1. Sampling with replacement from


the set {B,,...,8,}, defines a (conditional on the original data) probability measure

P*. If k is an integer such that kb ~ N , then letting €,,...,& be drawn i.i.d. from
P", it is seen that each €; is a block of b observations (€;1,...,&,5). If all | = kb of
the &;;’s are concatenated in one long vector denoted by ¥;,...,Y;, then the block-
resampling bootstrap estimate of the variance of /NXw is the variance of VIY; under
P*, and the bootstrap estimate of P{V/N(Xw — #) < 2} is P*{Vi(¥; — Xn) < x}, where
y= +e Y;:

It is obvious that taking 6 = 1 makes the block-resampling bootstrap coincide with the
classical (i.i.d.) bootstrap of Efron(1979) which has well-known optimality properties (cf.
Singh(1981)). It can be shown (cf. Lahiri(1990)) that a slightly modified bootstrap estimate
of sampling distribution turns out to be more accurate than the normal approximation, under
some regularity conditions, resulting to more accurate confidence intervals for ». The modifi-
cation amounts to approximating the quantiles of P{/N: (Xn — pu) < 2} by the corresponding
quantiles of P*{Vi(Y; — E*Y;) < z}, where E*Y; denotes the expected value of Y; under the
P* probability (conditional on the original data). The need for re-centering the bootstrap dis-
Circular Block-Resampling for Stationary Data 265

tribution so as to have mean zero can also be traced back to Kiinsch’s (1989) short calculation
of the skewness of his block-resampling bootstrap.
The reason that the re-centered bootstrap provides a more accurate approximation is that
E*Y; = q7* Di, 6 Litt") X; = Xn + O,(b/N), where, for consistency of the bootstrap in
the dependent data setting, b —- oo as N — oo. In other words, the distribution of Y; under P*

possesses a random bias of significant order. This bias is associated with the block-resampling
bootstrap that assigns reduced weight to X;’s with i < bor i > N—6+1. In other words, if we
let P* be the limit (almost sure in P*) of the proportion /~!(number of the Y;’s that equal X;)
as | — oo, (and assuming no ties among the X;’s), although P* = 6/R, with R = b(N —b+1),
for any i such that b < i < N — 6+11, this proportion drops to P* = i/R, for any i < b, and
Pr =(N -i+1)/R,
for anyi>N-6b+1.
266 Politis and Romano
ee eee

2. A circular block-resampling bootstrap

A simple and ‘automatic’ way to have an unbiased bootstrap distribution is to ‘wrap’ the
X;’s around in a ‘circle’, that is, to define (for i > N) X; = Xj,, where iy = i(modJN), and
Xo = Xn. This idea is closely associated with the definition of the circular autocovariance se-

quence of time series models. The ‘circular’ block-resampling bootstrap amounts to resampling
whole ‘arcs’ of the circularly defined observations, and goes as follows.

e Define the blocks B; as previously, that is, B; = (X;,..., Xji4p-1), but note that now for
any integer 6, there are N such B;, j = 1,...,N. Sampling with replacement from the set
{B,,...,Bn}, defines a (conditional on the original data) probability measure P*. If k is
an integer such that kb ~ N, then letting €,,...,€, be drawni.i.d. from P%, it is seen that
each €; is a block of b observations (€;1,...,&,b). If all / = kb of the €;;’s are concatenated
in one long vector denoted by ¥;,...,¥;, then the ‘circular’ block-resampling bootstrap
estimate of the variance of VN Xn is the variance of VIY; under P*, and the ‘circular’
block-resampling bootstrap estimate of P{/N(Xw — ») < 2} is P*{Vi(¥%; — Xn) < 2},
where Y; = } y-/_, Yj.

The ‘circular’ construction is an integral part of a related resampling method in which blocks
of random size are resampled (cf. Politis and Romano (1991)). It can also be immediately
applied to bootstrapping general linear and nonlinear statistics, as in Kiinsch (1989), Liu and
Singh (1988), and Politis and Romano (1989). Kiinsch’s proposal of ‘tapering’ the observations
in the B; blocks can also be incorporated in the ‘circular’ construction without changing the
first-order asymptotic results.
The following two theorems concern the consistency and asymptotic accuracy of the ‘circu-
lar’ block-resampling bootstrap. The theorems are stated for univariate sequences {X,,n € Z},

although their extension to multivariate settings is straightforward.

Theorem 1 Assume that E|X;|°t* < 00, for some 6 > 0, and 52, k?(ax(k))®? < oo. As
N = 00, let 1/N — 1, and let b + 00, but with bN-! + 0. Then 02, = Var(VNXw) has a
finite limit 02,, and Var*(V1¥;) © 03, where Var*(VIY;) is the variance of VIY; under P*
Circular Block-Resampling for Stationary Data
nn eters se _ rin theres, Jeet pheeee er267

conditional probability, as well as

sup |P*{V1(¥i - Xn) < 2} - P{VN (Xn — 2) < 2}| +0 (1)


for almost all sample sequences X,,...,Xn-

Proof. The proof of Theorem 1 is immediate in view of the proof of the consistency
of the block-resampling bootstrap of Kiinsch(1989). If we let E*, E*,Var*,Var*, represent
expectation and variance under the P* and P™ probabilities, then it is easily calculated that
E*Y, = Xn, and that
i+b—1
Var*(Vi¥;)= Ne »» X;- Xn)
ct

N—b+1 t+b—1 N t+b-1 f E


=f > (0 SS Xi -XwP+ YO (Ot YS Xi - Xw)*} = Var*(Vi¥;)
+ 0,(6/N)
t=1 j= i=N—b+2 j=

where it was used that E*Y; = Xn + O,(b/N), and


Viv b — aS ; :
Ver*(ViN)= —————
N-b+1 & (d- rr X; - E*Y;)

——
FN bat = (ot = oe— Xn)? + P 0,(b/N)

Since Var*(V1¥;) o2,, it is seen that the first two moments of V/Y; under the P* probability
are asymptotically correct.

Finally, the moment and mixing conditions assumed are sufficient (cf. Hall and Heyde(1980))
to imply that /N(Xw — ) is asymptotically normal N(0,02,). Noting that V/I(¥; — Xw) is
also asymptotically normal (conditionally on the data X),...,Xyn), the proof is concluded. O
It is noteworthy that to make the bias of the block-resampling bootstrap distribution to be
of smaller order than its standard deviation, Kiinsch (1989) imposed the additional condition
bN~-1/2 —, 0 which is unnecessarily strong in our setting.

Theorem 2 Assuming that the sequence {X,,n € Z} is defined on the probability space
(Q,A,P), denote D,,n € Z, a sequence of sub o-fields of A, and Dy? the o-field generated
by Dn,,---;Dn,- Also assume that E|X,|®+® < 00, for some 6 > 0, and, as N — o, let
268 Politis and Romano

/N — 1, and b — co, but with bN-1/8 — 0. Under the following regularity conditions:
ao) ax(k) decreases geometrically fast;
a) 3d > 0 such that for all k,n € N, with n > 1/d, there exists a Der measurable random
ariable Zin, for which E|Xk—Zkn| < d-te78", and E|Zrnyl?2(|Zisny| < &/4) < d=}, where
& i8 a sequence of real numbers satisfying logk= o(n,) and nz = O(log k)i+8 |as k + co;
a) 3d > 0 such that for allk,n€ N, with k >n>1/d, and allt > d,

E| E(eit(Xna-ntXnangi tet Xanga) D;, j # ky < en?

here j is used to denote the imaginary unit /—1;


ag) 3d > 0 such that for all k,ni,n2 € N, and A € Dnt,

E|P(A|D;, i# m1) — P(A[D;,0 < [my — i] < k +.n9)| < d-*e7*;


he following is true (provided of course that 02, > 0),
¥, - Xn
sup |P*{Vi /Vare(ViFi) <2} - P(VNANSE < 2}| = o(N-") (2)
Proof. As the proof of Theorem 1 relied on comparing the first two moments of //1(¥;-Xy)
nder P* with the corresponding ones under P*, the proof of Theorem 2 follows by looking at
he third order moment. Specifically, under the regularity conditions we have assumed, first
rder Edgeworth expansions for P* {VI ae < z} and for P{/N Sats < 2} are valid (cf.
ahiri(1990) where an extensive discussion relative to these regularity conditions can be found).
‘urthermore, equation (2) would be true, provided b?E*(Uy - Xn)$ — N7E( Xn —y)§ = 0,(1),
vhere Uf = 67} beer &1,3, and the &1,...,€1,4 are the elements of the first block-resample
rawn from P*.
However, in Lahiri(1990) it was shown that b?E*(Ut — E*U7)* — N?E(Xy — n)§ = o,(1),
there Uf = 67} Bat 1,3, and the €11,...,€1,5 are the elements of the first block-resample
awn from P*. Finally, it is easily seen that
id=
E*(U; - Ry) = 250 Fx Ry)
i=1 j=

N —b+1 i+b—1 i+b-1


xt Yate s X;- Xn) + > (67! e X;-Xy)§}= E*(UZ- E*UZ)>
+0, (b/N)
t=1 j= i=N—b+2 j=i
Circular Block-Resampling for Stationary Data 269

where it was used that E*Ut = Xn + O,(b/N), and

"(Us _~ E*U* re hos es p- ae Xe EY; 3


E*(U; Wes eel ye rt Ys t i)
i=1 =i

1 Neb tbo Ae
Spare) » (o-* So Xi
- Xn)? + 0,(/N)
i= j=i

Hence, E*(Ux — Xy)* — E*(Uy — E*UZ)> = O,(b/N) = 0,(b-?), and the proof is concluded.O
270 Politis and Romano

References

[1] Efron, B. (1979), Bootstrap Methods: Another Look at the Jackknife, Ann. Statist., 7,
1-26.

[2] Hall, P. and Heyde, C. (1980), Martingale Limit Theory and its Applications, Academic
Press, New York.

[3] Kiinsch H.R. (1989), The jackknife and the bootstrap for general stationary observations,
Ann. Statist., 17, 1217-1241.

[4] Lahiri, S.N.(1990), Second order optimality of stationary bootstrap, Technical Report
No.90-1, Department of Statistics, Iowa State University, (to appear in Statist. Prob.
Letters).

[5] Liu, R.Y. and Singh, K. (1988), Moving Blocks Bootstrap and Jackknife Capture Weak
Dependence, unpublished manuscript, Department of Statistics, Rutgers University.

[6] Politis D.N. and Romano, J.P. (1989), A General Resampling Scheme for Triangular
Arrays of a-mixing Random Variables with application to the problem of Spectral Density
Estimation, Technical Report No.338, Department of Statistics, Stanford University.

[7] Politis D.N. and Romano, J.P. (1991), The Stationary Bootstrap, Technical Report No.
365, Department of Statistics, Stanford University.

[8] Singh, K.(1981), On the asymptotic accuracy of Efron’s bootstrap, Ann.Statist., 9, 1187-
1195.
Some applications of the bootstrap
in complex problems
Robert Tibshirani
Department of Preventive Medicine and Biostatistics
and
Department of Statistics
University of Toronto

1 Introduction
In this paper I give two examples of the application of the bootstrap in some
complex problems. These problems arose as part of statistical consultations.

2 Prediction limits for exercise output


Dr. G. Canny of Toronto’s Hospital for Sick Children collected data on the
exercise performance for 118 normal children. Each child was given a series of
increasing workloads on a stationary bicycle and the heart rate was recorded
at each workload. The workloads were irregularly spaced and different for
each child, and the maximum workload used depended somewhat on the
observed heart rate. Sex and height of the children were recorded and were
thought to be important factors. The objective was to construct heart rate
prediction limits for each sex-height group for use in testing children who
were suspected to have asthma.
In order to obtain heart rate values for a common set of workloads, an
interpolating cubic spline was fit to the data for each child. Heart rate values
for workloads 0 to 250 watts in increments of 25 watts were then extracted.
For each sex-height group a cubic smoothing spline was then fit to the mean

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
212 Tibshirani
ae

heart rates. The smoothing parameter was chosen so that the degrees of
freedom of the fitted curve was about 4 (see Hastie and Tibshirani 1990
chapter 3). This fixed smoothing parameter was used throughout. Despite
the correlation across workloads for a given individual, the mean heart rate
value is the generalized least squares estimate of the within group mean; see
for example Rice and Silverman, 1988. The results are the solid curves in
Figure 1.
The construction of prediction bands for these curves seems very difficult
analytically. Denote the estimated curve at workload w by f(w). Since the
smoothing operation is linear, an estimate of the pointwise standard error of
f(w) is easily obtained, say s(w). Then if our nominal level is 80%,

f(w) + 1.28(1 + ~)s(w) (1)


represents an approximate central pointwise prediction interval for normal
child’s heart rate profile. However,,what is wanted here is a global rather
than pointwise band, so that there is a fixed probability of a new child’s heart
rate falling outside of the bands at any of the workloads 0, 25,...250. For this
reason, the pointwise bands need to be widened. In addition, there is corre-
lation across workloads for a given child and this will also make the pointwise
bands too narrow. Still another factor is the assumption of normality used
in construction of (3).
The bootstrap can be used to estimate the amount by which the pointwise
bands should be widened. Consider prediction bands of the form

(f(w) — ex(1 + =)s(w), Fw) + e2(1 + =)s(0)) (2)


We allow c; and c2 to differ to allow the bands to be different widths below
and above the curve. Then we seek the value of c; so that the probability of
a new curve falls outside of the lower band at any workload is 10%, and sim-
ilarly for cz. To apply the bootstrap, we sample children with replacement,
sampling x; times from the jth sex-height group, where z; is the number of
children in the jth group. For a fixed values of c; and cz, prediction bands
of the form (1) are constructed from the bootstrap data, using interpolating
and cubic smoothing splines as described above. Then the number of times
that the original heart rate profiles fall outside the bands (at least once) is
Applicat
po Ee ions
a SAof Bootstra
EERE p ee
in Complex
tse Problems
ate 293

Global prediction bands for heart rates

250
200
150
100 200
150
100
250
(watts)
workload (watts)
workload
50 50
172-186
height
male, 0 0
height
female,
172-186

eye) yeeY

150
100
250
200 150
100
250
200
(watts)
workload (watts)
workload
50 50

height
male,
166-171 0 0
female,
166-171
height

e181 We8Y

250
200
150
100
(watts)
workload (watts)
workload
50 250
200
150
100
50

height
male,
162-165 0 0
height
female,
162-165
00¢ Ost 001 os

eyes WESY eyes yee

250
200
150
100 250
200
150
100
(watts)
workload workload
(watts)
50 50

height
male,
152-161 0
height
female,
152-161 0

eyes yeey

150
100
250
200 150
100
250
200
workload
(watts) (watts)
workload
50 50

height
male,
137-151 0
height
female,
137-151 0

ey) eoy eyes wey

Fig. 1. Solid curves are estimate of heart rates at each workload, for children in
indicated sezx-height group. Broken lines are 80% global prediction bands. Tick
marks on the horizontal axis indicate workloads where measurements were taken.
274 Tibshirani
eer

computed. This entire process is repeated, increasing c, and cz until the


error rate, accumulated across all groups, is 10% below and above the curve.
This is an example of one of the prediction error estimators studied in
Efron (1983).
The resulting values of c, and cz were 2.8 and 2.4, much larger than the
pointwise values of 1.28. The global prediction bands are shown in Figure 1.
There are clear differences between males and females, and between shorter
and taller children. Since 11 workload values were used, a Bonferroni-type
adjustment for the pointwise values would use ®-1(1—.1/11) = 2.36, which is
close to cy but not c,. It is interesting to note that although the Bonferroni
bounnds are usually conservative, the bootstrap bands, which account for
other potential sources of error (e.g nonnormality) are wider in this problem.
A potential concern, pointed out by a referee, is the bias of a smoother
in estimating the true underlying curve. In this particular problem, the
curvature of the underlying functions does not seem to be substantial and
hence bias is not a major concern. However, in general one might need to
account for bias in forming the prediction bands.

3 Clustering of cortical cells


Dr. Derek Vanderkoy and Les Krushel of the Dept. of Anatomy, University
of Toronto, conducted experiments to study the way in which cells evolve in
the layers of the cortex from stem cells in the interior. The data consists of
cell counts in 6 layers along “spokes” in a cortical cell of 17 animals. The
layers are labelled 2 to 6. Each spoke represents descendents from a single
stem cell; there are measurements on approximately 9 spokes per animal.
A certain theory suggests that the cells do not spread themselves uni-
formly through the layers but cluster in layers 2,3,4, or 5,6 in a given spoke.
Let x;; represent the cell count in the jth layer for the ith spoke, 2;4 be
the total count in the ith spoke, and x44 be the total cell count. Figure 2
shows a plot of (4-2 ti; — D$_s vi;)/zi4 by animal. The plotting symbol is
Uj.

There seems to be an abundance of points near -1 and 1, giving some


credence to the clustering theory.
To formally test this, we consider the mixed multinomial model
Applications of Bootstrap in Complex
Problems 215

+ MDM wn NGto~ comm

Oo cmt mw SO

NN

animal

O'l s‘0 0G S‘0- O'l-


qoid"yIp
Fig. 2. Plot of difference in proportion of cells in layers 2,3,4
vs 5,6, by animal.
The plotting symbol is the total cell count in a given spoke.
276 Tibshirani

(xi2 ers Lis) ~ (1 = 0)M (i+, {p1, Pi, P1, 91, q}) =F OM (i+, {p2, P2, P2, q2, q2}) (3)

where M(m, {p1,p2,---ps}) represents a five-category multinomial dis-


tribution with n draws and success probabilities p1,...ps, and 1/5 <p <
1/3, 0 < pp < 1/5, and 3p; + 2g; = 1, 7 = 1,2. The idea behind this
model is as follows. Examination of the data revealed that r42 % 243 ©
t44, and r45 © ty. Thus a reasonable “no-clustering model” would be
M(zi+,{P1,P1,P1,%,}) for the counts in the ith spoke. The alternative
model (2) (along with the restrictions on the p;s) says that the population is
a mixture of two populations, one that clusters in layers 2,3,4 and the other
in 5,6. The parameter 0 determines the proportion of each type. The “no
clustering” model results if 9 = 0 or 1 and is therefore nested in the mixture
model. Note that at this point we are ignoring potential animal effects.
We fit this model to the data by maximum likelihood. The estimates
were 6 = .556, 61 = .253, po = .060, giving a log-likelihood value of Imax =
—769.97.. The model is suggesting an approximate 50-50 mixture of two
multinomials with probabilities (.253, .253, .253, .12,.12) and (.06, .06, .06, .41, .41).
The estimates for the no clustering model were p = .154 and 1°, = —807.47.
The twice log-likelihood ratio statistic R = 2[lmaz — (°,,,] equals 75.0, a very
large value considering that the mixture model has only 2 more parameters
than the no clustering model. However, it is not clear that the distribution
of R is approximately x3.
To investigate the distribution of R, we apply a simulation or bootstrap
technique. Datasets are generated from the no clustering model, using py,
fixing the x; at their observed values, and computing R for each dataset.
The maximum value of R over 200 simulations was about 6, suggesting that
the observed value of 75.0 is highly unlikely to have occurred by chance if
the no clustering model truly held.
Animal effects could still be inflating the value of R, so we tried sampling
by animal as well. The maximum value of R was 35, so we still conclude that
the observed value is highly significant.
There were numerous other questions that arose in this study. One in-
volved the alternative clustering 2,3 vs. 4,5,6. The maximized log-likelihood
value was -794.47 for this clustering, or 24.5 units lower than that for the
(2,3,4 vs 5,6) clustering. The investigators wanted to know which of the two
Applications of Bootstrap in Complex Problems patie

clusterings (2,3,4 vs 5,6) or (2,3 vs. 4,5,6) was better. The two models are not
nested, so this is a difficult question to answer. To address it, I bootstrapped
by sampling with replacement from the rows of the data matrix, and for each
bootstrap sample, I computed the difference in maximized log-likelihood val-
ues for the two models. The standard deviation of this difference was 10.4,
and so the observed difference of 24.5 is “significantly” large.

Acknowlegment:
I would like to thank William DuMouchel for helpful suggestions. This
work was supported by the Natural Sciences and Engineering Research Coun-
cil of Canada.

References:
Efron, B. (1983). Estimating the error-rate of a prediction rule: some
improvements on cross-validation. J. Amer. Statist. Assoc., vol 78, pages
316-331.
Hastie, T. and Tibshirani, R. (1990) Generalized Additive Models. Chap-
man and Hall. London.
Rice, J.A and Silverman, B.W. (1988) Estimating the mean and covari-
ance structure nonparametrically when the data are curves. Unpublished
report.
aaa
ata
as

~i osbr eg hand AE, tas ats

bo qupiatoud d «dé pri aa ;


guaccpsey
a S68 2Bite Sidinis xR ootus We
pay Wns ELE SOY hale

Fie rs Pesala Hy
.
MU > pas ke nis site rn i we Re eins as

Loti tans aie


elie a
ie dado
re & slat otsoe Sine, at
ads
o Te Fe “k-vy, A poet
. 7 = im
a Sea 3 (hee 1Ate ies wy

; ein Ath aiapevie” mre ‘one


oe
‘ ie
4
fy Fae ier
ere 4

ixRY "te, e ’
‘ ar
ni« wiv
ages
ys
} :

‘ lig uk=) ajar


Tat

~ theese me ‘aaah
20% ke ee>| amt (38

ihe deeni icatioh


‘sa
Sine ein eal eegietichonl eee 4 Pir
= heen a i vebgey dnb rive ed :
evantSat ell DSyer apd acot
z shemes e ae ay oe faves
che Le:see a DE ei :
é ay “nefecalwelt ae satis hs
a Wi
pete rae th eae
ena Pe—
e ; ES 7 —
-

APPROXIMATING THE DISTRIBUTION OF A GENERAL STANDARDIZED FUNCTIONAL


STATISTIC WITH THAT OF JACKKNIFE PSEUDO VALUES

by

D. S. Tu

Center for Multivariate Analysis


The Pennsylvania State University

ABSTRACT

By randomly weighting the jackknifed pseudo values, a method


to approximate the distribution of a general standardized
differentiable functional statistic is proposed. The computation
based on this method is simpler, the variance estimator is similar
to that of jackknife variance estimator, and this approach also
posesses the second order accuracy.

1. INTRODUCTION

The jackknife method, deleting one observation each time from


the sample for generating pseudo values and then averaging those
values to get the new estimate, was used mainly to reduce the bias
of an estimator and construct the variance estimates for complex
statistics. The properties of this method were investigated by
many authors from the end of 1950’s. For detailed studies and
applications of this method, the reader may refer to review papers
of Miller(1974) and Shi(1987) and the bibliography by Frangos
(1987).

In this paper we will consider another use of jacknife pseudo

———

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
280 Tu

values, that is, to give the approximation for the distribution of


a general standardized functional statistic. Let X,> Xo, cic xX

be independent identically distributed random variables with


unknown cdf F. Suppose that T,=T(F,) is an estimate of 9=T(F),
where T is a specified real-valued functional and Fo is the

empirical distribution of X)> Xo, a eats Xx, Then our problem is to

approximate the unknown distribution H, of (T,-6)/{var(T,)]1/?.

Since under some suitable conditions Ha is asymptotically the

normal distribution with mean 0 and variance 1, we may simply use


the normal distribution to approximate H.- One of disadvantages of

normal approximation lies in that it does not possess a higher


order accuracy. To get an approximation with higher order
accuracy, we may use the Edgeworth expansions which have been
discussed in detail by Pfanzagl(1985). The exact form of Edgeworth
expansions of H, can be found in Takahashi(1988). An alternative

but simpler approach, the bootstarp introduced by Efron(1979), can


also give a second order accurate approximation with some
superiorties over the one-term Edgeworth expansion approximation
(Bhattacharya and Qumisiyeh, 1989). To use the bootstrap method,
we need resampling from the original sample and recomputing Te many

times.

Our operation to approximate the distribution of a functional


statistic is based on the idea of random weighting. We generate a
group of random weights directly from the computer, use them to
weight jackknife pseudo values, and then employ the conditional
distribution of the weighted average of jackknife pseudo values
(with a suitable transformation) given the original sample to
approximate the distribution of the general functional statistic.
The computation in this approach is simpler(which needs the
Approximating the Distribution with Jackknife 281

computation of the original statistic n(n+1)/2 times) compared


with the bootstrap(which needs the computation of the statistic
1,000~2,000 times(Efron, 1987)) for moderate and small n. The
variance estimator deduced from this method is similar to that of
the jackknife estimator. Moreover, this approach possesses also
the second order accuracy and in some cases conditions guaranting
this property are weaker than that for the bootstrap approach(see
the remark in Section 4). By selecting weights properly, the
random weighting approximation may even be more accurate than
bootstrap approximation for tail probabilities (Zhang and Tu,
1990). For a detailed discussion about the motivation and general
properties of the random weighting method, see Tu and Zheng (1989).

Beran(1984) and Hinkley and Wei(1984) also considered the use

of the jackknife method to approximate A The idea of their

approaches is using the jackknife to estimate the standardized


bias, variance and skewness of i and then pluging these jackknife

estimates into the Edgeworth expansion of Hh to get a jackknife

Edgeworth expansion estimate. The difference between their


approach and ours lies in that the jackknife method is used before
asymptotic expansion in our approach. The improvement for
approximate accuracy via the transformation on statistics has also
been considered by Hall(1983), Withers (1983) and Abramovitch and
Singh(1985). The first two papers gave the modification only ina
compact interval of real line. The improvement given in the later
paper is over the whole line. Our work is closely related to that
of Beran(1984), Hinkley and Wei(1984) and Abramovitch and
Singh(1985). In fact, an exact form of transformation suggested in
Abramovitch and Singh (1985) is given by combining methods of the
jackknife used in Beran (1984) and Hinkley and Wei(1984) and the
random weighting suggested in Tu and Zheng(1989) for the functional
statistics. Recently, Konishi(1989) presented an improvement to
normal approximation by finding the normalizing transformation for
282 Tu

The rest of the paper is orginized as follows: In Secton 2 we


give some preliminary notations and basic ideas, including the
definition of a general functional statistic and its random
weighting version. The main results of this paper, concerning the
second order accuracy of our approach, are stated in Section 2 and
proved in Section 3. Finally in Section 4 we apply these results
to U-statistics and M-estimates.

2. SOME PRELIMINARY NOTATIONS AND BASIC IDEAS

Let A be the space of distribution functions containing a


fixed distribution F and all discrete,distributions with finite
support and R be the real line. A mapping from A to R is called a
functional on A. For this functional we introduce the definition
of the strong Frechet differentiability at FeA.

Definition 2.1 T is said to possess a strong Frechet


differential at F if and only if there exists ~: R-R depending
only on T and F such that
InC6)-T- (HqaCe-H (ad
T(G)-T(H)-[y(x)d(G-H) (x =, 9 (21)
as ||G-F|l+||H-Fl--0 for G, HeA, where |[hl|4sup|h(x)|. Without the
x

loss of generality in this definition we can take {y(x)dF(x)=0.

This definition of differentiability was used by Parr(1985) to


discuss the consistency of jackknife variance estimates. It is
stricter than that of Frechet differentiability or Hadamard
differentiability. For the relationship between these definitions
and examples satisfying this definition, refer to Parr(1985) and
Reeds (1976).

Further we can give the definition of the second order


Approximating the Distribution with Jackknife 283

differentiability of T at F.

DEFINITION 2.2 T is said to possess a strong Frechet second

order differential with remainder of type 92>(>0) at F if and

only if there exist ¥: R-R and oy: R ate: depending only on T and F

and a bounded set B CR such that


T(G)-T(H)—-| (x) d(G-H) (x)-[o(x, y)d(G—H) (x)d(G-H
2
—€B (2.2)
||c—H]|**
as ||G-H||+||H-F||/-—-0 for G, HeA.

This definition was also given and discussed by Reeds(1976).

Let now Xi, Xo, eee xX, be n i.i.d. random variables with

unknown distribution function F defined on R and 6=T(F) be a


functional parameter of F. Denote

ie Ei)
F(x) = nd C253)

where I(A) stands for the indicator function of the set A.


Corresponding to the parameter 8, we consider the estimate T=

T(F), which is called the functional statistic. The main purpose

of this paper is to estimate the unknown distribution of


1/2
L, = (1-9) /(var(T, )] (2.4)

with that of jackknife pseudo values of Th: First we introduce the

jackknife pseudo values to T,. Let

: n
BOn-1l(xjotnetien 4.y WX,ex)
j
“ ist, 2, 26... 0, (2.5)
jl
j=1
eel)
iJon (Fa-1 - ee) Pt
Th-12t and Ty gonty (7 Ri eee elas. boo. eta) (2.6)

Then T ,Gisl, 2, ... , m) are called jackknife pseudo values of ies

and
284 Tu
T 1 n
ee _ and
* Sd
U =(n-1)
n
See *
2th, 5 arse
T,) (2.7)

are the classical jackknife estimate of @ and jackknife variance


estimate of Ty’ respectively. Next we give the random weighting

statistic of (T,-9) . Let (1, $9: +++ 9 Sp be n i.i.d. random

variables with the common probability density

p(x) = pryy(2x)%e F¥I(x>0). (2.8)


Then
* 1 2 * (9.8)
=n » Sint) 3

is definied as the random weighting statistic of (T-8): The

preliminary idea of random weighting is that we can use the


oo ha Te a 3 * ana *
1/2 ,
conditional distribution H, of L=W,/lvar (We) ] given X,,

Xo, See x, to approximate the distribution of L.: The


eed i
conditional variance of vnW,

2. ee )ees
sy=var*(y¥nW_ =< (ap:
2 nit*.9 j (2.10)

*
where var stands for the conditional variance given X,, Xo, seeds

X (Below we will use the symbol * to denote the conditional

quantities given X)> redouaity x)» can be used as an estimate for the

variance of ynT,. It is easily seen that s- is almost the same as


*
the jackknife variance estimate UL:

The consistency of above approximation can be proved by

applying the Lindeberg Central Limit Theorem to L, and La

respectively.°
In comparing
.
asymptotic
.

expansions
°

of Ha and
*

Ha we
Approximating the Distribution with Jackknife 285

find that, however, this approximation can not possess the desired
second order accuracy(see the discussion in next section).
Following the idea of Abramovitch and Singh(1985), Beran(1984) and

Hinkley and Wei(1984) we can use a transformation on La via the


. : *

jackknife method to get a second order accurate approximation. Let

B p B pe B
h(y=(y - OB) - —2my - Om)? , Amy _ —On)3 (2.11)
n yn yn 3n yn

where

A nea hoes
Pon="on- 6 2 Taj’ Bon=on* ~§ 2 Taj?
oe J= (2.12)
A ft A

Bol £325 Gadliah te i. 32)


%on~ 6 ge Son 5
with the jackknife estimates

n °

(T .-T*) me (n-1) ei ltoatgta


(i)
T..= ——_L]1 ko. ae Suaeeaenae:
ij n
[ae ora] A
i=1 ™
fo- 2 fiw? F gld_y 3
32eetS n 4 ‘°n-1 on
s i=l
n

2 ‘i ;
+ SED
3(n-1) pera
(j) UT(k) 8 int, -(n-1)
« (T+)
CU ay)
jzk

+(n-2) 0033) jh (k,j)

(GN del
Diy aeie Cp
lO oe eh (2.43 )
*
Then we can use the conditional distribution of h,(L,) given X>

x ee anes x, to approximate the distribution of L.- For the second

order accuracy of this kind approximation we have a following


theoren.
286 Tu

THEOREM 2.1 Assume that T is a twice strongly Frechet

differentiable functional with remainder of type g2>(>0) at F.


Suppose also that the distribution of ¥(X, ) is nonlattice and

E{|T(F_) |*}<o. Then we have

Visup|P {h,(L_)<y}-P{L.<y}|0 a.s. (2.14)


x

The proof of the theorem is deferred in next section. Next we


make some general remarks for this theorem before closing this
section.

REMARK 2.1 In the applications of jackknife and bootstrap, for


example, to set a confidence interval for 8, we are mostly
concerned about the approxiamation for the distribution of the
studentized statistic

Lj=va(T-9)/s_,

where 5, is an estimate of the variance of ive for example, the

jackknife estimate sa: The method of this paper can be used in

principal in this case. The exact form of the transformation can

be obtained using the jackknife estimates for the cumulants of L.

given in Hinkley and Wei(1984). See also the discussion of Konishi


(1989).

REMARK 2.2 In fact the method suggested in this paper can be

used to approximate the distribution of the jackknifed statistic i

of Th and, even more, that of a functional jackknife statistic of

ee suggested by Sen(1988). The accuracy of the approximation to

these statistics needs further investigation.


Approximating the Distribution with Jackknife

REMARK 2.3 The jackknife technique used in this paper has a


little bit difference with that in Beran(1984). In fact, Beran
used the add-one jackknife other than the ordinary delete-one
jackknife used by most authors. The jackknife estimates Key and
A
Kyo of this paper are essentially taken from Hinkley and Wei(1984).

REMARK 2.4 Numerical results for the comparision of the method


suggested above with the bootstrap and some other related methods
are presented in Zhang and Tu(1990).

3. THE SECOND ORDER ACCURACY OF THE RANDOM WEIGHTING APPROXIMATION

The purpose of this section is to prove Theorem 2.1, which


shows that the random weighting approximation suggested in the
above section is second order accurate. We first give some results
about the Edgeworth expansions for a functional statistic and that
for a weighted summation of iid random variables. ‘The following
theorem is taken from Takahashi(1988).

THEOREM 3.1 Let

G(x)=P{[T(F)-T(F) ]/¥VARIT(F, ) I<x} (3.1)


and assume that T is a twice strongly Frechet differentiable

functional with remainder of type g2A(X>0) at F. Suppose also

that the distribution of ¥(X,) is nonlattice and E{ |T(F) |*}<a,


then uniformly in x

G(x)=8(x)-n
1/?[ks 1+(1/6)kg 9(x7-1) 1¢(x) +0(1/V¥n) (3.2)
where (x) and $(x) are the distribution function and density
function of standard normal, respectively,

o*-E(Y7(X,)}, kg =o E{p(K, .X,)) (3.3)


and
288 Tu

Ky =o" S(6ELY(X, (Ky) (Ky Xp) )4E(V" (XK,)}. (3.4)

REMARK 3.1 We note that the Edgeworth expansion given above


has a little bit different form with that given in Takahashi(1988).
The reason is that in Definition 2.2 the second Frechet derivative
of T is one half of that given in Definition 2 of Takahashi(1988).

Next we discuss the expansion for the weighted summation of


iid random variables.

THEOREM 3.2 Suppose that $y» Sqr see > Sn are iid with

density function as in (2.8), {a,j} is a sequence of triangle

arrays satisfying

¥ a_.=0, 0<—1 J} a” 2 .—0%<w,


2
— max |a_.|—0
j= njei iobs jectiieo 2"
n
and limit a , la; |°<n<oo
n>o j=l J
Let
n n
K,(y)=P{
2 Ya,;(¢5-20/(
i ¥a?21/2 )*/2cy} (3.5)
j=1 j=l
then uniformly in x we have

K(y)=0(y)-(1/6)
0(y) Vee(y"-1)eee¥ 170
ia =
(1 /vn) (3.6)
j=l
where

To prove this result we need a lemma on the Edgeworth expansion


for sums of independent random variables proved by Bai and Zhao
(1986).

LEMMA 3.1 Suppose that Yu Yo: ate Ye is a sequence of


Approximating the Distribution with Jackknife 289

independent random variables with

EY re
;=0, ry 2
EY;>0 and BlY; |3 ’<eo.
i=

Let

G,(y)= 3 Y/( 5By}? (3.7)


etl

then we have

[G,,(y)-O(y)+(1/6)0(y) (y?-1)
tg/(Hy)9? |

(| + gb}
1)n
(3.8)
where C is a absolute constant,

to= SEY ase 2 Wynjja¥j PCY; [>Cug)Takey

y;(t)=Bexp(itY,)
;
and 6, = 26 DENY; 19 -1

THE PROOF OF THEOREM 3.2 . Let Y.J = —ta_.(¢.-2),


yn nj *j
then Re
J j=1

is a sequence of independent random variables. Hence from Lemma 3.1


we have

IK, (y)-O(y)+(1/6) 0(y) Cy eb) osmei


Ta
290

Pnj’ Wj =* “nice
var
Fe a i d
|“B($;
i (¢; -9)1[ 2yt[ |=
ice2 ) |>CHy 1/2 Ih
a.

v.j (t)=Eexp(itY, j )=(1-(1/2) itl)


Vii 4
=
1 (2,
and 6, = ratal y Any
a3 By) 3 |

For Jy, since

s
vi» 3 3 25 PAR
IM,(| 1
Yas. 3 E|¢,-2| a Hla
ml yylS [ag 3/2 2)|
n n
-2|31(+}a_.||¢.-2
=
v¥n(p,)
1/2
ee 3 2
= ae la; | E|¢,-2| [Is 2|: max |a +] a
= 1<jsn ™
Ho =
rig
=p: |
ei) an
240,
jJ=1
we have

et 3
yn y E|W at —0,
ic ieee
that is J,=0(1/yn). Consider Jo. Since

va Y Ely.|* = vm Y,(1/n?)a4 Bl ¢,-2/4


n n

jot; ae j=l Rasy Hh


1
< —max fa
1%
.|(= a
3 )E|¢,-2 4

——0
we have Jo=0(1/yn). Finally we treat Jz. Observe that
n n Ome
sup ty |v; (t) |= sup i a |(1-(1/2) it-BL) 4 |
|t|26 -j=1 |t|26° j=1 vn

1 Oi (l+on
< n.d ecle Ht 6°;
e gee
Approximating the Distribution with Jackknife 291
Se RL SL ee er a a rr

S 1%
= (1+n--3/2,2.2
6a nj \-2
j=l -
2 -3.4 4
-2n--3/2,2 6a ajo”
< inj e ote 61a ih

=1
=1 on™ja)2(2
- -2(6 (2 8nj) ny4/1(S2a
a2 * -sca

<1 - 2,°n,2f1
2-)?[2 y8 a2,2
n yn Mj=1 nj

6 n
= €2)7(= Tete
max |a eem2 ele
: | ai |
vn

ee
vn
so that

og|vnin®( sup a) |v.(t)| + =)"


[t|>6, j=l J n

< Tlogn+nlog(1 - aL pik


3 vn n 2n

=Tognin(—a, + in)nl (=a, + 5h)?»


and hence

i
VnJg=vnn ( ue Le [Yj (+) | he1,\n 0;
nb,
2 j=

that is J=0(1/yn). By Summarizing all of the results the


conciusion of the theorem follows.

REMARK 3.2 To obtain the asymptotic expansions in Theorem

3.2, the weights Cie Sarl el, need only to statisfy E(¢,-E¢, )?=1

and E(¢,-E¢, )3=1.


292 ? Tu
ae
———————

By using of Theorem 3.2 we can get the asymptotic expansion


for distributions in random weighting approach.

THEOREM 3.3 Suppose that T possesses a strong Frechet


differential at F,

0<07=
{7 (x)dF(x)<o and S|¥Cx) [RdF(x) <0.
Then
n
P*{wr /LVAR’ (W,)17/?<y = &(y)-(1/6)0(y)(y2-1) ¥ 13 -+0(1/Vin) (3.9)
j=l
holds uniformly in y for almost all sample sequence X,; Xo, a

X_, where
n

*
PROOF. Take a_ .=T .-T. in Theorem 3.2. Then given X,,
n,j on J n 1

Xo, Zaks dpe Ae han Page triangle sequence of constants. Hence


n n,Jj eau

to prove this theorem we need only verify that conditions about an j

in Theorem 3.2 are true for almost all sample sequence X,, -

Xx: First it is obvious that

j=l
y a_ n,J
.=0

and from a theorem of Parr(1985),


n
vk cy ren
ney (Tn, iW —>0"<m a.s.,

hence what remain to prove are


A
Sr! * 13 3
(1) Timit = Y |T, s-TI" < S]¥(x) |dF(x)<m ais. (3.10)
NO j=) J
and

(2) limit 1 max ||T <-T. |—30° caus: (3.11)


° ° *

n>oyn l<jen "73 7


Approximating the Distribution with Jackknife 293

Note that sup IF, 4-Fyll < 1/n and by the definition of
<1sn :
differentiability we have
TG )=T(F,)-S¥(x) dF iF) (x) +0(([F, oral
n
=1(F)-¥(x)/[n-1]+{n(n-1)} > J; W(X, )+0(1/n)
j=l
and hence -
* -(X.) n
eg oe
teeing ayeni,© VOX; +001). (3712)
Therefore we have

1
— max |T. .-T * |< — 1 max |¥(X,)
ynacjen tS ya oe! |
ae era ia
+(1/vn) [2y W(X; )]+0(1/¥i)— 40 a.s.
j=l?
for the reason EY(X, )?<eo and also

|e
=) » fayoalSSA
j=1
Sis 1S)
ni), e 240%)
+ al
re +0(1)

—> 3f|¥(x) [PSdF(x)<0 as. (3713)


Thus (3.10) and (3.11) are proved. The proof of the theorem is
now completed.

REMARK 3.3 Comparing the asymptotic expansions of L, and Ly

given in Theorem 3.1 and 3.3 respectively, we can find that if we


*
use the distribution of L, to approximate that of L.> the second

order accuracy can not be obtained. The reason is that the second
terms of corresponding expansions are not the same. The
transformation given in (2.11) can help us to eliminate these
differences. This is the conclusion of Theorem 2.1 which we will
prove in the following.
294 Tu

For the proof of Theorem 2.1 we state a lemma.

LEMMA 3.2 The h,(y) defined above is a strictly increasing

function with inverse function

u(x) = 7-fon* X+ eho tt 1so(2/V8) a.s.


n n

The proof of the lemma can be obtained by elemetary


mathematical analysis (see also Yao(1988)).

PROOF OF THEOREM 2.1 Since

F"(y)=P" (h, We /[VAR™ cm*) 1/225) =P" (Ww/(VAR W Bit au)


we have from Theorem 3.3

limit Viisup|F_ (y)-@(u, Cy) )+(1/6)0(u, (y)) (u2(y)- 1) Ss)


rsjl=0- (3.15)
n>o y
Thus

limit Visup|F_ (y)-F nly) |


n> y

=limit vynsup|%(u, (y))-(1/6)0(u, (y)) (u2(y)— 1) Pa7


n> y
-O(y)+ a0 (9) [gy /2-Hegy(1-¥" )/6)|

<limit ynsup|®(u, (y))- (1/6)06(u, (y)) (v2(y)-1) Dhi


n> y
-8(y)+ Phy, /2-fya(a-y" /6)|
n

+limit
n>
sup0(y)
y
|(K34- kgy )/24(k 32m k39 )(1-y? )/6

A limit Yaouy is} (x) |+ Meet sup|Jo(x)| (3.16)


n>
From the definitions of oe and en by the same arguments as in

the proof of Theorem 3.3 of this paper and Theorem 2 of Parr(1985),


Approximating the Distribution with Jackknife 295

we can prove that they are strongly consistent estimates of Lost and

Kyo: Hence we have ~*

limit sup|J,(x)|=0 —a.s. (3.17)


n> x
Now we need only to consider the asymptotic properties of J, (x).

Note that

vaseo dyrE 1/1219, 6x) |+¥0 2uP 1319, (2)|

+n sup 1/1219,
6%)|

By ia baa
+31 Dyfea
5+d lesSec (3.18)
i
For Jia since

uw,(x)-x=(—ta, - ty r3-)+ (a, + SS) ee )x2[1+0(x/Vn) ]


yn ja ™ n j=l j
and
O(u, (x) )-O(x)=0(x) (u, (x)-x) +r,

where |r |=(1/2) |€0(€) | (a, (x)-x)?, € is between x and u(x). We

first prove that limit yn sup 1/12™n=2: ‘In fact, when [x|<nt/12
n>o |x|<n
-5/12
we have |x/y¥n|<n , hence limit sup 1/12
|o(x/¥n)|=0. Observe
n> «|x|<n
that

RES
limit
pagel
n
1" sup u_(x)-x|=limit [n
5 -1/12;4a “yn yor
2
nh 6 Ix|<n i/ial n Aecstes on 6521 fl

+n 1/12)”la,+ La ¥ ro.|(1+
6 nyja 713,| ixlen
sup _ 1/42 |o(x/va)
o(x/¥n |)]-0 a.s.

Therefore
- K 2
yn sup r Sighiait yn sup (u_ (x)-x)
yet / 12! se ae I<n 1/12°°n

Reaves
< Slimit
hae [n’’ 1/4 “su n
2
ix}ent/?? Ju_(x)-x|]°—-0 a.s. (3.19)
296 Tu

However

J,,=limityn sup 0(x)(u_(x)-x)-


1
20(u_(x))(ue(x)-1)
2 oa
2 rh;
ial Eager at/ia n 6° n n wel nj

1 2
- eee Oy )|
n

=limit
noe vayn Be lé
1/12'6
g(x) (x2-1)-6(u, ur (x))
1s 0(x)(x"- (u2(x)-1))
i §
earn

ten
+0(x)(La,, at yin3 )x70(x/yn) |
ri a.s.

=limit yn as aliale
Al 0(x) (x 2 -1)-0(u, (2) (9-191.
2 y 23
ta |
n>o |x|<
and as in the proof of (3.13), we have

6(u, (x)) (u2(x)-1)-0(x) (x2-1)=0 (x) (3x-x3) (u, (x)-x)+0(1/Vn) as.


Hence

J, =limit
nia
sup
[x|<nt/1?
Wreyerm
6
ames
n
y oo
ral nj

n
<Klimit sup |u,(x)-x|vn J r3 =0 a.s. (3.20)
n>o |x|< ni/12 ja 1
For J,, note that limit -0(x)|x|*=0(k=0, 1, 2), and hence
xX 7-0
Jie < limit ynO(u_ (en!/12) ys imit o( -n)/12)
n> 0 n>

simit Jotu,(-a!/1)) (aku!!! eyva D8,


n> j=

+ limito( -n'/1?)(|a, alte, [n?/®)


n> n
=limit vn@(u, (-ni/12)) a.s.
n> ®@

However

limit |u,(-n |
n> @
Approximating the Distribution
A s with Jackknife 297
n
etait [at /2(@ ap Seca, ae
n> + coe 4 ny In

+n =1/3;4(a5,+ av
21,-8 3 |=0
tay) ,)_ a.s.

80 J,o< limit ¥no(-n!/1241)=0, In the same way we can prove that


n>

31350 a.s. Hence

limit yn sup|J,(x)|
en. wale? < J, Lt ,+3,9tJ,,.
t2 5013 <0 as.

that is, yn sup|J,(x)|—>0 a.s. Summarizing the above results we


x
find that the conclusion of the theorem is valid.

REMARK 3.4 From the proof of Theorem 3.2 we known that the
conditions of Theorem 3.1 were only used to obtain the asymptotic
expansion P{T <x}. If for some special statistics Le we can obtain

the expansion for P{T <x} under some other(simpler) conditions,

then Theorem 2.1 will be valid under those conditions and the
conditions ensuring the consistency for the jackknife estimates
defined in (2.13), such as

0<07=[P(x)2dF(x)<o, f|¥(x) |PdF(x) <0, E| p(X, ,X_) |<


and
E|¥(X, )¥(X%,) p(X, ,Xq) |<.
In the next section we will use this remark to give results for
U-statistics and M-estimates.

4. TWO EXAMPLES

In this section we will give two examples. One is for the


U-statistics and another for the M-estimates.

(1) U-statistics
298 Tu

Let Xi Xxgrt? Xx be n i.i.d. observations coming from an

unknown population F. Then the U-statistics based on Xi, Xo, ey

P Xx, can be written as

-1
uU.=(5)
n 2
S n(X.,
i<j 1
X.)
J
(4.1)
where h is a symmetric kernal function. In fact Un is not a

functional statistic but it has a strong relationship with the


von-Mises statistic, a functional statistic defined by
non
ven? ay: h(X; ,X5); so here we take it as an example. The
i=1j=1
U-statistics can have a decomposition

U, =in. ny-1 YOK; .X5)


7,2 86%; )+(g) (4.2)
1=1 1<j
with
g(X, )=E{h(X, ,X_)-0|X,}, p(X, , Xo )=h(X, ,X)-9-g(X, )-g (XQ)

and
O=Eh(X, ,X,).

Based on this decomposition, Bickel et al.(1986) gave an Edgeworth


expansion for UL:

THEOREM 4.1 Suppose that there exist constants 6 and C and


functions x, (t) and Xq(t) suth that

(1) limit X,(t)=0;


t>o

(2) r>2+6>2, E| p(X, .X,)|"< ME

(3) Eg 3 (XT ee oo) (1e(X,) [Ds xX,(t) and |Eexp(itg(X,))|<


P

1-X,(t)<1(t>0), then there exist Bir al dependending only on 6, C,


Xi and Xo such that
Approximating the Distribution with Jackknife 299

zy os -1/2
sup|P(o x)-F(x
Up <x)-F_
= | _” a (U_-@) )|
(x)| << en (4.3)
4.3

where

F,(x)=®(x)-(+/?9(x)
1/6 )eg(x?-1)
n,
oP=n(n-1)"Bg(x, +2? (x, ,x,)
and

kg=0,” (Be°(X, )+3Eg (X, )e(Xq) p(X, ,Xp)}


Ber 14rd
with o,-Eg (X,)>0.

Let Uni be the jackknife pseudo values of U,: The jackknifed

statistic of US is U. itself. Hence the random weighting statistic

of U0 can be definied

* %D ‘

where Gy» Sgr cee ¢yviid with the probability density as in

(2.8). By the same discussion as that in Section 3 we can obtain


following results.

THEOREM 4.2 Suppose that E|h(X, ,X_) |< and the other
conditions of Theorems 4.1 are satisfied. Then for

ee | Bat y ie r
w(y)=(y-B,/va)® + —2y- /Wn)? + —2y-8, via)? (4.5)
yn vn
where

Bivens 2,3/2

and
300 Tu

+8 2) Ung Un) Ugg Py Img (-A) Up (1) 40g 9 (5)


+(n-2)U,,_o(k,
§)1}
with
n-1 =(n-2

i, j+k i, j+k,1
we have

Vii sup|P (w(W" /[VAR W.]!/2)<x}-P(o"*(U,-8)<x}|—0 a.s. (4.7)


x

A weaker form of this theorem can also be found in Tu and Shi


(1988).

(2) M-estimates

Let 8 be an open interval in real line R! and & be a set of

distributions on the R! and 0(F) be the functional defined by the


equation
S¥(x, O(F) )dF(x)=0 Fe. (4.8)
The M-estimate of 0(F) is defined by the following empirical
equation
n
Y W(X;
8.)=0. (4.9)
i=l
It is a functional statistic. A series of M-estimate is said to be

yn-consistent if for every €>0 we have P{|8 -8|>€}=0(1/va). By


following the asymptotic expansion for minimum constrast estimates
of Pfanzagl(1973), Zheng(1988) gave the corresponding result for
M-estimates.

THEOREM 4.3 Suppose that


(i) BgalF: FeS, 6(F)=0} is not an empty set.
Approximating the Distribution with Jackknife 301

(ii) ¥(x,@) posesses second order derivative on ® except for


finite points and at these points y’’(x,@) have right and left
limits.
(iii) For each 6, there exist Ug and M,(x, 8) such that

bP cx, + over (x, 7) Islt_-7, |My (x, 4) i=, 1, 2,-7,, T5eUy

with rae fakes


(iv) For each @ and FeSp, ¥(x,t) is two times differentiable
at t=0.
(v) For each 0, there exist Ug and M p(x.) such that

sup|¥(x, T) |<Mp (x,9)


TEU py

with
3
J Mp (x, 0)dF(x)<o.
(vi) For each 0, there exists a Ug, such that for any 6>0

sup ET ait T) ]dF(x)


|<1
|t|>5 re
(vii) For each @, ee

Vga S¥(x, 0)7aF(x) /Lfvg(x, @)dF(x)17€(0, @).


Then uniformly in y we have

Pptvn(8
0) /WWCY<y)=8(y)- SAC 8) 40(1/Nin)
n
where
3
A( ,0) = {vy (x,9)dF(x) y 214)

: Pte aac? }
cal S¥G(x, W(x, 8) dF(x)
© 2 LR Cx, 0) d(x) )1/2( fog x, 0) F(x) )
Vp!(x,0)dF(x)
: feels: Fania750)}
(SV? (x, 0)4F(x)dF(x))
302 Tu
eee ee er ree ee errr

The jackknife pseudo values of a, can be definied as

(i)
r) j=nd—(n-1)8) 1
n,

where @°9)(j=1, 2, ..., n) satisfy

Dyas apt)
W(x, 899) hit(x)=0
with
Ci) ga
Fi-1 ~ n-1 aes

Let
n
(BR
J,n n j21
os 991
ee
1 2
E,* nh fe (8, 5-8);
i=1
where the definition of €, (isl, 2, ... , mn) is given above, and

w(y) al y<b oi = LG n¥-b, 4 + hia aes se


yn
with
n n
cay. toe (a0)
(i) 1 SNe alba, 5-93,2)
8 SS es St ee ee eee one
n 6 38), n 6 6s,

3
ego tg{a 3 00-9.)
n-l on

Ben)? 5831-8
FO 99639 I(900 -9,1End Cn) ON(k) 905)
sev 95,P3}
2
re
n Mj=1
a) 978g,

and Cnt satisfying


Approximating the Distribution with Jackknife 303

Sucx, 93)
arG 3 (x)=0
(j,k) (x)=
Foo 1 ¥ 1{X.<x}.

Then the same disscusion as that in Section 3 gives

Theorem 4.4 Under the conditions of Theorem 4.3 we have


= * ea:
Viisup|P” {w(E, /(VAR E,]!/2)<y}-P(yn(d-6)/YW@8)<y}|—0 a.s.
y

Zheng(1988) gave another but more complicate random weighting


approximation for

n(8 -8)
et VV(8)
under the same conditions.

REMARK 4.1 In Theorem 4.2 the conditions are only put on

h(X, ,X), g(X,) and ~(X, ,X_). We need no conditions on h(X, ,Xq).

However for the consistency of bootstrap approximation we have to


put some conditions on h(X, ,X, ) (refer to the counterexample given

by Bickel and Freedman(1981)). Because U-statistics are not


involved in the terms like h(X, ,X;)(i=1, pee metres 9 Rems4Y

conditions for h(X, ,X,) are not the natural ones.

ACKNOWLEDGEMENTS

The author would like to thank Dr. C. R. Rao for helpful


suggestions and and discussions and the encouragement he has given.

REFERENCES

Abramovitch, L. and Singh, K. (1985). Edgeworth corrected pivotal


statistics and the bootstrap, Ann. Statist., 13, 116-132.
304 Tu
2 _ __..__|_| | ee

Bai, Z. D. and Zhao, L. C. (1986). Asymptotic expansions of sums


of independent random varibles, Scientia Sinica, Ser.A, 29,
1-22.

Beran, R. (1984), Jackknife approximations to bootstrap estimates,


Ann. Statist., 12, 101-118.

Bhattacharya, R. and Qumsiyeh, M. (1989). Second order and

LP_comparisons between the bootstarp and empirical Edgeworth


expansion methodologies, Ann. Statist., 17(1989), 160-169.

Bickel, P. J. and Freedman, D. (1981). Some asymptotic theory for


the bootstrap, Ann. Statist., 9, 1196-1217.

Bickel, P. J., Gotze, F. and van Zwet, W. R. (1986). The Edgeworth


expansion for U-statistics of degree two, Ann. Statist., 14,
1463-1484.

Efron, B. (1979). Bootstrap methods: another look at the jackknife.


Ann. Statist.,
7, 1-26.

Efron, B. (1987). Better bootstrap confidence intervals, J. Amer.


Statist. Assc., 82, 171-185.

Frangos, C. C. (1987). An updated bibliography on the jackknife


method, Commun. Statist.-Theory Meth., 16, 1543-1584.

Hall, P. (1983). Inverting an Edgeworth expansion, Ann. Statist.,


9, 569-576.

Hinkley, D. and Wei, B. C. (1984). Improvement of jackknife


confidence limit methods, Biometrika, 71, 331-339.

Konishi, S. (1989). Normalizing transformations and nonparametric


Approximating the Distribution with Jackknife 305
confidence intervals, Technical Report 89-01, Center for
Multivariate Analysis, The Penn State University.

Miller, R. G., JR. (1974). The jackknife: a review, Biometrika,


61, 1-17.

Parr, W. C. (1985). Jackknifing differentiable statistical


functional, J. Roy. Statist. Soc., Ser. B, 47, 56-66.

Pfanzagl, J. (1973). Asymptotic expansions related to minimum


contrast estimators, Ann. Statist., 1, 993-1026.

Pfanzgal, J. (1985). Asymptotic Expansions for General Statistical


Models, Springer-Verlag, Berlin.

Reeds, J. A. (1976). On the Definitions of von-Mises Functional,


Ph. D. thesis, Harvard Unversity.

Sen, P. K. (1988). Functional jackknifing: Rationality and general


asymptotics, Ann. Statist., 16, 450-460.

Shi, X. Q. (1987). The jackknife method in nonparametric


statistics, Chinese J. Appl. Prob. Statist., 3, 69-76.

Takahashi, H. (1988). A note on Edgeworth expansions for the


von—Mises functional, J. Multi. Analysis, 24, 56-65.

Tu, D. S. and Shi, X. Q. (1988). Bootstrapping and randomly


weighting the U-statistics with jackknifed pseudo values,
Math. Statist. Appl. Prob., 3, 205-212.

Tu, D. S. and Zheng, Z. G. (1991). Randomly Weighting: an


alternative approach to approximate the unknown distribution
of pivotal statistics, J. Combinatorics, Information & Sys.
Scis., R. P. Krishniah Memorial Issue, to appear.
306 Tu

Withers, C. S. (1983). Expansion for the distribution and quantiles


of a regular functional of the empirical distribution with
applications to non-parametric confidence intervals. Ann.
Statist., 11, 577-587.

Yao, Z. Q. (1988). Approximations to distributions admitting


Edgeworth expansions, J. Sys. Sci. Math. Scis., 8, 113-126.

Zhang, L. and Tu, D. S. (1990). A comparison of some jackknife


and bootstrap procedures in estimating sampling distributions
of studentized statistics and constructing confidence
intervals, Technical Report No. 90-28, Center for Multivariate
Analysis, The Pennsylvania State University.

Zheng, Z. G. (1988). M-estimates and the random weighting method,


Acta Sci. Nat. Uni. Pekinensis, 24, 277-286.
Part 3
Applications of
The Bootstrap
‘0 peed
5: ,
paste te
OASION ie am. “0 é ns
pois etl ied Oe age CaeeEaTS -
>?
q ; ve? oO

-. ' 45).. ome 0 MGT Cement


Mew S ehmieslee,
coe
~—

i.
-

ee
a hs

.
|
es)

ad

d
@

iS

~
~

a
se


SS

~—.

7
=
’ ~- —
——
ae Je :

a
—é
BOOTSTRAPPING FOR ORDER STATISTICS
SANS RANDOM NUMBERS (OPERATIONAL BOOTSTRAPPING)
William A. Bailey

Abstract: In contrast to what has come during the past


decade to be known as boostrapping (using random
numbers to resample from an original sample), what I am
tentatively calling "operational" boostrapping does not
use random numbers. Rather, it uses "generalized"
numerical convolutions of distributions to generate
distributions of the statistics of interest. The term
*“generalized” refers here to the convolutions
(1) being not only for sums, but for any function of
the numbers involved in the two distributions being
convolved; (2) involving not only dimensions of 1
(i.e., univariate), but also dimensions higher than 1
(i.e., bivariate, trivariate,4- and 5-dimensional).

The statistic of interest may be a sample mean or any


order statistic which can be generated by a sequence of
generalized convolutions. (Continuous distributions
would first be discretized.) For example, given an
empirical distribution obtained by attaching equal
probabilities of 1/n to original sample values in a
sample of size n,

(1) a distribution of means from resamples of size n


can be obtained by gonvolving the univariate empirical
distribution n times
oD

(call the result [ppp] ) and transforming the


i=1 oa

resulting distribution into E + %s0Py | ;


: i=!
See Examples #1 and 43.7

Example #4 generates the same distribution considering


ordered samples on the way.

(2) a distribution of ranges for resamples of size n


can be obtained from the univariate empirical

‘Kemper National Insurance Companies, Long Grove, IL


690049
2mn times" involves n-1 convolutions.

*The Examples (not shown herein) can be obtained from


the author.

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
310 Bailey
EE

distribution [*ir Pils, by convolving that


3h
distribution n times for minimums and maximums jointly
(call the result [array Pa] ) and transforming the
pes & i} 54
resulting distribution in to the univariate distribution
[yi-*i’ P; ifSes See Example #2.
oo

(3) a distribution of an order statistic such as a 25%


trimmed mean for resamples of size n can be obtained by
transforming the univariate empirical distribution
i=n , i=n
Kin De Inco ole. py digs LOD:
[ i’ i}ib, [ i? : i]iy!
convolving this distribution with the empirical
distribution n times using’ formulas specified in the
foe)
Example #5 (call the result [(ir ¥ir 4g) Pi] pe “and
i=1
transforming the result into the univariate distribution

[ata 72°Pa] . As the convolutions are performed,


2-n bad
1=1
the variable in the x position is the largest amount
encountered so far in the process, the variable in the
y position is the number of duplicates of that amount
so far, and the variable in the z position is the
trimmed sum so far. See Example #5.

A different approach (referred to here as a "binomial


approach) to generate this same result involves one
less dimension, with the variable in the x position
being the trimmed sum so far and the variable in the y
position being the number of items entering into the
trimmed sum so far (including even those items which
did not contribute to the trimmed sum). See Examples #7
and #8.

(4) A distribution of ratios of 25% trimmed means to


interquartile ranges is generated in Example #6.
Bootstrapping for Order Statistics 311
SSSSSS SSS

MESHING AND VON MISES THEORY

As the size of the original sample increases, the


number of calculations required to obtain distributions
of order statistics will become prohibitive, even using
fast computers. However, imposing meshes on_ the
coordinate axes can reduce the volume of calculations
to a manageable number while at the same time usually
permitting results which are sufficiently fine. Using
Von Mises Theorem and algorithm (see page 269 of
Reference [ ]), the first three moments of each
variable (X and Y) can be retained accurately. Using
something akin to linear programming we can aim at also
keeping accurate the covariance ey
Even if the solutions are expressed in terms of
multivariate distributions, lower-dimensional
distributions may be used if appropriate partial
distributions can be captured in a fairly limited
number of separate disk files. Such a reduction in the
number of dimensions will be practical if the number of
values which a marginal random variable* can take on is
limited to (say) 2500. The original samples should be
restricted to those with sample size 10 to 99, whereas
we illustrate the approach using samples of size less
than 10.

Illustration A: The solutions of Example #4 and #5 were


described in terms of 3-dimensional distributions
[(X,¥,Z) P], where X is the maximum attained so far in
the recursive process, and Y is the number of
duplicates in the most recent sequence of duplicates in
the recursive process.

Since the number of values Y can take on does not


exceed 4, we could choose to restrict attention to the
partial bivariate distributions® eae ee oa ee

These partial distributions would be captured in 4 disk


files, and. combined. at the. time .of) the, final
transformation of Z into a univariate (1-dimensional)
distribution of resample means. The same statements
could be made about Example #5, except that there the
focus is on resample 25% trimmed means.

EE

4Univariate or bivariate.
5RFach of these distributions is conditioned on a
particular value of Y; however, they are partial
distributions, since the probabilities have not be
normalized to unity.
312 Bailey
SNE

On the other hand, since the number of values the


bivariate random variable (X,Y) can take on does not
exceed 4+3+2+1=10, we could choose to restrict
attention to the partial univariate distributions®
PooRy Pee, Yys
These partial distributions would be captured in 10
disk files, and combined only at the time of the
transformation of Z into a univariate distribution of
resample mean. The same statements could be made about
Example #5, except that there the focus is on resample
25% trimmed mean.

Illustration B: The solution of Example #6 was


described in terms of 4-dimensional distributions
((X,Y,Z,T) Pj], where X is the maximum attained so far,
and Y is the number of duplicates in the most recent
sequence of duplicates so far,in the recursive process.

Since the number of values of the bivariate random


variable (X,Y) can take on does not exceed 4+3+2+1=10,
we could choose to restrict attention to the partial
bivariate (2-dimensional) distributions’
[(4; 7) Bie) 2
These partial distributions would be captured in 10
disk files, and combined at the time of the final
transformation of (Z,T) into a univariate distributions
of resample ratios of 25% trimmed mean to interquartile
range.

Illustration C: The solution of Example #7 was


described in terms of bivariate distributions [(X,Y)
P], where Y is the number of items entering the sample
sum so far in the recursive process.

Since the number of values the random variable Y can


take on is 5, we could choose to restrict attention to
the partial univariate distributions® aes orl ee4
These partial distributions would be captured in 5 disk

SEach of these distributions is conditioned on a


particular value of (X,Y); however, they are partial
distributions, since the probabilities have not be
normalized to unity.
7TIbid
8EFach of these distributions is conditioned on a
particular value of Y; however, they are partial
distributions, since the probabilities have not be
normalized to unity.
Bootstrapping for Order Statistics 313

files, and combined at the end of the process.

BIVARIATE GENERALIZED NUMERICAL CONVOLUTIONS

If fx,,y, and fx,y, are independent distributions of


the bivariate discrete finite random variables (X,,Y,)
and (X3,Y2), respectively, then the distribution fw, ,W>
£X,+¥,,Xoty, Of the bivariate random variable
(X4+X,, Y,+Y>3) is the bivariate convolution
fx,,Y,t+fx,,y, of fx,,y, and fy,,y, for sums.’
i =n,

Om [ (X1;,y1,;) ply iF - Similarly, let


SOS Pee
j=ng
fy yee te y= ( (x2j 772 ror 24 lity

Then fw,,w, = £x,+x,,¥,+Y, = fx,,y, + fx,,y, =


i=n,
j=n2
[(xti+x2,,v1,+¥2)) P1,-p2,)
ie [ses
If n, and nz are (say) 10°, then generating this matrix

would involve as 10° lines.1° This would be practical


if we do not intend to use fy,,w, for further
convolutions. On the other hand, if (for example) we
want to generate the distribution
fy +X+X,+X,, Y,+¥,+¥,+¥, of the bivariate random
variable (XG et G Maes
tYa), > then; we-gmight be
dealing with having to generate 10'7=10°-10° lines;
and this would be impractical, because of the amount of
both computer storage and computing time required. The
following algorithm has been designed to overcome these

9We are using + instead of * to indicate convolution


for sums; that is, f +f instead of the usual
X,Y, X2,Y2
*
fx Y, Sse ye
10There may be some collapsing due to identical amount
pairs on different lines. The number of lines produced
is reduced by representing on a single line all lines
with identical amount pairs; on that single line is
the amount pair and the sum of the _ original
probabilities.
11Tbid
314 Bailey

problems.

The Bivariate Convolution Algorithm


c - -15
Choose e€>0. Typically e is chosen to be JO pusoreOus 7

Loop #1: Perform the calculations indicated in Matrix


#1 above, discarding any lines for which the resulting
probability is less than e; that is, discard lines for
which pl,:p2, < ¢. The purpose of this is to avoid
underflow problems and to increase the fineness of the
partitions (meshes) to be imposed.

Calculate low, = min { x1,+x2,#0 | pl,-p2, > © } and


io TAR
$eeke yan yan
high, = max { x1,+x2,#0 | pl,-p2, > € }
i=1,2,...,n
Gt ee cr |

Let nax be a positive integer selected for the purpose


high,,-lovw,,
of creating a partition. Let A = —————; partition
WAX/ cases
the open interval (low,-A,,high,+A,) into nax/2+1
subintervals:
aa Se Subinterval I(r)
[0,0]
(lowv,,-A, lovw,,]
(low,,, low,+A)
[low,+A,low,+2-A)

[low,+(nax/2 - 3)+A,high,)
[high,, high,+A)
Subinterval I(r) is the degenerate interval consisting
of 0 alone. If for some ry>1 Oeliro), then 0 is deleted
from I(ro); that is, that particular subinterval has a
hole at O.

Loop #2: For each r (r=1,2,...,nax/2 + 1) calculate


low,(r)=min{yl,+y2,#0|pl,-p2,>€ and x1,+x2, € Ii} and
ES Were
J=1 525.550
oy
highy(r)=max{y1,+y2 ,#0 | pl,-p2,>€ and x1,+x2, € I(r}
i= 125 tras
j=1,2,...,m
Bootstrapping for Order Statistics 315

Let nay be a positive integer selected for the purpose


of creating the following partition:
highy(r)-low,(r)
calculate A, y = -——W¥W———_—_; partition each of the
nay/2 = 2
open intervals (low,(r)-Ay(r), high, (r)+A,(r)) into
nay/2+1 subintervals:
| ss: [Subinterval Jr(s)

[0,0]
(Llowy(r)-Ay(r),
Lowy (r) J
(low,(r),
Lowy (r) +A, (r))
[low (r) +A, ir), Lowy (r)+2-Ay(r))

nay/2| [low,(r)+ (nay/2-3) -Ay(r), highy(r))


nay/2 + 1|[(highy(r),highy(r)+A,(r))

Loop #3: Let I(r)xJ,(s) (called a mesh rectangle)


denote the Cartesian product of I(r) and J,(s).

For each ordered pair (r,s) (Gala. ;NaAXx 2:


s=1,2,...,nay/2+1) set to zero the initial value of
each of the following moment accumulators for each
combination of two pairs (r,s) and (i538
(rl2 pcs > pNAX/2+138=172, . ««,.nay/2t1) C12 isc usig ht
j=1,2,.-..,n,) perform the accumulations

Moo [Itr)xT p(s) ] Mo9[I(r)xJ,(s)] + pl,-p2;


Myo[{I(r)xJ,(s) ] Mio[(I(r)xJ,(s)] + (x1,+x2,)"-p1,-p2,

Moo (L(r)xT,(s)] = Mog [Iir)xJT,(s)] + (%1,+x2,)*+p1,-p2,


Myo [Itr)xJ,(s)] = Mgg[I(r)xd,(s)] + (x1,+x2,)°+p1y-p2,
Moi [L(r)xJ,(s)] = Mo, [Tir)xJ,(s)] + (y1\+y2,)'+p1,-p2,
Moa(I(r)xJ,(s)] = Moy[Iirxd,(s)] + (yly+y2,)*+ply-p2,
Mp3 [Iir)xJ,(s)] = Mog[Iir)xd,(s)] + (y1y+y2,)°+p1y-p2,
m,,(I(r)xJd,(s)] = mM, [I (r)xJ,( s) 5) - (y1,+y2,;)-pl,-p2,
]+ (X1,+x2

That is, we generate the probability, the ot through


3rd y-moments, the Ot? through 34 y-moments, and the
joint first moment for each mesh rectangle Ii(r)xJ,(s)
(Lb Spee NAX/ 2+ 17-S1, 2). gMay/2-+ 1).

Loop #4: Von Mises Theorem and algorithm guarantee that


316 Bailey
|.

for each pair (r,s) (r=1,2,...,nax/2+1;


s=1,2,...,nay/2+1) there exist and we can find four
pairs of real numbers (X,(r,s),Pj9(r,s)),
(Xo(r,s),Po,(r,s)), (Y1(r,s),P 3 (r,s) ) and
(Yo(r,s),Poa(r,s)) such that xX,(r,s) € I(r), Xg(r,s) €
Iir), yy(r,s) € Its), and yo(r,s) € I,(s) and such
that the following relationships hold:

Pro (r,s)
+Poo (r,s) = Doz (r,s) +Poo(r,s)=Mog9
[ I(r) xT, (s) ]
X1 (r,s) *Py9(r,s)+Xo(r,s) * Poa, (r,s) =m,,[I(r)xJ,(s) ]
2
Xe (r,s) "Pio (r,s) +Xg(r,s) * Poo (r,s) =Mz,
[I (r)xJ,(s) ]
3
Xs (r,s) > Py (r,s)+X3(r,s)° P23, (r,s) =M49 [ I (r)xJ,.(s) J
Yi (r,s) -Po1 (r,s) FYa(r,s) .Poalr,s) =Mp; [It@)xJd,(s) ]
2
yi (r,s) -Poi (r,s) FY2(r,s) .Poo(r,s) =Mp2
[I (r)xJ,(s) ]
3
Yi;(r,s) Poi (r,s) +Ya(r,s) .Poalr,s) =Mo3[
I (r)xJ,(s) J

We note that (x,(r,s),Y,(r,s)), (X (r,s) ,Yolr,s)),


(Xo(r,s),Y,(r,s)) &(Xolr,s),Yolr,s)) all €
[I(r)xJd,(s)].

The "C" program VONMISES in APPENDIX - VONMISES?!2


accepts the ot through 37 moments and produces at
most two points and associated probabilities, with the
feature that the first three moments are accurately
retained. This program is called twice for each pair
(r,S), once to produce x,(r,s), p,.(r.8), Xo(r,s) and
Pa (r,s) and once to produce § y,(r,s), Po,(r,8), Yolr.s)
and Pp,a(r,s).

Von Mises Theorem and algorithm do not assure that the


covariance m,,[I(r)xJ,(s)] will equal
22

} YX (r,s) *Yy(ry8) “Py (r,s)

1=1j=1
We can represent the (partial) distribution (of
(X,+X2,Y,+¥2) restricted to IrxJ,(s) by the four
points and associated probabilities

er ee mia e ee Oe
12The Appendices (not shown herein) can be obtained from
the author.
Bootstrapping for Order Statistics 317

(X,(r,s),Y,(r,s)) py (r,s)
+ (X, (7,8), Yolr,s)) =Pjolr,s)
(X2(r,s) Yi (r,s)) Pdi (r,s)
(X2(r,s),Yoalr,s)) p$o(r,s)
where Pii(r,s), Pi2(r,s), P31 (r,s) and PS2(r,s) are
determined as follows:

We have © p,,(r,s), P2o(r,s), Poi(r,s) and D)p,olr,s). We


want to determine pj,(r,s), Pio(r,s), P51 (r,s) and
P32(r,s), each in the interval [0,1], so that
Pii(r,s) + PSilr,s) = Poy (r,s)
Pi2(r,s) + PS2(r,s) = Poolr,s)
Pii(r,s) + Pjotr,s) = Pyo(r,s)
Pai(r,s) + Polr,s) = Poolr,s)
and so as to minimize the absolute value of the error

2 2

M,(I() X Jyv(s)] - > Xj (r,s) Ys (F,S)*pij(r,s)


i=l j=1

in the joint moment.

We note that choosing a value for pj,(r,s) will


determine the values for Pia(r.s), Pdi (r,s) and
Péo(r,s) in these four equations. Our _ choice of
Pii(r,s) is not completely arbitrary, since Pii(r.s),
Pi2(r,s), Po,(r,s) and pdolr,s) must each e [0,1]. But,
restricting pj,(r,s) to this extent, we can proceed to
choose pj,(r,s) so as to minimize the absolute error in
the joint moment. Choosing pj,t,s) to do this is
straightforward, because the error in the joint moment
can be expressed as a linear function of pj,(r,s).13

The "C" program LINEAR IN APPENDIX - LINEAR!4 embodies


this logic, enabling us to represent the (partial)
distribution in each mesh rectangle I (r)xJ,(s) by the

Fi at I ae ee eR ee ee OE
13fhe absolute value of the error in the joint moment is
linear in two pieces, because of the absolute value
being taken.
14The Appendices (not shown herein) can be obtained from
the author.
318 Bailey
n
—— = el

original four points and the newly derived associated


probabilities. In the process the 0? through aee
moments of X,+X, and Y,+Y2, respectively, are retained
accurately; and an attempt is made to also retain
accurately the joint moment.

Having kept accurately the first three moments of each


of X,+X, and Y,+Y,, respectively, within each mesh
rectangle, we have automatically kept accurately the
corresponding global moments.

For each mesh rectangle I (r)xJ,(s) (r=1,2,...,nax/2 +


di; S=15 2)...
,May /27teee) there correspond the vectors
[ (X,(r,8),¥,(r,8)) 4, (r,s) J
[ (X,0,8),Yolr,s)) Pyolr,s)]
[ (Xo(r,8),¥,(r,8)) Po (ry8)]
[ (X5(r,8) ,Yolr,s)) Poolr,s)]

We can the express the full distribution £X,+X5,Y,+Y>


of the bivariate random variable (X,+X3,Y,+Y2) as

’ s=nay/2+1
(Xq(r,s),Yy(r,s)) Pay rss) |ponaxy2et
(X1(r,8),Yolr,s)) —Pyo(rys)
(X2(r,s),Y;(r,s)) Poy (r,s)
(X2(r,8),Yolr,s)) Poalr,s) |=?
s=1 .

REFERENCES & ACKNOWLEDGMENTS

Springer, M.D., The Algebra of Random Variables,


John Wiley & Sons, Inc., New York, 1979

I want to thank Larry Hickey of PolySystems, Inc. in


Chicago for focusing my attention on Von Mises Theorem
and for patiently tutoring me in the "C" programming
language. He has been helpful to me over the years as
programmer, mathematician and codesigner. Josh Zirin,
formerly of Kemper Group, assisted in developing the
"C" program "vonmises". Thanks also to Anne Hoban of
Kemper National and the CHIWRITER scientfic/multifont
word processor.
-

A GENERALIZED BOOTSTRAP
By EDWARD J. BEDRICK! AND JOE R. HILL?
University of New Mexico and EDS Research

Abstract
This article defines a generalized boostrap that is based on the general frame-
work for model-based statistics described in Hill (1990). This bootstrap in-
cludes Efron’s (1979) frequentist, Rubin’s (1981) Bayesian, and Hinkley and
Schechtman’s (1987) conditional bootstraps as special cases. It also includes
new bootstraps for empirical Bayesian (EB) models. In particular, we present
a conditional EB bootstrap used for EB inference that is conditional on EB
ancillary statistics.
Key words: Bayesian bootstrap; Conditional bootstrap; Empirical Bayesian
bootstrap; Frequentist bootstrap; General framework.

1 Introduction
The generalized bootstrap has four inputs:

(i) a general statistical model, P = {p)(6,y), 4 € A}, which is a collection


of joint probability distributions for the parameters @ and the data y,
indexed by 4 € A;

(ii) an estimator of 4, \ = Xy);

(iii) a vector of fixed features of the experiment, z = z(0,y); and

(iv) observed data, yous.

Given these four inputs, a bootstrap replication is conceptually defined in two


steps:

1. generate z* from the distribution of z given y = Yoss assuming that


> MYyoss); then

2. generate (0*,y*) from the distribution of (0,y) given z = z* assuming


that; Av Asssi
1Department of Mathematics and Statistics, University of New Mexico, Albuquerque,
New Mexico, 87131.
2EDS Research, 5951 Jefferson St. NE, Albuquerque, New Mexico, 87110.

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
320 Bedrick and Hill
es

REMARK A. The joint distributions in a general statistical model can be


represented in two ways: for all \ € A, p,(9,y) = paly | 9) pr(8) = pal? |
y) pr(y). From left to right these distributions form the joint, the sampling, the
prior, the posterior, and the marginal families. This definition of a statistical
model includes frequentist, Bayesian, and classical empirical Bayesian (EB)
models as special cases (Hill, 1990).
For a frequentist model (Neyman, 1935),

Pr(9,y) = ply |9) 5x(8) = 6a(8) paly)


where 5)(8) = 1 if 6 = A, 0 otherwise and p(y) = p(y | 9 = \). That is, the
parameter 0 is a “fixed but unknown constant” (Neyman, 1977). Consequently,
the prior and posterior families are identical and the marginal family is the
sampling family with @ replaced by 4.
A Bayesian model is a completely specified joint distribution for the pa-
rameter @ and the data y,

P(9,y)= p(y |9)p(9) = p(4 |y)ply).


The index A takes a single value for Bayesian models, so \ = for all y.
Classical empirical Bayesian models (Robbins, 1956) are families of joint
distributions for parameter 9 = (41,...,0,)' and data y = (y1,..., yz)’. Usu-
ally, (6;, yi), 2 = 1,...,k, are exchangeable. Section 3 discusses the empirical
Bayesian model for a one-way layout.
REMARK B. A summary combines a probability model P with observed
data ¥.5, to form an estimate of the underlying probability mechanism (Efron,
1982b). For a frequentist model, P = {p(y),\ € A}, Efron (1982b) defined
the maximum likelihood summary (MLS), p.ss(y), to be the member of P that
satisfies the property: poss(Yobs) > Pr(Yobs), for all A € A. Hence, the MLS is
the maximum likelihood estimate (MLE) of p)(-); that is, poos(y) = Py~3,,,(Y)
where \,os = \MYyoos) is the observed MLE of X. Applying this concept to a
general statistical model P and an estimator A, a summary is a single joint
distribution for @ and y; that is, pos(9,y) = py_3,,,
(9,y)-
The generalized bootstrap generates fixed features z* from the posterior
distribution of z given y = ys, determined by the summary p,»,(6, y), then
it generates outcomes (9*, y*) from the conditional distribution of (6, y) given
z = 2* determined by the summary. As such, the generalized bootstrap is
essentially an estimated Bayesian bootstrap.
REMARK C. Rubin (1984) defined the concept of “fixed features of an
experiment.” If the experiment were going to be repeated, then certain features
like the sample size, the configuration of missing data, and the observational
units might be kept the same as in the original experiment. Because bootstraps
hypothetically repeat the experiment as though the summary were the actual
A Generalized Bootstrap 321

underlying distribution, the bootstrap variability of the fixed features is given


by their estimated posterior distribution. The bootstrap variability of the
other features of the experiment is determined by their estimated distributions
conditioned on the fixed features.
REMARK D. Let Q = Q(6,y) be a random variable whose expectation Q
with respect to repetitions defined by fixed features z is of interest. The boot-
strap estimate of Q is (assuming that all variables are absolutely continuous)

Qore =|[ {ff Q00*.v")pr(6",9° |2*)d0° dy} pale |y = yose)d2" N= Noes

The bootstrap takes advantage of the fact that Jobs will be plugged-in for \
by directly calculating

(8 = leua: Q(a", Y”)Pobs(O", y |z*) dd" dy" pors(2* |a = Voss \az”

That is, the summary p.5,.(9, y) is the only member of P used by the bootstrap.
Hence the bootstrap can avoid calculations involving all \ in A, which contrasts
with most analytical methods.
REMARK E. The inputs P, \, and z make this bootstrap very general.
Section 2 shows that for appropriate choices of these inputs, the generalized
bootstrap reduces to specific well-known bootstraps including Efron’s (1979)
frequentist bootstrap, Rubin’s (1981, 1984) Bayesian bootstraps, and Hinkley
and Schechtman’s (1987) conditional bootstrap. Section 3 defines a conditional
EB bootstrap appropriate for inference (Hill, 1987, 1990).

2 Well-Known Examples
2.1 Efron’s frequentist bootstrap
The frequentist bootstrap (Efron, 1979, 1982a, b) has inputs: (i) P is a fre-
quentist model as described in Remark A of Section 1; (ii) usually, \ is taken
to be the MLE, but this is not required; and (iii) z is null. A bootstrap
replicate sets 0* = Joos and generates y* from the MLS p,»,(y). Because the
estimated prior and the posterior distributions are both the same point dis-
tribution 6 = Ass. with probability one, this bootstrap also arises if the fixed
features z = 0.
EXAMPLE 1. Let y1,...,Yn be independently and identically distributed
with cumulative distribution function (CDF) 6 = A € A, where A is the
collection of all continuous distributions on R. The vector of order statistics
t = (ya), ++ Y(n))’ is complete sufficient for A and the empirical CDF, \, is the
nonparametric MLE of X. The bootstrap sets 6*= \obs and generates y* by
322 Bedrick and Hill
_.
ee es

randomly sampling from Noss, Which amounts to sampling with replacement


from yos- This is a nonparametric bootstrap. Parametric bootstraps are also
possible as illustrated by the following two examples.
EXAMPLE 2. Let 2,...,2, be a random sample from the discrete sample
space {1,...,k} with Pr[z; = j | 8 = (01,...,9%)'] = 9;, j=1,...,&. The vector
of cell counts, y = (y1,..-) yk)’, yj = #{t : ci = 7Z}, is a multinomial random
variable with
y |9~ Mult,(n,
8).
Assume that § = with probability one for some unknown vector of proba-
bilities \.. Then the marginal distribution of y is

y ~ Mult,(n, d).

The MLE of \ is \ = y/n. The bootstrap sets 6* = Jobs and generates y* from
the MLS :
y* ~ Mult;(n, obs).

EXAMPLE 3. Let
yi ~ N(x, 07)
independently for 1 = 1,...,n where \ = 0 = (07) indexes the family. The
MLE of d is (, 67) where ¥ = Dy;/n and 6? = Y(y; — y)?/n. The bootstrap
sets 6* = \.¥, and generates y* from the MLS

Yi io N(Yobs, 648)

independently for i = 1,...,n. For example, the bootstrap estimate of the bias
and variance of y are 0 and G?,,/n, respectively.

2.2 Rubin’s Bayesian bootstraps


The basic Bayesian bootstrap (Rubin, 1981) has inputs: (i) P is a Bayesian
model as defined in Remark A of Section 1; (ii) 1= \ because there is only
one distribution in P; and (iii) z = y. A bootstrap replicate sets y = y.», and
generates 9* from the posterior distribution p(@ |y = yoss).
EXAMPLE 4. Following Example 2, assume that y = (y1,..., yx)’ has a
multinomial sampling distribution

y |9 ~ Mult,(n,
8).

Furthermore, assume that @ = (6;,...,6,)' has a Dirichlet prior distribution

6 ~ Dir,
(my, ..., Me)
A Generalized Bootstrap 323
SSS SSS SSS

where the m; are known. Then the posterior distribution of 6 is also Dirichlet;
in particular,
6 |Yor, Dirk (yr Sue LG Peer Yk + mx).

A sample from p(@ |y = y.s5) can be generated using the gaps between ordered
uniform random variables (Rubin, 1981).
If m; = 0, 7 = 1,...,k, and 6.45 = yoss/n, then the posterior mean and
variance-covariance matrix of 6 — oe are

K(9 a bobs |oc Yobs) =0

and
Var(8 — 806s |y = Yyoss) = [Diag(Oos) — 905544,,]/(n + 1),
respectively.
These results are close to those for the frequentist bootstrap given in Ex-
ample 2. For that situation, the frequentist bootstrap distribution of 6 — 6,
which assumes that 6 equals the observed MLE 6.46 = Yobs/N, has mean 0 and
variance-covariance matrix

[Diag(8.ss) — 9os%p4]/7-
This similarity was noted by Rubin (1981) and by Efron (1982a).
Rubin (1984) defined two other Bayesian bootstraps. When the fixed fea-
tures z = 6 the bootstrap generates 0* from the posterior p(@ |y = Yyoss) as be-
fore. But now it also generates y* from the sampling distribution p(y |0 = *).
Marginally, y* is generated from the predictive distribution of a future obser-
vation given the original observed data. Rubin (1984) used this bootstrap to
evaluate tests of model adequacy when the experimental units where consid-
ered fixed features of the experiment.
When = is null the bootstrap generates (0*,y*) from the joint distribu-
tion p(9,y). Rubin (1984) used this bootstrap to evaluate the unconditional
calibration of procedures.

2.3. Hinkley and Schechtman’s conditional bootstrap


The conditional bootstrap (Hinkley and Schechtman, 1987) has inputs: (i) P
is a frequentist model; (ii) \ is taken to be the MLE; and (iii) z = a, where a is
ancillary for \ (that is, (\,a) is minimally sufficient for \ and p)(a) = p(a) for
all \). A conditional bootstrap replication sets 0* = Acts and a* = a5, then
generates y* from the estimated conditional distribution p,_;_,(y | @ = Gots).
Efron (1982b) related this estimated conditional distribution to his dis-
cussion of maximum likelihood summarization. Fisher (1934, 1956) argued
strongly that inference should be conditioned on any ancillary information in
324 Bedrick and Hill
————— ss

the minimal sufficient statistic in order to recapture information lost by the


MLE. Hinkley (1980) argued that inference must be conditional in order to
account for the use of the ancillary as a model checking statistic.
For a real-valued parameter, Efron and Hinkley (1978) showed that the
conditional bootstrap estimate of the variance of the MLE equals, approxi-
mately, one over the observed Fisher information. This result contrasts with
the unconditional bootstrap estimate which equals, approximately, one over
the expected Fisher information.

3 A Conditional EB Bootstrap
We illustrate the conditional EB bootstrap using the following example.
EXAMPLE 5. Let 6 = (61,..., 9%)’ and y = (y1,..., yx)’ have joint distribu-
tions described by

yi |0 ~'™7 N(O;,1), 0; ~'*? N(0,77), 1 =1,..., k,

indexed by r?. This model is equivalent to the model

yi ~"4 N(0,1/B), 6: |y ~'"? N{(1 — B)yi,1 — B}, i= 1,...,k,


indexed by B = 1/(1 + 7”). Let 7 = 6; be the parameter of interest. If
S=Dy?,a=y:/(S/k)/?, and t = (S,a), then t is minimally sufficient and a
is ancillary for n (Hill, 1990).
Let B = B(S) be an estimator of B based on S and let Bots = B(Sobs)- A
conditonal EB bootstrap replicate

1. ‘sets a” =<4.55,

2. generates S* ~ y7/ Boss,

3. calculates yt = a*(S*/k)*/?, and


4, generates n* ~ N{(1 — Boss)yi, 1 — Boss}.

Note that this bootstrap differs from the one defined by Laird and Louis
(1987).
To find the conditional mean squared error of an estimator 7 of 7, in the
notation of Remark D of Section 1, let Q(0,y) = (n — 9)? and z = a. Hill
(1990) proved that if 7(¢) = (1 — B(S))y, then

Q = MSEa(# |a) = 1— B+RSLa(#) Ba?


where RSLa(7) is the relative savings loss of 7 defined by Efron and Morris
(1973).
A Generalized Bootstrap 325
SSS
SSS SSS

For Stein’s estimator 7 = (1 — B)y,, where B = (k — 2)/S is an unbiased


(in the marginal family).estimator of B,

Q = MSEa(7 |a) = 1 — B+ (2/k)Ba?.

So the bootstrap estimate of the conditional EB MSE of Stein’s estimator is

Qobs io MSE,_3,,,(7 |a) =1- Boos at (2/k) Bossa”.

This estimate equals the posterior MSE of n when 7? ~ Unif[—1, 00), the
hyperprior that gives Stein’s rule.
Efron and Morris (1973) derived the relative savings loss for a class of
truncated Bayes rules. We can use their results to provide analytic conditional
EB bootstrap estimates of the conditional MSE for those rules.
Our results on EB conditional confidence bounds and more complicated
EB models will be reported elsewhere.

References
EFRON, B. (1979). Bootstrap methods: Another look at the jackknife. Ann.
Statist. '7, 1-26.
EFRON, B. (1982a). The Jackknife, Bootstrap, and Other Resampling Plans.
Society for Industrial and Applied Mathematics, Philadelphia.
EFRON, B. (1982b). Maximum likelihood and decision theory. Ann. Statist.
10, 340-356.
EFRON, B., AND HINKLEY, D. V. (1978). Assessing the accuracy of the MLE:
Observed versus expected Fisher information (with discussion). Biometrika
65, 457-487.
EFRON, B., AND Morris, C. (1973). Stein’s estimation rule and its competi-
tors — An empirical Bayes approach. J. Amer. Statist. Assoc. 68, 117-130.
FISHER, R. A. (1934). Two new properties of mathematical likelihood. Proc.
Roy. Soc. London Sect. A 144, 285-307.
FISHER, R. A. (1956). Statistical Methods and Scientific Inference. Oliver
and Boyd, Edinburgh.
HILL, J. R. (1987). Comment on “Empirical Bayes confidence intervals based
on bootstrap samples,” by N. M. Laird and T. A. Louis, J. Amer. Statist.
Assoc. 82, 752-754.
HILL, J. R. (1990). A general framework for model-based statistics. Biometri-
ka 77, 115-126.
HINKLEY, D. V. (1980). Likelihood. Can. J. Statist. 8, 151-163.
326 Bedrick and Hill

HINKLEY, D. V., AND SCHECHTMAN, E. (1987). Conditional bootstrap


methods in the mean-shift model. Biometrika 74, 85-93.
LaIrnD, N. M., AND Louts, T. A. (1987). Empirical Bayes confidence inter-
vals based on bootstrap samples. J. Amer. Statist. Assoc. 82, 739-750.
NEYMAN, J. (1935). On the problem of confidence intervals. Ann. Math.
Statist. 6, 11-116.
NEYMAN, J. (1977). Frequentist probability and frequentist statistics. Syn-
these 36, 97-131.
ROBBINS, H. (1956). An empirical Bayes approach to statistics. Proc. Third
Berkeley Symp. Math. Statist. Probab. 1, 157-164.
RUBIN, D. (1981). The Bayesian bootstrap. Ann. Statist. 9, 130-134.
RUBIN, D. (1984). Bayesianly justifiable and relevant frequency calculations
for the applied statistician. Ann. Statist. 12, 1151-1172.
BOOTSTRAPPING ADMISSIBLE LINEAR
MODEL SELECTION PROCEDURES

David Brownstone, University of California, Irvine*

Abstract
This paper investigates the use of the bootstrap for estimating
sampling distributions of standard and limited-translation Stein-rule
estimation procedures, based on Li (1985 and 1987) and Stein (1981).
These estimators incorporate uncertainty in the model selection process and
dominate both Ordinary Least Squares on the largest model and common
pretest estimators. Since there are no asymptotic approximations available
for these estimators, the bootstrap is necessary to obtain standard errors
needed for their practical application. Monte Carlo studies are performed
to verify the performance of the new procedures and the accuracy of the
bootstrap approximations to the second moments of their sampling
distributions. The bootstrap generates much more accurate standard errors
than the delete—one jackknife for the designs considered here. Additional
Monte Carlo studies are used to study the accuracy of various confidence
bands calculated from the bootstrap distributions. Although reasonably
accurate, there is room for improvement using double bootstrapping or
prepivoting.

1. Introduction
This paper extends the work in Brownstone (1990 a and b) analyzing
the bootstrap estimator for the sampling distributions of standard and
limited—-translation Stein-rule procedures for the linear model. These
procedures, based on Li (1985 and 1987) and Stein (1981), are admissible
alternatives to pretest estimators commonly used in the presence of model
uncertainty. Since there are no asymptotic approximations of the sampling
distribution of these procedures available, the bootstrap is necessary to
obtain standard errors needed for their practical application.
Brownstone (1990b) describes a Monte Carlo study, summarized in
Section 4 of this paper, which verifies the improved performance (relative to
Least Squares) of the Stein-rule procedures and the accuracy of their
bootstrapped variances. Section 5 of this paper describes similar Monte
Carlo studies for models with more exogenous variables, and it also
examines two alternative versions of the delete-one jackknife (see Wu,

* Department of Economics, University of California, Irvine, CA 92717.


Financial support from the UCI Institute of Transportation Studies and the
UCI Research Unit in Mathematical Behavioral Sciences is gratefully
acknowledged.

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
328 Brownstone
E—
———————————— __s__.
TT aa a.

1986) variance estimators. These jackknife estimators require less


computation than the bootstrap, but they are badly biased for the cases
considered here. ;
Sampling variances are frequently used to construct approximate
confidence intervals for parameter estimates. The sixth section investigates
the accuracy of these intervals and other simple confidence intervals
calculated directly from the bootstrap distributions. The intervals based on
the bootstrapped variances perform as well as the "percentile" and
"bias—corrected percentile" methods given in Efron and Tibshirani WEED:
but all of these intervals can be improved using prepivoting and/or "double
bootstrapping" (see Beran, 1986, and Hall, 1988). These improved intervals
are discussed in Section 7.

2. Bootstrapping and Jackknifing Multiple Regression Model Estimators


This section describes how Efron’s (1979) bootstrap procedure can be
applied to estimate sampling distributions of estimators for the multiple
regression model. The bootstrap method can be applied to much more
general situations (see Efron, 1982), but all of the essential elements of the
method are clearly seen by concentrating on the familiar multiple regression
model:

y=Xfr+e, (1)

where X and # are fixed (TxK) and (Kx1) matrices with full rank and T>K.
The components of ¢ are independent identically distributed random
variables with zero mean and common variance o?.
The nonparametric bootstrap estimate of the sampling distribution of
an estimator (* of f is generated by repeatedly drawing with replacement
from the residual vector

e* = y - Xf. (2)
If ep is a (Tx1) vector of T independent draws from e*, then the
corresponding bootstrap dependent variable is given by

Yb = Xf* + ep. (3)


For each vector yp the estimator is recomputed, and the sampling
distribution of the estimator is estimated by the empirical distribution of
these estimates computed over a large number of yp. This nonparametric
bootstrap has been applied to real data by Freedman and Peters 1984) and
others (see Veall’s, 1989, survey for further references).
_ The bootstrap method described above first approximates the true
distribution of the residuals by the empirical cumulative distribution
function of the estimated residuals from eq. 2. It then evaluates the exact
sampling distribution of the estimator by Monte Carlo simulations
assuming that this empirical cumulative distribution function is true. In
contrast, the traditional (non-bootstrap) approach first derives an
asymptotic approximation to the sampling distribution, and then estimates
Linear Model Selection 329
SSSSSS SSS SSS

the unknown parameters of this approximation. Theoretical results by


Bickel and Freedman a and Beran (1986) show that the bootstrap
distribution converges (as the original sample size, T, and the number of
bootstrap repetitions gets large) to the correct sampling distribution. The
key assumption, which is satisfied by all of the procedures considered in this
paper, is that the empirical cumulative distribution function of the residuals
in eq. 2 converges uniformly to the true distribution. If one is uncertain
about the consistency of (*, then (* could be replaced by LS (or some other
consistent estimator) to define e* in (2). Since the resulting residuals
satisfy the convergence criterion, the bootstrap sampling distribution
converges no matter how badly behaved /* is.
_ Jackknifing is an older resampling method which has recently been
reviewed by Wu (1986). I will only consider the delete-one jackknife here
since it is the only one offering significant computational advantages over
the bootstrap. If §*-; is the estimate of # computed with the ith observation
deleted from the sample, then define the pseudovalues py = T/* — (T-1)6*..
The ordinary jackknife variance estimator for (* is given by :

A fie mae
Wrsbing ), (Pr-O (PB). (4)
Hinkley (1977) proposed an improved version, called the balanced jackknife,
which uses the weighted pseudovalues Q; = 6* + T(1—wi)(f*—-G*4), where
wi = x;’/(X’X)-1x; and x; is the ith row of X. The balanced jackknife
variance estimator is given by:

Vg ence
= ! ),(Qi-B*)(Qi-B*)
;-D* SWE: (5)
Both of these jackknife variance estimators are consistent but biased under
assumptions on the X matrix and the smoothness of the mapping from the
least squares estimator to @*. These jackknife estimators are much faster to
compute than the bootstrap since the (* can be computed without
recomputing regressions for each reduced data set.
Of course, good asymptotic properties are only useful if the
asymptotic approximation holds for realistic models and sample sizes.
Monte Carlo studies by Brownstone (1990 a and b) and Freedman, Navidi,
and Peters (1988) indicate that, with sufficient degrees of freedom, the
bootstrap can generate accurate standard error estimates for estimated
parameters in linear models with reasonable degrees of freedom. Section 5
of this paper shows that the jackknife estimators may have poor small
sample properties in similar situations.

3. Stein—Rule Estimators and Model Selection


This section briefly reviews the results in Li (1985) and Li and Hwang
(1984) and shows how they can be applied to the models considered in the
test of this paper. Stein-rule or "shrinkage" estimators were first
330 Brownstone
| EEE

investigated by James and Stein (1961), who showed that these estimators
dominate LS under a squared error loss function. There has been a large
amount of work extending these estimators to different models and
situations. Judge and Bock (1978) and Judge and Yancey (1986) provide
good reviews of the application of these developments to multiple regression
and other econometric models, and Stigler (1990) gives an excellent
intuitive proof of Stein’s (1956) result. These authors have also proposed
Stein—-rule estimators as an alternative to commonly used pretest procedures
which are known to have poor sampling properties (see Brownstone, 1990a,
and Hill and Judge, 1987). Li and Hwang (1984) further point out that
appropriately defined Stein-rule estimators are "robust" with respect to
misspecified models. The development in this section, following Li (1985),
assumes that there is a true base model, and all approximate models are
derived by imposing linear restrictions on this base model. Although it is
relatively straightforward to consider nonlinear restrictions, the assumption
that the base model is correct is critical.
I will only consider the orthonormal linear model where X’X = I.
The general linear model in equation (1) can be transformed to this case
using the singular value decomposition. The LS estimator of (, (*, is just
X’y. If we temporarily assume that the ¢ are Normally distributed, then :

Bi = Bi + 6, i=1,...K, (6)
where 6; are independent normally distributed with mean zero and variance
o2. The estimation of @ from /* in eq. 6 is the classic Multivariate Normal
mean problem studied by Stein (1956), where he showed that LS is
inadmissible under squared error loss (||4*—{]||?).
Consider a class of linear estimators, ['*(h), of the unknown mean
vector 6. Associated with each h in a finite index set H there is a KxK
matrix, M(h), such that ['*(h) = M(h)G*. The model selection problem is
how to choose h. For example, if the columns of X are derived from a
singular value decomposition of a collinear design matrix, then some of the
characteristic values and the associated (; will be close to zero. This
suggests considering a class of restricted models with the last (K—h) (*; set
to zero (i.e. M(h) is a diagonal matrix with the first h elements equal to one
and the rest zero)1.
The next step is to choose h to minimize the risk function, but for
typical linear estimators (including LS) this risk function is unbounded. Li
ee suggests replacing the class of linear estimators, ['*(h), with their
tein-rule counterparts:
o2
f(h)= fh ae. (h) ps, (7)
B+" B(h)p*

'Li(1987) shows that this general framework can also be applied to ridge
regression and nearest-neighbor nonparametric regression.
Linear Model Selection 331

where A(h) = I-M(h) and, if M(h) is symmetric?,


B(h) = {(traceA(h)I — 2A(h)}-1 A(h)2. (8)
Note that eq. (8) requires that the largest characteristic root of A(h) be less
than half the trace of A(h). This condition, which can be restrictive in
practice, will be discussed in later sections. The asymptotic properties of
are unaffected if 0? is replaced by a consistent estimator such as s?.
Li (1985) shows that these Stein-rule estimators have the same
asymptotic properties as the associated linear estimators, but for any finite
sample size they have bounded risk. These Stein—-rule estimators shrink the
restricted estimator, I'*(h), toward the unrestricted estimator ((*). In the
linear model selection example discussed above, if the model indexed by h is
appropriate (i.e., the deleted coefficients are small) then the shrinkage
factor o2/(6*’ BG*) is close to one and the Stein-rule estimates are close to
the restricted estimate. If the model indexed by h is inappropriate, then the
Stein—rule estimates are closer to the unrestricted LS estimates.
Finding an optimal h among the class of Stein—rule estimators, (A(h),
requires an estimator for their risks. Stein (1981) shows that, under
regularity conditions,

of ||A(h)6*||?
S(h) = °K (BBO? / (9)

is an unbiased estimator for the risk of (S(h). Li(1985) further shows that
even when S(h) is not a consistent estimator of the risk, it is still a
consistent estimator of the true loss ll{s(h) — ||?) divided by K. Note that
choosing h to minimize S(h) is equivalent to minimizing

K (f*’B(h) 6)?
(10)
|| ACh) B*|]?
which is independent of o2. Li (1987) shows that minimizing eq. oe is
identical to the Generalized Cross Validation model selection criterion (see
Craven and Wahba, 1979) and is asymptotically optimal in the sense that

L(A s(h*))
—_—_——___——_+ 1 in probability (as T-0), (11)
ming -L(6*(h))
where L(-) is the loss function and h* denotes the model chosen to minimize
eq. (10). :
( Li’s methodology outlined above can be applied to general linear
model selection problems, but the remainder of this paper will only consider
the problem of choosing how many of the /*; to restrict to zero. In this case

2Li (1987) gives an alternative formula for general M(h). All the cases
considered in this paper have symmetric M(h).
332 Brownstone
ae

the trace condition on A(h) implies that there are at least three restrictions.
An alternative approach for improving LS in this situation is Stein’s (1981)
truncated limited-translation estimator, which uses different shrinkage
values for each component and therefore limits the bias. Stein’s estimator
is defined for each component by:

pe (LET fas
Yk_ymin (6% ,23)
where q is a large fraction of K, and Z;<--+<Z, are the order statistics of
the |6*;|. Note that if q=K , eq. 12 simplifies to the James—Stein (1961)
estimator. If q<K, then the estimator defined by eq. 12 shrinks the 6*;s
more if they are closer to zero, but some method for determining q is
needed. Dey and Berger (1983) propose choosing q>3 to maximize

(q-2)?
——————_—_—————_, (13)

Sej=1 aymin (641 ts28)


which maximizes the improvement in Bayes risk over LS. Stein (1981)
showed that the estimator defined by eq. (12) is minimax for fixed q. If the
data-dependent q maximizing eq. (13) is used, then it is very difficult to
analyze the resulting estimator, but a limited Monte Carlo study reported
in Dey and Berger suggests that the resulting estimator is also minimax and
compares favorably with other Stein—Rule estimators.
In actual applications, the shrinkage factors for both (s(h*) and (jn
should be truncated so that the resulting "positive part" estimates lie
between the restricted and unrestricted estimates. o? also needs to be
replaced by an estimator such as s2. The main difference between 4" and (3
is that, relative to (5, the limited—translation estimator limits the bias at
the cost of somewhat less efficiency. The Monte Carlo studies in the
following sections give some indications of the size of the efficiency losses
incurred using ( relative to (3.
For fixed q and A(h) with o? known, Stein |e shows that the two
estimators discussed above are minimax and admissible. Stein’s (1981)
unbiased risk estimators also provide a basis for generating confidence
bands for these estimators without bootstrapping or jackknifing.
Unfortunately, these methods do not appear to yield estimators for the
sampling distributions of the estimators used here where q and h depend on
the data in a complex fashion. The remainder of this paper reports results
from Monte Carlo studies of the bootstrap and jackknife sampling

3Efron and Morris (1973) show that most Stein-rule estimators can also be
derived as Empirical Bayes procedures, which justifies considering the
Bayes risk here.
Linear Model Selection 333
i

distributions for these Stein—rule procedures.


-

4. Estimator Performance
This section describes a Monte Carlo study of the positive part
versions of Li’s estimator any hereafter called LISTEIN, and Stein’s
limited translation estimator ( ),hereafter called NEWSTEIN. The
columns of X are generated as orthonormalized independent draws from a
Uniform distribution on the unit interval, and they are held fixed
pages the experiments. There are T=100 observations (rows of X),
and the first set of experiments have K=10 regressors. The dependent
variables (y) are generated from equation 1 with ¢ generated as independent
draws from a standard Normal distribution. This design has the risk of
LS = K for all of the experiments. The GAUSS (1989) programming
system was used for all calculations on a 386 personal computer, and the
pseudo-random numbers were all generated using GAUSS’ normal and
uniform random number generators.
The parameter vector § was chosen according to 6 = LA, where L is a
scalar. As L increases, the signal to noise ratio, as measured by the
population R?= 6’ / (p.6+T), varies between 0 and .988. This
definition of R? is appropriate since the regression plane passes through the
origin, and o? is always equal 1. ‘Table 1, abridged from Brownstone
(1990b), shows the results of a Monte Carlo study for two choices of the
direction vector, A. "Case 1" has all components of X = 1, and "Case 2"
has the first 5 components = 1.9999 and the remaining 5 = .0001.
Each entry in Table 1 is based on 400 Monte Carlo replications. In
addition to population R2, these tables give

400
(B+8)/400
J, BB) (14)
which is the empirical (mean) risk or expected prediction error for each of
the two Stein—-rule estimators. By construction, LS has theoretical risk 10
for all cases, and its empirical risk was very close to 10 for all cases, so these
empirical risks are omitted from Tables 1 and 2. The tables also provide
the average value of h* (the number of +; set to zero), denoted "hopt," for
LISTEIN and the average value of q for NEWSTEIN. The standard errors
of the empirical risk measures are not reported because they are always less
than .30.
Unlike the pretest and Stein—rule pretest estimators considered in
Brownstone (1990a), both LISTEIN and NEWSTEIN dominate LS in these
experiments. For "Case 1", the risk of LISTEIN and NEWSTEIN rises
smoothly toward 10 (the risk of LS in all of these experiments). This is to
be expected since as R? increases all parameters are estimated more
precisely and there is less gain from imposing approximate restrictions. As
LISTEIN and NEWSTEIN approach LS, both hopt and q (which are
restricted to be no less than three) approach their maximum values (10).
When hopt and q are at their maxima, both LISTEIN and NEWSTEIN are
just the standard positive—-part James-Stein estimator. Since the shrinkage
334 Brownstone
anna

value (i.e. 6*’B(h)6* in eq. 8) is large, the resulting estimators are


essentially LS. In actual applications with larger models there will usually
be at least three valid restrictions.
Some of the restricted models ie at least three (; = 0) considered in
the model selection phase are valid for "Case 2," which is why the risks of
the Stein-rule procedures remain below 10. Except for the smallest R?
values, LISTEIN has lower risk than NEWSTEIN since it does not attempt
to limit the bias in the resulting estimates. As R? increases, the average
values of hopt and q approach 5, which is the number of approximately
correct zero parameter restrictions for this case. The hump in the risk
function for the Stein-rule procedures around R? = 0.26 is caused by noise
in the model selection criteria (equations 10 and 13 respectively) used to
identify the best approximate model.

Table 1: Empirical Risks and Truncation Values for 10 Regressors


R2
Case 0.00 0.04 0.26 0.50 0.80 0.90 0.96
Estimators

LISTEIN 2:23 4.79 8.48 9.37 9.91 9.80 10.1


hopt 9.47 9.50 9.86 9.98 10.0 10.0 10.0
1
NEWSTEIN 1.30 4.22 8.27 9.34 9.91 9.80 10.1
q 8.50 8.68 9.90 10.0 10.0 10.0 10.0

LISTEIN 2.64 4.83 8.11 6.85 6.72 7.45 7.15


hopt 9.41 8.98 6.06 4.95 4.87 4.82 4.87
2
NEWSTEIN 1.62 4.12 9.17 9.64 9.09 9.66 9.60
q 8.39 8.68 8.06 6.55 4.45 4.47 4.52

Table 2: Empirical Risks and Truncation Values for 20 Regressors


R2
Estimators 0.00 0.04 0.26 0.50 0.80 0.90 0.96
LSR ia 10.8 10.9 11.0 11.1 10.6 11.0
LISTEIN 3.64 6.88 14.1 12.5 12.9 12.1 12.8
hopt 18.9 18.2 13.0 9.57 9.33 9.47 9.33
NEWSTEIN 1.87 5.01 15.2 17.8 18.9 17.7 18.1
q 15.5 14.8 15.7 12.2 7.35 7.23 7.22
Linear Model Selection 335
a

_ Table 2 gives the results of a similar Monte Carlo study where 10


additional independent regressors are added to the design matrix used in
Table 1. The true parameter vector was set to have the last 10 values equal
zero and the first equal to 1. Table 2 gives the empirical risks and
truncation values for the Stein—rule estimators. It also gives the risk for a
pretest estimator (LSR), which is defined to be LS if the F-test for the null
hypothesis that the last 10 parameter values are zero in principal
component space is significant at the 5 percent level. Otherwise it is the
ion least squares estimator with these last principal components set
O zero.
The risk for LS is always equal to 20 for these designs, and all of these
estimators clearly dominate LS. The LSR estimator performs so well here
because it is performing only one pretest of a clearly true restriction. In
actual applications, investigators do not know which set of parameters to
restrict to zero, so multiple pretests must be performed. The risk of LSR is
very close to 10, which is the theoretical value of least squares on the first
10 regressors, so there was very little uncertainty in the pretest in this
design. Considering that it is searching over all possible zero restrictions,
the LISTEIN procedure performs very well here. NEWSTEIN performs
worse than LISTEIN here because it is limiting the bias relative to LS,
which is a poor estimator for these designs.

5. Bootstrap and Jackknife Variance Estimation


Minimizing risk is not the only criterion for choosing an estimator.
Econometricians are frequently interested in estimating, and making
inferences about, components of §. Two feasible approaches for estimating
the sampling variances of the LISTEIN and NEWSTEIN estimators are the
bootstrap and jackknife procedures described in Section 2. Since these are
only asymptotically valid, Monte Carlo studies are needed to establish their
finite sample properties.
Table 3 gives the results of a small study of the bootstrap for three of
the designs in Table 2. There are 300 Monte Carlo repetitions and 300
bootstrap repetitions for each Monte Carlo repetition. The first two rows
for each estimator give the empirical means and standard errors of the
estimators of the first component of § over these Monte Carlo repetitions.
For each Monte Carlo sample the empirical standard errors of the
estimators of the first component of # across the bootstrap samples were
computed. The italicized rows give the mean ("Bootstrap SE") and
standard deviation ("SD Boot") of these bootstrap standard error estimates
over the Monte Carlo repetitions for each estimator.
Comparisons between the Monte Carlo ("Std. Error") and Bootstrap
("Bootstrap SE") rows in Table 3 show that the bootstrap gives good
estimates with no noticeable biases. The theoretical standard error of the
LS estimator is 1.8 for all these designs, so all these estimators perform
much better than LS. There is some evidence of bias in the Stein—rule
estimators for the R?2 = .26 case, but these biases are not significant. The
results in Table 3 are similar to those obtained in Brownstone (1990b)
which examined the bootstrap for the designs in Table 1.
336 Brownstone
aia
Ee el

The generally good results from this study are in contrast to one case
in Freedman, Navidi, and Peters (1988) which showed the poor performance
of the bootstrap for LS in an orthonormal design with 100 observations and
75 regressors. Two possible causes for the poor performance in that case are
the lack of degrees of freedom and the presence of 50 extraneous regressors
(i.e., those with true parameter values set to zero). When there are
extraneous regressors there is excess variation in the LS residuals, so it is
not surprising that the bootstrap based on these residuals will be biased.
The two Stein-rule procedures considered here should be less subject to this
"extraneous regressor" bias since they shrink the coefficients of these
extraneous regressors towards zero before bootstrap residuals are calculated.
Table 4 gives the results of additional Monte Carlo studies of the
bootstrap and the two jackknife estimators described in Section 2 for the
same designs as Table 3. There were only 100 bootstrap repetitions
performed for each Monte Carlo repetition in Table 4 so that the bootstrap
would be more computationally comparable to the jackknife estimators.
The jackknife estimators are still much faster to compute since they are
based on the delete-one LS estimates which only require one regression to
compute. In addition to the three estimators investigated in Table 3, Table
4 includes results for the LS estimator on the full model, which has
risk = 20 for all the designs. The "Std. Error" rows in Table 4 give the
standard errors of the estimates of the first component of 6 over the 300
Monte Carlo repetitions. These entries differ from Table 3 and across
estimators because, for computational reasons, different random numbers
were drawn for each estimator. The "Est." and "MSE" rows give the mean
and mean squared error of the jackknife and bootstrap estimators over the
Monte Carlo repetitions.

Table 3: Bootstrap Estimates for First Component of £


(300 Monte Carlo and 300 Bootstrap Repetitions)

True Parameter Value (R?


Estimators 0 (0) .64 (.26) 3.2 (.90)
Mean 0.026 0.64 409.
LSR Std. Error 1.2 12 1.1
Bootstrap SE | Aoi ui
SD Boot. 0.10 0.098 0.098

Mean —0.048 0.40 3:2


LISTEIN Std. Error 0.64 132 1.1
Bootstrap SE 0.75 1.1 1.2
SD Boot. 0.14 0.12 0.10

Mean —0.023 0.44 3.1


NEW- _ Std. Error 0.41 0.98 1.2
STEIN Bootstrap SE 0.48 0.89 LD
SD Boot. 0.083 0.13 0.11
Linear Model Selection 337

Table 4: Comparison of Bootstrap and Jackknife


Standard Errors

True Parameter Value 0.00 (R? = 0)


Estimator LS LSR LISTEIN NEWSTEIN

Jackknife Std. Error 1.39 1.17 0.674 0.462


Est. 1.57 1.87 1.25 0.959
MSE 0.0579 0.540 2.38 0.636

Balanced Std. Error 1.38 7, 0.665 0.471


Jackknife Est. 1.40 1.26 1.16 0.859
MSE 0.0193 0.516 2.07 0.469

Bootstrap Std. Error 1.28 1.07 0.597 0.379


(100 Rep.) Est. 1.25 1.12 0.729 0.477
MSE 0.0195 .0167 0.0876 0.0176

True Parameter Value 0.636 (R? = .26)


Estimator LS LSR LISTEIN NEWSTEIN

Jackknife Std. Error 1.41 1.16 1.23 1.02


Est. 1.58 1.42 1.86 1.48
MSE 0.0552 0.921 1.20 0.527

Balanced Std. Error 1.46 1.19 1.26 1.06


Jackknife Est. 1.39 1.23 1.63 1.31
MSE 0.0253 0.674 0.728 0.298

Bootstrap Std. Error 1.47 1By 1.28 1.08


(100 Rep.) Est. 1.26 1.18 1.11 0.881
MSE 0.0616 .0179 0.0525 0.0553

True Parameter Value 3.18 (R? = .90)


Estimator LS LSR LISTEIN NEWSTEIN

Jackknife Std. Error 1.41 1.17 1.21 1.30


Est. 1.58 1.40 1.55 1.60
MSE 0.0548 0.658 0.281 0.171

Balanced Std. Error 1.31 1.12 1:13 1.19


Jackknife Est. 1.40 1.28 1.40 ye
MSE 0.0279 0.835 0.248 0.0990

Bootstrap Std. Error 1.45 1.19 1.24 1.33


(100 Rep.) Est. 1.26 1.12 1.15 1.22
MSE 0.0549 .0179 0.0240 0.0289
338 Brownstone
EEEaac cca

Except for LS, the jackknife estimators perform much worse than the
bootstrap, and the balanced jackknife performs slightly better than the
regular jackknife. Since the mapping from LS to LSR and the Stein—rule
procedures is not continuous, the jackknife estimators are only known to be
consistent for LS. The balanced jackknife was designed to improve the
jackknife for LS, so its better performance here is not surprising. Even
though the balanced jackknife is an inferior variance estimator (relative to
the bootstrap) for these designs, it may still be good enough to be used to
prepivot the bootstrap sampling distributions.

6. Bootstrap Confidence Bands


One important use of sampling variances is the construction of
confidence bands for unknown parameters. Since the Stein-rule procedures
considered in this paper may not have Normal sampling distributions in
finite samples, it is also important to consider the performance of different
confidence bands computed directly from the bootstrap sampling
distributions. This section describes Monte Carlo studies of a number of
nominal 90 percent confidence procedures using the same design as in Table
3. Estimating tail probabilities of the bootstrap distributions requires more
bootstrap repetitions than variance estimation, so all of the results in this
section are based on 1500 bootstrap repetitions for each Monte Carlo
repetition.

Table 5: Level 2a Confidence Band Definitions

Name Definition

Boot Std. [B*+opt(a) , B++opt(1-a)]


Percentile [G*-(a), G*-1(1-a)]
BC Percentile [G*-({2z9+2(a)}), G*-1(${2z9+2(1-a)})]
Std. [B++st(a) , B*+st(1-a)]
T Percentile [A*+sH*-(a) , 6++sH*-1(1-a)]

Notes: op is the bootstrap standard error for 4* and t(a) is the level
a cutoff value from a student T distribution with 80 degrees of
freedom. G* is the bootstrap distribution of A+, 6 is the standard
normal cumulative distribution function, z(a) is the level a cutoff
value from this distribution, and zo=)-1{G*(@*)}. 8, which is only
calculated for LS, is the square root of the usual LS variance estimate.
H* is the bootstrap distribution of (/>;—G*)/s;, where 6>; and s; are
the values of the estimators at the jth bootstrap repetition.
Linear Model Selection 339
SSS

Table 5 gives the definitions of the intervals, but the "Std." and "T
Percentile" are only calculated for the LS estimator since they require a
teliable estimate of o. Efron and Tibshirani (1986) show that the "Boot
Std.," "Percentile," and "BC Percentile" intervals are correct under
increasingly general conditions. Note that if the bootstrap bias is zero
G*(f*)=.5), then the Percentile and BC Percentile bands are identical.
ince the "Boot Std." band only requires bootstrap standard errors, it can
be computed with fewer bootstrap repetitions.
Table 6 shows the performance of 3 different 90 percent confidence
bands for the first coefficient of 6. Except for the Stein—-rule estimators at
true coefficient = 0, the bootstrap standard errors ("Boot SE" columns in
Table 6) underestimate the true values ("Std. Error" columns in Table 6).
This negative bias is related to the negative bias in the nonparametric
bootstrap standard error estimates for LS4. The coverage probability errors
generally follow the biases in the bootstrap standard errors. The "Boot
Std." and "Percentile" bands behaved similarly in these runs. The
bias—corrected intervals ("BC Percentile") behaved similarly for the LSR
estimator, but they displayed much more variability and had poor coverage
for the Stein-rule estimators. The culprit in the bad performance of the
bias—corrected intervals is high variation in the estimates of zo.
One difficulty with interpreting Table 6 is that there are no
well-defined "correct" confidence bands for these estimators to compare
against. Table 7 considers the confidence bands for the LS estimator where
the usual t interval ("Std." column in Table 7) provides a natural basis for
comparison. The bootstrap intervals studied in Table 6 all have some
undercoverage due to the negative bias in the bootstrap standard errors
("Boot SE" column in Table 7). The similarity of these three intervals is
expected since the LS estimator is normally distributed in this example.
Even though it is based on the biased bootstrap residuals ("Boot s" column
in Table 7), the "T Percentile" intervals are almost identical to the correct
"Std." intervals. The exact correctness of the "T Percentile" interval in
this example is not surprising since (4>;—(*)/s; is an exact pivot for the LS
estimator with normally distributed residuals. In more general settings,
Hall (a shows that the "T Percentile" intervals are asymptotically more
accurate than the other bootstrap intervals in Tables 6 and 7 since they
capture additional terms in the Cornish—-Fisher expansions of the
estimators’ sampling distributions. Direct application of the "T Percentile"
intervals to the LISTEIN or NEWSTEIN procedure requires a bootstrap or
jackknife variance estimate for each bootstrap repetition. This "double
bootstrapping" is computationally beyond the scope of the current study.
Tables 6 and 7 also give values of the "Bootstrap Risk," which is
defined as

4 The negative bias for LS is expected, since the sampling variability of the
LS residuals underestimates the variability of the true residuals. In
particular, E(e;2)=02%(1—h;), where e; is the LS residual for the ith
observation and h; is the ith element of the diagonal of Bos is One
solution is to divide each LS residual by the square root of (1-h;), but this
is only valid for LS. This bias disappears in large samples.
340 Brownstone
|
————__._

(15) Jn PB) PVN,


where N is the number of bootstrap repetitions. This is a consistent, but
biased, estimator for the risk of these estimators (see Bunke and Droge,
1984). The results in Tables 6 and 7 show that this bias can be substantial
in these designs, but these "Bootstrap Risks" generally give the correct
ordering between the estimators.

7. Further Refinements
The Monte Carlo studies presented here and in Brownstone (1990a
and b) show that bootstrapping estimation strategies for linear regression
models yields reasonably accurate estimates of their sampling distributions.
Although the bootstrap standard errors and confidence intervals are
adequate for most applications with sufficient degrees of freedom, the biases
in Tables 6 and 7 suggest the need for more accurate estimates. One
approach is to use exact finite sample theory and/or higher-order
asymptotic expansions. These theoretical results are available for simple
pretest estimators, such as LSR (Judge and Yancey, 1986), and for simple
Stein-tule estimators (Phillips, 1984), but not for the more complex
Stein-rule procedures (LISTEIN and NEWSTEIN) considered here. Even
when exact finite sample theory is available, the results are typically quite
sensitive to the functional form of the residual distribution.
Another approach which preserves the distributional robustness of the
bootstrap is to use improved bootstrap techniques (see Beran, 1986, Efron,
1987, Hall, 1988, and Loh, 1987). These work by improving the rate of
convergence of the bootstrap sampling distribution, but they generally
require an order of magnitude more computations than the simple
bootstrap. The most general approach (Beran, 1986 and Hall, 1988) is to
use the bootstrap distribution as an approximate pivot, which is a
generalization of the "T Percentile" interval in Table 7. Efron’s (1987)
accelerated bias—corrected bootstrap intervals have the same second-order
asymptotic properties as the T Percentile, but they are easier to compute in
many cases. Nevertheless these accelerated intervals require the equivalent
of a jackknife variance computation for each bootstrap repetition for the
LISTEIN and NEWSTEIN procedures. Monte Carlo studies of these
improved techniques require supercomputer resources, but their application
to even large applied problems is feasible on current high-end workstations.
Even though the techniques described above will improve the
accuracy of the methods studied in this paper, the Stein—-rule procedures
with the simple bootstrap confidence intervals are a_ substantial
improvement relative to current popular techniques. For the setting in
Tables 6 and 7, most applied econometricians would use extensive
pretesting to find the "best" set of regressors, and then perform inference
conditional on the final model chosen being correct. Brownstone (1990a)
and Hill and Judge (1987) show that these procedures may have much
higher risk than LS applied to the full model, and the conditional inferences
Linear Model Selection 341

Table 6: 90% Bootstrap Confidence Intervals for First Component of 8


(275 Monte Carlo Repetitions and 1500 Bootstrap Repetitions)
True coefficient = 0.00
Estimator Mean __ Std. Error Risk Boot SE Boot Risk

LSR 0.0452 1.18 11.0 1.13 10.3


LISTEIN 0.0153 0.683 4.32 0.762 5.55
NEWSTEIN 0.0496 0.495 2.90 0.516 3.78

Mean confidence bands and actual coverage probabilities


Estimator Boot Std. Percentile BC Percentile

LSR [-1.83, 1.92] [-1.85, 1.84] [-1.77, 1.88]


Cov. Pr. 0.892 0.900 0.852

LISTEIN —_[ -1.25, 1.28] [23 1.22] [-1.48, 1.23]


Cov. Pr. 0.956 0.980 0.836

NEWSTEIN[-0.809, 0.908] [-0.783, 0.807] [-0.996, 0.977]


Cov. Pr. 0.940 1.00 0.720

True coefficient = 0.636


Estimator Mean Std. Error Risk Boot SE’ Boot Risk

LSR 0.652 1.20 11.1 1.12 9.98


LISTEIN 0.474 1.28 14.5 1.10 11.6
NEWSTEIN 0.422 4.11 15.0 0.892 dist

Mean confidence bands and actual coverage probabilities


Estimator Boot Std. Percentile BC Percentile

LSR {-1.21, 2.51] [-1.18, 2.48] [-1.18, 2.48]


Cov. Pr. 0.887 0.865 0.887

LISTEIN [ -1.35, 2.30] [-1.51, 2.10] [-1.16, 2.50]


Cov. Pr. 0.858 0.855 0.805

NEWSTEIN [-1.06, 1.91] [-1.33, 1.59] (0.771, 2.31]


Cov. Pr. 0.816 0.876 0.674
342
a oa oy Oy oe Rs Brownstone
sae

(Table 6 continued)

True coefficient = 3.18


Estimator Mean Std. Error Risk Boot SE Boot Risk

LSR 3.16 1.24 10.2 1.13 10.2


LISTEIN 3.11 1.31 11.9 1.16 peer
NEWSTEIN 2.93 1.39 17.2 1.23 15.8

Mean confidence bands and actual coverage probabilities


Estimator Boot Std. Percentile BC Percentile

LSR [ 1.28, 5.04] [ 1.31, §:01] [ 1.30, 5.01]


Cov. Pr. 0.844 0.844 0.829

LISTEIN [ 1.19, 5.04] [ 1.23, 5.03] [ 1.19, 5.00]


Cov. Pr. 0.829 0.848 0.829

NEWSTEIN [ 0.886, 4.97] [0.670, 4.70] [ 1.16, 5.20]


Cov. Pr. 0. 856 0.848 0.844

Table 7: Bootstrap Confidence Intervals for LS for First Component of 8


(275 Monte Carlo Repetitions and 1500 Bootstrap Repetitions)

True Coef. Mean Std. Error Risk Boot SE Boot s Boot Risk

0.00 0.0655 1.40 20.1 1.26 1.27 16.0


0.636 0.608 Si 19.4 1.26 1.26 16.0
3.18 3.37 1.53 20.6 1.26 1.26 15.9

Mean confidence bands and actual coverage probabilities


Coeff. Std. Boot Std. Percentile BC Percentile T Percentile

[-2.28, 2.41] [-2.03, 2.16] [-2.02, 2.13] [+-2.02, 2.13] [-2.29, 2.39]
0 0.896 0.860 0.864 0.860 0.888
[-1.73, 2.95] | [-1.48, 2.70] [-1.46, 2.67] [-1.47, 2.66] _[-1.73, 2.94]
0.636 0.922 0.896 0.887 0.878 0.926
[ 1.03, 5.71] [ 1.28, 5.46] [ 1.30, 5.43] [ 1.30, 5.43] _[ 1.03, 5.70]
3.18 0.872 0.830 0.819 0.823 0.872
Linear Model Selection 343
SSS
SS SS SS

ignoring model selection can be badly biased. Relative to this type of


easton procedure, the Stein-rule procedures in Table 6 perform very
well.

8. References

Beran, R. (1986), "Simulated Power Functions," The Annals of Statistics,


V. 14, Pp. 151-173.
Bickel, P. J. and D. A. Freedman (1981), "Some Asymptotic Theory for the
Bootstrap," The Annals of Statistics, V. 9, Pp. 1196-1217.
Brownstone, D. (1990a), "Bootstrapping Improved Estimators for Linear
Regression Models," Journal of Econometrics, V. 44, Pp. 171-188.
Brownstone, D. (1990b), "How to ’Data Mine’ If You Must: Bootstrapping
Pretest and Stein—Rule Estimators," UC Irvine Research Unit in
eta ae Behavioral Science Technical Report MBS 90-08, June,
1990.
Bunke, O. and B. Droge (1984), "Bootstrap and Cross—Validation
Estimates of the Prediction Error for Linear Regression Model," The
Annals of Statistics, V. 12, Pp. 1400-1424.
Craven, P. and G. Wahba (1979), "Smoothing Noisy Data with Spline
Functions: Estimating the Correct Degrees of Smoothing by the
Method of Generalized Cross—Validation," Numer. Math., V. 31, Pp.
377-403.
Dey, D. K. and J. O. Berger (1983), "On Truncation of Shrinkage
Estimators in Simultaneous Estimation of Normal Means," Journal of
the American Statistical Association, V. 78, Pp. 865-869.
Efron, B. (1979), "Bootstrap Methods: Another Look at the Jackknife," The
Annals of Statistics, V. 7, Pp. 1-26.
Efron, B. (1982), The Jackknife, Bootstrap and Other Resampling Plans,
SIAM, Philadelphia.
Efron, B. (1987), "Better Bootstrap Confidence Intervals (with
discussion)," Journal of the American Statistical Association, V. 82,
Pp. 171-200.
Efron, B. and C. Morris (1973), "Stein’s Estimation Rule and its
Competitors — An Empirical Bayes Approach," Journal of the
American Statistical Association, V. 68, Pp. 117-130.
Efron, B. and R. Tibshirani (1986), "Bootstrap Methods for Standard
Errors, Confidence Intervals, and Other Measures of Statistical
Accuracy," Statistical Science, V. 1, Pp. 54-74.
Freedman, D.A. and S.C. Peters (1984), "Bootstrapping a Regression
Equation: Some Empirical Results," Journal of the American
Statistical Association, V. 79, Pp. 97 — 106.
Freedman, D.A., W. Navidi, and S.C. Peters (1988), "On the Impact of
Variable Selection in Fitting Regression Equations," in T. K. Dijkstra
(ed.), On Model Uncertainty and its Statistical Implications,
pringer-Verlag, Berlin, Pp. 1-16.
GAUSS (1989), Version 2.0 of the Gauss Programming Language, Aptech
Systems, Kent, Washington.
344 Brownstone
oe

Hall, P. (1988), "Theoretical Comparison of Bootstrap Confidence


Intervals," Annals of Statistics, V. 16, Pp. 927-953.
Hill, R.C. and G.G. Judge (1987), "Improved Prediction in the Presence of
Multicollinearity," Journal of Econometrics, V. 35, Pp. 83 -100.
Hinkley, D.V. (1977), "Jackknifing in Unbalanced Situations,"
Technometrics, V. 19, Pp. 285-292.
James, W. and C. Stein (1961), "Estimation with Quadratic Loss,"
Proceedings of the Fourth Berkeley Symposium on Mathematical
Statistics and Probability, University of California Press, Berkeley,
CA, Pp. 316 — 379.
Judge, G.G. and M.E. Bock (1978), The Statistical Implications of
Pre—Test and Stein—Rule Estimators in Econometrics,
North-Holland, Amsterdam.
Judge, G.G. and T.A. Yancey (1986), Improved Methods of Inference in
Econometrics, North-Holland, Amsterdam.
Li, K-C. (1985), "From Stein’s Unbiased Risk Estimates to the Method of
Generalized Cross Validation," Annals of Statistics, V.13, Pp.
1352-1377.
Li, K-C. (1987), "Asymptotic Optimality of Cp, Cl, Cross—Validation and
Generalized Cross Validation: Discrete Index Set," Annals of
Statistics, V.15, Pp. 958-975.
Li, K-C. and J. T. Hwang (1984), "The Data-Smoothing Aspect of Stein
Estimates," Annals of Statistics, V.12, Pp. 887-897.
Loh, W-Y. (1987), "Calibrating Confidence Coefficients," Journal of the
American Statistical Association, V. 82, Pp. 155-162.
Phillips, P. (1984), "The Exact Distribution of the Stein—Rule Estimator,"
Journal of Econometrics, V. 25, Pp. 123-132.
Stein, C. (1956), "Inadmissibility of the Usual Estimator for the Mean of a
Multivariate Normal Distribution," Proceedings of the Third Berkeley
Symposium on Mathematical Statistics and Probability, University of
California Press, Berkeley, Pp. 197-206.
Stein, C. (1981), "Estimation of the Parameters of a Multivariate Normal
Distribution," Annals of Statistics, V. 9, Pp. 1135-1151.
Stigler, S. M. (1990), "A Galtonian Perspective on Shrinkage Estimators,"
Statistical Science, V. 5, Pp. 147-155.
Veall, M.R. (1989), "Applications of Computationally—Intensive Methods to
Econometrics," Working Paper No. 89-05, Dept. of Economics,
McMaster University, Hamilton, Ontario.
Wu, C. F. J. (1986), "Jackknife, Bootstrap and Other Resampling Methods
in Regression Analysis," Annals of Statistics, V. 14, Pp. 1261-1295.
A HAZARD PROCESS FOR SURVIVAL ANALYSIS

John J. Hsieh, University of Toronto

Abstract This article develops a hazard process and applies it to non-


parametric survival analysis. The approach is a direct generalization of the
filtered counting process martingale techniques and can easily accommmodate
right censoring, competing risks and time-varying covariates. The hazard
processes arising from the point process, marked point process and marked pro-
cess with covariate path, respectively, are derived from the corresponding deter-
ministic hazard functions. The derived processes are employed to obtain non-
parametric estimates of various deterministic hazard and other functions as well
as asymptotic distributions of the estimators and test statistics , using martingale
methods. In addition to the usual functions describing the distribution of the
failure time, we also introduce the occurrence/exposure rates and show their
applications in significant tests comparing the hazard level of a cohort with that of
the general population.

1. Introduction
Let T be a non-negative random variable representing the length of the inter-
val from some initial zero point to the occurrence of an absorbing or non-
recurrent event of interest such as waiting time, failure time, recurrence time.
For a given sample of n individuals, T @; ) represents the duration time for the
member ; , i=1,2,...,.n. Since the probability measure induced by T is equivalent
to the original one defined on the o -algebra F generated by subsets of the sample
space S , we will henceforth deal directly with the former.
The (cumulative) distribution function F(t) = P[ T <t ] of T is right continu-
ous, monotone non-decreasing on [ 0, ce ) with F(0)=0 and lim F(t)=1. The com-
tf loo

plementary probability S(t)=1-F(t) is the survival function. If T is further assumed


to be absolutely continuous, then the density function
#i(t)=lifnP [t<T <t+A]/A

exists for r20, almost everywhere, and is given by f(t)=F’(t)=-S’(t). The hazard
rate function is defined for S(t)>0 by
h (elm? [T <t+AlT 2t VA=f (t)/S (t), (1)

with the cumulative hazard given by


H(= foejf(a)du = —In5 (1),
upon using the initial condition S(0)=1, so that
S (=exp [-H Ol-exp [-[,(udu. (2)

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
346 Hsieh
a

Note that if the distribution function of T is not absolutely continuous, then the
hazard rate function cannot be defined. If T has a finite mean, then the life expec-
tancy or mean residual life function is defined by
e (t)=E (T-1T2r}=[ uaF (uyiS (7) =I, 5 )d US), (3)
for S(t~)>0 and e(t) = 0 whenever S(t)=0,where S (@)=lims (u)=P [T 2t]. Note
ult
that S$ (t~)=S (t) if T is continuous. This fact was used in (1). The five functions
defined above are equivalent.
Finally, we define an important rate function -- the duration-specific
occurrence/exposure rate. The [t; , t2 ]-specific O/E rate is defined for tz > t,20
by
S(t; )-S (¢2)
(4)
{ S(t)dt
t),t2]
Expression (4) represents the ratio of the number of events in [f;,f2] to the
number of individual-time units of exposure in (¢;,t2]. If T is (absolutely) con-
tinuous, then S(t; )=S(t,) and an application of the mean value theorem of
integral calculus yields
S@i)S2) S(t)
=h(t)), (4a)
tlt, j S(t)dt S(t1)
(t1,¢2]
and an instantaneous O/E rate reduces to a hazard rate function. Thus, an O/E rate
is an average rate and as such is a set function while a hazard rate is an instan-
taneous rate and as such is a point function. Death rates and incidence rates are
examples of the former while force of mortality and force of morbidity are exam-
ples of the latter. For more functions useful in describing the distribution of the
duration time, see Hsieh ( 1990).

2. Deterministic Hazards
To cover both discrete and continuous ditributions, we begin by defining the
differential of F as follows:

F(t)-F (t")=P[T=t], if F is discrete,


dF (t)= (5)
dF
oe dt=f(t)dt, if F is continuous,

where F gs (u)=P [T <t]. Note that in the discrete case dF(t) becomes the
u
probability mass function representing the size of jump of F at t which is zero for
all t except at a countable number of jump points T(@;). The differential of H is
then defined for S (t~)>0 as
dF(t) |P(T=tlT2t], if F is discrete,
dH (t)=
S(t) \f@adt/S ()=h(t)dt, if F is continuous, (6)
the second equality following directly from (5).
Hazard Process for Survival Analysis 347

We shall call dH(t), the hazard (probability) of T at time t. The quantity


dH(t) is the fundamental function on which the developments in this paper will be
based.In the remainder of this section we shall derive other functions of use in
terms of dH(t) for the discrete case which is most relevant to estimation and
hypothesis testing to be developed in Sections 4 and 5.(The expressions for the
continuous distribution are as given in Section 1).
Clearly the cumulative hazard function H(t) is simply a Stieltjes integral
over (0,t], which is
Ho, dH (t)= dH (u). (7)
it] ust

Like F(i), H(0)=0 and H(t) is nondecreasing and right-continuous. Furthermore,


H(t) is a function of bounded variation and has the same set of discontinuity
points as F(t) which is countable and contains no limit ( cluster ) points. Since we
have from (6),
1—dH (t)=1-P [T=t|T 2t]=P [T >t|T =r],
this, together with the left-continuity of S(¢~)=P[T2t] ( implied by the right-
continuity of F(t)=P[T < t ]) implies that the survival function can be expressed in
terms of dH(t) as follows:
S@=P(T >t)=[][P (T >u)/P (T 2u)]=[][1—-dH (u)). (8)
ust ust
The distribution function F(t) is obtained from (5) and (6) as
F@)=[) dF@)= Lae (u~)dH Wars (u~)dH (u), (9)
where S (u~ )=[[[1—-dH (s)] (10).
S<u

The mean residual life function e(t) and the [t;,t2] -specific O/E rate for
discrete distribution are as given in (3) and (4), respectively, with S(t) and S(t")
replaced by (8) and (10), respectively.

3. The Hazard Process


To make statistical inference on dH(t), it is necessary to relate the hazard
defined in (6) to the observed information, Let F;- represent all the accumulated
information available up to just before time t. F,- is called the filtration on F or
pre-t information or history. We assume from now on that t takes values in some
bounded interval [0,t], for some t >0. F, is a right-continuous nested 6 -algebra
such that F,c F; (F;, is finer than F,) for all O<s <t. A typical subcollection of F;,
is the internal history
F/=o(/(T (@;)Ss),0Ss <t,i=1,..,7), (11)
the o -algebra generated by the observable indicator process { / (T(@i)Ss) } up to
time s < t. Clearly, the accumulated information represented by F; increases as
the observation time increases and so F; is an increasing family of 6 -algebra. In
Section 3 through 5 we shall take F;- as the natural filtration.
Now, for each member @; of the sample space Q, equation (6) may be
rewritten as
dH; (t)=dF ;(t)/S;(t~ )=E [dl (;)St)1T(@;)21). (12)
348 Hsieh

This follows because


dF ;(t)=dP [T (;)St]=dE [I(T (@;)St)|=E U (T@;)St) FE [al (T@;)s2)].
HE [I (T (@;)<t)
The indicator /(T (@;)St) is a random measure, so that { /(T(@;)St,t20 } can be
seen as a point process or counting process which counts whether or not the event
(of interest) has occurred to @; in (0,t]. We assume that /(T(@;)S0)=0.
We are now in a position to randomize the deterministic hazard dH;(t) given
in (12) by conditioning on the observed pre-t information F,- to obtain the sto-
chastic hazard for the member @; defined by
dh; (t)=E [dl (T (@;)st)
|F;-]. (13)
If we let the filtration F,=F! , the sample path of /(T(@;)Ss), for s<t, defined in
(11), then since the accumulated information represented by F;- changes only at
the stopping times T (@;), equation (13) reduces to a dichotomy solely determined
by whether or not the event occurred before t:
‘{dl (Tw;)St)|T(@;)2t), if 1(T (@;)2t)=1,
AMi(O=10, if 1(T (@;)21)=0, Sy
Using (12), eqation (14) may be more compactly written as
dh; (t)=y;(t)dH;(2), (15)
where y;(t)=/ (T (@;)2t), indicating (by the value 1 or 0) whether or not @; is still
at risk af time t~. The observable process { y;(t),t20 } is predictable , i.e., fixed
given F;-. We see from (15) that, unlike dH;(t), dA;(t) is random because y;(t)
is random. Hence, we call dA;(t) stochastic hazard and dH;(t) deterministic
hazard.
If the distribution of T (@;) has a density, then (15) may be written as
dij(t) ;
ai =yi(t)h;(t). (15°)
We shall call (15’) the stochastic hazard rate of T ( while h;(t) is a deterministic
hazard rate ). Integrating (13) and (15) over (0,t] we obtain (noting that A;(0)=0 )

Aj(t=|, Pia (w)=Hilmin (¢,T (@;))] = EU (7 (@,)st |F,_]. (16)


We shall call ul d yi(u)dH;(u) the hazard process associated with the count-
ing process { /(T(@j;)St)} for member @;. Thus a hazard process is nothing but
an integrated stochastic hazard. It is a random process, though considerably less
random than its associated counting process if the process admits a stochastic
hazard rate.
If we define the stochastic differential dM;(t) by

dM ;(t)=dl (T (@;)St)—y;(t)dH;(t), (17)


which represents the non-predictable portion ( i.e., the innovation) of the observ-
able jump process d/(Tw;)St), then, since y;(t) is predictable, it follows from
(17), (13) and (15) that
E [dM;(t)|F,-]=0, (18)
so that the integral of (17) ( with M;(0)=0)
Mit)=I(T (@,)S1)-[,
iu) dHi(u) (19)
Hazard Process for Survival Analysis 349
SS

is an F,- -martingale with zero mean


- E[M;(t) ]=0. (20)
In other words, the difference between the observable counting process
I (T (@;)<t) and its smooth, the hazard process A;(t)=|_ _y;(u)dHj(u), represents
a random noise (i.e., zero- mean martingale M;(t)). ttIeeneral, the conditional
expectation of a counting process given the history F,- is a predictable measure
called the compensator of the counting process with respect to F,- ( see, e.g.,
Liptser and Shiryayev 1978 and Karr 1986). Thus a hazard process is a compen-
sator. But a compensator ( of a counting process ) need not be a hazard process.
For example, if in (13) we let F,= { ®,Q }, the smallest possible o -algebra, then
di,(t)=dF;(t) and the compensator E [I (T (w;)St)|F;-]=F;(t) is deterministic and
hence not a hazard process. Similarly, both Poisson and Wiener processes have
deterministic compensators as a result of the independent increment property.
Now, if the @;’s in Q are all copies of @ ( i.e., if the population is homo-
geneous), then, for each i=1,2....,n,
dH,;(t)=dH (t), (21)
(i.e., same hazard for every individual in the population). In this case, we can
superimpose all the counting processes { /(T(w;)St) } for individual members @,,
i=1,2,....n, to obtain the counting process

NQ@=ENT 52)
3)
(22)
for the whole sample space Q which counts the total number of occurrencies of
the event of interest in (0,t]. The stochastic hazard dA(t) and the hazard process
A(t) of (22) are respectively obtained by summing both sides of (15) and (16)
over i and using (13) and (21):

dA(t)= Sdhi(rE [dN (t)|F,-], (23)


where dA(t)=A(t)—A(t), and i

AEE MOR{, ,¥ wae (u)=E [N (t) |F,-], (24)

where

Y( Ey(0- oy(T (21) (25)


is the number at risk just before t. If there are no censoring , as is the case here,
then Y(t)= n - N(t7). Clearly, both N(t) and Y(t) are bounded by n. That A(t)
given by (24) is the hazard process of N(t) follows from the " no simultaneous
occurrences " assumption which says that with probability one, each jump of N(t)
is of unit magnitude (i.e., no two individual counting processes jump at the same
time.)
By summing each of the formulas (17)-(20) for the individual counting
processes over i and using the relations (21), (22) and (25), we obtain the
corresponding formulas for the space counting process (22), with M (t)= M(t) :
i=l
350 Hsieh

The innovation gain


dM (t)=dN (t)-Y (t)dH (t) (17’)

with zero conditional mean


E [dM (t)|F,-]=0 (18)
and the F,;- - martingale
M (t)}=N ol, Y wae (u), (19’)
with zero mean
E[M (t)|Fo J=0. (20’)
If F is absolutely continuous, then so is the hazard process. In this case the
hazard rate exists so that the dH,(t) in (15),(16),(17),(19) and (21) may be
replaced by h;(t)dt and the dH(t) in (21),(23),(24),(17’) and (19") replaced by
h(t)dt. The existence of the hazard rate function implies the continuity of the
hazard process and hence simplifies results in statistical inference (see (27) below
and Section 5). It is noted that the counting processes /(T(@;)St) and N(t) are
right-continuous functions of t with left-hand limits, while the predictable (sto-
chastic) processes y;(t) and Y(t) are left continuous with right-hand limits.The
relationship among the hazard process, its associated counting process and the
martingale is akin, in that order, to the relationship among the random effects
u(X), the response data Y and the error e in the random-effects (linear statistical
) model:
Y=u(X)}+e, (26)
with X playing the role of the pre-t information F,-.
While the conditional mean of a martingale M(t) given F,- is M(t )(see
(20’)), the conditional variance is not. The conditional variance of M(t) given F,-
is equal to the predictable variation process given by
<M >@=[, ) aH (u), (27)
as a result of Theorem 2.21 of Karr (1986) and equation (24), provided that the
hazard process is a continuous function of t. Note that the variance of M(t) given
by (27) is a random variable because Y(t) is. Note also that the hazard process
given on the right side of (27) equals both conditional mean and variance of its
associated counting process given F,-. This implies that a counting process asso-
ciated with a continuous hazard process is conditionally Poisson given F;-. How-
ever, a noncontinuous hazard process does not yield (27) for <M>(t) and hence
does not have a conditional Poisson counting process. Under the "no simultane-
ous occurrences" assumption, it can similarly be shown that the predictable
covariation process of any two F,- -martingales defined by (19) or (19’) is zero,
provided that the hazard process is a continuous function of t. That is to say, the
martingales defined by (19) or (19’) for continuous hazard processes are orthogo-
nal.
Hazard ee
ee
S Process e
for Survival Analysis 351
L
4. Estimation of Deterministic Functions
In this and the next sections we shall use the martingale method to calculate
the nonparametric estimators of the deterministic functions and to derive various
test statistics. The martingale method is employed in preference to other methods
(such as the likelihood method ) mainly because of its direct link to the counting
and hazard processes (see (19) and (19”) ) and its ease in making such calcula-
tions ( see, e.g., Aalen 1978, Andersen et.al. 1982 and Karr 1986 ).
To estimate the deterministic functions, we shall first obtain the estimate for
the unobserved dH(t). The martingale estimate of dH(t) is obtained by solving
(17°) for dH(t) and using the assumption of asymptotic negligibility (dM(t)/Y (t)
—0, as ne) to yield
2 (t)=I (Y (t)>0)———,
aH (t)=I dN (t) 28
(Y (t)>0) Y(t) (28)
where dN (t)}=N (t) — N(t~) is the number of events at time t or in (t7,t], which is
zero for all texcept at the jump points t=T(@;).The indicator function inserted in
(28) as multiplier is to cover the possibility that Y(t) may take zero values. In
view of the definition of dH(t) given in (6), formula (28) is the natural estimate of
dH(t). Integrating (28) over (0,t] one obtains the martingale estimate of H(t):
. dN (s) 1
H(H=| IW (s)>0)—_——=>. = 29
©) ee Eee) Y(s) 2 Y(S;) Me)
where s;,i=1,...,n, are the sequence of ordered occurrence times. Formula (29) is
the natural estimate of the cumulative hazard H(t) (see also Section 6) and, as
expected, is essentially identical to the ones obtained by Nelson (1969), Aalen
(1975) and Altshuler (1980). The estimator (28) is then substituted into (8) and
(10) to obtain the estimates of S(t) and S(t”), respectively, which in turn are sub-
stituted into (3) and (4) to obtain the estimates of the mean residual life e(t) and
the O/E rate, respectively.
To obtain the (asymptotic) distribution of the estimator we will need to use
two main results from the martingale theory ( see, e.g., Aalen 1978, Gill 1984,
Kopp 1984, Liptser and Shiryayer 1978, and Rebolledo 1980 ). The first is the
martingale transform theorem which states that stochastic integrals of predictable
processes with respect to martingales are martingales. The proof is very simple.
The second is the central limit theorem for a sequence of martingales, which
states that for a sequence of ( mean zero ) square-integrable martingale
M,,M>2,...,Mp,..., if there exists a non-negative continsuous non-decreasing
deterministic function V(t) with V(0)=0 such that (a) for each t €(0,7],<M,>(¢)
converges to V(t) in distribution and (b) there exist constants Cnt0 such that as
neo, P[sup|dM,,(t)|<c,]—1,then, for re (0,1], M,,(t) converges to M(t) in dis-
tribution, where M(t) is normally distributed with mean zero and variance equal
to <M>(t)=V(t). The proof of this theorem is more technical and makes use of the
fact that a martingale has orthogonal increments to reach a weak convergence to a
time-transformed Wiener process as n increases ( For detaills of the proof with
different degrees of generality, see Rebolledo 1980, Liptser and Shiryayev 1980,
Hall and Heyde 1980, and Helland 1982.)

[Note: Equation (29) is to be followed by equation (33).]


352 Hsieh

5. Asymptotic Distributions of Estimates and Hypothesis Testing


In this section we assume that F is absolutely continuous. Our next task isto
obtain the asymptotic variance of the estimator H(t) and to derive test statistics
for testing hypotheses about the unknown H(t). To do so, we shall investigate the
asymptotic distribution of the process M ;(r) defined by
M 1(t)=Vn [H(t)-H (t)]. (33)
The first step is to show that M,(t) is asymptotically a martingale: Using
(29) and (17’), we obtain
Vn {H(t)-H (t)]=Vnl [iil (50) -dH (s)]
JE)

=\nf, PLO aerecaa foal (8)>0)-11dH (8)


P
nf FOOT dM (s) AS) 1 noe:
A Fi
(34)
(where —-—> signifies convergence in probability) which by the martingale
transformation theorem, is a martingale because Y is F,-predictable. Note that
E [dM,(t)| F,-]=0 as the size of n increases. This follows upon taking expectation
of (33) and using (29) and (23).
The second step is to obtain the asymptotic variation process of M(t) :
Using (34),(27) and the F,- - predictability of Y, we obtain for large n,

d <M ,>(t)=nl (Y (t)>0)Var [dM (t) |F,-)/[Y (¢)] =


2 nl(¥ (t)>0
ap gee es (35)

Hence, we have the large sample variation process

<M, >I, yas >(s)=n kiMca (s). (36)

The third step is to find the asymptotic distribution of M,(t). We will use the
so-called asymptotic stability condition (the Glivenko-Cantelli theorem) which
states that Y(t)/n converges uniformly (see Pollard 1984) with probability one to
the deterministic mean function p(t)=E [Y (t)/n], as nee. ( If there are no cen-
soring, as is the case here, p(t)=S (t~) .This, by the way, establishes the asymp-
totic negligibility condition of the martingale estimate (28)). Now we are ready to
show that the two conditions of the martingale central limit theorem are satisfied
by the martingale M(t), so that it has (continuous) Gaussian distribution with
mean zero and variance estimated by (36) for large n. We have, as n—-0,
I(¥ (s)>0) P dH (s)
<M1>Ct-| — )
>OFl iOmiGaie ee aes
p(s)”
which is deterministic and hence satisfies the first condition (a),.To show that
condition (b) holds, we note that jumps in the martingale Vn [H(t)}-H (t)] arise
only from H(t) whose jumps are of size 1/Y(t)(see (28)), which is of, order 1/n
and hence the Lindeberg condition holds for Caan 4 so that Vn| [H (t)—H (t)]
tends to zero , as n—e0, Note that conditions weaker than (b) such as those of
Lindeberg, Lyapounov and Rebolledo suffice to yield the same results(see Ander-
sen and Gill 1982 and Karr 1986). We therefore conclude that the estimator H(t)
given by (29) is consistent and asymptotically normal.
Hazard Process for Survival Analysis 350

Now, from (36) we immediately obtain the conditional variance of H (t) as

Var (H(t) |F-F{, nia (s) (37)


and, upon substituting (28) in (37), the sample variance

Var [H (t)| FF, are (s). (38)


We now derive test statistics for various hypotheses on h(t), te (0,7]. In gen-
eral, two types of hypotheses are of interest. One type is to compare the unknown
hazard h(t) in the target population with some known hazard h°(t) ( such as that
in the general population ). The second type is to compare the unknown hazards
among different target populations.
In the first type, the hypothesis of interest may be either
Hypo 1: h(t)=h°(t) (39)
or
Hypo 2:hj(t)=hf(t), j=l,...k, (40)
depending on whether we are dealing with one target population or k target popu-
lations. In (39) we might be interested in comparing the hazard level of a cohort
with that of a general population of which the cohort is a subset. In (40) we might
be interested in comparing the hazard levels for various occupational groups with
those for the nation as a whole.
We shall derive two test statistics for (39). Using the asymptotic distribution
of M(t), namely, the Gaussian distribution with mean zero and variance
estimated by (36), one immediately obtains the asymptotic unit normal test statis-
tic for Hypo 1 as follows:
ee oe H()-H°(2)
[<M,>(t)}"? [<H-H?>(a)]!?
{ 1(¥(s)>0) 2. - | h°(s)ds
ete CET (Swe ROS ae: (41)
I(Y(s)>0) |o ieee
Una Yq)
Sees 4s
where we have used (29) and (38) to obtain the first term in the numerator and the
denominator, respectively. In actual applications, the hazard rate function h°(¢) is
normally given in terms of the duration-specific occurrence/exposure rates
bhp) vats with (0,t] partitioned into r subintervals /;=(¢;,t;41], where t,=0 and
t,.1=1, the cutoff date of the study. Accordingly, the two terms involving h° on
the right side of (41) are written in terms of the hys as follows:
Le

{ gh ds= LAF Gj+1-4)» (41a)


j=1
where h? is defined by (4) and

fa PRP m ones eee


I(Y(s)>0) 6 ee I(¥(s)>0)
Of pees
a41b
354 Hsieh

An alternative test statistic can be derived by utilizing the result obtained at


the end of Section 3 that for the continuous hazard process
A(t)= {,, Y (s)h(s)ds, (24’)
the associated counting process N(t) is conditionally Poisson with mean and vari-
ance both equal to (24’). This implies an asymptotic unit normal test statistic for
Hypo 1 given by
N o-| Y (s)h°(s)ds
= 0,t (42)
Uf, ¥ n° )ds}"?
te

When, as above, h°(t}=h§ (t) for te/;, thenthe terms involving h° in (42) become
r

{og ¥ n°a )as =¥ Zhily (s)


h?| 7, Y(s)ds. (42a)

The test statistic Z given in (42) could also be derived by defining the
transformed martingale (see Andersen et al. 1982, and Andersen and Borgan 1985

M'(O=[, ,Y)4M (5), (43)


with M ,(t) as defined in (33). M’;(t) as defined in (43), being a sum of linear
functions of zero-mean Gaussian processes M ;(t), is a zero-mean Gaussian pro-
cess with variance estimated by
<M’, >(t)=n { iY (dH (s), (44)
since we have from (35) and (43),
d <M’, >(t)=[Y (t)?d <M , >(t)=nY (t)dH (1).
Finally, substitution of (33) into (43), followed by a normalization by the square
root of (44), yields the test statistic (42) for Hypo 1.
In many applied disciplines, the hazard process (24’) is referred to as the
"expected" counts. (The use of the word "expected" is inappropriate because as
noted earlier, the hazard process is a random measure). Thus, the test statistic (42)
simply represents a standardized difference between observed and "expected"
counts (O-E)/VE. The ratio O/E of the observed to the "expected" counts is
known in demography as the standardized mortality ratio (SMR) and in econom-
ics as the Paasche index.
Now, we come to the second hypothesis (40). Since the martingales defined
by (33) are orthogonal, we may form the sum of squares of k test statistics of (42)
to obtain
k (O;-E.)
2 eae |
LED Are
apm! (45)
j=l Ej
where Oj and E; represent, respectively, the two terms in the numerator of (42)
for each sample j=1,...,k, as a test statistic for (40) which has an asymptotic Chi-
square distribution with k degrees of freedom.
Hypotheses without specification of hazard values are also of interest. For
example, an investigator might be interested in comparing the hazard levels
Hazard Process for Survival Analysis 355

among various population groups classified by certain fixed external discrete or


categorical covariates of interest. Suppose the intersection of these discrete
covariates produces a total of k categories. Then we are here interested in testing
the second type of hypothesis given by
Hypo: hy(t)=ho(t)=...=hy(t). (46)
To derive the k-sample test statistic for the hypothesis (46), we make use of
the fact that under the null hypothesis, the k samples are homogeneous with com-
mon hazard rate h()}= hj(t), j=1,2,....k. Hence, both the counting processes N;(t)
and the F,--predictable processes Y(t) can be superimposed over j to obtain the
martingale:
M (t)=N. (t)- {,PAO IO!S (47)
k : k k
where N. (t)= 3)N;(¢), with nj=Y;(0) , so that n.=Y.()= nj and Y. @)=) Y;().
j=l j=l j=1
The unknown h(t) or equivalenty H(t) in this case has to be estimated inorder to
construct a test statistic for (46).
Corresponding to (29), we obtain the martingale estimate of H(t) from (47)
as follows:
$ dN. (s) 1 f
(t) iy TY(.()>0)
Hiveta s)>0)—— = Lye
— ah n
Hens. (48)
48

Corresponding to (42), we obtain the test statistic for the hypothesis (46) as
; k
Xk1=>D (Oj/-E;"/E;, (49)
j=l
where O;=N;(T) and
$ Y;Gi) ,
Eja[, Vid o=[, 10. p01) Pee PE haat (49a)

The test statistic (49) is known as the Chi-square test of homogeneity. It has an
asymptotic Chi-square distribution with k-1 degrees of freedom. The loss of one
degree of freedom from the total k degrees of freedom is due to the requirement
of estimating the unknown H(t).

6. Censoring, Competing Risks and Time-Varying Covariates


With no loss of generality, we shall from now on take "failure" as the
absorbing or nonrecurrent event of interest. Suppose there are q modes (causes) of
failure and, for any member ; of the sample space Q, let T'5(@;) be @; ’s failure
time due to cause 6 in the absence of all other causes, 6=1,2,...,¢g. Ts is known as
the latent life in competing risks theory. With F §(t)=P [T s<r], the hazard rate hs,
hazard probability dH 5, cumulative hazard H 5, density fs, survival Ss and mean
residual life function es of Ts are all as defined in Sections 1 and 2 with T there
replaced by T5. These functions are not identifiable because the T5’s are not
observable.
Let
T * (@;) = min(T 1(@),...,Tq(@i)). (50)
356 Hsieh

Then 7”(@;) is the failure time of «; in the presence of all causes. Note that right
censoring is considered and included in (50) as a mode of failure. To be specific,
let T,(@;) be the censoring time for @;. Just as in the last paragraph, with
F "(Q=PIT* <1), the functions , describing the distribution of T , namely,
h* Olek tal : ; i A S, and e , are similarly defined. These functions are
identifiable because T (@;) is observable.
Let X (@;) index the mode (cause) of failure and Z,(@;)) be a vector of inter-
nal covariates @; assumes at time t. Clearly, X(@;) takes values from the finite
index set A=/1,2,...,q¢} which lists the causes of failure. Let A =/{ 1,2,....¢-l}
which is the subset of A with censoring excluded. Suppose for simplicity that
Z,(@;) takes values from a set of discrete vector values I=/{z , ... ,Z,}. Note that
when Z,(@;) is continuous we may categorize it and use zy,Y=1,2,...,p, to
represent the mean value for the y th category. Clearly, Z,(@;) is a bounded
predictable vector covariate process and can change the values it takes, from z; to
Z2 , Say, as t progresses from 0 to T (;).
As there are several possible ways for the covariate to vary with time, dif-
ferent classifications of the time-varying covariates prevail in the literature. The
covariates Z;(@;) or Z(t) introduced here corresponds to the class of internal
covariates in Kalbfleisch and Prentice’s (1980) classification. Most of work on
Statistical inference involving time-varying covariates done so far employed
either parametric approach or semiparametric regression models and used likeli-
hood or least-squares methods which require (1) a given functional form for the
whole or part of the hazard rate function with covariates written as explicit func-
tions of time and (2) integration of the hazard rate function to obtain the cumula-
tive hazard and survival functions (see Petersen 1986, Blossfeld et al 1989 and
references therein). These procedures are mathematically unwieldy and compu-
taionally lengthy. Besides, it is not always possible to write the time path of
covariate as an explicit function of time. Our approach to the treatment of the
time-varying covariate is nonparametric and is akin to those of Leung and Wong
(1990) and McKeague and Utikal (1990). It requires a Markovian assumption on
the covariates and provides a straight forward approach to the estimation and
hypothesis testing problem.
In this section we formulate the competing-risks problem as a marked point
process {T (@;),X(@;)} with the internal history F;-“* defined by
FT X= 6/1 (T* (@;)<s,X (@)=8); OSs<t, 8 A,i=1,...,7) (51)
and employ the sub o-algebra
F! =o/I(T* (@;)Ss),0<s<t,i=1,...,.n} (52)
and the covariate history
F7=0/{Z, (w;)=z;0Ss<min(T*(@;),t),z€T,i=1,...,n} (53)
as natural filtration to accommodate time-varying covariates. Note that the two
random variables T (@;) and X(q;) are realized simultaneously. We cannot
observe one without the other. Thus, when failure occurs to @;, in addition to the
observed failure time T («;), we also observe the mode of failure X(@;) and the
covariate history Z,(@;) up to the time of failure T*(@;). This is a considerable
enlargement from the previous sections where we only dealt with the point pro-
ae { T (@;) } and employed the internal history F;- defined in (11) as natural
tration.
Hazard Process for Survival Analysis Shy

An appropriate extension of the definition of the deterministic hazard rate


h(t) given in (1) to incorporate censoring/competing risks and time-varying
covariates yields the two net hazard rate functions of the latent life T 5 given by
hs(¢)=limP [T5<t+AlT s2t]/A (54a)
and
hs(t sz )=limP [T 53 <t+A1T 521,Z (t)=z )/A, (54b)

and the two crude hazard rate functions of the marked point { Pox} given by
h'(, d)slimP (T" <1+A,X=8 |T*>t/A (55a)
* ; *

and
h* (t,(t, 8;z)=li
032) limaP [T ; <t+A,X =8|1T
=SlT->2t,Z (t)=z]/A. (55b)

Similarly, extensions for the deterministic hazard probability (6) yields the two
net hazard probabilities:
(T 5=t|T52t], if Fs5 is dicrete,
GEO yeah guis. Continuous, (56a)
and
P (Fs=t\T s2t,Z (t)=z), if Fs is discrete,
dist goss if Fs is continuous;
Si
and the two crude hazard probabilities:
P(T"=t,X=8IT">t], if F” is discrete,
dH“ (t,8)= (57a)
h*(t,8)dt, if F” is continuous,
3 P(T*=t,X=81T*>1,Z(t)=z], if F* is discrete,
aH (t, 5;z)=
h*(t,8;z)dt, if F* is continuous. Sap
In (54b), (55b), (56b) and (57b) above we have replaced the conditioning
events { T52t,Z(s),0<s<t } and { T >t,Z(s),O<s<t } in the general definitions of
conditional hazard functions by { T52t,Z (t)=z } and { T *>t,Z(t)=z } because we
assume the Markovian Property for Z(t) (see Heckman and Singer (1984) for
illustrations) which implies that hs(t;z) h*(t,5;z), dH 3(t;z) and dH (t,8;z) will
remain unchanged with the condition Z(t)=z in these formulas replaced by
Z(s)=z,0<s <t. That is , the hazard at duration time t depends on the covariates
only through their values at the duration time t and the above four hazard formu-
las are identical to those with the covariate values remaining unchanged at z over
the entire course of the spell and hence are interpretable as such.
Just as in Sections 4 and 5, our interest here lies in making inference on
dH 5(t) or hg(t) and dH 5(t;z) or hg(t;z). However, these net hazard functions are
not identifiable, i.e., cannot be estimated from the observed information (51)-(53)
without further assumptions. If we make the common assumption that the latent
lives Ts , 5€A, are mutually independent or conditionally independent given F;,,
it follows that (54a) equals (55a), and (54b) equals (55b), (56a) equals (57a) and
(56b) equals (57b) ( see, e.g., Hsieh 1989 for the proof of these and related
358 Hsieh

results). Since the crude hazards are identifiable, the net hazards can thus be
estimated through their relationships to the crude hazards under the assumption of
independent or conditional independent latent lives.
As in Section 3 we shall employ the homogeneity assumption that the @;’s in
Q. are all copies of the same @ (homogeneous population) and the "no simultane-
ous occurrences" assumption which in the present context means that with proba-
bility 1, Ts, (@;)#T5,(@;) for 5, #52, i=1,...,n, and T§(@;, )T 5(@;,), for 4114/2,
5=1,...,g. The counting process N(t) of Section 3 (see (22)) is directly extended as
follows: For the problem of competing risks accounting for right censoring, we
have

N(t,8)=DI(T* (@;)St,X (@;)=8), 120, SA, (58a)


n ,

i=l
which counts the number of failures in (0,t] due to cause de A. For the problem
of competing risks accounting for right censoring and time-varying covariates, we
have
n A

N(t,8;z)= >] (T* (@;)<t,X (w;)=5,Z7* (@;)=z), t20, dEA, zeT, (S58b)
t=1

which counts the number of failures in (0,t] due to cause de A with covariates z
at the time of failure.
The F,- -predictable process for the counting process N(t,5) defined in
(58a) is
n n

Y*(Q=Dyi QO= LMT" (@)21), 120, (59a)


t=) i=1

which represents the number at risk at time t of failure from any cause (including
censoring), and the F,- -predictable process for N(t, 5;z) defined in (58b) is
n n

Y"(t3z)= Dy; (:2z)= DM (T* (w;)2t,Z,(@)=z), 120, zeW, (59b)


i=l res
which represents the number at risk at time t with covariate % at time t. Note that
y;(t) and Y (t) are F;- -predictable while y; (¢;z) and Y (t;z) are F;- Fj
-predictable. Note also that Y(t) defined in (25) and Y (t) defined in (59a) are
decrement-life-table functions and hence are monotone nonincreasing function of
t, while for fixed covariate z, Y (t,z) defined in (59b) is an increment-decrement-
life-table function and hence may either increase or decrease with t.
The hazard process A(t) of Section 3 (see (24) and (24’)) is extended to
correspond to (58a) and (58b), respectively, as follows: ;
AG8)=[, Y (s)dH (5=[, WY (s)dH 5(s), (60a)
and

A(t, bz, YG :z)dH *(s, b2)-[, YG :z)dH 5(s;z). (60b)


The second equalities in (60a) and (60b) follow from the assumption of indepen-
dent and conditonal independent latent lives.
The assumption of independent latent lives implies preservation of mar-
tingale property. The martingale equations (19’) and (20’) of Section 3 are
Hazard Process for Survival Analysis 359
SS SSS SSS SSS

extended to correspond to the counting process (58a) and (58b) and their respec-
tive compensators (60a) and (60b) as follows:
M (t,8)=N(t,8) — ug’ aH ats), SecA, (61a)
with E [M (t,5)|F,-]=0, where F,=F! “, and
M (t, 8:2)=N (t, 8:2) — (oY 24H a(s32), SEA, zeT, (61b)
with E [M (t,8;z)|F,-]=0, where F,=F?* FZ.
It follows immediately from (61a) and (61b) that
E[N(t,8)|F,-]= fe Y*(s)dH 5(s)
and
E[N (t,8;z)|F,J= ot at :z)dH 5(s;z),
which are the corresponding extensions of (24) of Section 3.
Thus, martingale estimation formulas (28) and (29) of Section 4 for the
deterministic hazard dH(t) are extended to those for dH s(t) as follows: For each
de A , we have

dH (¢)=1(Y*(1)>0)
HE® p30, (62a)
Fer (7)
and

s@=[, [OOFANG=
Hs)={__1(Y*(s)>0) BV= stNGOs,
re) ee (62b)
where Sis i=1,...,n, is the sequence of ordered failure times irrespective of cause
and dN (t, 5) =N(t,6) — N(t,6) is the number of failures due to cause deA at
time t, which is zero except at the mark points (T (@;) =t,X(@;)=5). Similar
extensions for estimation of dH 5(t;z) and H §(t;z) yield for each (6,z)e AxT:

Mean
(er Oe e tSO, (63a)
F(h22)
and
: dN(s; .8;
Aseiarf, 10" ery MES» SO t20, (63b)
ROGGE) bige¥OG.2
where dN (t, 6;z)=N (t,8;z) — N(t”,8;z) is the number of failures due to cause
deA at time t with covariate z at the time of failure, which is zero except at the
points (T (@;)=t, X(@;)=5,Zr* (@;)=z). Note that just as in Section 4, we have
used the assumption of asymptotic negligibility in (62) and (63). The validity of
this assumption will be shown in the next paragraph.
The hypotheses and test statistics of Section 5 on the h(t)’s are to be
extended to the corresponding hypotheses and test statistics on the h(t)’s and
hs(t;z)’s. In order for the corresponding test statistics for these hypotheses to
hold, one has to show that the two martingale extensions of M(t) defined by
(33), namely,
M(t)=Vn lH s(t) —Ha(0)] (64a)
360 Hsieh

and j

M3(t)=Vn[H 5(t3z) — H3(t;z)] (64b)


satisfy the two conditions of the martingale central limit theorem stated in Section
4. In the case of (64a), these two conditions are satisfied because just as Y(t) in
Section 5, Y"(t) in this section is ,Monotone non-increasing, for which almost
surely uniform convergence of Y (t)/n holds and Glivenko-Cantelli theorem
applies, yielding the deterministic function E (Y*(t)/n) =S*(t~)=[]5s(t7). In the
$=1
case of (64b), the two conditions of the martingale central limit theorem are also
satisfied but, because of the non-monotonicity of Y (¢;z) in this case, one arrives
at convergence in probability rather than almost surely uniform convergence (see
Pollard 1984 and McKeague and Utikal 1990) and the asymptotic stability condi-
tion implies that Y"(t;z)/n converges in probability to a deterministic function
p(t,z) as nee, ( The exact expression for p(t,z), which is rather complicated, is
not required to obtain the asymptotic Gaussian distribution of M 3(t)).. These
results also imply the validity of the asymptotic negligibility, namely, 1/Y (#)—0
and Y(t ;z) 0, as no, used in the martingale estimation of dH g(t) in (62a)
and dH 5(t;z) in (63a) above. We thus conclude that both M(t) and M3(t) have
asymptotic Gaussian distributions with mean zero and respective variances
<M 2(t)>(t) and <M 3 >(t) estimated by (36) with Y(s), H(s) replaced by
Y (s),H5(s) and by,Y (s;z),Hs(s;z), respectively, and that the estimators H s(t)
given by (62b) and H 5(t;z) given by (63b) are consistent and asymptotically nor-
mal.
Corresponding to Hypo 1 (Equation (39)) of Section S we have two exten-
sions in this section. For the competing risks problem we have:
Hypol’: hs(t0=hg(t), SEA. (39’)
It follows from the results of the last paragraph that the first test statistic for (39’)
is given by (41) with Y(s), N(s), h°(s) replaced by Y"(s) , N(s,8) , h$(s), respec-
tively, together with similar replacements for (41a) and (41b), and that the second
test for (39’) is given by (42) with M(t), Y(s), H°(s) replaced by N(t,8), Y (s),
h§(s), respectively, together with similar replacements for (42a). For the problem
of competing risks with time-varying covariates we have the extension:
Hypo”: hs(t3z)=h8(t3z), SEA; zeT. (39”’)
The two
*
test statistics for (39’’) are given by (41) with Y(s), N(s), h°(s) replaced
by ¥ (s3z),.N (s, 5;z), h§(s;z), and given by (42) with N(t), Y(s),h°(s)replaced
by N(t,8;z), Y (s3z), h§(s;z), respectively, together with similar replacements
for (42a).
Corresponding to Hypo 2 (equation (40)) we also have two extensions:
Hypo 2’: H5;(t)}=h$;(t), j=l,...k, SEA, (40°)
and
Hypo 2”: hg;(t;z)=h8(t;z), j=1,..ok, SA, zeQ. (40°’)
The test statistic for (40°) and that for (40’’) are given by (45) wherein O ; and E;
represent the two terms in the numerator of (42) for each sample j=1....,k, with
appropriate replacements of N j(t),Yj(s) and h5(s) as described in the last para-
graph for the competing risks problems with and without time-varying
Hazard Process for Survival Analysis 361
SSS

covariates,respectively.
Finally, for the second type of hypothesis, we have in correspondence to the
hypothesis (46) the following two extensions:
Hypo’: hg, (t)=hso(t)=...=hax(t), SeA, (46”)
and

Hypo”: hg,(t;z)=...=hsx(t;z), 5€ A’, zeT. (46’’)


The k-sample test statistics for (46’) and (46”’) are given by (49) and (49a) with
Nj(t), Y;j(s), Y.(s) and N.(s) replaced by N;(t,8), Y;(s,), Y.(s,5) and N. (s,8),
respectively, in the case of (46’), and by N(,832 , Yj(s,5;z) , ¥.(s,6;z) and
N. (s, ;z), repectively, in the case of (46”’).

REFERENCES

Aalen, O. O. (1978): " Non-parametric Inference for a Family of Counting


Processes," Ann. Statist. 6, 701-726.
Altshuler, B. (1970): " Theory for the Measurement of Competing Risk in Animal
Experiments," Math. Biosci. 6, 1-11.
Andersen, P. K. and Borgan, O. (1985): ’ Counting Process Model for Life His-
tory Data: a review, " Scand J. Statist. 12, 97-158.
Andersen, P. K., Borgan, O., Gill, R. D. and Keiding, N. (1982): " Linear Non-
parametric Tests for Comparison of Counting Processes, with Applications to
Censored Survival Data (with discussion)," Int. Statist. Rev. 50, 219-258.
Correction 52, 225.
Andersen, P. K., and Gill, R. D. (1982): " Cox’s Regression Model for Counting
Processes: a Large Sample Study," Ann. Statist. 10, 1100-1120
Barlow, R. and Proschan, F. (1975): Statistical Theory of Reliability and Life
Testing, New York: Holt, Rinehart and Winston.
Blossfeld, H. P., A. Hamerle, and K. U. Mayer (1989): Event History Analysis:
Statistical Theory and Application in Social Sciences. Hillsdale, NJ: Lawrence
Erlbaum Associates.
Cox, D. R., and D. Oakes (1984): Analysis of Survival Data. London: Chapman
and Hall.
Gill, R. D. (1984): Understanding Cox’s Regression Model: a Martingale
Approach, J. Am. Statisti. Ass. 441-447
Hall, P., and C. Heyde (1980): Martingale Limit Theory and Its Applications.
New York. NY.: Academic Press.
Heckman, J. J., and B. Singer (1984): " Econometric Duration Analysis, " Journal
of Econometrics, 24, 63-132.
Helland, I. S. (1982): " Central Limit Theorem for Martingales with Continuous
Time," Scand J. Statist. 9, 79-94
Hsieh, J. J. (1989): "A Probability Approach to the Construction of Competing-
Risk Life Tables, " Biometrical Journal, 31, 339-357
seeecoe suse (1990): " Construction of Expanded Continuous Life Tables --- A
Generalization of Complete and Abridged Life Tables," Mathematical Biosci-
ences, (In Press).
Kalbfleisch, J. D., and Prentice, R. L. (1980): The Statistical Analysis of Failure
Time Data. New York: Wiley.
Karr, A. F. (1986): Point Processes and their Statistical Inference. New York:
362 Hsieh

Dekker.
Kopp, P. E. (1984): Martingales and Stochastic Integrals. Cambridge: Cam-
bridge University Press.
Leung, S. F., and W. H. Wong (1990): " Nonparametric Hazard Estimation with
Time-Varying Discrete Covariates," Journal of Econometrics, 45, 309-330
Lipster, R. S. and Shiryayev, A. N. (1978): Statistics of Random Processes II:
Applications, New York: Springer-Verlag.
Mckeague, I. W., and K. J. Utikal(1990): " Inference for a Nonlinear Counting
Process Regression Model,” Annals of Statistics 18, 1172-1187.
Nelson, W. (1969): " Hazard Plotting for Incomplete Failure Data, " J. Qual.
Technol. 1, 27-52
Petersen, T (1986): " Fitting Parametric survival Models with Time-Dependent
Covariates," Journal of Royal Statistical Society,series C, 35, 281-288
Pollard, D. (1984): Convergence of Stochastic Process. New York, NY.:
Springer-Verlag.
Rebolledo, R. (1980): " Central Limit Theorems for Local Martingales, " Z.
Wahrsch. verw.Gebiete 51, 269-286.
SSS

Bootstrap Assessment of Prediction in Exploratory


Regression Analysis

Victor Kipnis
University of Southern California*

Abstract

The paper concentrates on the evaluation of the impact of ex-


ploratory analysis for selecting the ’best’ predictor with respect to the
mean squared error of prediction (MSPE). The very selection process
affects the distribution of the conventional MSPE estimators and, in
particular, leads to their substantial overoptimism. To allow for the
selection effect, it is suggested to construct a pseudomodel for generat-
ing bootstrap pseudodata and to apply to them the selection procedure
used for the original data. The simulation study demonstrates that
the suggested approach can be used not only for assessing existing
procedures, but it also may be helpful in providing a criterion for a
stopping rule in multistep model building.
1. Introduction.
As opposed to the traditional inference, which is based on an a
prior: specified model, a major feature of modern regression analysis
is model building within the framework of exploratory data analysis.
This process may include checking model assumptions , applying re-
gression diagnostics , assessment of influence, selection of regressors,
choosing functional form of the equation, along with parameter esti-
mation, hypotheses testing, etc. As was pointed out by Box (1983),
not one but at least two kinds of inference are involved within this
exploratory process: fitting the data according to a tentative model
and data and model criticism. The latter includes assessment of the
fit with regard to a specified criterion and replacement, if possible,
of the current model with a new one. It is used iteratively and in
alternation with fitting. This makes contemporary regression anal-
ysis a multistep, iterative process in which the regression model is
not fixed in advance but rather is continually evolving. In practice,
when exploratory activity is usually carried out using and reusing the
same data, the conventional means of inference both at the interme-
diate steps and for the finally chosen equation could be misleading,
sometimes badly so (e.g., Freedman, 1983; Lovell, 1983; Miller, 1984;
Pinsker et al., 1985, 1987).
When the main goal of regression analysis is prediction of new
observations, model building is usually reduced to selection of a re-
gression equation that will give the ‘best’ predictions. There is a
good deal of literature on different selection procedures with regard
to their computational (logical) scheme (e.g., see a review in Hocking,
1976 and Thompson, 1978), but much less has been done concerning
their impact on statistical properties of a selected predictor. This pa-
per concentrates on estimating the mean squared error of prediction

*University Park, Los Angeles, California 90089-0641.


aaa,
a

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
364 Kipnis

(MSEP) of a regression equation selected by an exploratory proce-


dure. For illustrative purposes, regression procedures based on subset
selection methods are considered. This choice has been motivated
by the fact that subset selection methods have a relatively simple
logic and can be easily formalized and, on the other hand, they well
demonstrate the vast conceptual and statistical problems associated
with model selection. Besides, they have been very popular in applied
work and have received much attention in the literature. It should be
emphasized, though, that the general concepts of the present study
apply to any other exploratory procedure for model building.
After formulating a problem of MSEP estimation in exploratory
regression analysis in section 2, three bootstrap estimators are in-
troduced and discussed in section 3. The results of the experimental
comparison of the bootstrap and conventional estimators are analyzed
in section 4. The final section 5 is a conclusion.
2. Problem Formulation.
Consider the linear regression model

VS yY? pe=XxX6-+ e (1)

where Y is a n-vector of observations on the response variable, Y° =


XB is a n-vector of unobservable mean components of the model, X
is a n xX k full rank matrix of observations on each of the k predictor
variables, @ is a k-vector of unknown parameters, and € is a n—vector
of unobservable independent and identically distributed disturbances
with mean 0 and variance o*. The X’s are assumed fixed. The given
n > k observations on the response and the predictor variables consti-
tute, we suppose, a construction set V = (X,,Y). Let W = (Xw,Yw)
be a new set of nw observations based on the same model (1), where
Yy is independent from Y,. Given the construction set V and the
matrix Xy, the regression goal is prediction of hypothetical new ob-
servations Yy.
Consider an exploratory procedure

GV
Xe oe
that selects a p-subset, p < k, of the predictor variables and yields the
ny -vector Yy of the predicted values of Yy based on the OLS fitting
of the selected subset. To be more specific, let S, = [1,,1.,...,t,] be a
set of indices of the predictor variables to be included in the selected p-
subset. Let D, be a corresponding k xp ‘indicator’ matrix consisting of
zeros and ones such that Xy, = Xy D, contains only those pcolumns of
Xy with indices from S,. Then the procedure g consists of (i) choosing
subset S,; (ii) estimating vector @ of the regression coefficients by
B = D,B,, where B, = (Di
X, Xy D,)~' DX, Y,; (iii) predicting Yy by
Yw = Xw 6. Note that although £, is the OLS estimator of B, = Di B,
the resulting vector @ is not the OLS estimator of £. Its distribution,
Prediction in Exploratory Analysis 365

as well as the distribution of the predictor Yw = g(V;Xw), depends


on the procedure g. It is important to emphasize here that the chosen
predictor does not pretend to represent the ‘real’ model and, actually,
is based on a random subset S, of predictor variables. Its justification
should rest on a justification of the prediction procedure g by which
the data are transformed into actual prediction values.
For the quadratic loss function

A 1 A A

L(Yw ;Yw) = ~_ (Yw —Yw ) (Yw —Yvy),


Ww

two distinct risk functions are important measures of efficiency of the


procedure g. The first one is the conditional MSEP at Xy, given the
construction set V,

MSEPy (9;V) = E[Ly (9;V) |V]


poly Saat p (2)
= 07 + —(6
ny
~B)'Xi Xw (8 - A).
MSEP,y (g;V) is a random variable, together with #, and could be con-
sidered as measuring both the predictive ability of a selected model
and the efficiency of the selection procedure g for a given construc-
tion set V. To be able to analyze existing procedures and to invent
new ones, one needs to know their statistical properties for different
realizations of V. A summarizing characteristic is the unconditional
MSEP, 1.e., the average risk over all possible construction sets:

MSEPy (g) = E[MSEPy (9;V)]


2 1 A\iyt R
=O gan ae tel} Xw Xw (6 — EB) (3)
a A
+ —tr(X;, Xw VAR (B]),
ny

where VAR (B] = E((6— Ef) (B-EB)'] is the variance-covariance matrix


of B. As opposed to the conditional risk (2), the unconditional MSEP
measures the average efficiency of the selection procedure g, not of a
selected model that may vary for different realizations of V.
The simplest empirical estimator of MSEP is the autoloss func-
tion, or ‘apparent error’ in terminology of Efron (1982),

AL(9;V) = L(Yv9(V; Xv))


x ~(¥%y fey (ve — vy), (4)
that measures goodness-of-fit of the procedure g over the construc-
tion set V. Usually AL tends to underestimate MSEP, because the
366 Kipnis
____..|_ |

same data have been used for both construction and evaluation. This
is a familiar fact that could be easily demonstrated for a very sim-
ple procedure consisting of OLS-fitting of an a priort specified subset
Xy p= Xv D,, with fixed D,, and Xw = Xy, Le., when a new response
vector Yy is predicted for the same set Xy of observations on the ex-
planatory variables. For future reference we will denote this procedure
by gp. We have
Yw = 9p(V) = P,Y, (5)
where P, = Xvy,(X,,Xvp)'
X/, is the projection matrix onto the fixed
linear space spanned by the column-vectors of the matrix Xy,. In this
case, Xy EB = P,Y,, and it follows from (3)-(4) that

o? 1 o! oO
E[AL(g,;V)| = rae: Sal nt CRED 9) Cr (6)

and
o? 1 edi a
MSEP(g,) me Pre ay ni (I, =) (7)

so that AL underestimates MSEP with the negative bias (—2po?/n).


In the literature different ‘adjusted’ forms of AL have been sug-
gested (e.g. J,, Rothman, 1968; Amemiya, 1980; C,, Mallows, 1973;
AIC, Akaike, 1973; BIC, Sawa, 1978, and others). These statistics,
or their simple transformations in the case of information criteria, are
unbiased estimators of MSEP under certain conditions. The major as-
sumption requires that g be based on OLS-fitting a subset S, specified
independently from the construction data set V. Since model selec-
tion violates this assumption, it may seriously affect the distribution
of the conditional MSEP(g,V) in a way that is not reflected by any
of the adjusted statistics. The fact that the conventional estimators
become quite unsatisfactory in exploratory analysis has been repeat-
edly pointed out (e.g., Berk, 1978; Hjorth, 1982; Miller, 1984), but
up to now no theory has been put forward to account for it. In our
view, such a theory could be developed only after an explicit introduc-
tion of the concept of exploratory procedure into regression analysis
and making it an object of statistical study. For instance, the ‘pro-
cedural approach’ allows one to clearly understand and interpret the
fact that the conventional adjusted estimators become substantially
biased under selection process. These estimators have been derived for
one particular procedure, namely g,, that does not involve selection,
and cannot be used with other procedures without further adjust-
ment. Many misunderstandings in the literature on model selection
have resulted from underestimation of this fact. Without the concept
of regression procedure, it is very tempting to ‘forget’ how S, has been
arrived at and to apply the theory developed for an a priori fixed ioe
in the case when S, is a random subset selected by some procedure
g. As a result, conventional estimators, such as C, or AIC, have often
been used as criteria for screening different subsets and selecting the
Prediction in Exploratory Analysis 367

*best’ one, with implied false inference about thereby obtained predic-
tors. To be able to get more adequate estimators one has to study the
distribution of Yy under that very exploratory procedure which has
yielded this predictor.
As exact distributions are extremely difficult to study analyti-
cally, even for relatively simple selection procedures, prediction as-
sessment requires data sets independent from the construction data
V. To expedite this process, the independent sets could be replaced
by pseudosamples which in some sense are close to the original set.
There are different methods for construction of pseudosamples. One
leads to different forms of splitting the data or cross—validation (e.g.,
Stone, 1974; Geisser, 1975). Cross—validation methods could be very
helpful in realizing the behaviour of MSEP for some selection proce-
dures as was demonstrated by Hjorth (1982), and Picard and Cook
(1984) among others. Complications arise, though, for more compli-
cated procedures, which examine different classes of models and sets of
predictor variables that have not been planned in advance, but evolve
in the process of selection itself. Proper cross—validation in this case
requires excluding more and more observations for assessment of the
ensuing results at each new iteration, which could become impractical.
Another method of generating pseudosamples, which does not
have these limitations, is bootstrap. Recently bootstrap methods have
been successfully used for solving various statistical problems includ-
ing estimating the predictive ability for an a priori specified regression
model (e.g., Bunke and Droge, 1984). In the discussion on applying
bootstrap to estimating MSEP under subset selection (e.g., Miller,
1984), it is usually pointed out that this approach seems to suffer
from the complication that the selected set of predictor variables will
vary for different pseudosamples and could be quite different from
that fitted for the given construction sample. The key to our present
approach is just a small step from the usual bootstrap regression anal-
ysis but an important one nevertheless. We bring in the procedural
approach and suggest that assessment of the efficiency of the predictor
should rest on the bootstrap assessment of the exploratory procedure
by which this predictor has been chosen, rather than the evaluation
of any particular subset of variables. It is interesting to note that this
approach has been implicitly applied in Efron (1983, 1986) and Gong
(1986), where it was found better than cross-validation, and in Freed-
man et al. (1987), where it was declared a failure. We will briefly
comment on this failure in Section 3 (see Remark 1).

3. Bootstrap Estimators.
The idea is to analyze performance characteristics of the proce-
dure g on the data generated by a known random mechanism. The
main requirement is that this mechanism, or as we will call it, pseu-
domodel, should simulate the unknown regression model (1). In other
words, it should generate pseudosamples that are ‘close’ to the ob-
served one with regard to their statistical structure. There are several
ways to construct such a pseudomodel. One possible approach is to
368 Kipnis

use = 2
Y=Y°+é= X68 +Z (8)
where B and the distribution F of é are estimated from the data V.
In a parametric bootstrap, when the form of the distribution F of e€ is
assumed known, F is obtained by estimating the unknown parameters
of F. In a nonparametric bootstrap, € is usually a random sample
(perhaps, times a weighted factor) from the empirical distribution F =
F,,(ri,.--;T) of the residuals r = Y, — Xy 8. When the true model (1)
is unknown, choosing B seems to be one of the most sensitive elements
of constructing a pseudomodel. If procedure g is rather complicated,
e.g. involves subset selection, misspecification of the mean component
Y° in (8) could lead to quite a different performance on pseudodata
as compared to the real data.
In subset selection, it is usually assumed that X in (1) includes all
relevant variables plus, perhaps, some extraneous variables. In this sit-
uation it seems reasonable to use the OLS estimate @ = (Xi), X,) 1 X,Y
based on the ‘full’ set of explanatory variables. By using the best un-
biased estimator of G, one hopes to get a pseudomodel that is as close
to the real one as possible (see also Remark 3 below). A different
approach — resampling rows from the matrix V — does not seem to
be appropriate here. First, it would lead to pseudodata with a matrix
X different from its counterpart X, which is assumed fixed. Second,
if k is close to n, there is a high probability of getting fewer than k
distinct rows and, thus, not a full rank matrix X.
Consider now two pseudosamples, a pseudoconstruction set V =
(Xv,¥,) and a pseudotarget set W = (Xy,Yw), where Y, and Yy
are n X 1 and ny X 1 random vectors, respectively, independently
generated from model (8). Applying the same selection procedure
g that is used for the original construction set V to the pseudoset
V~, we get a pseudopredictor Yw = g(V;Xw) and the pseudolosses
Ly (93V) = a (Yw —Yw)! (Yw —~Yy). As the pseudomodel (8) is com-
pletely known, Monte Carlo replications could be used to analyze the
distribution of Ly (9;V). Characteristics of this distribution serve as
estimates of their counterparts for the distribution of the real losses
Lw (g;V). This approach leads to the so called direct bootstrap esti-
mator
Ry (9) = E.[MSEP, (9;V)}, (9)
where E. denotes expectation with respect to the random mechanism
(8).
_ _ The idea of direct bootstrap estimation has recently been applied
in Freedman et al. (1987). Disappointingly, the bootstrap did not per-
form very well. That R2 is generally biased may be easily illustrated
with the procedure g,.
Prediction in Exploratory Analysis
SSS
369
SS

PROPOSITION 1: Let g, be defined by (5), where Xy, = Xv D, is fixed in


advance and Xw = X,.. Let the parameters B and 6? in the pseudomodel
(8) be unbiased estimators for 8 and o”, respectively, in the model (1). Then
the direct bootstrap estimator RP (g,) for MSRP(g,) has a positive (unless
p =k) bias (k — p)o? /n.

PROOF: By analogy with (7)

m 6? nena ~
lee) Gy intp) + oY (ln — BY

so that

E(RE (9,)] = —(n +») + —B(%)"(I,— P,) BI)


1 L
+ ~ir((I, — P,) VAR [¥y])
o? 1 oe 0
= oe +k) + ne (Bye:

Recalling MSEP(g,) from (7) gives E[R° (g,)|- MSEP(g,) = (k—p)o?/n.

REMARK 1: The above result explains why the direct bootstrap estimator
performs well in the traditional regression framework, when the procedure
g, is applied to the ‘true’ model (1). Then p = k, and R2, is unbiased as has
been demonstrated in the literature. On the other hand, the fact that the
direct bootstrap estimator becomes biased even for such a simple procedure
as g,, may explain its failure in subset selection as was observed in Freedman
et al. (1987).
Another possible approach to deriving MSEP estimators is based on
using the pseudomodel (8) for evaluating ‘overoptimism’ of the autolosses
AL(g,V) in order to make an appropriate adjustment. At least two choices
present themselves for representing average overoptimism: the difference

Q% (9) = MSEPy (9) — E[AL(9;V)]


and the ratio
Qh (9) = MSEPy (9)/E[AL(g;V)].-
We will estimate Q4 (g) and Q¥ (g) by their pseudo-counterparts

Q4 (9) = E. [MSEP, (9;V) — AL(9;¥)]


and a y
Q™ (g) = E. [MSEP, (9;V)]/E~ [AL(g;V)],
370 Kipnis

As a result, we arrive at the following two bootstrap estimators of MSEP:


the additive estimator

RA (g;V) = AL(9;V) + Q% (9), (10)


and the multiplicative one

RY (9;V) = AL(9;V)Q¥ (a). (11)


The reason behind these estimators is to get more pivotal statistics as com-
pared to the direct estimator (9). It is interesting to note that the additive
estimator (10) was used in Efron (1983, 1986) and Gong (1986) in the frame-
work of logistic regression, but its properties were not compared with those
of the direct estimator. In what follows we will omit subscripts W,W, etc.
and auxiliary arguments if it does not cause confusion.
If the selection procedure g is simple enough, estimators (10), (11) could
be calculated analytically, without recourse to Monte Carlo. Consider again
procedure g, (5). This case seems to be especially interesting since, as is
mentioned above, the conventional MSEP estimators have been originally
derived for this particular procedure.

PROPOSITION 2: Let the parameters § and 6? in the pseudomodel (8) be


unbiased estimators for 8 and o”, respectively. Then for the procedure g,
(5), the additive bootstrap estimator R4 (g,) is an unbiased estimator for
MSEP(g,).
~

PROOF: By analogy with (6)-(7), Q4 (g,) = 2p6?/n, so that from (10)


i 1 ae
R* (9p) = al (I, — P,)¥v + PO (12)

Then it follows from (6) that

BURA (6,)] = —(n + p) + 1 ¥9" (le — YY?


ATA o? 1 iol

which according to (7) equals MSEP(qg,).

REMARK 2: Since the residual sum of squares for the p-subset is RS Sp =


Y, (J, — P,)¥v, it follows from (12) that the additive estimator Ré(g,)
coincides with the conventional MS EP estimator

A 1 i
R +e 7 (RSS> : 2p6"), (13)

where 6? = RS S, /(n—k), described, e.g., in Seber (1977) . It is interesting to


note, that Rs (and R4(g,) for this matter) is actually an unscaled extension
Prediction in Exploratory Analysis 371
SSSSS SSS SSS

of Mallows’ C, (e-g., Mallows, 1973) to the case when we predict Yw instead


of EYy , since Ro = +67(C, +n).

REMARK 3: The fact that the additive estimator R4 is unbiased, at least


for the simple procedure g,, if only E[@] = 6 and E(6?) = o?, may be
viewed as another rationale to use the full size set of predictor variables for
constructing the pseudomodel (8).

PROPOSITION 3: If, in addition to the conditions of Proposition 2, the


fitted subset Xy, represents the true model, that is, all components of B
not included in 8, equal zero, and if8B= D, (X,,Xvp)*X,,¥% in (8),
the multiplicative bootstrap estimator RM (g,) is an unbiased estimator for
MSEP(g,).
_ Proof immediately follows from the fact that in the considered case
Q™ (9p) = (n+ p)/(n—p), Yo (In — P,)Y,2 = 0, so that according to (11),
E[R™ (gp)] = <(n +p), which equals MSEP(g,).
REMARK 4: Under the conditions of Proposition 3, the multiplicative esti-
mator R™ (g,) coincides with another conventional estimator

Jp x meee
n
(ATE),
n-—p
(14)
suggested by Rothman (1968) and Amemiya (1980).
4. Experimental Comparison of the Conventional and the Boot-
strap Estimators.
To illustrate the effect of subset selection on conventional MS EP esti-
mators and to compare these estimators with the bootstrap estimators the
following simulation study was conducted. In all the experiments the simu-
lated data satisfied model (1), with « ~ N(0,07J,,) where matrices Xy and
Xw were orthonormal with the same number n = ny of observations. As
was pointed out in Miller (1984), the orthogonal case is far from being the
simplest one with respect to MSEP estimation and actually gives an example
of intermediate deterioration of the conventional estimators under subset se-
lection. An advantage of considering the orthogonal case (besides from obvi-
ous calculation simplifications) lies in the fact that.all major subset selection
procedures, such as the best subset regression, forward selection, backward
elimination, and stepwise regression lead here to the same ‘best’ p-subset, for
each p = 0,1,...,k, a property that does not hold for nonorthogonal predic-
tor variables. Consider the best subset regression procedure g, which consists
of screening all 2* subsets and selecting the best one with regard to some
criterion (e.g., Hocking, 1976). Usually such a criterion is based on one of
the conventional adjusted estimators, mentioned above. Then for any fixed
p, the ‘best’ p-subset is the one with minimum RSS,, and g could be con-
ceived of as a two-step procedure. At the first step, for each p,p = 0,1...,k,
a p-subset corresponding to the minimum RSS, is found. The second step
372 Kipnis

consists in comparing these subsets and choosing the overall best according
to the adopted criterion. In the experimental study the three bootstrap es-
timators (9)-(11) were compared with the two conventional estimators (13)
and (14) with regard to the evaluation of the MSEP for each of the best
p-subset at the first step of the procedure. Then each estimator served as
the criterion for the overall choice at the second step. It made it possible
to compare these statistics not only as MSEP estimators under the search
process, but also as stopping rules in subset selection.

REMARK 4: In general, exploratory procedures often represent a multi-step


model building strategy. Although the importance of having good MSEP
estimators for the finally chosen model is quite obvious, it is very essential to
be able to assess the results of the intermediate steps. The latter would allow
one to choose the best among the available alternatives at each step of the
procedure before moving to the next step, thus improving the exploratory
process itself.
The results of applying subset selection depend on the ratio k/n and on
the value of the true parameter vector # scaled by o”. A small theoretical
calculation might clarify matters. For any fixed p, let S = S, be the best
p-subset selected by g. Then in the orthonormal case it follows from (3) that
the unconditional risk equals

MSEP(g) = 0? + ~E()2+)_8?). (15)


teS IES

Note, that S in (15) is a random set of indices. If the true model (1) were
known, the best predictor with minimum MSEP should be based on the
subset S* = {1,,...,%,} such that 8? /o? > 1. This fact easily follows from
(3) since for any two fired subsets S,,, and S,,,p, < po,

MSEP(9,,)
-MSEP(,,) =—[0?(2-r.)- D> #i].
TES p,\Sp,

The situation, though, might be quite different when S, is selected according


to minimum RSS, = RSS, + )) 6? or, equivalently, maximum )> p?)
i¢S, icS
Here S, includes indices of the first p order statistics from k independent
random variables £7,...,8?, where B? ~ 07 x? (B;).

REMARK 5: This consideration explains why a chosen subset S, does not


necessarily minimize measure (15) and, in fact, may lead to a higher MSEP
than the optimal subset S* defined above.
In the experiments we put n = 50 and o? = 1. As is often the case in
applications the number of potential predictor variables may be quite large
as compared to the number of available observations, so that we considered
= 15, 25, and 35. Since MSEP does not depend on the choice of Xy and
Xw, as far as they remain orthonormal, we put Xy = Xw = [e1ea...€x],
Prediction in Exploratory Analysis 3/5

where the i-th component of vector e; equals one and all other compo-
nents equal zero. Three values of the true vector @ were considered: (i)
Br = (0,0,...,0)’, which represents the model with no relevant variables;
(ii) Brr = (B1,62,...,8,,0,0,...,0)’, where 8; = E[Z(i;q)], Z(i;q) is the
1-th order statistic from g N(1,.25) random variables, q is the closest to k/3
integer. Here the first q values of the elements of 6,; are near the resolv-
ing power of the system with the ‘signal-to-noise’ ratio ((;)?/o? ~ 1; (iii)
Br11 = (7.0,5.0,0,0,...,0)’, which represents the case with two very sig-
nificant predictor variables with the signal-to-noise ratio being 49 and 25,
respectively.
For each model specification, as defined by o? = 1, n = 50, k =
15, 25,35 and B = 6,,Br1,Br11; 1000 basic data sets Viz = (Xy 4 (m))
and W,, = (Xw,Yw(m)), m = 1,2,..., 1000, were generated following
(1). To each data set V,, the best subset procedure g was applied, and for
each p = 0,1,...,k, predictor Yy (m,p) based on the p-subset Xy,(m) =
Xy (m)D,(m) with the minimum RSS, was found. The ‘true’ conditional
MSEP was calculated by (2). The two conventional estimators were calcu-
lated by formulas (13) and (14) as based on subset X,,(m). The adopted
nonparametric pseudomodel was based on (8) with 8 = (X/, X,)-' X,Y
and €; ~ F= Poi lie 5s olept = Vy Xe For each simulated set m,
the direct, the additive, and the multiplicative estimators were calculated
by generating 200 pseudosamples V;(m) = (Xy,Y¥,(m,i)) and W,(m) =
(Xw ,Yw (m,i)) from the chosen pseudomodel.
REMARK 6: The decision to generate €; from the empirical distribution of
the last n — k residuals, as opposed to the traditionally used empirical distri-
bution of all the residuals, reflects a very important feature of an orthogonal
regression. Due to the choice of matrix X,, r,; =... = r, = 0, so that
when k is comparable with n, there is a high probability of getting too many
zero pseudo-disturbances €;. This would make pseudodata quite different
from the real observations. Another advantage of the chosen F lies in the
fact, that it is based on independent residuals as opposed to the full set of
components of r. Besides, 6? becomes an unbiased estimator for 0”.
To judge the performance of each estimator R in the simulations, two
criteria were used: the mean squared error

MSE, (hk) = E(R — MSEP(g;V))?


about the conditional MSEP, and the mean squared error

MSE, (Rk) = E(R — MSEP(g))?


about the unconditional MSEP, where expected values were approximated
by the averages, as based on the corresponding 1000 replications.

REMARK 7: There is no full agreement in the literature on whether condi-


tional or unconditional MSEP is more appropriate for prediction assessment.
The conventional statistics have been originally proposed for estimating the
AIgVL
TT “g=g
= oul suvoMpue piepuejs suOTeIAADp
se) poseq
UO QQO‘T (syuourrodxs
Jo wWoelezfipddSW
sIOyeuIT}sa

(A‘3)dasw Re ou od ms wd
: a as q as q as q as q as q as
+0 - 007 = Or £66: o. 261 «666 = Z6l OFT COE LOL’—s«CS66°
= 566° r6l
ar = Ole “EPO 9%6, a RI eek
"626 CSS 8r'T LG. B00= 12.S KOLA© G00.
a, Vee ta, “con... Musee BSRitS68 nau CS Sh lee POLL FSR OU= Sz
St pe “eahast
§ "SLO" = 198 pi ESET" =
PGeese ORL) BhT Ere Seo ee aS OD ae
za 1908 eee. GLI SLE= 261% SFT asc OT = 00c& @ all (OC
ete «ae S660, Pet... os EST S199a;& 007 —8rT O9€ Net oS Oe= wile GAs
068"—si“(‘éiST=C(‘é‘«iB’C“‘éOT:;
0 gO? SFT soeCO#~*C#ET.:CO@d
EET > Oe= 1a <0Z== 962
Mee “RO: 68 SO6l'S < £06| 117s SPT PLE Get Go& weal OG
Hales EI. SACS SOG ee:2 92 BHT 6LE ger ceS SOULEte 7
“thie SLI 0M, Wes 2 RI OTR 6H re OPT 6 SOE GCL CS
OF pee Se LIZV TEN.)
© S607 = S06= OFS 6H Sse SIn] e OLE 714 HE
Vea. SCC 196k Ze LaRPST cI OOF «OFT ZOb LT «OOF eh ise
ac= Soya Ree = Hedy. BSSIC OCT2 PSE 6H LO 6r'l LO’ Srl 90¢
xz BET,© <8 S6FIy = S0r a“ 6P'l= gS «6H 80b 6FT 80h 6rT Olp
Or 6r' 1 807" 6r'T 807° 6r' 1 807" 6r' 1 807 6r' 1 Sol OST Sz
LOY Srl LOV 6r'T LOV 6r'T psc 6c T SI¢ JES Sel 6rT 02
L8¢° ert 96¢° Lvl COV 6r'l 660 7G CSc vL6 vol Spt SI
6Prco vol CLE crt 06¢° IST LVC 886° EEGs 983° Scr prt Ol
6c¢o COT g9c crt 98C° ES Te 87 696 907 188° vam Crt 6
6c OE T 9SC Ivl C8C° CST 6CC SS6° 10d 088" CCl Ipt 8
LI¢ Oy Lee 6c T LL esl CG bro 86." p83" Cals Ort L
Soc STT Leo” L¢l CLE PST CLG Or6 Cé6l £63" Sit LET 9
C67 CTT Sco Sel 99¢° CST LOC Cr6" vol 606° cIl cot 6S
O8C ITT EES. col 09¢° 9ST COC CS6 rol 1c6 CLL ccl YP
6l I 967 OCT €se EAL 107 CLO 961° C96" cor’ OCT ¢€

375
LIT
€S7 SIT LEG LTT bre 09'T COC IOT 107 TOT £60 EG Lae ee
OFC yOeNE 9ST ETT gce” COT OI’ LOT OI’ LOT TLO CEL
SIT 6C¢o 89'T OCT SIT OL SIT 0 6rT O
SD
LEC SIT CVC
ee SS Sa SRS a ee Tee RT tS a
d ds a ds ist gs a ds ad ‘
ds
Stew a a
da ds
ea PL Sie ee 2 eR OG is es ete Se eS
te va ad ou
Vv
“f daSW
(A‘3)
Vv Vv Vv
SIOJEUUT}S9
dASW ywor1syIp jo (sywourtsedxa C00] UO peseq se) sUONeIAeP piepuL}s pue SUBOUT OWL Mgag ZT WIAVL
HIadVL
el “g=¢9 uL sueourpue prepuris suonelAsp
se) poseq
uO (00, (sjueuodxo
jo WWoerszfIpdHSW
eWIT}$9
S10}

Se es ee eo en i
(A‘3)dasw “f ou ad va ol
oe a ae ee a a ee
e a as | as a as q as a as a as
0 grz 0 BPrZ Le 8h~ ple 86°7 lpr Cn
BH 8 8ht 6LE
e OTE ep 2 a OS SOc —00'Z g9¢ cGy= Coen Sette| Dear
tO SOO SiS. Oe Sic; Ze Pee )~=—SdTHZ’s«éQO'
POT
=T €?~
(NO € HET THN" 96" v6 L6" Sol’ OST 9¢€ TT SS Ol Me
IZ Wie ever? Oo 6 ~—- 00% OFT ree SA TO OO See
83=6sdTE6——(s«
0c HT”—(s«S/B™"Ci
OFT SQO-~C—s«C«*T'
ese T.C‘*<SS
Say [66 40 ior
9 06h ree GS i 626 Vig. OFT I9¢ 6 aia 2 CS ea
)
-L “Pen TOL” O48 ci £6 617 GFT g9€ eo 0s: O04: 6G
8 ee 08? «GGT. SO! 9oC. OT rle 0eT ere. Soll Ol
6 ae I SOB SOS 196; LEC... 6P'T O8€ OT PSE«= OTT Vie
ie ar SS. IS O86 ye. OPT Sse. OWT ~— =€9E 6ZT 9f€
at EeY* Zel 696° se CUT OOS 6H= 00h COT OE Uhl ese
Oe RT CE (ICL6G6 prse 6YT Lop rT 90h 8rI 9r
cz “LOST Reh” 4 GWT SOR GET.5 SOPs. 6h 80r 6b 80h 6hI Olb
Prediction in Exploratory Analysis Se

unconditional MSEP (e.g., Hocking, 1976; Seber, 1977), but later most of
them were looked upon as estimators of the conditional MSEP (e.g., Hjorth,
1982; Efron, 1983, 1986; Picard and Cook, 1984). Recently, though, the
fashion seems to have been changing again. As is mentioned in Gong (1986)
and Boudreau (1988), it is, perhaps, unfair to consider MSEP estimates as
evaluating the conditional risk, so that they must be viewed as estimators of
the unconditional MSEP. Without trying to resolve this dispute, it is worth
mentioning that from the procedural point of view it is very important to
estimate the unconditional risk. As was mentioned in Section 2, the uncon-
ditional MSEP is the only measure that evaluates the exploratory procedure,
as opposed to any particular chosen model.
Since the results for the three considered k values were not qualitatively
different, in Tables 1 - 4 we report three model specifications with B =
Br,Brr,and B;;, for k = 25.
(i) Bias.From Tables 1.1 - 1.3 one can see that both conventional es-
timators J, and R> are considerably biased downward when the number of
regressors in a subset exceeds the number of non-zero coefficients of 8. For
p = 9 the bias of J, is about 40% and the bias of R¢ is more than 30% of the
actual MSEP values in all three considered examples. When p approaches
the full model size, the bias gets smaller and becomes negligible when p = 25.
As was expected the direct bootstrap estimator is biased upward. When
p is small this bias almost follows the same pattern as for the procedure g,,
i.e. (kK — p)o?/n. When p exceeds 3 the bias gets smaller as compared to
Jp; and it becomes almost negligible when p > 10. The ‘indirect’ bootstrap
estimators have considerably smaller bias than the conventional estimators.
The additive estimator R4 is a clear winner here, having the smallest bias
throughout. For all three considered cases, its bias is less than 2% of the
actual MSEP for p between 0 and 9, then it increases slightly and reaches
about 3% of the MSEP when p = 25. The bias of the multiplicative estima-
tor, RéM , is higher, approaching 10% of the MSEP value when p = 9, but
it still remains relatively small as compared to the bias of the conventional
estimators.
(ii) MSE. As follows from Tables 1.1 - 1.3, both conventional esti-
mators have smaller variances than the bootstrap estimators since they do
not account for the prediction error due to selection. Still in terms of the
MSE, and/or MSEg (Tables 2.1 -2.3), the indirect bootstrap estimators
considerably outperform J, and Rs: their larger variances are more than
compensated for by substantially lower biases. As compared between them-
selves, although RA is less biased than R™ , its lower bias trades off against
its higher variability, so that the MSE values for both estimators are prac-
tically the same in the three considered examples. The direct bootstrap
estimator occupies some middle ground except for small values of p(p < 5),
where R”? has the highest MSE due to a considerable bias.
(iii) MSEP as a function of p. What is even worse than biasedness,
both J, and Re do not follow the actual MSEP and lead to wrong conclusions
as to which subset is the overall best for prediction. From Tables 1.1 - 1.3
one can see that on average the best predictor (with regard to the actual
minimum MSEP) includes no variables if 6 = 6; and B = 6, and has 2
LOT 88I° Sc
Lov 88T LOT; 877 (02
681 Lov 88 Sol 807
gor 981° 991 981° 8c" pre ST
Sol 981 Sol 6LT 0¢7
SLT T9T Ges: csc OL
cSt OLY LST 991° CLE
Ost vst y9C ae =6
Tet vel Et vol LT
3 est 8rC gsc 8
9c Sel Ter OL IL@
Sel est 9¢C Tce. &
[4 uae Sel 191 COT
9cT est 96C ooc 9
Stl eel LAT, 191 Lvc
601 )8 9st 097 CIC =S
80l Stl OS) vol VT QI? Vv
Sot oot 901° col 917
660° +60" gol OLY Lov gol ¢
€60° 060° vst

378
680° 180° (6:3 9 (6:3 vil Cite G
8L0° 6L0" 6L0" v0T 601°
990° 907 790° c90° iL
790° £90" 990° Cv 790°
0S0" 97 Leo" LD ©
60 Lv0’ 7S0° Ie Le0
sroyeunse TASW IWeIesFIP
IpUo. 24. Ig=J Te dladvL
peseq se) SIO1I2 po yenbs
uvoul jeuonIpuooun pue [e UON
jo (syuouttodxe Qoo'T uo
891° 681° LOT 88I° LOT 881° LOT 881 LOT. 881° Sc
Sol 98T° sor 981° 991° 981° Sol" S81 LOZ L@ 072
CST OLT LST SLT (6% O8T LI? Se? 81 pee ST
Tel" Srl Sel (65% LST OLT S97 LLC USS 3 19¢ OL
LON: 6€T 5 a Tvl LST 691 $9C LL@ Ove” OSe 6
CCL 3 3 ia Lev" 6cT LSY 891° CIT €LZ vec cco
) EG: 1); cel 6ST OLT CSC C97 10¢° 60¢ L
801° sil cL. €cl v9T pL vec eve OLC LLE 9
Tol" OIT Sol SIT LL 6L1 4 a 07 9€7 (4 (Co
£60 0) L60° SOT” I8l° 681° Sst 16r" 861° C0?
€80° 880° L80° €60° 861° €0C OST VST SST 8ST ¢€ On
€L0 SLO LLO 080° €CC SC? Til’ CLT tLE Lk 7 tm
190° 790° 990° 990° 997 L9T TLO° TL0° 0LO TLO'- E
950° 9S0° 80° 850° Lye Lve €S0° €S0° €S0° €s0° 0
“ASW -ASW "ASW -ASW “ASW -ASW “ASW -ASIN “ASW = °ASW é
ne uy ou ou
SIOVEWITISS CGHSW JWelofip
Jo (syuouttiedxe QQ‘ WO peseq se) sIo1I9 porenbs ues! [euOTIpuOoUN pue [euOIIpuods ey], “"G=g TC AIAVL
HIdVL
ec ot], [BUONIpu
uep os
Mg [euontpu
uveooun pesenbssi04I0
se) paseq
Uo (QQ0‘T (sjuoumti
jo sdxe
WoloyjIpFASW S1OJCUINsS®

Orl” OrT” Orr Sv Sty

S
Lvl Lvl

vt

f=)
vrl

\e-al
9L0° vol
¥80° SLO’ Ore 3 960° €60° c60° 880°
(4.05 S¢0° (440) Lye Sve 190° 090° CSO" 0S0°
690° L90° 890° Ore Ive 790° $90" 80° 8S0°
6IT” OI

ERNIE
OI 661° 661° 80° LLO’ vL0’ €L0°
691° cSt OST BLT LLL: c60° 680° 680° 980°
OEE L81° vst 891° Sol vor’ 660° COT’ L60°

NGA
9ST SIC O17 €or 8ST OTT 60T° cite 90T°
887° SEC 8C7" COL" SST SCI:

N
sil

on
(cae

ra)
vIT

OG
cle: 8bC Ore oo) vst" cel’ Scr OeT Ter
Occ SSC Svc v9’ VST col cel" SET” OCT
Oe Tec SI? 8LT 191° CLI SST LOT OST’
LOC

eae
S8T y9T" 981° OO}: 981° Sol S8T
LOT" S9T
881° LOL 881° LOT

N
881° LOT

co
oO

va)
681° 89°
Prediction in Exploratory Analysis 381
NN

variables for G = 6,,,;. At the same time Re — the least corrupted of the
two conventional estimators — has its minimum when p = 4 for 8; and p = 6
for B;; and B;7;. ‘
REMARK 8: The results for @ = 6,, (Table 1.2) are especially interesting,
because they illustrate the fact that subset selection is not the best way of
deciding which variables should be included in the predictor equation, at least
when some of the components of @ are near the level of predictor significance.
Due to selection, even when p equals the number of relevant regressors, the
‘best’ p-subset does not often include all the significant variables but contains
some ‘noisy’ regressors with small or zero coefficients. As a result, on average
we are still better off with the ‘naive’ prediction Yy = 0. Copas (1983)
and Miller (1984), among others, make similar conclusions. Unfortunately,
both conventional estimators, J, and Re do not reflect this very important
feature of subset selection and mislead a researcher with an overoptimistic
assessment.
The same remains true with regard to the direct bootstrap estimator.
On the contrary, the indirect bootstrap estimators behave similarly to the
true MSEP. R4 has the smallest average values for p = 0 when £ = £, and
B = Br, and p=2 for B = B,,,;. R™ differs only in the most difficult case of
B = B11, where it has the smallest average value when p = 1. The fact that
the indirect bootstrap estimators closely follow the actual MSEP indicates
that they could be used as criteria for choosing the overall best predictor.
Tables 3.1 - 3.3 and 4 display characteristics of the final predictor selected
from among (k + 1) best p-subsets, p = 0,1,...,k, according to a criterion
based on each of the six statistics: actual MSEP(g,V), J, Re R° RA, and
R™. Tables 3.1 - 3.3 contain the empirical distributions of the number
of variables in these final predictors. It follows that the direct bootstrap
estimator provides the worst criterion with respect to the distribution of the
optimal p. The conventional estimators obviously lead to overfitting. The
indirect bootstrap estimators provide distributions very similar to one based
on using the actual MSEP, except for somewhat thicker right-hand tails.
These long tails could be explained by the fact that all k components of 8
in the pseudomodel (8) are always different from zero, as opposed to the
three considered true vectors B in model (1). Table 4 contains the average
MSEP values for the final predictors. It follows that although R?” still leads
to some poor results, the two indirect bootstrap estimators, used as selection
criteria, provide substantially better final predictors than those based on the
conventional criteria.
5. Conclusion.
The theory behind conventional estimators for predictive efficiency is
not valid when predictor selection and estimation are from the same data.
The very selection process affects the distribution of those estimators and,
in particular, leads to their substantial bias when the selection effect is not
allowed for. It is suggested that each estimator should be developed for
the selection procedure it is used with. As exact distributional results are
extremely difficult to study analytically, even for relatively simple subset se-
lection procedures, the indirect bootstrap assessment described above seems
to be helpful in solving the problem. The bootstrap method appears general
382 Kipnis

TABLE 3.1 8=8,. The empirical distributions (based on 1,000 experiments)


of the number p of variables for the final subsets selected from
among the (k+1) best p-subsets, p=0,1, .... k, by different
criteria.

MSEP(g,V) J, RS R? ne R™
1000 6 25 372 868 710
0 14 94 62 49 99
0 38 119 50 18 55
0 54 179 27 11 36
0 115 161 26 5 19
0 127 164 25 13 21
0 156 86 16 4 15
0 153 74 17 2 7
0 101 47 12 3 4
SC
NDMPWNeE
COON
|UD 0 87 27 8 1 7
0 56 12 10 2 2
0 43 5 13 1 4
0 25 5 8 1 3
0 13 1 4 0 4
0 9 0 12 2 2
0 2 1 15 0 1
0 0 0 6 1 1
0 0 0 10 0 1
0 1 0 14 2 0
0 0 0 17 0 2
0 0 0 17 2 0
0 0 0 24 2 1
0 0 0 29 3 43
0 0 0 34 1 0
0 0 0 42 7 2
0 0 0 130 2 1
Prediction in Exploratory Analysis 383

TABLE 3.2 8=8,. The empirical distributions (based on 1,000 experiments)


of the number p of variables for the final subsets selected from
among the (k+1) best p-subsets, p=0,1, .... k, by different
criteria.

MSEP(g,V) J, R R° R* R"
0 1 119 624 437
2 20 39 116 132
8 51 25 57 100
20 100 36 23 71
47 139 27 30 54
73 160 22 13 39
108 167 20 9 32
149 124 19 4 20
138 92 18 6 14
SC
!'D 132
WOOINIAMPWNPF 61 16 9 12
109 44 15 6 19
81 23 15 9 11
48 11 16 9 8
52 3 14 a 5
21 1 14 7 7
7 2 11 3 6
3 0 21 ‘] 7
0 1 OF 6 2
1 0 21 5 4
0 0 26 3 2
1 0 29 3 3
0 0 38 8 2
0 0 5) 5 4
0 0 52 7 Z
0 0 98 10 3
—~i—!
Oo
oN
eo
coco
coo 0
oSCcoqoce
So 0 216 17 4
384 Kipnis

TABLE 3.3 8=8,,. The empirical distributions (based on 1,000 experiments)


of the number p of variables for the final subsets selected from
among the (k+1) best p-subsets, p=0,1, .... k, by different
criteria.

p MSEP(g,V) J, R°? R* R™

0 0 0 0 0 0 0
1 0 0 0 3 26 13
) 993 8 31 252 786 640
3 5 16 115 104 78 143
4 0 45 136 58 33 64
5 2 78 177 $1 14 45
6 0 130 185 27 Vl 21
7 0 148 148 ie) } 21
8 0 151 70 25 3 14
9 0 148 69 15 5 7
10 0 99 BH 13 2 4
11 0 70 17 17 3 5
12 0 D2 8 14 0 4
13 0 30 4 13 2 3
14 0 13 1 10 1 3
15 0 8 1 9 1 0
16 0 3 0 12 0 2
17 0 0 1 11 0 2
18 0 0 0 14 3 0
19 0 0 0 19 2 1
20 0 1 0 20 2 |
21 0 0 0 23 4 o)
22 0 0 0 32 2 1
23 0 0 0 44 1 2
24 0 0 0 48 8 1
25 0 0 0 141 6 1
Prediction in Exploratory Analysis 385

TABLE 4 MSEP (averaged over 1,000 experiments) for the final


subsets selected from among the (k+1) best p-subsets,
p=90, 1, ..... k, by different criteria

MSEP
Criterion Se

B=B, B=B, B=By,

MSEP(g,V) 1.00 1.17 1.04


is 1.36 1.42 137
RS 1.29 1.38 1.30
R? 1.29 1.44 1.34
R* 1.05 17 1.11
R™ 1.09 1.29 : 1.14
386 Kipnis
a

and flexible enough, and in principle could be used for any exploratory pro-
cedure. Indeed, after generating a necessary number of pseudosamples the
same exploratory process that was used for the original data is applied to
each one of them, and the corresponding empirical distribution of predictor
errors provides all characteristics of interest.
One of the major problems in applying bootstrap to model building pro-
cedures consists in choosing an appropriate pseudomodel. Since the ‘true’
model is unknown, the usual bootstrap regression idea of using estimated pa-
rameters of the true model as the pseudoparameters in the pseudomodel does
not work. This difficulty is aggravated by the fact that rather complicated
exploratory procedures prove to be quite sensitive to the choice of the pseu-
domodel. In the framework of subset selection this problem seems to have
been resolved by using the ‘full’ model estimates to construct a pseudomodel.
But this approach is by no means mandatory, and other pseudomodels could
and, perhaps, should be used in different situations.
It is also important to note that the direct bootstrap method does not
work well enough in exploratory analysis, as was demonstrated by the per-
formance of R?. The indirect approach, which tries to improve, or adjust,
the existing estimators seems to be crucial in this case. As a result, the
suggested method can be used to assess the efficiency of different regression
procedures, to compare those procedures with each other, and to choose the
most efficient one. It also may be helpful in correcting the existing proce-
dures by providing, for example, a criterion for a stopping rule in multistep
model building.

References

Akaike, H. i Information Theory and an Extension of the Max-


imum Likelihood Principle, 2nd Int. Symp. on Information Theory (B.N.
Petrov and F. Csaki, eds.). Budapest: Akademiai Kiado, 267-281.
Amemiya, T. (1980). Selection of Regressors, Int. Econ. Rev., 21,
331-354.
Berk, K.N. (1978). Comparing Subset Regression Procedures, Techno-
metrics, 20, 1-6.
Boudreau, R. (1988). A Monte Carlo Assessment of Cross-validation
and the C, Criterion for Model Selection in Multiple Linear Regression,
Proceed. 20 *” Symp. Interface, 603-607.
Box, G.E.P. (1983). An Apology for Ecumenism in Statistics, Scientific
Inference, Data Analysts, and Robustness (G.E.P. Box, T. Leonard, and C.-
F. Wu, eds.,). New York: Academic Press, 51-84.
Bunke, O. and Droge, B. (1984). Bootstrap and Cross-Validation Esti-
mates of the Prediction Error for Linear Regression Models, Ann. of Statist.,
12, 1400-1424.
_ Copas, J.B. (1983). Regression, Prediction and Shrinkage (with Discus-
sion), J. R. Statist. Soc., B, 45, 311-354.
Efron, B. (1982). The Jackknife, the Bootstrap and Other R li
Plans; Philadephia: SIAM. : en A
Efron, B. (1983). Estimating the Error Rate of a Prediction Rule: Im-
provement on Cross-Validation, J. Amer. Statist. Ass., 78, 316-331.
Prediction in Exploratory Analysis

Efron, B. (1986). How Biased Is the Apparent Error Rate of a Prediction


Rule? J. Amer. Statist. Ass., 81, 461-470.
Freedman, D. (1983). A Note on Screening Regression Equation, Amer.
Statist., 37, 152-155.
Freedman D., Navidi W., and Peters, S. (1987). On the Impact of
Variable Selection in Fitting Regression Equations, Techn. Rep. 87, Univ.
California, Berkeley, CA.
Geisser, S. (1975). The Predictive Sample Reuse Method With Appli-
cations, J. Amer. Statist. Ass., 70, 320-328.
Gong, G. (1986). Cross-Validation, the Jackknife, and the Bootstrap:
Excess Error Estimation in Forward Logistic Regression, J. Amer. Statist.
Ass., 81, 108-113.
Hjorth, U. (1982). Model Selection and Forward Validation, Scand. J.
Statist., 9, 95-105.
Hocking, R.R. (1976). The Analysis and Selection of Variables in Linear
Regression, Biometrics, 32, 1-49.
Lovell, M. (1983) Data Mining, Rev. Econ. Statist., LXV, 1-11.
Mallows, C. (1973). Some Comments on C,, Technometrics, 15, 661-
675.
Miller, A.J. (1984). Selection of Subsets of Regression Variables (with
Discussion), J. R. Statist. Soc., A, 147, 389-425.
Picard, R.R., and Cook, R.D. (1984). Cross—Validation of Regression
Models, J. Amer. Statist. Ass., 79, 575-583.
Pinsker, I.Sh., Kipnis, V. and Grechanovsky, E. (1985). The use of the
F-Statistic in the Forward Selection Regression Algorithm, ASA Proceed.
Statist. Comput. Sect., 419-423.
Pinsker, I. Sh., Kipnis, V. and Grechanovsky, E. (1987). The Use of
Conditional Cutoffs in a Forward Selection Procedure, Commun. Statist. -
Theor. Meth., A, 16, 2227-2241.
Rothman, D. (1968). Letter to the Editor, Technometrics, 10, 432.
Sawa, T. (1978). Information Criteria for Discriminating Among Alter-
native Regression Models, Econometrica, 46, 1273-1291.
Seber, G. (1977). Linear Regresston Analysis. New York: John Wiley.
Stone, M. (1974). Cross—Validatory Choice and Assessment of Statisti-
cal Predictions (with Discussion), J. R. Statist. Soc., B, 36, 111-147.
Thompson, M.L. (1978). Selection of Variables in Multiple Regression.
Parts I and II, Int. Statist. Rev., 46, 1-19, 129-146.
iPTeat
1998 2,
Axe My r

Orie & hsgut “cage


inchROARaegh sew qakeemeges
get ty ix ft van a\< = A J

7 india wk,.ushh. OO. vd RO Sat pings


vues saa, ah et
; spin ys he, Rie ;
oh enna de

-ae
PR Oa tt iIg .y
A AVA “y fe» ip uals
Po al -' f AF

ih eae beh a me A, Ri ori ho


af a Stee
; a 4° i : fhi <e ie ¥

, Bsr
at auineee FY

ee tihdixBA deemed
(
ont PPRRE.
“Sortade eaten
dS 2
Te Ai OVE

x eae ye
at $1888 io “ts Ga

tte) 3 sake y!
Rite Heahh

ey apg Pere ¥4

a ey y ae ae
Pao, ‘ We ped aS

awe a fs Baek
O08. 0 aes
; Nae Me ry i. prem
rygine wl th ' yoy erty“a te Ths
II
ee

BOOTSTRAPPING LIKELIHOOD RATIOS


IN QUANTITATIVE GENETICS

Nicholas Schorkt
University of Michigan

I. Introduction

Norma! mixture distributions play a prominent role in the modeling of


the genotypic determinants of quantitative traits [Morton, 1967; Schork and
Schork, 1988; Elston, 1981]. Though the theory behind many models in ge-
netics 1s quite complex, the following provides a simple introduction to the use
of normal mixture distributions in quantitative genetic modeling.
Consider a genetic locus with two alleles A and a, whose allelic com-
binations form three distinct genotypes, AA, Aa (or aA), and aa occuring
in the population with the Hardy-Weinberg equilibrium dictated frequencies
PAA = 9°, Pag = 29(1 —q), and paa = (1 —q)?, where g is the frequency
of A allele. If one assumes each genotype acts so as to produce a normally
distributed phenotype with a mean, p, and variance, o 2" unique to that geno-
type, then one can model the overall distribution of the relevant trait, t, as a
mixture of 3 normal distributions:

3
9 —\
F (tla, m1, #2) 3,07, 03,08) = ) Pi - P(t \y;, 03). (1)
mall Q

where ¢ is the normal density function and i indexes the genotype; i.e., AA
(i=1), Aa or aA (i=2), and aa (i=3). Modeling the effects of dominance
and recessivity at the assumed locus is straightforward; one simply assumes
either uw4g = Haa OT LAA = Ag: Often the assumption of equal variances
can be made (i.e., oe = tay = one = oan): A graphical representation of the
modeling of quantitative traits with normal mixture distributions is given in
Figure 1 of the appendix, where dominance (of the a allele over the A allele)
is assumed.
An oft-used statistical practice in quantitative genetics is the fitting of
mixture distribution models (e.g., based on equation 1) to data with unknown

+ Division of Hypertension, Departrnent of Medicine and Department of Statistics, Uni-


versity of Michigan, R6592 Kresge I, Ann Arbor, Michigan 48109-0500

ee se ——EEo—E>———EE=E=
ee

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
390 Schork
———— ___.|

genetic determinants in an effort to gauge the strength of single-locus, 2-allele


models for those data. Traditionally, likelihood ratio tests involving 3 and 2
component normal mixture distribution models and single normal distribu-
tion models have been used as formal testing methods for relevant hypotheses.
Many problems are associated with this modeling and hypothesis testing strat-
egy. It is clear from Figure 1 that when the separation between the mean effects
of the various genotypes is small, it is not obvious that a normal mixture distri-
bution characterizes the data better than other models of “skewed” data (e.g.,
log normal distributions, Weibull distributions, etc.). In addition, Schork and
Schork [1988] have shown that for data with given levels of skewness, one will
almost always accept the normal mixture model as characterizing those data if
a single normal distribution is taken as the alternative to which it 1s compared
in a likelihood ratio test setting.

What is needed then is a test strategy which permits normal mixture


distributions to be compared to skewed alternatives, such as the log normal
distribution. In Section II we describe a strategy based on bootstrap principles
for this purpose and examine its properties through some Monte Carlo studies.
In Section III] we examine the usefulness of this strategy in some complicated
situations arising in quantitative genetics. In Section IV we offer some brief
summary remarks.

II. Bootstrap Tests for Non-Nested Hypotheses

Section I introduced the problem of comparing the support given to nor-


mal mixture and lognormal distributional models by a given set of data. Such
a problem involves hypotheses (e.g., models) from separate families of hy-
potheses. As such, many traditional asymptotic tests (such as the x2-based
likelihood ratio tests) cannot be used since the underlying distribution of the
likelihood ratio implicating normal mixture and lognormal models is unknown
and potentially intractable mathematically. One can overcome this shortcom-
ing, however, by simulating the unknown distribution of a likelihood ratio test
statistic implicating both models and estimating critical values from this sim-
ulated distribution. More formally, given data X = 2,...,2n following an
unknown distribution f(x), a test for the hypothesis Hg : f(x) = g(zx;8@) vs.
H: f(x) = h(z,6), where @ and 6 are unknown parameter vectors, can be
constructed as follows. For X compute T = 2S“ log|g(z;; 6)/h(x;; 5)}, the like-
Bootstrapping Likelihood Ratios 391

lihood ratio statistic under g(z;6), where 6 and 6 are consistent estimators of 6
and 6, respectively. Draw data sets X},...,X7 from g(z*; 6) and compute T,*
(2 =1,...,r). From the T*’s estimate a relevant critical quantile C. Compare
C to T to make inferences about Hg. This test construction has been referred
to as the “parametric bootstrap” test in the literature, and several authors
have investigated its usefulness in nested hypotheses situations (see Hall and
Titterington [1989]; Jockel [1986]; and Hope [1968}).
We examined the usefulness of the parametric bootstrap test for settings
involving mixed normal and lognormal distributional hypothesis. To investi-
gate the level of the test assuming the lognormal model is the correct model, we
generated 100 data sets following lognormal distributions with mean 0 and dif-
fering variances for various sample sizes and computed a parametric bootstrap
test with r=19 and a nominal level of 0.05 for each of the 100 replications. The
number of rejections resulting from these simulations were tallied and taken
as an estimate of the level of the test. These results are outlined in Table 1
of the appendix. This table suggests the parametric bootstrap test is at or
near the nominal level of the test. The small number of replications (i.e., 100)
used to estimate the significance levels was used to keep the computational
burden assumed in the estimation of the mixture distribution parameters for
each replication to a minimum.
To investigate the power of the test in this situation, we again kept the
lognormal distribution as the null model, but generated data following a 2
component normal mixture with mixing weights of 0.75 and 0.25, standard
deviation 1, and means separated at a distance varying from 1 to 4 “within”
component standard deviation units. The rationale for this was that as the
separation between the component means gets larger, “bimodality” (or the
conspicuous presence of a normal mixture distribution) would become more
pronounced (this is not the case in Figure 1) and therefore the lognormal
hypothesis should be more Power estimates based on this
easily rejected.
strategy were computed from 100 replications assuming four different sample
sizes. A plot of the results is given in Figure 2 of the appendix. It greatly
suggests that, as expected, the power to reject lognormality increases as the
separation between the means of the normal mixture increases. It should be
emphasized that it has been our experience that a homoscedastic 2 component
normal mixture with mixing weights 0.75 and 0.25 setting provides one of the
392 Schork
a _.| |e

poorest environments for rejecting the lognormal hypothesis at low mean com-
ponent separations. That is, other settings (e.g., mixing weights of 0.5 and
0.5) produce higher power to reject the lognormal hypothesis at small mean
component separations (which is intuitive, given that certain normal mixture
distribution settings don’t look or behave like lognormal distributions). Our
investigations of the level and power properties of the test and test setting as
described are by no means exhaustive; further studies involving other lognor-
mal and mixture parameter settings are called for, as are studies investigating
the level and power properties of the test when the roles of the hypotheses
are reversed (i.e., the normal mixture is the null and the lognormal is the
alternative hypothesis).

III. Quantitative Segregation Analysis

The goal of segregation analysis in genetics is to identify models that can


explain the “transmission” of a trait from generation to generation. The unit
of observation for segregation analysis is not an individual, but rather a “pedi-
gree” comprised of, for example, parents, offspring, grandparents, cousins, etc.
Because of the very detailed assumptions and modeling devices that go into
segregation analytic tools, we forego an elaborate description of them (the
interested reader is invited to read Elston and Stewart’s [1971] excellent intro-
duction to the subject and Elston’s [1981] review). Instead we simply want to
emphasize the role normal mixture distributions have in modeling the segre-
gation of genes controlling quantitative traits. Figure 3 depicts a hypothetical
nuclear family (i.e., parents and offspring only) segregating for a dominant
gene controlling a quantitative trait. The arrows under the x-axes represent
hypothetical trait values for each member of the family. From the figure it
appears persons 2 (the mother, say) and 5 have trait values well within the
normal component characterizing affliction (i.e., they have the “disease”). The
question becomes, “how likely is it that a single locus, 2 allele model can ex-
plain the variation in trait values possessed by the family as a whole?” For our
purposes it will suffice to say the null model, the assumption of a single locus
with 2 alleles, is based on mixture distributions, whereas an alternative model,
implicating, say, only environmental or polygenic (i.e., many loci) forces is not.
In this way segregation models for quantitative traits parallel the hypothesis
testing situation outlined in Section II.
Bootstrapping Likelihood Ratios 393
SSS SSS SSS

In order to test hypotheses about single locus segregation for a quanti-


tative trait, one can use the parametric bootstrap test. To illustrate this, we
apphed the parametric bootstrap test of segregation hypotheses to some pub-
lished data. Penno and Vesell [1983] studied the elimination of the model com-
pound antipyrine (AP) and other drugs in an effort to elucidate their genetic
determinants, if any. They concluded, in part, that a single locus with 2 alleles
appeared to control the variation in the elimination of N-demethylantipyrine
(NDM-AP). Penno and Vesell did not base their findings on a formal segrega-
tion analysis of their data, but rather merely observed the consistency of their
13 pedigrees to single locus hypotheses after classifying each individual (based
on their NDM-AP value) into 1 of 3 “disease” groups.

We subjected Penno and Vesell’s data on NDM-AP to a segregation anal-


ysis using the parametric bootstrap test methodology. We fit single locus, 2
allele (i.e., normal mixture based) and polygenic (i.e., based on a single nor-
mal distribution) models to their data and computed likelihoods for each. We
then simulated the distribution of the likelihood ratio first assuming the single
locus model was correct and then assuming the polygenic model was correct,
using r = 99. The simulated test statistic distributions and results are de-
picted in Figures 4 and 5, and suggest that though the major locus model is
not rejected, the polygenic model is easily rejected. We conclude that there is
evidence that a single locus with 2 alleles is, in part no doubt, responsible for
variation in the rate of NDM-AP elimination.

IV. Conclusion

Often times researchers in genetics, as in other disciplines, derive models


for phenomena so disparate that easy comparison of them is not straightfor-
ward. The non-nested nature of such models can be largely overcome through
the use of the computer-intensive, though highly intuitive, parametric boot-
strap test.

References

Morton NE (1967). The detection of major genes under additive continuous


variation. American Journal of Human Genetics 19:23-34.

Schork NJ and Schork MA (1988). Skewness and mixtures of normal distri-


394 Schork

butions. Communications in Statistics 17:3951-3969.


Elston RC (1981). Segregation Analysis. In, Advances in Human Genetics,
ed. H Harris and K Hirshhorn. New York: Plenum, pp. 63-120.
Hall P and Titterington DM (1989). The effect of simulation order on level
accuracy and power of Monte Carlo tests. J.R. Statist. Soc. B. 51:459-467.
Jockel KH (1986). Finite sample properties and asymptotic efficiency of Monte
Carlo tests. Annals of Statistics 14:336-347.
Hope AC (1968). A simplified Monte Carlo significance test procedure. J.R.
Statist. Soc. B. 30:582-598.
Elston RC and Stewart J (1971). A general model for the analysis of pedigree
data. Human Heredity 21:523-542.
Penno MB and Vesell ES (1983). Monogenic control of variations in antipyrine
metabolite formation. J. Clin. Invest. 71:1698-1709.
(S,°P°S UaANI|DGSuBoWUy) UO;
RaedaS
395

y : ss ” € ré T
aanqxywW Teur0U :Tapow ‘ITY ( ee
[ewiousoy :[apow T{[NN Jee
; : ; eg ; Sz°0

IMO”
0Ss*0
sz=u
SL4°0
OGeu-”
OSc=uU
ootsu vort
Bootstrapping Likelihood Ratios

4 ae 90 €0 00’
00 =—-00 or €0
002 uoyssaadxa
zi oz =O zz 8690
90° #410
00'4 adfjoues ee pue ey
12 660" 40” =—20" gt = =6S0
to )=—s«00”
«650° 80° $0 0s
vo , 20 40 = 20 90°
(0)Sn ze Ge
OV 20 60° 390
uoyssaidxa adAjouas yy
0Sz 001 0s Sz ozs odweS
GouOQraee
*so0UrTIeA JUaIATJTP pue OQ UvaW YIFM BIep
‘suoz3ngszAzIsTp (ayeza ‘+2°7) adkaousyg
[Tew1ou-SoT Suysn sazeottdez GOT vo paseq
oT
aanaxyu Tew10u Juauoduos-7 DPISepeosowoy °SA Tewio0u-Z
$3892 OTIED VIUOP 1OJ STAA2T BoUeSFITUSTS poiemypasy
I 9TqeL
Xf pueddy
Schork

aNTeA IFASTIeIAs Ysa] ONTeA IFISTFIEIS IS2L


os a > ae 0 = ig t- 9€ 6c Com no

8
(y0°yT) 2FISFIeIS
peaiesqg

OOT xX A3}suaq
x AqtTsuag
:°I1V

OT
sauaskTog
snoo7 Jofeq :TIMN

ST
OT
snooy iofey :°ITVv

OOT

02
souaskTOd :TTON
(70*T) 2F3s}IeIS pesresqg

ie

SG
02

oc
se
Se
396
A Nonparametric Density Estimation Based Resampling
Algorithm. !

Malcolm S. Taylor, US Army Ballistics Research Laboratory ?


James R. Thompson, Department of Statistics, Rice University 3

Abstract. The standard bootstrap algorithm may be viewed as the use of


a Dirac-comb as a nonparametric probability density estimator from which a
random number generator is created. In most cases, it will be more natural
to use a continuous nonparametric probability density estimator to build the
generator. Such an approach may be mathematically cumbersome and com-
putationally inefficient. It is demonstrated that it is possible to create such a
generator without explicitly obtaining the nonparametric probability density
estimator.

Discussion. It is seldom the case that a graphical display of a nonparamet-


ric density estimator is the final end product needed. This is fortunate, since
such a display becomes difficult for dimensions of three. Past a dimensionality
of five, such a display is seldom practical.
There are cases, however, where the “curse of dimensionality” can be readily
overcome. One of these is that of the creation of a pseudo-data set with
properties very much like those of a real data set.
To motivate the SIMDAT algorithm of Taylor and Thompson, let us first
consider the Dirac- comb density estimator associated with a one dimensional
data set {z;}%_,. The Dirac-comb density estimator is given by

fea) = =Dale ~ ai), (1

e277, (2)

Such a density is a “function” which is zero everywhere except at the data


points. At each of these points, the density becomes a line stretching to infinity
and with mass 1/n. As a nonparametric density estimator, fs(x) would appear
1This research was supported in part by the United States Army Research Office
(Durham) under DAAL-03-88-K0131.
2 Aberdeen Proving Ground, Forest Hill, Maryland 21005-5066
3Houston, Texas 77251-1892

ee eee ———EEEESESESESESESSSS>Eeee—e——e——E=E=
renner

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
398 Taylor and Thompson

to be terrible. It says that all that can happen in any other experiment is a
repeat of the data points already observed, each occuring with probability 1/n.
Yet, for many purposes f(a) isi quite satisfactory. For example, the mean
of Rigsis simply

jg= f afs(x)de = wnat


— (3)

Similarly,
fore) i ibe Fi
= / (c — z)’fs(x)dz = — (2 — 2)? = 8”. (4)
—oo n way

The Dirac-comb density estimator may clearly be extended to higher dimen-


sions. For example, if we have a sample of size n from a density of dimension
Pp, fs(z) becomes

s(X) = ~4X — Xi) (5)


where

a(X)=lim( = )Perp(-=22) 1 J eeare


(6)
with zx; being the 7th component of X.
We could, for example, for the two dimensional case, develop a 95% confi-
dence interval for the correlation coefficient,

_ Cov(z,y)
la Peer (7)
Suppose that we have a sample of size n: {z;, y;}%_,. Then we construct

s(2,y) = ~3 6((
2, Y) = (2534). (8)
” 721

Using this estimator, we construct 10,000 resamplings of size n, i.e., for


each of the 10,000 resamplings we draw samples from the n data points (with
replacement) of size n. For each of the resamplings, we compute the sample
correlation

Dini (25 — Zj)(Ysi — 93) 9


Dini (25 — 2)? DE (yt =H)? .
Nonparametric Density Estimation 399
SSS
SSS SS

Then, we rank the sample correlations from smallest to largest. A 95 %


confidence estimate is given by

T(250) < Pp < T(9,750)- (10)


Such a resampling procedure has been proposed by Efron, who calls it the
“bootstrap.” Although it is clear that such a procedure may have some use for
estimating the lower moments of some interesting parameters, we should never
lose sight of the fact that it is, after all, based on the profoundly discontinuous
Dirac-comb estimator fs.
It is quite easy to find an example where Dirac-comb nonparametric density
estimator based techniques (i.e., the bootstrap) produce disastrous results. For
example, suppose we have a sample of size 100 of firings at a bullseye of radius
5 centimeters. If the distribution of the shots is circular normal with mean
the center of the bullseye and deviation one meter, then with a probability
in excess of .88, none of the shots will hit inside the bullseye. Then any
Dirac-comb resampling procedure will tell us the bullseye is a safe place, if
we get a base sample (as we probably will) with no shots in the bullseye. We
note that any realization of a bootstrap simulation will be different from the
original sample. Some of the sample points will disappear. Others will be
repeated multiple times. Indeed, the concatenation of a bootstrap followed
by a bootstrap based on that bootstraped simulation, and so on, will lead
ultimately to a simulated sample which consists of a single sample point. This
is hardly desirable. It might be hoped that a single resampling would be of
such a character that we would be almost indifferent as to whether we had
this simulation or the original data set. But, of course, it would be dangerous
to wander too far from the original sample. A resampling of a resampling of
a resampling, etc., is not nearly as desirable as resamples which always point
directly to the original sample.
Clearly, it would usually be preferred to use resampling schemes based on
better nonparametric density estimators. One such would be

F(X)= =O K(X ~ XB),


t=1
(11)
where K(X) is a normal distribution centered at zero with locally estimated
covariance matrix »;.
Although a simulation algorithm employing such an estimator has much to
recommend it, it would appear to be extremely difficult of execution. It is at
this point, that we should recall what it is we seek: not a good nonparamet-
ric density estimator, but a random sample from such an estimator. One is
400 Taylor and Thompson
_———

tempted to try a scheme which goes directly from the actual sample to the
pseudo-sample. Of course, this is precisely what the bootstrap estimator does,
with the very bad properties associated with a Dirac-comb. It is possible,
however, to go from the sample directly to the pseudo-sample in such a way
that the resulting estimator behaves very much like that of the normal kernel
approach above. This is the SIMDAT algorithm of Taylor and Thompson.
We assume that we have a data set of size n from a p-dimensional variable
X, {X;}@,. First of all, we shall assume that we have already rescaled our
data set so that the marginal sample variances in each vector component are
the same. For a given integer m, we find, for each of the n data points, the
m — 1 nearest neighbors. These will be stored in an array of size n x (m —1).
Let us suppose we wish to generate a pseudo-sample of size N. Of course,
there is no reason to suppose that n and N will, in general, be the same (as
is the case generally with the bootstrap). To start the algorithm, we sample
one of the n data points with probability 1/n (just as with the bootstrap).
Starting with this point, we recall it and its m — 1 nearest neighbors from
memory, and compute the mean of the resulting set of points:

X=—) Xj. (12)

Next, we code each of the m data points about X:

{Xj} = {Xj - XV (13)


Clearly, although we go through the computations of sample means and
coding about them here as though they were a part of the simulation process,
the operation will be done once only, just as with the determination of the
m —1 nearest neighbors of each data point. The {X/} values as well as the X
values will be stored in an array of dimension n x (m + 1).
Next, we shall generate a random sample of size m from the one-dimensional
uniform distribution:

y(t Seid).
— (14)
We now generate our centered pseudo-data dineX, via

Ki Sap: (15)
l=1

Finally, we add back on X to obtain our pseudo-data point X:


Nonparametric Density Estimation 401
Ss

X= X'4+X. (16)
The procedure above, as m and n get large, becomes very like that of the
normal kernel approach mentioned earlier. To see why this is so, we consider
the sampled vector X; and its m — 1 nearest neighbors:

T |
a 21
{Xi}, 3 : o (17)

Ll l=. ..,7

For a moment, let us treat this collection as m points from a truncated


distribution with mean vector w and covariance matrix ©. Now, if {u,}™, is
an independent sample from the uniform distribution in (14), then

1 m—1 eee
E(u) = = Var(u) = Ta Cov(u;,u;) = 0,. for, ¢.4.3. (18)
Next, we form the linear combination:

l=1 :

For the r’th component of the vector Z, z, = ui2p1 + U2Er2 +... + UmTrm, We
observe the following relationships:

E(2,) = br, (20)

-aaa (i Se ael sia


Var(z,) = 0, + Tes (21)
m
and

m—1
Cov(z,, Zs) = Ors + a eH: (22)

Note that if the mean vector of X were (0,0,...,0), then the mean vector
and covariance matrix of Z would be the same as that of X, i.e., E(z,) = 0,
Var(z,) = 02, and Cov(z,,zs) = Ors. Naturally, by translation to the local
sample mean of the nearest neighbor cloud, we will not quite have achieved
this result. But we will come very close to the generation of an observation
from the truncated distribution which generated the points in the nearest
neighbor cloud.
402 Taylor and Thompson
Ee

Clearly, for m moderately large, by the central limit theorem, SIMDAT


comes close to sampling from n normal distributions with the mean and co-
variance matices corresponding to those of the n, m nearest neighbor clouds.
If we were seeking rules for consistency of the nonparametric density estimator
corresponding to SIMDAT, we could use the formula of Mack and Rosenblatt
for nearest neighbor nonparametric density estimators:

m = Cni*/@+4), (23)
Of course, people who carry out nonparametric density estimation realize
that such formulae have little practical relevance, since C is usually not avail-
able. Beyond this, we ought to remember that our goal is not to obtain a
nonparametric density estimator, but rather to generate a data set which ap-
pears like that of the data set before us. Let us suppose we err on the side of
making m far too small, namely, m = 1. That would yield simply the boot-
strap. Suppose we err on the side of making m far too large, namely, m = n.
That would yield an estimator which roughly sampled from a multivariate nor-
mal distribution with the mean vector and covariance matrix computed from
the data.
Below, in Figure 1, we show a sample of size 85 from a mixture of three
normal distributions with the weights indicated, and a pseudo data set of size
85 generated by SIMDAT with m = 5. We note that the emulation of the data
is reasonably good. In Figure 2, we go through the same exercise, but with
m = 15. There effects of a modest oversmoothing are noted. In general, if the
data set is very large, say of size 1,000 or greater, good results are generally
obtained with m ® .02n. For smaller values of n, m values in the .05n range
appear to work well. A version of SIMDAT in the S language, written by
EK. Neely Atkinson, is available under the name “gendat” from the S Library
available through email from Bell Labs. The new edition of the IMSL Library
will contain a SIMDAT subroutine entitled “RNDAT” (beta versions may be
obtained free by writing IMSL) .
In this paper, we have observed the usefulness of noting what we really seek
rather than using graphically displayed nonparametric density estimators as
our end product. The cases where we really must worry about obtaining the
graphical representation of a density are, happily, rare. But, as with the case
of SIMDAT, the context of nonparametric density estimation is very useful in
many problems where the explicit graphical representation of a nonparametric
density estimator is not required.
Nonparametric Density Estimation 403
References

Efron, Bradley (1979). Bootstrap methods—another look at the jacknife,


Annals of Statistics, 7, 1-26.

Mack, Y.P. and Rosenblatt, Murray (1979). Multivariate k-nearest neighbor


density estimates, Journal of Multivariate Analysis, 9, 1-15.

Tapia, Richard A. and Thompson, James R.and (1978). Nonparametric


Probability Density Estimation, Baltimore: Johns Hopkins.

Taylor, Malcolm S. and Thompson, James R. (1986). A data based algo-


rithm for the generation of random vectors, Computational Statistics and
Data Analysis, 4, 93-101.
othighnas 4 opened he as
Riis ix ie ta, shah ts

5 ma a ities i
iesorye ooh

= a i
i - mp; ma al, aie ,
he sed
R

J
i “a,

oi |
NONPARAMETRIC RANK
ESTIMATION USING BOOTSTRAP
RESAMPLING AND CANONICAL
CORRELATION ANALYSIS
Xin M. Tu, Harvard School of Public Health;
D.S. Burdick and B.C. Mitchell, Duke University *

Abstract
Canonical correlation analysis has proven to be remarkably successful as an
alternative to the eigenvalue approach in rank estimation; a problem that has
challenged the analytical chemists for more than a decade. A methodological
advance of this new approach is that it focuses on the difference in structure
rather than in magnitude in characterizing the difference between the signal
and the noise. This structural difference ts quantified through the analysis of
canonical correlation, which is a well established data reduction technique in
multivariate statistics. Unfortunately, there is a price to be paid for having this
structural difference: at least two replicate data matrices are needed to carry
out the analysis.
In this paper, we propose a bootstrap resampling method to extend the
canonical correlation analysis to a single data matrix. Such a procedure not
only removes the requisite for replicate data matrices but also leads to a robust
estimator for the rank. With the percentile method, statistical inference about
the rank can proceed without any distributional assumption about the noise.
This “distribution-free” feature is especially desirable and useful in practice
since it frees us from hinging results on some crutial assumptions about the
random noise which are often difficult to justify and may even be erroneous.
The procedure is illustrated with real as well as simulated mixture samples.
*X.M. Tu, Dept. of Biostat., Harvard School of Public of Health, 677 Huntington Ave.,
Boston, MA 02115; D.S. Burdick and B.C. Mitchell, Institute of Stat. and Decision Sciences,
Duke University, Durham, NC 27706

Exploring the Limits of Bootstrap. Edited by Raoul LePage and Lynne Billard.
Copyright©1992 by John Wiley & Sons, Inc. ISBN: 0471-53631-8.
406 Tu, Burdick, and Mitchell

KEY WORDS: Rank estimation; Bootstrap resampling; Canonical Correla-


tion; Excitation-emission matrix; Singular value decomposion.

INTRODUCTION
One of the problems which have challenged the chemists in analytical chem-
istry for more than a decade is to determine the number of components in a
multicomponent mixture sample. Often the sample data is expressed in the
form of a matrix whose rank, in the absence of noise, is equal to the number
of components. However, the presence of noise in the data generally causes
the rank to exceed the number of components in the mixture.
Most methods which have been proposed to estimate the rank in the pres-
ence of noise all, in essence, rely on the information summarized by the eigen-
values from the singular value decomposition of the underlying matrix (Rossi
and Warner 1986; Malinowski 1990; Tway, Cline Love and H.B. Woodruff
1980; Wold 1978). Even though it is difficult to evaluate the amount of in-
formation which is lost due to this type of data summary (or reduction), it is
not very hard to convince oneself at least heuristically that some information
is lost since it ignores any information contained in the eigenvectors which
may well provide much more information than the eigenvalues do. A differ-
ent approach which also incorporates the information in the eigenvectors was
therefore introduced recently as an alternative to the eigenvalue approach (Tu
et al. 1989). A methodological turning point of this new approach is that it fo-
cuses on the difference in structure rather than in magnitude in characterizing
the difference between the signal and the noise. This structural difference is
then quantified through the analysis of canonical correlation, a well established
data reduction methodology in multivariate statistics. Unfortunately, this new
approach can only be applied to situations where replicate data matrices are
available.
The objective of this paper is to continue to explore the potential and to
extend the scope of this new approach. In particular, we propose a bootstrap
resampling procedure to extend this new methodology to a single data matrix.
An important feature of the data matrix arising from such chemical experi-
ments is the structure imposed on the data matrix by the signal. Such a signal
structure plays a crutial role in rank estimation. As a consequence, a standard
bootstrap resampling procedure, which relies on a random sample drawn with
replacement from the elements of the data matrix, is not readily applicable
since it will destroy this signal structure which depends both on the ordered
rows and columns. We therefore discuss a variant of the bootstrap resampling
to circumvent this problem when resampling the observed data matrix.
In the following discussion, we use the Excitation-Emission Matrix (Warner
1982; Warner, Neal and Rossi 1985) as a vehicle of examplifying the proposed
Nonparametric Rank Estimation 407

methodology. However, the method can be applied to any matrix-formatted


data where the rank of the matrix is determined by the number of components
in the mixture.

RANK ESTIMATION BY
CANONICAL CORRELATION
In this section, we briefly review the procedure RECCAMP (Rank Estimation
by Canonical Correlation Analysis of Matrix Pairs) for analyzing replicate data
matrices. A detailed discussion of the RECCAMP can be found in Tu et al.
(1989).
Let the J by J matrix S be the EEM for an R-component mixture in the
absence of noise. Then S can be expressed as:

S=xy? (1)
where X = (21,...,2R) is an J by R matrix with the r column being the
excitation vector for the r“* component and Y = (y1,...,yR) is an J by R
matrix with the r‘* column being the emission vector for the r** component.
The columns of X as well as those of Y will be assumed to be linearly inde-
pendent. Under this assumption, the rank of S will be equal to R, which is
the number of components in the mixture. In the absence of noise, the rank
can be determined by the number of non-zero singular values from the singular
value decomposition (SVD) of S.
In practice, however, since the measured EEM also contains a noise term
N, ie.,

M=S+N (2)

the number of non-zero singular values will exceed the number of components
because the random noise term N generally increases the rank. The eigenvalue
approach for correcting this is to watch for a “large drop” between the singular
values, which seems very reasonable since one would expect a relatively small
singular value contributed from the low-magnitude noise. However, if these
singular values corresponding to the noise are viewed as a summary statis-
tic for the noise from a statistical point of view, which makes perfact sense
since this is what is used in making inference about the rank when using the
eigenvalue approach, it is not difficult to see that the random nature of the
noise may not be well summarized by this statistic since it clearly ignores any
information contained in the eigenvectors. An alternative to this eigenvalue
approach was therefore introduced, hoping to provide a better statistic to sum-
marize the information about the random noise. A methodological advance in
408 Tu, Burdick, and Mitchell
|

this new approach is that it focuses on the random nature of the noise, which
is manifested in an Euclidean subspace, and utilizes the analysis of canonical
correlation, a well established data reduction method in multivariate statis-
tics, to provide a statistic which fully reflects this random nature. Below, we
briefly describe this approach.
Let M, and M, be two replicate EEM’s. Then these matrices can be
expressed as

M, = S+M,

M2 = S+ No.

In the absence of noise, M, and M, will have a common column space which
we denote by Col(S) and a common row space which we denote by Row(S),
both of dimension R. In the presence of noise, let

M, = U,D\ViZ

Mp = U2D2VF

be the SVD’s of M, and M2, respectively. If the signal structure is retained


at a given noise level, then since the entries of the noise terms N, and N2
are independently distributed the subspaces Col(Ul**») and Col(US?*?) as
well as C ol (VAP?) and Col(Ve"*) will be highly correlated only in an R-
dimensional subspace, where the matrix yee (or yey is formed by the
(R +1) leading vectors of U; (or V;). Since canonical correlations provide a
summary measure of closeness between two subspaces, we therefore expect R
high correlations and one low correlation between Col (Ur and Col (ue)
and also between Col(Vi") and Col VAN This pattern of the correlation
structure provides a basis for estimating the rank or the number of components
in the mixture.
Unfortunately, there is a price to be paid for having this correlation struc-
ture in the data. The trade-off is the requisite for replicate EEM’s which
provide realizations of a random noise that are at least not correlated. This
might not be a stringent assumption in practice. However, it is at least a
methodological deficiency compared to the eigenvalue approach. We there-
fore discuss a bootstrap resampling method in the next section to remove this
restriction.

THE BOOTSTRAP RESAMPLING


With a single EEM, if we can somehow change the configuration of the noise
in the observed data, we may still be able to obtain the structural difference as
with replicate EEM’s. This is possible since an observed EEM contains a total
Nonparametric Rank Estimation 409

of I x J realizations of the random noise where J is the number of rows and J is


the number of columns of the EEM. However, a standard bootstrap resampling
procedure, which relies on a random sample drawn with replacement from
the elements of the data matrix is not readily applicable here since it will
clearly destroy the signal structure which depends on both the ordered rows
(excitation wavelength) and the ordered columns (emission wavelength). We,
therefore, describe a variant of the bootstrap resampling procedure below to
avoid this problem, but still to allow us to change the configuration of the
noise.
Let m; denote the j column of the EEM M for 1 < j < J. It follows
from (1) and (2) that
R
mj = D1 eryir +13; (3)
r=1

where z, or y, denotes the r™ column of X or Y, y;,, denotes the j** element


of the vector y, and n; denotes the j* column of N. In the absence of noise,
if A = [m,,,-...,™,] is a matrix formed by any selected J columns of M, the
subspace spanned by the columns of A and that spanned by the columns of
M satisfy

Col(A) = Col(M) (4)


provided that the matrix A = [m,,,...,m,,] is of full rank, or, equiva-
lently none of the individual components is completely missing in the selected
columns. Equations (3) and (4) provide a basis for our bootstrap resampling
procedure. Suppose that we draw with replacement 2J vectors from the set
of columns of M with each column in M being selected equally likely. Now
arbitrarily group the selected columns into two sets of vectors each of size J
and let A and B denote the matrices formed by these vectors from each set.
It follows from (4) that in the absence of noise

Col(A) = Col(B) (5)


provided that both of A and B are of full rank, or equivalently none of the
components is completely missing in the columns of A and those of B. In the
presence of noise, let
Alt) aa CD lV as alt (6)

BR) = yiRH) pH) yRnyr

be the rank (R +1) fits of A and B from SVD’s of A and B, respectively.


It follows from (5) that Col(US*”) and Col(U ey) are highly correlated
in an R-dimensional subspace whereas their orthogonal complements are not
410 Tu, Burdick, and Mitchell

because of the random rearrangement in the columns n; resulting from the


resampling process. This gives rise to the same structural difference between
Cour? ) and Cou?) as between Col(UL?*») and Col(US**) as with
replicate EEM’s. We would therefore expect a large drop between the Rt’ and
(R + 1)* canonical correlation coefficients among the ordered coefficients cal-
culated from Coury» ) and Col(U ay However, due to the randomness
in selecting the columns for the two matrices, it is still possible to observe a
relatively large (R+1)** coefficient which may make it difficult to observe this
drop. It is therefore necessary to find the distribution for each of the coeffi-
cients if we wish to base our inference about the rank on these coefficients.
Here, it would be difficult to derive the distribution functions in closed
form for these correlation coefficients. On the other hand, such an attempt
would require some tenured assumptions about the noise distribution which
may be difficult to verify and may result in limited applications. We therefore
propose a criterion based on the empirical distribution of a bootstrap sample.
To obtain a bootstrap sample for the (R + 1)* canonical correlation coef-
ficient, we repeat the resampling process, say, k times and observe a sequence
of correlation coefficients é! +) for 1 <1i<k. Because of the random resam-
pling, we would expect each of these fRaY) to appear equally likely between
0 and 1. We may, therefore, view this bootstrap sample as arising from a
distribution similar to the uniform distribution between 0 and 1 with mean
or median less than or equal to 0.5. As for statistical inference, we use the
sample median, which is not only more robust but also more appropriate in
our context since inference about the rank using this estimator can proceed
without resort to any distributional assumption or Monte Carlo method.
The p* percentiles can be calculated using the ordered statistic
(R41) R+1
Ca) srk
In particular, the sample median is given by m = er. where t is the smallest
integer such that -j_, pind > 0.5.
Following the above discussion, a null hypothesis at a given a-level can be
constructed as

Ho a m < 0.5 (7)

Such a test in general requires some distributional assumption or the calcula-


tion of standard errors if the normal distribution is used. However, using the
sample median, we can carry out such a test without making such parametric
assumptions which are often difficult to justify and may even be erroneous in
practice.
We note that the null hypothesis in (7) is equivalent to testing whether
the lower limit of the 1 — 2a confidence interval lies to the left of 0.5. With
Nonparametric Rank Estimation 411
SSS SSS SSS

the bootstrap sample, we can calculate the conficence interval for the median
using the percentile method (Efron 1982) without any assumption for the
underlying distributon and without recourse to Monte Carlo simulation. Using
the percentile method, the 1 — 2a confidence interval is expressed in the form

EVs koe |
Here the integers ¢; and t2 are given by the smallest integers which satisfy

F(t) >a

and

F(t2) >l-a.

where F(t) denotes the cumulative density function of a binomial distribution


B(t\|n, p) with the sample size parameter and the parameter for the probability
of success fixed at n = k and p = 0.5, respectively. We therefore reject the
null hypothesis at the a-level if the interval lies to the right of 0.5.
Note that in the above discussion we have implicitely assumed that the
matrices A{?+)) and B‘®+) constructed at each resampling step contain the
signal for all the components. Even though it is possible for A+”) or BURt)
or even both to contain only some of the components it is unlikely to occur in
a random resampling process unless some components are only represented by
a few columns of the EEM, or in other words, they have very narrow spectral
range of excitation or emission. Moreover, since we use a large sample robust
estimator, the sample median, in making statistical inference about the rank,
a few such instances in the resampling process will not cause much difference
in estimation.
Also, we have focused on the subspaces Col(U{**”) and Col(U Sie) which
correspond to the process of resampling columns in the above discussion. Sim-
ilar statements apply to the subspaces Col(Vik*)) and Col(VsX*))) as well
when resampling the rows. In practice, it may be helpful to use both of the re-
sampling processes, especially in dealing with samples which have a relatively
low signal-to-noise ratio. For example, we may wish to reject the hypothesis
that the rank is R—1 if the lower limit of the confidence interval for the median
of the R* canonical correlation coefficient from the process of resampling the
columns barely falls below 0.5 while that from the process of resampling the
rows stays above 0.5. Situations like that, on the other hand, may well depend
on the data at hand. However, the more information we can extract from the
data the more we will know about the sample, which could ultimately help
reach a comprise if it is too difficult to ascertain the rank in such cases.
412 Tu, Burdick, and Mitchell

RESULTS AND DISCUSSION


The proposed bootstrap resampling procedure was applied to some real as
well as simulated mixture samples. The real sample consists of two fluorescent
emitters: 9, 10 diphenylanthracene (DPA) and anthracene (ANT) and the
EEM is a 30 by 30 matrix. Plotted in Figure 1 is the EEM for this mixture
at 5 MHz generated with the phase-modulation technique. The simulated
mixture is a four component sample whose signal was extracted from some
real sample data and has been analyzed by the RECCAMP (Tu et al. 1989).
The signal strength of the simulated mixture sample is controlled by varying
a pseudo-intensity parameter c according to
M=cx*S+N (8)
where N is generated by a normal deviate with fixed mean 0 and variance 0.01.
With two replicate EEM’s, the RECCAMP is able to determine the rank for
c as low as 4 at which the noise level is very close to that of the signal (Tu et
al. 1989). It is therefore interesting to see whether we would still be able to
correctly estimate the rank at the same low signal level with only one EEM.
A consideration in implementing a bootstrap resampling procedure in gen-
eral is to choose the right sample size so that minimum amount of computation
can be performed without loss of much efficiency. As investigated by Efron
(1982, 1983), a bootstrap sample of 500 or 1,000 is usually used as a guide.
We have chosen a sample size of 500 for both mixtures after taking into con-
sideration the amount of computation that is needed to compute the canonical
correlation coefficients.
For each sample, canonical correlation coefficients were calculated for each
pair of the rank K’ fits of the SVD’s of the matrices corresponding to the
column resampling and row resampling processes at each step of the resampling
procedure for K = 1,2, etc. So, for each value of K, we have a sample of 500
coefficients each for the two resampling processes, which we shall refer to as
the bootstrap sample of canonical correlation coefficients. Shown in Tables
I and II are the percentiles of the distributions of these bootstrap samples
for each mixture sample for K = 1,2, etc. Also listed in these tables are
the confidence intervals (C. I.) for the respective medians, calculated by the
percentile method.
For the real mixture, the drop in the percentiles between the second and
third coefficients is obvious in all the resampling cases. This same idea is also
graphically portrayed by the plots of the distributions from the column resam-
pling process shown in Figure 2 (those from the row resampling process are
similar and are therefore not shown). Following the discussion in the previous
section, we can use confidence intervals to quantify our empirical assessment
in the form of a statistical test at a given nominal level. In this example, the
90% and 95% confidence intervals shown in Table I clearly illustrate this point.
Nonparameine
SS
r tric ae
Rank Estimation
ea
e 413
2oo

Percentiles Confidence Interval for the Median

25% Median 75% 90% CI. 95% C.I.

Column Resampling

K=1 0.9986 0.9993 0.9996 (0.99921, 0.99932) (0.99919, 0.999633)


K=2 . 0.933 0.961 0.976 (0.957, 0.964) (0.956, 0.965)
K=3 0.207 0.418 0.649 (0.386, 0.458) (0.379, 0.466)
K=4 0.151 0.279 0.455 (0.264, 0.301) (0.259, 0.302)
Keo: 5 Op) 0.239 0.402 (0.223, 0.255) (0.217, 0.263)
K=6 0.098 0.208 0.355 (0.188, 0.229) (0.186, 0.235)

Row Resampling

K=1 0.9982 0.9990 0.9994 (0.99891, 0.99907) (0.99888, 0.99909)


Ke e023 0.949 0.969 (0.945, 0.949) (0.944, 0.950)
K=3 0.248 0.476 0.661 (0.443, 0.508) (0.439, 0.517)
K=4 0.165 0.327 0.512 (0.298, 0.349) (0.295, 0.359)
K=5 0.119 0.252 0.429 (0.229, 0.279) (0.226, 0.284)
K=6 0:101 0.212 0.331 (0.197, 0.225) (0.189, 0.228)

Table I: Percentiles and confidence intervals (C. I.) from the distributions of the
bootstrap samples of canonical correlation coefficients for the mixture ANT and
DPA.
414 Tu, Burdick, and Mitchell

Percentiles Confidence Interval for the Median

25% Median 75% 90% C.I. 95% C.I.

o= 4

K=1 0.9984 0.99904 0.9994 (0.99894, 0.99913) (0.99892, 0.99915)


K=2 0.9853 0.9888 0.9921 (0.9885, 0.9894) (0.9884, 0.9895)
K=3 0.8602 0.9048 0.9368 (0.9009, 0.9109) (0.9008, 0.9122)
K=4 0.7069 0.7936 0.8459 (0.7823, 0.8015) (0.7796, 0.8022)
K=5 0.1331 0.2881 0.4616 (0.2574, 0.3165) (0.2541, 0.3195)
K=6 0.0806 0.1625 0.2967 (0.1483, 0.1752) (0.1452, 0.1775)
Ki="=, 010652 031367 0.2313 (0.1238, 0.1538) (0.1214, 0.1577)

C53

K=1 0.9981 0.9989 0.9993 (0.9988, 0.9989) (0.9987, 0.9990)


K=2 0.9776 0.9832 0.9870 (0.9824, 0.9836) (0.9823, 0.9838)
K=3 0.7168 0.8112 0.8704 (0.8056, 0.8197) (0.8030, 0.8222)
K=4 0.3257 0.5221 0.6609 (0.4993, 0.5445) (0.4956, 0.5486)
K=5 0.1157 0.2597 0.4225 (0.2426, 0.2821) (0.2374, 0.2863)
K=6 0.0765 0.1641 0.2934 (0.1486, 0.1778) (0.1465, 0.1814)
K=7 0.0503 0.1217 0.2106 (0.1077, 0.1353) (0.1065, 0.1387)

Table II: Percentiles and confidence intervals (C. I.) from the distributions of
the bootstrap samples of canonical correlation coefficients corresponding to the
process of resampling the columns for the simulated mixture.
Nonpara
SSE metric Rank Estimation 415
SECC onan peridot hnkayPn coed

Sv.
SOO IRS
RSD
Wrens

(a)

Figure 1: The EEM for the mixture of ANT and DPA.

(a) (>)

Figure 2: The empirical distributions of the bootstrap samples of canonical correlation


coefficients from the column resampling process for the mixture of ANT and DPA: (a) at
K = 2; (b) at K =3.
416 Tu, Burdick, and Mitchell
—— ______

As for the simulated mixture sample, our major reason to study this sample
is to see whether reduction in sample data would have any significant impact
on the canonical correlation coefficients. In statistical inference, reduction in
data reduces efficiency and may lead to bias in estimation. In our context,
both could have influence on the canonical correlation coefficients on which
our inference about the rank is based. It may be difficult, if not impossible, to
theoretically characterize this impact. However, with this simulated mixture
sample, we can at least make some empirical observation about this effect.
The percentiles and confidence intervals for the two critical intensity levels
are shown in Table 2. Note that reported in Table 2 are the results from
the process of resampling the columns since the results from resampling the
rows are similar and therefore are not reported. At c = 4, the drop for the
percentiles occurs between the forth and fifth coefficients. The lower limits
of both confidence intervals for the median of the fourth canonical correlation
coefficient are way above 0.5. On the other hand, even the upper limits of the
confidence intervals for the median of the fifth canonical correlation coefficient
fall below 0.5. Hence, had we not known the composition of this mixture,
we would have correctly estimated the rank at this low signal level without
difficulty. The percentiles and confidence intervals for c = 3, however, show
no sign of the four-component nature of the mixture sample. As a matter
of fact, the rank would be estimated to be three if based on the confidence
interval, which coincides with the estimate from the analysis of the RECCAMP
using two replicate EEM’s in this case (Tu et al. 1989). So at least with this
simulated data, it seems that there is no loss of efficiency in inference, which
may not be surprising since the information of the noise distribution (normal
distribution in this case) may very well be summarized by the 2500 elements
in one EEM. In this example, it is quite likely that one of the components is
overwhelmed by the noise when the signal level drops down to c = 3.
Plotted in Figure 3 are the distributions of some of the bootstrap samples.
The drop between the forth and fifth canonical correlation coefficients for
c = 4 can also be seen from the striking difference in skewness between the
distributions shown in (a) and (b). However, for c = 3, it is difficult to tell from
the shape of the histogram plotted in (c) whether the coefficients correspond
to the signal even though the C.J. in this case does cover the point 0.5 as
discussed earlier.
In summary, we have proposed a bootstrap resampling procedure to extend
the canonical correlation technique to the analysis of a single data matrix.
Also, since a robust estimator is introduced for inference, the procedure may
be applied to a wide range of data without any restriction on the noise distri-
bution. On the other hand, the procedure may be computationally expensive
compared to the eigenvalue approach, especially in case of low signal-to-noise
ratio. Our experience shows that if the signal-to-noise ratio is high as in the
Nonparametric Rank Estimation 417

(a) (b)

(c) (d)

Figure 3: The empirical distributions of the bootstrap samples of canonical correlation


coefficients from the column resampling process for the simulated mixture: (a) at c = 4,
KR =4(b) ate= 4, Kk =5; (c) at c= 3, kK.= 4; (d) at e=3; K =5.
418 Tu, Burdick, and Mitchell
nic ANE BERET SUP Mell Pe wilted Sn

real mixture sample in our example, a sample size as small as 50 will suffice to
make the correct inference. As a guide for general practice, we may start the
procedure with a small bootstrap sample and then increase the sample size
if necessary to improve efficiency for the boostrap estimate of the confidence
interval. However, as discussed earlier, there is no point to increase the sample
size beyond 1,000.
The procedure may also be used in conjunction with some other procedures
such as those from the eigenvalue approach to reduce the amount of compu-
tation but also to make valid statistical inference. For example, an initial
analysis using the eigenvalue approach may help expedite the search of rank
to a small possible range of values. The bootstrap resampling may then be
employed to quantify our uncertainty in terms of a valid statistical test.

References
[1] B. Efron, The Jackknife, the Bootstrap, and Other Resampling Plans,
SIAM, monograph # 38, CBMS-NSF (1982).

[2] B. Efron and G. Gong, American Statistician 37, 36 (1983).

[3] E.R. Malinowski, J. Chemometrics 4, 102 (1990).


[4] T.M. Rossi and I.M. Warner, Anal. Chem. 54, 810 (1986).

[5] X.M. Tu, D.S. Burdick, D.W. Millican and L.B. McGown, Anal. Chem.
61, 2219 (1989).
[6] P.C. Tway, L.J. Cline Love and H.B. Woodruff, Anal. Chim. Acta 117,
45 (1980).
[7] I.M. Warner, in Contemporary Topics in Analytical and Clinical Chem-
istry, Vol. 4, D.M. Hercules, G.M. Hieftje, L.R. Snyder and M.A. Evenson
Eds., Plenum Press, New York (1982).
[8] I.M. Warner, S.L. Neal and T.M. Rossi, J. of Research of the National
Bureau of Standards, 90, 487 (1985).

[9] S. Wold, Technometrics 20, 397 (1978).


Index

A Bayesian models, 320


bootstrap, 322-323
ABC method, 123-124 conditional empirical, 324-325
Alexander’s exponential inequality, 22 Berry-Esseen bounds, 149-150, 153
Algorithms: Bias, bootstrap estimate, 121-122
for bivariate convolution, 314-318 Bivariate convolution, 313-318
SIMDAT, 400-402 Bonferroni bounds, 274
Antithetic resampling, 135-137 Bonferroni inequality, 238
Apparent error, 110 Bootstrap methods, 5-6. See also
as estimator of MSEP, 365 Functions, classes of
Approximate bootstrap confidence applications of, overview, 7, 9-10
intervals method, 123-124 approximate bootstrap confidence
Asymptotic stability criterion, 352 intervals method (ABC), 123-124
Autoloss function, 365 Bayesian, 322-323
conditional empirical, 324-325
in bias estimation, 121-122
B bootstrap density estimator, 251-252
bootstrap distribution, 158
Balanced resampling, 133-135 convergence in distribution, 158-159
Banach space, 16 convergence in probability, 159-160
Bandwidth selection, 249-250 bootstrap sample, 158
by bootstrapping, 251-256 calibration, 216-217
comparison of selection methods, computational efficiency, 120-124
258-261 conditional, 323-324
by cross-validation, 256-257 empirical Bayesian, 324-325
Barndorff-Nielsen intervals, 115 confidence intervals, see Confidence
Bartlett corrections, 115 intervals

419
‘Bootstrap methods (Continued) nonstationary data, 195-197
consistency, asymptotic, 7 second order correctness, 185, 187,
convergence, 7, 329 197
and differentiable functionals, 36-44 naive, 8
as Dirac-comb density estimators, in nonparametric rank estimation,
397-402 408-418
problems with, 399 for non-tested hypotheses, 390-393, 395
double-dip-, 9 “operational” bootstrapping, 309-318
Dudley’s limit theorem, 36-37 bivariate convolution, 313-318
for Edgeworth correction, 8-9. See pivoting, 224
also Edgeworth expansions for prediction band construction,
and empirical processes, 17-22 272-274
central limit theorem, 17-18 quantile estimation, 141-142
and estimators, see Estimators U-quantile estimation, 146-154
and M-estimators, 25-36. See also versus random weighting
M-Estimators approximation, 281
central limit theorem, 30-32, 35-36 for ratio statistic distribution, 276-277
pr-bootstrap, 42-44 resampling, 106-107, 251
in exploratory regression (model ~ antithetic, 135-137
building): balanced, 133-135
compared to conventional methods, circular block, 266-269
371-386 i.i.d., 77, 81, 183-184
MSEP estimators, 367-371 importance, 137-141
pseudomodel choice, 386 improved approaches, 399-402
frequentist, 321-322 nonrandom, 309-318
generalized, 319-321 uniform, 128-129, 130
inconsistency of, 227 vector, 106, 121
versus jackknife and delta methods, and sample mean, see Sample mean,
103-108 bootstrapping
kernel density estimator bandwidth in segregation analysis, 393, 396
selection, 249-250 sign bootstrapping, 217-224
by asymptotic analysis, 254-256 conditional distributions, 218-221
comparison to other methods, SIMDAT algorithm, 400-402
258-261 smoothed, 252-253
by MISE estimation, 251-253 and spatial medians, 32-33
in likelihood ratio simulation, 390-393, and standard error estimate, 5-6
395 comparison to delta and jackknife
limit distribution, 7 methods, 103-108
and long-tailed errors, 9, 215-217 sign bootstrapping, 217-224
and Markov chains, see Markov chains for Stein-rule estimators, 325
meshing, 311-318 comparison with jackknife method,
Monte Carlo simulation in, see Monte 335-340
Carlo simulations Bootstrap samples, 106
moving block, 185-187, 228-231, Bootstrap t methods, 8-9
238-245, 264 confidence intervals, 67-68
applied to sample mean, 187-191 first order efficiency, 74
applied to studentized mean, parametric versus nonparametric,
192-195 74-75
choice of block size, 246-247 second order correctness/
Edgeworth expansions, 187-188 equivalence, 68-74

420
second order optimality/robustness, D
74-75
undershoot, 74 Decomposition, singular value, 407
as pivotal quantity, 118-120 Decrement-life-table functions, 358
P-Brownian bridge, 16 y Delta method, 103-108
Brownian motion: Deterministic hazards, 346-348
convergence to, 162, 164-165, 170, 172, Dirac-comb density estimator, 397-399
173 Distributions, maximum likelihood
estimated, 99-103
P-Donsker function classes, 17, 18, 20
C Donsker’s convergence theorem, 16, 17
Dudley’s limit theorem, 36-37
Canonical correlation analysis, 407-408 Dudley’s representation theorem, 20-21
Centring method, 131-133 Dvoretzky-Kiefer-Wolfowitz inequality,
Compensators, of counting processes, 242, 243
349
Competing risks theory, 355-361
Conditional bootstrap models, 323-325 E
Conditional distributions, sign
bootstrapping, 218-221 Edgeworth expansion, 8-9
Condition vector, 220 in bootstrapping Markov chains, 63,
Confidence intervals, 8-9, 101-103, 216, 80-81
223 of functional statistics, 287-291
Barndorff-Nielsen, 115 from moving block bootstrap, 187-188
choice of pivotal quantities, 116-120 Empirical influence function, 104
correctness versus accuracy, 112-116 Empirical measure, 16
defined, 338-339 Empirical processes, 15-17
Fieller’s construction, 113-114 almost sure behavior, 22-23
non-central chi-square, 114-115 bootstrap method applied for, 17-22
percentile, 102 central limit theorem, 17-18
standard, 102-103 Errors, standard, see Standard errors
Consistency, of bootstrap method, 7 Estimators, 3-4
Convergence, of bootstrap method, 7 cross-validation, 108-112
Convolutions, of distributions, 309-318 Dirac-comb density, 397-399
Correlation analysis, canonical, 407-408 generalized bootstrap for, 319-321
Correlation coefficient, Pearson, Hodges-Lehmann, bootstrapping, 154
116-117 jackknife methods for, 329
Cortical cells, clustering of, 274-277 kernel density, bandwidth, see Kernel
Countable state Markov chains, see density estimator bandwidth
Markov chains, countable state linear, 330-331
Counting processes, 348 martingale method for determining,
compensators of, 349 351-361
Covariates, in hazard processes, 355-361 maximum likelihood, 99-103
Cramer’s condition, 189 underestimation, 100-101
Cross-validation, 331 in multiple regression, bootstrapping,
estimators, 108-112 328-329
least squares, bandwidth selection, nonparametric density, 399-402
253, 256-257 orthogonal series, 249
Crude hazard probabilities, 357 spline, 249
Crude hazard rate functions, 357 spread, bootstrapping, 154

421
Estimators (Continued) G
standard error of, see Standard error
Stein-rule, 329-335 Gaussian processes, 16-17
bootstrapping, 325 sample continuous, 17
jackknife versus bootstrap methods, Genetics, quantitative, bootstrapping in,
335-338 389-393, 395-396
M-Estimators: Glivenko-Cantelli theorem, 352
approximation by pseudo jackknife
method, 300-303
bootstrap of, 25-36 H
central limit theorem, 30-32, 35-36
definition, 24-25, 29-30 Hadamard differentiability, 282
pr-bootstrap of, 42-44 Hajek-Le Cam convolution theorem, 72
Excitation-emission matrix, 406, 407, Hardy-Weinberg equilibrium, 389
408-409 Hazard probabilities, 357
Exercise output prediction, 272-274 Hazard processes, 346-350
decrement-life-table functions, 358
homogeneity, chi-square test, 355
F increment-decrement-life-table
functions, 358
Fast Fourier transforms, 217 latent life, 355
Fieller’s construction, 113-114 Martingale method for, 352-361
Filtration, 347 asymptotic estimate distribution,
Finite state Markov chains, see Markov 352-353
chains, finite state censoring, competing risks,
First order properties, 66, 74 covariates, 355-361
Fischer’s linear discriminant, 109, 110 hypothesis testing, 353-355
Frechét differentiability, 36, 234, 282-283 mortality ratio, standardized, 354
second order, 283 probabilities, 357
Frequentist models, 320 rate functions, 357
bootstrap, 321-322 stochastic hazards, 348
Functionals: History, 347
differentiable, 20-22 internal, 347
bootstrap method and, 36-44 Hitting time distribution, 49, 51, 54-55,
Frechét differentiability, 36, 234, 77
282-283 Hodges-Lehmann location estimator,
variance, 100 bootstrapping, 154
Functional statistics: Hoeffding expansion, 73
distribution approximation, 279-281, Homogeneity, chi-square test, 355
283-287
Edgeworth expansions, 287-291
Functions, classes of: I
measurable, 16
P-Donsker, 17, 18, 20 Importance resampling, 137-141
a.s.-bootstrap, 18 Increment-decrement-life-table
pr-bootstrap, 18 functions, 358
P-Glivenko-Cantelli, 29 Infinitesimal jackknife method, 104
uniformly pregaussian, 19 Internal history, 347
VC(Vapnik-Cérvonenkis)-subgraph, 22 Invariance principle, in sample mean
Functions, perfect, 20 bootstrapping, 161-172

422
J Markov chains, countable state:
bootstrapping, 52-61
Jackknife methods, 4-5 hitting time distribution estimation,
versus bootstrap and delta methods, 54-55
103-108 t sampling distribution estimation,
delete-d, 228 52-53
delete-one, 329 transition probability matrix
inconsistency of, 227 estimation, 57-58
infinitesimal, 104, 105 Markov chains, finite state:
moving block, 227-228, 231-238 bootstrapping, 50-52, 77-79, 81-87
pseudo, 279-281, 283-287, 297-303, 329 accuracy of, 62-63
for Stein-rule estimators, 335-338 asymptotic accuracy, 79-81
Jaeckel’s infinitesimal jackknife method, Edgeworth expansion, 80-81
104, 105 hitiing time distribution estimation,
49, 51, 54-55, 77
transition probability matrix
K estimation, 49, 50-51, 57-58, 77
Martingale method, for deterministic
Kernel density estimator bandwidth, functions, 351-361
249-250 Martingales, 164, 165, 166-167, 349, 350
comparison of selection methods, sequences of, central limit theorem,
258-261 351
selection by bootstrapping, 251-256 Martingale transform theorem, 351
selection by cross-validation, 256-257 Maximum likelihood estimated
Ktinsch’s moving block bootstrap distributions, 99-103
method, see Moving block bootstrap Mean, standard error of, 3-4
method Mean integrated square error (MISE),
250
bootstrap estimation, 251-253
L Mean square error of prediction (MSEP):
conditional, 365
Latent life, 355 estimators of:
Least squares cross-validation, bandwidth bootstrap, 367-371
selection, 253, 256-257 comparison of methods, 371-386
Likelihood ratios, bootstrapping, conventional, 365-367
390-393, 395 unconditional, 365
Limit distribution, of bootstrap method, 7 Medians, spatial:
Lindeberg central limit theorem, 284 bootstrapping, 32-33
Lindeberg-Levy central limit theorem, Meshing, 311-318
157 Minkowski inequality, 241
Linear approximation, 129-131 MISE, see Mean integrated square error
Linear discriminant, Fischer's, 109, 110 Mixing sequences, 226, 245-246, 264
Linear estimators, 330-331 Model building, see Regression
Linear programming, 311 analysis, exploratory
Linear regression analysis, 221, 224, 245 Monte Carlo simulations, 106, 107, 123,
127-140, 328
antithetic resampling, 135-137
M balanced resampling, 133-135
centring method, 131-133
Marked point processes, 356 importance resampling, 137-141

423
Monte Carlo simulations (Continued) Polynomial approximation, 131
linear approximation, 129-131 Predictable variation processes, 350
polynomial approximation, 131 Prediction bands, estimation by
uniform resampling, 128-129, 130 bootstrapping, 272-274
Mortality ratio, standardized, 354 Prediction errors, 109-112
Moving block bootstrap method, Propagation of errors formula, 103-108
185-187, 228-231, 238-245
applied to sample mean, 187-191
applied to studentized mean, 192-195 Q
choice of block size, 246-247
Edgeworth expansions, 187-188 U-Quantiles:
nonstationary data, 195-197 estimation, 141-142
second order correctness, 185, 187, 197 by bootstrapping, 146-154
Moving block jackknife method, 227-228, by pseudo jackknife method,
231-238 297-300
MSEP, see Mean square error of Quantitative genetics, bootstrapping in,
prediction 389-393, 395-396
Multinomial models, bootstrapping, Quantitative segregation analysis,
276-277 392-393
Multiple regression, bootstrapping in,
328-329
R
N Random weighting approximation,
280-281, 284
Net hazard probabilities, 357 second order accuracy, 287-297
Net hazard rate functions, 357 Rank estimation, nonparametric, 407-408
Nonparametric estimation, 397-402 bootstrap resampling, 408-418
choice of pivotal quantity, 116-120 Ratio statistic distribution, bootstrapping,
Nonparametric rank estimation, 407-408 276-277
bootstrap resampling, 408-418 ~ RECCAMP (Rank estimation), 407-408
Regression analysis, exploratory, 363-367
MSEP estimator determination:
O bootstrap approach, 367-371
comparison of methods, 371-386
Order statistics, “operational” conventional approach, 365-367
bootstrapping, 309-318 Regression analysis, linear, 221, 224,
Orthogonal series estimators, 249 245
Ottaviani’s inequality, 23 Regression analysis, multiple, 328-329
Resampling, 106-107, 251
antithetic, 135-137
P balanced, 133-135
circular block, 266-269
Paasche index, 354 i.i.d., 77, 81
Pearson correlation coefficient, 116-117 problems with, 183-184
Percentile confidence intervals, 102 importance, 137-141
Pivotal quantities, choice of, 116-120 improved approaches, 399-402
Pivot method, see Bootstrap t methods nonrandom, 309-318
Point processes, 348 uniform, 128-129, 130
Poisson processes, 221-222 vector, 106, 121

424
Risk function, minimization, 330-331, distribution approximation, 279-281,
333-335 283-287
Edgeworth expansions, 287-291
S .
Statistics, order, “operational”
bootstrapping, 309-318
Stein-rule estimators, 329-335
Sample mean, bootstrapping: bootstrapping, 325
convergence to Brownian motion, 162, jackknife versus bootstrap methods,
164-165, 170, 172, 173 335-340
invariance principle, 161-172 Stieltjes integral, 347
limiting distribution: Stochastic differential, 348
finite variance, 173, 174-178 Stochastic hazards, 348
infinite variance, 172, 173-178 Stochastic integral representation,
“operational” bootstrapping, 309-318 sample mean bootstrapping,
partial sum process, 161, 163 160-161
stochastic integral representation, Studentization, 5
160-161 Student's r, 5
Second order properties, 66, 68-77 bootstrap in estimating, 8-9
of “moving block” bootstrap, 185, 195, Survival analysis, 345-346
197 hazard processes for, see Hazard
Segregation analysis, 392-393 processes
Shrinkage estimators, see Stein-rule
estimators
Signs, of errors, bootstrapping, 217-224 ii
SIMDAT algorithm, 400-402
Singular value decomposition, 407 Taylor series method, 103-108
Skorohod topology, 161-162, 163, 170 Time dependence, models of, 226,
Spatial medians, bootstrapping of, 245-246
32-33 Training sets, 108-109
Spline estimators, 249 Transition probability matrix, 49, 50-51,
Spread estimators, bootstrapping, 154 57-58, 77
Square root trick inequality, 26 True error, 110
Standard confidence intervals, Tukey’s jackknife methods, see Jackknife
102-103 methods
Standard errors, 3-4
bootstrap estimate, 5-6
by signs, 217-224 U
comparison of delta/jackknife/
bootstrap methods, 103-108 Undershoot, of t confidence intervals,
jackknife estimate, 4-5 74
Standardized mortality ratio, 354 Uniform resampling, 128-129, 130
Standard normal distribution, 173
Statistical differentials, method of,
103-108 Vv
k-Statistics, 101
t-Statistics: Vapnik-Cérvonenkis property, 22
bootstrapping, see Bootstrap t methods Variance, 3
as pivotal quantities, 117-120 Variance functionals, 100
U-Statistics, see U-Quantiles von Mises expansion, 73
Statistics, functional: von Mises theorem, 311, 315-316

425
W . moving block bootstrap for, 228-231,
238-245, 264
Weak dependence: moving block jackknife for, 227-228,
circular block-resampling bootstrap 231-238
for, 266-269 Wiener processes, 351

426
ae aie |

ims & yf ‘

ae

>

ace ee:

“San

/
=~

ve

van

rd

if
=

oe
pa } S

—~ ——

, as : :
+ wae,
"
Jf ne -
Carns

ae, F i

'
1
ih
\\ A f
i
Applied Probability and Statistics (Continued)
PANKRATZ »* Forecasting with Univariate Box-Jenkins Models: Concepts and Cases
RACHEV » Probability Metrics and the Stability of Stochastic Models
RENYI ° A Diary on Information Theory
RIPLEY * Spatial Statistics
RIPLEY * Stochastic Simulation
ROSS =: Introduction to Probability and Statistics for Engineers and Scientists
ROUSSEEUW and LEROY = Robust Regression and Outlier Detection
RUBIN + Multiple Imputation for Nonresponse in Surveys
RUBINSTEIN + Monte Carlo Optimization, Simulations, and Sensitivity of Queueing
Networks
RYAN » Statistical Methods for Quality Improvement
SCHUSS ° Theory and Applications of Stochastic Differential Equations
SEARLE * Linear Models
SEARLE °* Linear Models for Unbalanced Data
SEARLE * Matrix Algebra Useful for Statistics
SEARLE, CASELLA, and McCULLOCH * Variance Components
SKINNER, HOLT, and SMITH * Analysis of Complex Surveys
STOYAN »° Comparison Methods for Queues and Other Stochastic Models
STOYAN, KENDALL, and MECKE »* Stochastic Geometry and Its Applications
THOMPSON * Empirical Model Building
TIERNEY + LISP-STAT: An Object-Oriented Environment for Statistical Computing
and Dynamic Graphics
TIJMS * Stochastic Modeling and Analysis: A Computational Approach
TITTERINGTON, SMITH, and MAKOV : Statistical Analysis of Finite Mixture
Distributions
UPTON and FINGLETON -° Spatial Data Analysis by Example, Volume I: Point
Pattern and Quantitative Data
UPTON and FINGLETON ° Spatial Data Analysis by Example, Volume II:
Categorical and Directional Data
VAN RIJCKEVORSEL and DE LEEUW + Component and
Correspondence Analysis
WEISBERG °* Applied Linear Regression, Second Edition
WHITTLE + Optimization Over Time: Dynamic Programming and Stochastic Control,
Volume I and Volume II
WHITTLE °* Systems in Stochastic Equilibrium
WONNACOTT and WONNACOTT + Econometrics, Second Edition
WONNACOTT and WONNACOTT »* Introductory Statistics, Fifth Edition
WONNACOTT and WONNACOTT = Introductory Statistics for Business and
Economics, Fourth Edition
WOOLSON -: Statistical Methods for the Analysis of Biomedical Data

Tracts on Probability and Statistics


BILLINGSLEY + Convergence of Probability Measures
TOUTENBURG : Prior Information in Linear Models
y . ‘ i .
ptt
4
we » x é
wr ,

af aw
cy
“a ‘ | 4

ne
4
£
1
5
2 ; Pa. Yas
i
”) } /

‘ >
~ é yo

me os

. ¥ &
>

f
; 4
: m i
eS

we f
ot

, iat

‘ ' :i :
i.
Z {

= = %, r sf
- = tlk ns
“ &
—— 74
7 ‘ a :

ae
Pas i

Sais i Nicci ~’ ¥ ee
P - » ‘ag . Bini .
r3 4 * ~ ~ me

¥ SS n i
ss ~ z
‘ bi <

4
a
~ =
ae ls be

) cs
wey -
x J

aa co
; yes =e x
" t

. %, Ae ‘ .
py
About the editors
RAOUL LEPAGE is Professor of
Statistics and Probability at Michi-
gan State University and Chairman
of the Board and Chief Executive Of--
ficer of the Interface Foundation of
North America. Dr. LePage earned a
BS in mathematics and an MS in sta-
tistics and probability at Michigan
State University, and a PhD in statis-
tics from the University of Minne-
sota. He has served as a consultant
and given courses on simulation
methods, bootstrap, and time series
analysis for government agencies
and private industry. He is active in
the study of stable random pro-
cesses where his series representa-
tion and Gaussian-slicing methods
are widely referenced. His current
work is on conditional resampling
methods designed to cope with
long-tailed errors in observations,
and also problems having applica-
tions in finance and control which
involve optimal methods for driving
a random process to achieve its
most rapid growth.
LYNNE BILLARD is Professor of
Statistics at the University of Geor-
gia. She was formerly head of the
statistics and computer science de-
partment at that university and has
held faculty positions and visiting
positions at a number of U.S. univer-
sities as well as in England and Can-
ada. Her current research interests
include time series, Sequential anal-
ysis, stochastic processes, and
AIDS. She is a fellow of the American
Statistical Association and the Insti-
tute of Mathematical Statistics and
an elected member of the Interna-
tional Statistical Institute. She has
held many professional offices, in-
cluding President of the Biometric
Society, Eastern North American Re-
gion, Program Secretary of the Insti-
tute of Mathematical Statistics, and
Associate Editor with the Journal of
the American Statistical Associa-
tion. She is currently a member of
the Internationa! Council of the Bio-
metric Society and the Council of
the International Statistical Insti-
tute. She received a BS in mathe-
matics and statistics and a PhD in
statistics from the University of New
South Wales, Australia.
Of related interest...
ROBUST ESTIMATION AND TESTING
Robert G. Staudte and Simon J. Sheather
Designed to provide students with practical methods for carrying out
robust procedures in a variety of statistical contexts, this practical
guide: reviews some traditional finite sample methods of analyzing
estimators of a scale parameter; introduces the main tools of robust
statistics; examines the important joint location-scale estimation prob-
lem; and comprehensively treats the linear regression model. The text
also includes an appendix containing a large number of Minitab macros
which calculate robust estimators and their standard errors.
1990 (0 471-85547-2) 376pp.

CONFIGURAL POLYSAMPLING
A Route to Practical Robustness
Edited by Stephan Morgenthaler and John W. Tukey
Beginning with the historical background of robustness, this unique
text introduces the essential ideas for a small sample approach to
robust statistics. Using models for distributional shape which contain
at least two alternative distributions, the book treats in great detail
inference based on such models for the location and scale case. Point
estimation and interval estimation are examined in depth, as is classi-
cal material concerning Pitman’s optimal invariant estimators. Configu-
ral Polysampling’s blend of theoretical and empirical facts makes it
especially useful to both novices and experienced users’ understand-
ing of traditional robust methods.
1991 (0 471-52372-0) 248 pp.

LISP-STAT
An Object-Oriented Environment for
Statistical Computing and Dynamic Graphics
Luke Tierney
This hands-on guide shows users how to carry out basic and extensive
programming projects within the context of the Lisp-Stat system. It also
clearly demonstrates how to use functional and object-oriented pro-
gramming styles in statistical computing to develop new numerical and
graphical methods. Using the Lisp system for statistical calculations
and graphs, the book addresses computations ranging from summary
statistics through fitting linear and nonlinear regression models to
general maximum likelinood estimation and approximate Bayesian
computations.
1990 (0 471-50916-7) 416 pp.

WILEY-INTERSCIENCE ISBN O-471-53631-4


ohn Wiley & Sons, Inc.
Professional, Reference and Trade Group a
605 Third Avenue, New York, N.Y. 10158-0012
New York ¢ Chichester ¢ Brisbane * Toronto .
¢ Singapore |||
9"780471"5 36314

You might also like