0% found this document useful (0 votes)
8 views

Lin Mod Book

David J. Olive's 'Theory for Linear Models' is a graduate-level textbook focusing on linear model theory, including multiple linear regression and experimental design models. The text emphasizes large sample theory and variable selection, covering various topics such as robust regression, multivariate models, and statistical learning alternatives. It includes practical applications using R software and provides resources for further learning and problem-solving.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lin Mod Book

David J. Olive's 'Theory for Linear Models' is a graduate-level textbook focusing on linear model theory, including multiple linear regression and experimental design models. The text emphasizes large sample theory and variable selection, covering various topics such as robust regression, multivariate models, and statistical learning alternatives. It includes practical applications using R software and provides resources for further learning and problem-solving.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 567

David J.

Olive

Theory for Linear Models


January 17, 2025

Springer
Preface

Many statistics departments offer a one semester graduate course in linear


model theory. Linear models include multiple linear regression and many
experimental design models. Three good books on linear model theory, in
increasing order of difficulty, are Myers and Milton (1991), Seber and Lee
(2003), and Christensen (2020). Other texts include Agresti (2015), Freed-
man (2005), Graybill (1976, 2000), Guttman (1982), Harville (2018), Hock-
ing (2013), Monahan (2008), Muller and Stewart (2006), Rao (1973), Rao et
al. (2008), Ravishanker, Chi, and Dey (2021), Rencher and Schaalje (2008),
Scheffé (1959), Searle and Gruber (2017), Sengupta and Jammalamadaka
(2019), Stapleton (2009), Wang and Chow (1994), and Zimmerman (2020ab).
A good summary is Olive (2017a, ch. 11).
The prerequisites for this text are i) a calculus based course in statistics
at the level of Chihara and Hesterberg (2011), Hogg et al. (2015), Larsen and
Marx (2017), Wackerly et al. (2008), and Walpole et al. (2016). ii) Linear
algebra at the level of Anton et al. (2019), and Leon (2015). iii) A calcu-
lus based course in multiple linear regression at the level of Abraham and
Ledolter (2006), Cook and Weisberg (1999), Kutner et al. (2005), Olive (2010,
2017a), and Weisberg (2014).
This text emphasizes large sample theory over normal theory, and shows
how to do inference after variable selection. The text is at a Master’s level
for the United States. Let n be the sample size and p the number of predic-
tor variables. Chapter 1 reviews some of the material from a calculus based
course in multiple linear regression as well as some of the material to be cov-
ered in the text. Chapter 1 also covers the multivariate normal distribution
and large sample theory. Most of these sections can be skimmed and then
reviewed as needed. Chapters 2 and 3 cover full and nonfull rank linear mod-
els, respectively, with emphasis on least squares. Chapter 4 considers variable
selection when n >> p. Chapter 5 considers Statistical Learning alternatives
to least squares when n >> p, including lasso, lasso variable selection, and
the elastic net. Chapter 6 shows how to use data splitting for inference if
n/p is not large. Chapter 7 gives theory for robust regression, using results

v
vi Preface

from robust multivariate location and dispersion. Chapter 8 gives theory for
the multivariate linear model where there are m ≥ 2 response variables.
Chapter 9 examines the one way MANOVA model, which is a special case of
the multivariate linear model. Chapter 10 generalizes much of the material
from Chapters 2–6 to many other regression models, including generalized
linear models and some survival regression models. Chapter 11 gives some
information about R and some hints for homework problems.
Chapters 2–4 are the most important for a standard course in Linear Model
Theory, along with the multivariate normal distribution and some large sam-
ple theory from Chapter 1. Some highlights of this text follow.
• Prediction intervals are given that can be useful even if n < p.
• The response plot is useful for checking the model.
• The large sample theory for the elastic net, lasso, and ridge regression
is greatly simplified. Large sample theory for variable selection and lasso
variable selection is given.
• The bootstrap is used for inference after variable selection if n ≥ 10p.
• Data splitting is used for inference after variable selection or model build-
ing if n < 5p.
• Most of the above highlights are extended to many other regression models
such as generalized linear models and some survival regression models.
The website (http://parker.ad.siu.edu/Olive/linmodbk.htm) for this book
provides R programs in the file linmodpack.txt and several R data sets in
the file linmoddata.txt. Section 11.1 discusses how to get the data sets and
programs into the software, but the following commands will work.
Downloading the book’s R functions linmodpack.txt and data files
linmoddata.txt into R: The following commands
source("http://parker.ad.siu.edu/Olive/linmodpack.txt")
source("http://parker.ad.siu.edu/Olive/linmoddata.txt")

can be used to download the R functions and data sets into R. (Copy and paste
these two commands into R from near the top of the file (http://parker.ad.
siu.edu/Olive/linmodhw.txt), which contains commands that are useful for
doing many of the R homework problems.) Type ls(). Over 100 R functions
from linmodpack.txt should appear. Exit R with the command q() and click
No.
The R software is used in this text. See R Core Team (2016). Some pack-
ages used in the text include glmnet Friedman et al. (2015), leaps Lum-
ley (2009), MASS Venables and Ripley (2010), mgcv Wood (2017), and pls
Mevik et al. (2015).

Acknowledgments

Teaching this course in 2014 as Math 583 and in 2019 and 2021 as Math
584 at Southern Illinois University was very useful.
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Response Plots and Response Transformations . . . . . . . 4
1.2.1 Response and Residual Plots . . . . . . . . . . . . . . . . . . . 5
1.2.2 Response Transformations . . . . . . . . . . . . . . . . . . . . . 8
1.3 A Review of Multiple Linear Regression . . . . . . . . . . . . . 13
1.3.1 The ANOVA F Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.2 The Partial F Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3.3 The Wald t Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.3.4 The OLS Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.3.5 The Location Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.3.6 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . 28
1.3.7 The No Intercept MLR Model . . . . . . . . . . . . . . . . . 29
1.4 The Multivariate Normal Distribution . . . . . . . . . . . . . . . 31
1.5 Large Sample Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.5.1 The CLT and the Delta Method . . . . . . . . . . . . . . . 34
1.5.2 Modes of Convergence and Consistency . . . . . . . . 37
1.5.3 Slutsky’s Theorem and Related Results . . . . . . . . 45
1.5.4 Multivariate Limit Theorems . . . . . . . . . . . . . . . . . . 48
1.6 Mixture Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.7 Elliptically Contoured Distributions . . . . . . . . . . . . . . . . . . 53
1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
1.9 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
1.10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2 FullRank Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71


2.1 Projection Matrices and the Column Space . . . . . . . . . . 71
2.2 Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.3 Least Squares Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2.3.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.4 WLS and Generalized Least Squares . . . . . . . . . . . . . . . . . 97

vii
viii Contents

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102


2.6 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

3 Nonfull Rank Linear Models and Cell Means Models . . . . . 117


3.1 Nonfull Rank Linear Models . . . . . . . . . . . . . . . . . . . . . . . . 117
3.2 Cell Means Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
3.4 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

4 Prediction and Variable Selection When n >> p . . . . . . . . . . 141


4.1 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.1.1 OLS Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . 142
4.2 Large Sample Theory for Some Variable Selection
Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.3 Prediction Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
4.4 Prediction Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4.5 Bootstrapping Hypothesis Tests and Confidence
Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
4.5.1 The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
4.5.2 Bootstrap Confidence Regions for Hypothesis
Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
4.5.3 Theory for Bootstrap Confidence Regions . . . . . . 178
4.5.4 Bootstrapping the Population Coefficient of
Multiple Determination . . . . . . . . . . . . . . . . . . . . . . . . 183
4.6 Bootstrapping Variable Selection . . . . . . . . . . . . . . . . . . . . . 186
4.6.1 The Parametric Bootstrap . . . . . . . . . . . . . . . . . . . . . 188
4.6.2 The Residual Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . 189
4.6.3 The Nonparametric Bootstrap . . . . . . . . . . . . . . . . . 191
4.6.4 Bootstrapping OLS Variable Selection . . . . . . . . . 192
4.6.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
4.7 Data Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
4.9 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
4.10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

5 Statistical Learning Alternatives to OLS . . . . . . . . . . . . . . . . . . 211


5.1 The MLR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
5.2 Forward Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
5.3 Principal Components Regression . . . . . . . . . . . . . . . . . . . . 221
5.4 Partial Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
5.5 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
5.6 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
5.7 Lasso Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Contents ix

5.8 The Elastic Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240


5.9 Prediction Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
5.10 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
5.11 Hypothesis Testing After Model Selection, n/p Large . 252
5.12 Data Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
5.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
5.14 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
5.15 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

6 What if n is not >> p? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273


6.1 Sparse Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
6.2 Data Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
6.4 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
6.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

7 Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279


7.1 The Location Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
7.2 The Multivariate Location and Dispersion Model . . . . 281
7.2.1 Affine Equivariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
7.2.2 Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
7.2.3 The Concentration Algorithm . . . . . . . . . . . . . . . . . . 287
7.2.4 Theory for Practical Estimators . . . . . . . . . . . . . . . . 291
7.2.5 Outlier Resistance and Simulations . . . . . . . . . . . . 301
7.2.6 The RMVN and RFCH Sets . . . . . . . . . . . . . . . . . . . 310
7.3 Outlier Detection for the MLD Model . . . . . . . . . . . . . . . . 312
7.3.1 MLD Outlier Detection if p > n . . . . . . . . . . . . . . . . 318
7.4 Outlier Detection for the MLR Model . . . . . . . . . . . . . . . . 321
7.5 Resistant Multiple Linear Regression . . . . . . . . . . . . . . . . . 324
7.6 Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
7.6.1 MLR Breakdown and Equivariance . . . . . . . . . . . . 335
7.6.2 A Practical High Breakdown Consistent
Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
7.8 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
7.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353

8 Multivariate Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 361


8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
8.2 Plots for the Multivariate Linear Regression Model . . 365
8.3 Asymptotically Optimal Prediction Regions . . . . . . . . . . 368
8.4 Testing Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
8.5 An Example and Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 383
8.5.1 Simulations for Testing . . . . . . . . . . . . . . . . . . . . . . . . . 388
8.6 The Robust rmreg2 Estimator . . . . . . . . . . . . . . . . . . . . . . . 391
x Contents

8.7 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394


8.7.1 Parametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . 394
8.7.2 Residual Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
8.7.3 Nonparametric Bootstrap . . . . . . . . . . . . . . . . . . . . . . 395
8.8 Data Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
8.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
8.10 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
8.11 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402

9 One Way MANOVA Type Models . . . . . . . . . . . . . . . . . . . . . . . 407


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
9.2 Plots for MANOVA Models . . . . . . . . . . . . . . . . . . . . . . . . . . 410
9.3 One Way MANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
9.4 An Alternative Test Based on Large Sample Theory . 418
9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
9.6 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
9.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424

10 1D Regression Models Such as GLMs . . . . . . . . . . . . . . . . . . . . . 425


10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
10.2 Additive Error Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
10.3 Binary, Binomial, and Logistic Regression . . . . . . . . . . . . 431
10.4 Poisson Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
10.5 GLM Inference, n/p Large . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
10.6 Variable and Model Selection . . . . . . . . . . . . . . . . . . . . . . . . 454
10.6.1 When n/p is Large . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
10.6.2 When n/p is Not Necessarily Large . . . . . . . . . . . . . 462
10.7 Generalized Additive Models . . . . . . . . . . . . . . . . . . . . . . . . . 465
10.7.1 Response Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
10.7.2 The EE Plot for Variable Selection . . . . . . . . . . . . . 468
10.7.3 An EE Plot for Checking the GLM . . . . . . . . . . . . 469
10.7.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
10.8 Overdispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
10.9 Inference After Variable Selection for GLMs . . . . . . . . . 477
10.9.1 The Parametric and Nonparametric Bootstrap . 477
10.9.2 Bootstrapping Variable Selection . . . . . . . . . . . . . . . 479
10.9.3 Examples and Simulations . . . . . . . . . . . . . . . . . . . . . 482
10.10Prediction Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
10.11OLS and 1D Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
10.11.1Inference for 1D Regression With a Linear
Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
10.12Data Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
10.13Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
10.14Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
Contents xi

11 Stuff for Students . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505


11.1 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
11.2 Hints for Selected Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 509
11.3 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
Chapter 1
Introduction

This chapter provides a preview of the book, and contains several sections
that will be useful for linear model theory. Section 1.2 defines 1D regression
and gives some techniques useful for checking the 1D regression model and vi-
sualizing data in the background of the data. Section 1.3 reviews the multiple
linear regression model. Sections 1.4 and 1.7 cover the multivariate normal
distribution and elliptically contoured distributions. Some large sample the-
ory is presented in Section 1.5, and Section 1.6 covers mixture distributions.
Section 1.4 is important, but the remaining sections can be skimmed and
then reviewed as needed.

1.1 Overview

Linear Model Theory provides theory for the multiple linear regression model
and some experimental design models. This text will also give theory for the
multivariate linear regression model where there are m ≥ 2 response vari-
ables. Emphasis is on least squares, but some alternative Statistical Learning
techniques, such as lasso and the elastic net, will also be covered. Chapter 10
considers theory for 1D regression models which include the multiple linear
regression model and generalized linear models.
Statistical Learning could be defined as the statistical analysis of multivari-
ate data. Machine learning, data mining, analytics, business analytics, data
analytics, and predictive analytics are synonymous terms. The techniques are
useful for Data Science and Statistics, the science of extracting information
from data. The R software will be used. See R Core Team (2020).
Let z = (z1 , ..., zk)T where z1 , ..., zk are k random variables. Often z =
(x , Y )T where xT = (x1 , ..., xp) is the vector of predictors and Y is the
T

variable of interest, called a response variable. Predictor variables are also


called independent variables, covariates, or features. The response variable

1
2 1 Introduction

is also called the dependent variable. Usually context will be used to decide
whether z is a random vector or the observed random vector.

Definition 1.1. A case or observation consists of k random variables


measured for one person or thing. The ith case z i = (zi1 , ..., zik)T . The
training data consists of z 1 , ..., zn . A statistical model or method is fit
(trained) on the training data. The test data consists of z n+1 , ..., zn+m , and
the test data is often used to evaluate the quality of the fitted model.

Following James et al. (2013, p. 30), the previously unseen test data is not
used to train the Statistical Learning method, but interest is in how well the
method performs on the test data. If the training data is (x1 , Y1 ), ..., (xn , Yn ),
and the previously unseen test data is (xf , Yf ), then particular interest is in
the accuracy of the estimator Ŷf of Yf obtained when the Statistical Learning
method is applied to the predictor xf . The two Pelawa Watagoda and Olive
(2021b) prediction intervals, developed in Section 4.3, will be tools for eval-
uating Statistical Learning methods for the additive error regression model
Yi = m(xi ) + ei = E(Yi |xi ) + ei for i = 1, ..., n where E(W ) is the expected
value of the random variable W . The multiple linear regression (MLR) model,
Yi = β1 + x2 β2 + · · · + xp βp + e = xT β + e, is an important special case.
The estimator Ŷf is a prediction if the response variable Yf is continuous,
as occurs in regression models. If Yf is categorical, then Ŷf is a classification.
For example, if Yf can be 0 or 1, then xf is classified to belong to group i if
Ŷf = i for i = 0 or 1.

Following Marden (2006, pp. 5,6), the focus of supervised learning is pre-
dicting a future value of the response variable Yf given xf and the training
data (x1 , Y1 ), ..., (x1 , Yn ). Hence the focus is not on hypothesis testing, con-
fidence intervals, parameter estimation, or which model fits best, although
these four inference topics can be useful for better prediction.

Notation: Typically lower case boldface letters such as x denote column


vectors, while upper case boldface letters such as S or Y are used for ma-
trices or column vectors. If context is not enough to determine whether y
is a random vector or an observed random vector, then Y = (Y1 , ..., Yp)T
may be used for the random vector, and y = (y1 , ..., yp)T for the observed
value of the random vector. An upper case letter such as Y will usually be a
random variable. A lower case letter such as x1 will also often be a random
variable. An exception to this notation is the generic multivariate location
and dispersion estimator (T, C) where the location estimator T is a p × 1
vector such as T = x. C is a p × p dispersion estimator and conforms to the
above notation.

The main focus of the first seven chapters is developing tools to analyze
the multiple linear regression model Yi = xTi β + ei for i = 1, ..., n. Classical
regression techniques use (ordinary) least squares (OLS) and assume n >> p,
1.1 Overview 3

but Statistical Learning methods often give useful results if p >> n. OLS
forward selection, lasso, ridge regression, and the elastic net will be some of
the techniques examined.

For classical regression and multivariate analysis, we often want n ≥ 10p,


and a model with n < 5p is overfitting: the model does not have enough data
to estimate parameters accurately. Statistical Learning methods often use a
model with d predictor variables, where n ≥ Jd with J ≥ 5 and preferably
J ≥ 10.

Acronyms are widely used in regression and Statistical Learning, and some
of the more important acronyms appear in Table 1.1. Also see the text’s index.

Table 1.1 Acronyms


Acronym Description
AER additive error regression
AP additive predictor = SP for a GAM
BLUE best linear unbiased estimator
cdf cumulative distribution function
cf characteristic function
CI confidence interval
CLT central limit theorem
CV cross validation
EC elliptically contoured
EAP estimated additive predictor = ESP for a GAM
ESP estimated sufficient predictor
ESSP estimated sufficient summary plot = response plot
GAM generalized additive model
GLM generalized linear model
iff if and only if
iid independent and identically distributed
lasso an MLR method
LR logistic regression
MAD the median absolute deviation
MCLT multivariate central limit theorem
MED the median
mgf moment generating function
MLD multivariate location and dispersion
MLR multiple linear regression
MVN multivariate normal
OLS ordinary least squares
pdf probability density function
PI prediction interval
pmf probability mass function
SE standard error
SP sufficient predictor
SSP sufficient summary plot
4 1 Introduction

Remark 1.1. There are several important Statistical Learning principles.


1) There is more interest in prediction or classification, e.g. producing Ŷf ,
than in other types of inference such as parameter estimation, hypothesis
testing, confidence intervals, or which model fits best.
2) Often the focus is on extracting useful information for high dimensional
statistics where n/p is not large, e.g. p > n. If d is a crude estimator of the
fitted model complexity, such as the number of predictor variables used by
the model, we want n/d large. A sparse model has few nonzero coefficients.
We can have sparse population models and sparse fitted models. Sometimes
sparse fitted models are useful even if the population model is not sparse.
Often the number of nonzero coefficients of a sparse fitted model = d. Sparse
fitted models are often useful for prediction.
3) Interest is in how well the method performs on test data. Performance on
training data is overly optimistic for estimating performance on test data.
4) Some methods are flexible while others are unflexible. For unflexible re-
gression methods, the sufficient predictor is often a hyperplane SP = xT β
(see Definition 1.2), and often the mean function E(Y |x) = M (xT β) where
the function M is known but the p×1 vector of parameters β is unknown and
must be estimated (e.g. generalized linear models). Flexible methods tend to
be useful for more complicated regression methods where E(Y |x) = m(x)
for an unknown function m or SP 6= xT β (e.g. generalized additive models).
Flexibility tends to increase with d.

1.2 Response Plots and Response Transformations

This section will consider tools for visualizing the regression model in the
background of the data. The definitions in this section tend not to depend
on whether n/p is large or small, but the estimator ĥ tends to be better if
n/p is large. In regression, the response variable is the variable of interest:
the variable you want to predict. The predictors or features x1 , ..., xp are
variables used to predict Y . See Chapter 10 for more on the 1D regression
model.

Definition 1.2. Regression investigates how the response variable Y


changes with the value of a p × 1 vector x of predictors. Often this con-
ditional distribution Y |x is described by a 1D regression model, where Y
is conditionally independent of x given the sufficient predictor SP = h(x),
written
Y x|SP or Y x|h(x), (1.1)
where the real valued function h : Rp → R. The estimated sufficient predictor
ESP = ĥ(x). An important special case is a model with a linear predictor
h(x) = xT β where ESP = xT β̂. This class of models includes the gener-
alized linear model (GLM). Another important special case is a generalized
1.2 Response Plots and Response Transformations 5

additive model (GAM), where Y is independent


Pp of x = (x1 , ..., xp)T given the
additive predictor AP = SP = α + j=2 Sj (xj ) for some (usually unknown)
functions Sj where x1 ≡ 1. The estimated additive predictor EAP = ESP =
Pp
α̂ + j=2 Ŝj (xj ).

Notation. Often the index i will be suppressed. For example, the multiple
linear regression model
Yi = xTi β + ei (1.2)
for i = 1, ..., n where β is a p × 1 unknown vector of parameters, and ei is a
random error. This model could be written Y = xT β + e. More accurately,
Y |x = xT β + e, but the conditioning on x will often be suppressed. Often
the errors e1 , ..., en are iid (independent and identically distributed) from a
distribution that is known except for a scale parameter. For example, the
ei ’s might be iid from a normal (Gaussian) distribution with mean 0 and
unknown standard deviation σ. For this Gaussian model, estimation of α, β,
and σ is important for inference and for predicting a new future value of the
response variable Yf given a new vector of predictors xf .

1.2.1 Response and Residual Plots

Definition 1.3. An estimated sufficient summary plot (ESSP) or response


plot is a plot of the ESP versus Y . A residual plot is a plot of the ESP versus
the residuals.

Notation: In this text, a plot of x versus Y will have x on the horizontal


axis, and Y on the vertical axis. For the additive error regression model
Y = m(x) + e, the ith residual is ri = Yi − m̂(xi ) = Yi − Ŷi where Ŷi = m̂(xi )
is the ith fitted value. The additive error regression model is a 1D regression
model with sufficient predictor SP = h(x) = m(x).

For the additive error regression model, the response plot is a plot of Ŷ
versus Y where the identity line with unit slope and zero intercept is added as
a visual aid. The residual plot is a plot of Ŷ versus r. Assume the errors ei are
iid from a unimodal distribution that is not highly skewed. Then the plotted
points should scatter about the identity line and the r = 0 line (the horizontal
axis) with no other pattern if the fitted model (that produces m̂(x)) is good.

Example 1.1. Tremearne (1911) presents a data set of about 17 mea-


surements on 115 people of Hausa nationality. We deleted 3 cases because
of missing values and used height as the response variable Y . Along with a
constant xi,1 ≡ 1, the five additional predictor variables used were height
when sitting, height when kneeling, head length, nasal breadth, and span (per-
6 1 Introduction

Response Plot

1800
1700
Y

1600

63
44
1500

1550 1600 1650 1700 1750 1800

FIT

Residual Plot

3
50
0
RES

-50

63
-100

44

1550 1600 1650 1700 1750 1800

FIT

Fig. 1.1 Residual and Response Plots for the Tremearne Data

haps from left hand to right hand). Figure 1.1 presents the (ordinary) least
squares (OLS) response and residual plots for this data set. These plots show
that an MLR model Y = xT β + e should be a useful model for the data
since the plotted points in the response plot are linear and follow the identity
line while the plotted points in the residual plot follow the r = 0 line with
no other pattern (except for a possible outlier marked 44). Note that many
important acronyms, such as OLS and MLR, appear in Table 1.1.
To use the response plot to visualize the conditional distribution of Y |xT β,
use the fact that the fitted values Ŷ = xT β̂. For example, suppose the height
given fit = 1700 is of interest. Mentally examine the plot about a narrow
vertical strip about fit = 1700, perhaps from 1685 to 1715. The cases in the
narrow strip have a mean close to 1700 since they fall close to the identity
line. Similarly, when the fit = w for w between 1500 and 1850, the cases have
heights near w, on average.
1.2 Response Plots and Response Transformations 7

Cases 3, 44, and 63 are highlighted. The 3rd person was very tall while
the 44th person was rather short. Beginners often label too many points
as outliers: cases that lie far away from the bulk of the data. See Chapter
7. Mentally draw a box about the bulk of the data ignoring any outliers.
Double the width of the box (about the identity line for the response plot
and about the horizontal line for the residual plot). Cases outside of this
imaginary doubled box are potential outliers. Alternatively, visually estimate
the standard deviation of the residuals in both plots. In the residual plot look
for residuals that are more than 5 standard deviations from the r = 0 line.
In Figure 1.1, the standard deviation of the residuals appears to be around
10. Hence cases 3 and 44 are certainly worth examining.
The identity line can also pass through or near an outlier or a cluster
of outliers. Then the outliers will be in the upper right or lower left of the
response plot, and there will be a large gap between the cluster of outliers and
the bulk of the data. Figure 1.1 was made with the following R commands,
using linmodpack function MLRplot and the major.lsp data set from the
text’s webpage.
major <- matrix(scan(),nrow=112,ncol=7,byrow=T)
#copy and paste the data set, then press enter
major <- major[,-1]
X<-major[,-6]
Y <- major[,6]
MLRplot(X,Y) #left click the 3 highlighted cases,
#then right click Stop for each of the two plots
A problem with response and residual plots is that there can be a lot of
black in the plot if the sample size n is large (more than a few thousand). A
variant of the response plot for the additive error regression model would plot
the identity line, the two lines parallel to the identity line corresponding to the
Section 4.1 large sample 100(1 − δ)% prediction intervals for Yf that depends
on Ŷf . Then plot points corresponding to training data cases that do not lie in
their 100(1 − δ)% PI. Use δ = 0.01 or 0.05. Try the following commands that
used δ = 0.2 since n is small. The commands use the linmodpack functions
AERplot and AERplot2. See Problem 1.31.
out<-lsfit(X,Y) #X and Y from the above R code
res<-out$res
yhat<-Y-res #usual response plot
AERplot(yhat,Y,res=res,d=2,alph=1)
AERplot(yhat,Y,res=res,d=2,alph=0.2)
#plots data outside the 80% pointwise PIs

n<-100000; q<-7 #q=p-1


b <- 0 * 1:q + 1
x <- matrix(rnorm(n * q), nrow = n, ncol = q)
8 1 Introduction

y <- 1 + x %*% b + rnorm(n)


out<-lsfit(x,y)
res<-out$res
yhat<-y-res
dd<-length(out$coef) #usual response plot
AERplot(yhat,y,res=res,d=dd,alph=1)
AERplot(yhat,y,res=res,d=dd,alph=0.01)
#plots data outside the 99% pointwise PIs
AERplot2(yhat,y,res=res,d=2)
#response plot with 90% pointwise prediction bands

1.2.2 Response Transformations

A response transformation Y = tλ (Z) can make the MLR model or additive


error regression model hold if the variable of interest Z is measured on the
wrong scale. For MLR, Y = tλ (Z) = xT β + e, while for additive error regres-
sion, Y = tλ (Z) = m(x) + e. Predictor transformations are used to remove
gross nonlinearities in the predictors, and this technique is often very useful.
However, if there are hundreds or more predictors, graphical methods for
predictor transformations take too long. Olive (2017a, Section 3.1) describes
graphical methods for predictor transformations.
Power transformations are particularly effective, and a power transforma-
tion has the form x = tλ (w) = w λ for λ 6= 0 and x = t0 (w) = log(w) for
λ = 0. Often λ ∈ ΛL where

ΛL = {−1, −1/2, −1/3, 0, 1/3, 1/2, 1} (1.3)

is called the ladder of powers. Often when a power transformation is needed,


a transformation that goes “down the ladder,” e.g. from λ = 1 to λ = 0 will
be useful. If the transformation goes too far down the ladder, e.g. if λ = 0
is selected when λ = 1/2 is needed, then it will be necessary to go back “up
the ladder.” Additional powers such as ±2 and ±3 can always be added. The
following rules are useful for both response transformations and predictor
transformations. In this text, log(x) = ln(x) = loge (x).
a) The log rule states that a positive variable that has the ratio between
the largest and smallest values greater than ten should be transformed to
logs. So W > 0 and max(W )/ min(W ) > 10 suggests using log(W ).
b) The ladder rule appears in Cook and Weisberg (1999, p. 86), and is
used for a plot of two variables, such as ESP versus Y for response transfor-
mations or x1 versus x2 for predictor transformations.
Ladder rule: To spread small values of a variable, make λ smaller.
To spread large values of a variable, make λ larger.
1.2 Response Plots and Response Transformations 9

Consider the ladder of powers. Often no transformation (λ = 1) is best,


then the log transformation, then the square root transformation, then the
reciprocal transformation.

a) b)

1.4

7
6
1.0

5
4
x

3
0.6

2
1
0.2

0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5

w w

c) d)

10 12
60

8
40
x

6
20

4
2
0

0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5

w w

Fig. 1.2 Plots to Illustrate the Ladder Rule

Example 1.2. Examine Figure 1.2. Since w is on the horizontal axis,


mentally add a narrow vertical slice to the plot. If a large amount of data falls
in the slice at the left of the plot, then small values need spreading. Similarly,
if a large amount of data falls in the slice at the right of the plot (compared
to the middle and left of the plot), then large values need spreading. For
the variable on the vertical axis, make a narrow horizontal slice. If the plot
looks roughly like the northwest corner of a square then small values of the
horizontal and large values of the vertical variable need spreading. Hence in
Figure 1.2a, small values of w need spreading. If the plot looks roughly like
the northeast corner of a square, then large values of both variables need
spreading. Hence in Figure 1.2b, large values of x need spreading. If the plot
looks roughly like the southwest corner of a square, as in Figure 1.2c, then
small values of both variables need spreading. If the plot looks roughly like
the southeast corner of a square, then large values of the horizontal and
small values of the vertical variable need spreading. Hence in Figure 1.2d,
small values of x need spreading.

Consider the additive error regression model Y = m(x) + e. Then the


response transformation model is Y = tλ (Z) = mλ (x) + e, and the graphical
10 1 Introduction

method for selecting the response transformation is to plot m̂λi (x) versus
tλi (Z) for several values of λi , choosing the value of λ = λ0 where the plotted
points follow the identity line with unit slope and zero intercept. For the
multiple linear regression model, m̂λi (x) = xT β̂ λi where β̂ λi can be found
using the desired fitting method, e.g. OLS or lasso.

Definition 1.4. Assume that all of the values of the “response” Zi are
positive. A power transformation has the form Y = tλ (Z) = Z λ for λ 6= 0
and Y = t0 (Z) = log(Z) for λ = 0 where

λ ∈ ΛL = {−1, −1/2, −1/3, 0, 1/3, 1/2, 1}.

Definition 1.5. Assume that all of the values of the “response” Zi are
positive. Then the modified power transformation family

(λ) Ziλ − 1
tλ (Zi ) ≡ Zi = (1.4)
λ
(0)
for λ 6= 0 and Zi = log(Zi ). Generally λ ∈ Λ where Λ is some interval such
as [−1, 1] or a coarse subset such as ΛL . This family is a special case of the
response transformations considered by Tukey (1957).

A graphical method for response transformations refits the model using


the same fitting method: changing only the “response” from Z to tλ (Z).
Compute the “fitted values” Ŵi using Wi = tλ (Zi ) as the “response.” Then
a transformation plot of Ŵi versus Wi is made for each of the seven values of
λ ∈ ΛL with the identity line added as a visual aid. Vertical deviations from
the identity line are the “residuals” ri = Wi − Ŵi . Then a candidate response
transformation Y = tλ∗ (Z) is reasonable if the plotted points follow the
identity line in a roughly evenly populated band if the MLR or additive error
regression model is reasonable for Y = W and x. Curvature from the identity
line suggests that the candidate response transformation is inappropriate.

Notice that the graphical method is equivalent to making “response plots”


for the seven values of W = tλ (Z), and choosing the “best response plot”
where the MLR model seems “most reasonable.” The seven “response plots”
are called transformation plots below. Our convention is that a plot of X
versus Y means that X is on the horizontal axis and Y is on the vertical
axis.

Definition 1.6. A transformation plot is a plot of Ŵ versus W with the


identity line added as a visual aid.

There are several reasons to use a coarse grid of powers. First, several of the
powers correspond to simple transformations such as the log, square root, and
cube root. These powers are easier to interpret than λ = 0.28, for example.
According to Mosteller and Tukey (1977, p. 91), the most commonly used
1.2 Response Plots and Response Transformations 11

a) lambda = 1 b) lambda = 0.5

sqrt(Z)
2000

40
Z

10
0

−500 1000 10 30 50

TZHAT TZHAT

c) lambda = 0 d) lambda = −1

0.008
log(Z)

1/Z

0.000
5

5 6 7 8 −0.002 0.002 0.006

TZHAT TZHAT

Fig. 1.3 Four Transformation Plots for the Textile Data

power transformations are the λ = 0 (log), λ = 1/2, λ = −1, and λ = 1/3


transformations in decreasing frequency of use. Secondly, if the estimator λ̂n
can only take values in ΛL , then sometimes λ̂n will converge (e.g. in prob-
ability) to λ∗ ∈ ΛL . Thirdly, Tukey (1957) showed that neighboring power
transformations are often very similar, so restricting the possible powers to
a coarse grid is reasonable. Note that powers can always be added to the
grid ΛL . Useful powers are ±1/4, ±2/3, ±2, and ±3. Powers from numerical
methods can also be added.

Application 1.1. This graphical method for selecting a response trans-


formation is very simple. Let Wi = tλ (Zi ). Then for each of the seven values
of λ ∈ ΛL , perform the regression fitting method, such as OLS or lasso, on
(Wi , xi ) and make the transformation plot of Ŵi versus Wi . If the plotted
12 1 Introduction

points follow the identity line for λ∗ , then take λ̂o = λ∗ , that is, Y = tλ∗ (Z)
is the response transformation.
If more than one value of λ ∈ ΛL gives a linear plot, take the simplest or
most reasonable transformation or the transformation that makes the most
sense to subject matter experts. Also check that the corresponding “residual
plots” of Ŵ versus W −Ŵ look reasonable. The values of λ in decreasing order
of importance are 1, 0, 1/2, −1, and 1/3. So the log transformation would be
chosen over the cube root transformation if both transformation plots look
equally good.
After selecting the transformation, the usual checks should be made. In
particular, the transformation plot for the selected transformation is the re-
sponse plot, and a residual plot should also be made. The following example
illustrates the procedure, and the plots show W = tλ (Z) on the vertical axis.
The label “TZHAT” of the horizontal axis are the “fitted values” Ŵ that
result from using W = tλ (Z) as the “response” in the OLS software.

Example 1.3: Textile Data. In their pioneering paper on response trans-


formations, Box and Cox (1964) analyze data from a 33 experiment on the
behavior of worsted yarn under cycles of repeated loadings. The “response”
Z is the number of cycles to failure and a constant is used along with the
three predictors length, amplitude, and load. Using the normal profile log
likelihood for λo , Box and Cox determine λ̂o = −0.06 with approximate 95
percent confidence interval −0.18 to 0.06. These results give a strong indi-
cation that the log transformation may result in a relatively simple model,
as argued by Box and Cox. Nevertheless, the numerical Box–Cox transfor-
mation method provides no direct way of judging the transformation against
the data.
Shown in Figure 1.3 are transformation plots of Ŵ versus W = Z λ for
four values of λ except log(Z) is used if λ = 0. The plots show how the trans-
formations bend the data to achieve a homoscedastic linear trend. Perhaps
more importantly, they indicate that the information on the transformation
is spread throughout the data in the plot since changing λ causes all points
along the curvilinear scatter in Figure 1.3a to form along a linear scatter in
Figure 1.3c. Dynamic plotting using λ as a control seems quite effective for
judging transformations against the data and the log response transformation
does indeed seem reasonable.
Note the simplicity of the method: Figure 1.3a shows that a response trans-
formation is needed since the plotted points follow a nonlinear curve while
Figure 1.3c suggests that Y = log(Z) is the appropriate response transforma-
tion since the plotted points follow the identity line. If all 7 plots were made
for λ ∈ ΛL , then λ = 0 would be selected since this plot is linear. Also, Figure
1.3a suggests that the log rule is reasonable since max(Z)/ min(Z) > 10.
1.3 A Review of Multiple Linear Regression 13

1.3 A Review of Multiple Linear Regression

The following review follows Olive (2017a: ch. 2) closely. Several of the results
in this section will be covered in more detail or proven in Chapter 2.

Definition 1.7. Regression is the study of the conditional distribution


Y |x of the response variable Y given the vector of predictors x = (x1 , ..., xp)T .

Definition 1.8. A quantitative variable takes on numerical values while


a qualitative variable takes on categorical values.
Definition 1.9. Suppose that the response variable Y and at least one
predictor variable xi are quantitative. Then the multiple linear regression
(MLR) model is

Yi = xi,1 β1 + xi,2 β2 + · · · + xi,p βp + ei = xTi β + ei (1.5)

for i = 1, . . . , n. Here n is the sample size and the random variable ei is the
ith error. Suppressing the subscript i, the model is Y = xT β + e.

In matrix notation, these n equations become

Y = Xβ + e, (1.6)

where Y is an n × 1 vector of dependent variables, X is an n × p matrix


of predictors, β is a p × 1 vector of unknown coefficients, and e is an n × 1
vector of unknown errors. Equivalently,
      
Y1 x1,1 x1,2 . . . x1,p β1 e1
 Y2   x2,1 x2,2 . . . x2,p   β2   e2 
      
 ..  =  .. .. . . ..   ..  +  ..  . (1.7)
 .   . . . .  .   . 
Yn xn,1 xn,2 . . . xn,p βp en

Often the first column of X is X1 = 1, the n × 1 vector of ones. The ith


case (xTi , Yi ) = (xi1 , xi2, ..., xip, Yi ) corresponds to the ith row xTi of X and
the ith element of Y (if xi1 ≡ 1, then xi1 could be omitted). In the MLR
model Y = xT β + e, the Y and e are random variables, but we only have
observed values Yi and xi . If the ei are iid (independent and identically
distributed) with zero mean E(ei ) = 0 and variance VAR(ei ) = V (ei ) = σ 2 ,
then regression is used to estimate the unknown parameters β and σ 2 .

Definition 1.10. The constant variance MLR model uses the as-
sumption that the errors e1 , ..., en are iid with mean E(ei ) = 0 and variance
VAR(ei ) = σ 2 < ∞. Also assume that the errors are independent of the pre-
dictor variables xi . The predictor variables xi are assumed to be fixed and
measured without error. The cases (xTi , Yi ) are independent for i = 1, ..., n.
14 1 Introduction

If the predictor variables are random variables, then the above MLR model
is conditional on the observed values of the xi . That is, observe the xi and
then act as if the observed xi are fixed.

Definition 1.11. The unimodal MLR model has the same assumptions
as the constant variance MLR model, as well as the assumption that the zero
mean constant variance errors e1 , ..., en are iid from a unimodal distribution
that is not highly skewed. Note that E(ei ) = 0 and V (ei ) = σ 2 < ∞.

Definition 1.12. The normal MLR model or Gaussian MLR model has
the same assumptions as the unimodal MLR model but adds the assumption
that the errors e1 , ..., en are iid N (0, σ 2 ) random variables. That is, the ei are
iid normal random variables with zero mean and variance σ 2 .

The unknown coefficients for the above 3 models are usually estimated
using (ordinary) least squares (OLS).

Notation. The symbol A ≡ B = f(c) means that A and B are equivalent


and equal, and that f(c) is the formula used to compute A and B.

Definition 1.13. Given an estimate b of β, the corresponding vector of


predicted values or fitted values is Yb ≡ Yb (b) = Xb. Thus the ith fitted value

Ŷi ≡ Ŷi (b) = xTi b = xi,1 b1 + · · · + xi,pbp .

The vector of residuals is r ≡ r(b) = Y − Yb (b). Thus ith residual ri ≡


ri (b) = Yi − Ŷi (b) = Yi − xi,1 b1 − · · · − xi,p bp .

Most regression methods attempt to find an estimate β̂ of β which mini-


mizes some criterion function Q(b) of the residuals.

Definition 1.14. The ordinary least squares (OLS) estimator β̂ OLS min-
imizes
Xn
QOLS (b) = ri2 (b), (1.8)
i=1

and β̂ OLS = (X T X)−1 X T Y .


The vector of predicted or fitted values Yb OLS = X β̂OLS = HY where the
hat matrix H = X(X T X)−1 X T provided the inverse exists. Typically the
subscript OLS is omitted, and the least squares regression equation is
Ŷ = β̂1 x1 + β̂2 x2 + · · · + β̂p xp where x1 ≡ 1 if the model contains a constant.

Definition 1.15. For MLR, the response plot is a plot of the ESP = fitted
values = Ŷi versus the response Yi , while the residual plot is a plot of the
ESP = Ŷi versus the residuals ri .
1.3 A Review of Multiple Linear Regression 15

Theorem 1.1. Suppose that the regression estimator b of β is used to


find the residuals ri ≡ ri (b) and the fitted values Ybi ≡ Ybi (b) = xTi b. Then
in the response plot of Ybi versus Yi , the vertical deviations from the identity
line (that has unit slope and zero intercept) are the residuals ri (b).

Proof. The identity line in the response plot is Y = xT b. Hence the


vertical deviation is Yi − xTi b = ri (b). 

The results in the following theorem are properties of least squares (OLS),
not of the underlying MLR model. Chapter 2 gives linear model theory for
the full rank model. Definitions 1.13 and 1.14 define the hat matrix H, vector
of fitted values Ŷ , and vector of residuals r. Parts f) and g) make residual
plots useful. If the plotted points are linear with roughly constant variance
and the correlation is zero, then the plotted points scatter about the r = 0
line with no other pattern. If the plotted points in a residual plot of w versus
r do show a pattern such as a curve or a right opening megaphone, zero
correlation will usually force symmetry about either the r = 0 line or the
w = median(w) line. Hence departures from the ideal plot of random scatter
about the r = 0 line are often easy to detect.

Let the n × p design matrix of predictor variables be


 
x1,1 x1,2 . . . x1,p  T
 x2,1 x2,2 . . . x2,p   x
    .1 
X= . .. . . ..  = v 1 v 2 . . . v p =  .. 
 .. . . . 
xTn
xn,1 xn,2 . . . xn,p

where v 1 = 1.
Warning: If n > p, as is usually the case for the full rank linear model,
X is not square, so (X T X)−1 6= X −1 (X T )−1 since X −1 does not exist.

Theorem 1.2. Suppose that X is an n × p matrix of full rank p. Then


a) H is symmetric: H = H T .
b) H is idempotent: HH = H .
c) X T r = 0 so that v Tj r = 0.
d) If there
Pn is a constant v 1 = 1 in the model, then the sum of the residuals
is zero: i=1 ri = 0.
e) rT Ŷ = 0.
f) If there is a constant in the model, then the sample correlation of the
fitted values and the residuals is 0: corr(r, Ŷ ) = 0.
g) If there is a constant in the model, then the sample correlation of the
jth predictor with the residuals is 0: corr(r, v j ) = 0 for j = 1, ..., p.
Proof. a) X T X is symmetric since (X T X)T = X T (X T )T = X T X.
Hence (X T X)−1 is symmetric since the inverse of a symmetric matrix is
symmetric. (Recall that if A has an inverse then (AT )−1 = (A−1 )T .) Thus
16 1 Introduction

using (AT )T = A and (ABC)T = C T B T AT shows that

H T = X T [(X T X)−1 ]T (X T )T = H.

b) HH = X(X T X)−1 X T X(X T X)−1 X T = H since (X T X)−1 X T X =


I p , the p × p identity matrix.
c) X T r = X T (I p − H)Y = [X T − X T X(X T X)−1 X T ]Y =
[X T − X T ]Y = 0. Since v j is the jth column of X, v Tj is the jth row of X T
and v Tj r = 0 for j = 1, ..., p.
P
d) Since v 1 = 1, v T1 r = ni=1 ri = 0 by c).
e) rT Ŷ = [(I n − H)Y ]T H Y = Y T (I n − H )HY = Y T (H − H)Y = 0.
f) The sample correlation between W and Z is corr(W, Z) =
Pn Pn
i=1 (wi − w)(zi − z) (wi − w)(zi − z)
= pPn i=1 Pn
(n − 1)sw sz i=1 (w i − w)
2
i=1 (zi − z)
2

where sm is the sample standard deviation of m for m = w, z. So the result


Pn
follows if A = i=1 (Ŷi − Ŷ )(ri − r) = 0. Now r = 0 by d), and thus
n
X n
X n
X
A= Ŷi ri − Ŷ ri = Ŷi ri
i=1 i=1 i=1
Pn
by d) again. But i=1 Ŷi ri = r T Ŷ = 0 by e).
Png) Following the argument in f), the resultPn follows if A =
i=1 (x i,j − x j )(r i − r) = 0Pwhere xj = i=1 xi,j /n is the sample mean of
the jth predictor. Now r = ni=1 ri /n = 0 by d), and thus
n
X n
X n
X
A= xi,j ri − xj ri = xi,j ri
i=1 i=1 i=1
Pn
by d) again. But i=1 xi,j ri = v Tj r = 0 by c). 

1.3.1 The ANOVA F Test

After fitting least squares and checking the response and residual plots to see
that an MLR model is reasonable, the next step is to check whether there is
an MLR relationship between Y and the nontrivial predictors x2 , ..., xp . If at
least one of these predictors is useful, then the OLS fitted values Ŷi should be
used. If none of the nontrivial predictors is useful, then Y will give as good
predictions as Ŷi . Here the sample mean
1.3 A Review of Multiple Linear Regression 17
n
1X
Y = Yi . (1.9)
n i=1

In the definition below, SSE is the sum of squared residuals and a residual
ri = êi = “errorhat.” In the literature “errorhat” is often rather misleadingly
abbreviated as “error.”

Definition 1.16. Assume that a constant is in the MLR model.


a) The total sum of squares
n
X
SST O = (Yi − Y )2 . (1.10)
i=1

b) The regression sum of squares


n
X
SSR = (Ŷi − Y )2 . (1.11)
i=1

c) The residual sum of squares or error sum of squares is


n
X n
X
SSE = (Yi − Ŷi )2 = ri2 . (1.12)
i=1 i=1

The result in the following theorem is a property of least squares (OLS),


not of the underlying MLR model. An obvious application is that given any
two of SSTO, SSE, and SSR, the 3rd sum of squares can be found using the
formula SST O = SSE + SSR.

Theorem 1.3. Assume that a constant is in the MLR model. Then


SST O = SSE + SSR.
Proof.
n
X n
X
SST O = (Yi − Ŷi + Ŷi − Y )2 = SSE + SSR + 2 (Yi − Ŷi )(Ŷi − Y ).
i=1 i=1

Hence the result follows if


n
X
A≡ ri (Ŷi − Y ) = 0.
i=1

But
n
X n
X
A= ri Ŷi − Y ri = 0
i=1 i=1

by Theorem 1.2 d) and e). 


18 1 Introduction

Definition 1.17. Assume that a constant is in the MLR model and that
SSTO 6= 0. The coefficient of multiple determination
SSR SSE
R2 = [corr(Yi , Ŷi )]2 = =1−
SSTO SSTO

where corr(Yi , Ŷi ) is the sample correlation of Yi and Ŷi .

Warnings: i) 0 ≤ R2 ≤ 1, but small R2 does not imply that the MLR


model is bad.
ii) If the MLR model contains a constant, then there are several equivalent
formulas for R2 . If the model does not contain a constant, then R2 depends
on the software package.
iii) R2 does not have much meaning unless the response plot and residual
plot both look good.
iv) R2 tends to be too high if n is small.
v) R2 tends to be too high if there are two or more separated clusters of
data in the response plot.
vi) R2 is too high if the number of predictors p is close to n.
vii) In large samples R2 will be large (close to one) if σ 2 is small compared
to the sample variance SY2 of the response variable Y . R2 is also large if the
sample variance of Ŷ is close to SY2 . Thus R2 is sometimes interpreted as
the proportion of the variability of Y explained by conditioning on x, but
warnings i) - v) suggest that R2 may not have much meaning.

The following 2 theorems suggest that R2 does not behave well when many
predictors that are not needed in the model are included in the model. Such
a variable is sometimes called a noise variable and the MLR model is “fitting
noise.” Theorem 1.5 appears, for example, in Cramér (1946, pp. 414-415),
and suggests that R2 should be considerably larger than p/n if the predictors
are useful. Note that if n = 10p and p ≥ 2, then under the conditions of
Theorem 1.5, E(R2 ) ≤ 0.1.

Theorem 1.4. Assume that a constant is in the MLR model. Adding a


variable to the MLR model does not decrease (and usually increases) R2 .

Theorem 1.5. Assume that a constant β1 is in the MLR model, that


β2 = · · · = βp = 0 and that the ei are iid N (0, σ 2 ). Hence the Yi are iid
N (β1 , σ 2 ). Then
a) R2 follows a beta distribution: R2 ∼ beta( p−1 n−p
2 , 2 ).
b)
p−1
E(R2 ) = .
n−1
c)
2(p − 1)(n − p)
VAR(R2 ) = .
(n − 1)2 (n + 1)
1.3 A Review of Multiple Linear Regression 19

Notice that each SS/n estimates the variability of some quantity. SST O/n
≈ SY2 , SSE/n ≈ Se2 = σ 2 , and SSR/n ≈ SŶ2 .

Definition 1.18. Assume that a constant is in the MLR model. Associated


with each SS in Definition 1.16 is a degrees of freedom (df) and a mean
square = SS/df. For SSTO, df = n − 1 and M ST O = SST O/(n − 1).
For SSR, df = p − 1 and M SR = SSR/(p − 1). For SSE, df = n − p and
M SE = SSE/(n − p).

√ Under mild conditions, if 2the MLR model is appropriate, then MSE is a


n consistent estimator of σ by Su and Cook (2012).

The ANOVA F test tests whether any of the nontrivial predictors x2 , ..., xp
are needed in the OLS MLR model, that is, whether Yi should be predicted
by the OLS fit Ŷi = β̂1 + xi,2 β̂2 + · · · + xi,pβ̂p or with the sample mean Y .
ANOVA stands for analysis of variance, and the computer output needed
to perform the test is contained in the ANOVA table. Below is an ANOVA
table given in symbols. Sometimes “Regression” is replaced by “Model” and
“Residual” by “Error.”

Summary Analysis of Variance Table

Source df SS MS F p-value
Regression p − 1 SSR MSR F0 =MSR/MSE for H0 :
Residual n − p SSE MSE β2 = · · · = βp = 0

Remark 1.2. Recall that for a 4 step test of hypotheses, the p–value is the
probability of getting a test statistic as extreme as the test statistic actually
observed and that H0 is rejected if the p–value < δ. As a benchmark for this
textbook, use δ = 0.05 if δ is not given. The 4th step is the nontechnical
conclusion which is crucial for presenting your results to people who are not
familiar with MLR. Replace Y and x2 , ..., xp by the actual variables used in
the MLR model.

Notation. The p–value ≡ pvalue given by output tends to only be cor-


rect for the normal MLR model. Hence the output is usually only giving an
estimate of the pvalue, which will often be denoted by pval. So reject H0 if
pval ≤ δ. Often
P
pval − pvalue → 0
(converges to 0 in probability, so pval is a consistent estimator of pvalue) as
the sample size n → ∞. See Section 1.5 and Chapter 2. Then the computer
output pval is a good estimator of the unknown pvalue. We will use F o ≡ F0 ,
Ho ≡ H0 , and Ha ≡ HA ≡ H1 .

Be able to perform the 4 step ANOVA F test of hypotheses.


i) State the hypotheses H0 : β2 = · · · = βp = 0 HA: not H0 .
20 1 Introduction

ii) Find the test statistic F0 = M SR/M SE or obtain it from output.


iii) Find the pval from output or use the F –table: pval =

P (Fp−1,n−p > F0 ).

iv) State whether you reject H0 or fail to reject H0 . If H0 is rejected, conclude


that there is an MLR relationship between Y and the predictors x2 , ..., xp . If
you fail to reject H0 , conclude that there is not an MLR relationship between
Y and the predictors x2 , ..., xp . (Or there is not enough evidence to conclude
that there is an MLR relationship between Y and the predictors.)

Some assumptions are needed on the ANOVA F test. Assume that both
the response and residual plots look good. It is crucial that there are no
outliers. Then a rule of thumb is that if n − p is large, then the ANOVA
F test p–value is approximately correct. An analogy can be made with the
central limit theorem, Y is a good estimator for µ if the Yi are iid N (µ, σ 2 )
and also a good estimator for µ if the data are iid with mean µ and variance
σ 2 if n is large enough.
If all of the xi are different (no replication) and if the number of predictors
p = n, then the OLS fit Ŷi = Yi and R2 = 1. Notice that H0 is rejected if the
statistic F0 is large. More precisely, reject H0 if

F0 > Fp−1,n−p,1−δ

where
P (F ≤ Fp−1,n−p,1−δ) = 1 − δ
when F ∼ Fp−1,n−p. Since R2 increases to 1 while (n − p)/(p − 1) decreases
to 0 as p increases to n, Theorem 1.6a below implies that if p is large then
the F0 statistic may be small even if some of the predictors are very good. It
is a good idea to use n ≥ 10p or at least n ≥ 5p if possible.

Theorem 1.6. Assume that the MLR model has a constant β1 .


a)
M SR R2 n−p
F0 = = 2
.
M SE 1−R p−1
b) If the errors ei are iid N (0, σ 2 ), and if H0 : β2 = · · · = βp = 0 is true,
then F0 has an F distribution with p − 1 numerator and n − p denominator
degrees of freedom: F0 ∼ Fp−1,n−p.
c) If the errors are iid with mean 0 and variance σ 2 , if the error distribution
is close to normal, and if n − p is large enough, and if H0 is true, then
F0 ≈ Fp−1,n−p in that the p-value from the software (pval) is approximately
correct.

Remark 1.3. When a constant is not contained in the model (i.e. xi,1 is
not equal to 1 for all i), then the computer output still produces an ANOVA
1.3 A Review of Multiple Linear Regression 21

table with the test statistic and p–value, and nearly the same 4 step test of
hypotheses can be used. The hypotheses are now H0 : β1 = · · · = βp = 0
HA: not H0 , and you are testing whether or not there is an MLR relationship
between Y and x1 , ..., xp. An MLR model without a constant (no intercept)
is sometimes called a “regression through the origin.” See Section 1.3.7.

1.3.2 The Partial F Test

Suppose that there is data on variables Z, w1 , ..., wr and that a useful MLR
model has been made using Y = t(Z), x1 ≡ 1, x2 , ..., xp where each xi is
some function of w1 , ..., wr. This useful model will be called the full model. It
is important to realize that the full model does not need to use every variable
wj that was collected. For example, variables with outliers or missing values
may not be used. Forming a useful full model is often very difficult, and it is
often not reasonable to assume that the candidate full model is good based
on a single data set, especially if the model is to be used for prediction.
Even if the full model is useful, the investigator will often be interested in
checking whether a model that uses fewer predictors will work just as well.
For example, perhaps xp is a very expensive predictor but is not needed given
that x1 , ..., xp−1 are in the model. Also a model with fewer predictors tends
to be easier to understand.

Definition 1.19. Let the full model use Y , x1 ≡ 1, x2 , ..., xp and let the
reduced model use Y , x1 , xi2 , ..., xiq where {i2 , ..., iq} ⊂ {2, ..., p}.

The partial F test is used to test whether the reduced model is good in
that it can be used instead of the full model. It is crucial that the reduced
and full models be selected before looking at the data. If the reduced model
is selected after looking at the full model output and discarding the worst
variables, then the p–value for the partial F test will be too high. If the
data needs to be looked at to build the full model, as is often the case, data
splitting is useful. See Section 6.2.
For (ordinary) least squares, usually a constant is used, and we are assum-
ing that both the full model and the reduced model contain a constant. The
partial F test has null hypothesis H0 : βiq+1 = · · · = βip = 0, and alternative
hypothesis HA : at least one of the βij 6= 0 for j > q. The null hypothesis is
equivalent to H0 : “the reduced model is good.” Since only the full model and
reduced model are being compared, the alternative hypothesis is equivalent
to HA: “the reduced model is not as good as the full model, so use the full
model,” or more simply, HA : “use the full model.”

To perform the partial F test, fit the full model and the reduced model
and obtain the ANOVA table for each model. The quantities dfF , SSE(F)
and MSE(F) are for the full model and the corresponding quantities from
22 1 Introduction

the reduced model use an R instead of an F . Hence SSE(F) and SSE(R) are
the residual sums of squares for the full and reduced models, respectively.
Shown below is output only using symbols.

Full model

Source df SS MS F0 and p-value


Regression p − 1 SSR MSR F0 =MSR/MSE
Residual dfF = n − p SSE(F) MSE(F) for H0 : β2 = · · · = βp = 0

Reduced model

Source df SS MS F0 and p-value


Regression q − 1 SSR MSR F0 =MSR/MSE
Residual dfR = n − q SSE(R) MSE(R) for H0 : β2 = · · · = βq = 0

Be able to perform the 4 step partial F test of hypotheses. i) State


the hypotheses. H0 : the reduced model is good HA : use the full model
ii) Find the test statistic. FR =
 
SSE(R) − SSE(F )
/M SE(F )
dfR − dfF

iii) Find the pval = P(FdfR −dfF ,dfF > FR ). ( Here dfR −dfF = p−q = number
of parameters set to 0, and dfF = n − p, while pval is the estimated p–value.)
iv) State whether you reject H0 or fail to reject H0 . Reject H0 if the pval ≤ δ
and conclude that the full model should be used. Otherwise, fail to reject H0
and conclude that the reduced model is good.

Sometimes software has a shortcut. In particular, the R software uses the


anova command. As an example, assume that the full model uses x2 and
x3 while the reduced model uses x2 . Both models contain a constant. Then
the following commands will perform the partial F test. (On the computer
screen the second command looks more like
red < − lm(y∼x2).)
full <- lm(y˜x2+x3)
red <- lm(y˜x2)
anova(red,full)
For an n × 1 vector a, let
q √
kak = a21 + · · · + a2n = aT a

be the Euclidean norm of a. If r and rR are the vector of residuals from


the full and reduced models, respectively, notice that SSE(F ) = krk2 and
SSE(R) = krR k2 .
1.3 A Review of Multiple Linear Regression 23

The following theorem suggests that H0 is rejected in the partial F test if


the change in residual sum of squares SSE(R) − SSE(F ) is large compared
to SSE(F ). If the change is small, then FR is small and the test suggests
that the reduced model can be used.

Theorem 1.7. Let R2 and R2R be the multiple coefficients of determi-


nation for the full and reduced models, respectively. Let Ŷ and Ŷ R be the
vectors of fitted values for the full and reduced models, respectively. Then
the test statistic in the partial F test is
 
SSE(R) − SSE(F )
FR = /M SE(F ) =
dfR − dfF
" #
kŶ k2 − kŶ R k2
/M SE(F ) =
dfR − dfF

SSE(R) − SSE(F ) n − p R2 − R2R n − p


= .
SSE(F ) p−q 1 − R2 p − q

Definition 1.20. An FF plot is a plot of fitted values from 2 different


models or fitting methods. An RR plot is a plot of residuals from 2 different
models or fitting methods.

Six plots are useful diagnostics for the partial F test: the RR plot with
the full model residuals on the vertical axis and the reduced model residuals
on the horizontal axis, the FF plot with the full model fitted values on the
vertical axis, and always make the response and residual plots for the full
and reduced models. Suppose that the full model is a useful MLR model. If
the reduced model is good, then the response plots from the full and reduced
models should be very similar, visually. Similarly, the residual plots from
the full and reduced models should be very similar, visually. Finally, the
correlation of the plotted points in the RR and FF plots should be high,
≥ 0.95, say, and the plotted points in the RR and FF plots should cluster
tightly about the identity line. Add the identity line to both the RR and
FF plots as a visual aid. Also add the OLS line from regressing r on rR to
the RR plot (the OLS line is the identity line in the FF plot). If the reduced
model is good, then the OLS line should nearly coincide with the identity line
in that it should be difficult to see that the two lines intersect at the origin.
If the FF plot looks good but the RR plot does not, the reduced model may
be good if the main goal of the analysis is to predict Y.
24 1 Introduction

1.3.3 The Wald t Test

Often investigators hope to examine βk in order to determine the importance


of the predictor xk in the model; however, βk is the coefficient for xk given
that the other predictors are in the model. Hence βk depends strongly on
the other predictors in the model. Suppose that the model has an intercept:
x1 ≡ 1. The predictor xk is highly correlated with the other predictors if
the OLS regression of xk on x1 , ..., xk−1, xk+1 , ..., xp has a high coefficient of
determination R2k . If this is the case, then often xk is not needed in the model
given that the other predictors are in the model. If at least one R2k is high
for k ≥ 2, then there is multicollinearity among the predictors.
As an example, suppose that Y = height, x1 ≡ 1, x2 = left leg length, and
x3 = right leg length. Then x2 should not be needed given x3 is in the model
and β2 = 0 is reasonable. Similarly β3 = 0 is reasonable. On the other hand,
if the model only contains x1 and x2 , then x2 is extremely important with β2
near 2. If the model contains x1 , x2 , x3, x4 = height at shoulder, x5 = right
arm length, x6 = head length, and x7 = length of back, then R2i may be high
for each i ≥ 2. Hence xi is not needed in the MLR model for Y given that
the other predictors are in the model.

Definition 1.21. The 100 (1 − δ) % CI for βk is β̂k ± tn−p,1−δ/2 se(β̂k ).


If the degrees of freedom d = n − p ≥ 30, the N(0,1) cutoff z1−δ/2 may be
used.

Know how to do the 4 step Wald t–test of hypotheses.


i) State the hypotheses H0 : βk = 0 HA : βk 6= 0.
ii) Find the test statistic to,k = β̂k /se(β̂k ) or obtain it from output.
iii) Find pval from output or use the t–table: pval =

2P (tn−p < −|to,k |) = 2P (tn−p > |to,k |).

Use the normal table or the d = Z line in the t–table if the degrees of freedom
d = n − p ≥ 30. Again pval is the estimated p–value.
iv) State whether you reject H0 or fail to reject H0 and give a nontechnical
sentence restating your conclusion in terms of the story problem.

Recall that H0 is rejected if the pval ≤ δ. As a benchmark for this textbook,


use δ = 0.05 if δ is not given. If H0 is rejected, then conclude that xk is needed
in the MLR model for Y given that the other predictors are in the model.
If you fail to reject H0 , then conclude that xk is not needed in the MLR
model for Y given that the other predictors are in the model. (Or there is
not enough evidence to conclude that xk is needed in the MLR model given
that the other predictors are in the model.) Note that xk could be a very
useful individual predictor, but may not be needed if other predictors are
added to the model.
1.3 A Review of Multiple Linear Regression 25

1.3.4 The OLS Criterion

a) OLS Minimizes Sum of Squared Vertical Deviations


1600
1300
Y

1000

1000 1100 1200 1300 1400 1500

OLSESP

b) This ESP Has a Much Larger Sum


1600
1300
Y

1000

1100 1200 1300 1400 1500 1600

BADESP

Fig. 1.4 The OLS Fit Minimizes the Sum of Squared Residuals

The OLS estimator β̂ minimizes the OLS criterion


n
X
QOLS (η) = ri2 (η)
i=1

T
where the residual
Pnri (η) = YP i −xi η. In other words, let ri = ri (β̂) be the OLS
n
residuals. Then i=1 ri2 ≤ i=1 ri2 (η) for any p×1 vector η, and the equality
holds (if and only if) iff η = β̂ if the n×p design matrix X is of full rank p ≤ n.
26 1 Introduction
Pn Pn Pn
In particular, if X has full rank p, then i=1 ri2 < i=1 ri2 (β) = i=1 e2i
even if the MLR model Y = Xβ + e is a good approximation
Pn to the data.
Warning: Often η is replaced by β: QOLS (β) = i=1 ri2 (β). This no-
tation is often used in Statistics when there are estimating equations. For
example, maximum likelihood estimation uses the log likelihood log(L(θ))
where θ is the vector of unknown parameters and the dummy variable in the
log likelihood.

Example 1.4. When a model depends on the predictors x only through


the linear combination xT β, then xT β is called a sufficient predictor and
xT β̂ is called an estimated sufficient predictor (ESP). For OLS the model is
Y = xT β + e, and the fitted value Ŷ = ESP . To illustrate the OLS criterion
graphically, consider the Gladstone (1905) data where we used brain weight as
the response. A constant, x2 = age, x3 = sex, and x4 = (size)1/3 were used
as predictors after deleting five “infants” from the data set. In Figure 1.4a, the
OLS response plot of the OLS ESP = Ŷ versus Y is shown. The vertical devi-
ations from the identity line are the residuals, and OLS minimizes the sum of
squared residuals. If any other ESP xT η is plotted versus Y , then the vertical
deviations from the identity line are the residuals ri (η). For this data, the OLS
estimator β̂ = (498.726, −1.597, 30.462, 0.696)T . Figure 1.4b shows the re-
sponse plot using the ESP xT η where η = (498.726, −1.597, 30.462, 0.796)T .
Hence only the coefficient for x4 was changed; however, the residuals ri (η) in
the resulting plot are much larger in magnitude on average than the residuals
in the OLS response plot. With slightly larger changes in the OLS ESP, the
resulting η will be such that the squared residuals are massive.

Theorem 1.8. The OLS estimator β̂ is the unique minimizer of the OLS
criterion if X has full rank p ≤ n.
Proof: Seber and Lee (2003, pp. 36-37). Recall that the hat matrix
H = X(X T X)−1 X T and notice that (I −H)T = I −H, that (I −H)H = 0
and that HX = X. Let η be any p × 1 vector. Then

(Y − X β̂)T (X β̂ − Xη) = (Y − HY )T (H Y − HXη) =

Y T (I − H)H(Y − Xη) = 0.
Thus QOLS (η) = kY − Xηk2 = kY − X β̂ + X β̂ − Xηk2 =
kY − X β̂k2 + kX β̂ − Xηk2 + 2(Y − X β̂)T (X β̂ − Xη).
Hence
kY − Xηk2 = kY − X β̂k2 + kX β̂ − Xηk2 . (1.13)
So
kY − Xηk2 ≥ kY − X β̂k2
with equality iff
X(β̂ − η) = 0
1.3 A Review of Multiple Linear Regression 27

iff β̂ = η since X is full rank. 

Alternatively calculus can be used. Notice that ri (η) = Yi −xi,1 η1 −xi,2 η2 −


· · · − xi,pηp . Recall that xTi is the ith row of X while v j is the jth column.
Since QOLS (η) =
n
X
(Yi − xi,1 η1 − xi,2 η2 − · · · − xi,p ηp)2 ,
i=1

the jth partial derivative


Xn
∂QOLS (η)
= −2 xi,j (Yi −xi,1 η1 −xi,2 η2 −· · ·−xi,p ηp ) = −2(v j )T (Y −Xη)
∂ηj i=1

for j = 1, ..., p. Combining these equations into matrix form, setting the
derivative to zero and calling the solution β̂ gives

X T Y − X T X β̂ = 0,

or
X T X β̂ = X T Y . (1.14)
Equation (1.14) is known as the normal equations. If X has full rank then
β̂ = (X T X)−1 X T Y . To show that β̂ is the global minimizer of the OLS
criterion, use the argument following Equation (1.13).

1.3.5 The Location Model

The location model


Yi = µ + ei , i = 1, . . . , n (1.15)
is a special case of the multiple linear regression model where p = 1, X = 1,
and β = β1 = µ. This model contains a constant but no nontrivial predictors.
In the location model, β̂OLS = β̂1 = µ̂ = Y . To see this, notice that
n
X n
X
dQOLS (η)
QOLS (η) = (Yi − η)2 and = −2 (Yi − η).

i=1 i=1

Setting the derivative equal to 0 and calling the unique solution µ̂ gives
P n
i=1 Yi = nµ̂ or µ̂ = Y . The second derivative

d2 QOLS (η)
= 2n > 0,
dη 2
hence µ̂ is the global minimizer.
28 1 Introduction

1.3.6 Simple Linear Regression

The simple linear regression (SLR) model is

Yi = β1 + β2 Xi + ei = α + βXi + ei

where the ei are iid with E(ei ) = 0 and VAR(ei ) = σ 2 for i = 1, ..., n. The Yi
and ei are random variables while the Xi are treated as known constants.
The SLR model is a special case of the MLR model with p = 2, xi,1 ≡ 1, and
xi,2 = Xi . For SLR, E(Yi ) = β1 + β2 Xi and the line E(Y ) = β1 + β2 X is the
regression function. VAR(Yi ) = σ 2 .
For SLR, the least squares P estimators β̂1 and β̂2 minimize the least
squares criterion Q(η1 , η2 ) = ni=1 (Yi − η1 − η2 Xi )2 . For a fixed η1 and η2 ,
Q is the sum of the squared vertical deviations from the line Y = η1 + η2 X.
The least squares (OLS) line is Ŷ = β̂1 + β̂2 X where the slope
Pn
(Xi − X)(Yi − Y )
β̂2 ≡ β̂ = i=1 Pn 2
i=1 (Xi − X)

and the intercept β̂1 ≡ α̂ = Y − β̂2 X.


By the chain rule,
Xn
∂Q
= −2 (Yi − η1 − η2 Xi )
∂η1 i=1

and
∂2Q
= 2n.
∂η12
Similarly,
Xn
∂Q
= −2 Xi (Yi − η1 − η2 Xi )
∂η2 i=1

and
X n
∂2Q
= 2 Xi2 .
∂η22
i=1

Setting the first partial derivatives to zero and calling the solutions β̂1 and
β̂2 shows that the OLS estimators β̂1 and β̂2 satisfy the normal equations:
n
X n
X
Yi = nβ̂1 + β̂2 Xi and
i=1 i=1

n
X n
X n
X
Xi Yi = β̂1 Xi + β̂2 Xi2 .
i=1 i=1 i=1
1.3 A Review of Multiple Linear Regression 29

The first equation gives β̂1 = Y − β̂2 X.


There are several equivalent formulas for the slope β̂2 .
Pn Pn 1 Pn Pn
i=1 (Xi − X)(Yi − Y ) i=1 Xi Yi − n ( i=1 Xi )( i=1 Yi )
β̂2 ≡ β̂ = Pn 2
= Pn 2 1 Pn 2
i=1 (Xi − X) i=1 Xi − n ( i=1 Xi )
Pn Pn
(Xi − X)Yi i=1 Xi Yi − nX Y
= Pi=1
n 2
= P n 2 2
= ρ̂sY /sX .
i=1 (Xi − X) i=1 Xi − n(X)

Here the sample correlation ρ̂ ≡ ρ̂(X, Y ) = corr(X, Y ) =


Pn Pn
i=1 (Xi − X)(Yi − Y ) (Xi − X)(Yi − Y )
= qP i=1 Pn
(n − 1)sX sY n
(X − X)2 (Y − Y )2
i=1 i i=1 i

where the sample standard deviation


v
u n
u 1 X
sW = t (Wi − W )2
n − 1 i=1

for W = X, Y. Notice that the term n − 1 that occurs in the denominator of


ρ̂, s2Y , and s2X can be replaced by n as long as n is used in all 3 quantities.
Pn
Also notice that the slope β̂2 = i=1 ki Yi where the constants

Xi − X
k i = Pn 2
. (1.16)
j=1 (Xj − X)

1.3.7 The No Intercept MLR Model

The no intercept MLR model, also known as regression through the origin, is
still Y = Xβ+e, but there is no intercept in the model, so X does not contain
a column of ones 1. Hence the intercept term β1 = β1 (1) is replaced by β1 xi1 .
Software gives output for this model if the “no intercept” or “intercept = F”
option is selected. For the no intercept model, the assumption E(e) = 0 is
important, and this assumption is rather strong.
Many of the usual MLR results still hold: β̂ OLS = (X T X)−1 X T Y , the
vector of predicted fitted values Yb = X β̂OLS = HY where the hat matrix
H = X(X T X)−1 X T provided the inverse exists, and the vector of residuals
is r = Y − Yb . The response plot and residual plot are made in the same way
and should be made before performing inference.
The main difference in the output is the ANOVA table. The ANOVA F
test in Section 1.3.1 tests H0 : β2 = · · · = βp = 0. The test in this subsection
30 1 Introduction

tests H0 : β1 = · · · = βp = 0 ≡ H0 : β = 0. The following definition and test


follows Guttman (1982, p. 147) closely.

Definition 1.22. Assume that Y = Xβ + e where the ei are iid. Assume


that it is desired to test H0 : β = 0 versus HA : β 6= 0.
a) The uncorrected total sum of squares
n
X
SST = Yi2 . (1.17)
i=1

b) The model sum of squares


n
X
SSM = Ŷi2 . (1.18)
i=1

c) The residual sum of squares or error sum of squares is


n
X n
X
SSE = (Yi − Ŷi )2 = ri2 . (1.19)
i=1 i=1

d) The degrees of freedom (df) for SSM is p, the df for SSE is n − p and
the df for SST is n. The mean squares are MSE = SSE/(n − p) and MSM =
SSM/p.

The ANOVA table given for the “no intercept” or “intercept = F” option
is below.

Summary Analysis of Variance Table

Source df SS MS F p-value
Model p SSM MSM F0 =MSM/MSE for H0 :
Residual n − p SSE MSE β=0

The 4 step no intercept ANOVA F test for β = 0 is below.


i) State the hypotheses H0 : β = 0, HA : β 6= 0.
ii) Find the test statistic F0 = M SM/M SE or obtain it from output.
iii) Find the pval from output or use the F –table: pval = P (Fp,n−p > F0 ).
iv) State whether you reject H0 or fail to reject H0 . If H0 is rejected, conclude
that there is an MLR relationship between Y and the predictors x1 , ..., xp . If
you fail to reject H0 , conclude that there is not an MLR relationship between
Y and the predictors x1 , ..., xp . (Or there is not enough evidence to conclude
that there is an MLR relationship between Y and the predictors.)
1.4 The Multivariate Normal Distribution 31

1.4 The Multivariate Normal Distribution

For much of this book, X is an n × p design matrix, but this section will usu-
ally use the notation X = (X1 , ..., Xp)T and Y for the random vectors, and
x = (x1 , ..., xp)T for the observed value of the random vector. This notation
will be useful to avoid confusion when studying conditional distributions such
as Y |X = x. It can be shown that Σ is positive semidefinite and symmetric.

Definition 1.23: Rao (1965, p. 437). A p × 1 random vector X has


a p−dimensional multivariate normal distribution Np (µ, Σ) iff tT X has a
univariate normal distribution for any p × 1 vector t.

If Σ is positive definite, then X has a pdf


1 −1
−(1/2)(z −µ)T Σ (z −µ)
f(z) = e (1.20)
(2π)p/2 |Σ|1/2

where |Σ|1/2 is the square root of the determinant of Σ. Note that if p = 1,


then the quadratic form in the exponent is (z − µ)(σ 2 )−1 (z − µ) and X has
the univariate N (µ, σ 2 ) pdf. If Σ is positive semidefinite but not positive
definite, then X has a degenerate distribution. For example, the univariate
N (0, 02) distribution is degenerate (the point mass at 0).

Definition 1.24. The population mean of a random p × 1 vector X =


(X1 , ..., Xp)T is
E(X) = (E(X1 ), ..., E(Xp))T
and the p × p population covariance matrix

Cov(X) = E(X − E(X))(X − E(X))T = (σij ).

That is, the ij entry of Cov(X) is Cov(Xi , Xj ) = σij .

The covariance matrix is also called the variance–covariance matrix and


variance matrix. Sometimes the notation Var(X) is used. Note that Cov(X)
is a symmetric positive semidefinite matrix. If X and Y are p × 1 random
vectors, a a conformable constant vector, and A and B are conformable
constant matrices, then

E(a + X) = a + E(X) and E(X + Y ) = E(X) + E(Y ) (1.21)

and
E(AX) = AE(X) and E(AXB) = AE(X)B. (1.22)
Thus
Cov(a + AX) = Cov(AX) = ACov(X)AT . (1.23)
32 1 Introduction

Some important properties of multivariate normal (MVN) distributions are


given in the following three theorems. These theorems can be proved using
results from Johnson and Wichern (1988, pp. 127-132) or Severini (2005, ch.
8).

Theorem 1.9. a) If X ∼ Np (µ, Σ), then E(X) = µ and

Cov(X) = Σ.

b) If X ∼ Np (µ, Σ), then any linear combination tT X = t1 X1 + · · · +


tp Xp ∼ N1 (tT µ, tT Σt). Conversely, if tT X ∼ N1 (tT µ, tT Σt) for every p × 1
vector t, then X ∼ Np (µ, Σ).
c) The joint distribution of independent normal random variables
is MVN. If X1 , ..., Xp are independent univariate normal N (µi , σi2 ) random
vectors, then X = (X1 , ..., Xp)T is Np (µ, Σ) where µ = (µ1 , ..., µp)T and
Σ = diag(σ12 , ..., σp2) (so the off diagonal entries σij = 0 while the diagonal
entries of Σ are σii = σi2 ).
d) If X ∼ Np (µ, Σ) and if A is a q×p matrix, then AX ∼ Nq (Aµ, AΣAT ).
If a is a p × 1 vector of constants and b is a constant, then a + bX ∼
Np (a + bµ, b2 Σ). (Note that bX = bI p X with A = bI p .)

It will be useful to partition X, µ, and Σ. Let X 1 and µ1 be q ×1 vectors,


let X 2 and µ2 be (p − q) × 1 vectors, let Σ 11 be a q × q matrix, let Σ 12
be a q × (p − q) matrix, let Σ 21 be a (p − q) × q matrix, and let Σ 22 be a
(p − q) × (p − q) matrix. Then
     
X1 µ1 Σ 11 Σ 12
X= , µ= , and Σ = .
X2 µ2 Σ 21 Σ 22

Theorem 1.10. a) All subsets of a MVN are MVN: (Xk1 , ..., Xkq )T
∼ Nq (µ̃, Σ̃) where µ̃i = E(Xki ) and Σ̃ ij = Cov(Xki , Xkj ). In particular,
X 1 ∼ Nq (µ1 , Σ 11 ) and X 2 ∼ Np−q (µ2 , Σ 22 ).
b) If X 1 and X 2 are independent, then Cov(X 1 , X 2 ) = Σ 12 =
E[(X 1 − E(X 1 ))(X 2 − E(X 2 ))T ] = 0, a q × (p − q) matrix of zeroes.
c) If X ∼ Np (µ, Σ), then X 1 and X 2 are independent iff Σ 12 = 0.
d) If X 1 ∼ Nq (µ1 , Σ 11 ) and X 2 ∼ Np−q (µ2 , Σ 22 ) are independent, then
     
X1 µ1 Σ 11 0
∼ Np , .
X2 µ2 0 Σ 22

Theorem 1.11. The conditional distribution of a MVN is MVN. If


X ∼ Np (µ, Σ), then the conditional distribution of X 1 given that X 2 = x2
is multivariate normal with mean µ1 + Σ 12 Σ −122 (x2 − µ2 ) and covariance
matrix Σ 11 − Σ 12 Σ −1
22 Σ 21 . That is,
1.4 The Multivariate Normal Distribution 33

X 1 |X 2 = x2 ∼ Nq (µ1 + Σ 12 Σ −1 −1
22 (x2 − µ2 ), Σ 11 − Σ 12 Σ 22 Σ 21 ).

Example 1.5. Let p = 2 and let (Y, X)T have a bivariate normal distri-
bution. That is,
     
Y µY σY2 Cov(Y, X)
∼ N2 , 2 .
X µX Cov(X, Y ) σX

Also, recall that the population correlation between X and Y is given by

Cov(X, Y ) σX,Y
ρ(X, Y ) = p p =
VAR(X) VAR(Y ) σX σY

if σX > 0 and σY > 0. Then Y |X = x ∼ N (E(Y |X = x), VAR(Y |X = x))


where the conditional mean
s
1 σY2
E(Y |X = x) = µY + Cov(Y, X) 2 (x − µX ) = µY + ρ(X, Y ) 2
(x − µX )
σX σX

and the conditional variance


1
VAR(Y |X = x) = σY2 − Cov(X, Y ) 2
Cov(X, Y )
σX
s
q q
σY2
= σY2 − ρ(X, Y ) 2 ρ(X, Y ) 2
σX σY2
σX

= σY2 − ρ2 (X, Y )σY2 = σY2 [1 − ρ2 (X, Y )].


Also aX + bY is univariate normal with mean aµX + bµY and variance

a2 σ X
2
+ b2 σY2 + 2ab Cov(X, Y ).

Remark 1.4. There are several common misconceptions. First, it is not


true that every linear combination tT X of normal random variables
is a normal random variable, and it is not true that all uncorrelated
normal random variables are independent. The key condition in The-
orem 1.9b and Theorem 1.10c is that the joint distribution of X is MVN. It
is possible that X1 , X2 , ..., Xp each has a marginal distribution that is uni-
variate normal, but the joint distribution of X is not MVN. See Seber and
Lee (2003, p. 23), and examine the following example from Rohatgi (1976,
p. 229). Suppose that the joint pdf of X and Y is a mixture of two bivariate
normal distributions both with EX = EY = 0 and VAR(X) = VAR(Y ) = 1,
but Cov(X, Y ) = ±ρ. Hence f(x, y) =
1 1 −1
p exp( (x2 − 2ρxy + y2 )) +
2 2π 1 − ρ2 2(1 − ρ2 )
34 1 Introduction

1 1 −1 1 1
p exp( (x2 + 2ρxy + y2 )) ≡ f1 (x, y) + f2 (x, y)
2 2π 1 − ρ2 2(1 − ρ2 ) 2 2
where x and y are real and 0 < ρ < 1. Since both marginal distributions
of fi (x, y) are N(0,1) for i = 1 and 2 by RTheorem
R 1.10 a), the marginal
distributions of X and Y are N(0,1). Since xyfi (x, y)dxdy = ρ for i = 1
and −ρ for i = 2, X and Y are uncorrelated, but X and Y are not independent
since f(x, y) 6= fX (x)fY (y).

Remark 1.5. In Theorem 1.11, suppose that X = (Y, X2 , ..., Xp)T . Let
X1 = Y and X 2 = (X2 , ..., Xp)T . Then E[Y |X 2 ] = β1 + β2 X2 + · · · + βp Xp
and VAR[Y |X 2 ] is a constant that does not depend on X 2 . Hence Y |X 2 =
β1 + β2 X2 + · · · + βp Xp + e follows the multiple linear regression model.

1.5 Large Sample Theory

The first three subsections will review large sample theory for the univariate
case, then multivariate theory will be given.

1.5.1 The CLT and the Delta Method

Large sample theory, also called asymptotic theory, is used to approximate


the distribution of an estimator when the sample size n is large. This the-
ory is extremely useful if the exact sampling distribution of the estimator is
complicated or unknown. To use this theory, one must determine what the
estimator is estimating, the rate of convergence, the asymptotic distribution,
and how large n must be for the approximation to be useful. Moreover, the
(asymptotic) standard error (SE), an estimator of the asymptotic standard
deviation, must be computable if the estimator is to be useful for inference.
Often the bootstrap can be used to compute the SE.

Theorem 1.12: the Central Limit Theorem (CLT). Let Y1 , P ..., Yn be


n
iid with E(Y ) = µ and VAR(Y ) = σ 2 . Let the sample mean Y n = n1 i=1 Yi .
Then √ D
n(Y n − µ) → N (0, σ 2 ).
Hence    Pn 
√ Yn −µ √ i=1 Yi − nµ D
n = n → N (0, 1).
σ nσ


Note that the sample mean is estimating the population mean µ with a √n
convergence rate, the asymptotic distribution is normal, and the SE = S/ n
1.5 Large Sample Theory 35

where S is the sample standard deviation. For distributions “close” to the


normal distribution, the central limit theorem provides a good approximation
if the sample size n ≥ 30. Hesterberg (2014, pp. 41, 66) suggests n ≥ 5000
is needed for moderately skewed distributions. A special case of the CLT is
proven after Theorem 1.25.
D
Notation. The notation X ∼ Y and X = Y both mean that the random
variables X and Y have the same distribution. Hence FX (x) = FY (y) for all
D
real y. The notation Yn → X means that for large n we can approximate the
cdf of Yn by the cdf of X. The distribution of X is the limiting distribution
or asymptotic distribution of Yn . For the CLT, notice that
   
√ Yn−µ Yn−µ
Zn = n = √
σ σ/ n
D
is the z–score of Y . If Zn → N (0, 1), then the notation Zn ≈ N (0, 1), also
written as Zn ∼ AN (0, 1), means approximate the cdf of Zn by the standard
normal cdf. See Definition 1.25. Similarly, the notation

Y n ≈ N (µ, σ 2 /n),

also written as Y n ∼ AN (µ, σ 2 /n), means approximate the cdf of Y n as


if Y n ∼ N (µ, σ 2 /n). The distribution of X does not depend on n, but the
approximate distribution Y n ≈ N (µ, σ 2 /n) does depend on n.

The two main applications of the CLT are to√give the limiting distribution

of n(Y n − µ) and the limiting
Pn distribution of n(Yn /n − µX ) for a random
variable Yn such that Yn = i=1 Xi where the Xi are iid with E(X) = µX
2
and VAR(X) = σX .

Example 1.6. a) Let Y1 , ..., Yn be iid Ber(ρ). Then E(Y ) = ρ and


VAR(Y ) = ρ(1 − ρ). (The Bernoulli (ρ) distribution is the binomial (1,ρ)
distribution.) Hence
√ D
n(Y n − ρ) → N (0, ρ(1 − ρ))

by the CLT.
D Pn
b) Now suppose that Yn ∼ BIN (n, ρ). Then Yn = i=1 Xi where
X1 , ..., Xn are iid Ber(ρ). Hence
 
√ Yn D
n − ρ → N (0, ρ(1 − ρ))
n

since  
√ Yn D √ D
n − ρ = n(X n − ρ) → N (0, ρ(1 − ρ))
n
36 1 Introduction

by a).
c) Now suppose that Yn ∼ BIN (kn , ρ) where kn → ∞ as n → ∞. Then
p  Yn 
kn − ρ ≈ N (0, ρ(1 − ρ))
kn
or  
Yn ρ(1 − ρ)
≈N ρ, or Yn ≈ N (kn ρ, kn ρ(1 − ρ)) .
kn kn

Theorem 1.13: the Delta Method. If g does not depend on n, g0 (θ) 6= 0,


and √ D
n(Tn − θ) → N (0, σ 2 ),
then √ D
n(g(Tn ) − g(θ)) → N (0, σ 2 [g0 (θ)]2 ).

Example 1.7. Let Y1 , ..., Yn be iid with E(Y ) = µ and VAR(Y ) = σ 2 .


Then by the CLT,
√ D
n(Y n − µ) → N (0, σ 2 ).
Let g(µ) = µ2 . Then g0 (µ) = 2µ 6= 0 for µ 6= 0. Hence
√ D
n((Y n )2 − µ2 ) → N (0, 4σ 2 µ2 )

for µ 6= 0 by the delta method.

Example 1.8. Let X ∼ Binomial(n, p) where the positive "  integer n #is
2
√ X
large and 0 < p < 1. Find the limiting distribution of n − p2 .
n

Solution. Example 1.6b gives the limiting distribution of n( Xn
− p). Let
2 0
g(p) = p . Then g (p) = 2p and by the delta method,
"  #    
2
√ X 2
√ X D
n −p = n g − g(p) →
n n

N (0, p(1 − p)(g0 (p))2 ) = N (0, p(1 − p)4p2 ) = N (0, 4p3 (1 − p)).
Example 1.9. Let Xn ∼ Poisson(nλ) where the positive integer n is large
and λ > 0.  
√ Xn
a) Find the limiting distribution of n − λ .
n
"r #
√ Xn √
b) Find the limiting distribution of n − λ .
n
1.5 Large Sample Theory 37

D Pn
Solution. a) Xn = i=1 Yi where the Yi are iid Poisson(λ). Hence E(Y ) =
λ = V ar(Y ). Thus by the CLT,
   Pn 
√ Xn D √ i=1 Yi D
n −λ = n − λ → N (0, λ).
n n

b) Let g(λ) = λ. Then g0 (λ) = 2√1 λ and by the delta method,
"r #    
√ Xn √ √ Xn D
n − λ = n g − g(λ) →
n n
   
0 2 1 1
N (0, λ (g (λ)) ) = N 0, λ = N 0, .
4λ 4
Example 1.10. Let Y1 , ..., Yn be independent and identically distributed
(iid) from a Gamma(α, β) distribution.
√ 
a) Find the limiting distribution of n Y − αβ .
√ 
b) Find the limiting distribution of n (Y )2 − c for appropriate con-
stant c.

Solution: a) Since E(Y ) = αβ and V (Y ) = αβ 2 , by the CLT


√  D
n Y − αβ → N (0, αβ 2 ).
b) Let µ = αβ and σ 2 = αβ 2 . Let g(µ) = µ2 so g0 (µ) = 2µ and
√  D
[g0 (µ)]2 = 4µ2 = 4α2 β 2 . Then by the delta method, n (Y )2 − c →
N (0, σ 2 [g0 (µ)]2 ) = N (0, 4α3β 4 ) where c = µ2 = α2 β 2 .

1.5.2 Modes of Convergence and Consistency

Definition 1.25. Let {Zn , n = 1, 2, ...} be a sequence of random variables


with cdfs Fn , and let X be a random variable with cdf F . Then Zn converges
in distribution to X, written
D
Zn → X,
L
or Zn converges in law to X, written Zn → X, if

lim Fn (t) = F (t)


n→∞

at each continuity point t of F . The distribution of X is called the limiting


distribution or the asymptotic distribution of Zn .
38 1 Introduction

An important fact is that the limiting distribution does not depend


on the sample size n. Notice√that the CLT and delta√method give the
limiting distributions of Zn = n(Y n − µ) and Zn = n(g(Tn ) − g(θ)),
respectively.
Convergence in distribution is useful if the distribution of Xn is unknown
or complicated and the distribution of X is easy to use. Then for large n we
can approximate the probability that Xn is in an interval by the probability
D
that X is in the interval. To see this, notice that if Xn → X, then P (a <
Xn ≤ b) = Fn (b) − Fn (a) → F (b) − F (a) = P (a < X ≤ b) if F is continuous
at a and b.
Warning: convergence in distribution says that the cdf Fn (t) of Xn gets
close to the cdf of F (t) of X as n → ∞ provided that t is a continuity
point of F . Hence for any  > 0 there exists Nt such that if n > Nt , then
|Fn(t) − F (t)| < . Notice that Nt depends on the value of t. Convergence in
distribution does not imply that the random variables Xn ≡ Xn (ω) converge
to the random variable X ≡ X(ω) for all ω.

Example 1.11. Suppose that Xn ∼ U (−1/n, 1/n). Then the cdf Fn (x) of
Xn is 
 0, x ≤ −1n
nx
Fn (x) = 2 + 2 , n ≤ x ≤ n1
1 −1

1, x ≥ n1 .
Sketching Fn (x) shows that it has a line segment rising from 0 at x = −1/n
to 1 at x = 1/n and that Fn (0) = 0.5 for all n ≥ 1. Examining the cases
x < 0, x = 0, and x > 0 shows that as n → ∞,

 0, x < 0
1
Fn (x) → x=0
2
1, x > 0.

Notice that the right hand side is not a cdf since right continuity does not
hold at x = 0. Notice that if X is a random variable such that P (X = 0) = 1,
then X has cdf 
0, x < 0
FX (x) =
1, x ≥ 0.
Since x = 0 is the only discontinuity point of FX (x) and since Fn (x) → FX (x)
for all continuity points of FX (x) (i.e. for x 6= 0),
D
Xn → X.

Example 1.12. Suppose Yn ∼ U (0, n). Then Fn (t) = t/n for 0 < t ≤ n
and Fn (t) = 0 for t ≤ 0. Hence limn→∞ Fn (t) = 0 for t ≤ 0. If t > 0 and
n > t, then Fn (t) = t/n → 0 as n → ∞. Thus limn→∞ Fn (t) = 0 for all
t, and Yn does not converge in distribution to any random variable Y since
H(t) ≡ 0 is not a cdf.
1.5 Large Sample Theory 39

Definition 1.26. A sequence of random variables Xn converges in distri-


bution to a constant τ (θ), written
D D
Xn → τ (θ), if Xn → X

where P (X = τ (θ)) = 1. The distribution of the random variable X is said


to be degenerate at τ (θ) or to be a point mass at τ (θ).

Definition 1.27. A sequence of random variables Xn converges in prob-


ability to a constant τ (θ), written
P
Xn → τ (θ),

if for every  > 0,

lim P (|Xn − τ (θ)| < ) = 1 or, equivalently, lim P(|Xn − τ (θ)| ≥ ) = 0.


n→∞ n→∞

The sequence Xn converges in probability to X, written


P
Xn → X,
P
if Xn − X → 0.
P
Notice that Xn → X if for every  > 0,

lim P (|Xn − X| < ) = 1, or, equivalently, lim P(|Xn − X| ≥ ) = 0.


n→∞ n→∞

Definition 1.28. Let the parameter space Θ be the set of possible values
of θ. A sequence of estimators Tn of τ (θ) is consistent for τ (θ) if
P
Tn → τ (θ)

for every θ ∈ Θ. If Tn is consistent for τ (θ), then Tn is a consistent esti-


mator of τ (θ).

Consistency is a weak property that is usually satisfied by good estimators.


Tn is a consistent estimator for τ (θ) if the probability that Tn falls in any
neighborhood of τ (θ) goes to one, regardless of the value of θ ∈ Θ.

Definition 1.29. For a real number r > 0, Yn converges in rth mean to a


random variable Y , written
r
Yn → Y,
if
E(|Yn − Y |r ) → 0
40 1 Introduction

as n → ∞. In particular, if r = 2, Yn converges in quadratic mean to Y ,


written
2 qm
Yn → Y or Yn → Y,
if
E[(Yn − Y )2 ] → 0
as n → ∞.

Theorem 1.14: Generalized Chebyshev’s Inequality. Let u : R →


[0, ∞) be a nonnegative function. If E[u(Y )] exists then for any c > 0,

E[u(Y )]
P [u(Y ) ≥ c] ≤ .
c
If µ = E(Y ) exists, then taking u(y) = |y − µ|r and c̃ = cr gives
Markov’s Inequality: for r > 0 and any c > 0,

E[|Y − µ|r ]
P [|Y − µ| ≥ c] = P [|Y − µ|r ≥ cr ] ≤ .
cr
If r = 2 and σ 2 = VAR(Y ) exists, then we obtain
Chebyshev’s Inequality:

VAR(Y )
P [|Y − µ| ≥ c] ≤ .
c2
Proof. The proof is given for pdfs. For pmfs, replace the integrals by sums.
Now
Z Z Z
E[u(Y )] = u(y)f(y)dy = u(y)f(y)dy + u(y)f(y)dy
R {y:u(y)≥c} {y:u(y)<c}
Z
≥ u(y)f(y)dy
{y:u(y)≥c}

since the integrand u(y)f(y) ≥ 0. Hence


Z
E[u(Y )] ≥ c f(y)dy = cP [u(Y ) ≥ c]. 
{y:u(y)≥c}

The following theorem gives sufficient conditions for Tn to be a consistent


estimator of τ (θ). Notice that Eθ [(Tn − τ (θ))2 ] = M SEτ(θ) (Tn ) → 0 for all
qm
θ ∈ Θ is equivalent to Tn → τ (θ) for all θ ∈ Θ.

Theorem 1.15. a) If

lim M SEτ(θ) (Tn ) = 0


n→∞
1.5 Large Sample Theory 41

for all θ ∈ Θ, then Tn is a consistent estimator of τ (θ).


b) If
lim VARθ (Tn ) = 0 and lim Eθ (Tn ) = τ (θ)
n→∞ n→∞

for all θ ∈ Θ, then Tn is a consistent estimator of τ (θ).


Proof. a) Using Theorem 1.14 with Y = Tn , u(Tn ) = (Tn − τ (θ))2 and
c = 2 shows that for any  > 0,

Eθ [(Tn − τ (θ))2 ]
Pθ (|Tn − τ (θ)| ≥ ) = Pθ [(Tn − τ (θ))2 ≥ 2 ] ≤ .
2
Hence
lim Eθ [(Tn − τ (θ))2 ] = lim M SEτ(θ) (Tn ) → 0
n→∞ n→∞

is a sufficient condition for Tn to be a consistent estimator of τ (θ).


b) Recall that

M SEτ(θ) (Tn ) = VARθ (Tn ) + [Biasτ(θ) (Tn )]2

where Biasτ(θ) (Tn ) = Eθ (Tn ) − τ (θ). Since M SEτ(θ) (Tn ) → 0 if both


VARθ (Tn ) → 0 and Biasτ(θ) (Tn ) = Eθ (Tn ) − τ (θ) → 0, the result follows
from a). 

The following result shows estimators that converge at a n rate are con-
sistent. Use this result and the delta method to show that g(Tn ) is a consistent
estimator of g(θ). Note that b) follows from a) with Xθ ∼ N (0, v(θ)). The
WLLN shows that Y is a consistent estimator of E(Y ) = µ if E(Y ) exists.

Theorem 1.16. a) Let Xθ be a random variable with distribution de-


pending on θ, and 0 < δ ≤ 1. If
D
nδ (Tn − τ (θ)) → Xθ
P
then Tn → τ (θ).
b) If
√ D
n(Tn − τ (θ)) → N (0, v(θ))
for all θ ∈ Θ, then Tn is a consistent estimator of τ (θ).

Definition 1.30. A sequence of random variables Xn converges almost


everywhere (or almost surely, or with probability 1) to X if

P ( lim Xn = X) = 1.
n→∞

This type of convergence will be denoted by


ae
Xn → X.
42 1 Introduction

Notation such as “Xn converges to X ae” will also be used. Sometimes “ae”
will be replaced with “as” or “wp1.” We say that Xn converges almost ev-
erywhere to τ (θ), written
ae
Xn → τ (θ),
if P (limn→∞ Xn = τ (θ)) = 1.

Theorem 1.17. Let Yn be a sequence of iid random variables with E(Yi ) =


µ. Then
ae
a) Strong Law of Large Numbers (SLLN): Y n → µ, and
P
b) Weak Law of Large Numbers (WLLN): Y n → µ.
Proof of WLLN when V (Yi ) = σ 2 : By Chebyshev’s inequality, for every
 > 0,
V (Y n ) σ2
P (|Y n − µ| ≥ ) ≤ = →0
2 n2
as n → ∞. 

In proving consistency results, there is an infinite sequence of estimators


that depend on the sample size n. Hence the subscript n will be added to the
estimators.

Definition 1.31. Lehmann (1999, pp. 53-54): a) A sequence of random


variables Wn is tight or bounded in probability, written Wn = OP (1), if for
every  > 0 there exist positive constants D and N such that

P (|Wn | ≤ D ) ≥ 1 − 

for all n ≥ N√ . Also Wn = OP (Xn ) if |Wn /Xn | = OP (1). Similarly, Wn =


OP (n−1/2 ) if | n Wn | = OP (1).
b) The sequence Wn = oP (n−δ ) if nδ Wn = oP (1) which means that
P
nδ Wn → 0.

c) Wn has the same order as Xn in probability, written Wn P Xn , if for


every  > 0 there exist positive constants N and 0 < d < D such that
   
Wn 1 Xn 1
P d ≤ ≤ D = P ≤ ≤ ≥ 1−
Xn D Wn d

for all n ≥ N .
d) Similar notation is used for a k × r matrix An = A = [ai,j (n)] if each
element ai,j (n) has the desired property. For example, A = OP (n−1/2 ) if
each ai,j (n) = OP (n−1/2 ).

Definition 1.32. Let Wn = kµ̂n − µk.


a) If Wn P n−δ for some δ > 0, then both Wn and µ̂n have (tightness)
rate nδ .
1.5 Large Sample Theory 43

b) If there exists a constant κ such that


D
nδ (Wn − κ) → X

for some nondegenerate random variable X, then both Wn and µ̂n have
convergence rate nδ .

Theorem 1.18. Suppose there exists a constant κ such that


D
nδ (Wn − κ) → X.

a) Then Wn = OP (n−δ ).
b) If X is not degenerate, then Wn P n−δ .

The above result implies that if Wn has convergence rate nδ , then Wn has
tightness rate nδ , and the term “tightness” will often be omitted. Part a) is
proved, for example, in Lehmann (1999, p. 67).
The following result shows that if Wn P Xn , then Xn P Wn , Wn =
OP (Xn ), and Xn = OP (Wn ). Notice that if Wn = OP (n−δ ), then nδ is
a lower bound on the rate of Wn . As an example, if the CLT holds then
Y n = OP (n−1/3 ), but Y n P n−1/2 .

Theorem 1.19. a) If Wn P Xn , then Xn P Wn .


b) If Wn P Xn , then Wn = OP (Xn ).
c) If Wn P Xn , then Xn = OP (Wn ).
d) Wn P Xn iff Wn = OP (Xn ) and Xn = OP (Wn ).
Proof. a) Since Wn P Xn ,
   
Wn 1 Xn 1
P d ≤ ≤ D = P ≤ ≤ ≥ 1−
Xn D Wn d

for all n ≥ N . Hence Xn P Wn .


b) Since Wn P Xn ,
 
Wn
P (|Wn | ≤ |Xn D |) ≥ P d ≤ ≤ D ≥1−
Xn

for all n ≥ N . Hence Wn = OP (Xn ).


c) Follows by a) and b).
d) If Wn P Xn , then Wn = OP (Xn ) and Xn = OP (Wn ) by b) and c).
Now suppose Wn = OP (Xn ) and Xn = OP (Wn ). Then

P (|Wn | ≤ |Xn |D/2 ) ≥ 1 − /2

for all n ≥ N1 , and

P (|Xn | ≤ |Wn |1/d/2) ≥ 1 − /2


44 1 Introduction

for all n ≥ N2 . Hence


 
Wn
P (A) ≡ P ≤ D/2 ≥ 1 − /2
Xn

and  
Wn
P (B) ≡ P d/2 ≤ ≥ 1 − /2
Xn
for all n ≥ N = max(N1 , N2 ). Since P (A ∩ B) = P (A) + P (B) − P (A ∪ B) ≥
P (A) + P (B) − 1,

Wn
P (A ∩ B) = P (d/2 ≤ ≤ D/2 ) ≥ 1 − /2 + 1 − /2 − 1 = 1 − 
Xn

for all n ≥ N. Hence Wn P Xn . 

The following result is used to prove the following Theorem 1.21 which says
that if there are K estimators Tj,n of a parameter β, such that kTj,n − βk =
OP (n−δ ) where 0 < δ ≤ 1, and if Tn∗ picks one of these estimators, then
kTn∗ − βk = OP (n−δ ).
Theorem 1.20: Pratt (1959). Let X1,n , ..., XK,n each be OP (1) where
K is fixed. Suppose Wn = Xin ,n for some in ∈ {1, ..., K}. Then

Wn = OP (1). (1.24)

Proof.

P (max{X1,n , ..., XK,n} ≤ x) = P (X1,n ≤ x, ..., XK,n ≤ x) ≤

FWn (x) ≤ P (min{X1,n , ..., XK,n} ≤ x) = 1 − P (X1,n > x, ..., XK,n > x).
Since K is finite, there exists B > 0 and N such that P (Xi,n ≤ B) > 1−/2K
and P (Xi,n > −B) > 1 − /2K for all n > N and i = 1, ..., K. Bonferroni’s
PK
inequality states that P (∩K i=1 Ai ) ≥ i=1 P (Ai ) − (K − 1). Thus

FWn (B) ≥ P (X1,n ≤ B, ..., XK,n ≤ B) ≥

K(1 − /2K) − (K − 1) = K − /2 − K + 1 = 1 − /2


and
−FWn (−B) ≥ −1 + P (X1,n > −B, ..., XK,n > −B) ≥
−1 + K(1 − /2K) − (K − 1) = −1 + K − /2 − K + 1 = −/2.
Hence
FWn (B) − FWn (−B) ≥ 1 −  for n > N. 
Theorem 1.21. Suppose kTj,n − βk = OP (n−δ ) for j = 1, ..., K where
0 < δ ≤ 1. Let Tn∗ = Tin ,n for some in ∈ {1, ..., K} where, for example, Tin ,n
1.5 Large Sample Theory 45

is the Tj,n that minimized some criterion function. Then

kTn∗ − βk = OP (n−δ ). (1.25)

Proof. Let Xj,n = nδ kTj,n − βk. Then Xj,n = OP (1) so by Theorem 1.20,
n kTn∗ − βk = OP (1). Hence kTn∗ − βk = OP (n−δ ). 
δ

1.5.3 Slutsky’s Theorem and Related Results

D P
Theorem 1.22: Slutsky’s Theorem. Suppose Yn → Y and Wn → w for
some constant w. Then
D
a) Yn + Wn → Y + w,
D
b) Yn Wn → wY, and
D
c) Yn /Wn → Y /w if w 6= 0.
P D
Theorem 1.23. a) If Xn → X, then Xn → X.
ae P D
b) If Xn → X, then Xn → X and Xn → X.
r P D
c) If Xn → X, then Xn → X and Xn → X.
P D
d) Xn → τ (θ) iff Xn → τ (θ).
P P
e) If Xn → θ and τ is continuous at θ, then τ (Xn ) → τ (θ).
D D
f) If Xn → θ and τ is continuous at θ, then τ (Xn ) → τ (θ).
D r ae
Suppose that for all θ ∈ Θ, Tn → τ (θ), Tn → τ (θ), or Tn → τ (θ). Then
Tn is a consistent estimator of τ (θ) by Theorem 1.23. We are assuming that
the function τ does not depend on n.

Example 1.13. Let Y1 , ..., Yn be iid with mean E(Yi ) = µ and variance
V (Yi ) = σ 2 . Then the sample mean Y n is a consistent estimator of µ since i)
the SLLN holds (use Theorems 1.17 and 1.23), ii) the WLLN holds, and iii)
the CLT holds (use Theorem 1.16). Since

lim VARµ (Y n ) = lim σ 2 /n = 0 and lim Eµ (Y n ) = µ,


n→∞ n→∞ n→∞

Y n is also a consistent estimator of µ by Theorem 1.15b. By the delta method


and Theorem 1.16b, Tn = g(Y n ) is a consistent estimator of g(µ) if g0 (µ) 6= 0
for all µ ∈ Θ. By Theorem 1.23e, g(Y n ) is a consistent estimator of g(µ) if g
is continuous at µ for all µ ∈ Θ.

Theorem 1.24. Assume that the function g does not depend on n.


D
a) Generalized Continuous Mapping Theorem: If Xn → X and the
46 1 Introduction

function g is such that P [X ∈ C(g)] = 1 where C(g) is the set of points


D
where g is continuous, then g(Xn ) → g(X).
D
b) Continuous Mapping Theorem: If Xn → X and the function g is
D
continuous, then g(Xn ) → g(X).

Remark 1.6. For Theorem 1.23, a) follows from Slutsky’s Theorem by


D P
taking Yn ≡ X = Y and Wn = Xn − X. Then Yn → Y = X and Wn → 0.
D
Hence Xn = Yn + Wn → Y + 0 = X. The convergence in distribution parts of
b) and c) follow from a). Part f) follows from d) and e). Part e) implies that
if Tn is a consistent estimator of θ and τ is a continuous function, then τ (Tn )
is a consistent estimator of τ (θ). Theorem 1.24 says that convergence in dis-
tribution is preserved by continuous functions, and even some discontinuities
are allowed as long as the set of continuity points is assigned probability 1
by the asymptotic distribution. Equivalently, the set of discontinuity points
is assigned probability 0.
D D
Example 1.14. (Ferguson 1996, p. 40): If Xn → X, then 1/Xn → 1/X
if X is a continuous random variable since P (X = 0) = 0 and x = 0 is the
only discontinuity point of g(x) = 1/x.

Example 1.15. Show that if Yn ∼ tn , a t distribution with n degrees of


D
freedom, then Yn → Z where Z ∼ N (0, 1).
D p p P
Solution: Yn = Z/ Vn /n where Z Vn ∼ χ2n . If Wn = Vn /n → 1,
D Pn
then the result follows by Slutsky’s Theorem. But Vn = i=1 Xi where the
P p P
iid Xi ∼ χ21 . Hence Vn /n → 1 by the WLLN and Vn /n → 1 by Theorem
1.23e.

Theorem 1.25: Continuity Theorem. Let Yn be sequence of random


variables with characteristic functions φn (t). Let Y be a random variable
with characteristic function (cf) φ(t).
a)
D
Yn → Y iff φn (t) → φ(t) ∀t ∈ R.
b) Also assume that Yn has moment generating function (mgf) mn and Y
has mgf m. Assume that all of the mgfs mn and m are defined on |t| ≤ d for
some d > 0. Then if mn (t) → m(t) as n → ∞ for all |t| < c where 0 < c < d,
D
then Yn → Y .

Application: Proof of a Special Case of the CLT. Following


Rohatgi (1984, pp. 569-9), let Y1 , ..., Yn be iid with mean µ, variance σ 2 , and
mgf mY (t) for |t| < to . Then

Yi − µ
Zi =
σ
1.5 Large Sample Theory 47

has mean 0, variance 1, and mgf mZ (t) = exp(−tµ/σ)mY (t/σ) for |t| < σto .
We want to show that
 
√ Yn−µ D
Wn = n → N (0, 1).
σ

Notice that Wn =
n
X n 
X  Pn
−1/2 −1/2 Yi − µ −1/2 i=1 Yi − nµ n−1/2 Y n − µ
n Zi = n =n = 1 .
σ σ n
σ
i=1 i=1

Thus
n
X Xn

mWn (t) = E(etWn ) = E[exp(tn−1/2 Zi )] = E[exp( tZi / n)]
i=1 i=1

n
Y n
Y
√ √ √
= E[etZi / n
]= mZ (t/ n) = [mZ (t/ n)]n .
i=1 i=1

Set ψ(x) = log(mZ (x)). Then



√ √ ψ(t/ n)
log[mWn (t)] = n log[mZ (t/ n)] = nψ(t/ n) = 1 .
n

Now ψ(0) = log[mZ (0)] = log(1) = 0. Thus by L’Hôpital’s rule (where the
derivative is with respect to n), limn→∞ log[mWn (t)] =
√ √ √
ψ(t/ n ) ψ0 (t/ n )[ −t/2
n3/2
] t ψ0 (t/ n )
lim 1 = lim = lim .
n→∞
n
n→∞ ( −1
n2 )
2 n→∞ √1
n

Now
m0Z (0)
ψ0 (0) = = E(Zi )/1 = 0,
mZ (0)
so L’Hôpital’s rule can be applied again, giving limn→∞ log[mWn (t)] =

t ψ00 (t/ n )[ 2n−t3/2 ] t2 00
√ t2 00
lim = lim ψ (t/ n ) = ψ (0).
2 n→∞ ( 2n−1
3/2 ) 2 n→∞ 2

Now
d m0Z (t) m00 (t)mZ (t) − (m0Z (t))2
ψ00 (t) = = Z .
dt mZ (t) [mZ (t)]2
So
ψ00 (0) = m00Z (0) − [m0Z (0)]2 = E(Zi2 ) − [E(Zi )]2 = 1.
Hence limn→∞ log[mWn (t)] = t2 /2 and
48 1 Introduction

lim mWn (t) = exp(t2 /2)


n→∞

which is the N(0,1) mgf. Thus by the continuity theorem,


 
√ Yn−µ D
Wn = n → N (0, 1). 
σ

1.5.4 Multivariate Limit Theorems

Many of the univariate results of the previous 3 subsections can be extended


to random vectors. For the limit theorems, the vectorp X is typically a k × 1
column vector and X T is a row vector. Let kxk = x21 + · · · + x2k be the
Euclidean norm of x.

Definition 1.33. Let X n be a sequence of random vectors with joint cdfs


Fn (x) and let X be a random vector with joint cdf F (x).
D
a) X n converges in distribution to X, written X n → X, if Fn (x) →
F (x) as n → ∞ for all points x at which F (x) is continuous. The distribution
of X is the limiting distribution or asymptotic distribution of X n .
P
b) X n converges in probability to X, written X n → X, if for every
 > 0, P (kX n − Xk > ) → 0 as n → ∞.
c) Let r > 0 be a real number. Then X n converges in rth mean to X,
r
written X n → X, if E(kX n − Xkr ) → 0 as n → ∞.
ae
d) X n converges almost everywhere to X, written X n → X, if
P (limn→∞ X n = X) = 1.

Theorems 1.26 and 1.27 below are the multivariate extensions of the
limit
√ theorems in subsection 1.5.1. When the limiting distribution of Z n =
n(g(T n ) − g(θ)) is multivariate normal Nk (0, Σ), approximate the joint
cdf of Z n with the joint cdf of the Nk (0, Σ) distribution. Thus to find proba-
bilities, manipulate Z n as if Z n ≈ Nk (0, Σ). To see that the CLT is a special
case of the MCLT below, let k = 1, E(X) = µ, and V (X) = Σ x = σ 2 .

Theorem 1.26: the Multivariate Central Limit Theorem (MCLT).


If X 1 , ..., Xn are iid k × 1 random vectors with E(X) = µ and Cov(X) =
Σ x , then
√ D
n(X n − µ) → Nk (0, Σ x )
where the sample mean
n
1X
Xn = X i.
n
i=1
1.5 Large Sample Theory 49

To see that the delta method is a special case of the multivariate delta
method, note that if Tn and parameter θ are real valued, then Dg (θ ) = g0 (θ).

Theorem 1.27: the Multivariate Delta Method. If g does not depend


on n and √ D
n(T n − θ) → Nk (0, Σ),
then √ D
n(g(T n ) − g(θ)) → Nd (0, Dg (θ ) ΣDTg (θ ) )
where the d × k Jacobian matrix of partial derivatives
 ∂ ∂ 
∂θ1 g1 (θ) ... ∂θk g1 (θ)
 .. .. 
D g (θ ) =  . . .
∂ ∂
g
∂θ1 d
(θ) . . . g (θ)
∂θk d

Here the mapping g : Rk → Rd needs to be differentiable in a neighborhood


of θ ∈ Rk .
P
Definition 1.34. If the estimator g(T n ) → g(θ) for all θ ∈ Θ, then g(T n )
is a consistent estimator of g(θ).

Theorem 1.28. If 0 < δ ≤ 1, X is a random vector, and


D
nδ (g(T n ) − g(θ)) → X,
P
then g(T n ) → g(θ).

Theorem 1.29. If X 1 , ..., Xn are iid, E(kXk) < ∞, and E(X) = µ, then
P
a) WLLN: X n → µ, and
ae
b) SLLN: X n → µ.

Theorem 1.30: Continuity Theorem. Let X n be a sequence of k × 1


random vectors with characteristic functions φn (t), and let X be a k × 1
random vector with cf φ(t). Then
D
X n → X iff φn (t) → φ(t)

for all t ∈ Rk .

Theorem 1.31: Cramér Wold Device. Let X n be a sequence of k × 1


random vectors, and let X be a k × 1 random vector. Then
D D
X n → X iff tT X n → tT X

for all t ∈ Rk .
50 1 Introduction

Application: Proof of the MCLT Theorem 1.26. Note that for fixed
t, the tT X i are iid random variables with mean tT µ and variance tT Σt.
√ D
Hence by the CLT, tT n(X n − µ) → N (0, tT Σt). The right hand side has
distribution tT X where X ∼ Nk (0, Σ). Hence by the Cramér Wold Device,
√ D
n(X n − µ) → Nk (0, Σ). 
P D
Theorem 1.32. a) If X n → X, then X n → X.
b)
P D
X n → g(θ) iff X n → g(θ).
Let g(n)
√ ≥ 1 be an increasing function of the sample size n: g(n) ↑ ∞, e.g.
g(n) = n. See White (1984, p. 15). If a k×1 random vector T n −µ converges
to
√ a nondegenerate multivariate normal
√ distribution with convergence rate
n, then T n has (tightness) rate n.

Definition 1.35. Let An = [ai,j (n)] be an r × c random matrix.


a) An = OP (Xn ) if ai,j (n) = OP (Xn ) for 1 ≤ i ≤ r and 1 ≤ j ≤ c.
b) An = op (Xn ) if ai,j (n) = op (Xn ) for 1 ≤ i ≤ r and 1 ≤ j ≤ c.
c) An P (1/(g(n)) if ai,j (n) P (1/(g(n)) for 1 ≤ i ≤ r and 1 ≤ j ≤ c.
d) Let A1,n = T n − µ and A2,n = C n − cΣ for some constant c > 0. If
A1,n P (1/(g(n)) and A2,n P (1/(g(n)), then (T n , C n ) has (tightness)
rate g(n).

Theorem 1.33: Continuous Mapping Theorem. Let X n ∈ Rk . If


D
X n → X and if the function g : Rk → Rj is continuous, then
D
g(X n ) → g(X).

The following two theorems are taken from Severini (2005, pp. 345-349,
354).

Theorem 1.34. Let X n = (X1n , ..., Xkn)T be a sequence of k × 1


random vectors, let Y n be a sequence of k × 1 random vectors, and let
X = (X1 , ..., Xk)T be a k × 1 random vector. Let W n be a sequence of k × k
nonsingular random matrices, and let C be a k × k constant nonsingular
matrix.
P P
a) X n → X iff Xin → Xi for i = 1, ..., k.
D P
b) Slutsky’s Theorem: If X n → X and Y n → c for some constant k ×1
D
vector c, then i) X n + Y n → X + c and
D
ii) Y Tn X n → cT X.
D P D D
c) If X n → X and W n → C, then W n X n → CX, X Tn W n → X T C,
D D
W −1
n Xn → C
−1
X, and X Tn W −1 T −1
n →X C .

Theorem 1.35. Let Wn , Xn , Yn , and Zn be sequences of random variables


such that Yn > 0 and Zn > 0. (Often Yn and Zn are deterministic, e.g.
Yn = n−1/2 .)
1.5 Large Sample Theory 51

a) If Wn = OP (1) and Xn = OP (1), then Wn + Xn = OP (1) and Wn Xn =


OP (1), thus OP (1) + OP (1) = OP (1) and OP (1)OP (1) = OP (1).
b) If Wn = OP (1) and Xn = oP (1), then Wn + Xn = OP (1) and Wn Xn =
oP (1), thus OP (1) + oP (1) = OP (1) and OP (1)oP (1) = oP (1).
c) If Wn = OP (Yn ) and Xn = OP (Zn ), then Wn +Xn = OP (max(Yn , Zn ))
and Wn Xn = OP (Yn Zn ), thus OP (Yn ) + OP (Zn ) = OP (max(Yn , Zn )) and
OP (Yn )OP (Zn ) = OP (Yn Zn ).
√ D
Theorem 1.36. i) Suppose n(Tn − µ) → Np (θ, Σ). Let A be a q × p
√ √ D
constant matrix. Then A n(Tn − µ) = n(ATn − Aµ) → Nq (Aθ, AΣAT ).
ii) Let Σ > 0. Assume n is large enough so that C > 0. If (T, C)
is a consistent estimator of (µ, s Σ) where s > 0 is some constant, then
Dx2
(T, C) = (x − T )T C −1 (x − T ) = s−1 Dx
2 2
(µ, Σ) + oP (1), so Dx (T, C) is
a consistent estimator of s−1 Dx2
(µ, Σ).
√ D
iii) Let Σ > 0. Assume n is large enough so that C > 0. If n(T − µ) →
T −1
Np (0, Σ) and if C is a consistent estimator of Σ, then n(T − µ) C (T −
D
µ) → χ2p . In particular,
D
n(x − µ)T S −1 (x − µ) → χ2p .
Proof: ii) Dx2
(T, C) = (x − T )T C −1 (x − T ) =
(x − µ + µ − T )T [C −1 − s−1 Σ −1 + s−1 Σ −1 ](x − µ + µ − T )
= (x − µ)T [s−1 Σ −1 ](x − µ) + (x − T )T [C −1 − s−1 Σ −1 ](x − T )
+(x − µ)T [s−1 Σ −1 ](µ − T ) + (µ − T )T [s−1 Σ −1 ](x − µ)
+(µ − T )T [s−1 Σ −1 ](µ − T ) = s−1 Dx 2
(µ, Σ) + OP (1).
2
(Note that Dx (T, C) = s−1 Dx 2
(µ, Σ) + OP (n−δ ) if (T, C) is a consistent
estimator of (µ, s Σ) with rate nδ where 0 < δ ≤ 0.5 if [C −1 − s−1 Σ −1 ] =
OP (n−δ ).)
2
Alternatively, Dx (T, C) is a continuous function of (T, C) if C > 0 for
2 P 2
n > 10p. Hence Dx (T, C) → Dx (µ, sΣ).
√ −1/2 D
iii) Note that Z n = n Σ (T − µ) → Np (0, I p ). Thus Z Tn Z n =
D
n(T − µ)T Σ −1 (T − µ) → χ2p . Now n(T − µ)T C −1 (T − µ) =
n(T − µ)T [C −1 − Σ −1 + Σ −1 ](T − µ) = n(T − µ)T Σ −1 (T − µ) +
D
n(T − µ)T [C −1 − Σ −1 ](T − µ) = n(T − µ)T Σ −1 (T − µ) + oP (1) → χ2p since
√ √
n(T − µ)T [C −1 − Σ −1 ] n(T − µ) = OP (1)oP (1)OP (1) = oP (1). 
D
Example 1.16. Suppose that xn y n for n = 1, 2, .... Suppose xn → x,
D
and y n → y where x y. Then
   
xn D x

yn y

by Theorem 1.30. To see this, let t = (tT1 , tT2 )T , z n = (xTn , y Tn )T , and z =


(xT , y T )T . Since xn y n and x y, the characteristic function
52 1 Introduction

φz n (t) = φxn (t1 )φy n (t2 ) → φx (t1 )φy (t2 ) = φz (t).

D
Hence g(z n ) → g(z) by Theorem 1.33.

Remark 1.7. In the above example, we can show x y instead of assum-


ing x y. See Ferguson (1996, p. 42).

1.6 Mixture Distributions

Mixture distributions are useful for model and variable selection since β̂ Imin ,0
is a mixture distribution of β̂ Ij ,0 , and the lasso estimator β̂ L is a mixture
distribution of β̂L,λi for i = 1, ..., M . See Chapter 4. A random vector u has
a mixture distribution if u equals a random vector uj with probability πj
for j = 1, ..., J. See Definition 1.24 for the population mean and population
covariance matrix of a random vector.

Definition 1.36. The distribution of a g ×1 random vector u is a mixture


distribution if the cumulative distribution function (cdf) of u is
J
X
Fu (t) = πj Fuj (t) (1.26)
j=1

PJ
where the probabilities πj satisfy 0 ≤ πj ≤ 1 and j=1 πj = 1, J ≥ 2,
and Fuj (t) is the cdf of a g × 1 random vector uj . Then u has a mixture
distribution of the uj with probabilities πj .

Theorem 1.37. Suppose E(h(u)) and the E(h(uj )) exist. Then


J
X
E[h(u)] = πj E[h(uj )]. (1.27)
j=1

Hence
J
X
E(u) = πj E[uj ], (1.28)
j=1

and Cov(u) = E(uuT ) − E(u)E(uT ) = E(uuT ) − E(u)[E(u)]T =


PJ T T
j=1 πj E[uj uj ] − E(u)[E(u)] =

J
X J
X
πj Cov(uj ) + πj E(uj )[E(uj )]T − E(u)[E(u)]T . (1.29)
j=1 j=1

If E(uj ) = θ for j = 1, ..., J, then E(u) = θ and


1.7 Elliptically Contoured Distributions 53

J
X
Cov(u) = πj Cov(uj ).
j=1

This theorem is easy to prove if the uj are continuous random vectors with
(joint) probability density functions (pdfs) fuj (t). Then u is a continuous
random vector with pdf
J
X Z ∞ Z ∞
fu (t) = πj fuj (t), and E[h(u)] = ··· h(t)fu (t)dt
j=1 −∞ −∞

J
X Z ∞ Z ∞ J
X
= πj ··· h(t)fuj (t)dt = πj E[h(uj )]
j=1 −∞ −∞ j=1

where E[h(uj )] is the expectation with respect to the random vector uj . Note
that
XJ X J
T
E(u)[E(u)] = πj πk E(uj )[E(uk )]T . (1.30)
j=1 k=1

R Alternatively, with respect to a Riemann Stieltjes integral, E[h(u)] =


h(t)dF (t) provided the expected value exists, and the integral is a lin-
ear operator
R with respect to both h and F . Hence for a mixture distribution,
E[h(u)] = h(t)dF (t) =
 
Z XJ J
X Z J
X
h(t) d  πj Fuj (t) = πj h(t)dFuj (t) = πj E[h(uj )].
j=1 j=1 j=1

1.7 Elliptically Contoured Distributions

Definition 1.37: Johnson (1987, pp. 107-108). A p×1 random vector X


has an elliptically contoured distribution, also called an elliptically symmetric
distribution, if X has joint pdf

f(z) = kp|Σ|−1/2 g[(z − µ)T Σ −1 (z − µ)], (1.31)

and we say X has an elliptically contoured ECp (µ, Σ, g) distribution.

If X has an elliptically contoured (EC) distribution, then the characteristic


function of X is
φX (t) = exp(itT µ)ψ(tT Σt) (1.32)
for some function ψ. If the second moments exist, then
54 1 Introduction

E(X) = µ (1.33)

and
Cov(X) = cX Σ (1.34)
where
cX = −2ψ0 (0).

Definition 1.38. The population squared Mahalanobis distance

U ≡ D2 = D2 (µ, Σ) = (X − µ)T Σ −1 (X − µ). (1.35)

For elliptically contoured distributions, U has pdf

π p/2
h(u) = kp up/2−1 g(u). (1.36)
Γ (p/2)

For c > 0, an ECp (µ, cI, g) distribution is spherical about µ where I is


the p × p identity matrix. The multivariate normal distribution Np (µ, Σ) has
kp = (2π)−p/2 , ψ(u) = g(u) = exp(−u/2), and h(u) is the χ2p pdf.

The following theorem is useful for proving properties of EC distributions


without using the characteristic function (1.32). See Eaton (1986) and Cook
(1998, pp. 57, 130).

Theorem 1.38. Let X be a p × 1 random vector with 1st moments; i.e.,


E(X) exists. Let B be any constant full rank p × r matrix where 1 ≤ r ≤ p.
Then X is elliptically contoured iff for all such conforming matrices B,

E(X|B T X) = µ + M B B T (X − µ) = aB + M B B T X (1.37)

where the p × 1 constant vector aB and the p × r constant matrix M B both


depend on B.

A useful fact is that aB and M B do not depend on g:

aB = µ − M B B T µ = (I p − M B B T )µ,

and
M B = ΣB(B T ΣB)−1 .
See Problem 1.19. Notice that in the formula for M B , Σ can be replaced by
cΣ where c > 0 is a constant. In particular, if the EC distribution has 2nd
moments, Cov(X) can be used instead of Σ.

To use Theorem 1.38 to prove interesting properties, partition X, µ, and


Σ. Let X 1 and µ1 be q × 1 vectors, let X 2 and µ2 be (p − q) × 1 vectors.
1.7 Elliptically Contoured Distributions 55

Let Σ 11 be a q × q matrix, let Σ 12 be a q × (p − q) matrix, let Σ 21 be a


(p − q) × q matrix, and let Σ 22 be a (p − q) × (p − q) matrix. Then
     
X1 µ1 Σ 11 Σ 12
X= , µ= , and Σ = .
X2 µ2 Σ 21 Σ 22
Also assume that the (p + 1) × 1 vector (Y, X T )T is ECp+1 (µ, Σ, g) where
Y is a random variable, X is a p × 1 vector, and use
     
Y µY ΣY Y Σ Y X
, µ= , and Σ = .
X µX Σ XY Σ XX

Theorem 1.39. Let X ∼ ECp (µ, Σ, g) and assume that E(X) exists.
a) Any subset of X is EC, in particular X 1 is EC.
b) (Cook 1998 p. 131, Kelker 1970). If Cov(X) is nonsingular,

Cov(X|B T X) = dg (B T X)[Σ − ΣB(B T ΣB)−1 B T Σ]

where the real valued function dg (B T X) is constant iff X is MVN.

Proof of a). Let A be an arbitrary full rank q × r matrix where 1 ≤ r ≤ q.


Let  
A
B= .
0
Then B T X = AT X 1 , and
  
T X1 T
E[X|B X] = E |A X 1 =
X2
     
µ1 M 1B T T
 X 1 − µ1
+ A 0
µ2 M 2B X 2 − µ2
by Theorem 1.38. Hence E[X 1 |AT X 1 ] = µ1 + M 1B AT (X 1 − µ1 ). Since A
was arbitrary, X 1 is EC by Theorem 1.38. Notice that M B = ΣB(B T ΣB)−1 =
       −1
Σ 11 Σ 12 A  Σ 11 Σ 12 A
AT 0T
Σ 21 Σ 22 0 Σ 21 Σ 22 0
 
M 1B
= .
M 2B
Hence
M 1B = Σ 11 A(AT Σ 11 A)−1
and X 1 is EC with location and dispersion parameters µ1 and Σ 11 . 

Theorem 1.40. Let (Y, X T )T be ECp+1 (µ, Σ, g) where Y is a random


variable.
56 1 Introduction

a) Assume that E[(Y, X T )T ] exists. Then E(Y |X) = α + β T2 X where


α = µY − β T2 µX and
β2 = Σ −1XX Σ XY .

b) Even if the first moment does not exist, the conditional median

MED(Y |X) = α + β T2 X

where α and β 2 are given in a).

Proof. a) The trick is to choose B so that Theorem 1.38 applies. Let


 T
0
B= .
Ip

Then B T ΣB = Σ XX and
 
ΣY X
ΣB = .
Σ XX

Now      
Y Y Y
E |X =E | BT
X X X
 
Y − µY
= µ + ΣB(B T ΣB)−1 B T
X − µX
by Theorem 1.38. The right hand side of the last equation is equal to
   
ΣY X −1 µY − Σ Y X Σ −1XX µX + Σ Y X Σ −1
XX X
µ+ Σ XX (X − µX ) =
Σ XX X

and the result follows since

β T2 = Σ Y X Σ −1
XX .

b) See Croux et al. (2001) for references.

Example 1.17. This example illustrates another application of Theorem


1.38. Suppose that X comes from a mixture of two multivariate normals with
the same mean and proportional covariance matrices. That is, let

X ∼ (1 − γ)Np (µ, Σ) + γNp (µ, cΣ)

where c > 0 and 0 < γ < 1. Since the multivariate normal distribution is
elliptically contoured (and see Theorem 1.37),

E(X|B T X) = (1 − γ)[µ + M 1 B T (X − µ)] + γ[µ + M 2 B T (X − µ)]

= µ + [(1 − γ)M 1 + γM 2 ]BT (X − µ) ≡ µ + M B T (X − µ).


1.8 Summary 57

Since M B only depends on B and Σ, it follows that M 1 = M 2 = M = M B .


Hence X has an elliptically contoured distribution by Theorem 1.38. See
Problem 1.13 for a related result.

Let x ∼ Np (µ, Σ) and y ∼ χ2d be independent. Let wi = xi /(y/d)1/2 for


i = 1, ..., p. Then w has a multivariate t-distribution with parameters µ and
Σ and degrees of freedom d, an important elliptically contoured distribution.
d
Cornish (1954) showed that the covariance matrix of w is Cov(w) = Σ
d−2
for d > 2. The case d = 1 is known as a multivariate Cauchy distribution.
The joint pdf of w is

Γ ((d + p)/2)) |Σ|−1/2


f(z) = [1 + d−1 (z − µ)T Σ −1 (z − µ)]−(d+p)/2 .
(πd)p/2 Γ (d/2)

See Mardia et al. (1979, pp. 43, 57). See Johnson and Kotz (1972, p. 134) for
the special case where the xi ∼ N (0, 1).
The following EC(µ, Σ, g) distribution for a p × 1 random vector x is
the uniform distribution on a hyperellipsoid where f(z) = c for z in the
hyperellipsoid where c is the reciprocal of the volume of the hyperellipsoid.
The pdf of the distribution is

Γ ( 2p + 1)
f(z) = |Σ|−1/2 I[(z − µ)T Σ −1 (z − µ) ≤ p + 2].
[(p + 2)π]p/2

Then E(x) = µ by symmetry and is can be shown that Cov(x) = Σ.


If x ∼ Np (µ, Σ) and ui = exp(xi ) for i = 1, ..., p, then u has a multivariate
lognormal distribution with parameters µ and Σ. This distribution is not an
elliptically contoured distribution. See Problem 1.8.

1.8 Summary

1) A case or observation consists of k random variables measured for one


person or thing. The ith case z i = (zi1 , ..., zik)T . The training data consists
of z1 , ..., zn . A statistical model or method is fit (trained) on the training
data. The test data consists of z n+1 , ..., zn+m , and the test data is often
used to evaluate the quality of the fitted model.
2) For classical regression and multivariate analysis, we often want n ≥
10p, and a model with n < 5p is overfitting: the model does not have enough
data to estimate parameters accurately if x is p × 1. Statistical Learning
methods often use a model with d variables, where n ≥ Jd with J ≥ 5 and
preferably J ≥ 10. A model is underfitting if it omits important predictors.
Fix p, if the probability that a model underfits goes to 0 as the sample size
58 1 Introduction

n → ∞, then overfitting may not be too serious if n ≥ Jd. Underfitting can


cause the model to fail to hold.
3) Regression investigates how the response variable Y changes with the
value of a p × 1 vector x of predictors. For a 1D regression model, Y is
conditionally independent of x given the sufficient predictor SP = h(x),
written Y x|h(x), where the real valued function h : Rp → R. The estimated
sufficient predictor ESP = ĥ(x). A response plot is a plot of the ESP versus
the response Y . Often SP = xT β and ESP = xT β̂. A residual plot is a plot
of the ESP versus the residuals. Tip: if the model for Y (more accurately
for Y |x) depends on x only through the real valued function h(x), then
SP = h(x).
4) If X and Y are p × 1 random vectors, a a conformable constant vector,
and A and B are conformable constant matrices, then

E(X+Y ) = E(X)+E(Y ), E(a+Y ) = a+E(Y ), & E(AXB) = AE(X)B.

Also
Cov(a + AX) = Cov(AX) = ACov(X)AT .
Note that E(AY ) = AE(Y ) and Cov(AY ) = ACov(Y )AT .
5) If X ∼ Np (µ, Σ), then E(X) = µ and Cov(X) = Σ.
6) If X ∼ Np (µ, Σ) and if A is a q×p matrix, then AX ∼ Nq (Aµ, AΣAT ).
If a is a p × 1 vector of constants, then X + a ∼ Np (µ + a, Σ).
7) All subsets of a MVN are MVN: (Xk1 , ..., Xkq )T ∼ Nq (µ̃, Σ̃) where
µ̃i = E(Xki ) and Σ̃ ij = Cov(Xki , Xkj ). In particular, X 1 ∼ Nq (µ1 , Σ 11 )
and X 2 ∼ Np−q (µ2 , Σ 22 ). If X ∼ Np (µ, Σ), then X 1 and X 2 are indepen-
dent iff Σ 12 = 0.
8)      
Y µY σY2 Cov(Y, X)
Let ∼ N2 , 2 .
X µX Cov(X, Y ) σX
Also recall that the population correlation between X and Y is given by

Cov(X, Y ) σX,Y
ρ(X, Y ) = p p =
VAR(X) VAR(Y ) σ X σY

if σX > 0 and σY > 0.


9) The conditional distribution of a MVN is MVN. If X ∼ Np (µ, Σ), then
the conditional distribution of X 1 given that X 2 = x2 is multivariate normal
with mean µ1 +Σ 12 Σ −1 −1
22 (x2 −µ2 ) and covariance matrix Σ 11 −Σ 12 Σ 22 Σ 21 .
That is,

X 1 |X 2 = x2 ∼ Nq (µ1 + Σ 12 Σ −1 −1
22 (x2 − µ2 ), Σ 11 − Σ 12 Σ 22 Σ 21 ).

10) Notation:

X 1 |X 2 ∼ Nq (µ1 + Σ 12 Σ −1 −1
22 (X 2 − µ2 ), Σ 11 − Σ 12 Σ 22 Σ 21 ).
1.9 Complements 59

11) Be able to compute the above quantities if X1 and X2 are scalars.


12) Let X n be a sequence of random vectors with joint cdfs Fn (x) and let
X be a random vector with joint cdf F (x).
D
a) X n converges in distribution to X, written X n → X, if Fn (x) →
F (x) as n → ∞ for all points x at which F (x) is continuous. The distribution
of X is the limiting distribution or asymptotic distribution of X n . Note
that X does not depend on n.
P
b) X n converges in probability to X, written X n → X, if for every
 > 0, P (kX n − Xk > ) → 0 as n → ∞.
13) Multivariate Central Limit Theorem (MCLT): If X 1 , ..., Xn are iid
k × 1 random vectors with E(X) = µ and Cov(X) = Σ x , then
√ D
n(X n − µ) → Nk (0, Σ x )

where the sample mean


n
1X
Xn = X i.
n i=1
√ D
14) Suppose n(Tn − µ) → Np (θ, Σ). Let A be a q × p constant matrix.
√ √ D
Then A n(Tn − µ) = n(ATn − Aµ) → Nq (Aθ, AΣAT ).
D
15) Suppose A is a conformable constant matrix and X n → X. Then
D
AX n → AX.
16) A g × 1 random vector u has a mixture distribution of the uj
with probabilities πj if u is equal to uj with probability πj . The cdf of
X J
u is Fu (t) = πj Fuj (t) where the probabilities πj satisfy 0 ≤ πj ≤
j=1
PJ
1 and j=1 πj = 1, J ≥ 2, and Fuj (t) is the cdf of a g × 1 ran-
PJ T
dom vector uj . Then E(u) = j=1 πj E[uj ] and Cov(u) = E(uu ) −
P J
E(u)E(uT ) = E(uuT ) − E(u)[E(u)]T = j=1 πj E[uj uTj ] − E(u)[E(u)]T =
PJ PJ T T
j=1 πj Cov(uj ) + j=1 πj E(uj )[E(uj )] − E(u)[E(u)] . If E(uj ) = θ for
PJ
j = 1, ..., J, then E(u) = θ and Cov(u) = j=1 πj Cov(uj ). Note that
PJ PJ
E(u)[E(u)]T = j=1 k=1 πj πk E(uj )[E(uk )]T .

1.9 Complements

Graphical response transformation methods similar to those in Section 1.2


include Cook and Olive (2001) and Olive (2004b, 2017a: section 3.2). A nu-
merical method is given by Zhang and Yang (2017).
60 1 Introduction

Section 1.5 followed Olive (2014, ch. 8) closely, which is a good Master’s
level treatment of large sample theory. There are several PhD level texts
on large sample theory including, in roughly increasing order of difficulty,
Lehmann (1999), Ferguson (1996), Sen and Singer (1993), and Serfling (1980).
White (1984) considers asymptotic theory for econometric applications.
For a nonsingular matrix, the inverse of the matrix, the determinant of
the matrix, and the eigenvalues of the matrix are continuous functions of
the matrix. Hence if Σ̂ is a consistent estimator of Σ, then the inverse,
determinant, and eigenvalues of Σ̂ are consistent estimators of the inverse,
determinant, and eigenvalues of Σ > 0. See, for example, Bhatia et al. (1990),
Stewart (1969), and Severini (2005, pp. 348-349).
Big Data
Sometimes n is huge and p is small. Then importance sampling and se-
quential analysis with sample size less than 1000 can be useful for inference
for regression and time series models. Sometimes n is much smaller than p,
for example with microarrays. Sometimes both n and p are large.

1.10 Problems

Problems from old qualifying exams are marked with a Q since these problems
take longer than quiz and exam problems.

crancap hdlen hdht Data for 1.1


1485 175 132
1450 191 117
1460 186 122
1425 191 125
1430 178 120
1290 180 117
90 75 51
1.1∗ . The table (W ) above represents 3 head measurements on 6 people
and one ape. Let X1 = cranial capacity, X2 = head length, and X3 = head
height. Let x = (X1 , X2 , X3 )T . Several multivariate location estimators, in-
cluding the coordinatewise median and sample mean, are found by applying
a univariate location estimator to each random variable and then collecting
the results into a vector. a) Find the coordinatewise median MED(W ).
b) Find the sample mean x.

1.2Q . Suppose that the regression model is Yi = 7+βXi +ei for i = 1, ..., n
where the ei are iid N (0, σ 2 ) random variables. The least squares criterion is
X n
Q(η) = (Yi − 7 − ηXi )2 .
i=1
1.10 Problems 61

a) What is E(Yi )?

b) Find the least squares estimator β̂ of β by setting the first derivative


d
Q(η) equal to zero.

c) Show that your β̂ is the global minimizer of the least squares criterion
d2
Q by showing that the second derivative Q(η) > 0 for all values of η.
dη 2
1.3Q . The location model is Yi = µ + ei for i = 1, ..., n where the ei are iid
with mean E(ei ) = 0 and constant variance VAR(ei ) = σ 2 . The least squares
Xn
estimator µ̂ of µ minimizes the least squares criterion Q(η) = (Yi − η)2 .
i=1
To find the least squares estimator, perform the following steps.
d
a) Find the derivative Q, set the derivative equal to zero and solve for

η. Call the solution µ̂.

b) To show that the solution was indeed the global minimizer of Q, show
d2
that Q > 0 for all real η. (Then the solution µ̂ is a local min and Q is
dη 2
convex, so µ̂ is the global min.)

1.4Q . The normal error model for simple linear regression through the
origin is
Yi = βXi + ei
for i = 1, ..., n where e1 , ..., en are iid N (0, σ 2 ) random variables.

a) Show that the least squares estimator for β is


Pn
Xi Yi
β̂ = Pi=1
n 2
.
i=1 Xi

b) Find E(β̂).

c) Find VAR(β̂).
Pn
(Hint: Note that β̂ = i=1 ki Yi where the ki depend on the Xi which are
treated as constants.)

1.5Q . Suppose that the regression model is Yi = 10 + 2Xi2 + β3 Xi3 + ei for


i = 1, ..., n where the ei are iid N (0, σ 2 ) random variables. The least squares
X n
criterion is Q(η3 ) = (Yi − 10 − 2Xi2 − η3 Xi3 )2 . Find the least squares es-
i=1
62 1 Introduction

d
timator β̂3 of β3 by setting the first derivative Q(η3 ) equal to zero. Show
dη3
that your β̂3 is the global minimizer of the least squares criterion Q by show-
d2
ing that the second derivative 2 Q(η3 ) > 0 for all values of η3 .
dη3
1.6. Suppose x1 , ..., xn are iid p × 1 random vectors from a multivariate
t-distribution with parameters µ and Σ with d degrees of freedom. Then
d
E(xi ) = µ and Cov(x) = Σ for d > 2. Assuming d > 2, find the
√ d − 2
limiting distribution of n(x − c) for appropriate vector c.

E(xi ) = e0.5 1
1.7. Suppose x1 , ..., xn are iid p × 1 random vectors where √
2
and Cov(xi ) = (e − e)I p . Find the limiting distribution of n(x − c) for
appropriate vector c.

1.8. Suppose x1 , ..., xn are iid 2 × 1 random vectors from a multivariate


lognormal LN(µ, Σ) distribution. Let xi = (Xi1 , Xi2 )T . Following Press
(2005, pp. 149-150), E(Xij ) = exp(µj + σj2 /2),
V (Xij ) = exp(σj2 )[exp(σj2 ) − 1] exp(2µj ) for j = 1, 2, and
Cov(Xi1 , Xi2 ) = exp[µ1√+ µ2 + 0.5(σ12 + σ22 ) + σ12 ][exp(σ12 ) − 1]. Find the
limiting distribution of n(x − c) for appropriate vector c.

1.9. The most used Poisson regression model is Y |x ∼ Poisson(exp(xT β)).


What is the sufficient predictor SP = h(x)?

1.10∗ . Suppose that


     
X1 49 3 1 −1 0
 X2    100   1 6 1 −1  
     .
 X3  ∼ N4   17  ,  −1 1 4 0 
X4 7 0 −1 0 2

a) Find the distribution of X2 .

b) Find the distribution of (X1 , X3 )T .

c) Which pairs of random variables Xi and Xj are independent?

d) Find the correlation ρ(X1 , X3 ).

1.11∗ . Recall that if X ∼ Np (µ, Σ), then the conditional distribution of


X 1 given that X 2 = x2 is multivariate normal with mean µ1 +Σ 12 Σ −1
22 (x2 −
µ2 ) and covariance matrix Σ 11 − Σ 12 Σ −1
22 Σ 21 .
Let σ12 = Cov(Y, X) and suppose Y and X follow a bivariate normal
distribution      
Y 49 16 σ12
∼ N2 , .
X 100 σ12 25
1.10 Problems 63

a) If σ12 = 0, find Y |X. Explain your reasoning.

b) If σ12 = 10, find E(Y |X).

c) If σ12 = 10, find Var(Y |X).

1.12. Let σ12 = Cov(Y, X) and suppose Y and X follow a bivariate normal
distribution      
Y 15 64 σ12
∼ N2 , .
X 20 σ12 81

a) If σ12 = 10, find E(Y |X).

b) If σ12 = 10, find Var(Y |X).

c) If σ12 = 10, find ρ(Y, X), the correlation between Y and X.

1.13. Suppose that

X ∼ (1 − γ)ECp (µ, Σ, g1 ) + γECp (µ, cΣ, g2 )

where c > 0 and 0 < γ < 1. Following Example 1.17, show that X has
an elliptically contoured distribution assuming that all relevant expectations
exist.
1.14. In Theorem 1.39b, show that if the second moments exist, then Σ
can be replaced by Cov(X).
1.15. Using the notation in Theorem 1.40, show that if the second mo-
ments exist, then

Σ −1
XX Σ XY = [Cov(X)]
−1
Cov(X, Y ).

1.16. Using the notation under Theorem 1.38, show that if X is elliptically
contoured, then the conditional distribution of X 1 given that X 2 = x2 is
also elliptically contoured.

1.17∗ . Suppose Y ∼ Nn (Xβ, σ 2 I). Find the distribution of


(X T X)−1 X T Y if X is an n × p full rank constant matrix and β is a p × 1
constant vector.

1.18. Recall that Cov(X, Y ) = E[(X − E(X))(Y − E(Y ))T ]. Using the
notation of Theorem 1.40, let (Y, X T )T be ECp+1 (µ, Σ, g) where Y is a
random variable. Let the covariance matrix of (Y, X T ) be
   
ΣY Y ΣY X VAR(Y ) Cov(Y, X)
Cov((Y, X T )T ) = c =
Σ XY Σ XX Cov(X, Y ) Cov(X)
64 1 Introduction

where c is some positive constant. Show that E(Y |X) = α + βT X where

α = µY − βT µX and

β = [Cov(X)]−1 Cov(X, Y ).
1.19. (Due to R.D. Cook.) Let X be a p×1 random vector with E(X) = 0
and Cov(X) = Σ. Let B be any constant full rank p × r matrix where
1 ≤ r ≤ p. Suppose that for all such conforming matrices B,

E(X|BT X) = M B B T X

where M B a p × r constant matrix that depend on B.


Using the fact that ΣB = Cov(X, B T X) = E(XX T B) =
E[E(XX T B|BT X)], compute ΣB and show that M B = ΣB(B T ΣB)−1 .
Hint: what acts as a constant in the inner expectation?

1.20. Let x be a p × 1 random vector with covariance matrix Cov(x). Let


A be an r × p constant matrix and let B be a q × p constant matrix. Find
Cov(Ax, Bx) in terms of A, B, and Cov(x).

1.21. Suppose that


     
X1 9 1 0.8 −0.4 0
 X2    16   0.8 1 −0.56 0  
     
 X3  ∼ N4   4  ,  −0.4 −0.56 1 0   .
X4 1 0 0 0 1

a) Find the distribution of X3 .


b) Find the distribution of (X2 , X4 )T .
c) Which pairs of random variables Xi and Xj are independent?
d) Find the correlation ρ(X1 , X3 ).
1.22. Suppose x1 , ..., xn are iid p × 1 random vectors where

xi ∼ (1 − γ)Np (µ, Σ) + γNp (µ, cΣ)

with 0 < γ < 1 and c > 0. Then √ E(xi ) = µ and Cov(xi ) = [1 + γ(c − 1)]Σ.
Find the limiting distribution of n(x − d) for appropriate vector d.

1.23. Let X be an n × p constant matrix and let β be a p × 1 constant


vector. Suppose Y ∼ Nn (Xβ, σ 2 I). Find the distribution of HY if H T =
H = H 2 is an n × n matrix and if HX = X. Simplify.

1.24. Recall that if X ∼ Np (µ, Σ), then the conditional distribution of X 1


given that X 2 = x2 is multivariate normal with mean µ1 +Σ 12 Σ −1 22 (x2 −µ2 )
and covariance matrix Σ 11 − Σ 12 Σ −122 Σ 21 . Let Y and X follow a bivariate
1.10 Problems 65

normal distribution
     
Y 134 24.5 1.1
∼ N2 , .
X 96 1.1 23.0

a) Find E(Y |X).

b) Find Var(Y |X).


1.25. Suppose that
     
X1 1 4021
 X2  7 0 1 0 0
     
 X3  ∼ N4   3  , 2 0 3 1.
X4 0 1015

a) Find the distribution of (X1 , X4 )T .

b) Which pairs of random variables Xi and Xj are independent?

c) Find the correlation ρ(X1 , X4 ).

1.26. Suppose that


     
X1 3 3211
 X2  4 2 4 1 0
     
 X3  ∼ N4   2  , 1 1 2 0.
X4 3 1003

a) Find the distribution of (X1 , X3 )T .

b) Which pairs of random variables Xi and Xj are independent?

c) Find the correlation ρ(X1 , X3 ).

1.27. Suppose that


     
X1 49 2 −1 3 0
 X2    25   −1 5 −3 0  
     .
 X3  ∼ N4   9  ,  3 −3 5 0 
X4 4 0 0 0 4

a) Find the distribution of (X1 , X3 )T .


b) Which pairs of random variables Xi and Xj are independent?
c) Find the correlation ρ(X1 , X3 ).
1.28. Recall that if X ∼ Np (µ, Σ), then the conditional distribution of X 1
given that X 2 = x2 is multivariate normal with mean µ1 +Σ 12 Σ −1 22 (x2 −µ2 )
66 1 Introduction

and covariance matrix Σ 11 − Σ 12 Σ −1


22 Σ 21 . Let Y and X follow a bivariate
normal distribution
     
Y 49 3 −1
∼ N2 , .
X 17 −1 4

a) Find E(Y |X).


b) Find Var(Y |X).
1.29. Following Srivastava and Khatri (1979, p. 47), let
     
X1 µ1 Σ 11 Σ 12
X= ∼ Np , .
X2 µ2 Σ 21 Σ 22

a) Show that the nonsingular linear transformation


    
I −Σ 12 Σ −1
22 X1 X 1 − Σ 12 Σ −1
22 X2
= ∼
0 I X2 X2
   
µ1 − Σ 12 Σ −1
22 µ2 Σ 11 − Σ 12 Σ −1
22 Σ 21 0
Np , .
µ2 0 Σ 22
b) Then X 1 − Σ 12 Σ −1
22 X 2 X 2 , and

X 1 − Σ 12 Σ −1 −1 −1
22 X 2 ∼ Nq (µ1 − Σ 12 Σ 22 µ2 , Σ 11 − Σ 12 Σ 22 Σ 21 ).

By independence, X 1 − Σ 12 Σ −1 22 X 2 has the same distribution as


(X 1 −Σ 12 Σ −1
22 X 2 )|X 2 , and the term −Σ 12 Σ −1
22 X 2 is a constant, given X 2 .
Use this result to show that

X 1 |X 2 ∼ Nq (µ1 + Σ 12 Σ −1 −1
22 (X 2 − µ2 ), Σ 11 − Σ 12 Σ 22 Σ 21 ).

1.30. Let Tn be as estimator of θ with µ = E(Tn ). Assume Cov(Tn ) exists.


Then the mean square error M SEθ (Tn ) = tr(E[(Tn − θ)(Tn − θ)T ] =
E[(Tn −θ)T (Tn −θ)]. Show that M SEθ (Tn ) = tr[Cov(Tn )]+(µ−θ)T (µ−θ).
Hint: Let tr be the trace operator. If AB is a square matrix, then
tr(AB) = tr(BA). Also, tr(A + B) = tr(A) + tr(B), and E[tr(X)] =
tr(E[X]) when the expected value of the random matrix X exists.

1.31Q . For the simple linear regression model, Yi = β1 + xi β2 + ei for


i = 1, ..., n or Y = Xβ + e where X = [1 x] and β = (β1 β2 )T . Find β̂1 and
β̂2 by minimizing the least squares criterion.

1.32. Consider the following two simple linear regression models:


Model I: Yi = β0 + β1 xi + ei
Model II: Yi = β1 xi + ei
with ei iid with mean 0 and variance σ 2 and i = 1, ..., n,
1.10 Problems 67

a) State (but do not derive) the least squares estimators of β1 for both
models. Are these estimators “BLUE”? Why or why not. Quote the relevant
theorem(s) in support of your assertation.
n
X
b) Prove than V (β̂1 ) = σ 2 / (xi − x)2 for model I, and V (β̂1 ) =
i=1
n
X
2 2
σ / (xi ) for model II.
i=1
c) Referring to b), show that the variance V (β̂1 ) for Model I is never
smaller than the variance V (β̂1 ) for model II.

1.33.

1.34.

1.35.

1.36.

1.37.

1.38.

1.39.

R Problem

Use the command source(“G:/linmodpack.txt”) to download the


functions and the command source(“G:/linmoddata.txt”) to download the
data. See Preface or Section 11.1. Typing the name of the slpack func-
tion, e.g. tplot2, will display the code for the function. Use the args com-
mand, e.g. args(tplot2), to display the needed arguments for the function.
For the following problem, the R command can be copied and pasted from
(http://parker.ad.siu.edu/Olive/linmodrhw.txt) into R.
1.40. This problem uses some of the R commands at the end of Section
1.2.1. A problem with response and residual plots is that there can be a lot
of black in the plot if the sample size n is large (more than a few thousand).
A variant of the response plot for the additive error regression model Y =
m(x)+e would plot the identity line, the two lines parallel to the identity line
corresponding to the Section 4.3 large sample 100(1−δ)% prediction intervals
for Yf that depends on Ŷf . Then plot points corresponding to training data
cases that do not lie in their 100(1−δ)% PI. We will use δ = 0.01, n = 100000,
and p = 8.
a) Copy and paste the commands for this part from linmodrhw into R.
They make the usual response plot with a lot of black. Do not include the
plot in Word.
68 1 Introduction

b) Copy and paste the commands for this part into R. They make the
response plot with the points within the pointwise 99% prediction interval
bands omitted. Include this plot in Word. For example, left click on the plot
and hit the Ctrl and c keys at the same time to make a copy. Then paste the
plot into Word, e.g., get into Word and hit the Ctrl and v keys at the same
time.
c) The additive error regression model is a 1D regression model. What is
the sufficient predictor = h(x)?
1.41. The linmodpack function tplot2 makes transformation plots for
the multiple linear regression model Y = t(Z) = xT β + e. Type = 1 for full
model OLS and should not be used if n < 5p, type = 2 for elastic net, 3 for
lasso, 4 for ridge regression, 5 for PLS, 6 for PCR, and 7 for forward selection
with Cp if n ≥ 10p and EBIC if n < 10p. These methods are discussed in
Chapter 5.
Copy and paste the three library commands near the top of linmodrhw
into R.
For parts a) and b), n = 100, p = 4 and Y = log(Z) = 0x1 + x2 + 0x3 +
0x4 + e = x2 + e. (Y and Z are swapped in the R code.)
a) Copy and paste the commands for this part into R. This makes the
response plot for the elastic net using Y = Z and x when the linear model
needs Y = log(Z). Do not include the plot in Word, but explain why the plot
suggests that something is wrong with the model Z = xT β + e.
b) Copy and paste the command for this part into R. Right click Stop 3
times until the horizontal axis has log(z). This is the response plot for the
true model Y = log(Z) = xT β + e = x2 + e. Include the plot in Word. Right
click Stop 3 more times so that the cursor returns in the command window.
c) Is the response plot linear?
For the remaining parts, n = p − 1 = 100 and Y = log(Z) = 0x1 + x2 +
0x3 + · · · + 0x101 + e = x2 + e. Hence the model is sparse.
d) Copy and paste the commands for this part into R. Right click Stop 3
times until the horizontal axis has log(z). This is the response plot for the
true model Y = log(Z) = xT β + e = x2 + e. Include the plot in Word. Right
click Stop 3 more times so that the cursor returns in the command window.
e) Is the plot linear?
f) Copy and paste the commands for this part into R. Right click Stop 3
times until the horizontal axis has log(z). This is the response plot for the true
model Y = log(Z) = xT β + e = x2 + e. Include the plot in Word. Right click
Stop 3 more times so that the cursor returns in the command window. PLS
is probably overfitting since the identity line nearly interpolates the fitted
points.
1.42. Get the R commands for this problem. The data is such that Y =
2 + x2 + x3 + x4 + e where the zero mean errors are iid [exponential(2) -
2]. Hence the residual and response plots should show high skew. Note that
1.10 Problems 69

β = (2, 1, 1, 1)T . The R code uses 3 nontrivial predictors and a constant, and
the sample size n = 1000.
a) Copy and paste the commands for part a) of this problem into R. Include
the response plot in Word. Is the lowess curve fairly close to the identity line?
b) Copy and paste the commands for part b) of this problem into R.
Include the residual plot in Word: press the Ctrl and c keys as the same time.
Then use the menu command “Paste” in Word. Is the lowess curve fairly
close to the r = 0 line? The lowess curve is a flexible scatterplot smoother.
c) The output out$coef gives β̂. Write down β̂ or copy and paste β̂ into
Word. Is β̂ close to β?
Chapter 2
Full Rank Linear Models

2.1 Projection Matrices and the Column Space

Vector spaces, subspaces, and column spaces should be familiar from linear
algebra, but are reviewed below.

Definition 2.1. A set V ⊆ Rk is a vector space if for any vectors


x, y, z ∈ V, and scalars a and b, the operations of vector addition and scalar
multiplication are defined as follows.
1) (x + y) + z = x + (y + z).
2) x + y = y + x.
3) There exists 0 ∈ V such that x + 0 = x = 0 + x.
4) For any x ∈ V, there exists y = −x such that x + y = y + x = 0.
5) a(x + y) = ax + ay.
6) (a + b)x = ax + by.
7) (ab) x = a(b x).
8) 1 x = x.

Hence for a vector space, addition is associative and commutative, there


is an additive identity vector 0, there is an additive inverse −x for each
x ∈ V, scalar multiplication is distributive and associative, and 1 is the
scalar identity element.
Two important vector spaces are Rk and V = {0}. Showing that a set M
is a subspace is a common method to show that M is a vector space.

Definition 2.2. Let M be a nonempty subset of a vector space V. If i)


ax ∈ M ∀x ∈ M and for any scalar a, and ii) x + y ∈ M ∀x, y ∈ M, then
M is a vector space known as a subspace.

Definition 2.3. The set of all linear combinations P of x1 , ..., xn is the


n
vector space known as span(x1 , ..., xn ) = {y ∈ Rk : y = i=1 ai xi for some
constants a1 , ..., an}.

71
72 2 Full Rank Linear Models

Definition
Pk 2.4. Let x1 , ..., xk ∈ V. If ∃ scalars α1 , ..., αk not
Pkall zero such
that i=1 αi xi = 0, then x1 , ..., xk are linearly dependent. If i=1 αi xi = 0
only if αi = 0 ∀ i = 1, ..., k, then x1 , ..., xk are linearly independent. Suppose
{x1 , ..., xk } is a linearly independent set and V = span(x1 , ..., xk ). Then
{x1 , ..., xk } is a linearly independent spanning set for V, known as a basis.

Definition 2.5. Let A = [a1 a2 ... am ] be an n × m matrix. The space


spanned by the columns of A = column space of A = C(A). Then C(A) =
{y ∈ Rn : y = Aw for some w ∈ Rm } = {y : y = w1 a1 + w2 a2 + · · · + wm am
for some scalars w1 , ...., wm} = span(a1 , ..., am ).

The space spanned by the rows of A is the row space of A. The row space
of A is the column space C(AT ) of AT . Note that
 
w1 m
  X
Aw = [a1 a2 ... am ]  ...  = wi a i .
wm i=1

With the design matrix X, different notation is used to denote the columns
of X since both the columns and rows X are important. Let
 T
x1
 .. 
X = [v1 v 2 ... v p ] =  . 
xTn

be an n × p matrix. Note that C(X) = {y ∈ Rn : y = Xb for some b ∈ Rp }.


Hence Xb is a typical element of C(X) and Aw is a typical element of C(A).
Note that
 T  T   
x1 x1 b b1 p
 ..   ..   ..  X
Xb =  .  b =  .  = [v1 v 2 ... v p ]  .  = bi v i .
xTn xTn b bp i=1

If the function X f (b) = Xb where the f indicates that the operation


X f : Rp → Rn is being treated as a function, then C(X) is the range of X f .
Hence some authors call the column space of A the range of A.
Let B be n × k, and let A be n × m. One way to show C(A) = C(B)
is to show that i) ∀x ∈ Rm , ∃ y ∈ Rk such that Ax = By ∈ C(B) so
C(A) ⊆ C(B), and ii) ∀y ∈ Rk , ∃ x ∈ Rm such that By = Ax ∈ C(A) so
C(B) ⊆ C(A). Another way to show C(A) = C(B) is to show that a basis
for C(A) is also a basis for C(B).
Definition 2.6. The dimension of a vector space V = dim(V) = the
number of vectors in a basis of V. The rank of a matrix A = rank(A) =
dim(C(A)), the dimension of the column space of A. Let A be n × m. Then
2.1 Projection Matrices and the Column Space 73

rank(A) = rank(AT ) ≤ min(m, n). If rank(A) = min(m, n), then A has full
rank, or A is a full rank matrix.

Definition 2.7. The null space of A = N (A) = {x : Ax = 0} = kernel


of A. The nullity of A = dim[N (A)]. The subspace V ⊥ = {y ∈ Rk : y ⊥ V}
is the orthogonal complement of V, where y ⊥ V means y T x = 0 ∀ x ∈ V.
N (AT ) = [C(A)]⊥, so N (A) = [C(AT )]⊥ .

Theorem 2.1: Rank Nullity Theorem. Let A be n × m. Then


rank(A) + dim(N (A)) = m.

Generalized inverses are useful for the non-full rank linear model and for
defining projection matrices.

Definition 2.8. A generalized inverse of an n × m matrix A is any


m × n matrix A− satisfying AA− A = A.

Other names are conditional inverse, pseudo inverse, g-inverse, and p-


inverse. Usually a generalized inverse is not unique, but if A−1 exists, then
A− = A−1 is unique.

Notation: G := A− means G is a generalized inverse of A.

Recall that if A is idempotent, then A2 = A. A matrix A is tripotent if


A = A. For both these cases, A := A− since AAA = A. It will turn out
3

that symmetric idempotent matrices are projection matrices.

Definition 2.9. Let V be a subspace of Rn . Then every y ∈ Rn can be


expressed uniquely as y = w + z where w ∈ V and z ∈ V ⊥ . Let X =
[v1 v 2 ... v p ] be n × p, and let V = C(X) = span(v 1 , ..., vp ). Then the n × n
matrix P V = P X is a projection matrix on C(X) if P X y = w ∀ y ∈ Rn .
(Here y = w + z = wy + z y , so w depends on y.)

Note: Some authors call a projection matrix an “orthogonal projection


matrix,” and call an idempotent matrix a “projection matrix.”
Theorem 2.2: Projection Matrix Theorem. a) P X is unique.
b) P X = X(X T X)− X T where (X T X)− is any generalized inverse of
X T X.
c) A is a projection matrix on C(A) iff A is symmetric and idempotent. Hence
P X is a projection matrix on C(P X ) = C(X), and P X is symmetric and
idempotent. Also, each column pi of P X satisfies P X pi = pi ∈ C(X).
d) I n − P X is the projection matrix on [C(X)]⊥.
e) A = P X iff i) y ∈ C(X) implies Ay = y and ii) y ⊥ C(X) implies
Ay = 0.
f) P X X = X, and P X W = W if each column of W ∈ C(X).
g) P X v i = v i .
h) If C(X R ) is a subspace of C(X), then P X P X R = P X R P X = P X R .
74 2 Full Rank Linear Models

i) The eigenvalues of P X are 0 or 1.


j) Let tr(A) = trace(A). Then rank(P X ) = tr(P X ) = rank(X).
k) P X is singular unless X is a nonsingular n×n matrix, and then P X = I n .
l) Let X = [Z X r ] where rank(X) = rank(X r ) = r so the columns of X r
form a basis for C(X). Then
 
0 0
0 (X Tr X r )−1

is a generalized inverse of X T X, and P X = X r (X Tr X r )−1 X Tr .

Two important consequences of the above theorem follow. First, P is


a projection matrix iff P is symmetric and idempotent. Partition X as
X = [X 1 X 2 ], let P be the projection matrix for C(X) and let P 1 be
the projection matrix for C(X 1 ). Since C(P 1 ) = C(X 1 ) ⊆ C(X), P P 1 = P 1 .
Hence P 1 P = (P P 1 )T = P T1 = P 1 .
Some results from linear algebra are needed to prove parts of the above
theorem. Unless told otherwise, matrices in this text are real. Then the
eigenvalues of a symmetric matrix A are real. If A is symmetric, then
rank(A) = number of nonzero eigenvalues of A. Recall that if AB is
a square matrix, then tr(AB) = tr(BA). Similarly, if A1 is m1 × m2 ,
A2 is m2 × m3 , ..., Ak−1 is mk−1 × mk , and Ak is mk × m1 , then
tr(A1 A2 · · · Ak ) = tr(Ak A1 A2 · · · Ak−1 ) = tr(Ak−1 Ak A1 A2 · · · Ak−2 ) =
· · · = tr(A2 A3 · · · Ak A1 ). Also note that a scalar is a 1 × 1 matrix, so
tr(a) = a. The next two paragraphs follow Christensen (1987, pp. 335-338)
closely.
If P and A are n × n matrices, then P = A iff P y = Ay for all y ∈ Rn
iff y T P = y T A for all y ∈ Rn . Let V be a subspace of Rn . Let y ∈ Rn
with y = w + z where w ∈ V and z ∈ V ⊥ . Let A and P be projection
matrices on V. Then Ay = w = P y. Since y was arbitrary, A = P and
projection matrices are unique. We prove that P X is symmetric below. Then
the projection matrix A = A(AT A)− A is symmetric by replacing X by A.
Hence Az = AT z = 0. Thus A2 y = Aw = w = Ay, and A2 = A since y
was arbitrary.
Now suppose A2 = A = AT , and let w ∈ C(A). Hence w = Aa for some
vector a. Thus Aw = A2 a = Aa = w. Let z ⊥ C(A) = C(AT ). Then
zT A = z T AT = 0. Thus Ay = Aw = w, and A is a projection matrix on
C(A). Note that C(PX ) ⊆ C(X) since P X X = X, and C(X) ⊆ C(PX )
since P X = XW where W = (X T X)− X T . Thus C(X) = C(P X ). To
show that P X X = X, let y = w + z with w = Xa and z T X = 0.
Note that y T P X X = w T X(X T X)− X T X = aT X T X(X T X)− X T X =
aT X T X = wT X = y T X. Since y was arbitrary, P X X = X. Note that
P X y = P X (w+z) = P X w = X(X T X)− X T Xa = P X Xa = Xa = w.
Thus P X is a projection matrix on C(X).
2.1 Projection Matrices and the Column Space 75

Note that if G is a generalized linear inverse of a symmetric matrix A,


then AT = AT GT AT = AGT A = A. Hence GT is a generalized linear
inverse of A. Also, AGAGT A = AGT A = A. Hence GAGT , a symmetric
matrix, is a generalized inverse of A. Thus a symmetric matrix A always
has a symmetric generalized linear inverse. Hence let B := (X T X)− be a
symmetric matrix. Then P X = X T BX = X T (X T X)− X is symmetric
since P X is unique, even if (X T X)− is not symmetric.
For part d), note that if y = w + z, then (I n − P X )y = z ∈ [C(X)]⊥ .
Hence the result follows from the definition of a projection matrix by in-
terchanging the roles of w and z. Part e) follows from the definition of
a projection matrix since if y ∈ C(X) then y = y + 0 where y = w
and 0 = z. If y ⊥ C(X) then y = 0 + y where 0 = w and y = z.
Part g) is a special case of f). In k), P X is singular unless p = n since
rank(X) = r ≤ min(p, n) < max(n, p) unless p = n, and P X is an
n × n matrix. Need rank(P X ) = n for P X to be nonsingular. For h),
P X P X R = P X R by f) since each column of P X r ∈ C(P X ). Taking
transposes and using symmetry shows P X R P X = P X R . For i), if λ is an
eigenvalue of P X , then for some x 6= 0, λx = P X x = P 2X x = λ2 x since
P X is idempotent by c). Hence λ = λ2 is real since P X is symmetric, so
λ = 0 or λ = 1. Then j) follows from i) since rank(P X ) = number of nonzero
eigenvalues of P X = tr(P X ).
For l), note that C(X) = C(X r ). Thus X r (X Tr X r )−1 X Tr = P X . Then
   
T ZT Z ZT Xr 0 T 0
X X= and X X XT X =
X Tr Z X Tr X r 0 (X Tr X r )−1
 
Z T X r (X Tr X r )−1 X Tr Z ZT Xr
= XT X
X Tr Z X Tr X r
since Z T P X Z = Z T Z because each column of Z ∈ C(X).

Most of the above results apply to full rank and nonfull rank matrices.
A corollary of the following theorem is that if X is full rank, then P X =
X(X T X)−1 X T = H.
Suppose A is p × p. Then the following are equivalent. 1) A is nonsingular,
2) A has a left inverse L with LA = I p , and 3) A has a right inverse R
with AR = I p . To see this, note that 1) implies (2) and 3) since A−1 A =
I p = AA−1 by the definition of an inverse matrix. Suppose AR = I p . Then
the determinant det(I p ) = 1 = det(AR) = det(A) det(R). Hence det(A) 6= 0
and A is nonsingular. Hence R = A−1 AR = A−1 and 3) implies 1). Similarly
2) implies 1). Also note that L = LI p = LAR = I p R = R = A−1 . Hence
in the proof below, we could just show that A− = L or A− = R.

Theorem 2.3. If A is nonsingular, the unique generalized inverse of A is


A−1 .
76 2 Full Rank Linear Models

Proof. Let A− be any generalized inverse of A. We give two proofs. i)


A = A−1 AA− AA−1 = A−1 AA−1 = A−1 . ii) A− A = A−1 AA− A =

A−1 A = I and AA− = AA− AA−1 = AA−1 = I. Thus A− = A−1 . 

2.2 Quadratic Forms

Definition 2.10.PLet A be an n×n matrix and let x ∈ Rn . Then a quadratic


T n Pn
form x Ax = i=1 j=1 aij xi xj , and a linear form is Ax. Suppose A
is a symmetric matrix. Then A is positive definite (A > 0) if xT Ax >
0 ∀ x 6= 0, and A is positive semidefinite (A ≥ 0) if xT Ax ≥ 0 ∀ x.

Notation: The matrix A in a quadratic form xT Ax will be symmetric


unless told otherwise. Suppose B is not symmetric. Since the quadratic form
is a scalar, xT Bx = (xT Bx)T = xT B T x = xT (B+B T )x/2, and the matrix
A = (B + B T )/2 is symmetric. If A ≥ 0 then the eigenvalues λi of A are
real and nonnegative. If A ≥ 0, let λ1 ≥ λ2 ≥ · · · ≥ λn ≥ 0. If A > 0, then
λn > 0. Some authors say symmetric A is nonnegative definite if A ≥ 0, and
that A is positive semidefinite if A ≥ 0 and there exists a nonzero x such
that xT Ax = 0. Then A is singular.

The spectral decomposition theorem is very useful. One application for


linear models is defining the square root matrix.

Theorem 2.4: Spectral Decomposition Theorem. Let A be an n × n


symmetric matrix with eigenvalue eigenvector pairs (λ1 , t1 ), (λ2 , t2 ), ..., (λn, tn )
where tTi ti = 1 and tTi tj = 0 if i 6= j for i = 1, ..., n. Hence Ati = λi ti . Then
the spectral decomposition of A is
n
X
A= λi ti tTi = λ1 t1 tT1 + · · · + λn tn tTn .
i=1

Let T = [t1 t2 · · · tn ] be the n × n orthogonal matrix with ith column


T T 1/2
ti . Then
√ T T p = T T = I. Let TΛ = diag(λ1 , ..., λn) and let Λ =
diag( λ1 , ..., λn ). Then A = T ΛT .

Definition 2.11. If A is aPpositive definite n × n symmetric matrix with


n
spectral decomposition A = i=1 λi ti tTi , then A = T ΛT T and

Xn
1 T
A−1 = T Λ−1 T T = ti ti .
λ
i=1 i

The square root matrix A1/2 = T Λ1/2 T T is a positive definite symmetric


matrix such that A1/2 A1/2 = A.
2.2 Quadratic Forms 77

The following theorem is often useful. Both the expected value and trace
are linear operators. Hence tr(A + B) = tr(A) + tr(B), and E[tr(X)] =
tr(E[X]) when the expected value of the random matrix X exists.

Theorem 2.5: expected value of a quadratic form. Let x be a ran-


dom vector with E(x) = µ and Cov(x) = Σ. Then

E(xT Ax) = tr(AΣ) + µT Aµ.

Proof. Two proofs are given. i) Searle (1971, p. 55): Note that E(xxT ) =
Σ + µµT . Since the quadratic form is a scalar and the trace is a linear
operator, E[xT Ax] = E[tr(xT Ax)] = E[tr(AxxT )] = tr(E[AxxT ]) =
tr(AΣ + AµµT ) = tr(AΣ) + tr(AµµT ) = tr(AΣ) + µT Aµ.
Pnii) PGraybill (1976, p. 140):
Pn P Using E(xi xj ) = σij + µi µj , E[xT Ax] =
n n T
i=1 j=1 aij E(xi xj ) = i=1 j=1 aij (σij + µi µj ) = tr(AΣ) + µ Aµ. 

Much of the theoretical results for quadratic forms assumes that the ei
are iid N (0, σ 2 ). These exact results are often special cases of large sample
theory that holds for a large class of iid zero mean error distributions that
have V (ei ) ≡ σ 2 . For linear models, Y is typically an n × 1 random vector.
The following theorem from statistical inference will be useful.

Theorem 2.6. Suppose x y, g(x) is a function of x alone, and h(y) is


a function of y alone. Then g(x) h(y).

The following theorem shows that independence of linear forms implies


independence of quadratic forms.

Theorem 2.7. If A and B are symmetric matrices and AY BY , then


Y T AY Y T BY .
Proof. Let g(AY ) = Y T AT A− AY = Y T AA− AY = Y T AY , and
let h(BY ) = Y T B T B − BY = Y T BB − BY = Y T BY . Then the result
follows by Theorem 2.6. 

Theorem 2.8. Let Y ∼ Nn (µ, Σ). a) Let u = AY and w = BY .


Then AY BY iff Cov(u, w) = AΣB T = 0 iff BΣAT = 0. Note that if
Σ = σ 2 I n , then AY BY iff AB T = 0 iff BAT = 0.
b) If A is a symmetric n × n matrix, and B is an m × n matrix, then
Y T AY BY if AΣB T = 0 if BΣAT = BΣA = 0. Note that if Σ =
σ I n , then Y T AY BY if AB T = 0 if BA = 0.
2

Proof. a) Note that


     
u AY A
= = Y
w BY B

has a multivariate normal distribution. Hence AY BY iff Cov(u, w) = 0.


Taking transposes shows Cov(u, w) = AΣB T = 0 iff BΣAT = 0.
78 2 Full Rank Linear Models

b) If AΣB T = 0 , then AY BY by a). Let g(AY ) = Y T AT A− AY =


Y AA− AY = Y T AY . Then g(AY ) = Y T AY BY by Theorem 2.6. 
T

One of the most useful theorems for proving that Y T AY Y T BY is


Craig’s Theorem. Taking transposes shows AΣB = 0 iff BΣA = 0. Note
that if AΣB = 0, then (∗) holds. Note AΣB = 0 is a sufficient condition
for Y T AY Y T BY if Σ ≥ 0, but necessary and sufficient if Σ > 0. If
Y ∼ Nn (µ, Σ) and AY BY , then Y T AY Y T BY , but if Σ is singular,
it is possible that Y T AY Y T BY even if AY and BY are dependent.

Theorem 2.9: Craig’s Theorem. Let Y ∼ Nn (µ, Σ).


a) If Σ > 0, then Y T AY Y T BY iff AΣB = 0 iff BΣA = 0.
b) If Σ ≥ 0, then Y T AY Y T BY if AΣB = 0 (or if BΣA = 0).
c) If Σ ≥ 0, then Y T AY Y T BY iff
(∗) ΣAΣBΣ = 0, ΣAΣBµ = 0, ΣBΣAµ = 0, and µT AΣBµ = 0.
Proof. For a) and b), AΣB = 0 implies Y T AY Y T BY by c)
or by Theorems 2.6, 2.7, and 2.8. See Reid and Driscoll (1988) for why
Y T AY Y T BY implies AΣB = 0 in a).
c) See Driscoll and Krasnicka (1995).

The following theorem is a corollary of Craig’s Theorem.

Theorem 2.10. Let Y ∼ Nn (0, I n ), with A and B symmetric. If


Y T AY ∼ χ2r and Y T BY ∼ χ2d , then Y T AY Y T BY iff AB = 0.

Theorem 2.11. If Y ∼ Nn (µ, Σ) with Σ > 0, then the population


squared Mahalanobis distance (Y − µ)T Σ −1 (Y − µ) ∼ χ2n .
Proof. Let Z = Σ −1/2 (Y −µ) ∼ Nn (0, I). Then Z = (Z1 ,P
..., Zn)T where
−1 T n
the Zi are iid N (0, 1). Hence (Y −µ) Σ (Y −µ) = Z Z = i=1 Zi2 ∼ χ2n .
T

For large sample theory, the noncentral χ2 distribution Pis important. If


n
Z1 , ..., Zn are independent N (0, 1) random variables, then Pi=1 Zi2 ∼ χ2n .
The noncentral χ2 (n, γ) distribution is the distribution of ni=1 Yi2 where
Y1 , ..., Yn are independent N (µi , 1) random variables. Note √that if Y ∼
N (µ, 1), then Y 2 ∼ χ2 (n = 1, γ = µ2 /2), and if Y ∼ N ( 2γ, 1), then
Y 2 ∼ χ2 (n = 1, γ).

Definition 2.12. Suppose Y1 , ..., Yn are independent N (µi ,P 1) random


n
variables so that Y = (Y1 , ..., Yn)T ∼ Nn (µ, I n ). Then Y T Y = i=1 Yi2 ∼
2 T 2
χ (n, γ = µ µ/2), a noncentral χ (n, γ) distribution, Pn with n degrees of free-
dom and noncentrality parameter γ = µT µ/2 = 12 i=1 µ2i ≥ 0. The noncen-
trality parameter δ = µT µ = 2γ is also used. If W ∼ χ2n , then W ∼ χ2 (n, 0)
so γ = 0. The χ2n distribution is also called the central χ2 distribution.

Some of the proof ideas for the following theorem came from Marden
(2012, pp. 48, 96-97). Recall that if Y1 , ..., Yk are independent with moment
2.2 Quadratic Forms 79
Pk
generating functions (mgfs) mYi (t), then the mgf of i=1 Yi is mPk Yi (t) =
i=1
Yk
mYi (t). If Y ∼ χ2 (n, γ), then the probability density function (pdf) of Y
i=1
is rather hard to use, but is given by
X∞ n ∞
X
e−γ γ j y 2 +j−1 e−y/2
f(y) = n = pγ (j)fn+2j (y)
j=0
j! 2 2 +j Γ ( n2 + j) j=0

where pγ (j) = P (W = j) is the probability mass function of a Poisson(γ)


random variable W , and fn+2j (y) is the pdf of a χ2n+2j random variable. If
γ = 0, define γ 0 = 1 in the first sum, and p0 (0) = 1 with p0 (j) = 0 for
j > 0 in the second sum. For computing moments and the moment gen-
erating function,R the integration
P∞and summation
R∞ operations
P∞ can be inter-

changed. Hence 0 f(y)dy = j=0 pγ (j) 0 fn+2j (y)dy = j=0 pγ (j) = 1.
Similarly, if mn+2j (t) = (1 − 2t)−(n+2j)/2 is the mgf of a χ2n+2j ran-
R∞
dom variable, then the mgf of Y is mY (t) = E(etY ) = 0 ety f(y)dy =
P∞ R ∞ ty P∞
j=0 pγ (j) 0 e fn+2j (y)dy = j=0 pγ (j)mn+2j (t).

Theorem 2.12. a) If Y ∼ χ2 (n, γ), then the moment generating function


of Y is mY (t) = (1 − 2t)−n/2 exp(−γ[1 − (1 − 2t)−1 ]) =
(1 − 2t)−n/2 exp[2γt/(1 − 2t)] for t < 0.5.
b) If Yi ∼ χ2 (ni , γi ) are independent
 for i = 1, ..., k, then
Pk 2
Pk Pk
Y
i=1 i ∼ χ i=1 in , γ
i=1 i .
c) If Y ∼ χ2 (n, γ), then E(Y ) = n + 2γ and V (Y ) = 2n + 8γ.
Proof. Two proofs are given. a) i) From the above remarks, and using ex =
 j
∞ ∞ ∞ e−γ γ
X xj X e−γ γ j X 1−2t
, mY (t) = (1−2t)−(n+2j)/2 = (1−2t)−n/2 =
j=0
j! j=0
j! j=0
j!
   
−n/2 γ −n/2 2γt
(1 − 2t) exp −γ + = (1 − 2t) exp .
1 − 2t 1 − 2t

ii) Let W ∼ N ( δ, 1) where δ = 2γ. Then W 2 ∼ χ2 (1, δ/2) = χ2 (1, γ).
Let W X where X ∼ χ2n−1 ∼ χ2 (n − 1, 0), and let Y = W 2 + X ∼ χ2 (n, γ)
by b). Then mW 2 (t) =
Z ∞  
2 2 1 −1 √
E(etW ) = etw √ exp (w − δ)2 dw =
−∞ 2π 2
Z ∞  
1 2 1 √
√ exp tw 2 − (w 2 − 2 δ w + δ) dw =
−∞ 2π 2 2
Z ∞  
1 −1 2 2

√ exp (w − 2tw − 2 δ w + δ) dw =
−∞ 2π 2
80 2 Full Rank Linear Models
Z   Z ∞  

1 −1 2 √ 1 −1
√ exp (w (1 − 2t) − 2 δw + δ) dw = √ exp A dw
−∞ 2π 2 −∞ 2π 2

where A = [ 1 − 2t (w − b)]2 + c with

δ −2tδ
b= and c =
1 − 2t 1 − 2t
after algebra. Hence m2W (t) =
r Z " # r

−c/2 1 1 1 −1 1 2 −c/2 1
e √ q exp 1 (w − b) dw = e
1 − 2t −∞ 2π 1 2 1−2t 1 − 2t
1−2t

R∞
since the integral = 1 = f(w)dw where f(w) is the N (b, 1/(1 − 2t)) pdf.
−∞
Thus  
1 tδ
mW 2 (t) = √ exp .
1 − 2t 1 − 2t
So mY (t) = mW 2 +X (t) = mW 2 (t)mX (t) =
  (n−1)/2  
1 tδ 1 1 tδ
√ exp = exp =
1 − 2t 1 − 2t 1 − 2t (1 − 2t)n/2 1 − 2t
 
−n/2 2γt
(1 − 2t) exp .
1 − 2t
b) i) By a), mPk Yi (t) =
i=1

k
Y k
Y
mYi (t) = (1 − 2t)−ni/2 exp(−γi [1 − (1 − 2t)−1 ]) =
i=1 i=1

k
!
Pk X
− ni /2 −1
(1 − 2t) i=1 exp − γi [1 − (1 − 2t) ] ,
i=1
k k
!
X X
2
the χ ni , γi mgf.
i=1 i=1
ii) Let Yi = Z Ti Z i
where the Z i ∼ Nni (µi , I ni ) are independent. Let
    
Z1 µ1
 Z2   µ2  
    
Z =  .  ∼ NPk ni  .  , I Pk ni  ∼ NPk ni (µZ , I Pk ni ).
 ..  i=1  ..  i=1  i=1 i=1

Zk µk
2.2 Quadratic Forms 81

k k k
!
X X X
T
Then Z Z = Z Ti Z i = Yi ∼ χ 2
ni , γZ where
i=1 i=1 i=1

k k
µTZ µZ X µTi µi X
γZ = = = γi .
2 2
i=1 i=1

c) i) Let W ∼ χ2 (1, γ) X ∼ χ2n−1 ∼ χ2 (n − 1, 0). Then by b) Y =


√ √
W +X √ ∼ χ2 (n, γ). Let Z ∼ N (0, 1) √
and δ = 2γ. Then√ δ+Z ∼ N ( δ, 1), and
W = ( δ +Z)2 . Thus E(W ) = E[( δ +Z)2 ] = δ +2 δE(Z)+E(Z 2 ) = δ +1.
Using the binomial theorem
n  
X n i n−i
(x + y)n = xy
i
i=0
√ √
with x = δ, y = Z, and n = 4, E(W 2 ) = E[( δ + Z)4 ] =

E[δ 2 + 4δ 3/2 Z + 6δZ 2 + 4 δZ 3 + Z 4 ] = δ 2 + 6δ + 3

since E(Z) = E(Z 3 ) = 0, and E(Z 4 ) = 3 by Problem 2.8. Hence V (W ) =


E(W 2 ) − [E(W )]2 = δ 2 + 6δ + 3 − (δ + 1)2 = δ 2 + 6δ + 3 − δ 2 − 2δ − 1 = 4δ + 2.
Thus E(Y ) = E(W ) + E(X) = δ + 1 + n − 1 = n + δ = n + 2γ, and
V (Y ) = V (W ) + V (X) = 4δ + 2 + 2(n − 1) = 8δ + 2n.
ii) Let Zi ∼ N (µi , 1) so E(Zi2 ) = σ 2 + µ2i = 1 + µ2i . By Problem 2.8,
E(Zi3 ) = µ3i + P 3µi , and E(Zi4 ) = µ4i + 6µ2i + 3. Hence Y ∼ P χ2 (n, γ) where
T n 2 n 2
Y = Z Z =
P i=1 Zi where Z ∼ Nn (µ, I). So PE(Y ) = i=1 E(Zi ) =
n 2 T n 2
i=1 (1 + µi ) = n + µ µ = n + 2γ, and V (Y ) = i=1 V (Zi ) =

n
X n
X n
X
[E(Zi4 ) − (E[Zi2 ])2 ] = [µ4i + 6µ2i + 3 − µ4i − 2µ2i − 1] = [4µ2i + 2]
i=1 i=1 i=1

= 2n + 4µT µ = 2n + 8γ. 

For the following theorem, see Searle (1971, p. 57). Most of the results in
Theorem 2.14 are corollaries of Theorem 2.13. Recall that the matrix in a
quadratic form is symmetric, unless told otherwise.

Theorem 2.13. If Y ∼ Nn (µ, Σ) where Σ > 0, then Y T AY ∼


χ (rank(A), µT Aµ/2) iff AΣ is idempotent.
2
82 2 Full Rank Linear Models

For the following theorem, note that if A = AT = A2 , then A is a


projection matrix since A is symmetric and idempotent. An n × n projection
matrix A is not a full rank matrix unless A = I n . See Theorem 2.2 j) and
k). Often results are given for Y ∼ Nn (0, I), and then the Y ∼ Nn (0, σ 2 I)
case is handled as in c) and g) below, since Y /σ ∼ Nn (0, I).

Theorem 2.14. Let A = AT be symmetric.


a) If Y ∼ Nn (0, Σ) where Σ is a projection matrix, then Y T Y ∼
2
χ (rank(Σ)) where rank(Σ) = tr(Σ).
b) If Y ∼ Nn (0, I), then Y T AY ∼ χ2r iff A is idempotent with rank(A) =
tr(A) = r.
c) Let Y ∼ Nn (0, σ 2 I). Then

Y T AY
∼ χ2r or Y T AY ∼ σ 2 χ2r
σ2
iff A is idempotent of rank r.
d) If Y ∼ Nn (0, Σ) where Σ > 0, then Y T AY ∼ χ2r iff AΣ is idempotent
with rank(A) = r = rank(AΣ).
 
Y TY 2 µT µ
e) If Y ∼ Nn (µ, σ 2 I) then ∼ χ n, .
σ2 2σ 2
f) If Y ∼ Nn (µ, I) then Y T AY ∼ χ2 (r, µT Aµ/2) iff A is idempotent
with rank(A) = tr(A) = r.
 
Y T AY 2 µT Aµ
g) If Y ∼ Nn (µ, σ 2 I) then ∼ χ r, iff A is idempotent
σ2 2σ 2
with rank(A) = tr(A) = r.

Note that A is a projection matrix iff A is idempotent in b) since A is


symmetric. Thus b) is a special case d). To see that c) holds, note Z = Y /σ ∼
Nn (0, I). Hence by b)

Y T AY
= Z T AZ ∼ χ2r
σ2
iff A is idempotent of rank r. Much of Theorem 2.14 follows from Theorem
2.13. For f), we give another proof from Christensen (1987, p. 8). Since A is a
projection matrix with rank(A) = r, let {b1 , ..., br } be an orthonormal basis
for C(A) and let B = [b1 b2 ... br ]. Then B T B = I r and the projection
matrix A = B(B T B)−1 B T = BB T . Thus Y T AY = Y T BB T Y = Z T Z
where Z = B T Y ∼ Nr (B T µ, BT IB) ∼ Nr (B T µ, I r ). Thus Y T AY =
Z T Z ∼ χ2 (r, µT BB T µ/2) ∼ χ2 (r, µT Aµ/2) by Definition 2.12.

The following theorem is useful for constructing ANOVA tables. See Searle
(1971, pp. 60-61).

Theorem 2.15: Generalized Cochran’s Theorem. Let Y ∼ Nn (µ, Σ).


Pk
Let Ai = ATi have rank ri for i = 1, ..., k, and let A = i=1 Ai = AT have
2.3 Least Squares Theory 83

rank r. Then Y T Ai Y ∼ χ2 (ri , µT Ai µ/2), and the Y T Ai Y are independent,


and Y T AY ∼ χ2 (r, µT Aµ/2), iff
I) any 2 of a) Ai Σ are idempotent ∀i,
b) Ai ΣAj = 0 ∀i < j,
c) AΣ is idempotent
Pk
are true; or II) c) is true and d) r = i=1 ri ;
or III) c) is true and e) A1 Σ, .., Ak−1 Σ are idempotent and Ak Σ ≥ 0 is
singular.

2.3 Least Squares Theory

Definition 2.13. Estimating equations are used to find estimators of


unknown parameters. The least squares criterion and log likelihood for max-
imum likelihood estimators are important examples.

Estimating equations are often used with a model, like Y = Xβ + e,


and often have a variable β that is used in the equations to find the es-
timator β̂ of the vector of parameters in the model. For example, the log
likelihood log(L(β, σ 2 )) has β and σ 2 as variables for a parametric statistical
model where β and σ 2 are fixed unknown parameters, and maximizing the
log likelihood with respect to these variables gives the maximum likelihood
estimators of the parameters β and σ 2 . So the term β is both a variable in
the estimating equations, which could be replaced by another variable such
as η, and a vector of parameters in the model. In the theorem below, we
could replace η by β where β is a vector of parameters in the linear model
and a variable in the least squares criterion which is an estimating equation.

Theorem 2.16. Let θ = Xη ∈ C(X) where Yi = xTi η + ri (η) and the


residual ri (η) depends on η. The least squares estimator β̂ is the value
p
Pnη ∈ R
of
2
that minimizes the least squares criterion
2
i=1 ri (η) = kY − Xηk .
Proof. Following Seber and Lee (2003, pp. 36-38), let Ŷ = θ̂ = P X Y ∈
C(X), r = (I − P X )Y ∈ [C(X)]⊥, and θ ∈ C(X). Then (Y − θ̂)T (θ̂ −
θ) = (Y − P X Y )T (P X Y − P X θ) = Y T (I − P X )P X (Y − θ) = 0 since
P X θ = θ. Thus kY − θk2 = (Y − θ̂ + θ̂ − θ)T (Y − θ̂ + θ̂ − θ) =

kY − θ̂k2 + kθ̂ − θk2 + 2(Y − θ̂)T (θ̂ − θ) ≥ kY − θ̂k2

with equality iff kθ̂ − θk2 = 0 iff θ̂ = θ = Xη. Since θ̂ = X β̂ the result
follows. 

Definition 2.14. The normal equations are


84 2 Full Rank Linear Models

X T X β̂ = X T Y .

To see that the normal equations hold, note that r = Y − Ŷ ⊥ C(X) by


Theorem 1.2 c) (and Theorem 2.20 i)). Thus r ∈ [C(X)]⊥ = N (X T ), and
X T (Y − Ŷ ) = 0. Hence X T Ŷ = X T X β̂ = X T Y .

The maximum likelihood estimator uses the log likelihood as an estimating


equation. Note that it is crucial to observe that the likelihood function is a
function of θ (and that y1 , ..., yn act as fixed constants). Also, if the MLE θ̂
exists, then θ̂ ∈ Θ, the parameter space.

Definition 2.15. Let f(y|θ) be the joint pdf of Y1 , ..., Yn. If Y = y is


observed, then the likelihood function L(θ) = f(y|θ). For each sample
point y = (y1 , ..., yn), let θ̂(y) be a parameter value at which L(θ|y) attains
its maximum as a function of θ with y held fixed. Then a maximum likelihood
estimator (MLE) of the parameter θ based on the sample Y is θ̂(Y ).

Definition 2.16. Let the log likelihood of θ 1 and θ 2 be log[L(θ 1 , θ2 )]. If θ̂ 2


is the MLE of θ 2 , then the log profile likelihood is log[Lp (θ 1 )] = log[L(θ 1 , θ̂ 2 )].

We can often fix σ and then show β̂ is the MLE by direct maximization.
Then the MLE σ̂ or σ̂ 2 can be found by maximizing the log profile likelihood
function log[Lp (σ)] or log[Lp (σ 2 )] where Lp (σ) = L(σ, β = β̂).

Remark 2.1. a) Know how to find the max and min of a function h that
is continuous on an interval [a,b] and differentiable on (a, b). Solve h0 (x) ≡ 0
and find the places where h0 (x) does not exist. These values are the critical
points. Evaluate h at a, b, and the critical points. One of these values will
be the min and one the max.
b) Assume h is continuous. Then a critical point θo is a local max of h(θ)
if h is increasing for θ < θo in a neighborhood of θo and if h is decreasing for
θ > θo in a neighborhood of θo . The first derivative test
 is often used.
d2
c) If h is strictly concave h(θ) < 0 for all θ , then any local max
dθ2
of h is a global max.
d2
d) Suppose h0 (θo ) = 0. The 2nd derivative test states that if 2 h(θo ) < 0,

then θo is a local max.
e) If h(θ) is a continuous function on an interval with endpoints a < b
(not necessarily finite), and differentiable on (a, b) and if the critical point
is unique, then the critical point is a global maximum if it is a local
maximum (because otherwise there would be a local minimum and the critical
point would not be unique). To show that θ̂ is the MLE (the global maximizer
of h(θ) = log L(θ)), show that log L(θ) is differentiable on (a, b). Then show
d
that θ̂ is the unique solution to the equation log L(θ) = 0 and that the

2.3 Least Squares Theory 85

d2
2nd derivative evaluated at θ̂ is negative: log L(θ)|θ̂ < 0. Similar remarks
dθ2
2
hold for finding σ̂ using the profile likelihood.

Theorem 2.17. Let Y = Xβ + e = Ŷ + r where X is full rank, and


Y ∼ Nn (Xβ, σ 2 I). Then the MLE of β is the least squares estimator β̂ and
the MLE of σ 2 is RSS/n = (n − p)M SE/n.
Proof. The Yi = Yi |xi are independent N (xTi β, σ 2 ) random variables with
probability density functions (pdfs) fYi (yi ). Let yi be the observed values of
Yi . Thus the likelihood function
Yn n
Y  
1 1
L(β, σ 2 ) = fYi (yi ) = √ exp (yi − xT
i β)2
=
σ 2π 2σ 2
i=1 i=1

n
!  
1 X −1
2 −n/2 T 2 2 −n/2 2
(2πσ ) exp (yi − xi β) = (2πσ ) exp ky − Xβk .
2σ 2 i=1 2σ 2
Pn Pn
The least squares criterion Q(β) = i=1 (yi − xTi β)2 = i=1 ri2 (β) = ky −
Xβk2 = (y − Xβ)T (y − Xβ). For fixed σ 2 , maximizing the likelihood is
equivalent to maximizing
 
−1 2
exp ky − Xβk ,
2σ 2

which is equivalent to minimizing ky −Xβk2 . But the least squares estimator


minimizes ky − Xβk2 by Theorem 2.16. Hence β̂ is the MLE of β.
Let Q = ky − X β̂k2 . Then the MLE of σ 2 can be found by maximizing
the log profile likelihood log(LP (σ 2 )) where
 
1 −1
LP (σ 2 ) = exp Q .
(2πσ 2 )n/2 2σ 2

Let τ = σ 2 . Then
n 1
log(Lp (σ 2 )) = c − log(σ 2 ) − 2 Q,
2 2σ
and
n 1
log(Lp (τ )) = c − log(τ ) − Q.
2 2τ
Hence
d log(LP (τ )) −n Q set
= + 2 = 0
dτ 2τ 2τ
or −nτ + Q = 0 or nτ = Q or
Pn
Q r2 n−p
τ̂ = = σ̂ 2 = i=1 i = M SE,
n n n
86 2 Full Rank Linear Models

which is a unique solution.


Now
d2 log(LP (τ )) n 2Q n 2nτ̂ −n
= − 3 = − 3 = 2 < 0.
dτ 2 2τ 2 2τ τ=τ̂ 2τ̂ 2 2τ̂ 2τ̂

Thus by Remark 2.1, σ̂ 2 is the MLE of σ 2 . 

Now assume the n × p matrix X has full rank p. There are two ways to
compute β̂. Use β̂ = (X T X)−1 X T Y , and use sample covariance matrices.
The population OLS coefficients are defined below. Let xTi = (1, uTi ) where
n
1X
ui is the vector of nontrivial predictors. Let Xjk = X ok = uok for
n
j=1
k = 2, ..., p. The subscript “ok” means sum over the first subscript j. Let
u = (uo,2 , ..., uo,p )T be the sample mean of the ui . Note that regressing on u
is equivalent to regressing on x if there is an intercept β1 in the model.

Definition 2.17. Using the above notation, let xTi = (1, uTi ), and let βT =
(β1 , βT2 ) where β1 is the intercept and the slopes vector β 2 = (β2 , ..., βp)T .
Let the population covariance matrices

Cov(u) = E[(u − E(u))(u − E(u))T ] = Σ u , and

Cov(u, Y ) = E[(u − E(u))(Y − E(Y ))] = Σ uY .


Then the population coefficients from an OLS regression of Y on x (even if
a linear model does not hold) are

β1 = E(Y ) − β T2 E(u) and β 2 = Σ −1


u Σ uY .
Definition 2.18. Let the sample covariance matrices be
n n
1 X 1 X
Σ̂ u = (ui − u)(ui − u)T and Σ̂ uY = (ui − u)(Yi − Y ).
n − 1 i=1 n − 1 i=1

Let the method of moments or maximum likelihood estimators be Σ̃ u =


n n n
1X 1X 1X
(ui −u)(ui −u)T and Σ̃ uY = (ui −u)(Yi −Y ) = ui Yi −u Y .
n i=1 n i=1 n i=1

P
Refer to Definitions 1.27, 1.28, and 1.33 for the notation “θ̂ → θ as n →
∞,” which means that θ̂ is a consistent estimator of θ, or that θ̂ converges
−1
in probability to θ. Note that D = X T1 X 1 − nu uT = (n − 1)Σ̂ u .

Theorem 2.18:
 Seber
  and Lee (2003,
 p. 106).
 Let X = T(1 X 1 ).
nY nY n nu
Then X T Y = T = Pn , XT X = ,
X1 Y u Y
i=1 i i nu X T1 X 1
2.3 Least Squares Theory 87
 
1
+ uT D−1 u −uT D−1
and (X T X)−1 = n
−D−1 u D−1
−1
where the (p − 1) × (p − 1) matrix D−1 = [(n − 1)Σ̂ u ]−1 = Σ̂ u /(n − 1).

Theorem 2.19: Second way to compute β̂:


−1 T
a) If Σ̂ u exists, then β̂1 = Y − β̂ 2 u and
n −1 −1 −1
β̂ 2 = Σ̂ Σ̃ uY = Σ̃ u Σ̃ uY = Σ̂ u Σ̂ uY .
n−1 u

b) Suppose that (Yi , uTi )T are iid random vectors such that σY2 , Σ −1
u , and
P
Σ uY exist. Then β̂1 → β1 and
P
β̂ 2 → β 2 as n → ∞.

Proof. Note that




uT1 n
T  ..  X
Y X 1 = (Y1 · · · Yn )  .  = Yi uTi
uTn i=1

and  
Y1 n
  X
X T1 Y = [u1 · · · un ]  ...  = ui Yi .
Yn i=1

So   1  T 
β̂1 + uT D−1 u −uT D−1 1
= n Y =
β̂ 2 −D−1 u D −1 X T1
1 −1
 
T
n +u D u −uT D −1 nY
.
−D −1 u D−1 X T1 Y

Thus β̂2 = −nD −1 u Y + D−1 X T1 Y = D−1 (X T1 Y − nu Y ) =


" n # −1
X Σ̂ u n −1
−1
D ui Yi − nu Y = nΣ̂ uY = Σ̂ u Σ̂ uY . Then
n−1 n−1
i=1

β̂1 = Y + nuT D −1 u Y − uT D−1 X T1 Y = Y + [nY uT D−1 − Y T X 1 D−1 ]u


T
= Y − β̂ 2 u. The convergence in probability results hold since sample means
and sample covariance matrices are consistent estimators of the population
means and population covariance matrices. 

It is important to note that the convergence in probability results are


for iid (Yi , uTi )T with second moments and nonsingular Σ u : a linear model
88 2 Full Rank Linear Models

Y = Xβ+e does not need to hold. Also, X is a random matrix, and the least
squares regression is conditional on X. When the linear model does hold, the
second method for computing β̂ is still valid even if X is a constant matrix,
P
and β̂ → β by the LS CLT. Some properties of the least squares estimators
and related quantities are given below, where X is a constant matrix. The
population results of Definition 2.17 were also shown when
 
Y
 x2     2 
  E(Y ) σY ΣY u
 ..  ∼ Np ,
 .  E(u) Σ uY Σ uu
xp

in Remark 1.5. Also see Theorem 1.40. The following theorem is similar to
Theorem 1.2.

Theorem 2.20. Let Y = Xβ + e = Ŷ + r where X has full rank p,


E(e) = 0, and Cov(e) = σ 2 I. Let P = P X be the projection matrix on
C(X) so Ŷ = P X, r = Y − Ŷ = (I − P )Y , and P X = X so X T P = X T .
i) The predictor variables and residuals are orthogonal. Hence the columns
of X and the residual vector are orthogonal: X T r = 0.
ii) E(Y ) = Xβ.
iii) Cov(Y ) = Cov(e) = σ 2 I.
iv) The fitted values and residuals are uncorrelated: Cov(r, Ŷ ) = 0.
v) The least squares estimator β̂ is an unbiased estimator of β : E(β̂) = β.
vi) Cov(β̂) = σ 2 (X T X)−1 .

Proof. i) X T r = X T (I−P )Y = 0Y = 0, while ii) and iii) are immediate.


iv) Cov(r, Ŷ ) = E([r − E(r)][Ŷ − E(Ŷ )]T ) =

E([(I − P )Y − (I − P )E(Y )][P Y − P E(Y )]T ) =

E[(I − P )[Y − E(Y )][Y − E(Y )]T P ] = (I − P )σ 2 IP = σ 2 (I − P )P = 0.


v) E(β̂) = E[(X T X)−1 X T Y ] = (X T X)−1 X T E[Y ] = (X T X)−1 X T Xβ
= β.
vi) Cov(β̂) = Cov[(X T X)−1 X T Y ] = Cov(AY ) = ACov(Y )AT =

σ 2 (X T X)−1 X T IX(X T X)−1 = σ 2 (X T X)−1 . 

Definition 2.19. Let a, b, and c be n × 1 constant vectors. A linear


estimator aT Y of cT θ is the best linear unbiased estimator (BLUE) of cT θ
if E(aT Y ) = cT θ, and for any other unbiased linear estimator bT Y of cT θ,
V ar(aT Y ) ≤ V ar(bT Y ).

The following theorem is useful for finding the BLUE when X has full
rank. Note that if W is a random variable, then the covariance matrix of
2.3 Least Squares Theory 89

W is Cov(W ) = Cov(W, W ) = V (W ). Note that the theorem shows that


bT X β̂ = bT P Y = aT β̂ is the BLUE of bT Xβ = aT β where aT = bT X
and θ = Xβ. Also, if bT Y is an unbiased estimator of aT β = bT Xβ, then
bT P Y = aT β̂ is a better unbiased estimator in that V (bT P Y ) ≤ V (bT Y ).
Since X is full rank, aT β is estimable with BLUE aT β̂ for every p × 1
constant vector A. Note that the ei are uncorrelated with zero mean, but not
necessarily independent or identically distributed in the following theorem.
Note that if b = d = P b, then P b = P P b = P b = d. The proof of the more
general Theorem 3.2 c) also proves Theorem 2.21.

Theorem 2.21: Gauss Markov Theorem-Full Rank Case. Let Y =


Xβ + e where X is full rank, E(e) = 0, and Cov(e) = σ 2 I. Then aT β̂ is
the unique BLUE of aT β for every constant p × 1 vector a.
Proof. Let bT Y be any linear unbiased estimator of aT β. Then E(bT Y ) =
aT β = bT E(Y ) = bT Xβ for any β ∈ Rp , the parameter space of β. Hence
aT = bT X. The least squares estimator aT β̂ = aT (X T X)−1 X T Y =
dT Y = bT X β̂ = bT P Y is a linear unbiased estimator of aT β since
E(aT β̂) = aT β. Now V (bT Y ) − V (aT β̂) = V (bT Y ) − V (bT P Y ) =
Cov(bT Y ) − Cov(bT P Y ) = σ 2 bT b − σ 2 bT P b = σ 2 bT (I − P )b = σ 2 zT z ≥ 0
with equality iff z = (I − P )b = 0 iff b = d = P b iff bT Y = bT P Y = aT β̂.
Since bT Y was an arbitrary unbiased linear estimator, the least squares es-
timator aT β̂ is BLUE. 

Lai et al. (1979) note that if E(β̂) = β and Cov(β̂) = σ 2 (X T X)−1 → 0


as n → ∞, then β̂ is a consistent estimator of β. Also see Zhang (2019).
The following theorem gives some properties of the least squares estimators
β̂ and MSE under the normal least squares model. Similar properties will be
developed without the normality assumption.

Theorem 2.22. Suppose Y = Xβ + e where X is full rank, e ∼


Nn (0, σ 2 I), and Y ∼ Nn (Xβ, σ 2 I).
a) β̂ ∼ Np (β, σ 2 (X T X)−1 ).
(β̂ − β)T X T X(β̂ − β)
b) ∼ χ2p .
σ2
c) β̂ M SE.
RSS (n − p)M SE
d) = ∼ χ2n−p .
σ2 σ2
Proof. Let P = P X .
a) Since A = (X T X)−1 X T is a constant matrix,

β̂ = AY ∼ Np (AE(Y ), ACov(Y )AT ) ∼

Np ((X T X)−1 X T Xβ, σ 2 (X T X)−1 X T IX(X T X)−1 ) ∼


Np (β, σ 2 (X T X)−1 ).
90 2 Full Rank Linear Models

b) The population Mahalanobis distance of β̂ is

(β̂ − β)T X T X(β̂ − β)


= (β̂ − β)T [Cov(β̂)]−1 (β̂ − β) ∼ χ2p
σ2
by Theorem 2.11.
c) Since Cov(β̂, r) = Cov((X T X)−1 X T Y , (I − P )Y ) =
σ 2 (X T X)−1 X T I(I − P ) = 0, β̂ r. Thus β̂ RSS = krk2 , and β̂ M SE.
d) Since P X = X and X T P = X T , it follows that X T (I − P ) = 0 and
(I − P )X = 0. Thus RSS = rT r = Y T (I − P )Y =

(Y − Xβ)T (I − P )(Y − Xβ) = eT (I − P )e.

Since e ∼ Nn (0, σ 2 I), then by Theorem 2.14 c), eT (I − P )e/σ 2 ∼ χ2n−p


where n − p = rank(I − P ) = tr(I − P ). 

2.3.1 Hypothesis Testing

Suppose Y = Xβ + e where rank(X) = p, E(e) = 0 and Cov(e) = σ 2 I. Let


L be an r × p constant matrix with rank(L) = r, let c be an r × 1 constant
vector, and consider testing H0 : Lβ = c. First theory will be given for when
e ∼ Nn (0, σ 2 I). The large sample theory will be given for when the iid zero
mean ei have V (ei ) = σ 2 . Note that the normal model will satisfy the large
sample theory conditions.
The partial F test, and its special cases the ANOVA F test and the Wald
t test, use c = 0. Let the full model use Y , x1 ≡ 1, x2 , ..., xp, and let
the reduced model use Y , x1 = xj1 ≡ 1, xj2 , ..., xjk where {j1 , ..., jk} ⊂
{1, ..., p} and j1 = 1. Here 1 ≤ k < p, and if k = 1, then the model is
Yi = β1 + ei . Hence the full model is Yi = β1 + β2 xi,2 + · · · + βp xi,p + ei , while
the reduced model is Yi = β1 + βj2 xi,j2 + · · · + βjk xi,jk + ei . In matrix form,
the full model is Y = Xβ + e and the reduced model is Y = X R βR + eR
where the columns of X R are a proper subset of the columns of X. i) The
partial F test has H0 : βjk+1 = · · · = βjp = 0, or H0 : the reduced model is
good, or H0 : Lβ = 0 where L is a (p − k) × p matrix where the ith row of L
has a 1 in the jk+i th position and zeroes elsewhere. In particular, if β1 , ..., βk
are the only βi in the reduced model, then L = [0 I p−k ] and 0 is a (p−k)×k
matrix. Hence r = p − k = number of predictors in the full model but not in
the reduced model. ii) The ANOVA F test is the special case of the partial
F test where the reduced model is Yi = β1 +i . Hence H0 : β2 = · · · = βp = 0,
or H0 : none of the nontrivial predictors x2 , ..., xp are needed in the linear
model, or H0 : Lβ = 0 where L = [0 I p−1 ] and 0 is a (p − 1) × 1 vector.
Hence r = p − 1. iii) The Wald t test uses the reduced model that deletes
the jth predictor from the full model. Hence H0 : βj = 0, or H0 : the jth
predictor xj is not needed in the linear model given that the other predictors
2.3 Least Squares Theory 91

are in the model, or H0 : Lj β = 0 where Lj = [0, ..., 0, 1, 0, ..., 0] is a 1 × p


row vector with a 1 in the jth position for j = 1, ..., p. Hence r = 1.
A way to get the test statistic FR for the partial F test is to fit the
full model and the reduced model. Let RSS be the RSS of the full model,
and let RSS(R) be the RSS of the reduced model. Similarly, let M SE and
M SE(R) be the MSE of the full and reduced models. Let dfR = n − k and
dfF = n − p be the degrees of freedom for the reduced and full models. Then
RSS(R) − RSS
FR = where r = dfR − dfF = p − k = number of predictors
rM SE
in the full model but not in the reduced model.
If β̂ ∼ Np (β, σ 2 (X T X)−1 ), then

Lβ̂ − c ∼ Nr (Lβ − c, σ 2 L(X T X)−1 LT ).

If H0 is true then Lβ̂ − c ∼ Nr (0, σ 2 L(X T X)−1 LT ), and by Theorem 2.11


1
rF1 = (Lβ̂ − c)T [L(X T X)−1 LT ]−1 (Lβ̂ − c) ∼ χ2r .
σ2
D
Let rFR = σ 2 rF1 /M SE. If H0 is true, rFR → χ2r for a large class of zero
mean error distributions. See Theorem 2.26 c).
D
From Definition 1.25, if Z n → Z as n → ∞, then Z n converges in dis-
tribution to the random vector Z, and “Z is the limiting distribution of
Z n ” means that the distribution of Z is the limiting distribution of Z n . The
D
notation Z n → Nk (µ, Σ) means Z ∼ Nk (µ, Σ).

Remark 2.2. a) Z is the limiting distribution of Z n , and does not depend


on the sample size n (since Z is found by taking the limit as n → ∞).
D
b) When Z n → Z, the distribution of Z can be used to approximate
probabilities P (Z n ≤ c) ≈ P (Z ≤ c) at continuity points c of the cdf FZ (z).
Often the limiting distribution is a continuous distribution, so all points c
are continuity points.
D
c) Often the two quantities Z n → Nk (µ, Σ) and Z n ∼ Nk (µ, Σ) behave
similarly. A big difference is that the distribution on the RHS (right hand
D D
side) can depend on n for ∼ but not for →. In particular, if Z n → Nk (µ, Σ),
D
then AZ n + b → Nm (Aµ + b, AΣAT ), provided the RHS does not depend
on n, where A is an m × k constant matrix and b is an m × 1 constant vector.
d) We often want a normal approximation where the RHS can depend on n.
Write Z n ∼ ANk (µ, Σ) for an approximate multivariate normal distribution
where the RHS may depend on n. For normal linear model, if e ∼ Nn (0, σ 2 I),
then β̂ ∼ Np (β, σ 2 (X T X)−1 ). If the ei are iid with E(ei ) = 0 and V (ei ) =
σ 2 , use the multivariate normal approximation β̂ ∼ ANp (β, σ 2 (X T X)−1 ) or
β̂ ∼ ANp (β, M SE(X T X)−1 ). The RHS depends on n since the number of
rows of X is n.
92 2 Full Rank Linear Models

Theorem 2.23. Suppose Σ̂ n and Σ are positive definite and symmetric.


D P −1/2 D
If W n → Nk (µ, Σ) and Σ̂ n → Σ, then Z n = Σ̂ n (W n − µ) → Nk (0, I),
−1 D
and Z Tn Z n = (W n − µ)T Σ̂ n (W n − µ) → χ2k .
−1/2
Proof. Z n = (Σ̂ n − Σ −1/2 + Σ −1/2 )(W n − µ) =
−1/2 −1/2 D
(Σ̂ n −Σ )(W n − µ) + Σ −1/2 (W n − µ) → 0 + Nk (0, I) ∼ Nk (0, I)
D
by Slutsky’s Theorem 1.34 b). Hence Z Tn Z n → χ2k . 

See Remark 2.3 for why Theorem 2.24 is useful.

Theorem 2.24. If Wn ∼ Fr,dn where the positive integer dn → ∞ as


D
n → ∞, then rWn → χ2r .

Proof. If X1 ∼ χ2d1 X2 ∼ χ2d2 , then

X1 /d1
∼ Fd1 ,d2 .
X2 /d2
Pk
If Ui ∼ χ21 are iid then i=1 Ui ∼ χ2k . Let d1 = r and k = d2 = dn . Hence if
X2 ∼ χ2dn , then
Pdn
X2 Ui P
= i=1 = U → E(Ui ) = 1
dn dn
D
by the law of large numbers. Hence if W ∼ Fr,dn , then rWn → χ2r . 

The following theorem is analogous to the central limit theorem and the
theory for the t–interval for µ based on Y and the sample standard deviation
(SD) SY . If the data Y1 , ..., Yn are iid with mean 0 and variance σ 2 , then Y
is asymptotically normal and the t–interval will perform well if the sample
size is large enough. The result below suggests that the OLS estimators Ŷi
and β̂ are good if the sample size is large enough. The condition max hi → 0
in probability usually holds if the researcher picked the design matrix X or
if the xi are iid random vectors from a well behaved population. Outliers
D
can cause the condition to fail. Convergence in distribution, Z n → Np (0, Σ),
means the multivariate normal approximation can be used for probability
calculations involving Z n . When p = 1, the univariate normal distribution
can be used. See Sen and Singer (1993, p. 280) for the theorem, which implies
that β̂ ≈ Np (β, σ 2 (X T X)−1 )). Let hi = H ii where H = P X . Note that
the following theorem is for the full rank model since X T X is nonsingular.

Theorem 2.25, LS CLT (Least Squares Central Limit Theo-


rem): Consider the MLR model Yi = xTi β + ei and assume that the zero
mean errors are iid with E(ei ) = 0 and VAR(ei ) = σ 2 . Also assume that
maxi (h1 , ..., hn) → 0 in probability as n → ∞ and
2.3 Least Squares Theory 93

XT X
→ W −1
n

as n → ∞. Then the least squares (OLS) estimator β̂ satisfies


√ D
n(β̂ − β) → Np (0, σ 2 W ). (2.1)

Equivalently,
D
(X T X)1/2 (β̂ − β) → Np (0, σ 2 I p ). (2.2)

If Σ = σ 2 W , then Σ̂ n = nM SE(X T X)−1 . Hence

β̂ ∼ ANp (β, M SE(X T X)−1 ), and

1 D
rFR = (Lβ̂ − c)T [L(X T X)−1 LT ]−1 (Lβ̂ − c) → χ2r (2.3)
M SE
√ D
as n → ∞ if H0 : Lβ = c is true so that n(Lβ̂ − c) → Nr (0, σ 2 LW LT ).

Definition 2.20. A test with test statistic Tn is a large sample right tail
δ test if the test rejects H0 if Tn > an and P (Tn > an ) = δn → δ as n → ∞
when H0 is true.

Typically we want δ ≤ 0.1, and the values δ = 0.05 or δ = 0.01 are


common. (An analogy is a large sample 100(1 − δ)% confidence interval or
prediction interval.)

Remark 2.3. Suppose P (W ≤ χ2q (1−δ)) = 1−δ and P (W > χ2q (1−δ)) =
δ where W ∼ χ2q . Suppose P (W ≤ Fq,dn (1 − δ)) = 1 − δ when W ∼ Fq,dn .
Also write χ2q (1 − δ) = χ2q,1−δ and Fq,dn (1 − δ) = Fq,dn ,1−δ . Suppose P (W >
z1−δ ) = δ when W ∼ N (0, 1), and P (W > tdn ,1−δ ) = δ when W ∼ tdn .
i) Theorem 2.24 is important because it can often be shown that a statistic
D
Tn = rWn → χ2r when H0 is true. Then tests that reject H0 when Tn >
χ2r (1 − δ) or when Tn /r = Wn > Fr,dn (1 − δ) are both large sample right
tail δ tests if the positive integer dn → ∞ as n → ∞. Large sample F tests
and intervals are used instead of χ2 tests and intervals since the F tests and
intervals are more accurate for moderate n.
D
ii) An analogy is that if test statistic Tn → N (0, 1) when H0 is true, then
tests that reject H0 if Tn > z1−δ or if Tn > tdn ,1−δ are both large sample
right tail δ tests if the positive integer dn → ∞ as n → ∞. Large sample t
tests and intervals are used instead of Z tests and intervals since the t tests
and intervals are more accurate for moderate n.
iii) Often n ≥ 10p starts to give good results for the OLS output for error
distributions not too far from N (0, 1). Larger values of n tend to be needed
94 2 Full Rank Linear Models

if the zero mean iid errors have a distribution that is far from a normal
distribution. Also see Theorem 1.5.

Theorem 2.26, Partial F Test Theorem. Suppose H0 : Lβ = 0 is


true for the partial F test. Under the OLS full rank model, a)
1
FR = (Lβ̂)T [L(X T X)−1 LT ]−1 (Lβ̂).
rM SE
b) If e ∼ Nn (0, σ 2 I), then FR ∼ Fr,n−p .
D
c) For a large class of zero mean error distributions rFR → χ2r .
d) The partial F test that rejects H0 : Lβ = 0 if FR > Fr,n−p (1 − δ) is a
large sample right tail δ test for the OLS model for a large class of zero mean
error distributions.
Proof sketch. a) Seber and Lee (2003, p. 100) show that

RSS(R) − RSS = (Lβ̂)T [L(X T X)−1 LT ]−1 (Lβ̂).

b) Let the full model Y = Xβ + e with a constant β1 in the model:


1 is the 1st column of X. Let the reduced model Y = X R β R + e also
have a constant in the model where the columns of X R are a subset of
k of the columns of X. Let P R be the projection matrix on C(X R ) so
SSE(R) − SSE(F )
P P R = P R . Then FR = where r = dfR − dfF = p −
rM SE(F )
k = number of predictors in the full model but not in the reduced model.
M SE = M SE(F ) = SSE(F )/(n−p) where SSE = SSE(F ) = Y (I −P )Y .
SSE(R) − SSE(F ) = Y T (P − P R )Y where SSE(R) = Y T (I − P R )Y .
Now assume Y ∼ Nn (Xβ, σ 2 I), and when H0 is true, Y ∼ Nn (X R β R , σ 2 I).
Since (I − P )(P − P R ) = 0, [SSE(R) − SSE(F )] M SE(F ) by Craig’s
Theorem. When H0 is true, µ = X R β R and µT Aµ = 0 where A = (I − P )
or A = (P − P R ). Hence the noncentrality parameter is 0, and by The-
orem 2.14 g), SSE ∼ σ 2 χ2n−p and SSE(R) − SSE(F ) ∼ σ 2 χ2p−k since
rank(P − P R ) = tr(P − P R ) = p − k. Hence under H0 , FR ∼ Fp−k,n−p.
Alternatively, let Y ∼ Nn (Xβ, σ 2 I n ) where X is an n × p matrix of rank
p. Let X = [X 1 X 2 ] and β = (β T1 β T2 )T where X 1 is an n × k matrix and
r = p−k. Consider testing H0 : β 2 = 0. (The columns of X can be rearranged
so that H0 corresponds to the partial F test.) Let P be the projection matrix
on C(X). Then r T r = Y T (I − P )Y = eT (I − P )e =
(Y − Xβ)T (I − P )(Y − Xβ) since P X = X and X T P = X T imply that
X T (I − P ) = 0 and (I − P )X = 0.
Suppose that H0 : β 2 = 0 is true so that Y ∼ Nn (X 1 β 1 , σ 2 I n ). Let
P 1 be the projection matrix on C(X 1 ). By the above argument, r TR rR =
Y T (I − P 1 )Y = (Y − X 1 β 1 )T (I − P 1 )(Y − X 1 β 1 ) = eTR (I − P 1 )eR where
eR ∼ Nn (0, σ 2 I n ) when H0 is true. Or use RHS = Y T (I − P 1 )Y

−β T1 X T1 (I − P 1 )Y + β T1 X T1 (I − P 1 )X 1 β 1 − Y T (I − P 1 )X 1 β 1 ,
2.3 Least Squares Theory 95

and the last three terms equal 0 since X T1 (I − P 1 ) = 0 and (I − P 1 )X 1 = 0.


Hence
Y T (I − P )Y 2 Y T (P − P 1 )Y
∼ χ n−p ∼ χ2r
σ2 σ2
by Theorem 2.14 c) using e and eR instead of Y , and Craig’s Theorem 2.9 b)
since n − p = rank(I − P ) = tr(I − P ), r = rank(P − P 1 ) = tr(P − P 1 ) =
p − k, and (I − P )(P − P 1 ) = 0.
If X1 ∼ χ2d1 X2 ∼ χ2d2 , then

X1 /d1
∼ Fd1 ,d2 .
X2 /d2

Hence
Y T (P − P 1 )Y /r Y T (P − P 1 )Y
= ∼ Fr,n−p
Y T (I − P )Y /(n − p) rM SE
when H0 is true. Since RSS = Y T (I − P )Y and RSS(R) = Y T (I − P 1 )Y ,
RSS(R) − RSS = Y T (I − P 1 − [I − P ])Y = Y T (P − P 1 )Y , and thus

Y T (P − P 1 )Y
FR = ∼ Fr,n−p .
rM SE
√ √ D
c) Assume H0 is true. By the OLS CLT, n(Lβ̂ − Lβ) = nLβ̂ →
√ √ D
Nr (0, σ 2 LW LT ). Thus n(Lβ̂)T (σ 2 LW LT )−1 nLβ̂ → χ2r . Let σ̂ 2 =
M SE and Ŵ = n(X T X)−1 . Then
D
n(Lβ̂)T [M SE Ln(X T X)−1 LT ]−1 Lβ̂ = rFR → χ2r .
D
d) By Theorem 2.24, if Wn ∼ Fr,dn then rWn → χ2r as n → ∞ and
dn → ∞. Hence the result follows by c). 

An ANOVA table for the partial F test is shown below, where k = pR is


the number of predictors used by the reduced model, and r = p − pR = p − k
is the number of predictors in the full model that are not in the reduced
model.

Source df SS MS F
SSE(R)−SSE
Reduced n − pR SSE(R) = Y T (I − P R )Y MSE(R) FR = rM SE =

Y T (P − P R )Y /r
Full n−p SSE = Y T (I − P )Y MSE
Y T (I − P )Y /(n − p)
The ANOVA F test is the special case where k = 1, X R = 1, P R = P 1 ,
and SSE(R) − SSE(F ) = SST O − SSE = SSR.
96 2 Full Rank Linear Models

ANOVA table: Y = Xβ + e with a constant β1 in the model: 1 is the


1st column of X. M S = SS/df.
Xn
1 Pn
SST O = Y T (I − 11T )Y = (Yi − Y )2 , SSE = 2
i=1 ri , SSR =
n i=1
Pn 2
i=1 (Ŷi − Y ) , SST O = SSR + SSE. SSTO is the SSE (residual sum
of squares) for the location model Y = 1β1 + e that contains a con-
stant but no nontrivial predictors. The location model has projection matrix
1
P 1 = 1(1T 1)−1 1T = 11T . Hence P P 1 = P 1 and P 1 = P 1 1 = 1.
n

Source df SS MS F p-value
T 1 T M SR
Regression p-1 SSR = Y (P − 11 )Y MSR F0 = M SE for H0 :
n

Residual n-p SSE = Y T (I − P )Y MSE β2 = · · · = βp = 0


The matrices in the quadratic forms for SSR and SSE are symmet-
ric and idempotent and their product is 0. Hence if e ∼ Nn (0, σ 2 I) so
Y ∼ Nn (Xβ, σ 2 I), then SSE SSR by Craig’s Theorem. If H0 is
true under normality, then Y ∼ Nn (1β1 , σ 2 I), and by Theorem 2.14 g),
SSE ∼ σ 2 χ2n−p and SSR ∼ σ 2 χ2p−1 since rank(I − P ) = tr(I − P ) = n − p
and rank(P − n1 11T ) = tr(P − n1 11T ) = p − 1. Hence under normality,
F0 ∼ Fp−1,n−p.
Let X ∼ tn−p . Then X 2 ∼ F1,n−p. The two tail Wald t test for H0 :
βj = 0 versus H1 : βj 6= 0 is equivalent to the corresponding right tailed F
test since rejecting H0 if |X| > tn−p (1 − δ) is equivalent to rejecting H0 if
X 2 > F1,n−p(1 − δ).

Definition 2.21. The pvalue of a test is the probability, assuming H0 is


true, of observing a test statistic as extreme as the test statistic Tn actually
observed. For a right tail test, pvalue = PH0 (of observing a test statistic
≥ Tn ).

Under the OLS model where FR ∼ Fq,n−p when H0 is true (so the ei are
iid N (0, σ 2 )), the pvalue = P (W > FR ) where W ∼ Fq,n−p . In general, we
can only estimate the pvalue. Let pval be the estimated pvalue. Then pval
P
= P (W > FR ) where W ∼ Fq,n−p , and pval → pvalue an n → ∞ for the
large sample partial F test. The pvalues in output are usually actually pvals
(estimated pvalues).

Definition 2.22. Let Y ∼ F (d1 , d2 ) ∼ F (d1 , d2 , 0). Let X1 ∼ χ2 (d1 , γ)


X1 /d1
X2 ∼ χ2 (d2 , 0). Then W = ∼ F (d1 , d2 , γ), a noncentral F distri-
X2 /d2
bution with d1 and d2 numerator and denominator degrees of freedom, and
noncentrality parameter γ.
2.4 WLS and Generalized Least Squares 97

Theorem 2.27, distribution of FR under normality when H0 may


not hold. Assume Y = Xβ + e where e ∼ Nn (0, σ 2 I). Let X = [X 1 X 2 ]
be full rank, and let the reduced model Y = X 1 β 1 + eR . Then
!
Y T (P − P 1 )Y /r β T X T (P − P 1 )Xβ
FR = T ∼ F r, n − p, .
Y (I − P )Y /(n − p) 2σ 2

If H0 : β 2 = 0 is true, then γ = 0.
Proof. Note that the denominator is the M SE, and (n − p)M SE/σ 2 ∼
2
χn−p by the proof of Theorem 2.26. By Theorem 2.14 f),
!
T 2 2 β T X T (P − P 1 )Xβ
Y (P − P 1 )Y /σ ∼ χ r,
2σ 2

where r = rank(P − P 1 ) = tr(P − P 1 ) = p − k since P − P 1 is a projection


matrix (symmetric and idempotent). 

Consider the test H0 : Lβ = c versus H1 : Lβ 6= c, and suppose H0 is


√ D
true. Then n(Lβ̂ − c) → Nr (0, σ 2 LW LT ). Hence
1 D
rF0 = (Lβ̂ − c)T (L(X T X)−1 LT )−1 (Lβ̂ − c) → χ2p ,
M SE
and rejecting H0 if F0 > Fr,n−p,1−δ is a large sample right tail δ test for a
large class of zero mean error distributions. Seber and Lee (2003, pp. 100-101)
show that F0 ∼ Fr,n−p if H0 is true and e ∼ Np (0, σ 2 I), but the above result
is far stronger: if the iid ei has to satisfy ei ∼ N (0, σ 2 ), OLS inference would
rarely be useful.

Remark 2.4. Suppose tests and confidence intervals are derived under
the assumption e ∼ Nn (0, σ 2 I). Then by the LS CLT and Remark 2.3,
the inference tends to give large sample tests and confidence intervals for
a large class of zero mean error distributions. For linear models, often the
error distribution has heavier tails than the normal distribution. See Huber
and Ronchetti (2009, p. 3). If some points stick out a bit in residual and/or
response plots, then the error distribution likely has heavier tails than the
normal distribution. See Figure 1.1.

2.4 WLS and Generalized Least Squares

Definition 2.23. Suppose that the response variable and at least one of the
predictor variables is quantitative. Then the generalized least squares (GLS)
model is
98 2 Full Rank Linear Models

Y = Xβ + e, (2.4)
where Y is an n × 1 vector of dependent variables, X is an n × p matrix
of predictors, β is a p × 1 vector of unknown coefficients, and e is an n × 1
vector of unknown errors. Also E(e) = 0 and Cov(e) = σ 2 V where V is a
known n × n positive definite matrix.

Definition 2.24. The GLS estimator

β̂ GLS = (X T V −1 X)−1 X T V −1 Y . (2.5)

The fitted values are Ŷ GLS = X β̂GLS .

Definition 2.25. Suppose that the response variable and at least one of
the predictor variables is quantitative. Then the weighted least squares (WLS)
model with weights w1 , ..., wn is the special case of the GLS model where V
is diagonal: V = diag(v1 , ..., vn) and wi = 1/vi . Hence

Y = Xβ + e, (2.6)

E(e) = 0, and Cov(e) = σ 2 V = σ 2 diag(v1 , ..., vn) = σ 2 diag(1/w1 , ..., 1/wn).

Definition 2.26. The WLS estimator

β̂ W LS = (X T V −1 X)−1 X T V −1 Y . (2.7)

The fitted values are Ŷ W LS = X β̂ W LS .

Definition 2.27. The feasible generalized least squares (FGLS) model is


the same as the GLS estimator except that V = V (θ) is a function of an
unknown q × 1 vector of parameters θ. Let the estimator of V be V̂ = V (θ̂).
Then the FGLS estimator
−1 −1
β̂F GLS = (X T V̂ X)−1 X T V̂ Y. (2.8)

The fitted values are Ŷ F GLS = X β̂ F GLS . The feasible weighted least squares
(FWLS) estimator is the special case of the FGLS estimator where V =
V (θ) is diagonal. Hence the estimated weights ŵi = 1/v̂i = 1/vi (θ̂). The
FWLS estimator and fitted values will be denoted by β̂ F W LS and Ŷ F W LS ,
respectively.

Notice that the ordinary least squares (OLS) model is a special case of
GLS with V = I n , the n × n identity matrix. It can be shown that the GLS
estimator minimizes the GLS criterion

QGLS (η) = (Y − Xη)T V −1 (Y − Xη).


2.4 WLS and Generalized Least Squares 99

Notice that the FGLS and FWLS estimators have p + q + 1 unknown param-
eters. These estimators can perform very poorly if n < 10(p + q + 1).
The GLS and WLS estimators can be found from the OLS regression
(without an intercept) of a transformed model. Typically there will be a
constant in the model: the first column of X is a vector of ones. Let the
symmetric, nonsingular n × n square root matrix R = V 1/2 with V = RR.
Let Z = R−1 Y , U = R−1 X and  = R−1 e.

Theorem 2.28. a)
Z = Uβ +  (2.9)
follows the OLS model since E() = 0 and Cov() = σ 2 I n .
b) The GLS estimator β̂ GLS can be obtained from the OLS regression
(without an intercept) of Z on U .
c) For WLS, Yi = xTi β + ei . The corresponding OLS model Z = U β + 
is equivalent to Zi = uTi β + i for i = 1, ..., n where uTi is the ith row of U .
√ √
Then Zi = wi Yi and ui = wi xi . Hence β̂ W LS can be obtained from the
√ √
OLS regression (without an intercept) of Zi = wi Yi on ui = wi xi .
Proof. a) E() = R−1 E(e) = 0 and

Cov() = R−1 Cov(e)(R−1 )T = σ 2 R−1 V (R−1 )T

= σ 2 R−1 RR(R−1 ) = σ 2 I n .
Notice that OLS without an intercept needs to be used since U does not
contain a vector of ones. The first column of U is R−1 1 6= 1.
b) Let β̂ ZU denote the OLS estimator obtained by regressing Z on U .
Then

β̂ ZU = (U T U )−1 U T Z = (X T (R−1 )T R−1 X)−1 X T (R−1 )T R−1 Y

and the result follows since V −1 = (RR)−1 = R−1 R−1 = (R−1 )T R−1 .
√ √
c) The result follows from b) if Zi = wi Yi and ui = wi xi . But for
√ √
WLS, V = diag(v1 , ..., vn) and hence R = diag( v1 , ..., vn ). Hence
√ √ √ √
R−1 = diag(1/ v1 , ..., 1/ vn ) = diag( w1 , ..., wn )

and Z = R−1 Y has ith element Zi = wi Yi . Similarly, U = R−1 X has ith

row uTi = wi xTi . 

Remark 2.5. Standard software produces WLS output and the ANOVA
F test and Wald t tests are performed using this output.

Remark 2.6. The FGLS estimator can also be found from the OLS re-
gression (without an intercept) of Z on U where V (θ̂) = RR. Similarly the
FWLS estimator can be found from the OLS regression (without an inter-
100 2 Full Rank Linear Models
√ √
cept) of Zi = ŵi Yi on ui = ŵi xi . But now U is a random matrix instead
of a constant matrix. Hence these estimators are highly nonlinear. OLS out-
put can be used for exploratory purposes, but the p–values are generally not
correct. The Olive (2018) bootstrap tests may be useful for FGLS and FWLS.
See Chapter 4.

Under regularity conditions, the OLS estimator β̂ OLS is a consistent esti-


mator of β when the GLS model holds, but β̂GLS should be used because it
generally has higher efficiency.

Definition 2.28. Let β̂ ZU be the OLS estimator from regressing Z on


U. The vector of fitted values is Ẑ = U β̂ZU and the vector of residuals
is rZU = Z − Ẑ. Then β̂ ZU = β̂ GLS for GLS, β̂ ZU = β̂F GLS for FGLS,
β̂ZU = β̂W LS for WLS, and β̂ ZU = β̂ F W LS for FWLS. For GLS, FGLS,
WLS, and FWLS, a residual plot is a plot of Ẑi versus rZU,i and a response
plot is a plot of Ẑi versus Zi .

Inference for the GLS model Y = Xβ + e can be performed by using


the partial F test for the equivalent no intercept OLS model Z = U β + .
Following Section 1.3.7, create Z and U , fit the full and reduced model using
the “no intercept” or “intercept = F” option. Let pval be the estimated
pvalue.
The 4 step partial F test of hypotheses: i) State the hypotheses H0 :
the reduced model is good HA : use the full model
ii) Find the test statistic FR =
 
SSE(R) − SSE(F )
/M SE(F )
dfR − dfF

iii) Find the pval = P(FdfR −dfF ,dfF > FR ). (On exams often an F table is
used. Here dfR −dfF = p−q = number of parameters set to 0, and dfF = n−p.)
iv) State whether you reject H0 or fail to reject H0 . Reject H0 if pval ≤ δ
and conclude that the full model should be used. Otherwise, fail to reject H0
and conclude that the reduced model is good.

Assume that the GLS model contains a constant β1 . The GLS ANOVA
F test of H0 : β2 = · · · = βp versus HA: not H0 uses the reduced model
that contains the first column of U . The GLS ANOVA F test of H0 : βi = 0
versus HA : βi 6= 0 uses the reduced model with the ith column of U deleted.
For the special case of WLS, the software will often have a weights option
that will also give correct output for inference.
Freedman (1981) shows that the nonparametric bootstrap can be use-
ful for the WLS model with the ei independent. For this case, the sand-
wich estimator is Cov(d β̂ T −1
X T Ŵ X(X T X)−1 with Ŵ =
OLS ) = (X X)
n diag(r1 , ..., rn)/(n − p) where the ri are the OLS residuals and W = σ 2 V .
2 2

See Hinkley (1977), MacKinnon and White (1985), and White (1980).
2.4 WLS and Generalized Least Squares 101

A major problem with the following theorem from Christensen (1987, p.


23) is that the weights wi are rarely known if heterogeneity (nonconstant vari-
ance) is present. Another problem is that normality is rare: the assumption
that the ei are independent with ei ∼ N (0, σ 2 /wi ) is too strong. However,
the theorem is useful for qualifying exam problems. From Definition 2.26, the
WLS estimator of β is β̂W LS = (X T V −1 X)−1 X T V −1 Y . The OLS estima-
tor is the special case with V = I. We will say that β̂ W LS is the BLUE of β
for the WLS model.

Theorem 2.29. Consider the WLS model Y = Xβ + e where E(e) = 0


and Cov(e) = σ 2 V = σ 2 diag(v1 , ..., vn) = σ 2 diag(1/w1 , ..., 1/wn). Suppose
the n × p matrix X has full rank p. Let a be a p × 1 constant vector.
a) The WLS estimator aT β̂ W LS is the BLUE of aT β.
b) If e ∼ N (0, σ 2 V ), then the WLS estimator aT β̂ W LS is the UMVUE
(uniformly minimum variance unbiased estimator) of aT β.
c) If e ∼ N (0, σ 2 V ), then the WLS estimator aT β̂W LS is the MLE of β.
Hence the WLS estimator aT β̂ W LS is the MLE of aT β.

Example 2.1. Let Y1 , . . . , Yn be independent random variables, and let


Yi have a N (iθ, i2 σ 2 ) distribution for i = 1, . . . , n. A statistician decided to
construct two estimators
P for the parameter
P P θ by using two models. [Leave the
sum of the series ni=1 i, ni=1 i2 , ni=1 i4 , etc. as they are, without replacing
them with their exact values.]

a) Write the linear model and state the assumptions.


b) Simplify the weighted least squares estimate of θ, and call it θ̂1 . Then,
simplify the distribution of θ̂1 .
c) Simplify the ordinary least squares estimator, and call it (θ̂2 ). Simplify
the distribution of θ̂2 .
d) Which estimator has a smaller variance? Is any of θ̂1 , θ̂2 a BLUE (Best
Linear Unbiased Estimator)?
Solution: When a WLS problem asks for a distribution and no other in-
formation is given, assume the errors are independent with ei ∼ N (0, σ 2 /wi )
and e ∼ N (0, σ 2 V ).
a) Y = Xθ + e or
       
Y1 1 e1 1
 Y2   2   e2  2
       
 ..  =  ..  θ +  ..  where X =  ..  ,
 .  .  .  .
Yn n en n

and e ∼ Nn (0, σ 2 V ) with V = diag(1, 22 , ..., n2).


b) Note that X T = (1, 2, ..., n), V −1 = diag(1, 1/22, ..., 1/n2), X T V −1 =
(1, 1/2, ..., 1/n), and X T V −1 X = 1 + 1 + · · ·+ 1 =P
n. Thus (X T V −1 X)−1 =
T −1 n
1/n and X V Y = Y1 + Y2 /2 + · · · + Yn /n = i=1 Yi /i. Thus the WLS
102 2 Full Rank Linear Models

estimator
n
1 X Yi
θ̂1 = (X T V −1 X)−1 X T V −1 Y = .
n i
i=1

Now
n
1 X iθ
E(θ̂1 ) = = θ,
n i=1 i
and  
n
X n
X
Yi i2 σ 2
V (θ̂1 ) = V = = σ 2 /2.
ni n2 i2
i=1 i=1

Thus θ̂1 ∼ N (θ, σ 2 /n).


c) The OLS estimator
Pn
T −1 T iYi
θ̂2 = (X X) X Y = Pi=1
n 2
.
i=1 i

Now Pn
i=1 i i θ
E(θ̂2 ) = P n 2
= θ,
i=1 i
and   Pn 2 2 2 Pn 4
n
X iY i i σ i
V (θ̂2 ) = V Pn i i=1 2
= Pn 2 2 = σ Pni=1 2 2 .
i=1 i=1 i2 ( i=1 i ) ( i=1 i )
Thus  Pn 4 
2
P i=1 i
θ̂2 ∼ N θ, σ .
( ni=1 i2 )2
d) The WLS estimator θ̂1 is BLUE and thus has smaller variance than
the OLS estimator θ̂2 (which is a linear unbiased estimator: WLS is “better
than” OLS when the weights are known).

2.5 Summary

1) The set of all linear combinationsPof x1 , ..., xn is the vector space known
n
as span(x1 , ..., xn ) = {y ∈ Rk : y = i=1 ai xi for some constants a1 , ..., an}.
2) Let A = [a1 a2 ... am ] be an n × m matrix. The space spanned by the
columns of A = column space of A = C(A). Then C(A) = {y ∈ Rn : y =
Aw for some w ∈ Rm } = {y : y = w1 a1 + w2 a2 + · · · + wm am for some
scalars w1 , ...., wm} = span(a1 , ..., am ).
3) A generalized inverse of an n × m matrix A is any m × n matrix A−
satisfying AA− A = A.
2.5 Summary 103

4) The projection matrix P = P X onto the column space of X is


unique, symmetric, and idempotent. P X = X, and P W = W if each
column of W ∈ C(X). The eigenvalues of P X are 0 or 1. Rank(P ) = tr(P ).
Hence P is singular unless X is a nonsingular n×n matrix, and then P = I n .
If C(X R ) is a subspace of C(X), then P X P X R = P X R P X = P X R .
5) I n − P is the projection matrix on [C(X)]⊥ .
6) Let A be a positive definite symmetric matrix. The square root matrix
1/2
A is a positive definite symmetric matrix such that A1/2 A1/2 = A.
7) The matrix A in a quadratic form xT Ax will be symmetric unless
told otherwise.
8) Theorem 2.5. Let x be a random vector with E(x) = µ and Cov(x) =
Σ. Then E(xT Ax) = tr(AΣ) + µT Aµ.
9) Theorem 2.7. If A and B are symmetric matrices and AY BY ,
then Y T AY Y T BY .
10) The important part of Craig’s Theorem is that if Y ∼ Nn (µ, Σ),
then Y T AY Y T BY if AΣB = 0.
11) Theorem 2.14. Let A = AT be symmetric. b) If Y ∼ Nn (0, I),
then Y T AY ∼ χ2r iff A is idempotent of rank r. c) If Y ∼ Nn (0, σ 2 I), then
Y T AY ∼ σ 2 χ2r iff A is idempotent of rank r.
12) Often theorems are given for when Y ∼ Nn (0, I). If Y ∼ Nn (0, σ 2 I),
then apply the theorem using Z = Y /σ ∼ Nn (0, I).
13) Suppose Y1 , ..., Yn are independent N (µi , 1) random
Pn variables so that
Y = (Y1 , ..., Yn)T ∼ Nn (µ, I n ). Then Y T Y = 2 2
i=1 i ∼ χ (n, γ =
Y
µT µ/2), a noncentral χ2 (n, γ) distribution,Pwith n degrees of freedom and
n
noncentrality parameter γ = µT µ/2 = 12 i=1 µ2i ≥ 0. The noncentrality
parameter δ = µT µ = 2γ is also used.
14) Theorem 2.16. Let θ = Xη ∈ C(X) where Yi = xTi η +ri (η) and the
residual ri (η) depends on η. The least squares estimator β̂ is the value
p
Pnη ∈ R
of
2
that minimizes the least squares criterion
2
i=1 ri (η) = kY − Xηk .
15) Let xi = (1, ui ), and let β T = (β1 , βT2 ) where β1 is the intercept and
T T

the slopes vector β 2 = (β2 , ..., βp)T . Let the population covariance matrices
Cov(u) = Σ u , and Cov(u, Y ) = Σ uY . If the (Yi , uTi )T are iid, then the
population coefficients from an OLS regression of Y on x are

β1 = E(Y ) − β T2 E(u) and β 2 = Σ −1


u Σ uY .
−1
16) Theorem 2.19: Second way to compute β̂: a) If Σ̂ u exists, then
T
β̂1 = Y − β̂ 2 u and
n −1 −1 −1
β̂ 2 = Σ̂ Σ̃ uY = Σ̃ u Σ̃ uY = Σ̂ u Σ̂ uY .
n−1 u
104 2 Full Rank Linear Models

b) Suppose that (Yi , uTi )T are iid random vectors such that σY2 , Σ −1
u , and
P P
Σ uY exist. Then β̂1 → β1 and β̂ 2 → β 2 as n → ∞ even if the OLS model
Y = Xβ + e does not hold.
17) Theorem 2.20. Let Y = Xβ + e = Ŷ + r where X is full rank,
E(e) = 0, and Cov(e) = σ 2 I. Let P = P X be the projection matrix on
C(X) so Ŷ = P X, r = Y − Ŷ = (I − P )Y , and P X = X so X T P = X T .
i) The predictor variables and residuals are orthogonal. Hence the columns
of X and the residual vector are orthogonal: X T r = 0.
ii) E(Y ) = Xβ.
iii) Cov(Y ) = Cov(e) = σ 2 I.
iv) The fitted values and residuals are uncorrelated: Cov(r, Ŷ ) = 0.
v) The least squares estimator β̂ is an unbiased estimator of β : E(β̂) = β.
vi) Cov(β̂) = σ 2 (X T X)−1 .
18) LS CLT. Suppose that the ei are iid and

XT X
→ W −1 .
n

Then the least squares (OLS) estimator β̂ satisfies


√ D
n(β̂ − β) → Np (0, σ 2 W ).

Also,
D
(X T X)1/2 (β̂ − β) → Np (0, σ 2 I p ).
19) Theorem 2.26, Partial F Test Theorem. Suppose H0 : Lβ = 0 is
true for the partial F test. Under the OLS full rank model, a)
1
FR = (Lβ̂)T [L(X T X)−1 LT ]−1 (Lβ̂).
rM SE
b) If e ∼ Nn (0, σ 2 I), then FR ∼ Fr,n−p .
D
c) For a large class of zero mean error distributions rFR → χ2r .
d) The partial F test that rejects H0 : Lβ = 0 if FR > Fr,n−p (1 − δ) is a
large sample right tail δ test for the OLS model for a large class of zero mean
error distributions.

2.6 Complements

A good reference for quadratic forms and the noncentral χ2 , t, and F distri-
butions is Johnson and Kotz (1970, ch. 28-31).
The theory for GLS and WLS is similar to the theory for the OLS MLR
model, but the theory for FGLS and FWLS is often lacking or huge sample
2.7 Problems 105

sizes are needed. However, FGLS and FWLS are often used in practice be-
cause usually V is not known and V̂ must be used instead. See Eicker (1963,
1967).
Least squares theory can be extended in at least two ways. For the first
extension, see Chang and Olive (2010) and Chapter 10. The second extension
of least squares theory is to an autoregressive AR(p) time series model: Yt =
φ0 + φ1 Yt−1 + · · · + φp Yt−p + et . In matrix form, this model is Y = Xβ + e =
      
Yp+1 1 Yp Yp−1 . . . Y1 φ0 ep+1
 Yp+2   1 Yp+1 Yp . . . Y2   φ1   ep+2 
      
 ..  =  .. .. .. . . ..   ..  +  ..  .
 .  . . . . .   .   . 
Yn 1 Yn−1 Yn−2 . . . Yn−p φp en

If the AR(p) model is stationary, then under regularity conditions, OLS


partial F tests are large sample tests for this model. See Anderson (1971, pp.
210–217).

2.7 Problems

Problems from old qualifying exams are marked with a Q since these problems
take longer than quiz and exam problems.

2.1Q . Suppose Yi = xTi β + ei for i = 1, ..., n where the errors are indepen-
dent N (0, σ 2 ). Then the likelihood function is
 
2 2 −n/2 −1 2
L(β, σ ) = (2πσ ) exp kY − Xβk .
2σ 2

a) Since the least squares estimator β̂ minimizes kY − Xβk2 , show that


β̂ is the MLE of β.
b) Then find the MLE σ̂ 2 of σ 2 .
2.2Q . Suppose Yi = xTi β +ei for i = 1, ..., n where the errors are iid double
exponential (0, σ) with σ > 0. Then the likelihood function is
n
!
1 1 −1 X T
L(β, σ) = n exp |Yi − xi β| .
2 σn σ
i=1
Pn
Suppose that β̃ is a minimizer of Q(β) = i=1 |Yi − xTi β|.
a) By direct maximization, show that β̃ is an MLE of β regardless of the
value of σ.
b) Find an MLE of σ by maximizing
106 2 Full Rank Linear Models
n
!
1 1 −1 X T
L(σ) ≡ L(β̃, σ) = n exp |Yi − xi β̃| .
2 σn σ i=1

2.3Q . Suppose Yi = xTi β+ei where the errors are independent N (0, σ 2 /wi )
where wi > 0 are known constants. Then the likelihood function is
n
! n n
!
Y √ 1 1 −1 X
2 T 2
L(β, σ ) = wi √ exp wi (yi − xi β) .
i=1
2π σn 2σ 2 i=1

Pn
a) Suppose that β̂ W minimizes i=1 wi (yi − xTi β)2 . Show that β̂ W is the
MLE of β.
b) Then find the MLE σ̂ 2 of σ 2 .

2.4Q . Suppose Y ∼ Nn (Xβ, σ 2 V ) for known positive definite n×n matrix


V . Then the likelihood function is
 n  
1 1 1 −1 −1
L(β, σ 2 ) = √ exp (y − Xβ)T
V (y − Xβ) .
2π |V |1/2 σ n 2σ 2

a) Suppose that β̂ G minimizes (y − Xβ)T V −1 (y − Xβ). Show that β̂G


is the MLE of β.
b) Find the MLE σ̂ 2 of σ 2 .

2.5. Find the vector a such that aT Y is an unbiased estimator for E(Yi )
if the usual linear model holds.

2.6. Write the following quantities as bT Y or Y T AY or AY .


P P
a) Y , b) i (Yi − Ŷi )2 , c) i (Ŷi )2 , d) β̂, e) Ŷ

2.7. Show that I − H = I − X(X T X)−1 X T is idempotent, that is, show


that (I − H)(I − H ) = (I − H )2 = I − H.

2.8. Let Y ∼ N (µ, σ 2 ) so that E(Y ) = µ and Var(Y ) = σ 2 = E(Y 2 ) −


[E(Y )]2 . If k ≥ 2 is an integer, then

E(Y k ) = (k − 1)σ 2 E(Y k−2 ) + µE(Y k−1).

Let Z = (Y − µ)/σ ∼ N (0, 1). Hence µk = E(Y − µ)k = σ k E(Z k ). Use this
fact and the above recursion relationship E(Z k ) = (k − 1)E(Z k−2 ) to find
a) µ3 and b) µ4 .

2.9. Let A and B be matrices with the same number of rows. If C is


another matrix such that A = BC, is it true that rank(A) = rank(B)?
Prove or give a counterexample.
2.7 Problems 107

2.10. Let x be an n × 1 vector and let B be an n × n matrix. Show that


xT Bx = xT B T x.
(The point of this problem is that if B is not a symmetric n × n matrix,
B + BT
then xT Bx = xT Ax where A = is a symmetric n × n matrix.)
2
2.11. Consider the model Yi = β1 + β2 Xi,2 + · · · + βp Xi,p + ei = xTi β + ei .
The least squares estimator β̂ minimizes
n
X
QOLS (η) = (Yi − xTi η)2
i=1

and the weighted least squares estimator minimizes


n
X
QW LS (η) = wi (Yi − xTi η)2
i=1

where the wi , Yi and xi are known quantities. Show that


n
X n
X
wi (Yi − xTi η)2 = (Ỹi − x̃Ti η)2
i=1 i=1

by identifying Ỹi , and x̃i . (Hence the WLS estimator is obtained from the
least squares regression of Ỹi on x̃i without an intercept.)

2.12. Suppose that X is an n × p matrix but the rank of X < p < n.


Then the normal equations X T Xβ = X T Y have infinitely many solutions.
Let β̂ be a solution to the normal equations. So X T X β̂ = X T Y . Let G =
(X T X)− be a generalized inverse of (X T X). Assume that E(Y ) = Xβ and
Cov(Y ) = σ 2 I. It can be shown that all solutions to the normal equations
have the form bz given below.

a) Show that bz = GX T Y + (GX T X − I)z is a solution to the normal


equations where the p × 1 vector z is arbitrary.

b) Show that E(bz ) 6= β.

(Hence some authors suggest that bz should be called a solution to the


normal equations but not an estimator of β.)

c) Show that Cov(bz ) = σ 2 GX T XGT .

d) Although G is not unique, the projection matrix P = XGX T onto


C(X) is unique. Use this fact to show that Ŷ = Xbz does not depend on G
or z.
108 2 Full Rank Linear Models

e) There are two ways to show that aT β is an estimable function. Either


show that there exists a vector c such that E(cT Y ) = aT β, or show that
a ∈ C(X T ). Suppose that a = X T w for some fixed vector w. Show that
E(aT bz ) = aT β.

(Hence aT β is estimable by aT bz where bz is any solution of the normal


equations.)

f) Suppose that a = X T w for some fixed vector w. Show that V ar(aT bz ) =


σ w T P w.
2

2.13. Let P be a projection matrix.


a) Show that P is a generalized inverse of P .
b) Show that P = P (P T P )− P T .

2.14Q . Suppose Yi = xTi β + ei with Q(β) ≥ 0. Let cn be a constant that


does not depend on β or σ. Suppose the likelihood function is
 
1 −1
L(β, σ) = cn n exp Q(β) .
σ σ

a) Suppose that β̂ Q minimizes Q(β). Show that β̂ Q is an MLE of β.


b) Then find an MLE σ̂ of σ.

2.15Q . Suppose Yi = xTi β + i with Q(β) ≥ 0. Let cn be a constant that


does not depend on β or σ 2 . Suppose the likelihood function is
 
1 −1
L(β, σ 2 ) = cn n exp Q(β) .
σ 2σ 2

a) Suppose that β̂ Q minimizes Q(β). Show that β̂ Q is the MLE of β.


b) Then find the MLE σ̂ 2 of σ 2 .

2.16. Suppose that G is a generalized inverse of a symmetric matrix A.


a) Show that GT is a generalized inverse of A.
b) Show that GAGT is a generalized inverse of A. (Hence, since a gener-
alized inverse always exists, a symmetric generalized inverse of a symmetric
matrix A always exists.)
 
1 2 4 3
2.17. (Searle (1971, p. 217)): Let A =  3 −1 2 −2  and show that A− =
5 −4 0 −7
 
1 2 0
 
1  3 −1 0 
7 0 0 0
is a generalized inverse of A.
0 0 0
2.7 Problems 109

2.18. Find the projection matrix P for C(X) where X is the 2 × 1 vector
X = (1, 2)T .
2.19. Let y ∼ Np (θ, Σ) where Σ is positive definite. Let A be a symmetric
p × p matrix.
a) Let x = y − θ. What is the distribution of x?
b) Show that
E[(y − θ)T A(y − θ)] = E[xT Ax]
is a function of A and Σ but not of θ.
2.20. (Hocking (2003, p. 61): Let y ∼ N3 (µ, σ 2 I) where y = (Y1 , Y2 , Y3 )T
and µ = (µ1 , µ2 , µ3 )T .   
1 −1 0 1 1 −2
Let A = 21  −1 1 0  and B = 16  1 1 −2 .
0 0 0 −2 −2 4
Are y T Ay and y T By independent? Explain.
2.21Q . Let Y = Xβ +e where e ∼ Nn (0, σ 2 I n ). Assume X has full rank.
Let r be the vector of residuals. Then the residual sum of squares RSS =
T T
rT r. The sum of squared fitted values is Ŷ Ŷ . Prove that rT r and Ŷ Ŷ
independent (or dependent).
(Hint: write each term as a quadratic form.)
 
12
2.22. Let B = .
24
a) Find rank(B).
b) Find a basis for C(B).
T
c) Find [C(B)]⊥ = nullspace
  of B .
1 −1
d) Show that B − = is a generalized inverse of B.
1 0
2.23. Suppose that Y = Xβ+e where Cov(e) = σ 2 Σ and Σ = Σ 1/2 Σ 1/2
where Σ 1/2 is nonsingular and symmetric. Hence Σ −1/2 Y = Σ −1/2 Xβ +
Σ −1/2 e. Find Cov(Σ −1/2 e). Simplify.
2 T T
2.24.
 Let y ∼ N2 (µ, σI) where y = (Y1 , Y2 ) and µ = (µ1 , µ2 ) . Let
1/2 1/2 1/2 −1/2
A= and B = .
1/2 1/2 −1/2 1/2
Are y T Ay and y T By independent? Explain.
2.25. Assuming the assumptions of the √least squares central limit theorem
hold, what is the limiting distribution of n (β̂ − β) if (X 0 X)/n → W −1
as n → ∞?
√ D
n (β̂ − β) →

2.26. Let the model be Yi = β1 + β2 xi2 + β3 xi3 + β4 xi4 + ... + β10 xi10 + ei .
The model in matrix form is Y = Xβ + e where e ∼ Nn (0, σ 2 I). Let P be
110 2 Full Rank Linear Models

the projection matrix on C(X) where the n × p matrix X has full rank p.
What is the distribution of Y T P Y ?
Hint: If Y ∼ Nn (µ, I), then Y T AY ∼ χ2 (rank(A),

T
µ Aµ/2) iff A = AT
Y Xβ
is idempotent. Y ∼ Nn (Xβ, σ 2 I), so ∼ Nn , I . Simplify.
σ σ
2.27. Let Y 0 = Y T . Let Y ∼ Nn (Xβ, σ 2 I). Recall that E(Y 0 AY ) =
tr(ACov(Y )) + E(Y 0 )AE(Y ).
Find E(Y 0 Y ) = E(Y 0 IY ).
2 T T
2.28.
 Let y ∼ N2 (µ, σI) where √ y = (Y1 , Y2 ) and µ = (µ1 , µ2 ) . Let
1/2 1/2 1/4 3/4
A= and B = √ .
1/2 1/2 3/4 3/4
Are Ay and By independent? Explain.
 
1 0
2.29. Let X =  1 0  .
1 1
a) Find rank(X).
b) Find a basis for C(X).
c) Find [C(X)]⊥ = nullspace of X T .
2.30Q . Let Y = Xβ + e where e ∼ Nn (0, σ 2 I n ). Assume X has full
rank and that the first column of X = 1 so that a constant is in the model.
Let r be the vector of residuals. Then the residual sum of squares RSS =
rT r = k(I − P )Y k2 . The sample mean Y = n1 1T Y . Prove that rT r and Y
independent (or dependent).
(Hint: If Y ∼ Nn (µ, Σ), then AY BY iff AΣB T = 0.
1 T
So prove whether (I − P )Y 1 Y .)
n
2.31. Let the full model be Yi = β1 +β2 xi2 +β3 xi3 +β4 xi4 +β5 xi5 +β6 xi6 +ei
and let the reduced model be Yi = β1 +β3 xi3 +ei for i = 1, ..., n. Write the full
model as Y = Xβ +e = X 1 β 1 +X 2 β2 +e, and consider testing H0 : β 2 = 0
where β 1 corresponds to the reduced model. Let P 1 be the projection matrix
on C(X 1 ) and let P be the projection matrix on C(X).
n − p Y T (P − P 1 )Y
Then FR = .
q Y T (I − P )Y
Assume  ∼ Nn (0, σ 2 I). Assume H0 is true.
a) What is q?
b) What is the distribution of Y T (P − P 1 )Y ?
c) What is the distribution of Y T (I − P )Y ?
d) What is the distribution of FR ?
2.32Q . If P is a projection matrix, prove a) the eigenvalues of P are 0 or
1, b) rank(P ) = tr(P ).
2.33Q . Suppose that AY and BY are independent where A and B are
symmetric matrices. Are Y 0 AY and Y 0 BY independent? (Hint: show that
2.7 Problems 111

the quadratic form Y 0 AY is a function of AY by using the definition of the


generalized inverse A− .)
2.34. Craig’s theorem states that if x ∼ Nn (µ, V ) and if A and B are
symmetric matrices, then the quadratic forms x0 Ax and x0 Bx are indepen-
dent iff i) V AV BV = 0, ii) V AV Bµ = 0, iii) V BV Aµ = 0, and iv)
µ0 AV Bµ = 0. Here V is positive semidefinite. Hence V could be singular.
Notice that V is symmetric since it is a covariance matrix.
Suppose that AV B = 0. Are x0 Ax and x0 Bx are independent? Explain
briefly.
2.35Q . 2.35. Let Y be an n × 1 random vector and A an n × n symmetric
matrix. Let E(Y ) = θ and Cov(Y ) = Σ = (σij ).
a) Prove that E(Y T AY ) = tr(AΣ) + θ T Aθ.
b) Let E(Yi ) = θ for all i, σiiP= σ 2 for all i, and σij = ρσ 2 for i 6= j
2
where −1 < ρ < 1. Show that i (Yi − Y ) is an unbiased estimator of
P
2
σ (1 − ρ)(n − 1). P Hint: write i (Yi − Y ) = Y T AY and use a).
2

c) Show when i (Yi − Y )2 and Y are independent if Σ = σ 2 I. State the


theorems clearly wherever used in your proof.
2.36Q (NIU, summer 1991). Consider the regression model Yi = βxi +
ei for i = 1, ..., n where the ei are iid N (0, σ 2 ).
a) Show that the least squares estimator of β is
Pn
xi Yi
β̂ = Pi=1 n 2 .
i=1 xi

b) Express β̂ as a linear combination of the responses and derive its mean


and variance.
c) Show that Ŷi = β̂xi is an unbiased estimator of E(Yi )and derive its
variance.
d) Derive the maximum likelihood estimators of β and σ 2 .
2.37Q . a) For an n × 1 vector Y with E(Y ) = µ and Cov(Y ) = Σ, show
E(Y T AY ) = trace(AΣ) + µT Aµ. is normality necessary here?
b) Consider the usual full rank linear model Y = Xβ + e where X is
n × p, the first column of X is 1, β is p × 1 and e ∼ Nn (0, σ 2 I).
i) Write down an ANOVA table to test (β2 , ..., βp)T = 0, giving expressions
for the regression sum of squares (SSR) and the error sum of squares (SSE).
ii) Find E(SSR) and E(SSE) when H0 is true.
iii) Derive the distribution of SSE/σ 2 if H0 is true. State any theorems
used.
2.38Q . a) Define a generalized inverse of a matrix A.
b) i) Suppose X is n × p with rank r < p. Give the formula for the
projection matrix P onto the column space of X.
ii) For
112 2 Full Rank Linear Models
 
1 −2
X =  1 −2  ,
1 −2
calculate P .
iii) With X as above and Y = (1, 2, 3)T , calculate the error sum of squares
SSE.
2.39Q . Consider the usual full rank model Y = Xβ + e where X is n × p
and e ∼ Nn (0, σ 2 I n ). Let β = (β T1 β T2 )T where β i is pi × 1.
a) Write down the complete ANOVA table for the test H0 : β 2 = 0,
including the expected mean squares.
b) Prove that SSE(R) − SSE and M SE are independent.
c) If H0 is true, show FR ∼ Fp2 ,n−p .
2.40Q . Let Y ∼ Nn (µ, Σ) where Σ > 0, and let A be a symmetric matrix.
a) State the necessary and sufficient condition(s) for Y T AY to be a chi-
square random variable.
b) Suppose rank(Σ) = n and BΣA = 0 where B is a q × n matrix. Prove
that Y T AY and BY are independent.
c) If µ = µ1 and Σ = σ 2 I where σ 2 > 0, prove that
n n
1X 1 X
Y = Yi and (Yi − Y )2 are independent.
n i=1 n − 1 i=1
2.41Q . Let Y = Xβ + e where e ∼ Nn (0, σ 2 I), X is an n × p matrix of
rank p, and β is a p × 1 vector.
a) Write down (do not derive) the MLEs of β and σ 2 .
b) If σ̂ 2 is the MLE of σ 2 , derive the distribution of (n − p)σ̂ 2 /σ 2 .
c) Prove that β̂ (MLE of β) and σ̂ 2 are independent.
d) Now suppose e ∼ Nn (0, σ 2 V ) where V is a known positive definite
matrix. Write down the MLE of β.
2.42Q . a) Suppose Y ∼ Nn (µ, Σ). Let A be an n × n symmetric matrix.
i) Show E[(Y − µ)T A(Y − µ)] = tr(AΣ). Is normality of Y necessary
here?
ii) State a necessary and sufficient condition for (Y − µ)T A(Y − µ) to be
a chi-square random variable.
iii) State a necessary and sufficient condition for (Y − µ)T A(Y − µ) and
BY to be independent where B is an q × n matrix.
b) Suppose Y ∼ Nn (Xβ, σ 2 I) where X is an n × p matrix of rank p and
β is p × 1.
1
i) Derive the distribution of (I − H)Y where H is the projection matrix
σ
onto the column space C(X).
Y T (I − H)Y
ii) Derive the distribution of u = .
σ2
iii) Show that u and v = HY are independent.
2.43Q . Consider the regression model yi = βxi + ei for i = 1, ..., n where
the ei are iid N (0, σ 2 ).
2.7 Problems 113

a) Derive the least squares estimator of β.


b) Write down an unbiased estimator of σ 2 .
c) Derive the maximum likelihood estimators of β and σ 2 .
2.44Q . Let Y1 and Y2 be independent random variables with mean θ and
2θ respectively. Find the least squares estimate of θ and the residual sum of
squares.
√ D
2.45Q . a) By the least squares central limit theorem, n(β̂ − β) →

Np (0, σ 2 W ). Hence the limiting distribution of of n(β̂−β) is the Np (0, σ 2 W )
distribution.
√ Let A be a constant r × p matrix. Find the limiting distribution
of A n(β̂ − β).
D
b) Suppose Z n → Nk (µ, I). Let A be a constant r × k matrix. Find the
limiting distribution of A(Z n − µ).
2.46. Suppose that Y1 , . . . , Yn are independent random with Yi ∼ N (βxi , σ 2 ),
where x1 , . . . , xn are fixed known constants, and β and σ 2 are unknown.
a) Find the MLE of β, and show that it is an unbiased estimator of β.
b) Find the distribution of the MLE of β. P
Yi
c) Two other possible estimators for β are given by U = P and V =
xi
1 X Yi
.
n xi
i) Show these two estimators are also unbiased estimators of β.
ii) Calculate their variances and compare them with the variance of MLE.
2.47. Consider the usual multiple linear regression model, written in ma-
trix notation as Y = Xβ + , where  ∼ Np (0, σ 2 I). Assume that X has
full rank. Recall that the various sums of squares from the ANOVA table for
this model have the following forms:
a) SSTotegr = Y T (I − n−1 J)Y
b) SSE= Y T (I − H)Y
c) SSRegr = Y T (H − n−1 J)Y
where the hat matrix is H = X(X T X)−1 X T and J = 11T , with 1T =
[1 1 . . . 1]. As is well-known, these sums of squares are quadratic forms.
Show that in each case, the matrix of the quadratic form is symmetric and
idempotent.
(Hint: where necessary, you may assume that the design matrix can be
partitioned as X = [1 X ∗ ], where X ∗ is an n × (p − 1) submatrix made up
of columns that are the individual p − 1 predictor variables.)
2.48. Suppose that the regression model is Yi = ai + βxi + ei for i =
1, ..., n where the ai are known constants and the ei are iid N (0, σ 2 ) random
X n
variables. The least squares criterion is Q(η) = (Yi − ai − ηxi )2 .
i=1
a) What is E(Yi |xi )?
b) Find the least squares estimator β̂ of β. Prove that your β̂ is the global
minimizer of the least squares criterion Q.
114 2 Full Rank Linear Models

d d2
c) If each xi = 1 for i = 1, ..., n, what are β̂, Q(η), and Q(η)?
dη dη 2
d) The likelihood function is
n
!
2 2 −n/2 −1 X
L(β, σ ) = (2πσ ) exp (Yi − ai − βxi )2 .
2σ 2
i=1
Pn
Since the least squares estimator β̂ minimizes i=1 (Yi − ai − βxi )2 , show
that β̂ is the (maximum likelihood estimator) MLE of β.
e) Then find the MLE σ̂ 2 of σ 2 .

2.49. Let A0 = AT be the transpose of A.


a) Suppose that the usual Gaussian linear model holds and that the sample
size is n. Find E(Y 0 Y ).
b) Let y ∼ Np (θ, Σ) where Σ is positive definite. Let A be a symmetric
p × p matrix. Let x = y − θ. Find

E[(y − θ)0 A(y − θ)] = E[x0 Ax].

2.50Q . Consider the regression model yi = βxi + ei for i = 1, ..., n where


the ei are iid N (0, σ 2 ).
a) Derive the least squares estimator of β.
b) Write down an unbiased estimator of σ 2 .
c) Derive the maximum likelihood estimators of β and σ 2 .

2.51Q . Let Y1 , . . . , Yn be independent random variables, and let Yi have a


N (iθ, i2 σ 2 ) distribution for i = 1, . . . , n. A statistician decided to construct
two estimatorsPn forP the parameter
Pn θ by using two models. [Leave the sum of
n
the series i=1 i, i=1 i2 , i=1 i4 , etc. as they are, without replacing them
with their exact values.]

a) Write the linear model and state the assumptions.


b) Simplify the weighted least squares estimate of θ, and call it θ̂1 . Then,
simplify the distribution of θ̂1 .
c) Simplify the ordinary least squares estimator, and call it (θ̂2 ). Simplify
the distribution of θ̂2 .
d) Which estimator has a smaller variance? Is any of θ̂1 , θ̂2 a BLUE (Best
Linear Unbiased Estimator)?
2.52.
2.53.

R Problems
Use the command source(“G:/linmodpack.txt”) to download the
functions and the command source(“G:/linmoddata.txt”) to download the
data. See Preface or Section 11.1. Typing the name of the linmodpack
function, e.g. regbootsim2, will display the code for the function. Use the
2.7 Problems 115

args command, e.g. args(regbootsim2), to display the needed arguments for


the function. For the following problem, the R commands can be copied and
pasted from (http://parker.ad.siu.edu/Olive/linmodrhw.txt) into R.
2.74. Generalized and weighted least squares are each equivalent to
a least squares regression without intercept. Let w 0 = wT . Let V =
diag(1, 1/2, 1/3, ..., 1/9) = diag(wi ) where n = 9 and the weights wi = i
for i = 1, ..., 9. Let x0 = (1, x1 , x2 , x3). Then the weighted least squares
with weight vector w 0 = (1, 2, ..., 9) is equivalent to the OLS regression of
√ √ √ √ √ √
wi Yi = Zi on u where u = wi x = ( wi , wi x1 , wi x2 , wi x3 )0 . There
is no intercept because the vector of ones has been replaced by a vector of

the wi ’s. Copy and paste the commands for this problem into R. The com-
mands fit weightd least squares and the equivalent OLS regression without
an intercept. Include one page of output in Word.
Chapter 3
Nonfull Rank Linear Models and Cell
Means Models

Much of Sections 2.1 and 2.2 apply to both full rank and nonfull rank linear
models. In this chapter we often assume X has rank r < p ≤ n.

3.1 Nonfull Rank Linear Models

Definition 3.1. The nonfull rank linear model is Y = Xβ + e where X


has rank r < p ≤ n, X is an n × p matrix, E(e) = 0 and Cov(e) = σ 2 I.

Nonfull rank models are often used in experimental design models. Much
of the nonfull rank model theory is similar to that of the full rank model,
but there are some differences. Now the generalized inverse (X T X)− is not
unique. Similarly, β̂ is a solution to the normal equations, but depends on the
generalized inverse and is not unique. Some properties of the least squares
estimators are summarized below. Let P = P X be the projection matrix
on C(X). Recall that projection matrices are symmetric and idempotent but
singular unless P = I. Also recall that P X = X, so X T P = X T .

Theorem 3.1. Let Y = Xβ + e where X has rank r < p ≤ n, E(e) = 0,


and Cov(e) = σ 2 I.
i) P = X(X T X)− X T is the unique projection matrix on C(X) and does
not depend on the generalized inverse (X T X)− .
ii) β̂ = (X T X)− X T Y does depend on (X T X)− and is not unique.
iii) Ŷ = X β̂ = P Y , r = Y − Ŷ = Y − X β̂ = (I − P )Y and RSS = rT r
are unique and so do not depend on (X T X)− .
iv) β̂ is a solution to the normal equations: X T X β̂ = X T Y .
v) Rank(P ) = r and rank(I − P ) = n − r.
RSS rT r
vi) M SE = = is an unbiased estimator of σ 2 .
n−r n−r

117
118 3 Nonfull Rank Linear Models and Cell Means Models

vii) Let the columns of X 1 form a basis for C(X). For example, take r lin-
early independent columns of X to form X 1 . Then P = X 1 (X T1 X 1 )−1 X T1 .

Proof. Parts i) follows from Theorem 2.2 a), b). For part iii), P and I −P
are projection matrices and projections P w and (I − P )w are unique since
projection matrices are unique. For ii), since (X T X)− is not unique, β̂ is not
unique. Note that iv) holds since X T X β̂ = X T P Y = X T Y since P X = X
and X T P = X T . From the proof of Theorem 2.2, if M is a projection
matrix, then rank(M ) = tr(M ) = the number of nonzero eigenvalues of
M = rank(X). Thus v) holds. vi) E(r T r) = E(eT (I − P )e) = tr[(I −
P )σ 2 I)] = σ 2 (n − r) by Theorem 2.5. Part vii) follows from Theorem 2.2.


Definition 3.2. Let a and b be constant vectors. Then aT β is estimable


if there exists a linear unbiased estimator bT Y so E(bT Y ) = aT β.
The term “estimable” is misleading since there are nonestimable quantities
aT β that can be estimated with biased estimators. For full rank models, aT β
is estimable for any p × 1 constant vector a since aT β̂ is a linear unbiased
estimator of aT β. See the Gauss Markov Theorem (Full Rank Case) 2.22.
Estimable quantities tend to go with the nonfull rank linear model. We can
avoid nonestimable functions by using a full rank model instead of a nonfull
rank model (delete columns of X until it is full rank). From Chapter 2, the
linear estimator aT Y of cT θ is the best linear unbiased estimator (BLUE) of
cT θ if E(aT Y ) = cT θ, and if for any other unbiased linear estimator bT Y
of cT θ, V (aT Y ) ≤ V (bT Y ). Note that E(bT Y ) = cT θ.

Since r ≤ p ≤ n, the model is full rank in the following theorem if r = p.


Then the next theorem shows that the least squares estimator of an estimable
function aT β is aT β̂ = bT X β̂ = bT P Y .

Theorem 3.2. Let Y = Xβ + e where X has rank r ≤ p ≤ n, E(e) = 0,


and Cov(e) = σ 2 I.
a) The quantity aT β is estimable iff aT = bT X iff a = X T b (for some
constant vector b) iff a ∈ C(X T ).
b) Let θ̂ = X β̂ and θ = Xβ. Suppose there exists a constant vector c
such that E(cT θ̂) = cT θ. Then among the class of linear unbiased estimators
of cT θ, the least squares estimator cT θ̂ is the unique BLUE.
c) Gauss Markov Theorem: If aT β is estimable and a least squares
estimator β̂ is any solution to the normal equations X T X β̂ = X T Y , then
aT β̂ is the unique BLUE of aT β.
Proof. a) If aT β is estimable, then aT β = E(bT Y ) = bT Xβ for all
β ∈ Rp . Thus aT = bT X or a = X T b. Hence aT β is estimable iff aT = bT X
iff a = X T b iff a ∈ C(X T ).
3.2 Cell Means Models 119

For part b), we use the proof from Seber and Lee (2003, p. 43). Since θ̂ =
X β̂ = P Y , it follows that E(cT θ̂) = E(cT P Y ) = cT P Xβ = cT Xβ = cT θ.
Thus cT θ̂ = cT P Y = (P c)T Y is a linear unbiased estimator of cT θ. Let
dT Y be any other linear unbiased estimator of cT θ. Hence E(dT Y ) = dT θ =
cT θ for all θ ∈ C(X). So (c − d)T θ = 0 for all θ ∈ C(X). Hence (c − d) ∈
[C(X)]⊥ and P (c − d) = 0, or P c = P d. Thus V (cT θ̂) = V (cT P Y ) =
V (dT P Y ) = σ 2 dT P T P d = σ 2 dT P d. Then V (dT Y )−V (cT θ̂) = V (dT Y )−
V (dT P Y ) = σ 2 [dT d − dT P d] = σ 2 dT (I n − P )d = σ 2 dT (I n − P )T (I n −
P )d = g T g ≥ 0 with equality iff g = (I n − P )d = 0, or d = P d = P c. Thus
cT θ̂ has minimum variance and is unique.
c) Since aT β is estimable, aT β̂ = bT X β̂. Then aT β̂ = bT θ̂ is the unique
BLUE of aT β = bT θ by part b). 

Remark 3.1. There are several ways to show whether aT β is estimable


or nonestimable. i) For the full rank model, aT β is estimable: use the BLUE
aT β̂. Let θ̂ = X β̂ be the least squares estimator of Xβ where X has full
rank p. a) cT θ̂ is the unique BLUE of cT θ. b) aT β̂ is the BLUE of aT β for
every vector a.
Now consider the nonfull rank model. ii) If aT β is estimable: use the BLUE
T
a β̂.
iii) There are two more ways to check whether aT β is estimable.
a) If there is a constant vector b such that E(bT Y ) = aT β, then aT β is
estimable.
b) If aT = bT X or a = X T b or a ∈ C(X T ), then aT β is estimable.
Then bT Y is a linear unbiased estimator of aT β, and the least squares esti-
mator bT P Y = aT β̂ is the best linear unbiased estimator (BLUE) in that
V (aT β̂) = V (bT P Y ) ≤ V (bT Y ).

3.2 Cell Means Models

Nonfull rank models are often used for experimental design models, but cell
means models have full rank. The cell means models will be illustrated with
the one way Anova model. See Problem 3.9 for the cell means model for the
two way Anova model.

Definition 3.3. Models in which the response variable Y is quantitative,


but all of the predictor variables are qualitative are called analysis of vari-
ance (ANOVA or Anova) models, experimental design models, or design of
experiments (DOE) models. Each combination of the levels of the predictors
gives a different distribution for Y . A predictor variable W is often called a
factor and a factor level ai is one of the categories W can take.
The one way Anova model is used to compare p treatments. Usually there
is replication and H0 : µ1 = µ2 = · · · = µp is a hypothesis of interest.
120 3 Nonfull Rank Linear Models and Cell Means Models

Investigators may also want to rank the population means from smallest to
largest.
Definition 3.4. Let fZ (z) be the pdf of Z. Then the family of pdfs fY (y) =
fZ (y − µ) indexed by the location parameter µ, −∞ < µ < ∞, is the location
family for the random variable Y = µ + Z with standard pdf fZ (z).

Definition 3.5. A one way fixed effects Anova model has a single quali-
tative predictor variable W with p categories a1 , ..., ap . There are p different
distributions for Y , one for each category ai . The distribution of

Y |(W = ai ) ∼ fZ (y − µi )

where the location family has second moments. Hence all p distributions come
from the same location family with different location parameter µi and the
same variance σ 2 .

Notation. It is convenient to relabel the response variable Y1 , ..., Yn as


the vector Y = (Y11 , ..., Y1,n1, Y21 , ..., Y2,n2 , ..., Yp1, ..., Yp,np)T where the Yij
are independent and Yi1 , ..., Yi,ni are iid. Here j = 1, ..., ni where ni is the
number of cases from the ith level where i = 1, ..., p. Thus n1 + · · · + np =
n. Similarly use double subscripts on the errors. Then there will be many
equivalent parameterizations of the one way fixed effects Anova model.

Definition 3.6. The cell means model is the parameterization of the one
way fixed effects Anova model such that

Yij = µi + eij

where Yij is the value of the response variable for the jth trial of the ith
factor level. The µi are the unknown means and E(Yij ) = µi . The eij are
iid from the location family with pdf fZ (z) and unknown variance σ 2 =
VAR(Yij ) = VAR(eij ). For the normal cell means model, the eij are iid
N (0, σ 2 ) for i = 1, ..., p and j = 1, ..., ni.

The cell means model is a linear model (without intercept) of the form
Y = X c βc + e =
3.2 Cell Means Models 121
     
Y11 1 0 0 ... 0 e11
 ..   .. .. .. ..   .. 
 .  . . . .  . 
     
 Y1,n1   1 0 0 . . . 0   
       e1,n1 
 Y21   0 1 0 . . . 0   e21 
    µ1  
 ..   .. .. .. ..   µ   .. 
 .  . . . 
.    . 
2 
   .
 Y2,n2  =  0 1 0 . . . 0   . +  (3.1)
     ..   e2,n2 
 .  . . . .   . 
 ..   .. .. .. ..  µp  .. 
     
 Yp,1   0 0 0 . . . 1   ep,1 
     
 .  . . . ..   . 
 ..   .. .. .. .   .
. 
Yp,np 0 0 0 ... 1 ep,np
Pn i
Notation. Let Yi0 = j=1 Yij and let
ni
1 X
µ̂i = Y i0 = Yi0 /ni = Yij . (3.2)
ni j=1

Hence the “dot notation” P meansP


sum over the subscript corresponding to the
p ni
0, e.g. j. Similarly, Y00 = i=1 j=1 Yij is the sum of all of the Yij .

Let X c = [v 1 v 2 · · · v p ], and notice that the indicator variables used in


the cell means model (3.1) are v hk = xhk = 1 if the hth case has W = ak , and
v hk = xhk = 0, otherwise, for k = 1, ..., p and h = 1, ..., n. So Yij has xhk = 1
only if i = k and j = 1, ..., ni. The model can use p indicator variables for the
factor instead of p − 1 indicator variables because the model does not contain
an intercept. Also notice that (X Tc X c ) = diag(n1 , ..., np),

E(Y ) = X c βc = (µ1 , ..., µ1, µ2 , ..., µ2, ..., µp, ..., µp)T ,

and X Tc Y = (Y10 , ..., Y10, Y20, ..., Y20, ..., Yp0, ..., Yp0)T . Hence (X Tc X c)−1 =
diag(1/n1 , ..., 1/np) and the OLS estimator

β̂ c = (X Tc X c )−1 X Tc Y = (Y 10 , ..., Y p0 )T = (µ̂1 , ..., µ̂p)T .

Thus Ŷ = X c β̂ c = (Y 10 , ..., Y 10 , ..., Y p0 , ..., Y p0 )T . Hence the ijth fitted


value is
Ŷij = Y i0 = µ̂i (3.3)
and the ijth residual is

rij = Yij − Ŷij = Yij − µ̂i . (3.4)

Since the cell means model is a linear model, there is an associated response
plot and residual plot. However, many of the interpretations of the OLS
quantities for Anova models differ from the interpretations for MLR models.
122 3 Nonfull Rank Linear Models and Cell Means Models

First, for MLR models, the conditional distribution Y |x makes sense even if
x is not one of the observed xi provided that x is not far from the xi . This
fact makes MLR very powerful. For MLR, at least one of the variables in x
is a continuous predictor. For the one way fixed effects Anova model, the p
distributions Y |xi make sense where xTi is a row of X c .
Also, the OLS MLR ANOVA F test for the cell means model tests H0 :
βc = 0 ≡ H0 : µ1 = · · · = µp = 0, while the one way fixed effects ANOVA F
test given after Definition 3.10 tests H0 : µ1 = · · · = µp .

Definition 3.7. Consider the one way fixed effects Anova model. The
response plot is a plot of Ŷij ≡ µ̂i versus Yij and the residual plot is a plot of
Ŷij ≡ µ̂i versus rij .
The points in the response plot scatter about the identity line and the
points in the residual plot scatter about the r = 0 line, but the scatter need
not be in an evenly populated band. A dot plot of Z1 , ..., Zm consists of an
axis and m points each corresponding to the value of Zi . The response plot
consists of p dot plots, one for each value of µ̂i . The dot plot corresponding
to µ̂i is the dot plot of Yi1 , ..., Yi,ni. The p dot plots should have roughly the
same amount of spread, and each µ̂i corresponds to level ai . If a new level
af corresponding to xf was of interest, hopefully the points in the response
plot corresponding to af would form a dot plot at µ̂f similar in spread to
the other dot plots, but it may not be possible to predict the value of µ̂f .
Similarly, the residual plot consists of p dot plots, and the plot corresponding
to µ̂i is the dot plot of ri1 , ..., ri,ni.
Assume that each ni ≥ 10. Under the assumption that the Yij are from
the same location family with different parameters µi , each of the p dot plots
should have roughly the same shape and spread. This assumption is easier
to judge with the residual plot. If the response plot looks like the residual
plot, then a horizontal line fits the p dot plots about as well as the identity
line, and there is not much difference in the µi . If the identity line is clearly
superior to any horizontal line, then at least some of the means differ.
Definition 3.8. An outlier corresponds to a case that is far from the
bulk of the data. Look for a large vertical distance of the plotted point from
the identity line or the r = 0 line.
Rule of thumb 3.1. Mentally add 2 lines parallel to the identity line and
2 lines parallel to the r = 0 line that cover most of the cases. Then a case is
an outlier if it is well beyond these 2 lines.
This rule often fails for large outliers since often the identity line goes
through or near a large outlier so its residual is near zero. A response that is
far from the bulk of the data in the response plot is a “large outlier” (large
in magnitude). Look for a large gap between the bulk of the data and the
large outlier.
3.2 Cell Means Models 123

Suppose there is a dot plot of nj cases corresponding to level aj that is


far from the bulk of the data. This dot plot is probably not a cluster of “bad
outliers” if nj ≥ 4 and n ≥ 5p. If nj = 1, such a case may be a large outlier.

The assumption of the Yij coming from the same location family with
different location parameters µi and the same constant variance σ 2 is a big
assumption and often does not hold. Another way to check this assumption is
to make a box plot of the Yij for each i. The box in the box plot corresponds
to the lower, middle, and upper quartiles of the Yij . The middle quartile
is just the sample median of the data mij : at least half of the Yij ≥ mij
and at least half of the Yij ≤ mij . The p boxes should be roughly the same
length and the median should occur in roughly the same position (e.g. in
the center) of each box. The “whiskers” in each plot should also be roughly
similar. Histograms for each of the p samples could also be made. All of the
histograms should look similar in shape.

Example 3.1. Kuehl (1994, p. 128) gives data for counts of hermit crabs
on 25 different transects in each of six different coastline habitats. Let Z be
the count. Then the response variable Y = log10 (Z + 1/6). Although the
counts Z varied greatly, each habitat had several counts of 0 and often there
were several counts of 1, 2, or 3. Hence Y is not a continuous variable. The
cell means model was fit with ni = 25 for i = 1, ..., 6. Each of the six habitats
was a level. Figure 3.1a and b shows the response plot and residual plot.
There are 6 dot plots in each plot. Because several of the smallest values in
each plot are identical, it does not always look like the identity line is passing
through the six sample means Y i0 for i = 1, ..., 6. In particular, examine the
dot plot for the smallest mean (look at the 25 dots furthest to the left that
fall on the vertical line FIT ≈ 0.36). Random noise (jitter) has been added to
the response and residuals in Figure 3.1c and d. Now it is easier to compare
the six dot plots. They seem to have roughly the same spread.
The plots contain a great deal of information. The response plot can be
used to explain the model, check that the sample from each population (treat-
ment) has roughly the same shape and spread, and to see which populations
have similar means. Since the response plot closely resembles the residual plot
in Figure 3.1, there may not be much difference in the six populations. Lin-
earity seems reasonable since the samples scatter about the identity line. The
residual plot makes the comparison of “similar shape” and “spread” easier.

Definition 3.9. a) The total sum of squares


p X
X ni
SST O = (Yij − Y 00 )2 .
i=1 j=1

b) The treatment sum of squares


124 3 Nonfull Rank Linear Models and Cell Means Models

a) Response Plot b) Residual Plot

1
RESID
1
Y

0
0

-1
0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0

FIT FIT

c) Jittered Response Plot d) Jittered Residual Plot


2

1
JR
JY

0
-1
0

0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0

FIT FIT

Fig. 3.1 Plots for Crab Data

p
X
SST R = ni (Y i0 − Y 00 )2 .
i=1

c) The residual sum of squares or error sum of squares


p ni
X X
SSE = (Yij − Y i0 )2 .
i=1 j=1

Definition 3.10. Associated with each SS in Definition 3.9 is a degrees


of freedom (df) and a mean square = SS/df. For SSTO, df = n − 1 and
M ST O = SST O/(n−1). For SSTR, df = p−1 and M ST R = SST R/(p−1).
For SSE, df = n − p and M SE = SSE/(n − p).
Pni
Let Si2 = j=1 (Yij − Y i0 )2 /(ni − 1) be the sample variance of the ith
group. Then the MSE is a weighted sum of the Si2 :
p n p n
2 1 XXi
2 1 XXi

σ̂ = M SE = rij = (Yij − Y i0 )2 =
n−p n−p
i=1 j=1 i=1 j=1

p
1 X
(ni − 1)Si2 = Spool
2
n−p
i=1
2
where Spool is known as the pooled variance estimator.

The ANOVA F test tests whether the p means are equal. If H0 is not
rejected and the means are equal, then it is possible that the factor is unim-
3.2 Cell Means Models 125

portant, but it is also possible that the factor is important but the
level is not. For example, the factor might be type of catalyst. The yield
may be equally good for each type of catalyst, but there would be no yield if
no catalyst was used.
The ANOVA table is the same as that for MLR, except that SSTR re-
places the regression sum of squares. The MSE is again an estimator of σ 2 .
The ANOVA F test tests whether all p means µi are equal. Shown below
is an ANOVA table given in symbols. Sometimes “Treatment” is replaced
by “Between treatments,” “Between Groups,” “Between,” “Model,” “Fac-
tor,” or “Groups.” Sometimes “Error” is replaced by “Residual,” or “Within
Groups.” Sometimes “p-value” is replaced by “P”, “P r(> F ),” or “PR > F.”
The “p-value” is nearly always an estimated p-value, denoted by pval. An ex-
ception is when the ei are iid N (0, σe2 ). Normality is rare and the constant
variance assumption rarely holds.

Summary Analysis of Variance Table

Source df SS MS=SS/df F p-value


Treatment p − 1 SSTR MSTR F0 =MSTR/MSE for H0 :
Error n − p SSE MSE µ1 = · · · = µp

Here is the 4 step fixed effects one way ANOVA F test of hy-
potheses.
i) State the hypotheses H0 : µ1 = µ2 = · · · = µp and HA: not H0 .
ii) Find the test statistic F0 = M ST R/M SE or obtain it from output.
iii) Find the pval from output or use the F –table: pval =

P (Fp−1,n−p > F0 ).

iv) State whether you reject H0 or fail to reject H0 . If the pval ≤ δ, reject H0
and conclude that the mean response depends on the factor level. (Hence not
all of the treatment means are equal.) Otherwise fail to reject H0 and conclude
that the mean response does not depend on the factor level. (Hence all of the
treatment means are equal, or there is not enough evidence to conclude that
the mean response depends on the factor level.) Give a nontechnical sentence.

Rule of thumb 3.2. If

max(S1 , ..., Sp) ≤ 2 min(S1 , ..., Sp),

then the one way ANOVA F test results will be approximately correct if the
response and residual plots suggest that the remaining one way Anova model
assumptions are reasonable. See Moore (2007, p. 634). If all of the ni ≥ 5,
replace the standard deviations by the ranges of the dot plots when exam-
126 3 Nonfull Rank Linear Models and Cell Means Models

ining the response and residual plots. The range Ri = max(Yi,1 , ..., Yi,ni ) −
min(Yi,1 , ..., Yi,ni) = length of the ith dot plot for i = 1, ..., p.
The assumption that the zero mean iid errors have constant variance
V (eij ) ≡ σ 2 is much stronger for the one way Anova model than for the mul-
tiple linear regression model. The assumption implies that the p population
distributions have pdfs from the same location family with different means
µ1 , ..., µp but the same variances σ12 = · · · = σp2 ≡ σ 2 . The one way ANOVA F
test has some resistance to the constant variance assumption, but confidence
intervals have much less resistance to the constant variance √assumption. Con-

sider confidence intervals for µi such as Y i0 ± tni −1,1−δ/2 M SE/ ni . MSE
is a weighted average of the Si2 . Hence MSE overestimates small 2
√ σi and un-
2 2
derestimates large σi when the σi are not equal. Hence using M SE instead
of Si will make the CI too long or too short, and Rule of thumb 3.2 does not
apply to confidence intervals based on MSE.

Sometimes SSTR is written as RSSH − RSS as in the Table below. Note


that RSS = SSE.
Summary Analysis of Variance Table

Source df SS MS=SS/df F p-value


Between p − 1 RSSH − RSS MSTR F0 =MSTR/MSE for H0 :
Error n − p RSS MSE µ1 = · · · = µp

Example 3.2. An experiment was run to compare three different primitive


altimeters (an altimeter is a device which measures altitude). The response
is the error in reading.
Altimeter 1: 3, 6, 3
Altimeter 2: 4, 5, 4
Altimeter 3: 7, 8, 7
We like to compare the means of these three altimeters.
a) Write the linear model. Describe all terms and assumptions. Use βi instead
of µi .
b) Given that RSSH − RSS = 20.22 and RSS = 7.33, state the hypotheses
that the means are equal, and complete the ANOVA table if the p-value =
0.0188.
c) Find the distribution of the test statistic under normality, and show how
to precisely make the decision. (no calculation necessary, only show the steps)

Solution. a) Let Y be 9 × 1. Then


3.2 Cell Means Models 127
 
100
1 0 0
 
1 0 0
  
 0 1 0  β1
 
Y =  
 0 1 0  β2 + e
 0 1 0  β3
 
0 0 1
 
0 0 1
001

where e ∼ N (0, σ 2 I).


b) H0 : β1 = β2 = β3 versus H1 : not H0
Note that n = 9, p = 3, and ni = 3 for i = 1, 2, 3.

Source df SS M S = SS/df F=MSB/MSE p-value


Between 2 = p − 1 20.22 20.22/2 = 10.11 10.11/1.222 = 8.273 0.0188
Error 6 = n − p 7.33 7.33/6 = 1.222
c) Reject H0 if 8.273 > F (2, 6, 0.05) where P [F (2, 6) > F (2, 6, 0.05)] =
0.05 and F (2, 6) is an F random variable with 2 numerator and 6 denominator
degrees of freedom.

All of the parameterizations of the one way fixed effects Anova model
yield the same predicted values, residuals, and ANOVA F test, but the inter-
pretations of the parameters differ. The cell means model is a linear model
(without intercept) of the form Y = X c βc + e = that can be fit using OLS.
The OLS MLR output gives the correct fitted values and residuals but an
incorrect ANOVA table. An equivalent linear model (with intercept) with
correct OLS MLR ANOVA table as well as residuals and fitted values can
be formed by replacing any column of the cell means model by a column of
ones 1. Removing the last column of the cell means model and making the
first column 1 gives the model Y = β0 + β1 x1 + · · · + βp−1 xp−1 + e given in
matrix form by (3.5) below.
It can be shown that the OLS estimators corresponding to (3.5) are β̂0 =
Y p0 = µ̂p , and β̂i = Y i0 − Y p0 = µ̂i − µ̂p for i = 1, ..., p − 1. The cell means
model has β̂i = µ̂i = Y i0 .
128 3 Nonfull Rank Linear Models and Cell Means Models
 
1 1 0 ... 0
   . .. .. ..   
Y11  .. . . . e11
 
 ..   1 1 0 ... 0  .. 
 .    1 0 1 ... 0
  . 
   
 Y1,n1    .. .. .. .. 
  e1,n1 
   
 Y21    . . . .     e21 
  β0  
 ..   1 0 1 ... 0 
  .. 
 .   ..   β1  +  . 
   .. .. ..
 .   .
 Y2,n2  =  . . . .   ..   e  (3.5)
     2,n 2 
 .   1 0 0 ... 1  βp−1  . 
 ..   . .. .. ..   .. 
  ..  
 Yp,1   . . .  ep,1 
   

  
 .  1 0 0 ... 1  . 
 ..     .. 
1 0 0 ... 0
Yp,np . .. .. ..  ep,np
 .. . . .
1 0 0 ... 0
Pp Pp
Definition 3.11. A contrast C = i=1 ki µi where i=1 ki = 0. The
Pp
estimated contrast is Ĉ = i=1 ki Y i0 .

If the null hypothesis of the fixed effects one way ANOVA test is not true,
then not all of the means µi are equal. Researchers will often have hypotheses,
before examining the data, that they desire to test. Often such a hypothesis
can be put in the form of a contrast. For example, the contrast C = µi − µj
is used to compare the means of the ith and jth groups while the contrast
µ1 − (µ2 + · · · + µp )/(p − 1) is used to compare the last p − 1 groups with
the 1st group. This contrast is useful when the 1st group corresponds to a
standard or control treatment while the remaining groups correspond to new
treatments.
Assume that the normal cell means model is a useful approximation to the
data. Then the Y i0 ∼ N (µi , σ 2 /ni ) are independent, and
p p
!
X X ki2
2
Ĉ = ki Y i0 ∼ N C, σ .
ni
i=1 i=1

Hence the standard error


v
u p
u X ki2
SE(Ĉ) = tM SE .
ni
i=1

The degrees of freedom is equal to the MSE degrees of freedom Pp= n − p.


Consider
Pp a family of null hypotheses for contrasts {Ho : i=1 ki µi = 0
where i=1 ki = 0 and the ki may satisfy other constraints}. Let δS denote
the probability of a type I error for a single test from the family where a type
I error is a false rejection. The family level δF is an upper bound on the
3.3 Summary 129

(usually unknown) size δT . Know how to interpret δF ≈ δT =


P(of making at least one type I error among the family of contrasts).
Two important families of contrasts are the family of all possible con-
trasts and the family of pairwise differences Cij = µi − µj where i 6= j. The
Scheffé multiple comparisons procedure has a δF for the family of all possible
contrasts, while the Tukey multiple comparisons procedure has a δF for the
family of all p2 pairwise contrasts.

3.3 Summary

1) The nonfull rank linear model: suppose Y = Xβ + e where X has


rank r < p and X is an n × p matrix.
i) P X = X(X T X)− X T is the unique projection matrix on C(X) and
does not depend on the generalized inverse (X T X)− .
ii) β̂ = (X T X)− X T Y does depend on (X T X)− and is not unique.
iii) Ŷ = X β̂ = P X Y , e = Y − Ŷ = Y − X β̂ = (I − P X )Y and
RSS = eT e are unique and so do not depend on (X T X)− .
iv) β̂ is a solution to the normal equations: X T X β̂ = X T Y .
v) It can be shown that rank(P X ) = r and rank(I − P X ) = n − r.
vi) Let θ̂ = X β̂ and θ = Xθ. Suppose there exists a constant vector c
such that E(cT θ̂) = cT θ. Then among the class of linear unbiased estimators
of cT θ, the least squares estimator cT θ̂ is BLUE.
RSS eT e
vii) If Cov(Y ) = Cov() = σ 2 I, then M SE = = is an
n−r n−r
2
unbiased estimator of σ .
viii) Let the columns of X 1 form a basis for C(X). For example, take r lin-
early independent columns of X to form X 1 . Then P X = X 1 (X T1 X 1 )−1 X T1 .

2) Let a and b be constant vectors. Then aT β is estimable if there exists


a linear unbiased estimator bT Y so E(bT Y ) = aT β.
3) The quantity aT β is estimable iff aT = bT X iff a = X T b (for some
constant vector b) iff a ∈ C(X T ).
4) If aT β is estimable and a least squares estimator β̂ is any solution to
the normal equations X T X β̂ = X T Y . Then aT β is unique and aT β̂ is the
BLUE of aT β.
5) The term “estimable” is misleading since there are nonestimable quan-
tities aT β that can be estimated with biased or nonlinear estimators.
6) Estimable quantities tend to go with the nonfull rank linear model. Can
avoid nonestimable functions by using a full rank model instead of a nonfull
rank model (delete columns of X until it is full rank).
130 3 Nonfull Rank Linear Models and Cell Means Models

7) The linear estimator aT Y of cT θ is the best linear unbiased estimator


(BLUE) of cT θ if E(aT Y ) = cT θ, and if for any other unbiased linear
estimator bT Y of cT θ, V (aT Y ) ≤ V (bT Y ). Note that E(bT Y ) = cT θ.
8) Let θ̂ = X β̂ be the least squares estimator of Xβ where X has full
rank p. a) cT θ̂ is the unique BLUE of cT θ. b) aT β̂ is the BLUE of aT β for
every vector a.
9) In experimental design models or design of experiments (DOE), the
entries of X are coded, often as −1, 0 or 1. Often X is not a full rank matrix.
10) Some DOE models have one Yi per xi and lots of xi ’s. Then the
response and residual plots are used like those for MLR.
11) Some DOE models have ni Yi ’s per xi , and only a few distinct values
of xi . Then the response and residual plots no longer look like those for MLR.
12) A dot plot of Z1 , ..., Zm consists of an axis and m points each corre-
sponding to the value of Zi .
13) Let fZ (z) be the pdf of Z. Then the family of pdfs fY (y) = fZ (y − µ)
indexed by the location parameter µ, −∞ < µ < ∞, is the location family
for the random variable Y = µ + Z with standard pdf fZ (y). A one way fixed
effects ANOVA model has a single qualitative predictor variable W with p
categories a1 , ..., ap . There are p different distributions for Y , one for each
category ai . The distribution of

Y |(W = ai ) ∼ fZ (y − µi )

where the location family has second moments. Hence all p distributions come
from the same location family with different location parameter µi and the
same variance σ 2 . The one way fixed effects normal ANOVA model is the
special case where Y |(W = ai ) ∼ N (µi , σ 2 ).
14) The response plot is a plot of Ŷ versus Y . For the one way Anova model,
the response plot is a plot of Ŷij = µ̂i versus Yij . Often the identity line with
unit slope and zero intercept is added as a visual aid. Vertical deviations from
the identity line are the residuals eij = Yij − Ŷij = Yij − µ̂i . The plot will
consist of p dot plots that scatter about the identity line with similar shape
and spread if the fixed effects one way ANOVA model is appropriate. The
ith dot plot is a dot plot of Yi,1 , ..., Yi,ni. Assume that each ni ≥ 10. If the
response plot looks like the residual plot, then a horizontal line fits the p dot
plots about as well as the identity line, and there is not much difference in
the µi . If the identity line is clearly superior to any horizontal line, then at
least some of the means differ.
The residual plot is a plot of Ŷ versus e where the residual e = Y − Ŷ . The
plot will consist of p dot plots that scatter about the e = 0 line with similar
shape and spread if the fixed effects one way ANOVA model is appropriate.
The ith dot plot is a dot plot of ei,1 , ..., ei,ni. Assume that each ni ≥ 10.
Under the assumption that the Yij are from the same location scale family
with different parameters µi , each of the p dot plots should have roughly the
3.3 Summary 131

same shape and spread. This assumption is easier to judge with the residual
plot than with the response plot.
15) Rule of thumb: Let Ri be the range of the ith dot plot =
max(Yi1 , ..., Yi,ni)−min(Yi1 , ..., Yi,ni). If the ni ≈ n/p and if max(R1 , ..., Rp) ≤
2 min(R1 , ..., Rp), then the one way ANOVA F test results will be approxi-
mately correct if the response and residual plots suggest that the remaining
one way ANOVA model assumptions are reasonable. Confidence intervals
need stronger assumptions.
P i
16) Let Yi0 = nj=1 Yij and let
ni
1 X
µ̂i = Y i0 = Yi0 /ni = Yij .
ni j=1

Hence the “dot notation” P meansP sum over the subscript corresponding to the
p ni
0, e.g. j. Similarly, Y00 = i=1 j=1 Yij is the sum of all of the Yij . Be able
to find µ̂i from data.
17) The cell means model for the fixed effects one way Anova is Yij =
µi + ij where Yij is the value of the response variable for the jth trial of the
ith factor level for i = 1, ..., p and j = 1, ..., ni. The µi are the unknown means
and E(Yij ) = µi . The ij are iid from the location family with pdf fZ (z), zero
mean and unknown variance σ 2 = V (Yij ) = V (ij ). For the normal cell means
Pn i
model, the ij are iid N (0, σ 2 ). The estimator µ̂i = Y i0 = j=1 Yij /ni = Ŷij .
The ith residual
Pp is eij = Yij −Y i0 , and Y 00 is the samplePp mean Pniof all of the Y2ij
and n = i=1 ni . The total sum of squares SSTO = i=1 j=1 (Yij − Y 00 ) ,
Pp 2
the treatment sum of squares Pp SSTRPni = i=1 ni (Y i0 −Y 00 ) , and the error sum
of squares SSE = RSS = i=1 j=1 (Yij −Y i0 )2 . The MSE is an estimator of
σ 2 . The Anova table is the same as that for multiple linear regression, except
that SSTR replaces the regression sum of squares and that SSTO, SSTR and
SSE have n − 1, p − 1 and n − p degrees of freedom.
Summary Analysis of Variance Table

Source df SS MS F p-value
Treatment p − 1 SSTR MSTR F0 =MSTR/MSE for H0 :
Error n − p SSE MSE µ1 = · · · = µp

18) Shown is a one way ANOVA table given in symbols. Sometimes “Treat-
ment” is replaced by “Between treatments,” “Between Groups,” “Model,”
“Factor” or “Groups.” Sometimes “Error” is replaced by “Residual,” or
“Within Groups.” Sometimes “p-value” is replaced by “P”, “P r(> F )” or
“PR > F.” SSE is often replaced by RSS = residual sum of squares.
19) In matrix form, the cell means model is the linear model without an
intercept (although 1 ∈ C(X)), where µ = β = (µ1 , ..., µp)T , and Y =
Xµ +  =
132 3 Nonfull Rank Linear Models and Cell Means Models
     
Y11 100 ... 0 11
 ..   .. .. .. ..   .. 
 .  . . . .  . 
     
 Y1,n1   1 0 0 ... 0  
       1,n1 
 Y21   0 1 0 ... 0  21 
    µ1  
 ..   .. .. .. ..   µ   .. 
 .  . . . 
.    . 
2 
   .
 Y2,n2  =  0 1 0 ... 0
 . + 
     ..   2,n2 
 .  . . . ..  µ  . 
 ..   .. .. .. . p  .. 
     
 Yp,1   0 0 0 ... 1  p,1 
     
 .  . . . ..   . 
 ..   .. .. .. .   .
. 
Yp,np 0 0 0 ... 1 p,np
20) For the cell means model, X T X = diag(n1 , ..., np), (X T X)−1 =
diag(1/n1 , ..., 1/np), and X T Y = (Y10 , ..., Yp0)T . So β̂ = µ̂ = (X T X)−1 X T Y
= (Y 10 , ..., Y p0 )T . Then Ŷ = X(X T X)−1 X T Y = X µ̂, and Ŷij = Y i0 .
Hence the ijth residual eij = Yij − Ŷij = Yij − Y i0 for i = 1, ..., p and
j = 1, ..., ni.
21) In the response plot, the dot plot for the jth treatment crosses the
identity line at Y j0 .
22) The one way Anova F test has hypotheses H0 : µ1 = · · · = µp and HA :
not H0 (not all of the p population means are equal). The one way Anova
table for this test is given above 18). Let RSS = SSE. The test statistic

M ST R [RSS(H) − RSS]/(p − 1)
F = = ∼ Fp−1,n−p
M SE M SE
2
if the ij are iid N (0, σP ). If P
H0 is true, then Yij = µ+ij and µ̂ = Y 00 . Hence
p ni
RSS(H) = SST O = i=1 j=1 (Yij − Y 00 )2 . Since SST O = SSE + SST R,
the quantity SST R = RSS(H) − RSS, and M ST R = SST R/(p − 1).
23) The one way Anova F test is a large sample test if the ij are iid with
mean 0 and variance σ 2 . Then the Yij come from the same location family
with the same variance σi2 = σ 2 and different mean µi for i = 1, ..., p. Thus
the p treatments (groups, populations) have the same variance σi2 = σ 2 . The
V (ij ) ≡ σ 2 assumption (which implies that σi2 = σ 2 for i = 1, ..., p) is a
much stronger assumption for the one way Anova model than for MLR, but
the test has some resistance to the assumption that σi2 = σ 2 by 15).
24) Other design matrices X can be used for the full model. One design
matrix adds a column of ones to the cell means design matrix. This model is
no longer a full rank model.
3.3 Summary 133
 
1 1 0 ... 0
   . . . ..   
Y11  .. .. .. . 11
 
 ..   1 1 0 ... 0  .. 
 .     . 
 
 1 0 1 ... 0   
 Y1,n1   . . . ..   1,n1 
  . . .     
 Y21   . . . .  21 
    β0  
 ..   1 0 1 . . . 0    .. 
 .  . . . .   β1 
  . 
  .. .. .. ..   .  +  .
 Y2,n2  =    ..   2,n2 
    
 .   1 0 0 ... 1  . 
 ..    βp−1  . 
   .. .. .. ..   . 
 Yp,1   . . . .   
     p,1 
 .   1 0 0 . . . 1   . 
 ..   1 0 0 ... 0
  .. 
 
Yp,np . . . ..  p,np
 .. .. .. .
1 0 0 ... 0
25) A full rank one way Anova model with an intercept adds a constant but
deletes the last column of the X for the cell means model. Then Y = Xβ +
where Y and  are as in the cell means model. Then β = (β0 , β1 , ..., βp−1)T =
(µp , µ1 − µp , µ2 − µp , ..., µp−1 − µp )T . So β0 = µp and βi = µi − µp for
i = 1, ..., p − 1.
It can be shown that the OLS estimators are β̂0 = Y p0 = µ̂p , and β̂i =
Y i0 − Y p0 = µ̂i − µ̂p for i = 1, ..., p− 1. (The cell means model has β̂i = µ̂i =
Y i0 .) In matrix form the model is shown above.
Then X T Y = (Y00 , Y10 , Y20, ..., Yp−1,0)T and
 
n n1 n2 n3 · · · np−2 np−1  
 n1 n1 0 0 · · · 0 0   n  (n1 n2 · · · np−1 )
   n1 
 n2 0 n2 0 · · · 0 0   
T     n2  
X X= . .. .. .. .. ..  =    .
 .. . . . ··· . 
.   .   .  diag(n 1 , ..., n p−1) 
 . 
 np−2 0 0 0 · · · np−2 0 
np−1
np−1 0 0 0 · · · 0 np−1
 
1 −1 −1 −1 · · · −1 −1
 −1 1 + np 1 1 ··· 1 1 
 n1 
 −1 1 1 + np 1 · · · 1 1 
1   n 2 
Hence (X T X)−1 = . . . . . . =

np  . . .
. .
. .
. ··· .
. .
. 

 −1 1 1
np
1 · · · 1 + np−2 1 
 
np
−1 1 1 1 ··· 1 1 + np−1
 
1 1 −1T
T np np .
np −1 11 + diag( n1 , ..., np−1 )
134 3 Nonfull Rank Linear Models and Cell Means Models

This model is interesting since the one way Anova F test of H0 : µ1 =


· · · = µp versus HA : not H0 corresponds to the MLR Anova F test of
H0 : β1 = · · · = βp−1 =P0 versus HA : not
PH 0.
p p
26) A contrast θ = i=1 ci µi where i=1vci = 0. The estimated contrast
u p 2
Pp √ uX c
is θ̂ = i=1 ci Y i0 . Then SE(θ̂) = M SE t i
and a 100(1 − δ)% CI
n
i=1 i
for θ is θ̂ ± tn−1,1−δ/2SE(θ̂). CIs for one way Anova are less robust to the
assumption that σi2 ≡ σ 2 than the one way Anova F test.
27) Two important families of contrasts are the family of all possible con-
trasts and the family of pairwise differences θij = µi − µj where i 6= j. The
Scheffé multiple comparisons procedure has a δF for the family of all possible
contrasts while the Tukey multiple comparisons procedure has a δF for the
family of all p2 pairwise contrasts.

3.4 Complements

Section 3.2 followed Olive (2017a, ch. 5) closely. The one way Anova model
assumption that the groups have the same variance is very strong. Chapter
9 shows how to use large sample theory to create better one way MANOVA
type tests, and better one way Anova tests are a special case. The tests tend
to be better when all of the ni are large enough for the CLT to hold for each
Y io . Also see Rupasinghe Arachchige Don and Olive (2019).

3.5 Problems

3.1. When X is not full rank, the projection matrix P X for C(X) is P X =
X(X 0 X)− X 0 where X 0 = X T . To show that C(P X ) = C(X), you can show
that a) P X w = Xy ∈ C(X) where w is an arbitrary conformable constant
vector, and b) Xy = P X w ∈ C(P X ) where y is an arbitrary conformable
constant vector.
a) Show P X w = Xy and identify y.
b) Show Xy = P X w and identify w. Hint: P X X = X.
3.2. Let P = X(X T X)− X T be the projection matrix onto the column
space of X. Using P X = X, show P is idempotent.
3.3. Suppose that X is an n × p matrix but the rank of X < p < n. Then
the normal equations X 0 Xβ = X 0 Y have infinitely many solutions. Let β̂ be
a solution to the normal equations. So X 0 X β̂ = X 0 Y . Let G = (X 0 X)− be a
generalized inverse of (X 0 X). Assume that E(Y ) = Xβ and Cov(Y ) = σ 2 I.
3.5 Problems 135

It can be shown that all solutions to the normal equations have the form bz
given below.

a) Show that bz = GX 0 Y + (GX 0 X − I)z is a solution to the normal


equations where the p × 1 vector z is arbitrary.

b) Show that E(bz ) 6= β.

(Hence some authors suggest that bz should be called a solution to the


normal equations but not an estimator of β.)

c) Show that Cov(bz ) = σ 2 GX 0 XG0 .

d) Although G is not unique, the projection matrix P = XGX 0 onto


C(X) is unique. Use this fact to show that Ŷ = Xbz does not depend on G
or z.

e) There are two ways to show that a0 β is an estimable function. Either


show that there exists a vector c such that E(c0 Y ) = a0 β, or show that
a ∈ C(X 0 ). Suppose that a = X 0 w for some fixed vector w. Show that
E(a0 bz ) = a0 β.

(Hence a0 β is estimable by a0 bz where bz is any solution of the normal


equations.)

f) Suppose that a = X 0 w for some fixed vector w. Show that V ar(a0 bz ) =


σ w 0 P w.
2

3.4. Let Y = Xβ + e where E(e) = 0, Cov(e) = σ 2 I n , and X has full


rank. Let a be a constant vector. (Hint: full rank model formulas are rather
simple.)
a) Find E(aT β̂).
b) Is aT β estimable? Explain briefly.
 
12
3.5. Let Y = Xβ + e where Y = (Y1 , Y2 , Y3 )0 , X =  1 2 , β = (β1 , β2 )0 ,
24
E(e) = 0, and Cov(e) = σ 2 I.
a) Find [C(X 0 )].
Show whether or not the following functions are estimable.
b) 5β1 + 10β2
c) β1
d) β1 − 2β2
3.6. Let Y = Xβ + e where E(e) = 0, Cov(e) = σ 2 I n , and X has full
rank. Note that Yi = xTi β + ei . Assume X is a constant matrix.
a) Find E(Yi ).
b) Is E(Yi ) estimable? Explain briefly.
136 3 Nonfull Rank Linear Models and Cell Means Models

3.7. An overparameterized two way Anova model is Yijk = µ + αi + βj +


τij + eijk for i = 1, ..., a and j = 1, ..., b and k = 1, ..., m. Suppose a = 2,
b = 2, and m = 2. Then
 
  µ  
Y111  α1  e111
 Y112     
   α2   e112 
 Y121     e121 
   β1   
 Y122     
  = X  β2  +  e122  .
 Y211     e211 
   τ11   
 Y212     
   τ12   e212 
 Y221     e221 
 τ21 
Y222 e222
τ22

a) Give the matrix X.


b) We can write the above model as Y = Xβ + e. This model is not full
rank. What is the projection matrix P (onto the column space of X)? Hint:
X T X is singular, so use the generalized inverse.
3.8. Suppose that Y = (Y1 , Y2 )0 , Var(Y ) = σ 2 I, E(Y1 ) = E(Y2 ) = β1 −
2β2 . Show whether or not the following functions are estimable. Hint E(Y ) =
Xβ, so find X.
a) β1
b) β2
c) −β1 + 2β2
d) 4β1 − 8β2
3.9. The cell means model for the two way Anova model is Yijk = µij +eijk
for i = 1, ..., a and j = 1, ..., b and k = 1, ..., m. Suppose a = 2, b = 2, and
m = 2. Then    
Y111 e111
 Y112   
     e112 
 Y121  µ11  
   e121 
 Y122   µ12   e122 
     
 Y211  = X  µ21  +  e211  .
   
 Y212  µ22  e212 
   
 Y221   e221 
Y222 e222
a) Give the matrix X.
b) Suppose that a full rank cell means two way Anova model is written in
matrix form as Y = Xβ + e. What is the vector of residuals r?
3.10. Note that C(X 0 X) = C(X 0 ) since C(X 0 X) ⊆ C(X 0 ) and rank(X 0 X) =
rank(X 0 ).
Use this result to explain why there is always a solution β̂ to the normal
equations:
X 0 X β̂ = X 0 Y .
3.5 Problems 137

3.11. An alternative parameterization of the one way Anova model is


Yij = µ + αi + eij for i = 1, ..., p and j = 1, ..., ni. Hence µi = µ + αi . Suppose
p = 3 and ni = 2. Then
   
Y11   e11
 Y12  µ  e12 
   
 Y21   α1   e21 
 =X + 
 Y22   α2   e22  .
   
 Y31  α3  e31 
Y32 e32

Give the matrix X.


3.12Q . Consider the linear regression model Yi = β1 +β2 xi2 +· · ·+βp xip +ei
or Y = Xβ+e where Y ∼ Nn (Xβ, σ 2 I). Assume X is n×p with rank(X) =
r ≤ p.
a) Give expressions for SSE and SSR using matrix notation.
b) Find E(SSE) and E(SSR).
c) Find the distribution of i) SSE, ii) SSR, and iii) MSR/MSE under the
assumption β2 = · · · = βp = 0.
3.13Q . Consider the linear regression model Y = Xβ + e where Y ∼
Nn (Xβ, σ 2 I). Assume X is n × p with rank(X) = r ≤ p.
a) i) Define what is meant by an estimable linear function of β.
ii) Write down the least squares estimator of an estimable function of β.
iii) Write down an unbiased estimator of σ 2 .
b) Show the estimators of part a) ii) and iii) are unbiased.
c) State the Gauss Markov Theorem.
d) Give expressions for SSE and SSR using matrix notation.
3.14Q . Let E(Y ) = Xβ where Y is 3 × 1, X is 3 × 2, and β is 2 × 1. Let
   
20 36
i) X =  1 1  and ii) X =  2 4  .
02 12

a) In each of cases i) and ii), state whether β is estimable and explain your
answer.
b) If the answer is “yes,” then determine the matrix B in β̂ = BY .
c) If the answer is “no,” then produce one estimable parametric function
and its unbiased estimator.
3.15Q . Let y ∼ Np (Aβ, σ 2 I p ), where A is a known p × n matrix of
constants and β an n×1 vector of unknown parameters. Let r = rank(A), 0 <
r < p. Define the vector of fitted values y b, and the vector of residuals e, as
b = P A y and e = y − y
y b (P A is the projection matrix on C(A), the column
space of A).
(a) Provide the distribution of y b.
(b) Provide the distribution of e.
138 3 Nonfull Rank Linear Models and Cell Means Models

(c) Are y and e distributed independently? Explain your answer.


(d) Are yb and e distributed independently? Explain your answer.
3.16Q . Let Y ∼ Nn (Xβ, σ 2 I n ), where X is n × k matrix with n > k ≥ 2,
and β ∈ Rk . Suppose a hypothesis H0 states that under H0 the data vector
Y has the mean E[Y ] = Zγ, where Z is a suitable matrix with C(Z) is a
proper subset of C(X).
(a) Show that there is a matrix B so that Z = XB.
(b) Show that P X P Z = P Z .
(c) Show that P X − P Z is an idempotent matrix.
(d) Define SSE = Y > [I −P X ]Y and SSE2 = Y > [I −P Z ]Y . Show that
SSE2 ≥ SSE.
(e) Show that SSE2 − SSE and SSE are independently distributed.
(f) Can you suggest a test of H0 based on SSE2 − SSE and SSE?
3.17Q . Consider a two-way cross-classified data where the factor A has 3
levels and the factor B has 4 levels. The numbers of observations for the 12
cells in the two-way classification are as given in the following table. Thus
we have no observations in a number of cells. If nij denotes the number of
observations in the cell corresponding to the ith level of A and the jth level
of B, we have in our data n11 = 1, n12 = 1, n13 = 1, n21 = 1, n22 = 2, n23 =
1, n34 = 2, and all other nij s are zero. For a non-empty cell (i, j), we use Yijk
to denote the kth observation in the cell. We also assume the additive model
given by (when nij > 0)

E(Yijk ) = µ + αi + βj , k = 1, ..., nij , i = 1, 2, 3, j = 1, 2, 3, 4,

B
12 3 4
1 1110
A2 1210
3 0002
Table 3.1 Frequency Table

We denote the data in vector notation as Y = (Y111 , Y121, Y131, Y211 , Y221, Y222, Y231, Y341 , Y342)> .
Also, we write β = (µ, α1 , α2 , α3, β1 , β2 , β3 , β4 )> .
(a) Find the model matrix (design matrix) X for the model so that
E(Y ) = Xβ.
(b) Find the vector X > Y .
(c) Decide whether Ȳ.1. − Ȳ.3. is the OLS estimator for β1 − β3 . Explain
P Pnij P
your answer. Here Ȳ.j. = 3i=1 k=1 Yijk / 3i=1 nij .
(d) Decide whether Ȳ1.. − Ȳ3.. is the OLS estimator for α1 − α3 . Explain
P4 Pnij P4
your answer. Here Ȳi.. = j=1 k=1 Yijk / j=1 nij .
3.5 Problems 139
 
12
3.18Q . Let Y = Xβ +  where Y = (Y1 , Y2 , Y3 )0 , X =  1 2 , β =
24
(β1 , β2 )0 , E() = 0, and Cov() = σ 2 I.
a) Find C(X 0 ).
Show whether or not the following functions are estimable.
b) 5β1 + 10β2
c) β1
d) β1 − 2β2
Q 0 0
 3.19 . Let Y = Xβ +  where Y = (Y1 , Y2 , Y3 ) = (1, 2, 3) , X =
1 −2
 1 −2 , β = (β1 , β2 )0 , E() = 0, and Cov() = σ 2 I.
1 −2
a) Calculate P , the projection matrix P onto the column space of X.
b) Calculate the error sum of squares SSE.
c) Find C(X 0 ).
Show whether or not the following functions are estimable.
d) 5β1 + 10β2
e) β1
f) β1 − 2β2
3.20Q . Let Y = Xβ + . Suppose that aT1 β, ..., aTk β are estimable func-
Xk
tions. Prove or disprove: ci aTi β is estimable where c1 , ..., ck are known
i=1
constants.
3.21Q . a) Let X be an n×1 random vector with E(X) = µ and Cov(X) =
Σ of rank r. Find E(X T Σ − X).
b) Consider the one way fixed effects ANOVA model with 2 replications
per group so that Y is a 2p × 1 random vector:
 
  1 1 0 ... 0  
Y1,1 1 1 0 ... 0 e1,1
  
 Y1,2   1 0 1 . . . 0  β0  e1,2 
     
 Y2,1   1 0 1 . . . 0   β1   e2,1 
      
      
Y = Xβ + e =  Y2,2  =  ... ... ... ... ...   β2  +  e2,2 
 ..     ..   .. 
 .  1 0 0 ... 1 .   . 
     
 Yp,1   1 0 0 . . . 1  βp−1  ep,1 
 
Yp,2 1 0 0 ... 0 ep,2
1 0 0 ... 0

with E(e) = 0.
i) Simplify E(Y ) = Xβ.
ii) If
140 3 Nonfull Rank Linear Models and Cell Means Models
 
0
E(Y ) = Xβ =  β0  ,
β0
find β1 , ..., βp−1 in terms of β0 .
3.22Q . An experiment was run to compare three different primitive al-
timeters (an altimeter is a device which measures altitude). The response is
the error in reading.
Altimeter 1: 3, 6, 3
Altimeter 2: 4, 5, 4
Altimeter 3: 7, 8, 7
We like to compare the means of these three altimeters.
a) Write the linear model. Describe all terms and assumptions. Use βi instead
of µi .
b) Given that RSSH − RSS = 20.22 and RSS = 7.33, state the hypotheses
that the means are equal, and complete the ANOVA table (omit the p-value).
c) Find the distribution of the test statistic under normality, and show how
to precisely make the decision. (no calculation necessary, only show the steps)
Chapter 4
Prediction and Variable Selection When
n >> p

This chapter considers variable selection when n >> p and prediction in-
tervals that can work if n > p or p > n. Prediction regions and prediction
intervals applied to a bootstrap sample can result in confidence regions and
confidence intervals. The bootstrap confidence regions will be used for infer-
ence after variable selection.

4.1 Variable Selection

Variable selection, also called subset or model selection, is the search for a
subset of predictor variables that can be deleted with little loss of information
if n/p is large. Consider the 1D regression model where Y x|SP where
SP = xT β. See Chapters 1 and 10. A model for variable selection can be
described by
xT β = xTS β S + xTE β E = xTS β S (4.1)
where x = (xTS , xTE )T is a p × 1 vector of predictors, xS is an aS × 1 vector,
and xE is a (p − aS ) × 1 vector. Given that xS is in the model, β E = 0 and
E denotes the subset of terms that can be eliminated given that the subset
S is in the model.
Since S is unknown, candidate subsets will be examined. Let xI be the
vector of a terms from a candidate subset indexed by I, and let xO be the
vector of the remaining predictors (out of the candidate submodel). Then

xT β = xTI βI + xTO β O .

Suppose that S is a subset of I and that model (4.1) holds. Then

xT β = xTS β S = xTS βS + xTI/S β(I/S) + xTO 0 = xTI β I

141
142 4 Prediction and Variable Selection When n >> p

where xI/S denotes the predictors in I that are not in S. Since this is true
regardless of the values of the predictors, β O = 0 and the sample correlation
corr(xTi β, xTI,i β I ) = 1.0 for the population model if S ⊆ I. The estimated
sufficient predictor (ESP) is xT β̂, and a submodel I is worth considering if
the correlation corr(ESP, ESP (I)) ≥ 0.95.

Definition 4.1. The model Y x|xT β that uses all of the predictors is
called the full model. A model Y xI |xTI β I that uses a subset xI of the
predictors is called a submodel. The full model is always a submodel.
The full model has sufficient predictor SP = xT β and the submodel has
SP = xTI βI .

Forward selection or backward elimination with the Akaike (1973) AIC


criterion or Schwarz (1978) BIC criterion are often used for variable selec-
tion. The relaxed lasso or relaxed elastic net estimator fits the regression
method, such as a GLM or Cox (1972) proportional hazards regression, to
the predictors than had nonzero lasso or elastic net coefficients. See Chapters
5 and 10.
To clarify notation, suppose p = 4, a constant x1 = 1 corresponding to β1 is
always in the model, and β = (β1 , β2 , 0, 0)T . Then the J = 2p−1 = 8 possible
subsets of {1, 2, ..., p} that always contain 1 are I1 = {1}, S = I2 = {1, 2},
I3 = {1, 3}, I4 = {1, 4}, I5 = {1, 2, 3}, I6 = {1, 2, 4}, I7 = {1, 3, 4}, and
I8 = {1, 2, 3, 4}. There are 2p−aS = 4 subsets I2 , I5 , I6 , and I8 such that
S ⊆ Ij . Let β̂ I7 = (β̂1 , β̂3 , β̂4 )T and xI7 = (x1 , x3 , x4)T .
Underfitting occurs if submodel I does not contain S. Following, for ex-
ample, Pelawa Watagoda (2019), let X = [X I X O ] and β = (β TI , βTO )T .
Then Xβ = X I β I + X O β O , and β̂I = (X I X I )−1 X TI Y = AY . Assuming
the usual MLR model, Cov(β̂ I ) = Cov(AY ) = Aσ 2 IAT = σ 2 (X TI X I )−1 .
Now E(β̂ I ) = E(AY ) = AXβ = (X I X I )−1 X TI (X I β I + X O βO ) =

β I + (X I X I )−1 X TI X O β O = βI + AX O β O .

If S ⊆ I, then β O = 0, but if underfitting occurs then the bias vector


AX O β O can be large.

4.1.1 OLS Variable Selection

Simpler models are easier to explain and use than more complicated mod-
els, and there are several other important reasons to perform variable se-
lection.
Pn For example, an OLS MLR model with unnecessary predictors has
i=1 V (Ŷi ) that is too large. If (4.1) holds, S ⊆ I, βS is an aS × 1 vector,
and β I is a j × 1 vector with j > aS , then
4.1 Variable Selection 143
n n
1X σ2 j σ 2 aS 1X
V (ŶIi ) = > = V (ŶSi ). (4.2)
n i=1 n n n i=1

In particular, the full model has j = p. Hence having unnecessary predic-


tors decreases the precision for prediction. Fitting unnecessary predictors is
sometimes called fitting noise or overfitting. As an extreme case, suppose
that the full model contains p = n predictors, including a constant, so that
the hat matrix H = I n , the n × n identity matrix. Then Ŷ = Y so that
VAR(Ŷ |x) = VAR(Y ). A model I underfits if it does not include all of the
predictors in S. A model I does not underfit if S ⊆ I.
To see that (4.2) holds, assume that the full model includes all p possible
terms so the full model may overfit but does not underfit. Then Ŷ = HY
and Cov(Ŷ ) = σ 2 HIH T = σ 2 H. Thus
n
1X 1 σ2 σ2 p
V (Ŷi ) = tr(σ 2 H) = tr((X T X)−1 X T X) =
n n n n
i=1

where tr(A) is the trace operation. Replacing p by j and aS and replac-


ing H by H I and H S implies Equation (4.2). Hence if only aS parame-
ters are needed and p >> aS , then serious overfitting occurs and increases
n
1X
V (Ŷi ).
n i=1
Two important summaries for submodel I are R2 (I), the proportion of
the variability of Y explained by the nontrivial predictors in the model,
and M SE(I) = σ̂I2 , the estimated error variance. See Definitions 1.17 and
1.18. Suppose that model I contains k predictors, including a constant. Since
adding predictors does not decrease R2 , the adjusted R2A (I) is often used,
where
n n
R2A(I) = 1 − (1 − R2 (I)) = 1 − M SE(I) .
n−k SST
See Seber and Lee (2003, pp. 400-401). Hence the model with the maximum
R2A(I) is also the model with the minimum M SE(I).

For multiple linear regression, recall that if the candidate model of xI


has k terms (including the constant), then the partial F statistic for testing
whether the p − k predictor variables in xO can be deleted is
 
SSE(I) − SSE SSE n − p SSE(I)
FI = / = −1
(n − k) − (n − p) n − p p−k SSE

where SSE is the error sum of squares from the full model, and SSE(I) is the
error sum of squares from the candidate submodel. An extremely important
criterion for variable selection is the Cp criterion.

Definition 4.2.
144 4 Prediction and Variable Selection When n >> p

SSE(I)
Cp(I) = + 2k − n = (p − k)(FI − 1) + k
M SE
where MSE is the error mean square for the full model.
D
Note that when H0 is true, (p − k)(FI − 1) + k → χ2p−k + 2k − p for a large
class of iid error distributions. Minimizing Cp (I) is equivalent to minimizing
M SE [Cp (I)] = SSE(I) + (2k − n)M SE = rT (I)r(I) + (2k − n)M SE. The
following theorem helps explain why Cp is a useful criterion and suggests that
for subsets I with k terms, submodels with Cp(I) ≤ min(2k, p) are especially
interesting. Olive and Hawkins (2005) show that this interpretation of Cp can
be generalized to 1D regression models with a linear predictor β T x = xT β,
such as generalized linear models. Denote the residuals and fitted values from
the full model by ri = Yi −xTi β̂ = Yi −Ŷi and Ŷi = xTi β̂ respectively. Similarly,
let β̂I be the estimate of β I obtained from the regression of Y on xI and
denote the corresponding residuals and fitted values by rI,i = Yi − xTI,i β̂I
and ŶI,i = xTI,i β̂ I where i = 1, ..., n.

Theorem 4.1. Suppose that a numerical variable selection method sug-


gests several submodels with k predictors, including a constant, where 2 ≤
k ≤ p.
a) The model I that minimizes Cp (I) maximizes corr(r, rI ).
r
p
b) Cp (I) ≤ 2k implies that corr(r, rI ) ≥ 1 − .
n
c) As corr(r, rI ) → 1,

corr(xT β̂, xT
I β̂I ) = corr(ESP, ESP(I)) = corr(Ŷ, ŶI ) → 1.

Proof. These results are a corollary of Theorem 4.2 below. 

Remark 4.1. Consider the model Ii that deletes the predictor xi . Then
the model has k = p − 1 predictors including the constant, and the test
statistic is ti where
t2i = FIi .
Using Definition 4.2 and Cp (Ifull ) = p, it can be shown that

Cp (Ii ) = Cp (Ifull ) + (t2i − 2).

Using the screen Cp (I) ≤ min(2k, p) suggests that the predictor xi should
not be deleted if √
|ti | > 2 ≈ 1.414.

If |ti | < 2 then the predictor can probably be deleted since Cp decreases.
The literature suggests using the Cp (I) ≤ k screen, but this screen eliminates
too many potentially useful submodels.
4.1 Variable Selection 145

More generally, it can be shown that Cp (I) ≤ 2k iff


p
FI ≤ .
p−k
Now k is the number of terms in the model I including a constant while p − k
is the number of terms set to 0. As k → 0, the partial F test will reject Ho:
βO = 0 (i.e. say that the full model should be used instead of the submodel
I) unless FI is not much larger than 1. If p is very large and p − k is very
small, then the partial F test will tend to suggest that there is a model I
that is about as good as the full model even though model I deletes p − k
predictors.

Definition 4.3. The “fit–fit” or FF plot is a plot of ŶI,i versus Ŷi while
a “residual–residual” or RR plot is a plot rI,i versus ri . A response plot is a
plot of ŶI,i versus Yi . An EE plot is a plot of ESP(I) versus ESP. For MLR,
the EE and FF plots are equivalent.

Six graphs will be used to compare the full model and the candidate sub-
model: the FF plot, RR plot, the response plots from the full and submodel,
and the residual plots from the full and submodel. These six plots will con-
tain a great deal of information about the candidate subset provided that
Equation (4.1) holds and that a good estimator (such as OLS) for β̂ and β̂I
is used.

Application 4.1. To visualize whether a candidate submodel using pre-


dictors xI is good, use the fitted values and residuals from the submodel and
full model to make an RR plot of the rI,i versus the ri and an FF plot of ŶI,i
versus Ŷi . Add the OLS line to the RR plot and identity line to both plots as
visual aids. The subset I is good if the plotted points cluster tightly about
the identity line in both plots. In particular, the OLS line and the identity
line should “nearly coincide” so that it is difficult to tell that the two lines
intersect at the origin in the RR plot.

To verify that the six plots are useful for assessing variable selection,
the following notation will be useful. Suppose that all submodels include
a constant and that X is the full rank n × p design matrix for the full
model. Let the corresponding vectors of OLS fitted values and residuals
be Ŷ = X(X T X)−1 X T Y = HY and r = (I − H)Y , respectively.
Suppose that X I is the n × k design matrix for the candidate submodel
and that the corresponding vectors of OLS fitted values and residuals are
Ŷ I = X I (X TI X I )−1 X TI Y = H I Y and r I = (I − H I )Y , respectively.

A plot can be very useful if the OLS line can be compared to a reference
line and if the OLS slope is related to some quantity of interest. Suppose that
a plot of w versus z places w on the horizontal axis and z on the vertical axis.
Then denote the OLS line by ẑ = a + bw. The following theorem shows that
146 4 Prediction and Variable Selection When n >> p

the plotted points in the FF, RR, and response plots will cluster about the
identity line. Notice that the theorem is a property of OLS and holds even if
the data does not follow an MLR model. Let corr(x, y) denote the correlation
between x and y.

Theorem 4.2. Suppose that every submodel contains a constant and that
X is a full rank matrix.
Response Plot: i) If w = ŶI and z = Y then the OLS line is the identity
line.
2 2
ii) If w = Y and z = ŶI then the OLS line has slope
Pnb = [corr(Y, ŶI )]2 = R (I)
2
and intercept a = Y (1 − R (I)) where Y = i=1 Yi /n and R (I) is the
coefficient of multiple determination from the candidate model.
FF or EE Plot: iii) If w = ŶI and z = Ŷ then the OLS line is the identity
line. Note that ESP (I) = ŶI and ESP = Ŷ .
iv) If w = Ŷ and z = ŶI then the OLS line has slope b = [corr(Ŷ , ŶI )]2 =
SSR(I)/SSR and intercept a = Y [1 − (SSR(I)/SSR)] where SSR is the
regression sum of squares.
RR Plot: v) If w = r and z = rI then the OLS line is the identity line.
vi) If w = rI and z = r then a = 0 and the OLS slope b = [corr(r, rI )]2 and
s s r
SSE n−p n−p
corr(r, rI ) = = = .
SSE(I) Cp (I) + n − 2k (p − k)FI + n − p

Proof: Recall that H and H I are symmetric idempotent matrices and


that H H I = H I . The mean of OLS fitted values is equal to Y and the
mean of OLS residuals is equal to 0. If the OLS line from regressing z on w
is ẑ = a + bw, then a = z − bw and
P
(wi − w)(zi − z) SD(z)
b= P 2
= corr(z, w).
(wi − w) SD(w)

Also recall that the OLS line passes through the means of the two variables
(w, z).
(*) Notice that the OLS slope from regressing z on w is equal to one if
and only if the OLS slope from regressing w on z is equal to [corr(z, w)]2 .
P P 2 T
i) The slope b = 1 if ŶI,i Yi = ŶI,i . This equality holds since Ŷ I Y =
T
Y T H I Y = Y T H I H I Y = Ŷ I Ŷ I . Since b = 1, a = Y − Y = 0.

ii) By (*), the slope


P
2 2 (ŶI,i − Y )2
b = [corr(Y, ŶI )] = R (I) = P = SSR(I)/SST O.
(Yi − Y )2

The result follows since a = Y − bY .


4.1 Variable Selection 147
P P 2
iii) The slope b = 1 if ŶI,i Ŷi = ŶI,i . This equality holds since
T T
Ŷ Ŷ I = Y T HH I Y = Y T H I Y = Ŷ I Ŷ I . Since b = 1, a = Y − Y = 0.

iv) From iii),


SD(Ŷ )
1= [corr(Ŷ , ŶI )].
SD(ŶI )
Hence
SD(ŶI )
corr(Ŷ , ŶI ) =
SD(Ŷ )
and the slope
SD(ŶI )
b= corr(Ŷ , ŶI ) = [corr(Ŷ , ŶI )]2 .
SD(Ŷ )
Also the slope
P
(ŶI,i − Y )2
b= P = SSR(I)/SSR.
(Ŷi − Y )2
The result follows since a = Y − bY .

v) The OLS line passes through the origin. Hence a = 0. The slope b =
rT rI /rT r. Since rT rI = Y T (I − H )(I − H I )Y and (I − H )(I − H I ) =
I − H, the numerator rT rI = rT r and b = 1.

vi) Again a = 0 since the OLS line passes through the origin. From v),
r
SSE(I)
1= [corr(r, rI )].
SSE
Hence s
SSE
corr(r, rI ) =
SSE(I)
and the slope
s
SSE
b= [corr(r, rI )] = [corr(r, rI )]2 .
SSE(I)

Algebra shows that


s r
n−p n−p
corr(r, rI ) = = . 
Cp (I) + n − 2k (p − k)FI + n − p

Remark 4.2. Let Imin be the model than minimizes Cp (I) among the
models I generated from the variable selection method such as forward se-
148 4 Prediction and Variable Selection When n >> p

lection. Assuming the the full model Ip is one of the models generated, then
Cp(Imin ) ≤ Cp (Ip ) = p, and corr(r, rImin ) → 1 as n → ∞ by Theorem 4.2
vi). Referring to Equation (4.1), if P (S ⊆ Imin ) does not go to 1 as n → ∞,
then the above correlation would not go to one. Hence P (S ⊆ Imin ) → 1 as
n → ∞.

A standard model selection procedure will often be needed to suggest


models. For example, forward selection or backward elimination could be
used. If p < 30, Furnival and Wilson (1974) provide a technique for selecting
a few candidate subsets after examining all possible subsets.

Remark 4.3. Daniel and Wood (1980, p. 85) suggest using Mallows’
graphical method for screening subsets by plotting k versus Cp (I) for models
close to or under the Cp = k line. Theorem 4.2 vi) implies that if Cp(I) ≤ k
or FI < 1, then corr(r, rI ) and corr(ESP, ESP (I)) both go to 1.0 as n → ∞.
Hence models I that satisfy the Cp(I) ≤ k screen will contain the true model
S with high probability when n is large. This result does not guarantee that
the true model S will satisfy the screen, but overfit is likely. Let d be a lower
bound on corr(r, rI ). Theorem 4.2 vi) implies that if
 
1 p
Cp (I) ≤ 2k + n 2 − 1 − 2 ,
d d

then corr(r, rI ) ≥ d. The simple screen Cp (I) ≤ 2k corresponds to


r
p
d ≡ dn = 1 − .
n
To avoid excluding too many good submodels, consider models I with
Cp(I) ≤ min(2k, p). Models under both the Cp = k line and the Cp = 2k line
are of interest.

Rule of thumb 4.1. a) After using a numerical method such as forward


selection or backward elimination, let Imin correspond to the submodel with
the smallest Cp. Find the submodel II with the fewest number of predictors
such that Cp (II ) ≤ Cp (Imin ) + 1. Then II is the initial submodel that should
be examined. It is possible that II = Imin or that II is the full model. Do
not use more predictors than model II to avoid overfitting.
b) Models I with fewer predictors than II such that Cp (I) ≤ Cp (Imin ) + 4
are interesting and should also be examined.
c) Models I with k predictors, including a constant and with fewer predic-
tors than II such that Cp (Imin ) + 4 < Cp (I) ≤ min(2k, p) should be checked
but often underfit: important predictors are deleted from the model. Underfit
is especially likely to occur if a predictor with one degree of freedom is deleted
(if the c − 1 indicator variables corresponding to a factor are deleted, then
4.1 Variable Selection 149

the factor has c − 1 degrees of freedom) and the jump in Cp is large, greater
than 4, say.
d) If there are no models I with fewer predictors than II such that Cp (I) ≤
min(2k, p), then model II is a good candidate for the best subset found by
the numerical procedure.

Forward selection forms a sequence of submodels I1 , ..., Ip where Ij uses j


predictors including the constant. Let I1 use x∗1 = x1 ≡ 1: the model has a
constant but no nontrivial predictors. To form I2 , consider all models I with
two
Pn predictors including x∗1 . Compute SSE(I) = RSS(I) = r T (I)r(I) =
P
2 n 2
i=1 ri (I) = i=1 (Yi − Ŷi (I)) . Let I2 minimize SSE(I) for the p − 1

models I that contain x1 and one other predictor. Denote the predictors in
I2 by x∗1 , x∗2 . In general, to form Ij consider all models I with j predictors
including variables x∗1 , ..., x∗j−1. Compute SSE(I) and let Ij minimize SSE(I)
for the p − j + 1 models I that contain x∗1 , ..., x∗j−1 and one other predictor
not already selected. Denote the predictors in Ij by x∗1 , ..., x∗j . Continue in
this manner for j = 2, ..., M = p.
Backward elimination also forms a sequence of submodels I1 , ..., Ip where
Ij uses j predictors including the constant. Let Ip be the full model. To
form Ip−1 consider all models I with p − 1 predictors including the constant.
Compute SSE(I), and let Ip−1 minimize Qp−1 (I) for the p − 1 models I
that exclude one of the predictors x2 , ..., xp. Denote the predictors in Ip−1
by x∗1 , x∗2 , ..., x∗p−1. In general, to form Ij consider all models I with j pre-
dictors including variables x∗1 , ..., x∗j+1. Compute SSE(I), and let Ij mini-
mize SSE(I) for the p − j + 1 models I that exclude one of the predictors
x∗2 , ..., x∗j+1. Denote the predictors in Ij by x∗1 , ..., x∗j . Continue in this manner
for j = p = M, p − 1, ..., 2, 1 where I1 uses x∗1 = x1 ≡ 1.
Several criterion produce the same sequence of models if forward selection
or backward elimination are used, including M SE(I), Cp (I), R2A (I), AIC(I),
BIC(I), and EBIC(I). This result holds since if the number of predictors
k in the model I is fixed, the criterion is equivalent to minimizing SSE(I)
plus a constant. The constants differ so the model Imin that minimizes the
criterion often differ. Heuristically, backward elimination tries to delete the
variable that will increase Cp the least while forward selection tries to add
the variable that will decrease Cp the most.
When there is a sequence of M submodels, the final submodel Id needs to
be selected with ad terms, including a constant. Let the candidate model I
contain a terms, including a constant, and let xI and β̂ I be a × 1 vectors.
Then there are many criteria used to select the final submodel Id . For a given
data set, the quantities p, n, and σ̂ 2 act as constants, and a criterion below
may add a constant or be divided by a positive constant without changing
the subset Imin that minimizes the criterion.
Let criteria CS (I) have the form

CS (I) = SSE(I) + aKn σ̂ 2 .


150 4 Prediction and Variable Selection When n >> p

These criteria need a good estimator of σ 2 and n/p large. See Shibata (1984).
The criterion Cp (I) = AICS (I) uses Kn = 2 while the BICS (I) criterion uses
Kn = log(n). See Jones (1946) and Mallows (1973) for Cp . It can be shown
that Cp (I) = AICS (I) is equivalent to the CP (I) criterion of Definition 4.2.
Typically σ̂ 2 is the OLS full model M SE when n/p is large.
The following criteria also need n/p large. AIC is due to Akaike (1973),
AICC is due to Hurvich and Tsai (1989), and BIC to Schwarz (1978) and
Akaike (1977, 1978). Also see Burnham and Anderson (2004).
 
SSE(I)
AIC(I) = n log + 2a,
n
 
SSE(I) 2a(a + 1)
AICC (I) = n log + ,
n n−a−1
 
SSE(I)
and BIC(I) = n log + a log(n).
n
Forward selection with Cp and AIC often gives useful results if n ≥ 5p
and if the final model has n ≥ 10ad . For p < n < 5p, forward selection with
Cp and AIC tends to pick the full model (which overfits since n < 5p) too
often, especially if σ̂ 2 = M SE. The Hurvich and Tsai (1989, 1991) AICC
criterion can be useful if n ≥ max(2p, 10ad).
The EBIC criterion given in Luo and Chen (2013) may be useful when
n/p is not large. Let 0 ≤ γ ≤ 1 and |I| = a ≤ min(n, p) if β̂ I is a × 1. We
may use a ≤ min(n/5, p). Then EBIC(I) =
     
SSE(I) p p
n log + a log(n) + 2γ log = BIC(I) + 2γ log .
n a a

This criterion can give good results if p = pn = O(nk ) and γ > 1 − 1/(2k).
Hence we will use γ = 1. Then minimizing EBIC(I) is equivalent to mini-
mizing BIC(I) − 2 log[(p − a)!] − 2 log(a!) since log(p!) is a constant.
The above criteria can be applied to forward selection and relaxed lasso.
The Cp criterion can also be applied to lasso. See Efron and Hastie (2016,
pp. 221, 231).
Now suppose p = 6 and S in Equation (4.1) corresponds to x1 ≡ 1, x2 ,
and x3 . Suppose the data set is such that underfitting (omitting a predic-
tor in S) does not occur. Then there are eight possible submodels that
contain S: i) x1 , x2 , x3 ; ii) x1 , x2 , x3 , x4 ; iii) x1 , x2 , x3 , x5 ; iv) x1 , x2 , x3 , x6 ;
v) x1 , x2 , x3, x4 , x5 ; vi) x1 , x2 , x3 , x4, x6 ; vii) x1 , x2 , x3, x5 , x6 ; and the full
model viii) x1 , x2 , x3 , x4 , x5, x6 . The possible submodel sizes are k = 3, 4, 5,
or 6. Since the variable selection criteria for forward selection described above
minimize the MSE given that x∗1 , ..., x∗k−1 are in the model, the M SE(Ik ) are
too small and underestimate σ 2 . Also the model Imin fits the data a bit too
well. Suppose Imin = Id . Compared to selecting a model Ik before examining
4.2 Large Sample Theory for Some Variable Selection Estimators 151

the data, the residuals ri (Imin ) are too small in magnitude, the |ŶImin ,i − Yi |
are too small, and M SE(Imin ) is too small. Hence using Imin = Id as the full
model for inference does not work. In particular, the partial F test statistic
FR in Theorem 2.27, using Id as the full model, is too large since the M SE
is too small. Thus the partial F test rejects H0 too often. Similarly, the con-
fidence intervals for βi are too short, and hypothesis tests reject H0 : βi = 0
too often when H0 is true. The fact that the selected model Imin from vari-
able selection cannot be used as the full model for classical inference is known
as selection bias. Also see Hurvich and Tsai (1990).
This chapter offers two remedies: i) use the large sample theory of β̂ Imin ,0
(defined two paragraphs below) and the bootstrap for inference after variable
selection, and ii) use data splitting for inference after variable selection.

4.2 Large Sample Theory for Some Variable Selection


Estimators

Large sample theory is often tractable if the optimization problem is convex.


The optimization problem for variable selection is not convex, so new tools
are needed. Tibshirani et al. (2018) and Leeb and Pötscher (2006, 2008) note

that we can not find the limiting distribution of Z n = nA(β̂Imin − β I )
after variable selection. One reason is that with positive probability, β̂ Imin
does not have the same dimension as β I if AIC or Cp is used. Hence Z n is
not defined with positive probability.
The large sample theory for OLS variable selection estimators such as for-
ward selection and lasso variable selection in this section is due to Pelawa
Watagoda and Olive (2019, 2020). Rathnayake and Olive (2020) extend this
theory to many other variable selection estimators such as generalized lin-
ear models. Charkhi and Claeskens (2018) have a related result for forward
selection with AIC when the iid errors are N (0, σ 2 ). Assume p is fixed, and
n → ∞. Suppose that model (4.1) holds. Assume the maximum leverage

max xTiIj (X TIj X Ij )−1 xiIj → 0


i=1,...,n

in probability as n → ∞ for each Ij with S ⊆ Ij where the dimension of Ij


√ D
is aj . For the OLS model with S ⊆ Ij , n(β̂ Ij − β Ij ) → Naj (0, V j ) where
P
V j = σ 2 W j and (X TIj X Ij )/n → W −1
j by the LS CLT Theorem 2.26. Then
√ D
ujn = n(β̂Ij ,0 − β) → uj ∼ Np (0, V j,0 ) (4.3)

where V j,0 adds columns and rows of zeros corresponding to the xi not in
Ij , and V j,0 is singular unless Ij corresponds to the full model.
152 4 Prediction and Variable Selection When n >> p

For MLR, V j,0 = σ 2 W j,0 . For example, if p = 3 and model Ij uses a


constant x1 ≡ 1 and x3 with
 
  V11 0 V12
V11 V12
Vj = , then V j,0 =  0 0 0  .
V21 V22
V21 0 V22

Let Imin correspond to the set of predictors selected by a variable selection


method such as forward selection or lasso variable selection. Use zero padding
to form the p × 1 variable selection estimator β̂ V S . For example, if p = 4 and
β̂Imin = (β̂1 , β̂3 )T , then β̂ V S = β̂ Imin ,0 = (β̂1 , 0, β̂3 , 0)T . In the following
definition, if each subset contains at least one variable, then there are J =
2p − 1 subsets.

Definition 4.4. The variable selection estimator β̂ V S = β̂ Imin ,0 , and


β̂V S = β̂ Ik ,0 with probabilities πkn = P (Imin = Ik ) for k = 1, ..., J where
there are J subsets.

Definition 4.5. Let β̂ M IX be a random vector with a mixture distribu-


tion of the β̂ Ik ,0 with probabilities equal to πkn . Hence β̂M IX = β̂Ik ,0 with
same probabilities πkn of the variable selection estimator β̂ V S , but the Ik are
randomly selected.

The large sample distribution of β̂ M IX is simpler than that of β̂ V S , and


is useful for explaining the large sample distribution of β̂ V S . For how to
bootstrap β̂ M IX , see Rathnayake and Olive (2020). For mixture distributions,
see Section 1.6.
The first assumption in Theorem 4.3 is P (S ⊆ Imin ) → 1 as n → ∞. Then
the variable selection estimator corresponding to Imin underfits with prob-
ability going to zero, and the assumption holds under regularity conditions
if BIC or AIC is used. See Charkhi and Claeskens (2018) and Claeskens and
Hjort (2008, pp. 70, 101, 102, 114, 232). For multiple linear regression with
Mallows (1973) Cp or AIC, see Li (1987), Nishii (1984), and Shao (1993).
For a shrinkage estimator that does variable selection, let β̂ Imin be the OLS
estimator applied to a constant and the variables with nonzero shrinkage es-
timator coefficients. If the shrinkage estimator is a consistent estimator of β,
then P (S ⊆ Imin ) → 1 as n → ∞. See Zhao and Yu (2006, p. 2554). Hence
Theorem 4.3c) proves that √ the lasso variable selection and elastic net variable
selection estimators are n consistent estimators of β if lasso and elastic net
are consistent. Also see Theorem 4.4 and Remark 4.5. The assumption on
ujn in Theorem 4.3 is reasonable by (4.3) since S ⊆ Ij for each πj , and since
β̂M IX uses random selection.

Theorem 4.3. Assume P (S ⊆ Imin ) → 1 as n → ∞, and let β̂ M IX =


β̂Ik ,0 with probabilities πkn where πkn → πk as n → ∞. Denote the positive
4.2 Large Sample Theory for Some Variable Selection Estimators 153
√ D
πk by πj . Assume ujn = n(β̂ Ij ,0 − β) → uj ∼ Np (0, V j,0 ). a) Then
√ D
un = n(β̂ M IX − β) → u (4.4)
P
where the cdf of u is Fu (t) = j πj Fuj (t). Thus u has a mixture P
distribution
of the uj with probabilities πj , E(u) = 0, and Cov(u) = Σ u = j πj V j,0 .
b) Let A be a g × p full rank matrix with 1 ≤ g ≤ p. Then
√ D
v n = Aun = n(Aβ̂ M IX − Aβ) → Au = v (4.5)

where v has a mixture distribution of the v j = Auj ∼ Ng (0, AV j,0 AT ) with


probabilities πj .

c) The estimator β̂ V S is a n consistent estimator of β. Hence

n(β̂ V S − β) = OP (1).
√ D
d) If πd = 1, then n(β̂ SEL − β) → u ∼ Np (0, V d,0 ) where SEL is V S
or M IX.
Proof. a) Since un has a mixtureP distribution of the ukn with
P probabilities
πkn, the cdf of un is Fun (t) = k πkn Fukn (t) → Fu (t) = j πj Fuj (t) at
continuity points of the Fuj (t) as n → ∞.
D D
b) Since un → u, then Aun → Au. √
c) The result follows since selecting from a finite number J of n consistent √
estimators (even on a set that goes to one in probability) results in a n
consistent estimator by Pratt (1959).
d) If πd = 1, there is no selection bias, asymptotically. The result also follows
by Pötscher (1991, Lemma 1). 

The following subscript notation is useful. Subscripts before the M IX


are used for subsets of β̂ M IX = (β̂1 , ..., β̂p)T . Let β̂ i,M IX = β̂i . Similarly, if
I = {i1 , ..., ia}, then β̂ I,M IX = (β̂i1 , ..., β̂ia )T . Subscripts after M IX denote
the ith vector from a sample β̂ M IX,1 , ..., β̂M IX,B . Similar notation is used for
other estimators such as β̂V S . The subscript 0 is still used for zero padding.
We may use F U LL to denote the full model β̂ = β̂ F U LL .
Typically the mixture distribution is not asymptotically normal unless a
πd = 1 (e.g. if S is the full model), or if for each πj , Auj ∼ Ng (0, AV j,0 AT ) =
√ D
Ng (0, AΣAT ). Then n(Aβ̂ M IX − Aβ) → Au ∼ Ng (0, AΣAT ). This spe-
√ D
cial case occurs for β̂ S,M IX if n(β̂ − β) → Np (0, V ) where the asymptotic
covariance matrix V is diagonal and nonsingular. Then β̂S,M IX and β̂S,F U LL
have the same multivariate normal limiting distribution.√For several criteria,
this result should hold for β̂ V S since asymptotically, n(Aβ̂ V S − Aβ) is
selecting from the Auj which have the same distribution. Then the confi-
∗ ∗
dence regions applied to Aβ̂ SEL = B β̂ S,SEL should have similar volume and
cutoffs where SEL is M IX, V S, or F U LL.
154 4 Prediction and Variable Selection When n >> p

Theorem 4.3 can be used to justify prediction intervals after variable se-
lection. See Pelawa Watagoda and Olive (2020). Theorem 4.3d) is useful for
variable selection consistency and the oracle property where πd = πS = 1 if
P (Imin = S) → 1 as n → ∞. See Claeskens and Hjort (2008, pp. 101-114) and
Fan and Li (2001) for references. A necessary condition for P (Imin = S) → 1
is that S is one of the models considered with probability going to one.
This condition holds under strong regularity conditions for fast methods. See
Wieczorek and Lei (2021) for forward selection and Hastie et al. (2015, pp.
295-302) for lasso, where the predictors need a “near orthogonality” condi-
tion.
Remark 4.4. If A1 , A2 , ..., Ak are pairwise disjoint and if ∪ki=1 Ai = S,
then the collection of sets A1 , A2 , ..., Ak is a partition of S. Then the Law of
Total Probability states that if A1 , A2 , ..., Ak form a partition of S such that
P (Ai ) > 0 for i = 1, ..., k, then
k
X k
X
P (B) = P (B ∩ Aj ) = P (B|Aj )P (Aj ).
j=1 j=1

Let sets Ak+1 , ..., Am satisfy P (Ai ) = 0 for i = k +1, ..., m. Define P (B|Aj ) =
0 if P (Aj = 0. Then a Generalized Law of Total Probability is
m
X m
X
P (B) = P (B ∩ Aj ) = P (B|Aj )P (Aj ),
j=1 j=1

and will be used in the following paragraph.

Pötscher (1991) used the conditional distribution of β̂ V S |(β̂V S = β̂ Ik ,0 )



to find the distribution of w n = n(β̂V S − β). Let W = WV S = k if β̂ V S =
β̂Ik ,0 where P (WV S = k) = πkn for k = 1, ..., J. Then (β̂ V S:n , WV S:n ) =
(β̂V S , WV S ) has a joint distribution where the sample size n is usually sup-
pressed. Note that β̂ V S = β̂ IW ,0 . Define P (B|Ak )P (Ak ) = 0 if P (Ak ) = 0.
C
Let β̂ Ik ,0 be a random vector from the conditional distribution β̂ Ik ,0 |(WV S =
√ √ C
k). Let wkn = n(β̂ Ik ,0 − β)|(WV S = k) ∼ n(β̂ Ik ,0 − β). Denote
Fz (t) = P (z1 ≤ t1 , ..., zp ≤ tp ) by P (z ≤ t). Then

Fw n (t) = P [n1/2(β̂ V S − β) ≤ t] =
J
X
P [n1/2(β̂ V S − β) ≤ t|(β̂V S = β̂ Ik ,0 )]P (β̂V S = β̂Ik ,0 ) =
k=1
J
X
P [n1/2(β̂ Ik ,0 − β) ≤ t|(β̂ V S = β̂ Ik ,0 )]πkn
k=1
4.2 Large Sample Theory for Some Variable Selection Estimators 155

J
X J
X
C
= P [n1/2(β̂ Ik ,0 − β) ≤ t]πkn = Fwkn (t)πkn .
k=1 k=1
C
Hence β̂ V S has a mixture distribution of the β̂ Ik ,0 with probabilities πkn ,
and w n has a mixture distribution of the w kn with probabilities πkn.
√ C D
Charkhi and Claeskens (2018) showed that w jn = n(β̂ Ij ,0 − β) → wj if
S ⊆ Ij for the MLE with AIC. Here w j is a multivariate truncated normal
distribution (where no truncation is possible) that is symmetric about 0.
Hence E(w j ) = 0, and Cov(w j ) = Σ j exits. Referring to Definitions 4.4
√ √
and 4.5, note that both n(β̂ M IX − β) and n(β̂ V S − β) are selecting from

the ukn = n(β̂ Ik ,0 − β) and asymptotically from the uj of Equation (4.3).
The random selection for β̂ M IX does not change the distribution of ujn , but
selection bias does change the distribution of the selected ujn to that of w jn .
Similarly, selection bias does change the distribution of the selected uj to
D
that of w j . The reasonable Theorem 4.4 assumption that wjn → w j may
not be mild.

Theorem 4.4, Variable Selection CLT. Assume P (S ⊆ Imin ) → 1


as n → ∞, and let β̂ V S = β̂ Ik ,0 with probabilities πkn where πkn → πk as
√ C D
n → ∞. Denote the positive πk by πj . Assume w jn = n(β̂ Ij ,0 − β) → wj .
Then √ D
w n = n(β̂ V S − β) → w (4.6)
P
where the cdf of w is Fw (t) = j πj Fwj (t). Thus w is a mixture distribution
of the w j with probabilities πj .

Proof. Since w n has a mixtureP distribution of the w kn with


P probabilities
πkn, the cdf of wn is Fw n (t) = k πkn Fwkn (t) → Fw (t) = j πj Fw j (t) at
continuity points of the Fwj (t) as n → ∞. 

Remark 4.5. If P (S ⊆ Imin ) → 1 as n → ∞, then β̂√ V S is a n consistent
estimator of β since selecting from a finite number J of n consistent
√ estima-
tors (even on a set that goes to one in probability) results in a n consistent
estimator by Pratt (1959). By both this result and Theorems 4.3 and 4.4, √ the
lasso variable selection and elastic net variable selection estimators are n
consistent if lasso and elastic net are consistent.

Mixture distributions are useful for variable selection since β̂ Imin ,0 has a
mixture distribution of the β̂ Ij ,0 . Review mixture distributions from Section
1.6. The following theorem is due toPPelawa Watagoda and Olive (2019a).
Note that the cdf of Tn is FTn (z) = j πjn FTjn (z) where FTjn (z) is the cdf
of Tjn .

Theorem 4.5, Mixture Distribution CLT. Suppose the g × 1 statistic


Tn is equal to the estimator Tjn with probability πjn for j = 1, ..., J where
156 4 Prediction and Variable Selection When n >> p
P √ D
jπjn = 1, πjn → πj as n → ∞, and ujn = n(Tjn − θ) → uj with
E(uj ) = 0 and Cov(uj ) = Σ j . Then
√ D
n(Tn − θ) → u (4.7)
P
where the cdf of u is Fu (z) = j πj Fuj (z) and Fuj (z) is the cdf of uj .
Thus, u is a mixture distribution
P of the uj with probabilities πj , E(u) = 0,
and Cov(u) = Σ u = j πj Σ j .
Proof: Note that √Tn has a mixture distribution of the Tjn with prob-
abilities
√ πjn. Hence n(Tn − √θ) has a mixture
P distribution of Pthe ujn =
n(Tjn − θ), and the cdf of n(Tn − θ) is j πjn Fujn (z) → j πj Fuj (z)
at continuity points z of the Fuj . 

Remark 4.6. Another variable selection model is xT β = xTSi β Si for i =


1, ..., K. Then submodel I underfits if no Si ⊆ I. A necessary condition for
an estimator to be consistent is P(no Si ⊆ Imin ) → 0 as n → ∞. Then in
Theorem 4.4, we can replace P (S ⊆ Imin ) → 1 by P(no Si ⊆ Imin ) → 0 as
n → ∞.

4.3 Prediction Intervals

Prediction intervals for regression and prediction regions for multivariate re-
gression are important topics. Inference after variable selection will consider
bootstrap hypothesis testing. Applying certain prediction intervals or pre-
diction regions to the bootstrap sample will result in confidence intervals or
confidence regions. The prediction intervals and regions are based on samples
of size n, while the bootstrap sample size is B = Bn . Hence this section and
the following section are important.

Definition 4.6. Consider predicting a future test value Yf given a p × 1


vector of predictors xf and training data (Y1 , x1 ), ..., (Yn, xn ). A large sam-
ple 100(1 − δ)% prediction interval (PI) for Yf has the form [L̂n , Ûn ] where
P (L̂n ≤ Yf ≤ Ûn ) is eventually bounded below by 1 − δ as the sample size
n → ∞. A large sample 100(1 − δ)% PI is asymptotically optimal if it has the
shortest asymptotic length: the length of [L̂n , Ûn ] converges to Us − Ls as
n → ∞ where [Ls , Us ] is the population shorth: the shortest interval covering
at least 100(1 − δ)% of the mass.

If Yf |xf has a pdf, we often want P (L̂n ≤ Yf ≤ Ûn ) → 1 − δ as n → ∞.


The interpretation of a 100 (1 − δ)% PI for a random variable Yf is similar
to that of a confidence interval (CI). Collect data, then form the PI, and
repeat for a total of k times where the k trials are independent from the
same population. If Yfi is the ith random variable and P Ii is the ith PI,
4.3 Prediction Intervals 157

then the probability that Yfi ∈ P Ii for j of the PIs approximately follows a
binomial(k, ρ = 1 − δ) distribution. Hence if 100 95% PIs are made, ρ = 0.95
and Yfi ∈ P Ii happens about 95 times.
There are two big differences between CIs and PIs. First, the length of the
CI goes to 0 as the sample size n goes to ∞ while the length of the PI con-
verges to some nonzero number J, say. Secondly, many confidence intervals
work well for large classes of distributions while many prediction intervals
assume that the distribution of the data is known up to some unknown pa-
rameters. Usually the N (µ, σ 2 ) distribution is assumed, and the parametric
PI may not perform well if the normality assumption is violated. This section
will describe three nonparametric PIs for the additive error regression model,
Y = m(x) + e, that work well for a large class of unknown zero mean error
distributions.
First we will consider the location model, Yi = µ + ei , where Y1 , ..., Yn, Yf
are iid and there are no vectors of predictors xi and xf . Let Z(1) ≤ Z(2) ≤
· · · ≤ Z(n) be the order statistics of n iid random variables Z1 , ..., Zn. Let a
future random variable Zf be such that Z1 , ..., Zn, Zf are iid. Let k1 = dnδ/2e
and k2 = dn(1 − δ/2)e where dxe is the smallest integer ≥ x. For example,
d7.7e = 8. Then a common nonparametric large sample 100(1−δ)% prediction
interval for Zf is
[Z(k1) , Z(k2 ) ] (4.8)
where 0 < δ < 1. See Frey (2013) for references.
The shorth(c) estimator of the population shorth is useful for making
asymptotically optimal prediction intervals. With the Zi and Z(i) as in the
above paragraph, let the shortest closed interval containing at least c of the
Zi be
shorth(c) = [Z(s) , Z(s+c−1) ]. (4.9)
Let
kn = dn(1 − δ)e. (4.10)
Frey (2013) showed that for large nδ and iid data,
p the shorth(kn ) prediction
interval has maximum undercoverage ≈ 1.12 δ/n, and used the shorth(c)
estimator as the large sample 100(1 − δ)% PI where
p
c = min(n, dn[1 − δ + 1.12 δ/n ] e). (4.11)

An interesting fact is that the maximum undercoverage occurs for the family
of uniform U (θ1 , θ2 ) distributions where such a distribution has pdf f(y) =
1/(θ2 − θ1 ) for θ1 ≤ y ≤ θ2 where f(y) = 0, otherwise, and θ1 < θ2 .
A problem with the prediction intervals that cover ≈ 100(1 − δ)% of the
training data cases Yi (such as (4.8) using c = kn given by (4.9)), is that they
have coverage lower than the nominal coverage of 1 − δ for moderate n. This
result is not surprising since empirically statistical methods perform worse on
test data. For iid data, Frey (2013) used (4.10) to correct for undercoverage.
158 4 Prediction and Variable Selection When n >> p

Example 4.1. Given below were votes for preseason 1A basketball poll
from Nov. 22, 2011 WSIL News where the 778 was a typo: the actual value
was 78. As shown below, finding shorth(3) from the ordered data is simple.
If the outlier was corrected, shorth(3) = [76,78].
111 89 778 78 76

order data: 76 78 89 111 778

13 = 89 - 76

33 = 111 - 78

689 = 778 - 89
shorth(3) = [76,89]
1.0
0.8
0.6
pdf

0.4
0.2
0.0

0 1 2 3 4 5

Fig. 4.1 The 36.8% Highest Density Region is [0,1]

Remark. 4.7. The large sample 100(1 − δ)% shorth PI (4.10) may or
may not be asymptotically optimal if the 100(1 − δ)% population shorth is
[Ls , Us ] and F (x) is not strictly increasing in intervals (Ls − , Ls + ) and
(Us − , Us + ) for some  > 0. To see the issue, suppose Y has probability
mass function (pmf) p(0) = 0.4, p(1) = 0.3, p(2) = 0.2, p(3) = 0.06, and
p(4) = 0.04. Then the 90% population shorth is [0,2] and the 100(1 − δ)%
4.3 Prediction Intervals 159

population shorth is [0,3] for (1 − δ) ∈ (0.9, 0.96]. Let Wi = I(Yi ≤ x) = 1 if


Yi ≤ x and 0, otherwise. The empirical cdf
n n
1X 1X
F̂n (x) = I(Yi ≤ x) = I(Y(i) ≤ x)
n i=1 n i=1

is the sample proportion of Yi ≤ x. If Y1 , ..., Yn are iid, then for fixed x,


nF̂n (x) ∼ binomial(n, F (x)). Thus F̂n (x) ∼ AN (F (x), F (x)(1 − F (x))/n).
P
For the Y with the above pmf, F̂n (2) → 0.9 as n → ∞ with P (F̂n (2) < 0.9) →
0.5 and P (F̂n (2) ≥ 0.9) → 0.5 as n → ∞. Hence the large sample 90% PI
(4.10) will be [0,2] or [0,3] with probabilities → 0.5 as n → ∞ with expected
asymptotic length of 2.5 and expected asymptotic coverage converging to
0.93. However, the large sample 100(1−δ)% PI (4.10) converges to [0,3] and is
asymptotically optimal with asymptotic coverage 0.96 for (1−δ) ∈ (0.9, 0.96).

For a random variable Y , the 100(1 −δ)% highest density region is a union
of k ≥ 1 disjoint intervals such that the mass within the intervals ≥ 1 − δ
and the sum of the k interval lengths is as small as possible. Suppose that
f(z) is a unimodal pdf that has interval support, and that the pdf f(z) of Y
decreases rapidly as z moves away from the mode. Let [a, b] be the shortest
interval such that FY (b) − FY (a) = 1 − δ where the cdf FY (z) = P (Y ≤ z).
Then the interval [a, b] is the 100(1 − δ) highest density region. To find the
100(1 − δ)% highest density region of a pdf, move a horizontal line down
from the top of the pdf. The line will intersect the pdf or the boundaries of
the support of the pdf at [a1 , b1 ], ..., [ak, bk ] for some k ≥ 1. Stop moving the
line when the areas under the pdf corresponding to the intervals is equal to
1 − δ. As an example, let f(z) = e−z for z > 0. See Figure 4.1 where the
area under the pdf from 0 to 1 is 0.368. Hence [0,1] is the 36.8% highest
density region. The shorth PI estimates the highest density interval which is
the highest density region for a distribution with a unimodal pdf. Often the
highest density region is an interval [a, b] where f(a) = f(b), especially if the
support where f(z) > 0 is (−∞, ∞).

The additive error regression model is Y = m(x) + e where m(x) is a real


valued function and the ei are iid, often with zero mean and constant variance
V (e) = σ 2 . The large sample theory for prediction intervals is simple for this
model, and variable selection models for the multiple linear regression model
have this form with m(x) = xT β = xTI βI if S ⊆ I. Let the residuals ri = Yi −
m̂(xi ) = Yi −Ŷi for i = 1, ..., n. Assume m̂(x) is a consistent estimator of m(x)
such that the sample percentiles [L̂n (r), Ûn (r)] of the residuals are consistent
estimators of the population percentiles [L, U ] of the error distribution where
P (e ∈ [L, U ]) = 1 − δ. Let Ŷf = m̂(xf ). Then P (Yf ∈ [Ŷf + L̂n (r), Ŷf +
Ûn (r)] → P (Yf ∈ [m(xf )+L, m(xf )+U ]) = P (e ∈ [L, U ]) = 1−δ as n → ∞.
Three common choices are a) P (e ≤ U ) = 1 − δ/2 and P (e ≤ L) = δ/2, b)
160 4 Prediction and Variable Selection When n >> p

P (e2 ≤ U 2 ) = P (|e| ≤ U ) = P (−U ≤ e ≤ U ) = 1 − δ with L = −U , and c)


the population shorth is the shortest interval (with length U − L) such that
P [e ∈ [L, U ]) = 1 − δ. The PI c) is asymptotically optimal while a) and b)
are asymptotically optimal on the class of symmetric zero mean unimodal
error distributions. The split conformal PI (4.16), described below, estimates
[−U, U ] in b).
Prediction intervals based on the shorth of the residuals need a correction
factor for good coverage since the residuals tend to underestimate the errors
in magnitude. With the exception of ridge regression, let d be the number
of “variables” used by the method. For MLR, forward selection, lasso, and
relaxed lasso use variables x∗1 , ..., x∗d while PCR and PLS use variables that
are linear combinations of the predictors Vj = γ Tj x for j = 1, ..., d. (We could
let d = j if j is the degrees of freedom of the selected model if that model
was chosen in advance without model or variable selection. Hence d = j is
not the model degrees of freedom if model selection was used.) See Chapter
5 for more about these estimators. See Hong et al. (2018) for why classical
prediction intervals after variable selection fail to work.
For n/p large and d = p, Olive (2013a) developed prediction intervals for
models of the form Yi = m(xi ) + ei , and variable selection models for MLR
have this form, as noted by Olive (2018). Pelawa Watagoda and Olive (2019b)
gave two prediction intervals that can be useful even if n/p is not large. These
PIs will be defined below. The first PI modifies the Olive (2013a) PI that can
only be computed if n > p. Olive (2007, 2017a, 2017b, 2018) used similar
correction factors for several prediction intervals and prediction regions with
d = p. We want n ≥ 10d so that the model does not overfit.
If the OLS model I has d predictors, and S ⊆ I, then
n
! n
!
X ri2 X e2i
2
E(M SE(I)) = E =σ =E
n−d n
i=1 i=1

and M SE(I) is a n consistent estimator of σ 2 for many error distributions
by Su and Cook (2012). Also see Freedman (1981). For a wide range of regres-
sion models, extrapolation occurs if the leverage hf = xTI,f (X TI X I )−1 xI,f >
2d/n: if xI,f is too far from the data xI,1 , ..., xI,n , then the model may not
hold and prediction can be arbitrarily bad. These results suggests that
r q r
n n + 2d
(1 + hf ) ri ≈ ri ≈ ei .
n−d n−d

In simulations for prediction intervals and prediction regions with n = 20d,


the maximum simulated undercoverage was near 5% if qn in (4.11) is changed
to qn = 1 − δ.
Next we give the correction factor and the first prediction interval. Let
qn = min(1 − δ + 0.05, 1 − δ + d/n) for δ > 0.1 and
4.3 Prediction Intervals 161

qn = min(1 − δ/2, 1 − δ + 10δd/n), otherwise. (4.12)

If 1 − δ < 0.999 and qn < 1 − δ + 0.001, set qn = 1 − δ. Let

c = dnqn e, (4.13)

and let  r
15 n + 2d
bn = 1+ (4.14)
n n−d
if d ≤ 8n/9, and  
15
bn = 5 1 + ,
n
otherwise. As d gets close to n, the model overfits and the coverage will
be less than the nominal. The piecewise formula for bn allows the prediction
interval to be computed even if d ≥ n. Compute the shorth(c) of the residuals
= [r(s), r(s+c−1)] = [ξ̃δ1 , ξ̃1−δ2 ]. Then the first 100 (1 − δ)% large sample PI
for Yf is
[m̂(xf ) + bn ξ̃δ1 , m̂(xf ) + bn ξ˜1−δ2 ]. (4.15)
The second PI randomly divides the data into two half sets H and V
where H has nH = dn/2e of the cases and V has the remaining nV = n − nH
cases i1 , ..., inV . The estimator m̂H (x) is computed using the training data
set H. Then the validation residuals vj = Yij − m̂H (xij ) are computed for the
j = 1, ..., nV cases in the validation set V . Find the Frey PI [v(s) , v(s+c−1)]
of the validation residuals (replacing n in (4.10) by nV = n − nH ). Then the
second new 100(1 − δ)% large sample PI for Yf is

[m̂H (xf ) + v(s) , m̂H (xf ) + v(s+c−1) ]. (4.16)

Remark 4.8. Note that correction factors bn → 1 are used in large sample
confidence intervals and tests if the limiting distribution is N(0,1) or χ2p , but
a tdn or pFp,dn cutoff is used: tdn ,1−δ /z1−δ → 1 and pFp,dn ,1−δ /χ2p,1−δ → 1 if
dn → ∞ as n → ∞. Using correction factors for large sample confidence inter-
vals, tests, prediction intervals, prediction regions, and bootstrap confidence
regions improves the performance for moderate sample size n.

Remark 4.9. For a good fitting model, residuals ri tend to be smaller in


magnitude than the errors ei , while validation residuals vi tend to be larger
in magnitude than the ei . Thus the Frey correction factor can be used for PI
(4.15) while PI (4.14) needs a stronger correction factor.

We can also motivate PI (4.15) by modifying the justification for the Lei
et al. (2018) split conformal prediction interval

[m̂H (xf ) − aq , m̂H (xf ) + aq ] (4.17)


162 4 Prediction and Variable Selection When n >> p

where aq is the 100(1 − α)th quantile of the absolute validation residuals.


PI (4.15) is a modification of the split conformal PI that is asymptotically
optimal. Suppose (Yi , xi ) are iid for i = 1, ..., n, n + 1 where (Yf , xf ) =
(Yn+1 , xn+1 ). Compute m̂H (x) from the cases in H. For example, get β̂ H
from the cases in H. Consider the validation residuals vi for i = 1, ..., nV and
the validation residual vnV +1 for case (Yf , xf ). Since these nV + 1 cases are
iid, the probability that vt has rank j for j = 1, ..., nV + 1 is 1/(nV + 1) for
each t, i.e., the ranks follow the discrete uniform distribution. Let t = nV + 1
and let the v(j) be the ordered residuals using j = 1, ..., nV . That is, get the
order statistics without using the unknown validation residual vnV +1 . Then
v(i) has rank i if v(i) < vnV +1 but rank i + 1 if v(i) > vnV +1 . Thus

P (Yf ∈ [m̂H (xf )+v(k) , m̂H (xf )+v(k+b−1) ]) = P (v(k) ≤ vnV +1 ≤ v(k+b−1)) ≥

P (vnV +1 has rank between k + 1 and k + b − 1 and there are no tied ranks)
≥ (b − 1)/(nV + 1) ≈ 1 − δ if b = d(nV + 1)(1 − δ)e + 1 and k + b − 1 ≤ nV .
This probability statement holds for a fixed k such as k = dnV δ/2e. The
statement is not true when the shorth(b) estimator is used since the shortest
interval using k = s can have s change with the data set. That is, s is not
fixed. Hence if PI’s were made from J independent data sets, the PI’s with
fixed k would contain Yf about J(1−δ) times, but this value would be smaller
for the shorth(b) prediction intervals where s can change with the data set.
The above argument works if the estimator m̂(x) is “symmetric in the data,”
which is satisfied for multiple linear regression estimators.
The PIs (4.14) to (4.16) can be used with m̂(x) = Ŷf = xTId β̂Id where Id
denotes the index of predictors selected from the model or variable selection
method. If β̂ is a consistent estimator of β, the PIs (4.14) and (4.15) are
asymptotically optimal for a large class of error distributions while the split
conformal PI (4.16) needs the error distribution to be unimodal and symmet-
ric for asymptotic optimality. Since m̂H uses n/2 cases, m̂H has about half
the efficiency of m̂. When p ≥ n, the regularity conditions for consistent esti-
mators are strong. For example, EBIC and lasso can have P (S ⊆ Imin ) → 1
as n → ∞. Then forward selection √ with EBIC and relaxed lasso can produce
consistent estimators. PLS can be n consistent. See Chapter 5 for the large
sample for many MLR estimators.
None of the three prediction intervals (4.14), (4.15), and (4.16) dominates
the other two. Recall that βS is an aS × 1 vector in (4.1). If a good fit-
ting method, such as lasso or forward selection with EBIC, is used, and
1.5aS ≤ n ≤ 5aS , then PI (4.14) can be much shorter than PIs (4.15) and
(4.16). For n/d large, PIs (4.14) and (4.15) can be shorter than PI (4.16) if
the error distribution is not unimodal and symmetric; however, PI (4.16) is
often shorter if n/d is not large since the sample shorth converges to the pop-
ulation shorth rather slowly. Grübel (1982)
√ shows that for iid data, the length
and center the shorth(kn ) interval are n consistent and n1/3 consistent es-
timators of the length and center of the population shorth interval. For a
4.4 Prediction Regions 163

unimodal and symmetric error distribution, the three PIs are asymptotically
equivalent, but PI (4.16) can be the shortest PI due to different correction
factors.
If the estimator is poor, the split conformal PI (4.16) and PI (4.15) can
have coverage closer to the nominal coverage than PI (4.14). For example, if
m̂ interpolates the data and m̂H interpolates the training data from H, then
the validation residuals will be huge. Hence PI (4.15) will be long compared
to PI (4.16).
Asymptotically optimal PIs estimate the population shorth of the zero
mean error distribution. Hence PIs that use the shorth of the residuals, such
as PIs (4.14) and (4.15), are the only easily computed asymptotically optimal
PIs for a wide range of consistent estimators β̂ of β for the multiple linear
regression model. If the error distribution is e ∼ EXP (1) − 1, then the
asymptotic length of the 95% PI (4.14) or (4.15) is 2.966 while that of the
split conformal PI is 2(1.966) = 3.992. For more about these PIs applied to
MLR models, see Section 5.10 and Pelawa Watagoda and Olive (2019b).

4.4 Prediction Regions

Consider predicting a p × 1 future test value xf , given past training data


x1 , ..., xn where x1 , ..., xn , xf are iid. Much as confidence regions and inter-
vals give a measure of precision for the point estimator θ̂ of the parameter
θ, prediction regions and intervals give a measure of precision of the point
estimator T = x̂f of the future random vector xf .

Definition 4.7. A large sample 100(1 − δ)% prediction region is a set


An such that P (xf ∈ An ) is eventually bounded below by 1 − δ as n →
∞. A prediction region is asymptotically optimal if its volume converges in
probability to the volume of the minimum volume covering region or the
highest density region of the distribution of xf .

If xf has a pdf, we often want P (xf ∈ An ) → 1 − δ as n → ∞. A PI


is a prediction region where p = 1. Highest density regions are usually hard
to estimate for p not much larger than four, but many elliptically contoured
distributions with a nonsingular population covariance matrix, including the
multivariate normal distribution, have highest density regions that can be
estimated by the nonparametric prediction region (4.24). For more about
highest density regions, see Olive (2017b, pp. 148-155) and Hyndman (1996).
For multivariate data, sample Mahalanobis distances play a role similar to
that of residuals in multiple linear regression. Let the observed training data
be collected in an n × p matrix W . Let the p × 1 column vector T = T (W )
be a multivariate location estimator, and let the p × p symmetric positive
definite matrix C = C(W ) be a dispersion estimator.
164 4 Prediction and Variable Selection When n >> p

Definition 4.8. Let x1j , ..., xnj be measurements on the jth random
variable Xj corresponding to the jth column of the data matrix W . The
n
1X
jth sample mean is xj = xkj . The sample covariance Sij estimates
n
k=1
Cov(Xi , Xj ) = σij = E[(Xi − E(Xi ))(Xj − E(Xj ))], and
n
1 X
Sij = (xki − xi )(xkj − xj ).
n−1
k=1

Sii = Si2 is the sample variance that estimates the population variance
σii = σi2 . The sample correlation rij estimates the population correlation
σij
Cor(Xi , Xj ) = ρij = , and
σi σj
Pn
Sij Sij (xki − xi )(xkj − xj )
rij = = p = Pn k=1
p pPn .
Si Sj Sii Sjj k=1 (xki − xi )
2
k=1 (xkj − xj )
2

Definition 4.9. Let x1 , ..., xn be the data where xi is a p × 1 vector. The


sample mean or sample mean vector
n
1X 1
x= xi = (x1 , ..., xp )T = W T 1
n n
i=1

where 1 is the n × 1 vector of ones. The sample covariance matrix


n
1 X
S= (xi − x)(xi − x)T = (Sij ).
n−1
i=1

That is, the ij entry of S is the sample covariance Sij . The classical estima-
tor of multivariate location and dispersion is (T, C) = (x, S). The sample
correlation matrix
R = (rij ).
That is, the ij entry of R is the sample correlation rij .
Pn
It can be shown that (n − 1)S = i=1 xi xTi − x xT =

1 T T
WTW − W 11 W .
n
1 T
Hence if the centering matrix G = I − 11 , then (n − 1)S = W T GW .
n
See Definition 1.24 for the population mean and population covariance
matrix. Definition 2.18 also defined a sample covariance matrix. The Ma-
4.4 Prediction Regions 165

halanobis distance in Definition 4.9 is a random variable that estimates the


population Mahalanobis distance of Definition 1.38.
p
Definition 4.9. The ith Mahalanobis distance Di = Di2 where the ith
squared Mahalanobis distance is

Di2 = Di2 (T (W ), C(W )) = (xi − T (W ))T C −1 (W )(xi − T (W )) (4.18)

for each point xi . Notice that Di2 is a random variable (scalar valued). Let
(T, C) = (T (W ), C(W )). Then
2
Dx (T, C) = (x − T )T C −1 (x − T ).

Hence Di2 uses x = xi .

Let the p × 1 location vector be µ, often the population mean, and let
the p × p dispersion matrix be Σ, often the population covariance matrix.
Notice that if x is a random vector, then the population squared Mahalanobis
distance from Definition 1.38 is
2
Dx (µ, Σ) = (x − µ)T Σ −1 (x − µ) (4.19)

and that the term Σ −1/2 (x − µ) is the p−dimensional analog to the z-score
used to transform a univariate N (µ, σ 2 ) random variable intopa N (0, 1) ran-
dom variable. Hence the sample Mahalanobis distance Di = Di2 is an ana-
log of the absolute value |Zi | of the sample Z-score Zi = (Xi − X)/σ̂. Also
notice that the Euclidean distance of xi from the estimate of center T (W )
is Di (T (W ), I p ) where I p is the p × p identity matrix.

Consider the hyperellipsoid


2 2
An = {x : Dx (x, S) ≤ D(c) } = {x : Dx (x, S) ≤ D(c) }. (4.20)

If n is large, we can use c = kn = dn(1 − δ)e. If n is not large, using c = Un


where Un decreases to kn , can improve small sample performance. Un will be
defined in the paragraph below Equation (4.23). Olive (2013a) showed that
(4.19) is a large sample 100(1 − δ)% prediction region under mild conditions,
although regions with smaller volumes may exist. Note that the result follows
since if Σ x and S are nonsingular, then the Mahalanobis distance is a con-
D
tinuous function of (x, S). Let µ = E(x) and D = D(µ, Σ x ). Then Di → D
D
and Di2 → D2 . Hence the sample percentiles of the Di are consistent estima-
tors of the population percentiles of D at continuity points of the cumulative
distribution function of D.
A problem with the prediction regions that cover ≈ 100(1 − δ)% of the
training data cases xi (such as (4.19) for c = kn ), is that they have coverage
lower than the nominal coverage of 1 − δ for moderate n. This result is not
surprising since empirically statistical methods perform worse on test data.
166 4 Prediction and Variable Selection When n >> p

Increasing c will improve the coverage for moderate samples. Also see Remark
4.8. Empirically for many distributions, for n ≈ 20p, the prediction region
(4.19) applied to iid data using c = kn = dn(1 − δ)e tended to have under-
coverage as high as 5%. The undercoverage decreases rapidly as n increases.
Let qn = min(1 − δ + 0.05, 1 − δ + p/n) for δ > 0.1 and

qn = min(1 − δ/2, 1 − δ + 10δp/n), otherwise. (4.21)

If 1 − δ < 0.999 and qn < 1 − δ + 0.001, set qn = 1 − δ. Using

c = dnqn e (4.22)

in (4.19) decreased the undercoverage. Note that Equations (4.11) and (4.12)
are similar to Equations
√ (4.20) and (4.21), but replace p by d.
If (T, C) is a n consistent estimator of (µ, d Σ) for some constant d > 0
where Σ is nonsingular, then D2 (T, C) = (x − T )T C −1 (x − T ) =

(x − µ + µ − T )T [C −1 − d−1 Σ −1 + d−1 Σ −1 ](x − µ + µ − T )

= d−1 D2 (µ, Σ) + op (1).


Thus the sample percentiles of Di2 (T, C) are consistent estimators of the per-
centiles of d−1 D2 (µ, Σ) (at continuity points D1−δ of the cdf of D2 (µ, Σ)).
2
If x ∼ Nm (µ, Σ), then Dx (µ, Σ) = D2 (µ, Σ) ∼ χ2m .
Suppose (T, C) = (xM , b S M ) is the sample mean and scaled sample
covariance matrix applied to some subset of the data. The classical estimator
and RMVN estimator from Section 7.1 satisfy this assumption. For h > 0,
the hyperellipsoid

{z : (z − T )T C −1 (z − T ) ≤ h2 } = {z : Dz
2
≤ h2 } = {z : Dz ≤ h} (4.23)

has volume equal to

2π p/2 p p 2π p/2 p p/2 p


h det(C) = h b det(S M ). (4.24)
pΓ (p/2) pΓ (p/2)

A future observation (random vector) xf is in the region (4.22) if Dxf ≤ h.


If (T, C) is a consistent estimator of (µ, dΣ) for some constant d > 0 where
Σ is nonsingular, then (4.22) is a large sample 100(1 − δ)% prediction region
if h = D(Un ) where D(Un ) is the 100qnth sample quantile of the Di where qn
is defined above (4.21). If x1 , ..., xn and xf are iid, then prediction region
(4.24) is asymptotically optimal for a large class of elliptically contoured
distributions since the volume of (4.24) converges in probability to the volume
of the highest density region. (These distributions have a highest density
region which is a hyperellipsoid determined by a population Mahalanobis
distance. See Section 1.7.)
4.4 Prediction Regions 167

The Olive (2013a) nonparametric prediction region uses (T, C) = (x, S).
For the classical prediction region, see Chew (1966) and Johnson and Wichern
(1988, pp. 134, 151). Refer to the above paragraph for D(Un ) .

Definition 4.10. The large sample 100(1 − δ)% nonparametric prediction


region for a future value xf given iid data x1 , ..., xn is
2 2
{z : Dz (x, S) ≤ D(U n)
}, (4.25)

while the large sample 100(1 − δ)% classical prediction region is


2
{z : Dz (x, S) ≤ χ2p,1−δ }. (4.26)

If p is small, Mahalanobis distances tend to be right skewed with a pop-


ulation shorth that discards the right tail. For p = 1 and n ≥ 20, the finite
sample correction factors c/n for c given by (4.10) and (4.21) do not differ
by much more than 3% for 0.01 ≤ δ ≤ 0.5. See Figure 4.2 where ol = (Eq.
4.21)/n is plotted versus fr = (Eq. 4.10)/n for n = 20, 21, ..., 500. The top
plot is for δ = 0.01, while the bottom plot is for δ = 0.3. The identity line is
added to each plot as a visual aid. The value of n increases from 20 to 500
from the right of the plot to the left of the plot. Examining the axes of each
plot shows that the correction factors do not differ greatly. R code to create
Figure 4.2 is shown below.
cmar <- par("mar"); par(mfrow = c(2, 1))
par(mar=c(4.0,4.0,2.0,0.5))
frey(0.01); frey(0.3)
par(mfrow = c(1, 1)); par(mar=cmar)
Remark 4.10. The nonparametric prediction region (4.24) is useful if
x1 , ..., xn , xf are iid from a distribution with a nonsingular covariance matrix,
and the sample size n is large enough. The distribution could be continuous,
discrete, or a mixture. The asymptotic coverage is 1 − δ if D has a pdf,
although prediction regions with smaller volume may exist. If the 100(1−δ)th
percentile D1−δ of D is not a continuity point of the distribution of D, then
the asymptotic coverage tends to be ≥ 1 − δ since a sample percentile with
cutoff qn that decreases to 1−δ is used and a closed region is used. Often D has
a continuous distribution and hence has no discontinuity points for 0 < δ < 1.
(If there is a jump in the distribution from 0.9 to 0.96 at discontinuity point a,
and the nominal coverage is 0.95, we want 0.96 coverage instead of 0.9. So we
want the sample percentile to decrease to a.) The nonparametric prediction
region (4.24) contains Un of the training data cases xi provided that S is
nonsingular, even if the model is wrong. For many distributions, the coverage
started to be close to 1 − δ for n ≥ 10p where the coverage is the simulated
percentage of times that the prediction region contained xf .
168 4 Prediction and Variable Selection When n >> p

0.999
fr

0.997
0.995

0.991 0.992 0.993 0.994 0.995

ol
0.82
0.78
fr

0.74

0.72 0.74 0.76 0.78 0.80 0.82 0.84

ol

Fig. 4.2 Correction Factor Comparison when δ = 0.01 (Top Plot) and δ = 0.3
(Bottom Plot)

Remark 4.11. The most used prediction regions assume that the error
vectors are iid from a multivariate normal distribution. Using (4.23), the ratio
of the volumes of regions (4.25) and (4.24) is
!p/2
χ2p,1−δ
2 ,
D(U n)

which can become close to zero rapidly as p gets large if the xi are not
from the light tailed multivariate normal distribution. For example, suppose
χ24,0.5 ≈ 3.33 and D(U2
n)
≈ Dx 2 2
,0.5 = 6. Then the ratio is (3.33/6) ≈ 0.308.
Hence if the data is not multivariate normal, severe undercoverage can occur
if the classical prediction region is used, and the undercoverage tends to get
worse as the dimension p increases. The coverage need not to go to 0, since by
2
the multivariate Chebyshev’s inequality, P (Dx (µ, Σ x ) ≤ γ) ≥ 1 − p/γ > 0
4.5 Bootstrapping Hypothesis Tests and Confidence Regions 169

for γ > p where the population covariance matrix Σ x = Cov(x). See Budny
(2014), Chen (2011), and Navarro (2014, 2016). Using γ = h2 = p/δ in (4.22)
usually results in prediction regions with volume and coverage that is too
large.

Remark 4.12. The nonparametric prediction region (4.24) starts to have


good coverage for n ≥ 10p for a large class of distributions. Olive (2013a)
suggests n ≥ 50p may be needed for the prediction region to have a good
volume. Of course for any n there are error distributions that will have severe
undercoverage. Statisticians often say that correction factors are ad hoc, but
doing nothing is much more ad hoc than using correction factors.

For the multivariate lognormal distribution with n = 20p, the large sample
nonparametric 95% prediction region (4.24) had coverages 0.970, 0.959, and
0.964 for p = 100, 200, and 500. Some R code is below.
nruns=1000 #lognormal, p = 100, n = 20p = 2000
count<-0
for(i in 1:nruns){
x <- exp(matrix(rnorm(200000),ncol=100,nrow=2000))
xff <- exp(as.vector(rnorm(100)))
count <- count + predrgn(x,xf=xff)$inr}
count #970/1000, may take a few minutes
Notice that for the training data x1 , ..., xn , if C −1 exists, then c ≈ 100qn %
of the n cases are in the prediction regions for xf = xi , and qn → 1 −δ even if
(T, C) is not a good estimator. Hence the coverage qn of the training data is
robust to model assumptions. Of course the volume of the prediction region
could be large if a poor estimator (T, C) is used or if the xi do not come
from an elliptically contoured distribution. Also notice that qn = 1 − δ/2 or
qn = 1 − δ + 0.05 for n ≤ 20p and qn → 1 − δ as n → ∞. If qn ≡ 1 − δ and
(T, C) is a consistent estimator of (µ, dΣ) where d > 0 and Σ is nonsingular,
then (4.22) with h = D(Un ) is a large sample prediction region, but taking
qn given by (4.20) improves the finite sample performance of the prediction
region. Taking qn ≡ 1 − δ does not take into account variability of (T, C),
and for n = 20p the resulting prediction region tended to have undercoverage
as high as min(0.05, δ/2). Using (4.20) helped reduce undercoverage for small
n ≥ 20p due to the unknown variability of (T, C).

4.5 Bootstrapping Hypothesis Tests and Confidence


Regions

This section shows that, under regularity conditions, applying the nonpara-
metric prediction region of Section 4.4 to a bootstrap sample results in a
170 4 Prediction and Variable Selection When n >> p

confidence region. The volume of a confidence region → 0 as n → 0, while


the volume of a prediction region goes to that of a population region that
would contain a new xf with probability 1 − δ. The nominal coverage is
100(1 − δ). If the actual coverage 100(1 − δn ) > 100(1 − δ), then the region is
conservative. If 100(1 − δn ) < 100(1 − δ), then the region is liberal. A region
that is 5% conservative is considered “much better” than a region that is 5%
liberal.
When teaching confidence intervals, it is often noted that by the central
limit theorem, the √ probability that Y n is within two standard deviations
(2SD(Y n ) = 2σ/ n) of θ = µ is about 95%. Hence the probability that θ is
within √ two standard √deviations of Y n is about 95%. Thus the interval [θ −
1.96S/ n, θ+1.96S/ n] is a large sample 95% prediction interval for √ a future
value of√ the sample mean Y n,f if θ is known, while [Y n − 1.96S/ n, Y n +
1.96S/ n] is a large sample 95% confidence interval for the population mean
θ. Note that the lengths of the two intervals are the same. Where the interval
is centered, at the parameter θ or the statistic Y n , determines whether the
interval is a prediction or a confidence interval. See Theorem 4.7 for a similar
relationship between confidence regions and prediction regions.

Definition 4.11. A large sample 100(1−δ)% confidence region for a vector


of parameters θ is a set An such that P (θ ∈ An ) is eventually bounded below
by 1 − δ as n → ∞.

If An is based on a squared Mahalanobis distance D2 with a limiting


distribution that has a pdf, we often want P (θ ∈ An ) → 1 − δ as n → ∞.

There are several methods for obtaining a bootstrap sample T1∗ , ...., TB∗
where the sample size n is suppressed: Ti∗ = Tin

. The parametric bootstrap,
nonparametric bootstrap, and residual bootstrap will be used. Applying pre-
diction region (4.24) to the bootstrap sample will result in a confidence region
for θ. When g = 1, applying the shorth PI (4.10) or PI (4.7) to the bootstrap
sample results in a confidence interval for θ. Section 4.5.2 will help clarify
ideas.

When g = 1, a confidence interval is a special case of a confidence region.


One sided confidence intervals give a lower or upper confidence bound for θ.
A large sample 100(1 − δ)% lower confidence interval (−∞, Un ] uses an upper
confidence bound Un and is in the lower tail of the distribution of θ̂. A large
sample 100(1 −δ)% upper confidence interval [Ln , ∞) uses a lower confidence
bound Ln and is in the upper tail of the distribution of θ̂. These CIs can be
useful if θ ∈ [a, b] and θ = a or θ = b is of interest for a hypothesis test. For
example, [a, b] = [0, 1] if θ = ρ2 , the squared population correlation. Then use
[0, Un] and [Ln , 1] as CIs, e.g. if we expect θ = 0 we might test H0 : θ ≤ 0.05
versus H0 : θ > 0.05, and fail to reject H0 if Un < 0.05. See Section 4.5.4 for
an illustration. Again we often want the probability to converge to 1 − δ if
4.5 Bootstrapping Hypothesis Tests and Confidence Regions 171

the confidence interval is based on a statistic with an asymptotic distribution


that has a pdf.

Definition 4.12. The interval [Ln , Un ] is a large sample 100(1 − δ)%


confidence interval for θ if P (Ln ≤ θ ≤ Un ) is eventually bounded below by
1 − δ as n → ∞. The interval (−∞, Un ] is a large sample 100(1 − δ)% lower
confidence interval for θ if P (θ ≤ Un ) is eventually bounded below by 1 − δ
as n → ∞. The interval [Ln , ∞) is large sample 100(1 − δ)% upper confidence
interval for θ if P (θ ≥ Ln ) is eventually bounded below by 1 − δ as n → ∞.

Next we discuss bootstrap confidence intervals that are obtained by ap-


plying prediction intervals (4.7) and (4.10) to the bootstrap sample. Some
additional bootstrap CIs are obtained from bootstrap confidence regions
from Section 4.5.2 when g = 1. See Efron (1982) and Chen (2016) for the
percentile Pmethod CI. Let Tn be an estimator of a parameter θ such as
n
Tn = Z = i=1 Zi /n with θ = E(Z1 ). Let T1∗ , ..., TB∗ be a bootstrap sample
∗ ∗
for Tn . Let T(1) , ..., T(B) be the order statistics of the the bootstrap sample.
The CI (4.26) is obtained by applying PI (4.7) to the bootstrap sample with
B used instead of n. Hence (4.26) is also a large sample prediction interval for
a future value of Tf∗ if the Ti∗ are iid from the empirical distribution discussed
in Section 4.5.1.

Definition 4.13. The bootstrap percentile method large sample 100(1 −


∗ ∗
δ)% confidence interval for θ is an interval [T(k L)
, T(K U)
] containing ≈ dB(1 −

δ)e of the Ti . Let k1 = dBδ/2e and k2 = dB(1 − δ/2)e. A common choice is
∗ ∗
[T(k 1)
, T(k 2)
]. (4.27)

The large sample 100(1 − δ)% lower percentile method CI for θ is



(−∞, T(dB(1−δ)e) ]. The large sample 100(1 − δ)% upper percentile method CI

for θ is [T(dBδe) , ∞).

Definition 4.14. The large sample 100(1 − δ)% lower shorth CI for θ

is (−∞, T(c) ], while the large sample 100(1 − δ)% upper shorth CI for θ is

[T(B−c+1) , ∞). The large sample 100(1 − δ)% shorth(c) CI uses the interval
∗ ∗ ∗ ∗ ∗ ∗
[T(1) , T(c) ], [T(2) , T(c+1) ], ..., [T(B−c+1) , T(B) ] of shortest length. Here
p
c = min(B, dB[1 − δ + 1.12 δ/B ] e). (4.28)

Applied to a bootstrap sample, the Frey shorth interval can be regarded as


the shortest percentile method confidence interval, asymptotically. Hence the
shorth confidence interval is a practical implementation of the Hall (1988)
shortest bootstrap interval based on all possible bootstrap samples. See Re-
mark 4.16 for some theory for bootstrap CIs such as (4.26) and (4.27).
172 4 Prediction and Variable Selection When n >> p

4.5.1 The Bootstrap

This subsection illustrates the nonparametric bootstrap with some examples.


Suppose a statistic Tn is computed from a data set of n cases. The nonpara-
metric bootstrap draws n cases with replacement from that data set. Then
T1∗ is the statistic Tn computed from the sample. This process is repeated
B times to produce the bootstrap sample T1∗ , ..., TB∗ . Sampling cases with
replacement uses the empirical distribution.

Definition 4.15. Suppose that data x1 , ..., xn has been collected and
observed. Often the data is a random sample (iid) from a distribution with
cdf F . The empirical distribution is a discrete distribution where the xi are
the possible values, and each value is equally likely. If w is a random variable
having the empirical distribution, then pi = P (w = xi ) = 1/n for i = 1, ..., n.
The cdf of the empirical distribution is denoted by Fn .

Example 4.2. Let w be a random variable having the empirical distri-


bution given by Definition 4.15. Show that E(w) = x ≡ xn and Cov(w) =
n−1 n−1
S≡ S n.
n n
Solution: Recall
P that for a discrete random vector, the population expected
value E(w) = xi pi where xi are the values that w takes with positive
probability pi . Similarly, the population covariance matrix
X
Cov(w) = E[(w − E(w))(w − E(w))T ] = (xi − E(w))(xi − E(w))T pi .

Hence
n
X 1
E(w) = xi = x,
n
i=1

and
n
X 1 n−1
Cov(w) = (xi − x)(xi − x)T = S. 
n n
i=1

Example 4.3. If W1 , ..., Wn are iid from a distribution with cdf FW , then
the empirical cdf Fn corresponding to FW is given by
n
1X
Fn (y) = I(Wi ≤ y)
n
i=1

where the indicator I(Wi ≤ y) = 1 if Wi ≤ y and I(Wi ≤ y) = 0 if Wi > y.


Fix n and y. Then nFn (y) ∼ binomial (n, FW (y)). Thus E[Fn (y)] = FW (y)
and V [Fn(y)] = FW (y)[1 − FW (y)]/n. By the central limit theorem,
√ D
n(Fn (y) − FW (y)) → N (0, FW (y)[1 − FW (y)]).
4.5 Bootstrapping Hypothesis Tests and Confidence Regions 173

Thus Fn (y) − FW (y) = OP (n−1/2 ), and Fn is a reasonable estimator of FW


if the sample size n is large.

Suppose there is data w 1 , ..., wn collected into an n × p matrix W . Let


the statistic Tn = t(W ) = T (Fn ) be computed from the data. Suppose the
statistic estimates µ = T (F ), and let t(W ∗ ) = t(Fn∗) = Tn∗ indicate that
t was computed from an iid sample from the empirical distribution Fn : a
sample w ∗1 , ..., w∗n of size n was drawn with replacement from the observed
sample w1 , ..., wn . This notation is used for von Mises differentiable statistical
functions in large sample theory. See Serfling (1980, ch. 6). The empirical
distribution is also important for the influence function (widely used in robust
statistics). The nonparametric bootstrap draws B samples of size n from the

rows of W , e.g. from the empirical distribution of w1 , ..., wn . Then Tjn is
computed from the jth bootstrap sample for j = 1, ..., B.

Example 4.4. Suppose the data is 1, 2, 3, 4, 5, 6, 7. Then n = 7 and the


sample median Tn is 4. Using R, we drew B = 2 bootstrap samples (samples
of size n drawn with replacement from the original data) and computed the
∗ ∗
sample median T1,n = 3 and T2,n = 4.
b1 <- sample(1:7,replace=T)
b1
[1] 3 2 3 2 5 2 6
median(b1)
[1] 3
b2 <- sample(1:7,replace=T)
b2
[1] 3 5 3 4 3 5 7
median(b2)
[1] 4
The bootstrap has been widely used to estimate the population covariance
matrix of the statistic Cov(Tn ), for testing hypotheses, and for obtaining
confidence regions (often confidence intervals). An iid sample T1n , ..., TBn of
size B of the statistic would be very useful for inference, but typically we
only have one sample of data and one value Tn = T1n of the statistic. Often
∗ ∗
Tn = t(w 1 , ..., wn ), and the bootstrap sample T1n , ..., TBn is formed where
∗ ∗ ∗ ∗ ∗
Tjn = t(w j1 , ..., wjn ). Section 4.5.3 will show that T1n − Tn , ..., TBn − Tn is
√ D
pseudodata for T1n − θ, ..., TBn − θ when n is large in that n(Tn − θ) → u
√ D
and n(T ∗ − Tn ) → u.

Example 4.5. Suppose there is training data (y i , xi ) for the model y i =


m(xi ) + i for i = 1, ..., n, and it is desired to predict a future test value
y f given xf and the training data. The model can be fit and the residual
vectors formed. One method for obtaining a prediction region for y f is to
form the pseudodata ŷ f + ˆi for i = 1, ..., n, and apply the nonparametric
174 4 Prediction and Variable Selection When n >> p

prediction region (4.24) to the pseudodata. See Section 8.3 and Olive (2017b,
2018). The residual bootstrap could also be used to make a bootstrap sample
ŷ f + ˆ∗1 , ..., ŷf + ˆ∗B where the ˆ∗j are selected with replacement from the
residual vectors for j = 1, ..., B. As B → ∞, the bootstrap sample will take
on the n values ŷ f +ˆ i (the pseudodata) with probabilities converging to 1/n
for i = 1, ..., n.
Suppose there is a statistic Tn that is a g × 1 vector. Let
B B
∗ 1 X ∗ 1 X ∗ ∗ ∗
T = Ti and S ∗T = (T − T )(Ti∗ − T )T (4.29)
B i=1 B − 1 i=1 i

be the sample mean and sample covariance matrix of the bootstrap sample
T1∗ , ..., TB∗ where Ti∗ = Ti,n
∗ ∗
. Fix n, and let E(Ti,n ∗
) = θ n and Cov(Ti,n ) = Σn.
√ D
We will often assume that Cov(Tn ) = Σ T , and n(Tn − θ) → Ng (0, Σ A )
P
where Σ A > 0 is positive definite and nonsingular. Often nΣ̂ T → Σ A .
For example, using least squares and the residual bootstrap for the multiple
n−p
linear regression model, Σ n = M SE(X T X)−1 , Tn = θ n = β̂, θ = β,
n
Σ̂ T = M SE(X T X)−1 and Σ A = σ 2 limn→∞(X T X/n)−1 . See Example 4.6
in Section 4.6.
Suppose the Ti∗ = Ti,n∗
are iid from some distribution with cdf F̃n . For
∗ ∗
example, if Ti,n = t(Fn ) where iid samples from Fn are used, then F̃n is the
cdf of t(Fn∗). With respect to F̃n , both θn and Σ n are parameters, but with
respect to F , θ n is a random vector and Σ n is a random matrix. For fixed
n, by the multivariate central limit theorem,
√ ∗ D ∗ ∗ D
B(T − θn ) → Ng (0, Σ n ) and B(T − θ n )T [S∗T ]−1 (T − θ n ) → χ2r

as B → ∞.

Remark 4.13. For Examples 4.2, 4.5, and 4.6, the bootstrap works but
is expensive compared to alternative methods. For Example 4.2, fix n, then
∗ P P
T → θ n = x and S ∗T → (n − 1)S/n as B → ∞, but using (x, S) makes
more sense. For Example 4.5, use the pseudodata instead of the residual boot-
strap. For Example 4.6, using β̂ and the classical estimated covariance ma-
d β̂) = M SE(X T X)−1 makes more sense than using the bootstrap.
trix Cov(
For these three examples, it is known how the bootstrap sample behaves as
√ D
B → ∞. The bootstrap can be very useful when n(Tn − θ) → Ng (0, Σ A ),
but it not known how to estimate Σ A without using a resampling method
√ D
like the bootstrap. The bootstrap may be useful when n(Tn − θ) → u, but
the limiting distribution (the distribution of u) is unknown.
4.5 Bootstrapping Hypothesis Tests and Confidence Regions 175

4.5.2 Bootstrap Confidence Regions for Hypothesis


Testing

When the bootstrap is used, a large sample 100(1 − δ)% confidence region
for a g × 1 parameter vector θ is a set An = An,B such that P (θ ∈ An,B ) is
eventually bounded below by 1 − δ as n, B → ∞. The B is often suppressed.
Consider testing H0 : θ = θ0 versus H1 : θ 6= θ0 where θ 0 is a known g × 1
vector. Then reject H0 if θ 0 is not in the confidence region An . Let the g × 1
vector Tn be an estimator of θ. Let T1∗ , ..., TB∗ be the bootstrap sample for Tn .
Let A be a full rank g × p constant matrix. For variable selection, consider
testing H0 : Aβ = θ0 versus H1 : Aβ 6= θ 0 with θ = Aβ where often

θ0 = 0. Then let Tn = Aβ̂ Imin ,0 and let Ti∗ = Aβ̂ Imin ,0,i for i = 1, ..., B.
The statistic β̂Imin ,0 is the variable selection estimator padded with zeroes.

See Section 4.2. Let T and S ∗T be the sample mean and sample covariance
matrix of the bootstrap sample T1∗ , ..., TB∗ . See Equation (4.28). See Theorem
2.25 for why dn Fg,dn ,1−δ → χ2g,1−δ as dn → ∞. Here P (X ≤ χ2g,1−δ ) = 1 − δ
if X ∼ χ2g , and P (X ≤ Fg,dn ,1−δ ) = 1 − δ if X ∼ Fg,dn . Let kB = dB(1 − δ)e.

Definition 4.16. a) The standard bootstrap large sample 100(1 − δ)%


confidence region for θ is {w : (w − Tn )T [S∗T ]−1 (w − Tn ) ≤ D1−δ
2
}=
2
{w : Dw (Tn , S ∗T ) ≤ D1−δ
2
} (4.30)
2
where D1−δ = χ2g,1−δ or D1−δ2
= dn Fg,dn ,1−δ where dn → ∞ as n → ∞. b)
The Bickel and Ren (2001) large sample 100(1 − δ)% confidence region for θ
is {w : (w − Tn )T [Σ̂ A /n]−1(w − Tn ) ≤ D(k2
B ,T )
}=

2 2
{w : Dw (Tn , Σ̂ A /n) ≤ D(k B ,T )
} (4.31)
2
where the cutoff D(k B ,T )
is the 100kB th sample quantile of the
Di2 = (Ti∗ − Tn )T [Σ̂ A /n]−1(Ti∗ − Tn ) = n(Ti∗ − Tn )T [Σ̂ A ]−1 (Ti∗ − Tn ).
√ D P
Confidence region (4.29) needs n(Tn − θ) → Ng (0, Σ A ) and nS ∗T →
Σ A > 0 as n, B → ∞. See Machado and Parente (2005) for regularity con-
ditions for this assumption. Bickel and Ren (2001) have interesting sufficient
conditions for (4.30) to be a confidence region when Σ̂ A is a consistent esti-
mator of positive definite Σ A . Let the vector of parameters θ = T (F ), the
statistic Tn = T (Fn ), and the bootstrapped statistic T ∗ = T (Fn∗ ) where F
is the cdf of iid x1 , ..., xn , Fn is the empirical cdf, and Fn∗ is the empiri-
cal cdf of x∗1 , ..., x∗n , a sample from Fn using the nonparametric bootstrap.
√ D
If n(Fn − F ) → z F , a Gaussian random process, and if T is sufficiently
√ D
smooth (has a Hadamard derivative Ṫ (F )), then n(Tn − θ) → u and
176 4 Prediction and Variable Selection When n >> p
√ D
n(Ti∗ −Tn ) → u with u = Ṫ (F )z F . Note that Fn is a perfectly good cdf “F ”

and Fn is a perfectly good empirical cdf from Fn = “F .” Thus if n is fixed,
and a sample of size m is drawn with replacement from the empirical distribu-
√ ∗ D
tion, then m(T (Fm )−Tn ) → Ṫ (Fn )z Fn . Now let n → ∞ with m = n. Then
√ D
bootstrap theory gives n(Ti∗ − Tn ) → limn→∞ Ṫ (Fn )z Fn = Ṫ (F )zF ∼ u.
The following three confidence regions will be used for inference after vari-
able selection. The Olive (2017ab, 2018) prediction region method applies
prediction region (4.24) to the bootstrap sample. Olive (2017ab, 2018) also
gave the modified Bickel and Ren confidence region that uses Σ̂ A = nS ∗T .
The hybrid confidence region is due to Pelawa Watagoda and Olive (2019a).
Let qB = min(1 − δ + 0.05, 1 − δ + g/B) for δ > 0.1 and

qB = min(1 − δ/2, 1 − δ + 10δg/B), otherwise. (4.32)

If 1 − δ < 0.999 and qB < 1 − δ + 0.001, set qB = 1 − δ. Let D(UB ) be the


100qB th sample quantile of the Di . Use (4.31) as a correction factor for finite
B ≥ 50g.

Definition 4.17. a) The prediction region method large sample 100(1 −


∗ ∗
δ)% confidence region for θ is {w : (w − T )T [S ∗T ]−1 (w − T ) ≤ D(U
2
B)
}=

2
{w : Dw (T , S∗T ) ≤ D(U
2
B)
} (4.33)
∗ ∗
2
where D(U B)
is computed from Di2 = (Ti∗ − T )T [S ∗T ]−1 (Ti∗ − T ) for i =

1, ..., B. Note that the corresponding test for H0 : θ = θ 0 rejects H0 if (T −

θ0 )T [S ∗T ]−1 (T − θ 0 ) > D(U
2
B)
. (This procedure is basically the one sample
Hotelling’s T test applied to the Ti∗ using S ∗T as the estimated covariance
2

matrix and replacing the χ2g,1−δ cutoff by D(U 2


B)
.) b) The modified Bickel
and Ren (2001) large sample 100(1 − δ)% confidence region is {w : (w −
Tn )T [S∗T ]−1 (w − Tn ) ≤ D(U2
B ,T )
}=
2
{w : Dw (Tn , S∗T ) ≤ D(U
2
B ,T )
} (4.34)
2
where the cutoff D(U B ,T )
is the 100qB th sample quantile of the Di2 = (Ti∗ −
T ∗ −1 ∗
Tn ) [ST ] (Ti − Tn ). Note that the corresponding test for H0 : θ = θ 0
rejects H0 if (Tn − θ0 )T [S ∗T ]−1 (Tn − θ 0 ) > D(U
2
B ,T )
. c) Shift region (4.32) to
2
have center Tn , or equivalently, change the cutoff of region (4.33) to D(U B)
to get the hybrid large sample 100(1 − δ)% confidence region: {w : (w −
Tn )T [S∗T ]−1 (w − Tn ) ≤ D(U 2
B)
}=
2
{w : Dw (Tn , S∗T ) ≤ D(U
2
B)
}. (4.35)

Note that the corresponding test for H0 : θ = θ 0 rejects H0 if


(Tn − θ 0 )T [S∗T ]−1 (Tn − θ 0 ) > D(U
2
B)
.
4.5 Bootstrapping Hypothesis Tests and Confidence Regions 177

Hyperellipsoids (4.32) and (4.34) have the same volume since they are the
same region shifted to have a different center. The ratio of the volumes of
regions (4.32) and (4.33) is
   
|S ∗T |1/2 D(UB ) g D(UB ) g
= . (4.36)
|S ∗T |1/2 D(UB ,T ) D(UB ,T )

The volume of confidence region (4.33) tends to be greater than that of (4.32)

since the Ti∗ are closer to T than Tn on average.
If g = 1, then a hyperellipsoid is an interval, and confidence intervals are
special cases of confidence regions. Suppose the parameter of interest is θ, and
there is a bootstrap sample T1∗ , ..., TB∗ where the statistic Tn is an estimator
of θ based on a sample of size n. The percentile method uses an interval that
∗ ∗
contains UB ≈ kB = dB(1 −δ)e of the Ti∗ . Let ai = |Ti∗ −T |. Let T and ST2∗
be the sample mean and variance of the Ti∗ . Then the squared Mahalanobis
∗ ∗ ∗
distance Dθ2 = (θ−T )2 /ST∗2 ≤ D(U 2
B)
is equivalent to θ ∈ [T −ST∗ D(UB ) , T +
∗ ∗ ∗
ST∗ D(UB ) ] = [T − a(UB ) , T + a(UB ) ], which is an interval centered at T just
long enough to cover UB of the Ti∗ . Hence the prediction region method is
a special case of the percentile method if g = 1. See Definition 4.13. Efron
(2014) used a similar large sample 100(1 − δ)% confidence interval assuming

that T is asymptotically normal. The CI corresponding to (4.33) is defined
similarly, and [Tn − a(UB ) , Tn + a(UB ) ] is the CI for (4.34). Note that the
three CIs corresponding to (4.32)–(4.34) can be computed without finding
ST∗ or D(UB ) even if ST∗ = 0. The Frey (2013) shorth(c) CI (4.27) computed
from the Ti∗ can be much shorter than the Efron (2014) or prediction region
method confidence intervals. See Remark 4.16 for some theory for bootstrap
CIs.
∗ n−p
Remark 4.14. From Example 4.6, Cov(β̂ ) = M SE(X T X)−1 =
n
n−pd d β̂) = M SE(X T X)−1 starts to give good estimates
Cov(β̂) where Cov(
n
of Cov(β̂) = Σ T for many error distributions if n ≥ 10p and T = β̂. For
the residual bootstrap with large B, note that S ∗T ≈ 0.95Cov(d β̂) for n = 20p
∗ d
and S T ≈ 0.99Cov(β̂) for n = 100p. Hence we may need n >> √ p before the
S ∗T is a good estimator of Cov(T ) = Σ √ T . The distribution of n(Tn − θ) is
approximated by the distribution of n(T ∗ − Tn ) or by the distribution of
√ ∗
n(T ∗ − T ), but n may need to be large before the approximation is good.

Suppose the bootstrap sample mean T estimates θ, and the bootstrap

sample covariance matrix S T estimates cn Cov(T d n ) ≈ cn Σ T where cn in-
creases to 1 as n → ∞. Then S ∗T is not a good estimator of Cov(T d n ) un-
til cn ≈ 1 (n ≥ 100p for OLS β̂), but the squared Mahalanobis distance

Dw 2∗
(T , S∗T ) ≈ Dw
2 2∗
(θ, Σ T )/cn and D(U B)
2
≈ D1−δ /cn . Hence the prediction
2∗ 2
region method has a cutoff D(UB ) that estimates the cutoff D1−δ /cn . Thus
the prediction region method may give good results for much smaller n than
178 4 Prediction and Variable Selection When n >> p

a bootstrap method that uses a χ2g,1−δ cutoff when a cutoff χ2g,1−δ /cn should
be used for moderate n.

Remark 4.15. For bootstrapping the p × 1 vector β̂ Imin ,0 , we will often


want n ≥ 20p and B ≥ max(100, n, 50p). If Tn is g × 1, we might replace p
by g or replace p by d if d is the model degrees of freedom. Sometimes much
larger n is needed to avoid undercoverage. We want B ≥ 50g so that S ∗T is a
good estimator of Cov(Tn∗ ). Prediction region theory uses correction factors
like (4.21) and (4.10) to compensate for finite n. The bootstrap confidence
regions (4.32)–(4.34) and the shorth CI use the correction factors (4.31) and
(4.27) to compensate for finite B ≥ 50g. Note that the correction factors
make the volume of the confidence region larger as B decreases. Hence a test
with larger B will have more power.

4.5.3 Theory for Bootstrap Confidence Regions

Consider testing H0 : θ = θ0 versus H1 : θ 6= θ0 where θ is g × 1. This


section gives some theory for bootstrap confidence regions and for the bag-

ging estimator T , also called the smoothed bootstrap estimator. Empirically,
bootstrapping with the bagging estimator often outperforms bootstrapping
with Tn . See Breiman (1996), Yang (2003), and Efron (2014). See Büchlmann
and Yu (2002) and Friedman and Hall (2007) for theory and references for
the bagging estimator. Since (4.33) is a large sample confidence region by
√ ∗ P
Bickel and Ren (2001), (4.32) and (4.34) are too, provided n(T − Tn ) → 0.
√ D √ D
If i) n(Tn −θ) → u, then under regularity conditions, ii) n(Ti∗ −Tn ) →
√ ∗ D √ ∗ D P
u, iii) n(T − θ) → u, iv) n(Ti∗ − T ) → u, and v) nS ∗T → Cov(u).
Suppose i) and ii) hold with E(u) = 0 and Cov(u) √ = Σ u . With respect
to the bootstrap sample, Tn is a constant and the n(Ti∗ − Tn ) are iid for
√ D
i = 1, ..., B. Let n(Ti∗ − Tn ) → v i ∼ u where the√v i are iid with the same
distribution as u. Fix B. Then the average of the n(Ti∗ − Tn ) is
B  
√ ∗ D1 X Σu
n(T − Tn ) → v i ∼ ANg 0,
B i=1 B

where z ∼ ANg (0, Σ) is an asymptotic multivariate normal approximation.


√ ∗ P
Hence as B → ∞, n(T − Tn ) → 0, and iii) and iv) hold. If B is fixed and
u ∼ Ng (0, Σ u ), then
B  
1 X Σu √ √ ∗ D
v i ∼ Ng 0, and B n(T − Tn ) → Ng (0, Σ u ).
B B
i=1
4.5 Bootstrapping Hypothesis Tests and Confidence Regions 179

Hence the prediction region method gives a large sample confidence region√ for

2
θ provided that the sample percentile D̂1−δ of the DT2 i∗ (T , S ∗T ) = n(Ti∗ −
∗ √ ∗
T )T (nS ∗T )−1 n(Ti∗ − T ) is a consistent estimator of the percentile Dn,1−δ
2
∗ ∗ √ ∗ ∗ √ ∗
of the random variable D2 (T , S T ) = n(θ − T )T (nS T )−1 n(θ − T ) in
θ
2 2 P
that D̂1−δ − Dn,1−δ → 0. Since iii) and iv) hold, the sample percentile will
be consistent Hunder much weaker conditions than v) if Σ u is nonsingular.
Olive (2017b: 5.3.3, 2018) proved that the prediction region method gives a
large sample confidence region under the much stronger conditions of v) and
u ∼ Ng (0, Σ u ), but the above Pelawa Watagoda and Olive (2019a) proof is
simpler.
√ D √ D
Remark 4.16. Note that if n(Tn −θ) → U and n(Ti∗ −Tn ) → U where
U has a unimodal probability density function symmetric about zero, then
the confidence intervals from the three confidence regions (4.32)–(4.34), the
shorth confidence interval (4.27), and the “usual” percentile method confi-
dence interval (4.26) are asymptotically equivalent (use the central proportion
of the bootstrap sample, asymptotically).
P
Assume nS ∗T → Σ A as n, B → ∞ where Σ A and S ∗T are nonsingular g ×g
matrices, and Tn is an estimator of θ such that
√ D
n (Tn − θ) → u (4.37)

as n → ∞. Then
√ −1/2 D −1/2
n ΣA (Tn − θ) → Σ A u = z,
−1 D
n (Tn − θ)T Σ̂ A (Tn − θ) → z T z = D2
as n → ∞ where Σ̂ A is a consistent estimator of Σ A , and
D
(Tn − θ)T [S∗T ]−1 (Tn − θ) → D2 (4.38)

as n, B → ∞. Assume the cumulative distribution function of D2 is continu-


2
ous and increasing in a neighborhood of D1−δ where P (D2 ≤ D1−δ
2
) = 1−δ. If
2
the distribution of D is known, then we could use the large sample confidence
region (4.29) {w : (w − Tn )T [S ∗T ]−1 (w − Tn ) ≤ D1−δ
2
}. Often by a central
√ D
limit theorem or the multivariate delta method, n(Tn − θ) → Ng (0, Σ A ),
−1
and D2 ∼ χ2g . Note that [S∗T ]−1 could be replaced by nΣ̂ A .
√ D
Remark 4.17. Under reasonable conditions, i) n(Tn − θ) → u, ii)
√ D √ ∗ D √ ∗ D
n(Ti∗ − Tn ) → u, iii) n(T − θ) → u, and iv) n(Ti∗ − T ) → u. Then
∗ √ ∗ √ ∗
D12 = DT2 i∗ (T , S ∗T ) = n(Ti∗ − T )T (nS ∗T )−1 n(Ti∗ − T ),
180 4 Prediction and Variable Selection When n >> p
√ √
D22 = Dθ 2
(Tn , S ∗T ) = n(Tn − θ)T (nS ∗T )−1 n(Tn − θ),
∗ √ ∗ √ ∗
D32 = Dθ 2
(T , S∗T ) = n(T − θ)T (nS ∗T )−1 n(T − θ), and
√ √
D42 = DT2 i∗ (Tn , S ∗T ) = n(Ti∗ − Tn )T (nS ∗T )−1 n(Ti∗ − Tn ),
P D
are well behaved. If (nS ∗T )−1 → Σ −1 2 2 T −1 ∗ −1
T , then Dj → D = u Σ T u. If (nS T )
2 T ∗ −1
is “not too ill conditioned” then Dj ≈ u (nS T ) u for large n, and the
confidence regions (4.32), (4.33), and (4.34) will have coverage near 1 − δ.
The regularity conditions for (4.32)–(4.34) are weaker when g = 1, since S ∗T
does not need to be computed.

The following Pelawa Watagoda and Olive (2019a) theorem is very useful.
2
Let D(U B)
be the cutoff for the nonparametric prediction region (4.24) com-
puted from the Di2 (T , ST ) for i = 1, ..., B. Hence n is replaced by B. Since
Tn depends on the sample size n, we need (nS T )−1 to be fairly well behaved
(“not too ill conditioned”) for each n ≥ 20g, say. This condition is weaker
P
than (nS T )−1 → Σ −1
A . Note that Ti = Tin .
√ D
Theorem 4.7: Geometric Argument. Suppose n(Tn − θ) → u with
E(u) = 0 and Cov(u) = Σ u . Assume T1 , ..., TB are iid with nonsingular
covariance matrix Σ Tn . Then the large sample 100(1 − δ)% prediction region
2 2
Rp = {w : Dw (T , ST ) ≤ D(U B)
} centered at T contains a future value of
the statistic Tf with probability 1 − δB → 1 − δ as B → ∞. Hence the region
2 2
Rc = {w : Dw (Tn , S T ) ≤ D(U B)
} is a large sample 100(1 − δ)% confidence
region for θ where Tn is a randomly selected Ti .
Proof. The region Rc centered at a randomly selected Tn contains T with
probability√1 − δB which is eventually bounded below by 1 − δ as B → ∞.
Since the n(Ti − θ) are iid,
√   
n(T1 − θ) v1
 ..  D  .. 
 . → . 

n(TB − θ) vB

where the v i are iid with the same distribution as u. (Use Theorems 1.30
and 1.31, and see Example 1.16.) For fixed B, the average of these random
vectors is
XB  
√ D 1 Σu
n(T − θ) → v i ∼ ANg 0,
B i=1 B

by Theorem 1.33. Hence (T − θ) = OP ((nB)−1/2 ), and T gets arbitrarily


close to θ compared to Tn as B → ∞. Thus Rc is a large sample 100(1 − δ)%
confidence region for θ as n, B → ∞. 
4.5 Bootstrapping Hypothesis Tests and Confidence Regions 181

Fig. 4.3 Confidence Regions for 2 Statistics with MVN Distributions


182 4 Prediction and Variable Selection When n >> p

Examining the iid data cloud T1 , ..., TB and the bootstrap sample √ data
cloud√T1∗ , ..., TB∗ is often useful for understanding the bootstrap. If n(Tn −θ)
and n(Ti∗ − Tn ) both converge in distribution to u, then the bootstrap
sample data cloud of T1∗ , ..., TB∗ is like the data cloud of iid T1 , ..., TB shifted
to be centered at Tn . The nonparametric confidence region (4.32) applies the
prediction region to the bootstrap. Then the hybrid region (4.34) centers that
region at Tn . Hence (4.34) is a confidence region by the geometric argument,
√ ∗ P
and (4.32) is a confidence region if n(T − Tn ) → 0. Since the Ti∗ are closer
∗ 2 2
to T than Tn on average, D(U B ,T )
tends to be greater than D(U B)
. Hence
the coverage and volume of (4.33) tend to be at least as large as the coverage
and volume of (4.34).

The hyperellipsoid corresponding to the squared Mahalanobis distance


D2 (Tn , C) is centered at Tn , while the hyperellipsoid corresponding to
the squared Mahalanobis distance D2 (T , C) is centered at T . Note that
DT2 (Tn , C) = (T −Tn )T C −1 (T −Tn ) = (Tn −T )T C −1 (Tn −T ) = DT2 n (T , C).
Thus DT2 (Tn , C) ≤ D(U2
B)
2
iff DT2 n (T , C) ≤ D(U B)
.
The prediction region method will often simulate well even if B is rather

small. If the ellipses are centered at Tn or T , Figure 4.3 shows confidence
regions if the plotted points are T1∗ , ..., TB∗ where the Ti∗ are approximately
multivariate normal. If the ellipses are centered at T , Figure 4.3 shows 10%,
30%, 50%, 70%, 90%, and 98% prediction regions for a future value of Tf for
two multivariate normal statistics. Then the plotted points are iid T1 , ..., TB.
P
If nCov(T ) → Σ A , and the Ti∗ are iid from the bootstrap distribution, then
∗ ∗
Cov(T ) ≈ Cov(T )/B ≈ Σ A /(nB). By Theorem 4.7, if T is in the 90% pre-
diction region with probability near 90%, then the confidence region should
give simulated coverage near 90% and the volume of the confidence region

should be near that of the 90% prediction region. If B = 100, then T falls
in a covering region of the same shape as the prediction √ region, but centered
near Tn and the lengths of the axes are divided by B. Hence if B = 100,
then the axes lengths of this covering region are about one tenth of those in
Figure 4.3. Hence when Tn falls within the 70% prediction region, the prob-

ability that T falls in the 90% prediction region is near one. If Tn is just

within or just without the boundary of the 90% prediction region, T tends
to be just within or just without of the 90% prediction region. Hence the
coverage and volume of prediction region confidence region is near that of
the nominal coverage 90% and near the volume of the 90% prediction region.
Hence B does not need to be large provided that n and B are large enough
so that ST∗ ≈ Cov(T ∗ ) ≈ Σ A /n. If n is large, the sample covariance matrix
starts to be a good estimator of the population covariance matrix when B ≥
Jg where J = 20 or 50. For small g, using B = 1000 often led to good
simulations, but B = max(50g, 100) may work well.

Remark 4.18. Remark 4.14 suggests that even if the statistic Tn is asymp-
totically normal so the Mahalanobis distances are asymptotically χ2g , the pre-
4.5 Bootstrapping Hypothesis Tests and Confidence Regions 183

diction region method can give better results for moderate n by using the
2
cutoff D(U B)
instead of the cutoff χ2g,1−δ . Theorem 4.7 says that the hyper-
ellipsoidal prediction and confidence regions have exactly the same volume.
We compensate for the prediction region undercoverage when n is moderate
2 2
by using D(U n)
. If n is large, by using D(U B)
, the prediction region method
confidence region compensates for undercoverage when B is moderate, say
B ≥ Jg where J = 20 or 50. See Remark 4.15. This result can be useful
if a simulation with B = 1000 or B = 10000 is much slower than a simu-
lation with B = Jg. The price to pay is that the prediction region method
confidence region is inflated to have better coverage, so the power of the
hypothesis test is decreased if moderate B is used instead of larger B.

4.5.4 Bootstrapping the Population Coefficient of


Multiple Determination

This subsection illustrates a case where the shorth(c) bootstrap CI fails, but
the lower shorth CI can be useful. See Definition 4.14.
The multiple linear regression (MLR) model is

Yi = β1 + xi,2 β2 + · · · + xi,p βp + ei = xTi β + ei

for i = 1, ..., n. See Definition 1.17 for the coefficient of multiple determination
SSR SSE
R2 = [corr(Yi , Ŷi )]2 = =1−
SSTO SSTO

where corr(Yi , Ŷi ) is the sample correlation of Yi and Ŷi .


2
Assume that the variance of the errors
Pp is σe and that the variance
Pp of Y is
2
σY . Let the linear combination L = i=2 xi βi where Y = β1 + i=2 xi βi +
2
e = β1 + L + e. Let the variance of L be σL . Then
Pn
r2 P σ2 σ2
R2 = 1 − Pn i=1 i 2 → τ 2 = 1 − 2e = 1 − 2 e 2 .
i=1 (Yi − Y )
σY σe + σL

Here we assume that e is independent of the predictors x2 , ..., xp. Hence e is


independent of L and the variance σY2 = V (L + e) = V (L) + V (e) = σL 2
+ σe2 .
One of the sufficient conditions for the shorth(c) interval to be a large
√ D
sample CI for θ is n(T − θ) → N (0, σ 2 ). If the function t(θ) has an inverse,
√ D
and n(t(T ) − t(θ)) → N (0, v2 ), then the above condition typically holds by
the delta method. See Remark 4.16.
For T = R2 and θ = τ 2 , the test statistic F0 for testing H0 : β2 = · · · =
D
βp = 0 in the Anova F test has (p − 1)F0 → χ2p−1 for a large class of error
distributions when H0 is true, where
184 4 Prediction and Variable Selection When n >> p

R2 n−p
F0 = 2
1−R p−1
if the MLR model has a constant. If H0 is false, then F0 has an asymptotic
2
scaled noncentral √ χ distribution. These results suggest that the large sample
distribution of n(R2 − τ 2 ) may not be N (0, σ 2 ) if H0 is false so τ 2 > 0. If
√ D
τ 2 = 0, we may have n(R2 − 0) → N (0, 0), the point mass at 0. Hence the
shorth CI may not be a large sample CI for τ 2 . The lower shorth CI should
be useful for testing H0 : τ 2 = 0 versus HA : τ 2 > a where 0 < a ≤ 1 since
the coverage is 1 and the length of the CI converges to 0. So reject H0 if a is
not in the CI.
The simulation simulated iid data w with u = Aw and Aij = ψ for i 6= j
and Aii = 1 where 0 ≤ ψ < 1 and u = (x2 , ..., xp)T√. Hence Cor(xi , xj ) = ρ =
[2ψ + (p − 3)ψ2]/[1 + (p − 2)ψ2] for i 6= j. If ψ = 1/ kp, then ρ → 1/(k + 1) as
p → ∞ where k > 0. We used w ∼ Np−1 (0, I p−1 ). If ψ is high or if p is large
with ψ ≥ 0.5, then the data are clustered tightly about the line with direction
1 = (1, ..., 1)T , and there is a dominant principal component with eigenvector

1 and eigenvalue λ1 . We used ψ = 0, 1/ p, and 0.9. Then ρ = 0, ρ → 0.5, or
ρ → 1 as p → ∞.
We also used V (x2 ) = · · · = V (xp ) = σx2 . If p > 2, then Cov(xi , xj ) = ρσx2
for i 6= j and Cov(xi , xj ) = V (xi ) = σx2 for i = j. Then V (Y ) = σY2 = σL 2
+σe2
where
Xp Xp p
X p X
X p
2
σL = V (L) = V ( βi xi ) = Cov( βi xi , βj xj ) = βi βj Cov(xi , xj )
i=2 i=2 j=2 i=2 j=2

p
X p
X p
X
= βi2 σx2 + 2ρσx2 βi βj .
i=2 i=2 j=i+1

The simulations took βi ≡ 0 or βi ≡ 1 for i = 2, ..., p. For the latter case,


2
σL = V (L) = (p − 1)σx2 + 2ρσx2 p(p − 1)/2.

The zero mean errors ei were from 5 distributions: i) N(0,1), ii) t3 , iii)
EXP (1) − 1, iv) uniform(−1, 1), and v) (1 − )N (0, 1) + N (0, (1 + s)2 ) with
 = 0.1 and s = 9 in the simulation. Then Y = 1 + bx2 + bx3 + · · · + bxp + e
with b = 0 or b = 1.
Remark 4.19. Suppose the simulation uses K runs and Wi = 1 if µ is
in the ith CI, and Wi = 0 otherwise, for i = 1, ..., K. Then the Wi are iid
binomial(1,1 − δn) where ρn = 1 − δn is the true coverage of the CI when the
PK
sample size is n. Let
p ρ̂n = W . Since i=1 Wi ∼ binomial(K, ρn ), the standard
error SE(W ) = ρn (1 − ρn )/K. For K = 5000 and ρn near 0.9, we have
3SE(W ) ≈ 0.01. Hence an observed coverage of ρ̂n within 0.01 of the nominal
coverage 1 − δ suggests that there is no reason to doubt that the nominal
CI coverage is different from the observed coverage. So for a large sample
4.5 Bootstrapping Hypothesis Tests and Confidence Regions 185

95% CI, we want the observed coverage to be between 0.94 and 0.96. Also
a difference of 0.01 is not large. Coverage slightly higher than the nominal
coverage is better than coverage slightly lower than the nominal coverage.

Bootstrapping confidence intervals for quantities like ρ2 and τ 2 is notori-


2
ously difficult. If β2 = · · · = βp = 0, then σL = 0 and τ 2 = 0. However, the
2∗
probability that Ri > 0 = 1. Hence the usual two sided bootstrap percentile
and shorth intervals for τ 2 will never contain 0. The one sided bootstrap CI

[0, T(c) ] always contains 0, and is useful if the length of the CI goes to 0 as
n → ∞. In the table below, βi = b for i = 2, ..., p. If b = 0, then τ 2 = 0.
The simulation for the table used 5000 runs with the bootstrap sample
size B = 1000. When n = 400, the shorth(c) CI never contains τ 2 = 0 and
the average length of the CI is 0.035. See ccov and clen. The lower shorth CI
always contained τ 2 = 0 with lcov = 1, and the average CI length was llen =
0.036. The upper shorth CI never contains τ 2 = 0, and the average length is
near 1.

Table 4.1 Bootstrapping τ 2 with R2 and B = 1000


etype n p b ψ τ 2 ccov clen lcov llen ucov ulen
1 100 4 0 0 0 0 0.135 1 0.137 0 0.990
1 200 4 0 0 0 0 0.0693 1 0.0702 0 0.995
1 400 4 0 0 0 0 0.0354 1 0.0358 0 0.988

Three linmodpack functions were used in the simulation. The function


shorthLU gets the shorth(c) CI, the lower shorth CI, and the upper shorth
CI. The function Rsqboot bootstraps R2 , while the function Rsqbootsim
does the simulation. Some R code for the first line of Table 4.1 is below where
b = cc.
Rsqbootsim(n=100,p=4,BB=1000,nruns=5000,type=1,psi=0,
cc=0)
$rho
[1] 0
$sigesq
[1] 1
$sigLsq
[1] 0
$poprsq
[1] 0
$cicov
[1] 0
$avelen
[1] 0.1348881
$lcicov
186 4 Prediction and Variable Selection When n >> p

[1] 1
$lavelen
[1] 0.13688
$ucicov
[1] 0
$uavelen
[1] 0.9896608

4.6 Bootstrapping Variable Selection

This section considers bootstrapping the MLR variable selection model. Rath-
nayake and Olive (2020) shows how to bootstrap variable selection for many
other regression models. This section will explain why the bootstrap con-
fidence regions (4.32), (4.33), and (4.34) give useful results. Much of the
theory in Section 4.5.3 does not apply to the variable selection estimator
Tn = Aβ̂ Imin ,0 with θ = Aβ, because Tn is not smooth since Tn is equal to
the estimator Tjn with probability πjn for j = 1, ..., J. Here A is a known
full rank g × p matrix with 1 ≤ g ≤ p.
Obtaining the bootstrap samples for β̂V S and β̂ M IX is simple. Generate

Y ∗ and X ∗ that would be used to produce β̂ if the full model estimator β̂

was being bootstrapped. Instead of computing β̂ , compute the variable selec-
∗ ∗C
tion estimator β̂ V S,1 = β̂ Ik1 ,0 . Then generate another Y ∗ and X ∗ and com-
∗ ∗
pute β̂ M IX,1 = β̂ Ik1 ,0 (using the same subset Ik1 ). This process is repeated
B times to get the two bootstrap samples for i = 1, ..., B. Let the selection
probabilities for the bootstrap variable selection estimator be ρkn . Then this
bootstrap procedure bootstraps both β̂V S and β̂ M IX with πkn = ρkn .
The key idea is to show that the bootstrap data cloud is slightly more
variable than the iid data cloud, so confidence region (4.33) applied to the
bootstrap data cloud has coverage bounded below by (1 − δ) for large enough
n and B.
For the bootstrap, P suppose that Ti∗ is equal to Tij∗ with probability ρjn
for j = 1, ..., J where j ρjn = 1, and ρjn → πj as n → ∞. Let Bjn count
the number of times Ti∗ = Tij∗ in the bootstrap sample. Then the bootstrap
sample T1∗ , ..., TB∗ can be written as

T1,1 , ..., TB∗ 1n,1 , ..., T1,J

, ..., TB∗ Jn,J

P
where the Bjn follow a multinomial distribution and Bjn /B → ρjn as B →

∞. Denote T1j , ..., TB∗ jn,j as the jth bootstrap component of the bootstrap

sample with sample mean T j and sample covariance matrix S ∗T ,j . Then
4.6 Bootstrapping Variable Selection 187

B Bjn
∗ 1 X ∗ X Bjn 1 X ∗ X ∗
T = Ti = Tij = ρ̂jn T j .
B B Bjn
i=1 j i=1 j

Similarly, we can define the jth component of the iid sample T1 , ..., TB to
have sample mean T j and sample covariance matrix S T ,j .
√ D
Let Tn = β̂ M IX and Tij = β̂ Ij ,0 . If S ⊆ Ij , assume n(β̂ Ij − β Ij ) →
√ ∗ D
Naj (0, V j ) and n(β̂ Ij − β̂ Ij ) → Naj (0, V j ). Then by Equation (4.3),
√ D √ ∗ D
n(β̂ Ij ,0 −β) → Np (0, V j,0 ) and n(β̂ Ij ,0 − β̂ Ij ,0 ) → Np (0, V j,0 ). (4.39)

This result means that the component clouds have the same variability
asymptotically. The iid data component clouds are all centered at β. If the
bootstrap data component clouds were all centered at the same value β̃, then
the bootstrap cloud would be like an iid data cloud shifted to be centered at
β̃, and (4.33) would be a confidence region for θ = β. Instead, the bootstrap
data component clouds are shifted slightly from a common center, and are
each centered at a β̂ Ij ,0 . Geometrically, the shifting of the bootstrap compo-
nent data clouds makes the bootstrap data cloud similar but more variable
than the iid data cloud asymptotically (we want n ≥ 20p), and centering
the bootstrap data cloud at Tn results in the confidence region (4.33) hav-
ing slightly higher asymptotic coverage than applying (4.33) to the iid data
cloud. Also, (4.33) tends to have higher coverage than (4.34) since the cutoff
for (4.33) tends to be larger than the cutoff for (4.34). Region (4.32) has
the same volume as region (4.34), but tends to have higher coverage since

empirically, the bagging estimator T tends to estimate θ at least as well as
Tn for a mixture distribution. A similar argument holds if Tn = Aβ̂ M IX ,
Tij = Aβ̂ Ij ,0 , and θ = Aβ.
To see that T ∗ has more variability than Tn , asymptotically, look at Figure
4.3. Imagine that n is huge and the J = 6 ellipsoids are 99.9% covering
regions for the component data clouds corresponding to Tjn for j = 1, ..., J.
Separating the clouds slightly, without rotation, increases the variability of
the overall data cloud. The bootstrap distribution of T ∗ corresponds to the
separated clouds. The shape of the overall data cloud does not change much,
but the volume does increase.
In the simulations for H0 : Aβ = BβS = θ 0 with n ≥ 20p, the coverage
tended to get close to 1 − δ for B ≥ max(200, 50p) so that S ∗T is a good esti-
mator of Cov(T ∗ ). In the simulations where S is not the full model, inference
with backward elimination with Imin using AIC was often more precise than
inference with the full model if n ≥ 20p and B ≥ 50p.
The matrix S ∗T can be singular due to one or more columns of zeros
in the bootstrap sample for β1 , ..., βp. The variables corresponding to these
columns are likely not needed in the model given that the other predictors
are in the model. A simple remedy is to add d bootstrap samples of the
188 4 Prediction and Variable Selection When n >> p
∗ ∗
full model estimator β̂ = β̂ F U LL to the bootstrap sample. For example,
take d = dcBe with c = 0.01. A confidence interval [Ln , Un ] can be com-
puted without S ∗T for (4.32), (4.33), and (4.34). Using the confidence interval
∗ ∗
[max(Ln , T(1) ), min(Un , T(B) )] can give a shorter covering region.
Undercoverage can occur if bootstrap sample data cloud is less variable
than the iid data cloud, e.g., if (n − p)/n is not close to one. Coverage can be
higher than the nominal coverage for two reasons: i) the bootstrap data cloud
is more variable than the iid data cloud of T1 , ..., TB, and ii) zero padding.

The bootstrap component clouds for β̂ V S are again separated compared
to the iid clouds for β̂ V S , which are centered about β. Heuristically, most of
the selection bias is due to predictors in E, not to the predictors in S. Hence
∗ ∗ ∗
β̂S,V S is roughly similar to β̂ S,M IX . Typically the distributions of β̂ E,V S

and β̂ E,M IX are not similar, but use the same zero padding. In simulations,
confidence regions for β̂V S tended to have less undercoverage than confidence

regions for β̂ M IX .

4.6.1 The Parametric Bootstrap

The parametric bootstrap generates Y ∗j = (Yi∗ ) from a parametric distribu-



tion. Then regress Y ∗j on X to get β̂ j for j = 1, ..., B. Consider the paramet-
ric bootstrap for the MLR model with Y ∗ ∼ Nn (X β̂, σ̂n2 I) ∼ Nn (HY , σ̂n2 I)
where we are not assuming that the ei ∼ N (0, σ 2 ), and
n
1 X 2
σ̂n2 = M SE = ri
n−p
i=1

where the residuals are from the full OLS model. Then M SE is a n con-
sistent estimator of σ 2 under mild conditions by Su and Cook (2012). Hence

Y ∗ = X β̂ OLS + e∗

where the e∗i are iid N (0, M SE) and β̂ = β̂ OLS .


∗ ∗
Thus β̂ I = (X TI X I )−1 X TI Y ∗ ∼ NaI (β̂ I , σ̂n2 (X TI X I )−1 ) since E(β̂ I ) =

(X TI X I )−1 X TI H Y = β̂I because HX I = X I , and Cov(β̂ I ) = σ̂n2 (X TI X I )−1 .
Hence √ ∗ D
n(β̂ I − β̂ I ) ∼ NaI (0, nσ̂n2 (X TI X I )−1 ) → NaI (0, V I )
as n, B → ∞ if S ⊆ I.
4.6 Bootstrapping Variable Selection 189

4.6.2 The Residual Bootstrap

The residual bootstrap is often useful for additive error regression models of
the form Yi = m(xi ) + ei = m̂(xi ) + ri = Ŷi + ri for i = 1, ..., n where the
ith residual ri = Yi − Ŷi . Let Y = (Y1 , ..., Yn)T , r = (r1 , ..., rn)T , and let
X be an n × p matrix with ith row xTi . Then the fitted values Ŷi = m̂(xi ),
and the residuals are obtained by regressing Y on X. Here the errors ei are
iid, and it would be useful to be able to generate B iid samples e1j , ..., enj
from the distribution of ei where j = 1, ..., B. If the m(xi ) were known, then
we could form a vector Y j where the ith element Yij = m(xi ) + eij for
∗ ∗
i = 1, ..., n. Then regress Y j on X. Instead, draw samples r1j , ..., rnj with

replacement from the residuals, then form a vector Y j where the ith element
Yij∗ = m̂(xi ) + rij

for i = 1, ..., n. Then regress Y ∗j on X. If the residuals do

not sum to 0, it is often useful to replace ri by i = ri − r, and rij by ∗ij .

Example 4.6. For multiple linear regression, Yi = xTi β + ei is written in


matrix form as Y = Xβ + e. Regress Y on X to obtain β̂, r, and Ŷ with
ith element Ŷi = m̂(xi ) = xTi β̂. For j = 1, ..., B, regress Y ∗j on X to form
∗ ∗
β̂1,n , ..., β̂B,n using the residual bootstrap.
Now examine the OLS model. Let Ŷ = Ŷ OLS = X β̂ OLS = HY be the
fitted values from the OLS full model. Let r W denote an n × 1 random vector
of elements selected with replacement from the OLS full model residuals.
Following Freedman (1981) and Efron (1982, p. 36),

Y ∗ = X β̂ OLS + rW

follows a standard linear model where the elements riW of rW are iid from
the empirical distribution of the OLS full model residuals ri . Hence
n n
1X 1X 2 n−p
E(riW ) = ri = 0, V (riW ) = σn2 = r = M SE,
n i=1 n i=1 i n

E(r W ) = 0, and Cov(Y ∗ ) = Cov(r W ) = σn2 I n .


∗ ∗
Let β̂ = β̂ OLS . Then β̂ = (X T X)−1 X T Y ∗ with Cov(β̂ ) = σn2 (X T X)−1 =
n−p ∗
M SE(X T X)−1 , and E(β̂ ) = (X T X)−1 X T E(Y ∗ ) =
n
(X T X)−1 X T H Y = β̂ = β̂n since HX = X. The expectations are with
respect to the bootstrap distribution where Ŷ acts as a constant.
For the OLS estimator β̂ = β̂ OLS , the estimated covariance matrix
d β̂
of β̂ OLS is Cov( T −1
OLS ) = M SE(X X) . The sample covariance matrix
∗ ∗
of the β̂ is estimating Cov(β̂ ) r as B → ∞. Hence the residual boot-
n−p
strap standard error SE(β̂i∗ ) ≈ SE(β̂i ) for i = 1, ..., p where
n
190 4 Prediction and Variable Selection When n >> p

β̂OLS = β̂ = (β̂1 , ..., β̂p)T . The LS CLT Theorem 2.26 says


√ D d β̂ OLS )) ∼ Np (0, σ 2 W )
n(β̂ − β) → Np (0, lim nCov(
n→∞

where n(X T X)−1 → W . Since Y ∗ = X β̂ OLS +rW follows a standard linear


model, it may not be surprising that
√ ∗ D ∗
d β̂ )) ∼ Np (0, σ 2 W ).
n(β̂ − β̂ OLS ) → Np (0, lim nCov(
n→∞

See Freedman (1981).



For the above residual bootstrap, β̂ Ij = (X TIj X Ij )−1 X TIj Y ∗ = Dj Y ∗
∗ ∗
with Cov(β̂ Ij ) = σn2 (X TIj X Ij )−1 and E(β̂ Ij ) = (X TIj X Ij )−1 X TIj E(Y ∗ ) =
(X TIj X Ij )−1 X TIj HY = β̂ Ij since HX Ij = X Ij . The expectations are with
respect to the bootstrap distribution where Ŷ acts as a constant.
Thus for S ⊆ I and the residual bootstrap using residuals from the full
∗ ∗ P
OLS model, E(β̂ I ) = β̂ I and nCov(β̂ I ) = n[(n − p)/n]σ̂n2 (X TI X I )−1 → V I
∗ P
as n → ∞ with σ̂n2 = M SE. Hence β̂I − β̂ I → 0 as n → ∞ by Lai et al
∗ ∗
(1979). Note that β̂ I = β̂I,n and β̂ I = β̂I,n depend on n.

Remark 4.20. The Cauchy Schwartz inequality says |aT b| ≤ kak kbk.

Suppose n(β̂ − β) = OP (1) is bounded in probability. This will occur if
√ D
n(β̂ − β) → Np (0, Σ), e.g. if β̂ is the OLS estimator. Then

|ri − ei | = |Yi − xTi β̂ − (Yi − xTi β)| = |xTi (β̂ − β)|.

Hence
√ √
n max |ri − ei | ≤ ( max kxi k) k n(β̂ − β)k = OP (1)
i=1,...,n i=1,...,n

since max kxi k = OP (1) or there is extrapolation. Hence OLS residuals be-
have well if the zero mean error distribution of the iid ei has a finite variance
σ2.

Remark 4.21. Note that both the residual bootstrap and parametric
bootstrap for OLS are robust to the unknown error distribution of the iid ei .
For the residual bootstrap with S ⊆ I where I is not the full model, it may
√ ∗ D
not be true that n(β̂ I − β̂ I ) → NaI (0, V I ) as n, B → ∞. For the model
Y = Xβ + e, the ei are iid from a distribution that does not depend on n,
and β E = 0. For Y ∗ = X β̂ + rW , the distribution of the riW depends on n

and β̂ E 6= 0 although nβ̂E = OP (1).
4.6 Bootstrapping Variable Selection 191

4.6.3 The Nonparametric Bootstrap

The nonparametric bootstrap (also called the empirical bootstrap, naive


bootstrap, the pairwise bootstrap, and the pairs bootstrap) draws a sam-
ple of n cases (Yi∗ , x∗i ) with replacement from the n cases (Yi , xi ), and re-

gresses the Yi∗ on the x∗i to get β̂ V S,1 , and then draws another sample to get

β̂M IX,1 . This process is repeated B times to get the two bootstrap samples
for i = 1, ..., B.
Then for the full model,

Y ∗ = X ∗ β̂ OLS + rW

and for a submodel I,

Y ∗ = X ∗I β̂ I,OLS + rW
I .

Freedman (1981) showed that under regularity conditions for the OLS MLR
√ ∗ D
model, n(β̂ − β̂) → Np (0, σ 2 W ) ∼ Np (0, V ). Hence if S ⊆ Ij ,
√ ∗ D
n(β̂ I − β̂ I ) → NaI (0, V I )

as n, B → ∞. (Treat Ij as if Ij is the full model.)


One set of regularity conditions is that the MLR model holds, and if xi =
(1 uTi )T , then the wi = (Yi uTi )T are iid from some population with a
nonsingular covariance matrix. Since cases are sampled with replacement, we

have Yi∗ = x∗T ∗
i β + ei for i = 1, ..., n. In matrix form Y = X ∗ β + e∗ , but

X is a random matrix and the e∗i are not iid from the distribution of the ei
since the e∗i are “sampled with replacement” from the unknown e1 , ..., en.
The nonparametric bootstrap uses w∗1 , ..., w∗n where the w ∗i are sampled
with replacement from w 1 , ..., wn . By Example 4.2, E(w ∗ ) = w, and
n  2 
∗ 1X T e S̃Y Σ̃ Y u
Cov(w ) = (w i − w)(w i − w) = Σ w = .
n Σ̃ uY Σ̃ u
i=1

Note that β̂ is a constant with respect to the bootstrap distribution. Assume


all inverse matrices exist. Then by Theorem 2.20,
" # " ∗ ∗T
# " T
#  
∗ β̂1∗ Y − β̂ u u∗ P Y − β̂u u β̂1
β̂ = ∗ = −1∗ ∗ → −1 = = β̂
β̂ u Σ̃ u Σ̃ uY Σ̃ u Σ̃ uY β̂ u

as B → ∞. This result suggests that the nonparametric bootstrap for OLS


MLR might work under milder regularity conditions than the w i being iid
from some population with a nonsingular covariance matrix.
192 4 Prediction and Variable Selection When n >> p

4.6.4 Bootstrapping OLS Variable Selection

Undercoverage can occur if the bootstrap sample data cloud is less variable
than the iid data cloud, e.g., if (n − p)/n is not close to one. Coverage can be
higher than the nominal coverage for two reasons: i) the bootstrap data cloud
is more variable than the iid data cloud of T1 , ..., TB, and ii) zero padding.
To see the effect of zero padding, consider H0 : Aβ = β O = 0 where
βO = (βi1 , ...., βig )T and O ⊆ E in (4.1) so that H0 is true. Suppose a nominal
95% confidence region is used and UB = 0.96. Hence the confidence region

(4.32) or (4.33) covers at least 96% of the bootstrap sample. If β̂ O,j = 0 for
∗ ∗
more than 4% of the β̂O,1 , ..., β̂O,B , then 0 is in the confidence region and the
bootstrap test fails to reject H0 . If this occurs for each run in the simulation,
then the observed coverage will be 100%.

Now suppose β̂ O,j = 0 for j = 1, ..., B. Then S ∗T is singular, but the
singleton set {0} is the large sample 100(1 − δ)% confidence region (4.32),
(4.33), or (4.34) for β O and δ ∈ (0, 1), and the pvalue for H0 : βO = 0 is

one. (This result holds since {0} contains 100% of the β̂ O,j in the bootstrap
sample.) For large sample theory tests, the pvalue estimates the population
pvalue. Let I denote the other predictors in the model so β = (β TI , βTO )T . For
the Imin model from forward selection, there may be strong evidence that xO
is not needed in the model given xI is in the model if the “100%” confidence
region is {0}, n ≥ 20p, B ≥ 50p, and the error distribution is unimodal and
not highly skewed. (Since the pvalue is one, this technique may be useful
for data snooping: applying OLS theory to submodel I may have negligible
selection bias.)

Remark 4.22. The assumption ρjn → πj as n → ∞ seems to be the most


reasonable for the residual bootstrap since |ri − ei | → 0 fast by Remark 4.20.
The assumption may not hold for the parametric bootstrap of Section 4.6.1 if
the ei are not iid N (0, σ 2 ). Another way to look at the bootstrap confidence
region for OLS variable selection estimators is to consider the estimator T2,n
that chooses Ij with probability equal to the observed bootstrap proportion
ρ̂jn . The bootstrap sample T1∗ , ..., TB∗ tends to be slightly more variable than
an iid sample T2,1 , ..., T2,B, and the geometric argument suggests that the
large sample coverage of the nominal 100(1 − δ)% confidence region will be
at least as large as the nominal coverage 100(1 − δ)%.

Remark 4.23. Note that there are several important variable selection
models, including the model given by Equation (4.1) where xT β = xTS β S .
Another model is xT β = xTSi β Si for i = 1, ..., K. Then there are K ≥ 2
competing “true” nonnested submodels where β Si is aSi × 1. For example,
suppose the K = 2 models have predictors x1 , x2 , x3 for S1 and x1 , x2 , x4 for
S2 . Then x3 and x4 are likely to be selected and omitted often by forward
selection for the B bootstrap samples. Hence omitting all predictors xi that

have a βij = 0 for at least one of the bootstrap samples j = 1, ..., B could
4.6 Bootstrapping Variable Selection 193

result in underfitting, e.g. using just x1 and x2 in the above K = 2 example.


If n and B are large enough, the singleton set {0} could still be the “100%”
confidence region for a vector βO . See Remark 4.6.
Suppose the predictors xi have been standardized. Then another important
regression model has the βi taper off rapidly, but no coefficients are equal to
zero. For example, βi = e−i for i = 1, ..., p.

Example 4.7. Cook and Weisberg (1999, pp. 351, 433, 447) gives a data
set on 82 mussels sampled off the coast of New Zealand. Let the response
variable be the logarithm log(M ) of the muscle mass, and the predictors are
the length L and height H of the shell in mm, the logarithm log(W ) of the shell
width W, the logarithm log(S) of the shell mass S, and a constant. Inference
for the full model is shown below along with the shorth(c) nominal 95%
confidence intervals for βi computed using the nonparametric and residual
bootstraps. As expected, the residual bootstrap intervals are close to the
classical least squares confidence intervals ≈ β̂i ± 1.96SE(β̂i ).
large sample full model inference
Est. SE t Pr(>|t|) nparboot resboot
int -1.249 0.838 -1.49 0.14 [-2.93,-0.093][-3.045,0.473]
L -0.001 0.002 -0.28 0.78 [-0.005,0.003][-0.005,0.004]
logW 0.130 0.374 0.35 0.73 [-0.457,0.829][-0.703,0.890]
H 0.008 0.005 1.50 0.14 [-0.002,0.018][-0.003,0.016]
logS 0.640 0.169 3.80 0.00 [ 0.244,1.040][ 0.336,1.012]
output and shorth intervals for the min Cp submodel FS
Est. SE 95% shorth CI 95% shorth CI
int -0.9573 0.1519 [-3.294, 0.495] [-2.769, 0.460]
L 0 [-0.005, 0.004] [-0.004, 0.004]
logW 0 [ 0.000, 1.024] [-0.595, 0.869]
H 0.0072 0.0047 [ 0.000, 0.016] [ 0.000, 0.016]
logS 0.6530 0.1160 [ 0.322, 0.901] [ 0.324, 0.913]
for forward selection for all subsets
The minimum Cp model from all subsets variable selection and forward
selection both used a constant, H, and log(S). The shorth(c) nominal 95%
confidence intervals for βi using the residual bootstrap are shown. Note that
the intervals for H are right skewed and contain 0 when closed intervals
are used instead of open intervals. Some least squares output is shown, but
should only be used for inference if the model was selected before looking at
the data.
It was expected that log(S) may be the only predictor needed, along with
a constant, since log(S) and log(M ) are both log(mass) measurements and
likely highly correlated. Hence we want to test H0 : β2 = β3 = β4 = 0 with
the Imin model selected by all subsets variable selection. (Of course this test
would be easy to do with the full model using least squares theory.) Then
H0 : Aβ = (β2 , β3 , β4 )T = 0. Using the prediction region method with the
194 4 Prediction and Variable Selection When n >> p
q
full model gave an interval [0,2.930] with D0 = 1.641. Note that χ23,0.95 =
2.795. So fail to reject H0 . Using the prediction region method with the Imin
variable selection model had [0, D(UB ) ] = [0, 3.293] while D0 = 1.134. So fail
to reject H0 .
Then we redid the bootstrap with the full model and forward selection. The
full model had [0, D(UB ) ] = [0, 2.908] with D0 = 1.577. So fail to reject H0 .
Using the prediction region method with the Imin forward selection model
had [0, D(UB ) ] = [0, 3.258] while D0 = 1.245. So fail to reject H0 . The ratio of
the volumes of the bootstrap confidence regions for this test was 0.392. (Use
(4.35) with S ∗T and D from forward selection for the numerator, and from
the full model for the denominator.) Hence the forward selection bootstrap
test was more precise than the full model bootstrap test. Some R code used
to produce the above output is shown below.
library(leaps)
y <- log(mussels[,5]); x <- mussels[,1:4]
x[,4] <- log(x[,4]); x[,2] <- log(x[,2])
out <- regboot(x,y,B=1000)
tem <- rowboot(x,y,B=1000)
outvs <- vselboot(x,y,B=1000) #get bootstrap CIs
outfs <- fselboot(x,y,B=1000) #get bootstrap CIs
apply(out$betas,2,shorth3);
apply(tem$betas,2,shorth3);
apply(outvs$betas,2,shorth3) #for all subsets
apply(outfs$betas,2,shorth3) #for forward selection
ls.print(outvs$full)
ls.print(outvs$sub)
ls.print(outfs$sub)
#test if beta_2 = beta_3 = beta_4 = 0
Abeta <- out$betas[,2:4] #full model
#prediction region method with residual bootstrap
out<-predreg(Abeta)
Abeta <- outvs$betas[,2:4]
#prediction region method with Imin all subsets
outvs <- predreg(Abeta)
Abeta <- outfs$betas[,2:4]
#prediction region method with Imin forward sel.
outfs<-predreg(Abeta)
#ratio of volumes for forward selection and full model
(sqrt(det(outfs$cov))*outfs$D0ˆ3)/(sqrt(det(out$cov))*out$D0ˆ3)
Example 4.8. Consider the Gladstone (1905) data set that has 12 vari-
ables on 267 persons after death. The response variable was brain weight.
Head measurements were breadth, circumference, head height, length, and
size as well as cephalic index and brain weight. Age, height, and two categor-
4.6 Bootstrapping Variable Selection 195

ical variables ageclass (0: under 20, 1: 20-45, 2: over 45) and sex were also
given. The eight predictor variables shown in the output were used.
Output is shown below for the full model and the bootstrapped minimum
Cp forward selection estimator. Note that the shorth intervals for length and
sex are quite long. These variables are often in and often deleted from the
bootstrap forward selection. Model II is the model with the fewest predictors
such that CP (II ) ≤ CP (Imin )+1. For this data set, II = Imin . The bootstrap
CIs differ due to different random seeds.
large sample full model inference for Ex. 4.8
Estimate SE t Pr(>|t|) 95% shorth CI
Int -3021.255 1701.070 -1.77 0.077 [-6549.8,322.79]
age -1.656 0.314 -5.27 0.000 [ -2.304,-1.050]
breadth -8.717 12.025 -0.72 0.469 [-34.229,14.458]
cephalic 21.876 22.029 0.99 0.322 [-20.911,67.705]
circum 0.852 0.529 1.61 0.109 [ -0.065, 1.879]
headht 7.385 1.225 6.03 0.000 [ 5.138, 9.794]
height -0.407 0.942 -0.43 0.666 [ -2.211, 1.565]
len 13.475 9.422 1.43 0.154 [ -5.519,32.605]
sex 25.130 10.015 2.51 0.013 [ 6.717,44.19]
output and shorth intervals for the min Cp submodel
Estimate SE t Pr(>|t|) 95% shorth CI
Int -1764.516 186.046 -9.48 0.000 [-6151.6,-415.4]
age -1.708 0.285 -5.99 0.000 [ -2.299,-1.068]
breadth 0 [-32.992, 8.148]
cephalic 5.958 2.089 2.85 0.005 [-10.859,62.679]
circum 0.757 0.512 1.48 0.140 [ 0.000, 1.817]
headht 7.424 1.161 6.39 0.000 [ 5.028, 9.732]
height 0 [ -2.859, 0.000]
len 6.716 1.466 4.58 0.000 [ 0.000,30.508]
sex 25.313 9.920 2.55 0.011 [ 0.000,42.144]
output and shorth for I_I model
Estimate Std.Err t-val Pr(>|t|) 95% shorth CI
Int -1764.516 186.046 -9.48 0.000 [-6104.9,-778.2]
age -1.708 0.285 -5.99 0.000 [ -2.259,-1.003]
breadth 0 [-31.012, 6.567]
cephalic 5.958 2.089 2.85 0.005 [ -6.700,61.265]
circum 0.757 0.512 1.48 0.140 [ 0.000, 1.866]
headht 7.424 1.161 6.39 0.000 [ 5.221,10.090]
height 0 [ -2.173, 0.000]
len 6.716 1.466 4.58 0.000 [ 0.000,28.819]
sex 25.313 9.920 2.55 0.011 [ 0.000,42.847]
The R code used to produce the above output is shown below. The last
four commands are useful for examining the variable selection output.
x<-cbrainx[,c(1,3,5,6,7,8,9,10)]
196 4 Prediction and Variable Selection When n >> p

y<-cbrainy
library(leaps)
out <- regboot(x,y,B=1000)
outvs <- fselboot(x,cbrainy) #get bootstrap CIs,
apply(out$betas,2,shorth3)
apply(outvs$betas,2,shorth3)
ls.print(outvs$full)
ls.print(outvs$sub)
outvs <- modIboot(x,cbrainy) #get bootstrap CIs,
apply(outvs$betas,2,shorth3)
ls.print(outvs$sub)
tem<-regsubsets(x,y,method="forward")
tem2<-summary(tem)
tem2$which
tem2$cp

4.6.5 Simulations

For variable selection with the p × 1 vector β̂ Imin ,0 , consider testing H0 :


Aβ = θ 0 versus H1 : Aβ 6= θ0 with θ = Aβ where often θ 0 = 0. Then let

Tn = Aβ̂ Imin ,0 and let Ti∗ = Aβ̂ Imin ,0,i for i = 1, ..., B. The shorth estimator
∗ ∗
can be applied to a bootstrap sample β̂i1 , ..., β̂iB to get a confidence interval
for βi . Here Tn = β̂i and θ = βi .
Assume p is fixed, n ≥ 20p, and that the error distribution is unimodal
and not highly skewed. Then the plotted points in the response and residual
plots should scatter in roughly even bands about the identity line (with unit
slope and zero intercept) and the r = 0 line, respectively. See Figure 1.1. If
the error distribution is skewed or multimodal, then much larger sample sizes
may be needed.
Next, we describe a small simulation study that was done using B =
max(1000, n/25, 50p) and 5000 runs. The simulation used p = 4, 6, 7, 8, and

10; n = 25p and 50p; ψ = 0, 1/ p, and 0.9; and k = 1 and p − 2 where
k and ψ are defined in the following paragraph. In the simulations, we use
θ = Aβ = βi , θ = Aβ = βS = 1 and θ = Aβ = β E = 0.
Let x = (1 uT )T where u is the (p − 1) × 1 vector of nontrivial predictors.
In the simulations, for i = 1, ..., n, we generated w i ∼ Np−1 (0, I) where the
m = p − 1 elements of the vector wi are iid N(0,1). Let the m × m matrix
A = (aij ) with aii = 1 and aij = ψ where 0 ≤ ψ < 1 for i 6= j. Then the
vector ui = Aw i so that Cov(ui ) = Σ u = AAT = (σij ) where the diagonal
entries σii = [1+(m−1)ψ2 ] and the off diagonal entries σij = [2ψ+(m−2)ψ2 ].
Hence the correlations are Cor(xi , xj ) = ρ = (2ψ + (m − 2)ψ2 )/(1 + (m −

1)ψ2 ) for i 6= j where xi and xj are nontrivial predictors. If ψ = 1/ cp,
4.6 Bootstrapping Variable Selection 197

then ρ → 1/(c + 1) as p → ∞ where c > 0. As ψ gets close to 1, the


predictor vectors cluster about the line in the direction of (1, ..., 1)T . Let
Yi = 1 + 1xi,2 + · · · + 1xi,k+1 + ei for i = 1, ..., n. Hence β = (1, .., 1, 0, ..., 0)T
with k + 1 ones and p − k − 1 zeros. The zero mean errors ei were iid from
five distributions: i) N(0,1), ii) t3 , iii) EXP(1) - 1, iv) uniform(−1, 1), and v)
0.9 N(0,1) + 0.1 N(0,100). Only distribution iii) is not symmetric.
When ψ = 0, the full model √least squares confidence intervals for βi should
have length near 2t96,0.975σ/ n ≈ 2(1.96)σ/10 = 0.392σ when n = 100 and
the iid zero mean errors have variance σ 2 . The simulation computed the Frey
shorth(c) interval for each βi and used bootstrap confidence regions to test
H0 : β S = 1 (whether first k + 1 βi = 1) and H0 : βE = 0 (whether the last
p − k − 1 βi = 0). The nominal coverage was 0.95 with δ = 0.05. Observed
coverage between 0.94 and 0.96 suggests coverage is close to the nominal
value.
The regression models used the residual bootstrap on the forward selection
estimator β̂ Imin ,0 . Table 4.2 gives results for when the iid errors ei ∼ N (0, 1)
with n = 100, p = 4, and k = 1. Table 4.2 shows two rows for each model
giving the observed confidence interval coverages and average lengths of the
confidence intervals. The term “reg” is for the full model regression, and the
term “vs” is for forward selection. The last six columns give results for the
tests. The terms pr, hyb, and br are for the prediction region method (4.32),
hybrid region (4.34), and Bickel and Ren region (4.33). The 0 indicates the
test was H0 : β E = 0, while the 1 indicates that the test was H0 : βS = 1.
The length and coverage = P(fail to reject H0 ) for the interval [0, D(UB ) ] or
[0, D(UB ,T ) ] where D(UB ) or D(UB ,T ) is the cutoff for the confidence region.
q
The cutoff will often be near χ2g,0.95 if the statistic T is asymptotically nor-
q
mal. Note that χ22,0.95 = 2.448 is close to 2.45 for the full model regression
bootstrap tests.
Volume ratios of the three confidence regions can be compared using (4.35),
but there is not enough information in Table 4.2 to compare the volume of
the confidence region for the full model regression versus that for the forward
selection regression since the two methods have different determinants |S ∗T |.
The inference for forward selection was often as precise or more precise
than the inference for the full model. The coverages were near 0.95 for the
regression bootstrap on the full model, although there was slight undercov-
erage for the tests since (n − p)/n = 0.96 when n = 25p. Suppose ψ = 0.
Then from Section 4.2, β̂ S has the same limiting distribution for Imin and
the full model. Note that the average lengths and coverages were similar for
the full model and forward selection Imin for β1 , β2 , and β S = (β1 , β2 )T .
Forward selection inference was more precise for βE = (β3 , β4 )T . The Bickel
and Ren (4.33) cutoffs and coverages were at least as high as those of the
hybrid region (4.34).
For ψ > 0 and Imin , the coverages for the βi corresponding to β S were
near 0.95, but the average length could be shorter since Imin tends to have
198 4 Prediction and Variable Selection When n >> p

Table 4.2 Bootstrapping OLS Forward Selection with Cp , ei ∼ N (0, 1)


ψ β1 β2 βp−1 βp pr0 hyb0 br0 pr1 hyb1 br1
reg,0 0.946 0.950 0.947 0.948 0.940 0.941 0.941 0.937 0.936 0.937
len 0.396 0.399 0.399 0.398 2.451 2.451 2.452 2.450 2.450 2.451
vs,0 0.948 0.950 0.997 0.996 0.991 0.979 0.991 0.938 0.939 0.940
len 0.395 0.398 0.323 0.323 2.699 2.699 3.002 2.450 2.450 2.457
reg,0.5 0.946 0.944 0.946 0.945 0.938 0.938 0.938 0.934 0.936 0.936
len 0.396 0.661 0.661 0.661 2.451 2.451 2.452 2.451 2.451 2.452
vs,0.5 0.947 0.968 0.997 0.998 0.993 0.984 0.993 0.955 0.955 0.963
len 0.395 0.658 0.537 0.539 2.703 2.703 2.994 2.461 2.461 2.577
reg,0.9 0.946 0.941 0.944 0.950 0.940 0.940 0.940 0.935 0.935 0.935
len 0.396 3.257 3.253 3.259 2.451 2.451 2.452 2.451 2.451 2.452
vs,0.9 0.947 0.968 0.994 0.996 0.992 0.981 0.992 0.962 0.959 0.970
len 0.395 2.751 2.725 2.735 2.716 2.716 2.971 2.497 2.497 2.599

less multicorrelation than the full model. For ψ ≥ 0, the Imin coverages were
higher than 0.95 for β3 and β4 and for testing H0 : β E = 0 since zeros often
occurred for β̂j∗ for j = 3, 4. The average CI lengths were shorter for Imin
than for the OLS full model for β3 and β4 . Note that for Imin , the coverage
for testing H0 : β S = 1 was higher than that for the OLS full model.

Table 4.3 Bootstrap CIs with Cp , p = 10, k = 8, ψ = 0.9, error type v)


n β1 β2 β3 β4 β5 β6 β7 β8 β9 β10
250 0.945 0.824 0.822 0.827 0.827 0.824 0.826 0.817 0.827 0.999
shlen 0.825 6.490 6.490 6.482 6.485 6.479 6.512 6.496 6.493 6.445
250 0.946 0.979 0.980 0.985 0.981 0.983 0.983 0.977 0.983 0.998
prlen 0.807 7.836 7.850 7.842 7.830 7.830 7.851 7.840 7.839 7.802
250 0.947 0.976 0.978 0.984 0.978 0.978 0.979 0.973 0.980 0.996
brlen 0.811 8.723 8.760 8.765 8.736 8.764 8.745 8.747 8.753 8.756
2500 0.951 0.947 0.948 0.948 0.948 0.947 0.949 0.944 0.951 0.999
shlen 0.263 2.268 2.271 2.271 2.273 2.262 2.632 2.277 2.272 2.047
2500 0.945 0.961 0.959 0.955 0.960 0.960 0.961 0.958 0.961 0.998
prlen 0.258 2.630 2.639 2.640 2.632 2.632 2.641 2.638 2.642 2.517
2500 0.946 0.958 0.954 0.960 0.956 0.960 0.962 0.955 0.961 0.997
brlen 0.258 2.865 2.875 2.882 2.866 2.871 2.887 2.868 2.875 2.830
25000 0.952 0.940 0.939 0.935 0.940 0.942 0.938 0.937 0.942 1.000
shlen 0.083 0.809 0.808 0.806 0.805 0.807 0.808 0.808 0.809 0.224
25000 0.948 0.964 0.968 0.962 0.964 0.966 0.964 0.964 0.967 0.991
prlen 0.082 0.806 0.805 0.801 0.800 0.805 0.805 0.803 0.806 0.340
25000 0.949 0.969 0.972 0.968 0.967 0.971 0.969 0.969 0.973 0.999
brlen 0.082 0.810 0.810 0.805 0.804 0.809 0.810 0.808 0.810 0.317

Results for other values of n, p, k, and distributions of ei were similar. For


forward selection with ψ = 0.9 and Cp , the hybrid region (4.34) and shorth
confidence intervals occasionally had coverage less than 0.93. It was also rare
for the bootstrap to have one or more columns of zeroes so S ∗T was singular.
4.6 Bootstrapping Variable Selection 199

For error distributions i)-iv) and ψ = 0.9, sometimes the shorth CIs needed
n ≥ 100p for all p CIs to have good coverage. For error distribution v) and
ψ = 0.9, even larger values of n were needed. Confidence intervals based on
(4.32) and (4.33) worked for much smaller n, but tended to be longer than
the shorth CIs.
See Table 4.3 for one of the worst scenarios for the shorth, where shlen,
prlen, and brlen are for the average CI lengths based on the shorth, (4.32), and
(4.33), respectively. In Table 4.3, k = 8 and the two nonzero πj correspond
to the full model β̂ and β̂ S,0 . Hence βi = 1 for i = 1, ..., 9 and β10 = 0.
Hence confidence intervals for β10 had the highest coverage and usually the
shortest average length (for i 6= 1) due to zero padding. √ Theory in Section
4.2 showed that the CI lengths are proportional to 1/ n. When n = 25000,
the shorth CI uses the 95.16th percentile while CI (4.32) uses the 95.00th
percentile, allowing the average CI length of (4.32) to be shorter than that of
the shorth CI, but the distribution for β̂i∗ is likely approximately symmetric
for i 6= 10 since the average lengths of the three confidence intervals were
about the same for each i 6= 10.
When BIC was used, undercoverage was a bit more common and severe,
and undercoverage occasionally occurred with regions (4.32) and (4.33). BIC
also occasionally had 100% coverage since BIC produces more zeroes than
Cp .
Some R code for the simulation is shown below.
record coverages and ‘‘lengths" for
b1, b2, bp-1, bp, pm0, hyb0, br0, pm1, hyb1, br1

regbootsim3(n=100,p=4,k=1,nruns=5000,type=1,psi=0)
$cicov
[1] 0.9458 0.9500 0.9474 0.9484 0.9400 0.9408 0.9410
0.9368 0.9362 0.9370
$avelen
[1] 0.3955 0.3990 0.3987 0.3982 2.4508 2.4508 2.4521
[8] 2.4496 2.4496 2.4508
$beta
[1] 1 1 0 0
$k
[1] 1
library(leaps)
vsbootsim4(n=100,p=4,k=1,nruns=5000,type=1,psi=0)
$cicov
[1] 0.9480 0.9496 0.9972 0.9958 0.9910 0.9786 0.9914
0.9384 0.9394 0.9402
$avelen
[1] 0.3954 0.3987 0.3233 0.3231 2.6987 2.6987 3.0020
[8] 2.4497 2.4497 2.4570
200 4 Prediction and Variable Selection When n >> p

$beta
[1] 1 1 0 0
$k
[1] 1

4.7 Data Splitting

Data splitting is used for inference after model selection. Use a training set
to select a full model, and a validation set for inference with the selected full
model. Here p >> n is possible. See Chapter 6, Hurvich and Tsai (1990, p.
216) and Rinaldo et al. (2019). Typically when training and validation sets
are used, the training set is bigger than the validation set or half sets are
used, often causing large efficiency loss.
Let J be a positive integer and let bxc be the integer part of x, e.g.,
b7.7c = 7. Initially divide the data into two sets H1 with n1 = bn/(2J)c
cases and V1 with n − n1 cases. If the fitted model from H1 is not good
enough, randomly select n1 cases from V1 to add to H1 to form H2 . Let V2
have the remaining cases from V1 . Continue in this manner, possibly forming
sets (H1 , V1 ), (H2 , V2 ), ..., (HJ , VJ ) where Hi has ni = in1 cases. Stop when
Hd gives a reasonable model Id with ad predictors if d < J. Use d = J,
otherwise. Use the model Id as the full model for inference with the data in
Vd .
This procedure is simple for a fixed data set, but it would be good to
automate the procedure. Forward selection with the Chen and Chen (2008)
EBIC criterion and lasso are useful for finding a reasonable fitted model.
BIC and the Hurvich and Tsai (1989) AICC criterion can be useful if n ≥
max(2p, 10ad). For example, if n = 500000 and p = 90, using n1 = 900 would
result in a much smaller loss of efficiency than n1 = 250000.

4.8 Summary

1) A model for variable selection can be described by xT β = xTS β S +xTE β E =


xTS β S where x = (xTS , xTE )T is a p × 1 vector of predictors, xS is an aS × 1
vector, and xE is a (p−aS )×1 vector. Given that xS is in the model, β E = 0.
Assume p is fixed while n → ∞.
2) If β̂ I is a × 1, form the p × 1 vector β̂ I,0 from β̂ I by adding 0s cor-
responding to the omitted variables. For example, if p = 4 and β̂Imin =
(β̂1 , β̂3 )T , then β̂ Imin ,0 = (β̂1 , 0, β̂3 , 0)T . For the OLS model with S ⊆ I,
√ D P
n(β̂ I − β I ) → NaI (0, V I ) where (X TI X I )/(nσ 2 ) → V −1 I .
4.8 Summary 201

3) Theorem 4.4, Variable Selection CLT. Assume P (S ⊆ Imin ) → 1


as n → ∞, and let Tn = β̂ Imin ,0 and Tjn = β̂ Ij ,0 . Let Tn = Tkn = β̂ Ik ,0
with probabilities πkn where πkn → πk as n → ∞. Denote the πk with
S ⊆ Ik by πj . The other πk = 0 since P (S ⊆ Imin ) → 1 as n → ∞. Assume
√ D √ D
n(β̂ Ij − β Ij ) → Naj (0, V j ) and ujn = n(β̂ Ij ,0 − β) → uj ∼ Np (0, V j,0 ).
a) Then
√ D
n(β̂ Imin ,0 − β) → u
P
where the cdf of u is Fu (z) = j πj Fuj (z). Thus u is a mixture P distribution
of the uj with probabilities πj , E(u) = 0, and Cov(u) = Σ u = j πj V j,0 .
b) Let A be a g × p full rank matrix with 1 ≤ g ≤ p. Then
√ D
n(Aβ̂ Imin ,0 − Aβ) → Au = v

where Au has a mixture distribution of the Auj ∼ Ng (0, AV j,0 AT ) with


probabilities πj .

4) For h > 0, the hyperellipsoid {z : (z − T )T C −1 (z − T ) ≤ h2 } =


2
{z : Dz ≤ h2 } = {z : Dz ≤ h}. A future observation (random vector) xf is
in this region if Dxf ≤ h. A large sample 100(1 − δ)% prediction region is a
set An such that P (xf ∈ An ) is eventually bounded below by 1−δ as n → ∞
where 0 < δ < 1. A large sample 100(1 − δ)% confidence region for a vector of
parameters θ is a set An such that P (θ ∈ An ) is eventually bounded below
by 1 − δ as n → ∞.
5) Let qn = min(1 − δ + 0.05, 1 − δ + p/n) for δ > 0.1 and qn =
min(1 −δ/2, 1 −δ +10δp/n), otherwise. If qn < 1 −δ +0.001, set qn = 1 −δ. If
(T, C) is a consistent estimator of (µ, dΣ), then {z : Dz (T, C) ≤ h} is a large
sample 100(1−δ)% prediction regions if h = D(Un ) where D(Un ) is the 100qnth
sample quantile of the Di . The large sample 100(1 − δ)% nonparametric
2 2
prediction region {z : Dz (x, S) ≤ D(U n)
} uses (T, C) = (x, S). We want
n ≥ 10p for good coverage and n ≥ 50p for good volume.
6) Consider testing H0 : θ = θ 0 versus H1 : θ 6= θ0 where θ 0 is a known
g × 1 vector. Make a confidence region and reject H0 if θ0 is not in the
confidence region. Let qB and UB be as in 5) with n replaced by B and p

replaced by g. Let T and S ∗T be the sample mean and sample covariance
matrix of the bootstrap sample T1∗ , ..., TB∗ . a) The prediction region method

large sample 100(1−δ)% confidence region for θ is {w : (w−T )T [S∗T ]−1 (w−
∗ ∗
T ) ≤ D(U 2
B)
} = {w : Dw2
(T , S ∗T ) ≤ D(U
2
B)
} where D(U2
B)
is computed from
∗ ∗
Di2 = (Ti∗ −T )T [S ∗T ]−1 (Ti∗ −T ) for i = 1, ..., B. Note that the corresponding
∗ ∗
test for H0 : θ = θ0 rejects H0 if (T − θ0 )T [S ∗T ]−1 (T − θ 0 ) > D(U 2
B)
.
This procedure applies the nonparametric prediction region to the bootstrap
sample. b) The modified Bickel and Ren (2001) large sample 100(1 − δ)%
confidence region is {w : (w − Tn )T [S ∗T ]−1 (w − Tn ) ≤ D(U 2
B ,T )
} = {w :
2 ∗ 2 2
Dw (Tn , ST ) ≤ D(UB ,T ) } where the cutoff D(UB ,T ) is the 100qB th sample
202 4 Prediction and Variable Selection When n >> p

quantile of the Di2 = (Ti∗ − Tn )T [S ∗T ]−1 (Ti∗ − Tn ). c) The hybrid large sample
100(1 − δ)% confidence region: {w : (w − Tn )T [S ∗T ]−1 (w − Tn ) ≤ D(U 2
B)
}=
2 ∗ 2
{w : Dw (Tn , ST ) ≤ D(UB ) }.
If g = 1, confidence intervals can be computed without S ∗T or D2 for a),
b), and c).
For some data sets, S ∗T may be singular due to one or more columns of
zeroes in the bootstrap sample for β1 , ..., βp. The variables corresponding to
these columns are likely not needed in the model given that the other predic-
tors are in the model if n and B are large enough. Let β O = (βi1 , ..., βig )T ,

and consider testing H0 : AβO = 0. If Aβ̂ O,i = 0 for greater than Bδ of the
bootstrap samples i = 1, ..., B, then fail to reject H0 . (If S ∗T is nonsingular,
the 100(1 − δ)% prediction region method confidence region contains 0.)
√ D
7) Theorem 4.7: Geometric Argument. Suppose n(Tn − θ) → u
with E(u) = 0 and Cov(u) = Σ u . Assume T1 , ..., TB are iid with nonsingular
covariance matrix Σ Tn . Then the large sample 100(1 − δ)% prediction region
2 2
Rp = {w : Dw (T , ST ) ≤ D(U B)
} centered at T contains a future value of
the statistic Tf with probability 1 − δB → 1 − δ as B → ∞. Hence the region
2 2
Rc = {w : Dw (Tn , S T ) ≤ D(U B)
} is a large sample 100(1 − δ)% confidence
region for θ.
8) Applying the nonparametric prediction region (4.24) to the iid data
T1 , ..., TB results in the 100(1−δ)% confidence region {w : (w−Tn )T S −1 T (w−
2 2
Tn ) ≤ D(U B)
(T n , S T )} where D (UB ) (Tn , S T ) is computed from the (Ti −
T −1
Tn ) S T (Ti − Tn ) provided the S T = S Tn are “not too ill conditioned.”
For OLS variable selection, assume there are two or more component clouds.
The bootstrap component data clouds have the same asymptotic covariance
matrix as the iid component data clouds, which are centered at θ. The jth
bootstrap component data cloud is centered at E(Tij∗ ) and often E(Tjn ∗
)=
Tjn . Confidence region (4.32) is the prediction region (4.24) applied to the
bootstrap sample, and (4.32) is slightly larger in volume than (4.24) applied
to the iid sample, asymptotically. The hybrid region (4.34) shifts (4.32) to be
centered at Tn . Shifting the component clouds slightly and computing (4.24)
does not change the axes of the prediction region (4.24) much compared
to not shifting the component clouds. Hence by the geometric argument, we
expect (4.34) to have coverage at least as high as the nominal, asymptotically,
provided the S ∗T are “not too ill conditioned.” The Bickel and Ren confidence

region (4.33) tends to have higher coverage and volume than (4.34). Since T
tends to be closer to θ than Tn , (4.32) tends to have good coverage.
9) Suppose m independent large sample 100(1 − δ)% prediction regions
are made where x1 , ..., xn , xf are iid from the same distribution for each of
the m runs. Let Y count the number of times xf is in the prediction region.
Then Y ∼ binomial (m, 1 − δn ) where 1 − δn is the true coverage. Simulation
can be used to see if the true or actual coverage 1 − δn is close to the nominal
coverage 1 − δ. A prediction region with 1 − δn < 1 − δ is liberal and a region
with 1 − δn > 1 − δ is conservative. It is better to be conservative by 3% than
4.9 Complements 203

liberal by 3%. Parametric prediction regions tend to have large undercoverage


and so are too liberal. Similar definitions are used for confidence regions.
10) For the bootstrap, perform variable selection on Y ∗i and X (or X ∗
for the nonparametric bootstrap), fit the model that minimizes the criterion,
and add 0s corresponding to the omitted variables, resulting in estimators
∗ ∗ ∗ ∗
β̂1 , ..., β̂B where β̂ i = β̂ Imin ,0,i .
11) Let Z1 , ..., Zn be random variables, let Z(1) , ..., Z(n) be the order
statistics, and let c be a positive integer. Compute Z(c) − Z(1) , Z(c+1) −
Z(2), ..., Z(n) − Z(n−c+1) . Let shorth(c) = [Z(d) , Z(d+c−1) ] correspond to the
interval with the shortest length.
∗ ∗ ∗
The large sample 100(1−δ)% shorth(c) CI uses the interval [T(1) , T(c) ], [T(2) ,
∗ ∗ ∗
T(c+1) ], ..., [T(B−c+1), T(B) ] of shortest length. Here c = min(B, dB[1 − δ +
p
1.12 δ/B ] e). The shorth CI is computed by applying the shorth PI to the
bootstrap sample.

4.9 Complements

This chapter followed Olive (2017b, ch. 5) and Pelawa Watagoda and Olive
(2019ab) closely. Also see Olive (2013a, 2018), Pelawa Watagoda (2017), and
Rathnayake and Olive (2019). For MLR, Olive (2017a: p. 123, 2017b: p. 176)
showed that β̂Imin ,0 is a consistent estimator. Olive (2014: p. 283, 2017ab,
2018) recommended using the shorth(c) estimator for the percentile method.
Olive (2017a: p. 128, 2017b: p. 181, 2018) showed that the prediction region
method can simulate well for the p × 1 vector β̂ Imin ,0 . Hastie et al. (2009, p.
57) noted that variable selection is a shrinkage estimator: the coefficients are
shrunk to 0 for the omitted variables.
Good references for the bootstrap include Efron (1979, 1982), Efron and
Hastie (2016, ch. 10–11), and Efron and Tibshirani (1993). Also see Chen
(2016) and Hesterberg (2014). One of the sufficient conditions for the boot-
strap confidence region is that T has a well behaved Hadamard derivative.
Fréchet differentiability implies Hadamard differentiability, and many statis-
tics are shown to be Hadamard differentiable in Bickel and Ren (2001), Clarke
(1986, 2000), Fernholtz (1983), Gill (1989), Ren (1991), and Ren and Sen
(1995). Bickel and Ren (2001) showed that their method can work when
Hadamard differentiability fails.
There is a massive literature on variable selection and a fairly large litera-
ture for inference after variable selection. See, for example, Leeb and Pötscher
(2005, 2006, 2008), Leeb et al. (2015), Tibshirani et al. (2016), and Tibshi-
rani et al. (2018). Knight and Fu (2000) have some results on the residual
bootstrap that uses residuals from one estimator, such as full model OLS,
but fit another estimator, such as lasso.
204 4 Prediction and Variable Selection When n >> p

Inference techniques for the variable selection model, other than data split-
ting, have not had much success. For multiple linear regression, the methods
are often inferior to data splitting, often assume normality, or are asymptot-
ically equivalent to using the full model, or find a quantity to test that is not
Aβ. See Ewald and Schneider (2018). Berk et al. (2013) assumes normality,
needs p no more than about 30, assumes σ 2 can be estimated independently
of the data, and Leeb et al. (2015) say the method does not work. The
∗ P
bootstrap confidence region (4.32) is centered at T ≈ j ρjn Tjn , which is
closely related to a model averaging estimator. Wang and Zhou (2013) show
that the Hjort and Claeskens (2003) confidence intervals based on frequentist
model averaging are asymptotically equivalent to those obtained from the
full model. See Buckland et al. (1997) and Schomaker and Heumann (2014)
for standard errors when using the bootstrap or model averaging for linear
model confidence intervals.
∗ ∗ ∗
Efron (2014) used the confidence interval T ± z1−δ SE(T ) assuming T
is asymptotically normal and using delta method techniques, which require
nonsingular covariance matrices. There is not yet rigorous theory for this

method. Section 4.2 proved that T is asymptotically normal: under regular-
√ D √ D
ity conditions: if n(Tn − θ) → Ng (0, Σ A ) and n(Ti∗ − Tn ) → Ng (0, Σ A ),
√ ∗ D
then under regularity conditions n(T − θ) → Ng (0, Σ A ). If g = 1,
then the prediction region method large sample 100(1 − δ)% CI for θ has
∗ ∗
P (θ ∈ [T − a(UB ) , T + a(UB ) ]) → 1 − δ as n → ∞. If the Frey CI also has
coverage converging to 1−δ, than√the two methods have the same asymptotic
length (scaled by multiplying by n), since otherwise the shorter interval will
have lower asymptotic coverage. √
For the mixture distribution with two or more component groups, n(Tn −
D √ D
θ) → v by Theorem 4.4 b). If n(Ti∗ − cn ) → u then cn must be a value
∗ P P
such as cn = T , cn = j ρjn Tjn , or cn = j πj Tjn . Next we will examine
∗ √ D
T . If S ⊆ Ij , then n(β̂ Ij ,0 − β) → Np (0, V j,0 ), and for the parametric
√ ∗ D
and nonparametric bootstrap, n(β̂ Ij ,0 − β̂ Ij ,0 ) → Np (0, V j,0 ). Let Tn =
Aβ̂Imin ,0 and Tjn = Aβ̂ Ij ,0 = ADj0 Y using notation from Section 4.6. Let
√ ∗ P P
θ = Aβ. Hence from Section 4.5.3, n(T j − Tjn ) → 0. Assume ρ̂in → ρi as
√ ∗
n → ∞. Then n(T − θ) =
X √ ∗ X √ ∗ X √ ∗
ρ̂in n(T i − θ) = ρ̂jn n(T j − θ) + ρ̂kn n(T k − θ)
i j k

P
= dn + an where an → 0 since ρk = 0. Now
X √ ∗ X √
dn = ρ̂jn n(T j − Tjn + Tjn − θ) = ρ̂jn n(Tjn − θ) + cn
j j
4.9 Complements 205
√ ∗
where cn = oP (1) since n(T j − Tjn ) = oP (1). Hence under regularity con-
√ ∗ D P √ D
ditions, if n(T − θ) → w then j ρj n(Tjn − θ) → w.
To examine the last term and w, let the n × 1 vector Y have characteristic
function φY , E(Y ) = Xβ, and Cov(Y ) = σ 2 I. Let Z = (Y T , ..., Y T )T be a
Jn × 1 vector with J copies of Y stacked into a vector. Let t = (tT1 , ..., tTJ )T .
PJ
Then Z has characteristic function φZ (t) = φY ( j=1 ti ) = φY (s). Now
assume Y ∼ Nn (Xβ, σ 2 I). Then tT Z = sT Y ∼ N (sT Xβ, σ 2 sT s). Hence
Z has a multivariate normal distribution by Definition 1.23 with E(Z) =
(Xβ T , ..., Xβ T )T , and Cov(Z) a block matrix with J × J blocks each equal
to σ 2 I. Then
X X
ρj Tjn = ρj ADj0 Y = BY ∼ Ng (θ, σ 2 BB T ) =
j j

XX
Ng (θ, σ 2 ρj ρk AD j0 DTk0 A)
j k

T T T
since E(Tjn ) = E(Aβ̂ Ij ,0 ) = Aβ = θ if S ⊆ Ij . Since (T1n , ..., Tjn) =
T T T
diag(AD10 , ..., ADJ0 )Z, then (T1n , ..., Tjn) is multivariate normal and
X XX
ρj Tjn ∼ Ng [θ, πj πk Cov(Tjn , Tkn )].
j j k

P
Now assume nDj0 DTk0 → W jk as n → ∞. Then
X √ D
XX
ρj n(Tjn − θ) → w ∼ Ng (0, σ 2 ρj ρk AW jk A).
j j k

We conjecture that this result may hold under milder conditions than
Y ∼ Nn (Xβ, σ 2 I), but even the above results are not yet rigorous. If
√ D
n(Tjn − θ) → w j ∼ Ng (0, Σ j ), then a possibly poor approximation is
∗ P P P
T
P≈ P j ρj Tjn ≈ Ng [θ, j k ρj ρk Cov(Tjn , Tkn )], and estimating
j k ρj ρk Cov(Tjn , Tkn) with delta method techniques may not be possible.
The double bootstrap technique may be useful. See Hall (1986) and Chang
∗ ∗
and Hall (2015) for references. The double bootstrap for T = T B says that

Tn = T is a statistic that can be bootstrapped. Let Bd ≥ 50gmax where
1 ≤ gmax ≤ p is the largest dimension of θ to be tested with the double

bootstrap. Draw a bootstrap sample of size B and compute T = T1∗ . Repeat
for a total of Bd times. Apply the confidence region (4.32), (4.33), orq
(4.34) to
the double bootstrap sample T1∗ , ..., TB∗ d . If D(UBd ) ≈ D(UBd ,T ) ≈ χ2g,1−δ ,

then T may be approximately multivariate normal. The CI (4.32) applied
to the double bootstrap sample could be regarded as a modified Frey CI
206 4 Prediction and Variable Selection When n >> p

without delta method techniques. Of course the double bootstrap tends to


be too computationally expensive to simulate.

We can get a prediction region by randomly dividing the data into two
half sets H and V where H has nH = dn/2e of the cases and V has the
remaining m = nV = n − nH cases. Compute (xH , S H ) from the cases in
H. Then compute the distances Di2 = (xi − xH )T S −1 H (xi − xH ) for the m
vectors xi in V . Then a large sample 100(1 − δ)% prediction region for xF is
2 2
{x : Dx (xH , SH ) ≤ D(k m)
} where km = dm(1 − δ)e. This prediction region
may give better coverage than the nonparametric prediction region (4.24) if
5p ≤ n ≤ 20p.
The iid sample T1 , ..., TB has sample mean T . Let Tin = Tijn if Tjn is
P
chosen Djn times where the random variables Djn /B → πjn. The Djn follow
a multinomial distribution. Then the iid sample can be written as

T1,1 , ..., TD1n,1 , ..., T1,J , ..., TDJn,J ,

where the Tij are not iid. Denote T1j , ..., TDjn,j as the jth component of the
iid sample with sample mean T j and sample covariance matrix S T ,j . Thus

B
1 X X Djn 1 DX
jn
X
T = Tijn = Tij = π̂jn T j .
B B Djn
i=1 j i=1 j

Hence T is a random linear combination of the T j . Conditionally on the Djn ,


the Tij are independent, and T is a linear combination of the T j . Note that
Cov(T ) = Cov(Tn )/B.
Software. The simulations were done in R. See R Core Team (2016). We
used several R functions including forward selection as computed with the
regsubsets function from the leaps library. Several linmodpack functions
were used. The function predrgn makes the nonparametric prediction re-
gion and determines whether xf is in the region. The function predreg also
makes the nonparametric prediction region, and determines if 0 is in the re-
gion. For multiple linear regression, the function regboot does the residual
bootstrap for multiple linear regression, regbootsim simulates the residual
bootstrap for regression, and the function rowboot does the empirical non-
parametric bootstrap. The function vsbootsim simulates the bootstrap for
all subsets variable selection, so needs p small, while vsbootsim2 simulates
the prediction region method for forward selection. The functions fselboot
and vselboot bootstrap the forward selection and all subsets variable selec-
tion estimators that minimize Cp. See Examples 4.7 and 4.8. The shorth3
function computes the shorth(c) intervals with the Frey (2013) correction
used when g = 1. Table 4.2 was made using regbootsim3 for the OLS full
model and vsbootsim4 for forward selection. The functions bicboot and
bicbootsim are useful if BIC is used instead of Cp . For forward selection
4.10 Problems 207

with Cp , the function vscisim was used to make Table 4.3, and can be used
to compare the shorth, prediction region method, and Bickel and Ren CIs for
βi .

4.10 Problems

4.1. Consider the Cushny and Peebles data set (see Staudte and Sheather
1990, p. 97) listed below. Find shorth(7). Show work.
0.0 0.8 1.0 1.2 1.3 1.3 1.4 1.8 2.4 4.6
4.2. Find shorth(5) for the following data set. Show work.
6 76 90 90 94 94 95 97 97 1008
4.3. Find shorth(5) for the following data set. Show work.
66 76 90 90 94 94 95 95 97 98
4.4. Suppose you are estimating the mean θ of losses with the maxi-
mum likelihood estimator (MLE) X assuming an exponential (θ) distribution.
Compute the sample mean of the fourth bootstrap sample.
actual losses 1, 2, 5, 10, 50: X = 13.6
bootstrap samples:
2, 10, 1, 2, 2: X = 3.4
50, 10, 50, 2, 2: X = 22.8
10, 50, 2, 1, 1: X = 12.8
5, 2, 5, 1, 50: X =?

4.5. The data below are a sorted residuals from a least squares regression
where n = 100 and p = 4. Find shorth(97) of the residuals.
number 1 2 3 4 ... 97 98 99 100
residual -2.39 -2.34 -2.03 -1.77 ... 1.76 1.81 1.83 2.16
4.6. To find the sample median of a list of n numbers where n is odd, order
the numbers from smallest to largest and the median is the middle ordered
number. The sample median estimates the population median. Suppose the
sample is {14, 3, 5, 12, 20, 10, 9}. Find the sample median for each of the three
bootstrap samples listed below.
Sample 1: 9, 10, 9, 12, 5, 14, 3
Sample 2: 3, 9, 20, 10, 9, 5, 14
Sample 3: 14, 12, 10, 20, 3, 3, 5

4.7. Suppose you are estimating the mean µ of losses with T = X.


actual losses 1, 2, 5, 10, 50: X = 13.6,
a) Compute T1∗ , ..., T4∗, where Ti∗ is the sample mean of the ith bootstrap
sample. bootstrap samples:
208 4 Prediction and Variable Selection When n >> p

2, 10, 1, 2, 2:
50, 10, 50, 2, 2:
10, 50, 2, 1, 1:
5, 2, 5, 1, 50:
b) Now compute the bagging estimator which is the sample mean of the
B
∗ 1 X ∗
Ti∗ : the bagging estimator T = T where B = 4 is the number of
B i=1 i
bootstrap samples.
4.8. Consider the output for Example 4.7 for the minimum Cp forward
selection model.
a) What is β̂ Imin ?
b) What is β̂Imin ,0 ?
c) The large sample 95% shorth CI for H is [0,0.016]. Is H needed is the
minimum Cp model given that the other predictors are in the model?
d) The large sample 95% shorth CI for log(S) is [0.324,0.913] for all subsets.
Is log(S) needed is the minimum Cp model given that the other predictors
are in the model?
e) Suppose x1 = 1, x4 = H = 130, and x5 = log(S) = 5.075. Find
Ŷ = (x1 x4 x5 )β̂ Imin . Note that Y = log(M ).

4.9Q . Suppose Y ∗ = X β̂ + rW where where E(r W ) = 0 and Cov(r W ) =



Cov(Y ∗ ) = M SE I n . Then β̂ = (X T X)−1 X T Y ∗ . Recall that X is an
n × p constant matrix. Simplify quantities when possible.

a) What is E(β̂ )?

b) What is Cov(β̂ )?

c) Recall that X β̂ = P Y . What is E(β̂ I ) = E[(X TI X I )−1 X TI Y ∗ ]?

d) What is Cov(β̂ I )?

4.10Q . Suppose Y ∗ ∼ Nn (X β̂, σn2 I n ). Hence Yi∗ = xTi β̂ + P i where


∗ T
E(P i ) = 0 and V ( P
i ) = σ 2
n . Hence AY ∼ Ng (AX β̂, σ 2
n AA ) if A is a
g × n constant matrix. Recall that X is an n × p constant matrix. Simplify
quantities when possible.

a) What is the distribution of β̂ = (X T X)−1 X T Y ∗ ?

b) Using a), what is E(β̂ )?

c) Recall that X β̂ = P Y . What is the distribution of β̂ I = (X TI X I )−1 X TI Y ∗

if β̂I is k × 1?
4.11Q . Suppose Y ∗ = X β̂ + rW where E(rW ) = 0 and Cov(rW ) =

Cov(Y ∗ ) = diag(ri2 ) = diag(r12 , ..., rn2 ). Then β̂ = (X T X)−1 X T Y ∗ is the
least squares estimator from regressing Y ∗ on X, an n × p constant matrix.
This model is used for the wild bootstrap. Simplify quantities when possible.
(Can simplify a) and c), but can’t simplify b) and d) much.)
4.10 Problems 209

a) What is E(β̂ )?

b) What is Cov(β̂ )?

c) Recall that X β̂ = P Y . What is E(β̂ I ) = E[(X TI X I )−1 X TI Y ∗ ]?

d) What is Cov(β̂ I )?
4.12.
4.13.
4.14.
4.15.
4.16.
4.17.
4.18.
4.19.
4.20.

R Problems
Use the command source(“G:/linmodpack.txt”) to download the
functions and the command source(“G:/linmoddata.txt”) to download the
data. See Preface or Section 11.1. Typing the name of the linmodpack
function, e.g. regbootsim2, will display the code for the function. Use the
args command, e.g. args(regbootsim2), to display the needed arguments for
the function. For the following problem, the R command can be copied and
pasted from (http://parker.ad.siu.edu/Olive/linmodrhw.txt) into R.
4.21. a) Type the R command predsim() and paste the output into
Word.
This program computes xi ∼ N4 (0, diag(1, 2, 3, 4)) for i = 1, ..., 100 and
xf = x101 . One hundred such data sets are made, and ncvr, scvr, and mcvr
count the number of times xf was in the nonparametric, semiparametric,
and parametric MVN 90% prediction regions. The volumes of the prediction
regions are computed and voln, vols, and volm are the average ratio of the
volume of the ith prediction region over that of the semiparametric region.
Hence vols is always equal to 1. For multivariate normal data, these ratios
should converge to 1 as n → ∞.
b) Were the three coverages near 90%?
4.22. Consider the multiple linear regression model Yi = β1 + β2 xi,2 +
β3 xi,3 + β4 xi,4 + ei where β = (1, 1, 0, 0)T . The function regbootsim2
bootstraps the regression model, finds bootstrap confidence intervals for βi
and a bootstrap confidence region for (β3 , β4 )T corresponding to the test
H0 : β3 = β4 = 0 versus HA: not H0 . See the R code near Table 4.3. The
lengths of the CIs along with the proportion of times the CI for βi contained
βi are given. The fifth interval gives the length of the interval [0, D(c)] where
H0 is rejected if D0 > D(c) and the fifth “coverage” is the proportion of times
the test fails to reject H0 . Since nominal 95% CIs were used and the nominal
210 4 Prediction and Variable Selection When n >> p

level of the test is 0.05 when H0 is true, we want the coverages near 0.95.
The CI lengths for the first 4 intervals should be near 0.392. The residual
bootstrap is used.
Copy and paste the commands for this problem into R, and include the
output in Word.
Chapter 5
Statistical Learning Alternatives to OLS

This chapter considers several alternatives to OLS for the multiple linear
regression model. Large sample theory is give for p fixed, but the prediction
intervals can have p > n.

5.1 The MLR Model

From Definition 1.9, the multiple linear regression (MLR) model is

Yi = β1 + xi,2 β2 + · · · + xi,p βp + ei = xTi β + ei (5.1)

for i = 1, ..., n. This model is also called the full model. Here n is the sample
size and the random variable ei is the ith error. Assume that the ei are iid
with variance V (ei ) = σ 2 . In matrix notation, these n equations become
Y = Xβ + e where Y is an n × 1 vector of dependent variables, X is an
n × p matrix of predictors, β is a p × 1 vector of unknown coefficients, and e
is an n × 1 vector of unknown errors.
There are many methods for estimating β, including (ordinary) least
squares (OLS) for the full model, forward selection with OLS, elastic net,
principal components regression (PCR), partial least squares (PLS), lasso,
lasso variable selection, and ridge regression (RR). For the last six methods,
it is convenient to use centered or scaled data. Suppose U has observed val-
ues U1 , ..., Un. For example, if Ui = Yi then U corresponds to the response
variable Y . The observed values of a random variable V are centered if their
sample mean is 0. The centered values of U are Vi = Ui − U for i = 1, ..., n.
Let g be an integer near 0. If the sample variance of the Ui is
n
1 X
σ̂g2 = (Ui − U )2 ,
n − g i=1

211
212 5 Statistical Learning Alternatives to OLS

then the sample standard deviation of Ui is σ̂g . If the values of Ui are not all
the same, then σ̂g > 0, and the standardized values of the Ui are

Ui − U
Wi = .
σ̂g

Typically g = 1 or g = 0 are used: g = 1 gives an unbiased estimator


of σ 2 while g = 0 gives the method of moments estimator. Note that the
standardized values are centered, W = 0, and the sample variance of the
standardized values n
1 X 2
W = 1. (5.2)
n − g i=1 i

Remark 5.1. Let the nontrivial predictors uTi = (xi,2 , ..., xi,p) = (ui,1 , ...,
ui,p−1 ). Then xi = (1, uTi )T . Let the n × (p − 1) matrix of standardized
nontrivial P
predictors W g = (WPijn) when the predictors are standardized using
n
σ̂g . Thus, i=1 Wij = 0 and i=1 Wij2 = n − g for j = 1, ..., p − 1. Hence
n
xi,j+1 − xj+1 2 1 X
Wij = where σ̂j+1 = (xi,j+1 − xj+1 )2
σ̂j+1 n−g
i=1

is σ̂g for the (j + 1)th variable xj+1 . Let wTi = (wi,1 , ..., wi,p−1) be the
standardized vector of nontrivial predictors for the ith case. Since the stan-
dardized data are also centered, w = 0. Then the sample covariance matrix
of the w i is the sample correlation matrix of the ui :

W Tg W g
ρ̂u = Ru = (rij ) = n−g
where rij is the sample correlation of ui = xi+1 and uj = xj+1 . Thus the
sample correlation matrix Ru does not depend on g. Let Z = Y − Y where
Y = Y 1. Since the R software tends to use g = 0, let W = W 0 . Note that
n × (p − 1) matrix W does not include a vector 1 of ones. Then regression
through the origin is used for the model

Z = Wη + e (5.3)

where Z = (Z1 , ..., Zn)T and η = (η1 , ..., ηp−1)T . The vector of fitted values
Ŷ = Y + Ẑ.

Remark 5.2. i) Interest is in model (5.1): estimate Ŷf and β̂. For many
regression estimators, a method is needed so that everyone who uses the same
units of measurements for the predictors and Y gets the same (Ŷ , β̂). Also,
see Remark 7.7. Equation (5.3) is a commonly used method for achieving this
2
goal. Suppose g = 0. The method of moments estimator of the variance σw
is
5.1 The MLR Model 213
n
2 2 1X
σ̂g=0 = SM = (wi − w)2 .
n i=1
2
When data xi are standardized to have w = 0 and SM = 1, the standardized
data wi has no units. ii) Hence the estimators Ẑ and η̂ do not depend on
the units of measurement of the xi if standardized data and Equation (5.3)
are used. Linear combinations of the w i are linear combinations of the ui ,
which are linear combinations of the xi . (Note that γ T u = (0 γ T ) x.) Thus
the estimators Ŷ and β̂ are obtained using Ẑ, η̂, and Y . The linear trans-
formation to obtain (Ŷ , β̂) from (Ẑ, η̂) is unique for a given set of units of
measurements for the xi and Y . Hence everyone using the same units of mea-
2
surements gets the same (Ŷ , β̂). iii) Also, since W j = 0 and SM,j = 1, the
standardized predictor variables have similar spread, and the magnitude of
η̂i is a measure of the importance of the predictor variable Wj for predicting
Y.

Remark 5.3. Let σ̂j be the sample standard deviation of variable xj


(often with g = 0) for j = 2, ...., p. Let Ŷi = β̂1 + xi,2 β̂2 + · · · + xi,p β̂p = xTi β̂.
If standardized nontrivial predictors are used, then
xi,2 − x2 xi,p − xp
Ŷi = γ̂ + wi,1 η̂1 + · · · + wi,p−1 η̂p−1 = γ̂ + η̂1 + · · · + η̂p−1
σ̂2 σ̂p

= γ̂ + w Ti η̂ = γ̂ + Ẑi (5.4)
where
η̂j = σ̂j+1 β̂j+1 (5.5)
for j = 1, ..., p − 1. Often γ̂ = Y so that Ŷi = Y if xi,j = xj for j = 2, ..., p.
Then Ŷ = Y + Ẑ where Y = Y 1. Note that
x2 xp
γ̂ = β̂1 + η̂1 + · · · + η̂p−1 .
σ̂2 σ̂p

Notation. The symbol A ≡ B = f(c) means that A and B are equivalent


and equal, and that f(c) is the formula used to compute A and B.

Most regression methods attempt to find an estimate β̂ of β which


minimizes some criterion function Q(b) of the residuals. As in Definition
1.13, given an estimate b of β, the corresponding vector of fitted values is
b ≡ Yb (b) = Xb, and the vector of residuals is r ≡ r(b) = Y − Yb (b). See
Y
Definition 1.14 for the OLS model for Y = Xβ + e. The following model is
useful for the centered response and standardized nontrivial predictors, or if
Z = Y , W = X I , and η = βI corresponds to a submodel I.
214 5 Statistical Learning Alternatives to OLS

Definition 5.1. If Z = W η + e, where the n × q matrix W has full rank


q = p − 1, then the OLS estimator

η̂OLS = (W T W )−1 W T Z

minimizes the OLS criterion QOLS (η) = r(η)T r(η) over all vectors η ∈
b OLS = W η̂
Rp−1 . The vector of predicted or fitted values Z OLS = HZ where
H = W (W T W )−1 W T . The vector of residuals r = r(Z, W ) = Z − Ẑ =
(I − H)Z.

Assume that the sample correlation matrix

WTW P
Ru = → V −1 . (5.6)
n
Note that V −1 = ρu , the population correlation matrix of the nontrivial
predictors ui , if the ui are a random sample from a population. Let H =
P
W (W T W )−1 W T = (hij ), and assume that maxi=1,...,n hii → 0 as n → ∞.
Then by Theorem 2.26 (the LS CLT), the OLS estimator satisfies
√ D
n(η̂OLS − η) → Np−1 (0, σ 2 V ). (5.7)

Remark 5.4: Variable selection is the search for a subset of predictor


variables that can be deleted without important loss of information if n/p is
large (and the search for a useful subset of predictors if n/p is not large). Refer
to Chapter 4 for variable selection and Equation (4.1) where xT β = xTS β S +
xTE β E = xTS βS . Let p be the number of predictors in the full model, including
a constant. Let q = p − 1 be the number of nontrivial predictors in the full
model. Let a = aI be the number of predictors in the submodel I, including
a constant. Let k = kI = aI − 1 be the number of nontrivial predictors
in the submodel. For submodel I, think of I as indexing the predictors in
the model, including the constant. Let A index the nontrivial predictors in
the model. Hence I adds the constant (trivial predictor) to the collection
of nontrivial predictors in A. In Equation (4.1), there is a “true submodel”
Y = X S β S + e where all of the elements of β S are nonzero but all of the
elements of β that are not elements of βS are zero. Then a = aS is the
number of predictors in that submodel, including a constant, and k = kS is
the number of active predictors = number of nonnoise variables = number
of nontrivial predictors in the true model S = IS . Then there are p − a noise
variables (xi that have coefficient βi = 0) in the full model. The true model
is generally only known in simulations. For Equation (4.1), we also assume
that if xT β = xTI βI , then S ⊆ I. Hence S is the unique smallest subset of
predictors such that xT β = xTS β S . Two alternative variable selection models
were given by Remark 4.24.
5.1 The MLR Model 215

Model selection generates M models. Then a hopefully good model is


selected from these M models. Variable selection is a special case of model
selection. Many methods for variable and model selection have been suggested
for the MLR model. We will consider several R functions including i) forward
selection computed with the regsubsets function from the leaps library,
ii) principal components regression (PCR) with the pcr function from the
pls library, iii) partial least squares (PLS) with the plsr function from the
pls library, iv) ridge regression with the cv.glmnet or glmnet function
from the glmnet library, v) lasso with the cv.glmnet or glmnet function
from the glmnet library, and vi) relaxed lasso which is OLS applied to
the lasso active set (nontrivial predictors with nonzero coefficients) and a
constant. See Sections 5.2–5.7 and James et al. (2013, ch. 6).
These six methods produce M models and use a criterion to select the
final model (e.g. Cp or 10-fold cross validation (CV)). See Section 5.10. The
number of models M depends on the method. Often one of the models is the
full model (5.1) that uses all p − 1 nontrivial predictors. The full model is
(approximately) fit with (ordinary) least squares. For one of the M models,
some of the methods use η̂ = 0 and fit the model Yi = β1 + ei with Ŷi ≡ Y
that uses none of the nontrivial predictors. Forward selection, PCR, and PLS
use variables v1 = 1 (the constant or trivial predictor) and vj = γ Tj x that are
linear combinations of the predictors for j = 2, ..., p. Model Ii uses variables
v1 , v2 , ..., vi for i = 1, ..., M where M ≤ p and often M ≤ min(p, n/10). Then
M models Ii are used. (For forward selection and PCR, OLS is used to regress
Y (or Z) on v1 , ..., vi.) Then a criterion chooses the final submodel Id from
candidates I1 , ..., IM .

Remark 5.5. Prediction interval (4.14) used a number d that was often
the number of predictors in the selected model. For forward selection, PCR,
PLS, lasso, and relaxed lasso, let d be the number of predictors vj = γ Tj x in
the final model (with nonzero coefficients), including a constant v1 . For for-
ward selection, lasso, and relaxed lasso, vj corresponds to a single nontrivial
predictor, say vj = x∗j = xkj . Another method for obtaining d is to let d = j
if j is the degrees of freedom of the selected model if that model was chosen
in advance without model or variable selection. Hence d = j is not the model
degrees of freedom if model selection was used.

Overfitting or “fitting noise” occurs when there is not enough data to


estimate the p × 1 vector β well with the estimation method, such as OLS.
The OLS model is overfitting if n < 5p. When n > p, X is not invertible,
but if n = p, then Ŷ = HY = X(X T X)−1 X T Y = I n Y = Y regardless of
how bad the predictors are. If n < p, then the OLS program fails or Ŷ = Y :
the fitted regression plane interpolates the training data response variables
Y1 , ..., Yn. The following rule of thumb is useful for many regression methods.
Note that d = p for the full OLS model.
216 5 Statistical Learning Alternatives to OLS

Rule of thumb 5.1. We want n ≥ 10d to avoid overfitting. Occasionally


n as low as 5d is used, but models with n < 5d are overfitting.

Remark 5.6. Use Z n ∼ ANr (µn , Σ n ) to indicate that a normal approx-


imation is used: Z n ≈ Nr (µn , Σ n ). Let a be a constant, let A be a k × r
constant matrix (often with full rank k ≤ r), and let c be a k × 1 constant
√ D
vector. If n(θ̂ n − θ) → Nr (0, V ), then aZ n = aI r Z n with A = aI r ,
  
aZ n ∼ ANr aµn , a2 Σ n , and AZ n + c ∼ ANk Aµn + c, AΣ n AT ,

  !
V AV AT
θ̂n ∼ ANr θ, , and Aθ̂ n + c ∼ ANk Aθ + c, .
n n
Theorem 2.26 gives the large sample theory for the OLS full model. Then
β̂ ≈ Np (β, σ 2 (X T X)−1 )) or β̂ ∼ ANp (β, M SE(X T X)−1 )).

When minimizing or maximizing a real valued function Q(η) of the k × 1


vector η, the solution η̂ is found by setting the gradient of Q(η) equal to
0. The following definition and lemma follow Graybill (1983, pp. 351-352)
closely. Maximum likelihood estimators are examples of estimating equations.
There is a vector of parameters η, and the gradient of the log likelihood
function log L(η) is set to zero. The solution η̂ is the MLE, an estimator
of the parameter vector η, but in the log likelihood, η is a dummy variable
vector, not the fixed unknown parameter vector.

Definition 5.2. Let Q(η) be a real valued function of the k × 1 vector η.


The gradient of Q(η) is the k × 1 vector
 ∂ 
∂η1
Q(η)
 ∂ Q(η) 
∂Q ∂Q(η)  ∂η2 
5Q = 5Q(η) = = = . .

∂η ∂η  .
. 

∂ηk Q(η)

Suppose there is a model with unknown parameter vector η. A set of esti-


mating equations f(η) is used to maximize or minimize Q(η) where η is a
dummy variable vector.
set
Often f(η) = 5Q, and we solve f(η) = 5Q = 0 for the solution η̂, and
f : Rk → Rk . Note that η̂ is an estimator of the unknown parameter vector
η in the model, but η is a dummy variable in Q(η). Hence we could use Q(b)
instead of Q(η), but the solution of the estimating equations would still be
b̂ = η̂.
5.1 The MLR Model 217

As a mnemonic (memory aid) for the following theorem, note that the
d d d d
derivative ax = xa = a and ax2 = xax = 2ax.
dx dx dx dx
Theorem 5.1. a) If Q(η) = aT η = ηT a for some k × 1 constant vector
a, then 5Q = a.
b) If Q(η) = ηT Aη for some k × k constant matrix A, then 5Q = 2Aη.
Pk
c) If Q(η) = i=1 |ηi | = kηk1 , then 5Q = s = sη where si = sign(ηi )
where sign(ηi ) = 1 if ηi > 0 and sign(ηi ) = −1 if ηi < 0. This gradient is only
defined for η where none of the k values of ηi are equal to 0.

Example 5.1. If Z = W η +e, then the OLS estimator minimizes Q(η) =


kZ − W ηk22 = (Z − W η)T (Z − W η) = Z T Z − 2Z T W η + η T (W T W )η.
Using Theorem 5.1 with aT = Z T W and A = W T W shows that 5Q =
−2W T Z +2(W T W )η. Let 5Q(η̂) denote the gradient evaluated at η̂. Then
the OLS estimator satisfies the normal equations (W T W )η̂ = W T Z.

Example 5.2. The Hebbler (1847) data was collected from n = 26 dis-
tricts in Prussia in 1843. We will study the relationship between Y = the
number of women married to civilians in the district with the predictors x1
= constant, x2 = pop = the population of the district in 1843, x3 = mmen
= the number of married civilian men in the district, x4 = mmilmen = the
number of married men in the military in the district, and x5 = milwmn =
the number of women married to husbands in the military in the district.
Sometimes the person conducting the survey would not count a spouse if
the spouse was not at home. Hence Y is highly correlated but not equal to
x3 . Similarly, x4 and x5 are highly correlated but not equal. We expect that
Y = x3 + e is a good model, but n/p = 5.2 is small. See the following output.
ls.print(out)
Residual Standard Error=392.8709
R-Square=0.9999, p-value=0
F-statistic (df=4, 21)=67863.03
Estimate Std.Err t-value Pr(>|t|)
Intercept 242.3910 263.7263 0.9191 0.3685
pop 0.0004 0.0031 0.1130 0.9111
mmen 0.9995 0.0173 57.6490 0.0000
mmilmen -0.2328 2.6928 -0.0864 0.9319
milwmn 0.1531 2.8231 0.0542 0.9572
res<-out$res
yhat<-Y-res #d = 5 predictors used including x_1
AERplot2(yhat,Y,res=res,d=5)
#response plot with 90% pointwise PIs
$respi #90% PI for a future residual
[1] -950.4811 1445.2584 #90% PI length = 2395.74
218 5 Statistical Learning Alternatives to OLS

5.2 Forward Selection

Variable selection methods such as forward selection were covered in Chapter


4 where model Ij uses j predictors x∗1 , ..., x∗j including the constant x∗1 ≡ 1. If
n/p is not large, forward selection can be done as in Chapter 4 except instead
of forming p submodels I1 , ..., Ip, form the sequence of M submodels I1 , ..., IM
where M = min(dn/Je, p) for some positive integer J such as J = 5, 10, or 20.
Here dxe is the smallest integer ≥ x, e.g., d7.7e = 8. Then for each submodel
Ij , OLS is used to regress Y on 1, x∗2 , ..., x∗j . Then a criterion chooses which
model Id from candidates I1 , ..., IM is to be used as the final submodel.

Remark 5.7. Suppose n/J is an integer. If p ≤ n/J, then forward selec-


tion fits (p − 1) + (p − 2) + · · ·+ 2 + 1 = p(p − 1)/2 ≈ p2 /2 models, where p − i
models are fit at step i for i = 1, ..., (p−1). If n/J < p, then forward selection
uses (n/J) − 1 steps and fits ≈ (p − 1) + (p − 2) + · · · + (p − (n/J) + 1) =
p((n/J) − 1) − (1 + 2 + · · · + ((n/J) − 1)) =
n n
n J (J− 1) n (2p − Jn )
p( − 1) − ≈
J 2 J 2
models. Thus forward selection can be slow if n and p are both large, al-
though the R package leaps uses a branch and bound algorithm that likely
eliminates many of the possible fits. Note that after step i, the model has
i + 1 predictors, including the constant.

The R function regsubsets can be used for forward selection if p < n,


and if p ≥ n if the maximum number of variables is less than n. Then warning
messages are common. Some R code is shown below.
#regsubsets works if p < n, e.g. p = n-1, and works
#if p > n with warnings if nvmax is small enough
set.seed(13)
n<-100
p<-200
k<-19 #the first 19 nontrivial predictors are active
J<-5
q <- p-1
b <- 0 * 1:q
b[1:k] <- 1 #beta = (1, 1, ..., 1, 0, 0, ..., 0)ˆT
x <- matrix(rnorm(n * q), nrow = n, ncol = q)
y <- 1 + x %*% b + rnorm(n)
nc <- ceiling(n/J)-1 #the constant will also be used
nc <- min(nc,q)
nc <- max(nc,1) #nc is the maximum number of
#nontrivial predictors used by forward selection
pp <- nc+1 #d = pp is used for PI (4.14)
5.2 Forward Selection 219

vars <- as.vector(1:(p-1))


temp<-regsubsets(x,y,nvmax=nc,method="forward")
out<-summary(temp)
num <- length(out$cp)
mod <- out$which[num,] #use the last model
#do not need the constant in vin
vin <- vars[mod[-1]]
out$rss
[1] 1496.49625 1342.95915 1214.93174 1068.56668
973.36395 855.15436 745.35007 690.03901
638.40677 590.97644 542.89273 503.68666
467.69423 420.94132 391.41961 328.62016
242.66311 178.77573 79.91771
out$bic
[1] -9.4032 -15.6232 -21.0367 -29.2685
-33.9949 -42.3374 -51.4750 -54.5804
-57.7525 -60.8673 -64.7485 -67.6391
-70.4479 -76.3748 -79.0410 -91.9236
-117.6413 -143.5903 -219.498595
tem <- lsfit(x[,1:19],y) #last model used the
sum(tem$residˆ2) #first 19 predictors
[1] 79.91771 #SSE(I) = RSS(I)
n*log(out$rss[19]/n) + 20*log(n)
[1] 69.68613 #BIC(I)
for(i in 1:19) #a formula for BIC(I)
print( n*log(out$rss[i]/n) + (i+1)*log(n) )
bic <- c(279.7815, 273.5616, 268.1480, 259.9162,
255.1898, 246.8474, 237.7097, 234.6043, 231.4322,
228.3175, 224.4362, 221.5456, 218.7368, 212.8099,
210.1437, 197.2611, 171.5435, 145.5944, 69.6861)
tem<-lsfit(bic,out$bic)
tem$coef
Intercept X
-289.1846831 0.9999998 #bic - 289.1847 = out$bic
xx <- 1:min(length(out$bic),p-1)+1
ebic <- out$bic+2*log(dbinom(x=xx,size=p,prob=0.5))
#actually EBIC(I) - 2 p log(2).
Example 5.2, continued. The output below shows results from forward
selection for the marry data. The minimum Cp model Imin uses a constant
and mmem. The forward selection PIs are shorter than the OLS full model
PIs.
library(leaps);Y <- marry[,3]; X <- marry[,-3]
temp<-regsubsets(X,Y,method="forward")
220 5 Statistical Learning Alternatives to OLS

out<-summary(temp)
Selection Algorithm: forward
pop mmen mmilmen milwmn
1 ( 1 ) " " "*" " " " "
2 ( 1 ) " " "*" "*" " "
3 ( 1 ) "*" "*" "*" " "
4 ( 1 ) "*" "*" "*" "*"
out$cp
[1] -0.8268967 1.0151462 3.0029429 5.0000000
#mmen and a constant = Imin
mincp <- out$which[out$cp==min(out$cp),]
#do not need the constant in vin
vin <- vars[mincp[-1]]
sub <- lsfit(X[,vin],Y)
ls.print(sub)
Residual Standard Error=369.0087
R-Square=0.9999
F-statistic (df=1, 24)=307694.4
Estimate Std.Err t-value Pr(>|t|)
Intercept 241.5445 190.7426 1.2663 0.2175
X 1.0010 0.0018 554.7021 0.0000
res<-sub$res
yhat<-Y-res #d = 2 predictors used including x_1
AERplot2(yhat,Y,res=res,d=2)
#response plot with 90% pointwise PIs
$respi #90% PI for a future residual
[1] -778.2763 1336.4416 #length 2114.72
Consider forward selection where xI is a × 1. Underfitting occurs if S
is not a subset of I so xI is missing important predictors. A special case
of underfitting is d = a < aS . Overfitting for forward selection occurs if i)
n < 5a so there is not enough data to estimate the a parameters in βI well,
or ii) S ⊆ I but S 6= I. Overfitting is serious if n < 5a, but “not much of a
problem” if n > Jp where J = 10 or 20 for many data sets. Underfitting is a
serious problem. Let Yi = xTI,i β I + eI,i . Then V (eI,i ) may not be a constant
σ 2 : V (eI,i ) could depend on case i, and the model may no longer be linear.
Check model I with response and residual plots.
Forward selection is a shrinkage method: p models are produced and except
for the full model, some |β̂i | are shrunk to 0. Lasso and ridge regression are
also shrinkage methods. Ridge regression is a shrinkage method, but |β̂i | is
not shrunk to 0. Shrinkage methods that shrink β̂i to 0 are also variable
selection methods. See Sections 5.5, 5.6, and 5.8.

Definition 5.3. Suppose the population MLR model has β S an aS × 1


vector. The population MLR model is sparse if aS is small. The population
MLR model is dense or abundant if n/aS < J where J = 5 or J = 10, say.
5.3 Principal Components Regression 221

The fitted model β̂ = β̂ Imin ,0 is sparse if d = number of nonzero coefficients


is small. The fitted model is dense if n/d < J where J = 5 or J = 10.

5.3 Principal Components Regression

Some notation for eigenvalues, eigenvectors, orthonormal eigenvectors, posi-


tive definite matrices, and positive semidefinite matrices will be useful before
defining principal components regression, which is also called principal com-
ponent regression.

Notation: Recall that a square symmetric p × p matrix A has an eigen-


value λ with corresponding eigenvector x 6= 0 if

Ax = λx. (5.8)

The eigenvalues of A are real since A is symmetric. Note that if constant


c 6= 0 and x is an eigenvector of A, then c x is an √ eigenvector of A. Let
e be an eigenvector of A with unit length kek2 = eT e = 1. Then e and
−e are eigenvectors with unit length, and A has p eigenvalue eigenvector
pairs (λ1 , e1 ), (λ2 , e2 ), ..., (λp, ep ). Since A is symmetric, the eigenvectors are
chosen such that the ei are orthonormal: eTi ei = 1 and eTi ej = 0 for i 6=
j. The symmetric matrix A is positive definite iff all of its eigenvalues are
positive, and positive semidefinite iff all of its eigenvalues are nonnegative.
If A is positive semidefinite, let λ1 ≥ λ2 ≥ · · · ≥ λp ≥ 0. If A is positive
definite, then λp > 0.
Theorem 5.2. Let A be a p × p symmetric matrix with eigenvector eigen-
value pairs (λ1 , e1 ), (λ2 , e2 ), ..., (λp, ep ) where eTi ei = 1 and eTi ej = 0 if i 6= j
for i = 1, ..., p. Then the spectral decomposition of A is
p
X
A= λi ei eTi = λ1 e1 eT1 + · · · + λp ep eTp .
i=1

Using the same notation as Johnson and Wichern (1988, pp. 50-51),
let P = [e1 e2 · · · ep ] be the p × p orthogonal matrix with ith column
T
ei . Then P Pp = P T P = I. Let Λ = diag(λ1 , ..., λp) and let Λ1/2 =

diag( λ1 , ..., λp ). If A is a positive definite p × p symmetric matrix with
Pp
spectral decomposition A = i=1 λi ei eTi , then A = P ΛP T and
p
X 1
A−1 = P Λ−1 P T = ei eTi .
λi
i=1
222 5 Statistical Learning Alternatives to OLS

Theorem 5.3. Let A be aP positive definite p × p symmetric matrix with


p
spectral decomposition A = i=1 λi ei eTi . The square root matrix A1/2 =
P Λ P is a positive definite symmetric matrix such that A1/2 A1/2 = A.
1/2 T

Principal components regression (PCR) uses OLS regression on the prin-


cipal components of the correlation matrix Ru of the p − 1 nontrivial pre-
dictors u1 = x2 , ..., up−1 = xp . Suppose Ru has eigenvalue eigenvector pairs
(λ̂1 , ê1 ), ..., (λ̂K , êK ) where λ̂1 ≥ λ̂2 ≥ · · · ≥ λ̂K ≥ 0 where K = min(n, p−1).
Then Ru êi = λ̂i êi for i = 1, ..., K. Since Ru is a symmetric positive semidef-
inite matrix, the λ̂i are real and nonnegative.

The eigenvectors êi are orthonormal: êTi êi = 1 and êTi êj = 0 for i 6= j.
If the eigenvalues are unique, then êi and −êi are the only orthonormal
eigenvectors corresponding to λ̂i . For example, the eigenvalue eigenvector
pairs can be found using the singular value decomposition of the matrix

W g / n − g where W g is the matrix of the standardized nontrivial predictors
wi , the sample covariance matrix
n n
W Tg W g 1 X T 1 X
Σ̂ w = = (w i − w)(w i − w) = wi w Ti = Ru ,
n−g n−g n−g
i=1 i=1

and usually g = 0 or g = 1. If n > K = p − 1, then the spectral decomposition


of Ru is
p−1
X
Ru = λ̂i êi êTi = λ̂1 ê1 êT1 + · · · + λ̂p−1 êp−1 êTp−1 ,
i=1
Pp−1
and i=1 λ̂i = p − 1.
Let w1 , ..., wn denote the standardized vectors of nontrivial predictors.
Then the K principal components corresponding to the jth case w j are
Pj1 = êT1 w j , ..., PjK = êTK w j . Following Hastie et al. (2009, p. 66), the ith
eigenvector ei is known as the ith principal component direction or Karhunen
Loeve direction of W g .
Principal components have a nice geometric interpretation if n > K =
p − 1. If n > K and Ru is nonsingular, then the hyperellipsoid
2
{w|Dw (0, Ru ) ≤ h2 } = {w : w T R−1 2
u w≤h }
is centered at 0. The volume of the hyperellipsoid is

2π K/2
|Ru |1/2hK .
KΓ (K/2)

Then points at squared distance wT R−1 2


u w = h from the origin lie on the
hyperellipsoid centered at the origin whose axes are given by the eigenvectors
5.3 Principal Components Regression 223
p
êi where the half length in the direction of êi is h λ̂i . Let j = 1, ..., n. Then
the first principal component Pj1 is obtained by projecting the wj on the
(longest) major axis of the hyperellipsoid, the second principal component Pj2
is obtained by projecting the w j on the next longest axis of the hyperellipsoid,
..., and the (p − 1)th principal component Pj,p−1 is obtained by projecting
the w j on the (shortest) minor axis of the hyperellipsoid. Examine Figure 4.3
for two ellipsoids with 2 nontrivial predictors. The axes of the hyperellipsoid
are a rotation of the usual axes about the origin.
Let the random variable Vi correspond to the ith principal component,
and let (P1i , ..., Pni)T = (V1i , ..., Vni)T be the observed data for Vi . Let g = 1.
Then the sample mean
n n
1X 1X T
Vi = Vki = êi w k = êTi w = êTi 0 = 0,
n n
k=1 k=1

and the sample covariance of Vi and Vj is Cov(Vi , Vj ) =


n n
1X 1X T
(Vki − V i )(Vkj − V j ) = êi wk wTk êj = êTi Ru êj
n n
k=1 k=1

= λ̂j êTi êj = 0 for i 6= j since the sample covariance matrix of the standard-
ized data is n
1X
w k w Tk = Ru
n
k=1

and Ru êj = λ̂j êj . Hence Vi and Vj are uncorrelated.


PCR uses linear combinations of the standardized data as predictors. Let
Vj = êTj w for j = 1, ..., K. Let model Ji contain V1 , ..., Vi. Then for model Ji ,
use OLS regression of Z = Y − Y on V1 , ..., Vi with Ŷ = Ẑ + Y . Since linear
combinations of w are linear combinations of x, Ŷ = X β̂ P CR,Ij where the
model Ij uses a constant and the first j − 1 PCR components.

Notation: Just as we use xi or Xi to denote the ith predictor, we will


use vj or Vj to denote predictors that are linear combinations of the original
predictors: e.g. vj = Vj = γ Tj x or vj = Vj = γ Tj u.
Remark 5.8. The set of (p − 1) × 1 vectors {(1, 0, ..., 0)T , (0, 1, 0, ..., 0)T ,
(0, ...0, 1)T } is the standard basis for Rp−1 . The set of vectors {ê1 , ..., êp−1 }
Pj T
is also a basis for Rp−1 . For PCR and some constants θi , i=1 θi êj w =
Pp−1
i=1 ηi wi if j = p − 1, but not if j < p − 1 in general. Hence PCR tends to
give inconsistent estimators unless P(j = p − 1) = P(PCR uses the OLS full
model) goes to one.

There are at least two problems with PCR. i) In general, β̂ P CR,Ij is an


inconsistent estimator of β̂ unless P (j → p − 1) = P (β̂ P CR,Ij → β̂ OLS ) → 1
224 5 Statistical Learning Alternatives to OLS

as n → ∞. ii) Generally there is no reason why the predictors should be


ranked from best to worst by V1 , V2 , ..., VK . For example, the last few prin-
cipal components (and a constant) could be much better for prediction than
the other principal components. See Jolliffe (1983) and Cook and Forzani
(2008). If n ≥ 10p, often PCR needs to use all p − 1 components (i.e., PCR
= OLS full model) to be competitive with other regression models. Per-
forming OLS forward selection orPlasso on V1 , ..., VK may be more effective.
J
There is one exception. Suppose i=1 λ̂i ≥ q(p − 1) where 0.5 ≤ q ≤ 1, e.g.
q = 0.8 where J is a lot smaller than p − 1. Then the J predictors V1 , ..., VJ
capture much of the information of the standardized nontrivial predictors
w1 , ..., wp−1. Then regressing Y on 1, V1 , ..., VJ may be competitive with re-
gressing Y on 1, w1 , ..., wp−1. PCR is equivalent to OLS on the full model
when Y is regressed on a constant and all K of the principal components.
PCR can also be useful if X is singular or nearly singular (ill conditioned).

Example 5.2, continued. The PCR output below shows results for the
marry data where 10-fold CV was used. The OLS full model was selected.
library(pls); y <- marry[,3]; x <- marry[,-3]
z <- as.data.frame(cbind(y,x))
out<-pcr(y˜.,data=z,scale=T,validation="CV")
tem<-MSEP(out)
tem
(Int) 1 comps 2 comps 3 comps 4 comps
CV 1.743e+09 449479706 8181251 371775 197132
cvmse<-tem$val[,,1:(out$ncomp+1)][1,]
nc <-max(which.min(cvmse)-1,1)
res <- out$residuals[,,nc]
yhat<-y-res #d = 5 predictors used including constant
AERplot2(yhat,y,res=res,d=5)
#response plot with 90% pointwise PIs
$respi #90% PI same as OLS full model
-950.4811 1445.2584 #PI length = 2395.74

5.4 Partial Least Squares

Partial least squares (PLS) uses variables v1 = 1 (the constant or trivial


predictor) and “PLS components” vj = γ Tj x for j = 2, ..., p. Next let the
response Y be used with the standardized predictors Wj . Let the “PLS com-
ponents” Vj = ĝ Tj w. Let model Ji contain V1 , ..., Vi. Often k–fold cross val-
idation is used to pick the PLS model from J1 , ..., JM . PLS seeks directions
ĝ j such that the PLS components Vj are highly correlated with Y , subject to
being uncorrelated with other PLS components Vi for i 6= j. Note that PCR
components are formed without using Y .
5.5 Ridge Regression 225

Remark 5.9. PLS may or may not give a consistent estimator of β if p/n
does not go to zero: rather strong regularity conditions have been used to
prove consistency or inconsistency if p/n does not go to zero. See Chun and
Keleş (2010), Cook (2018), Cook et al. (2013), and Cook and Forzani (2018,
2019).

Following Hastie et al. (2009, pp. 80-81), let W = [s1 , ..., sp−1 ] so sj is
the vector corresponding to the standardized jth nontrivial predictor. Let
ĝ1i = sTj Y be n times the least squares coefficient from regressing Y on
si . Then the first PLS direction ĝ 1 = (ĝ11 , ..., ĝ1,p−1)T . Note that W ĝ i =
(Vi1 , ..., Vin)T = pi is the ith PLS component. This process is repeated using
matrices W k = [sk1 , ..., skp−1 ] where W 0 = W and W k is orthogonalized
with respect to pk for k = 1, ..., p − 2. So skj = sjk−1 − [pTk sjk−1 /(pTk pk )]pk
for j = 1, ..., p− 1. If the PLS model Ii uses a constant and PLS components
V1 , ..., Vi−1, let Ŷ Ii be the predicted values from the PLS model using Ii .
Then Ŷ Ii = Ŷ Ii−1 + θ̂i pi where Ŷ I0 = Y 1 and θ̂i = pTi Y /(pTi pi ). Since
linear combinations of w are linear combinations of x, Ŷ = X β̂P LS,Ij where
Ij uses a constant and the first j − 1 PLS components. If j = p, then the
PLS model Ip is the OLS full model.

Example 5.2, continued. The PLS output below shows results for the
marry data where 10-fold CV was used. The OLS full model was selected.
library(pls); y <- marry[,3]; x <- marry[,-3]
z <- as.data.frame(cbind(y,x))
out<-plsr(y˜.,data=z,scale=T,validation="CV")
tem<-MSEP(out)
tem
(Int) 1 comps 2 comps 3 comps 4 comps
CV 1.743e+09 256433719 6301482 249366 206508
cvmse<-tem$val[,,1:(out$ncomp+1)][1,]
nc <-max(which.min(cvmse)-1,1)
res <- out$residuals[,,nc]
yhat<-y-res #d = 5 predictors used including constant
AERplot2(yhat,y,res=res,d=5)
$respi #90% PI same as OLS full model
-950.4811 1445.2584 #PI length = 2395.74
The Mevik et al. (2015) pls library is useful for computing PLS and PCR.

5.5 Ridge Regression

Consider the MLR model Y = Xβ + e. Ridge regression uses the centered


response Zi = Yi − Y and standardized nontrivial predictors in the model
226 5 Statistical Learning Alternatives to OLS

Z = W η + e. Then Yˆi = Ẑi + Y . Note that in Definition 5.5, λ1,n is a tuning


parameter, not an eigenvalue. The residuals r = r(β̂ R ) = Y − Ŷ . Refer to
Definition 5.1 for the OLS estimator η̂OLS = (W T W )−1 W T Z.

Definition 5.4. Consider the MLR model Z = W η + e. Let b be a


(p − 1) × 1 vector. Then the fitted value Ẑi (b) = wTi b and the residual
ri (b) = Zi − Ẑi (b). The vector of fitted values Ẑ(b) = W b and the vector of
residuals r(b) = Z − Ẑ(b).

Definition 5.5. Consider fitting the MLR model Y = Xβ + e using


Z = W η + e. Let λ ≥ 0 be a constant. The ridge regression estimator η̂R
minimizes the ridge regression criterion
p−1
1 λ1,n X 2
QR (η) = (Z − W η)T (Z − W η) + η (5.9)
a a i=1 i

over all vectors η ∈ Rp−1 where λ1,n ≥ 0 and a > 0 are known constants
with a = 1, 2, n, and 2n common. Then

η̂ R = (W T W + λ1,n I p−1 )−1 W T Z. (5.10)

The residual sum of squares RSS(η) = (Z − W η)T (Z − W η), and λ1,n = 0


corresponds to the OLS estimator η̂ OLS . The ridge regression vector of fitted
values is Ẑ = Ẑ R = W η̂ R , and the ridge regression vector of residuals
rR = r(η̂R ) = Z − Ẑ R . The estimator is said to be regularized if λ1,n > 0.
Obtain Ŷ and β̂ R using η̂R , Ẑ, and Y .

Using a vector of parameters η and a dummy vector η in QR is common


for minimizing a criterion Q(η), often with estimating equations. See the
paragraphs above and below Definition 5.2. We could also write
1 λ1,n T
QR (b) = r(b)T r(b) + b b
a a
Pp−1
where the minimization is over all vectors b ∈ Rp−1 . Note that i=1 ηi2 =
ηT η = kηk22 . The literature often uses λa = λ = λ1,n /a.
Pp−1
Note that λ1,n bT b = λ1,n i=1 b2i . Each coefficient bi is penalized equally
by λ1,n . Hence using standardized nontrivial predictors makes sense so that
if ηi is large in magnitude, then the standardized variable wi is important.

Remark 5.10. i) If λ1,n = 0, the ridge regression estimator becomes the


OLS full model estimator: η̂R = η̂ OLS .
ii) If λ1,n > 0, then W T W + λ1,n I p−1 is nonsingular. Hence η̂R exists
even if X and W are singular or ill conditioned, or if p > n.
5.5 Ridge Regression 227

iii) Following Hastie et al. (2009, p. 96), let the augmented matrix W A
and the augmented response vector Z A be defined by
   
p W Z
WA = , and Z A = ,
λ1,n I p−1 0

where 0 is the (p − 1) × 1 zero vector. For λ1,n > 0, the OLS estimator from
regressing Z A on W A is

η̂A = (W TA W A )−1 W TA Z A = η̂R

since W TA Z A = W T Z and
   
p
W TA W A = W T
λ1,n I p−1 p W = W T W + λ1,n I p−1 .
λ1,n I p−1

iv) A simple way to regularize a regression estimator, such as the L1 esti-


mator, is to compute that estimator from regressing Z A on W A .

Remark 5.10 iii) is interesting. Note that for λ1,n > 0, the (n+p−1)×(p−1)
matrix W A has full rank p−1. The augmented OLS model consists of adding
p − 1 pseudo-cases (wTn+1 , Zn+1 )T , ..., (wTn+p−1 , Zn+p−1 )T where Zj = 0 and
p
wj = (0, ..., λ1,n , 0, ..., 0)T for j = n+1, ..., n+p−1 where the nonzero entry
is in the kth position if j = n + k. For centered response and standardized
nontrivial predictors, the population OLS regression fit runs through the
origin (w T , Z)T = (0T , 0)T . Hence for λ1,n = 0, the augmented OLS model
adds p − 1 typical cases at the origin. If λ1,n is not large, then the pseudo-
data can still be regarded as typical cases. If λ1,n is large, the pseudo-data
act as w–outliers (outliers in the standardized predictor variables), and the
OLS slopes go to zero as λ1,n gets large, making Ẑ ≈ 0 so Ŷ ≈ Y .
To prove Remark 5.10 ii), let (ψ, g) be an eigenvalue eigenvector pair of
W T W = nRu . Then [W T W + λ1,n I p−1 ]g = (ψ + λ1,n )g, and (ψ + λ1,n , g)
is an eigenvalue eigenvector pair of W T W + λ1,n I p−1 > 0 provided λ1,n > 0.

The degrees of freedom for a ridge regression with known λ1,n is also
interesting and will be found in the next paragraph. The sample correlation
matrix of the nontrivial predictors
1
Ru = WTWg
n−g g

where we will use g = 0 and W = W 0 . Then W T W = nRu . By singular


value decomposition (SVD) theory, the SVD of W is W = U ΛV T where
the positive singular values σi are square roots of the positive eigenvalues of
both W T W and of W W T . Also V = (ê1 ê2 · · · êp ), and W T W êi = σi2 êi .
228 5 Statistical Learning Alternatives to OLS

Hence λ̂i = σi2 where λ̂i = λ̂i (W T W ) is the ith eigenvalue of W T W , and êi
is the ith orthonormal eigenvector of Ru and of W T W . The SVD of W T is
W T = V ΛT U T , and the Gram matrix
 T 
w 1 w 1 wT1 w2 . . . w T1 w n
 .. 
W W T =  ... ..
.
..
. . 
w Tn w 1 wTn w2 . . . w Tn w n

which is the matrix of scalar products. Warning: Note that σi is the ith
singular value of W , not the standard deviation of wi .
Following Hastie et al. (2009, p. 68), if λ̂i = λ̂i (W T W ) is the ith eigenvalue
of W T W where λ̂1 ≥ λ̂2 ≥ · · · ≥ λ̂p−1 , then the (effective) degrees of freedom
for the ridge regression of Z on W with known λ1,n is df(λ1,n ) =
p−1 p−1
X σi2 X λ̂i
T −1 T
tr[W (W W + λ1,n I p−1 ) W ]= 2 = (5.11)
i=1
σi + λ1,n i=1 λ̂i + λ1,n

where the trace of a square (p − 1) × (p − 1) matrix A = (aij ) is tr(A) =


Pp−1 Pp−1
i=1 aii = i=1 λ̂i (A). Note that the trace of A is the sum of the diagonal
elements of A = the sum of the eigenvalues of A.
Note that 0 ≤ df(λ1,n ) ≤ p − 1 where df(λ1,n ) = p − 1 if λ1,n = 0 and
df(λ1,n ) → 0 as λ1,n → ∞. The R code below illustrates how to compute
ridge regression degrees of freedom.
set.seed(13)
n<-100; q<-3 #q = p-1
b <- 0 * 1:q + 1
u <- matrix(rnorm(n * q), nrow = n, ncol = q)
y <- 1 + u %*% b + rnorm(n) #make MLR model
w1 <- scale(u) #t(w1) %*% w1 = (n-1) R = (n-1)*cor(u)
w <- sqrt(n/(n-1))*w1 #t(w) %*% w = n R = n cor(u)
t(w) %*% w/n
[,1] [,2] [,3]
[1,] 1.00000000 -0.04826094 -0.06726636
[2,] -0.04826094 1.00000000 -0.12426268
[3,] -0.06726636 -0.12426268 1.00000000
cor(u) #same as above
rs <- t(w)%*%w #scaled correlation matrix n R
svs <-svd(w)$d #singular values of w
lambda <- 0
d <- sum(svsˆ2/(svsˆ2+lambda))
#effective df for ridge regression using w
d
[1] 3 #= q = p-1
112.60792 103.88089 83.51119
5.5 Ridge Regression 229

svsˆ2 #as above


uu<-scale(u,scale=F) #centered but not scaled
svs <-svd(uu)$d #singular values of uu
svsˆ2
[1] 135.78205 108.85903 85.83395
d <- sum(svsˆ2/(svsˆ2+lambda))
#effective df for ridge regression using uu
#d is again 3 if lambda = 0
In general, if Ẑ = H λ Z, then df(Ẑ) = tr(H λ ) where H λ is a (p − 1) ×
(p − 1) “hat matrix.” For computing Ŷ , df(Ŷ ) = df(Ẑ) + 1 since a constant
β̂1 also needs to be estimated. These formulas for degrees of freedom assume
that λ is known before fitting the model. The formulas do not give the model
degrees of freedom if λ̂ is selected from M values λ1 , ..., λM using a criterion
such as k-fold cross validation.
Suppose the ridge regression criterion is written, using a = 2n, as
1
QR,n (b) = r(b)T r(b) + λ2n bT b, (5.12)
2n
as in Hastie et al. (2015, p. 10). Then λ2n = λ1,n /(2n) using the λ1,n from
(5.9).

The following remark is interesting if λ1,n and p are fixed. However, λ̂1,n is
usually used, for example, after 10-fold cross validation. The fact that η̂R =
An,λ η̂OLS appears in Efron and Hastie (2016, p. 98), and Marquardt and
Snee (1975). See Theorem 5.4 for the ridge regression central limit theorem.

Remark 5.11. Ridge regression has a simple relationship with OLS if


n > p and (W T W )−1 exists. Then η̂R = (W T W + λ1,n I p−1 )−1 W T Z =
(W T W +λ1,n I p−1 )−1 (W T W )(W T W )−1 W T Z = An,λ η̂OLS where An,λ ≡
An = (W T W + λ1,n I p−1 )−1 W T W . By the LS CLT Equation (5.7) with
V̂ /n = (W T W )−1 , a normal approximation for OLS is

η̂ OLS ∼ ANn−p (η, M SE (W T W )−1 ).

Hence a normal approximation for ridge regression is

η̂ R ∼ ANp−1 (An η, M SE An (W T W )−1 ATn ) ∼

ANp−1 [An η, M SE (W T W + λ1,n I p−1 )−1 (W T W )(W T W + λ1,n I p−1 )−1 ].


P
If Equation (5.7) holds and λ1,n /n → 0 as n → ∞, then An → I p−1 .

Remark 5.12. The ridge regression criterion from Definition 5.5 can also
be defined by
QR (η) = kZ − W ηk22 + λ1,n η T η. (5.13)
230 5 Statistical Learning Alternatives to OLS

Then by Theorem 5.1, the gradient 5QR = −2W T Z +2(W T W )η +2λ1,n η.


Cancelling constants and evaluating the gradient at η̂ R gives the score equa-
tions
−W T (Z − W η̂R ) + λ1,n η̂ R = 0. (5.14)
Following Efron and Hastie (2016, pp. 381-382, 392), this means η̂R = W T a
for some n × 1 vector a. Hence −W T (Z − W W T a) + λ1,n W T a = 0, or

W T (W W T + λ1,n I n )]a = W T Z

which has solution a = (W W T + λ1,n I n )−1 Z. Hence

η̂R = W T a = W T (W W T + λ1,n I n )−1 Z = (W T W + λ1,n I p−1 )−1 W T Z.

Using the n × n matrix W W T is computationally efficient if p > n while


using the p × p matrix W T W is computationally efficient if n > p. If A is
k × k, then computing A−1 has O(k 3 ) complexity.

The following identity from Gunst and Mason (1980, p. 342) is useful for
ridge regression inference: η̂ R =(W T W + λ1,n I p−1 )−1 W T Z

= (W T W + λ1,n I p−1 )−1 W T W (W T W )−1 W T Z

= (W T W + λ1,n I p−1 )−1 W T W η̂OLS = An η̂ OLS =


[I p−1 − λ1,n (W T W + λ1,n I p−1 )−1 ]η̂OLS = B n η̂OLS =
λ1n
η̂OLS − n(W T W + λ1,n I p−1 )−1 η̂ OLS
n
since An − B n = 0. See Problem 5.3. Assume Equation (5.6) holds. If
λ1,n /n → 0 then

W T W + λ1,n I p−1 P P
→ V −1 , and n(W T W + λ1,n I p−1 )−1 → V .
n
Note that
!−1
W T W + λ1,n I p−1 WTW P
An = An,λ = → V V −1 = I p−1
n n

if λ1,n /n → 0 since matrix inversion is a continuous function of a positive


definite matrix. See, for example, Bhatia et al. (1990), Stewart (1969), and
Severini (2005, pp. 348-349).
For model selection, the M values of λ = λ1,n are denoted by λ1 , λ2 , ..., λM
where λi = λ1,n,i depends on n for i = 1, ..., M . If λs corresponds to the model
selected, then λ̂1,n = λs . The following theorem shows that ridge regression
5.5 Ridge Regression 231

and the OLS full model are asymptotically equivalent if λ̂1,n = oP (n1/2 ) so
√ P
λ̂1,n / n → 0.

Theorem 5.4, RR CLT (Ridge Regression Central Limit Theo-


rem. Assume p is fixed and that the conditions of the LS CLT Theorem
Equation (5.7) hold for the model Z = W η + e.
√ P
a) If λ̂1,n / n → 0, then
√ D
n(η̂ R − η) → Np−1 (0, σ 2 V ).
√ P
b) If λ̂1,n / n → τ ≥ 0 then
√ D
n(η̂ R − η) → Np−1 (−τ V η, σ 2 V ).
√ P
Proof: If λ̂1,n / n → τ ≥ 0, then by the above Gunst and Mason (1980)
identity,
η̂ R = [I p−1 − λ̂1,n (W T W + λ̂1,n I p−1 )−1 ]η̂OLS .
Hence √ √
n(η̂R − η) = n(η̂R − η̂ OLS + η̂ OLS − η) =
√ √ λ̂1,n
n(η̂ OLS − η) − n n(W T W + λ̂1,n I p−1 )−1 η̂OLS
n
D
→ Np−1 (0, σ 2 V ) − τ V η ∼ Np−1 (−τ V η, σ 2 V ). 
For p fixed, Knight and Fu (2000) note i) that η̂ R is a consistent estimator
of η if λ1,n = o(n) so λ1,n /n → 0 as√n → ∞, ii) OLS and ridge regression
are asymptotically
√ equivalent if λ1,n / n → 0 as
√ n → ∞, iii)√ridge regression
is a n consistent√ estimator of η if λ1,n = O( n) (so λ1,n / n is bounded),
and iv) if λ1,n / n → τ ≥ 0, then
√ D
n(η̂ R − η) → Np−1 (−τ V η, σ 2 V ).

Hence the bias can be considerable if τ 6= 0. If τ = 0, then OLS and ridge


regression have the same limiting distribution.
Even if p is fixed, there are several problems with ridge regression infer-
ence if λ̂1,n is selected, e.g. after 10-fold cross validation. For OLS forward
selection, the probability that the √ model Imin underfits goes to zero, and
each model with S ⊆ I produced a n consistent estimator β̂ I,0 of β. Ridge
regression with 10-fold CV often shrinks β̂ R too much if both i) the number
of population active predictors kS = aS − 1 in Equation (4.1) and Remark
5.4 is greater than about√20, and ii) the predictors are highly correlated. If
p is fixed and λ1,n = oP ( n), then the OLS full model and ridge regression
are asymptotically equivalent, but much larger sample sizes may be needed
for the normal approximation to be good for ridge regression since the ridge
232 5 Statistical Learning Alternatives to OLS

regression estimator can have large bias for moderate n. Ten fold CV does
√ P P
not appear to guarantee that λ̂1,n / n → 0 or λ̂1,n /n → 0.
Ridge regression can be a lot better than the OLS full model if i) X T X is
singular or ill conditioned or ii) n/p is small. Ridge regression can be much
faster than forward selection if M = 100 and n and p are large.
Roughly speaking, the biased estimation of the ridge regression estimator
can make the MSE of β̂ R or η̂ R less than that of β̂OLS or η̂ OLS , but the
large sample inference may need larger n for ridge regression than for OLS.
However, the large sample theory has n >> p. We will try to use prediction
intervals to compare OLS, forward selection, ridge regression, and lasso for
data sets where p > n. See Sections 5.9, 5.10, 5.11, and 5.12.

Warning. Although the R functions glmnet and cv.glmnet appear to


do ridge regression, getting the fitted values, λ̂1,n , and degrees of freedom to
match up with the formulas of this section can be difficult.

Example 5.2, continued. The ridge regression output below shows results
for the marry data where 10-fold CV was used. A grid of 100 λ values was
used, and λ0 > 0 was selected. A problem with getting the false degrees of
freedom d for ridge regression is that it is not clear that λ = λ1,n /(2n). We
need to know the relationship between λ and λ1,n in order to compute d. It
seems unlikely that d ≈ 1 if λ0 is selected.
library(glmnet); y <- marry[,3]; x <- marry[,-3]
out<-cv.glmnet(x,y,alpha=0)
lam <- out$lambda.min #value of lambda that minimizes
#the 10-fold CV criterion
yhat <- predict(out,s=lam,newx=x)
res <- y - yhat
n <- length(y)
w1 <- scale(x)
w <- sqrt(n/(n-1))*w1 #t(w) %*% w = n R_u, u = x
diag(t(w)%*%w)
pop mmen mmilmen milwmn
26 26 26 26
#sum w_iˆ2 = n = 26 for i = 1, 2, 3, and 4
svs <- svd(w)$d #singular values of w,
pp <- 1 + sum(svsˆ2/(svsˆ2+2*n*lam)) #approx 1
# d for ridge regression if lam = lam_{1,n}/(2n)
AERplot2(yhat,y,res=res,d=pp)
$respi #90% PI for a future residual
[1] -5482.316 14854.268 #length = 20336.584
#try to reproduce the fitted values
z <- y - mean(y)
q<-dim(w)[2]
I <- diag(q)
5.6 Lasso 233

M<- w%*%solve(t(w)%*%w + lam*I/(2*n))%*%t(w)


fit <- M%*%z + mean(y)
plot(fit,yhat) #they are not the same
max(abs(fit-yhat))
[1] 46789.11
M<- w%*%solve(t(w)%*%w + lam*I/(1547.1741))%*%t(w)
fit <- M%*%z + mean(y)
max(abs(fit-yhat)) #close
[1] 8.484979

5.6 Lasso

Consider the MLR model Y = Xβ + e. Lasso uses the centered response


Zi = Yi −Y and standardized nontrivial predictors in the model Z = W η +e
as described in Remark 5.1. Then Yˆi = Ẑi + Y . The residuals r = r(β̂ L) =
Y − Ŷ . Recall that Y = Y 1.

Definition 5.6. Consider fitting the MLR model Y = Xβ + e using


Z = W η + e. The lasso estimator η̂L minimizes the lasso criterion
p−1
1 λ1,n X
QL(η) = (Z − W η)T (Z − W η) + |ηi | (5.15)
a a
i=1

over all vectors η ∈ Rp−1 where λ1,n ≥ 0 and a > 0 are known constants
with a = 1, 2, n, and 2n are common. The residual sum of squares RSS(η) =
(Z − W η)T (Z − W η), and λ1,n = 0 corresponds to the OLS estimator
η̂OLS = (W T W )−1 W T Z if W has full rank p − 1. The lasso vector of fitted
values is Ẑ = Ẑ L = W η̂ L , and the lasso vector of residuals r(η̂ L ) = Z − Ẑ L .
The estimator is said to be regularized if λ1,n > 0. Obtain Ŷ and β̂L using
η̂L , Ẑ, and Y .

Using a vector of parameters η and a dummy vector η in QL is common


for minimizing a criterion Q(η), often with estimating equations. See the
paragraphs above and below Definition 5.2. We could also write
p−1
1 λ1,n X
QL (b) = r(b)T r(b) + |bj |, (5.16)
a a j=1

where the minimization is over all vectors b ∈ Rp−1 . The literature often uses
λa = λ = λ1,n /a.

For fixed λ1,n , the lasso optimization problem is convex. Hence fast algo-
rithms exist. As λ1,n increases, some of the η̂i = 0. If λ1,n is large enough,
234 5 Statistical Learning Alternatives to OLS

then η̂ L = 0 and Ŷi = Y for i = 1, ..., n. If none of the elements η̂i of η̂L are
zero, then η̂ L can be found, in principle, by setting the partial derivatives of
QL(η) to 0. Potential minimizers also occur at values of η where not all of the
partial derivatives exist. An analogy is finding the minimizer of a real valued
function of one variable h(x). Possible values for the minimizer include values
of xc satisfying h0 (xc ) = 0, and values xc where the derivative does not exist.
Typically some of the elements η̂i of η̂L that minimizes QL(η) are zero, and
differentiating does not work.

The following identity from Efron and Hastie (2016, p. 308), for example,
is useful for inference for the lasso estimator η̂L :

−1 T λ1,n λ1,n
W (Z − W η̂L ) + sn = 0 or − W T (Z − W η̂L ) + sn = 0
n 2n 2
where sin ∈ [−1, 1] and sin = sign(η̂i,L) if η̂i,L 6= 0. Here sign(ηi ) = 1 if ηi > 0
and sign(ηi ) = −1 if ηi < 0. Note that sn = sn,η̂ depends on η̂L . Thus η̂L
L

λ1,n λ1,n
= (W T W )−1 W T Z − n(W T W )−1 sn = η̂OLS − n(W T W )−1 sn .
2n 2n
If none of the elements of η are zero, and if η̂ L is a consistent estimator of η,
P √
then sn → s = sη . If λ1,n / n → 0, then OLS and lasso are asymptotically
equivalent even if sn does not converge to a vector s as n → ∞ since sn is
bounded. For model selection, the M values of λ are denoted by 0 ≤ λ1 <
λ2 < · · · < λM where λi = λ1,n,i depends on n for i = 1, ..., M . Also, λM
is the smallest value of λ such that η̂λM = 0. Hence η̂λi 6= 0 for i < M . If
λs corresponds to the model selected, then λ̂1,n = λs . The following theorem
shows that lasso and the OLS full model are asymptotically equivalent if
√ P √
λ̂1,n = oP (n1/2 ) so λ̂1,n / n → 0: thus n(η̂ L − η̂ OLS ) = op (1).

Theorem 5.5, Lasso CLT. Assume p is fixed and that the conditions of
the LS CLT Theorem Equation (5.7) hold for the model Z = W η + e.
√ P
a) If λ̂1,n / n → 0, then
√ D
n(η̂L − η) → Np−1 (0, σ 2 V ).
√ P P
b) If λ̂1,n / n → τ ≥ 0 and sn → s = sη , then
 
√ D −τ
n(η̂ L − η) → Np−1 V s, σ 2 V .
2

√ P P
Proof. If λ̂1,n / n → τ ≥ 0 and sn → s = sη , then
√ √
n(η̂ L − η) = n(η̂L − η̂OLS + η̂OLS − η) =
5.6 Lasso 235

√ √ λ1,n D τ
n(η̂ OLS − η) − n n(W T W )−1 sn → Np−1 (0, σ 2 V ) − V s
2n 2
 
−τ
∼ Np−1 V s, σ 2 V
2
P
since under the LS CLT, n(W T W )−1 → V .
P
Part a) does not need sn → s as n → ∞, since sn is bounded. 

Suppose p is fixed. Knight and Fu (2000) note i) that η̂L is a consistent


estimator of η if λ1,n = o(n) so λ1,n /n → 0 as n → ∞, ii) OLS and lasso are
asymptotically equivalent √ if λ1,n → ∞ too slowly as n → ∞ (e.g. if λ√1,n = λ
is fixed),
√ iii) lasso is a n consistent estimator of η if λ1,n = O( n) (so
λ1,n / n is bounded). Note that Theorem
√ 5.5 shows that OLS and lasso are
asymptotically equivalent if λ1,n / n → 0 as n → 0.

In the literature, the criterion often uses λa = λ1,n /a:


p−1
1 X
QL,a(b) = r(b)T r(b) + λa |bj |.
a j=1

The values a = 1, 2, and 2n are common. Following Hastie et al. (2015, pp.
9, 17, 19) for the next two paragraphs, it is convenient to use a = 2n:
p−1
1 X
QL,2n(b) = r(b)T r(b) + λ2n |bj |, (5.17)
2n j=1

where the ZiPare centered and the wj are standardized using g = 0 so w j = 0


and nσ̂j2 = ni=1 wi,j2
= n. Then λ = λ2n = λ1,n /(2n) in Equation (5.15).
For model selection, the M values of λ are denoted by 0 ≤ λ2n,1 < λ2n,2 <
· · · < λ2n,M where η̂ λ = 0 iff λ ≥ λ2n,M and

1 T
λ2n,max = λ2n,M = max s Z
j n j

and sj is the jth column of W corresponding to the jth standardized non-


trivial predictor Wj . In terms of the 0 ≤ λ1 < λ2 < · · · < λM , used above
Theorem 5.5, we have λi = λ1,n,i = 2nλ2n,i and

λM = 2nλ2n,M = 2 max sTj Z .


j

For model selection we let I denote the index set of the predictors in the
fitted model including the constant. The set A defined below is the index set
without the constant.
236 5 Statistical Learning Alternatives to OLS

Definition 5.7. The active set A is the index set of the nontrivial predic-
tors in the fitted model: the predictors with nonzero η̂i .

Suppose that there are k active nontrivial predictors. Then for lasso, k ≤ n.
Let the n × k matrix W A correspond to the standardized active predictors.
If the columns of W A are in general position, then the lasso vector of fitted
values

Ẑ L = W A (W TA W A )−1 W TA Z − nλ2n W A (W TA W A )−1 sA

where sA is the vector of signs of the active lasso coefficients. Here we are
using the λ2n of (5.17), and nλ2n = λ1,n /2. We could replace n λ2n by λ2 if
we used a = 2 in the criterion
p−1
1 X
QL,2 (b) = r(b)T r(b) + λ2 |bj |. (5.18)
2 j=1

See, for example, Tibshirani (2015). Note that W A (W TA W A )−1 W TA Z is the


vector of OLS fitted values from regressing Z on W A without an intercept.

Example 5.2, continued. The lasso output below shows results for the
marry data where 10-fold CV was used. A grid of 38 λ values was used, and
λ0 > 0 was selected.
library(glmnet); y <- marry[,3]; x <- marry[,-3]
out<-cv.glmnet(x,y)
lam <- out$lambda.min #value of lambda that minimizes
#the 10-fold CV criterion
yhat <- predict(out,s=lam,newx=x)
res <- y - yhat
pp <- out$nzero[out$lambda==lam] + 1 #d for lasso
AERplot2(yhat,y,res=res,d=pp)
$respi #90% PI for a future residual
-4102.672 4379.951 #length = 8482.62
There are some problems with lasso. i) Lasso large sample theory is worse
or as good as that of the OLS full model if n/p is large. ii) Ten fold CV does
√ P P
not appear to guarantee that λ̂1,n / n → 0 or λ̂1,n /n → 0. iii) Lasso often
shrinks β̂ too much if aS ≥ 20 and the predictors are highly correlated. iv)
Ridge regression can be better than lasso if aS > n.
Lasso can be a lot better than the OLS full model if i) X T X is singular
or ill conditioned or ii) n/p is small. iii) For lasso, M = M (lasso) is often
near 100. Let J ≥ 5. If n/J and p are both a lot larger than M (lasso), then
lasso can be considerably faster than forward selection, PLS, and PCR if
M = M (lasso) = 100 and M = M (F ) = min(dn/Je, p) where F stands for
forward selection, PLS, or PCR. iv) The number of nonzero coefficients in
5.7 Lasso Variable Selection 237

η̂L ≤ n even if p > n. This property of lasso can be useful if p >> n and the
population model is sparse.

5.7 Lasso Variable Selection

Lasso variable selection applies OLS on a constant and the active predictors
that have nonzero lasso η̂i . The method is called relaxed lasso by Hastie et al.
(2015, p. 12), and the relaxed lasso (φ = 0) estimator by Meinshausen (2007).
The method is also called OLS-post lasso and post model selection OLS.
Let X A denote the matrix with a column of ones and the unstandardized
active nontrivial predictors. Hence the lasso variable selection estimator is
β̂LV S = (X TA X A )−1 X TA Y , and lasso variable selection is an alternative to
forward selection. Let k be the number of active (nontrivial) predictors so
β̂V LS is (k + 1) × 1.
Let Imin correspond to the lasso variable selection estimator and β̂ V S =
β̂LV S,0 = β̂ Imin ,0 to the zero padded lasso variable selection estimator. Then

by Remark 4.5 where p is fixed, β̂LV S,0 is n consistent when lasso is consis-
tent, with the limiting distribution for β̂ LV S,0 given by Theorem 4.4. Hence,
relaxed lasso can be bootstrapped with the same methods used for forward
selection in Chapter 4. Lasso variable selection will often be better than lasso
when the model is sparse or if n ≥ 10(k + 1). Lasso can be better than lasso
variable selection if (X TA X A ) is ill conditioned or if n/(k + 1) < 10. Also see
Pelawa Watagoda and Olive (2020) and Rathnayake and Olive (2020).
Suppose the n × q matrix x has the q = p − 1 nontrivial predictors. The
following R code gives some output for a lasso estimator and then the corre-
sponding relaxed lasso estimator.
library(glmnet)
y <- marry[,3]
x <- marry[,-3]
out<-glmnet(x,y,dfmax=2) #Use 2 for illustration:
#often dfmax approx min(n/J,p) for some J >= 5.
lam<-out$lambda[length(out$lambda)]
yhat <- predict(out,s=lam,newx=x)
#lasso with smallest lambda in grid such that df = 2
lcoef <- predict(out,type="coefficients",s=lam)
as.vector(lcoef) #first term is the intercept
#3.000397e+03 1.800342e-03 9.618035e-01 0.0 0.0
res <- y - yhat
AERplot(yhat,y,res,d=3,alph=1) #lasso response plot
##relaxed lasso =
#OLS on lasso active predictors and a constant
vars <- 1:dim(x)[2]
238 5 Statistical Learning Alternatives to OLS

lcoef<-as.vector(lcoef)[-1] #don’t need an intercept


vin <- vars[lcoef>0] #the lasso active set
vin
#1 2 since predictors 1 and 2 are active
sub <- lsfit(x[,vin],y) #lasso variable selection
sub$coef
# Intercept pop mmen
#2.380912e+02 6.556895e-05 1.000603e+00
# 238.091 6.556895e-05 1.0006
res <- sub$resid
yhat <- y - res
AERplot(yhat,y,res,d=3,alph=1) #response plot
Example 5.2, continued. The lasso variable selection output below shows
results for the marry data where 10-fold CV was used to choose the lasso
estimator. Then lasso variable selection is OLS applied to the active variables
with nonzero lasso coefficients and a constant. A grid of 38 λ values was used,
and λ0 > 0 was selected. The OLS SE, t statistic and pvalue are generally
not valid for relaxed lasso by Remark 4.5 and Theorem 4.4.
library(glmnet); y <- marry[,3]; x <- marry[,-3]
out<-cv.glmnet(x,y)
lam <- out$lambda.min #value of lambda that minimizes
#the 10-fold CV criterion
pp <- out$nzero[out$lambda==lam] + 1
#d for lasso variable selection
#get lasso variable selection
lcoef <- predict(out,type="coefficients",s=lam)
lcoef<-as.vector(lcoef)[-1]
vin <- vars[lcoef!=0]
sub <- lsfit(x[,vin],y)
ls.print(sub)
Residual Standard Error=376.9412
R-Square=0.9999
F-statistic (df=2, 23)=147440.1
Estimate Std.Err t-value Pr(>|t|)58
Intercept 238.0912 248.8616 0.9567 0.3487
pop 0.0001 0.0029 0.0223 0.9824
mmen 1.0006 0.0164 60.9878 0.0000
res <- sub$resid
yhat <- y - res
AERplot2(yhat,y,res=res,d=pp)
$respi #90% PI for a future residual
-822.759 1403.771 #length = 2226.53
To summarize Example 5.2, forward selection selected the model with the
minimum Cp while the other methods used 10-fold CV. PLS and PCR used
5.7 Lasso Variable Selection 239

a) Forward Selection b) Ridge Regression

100000

100000
y

y
50000

50000
50000 100000 150000 50000 100000 150000

yhat yhat

c) Lasso d) Lasso Variable Selection


100000

100000
y

y
50000

50000

50000 100000 150000 50000 100000 150000

yhat yhat

Fig. 5.1 Marry Data Response Plots

the OLS full model with PI length 2395.74, forward selection used a constant
and mmen with PI length 2114.72, ridge regression had PI length 20336.58,
lasso and lasso variable selection used a constant, mmen, and pop with lasso
PI length 8482.62 and relaxed lasso PI length 2226.53. PI (4.14) was used.
Figure 5.1 shows the response plots for forward selection, ridge regression,
lasso, and lasso variable selection. The plots for PLS=PCR=OLS full model
were similar to those of forward selection and lasso variable selection. The
plots suggest that the MLR model is appropriate since the plotted points
scatter about the identity line. The 90% pointwise prediction bands are also
shown, and consist of two lines parallel to the identity line. These bands are
very narrow in Figure 5.1 a) and d).
240 5 Statistical Learning Alternatives to OLS

5.8 The Elastic Net

Following Hastie et al. (2015, p. 57), let β = (β1 , βTS )T , let λ1,n ≥ 0, and let
α ∈ [0, 1]. Let

RSS(β) = (Y − Xβ)T (Y − Xβ) = kY − Xβk22 .


Pk
For a k×1 vector η, the squared (Euclidean) L2 norm kηk22 = ηT η = i=1 ηi2
Pk
and the L1 norm kηk1 = i=1 |ηi |.

Definition 5.8. The elastic net estimator β̂ EN minimizes the criterion


 
1 1
QEN (β) = RSS(β) + λ1,n (1 − α)kβS k22 + αkβ S k1 , or (5.19)
2 2

Q2 (β) = RSS(β) + λ1 kβS k22 + λ2 kβ S k1 (5.20)


where 0 ≤ α ≤ 1, λ1 = (1 − α)λ1,n and λ2 = 2αλ1,n .

Note that α = 1 corresponds to lasso (using λa=0.5 ), and α = 0 corresponds


to ridge regression. For α < 1 and λ1,n > 0, the optimization problem is
strictly convex with a unique solution. The elastic net is due to Zou and
Hastie (2005). It has been observed that the elastic net can have much better
prediction accuracy than lasso when the predictors are highly correlated.
As with lasso, it is often convenient to use the centered response Z = Y −Y
where Y = Y 1, and the n×(p−1) matrix of standardized nontrivial predictors
W . Then regression through the origin is used for the model

Z = Wη + e (5.21)

where the vector of fitted values Ŷ = Y + Ẑ.


Ridge regression can be computed using OLS on augmented matrices.
Similarly, the elastic net can be computed using lasso on augmented matrices.
Let the elastic net estimator η̂EN minimize

QEN (η) = RSSW (η) + λ1 kηk22 + λ2 kηk1 (5.22)

where λ1 = (1 − α)λ1,n and λ2 = 2αλ1,n . Let the (n + p − 1) × (p − 1)


augmented matrix W A and the (n + p − 1) × 1 augmented response vector
Z A be defined by
   
√ W Z
WA = , and Z A = ,
λ1 I p−1 0

where 0 is the (p − 1) × 1 zero vector. Let RSSA (η) = kZ A − W A ηk22 . Then


η̂EN can be obtained from the lasso of Z A on W A : that is, η̂ EN minimizes
5.8 The Elastic Net 241

QL (η) = RSSA (η) + λ2 kηk1 = QEN (η). (5.23)

Proof: We need to show that QL(η) = QEN (η). Note that Z TA Z A = Z T Z,


 

WA η = √ ,
λ1 η

and Z TA W A η = Z T W η. Then

RSSA (η) = kZ A − W A ηk22 = (Z A − W A η)T (Z A − W A η) =

Z TA Z A − Z TA W A η − η T W TA Z A + ηT W TA W A η =
 p  Wη 
T T T T T T T √
Z Z − Z Wη − η W Z + η W λ1 η .
λ1 η
Thus

QL (η) = Z T Z − Z T W η − ηT W T Z + η T W T W η + λ1 ηT η + λ2 kηk1 =

RSS(η) + λ1 kηk22 + λ2 kηk1 = QEN (η). 


Remark 5.13. i) You could compute the elastic net estimator using a
grid of 100 λ1,n values and a grid of J ≥ 10 α values, which would take
about J ≥ 10 times as long to compute as lasso. The above equivalent lasso
problem (5.23) still needs a grid of λ1 = (1 − α)λ1,n and λ2 = 2αλ1,n values.
Often J = 11, 21, 51, or 101. The elastic net estimator tends to be com-
puted with fast methods for optimizing convex problems, such as coordinate
descent. ii) Like lasso and ridge regression, the elastic net estimator is asymp-

totically equivalent to the OLS full model if p is fixed and λ̂1,n = oP ( n),
but behaves worse than the OLS full model otherwise. See Theorem 5.6. iii)
For prediction intervals, let d be the number of nonzero coefficients from
the equivalent augmented lasso problem (5.23). Alternatively, use d2 with
d ≈ d2 = tr[W AS (W TAS W AS + λ2,n I p−1 )−1 W TAS ] where W AS corresponds
to the active set (not the augmented matrix). See Tibshirani and Taylor
(2012, p. 1214). Again λ2,n may not be the λ2 given by the software. iv)
The number of nonzero lasso components (not including the constant) is at
most min(n, p − 1). Elastic net tends to do variable selection, but the number
of nonzero components can equal p − 1 (make the elastic net equal to ridge
regression). Note that the number of nonzero components in the augmented
lasso problem (5.23) is at most min(n + p − 1, p − 1) = p − 1. vi) The elastic
net can be computed with glmnet, and there is an R package elasticnet.
vii) For fixed α > 0, we could get λM for elastic net from the equivalent lasso
problem. For ridge regression, we could use the λM for an α near 0.

Since lasso uses at most min(n, p − 1) nontrivial predictors, elastic net and
ridge regression can perform better than lasso if the true number of active
242 5 Statistical Learning Alternatives to OLS

nontrivial predictors aS > min(n, p − 1). For example, suppose n = 1000,


p = 5000, and aS = 1500.
Following Jia and Yu (2010), by standard Karush-Kuhn-Tucker (KKT)
conditions for convex optimality for Equation (5.20), η̂EN is optimal if

2W T W η̂ EN − 2W T Z + 2λ1 η̂ EN + λ2 sn = 0, or

λ2
(W T W + λ1 I p−1 )η̂EN = W T Z − sn , or
2
λ2
η̂EN = η̂ R − n(W T W + λ1 I p−1 )−1 sn . (5.24)
2n
Hence
λ1 λ2
η̂EN = η̂OLS − n(W T W +λ1 I p−1 )−1 η̂ OLS − n(W T W +λ1 I p−1 )−1 sn
n 2n
λ1 λ2
= η̂OLS − n(W T W + λ1 I p−1 )−1 [ η̂OLS + sn ].
n 2n
√ P P √ P √ P
Note that if λ̂1,n / n → τ and α̂ → ψ, then λ̂1 / n → (1 −ψ)τ and λ̂2 / n →
2ψτ. The following theorem shows elastic net is asymptotically equivalent to
√ P
the OLS full model if λ̂1,n / n → 0. Note that we get the RR CLT if ψ = 0
√ P
and the lasso CLT (using 2λ̂1,n / n → 2τ ) if ψ = 1. Under these conditions,

√ √ λ̂1 λ̂2
n(η̂ EN −η) = n(η̂OLS −η)−n(W T W + λ̂1 I p−1 )−1 [ √ η̂OLS + √ sn ].
n 2 n

The following theorem is due to Slawski et al. (2010), and summarized in


Pelawa Watagoda and Olive (2020).

Theorem 5.6, Elastic Net CLT. Assume p is fixed and that the condi-
tions of the LS CLT Equation (5.7) hold for the model Z = W η + e.
√ P
a) If λ̂1,n / n → 0, then
√ D
n(η̂ EN − η) → Np−1 (0, σ 2 V ).
√ P P P
b) If λ̂1,n / n → τ ≥ 0, α̂ → ψ ∈ [0, 1], and sn → s = sη , then
√ D 
n(η̂EN − η) → Np−1 −V [(1 − ψ)τ η + ψτ s], σ 2 V .

Proof. By the above remarks and the RR CLT Theorem 5.4,


√ √ √ √
n(η̂EN − η) = n(η̂EN − η̂R + η̂R − η) = n(η̂R − η) + n(η̂ EN − η̂R )

D  2ψτ
→ Np−1 −(1 − ψ)τ V η, σ 2 V − Vs
2
5.9 Prediction Intervals 243

∼ Np−1 −V [(1 − ψ)τ η + ψτ s], σ 2 V .
The mean of the normal distribution is 0 under a) since α̂ and sn are bounded.


Example 5.2, continued. The slpack function enet does elastic net
using 10-fold CV and a grid of α values {0, 1/am, 2/am, ..., am/am = 1}. The
default uses am = 10. The default chose lasso with alph = 1. The function
also makes a response plot, but does not add the lines for the pointwise
prediction intervals since the false degrees of freedom d is not computed.
library(glmnet); y <- marry[,3]; x <- marry[,-3]
tem <- enet(x,y)
tem$alph
[1] 1 #elastic net was lasso
tem<-enet(x,y,am=100)
tem$alph
[1] 0.97 #elastic net was not lasso with a finer grid
The elastic net variable selection estimator applies OLS to a constant
and the active predictors that have nonzero elastic net η̂i . Hence elastic net
is used as a variable selection method. Let X A denote the matrix with a
column of ones and the unstandardized active nontrivial predictors. Hence
the relaxed elastic net estimator is β̂ RL = (X TA X A )−1 X TA Y , and relaxed
elastic net is an alternative to forward selection. Let k be the number of
active (nontrivial) predictors so β̂ REN is (k + 1) × 1. Let Imin correspond to
the elastic net variable selection estimator and β̂V S = β̂ ENV S,0 = β̂ Imin ,0
to the zero padded relaxed
√ elastic net estimator. Then by Remark 4.5 where
p is fixed, β̂ ENV S,0 is n consistent when elastic net is consistent, with the
limiting distribution for β̂REN,0 given by Theorem 4.4. Hence, relaxed elastic
net can be bootstrapped with the same methods used for forward selection
in Chapter 4. Elastic net variable selection will often be better than elastic
net when the model is sparse or if n ≥ 10(k + 1). The elastic net can be
better than elastic net variable selection if (X TA X A ) is ill conditioned or if
n/(k + 1) < 10. Also see Olive (2019) and Rathnayake and Olive (2020).

5.9 Prediction Intervals

This section will use the prediction intervals from Section 4.3 applied to the
MLR model with m̂(x) = xTI β̂ I and I corresponds to the predictors used
by the MLR method. We will use the six methods forward selection with
OLS, PCR, PLS, lasso, relaxed lasso, and ridge regression. When p > n,
results from Hastie et al. (2015, pp. 20, 296, ch. 6, ch. 11) and Luo and Chen
(2013) suggest that lasso, relaxed lasso, and forward selection with EBIC can
244 5 Statistical Learning Alternatives to OLS

perform well for sparse models: the subset S in Equation (4.1) and Remark
5.4 has aS small.
Consider d for the prediction interval (4.14). As in Chapter 4, with the
exception of ridge regression, let d be the number of “variables” used by the
method, including a constant. Hence for lasso, relaxed lasso, and forward
selection, d − 1 is the number of active predictors while d − 1 is the number
of “components” used by PCR and PLS.
Many things can go wrong with prediction. It is assumed that the test
data follows the same MLR model as the training data. Population drift is a
common reason why the above assumption, which assumes that the various
distributions involved do not change over time, is violated. Population drift
occurs when the population distribution does change over time.
A second thing that can go wrong is that the training or test data set is
distorted away from the population distribution. This could occur if outliers
are present or if the training data set and test data set are drawn from
different populations. For example, the training data set could be drawn
from three hospitals, and the test data set could be drawn from two more
hospitals. These two populations of three and two hospitals may differ.
A third thing that can go wrong is extrapolation: if xf is added to
x1 , ..., xn , then there is extrapolation if xf is not like the xi , e.g. xf is an
outlier. Predictions based on extrapolation are not reliable. Check whether
the Euclidean distance of xf from the coordinatewise median MED(X) of
the x1 , ..., xn satisfies Dxf (MED(X), I p ) ≤ maxi=1,...,n Di (MED(X), I p ).
Alternatively, use the ddplot5 function, described in Chapter 7, applied to
x1 , ..., xn , xf to check whether xf is an outlier.
When n ≥ 10p, let the hat matrix H = X(X T X)−1 X T . Let hi = hii
be the ith diagonal element of H for i = 1, ..., n. Then hi is called the
ith leverage and hi = xTi (X T X)−1 xi . Then the leverage of xf is hf =
xTf (X T X)−1 xf . Then a rule of thumb is that extrapolation occurs if hf >
max(h1 , ..., hn). This rule works best if the predictors are linearly related in
that a plot of xi versus xj should not have any strong nonlinearities. If there
are strong nonlinearities among the predictors, then xf could be far from the
xi but still have hf < max(h1 , ..., hn). If the regression method, such as lasso
or forward selection, uses a set I of a predictors, including a constant, where
n ≥ 10a, the above rule of thumb could be used for extrapolation where xf ,
xi , and X are replaced by xI,f , xI,i , and X I .
For the simulation from Pelawa Watagoda and Olive (2019b), we used
several R functions including forward selection (FS) as computed with the
regsubsets function from the leaps library, principal components regres-
sion (PCR) with the pcr function and partial least squares (PLS) with the
plsr function from the pls library, and ridge regression (RR) and lasso
with the cv.glmnet function from the glmnet library. Relaxed lasso (RL)
was applied to the selected lasso model.
Let x = (1 uT )T where u is the (p − 1) × 1 vector of nontrivial predictors.
In the simulations, for i = 1, ..., n, we generated w i ∼ Np−1 (0, I) where the
5.9 Prediction Intervals 245

Table 5.1 Simulated Large Sample 95% PI Coverages and Lengths, ei ∼ N (0, 1)
n p ψ k FS lasso RL RR PLS PCR
100 20 0 1 cov 0.9644 0.9750 0.9666 0.9560 0.9438 0.9772
len 4.4490 4.8245 4.6873 4.5723 4.4149 5.5647
100 40 0 1 cov 0.9654 0.9774 0.9588 0.9274 0.8810 0.9882
len 4.4294 4.8889 4.6226 4.4291 4.0202 7.3393
100 200 0 1 cov 0.9648 0.9764 0.9268 0.9584 0.6616 0.9922
len 4.4268 4.9762 4.2748 6.1612 2.7695 12.412
100 50 0 49 cov 0.8996 0.9719 0.9736 0.9820 0.8448 1.0000
len 22.067 6.8345 6.8092 7.7234 4.2141 38.904
200 20 0 19 cov 0.9788 0.9766 0.9788 0.9792 0.9550 0.9786
len 4.9613 4.9636 4.9613 5.0458 4.3211 4.9610
200 40 0 19 cov 0.9742 0.9762 0.9740 0.9738 0.9324 0.9792
len 4.9285 5.2205 5.1146 5.2103 4.2152 5.3616
200 200 0 19 cov 0.9728 0.9778 0.9098 0.9956 0.3500 1.0000
len 4.8835 5.7714 4.5465 22.351 2.1451 51.896
400 20 0.9 19 cov 0.9664 0.9748 0.9604 0.9726 0.9554 0.9536
len 4.5121 10.609 4.5619 10.663 4.0017 3.9771
400 40 0.9 19 cov 0.9674 0.9608 0.9518 0.9578 0.9482 0.9646
len 4.5682 14.670 4.8656 14.481 4.0070 4.3797
400 400 0.9 19 cov 0.9348 0.9636 0.9556 0.9632 0.9462 0.9478
len 4.3687 47.361 4.8530 48.021 4.2914 4.4764
400 400 0 399 cov 0.9486 0.8508 0.5704 1.0000 0.0948 1.0000
len 78.411 37.541 20.408 244.28 1.1749 305.93
400 800 0.9 19 cov 0.9268 0.9652 0.9542 0.9672 0.9438 0.9554
len 4.3427 67.294 4.7803 66.577 4.2965 4.6533

m = p − 1 elements of the vector wi are iid N(0,1). Let the m × m matrix


A = (aij ) with aii = 1 and aij = ψ where 0 ≤ ψ < 1 for i 6= j. Then the
vector ui = Aw i so that Cov(ui ) = Σ u = AAT = (σij ) where the diagonal
entries σii = [1+(m−1)ψ2 ] and the off diagonal entries σij = [2ψ+(m−2)ψ2 ].
Hence the correlations are cor(xi , xj ) = ρ = (2ψ +(m−2)ψ2 )/(1+(m−1)ψ 2 )

for i 6= j where xi and xj are nontrivial predictors. If ψ = 1/ cp, then
ρ → 1/(c + 1) as p → ∞ where c > 0. As ψ gets close to 1, the predictor
vectors cluster about the line in the direction of (1, ..., 1)T . Let Yi = 1+1xi,2 +
· · · + 1xi,k+1 + ei for i = 1, ..., n. Hence β = (1, .., 1, 0, ..., 0)T with k + 1 ones
and p − k − 1 zeros. The zero mean errors ei were iid from five distributions:
i) N(0,1), ii) t3 , iii) EXP(1) - 1, iv) uniform(−1, 1), and v) 0.9 N(0,1) +
0.1 N(0,100). Normal distributions usually appear in simulations, and the
uniform distribution is the distribution where the shorth undercoverage is
maximized by Frey (2013). Distributions ii) and v) have heavy tails, and
distribution iii) is not symmetric.
The population shorth 95% PI lengths estimated by the asymptotically
optimal 95% PIs are i) 3.92 = 2(1.96), ii) 6.365, iii) 2.996, iv) 1.90 = 2(0.95),
and v) 13.490. The split conformal PI (4.16) is not asymptotically optimal
for iii), and for iii) PI (4.16) has asymptotic length 2(1.966) = 3.992. The
simulation used 5000 runs, so an observed coverage in [0.94, 0.96] gives no
246 5 Statistical Learning Alternatives to OLS

reason to doubt that the PI has the nominal coverage of 0.95. The simulation

used p = 20, 40, 50, n, or 2n; ψ = 0, 1/ p, or 0.9; and k = 1, 19, or p − 1. The
OLS full model fails when p = n and p = 2n, where regularity conditions
for consistent estimators are strong. The values k = 1 and k = 19 are sparse
models where lasso, relaxed lasso, and forward selection with EBIC can per-
form well when n/p is not large. If k = p − 1 and p ≥ n, then the model

is dense. When ψ = 0, the predictors are uncorrelated, when ψ = 1/ p,
the correlation goes to 0.5 as p increases and the predictors are moderately
correlated. For ψ = 0.9, the predictors are highly correlated with 1 dominant
principal component, a setting favorable for PLS and PCR. The simulated
data sets are rather small since the some of the R estimators are rather slow.
The simulations were done in R. See R Core Team (2016). The results
were similar for all five error distributions, and we show some results for
the normal and shifted exponential distributions. Tables 5.1 and 5.2 show
some simulation results for PI (4.14) where forward selection used Cp for
n ≥ 10p and EBIC for n < 10p. The other methods minimized 10-fold CV. For
forward selection, the maximum number of variables used was approximately
min(dn/5e, p). Ridge regression used the same d that was used for lasso.
For n ≥ 5p, coverages tended to be near or higher than the nominal value
of 0.95. The average PI length was often near 1.3 times the asymptotically
optimal length for n = 10p and close to the optimal length for n = 100p. Cp
and EBIC produced good PIs for forward selection, and 10-fold CV produced
good PIs for PCR and PLS. For lasso and ridge regression, 10-fold CV pro-
duced good PIs if ψ = 0 or if k was small, but if both k ≥ 19 and ψ ≥ 0.5,
then 10-fold CV tended to shrink too much and the PI lengths were often
too long. Lasso did appear to select S ⊆ Imin since relaxed lasso was good.
For n/p not large, good performance needed stronger regularity conditions,
and all six methods can have problems. PLS tended to have severe undercov-
erage with small average length, but sometimes performed well for ψ = 0.9.
The PCR length was often too long for ψ = 0. If there was k = 1 active
population predictor, then forward selection with EBIC, lasso, and relaxed
lasso often performed well. For k = 19, forward selection with EBIC often
performed well, as did lasso and relaxed lasso for ψ = 0. For dense models
with k = p − 1 and n/p not large, there was often undercoverage. Here for-
ward selection would use about n/5 variables. Let d − 1 be the number of
active nontrivial predictors in the selected model. For N (0, 1) errors,
√ ψ = 0,
and d < k, an asymptotic population 95% PI has length 3.92 k − d + 1.
Note that when the (Yi , uTi )T follow a multivariate normal distribution, ev-
ery subset follows a multiple linear regression model. EBIC occasionally had
undercoverage, especially for k = 19 or p − 1, which was usually more severe

for ψ = 0.9 or 1/ p.
Tables 5.3 and 5.4 show some results for PIs (4.15) and (4.16). Here forward
selection using the minimum Cp model if nH > 10p and EBIC otherwise. The
coverage was very good. Labels such as CFS and CRL used PI (4.16). For
relaxed lasso, the program sometimes failed to run for 5000 runs, e.g., if the
5.9 Prediction Intervals 247

Table 5.2 Simulated Large Sample 95% PI Coverages and Lengths, ei ∼ EXP (1)−1

n p ψ k FS lasso RL RR PLS PCR


100 20 0 1 cov 0.9622 0.9728 0.9648 0.9544 0.9460 0.9724
len 3.7909 4.4344 4.3865 4.4375 4.2818 5.5065
2000 20 0 1 cov 0.9506 0.9502 0.9500 0.9488 0.9486 0.9542
len 3.1631 3.1199 3.1444 3.2380 3.1960 3.3220
200 20 0.9 1 cov 0.9588 0.9666 0.9664 0.9666 0.9556 0.9612
len 3.7985 3.6785 3.7002 3.7491 3.5049 3.7844
200 20 0.9 19 cov 0.9704 0.9760 0.9706 0.9784 0.9578 0.9592
len 4.6128 12.1188 4.8732 12.0363 3.3929 3.7374
200 200 0.9 19 cov 0.9338 0.9750 0.9564 0.9740 0.9440 0.9596
len 4.6271 37.3888 5.1167 56.2609 4.0550 4.6994
400 40 0.9 19 cov 0.9678 0.9654 0.9492 0.9624 0.9426 0.9574
len 4.3433 14.7390 4.7625 14.6602 3.6229 4.1045

Table 5.3 Validation Residuals: Simulated Large Sample 95% PI Coverages and
Lengths, ei ∼ N (0, 1)

n,p,ψ,k FS CFS RL CRL Lasso CL RR CRR


200,20, 0,19 cov 0.9574 0.9446 0.9522 0.9420 0.9538 0.9382 0.9542 0.9430
len 4.6519 4.3003 4.6375 4.2888 4.6547 4.2964 4.7215 4.3569
200,40,0,19 cov 0.9564 0.9412 0.9524 0.9440 0.9550 0.9406 0.9548 0.9404
len 4.9188 4.5426 5.2665 4.8637 5.1073 4.7193 5.3481 4.9348
200,200, 0,19 cov 0.9488 0.9320 0.9548 0.9392 0.9480 0.9380 0.9536 0.9394
len 7.0096 6.4739 5.1671 4.7698 31.1417 28.7921 47.9315 44.3321
400,20,0.9,19 cov 0.9498 0.9406 0.9488 0.9438 0.9524 0.9426 0.9550 0.9426
len 4.4153 4.1981 4.5849 4.3591 9.4405 8.9728 9.2546 8.8054
400,40,0.9,19 cov 0.9504 0.9404 0.9476 0.9388 0.9496 0.9400 0.9470 0.9410
len 4.7796 4.5423 4.9704 4.7292 13.3756 12.7209 12.9560 12.3118
400,400,0.9,19 cov 0.9480 0.9398 0.9554 0.9444 0.9506 0.9422 0.9506 0.9408
len 5.2736 5.0131 4.9764 4.7296 43.5032 41.3620 42.6686 40.5578
400,800,0.9,19 cov 0.9550 0.9474 0.9522 0.9412 0.9550 0.9450 0.9550 0.9446
len 5.3626 5.0943 4.9382 4.6904 60.9247 57.8783 60.3589 57.3323

number of variables selected d = nH . In Table 5.3, PIs (4.15) and (4.16) are
asymptotically equivalent, but PI (4.16) had shorter lengths for moderate
n. In Table 5.4, PI (4.15) is shorter than PI (4.16) asymptotically, but for
moderate n, PI (4.16) was often shorter.
Table 5.5 shows some results for PIs (4.14) and (4.15) for lasso and ridge
regression. The header lasso indicates PI (4.14) was used while vlasso indi-
cates that PI (4.15) was used. PI (4.15) tended to work better when the fit
was poor while PI (4.14) was better for n = 2p and k = p − 1. The PIs are
asymptotically equivalent for consistent estimators.
248 5 Statistical Learning Alternatives to OLS

Table 5.4 Validation Residuals: Simulated Large Sample 95% PI Coverages and
Lengths, ei ∼ EXP (1) − 1

n,p,ψ,k FS CFS RL CRL Lasso CL RR CRR


200,20,0,1 cov 0.9596 0.9504 0.9588 0.9374 0.9604 0.9432 0.9574 0.9438
len 4.6055 4.2617 4.5984 4.2302 4.5899 4.2301 4.6807 4.2863
2000,20,0,1 cov 0.9560 0.9508 0.9530 0.9464 0.9544 0.9462 0.9530 0.9462
len 3.3469 3.9899 3.3240 3.9849 3.2709 3.9786 3.4307 3.9943
200,20,0.9,1 cov 0.9564 0.9402 0.9584 0.9362 0.9634 0.9412 0.9638 0.9418
len 3.9184 3.8957 3.8765 3.8660 3.8406 3.8483 3.8467 3.8509
200,20,0.9,19 cov 0.9630 0.9448 0.9510 0.9368 0.9554 0.9430 0.9572 0.9420
len 5.0543 4.6022 4.8139 4.3841 9.8640 9.0748 9.5218 8.7366
200,200,0.9,19 cov 0.9570 0.9434 0.9588 0.9418 0.9552 0.9392 0.9544 0.9394
len 5.8095 5.2561 5.2366 4.7292 31.1920 28.8602 47.9229 44.3251
400,40,0.9,19 cov 0.9476 0.9402 0.9494 0.9416 0.9584 0.9496 0.9562 0.9466
len 4.6992 4.4750 4.9314 4.6703 13.4070 12.7442 13.0579 12.4015

Table 5.5 PIs (4.14) and (4.15): Simulated Large Sample 95% PI Coverages and
Lengths

n p ψ k dist lasso vlasso RR vRR


100 20 0 1 cov N(0,1) 0.9750 0.9632 0.9564 0.9606
len 4.8245 4.7831 4.5741 5.3277
100 20 0 1 cov EXP(1)−1 0.9728 0.9582 0.9546 0.9612
len 4.4345 5.0089 4.4384 5.6692
100 50 0 49 cov N(0,1) 0.9714 0.9606 0.9822 0.9618
len 6.8345 22.3265 7.7229 27.7275
100 50 0 49 cov EXP(1)−1 0.9716 0.9618 0.9814 0.9608
len 6.9460 22.4097 7.8316 27.8306
400 400 0 399 cov N(0,1) 0.8508 0.9518 1.0000 0.9548
len 37.5418 78.0652 244.1004 69.5812
400 400 0 399 cov EXP(1)−1 0.8446 0.9586 1.0000 0.9558
len 37.5185 78.0564 243.7929 69.5474

5.10 Cross Validation

For MLR variable selection there are many methods for choosing the final
submodel, including AIC, BIC, Cp , and EBIC. See Section 4.1. Variable se-
lection is a special case of model selection where there are M models a a final
model needs to be chosen. Cross validation is a common criterion for model
selection.
Definition 5.9. For k-fold cross validation (k-fold CV), randomly divide
the training data into k groups or folds of approximately equal size nj ≈ n/k
for j = 1, ..., k. Leave out the first fold, fit the statistical method to the k − 1
5.10 Cross Validation 249

remaining folds, and then compute some criterion for the first fold. Repeat
for folds 2, ..., k.

Following James et al. (2013, p. 181), if the statistical method is an MLR


method, we often compute Ŷi (j) for each Yi in the fold j left out. Then
nj
1 X
M SEj = (Yi − Ŷi (j))2 ,
nj i=1

and the overall criterion is


k
1X
CV(k) = M SEj .
k
j=1

Note that if each nj = n/k, then


n
1X
CV(k) = (Yi − Ŷi (j))2 .
n i=1

Then CV(k) ≡ CV(k) (Ii ) is computed for i = 1, ..., M , and the model Ic with
the smallest CV(k)(Ii ) is selected.
Assume that model (4.1) holds: Y = xT β + e = xTS β S + e where β S is an
aS × 1 vector. Suppose p is fixed and n → ∞. If β̂ I is a × 1, form the p × 1
vector β̂ I,0 from β̂ I by adding 0s corresponding to the omitted variables. If
P (S ⊆ Imin ) → 1 as n → ∞, then Theorem 4.4 and Remark 4.5 showed that

β̂Imin ,0 is a n consistent estimator of β under mild regularity conditions.
Note that if aS = p, then β̂ Imin ,0 is asymptotically equivalent to the OLS full
model β̂ (since S is equal to the full model).

Choosing folds for k-fold cross validation is similar to randomly allocating


cases to treatment groups. The following code is useful for a simulation. It
makes copies of 1 to k in a vector of length n called tfolds. The sample
command makes a permutation of tfolds to get the folds. The lengths of the
k folds differ by at most 1.
n<-26
k<-5
J<-as.integer(n/k)+1
tfolds<-rep(1:k,J)
tfolds<-tfolds[1:n] #can pass tfolds to a loop
folds<-sample(tfolds)
folds
4 2 3 5 3 3 1 5 2 2 5 1 2 1 3 4 2 1 5 5 1 4 1 4 4 3
Example 5.2, continued. The linmodpack function pifold uses k-fold
CV to get the coverage and average PI lengths. We used 5-fold CV with
250 5 Statistical Learning Alternatives to OLS

coverage and average 95% PI length to compare the forward selection models.
All 4 models had coverage 1, but the average 95% PI lengths were 2591.243,
2741.154, 2902.628, and 2972.963 for the models with 2 to 5 predictors. See
the following R code.
y <- marry[,3]; x <- marry[,-3]
x1 <- x[,2]
x2 <- x[,c(2,3)]
x3 <- x[,c(1,2,3)]
pifold(x1,y) #nominal 95% PI
$cov
[1] 1
$alen
[1] 2591.243
pifold(x2,y)
$cov
[1] 1
$alen
[1] 2741.154
pifold(x3,y)
$cov
[1] 1
$alen
[1] 2902.628
pifold(x,y)
$cov
[1] 1
$alen
[1] 2972.963
#Validation PIs for submodels: the sample size is
#likely too small and the validation PI is formed
#from the validation set.
n<-dim(x)[1]
nH <- ceiling(n/2)
indx<-1:n
perm <- sample(indx,n)
H <- perm[1:nH]
vpilen(x1,y,H) #13/13 were in the validation PI
$cov
[1] 1.0
$len
[1] 116675.4
vpilen(x2,y,H)
$cov
[1] 1.0
$len
5.10 Cross Validation 251

[1] 116679.8
vpilen(x3,y,H)
$cov
[1] 1.0
$len
[1] 116312.5
vpilen(x,y,H)
$cov
[1] 1.0
$len #shortest length
[1] 116270.7
Some more code is below.
n <- 100
p <- 4
k <- 1
q <- p-1
x <- matrix(rnorm(n * q), nrow = n, ncol = q)
b <- 0 * 1:q
b[1:k] <- 1
y <- 1 + x %*% b + rnorm(n)
x1 <- x[,1]
x2 <- x[,c(1,2)]
x3 <- x[,c(1,2,3)]
pifold(x1,y)
$cov
[1] 0.96
$alen
[1] 4.2884
pifold(x2,y)
$cov
[1] 0.98
$alen
[1] 4.625284
pifold(x3,y)
$cov
[1] 0.98
$alen
[1] 4.783187
pifold(x,y)
$cov
[1] 0.98
$alen
[1] 4.713151
252 5 Statistical Learning Alternatives to OLS

n <- 10000
p <- 4
k <- 1
q <- p-1
x <- matrix(rnorm(n * q), nrow = n, ncol = q)
b <- 0 * 1:q
b[1:k] <- 1
y <- 1 + x %*% b + rnorm(n)
x1 <- x[,1]
x2 <- x[,c(1,2)]
x3 <- x[,c(1,2,3)]
pifold(x1,y)
$cov
[1] 0.9491
$alen
[1] 3.96021
pifold(x2,y)
$cov
[1] 0.9501
$alen
[1] 3.962338
pifold(x3,y)
$cov
[1] 0.9492
$alen
[1] 3.963305
pifold(x,y)
$cov
[1] 0.9498
$alen
[1] 3.96203

5.11 Hypothesis Testing After Model Selection, n/p


Large

Section 4.6 showed how to use the bootstrap for hypothesis test H0 : θ =
Aβ = θ 0 versus H1 : θ = Aβ 6= θ 0 with the statistic Tn = Aβ̂ Imin ,0
where β̂ Imin ,0 is the zero padded OLS estimator computed from the variables
corresponding to Imin . The theory needs P (S ⊆ Imin ) → 1 as n → ∞, and
hence applies to OLS variable selection with AIC, BIC, and Cp , and to relaxed
lasso and relaxed elastic net if lasso and elastic net are consistent.
Assume n ≥ 20p and that the error distribution is unimodal and not highly
skewed. The response plot and residual plot are plots with Ŷ = xT β̂ on the
5.12 Data Splitting 253

horizontal axis and Y or r on the vertical axis, respectively. Then the plotted
points in these plots should scatter in roughly even bands about the identity
line (with unit slope and zero intercept) and the r = 0 line, respectively.
See Figure 1.1. If the plots for the OLS full model suggest that the error
distribution is skewed or multimodal, then much larger sample sizes may be
needed. √
Let p be fixed. Then lasso is asymptotically equivalent to OLS if λ̂1n / n →
0, and hence should
√ not have any β̂i = 0, asymptotically. If aS < p, then lasso
tends not be n consistent if lasso selects S with high probability√ by Ewald
and Schneider (2018), but then relaxed lasso tends to be n consistent. If
λ̂1n /n → 0, then lasso is consistent so P (S ⊆ I) → 1 as n → ∞. Hence often

if lasso has more than one β̂ i = 0, then lasso is not n consistent.
Suppose we use the residual bootstrap where Y ∗ = X β̂ OLS +r W follows a
standard linear model where the elements riW of rW are iid from the empirical
distribution of the OLS full model residuals ri . In Section 4.6 we used forward
selection when regressing Y ∗ on X, but we could use lasso or ridge regression
instead. Since these estimators are consistent if λ̂1n /n → 0 as n → ∞, we
∗ ∗ ∗
expect β̂ L and β̂R to be centered at β̂OLS . If the variabliity of the β̂ is similar
to or greater than that of β̂ OLS , then by the geometric argument Theorem
4.5, we might get simulated coverage close to or higher than the nominal.

If lasso or ridge regression shrink β̂ too much, then the coverage could be
bad. In limited simulations, the prediction region method only simulated well
for ridge regression with ψ = 0. Results from Ewald and Schneider (2018, p.
1365) suggest that the lasso confidence region volume √ is greater than OLS
confidence region volume when lasso uses λ1n = n/2.
A small simulation was done for confidence intervals and confidence re-
gions, using the same type of data as for the variable selection simula-
tion in Section 4.6 and the prediction interval simulation in Section 5.9,
with B = max(1000, n, 20p) and 5000 runs. The regression model used
β = (1, 1, 0, 0)T with n = 100 and p = 4. When ψ = 0, the design matrix
X consisted of iid N(0,1) random variables. See Table 5.6 which was taken
from Pelawa Watagoda (2017). The residual bootstrap was used. Types 1)–
5) correspond to types i)–v), and the  value only applies to the type 5)
error distribution. The function lassobootsim3 uses the prediction region
method for lasso and ridge regression. The function lassobootsim4 can
be used to simulate confidence intervals for the βi is S ∗T is singular for lasso.
The test was for H0 : (β3 , β4 )T = (0, 0)T .

5.12 Data Splitting

A common method for data splitting randomly divides the data set into two
half sets. On the first half set, fit the model selection method, e.g. forward
254 5 Statistical Learning Alternatives to OLS

Table 5.6 Bootstrapping Lasso, ψ = 0


n  type β1 β2 β3 β4 test
100 1 cov 0.9440 0.9376 0.9910 0.9946 0.9790
len 0.4143 0.4470 0.3759 0.3763 2.6444
2 cov 0.9468 0.9428 0.9946 0.9944 0.9816
len 0.6870 0.7565 0.6238 0.6226 2.6832
3 cov 0.9418 0.9408 0.9930 0.9948 0.9840
len 0.4110 0.4506 0.3743 0.3746 2.6684
4 cov 0.9468 0.9370 0.9938 0.9948 0.9838
len 0.2392 0.2578 0.2151 0.2153 2.6454
0.5 5 cov 0.9438 0.9344 0.9988 0.9970 0.9924
len 2.9380 2.5042 2.4912 2.4715 2.8536
0.9 5 cov 0.9506 0.9290 0.9974 0.9976 0.9956
len 3.9180 3.2760 3.7356 3.2739 2.8836

selection or lasso, to get the a predictors. Use this model as the full model
for the second half set: use the standard OLS inference from regressing the
response on the predictors found from the first half set. This method can
be inefficient if n ≥ 10p, but is useful for a sparse model if n ≤ 5p, if the
probability that the model underfits goes to zero, and if n ≥ 20a. A model is
sparse if the number of predictors with nonzero coefficients is small.
For lasso, the active set I from the first half set (training data) is found,
and data splitting estimator is the OLS estimator β̂ I,D computed from the
second half set (test data). This estimator is not the relaxed lasso estimator.
The estimator β̂ I,D has the same large sample theory as if I was chosen
before obtaining the data.

5.13 Summary

1) The MLR model is Yi = β1 + xi,2 β2 + · · · + xi,pβp + ei = xTi β + ei for


i = 1, ..., n. This model is also called the full model. In matrix notation,
these n equations become Y = Xβ + e. Note that xi,1 ≡ 1.
2) The ordinary
Pn least squares OLS full model estimator β̂ OLS minimizes
QOLS (β) = i=1 ri2 (β) = RSS(β) = (Y − Xβ)T (Y − Xβ). In the estimat-
ing equations QOLS (β), the vector β is a dummy variable. The minimizer
β̂OLS estimates the parameter vector β for the MLR model Y = Xβ + e.
Note that β̂ OLS ∼ ANp (β, M SE(X T X)−1 ).
3) Given an estimate b of β, the corresponding vector of predicted values
or fitted values is Yb ≡ Yb (b) = Xb. Thus the ith fitted value

Ŷi ≡ Ŷi (b) = xTi b = xi,1 b1 + · · · + xi,pbp .


5.13 Summary 255

The vector of residuals is r ≡ r(b) = Y − Yb (b). Thus ith residual ri ≡


ri (b) = Yi − Ŷi (b) = Yi − xi,1 b1 − · · · − xi,pbp . A response plot for MLR is a
plot of Ŷi versus Yi . A residual plot is a plot of Ŷi versus ri . If the ei are iid
from a unimodal distribution that is not highly skewed, the plotted points
should scatter about the identity line and the r = 0 line.

Label coef SE shorth 95% CI for βi

4) Constant=intercept= x1 β̂1 SE(β̂1 ) [L̂1 , Û1 ]


x2 β̂2 SE(β̂2 ) [L̂2 , Û2 ]
..
.
xp β̂p SE(β̂p ) [L̂p , Ûp ]
The classical OLS large sample 95% CI for βi is β̂i ± 1.96SE(β̂i ). Consider
testing H0 : βi = 0 versus HA : βi 6= 0. If 0 ∈ CI for βi , then fail to reject H0 ,
and conclude xi is not needed in the MLR model given the other predictors
are in the model. If 0 6∈ CI for βi , then reject H0 , and conclude xi is needed
in the MLR model.
5) Let xTi = (1 uTi ). It is often convenient to use the centered response
Z = Y − Y where Y = Y 1, and the n × (p − 1) matrix of standardized
nontrivial predictors W = (Wij ). For jP = 1, ..., p − 1, let W
Pijn denote the
n
(j + 1)th variable standardized so that i=1 Wij = 0 and i=1 Wij2 = n.
Then the sample correlation matrix of the nontrivial predictors ui is

WTW
Ru = .
n
Then regression through the origin is used for the model Z = W η + e
where the vector of fitted values Ŷ = Y + Ẑ. Thus the centered response
Zi = Yi − Y and Ŷi = Ẑi + Y . Then η̂ does not depend on the units of
measurement of the predictors. Linear combinations of the ui can be written
as linear combinations of the xi , hence β̂ can be found from η̂.
6) A model for variable selection is xT β = xTS β S + xTE β E = xTS β S where
x = (xTS , xTE )T , xS is an aS × 1 vector, and xE is a (p − aS ) × 1 vector. Let
xI be the vector of a terms from a candidate subset indexed by I, and let xO
be the vector of the remaining predictors (out of the candidate submodel). If
S ⊆ I, then xT β = xTS β S = xTS β S + xTI/S β (I/S) + xTO 0 = xTI β I where xI/S
denotes the predictors in I that are not in S. Since this is true regardless
of the values of the predictors, β O = 0 if S ⊆ I. Note that β E = 0. Let
kS = aS − 1 = the number of population active nontrivial predictors. Then
k = a − 1 is the number of active predictors in the candidate submodel I.
7) Let Q(η) be a real valued function of the k × 1 vector η. The gradient
of Q(η) is the k × 1 vector
256 5 Statistical Learning Alternatives to OLS
 ∂

∂η1
Q(η)
 ∂ 
∂Q ∂Q(η)  ∂η2 Q(η) 
5Q = 5Q(η) = = =
 .. .

∂η ∂η  . 

∂ηk Q(η)

Suppose there is a model with unknown parameter vector η. A set of estimat-


ing equations f(η) is minimized or maximized where η is a dummy variable
vector in the function f : Rk → Rk .
8) As a mnemonic (memory aid) for the following results, note that the
d d d d
derivative ax = xa = a and ax2 = xax = 2ax.
dx dx dx dx
T T
a) If Q(η) = a η = η a for some k × 1 constant vector a, then 5Q = a.
b) If Q(η) = ηT Aη for some k × k constant matrix A, then 5Q = 2Aη.
Pk
c) If Q(η) = i=1 |ηi | = kηk1 , then 5Q = s = sη where si = sign(ηi )
where sign(ηi ) = 1 if ηi > 0 and sign(ηi ) = −1 if ηi < 0. This gradient is only
defined for η where none of the k values of ηi are equal to 0.
9) Forward selection with OLS generates a sequence of M models I1 , ..., IM
where Ij uses j predictors x∗1 ≡ 1, x∗2 , ..., x∗M . Often M = min(dn/Je, p) where
J is a positive integer such as J = 5.
10) For the model Y = Xβ + e, methods such as forward selection, PCR,
PLS, ridge regression, relaxed lasso, and lasso each generate M fitted mod-
els I1 , ..., IM , where M depends on the method. For forward selection the
simulation used Cp for n ≥ 10p and EBIC for n < 10p. The other meth-
ods minimized 10-fold CV. For forward selection, the maximum number of
variables used was approximately min(dn/5e, p).
11) Consider choosing η̂ to minimize the criterion
p−1
1 λ1,n X
Q(η) = (Z − W η)T (Z − W η) + |ηi |j (5.25)
a a
i=1

where λ1,n ≥ 0, a > 0, and j > 0 are known constants. Then j = 2


corresponds to ridge regression η̂ R , j = 1 corresponds to lasso η̂L , and
a = 1, 2, n, and 2n are common. The residual sum of squares RSSW (η) =
(Z − W η)T (Z − W η), and λ1,n = 0 corresponds to the OLS estimator
η̂OLS = (W T W )−1 W T Z. Note that for a k × 1 vector η, the squared (Eu-
Pk Pk
clidean) L2 norm kηk22 = ηT η = i=1 ηi2 and the L1 norm kηk1 = i=1 |ηi |.
Lasso and ridge regression have a parameter λ. When λ = 0, the OLS
full model is used. Otherwise, the centered response and scaled nontrivial
predictors are used with Z = W η + e. See 5). These methods also use a
maximum value λM of λ and a grid of M λ values 0 ≤ λ1 < λ2 < · · · <
λM −1 < λM where often λ1 = 0. For lasso, λM is the smallest value of λ such
that η̂ λM = 0. Hence η̂λi 6= 0 for i < M .

12) The elastic net estimator η̂EN minimizes


5.13 Summary 257

QEN (η) = RSS(η) + λ1 kηk22 + λ2 kηk1 (5.26)

where λ1 = (1 − α)λ1,n and λ2 = 2αλ1,n with 0 ≤ α ≤ 1.


13) Use Z n ∼ ANg (µn , Σ n ) to indicate that a normal approximation is
used: Z n ≈ Ng (µn , Σ n ). Let a be a constant, let A be a k × g constant
√ D
matrix, and let c be a k × 1 constant vector. If n(θ̂ n − θ) → Ng (0, V ), then
aZ n = aI g Z n with A = aI g ,
  
aZ n ∼ ANg aµn , a2 Σ n , and AZ n + c ∼ ANk Aµn + c, AΣ n AT ,

  !
V AV AT
θ̂ n ∼ ANg θ, , and Aθ̂ n + c ∼ ANk Aθ + c, .
n n

14) Assume η̂OLS = (W T W )−1 W T Z. Let sn = (s1n , ..., sp−1,n)T where


sin ∈ [−1, 1] and sin = sign(η̂i ) if η̂i 6= 0. Here sign(ηi ) = 1 if ηi > 1 and
sign(ηi ) = −1 if ηi < 1. Then
λ1n
i) η̂R = η̂OLS − n(W T W + λ1,n I p−1 )−1 η̂ OLS .
n
λ1,n
ii) η̂ L = η̂OLS − n(W T W )−1 sn .
2n  
T −1 λ1 λ2
iii) η̂EN = η̂ OLS − n(W W + λ1 I p−1 ) η̂ + sn .
n OLS 2n
WTW P
15) Assume that the sample correlation matrix Ru = → V −1 .
n
P
Let H = W (W T W )−1 W T = (hij ), and assume that maxi=1,...,n hii → 0 as
n → ∞. Let η̂ A be η̂ EN , η̂ L , or η̂R . Let p be fixed.
√ D
i) LS CLT: n(η̂ OLS − η) → Np−1 (0, σ 2 V ).
√ P
ii) If λ̂1,n / n → 0, then
√ D
n(η̂A − η) → Np−1 (0, σ 2 V ).
√ P P P
iii) If λ̂1,n / n → τ ≥ 0, α̂ → ψ ∈ [0, 1], and sn → s = sη , then
√ D 
n(η̂EN − η) → Np−1 −V [(1 − ψ)τ η + ψτ s], σ 2 V .
√ P
iv) If λ̂1,n / n → τ ≥ 0, then
√ D
n(η̂ R − η) → Np−1 (−τ V η, σ 2 V ).
√ P P
v) If λ̂1,n / n → τ ≥ 0 and sn → s = sη , then
 
√ D −τ 2
n(η̂ L − η) → Np−1 V s, σ V .
2
258 5 Statistical Learning Alternatives to OLS

ii) and v) are the Lasso CLT, ii) and iv) are the RR CLT, and ii) and iii)
are the EN CLT.
16) Under the conditions of√15), relaxed lasso = VS-lasso and relaxed
elastic net = VS-elastic net are n consistent under much √ milder conditions
than lasso and elastic net, since the relaxed estimators are n consistent when
lasso and elastic net are consistent. Let Imin correspond to the predictors
chosen by lasso, elastic net, or forward selection, including a constant. Let
β̂Imin be the OLS estimator applied to these predictors, let β̂ Imin ,0 be the
zero padded estimator. The large sample theory for β̂ Imin ,0 (from forward
selection, relaxed lasso, and relaxed elastic net) is given by Theorem 4.4.
Note that the large sample theory for the estimators β̂ is given for p × 1
vectors. The theory for η̂ is given for (p − 1) × 1 vectors In particular, the
theory for lasso and elastic net does not cast away the η̂i = 0.
17) Under Equation (4.1) with p fixed, if lasso or elastic net are consistent,
then P (S ⊆ Imin ) → 1 as n → ∞.√Hence when lasso and elastic net do
variable selection, they are often not n consistent.
18) Refer to 6). a) The OLS full model tends to be useful if n ≥ 10p with
large sample theory better than that of lasso, ridge regression, and elastic
net. Testing is easier and the Olive (2007) PI tailored to the OLS full model
will work better for smaller sample sizes than PI (4.14) if n ≥ 10p. If n ≥ 10p
but X T X is singular or ill conditioned, other methods can perform better.
Forward selection, relaxed lasso, and relaxed elastic net are competitive
with the OLS full model even when n ≥ 10p and X T X is well conditioned.
If n ≤ p then OLS interpolates the data and is a poor method. If n = Jp,
then as J decreases from 10 to 1, other methods become competitive.
b) If n ≥ 10p and kS < p − 1, then forward selection can give more precise
inference than the OLS full model. When n/p is small, the PI (4.14) for
forward selection can perform well if n/kS is large. Forward selection can
be worse than ridge regression or elastic net if kS > min(n/J, p). Forward
selection can be too slow if both n and p are large. Forward selection, relaxed
lasso, and relaxed elastic net tend to be bad if (X TA X A )−1 is ill conditioned
where A = Imin .
c) If n ≥ 10p, lasso can be better than the OLS full model if X T X is ill
conditioned. Lasso seems to perform best if kS is not much larger than 10
or if the nontrivial predictors are orthogonal or uncorrelated. Lasso can be
outperformed by ridge regression or elastic net if kS > min(n, p − 1).
d) If n ≥ 10p ridge regression and elastic net can be better than the OLS
full model if X T X is ill conditioned. Ridge regression (and likely elastic net)
seems to perform best if kS is not much larger than 10 or if the nontrivial
predictors are orthogonal or uncorrelated. Ridge regression and elastic net
can outperform lasso if kS > min(n, p − 1).
e) The PLS PI (4.14) can perform well if n ≥ 10p if some of the other five
methods used in the simulations start to perform well when n ≥ 5p. PLS may
or may not be inconsistent if n/p is not large. Ridge regression tends to be
5.14 Complements 259

inconsistent unless P (d → p) → 1 so that ridge regression is asymptotically


equivalent to the OLS full model.
19) Under strong regularity conditions, lasso and relaxed lasso with k–fold
CV, and forward selection with EBIC can perform well even if n/p is small.
So PI (4.14) can be useful when n/p is small.

5.14 Complements

Good references for forward selection, PCR, PLS, ridge regression, and lasso
are Hastie et al. (2009, 2015), James et al. (2013), Olive (2019), Pelawa
Watagoda (2017) and Pelawa Watagoda and Olive (2019b). Also see Efron
and Hastie (2016). An early reference for forward selection is Efroymson
(1960). Under strong regularity conditions, Gunst and Mason (1980, ch. 10)
covers inference for ridge regression (and a modified version of PCR) when
the iid errors ei ∼ N (0, σ 2 ).
Xu et al. (2011) notes that sparse algorithms are not stable. Belsley (1984)
shows that centering can mask ill conditioning of X.
Classical principal component analysis based on the correlation matrix can
be done using the singular value decomposition (SVD) of the scaled matrix

W S = W g / n − 1 using êi and λ̂i = σi2 where λ̂i = λ̂i (W TS W S ) is the ith
eigenvalue of W TS W S . Here the scaling is using g = 1. For more information
about the SVD, see Datta (1995, pp. 552-556) and Fogel et al. (2013).
There is massive literature on variable selection and a fairly large literature
for inference after variable selection. See, for example, Bertsimas et al. (2016),
Fan and Lv (2010), Ferrari and Yang (2015), Fithian et al. (2014), Hjort and
Claeskins (2003), Knight and Fu (2000), Lee et al. (2016), Leeb and Pötscher
(2005, 2006), Lockhart et al. (2014), Qi et al. (2015), and Tibshirani et al.
(2016).
For post-selection inference, the methods in the literature are often for mul-
tiple linear regression assuming normality, or are asymptotically equivalent
to using the full model, or find a quantity to test that is not Aβ. Typically
the methods have not been shown to perform better than data splitting. See
Ewald and Schneider (2018). When n/p is not large, inference is currently
much more difficult. Under strong regularity conditions, lasso and forward
selection with EBIC can work well. Leeb et al. (2015) suggests that the Berk
et al. (2013) method does not really work. Also see Dezeure et al. (2015),
Javanmard and Montanari (2014), Lu et al. (2017), Tibshirani et al. (2016),
van de Geer et al. (2014), and Zhang and Cheng (2017). Fan and Lv (2010)
gave large sample theory for some methods if p = o(n1/5 ). See Tibshirani et
al. (2016) for an R package.
Warning: For n < 5p, every estimator is unreliable, to my knowledge.
Regularity conditions for consistency are strong if they exist. For example,
260 5 Statistical Learning Alternatives to OLS

PLS is sometimes inconsistent and sometimes n consistent. Validating the
MLR estimator with PIs can help. Also make response and residual plots.
Full OLS Model: A sufficient condition for β̂ OLS to be a consistent
estimator of β is Cov(β̂ OLS ) = σ 2 (X T X)−1 → 0 as n → ∞. See Lai et al.
(1979).
Forward Selection: See Olive and Hawkins (2005), Pelawa Watagoda
and Olive (2019ab), and Rathnayake and Olive (2019).
Principal Components Regression: Principal components are Karhunen
Loeve directions of centered X. See Hastie et al. (2009, p. 66). A useful PCR
paper is Cook and Forzani (2008).
Partial Least Squares: PLS was introduced by Wold (1975). Also see
Wold (1985, 2006). Two useful √ papers are Cook et al. (2013) and Cook and
Su (2016). PLS tends to be n consistent if p is fixed and n → √ ∞. If p > n,
under two sets of strong regularity conditions, PLS can be n consistent
or inconsistent. See Chun and Keleş (2010), Cook (2018), Cook and Forzani
(2018, 2019), and Cook et al. (2013). Denham (1997) suggested a PI for PLS
that assumes the number of components is selected in advance.
Ridge Regression: An important ridge regression paper is Hoerl and
Kennard (1970). Also see Gruber (1998). Ridge regression is known as
Tikhonov regularization in the numerical analysis literature.
Lasso: Lasso was introduced by Tibshirani (1996). Efron et al. (2004)
and Tibshirani et al. (2012) are important papers. Su et al. (2017) note some
problems with lasso. If n/p is large, see Knight and Fu (2000) for the residual
bootstrap with OLS full model residuals. Camponovo (2015) suggested that
the nonparametric bootstrap does not work for lasso. Chatterjee and Lahiri
(2011) stated that the residual bootstrap with lasso does not work. Hall et
al. (2009) stated that the residual bootstrap with OLS full model residuals
does not work, but the m out of n residual bootstrap with OLS full model
residuals does work. Rejchel (2016) gave a good review of lasso theory. Fan
and Lv (2010) reviewed large sample theory for some alternative methods.
See Lockhart et al. (2014) for a partial remedy for hypothesis testing with
lasso. The Ning and Liu (2017) method needs a log likelihood. Knight and
Fu (2000) gave theory for fixed p.
Regularity conditions for testing are strong. Often lasso tests assume that
Y and the nontrivial predictors follow a multivariate normal (MVN) distri-
bution. For the MVN distribution, the MLR model tends to be dense not
sparse if n/p is small.
Lasso Variable Selection:
Applying OLS on a constant and the k nontrivial predictors that have
nonzero lasso η̂i is called lasso variable selection. We want n ≥ 10(k + 1).
If λ1 = 0, a variant of lasso variable selection computes the OLS submodel
for the subset corresponding to λi for i = 1, ..., M . If Cp is used, then this
variant has large sample theory given by Theorem 2.4.
Lasso can also be used for other estimators, such as generalized linear
models (GLMs). Then lasso variable selection is the “classical estimator,”
5.14 Complements 261

such as a GLM, applied to the lasso active set. In other words, use lasso
variable selection as a variable selection method. For prediction, lasso variable
selection is often better than lasso, but sometimes lasso is better.
See Meinshausen (2007) for the relaxed lasso method with R package
relaxo for MLR: apply lasso with penalty λ to get a subset of variables
with nonzero coefficients. Then reduce the shrinkage of the nonzero elements
by applying lasso again to the nonzero coefficients but with a smaller penalty
φ. This two stage estimator could be used for other estimators. Lasso variable
selection corresponds to the limit as φ → 0.
Dense Regression or Abundant Regression: occurs when most of the
predictors contribute to the regression. Hence the regression is not sparse. See
Cook et al. (2013).
Other Methods: Consider the MLR model Z = W η + e. Let λ ≥ 0 be
a constant and let q ≥ 0. The estimator η̂ q minimizes the criterion

p−1
X
Qq (b) = r(b)T r(b) + λ |bi |q , (5.27)
j=1

over all vectors b ∈ Rp−1 where we take 00 = 0. Then q = 1 corresponds


toPlasso and q = 2 corresponds to ridge regression. If q = 0, the penalty
p−1
λ j=1 |bi |0 = λk where k is the number of nonzero components of b. Hence
the q = 0 estimator is often called the “best subset” estimator. See Frank
and Friedman (1993). For fixed p, large sample theory is given by Knight and
Fu (2000). Following Hastie et al. (2009, p. 72), the optimization problem is
convex if q ≥ 1 and λ is fixed.
If n ≤ 400 and p ≤ 3000, Bertsimas et al. (2016) give a fast “all subsets”
variable selection method. Lin et al. (2012) claim to have a very fast method
for variable selection. Lee and Taylor (2014) suggest the marginal screening
algorithm: let W be the matrix of standardized nontrivial predictors. Com-
pute W T Y = (c1 , ..., cp−1)T and select the J variables corresponding to the
J largest |ci |. These are the J standardized variables with the largest absolute
correlations with Y . Then do an OLS regression of Y on these J variables
and a constant. A slower algorithm somewhat similar but much slower than
the Lin et al. (2012) algorithm follows. Let a constant x1 be in the model, and
let W = [a1 , ..., ap−1 ] and r = Y − Y . Compute W T r and let x∗2 correspond
to the variable with the largest absolute entry. Remove the corresponding
aj from W to get W 1 . Let r 1 be the OLS residuals from regressing Y on
x1 and x∗2 . Compute W T r1 and let x∗3 correspond to the variable with the
largest absolute entry. Continue in this manner to get x1 , x∗2 , ..., x∗J where
J = min(p, dn/5e). Like forward selection, evaluate the J − 1 models Ij con-
taining the first j predictors x1 , x∗2 , ..., x∗J for j = 2, ..., J with a criterion such
as Cp .
262 5 Statistical Learning Alternatives to OLS

Following Sun and Zhang (2012), let (5.6) hold and let
1 X  |ηi | 
p−1
Q(η) = (Z − W η)T (Z − W η) + λ2 ρ where ρ is scaled such
2n i=1
λ
that the derivative ρ0 (0+) = 1. As for lasso and elastic net, let sj = sgn(η̂j )
where sj ∈ [−1, 1] if η̂j = 0. Let ρ0j = ρ0 (|η̂j |/λ) if η̂j 6= 0, and ρ0j = 1 if
η̂j = 0. Then η̂ is a critical point of Q(η) iff wTj (Z − W η̂) = nλsj ρ0j for
j = 1, ..., n. If ρ is convex, then these conditions are the KKT conditions. Let
dj = sj ρ0j . Then W T Z − W T W η̂ = nλd, and η̂ = η̂ OLS − nλ(W T W )−1 d.
If the dj are bounded, then η̂ is consistent if λ → 0 as n → ∞, and η̂ is
asymptotically equivalent to η̂OLS if n1/2 λ → 0. Note that ρ(t) = t for t > 0
gives lasso with λ = λ1,n /(2n).
Gao and Huang (2010) give theory for a LAD–lasso estimator, and Qi et
al. (2015) is an interesting lasso competitor.
Multivariate linear regression has m ≥ 2 response variables. See Olive
(2017ab: ch. 12). PLS also works if m ≥ 1, and methods like ridge regression
and lasso can also be extended to multivariate linear regression. See, for ex-
ample, Haitovsky (1987) and Obozinski et al. (2011). Sparse envelope models
are given in Su et al. (2016).
AIC and BIC Type Criterion:
Olive and Hawkins (2005) and Burnham and Anderson (2004) are useful
reference when p is fixed. Some interesting theory for AIC appears in Zhang
(1992ab). Zheng and Loh (1995) show that BICS can work if p = pn =
o(log(n)) and there is a consistent estimator of σ 2 . For the Cp criterion, see
Jones (1946) and Mallows (1973).
AIC and BIC type criterion and variable selection for high dimensional re-
gression are discussed in Chen and Chen (2008), Fan and Lv (2010), Fujikoshi
et al. (2014), and Luo and Chen (2013). Wang (2009) suggests using

W BIC(I) = log[SSE(I)/n] + n−1 |I|[log(n) + 2 log(p)].

See Bogdan et al. (2004), Cho and Fryzlewicz (2012), and Kim et al. (2012).
Luo and Chen (2013) state that W BIC(I) needs p/na < 1 for some 0 < a <
1.
If n/p is large and one of the models being considered is the true model
S (shown to occur with probability going to one only under very strong
assumptions by Wieczorek and Lei (2021)), then BIC tends to outperform
AIC. If none of the models being considered is the true model, then AIC
tends to outperform BIC. See Yang (2003).
Robust Versions: Hastie et al. (2015, pp. 26-27) discuss some modifica-
tions of lasso that are robust to certain types of outliers. Robust methods
for forward selection and LARS are given by Uraibi et al. (2017, 2019) that
need n >> p. If n is not much larger than p, then Hoffman et al. (2015)
have a robust Partial Least Squares–Lasso type estimator that uses a clever
weighting scheme.
5.14 Complements 263

A simple method to make an MLR method robust to certain types of


outliers is to find the covmb2 set B of Chapter 7 applied to the quantitative
predictors. Then use the MLR method (such as elastic net, lasso, PLS, PCR,
ridge regression, or forward selection) applied to the cases corresponding to
the xj in B. Make a response and residual plot, based on the robust estimator
β̂B , using all n cases.
Prediction Intervals:
Lei et al. (2018) and Wasserman (2014) suggested prediction intervals for
estimators such as lasso. The method has interesting theory if the (xi , Yi ) are
iid from some population. Also see Butler and Rothman (1980). Steinberger
and Leeb (2016) used leave-one-out residuals, but delete the upper and lower
2.5% of the residuals to make a 95% PI. Hence the PI will have undercoverage
and the shorth PI will tend to be shorter when the error distribution is not
symmetric.
Let p be fixed, d be for PI (4.14), and n → ∞. For elastic net, forward
selection, PCR, PLS, ridge regression, relaxed lasso, and lasso, if P (d → p) →
1 as n → ∞ then the seven methods are asymptotically equivalent to the
OLS full model, and the PI (4.14) is asymptotically optimal on a large class
of iid unimodal zero mean error distributions. The asymptotic optimality
holds since the sample quantile of the OLS full model residuals are consistent
estimators of the population quantiles of the unimodal error distribution for
P
a large class of distributions. Note that d → p if P (λ̂1n → 0) → 1 for elastic
P
net, lasso, and ridge regression, and d → p if the number d − 1 of components
T T
(γ j x or γ j w) used by the method satisfies P (d −1 → p−1) → 1. Consistent
estimators β̂ of β also produce residuals such that the sample quantiles of the
residuals are consistent estimators of quantiles of the error distribution. See
Remark 4.21, Olive and Hawkins (2003), and Rousseeuw and Leroy (1987, p.
128).

Degrees of Freedom:
A formula for the model degrees of freedom df tend to be given for a model
when there is no model selection or variable selection. For many estimators,
the degrees of freedom is not known if model selection is used. A d for PI
(4.15) is often obtained by plugging in the degrees of freedom formula as if
model selection did not occur. Then the resulting d is rarely an actual degrees
of freedom. As an example, if Ŷ = H λ Y , then often df = trace(H λ ) if λ is
selected before examining the data. If model selection is used to pick λ̂, then
d = trace(H λ̂ ) is not the model degrees of freedom.
264 5 Statistical Learning Alternatives to OLS

5.15 Problems

5.1. For ridge regression, suppose V = ρ−1


u . Show that if p/n and λ/n =
λ1,n /n are both small, then

λ
η̂R ≈ η̂OLS − V η̂ OLS .
n

5.2. Consider choosing η̂ to minimize the criterion


p−1
1 λ1,n X
Q(η) = (Z − W η)T (Z − W η) + |ηi |j
a a
i=1

where λ1,n ≥ 0, a > 0, and j > 0 are known constants. Consider the regres-
sion methods OLS, forward selection, lasso, PLS, PCR, ridge regression, and
relaxed lasso.
a) Which method corresponds to j = 1?
b) Which method corresponds to j = 2?
c) Which method corresponds to λ1,n = 0?

5.3. For ridge regression, let An = (W T W + λ1,n I p−1 )−1 W T W and


Bn = [I p−1 − λ1,n (W T W + λ1,n I p−1 )−1 ]. Show An − B n = 0.

5.4. Suppose Ŷ = HY where H is an n × n hat matrix. Then the de-


grees of freedom df(Ŷ ) = tr(H ) = sum of the diagonal elements of H. An
estimator with low degrees of freedom is inflexible while an estimator with
high degrees of freedom is flexible. If the degrees of freedom is too low, the
estimator tends to underfit while if the degrees of freedom is to high, the
estimator tends to overfit.
a) Find df(Ŷ ) if Ŷ = Y 1 which uses H = (hij ) where hij ≡ 1/n for all
i and j. This inflexible estimator uses the sample mean Y of the response
variable as Ŷi for i = 1, ..., n.
b) Find df(Ŷ ) if Ŷ = Y = I n Y which uses H = I n where hii = 1. This
bad flexible estimator interpolates the response variable.

5.5. Suppose Y = Xβ + e, Z = W η + e, Ẑ = W η̂, Z = Y − Y , and


Ŷ = Ẑ + Y . Let the n × p matrix W 1 = [1 W ] and the p × 1 vector
η̂1 = (Y η̂T )T where the scalar Y is the sample mean of the response
variable. Show Ŷ = W 1 η̂1 .

5.6. Let Z = Y − Y where Y = Y 1, and the n × (p − 1) matrix of stan-


dardized nontrivial predictors G = (Gij ). For
Pnj = 1, ..., p − 1, P
let Gij denote
n
the (j + 1)th variable standardized so that i=1 Gij = 0 and i=1 G2ij = 1.
Note that the sample correlation matrix of the nontrivial predictors ui is
5.15 Problems 265

Ru = GT G. Then regression through the origin is used for the model

Z = Gη + e (5.28)

where the vector of fitted values Ŷ = Y + Ẑ. The standardizationP


differs from
n
that used for earlier regression models (see Remark 5.1), since i=1 G2ij =
Pn
1 6= n = i=1 Wij2 . Note that

1
G = √ W.
n

Following Zou and Hastie (2005), the naive elastic net η̂N estimator is the
minimizer of
QN (η) = RSS(η) + λ∗2 kηk22 + λ∗1 kηk1 (5.29)
where λ∗i ≥ 0. The term “naive” is used because the elastic net estimator
λ∗ λ∗ p
is better. Let τ = ∗ 2 ∗ , γ = p 1 ∗ , and η A = 1 + λ∗2 η. Let the
λ1 + λ2 1 + λ2
(n + p − 1) × (p − 1) augmented matrix GA and the (n + p − 1) × 1 augmented
response vector Z A be defined by
   
p G Z
GA = ∗ , and Z A = ,
λ2 I p−1 0
p
where 0 is the (p − 1) × 1 zero vector. Let η̂A = 1 + λ∗2 η̂ be obtained from
the lasso of Z A on GA : that is η̂ A minimizes

QN (η A ) = kZ A − GA η A k22 + γkη A k1 = QN (η).

Prove QN (ηA ) = QN (η).


(Then
1 p
η̂ N = p ∗
η̂A and η̂EN = 1 + λ∗2 η̂ A = (1 + λ∗2 )η̂ N .
1 + λ2

The above elastic net estimator minimizes the criterion

η T GT Gη T λ∗2
QG(η) = − 2Z Gη + kηk22 + λ∗1 kηk1 ,
1 + λ∗2 1 + λ∗2

and hence is not the elastic net estimator corresponding to Equation (5.22).)

5.7. Let β = (β1 , βTS )T . Consider choosing β̂ to minimize the criterion

Q(β) = RSS(β) + λ1 kβ S k22 + λ2 kβ S k1


266 5 Statistical Learning Alternatives to OLS

where λi ≥ 0 for i = 1, 2.
a) Which values of λ1 and λ2 correspond to ridge regression?
b) Which values of λ1 and λ2 correspond to lasso?
c) Which values of λ1 and λ2 correspond to elastic net?
d) Which values of λ1 and λ2 correspond to the OLS full model?

5.8. For the output below, an asterisk means the variable is in the model.
All models have a constant, so model 1 contains a constant and mmen.
a) List the variables, including a constant, that models 2, 3, and 4 contain.
b) The term out$cp lists the Cp criterion. Which model (1, 2, 3, or 4) is
the minimum Cp model Imin ?
c) Suppose β̂Imin = (241.5445, 1.001)T . What is β̂ Imin ,0 ?
Selection Algorithm: forward #output for Problem 5.8
pop mmen mmilmen milwmn
1 ( 1 ) " " "*" " " " "
2 ( 1 ) " " "*" "*" " "
3 ( 1 ) "*" "*" "*" " "
4 ( 1 ) "*" "*" "*" "*"
out$cp
[1] -0.8268967 1.0151462 3.0029429 5.0000000
5.9. Consider the output for Example 4.7 for the OLS full model. The
column resboot gives the large sample 95% CI for βi using the shorth applied

to the β̂ij for j = 1, ..., B using the residual bootstrap. The standard large
sample 95% CI for βi is β̂i ±1.96SE(β̂i ). Hence for β2 corresponding to L, the
standard large sample 95% CI is −0.001 ± 1.96(0.002) = −0.001 ± 0.00392 =
[−0.00492, 0.00292] while the shorth 95% CI is [−0.005, 0.004].
a) Compute the standard 95% CIs for βi corresponding to W, H, and S.
Also write down the shorth 95% CI. Are the standard and shorth 95% CIs
fairly close?
b) Consider testing H0 : βi = 0 versus HA : βi 6= 0. If the corresponding
95% CI for βi does not contain 0, then reject H0 and conclude that the
predictor variable Xi is needed in the MLR model. If 0 is in the CI then fail
to reject H0 and conclude that the predictor variable Xi is not needed in the
MLR model given that the other predictors are in the MLR model.
Which variables, if any, are needed in the MLR model? Use the standard
CI if the shorth CI gives a different result. The nontrivial predictor variables
are L, W, H, and S.

5.10. Tremearne (1911) presents a data set of about 17 measurements on


112 people of Hausa nationality. We used Y = height. Along with a constant
xi,1 ≡ 1, the five additional predictor variables used were xi,2 = height when
sitting, xi,3 = height when kneeling, xi,4 = head length, xi,5 = nasal breadth,
and xi,6 = span (perhaps from left hand to right hand). The output below is
for the OLS full model.
5.15 Problems 267

Estimate Std.Err 95% shorth CI


Intercept -77.0042 65.2956 [-208.864,55.051]
X2 0.0156 0.0992 [-0.177, 0.217]
X3 1.1553 0.0832 [ 0.983, 1.312]
X4 0.2186 0.3180 [-0.378, 0.805]
X5 0.2660 0.6615 [-1.038, 1.637]
X6 0.1396 0.0385 [0.0575, 0.217]
a) Give the shorth 95% CI for β2 .
b) Compute the standard 95% CI for β2 .
c) Which variables, if any, are needed in the MLR model given that the
other variables are in the model?
Now we use forward selection and Imin is the minimum Cp model.
Estimate Std.Err 95% shorth CI
Intercept -42.4846 51.2863 [-192.281, 52.492]
X2 0 [ 0.000, 0.268]
X3 1.1707 0.0598 [ 0.992, 1.289]
X4 0 [ 0.000, 0.840]
X5 0 [ 0.000, 1.916]
X6 0.1467 0.0368 [ 0.0747, 0.215]
(Intercept) a b c d e
1 TRUE FALSE TRUE FALSE FALSE FALSE
2 TRUE FALSE TRUE FALSE FALSE TRUE
3 TRUE FALSE TRUE TRUE FALSE TRUE
4 TRUE FALSE TRUE TRUE TRUE TRUE
5 TRUE TRUE TRUE TRUE TRUE TRUE
> tem2$cp
[1] 14.389492 0.792566 2.189839 4.024738 6.000000
d) What is the value of Cp(Imin ) and what is β̂ Imin ,0 ?
e) Which variables, if any, are needed in the MLR model given that the
other variables are in the model?
f) List the variables, including a constant, that model 3 contains.
5.11. Table 5.7 below shows simulation results for bootstrapping OLS (reg)
and forward selection (vs) with Cp when β = (1, 1, 0, 0, 0)T . The βi columns
give coverage = the proportion of CIs that contained βi and the average
length of the CI. The test is for H0 : (β3 , β4 , β5 )T = 0 and H0 is true. The
“coverage” is the proportion of times the prediction region method bootstrap
test failed to reject H0 . Since 1000 runs were used, a cov in [0.93,0.97] is
reasonable for a nominal value of 0.95. Output is given for three different
error distributions. If the coverage for both methods ≥ 0.93, the method
with the shorter average CI length was more precise. (If one method had
coverage ≥ 0.93 and the other had coverage < 0.93, we will say the method
with coverage ≥ 0.93 was more precise.)
268 5 Statistical Learning Alternatives to OLS

a) For β3 , β4 , and β5 , which method, forward selection or the OLS full


model, was more precise?

Table 5.7 Bootstrapping Forward Selection, n = 100, p = 5, ψ = 0, B = 1000


β1 β2 β3 β4 β5 test
reg cov 0.95 0.93 0.93 0.93 0.94 0.93
len 0.658 0.672 0.673 0.674 0.674 2.861
vs cov 0.95 0.94 0.998 0.998 0.999 0.993
len 0.661 0.679 0.546 0.548 0.544 3.11
reg cov 0.96 0.93 0.94 0.96 0.93 0.94
len 0.229 0.230 0.229 0.231 0.230 2.787
vs cov 0.95 0.94 0.999 0.997 0.999 0.995
len 0.228 0.229 0.185 0.187 0.186 3.056
reg cov 0.94 0.94 0.95 0.94 0.94 0.93
len 0.393 0.398 0.399 0.399 0.398 2.839
vs cov 0.94 0.95 0.997 0.997 0.996 0.990
len 0.392 0.400 0.320 0.322 0.321 3.077

b) The test “length” is the average length of the interval [0, D(UB ) ] = D(UB )
where the test fails to reject H0 if D0 ≤ D(UB ) . The OLS full model is
asymptotically normal, and hence for q large enough n and B the reg len row
for the test column should be near χ23,0.95 = 2.795.
Were the three values in the test column for reg within 0.1 of 2.795?
5.12. Suppose the MLR model Y = Xβ + e, and the regression method
fits Z = W η + e. Suppose Ẑ = 245.63 and Y = 105.37. What is Ŷ ?
5.13. To get a large sample 90% PI for a future value Yf of the response
variable, find a large sample 90% PI for a future residual and add Ŷf to the
endpoints of the of that PI. Suppose forward selection is used and the large
sample 90% PI for a future residual is [−778.28, 1336.44]. What is the large
sample 90% PI for Yf if β̂ Imin = (241.545, 1.001)T used a constant and the
predictor mmen with corresponding xImin ,f = (1, 75000)T ?
5.14. Table 5.8 below shows simulation results for bootstrapping OLS
(reg), lasso, and ridge regression (RR) with 10-fold CV when β = (1, 1, 0, 0)T .
The βi columns give coverage = the proportion of CIs that contained βi and
the average length of the CI. The test is for H0 : (β3 , β4 )T = 0 and H0 is
true. The “coverage” is the proportion of times the prediction region method
bootstrap test failed to reject H0 . OLS used 1000 runs while 100 runs were
used for lasso and ridge regression. Since 100 runs were used, a cov in [0.89,
1] is reasonable for a nominal value of 0.95. If the coverage for both methods
≥ 0.89, the method with the shorter average CI length was more precise.
(If one method had coverage ≥ 0.89 and the other had coverage < 0.89, we
will say the method with coverage ≥ 0.89 was more precise.) The results
for the lasso test were omitted since sometimes S ∗T was singular. (Lengths
5.15 Problems 269

for the test column are not comparable unless the statistics have the same
asymptotic distribution.)

Table 5.8 Bootstrapping lasso and RR, n = 100, ψ = 0.9, p = 4, B = 250


β1 β2 β3 β4 test
reg cov 0.942 0.951 0.949 0.943 0.943
len 0.658 5.447 5.444 5.438 2.490
RR cov 0.97 0.02 0.11 0.10 0.05
len 0.681 0.329 0.334 0.334 2.546
reg cov 0.947 0.955 0.950 0.951 0.952
len 0.658 5.511 5.497 5.500 2.491
lasso cov 0.93 0.91 0.92 0.99
len 0.698 3.765 3.922 3.803

a) For β3 and β4 which method, ridge regression or the OLS full model,
was better?
b) For β3 and β4 which method, lasso or the OLS full model, was more
precise?
5.15. Suppose n = 15 and 5-fold CV is used. Suppose observations are
measured for the following people. Use the output below to determine which
people are in the first fold.
folds: 4 3 4 2 1 4 3 5 2 2 3 1 5 5 1
1) Athapattu, 2) Azizi, 3) Cralley 4) Gallage, 5) Godbold, 6) Gunawar-
dana, 7) Houmadi, 8) Mahappu, 9) Pathiravasan, 10) Rajapaksha, 11)
Ranaweera, 12) Safari, 13) Senarathna, 14) Thakur, 15) Ziedzor
5.16. Table 5.9 below shows simulation results for a large sample 95% pre-
diction interval. Since 5000 runs were used, a cov in [0.94, 0.96] is reasonable
for a nominal value of 0.95. If the coverage for a method ≥ 0.94, the method
with the shorter average PI length was more precise. Ignore methods with
cov < 0.94. The MLR model had β = (1, 1, ..., 1, 0, ..., 0)T where the first
k + 1 coefficients were equal to 1. If ψ = 0 then the nontrivial predictors were
uncorrelated, but highly correlated if ψ = 0.9.

Table 5.9 Simulated Large Sample 95% PI Coverages and Lengths, ei ∼ N (0, 1)
n p ψ k FS lasso RL RR PLS PCR
100 40 0 1 cov 0.9654 0.9774 0.9588 0.9274 0.8810 0.9882
len 4.4294 4.8889 4.6226 4.4291 4.0202 7.3393
400 400 0.9 19 cov 0.9348 0.9636 0.9556 0.9632 0.9462 0.9478
len 4.3687 47.361 4.8530 48.021 4.2914 4.4764

a) Which method was most precise, given cov ≥ 0.94, when n = 100?
270 5 Statistical Learning Alternatives to OLS

b) Which method was most precise, given cov ≥ 0.94, when n = 400?
5.17. When doing a PI or CI simulation for a nominal 100(1 − δ)% = 95%
interval, there are m runs. For each run, a data set and interval are generated,
and for the ith run Yi = 1 if µ or Yf is in the interval, and Yi = 0, otherwise.
Hence the Yi are iid Bernoulli(1 − δn ) random variables where 1 − δn is
the true probability (true coverage) that the interval will contain
P µ or Yf .
The observed coverage (= coverage) in the simulation is Y = i Yi /m. The
variance V (Y ) = σ 2 /m where σ 2 = (1 − δn )δn ≈ (1 − δ)δ ≈ (0.95)0.05 if
δn ≈ δ = 0.05. Hence r
0.95(0.05)
SD(Y ) ≈ .
m
If the (observed) coverage is within 0.95 ± kSD(Y ) the integer k is near 3,
then there is no reason to doubt that the actual coverage 1 − δn differs from
the nominal coverage 1−δ = 0.95 if m ≥ 1000 (and as a crude benchmark, for
m ≥ 100). In the simulation, the length of each interval is computed, and the
average length is computed. For intervals with coverage ≥ 0.95 − kSD(Y ),
intervals with shorter average length are better (have more precision).
a) If m = 5000 what is 3 SD(Y ), using the above approximation? Your
answer should be close to 0.01.
b) If m = 1000 what is 3 SD(Y ), using the above approximation?
5.18. Let Yi = β1 + β2 xi2 + · · · + βp xip + i for i = 1, ..., n where the i are
independent and identically distributed (iid) with expected value E(i ) = 0
and variance V (i ) = σ 2 . in matrix form, this model is Y = Xβ + . As-
sume X has full rank p where p < n. Let β̂R = (X T X + λn I p )−1 X T Y =
(X T X + λn I p )−1 (X T X)(X T X)−1 X T Y where λn ≥ 0 is a constant that
may depend on n and I p is the p × p identity matrix. Let β̂ = β̂ OLS be the
ordinary least squares estimator. Let Cov(Z) = V ar(Z) be the covariance
matrix of random vector Z.
a) Find E(β̂).
b) Find E(β̂ R ).

c) Find Cov(β̂).
d) Find Cov(β̂ R ). Simplify.
e) Suppose (X T X)/n → V −1 as n → ∞. Then n(X T X)−1 → V as
n → ∞ and if λn /n → 0 as n → ∞, then (X T X + λn I p )/n → V −1 and
n(X T X + λn I p )−1 → V as n → ∞. If λn /n → 0, show nCov(β̂ R ) → σ 2 V
as n → ∞. Hint: nA−1 BA−1 = nA−1 (B/n)nA−1 .
5.19.
5.20.

R Problem

Use the command source(“G:/linmodpack.txt”) to download the


functions and the command source(“G:/linmoddata.txt”) to download the
5.15 Problems 271

data. See Preface or Section 11.1. Typing the name of the slpack func-
tion, e.g. vsbootsim3, will display the code for the function. Use the args com-
mand, e.g. args(vsbootsim3), to display the needed arguments for the function.
For the following problem, the R command can be copied and pasted from
(http://parker.ad.siu.edu/Olive/linmodrhw.txt) into R.
5.21. The R program generates data satisfying the MLR model

Y = β1 + β2 x2 + β3 x3 + β4 x4 + e

where β = (β1 , β2 , β3 , β4 )T = (1, 1, 0, 0).


a) Copy and paste the commands for this part into R. The output gives
β̂OLS for the OLS full model. Give β̂ OLS . Is β̂ OLS close to β = 1, 1, 0, 0)T ?
b) The commands for this part bootstrap the OLS full model using the
residual bootstrap. Copy and paste the output into Word. The output shows

Tj∗ = β̂ j for j = 1, ..., 5.
c) B = 1000 Tj∗ were generated. The commands for this part compute the
∗ ∗
sample mean T of the Tj∗ . Copy and paste the output into Word. Is T close
to β̂ OLS found in a)?
d) The commands for this part bootstrap the forward selection using the
residual bootstrap. Copy and paste the output into Word. The output shows

Tj∗ = β̂ Imin ,0,j for j = 1, ..., 5. The last two variables may have a few 0s.
e) B = 1000 Tj∗ were generated. The commands for this part compute the

sample mean T of the Tj∗ where Tj∗ is as in d). Copy and paste the output

into Word. Is T close to β = (1, 1, 0, 0)?
5.22. This simulation is similar to that used to form Table 4.2, but 1000
runs are used so coverage in [0.93,0.97] suggests that the actual coverage is
close to the nominal coverage of 0.95.
The model is Y = xT β + e = xTS β S + e where β S = (β1 , β2 , ..., βk+1)T =
(β1 , β2 )T and k = 1 is the number of active nontrivial predictors in the popu-
lation model. The output for test tests H0 : (βk+2 , ..., βp)T = (β3 , ..., βp)T = 0
and H0 is true. The output gives the proportion of times the prediction region
method bootstrap test fails to reject H0 . The nominal proportion is 0.95.
After getting your output, make a table similar to Table 4.2 with 4 lines.
If your p = 5 then you need to add a column for β5 . Two lines are for reg
(the OLS full model) and two lines are for vs (forward selection with Imin ).
The βi columns give the coverage and lengths of the 95% CIs for βi . If the
coverage ≥ 0.93, then the shorter CI length is more precise. Were the CIs
for forward selection more precise than the CIs for the OLS full model for β3
and β4 ?
To get the output, copy and paste the source commands from
(http://parker.ad.siu.edu/Olive/linmodrhw.txt) into R. Copy and past the
library command for this problem into R.
If you are person j then copy and paste the R code for person j for this
problem into R.
272 5 Statistical Learning Alternatives to OLS

5.23. This problem is like Problem 5.19, but ridge regression is used in-
stead of forward selection. This simulation is similar to that used to form
Table 4.2, but 100 runs are used so coverage in [0.89,1.0] suggests that the
actual coverage is close to the nominal coverage of 0.95.
The model is Y = xT β + e = xTS β S + e where β S = (β1 , β2 , ..., βk+1)T =
(β1 , β2 )T and k = 1 is the number of active nontrivial predictors in the popu-
lation model. The output for test tests H0 : (βk+2 , ..., βp)T = (β3 , ..., βp)T = 0
and H0 is true. The output gives the proportion of times the prediction region
method bootstrap test fails to reject H0 . The nominal proportion is 0.95.
After getting your output, make a table similar to Table 4.2 with 4 lines.
If your p = 5 then you need to add a column for β5 . Two lines are for reg
(the OLS full model) and two lines are for ridge regression (with 10 fold CV).
The βi columns give the coverage and lengths of the 95% CIs for βi . If the
coverage ≥ 0.89, then the shorter CI length is more precise. Were the CIs for
ridge regression more precise than the CIs for the OLS full model for β3 and
β4 ?
To get the output, copy and paste the source commands from
(http://parker.ad.siu.edu/Olive/linmodrhw.txt) into R. Copy and past the
library command for this problem into R.
If you are person j then copy and paste the R code for person j for this
problem into R.
5.21. This is like Problem 5.20, except lasso is used. If you are person j in
Problem 5.20, then copy and paste the R code for person j for this problem
into R. Make a table with 4 lines: two for OLS and 2 for lasso. Were the CIs
for lasso more precise than the CIs for the OLS full model for β3 and β4 ?
Chapter 6
What if n is not >> p?

When p > n, the fitted model should do better than i) interpolating the data
or ii) discarding all of the predictors and using the location model of Section
1.3.5 for inference. If p > n, forward selection, lasso, relaxed lasso, elastic
net, and relaxed elastic net can be useful for several regression models. Ridge
regression, partial least squares, and principal components regression can also
be computed for multiple linear regression. Sections 4.3, 5.9, and 10.7 give
prediction intervals.
One of the biggest errors in regression is to use the response variable
to build the regression model using all n cases, and then do inference as if
the built model was selected without using the response, e.g., selected before
gathering data. Using the response variable to build the model is called data
snooping, then inference is generally no longer valid, and the model built from
data snooping tends to fit the data too well. In particular, do not use data
snooping and then use variable selection or cross validation. See Hastie et al
(2009, p. 245) and Olive (2017a, pp. 85-89).
Building a regression model from data is one of the most challenging regres-
sion problems. The “final full model” will have response variable Y = t(Z), a
constant x1 , and predictor variables x2 = t2 (w2 , ..., wr), ..., xp = tp (w2 , ..., wr)
where the initial data consists of Z, w2 , ..., wr . Choosing t, t2 , ..., tp so that
the final full model is a useful regression approximation to the data can be
difficult.

As a rule of thumb, if strong nonlinearities are apparent in the predictors


w2 , ..., wp, it is often useful to remove the nonlinearities by transforming the
predictors using power transformations. When p is large, a scatterplot matrix
of w2 , ..., wp can not be made, but the log rule of Section 1.2 can be useful.
Plots from Chapter 7, such as the DD plot, can also be useful. A scatterplot
matrix of the wi is an array of scatterplots of wi versus wj . A scatterplot is
a plot of wi versus wj .
In the literature, it is sometimes stated that predictor transformations
that are made without looking at the response are “free.” The reasoning

273
274 6 What if n is not >> p?

is that the conditional distribution of Y |(x2 = a2 , ..., xp = ap ) is the same


as the conditional distribution of Y |[t2 (x2 ) = t2 (a2 ), ..., tp(xp ) = tp (ap )]:
there
√ is simply a change of labelling. Certainly if Y |x = 9 ∼ N (0, 1), then
Y | x = 3 ∼ N (0, 1). To see that the above rule of thumb does not always
work, suppose that Y = β1 + β2 x2 + · · · + βp xp + e where the xi are iid
lognormal(0,1) random variables. Then wi = log(xi ) ∼ N (0, 1) for i = 2, ..., p
and the scatterplot matrix of the wi will be linear while the scatterplot matrix
of the xi will show strong nonlinearities if the sample size is large. However,
there is an MLR relationship between Y and the xi while the relationship
between Y and the wi is nonlinear: Y = β1 +β2 ew2 +· · ·+βp ewp +e 6= β T w+e.
Given Y and the wi with no information of the relationship, it would be
difficult to find the exponential transformation and to estimate the βi . The
moral is that predictor transformations, especially the log transformation, can
and often do greatly simplify the MLR analysis, but predictor transformations
can turn a simple MLR analysis into a very complex nonlinear analysis.
Recall the 1D regression model from Definition 1.2 with

Y x|SP or Y x|h(x),

where the real valued function h : Rp → R. An important special case is a


model with a linear predictor h(x) = xT β.
For the 1D regression model, let the ith case be (Yi , xi ) for i = 1, ..., n
where the n cases are independent. Variable selection is the search for a
subset of predictor variables that can be deleted with little loss of information
if n/p is large, and so that the model with the remaining predictors is useful
for prediction even if n/p is not large. The model for variable selection given
by Equation (4.1) can be useful even if n/p is not large:

xT β = xTS β S + xTE β E = xTS β S (6.1)

where x = (xTS , xTE )T , xS is an aS × 1 vector, and xE is a (p − aS ) × 1 vector.


Given that xS is in the model, β E = 0 and E denotes the subset of terms
that can be eliminated given that the subset S is in the model. Let xI be the
vector of a terms from a candidate subset indexed by I, and let xO be the
vector of the remaining predictors (out of the candidate submodel). Suppose
that S is a subset of I and that model (6.1) holds. Then

xT β = xTS β S = xTS βS + xTI/S β(I/S) + xTO 0 = xTI β I

where xI/S denotes the predictors in I that are not in S. Since this is true
regardless of the values of the predictors, β O = 0 if S ⊆ I.
6.2 Data Splitting 275

6.1 Sparse Models

When n/p → 0 as n → ∞, consistent estimators generally cannot be found


unless the model has a simplifying structure. A sparse model is one such
structure. For Equation (6.1), a population regression model is sparse if aS
is small. We want n ≥ 10aS .
For multiple linear regression with p > n, results from Hastie et al. (2015,
pp. 20, 296, ch. 6, ch. 11) and Luo and Chen (2013) suggest that lasso, relaxed
lasso, and forward selection with EBIC can perform well for sparse models.
Least angle regression, elastic net, and relaxed elastic net can also be useful.
Suppose the selected model is Id , and βId is ad × 1. For multiple linear
regression, forward selection with Cp and AIC often gives useful results if
n ≥ 5p and if the final model I has n ≥ 10ad. For p < n < 5p, forward
selection with Cp and AIC tends to pick the full model (which overfits since
n < 5p) too often, especially if σ̂ 2 = M SE. The Hurvich and Tsai (1989)
AICC criterion can be useful for MLR and time series if n ≥ max(2p, 10ad).
If n ≥ 5p, AIC and BIC are useful for many regression models, and forward
selection with EBIC can be used for some models if n/p is small. See Section
4.1 and Chen and Chen (2008).

6.2 Data Splitting

Data splitting is useful for many regression models when the n cases are in-
dependent, including multiple linear regression, multivariate linear regression
where there are m ≥ 2 response variables, generalized linear models (GLMs),
the Cox (1972) proportional hazards regression model, and parametric sur-
vival regression models.
Consider a regression model with response variable Y and a p × 1 vector
of predictors x. This model is the full model. Suppose the n cases are inde-
pendent. To perform data splitting, randomly divide the data into two sets
H and V where H has nH of the cases and V has the remaining nV = n−nH
cases i1 , ..., inV . Find a model I, possibly with data snooping or model se-
lection, using the data in the training set H. Use the model I as the full
model to perform inference using the data in the validation set V . That is,
regress YV on X V,I and perform the usual inference for the model using the
j = 1, ..., nV cases in the validation set V . If βI uses a predictors, we want
nV ≥ 10a and we want P (S ⊆ I) → 1 as n → ∞ or for (YV , X V,I ) to follow
the regression model.
In the literature, often nH ≈ dn/2e. For model selection, use the training
data set to fit the model selection method, e.g. forward selection or lasso, to
get the a predictors. On the test set, use the standard regression inference
from regressing the response on the predictors found from the training set.
This method can be inefficient if n ≥ 10p, but is useful for a sparse model
276 6 What if n is not >> p?

if n ≤ 5p, if the probability that the model underfits goes to zero, and if
n ≥ 20a.
The method is simple, use one half set to get the predictors, then fit
the regression model, such as a GLM or OLS, to the validation half set
(Y V , X V,I ). The regression model needs to hold for (Y V , X V,I ) and we want
nV ≥ 10a if I uses a predictors. The regression model can hold if S ⊆ I
and the model is sparse. Let x = (x1 , ..., xp )T where x1 is a constant. If
(Y, x2 , ..., xp )T follows a multivariate normal distribution, then (Y, xI ) follows
a multiple linear regression model for every I. Hence the full model need not
be sparse, although the selected model may be suboptimal.
Of course other sample sizes than half sets could be used. For example if
n = 1000p, use n = 10p for the training set and n = 990p for the validation
set.
Remark 6.1. i) One use of data splitting is to try to transform the
p ≥ n problem into an n ≥ 10k problem. This method can work if
the model is sparse. For multiple linear regression, this method can work
if Y ∼ Nn (Xβ, σ 2 I), since then all subsets I satisfy the MLR model:
Yi = xTI,i βI + eI,i . See Remark 1.5. If βI is k × 1, we want n ≥ 10k and
V (eI,i ) = σI2 to be small. For binary logistic regression, the discriminant
function model of Definition 10.7 can be useful if xI |Y = j ∼ Nk (µj , Σ)
for j = 0, 1. Of course, the models may not be sparse, and the multivariate
normal assumptions for MLR and binary logistic regression rarely hold.
ii) Data splitting can be tricky for lasso, ridge regression, and elastic net
if the sample sizes of the training and validation sets differ. Roughly set
λ1,n1 /(2n1 ) = λ2,n2 /(2n2 ). Data splitting is much easier for variable selection
methods such as forward selection, relaxed lasso, and relaxed elastic net. Find
the variables x∗1 , ..., x∗k indexed by I from the training set, and use model I
as the full model for the validation set.
iii) Another use of data splitting is that data snooping can be used on the
training set: use the model as the full model for the validation set.

6.3 Summary

1) Using the response variable to build a model is known as data snooping,


and invalidates inference if data snooping is used on the entire data set of n
cases.
2) Suppose xT β = xTS βS + xTE βE = xTS β S where βS is an aS × 1 vector.
A regression model is sparse if aS is small. We want n ≥ 10aS .
3) Assume the cases are independent. To perform data splitting, randomly
divide the data into two half sets H and V where H has nH of the cases and
V has the remaining nV = n − nH cases i1 , ..., inV . Build the model, possibly
with data snooping, or perform variable selection to Find a model I, possibly
with data snooping or model selection, using the data in the training set H.
6.5 Problems 277

Use the model I as the full model to perform inference using the data in the
validation set V .

6.4 Complements

Suppose model Ik contains k predictors including a constant. For multiple


linear regression, the forward selection algorithm in Chapter 4 adds a pre-
dictor x∗k+1 that minimizes the residual sum of squares, while the Pati et al.
(1993) “orthogonal matching pursuit algorithm” uses predictors (scaled to
have unit norm: xTi xi = 1 for the nontrivial predictors), and adds the scaled
predictor x∗k+1 that maximizes |x∗T k+1 r k | where the maximization is over vari-
ables not yet selected and the rk are the OLS residuals from regressing Y
on X ∗Ik . Fan and Li (2001) and Candes and Tao (2007) gave competitors to
lasso. Some fast methods seem similar to the first PLS component. A useful
reference for data splitting is Rinaldo et al (2019).
Fan and Li (2002) give a method of variable selection for the Cox (1972)
proportional hazards regression model. Using AIC is also useful if p is fixed.
For a time series Y1 , ..., Yn, we could use Y1 , ..., Ym as one set and Ym+1 , ..., Yn
as the other set. Three set inference may be needed. Use Y1 , ..., Ym as the first
set (trianing data), Ym+1 , ..., Ym+k as a burn in set, and Ym+k+1 , ..., Yn as the
third set for inference.
When the entire data set is used to build a model with the response vari-
able, the inference tends to be invalid, and cross validation should not be used
to check the model. See Hastie et al. (2009, p. 245). In order for the inference
and cross validation to be useful, the response variable and the predictors
for the regression should be chosen before looking at the response variable.
Predictor transformations can be done as long as the response variable is not
used to choose the transformation. You can do model building on the test
set, and then inference for the chosen (built) model as the full model with
the validation set, provided this model follows the regression model used for
inference (e.g. multiple linear regression or a GLM). This process is difficult
to simulate.

6.5 Problems
Chapter 7
Robust Regression

This chapter considers outlier detection and then develops robust regression
estimators. Robust estimators of multivariate location and dispersion are
useful for outlier detection and for developing robust regression estimators.
Outliers and dot plots were discussed in Chapter 3.

Definition 7.1 An outlier corresponds to a case that is far from the bulk
of the data.

Definition 7.2. A dot plot of Z1 , ..., Zm consists of an axis and m points


each corresponding to the value of Zi .

The following plots and techniques will be developed in this chapter. For
the location model, use a dot plot to detect outliers. For the multivariate
location model with p = 2 make a scatterplot. For multiple linear regression
with one nontrivial predictor x, plot x versus Y . For the multiple linear
regression model, make the residual and response plots. For the multivariate
location model, make the DD plot if n ≥ 5p, and use ddplot5 if p > n. If p
is not much larger than 5, elemental sets are useful for outlier detection for
multiple linear regression and multivariate location and dispersion.

7.1 The Location Model

The location model is

Yi = µ + ei , i = 1, . . . , n (7.1)

where e1 , ..., en are error random variables, often iid with zero mean. The
location model is used when there is one variable Y , such as height, of interest.
The location model is a special case of the multiple linear regression model
and of the multivariate location and dispersion model, where there are p

279
280 7 Robust Regression

variables x1 , ..., xp of interest, such as height and weight if p = 2. The dot


plot of Definition 7.2 is useful for detecting outliers in the location model.
The location model is often summarized by obtaining point estimates and
confidence intervals for a location parameter and a scale parameter. Assume
that there is a sample Y1 , . . . , Yn of size n where the Yi are iid from a distri-
bution with median MED(Y ), mean E(Y ), and variance V (Y ) if they exist.
The location parameter µ is often the population mean or median p while the
scale parameter is often the population standard deviation V (Y ). The ith
case is Yi .
Point estimation is one of the oldest problems in statistics and four impor-
tant statistics for the location model are the sample mean, median, variance,
and the median absolute deviation (MAD). Let Y1 , . . . , Yn be the random
sample; i.e., assume that Y1 , ..., Yn are iid. The sample mean is a measure of
location and estimates Pnthe population mean (expected value) Pnµ = E(Y ). 2The
Yi i=1 (Yi − Y )
sample mean Y = i=1
. The sample variance Sn2 = =
Pn n n−1
2
i=1 Yi − n(Y )
2 p
, and the sample standard deviation Sn = Sn2 .
n−1
If the data set Y1 , ..., Yn is arranged in ascending order from smallest to
largest and written as Y(1) ≤ · · · ≤ Y(n), then Y(i) is the ith order statistic
and the Y(i) ’s are called the order statistics. If the data Y1 = 1, Y2 = 4, Y3 =
2, Y4 = 5, and Y5 = 3, then Y = 3, Y(i) = i for i = 1, ..., 5 and MED(n) = 3
where the sample size n = 5. The sample median is a measure of location
while the sample standard deviation is a measure of spread. The sample mean
and standard deviation are vulnerable to outliers, while the sample median
and MAD, defined below, are outlier resistant.

Definition 7.3. The sample median

MED(n) = Y((n+1)/2) if n is odd, (7.2)

Y(n/2) + Y((n/2)+1)
MED(n) = if n is even.
2
The notation MED(n) = MED(Y1 , ..., Yn) will also be used.
Definition 7.4. The sample median absolute deviation is

MAD(n) = MED(|Yi − MED(n)|, i = 1, . . . , n). (7.3)

Since MAD(n) is the median of n distances, at least half of the observations


are within a distance MAD(n) of MED(n) and at least half of the observations
are a distance of MAD(n) or more away from MED(n). Like the standard
deviation, MAD(n) is a measure of spread.

Example 7.1. Let the data be 1, 2, 3, 4, 5, 6, 7, 8, 9. Then MED(n) = 5


and MAD(n) = 2 = MED{0, 1, 1, 2, 2, 3, 3, 4, 4}.
7.2 The Multivariate Location and Dispersion Model 281

The trimmed mean is used in Chapter 9. We recommend the 25% trimmed


mean. Let bxc denote the “greatest integer function” (e.g., b7.7c = 7).

Definition 7.5. The symmetrically trimmed mean or the δ trimmed mean


Un
X
1
Tn = Tn (Ln , Un ) = Y(i) (7.4)
Un − Ln
i=Ln +1

where Ln = bnδc and Un = n − Ln . If δ = 0.25, say, then the δ trimmed


mean is called the 25% trimmed mean.
The (δ, 1 − γ) trimmed mean uses Ln = bnδc and Un = bnγc.

Estimators that use order statistics are common. Theory for the MAD,
median, and trimmed mean is given, for example, in Olive (2008), which
also gives confidence intervals based on the median and trimmed mean. The
shorth estimator of Section 4.3 was used for prediction intervals.

7.2 The Multivariate Location and Dispersion Model

The multivariate location and dispersion (MLD) model is a special case of the
multivariate linear model, just like the location model is a special case of the
multiple linear regression model. Robust estimators of multivariate location
and dispersion are useful for detecting outliers in the predictor variables and
for developing an outlier resistant multiple√ linear regression estimator.
The practical, highly outlier resistant, n consistent FCH, RFCH, and
RMVN estimators of (µ, cΣ) are developed along with proofs. The RFCH
and RMVN estimators are reweighted versions of the FCH estimator. It is
shown why competing “robust estimators” fail to work, are impractical, or are
not yet backed by theory. The RMVN and RFCH sets are defined and will be
used for outlier detection and to create practical robust methods of multiple
linear regression and multivariate linear regression. Many more applications
are given in Olive (2017b).
Warning: This section contains many acronyms, abbreviations, and es-
timator names such as FCH, RFCH, and RMVN. Often the acronyms start
with the added letter A, C, F, or R: A stands for algorithm, C for con-
centration, F for estimators that use a fixed number of trial fits, and R for
reweighted.

Definition 7.6. The multivariate location and dispersion model is

Y i = µ + ei , i = 1, . . . , n (7.5)

where e1 , ..., en are p × 1 error random vectors, often iid with zero mean and
covariance matrix Cov(e) = Cov(Y ) = Σ Y = Σ e .
282 7 Robust Regression

Note that the location model is a special case of the MLD model with
p = 1. If E(e) = 0, then E(Y ) = µ. A p × p dispersion matrix is a symmetric
matrix that measures the spread of a random vector. Covariance and corre-
lation matrices are dispersion matrices. One way to get a robust estimator
of multivariate location is to stack the marginal estimators of location into
a vector. The coordinatewise median MED(W ) is an example. The sample
mean x also stacks the marginal estimators into a vector, but is not outlier
resistant.
Let µ be a p × 1 location vector and Σ a p × p symmetric dispersion
matrix. Because of symmetry, the first row of Σ has p distinct unknown
parameters, the second row has p − 1 distinct unknown parameters, the third
row has p − 2 distinct unknown parameters, ..., and the pth row has one
distinct unknown parameter for a total of 1 + 2 + · · · + p = p(p + 1)/2
unknown parameters. Since µ has p unknown parameters, an estimator (T, C)
of multivariate location and dispersion, needs to estimate p(p+3)/2 unknown
parameters when there are p random variables. If the p variables can be
transformed into an uncorrelated set then there are only 2p parameters, the
means and variances, while if the dimension can be reduced from p to p − 1,
the number of parameters is reduced by p(p + 3)/2 − (p − 1)(p + 2)/2 = p + 1.
The sample covariance or sample correlation matrices estimate these pa-
rameters very efficiently since Σ = (σij ) where σij is a population covariance
or correlation. These quantities can be estimated with the sample covariance
or correlation taking two variables Xi and Xj at a time. Note that there are
p(p + 1)/2 pairs that can be chosen from p random variables X1 , ..., Xp.

Rule of thumb 7.1. For the classical estimators of multivariate location


and dispersion, (x, S) or (z = 0, R), we want n ≥ 10p. We want n ≥ 20p for
the robust MLD estimators (FCH, RFCH, or RMVN) described later in this
section.

7.2.1 Affine Equivariance

Before defining an important equivariance property, some notation is needed.


Assume that the data is collected in an n × p data matrix W . Let B = 1bT
where 1 is an n × 1 vector of ones and b is a p × 1 constant vector. Hence
the ith row of B is bTi ≡ bT for i = 1, ..., n. For such a matrix B, consider
the affine transformation Z = W AT + B where A is any nonsingular p × p
matrix. An affine transformation changes xi to z i = Axi + b for i = 1, ..., n,
and affine equivariant multivariate location and dispersion estimators change
in natural ways.

Definition 7.7. The multivariate location and dispersion estimator (T, C)


is affine equivariant if
7.2 The Multivariate Location and Dispersion Model 283

T (Z) = T (W AT + B) = AT (W ) + b, (7.6)

and C(Z) = C(W AT + B) = AC(W )AT . (7.7)

The following theorem shows that the Mahalanobis distances are invariant
under affine transformations. See Rousseeuw and Leroy (1987, pp. 252-262)
for similar results. Thus if (T, C) is affine equivariant, so is
2 2
(T, D(c n)
(T, C) C) where D(j) (T, C) is the jth order statistic of the Di2 .

Theorem 7.1. If (T, C) is affine equivariant, then

Di2 (W ) ≡ Di2 (T (W ), C(W )) = Di2 (T (Z), C(Z)) ≡ Di2 (Z). (7.8)

Proof. Since Z = W AT + B has ith row z Ti = xTi AT + bT ,

Di2 (Z) = [zi − T (Z)]T C −1 (Z)[z i − T (Z)]

= [A(xi − T (W ))]T [AC(W )AT ]−1 [A(xi − T (W ))]


= [xi − T (W )]T C −1 (W )[xi − T (W )] = Di2 (W ). 

Definition 7.8. For MLD, an elemental set J = {m1 , ..., mp+1} is a set of
p + 1 cases drawn without replacement from the data set of n cases. The ele-
mental fit (TJ , C J ) = (xJ , SJ ) is the sample mean and the sample covariance
matrix computed from the cases in the elemental set.

If the data are iid, then the elemental fit gives an unbiased but inconsistent
estimator of (E(x), Cov(x)). Note that the elemental fit uses the smallest
sample size p + 1 such that S J is nonsingular if the data are in “general
position” defined in Definition 7.10. See Definition 4.7 for the sample mean
and sample covariance matrix.

7.2.2 Breakdown

This subsection gives a standard definition of breakdown for estimators of


multivariate location and dispersion. The following notation will be useful.
Let W denote the n × p data matrix with ith row xTi corresponding to the
ith case. Let w 1 , ...wn be the contaminated data after dn of the xi have been
replaced by arbitrarily bad contaminated cases. Let W nd denote the n×p data
matrix with ith row w Ti . Then the contamination fraction is γn = dn /n. Let
(T (W ), C(W )) denote an estimator of multivariate location and dispersion
284 7 Robust Regression

where the p × 1 vector T (W ) is an estimator of location and the p × p


symmetric positive semidefinite matrix C(W ) is an estimator of dispersion.

Theorem 7.2. Let B > 0 be a p × p symmetric matrix with eigenvalue


eigenvector pairs (λ1 , e1 ), ..., (λp, ep ) where λ1 ≥ λ2 · · · ≥ λp > 0 and the
orthonormal eigenvectors satisfy eTi ei = 1 while eTi ej = 0 for i 6= j. Let d
be a given p × 1 vector and let a be an arbitrary nonzero p × 1 vector.
aT ddT a
a) max T = dT B −1 d where the max is attained for a = cB −1 d
a6=0 a Ba
for any constant c 6= 0. Note that the numerator = (aT d)2 .
aT Ba
b) max T = max aT Ba = λ1 where the max is attained for a = e1 .
a6=0 a a kak=1
aT Ba
c) min T = min aT Ba = λp where the min is attained for a = ep .
a6=0 a a kak=1
aT Ba
d) max = max aT Ba = λk+1 where the max is
a⊥e1 ,...,ek aT a kak=1,a⊥e1 ,...,ek
attained for a = ek+1 for k = 1, 2, ..., p − 1.
e) Let (x, S) be the observed sample mean and sample covariance matrix
naT (x − µ)(x − µ)T a
where S > 0. Then max = n(x−µ)T S −1 (x−µ) = T 2
a6=0 aT Sa
where the max is attained for a = cS −1 (x − µ) for any constant c 6= 0.
f) Let A be a p × p symmetric matrix. Let C > 0 be a p × p symmetric
aT Aa
matrix. Then max T = λ1 (C −1 A), the largest eigenvalue of C −1 A. The
a6=0 a Ca
value of a that achieves the max is the eigenvector g 1 of C −1 A corresponding
aT Aa
to λ1 (C −1 A). Similarly min T = λp (C −1 A), the smallest eigenvalue of
a6=0 a Ca
C −1 A. The value of a that achieves the min is the eigenvector g p of C −1 A
corresponding to λp (C −1 A).

Proof Sketch. See Johnson and Wichern (1988, pp. 64-65, 184). For a),
note that rank(C −1 A) = 1, where C = B and A = ddT , since rank(C −1 A)
= rank(A) = rank(d) = 1. Hence C −1 A has one nonzero eigenvalue eigen-
vector pair (λ1 , g1 ). Since

(λ1 = dT B −1 d, g 1 = B −1 d)

is a nonzero eigenvalue eigenvector pair for C −1 A, and λ1 > 0, the result


follows by f).
Note that b) and c) are special cases of f) with A = B and C = I.
Note that e) is a special case of a) with d = (x − µ) and B = S.
(Also note that (λ1 = (x − µ)T S −1 (x − µ), g1 = S −1 (x − µ)) is a nonzero
eigenvalue eigenvector pair for the rank 1 matrix C −1 A where C = S and
A = (x − µ)(x − µ)T .)
7.2 The Multivariate Location and Dispersion Model 285

For f), see Mardia et al. (1979, p. 480). 

From Theorem 7.2, if C(W nd ) > 0, then max aT C(W nd )a = λ1 and


kak=1
min aT C(W nd )a = λp . A high breakdown dispersion estimator C is positive
kak=1
definite if the amount
Pp Pofp contamination is less than the breakdown value.
Since aT Ca = i=1 j=1 cij ai aj , the largest eigenvalue λ1 is bounded as
W nd varies iff C(W nd ) is bounded as W nd varies.
Definition 7.9. The breakdown value of the multivariate location estima-
tor T at W is
( )
dn n
B(T, W ) = min : sup kT (W d )k = ∞
n Wn
d

where the supremum is over all possible corrupted samples W nd and 1 ≤


dn ≤ n. Let λ1 (C(W )) ≥ · · · ≥ λp (C(W )) ≥ 0 denote the eigenvalues of the
dispersion estimator applied to data W . The estimator C breaks down if the
smallest eigenvalue can be driven to zero or if the largest eigenvalue can be
driven to ∞. Hence the breakdown value of the dispersion estimator is
(   )
dn 1 n
B(C, W ) = min : sup max , λ1 (C(W d )) = ∞ .
n Wn λp (C(W nd ))
d

Definition 7.10. Let γn be the breakdown value of (T, C). High break-
down (HB) statistics have γn → 0.5 as n → ∞ if the (uncontaminated) clean
data are in general position: no more than p points of the clean data lie on
any (p − 1)-dimensional hyperplane. Estimators are zero breakdown if γn → 0
and positive breakdown if γn → γ > 0 as n → ∞.

Note that if the number of outliers is less than the number needed to cause
breakdown, then kT k is bounded and the eigenvalues are bounded away from
0 and ∞. Also, the bounds do not depend on the outliers but do depend on
the estimator (T, C) and on the clean data W .
The following result shows that a multivariate location estimator T basi-
cally “breaks down” if the d outliers can make the median Euclidean distance
MED(kwi −T (W nd )k) arbitrarily large where w Ti is the ith row of W nd . Thus
a multivariate location estimator T will not break down if T can not be driven
out of some ball of (possibly huge) radius r about the origin. For an affine
equivariant estimator, the largest possible breakdown value is n/2 or (n+1)/2
for n even or odd, respectively. Hence in the proof of the following result, we
could replace dn < dT by dn < min(n/2, dT ).

Theorem 7.3. Fix n. If nonequivariant estimators (that may have a break-


down value of greater than 1/2) are excluded, then a multivariate location
estimator has a breakdown value of dT /n iff dT = dT ,n is the smallest num-
286 7 Robust Regression

ber of arbitrarily bad cases that can make the median Euclidean distance
MED(kwi − T (W nd )k) arbitrarily large.
Proof. Suppose the multivariate location estimator T satisfies kT (W nd )k ≤
M for some constant M if dn < dT . Note that for a fixed data set W nd
with ith row wi , the median Euclidean distance MED(kw i − T (W nd )k) ≤
maxi=1,...,n kxi − T (W nd )k ≤ maxi=1,...,n kxi k + M if dn < dT . Similarly,
suppose MED(kwi − T (W nd )k) ≤ M for some constant M if dn < dT , then
kT (W nd )k is bounded if dn < dT . 

Since the coordinatewise median MED(W ) is a HB estimator of multi-


variate location, it is also true that a multivariate location estimator T will
not break down if T can not be driven out of some ball of radius r about
MED(W ). Hence (MED(W ), I p ) is a HB estimator of MLD.
If a high breakdown estimator (T, C) ≡ (T (W nd ), C(W nd )) is evaluated
on the contaminated data W nd , then the location estimator T is contained in
some ball about the origin of radius r, and 0 < a < λp ≤ λ1 < b where the
constants a, r, and b depend on the clean data and (T, C), but not on W nd if
the number of outliers dn satisfies 0 ≤ dn < nγn < n/2 where the breakdown
value γn → 0.5 as n → ∞.
The following theorem will be used to show that if the classical estimator
(X B , SB ) is applied to cn ≈ n/2 cases contained in a ball about the origin of
radius r where r depends on the clean data but not on W nd , then (X B , S B )
is a high breakdown estimator.

Theorem 7.4. If the classical estimator (X B , S B ) is applied to cn cases


that are contained in some bounded region where p + 1 ≤ cn ≤ n, then the
maximum eigenvalue λ1 of S B is bounded.
Proof. The largest eigenvalue of a p × p matrix A is bounded above by
p max |ai,j | where ai,j is the (i, j) entry of A. See Datta (1995, p. 403). Denote
the cn cases by z 1 , ..., zcn . Then the (i, j)th element ai,j of A = S B is

Xcn
1
ai,j = (zi,m − z i )(zj,m − z j ).
cn − 1 m=1

Hence the maximum eigenvalue λ1 is bounded. 

The determinant det(S) = |S| of S is known as the generalized sample


variance. Consider the hyperellipsoid

{z : (z − T )T C −1 (z − T ) ≤ D(c
2
n)
} (7.9)
2
where D(c n)
is the cn th smallest squared Mahalanobis distance based on
(T, C). This hyperellipsoid contains the cn cases with the smallest Di2 . Sup-
pose (T, C) = (xM , b S M ) is the sample mean and scaled sample covariance
matrix applied to some subset of the data where b > 0. The classical, RFCH,
7.2 The Multivariate Location and Dispersion Model 287

and RMVN estimators satisfy this assumption. For h > 0, the hyperellipsoid

{z : (z − T )T C −1 (z − T ) ≤ h2 } = {z : Dz
2
≤ h2 } = {z : Dz ≤ h}

has volume equal to

2π p/2 p p 2π p/2 p p/2 p


h det(C) = h b det(S M ).
pΓ (p/2) pΓ (p/2)

If h2 = D(c
2
n)
, then the volume is proportional to the square root of the deter-
1/2
minant |S M | , and this volume will be positive unless extreme degeneracy
is present among the cn cases. See Johnson and Wichern (1988, pp. 103-104).

7.2.3 The Concentration Algorithm

Concentration algorithms are widely used since impractical brand name es-
timators, such as the MCD estimator given in Definition 7.11, take too long
to compute. The concentration algorithm, defined in Definition 7.12, use K
starts and attractors. A start is an initial estimator, and an attractor is an
estimator obtained by refining the start. For example, let the start be the
classical estimator (x, S). Then the attractor could be the classical estima-
tor (T1 , C 1 ) applied to the half set of cases with the smallest Mahalanobis
distances. This concentration algorithm uses one concentration step, but the
process could be iterated for k concentration steps, producing an estimator
(Tk , C k )
If more than one attractor is used, then some criterion is needed to select
which of the K attractors is to be used in the final estimator. If each attractor
(Tk,j , C k,j ) is the classical estimator applied to cn ≈ n/2 cases, then the
minimum covariance determinant (MCD) criterion is often used: choose the
attractor that has the minimum value of det(C k,j ) where j = 1, ..., K.
The remainder of this section will explain the concentration algorithm,
explain why the MCD criterion is useful but can be improved, provide some
theory for practical robust multivariate location and dispersion estimators,
and show how the set of cases used to compute the recommended RMVN or
RFCH estimator can be used to create outlier resistant regression estimators.
The RMVN and RFCH estimators are reweighted versions of the practical
FCH estimator, given in Definition 7.15.

Definition 7.11. Consider the subset Jo of cn ≈ n/2 observations whose


sample covariance matrix has the lowest determinant among all C(n, cn ) sub-
sets of size cn . Let TM CD and C M CD denote the sample mean and sample
covariance matrix of the cn cases in Jo . Then the minimum covariance de-
terminant MCD(cn ) estimator is (TM CD (W ), C M CD (W )).
288 7 Robust Regression

Here  
n n!
C(n, i) = =
i i! (n − i)!
is the binomial coefficient.
The MCD estimator is a high breakdown (HB) estimator, and the value
cn = b(n + p + 1)/2c is often used as the default. The MCD estimator is the
pair
(β̂LT S , QLT S (β̂LT S )/(cn − 1))
in the location model where LTS stands for the least trimmed sum of squares
estimator. See Section 7.6. The population analog of the MCD estimator is
closely related to the hyperellipsoid of highest concentration
√ that contains
cn /n ≈ half of the mass. The MCD estimator is a n consistent HB asymp-
totically normal estimator for (µ, aM CD Σ) where aM CD is some positive
constant when the data xi are iid from a large class of distributions. See
Cator and Lopuhaä (2010, 2012) who extended some results of Butler et al.
(1993).
Computing robust covariance estimators can be very expensive. For exam-
ple, to compute the exact MCD(cn ) estimator (TM CD , CM CD ), we need to
consider the C(n, cn ) subsets of size cn . Woodruff and Rocke (1994, p. 893)
noted that if 1 billion subsets of size 101 could be evaluated per second, it
would require 1033 millenia to search through all C(200, 101) subsets if the
sample size n = 200. See Section 7.8 for the MCD complexity.
Hence algorithm estimators will be used to approximate the robust esti-
mators. Elemental sets are the key ingredient for both basic resampling and
concentration algorithms.

Definition 7.12. Suppose that x1 , ..., xn are p × 1 vectors of observed


data. For the multivariate location and dispersion model, an elemental set J
is a set of p + 1 cases. An elemental start is the sample mean and sample
covariance matrix of the data corresponding to J. In a concentration algo-
rithm, let (T−1,j , C −1,j ) be the jth start (not necessarily elemental) and
compute all n Mahalanobis distances Di (T−1,j , C −1,j ). At the next iter-
ation, the classical estimator (T0,j , C 0,j ) = (x0,j , S0,j ) is computed from
the cn ≈ n/2 cases corresponding to the smallest distances. This itera-
tion can be continued for k concentration steps resulting in the sequence
of estimators (T−1,j , C −1,j ), (T0,j , C 0,j ), ..., (Tk,j , C k,j ). The result of the it-
eration (Tk,j , C k,j ) is called the jth attractor. If Kn starts are used, then
j = 1, ..., Kn. The concentration attractor, (TA , C A ), is the attractor chosen
by the algorithm. The attractor is used to obtain the final estimator. A com-
mon choice is the attractor that has the smallest determinant det(C k,j ). The
basic resampling algorithm estimator is a special case where k = −1 so that
the attractor is the start: (xk,j , Sk,j ) = (x−1,j , S−1,j ).

This concentration algorithm is a simplified version of the algorithms given


by Rousseeuw and Van Driessen (1999) and Hawkins and Olive (1999a). Using
7.2 The Multivariate Location and Dispersion Model 289

k = 10 concentration steps often works well. The following proposition is


useful and shows that det(S 0,j ) tends to be greater than the determinant of
the attractor det(S k,j ).

Theorem 7.5: Rousseeuw and Van Driessen (1999, p. 214). Sup-


pose that the classical estimator (xt,j , St,j ) is computed from cn cases and
that the n Mahalanobis distances Di ≡ Di (xt,j , S t,j ) are computed. If
(xt+1,j , St+1,j ) is the classical estimator computed from the cn cases with
the smallest Mahalanobis distances Di , then det(S t+1,j ) ≤ det(S t,j ) with
equality iff (xt+1,j , S t+1,j ) = (xt,j , St,j ).

Starts that use a consistent initial estimator could be used. Kn is the


number of starts and k is the number of concentration steps used in the
algorithm. Suppose the algorithm estimator uses some criterion to choose an
attractor as the final estimator where there are K attractors and K is fixed,
e.g. K = 500, so K does not depend on n. A crucial observation is that the
theory of the algorithm estimator depends on the theory of the attractors,
not on the estimator corresponding to the criterion.
For example, let (0, I p ) and (1, diag(1, 3, ..., p)) be the high breakdown
attractors where 0 and 1 are the p × 1 vectors of zeroes and ones. If the
minimum determinant criterion is used, then the final estimator is (0, I p ).
Although the MCD criterion is used, the algorithm estimator does not have
the same properties as the MCD estimator.
Hawkins and Olive (2002) showed that if K randomly selected elemental
starts are used with concentration to produce the attractors, then the result-
ing estimator is inconsistent and zero breakdown if K and k are fixed and free
of n. Note that each elemental start can be made to breakdown by changing
one case. Hence the breakdown value of the final estimator is bounded by
K/n → 0 as n → ∞. Note that the classical estimator computed from hn
randomly drawn cases is an inconsistent estimator unless hn → ∞ as n → ∞.
Thus the classical estimator applied to a randomly drawn elemental set of
hn ≡ p + 1 cases is an inconsistent estimator, so the K starts and the K
attractors are inconsistent.
This theory shows that the Maronna et al. (2006, pp. 198-199) estimators
that use K = 500 and one concentration step (k = 0) are inconsistent and
zero breakdown. The following theorem is useful because it does not depend
on the criterion used to choose the attractor.
Suppose there are K consistent estimators (Tj , C j ) of (µ, a Σ) for some
constant a > 0, each with the same rate nδ . If (TA , C A ) is an estimator
obtained by choosing one of the K estimators, then (TA , C A ) is a consistent
estimator of (µ, a Σ) with rate nδ by Pratt (1959). See Theorem 1.21.

Theorem 7.6. Suppose the algorithm estimator chooses an attractor as


the final estimator where there are K attractors and K is fixed.
i) If all of the attractors are consistent estimators of (µ, a Σ), then the
algorithm estimator is a consistent estimator of (µ, a Σ).
290 7 Robust Regression

ii) If all of the attractors are consistent estimators of (µ, a Σ) with the
same rate, e.g. nδ where 0 < δ ≤ 0.5, then the algorithm estimator is a
consistent estimator of (µ, a Σ) with the same rate as the attractors.
iii) If all of the attractors are high breakdown, then the algorithm estimator
is high breakdown.
iv) Suppose the data x1 , ..., xn are iid and P (xi = µ) < 1. The elemental
basic resampling algorithm estimator (k = −1) is inconsistent.
v) The elemental concentration algorithm is zero breakdown.
Proof. i) Choosing from K consistent estimators for (µ, a Σ) results in a
consistent estimator for of (µ, a Σ), and ii) follows from Pratt (1959). iii) Let
γn,i be the breakdown value of the ith attractor if the clean data x1 , ..., xn are
in general position. The breakdown value γn of the algorithm estimator can
be no lower than that of the worst attractor: γn ≥ min(γn,1 , ..., γn,K ) → 0.5
as n → ∞.
iv) Let (x−1,j , S−1,j ) be the classical estimator applied to a randomly
drawn elemental set. Then x−1,j is the sample mean applied to p + 1 iid
cases. Hence E(S j ) = Σ x , E[x−1,j ] = E(x) = µ, and Cov(x−1,j ) =
Cov(x)/(p+1) = Σ x /(p+1) assuming second moments. So the (x−1,j , S−1,j )
are identically distributed and inconsistent estimators of (µ, Σ x ). Even with-
out second moments, there exists  > 0 such that P (kx−1,j −µk > ) = δ > 0
where the probability, , and δ do not depend on n since the distribution
of x−1,j only depends on the distribution of the iid xi , not on n. Then
P (minj kx−1,j − µk > ) = P (all kx−1,j − µk > ) → δK > 0 as n → ∞
where equality would hold if the x−1,j were iid. Hence the “best start” that
minimizes kx−1,j − µk is inconsistent.
v) The classical estimator with breakdown 1/n is applied to each elemental
start. Hence γn ≤ K/n → 0 as n → ∞. 

Since the FMCD estimator is a zero breakdown elemental concentration


algorithm, the Hubert et al. (2008) claim that “MCD can be efficiently com-
puted with the FAST-MCD estimator” is false. Suppose K is fixed, but at
least one randomly drawn start is iterated to convergence so that k is not
fixed. Then it is not known whether the attractors are inconsistent or consis-
tent estimators, so it is not known whether FMCD is consistent. It is possible
to produce consistent estimators if K ≡ Kn is allowed to increase to ∞.

Remark 7.1. Let γo be the highest percentage of large outliers that an


elemental concentration algorithm can detect reliably. For many data sets,
 
n − cn
γo ≈ min , 1 − [1 − (0.2)1/K ]1/h 100% (7.10)
n

if n is large, cn ≥ n/2 and h = p + 1.


Proof. Suppose that the data set contains n cases with d outliers and
n − d clean cases. Suppose K elemental sets are chosen with replacement.
7.2 The Multivariate Location and Dispersion Model 291

If Wi is the number of outliers in the ith elemental set, then the Wi are
iid hypergeometric(d, n − d, h) random variables. Suppose that it is desired
to find K such that the probability P(that at least one of the elemental
sets is clean) ≡ P1 ≈ 1 − α where 0 < α < 1. Then P1 = 1− P(none of
the K elemental sets is clean) ≈ 1 − [1 − (1 − γ)h ]K by independence. If the
contamination proportion γ is fixed, then the probability of obtaining at least
one clean subset of size h with high probability (say 1 − α = 0.8) is given by
0.8 = 1 − [1 − (1 − γ)h ]K . Fix the number of starts K and solve this equation
for γ. 

7.2.4 Theory for Practical Estimators

It is convenient to let the xi be random vectors for large sample theory,


but the xi are fixed clean observed data vectors when discussing breakdown.
This subsection presents the FCH estimator to be used along with the classi-
cal estimator. Recall from Definition 7.12 that a concentration algorithm uses
Kn starts (T−1,j , C −1,j ). After finding (T0,j , C 0,j ), each start is refined with
k concentration steps, resulting in Kn attractors (Tk,j , C k,j ), and the con-
centration attractor (TA , C A ) is the attractor that optimizes the criterion.

Concentration algorithms include the basic resampling algorithm as a spe-


cial case with k = −1. Using k = 10 concentration steps works well, and
iterating until convergence is usually fast. The DGK estimator (Devlin et
al. 1975, 1981) defined below is one example. The DGK estimator is affine
equivariant since the classical estimator is affine equivariant and Mahalanobis
distances are invariant under affine transformations by Theorem 7.1. This
subsection will show that the Olive (2004a) MB estimator√ is a high break-
down estimator and that the DGK estimator is a n consistent estimator
of (µ, aM CD Σ), the same quantity estimated by the MCD estimator. Both
estimators use the classical estimator computed from cn ≈ n/2 cases. The
breakdown point of the DGK estimator has been conjectured to be “at most
1/p.” See Rousseeuw and Leroy (1987, p. 254).

Definition 7.13. The DGK estimator (Tk,D , C k,D ) = (TDGK , C DGK )


uses the classical estimator (T−1,D , C −1,D ) = (x, S) as the only start.

Definition 7.14. The median ball (MB) estimator (Tk,M , C k,M ) =


(TM B , C M B ) uses (T−1,M , C −1,M ) = (MED(W ), I p ) as the only start where
MED(W ) is the coordinatewise median. So (T0,M , C 0,M ) is the classical es-
timator applied to the “half set” of data closest to MED(W ) in Euclidean
distance.

The proof of the following theorem implies that a high breakdown estima-
tor (T, C) has MED(Di2 ) ≤ V and that the hyperellipsoid {x|Dx 2 2
≤ D(c n)
}
292 7 Robust Regression

that contains cn ≈ n/2 of the cases is in some ball about the origin of ra-
dius r, where V and r do not depend on the outliers even if the number of
outliers is close to n/2. Also the attractor of a high breakdown estimator is
a high breakdown estimator if the number of concentration steps k is fixed,
e.g. k = 10. The theorem implies that the MB estimator (TM B , C M B ) is high
breakdown.

Theorem 7.7. Suppose (T, C) is a high breakdown estimator where C is


a symmetric, positive definite p × p matrix if the contamination proportion
dn /n is less than the breakdown value. Then the concentration attractor
(Tk , C k ) is a high breakdown estimator if the coverage cn ≈ n/2 and the
data are in general position.
Proof. Following Leon (1986, p. 280), if A is a symmetric positive definite
matrix with eigenvalues τ1 ≥ · · · ≥ τp , then for any nonzero vector x,

0 < kxk2 τp ≤ xT Ax ≤ kxk2 τ1 . (7.11)

Let λ1 ≥ · · · ≥ λp be the eigenvalues of C. By (7.11),

1 1
kx − T k2 ≤ (x − T )T C −1 (x − T ) ≤ kx − T k2 . (7.12)
λ1 λp
2
By (7.12), if the D(i) are the order statistics of the Di2 (T, C), then D(i)
2
<V
for some constant V that depends on the clean data but not on the outliers
even if i and dn are near n/2. (Note that 1/λp and MED(kxi − T k2 ) are both
bounded for high breakdown estimators even for dn near n/2.)
Following Johnson and Wichern (1988, pp. 50, 103), the boundary of
the set {x|Dx 2
≤ h2 } = {x|(x − T )T C −1√ (x − T ) ≤ h2 } is a hyperellip-
2 2
soid centered at T with axes of length 2h λi . Hence {x|Dx ≤ D(c n)
} is
contained in some ball about the origin of radius r where r does not de-
pend on the number of outliers even for dn near n/2. This is the set con-
taining the cases used to compute (T0 , C 0 ). Since the set is bounded, T0
is bounded and the largest eigenvalue λ1,0 of C 0 is bounded by Theorem
7.4. The determinant det(C M CD ) of the HB minimum covariance deter-
minant estimator satisfies 0 < det(C M CD ) ≤ det(C 0 ) = λ1,0 · · · λp,0 , and
λp,0 > inf det(C M CD )/λp−11,0 > 0 where the infimum is over all possible data
sets with n−dn clean cases and dn outliers. Since these bounds do not depend
on the outliers even for dn near n/2, (T0 , C 0 ) is a high breakdown estimator.
Now repeat the argument with (T0 , C 0 ) in place of (T, C) and (T1 , C 1 ) in
place of (T0 , C 0 ). Then (T1 , C 1 ) is high breakdown. Repeating the argument
iteratively shows (Tk , C k ) is high breakdown. 

The following corollary shows that it is easy to find a subset J of cn ≈ n/2


cases such that the classical estimator (xJ , SJ ) applied to J is a HB estimator
of MLD.
7.2 The Multivariate Location and Dispersion Model 293

Theorem 7.8. Let J consist of the cn cases xi such that


kxi − MED(W )k ≤ MED(kxi − MED(W )k). Then the classical estimator
(xJ , S J ) applied to J is a HB estimator of MLD.
To investigate the consistency and rate of robust estimators of multivariate
location and dispersion, review Definitions 1.34 and 1.35.

The following assumption (E1) gives a class


√ of distributions where we can
prove that the new robust estimators are n consistent. Cator and Lop-
uhaä (2010, 2012) showed that MCD is consistent provided that the MCD
functional is unique. Distributions where the functional is unique are called
“unimodal,” and rule out, for example, a spherically symmetric uniform dis-
tribution. Theorem 7.9 is crucial for theory and Theorem 7.10 shows that
under (E1), both MCD and DGK are estimating (µ, aM CD Σ).

Assumption (E1): The x1 , ..., xn are iid from a “unimodal” ellipti-


cally contoured ECp (µ, Σ, g) distribution with nonsingular covariance ma-
Rtrix TCov(x i ) where g is continuously differentiable with finite 4th moment:
(x x)2 g(xT x)dx < ∞.

Lopuhaä (1999) showed that if a start (T, C) is a consistent affine equiv-


ariant estimator of (µ, sΣ), then the classical estimator applied to the cases
with Di2 (T, C) ≤ h2 is a consistent estimator of (µ, aΣ) where a, s > 0 are
some constants. Affine equivariance is not used for Σ = I p . Also, the attrac-
tor and the start have the same rate. If the start is inconsistent, then so is
the attractor. The weight function I(Di2 (T, C) ≤ h2 ) is an indicator that is
1 if Di2 (T, C) ≤ h2 and 0 otherwise.

Theorem 7.9, Lopuhaä (1999). Assume the number of concentration


steps k is fixed. a) If the start (T, C) is inconsistent, then so is the attractor.
b) Suppose (T, C) is a consistent estimator of (µ, sI p ) with rate nδ where
s > 0 and 0 < δ ≤ 0.5. Assume (E1) holds and Σ = I p . Then the classical
estimator (T0 , C 0 ) applied to the cases with Di2 (T, C) ≤ h2 is a consistent
estimator of (µ, aI p ) with the same rate nδ where a > 0.
c) Suppose (T, C) is a consistent affine equivariant estimator of (µ, sΣ)
with rate nδ where s > 0 and 0 < δ ≤ 0.5. Assume (E1) holds. Then the
classical estimator (T0 , C 0 ) applied to the cases with Di2 (T, C) ≤ h2 is a
consistent affine equivariant estimator of (µ, aΣ) with the same rate nδ where
a > 0. The constant a depends on the positive constants s, h, p, and the
elliptically contoured distribution, but does not otherwise depend on the
consistent start (T, C).

Let δ = 0.5. Applying Theorem 7.9c) iteratively for a fixed number k of


steps√produces a sequence of estimators (T0 , C 0 ), ..., (Tk, C k ) where (Tj , C j )
is a n consistent affine equivariant estimator of (µ, aj Σ) where the con-
stants aj > 0 depend on s, h, p, and the elliptically contoured distribution,
but do not otherwise depend on the consistent start (T, C) ≡ (T−1 , C −1 ).
294 7 Robust Regression

The 4th moment assumption was used to simplify theory, but likely holds
under 2nd moments. Affine equivariance is needed so that the attractor is
affine equivariant, but probably is not needed to prove consistency.

Conjecture 7.1. Change the finite 4th moments assumption to a finite


2nd moments in assumption E1). Suppose (T, C) is a consistent estimator
of (µ, sΣ) with rate nδ where s > 0 and 0 < δ ≤ 0.5. Then the classical
estimator applied to the cases with Di2 (T, C) ≤ h2 is a consistent estimator
of (µ, aΣ) with the same rate nδ where a > 0.

Remark 7.2. To see that the Lopuhaä (1999) theory extends to con-
centration where the weight function uses h2 = D(c
2
n)
(T, C), note that
2
(T, C̃) ≡ (T, D(c n)
(T, C) C) is a consistent estimator of (µ, bΣ) where b > 0
is derived in (7.14), and weight function I(Di2 (T, C̃) ≤ 1) is equivalent to the
concentration weight function I(Di2 (T, C) ≤ D(c 2
n)
(T, C)). As noted above
Theorem 7.1, (T, C̃) is affine equivariant if (T, C) is affine equivariant. Hence
Lopuhaä (1999) theory applied to (T, C̃) with h = 1 is equivalent to theory
applied to affine equivariant (T, C) with h2 = D(c 2
n)
(T, C).
If (T, C) is a consistent estimator of (µ, s Σ) with rate nδ where 0 < δ ≤
0.5, then D2 (T, C) = (x − T )T C −1 (x − T ) =

(x − µ + µ − T )T [C −1 − s−1 Σ −1 + s−1 Σ −1 ](x − µ + µ − T )

= s−1 D2 (µ, Σ) + OP (n−δ ). (7.13)


Thus the sample percentiles of Di2 (T, C) are consistent estimators of the per-
centiles of s−1 D2 (µ, Σ). Suppose cn /n → ξ ∈ (0, 1) as n → ∞, and let
Dξ2 (µ, Σ) be the 100ξth percentile of the population squared distances. Then
2 P
D(c n)
(T, C) → s−1 Dξ2 (µ, Σ) and bΣ = s−1 Dξ2 (µ, Σ)sΣ = Dξ2 (µ, Σ)Σ.
Thus
b = Dξ2 (µ, Σ) (7.14)
does not depend on s > 0 or δ ∈ (0, 0.5]. 

Concentration applies the classical estimator to cases with Di2 (T, C) ≤


2
D(c n)
(T, C).
Let cn ≈ n/2 and

2
b = D0.5 (µ, Σ)

be the population√median of the population squared distances. By Remark


7.2, if (T, C) is a n consistent affine
√ equivariant estimator of (µ, sΣ) then
2
(T, C̃) ≡ (T, D(c n)
(T, C) C) is a n consistent affine equivariant estimator
of (µ, bΣ), and Di2 (T, C̃) ≤ 1 is equivalent to Di2 (T, C) ≤ D(c
2
n)
(T, C)).
Hence Lopuhaä (1999) theory applied to (T, C̃) with h = 1 is equivalent
to theory applied to the concentration estimator using the affine equivariant
7.2 The Multivariate Location and Dispersion Model 295

estimator (T, C) ≡ (T−1 , C −1 ) as the start. Since b does not depend on s,


concentration√produces a sequence of estimators (T0 , C 0 ), ..., (Tk , C k ) where
(Tj , C j ) is a n consistent affine equivariant estimator of (µ, aΣ) where the
constant a > 0 is the same for j = 0, 1, ..., k.
Theorem 7.10 shows that a = aM CD where ξ = 0.5. Hence concentration
with a consistent affine equivariant estimator of (µ, sΣ) with rate nδ as a start
results in a consistent affine equivariant estimator of (µ, aM CD Σ) with rate
nδ . This result can be applied
√ iteratively for a finite number of concentration
steps. Hence DGK is a n consistent affine equivariant estimator of the
same quantity that MCD is estimating. It is not known if the results hold
if concentration is iterated to convergence. For multivariate normal data,
D2 (µ, Σ) ∼ χ2p .

Theorem 7.10. Assume that (E1) holds and that (T, C) is a consistent
affine equivariant estimator of (µ, sΣ) with rate nδ where the constants s > 0
and 0 < δ ≤ 0.5. Then the classical estimator (xt,j , St,j ) computed from the
cn ≈ n/2 of cases with the smallest distances Di (T, C) is a consistent affine
equivariant estimator of (µ, aM CD Σ) with the same rate nδ .
Proof. By Remark 7.2 the estimator is a consistent affine equivariant esti-
mator of (µ, aΣ) with rate nδ . By the remarks above, a will be the same for
any consistent affine equivariant estimator of (µ, sΣ) and a does not depend
on s > 0 or δ √ ∈ (0, 0.5]. Hence the result follows if a = aM CD . The MCD
estimator is a n consistent affine equivariant estimator of (µ, aM CD Σ) by
Cator and Lopuhaä (2010, 2012). If the MCD estimator is the start, then it
is also the attractor by Theorem 7.5 which shows that concentration does not
increase the MCD criterion. Hence a = aM CD . 

Next we define the easily computed robust n consistent FCH estima-
tor, so named since it is fast, consistent, and √uses a high breakdown attrac-
tor. The FCH and MBA estimators use the n consistent DGK estimator
(TDGK , C DGK ) and the high breakdown MB estimator (TM B , C M B ) as at-
tractors.

Definition 7.15. Let the “median ball” be the hypersphere containing the
“half set” of data closest to MED(W ) in Euclidean distance. The FCH esti-
mator uses the MB attractor if the DGK location estimator TDGK is outside
of the median ball, and the attractor with the smallest determinant, other-
wise. Let (TA , C A ) be the attractor used. Then the estimator (TF CH , C F CH )
takes TF CH = TA and

MED(Di2 (TA , C A ))
C F CH = CA (7.15)
χ2p,0.5

where χ2p,0.5 is the 50th percentile of a chi–square distribution with p degrees


of freedom.
296 7 Robust Regression

Remark 7.3. The MBA estimator (TM BA , C M BA ) uses the attractor


(TA , C A ) with the smallest determinant. Hence the DGK estimator is used
as the attractor if det(C DGK ) ≤ det(C M B ), and the MB estimator is used
as the attractor, otherwise. Then TM BA = TA and C M BA is computed using
the right hand side of (7.15). The difference between the FCH and MBA
estimators is that the FCH estimator also uses a location criterion to choose
the attractor: if the DGK location estimator TDGK has a greater Euclidean
distance from MED(W ) than half the data, then FCH uses the MB attractor.
The FCH estimator only uses the attractor with the smallest determinant if
kTDGK − MED(W )k ≤ MED(Di (MED(W ), I p )). Using the location crite-
rion increases the outlier resistance of the FCH estimator for certain types of
outliers, as will be seen in Section 7.2.5.

The following theorem shows the FCH estimator has good statistical prop-
erties. We conjecture that FCH is high breakdown. Note that the location
estimator TF CH is high breakdown and that det(C F CH ) is bounded away
from 0 and ∞ if the data is in general position, even if nearly half of the
cases are outliers.

Theorem 7.11. TF CH is high breakdown if the clean data are in gen-


eral position. Suppose (E1) holds. If (TA , C A ) is the√DGK or MB attractor
with the smallest determinant, then (TA , C A ) is a n consistent estimator
of
√ (µ, aM CD Σ). Hence the MBA and FCH estimators 2are outlier resistant
n consistent estimators of (µ, cΣ) where c = u0.5 /χp,0.5 , and c = 1 for
multivariate normal data.
Proof. TF CH is high breakdown since it is a bounded distance from
MED(W ) even if the number of outliers is close to n/2. Under (E1) the
FCH and MBA estimators are asymptotically equivalent since kTDGK −
MED(W )k → 0 in probability. The estimator satisfies 0 < det(C M CD ) ≤
det(C A ) ≤ det(C 0,M ) < ∞ by Theorem 7.7 if up to nearly 50% of the cases
are outliers. If the distribution is spherical about µ, then
√ the result follows
from Pratt (1959) and Theorem 7.5 since both starts are n consistent. Oth-
erwise, the MB estimator C M √B is a biased estimator of aM CD Σ. But the
DGK estimator C DGK is a n consistent estimator of aM CD Σ by Theo-
rem 7.10 and kC M CD − C DGK k = OP (n−1/2 ). Thus the probability that
the DGK attractor minimizes the determinant goes to one as n → ∞, and
(TA , C A ) is asymptotically equivalent to the DGK estimator (TDGK , C DGK ).
Let C F = C F CH or C F = C M BA . Let P (U ≤ uα ) = α where U is given
by (1.35). Then the scaling in (7.15) makes C F a consistent estimator of cΣ
where c = u0.5 /χ2p,0.5, and c = 1 for multivariate normal data. 

A standard method of reweighting can be used to produce the RMBA and


RFCH estimators. RMVN uses a slightly modified method of reweighting so
that RMVN gives good estimates of (µ, Σ) for multivariate normal data,
even when certain types of outliers are present.
7.2 The Multivariate Location and Dispersion Model 297

Definition 7.16. The RFCH estimator uses two standard reweighting


steps. Let (µ̂1 , Σ̃ 1 ) be the classical estimator applied to the n1 cases with
Di2 (TF CH , C F CH ) ≤ χ2p,0.975, and let

MED(Di2 (µ̂1 , Σ̃ 1 ))
Σ̂ 1 = Σ̃ 1 .
χ2p,0.5

Then let (TRF CH , Σ̃ 2 ) be the classical estimator applied to the cases with
Di2 (µ̂1 , Σ̂ 1 ) ≤ χ2p,0.975, and let

MED(Di2 (TRF CH , Σ̃ 2 ))
C RF CH = Σ̃ 2 .
χ2p,0.5

RMBA and RFCH are n consistent estimators of (µ, cΣ) by Lopuhaä
(1999) where the weight function uses h2 = χ2p,0.975, but the two estimators
use nearly 97.5% of the cases if the data is multivariate normal.

Definition 7.17. The RMVN estimator uses (µ̂1 , Σ̃ 1 ) and n1 as above.


Let q1 = min{0.5(0.975)n/n1, 0.995}, and

MED(Di2 (µ̂1 , Σ̃ 1 ))
Σ̂ 1 = Σ̃ 1 .
χ2p,q1

Then let (TRM V N , Σ̃ 2 ) be the classical estimator applied to the n2 cases with
Di2 (µ̂1 , Σ̂ 1 )) ≤ χ2p,0.975. Let q2 = min{0.5(0.975)n/n2, 0.995}, and

MED(Di2 (TRM V N , Σ̃ 2 ))
C RM V N = Σ̃ 2 .
χ2p,q2

Definition 7.18. Let the n2 cases in Definition 7.17 be known as the


RMVN set U . Hence (TRM V N , Σ̃ 2 ) = (xU , SU ) is the classical estimator
applied to the RMVN set U , which can be regarded as the untrimmed data
(the data not trimmed by ellipsoidal trimming) or the cleaned data. Also
S U is the unscaled estimated dispersion matrix while C RM V N is the scaled
estimated dispersion matrix.

Remark 7.4. Classical methods can be applied√to the RMVN subset U to


make robust methods. Under (E1), (xU , SU ) is a n consistent estimator of
(µ, cU Σ) for some constant cU > 0 that depends on the underlying distribu-
tion of the iid xi . For a general estimator of multivariate location and disper-
sion (TA , C A ), typically a reweight for efficiency step is performed, resulting
in a set U such that the classical estimator (xU , S U ) is the classical estima-
tor applied to a set U . For example, use U = {xi |Di2 (TA , C A ) ≤ χ2p,0.975}.
Then the final estimator is (TF , C F ) = (xU , aS U ) where scaling is done as
298 7 Robust Regression

in Equation (7.15) in an attempt to make C F a good estimator of Σ if the


iid data
√ are from a Np (µ, Σ) distribution. Then (xU , S U ) can be shown to
be a n consistent estimator of (µ, cU Σ) for a large class of distributions
for
√ the RMVN set, for the RFCH set, or if (TA , C A ) is an affine equivariant
n consistent estimator of (µ, cAΣ) on a large class of distributions. The
necessary theory is not yet available for other practical robust reweighted
estimators such as OGK and Det-MCD.

Table 7.1 Average Dispersion Matrices for Near Point Mass Outliers
 RMVN   FMCD   OGK  MB 
1.002 −0.014 0.055 0.685 0.185 0.089 2.570 −0.082
−0.014 2.024 0.685 122.5 0.089 36.24 −0.082 5.241

Table 7.2 Average Dispersion Matrices for Mean Shift Outliers


 RMVN   FMCD   OGK  MB 
0.990 0.004 2.530 0.003 19.67 12.88 2.552 0.003
0.004 2.014 0.003 5.146 12.88 39.72 0.003 5.118


The RMVN estimator is a n consistent estimator of (µ, dΣ) by Lopuhaä
(1999) where the weight function uses h2 = χ2p,0.975 and d = u0.5 /χ2p,q where
q2 → q in probability as n → ∞. Here 0.5 ≤ q < 1 depends on the elliptically
contoured distribution, but q = 0.5 and d = 1 for multivariate normal data.
If the bulk of the data is Np (µ, Σ), the RMVN estimator can give useful
estimates of (µ, Σ) for certain types of outliers where FCH and RFCH esti-
mate (µ, dE Σ) for dE > 1. To see this claim, let 0 ≤ γ < 0.5 be the outlier
P P
proportion. If γ = 0, then ni /n → 0.975 and qi → 0.5. If γ > 0, suppose
the outlier configuration is such that the Di2 (TF CH , C F CH ) are roughly χ2p
for the clean cases, and the outliers have larger Di2 than the clean cases.
Then MED(Di2 ) ≈ χ2p,q where q = 0.5/(1 − γ). For example, if n = 100 and
γ = 0.4, then there are 60 clean cases, q = 5/6, and the quantile χ2p,q is
being estimated instead of χ2p,0.5 . Now ni ≈ n(1 − γ)0.975, and qi estimates
q. Thus C RM V N ≈ Σ. Of course consistency cannot generally be claimed
when outliers are present.
Simulations suggested (TRM V N , C RM V N ) gives useful estimates of (µ, Σ)
for a variety of outlier configurations. Using 20 runs and n = 1000, the aver-
ages of the dispersion matrices were computed when the bulk of the data are
iid N√
2 (0, Σ) where Σ = diag(1, 2). For clean data, FCH, RFCH, and RMVN
give n consistent estimators of Σ, while FMCD and the Maronna and Za-
mar (2002) OGK estimator seem to be approximately unbiased for Σ. The
median ball estimator was scaled using (7.15) and estimated diag(1.13, 1.85).
Next the data had γ = 0.4 and the outliers had x ∼ N2 ((0, 15)T , 0.0001I2 ),
a near point mass at the major axis. FCH, MB, and RFCH estimated 2.6Σ
7.2 The Multivariate Location and Dispersion Model 299

while RMVN estimated Σ. FMCD and OGK failed to estimate d Σ. Note


that χ22,5/6 /χ22,0.5 = 2.585. See Table 7.1. The following R commands were
used where mldsim is from linmodpack.
qchisq(5/6,2)/qchisq(.5,2) = 2.584963
mldsim(n=1000,p=2,outliers=6,pm=15)
Next the data had γ = 0.4 and the outliers had x ∼ N2 ((20, 20)T , Σ), a
mean shift with the same covariance matrix as the clean cases. Rocke and
Woodruff (1996) suggest that outliers with mean shift are hard to detect.
FCH, FMCD, MB, and RFCH estimated 2.6Σ while RMVN estimated Σ,
and OGK failed. See Table 7.2. The R command is shown below.
mldsim(n=1000,p=2,outliers=3,pm=20)
Remark 7.5. The RFCH and RMVN estimators are recommended. If
these estimators are too slow and outlier detection is of interest, try the RMB
estimator, the reweighted MB estimator. If RMB is too slow or if n < 2(p+1),
the Euclidean distances Di (MED(W ), I) of xi from the coordinatewise me-
dian MED(W ) may be useful. A DD plot of Di (x, I) versus Di (MED(W ), I)
is also useful for outlier detection and for whether x and MED(W ) are giving
similar estimates of multivariate location. Also see Section 7.3.

Hubert et al. (2008, 2012) claim that FMCD computes the MCD estimator.
This claim is trivially shown to be false in the following theorem.

Theorem 7.12. Neither FMCD nor Det-MCD compute the MCD esti-
mator.
Proof. A necessary condition for an estimator to be the MCD estimator
is that the determinant of the covariance matrix for the estimator be the
smallest for every run in a simulation. Sometimes FMCD had the smaller
determinant and sometimes Det-MCD had the smaller determinant in the
simulations done by Hubert et al. (2012). 

Example 7.2. Tremearne (1911) recorded height = x[,1] and height while
kneeling = x[,2] of 112 people. Figure 7.1a shows a scatterplot of the data.
Case 3 has the largest Euclidean distance of 214.767 from MED(W ) =
(1680, 1240)T , but if the distances correspond to the contours of a cover-
ing ellipsoid, then case 44 has the largest distance. For k = 0, (T0,M , C 0,M )
is the classical estimator applied to the “half set” of cases closest to MED(W )
in Euclidean distance. The hypersphere (circle) centered at MED(W ) that
covers half the data is small because the data density is high near MED(W ).
The median Euclidean distance is 59.661 and case 44 has Euclidean distance
77.987. Hence the intersection of the sphere and the data is a highly corre-
lated clean ellipsoidal region. Figure 7.1b shows the DD plot of the classical
distances versus the MB distances. Notice that both the classical and MB
estimators give the largest distances to cases 3 and 44. Notice that case 44
could not be detected using marginal methods.
300 7 Robust Regression

a) Major Data b) MB DD Plot

1150 1200 1250 1300 1350


44

12
3 3

10
44

8
x[, 2]

RD

6
4
2
0
1500 1700 0 2 4 6

x[, 1] MD

Fig. 7.1 Plots for Major Data

As the dimension p gets larger, outliers that can not be detected by


marginal methods (case 44 in Example 7.2) become harder to detect. When
p = 3 imagine that the clean data is a baseball bat or stick with one end
at the SW corner of the bottom of the box (corresponding to the coordinate
axes) and one end at the NE corner of the top of the box. If the outliers are
a ball, there is much more room to hide them in the box than in a covering
rectangle when p = 2.
Example 7.3. The estimators can be useful when the data is not ellipti-
cally contoured. The Gladstone (1905) data has 11 variables on 267 persons
after death. Head measurements were breadth, circumference, head height,
length, and size as well as cephalic index and brain weight. Age, height, and
two categorical variables ageclass (0: under 20, 1: 20-45, 2: over 45) and
sex were also given. Figure 7.2 shows the DD plots for the FCH, RMVN,
cov.mcd, and MB estimators. The DD plots from the DGK, MBA, and
RFCH estimators were similar, and the six outliers in Figure 7.2 correspond
to the six infants in the data set.

Section 7.3 shows that if a consistent robust estimator is scaled as in


(7.15), then the plotted points in the DD plot will cluster about the identity
line with unit slope and zero intercept if the data is multivariate normal,
and about some other line through the origin if the data is from some other
elliptically contoured distribution with a nonsingular covariance matrix. Since
multivariate procedures tend to perform well for elliptically contoured data,
the DD plot is useful even if outliers are not present.
7.2 The Multivariate Location and Dispersion Model 301

FCH DD Plot RMVN DD Plot

RD

RD

100
0 100

0
2 4 6 8 10 2 4 6 8 10

MD MD

cov.mcd DD Plot MB DD Plot


100
RD

RD

0 100
0

2 4 6 8 10 2 4 6 8 10

MD MD

Fig. 7.2 DD Plots for Gladstone Data

7.2.5 Outlier Resistance and Simulations

RMVN FMCD
0.996 0.014 0.002 -0.001 0.931 0.017 0.011 0.000
0.014 2.012 -0.001 0.029 0.017 1.885 -0.003 0.022
0.002 -0.001 2.984 0.003 0.011 -0.003 2.803 0.010
-0.001 0.029 0.003 3.994 0.000 0.022 0.010 3.752

Simulations were used to compare (TF CH , C F CH ), (TRF CH , C RF CH ),


(TRM V N , C RM V N ), and (TF M CD , C F M CD ). Shown above are the averages,
using 20 runs and n = 1000, of the dispersion matrices when the bulk of the
data are iid N4 (0, Σ) where Σ = diag(1, 2, 3, 4). The first pair of√ matrices
used γ = 0. Here the FCH, RFCH, and RMVN estimators are n consis-
tent estimators of Σ, while C F M CD seems to be approximately unbiased for
0.94Σ.
Next the data had γ = 0.4 and the outliers had x ∼ N4 ((0, 0, 0, 15)T ,
0.0001 I 4 ), a near point mass at the major axis. FCH and RFCH estimated
1.93Σ while RMVN estimated Σ. The FMCD estimator failed to estimate
d Σ. Note that χ24,5/6/χ24,0.5 = 1.9276.

RMVN FMCD
0.988 -0.023 -0.007 0.021 0.227 -0.016 0.002 0.049
-0.023 1.964 -0.022 -0.002 -0.016 0.435 -0.014 0.013
-0.007 -0.022 3.053 0.007 0.002 -0.014 0.673 0.179
0.021 -0.002 0.007 3.870 0.049 0.013 0.179 55.65
302 7 Robust Regression

Next the data had γ = 0.4 and the outliers had x ∼ N4 (15 1, Σ), a mean
shift with the same covariance matrix as the clean cases. Again FCH and
RFCH estimated 1.93Σ while RMVN and FMCD estimated Σ.

RMVN FMCD
1.013 0.008 0.006 -0.026 1.024 0.002 0.003 -0.025
0.008 1.975 -0.022 -0.016 0.002 2.000 -0.034 -0.017
0.006 -0.022 2.870 0.004 0.003 -0.034 2.931 0.005
-0.026 -0.016 0.004 3.976 -0.025 -0.017 0.005 4.046

Geometrical arguments suggest that the MB estimator has considerable


outlier resistance. Suppose the outliers are far from the bulk of the data. Let
the “median ball” correspond to the half set of data closest to MED(W ) in
Euclidean distance. If the outliers are outside of the median ball, then the
initial half set in the iteration leading to the MB estimator will be clean. Thus
the MB estimator will tend to give the outliers the largest MB distances unless
the initial clean half set has very high correlation in a direction about which
the outliers lie. This property holds for very general outlier configurations.
The FCH estimator tries to use the DGK attractor if the det(C DGK ) is small
and the DGK location estimator TDGK is in the median ball. Distant outliers
that make det(C DGK ) small also drag TDGK outside of the median ball. Then
FCH uses the MB attractor.
Compared to OGK and FMCD, the MB estimator is vulnerable to outliers
that lie within the median ball. If the bulk of the data is highly correlated
with the major axis of a hyperellipsoidal region, then the distances based on
the clean data can be very large for outliers that fall within the median ball.
The outlier resistance of the MB estimator decreases as p increases since the
volume of the median ball rapidly increases with p.
A simple simulation for outlier resistance is to count the number of times
the minimum distance of the outliers is larger than the maximum distance of
the clean cases. The simulation used 100 runs. If the count was 97, then in 97
data sets the outliers can be separated from the clean cases with a horizontal
line in the DD plot, but in 3 data sets the robust distances did not achieve
complete separation. In Spring 2015, Det-MCD simulated much like FMCD,
but was more likely to cause an error in R.
The clean cases had x ∼ Np (0, diag(1, 2, ..., p)). Outlier types were the
mean shift x ∼ Np (pm1, diag(1, 2, ..., p)) where 1 = (1, ..., 1)T and x ∼
Np ((0, ..., 0, pm)T , 0.0001I p ), a near point mass at the major axis. Notice that
the clean data can √ be transformed to a Np (0, I p ) distribution by multiplying

xi by diag(1, 1/ 2, ..., 1/ p), and this transformation changes the location

of the near point mass to (0, ..., 0, pm/ p)T .
Suppose the attractor is (xk,j , S k,j ) computed from a subset of cn cases.
The MCD(cn ) criterion is the determinant det(S k,j ). The volume of the hy-
perellipsoid {z : (z − xk,j )T S −1 2
k,j (z − xk,j ) ≤ h } is equal to
7.2 The Multivariate Location and Dispersion Model 303

Table 7.3 Number of Times Mean Shift Outliers had the Largest Distances
p γ n pm MBA FCH RFCH RMVN OGK FMCD MB
10 .1 100 4 49 49 85 84 38 76 57
10 .1 100 5 91 91 99 99 93 98 91
10 .4 100 7 90 90 90 90 0 48 100
40 .1 100 5 3 3 3 3 76 3 17
40 .1 100 8 36 36 37 37 100 49 86
40 .25 100 20 62 62 62 62 100 0 100
40 .4 100 20 20 20 20 20 0 0 100
40 .4 100 35 44 98 98 98 95 0 100
60 .1 200 10 49 49 49 52 100 30 100
60 .1 200 20 97 97 97 97 100 35 100
60 .25 200 25 60 60 60 60 100 0 100
60 .4 200 30 11 21 21 21 17 0 100
60 .4 200 40 21 100 100 100 100 0 100

q
2π p/2 p
h det(S k,j ), (7.16)
pΓ (p/2)

see Johnson and Wichern (1988, pp. 103-104).


For near point mass outliers, a hyperellipsoid with very small volume can
cover half of the data if the outliers are at one end of the hyperellipsoid and
some of the clean data are at the other end. This half set will produce a
classical estimator with very small determinant by (7.16). In the simulations
for large γ, as the near point mass is moved very far away from the bulk
of the data, only the classical, MB, and OGK estimators did not have nu-
merical difficulties. Since the MCD estimator has smaller determinant than
DGK, estimators like FMCD and MBA that use the MCD criterion without
using location information will be vulnerable to these outliers. FMCD is also
vulnerable to outliers if γ is slightly larger than γo given by (7.10).

Table 7.4 Number of Times Near Point Mass Outliers had the Largest Distances
p γ n pm MBA FCH RFCH RMVN OGK FMCD MB
10 .1 100 40 73 92 92 92 100 95 100
10 .25 100 25 0 99 99 90 0 0 99
10 .4 100 25 0 100 100 100 0 0 100
40 .1 100 80 0 0 0 0 79 0 80
40 .1 100 150 0 65 65 65 100 0 99
40 .25 100 90 0 88 87 87 0 0 88
40 .4 100 90 0 91 91 91 0 0 91
60 .1 200 100 0 0 0 0 13 0 91
60 .25 200 150 0 100 100 100 0 0 100
60 .4 200 150 0 100 100 100 0 0 100
60 .4 200 20000 0 100 100 100 64 0 100

Tables 7.3 and 7.4 help illustrate the results for the simulation. Large
counts and small pm for fixed γ suggest greater ability to detect outliers.
304 7 Robust Regression

Values of p were 5, 10, 15, ..., 60. First consider the mean shift outliers and
Table 7.3. For γ = 0.25 and 0.4, MB usually had the highest counts. For
5 ≤ p ≤ 20 and the mean shift, the OGK estimator often had the smallest
counts, and FMCD could not handle 40% outliers for p = 20. For 25 ≤ p ≤ 60,
OGK usually had the highest counts for γ = 0.05 and 0.1. For p ≥ 30, FMCD
could not handle 25% outliers even for enormous values of pm.
In Table 7.4, FCH greatly outperformed MBA although the only difference
between the two estimators is that FCH uses a location criterion as well as
the MCD criterion. OGK performed well for γ = 0.05 and 20 ≤ p ≤ 60 (not
tabled). For large γ, OGK often has large bias for cΣ. Then the outliers may
need to be enormous before OGK can detect them. Also see Table 7.2, where
OGK gave the outliers the largest distances for all runs, but C OGK does not
give a good estimate of cΣ = c diag(1, 2).

FMCD DD Plot

72
4.0

34
3.5

19
3.0
RD

44
2.5

32
2.0

50
1.5

1.5 2.0 2.5 3.0 3.5 4.0

MD

Fig. 7.3 The FMCD Estimator Failed

The DD plot of M Di versus RDi is useful for detecting outliers. The


resistant estimator will be useful if (T, C) ≈ (µ, cΣ) where c > 0 since scaling
by c affects the vertical labels of the RDi but not the shape of the DD plot.
For the outlier data, the MBA estimator is biased, but the mean shift outliers
in the MBA DD plot will have large RDi since C M BA ≈ 2C F M CD ≈ 2Σ.
In an older mean shift simulation, when p was 8 or larger, the cov.mcd
estimator was usually not useful for detecting the mean shift outliers. Figure
7.2 The Multivariate Location and Dispersion Model 305

Resistant DD Plot

30 6

20
27

15 20
RD

10
5

95 53

1.5 2.0 2.5 3.0 3.5 4.0

MD

Fig. 7.4 The Outliers are Large in the MBA DD Plot

7.3 shows that now the FMCD RDi are highly correlated with the M Di . The
DD plot based on the MBA estimator detects the outliers. See Figure 7.4.
For many data sets, Equation (7.10) gives a rough approximation for the
number of large outliers that concentration algorithms using K starts each
consisting of h cases can handle. However, if the data set is multivariate and
the bulk of the data falls in one compact hyperellipsoid while the outliers
fall in another hugely distant compact hyperellipsoid, then a concentration
algorithm using a single start can sometimes tolerate nearly 25% outliers.
For example, suppose that all p + 1 cases in the elemental start are outliers
but the covariance matrix is nonsingular so that the Mahalanobis distances
can be computed. Then the classical estimator is applied to the cn ≈ n/2
cases with the smallest distances. Suppose the percentage of outliers is less
than 25% and that all of the outliers are in this “half set.” Then the sample
mean applied to the cn cases should be closer to the bulk of the data than
to the cluster of outliers. Hence after a concentration step, the percentage
of outliers will be reduced if the outliers are very far away. After the next
concentration step the percentage of outliers will be further reduced and after
several iterations, all cn cases will be clean.
In a small simulation study, 20% outliers were planted for various values of
p. If the outliers were distant enough, then the minimum DGK distance for
the outliers was larger than the maximum DGK distance for the nonoutliers.
306 7 Robust Regression

Hence the outliers would be separated from the bulk of the data in a DD plot
of classical versus robust distances. For example, when the clean data comes
from the Np (0, I p ) distribution and the outliers come from the Np (2000 1, I p )
distribution, the DGK estimator with 10 concentration steps was able to
separate the outliers in 17 out of 20 runs when n = 9000 and p = 30. With
10% outliers, a shift of 40, n = 600, and p = 50, 18 out of 20 runs worked.
Olive (2004a) showed similar results for the Rousseeuw and Van Driessen
(1999) FMCD algorithm and that the MBA estimator could often correctly
classify up to 49% distant outliers. The following theorem shows that it is
very difficult to drive the determinant of the dispersion estimator from a
concentration algorithm to zero.

Theorem 7.13. Consider the concentration and MCD estimators that


both cover cn cases. For multivariate data, if at least one of the starts is
nonsingular, then the concentration attractor C A is less likely to be singular
than the high breakdown MCD estimator C M CD .

Proof. If all of the starts are singular, then the Mahalanobis distances
cannot be computed and the classical estimator can not be applied to cn
cases. Suppose that at least one start was nonsingular. Then C A and C M CD
are both sample covariance matrices applied to cn cases, but by definition
C M CD minimizes the determinant of such matrices. Hence 0 ≤ det(C M CD ) ≤
det(C A ). 

Software

H The robustbase library was downloaded from (www.r-project.org/#doc).


11.1 explains how to use the source command to get the linmodpack
functions in R and how to download a library from R. Type the commands
library(MASS) and library(robustbase) to compute the FMCD and
OGK estimators with the cov.mcd and covOGK functions. To use Det-MCD
instead of FMCD, change
out <- covMcd(x) to out <- covMcd(x,nsamp="deterministic"),

but in Spring 2015 this change was more likely to cause errors.
The linmodpack function
mldsim(n=200,p=5,gam=.2,runs=100,outliers=1,pm=15)
can be used to produce Tables 7.1–7.4. Change outliers to 0 to examine the
average of µ̂ and Σ̂. The function mldsim6 is similar but does not need the
library command since it compares the FCH, RFCH, MB estimators, and
the covmb2 estimator of Section 7.3.
The function function covfch computes FCH and RFCH, while covrmvn
computes the RMVN and MB estimators. The function covrmb computes MB
and RMB where RMB is like RMVN except the MB estimator is reweighted
instead of FCH. Functions covdgk, covmba, and rmba compute the scaled
DGK, MBA, and RMBA estimators. Better programs would use MB if
DGK causes an error.
7.2 The Multivariate Location and Dispersion Model 307

6
4
x[, 2]

2
0
−2
−4

−2 −1 0 1 2

x[, 1]

Fig. 7.5 highlighted cases = half set with smallest RD = (T0 , C 0 )


6
4
x[, 2]

2
0
−2
−4

−2 −1 0 1 2

x[, 1]

Fig. 7.6 highlighted cases = half set with smallest RD = (T1 , C 1 )


308 7 Robust Regression

6
4
x[, 2]

2
0
−2
−4

−2 −1 0 1 2

x[, 1]

Fig. 7.7 highlighted cases = half set with smallest RD = (T2 , C 2 )


5
4
3
rd

2
1

0.5 1.0 1.5 2.0 2.5 3.0 3.5

MD

Fig. 7.8 highlighted cases = outliers, RD = (T0,D , C 0,D )


7.2 The Multivariate Location and Dispersion Model 309

5
4
rd

3
2
1

0.5 1.0 1.5 2.0 2.5 3.0 3.5

MD

Fig. 7.9 highlighted cases = outliers, RD = (T1,D , C 1,D )


5
4
rd

3
2
1

0.5 1.0 1.5 2.0 2.5 3.0 3.5

MD

Fig. 7.10 highlighted cases = outliers, RD = (T2,D , C 2,D )


310 7 Robust Regression

20
15
rd

10
5
0

0.5 1.0 1.5 2.0 2.5 3.0 3.5

MD

Fig. 7.11 highlighted cases = outliers, RD = (T3,D , C 3,D )

The concmv function described in Problem 7.6 illustrates concentration


where the start is (MED(W ), diag([M AD(Xi )]2 )). In Figures 7.5, 7.6, and
7.7, the highlighted cases are the half set with the smallest distances, and
the initial half set shown in Figure 7.5 is not clean, where n = 100 and there
are 40 outliers. The attractor shown in Figure 7.7 is clean. This type of data
set has too many outliers for DGK while the MB starts and attractors are
almost always clean.
The ddmv function in Problem 7.7 illustrates concentration for the DGK
estimator where the start is the classical estimator. Now n = 100, p = 4,
and there are 25 outliers. A DD plot of classical distances MD versus robust
distances RD is shown. See Figures 7.8, 7.9, 7.10, and 7.11. The half set of
cases with the smallest RDs is used, and the initial half set shown in Figure
7.8 is not clean. The attractor in Figure 7.11 is the DGK estimator which
uses a clean half set. The
√ clean√ cases xi ∼ N4 (0, diag(1, 2, 3, 4)) while the
outliers xi ∼ N4 ((10, 10 2, 10 3, 20)T , diag(1, 2, 3, 4)).

7.2.6 The RMVN and RFCH Sets

Both the RMVN and RFCH estimators compute the classical estimator
(xU , SU ) on some set U containing nU ≥ n/2 of the cases. Referring to Defi-
7.2 The Multivariate Location and Dispersion Model 311

nition 7.16, for the RFCH estimator, (xU , S U ) = (TRF CH , Σ̃ 2 ), and then S U
is scaled to form C RF CH . Referring to Definition 7.17, for the RMVN esti-
mator, (xU , SU ) = (TRM V N , Σ̃ 2 ), and then S U is scaled to form C RM V N .
See Definition 7.18.
The two main ways to handle outliers are i) apply the multivariate method
to the cleaned data, and ii) plug in robust estimators for classical estimators.
Subjectively cleaned data may work well for a single data set, but we can’t
get large sample theory since sometimes too many cases are deleted (delete
outliers and some nonoutliers) and sometimes too few (do not get all of the
outliers).
√ Practical plug in robust estimators have rarely been shown to be
n consistent and highly outlier resistant.
Using the RMVN or RFCH set U is simultaneously a plug in method and
an objective way to clean the data such that the resulting robust method is
often backed by theory. This result is extremely useful computationally: find
the RMVN set or RFCH set U , then apply the classical method to the cases
in the set U . This procedure is often equivalent to using (xU , SU ) as plug
in estimators. The method can be applied if n > 2(p + 1) but may not work
well unless n > 20p. The linmodpack function getu gets the RMVN set U as
well as the case numbers corresponding to the cases in U .
The set U is a small volume hyperellipsoid containing at least half of the
cases since concentration is used. The set U can also be regarded as the
“untrimmed data”: the data that was not trimmed by ellipsoidal trimming.
Theory has been proved for a large class of elliptically contoured distributions,
but it is conjectured that theory holds for a much wider class of distributions.
See Olive (2017b, pp. 127-128).
In simulations RFCH and RMVN seem to estimate cΣ x if x = Az + µ
where z = (z1 , ..., zp)T and the zi are iid from a continuous distribution with
variance σ 2 . Here Σ x = Cov(x) = σ 2 AAT . The bias for the MB estimator
seemed to be small. It is known that affine equivariant estimators give unbi-
ased estimators of cΣ x if the distribution of zi is also symmetric. DGK is
affine equivariant and RFCH and RMVN are asymptotically equivalent to a
scaled DGK estimator. But in the simulations the results also held for skewed
distributions.
Several illustrative applications of the RMVN set U are given next, where
the theory usually assumes that the cases are iid from a large class of ellip-
tically contoured distributions.
i) The classical estimator of multivariate
√ location and dispersion applied
to the cases in U gives (xU , S U ), a n consistent estimator of (µ, cΣ) for
some constant c > 0. See Remark 7.4.
ii) The classical estimator of the correlation matrix applied to the cases in
U gives RU , a consistent estimator of the population correlation matrix ρx .
iii) For multiple linear regression, let Y be the response variable, x1 = 1
and x2 , ..., xp be the predictor variables. Let z i = (Yi , xi2, ..., xip)T . Let U
be the RMVN or RFCH set formed using the z i . Then a classical regression
312 7 Robust Regression

estimator applied to the set U results in a robust regression estimator. For


least squares, this is implemented with the linmodpack function rmreg3.
iv) For multivariate linear regression, let Y1 , ..., Ym be the response vari-
ables, x1 = 1 and x2 , ..., xp be the predictor variables. Let

z i = (Yi1 , ...Yim, xi2, ..., xip)T .

Let U be the RMVN or RFCH set formed using the z i . Then a classical least
squares multivariate linear regression estimator applied to the set U results
in a robust multivariate linear regression estimator. For least squares, this is
implemented with the linmodpack function rmreg2. The method for multiple
linear regression in iii) corresponds to m = 1. See Section 8.6.
There are also several variants on the method. Suppose there are tentative
predictors Z1 , ..., ZJ . After transformations assume that predictors X1 , ..., Xk
are linearly related. Assume the set U used cases i1 , i2 , ..., inU . To add vari-
ables like Xk+1 = X12 , Xk+2 = X3 X4 , Xk+3 = gender, ..., Xp , augment
U with the variables Xk+1 , ..., Xp corresponding to cases i1 , ..., inU . Adding
variables results in cleaned data that is more likely to contain outliers.
If there are g groups (g = G for discriminant analysis, g = 2 for binary
regression, and g = p for one way MANOVA), the function getubig gets
the RMVN set Ui for each group and combines the g RMVN sets into one
large set Ubig = U1 ∪ U2 ∪ · · ·∪ Ug . Olive (2017b) has many more applications.

7.3 Outlier Detection for the MLD Model

Now suppose the multivariate data has been collected into an n × p matrix
 
 T x1,1 x1,2 . . . x1,p
x1  
   x2,1 x2,2 . . . x2,p   
W = X =  ...  =  . . . .  = v1 v2 . . . vp
 .. .. . . .. 
xTn
xn,1 xn,2 . . . xn,p

where the ith row of W is the ith case xTi and the jth column v j of W
corresponds to n measurements of the jth random variable Xj for j = 1, ..., p.
Hence the n rows of the data matrix W correspond to the n cases, while the
p columns correspond to measurements on the p random variables X1 , ..., Xp.
For example, the data may consist of n visitors to a hospital where the p = 2
variables height and weight of each individual were measured.
Definition 7.19. The coordinatewise median MED(W ) = (MED(X1 ), ...,
MED(Xp ))T where MED(Xi ) is the sample median of the data in column i
corresponding to variable Xi and v i .
7.3 Outlier Detection for the MLD Model 313

Example 7.4. Let the data for X1 be 1, 2, 3, 4, 5, 6, 7, 8, 9 while the data for
X2 is 7, 17, 3, 8, 6, 13, 4, 2, 1. Then MED(W ) = (MED(X1 ), MED(X2 ))T =
(5, 6)T .

Definition 7.20: Rousseeuw and Van Driessen (1999). The DD plot


is a plot of the classical Mahalanobis distances MDi versus robust Maha-
lanobis distances RDi .

The DD plot is used as a diagnostic for multivariate normality, elliptical


symmetry, and for outliers. Assume that the data set consists of iid vectors
from an ECp (µ, Σ, g) distribution with second moments. See Section 1.7 for
notation. Then the classical sample mean and covariance matrix (TM , C M ) =
(x, S) is a consistent estimator for (µ, cx Σ) = (E(x), Cov(x)). Assume that
an alternative algorithm estimator (TA , C A ) is a consistent estimator for
(µ, aAΣ) for some constant aA > 0. By scaling the algorithm estimator,
the DD plot can be constructed to follow the identity line with unit slope
and zero intercept. Let (TR , C R ) = (TA , C A /τ 2 ) denote the scaled algorithm
estimator where τ > 0 is a constant to be determined. Notice that (TR , C R )
is a valid estimator of location and dispersion. Hence the robust distances
used in the DD plot are given by
q
RDi = RDi (TR , C R ) = (xi − TR (W ))T [C R (W )]−1 (xi − TR (W ))

= τ Di (TA , C A ) for i = 1, ..., n.


The following theorem shows that if consistent estimators are used to
construct the distances, then the DD plot will tend to cluster tightly about the
line segment through (0, 0) and (MDn,α , RDn,α ) where 0 < α < 1 and MDn,α
is the 100αth sample percentile of the MDi . Nevertheless, the variability in
the DD plot may increase with the distances. Let K > 0 be a constant, e.g.
the 99th percentile of the χ2p distribution.

Theorem 7.14. Assume that x1 , ..., xn are iid observations from a dis-
tribution with parameters (µ, Σ) where Σ is a symmetric positive definite
matrix. Let aj > 0 and assume that (µ̂j,n , Σ̂ j,n ) are consistent estimators of
(µ, aj Σ) for j = 1, 2.
a) Dx2
(µ̂j , Σ̂ j ) − a1j Dx
2
(µ, Σ) = oP (1).
−1
b) Let 0 < δ ≤ 0.5. If (µ̂j , Σ̂ j )−(µ, aj Σ) = Op (n−δ ) and aj Σ̂ j −Σ −1 =
OP (n−δ ), then

2 1 2
Dx (µ̂j , Σ̂ j ) − Dx (µ, Σ) = OP (n−δ ).
aj

c) Let Di,j ≡ Di (µ̂j,n , Σ̂ j,n ) be the ith Mahalanobis distance computed


from (µ̂j,n , Σ̂ j,n ). Consider the cases in the region R = {i|0 ≤ Di,j ≤ K, j =
1, 2}. Let rn denote the correlation between Di,1 and Di,2 for the cases in R
314 7 Robust Regression

(thus rn is the correlation of the distances in the “lower left corner” of the
DD plot). Then rn → 1 in probability as n → ∞.

Proof. Let Bn denote the subset of the sample space on which both Σ̂ 1,n
and Σ̂ 2,n have inverses. Then P (Bn ) → 1 as n → ∞.
2 −1
a) and b): Dx (µ̂j , Σ̂ j ) = (x − µ̂j )T Σ̂ j (x − µ̂j ) =
 −1 
Σ Σ −1 −1
(x − µ̂j )T − + Σ̂ j (x − µ̂j )
aj aj
   −1 
−Σ −1 −1 Σ
= (x − µ̂j )T + Σ̂ j (x − µ̂j ) + (x − µ̂j )T (x − µ̂j )
aj aj
1 −1
= (x − µ̂j )T (−Σ −1 + aj Σ̂ j )(x − µ̂j ) +
aj
 −1 
T Σ
(x − µ + µ − µ̂j ) (x − µ + µ − µ̂j )
aj
1
= (x − µ)T Σ −1 (x − µ)
aj
2 1
+ (x − µ)T Σ −1 (µ − µ̂j ) + (µ − µ̂j )T Σ −1 (µ − µ̂j )
aj aj
1 −1
+ (x − µ̂j )T [aj Σ̂ j − Σ −1 ](x − µ̂j ) (7.17)
aj
on Bn , and the last three terms are oP (1) under a) and OP (n−δ ) under b).
P
c) Following the proof of a), Dj2 ≡ Dx
2
(µ̂j , Σ̂ j ) → (x −µ)T Σ −1 (x −µ)/aj
for fixed x, and the result follows. 

The above result implies that a plot of the MDi versus the Di (TA , C A ) ≡
Di (A) will follow a line through the origin with some positive slope since if
x = µ, then both the classical and the algorithm distances should be close to
zero. We want to find τ such that RDi = τ Di (TA , C A ) and the DD plot of
MDi versus RDi follows the identity line. By Theorem 7.14, the plot of MDi
versus Di (A) will follow the line segment defined by the origin (0, 0) and the
point of observed median Mahalanobis distances, (med(MDi ), med(Di (A))).
This line segment has slope

med(Di (A))/med(MDi )

which is generally not one. By taking τ = med(MDi )/med(Di (A)), the plot
will follow the identity line if (x, S) is a consistent estimator of (µ, cx Σ)
and if (TA , C A ) is a consistent estimator of (µ, aA Σ). (Using the notation
from Theorem 7.14, let (a1 , a2 ) = (cx , aA ).) The classical estimator is con-
sistent if the population has a nonsingular covariance matrix. The algorithm
7.3 Outlier Detection for the MLD Model 315

estimators (TA , C A ) from Theorem 7.11 are consistent on a large class of


EC distributions that have a nonsingular covariance matrix, but tend to be
biased for non–EC distributions. We recommend using RFCH or RMVN as
the robust estimators in DD plots.
By replacing the observed median med(MDi ) of the classical Mahalanobis
distances with the target population analog, say MED, τ can be chosen so
that the DD plot is simultaneously a diagnostic for elliptical symmetry and a
diagnostic for the target EC distribution. That is, the plotted points follow
the identity line if the data arise from a target EC distribution such as the
multivariate normal distribution, but the points follow a line with non-unit
slope if the data arise from an alternative EC distribution. In addition the DD
plot can often detect departures from elliptical symmetry such as outliers,
the presence of two groups, or the presence of a mixture distribution.

Example 7.5. We will use the multivariate normal Np (µ, Σ) distribution


as the target. If the data are indeed iid MVN vectors,
q then the (MDi )2 are
asymptotically χ2p random variables, and MED = χ2p,0.5 where χ2p,0.5 is the
median of the χ2p distribution. Since the target distribution is Gaussian, let
q q
χ2p,0.5 χ2p,0.5
RDi = Di (A) so that τ= . (7.18)
med(Di (A)) med(Di (A))

Since every nonsingular estimator of multivariate location and dispersion


defines a hyperellipsoid, the DD plot can be used to examine which points
are in the robust hyperellipsoid

{x : (x − TR )T C −1 2
R (x − TR ) ≤ RD(h) } (7.19)
2
where RD(h) is the hth smallest squared robust Mahalanobis distance, and
which points are in a classical hyperellipsoid

{x : (x − x)T S −1 (x − x) ≤ M D(h)
2
}. (7.20)

In the DD plot, points below RD(h) correspond to cases that are in the
hyperellipsoid given by Equation (7.19) while points to the left of M D(h) are
in a hyperellipsoid determined by Equation (7.20). In particular, we can use
the DD plot to examine which points are in the nonparametric prediction
region (4.24).

Application 7.1. Consider the DD plot with RFCH or RMVN. The DD


plot can be used simultaneously as a diagnostic for whether the data arise from
a multivariate normal distribution or from another EC distribution with non-
singular covariance matrix. EC data will cluster about a straight line through
the origin; MVN data in particular will cluster about the identity line. Thus
the DD plot can be used to assess the success of numerical transformations
316 7 Robust Regression

towards elliptical symmetry. The DD plot can be used to detect multivariate


outliers. Use the DD plot to detect outliers and leverage groups if n ≥ 10p
for the predictor variables in regression.

a) DD Plot, 200 N(0, I3) Cases b) DD Plot, 200 EC Cases


4

12
10
3

8
RD

RD

6
2

4
1

2
0
0.5 1.0 1.5 2.0 2.5 3.0 3.5 0 1 2 3 4 5 6

MD MD

c) DD Plot, 200 Lognormal Cases d) Weighted DD Plot for Lognormal Data

3.0
20

2.5
15

2.0
RD

RD

1.5
10

1.0
5

0.5
0

0 2 4 6 0.5 1.0 1.5

MD MD

Fig. 7.12 4 DD Plots

For this application, the RFCH and RMVN estimators may be best. For
MVN data, the RDi from the RFCH estimator tend to have a higher correla-
tion with the MDi from the classical estimator than the RDi from the FCH
estimator, and the cov.mcd estimator may be inconsistent.

Figure 7.12 shows the DD plots for 3 artificial data sets using cov.mcd.
The DD plot for 200 N3 (0, I 3 ) points shown in Figure 7.12a resembles the
identity line. The DD plot for 200 points from the elliptically contoured
distribution 0.6N3 (0, I 3 ) + 0.4N3 (0, 25 I 3 ) in Figure 7.12b clusters about a
line through the origin with a slope close to 2.0.

A weighted DD plot magnifiesq the lower left corner of the DD plot by


omitting the cases with RDi ≥ χ2p,.975 . This technique can magnify features
that are obscured when large RDi ’s are present. If the distribution of x is EC
with nonsingular Σ, Theorem 7.14 implies that the correlation of the points
7.3 Outlier Detection for the MLD Model 317

in the weighted DD plot will tend to one and that the points will cluster
about a line passing through the origin. For example, the plotted points in
the weighted DD plot (not shown) for the non-MVN EC data of Figure 7.12b
are highly correlated and still follow a line through the origin with a slope
close to 2.0.

Figures 7.12c and 7.12d illustrate how to use the weighted DD plot. The
ith case in Figure 7.12c is (exp(xi,1 ), exp(xi,2 ), exp(xi,3 ))T where xi is the
ith case in Figure 7.12a; i.e. the marginals follow a lognormal distribution.
The plot does not resemble the identity line, correctly suggesting that the
distribution of the data is not MVN; however, the correlation of the plotted
points is
qrather high. Figure 7.12d is the weighted DD plot where cases with
RDi ≥ χ23,.975 ≈ 3.06 have been removed. Notice that the correlation of the
plotted points is not close to one and that the best fitting line in Figure 7.12d
may not pass through the origin. These results suggest that the distribution
of x is not EC.

a) DD Plot for Buxton Data

64 61 63
300

65
62
200
RD

100
0

1 2 3 4 5

MD

b) DD Plot with Outliers Removed


5
4
3
RD

2
1

1 2 3

MD

Fig. 7.13 DD Plots for the Buxton Data

Example 7.6. Buxton (1920, pp. 232-5) gave 20 measurements of 88 men.


We will examine whether the multivariate normal distribution is a reasonable
318 7 Robust Regression

model for the measurements head length, nasal height, bigonal breadth, and
cephalic index where one case has been deleted due to missing values. Figure
7.13a shows the DD plot. Five head lengths were recorded to be around 5
feet and are massive outliers. Figure 7.13b is the DD plot computed after
deleting these points and suggests that the multivariate normal distribution
is reasonable. (The recomputation of the DD plot means that the plot is not
a weighted DD plot which would simply omit the outliers and then rescale
the vertical axis.)

library(MASS)
x <- cbind(buxy,buxx)
ddplot(x,type=3) #Figure 7.13a), right click Stop

zx <- x[-c(61:65),]
ddplot(zx,type=3) #Figure 7.13b), right click Stop

7.3.1 MLD Outlier Detection if p > n

Most outlier detection methods work best if n ≥ 20p, but often data sets have
p > n, and outliers are a major problem. One of the simplest outlier detection
methods uses the Euclidean distances of the xi from the coordinatewise me-
dian Di = Di (MED(W ), I p ). Concentration type steps compute the weighted
median MEDj : the coordinatewise median computed from the “half set” of
cases xi with Di2 ≤ MED(Di2 (MEDj−1 , I p )) where MED0 = MED(W ).
We often used j = 0 (no concentration type steps) or j = 9. Let Di =
Di (MEDj , I p ). Let Wi = 1 if Di ≤ MED(D1 , ..., Dn) + kMAD(D1 , ..., Dn)
where k ≥ 0 and k = 5 is the default choice. Let Wi = 0, otherwise. Using
k ≥ 0 insures that at least half of the cases get weight 1. This weighting
corresponds to the weighting that would be used in a one sided metrically
trimmed mean (Huber type skipped mean) of the distances.
Application 7.2. This outlier resistant regression method uses terms from
the following definition. Let the ith case w i = (Yi , xTi )T where the continuous
predictors from xi are denoted by ui for i = 1, ..., n. Apply the covmb2
estimator to the ui , and then run the regression method on the m cases w i
corresponding to the covmb2 set B indices i1 , ...im, where m ≥ n/2.

Definition 7.21. Let the covmb2 set B of at least n/2 cases correspond
to the cases with weight Wi = 1. The cases not in set B get weight Wi = 0.
Then the covmb2 estimator (T, C) is the sample mean and sample covariance
matrix applied to the cases in set B. Hence
Pn Pn
i=1 Wi xi Wi (xi − T )(xi − T )T
T = P n and C = i=1 Pn .
i=1 Wi i=1 Wi − 1
7.3 Outlier Detection for the MLD Model 319

Example 7.7. Let the clean data (nonoutliers) be i 1 for i = 1, 2, 3, 4, and


5 while the outliers are j 1 for j = 16, 17, 18, and 19. Here n = 9 and 1 is p×1.
Making a plot of the data for p = 2 may be useful. Then the coordinatewise
median MED0 = MED(W ) = 5 1. The median Euclidean distance of the data
is the Euclidean distance of 5 1 from 1 1 = the Euclidean distance of 5 1 from
9 1. The median ball is the hypersphere centered at the coordinatewise median
with radius r = MED(Di (MED(W ), I p ), i = 1, ..., n) that tends to contain
(n + 1)/2 of the cases if n is odd. Hence the clean data are in the median ball
and the outliers are outside of the median ball. The coordinatewise median
of the cases with the 5 smallest distances is the coordinatewise median of
the clean data: MED1 = 3 1. Then the median Euclidean distance of the
data from MED1 is the Euclidean distance of 3 1 from 1 1 = the Euclidean
distance of 3 1 from 5 1. Again the clean cases are the cases with the 5 smallest
Euclidean distances. Hence MEDj = 3 1 for j ≥ 1. For j ≥ 1, if xi = j 1, then
√ √ √
Di = |j − 3| p. Thus D(1) = 0, D(2) = D(3) = p, and D(4) = D(5) = 2 p.

Hence MED(D1 , ..., Dn) = D(5) = 2 p = MAD(D1 , ..., Dn) since the median
√ √
distance of the Di from D(5) is 2 p − 0 = 2 p. Note that the 5 smallest
√ √ √
absolute distances |Di − D(5) | are 0, 0, p, p, and 2 p. Hence Wi = 1 if
√ √ √
Di ≤ 2 p + 10 p = 12 p. The clean data get weight 1 while the outliers
get weight 0 since the smallest distance Di for the outliers is the Euclidean

distance of 3 1 from 16 1 with a Di = k16 1 − 3 1k = 13 p. Hence the
covmb2 estimator (T, C) is the sample mean and sample covariance matrix
of the clean data. Note that the distance for the outliers to get zero

weight is proportional to the square root of the dimension p.

The covmb2 estimator can also be used for n > p. The covmb2 estimator
attempts to give a robust dispersion estimator that reduces the bias by using
a big ball about MEDj instead of a ball that contains half of the cases. The
linmodpack function getB gives the set B of cases that got weight 1 along
with the index indx of the case numbers that got weight 1. The function
ddplot5 plots the Euclidean distances from the coordinatewise median ver-
sus the Euclidean distances from the covmb2 location estimator. Typically
the plotted points in this DD plot cluster about the identity line, and outliers
appear in the upper right corner of the plot with a gap between the bulk of
the data and the outliers. An alternative for outlier detection is to replace C
by C d = diag(σ̂11 , ..., σ̂pp). For example, use σ̂ii = C ii . See Ro et al. (2015)
and Tarr et al. (2016) for references.

Example 7.8. For the Buxton (1920) data with multiple linear regression,
height was the response variable while an intercept, head length, nasal height,
bigonal breadth, and cephalic index were used as predictors in the multiple
linear regression model. Observation 9 was deleted since it had missing values.
Five individuals, cases 61–65, were reported to be about 0.75 inches tall with
head lengths well over five feet! See Problem 7.11 to reproduce the following
plots.
320 7 Robust Regression

a) lasso

1500
y

500
0

0 500 1000 1500

yhat

b) lasso using covmb set B


1500
y

500
0

1660 1680 1700 1720

yhat

Fig. 7.14 Response plot for lasso and lasso applied to the covmb2 set B.

Figure 7.14a) shows the response plot for lasso. The identity line passes
right through the outliers which are obvious because of the large gap. Figure
7.14b) shows the response plot from lasso for the cases in the covmb2 set
B applied to the predictors, and the set B included all of the clean cases
and omitted the 5 outliers. The response plot was made for all of the data,
including the outliers. Prediction interval (PI) bands are also included for
both plots. Both plots are useful for outlier detection, but the method for
plot 7.14b) is better for data analysis: impossible outliers should be deleted
or given 0 weight, we do not want to predict that some people are about 0.75
inches tall, and we do want to predict that the people were about 1.6 to 1.8
meters tall. Figure 7.15 shows the DD plot made using ddplot5. The five
outliers are in the upper right corner.

Also see Problem 7.12 b) for the Gladstone (1905) data where the covmb2
set B deleted the 8 cases with the largest Di , including 5 outliers and 3 clean
cases.
7.4 Outlier Detection for the MLR Model 321

1500
1000
RDCOVMB2

500
0

0 500 1000 1500

RDMED

Fig. 7.15 DD plot with ddplot5.

7.4 Outlier Detection for the MLR Model

For multiple linear regression, the OLS response and residual plots are very
useful for detecting outliers. The DD plot of the continuous predictors is also
useful. Use the linmodpack functions MLRplot and ddplot4. Response and
residual plots from outlier resistant methods are also useful. See Figure 7.14.
Huber and Ronchetti (2009, p. 154) noted that efficient methods for iden-
tifying leverage groups are needed. Such groups are often difficult to detect
with regression diagnostics and residuals, but often have outlying fitted val-
ues and responses that can be detected with response and residual plots. The
following rules of thumb are useful for finding influential cases and outliers.
Look for points with large absolute residuals and for points far away from
Y . Also look for gaps separating the data into clusters. The OLS fit often
passes through a cluster of outliers, causing a large gap between a cluster
corresponding to the bulk of the data and the cluster of outliers. When such
a gap appears, it is possible that the smaller cluster corresponds to good
leverage points: the cases follow the same model as the bulk of the data. To
determine whether small clusters are outliers or good leverage points, give
zero weight to the clusters, and fit an MLR estimator such as OLS to the
bulk of the data. Denote the weighted estimator by β̂w . Then plot Ŷw versus
Y using the entire data set. If the identity line passes through the cluster,
then the cases in the cluster may be good leverage points, otherwise they
322 7 Robust Regression

may be outliers. The trimmed views estimator of Section 7.5 is also useful.
Dragging the plots, so that they are roughly square, can be useful.

Definition 7.22. Suppose that some analysis to detect outliers is per-


formed. Masking occurs if the analysis suggests that one or more outliers are
in fact good cases. Swamping occurs if the analysis suggests that one or more
good cases are outliers. Suppose that a subset of h cases is selected from the
n cases making up the data set. Then the subset is clean if none of the h
cases are outliers.

Influence diagnostics such as Cook’s distances CDi from Cook (1977) and
the weighted Cook’s distances W CDi from Peña (2005) are sometimes useful.
Although an index plot of Cook’s distance CDi may be useful for flagging
influential cases, the index plot provides no direct way of judging the model
against the data. As a remedy, cases in the response and residual plots with
CDi > min(0.5, 2p/n) are highlighted with open squares, and cases with
|W CDi − median(WCDi )| > 4.5MAD(WCDi) are highlighted with crosses,
where the median absolute deviation MAD(wi ) = median(|wi −median(wi )|).

Example 7.9. Figure 7.16 shows the response plot and residual plot for
the Buxton (1920) data. Notice that the OLS fit passes through the outliers,
but the response plot is resistant to Y –outliers since Y is on the vertical
axis. Also notice that although the outlying cluster is far from Y , only two of
the outliers had large Cook’s distance and only one case had a large W CDi .
Hence masking occurred for the Cook’s distances, the W CDi , and for the
OLS residuals, but not for the OLS fitted values. Figure 7.16 was made with
the following R commands.
source("G:/linmodpack.txt"); source("G:/linmoddata.txt")
mlrplot4(buxx,buxy) #right click Stop twice
High leverage outliers are a particular challenge to conventional numerical
MLR diagnostics such as Cook’s distance, but can often be visualized using
the response and residual plots. (Using the trimmed views of Section 7.5
is also effective for detecting outliers and other departures from the MLR
model.)

Example 7.10. Hawkins et al. (1984) gave a well known artificial data
set where the first 10 cases are outliers while cases 11-14 are good leverage
points. Figure 7.17 shows the residual and response plots based on the OLS
estimator. The highlighted cases have Cook’s distance > min(0.5, 2p/n), and
the identity line is shown in the response plot. Since the good cases 11-14
have the largest Cook’s distances and absolute OLS residuals, swamping has
occurred. (Masking has also occurred since the outliers have small Cook’s
distances, and some of the outliers have smaller OLS residuals than clean
cases.) To determine whether both clusters are outliers or if one cluster con-
sists of good leverage points, cases in both clusters could be given weight
7.4 Outlier Detection for the MLR Model 323

Response Plot

1000
Y

0 0 500 1000 1500

FIT
Residual Plot
150
RES

0
−150

0 500 1000 1500

FIT
Fig. 7.16 Plots for Buxton Data

Response Plot

5
10

3
2
1 6 4
8
6
Y

13
2

14 11
12
0

0 2 4 6 8

FIT

Residual Plot
4

5
2
1 3
2

6 4
0
RES

-2

14
-4

13
-6

11
12
-8

0 2 4 6 8

FIT

Fig. 7.17 Plots for HBK Data


324 7 Robust Regression

zero and the resulting response plot created. (Alternatively, response plots
based on the tvreg estimator of Section 7.5 could be made where the cases
with weight one are highlighted. For high levels of trimming, the identity line
often passes through the good leverage points.)
The above example is typical of many “benchmark” outlier data sets for
MLR. In these data sets traditional OLS diagnostics such as Cook’s distance
and the residuals often fail to detect the outliers, but the combination of the
response plot and residual plot is usually able to detect the outliers. The CDi
and W CDi are the most effective when there is a single cluster about the
identity line. If there is a second cluster of outliers or good leverage points
or if there is nonconstant variance, then these numerical diagnostics tend to
fail.

7.5 Resistant Multiple Linear Regression

Consider the multiple linear regression model, written in matrix form as


Y = Xβ + e. The OLS response and residual plots are very useful for de-
tecting outliers and checking the model. Resistant estimators are useful for
detecting certain types of outliers. Some good resistant regression estimators
are rmreg2 from Section 8.6, the hbreg estimator from Section 7.7, and the
Olive (2005) MBA and trimmed views estimators described below. Also apply
a multiple linear regression method such as OLS or lasso to the cases cor-
responding to the RFCH, RMVN, or covmb2 set applied to the continuous
predictors. See Sections 7.2.6 and 7.3.1.
The L1 estimator or least absolute deviations estimator is a competitor
Pn for
OLS. The L1 estimator β̂L1 minimizes the criterion QL1 (b) = i=1 |ri (b)|
where ri (b) = Yi − xTi b is the ith residual corresponding to b. Response and
residual plots from these two estimators are useful for detecting outliers.
Resistant estimators are often created by computing several trial fits bi
that are estimators of β. Then a criterion is used to select the trial fit to be
used in the resistant estimator. Suppose c ≈ n/2. The LMS(c) criterion is
2 2 2
QLM S (b) = r(c) (b) where r(1) ≤ · · · ≤ r(n) are the ordered squared residu-
Pc 2
als, and the LTS(c) criterion is QLT S (b) = i=1 r(i) (b). The LTA(c) crite-
Pc
rion is QLT A(b) = i=1 |r(b)|(i) where |r(b)|(i) is the ith ordered absolute
residual. Three impractical high breakdown robust estimators are the Ham-
pel (1975) least median of squares (LMS) estimator, the Rousseeuw (1984)
least trimmed sum of squares (LTS) estimator, and the Hössjer (1991) least
trimmed sum of absolute deviations (LTA) estimator. Also see Hawkins and
Olive (1999ab). These estimators correspond to the β̂ L ∈ Rp that minimizes
the corresponding criterion. LMS, LTA, and LTS have O(np ) or O(np+1 )
complexity. See Bernholt (2005), Hawkins and Olive (1999b), Klouda (2015),
and Mount et al. (2014). Estimators with O(n4 ) or higher complexity take
7.5 Resistant Multiple Linear Regression 325

too long to compute. LTS and LTA are n consistent while LMS has the
lower n1/3 rate. See Kim and Pollard (1990), Čı́žek (2006, 2008), and Mašı̈ček
(2004). If c = n, the LTS and LTA criteria are the OLS and L1 criteria. See
Olive (2008, 2017b: ch. 14) for more on these estimators.
A good resistant estimator is the Olive (2005) median ball algorithm (MBA
or mbareg). The Euclidean distance of the ith vector of predictors xi from
the jth vector of predictors xj is
q
Di (xj ) = Di (xj , I p ) = (xi − xj )T (xi − xj ).

For a fixed xj consider the ordered distances D(1) (xj ), ..., D(n)(xj ). Next,
let β̂ j (α) denote the OLS fit to the min(p + 3 + bαn/100c, n) cases with
the smallest distances where the approximate percentage of cases used is
α ∈ {1, 2.5, 5, 10, 20, 33, 50}. (Here bxc is the greatest integer function so
b7.7c = 7. The extra p + 3 cases are added so that OLS can be computed for
small n and α.) This yields seven OLS fits corresponding to the cases with
predictors closest to xj . A fixed number of K cases are selected at random
without replacement to use as the xj . Hence 7K OLS fits are generated. We
use K = 7 as the default. A robust criterion Q is used to evaluate the 7K
fits and the OLS fit to all of the data. Hence 7K + 1 OLS fits are generated
and the MBA estimator is the fit that minimizes the criterion. The median
squared residual is a good choice for Q.
Three ideas motivate this estimator. First, x-outliers, which are outliers in
the predictor space, tend to be much more destructive than Y -outliers which
are outliers in the response variable. Suppose that the proportion of outliers
is γ and that γ < 0.5. We would like the algorithm to have at least one
“center” xj that is not an outlier. The probability of drawing a center that is
not an outlier is approximately 1−γ K > 0.99 for K ≥ 7 and this result is free
of p. Secondly, by using the different percentages of coverages, for many data
sets there will be a center and a coverage√that contains no outliers. Third, by
Theorem 1.21, the MBA estimator is a n consistent estimator of the same
parameter vector β estimated by OLS under mild conditions.
Ellipsoidal trimming can be used to create resistant multiple linear regres-
sion (MLR) estimators. To perform ellipsoidal trimming, an estimator (T, C)
is computed and used to create the squared Mahalanobis distances Di2 for
each vector of observed predictors xi . If the ordered distance D(j) is unique,
then j of the xi ’s are in the ellipsoid

{x : (x − T )T C −1 (x − T ) ≤ D(j)
2
}. (7.21)

The ith case (Yi , xTi )T is trimmed if Di > D(j) . Then an estimator of β is
computed from the remaining cases. For example, if j ≈ 0.9n, then about
10% of the cases are trimmed, and OLS or L1 could be used on the cases
that remain. Ellipsoidal trimming differs from using the RFCH, RMVN, or
326 7 Robust Regression

covmb2 set since these sets use a random amount of trimming. (The ellip-
soidal trimming technique can also be used for other regression models, and
the theory of the regression method tends to apply to the method applied to
the cleaned data that was not trimmed since the response variables were not
used to select the cases. See Chapter 10.)
Use ellipsoidal trimming on the RFCH, RMVN, or covmb2 set applied to
the continuous predictors to get a fit β̂C . Then make a response and residual
plot using all of the data, not just the cleaned data that was not trimmed.

The resistant trimmed views estimator combines ellipsoidal trimming and


the response plot. First compute (T, C) on the xi , perhaps using the RMVN
estimator. Trim the M % of the cases with the largest Mahalanobis distances,
and then compute the MLR estimator β̂M from the remaining cases. Use
M = 0, 10, 20, 30, 40, 50, 60, 70, 80, and 90 to generate ten response plots
T
of the fitted values β̂ M xi versus Yi using all n cases. (Fewer plots are used
for small data sets if β̂ M can not be computed for large M .) These plots are
called “trimmed views.”

Definition 7.23. The trimmed views (TV) estimator β̂T ,n corresponds


to the trimmed view where the bulk of the plotted points follow the identity
line with smallest variance function, ignoring any outliers.

Example 7.11. For the Buxton (1920) data, height was the response
variable while an intercept, head length, nasal height, bigonal breadth, and
cephalic index were used as predictors in the multiple linear regression model.
Observation 9 was deleted since it had missing values. Five individuals, cases
61–65, were reported to be about 0.75 inches tall with head lengths well
over five feet! OLS was used on the cases remaining after trimming, and
Figure 7.18 shows four trimmed views corresponding to 90%, 70%, 40%,
and 0% trimming. The OLS TV estimator used 70% trimming since this
trimmed view was best. Since the vertical distance from a plotted point to the
identity line is equal to the case’s residual, the outliers had massive residuals
for 90%, 70%, and 40% trimming. Notice that the OLS trimmed view with
0% trimming “passed through the outliers” since the cluster of outliers is
scattered about the identity line.

The TV estimator β̂T ,n has good statistical properties if an estimator with


good statistical properties is applied to the cases (X M,n , Y M,n ) that remain
after trimming. Candidates include OLS, L1 , Huber’s M–estimator, Mallows’
GM–estimator, or the Wilcoxon rank estimator. See Rousseeuw and Leroy
(1987, pp. 12-13, 150). The basic idea is that if an estimator with OP (n−1/2 )
convergence rate is applied to a set of nM ∝ n cases, then the resulting
estimator β̂M,n also has OP (n−1/2 ) rate provided that the response Y was
not used to select the nM cases in the set. If kβ̂M,n − βk = OP (n−1/2 ) for
M = 0, ..., 90 then kβ̂T ,n − βk = OP (n−1/2 ) by Theorem 1.21.
7.5 Resistant Multiple Linear Regression 327

90% 70%

1500

1500
1000

1000
y

y
500

500
0

0
2000 4000 6000 8000 10000 5000 10000 15000 20000

fit fit

40% 0%
1500

1500
1000

1000
y

y
500

500
0

2000 3000 4000 5000 6000 0 0 500 1000 1500

fit fit

Fig. 7.18 4 Trimmed Views for the Buxton Data

Let X n = X 0,n denote the full design matrix. Often when proving asymp-
totic normality of an MLR estimator β̂ 0,n , it is assumed that

X Tn X n
→ W −1 .
n

If β̂0,n has OP (n−1/2 ) rate and if for big enough n all of the diagonal elements
of !−1
X TM,n X M,n
n

are all contained in an interval [0, B) for some B > 0, then kβ̂ M,n − βk =
OP (n−1/2 ).
The distribution of the estimator β̂ M,n is especially simple when OLS is
used and the errors are iid N (0, σ 2 ). Then

β̂ M,n = (X TM,n X M,n )−1 X TM,n Y M,n ∼ Np (β, σ 2 (X TM,n X M,n )−1 )

and n(β̂ M,n −β) ∼ Np (0, σ 2 (X TM,n X M,n /n)−1 ). This result does not imply
that β̂T ,n is asymptotically normal. See the following paragraph for the large
sample theory of a modified trimmed views estimator.
328 7 Robust Regression

Warning: When Yi = xTi β + e, MLR estimators tend to estimate the


same slopes β2 , ..., βp, but the constant β1 tends to depend on the estimator
unless the errors are symmetric. The MBA and trimmed views estimators
do estimate the same β as OLS asymptotically, but samples may need to
be huge before the MBA and trimmed views estimates of the constant are
close to the OLS estimate of the constant. If the trimmed views estimator
is modified so that the LTS, LTA, or LMS criterion is used to select the
final estimator, then a conjecture is that the limiting distribution is similar
√ D Pk
to that of the variable selection estimator: n(β̂ M T V − β) → i=1 πi w i
Pk
where 0 ≤ πi ≤ 1 and i=1 πi = 1. The index i corresponds to the fits
considered by the modified trimmed views estimator with k = 10. For the
MBA estimator and the modified trimmed views estimator, the prediction
region method, described in Section 4.5, may be useful for testing hypotheses.
Large sample sizes may be needed if the error distribution is not symmetric
since the constant β̂ 1 needs large samples. See Olive (2017b, p. 444) for
an explanation for why large sample sizes may be needed to estimate the
constant.

The conditions
√ under which the rmreg2 estimator of Section 8.6 has been
shown to be √ n consistent are quite strong, but it seems likely that the es-
timator is a n consistent estimator of β under mild conditions where the
parameter vector β is not, in general, the parameter vector estimated by OLS.
For MLR, the linmodpack function rmregboot bootstraps the rmreg2 es-
timator, and the function rmregbootsim can be used to simulate rmreg2.
Both functions use the residual bootstrap where the residuals come from
OLS. See the R code below.
out<-rmregboot(belx,bely)
plot(out$betas)
ddplot4(out$betas) #right click Stop

out<-rmregboot(cbrainx,cbrainy)
ddplot4(out$betas) #right click Stop
Often practical “robust estimators” generate a sequence of K trial fits
called attractors: b1 , ..., bK . Then some criterion is evaluated and the attractor
bA that minimizes the criterion is used in the final estimator.

Definition 7.24. For MLR, an elemental set J is a set of p cases drawn


with replacement from the data set of n cases. The elemental fit is the OLS
estimator β̂ Ji = (X TJi X Ji )−1 X TJi Y Ji = X −1
Ji Y Ji applied to the cases corre-
sponding to the elemental set provided that the inverse of X Ji exists. In a
concentration algorithm, let b0,j be the jth start, not necessarily elemental,
and compute all n residuals ri (b0,j ) = Yi − xTi b0,j . At the next iteration, the
OLS estimator b1,j is computed from the cn ≈ n/2 cases corresponding to
the smallest squared residuals ri2 (b0,j ). This iteration can be continued for
7.5 Resistant Multiple Linear Regression 329

a) A Start for the Animal Data b) The Attractor for the Start

8
6

6
4

4
Y

Y
2

2
0

0
0 5 10 0 5 10

X X

Fig. 7.19 The Highlighted Points are More Concentrated about the Attractor

k steps resulting in the sequence of estimators b0,j , b1,j , ..., bk,j . Then bk,j is
the jth attractor for j = 1, ..., K. Then the attractor bA that minimizes the
LTS criterion is used in the final estimator. Using k = 10 concentration steps
often works well, and the basic resampling algorithm is a special case with
k = 0, i.e., the attractors are the starts. Such an algorithm is called a CLTS
concentration algorithm or CLTS.

A CLTA concentration algorithm would replace the OLS estimator by


the L1 estimator, and the smallest cn squared residuals by the smallest cn
absolute residuals. Many other variants are possible, but obtaining theoretical
results may be difficult.

Example 7.12. As an illustration of the CLTA concentration algorithm,


consider the animal data from Rousseeuw and Leroy (1987, p. 57). The re-
sponse Y is the log brain weight and the predictor x is the log body weight
for 25 mammals and 3 dinosaurs (outliers with the highest body weight).
Suppose that the first elemental start uses cases 20 and 14, corresponding to
mouse and man. Then the start bs,1 = b0,1 = (2.952, 1.025)T and the sum of
14
X
the c = 14 smallest absolute residuals |r|(i)(b0,1 ) = 12.101. Figure 7.19a
i=1
shows the scatterplot of x and y. The start is also shown and the 14 cases
corresponding to the smallest absolute residuals are highlighted. The L1 fit to
330 7 Robust Regression

a) 5 Elemental Starts b) The Corresponding Attractors

8
6

6
4

4
Y

Y
2

2
0

0
0 5 10 0 5 10

X X

Fig. 7.20 Starts and Attractors for the Animal Data

14
X
these c highlighted cases is b1,1 = (2.076, 0.979)T and |r|(i)(b1,1 ) = 6.990.
i=1
The iteration consists of finding the cases corresponding to the c smallest
absolute residuals, obtaining the corresponding L1 fit and repeating. The
attractor ba,1 = b7,1 = (1.741, 0.821)T and the LTA(c) criterion evaluated
14
X
at the attractor is |r|(i)(ba,1 ) = 2.172. Figure 7.19b shows the attractor
i=1
and that the c highlighted cases corresponding to the smallest absolute resid-
uals are much more concentrated than those in Figure 7.19a. Figure 7.20a
shows 5 randomly selected starts while Figure 7.20b shows the corresponding
attractors. Notice that the elemental starts have more variability than the
attractors, but if the start passes through an outlier, so does the attractor.

Remark 7.6. Consider drawing K elemental sets J1 , ..., JK with replace-


ment to use as starts. For multivariate location and dispersion, use the attrac-
tor with the smallest MCD criterion to get the final estimator. For multiple
linear regression, use the attractor with the smallest LMS, LTA, or LTS cri-
terion to get the final estimator. For 500 ≤ K ≤ 3000 and p not much larger
than 5, the elemental set algorithm is very good for detecting certain “outlier
configurations,” including i) a mixture of two regression hyperplanes that
cross in the center of the data cloud for MLR (not an outlier configuration
since outliers are far from the bulk of the data) and ii) a cluster of outliers
that can often be placed close enough to the bulk of the data so that an MB,
RFCH, or RMVN DD plot can not detect the outliers. However, the outlier
resistance of elemental algorithms decreases rapidly as p increases.
7.5 Resistant Multiple Linear Regression 331

Suppose the data set has n cases where d are outliers and n−d are “clean”
(not outliers). The the outlier proportion γ = d/n. Suppose that K elemental
sets are chosen with replacement and that it is desired to find K such that
the probability P(that at least one of the elemental sets is clean) ≡ P1 ≈ 1−α
where α = 0.05 is a common choice. Then P1 = 1− P(none of the K elemental
sets is clean) ≈ 1 −[1 −(1 −γ)p ]K by independence. Hence α ≈ [1 −(1 −γ)p ]K
or
log(α) log(α)
K≈ ≈ (7.22)
log([1 − (1 − γ)p ]) −(1 − γ)p
using the approximation log(1 − x) ≈ −x for small x. Since log(0.05) ≈ −3,
3
if α = 0.05, then K ≈ . Frequently a clean subset is wanted even if
(1 − γ)p
the contamination proportion γ ≈ 0.5. Then for a 95% chance of obtaining at
least one clean elemental set, K ≈ 3 (2p ) elemental sets need to be drawn. If
the start passes through an outlier, so does the attractor. For concentration
algorithms for multivariate location and dispersion, if the start passes through
a cluster of outliers, sometimes the attractor would be clean. See Figure 7.5–
7.11.

Table 7.5 Largest p for a 95% Chance of a Clean Subsample.


K
γ 500 3000 10000 105 106 107 108 109
0.01 509 687 807 1036 1265 1494 1723 1952
0.05 99 134 158 203 247 292 337 382
0.10 48 65 76 98 120 142 164 186
0.15 31 42 49 64 78 92 106 120
0.20 22 30 36 46 56 67 77 87
0.25 17 24 28 36 44 52 60 68
0.30 14 19 22 29 35 42 48 55
0.35 11 16 18 24 29 34 40 45
0.40 10 13 15 20 24 29 33 38
0.45 8 11 13 17 21 25 28 32
0.50 7 9 11 15 18 21 24 28

Notice that the number of subsets K needed to obtain a clean elemental set
with high probability is an exponential function of the number of predictors
p but is free of n. Hawkins and Olive (2002) showed that if K is fixed and
free of n, then the resulting elemental or concentration algorithm (that uses k
concentration steps), is inconsistent and zero breakdown. See Theorem 7.21.
Nevertheless, many practical estimators tend to use a value of K that is free
of both n and p (e.g. K = 500 or K = 3000). Such algorithms include ALMS
= FLMS = lmsreg and ALTS = FLTS = ltsreg. The “A” denotes that
an algorithm was used. The “F” means that a fixed number of trial fits (K
332 7 Robust Regression

elemental fits) was used and the criterion (LMS or LTS) was used to select
the trial fit used in the final estimator.
To examine the outlier resistance of such inconsistent zero breakdown es-
timators, fix both K and the contamination proportion γ and then find the
largest number of predictors p that can be in the model such that the proba-
bility of finding at least one clean elemental set is high. Given K and γ, P (at
least one of K subsamples is clean) = 0.95 ≈
3
1 − [1 − (1 − γ)p ]K . Thus the largest value of p satisfies ≈ K, or
(1 − γ)p
 
log(3/K)
p≈ (7.23)
log(1 − γ)

if the sample size n is very large. Again bxc is the greatest integer function:
b7.7c = 7.
Table 7.5 shows the largest value of p such that there is a 95% chance
that at least one of K subsamples is clean using the approximation given by
Equation (7.23). Hence if p = 28, even with one billion subsamples, there
is a 5% chance that none of the subsamples will be clean if the contami-
nation proportion γ = 0.5. Since clean elemental fits have great variability,
an algorithm needs to produce many clean fits in order for the best fit to
be good. When contamination is present, all K elemental sets could contain
outliers. Hence basic resampling and concentration algorithms that only use
K elemental starts are doomed to fail if γ and p are large.
The outlier resistance of elemental algorithms that use K elemental sets
decreases rapidly as p increases. However, for p < 10, such elemental algo-
rithms are often useful for outlier detection. They can perform better than
MBA, trimmed views, and rmreg2 if p is small and the outliers are close
to the bulk of the data or if p is small and there is a mixture distribution:
the bulk of the data follows one MLR model, but “outliers” and some of the
clean data are fit well by another MLR model. For example, if there is one
nontrivial predictor, suppose the plot of x versus Y looks like the letter X.
Such a mixture distribution is not really an outlier configuration since out-
liers lie far from the bulk of the data. All practical estimators have outlier
configurations where they perform poorly. If p is small, elemental algorithms
tend to have trouble when there is a weak regression relationship for the bulk
of the data and a cluster of outliers that are not good leverage points (do
not fall near the hyperplane followed by the bulk of the data). The Buxton
(1920) data set is an example.

Theorem 7.15. Let h = p be the number of randomly selected cases in


an elemental set, and let γo be the highest percentage of massive outliers that
a resampling algorithm can detect reliably. If n is large, then
 
n−c
γo ≈ min , 1 − [1 − (0.2)1/K ]1/h 100%. (7.24)
n
7.5 Resistant Multiple Linear Regression 333

Proof. As in Remark 7.1, if the contamination proportion γ is fixed, then


the probability of obtaining at least one clean subset of size h with high
probability (say 1 − α = 0.8) is given by 0.8 = 1 − [1 − (1 − γ)h ]K . Fix the
number of starts K and solve this equation for γ. 

The value of γo depends on c ≥ n/2 and h. To maximize γo , take c ≈ n/2


and h = p. For example, with K = 500 starts, n > 100, and h = p ≤ 20 the
resampling algorithm should be able to detect up to 24% outliers provided
every clean start is able to at least partially separate inliers (clean cases)
from outliers. However, if h = p = 50, this proportion drops to 11%.

Definition 7.25. Let b1 , ..., bJ be J estimators of β. Assume that J ≥ 2


and that OLS is included. A fit-fit (FF) plot is a scatterplot matrix of the
fitted values Yb (b1 ), ..., Y
b (bJ ). Often Y is also included in the top or bottom
row of the FF plot to see the response plots. A residual-residual (RR) plot is
a scatterplot matrix of the residuals r(b1 ), ..., r(bJ ). Often Ŷ is also included
in the top or bottom row of the RR plot to see the residual plots.

If the multiple linear regression model holds, if the predictors are bounded,
and if all J regression estimators are consistent estimators of β, then the
subplots in the FF and RR plots should be linear with a correlation tending
to one as the sample size n increases. To prove this claim, let the ith residual
from the jth fit bj be ri (bj ) = Yi −xTi bj where (Yi , xTi ) is the ith observation.
Similarly, let the ith fitted value from the jth fit be Ybi (bj ) = xTi bj . Then

kri (b1 ) − ri (b2 )k = kYbi (b1 ) − Ybi (b2 )k = kxTi (b1 − b2 )k

≤ kxi k (kb1 − βk + kb2 − βk). (7.25)


The FF plot is a powerful way for comparing fits. The commonly suggested
alternative is to look at a table of the estimated coefficients, but coefficients
can differ greatly while yielding similar fits if some of the predictors are highly
correlated or if several of the predictors are independent of the response. See
Olive (2017b, pp. 408-412).
Table 7.6 compares the TV, MBA (for MLR), lmsreg, ltsreg, L1 , and
OLS estimators on 7 data sets available from the text’s website. The column
headers give the file name while the remaining rows of the table give the
sample size n, the number of predictors p, the amount of trimming M used
by the TV estimator, the correlation of the residuals from the TV estimator
with the corresponding alternative estimator, and the cases that were out-
liers. If the correlation was greater than 0.9, then the method was effective
in detecting the outliers, and the method failed, otherwise. Sometimes the
trimming percentage M for the TV estimator was picked after fitting the
bulk of the data in order to find the good leverage points and outliers. Each
model included a constant.
334 7 Robust Regression

Table 7.6 Summaries for Seven Data Sets, the Correlations of the Residuals from
TV(M) and the Alternative Method are Given in the 1st 5 Rows

Method Buxton Gladstone glado hbk major nasty wood


MBA 0.997 1.0 0.455 0.960 1.0 -0.004 0.9997
LMSREG -0.114 0.671 0.938 0.977 0.981 0.9999 0.9995
LTSREG -0.048 0.973 0.468 0.272 0.941 0.028 0.214
L1 -0.016 0.983 0.459 0.316 0.979 0.007 0.178
OLS 0.011 1.0 0.459 0.780 1.0 0.009 0.227
outliers 61-65 none 115 1-10 3,44 2,6,...,30 4,6,8,19
n 87 267 267 75 112 32 20
p 5 7 7 4 6 5 6
M 70 0 30 90 0 90 20

Notice that the TV, MBA, and OLS estimators were the same for the
Gladstone (1905) data and for the Tremearne (1911) major data which had
two small Y –outliers. For the Gladstone data, there is a cluster of infants
that are good leverage points, and we attempt to predict brain weight with
the head measurements height, length, breadth, size, and cephalic index. Orig-
inally, the variable length was incorrectly entered as 109 instead of 199 for
case 115, and the glado data contains this outlier. In 1997, lmsreg was not
able to detect the outlier while ltsreg did. Due to changes in the Splus 2000
code, lmsreg detected the outlier but ltsreg did not. These two functions
change often, not always for the better.

To end this section, we describe resistant regression with the RMVN set
U or covmb2 set B in more detail. Assume that predictor transformations
have been performed to make a p × 1 vector of predictors x, and that w
consists of k ≤ p continuous predictor variables that are linearly related. Find
the RMVN set based on the w to obtain nu cases (y ci , xci ), and then run
the regression method on the cleaned data. Often the theory of the method
applies to the cleaned data set since y was not used to pick the subset of
the data. Efficiency can be much lower since nu cases are used where n/2 ≤
nu ≤ n, and the trimmed cases tend to be the “farthest” from the center of
w. The method will have the most outlier resistance if k = p − 1 if there is a
trivial predictor X1 ≡ 1.
In R, assume Y is the vector of response variables, x is the data matrix of
the predictors (often not including the trivial predictor), and w is the data
matrix of the wi . Then the following R commands can be used to get the
cleaned data set. We could use the covmb2 set B instead of the RMVN set
U computed from the w by replacing the command getu(w) by getB(w).
indx <- getu(w)$indx #often w = x
Yc <- Y[indx]
Xc <- x[indx,]
#example
7.6 Robust Regression 335

indx <- getu(buxx)$indx


Yc <- buxy[indx]
Xc <- buxx[indx,]
outr <- lsfit(Xc,Yc)
MLRplot(Xc,Yc) #right click Stop twice

7.6 Robust Regression

This section will consider the breakdown of a regression estimator and then
develop the practical high breakdown hbreg estimator.

7.6.1 MLR Breakdown and Equivariance

Breakdown and equivariance properties have received considerable attention


in the literature. Several of these properties involve transformations of the
data, and are discussed below. If X and Y are the original data, then the
vector of the coefficient estimates is
b = β(X,
β b Y ) = T (X, Y ), (7.26)

the vector of predicted values is


b
Yb = Yb (X, Y ) = X β(X, Y ), (7.27)

and the vector of residuals is


b.
r = r(X, Y ) = Y − Y (7.28)

If the design matrix X is transformed into W and the vector of dependent


variables Y is transformed into Z, then (W , Z) is the new data set.

Definition 7.26. Regression Equivariance: Let u be any p × 1 vector.


b is regression equivariant if
Then β
b
β(X, b
Y + Xu) = T (X, Y + Xu) = T (X, Y ) + u = β(X, Y ) + u. (7.29)

Hence if W = X and Z = Y + Xu, then Z b =Y b + Xu and r(W , Z) =


Z −Z b = r(X, Y ). Note that the residuals are invariant under this type of
b then regression equivariance implies
transformation, and note that if u = −β,
that we should not find any linear structure if we regress the residuals on X.
Also see Problem 7.2.
336 7 Robust Regression

b is
Definition 7.27. Scale Equivariance: Let c be any scalar. Then β
scale equivariant if
b
β(X, b
cY ) = T (X, cY ) = cT (X, Y ) = cβ(X, Y ). (7.30)

Hence if W = X and Z = cY , then Z b = cYb and r(X, cY ) = c r(X, Y ).


Scale equivariance implies that if the Yi ’s are stretched, then the fits and the
residuals should be stretched by the same factor.

Definition 7.28. Affine Equivariance: Let A be any p × p nonsingular


b is affine equivariant if
matrix. Then β
b
β(XA, b
Y ) = T (XA, Y ) = A−1 T (X, Y ) = A−1 β(X, Y ). (7.31)

Hence if W = XA and Z = Y , then Z b = W β(XA,


b Y)=
−1 b
XAA β(X, Y ) = Y , and r(XA, Y ) = Z − Z = Y − Yb = r(X, Y ). Note
b b
that both the predicted values and the residuals are invariant under an affine
transformation of the predictor variables.

Definition 7.29. Permutation Invariance: Let P be an n × n per-


mutation matrix. Then P T P = P P T = I n where I n is an n × n identity
b is
matrix and the superscript T denotes the transpose of a matrix. Then β
permutation invariant if
b X, P Y ) = T (P X, P Y ) = T (X, Y ) = β(X,
β(P b Y ). (7.32)

Hence if W = P X and Z = P Y , then Z b = PY b and r(P X, P Y ) =


P r(X, Y ). If an estimator is not permutation invariant, then swapping
rows of the n × (p + 1) augmented matrix (X, Y ) will change the estimator.
Hence the case number is important. If the estimator is permutation invariant,
then the position of the case in the data cloud is of primary importance.
Resampling algorithms are not permutation invariant because permuting the
data causes different subsamples to be drawn.

Remark 7.7. OLS has the above invariance properties, but most Statis-
tical Learning alternatives such as lasso and ridge regression do not have all
four properties. Hence Remark 5.1 is used to fit the data with Z = W η + e.
Then obtain β̂ from η̂.

The remainder of this subsection gives a standard definition of breakdown


and then shows that if the median absolute residual is bounded in the presence
of high contamination, then the regression estimator has a high breakdown
value. The following notation will be useful. Let W denote the data matrix
where the ith row corresponds to the ith case. For regression, W is the
n × (p + 1) matrix with ith row (xTi , Yi ). Let W nd denote the data matrix
where any dn of the cases have been replaced by arbitrarily bad contaminated
7.6 Robust Regression 337

cases. Then the contamination fraction is γ ≡ γn = dn /n, and the breakdown


value of β̂ is the smallest value of γn needed to make kβ̂k arbitrarily large.

Definition 7.30. Let 1 ≤ dn ≤ n. If T (W ) is a p × 1 vector of regression


coefficients, then the breakdown value of T is
( )
dn n
B(T, W ) = min : sup kT (W d )k = ∞
n Wn
d

where the supremum is over all possible corrupted samples W nd .

Definition 7.31. High breakdown regression estimators have γn → 0.5


as n → ∞ if the clean (uncontaminated) data are in general position: any
p clean cases give a unique estimate of β. Estimators are zero breakdown if
γn → 0 and positive breakdown if γn → γ > 0 as n → ∞.

The following result greatly simplifies some breakdown proofs and shows
that a regression estimator basically breaks down if the median absolute
residual MED(|ri |) can be made arbitrarily large. The result implies that if
the breakdown value ≤ 0.5, breakdown can be computed using the median
absolute residual MED(|ri |(W nd )) instead of kT (W nd )k. Similarly β̂ is high
breakdown if the median squared residual or the cn th largest absolute resid-
2
ual |ri |(cn) or squared residual r(c n)
stay bounded under high contamination
where cn ≈ n/2. Note that kβ̂k ≡ kβ̂(W nd )k ≤ M for some constant M that
depends on T and W but not on the outliers if the number of outliers dn is
less than the smallest number of outliers needed to cause breakdown.

Theorem 7.16. If the breakdown value ≤ 0.5, computing the break-


down value using the median absolute residual MED(|ri |(W nd )) instead of
kT (W nd )k is asymptotically equivalent to using Definition 7.30.
Proof. Consider any contaminated data set W nd with ith row (w Ti , Zi )T .
If the regression estimator T (W nd ) = β̂ satisfies kβ̂k ≤ M for some constant
T
M if d < dn , then the median absolute residual MED(|Zi − β̂ wi |) is bounded
T Pp
by maxi=1,...,n |Yi − β̂ xi | ≤ maxi=1,...,n [|Yi | + j=1 M |xi,j |] if dn < n/2.
If the median absolute residual is bounded by M when d < dn , then kβ̂k
is bounded provided fewer than half of the cases line on the hyperplane (and
so have absolute residual of 0), as shown next. Now suppose that kβ̂k = ∞.
Since the absolute residual is the vertical distance of the observation from the
hyperplane, the absolute residual |ri | = 0 if the ith case lies on the regression
hyperplane, but |ri | = ∞ otherwise. Hence MED(|ri |) = ∞ if fewer than
half of the cases lie on the regression hyperplane. This will occur unless the
proportion of outliers dn /n > (n/2 − q)/n → 0.5 as n → ∞ where q is the
number of “good” cases that lie on a hyperplane of lower dimension than p.
338 7 Robust Regression

In the literature it is usually assumed that the original data are in general
position: q = p − 1. 

Suppose that the clean data are in general position and that the number of
outliers is less than the number needed to make the median absolute residual
and kβ̂k arbitrarily large. If the xi are fixed, and the outliers are moved up
and down by adding a large positive or negative constant to the Y values
of the outliers, then for high breakdown (HB) estimators, β̂ and MED(|ri |)
stay bounded where the bounds depend on the clean data W but not on the
outliers even if the number of outliers is nearly as large as n/2. Thus if the
|Yi | values of the outliers are large enough, the |ri | values of the outliers will
be large.
If the Yi ’s are fixed, arbitrarily large x-outliers tend to drive the slope
estimates to 0, not ∞. If both x and Y can be varied, then a cluster of
outliers can be moved arbitrarily far from the bulk of the data but may still
have small residuals. For example, move the outliers along the regression
hyperplane formed by the clean cases.

If the (xTi , Yi ) are in general position, then the contamination could be


such that β̂ passes exactly through p − 1 “clean” cases and dn “contam-
inated” cases. Hence dn + p − 1 cases could have absolute residuals equal
to zero with kβ̂k arbitrarily large (but finite). Nevertheless, if T possesses
reasonable equivariant properties and kT (W nd )k is replaced by the median
absolute residual in the definition of breakdown, then the two breakdown val-
ues are asymptotically equivalent. (If T (W ) ≡ 0, then T is neither regression
nor affine equivariant. The breakdown value of T is one, but the median ab-
solute residual can be made arbitrarily large if the contamination proportion
is greater than n/2.)

If the Yi ’s are fixed, arbitrarily large x-outliers will rarely drive kβ̂k to
∞. The x-outliers can drive kβ̂k to ∞ if they can be constructed so that
the estimator is no longer defined, e.g. so that X T X is nearly singular. The
examples following some results on norms may help illustrate these points.

Definition 7.32. Let y be an n × 1 vector. Then kyk is a vector norm if


vn1) kyk ≥ 0 for every y ∈ Rn with equality iff y is the zero vector,
vn2) kayk = |a| kyk for all y ∈ Rn and for all scalars a, and
vn3) kx + yk ≤ kxk + kyk for all x and y in Rn .
Definition 7.33. Let G be an n × p matrix. Then kGk is a matrix norm if
mn1) kGk ≥ 0 for every n×p matrix G with equality iff G is the zero matrix,
mn2) kaGk = |a| kGk for all scalars a, and
mn3) kG + Hk ≤ kGk + kHk for all n × p matrices G and H .

Example 7.13. The q-norm of a vector y is kykq = (|y1 |q +· · ·+|yn |q )1/q .


In particular,
p kyk1 = |y1 | + · · · + |yn |, the Euclidean norm
kyk2 = y12 + · · · + yn2 , and kyk∞ = maxi |yi |. Given a matrix G and
7.6 Robust Regression 339

a vector norm kykq the q-norm or subordinate matrix norm of matrix G is


kGykq
kGkq = max . It can be shown that the maximum column sum norm
y 6=0 kykq
n p
X X
kGk1 = max |gij |, the maximum row sum norm kGk∞ = max |gij |,
1≤j≤p 1≤i≤n
i=1 j=1
q
T
and the spectral norm kGk2 = maximum eigenvalue of G G. The
Frobenius norm
v
uX q
u p Xn
kGkF = t |gij |2 = trace(GT G).
j=1 i=1

Several useful results involving matrix norms will be used. First, for any
subordinate matrix norm, kGykq ≤ kGkq kykq . Let J = Jm = {m1 , ..., mp}
denote the p cases in the mth elemental fit bJ = X −1 J Y J . Then for any
elemental fit bJ (suppressing q = 2),

kbJ − βk = kX −1 −1 −1
J (X J β + eJ ) − βk = kX J eJ k ≤ kX J k keJ k. (7.33)

The following results (Golub and Van Loan 1989, pp. 57, 80) on the Euclidean
norm are useful. Let 0 ≤ σp ≤ σp−1 ≤ · · · ≤ σ1 denote the singular values of
X J = (xmi,j ). Then
σ1
kX −1
J k= , (7.34)
σp kX J k
max |xmi,j | ≤ kX J k ≤ p max |xmi,j |, and (7.35)
i,j i,j

1 1
≤ ≤ kX −1
J k. (7.36)
p maxi,j |xmi,j | kX J k

From now on, unless otherwise stated, we will use the spectral norm as the
matrix norm and the Euclidean norm as the vector norm.

Example 7.14. Suppose the response values Y are near 0. Consider the fit
from an elemental set: bJ = X −1 J Y J and examine Equations (7.34), (7.35),
and (7.36). Now kbJ k ≤ kX −1 J k kY J k, and since x-outliers make kX J k
large, x-outliers tend to drive kX −1
J k and kbJ k towards zero not towards ∞.
The x-outliers may make kbJ k large if they can make the trial design kX J k
nearly singular. Notice that Euclidean norm kbJ k can easily be made large if
one or more of the elemental response variables is driven far away from zero.

Example 7.15. Without loss of generality, assume that the clean Y ’s are
contained in an interval [a, f] for some a and f. Assume that the regression
340 7 Robust Regression

model contains an intercept β1 . Then there exists an estimator β̂ M of β such


that kβ̂ M k ≤ max(|a|, |f|) if dn < n/2.

Proof. Let MED(n) = MED(Y1 , ..., Yn) and MAD(n) = MAD(Y1 , ..., Yn).
Take β̂ M = (MED(n), 0, ..., 0)T . Then kβ̂M k = |MED(n)| ≤ max(|a|, |f|).
Note that the median absolute residual for the fit β̂ M is equal to the median
absolute deviation MAD(n) = MED(|Yi − MED(n)|, i = 1, ..., n) ≤ f − a if
dn < b(n + 1)/2c. 

Note that β̂ M is a poor high breakdown estimator of β and Ŷi (β̂ M ) tracks
the Yi very poorly. If the data are in general position, a high breakdown
regression estimator is an estimator which has a bounded median absolute
residual even when close to half of the observations are arbitrary. Rousseeuw
and Leroy (1987, pp. 29, 206) conjectured that high breakdown regression
estimators can not be computed cheaply, and that if the algorithm is also
affine equivariant, then the complexity of the algorithm must be at least
O(np ). The following theorem shows that these two conjectures are false.

Theorem 7.17. If the clean data are in general position and the model has
an intercept, then a scale and affine equivariant high breakdown estimator
β̂w can be found by computing OLS on the set of cases that have Yi ∈
[MED(Y1 , ..., Yn) ± w MAD(Y1 , ..., Yn)] where w ≥ 1 (so at least half of the
cases are used).

Proof. Note that β̂ w is obtained by computing OLS on the set J of the


nj cases which have

Yi ∈ [MED(Y1 , ..., Yn) ± wMAD(Y1 , ..., Yn)] ≡ [MED(n) ± wMAD(n)]

where w ≥ 1 (to guarantee that nj ≥ n/2). Consider the estimator β̂ M =


(MED(n), 0, ..., 0)T which yields the predicted values Ŷi ≡ MED(n). The
squared residual ri2 (β̂ M ) ≤ (w MAD(n))2 if the ith case is in J. Hence the
weighted LS fit β̂ w is the OLS fit to the cases in J and has
X
ri2 (β̂ w ) ≤ nj (w MAD(n))2 .
i∈J

Thus
√ √
MED(|r1 (β̂ w )|, ..., |rn(β̂ w )|) ≤ nj w MAD(n) < n w MAD(n) < ∞.

Thus the estimator β̂ w has a median absolute residual bounded by



n w MAD(Y1 , ..., Yn). Hence β̂ w is high breakdown, and it is affine equiv-
ariant since the design is not used to choose the observations. It is scale
equivariant since for constant c = 0, β̂ w = 0, and for c 6= 0 the set of
7.6 Robust Regression 341

cases used remains the same under scale transformations and OLS is scale
equivariant. 

Note that if w is huge and MAD(n) 6= 0, then the high breakdown estima-
tor β̂ w and β̂ OLS will be the same for most data sets. Thus high breakdown
estimators can be very nonrobust. Even if w = 1, the HB estimator β̂ w only
resists large Y outliers.

An ALTA concentration algorithm uses the L1 estimator instead of OLS


in the concentration step and uses the LTA criterion. Similarly an ALMS
concentration algorithm uses the L∞ estimator and the LMS criterion.
Theorem 7.18. If the clean data are in general position and if a high
breakdown start is added to an ALTA, ALTS, or ALMS concentration algo-
rithm, then the resulting estimator is HB.
Proof. Concentration reduces (or does not increase) the corresponding HB
criterion that is based on cn ≥ n/2 absolute residuals, so the median absolute
residual of the resulting estimator is bounded as long as the criterion applied
to the HB estimator is bounded. 

For example, consider the LTS(cn ) criterion. Suppose the ordered squared
residuals from the high breakdown mth start b0m are obtained. If the data
are in general position, then QLT S (b0m ) is bounded even if the number of
outliers dn is nearly as large as n/2. Then b1m is simply the OLS fit to
2
the cases corresponding to the cn smallest squared residuals r(i) (b0m ) for
i = 1, ..., cn. Denote these cases by i1 , ..., icn . Then QLT S (b1m ) =
cn
X cn
X cn
X cn
X
2
r(i) (b1m ) ≤ ri2j (b1m ) ≤ ri2j (b0m ) = 2
r(i) (b0m ) = QLT S (b0m )
i=1 j=1 j=1 j=1

where the second inequality follows from the definition of the OLS estimator.
Hence concentration steps reduce or at least do not increase the LTS criterion.
If cn = (n+1)/2 for n odd and cn = 1+n/2 for n even, then the LTS criterion
is bounded iff the median squared residual is bounded.

Theorem 7.18 can be used to show that the following two estimators are
high√breakdown. The estimator β̂ B is the high breakdown attractor used by
the n consistent high breakdown hbreg estimator of Definition 7.35.

Definition 7.34. Make an OLS fit to the cn ≈ n/2 cases whose Y values
are closest to the MED(Y1 , ..., Yn) ≡ MED(n) and use this fit as the start
for concentration. Define β̂ B to be the attractor after k concentration steps.
Define bk,B = 0.9999β̂B .

Theorem 7.19. If the clean data are in general position, then β̂ B and
bk,B are high breakdown regression estimators.
342 7 Robust Regression

Proof. The start can be taken to be β̂ w with w = 1 from Theorem 7.17.


Since the start is high breakdown, so is the attractor β̂ B by Theorem 7.18.
Multiplying a HB estimator by a positive constant does not change the break-
down value, so bk,B is HB. 

The following result shows that it is easy to make a HB estimator that is


asymptotically equivalent to a consistent estimator on a large class of iid zero
mean symmetric error distributions, although the outlier resistance of the HB
estimator is poor. The following result may not hold if β̂ C estimates β C and
β̂LM S estimates β LM S where β C 6= β LM S . Then bk,B could have a smaller
median squared residual than β̂C even if there are no outliers. The two param-
eter vectors could differ because the constant term is different if the error dis-
tribution is not symmetric. For a large class of symmetric error distributions,
βLM S = β OLS = β C ≡ β, then the ratio MED(ri2 (β̂))/MED(ri2 (β)) → 1 as
n → ∞ for any consistent estimator of β. The estimator below has two attrac-
tors, β̂ C and bk,B , and the probability that the final estimator β̂ D is equal
to β̂ C goes to one under the strong assumption that the error distribution is
such that both β̂ C and β̂ LM S are consistent estimators of β.

Theorem 7.20. Assume the clean data are in general position, and that
the LMS estimator is a consistent estimator of β. Let β̂C be any practical con-
sistent estimator of β, and let β̂ D = β̂ C if MED(ri2 (β̂ C )) ≤ MED(ri2 (bk,B )).
Let β̂ D = bk,B , otherwise. Then β̂ D is a HB estimator that is asymptotically
equivalent to β̂ C .
Proof. The estimator is HB since the median squared residual of β̂ D
is no larger than that of the HB estimator bk,B . Since β̂ C is consistent,
MED(ri2 (β̂ C )) → MED(e2 ) in probability where MED(e2 ) is the population
median of the squared error e2 . Since the LMS estimator is consistent, the
probability that β̂C has a smaller median squared residual than the biased
estimator β̂ k,B goes to 1 as n → ∞. Hence β̂D is asymptotically equivalent
to β̂ C . 

The elemental concentration and elemental resampling algorithms use K


elemental fits where K is a fixed number that does not depend on the sample
size n, e.g. K = 500. See Definitions 7.12 and 7.24. Note that an estimator can
not be consistent for θ unless the number of randomly selected cases goes to
∞, except in degenerate situations. The following theorem shows the widely
used elemental estimators are zero breakdown estimators. (If K = Kn → ∞,
then the elemental estimator is zero breakdown if Kn = o(n). A necessary
condition for the elemental basic resampling estimator to be consistent is
Kn → ∞.)
7.6 Robust Regression 343

Theorem 7.21: a) The elemental basic resampling algorithm estimators


are inconsistent. b) The elemental concentration and elemental basic resam-
pling algorithm estimators are zero breakdown.
Proof: a) Note that you can not get a consistent estimator by using Kh
randomly selected cases since the number of cases Kh needs to go to ∞ for
consistency except in degenerate situations.
b) Contaminating all Kh cases in the K elemental sets shows that the
breakdown value is bounded by Kh/n → 0, so the estimator is zero break-
down. 

7.6.2 A Practical High Breakdown Consistent Estimator

Olive and Hawkins √ (2011) showed that the practical hbreg estimator is a
high breakdown n consistent robust estimator that is asymptotically equiv-
alent to the least squares estimator for many error distributions. This sub-
section follows Olive (2017b, pp. 420-423).
The outlier resistance of the hbreg estimator is not very good, but roughly
comparable to the best of the practical “robust regression” estimators avail-
able in R packages as of 2019. The estimator is of some interest since it proved
that practical high breakdown consistent estimators are possible. Other prac-
tical regression estimators that claim to be high breakdown and consistent
appear to be zero breakdown because they use the zero breakdown elemental
concentration algorithm. See Theorem 7.21.
The following theorem is powerful because it does not depend on the crite-
rion used to choose the attractor. Suppose there are K consistent estimators
β̂j of β, each with the same rate nδ . If β̂ A is an estimator obtained by choos-
ing one of the K estimators, then β̂ A is a consistent estimator of β with rate
nδ by Pratt (1959). See Theorem 1.21.

Theorem 7.22. Suppose the algorithm estimator chooses an attractor as


the final estimator where there are K attractors and K is fixed.
i) If all of the attractors are consistent, then the algorithm estimator is
consistent.
ii) If all of the attractors are consistent with the same rate, e.g., nδ where
0 < δ ≤ 0.5, then the algorithm estimator is consistent with the same rate as
the attractors.
iii) If all of the attractors are high breakdown, then the algorithm estimator
is high breakdown.
Proof. i) Choosing from K consistent estimators results in a consistent
estimator, and ii) follows from Pratt (1959). iii) Let γn,i be the breakdown
value of the ith attractor if the clean data are in general position. The break-
down value γn of the algorithm estimator can be no lower than that of the
worst attractor: γn ≥ min(γn,1 , ..., γn,K) → 0.5 as n → ∞. 
344 7 Robust Regression

The consistency of the algorithm estimator changes dramatically if K is


fixed but the start size h = hn = g(n) where g(n) → ∞. In particular, if
K starts with rate n1/2 are used, the final estimator also has rate n1/2 . The
drawback to these algorithms is that they may not have enough outlier resis-
tance. Notice that the basic resampling result below is free of the criterion.

Theorem 7.23. Suppose Kn ≡ K starts are used and that all starts have
subset size hn = g(n) ↑ ∞ as n → ∞. Assume that the estimator applied to
the subset has rate nδ .
i) For the hn -set basic resampling algorithm, the algorithm estimator has
rate [g(n)]δ .
ii) Under regularity conditions (e.g. given by He and Portnoy 1992), the k–
step CLTS estimator has rate [g(n)]δ .

Proof. i) The hn = g(n) cases are randomly sampled without replacement.


Hence the classical estimator applied to these g(n) cases has rate [g(n)]δ . Thus
all K starts have rate [g(n)]δ , and the result follows by Pratt (1959). ii) By
He and Portnoy (1992), all K attractors have [g(n)]δ rate, and the result
follows by Pratt (1959). 

Remark 7.8. Theorem 7.16 shows that β̂ is HB if the median absolute or


2
squared residual (or |r(β̂)|(cn ) or r(c n)
where cn ≈ n/2) stays bounded under
high contamination. Let QL (β̂ H ) denote the LMS, LTS, or LTA criterion for
an estimator β̂H ; therefore, the estimator β̂H is high breakdown if and only
if QL(β̂ H ) is bounded for dn near n/2 where dn < n/2 is the number of out-
liers. The concentration operator refines an initial estimator by successively
reducing the LTS criterion. If β̂ F refers to the final estimator (attractor) ob-
tained by applying concentration to some starting estimator β̂ H that is high
breakdown, then since QLT S (β̂ F ) ≤ QLT S (β̂ H ), applying concentration to
a high breakdown start results in a high breakdown attractor. See Theorem
7.18.

High breakdown estimators are, however, not necessarily useful for detect-
ing outliers. Suppose γn < 0.5. On the one hand, if the xi are fixed, and the
outliers are moved up and down parallel to the Y axis, then for high break-
down estimators, β̂ and MED(|ri |) will be bounded. Thus if the |Yi | values of
the outliers are large enough, the |ri | values of the outliers will be large, sug-
gesting that the high breakdown estimator is useful for outlier detection. On
the other hand, if the Yi ’s are fixed at any values and the x values perturbed,
sufficiently large x-outliers tend to drive the slope estimates to 0, not ∞. For
many estimators, including LTS, LMS, and LTA, a cluster of Y outliers can
be moved arbitrarily far from the bulk of the data but still, by perturbing
their x values, have arbitrarily small residuals. See Example 7.16.
7.6 Robust Regression 345

Our practical high breakdown procedure is made up of three components.


1) A practical estimator β̂C that is consistent for clean data. Suitable choices
would include the full-sample OLS and L1 estimators.
2) A practical estimator β̂ A that is effective for outlier identification. Suitable
choices include the mbareg, rmreg2, lmsreg, or FLTS estimators.
3) A practical high-breakdown estimator such as β̂ B from Definition 7.34
with k = 10.

By selecting one of these three estimators according to the features each


of them uncovers in the data, we may inherit some of the good properties of
each of them.

Definition 7.35. The hbreg estimator β̂ H is defined as follows. Pick a


constant a > 1 and set β̂ H = β̂ C . If aQL(β̂ A ) < QL(β̂ C ), set β̂ H = β̂ A . If
aQL(β̂ B ) < min[QL (β̂C ), aQL (β̂ A )], set β̂ H = β̂ B .

That is, find the smallest of the three scaled criterion values QL (β̂ C ),
aQL(β̂ A ), aQL (β̂ B ). According to which of the three estimators attains this
minimum, set β̂H to β̂ C , β̂A , or β̂B respectively.
Large sample theory for hbreg is simple and given in the following theo-
rem. Let β̂ L be the LMS, LTS, or LTA estimator that minimizes the criterion
QL. Note that the impractical estimator β̂ L is never computed. The following
theorem shows that β̂ H is asymptotically equivalent to β̂ C on a large class

of zero mean finite variance symmetric error distributions. Thus if β̂C is n
consistent or asymptotically efficient, so is β̂ H . Notice that β̂A does not need
to be consistent. This point is crucial since lmsreg is not consistent and it is
not known whether FLTS is consistent. The clean data are in general position
if any p clean cases give a unique estimate of β̂.

Theorem 7.24. Assume the clean data are in general position, and sup-
pose that both β̂ L and β̂ C are consistent estimators of β where the regression
model contains a constant. Then the hbreg estimator β̂ H is high breakdown
and asymptotically equivalent to β̂ C .
Proof. Since the clean data are in general position and QL (β̂ H ) ≤
aQL(β̂ B ) is bounded for γn near 0.5, the hbreg estimator is high break-
down. Let Q∗L = QL for LMS and Q∗L = QL /n for LTS and LTA. As n → ∞,
consistent estimators β̂ satisfy Q∗L (β̂) − Q∗L (β) → 0 in probability. Since
LMS, LTS, and LTA are consistent and the minimum value is Q∗L(β̂ L ), it
follows that Q∗L (β̂ C ) − Q∗L (β̂ L ) → 0 in probability, while Q∗L(β̂ L ) < aQ∗L(β̂)
for any estimator β̂. Thus with probability tending to one as n → ∞,
QL(β̂ C ) < a min(QL (β̂ A ), QL(β̂ B )). Hence β̂ H is asymptotically equivalent
to β̂ C . 
346 7 Robust Regression

Remark 7.9. i) Let β̂ C = β̂ OLS . Then hbreg is asymptotically equiva-


lent to OLS when the errors ei are iid from a large class of zero mean finite
variance symmetric distributions, including the N (0, σ 2 ) distribution, since
the probability that hbreg uses OLS instead of β̂ A or β̂ B goes to one as
n → ∞.
ii) The above theorem proves that practical high breakdown estimators
with 100% asymptotic Gaussian efficiency exist; however, such estimators
are not necessarily good.
iii) The theorem holds when both β̂ L and β̂ C are consistent estimators of
β, for example, when the iid errors come from a large class or zero mean finite
variance symmetric distributions. For asymmetric distributions, β̂ C estimates
βC and β̂ L estimates β L where the constants usually differ. The theorem
holds for some distributions that are not symmetric because of the penalty
a. As a → ∞, the class of asymmetric distributions where the theorem holds
greatly increases, but the outlier resistance decreases rapidly as a increases
for a > 1.4.
iv) The default hbreg estimator used OLS, mbareg, and β̂ B with a = 1.4
and the LTA criterion. For the simulated data with symmetric error distri-
butions, β̂ B appeared to give biased estimates of the slopes. However, for the
simulated data with right skewed error distributions, β̂ B appeared to give
good estimates of the slopes but not the constant estimated by OLS, and the
probability that the hbreg estimator
√ selected β̂ B appeared to go to one.
v) Both MBA and OLS are n consistent estimators of β, even for a large
class of skewed distributions. Using β̂ A = β̂ M BA and removing β̂ B from the

hbreg
√ estimator results in a n consistent estimator of β when β̂C = OLS is
a n consistent estimator of β, but massive sample sizes were still needed to
get good estimates of the constant for skewed error distributions. For skewed
distributions, if OLS needed n = 1000 to estimate the constant well, mbareg
might need n > one million to estimate the constant well.
The situation is worse for multivariate linear regression when hbreg is
used instead of OLS, since there are m constants to be estimated. If the
distribution of the iid error vectors ei is not elliptically contoured, getting
all m mbareg estimators to estimate all m constants well needs even larger
sample sizes.
vi) The outlier resistance of hbreg is not especially good.
The family of hbreg estimators is enormous and depends on i) the prac-
tical high breakdown estimator β̂ B , ii) β̂ C , iii) β̂ A , iv) a, and v) the criterion
QL. Note that the theory needs the error distribution to be such that both
β̂C and β̂ L are consistent. Sufficient conditions for LMS, LTS, and LTA to be
consistent are rather strong. To have reasonable sufficient conditions for the
hbreg estimator to be consistent, β̂C should be consistent under weak condi-
tions. Hence OLS is a good choice that results in 100% asymptotic Gaussian
efficiency.
7.6 Robust Regression 347

We suggest using the LTA criterion since in simulations, hbreg behaved


like β̂C for smaller sample sizes than those needed by the LTS and LMS
criteria. We want a near 1 so that hbreg has outlier resistance similar to β̂ A ,
but we want a large enough so that hbreg performs like β̂ C for moderate
n on clean data. Simulations suggest that a = 1.4 is √a reasonable choice.
The default hbreg program from linmodpack uses the n consistent outlier
resistant estimator mbareg as β̂ A .
There are at least three reasons for using β̂ B as the high breakdown es-
timator. First, β̂ B is high breakdown and simple to compute. Second, the
fitted values roughly track the bulk of the data. Lastly, although β̂ B has
rather poor outlier resistance, β̂ B does perform well on several outlier con-
figurations where some common alternatives fail.
Next we will show that the hbreg estimator implemented with a = 1.4
using QLT A , β̂ C = OLS, and β̂B can greatly improve the estimator β̂A . We
will use β̂ A = ltsreg in R and Splus 2000. Depending on the implemen-
tation, the ltsreg estimators use the elemental resampling algorithm, the
elemental concentration algorithm, or a genetic algorithm. Coverage is 50%,
75%, or 90%. The Splus 2000 implementation is an unusually poor genetic
algorithm with 90% coverage. The R implementation appears to be the zero
breakdown inconsistent elemental basic resampling algorithm that uses 50%
coverage. The ltsreg function changes often.
Simulations were run in R with the xij (for j > 1) and ei iid N (0, σ 2 )
and β = 1, the p × 1 vector of ones. Then β̂ was recorded for 100 runs. The
mean and standard deviation of the β̂j were recorded for j = 1, ..., p. For
n ≥ 10p and OLS, the vector of means should√be close to √ 1 and the vector
of standard deviations should be close to 1/ n. The n consistent high
breakdown hbreg estimator performed like OLS if n ≈ 35p and 2 ≤ p ≤ 6,
if n ≈ 20p and 7 ≤ p ≤ 14, or if n ≈ 15p and 15 ≤ p ≤ 40. See Table 7.7
for p = 5 and 100 runs. ALTS denotes ltsreg, HB denotes hbreg, and
BB denotes β̂ B . In the simulations, hbreg estimated the slopes well for the
highly skewed lognormal data, but not the OLS constant. Use the linmodpack
function hbregsim. √
As implemented in linmodpack, the hbreg estimator is a practical n
consistent high breakdown estimator that appears to perform like OLS for
moderate n if the errors are unimodal and symmetric, and to have outlier
resistance comparable to competing practical “outlier resistant” estimators.
The hbreg, lmsreg, ltsreg, OLS, and β̂ B estimators were compared
on the same 25 benchmark data sets. Also see Park et al. (2012). The HB
estimator β̂B was surprisingly good in that the response plots showed that it
was the best estimator for 2 data sets and that it usually tracked the data, but
it performed poorly in 7 of the 25 data sets. The hbreg estimator performed
well, but for a few data sets hbreg did not pick the attractor with the best
response plot, as illustrated in the following example.
348 7 Robust Regression

Table 7.7 MEAN β̂i and SD(β̂i )


n method mn or sd β̂1 β̂2 β̂3 β̂4 β̂5
25 HB mn 0.9921 0.9825 0.9989 0.9680 1.0231
sd 0.4821 0.5142 0.5590 0.4537 0.5461
OLS mn 1.0113 1.0116 0.9564 0.9867 1.0019
sd 0.2308 0.2378 0.2126 0.2071 0.2441
ALTS mn 1.0028 1.0065 1.0198 1.0092 1.0374
sd 0.5028 0.5319 0.5467 0.4828 0.5614
BB mn 1.0278 0.5314 0.5182 0.5134 0.5752
sd 0.4960 0.3960 0.3612 0.4250 0.3940
400 HB mn 1.0023 0.9943 1.0028 1.0103 1.0076
sd 0.0529 0.0496 0.0514 0.0459 0.0527
OLS mn 1.0023 0.9943 1.0028 1.0103 1.0076
sd 0.0529 0.0496 0.0514 0.0459 0.0527
ALTS mn 1.0077 0.9823 1.0068 1.0069 1.0214
sd 0.1655 0.1542 0.1609 0.1629 0.1679
BB mn 1.0184 0.8744 0.8764 0.8679 0.8794
sd 0.1273 0.1084 0.1215 0.1206 0.1269
1000

1000
Y

Y
0

0 500 1500 0 500 1500

HBFIT OLSFIT
1000

1000
Y

Y
0

0 500 1500 2000 3000

ALTSFIT BBFIT

Fig. 7.21 Response Plots Comparing Robust Regression Estimators

Example 7.16. The LMS, LTA, and LTS estimators are determined by a
“narrowest band” covering half of the cases. Hawkins and Olive (2002) sug-
gested that the fit will pass through outliers if the band through the outliers
is narrower than the band through the clean cases. This behavior tends to
occur if the regression relationship is weak, and if there is a tight cluster
7.7 Summary 349

of outliers where |Y | is not too large. As an illustration, Buxton (1920, pp.


232-5) gave 20 measurements of 88 men. Consider predicting stature using an
intercept, head length, nasal height, bigonal breadth, and cephalic index. One
case was deleted since it had missing values. Five individuals, numbers 61-65,
were reported to be about 0.75 inches tall with head lengths well over five
feet! Figure 7.21 shows the response plots for hbreg, OLS, ltsreg, and β̂B .
Notice that only the fit from β̂ B (BBFIT) did not pass through the outliers,
but hbreg selected the OLS attractor. There are always outlier configura-
tions where an estimator will fail, and hbreg should fail on configurations
where LTA, LTS, and LMS would fail.

7.7 Summary
Pn
Yi
1) For the location model, the sample mean Y = i=1 , the sample vari-
Pn n
2 p
2 i=1 (Yi − Y )
ance Sn = , and the sample standard deviation Sn = Sn2 .
n−1
If the data Y1 , ..., Yn is arranged in ascending order from smallest to largest
and written as Y(1) ≤ · · · ≤ Y(n), then Y(i) is the ith order statistic and the
Y(i)’s are called the order statistics. The sample median

MED(n) = Y((n+1)/2) if n is odd,

Y(n/2) + Y((n/2)+1)
MED(n) = if n is even.
2
The notation MED(n) = MED(Y1 , ..., Yn) will also be used. The sample me-
dian absolute deviation is MAD(n) = MED(|Yi − MED(n)|, i = 1, . . . , n).
2) Suppose the multivariate data has been collected into an n × p matrix
 T
x1
 .. 
W = X =  . .
xTn

The coordinatewise median MED(W ) = (MED(X1 ), ..., MED(Xp ))T where


MED(Xi ) is the sample median of the data in column i corresponding to
n
1X
variable Xi . The sample mean x = xi = (X 1 , ..., Xp )T where X i is
n i=1
the sample mean of the data in column i corresponding to variable Xi . The
sample covariance matrix
n
1 X
S= (xi − x)(xi − x)T = (Sij ).
n − 1 i=1
350 7 Robust Regression

That is, the ij entry of S is the sample covariance Sij . The classical estimator
of multivariate location and dispersion is (T, C) = (x, S).
3) Let (T, C) = (T (W ), C(W )) be an estimator of multivariate
p location
and dispersion. The ith Mahalanobis distance Di = Di2 where the ith
squared Mahalanobis distance is Di2 = Di2 (T (W ), C(W )) =
(xi − T (W ))T C −1 (W )(xi − T (W )).
4) The squared Euclidean distances of the xi from the coordinatewise
median is Di2 = Di2 (MED(W ), I p ). Concentration type steps compute the
weighted median MEDj : the coordinatewise median computed from the cases
xi with Di2 ≤ MED(Di2 (MEDj−1 , I p )) where MED0 = MED(W ). Often used
j = 0 (no concentration type steps) or j = 9. Let Di = Di (MEDj , I p ). Let
Wi = 1 if Di ≤ MED(D1 , ..., Dn)+kMAD(D1 , ..., Dn) where k ≥ 0 and k = 5
is the default choice. Let Wi = 0, otherwise.
5) Let the covmb2 set B of at least n/2 cases correspond to the cases with
weight Wi = 1. Then the covmb2 estimator (T, C) is the sample mean and
sample covariance matrix applied to the cases in set B. Hence
Pn Pn
i=1 Wi xi Wi (xi − T )(xi − T )T
T = Pn and C = i=1 Pn .
i=1 Wi i=1 Wi − 1

The function ddplot5 plots the Euclidean distances from the coordinatewise
median versus the Euclidean distances from the covmb2 location estimator.
Typically the plotted points in this DD plot cluster about the identity line,
and outliers appear in the upper right corner of the plot with a gap between
the bulk of the data and the outliers.

7.8 Complements

Most of this chapter was taken from Olive (2017b). See that text for references
to concepts such as breakdown. The fact that response plots are extremely
useful for model assessment and for detecting influential cases and outliers
for an enormous variety of statistical models does not seem to be well known.
Certainly in any multiple linear regression analysis, the response plot and the
residual plot of Ŷ versus r should always be made. Cook and Olive (2001)
used response plots to select a response transformation graphically. Olive
(2005) suggested using residual, response, RR, and FF plots to detect outliers
while Hawkins and Olive (2002, pp. 141, 158) suggested using H the RR and
FF plots. The four plots are best for n ≥ 5p. Olive (2008: 6.4, 2017a: ch.
5-9) showed that the residual and response plots are useful for experimental
design models. Park et al. (2012) showed response plots are competitive with
the best robust regression methods for outlier detection on some outlier data
sets that have appeared in the literature.
7.8 Complements 351

Olive (2002) found applications for the DD plot. The TV estimator was
proposed by Olive (2002, 2005a). Although both the TV and MBA estimators
have the good OP (n−1/2 ) convergence rate, their efficiency under normality
may be very low. Chang and Olive (2010) suggested a method of adaptive
trimming such that the resulting estimator is asymptotically equivalent to
the OLS estimator.
If n is not much larger than p, then Hoffman et al. (2015) gave a ro-
bust Partial Least Squares–Lasso type estimator that uses a clever weighting
scheme. See Uraibi et al. (2017, 2019) for robust methods of forward selection
and least angle regression.
Robust MLD
For the FCH, RFCH, and RMVN estimators, see Olive and Hawkins
(2010), Olive (2017b, ch. 4), and Zhang et al. (2012). See Olive (2017b, p.
120) for the covmb2 estimator.
The fastest estimators of multivariate location and dispersion that have
been shown to be both consistent and high breakdown are the minimum
covariance determinant (MCD) estimator with O(nv ) complexity where
v = 1 + p(p + 3)/2 and possibly an all elemental subset estimator of He
and Wang (1997). See Bernholt and Fischer (2004). The minimum volume
ellipsoid (MVE) complexity is far higher, and for p > 2 there may be no
known method for computing S, τ , projection based, and constrained M
estimators. For some depth estimators, like the Stahel-Donoho estimator, the
exact algorithm of Liu and Zuo (2014) appears to take too long if p ≥ 6 and
n ≥ 100, and simulations may need p ≤ 3. It is possible to compute the MCD
and MVE estimators for p = 4 and n = 100 in a few hours using branch
and bound algorithms (like estimators with O(1004 ) complexity). See Agulló
(1996, 1998) and Pesch (1999). These algorithms take too long if both p ≥ 5
and n ≥ 100. Simulations may need p ≤ 2. Two stage estimators such as
the MM estimator, that need an initial high breakdown consistent estimator,
take longer to compute than the initial estimator. Rousseeuw (1984) intro-
duced the MCD and MVE estimators. See Maronna et al. (2006, ch. 6) for
descriptions and references.
Estimators with complexity higher than O[(n3 +n2 p+np2 +p3 ) log(n)] take
too long to compute and will rarely be used. Reyen et al. (2009) simulated
the OGK and the Olive (2004a) median ball algorithm (MBA) estimators for
p = 100 and n up to 50000, and noted that the OGK complexity is O[p3 +
np2 log(n)] while that of MBA is O[p3 + np2 + np log(n)]. FCH, RMBA, and
RMVN have the same complexity as MBA. FMCD has the same complexity
as FCH, but FCH is roughly 100 to 200 times faster.
Robust Regression
For the hbreg estimator, see Olive and Hawkins (2011) and Olive (2017b,
ch. 14). Robust regression estimators have unsatisfactory outlier resistance
and large sample theory. The hbreg estimator is fast and high breakdown,
but does not provide an adequate remedy for outliers, and the symmetry
condition for consistency is too strong. OLS response and residual plots, and
352 7 Robust Regression

RMVN or RFCH DD plots are useful for detecting multiple linear regression
outliers.
Many of the robust statistics for the location model are practical to com-
pute, outlier resistant, and backed by theory. See Huber and Ronchetti (2009).
A few estimators of multivariate location and dispersion, such as the coordi-
natewise median, are practical to compute, outlier resistant, and backed by
theory.
For practical estimators for MLR and MCD, hbreg and FCH appear to
be the only estimators proven to be consistent (for a large class of symmetric
error distributions and for a large class of EC distributions, respectively) with
some breakdown theory (TF CH is HB). Perhaps all other “robust statistics”
for MLR and MLD that have been shown to be both consistent and high
breakdown are impractical to compute for p > 4: the impractical “brand
name” estimators have at least O(np ) complexity, while the practical esti-
mators used in the software for the “brand name estimators” have not been
shown to be both high breakdown and consistent. See Theorems 7.12 and
7.21, Hawkins and Olive (2002), Olive (2008, 2017b), Hubert et al. (2002),
and Maronna and Yohai (2002). Huber and Ronchetti (2009, pp. xiii, 8-9,
152-154, 196-197) suggested that high breakdown regression estimators do
not provide an adequate remedy for the ill effects of outliers, that their sta-
tistical and computational properties are not adequately understood, that
high breakdown estimators “break down for all except the smallest regres-
sion problems by failing to provide a timely answer!” and that “there are no
known high breakdown point estimators of regression that are demonstrably
stable.”
A large number of impractical high breakdown regression estimators have
been proposed, including LTS, LMS, LTA, S, LQD, τ , constrained M, re-
peated median, cross checking, one step GM, one step GR, t-type, and re-
gression depth estimators. See Rousseeuw and Leroy (1987) and Maronna et
al. (2006). The practical algorithms used in the software use a brand name
criterion to evaluate a fixed number of trial fits and should be denoted as
an F-brand name estimator such as FLTS. Two stage estimators, such as
the MM estimator, that need an initial consistent high breakdown estima-
tor often have the same breakdown value and consistency rate as the initial
estimator. These estimators are typically implemented with a zero break-
down inconsistent initial estimator and hence are zero breakdown with zero
efficiency.
Maronna and Yohai (2015) used OLS and 500 elemental sets as the 501
trial fits to produce an FS estimator used as the initial estimator for an
FMM estimator. Since the 501 trial fits are zero breakdown, so is the FS
estimator. Since the FMM estimator has the same breakdown as the initial
estimator, the FMM estimator is zero breakdown. For regression, they show
that the FS estimator is consistent on a large class of zero mean finite variance
symmetric distributions. Consistency follows since the elemental fits and OLS
are unbiased estimators of β OLS but an elemental fit is an OLS fit to p cases.
7.9 Problems 353

Hence the elemental fits are very variable, and the probability that the OLS
fit has a smaller S-estimator criterion than a randomly chosen elemental
fit (or K randomly chosen elemental
√ fits) goes to one as n → ∞. (OLS
and the S-estimator are both n consistent estimators of β, so the ratio of
their criterion values goes to one, and the S-estimator minimizes the criterion
value.) Hence the FMM estimator is asymptotically equivalent to the MM
estimator that has the smallest criterion value for a large class of iid zero
mean finite variance symmetric error distributions. This FMM estimator is
asymptotically equivalent to the FMM estimator that uses OLS as the initial
estimator. When the error distribution is skewed the S-estimator and OLS
population constant are not the same, and the probability that an elemental
fit is selected is close to one for a skewed error distribution as n → ∞. (The
OLS estimator β̂ gets very close to β OLS while the elemental fits are highly
variable unbiased estimators of βOLS , so one of the elemental fits is likely to
have a constant that is closer to the S-estimator constant while still having
good slope estimators.) Hence the FS estimator is inconsistent, and the FMM
estimator is likely inconsistent
√ for skewed distributions. No practical method
is known for computing a n consistent FS or FMM estimator that has the
same breakdown and maximum bias function as the S or MM estimator that
has the smallest S or MM criterion value.
The L1 CLT is
 
√ D 1
n(β̂ L1 − β) → Np 0, W (7.37)
4[f(0)]2

when X T X/n → W −1 , and when the errors ei are iid with a cdf F and a pdf
f such that the unique population median is 0 with f(0) > 0. If a constant β1
is in the model or if the column space of X contains 1, then this assumption
is mild, but if the pdf is not symmetric about 0, then the L1 β1 tends to differ
from the OLS β1 . See Bassett and Koenker (1978). Estimating f(0) can be
difficult, so the residual bootstrap using OLS residuals or using êi = ri − r
where the ri are the L1 residuals with the prediction region method may be
useful.

7.9 Problems

PROBLEMS WITH AN ASTERISK * ARE ESPECIALLY USE-


FUL.

7.1. Referring to Definition 7.25, let Ŷi,j = xTi β̂ j = Ŷi (β̂ j ) and let ri,j =
ri (β̂ j ). Show that kri,1 − ri,2 k = kŶi,1 − Ŷi,2 k.
354 7 Robust Regression

7.2. Assume that the model has a constant β1 so that the first column of
X is 1. Show that if the regression estimator is regression equivariant, then
adding 1 to Y changes β̂1 but does not change the slopes β̂2 , ..., β̂p.

R Problems
Use the command source(“G:/linmodpack.txt”) to download the
functions and the command source(“G:/linmoddata.txt”) to download the
data. See Preface or Section 11.1. Typing the name of the linmodpack
function, e.g. trviews, will display the code for the function. Use the args
command, e.g. args(trviews), to display the needed arguments for the func-
tion. For some of the following problems, the R commands can be copied and
pasted from (http://parker.ad.siu.edu/Olive/mrsashw.txt) into R.

7.3. Paste the command for this problem into R to produce the second
column of Table 7.5. Include the output in Word.
7.4. a) To get an idea for the amount of contamination a basic resam-
pling or concentration algorithm for MLR can tolerate, enter or download
the gamper function (with the source(“G:/linmodpack.txt”) command) that
evaluates Equation (7.24) at different values of h = p.
b) Next enter the following commands and include the output in Word.
zh <- c(10,20,30,40,50,60,70,80,90,100)
for(i in 1:10) gamper(zh[i])
7.5∗ . a) Assuming that you have done the two source commands above
Problem 7.3 (and the R command library(MASS)), type the command
ddcomp(buxx). This will make 4 DD plots based on the DGK, FCH, FMCD,
and median ball estimators. The DGK and median ball estimators are the
two attractors used by the FCH estimator. With the leftmost mouse button,
move the cursor to an outlier and click. This data is the Buxton (1920) data
and cases with numbers 61, 62, 63, 64, and 65 were the outliers with head
lengths near 5 feet. After identifying at least three outliers in each plot, hold
the rightmost mouse button down (and in R click on Stop) to advance to the
next plot. When done, hold down the Ctrl and c keys to make a copy of the
plot. Then paste the plot in Word.
b) Repeat a) but use the command ddcomp(cbrainx). This data is the
Gladstone (1905) data and some infants are multivariate outliers.
c) Repeat a) but use the command ddcomp(museum[,-1]). This data is the
Schaaffhausen (1878) skull measurements and cases 48–60 were apes while
the first 47 cases were humans.

7.6∗ . (Perform the source(“G:/linmodpack.txt”) command if you have not


already done so.) The concmv function illustrates concentration with p = 2
and a scatterplot of X1 versus X2 . The outliers are such that the robust
estimators can not always detect them. Type the command concmv(). Hold
the rightmost mouse button down (and in R click on Stop) to see the DD
7.9 Problems 355

plot after one concentration step. The start uses the coordinatewise median
and diag([M AD(Xi )]2 ). Repeat 4 more times to see the DD plot based on
the attractor. The outliers have large values of X2 and the highlighted cases
have the smallest distances. Repeat the command concmv() several times.
Sometimes the start will contain outliers but the attractor will be clean (none
of the highlighted cases will be outliers), but sometimes concentration causes
more and more of the highlighted cases to be outliers, so that the attractor
is worse than the start. Copy one of the DD plots where none of the outliers
are highlighted into Word.

7.7∗ . (Perform the source(“G:/linmodpack.txt”) command if you have not


already done so.) The ddmv function illustrates concentration with the DD
plot. The outliers are highlighted. The first graph is the DD plot after one
concentration step. Hold the rightmost mouse button down (and in R click
on Stop) to see the DD plot after two concentration steps. Repeat 4 more
times to see the DD plot based on the attractor. In this problem, try to
determine the proportion of outliers gam that the DGK estimator can detect
for p = 2, 4, 10, and 20. Make a table of p and gam. For example the command
ddmv(p=2,gam=.4) suggests that the DGK estimator can tolerate nearly 40%
outliers with p = 2, but the command ddmv(p=4,gam=.4) suggest that gam
needs to be lowered (perhaps by 0.1 or 0.05). Try to make 0 < gam < 0.5 as
large as possible.

7.8∗ . a) If necessary, use the commands source(“G:/linmodpack.txt”) and


source(“G:/linmoddata.txt”).
b) Enter the command mbamv(belx,bely) in R. Click on the rightmost
mouse button (and in R, click on Stop). You need to do this 7 times be-
fore the program ends. There is one predictor x and one response Y . The
function makes a scatterplot of x and Y and cases that get weight one are
shown as highlighted squares. Each MBA sphere covers half of the data.
When you find a good fit to the bulk of the data, hold down the Ctrl and c
keys to make a copy of the plot. Then paste the plot in Word.
c) Enter the command mbamv2(buxx,buxy) in R. Click on the rightmost
mouse button (and in R, click on Stop). You need to do this 14 times before
the program ends. There are four predictors x1 , ..., x4 and one response Y .
The function makes the response and residual plots based on the OLS fit to
the highlighted cases. Each MBA sphere covers half of the data. When you
find a good fit to the bulk of the data, hold down the Ctrl and c keys to make
a copy of the two plots. Then paste the plots in Word.

7.9. This problem compares the MBA estimator that uses the median
squared residual MED(ri2 ) criterion with the MBA√estimator that uses the
LATA criterion.
√ On clean data, both estimators are n consistent since both
use 50 n consistent OLS estimators. The MED(ri2 ) criterion has trouble
with data sets where the multiple linear regression relationship is weak and
356 7 Robust Regression

there is a cluster of outliers. The LATA criterion tries to give all x–outliers,
including good leverage points, zero weight.
a) If necessary, use the commands source(“G:/linmodpack.txt”) and
source(“G:/linmoddata.txt”). The mlrplot2 function is used to compute
both MBA estimators. Use the rightmost mouse button to advance the plot
(and in R, highlight stop).
b) Use the command mlrplot2(belx,bely) and include the resulting plot in
Word. Is one estimator better than the other, or are they about the same?
c) Use the command mlrplot2(cbrainx,cbrainy) and include the resulting
plot in Word. Is one estimator better than the other, or are they about the
same? (The infants are likely good leverage cases instead of outliers.)
d) Use the command mlrplot2(museum[,3:11],museum[,2]) and include the
resulting plot in Word. For this data set, most of the cases are based on
humans but a few are based on apes. The MBA LATA estimator will often
give the cases corresponding to apes larger absolute residuals than the MBA
estimator based on MED(ri2 ), but the apes appear to be good leverage cases.
e) Use the command mlrplot2(buxx,buxy) until the outliers are clustered
about the identity line in one of the two response plots. (This will usually
happen within 10 or fewer runs. Pressing the “up arrow” will bring the pre-
vious command to the screen and save typing.) Then include the resulting
plot in Word. Which estimator went through the outliers and which one gave
zero weight to the outliers?
f) Use the command mlrplot2(hx,hy) several times. Usually both MBA
estimators fail to find the outliers for this artificial Hawkins data set that is
also analyzed by Atkinson and Riani (2000, section 3.1). The lmsreg estimator
can be used to find the outliers. In R use the commands library(MASS) and
ffplot2(hx,hy). Include the resulting plot in Word.

7.10. a) After entering the two source commands above Problem 7.3, enter
the following command.
MLRplot(buxx,buxy)
Click the rightmost mouse button (and in R click on Stop). The response
plot should appear. Again, click the rightmost mouse button (and in R click
on Stop). The residual plot should appear. Hold down the Ctrl and c keys to
make a copy of the two plots. Then paste the plots in Word.
b) The response variable is height, but 5 cases were recorded with heights
about 0.75 inches tall. The highlighted squares in the two plots correspond
to cases with large Cook’s distances. With respect to the Cook’s distances,
what is happening, swamping or masking?
7.11. For the Buxton (1920) data with multiple linear regression, height
was the response variable while an intercept, head length, nasal height, bigonal
breadth, and cephalic index were used as predictors in the multiple linear
regression model. Observation 9 was deleted since it had missing values. Five
7.9 Problems 357

individuals, cases 61–65, were reported to be about 0.75 inches tall with head
lengths well over five feet!
a) Copy and paste the commands for this problem into R. Include the lasso
response plot in Word. The identity line passes right through the outliers
which are obvious because of the large gap. Prediction interval (PI) bands
are also included in the plot.
b) Copy and paste the commands for this problem into R. Include the
lasso response plot in Word. This did lasso for the cases in the covmb2 set
B applied to the predictors which included all of the clean cases and omitted
the 5 outliers. The response plot was made for all of the data, including the
outliers.
c) Copy and paste the commands for this problem into R. Include the DD
plot in Word. The outliers are in the upper right corner of the plot.
7.12. Consider the Gladstone (1905) data set that has 12 variables on
267 persons after death. There are 5 infants in the data set. The response
variable was brain weight. Head measurements were breadth, circumference,
head height, length, and size as well as cephalic index and brain weight. Age,
height, and three categorical variables cause, ageclass (0: under 20, 1: 20-45,
2: over 45) and sex were also given. The constant x1 was the first variable.
The variables cause and ageclass were not coded as factors. Coding as factors
might improve the fit.
a) Copy and paste the commands for this problem into R. Include the
lasso response plot in Word. The identity line passes right through the infants
which are obvious because of the large gap. Prediction interval (PI) bands
are also included in the plot.
b) Copy and paste the commands for this problem into R. Include the
lasso response plot in Word. This did lasso for the cases in the covmb2 set
B applied to the nontrivial predictors which are not categorical (omit the
constant, cause, ageclass and sex) which omitted 8 cases, including the 5
infants. The response plot was made for all of the data.
c) Copy and paste the commands for this problem into R. Include the DD
plot in Word. The infants are in the upper right corner of the plot.
7.13. The linmodpack function mldsim6 compares 7 estimators: FCH,
RFCH, CMVE, RCMVE, RMVN, covmb2, and MB described in Olive
(2017b, ch. 4). Most of these estimators need n > 2p, need a nonsingu-
lar dispersion matrix, and work best with n > 10p. The function generates
data sets and counts how many times the minimum Mahalanobis distance
Di (T, C) of the outliers is larger than the maximum distance of the clean
data. The value pm controls how far the outliers need to be from the bulk of

the data, and pm roughly needs to increase with p.
For data sets with p > n possible, the function mldsim7 used the Eu-
clidean distances Di (T, I p ) and the Mahalanobis distances Di (T, C d ) where
C d is the diagonal matrix with the same diagonal entries as C where (T, C)
is the covmb2 estimator using j concentration type steps. Dispersion ma-
358 7 Robust Regression

trices are effected more by outliers than good robust location estimators,
so when the outlier proportion is high, it is expected that the Euclidean
distances Di (T, I p ) will outperform the Mahalanobis distance Di (T, C d ) for
many outlier configurations. Again the function counts the number of times
the minimum outlier distance is larger than the maximum distance of the
clean data.
Both functions used several outlier types. The simulations generated 100
data sets. The clean data had xi ∼ Np (0, diag(1, ..., p)). Type 1 had outliers
in a tight cluster (near point mass) at the major axis (0, ..., 0, pm)T . Type 2
had outliers in a tight cluster at the minor axis (pm, 0, ..., 0)T . Type 3 had
mean shift outliers xi ∼ Np ((pm, ..., pm)T , diag(1, ..., p)). Type 4 changed
the pth coordinate of the outliers to pm. Type 5 changed the 1st coordinate
of the outliers to pm. (If the outlier xi = (x1i , ..., xpi)T , then xi1 = pm.)

Table 7.8 Number of Times All Outlier Distances > Clean Distances, otype=1
n p γ osteps pm FCH RFCH CMVE RCMVE RMVN covmb2 MB
100 10 0.25 0 20 85 85 85 85 86 67 89

a) Table 7.8 suggests with osteps = 0, covmb2 had the worst count. When
pm is increased to 25, all counts become 100. Copy and paste the commands
for this part into R and make a table similar to Table 7.8, but now osteps=9
and p = 45 is close to n/2 for the second line where pm = 60. Your table
should have 2 lines from output.

Table 7.9 Number of Times All Outlier Distances > Clean Distances, otype=1
n p γ osteps pm covmb2 diag
100 1000 0.4 0 1000 100 41
100 1000 0.4 9 600 100 42

b) Copy and paste the commands for this part into R and make a table
similar to Table 7.9, but type 2 outliers are used.
c) When you have two reasonable outlier detectors, there are outlier con-
figurations where one will beat the other. Simulations suggest that “covmb2”
using Di (T, I p ) outperforms “diag” using Di (T, C d ) for many outlier config-
urations, but there are some exceptions. Copy and paste the commands for
this part into R and make a table similar to Table 7.9, but type 3 outliers
are used.
7.14. a) In addition to the source(“G:/linmodpack.txt”) command, also
use the source(“G:/linmoddata.txt”) command, and type the library(MASS)
command).
7.9 Problems 359

b) Type the command tvreg(buxx,buxy,ii=1). Click the rightmost mouse


button and highlight Stop. The response plot should appear. Repeat 10 times
and remember which plot percentage M (say M = 0) had the best response
plot. Then type the command tvreg2(buxx,buxy, M = 0) (except use your
value of M, not 0). Again, click the rightmost mouse button (and in R, high-
light Stop). The response plot should appear. Hold down the Ctrl and c keys
to make a copy of the plot. Then paste the plot in Word.
c) The estimated coefficients β̂T V from the best plot should have appeared
on the screen. Copy and paste these coefficients into Word.

7.15. This problem is like Problem 7.11, except elastic net is used instead
of lasso.
a) Copy and paste the commands for this problem into R. Include the
elastic net response plot in Word. The identity line passes right through the
outliers which are obvious because of the large gap. Prediction interval (PI)
bands are also included in the plot.
b) Copy and paste the commands for this problem into R. Include the
elastic net response plot in Word. This did elastic net for the cases in the
covmb2 set B applied to the predictors which included all of the clean cases
and omitted the 5 outliers. The response plot was made for all of the data,
including the outliers. (Problem 7.11 c) shows the DD plot for the data.)
Chapter 8
Multivariate Linear Regression

This chapter will show that multivariate linear regression with m ≥ 2 re-
sponse variables is nearly as easy to use, at least if m is small, as multiple
linear regression which has 1 response variable. For multivariate linear re-
gression, at least one predictor variable is quantitative. Plots for checking
the model, including outlier detection, are given. Prediction regions that are
robust to nonnormality are developed. For hypothesis testing, it is shown
that the Wilks’ lambda statistic, Hotelling Lawley trace statistic, and Pillai’s
trace statistic are robust to nonnormality.

8.1 Introduction

Definition 8.1. The response variables are the variables that you want
to predict. The predictor variables are the variables used to predict the
response variables.

Definition 8.2. The multivariate linear regression model

y i = B T xi + i

for i = 1, ..., n has m ≥ 2 response variables Y1 , ..., Ym and p predictor


variables x1 , x2, ..., xp where x1 ≡ 1 is the trivial predictor. The ith case
is (xTi , yTi ) = (1, xi2, ..., xip, Yi1 , ..., Yim) where the 1 could be omitted. The
model is written in matrix form as Z = XB + E where the matrices are
defined below. The model has E(k ) = 0 and Cov(k ) = Σ  = (σij) for
k = 1, ..., n. Then the p × m coefficient matrix B = β1 β2 . . . β m and
the m × m covariance matrix Σ  are to be estimated, and E(Z) = XB
while E(Yij ) = xTi βj . The i are assumed to be iid. Multiple linear regres-
sion corresponds to m = 1 response variable, and is written in matrix form
as Y = Xβ + e. Subscripts are needed for the m multiple linear regression

361
362 8 Multivariate Linear Regression

models Y j = Xβ j +ej for j = 1, ..., m where E(ej ) = 0. For the multivariate


linear regression model, Cov(ei , ej ) = σij I n for i, j = 1, ..., m where I n is
the n × n identity matrix.

Notation. The multiple linear regression model uses m = 1. See Def-


inition 1.9. The multivariate linear model y i = B T xi + i for i = 1, ..., n
has m ≥ 2, and multivariate linear regression and MANOVA models are
special cases. See Definition 9.2. This chapter will use x1 ≡ 1 for the multi-
variate linear regression model. The multivariate location and dispersion
model is the special case where X = 1 and p = 1.

The data matrix W = [X Z] except usually the first column 1 of X is


omitted for software. The n × m matrix
 
Y1,1 Y1,2 . . . Y1,m  T
 Y2,1 Y2,2 . . . Y2,m   y1
    .. 
Z = . .. . . .  = Y 1 Y 2 . . . Y m =  . .
 .. . . .. 
y Tn
Yn,1 Yn,2 . . . Yn,m

The n × p design matrix of predictor variables is


 
x1,1 x1,2 . . . x1,p  T
 x2,1 x2,2 x1
 . . . x2,p 
    . 
X= . .. . . ..  = v 1 v 2 . . . v p =  .. 
 .. . . . 
xTn
xn,1 xn,2 . . . xn,p

where v 1 = 1.
The p × m matrix
 
β1,1 β1,2 . . . β1,m
 β2,1 β2,2 . . . β2,m 
   
B = . .. .. ..  = β 1 β 2 . . . βm .
 .. . . . 
βp,1 βp,2 . . . βp,m

The n × m matrix
 
1,1 1,2 . . . 1,m  T
 2,1 2,2 . . . 2,m   
    .1 
E= . .. . . .  = e1 e2 . . . em =  ..  .
 .. . . .. 
Tn
n,1 n,2 . . . n,m

Considering the ith row of Z, X, and E shows that y Ti = xTi B + Ti .

Each response variable in a multivariate linear regression model follows a


multiple linear regression model Y j = Xβ j + ej for j = 1, ..., m where it
8.1 Introduction 363

is assumed that E(ej ) = 0 and Cov(ej ) = σjj I n . Hence the errors corre-
sponding to the jth response are uncorrelated with variance σj2 = σjj . Notice
that the same design matrix X of predictors is used for each of the m
models, but the jth response variable vector Y j , coefficient vector βj , and
error vector ej change and thus depend on j.
Now consider the ith case (xTi , yTi ) which corresponds to the ith row of Z
and the ith row of X. Then
 
Yi1 = β11 xi1 + · · · + βp1 xip + i1 = xTi β 1 + i1
 Yi2 = β12 xi1 + · · · + βp2 xip + i2 = xTi β 2 + i2 
 
 .. 
 . 
Yim = β1m xi1 + · · · + βpm xip + im = xTi β m + im

or y i = µxi + i = E(y i ) + i where


 
xTi β1
 xTi β2 
 
E(y i ) = µxi = B T xi =  . .
 .. 
xTi β m

The notation y i |xi and E(y i |xi ) is more accurate, but usually the condi-
tioning is suppressed. Taking µxi to be a constant (or condition on xi if the
predictor variables are random variables), y i and i have the same covariance
matrix. In the multivariate regression model, this covariance matrix Σ  does
not depend on i. Observations from different cases are uncorrelated (often
independent), but the m errors for the m different response variables for the
same case are correlated. If X is a random matrix, then assume X and E
are independent and that expectations are conditional on X.

Example 8.1. Suppose it is desired to predict the response variables Y1 =


height and Y2 = height at shoulder of a person from partial skeletal remains.
A model for prediction can be built from nearly complete skeletons or from
living humans, depending on the population of interest (e.g. ancient Egyp-
tians or modern US citizens). The predictor variables might be x1 ≡ 1, x2 =
femur length, and x3 = ulna length. The two heights of individuals with
x2 = 200mm and x3 = 140mm should be shorter on average than the two
heights of individuals with x2 = 500mm and x3 = 350mm. In this example
Y1 , Y2 , x2 , and x3 are quantitative variables. If x4 = gender is a predictor
variable, then gender (coded as male = 1 and female = 0) is qualitative.

Definition 8.3. Least squares is the classical method for fitting multivari-
ate linear regression. The least squares estimators are
 
B̂ = (X T X)−1 X T Z = β̂ 1 β̂ 2 . . . β̂m .

The predicted values or fitted values


364 8 Multivariate Linear Regression
 
Ŷ1,1 Ŷ1,2 . . . Ŷ1,m
   Ŷ2,1 Ŷ2,2 . . . Ŷ2,m 

Ẑ = X B̂ = Ŷ 1 Ŷ 2 . . . Ŷ m =  . .. .. ..  .
 .. . . . 
Ŷn,1 Ŷn,2 . . . Ŷn,m

The residuals Ê = Z − Ẑ = Z − X B̂ =
 T  
ˆ
1 ˆ1,1 ˆ1,2 . . . ˆ1,m
ˆ T  
  ˆ2,1 ˆ2,2 ... ˆ 2,m 
 2   
 .  = r1 r 2 . . . rm =  .. .. .. ..  .
 ..   . . . . 
Tn
ˆ ˆn,1 ˆn,2 ... ˆ n,m

These quantities can be found from the m multiple linear regressions of Y j


on the predictors: β̂j = (X T X)−1 X T Y j , Ŷ j = X β̂ j , and rj = Y j − Ŷ j
for j = 1, ..., m. Hence ˆi,j = Yi,j − Ŷi,j where Ŷ j = (Ŷ1,j , ..., Ŷn,j)T . Finally,
Σ̂ ,d =
T n
(Z − Ẑ)T (Z − Ẑ) (Z − X B̂)T (Z − X B̂) Ê Ê 1 X T
= = = ˆi ˆi .
n−d n−d n−d n − d i=1

The choices d = 0 and d = p are common. If d = 1, then Σ̂ ,d=1 = S r , the


sample covariance matrix of the residual vectors ˆ i , since the sample mean
of the ˆ
i is 0. Let Σ̂  = Σ̂ ,p be the unbiased estimator of Σ  . Also,

Σ̂ ,d = (n − d)−1 Z T [I − X(X T X)−1 X]Z,

and
Ê = [I − X(X T X)−1 X]Z.
The following two theorems show that the least squares estimators are
fairly good. Also see Theorem 8.7 in Section 8.4. Theorem 8.2 can also be
n−1
used for Σ̂ ,d = Sr.
n−d
Theorem 8.1, Johnson and Wichern (1988, p. 304): Suppose X has
full rank p < n and the covariance structure of Definition 8.2 holds. Then
E(B̂) = B so E(β̂ j ) = β j , Cov(β̂ j , β̂ k ) = σjk (X T X)−1 for j, k = 1, ..., p.
Also Ê and B̂ are uncorrelated, E(Ê) = 0, and
T !
Ê Ê
E(Σ̂  ) = E = Σ .
n−p
8.2 Plots for the Multivariate Linear Regression Model 365
Pn
Theorem 8.2. S r = Σ  +OP (n−1/2 ) and n1 i=1 i Ti = Σ  +OP (n−1/2 )
Pn
if the following three conditions hold: B − B̂ = OP (n−1/2 ), n1 i=1 i xTi =
P n
OP (1), and n1 i=1 xi xTi = OP (n1/2 ).
T
Proof. Note that y i = B T xi +i = B̂ xi +ˆ
i . Hence ˆi = (B−B̂)T xi +i .
Thus
n
X n
X n
X
ˆ Ti =
i ˆ (i −i +ˆi )(i −i +ˆi )T = [i Ti +i (ˆ
i −i )T +(ˆ Ti ]
i −i )ˆ
i=1 i=1 i=1

n
X Xn Xn
= i Ti + ( i xTi )(B − B̂) + (B − B̂)T ( xi Ti )+
i=1 i=1 i=1

Xn
(B − B̂)T ( xi xTi )(B − B̂).
i=1
Pn Pn
Thus 1
n i ˆTi
i=1 ˆ = 1
n
T
i=1 i i +

OP (1)OP (n−1/2 ) + OP (n−1/2 )OP (1) + OP (n−1/2 )OP (n1/2 )OP (n−1/2 ),
P
and the result follows since n1 ni=1 i Ti = Σ  + OP (n−1/2 ) and
n
n 1X T
Sr = ˆi ˆ
i . 
n−1n
i=1


S r and Σ̂  are also n consistent estimators of Σ  by Su and Cook (2012,
p. 692). See Theorem 8.7.

8.2 Plots for the Multivariate Linear Regression Model

This section suggests using residual plots, response plots, and the DD plot to
examine the multivariate linear model. The DD plot is used to examine the
distribution of the iid error vectors. The residual plots are often used to check
for lack of fit of the multivariate linear model. The response plots are used
to check linearity and to detect influential cases for the linearity assumption.
The response and residual plots are used exactly as in the m = 1 case corre-
sponding to multiple linear regression and experimental design models. See
Olive (2010, 2017a), Olive et al. (2015), Olive and Hawkins (2005), and Cook
and Weisberg (1999, p. 432).

Notation. Plots will be used to simplify the regression analysis, and in


this text a plot of W versus Z uses W on the horizontal axis and Z on the
vertical axis.
366 8 Multivariate Linear Regression

Definition 8.4. A response plot for the jth response variable is a plot
of the fitted values Ybij versus the response Yij . The identity line with slope
one and zero intercept is added to the plot as a visual aid. A residual plot
corresponding to the jth response variable is a plot of Ŷij versus rij .

Remark 8.1. Make the m response and residual plots for any multivariate
linear regression. In a response plot, the vertical deviations from the identity
line are the residuals rij = Yij − Ŷij . Suppose the model is good, the jth error
distribution is unimodal and not highly skewed for j = 1, ..., m, and n ≥ 10p.
Then the plotted points should cluster about the identity line in each of the
m response plots. If outliers are present or if the plot is not linear, then the
current model or data need to be transformed or corrected. If the model is
good, then each of the m residual plots should be ellipsoidal with no trend
and should be centered about the r = 0 line. There should not be any pattern
in the residual plot: as a narrow vertical strip is moved from left to right, the
behavior of the residuals within the strip should show little change. Outliers
and patterns such as curvature or a fan shaped plot are bad.
Rule of thumb 8.1. Use multivariate linear regression if

n ≥ max((m + p)2 , mp + 30, 10p))

provided that the m response and residual plots all look good. Make the DD
plot of the ˆi . If a residual plot would look good after several points have
been deleted, and if these deleted points were not gross outliers (points far
from the point cloud formed by the bulk of the data), then the residual plot
is probably good. Beginners often find too many things wrong with a good
model. For practice, use the computer to generate several multivariate linear
regression data sets, and make the m response and residual plots for these
data sets. This exercise will help show that the plots can have considerable
variability even when the multivariate linear regression model is good. The
linmodpack function MLRsim simulates response and residual plots for various
distributions when m = 1.

Rule of thumb 8.2. If the plotted points in the residual plot look like
a left or right opening megaphone, the first model violation to check is the
assumption of nonconstant variance. (This is a rule of thumb because it is
possible that such a residual plot results from another model violation such
as nonlinearity, but nonconstant variance is much more common.)

Remark 8.2. Residual plots magnify departures from the model while the
response plots emphasize how well the multivariate linear regression model
fits the data.
Definition 8.5. An RR plot is a scatterplot matrix of the m sets of
residuals r1 , ..., rm .
8.2 Plots for the Multivariate Linear Regression Model 367

Definition 8.6. An FF plot is a scatterplot matrix of the m sets of fitted


values of response variables Ŷ 1 , ..., Ŷ m . The m response variables Y 1 , ..., Y m
can be added to the plot.

Remark 8.3. Some applications for multivariate linear regression need the
m error vectors to be linearly related, and larger sample sizes may be needed
if the error vectors are not linearly related. For example, the asymptotic
optimality of the prediction regions of Section 8.3 needs the error vectors to
be iid from an elliptically contoured distribution. Make the RR plot and a
DD plot of the residual vectors ˆi to check that the error vectors are linearly
related. Make a DD plot of the continuous predictor variables to check for
x-outliers. Make a DD plot of Y1 , ...., Ym to check for outliers, especially if
it is assumed that the response variables come from an elliptically contoured
distribution.
The RMVN DD plot of the residual vectors ˆ i is used to check the error
vector distribution, to detect outliers, and to display the nonparametric pre-
diction region developed in Section 8.3. The DD plot suggests that the error
vector distribution is elliptically contoured if the plotted points cluster tightly
about a line through the origin as n → ∞. The plot suggests that the error
vector distribution is multivariate normal if the line is the identity line. If n
is large and the plotted points do not cluster tightly about a line through the
origin, then the error vector distribution may not be elliptically contoured.
These applications of the DD plot for iid multivariate data are discussed in
Olive (2002, 2008, 2013a, 2017b) and Chapter 7. The RMVN estimator has
not yet been proven to be a consistent estimator when computed from resid-
ual vectors, but simulations suggest that the RMVN DD plot of the residual
vectors is a useful diagnostic plot. The linmodpack function mregddsim can
be used to simulate the DD plots for various distributions.
Predictor transformations for the continuous predictors can be made ex-
actly as in Section 1.2.

Warning: The log rule and other transformations do not always work. For
example, the log rule may fail. If the relationships in the scatterplot matrix are
already linear or if taking the transformation does not increase the linearity,
then no transformation may be better than taking a transformation. For
the Cook and Weisberg (1999) data set evaporat.lsp with m = 1, the log
rule suggests transforming the response variable Evap, but no transformation
works better.

Response transformations can also be made as in Section 1.2, but also


make the response plot of Ŷ j versus Y j , and use the rules of Section 1.2
on Yj to linearize the response plot for each of the m response variables
Y1 , ..., Ym.
368 8 Multivariate Linear Regression

8.3 Asymptotically Optimal Prediction Regions

In this section, we will consider a more general multivariate regression model,


and then consider the multivariate linear model as a special case. Given n
cases of training or past data (x1 , y1 ), ..., (xn , y n ) and a vector of predictors
xf , suppose it is desired to predict a future test vector y f .

Definition 8.7. A large sample 100(1 − δ)% prediction region is a set An


such that P (y f ∈ An ) → 1 −δ as n → ∞, and is asymptotically optimal if the
volume of the region converges in probability to the volume of the population
minimum volume covering region.

The classical large sample 100(1 − δ)% prediction region for a future value
2
xf given iid data x1 , ..., , xn is {x : Dx (x, S) ≤ χ2p,1−δ }, while for multi-
variate linear regression, the classical large sample 100(1 − δ)% prediction
region for a future value y f given xf and past data (x1 , yi ), ..., (xn , yn ) is
2
{y : Dy (ŷ f , Σ̂  ) ≤ χ2m,1−δ }. See Johnson and Wichern (1988, pp. 134, 151,
312). By Equation (1.36), these regions may work for multivariate normal xi
or i , but otherwise tend to have undercoverage. Section 4.4 and Olive (2013a)
replaced χ2p,1−δ by the order statistic D(U 2
n)
where Un decreases to dn(1 −δ)e.
This section will use a similar technique from Olive (2018) to develop possibly
the first practical large sample prediction region for the multivariate linear
model with unknown error distribution. The following technical theorem will
be needed to prove Theorem 8.4.

Theorem 8.3. Let a > 0 and assume that (µ̂n , Σ̂ n ) is a consistent esti-
mator of (µ, aΣ).
a) Dx2
(µ̂n , Σ̂ n ) − a1 Dx
2
(µ, Σ) = oP (1).
−1
b) Let 0 < δ ≤ 0.5. If (µ̂n , Σ̂ n ) − (µ, aΣ) = Op (n−δ ) and aΣ̂ n − Σ −1 =
OP (n−δ ), then

2 1 2
Dx (µ̂n , Σ̂ n ) − Dx (µ, Σ) = OP (n−δ ).
a

Proof. Let Bn denote the subset of the sample space on which Σ̂ n has an
inverse. Then P (Bn ) → 1 as n → ∞. Now

2 −1
Dx (µ̂n , Σ̂ n ) = (x − µ̂n )T Σ̂ n (x − µ̂n ) =
 −1 
Σ Σ −1 −1
(x − µ̂n )T − + Σ̂ n (x − µ̂n ) =
a a
 −1   −1 
−Σ −1 Σ
(x − µ̂n )T + Σ̂ n (x − µ̂n ) + (x − µ̂n )T (x − µ̂n ) =
a a
8.3 Asymptotically Optimal Prediction Regions 369

1 −1
(x − µ̂n )T (−Σ −1 + a Σ̂ n )(x − µ̂n ) +
a
 −1 
Σ
(x − µ + µ − µ̂n )T (x − µ + µ − µ̂n )
a
1 2
= (x − µ)T Σ −1 (x − µ) + (x − µ)T Σ −1 (µ − µ̂n )+
a a
1 1 −1
(µ − µ̂n )T Σ −1 (µ − µ̂n ) + (x − µ̂n )T [aΣ̂ n − Σ −1 ](x − µ̂n )
a a
on Bn , and the last three terms are oP (1) under a) and OP (n−δ ) under b).


Now suppose a prediction region for an m × 1 random vector y f given a


vector of predictors xf is desired for the multivariate linear model. If we had
many cases z i = B T xf + i , then we could use the multivariate prediction
region for m variables from Section 4.4. Instead, Theorem 8.4 will use the
T
prediction region from Section 4.4 on the pseudodata ẑ i = B̂ xf + ˆ i =
ŷ f + ˆi for i = 1, ..., n. This takes the data cloud of the n residual vectors ˆi
and centers the cloud at ŷ f . Note that ẑ i = (B −B + B̂)T xf +(i −i +ˆ i ) =
T T T −1/2
zi +(B̂ −B) xf +ˆi −i = z i +(B̂−B) xf −(B̂ −B) xi = z i +OP (n ).
Hence the distances based on the z i and the distances based on the ẑ i have
the same quantiles, asymptotically (for quantiles that are continuity points
of the distribution of z i ).
If the i are iid from an ECm (0, Σ, g) distribution with continuous de-
creasing g and nonsingular covariance matrix Σ  = cΣ for some con-
stant c > 0, then the population asymptotically optimal prediction region
is {y : Dy (B T xf , Σ  ) ≤ D1−δ } where P (Dy (B T xf , Σ  ) ≤ D1−δ ) = 1 − δ.
q
For example, if the iid i ∼ Nm (0, Σ  ), then D1−δ = χ2m,1−δ . If the er-
ror distribution is not elliptically contoured, then the above region still has
100(1 − δ)% coverage, but prediction regions with smaller volume may exist.
A natural way to make a large sample prediction region is to estimate the
target population minimum volume covering region, but for moderate sam-
ples and many error distributions, the natural estimator that covers dn(1−δ)e
of the cases tends to have undercoverage as high as min(0.05, δ/2). This em-
pirical result is not too surprising since it is well known that the performance
of a prediction region on the training data is superior to the performance on
future test data, due in part to the unknown variability of the estimator. To
compensate for the undercoverage, let qn be as in Theorem 8.4.

Theorem 8.4. Suppose y i = E(y i |xi ) + i = ŷ i + ˆi where Cov(i ) =


Σ  > 0, and where the zero mean f and the i are iid for i = 1, ..., n.
Given xf , suppose the fitted model produces ŷ f and nonsingular Σ̂  . Let
ẑi = ŷ f + ˆi and
370 8 Multivariate Linear Regression
−1
Di2 ≡ Di2 (ŷ f , Σ̂  ) = (ẑ i − ŷ f )T Σ̂  (ẑi − ŷ f )

for i = 1, ..., n. Let qn = min(1 − δ + 0.05, 1 − δ + m/n) for δ > 0.1 and

qn = min(1 − δ/2, 1 − δ + 10δm/n), otherwise.

If qn < 1 − δ + 0.001, set qn = 1 − δ. Let 0 < δ < 1 and h = D(Un ) where


D(Un ) is the 100 qn th sample quantile of the Mahalanobis distances Di . Let
the nominal 100(1 − δ)% prediction region for y f be given by
−1
{z : (z − ŷ f )T Σ̂  (z − ŷ f ) ≤ D(U
2
n)
}=

2 2
{z : Dz (ŷ f , Σ̂  ) ≤ D(U n)
} = {z : Dz (ŷ f , Σ̂  ) ≤ D(Un ) }. (8.1)
a) Consider the n prediction regions for the data where (y f,i , xf,i ) =
(y i , xi ) for i = 1, ..., n. If the order statistic D(Un ) is unique, then Un of the
n prediction regions contain y i where Un /n → 1 − δ as n → ∞.
b) If (ŷ f , Σ̂  ) is a consistent estimator of (E(y f ), Σ  ), then (8.1) is a
large sample 100(1 − δ)% prediction region for y f .
c) If (ŷ f , Σ̂  ) is a consistent estimator of (E(y f ), Σ  ), and the i come
from an elliptically contoured distribution such that the unique highest den-
sity region is {z : Dz (0, Σ  ) ≤ D1−δ }, then the prediction region (8.1) is
asymptotically optimal.

Proof. a) Suppose (xf , yf ) = (xi , yi ). Then


−1 −1
2
Dy i
Ti Σ̂  ˆ
(ŷ i , Σ̂  ) = (y i − ŷ i )T Σ̂  (y i − ŷ i ) = ˆ i = D2ˆ i (0, Σ̂  ).

Hence y i is in the ith prediction region {z : Dz (ŷ i , Σ̂  ) ≤ D(Un ) (ŷ i , Σ̂  )}


iff ˆ
i is in prediction region {z : Dz (0, Σ̂  ) ≤ D(Un ) (0, Σ̂  )}, but exactly Un
of the ˆ i are in the latter region by construction, if D(Un ) is unique. Since
D(Un ) is the 100(1 − δ)th percentile of the Di asymptotically, Un /n → 1 − δ.
b) Let P [Dz (E(y f ), Σ  ) ≤ D1−δ (E(y f ), Σ  )] = 1 − δ. Since Σ  > 0,
P D
Theorem 8.3 shows that if (ŷ f , Σ̂  ) → (E(y f ), Σ  ) then D(ŷ f , Σ̂  ) →
Dz (E(y f ), Σ  ). Hence the percentiles of the distances converge in distri-
bution, and the probability that y f is in {z : Dz (ŷ f , Σ̂  ) ≤ D1−δ (ŷ f , Σ̂  )}
converges to 1 − δ = the probability that y f is in {z : Dz (E(y f ), Σ  ) ≤
D1−δ (E(y f ), Σ  )} at continuity points D1−δ of the distribution of D(E(y f ),
Σ  ).
c) The asymptotically optimal prediction region is the region with the
smallest volume (hence highest density) such that the coverage is 1 − δ, as
n → ∞. This region is {z : Dz (E(y f ), Σ  ) ≤ D1−δ (E(y f ), Σ  )} if the
asymptotically optimal region for the i is {z : Dz (0, Σ  ) ≤ D1−δ (0, Σ  )}.
Hence the result follows by b). 
8.3 Asymptotically Optimal Prediction Regions 371
−1
Notice that if Σ̂  exists, then 100qn% of the n training data y i are in their
corresponding prediction region with xf = xi , and qn → 1−δ even if (ŷ i , Σ̂  )
is not a good estimator or if the regression model is misspecified. Hence the
coverage qn of the training data is robust to model assumptions. Of course the
volume of the prediction region could be large if a poor estimator (ŷ i , Σ̂  ) is
used or if the i do not come from an elliptically contoured distribution. The
response, residual, and DD plots can be used to check model assumptions.
If the plotted points in the RMVN DD plot cluster tightly about some line
through the origin and if n ≥ max[3(m + p)2 , mp + 30], we expect the volume
of the prediction region may be fairly low for the least squares estimators.
If n is too small, then multivariate data is sparse and the covering ellipsoid
for the training data may be far too small for future data, resulting in severe
undercoverage. Also notice that qn = 1 − δ/2 or qn = 1 − δ + 0.05 for n ≤ 20p.
At the training data, the coverage qn ≥ 1 − δ, and qn converges to the
nominal coverage 1 − δ as n → ∞. Suppose n ≤ 20p. Then the nominal 95%
prediction region uses qn = 0.975 while the nominal 50% prediction region
uses qn = 0.55. Prediction distributions depend both on the error distribution
and on the variability of the estimator (ŷ f , Σ̂  ). This variability is typically
unknown but converges to 0 as n → ∞. Also, residuals tend to underestimate
errors for small n. For moderate n, ignoring estimator variability and using
qn = 1 − δ resulted in undercoverage as high as min(0.05, δ/2). Letting the
“coverage” qn decrease to the nominal coverage 1 − δ inflates the volume of
the prediction region for small n, compensating for the unknown variability
of (ŷ f , Σ̂  ).
Consider the multivariate linear regression model. Let Σ̂  = Σ̂ ,d=p , ẑ i =
ŷ f + ˆi , and Di2 (ŷ f , Sr ) = (ẑ i − ŷ f )T S −1
r (ẑ i − ŷ f ) for i = 1, ..., n. Then the
large sample nonparametric 100(1 − δ)% prediction region is
2 2
{z : Dz (ŷ f , Sr ) ≤ D(U n)
} = {z : Dz (ŷ f , S r ) ≤ D(Un ) }. (8.2)

Theorem 8.5 will show that this prediction region (8.2) can also be found
by applying the nonparametric prediction region (4.24) on the ẑ i . Recall that
S r defined in Definition 8.3 is the sample covariance matrix of the residual
vectors ˆi . For the multivariate linear regression model, if D1−δ is a continuity
point of the distribution of D, Assumption D1 above Theorem 8.7 holds, and
the i have a nonsingular covariance matrix, then (8.2) is a large sample
100(1 − δ)% prediction region for y f .

Theorem 8.5. For multivariate linear regression, when least squares is


used to compute ŷ f , S r , and the pseudodata ẑi , prediction region (8.2) is
the nonparametric prediction region (4.24) applied to the ẑ i .
Proof. Multivariate linear regression with least squares satisfies Theorem
8.4 by Su and Cook (2012). (See Theorem 8.7.) Let (T, C) be the sample
mean and sample covariance matrix (see Definition 4.7) applied to the ẑ i .
The sample mean and sample covariance matrix of the residual vectors is
372 8 Multivariate Linear Regression

(0, Sr ) since least squares was used. Hence the ẑ i = ŷ f + ˆ


i have sample
covariance matrix S r , and sample mean ŷ f . Hence (T, C) = (ŷ f , S r ), and
the Di (ŷ f , Sr ) are used to compute D(Un ) . 

The RMVN DD plot of the residual vectors will be used to display the
prediction regions for multivariate linear regression. See Example 8.3. The
nonparametric prediction region for multivariate linear regression of Theorem
8.5 uses (T, C) = (ŷ f , S r ) in (8.1), and has simple geometry. Let Rr be the
nonparametric prediction region (8.2) applied to the residuals ˆi with ŷ f = 0.
Then Rr is a hyperellipsoid with center 0, and the nonparametric prediction
region is the hyperellipsoid Rr translated to have center ŷ f . Hence in a DD
plot, all points to the left of the line M D = D(Un ) correspond to y i that are
in their prediction region, while points to the right of the line are not in their
prediction region.
The nonparametric prediction region has some interesting properties. This
prediction region is asymptotically optimal if the i are iid for a large class
of elliptically contoured ECm (0, Σ, g) distributions. Also, if there are 100
different values (xjf , y jf ) to be predicted, we only need to update ŷ jf for
j = 1, ..., 100, we do not need to update the covariance matrix S r .
It is common practice to examine how well the prediction regions work
on the training data. That is, for i = 1, ..., n, set xf = xi and see if y i is
in the region with probability near to 1 − δ with a simulation study. Note
that ŷ f = ŷ i if xf = xi . Simulation is not needed for the nonparametric
prediction region (8.2) for the data since the prediction region (8.2) centered
at ŷ i contains y i iff Rr , the prediction region centered at 0, contains ˆ
i since
ˆi = y i − ŷ i . Thus 100qn % of prediction regions corresponding to the data
(y i , xi ) contain y i , and 100qn% → 100(1 − δ)%. Hence the prediction regions
work well on the training data and should work well on (xf , y f ) similar to
the training data. Of course simulation should be done for test data (xf , y f )
that are not equal to training data cases. See Problem 8.11.
This training data result holds provided that the multivariate linear regres-
sion using least squares is such that the sample covariance matrix S r of the
residual vectors is nonsingular, the multivariate regression model need
not be correct. Hence the coverage at the n training data cases (xi , yi )
is robust to model misspecification. Of course, the prediction regions may
be very large if the model is severely misspecified, but severity of misspec-
ification can be checked with the response and residual plots. Coverage for
a future value y f can also be arbitrarily bad if there is extrapolation or if
(xf , yf ) comes from a different population than that of the data.
8.4 Testing Hypotheses 373

8.4 Testing Hypotheses

This section considers testing a linear hypothesis H0 : LB = 0 versus


H1 : LB 6= 0 where L is a full rank r × p matrix.

Definition 8.8. Assume rank(X) = p. The total corrected (for the mean)
sum of squares and cross products matrix is
 
T 1 T
T = R + We = Z I n − 11 Z.
n

Note that T /(n − 1) is the usual sample covariance matrix Σ̂ y if all n of the
y i are iid, e.g. if B = 0. The regression sum of squares and cross products
matrix is
 
1 T 1
T T
R = Z X(X X) X − 11 Z = Z T X B̂ − Z T 11T Z.
−1 T
n n
T
Let H = B̂ LT [L(X T X)−1 LT ]−1 LB̂. The error or residual sum of squares
and cross products matrix is

W e = (Z − Ẑ)T (Z − Ẑ) = Z T Z − Z T X B̂ = Z T [I n − X(X T X)−1 X T ]Z.


T
Note that W e = Ê Ê and W e /(n − p) = Σ̂  .

Warning: SAS output uses E instead of W e .

The MANOVA table is shown below.

Summary MANOVA Table

Source matrix df
Regression or Treatment R p − 1
Error or Residual We n − p
Total (corrected) T n−1

Definition 8.9. Let λ1 ≥ λ2 ≥ · · · ≥ λm be the ordered eigenvalues of


W −1
e H. Then there are four commonly used test statistics.
The Roy’s maximum root statistic is λmax (L) = λ1 .
The Wilks’ Λ statistic is Λ(L) = |(H + W e )−1 W e | = |W −1
e H + I|
−1
=
m
Y
(1 + λi )−1 .
i=1
m
X λi
The Pillai’s trace statistic is V (L) = tr[(H + W e )−1 H] = .
1 + λi
i=1
374 8 Multivariate Linear Regression
m
X
The Hotelling-Lawley trace statistic is U (L) = tr[W −1
e H] = λi .
i=1

Typically some function of one of the four above statistics is used to get
pval, the estimated pvalue. Output often gives the pvals for all four test
statistics. Be cautious about inference if the last three test statistics do not
lead to the same conclusions (Roy’s test may not be trustworthy for r > 1).
Theory and simulations developed below for the four statistics will provide
more information about the sample sizes needed to use the four test statistics.
See the paragraphs after the following theorem for the notation used in that
theorem.
Theorem 8.6. The Hotelling-Lawley trace statistic
1 −1
U (L) = [vec(LB̂)]T [Σ̂  ⊗ (L(X T X)−1 LT )−1 ][vec(LB̂)]. (8.3)
n−p

Proof. Using the Searle (1982, p. 333) identity


tr(AGT DGC) = [vec(G)]T [CA ⊗ D T ][vec(G)], it follows that
−1 T
(n − p)U (L) = tr[Σ̂  B̂ LT [L(X T X)−1 LT ]−1 LB̂]
−1 −1
= [vec(LB̂)]T [Σ̂  ⊗ (L(X T X)−1 LT )−1 ][vec(LB̂)] = T where A = Σ̂  ,
G = LB̂, D = [L(X T X)−1 LT ]−1 , and C = I. Hence (8.3) holds. 
D
Some notation is useful to show (8.3) and to show that (n−p)U (L) → χ2rm
under mild conditions if H0 is true. Following Henderson and Searle (1979),
let matrix A = [a1 a2 . . . ap ]. Then the vec operator stacks the columns
of A on top of one another so
 
a1
 a2 
 
vec(A) =  .  .
 .. 
ap

Let A = (aij ) be an m × n matrix and B a p × q matrix. Then the


Kronecker product of A and B is the mp × nq matrix
 
a11 B a12 B · · · a1n B
 a21 B a22 B · · · a2n B 
 
A⊗B =  . .. ..  .
 .. . ··· . 
am1 B am2 B · · · amn B

An important fact is that if A and B are nonsingular square matrices, then


[A ⊗ B]−1 = A−1 ⊗ B −1 . The following assumption is important.
8.4 Testing Hypotheses 375

Assumption D1: Let hi be the ith diagonal element of X(X T X)−1 X T .


P
Assume max1≤i≤n hi → 0 as n → ∞, assume that the zero mean iid error
1 P
vectors have finite fourth moments, and assume that X T X → W −1 .
n
Su and Cook (2012) proved a central limit type theorem for Σ̂  and B̂ for
the partial envelopes estimator, and the least squares estimator is a special
case. These results prove the following theorem. Their theorem√ also shows
that for multiple linear regression (m = 1), σ̂ 2 = M SE is a n consistent
estimator of σ 2 .

Theorem 8.7: Multivariate Least Squares Central Limit Theorem


(MLS CLT). For the least squares estimator, if assumption D1 holds, then

Σ̂  is a n consistent estimator of Σ  and
√ D
n vec(B̂ − B) → Npm (0, Σ  ⊗ W ).

Theorem 8.8. If assumption D1 holds and if H0 is true, then


D
(n − p)U (L) → χ2rm .
√ D
Proof. By Theorem 8.7, n vec(B̂ − B) → Npm (0, Σ  ⊗ W ). Then un-
√ D
der H0 , n vec(LB̂) → Nrm (0, Σ  ⊗ LW LT ), and n [vec(LB̂)]T [Σ −1  ⊗
D
(LW LT )−1 ][vec(LB̂)] → χ2rm . This result also holds if W and Σ  are re-
placed by Ŵ = n(X T X)−1 and Σ̂  . Hence under H0 and using the proof of
Theorem 8.6,
−1 D
T = (n−p)U (L) = [vec(LB̂)]T [Σ̂  ⊗(L(X T X)−1 LT )−1 ][vec(LB̂)] → χ2rm .

Some more details on the above results may be useful. Consider testing a
linear hypothesis H0 : LB = 0 versus H1 : LB 6= 0 where L is a full rank
r × p matrix. For now assume the error distribution is multivariate normal
Nm (0, Σ  ). Then
 
β̂ 1 − β 1
 β̂ − β 
 2 2  T
vec(B̂ − B) =  ..  ∼ Npm (0, Σ  ⊗ (X X)−1 )
 . 
β̂ m − β m

where
376 8 Multivariate Linear Regression
 
σ11 (X T X)−1 σ12 (X T X)−1 · · · σ1m (X T X)−1
 σ21 (X T X)−1 σ22 (X T X)−1 · · · σ2m (X T X)−1 
 
C = Σ  ⊗(X T X)−1 = .. .. .. .
 . . ··· . 
σm1 (X T X)−1 σm2 (X T X)−1 · · · σmm (X T X)−1

Now let A be an rm×pm block diagonal matrix: A = diag(L, ..., L). Then
A vec(B̂ − B) = vec(L(B̂ − B)) =
 
L(β̂ 1 − β1 )
 L(β̂ 2 − β2 ) 
 
 ..  ∼ Nrm (0, Σ  ⊗ L(X T X)−1 LT )
 . 
L(β̂ m − βm )

where D = Σ  ⊗ L(X T X)−1 LT = ACAT =


 
σ11 L(X T X)−1 LT σ12 L(X T X)−1 LT · · · σ1m L(X T X)−1 LT
 σ21 L(X T X)−1 LT σ22 L(X T X)−1 LT · · · σ2m L(X T X)−1 LT 
 
 .. .. .. .
 . . ··· . 
σm1 L(X T X)−1 LT σm2 L(X T X)−1 LT · · · σmm L(X T X)−1 LT

Under H0 , vec(LB) = A vec(B) = 0, and


 
Lβ̂1
 Lβ̂2 
 
vec(L B̂) =  ..  ∼ Nrm (0, Σ  ⊗ L(X T X)−1 LT ).
 . 
Lβ̂ m

Hence under H0 ,

[vec(LB̂)]T [Σ −1 T −1 T −1 2
 ⊗ (L(X X) L ) ][vec(LB̂)] ∼ χrm ,
and
−1 D
T = [vec(LB̂)]T [Σ̂  ⊗ (L(X T X)−1 LT )−1 ][vec(LB̂)] → χ2rm . (8.4)

A large sample level δ test will reject H0 if pval ≤ δ where


 
T
pval = P < Frm,n−mp . (8.5)
rm

Since least squares estimators are asymptotically normal, if the i are iid
for a large class of distributions,
8.4 Testing Hypotheses 377
 
β̂ 1 − β1
√ √  β̂ 2 − β2 
 D
n vec(B̂ − B) = n  ..  → Npm (0, Σ  ⊗ W )
 . 
β̂ m − βm

where
XT X P
→ W −1 .
n
Then under H0 ,
 
Lβ̂1
√ √  Lβ̂2 
  D
n vec(L B̂) = n  ..  → Nrm (0, Σ  ⊗ LW LT ),
 . 
Lβ̂ m

and
D
n [vec(LB̂)]T [Σ −1 T −1 2
 ⊗ (LW L ) ][vec(LB̂)] → χrm .
Hence (8.4) holds, and (8.5) gives a large sample level δ test if the least
squares estimators are asymptotically normal.
Kakizawa (2009) showed, under stronger assumptions than Theorem 8.8,
that for a large class of iid error distributions, the following test statistics
have the same χ2rm limiting distribution when H0 is true, and the same non-
central χ2rm (ω2 ) limiting distribution with noncentrality parameter ω2 when
H0 is false under a local alternative. Hence the three tests are robust to the
assumption of normality. The limiting null distribution is well known when
the zero mean errors are iid from a multivariate normal distribution. See
D D
Khattree and Naik (1999, p. 68): (n − p)U (L) → χ2rm , (n − p)V (L) → χ2rm ,
D
and −[n − p − 0.5(m − r + 3)] log(Λ(L)) → χ2rm . Results from Kshirsagar
(1972, p. 301) suggest that the third chi-square approximation is very good
if n ≥ 3(m + p)2 for multivariate normal error vectors.
Theorems 8.6 and 8.8 are useful for relating multivariate tests with the
partial F test for multiple linear regression that tests whether a reduced
model that omits some of the predictors can be used instead of the full model
that uses all p predictors. The partial F test statistic is
 
SSE(R) − SSE(F )
FR = /M SE(F )
dfR − dfF

where the residual sums of squares SSE(F ) and SSE(R) and degrees of
freedom dfF and dfr are for the full and reduced model while the mean
square error M SE(F ) is for the full model. Let the null hypothesis for the
partial F test be H0 : Lβ = 0 where L sets the coefficients of the predictors
in the full model but not in the reduced model to 0. Seber and Lee (2003, p.
100) shows that
378 8 Multivariate Linear Regression

[Lβ̂]T (L(X T X)−1 LT )−1 [Lβ̂]


FR =
rσ̂ 2
is distributed as Fr,n−p if H0 is true and the errors are iid N (0, σ 2 ). Note
that for multiple linear regression with m = 1, FR = (n − p)U (L)/r since
−1
Σ̂  = 1/σ̂ 2 . Hence the scaled Hotelling Lawley test statistic is the partial
F test statistic extended to m > 1 predictor variables by Theorem 8.6.
D
By Theorem 8.8, for example, rFR → χ2r for a large class of nonnormal
D
error distributions. If Zn ∼ Fk,dn , then Zn → χ2k /k as dn → ∞. Hence using
the Fr,n−p approximation gives a large sample test with correct asymptotic
level, and the partial F test is robust to nonnormality.
Similarly, using an Frm,n−pm approximation for the following test statistics
gives large sample tests with correct asymptotic level by Kakizawa (2009) and
similar power for large n. The large sample test will have correct asymptotic
level as long as the denominator degrees of freedom dn → ∞ as n → ∞, and
dn = n − pm reduces to the partial F test if m = 1 and U (L) is used. Then
the three test statistics are
−[n − p − 0.5(m − r + 3)] n−p n−p
log(Λ(L)), V (L), and U(L).
rm rm rm
By Berndt and Savin (1977) and Anderson (1984, pp. 333, 371),

V (L) ≤ − log(Λ(L)) ≤ U (L).

Hence the Hotelling Lawley test will have the most power and Pillai’s test
will have the least power.
Following Khattree and Naik (1999, pp. 67-68), there are several ap-
proximations used by the SAS software. For the Roy’s largest root test, if
h = max(r, m), use

n−p−h+r
λmax (L) ≈ F (h, n − p − h + r).
h
The simulations in Section 8.5 suggest that this approximation is good for
r = 1 but poor for r > 1. Anderson (1984, p. 333) stated that Roy’s largest
root test has the greatest power if r = 1 but is √
an inferior test
√ for r > 1. Let
g = n−p−(m−r+1)/2, u = (rm−2)/4 and t = r 2 m2 − 4/ m2 + r 2 − 5 for
P P
m2 +r 2 −5 > 0 and t = 1, otherwise. Assume H0 is true. Thus U → 0, V → 0,
P
and Λ → 1 as n → ∞. Then
gt − 2u 1 − Λ1/t 1 − Λ1/t
1/t
≈ F (rm, gt − 2u) or (n − p)t ≈ χ2rm .
rm Λ Λ1/t
For large n and t > 0, − log(Λ) = −t log(Λ1/t ) = −t log(1 + Λ1/t − 1) ≈
t(1 − Λ1/t ) ≈ t(1 − Λ1/t )/Λ1/t . If it can not be shown that
8.4 Testing Hypotheses 379

P
(n − p)[− log(Λ) − t(1 − Λ1/t )/Λ1/t ] → 0 as n → ∞,

then it is possible that the approximate χ2rm distribution may be the limiting
distribution for only a small class of iid error distributions. When the i are
iid Nm (0, Σ  ), there are some exact results. For r = 1,

n−p−m+1 1−Λ
∼ F (m, n − p − m + 1).
m Λ
For r = 2,

2(n − p − m + 1) 1 − Λ1/2
∼ F (2m, 2(n − p − m + 1)).
2m Λ1/2
For m = 2,
2(n − p) 1 − Λ1/2
∼ F (2r, 2(n − p)).
2r Λ1/2
Let s = min(r, m), m1 = (|r − m| − 1)/2 and m2 = (n − p − m − 1)/2. Note
that s(|r − m| + s) = min(r, m) max(r, m) = rm. Then

n−p V n−p V 2m2 + s + 1 V


= ≈ ≈
rm 1 − V /s s(|r − m| + s) 1 − V /s 2m1 + s + 1 s − V

F (s(2m1 +s+1), s(2m2 +s+1)) ≈ F (s(|r−m|+s), s(n−p)) = F (rm, s(n−p)).


This approximation is asymptotically correct by Slutsky’s theorem since
P n−p
1 − V /s → 1. Finally, U=
rm
n−p 2(sm2 + 1)
U≈ 2 U ≈ F (s(2m1 + s + 1), 2(sm2 + 1))
s(|r − m| + s) s (2m1 + s + 1)

≈ F (s(|r − m| + s), s(n − p)) = F (rm, s(n − p)).


This approximation is asymptotically correct for a wide range of iid error
distributions.
Multivariate analogs of tests for multiple linear regression can be derived
with appropriate choice of L. Assume a constant x1 = 1 is in the model. As
a textbook convention, use δ = 0.05 if δ is not given.
The four step MANOVA test of linear hypotheses is useful.
i) State the hypotheses H0 : LB = 0 and H1 : LB 6= 0.
ii) Get test statistic from output.
iii) Get pval from output.
iv) State whether you reject H0 or fail to reject H0 . If pval ≤ δ, reject H0
and conclude that LB 6= 0. If pval > δ, fail to reject H0 and conclude that
LB = 0 or that there is not enough evidence to conclude that LB 6= 0.
380 8 Multivariate Linear Regression

The MANOVA test of H0 : B = 0 versus H1 : B 6= 0 is the special case


T T
corresponding to L = I and H = B̂ X T X B̂ = Ẑ Ẑ, but is usually not a
test of interest.

The analog of the ANOVA F test for multiple linear regression is the
MANOVA F test that uses L = [0 I p−1 ] to test whether the nontrivial
predictors are needed in the model. This test should reject H0 if the response
and residual plots look good, n is large enough, and at least one response
plot does not look like the corresponding residual plot. A response plot for
Yj will look like a residual plot if the identity line appears almost horizontal,
hence the range of Ŷj is small. Response and residual plots are often useful
for n ≥ 10p.
The 4 step MANOVA F test of hypotheses uses L = [0 I p−1 ].
i) State the hypotheses H0 : the nontrivial predictors are not needed in the
mreg model H1 : at least one of the nontrivial predictors is needed.
ii) Find the test statistic F0 from output.
iii) Find the pval from output.
iv) If pval ≤ δ, reject H0 . If pval > δ, fail to reject H0 . If H0 is rejected,
conclude that there is a mreg relationship between the response variables
Y1 , ..., Ym and the predictors x2 , ..., xp . If you fail to reject H0 , conclude
that there is a not a mreg relationship between Y1 , ..., Ym and the predictors
x2 , ..., xp . (Or there is not enough evidence to conclude that there is a
mreg relationship between the response variables and the predictors. Get the
variable names from the story problem.)

The Fj test of hypotheses uses Lj = [0, ..., 0, 1, 0, ..., 0], where the 1 is in
the jth position, to test whether the jth predictor xj is needed in the model
given that the other p − 1 predictors are in the model. This test is an analog
of the t tests for multiple linear regression. Note that xj is not needed in the
model corresponds to H0 : B j = 0 while xj needed in the model corresponds
to H1 : B j 6= 0 where B Tj is the jth row of B.
The 4 step Fj test of hypotheses uses Lj = [0, ..., 0, 1, 0, ..., 0] where the 1
is in the jth position.
i) State the hypotheses H0 : xj is not needed in the model
H1 : xj is needed.
ii) Find the test statistic Fj from output.
iii) Find pval from output.
iv) If pval ≤ δ, reject H0 . If pval > δ, fail to reject H0 . Give a nontechnical
sentence restating your conclusion in terms of the story problem. If H0 is
rejected, then conclude that xj is needed in the mreg model for Y1 , ..., Ym
given that the other predictors are in the model. If you fail to reject H0 , then
conclude that xj is not needed in the mreg model for Y1 , ..., Ym given that
the other predictors are in the model. (Or there is not enough evidence to
conclude that xj is needed in the model. Get the variable names from the
story problem.)
8.4 Testing Hypotheses 381

The Hotelling Lawley statistic


 
β̂j1
1 T −1 1 
−1  β̂j2 

Fj = B̂ j Σ̂  B̂ j = (β̂j1 , β̂j2 , ..., β̂jm)Σ̂   .. 
dj dj  . 
β̂jm
T
where B̂ j is the jth row of B̂ and dj = (X T X)−1
jj , the jth diagonal entry of
T −1
(X X) . The statistic Fj could be used for forward selection and backward
elimination in variable selection.

The 4 step MANOVA partial F test of hypotheses has a full model


using all of the variables and a reduced model where r of the variables are
deleted. The ith row of L has a 1 in the position corresponding to the ith
variable to be deleted. Omitting the jth variable corresponds to the Fj test
while omitting variables x2 , ..., xp corresponds to the MANOVA F test. Using
L = [0 I k ] tests whether the last k predictors are needed in the multivariate
linear regression model given that the remaining predictors are in the model.
i) State the hypotheses H0 : the reduced model is good H1 : use the full
model.
ii) Find the test statistic FR from output.
iii) Find the pval from output.
iv) If pval ≤ δ, reject H0 and conclude that the full model should be used.
If pval > δ, fail to reject H0 and conclude that the reduced model is good.

The linmodpack function mltreg produces the m response and residual


plots, gives B̂, Σ̂  , the MANOVA partial F test statistic and pval corre-
sponding to the reduced model that leaves out the variables given by indices
(so x2 and x4 in the output below with F = 0.77 and pval = 0.614), Fj and
the pval for the Fj test for variables 1, 2, ..., p (where p = 4 in the output
below so F2 = 1.51 with pval = 0.284), and F0 and pval for the MANOVA
F test (in the output below F0 = 3.15 and pval= 0.06). Right click Stop
on the plots m times to advance the plots and to get the cursor back on the
command line in R.
The command out <- mltreg(x,y,indices=c(2)) would produce
a MANOVA partial F test corresponding to the F2 test while the command
out <- mltreg(x,y,indices=c(2,3,4)) would produce a MANOVA
partial F test corresponding to the MANOVA F test for a data set with
p = 4 predictor variables. The Hotelling Lawley trace statistic is used in the
tests.
out <- mltreg(x,y,indices=c(2,4))
$Bhat
[,1] [,2] [,3]
[1,] 47.96841291 623.2817463 179.8867890
382 8 Multivariate Linear Regression

[2,] 0.07884384 0.7276600 -0.5378649


[3,] -1.45584256 -17.3872206 0.2337900
[4,] -0.01895002 0.1393189 -0.3885967
$Covhat
[,1] [,2] [,3]
[1,] 21.91591 123.2557 132.339
[2,] 123.25566 2619.4996 2145.780
[3,] 132.33902 2145.7797 2954.082
$partial
partialF Pval
[1,] 0.7703294 0.6141573

$Ftable
Fj pvals
[1,] 6.30355375 0.01677169
[2,] 1.51013090 0.28449166
[3,] 5.61329324 0.02279833
[4,] 0.06482555 0.97701447

$MANOVA
MANOVAF pval
[1,] 3.150118 0.06038742
#Output for Example 8.2
y<-marry[,c(2,3)]; x<-marry[,-c(2,3)];
mltreg(x,y,indices=c(3,4))
$partial

partialF Pval
[1,] 0.2001622 0.9349877
$Ftable
Fj pvals
[1,] 4.35326807 0.02870083
[2,] 600.57002201 0.00000000
[3,] 0.08819810 0.91597268
[4,] 0.06531531 0.93699302
$MANOVA
MANOVAF pval
[1,] 295.071 1.110223e-16
Example 8.2. The above output is for the Hebbler (1847) data from
the 1843 Prussia census. Sometimes if the wife or husband was not at the
household, then s/he would not be counted. Y1 = number of married civilian
men in the district, Y2 = number of women married to civilians in the district,
x2 = population of the district in 1843, x3 = number of married military men
8.5 An Example and Simulations 383

in the district, and x4 = number of women married to military men in the


district. The reduced model deletes x3 and x4 . The constant uses x1 = 1.
a) Do the MANOVA F test.
b) Do the F2 test.
c) Do the F4 test.
d) Do an appropriate 4 step test for the reduced model that deletes x3
and x4 .
e) The output for the reduced model that deletes x1 and x2 is shown below.
Do an appropriate 4 step test.
$partial
partialF Pval
[1,] 569.6429 0
Solution:
a) i) H0 : the nontrivial predictors are not needed in the mreg model
H1 : at least one of the nontrivial predictors is needed
ii) F0 = 295.071
iii) pval = 0
iv) Reject H0 , the nontrivial predictors are needed in the mreg model.
b) i) H0 : x2 is not needed in the model H1 : x2 is needed
ii) F2 = 600.57
iii) pval = 0
iv) Reject H0 , population of the district is needed in the model.
c) i) H0 : x4 is not needed in the model H1 : x4 is needed
ii) F4 = 0.065
iii) pval = 0.937
iv) Fail to reject H0 , number of women married to military men is not
needed in the model given that the other predictors are in the model.
d) i) H0 : the reduced model is good H1 : use the full model.
ii) FR = 0.200
iii) pval = 0.935
iv) Fail to reject H0 , so the reduced model is good.
e) i) H0 : the reduced model is good H1 : use the full model.
ii) FR = 569.6
iii) pval = 0.00
iv) Reject H0 , so use the full model.

8.5 An Example and Simulations

In the DD plot, cases to the left of the vertical line are in their nonparametric
prediction region. The long horizontal line corresponds to a similar cutoff
based on the RD. The shorter horizontal line that ends at the identity line
384 8 Multivariate Linear Regression

is the parametric MVN prediction region from Section 4.4 applied to the
ẑi . Points below these two lines are only conjectured to be large sample
prediction regions, but are added to the DD plot as visual aids. Note that
ẑi = ŷ f + ˆ
i , and adding a constant ŷ f to all of the residual vectors does not
change the Mahalanobis distances, so the DD plot of the residual vectors can
be used to display the prediction regions.

Response Plot
5.5

37
4.5
Y

79
3.5
2.5

2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

FIT
Residual Plot
0.4

37
0.2
0.0
RES

48 11
−0.4

79

2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

FIT

Fig. 8.1 Plots for Y1 = log(S).

Example 8.3. Cook and Weisberg (1999, pp. 351, 433, 447) gave a data
set on 82 mussels sampled off the coast of New Zealand. Let Y1 = log(S)
and Y2 = log(M ) where S is the shell mass and M is the muscle mass.
The predictors are X2 = L, X3 = log(W ), and X4 = H: the shell length,
log(width), and height. To check linearity of the multivariate linear regression
model, Figures 8.1 and 8.2 give the response and residual plots for Y1 and
Y2 . The response plots show strong linear relationships. For Y1 , case 79 sticks
out while for Y2 , cases 8, 25, and 48 are not fit well. Highlighted cases had
Cook’s distance > min(0.5, 2p/n). See Cook (1977).
To check the error vector distribution, the DD plot should be used instead
of univariate residual plots, which do not take into account the correlations
of the random variables 1 , ..., m in the error vector . A residual vector
ˆ = (ˆ  − ) +  is a combination of  and a discrepancy ˆ  −  that tends
to have an approximate multivariate normal distribution. The ˆ −  term
can dominate for small to moderate n when  is not multivariate normal,
8.5 An Example and Simulations 385

Response Plot

4
3
2
Y
25
8

1
48

0
1.5 2.0 2.5 3.0 3.5 4.0

FIT
0.5
0.0 Residual Plot
RES

−0.5

25

8
−1.0

48

1.5 2.0 2.5 3.0 3.5 4.0

FIT

Fig. 8.2 Plots for Y2 = log(M ).


7

48
6

8
5

79
4
RD

3
2
1
0

0 1 2 3 4 5

MD

Fig. 8.3 DD Plot of the Residual Vectors for the Mussels Data.
386 8 Multivariate Linear Regression

incorrectly suggesting that the distribution of the error vector  is closer to a


multivariate normal distribution than is actually the case. Figure 8.3 shows
the DD plot of the residual vectors. The plotted points are highly correlated
but do not cover the identity line, suggesting an elliptically contoured error
distribution that is not multivariate normal. The nonparametric 90% predic-
tion region for the residuals consists of the points to the left of the vertical
line M D = 2.60. Cases 8, 48, and 79 have especially large distances.
The four Hotelling Lawley Fj statistics were greater than 5.77 with pvalues
less than 0.005, and the MANOVA F statistic was 337.8 with pvalue ≈ 0.
The response, residual, and DD plots are effective for finding influential
cases, for checking linearity, for checking whether the error distribution is
multivariate normal or some other elliptically contoured distribution, and
for displaying the nonparametric prediction region. Note that cases to the
right of the vertical line correspond to cases with y i that are not in their
prediction region. These are the cases corresponding to residual vectors with
large Mahalanobis distances. Adding a constant does not change the distance,
so the DD plot for the residual vectors is the same as the DD plot for the ẑ i .

Response Plot
50
40
30
Y

20
10
0

0 10 20 30 40

FIT
Residual Plot
15
10
5
RES

0
−10 −5

0 10 20 30 40

FIT

Fig. 8.4 Plots for Y2 = M .

c) Now suppose the same model is used except Y2 = M . Then the response
and residual plots for Y1 remain the same, but the plots shown in Figure 8.4
show curvature about the identity and r = 0 lines. Hence the linearity condi-
tion is violated. Figure 8.5 shows that the plotted points in the DD plot have
correlation well less than one, suggesting that the error vector distribution
8.5 An Example and Simulations 387

6
5
4
RD

3
2
1
0

0.5 1.0 1.5 2.0 2.5 3.0 3.5

MD

Fig. 8.5 DD Plot When Y2 = M .

is no longer elliptically contoured. The nonparametric 90% prediction region


for the residual vectors consists of the points to the left of the vertical line
M D = 2.52, and contains 95% of the training data. Note that the plots can
be used to quickly assess whether power transformations have resulted in a
linear model, and whether influential cases are present. R code for producing
the five figures is shown below.
y <- log(mussels)[,4:5]
x <- mussels[,1:3]
x[,2] <- log(x[,2])
z<-cbind(x,y) #scatterplot matrix
pairs(z, labels=c("L","log(W)","H","log(S)","log(M)"))
ddplot4(z) #right click Stop, DD plot of MLD model
out <- mltreg(x,y) #right click Stop 4 times, Fig. 8.1, 8.2
ddplot4(out$res) #right click Stop, Fig. 8.3
y[,2] <- mussels[,5]
tem <- mltreg(x,y) #right click Stop 4 times, Fig. 8.4
ddplot4(tem$res) #right click Stop, Fig. 8.5
388 8 Multivariate Linear Regression

8.5.1 Simulations for Testing

A small simulation was used to study the Wilks’ Λ test, the Pillai’s trace
test, the Hotelling Lawley trace test, and the Roy’s largest root test for the
Fj tests and the MANOVA F test for multivariate linear regression. The first
row of B was always 1T and the last row of B was always 0T . When the null
hypothesis for the MANOVA F test is true, all but the first row corresponding
to the constant are equal to 0T . When p ≥ 3 and the null hypothesis for the
MANOVA F test is false, then the second to last row of B is (1, 0, ..., 0),
the third to last row is (1, 1, 0, ..., 0) et cetera as long as the first row is
not changed from 1T . First m × 1 error vectors wi were generated such that
the m random variables in the vector w i are iid with variance σ 2 . Let the
m × m matrix A = (aij ) with aii = 1 and aij = ψ where 0 ≤ ψ < 1 for i 6= j.
Then i = Aw i so that Σ  = σ 2 AAT = (σij ) where the diagonal entries
σii = σ 2 [1 + (m − 1)ψ2 ] and the off diagonal entries σij = σ 2 [2ψ + (m − 2)ψ2 ]
where ψ = 0.10. Hence the correlations are (2ψ +(m−2)ψ2 )/(1 +(m−1)ψ2 ).
As ψ gets close to 1, the error vectors cluster about the line in the direction
of (1, ..., 1)T . We used w i ∼ Nm (0, I), wi ∼ (1 − τ )Nm (0, I) + τ Nm (0, 25I)
with 0 < τ < 1 and τ = 0.25 in the simulation, wi ∼ multivariate td with
d = 7 degrees of freedom, or w i ∼ lognormal - E(lognormal): where the m
components of w i were iid with distribution ez − E(ez ) where z ∼ N (0, 1).
Only the lognormal distribution is not elliptically contoured.

Table 8.1 Test Coverages: MANOVA F H0 is True.


w dist n test F1 F2 Fp−1 Fp FM
MVN 300 W 1 0.043 0.042 0.041 0.018
MVN 300 P 1 0.040 0.038 0.038 0.007
MVN 300 HL 1 0.059 0.058 0.057 0.045
MVN 300 R 1 0.051 0.049 0.048 0.993
MVN 600 W 1 0.048 0.043 0.043 0.034
MVN 600 P 1 0.046 0.042 0.041 0.026
MVN 600 HL 1 0.055 0.052 0.050 0.052
MVN 600 R 1 0.052 0.048 0.047 0.994
MIX 300 W 1 0.042 0.043 0.044 0.017
MIX 300 P 1 0.039 0.040 0.042 0.008
MIX 300 HL 1 0.057 0.059 0.058 0.039
MIX 300 R 1 0.050 0.050 0.051 0.993
MVT(7) 300 W 1 0.048 0.036 0.045 0.020
MVT(7) 300 P 1 0.046 0.032 0.042 0.011
MVT(7) 300 HL 1 0.064 0.049 0.058 0.045
MVT(7) 300 R 1 0.055 0.043 0.051 0.993
LN 300 W 1 0.043 0.047 0.040 0.020
LN 300 P 1 0.039 0.045 0.037 0.009
LN 300 HL 1 0.057 0.061 0.058 0.041
LN 300 R 1 0.049 0.055 0.050 0.994
8.5 An Example and Simulations 389

Table 8.2 Test Coverages: MANOVA F H0 is False.


n m = p test F1 F2 Fp−1 Fp FM
30 5 W 0.012 0.222 0.058 0.000 0.006
30 5 P 0.000 0.000 0.000 0.000 0.000
30 5 HL 0.382 0.694 0.322 0.007 0.579
30 5 R 0.799 0.871 0.549 0.047 0.997
50 5 W 0.984 0.955 0.644 0.017 0.963
50 5 P 0.971 0.940 0.598 0.012 0.871
50 5 HL 0.997 0.979 0.756 0.053 0.991
50 5 R 0.996 0.978 0.744 0.049 1
105 10 W 0.650 0.970 0.191 0.000 0.633
105 10 P 0.109 0.812 0.050 0.000 0.000
105 10 HL 0.964 0.997 0.428 0.000 1
105 10 R 1 1 0.892 0.052 1
150 10 W 1 1 0.948 0.032 1
150 10 P 1 1 0.941 0.025 1
150 10 HL 1 1 0.966 0.060 1
150 10 R 1 1 0.965 0.057 1
450 20 W 1 1 0.999 0.020 1
450 20 P 1 1 0.999 0.016 1
450 20 HL 1 1 0.999 0.035 1
450 20 R 1 1 0.999 0.056 1

The simulation used 5000 runs, and H0 was rejected if the F statistic
was greater than Fd1 ,d2 (0.95) where P (Fd1 ,d2 < Fd1 ,d2 (0.95)) = 0.95 with
d1 = rm and d2 = n − mp for the test statistics

−[n − p − 0.5(m − r + 3)] n−p n−p


log(Λ(L)), V (L), and U(L),
rm rm rm
while d1 = h = max(r, m) and d2 = n − p − h + r for the test statistic

n−p−h+r
λmax (L).
h
Denote these statistics by W , P , HL, and R. Let the coverage be the propor-
tion of times that H0 is rejected. We want coverage near 0.05 when H0 is true
and coverage close to 1 for good power when H0 is false. With 5000 runs,
coverage outside of (0.04,0.06) suggests that the true coverage is not 0.05.
Coverages are tabled for the F1 , F2 , Fp−1 , and Fp test and for the MANOVA
F test denoted by FM . The null hypothesis H0 was always true for the Fp
test and always false for the F1 test. When the MANOVA F test was true,
H0 was true for the Fj tests with j 6= 1. When the MANOVA F test was
false, H0 was false for the Fj tests with j 6= p, but the Fp−1 test should be
hardest to reject for j 6= p by construction of B and the error vectors.
When the null hypothesis H0 was true, simulated values started to get close
to nominal levels for n ≥ 0.8(m+p)2 , and were fairly good for n ≥ 1.5(m+p)2 .
The exception was Roy’s test which rejects H0 far too often if r > 1. See Table
390 8 Multivariate Linear Regression

8.1 where we want values for the F1 test to be close to 1 since H0 is false
for the F1 test, and we want values close to 0.05, otherwise. Roy’s test was
very good for the Fj tests but very poor for the MANOVA F test. Results
are shown for m = p = 10. As expected from Berndt and Savin (1977),
Pillai’s test rejected H0 less often than Wilks’ test which rejected H0 less
often than the Hotelling Lawley test. Based on a much larger simulation
study, using the four types of error vector distributions and m = p, the tests
had approximately correct level if n ≥ 0.83(m + p)2 for the Hotelling Lawley
test, if n ≥ 2.80(m + p)2 for the Wilks’ test (agreeing with Kshirsagar (1972)
n ≥ 3(m + p)2 for multivariate normal data), and if n ≥ 4.2(m + p)2 for
Pillai’s test.
In Table 8.2, H0 is only true for the Fp test where p = m, and we want
values in the Fp column near 0.05. We want values near 1 for high power
otherwise. If H0 is false, often H0 will be rejected for small n. For example,
if n ≥ 10p, then the m residual plots should start to look good, and the
MANOVA F test should be rejected. For the simulated data, the test had
fair power for n not much larger than mp. Results are shown for the lognormal
distribution.
Some R output for reproducing the simulation is shown below. The linmod-
pack function is mregsim and etype = 1 uses data from a MVN distribution.
The fcov line computed the Hotelling Lawley statistic using Equation (8.3)
while the hotlawcov line used Definition 8.9. The mnull=T part of the com-
mand means we want the first value near 1 for high power and the next three
numbers near the nominal level 0.05 except for mancv where we want all
of the MANOVA F test statistics to be near the nominal level of 0.05. The
mnull=F part of the command means want all values near 1 for high power
except for the last column (for the terms other than mancv) corresponding to
the Fp test where H0 is true so we want values near the nominal level of 0.05.
The “coverage” is the proportion of times that H0 is rejected, so “coverage”
is short for “power” and “level”: we want the coverage near 1 for high power
when H0 is false and we want the coverage near the nominal level 0.05 when
H0 is true. Also see Problem 8.10.
mregsim(nruns=5000,etype=1,mnull=T)
$wilkcov
[1] 1.0000 0.0450 0.0462 0.0430
$pilcov
[1] 1.0000 0.0414 0.0432 0.0400
$hotlawcov
[1] 1.0000 0.0522 0.0516 0.0490
$roycov
[1] 1.0000 0.0512 0.0500 0.0480
$fcov
[1] 1.0000 0.0522 0.0516 0.0490
$mancv
wcv pcv hlcv rcv fcv
8.6 The Robust rmreg2 Estimator 391

[1,] 0.0406 0.0332 0.049 0.1526 0.049

mregsim(nruns=5000,etype=2,mnull=F)

$wilkcov
[1] 0.9834 0.9814 0.9104 0.0408
$pilcov
[1] 0.9824 0.9804 0.9064 0.0372
$hotlawcov
[1] 0.9856 0.9838 0.9162 0.0480
$roycov
[1] 0.9848 0.9834 0.9156 0.0462
$fcov
[1] 0.9856 0.9838 0.9162 0.0480
$mancv
wcv pcv hlcv rcv fcv
[1,] 0.993 0.9918 0.9942 0.9978 0.9942
H
See Olive (2017b, 12.5.2) for simulations for the prediction region. Also
see Problem 8.11.

8.6 The Robust rmreg2 Estimator

The robust multivariate linear regression estimator rmreg2 is the classi-


cal multivariate linear regression estimator applied to the RMVN set when
RMVN is computed from the vectors ui = (xi2 , ..., xip, Yi1 , ..., Yim)T for
i = 1, ..., n. Hence ui is the ith case with xi1 = 1 deleted. This regression
estimator has considerable outlier resistance, and is one of the most outlier
resistant practical robust regression estimator for the m = 1 multiple linear
regression case. See Chapter 7. The rmreg2 estimator has been shown to be
consistent if the ui are iid from a large class of elliptically contoured distri-
butions, which is a much stronger assumption than having iid error vectors
i .
Theorem 2.20 gave a second way to compute β̂, and there is a similar result
for multivariate linear regression. Let x = (1, uT )T and let β = (β1 , β T2 )T =
(α, ηT )T . Now for multivariate linear regression, β̂ j = (α̂j , η̂Tj )T where α̂j =
−1 Pn
Y j − η̂ Tj u and η̂j = Σ̂ u Σ̂ uYj by Theorem 2.20. Let Σ̂ uy = n−1 1
i=1 (w i −
T
w)(y i − y) which has jth column Σ̂ w Yj for j = 1, ..., m. Let
     
u E(u) µu
v= , E(v) = µv = = , and Cov(v) = Σ v =
y E(y) µy
392 8 Multivariate Linear Regression
 
Σ uu Σ uy
.
Σ yu Σ yy
Let the vector of constants be αT = (α1 , ..., αm) and the matrix of slope
vectors B S = η 1 η2 . . . ηm . Then the population least squares coefficient
matrix is  T
α
B=
BS
where α = µy − B TS µu and B S = Σ −1
u Σ uy where Σ u = Σ uu .
If the ui are iid with nonsingular covariance matrix Cov(u), the least
squares estimator  T
α̂
B̂ =
B̂ S
T −1
where α̂ = y − B̂ S u and B̂ S = Σ̂ u Σ̂ uy . The least squares multivariate
linear regression estimator can be calculated by computing the classical esti-
mator (v, Sv ) = (v, Σ̂ v ) of multivariate location and dispersion on the v i ,
and then plug in the results into the formulas for α̂ and B̂ S .
Let (T, C) = (µ̃v , Σ̃ v ) be a robust estimator of multivariate location and
dispersion. If µ̃v is a consistent estimator of µv and Σ̃ v is a consistent
estimator of c Σ v for some constant c > 0, then a robust estimator of mul-
T
tivariate linear regression is the plug in estimator α̃ = µ̃y − B̃ S µ̃u and
−1
B̃S = Σ̃ u Σ̃ uy .
For the rmreg2 estimator, (T, C) is the classical estimator applied to
the RMVN set when RMVN is applied to vectors v i for i = 1, ..., n (could
use (T, C) = RMVN estimator √ since the scaling does not matter for this
application). Then (T, C) is a n consistent estimator of (µv , c Σ v ) if the v i
are iid from a large class of ECd (µv , Σ v , g) distributions where d = m+p−1.
Thus √ the classical and robust estimators of multivariate linear regression are
both n consistent estimators of B if the v i are iid from a large class of
elliptically contoured distributions. This assumption is quite strong, but the
robust estimator is useful for detecting outliers. When there are categorical
predictors or the joint distribution of v is not elliptically contoured, it is
possible that the robust estimator is bad and very different from the good
classical least squares estimator. The linmodpack function rmreg2 computes
the rmreg2 estimator and produces the response and residual plots.

Example 8.4. Buxton (1920) gave various measurements of 88 men. Let


Y1 = nasal height and Y2 = height with x2 = head length, x3 = bigonal breadth,
and x4 = cephalic index. Five individuals, numbers 62–66, were reported to
be about 0.75 inches tall with head lengths well over five feet! Thus Y2 and
x2 have massive outliers. Figures 8.6 and 8.7 show that the response and
residual plots corresponding to rmreg2 do not have fits that pass through
the outliers.
These figures can be made with the following R commands.
8.6 The Robust rmreg2 Estimator 393

Response Plot

60
55
y[, 1]

50
45 35 40 45 50

fit[, 1]
Residual Plot
20
15
10
res[, 1]

5
0
−5

35 40 45 50

fit[, 1]

Fig. 8.6 Plots for Y1 = nasal height using rmreg2.

Response Plot
1500
1000
y[, i]

500
0

2000 2500 3000 3500 4000 4500

fit[, i]
Residual Plot
0
−2000
res[, i]

−4000

2000 2500 3000 3500 4000 4500

fit[, i]

Fig. 8.7 Plots for Y2 = height using rmreg2.


394 8 Multivariate Linear Regression

ht <- buxy; z <- cbind(buxx,ht);


y <- z[,c(2,5)]; x <- z[,c(1,3,4)]
# compare mltreg(x,y) #right click Stop 4 times
out <- rmreg2(x,y) #right click Stop 4 times
# try ddplot4(out$res) #right click Stop
The residual bootstrap for the test H0 : LB = 0 may be useful. Take a
sample of size n with replacement from the residual vectors to form Z ∗1 with
ith row y ∗T
i where y ∗i = ŷ i + ∗i . The function rmreg3 gets the rmreg2

estimator without the plots. Using rmreg3, regress Z on X to get vec(L B̂ 1 ).

Repeat B times to get a bootstrap sample w1 , ..., wB where w i = vec(L B̂ i ).
The nonparametric bootstrap uses n cases drawn with replacement, and may
also be useful. Apply the nonparametric prediction region to the w i and see
if 0 is in the region. If L is r × p, then w is rp × 1, and we likely need
n ≥ max[50rp, 3(m + p)2 ].

8.7 Bootstrap

8.7.1 Parametric Bootstrap

The parametric bootstrap for the multivariate linear regression model uses
T
y ∗i ∼ Nm (B̂ xi , Σ̂  ) for i = 1, ..., n where we are not assuming that the
i ∼ Nm (0, Σ  ). Let Z ∗j have ith row y ∗T i and regress Z ∗j on X to obtain

B̂j for j = 1, ..., B. Let S ⊆ I, let B̂ I = (X TI X I )−1 X TI Z ∗ , and assume
P
n(X TI X I )−1 → W I for any I such that S ⊆ I. Then with calculations
similar to those for the multiple linear regression model parametric bootstrap

of Section 4.6.1, E(B̂ I ) = B̂I ,
√ D
n vec(B̂ I − B I ) → NaI m (0, Σ  ⊗ W I ),
√ ∗ −1 D
and n vec(B̂ I − B̂ I ) ∼ NaI m (0, Σ̂  ⊗ n(X T
I X I) ) → NaI m (0, Σ  ⊗ W I )
∗ ∗
as n, B → ∞ if S ⊆ I. Let B̂ I,0 be formed from B̂ I by adding rows of zeros
corresponding to omitted variables.

8.7.2 Residual Bootstrap

The residual bootstrap uses the multivariate linear regression model


W
Z ∗ = X B̂ + Ê
8.9 Summary 395
W W
where the rows of Ê are sampled with replacement from the rows of Ê .
∗ ∗
Regress Z ∗ of X and repeat to get the bootstrap sample B̂ 1 , ..., B̂B .

8.7.3 Nonparametric Bootstrap

The nonparametric bootstrap samples cases (y Ti , xTi )T with replacement to



form (Z ∗j , X ∗j ), and regresses Z ∗j on X ∗j to get B̂ j for j = 1, ..., B. The
nonparametric bootstrap can be useful even if heteroscedasticity or overdis-
persion is present, if the cases are an iid sample from some population, a
very strong assumption. See Eck (2018) for using the residual bootstrap and
nonparametric bootstrap to bootstrap multivariate linear regression.

8.8 Data Splitting

The theory for multivariate linear regression assumes that the model is known
before gathering data. If variable selection and response transformations are
performed to build a model, then the estimators are biased and results for
inference fail to hold in that pvalues and coverage of confidence and prediction
regions will be wrong.
Data splitting can be used in a manner similar to how data splitting is
used for MLR and other regression models. A pilot study is an alternative to
data splitting.

8.9 Summary

1) The multivariate linear regression model is a special case of the multi-


variate linear model where at least one predictor variable xj is continuous.
The MANOVA model in Chapter 9 is a multivariate linear model where all
of the predictors are categorical variables so the xj are coded and are often
indicator variables.
2) The multivariate linear regression model y i = B T xi + i for
i = 1, ..., n has m ≥ 2 response variables Y1 , ..., Ym and p predictor variables
x1 , x2 , ..., xp. The ith case is (xTi , yTi ) = (xi1 , xi2 , ..., xip, Yi1 , ..., Yim). The
constant xi1 = 1 is in the model, and is often omitted from the case and
the data matrix. The model is written in matrix form as Z = XB + E.
The model has E(k ) = 0 and Cov(k ) = Σ  = (σij ) for k = 1, ..., n. Also
E(ei ) = 0 while Cov(ei , ej ) = σij I n for i, j = 1, ..., m. Then B and Σ  are
unknown matrices of parameters to be estimated, and E(Z) = XB while
E(Yij ) = xTi β j .
396 8 Multivariate Linear Regression

3) Each response variable in a multivariate linear regression model follows


a multiple linear regression model Y j = Xβ j + ej for j = 1, ..., m where it
is assumed that E(ej ) = 0 and Cov(ej ) = σjj I n .
4) For each variable Yk make a response plot of Ŷik versus Yik and a residual
plot of Ŷik versus rik = Yik − Ŷik . If the multivariate linear regression model
is appropriate, then the plotted points should cluster about the identity line
in each of the m response plots. If outliers are present or if the plot is not
linear, then the current model or data need to be transformed or corrected.
If the model is good, then each of the m residual plots should be ellipsoidal
with no trend and should be centered about the r = 0 line. There should not
be any pattern in the residual plot: as a narrow vertical strip is moved from
left to right, the behavior of the residuals within the strip should show little
change. Outliers and patterns such as curvature or a fan shaped plot are bad.
5) Make a scatterplot matrix of Y1 , ..., Ym and of the continuous predictors.
Use power transformations to remove strong nonlinearities.
6) Consider testing LB = 0 where L is an r × p full rank matrix. Let
T T
W e = Ê Ê and W e /(n − p) = Σ̂  . Let H = B̂ LT [L(X T X)−1 LT ]−1 LB̂.
Let λ1 ≥ λ2 ≥ · · · ≥ λm be the ordered eigenvalues of W −1e H. Then there
are four commonly used test statistics.
The Wilks’ Λ statistic is Λ(L) = |(H + W e )−1 W e | = |W −1
e H + I|
−1
=
Ym
(1 + λi )−1 .
i=1
m
X λi
The Pillai’s trace statistic is V (L) = tr[(H + W e )−1 H] = .
1 + λi
i=1
Xm
The Hotelling-Lawley trace statistic is U (L) = tr[W −1
e H ] = λi .
i=1
The Roy’s maximum root statistic is λmax (L) = λ1 .
7) Theorem: The Hotelling-Lawley trace statistic
1 −1
U (L) = [vec(LB̂)]T [Σ̂  ⊗ (L(X T X)−1 LT )−1 ][vec(LB̂)].
n−p

8) Assumption D1: Let hi be the ith diagonal element of X(X T X)−1 X T .


P
Assume max(h1 , ..., hn) → 0 as n → ∞, assume that the zero mean iid error
1 P
vectors have finite fourth moments, and assume that X T X → W −1 .
n
9) Multivariate Least Squares Central Limit Theorem (MLS
CLT): For the least squares estimator, if assumption D1 holds, then Σ̂  is
√ √ D
a n consistent estimator of Σ  , and n vec(B̂ − B) → Npm (0, Σ  ⊗ W ).
10) Theorem: If assumption D1 holds and if H0 is true, then
D
(n − p)U (L) → χ2rm .
8.9 Summary 397

D
11) Under regularity conditions, −[n − p + 1 − 0.5(m − r + 3)] log(Λ(L)) →
D D
χrm , (n − p)V (L) → χ2rm , and (n − p)U (L) → χ2rm .
2

These statistics are robust against nonnormality.


12) Forthe Wilks’ Lambda test, 
−[n − p + 1 − 0.5(m − r + 3)]
pval = P log(Λ(L)) < Frm,n−rm .
rm  
n−p
For the Pillai’s trace test, pval = P V (L) < Frm,n−rm .
rm  
n−p
For the Hotelling Lawley trace test, pval = P U (L) < Frm,n−rm .
rm
The above three tests are large sample tests, P(reject H0 |H0 is true) → δ
as n → ∞, under regularity conditions.
13) The 4 step MANOVA F test of hypotheses uses L = [0 I p−1 ].
i) State the hypotheses H0 : the nontrivial predictors are not needed in the
mreg model H1 : at least one of the nontrivial predictors is needed.
ii) Find the test statistic Fo from output.
iii) Find the pval from output.
iv) If pval ≤ δ, reject H0 . If pval > δ, fail to reject H0 . If H0 is rejected,
conclude that there is a mreg relationship between the response variables
Y1 , ..., Ym and the predictors x2 , ..., xp . If you fail to reject H0 , conclude that
there is a not a mreg relationship between Y1 , ..., Ym and the predictors x2 ,
..., xp . (Get the variable names from the story problem.)
14) The 4 step Fj test of hypotheses uses Lj = [0, ..., 0, 1, 0, ..., 0] where
the 1 is in the jth position. Let B Tj be the jth row of B. The hypotheses are
equivalent to H0 : B Tj = 0 H1 : B Tj 6= 0. i) State the hypotheses
H0 : xj is not needed in the model H1 : xj is needed in the model.
ii) Find the test statistic Fj from output.
iii) Find pval from output.
iv) If pval ≤ δ, reject H0 . If pval > δ, fail to reject H0 . Give a nontechnical
sentence restating your conclusion in terms of the story problem. If H0 is
rejected, then conclude that xj is needed in the mreg model for Y1 , ..., Ym. If
you fail to reject H0 , then conclude that xj is not needed in the mreg model
for Y1 , ..., Ym given that the other predictors are in the model.
15) The 4 step MANOVA partial F test of hypotheses has a full model
using all of the variables and a reduced model where r of the variables are
deleted. The ith row of L has a 1 in the position corresponding to the ith
variable to be deleted. Omitting the jth variable corresponds to the Fj test
while omitting variables x2 , ..., xp corresponds to the MANOVA F test.
i) State the hypotheses H0 : the reduced model is good
H1 : use the full model.
ii) Find the test statistic FR from output.
iii) Find the pval from output.
iv) If pval ≤ δ, reject H0 and conclude that the full model should be used.
If pval > δ, fail to reject H0 and conclude that the reduced model is good.
398 8 Multivariate Linear Regression

16) The 4 step MANOVA F test should reject H0 if the response and
residual plots look good, n is large enough, and at least one response plot
does not look like the corresponding residual plot. A response plot for Yj will
look like a residual plot if the identity line appears almost horizontal, hence
the range of Ŷj is small.
17) The linmodpack function mltreg produces the m response and resid-
ual plots, gives B̂, Σ̂  , the MANOVA partial F test statistic and pval cor-
responding to the reduced model that leaves out the variables given by in-
dices (so x2 and x4 in the output below with F = 0.77 and pval = 0.614),
Fj and the pval for the Fj test for variables 1, 2, ..., p (where p = 4 in
the output below so F2 = 1.51 with pval = 0.284), and F0 and pval for
the MANOVA F test (in the output below F0 = 3.15 and pval= 0.06).
The command out <- mltreg(x,y,indices=c(2)) would produce a
MANOVA partial F test corresponding to the F2 test while the command
out <- mltreg(x,y,indices=c(2,3,4)) would produce a MANOVA
partial F test corresponding to the MANOVA F test for a data set with
p = 4 predictor variables. The Hotelling Lawley trace statistic is used in the
tests.
out <- mltreg(x,y,indices=c(2,4))
$Bhat [,1] [,2] [,3]
[1,] 47.96841291 623.2817463 179.8867890
[2,] 0.07884384 0.7276600 -0.5378649
[3,] -1.45584256 -17.3872206 0.2337900
[4,] -0.01895002 0.1393189 -0.3885967
$Covhat
[,1] [,2] [,3]
[1,] 21.91591 123.2557 132.339
[2,] 123.25566 2619.4996 2145.780
[3,] 132.33902 2145.7797 2954.082
$partial
partialF Pval
[1,] 0.7703294 0.6141573
$Ftable
Fj pvals
[1,] 6.30355375 0.01677169
[2,] 1.51013090 0.28449166
[3,] 5.61329324 0.02279833
[4,] 0.06482555 0.97701447
$MANOVA
MANOVAF pval
[1,] 3.150118 0.06038742
18) Given B̂ = [β̂1 β̂ 2 · · · β̂ m ] and xf , find ŷ f = (ŷ1 , ..., ŷm)T where
T
ŷi = β̂ i xf .
8.9 Summary 399
T n
Ê Ê 1 X T
19) Σ̂  = = ˆ
i ˆ
i while the sample covariance matrix of
n−p n−p
i=1
T
n−p Ê Ê √
the residuals is S r = Σ̂  = . Both Σ̂  and S r are n consistent
n−1 n−1
estimators of Σ  for a large class of distributions for the error vectors i .
20) The 100(1 − δ)% nonparametric prediction
H region for y f given xf is
the nonparametric prediction region from 4.4 applied to ẑi = ŷ f + ˆ i =
T
B̂ xf + ˆi for i = 1, ..., n. This takes the data cloud of the n residual vectors
ˆi and centers the cloud at ŷ f . Let

Di2 (ŷ f , Sr ) = (ẑ i − ŷ f )T S −1


r (ẑ i − ŷ f )

for i = 1, ..., n. Let qn = min(1 − δ + 0.05, 1 − δ + m/n) for δ > 0.1 and

qn = min(1 − δ/2, 1 − δ + 10δm/n), otherwise.

If qn < 1 − δ + 0.001, set qn = 1 − δ. Let 0 < δ < 1 and h = D(Un ) where


D(Un ) is the qn th sample quantile of the Di . The 100(1 − δ)% nonparametric
prediction region for y f is

{y : (y − ŷ f )T S −1 2
r (y − ŷ f ) ≤ D(Un ) } = {y : Dy (ŷ f , S r ) ≤ D(Un ) }.

a) Consider the n prediction regions for the data where (y f,i , xf,i ) =
(y i , xi ) for i = 1, ..., n. If the order statistic D(Un ) is unique, then Un of the
n prediction regions contain y i where Un /n → 1 − δ as n → ∞.
b) If (ŷ f , S r ) is a consistent estimator of (E(y f ), Σ  ) then the nonpara-
metric prediction region is a large sample 100(1 − δ)% prediction region for
yf .
c) If (ŷ f , S r ) is a consistent estimator of (E(y f ), Σ  ), and the i come
from an elliptically contoured distribution such that the unique highest den-
sity region is {y : Dy (0, Σ  ) ≤ D1−δ }, then the nonparametric prediction
region is asymptotically optimal.
21) On the DD plot for the residual vectors, the cases to the left of the
vertical line correspond to cases that would have y f = y i in the nonpara-
metric prediction region if xf = xi , while the cases to the right of the line
would not have y f = y i in the nonparametric prediction region.
22) The DD plot for the residual vectors is interpreted almost exactly as
a DD plot for iid multivariate data is interpreted. Plotted points clustering
about the identity line suggests that the i may be iid from a multivariate
normal distribution, while plotted points that cluster about a line through
the origin with slope greater than 1 suggests that the i may be iid from an
elliptically contoured distribution that is not MVN. Points to the left of the
vertical line corresponds to the cases that are in their nonparamtric prediction
region. Robust distances have not been shown to be consistent estimators of
the population distances, but are useful for a graphical diagnostic.
400 8 Multivariate Linear Regression

23) Multiple Linear Regression Multivariate Linear Regression


Y = Xβ + e Z = XB + E
1) E(Y ) = Xβ E[Z] = XB

2) Yi = xTi β + ei y i = B T xi + i

3) E(e) = 0 E[E] = 0

4) H = P = X(X T X)−1 X T H = P = X(X T X)−1 X T

5) b = (X T X)−1 X T Y
β b = (X T X)−1 X T Z
B

6) Yb = P Y b = PZ
Z

7) r=b
e = (I − P )Y b = (I − P )Z
E

8) b =β
E[β] b =B
E[B]

9) E(Yb ) = E(Y ) = Xβ b = XB
E[Z]

T
Ê Ê
σ̂ 2 = rn−pr
T
10) Σ̂  =
n−p

11) V (ei ) = σ 2 Cov(i ) = Σ 

12) E(Yi ) = β T xi E[yi ] = B T xi

H0 : Lβ = 0 H0 : LB = 0
D D
13) rFR → χ2r (n − p)U (L) → χ2rm

14) LS CLT MLS CLT


√ D 2
√ b D
n(β̂ − β) → Np (0, σ W ) n vec(B − B) → Npm (0, Σ  ⊗ W ).
8.10 Complements 401

23) The table on the previous page compares MLR and MREG.
24) The robust multivariate linear regression method rmreg2 computes
the classical estimator on the RMVN set where RMVN is computed from
the n cases v i = (xi2 , ..., xpi, Yi1 , ..., Yim)T . This estimator has considerable
outlier resistance but theory currently needs very strong assumptions. The
response and residual plots and DD plot of the residuals from this estimator
are useful for outlier detection. The rmreg2 estimator is superior to the
rmreg estimator for outlier detection.

8.10 Complements

This chapter followed Olive (2017b, ch. 12) closely. Multivariate linear re-
gression is a semiparametric method that is nearly as easy to use as multiple
linear regression if m is small. Section 8.3 followed Olive (2018) closely. The
material on plots and testing followed Olive et al. (2015) closely. The m re-
sponse and residual plots should be made as well as the DD plot, and the
response and residual plots are very useful for the m = 1 case of multiple
linear regression and experimental design. These plots speed up the model
building process for multivariate linear models since the success of power
transformations achieving linearity can be quickly assessed, and influential
cases can be quickly detected. See Cook and Olive (2001).
Work is needed on variable selection and on determining the sample sizes
for when the tests and prediction regions start to work well. Response and
residual plots can look good for n ≥ 10p, but for testing and prediction
regions, we may need n ≥ a(m + p)2 where 0.8 ≤ a ≤ 5 even for well behaved
elliptically contoured error distributions. Variable selection for multivariate
linear regression is discussed in Fujikoshi et al. (2014). R programs are needed
to make variable selection easy. Forward selection would be especially useful.
Often observations (Y1 , ..., Ym, x2, ..., xp) are collected on the same person
or thing and hence are correlated. If transformations can be found such that
the DD plot and the m response plots and residual plots look good, and
n is large (n ≥ max[(m + p)2 , mp + 30)] starts to give good results), then
multivariate linear regression can be used to efficiently analyze the data.
Examining m multiple linear regressions is an incorrect method for analyzing
the data.
In addition to robust estimators and seemingly unrelated regressions, en-
velope estimators and partial least squares (PLS) are competing methods for
multivariate linear regression. See recent work by Cook such as Cook (2018),
Cook and Su (2013), Cook et al. (2013), and Su and Cook (2012). Methods
like ridge regression and lasso can also be extended to multivariate linear re-
gression. See, for example, Obozinski et al. (2011). Relaxed lasso extensions
are likely useful. Prediction regions for alternative methods with n >> p
could be made following Section 8.3.
402 8 Multivariate Linear Regression

Plugging in robust dispersion estimators in place of the covariance matri-


ces, as done in Section 8.6, is not a new idea. Maronna and Morgenthaler
(1986) used M –estimators when m = 1. Problems can occur if the error
distribution is not elliptically contoured. See Nordhausen and Tyler (2015).
Khattree and Naik (1999, pp. 91-98) discussed testing H0 : LBM = 0
versus H1 : LBM 6= 0 where M = I gives a linear test of hypotheses.
Johnstone and Nadler (2017) gave useful approximations for Roy’s largest
root test when the error vector distribution is multivariate normal.

8.11 Problems

PROBLEMS WITH AN ASTERISK * ARE ESPECIALLY USE-


FUL.

8.1∗ . Consider the Hotelling Lawley test statistic. Let


−1
T (W ) = n [vec(LB̂)]T [Σ̂  ⊗ (LW LT )−1 ][vec(LB̂)].

Let
XT X −1
= Ŵ .
n
−1
Show T (Ŵ ) = [vec(LB̂)]T [Σ̂  ⊗ (L(X T X)−1 LT )−1 ][vec(LB̂)].

8.2. Consider the Hotelling Lawley test statistic. Let T =


−1
[vec(LB̂)]T [Σ̂  ⊗ (L(X T X)−1 LT )−1 ][vec(LB̂)].
T
Let L = Lj = [0, ..., 0, 1, 0, ..., 0] have a 1 in the jth position. Let b̂j = LB̂ be
the jth row of B̂. Let dj = Lj (X T X)−1 LTj = (X T X)−1 jj , the jth diagonal
T −1
entry of (X T X)−1 . Then Tj = 1
dj b̂j Σ̂  b̂j . The Hotelling Lawley statistic

T
U = tr([(n − p)Σ̂  ]−1 B̂ LT [L(X T X)−1 LT ]−1 LB̂]).

1 −1 T
Hence if L = Lj , then Uj = dj (n−p) tr(Σ̂  b̂j b̂j ).
Using tr(ABC) = tr(CAB) and tr(a) = a for scalar a, show that
(n − p)Uj = Tj .

8.3. Consider the Hotelling Lawley test statistic. Using the Searle (1982,
p. 333) identity

tr(AGT DGC) = [vec(G)]T [CA ⊗ DT ][vec(G)],


8.11 Problems 403
−1 T
show (n − p)U(L) = tr[Σ̂  B̂ LT [L(X T X)−1 LT ]−1 LB̂]
−1
= [vec(LB̂)]T [Σ̂  ⊗ (L(X T X)−1 LT )−1 ][vec(LB̂)] by identifying A, G, D,
and C.

$Ftable Fj pvals #Output for problem 8.4.


[1,] 82.147221 0.000000e+00
[2,] 58.448961 0.000000e+00
[3,] 15.700326 4.258563e-09
[4,] 9.072358 1.281220e-05
[5,] 45.364862 0.000000e+00

$MANOVA
MANOVAF pval
[1,] 67.80145 0
8.4. The output above is for the R Seatbelts data set where Y1 = drivers =
number of drivers killed or seriously injured, Y2 = front = number of front
seat passengers killed or seriously injured, and Y3 = back = number of back
seat passengers killed or seriously injured. The predictors were x2 = kms =
distance driven, x3 = price = petrol price, x4 = van = number of van drivers
killed, and x5 = law = 0 if the law was in effect that month and 1 otherwise.
The data consists of 192 monthly totals in Great Britain from January 1969 to
December 1984, and the compulsory wearing of seat belts law was introduced
in February 1983.
a) Do the MANOVA F test.
b) Do the F4 test.
8.5. a) Sketch a DD plot of the residual vectors ˆi for the multivariate
linear regression model if the error vectors i are iid from a multivariate
normal distribution. b) Does the DD plot change if the one way MANOVA
model is used instead of the multivariate linear regression model?

8.6. The output below is for the R judge ratings data set consisting of
lawyer ratings for n = 43 judges. Y1 = oral = sound oral rulings, Y2 = writ =
sound written rulings, and Y3 = rten = worthy of retention. The predictors
were x2 = cont = number of contacts of lawyer with judge, x3 = intg =
judicial integrity, x4 = dmnr = demeanor, x5 = dilg = diligence, x6 =
cfmg = case flow managing, x7 = deci = prompt decisions, x8 = prep =
preparation for trial, x9 = fami = familiarity with law, and x10 = phys =
physical ability.
a) Do the MANOVA F test.
b) Do the MANOVA partial F test for the reduced model that deletes
x2 , x5 , x6, x7 , and x8 .

y<-USJudgeRatings[,c(9,10,12)] #See problem 8.6.


404 8 Multivariate Linear Regression

x<-USJudgeRatings[,-c(9,10,12)]
mltreg(x,y,indices=c(2,5,6,7,8))
$partial
partialF Pval
[1,] 1.649415 0.1855314

$MANOVA
MANOVAF pval
[1,] 340.1018 1.121325e-14
8.7. Let β i be p × 1 and suppose
     
β̂ 1 − β1 0 σ (X T X)−1 σ12 (X T X)−1
∼ N2p , 11 T .
β̂ 2 − β2 0 σ21 (X X)−1 σ22 (X T X)−1

Find the distribution of


 
β̂ 1 − β 1
[L 0] = Lβ̂ 1
β̂ 2 − β 2

where Lβ 1 = 0 and L is r × p with r ≤ p. Simplify.


8.8. Let y = B T x + . Suppose x = (1, x2 , ..., xp)T = (1 w T )T where
w = (x2 , ..., xp)T . Let  T
α
B= .
BS
Suppose      
y µy Σ yy Σ yw
∼ Nm+p−1 , .
w µw Σ wy Σ ww
Then y|w ∼ Nm (µy + Σ yw Σ −1 −1
ww (w − µw ), Σ yy − Σ yw Σ ww Σ ww ),
−1
and  ∼ Nm (0, Σ yy − Σ yw Σ ww Σ ww ) = Nm (0, Σ  ).
Now  
1
y|x = y| = B T x + ,
w
and
 T    
αT 1 1
y|w = B T x +  = +  = (α B TS ) +  = α + B TS w + .
BS w w

Hence E(y|w) = µy + Σ yw Σ −1 T
ww (w − µw ) = α + B S w.
a) Show α = µy − B TS µw .
b) Show B S = Σ −1
w Σ wy where Σ w = Σ ww .
(Hence B TS = Σ yw Σ −1
w .)
R Problems
8.11 Problems 405

Warning: Use the command source(“G:/linmodpack.txt”) to down-


load the programs. See Preface or Section 11.1. Typing the name of
the mpack function, e.g. ddplot, will display the code for the function. Use
the args command, e.g. args(ddplot), to display the needed arguments for
the function. For some of the following problems, the R commands can be
copied and pasted from (http://parker.ad.siu.edu/Olive/linmodrhw.txt) into
R.
8.9. This problem examines multivariate linear regression on the Cook
and Weisberg (1999) mussels data with Y1 = log(S) and Y2 = log(M ) where
S is the shell mass and M is the muscle mass. The predictors are X2 = L,
X3 = log(W ), and X4 = H: the shell length, log(width), and height.
a) The R command for this part makes the response and residual plots
for each of the two response variables. Click the rightmost mouse button and
highlight Stop to advance the plot. When you have the response and residual
plots for one variable on the screen, copy and paste the two plots into Word.
Do this two times, once for each response variable. The plotted points fall in
roughly evenly populated bands about the identity or r = 0 line.
b) Copy and paste the output produced from the R command for this part
from $partial on. This gives the output needed to do the MANOVA F test,
MANOVA partial F test, and the Fj tests.
c) The R command for this part makes a DD plot of the residual vectors
and adds the lines corresponding to those in Figure 8.3. Place the plot in
Word. Do the residual vectors appear to follow a multivariate normal distri-
bution? (Right click Stop once.)
d) Do the MANOVA partial F test where the reduced model deletes X3
and X4 .
e) Do the F2 test.
f) Do the MANOVA F test.
8.10. This problem examines multivariate linear regression on the SAS
Institute (1985, p. 146) Fitness Club Data with Y1 = chinups, Y2 = situps,
and Y3 = jumps. The predictors are X2 = weight, X3 = waist, and X4 =
pulse.
a) The R command for this part makes the response and residual plots for
each of the three variables. Click the rightmost mouse button and highlight
Stop to advance the plot. When you have the response and residual plots for
one variable on the screen, copy and paste the three plots into Word. Do this
three times, once for each response variable. Are there any outliers?
b) The R command for this part makes a DD plot of the residual vectors
and adds the lines corresponding to those in Figure 8.3. Place the plot in
Word. Are there any outliers? (Right click Stop once.)
8.11. This problem uses the linmodpack function mregsim to simulate the
Wilks’ Λ test, Pillai’s trace test, Hotelling Lawley trace test, and Roy’s largest
root test for the Fj tests and the MANOVA F test for multivariate linear
regression. When mnull = T the first row of B is 1T while the remaining
406 8 Multivariate Linear Regression

rows are equal to 0T . Hence the null hypothesis for the MANOVA F test is
true. When mnull = F the null hypothesis is true for p = 2, but false for
p > 2. Now the first row of B is 1T and the last row of B is 0T . If p > 2,
then the second to last row of B is (1, 0, ..., 0), the third to last row is (1,
1, 0, ..., 0) et cetera as long as the first row is not changed from 1T . First
m iid errors z i are generated such that the m errors are iid with variance
σ 2 . Then i = Azi so that Σ̂  = σ 2 AAT = (σij ) where the diagonal entries
σii = σ 2 [1 + (m − 1)ψ2 ] and the off diagonal entries σij = σ 2 [2ψ + (m − 2)ψ2 ]
where ψ = 0.10. Terms like Wilkcov give the percentage of times the Wilks’
test rejected the F1 , F2 , ..., Fp tests. The $mancv wcv pcv hlcv rcv fcv output
gives the percentage of times the 4 test statistics reject the MANOVA F test.
Here hlcov and fcov both correspond to the Hotelling Lawley test using the
formulas in Problem 8.3.
5000 runs will be used so the simulation may take several minutes. Sample
sizes n = (m + p)2 , n = 3(m + p)2 , and n = 4(m + p)2 were interesting. We
want coverage near 0.05 when H0 is true and coverage close to 1 for good
power when H0 is false. Multivariate normal errors were used in a) and b)
below.
a) Copy the coverage parts of the output produced by the R commands
for this part where n = 20, m = 2, and p = 4. Here H0 is true except for
the F1 test. Wilks’ and Pillai’s tests had low coverage < 0.05 when H0 was
false. Roy’s test was good for the Fj tests, but why was Roy’s test bad for
the MANOVA F test?
b) Copy the coverage parts of the output produced by the R commands
for this part where n = 20, m = 2, and p = 4. Here H0 is false except for the
F4 test. Which two tests seem to be the best for this part?
8.12. This problem uses the linmodpack function mpredsim to simulate
the prediction regions for y f given xf for multivariate regression. With 5000
runs this simulation may take several minutes. The R command for this
problem generates iid lognormal errors then subtracts the mean, producing
zi . Then the i = Az i are generated as in Problem 8.11 with n=100, m=2,
and p=4. The nominal coverage of the prediction region is 90%, and 92%
of the training data is covered. The ncvr output gives the coverage of the
nonparametric region. What was ncvr?
Chapter 9
One Way MANOVA Type Models

Multivariate regression is the study of the conditional distribution y|x of the


m × 1 vector of response variables y given the p × 1 vector of nontrivial pre-
dictors x. The multivariate linear model includes the following two models. i)
The multivariate linear regression model of Chapter 8 has at least one quan-
titative predictor variable. ii) For the MANOVA model, the predictors are
indicator variables. Often observations (Y1 , ..., Ym, x1 , x2 , ..., xp) are collected
on the same person or thing and hence are correlated. If transformations can
be found such that the m response plots and residual plots of Section 9.2
2
look good,
Ppand n ≥ (m + p) (and ni ≥ 10m if there are p treatment groups
and n = i=1 ni ), then the MANOVA model can often be used to efficiently
analyze the data. These two plots and the DD plot of the residuals are useful
for checking the model and for outlier detection.

9.1 Introduction

Definition 9.1. The response variables are the variables that you want
to predict. The predictor variables are the variables used to predict the
response variables.

Notation. A multivariate linear model has m ≥ 2 response variables. A


multiple linear model = univariate linear model has m = 1 response variable,
but at least two nontrivial predictors, and usually a constant (so p ≥ 3).
A simple linear model has m = 1, one nontrivial predictor, and usually a
constant (so p = 2). Multiple linear regression models and ANOVA models
are special cases of multiple linear models.

Definition 9.2. The multivariate linear model

y i = B T xi + i

407
408 9 One Way MANOVA Type Models

for i = 1, ..., n has m ≥ 2 response variables Y1 , ..., Ym and p predictor vari-


ables x1 , x2, ..., xp. The ith case is (xTi , yTi ) = (xi1 , xi2 , ..., xip, Yi1 , ..., Yim). If
a constant xi1 = 1 is in the model, then xi1 could be omitted from the case.
The model is written in matrix form as Z = XB + E where the matrices are
the same as those between Definitions 8.2 and 8.3. The model has E(k ) = 0
and Cov(
 k ) = Σ  = (σij ) for k = 1, ..., n. Then the p × m coefficient matrix
B = β 1 β 2 . . . βm and the m × m covariance matrix Σ  are to be esti-
mated, and E(Z) = XB while E(Yij ) = xTi β j . The i are assumed to be
iid. The univariate linear model corresponds to m = 1 response variable, and
is written in matrix form as Y = Xβ + e. Subscripts are needed for the m
univariate linear models Y j = Xβj + ej for j = 1, ..., m where E(ej ) = 0.
For the multivariate linear model, Cov(ei , ej ) = σij I n for i, j = 1, ..., m
where I n is the n × n identity matrix.

Definition 9.3. The multivariate analysis of variance (MANOVA model)


y i = B T xi + i for i = 1, ..., n has m ≥ 2 response variables Y1 , ..., Ym
and p predictor variables X1 , X2 , ..., Xp. The MANOVA model is a special
case of the multivariate linear model. For the MANOVA model, the predic-
tors are not quantitative variables, so the predictors are indicator variables.
Sometimes the trivial predictor 1 is also in the model. In matrix form, the
MANOVA model is Z = XB + E. The model has E(k ) = 0 and Cov(k ) =
Σ  = (σij ) for k = 1, ..., n. Also E(ei ) = 0 while Cov(ei , ej ) = σij I n for
i, j = 1, ..., m. Then B and Σ  are unknown matrices of parameters to be
estimated, and E(Z) = XB while E(Yij ) = xTi β j .

The data matrix W d = [X Z]. If the model contains a constant, then


usually the first column of ones 1 of X is omitted from the data matrix for
software such as R and SAS.

Each response variable in a MANOVA model follows an ANOVA model


Y j = Xβ j + ej for j = 1, ..., m where it is assumed that E(ej ) = 0 and
Cov(ej ) = σjj I n . Hence the errors corresponding to the jth response are
uncorrelated with variance σj2 = σjj . Notice that the same design matrix
X of predictors is used for each of the m models, but the jth response variable
vector Y j , coefficient vector β j , and error vector ej change and thus depend
on j. Hence for a one way MANOVA model, each response variable follows a
one way ANOVA model, while for a two way MANOVA model, each response
variable follows a two way ANOVA model for j = 1, ..., m.
Once the ANOVA model is fixed, e.g. a one way ANOVA model, the design
matrix X depends on the parameterization of the ANOVA model. See Chap-
ter 3. The fitted values and residuals are the same for each parameterization,
but the interpretation of the parameters depends on the parameterization.
Now consider the ith case (xTi , y Ti ) which corresponds to the ith row of
X and the ith row of Z. Then y i = E(y i ) + i where
9.1 Introduction 409
 
xTi β 1
 xTi β 2 
 
E(y i ) = B T xi =  . .
 .. 
xTi β m

The notation y i |xi and E(y i |xi ) is more accurate, but usually the con-
ditioning is suppressed. Taking E(y i |xi ) to be a constant, y i and i have
the same covariance matrix. In the MANOVA model, this covariance matrix
Σ  does not depend on i. Observations from different cases are uncorrelated
(often independent), but the m errors for the m different response variables
for the same case are correlated.

Let B̂ be the MANOVA estimator of B. MANOVA models are often fit


by least squares. Then the least squares estimators are
 
B̂ = B̂g = (X T X)− X T Z = β̂ 1 β̂ 2 . . . β̂ m

where (X T X)− is a generalized inverse of X T X. Here B̂ g depends on the


generalized inverse. If X has full rank p then (X T X)− = (X T X)−1 and B̂
is unique.

Definition 9.4. The predicted values or fitted values


 
Ŷ1,1 Ŷ1,2 . . . Ŷ1,m
 
  Ŷ2,1 Ŷ2,2 . . . Ŷ2,m 

Ẑ = X B̂ = Ŷ 1 Ŷ 2 . . . Ŷ m =  . .. .. ..  .
 .. . . . 
Ŷn,1 Ŷn,2 . . . Ŷn,m

The residuals Ê = Z − Ẑ = Z − X B̂ =
 T  
ˆ
1 ˆ1,1 ˆ1,2 . . . ˆ1,m
ˆ T 
   ˆ2,1 ˆ2,2 ... ˆ 2,m 
 2  
 .  = r̂1 r̂ 2 . . . r̂m =  .. .. .. . .
 ..   . . . .. 
Tn
ˆ ˆn,1 ˆn,2 ... ˆ n,m

These quantities can be found by fitting m ANOVA models Y j = Xβj +ej to


b , Yb j = X β
get β b , and r̂j = Y j − Ŷ j for j = 1, ..., m. Hence ˆi,j = Yi,j − Ŷi,j
j j
where Ŷ j = (Ŷ1,j , ..., Ŷn,j )T . Finally, Σ̂ ,d =
T n
(Z − Ẑ)T (Z − Ẑ) (Z − X B̂)T (Z − X B̂) Ê Ê 1 X T
= = = ˆi ˆi .
n−d n−d n−d n − d i=1
410 9 One Way MANOVA Type Models

The choices d = 0 and d = p are common. Let Σ̂  be the usual estimator


of Σ  for the MANOVA model. If least squares is used with a full rank X,
then Σ̂  = Σ̂ ,d=p .

9.2 Plots for MANOVA Models

As in Chapter 8, this section suggests using residual plots, response plots,


and the DD plot to examine the multivariate linear model. The residual plots
are often used to check for lack of fit of the multivariate linear model. The
response plots are used to check linearity (and to detect influential cases and
outliers for linearity). The response and residual plots are used exactly as in
the m = 1 case corresponding to multiple linear regression and experimental
design models. See Olive (2010, 2017a), Olive and Hawkins (2005), and Cook
and Weisberg (1999, p. 432). Chapter 8 used the response and residual plots
for MLR for each response variable Yj . The one way MANOVA model will
use the response and residual plots for the one way ANOVA model for each
response variable Yj . See Chapter 3.

Definition 9.5. A response plot for the jth response variable is a plot
of the fitted values Ybij versus the response Yij . The identity line with slope
one and zero intercept is added to the plot as a visual aid. A residual plot
corresponding to the jth response variable is a plot of Ŷij versus rij .

Remark 9.1. Make the m response and residual plots for any MANOVA
model. In a response plot, the vertical deviations from the identity line are the
residuals rij = Yij − Ŷij . Suppose the model is good, the error distribution is
not highly skewed, and n ≥ 10p. Then the plotted points should cluster about
the identity line in each of the m response plots. If outliers are present or if
the plot is not linear, then the current model or data need to be transformed
or corrected. If the model is good, then the each of the m residual plots
should be ellipsoidal with no trend and should be centered about the r = 0
line. There should not be any pattern in the residual plot: as a narrow vertical
strip is moved from left to right, the behavior of the residuals within the strip
should show little change. Outliers and patterns such as curvature or a fan
shaped plot are bad.

For some MANOVA models that do not use replication, the response and
residual plots look much like those for multivariate linear regression in Section
8.2. The response and residual plots for the one way MANOVA model need
some notation, and it is useful to use three subscripts. Suppose there are inde-
pendent random samples of size ni from p different populations (treatments),
Pp
or ni cases are randomly assigned to p treatment groups with n = i=1 ni .
Assume that m response variables y ij = (Yij1 , ..., Yijm)T are measured for the
ith treatment. Hence i = 1, ..., p and j = 1, ..., ni. The Yijk follow different one
9.2 Plots for MANOVA Models 411

way ANOVA models for k = 1, ..., m. Assume E(y ij ) = µi = (µi1 , ..., µim)T
and Cov(y ij ) = Σ  . Hence the p treatments have possibly different mean
vectors µi , but common covariance matrix Σ  .
Then for the kth response variable, the response plot is a plot of Ŷijk ≡ µ̂ik
versus Yijk and the residual plot is a plot of Ŷijk ≡ µ̂ik versus rijk where µ̂ik is
the sample mean of the ni responses Yijk corresponding to the ith treatment
for the kth response variable. Add the identity line to the response plot and
r = 0 line to the residual plot as visual aids. The points in the response
plot scatter about the identity line and the points in the residual plot scatter
about the r = 0 line, but the scatter need not be in an evenly populated band.
A dot plot of Z1 , ..., Zn consists of an axis and n points each corresponding to
the value of Zi . The response plot for the kth response variable consists of p
dot plots, one for each value of µ̂ik . The dot plot corresponding to µ̂ik is the
dot plot of Yi,1,k , ..., Yi,ni,k . Similarly, the residual plot for the kth response
variable consists of p dot plots, and the plot corresponding to µ̂ik is the dot
plot of ri,1,k , ..., ri,ni,k . Assuming the ni ≥ 10, the p dot plots for the kth
response variable should have roughly the same shape and spread in both
ni
1 X
the response and residual plots. Note that µ̂ik = Y iok = Yijk .
ni
j=1
Assume that each ni ≥ 10. It is easier to check shape and spread in the
residual plot. If the response plot looks like the residual plot, then a horizontal
line fits the p dot plots about as well as the identity line, and there may not
be much difference in the µik . In the response plot, if the identity line fits
the plotted points better than any horizontal line, then conclude that at least
some of the means µik differ.
Definition 9.6. An outlier corresponds to a case that is far from the
bulk of the data. Look for a large vertical distance of the plotted point from
the identity line or the r = 0 line.
Rule of thumb 9.1. Mentally add 2 lines parallel to the identity line and
2 lines parallel to the r = 0 line that cover most of the cases. Then a case is
an outlier if it is well beyond these 2 lines.

This rule often fails for large outliers since often the identity line goes
through or near a large outlier so its residual is near zero. A response that is
far from the bulk of the data in the response plot is a “large outlier” (large
in magnitude). Look for a large gap between the bulk of the data and the
large outlier.
Suppose there is a dot plot of ni cases corresponding to treatment i with
mean µik that is far from the bulk of the data. This dot plot is probably not
a cluster of “bad outliers” if ni ≥ 4 and n ≥ 5p. If ni = 1, such a case may
be a large outlier.

Rule of thumb 9.2. Often an outlier is very good, but more often an
outlier is due to a measurement error and is very bad.
412 9 One Way MANOVA Type Models

Remark 9.2. Rule of thumb 3.2 for the one way ANOVA F test may also
be useful for the one way MANOVA model tests of hypotheses.
Remark 9.3. The above rules are mainly for linearity and tend to use
marginal models. The marginal models are useful for checking linearity, but
are not very useful for checking other model violations such as outliers in the
error vector distribution. The RMVN DD plot of the residual vectors is a
global method (takes into account the correlations of Y1 , ..., Ym) for checking
the error vector distribution, but is not real effective for detecting outliers
since OLS is used to find the residual vectors. A DD plot of residual vectors
from a robust MANOVA method might be more effective for detecting out-
liers. This remark also applies to the plots used in Section 8.2 for multivariate
linear regression.

The RMVN DD plot of the residual vectors ˆi is used to check the er-
ror vector distribution, to detect outliers, and to display the nonparametric
prediction region developed in Section 8.3. The DD plot suggests that the
error vector distribution is elliptically contoured if the plotted points cluster
tightly about a line through the origin as n → ∞. The plot suggests that
the error vector distribution is multivariate normal if the line is the identity
line. If n is large and the plotted points do not cluster tightly about a line
through the origin, then the error vector distribution may not be elliptically
contoured. These applications of the DD plot for iid multivariate data are
discussed in Olive (2002, 2008, 2013a) and Chapter 7. The RMVN estimator
has not yet been proven to be a consistent estimator for residual vectors,
but simulations suggest that the RMVN DD plot of the residual vectors is a
useful diagnostic plot.
Response transformations can also be made as in Section 1.2, but also make
the response plot of Ŷ j versus Y j and use the rules of Section 1.2 on Yj to
linearize the response plot for each of the m response variables Y1 , ..., Ym.

Example 9.1. Consider the one way MANOVA model on the famous
iris data set with n = 150 and p = 3 species of iris: setosa, versicolor, and
virginica. The m = 4 variables are Y1 = sepal length, Y2 = sepal width, Y3 =
petal length, and Y4 = petal width. See Becker et al. (1988). The plots for the
m = 4 response variables look similar, and Figure 9.1 shows the response and
residual plots for Y4 . Note that the spread of the three dot plots is similar.
The dot plot intersects the identity line at the sample mean of the cases in
the dot plot. The setosa cases in lowest dot plot have a sample mean of 0.246
and the horizontal line Y4 = 0.246 is below the dot plots for versicolor and
virginica which have means of 1.326 and 2.026. Hence the mean petal widths
differ for the three species, and it is easier to see this difference in the response
plot than the residual plot. The plots for the other three variables are similar.
Figure 9.2 shows that the DD plot of the residual vectors suggests that the
error vector distribution is elliptically contoured but not multivariate normal.
9.2 Plots for MANOVA Models 413

The DD plot also shows the prediction regions of Section 8.3 computed
using the residual vectors ˆ i . From Section 8.3, if {ˆ
|Dˆ (0, S r ) ≤ h} is a
prediction region for the residual vectors, then {y|Dy (ŷ f , Sr ) ≤ h} is a
prediction region for y f . For the one way MANOVA model, a prediction
region for y f would only be valid for an xf which was observed, i.e., for
xf = xj , since only observed values of the categorical predictor variables
make sense. The 90% nonparametric prediction region corresponds to y with
distances to the left of the vertical line M D = 3.2.

Response Plot
2.5
2.0
1.5
y[, i]

1.0
0.5

0.5 1.0 1.5 2.0

fit[, i]
Residual Plot
0.2 0.4
res[, i]

−0.2
−0.6

0.5 1.0 1.5 2.0

fit[, i]

Fig. 9.1 Plots for Y4 = Petal Width.

R commands for these two figures are shown below, and will also show
the plots for Y1 , Y2 , and Y3 . The linmodpack function manova1w makes the
response and residual plots while ddplot4 makes the DD plot. The last
command shows that the pvalue = 0 for the one way MANOVA test discussed
in the following section.
library(MASS)
y <- iris[,1:4] #m = 4 = number of response variables
group <- iris[,5]
#p = number of groups = number of dot plots
out<- manova1w(y,p=3,group=group) #right click
#Stop 8 times
ddplot4(out$res) #right click Stop
summary(out$out) #default is Pillai’s test
414 9 One Way MANOVA Type Models

5
4
3
RD

2
1

1 2 3 4

MD

Fig. 9.2 DD Plot of the Residual Vectors for Iris Data.

9.3 One Way MANOVA

Using double subscripts will be useful for describing the one way MANOVA
model. Suppose there are independent random samples of size ni from p
different populations (treatments),
Pp or ni cases are randomly assigned to p
treatment groups. Then n = i=1 ni and the group sample sizes are ni for
i = 1, ..., p. Assume that m response variables y ij = (Yij1 , ..., Yijm)T are
measured for the ith treatment group and the jth case (often an individual
or thing) in the group. Hence i = 1, ..., p and j = 1, ..., ni. The Yijk follow
different one way ANOVA models for k = 1, ..., m. Assume E(y ij ) = µi and
Cov(y ij ) = Σ  . Hence the p treatments have different mean vectors µi , but
common covariance matrix Σ  . (The common covariance matrix assumption
can be relaxed for p = 2 with the appropriate 2 sample Hotelling’s T 2 test.)
The one way MANOVA is used to test H0 : µ1 = µ2 = · · · = µp . Often
µi = µ + τ i , so H0 becomes H0 : τ 1 = · · · = τ p . If m = 1, the one
way MANOVA model is the one way ANOVA model. MANOVA is useful
since it takes into account the correlations between the m response variables.
Performing m ANOVA tests fails to account for these correlations, but can
be a useful diagnostic. The Hotelling’s T 2 test that uses a common covariance
matrix is a special case ofPthe one way MANOVA model with p = 2.
p
Let µi = µ+τ i where i=1 ni τ i = 0. The jth case from the ith population
or treatment group is y ij = µ+τ i +ij where ij is an error vector, i = 1, ..., p
9.3 One Way MANOVA 415
Pp Pn i
and j = 1, ..., ni. Let y = µ̂ = i=1 j=1 y ij /n be the overall mean. Let
Pn i
y i = j=1 y ij /ni so τ̂ i = y i − y. Let the residual vector ˆij = y ij − y i =
y ij − µ̂ − τ̂ i . Then y ij = y + (y i − y) + (y ij − y i ) = µ̂ + τ̂ i + ˆ
ij .
Several m×m matrices will be useful. Let S i be the sample covariance ma-
trix corresponding to the ith treatment group. Then the within sum of squares
and cross products matrix is W = W e = (n1 − 1)S 1 + · · · + (np − 1)S p =
Pp Pni T
i=1 j=1 (y ij − y i )(y ij − y i ) . Then Σ̂  = W /(n − p). The treatment or
between sum of squares and cross products matrix is
p
X
BT = ni (y i − y)(y i − y)T .
i=1

The total corrected P (for the


Pnmean) sum of squares and cross products matrix
p
is T = B T + W = i=1 j=1 i
(y ij − y)(y ij − y)T . Note that S = T /(n − 1)
is the usual sample covariance matrix of the y ij if it is assumed that all n of
the y ij are iid so that the µi ≡ µ for i = 1, ..., p.
The one way MANOVA model is y ij = µ + τ i + ij where the ij are iid
with E(ij ) = 0 and Cov(ij ) = Σ  . The MANOVA table is shown below.

Summary One Way MANOVA Table

Source matrix df
Treatment or Between BT p − 1
Residual or Error or Within W n − p
Total (corrected) T n−1

If all n of the y ij are iid with E(y ij ) = µ and Cov(y ij ) = Σ  , it can


P
be shown that A/df → Σ  where A = W , B T , or T , and df is the corre-
sponding degrees of freedom. Let t0 be the test statistic. Often Pillai’s trace
statistic, the Hotelling Lawley trace statistic, or Wilks’ lambda are used.
Wilks’ lambda
Pp
|W | |W | | i=1 (ni − 1)S i |
Λ= = = =
|B T + W | |T | |(n − 1)S|
Pp Pni
| i=1 j=1 (y ij − y i )(y ij − y i )T |
Pp Pn i .
| i=1 j=1 (y ij − y)(y ij − y)T |
Then to = −[n − 0.5(m + p − 2)] log(Λ) and pval = P (χ2m(p−1) > t0 ). Hence
reject H0 if t0 > χ2m(p−1) (1 − α). See Johnson and Wichern (1988, p. 238).

The four steps of the one way MANOVA test follow.


i) State the hypotheses H0 : µ1 = · · · = µp and H1 : not H0 .
ii) Get t0 from output.
iii) Get pval from output.
416 9 One Way MANOVA Type Models

iv) State whether you reject H0 or fail to reject H0 . If pval ≤ α, reject H0


and conclude that not all of the p treatment means are equal. If pval > α, fail
to reject H0 and conclude that all p treatment means are equal or that there
is not enough evidence to conclude that not all of the p treatment means are
equal. As a textbook convention, use α = 0.05 if α is not given.

Another way to perform the one way MANOVA test is to get R output.
The default test is Pillai’s test, but other tests can be obtained with the R
output shown below.
summary(out$out) #default is Pillai’s test
summary(out$out, test = "Wilks")
summary(out$out, test = "Hotelling-Lawley")
summary(out$out, test = "Roy")
Example 9.1, continued. The R output for the iris data gives a Pillai’s
F statistic of 53.466 and pval = 0.
i) H0 : µ1 = · · · = µ4 H1 : not H0
ii) F = 53.466
iii) pval = 0
iv) Reject H0 . The means for the three varieties of iris do differ.

Following Mardia et al. (1979, p. 335), let λ1 ≥ λ2 · · · ≥ λm be the eigen-


W −1 B T . Then 1 + λi for i = 1, ..., m are the eigenvalues of W −1 T
values of Q
m
and Λ = i=1 (1 + λi )−1 .
Following Fujikoshi (2002), let Pm the Hotelling Lawley trace statistic U =
tr(B T W −1 ) = tr(W −1 B T ) = i=1 λi , and let Pillai’s trace statistic V =
Xm
λi
tr(B T T −1 ) = tr(T −1 B T ) = . If the y ij − µj are iid with common
i=1
1 + λi
covariance matrix Σ  , and if H0 is true, then under regularity conditions
D D
−[n − 0.5(m + p − 2)] log(Λ) → χ2m(p−1) , (n − m − p − 1)U → χ2m(p−1) , and
D
(n − 1)V → χ2m(p−1) . Note that the common covariance matrix assumption
implies that each of the p treatment groups or populations has the same
covariance matrix Σ i = Σ  for i = 1, ..., p, an extremely strong assumption.

Remark 9.4. Another method for one way MANOVA is to use the model
Z = XB + E or
9.3 One Way MANOVA 417
 
110 ... 0
   . . . .. 
Y111 Y112 · · · Y11m  .. .. .. .
 
 .. .. ..   110 ... 0
 . . ··· .  
 ... 0

  1 0 1 
 Y1,n1,1 Y1,n1,2 · · · Y1,n1 ,m 
   . . .
 .. .. ..
.. 
 Y211 Y211 · · · Y21m   . 
 β1,1 β1,2 . . . β1,m

   
 .. .. ..  1 0 1 ... 0 
 .
 . ··· .  . . .
=. . . ..   β2,1 β2,2 . . . β2,m  + E.
 Y2,n2,1 Y2,n2,2 · · · Y2,n2 ,m  . . . .  .. .. . . .. 
  

 . . . . 
 . .. ..   00
1 ... 1  βp,1 βp,2 . . . βp,m
 .. . ··· .  . . . .. 
  . . .
 Yp,11 Yp,1m  . . . .
 Yp,1m · · ·   
 . .. ..   1 00 ... 1
 ..  
1 0 0

. ··· .
 ... 0 
Yp,np ,1 Yp,np ,2 · · · Yp,np ,m . . . .. 
 .. .. .. .
1 0 0 ... 0

Then X is full rank where the ith column of X is an indicator for group i − 1
for i = 2, ..., p, β̂1k = Y pok = µ̂pk for k = 1, ..., m, and

β̂ik = Y i−1,ok − Y pok = µ̂i−1,k − µ̂pk

for k = 1, ..., m and i = 2, ..., p. Thus testing H0 : µ1 = · · · = µp is equivalent


to testing H0 : LB = 0 where L = [0 I p−1 ]. Such tests are discussed in
Section 8.4. Then y ij = µi + ij and
 
µTp
 µ1 − µTp
T 
 
 µT2 − µTp 
 
BT = B =  .. . (9.1)
 . 
 
 µT − µTp 
p−2
µTp−1 − µTp

Equation (3.5) used the same X for one way ANOVA model with m = 1
as the X used in the above one way MANOVA model. Then the MLR F test
was the same as the one way ANOVA F test. Similarly, if L = (0 I p−1 ) then
the multivariate linear regression Hotelling Lawley test statistic for testing
H0 : LB = 0 versus H1 : LB 6= 0 is U = tr(W −1 H) while the Hotelling
Lawley test statistic for the one way MANOVA test with H0 : µ1 = µ2 =
· · · = µp is U = tr(W −1 B T ). Rupasinghe Arachchige Don (2018) showed
that these two test statistics are the the same for the above X by showing
that B T = H. Here H is given in Section 8.4 and is not the hat matrix.
418 9 One Way MANOVA Type Models

9.4 An Alternative Test Based on Large Sample Theory

Large sample theory can be also be used to derive a competing test. Let Σ i
be the nonsingular population covariance matrix of the ith treatment group
or population. ToP simplify the large sample theory, assume ni = πi n where
p
0 < πi < 1 and i=1 πi = 1. Let Ti be a multivariate location  estimator

√ D √ D Σi
such that ni (Ti − µi ) → Nm (0, Σ i ), and n(Ti − µi ) → Nm 0, . Let
πi
T T T T T T T T
T = (T1 , T2 , ..., Tp ) , ν = (µ1 , µ2 , ..., µp ) , and A be a full rank r × mp
matrix with rank r, then a large sample test of the form H0 : Aν = θ 0 versus
H1 : Aν 6= θ 0 uses
   
√ D Σ1 Σ2 Σp
A n(T − ν) → u ∼ Nr 0, A diag , , ..., AT . (9.2)
π1 π2 πp

Let the Wald-type statistic


" ! #−1
T Σ̂ 1 Σ̂ 2 Σ̂ p T
t0 = [AT − θ 0 ] A diag , , ..., A [AT − θ 0 ]. (9.3)
n1 n2 np

These results prove the following theorem.


D
Theorem 9.1. Under the above conditions, t0 → χ2r if H0 is true.

This test is due to Rupasinghe Arachchige Don and Olive (2019), and a
special case was used by Zhang and Liu (2013) and Konietschke et al. (2015)
with Ti = y i and Σ̂ i = S i . The p = 2 case gives analogs to the two sample
Hotelling’s T 2 test. See Rupasinghe Arachchige Don and Pelawa Watagoda
(2018). The m = 1 case gives analogs of the one way ANOVA test. If m = 1,
see competing tests in Brown and Forsythe (1974a,b), Olive (2017a, pp. 200-
202), and Welch (1947, 1951).
For the one way MANOVA type test, let A be the block matrix
 
I 0 0 . . . -I
 0 I 0 . . . -I 
 
A=. . . ..  .
 .. .. .. . 
0 0 . . . I -I

Let µi ≡ µ, let H0 : µ1 = · · · = µp or, equivalently, H0 : Aν = 0, and let


9.4 An Alternative Test Based on Large Sample Theory 419
 
T1 − Tp
 T2 − Tp 
 
 .. 
w = AT =  . . (9.4)
 
 Tp−2 − Tp 
Tp−1 − Tp
√ D
Then nw → Nm(p−1) (0, Σ w ) if H0 is true with Σ w = (Σ ij ) where Σ ij =
Σp Σi Σp
for i 6= j, and Σ ii = + for i = j. Hence
πp πi πp
!−1
T −1 T Σ̂ w D
t0 = nw Σ̂ w w = w w → χ2m(p−1)
n

as the ni → ∞ if H0 is true. Here


 
Σ̂ 1 + Σ̂ p Σ̂ p Σ̂ p Σ̂ p
 n1 np np np . . . np 
 Σ̂ Σ̂ Σ̂ p Σ̂ p Σ̂ 
Σ̂ w  p 2
+ np np . . . p

=

np
..
n2
.. ..
np
..

 (9.5)
n  
 . . . . 
Σ̂ p Σ̂ p Σ̂ p Σ̂ p−1 Σ̂
np np np
. . . np−1 + npp

is a block matrix where the off diagonal block entries equal Σ̂ p /np and the
Σ̂ i Σ̂ p
ith diagonal block entry is + for i = 1, ..., (p − 1).
ni np
Reject H0 if
t0 > m(p − 1)Fm(p−1),dn (1 − δ) (9.6)
where dn = min(n1 , ..., np). See Theorem 2.25. It may make sense to relabel
the groups so that np is the largest ni or Σ̂ p /np has the smallest general-
ized variance of the Σ̂ i /ni . This test may start to outperform the one way
MANOVA test if n ≥ (m + p)2 and ni ≥ 40m for i = 1, ..., p.

If Σ i ≡ Σ and Σ̂ i is replaced by Σ̂, we will show that for the one way
MANOVA test that t0 = (n − p)U where U is the Hotelling Lawley statistic.
For the proof, some results on the vec and Kronecker product will be useful.
Following Henderson and Searle (1979), vec(G) and vec(GT ) contain the
same elements in different sequences. Define the permutation matrix P r,m
such that
vec(G) = P r,m vec(GT ) (9.7)
where G is r × m. Then P Tr,m = P m,r , and P r,m P m,r = P m,r P r,m = I rm .
If C is s × m and D is p × r, then

C ⊗ D = P p,s (D ⊗ C)P m,q . (9.8)


420 9 One Way MANOVA Type Models

Also
(C ⊗ D)vec(G) = vec(DGC T ) = P p,s (D ⊗ C)vec(GT ). (9.9)
If C is m × mand D is r × r, then C ⊗ D = P r,m (D ⊗ C)P m,r , and

[vec(G)]T (C ⊗ D)vec(G) = [vec(GT )]T (D ⊗ C)vec(GT ). (9.10)

Theorem 9.2. For the one way MANOVA test using A as defined below
Theorem 9.1, let the Hotelling Lawley trace statistic U = tr(W −1 B T ). Then
" ! #−1
T Σ̂ Σ̂ Σ̂ T
(n − p)U = t0 = [AT − θ0 ] A diag , , ..., A [AT − θ 0 ].
n1 n2 np

D
Hence if the Σ i ≡ Σ and H0 : µ1 = · · · = µp is true, then (n − p)U = t0 →
χ2m(p−1) .
Proof. Let B and X be as in Remark 9.4. Let L = [0 I p−1 ] be an s × p
matrix with s = p − 1. For this choice of X, U = tr(W −1 B T ) = tr(W −1 H)
by Remark 9.4. Hence by Theorem 8.6,
−1
(n − p)U = [vec(LB̂)]T [Σ̂  ⊗ (L(X T X)−1 LT )−1 ][vec(LB̂)]. (9.11)

Now vec([LB̂]T ) = w = AT of Equation (9.4) with Ti = y i . Then


!−1
T Σ̂ w
t0 = w w
n

where
Σ̂ w
= L(X T X)−1 LT ⊗ Σ̂
n
is given by Equation (9.5) with each Σ̂ i replaced by Σ̂. Thus t0 =
−1
[vec([LB̂]T )]T [(L(X T X)−1 LT )−1 ⊗ Σ̂  ][vec([LB̂]T )]. (9.12)

Then t0 = (n − p)U by Equation (9.10) with G = LB̂. 

Hence the one way MANOVA test is a special case of Equation (9.3) where
θ0 = 0 and Σ̂ i ≡ Σ̂, but then Theorem 9.1 only holds if H0 is true and
Σ i ≡ Σ. Note that the large sample theory of Theorem 9.1 is trivial compared
to the large sample theory of (n − p)U given in Theorem 9.2. Fujikoshi (2002)
D D
showed (n − m − p − 1)U → χ2m(p−1) while (n − p)U → χ2m(p−1) by Theorem
9.2 if H0 is true under the common covariance matrix assumption. There is no
P
contradiction since (m+1)U → 0 as the ni → ∞. Note the A is m(p−1)×mp.
9.5 Summary 421

For tests corresponding to Theorem 9.1, we will use bootstrap with the
prediction region method of Chapter 4 to test H0 when Σ̂ w or the Σ̂ i are
unknown or difficult to estimate. To bootstrap the test H0 : Aν = θ 0 versus
H1 : Aν 6= θ 0 , use Zn = AT . Take a sample of size nj with replacement from
the nj cases for each group for j = 1, 2, ..., p to obtain Tj∗ and T ∗1 . Repeat B
times to obtain T ∗1 , ..., T ∗B . Then Zi∗ = AT ∗i for i = 1, ..., B. We will illustrate
this method with the analog for the one way MANOVA test for H0 : Aθ = 0
which is equivalent to H0 : µ1 = · · · = µp , where 0 is an r × 1 vector of
zeroes with r = m(p − 1). Then Zn = AT = w given by Equation (9.4).
Hence the m(p − 1) × 1 vector Zi∗ = AT ∗i = ((T1∗ − Tp∗ )T , ..., (Tp−1 ∗
− Tp∗ )T )T
where Tj is a multivariate location estimator (such as the sample mean,
coordinatewise median, or trimmed mean), applied to the cases in the jth
treatment group. The prediction region method fails to reject H0 if 0 is in
the resulting confidence region.
We may need B ≥ 50m(p−1), n ≥ (m+p)2 , and ni ≥ 40m. If the ni are not
large, the one way MANOVA test can be regarded as a regularized estimator,
and can perform better than the tests that do not assume equal population
covariance matrices. See the simulations in Rupasinghe Arachchige Don and
Olive (2019).
If H0 : Aν = θ0 is true and if the Σ i ≡ Σ for i = 1, ..., p, then
" ! #−1
T Σ̂ Σ̂ Σ̂ T D
t0 = [AT − θ 0 ] A diag , , ..., A [AT − θ 0 ] → χ2r .
n1 n2 np

If H0 is true but the Σ i are not equal, we may be able to get a bootstrap
cutoff by using
" ! #−1
Σ̂ Σ̂ Σ̂
t∗0i = [AT ∗i − AT ] T
A diag , , ..., A T
[AT ∗i − AT ] =
n1 n2 np
! !
Σ̂ Σ̂ Σ̂
2
DAT ∗ AT , A diag , , ..., AT .
i n1 n2 np

9.5 Summary

1) The multivariate linear model y i = B T xi +i for i = 1, ..., n has m ≥ 2


response variables Y1 , ..., Ym and p predictor variables x1 , x2, ..., xp. The ith
case is (xTi , y Ti ) = (xi1 , xi2, ..., xip, Yi1 , ..., Yim). If a constant xi1 = 1 is in
the model, then xi1 could be omitted from the case. The model is written
in matrix form as Z = XB + E. The model has E(k ) = 0 and Cov(k ) =
Σ  = (σij ) for k = 1, ..., n. Also E(ei ) = 0 while Cov(ei , ej ) = σij I n for
422 9 One Way MANOVA Type Models

i, j = 1, ..., m. Then B and Σ  are unknown matrices of parameters to be


estimated, and E(Z) = XB while E(Yij ) = xTi β j .
The data matrix W = [X Z] except usually the first column 1 of X is
omitted if xi,1 ≡ 1. The n × m matrix
 
Y1,1 Y1,2 . . . Y1,m  T
 Y2,1 Y2,2 . . . Y2,m   y
    .1 
Z = . .. . . .  = Y 1 Y 2 . . . Y m =  ..  .
 .. . . .. 
y Tn
Yn,1 Yn,2 . . . Yn,m

The n × p matrix
 
x1,1 x1,2 . . . x1,p  T
 x2,1 x2,2  x
 . . . x2,p     .1 
X= . .. . . ..  = v 1 v 2 . . . v p =  .. 
 .. . . . 
xTn
xn,1 xn,2 . . . xn,p

where often v 1 = 1.
The p × m matrix
 
β1,1 β1,2 . . . β1,m
 β2,1 β2,2 . . . β2,m 
   
B = . .. .. ..  = β 1 β 2 . . . βm .
 .. . . . 
βp,1 βp,2 . . . βp,m

The n × m matrix
 
1,1 1,2 . . . 1,m  T
 2,1 2,2 1
 . . . 2,m 
    .. 
E= . .. .. .  = e e
1 2 . . . e m =  . .
 .. . . .. 
Tn
n,1 n,2 . . . n,m

2) The univariate linear model is Yi = xi,1 β1 + xi,2 β2 + · · · + xi,p βp + ei =


xTi β + ei = β T xi + ei for i = 1, . . . , n. In matrix notation, these n equations
become Y = Xβ + e, where Y is an n × 1 vector of response variables, X
is an n × p matrix of predictors, β is a p × 1 vector of unknown coefficients,
and e is an n × 1 vector of unknown errors.
3) Each response variable in a multivariate linear model follows a univari-
ate linear model Y j = Xβ j + ej for j = 1, ..., m where it is assumed that
E(ej ) = 0 and Cov(ej ) = σjj I n .
4) In a MANOVA model, y k = B T xk + k for k = 1, ..., n is written in
matrix form as Z = XB+E. The model has E(k ) = 0 and Cov(k ) = Σ  =
(σij ) for k = 1, ..., n. Each response variable in a MANOVA model follows
9.5 Summary 423

an ANOVA model Y j = Xβ j + ej for j = 1, ..., m where it is assumed that


E(ej ) = 0 and Cov(ej ) = σjj I n .
5) The one way MANOVA model is as above where Y j = Xβ j + ej
is a one way ANOVA model for j = 1, ..., m. Check the model by making m
response and residual plots and a DD plot of the residual vectors ˆ i .
6) The one way MANOVA model is a generalization of the Hotelling’s
T 2 test from 2 groups to p ≥ 2 groups, assumed to have different means
but a common covariance matrix Σ  . Want to test H0 : µ1 = · · · = µp .
This model is a multivariate linear model so there are m response variables
Y1 , ..., Ym measured for each group. Each Yi follows a one way ANOVA model
for i = 1, ..., m.
7) For the one way MANOVA model, make a DD plot of the residual
vectors ˆ i where i = 1, ..., n. Use the plot to check whether the i follow a
multivariate normal distribution or some other elliptically contoured distri-
bution. We want n ≥ (m + p)2 and ni ≥ 10m.
8) For the one way MANOVA model, write the data as Yijk where i =
1, ..., p and j = 1, ..., ni. So k corresponds to the kth variable Yk for k =
1, ..., m. Then Ŷijk = µ̂ik = Y iok for i = 1, ..., p. So for the kth variable, the
means µ1k , ..., µpk are of interest. The residuals are rijk = Yijk − Ŷijk . For
each variable Yk make a response plot of Y iok versus Yijk and a residual plot
of Y iok versus rijk . Both plots will consist of p dot plots of ni cases located
at the Y iok . The dot plots should follow the identity line in the response plot
and the horizontal r = 0 line in the residual plot for each of the m response
variables Y1 , ..., Ym. For each variable Yk , let Rik be the range of the ith dot
plot. If each ni ≥ 5, we want max(R1k , ..., Rpk) ≤ 2 min(R1k , ..., Rpk). The
one way MANOVA model may be reasonable for the test in point 9) if the
m response and residual plots satisfy the above graphical checks.
9) The four steps of the one way MANOVA test follow.
i) State the hypotheses H0 : µ1 = · · · = µp and H1 : not H0 .
ii) Get t0 from output.
iii) Get pval from output.
iv) State whether you reject H0 or fail to reject H0 . If pval ≤ α, reject H0
and conclude that not all of the p treatment means are equal. If pval > α, fail
to reject H0 and conclude that all p treatment means are equal or that there
is not enough evidence to conclude that not all of the p treatment means
are equal. Give a nontechnical sentence as the conclusion, if possible. As a
textbook convention, use α = 0.05 if α is not given.
10) The one way MANOVA test assumes that the p treatment groups or
populations have the same covariance matrix: Σ 1 = · · · = Σ p , but the test
has some resistance to this assumption. See points 6) and 8).
424 9 One Way MANOVA Type Models

9.6 Complements

The linmodpack function manbtsim2 simulates the bootstrap tests corre-


sponding to Theorem 9.1 using the sample mean, coordinatewise median, and
coordinatewise 25% trimmed mean. The function manbtsim4 adds the test
corresponding to Equation (9.6). The function manbtsim is like manbtsim2,
but adds TRM V N from Definition 7.17 to the simulation, making the simu-
lation very slow. The prediction region method was proven to work for the
sample mean, coordinatwise median, and coordinatwise trimmed means in
Rupasinghe Arachchige Don and Olive (2019). We only conjecture that the
prediction region method works for TRM V N .

9.7 Problems

9.1∗ . If X is of full rank and least squares is used to fit the MANOVA
model, then β̂ i = (X T X)−1 X T Y i , and Y i = Xβ i + ei . Treating Xβ i as a
constant, Cov(Y i , Y j ) = Cov(ei , ej ) = σij I n . Using this information, show
Cov(β̂ i , β̂j ) = σij (X T X)−1 .
Chapter 10
1D Regression Models Such as GLMs

... estimates of the linear regression coefficients are relevant to the linear
parameters of a broader class of models than might have been suspected.
Brillinger (1977, p. 509)
After computing β̂, one may go on to prepare a scatter plot of the points
(β̂xj , yj ), j = 1, ..., n and look for a functional form for g(·).
Brillinger (1983, p. 98)
This chapter considers 1D regression models including additive error re-
gression (AER), generalized linear models (GLMs), and generalized additive
models (GAMs). Multiple linear regression is a special case of these four
models.
See Definition 1.2 for the 1D regression model, sufficient predictor (SP =
h(x)), estimated sufficient predictor (ESP = ĥ(x)), generalized linear model
(GLM), and the generalized additive model (GAM). When using a GAM to
check a GLM, the notation ESP may be used for the GLM, and EAP (esti-
mated additive predictor) may be used for the ESP of the GAM. Definition
1.3 defines the response plot of ESP versus Y .
Suppose the sufficient predictor SP = h(x). Often SP = xT β. If u only
contains the nontrivial predictors, then SP = β1 + uT β2 = α + uT η is often
used where β = (β1 , βT2 )T = (α, ηT )T and x = (1, uT )T .

10.1 Introduction

First we describe some regression models in the following three definitions.


The most general model uses SP = h(x) as defined in Definition 1.2. The
GAM with SP = AP will be useful for checking the model (often a GLM)
with SP = xT β. Thus the additive error regression model with SP = AP
is useful for checking the multiple linear regression model. The model with
SP = β T x = xT β tends to have the most theory for inference and variable

425
426 10 1D Regression Models Such as GLMs

selection. For the models below, the model estimated mean function and
often a nonparametric estimator of the mean function, such as lowess, will
be added to the response plot as a visual aid. For all of the models in the
following three definitions, Y1 , ..., Yn are independent, but often the subscripts
are suppressed. For example, Y = SP + e is used instead of Yi = Yi |xi =
Yi |SPi = SPi + ei = h(xi ) + ei for i = 1, ..., n.

Definition 10.1. i) The additive error regression (AER) model


Y = SP + e has conditional mean function E(Y |SP ) = SP and conditional
variance function V (Y |SP ) = σ 2 = V (e). See Section 10.2. The response
plot of ESP versus Y and the residual plot of ESP versus r = Y − Ŷ are
used just as for multiple linear regression. The estimated model (conditional)
mean function is the identity line Y = ESP . The response transformation
model is Y = t(Z) = SP + e where the response transformation t(Z) can be
found using a graphical method similar to Section 1.2.
 
eSP
ii) The binary regression model is Y ∼ binomial 1, ρ = .
1 + eSP
This model has E(Y |SP ) = ρ = ρ(SP ) and V (Y |SP ) = ρ(SP )(1 − ρ(SP )).
eESP
Then ρ̂ = is the estimated mean function. See Section 10.3.
1 + eESP
 
eSP
iii) The binomial regression model is Yi ∼ binomial mi , ρ = .
1 + eSP
Then E(Yi |SPi ) = mi ρ(SPi ) and V (Yi |SPi ) = mi ρ(SPi )(1 − ρ(SPi )), and
mi eESP
Ê(Yi |xi ) = mi ρ̂ = is the estimated mean function. See Section
1 + eESP
10.3.

iv) The Poisson regression (PR) model Y ∼ Poisson eSP has
E(Y |SP ) = V (Y |SP ) = exp(SP ). The estimated mean and variance func-
tions are Ê(Y |x) = eESP . See Section 10.4.
v) Suppose Y has a gamma G(ν, λ) distribution so that E(Y ) = νλ and
V (Y ) = νλ2 . The Gamma regression model Y ∼ G (ν, λ = µ(SP )/ν)
has E(Y |SP ) = µ(SP ) and V (Y |SP ) = [µ(SP )]2 /ν. The estimated mean
function is Ê(Y |x) = µ(ESP ). The choices µ(SP ) = SP , µ(SP ) = exp(SP )
and µ(SP ) = 1/SP are common. Since µ(SP ) > 0, Gamma regression mod-
els that use the identity or reciprocal link run into problems if µ(ESP ) is
negative for some of the cases.

Alternatives to the binomial and Poisson regression models are needed


because often the mean function for the model is good, but the variance
function is not: there is overdispersion. See Section 10.8.

A useful alternative to the binomial regression model is a beta–binomial


regression (BBR) model. Following Simonoff (2003, pp. 93-94) and Agresti
(2002, pp. 554-555), let δ = ρ/θ and ν = (1 − ρ)/θ, so ρ = δ/(δ + ν) and
10.1 Introduction 427

Γ (δ)Γ (ν)
θ = 1/(δ+ν). Let B(δ, ν) = . If Y has a beta–binomial distribution,
Γ (δ + ν)
Y
 ∼ BB(m, ρ, θ), then the probability mass function of Y is P (Y = y) =
m B(δ + y, ν + m − y)
for y = 0, 1, 2, ..., m where 0 < ρ < 1 and θ > 0.
y B(δ, ν)
Hence δ > 0 and ν > 0. Then E(Y ) = mδ/(δ + ν) = mρ and V(Y ) =
mρ(1 − ρ)[1 + (m − 1)θ/(1 + θ)]. If Y |π ∼ binomial(m, π) and π ∼ beta(δ, ν),
then Y ∼ BB(m, ρ, θ). As θ → 0, it can be shown that V (π) → 0, and the
beta–binomial distribution converges to the binomial distribution.

Definition 10.2. The BBR model states that Y1 , ..., Yn are independent
random variables where Yi |SPi ∼ BB(mi , ρ(SPi ), θ). Hence E(Yi |SPi ) =
mi ρ(SPi ) and

V (Yi |SPi ) = mi ρ(SPi )(1 − ρ(SPi ))[1 + (mi − 1)θ/(1 + θ)].

The BBR model has the same mean function as the binomial regression
model, but allows for overdispersion. As θ → 0, it can be shown that the
BBR model converges to the binomial regression model.

A useful alternative to the PR model is a negative binomial regression


(NBR) model. If Y has a (generalized) negative binomial distribution, Y ∼
N B(µ, κ), then the probability mass function of Y is
 κ  y
Γ (y + κ) κ κ
P (Y = y) = 1−
Γ (κ)Γ (y + 1) µ + κ µ+κ

for y = 0, 1, 2, ... where µ > 0 and κ > 0. Then E(Y ) = µ and V(Y ) =
µ+µ2 /κ. (This distribution is a generalization of the negative binomial (κ, ρ)
distribution where ρ = κ/(µ + κ) and κ > 0 is an unknown real parameter
rather than a known integer.)

Definition 10.3. The negative binomial regression (NBR) model


is Y |SP ∼ NB(exp(SP), κ). Thus E(Y |SP ) = exp(SP ) and
 
exp(SP )
V (Y |SP ) = exp(SP ) 1 + = exp(SP ) + τ exp(2 SP ).
κ

The NBR model has the same mean function as the PR model but allows
for overdispersion. Following Agresti (2002, p. 560), as τ ≡ 1/κ → 0, it can
be shown that the NBR model converges to the PR model.

Several important survival regression models are 1D regression models


with SP = xT β, including the Cox (1972) proportional hazards regression
model. The following survival regression models are parametric. The accel-
erated failure time model has log(Y ) = α + SPA + σe where SPA = uT β A ,
V (e) = 1, and the ei are iid from a location scale family. If the Yi are log-
428 10 1D Regression Models Such as GLMs

normal, the ei are normal. If the Yi are loglogistic, the ei are logistic. If the
Yi are Weibull, the ei are from a smallest extreme value distribution. The
Weibull regression model is a proportional hazards model using Yi and an
accelerated failure time model using log(Yi ) with β P = β A /σ. Let Y hav a
Weibull W (γ, λ) distribution if the pdf of Y is

f(y) = λγyγ−1 exp[−λyγ ]

for y > 0. Prediction intervals for parametric survival regression models are
for survival times Y , not censored survival times. See Section 10.10.

Definition 10.4. The Weibull proportional hazards regression model is

Y |SP ∼ W (γ = 1/σ, λ0 exp(SP )),

where λ0 = exp(−α/σ).

Generalized linear models are an important class of parametric 1D regres-


sion models that include multiple linear regression, logistic regression, and
Poisson regression. Assume that there is a response variable Y and a q × 1
vector of nontrivial predictors x. Before defining a generalized linear model,
the definition of a one parameter exponential family is needed. Let f(y) be
a probability density function (pdf) if Y is a continuous random variable,
and let f(y) be a probability mass function (pmf) if Y is a discrete random
variable. Assume that the support of the distribution of Y is Y and that the
parameter space of θ is Θ.

Definition 10.5. A family of pdfs or pmfs {f(y|θ) : θ ∈ Θ} is a


1-parameter exponential family if

f(y|θ) = k(θ)h(y) exp[w(θ)t(y)] (10.1)

where k(θ) ≥ 0 and h(y) ≥ 0. The functions h, k, t, and w are real valued
functions.

In the definition, it is crucial that k and w do not depend on y and that


h and t do not depend on θ. The parameterization is not unique since, for
example, w could be multiplied by a nonzero constant m if t is divided by m.
Many other parameterizations are possible. If h(y) = g(y)IY (y), then usually
k(θ) and g(y) are positive, so another parameterization is

f(y|θ) = exp[w(θ)t(y) + d(θ) + S(y)]IY (y) (10.2)

where S(y) = log(g(y)), d(θ) = log(k(θ)), and the support Y does not depend
on θ. Here the indicator function IY (y) = 1 if y ∈ Y and IY (y) = 0, otherwise.
10.1 Introduction 429

Definition 10.6. Assume that the data is (Yi , xi ) for i = 1, ..., n. An


important type of generalized linear model (GLM) for the data states
that the Y1 , ..., Yn are independent random variables from a 1-parameter ex-
ponential family with pdf or pmf
 
c(θ(xi ))
f(yi |θ(xi )) = k(θ(xi ))h(yi ) exp yi . (10.3)
a(φ)

Here φ is a known constant (often a dispersion parameter), a(·) is a known


function, and θ(xi ) = η(xTi β). Let E(Yi ) ≡ E(Yi |xi ) = µ(xi ). The GLM
also states that g(µ(xi )) = xTi β where the link function g is a differen-
tiable monotone function. Then the canonical link function is g(µ(xi )) =
c(µ(xi )) = βT xi , and the quantity β T x is called the linear predictor.

The GLM parameterization (10.3) can be written in several ways. By


Equation (10.2), f(yi |θ(xi )) = exp[w(θ(xi ))yi + d(θ(xi )) + S(y)]IY (y) =
 
c(θ(xi )) b(c(θ(xi ))
exp yi − + S(y) IY (y)
a(φ) a(φ)
 
νi b(νi )
= exp yi − + S(y) IY (y)
a(φ) a(φ)
where νi = c(θ(xi )) is called the natural parameter, and b(·) is some known
function.
Notice that a GLM is a parametric model determined by the 1-parameter
exponential family, the link function, and the linear predictor. Since the link
function is monotone, the inverse link function g−1 (·) exists and satisfies

µ(xi ) = g−1 (xTi β). (10.4)

Also notice that the Yi follow a 1-parameter exponential family where

c(θ)
t(yi ) = yi and w(θ) = ,
a(φ)

and notice that the value of the parameter θ(xi ) = η(xTi β) depends on the
value of xi . Since the model depends on x only through the linear predictor
xT β, a GLM is a 1D regression model. Thus the linear predictor is also a
sufficient predictor.

The following three sections illustrate three of the most important gen-
eralized linear models. Inference and variable selection for these GLMs are
discussed in Sections 10.5 and 10.6. Their generalized additive model analogs
are discussed in Section 10.7.
430 10 1D Regression Models Such as GLMs

10.2 Additive Error Regression

The linear regression model Y = SP + e = xT β + e includes multiple linear


regression (MLR) and many experimental design models as special cases. See
Chapters 1–4.
If Y is quantitative, a useful extension is the additive error regression
(AER) model Y = SP + e where SP = h(x). See Definition 10.1 i). If
e ∼ N (0, σ 2 ), then Y ∼ N (SP, σ 2 ). If e ∼ N (0, σ 2 ) and SP = xT β, then the
resulting multiple linear regression model is also a GLM and an additive error
regression model. The normality assumption is too restrictive since the error
distribution is rarely normal. If m is a smooth function, the additive error
single index model, where SP = h(x) = m(xT β), is an important special
case.
Response plots, residual plots, and response transformations for the addi-
tive error regression model are very similar to those for the multiple linear
regression model. See Olive (2004b). To avoid overfitting, assume n ≥ 10d
where d is the model degrees of freedom, possibly estimated. Hence d = p for
multiple linear regression with OLS. Prediction intervals are given in Section
4.3.
The GAM additive error regression model is useful for checking the mul-
tiple linear regression (MLR) model. Let ESP = xTP β̂ be the ESP for MLR
p
where x = (1, x2 , ..., xp)T . Let ESP = EAP = α̂ + j=2 Ŝj (xj ) be the ESP
for the GAM additive error regression model.
After making the usual checks on the MLR model, there are two useful
plots that use the GAM. If the plotted points of the EE plot of EAP versus
ESP cluster tightly about the identity line, then the MLR and the GAM
produce similar fitted values. A plot of xj versus Ŝj (xj ) can be useful for
visualizing whether a predictor transformation tj (xj ) is needed for the jth
predictor xj . If the plot is linear then no transformation may be needed. If the
plot is nonlinear, the shape of the plot, along with the graphical methods of
Section 1.2, may be useful for suggesting the transformation tj . The additive
error regression GAM can be fit with all p of the Sj unspecified, or fit p GAMs
where Si is linear except for unspecified Sj where j = 2, ..., p. Some of these
applications for checking GLMs with GAMs will be discussed in Section 10.7.
Suppose n/p is large and SP = m(xT β). Olive (2008: ch. 12, 2010: ch.
15), Olive and Hawkins (2005), and Chang and Olive (2010) show that vari-
able selection methods using Cp and the partial F test, originally meant for
multiple linear regression, can be used (under regularity conditions) for the
additive error single index model. See Section 10.11.
10.3 Binary, Binomial, and Logistic Regression 431

10.3 Binary, Binomial, and Logistic Regression

Multiple linear regression is used when the response variable is quantitative,


but for many data sets the response variable is categorical and takes on two
values: 0 or 1. The occurrence of the category that is counted is labelled as a
1 or a “success,” while the nonoccurrence of the category that is counted is
labelled as a 0 or a “failure.” For example, a “success” = “occurrence” could
be a person who contracted lung cancer and died within 5 years of detection.
Often the labelling is arbitrary, e.g., if the response variable is gender taking
on the two categories female and male. If males are counted then Y = 1 if the
subject is male and Y = 0 if the subject is female. If females are counted then
this labelling is reversed. For a binary response variable, a binary regression
model is often appropriate.

Definition 10.7. The binomial regression model states that Y1 , ..., Yn


are independent random variables with Yi ∼ binomial(mi , ρ(xi )). The binary
regression model is the special case where mi ≡ 1 for i = 1, ..., n while the
logistic regression (LR) model is the special case of binomial regression
where
exp(h(xi ))
P (success|xi ) = ρ(xi ) = . (10.5)
1 + exp(h(xi ))

If the sufficient predictor SP = h(x) = xT β, then the most used binomial


regression models are such that Y1 , ..., Yn are independent random variables
with Yi ∼ binomial(mi , ρ(xT β)), or

Yi |SPi ∼ binomial(mi , ρ(SPi )). (10.6)

Note that the conditional mean function E(Yi |SPi ) = mi ρ(SPi ) and the
conditional variance function V (Yi |SPi ) = mi ρ(SPi )(1 − ρ(SPi )).
Thus the binary logistic regression model says that

Y |SP ∼ binomial(1, ρ(SP))

where
exp(SP )
ρ(SP ) =
1 + exp(SP )
for the LR model. Note that the conditional mean function E(Y |SP ) =
ρ(SP ) and the conditional variance function V (Y |SP ) = ρ(SP )(1 − ρ(SP )).
For the LR model, the Y are independent and
 
exp(ESP)
Y |x ≈ binomial 1, ,
1 + exp(ESP)

or Y |SP ≈ Y |ESP ≈ binomial(1, ρ(ESP)).


432 10 1D Regression Models Such as GLMs

Although the logistic regression model is the most important model for
binary regression, several other models are also used. Notice that ρ(x) =
P (S|x) is the population probability of success S given x, while 1 − ρ(x) =
P (F |x) is the probability of failure F given x. In particular, for binary re-
gression, ρ(x) = P (Y = 1|x) = 1 −P (Y = 0|x). If this population proportion
ρ = ρ(h(x)), then the model is a 1D regression model. The model is a GLM if
the link function g is differentiable and monotone so that g(ρ(x T β)) = xT β
and g−1 (xT β) = ρ(xT β). Usually the inverse link function corresponds to
the cumulative distribution function of a location scale family. For example,
for logistic regression, g−1 (x) = exp(x)/(1 + exp(x)) which is the cdf of the
logistic L(0, 1) distribution. For probit regression, g−1 (x) = Φ(x) which is the
cdf of the normal N (0, 1) distribution. For the complementary log-log link,
g−1 (x) = 1 − exp[− exp(x)] which is the cdf for the smallest extreme value
distribution. For this model, g(ρ(x)) = log[− log(1 − ρ(x))] = xT β.
Another important binary regression model is the discriminant function
model. See Hosmer and Lemeshow (2000, pp. 43–44). Assume that πj =
P (Y = j) and that x|Y = j ∼ Nk (µj , Σ) for j = 0, 1. That is, the conditional
distribution of x given Y = j follows a multivariate normal distribution with
mean vector µj and covariance matrix Σ which does not depend on j. Notice
that Σ = Cov(x|Y ) 6= Cov(x). Then as for the binary logistic regression
model with x = (1, uT )T and β = (α, ηT )T ,

exp(α + uT η) exp(xT β)
P (Y = 1|x) = ρ(x) = = .
1 + exp(α + uT η) 1 + exp(xT β)

Definition 10.8. Under the conditions above, the discriminant func-


tion parameters are given by

η = Σ −1 (µ1 − µ0 ) (10.7)
 
π1
and α = log − 0.5(µ1 − µ0 )T Σ −1 (µ1 + µ0 ).
π0
The logistic regression (maximum likelihood) estimator also tends to per-
form well for this type of data. An exception is when the Y = 0 cases and
Y = 1 cases can be perfectly or nearly perfectly classified by the ESP. Let
the logistic regression ESP = xT β̂. Consider the response plot of the ESP
versus Y . If the Y = 0 values can be separated from the Y = 1 values by
the vertical line ESP = 0, then there is perfect classification. See Figure 10.1
b). In this case the maximum likelihood estimator for the logistic regression
parameters β does not exist because the logistic curve can not approximate
a step function perfectly. See Atkinson and Riani (2000, pp. 251-254). If only
a few cases need to be deleted in order for the data set to have perfect clas-
sification, then the amount of “overlap” is small and there is nearly “perfect
classification.”
10.3 Binary, Binomial, and Logistic Regression 433

Ordinary least squares (OLS) can also be useful for logistic regression. The
ANOVA F test, partial F test, and OLS t tests are often asymptotically valid
when the conditions in Definition 10.8 are met, and the OLS ESP and LR
ESP are often highly correlated. See Haggstrom (1983). For binary data the
Yi only take two values, 0 and 1, and the residuals do not behave very well.
Hence the response plot will be used both as a goodness of fit plot and as a
lack of fit plot.

Definition 10.9. For binary logistic regression, the response plot or esti-
mated sufficient summary plot is the plot of the ESP = ĥ(xi ) versus Yi with
the estimated mean function
exp(ESP )
ρ̂(ESP ) =
1 + exp(ESP )

added as a visual aid.

A scatterplot smoother such as lowess is also added as a visual aid. Alter-


natively, divide the ESP into J slices with approximately the same number
of cases in each slice. Then
P compute
P the sample mean = sample proportion
in slice s: ρ̂s = Y s = s Yi / s mi where mi ≡ 1 and the sum is over the
cases in slice s. Then plot the resulting step function.
T T
Suppose that P x = (1, u ) is a p × 1 vector of predictors where q =
p − 1, N1 = Yi = the number of 1s and N0 = n − N1 = the number of
0s. Also assume that q ≤ min(N0 , N1 )/5. Then if the parametric estimated
mean function ρ̂(ESP ) looks like a smoothed version of the step function,
then the LR model is likely to be useful. In other words, the observed slice
proportions should scatter fairly closely about the logistic curve ρ̂(ESP ) =
exp(ESP )/[1 + exp(ESP )].
The response plot is a powerful method for assessing the adequacy of the
binary LR regression model. Suppose that both the number of 0s and the
number of 1s is large compared to the number of predictors q, that the ESP
takes on many values and that the binary LR model is a good approximation
to the data. Then Y |ESP ≈ binomial(1, ρ̂(ESP ). Unlike the response plot
for multiple linear regression where the mean function is always the identity
line, the mean function in the response plot for LR can take a variety of
shapes depending on the range of the ESP. For LR, the (estimated) mean
function is
exp(ESP )
ρ̂(ESP ) = .
1 + exp(ESP )
If the ESP = 0 then Y |SP ≈ binomial(1,0.5). If the ESP = −5, then Y |SP ≈
binomial(1,ρ ≈ 0.007) while if the ESP = 5, then Y |SP ≈ binomial(1,ρ ≈
0.993). Hence if the range of the ESP is in the interval (−∞, −5) then the
mean function is flat and ρ̂(ESP ) ≈ 0. If the range of the ESP is in the
interval (5, ∞) then the mean function is again flat but ρ̂(ESP ) ≈ 1. If
−5 < ESP < 0 then the mean function looks like a slide. If −1 < ESP < 1
434 10 1D Regression Models Such as GLMs

then the mean function looks linear. If 0 < ESP < 5 then the mean function
first increases rapidly and then less and less rapidly. Finally, if −5 < ESP < 5
then the mean function has the characteristic “ESS” shape shown in Figure
10.1 c).
This plot is very useful as a goodness of fit diagnostic. Divide the ESP into
J “slices” each containing approximately n/J cases. Compute the sample
mean = sample proportion of the Y s in each slice and add the resulting step
function to the response plot. This is done in Figure 10.1 c) with J = 4
slices. This step function is a simple nonparametric estimator of the mean
function ρ(SP ). If the step function follows the estimated LR mean function
(the logistic curve) closely, then the LR model fits the data well. The plot
of these two curves is a graphical approximation of the goodness of fit tests
described in Hosmer and Lemeshow (2000, pp. 147–156).
The deviance test described in Section 10.5 is used to test whether β = 0,
and is the analog of the ANOVA F test for multiple linear regression. If the
binary LR model is a good approximation to the data but β = 0, then the
predictors x are not needed in the model and ρ̂(xi ) ≡ ρ̂ = Y (the usual
univariate estimator of the success proportion) should be used instead of the
LR estimator
exp(xTi β̂)
ρ̂(xi ) = .
1 + exp(xTi β̂)
If the logistic curve clearly fits the step function better than the line Y = Y ,
then H0 will be rejected, but if the line Y = Y fits the step function about
as well as the logistic curve (which should only happen if the logistic curve
is linear with a small slope), then Y may be independent of the predictors.
See Figure 10.1 a).

For binomial logistic regression, the response plot needs to be modified


and a check for overdispersion is needed.

Definition 10.10. Let Zi = Yi /mi . Then the conditional distribution


Zi |xi of the LR binomial regression model can be visualized with a response
T
plot of the ESP = β̂ xi versus Zi with the estimated mean function

exp(ESP )
ρ̂(ESP ) =
1 + exp(ESP )

added as a visual aid. Divide the ESP into J slices with


Papproximately
P the
same number of cases in each slice. Then compute ρ̂s = s Yi / s mi where
the sum is over the cases in slice s. Then plot the resulting step function
or the lowess curve. For binary data the step function is simply the sample
proportion in each slice.

Both the lowess curve and step function are simple nonparametric estima-
tors of the mean function ρ(SP ). If the lowess curve or step function tracks
10.3 Binary, Binomial, and Logistic Regression 435

the logistic curve (the estimated mean) closely, then the LR mean function
is a reasonable approximation to the data.
Checking the LR model in the nonbinary case is more difficult because
the binomial distribution is not the only distribution appropriate for data
that takes on values 0, 1, ..., m if m ≥ 2. Hence both the mean and variance
functions need to be checked. Often the LR mean function is a good approx-
imation to the data, the LR MLE is a consistent estimator of β, but the
LR model is not appropriate. The problem is that for many data sets where
E(Yi |xi ) = mi ρ(SPi ), it turns out that V (Yi |xi ) > mi ρ(SPi )(1 − ρ(SPi )).
This phenomenon is called overdispersion. The BBR model of Definition 10.2
is a useful alternative to LR.
For both the LR and BBR models, the conditional distribution of Y |x can
still be visualized with a response plot of the ESP versus Zi = Yi /mi with the
estimated mean function Ê(Zi |xi ) = ρ̂(SP ) = ρ(ESP ) and a step function
or lowess curve added as visual aids.
Since the binomial regression model is simpler than the BBR model, graph-
ical diagnostics for the goodness of fit of the LR model would be useful. The
following plot was suggested by Olive (2013b) to check for overdispersion.
Definition 10.11. To check for overdispersion, use the OD plot of the
estimated model variance V̂M ≡ V̂ (Y |SP ) versus the squared residuals V̂ =
[Y − Ê(Y |SP )]2 . For the LR model, V̂ (Yi |SP ) = mi ρ(ESPi )(1 − ρ(ESPi ))
and Ê(Yi |SP ) = mi ρ(ESPi ).

Numerical summaries are also available. The deviance G2 is a statistic


used to assess the goodness of fit of the logistic regression model much as R2
is used for multiple linear regression. When the mi are small, G2 may not be
reliable but the response plot is still useful. If the Yi are not too close to 0
or mi , if the response and OD plots look good, and the deviance G2 √ satisfies
G2 /(n−p) ≈ 1, then the LR model is likely useful. If G2 > (n−p)+3 n − p,
then a more complicated count model may be needed.
Combining the response plot with the OD plot is a powerful method for
assessing the adequacy of the LR model. To motivate the OD plot, recall that
if a count Y is not too close to 0 or m, then a normal approximation
p is good
for the binomial distribution. Notice that if Yi = E(Y |SP ) + 2 V (Y |SP ),
then [Yi − E(Y |SP )]2 = 4V (Y |SP ). Hence if both the estimated mean and
estimated variance functions are good approximations, and if the counts are
not too close to 0 or mi , then the plotted points in the OD plot will scatter
about a wedge formed by the V̂ = 0 line and the line through the origin
with slope 4: V̂ = 4V̂ (Y |SP ). Only about 5% of the plotted points should
be above this line.
When the counts are small, the OD plot is not wedge shaped, but if the LR
model is correct, the least squares (OLS) line should be close to the identity
line through the origin with unit slope. If the data are binary, the response
plot is enough to check the binomial regression assumption.
436 10 1D Regression Models Such as GLMs

Suppose the bulk of the plotted points in the OD plot fall in a wedge.
Then the identity line, slope 4 line, and OLS line will be added to the plot
as visual aids. It is easier to use the OD plot to check the variance function
than the response plot since judging the variance function with the straight
lines of the OD plot is simpler than judging the variability about the logistic
curve. Also outliers are often easier to spot with the OD plot. For the LR
model, V̂ (Yi |SP ) = mi ρ(ESPi )(1 − ρ(ESPi )) and Ê(Yi |SP ) = mi ρ(ESPi ).
The evidence of overdispersion increases from slight to high as the scale of the
vertical axis increases from 4 to 10 times that of the horizontal axis. There is
considerable evidence of overdispersion if the scale of the vertical axis is more
than 10 times that of the horizontal, or if the percentage of points above the
slope 4 line through the origin is much larger than 5%.
If the binomial LR OD plot is used but the data follows a beta–binomial re-
gression model, then V̂mod = V̂ (Yi |SP ) ≈ mi ρ(ESP )(1−ρ(ESP )) while V̂ =
[Yi − mi ρ(ESP )]2 ≈ (Yi − E(Yi ))2 . Hence E(V̂ ) ≈ V (Yi ) ≈ mi ρ(ESP )(1 −
ρ(ESP ))[1 + (mi − 1)θ/(1 + θ)], so the plotted points with mi = m should
θ 1 + mθ
scatter about a line with slope ≈ 1 + (m − 1) = .
1+θ 1+θ

a) b)
0.6

0.6
Y

Y
0.0

0.0

−1.6 −1.2 −0.8 −50 0 50

ESP ESP

c) ESSP d) OD Plot
0.0 0.4 0.8
0.6

Vhat
Y

0.0

−10 0 5 10 0.00 0.10 0.20

ESP Vmodhat

Fig. 10.1 Response Plots for Museum Data

The first example is for binary data. For binary data, G2 is not approxi-
mately χ2 and some plots of residuals have a pattern whether the model is
10.3 Binary, Binomial, and Logistic Regression 437

correct or not. For binary data the OD plot is not needed, and the plotted
points follow a curve rather than falling in a wedge. The response plot is
very useful if the logistic curve and step function of observed proportions are
added as visual aids. The logistic curve gives the estimated LR probability of
success. For example, when ESP = 0, the estimated probability is 0.5. The
following three examples used SP = xT β.

Example 10.1. Schaaffhausen (1878) gives data on skulls at a museum.


The 1st 47 skulls are humans while the remaining 13 are apes. The response
variable ape is 1 for an ape skull. The response plot in Figure 10.1a) uses
the predictor face length. The model fits very poorly since the probability
of a 1 decreases then increases. The response plot in Figure 10.1b) uses the
predictor head height and perfectly classifies the data since the ape skulls can
be separated from the human skulls with a vertical line at ESP = 0. The
response plot in Figure 10.1c uses predictors lower jaw length, face length,
and upper jaw length. None of the predictors is good individually, but together
provide a good LR model since the observed proportions (the step function)
track the model proportions (logistic curve) closely. The OD plot in Figure
10.1d) is curved and is not needed for a binary response.

a) ESSP b) OD Plot
1.2
0.0 0.4 0.8

Vhat

0.6
Z

0.0

−4 0 4 0.5 2.0

ESP Vmodhat

Fig. 10.2 Visualizing the Death Penalty Data

Example 10.2. Abraham and Ledolter (2006, pp. 360-364) describe death
penalty sentencing in Georgia. The predictors are aggravation level from 1 to
6 (treated as a continuous variable) and race of victim coded as 1 for white
438 10 1D Regression Models Such as GLMs

and 0 for black. There were 362 jury decisions and 12 level race combinations.
The response variable was the number of death sentences in each combination.
The response plot (ESSP) in Figure 10.2a shows that the Yi /mi are close to
the estimated LR mean function (the logistic curve). The step function based
on 5 slices also tracks the logistic curve well. The OD plot is shown in Figure
10.2b with the identity, slope 4, and OLS lines added as visual aids. The
vertical scale is less than the horizontal scale, and there is no evidence of
overdispersion.

a) ESSP b) OD Plot
0.0 0.4 0.8

1000
Vhat
Z

−3 0 2 0 20 40

ESP Vmodhat

Fig. 10.3 Plots for Rotifer Data

Example 10.3. Collett (1999, pp. 216-219) describes a data set where
the response variable is the number of rotifers that remain in suspension in
a tube. A rotifer is a microscopic invertebrate. The two predictors were the
density of a stock solution of Ficolli and the species of rotifer coded as 1
for polyarthra major and 0 for keratella cochlearis. Figure 10.3a shows the
response plot (ESSP). Both the observed proportions and the step function
track the logistic curve well, suggesting that the LR mean function is a good
approximation to the data. The OD plot suggests that there is overdispersion
since the vertical scale is about 30 times the horizontal scale. The OLS line
has slope much larger than 4 and two outliers seem to be present.
10.4 Poisson Regression 439

10.4 Poisson Regression

If the response variable Y is a count, then the Poisson regression model is


often useful. For example, counts often occur in wildlife studies where a region
is divided into subregions and Yi is the number of a specified type of animal
found in the subregion.

Definition 10.12. The Poisson regression (PR) model states that


Y1 , ..., Yn are independent random variables with Yi ∼ Poisson(µ(xi )) where
µ(xi ) = exp(h(xi )). Thus Y |SP ∼ Poisson(exp(SP)). Notice that Y |SP =
0 ∼ Poisson(1). Note that the conditional mean and variance functions are
equal: E(Y |SP ) = V (Y |SP ) = exp(SP ).

In the response plot for Poisson regression, the shape of the estimated
mean function µ̂(ESP ) = exp(ESP ) depends strongly on the range of the
ESP. The variety of shapes occurs because the plotting software attempts
to fill the vertical axis. Hence if the range of the ESP is narrow, then the
exponential function will be rather flat. If the range of the ESP is wide, then
the exponential curve will look flat in the left of the plot but will increase
sharply in the right of the plot.

Definition 10.13. The estimated sufficient summary plot (ESSP) or re-


sponse plot, is a plot of the ESP = ĥ(xi ) versus Yi with the estimated mean
function
µ̂(ESP ) = exp(ESP )
added as a visual aid. A scatterplot smoother such as lowess is also added as
a visual aid.

This plot is very useful as a goodness of fit diagnostic. The lowess curve
is a nonparametric estimator of the mean function and is represented as a
jagged curve to distinguish it from the estimated PR mean function (the
exponential curve). See Figure 10.4 a). If the number of nontrivial predictors
q < n/10, if there is no overdispersion, and if the lowess curve follows the
exponential curve closely (except possibly for the largest values of the ESP),
then the PR mean function may be a useful approximation for E(Y |x). A
useful lack of fit plot is a plot of the ESP versus the deviance residuals
that are often available from the software.

The deviance test described in Section 10.5 is used to test whether β = 0,


and is the analog of the ANOVA F test for multiple linear regression. If the
PR model is a good approximation to the data but β = 0, then the predictors
x are not needed in the model and µ̂(xi ) ≡ µ̂ = Y (the sample mean) should
be used instead of the PR estimator

µ̂(xi ) = exp(xTi β̂).


440 10 1D Regression Models Such as GLMs

If the exponential curve clearly fits the lowess curve better than the line
Y = Y , then H0 should be rejected, but if the line Y = Y fits the lowess
curve about as well as the exponential curve (which should only happen if
the exponential curve is approximately linear with a small slope), then Y
may be independent of the predictors. See Figure 10.6 a).

Warning: For many count data sets where the PR mean function is
good, the PR model is not appropriate but the PR MLE is still a con-
sistent estimator of β. The problem is that for many data sets where
E(Y |x) = µ(x) = exp(SP ), it turns out that V (Y |x) > exp(SP ). This
phenomenon is called overdispersion. Adding parametric and nonparamet-
ric estimators of the standard deviation function to the response plot can
be useful. See Cook and Weisberg (1999, pp. 401-403). The NBR model of
Definition 10.3 is a useful alternative to PR.

Since the Poisson regression model is simpler than the NBR model, graph-
ical diagnostics for the goodness of fit of the PR model would be useful. The
following plot was suggested by Winkelmann (2000, p. 110).
Definition 10.14. To check for overdispersion, use the OD plot of the
estimated model variance V̂M ≡ V̂ (Y |SP ) versus the squared residuals V̂ =
[Y − Ê(Y |SP )]2 . For the PR model, V̂ (Y |SP ) = exp(ESP ) = Ê(Y |SP ) and
V̂ = [Y − exp(ESP )]2 .
Numerical summaries are also available. The deviance G2 , described in
Section 10.5, is a statistic used to assess the goodness of fit of the Poisson
regression model much as R2 is used for multiple linear regression. For Poisson
regression, G2 is approximately chi-square with n − p degrees√ of freedom.
Since a χ2d random variable has mean d and standard deviation
√ 2d, the
√ 98th
2
percentile of the χd distribution is approximately d + 3 d ≈ d + 2.121 2d. If
the response and OD plots look good,√and G2 /(n−p) ≈ 1, then the PR model
is likely useful. If G2 > (n − p) + 3 n − p, then a more complicated count
model than PR may be needed. A good discussion of such count models is in
Simonoff (2003).
For PR, Winkelmann (2000, p. 110) suggested that the plotted points in
the OD plot should scatter about the identity line through the origin with unit
slope and that the OLS line should be approximately equal to the identity
line if the PR model is appropriate. But in simulations, it was found that the
following two observations make the OD plot much easier to use for Poisson
regression.
First, recall that a normal approximation is good for both the Poisson
and negative binomial distributions
p if the count Y is not too small. Notice
that if Y = E(Y |SP ) + 2 V (Y |SP ), then [Y − E(Y |SP )]2 = 4V (Y |SP ).
Hence if both the estimated mean and estimated variance functions are good
approximations, the plotted points in the OD plot for Poisson regression will
scatter about a wedge formed by the V̂ = 0 line and the line through the
10.4 Poisson Regression 441

origin with slope 4: V̂ = 4V̂ (Y |SP ). If the normal approximation is good,


only about 5% of the plotted points should be above this line.
Second, the evidence of overdispersion increases from slight to high as the
scale of the vertical axis increases from 4 to 10 times that of the horizontal
axis. (The scale of the vertical axis tends to depend on the few cases with
the largest V̂ (Y |SP ), and P [(Y − Ê(Y |SP ))2 > 10V̂ (Y |SP )] can be ap-
proximated with a normal approximation or Chebyshev’s inequality.) There
is considerable evidence of overdispersion if the scale of the vertical axis is
more than 10 times that of the horizontal, or if the percentage of points above
the slope 4 line through the origin is much larger than 5%. Hence the identity
line and slope 4 line are added to the OD plot as visual aids, and one should
check whether the scale of the vertical axis is more than 10 times that of the
horizontal.
Combining the response plot with the OD plot is a powerful method for
assessing the adequacy of the Poisson regression model. It is easier to use the
OD plot to check the variance function than the response plot since judging
the variance function with the straight lines of the OD plot is simpler than
judging two curves. Also outliers are often easier to spot with the OD plot.
For Poisson regression, judging the mean function from the response plot
may be rather difficult for large counts since the mean function is curved
and lowess does not track the exponential function very well for large counts.
Definition 10.16 will give some useful plots. Since P (Yi = 0) > 0, the estima-
tors given in the following definition are used. Let Zi = Yi if Yi > 0, and let
Zi = 0.5 if Yi = 0. Let x = (1, uT )T .

Definition 10.15. The minimum chi–square estimator of the pa-


rameters β = (α, ηT )T in a Poisson regression model are (α̂M , η̂M ), and are
found from the weighted least squares regression of log(Zi ) on ui with weights
wi = Zi . Equivalently,
√ use the ordinary
√ least squares (OLS) regression (with-
out intercept) of Zi log(Zi ) on Zi (1, uTi )T .

The minimum chi–square estimator tends to be consistent if n is fixed


and all n counts Yi increase to ∞, while the Poisson regression maximum
likelihood estimator β̂ = (α̂, η̂T )T tends to be consistent if the sample size
n → ∞. See Agresti (2002, pp. 611-612). However, the two estimators are
often close for many data sets.
The basic idea of the following two plots for Poisson regression is to trans-
form the data towards a linear model, then make the response plot of Ŵ
versus W and residual plot of the residuals W − Ŵ for the transformed re-
sponse variable W . The mean function is the identity line and the vertical
deviations from the identity line are the WLS residuals. If ESP = xTi β̂, The
plots are based on weighted least squares (WLS) √regression. Use√ the equiva-
lent OLS regression (without intercept) of W = Zi log(Zi ) on Zi (1, uTi )T .

Then the√plot of the “fitted values” Ŵ = Zi (α̂M + η̂ TM ui ) versus the “re-
sponse” Zi log(Zi ) should have points that scatter about the identity line.
442 10 1D Regression Models Such as GLMs

These results and the equivalence of the minimum chi–square estimator to


an OLS estimator suggest the following diagnostic plots.
Definition 10.16. For a √Poisson regression√model, a weighted fit re-
sponse plot is a plot of √Zi ESP versus Zi log(Zi ). The weighted
residual
√ plot√is a plot of Zi ESP versus the “WLS” residuals rW i =
Zi log(Zi ) − Zi ESP .

If the Poisson regression model is appropriate and the PR estimator is


good, then the plotted points in the weighted fit response plot should follow
the identity line. When the counts Yi are small, the “WLS” residuals can
not be expected to be approximately normal. Often the larger counts are fit
better than the smaller counts and hence the residual plots have a “left open-
ing megaphone” shape. This fact makes residual plots for Poisson regression
rather hard to use, but cases with large “WLS” residuals may not be fit very
well by the model. Both the weighted fit response and residual plots perform
better for simulated PR data with many large counts than for data where all
of the counts are less than 10. The following three examples use SP = xT β.

Example 10.4. For the Ceriodaphnia data of Myers et al. (2002, pp.
136-139), the response variable Y is the number of Ceriodaphnia organisms
counted in a container. The sample size was n = 70, and the predictors were
a constant (x1 ), seven concentrations of jet fuel (x2 ), and an indicator for
two strains of organism (x3 ). The jet fuel was believed to impair reproduction
so high concentrations should have smaller counts. Figure 10.4 shows the 4
plots for this data. In the response plot of Figure 10.4a, the lowess curve
is represented as a jagged curve to distinguish it from the estimated PR
mean function (the exponential curve). The horizontal line corresponds to
the sample mean Y . The OD plot in Figure 10.4b suggests that there is little
evidence of overdispersion. These two plots as well as Figures 10.4c and 10.4d
suggest that the Poisson regression model is a useful approximation to the
data.

Example 10.5. For the crab data, the response Y is the number of satel-
lites (male crabs) near a female crab. The sample size n = 173 and the pre-
dictor variables were the color, spine condition, caparice width, and weight
of the female crab. Agresti (2002, pp. 126-131) first uses Poisson regression,
and then uses the NBR model with κ̂ = 0.98 ≈ 1. Figure 4.5a suggests that
there is one case with an unusually large value of the ESP. The lowess curve
does not track the exponential curve all that well. Figure 4.5b suggests that
overdispersion is present since the vertical scale is about 10 times that of
the horizontal scale and too many of the plotted points are large and greater
than the slope 4 line. Figure 4.5c also suggests that the Poisson regression
mean function is a rather poor fit since the plotted points fail to cover the
identity line. Although the exponential mean function fits the lowess curve
better than the line Y = Y , an alternative model to the NBR model may fit
10.4 Poisson Regression 443

a) ESSP b) OD Plot

100

300
Vhat
Y

40

0
0
1.5 2.5 3.5 4.5 20 40 60 80

ESP Ehat

c) WFRP Based on MLE d) WRP Based on MLE


sqrt(Z) * log(Z)

2
20 40

MWRES

0
−2
0

0 10 30 0 10 30

MWFIT MWFIT

Fig. 10.4 Plots for Ceriodaphnia Data

a) ESSP b) OD Plot
15

60 120
Vhat
Y

5
0

0.5 1.5 2.5 2 4 6 8 12

ESP Ehat

c) WFRP Based on MLE d) WRP Based on MLE


sqrt(Z) * log(Z)

MWRES
8

4
4

0
0

0 2 4 6 0 2 4 6

MWFIT MWFIT

Fig. 10.5 Plots for Crab Data


444 10 1D Regression Models Such as GLMs

a) ESSP b) OD Plot

80

Vhat

0 1500
Y

20
3.0 3.5 4.0 20 40 60

ESP Ehat

c) WFRP Based on MLE d) WRP Based on MLE


sqrt(Z) * log(Z)

10 30 50

6
MWRES

2
−2
10 20 30 40 10 20 30 40

MWFIT MWFIT

Fig. 10.6 Plots for Popcorn Data

the data better. In later chapters, Agresti uses binomial regression models
for this data.
Example 10.6. For the popcorn data of Myers et al. (2002, p. 154), the
response variable Y is the number of inedible popcorn kernels. The sample
size was n = 15 and the predictor variables were temperature (coded as 5,
6, or 7), amount of oil (coded as 2, 3, or 4), and popping time (75, 90, or
105). One batch of popcorn had more than twice as many inedible kernels
as any other batch and is an outlier. Ignoring the outlier in Figure 10.6a
suggests that the line Y = Y will fit the data and lowess curve better than
the exponential curve. Hence Y seems to be independent of the predictors.
Notice that the outlier sticks out in Figure 10.6b and that the vertical scale is
well over 10 times that of the horizontal scale. If the outlier was not detected,
then the Poisson regression model would suggest that temperature and time
are important predictors, and overdispersion diagnostics such as the deviance
would be greatly inflated. However, we probably need to delete the high
temperature, low oil, and long popping time combination, to conclude that
the response is independent of the predictors.
10.5 GLM Inference, n/p Large 445

10.5 GLM Inference, n/p Large

This section gives a very brief discussion of inference for the logistic regression
(LR) and Poisson regression (PR) models. Inference for these two models is
very similar to inference for the multiple linear regression (MLR) model. For
all three of these models, Y is independent of the p × 1 vector of predictors
x = (x1 , x2 , ..., xp)T given the sufficient predictor xT β where the constant
x1 ≡ 1.
To perform inference for LR and PR, computer output is needed. Shown
below is output using symbols and output from a real data set with p = 3
nontrivial predictors. This data set is the banknote data set described in Cook
and Weisberg (1999, p. 524). There were 200 Swiss bank notes of which 100
were genuine (Y = 0) and 100 counterfeit (Y = 1). The goal of the analysis
was to determine whether a selected bill was genuine or counterfeit from
physical measurements of the bill.

Label Estimate Std. Error Est/SE p-value


Constant β̂1 se(β̂1 ) zo,1 for H0 : β1 = 0
x2 β̂2 se(β̂2 ) zo,2 = β̂2 /se(β̂2 ) for H0 : β2 = 0
.. .. .. .. ..
. . . . .
xp β̂p se(β̂p ) zo,p = β̂p /se(β̂p ) for H0 : βp = 0
Number of cases: n
Degrees of freedom: n - p
Pearson X2:
Deviance: D = Gˆ2
Binomial Regression
Kernel mean function = Logistic
Response = Status
Terms = (Bottom Left)
Trials = Ones

Coefficient Estimates
Label Estimate Std. Error Est/SE p-value
Constant -389.806 104.224 -3.740 0.0002
Bottom 2.26423 0.333233 6.795 0.0000
Left 2.83356 0.795601 3.562 0.0004

Scale factor: 1.
Number of cases: 200
Degrees of freedom: 197
Pearson X2: 179.809
Deviance: 99.169
446 10 1D Regression Models Such as GLMs

Point estimators for the mean function are important. Given values of
x = (x1 , ..., xp)T , a major goal of binary logistic regression is to estimate the
success probability P (Y = 1|x) = ρ(x) with the estimator

exp(xT β̂)
ρ̂(x) = . (10.8)
1 + exp(xT β̂)

Similarly, a major goal of Poisson regression is to estimate the mean


E(Y |x) = µ(x) with the estimator

µ̂(x) = exp(xT β̂). (10.9)

For tests, pval, the estimated p–value, is an important quantity. Again


what output labels as p–value is typically pval. Recall that H0 is rejected if
the pval ≤ δ. A pval between 0.07 and 1.0 provides little evidence that H0
should be rejected, a pval between 0.01 and 0.07 provides moderate evidence
and a pval less than 0.01 provides strong statistical evidence that H0 should
be rejected. Statistical evidence is not necessarily practical evidence, and
reporting the pval along with a statement of the strength of the evidence is
more informative than stating that the pval is less than some chosen value
such as δ = 0.05. Nevertheless, as a homework convention, use δ = 0.05 if
δ is not given.

Investigators also sometimes test whether a predictor xj is needed in the


model given that the other p−1 predictors are in the model with the following
4 step Wald test of hypotheses.
i) State the hypotheses H0 : βj = 0 HA : βj 6= 0.
ii) Find the test statistic zo,j = β̂j /se(β̂j ) or obtain it from output.
iii) The pval = 2P (Z < −|zoj |) = 2P (Z > |zoj |). Find the pval from output
or use the standard normal table.
iv) State whether you reject H0 or fail to reject H0 and give a nontechnical
sentence restating your conclusion in terms of the story problem.

If H0 is rejected, then conclude that xj is needed in the GLM model for


Y given that the other p − 1 predictors are in the model. If you fail to reject
H0 , then conclude that xj is not needed in the GLM model for Y given that
the other p − 1 predictors are in the model. (Or there is not enough evidence
to conclude that xj is needed in the model.) Note that xj could be a very
useful GLM predictor, but may not be needed if other predictors are added
to the model.

The Wald confidence interval (CI) for βj can also be obtained using the
output: the large sample 100 (1 − δ) % CI for βj is βˆj ± z1−δ/2 se(β̂j ).
10.5 GLM Inference, n/p Large 447

The Wald test and CI tend to give good results if the sample size n is large.
Here 1 − δ refers to the coverage of the CI. A 90% CI uses z1−δ/2 = 1.645, a
95% CI uses z1−δ/2 = 1.96, and a 99% CI uses z1−δ/2 = 2.576.

For a GLM, often 3 models are of interest: the full model that uses all
p of the predictors xT = (xTR , xTO ), the reduced model that uses the r
predictors xR , and the saturated model that uses n parameters θ1 , ..., θn
where n is the sample size. For the full model the p parameters β1 , ..., βp are
estimated while the reduced model has r + 1 parameters. Let lSAT (θ1 , ..., θn)
be the likelihood function for the saturated model and let lF U LL (β) be the
likelihood function for the full model. Let LSAT = log lSAT (θ̂1 , ..., θ̂n) be the
log likelihood function for the saturated model evaluated at the maximum
likelihood estimator (MLE) (θ̂1 , ..., θ̂n) and let LF U LL = log lF U LL (β̂) be the
log likelihood function for the full model evaluated at the MLE (β̂). Then
the deviance D = G2 = −2(LF U LL − LSAT ). The degrees of freedom for
the deviance = dfF U LL = n − p where n is the number of parameters for the
saturated model and p is the number of parameters for the full model.
The saturated model for logistic regression states that for i = 1, ..., n, the
Yi |xi are independent binomial(mi, ρi ) random variables where ρ̂i = Yi /mi .
The saturated model is usually not very good for binary data (all mi = 1)
or if the mi are small. The saturated model can be good if all of the mi are
large or if ρi is very close to 0 or 1 whenever mi is not large.
The saturated model for Poisson regression states that for i = 1, ..., n,
the Yi |xi are independent Poisson(µi ) random variables where µ̂i = Yi . The
saturated model is usually not very good for Poisson data, but the saturated
model may be good if n is fixed and all of the counts Yi are large.
If X ∼ √ χ2d then E(X) = d and VAR(X) = 2d. An observed value √ of
X > d + 3 d is unusually large and an observed value of X < d − 3 d is
unusually small.
When the saturated model is good, a rule of thumb is that the logistic or

Poisson regression model is ok if G2 ≤ n − p (or if G2 ≤ n − p + 3 n − p).
For binary LR, the χ2n−p approximation for G2 is rarely good even for large
sample sizes n. For LR, the response plot is often a much better diagnostic
for goodness of fit, especially when ESP = xTi β takes on many values √ and
when p << n. For PR, both the response plot and G2 ≤ n − p + 3 n − p
should be checked.
Response = Y
Terms = (x1 , ..., xp)
448 10 1D Regression Models Such as GLMs

Sequential Analysis of Deviance

Total Change
Predictor df Deviance df Deviance
Ones n − 1 = dfo G2o
x2 n−2 1
x3 n−3 1
.. .. .. ..
. . . .
xp n − p = dfF U LL G2F U LL 1
-----------------------------------------
Data set = cbrain, Name of Fit = B1
Response = sex
Terms = (cephalic size log[size])
Sequential Analysis of Deviance
Total Change
Predictor df Deviance | df Deviance
Ones 266 363.820 |
cephalic 265 363.605 | 1 0.214643
size 264 315.793 | 1 47.8121
log[size] 263 305.045 | 1 10.7484
The above output, shown in symbols and for a real data set, is used for the
deviance test described below. Assume that the response plot has been made
and that the logistic or Poisson regression model fits the data well in that the
nonparametric step or lowess estimated mean function follows the estimated
model mean function closely and there is no evidence of overdispersion. The
deviance test is used to test whether β 2 = 0 where β = (β1 , βT2 )T = (α, ηT )T .
If this is the case, then the nontrivial predictors are not needed in the GLM
model. If H0 : β 2 = 0 is not rejected, then for Poisson regression the estimator
Xn Xn
µ̂ = Y should be used while for logistic regression ρ̂ = Yi / mi should
i=1 i=1
be used. Note that ρ̂ = Y for binary logistic regression since mi ≡ 1 for
i = 1, ..., n. This test is similar to the ANOVA F test for multiple liner
regression.

The 4 step deviance test is


i) H0 : β2 = 0 HA : β 2 6= 0,
ii) test statistic G2 (o|F ) = G2o − G2F U LL.
iii) The pval = P (χ2 > G2 (o|F )) where χ2 ∼ χ2q has a chi–square dis-
tribution with q = p − 1 degrees of freedom. Note that q = q + 1 − 1 =
dfo − dfF U LL = n − 1 − (n − q − 1).
iv) Reject H0 if the pval ≤ δ and conclude that there is a GLM relationship
between Y and the predictors X2 , ..., Xp. If pval > δ, then fail to reject H0 and
conclude that there is not a GLM relationship between Y and the predictors
10.5 GLM Inference, n/p Large 449

X2 , ..., Xp. (Or there is not enough evidence to conclude that there is a GLM
relationship between Y and the predictors.)

This test can be performed in R by obtaining output from the full and
null model.
outf <- glm(Y˜x2 + x3 + ... + xp, family = binomial)
outn <- glm(Y˜1,family = binomial)
anova(outn,outf,test="Chi")
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 *** ****
2 *** **** k Gˆ2(0|F) pvalue
The output below, shown both in symbols and for a real data set, can be
used to perform the change in deviance test. If the reduced model leaves out
a single variable xi , then the change in deviance test becomes H0 : βi = 0
versus HA : βi 6= 0. This test is a competitor of the Wald test. This change in
deviance test is usually better than the Wald test if the sample size n is not
large, but the Wald test is often easier for software to produce. For large n
the test statistics from the two tests tend to be very similar (asymptotically
equivalent tests).
If the reduced model is good, then the EE plot of ESP (R) = xTRi β̂ R
versus ESP = xTi β̂ should be highly correlated with the identity line with
unit slope and zero intercept.

Response = Y Terms = (x1 , ..., xp) (Full Model)

Label Estimate Std. Error Est/SE p-value


Constant β̂1 se(β̂1 ) zo,1 for H0 : β1 = 0
x2 β̂1 se(β̂1 ) zo,1 = β̂1 /se(β̂1 ) for H0 : β1 = 0
.. .. .. .. ..
. . . . .
xp β̂q se(β̂p ) zo,p = β̂p /se(β̂p ) for H0 : βp = 0
Degrees of freedom: n − p = dfF U LL
Deviance: D = G2F U LL

Response = Y Terms = (x1 , ..., xr) (Reduced Model)

Label Estimate Std. Error Est/SE p-value


Constant β̂1 se(β̂1 ) zo,1 for H0 : β1 = 0
x2 β̂2 se(β̂2 ) zo,2 = β̂2 /se(β̂2 ) for H0 : β1 = 0
.. .. .. .. ..
. . . . .
xr β̂r se(β̂r ) zo,r = β̂k /se(β̂r ) for H0 : βr = 0
Degrees of freedom: n − r = dfRED
Deviance: D = G2RED
(Full Model) Response = Status,
450 10 1D Regression Models Such as GLMs

Terms = (Diagonal Bottom Top)

Label Estimate Std. Error Est/SE p-value


Constant 2360.49 5064.42 0.466 0.6411
Diagonal -19.8874 37.2830 -0.533 0.5937
Bottom 23.6950 45.5271 0.520 0.6027
Top 19.6464 60.6512 0.324 0.7460

Degrees of freedom: 196


Deviance: 0.009

(Reduced Model) Response = Status, Terms = (Diagonal)


Label Estimate Std. Error Est/SE p-value
Constant 989.545 219.032 4.518 0.0000
Diagonal -7.04376 1.55940 -4.517 0.0000

Degrees of freedom: 198


Deviance: 21.109
After obtaining an acceptable full model where

SP = β1 + β2 x2 + · · · + βp xp = xT β = xTR β R + xTO β O

try to obtain a reduced model

SP (red) = β1 + βR2 xR2 + · · · + βRr xRr = xTR βR

where the reduced model uses r of the predictors used by the full model and
xO denotes the vector of p − r predictors that are in the full model but not
the reduced model. For logistic regression, the reduced model is Yi |xRi ∼
independent Binomial(mi , ρ(xRi )) while for Poisson regression the reduced
model is Yi |xRi ∼ independent Poisson(µ(xRi )) for i = 1, ..., n.
Assume that the response plot looks good. Then we want to test H0 : the
reduced model is good (can be used instead of the full model) versus HA :
use the full model (the full model is significantly better than the reduced
model). Fit the full model and the reduced model to get the deviances G2F U LL
and G2RED . The next test is similar to the partial F test for multiple linear
regression.

The 4 step change in deviance test is


i) H0 : the reduced model is good HA: use the full model,
ii) test statistic G2 (R|F ) = G2RED − G2F U LL.
iii) The pval = P (χ2 > G2 (R|F )) where χ2 ∼ χ2p−r has a chi–square
distribution with p − r degrees of freedom. Note that p − 1 is the number of
nontrivial predictors in the full model while r − 1 is the number of nontrivial
10.5 GLM Inference, n/p Large 451

predictors in the reduced model. Also notice that p − r = dfRED − dfF U LL =


n − r − (n − p) = (p − 1) − (r − 1).
iv) Reject H0 if the pval ≤ δ and conclude that the full model should be
used. If pval > δ, then fail to reject H0 and conclude that the reduced model
is good.

This test can be performed in R by obtaining output from the full and
reduced model.
outf <- glm(Y˜x2 + x3 + ... + xp, family = binomial)
outr <- glm(Y˜ x4 + x6 + x8,family = binomial)
anova(outr,outf,test="Chi")
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 *** ****
2 *** **** p-r Gˆ2(R|F) pvalue
Interpretation of coefficients: if x2 , ..., xi−1, xi+1 , ..., xp can be held fixed,
then increasing xi by 1 unit increases the sufficient predictor SP by βi units.
As a special case, consider logistic regression. Let ρ(x) = P (success|x) = 1 −
P(failure|x) where a “success” is what is counted and a “failure” is what is not
counted (so if the Yi are binary, ρ(x) = P (Yi = 1|x)). Then the estimated
ρ̂(x)
odds of success is Ω̂(x) = = exp(xT β̂). In logistic regression,
1 − ρ̂(x)
increasing a predictor xi by 1 unit (while holding all other predictors fixed)
multiplies the estimated odds of success by a factor of exp(β̂i ).
Output for Full Model, Response = gender, Terms =
(age log[age] breadth circum headht
height length size log[size])
Number of cases: 267, Degrees of freedom: 257,
Deviance: 234.792

Logistic Regression Output for Reduced Model,


Response = gender, Terms = (height size)
Label Estimate Std. Error Est/SE p-value
Constant -6.26111 1.34466 -4.656 0.0000
height -0.0536078 0.0239044 -2.243 0.0249
size 0.0028215 0.000507935 5.555 0.0000

Number of cases: 267, Degrees of freedom: 264


Deviance: 313.457
Example 10.7. Let the response variable Y = gender = 0 for F and 1
for M. Let x2 = height (in inches) and x3 = size of head (in mm3 ). Logistic
regression is used, and data is from Gladstone (1905). There is output above.

a) Predict ρ̂(x) if height = x2 = 65 and size = x3 = 3500.


452 10 1D Regression Models Such as GLMs

b) The full model uses the predictors listed above to the right of Terms.
Perform a 4 step change in deviance test to see if the reduced model can be
used. Both models contain a constant.
Solution: a) ESP = β̂1 + β̂2 x2 + βˆ3 x3 = −6.26111 − 0.0536078(65) +
0.0028215(3500) = 0.1296. So

eESP 1.1384
ρ̂(x) = = = 0.5324.
1 + eESP 1 + 1.1384
b) i) H0 : the reduced model is good HA: use the full model
ii) G2 (R|F ) = 313.457 − 234.792 = 78.665
iii) Now df = 264 − 257 = 7, and comparing 78.665 with χ27,0.999 = 24.32
shows that the pval = 0 < 1 − 0.999 = 0.001.
iv) Reject H0 , use the full model.

Example 10.8. Suppose that Y is a 1 or 0 depending on whether the


person is or is not credit worthy. Let x2 through x7 be the predictors and
use the following output to perform a 4 step deviance test. The credit data is
available from the text’s website as file credit.lsp, and is from Fahrmeir and
Tutz (2001).
Response = y
Sequential Analysis of Deviance
All fits include an intercept.
Total Change
Predictor df Deviance | df Deviance
Ones 999 1221.73 |
x2 998 1177.11 | 1 44.6148
x3 997 1176.55 | 1 0.561629
x4 996 1168.33 | 1 8.21723
x5 995 1168.20 | 1 0.137583
x6 994 1163.44 | 1 4.75625
x7 993 1158.22 | 1 5.21846
Solution: i) H0 : β2 = · · · = β7 HA: not H0
ii) G2 (0|F ) = 1221.73 − 1158.22 = 63.51
iii) Now df = 999 − 993 = 6, and comparing 63.51 with χ26,0.999 = 22.46
shows that the pval = 0 < 1 − 0.999 = 0.001.
iv) Reject H0 , there is a LR relationship between Y = credit worthiness
and the predictors x2 , ..., x7.

Coefficient Estimates
Label Estimate Std. Error Est/SE p-value
Constant -5.84211 1.74259 -3.353 0.0008
jaw ht 0.103606 0.0383650 ? ??
10.5 GLM Inference, n/p Large 453

Example 10.9. A museum has 60 skulls, some of which are human and
some of which are from apes. Consider trying to estimate whether the skull
type is human or ape from the height of the lower jaw. Use the above logistic
regression output to answer the following problems. The museum data is
available from the text’s website as file museum.lsp, and is from Schaaffhausen
(1878). Here x = x2 .
a) Predict ρ̂(x) if x = 40.0.
b) Find a 95% CI for β2 .
c) Perform the 4 step Wald test for H0 : β2 = 0.
Solution: a) exp[ESP ] = exp[β̂1 +β̂2 (40)] = exp[−5.84211+0.103606(40)] =
exp[−1.69787] = 0.1830731. So

eESP 0.1830731
ρ̂(x) = = = 0.1547.
1 + eESP 1 + 0.1830731

b) β̂2 ± 1.96SE(β̂2 ) = 0.103606 ± 1.96(0.03865) = 0.103606 ± 0.0751954 =


[0.02841, 0.1788].
c) i) H0 : β2 = 0 HA : β2 6= 0
β̂2 0.103606
ii) Z0 = = = 2.7005.
SE(β̂2 ) 0.038365
iii) Using a standard normal table, pval = 2P (Z < −2.70) = 2(0.0035) =
0.0070.
iv) Reject H0 , jaw height is a useful LR predictor for whether the skull is
human or ape (so is needed in the LR model).

Label Estimate Std. Error Est/SE p-value


Constant -0.406023 0.877382 -0.463 0.6435
bombload 0.165425 0.0675296 2.450 0.0143
exper -0.0135223 0.00827920 -1.633 0.1024
type 0.568773 0.504297 1.128 0.2594
Example 10.10. Use the above output to perform inference on the num-
ber of locations where aircraft was damaged. The output is from a Poisson
regression. The variable exper = total months of aircrew experience while
type of aircraft was coded as 0 or 1. There were n = 30 cases. Data is from
Montgomery et al. (2001).
a) Predict µ̂(x) if bombload = x2 = 7.0, exper = x3 = 80.2, and type
= x4 = 1.0.
b) Perform the 4 step Wald test for H0 : β3 = 0.
c) Find a 95% confidence interval for β4 .
Solution: a) ESP = β̂1 + β̂2 x2 + β̂3 x3 + β̂4 x4 = −0.406023 + 0.165426(7) −
0.0135223(80.2)+0.568773(1) = 0.2362. So µ̂(x) = exp(ESP ) = exp(0.2360) =
1.2665.
b) i) H0 : β3 = 0 HA : β3 6= 0
454 10 1D Regression Models Such as GLMs

ii) t03 = −1.633.


iii) pval = 0.1024
iv) Fail to reject H0 , exper in not needed in the PR model for number of
locations given that bombload and type are in the model.
c) β̂4 ± 1.96SE(β̂4 ) = 0.568773 ± 1.96(0.504297) = 0.568773 ± 0.9884 =
[−0.4196, 1.5572].

10.6 Variable and Model Selection

10.6.1 When n/p is Large

This subsection gives some rules of thumb for variable selection for logistic
and Poisson regression when SP = xT β. Before performing variable selection,
a useful full model needs to be found. The process of finding a useful full
model is an iterative process. Given a predictor x, sometimes x is not used
by itself in the full model. Suppose that Y is binary. Then to decide what
functions of x should be in the model, look at the conditional distribution of
x|Y = i for i = 0, 1. The rules shown in Table 10.1 are used if x is an indicator
variable or if x is a continuous variable. Replace normality by “symmetric
with similar spreads” and “symmetric with different spreads” in the second
and third lines of the table. See Cook and Weisberg (1999, p. 501) and Kay
and Little (1987).

The full model will often contain factors and interactions. If w is a nominal
variable with K levels, make w into a factor by using K − 1 (indicator or)
dummy variables x1,w , ..., xK−1,w in the full model. For example, let xi,w = 1
if w is at its ith level, and let xi,w = 0, otherwise. An interaction is a product
of two or more predictor variables. Interactions are difficult to interpret.
Often interactions are included in the full model, and then the reduced model
without any interactions is tested. The investigator is often hoping that the
interactions are not needed.

Table 10.1 Building the Full Logistic Regression Model

distribution of x|y = i variables to include in the model


x|y = i is an indicator x
x|y = i ∼ N (µi, σ2 ) x
x|y = i ∼ N (µi, σi2 ) x and x2
x|y = i has a skewed distribution x and log(x)
x|y = i has support on (0,1) log(x) and log(1 − x)
10.6 Variable and Model Selection 455

A scatterplot matrix is used to examine the marginal relationships of


the predictors and response. Place Y on the top or bottom of the scatterplot
matrix. Variables with outliers, missing values, or strong nonlinearities may
be so bad that they should not be included in the full model. Suppose that
all values of the variable x are positive. The log rule says add log(x) to the
full model if max(xi )/ min(xi ) > 10. For the binary logistic regression model,
it is often useful to mark the plotted points by a 0 if Y = 0 and by a + if
Y = 1.

To make a full model, use the above discussion and then make a response
plot to check that the full model is good. The number of predictors in the
full model should be much smaller than the number P of data cases n. Suppose
that the Yi are binary for i = 1, ..., n. Let N1 = Yi = the number of 1s and
N0 = n−N1 = the number of 0s. A rough rule of thumb is that the full model
should use no more than min(N0 , N1 )/5 predictors and the final submodel
should have r predictor variables where r is small with r ≤ min(N0 , N1 )/10.
For Poisson regression, a rough rule of thumb is that the full model should
use no more than n/5 predictors and the final submodel should use no more
than n/10 predictors.

Variable selection is the search for a subset of predictor variables that


can be deleted without important loss of information. A model for variable
selection for many models, including GLMs, is given is Section 4.1. Let ESP
correspond to the full model and let ESP (I) correspond to the submodel I.

Definition 10.17. An EE plot is a plot of ESP (I) versus ESP .

Variable selection is closely related to the change in deviance test for


a reduced model. You are seeking a subset I of the variables to keep in the
model. The AIC(I) statistic is used as an aid in backward elimination and
forward selection. The full model and the model Imin found with the smallest
AIC are always of interest. Burnham and Anderson (2004) suggest that if
∆(I) = AIC(I) − AIC(Imin ), then models with ∆(I) ≤ 2 are good, models
with 4 ≤ ∆(I) ≤ 7 are borderline, and models with ∆(I) > 10 should not be
used as the final submodel. Create a full model. The full model has a deviance
at least as small as that of any submodel. The final submodel should have an
EE plot that clusters tightly about the identity line. As a rough rule of thumb,
a good submodel I has corr(ESP (I), ESP ) ≥ 0.95. Find the submodel II
with the smallest number of predictors such that ∆(II ) ≤ 2. Then submodel
II is the initial submodel to examine. Also examine submodels I with fewer
predictors than II with ∆(I) ≤ 7. Based on these cutoffs, ∆(I) + 2 seems to
be near a χ21 distribution for a model I that leaves one predictor (that has
one degree of freedom) out of Imin . Perhaps ∆(I) + 1.84 would be a better
approximation.
456 10 1D Regression Models Such as GLMs

Backward elimination starts with the full model with q = p − 1 non-


trivial variables, and the predictor that optimizes some criterion is deleted. A
constant x∗1 = x1 ≡ 1 is always in the model. Then there are q − 1 nontrivial
variables left, and the predictor that optimizes some criterion is deleted. This
process continues for models with q − 2, q − 3, ..., 2, and 1 predictors.
Forward selection starts with the model with a constant x∗1 = x1 ≡ 1,
and the predictor that optimizes some criterion is added. Then there are 2
variables in the model, and the predictor that optimizes some criterion is
added. This process continues for models with 2, 3, ..., p − 1, and p predictors.
Both forward selection and backward elimination result in a sequence, often
different, of p models {x∗1 }, {x∗1, x∗2 }, ..., {x∗1, x∗2 , ..., x∗p−1}, {x∗1, x∗2 , ..., x∗p} =
full model.

All subsets variable selection can be performed with the following pro-
cedure. Compute the ESP of the GLM and compute the OLS ESP found by
the OLS regression of Y on x. Check that |corr(ESP, OLS ESP)| ≥ 0.95.This
high correlation will exist for many data sets. Then perform multiple linear
regression and the corresponding all subsets OLS variable selection with the
Cp(I) criterion. If the sample size n is large and Cp (I) ≤ 2r where the subset
I has r variables including a constant, then corr(OLS ESP, OLS ESP(I))
will be high by Olive and Hawkins (2005), and hence corr(ESP, ESP(I))
will be high. In other words, if the OLS ESP and GLM ESP are highly
correlated, then performing multiple linear regression and the corresponding
MLR variable selection (e.g. forward selection, backward elimination, or all
subsets selection) based on the Cp (I) criterion may provide many interesting
submodels.

Know how to find good models from output. The following rules of thumb
(roughly in order of decreasing importance) may be useful. It is often not
possible to have all 12 rules of thumb to hold simultaneously. Let submodel
I have rI predictors, including a constant. Do not use more predictors than
submodel II , which has no more predictors than the minimum AIC model.
It is possible that II = Imin = Ifull . Assume the response plot for the full
model is good. Then the submodel I is good if
i) the response plot for the submodel looks like the response plot for the full
model.
ii) corr(ESP,ESP(I)) ≥ 0.95.
iii) The plotted points in the EE plot cluster tightly about the identity line.
iv) Want the pval ≥ 0.01 for the change in deviance test that uses I as the
reduced model.
v) For binary LR want rI ≤ min(N1 , N0 )/10. For PR, want rI ≤ n/10.
vi) Fit OLS to the full and reduced models. The plotted points in the plot of
the OLS residuals from the submodel versus the OLS residuals from the full
model should cluster tightly about the identity line.
vii) Want the deviance G2 (I) ≥ G2 (full) but close. (G2 (I) ≥ G2 (full) since
adding predictors to I does not increase the deviance.)
10.6 Variable and Model Selection 457

viii) Want AIC(I) ≤ AIC(Imin ) + 7 where Imin is the minimum AIC model
found by the variable selection procedure.
ix) Want hardly any predictors with pvals > 0.05.
x) Want few predictors with √ pvals between 0.01 and 0.05.
xi) Want G2 (I) ≤ n − rI + 3 n − rI .
xii) The OD plot should look good.

Heuristically, forward selection tries to add the variable that will decrease
the deviance the most. A decrease in deviance less than 4 (if the predictor
has 1 degree of freedom) may be troubling in that a bad predictor may have
been added. In practice, the forward selection program may add the variable
such that the submodel I with j nontrivial predictors has a) the smallest
AIC(I), b) the smallest deviance G2 (I), or c) the smallest pval (preferably
from a change in deviance test but possibly from a Wald test) in the test
H0 : βi = 0 versus HA : βi 6= 0 where the current model with j terms plus
the predictor xi is treated as the full model (for all variables xi not yet in
the model).

Suppose that the full model is good and is stored in M1. Let M2, M3,
M4, and M5 be candidate submodels found after forward selection, backward
elimination, etc. Make a scatterplot matrix of the ESPs for M2, M3, M4,
M5, and M1. Good candidates should have estimated sufficient predictors
that are highly correlated with the full model estimated sufficient predictor
(the correlation should be at least 0.9 and preferably greater than 0.95). For
binary logistic regression, mark the symbols (0 and +) using the response
variable Y .

The final submodel should have few predictors, few variables with large
Wald pvals (0.01 to 0.05 is borderline), a good response plot, and an EE plot
that clusters tightly about the identity line. If a factor has K − 1 dummy
variables, either keep all K − 1 dummy variables or delete all K − 1 dummy
variables, do not delete some of the dummy variables.

Some logistic regression output can be unreliable if ρ̂(x) = 1 or ρ̂(x) = 0


exactly. Then ESP = ∞ or ESP = −∞ respectively. Some binary logistic
regression output can also be unreliable if there is perfect classification of 0s
and 1s so that the 0s are to the left and the 1s to the right of ESP = 0 in
the response plot. Then the logistic regression MLE β̂ LR does not exist, and
variable selection rules of thumb may fail. Note that when there is perfect
classification, the logistic regression model is very useful, but the logistic
curve can not approximate a step function rising from 0 to 1 at ESP = 0,
arbitrarily closely.

Example 10.11. The following output is for forward selection. All models
use a constant. For forward selection, the min AIC model uses {F}LOC, TYP,
AGE, CAN, SYS, PCO, and PH. Model II uses {F}LOC, TYP, AGE, CAN,
458 10 1D Regression Models Such as GLMs

and SYS. Let model I use {F}LOC, TYP, AGE, and CAN. This model
may be good, so for forward selection, models II and I are the first models
to examine. {F}LOC is notation used for a factor with K − 1 = 3 dummy
variables, while k is the number of variables in I, including a constant. Output
is from the Cook and Weisberg (1999) Arc software.
Forward Selection comment

Base terms: ({F}LOC TYP)


Deviance Pearson X2 | k AIC > min AIC + 7
Add:AGE 141.873 187.84 | 5 151.873

Base terms: ({F}LOC TYP AGE)


Deviance Pearson X2| k AIC < min AIC + 7
Add:CAN 134.595 170.367 | 6 146.595
({F}LOC TYP AGE CAN) could be a good model

Base terms: ({F}LOC TYP AGE CAN)


Deviance Pearson X2 | k AIC < min AIC + 2
Add:SYS 128.441 179.753 | 7 142.441
({F}LOC TYP AGE CAN SYS) could be a good model

Base terms: ({F}LOC TYP AGE CAN SYS)


Deviance Pearson X2 | k AIC < min AIC + 2
Add:PCO 126.572 186.71 | 8 142.572
PCO not important since AIC < min AIC + 2

Base terms: ({F}LOC TYP AGE CAN SYS PCO)


Deviance Pearson X2 | k AIC
Add:PH 123.285 191.264 | 9 141.285 min AIC
PH not important since AIC < min AIC + 2
B1 B2 B3 B4
df 255 258 259 263
# of predictors 11 8 7 3
# with 0.01 ≤ Wald p-value ≤ 0.05 2 1 0 0
# with Wald p-value > 0.05 4 0 0 0
G2 233.765 237.212 243.482 278.787
AIC 257.765 255.212 259.482 286.787
corr(ESP,ESP(I)) 1.0 0.99 0.97 0.80
p-value for change in deviance test 1.0 0.328 0.045 0.000

Example 10.12. The above table gives summary statistics for 4 models
considered as final submodels after performing variable selection. One pre-
dictor was a factor, and a factor was considered to have a bad Wald p-value
> 0.05 if all of the dummy variables corresponding to the factor had p-values
10.6 Variable and Model Selection 459

> 0.05. Similarly the factor was considered to have a borderline p-value with
0.01 ≤ p-value ≤ 0.05 if none of the dummy variables corresponding to the
factor had a p-value < 0.01 but at least one dummy variable had a p-value
between 0.01 and 0.05. The response was binary and logistic regression was
used. The response plot for the full model B1 was good. Model B2 was the
minimum AIC model found. There were 267 cases: for the response, 113 were
0’s and 154 were 1’s.
Which two models are the best candidates for the final submodel? Explain
briefly why each of the other 2 submodels should not be used.
Solution: B2 and B3 are best. B1 has too many predictors with rather
large p-values. For B4, the AIC is too high and the corr and p-value are too
low.

Response Plot
1.0
0.8
0.6
Y

0.4
0.2
0.0

−20 −10 0 10 20 30 40

ESP

Fig. 10.7 Visualizing the ICU Data

Example 10.13. The ICU data is available from the text’s website and
from STATLIB (http://lib.stat.cmu.edu/DASL/Datafiles/ICU.html). Also
see Hosmer and Lemeshow (2000, pp. 23-25). The survival of 200 patients
following admission to an intensive care unit was studied with logistic regres-
sion. The response variable was STA (0 = Lived, 1 = Died). Predictors were
AGE, SEX (0 = Male, 1 = Female), RACE (1 = White, 2 = Black, 3 =
Other), SER= Service at ICU admission (0 = Medical, 1 = Surgical), CAN=
Is cancer part of the present problem? (0 = No, 1 = Yes), CRN= History
of chronic renal failure (0 = No, 1 = Yes), INF= Infection probable at ICU
admission (0 = No, 1 = Yes), CPR= CPR prior to ICU admission (0 = No, 1
460 10 1D Regression Models Such as GLMs

EE PLOT for Model without Race

40
30
20
ESP

10
0
−20 −10

−5 0 5 10 15 20

ESPS

Fig. 10.8 EE Plot Suggests Race is an Important Predictor

= Yes), SYS= Systolic blood pressure at ICU admission (in mm Hg), HRA=
Heart rate at ICU admission (beats/min), PRE= Previous admission to an
ICU within 6 months (0 = No, 1 = Yes), TYP= Type of admission (0 =
Elective, 1 = Emergency), FRA= Long bone, multiple, neck, single area, or
hip fracture (0 = No, 1 = Yes), PO2= PO2 from initial blood gases (0 if >60,
1 if ≤ 60), PH= PH from initial blood gases (0 if ≥ 7.25, 1 if <7.25), PCO=
PCO2 from initial blood gases (0 if ≤ 45, 1 if >45), Bic= Bicarbonate from
initial blood gases (0 if ≥ 18, 1 if <18), CRE= Creatinine from initial blood
gases (0 if ≤ 2.0, 1 if >2.0), and LOC= Level of consciousness at admission
(0 = no coma or stupor, 1= deep stupor, 2 = coma).
Factors LOC and RACE had two indicator variables to model the three
levels. The response plot in Figure 10.7 shows that the logistic regression
model using the 19 predictors is useful for predicting survival, although the
output has ρ̂(x) = 1 or ρ̂(x) = 0 exactly for some cases. Note that the
step function of slice proportions tracks the model logistic curve fairly well.
Variable selection, using forward selection and backward elimination with
the AIC criterion, suggested the submodel using AGE, CAN, SYS, TYP, and
LOC. The EE plot of ESP(sub) versus ESP(full) is shown in Figure 10.8.
The plotted points in the EE plot should cluster tightly about the identity
line if the full model and the submodel are good. Since this clustering did
not occur, the submodel seems to be poor. The lowest cluster of points and
the case on the right nearest to the identity line correspond to black patients.
10.6 Variable and Model Selection 461

EE PLOT for Model with Race

40
30
20
ESP

10
0
−20 −10

−20 −10 0 10 20 30

ESPS

Fig. 10.9 EE Plot Suggests Race is an Important Predictor

The main cluster and upper right cluster correspond to patients who are not
black.
Figure 10.9 shows the EE plot when RACE is added to the submodel.
Then all of the points cluster about the identity line. Although numerical
variable selection did not suggest that RACE is important, perhaps since
output had ρ̂(x) = 1 or ρ̂(x) = 0 exactly for some cases, the two EE plots
suggest that RACE is important. Also the RACE variable could be replaced
by an indicator for black. This example illustrates how the plots can be
used to quickly improve and check the models obtained by following logistic
regression with variable selection even if the MLE β̂ LR does not exist.

P1 P2 P3 P4
df 144 147 148 149
# of predictors 6 3 2 1
# with 0.01 ≤ Wald p-value ≤ 0.05 1 0 0 0
# with Wald p-value > 0.05 3 0 1 0
G2 127.506 131.644 147.151 149.861
AIC 141.506 139.604 153.151 153.861
corr(ESP,ESP(I)) 1.0 0.954 0.810 0.792
p-value for change in deviance test 1.0 0.247 0.0006 0.0

Example 10.14. The above table gives summary statistics for 4 models
considered as final submodels after performing variable selection. Poisson
462 10 1D Regression Models Such as GLMs

regression was used. The response plot for the full model P1 was good. Model
P2 was the minimum AIC model found.
Which model is the best candidate for the final submodel? Explain briefly
why each of the other 3 submodels should not be used.
Solution: P2 is best. P1 has too many predictors with large pvalues and
more predictors than the minimum AIC model. P3 and P4 have corr and
pvalue too low and AIC too high.

Warning. Variable selection for GLMs is very similar to that for multiple
linear regression. Finding a model II from variable selection, and using GLM
output for model II does not give valid tests and confidence intervals. If there
is a good full model that was found before examining the response, and if II
is the minimum AIC model, then Section 10.9 describes how to do inference
after variable selection. If the model needs to be built using the response, use
data splitting. A pilot study can also be useful.

10.6.2 When n/p is Not Necessarily Large

Forward selection with EBIC, lasso, and/or elastic net can be used for the
Cox proportional hazards regression model and for some GLMs, including
binomial and Poisson regression. The relaxed lasso = VS-lasso and relaxed
elastic net = VS-elastic net estimators apply the GLM or Cox regression
model to the predictors with nonzero lasso or elastic net coefficients. As
with multiple linear regression, the population number of active nontrivial
predictors = kS , but for a GLM, model I with SP = xTI βI has k active
nontrivial predictors. See Section 4.1.

Remark 10.1. Most of the plots in this chapter that use ESP = xT β̂,
and can also be made using ESP (I) = xTI β̂I . Obtaining a good ESP becomes
more difficult as n/p becomes smaller.

Remark 10.2. Suppose the 1D regression model, such as a GLM, has


SP = xT β. If n > 10p, then fit the model using Chapter 5 MLR type
methods, such as relaxed lasso and forward selection (using Cp ), to find a
subset of predictors I. If n < 10p, fit the model with MLR lasso. (Limited
experience suggests that MLR with EBIC leads to severe underfitting if n <
10p if the 1D regression model is not MLR.) Then fit the 1D regression
with Y and xI . Check the model with the response plot and the EE plot
of the MLR ESP versus the 1D regression ESP. High correlation in the EE
plot suggests MLR model selection may be useful for the 1D regression model
selection. For some GLMs, make the OD plot. If xI is an a×1 vector, we want
n ≥ Ja where J ≥ 5 and preferably J ≥ 10. For binary logistic regression, we
want a ≥ J min(N0 , N1 ). Note that if n < 5p, the EE plot of the submodel
ESP versus the full model ESP should not be used since the full model is
10.6 Variable and Model Selection 463

overfitting. This method should be best when the predictors are linearly
related: there should be no strong nonlinear relationships. See Olive and
Hawkins (2005) for this method when n > 10p.

Some R commands for GLM lasso and Remark 10.2 are shown below. Note
that the family command indicates whether a binomial regression (including
binary regression) or a Poisson regression is being fit. The default for GLM
lasso uses 10-fold CV with a deviance criterion.
set.seed(1976) #Binary regression
library(glmnet)
n<-100
m<-1 #binary regression
q <- 100 #100 nontrivial predictors, 95 inactive
k <- 5 #k_S = 5 population active predictors
y <- 1:n
mv <- m + 0 * y
vars <- 1:q
beta <- 0 * 1:q
beta[1:k] <- beta[1:k] + 1
beta
alpha <- 0
x <- matrix(rnorm(n * q), nrow = n, ncol = q)
SP <- alpha + x[,1:k] %*% beta[1:k]
pv <- exp(SP)/(1 + exp(SP))
y <- rbinom(n,size=m,prob=pv)
y
out<-cv.glmnet(x,y,family="binomial")
lam <- out$lambda.min
bhat <- as.vector(predict(out,type="coefficients",s=lam))
ahat <- bhat[1] #alphahat
bhat<-bhat[-1]
vin <- vars[bhat!=0] #want 1-5, overfit
[1] 1 2 3 4 5 6 16 59 61 74 75 76 96
ind <- as.data.frame(cbind(y,x[,vin])) #relaxed lasso GLM
tem <- glm(y˜.,family="binomial",data=ind)
tem$coef
(Inter) V2 V3 V4 V5 V6
0.2103 1.0037 1.4304 0.6208 1.8805 0.3831
V7 V8 V9 V10 V11 V12
0.8971 0.4716 0.5196 0.8900 0.6673 -0.7611
V13 V14
-0.5918 0.6926
lrplot3(tem=tem,x=x[,vin]) #binary response plot
#now use MLR lasso
outm<-cv.glmnet(x,y)
464 10 1D Regression Models Such as GLMs

lamm <- outm$lambda.min


bm <- as.vector(predict(outm,type="coefficients",s=lamm))
am <- bm[1] #alphahat
bm<-bm[-1]
vm <- vars[bm!=0] #1 more variable than GLM lasso
vm
[1] 1 2 3 4 5 6 16 35 59 61 74 75 76 96
vin
[1] 1 2 3 4 5 6 16 59 61 74 75 76 96
inm <- as.data.frame(cbind(y,x[,vm])) #relaxed lasso GLM
tm <- glm(y˜.,family="binomial",data=inm)
lrplot3(tem=tm,x=x[,vm]) #binary response plot
#Now use MLR forward selection with EBIC since n < 10p.
library(leaps)
out<-fsel(x,y)
vin<-out$vin
vin #severe underfit
[1] 4
inm <- as.data.frame(cbind(y,x[,vin]))
tm <- glm(y˜.,family="binomial",data=inm)
lrplot3(tem=tm,x=x[,vin]) #binary response plot
#Poisson regression, using same x and beta as above
y <- rpois(n,lambda=exp(SP))
out<-cv.glmnet(x,y,family="poisson")
lam <- out$lambda.min
bhat <- as.vector(predict(out,type="coefficients",s=lam))
ahat <- bhat[1] #alphahat
bhat<-bhat[-1]
vin <- vars[bhat!=0] #want 1-5, overfit
vin
[1] 1 2 3 4 5 7 9 10 13 16 17 18 21 23 25
26 27 30 37 39 40 42 44 46 51 53 57 59 62 71 74 84 85 93 95 97 99
ind <- as.data.frame(cbind(y,x[,vin])) #relaxed lasso GLM
out <- glm(y˜.,family="poisson",data=ind)
ESP <- predict(out)
prplot2(ESP,x=x[,vin],y) #response and OD plots
#now use MLR lasso
outm<-cv.glmnet(x,y)
lamm <- outm$lambda.min
bm <- as.vector(predict(outm,type="coefficients",s=lamm))
am <- bm[1] #alphahat
bm<-bm[-1]
vm <- vars[bm!=0]
vm #much less overfit than GLM lasso
[1] 1 2 3 4 5 9 17 21 22 27 29 60 75 95
10.7 Generalized Additive Models 465

inm <- as.data.frame(cbind(y,x[,vm])) #relaxed lasso GLM


out <- glm(y˜.,family="poisson",data=inm)
ESP <- predict(out)
prplot2(ESP,x=x[,vm],y) #response and OD plots
#Now use MLR forward selection with EBIC since n < 10p.
library(leaps)
out<-fsel(x,y)
vin<-out$vin
vin #severe underfit causes poor fit and overdispersion
[1] 5
inm <- as.data.frame(cbind(y,x[,vin]))
out <- glm(y˜.,family="poisson",data=inm)
ESP <- predict(out)
prplot2(ESP,x=x[,vin],y) #response and OD plots

10.7 Generalized Additive Models

There are many alternatives to the binomial and Poisson regression GLMs.
Alternatives to the binomial GLM of Definition 10.7 include the discriminant
function model of Definition 10.8, the quasi-binomial model, the binomial
generalized additive model (GAM), and the beta-binomial model of Definition
10.2.
Alternatives to the Poisson GLM of Definition 10.12 include the quasi-
Poisson model, the Poisson GAM, and the negative binomial regression model
of Definition 10.3. Other alternatives include the zero truncated Poisson
model, the zero truncated negative binomial model, the hurdle or zero in-
flated Poisson model, the hurdle or zero inflated negative binomial model,
the hurdle or zero inflated additive Poisson model, and the hurdle or zero
inflated additive negative binomial model. See Zuur et al. (2009), Simonoff
(2003), and Hilbe (2011).

Many of these models can be visualized with response plots. An interesting


research project would be to make response plots for these models, adding
the conditional mean function and lowess to the plot. Also make OD plots to
check whether the model handled overdispersion. This section will examine
several of the above models, especially GAMs. A GAM is a 1D regression
model with SP=AP and ESP=EAP. We may use ESP for a GLM and EAP
for a GAM.

Definition 10.18. In a 1D regression, Y is independent of x given the


sufficient predictor SP = h(x) where SP = xT β for a GLM. In a general-
ized additive model, YPis independent of x = (x1 , ..., xp)T given the additive
p
predictor AP = α + j=2 Sj (xj ) for some (usually unknown) functions Sj .
466 10 1D Regression Models Such as GLMs

T
The estimated sufficient predictor ESP = ĥ(x)
Ppand ESP = x β̂ for a GLM.
The estimated additive predictor EAP = α̂ + j=2 Ŝj (xj ). An ESP–response
plot is a plot of ESP versus Y while an EAP–response plot is a plot of EAP
versus Y .

Note that a GLM is a special case of the GAM using Sj (xj ) = βj xj for j =
2, ..., p with α = β1 . A GLM with SP = α + β2 x2 + β3 x3 + β4 x1 x2 is a special
case of a GAM with x4 ≡ x1 x2 . A GLM with SP = α + β2 x2 + β3 x22 + β4 x3
is a special case of a GAM with S2 (x2 ) = β2 x2 + β3 x22 and S3 (x3 ) = β4 x3 .
A GLM with p terms may be equivalent to a GAM with k terms w1 , ..., wk
where k < p.
The plotted points in the EE plot defined below should scatter tightly
about the identity line if the GLM is appropriate and if the sample size is
large enough so that the ESP is a good estimator of the SP and the EAP is a
good estimator of the AP. If the clustering is not tight but the GAM gives a
reasonable approximation to the data, as judged by the EAP–response plot,
then examine the Ŝj of the GAM to see if some simple terms such as x2i can
be added to the GLM so that the modified GLM has a good ESP–response
plot. (This technique is easiest if the GLM and GAM have the same p terms
x1 , ..., xp. The technique is more difficult, for example, if the GLM has terms
x1 , x2 , x22, and x3 while the GAM has terms x1 , x2 and x3 .)

Definition 10.19. An EE plot is a plot of EAP versus ESP.

Definition 10.20. Recall the binomial GLM


 
exp(SPi )
Yi |SPi ∼ binomial mi , .
1 + exp(SPi )

Let ρ(w) = exp(w)/[1 + exp(w)].  


exp(APi )
i) The binomial GAM is Yi |APi ∼ binomial mi , . The
1 + exp(APi )
EAP–response plot adds the estimated mean function ρ(EAP ) and a step
function to the plot as done for the ESP–response plot of Section 10.3.
ii) The quasi-binomial model is a 1D regression model with E(Yi |xi ) =
mi ρ(SPi ) and V (Yi |xi ) = φ mi ρ(SPi )(1 − ρ(SPi )) where the dispersion
parameter φ > 0. Note that this model and the binomial GLM have the
same conditional mean function, and the conditional variance functions are
the same if φ = 1.

Definition 10.21. Recall the Poisson GLM Y |SP ∼ P oisson(exp(SP )).


i) The Poisson GAM is Y |AP ∼ P oisson(exp(AP )). The EAP–response
plot adds the estimated mean function exp(EAP ) and lowess to the plot as
done for the ESP–response plot of Section 10.4.
ii) The quasi-Poisson model is a 1D regression model with E(Y |x) =
exp(SP ) and V (Y |x) = φ exp(SP ) where the dispersion parameter φ > 0.
10.7 Generalized Additive Models 467

Note that this model and the Poisson GLM have the same conditional mean
function, and the conditional variance functions are the same if φ = 1.

For the quasi-binomial model, the conditional mean and variance functions
are similar to those of the binomial distribution, but it is not assumed that
Y |SP has a binomial distribution. Similarly, it is not assumed that Y |SP
has a Poisson distribution for the quasi-Poisson model.

Next, some notation is needed to derive the zero truncated Poisson re-
gression model. Y has a zero truncated Poisson distribution, Y ∼ ZT P (µ),
e−µ µy
if the probability mass function (pmf) of Y is f(y) = for
(1 − eµ ) y!
y = 1, 2, 3, ... where µ > 0. The ZTP pmf is obtained from a Poisson distri-
bution where y = 0 values are truncated, so not allowed.
P∞ If W ∼ P oisson(µ)
with pmf fW (y), then P (W = 0) = e−µ , so y=1 fW (y) = 1 − e−µ =
P∞ P∞ µ
y=0 fW (y) − y=1 fW (y). So the ZTP pmf f(y) = fW (y)/(1 − e ) for
y 6= 0. P∞ P∞ P∞ −µ
Now E(Y ) = y=1 yf(y) = y=0 yf(y) = y=0 yfW (y)/(1 − e ) =
−µ −µ
E(W )/(1 − e ) = µ/(1P− e ). P∞ P∞

Similarly, E(Y 2 ) = y=1 y2 f(y) = y=0 y2 f(y) = y=0 y2 fW (y)/(1 −
e−µ ) = E(W 2 )/(1 − e−µ ) = [µ2 + µ]/(1 − e−µ ). So
 2
µ2 + µ µ
V (Y ) = E(Y 2 ) − (E(Y ))2 = − .
1 − e−µ 1 − e−µ

Definition 10.22. The zero truncated Poisson regression model has


Y |SP ∼ ZT P (exp(SP )). Hence the parameter µ(SP ) = exp(SP ),

exp(SP )
E(Y |x) = and
1 − exp(− exp(SP ))

 2
[exp(SP )]2 + exp(SP ) exp(SP )
V (Y |SP ) = − .
1 − exp(− exp(SP )) 1 − exp(− exp(SP ))

The quasi-binomial, quasi-Poisson, and zero truncated Poisson regression


models have GAM analogs that replace SP by AP. Definitions 10.1, 10.2, and
10.3 give important GAM models where SP = AP. Several of these models
are GAM analogs of models discussed in Sections 10.2, 10.3, and 10.4.

10.7.1 Response Plots

For a 1D regression model, there are several useful plots using the ESP. A
GAM is a 1D regression model with ESP = EAP . It is well known that the
468 10 1D Regression Models Such as GLMs

residual plot of ESP or EAP versus the residuals (on the vertical axis) is
useful for checking the model. Similarly, the response plot of ESP or EAP
versus the response Y is useful. Assume that the ESP or EAP takes on many
values. For a GAM, substitute EAP for ESP for the plots in Definitions 10.9,
10.10, 10.11, 10.13, 10.14, and 10.16.
The response plot for the beta-binomial GAM is similar to that for the
binomial GAM. The plots for the negative binomial GAM are similar to those
of the Poisson regression GAM, including the plots in Definition 10.16. See
Examples 10.4, 10.5, and 10.6.

10.7.2 The EE Plot for Variable Selection

Variable selection is the search for a subset of variables that can be deleted
without important loss of information. Olive and Hawkins (2005) make an
EE plot of ESP (I) versus ESP where ESP (I) is for a submodel I and ESP
is for the full model. This plot can also be used to complement the hypothesis
test that the reduced model I (which is selected before gathering data) can
be used instead of the full model. The obvious extension to GAMs is to make
the EE plot of EAP (I) versus EAP . If the fitted full model and submodel
I are good, then the plotted points should follow the identity line with high
correlation (use correlation ≥ 0.95 as a benchmark).
To justify this claim, assume that there exists a subset S of predictor
variables such that if xS is in the model, then none of the other predictors
is needed in the model. Write E for these (‘extraneous’) variables not in S,
partitioning x = (xTS , xTE )T . Then
p
X X X X
AP = α+ Sj (xj ) = α+ Sj (xj )+ Sk (xk ) = α+ Sj (xj ). (10.10)
j=2 j∈S k∈E j∈S

The extraneous terms that can be eliminated given that the subset S is in
the model have Sk (xk ) = 0 for k ∈ E.
Now suppose that I is a candidate subset of predictors and that S ⊆ I.
Then
p
X X X
AP = α + Sj (xj ) = α + Sj (xj ) = α + Sk (xk ) = AP (I),
j=2 j∈S k∈I

(if I includes predictors from E, these will have Sk (xk ) = 0). For any subset
I that includes all relevant predictors, the correlation corr(AP, AP(I)) = 1.
Hence if the full model and submodel are reasonable and if EAP and EAP(I)
are good estimators of AP and AP(I), then the plotted points in the EE plot
of EAP(I) versus EAP will follow the identity line with high correlation.
10.7 Generalized Additive Models 469

10.7.3 An EE Plot for Checking the GLM

One useful application of a GAM is for checking whether the corresponding


GLM has the correct form of the predictors xj in the model. Suppose a GLM
and the corresponding GAM are both fit with the same link function where
at least one general Sj (xj ) was used. Since the GLM is a special case of the
GAM, the plotted points in the EE plot of EAP versus ESP should follow
the identity line with very high correlation if the fitted GLM and GAM are
roughly equivalent. If the correlation is not very high and the GAM has some
nonlinear Ŝj (xj ), update the GLM, and remake the EE plot. For example,
update the GLM by adding terms such as x2j and possibly x3j , or add log(xj )
if xj is highly skewed. Then remake the EAP versus ESP plot.

10.7.4 Examples

For the binary logistic GAM, the EAP will not be a consistent estimator
of the AP if the estimated probability ρ̂(AP ) = ρ(EAP ) is exactly zero or
one. The following example will show that GAM output and plots can still
be used for exploratory data analysis. The example also illustrates that EE
plots are useful for detecting cases with high leverage and clusters of cases.
1.0
0.8
0.6
Y

0.4
0.2
0.0

−20 0 20 40 60

EAP

Fig. 10.10 Visualizing the ICU GAM


470 10 1D Regression Models Such as GLMs

40
30
20
ESP

10
0
−20 −10

−20 0 20 40 60

EAP

Fig. 10.11 GAM and GLM give Similar Success Probabilities

Example 10.15. For the ICU data of Example 10.13, a binary general-
ized additive model was fit with unspecified functions for AGE, SYS, and
HRA, and linear functions for the remaining 16 variables. Output suggested
that functions for SYS and HRA are linear but the function for AGE may
be slightly curved. Several cases had ρ̂(AP ) equal to zero or one, but the
response plot in Figure 10.10 suggests that the full model is useful for pre-
dicting survival. Note that the ten slice step function closely tracks the logistic
curve. To visualize the model with the response plot, use Y |x ≈ binomial[1,
ρ(EAP ) = eEAP /(1+eEAP )]. When x is such that EAP < −5, ρ(EAP ) ≈ 0.
If EAP > 5, ρ(EAP ) ≈ 1, and if EAP = 0, then ρ(EAP ) = 0.5. The logistic
curve gives ρ(EAP ) ≈ P (Y = 1|x) = ρ(AP ). The different estimated bi-
nomial distributions have ρ̂(AP ) = ρ(EAP ) that increases according to the
logistic curve as EAP increases. If the step function tracks the logistic curve
closely, the binary GAM gives useful smoothed estimates of ρ(AP ) provided
that the number of 0s and 1s are both much larger than the model degrees
of freedom so that the GAM is not overfitting.
A binary logistic regression was also fit, and Figure 10.11 shows the plot of
EAP versus ESP. The plot shows that the near zero and near one probabilities
are handled differently by the GAM and GLM, but the estimated success
probabilities for the two models are similar: ρ̂(ESP ) ≈ ρ̂(EAP ). Hence we
used the GLM and perform variable selection as in Example 10.13. Some R
code is below.

##ICU data from Statlib or URL


10.7 Generalized Additive Models 471

#http://parker.ad.siu.edu/Olive/ICU.lsp
#delete header of ICU.lsp and delete last parentheses
#at the end of the file. Save the file on F drive as
#icu.txt.

icu <- read.table("F:\\icu.txt")

names(icu) <- c("ID", "STA", "AGE", "SEX", "RACE",


"SER", "CAN", "CRN", "INF", "CPR", "SYS", "HRA",
"PRE", "TYP", "FRA", "PO2", "PH", "PCO", "Bic",
"CRE", "LOC")

icu[,5] <- as.factor(icu[,5])


icu[,21] <- as.factor(icu[,21])
icu2<-icu[,-1]
outf <- glm(formula=STA˜.,family=binomial,data=icu2)
ESP <- predict(outf)

library(mgcv)
outgam <- gam(STA ˜ s(AGE)+SEX+RACE+SER+CAN+CRN+INF+
CPR+s(SYS)+s(HRA)+PRE+TYP+FRA+PO2+PH+PCO+Bic+CRE+LOC,
family=binomial,data=icu2)
EAP <- predict.gam(outgam)
plot(EAP,ESP)
abline(0,1)
#Figure 10.11

Y <- icu2[,1]
lrplot3(ESP=EAP,Y,slices=18)
#Figure 10.10

lrplot3(ESP,Y,slices=18)
#Figure 10.7
Example 10.16. For binary data, Kay and Little (1987) suggest exam-
ining the two distributions x|Y = 0 and x|Y = 1. Use predictor x if the two
distributions are roughly symmetric with similar spread. Use x and x2 if the
distributions are roughly symmetric with different spread. Use x and log(x)
if one or both of the distributions are skewed. The log rule says add log(x)
to the model if min(x) > 0 and max(x)/ min(x) > 10. The Gladstone (1905)
data is useful for illustrating these suggestions. The response was gender with
Y = 1 for male and Y = 0 for female. The predictors were age, height, and
the head measurements circumference, length, and size. When the GAM was
fit without log(age) or log(size), the Ŝj for age, height, and circumference
were nonlinear. The log rule suggested adding log(age), and log(size) was
added because size is skewed. The GAM for this model had plots of Ŝj (xj )
472 10 1D Regression Models Such as GLMs

that were fairly linear. The response plot is not shown but was similar to
Figure 10.10, and the step function tracked the logistic curve closely. When
EAP = 0, the estimated probability of Y = 1 (male) is 0.5. When EAP > 5
the estimated probability is near 1, but near 0 for EAP < −5. The response
plot for the binomial GLM, not shown, is similar.

8
6
4
ESPp

2
0
−2
−4

−2 0 2 4

EAP

Fig. 10.12 EE plot for cubic GLM for Heart Attack Data

Example 10.17. Wood (2017, pp. 125-130) describes heart attack data
where the response Y is the number of heart attacks for mi patients suspected
of suffering a heart attack. The enzyme ck (creatine kinase) was measured for
the patients and it was determined whether the patient had a heart attack
or not. A binomial GLM with predictors x1 = ck, x2 = [ck]2 , and x3 = [ck]3
was fit and had AIC = 33.66. The binomial GAM with predictor x1 was fit in
R, and Figure 10.12 shows that the EE plot for the GLM was not too good.
The log rule suggests using ck and log(ck), but ck was not significant. Hence
a GLM with the single predictor log(ck) was fit. Figure 10.13 shows the EE
plot, and Figure 10.14 shows the response plot where the Zi = Yi /mi track
the logistic curve closely. There was no evidence of overdispersion and the
model had AIC = 33.45. The GAM using log(ck) had a linear Ŝ, and the
correlation of the plotted points in the EE plot, not shown, was one.
10.7 Generalized Additive Models 473

4
2
ESPl

0
−2

−2 0 2 4

EAP

Fig. 10.13 EE plot with log(ck) in the GLM


1.0
0.8
0.6
Z

0.4
0.2
0.0

−2 0 2 4

ESPl

Fig. 10.14 Response Plot for Heart Attack Data


474 10 1D Regression Models Such as GLMs

10.8 Overdispersion

Definition 10.23. Overdispersion occurs when the actual conditional vari-


ance function V (Y |x) is larger than the model conditional variance function
VM (Y |x).

Overdispersion can occur if the model underfits, if the response variables


are correlated, if the population follows a mixture distribution, or if outliers
are present. Typically it is assumed that the model is correct so V (Y |x) =
VM (Y |x). Hence the subscript M is usually suppressed. A GAM has condi-
tional mean and variance functions EM (Y |AP ) and VM (Y |AP ) where the
subscript M indicates that the function depends on the model. Then overdis-
persion occurs if V (Y |x) > VM (Y |AP ) where E(Y |x) and V (Y |x) denote
the actual conditional mean and variance functions. Then the assumptions
that E(Y |x) = EM (Y |x) ≡ m(AP ) and V (Y |x) = VM (Y |AP ) ≡ v(AP )
need to be checked.
First check that the assumption E(Y |x) = m(SP ) is a reasonable approx-
imation to the data using the response plot with lowess and the estimated
conditional mean function ÊM (Y |x) = m̂(SP ) added as visual aids. Overdis-
persion can occur even if the model conditional mean function E(Y |SP )
is a good approximation to the data. For example, for many data sets
where E(Yi |xi ) = mi ρ(SPi ), the binomial regression model is inappropriate
since V (Yi |xi ) > mi ρ(SPi )(1 − ρ(SPi )). Similarly, for many data sets where
E(Y |x) = µ(x) = exp(SP ), the Poisson regression model is inappropriate
since V (Y |x) > exp(SP ). If the conditional mean function is adequate, then
we suggest checking for overdispersion using the OD plot.

Definition 10.24. For 1D regression, the OD plot is a plot of the estimated


model variance V̂M (Y |SP ) versus the squared residuals
V̂ = [Y − ÊM (Y |SP )]2 . Replace SP by AP for a GAM.

The OD plot has been used by Winkelmann (2000, p. 110) for the Poisson
regression model where V̂M (Y |SP ) = ÊM (Y |SP ) = exp(ESP ). For binomial
and Poisson regression, the OD plot can be used to complement tests and
diagnostics for overdispersion such as those given in Cameron and Trivedi
(2013), Collett (1999, ch. 6), and Winkelmann (2000). See discussion below
Definitions 10.11 and 10.14 for how to interpret the OD plot with the identity
line, OLS line, and slope 4 line added as visual aids, and for discussion of the
numerical summaries G2 and X 2 for GLMs.
Definition 10.1, with SP = AP, gives EM (Y |AP ) = m(AP ) and VM (Y |AP )
= v(AP ) for several models. Often m̂(AP ) = m(EAP ) and v̂(AP ) =
v(EAP ), but additional parameters sometimes need to be estimated. Hence
v̂(AP ) = mi ρ(EAPi )(1−ρ(EAPi ))[1+(mi −1)θ̂/(1+θ̂)], v̂(AP ) = exp(EAP )+
τ̂ exp(2 EAP ), and v̂(AP ) = [m(EAP )]2 /ν̂ for the beta-binomial, nega-
tive binomial, and gamma GAMs, respectively. The beta-binomial regres-
10.8 Overdispersion 475

sion model is often used if the binomial regression is inadequate because of


overdispersion, and the negative binomial GAM is often used if the Poisson
GAM is inadequate.
Since the Poisson regression (PR) model is simpler than the negative bi-
nomial regression (NBR) model, and the binomial logistic regression (LR)
model is simpler beta-binomial regression (BBR) model, the graphical di-
agnostics for the goodness of fit of the PR and LR models are very useful.
Combining the response plot with the OD plot is a powerful method for as-
sessing the adequacy of the Poisson and logistic regression models. NBR and
BBR models should also be checked with response and OD plots. See Exam-
ples 10.2–10.6 and the R code at the end of Section 10.6 (where q = p − 1).

Example 10.18. The species data is from Cook and Weisberg (1999,
pp. 285-286) and Johnson and Raven (1973). The response variable is the
total number of species recorded on each of 29 islands in the Galápagos
Archipelago. Predictors include area of island, areanear = the area of the
closest island, the distance to the closest island, the elevation, and endem =
the number of endemic species (those that were not introduced from else-
where). A scatterplot matrix of the predictors suggested that log transfor-
mations should be taken. Poisson regression suggested that log(endem) and
log(areanear) were the important predictors, but the deviance and Pear-
son X 2 statistics suggested overdispersion was present since both statistics
were near 71.4 with 26 degrees of freedom. The residual plot also suggested
increasing variance with increasing fitted value. A negative binomial regres-
sion suggested that only log(endem) was needed in the model, and had a
deviance of 26.12 on 27 degrees of freedom. The residual plot for this model
was roughly ellipsoidal. The negative binomial GAM with log(endem) had
an Ŝ that was linear and the plotted points in the EE plot had correlation
near 1.
The response plot with the exponential and lowess curves added as visual
aids is shown in Figure 10.15. The interpretation is that Y |x ≈ negative
binomial with E(Y |x) ≈ exp(EAP ). Hence if EAP = 0, E(Y |x) ≈ 1. The
negative binomial and Poisson GAM have the same conditional mean func-
tion. If the plot was for a Poisson GAM, the interpretation would be that
Y |x ≈ Poisson(exp(EAP )). Hence if EAP = 0, Y |x ≈ Poisson(1).
Figure 10.16 shows the OD plot for the negative binomial GAM with the
identity line and slope 4 line through the origin added as visual aids. The
plotted points fall within the “slope 4 wedge,” suggesting that the negative
binomial regression model has successfully dealt with overdispersion. Here
Ê(Y |AP ) = exp(EAP ) and V̂ (Y |AP ) = exp(EAP ) + τ̂ exp(2EAP ) where
τ̂ = 1/37.
476 10 1D Regression Models Such as GLMs

400
300
Y

200
100
0

0 1 2 3 4 5 6

EAP

Fig. 10.15 Response Plot for Negative Binomial GAM


5000
3000
Vhat

1000
0

0 1000 2000 3000 4000

ModVar

Fig. 10.16 OD Plot for Negative Binomial GAM


10.9 Inference After Variable Selection for GLMs 477

10.9 Inference After Variable Selection for GLMs

Inference after variable selection for GLMs is very similar to inference after
variable selection for multiple linear regression. AIC, BIC, EBIC, lasso, and
elastic net can be used for variable selection. Read Section 4.2 for the large
sample theory for β̂ Imin ,0 . We assume that n >> p. Theorem 4.4, the Vari-
able Selection CLT, still applies, as does Remark 4.4. Hence if √
lasso or elastic
net is consistent, then relaxed lasso or relaxed elastic net is n consistent.
The geometric argument of Theorem 4.5 also applies. We follow Rathnayake
and Olive (2019) closely. Read Sections 4.2, 4.5, and 4.6 before reading this
section. We will describe the parametric bootstrap, and then consider boot-
strapping variable selection.

10.9.1 The Parametric and Nonparametric Bootstrap

Consider a parametric 1D regression model Y |x ∼ D(xT β, γ) where D is a


parametric distribution that depends on the p × 1 vector of predictors x only
through SP = xT β, and γ is a q × 1 vector of parameters.
√ D
Suppose Yi |xi ∼ D(xTi β, γ), n(β̂ − β) → Np (0, V (β)), and that
P
V (β̂) → V (β) as n → ∞. These assumptions tend to be mild for a parametric
regression model where the maximum likelihood estimator (MLE) β̂ is used.
Then V (β) = I −1 (β), the inverse Fisher information matrix. If I n (β) is the
P
Fisher information matrix based on a sample of size n, then I n (β)/n → I(β).
For GLMs, see, for example, Sen and Singer (1993, p. 309). For the paramet-
ric regression model, we regress Y on X to obtain (β̂, γ̂) where the n × 1
vector Y = (Yi ) and the ith row of the n × p design matrix X is xTi .
The parametric bootstrap uses Y ∗j = (Yi∗ ) where Yi∗ |xi ∼ D(xTi β̂, γ̂)

for i = 1, ...., n. Regress Y ∗j on X to get β̂ j for j = 1, ..., B. The large

sample theory for β̂ is simple. Note that if Yi∗ |xi ∼ D(xTi b, γ̂) where b
does not depend on n, then (Y ∗ , X) follows the parametric regression model
√ ∗ D
with parameters (b, γ̂). Hence n(β̂ − b) → Np (0, V (b)). Now fix large
√ ∗ D
integer n0 , and let b = β̂ no . Then n(β̂ − β̂ no ) → Np (0, V (β̂ no )). Since
D
Np (0, V (β̂)) → Np (0, V (β)), we have
√ ∗ D
n(β̂ − β̂) → Np (0, V (β)) (10.11)

as n → ∞.
Now suppose S ⊆ I. Without loss of generality, let β = (β TI , βTO )T and β̂ =
(β̂(I)T , β̂(O)T )T . Then (Y , X I ) follows the parametric regression model with
√ D
parameters (β I , γ). Hence n(β̂ I − β I ) → NaI (0, V (β I )). Now (Y ∗ , X I )
478 10 1D Regression Models Such as GLMs

only follows the parametric regression model asymptotically, since β̂(O) 6= 0.


∗ ∗
However, under regularity conditions, E(β̂ I ) ≈ β̂ I and Cov(β̂I )− Cov(β̂I ) →
0 as n, B → ∞.
To see the above claim for GLMs, consider a GLM with ηi = SPi = xTi β =
g(µi ) where µi = E(Yi |xi ) = g−1 (ηi ). Let Vi = V (Yi |xi ). Let

∂ηi
zi = g(µi ) + g0 (µi )(Yi − µi ) = ηi + (Yi − µi ), Z = (zi ),
∂µi
 2
∂µi 1
wi = , W = diag(wi ), Ŵ = W | , and Ẑ = Z| .
∂ηi Vi β̂ β̂
Then

β̂ = (X T Ŵ X)−1 X T Ŵ Ẑ and β̂ I = (X T
I Ŵ I X I )
−1
XT
I Ŵ I Ẑ I

while ∗ ∗ ∗ ∗
β̂ I = (X TI Ŵ I X I )−1 X TI Ŵ I Ẑ I (10.12)

where is fit as if (Y ∗ , X I ) follows the GLM with parameters (β̂(I), γ̂).
β̂ I

If S ⊆ I, then this approximation is correct asymptotically since nβ̂(O) =

OP (1). Hence ηiI = xTiI β̂(I) = g(µ∗iI ), and ViI∗ = VM (Yi∗ |xiI ) where VM
is the model variance from the GLM with parameters (β̂(I), γ̂). Also, the
estimated asymptotic covariance matrices are
d β̂) = (X T Ŵ X)−1 and Cov(
Cov( d β̂ I ) = (X T Ŵ I X I )−1 .
I

See, for example, Agresti (2002, pp. 138, 147), Hillis and Davis (1994),
and McCullagh and Nelder (1989). From Sen and Singer (1994, p. 307),
P
n(X TI Ŵ I X I )−1 → I −1 (β I ) as n → ∞ if S ⊆ I.
Let β̃ = (X T W X)−1 X T W Z. Then E(β̃) = β since E(Z) = Xβ, and
Cov(Y ) = Cov(Y |X) = diag(Vi ). Since

∂µi 1 ∂ηi
= 0 and = g0 (µi ),
∂ηi g (µi ) ∂µi

Cov(Z) = Cov(Z|X) = W −1 . Thus Cov(β̃) = (XW X)−1 . Although


P
β̂ − β = OP (n−1/2 ), we have n(X T Ŵ X)−1 − n(X T W X)−1 → I −1 (β) −
−1
I (β) = 0 as n → ∞.

Let β̃ I = (X TI W ∗I X I )−1 X TI W ∗I Z ∗I where W ∗i and Z ∗I are evaluated using
β̂(I). Then Cov(Y ∗ ) = diag(Vi∗ ) → diag(ViI∗ ). Hence Cov(Z ∗I ) → W ∗−1 I and
∗ ∗
Cov(β̃I ) → (X TI W ∗I X I )−1 as n, B → ∞. Hence Cov(β̂I )− Cov(β̂I ) → 0 as
n, B → ∞ if S ⊆ I.
As an example, consider the Poisson regression model from Section 10.4.
Then µ∗iI = exp(xTiI β̂(I)) = exp(ηiI ∗
) = ViI∗ . Hence
10.9 Inference After Variable Selection for GLMs 479

∂µ∗iI ∗ ∗ ∗
∗ = exp(ηiI ) = µiI = ViI ,
∂ηiI


wiI = exp(xTiI β̂(I)), and ŵiI

= exp(xTiI β̂ I ). Similarly, ηiI

= log(µ∗iI ),

∗ ∗ ∂ηiI 1
ziI = ηiI + (Y ∗ − µ∗iI ) = ηiI

+ ∗ (Yi∗ − µ∗iI ), and
∂µ∗iI i µiI

∗ ∗ 1 ∗
ẑiI = xTiI β̂ I + ∗ (Yi

− exp(xTiI β̂ I )).
exp(xTiI β̂ I )
Note that for (Y , X I ), the formulas are the same with the asterisks removed
and µiI = exp(xTiI β I ).
The nonparametric bootstrap samples cases (Yi , xi ) with replacement to

form (Y ∗j , X ∗j ), and regresses Y ∗j on X ∗j to get β̂ j for j = 1, ..., B. The
nonparametric bootstrap can be useful even if heteroscedasticity or overdis-
persion is present, if the cases are an iid sample from some population, a very
strong assumption.

10.9.2 Bootstrapping Variable Selection

Consider testing H0 : θ = θ 0 versus H1 : θ 6= θ 0 where θ is g × 1. Let the


variable selection estimator Tn = Aβ̂ Imin ,0 with θ = Aβ. Recall Tn is equal
to the estimator Tjn with probability πjn for j = 1, ..., J. Here A is a known
√ D
full rank g × p matrix with 1 ≤Pg ≤ p. We have n(Tn − θ) → v by (4.6)
T
where E(v) = 0, and Σ v = j πj AV j,0 A . Hence geometric argument
Theorem 4.5 holds: if we had iid data T1 , ..., TB , then the prediction region
applied to the iid data and centered at a randomly chosen Tn would be a
large sample confidence region for θ.
Next use the argument for multiple linear regression in Section 4.6.4. For
the bootstrap, P suppose that Ti∗ is equal to Tij∗ with probability ρjn for j =
1, ..., J where j ρjn = 1, and ρjn → πj as n → ∞. Let Bjn count the
number of times Ti∗ = Tij∗ in the bootstrap sample. Then the bootstrap
sample T1∗ , ..., TB∗ can be written as

T1,1 , ..., TB∗ 1n,1 , ..., T1,J

, ..., TB∗ Jn,J

P
where the Bjn follow a multinomial distribution and Bjn /B → ρjn as B →

∞. Denote T1j , ..., TB∗ jn,j as the jth bootstrap component of the bootstrap

sample with sample mean T j and sample covariance matrix S ∗T ,j . Then
480 10 1D Regression Models Such as GLMs

B Bjn
∗ 1 X ∗ X Bjn 1 X ∗ X ∗
T = Ti = Tij = ρ̂jn T j .
B B Bjn
i=1 j i=1 j

Similarly, we can define the jth component of the iid sample T1 , ..., TB to
have sample mean T j and sample covariance matrix S T ,j .
Suppose the jth component of an iid sample T1 , ..., TB and the jth compo-
nent of the bootstrap sample T1∗ , ..., TB∗ have the same variability asymptot-
ically. Since E(Tjn ) ≈ θ, each component of the iid sample is approximately

centered at θ. The bootstrap components are centered at E(Tjn ), and often

E(Tjn ) = Tjn . Geometrically, separating the component clouds so that they
are no longer centered at one value makes the overall data cloud larger. Thus
the variability of Tn∗ is larger than that of Tn for a mixture distribution,
asymptotically. Hence the prediction region applied to the bootstrap sample
is slightly larger than the prediction region applied to the iid sample, asymp-
2 2
totically (we want n ≥ 20p). Hence cutoff D̂1,1−δ = D(U B)
gives coverage
close to or higher than the nominal coverage for confidence regions (4.32)
and (4.34), using the geometric argument. The deviation Ti∗ − Tn tends to

be larger in magnitude than the deviation and Ti∗ − T . Hence the cutoff
2 2 2
D̂2,1−δ = D(U B ,T )
tends to be larger than D(U B)
, and region (4.33) tends to
have higher coverage than region (4.34) for a mixture distribution.
The full model should be checked with the response plot before do-
ing variable selection inference. Assume p is fixed and n ≥ 20p. Assume
P (S ⊆ Imin ) → 1 as n → ∞, and that S ⊆ Ij . For multiple linear re-
gression with the residual bootstrap that uses residuals from the full OLS
model, Chapter 4 showed that the components of the iid sample and boot-
strap sample have the same variability asymptotically. The components of the
iid sample are centered at Aβ while the components of the bootstrap sample
are centered at Aβ̂Ij ,0 . Now consider regression models with Y x|xT β.
√ D
Assume nA(β̂ Ij ,0 − β) → Naj (0, Σ j ) where Σ j = AV j,0 AT . For the non-
√ ∗ D
parametric bootstrap, assume n(Aβ̂Ij ,0 − Aβ̂ Ij ,0 ) → Naj (0, Σ j ). Then the
components of the iid sample and bootstrap sample have the same variability
asymptotically. The components of iid sample are centered at Aβ while the
components of the bootstrap sample are centered at Aβ̂Ij ,0 . For the nonpara-
√ D
metric bootstrap, the above results tend to hold if n(β̂ − β) → Np (0, V )
√ ∗ D
and if n(β̂ − β̂) → Np (0, V ). Assumptions for the nonparametric boot-
strap tend to be rather strong: often one assumption is that the n cases
(Yi , xTi )T are iid from some population. See Shao and Tu (1995, pp. 335-349)
for the nonparametric bootstrap for GLMs, nonlinear regression, and Cox’s
proportional hazards regression. Also see Burr (1994), Efron and Tibshirani
(1993), Freedman (1981), and Tibshirani (1997).
For the parametric bootstrap, Section 10.9.1 showed that under regular-

ity conditions, Cov(β̂I )− Cov(β̂I ) → 0 as n, B → ∞ if S ⊆ I. Hence
10.9 Inference After Variable Selection for GLMs 481


Cov(Tjn ) − Cov(Tjn ) → 0 as n, B → ∞ if S ⊆ I. Here Tn = Aβ̂ Imin ,0 ,
∗ ∗
Tjn = Aβ̂ Ij ,0 , Tn∗ = Aβ̂ Imin ,0 , and Tjn

= Aβ̂Ij ,0 . Then E(Tjn ) ≈ Aβ = θ
∗ ∗
while the E(Tjn ) are more variable than the E(Tjn ) with E(Tjn ) ≈ Aβ̂(Ij , 0),
roughly, where β̂(Ij , 0) is formed from β̂(Ij ) by adding zeros corresponding
to variables not in Ij . Hence the jth component of an iid sample T1 , ..., TB
and the jth component of the bootstrap sample T1∗ , ..., TB∗ have the same
variability asymptotically.
In simulations for n ≥ 20p for H0 : Aβ S = θ 0 , the coverage tended to
get close to 1 − δ for B ≥ max(200, 50p) so that S ∗T is a good estimator of
Cov(T ∗ ). In the simulations where S is not the full model, inference with
backward elimination with Imin using AIC was often more precise than in-
ference with the full model if n ≥ 20p and B ≥ 50p. It is possible that S ∗T is
singular if a column of the bootstrap sample is equal to 0. If the regression
model has a q × 1 vector of parameters γ, we may need to replace p by p + q.
Undercoverage can occur if bootstrap sample data cloud is less variable
than the iid data cloud, e.g., if (n − p)/n is not close to one. Coverage can be
higher than the nominal coverage for two reasons: i) the bootstrap data cloud
is more variable than the iid data cloud of T1 , ..., TB, and ii) zero padding.
To see the effect of zero padding, consider H0 : Aβ = β O = 0 where
βO = (βi1 , ...., βig )T and O ⊆ E in (4.1) so that H0 is true. Suppose a
nominal 95% confidence region is used and UB is the 96th percentile. Hence
the confidence region (4.32) or (4.33) covers at least 96% of the bootstrap
∗ ∗ ∗
sample. If β̂ O,j = 0 for more than 4% of the β̂ O,1 , ..., β̂O,B , then 0 is in the
confidence region and the bootstrap test fails to reject H0 . If this occurs for
each run in the simulation, then the observed coverage will be 100%.

Now suppose β̂ O,j = 0 for j = 1, ..., B. Then S ∗T is singular, but the
singleton set {0} is the large sample 100(1 − δ)% confidence region (4.32),
(4.33), or (4.34) for β O and δ ∈ (0, 1), and the pvalue for H0 : βO = 0 is

one. (This result holds since {0} contains 100% of the β̂ O,j in the bootstrap
sample.) For large sample theory tests, the pvalue estimates the population
pvalue. Let I denote the other predictors in the model so β = (β TI , βTO )T . For
the Imin model from variable selection, there may be strong evidence that xO
is not needed in the model given xI is in the model if the “100%” confidence
region is {0}, n ≥ 20p, and B ≥ 50p. (Since the pvalue is one, this technique
may be useful for data snooping: applying MLE theory to submodel I may
have negligible selection bias.)
Remark 10.3. As in Chapter 4, another way to look at the bootstrap con-
fidence region for variable selection estimators is to consider the estimator
T2,n that chooses Ij with probability equal to the observed bootstrap propor-
tion ρ̂jn . The bootstrap sample T1∗ , ..., TB∗ tends to be slightly more variable
than an iid sample T2,1 , ..., T2,B, and the geometric argument suggests that
the large sample coverage of the nominal 100(1 − δ)% confidence region will
be at least as large as the nominal coverage 100(1 − δ)%.
482 10 1D Regression Models Such as GLMs

10.9.3 Examples and Simulations

Pelawa Watagoda and Olive (2019a) have an example and simulations for
multiple linear regression using the residual bootstrap. See Chapter 4. We
will use Poisson and binomial regression.
Example 10.19. Lindenmayer et al. (1991) and Cook and Weisberg (1999,
p. 533) give a data set with 151 cases where Y is the number of possum
species found in a tract of land in Australia. The predictors are acacia=basal
area of acacia + 1, bark=bark index, habitat=habitat score, shrubs=number
of shrubs + 1, stags= number of hollow trees + 1, stumps=indicator for
presence of stumps, and a constant. Inference for the full Poisson regression
model is shown along with the shorth(c) nominal 95% confidence intervals for
βi computed using the parametric bootstrap with B = 1000. As expected, the
bootstrap intervals are close to the large sample GLM confidence intervals
≈ β̂i ± 2SE(β̂i ).
The minimum AIC model from backward elimination used a constant,
bark, habitat, and stags. The shorth(c) nominal 95% confidence intervals for
βi using the parametric bootstrap are shown. Note that most of the confidence
intervals contain 0 when closed intervals are used instead of open intervals.
The Poisson regression output is also shown, but should only be used for
inference if the model was selected before looking at the data.
large sample full model inference
Est. SE z Pr(>|z|) 95% shorth CI
int -1.0428 0.2480 -4.205 0.0000 [-1.562,-0.538]
acacia 0.0166 0.0103 1.612 0.1070 [-0.004, 0.035]
bark 0.0361 0.0140 2.579 0.0099 [ 0.007, 0.065]
habitat 0.0762 0.0375 2.032 0.0422 [-0.003, 0.144]
shrubs 0.0145 0.0205 0.707 0.4798 [-0.028, 0.056]
stags 0.0325 0.0103 3.161 0.0016 [ 0.013, 0.054]
stumps -0.3907 0.2866 -1.364 0.1727 [-1.010, 0.171]
output and shorth intervals for the min AIC submodel
Est. SE z Pr(>|z|) 95% shorth CI
int -0.8994 0.2135 -4.212 0.0000 [-1.438,-0.428]
acacia 0 [ 0.000, 0.037]
bark 0.0336 0.0121 2.773 0.0056 [ 0.000, 0.060]
habitat 0.1069 0.0297 3.603 0.0003 [ 0.000, 0.156]
shrubs 0 [ 0.000, 0.060]
stags 0.0302 0.0094 3.210 0.0013 [ 0.000, 0.054]
stumps 0 [-0.970, 0.000]
We tested H0 : β2 = β5 = β7 = 0 with the Imin model selected by
backward elimination. (Of course this test would be easy to do with the
full model using GLM theory.) Then H0 : Aβ = (β2 , β5 , β7 )T = 0. Using
the prediction region method with the full model had [0, D(UB ) ] = [0, 2.836]
10.9 Inference After Variable Selection for GLMs 483
q
with D0 = 2.135. Note that χ23,0.95 = 2.795. So fail to reject H0 . Using
the prediction region method with the Imin backward elimination model had
[0, D(UB ) ] = [0, 2.804] while D0 = 1.269. So fail to reject H0 . The ratio of
the volumes of the bootstrap confidence regions for this test was 0.322. (Use
(3.35) with S ∗T and D from backward elimination for the numerator, and
from the full model for the denominator.) Hence the backward elimination
bootstrap test was more precise than the full model bootstrap test.

Example 10.20. For binary logistic regression, the MLE tends to converge
if max(|xTi β̂|) ≤ 7 and if the Y values of 0 and 1 are not nearly perfectly
classified by the rule Ŷ = 1 if xTi β̂ > 0.5 and Ŷ = 0, otherwise. If there
is perfect classification, the MLE does not exist. Let ρ̂(x) = P̂ (Y = 1|x)
under the binary logistic regression. If |xTi β̂|) ≥ 10, some of the ρ̂(xi ) tend
to be estimated to be exactly equal to 0 or 1, which causes problems for
the MLE. The Flury and Riedwyl (1988, pp. 5-6) banknote data consists of
100 counterfeit and 100 genuine Swiss banknote. The response variable is
an indicator for whether the banknote is counterfeit. The six predictors are
measurements on the banknote: bottom, diagonal, left, length, right, and top.
When the logistic regression model is fit with these predictors and a constant,
there is almost perfect classification and backward elimination had problems.
We deleted diagonal, which is likely an important predictor, so backward
elimination would run. For this full model, classification is very good, but
the xTi β̂ run from −20 to 20. In a plot of xTi β̂ versus Y on the vertical axis
(not shown), the logistic regression mean function is tracked closely by the
lowess scatterplot smoother. The full model and backward elimination output
is below. Inference using the logistic regression normal approximation appears
to greatly underestimate the variability of β̂ compared to the parametric full
model bootstrap variability. We tested H0 : β2 = β3 = β4 = 0 with the Imin
model selected by backward elimination. Using the prediction region method
with the full model had [0, D(UB ) ] = [0, 1.763] with D0 = 0.2046. Note that
q
χ23,0.95 = 2.795. So fail to reject H0 . Using the prediction region method
with the Imin backward elimination model had [0, D(UB ) ] = [0, 1.511] while
D0 = 0.2297. So fail to reject H0 . The ratio of the volumes of the bootstrap
confidence regions for this test was 16.2747. Hence the full model bootstrap
inference was much more precise. Backward elimination produced many zeros,
but also produced many estimates that were very large in magnitude.
large sample full model inference
Est. SE z Pr(>|z|) 95% shorth CI
int -475.581 404.913 -1.175 0.240 [-83274.99,1939.72]
length 0.375 1.418 0.265 0.791 [ -98.902,137.589]
left -1.531 4.080 -0.375 0.708 [ -364.814,611.688]
right 3.628 3.285 1.104 0.270 [ -261.034,465.675]
bottom 5.239 1.872 2.798 0.005 [ 3.159,567.427]
top 6.996 2.181 3.207 0.001 [ 4.137,666.010]
484 10 1D Regression Models Such as GLMs

output and shorth intervals for the min AIC submodel


Est. SE z Pr(>|z|) 95% shorth CI
int -472.999 269.271 -1.757 0.079 [-168131.6,35623.9]
length 0 [ -110.850,286.265]
left 0 [ -752.695,724.702]
right 2.725 2.050 1.329 0.184 [-656.1549,906.136]
bottom 5.005 1.657 3.020 0.003 [ 2.985,1428.346]
top 6.821 2.071 3.294 0.001 [ 4.333,1957.107]
Binary regression data sets like the one in Example 10.20 are common:
the response plot of xTi β̂ versus Y suggests that the logistic regression mean
function is good, but the range of xTi β̂ is such that the GLM normal ap-
proximation to the MLE β̂ is likely invalid. Since the parametric bootstrap
produces datasets very similar to the actual dataset, the bootstrap distri-
bution of the logistic regression MLE may be superior to the GLM normal
approximation. For Example 10.20, the GLM and bootstrap inference for the
full model both suggest that bottom and top are important predictors.

The results of the following simulation are similar to those of Chapter 4


for multiple linear regression using the residual bootstrap with residuals from
the OLS full model. This simulation was for Poisson regression and binomial
regression, using B = max(200, n/10, 50p) and 5000 runs. The simulation

used p = 4, 6, 7, 8, and 10; n = 25p, n = 50p; ψ = 0, 1/ p, and 0.9; and
k = 1 and p − 2 where k and ψ are defined in the following paragraph. A
larger simulation study is in Rathnayake (2019). In the simulations, we used
θ = Aβ = βi , θ = Aβ = βS = (β1 , 1, ..., 1)T and θ = Aβ = βE = 0.
Let x = (1, uT )T where u is the (p − 1) × 1 vector of nontrivial predictors.
In the simulations, for i = 1, ..., n, we generated w i ∼ Np−1 (0, I) where the
q = p − 1 elements of the vector wi are iid N(0,1). Let the q × q matrix
A = (aij ) with aii = 1 and aij = ψ where 0 ≤ ψ < 1 for i 6= j. Then the
vector zi = Awi so that Cov(z i ) = Σ z = AAT = (σij ) where the diagonal
entries σii = [1 +(q −1)ψ2 ] and the off diagonal entries σij = [2ψ +(q −2)ψ2 ].
Hence the correlations are cor(zi , zj ) = ρ = (2ψ + (q − 2)ψ2 )/(1 + (q − 1)ψ2 )
P
for i 6= j. Then kj=1 zj ∼ N (0, kσii +k(k −1)σij ) = N (0, v2 ). Let u = az/v.
Then cor(xi , xj ) = ρ for i 6= j where xi and xj are nontrivial predictors. If

ψ = 1/ cp, then ρ → 1/(c + 1) as p → ∞ where c > 0. As ψ gets close to 1,
the predictor vectors ui cluster about the line in the direction of (1, ..., 1)T .
Let SP = xT β = β1 + 1xi,2 + · · · + 1xi,k+1 ∼ N (β1 , a2 ) for i = 1, ..., n. Hence
β = (β1 , 1, ..., 1, 0, ..., 0)T with β1 , k ones, and p − k − 1 zeros. Binomial
regression used β1 = 0, a = 5/3, and mi = m with m = 1 or 20. Poisson
regression used β1 = 1 = a and β1 = 5 with a = 2.
The simulation computed the Frey shorth(c) interval for each βi and used
bootstrap confidence regions to test H0 : β S = (β1 , 1, ..., 1)T where β2 =
· · · = βk+1 = 1, and H0 : β E = 0 (whether the last p − k − 1 βi = 0). The
nominal coverage was 0.95 with δ = 0.05. Observed coverage between 0.94
10.9 Inference After Variable Selection for GLMs 485

and 0.96 would suggest coverage is close to the nominal value. The parametric
bootstrap was used with AIC.
In the tables, there are two rows for each model giving the observed confi-
dence interval coverages and average lengths of the confidence intervals. The
term “reg” is for the full model regression, and the term “vs” is for backward
elimination. The last six columns give results for the tests. The terms pr,
hyb, and br are for the prediction region method (4.32), hybrid region (4.34),
and Bickel and Ren region (4.33). The 0 indicates the test was H0 : β E = 0,
while the 1 indicates that the test was H0 : βS = (β1 , 1..., 1)T . The length
and coverage = P(fail to reject H0 ) for the interval [0, D(UB ) ] or [0, D(UB ,T ) ]
where D(UB ) or D(UB ,T ) is the cutoff for the confidence region. The cutoff
q
will often be near χ2g,0.95 if the statistic T is asymptotically normal. Note
q
that χ22,0.95 = 2.448 is close to 2.45 for the full model regression bootstrap
tests for β S if k = 1.
Volume ratios of the three confidence regions can be compared using (4.35),
but there is not enough information in the tables to compare the volume of
the confidence region for the full model regression versus that for the variable
selection regression since the two methods have different determinants |S ∗T |.
The inference for backward elimination was often as precise or more precise
than the inference for the full model. The coverages tended to be near 0.95
for the parametric bootstrap on the full model. Variable selection coverage
tended to be near 0.95 unless the β̂i could equal 0. An exception was binary
logistic regression with m = 1 where variable selection and the full model
often had higher coverage than the nominal 0.95 for the hypothesis tests,
especially for n = 25p. Compare Tables 10.2 and 10.3. For binary regression,
the bootstrap confidence regions using smaller a and larger n resulted in
coverages closer to 0.95 for the full model, and convergence problems caused
the programs to fail for a > 4. The Bickel and Ren (4.33) average cutoffs
were at least as high as those of the hybrid region (4.34).
If βi was a component of βE , then the backward elimination confidence
intervals had higher coverage but were shorter than those of the full model
due to zero padding. The zeros in β̂E tend to result in higher than nominal
coverage for the variable selection estimator, but can greatly decrease the
volume of the confidence region compared to that of the full model.
For the simulated data, when ψ = 0, the asymptotic covariance matrix
I −1 (β) is diagonal. Hence β̂ S has the same multivariate normal limiting
distribution for Imin and the full model by Remark 4.4. For Tables 10.2-
10.5, β S = (β1 , β2 )T , and βp−1 and βp are components of β E . For Table
10.6, β S = (β1 , ..., β9)T . Hence β1 , β2 , and βp−1 are components of β S , while
βE = β10 . For the n in the tables and ψ = 0, the coverages and “lengths”
did tend to be close for the βi that are components of β S , and for pr1, hyb1,
and br1.
486 10 1D Regression Models Such as GLMs

Table 10.2 Bootstrapping Binomial Logistic Regression, Backward Elimination with


AIC, B = 200, n = 100, p = 4, k = 1, and m = 1

ψ β1 β2 βp−1 βp pr0 hyb0 br0 pr1 hyb1 br1


reg,0 0.9516 0.9328 0.9524 0.9504 0.9724 0.9872 0.9920 0.9802 0.9838 0.9888
len 1.1605 1.0953 0.7171 0.7151 2.5225 2.5225 2.5476 2.5173 2.5173 2.6893
vs,0 0.9564 0.9322 0.9976 0.9976 0.9960 0.9964 0.9988 0.9774 0.9794 0.9948
len 1.1483 1.0798 0.6143 0.6204 2.7329 2.7329 3.0386 2.5160 2.5160 2.6899
reg,0.5 0.9538 0.9428 0.9440 0.9544 0.9680 0.9854 0.9896 0.9724 0.9828 0.9858
len 1.1622 1.6737 1.4547 1.4588 2.5221 2.5221 2.5475 2.5165 2.5165 2.6037
vs,0.5 0.9528 0.9662 0.9978 0.9982 0.9948 0.9918 0.9978 0.9760 0.9756 0.9872
len 1.1462 1.6714 1.2879 1.2883 2.7230 2.7230 3.0170 2.5379 2.5379 2.6860
reg,0.9 0.9662 0.9578 0.9520 0.9500 0.9690 0.9846 0.9884 0.9724 0.9848 0.9876
len 1.1606 9.4523 9.4241 9.4379 2.5220 2.5220 2.5454 2.5142 2.5142 2.5389
vs,0.9 0.9566 0.9422 0.9960 0.9974 0.9958 0.9972 0.9982 0.9866 0.9932 0.9956
len 1.1502 8.4654 8.4806 8.4951 2.7700 2.7700 3.0182 2.6176 2.6176 2.7644

Table 10.3 Bootstrapping Binomial Logistic Regression, Backward Elimination with


AIC, B = 200, n = 200, p = 4, k = 1, and m = 1

ψ β1 β2 βp−1 βp pr0 hyb0 br0 pr1 hyb1 br1


reg,0 0.9504 0.9440 0.9552 0.9544 0.9584 0.9662 0.9674 0.9580 0.9662 0.9728
len 0.7539 0.6771 0.4583 0.4587 2.4884 2.4884 2.4992 2.4846 2.4846 2.5745
vs,0 0.9552 0.9490 0.9986 0.9978 0.9954 0.9908 0.9968 0.9600 0.9698 0.9762
len 0.7510 0.6736 0.3909 0.3926 2.7226 2.7226 3.0310 2.4814 2.4814 2.5740
reg,0.5 0.9538 0.9508 0.9550 0.9578 0.9590 0.9686 0.9690 0.9578 0.9658 0.9714
len 0.7548 1.0543 0.9337 0.9309 2.4858 2.4858 2.4958 2.4828 2.4828 2.5266
vs,0.5 0.9538 0.9602 0.9984 0.9974 0.9930 0.9922 0.9958 0.9708 0.9786 0.9828
len 0.7501 1.0607 0.8064 0.8047 2.7022 2.7023 2.9948 2.5004 2.5004 2.6164
reg,0.9 0.9462 0.9536 0.9522 0.9496 0.9548 0.9642 0.9658 0.9496 0.9610 0.9626
len 0.7546 6.0844 6.0691 6.0800 2.4888 2.4888 2.4990 2.4860 2.4860 2.4967
vs,0.9 0.9562 0.9520 0.9958 0.9954 0.9936 0.9922 0.9968 0.9822 0.9870 0.9896
len 0.7502 5.3338 5.3737 5.3847 2.7934 2.7934 3.0392 2.5873 2.5873 2.7225

Table 10.4 Bootstrapping Binomial Logistic Regression, Backward Elimination with


AIC, B = 500, n = 250, p = 10, k = 1, and m = 20

ψ β1 β2 βp−1 βp pr0 hyb0 br0 pr1 hyb1 br1


reg,0 0.9576 0.9502 0.9520 0.9548 0.9500 0.9528 0.9530 0.9480 0.9496 0.9502
len 0.1428 0.1232 0.0860 0.0860 3.9837 3.9837 3.9876 2.4538 2.4538 2.4653
vs,0 0.9510 0.9510 0.9992 0.9978 0.9980 0.9982 0.9998 0.9412 0.9458 0.9478
len 0.1424 0.1229 0.0706 0.0707 4.3081 4.3081 4.7454 2.4531 2.4531 2.4747
reg,0.32 0.9536 0.9534 0.9514 0.9548 0.9496 0.9524 0.9530 0.9474 0.9490 0.9506
len 0.1426 0.1833 0.1609 0.1610 3.9840 3.9840 3.9884 2.4528 2.4528 2.4589
vs,0.32 0.9534 0.9620 0.9966 0.9976 0.9968 0.9976 0.9988 0.9534 0.9544 0.9582
len 0.1424 0.1837 0.1347 0.1352 4.2607 4.2607 4.6891 2.4527 2.4527 2.5042
reg,0.9 0.9514 0.9432 0.9552 0.9498 0.9434 0.9448 0.9446 0.9430 0.9440 0.9450
len 0.1427 2.2178 2.2170 2.2175 3.9846 3.9846 3.9887 2.4530 2.4530 2.4553
vs,0.9 0.9590 0.9656 0.9982 0.9986 0.9982 0.9978 0.9996 0.9532 0.9478 0.9654
len 0.1425 2.0342 1.8778 1.8862 4.2368 4.2368 4.6742 2.4449 2.4449 2.5661
10.10 Prediction Intervals 487

Table 10.5 Bootstrapping Poisson Regression, Backward Elimination with AIC,


B = 500, n = 250, p = 10, k = 1, a = 1, β1 = 1

ψ β1 β2 βp−1 βp pr0 hyb0 br0 pr1 hyb1 br1


reg,0 0.9480 0.9526 0.9526 0.9520 0.9502 0.9512 0.9524 0.9432 0.9454 0.9472
len 0.1752 0.1325 0.1275 0.1276 3.9859 3.9859 3.9901 2.4528 2.4528 2.4740
vs,0 0.9552 0.9574 0.9982 0.9982 0.9984 0.9982 0.9998 0.9524 0.9574 0.9628
len 0.1752 0.1323 0.1051 0.1047 4.3004 4.3004 4.7408 2.4543 2.4543 2.5009
reg,0.32 0.9552 0.9518 0.9520 0.9536 0.9538 0.9536 0.9538 0.9510 0.9532 0.9552
len 0.1752 0.2419 0.2390 0.2386 3.9852 3.9852 3.9894 2.4518 2.4518 2.4689
vs,0.32 0.9562 0.9632 0.9986 0.9992 0.9980 0.9982 0.9992 0.9630 0.9644 0.9712
len 0.1750 0.2419 0.2005 0.2004 4.2618 4.2618 4.6811 2.4520 2.4520 2.5384
reg,0.9 0.9478 0.9530 0.9570 0.9554 0.9458 0.9478 0.9484 0.9448 0.9448 0.9476
len 0.1754 3.2873 3.2859 3.2912 3.9831 3.9831 3.9872 2.4536 2.4536 2.4691
vs,0.9 0.9500 0.9574 0.9984 0.9994 0.9970 0.9966 0.9984 0.9638 0.9626 0.9742
len 0.1752 2.8710 2.7922 2.7879 4.2597 4.2597 4.6886 2.4809 2.4809 2.6402

Table 10.6 Bootstrapping Poisson Regression, Backward Elimination with AIC,


B = 500, n = 250, p = 10, k = 8, a = 2, β1 = 5

ψ β1 β2 βp−1 βp pr0 hyb0 br0 pr1 hyb1 br1


reg,0 0.9522 0.9468 0.9540 0.9518 0.9496 0.9492 0.9488 0.9474 0.9464 0.9478
len 0.0210 0.0146 0.0146 0.0142 1.9593 1.9593 1.9609 4.1633 4.1633 4.1675
vs,0 0.9544 0.9546 0.9518 0.9980 0.9966 0.9374 0.9966 0.9534 0.9524 0.9552
len 0.0210 0.0146 0.0146 0.0117 2.1470 2.1470 2.3955 4.1655 4.1655 4.1880
reg,0.32 0.9522 0.9510 0.9486 0.9540 0.9494 0.9504 0.9516 0.9460 0.9468 0.9472
len 0.0210 0.0664 0.0664 0.0663 1.9595 1.9595 1.9614 4.1636 4.1636 4.1684
vs,0.32 0.9508 0.9596 0.9496 0.9992 0.9986 0.9434 0.9986 0.9634 0.9646 0.9696
len 0.0210 0.0663 0.0662 0.0541 2.1434 2.1434 2.3960 4.1970 4.1970 4.2703
reg,0.9 0.9536 0.9580 0.9550 0.9584 0.9538 0.9538 0.9548 0.9496 0.9512 0.9524
len 0.0210 1.0357 1.0361 1.0336 1.9585 1.9585 1.9605 4.1603 4.1603 4.1643
vs,0.9 0.9486 0.9484 0.9492 0.9988 0.9982 0.9492 0.9982 0.9688 0.9546 0.9676
len 0.0212 1.0742 1.0745 0.8793 2.1387 2.1387 2.3860 4.2883 4.2883 4.3818

10.10 Prediction Intervals

We use two prediction intervals from Olive et al. (2019). The first predic-
tion interval for Yf applies the shorth prediction interval of Section 4.3 to
the parametric bootstrap sample Y1∗ , ..., YB∗ where the Yi∗ are iid from the
distribution D(ĥ(xf ), γ̂). If the regression method produces a consistent es-
timator (ĥ(x), γ̂) of (h(x), γ), then this new prediction interval is a large
sample 100(1 − δ)% PI that is a consistent estimator of the shortest popula-
tion interval [L, U ] that contains at least 1 − δ of the mass as B, n → ∞. The
new large sample 100(1 − δ)% PI using Y1∗ , ..., YB∗ uses the shorth(c) PI with
p
c = min(B, dB[1 − δ + 1.12 δ/B ] e). (10.13)
488 10 1D Regression Models Such as GLMs

For models with a linear predictor xT β, we will want prediction intervals


after variable selection or model selection. Refer to Equation (4.1) and Section
10.6.1. Forward selection or backward elimination with the Akaike (1973) AIC
criterion or Schwarz (1978) BIC criterion are often used for GLM variable
selection. The Chen and Chen (2008) EBIC criterion can be useful, especially
if n/p is not large. GLM model selection with lasso and the elastic net is
also common. See Hastie et al. (2015, ch. 3), Tibshirani (1996), Friedman et
al. (2007), and Friedman et al. (2010). Relaxed lasso applies the regression
method, such as a GLM, to the active predictors with nonzero coefficients
selected by lasso. For n ≥ 10p, Olive and Hawkins (2005) suggested using
multiple linear regression variable selection software with the Mallows (1973)
Cp criterion to get a subset I, then fit the GLM using Y and xI . If the
regression model contains a q × 1 vector of parameters γ, then we may need
n ≥ 10(p + q).
The prediction interval (10.13) can have undercoverage if n is small com-
pared to the number of estimated parameters. The modified shorth PI (10.14)
inflates PI (10.13) to compensate for parameter estimation and model selec-
tion. Let d be the number of variables x∗1 , ..., x∗d used by the full model, for-
ward selection, lasso, or relaxed lasso. (We could let d = j if j is the degrees
of freedom of the selected model if that model was chosen in advance without
model or variable selection. Hence d = j is not the model degrees of freedom
if model selection was used. For a GAM full model, suppose the P“degrees of
p
freedom” di for S(xi ) is bounded by k. We could let d = 1 + i=2 di with
p ≤ d ≤ pk.) We want n ≥ 10d, and the prediction interval length will be
increased (penalized) if n/d is not large. Let qn = min(1−δ+0.05, 1−δ+d/n)
for δ > 0.1 and

qn = min(1 − δ/2, 1 − δ + 10δd/n), otherwise.

If 1 − δ < 0.999 and qn < 1 − δ + 0.001, set qn = 1 − δ. Then compute the


shorth PI with
p
cmod = min(B, dB[qn + 1.12 δ/B ] e). (10.14)

Olive (2007, 2018) and Pelawa Watagoda and Olive (2019b) used similar
correction factors since the maximum simulated undercoverage was about
0.05 when n = 20d. If a q × 1 vector of parameters γ is also estimated, we
may need to replace d by dq = d + q.
If β̂ I is a×1, form the p×1 vector β̂ I,0 from β̂ I by adding 0s corresponding
to the omitted variables. For example, if p = 4 and β̂ Imin = (β̂1 , β̂3 )T is the
estimator that minimized the variable selection criterion, then β̂ Imin ,0 =
(β̂1 , 0, β̂3 , 0)T .
Hong et al. (2018) explain why classical PIs after AIC variable selection
may not work. Fix p and let Imin correspond to the predictors used after
variable selection, including AIC, BIC, and relaxed lasso. Suppose P (S ⊆
10.10 Prediction Intervals 489

Imin ) → 1 as n → ∞. See Charkhi and Claeskens (2018), Claeskens and


Hjort (2008, pp. 70, 101, 102, 114, 232), Hastie et al. (2015, pp. 295-302)
and Haughton (1988, 1989) for more information and references about this
assumption. For relaxed lasso, the assumption holds if lasso is a √
consistent
estimator. Suppose model (4.1) holds, and that if S ⊆ Ij , then n(β̂ Ij −
D
βIj ) → Naj (0, V j ). Hence
√ D
n(β̂ Ij ,0 − β) → Np (0, V j,0 ) (10.15)

where V j,0 adds columns and rows of zeros corresponding to the xi not

in Ij . Then β̂ Imin ,0 is a n consistent estimator of β under model (4.1)
if the variable selection criterion is used with forward selection, backward
elimination, or all subsets. Hence (10.13) and (10.14) are large sample PIs.

Rathnayake and Olive (2019) gave the limiting distribution of n(β̂ Imin ,0 −
β), generalizing the Pelawa Watagoda and Olive (2019a) result for multiple
linear regression. Regularity conditions for (10.13) and (10.14) to be large
sample PIs when p > n are much stronger.
Prediction intervals (10.13) and (10.14) often have higher than the nominal
coverage if n is large and Yf can only take on a few values. Consider binary
regression where Yf ∈ {0, 1} and the PIs (10.13) and (10.14) are [0,1] with
100% coverage, [0,0], or [1,1]. If [0,0] or [1,1] is the PI, coverage tends to be
higher than nominal coverage unless P (Yf = 1|xf ) is near δ or 1 − δ, e.g., if
P (Yf = 1|xf ) = 0.01, then [0,0] has coverage near 99% even if 1 − δ < 0.99.

Example 10.21. For the Ceriodaphnia data of Example 10.4, Figure 10.17
shows the response plot of ESP versus Y for this data. In this plot, the lowess
curve is represented as a jagged curve to distinguish it from the estimated
Poisson regression mean function (the exponential curve). The horizontal line
corresponds to the sample mean Y . The circles correspond to the Yi and the
×’s to the PIs (10.13) with d = p = 3. The n large sample 95% PIs contained
97% of the Yi . There was no evidence of overdispersion: see Example 10.4.
There were 5 replications for each of the 14 strain–species combinations,
which helps show the bootstrap PI variability when B = 1000. This example
illustrates a useful goodness of fit diagnostic: if the model D is a useful
approximation for the data and n is large enough, we expect the coverage on
the training data to be close to or higher than the nominal coverage 1 − δ.
For example, there may be undercoverage if a Poisson regression model is
used when a negative binomial regression model is needed.

Example 10.22. For the banknote data of Example 10.20, after variable
selection, we decided to use a constant, right, and bottom as predictors. The
response plot for this submodel is shown in the left plot of Figure 10.18 with
Z = Zi = Yi /mi = Yi and the large sample 95% PIs for Zi = Yi . The circles
correspond to the Yi and the ×’s to the PIs (10.13) with d = 3, and 199 of the
200 PIs contain Yi . The PI [0,0] that did not contain Yi corresponds to the
490 10 1D Regression Models Such as GLMs

Response Plot

100
80
60
Y

40
20
0

1.5 2.0 2.5 3.0 3.5 4.0 4.5

ESP

Fig. 10.17 Ceriodaphnia Data Response Plot.

circle in the upper left corner. The PIs were [0,0], [0,1], or [1,1] since the data
is binary. The mean function is the smooth curve and the step function gives
the sample proportion of ones in the interval. The step function approximates
the smooth curve closely, hence the binary logistic regression model seems
reasonable. The right plot of Figure 10.18 shows the GAM using right and
bottom with d = 3. The coverage was 100% and the GAM had many [1,1]
intervals.

Example 10.23. For the species data of Examples 10.18, we used a con-
stant and log(endem), log(area), log(distance), and log(areanear). The re-
sponse plot looks good, but the OD plot (not shown) suggests overdispersion.
When the response plot for the Poisson regression model was made, the n
large sample 95% PIs (10.13) contained 89.7% of the Yi .

For the simulations, generating xT β is important. For example, for bino-


mial logistic regression, typically −5 ≤ xT β ≤ 5 or there can be problems
with the MLE. We used the same simulated data as that used for variable
selection in Section 10.9.3. Thus SP = xT β = β1 + 1xi,2 + · · · + 1xi,k+1 ∼
N (β1 , a2 ) for i = 1, ..., n. Hence β = (β1 , 1, .., 1, 0, ..., 0)T with β1 , k ones and
p − k − 1 zeros. The default settings for Poisson regression use β1 = 1 = a.
The default settings for binomial regression use β1 = 0 and a = 5/3.
The simulation used 5000 runs, so an observed coverage in [0.94, 0.96]
gives no reason to doubt that the PI has the nominal coverage of 0.95. The
10.10 Prediction Intervals 491

Response Plot Response Plot

1.0

1.0
0.8

0.8
0.6

0.6
Z

Z
0.4

0.4
0.2

0.2
0.0

0.0
−5 0 5 −5 0 5 10 15

ESP ESP

Fig. 10.18 Banknote Data GLM and GAM Response Plots.


simulation used B = 1000; p = 4, 50, n, or 2n; ψ = 0, 1/ p, or 0.9; and
k = 1, 19, or p − 1. The simulated data sets are rather small since the R
estimators are rather slow. For binomial and Poisson regression, we only
computed the GAM for p = 4 with SP = AP = α + S2 (x2 ) + S2 (x3 ) + S4 (x4 )
and d = p = 4. We only computed the full model GLM if n ≥ 5p. Lasso and
relaxed lasso were computed for all cases. The regression model was computed
from the training data, and a prediction interval was made for the test case
Yf given xf . The “length” and “coverage” were the average length and the
proportion of the 5000 prediction intervals that contained Yf . Two rows per
table were used to display these quantities.
Tables 10.7 to 10.9 show some simulation results for Poisson regression.
Lasso minimized 10-fold cross validation and relaxed lasso was applied to the
selected lasso model. The full GLM, full GAM and backward elimination (BE
in the tables) used PI (10.13) while lasso, relaxed lasso (RL in the tables),
and forward selection using the Olive and Hawkins (2005) method (OHFS
in the tables) used PI (10.14). For n ≥ 10p, coverages tended to be near
or higher than the nominal value of 0.95, except for lasso and the Olive and
Hawkins (2005) method in Tables 10.8 and 10.9. In Table 10.7, coverages were
high because the Poisson counts were small and the Poisson distribution is
discrete. In Table 10.8, the Poisson counts were not small, so the discreteness
of the distribution did not affect the coverage much. For Table 10.9, p = 50,
and PI (10.13) has slight undercoverage for the full GLM since n = 10p. Table
10.9 helps illustrate the importance of the correction factor: PI (10.14) would
492 10 1D Regression Models Such as GLMs

Table 10.7 Simulated Large Sample 95% PI Coverages and Lengths for Poisson
Regression, p = 4, β1 = 1 = a

n ψ k GLM GAM lasso RL OHFS BE


100 0 1 cov 0.9712 0.9714 0.9810 0.9800 0.9792 0.9734
len 6.6448 6.6118 7.2770 7.2004 7.0680 6.6632
400 0 1 cov 0.9692 0.9694 0.9728 0.9714 0.9722 0.9665
len 6.6392 6.6474 6.7996 6.7722 6.7588 6.6778
100 0.5 1 cov 0.9642 0.9644 0.9796 0.9786 0.9760 0.9689
len 6.6922 6.6806 7.3136 7.2824 7.1160 6.7767
400 0.5 1 cov 0.9668 0.9670 0.9722 0.9716 0.9702 0.9754
len 6.6720 6.6896 6.8342 6.8140 6.7992 6.7802
100 0.9 1 cov 0.9672 0.9674 0.9766 0.9768 0.9738 0.9665
len 6.6038 6.6186 7.1480 7.1214 7.0002 6.5789
400 0.9 1 cov 0.9660 0.9662 0.9734 0.9700 0.9692 0.9798
len 6.5838 6.5746 6.7526 6.7196 6.7004 6.7443
100 0 3 cov 0.9696 0.9698 0.9848 0.9834 0.9818 0.9654
len 6.7080 6.7084 7.5632 7.5442 7.5348 6.7408
400 0 3 cov 0.9728 0.9730 0.9750 0.9746 0.9748 0.9657
len 6.5718 6.5684 6.7690 6.7356 6.7406 6.7063
100 0.5 3 cov 0.9672 0.9674 0.9842 0.9838 0.9736 0.9592
len 6.6992 6.7044 7.5804 7.5494 7.3810 6.7128
400 0.5 3 cov 0.9682 0.9684 0.9730 0.9722 0.9702 0.9772
len 6.6794 6.6890 6.8726 6.8520 6.8466 6.7504
100 0.9 3 cov 0.9664 0.9666 0.9804 0.9810 0.9750 0.9678
len 6.6704 6.6646 7.2880 7.2672 7.0722 6.7635
400 0.9 3 cov 0.9690 0.9692 0.9744 0.9742 0.9736 0.9667
len 6.7960 6.8092 6.9696 6.9682 6.9120 6.6987

have higher coverage and longer average length. Lasso was good at choosing
subsets that contain S since relaxed lasso had good coverage. The Olive and
Hawkins (2005) method is partly graphical, and graphs were not used in the
simulation.
Tables 10.10 and 10.11 are for binomial regression where only PI (10.13)
was used. For large n, coverage is likely to be higher than the nominal if the
binomial probability of success can get close to 0 or 1. For binomial regression,
neither lasso nor the Olive and Hawkins (2005) method had undercoverage
in any of the simulations with n ≥ 10p.
For n ≤ p, good performance needed stronger regularity conditions, and
Table 10.12 shows some results with n = 100 and p = 200. For k = 1,
relaxed lasso performed well as did lasso except in the second to last column
of Table 10.12. With k = 19 and ψ = 0, there was undercoverage since
n < 10(k + 1). For the dense models with k = 199 and ψ = 0, there was often
severe undercoverage, lasso sometimes picked 100 predictors including the
constant, and then relaxed lasso caused the program to fail with 5000 runs.
Coverage was usually good for ψ > 0 except for the second to last column
and sometimes the last column of Table 10.12. With ψ = 0.9, each predictor
was highly correlated with the one dominant principal component.
10.10 Prediction Intervals 493

Table 10.8 Simulated Large Sample 95% PI Coverages and Lengths for Poisson
Regression, p = 4, β1 = 5, a = 2

n ψ k GLM GAM lasso RL OHFS BE


100 0 1 cov 0.9500 0.9440 0.7730 0.9664 0.9654 0.9520
len 77.6072 77.6306 84.1066 81.8374 82.4752 84.1432
400 0 1 cov 0.9580 0.9564 0.7566 0.9622 0.9628 0.9534
len 82.0126 82.0212 85.5704 83.2692 83.4374 80.9897
100 0.5 1 cov 0.9456 0.9424 0.7646 0.9634 0.9408 0.9512
len 83.0236 82.9034 90.5822 88.3060 88.6700 79.6887
400 0.5 1 cov 0.9530 0.9500 0.7584 0.9604 0.9566 0.9678
len 83.8588 83.8292 87.4336 85.1042 85.1434 79.9855
100 0.9 1 cov 0.9492 0.9452 0.7688 0.9646 0.7712 0.9654
len 78.3554 78.3798 87.0086 84.6072 83.4980 81.5432
400 0.9 1 cov 0.9550 0.9574 0.7606 0.9606 0.7928 0.9513
len 76.7028 76.7594 80.5070 78.2308 78.2538 80.1298
100 0 3 cov 0.9544 0.9466 0.7798 0.9708 0.9404 0.9487
len 80.1476 80.1362 92.1372 89.8532 90.3456 79.4565
400 0 3 cov 0.9560 0.9548 0.7514 0.9582 0.9566 0.9567
len 80.7868 80.8976 85.0642 82.7982 82.7912 79.4522
100 0.5 3 cov 0.9516 0.9478 0.7848 0.9694 0.3324 0.9515
len 77.1120 77.1130 88.9346 86.4680 85.8634 81.5643
400 0.5 3 cov 0.9568 0.9558 0.7534 0.9636 0.5214 0.9528
len 80.4226 80.4932 84.7646 82.5590 83.7526 79.9786
100 0.9 3 cov 0.9492 0.9456 0.7882 0.9620 0.7510 0.9554
len 79.5374 79.6172 91.2052 89.0692 84.5648 81.8544
400 0.9 3 cov 0.9544 0.9546 0.7638 0.9554 0.7384 0.9586
len 79.7384 79.6906 83.8318 81.6862 81.0882 80.7521

Table 10.9 Simulated Large Sample 95% PI Coverages and Lengths for Poisson
Regression, p = 50, β1 = 5, a = 2

n ψ k GLM lasso RL OHFS BE


500 0 1 cov 0.9352 0.7564 0.9598 0.9640 0.9476
len 81.2668 84.3188 81.8934 85.2922 81.1010
500 0.14 1 cov 0.9370 0.7508 0.9580 0.9628 0.9458
len 81.1820 84.4530 82.1894 85.2304 81.1146
500 0.9 1 cov 0.9368 0.7630 0.9620 0.8994 0.9456
len 80.4568 86.3506 84.4942 84.1448 80.4202
500 0 19 cov 0.9388 0.7592 0.9756 0.3778 0.9472
len 81.6922 96.8546 94.6350 99.7436 81.7218
500 0.14 19 cov 0.9368 0.7556 0.9730 0.2770 0.9438
len 80.0654 95.2964 93.2748 87.3814 80.1276
500 0.9 19 cov 0.9350 0.7544 0.9536 0.9480 0.9352
len 79.7324 86.3448 84.0674 83.2958 79.6172
500 0 49 cov 0.9386 0.7104 0.9666 0.1004 0.9364
len 81.1422 96.4304 94.8818 108.0518 81.2516
500 0.14 49 cov 0.9396 0.7194 0.9558 0.2858 0.9402
len 79.7874 94.8908 93.2538 86.4234 79.8692
500 0.9 49 cov 0.9380 0.7640 0.9480 0.9512 0.9430
len 78.8146 85.5786 83.2812 82.4104 78.8316
494 10 1D Regression Models Such as GLMs

Table 10.10 Simulated Large Sample 95% PI Coverages and Lengths for Binomial
Regression, p = 4, m = 40

n ψ k GLM GAM lasso RL OHFS BE


100 0 1 cov 0.9786 0.9788 0.9774 0.9744 0.9720 0.9726
len 10.7696 10.7656 10.5332 10.4430 10.1990 10.2016
400 0 1 cov 0.9708 0.9700 0.9696 0.9708 0.9702 0.9688
len 9.8374 9.8426 9.8292 9.7866 9.7518 9.7548
100 0.5 1 cov 0.9792 0.9720 0.9742 0.9750 0.9724 0.9708
len 10.6668 10.6426 10.3790 10.3282 10.1060 10.1012
400 0.5 1 cov 0.9678 0.9676 0.9692 0.9670 0.9668 0.9656
len 9.8352 9.8452 9.8196 9.7890 9.7612 9.7590
100 0.9 1 cov 0.9780 0.9766 0.9762 0.9742 0.9704 0.9714
len 10.7324 10.7222 10.3774 10.3186 10.1438 10.1602
400 0.9 1 cov 0.9688 0.9672 0.9680 0.9674 0.9684 0.9672
len 9.7554 9.7646 9.7392 9.7012 9.6778 9.6790
100 0 3 cov 0.9790 0.9750 0.9782 0.9772 0.9780 0.9776
len 10.6974 10.6960 10.7388 10.7030 10.6956 10.7020
400 0 3 cov 0.9652 0.9652 0.9654 0.9656 0.9650 0.9626
len 9.7838 9.7878 9.8244 9.7864 9.7800 9.7722
100 0.5 3 cov 0.9780 0.9734 0.9776 0.9766 0.9770 0.9784
len 10.7224 10.7034 10.7482 10.7042 10.7162 10.7134
400 0.5 3 cov 0.9686 0.9688 0.9726 0.9702 0.9704 0.9706
len 9.7250 9.7170 9.7460 9.7172 9.7152 9.7290
100 0.9 3 cov 0.9800 0.9798 0.9802 0.9786 0.9698 0.9720
len 10.6978 10.6994 10.5820 10.5414 10.0660 10.1802
400 0.9 3 cov 0.9682 0.9684 0.9696 0.9674 0.9678 0.9676
len 9.8146 9.8074 9.8364 9.8190 9.7594 9.7764

Table 10.11 Simulated Large Sample 95% PI Coverages and Lengths for Binomial
Regression, p = 50, m = 7

n ψ k GLM lasso RL OHFS BE


1000 0 1 cov 0.9896 0.9838 0.9802 0.9798 0.9798
len 4.0008 3.6666 3.5744 3.5838 3.5842
1000 0.14 1 cov 0.9868 0.9818 0.9782 0.9774 0.9770
len 4.0422 3.6836 3.6158 3.6226 3.6312
1000 0.9 1 cov 0.9894 0.9794 0.9796 0.9800 0.9798
len 4.0214 3.5994 3.5794 3.6122 3.6114
1000 0 19 cov 0.9888 0.9870 0.9848 0.9814 0.9812
len 4.0294 3.9730 3.8438 3.7110 3.7030
1000 0.14 19 cov 0.9872 0.9846 0.9852 0.9804 0.9806
len 4.0376 3.8350 3.7834 3.7170 3.7066
1000 0.9 19 cov 0.9884 0.9804 0.9808 0.9802 0.9772
len 4.0348 3.6170 3.5948 3.6226 3.6216
1000 0 49 cov 0.990 0.9904 0.9904 0.9900 0.9904
len 4.0428 4.0726 4.0528 4.0490 4.0460
1000 0.14 49 cov 0.9866 0.9866 0.9856 0.9806 0.9796
len 4.0396 3.9044 3.8640 3.7046 3.6988
1000 0.9 49 cov 0.9874 0.9808 0.9792 0.9790 0.9772
len 4.0660 3.6444 3.6230 3.6556 3.6490
10.11 OLS and 1D Regression 495

Table 10.12 Simulated Large Sample 95% PI Coverages and Lengths, n = 100,
p = 200
BR m=7 BR m=40 PR,a=1 β1 = 1 PR,a=2 β1 = 5
ψ,k lasso RL lasso RL lasso RL lasso RL
0 cov 0.9912 0.9654 0.9836 0.9602 0.9816 0.9612 0.7620 0.9662
1 len 4.2774 3.8356 11.3482 11.001 7.8350 7.5660 93.7318 91.4898
0.07 cov 0.9904 0.9698 0.9796 0.9644 0.9790 0.9696 0.7652 0.9706
1 len 4.2570 3.9256 11.4018 11.1318 7.8488 7.6680 92.0774 89.7966
0.9 cov 0.9844 0.9832 0.9820 0.9820 0.9880 0.9858 0.7850 0.9628
1 len 3.8242 3.7844 10.9600 10.8716 7.6380 7.5954 98.2158 95.9954
0 cov 0.9146 0.8216 0.8532 0.7874 0.8678 0.8038 0.1610 0.6754
19 len 4.7868 3.8632 12.0152 11.3966 7.8126 7.5188 88.0896 90.6916
0.07 cov 0.9814 0.9568 0.9424 0.9208 0.9620 0.9444 0.3790 0.5832
19 len 4.1992 3.8266 11.3818 11.0382 7.9010 7.7828 92.3918 92.1424
0.9 cov 0.9858 0.9840 0.9812 0.9802 0.9838 0.9848 0.7884 0.9594
19 len 3.8156 3.7810 10.9194 10.8166 7.6900 7.6454 97.744 95.2898
0.07 cov 0.9820 0.9640 0.9604 0.9390 0.9720 0.9548 0.3076 0.4394
199 len 4.1260 3.7730 11.2488 10.9248 8.0784 7.9956 90.4494 88.0354
0.9 cov 0.9886 0.9870 0.9822 0.9804 0.9834 0.9814 0.7888 0.9586
199 len 3.8558 3.8172 10.9714 10.8778 7.6728 7.6602 97.0954 94.7604

10.11 OLS and 1D Regression

For this section let SP = xT β = α + uT η. An important 1D regression


model, introduced by Li and Duan (1989), has the form

Y = g(α + uT η, e) (10.16)

where g is a bivariate (inverse link) function and e is a zero mean error that
is independent of x. The constant term α may be absorbed by g if desired.
An important special case is the response transformation model where

g(xT β, e) = t−1 (xT β + e) (10.17)

and t−1 is a one to one (typically monotone) function. Hence

t(Y ) = xT β + e.

Dimension reduction can greatly simplify our understanding of the con-


ditional distribution Y |x. If a 1D regression model is appropriate, then the
p–dimensional vector x can be replaced by the 1–dimensional scalar xT β
with “no loss of information about the conditional distribution.” Cook and
Weisberg (1999, p. 411) define a sufficient summary plot (SSP) to be a plot
that contains all the sample regression information about the conditional dis-
tribution Y |x of the response given the predictors. The response plot of ESP
versus Y is an estimated sufficient summary plot (ESSP).
496 10 1D Regression Models Such as GLMs

Remark 10.4. Suppose the 1D regression model is Y x|xT β. Then


T T
Y x|(a+cβ x) for any constants a and c 6= 0. Hence a+cx β is a sufficient
predictor (SP) with ESP = α̃ + xT β̃ where β̃ is an estimator of cβ for some
nonzero constant c. Let x = (1, uT )T . We can also use ESP = α̃ + uT η̃
where η̃ is an estimator of c η for some nonzero constant c.
T
Consider the OLS estimator β̂ = (β̂ 1 , β̂2 )T = (α̂, η̂T )T . √
Li and Duan
(1989, p. 1031) showed that under regularity conditions, η̂ is a n consistent
estimator of cη for some constant c. If η̂ ≈ cη when Y x|xT β, then the
response plot of

α̂ + uT η̂ versus Y or xT β̂ versus Y

can be used to visualize the conditional distribution Y |xT β provided that


c 6= 0. Often if no strong nonlinearities are present among the pre-
dictors, uT η̂ is a useful ESP.

Remark 10.5. For OLS, call the plot of xT β̂ versus Y the OLS view.
The fact that the OLS view is frequently a useful response plot was perhaps
first noted by Brillinger (1977, 1983) and called the 1D Estimation Result by
Cook and Weisberg (1999, p. 432).

Olive (2002, 2004b, 2008: ch.12) showed that the trimmed views esti-
mator of Chapter 7 also gives useful response plots for 1D regression. If
Y = m(xT β) + e = m(α + uT η) + e, look for a plot with a smooth mean
function and the smallest variance function. The trimmed view with 0% trim-
ming is the OLS view.
Recall from Definition 2.17 and Theorem 2.20 that if x = (1, uT )T and
β = (α, ηT )T , then ηOLS = Σ −1u Σ u,Y . Let q = p−1. The following notation
will be useful for studying the OLS estimator. Let the sufficient predictor z =
uT η = ηT u and let w = u−E(u). Let r = w−(Σ u η)ηT w. The proof of the
next result is outlined in Problem 10.1 using an argument due to Aldrin, et al.
(1993). If the 1D regression model is appropriate, then typically Cov(u, Y ) 6=
0 unless uT β follows a symmetric distribution and m is symmetric about the
median of uT η.

Theorem 10.1. Suppose that (Yi , uTi )T are iid observations and that the
positive definite q × q matrix Cov(u) = Σ u and the q × 1 vector Cov(u, Y ) =
Σ u,Y . Assume that Yi = m(uTi η)+ei where the zero mean constant variance
iid errors ei are independent of the predictors ui . Then

ηOLS = Σ −1
u Σ u,Y = cm,u η + bm,u (10.18)

where the scalar


cm,u = E[ηT (u − E(u)) m(uT η)] (10.19)
and the bias vector
10.11 OLS and 1D Regression 497

bm,u = Σ −1 T
u E[m(u η)r]. (10.20)
Moreover, bm,u = 0 if u is from an elliptically contoured distribution with
nonsingular Σ u , and cm,u 6= 0 unless Cov(u, Y ) = 0. If the multiple linear
regression model holds, then cm,u = 1, and bm,u = 0.

Olive and Hawkins (2005) and Olive (2008, ch. 12) suggested using variable
selection methods with Cp , originally meant for multiple linear regression,
for 1D regression models with SP = xT β. In particular, Theorem 4.2 is still
useful.

10.11.1 Inference for 1D Regression With a Linear


Predictor

This section follows Chang and Olive (2010) closely. Theorem 2.20 is useful.
Some notation is needed for the following results. Many 1D regression models
have an error e with
σ 2 = Var(e) = E(e2 ). (10.21)
Let ê be the error residual for e. Let the population OLS residual

v = Y − αOLS − uT ηOLS (10.22)

with
τ 2 = E[(Y − αOLS − uT ηOLS )2 ] = E(v2 ), (10.23)
and let the OLS residual be

r = Y − α̂OLS − uT η̂OLS . (10.24)

Typically the OLS residual r is not estimating the error e and τ 2 6= σ 2 , but
the following results show that the OLS residual is of great interest for 1D
regression models.
Assume that a 1D model holds, Y u|(α + uT η), which is equivalent to
Y u|uT η. Then under regularity conditions, results i) – iii) below hold.

i) Li and Duan (1989): ηOLS = cη for some constant c.


ii) Li and Duan (1989) and Chen and Li (1998):
√ D
n(η̂OLS − cη) → Np−1 (0, C OLS ) (10.25)

where
−1
C OLS = Σ u E[(Y −αOLS −uT βTOLS )2 (u−E(u))(u−E(u))T ]Σ −1
u . (10.26)
498 10 1D Regression Models Such as GLMs

iii) Chen and Li (1998): Let A be a known full rank constant k × (p − 1)


matrix. If the null hypothesis H0 : Aη = 0 is true, then
√ √ D
n(Aη̂ OLS − cAη) = nAη̂OLS → Nk (0, AC OLS AT )

and
AC OLS AT = τ 2 AΣ −1 T
u A . (10.27)
Notice that C OLS = τ 2 Σ −1 T
u if v = Y −αOLS −u ηOLS u or if the MLR
2 2
model holds. If the MLR model holds, τ = σ .
To create test statistics, the estimator
n n
1 X 2 1 X
τ̂ 2 = MSE = ri = (Yi − α̂OLS − uT
i β̂ OLS )
2
n−p n−p
i=1 i=1

will be useful. The estimator Ĉ OLS =


" n #
−1 1 X T 2 T −1
Σ̂ u [(Yi − α̂OLS − ui β̂ OLS ) (ui − u)(ui − u) ] Σ̂ u (10.28)
n i=1

can also be useful. Notice that for general 1D regression models, the OLS
MSE estimates τ 2 rather than the error variance σ 2 .
iv) Result iii) suggests that a test statistic for H0 : Aη = 0 is
−1 D
WOLS = nη̂TOLS AT [AΣ̂ u AT ]−1 Aη̂OLS /τ̂ 2 → χ2k , (10.29)

the chi–square distribution with k degrees of freedom.

Before presenting the main theoretical result, some results from OLS MLR
theory are needed. Let the p×1 vector β = (α, ηT )T , the known k×p constant
matrix à = [a A] where a is a k×1 vector, and let c be a known k×1 constant
vector. Using Equation (2.6), the usual F statistic for testing H0 : Ãβ = c is
T
(Ãη̂ − c)T [Ã(X T X)−1 Ã ]−1 (Ãη̂ − c)/(k τ̂ 2 ) (10.30)

where M SE = τ̂ 2 . Recall that if H0 is true, the MLR model holds and the
errors ei are iid N (0, σ 2 ), then Fo ∼ Fk,n−p, the F distribution with k and
n − p degrees of freedom. By Theorem 2.25, if Zn ∼ Fk,n−p, then
D
Zn → χ2k /k (10.31)

as n → ∞.
The main theoretical result of this section is Theorem 10.2 below. This
theorem and (10.31) suggest that OLS output, originally meant for testing
with the MLR model, can also be used for testing with many 1D regression
data sets. Without loss of generality, let the 1D model Y x|(α + uT η) be
10.11 OLS and 1D Regression 499

written as
Y x|(α + uTR β R + uTO β O )
where the reduced model is Y x|(α + uTR ηR ) and uO denotes the terms
outside of the reduced model. Notice that OLS ANOVA F test corresponds
to Ho: η = 0 and uses A = I p−1 . The tests for H0 : βi = 0 use A =
(0, ..., 0, 1, 0, ..., 0) where the 1 is in the (i − 1)th position for i = 2, ..., p and
are equivalent to the OLS t tests. The test H0 : ηO = 0 uses A = [0 I j ] if
ηO is a j × 1 vector, and the test statistic (10.30) can be computed with the
OLS partial F test: run OLS on the full model to obtain SSE and on the
reduced model to obtain SSE(R).
In the theorem below, it is crucial that H0 : Aη = 0. Tests for H0 : Aη =
1, say, may not be valid even if the sample size n is large. Also, confidence
intervals corresponding to the t tests are for cβi , and are usually not very
useful when c is unknown.

Theorem 10.2. Assume that a 1D regression model Y x|xT β holds


and that Equation (10.29) holds when Ho : Aβ = 0 is true. Then the test
statistic (10.30) satisfies
n−1 D
F0 = WOLS → χ2k /k
kn
as n → ∞.
Proof. Notice that by (10.29), the result follows if F0 = (n−1)WOLS /(kn).
Let à = [0 A] so that H0 : Ãβ = 0 is equivalent to H0 : Aη = 0. By
Theorem 2.19,
1 
+ uT D−1 u −uT D−1
(X T X)−1 = n (10.32)
−D−1 u D−1

where the (p − 1) × (p − 1) matrix


−1
D−1 = [(n − 1)Σ̂ u ]−1 = Σ̂ u /(n − 1). (10.33)

Using à and (10.32) in (10.30) shows that F0 =


 1  −1
T + uT D−1 u −uT D−1 0T
(Aη̂ OLS ) [0 A] n Aη̂OLS /(k τ̂ 2 ),
−D−1 u D −1 AT

and the result follows from (10.33) after algebra. 

See Chang and Olive (2010) and Olive (2008: ch. 12, 2010: ch. 15) for
simulations and more information.
500 10 1D Regression Models Such as GLMs

10.12 Data Splitting

Data splitting is used for inference after model selection. Use a training set
to select a full model, and a validation set for inference with the selected full
model. Here p >> n is possible. See Hurvich and Tsai (1990, p. 216) and
Rinaldo et al. (2019). Typically when training and validation sets are used,
the training set is bigger than the validation set or half sets are used, often
causing large efficiency loss.
Let J be a positive integer and let bxc be the integer part of x, e.g.,
b7.7c = 7. Initially divide the data into two sets H1 with n1 = bn/(2J)c
cases and V1 with n − n1 cases. If the fitted model from H1 is not good
enough, randomly select n1 cases from V1 to add to H1 to form H2 . Let V2
have the remaining cases from V1 . Continue in this manner, possibly forming
sets (H1 , V1 ), (H2 , V2 ), ..., (HJ , VJ ) where Hi has ni = in1 cases. Stop when
Hd gives a reasonable model Id with ad predictors if d < J. Use d = J,
otherwise. Use the model Id as the full model for inference with the data in
Vd .
This procedure is simple for a fixed data set, but it would be good to
automate the procedure. For example, if n = 500000 and p = 90, using
n1 = 900 would result in a much smaller loss of efficiency than n1 = 250000.

10.13 Complements

This chapter used material from Chang and Olive (2010), Olive (2013b,
2017a: ch. 13), Olive et al. (2019), and Rathnayake and Olive (2019). GLMs
were introduced by Nelder and Wedderburn (1972). Useful references for
generalized additive models include Hastie and Tibshirani (1986, 1990), and
Wood (2017). Zhou (2001) is useful for simulating the Weibull regression
model. Also see McCullagh and Nelder (1989), Agresti (2013, 2015), and Cook
and Weisberg (1999, ch. 21-23). Collett (2003) and Hosmer and Lemeshow
(2000) are excellent texts on logistic regression while Cameron and Trivedi
(2013) and Winkelmann (2008) cover Poisson regression. Alternatives to Pois-
son regression mentioned in Section 10.7 are covered by Zuur et al. (2009),
Simonoff (2003), and Hilbe (2011). Cook and Zhang (2015) show that enve-
lope methods have the potential to significantly improve GLMs. Some GLM
large sample theory is given by Claeskens and Hjort (2008, p. 27), Cook and
Zhang (2015), and Sen and Singer (1993, p. 309).
An introduction to 1D regression and regression graphics is Cook and
Weisberg (1999a, ch. 18, 19, and 20), while Olive (2010) considers 1D regres-
sion. A more advanced treatment is Cook (1998). Important papers include
Brillinger (1977, 1983) and Li and Duan (1989). Li (1997) shows that OLS F
tests can be asymptotically valid for model (10.18) if u is multivariate nor-
10.13 Complements 501

mal and Σ −1
u Σ uY 6= 0. The scatterplot smoother lowess is due to Cleveland
(1979, 1981).

Suppose n ≥ 10p. Results from Cameron and Trivedi (1998, p. 89) suggest
that if a Poisson regression model is fit using OLS software for MLR, then
a rough approximation is β̂ P R ≈ β̂ OLS /Y . So a rough approximation is
PR ESP ≈ (OLS ESP)/Y . Results from Haggstrom (1983) suggest that if
a binary regression model is fit using OLS software for MLR, then a rough
approximation is β̂ LR ≈ β̂ OLS /M SE.
Haughton (1988, 1989) showed P (S ⊆ Imin ) → 1 as n → ∞ if BIC is
used. AIC has a smaller penalty than BIC, so often overfits. According to
Claeskens and Hjort (2008, p. xi), inference after variable selection has been
called “the quiet scandal of statistics.”
Plots were made in R and Splus, see R Core Team (2016). The Wood
(2017) library mgcv was used for fitting a GAM, and the Venables and Ripley
(2010) library MASS was used for the negative binomial family. The gam
library is also useful. The Lesnoff and Lancelot (2010) R package aod has
function betabin for beta binomial regression and is also useful for fitting
negative binomial regression. SAS has proc genmod, proc gam, and proc
countreg which are useful for fitting GLMs such as Poisson regression,
GAMs such as the Poisson GAM, and overdispersed count regression models.
In Section 10.9, the functions binregbootsim and pregbootsim are
useful for the full binomial regression and full Poisson regression models. The
functions vsbrbootsim and vsprbootsim were used to bootstrap back-
ward elimination for binomial and Poisson regression. The functions LRboot
and vsLRboot bootstrap the logistic regression full model and backward
elimination. The functions PRboot and vsPRboot bootstrap the Poisson
regression full model and backward elimination.
In Section 10.10, table entries for Poisson regression were made with
prpisim2 while entries for binomial regression were made with brpisim.
The functions prpiplot2 and lrpiplot were used to make Figures 10.17
and 10.18. The function prplot can be used to check the full Poisson regres-
sion model for overdispersion. The function prplot2 can be used to check
other Poisson regression models such as a GAM or lasso.
i) Resistant regression: Suppose the regression model has an m×1 response
vector y, and a p × 1 vector of predictors x. Assume that predictor trans-
formations have been performed to make x, and that w consists of k ≤ p
continuous predictor variables that are linearly related. Find the RMVN set
based on the w to obtain nu cases (y ci , xci ), and then run the regression
method on the cleaned data. Often the theory of the method applies to the
cleaned data set since y was not used to pick the subset of the data. Effi-
ciency can be much lower since nu cases are used where n/2 ≤ nu ≤ n, and
the trimmed cases tend to be the “farthest” from the center of w.
The method will have the most outlier resistance if k = p (or k = p − 1 if
there is a trivial predictor X1 ≡ 1). If m = 1, make the response plot of Yˆc
502 10 1D Regression Models Such as GLMs

versus Yc with the identity line added as a visual aid, and make the residual
plot of Ŷc versus rc = Yc − Ŷc .
In R, assume Y is the vector of response variables, x is the data matrix of
the predictors (often not including the trivial predictor), and w is the data
matrix of the wi . Then the following R commands can be used to get the
cleaned data set. We could use the covmb2 set B instead of the RMVN set
U computed from the w by replacing the command getu(w) by getB(w).
indx <- getu(w)$indx #often w = x
Yc <- Y[indx]
Xc <- x[indx,]
#example
indx <- getu(buxx)$indx
Yc <- buxy[indx]
Xc <- buxx[indx,]
outr <- lsfit(Xc,Yc)
MLRplot(Xc,Yc) #right click Stop twice
a) Resistant additive error regression: An additive error regression model
has the form Y = h(x) + e where there is m = 1 response variable Y , and the
p × 1 vector of predictors x is assumed to be known and independent of the
additive error e. An enormous variety of regression models have this form,
including multiple linear regression, nonlinear regression, nonparametric re-
gression, partial least squares, lasso, ridge regression, etc. Find the RMVN
set (or covmb2 set) based on the w to obtain nU cases (Yci , xci ), and then
run the additive error regression method on the cleaned data.
b) Resistant Additive Error Multivariate Regression
Assume y = g(x) +  = E(y|x) +  where g : Rp → Rm , y = (Y1 , ..., Ym)T ,
and  = (1 , ..., m)T . Many models have this form, including multivariate
linear regression, seemingly unrelated regressions, partial envelopes, partial
least squares, and the models in a) with m = 1 response variable. Clean the
data as in a) but let the cleaned data be stored in (Z c , X c ). Again, the theory
of the method tends to apply to the method applied to the cleaned data since
the response variables were not used to select the cases, but the efficiency is
often much lower. In the R code below, assume the y are stored in z.
indx <- getu(w)$indx #often w = x
Zc <- z[indx]
Xc <- x[indx,]
#example
ht <- buxy
t <- cbind(buxx,ht);
z <- t[,c(2,5)];
x <- t[,c(1,3,4)]
indx <- getu(x)$indx
Zc <- z[indx,]
10.14 Problems 503

Xc <- x[indx,]
mltreg(Xc,Zc) #right click Stop four times

10.14 Problems

10.1∗. (Aldrin et al. 1993). Suppose

Y = m(uT η) + e (10.34)

where m is a possibly unknown function and the zero mean errors e are inde-
pendent of the predictors. Let z = uT η and let w = u − E(u). Let Σ u,Y =
Cov(u, Y ), and let Σ u = Cov(u) = Cov(w). Let r = w − (Σ u η)ηT w.

a) Recall that Cov(u, Y ) = E[(u − E(u))(Y − E(Y ))T ] and show that
Σ u,Y = E(wY ).

b) Show that E(wY ) = Σ u,Y = E[(r + (Σ u η)η T w) m(z)] =

E[m(z)r] + E[ηT w m(z)]Σ u η.

c) Using η OLS = Σ −1
u Σ u,Y , show that η OLS = c(u)η + b(u) where the
constant
c(u) = E[ηT (u − E(u))m(uT η)]
and the bias vector b(u) = Σ −1 T
u E[m(u η)r].
d) Show that E(wz) = Σ u η. (Hint: Use E(wz) = E[(u − E(u))uT η] =
E[(u − E(u))(uT − E(uT ) + E(uT ))η].)

e) Assume m(z) = z. Using d), show that c(u) = 1 if ηT Σ u η = 1.

f) Assume that η T Σ u η = 1. Show that E(zr) = E(rz) = 0. (Hint: Find


E(rz) and use d).)

g) Suppose that ηT Σ u η = 1 and that the distribution of u is multivariate


normal. Then the joint distribution of z and r is multivariate normal. Using
the fact that E(zr) = 0, show Cov(r, z) = 0 so that z and r are independent.
Then show that b(u) = 0.

(Note: the assumption η T Σ u η = 1 can be made without loss of generality


since if ηT Σ u η = d2 > 0 (assuming Σ u is positive definite), then y =
m(d(η/d)T u) + e ≡ md (θ T u) + e where md (v) = m(dv), θ = η/d and
θT Σ u θ = 1.)
Chapter 11
Stuff for Students

11.1 R

R is available from the CRAN website (https://cran.


r-project.org/). As of January 2020, the author’s personal computer has Ver-
sion 3.3.1 (June 21, 2016) of R. R is similar to Splus, but is free. R is very
versatile since many people have contributed useful code, often as packages.

Downloading the book’s files into R


Many of the homework problems use R functions contained in the book’s
website (http://parker.ad.siu.edu/Olive/linmodbk.htm) under the file name
linmodpack.txt. The following two R commands can be copied and pasted into
R from near the top of the file (http://parker.ad.siu.edu/Olive/
linmodrhw.txt).
Downloading the book’s R functions linmodpack.txt and data files
linmoddata.txt into R: the commands
source("http://parker.ad.siu.edu/Olive/linmodpack.txt")
source("http://parker.ad.siu.edu/Olive/linmoddata.txt")

can be used to download the R functions and data sets into R. Type ls().
Nearly 10 R functions from linmodpack.txt should appear. In R, enter the
command q(). A window asking “Save workspace image?” will appear. Click
on No to remove the functions from the computer (clicking on Yes saves the
functions in R, but the functions and data are easily obtained with the source
commands).
Citing packages
We will use R packages often in this book. The following R command is
useful for citing the Mevik et al. (2015) pls package.
citation("pls")

505
506 11 Stuff for Students

Other packages cited in this book include MASS and class: both from Ven-
ables and Ripley (2010), glmnet: Friedman et al. (2015), and leaps: Lumley
(2009).
This section gives tips on using R, but is no replacement for books such
as Becker et al. (1988), Crawley (2005, 2013), Fox and Weisberg (2010), or
Venables and Ripley (2010). Also see Mathsoft (1999ab) and use the website
(www.google.com) to search for useful websites. For example enter the search
words R documentation.

The command q() gets you out of R.


Least squares regression can be done with the function lsfit or lm.
The commands help(fn) and args(fn) give information about the function
fn, e.g. if fn = lsfit.
Type the following commands.
x <- matrix(rnorm(300),nrow=100,ncol=3)
y <- x%*%1:3 + rnorm(100)
out<- lsfit(x,y)
out$coef
ls.print(out)
The first line makes a 100 by 3 matrix x with N(0,1) entries. The second
line makes y[i] = 0 + 1 ∗ x[i, 1] + 2 ∗ x[i, 2] +3 ∗ x[i, 2] + e where e is N(0,1). The
term 1:3 creates the vector (1, 2, 3)T and the matrix multiplication operator is
% ∗ %. The function lsfit will automatically add the constant to the model.
Typing “out” will give you a lot of irrelevant information, but out$coef and
out$resid give the OLS coefficients and residuals respectively.
To make a residual plot, type the following commands.
fit <- y - out$resid
plot(fit,out$resid)
title("residual plot")
The first term in the plot command is always the horizontal axis while the
second is on the vertical axis.

To put a graph in Word, hold down the Ctrl and c buttons simulta-
neously. Then select “Paste” from the Word menu, or hit Ctrl and v at the
same time.
To enter data, open a data set in Notepad or Word. You need to know
the number of rows and the number of columns. Assume that each case is
entered in a row. For example, assuming that the file cyp.lsp has been saved
on your flash drive from the webpage for this book, open cyp.lsp in Word. It
has 76 rows and 8 columns. In R , write the following command.
cyp <- matrix(scan(),nrow=76,ncol=8,byrow=T)
Then copy the data lines from Word and paste them in R. If a cursor does
not appear, hit enter. The command dim(cyp) will show if you have entered
the data correctly.
11.1 R 507

Enter the following commands


cypy <- cyp[,2]
cypx<- cyp[,-c(1,2)]
lsfit(cypx,cypy)$coef
to produce the output below.
Intercept X1 X2 X3
205.40825985 0.94653718 0.17514405 0.23415181
X4 X5 X6
0.75927197 -0.05318671 -0.30944144
Making functions in R is easy.

For example, type the following commands.


mysquare <- function(x){
# this function squares x
r <- xˆ2
r }
The second line in the function shows how to put comments into functions.

Modifying your function is easy.

Use the fix command.


fix(mysquare)
This will open an editor such as Notepad and allow you to make changes. (In
Splus, the command Edit(mysquare) may also be used to modify the function
mysquare.)

To save data or a function in R, when you exit, click on Yes when the
“Save worksheet image?” window appears. When you reenter R, type ls().
This will show you what is saved. You should rarely need to save anything
for this book. To remove unwanted items from the worksheet, e.g. x, type
rm(x),
pairs(x) makes a scatterplot matrix of the columns of x,
hist(y) makes a histogram of y,
boxplot(y) makes a boxplot of y,
stem(y) makes a stem and leaf plot of y,
scan(), source(), and sink() are useful on a Unix workstation.
To type a simple list, use y <− c(1,2,3.5).
The commands mean(y), median(y), var(y) are self explanatory.

The following commands are useful for a scatterplot created by the com-
mand plot(x,y).
lines(x,y), lines(lowess(x,y,f=.2))
508 11 Stuff for Students

identify(x,y)
abline(out$coef ), abline(0,1)

The usual arithmetic operators are 2 + 4, 3 − 7, 8 ∗ 4, 8/4, and


2ˆ{10}.
The ith element of vector y is y[i] while the ij element of matrix x is x[i, j].
The second row of x is x[2, ] while the 4th column of x is x[, 4]. The transpose
of x is t(x).

The command apply(x,1,fn) will compute the row means if fn = mean.


The command apply(x,2,fn) will compute the column variances if fn = var.
The commands cbind and rbind combine column vectors or row vectors with
an existing matrix or vector of the appropriate dimension.

Getting information about a library in R


In R, a library is an add–on package of R code. The command library()
lists all available libraries, and information about a specific library, such as
leaps for variable selection, can be found, e.g., with the command
library(help=leaps).

Downloading a library into R


Many researchers have contributed a library or package of R code that can
be downloaded for use. To see what is available, go to the website
(http://cran.us.r-project.org/) and click on the Packages icon.
Following Crawley (2013, p. 8), you may need to “Run as administrator”
before you can install packages (right click on the R icon to find this). Then
use the following command to install the glmnet package.
install.packages("glmnet")
Open R and type the following command.
library(glmnet)
Next type help(glmnet) to make sure that the library is available for use.

Warning: R is free but not fool proof. If you have an old version of R
and want to download a library, you may need to update your version of
R. The libraries for robust statistics may be useful for outlier detection, but
the methods have not been shown to be consistent or high breakdown. All
software has some bugs. For example, Version 1.1.1 (August 15, 2000) of R
had a random generator for the Poisson distribution that produced variates
with too small of a mean θ for θ ≥ 10. Hence simulated 95% confidence
intervals might contain θ 0% of the time. This bug seems to have been fixed
in Versions 2.4.1 and later. Also, some functions in lregpack may no longer
work in new versions of R.
11.2 Hints for Selected Problems 509

11.2 Hints for Selected Problems

Chapter 1
1.1 a) Sort each column, then find the median of each column. Then
MED(W ) = (1430, 180, 120)T .
b) The sample mean of (X1 , X2 , X3 )T is found by finding the sample mean
of each column. Hence x = (1232.8571, 168.00, 112.00)T .

1.2 a) 7 + βXi
P P
b) β̂ = (Yi − 7)Xi / Xi2
1.3 See Section
P 1.3.5. P 2
1.5 a)
P 2 3 β̂ = X3i (Yi − 10 − 2X2i )/ X3i . The second partial derivative
= 2 X3i > 0.

1.10 a) X2 ∼ N (100, 6).


b)      
X1 49 3 −1
∼ N2 , .
X3 17 −1 4
c) X1 X4 and X3 X4 .
d)
Cov(X1 , X3 ) −1
ρ(X1 , X2 ) = p = √ √ = −0.2887.
VAR(X1 )VAR(X3 ) 3 4
1.11 a) Y |X ∼ N (49, 16) since Y X. (Or use E(Y |X) = µY +
−1
Σ12 Σ22 (X − µx ) = 49 + 0(1/25)(X − 100) = 49 and VAR(Y |X) =
−1
Σ11 − Σ12 Σ22 Σ21 = 16 − 0(1/25)0 = 16.)
−1
b) E(Y |X) = µY +Σ12 Σ22 (X −µx ) = 49 +10(1/25)(X −100) = 9 +0.4X.

−1
c) VAR(Y |X) = Σ11 − Σ12 Σ22 Σ21 = 16 − 10(1/25)10 = 16 − 4 = 12.

1.13 The proof is identical to that given in Example 3.2. (In addition, it
is fairly simple to show that M1 = M2 ≡ M . That is, M depends on Σ but
not on c or g.)

1.19 ΣB = E[E(X|B T X)X T B)] = E(M B B T XX T B) = M B B T ΣB.


Hence M B = ΣB(B T ΣB)−1 .

1.26 a)    
3 31
N2 , .
2 12
b) X2 X4 and X3 X4 .
σ12 1 √
c) √ = √ √ = 1/ 6 = 0.4082.
σ11 σ33 2 3
510 11 Stuff for Students

1.31 See Section 1.3.6.


1.32 a) Model I:
Pn Xn
(Xi − X)Yi xi − x
β̂1 = Pi=1
n 2
= ki Yi with ki = Pn 2
.
j=1 (xj − x) i=1 j=1 (xj − x)

Model II:
Pn n
xi Yi X xi
β̂1 = Pi=1
n 2
= ki Yi with ki = Pn 2
.
x
j=1 j i=1 j=1 xj

b) Model I:
n
X n
X Pn Xn
(xi − x)2
V (β̂1 ) = ki2 V (Yi ) = σ 2
ki2 2
= σ Pni=1 = σ 2
/ (xi − x)2 .
i=1 i=1
[ j=1 (xj − x)2 ]2 i=1

Model II:
n
X n
X Pn n
X
x2
V (β̂1 ) = ki2 V (Yi ) = σ 2 ki2 = σ 2 Pni=1 2i 2 = σ 2 / x2i .
[ j=1 xj ]
i=1 i=1 i=1

The models are full rank, Pso the estimators


P are BLUE. P
n
c) The result follows if i=1 x2i ≥ i=1 (xi − x)2 , but i=1 (xi − µ)2 is
the least squares criterion for the model xi = µ + ei , and the criterion is
minimized by the least squares estimator µ̂ = x. Hence using µ̃ = 0 gives a
least squares criterion at least as large as that using µ̂, and the result holds.
1.33 a) E(r) = E[(I − P )Y ] = (I − P )Xβ = 0. Cov(r) = Cov[(I −
P )Y ] = (I − P )Cov(Y )(I − P )T = σ 2 (I − P ).
b) Cov(r, Y ) = E([r − E(r)][Y − E(Y )]T ) =

E([(I − P )Y − (I − P )E(Y )][Y − E(Y )]T ) =

E[(I−P )[Y −E(Y )][Y −E(Y )]T ] = (I−P )Cov(Y ) = (I−P )σ 2 I = σ 2 (I−P ).
c) Cov(r, Ŷ ) = E([r − E(r)][Ŷ − E(Ŷ )]T ) =

E([(I − P )Y − (I − P )E(Y )][P Y − P E(Y )]T ) =

E[(I − P )[Y − E(Y )][Y − E(Y )]T P ] = (I − P )σ 2 IP = σ 2 (I − P )P = 0.

Chapter 2
2.1 See the proof of Theorem 2.18.
2.14 For fixed σ > 0, L(β, σ 2 ) is maximized by minimizing Q(β) ≥ 0. So
β̂Q maximizes L(β, σ 2 ) regardless of the value of σ 2 > 0. So β̂ Q is the MLE.
11.2 Hints for Selected Problems 511

b) Let Q = Q(β̂Q ). Then the MLE σ̂ 2 is found by maximizing the profile


 
2 2 1 −1
likelihood, Lp (σ ) = L(β̂ Q , σ ) = cn n exp 2
Q . Let τ = σ 2 . The
  σ 2σ
1 −1
Lp (τ ) = cn n/2 exp Q , and the log profile likelihood log Lp (τ ) =
τ 2τ
n Q
d − log(τ ) − . Thus
2 2τ
d log Lp (τ ) −n Q set
= + 2 = 0
dτ 2τ 2τ
or −nτ + Q = 0 or τ̂ = σ̂ 2 = Q/n, unique. Then

d2 log Lp (τ ) n 2Q n 2nτ̂ −n
= 2− = − 3 = 2 <0
dτ 2 2τ 2τ 3 τ̂ 2τ 2 2τ̂ 2τ̂

which proves that σ̂ 2 is the MLE of σ 2 .


2.32 a) If λ is an eigenvalue of P , then for some x 6= 0, λx = P x =
P 2 x = λ2 x. So λ(λ − 1) = 0, which only has possible solutions λ = 0 or
λ = 1.
b) Thus rank(P ) = number of nonzero eigenvalues of P = tr(P ) by a).
2.35 a) Note that E(Y Y T ) = Σ + θθ T . Since the quadratic form is
a scalar and the trace is a linear operator, E[Y T AY ] = E[tr(Y T AY )] =
E[tr(AY Y T )] = tr(E[AY Y T ]) = tr(AΣ +Aθθ T ) = tr(AΣ)+tr(AθθT ) =
tr(AΣ) + θT Aθ.P
b) Note that i (Yi − Y )2 is the residual sum of squares for the linear
X 1
model Y = 1 + e. Hence (Yi − Y )2 = Y T (I − H)Y = Y T (I − 11T )Y
n
i
1
where H = 1(1 1) 1 . Now tr(AΣ) = tr(Σ) − tr( 11T Σ). Now 1T Σ =
T −1 T
n
(σ 2 [1 +(n−1)ρ], ..., σ2[1 +(n−1)ρ], tr(11T Σ) = 1T Σ1 = n(σ 2 [1 +(n−1)ρ]),
and tr( n1 11T Σ) = σ 2 [1 + (n − 1)ρ]. So tr(AΣ) = nσ 2 − σ 2 [1 + (n − 1)ρ] =
σ 2 [n − 1 − (n − 1)ρ] = σ 2 (n − 1)(1 − ρ). Now θT Aθ = θ1T (I − n1 11T )1 =
θ2 (n − n2 /n) = 0. Hence the result follows by a).
1
c) Assume Y ∼ Nn (θ, σ 2 I). Then Y = BY where B = 1T . Now
n
Y T AY = Y T AT AY . Hence the two terms are independent if AY BY
T T 1 1 T 1
iff AB = 0, but AB = (I − 11 )1 = (1 − 1) = 0.
Pn n n n
2.36 a) Q(β) = i=1 (Yi − βxi )2 . By the chain rule,

Xn
dQ(β)
= −2 (Yi − βxi )xi .
dβ i=1
512 11 Stuff for Students

Setting the derivative


P Pn equal to 0 and calling the unique solution β̂ gives
n 2
i=1 x i Yi = β̂ i=1 i or
x
Pn
xi Yi
β̂ = Pi=1
n 2
.
i=1 xi

n
X n
X
Pn
b) β̂ = i=1 ki Yi where ki = xi / x2j . Hence E(β̂) = ki E(Yi ) =
j=1 i=1
n
X n
X n
X n
X Xn
ki βxi = β x2i / x2j = β. V (β̂) = ki2 V (Yi ) = σ2 ki2 using
i=1 i=1 j=1 i=1 i=1
Pn Pn
Yi = Yi |xi has V (Yi ) = σ 2 . Note that i=1 ki2 = 1/ i=1 x2i .
c) E(Ŷi ) = βxi = E(Yi ) = E(Yi |xi ), suppressing the conditioning. V (Ŷi ) =
Pn
V (β̂xi ) = x2i V (β̂) = σ 2 x2i / j=1 x2j by b).
d) Under this normal model, the MLE of β is β̂ and the MLE of σ 2 is
n
1X 2 n−p
σ̂ 2 = ri = M SE
n n
i=1

with p = 1.
2.37 a) Use either proof of Theorem 2.5. Normality is not necessary.
b) i)

Source df MS SS F p-value
1 T T M SR
Regression p-1 SSR = Y (P − 11 )Y MSR F0 = M SE for H0 :
n

Residual n-p SSE = Y T (I − P )Y MSE β2 = · · · = βp = 0


ii) E(M SE) = σ 2 , so E(SSE) = (n − p)σ 2 . By a)

11T 11T 11T


E(SSR) = β T X T (P − )Xβ+tr[σ 2 (P − )] = β T X T (P − )Xβ+σ 2 (p−1).
n n n
When H0 is true Xβ = 1β1 and E(SSR) = σ 2 (p − 1).
 
2 Y T AY 2 µT Aµ
iii) By Theorem 2.14 g), if Y ∼ Nn (µ, σ I) then ∼ χ r,
σ2 2σ 2
iff A is idempotent with rank(A) = tr(A) = r.
This theorem applies to SSE/σ 2 with A = I −P , r = n−p, and µ = Xβ.
Then µT (I−P )µ = 0 since P X = X. Hence SSE/σ 2 ∼ χ2 (n−p, 0) ∼ χ2n−p .
2.38 a) A− is a generalized inverse of A if AA− A = A.
T
b) i) P = X(X T X)− X .
1
ii) C(X) = C(1). Hence P = 1(1T 1)−1 1T = 11T .
3
11.2 Hints for Selected Problems 513
P
iii) SSE = Y T (I −P )Y = Y T Y − 31 ( Yi )2 = 1 +4 +9 −(1 + 2 + 3)2 /3 =
14 − 36/3 = 2.
2.39 a)

Source df SS MS E(MS) F
T SSE(R)−SSE
Reduced n − p1 SSE(R) = Y (I − P 1 )Y MSE(R) E(MSE(R)) FR = p2 M SE =

Y T (P − P 1 )Y /p2
Full n−p SSE = Y T (I − P )Y MSE σ2
Y T (I − P )Y /(n − p)
where
1 1
E(M SE(R)) = [σ 2 tr(I−P 1 )+β T X T (I−P 1 )Xβ] = [σ 2 (n−p1 )+β T X T (I−P 1 )Xβ].
n − p1 n − p1

If H0 is true, then Y ∼ Nn (X 1 β 1 , σ 2 I), and E(M SE(R)) = σ 2 .


b) Need to show that SSE(R) − SSE = Y T (P − P 1 )Y and SSE =
T
Y (I − P )Y are independent. This result follows from Craig’s Theorem
since (P − P 1 )(I − P ) = P − P 1 − P + P 1 = 0.
 
Y T AY 2 µT Aµ
c) By Theorem 2.14 g), if Y ∼ Nn (µ, σ 2 I) then ∼ χ r,
σ2 2σ 2
iff A is idempotent with rank(A) = tr(A) = r.
This theorem applies to SSE/σ 2 with A = I −P and r = n−p. Then µ =
Xβ, and µT (I − P )µ = 0 since P X = X. Hence SSE/σ 2 ∼ χ2 (n − p, 0) ∼
χ2n−p. Similarly, when H0 is true, the theorem applies to Y T (P − P 1 )Y /σ 2
with A = P −P 1 and r = p−p1 = p2 . Then µ = X 1 β 1 , and µT (P −P 1 )µ =
0 since P X 1 = P 1 X 1 = X 1 . Hence Y T (P − P 1 )Y /σ 2 ∼ χ2 (p2 , 0) ∼ χ2p2 .
Thus
Y T (P − P 1 )Y /p2
FR = T ∼ Fp2 ,n−p.
Y (I − P )Y /(n − p)
2.40 a) Y T AY ∼ χ2 (rank(A)) iff AΣ is idempotent and µT Aµ = 0 by
Theorem 2.13.
b) This proof similar to the proof of Theorem 2.8. Let u = AY and
w = BY . Then AY BY iff Cov(w, u) = BΣA = 0. Thus AY BY .
Let g(AY ) = Y T AT A− AY = Y T AA− AY = Y T AY . Then g(AY ) =
T
Y AY BY since AY Pn BY . T
c) Y = 1T Y /n and i=1 (Yi − Y )2 = PY (I − P 1 )Y where P 1 = 11T /n
n
is the projection matrix on C(1) since i=1 (Yi − Y )2 is the residual sum of
squares for the model Y = 1µ + e with least squares estimator µ̂ = Y . Hence
the quantities are independent if BY = 1T Y and Y T AY = Y T (I − P 1 )Y
are independent, or if 1T I(I −P 1 ) = 0 by b). This result holds since 1T P 1 =
1T since P 1 is the projection matrix on C(1) means P 1 1 = 1.
514 11 Stuff for Students
n
T T 1X 2 1 1
2.41 a) β̂ = (X X) −1 2
X Y and σ̂ = r = SSE = Y T (I −
n i=1 i n n
P )Y .
 
2 Y T AY 2 µT Aµ
b) By Theorem 2.14 g), if Y ∼ Nn (µ, σ I) then ∼ χ r,
σ2 2σ 2
iff A is idempotent with rank(A) = tr(A) = r.
This theorem applies to SSE/σ 2 with A = I −P , r = n−p, and µ = Xβ.
Then µT (I−P )µ = 0 since P X = X. Hence SSE/σ 2 ∼ χ2 (n−p, 0) ∼ χ2n−p .
Thus
n − p SSE n−p 2
(n − p)σ̂ 2 /σ 2 = 2
∼ χn−p.
n σ n
c) BY Y T AY if BA = 0 by Theorem 2.8 b). Here BA = (X T X)−1 X T (I−
P ) = 0 since X T P = X T . Thus the MLEs are independent.
d) The MLE is the generalized least squares estimator β̂ = (X T V −1 X)−1 X T V −1 Y .

2.42 Note that H = P and that Z = Y − µ ∼ Nn (0, Σ).


a) i) E[(Y − µ)T A(Y − µ)] = E[Z T AZ] = tr(AΣ) + 0T A0 = tr(AΣ)
by Theorem 2.5 using E(Z) = 0.
Alternatively, E(ZZ T ) = Σ since E(Z) = 0. Since the quadratic form
is a scalar and the trace is a linear operator, E[Z T AZ] = E[tr(Z T AZ)] =
E[tr(AZZ T )] = tr(E[AZZ T ]) = tr(AΣ).
Normality is not needed for this result.
ii) AΣ is idempotent by Theorem 2.13.
iii) BΣA = 0 (or AΣB T = 0) by Theorem 2.8.
1 1 1 1
b) i) (I −H )Y ∼ Nn ( (I −H )Xβ, (I −H)σ 2 I (I −H ) ∼ Nn (0, I −
σ σ σ σ
H) since H X = X.
 
Y T AY 2 µT Aµ
ii) By Theorem 2.14 g), if Y ∼ Nn (µ, σ 2 I) then ∼ χ r,
σ2 2σ 2
iff A is idempotent with rank(A) = tr(A) = r.
Y T (I − H )Y
This theorem applies to u = = SSE/σ 2 with A = I − H,
σ2
r = n − p, and µ = Xβ. Then µT (I − H)µ = 0 since HX = X. Hence
SSE/σ 2 ∼ χ2 (n − p, 0) ∼ χ2n−p .
iii) By Theorem 2.8 b), independence follows since H(I − H) = 0.
Pn
2.43 a) Q(β) = i=1 (yi − βxi )2 . By the chain rule,

Xn
dQ(β)
= −2 (yi − βxi )xi .
dβ i=1

Setting the derivative


P Pn equal to 0 and calling the unique solution β̂ gives
n 2
x y
i=1 i i = β̂ i=1 i or
x Pn
xi yi
β̂ = Pi=1
n 2
.
i=1 xi
11.2 Hints for Selected Problems 515
n
1 X 2
b) M SE = r since p = 1.
n − 1 i=1 i
c) Since yi ∼ N (xi β, σ 2 ), the likelihood function
n
Y n
Y n
2 1 1 −1 2 1 −1 X
L(β, σ ) = fyi (yi ) = √ exp[ 2 (yi −xi β) ] = cn n exp[ 2 (yi −xi β)2 ] =
i=1 i=1
2π σ 2σ σ 2σ
i=1

1 −1
cn
exp[ 2 Q(β)]
σn 2σ
where Q(β) is the least squares criterion. For fixed σ > 0, maiximizing L(β, σ)
is equivalent to minimizing the least squares criterion Q(β). Thus β̂ from a)
is the MLE of β. To find the MLE of σ 2 , use the profile likelihood function
1 −1 1 −1
Lp (σ 2 ) = Lp (τ ) = cn n
exp[ 2 Q] = cn n/2 exp[ Q]
σ 2σ τ 2τ

where Q = Q(β̂). Then the log profile likelihood function

n Q
log(Lp (τ )) = dn − log(τ ) − ,
2 2τ
d −n Q set
and log(Lp (τ )) = + 2 = 0.
dτ 2τ 2τ
P
Thus nτ = Q or τ̂ = σ̂ 2 = Q/n = i=1 ri2 /n, which is a unique solution.
Now
d2 n 2Q n 2nτ̂ −n
log(Lp (τ )) = − 3 = 2 − 3 = 2 < 0.
dτ 2 2τ 2 2τ τ̂ 2τ̂ 2τ̂ 2τ̂
Thus σ̂ 2 is the MLE of σ 2 .
2.44 Let Y1 and Y2 be iindependent random variables with mean θ and
2θ respectively. Find the least squares estimate of θ and the residual sum of
squares.
Solution:
     
Y1 1 e1
Y = Xβ + e = = θ+ .
Y2 2 e2

Then

 −1  
T −1 T 1 Y1 Y1 + 2Y2
θ̂ = (X X) X Y = (1 2) (1 2) = .
2 Y2 5
 Y1 +2Y2 
 
1 Y1 + 2Y2 5
Now Ŷ = X θ̂ = = .
2 5 2Y1+4Y2
5
516 11 Stuff for Students

Thus  2  2
Y1 + 2Y2 2Y1 + 4Y2
RSS = Y1 − + Y2 − .
5 5
√ D
2.45 a) nA(β̂ − β) → Nr (0, σ 2 AW AT ).
D
b) A(Z n − µ) → Nr (0, AAT ).
2.46 a)
( n
)
1 X
L(β, σ 2 ) = f(y1 , . . . , yn |β, σ 2 ) = (2π)−n/2 (σ 2 )−n/2 exp − 2 (yi − βxi )2 .

i=1

Fix σ > 0. Then L(β, σ 2 ) is maximized by minimizing


n
X
(yi − βxi )2 ,
i=1

which gives the least squares estimator. Taking derivative with respective to
β and setting it equal to 0, the solution is the MLE if the second derivative
is positive. The solution is:
Pn
xi Yi
β̂ = Pi=1n 2
i=1 xi

and it is easy to check that the second derivative is positive.

 Pn  Pn Pn Pn 2
xi Yi xi E[Yi ] i=1 xi βxi i=1 xi
E(β̂) = E Pi=1
n 2
= i=1
P n 2
= P n 2
= β P n 2
= β.
i=1 xi i=1 xi i=1 xi i=1 xi

c) Note that β̂ is a linear combination of independent normal random


variables, so β̂ has a normal distribution with its mean and variance. We
have already computed the mean, so we need only compute the variance:

 Pn  X n
i=1 xi Yi 1
V ar(β̂) = V ar Pn 2
= P n V ar( xi Yi ),
i=1 xi ( i=1 x2i )2 i=1
n
X
1 σ2
= Pn 2 2
x2i V ar(Yi ) = Pn 2.
( i=1 xi ) i=1 xi
i=1

2
Therefore, β̂ ∼ N (β, Pnσ x2 ).
i=1 i
d) For the expectation we have:
11.2 Hints for Selected Problems 517
P  P P
Yi E[Yi ] βxi
E[U ] = E P = P = P = β,
xi xi xi
 X 
1 Yi 1 X E[Yi ] 1 X βxi
E[V ] = E = = = β.
n xi n xi n xi

For variance we have


P  P
Yi V ar[Yi ] nσ 2 σ2
V ar[U ] = V ar P = P 2 = P 2 = ,
xi ( xi ) ( xi ) nX̄ 2
 X   
1 Yi 1 X Yi 1 X σ2 σ2 X 1
V ar[V ] = V ar = 2 V ar = 2 2
= 2 .
n xi n xi n xi n x2i
Pn
We do know that if ai > 0 then 1 Pn1 1 ≤ n1 i=1 ai . Now, set ai = x12 ,
n i=1 ai i

then we have n
n 1X 1
Pn ≤ ,
i=1 x2i n i=1 x2i

therefore V ar(β̂) ≤ P
V ar(V ). Pn
n 2 2 2
Moreover, since i=1 (xi − x̄) ≥ 0 therefore i=1 xi ≥ nx̄ , hence
V ar(β̂) ≤ V ar(U ).
Finally, since f(t) = t12 is convex, then by using Jensen’s inequality we
have
n
1 1X 1
≤ ,
x̄2 n 1 x2i
thus
V ar(β̂) ≤ V ar(U ) ≤ V ar(V ).
2.47 For symmetry the solutions are obvious, since
(1) the transpose of a difference is the difference of the transpose, and
(2) we know H is symmetric, I is symmetric, and since the constant n−1
does not affect the transpose operation, n−1 J T = n−1 is a symmetric matrix.
For idempotent, we need to show that squaring each matrix returns the
original. Recall that H is idempotent, because

H 2 = [X(X T X)−1 X T ][X(X T X)−1 X T ]


= X(X T X)−1 (X T X)(X T X)−1 X T
= X(X T X)−1 X T = H

Now, we can write (a) (I − n−1 J)2 = I 2 − 2n−1 J + n−2 J 2 . But, since
J = 11T , we have J 2 = (11T )2 = 11T 11T = 1n1T = n11T . Thus,

(I − n−1 J)2 = I 2 − 2n−1 J + n−1 J = I − n−1 J


518 11 Stuff for Students

(b) For SSE, we have (IH)2 = I 2 2H +H 2 . Since I and H are idempotent,


we can see this matrix is idempotent, i.e., (I −n−1 J)2 = I −2H +H = I −H.
(c) ) Lastly, for SSRegr take (H − n−1 J)2 = H 2 − n−1 HJ − n−1 JH +
n J . We know, i) H 2 = H , and ii) n−2 J 2 = n−1 J. therefore
−2 2

Further, from the hint, let X = [1X ∗ ] so that HX = H [1X ∗ ] =


[H1HX ∗ ]. But H X = X(X T X)1X T X = XI = X. So we have
HX = [H1HX ∗ ] = X = [1X ∗ ]. Since the partitioned components in
this equality have the same orders, we can therefore conclude from the first
partitioned component that H 1 = 1. Then, HJ = H11T = 11T = J. A
T
similar argument applied to X T = [1T X ∗ ] and X T H yields 1T H = 1T ,
so that 1 H = 1 and J H = 11 H = 11T = J. Combining these various
T T T

results together gives

(H − n−1 J)2 = H 2 − n−1 HJ − n−1 JH + n−2 J 2


= H − n−1 J − n−1 J + n−1 J
= H − n−1 J

2.48 a) E(Yi |xi ) = ai + βxi .


b) By the chain rule,
n
" n n
#
dQ(η) X X X
2
= −2 (Yi − ai − ηxi )xi = −2 xi (Yi − ai ) − η xi
dη i=1 i=1 i=1

Setting the derivative equal to 0 and calling the unique solution β̂ gives
Pn set Pn
η i=1 x2i = i=1 xi (Yi − ai ) or
Pn
xi (Yi − ai )
β̂ = i=1 Pn 2 .
i=1 xi

Now
X n
d2 Q(η)
2
=2 x2i > 0.

i=1

Hence β̂ is the least squares estimator.


c) If xi ≡ 1, then
Pn Xn
i=1 (Yi − ai ) dQ(η) d2 Q(η)
β̂ = , = 2nη − 2 (Yi − ai ) and = 2n > 0.
n dη i=1
dη22

d) For fixed σ 2 , maximizing the likelihood is equivalent to maximizing


 
−1 2
exp (Yi − ai − βxi ) ,
2σ 2
11.2 Hints for Selected Problems 519

which is equivalent to minimizing (Yi − ai − βxi )2 . So β̂ maximizes L(β, σ 2 )


regardless of the
Pvalue of σ 2 > 0. Hence β̂ is the MLE of β.
n
e) Let Q = i=1 (Yi − ai − β̂xi )2 . Then the MLE of σ 2 can be found by
maximizing the log profile likelihood log(LP (σ 2 )) where
 
1 −1
LP (σ 2 ) = exp Q .
(2πσ 2 )n/2 2σ 2

Let τ = σ 2 . Then
n 1
log(Lp (σ 2 )) = c − log(σ 2 ) − 2 Q,
2 2σ
and
n 1
log(Lp (τ )) = c − log(τ ) − Q.
2 2τ
Hence
d log(LP (τ )) −n Q set
= + 2 = 0
dτ 2τ 2τ
or −nτ + Q = 0 or nτ = Q or
Q
τ̂ = = σ̂ 2 ,
n
which is a unique solution.
Now
d2 log(LP (τ )) n 2Q n 2nτ̂ −n
= − 3 = − 3 = 2 < 0.
dτ 2 2τ 2 2τ τ=τ̂ 2τ̂ 2 2τ̂ 2τ̂

Thus σ̂ 2 is the MLE of σ 2 .

2.49 a) Use E(Y 0 AY ) = tr(AΣ) + E(Y 0 )AE(Y ) with A = I, Σ =


Cov(Y ) = σ 2 I, and E(Y ) = Xβ. Note that Y ∼ Nn (Xβ, σ 2 I).
Then E(Y 0 IY ) = tr(Iσ 2 I) + 0 0
Pβn X IXβ n + β0 X 0 Xβ.
= σ 2P
0 2 n 2
PnAlternatively,
2 0 2
E(Y Y ) =
2 0 0
i=1 E(Yi ) = i=1 [V (Yi ) + (E[Yi ]) ] =
i=1 [σ + (xi β) ] = nσ + β X Xβ.
b) Note that x ∼ Np (0, Σ). Then E(x0 Ax) = tr(ACov(x))+E(x0 )AE(x) =
tr(AΣ) + E(x0 )AE(x) = tr(AΣ) since E(x) = 0.
2.49 See Example 2.1.
Chapter 3
3.7 Note that Z TA Z A = Z T Z,
 

GA η A = p ∗ ,
λ2 η

and Z TA GA η A = Z T Gη. Then


520 11 Stuff for Students

RSS(η A ) = kZ A − GA ηA k22 = (Z A − GA ηA )T (Z A − GA η A ) =

Z TA Z A − Z TA GA η A − ηTA GTA Z A + ηTA GTA GA ηA =


 p   Gη 
Z T Z − Z T Gη − η T GT Z + η T GT λ2 η T p ∗ .
λ2 η
Thus

QN (ηA ) = Z T Z − Z T Gη − ηT GT Z + ηT GT Gη + λ∗2 ηT η + γkη A k1 =

λ∗
kZ − Gηk22 + λ∗2 kηk22 + p 1 ∗ kηA k1 =
1 + λ2
RSS(η) + λ∗2 kηk22 + λ∗1 kηk1 = Q(η). 
1
3.12 a) SSE = Y T (I − P )Y and SSR = Y T (P − 11T )Y = Y T (P −
n
1
P 1 )Y where P 1 = 11T = 1(1T 1)−1 1T is the projection matrix on C(1).
n
b) E(M SE) = σ 2 , so E(SSE) = (n − r)σ 2 . By a) and Theorem 2.5,

11T 11T 11T


E(SSR) = β T X T (P − )Xβ+tr[σ 2 (P − )] = β T X T (P − )Xβ+σ 2 (r−1).
n n n
When H0 is true Xβ = 1β1 and E(SSR) = σ 2 (r − 1).
 
2 Y T AY 2 µT Aµ
c) By Theorem 2.14 g), if Y ∼ Nn (µ, σ I) then ∼ χ a,
σ2 2σ 2
iff A is idempotent with rank(A) = tr(A) = a.
i) Theorem 2.14 g) applies to SSE/σ 2 with A = I − P and a = n − r.
Since µ = Xβ, and µT (I − P )µ = 0 since P X = X. Hence SSE/σ 2 ∼
χ2 (n − r, 0) ∼ χ2n−r . Thus SSE ∼ σ 2 χ2n−r regardless of whether H0 is true
or false.
ii) Theorem 2.14 g) applies to SSR/σ 2 with A = P − P 1 and a = r − 1.
If H0 is true, then µ = 1β1 and and µT (P − P 1 )µ = 0 since 1 is the first
column of X and P 1 is the projection matrix on C(1). Thus P 1 = P 1 1 = 1.
Hence SSR/σ 2 ∼ χ2 (r − 1, 0) ∼ χ2r−1 . Thus SSR ∼ σ 2 χ2r−1 .
iii) SSE and SSR are independent by Craig’s theorem since (I − P )(P −
P 1 ) = P − P 1 − P + P 1 = 0. MSE = SSE/(n-r) and MSR = SSR/(r-1).
Thus
SSR/[σ 2 (r − 1)]
M SR/M SE = ∼ Fr−1,n−r .
SSE/[σ 2 (n − r)]
3.13 a) i) Let a and b be constant vectors. Then aT β is estimable if there
exists a linear unbiased estimator bT Y so E(bT Y ) = aT β. Also, the quantity
aT β is estimable iff aT = bT X iff a = X T b iff a ∈ C(X T ).
ii) Let a least squares estimator β̂ be any solution to the normal equations
X T X β̂ = X T Y . Then the least squares estimator of aT β is aT β̂ = bT X β̂ =
bT P Y .
11.2 Hints for Selected Problems 521

iii) M SE = Y T (I − P )Y /(n − r) = SSE/(n − r).


b) ii) E(bT P Y ) = bT P Xβ = bT Xβ = aT β.
iii) E(SSE) = E(Y T (I − P )Y ) = tr[σ 2 (I − P )I] + µT (I − P )µ by
Theorem 2.5 where µ = Xβ. Hence E(SSE) = σ 2 (tr(I − P ) = σ 2 (n − r).
Hence E(M SE) = E(SSE)/(n − r) = σ 2 .
c) If aT β is estimable and a least squares estimator β̂ is any solution to
the normal equations X T X β̂ = X T Y , then aT β̂ is the unique BLUE of
aT β.
1
d) SSE = Y T (I − P )Y and SSR = Y T (P − 11T )Y = Y T (P − P 1 )Y
n
1
where P 1 = 11T = 1(1T 1)−1 1T is the projection matrix on C(1).
n
3.14 a) Note that β is estimable for i) since X for i) has full rank 2.
Note that β is not estimable for ii) since X for ii) does not have full rank
(rank(X) = 1).
b)
  −1
  20  −1
210  51
B = (X T X)−1 X T =  1 1  X T = XT .
012 15
02

If  
a11 a12
A=
a21 a22
and d = a11 a22 − a21 a12 6= 0, then
 
1 a22 −a12
A−1 = .
d −a21 a11

Thus     
1 5 −1 210 1 10 4 −2
B= = .
24 −1 5 012 24 −2 4 10
c) Note that bT Y is an unbiased estimator of bT Xβ = aT β with aT =
T
b X. If b = 1, then
 
36
aT = 1T X = (1 1 1)  2 4  = (6 12).
12

Thus the estimable function aT β = 6β1 +12β2 has unbiased estimator bT Y =


1T Y = Y1 + Y2 + Y3 .
Alternatively, let b = 1 and a be as above. Then the unbiased least squares
estimator aT β̂ = bT P Y where
522 11 Stuff for Students
   −1  
3 3 963
1
P =  2  (3 2 1)  2  (3 2 1) = 6 4 2.
14
1 1 321

Since b = 1, the unbiased least squares estimator is


 
Y1
1 18 12 6
(18 12 6)  Y2  = Y1 + Y2 + Y3 .
14 14 14 14
Y3

Since E(Y ) = Xβ, note that E(aT β̂) =


18 12 6
(3β1 +6β2 )+ (2β1 +4β2 )+ (β1 +2β2 ) = (84/14)β1 +(168/14)β2 = 6β1 +12β2 .
14 14 14
3.15 (a)
Since y ∼ Np (Aβ, σ 2 I p ), it follows that yb = P A y ∼ Np (P A Aβ , P A σ2 I p P >
A ).
But P A Aβ = Aβ, and P A σ2 I p P > 2 >
A = σ P AP A = σ P AP A = σ P A .
2 2

Hence,
yb ∼ Np (Aβ , σ2 P A ).
(b)
e = y−y b=y− 
P A y = (I − P A )y. Therefore, we have

(I − P A )y ∼ Np (I − P A )Aβ , (I − P A )σ2 I p (I − P A )> where
(I p − P A )Aβ = Aβ − Aβ = 0, and
(I p − P A )σ2 I p (I p − P A )> = σ2 (I p − P A ). Hence

 
e ∼ Np 0, σ (I p − P A ) .
2

(c)
 
Cov(y, e) = Cov y , (I − P A )y = Cov(y )(I p −P A )> = σ2 I p (I p −P A ) = σ2 (I p −P A ) 6= 0.

Hence y and e are not independent.


(d)

Cov(b
y , e) = Cov (P A y, (I − P A )y) = P A Cov(y )(I p − P A )>
= P A σ I p (I p − P A ) = σ P A (I p − P A ) = 0
2 2

b and e are independent by Theorem 2.8a).


This proves that y
3.16 (a) Given that C(Z) ⊂ C(X), let z j be the jth column of Z. Then
zj ∈ C(Z) ⊂ C(X). Thus, zj = Xbj for some bj . Hence

Z = (z 1 , . . . , zt ) = X(b1 , . . . , bt ) = XB.

(b)
11.2 Hints for Selected Problems 523

P X P Z = X(X > X)− X > Z(Z > Z)− Z >


= X(X > X)− X > XB(Z > Z)− Z > since Z = XB
> > > > > >
= P X XB (Z Z )− Z = XB (Z Z )− Z = Z (Z Z )− Z = P Z

(c)

(P X − P Z ) =
2
P 2X − P X P Z − P Z P X + P 2Z
= P X − P Z − (P X P Z ) + P Z = P X − P Z .
>

(d)

SSE2 − SSE = Y > (P X − P Z ) Y


= Y > (P X − P Z ) (P X − P Z ) Y
>
= Y > (P X − P Z ) (P X − P Z ) Y
>
= {(P X − P Z ) Y } {(P X − P Z ) Y } ≥ 0

(e) Use Craig’s Theorem: true since (P X − P Z )(I − P X ) = 0


(f)

SSE
∼ χ2 (df1 , ncp1 = 0), df1 = n − rank(X)
σ2
SSE2 − SSE
∼ χ2 (df1 , ncp2 ), df2 = rank(X) − rank(Z) > 0
σ2
df2 > 0 this is because C(Z) is a proper subset of C(X).
1
ncp2 = (Xβ)> (P X − P Z ) Xβ
2σ 2
1 > > >

= β X P X X − X P Z X β
2σ 2
1 >   1
= 2
β X >X − X >P Z X β = (Xβ)> (I − P Z ) Xβ > 0
2σ 2σ 2
The last inequality follows from the fact that C(Z) is a proper subset of
C(X).
Under the null hypothesis H0 : Xβ = Zγ, we have ncp2 = 0. Therefore,
F > c will be a test for H0 : E(Y ) ∈ C(Z), where

(SSE2 − SSE)/df2
F =
SSE/df1
524 11 Stuff for Students

has F distribution under the H0 .

3.17 (a)
 
1100 1000
1 1 0 0 0 1 0 0
 
1 1 0 0 0 0 1 0
 
1 0 1 0 1 0 0 0
 
X =
1 0 1 0 0 1 0 0

1 0 1 0 0 1 0 0
 
1 0 1 0 0 0 1 0
 
1 0 0 1 0 0 0 1
1001 0001

(b)

P P P 
P
i P j k Yijk
 
 Pj Pk Y1jk 
 
 Pj Pk Y2jk 
 
X >Y =  k Y3jk 
 P P Y
j

 Pi Pk i1k 
 
 Pi Pk Yi2k 
 
P i Pk Yi3k
j Y
k 3jk

(c)
First, note that:
P3 Pnij P3
i=1 k=1 µ + αi + βj n.j µ + nij αi + n.j βj
i=1
E(Y .j. ) = =
n.j n.j
P
nij αi
= µ + βj + i
n.j

Then,
1
E(Y .1. ) = µ + β1 + (α1 + α2 )
2
1
E(Y .3. ) = µ + β3 + (α1 + α2 )
2
Hence E(Y .1. ) − E(Y .3. ) = β1 − β3 , and it is a LUE for β1 − β3 . More work
is needed to show Y .1. − Y .3. is an OLS estimator of β1 − β3 .
(d)
11.2 Hints for Selected Problems 525

1
E(Y 1.. ) = µ + α1 + (β1 + β2 + β3 )
3
1
E(Y 3.. ) = µ + α3 + (2β4 )
2
1
⇒ E(Y 1.. − Y 3..) = α1 − α3 + (β1 + β2 + β3 − 3β4 ) 6= α1 − α3
3
Therefore, Y 1.. −Y 3.. is not an unbiased estimator for α1 −α3 , hence it cannot
be the OLS estimator of α1 − α3 .
3.18    
112 1
a) X 0 = , so C(X 0 ) = span .
224 2
For b), c), and d), if a is a 2 × 1 constant vector, then a0 β is estimable iff
a ∈ C(X 0 ).
b) Yes, estimable since
   
5 1
5β1 + 10β2 = (5 10)β, and =5 ∈ C(X 0 ).
10 2

c) No, not estimable since


 
1
β1 = (1 0)β, and 6∈ C(X 0 ).
0

d) No, not estimable since


 
1
β1 − 2β2 = (1 − 2)β, and 6∈ C(X 0 ).
−2

3.20 Since aTi β is estimable, ai ∈ C(X T ). Thus constant vector a =


Pk T T
Pk T
i=1 ci ai ∈ C(X ). Hence a β = i=1 ci ai β is estimable.
There are several other correct solutions, such Pk as there exists constant
vectors bi such that E(bTi Y ) = aTi β. Let b = i=1 ci bi . Then E(bT Y ) =
Pk T Pk T
Pk T
i=1 ci E(bi Y ) = i=1 ci ai β. Hence i=1 ci ai β is estimable.
(This problem proves that an arbitrary linear combination of estimable
functions is an estimable function.)
3.21 a) E(X T AX) = tr(AΣ) + [E(X)]T AE(X) with A = Σ − . Hence
E(X T AX) = tr(Σ − Σ) + µT Σ − µ.
b) i) Xβ = (β0 + β1 , β0 + β1 , β0 + β2 , β0 + β2 , ..., β0 + βp−1 , β0 +
βp−1 , β0 , β0 )T .
ii) β1 = β2 = · · · = βp−1 = −β0
3.21 See Example 3.2 with the p − value omitted from the Anova table.
Chapter 4
4.11 a) (X T X)−1 X T E(Y ∗ ) = (X T X)−1 X T X β̂ = β̂.
b) ACov(Y ∗ )AT = (X T X)−1 X T diag(ri2 )X(X T X)−1 .
526 11 Stuff for Students

c) We will use X β̂ = P Y and P X I = X I . Then E(β̂ I ) = (X TI X I )−1 X TI E(Y ∗ ) =
(X TI X I )−1 X TI X β̂ = (X TI X I )−1 X TI P Y = (X TI X I )−1 X TI Y = β̂ I .
d) ACov(Y ∗ )AT = (X TI X I )−1 X TI diag(ri2 )X I (X TI X I )−1 .
Chapter 10
10.1
a) Since Y is a (random) scalar and E(w) = 0, Σ u,Y = E[(u−E(u))(Y −
E(Y ))T ] = E[w(Y − E(Y ))] = E(wY ) − E(w)E(Y ) = E(wY ).

b) Using the definition of z and r, note that Y = m(z) + e and


w = r + (Σ u η)η T w. Hence E(wY ) = E[(r + (Σ u η)η T w)(m(z) + e)] =
E[(r + (Σ u η)ηT w)m(z)] + E[r + (Σ u η)ηT w]E(e) since e is independent of
x. Since E(e) = 0, the latter term drops out. Since m(z) and η T wm(z) are
(random) scalars, E(wY ) = E[m(z)r] + E[ηT w m(z)]Σ u η.

c) Using result b), Σ −1 −1 −1 T


u Σ u,Y = Σ u E[m(z)r] + Σ u E[η w m(z)]Σ u η
T −1
= E[η w m(z)]Σ u Σ u η +Σ u E[m(z)r] = E[η w m(z)]η +Σ −1
−1 T
u E[m(z)r]
and the result follows.

d) E(wz) = E[(u − E(u))uT η] = E[(u − E(u))(uT − E(uT ) + E(uT ))η]


= E[(u − E(u))(uT − E(uT ))]η + E[u − E(u)]E(uT )η = Σ u η.

e) If m(z) = z, then c(u) = E(ηT wz) = ηT E(wz) = η T Σ u η = 1 by


result d).

f) Since z is a (random) scalar, E(zr) = E(rz) = E[(w −(Σ u η)η T w)z] =


E(wz) − (Σ u η)η T E(wz). Using result d), E(rz) = Σ u η − Σ u ηη T Σ u η =
Σ u η − Σ u η = 0.

g) Since z and r are linear combinations of u, the joint distribution of z and


r is multivariate normal. Since E(r) = 0, z and r are uncorrelated and thus
independent. Hence m(z) and r are independent and b(u) = Σ −1 u E[m(z)r] =
Σ −1
u E[m(z)]E(r) = 0.
11.3 Tables 527

11.3 Tables

Tabled values are F(k,d, 0.95) where P (F < F (k, d, 0.95)) = 0.95.
00 stands for ∞. Entries were produced with the qf(.95,k,d) command
in R. The numerator degrees of freedom are k while the denominator degrees
of freedom are d.
k 1 2 3 4 5 6 7 8 9 00
d
1 161 200 216 225 230 234 237 239 241 254
2 18.5 19.0 19.2 19.3 19.3 19.3 19.4 19.4 19.4 19.5
3 10.1 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.53
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.63
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.37
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 3.67
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.23
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 2.93
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 2.71
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.54
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.41
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.30
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.21
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.13
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.07
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.01
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 1.96
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 1.92
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 1.88
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 1.84
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 1.71
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 1.62
00 3.84 3.00 2.61 2.37 2.21 2.10 2.01 1.94 1.88 1.00
528 11 Stuff for Students

Tabled values are tα,d where P (t < tα,d ) = α where t has a t distribution
with d degrees of freedom. If d > 29 use the N (0, 1) cutoffs d = Z = ∞.
alpha pvalue
d 0.005 0.01 0.0250.05 0.5 0.95 0.975 0.99 0.995 left tail
1 -63.66 -31.82 -12.71
-6.314 0 6.314 12.71 31.82 63.66
2 -9.925 -6.965 -4.303
-2.920 0 2.920 4.303 6.965 9.925
3 -5.841 -4.541 -3.182
-2.353 0 2.353 3.182 4.541 5.841
4 -4.604 -3.747 -2.776
-2.132 0 2.132 2.776 3.747 4.604
5 -4.032 -3.365 -2.571
-2.015 0 2.015 2.571 3.365 4.032
6 -3.707 -3.143 -2.447
-1.943 0 1.943 2.447 3.143 3.707
7 -3.499 -2.998 -2.365
-1.895 0 1.895 2.365 2.998 3.499
8 -3.355 -2.896 -2.306
-1.860 0 1.860 2.306 2.896 3.355
9 -3.250 -2.821 -2.262
-1.833 0 1.833 2.262 2.821 3.250
10 -3.169 -2.764 -2.228
-1.812 0 1.812 2.228 2.764 3.169
11 -3.106 -2.718 -2.201
-1.796 0 1.796 2.201 2.718 3.106
12 -3.055 -2.681 -2.179
-1.782 0 1.782 2.179 2.681 3.055
13 -3.012 -2.650 -2.160
-1.771 0 1.771 2.160 2.650 3.012
14 -2.977 -2.624 -2.145
-1.761 0 1.761 2.145 2.624 2.977
15 -2.947 -2.602 -2.131
-1.753 0 1.753 2.131 2.602 2.947
16 -2.921 -2.583 -2.120
-1.746 0 1.746 2.120 2.583 2.921
17 -2.898 -2.567 -2.110
-1.740 0 1.740 2.110 2.567 2.898
18 -2.878 -2.552 -2.101
-1.734 0 1.734 2.101 2.552 2.878
19 -2.861 -2.539 -2.093
-1.729 0 1.729 2.093 2.539 2.861
20 -2.845 -2.528 -2.086
-1.725 0 1.725 2.086 2.528 2.845
21 -2.831 -2.518 -2.080
-1.721 0 1.721 2.080 2.518 2.831
22 -2.819 -2.508 -2.074
-1.717 0 1.717 2.074 2.508 2.819
23 -2.807 -2.500 -2.069
-1.714 0 1.714 2.069 2.500 2.807
24 -2.797 -2.492 -2.064
-1.711 0 1.711 2.064 2.492 2.797
25 -2.787 -2.485 -2.060
-1.708 0 1.708 2.060 2.485 2.787
26 -2.779 -2.479 -2.056
-1.706 0 1.706 2.056 2.479 2.779
27 -2.771 -2.473 -2.052
-1.703 0 1.703 2.052 2.473 2.771
28 -2.763 -2.467 -2.048
-1.701 0 1.701 2.048 2.467 2.763
29 -2.756 -2.462 -2.045
-1.699 0 1.699 2.045 2.462 2.756
Z -2.576 -2.326 -1.960
-1.645 0 1.645 1.960 2.326 2.576
CI 90% 95% 99%
0.995 0.99 0.975 0.95 0.5 0.05 0.025 0.01 0.005 right tail
0.01 0.02 0.05 0.10 1 0.10 0.05 0.02 0.01 two tail
REFERENCES 529

Abraham, B., and Ledolter, J. (2006), Introduction to Regression Modeling,


Thomson Brooks/Cole, Belmont, CA.
Agresti, A. (2002), Categorical Data Analysis, 2nd ed., Wiley, Hoboken,
NJ.
Agresti, A. (2013), Categorical Data Analysis, 3rd ed., Wiley, Hoboken,
NJ.
Agresti, A. (2015), Foundations of Linear and Generalized Linear Models,
Wiley, Hoboken, NJ.
Agulló, J. (1996), “Exact Iterative Computation of the Multivariate Min-
imum Volume Ellipsoid Estimator with a Branch and Bound Algorithm,” in
Proceedings in Computational Statistics, ed. Prat, A., Physica-Verlag, Hei-
delberg, 175-180.
Agulló, J. (1998), “Computing the Minimum Covariance Determinant Es-
timator,” unpublished manuscript, Universidad de Alicante.
Akaike, H. (1973), “Information Theory and an Extension of the Maxi-
mum Likelihood Principle,” in Proceedings, 2nd International Symposium on
Information Theory, eds. Petrov, B.N., and Csakim, F., Akademiai Kiado,
Budapest, 267-281.
Akaike, H. (1977), “On Entropy Maximization Principle,” in Applications
of Statistics, ed. Krishnaiah, P.R, North Holland, Amsterdam, 27-41.
Akaike, H. (1978), “A New Look at the Bayes Procedure,” Biometrics, 65,
53-59.
Aldrin, M., Bφlviken, E., and Schweder, T. (1993), “Projection Pursuit
Regression for Moderate Non-linearities,” Computational Statistics & Data
Analysis, 16, 379-403.
Anderson, T.W. (1971), The Statistical Analysis of Time Series, Wiley,
New York, NY.
Anderson, T.W. (1984), An Introduction to Multivariate Statistical Anal-
ysis, 2nd ed., Wiley, New York, NY.
Anton, H., Rorres, C., and Kaul, A. (2019), Elementary Linear Algebra,
Applications Version, 12th ed., Wiley, New York, NY.
Atkinson, A., and Riani, R. (2000), Robust Diagnostic Regression Analysis,
Springer, New York, NY.
Basa, J., Cook, R.D., Forzani, L., and Marcos, M. (2024), “Asymptotic
Distribution of One-Component Partial Least Squares Regression Estimators
in High Dimensions,” The Canadian Journal of Statistics, 52, 118-130.
Bassett, G.W., and Koenker, R.W. (1978), “Asymptotic Theory of Least
Absolute Error Regression,” Journal of the American Statistical Association,
73, 618-622.
Becker, R.A., Chambers, J.M., and Wilks, A.R. (1988), The New S
Language: a Programming Environment for Data Analysis and Graphics,
Wadsworth and Brooks/Cole, Pacific Grove, CA.
Belsley, D.A. (1984), “Demeaning Conditioning Diagnostics Through Cen-
tering,” The American Statistician, 38, 73-77.
530 REFERENCES

Berk, R., Brown, L., Buja, A., Zhang, K., and Zhao, L. (2013), “Valid
Post-Selection Inference,” The Annals of Statistics, 41, 802-837.
Berndt, E.R., and Savin, N.E. (1977), “Conflict Among Criteria for Testing
Hypotheses in the Multivariate Linear Regression Model,” Econometrika, 45,
1263-1277.
Bernholt, T. (2005), “Computing the Least Median of Squares Estimator
in Time O(nd ),” Proceedings of ICCSA 2005, LNCS, 3480, 697-706.
Bernholt, T., and Fischer, P. (2004), “The Complexity of Computing the
MCD-Estimator,” Theoretical Computer Science, 326, 383-398.
Bertsimas, D., King, A., and Mazmunder, R. (2016), “Best Subset Selec-
tion Via a Modern Optimization Lens,” The Annals of Statistics, 44, 813-852.
Bhatia, R., Elsner, L., and Krause, G. (1990), “Bounds for the Variation of
the Roots of a Polynomial and the Eigenvalues of a Matrix,” Linear Algebra
and Its Applications, 142, 195-209.
Bickel, P.J., and Ren, J.–J. (2001), “The Bootstrap in Hypothesis Testing,”
in State of the Art in Probability and Statistics: Festschrift for William R. van
Zwet, eds. de Gunst, M., Klaassen, C., and van der Vaart, A., The Institute
of Mathematical Statistics, Hayward, CA, 91-112.
Bogdan, M., Ghosh, J., and Doerge, R. (2004), “Modifying the Schwarz
Bayesian Information Criterions to Locate Multiple Interacting Quantitative
Trait Loci,” Genetics, 167, 989-999.
Box, G.E.P., and Cox, D.R. (1964), “An Analysis of Transformations,”
Journal of the Royal Statistical Society, B, 26, 211-246.
Breiman, L. (1996), “Bagging Predictors,” Machine Learning, 24, 123-140.
Brillinger, D.R. (1977), “The Identification of a Particular Nonlinear Time
Series,” Biometrika, 64, 509-515.
Brillinger, D.R. (1983), “A Generalized Linear Model with “Gaussian”
Regressor Variables,” in A Festschrift for Erich L. Lehmann, eds. Bickel,
P.J., Doksum, K.A., and Hodges, J.L., Wadsworth, Pacific Grove, CA, 97-
114.
Brown, M.B., and Forsythe, A.B. (1974a), “The ANOVA and Multiple
Comparisons for Data with Heterogeneous Variances,” Biometrics, 30, 719-
724.
Brown, M.B., and Forsythe, A.B. (1974b), “The Small Sample Behavior of
Some Statistics Which Test the Equality of Several Means,” Technometrics,
16, 129-132.
Büchlmann, P., and Yu, B. (2002), “Analyzing Bagging,” The Annals of
Statistics, 30, 927-961.
Buckland, S.T., Burnham, K.P., and Augustin, N.H. (1997), “Model Se-
lection: an Integral Part of Inference,” Biometrics, 53, 603-618.
Budny, K. (2014), “A Generalization of Chebyshev’s Inequality for Hilbert-
Space-Valued Random Variables,” Statistics & Probability Letters, 88, 62-65.
Burnham, K.P., and Anderson, D.R. (2002), Model Selection and Mul-
timodel Inference: a Practical Information-Theoretic Approach, 2nd ed.,
Springer, New York, NY.
REFERENCES 531

Burnham, K.P., and Anderson, D.R. (2004), “Multimodel Inference Under-


standing AIC and BIC in Model Selection,” Sociological Methods & Research,
33, 261-304.
Burr, D. (1994), “A Comparison of Certain Bootstrap Confidence Intervals
in the Cox Model,” Journal of the American Statistical Association, 89, 1290-
1302.
Butler, R., and Rothman, E. (1980), “Predictive Intervals Based on Reuse
of the Sample,” Journal of the American Statistical Association, 75, 881-889.
Butler, R.W., Davies, P.L., and Jhun, M. (1993), “Asymptotics for the
Minimum Covariance Determinant Estimator,” The Annals of Statistics, 21,
1385-1400.
Buxton, L.H.D. (1920), “The Anthropology of Cyprus,” The Journal of
the Royal Anthropological Institute of Great Britain and Ireland, 50, 183-235.
Cameron, A.C., and Trivedi, P.K. (1998), Regression Analysis of Count
Data, 1st ed., Cambridge University Press, Cambridge, UK.
Cameron, A.C., and Trivedi, P.K. (2013), Regression Analysis of Count
Data, 2nd ed., Cambridge University Press, Cambridge, UK.
Camponovo, L. (2015), “On the Validity of the Pairs Bootstrap for Lasso
Estimators,” Biometrika, 102, 981-987.
Candes, E., and Tao, T. (2007), “The Dantzig Selector: Statistical Estima-
tion When p Is Much Larger Than n, The Annals of Statistics, 35, 2313-2351.
Cator, E.A., and Lopuhaä, H.P. (2010), “Asymptotic Expansion of the
Minimum Covariance Determinant Estimators,” Journal of Multivariate Anal-
ysis, 101, 2372-2388.
Cator, E.A., and Lopuhaä, H.P. (2012), “Central Limit Theorem and In-
fluence Function for the MCD Estimators at General Multivariate Distribu-
tions,” Bernoulli, 18, 520-551.
Chang, J., and Hall, P. (2015), “Double Bootstrap Methods That Use a
Single Double-Bootstrap Simulation,” Biometrika, 102, 203-214.
Chang, J., and Olive, D.J. (2010), “OLS for 1D Regression Models,” Com-
munications in Statistics: Theory and Methods, 39, 1869-1882.
Charkhi, A., and Claeskens, G. (2018), “Asymptotic Post-Selection Infer-
ence for the Akaike Information Criterion,” Biometrika, 105, 645-664.
Chatterjee, A., and Lahiri, S.N. (2011), “Bootstrapping Lasso Estimators,”
Journal of the American Statistical Association, 106, 608-625.
Chen, C.H., and Li, K.C. (1998), “Can SIR be as Popular as Multiple
Linear Regression?,” Statistica Sinica, 8, 289-316.
Chen, J., and Chen, Z. (2008), “Extended Bayesian Information Criterion
for Model Selection with Large Model Spaces,” Biometrika, 95, 759-771.
Chen, S.X. (2016), “Peter Hall’s Contributions to the Bootstrap,” The
Annals of Statistics, 44, 1821-1836.
Chen, X. (2011), “A New Generalization of Chebyshev Inequality for Ran-
dom Vectors,” see arXiv:0707.0805v2.
532 REFERENCES

Chew, V. (1966), “Confidence, Prediction and Tolerance Regions for the


Multivariate Normal Distribution,” Journal of the American Statistical As-
sociation, 61, 605-617.
Chihara, L., and Hesterberg, T. (2011), Mathematical Statistics with Re-
sampling and R, Wiley, Hoboken, NJ.
Cho, H., and Fryzlewicz, P. (2012), “High Dimensional Variable Selection
Via Tilting,” Journal of the Royal Statistical Society, B, 74, 593-622.
Christensen, R. (1987), Plane Answers to Complex Questions: the Theory
of Linear Models, 1st ed., Springer, New York, NY.
Christensen, R. (2020), Plane Answers to Complex Questions: the Theory
of Linear Models, 5th ed., Springer, New York, NY.
Chun, H., and Keleş, S. (2010), “Sparse Partial Least Squares Regression
for Simultaneous Dimension Reduction and Predictor Selection,” Journal of
the Royal Statistical Society, B, 72, 3-25.
Čı́žek, P. (2006), “Least Trimmed Squares Under Dependence,” Journal
of Statistical Planning and Inference, 136, 3967-3988.
Čı́žek, P. (2008), “General Trimmed Estimation: Robust Approach to Non-
linear and Limited Dependent Variable Models,” Econometric Theory, 24,
1500-1529.
Claeskens, G., and Hjort, N.L. (2008), Model Selection and Model Averag-
ing, Cambridge University Press, New York, NY.
Clarke, B.R. (1986), “Nonsmooth Analysis and Fréchet Differentiability of
M Functionals,” Probability Theory and Related Fields, 73, 137-209.
Clarke, B.R. (2000), “A Review of Differentiability in Relation to Robust-
ness with an Application to Seismic Data Analysis,” Proceedings of the Indian
National Science Academy, A, 66, 467-482.
Cleveland, W. (1979), “Robust Locally Weighted Regression and Smooth-
ing Scatterplots,” Journal of the American Statistical Association, 74, 829-
836.
Cleveland, W.S. (1981), “LOWESS: a Program for Smoothing Scatterplots
by Robust Locally Weighted Regression,” The American Statistician, 35, 54.
Collett, D. (1999), Modelling Binary Data, 1st ed., Chapman & Hall/CRC,
Boca Raton, FL.
Collett, D. (2003), Modelling Binary Data, 2nd ed., Chapman & Hall/CRC,
Boca Raton, FL.
Cook, R.D. (1977), “Deletion of Influential Observations in Linear Regres-
sion,” Technometrics, 19, 15-18.
Cook, R.D. (1998), Regression Graphics: Ideas for Studying Regression
Through Graphics, Wiley, New York, NY.
Cook, R.D. (2018), An Introduction to Envelopes: Dimension Reduction
for Efficient Estimation in Multivariate Statistics, Wiley, Hoboken, NJ.
Cook, R.D., and Forzani, L. (2008), “Principal Fitted Components for
Dimension Reduction in Regression,” Statistical Science, 23, 485-501.
Cook, R.D., and Forzani, L. (2018), “Big Data and Partial Least Squares
Prediction,” The Canadian Journal of Statistics, 46, 62-78.
REFERENCES 533

Cook, R.D., and Forzani, L. (2019), “Partial Least Squares Prediction in


High-Dimensional Regression,” The Annals of Statistics, 47, 884-908.
Cook, R.D., and Forzani, L. (2024), Partial Least Squares Regression,
Chapman and Hall/CRC, Boca Raton, FL.
Cook, R.D., Forzani, L., and Rothman, A. (2013), “Prediction in Abun-
dant High-Dimensional Linear Regression,” Electronic Journal of Statistics,
7, 3059-3088.
Cook, R.D., Helland, I.S., and Su, Z. (2013), “Envelopes and Partial Least
Squares Regression,” Journal of the Royal Statistical Society, B, 75, 851-877.
Cook, R.D., and Olive, D.J. (2001), “A Note on Visualizing Response
Transformations in Regression,” Technometrics, 43, 443-449.
Cook, R.D., and Su, Z. (2013), “Scaled Envelopes: Scale-Invariant and
Efficient Estimation in Multivariate Linear Regression,” Biometrika, 100,
929-954.
Cook, R.D., and Su, Z. (2016), “Scaled Predictor Envelopes and Partial
Least-Squares Regression,” Technometrics, 58, 155-165.
Cook, R.D., and Weisberg, S. (1999), Applied Regression Including Com-
puting and Graphics, Wiley, New York, NY.
Cook, R.D., and Zhang, X. (2015), “Foundations of Envelope Models and
Methods,” Journal of the American Statistical Association, 110, 599-611.
Cox, D.R. (1972), “Regression Models and Life-Tables,” Journal of the
Royal Statistical Society, B, 34, 187-220.
Cornish, E.A. (1954), “The Multivariate t-Distribution Associated with a
Set of Normal Sample Deviates,” Australian Journal of Physics, 7, 531-542.
Cramér, H. (1946), Mathematical Methods of Statistics, Princeton Univer-
sity Press, Princeton, NJ.
Crawley, M.J. (2005), Statistics an Introduction Using R, Wiley, Hoboken,
NJ.
Crawley, M.J. (2013), The R Book, 2nd ed., Wiley, Hoboken, NJ.
Croux, C., Dehon, C., Rousseeuw, P.J., and Van Aelst, S. (2001), “Ro-
bust Estimation of the Conditional Median Function at Elliptical Models,”
Statistics & Probability Letters, 51, 361-368.
Daniel, C., and Wood, F.S. (1980), Fitting Equations to Data, 2nd ed.,
Wiley, New York, NY.
Datta, B.N. (1995), Numerical Linear Algebra and Applications,
Brooks/Cole Publishing Company, Pacific Grove, CA.
Denham, M.C. (1997), “Prediction Intervals in Partial Least Squares,”
Journal of Chemometrics, 11, 39-52.
Devlin, S.J., Gnanadesikan, R., and Kettenring, J.R. (1975), “Robust Es-
timation and Outlier Detection with Correlation Coefficients,” Biometrika,
62, 531-545.
Devlin, S.J., Gnanadesikan, R., and Kettenring, J.R. (1981), “Robust Es-
timation of Dispersion Matrices and Principal Components,” Journal of the
American Statistical Association, 76, 354-362.
534 REFERENCES

Dezeure, R., Bühlmann, P., Meier, L., and Meinshausen, N. (2015), “High-
Dimensional Inference: Confidence Intervals, p-Values and R-Software hdi,”
Statistical Science, 30, 533-558.
Draper, N.R., and Smith, H. (1966, 1981, 1998), Applied Regression Anal-
ysis, 1st, 2nd, and 3rd ed., Wiley, New York, NY.
Driscoll, M.F., and Krasnicka, B. (1995), “An Accessible Proof of Craig’s
Theorem in the General Case,” The American Statistician, 49, 59-62.
Eaton, M.L. (1986), “A Characterization of Spherical Distributions,” Jour-
nal of Multivariate Analysis, 20, 272-276.
Eck, D.J. (2018), “Bootstrapping for Multivariate Linear Regression Mod-
els,” Statistics & Probability Letters, 134, 141-149.
Efron, B. (1979), “Bootstrap Methods, Another Look at the Jackknife,”
The Annals of Statistics, 7, 1-26.
Efron, B. (1982), The Jackknife, the Bootstrap and Other Resampling
Plans, SIAM, Philadelphia, PA.
Efron, B. (2014), “Estimation and Accuracy After Model Selection,” (with
discussion), Journal of the American Statistical Association, 109, 991-1007.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004), “Least
Angle Regression,” (with discussion), The Annals of Statistics, 32, 407-451.
Efron, B., and Hastie, T. (2016), Computer Age Statistical Inference, Cam-
bridge University Press, New York, NY.
Efron, B., and Tibshirani, R.J. (1993), An Introduction to the Bootstrap,
Chapman & Hall/CRC, New York, NY.
Efroymson, M.A. (1960), “Multiple Regression Analysis,” in Mathematical
Methods for Digital Computers, eds. Ralston, A., and Wilf, H.S., Wiley, New
York, NY, 191-203.
Eicker, F. (1963), “Asymptotic Normality and Consistency of the Least
Squares Estimators for Families of Linear Regressions,” Annals of Mathe-
matical Statistics, 34, 447-456.
Eicker, F. (1967), “Limit Theorems for Regressions with Unequal and De-
pendent Errors,” in Proceedings of the Fifth Berkeley Symposium on Mathe-
matical Statistics and Probability, Vol. I: Statistics, eds. Le Cam, L.M., and
Neyman, J., University of California Press, Berkeley, CA, 59-82.
Ewald, K., and Schneider, U. (2018), “Uniformly Valid Confidence Sets
Based on the Lasso,” Electronic Journal of Statistics, 12, 1358-1387.
Fahrmeir, L. and Tutz, G. (2001), Multivariate Statistical Modelling Based
on Generalized Linear Models, 2nd ed., Springer, New York, NY.
Fan, J., and Li, R. (2001), “Variable Selection Via Noncave Penalized
Likelihood and Its Oracle Properties,” Journal of the American Statistical
Association, 96, 1348-1360.
Fan, J., and Li, R. (2002), “Variable Selection for Cox’s Proportional Haz-
ard Model and Frailty Model,” The Annals of Statistics, 30, 74-99.
Fan, J., and Lv, J. (2010), “A Selective Overview of Variable Selection in
High Dimensional Feature Space,” Statistica Sinica, 20, 101-148.
REFERENCES 535

Ferguson, T.S. (1996), A Course in Large Sample Theory, Chapman &


Hall, New York, NY.
Fernholtz, L.T. (1983), von Mises Calculus for Statistical Functionals,
Springer, New York, NY.
Ferrari, D., and Yang, Y. (2015), “Confidence Sets for Model Selection by
F –Testing,” Statistica Sinica, 25, 1637-1658.
Fithian, W., Sun, D., and Taylor, J. (2014), “Optimal Inference after
Model Selection,” ArXiv e-prints.
Flury, B., and Riedwyl, H. (1988), Multivariate Statistics: a Practical Ap-
proach, Chapman & Hall, New York.
Fogel, P., Hawkins, D.M., Beecher, C., Luta, G., and Young, S. (2013), “A
Tale of Two Matrix Factorizations,” The American Statistician, 67, 207-218.
Fox, J., and Weisberg, S. (2010), An R Companion to Applied Regression,
2nd ed., Sage Publications, Thousand Oaks, CA.
Frank, I.E., and Friedman, J.H. (1993), “A Statistical View of Some
Chemometrics Regression Tools,” (with discussion), Technometrics, 35, 109-
148.
Freedman, D.A. (1981), “Bootstrapping Regression Models,” The Annals
of Statistics, 9, 1218-1228.
Freedman, D.A. (2005), Statistical Models Theory and Practice, Cam-
bridge University Press, New York, NY.
Frey, J. (2013), “Data-Driven Nonparametric Prediction Intervals,” Jour-
nal of Statistical Planning and Inference, 143, 1039-1048.
Friedman, J., Hastie, T., Hoefling, H., and Tibshirani, R. (2007), “Pathwise
Coordinate Optimization,” Annals of Applied Statistics, 1, 302-332.
Friedman, J., Hastie, T., Simon, N., and Tibshirani, R. (2015), glmnet:
Lasso and Elastic-net Regularized Generalized Linear Models, R Package ver-
sion 2.0, (http://cran.r-project.org/package=glmnet).
Friedman, J., Hastie, T., and Tibshirani, R. (2010), “Regularization Paths
for Generalized Linear Models Via Coordinate Descent,” Journal of Statistical
Software, 33, 1-22.
Friedman, J.H., and Hall, P. (2007), “On Bagging and Nonlinear Estima-
tion,” Journal of Statistical Planning and Inference, 137, 669-683.
Fujikoshi, Y. (2002), “Asymptotic Expansions for the Distributions of Mul-
tivariate Basic Statistics and One-Way MANOVA Tests Under Nonnormal-
ity,” Journal of Statistical Planning and Inference, 108, 263-282.
Fujikoshi, Y., Sakurai, T., and Yanagihara, H. (2014), “Consistency of
High-Dimensional AIC–Type and Cp –Type Criteria in Multivariate Linear
Regression,” Journal of Multivariate Analysis, 123, 184-200.
Furnival, G., and Wilson, R. (1974), “Regression by Leaps and Bounds,”
Technometrics, 16, 499-511.
Gao, X., and Huang, J. (2010), “Asymptotic Analysis of High-Dimensional
LAD Regression with Lasso,” Statistica Sinica, 20, 1485-1506.
536 REFERENCES

Gill, R.D. (1989), “Non- and Semi-Parametric Maximum Likelihood Esti-


mators and the von Mises Method, Part 1,” Scandinavian Journal of Statis-
tics, 16, 97-128.
Gladstone, R.J. (1905), “A Study of the Relations of the Brain to the Size
of the Head,” Biometrika, 4, 105-123.
Golub, G.H., and Van Loan, C.F. (1989), Matrix Computations, 2nd ed.,
John Hopkins University Press, Baltimore, MD.
Graybill, F.A. (1976), Theory and Application of the Linear Model, Dux–
bury Press, North Scituate, MA.
Graybill, F.A. (1983), Matrices with Applications to Statistics, 2nd ed.,
Wadsworth, Belmont, CA.
Graybill, F.A. (2000), Theory and Application of the Linear Model, Brooks/
Cole, Pacific Grove, CA.
Grübel, R. (1988), “The Length of the Shorth,” The Annals of Statistics,
16, 619-628.
Gruber, M.H.J. (1998), Improving Efficiency by Shrinkage: the James-
Stein and Ridge Regression Estimators, Marcel Dekker, New York, NY.
Gunst, R.F., and Mason, R.L. (1980), Regression Analysis and Its Appli-
cation: a Data Oriented Approach, Marcel Dekker, New York, NY.
Guttman, I. (1982), Linear Models: an Introduction, Wiley, New York,
NY.
Haggstrom, G.W. (1983), “Logistic Regression and Discriminant Analysis
by Ordinary Least Squares,” Journal of Business & Economic Statistics, 1,
229-238.
Haitovsky, Y. (1987), “On Multivariate Ridge Regression,” Biometrika,
74, 563-570.
Hall, P (1986), “On the Bootstrap and Confidence Intervals,” The Annals
of Statistics, 14, 1431-1452.
Hall, P. (1988), “Theoretical Comparisons of Bootstrap Confidence Inter-
vals,” (with discussion), The Annals of Statistics, 16, 927-985.
Hall, P., Lee, E.R., and Park, B.U. (2009), “Bootstrap-Based Penalty
Choice for the Lasso Achieving Oracle Performance,” Statistica Sinica, 19,
449-471.
Hampel, F.R. (1975), “Beyond Location Parameters: Robust Concepts and
Methods,” Bulletin of the International Statistical Institute, 46, 375-382.
Harville, D.A. (2018), Linear Models and the Relevant Distributions and
Matrix Algebra, Chapman & Hall/CRC Press, Boca Raton, FL.
Hastie, T.J., and Tibshirani, R.J. (1986), “Generalized Additive Models”
(with discussion), Statistical Science, 1, 297-318.
Hastie, T.J., and Tibshirani, R.J. (1990), Generalized Additive Models,
Chapman & Hall, London, UK.
Hastie, T., Tibshirani, R., and Friedman, J. (2009), The Elements of Sta-
tistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer,
New York, NY.
REFERENCES 537

Hastie, T., Tibshirani, R., and Wainwright, M. (2015), Statistical Learning


with Sparsity: the Lasso and Generalizations, CRC Press Taylor & Francis,
Boca Raton, FL.
Haughton, D.M.A. (1988), “On the Choice of a Model to Fit Data From
an Exponential Family,” The Annals of Statistics, 16, 342-355.
Haughton, D. (1989), “Size of the Error in the Choice of a Model to Fit
Data From an Exponential Family,” Sankhyā, A, 51, 45-58.
Hawkins, D.M., Bradu, D., and Kass, G.V. (1984), “Location of Several
Outliers in Multiple Regression Data Using Elemental Sets,” Technometrics,
26, 197-208.
Hawkins, D.M., and Olive, D.J. (1999a), “Improved Feasible Solution Al-
gorithms for High Breakdown Estimation,” Computational Statistics & Data
Analysis, 30, 1-11.
Hawkins, D.M., and Olive, D. (1999b), “Applications and Algorithms
for Least Trimmed Sum of Absolute Deviations Regression,” Computational
Statistics & Data Analysis, 32, 119-134.
Hawkins, D.M., and Olive, D.J. (2002), “Inconsistency of Resampling Al-
gorithms for High Breakdown Regression Estimators and a New Algorithm,”
(with discussion), Journal of the American Statistical Association, 97, 136-
159.
He, X., and Portnoy, S. (1992), “Reweighted LS Estimators Converge at
the Same Rate as the Initial Estimator,” The Annals of Statistics, 20, 2161-
2167.
He, X., and Wang, G. (1997), “Qualitative Robustness of S*- Estimators of
Multivariate Location and Dispersion,” Statistica Neerlandica, 51, 257-268.
Hebbler, B. (1847), “Statistics of Prussia,” Journal of the Royal Statistical
Society, A, 10, 154-186.
Henderson, H.V., and Searle, S.R. (1979), “Vec and Vech Operators for
Matrices, with Some Uses in Jacobians and Multivariate Statistics,” The
Canadian Journal of Statistics, 7, 65-81.
Hesterberg, T., (2014), “What Teachers Should Know about the Boot-
strap: Resampling in the Undergraduate Statistics Curriculum,” available
from (http://arxiv.org/pdf/1411.5279v1.pdf). (An abbreviated version was
published (2015), The American Statistician, 69, 371-386.)
Hilbe, J.M. (2011), Negative Binomial Regression, Cambridge University
Press, 2nd ed., Cambridge, UK.
Hillis, S.L., and Davis, C.S. (1994), “A Simple Justification of the Iterative
Fitting Procedure for Generalized Linear Models,” The American Statisti-
cian, 48, 288-289.
Hinkley, D.V. (1977), “Jackknifing in Unbalanced Situations,” Technomet-
rics, 19, 285-292.
Hjort, G., and Claeskens, N.L. (2003), “The Focused Information Crite-
rion,” Journal of the American Statistical Association, 98, 900-945.
Hocking, R.R. (2003), Methods and Applications of Linear Models: Regres-
sion and the Analysis of Variance, 2nd ed., Wiley, New York, NY.
538 REFERENCES

Hocking, R.R. (2013), Methods and Applications of Linear Models: Regres-


sion and the Analysis of Variance, 3rd ed., Wiley, New York, NY.
Hoerl, A.E., and Kennard, R. (1970), “Ridge Regression: Biased Estima-
tion for Nonorthogonal Problems,” Technometrics, 12, 55-67.
Hoffman, I., Serneels, S., Filzmoser, P., and Croux, C. (2015), “Sparse
Partial Robust M Regression,” Chemometrics and Intelligent Laboratory Sys-
tems, 149, Part A, 50-59.
Hogg, R.V., Tanis, E.A., and Zimmerman, D.L. (2015), Probability and
Statistical Inference, 9th ed., Pearson, Boston, MA.
Hong, L., Kuffner, T.A., and Martin, R. (2018), “On Overfitting and Post-
Selection Uncertainty Assessments,” Biometrika, 105, 221-224.
Hosmer, D.W., and Lemeshow, S. (2000), Applied Logistic Regression, 2nd
ed., Wiley, New York, NY.
Hössjer, O. (1991), Rank-Based Estimates in the Linear Model with High
Breakdown Point, Ph.D. Thesis, Report 1991:5, Department of Mathematics,
Uppsala University, Uppsala, Sweden.
Huber, P.J., and Ronchetti, E.M. (2009), Robust Statistics, 2nd ed., Wiley,
Hoboken, NJ.
Hubert, M., Rousseeuw, P.J., and Van Aelst, S. (2002), “Comment on
‘Inconsistency of Resampling Algorithms for High Breakdown Regression and
a New Algorithm’ by D.M. Hawkins and D.J. Olive,” Journal of the American
Statistical Association, 97, 151-153.
Hubert, M., Rousseeuw, P.J., and Van Aelst, S. (2008), “High Breakdown
Multivariate Methods,” Statistical Science, 23, 92-119.
Hubert, M., Rousseeuw, P.J., and Verdonck, T. (2012), “A Deterministic
Algorithm for Robust Location and Scatter,” Journal of Computational and
Graphical Statistics, 21, 618-637.
Hurvich, C., and Tsai, C.L. (1989), “Regression and Time Series Model
Selection in Small Samples,” Biometrika, 76, 297-307.
Hurvich, C., and Tsai, C.L. (1990), “The Impact of Model Selection on
Inference in Linear Regression,” The American Statistician, 44, 214-217.
Hurvich, C.M., and Tsai, C.-L. (1991),“Bias of the Corrected AIC Cri-
terion for Underfitted Regression and Time Series Models,” Biometrika, 78,
499-509.
Hyndman, R.J. (1996), “Computing and Graphing Highest Density Re-
gions,” The American Statistician, 50, 120-126.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013), An Intro-
duction to Statistical Learning with Applications in R, Springer, New York,
NY.
Javanmard, A., and Montanari, A. (2014), “Confidence Intervals and
Hypothesis Testing for High-Dimensional Regression,” Journal of Machine
Learning Research, 15, 2869-2909.
Jia, J., and Yu, B. (2010), “On Model Selection Consistency of the Elastic
Net When p >> n,” Statistica Sinica, 20, 595-611.
REFERENCES 539

Johnson, M.E. (1987), Multivariate Statistical Simulation, Wiley, New


York, NY.
Johnson, M.P., and Raven, P.H. (1973), “Species Number and Endemism,
the Galápagos Archipelago Revisited,” Science, 179, 893-895.
Johnson, N.L., and Kotz, S. (1970), Distributions in Statistics: Continuous
Univariate Distributions–2, Wiley, New York, NY.
Johnson, N.L., and Kotz, S. (1972), Distributions in Statistics: Continuous
Multivariate Distributions, Wiley, New York, NY.
Johnson, R.A., and Wichern, D.W. (1988), Applied Multivariate Statistical
Analysis, 2nd ed., Prentice Hall, Englewood Cliffs, NJ.
Johnstone, I.M., and Nadler, B. (2017), “Roy’s Largest Root Test Under
Rank-One Alternatives,” Biometrika, 104, 181-193.
Jollife, I.T. (1983), “A Note on the Use of Principal Components in Re-
gression,” Applied Statistics, 31, 300-303.
Jones, H.L. (1946), “Linear Regression Functions with Neglected Vari-
ables,” Journal of the American Statistical Association, 41, 356-369.
Kakizawa, Y. (2009), “Third-Order Power Comparisons for a Class of Tests
for Multivariate Linear Hypothesis Under General Distributions,” Journal of
Multivariate Analysis, 100, 473-496.
Kay, R., and Little, S. (1987), “Transformations of the Explanatory Vari-
ables in the Logistic Regression Model for Binary Data,” Biometrika, 74,
495-501.
Kelker, D. (1970), “Distribution Theory of Spherical Distributions and a
Location Scale Parameter Generalization,” Sankhya, A, 32, 419-430.
Khattree, R., and Naik, D.N. (1999), Applied Multivariate Statistics with
SAS Software, 2nd ed., SAS Institute, Cary, NC.
Kim, J., and Pollard, D. (1990), “Cube Root Asymptotics,” The Annals
of Statistics, 18, 191-219.
Kim, Y., Kwon, S., and Choi, H. (2012), “Consistent Model Selection
Criteria on High Dimensions,” Journal of Machine Learning Research, 13,
1037-1057.
Klouda, K. (2015), “An Exact Polynomial Time Algorithm for Comput-
ing the Least Trimmed Squares Estimate,” Computational Statistics & Data
Analysis, 84, 27-40.
Knight, K., and Fu, W.J. (2000), “Asymptotics for Lasso-Type Estima-
tors,” Annals of Statistics, 28, 1356-1378.
Konietschke, F., Bathke, A.C., Harrar, S.W., and Pauly, M. (2015), “Para-
metric and Nonparametric Bootstrap Methods for General MANOVA,” Jour-
nal of Multivariate Analysis, 140, 291-301.
Kshirsagar, A.M. (1972), Multivariate Analysis, Marcel Dekker, New York,
NY.
Kuehl, R.O. (1994), Statistical Principles of Research Design and Analysis,
Duxbury, Belmont, CA.
Kutner, M.H., Nachtsheim, C.J., Neter, J. and Li, W. (2005), Applied
Linear Statistical Models, 5th ed., McGraw-Hill/Irwin, Boston, MA.
540 REFERENCES

Lai, T.L., Robbins, H., and Wei, C.Z. (1979), “Strong Consistency of Least
Squares Estimates in Multiple Regression II,” Journal of Multivariate Anal-
ysis, 9, 343-361.
Larsen, R.J., and Marx, M.L. (2017), Introduction to Mathematical Statis-
tics and Its Applications, 6th ed., Pearson, Boston, MA.
Lee, J., Sun, D., Sun, Y., and Taylor, J. (2016), “Exact Post-Selection
Inference with Application to the Lasso,” The Annals of Statistics, 44, 907-
927.
Lee, J.D., and Taylor, J.E. (2014), “Exact Post Model Selection Infer-
ence for Marginal Screening,” in Advances in Neural Information Processing
Systems, 136-144.
Leeb, H., and Pötscher, B.M. (2005), “Model Selection and Inference: Facts
and Fiction,” Econometric Theory, 21, 21-59.
Leeb, H., and Pötscher, B.M. (2006), “Can One Estimate the Conditional
Distribution of Post–Model-Selection Estimators?” The Annals of Statistics,
34, 2554-2591.
Leeb, H. and Pötscher, B.M. (2008), “Can One Estimate the Unconditional
Distribution of Post-Model-Selection Estimators?” Econometric Theory, 24,
338-376.
Leeb, H., Pötscher, B.M., and Ewald, K. (2015), “On Various Confidence
Intervals Post-Model-Selection,” Statistical Science, 30, 216-227.
Lehmann, E.L. (1999), Elements of Large–Sample Theory, Springer, New
York, NY.
Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R.J., and Wasserman, L.
(2018), “Distribution-Free Predictive Inference for Regression,” Journal of
the American Statistical Association, 113, 1094-1111.
Leon, S.J. (1986), Linear Algebra with Applications, 2nd ed., Macmillan
Publishing Company, New York, NY.
Leon, S.J. (2015), Linear Algebra with Applications, 9th ed., Pearson,
Boston, MA.
Li, K.–C. (1987), “Asymptotic Optimality for Cp , CL, Cross-Validation
and Generalized Cross-Validation: Discrete Index Set,” The Annals of Statis-
tics, 15, 958-975.
Li, K.–C., and Duan, N. (1989), “Regression Analysis Under Link Viola-
tion,” The Annals of Statistics, 17, 1009-1052.
Lin, D., Foster, D.P., and Ungar, L.H. (2012), “VIF Regression, a Fast
Regression Algorithm for Large Data,” Journal of the American Statistical
Association, 106, 232-247.
Lindenmayer, D.B., Cunningham, R., Tanton, M.T., Nix, H.A., and Smith,
A.P. (1991), “The Conservation of Arboreal Marsupials in the Montane Ash
Forests of Central Highlands of Victoria, South-East Australia: III. The Habi-
tat Requirement’s of Leadbeater’s Possum Gymnobelideus Leadbeateri and
Models of the Diversity and Abundance of Arboreal Marsupials,” Biological
Conservation, 56, 295-315.
REFERENCES 541

Liu, X., and Zuo, Y. (2014), “Computing Projection Depth and Its Asso-
ciated Estimators,” Statistics and Computing, 24, 51-63.
Lockhart, R., Taylor, J., Tibshirani, R.J., and Tibshirani, R. (2014), “A
Significance Test for the Lasso,” (with discussion), The Annals of Statistics,
42, 413-468.
Lopuhaä, H.P. (1999), “Asymptotics of Reweighted Estimators of Multi-
variate Location and Scatter,” The Annals of Statistics, 27, 1638-1665.
Lu, S., Liu, Y., Yin, L., and Zhang, K. (2017), “Confidence Intervals and
Regions for the Lasso by Using Stochastic Variational Inequality Techniques
in Optimization,” Journal of the Royal Statistical Society, B, 79, 589-611.
Lumley, T. (using Fortran code by Alan Miller) (2009), leaps: Regression
Subset Selection, R package version 2.9, (https://CRAN.R-project.org/package
=leaps).
Luo, S., and Chen, Z. (2013), “Extended BIC for Linear Regression Models
with Diverging Number of Relevant Features and High or Ultra-High Feature
Spaces,” Journal of Statistical Planning and Inference, 143, 494-504.
Machado, J.A.F., and Parente, P. (2005), “Bootstrap Estimation of Covari-
ance Matrices Via the Percentile Method,” Econometrics Journal, 8, 70-78.
MacKinnon, J.G., and White, H. (1985), “Some Heteroskedasticity-Consistent
Covariance Matrix Estimators with Improved Finite Sample Properties,”
Journal of Econometrics, 29, 305-325.
Mallows, C. (1973), “Some Comments on Cp ,” Technometrics, 15, 661-676.
Marden, J.I. (2017), Mathematical Statistics: Old School, available at
(www.stat.istics.net and www.amazon.com).
Mardia, K.V., Kent, J.T., and Bibby, J.M. (1979), Multivariate Analysis,
Academic Press, London, UK.
Maronna, R.A., Martin, R.D., and Yohai, V.J. (2006), Robust Statistics:
Theory and Methods, Wiley, Hoboken, NJ.
Maronna, R.A., and Morgenthaler, S. (1986), “Robust Regression Through
Robust Covariances,” Communications in Statistics: Theory and Methods, 15,
1347-1365.
Maronna, R.A., and Yohai, V.J. (2002), “Comment on ‘Inconsistency of
Resampling Algorithms for High Breakdown Regression and a New Algo-
rithm’ by D.M. Hawkins and D.J. Olive,” Journal of the American Statistical
Association, 97, 154-155.
Maronna, R.A., and Yohai, V.J. (2015), “High-Sample Efficiency and Ro-
bustness Based on Distance-Constrained Maximum Likelihood,” Computa-
tional Statistics & Data Analysis, 83, 262-274.
Maronna, R.A., and Zamar, R.H. (2002), “Robust Estimates of Location
and Dispersion for High-Dimensional Datasets,” Technometrics, 44, 307-317.
Marquardt, D.W., and Snee, R.D. (1975), “Ridge Regression in Practice,”
The American Statistician, 29, 3-20.
Mašı̈ček, L. (2004), “Optimality of the Least Weighted Squares Estima-
tor,” Kybernetika, 40, 715-734.
542 REFERENCES

MathSoft (1999a), S-Plus 2000 User’s Guide, Data Analysis Products Di-
vision, MathSoft, Seattle, WA.
MathSoft (1999b), S-Plus 2000 Guide to Statistics, Volume 2, Data Anal-
ysis Products Division, MathSoft, Seattle, WA.
McCullagh, P., and Nelder, J.A. (1989), Generalized Linear Models, 2nd
ed., Chapman & Hall, London, UK.
Meinshausen, N. (2007), “Relaxed Lasso,” Computational Statistics &
Data Analysis, 52, 374-393.
Mevik, B.–H., Wehrens, R., and Liland, K.H. (2015), pls: Partial Least
Squares and Principal Component Regression, R package version 2.5-0, (https:
//CRAN.R-project.org/package=pls).
Monahan, J.F. (2008), A Primer on Linear Models, Chapman & Hall/CRC,
Boca Rotan, FL.
Montgomery, D.C., Peck, E.A., and Vining, G. (2001), Introduction to
Linear Regression Analysis, 3rd ed., Wiley, Hoboken, NJ.
Montgomery, D.C., Peck, E.A., and Vining, G. (2021), Introduction to
Linear Regression Analysis, 6th ed., Wiley, Hoboken, NJ.
Moore, D.S. (2007), The Basic Practice of Statistics, 4th ed., W.H. Free-
man, New York, NY.
Mosteller, F., and Tukey, J.W. (1977), Data Analysis and Regression,
Addison-Wesley, Reading, MA.
Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., and Wu, A.Y.
(2014), “On the Least Trimmed Squares Estimator,” Algorithmica, 69, 148-
183.
Muller, K.E., and Stewart, P.W. (2006), Linear Model Theory: Univariate,
Multivariate, and Mixed Models, Wiley, Hoboken, NJ.
Myers, R.H., and Milton, J.S. (1991), A First Course in the Theory of
Linear Statistical Models, Duxbury, Belmont, CA.
Myers, R.H., Montgomery, D.C., and Vining, G.G. (2002), Generalized
Linear Models with Applications in Engineering and the Sciences, Wiley, New
York, NY.
Navarro, J. (2014), “Can the Bounds in the Multivariate Chebyshev In-
equality be Attained?” Statistics & Probability Letters, 91, 1-5.
Navarro, J. (2016), “A Very Simple Proof of the Multivariate Chebyshev’s
Inequality,” Communications in Statistics: Theory and Methods, 45, 3458-
3463.
Nelder, J.A., and Wedderburn, R.W.M. (1972), “Generalized Linear Mod-
els,” Journal of the Royal Statistical Society, A, 135, 370-384.
Ning, Y., and Liu, H. (2017), “A General Theory of Hypothesis Tests and
Confidence Regions for Sparse High Dimensional Models,” The Annals of
Statistics, 45, 158-195.
Nishii, R. (1984), “Asymptotic Properties of Criteria for Selection of Vari-
ables in Multiple Regression,” The Annals of Statistics, 12, 758-765.
Nordhausen, K., and Tyler, D.E. (2015), “A Cautionary Note on Robust
Covariance Plug-In Methods,” Biometrika, 102, 573-588.
REFERENCES 543

Obozinski, G., Wainwright, M.J., and Jordan, M.I. (2011), “Support


Union Recovery in High-Dimensional Multivariate Regression,” The Annals
of Statistics, 39, 1-47.
Olive, D.J. (2002), “Applications of Robust Distances for Regression,”
Technometrics, 44, 64-71.
Olive, D.J. (2004a), “A Resistant Estimator of Multivariate Location and
Dispersion,” Computational Statistics & Data Analysis, 46, 99-102.
Olive, D.J. (2004b), “Visualizing 1D Regression,” in Theory and Applica-
tions of Recent Robust Methods, eds. Hubert, M., Pison, G., Struyf, A., and
Van Aelst, S., Birkhäuser, Basel, Switzerland, 221-233.
Olive, D.J. (2005), “Two Simple Resistant Regression Estimators,” Com-
putational Statistics & Data Analysis, 49, 809-819.
Olive, D.J. (2007), “Prediction Intervals for Regression Models,” Compu-
tational Statistics & Data Analysis, 51, 3115-3122.
Olive, D.J. (2008), Applied Robust Statistics, unpublished online text, see
(http://parker.ad.siu.edu/Olive/ol-bookp.htm).
Olive, D.J. (2010), Multiple Linear and 1D Regression, online course notes,
see (http://parker.ad.siu.edu/Olive/regbk.htm).
Olive, D.J. (2013a), “Asymptotically Optimal Regression Prediction In-
tervals and Prediction Regions for Multivariate Data,” International Journal
of Statistics and Probability, 2, 90-100.
Olive, D.J. (2013b), “Plots for Generalized Additive Models,” Communi-
cations in Statistics: Theory and Methods, 42, 2610-2628.
Olive, D.J. (2014), Statistical Theory and Inference, Springer, New York,
NY.
Olive, D.J. (2017a), Linear Regression, Springer, New York, NY.
Olive, D.J. (2017b), Robust Multivariate Analysis, Springer, New York,
NY.
Olive, D.J. (2018), “Applications of Hyperellipsoidal Prediction Regions,”
Statistical Papers, 59, 913-931.
Olive, D.J. (2025a), Prediction and Statistical Learning, online course
notes, see (http://parker.ad.siu.edu/Olive/slearnbk.htm).
Olive, D.J. (2025b), Robust Statistics, online course notes, (http://parker.
ad.siu.edu/Olive/robbook.htm).
Olive (2025c) Large Sample Theory: online course notes, (http://parker.ad.
siu.edu/Olive/lsampbk.pdf).
Olive, D.J., Alshammari, A., Pathiranage, K.G., and Hettige, L.A.W.
(2025), “Testing with the One Component Partial Least Squares and the
Marginal Maximum Likelihood Estimators,” is at (http://parker.ad.
siu.edu/Olive/pphdwls.pdf).
Olive, D.J., and Hawkins, D.M. (2003), “Robust Regression with High
Coverage,” Statistics & Probability Letters, 63, 259-266.
Olive, D.J., and Hawkins, D.M. (2005), “Variable Selection for 1D Regres-
sion Models,” Technometrics, 47, 43-50.
544 REFERENCES

Olive, D.J., and Hawkins, D.M. (2010), “Robust Multivariate Location and
Dispersion,” preprint, see (http://parker.ad.siu.edu/Olive/pphbmld.pdf).
Olive, D.J., and Hawkins, D.M. (2011), “Practical High Breakdown Re-
gression,” preprint at (http://parker.ad.siu.edu/Olive/pphbreg.pdf).
Olive, D.J., Pelawa Watagoda, L.C.R., and Rupasinghe Arachchige Don,
H.S. (2015), “Visualizing and Testing the Multivariate Linear Regression
Model,” International Journal of Statistics and Probability, 4, 126-137.
Olive, D.J., Rathnayake, R.C., and Haile, M.G. (2022), “Prediction Inter-
vals for GLMs, GAMs, and Some Survival Regression Models,” Communica-
tions in Statistics: Theory and Methods, 51, 8012-8026.
Olive, D.J., and Zhang, L. (2025), “One Component Partial Least Squares,
High Dimensional Regression, Data Splitting, and the Multitude of Models,”
Communications in Statistics: Theory and Methods, 54, 130-145.
Park, Y., Kim, D., and Kim, S. (2012), “Robust Regression Using Data
Partitioning and M-Estimation,” Communications in Statistics: Simulation
and Computation, 8, 1282-1300.
Pati, Y.C., Rezaiifar, R., and Krishnaprasad, P.S. (1993), “Orthogonal
Matching Pursuit: Recursive Function Approximation with Applications to
Wavelet Decomposition,” in Conference Record of the Twenty-Seventh Asilo-
mar Conference on Signals, Systems and Computers, IEEE, 40-44.
Pelawa Watagoda, L.C.R. (2017), “Inference after Variable Selection,”
Ph.D. Thesis, Southern Illinois University. See (http://parker.ad.siu.edu/
Olive/slasanthiphd.pdf).
Pelawa Watagoda, L.C.R. (2019), “A Sub-Model Theorem for Ordinary
Least Squares,” International Journal of Statistics and Probability, 8, 40-43.
Pelawa Watagoda, L.C.R., and Olive, D.J. (2021a), “Bootstrapping Mul-
tiple Linear Regression after Variable Selection,” Statistical Papers, 62, 681-
700.
Pelawa Watagoda, L.C.R., and Olive, D.J. (2021b), “Comparing Six
Shrinkage Estimators with Large Sample Theory and Asymptotically Op-
timal Prediction Intervals,” Statistical Papers, 62, 2407-2431.
Peña, D. (2005), “A New Statistic for Influence in Regression,” Techno-
metrics, 47, 1-12.
Pesch, C. (1999), “Computation of the Minimum Covariance Determinant
Estimator,” in Classification in the Information Age, Proceedings of the 22nd
Annual GfKl Conference, Dresden 1998, eds. Gaul, W., and Locarek-Junge,
H., Springer, Berlin, 225–232.
Pratt, J.W. (1959), “On a General Concept of “in Probability”,” The
Annals of Mathematical Statistics, 30, 549-558.
Press, S.J. (2005), Applied Multivariate Analysis: Using Bayesian and Fre-
quentist Methods of Inference, 2nd ed., Dover, Mineola, NY.
Qi, X., Luo, R., Carroll, R.J., and Zhao, H. (2015), “Sparse Regression
by Projection and Sparse Discriminant Analysis,” Journal of Computational
and Graphical Statistics, 24, 416-438.
REFERENCES 545

R Core Team (2024), “R: a Language and Environment for Statisti-


cal Computing,” R Foundation for Statistical Computing, Vienna, Austria,
(www.R-project.org).
Rao, C.R. (1965, 1973) Linear Statistical Inference and Its Applications,
1st and 2nd ed., Wiley, New York, NY.
Rao, C.R., Toutenberg, H., Shalabh, and Heunmann, C. (2008), Lin-
ear Models and Generalizations: Least Squares and Alternatives, 3rd ed.,
Springer, New York, NY.
Rathnayake, R.C. (2019), Inference for Some GLMs and Survival Regres-
sion Models after Variable Selection, Ph.D. thesis, Southern Illinois Univer-
sity, at (http://parker.ad.siu.edu/Olive/srasanjiphd.pdf).
Rathnayake, R.C., and Olive, D.J. (2023), “Bootstrapping Some GLM
and Survival Regression Variable Selection Estimators,” Communications in
Statistics: Theory and Methods, 52, 2625-2645.
Ravishanker, N., Chi, Z., and Dey, D.K. (2021), A First Course in Linear
Model Theory, 2nd ed., Chapman & Hall/CRC, Boca Raton, FL.
Reid, J.G., and Driscoll, M.F. (1988), “An Accessible Proof of Craig’s
Theorem in the Noncentral Case,” The American Statistician, 42, 139-142.
Rejchel, W. (2016), “Lasso with Convex Loss: Model Selection Consistency
and Estimation,” Communications in Statistics: Theory and Methods, 45,
1989-2004.
Ren, J.-J. (1991), “On Hadamard Differentiability of Extended Statistical
Functional,” Journal of Multivariate Analysis, 39, 30-43.
Ren, J.-J., and Sen, P.K. (1995), “Hadamard Differentiability on D[0,1]p,”
Journal of Multivariate Analysis, 55, 14-28.
Rencher, A.C., and Schaalje, G.B. (2008), Linear Models in Statistics, 2nd
ed., Wiley, Hoboken, NJ.
Reyen, S.S., Miller, J.J., and Wegman, E.J. (2009), “Separating a Mixture
of Two Normals with Proportional Covariances,” Metrika, 70, 297-314.
Rinaldo, A., Wasserman, L., and G’Sell, M. (2019), “Bootstrapping and
Sample Splitting for High-Dimensional, Assumption-Lean Inference,” The
Annals of Statistics, 47, 3438-3469.
Ro, K., Zou, C., Wang, W., and Yin, G. (2015), “Outlier Detection for
High–Dimensional Data,” Biometrika, 102, 589-599.
Rocke, D.M., and Woodruff, D.L. (1996), “Identification of Outliers in
Multivariate Data,” Journal of the American Statistical Association, 91,
1047-1061.
Rohatgi, V.K. (1976), An Introduction to Probability Theory and Mathe-
matical Statistics, Wiley, New York, NY.
Rohatgi, V.K. (1984), Statistical Inference, Wiley, New York, NY.
Rousseeuw, P.J. (1984), “Least Median of Squares Regression,” Journal of
the American Statistical Association, 79, 871-880.
Rousseeuw, P.J., and Leroy, A.M. (1987), Robust Regression and Outlier
Detection, Wiley, New York, NY.
546 REFERENCES

Rousseeuw, P.J., and Van Driessen, K. (1999), “A Fast Algorithm for the
Minimum Covariance Determinant Estimator,” Technometrics, 41, 212-223.
Rupasinghe Arachchige Don, H.S. (2018), “A Relationship Between the
One-Way MANOVA Test Statistic and the Hotelling Lawley Trace Test
Statistic,” International Journal of Statistics and Probability, 7, 124-131.
Rupasinghe Arachchige Don, H.S., and Olive, D.J. (2019), “Bootstrapping
Analogs of the One Way MANOVA Test,” Communications in Statistics:
Theory and Methods, 48, 5546-5558.
Rupasinghe Arachchige Don, H.S., and Pelawa Watagoda, L.C.R. (2018),
“Bootstrapping Analogs of the Two Sample Hotelling’s T 2 Test,” Communi-
cations in Statistics: Theory and Methods, 47, 2172-2182.
SAS Institute (1985), SAS User’s Guide: Statistics, Version 5, SAS Insti-
tute, Cary, NC.
Schaaffhausen, H. (1878), “Die Anthropologische Sammlung Des Anatom–
ischen Der Universitat Bonn,” Archiv fur Anthropologie, 10, 1-65, Appendix.
Scheffé, H. (1959), The Analysis of Variance, Wiley, New York, NY.
Schomaker, M., and Heumann, C. (2014), “Model Selection and Model Av-
eraging After Multiple Imputation,” Computational Statistics & Data Anal-
ysis, 71, 758-770.
Schwarz, G. (1978), “Estimating the Dimension of a Model,” The Annals
of Statistics, 6, 461-464.
Searle, S.R. (1971), Linear Models, Wiley, New York, NY.
Searle, S.R. (1982), Matrix Algebra Useful for Statistics, Wiley, New York,
NY.
Searle, S.R., and Gruber, M.H.J. (2017), Linear Models, 2nd ed., Wiley,
Hoboken, NJ.
Seber, G.A.F., and Lee, A.J. (2003), Linear Regression Analysis, 2nd ed.,
Wiley, New York, NY.
Sen, P.K., and Singer, J.M. (1993), Large Sample Methods in Statistics:
an Introduction with Applications, Chapman & Hall, New York, NY.
Sengupta, D., and Jammalamadaka, S.R. (2019), Linear Models and Re-
gression with R: an Integrated Approach, World Scientific, Singapore.
Serfling, R.J. (1980), Approximation Theorems of Mathematical Statistics,
Wiley, New York, NY.
Severini, T.A. (1998), “Some Properties of Inferences in Misspecified Lin-
ear Models,” Statistics & Probability Letters, 40, 149-153.
Severini, T.A. (2005), Elements of Distribution Theory, Cambridge Uni-
versity Press, New York, NY.
Shao, J. (1993), “Linear Model Selection by Cross-Validation,” Journal of
the American Statistical Association, 88, 486-494.
Shao, J., and Tu, D.S. (1995), The Jackknife and the Bootstrap, Springer,
New York, NY.
Shibata, R. (1984), “Approximate Efficiency of a Selection Procedure for
the Number of Regression Variables,” Biometrika, 71, 43-49.
REFERENCES 547

Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2011), “Regular-
ization Paths for Cox’s Proportional Hazards Model via Coordinate Descent,”
Journal of Statistical Software, 39, 1-13.
Simonoff, J.S. (2003), Analyzing Categorical Data, Springer, New York,
NY.
Slawski, M., zu Castell, W., and Tutz, G. (2010), “Feature Selection
Guided by Structural Information,” Annals of Applied Statistics, 4, 1056-
1080.
Srivastava, M.S., and Khatri, C.G. (1979), An Introduction to Multivariate
Statistics, North Holland, New York, NY.
Stapleton, J.H. (2009), Linear Statistical Models, 2nd ed., Wiley, Hoboken,
NJ.
Staudte, R.G., and Sheather, S.J. (1990), Robust Estimation and Testing,
Wiley, New York, NY.
Steinberger, L., and Leeb, H. (2023), “Conditional Predictive Inference for
Stable Algorithms,” The Annals of Statistics, 51, 290-311.
Stewart, G.M. (1969), “On the Continuity of the Generalized Inverse,”
SIAM Journal on Applied Mathematics, 17, 33-45.
Su, W., Bogdan, M., and Candés, E. (2017), “False Discoveries Occur
Early on the Lasso Path,” The Annals of Statistics, 45, 2133-2150.
Su, Z., and Cook, R.D. (2012), “Inner Envelopes: Efficient Estimation in
Multivariate Linear Regression,” Biometrika, 99, 687-702.
Su, Z., Zhu, G., and Yang, Y. (2016), “Sparse Envelope Model: Efficient
Estimation and Response Variable Selection in Multivariate Linear Regres-
sion,” Biometrika, 103, 579-593.
Sun, T., and Zhang, C.-H. (2012), “Scaled Sparse Linear Regression,”
Biometrika, 99, 879-898.
Tarr, G., Müller, S., and Weber, N.C. (2016), “Robust Estimation of Pre-
cision Matrices Under Cellwise Contamination,” Computational Statistics &
Data Analysis, 93, 404-420.
Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,”
Journal of the Royal Statistical Society, B, 58, 267-288.
Tibshirani, R, (1997), “The Lasso Method for Variable Selection in the
Cox Model,” Statistics in Medicine, 16, 385-395.
Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J.,
and Tibshirani, R.J. (2012), “Strong Rules for Discarding Predictors in Lasso-
Type Problems,” Journal of the Royal Statistical Society, B, 74, 245–266.
Tibshirani, R.J. (2013), “The Lasso Problem and Uniqueness,” Electronic
Journal of Statistics, 7, 1456-1490.
Tibshirani, R.J. (2015), “Degrees of Freedom and Model Search,” Statistica
Sinica, 25, 1265-1296.
Tibshirani, R.J., Rinaldo, A., Tibshirani, R., and Wasserman, L. (2018),
“Uniform Asymptotic Inference and the Bootstrap after Model Selection,”
The Annals of Statistics, 46, 1255-1287.
548 REFERENCES

Tibshirani, R.J., and Taylor, J. (2012), “Degrees of Freedom in Lasso


Problems,” The Annals of Statistics, 40, 1198-1232.
Tibshirani, R.J., Taylor, J., Lockhart, R., and Tibshirani, R. (2016), “Ex-
act Post-Selection Inference for Sequential Regression Procedures,” Journal
of the American Statistical Association, 111, 600-620.
Tremearne, A.J.N. (1911), “Notes on Some Nigerian Tribal Marks,” Jour-
nal of the Royal Anthropological Institute of Great Britain and Ireland, 41,
162-178.
Tukey, J.W. (1957), “Comparative Anatomy of Transformations,” Annals
of Mathematical Statistics, 28, 602-632.
Uraibi, H.S., Midi, H., and Rana, S. (2017), “Robust Multivariate Least
Angle Regression,” Science Asia, 43, 56-60.
Uraibi, H.S., Midi, H., and Rana, S. (2017), “Selective Overview of Forward
Selection in Terms of Robust Correlations,” Communications in Statistics:
Simulations and Computation, 46, 5479-5503.
van de Geer, S., Bülhmann, P., Ritov, Y., and Dezeure, R. (2014), “On
Asymptotically Optimal Confidence Regions and Tests for High-Dimensional
Models,” The Annals of Statistics, 42, 1166-1202.
Venables, W.N., and Ripley, B.D. (2010), Modern Applied Statistics with
S, 4th ed., Springer, New York, NY.
Wackerly, D.D., Mendenhall, W., and Scheaffer, R.L. (2008), Mathematical
Statistics with Applications, 7th ed., Thomson Brooks/Cole, Belmont, CA.
Walpole, R.E., Myers, R.H., Myers, S.L., and Ye, K. (2016), Probability &
Statistics for Engineers & Scientists, 9th ed., Pearson, Boston, MA.
Wang, H. (2009), “Forward Regression for Ultra-High Dimensional Vari-
able Screening,” Journal of the American Statistical Association, 104, 1512-
1524.
Wang, H., and Zhou, S.Z.F. (2013), “Interval Estimation by Frequentist
Model Averaging,” Communications in Statistics: Theory and Methods, 42,
4342-4356.
Wang, S.-G., and Chow, S.-C. (1994), Advanced Linear Models: Theory
and Applications, Marcel Dekker, New York, NY.
Wasserman, L. (2014), “Discussion: A Significance Test for the Lasso,”
The Annals of Statistics, 42, 501-508.
Weisberg, S. (2014), Applied Linear Regression, 4th ed., Wiley, Hoboken,
NJ.
Welch, B.L. (1947), “The Generalization of Student’s Problem When Sev-
eral Different Population Variances Are Involved,” Biometrika, 34, 28-35.
Welch, B.L. (1951), “On the Comparison of Several Mean Values: an Al-
ternative Approach,” Biometrika, 38, 330-336.
White, H. (1980), “A Heteroskedasticity-Consistent Covariance Matrix Es-
timator and a Direct Test for Heteroskedasticity,” Econometrica, 48, 817-838.
White, H. (1984), Asymptotic Theory for Econometricians, Academic
Press, San Diego, CA.
REFERENCES 549

Wieczorek, J., and Lei, J. (2022), “Model-Selection Properties of Forward


Selection and Sequential Cross-Validation for High-Dimensional Regression,”
Canadian Journal of Statistics, 50, 454-470.
Winkelmann, R. (2000), Econometric Analysis of Count Data, 3rd ed.,
Springer, New York, NY.
Winkelmann, R. (2008), Econometric Analysis of Count Data, 5th ed.,
Springer, New York, NY.
Wold, H. (1975), “Soft Modelling by Latent Variables: the Non-Linear
Partial Least Squares (NIPALS) Approach,” Journal of Applied Probability,
12, 117-142.
Wold, H. (1985), “Partial Least Squares,” International Journal of Cardi-
ology, 147, 581-591.
Wold, H. (2006), “Partial Least Squares,” Encyclopedia of Statistical Sci-
ences, Wiley, New York, NY.
Wood, S.N. (2017), Generalized Additive Models: an Introduction with R,
2nd ed., Chapman & Hall/CRC, Boca Rotan, FL.
Woodruff, D.L., and Rocke, D.M. (1994), “Computable Robust Estimation
of Multivariate Location and Shape in High Dimension Using Compound
Estimators,” Journal of the American Statistical Association, 89, 888-896.
Xu, H., Caramanis, C., and Mannor, S. (2011), “Sparse Algorithms are Not
Stable: a No-Free-Lunch Theorem,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, PP(99), 1-9.
Yang, Y. (2003), “Regression with Multiple Candidate Models: Selecting
or Mixing?” Statistica Sinica, 13, 783-809.
Zhang, J. (2020), “Consistency of MLE, LSE and M-Estimation Under
Mild Conditions,” Statistical Papers, 61, 189-199.
Zhang, J., Olive, D.J., and Ye, P. (2012), “Robust Covariance Matrix
Estimation with Canonical Correlation Analysis,” International Journal of
Statistics and Probability, 1, 119-136.
Zhang, J.-T., and Liu, X. (2013), “A Modified Bartlett Test for Het-
eroscedastic One-Way MANOVA,” Metrika, 76, 135–152.
Zhang, P. (1992a), “On the Distributional Properties of Model Selection
Criterion,” Journal of the American Statistical Association, 87, 732-737.
Zhang, P. (1992b), “Inference After Variable Selection in Linear Regression
Models,” Biometrika, 79, 741-746.
Zhang, T., and Yang, B. (2017), “Box-Cox Transformation in Big Data,”
Technometrics, 59, 189-201.
Zhang, X., and Cheng, G. (2017), “Simultaneous Inference for High-
Dimensional Linear Models,” Journal of the American Statistical Association,
112, 757-768.
Zhao, P., and Yu, B. (2006), “On Model Selection Consistency of Lasso,”
Journal of Machine Learning Research 7, 2541-2563.
Zheng, Z., and Loh, W.-Y. (1995), “Consistent Variable Selection in Linear
Models,” Journal of the American Statistical Association, 90, 151-156.
550 REFERENCES

Zhou, M. (2001), “Understanding the Cox Regression Models with Time–


Change Covariates,” The American Statistician, 55, 153-155.
Zimmerman, D.L. (2020a), Linear Model Theory with Examples and Ex-
ercises, Springer, New York, NY.
Zimmerman, D.L. (2020b), Linear Model Theory: Exercises and Solutions,
Springer, New York, NY.
Zou, H., and Hastie, T. (2005), “Regularization and Variable Selection
Via the Elastic Net,” Journal of the Royal Statistical Society Series, B, 67,
301-320.
Zuur, A.F., Ieno, E.N., Walker, N.J., Saveliev, A.A., and Smith, G.M.
(2009), Mixed Effects Models and Extensions in Ecology with R, Springer,
New York, NY.
Index

Čı́žek, 325 beta–binomial regression, 426


1D regression, 4, 141, 274 Bhatia, 60, 230
1D regression model, 425 Bickel, 175, 178, 201, 203
binary regression, 426, 431
Abraham, v, 437 binomial regression, 426, 431
active set, 236 bivariate normal, 33
additive error regression, 2, 5, 159, 426 BLUE, 3, 88, 118
additive error single index model, 430 Bogdan, 262
additive predictor, 5 bootstrap, 34, 203
AER, 3 Box, 12
affine equivariant, 282, 336 Box–Cox transformation, 12
affine transformation, 282, 336 breakdown, 283, 336
Agresti, v, 426, 427, 441, 442, 500 Breiman, 178
Agulló, 351 Brillinger, 425, 496, 500
Akaike, 142, 150, 488 Brown, 418
Aldrin, 496, 503 Buckland, 204
Anderson, 105, 150, 262, 378, 455 Budny, 169
ANOVA, 119 Burnham, 150, 262, 455
Anton, v Butler, 263, 288
AP, 3 Buxton, 317, 319, 322, 326, 332, 349,
asymptotic distribution, 34, 37 354, 356, 392
asymptotic theory, 34
asymptotically optimal, 156, 163
Atkinson, 356, 432 C, v
attractor, 288 Cameron, 474, 500, 501
Camponovo, 260
Büchlmann, 178 Candes, 277
bagging estimator, 178 case, 2, 13
basic resampling, 288 Cator, 288, 293, 295
Bassett, 353 Cauchy Schwartz inequality, 190
Becker, 412, 506 cdf, 3
Belsley, 259 centering matrix, 164
Berk, 204, 259 cf, 3, 46
Berndt, 378, 390 Chang, 105, 205, 351, 430, 497, 499, 500
Bernholt, 324, 351 Charkhi, 151, 155, 489
Bertsimas, 259, 261 Chatterjee, 260
best linear unbiased estimator, 88 Chebyshev’s Inequality, 40

551
552 Index

Chen, 150, 169, 171, 200, 203, 243, 262, degrees of freedom, 19, 264
275, 488, 497 Delta Method, 36
Cheng, 259 Denham, 260
Chew, 167 Det-MCD, 302, 306
Cho, 262 Devlin, 291
Chow, v Dey, v
Christensen, v, 74, 82, 101 Dezeure, 259
Chun, 225, 260 df, 19
CI, 3 DGK estimator, 291
Claeskens, 151, 154, 155, 204, 489, 500, discriminant function, 432
501 dispersion matrix, 282
Claeskins, 259 DOE, 119
Clarke, 203 dot plot, 122, 279, 411
classical prediction region, 167 double bootstrap, 205
Cleveland, 501 Driscoll, 78
CLT, 3 Duan, 495–497, 500
CLTS, 329
coefficient of multiple determination, 18 EAP, 3
Collett, 438, 474, 500 Eaton, 54
column space, 72, 102 EC, 3
concentration, 288, 291 Eck, 395
conditional distribution, 32 EE plot, 145, 455
confidence region, 170, 201 Efron, 150, 171, 177, 178, 189, 203, 229,
consistent, 39 230, 234, 259, 260
consistent estimator, 39 Efroymson, 259
constant variance MLR model, 13 Eicker, 105
Continuity Theorem, 46 eigenvalue, 221
Continuous Mapping Theorem:, 46 eigenvector, 221
converges almost everywhere, 41, 42 elastic net, 240
converges in distribution, 37 elastic net variable selection, 243
converges in law, 37 elemental set, 283, 288, 290, 328, 331
converges in probability, 39 ellipsoidal trimming, 325
converges in quadratic mean, 40 elliptically contoured, 53, 57, 316
Cook, v, 8, 19, 54, 55, 59, 64, 160, 188, elliptically contoured distribution, 166
193, 224, 225, 260, 261, 322, 350, elliptically symmetric, 53
365, 367, 371, 375, 384, 401, 405, empirical cdf, 172
410, 440, 454, 458, 475, 482, 495, empirical distribution, 172
500 envelope estimators, 401
coordinatewise median, 282 error sum of squares, 17, 30
Cornish, 57 ESP, 3
covariance matrix, 31 ESSP, 3
coverage, 167 estimable, 118
covmb2, 318, 350 estimated additive predictor, 5, 425
Cox, 12, 142, 275, 427 estimated sufficient predictor, 4, 425
Craig’s Theorem, 78, 103 estimated sufficient summary plot, 5,
Cramér, 18 495
Crawley, 506, 508 Euclidean norm, 48, 338
Croux, 56 Ewald, 204, 253, 259
CV, 3 experimental design, 119
exponential family, 428
Daniel, 148 extrapolation, 160, 244
data splitting, 462
Datta, 259, 286 Fahrmeir, 452
DD plot, 313 Fan, 154, 259, 260, 262, 277
Index 553

feasible generalized least squares, 98 Hampel, 324


Ferguson, 46, 60 Harville, v
Fernholtz, 203 Hastie, 150, 154, 203, 222, 225, 227–230,
Ferrari, 259 234, 235, 237, 240, 243, 259–262,
FF plot, 23, 145, 333, 367 265, 273, 275, 488, 489, 500
Fischer, 351 hat matrix, 14, 29
Fithian, 259 Haughton, 489, 501
fitted values, 14, 213, 254 Hawkins, 144, 260, 262, 263, 288, 322,
Flury, 483 324, 331, 343, 350, 351, 356, 365,
Fogel, 259 410, 430, 456, 463, 468, 488, 497
Forsythe, 418 hbreg, 345
Forzani, 224, 225, 260 He, 344, 351
Fox, 506 Hebbler, 217, 382
Frank, 261 Henderson, 374, 419
Freedman, v, 100, 160, 189–191 Hesterberg, 35, 203
Frey, 157, 171, 206 Heumann, 204
Friedman, vi, 178, 261, 488, 506 high dimensional statistics, 4
Fryzlewicz, 262 highest density region, 159, 163
Fu, 203, 231, 235, 259–261 Hilbe, 465, 500
Fujikoshi, 262, 401, 416, 420 Hinkley, 100
full model, 142, 211, 254 Hjort, 152, 154, 204, 259, 489, 500, 501
full rank, 73 Hocking, v, 109
Furnival, 148 Hoerl, 260
Hoffman, 262, 351
GAM, 3 Hogg, v
Gamma regression model, 426 Hong, 160, 488
Gao, 262 Hosmer, 432, 434, 459, 500
Gauss Markov Theorem-Full Rank Case, Huang, 262
89 Huber, 97, 321, 326, 352
Gaussian MLR model, 14 Hubert, 290, 299, 352
general position, 285, 338, 345 Hurvich, 150, 151, 200, 275, 500
generalized additive model, 5, 425, 465 Hyndman, 163
Generalized Cochran’s Theorem, 82
generalized inverse, 73, 102 i, 173
generalized least squares, 97 identity line, 5, 15, 196, 366, 410
generalized linear model, 4, 425, 428, iff, 3
429 iid, 3, 5, 13, 280, 281
Gill, 203
Gladstone, 26, 194, 300, 320, 334, 354, Jacobian matrix, 49
357, 451, 471 James, 2, 215, 249, 259
GLM, 3, 429, 455 Jammalamadaka, v
Golub, 339 Javanmard, 259
Grübel, 162 Jia, 242
Gram matrix, 228 Johnson, 32, 53, 57, 104, 167, 221, 284,
Graybill, v, 77, 216 287, 292, 303, 364, 368, 415, 475
Gruber, v, 260 Johnstone, 402
Gunst, 230, 231, 259 joint distribution, 32
Guttman, v, 30 Jolliffe, 224
Jones, 150, 262
Hössjer, 324
Hadamard derivative, 203 Kakizawa, 377, 378
Haggstrom, 433, 501 Karhunen Loeve direction, 222
Haitovsky, 262 Karhunen Loeve directions, 260
Hall, 171, 178, 205, 260 Kay, 454, 471
554 Index

Keleş, 225, 260 Lopuhaä, 288, 293, 295


Kelker, 55 LR, 3, 431
Kennard, 260 LS CLT, 92, 104
Khatri, 66 LTA, 324
Khattree, 377, 378, 402 LTS, 324
Kim, 262, 325 Lu, 259
Klouda, 324 Lumley, vi, 506
Knight, 203, 231, 235, 259–261 Luo, 150, 243, 262, 275
Koenker, 353 Lv, 259, 260, 262
Konietschke, 418
Kotz, 57, 104 Mašı̈ček, 325
Krasnicka, 78 Machado, 175
Kshirsagar, 377, 390 MacKinnon, 100
Kuehl, 123 MAD, 3, 280
Kutner, v Mahalanobis distance, 54, 163, 165, 283,
313, 350
ladder of powers, 8 Mallows, 148, 150, 152, 262, 326, 488
ladder rule, 8 MANOVA model, 408
Lahiri, 260 Marden, 2, 78
Lai, 89, 190, 260 Mardia, 57, 285, 416
Lancelot, 501 Markov’s Inequality, 40
Larsen, v Maronna, 289, 298, 351, 352, 402
lasso, 3, 10, 215, 262, 401 Marquardt, 229
lasso variable selection, 260 Marx, v
Law of Total Probability, 154 Masking, 322
least squares, 14 Mason, 230, 231, 259
least squares estimators, 363, 409 Mathsoft, 506
Ledolter, v, 437 matrix norm, 338
Lee, v, 26, 33, 83, 86, 94, 97, 119, 143, MB estimator, 292
259, 261, 377 McCullagh, 500
Leeb, 151, 203, 204, 259, 263 MCD, 288
Lehmann, 42, 43, 60 MCLT, 3
Lei, 154, 161, 262, 263 mean, 280
Lemeshow, 432, 434, 459, 500 mean square error, 66
Leon, v, 292 MED, 3
Leroy, 263, 283, 291, 326, 329, 340, 352 median, 280, 349
Lesnoff, 501 median absolute deviation, 280, 349
leverage, 160, 244 Meinshausen, 237, 261
Li, 152, 154, 277, 495–497, 500 Mevik, vi, 225, 505
limiting distribution, 35, 37 mgf, 3, 46
Lin, 261 Milton, v
Lindenmayer, 482 minimum chi–square estimator, 441
linearly dependent, 72 minimum covariance determinant, 287
linearly independent, 72 minimum volume ellipsoid, 351
linmodpack, vi mixture distribution, 52, 59
Little, 454, 471 MLD, 3, 281
Liu, 260, 351, 418 MLR, 2, 3, 13
LMS, 324 MLS CLT, 375
location family, 120 model averaging, 204
location model, 27, 279 model sum of squares, 30
Lockhart, 259, 260 modified power transformation, 10
log rule, 8, 455 moment generating function, 79
logistic regression, 276, 431 Monahan, v
Loh, 262 Montanari, 259
Index 555

Montgomery, 453 overdispersion, 435


Moore, 125 overfit, 143
Morgenthaler, 402
Mosteller, 10 Pötscher, 151, 153, 154, 203, 259
Mount, 324 Parente, 175
Muller, v Park, 347, 350
multicollinearity, 24 Partial F Test Theorem, 94, 104
multiple linear regression, 2, 5, 13 partial least squares, 215, 401
multiple linear regression model, 362 Pati, 277
Multivariate Central Limit Theorem, 48 pdf, 3
multivariate Chebyshev’s inequality, 168 Peña, 322
Multivariate Delta Method, 49 Pelawa Watagoda, 2, 142, 151, 154, 155,
multivariate linear model, 362, 407 160, 163, 176, 179, 180, 203, 237,
multivariate linear regression model, 361 242, 244, 253, 259, 260, 418, 482,
multivariate location and dispersion, 288 488, 489
multivariate location and dispersion percentile method, 171
model, 281, 362 permutation invariant, 336
multivariate normal, 31, 54, 313, 315 Pesch, 351
multivariate t-distribution, 57 PI, 3
MVN, 3, 32 pmf, 3
Myers, v, 442, 444 Poisson regression, 426, 439, 500
Pollard, 325
Nadler, 402 pooled variance estimator, 124
Naik, 377, 378, 402 population correlation, 33
Navarro, 169 population mean, 31
Nelder, 500 Portnoy, 344
Ning, 260 positive breakdown, 285
Nishii, 152 positive definite, 76, 221
noncentral χ2 distribution, 78 positive semidefinite, 76, 221
nonparametric bootstrap, 173, 206 power transformation, 10
nonparametric prediction region, 167 Pratt, 153, 289, 296, 343, 344
Nordhausen, 402 predicted values, 14, 254
norm, 240, 339 prediction region, 163
normal equations, 27 predictor variables, 361, 407
normal MLR model, 14 Press, 62
null space, 73 principal component direction, 222
principal component regression, 221
Obozinski, 262, 401 principal components regression, 215,
observation, 2 221
OD plot, 474 projection matrix, 73
Olive, v, 2, 13, 59, 60, 100, 105, 134, Projection Matrix Theorem, 73
144, 151, 152, 154, 155, 160, 163, pval, 19, 24, 100, 125
165, 167, 169, 174, 176, 180, 203, pvalue, 19, 96
237, 242–244, 259, 260, 262, 263,
273, 281, 288, 291, 306, 311, 324, Qi, 259, 262
325, 328, 331, 343, 350, 351, 365, quadratic form, 76
367, 368, 401, 410, 412, 418, 421, qualitative variable, 13
424, 430, 435, 456, 463, 468, 477, quantitative variable, 13
482, 487–489, 497, 499, 500
OLS, 3, 10, 14 R, 505
order statistics, 157, 280, 349 R Core Team, vi, 206, 501
outlier, 122, 279, 411 rank, 72
outlier resistant regression, 318 Rank Nullity Theorem, 73
outliers, 7, 322 Rao, v, 31
556 Index

Rathnayake, 151, 152, 203, 237, 243, Schwarz, 142, 150, 488
260, 477, 484, 489, 500 score equations, 230
Raven, 475 SE, 3, 34
Ravishanker, v Searle, v, 77, 81, 82, 108, 374, 402, 419
regression equivariance, 335 Seber, v, 26, 33, 83, 86, 94, 97, 119, 143,
regression equivariant, 335 377
regression sum of squares, 17 selection bias, 151
regression through the origin, 29 Sen, 60, 92, 203, 477
Reid, 78 Sengupta, v
Rejchel, 260 Serfling, 60, 173
relaxed elastic net, 252 Severini, 32, 50, 60, 230
relaxed lasso, 215, 252 Shao, 152, 480
Ren, 175, 178, 201, 203 Sheather, 207
Rencher, v Shibata, 150
residual plot, 5, 14, 366, 410 shrinkage estimator, 203
residuals, 14, 213, 255 Simonoff, 426, 440, 465, 500
response plot, 5, 14, 100, 145, 366, 410, simple linear regression, 28
425, 496 Singer, 60, 92, 477
response transformation, 11 singular value decomposition, 227
response transformation model, 426, 495 Slawski, 242
response variable, 1, 4 SLR, 28
response variables, 361, 407 Slutsky’s Theorem, 45, 50
Reyen, 351 smallest extreme value distribution, 432
RFCH estimator, 297 smoothed bootstrap estimator, 178
Riani, 356, 432 Snee, 229
ridge regression, 215, 262, 401 SP, 3
Riedwyl, 483 span, 71, 102
Rinaldo, 200, 277, 500 sparse model, 4
Ripley, vi, 501, 506 spectral decomposition, 221
Ro, 319 Spectral Decomposition Theorem, 76
Rocke, 288, 299 spectral norm, 339
Rohatgi, 33, 46 spherical, 54
Ronchetti, 97, 321, 352 split conformal prediction interval, 161
Rothman, 263 square root matrix, 76, 99, 103, 222
Rousseeuw, 263, 283, 288, 291, 306, 313, Srivastava, 66
324, 326, 329, 340, 351, 352 SSP, 3, 495
row space, 72 Stahel-Donoho estimator, 351
RR plot, 23, 145, 366 standard deviation, 280
Rupasinghe Arachchige Don, 134, 417, standard error, 34
418, 421, 424 Stapleton, v
STATLIB, 459
S, 42 Staudte, 207
sample correlation matrix, 164 Steinberger, 263
sample covariance matrix, 164, 349 Stewart, v, 60, 230
sample mean, 16, 34, 164, 349 Su, 19, 160, 188, 260, 262, 365, 371, 375,
sandwich estimator, 100 401
SAS Institute, 405 submodel, 142
Savin, 378, 390 subspace, 71
scale equivariant, 336 sufficient predictor, 4, 142, 425
Schaaffhausen, 354, 437, 453 sufficient summary plot, 495
Schaalje, v Sun, 262
Scheffé, v supervised learning, 2
Schneider, 204, 253, 259 SVD, 227
Schomaker, 204 Swamping, 322
Index 557

symmetrically trimmed mean, 281 W, 42


Wackerly, v
Tao, 277 Walpole, v
Tarr, 319 Wang, v, 204, 262, 351
Taylor, 241, 261 Wasserman, 263
test data, 2 Wedderburn, 500
Tibshirani, 151, 203, 236, 241, 259, 260, weighted least squares, 98
488, 500 Weisberg, v, 8, 193, 365, 367, 384, 405,
Tikhonov regularization, 260 410, 440, 454, 458, 475, 482, 495,
time series, 275 500, 506
total sum of squares, 17 Welch, 418
trace, 66, 77, 228 White, 50, 60, 100
training data, 2 Wichern, 32, 167, 221, 284, 287, 292,
transformation plot, 10, 11 303, 364, 368, 415
Tremearne, 5, 266, 299, 334 Wieczorek, 154, 262
trimmed views estimator, 326 Wilcoxon rank estimator, 326
Trivedi, 474, 500, 501 Wilson, 148
Tsai, 150, 151, 200, 275, 500
Winkelmann, 440, 474, 500
Tu, 480
Wold, 260
Tukey, 10, 11
Wood, vi, 148, 472, 500
Tutz, 452
Woodruff, 288, 299
TV estimator, 326, 351
Tyler, 402
Xu, 259
uncorrected total sum of squares, 30
underfit, 143, 148 Yang, 59, 178, 259, 262
underfitting, 142 Yohai, 352
unimodal MLR model, 14 Yu, 152, 178, 242
Uraibi, 262

van de Geer, 259 Zamar, 298


Van Driessen, 288, 306, 313 zero breakdown, 285
Van Loan, 339 Zhang, 59, 89, 259, 262, 351, 418, 500
variable selection, 455 Zhao, 152
variance, 280 Zheng, 262
vector norm, 338 Zhou, 204, 500
vector space, 71 Zimmerman, v
Venables, vi, 501, 506 Zou, 240, 265
von Mises differentiable statistical Zuo, 351
functions, 173 Zuur, 465, 500

You might also like