Lin Mod Book
Lin Mod Book
Olive
Springer
Preface
v
vi Preface
from robust multivariate location and dispersion. Chapter 8 gives theory for
the multivariate linear model where there are m ≥ 2 response variables.
Chapter 9 examines the one way MANOVA model, which is a special case of
the multivariate linear model. Chapter 10 generalizes much of the material
from Chapters 2–6 to many other regression models, including generalized
linear models and some survival regression models. Chapter 11 gives some
information about R and some hints for homework problems.
Chapters 2–4 are the most important for a standard course in Linear Model
Theory, along with the multivariate normal distribution and some large sam-
ple theory from Chapter 1. Some highlights of this text follow.
• Prediction intervals are given that can be useful even if n < p.
• The response plot is useful for checking the model.
• The large sample theory for the elastic net, lasso, and ridge regression
is greatly simplified. Large sample theory for variable selection and lasso
variable selection is given.
• The bootstrap is used for inference after variable selection if n ≥ 10p.
• Data splitting is used for inference after variable selection or model build-
ing if n < 5p.
• Most of the above highlights are extended to many other regression models
such as generalized linear models and some survival regression models.
The website (http://parker.ad.siu.edu/Olive/linmodbk.htm) for this book
provides R programs in the file linmodpack.txt and several R data sets in
the file linmoddata.txt. Section 11.1 discusses how to get the data sets and
programs into the software, but the following commands will work.
Downloading the book’s R functions linmodpack.txt and data files
linmoddata.txt into R: The following commands
source("http://parker.ad.siu.edu/Olive/linmodpack.txt")
source("http://parker.ad.siu.edu/Olive/linmoddata.txt")
can be used to download the R functions and data sets into R. (Copy and paste
these two commands into R from near the top of the file (http://parker.ad.
siu.edu/Olive/linmodhw.txt), which contains commands that are useful for
doing many of the R homework problems.) Type ls(). Over 100 R functions
from linmodpack.txt should appear. Exit R with the command q() and click
No.
The R software is used in this text. See R Core Team (2016). Some pack-
ages used in the text include glmnet Friedman et al. (2015), leaps Lum-
ley (2009), MASS Venables and Ripley (2010), mgcv Wood (2017), and pls
Mevik et al. (2015).
Acknowledgments
Teaching this course in 2014 as Math 583 and in 2019 and 2021 as Math
584 at Southern Illinois University was very useful.
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Response Plots and Response Transformations . . . . . . . 4
1.2.1 Response and Residual Plots . . . . . . . . . . . . . . . . . . . 5
1.2.2 Response Transformations . . . . . . . . . . . . . . . . . . . . . 8
1.3 A Review of Multiple Linear Regression . . . . . . . . . . . . . 13
1.3.1 The ANOVA F Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.2 The Partial F Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.3.3 The Wald t Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.3.4 The OLS Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.3.5 The Location Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.3.6 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . 28
1.3.7 The No Intercept MLR Model . . . . . . . . . . . . . . . . . 29
1.4 The Multivariate Normal Distribution . . . . . . . . . . . . . . . 31
1.5 Large Sample Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.5.1 The CLT and the Delta Method . . . . . . . . . . . . . . . 34
1.5.2 Modes of Convergence and Consistency . . . . . . . . 37
1.5.3 Slutsky’s Theorem and Related Results . . . . . . . . 45
1.5.4 Multivariate Limit Theorems . . . . . . . . . . . . . . . . . . 48
1.6 Mixture Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.7 Elliptically Contoured Distributions . . . . . . . . . . . . . . . . . . 53
1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
1.9 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
1.10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
vii
viii Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
Chapter 1
Introduction
This chapter provides a preview of the book, and contains several sections
that will be useful for linear model theory. Section 1.2 defines 1D regression
and gives some techniques useful for checking the 1D regression model and vi-
sualizing data in the background of the data. Section 1.3 reviews the multiple
linear regression model. Sections 1.4 and 1.7 cover the multivariate normal
distribution and elliptically contoured distributions. Some large sample the-
ory is presented in Section 1.5, and Section 1.6 covers mixture distributions.
Section 1.4 is important, but the remaining sections can be skimmed and
then reviewed as needed.
1.1 Overview
Linear Model Theory provides theory for the multiple linear regression model
and some experimental design models. This text will also give theory for the
multivariate linear regression model where there are m ≥ 2 response vari-
ables. Emphasis is on least squares, but some alternative Statistical Learning
techniques, such as lasso and the elastic net, will also be covered. Chapter 10
considers theory for 1D regression models which include the multiple linear
regression model and generalized linear models.
Statistical Learning could be defined as the statistical analysis of multivari-
ate data. Machine learning, data mining, analytics, business analytics, data
analytics, and predictive analytics are synonymous terms. The techniques are
useful for Data Science and Statistics, the science of extracting information
from data. The R software will be used. See R Core Team (2020).
Let z = (z1 , ..., zk)T where z1 , ..., zk are k random variables. Often z =
(x , Y )T where xT = (x1 , ..., xp) is the vector of predictors and Y is the
T
1
2 1 Introduction
is also called the dependent variable. Usually context will be used to decide
whether z is a random vector or the observed random vector.
Following James et al. (2013, p. 30), the previously unseen test data is not
used to train the Statistical Learning method, but interest is in how well the
method performs on the test data. If the training data is (x1 , Y1 ), ..., (xn , Yn ),
and the previously unseen test data is (xf , Yf ), then particular interest is in
the accuracy of the estimator Ŷf of Yf obtained when the Statistical Learning
method is applied to the predictor xf . The two Pelawa Watagoda and Olive
(2021b) prediction intervals, developed in Section 4.3, will be tools for eval-
uating Statistical Learning methods for the additive error regression model
Yi = m(xi ) + ei = E(Yi |xi ) + ei for i = 1, ..., n where E(W ) is the expected
value of the random variable W . The multiple linear regression (MLR) model,
Yi = β1 + x2 β2 + · · · + xp βp + e = xT β + e, is an important special case.
The estimator Ŷf is a prediction if the response variable Yf is continuous,
as occurs in regression models. If Yf is categorical, then Ŷf is a classification.
For example, if Yf can be 0 or 1, then xf is classified to belong to group i if
Ŷf = i for i = 0 or 1.
Following Marden (2006, pp. 5,6), the focus of supervised learning is pre-
dicting a future value of the response variable Yf given xf and the training
data (x1 , Y1 ), ..., (x1 , Yn ). Hence the focus is not on hypothesis testing, con-
fidence intervals, parameter estimation, or which model fits best, although
these four inference topics can be useful for better prediction.
The main focus of the first seven chapters is developing tools to analyze
the multiple linear regression model Yi = xTi β + ei for i = 1, ..., n. Classical
regression techniques use (ordinary) least squares (OLS) and assume n >> p,
1.1 Overview 3
but Statistical Learning methods often give useful results if p >> n. OLS
forward selection, lasso, ridge regression, and the elastic net will be some of
the techniques examined.
Acronyms are widely used in regression and Statistical Learning, and some
of the more important acronyms appear in Table 1.1. Also see the text’s index.
This section will consider tools for visualizing the regression model in the
background of the data. The definitions in this section tend not to depend
on whether n/p is large or small, but the estimator ĥ tends to be better if
n/p is large. In regression, the response variable is the variable of interest:
the variable you want to predict. The predictors or features x1 , ..., xp are
variables used to predict Y . See Chapter 10 for more on the 1D regression
model.
Notation. Often the index i will be suppressed. For example, the multiple
linear regression model
Yi = xTi β + ei (1.2)
for i = 1, ..., n where β is a p × 1 unknown vector of parameters, and ei is a
random error. This model could be written Y = xT β + e. More accurately,
Y |x = xT β + e, but the conditioning on x will often be suppressed. Often
the errors e1 , ..., en are iid (independent and identically distributed) from a
distribution that is known except for a scale parameter. For example, the
ei ’s might be iid from a normal (Gaussian) distribution with mean 0 and
unknown standard deviation σ. For this Gaussian model, estimation of α, β,
and σ is important for inference and for predicting a new future value of the
response variable Yf given a new vector of predictors xf .
For the additive error regression model, the response plot is a plot of Ŷ
versus Y where the identity line with unit slope and zero intercept is added as
a visual aid. The residual plot is a plot of Ŷ versus r. Assume the errors ei are
iid from a unimodal distribution that is not highly skewed. Then the plotted
points should scatter about the identity line and the r = 0 line (the horizontal
axis) with no other pattern if the fitted model (that produces m̂(x)) is good.
Response Plot
1800
1700
Y
1600
63
44
1500
FIT
Residual Plot
3
50
0
RES
-50
63
-100
44
FIT
Fig. 1.1 Residual and Response Plots for the Tremearne Data
haps from left hand to right hand). Figure 1.1 presents the (ordinary) least
squares (OLS) response and residual plots for this data set. These plots show
that an MLR model Y = xT β + e should be a useful model for the data
since the plotted points in the response plot are linear and follow the identity
line while the plotted points in the residual plot follow the r = 0 line with
no other pattern (except for a possible outlier marked 44). Note that many
important acronyms, such as OLS and MLR, appear in Table 1.1.
To use the response plot to visualize the conditional distribution of Y |xT β,
use the fact that the fitted values Ŷ = xT β̂. For example, suppose the height
given fit = 1700 is of interest. Mentally examine the plot about a narrow
vertical strip about fit = 1700, perhaps from 1685 to 1715. The cases in the
narrow strip have a mean close to 1700 since they fall close to the identity
line. Similarly, when the fit = w for w between 1500 and 1850, the cases have
heights near w, on average.
1.2 Response Plots and Response Transformations 7
Cases 3, 44, and 63 are highlighted. The 3rd person was very tall while
the 44th person was rather short. Beginners often label too many points
as outliers: cases that lie far away from the bulk of the data. See Chapter
7. Mentally draw a box about the bulk of the data ignoring any outliers.
Double the width of the box (about the identity line for the response plot
and about the horizontal line for the residual plot). Cases outside of this
imaginary doubled box are potential outliers. Alternatively, visually estimate
the standard deviation of the residuals in both plots. In the residual plot look
for residuals that are more than 5 standard deviations from the r = 0 line.
In Figure 1.1, the standard deviation of the residuals appears to be around
10. Hence cases 3 and 44 are certainly worth examining.
The identity line can also pass through or near an outlier or a cluster
of outliers. Then the outliers will be in the upper right or lower left of the
response plot, and there will be a large gap between the cluster of outliers and
the bulk of the data. Figure 1.1 was made with the following R commands,
using linmodpack function MLRplot and the major.lsp data set from the
text’s webpage.
major <- matrix(scan(),nrow=112,ncol=7,byrow=T)
#copy and paste the data set, then press enter
major <- major[,-1]
X<-major[,-6]
Y <- major[,6]
MLRplot(X,Y) #left click the 3 highlighted cases,
#then right click Stop for each of the two plots
A problem with response and residual plots is that there can be a lot of
black in the plot if the sample size n is large (more than a few thousand). A
variant of the response plot for the additive error regression model would plot
the identity line, the two lines parallel to the identity line corresponding to the
Section 4.1 large sample 100(1 − δ)% prediction intervals for Yf that depends
on Ŷf . Then plot points corresponding to training data cases that do not lie in
their 100(1 − δ)% PI. Use δ = 0.01 or 0.05. Try the following commands that
used δ = 0.2 since n is small. The commands use the linmodpack functions
AERplot and AERplot2. See Problem 1.31.
out<-lsfit(X,Y) #X and Y from the above R code
res<-out$res
yhat<-Y-res #usual response plot
AERplot(yhat,Y,res=res,d=2,alph=1)
AERplot(yhat,Y,res=res,d=2,alph=0.2)
#plots data outside the 80% pointwise PIs
a) b)
1.4
7
6
1.0
5
4
x
3
0.6
2
1
0.2
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
w w
c) d)
10 12
60
8
40
x
6
20
4
2
0
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
w w
method for selecting the response transformation is to plot m̂λi (x) versus
tλi (Z) for several values of λi , choosing the value of λ = λ0 where the plotted
points follow the identity line with unit slope and zero intercept. For the
multiple linear regression model, m̂λi (x) = xT β̂ λi where β̂ λi can be found
using the desired fitting method, e.g. OLS or lasso.
Definition 1.4. Assume that all of the values of the “response” Zi are
positive. A power transformation has the form Y = tλ (Z) = Z λ for λ 6= 0
and Y = t0 (Z) = log(Z) for λ = 0 where
Definition 1.5. Assume that all of the values of the “response” Zi are
positive. Then the modified power transformation family
(λ) Ziλ − 1
tλ (Zi ) ≡ Zi = (1.4)
λ
(0)
for λ 6= 0 and Zi = log(Zi ). Generally λ ∈ Λ where Λ is some interval such
as [−1, 1] or a coarse subset such as ΛL . This family is a special case of the
response transformations considered by Tukey (1957).
There are several reasons to use a coarse grid of powers. First, several of the
powers correspond to simple transformations such as the log, square root, and
cube root. These powers are easier to interpret than λ = 0.28, for example.
According to Mosteller and Tukey (1977, p. 91), the most commonly used
1.2 Response Plots and Response Transformations 11
sqrt(Z)
2000
40
Z
10
0
−500 1000 10 30 50
TZHAT TZHAT
c) lambda = 0 d) lambda = −1
0.008
log(Z)
1/Z
0.000
5
TZHAT TZHAT
points follow the identity line for λ∗ , then take λ̂o = λ∗ , that is, Y = tλ∗ (Z)
is the response transformation.
If more than one value of λ ∈ ΛL gives a linear plot, take the simplest or
most reasonable transformation or the transformation that makes the most
sense to subject matter experts. Also check that the corresponding “residual
plots” of Ŵ versus W −Ŵ look reasonable. The values of λ in decreasing order
of importance are 1, 0, 1/2, −1, and 1/3. So the log transformation would be
chosen over the cube root transformation if both transformation plots look
equally good.
After selecting the transformation, the usual checks should be made. In
particular, the transformation plot for the selected transformation is the re-
sponse plot, and a residual plot should also be made. The following example
illustrates the procedure, and the plots show W = tλ (Z) on the vertical axis.
The label “TZHAT” of the horizontal axis are the “fitted values” Ŵ that
result from using W = tλ (Z) as the “response” in the OLS software.
The following review follows Olive (2017a: ch. 2) closely. Several of the results
in this section will be covered in more detail or proven in Chapter 2.
for i = 1, . . . , n. Here n is the sample size and the random variable ei is the
ith error. Suppressing the subscript i, the model is Y = xT β + e.
Y = Xβ + e, (1.6)
Definition 1.10. The constant variance MLR model uses the as-
sumption that the errors e1 , ..., en are iid with mean E(ei ) = 0 and variance
VAR(ei ) = σ 2 < ∞. Also assume that the errors are independent of the pre-
dictor variables xi . The predictor variables xi are assumed to be fixed and
measured without error. The cases (xTi , Yi ) are independent for i = 1, ..., n.
14 1 Introduction
If the predictor variables are random variables, then the above MLR model
is conditional on the observed values of the xi . That is, observe the xi and
then act as if the observed xi are fixed.
Definition 1.11. The unimodal MLR model has the same assumptions
as the constant variance MLR model, as well as the assumption that the zero
mean constant variance errors e1 , ..., en are iid from a unimodal distribution
that is not highly skewed. Note that E(ei ) = 0 and V (ei ) = σ 2 < ∞.
Definition 1.12. The normal MLR model or Gaussian MLR model has
the same assumptions as the unimodal MLR model but adds the assumption
that the errors e1 , ..., en are iid N (0, σ 2 ) random variables. That is, the ei are
iid normal random variables with zero mean and variance σ 2 .
The unknown coefficients for the above 3 models are usually estimated
using (ordinary) least squares (OLS).
Definition 1.14. The ordinary least squares (OLS) estimator β̂ OLS min-
imizes
Xn
QOLS (b) = ri2 (b), (1.8)
i=1
Definition 1.15. For MLR, the response plot is a plot of the ESP = fitted
values = Ŷi versus the response Yi , while the residual plot is a plot of the
ESP = Ŷi versus the residuals ri .
1.3 A Review of Multiple Linear Regression 15
The results in the following theorem are properties of least squares (OLS),
not of the underlying MLR model. Chapter 2 gives linear model theory for
the full rank model. Definitions 1.13 and 1.14 define the hat matrix H, vector
of fitted values Ŷ , and vector of residuals r. Parts f) and g) make residual
plots useful. If the plotted points are linear with roughly constant variance
and the correlation is zero, then the plotted points scatter about the r = 0
line with no other pattern. If the plotted points in a residual plot of w versus
r do show a pattern such as a curve or a right opening megaphone, zero
correlation will usually force symmetry about either the r = 0 line or the
w = median(w) line. Hence departures from the ideal plot of random scatter
about the r = 0 line are often easy to detect.
where v 1 = 1.
Warning: If n > p, as is usually the case for the full rank linear model,
X is not square, so (X T X)−1 6= X −1 (X T )−1 since X −1 does not exist.
H T = X T [(X T X)−1 ]T (X T )T = H.
After fitting least squares and checking the response and residual plots to see
that an MLR model is reasonable, the next step is to check whether there is
an MLR relationship between Y and the nontrivial predictors x2 , ..., xp . If at
least one of these predictors is useful, then the OLS fitted values Ŷi should be
used. If none of the nontrivial predictors is useful, then Y will give as good
predictions as Ŷi . Here the sample mean
1.3 A Review of Multiple Linear Regression 17
n
1X
Y = Yi . (1.9)
n i=1
In the definition below, SSE is the sum of squared residuals and a residual
ri = êi = “errorhat.” In the literature “errorhat” is often rather misleadingly
abbreviated as “error.”
But
n
X n
X
A= ri Ŷi − Y ri = 0
i=1 i=1
Definition 1.17. Assume that a constant is in the MLR model and that
SSTO 6= 0. The coefficient of multiple determination
SSR SSE
R2 = [corr(Yi , Ŷi )]2 = =1−
SSTO SSTO
The following 2 theorems suggest that R2 does not behave well when many
predictors that are not needed in the model are included in the model. Such
a variable is sometimes called a noise variable and the MLR model is “fitting
noise.” Theorem 1.5 appears, for example, in Cramér (1946, pp. 414-415),
and suggests that R2 should be considerably larger than p/n if the predictors
are useful. Note that if n = 10p and p ≥ 2, then under the conditions of
Theorem 1.5, E(R2 ) ≤ 0.1.
Notice that each SS/n estimates the variability of some quantity. SST O/n
≈ SY2 , SSE/n ≈ Se2 = σ 2 , and SSR/n ≈ SŶ2 .
The ANOVA F test tests whether any of the nontrivial predictors x2 , ..., xp
are needed in the OLS MLR model, that is, whether Yi should be predicted
by the OLS fit Ŷi = β̂1 + xi,2 β̂2 + · · · + xi,pβ̂p or with the sample mean Y .
ANOVA stands for analysis of variance, and the computer output needed
to perform the test is contained in the ANOVA table. Below is an ANOVA
table given in symbols. Sometimes “Regression” is replaced by “Model” and
“Residual” by “Error.”
Source df SS MS F p-value
Regression p − 1 SSR MSR F0 =MSR/MSE for H0 :
Residual n − p SSE MSE β2 = · · · = βp = 0
Remark 1.2. Recall that for a 4 step test of hypotheses, the p–value is the
probability of getting a test statistic as extreme as the test statistic actually
observed and that H0 is rejected if the p–value < δ. As a benchmark for this
textbook, use δ = 0.05 if δ is not given. The 4th step is the nontechnical
conclusion which is crucial for presenting your results to people who are not
familiar with MLR. Replace Y and x2 , ..., xp by the actual variables used in
the MLR model.
P (Fp−1,n−p > F0 ).
Some assumptions are needed on the ANOVA F test. Assume that both
the response and residual plots look good. It is crucial that there are no
outliers. Then a rule of thumb is that if n − p is large, then the ANOVA
F test p–value is approximately correct. An analogy can be made with the
central limit theorem, Y is a good estimator for µ if the Yi are iid N (µ, σ 2 )
and also a good estimator for µ if the data are iid with mean µ and variance
σ 2 if n is large enough.
If all of the xi are different (no replication) and if the number of predictors
p = n, then the OLS fit Ŷi = Yi and R2 = 1. Notice that H0 is rejected if the
statistic F0 is large. More precisely, reject H0 if
F0 > Fp−1,n−p,1−δ
where
P (F ≤ Fp−1,n−p,1−δ) = 1 − δ
when F ∼ Fp−1,n−p. Since R2 increases to 1 while (n − p)/(p − 1) decreases
to 0 as p increases to n, Theorem 1.6a below implies that if p is large then
the F0 statistic may be small even if some of the predictors are very good. It
is a good idea to use n ≥ 10p or at least n ≥ 5p if possible.
Remark 1.3. When a constant is not contained in the model (i.e. xi,1 is
not equal to 1 for all i), then the computer output still produces an ANOVA
1.3 A Review of Multiple Linear Regression 21
table with the test statistic and p–value, and nearly the same 4 step test of
hypotheses can be used. The hypotheses are now H0 : β1 = · · · = βp = 0
HA: not H0 , and you are testing whether or not there is an MLR relationship
between Y and x1 , ..., xp. An MLR model without a constant (no intercept)
is sometimes called a “regression through the origin.” See Section 1.3.7.
Suppose that there is data on variables Z, w1 , ..., wr and that a useful MLR
model has been made using Y = t(Z), x1 ≡ 1, x2 , ..., xp where each xi is
some function of w1 , ..., wr. This useful model will be called the full model. It
is important to realize that the full model does not need to use every variable
wj that was collected. For example, variables with outliers or missing values
may not be used. Forming a useful full model is often very difficult, and it is
often not reasonable to assume that the candidate full model is good based
on a single data set, especially if the model is to be used for prediction.
Even if the full model is useful, the investigator will often be interested in
checking whether a model that uses fewer predictors will work just as well.
For example, perhaps xp is a very expensive predictor but is not needed given
that x1 , ..., xp−1 are in the model. Also a model with fewer predictors tends
to be easier to understand.
Definition 1.19. Let the full model use Y , x1 ≡ 1, x2 , ..., xp and let the
reduced model use Y , x1 , xi2 , ..., xiq where {i2 , ..., iq} ⊂ {2, ..., p}.
The partial F test is used to test whether the reduced model is good in
that it can be used instead of the full model. It is crucial that the reduced
and full models be selected before looking at the data. If the reduced model
is selected after looking at the full model output and discarding the worst
variables, then the p–value for the partial F test will be too high. If the
data needs to be looked at to build the full model, as is often the case, data
splitting is useful. See Section 6.2.
For (ordinary) least squares, usually a constant is used, and we are assum-
ing that both the full model and the reduced model contain a constant. The
partial F test has null hypothesis H0 : βiq+1 = · · · = βip = 0, and alternative
hypothesis HA : at least one of the βij 6= 0 for j > q. The null hypothesis is
equivalent to H0 : “the reduced model is good.” Since only the full model and
reduced model are being compared, the alternative hypothesis is equivalent
to HA: “the reduced model is not as good as the full model, so use the full
model,” or more simply, HA : “use the full model.”
To perform the partial F test, fit the full model and the reduced model
and obtain the ANOVA table for each model. The quantities dfF , SSE(F)
and MSE(F) are for the full model and the corresponding quantities from
22 1 Introduction
the reduced model use an R instead of an F . Hence SSE(F) and SSE(R) are
the residual sums of squares for the full and reduced models, respectively.
Shown below is output only using symbols.
Full model
Reduced model
iii) Find the pval = P(FdfR −dfF ,dfF > FR ). ( Here dfR −dfF = p−q = number
of parameters set to 0, and dfF = n − p, while pval is the estimated p–value.)
iv) State whether you reject H0 or fail to reject H0 . Reject H0 if the pval ≤ δ
and conclude that the full model should be used. Otherwise, fail to reject H0
and conclude that the reduced model is good.
Six plots are useful diagnostics for the partial F test: the RR plot with
the full model residuals on the vertical axis and the reduced model residuals
on the horizontal axis, the FF plot with the full model fitted values on the
vertical axis, and always make the response and residual plots for the full
and reduced models. Suppose that the full model is a useful MLR model. If
the reduced model is good, then the response plots from the full and reduced
models should be very similar, visually. Similarly, the residual plots from
the full and reduced models should be very similar, visually. Finally, the
correlation of the plotted points in the RR and FF plots should be high,
≥ 0.95, say, and the plotted points in the RR and FF plots should cluster
tightly about the identity line. Add the identity line to both the RR and
FF plots as a visual aid. Also add the OLS line from regressing r on rR to
the RR plot (the OLS line is the identity line in the FF plot). If the reduced
model is good, then the OLS line should nearly coincide with the identity line
in that it should be difficult to see that the two lines intersect at the origin.
If the FF plot looks good but the RR plot does not, the reduced model may
be good if the main goal of the analysis is to predict Y.
24 1 Introduction
Use the normal table or the d = Z line in the t–table if the degrees of freedom
d = n − p ≥ 30. Again pval is the estimated p–value.
iv) State whether you reject H0 or fail to reject H0 and give a nontechnical
sentence restating your conclusion in terms of the story problem.
1000
OLSESP
1000
BADESP
Fig. 1.4 The OLS Fit Minimizes the Sum of Squared Residuals
T
where the residual
Pnri (η) = YP i −xi η. In other words, let ri = ri (β̂) be the OLS
n
residuals. Then i=1 ri2 ≤ i=1 ri2 (η) for any p×1 vector η, and the equality
holds (if and only if) iff η = β̂ if the n×p design matrix X is of full rank p ≤ n.
26 1 Introduction
Pn Pn Pn
In particular, if X has full rank p, then i=1 ri2 < i=1 ri2 (β) = i=1 e2i
even if the MLR model Y = Xβ + e is a good approximation
Pn to the data.
Warning: Often η is replaced by β: QOLS (β) = i=1 ri2 (β). This no-
tation is often used in Statistics when there are estimating equations. For
example, maximum likelihood estimation uses the log likelihood log(L(θ))
where θ is the vector of unknown parameters and the dummy variable in the
log likelihood.
Theorem 1.8. The OLS estimator β̂ is the unique minimizer of the OLS
criterion if X has full rank p ≤ n.
Proof: Seber and Lee (2003, pp. 36-37). Recall that the hat matrix
H = X(X T X)−1 X T and notice that (I −H)T = I −H, that (I −H)H = 0
and that HX = X. Let η be any p × 1 vector. Then
Y T (I − H)H(Y − Xη) = 0.
Thus QOLS (η) = kY − Xηk2 = kY − X β̂ + X β̂ − Xηk2 =
kY − X β̂k2 + kX β̂ − Xηk2 + 2(Y − X β̂)T (X β̂ − Xη).
Hence
kY − Xηk2 = kY − X β̂k2 + kX β̂ − Xηk2 . (1.13)
So
kY − Xηk2 ≥ kY − X β̂k2
with equality iff
X(β̂ − η) = 0
1.3 A Review of Multiple Linear Regression 27
for j = 1, ..., p. Combining these equations into matrix form, setting the
derivative to zero and calling the solution β̂ gives
X T Y − X T X β̂ = 0,
or
X T X β̂ = X T Y . (1.14)
Equation (1.14) is known as the normal equations. If X has full rank then
β̂ = (X T X)−1 X T Y . To show that β̂ is the global minimizer of the OLS
criterion, use the argument following Equation (1.13).
Setting the derivative equal to 0 and calling the unique solution µ̂ gives
P n
i=1 Yi = nµ̂ or µ̂ = Y . The second derivative
d2 QOLS (η)
= 2n > 0,
dη 2
hence µ̂ is the global minimizer.
28 1 Introduction
Yi = β1 + β2 Xi + ei = α + βXi + ei
where the ei are iid with E(ei ) = 0 and VAR(ei ) = σ 2 for i = 1, ..., n. The Yi
and ei are random variables while the Xi are treated as known constants.
The SLR model is a special case of the MLR model with p = 2, xi,1 ≡ 1, and
xi,2 = Xi . For SLR, E(Yi ) = β1 + β2 Xi and the line E(Y ) = β1 + β2 X is the
regression function. VAR(Yi ) = σ 2 .
For SLR, the least squares P estimators β̂1 and β̂2 minimize the least
squares criterion Q(η1 , η2 ) = ni=1 (Yi − η1 − η2 Xi )2 . For a fixed η1 and η2 ,
Q is the sum of the squared vertical deviations from the line Y = η1 + η2 X.
The least squares (OLS) line is Ŷ = β̂1 + β̂2 X where the slope
Pn
(Xi − X)(Yi − Y )
β̂2 ≡ β̂ = i=1 Pn 2
i=1 (Xi − X)
and
∂2Q
= 2n.
∂η12
Similarly,
Xn
∂Q
= −2 Xi (Yi − η1 − η2 Xi )
∂η2 i=1
and
X n
∂2Q
= 2 Xi2 .
∂η22
i=1
Setting the first partial derivatives to zero and calling the solutions β̂1 and
β̂2 shows that the OLS estimators β̂1 and β̂2 satisfy the normal equations:
n
X n
X
Yi = nβ̂1 + β̂2 Xi and
i=1 i=1
n
X n
X n
X
Xi Yi = β̂1 Xi + β̂2 Xi2 .
i=1 i=1 i=1
1.3 A Review of Multiple Linear Regression 29
Xi − X
k i = Pn 2
. (1.16)
j=1 (Xj − X)
The no intercept MLR model, also known as regression through the origin, is
still Y = Xβ+e, but there is no intercept in the model, so X does not contain
a column of ones 1. Hence the intercept term β1 = β1 (1) is replaced by β1 xi1 .
Software gives output for this model if the “no intercept” or “intercept = F”
option is selected. For the no intercept model, the assumption E(e) = 0 is
important, and this assumption is rather strong.
Many of the usual MLR results still hold: β̂ OLS = (X T X)−1 X T Y , the
vector of predicted fitted values Yb = X β̂OLS = HY where the hat matrix
H = X(X T X)−1 X T provided the inverse exists, and the vector of residuals
is r = Y − Yb . The response plot and residual plot are made in the same way
and should be made before performing inference.
The main difference in the output is the ANOVA table. The ANOVA F
test in Section 1.3.1 tests H0 : β2 = · · · = βp = 0. The test in this subsection
30 1 Introduction
d) The degrees of freedom (df) for SSM is p, the df for SSE is n − p and
the df for SST is n. The mean squares are MSE = SSE/(n − p) and MSM =
SSM/p.
The ANOVA table given for the “no intercept” or “intercept = F” option
is below.
Source df SS MS F p-value
Model p SSM MSM F0 =MSM/MSE for H0 :
Residual n − p SSE MSE β=0
For much of this book, X is an n × p design matrix, but this section will usu-
ally use the notation X = (X1 , ..., Xp)T and Y for the random vectors, and
x = (x1 , ..., xp)T for the observed value of the random vector. This notation
will be useful to avoid confusion when studying conditional distributions such
as Y |X = x. It can be shown that Σ is positive semidefinite and symmetric.
and
E(AX) = AE(X) and E(AXB) = AE(X)B. (1.22)
Thus
Cov(a + AX) = Cov(AX) = ACov(X)AT . (1.23)
32 1 Introduction
Cov(X) = Σ.
Theorem 1.10. a) All subsets of a MVN are MVN: (Xk1 , ..., Xkq )T
∼ Nq (µ̃, Σ̃) where µ̃i = E(Xki ) and Σ̃ ij = Cov(Xki , Xkj ). In particular,
X 1 ∼ Nq (µ1 , Σ 11 ) and X 2 ∼ Np−q (µ2 , Σ 22 ).
b) If X 1 and X 2 are independent, then Cov(X 1 , X 2 ) = Σ 12 =
E[(X 1 − E(X 1 ))(X 2 − E(X 2 ))T ] = 0, a q × (p − q) matrix of zeroes.
c) If X ∼ Np (µ, Σ), then X 1 and X 2 are independent iff Σ 12 = 0.
d) If X 1 ∼ Nq (µ1 , Σ 11 ) and X 2 ∼ Np−q (µ2 , Σ 22 ) are independent, then
X1 µ1 Σ 11 0
∼ Np , .
X2 µ2 0 Σ 22
X 1 |X 2 = x2 ∼ Nq (µ1 + Σ 12 Σ −1 −1
22 (x2 − µ2 ), Σ 11 − Σ 12 Σ 22 Σ 21 ).
Example 1.5. Let p = 2 and let (Y, X)T have a bivariate normal distri-
bution. That is,
Y µY σY2 Cov(Y, X)
∼ N2 , 2 .
X µX Cov(X, Y ) σX
Cov(X, Y ) σX,Y
ρ(X, Y ) = p p =
VAR(X) VAR(Y ) σX σY
a2 σ X
2
+ b2 σY2 + 2ab Cov(X, Y ).
1 1 −1 1 1
p exp( (x2 + 2ρxy + y2 )) ≡ f1 (x, y) + f2 (x, y)
2 2π 1 − ρ2 2(1 − ρ2 ) 2 2
where x and y are real and 0 < ρ < 1. Since both marginal distributions
of fi (x, y) are N(0,1) for i = 1 and 2 by RTheorem
R 1.10 a), the marginal
distributions of X and Y are N(0,1). Since xyfi (x, y)dxdy = ρ for i = 1
and −ρ for i = 2, X and Y are uncorrelated, but X and Y are not independent
since f(x, y) 6= fX (x)fY (y).
Remark 1.5. In Theorem 1.11, suppose that X = (Y, X2 , ..., Xp)T . Let
X1 = Y and X 2 = (X2 , ..., Xp)T . Then E[Y |X 2 ] = β1 + β2 X2 + · · · + βp Xp
and VAR[Y |X 2 ] is a constant that does not depend on X 2 . Hence Y |X 2 =
β1 + β2 X2 + · · · + βp Xp + e follows the multiple linear regression model.
The first three subsections will review large sample theory for the univariate
case, then multivariate theory will be given.
√
Note that the sample mean is estimating the population mean µ with a √n
convergence rate, the asymptotic distribution is normal, and the SE = S/ n
1.5 Large Sample Theory 35
Y n ≈ N (µ, σ 2 /n),
The two main applications of the CLT are to√give the limiting distribution
√
of n(Y n − µ) and the limiting
Pn distribution of n(Yn /n − µX ) for a random
variable Yn such that Yn = i=1 Xi where the Xi are iid with E(X) = µX
2
and VAR(X) = σX .
by the CLT.
D Pn
b) Now suppose that Yn ∼ BIN (n, ρ). Then Yn = i=1 Xi where
X1 , ..., Xn are iid Ber(ρ). Hence
√ Yn D
n − ρ → N (0, ρ(1 − ρ))
n
since
√ Yn D √ D
n − ρ = n(X n − ρ) → N (0, ρ(1 − ρ))
n
36 1 Introduction
by a).
c) Now suppose that Yn ∼ BIN (kn , ρ) where kn → ∞ as n → ∞. Then
p Yn
kn − ρ ≈ N (0, ρ(1 − ρ))
kn
or
Yn ρ(1 − ρ)
≈N ρ, or Yn ≈ N (kn ρ, kn ρ(1 − ρ)) .
kn kn
Example 1.8. Let X ∼ Binomial(n, p) where the positive " integer n #is
2
√ X
large and 0 < p < 1. Find the limiting distribution of n − p2 .
n
√
Solution. Example 1.6b gives the limiting distribution of n( Xn
− p). Let
2 0
g(p) = p . Then g (p) = 2p and by the delta method,
" #
2
√ X 2
√ X D
n −p = n g − g(p) →
n n
N (0, p(1 − p)(g0 (p))2 ) = N (0, p(1 − p)4p2 ) = N (0, 4p3 (1 − p)).
Example 1.9. Let Xn ∼ Poisson(nλ) where the positive integer n is large
and λ > 0.
√ Xn
a) Find the limiting distribution of n − λ .
n
"r #
√ Xn √
b) Find the limiting distribution of n − λ .
n
1.5 Large Sample Theory 37
D Pn
Solution. a) Xn = i=1 Yi where the Yi are iid Poisson(λ). Hence E(Y ) =
λ = V ar(Y ). Thus by the CLT,
Pn
√ Xn D √ i=1 Yi D
n −λ = n − λ → N (0, λ).
n n
√
b) Let g(λ) = λ. Then g0 (λ) = 2√1 λ and by the delta method,
"r #
√ Xn √ √ Xn D
n − λ = n g − g(λ) →
n n
0 2 1 1
N (0, λ (g (λ)) ) = N 0, λ = N 0, .
4λ 4
Example 1.10. Let Y1 , ..., Yn be independent and identically distributed
(iid) from a Gamma(α, β) distribution.
√
a) Find the limiting distribution of n Y − αβ .
√
b) Find the limiting distribution of n (Y )2 − c for appropriate con-
stant c.
Example 1.11. Suppose that Xn ∼ U (−1/n, 1/n). Then the cdf Fn (x) of
Xn is
0, x ≤ −1n
nx
Fn (x) = 2 + 2 , n ≤ x ≤ n1
1 −1
1, x ≥ n1 .
Sketching Fn (x) shows that it has a line segment rising from 0 at x = −1/n
to 1 at x = 1/n and that Fn (0) = 0.5 for all n ≥ 1. Examining the cases
x < 0, x = 0, and x > 0 shows that as n → ∞,
0, x < 0
1
Fn (x) → x=0
2
1, x > 0.
Notice that the right hand side is not a cdf since right continuity does not
hold at x = 0. Notice that if X is a random variable such that P (X = 0) = 1,
then X has cdf
0, x < 0
FX (x) =
1, x ≥ 0.
Since x = 0 is the only discontinuity point of FX (x) and since Fn (x) → FX (x)
for all continuity points of FX (x) (i.e. for x 6= 0),
D
Xn → X.
Example 1.12. Suppose Yn ∼ U (0, n). Then Fn (t) = t/n for 0 < t ≤ n
and Fn (t) = 0 for t ≤ 0. Hence limn→∞ Fn (t) = 0 for t ≤ 0. If t > 0 and
n > t, then Fn (t) = t/n → 0 as n → ∞. Thus limn→∞ Fn (t) = 0 for all
t, and Yn does not converge in distribution to any random variable Y since
H(t) ≡ 0 is not a cdf.
1.5 Large Sample Theory 39
Definition 1.28. Let the parameter space Θ be the set of possible values
of θ. A sequence of estimators Tn of τ (θ) is consistent for τ (θ) if
P
Tn → τ (θ)
E[u(Y )]
P [u(Y ) ≥ c] ≤ .
c
If µ = E(Y ) exists, then taking u(y) = |y − µ|r and c̃ = cr gives
Markov’s Inequality: for r > 0 and any c > 0,
E[|Y − µ|r ]
P [|Y − µ| ≥ c] = P [|Y − µ|r ≥ cr ] ≤ .
cr
If r = 2 and σ 2 = VAR(Y ) exists, then we obtain
Chebyshev’s Inequality:
VAR(Y )
P [|Y − µ| ≥ c] ≤ .
c2
Proof. The proof is given for pdfs. For pmfs, replace the integrals by sums.
Now
Z Z Z
E[u(Y )] = u(y)f(y)dy = u(y)f(y)dy + u(y)f(y)dy
R {y:u(y)≥c} {y:u(y)<c}
Z
≥ u(y)f(y)dy
{y:u(y)≥c}
Theorem 1.15. a) If
Eθ [(Tn − τ (θ))2 ]
Pθ (|Tn − τ (θ)| ≥ ) = Pθ [(Tn − τ (θ))2 ≥ 2 ] ≤ .
2
Hence
lim Eθ [(Tn − τ (θ))2 ] = lim M SEτ(θ) (Tn ) → 0
n→∞ n→∞
P ( lim Xn = X) = 1.
n→∞
Notation such as “Xn converges to X ae” will also be used. Sometimes “ae”
will be replaced with “as” or “wp1.” We say that Xn converges almost ev-
erywhere to τ (θ), written
ae
Xn → τ (θ),
if P (limn→∞ Xn = τ (θ)) = 1.
P (|Wn | ≤ D ) ≥ 1 −
for all n ≥ N .
d) Similar notation is used for a k × r matrix An = A = [ai,j (n)] if each
element ai,j (n) has the desired property. For example, A = OP (n−1/2 ) if
each ai,j (n) = OP (n−1/2 ).
for some nondegenerate random variable X, then both Wn and µ̂n have
convergence rate nδ .
a) Then Wn = OP (n−δ ).
b) If X is not degenerate, then Wn P n−δ .
The above result implies that if Wn has convergence rate nδ , then Wn has
tightness rate nδ , and the term “tightness” will often be omitted. Part a) is
proved, for example, in Lehmann (1999, p. 67).
The following result shows that if Wn P Xn , then Xn P Wn , Wn =
OP (Xn ), and Xn = OP (Wn ). Notice that if Wn = OP (n−δ ), then nδ is
a lower bound on the rate of Wn . As an example, if the CLT holds then
Y n = OP (n−1/3 ), but Y n P n−1/2 .
and
Wn
P (B) ≡ P d/2 ≤ ≥ 1 − /2
Xn
for all n ≥ N = max(N1 , N2 ). Since P (A ∩ B) = P (A) + P (B) − P (A ∪ B) ≥
P (A) + P (B) − 1,
Wn
P (A ∩ B) = P (d/2 ≤ ≤ D/2 ) ≥ 1 − /2 + 1 − /2 − 1 = 1 −
Xn
The following result is used to prove the following Theorem 1.21 which says
that if there are K estimators Tj,n of a parameter β, such that kTj,n − βk =
OP (n−δ ) where 0 < δ ≤ 1, and if Tn∗ picks one of these estimators, then
kTn∗ − βk = OP (n−δ ).
Theorem 1.20: Pratt (1959). Let X1,n , ..., XK,n each be OP (1) where
K is fixed. Suppose Wn = Xin ,n for some in ∈ {1, ..., K}. Then
Wn = OP (1). (1.24)
Proof.
FWn (x) ≤ P (min{X1,n , ..., XK,n} ≤ x) = 1 − P (X1,n > x, ..., XK,n > x).
Since K is finite, there exists B > 0 and N such that P (Xi,n ≤ B) > 1−/2K
and P (Xi,n > −B) > 1 − /2K for all n > N and i = 1, ..., K. Bonferroni’s
PK
inequality states that P (∩K i=1 Ai ) ≥ i=1 P (Ai ) − (K − 1). Thus
Proof. Let Xj,n = nδ kTj,n − βk. Then Xj,n = OP (1) so by Theorem 1.20,
n kTn∗ − βk = OP (1). Hence kTn∗ − βk = OP (n−δ ).
δ
D P
Theorem 1.22: Slutsky’s Theorem. Suppose Yn → Y and Wn → w for
some constant w. Then
D
a) Yn + Wn → Y + w,
D
b) Yn Wn → wY, and
D
c) Yn /Wn → Y /w if w 6= 0.
P D
Theorem 1.23. a) If Xn → X, then Xn → X.
ae P D
b) If Xn → X, then Xn → X and Xn → X.
r P D
c) If Xn → X, then Xn → X and Xn → X.
P D
d) Xn → τ (θ) iff Xn → τ (θ).
P P
e) If Xn → θ and τ is continuous at θ, then τ (Xn ) → τ (θ).
D D
f) If Xn → θ and τ is continuous at θ, then τ (Xn ) → τ (θ).
D r ae
Suppose that for all θ ∈ Θ, Tn → τ (θ), Tn → τ (θ), or Tn → τ (θ). Then
Tn is a consistent estimator of τ (θ) by Theorem 1.23. We are assuming that
the function τ does not depend on n.
Example 1.13. Let Y1 , ..., Yn be iid with mean E(Yi ) = µ and variance
V (Yi ) = σ 2 . Then the sample mean Y n is a consistent estimator of µ since i)
the SLLN holds (use Theorems 1.17 and 1.23), ii) the WLLN holds, and iii)
the CLT holds (use Theorem 1.16). Since
Yi − µ
Zi =
σ
1.5 Large Sample Theory 47
has mean 0, variance 1, and mgf mZ (t) = exp(−tµ/σ)mY (t/σ) for |t| < σto .
We want to show that
√ Yn−µ D
Wn = n → N (0, 1).
σ
Notice that Wn =
n
X n
X Pn
−1/2 −1/2 Yi − µ −1/2 i=1 Yi − nµ n−1/2 Y n − µ
n Zi = n =n = 1 .
σ σ n
σ
i=1 i=1
Thus
n
X Xn
√
mWn (t) = E(etWn ) = E[exp(tn−1/2 Zi )] = E[exp( tZi / n)]
i=1 i=1
n
Y n
Y
√ √ √
= E[etZi / n
]= mZ (t/ n) = [mZ (t/ n)]n .
i=1 i=1
Now ψ(0) = log[mZ (0)] = log(1) = 0. Thus by L’Hôpital’s rule (where the
derivative is with respect to n), limn→∞ log[mWn (t)] =
√ √ √
ψ(t/ n ) ψ0 (t/ n )[ −t/2
n3/2
] t ψ0 (t/ n )
lim 1 = lim = lim .
n→∞
n
n→∞ ( −1
n2 )
2 n→∞ √1
n
Now
m0Z (0)
ψ0 (0) = = E(Zi )/1 = 0,
mZ (0)
so L’Hôpital’s rule can be applied again, giving limn→∞ log[mWn (t)] =
√
t ψ00 (t/ n )[ 2n−t3/2 ] t2 00
√ t2 00
lim = lim ψ (t/ n ) = ψ (0).
2 n→∞ ( 2n−1
3/2 ) 2 n→∞ 2
Now
d m0Z (t) m00 (t)mZ (t) − (m0Z (t))2
ψ00 (t) = = Z .
dt mZ (t) [mZ (t)]2
So
ψ00 (0) = m00Z (0) − [m0Z (0)]2 = E(Zi2 ) − [E(Zi )]2 = 1.
Hence limn→∞ log[mWn (t)] = t2 /2 and
48 1 Introduction
Theorems 1.26 and 1.27 below are the multivariate extensions of the
limit
√ theorems in subsection 1.5.1. When the limiting distribution of Z n =
n(g(T n ) − g(θ)) is multivariate normal Nk (0, Σ), approximate the joint
cdf of Z n with the joint cdf of the Nk (0, Σ) distribution. Thus to find proba-
bilities, manipulate Z n as if Z n ≈ Nk (0, Σ). To see that the CLT is a special
case of the MCLT below, let k = 1, E(X) = µ, and V (X) = Σ x = σ 2 .
To see that the delta method is a special case of the multivariate delta
method, note that if Tn and parameter θ are real valued, then Dg (θ ) = g0 (θ).
Theorem 1.29. If X 1 , ..., Xn are iid, E(kXk) < ∞, and E(X) = µ, then
P
a) WLLN: X n → µ, and
ae
b) SLLN: X n → µ.
for all t ∈ Rk .
for all t ∈ Rk .
50 1 Introduction
Application: Proof of the MCLT Theorem 1.26. Note that for fixed
t, the tT X i are iid random variables with mean tT µ and variance tT Σt.
√ D
Hence by the CLT, tT n(X n − µ) → N (0, tT Σt). The right hand side has
distribution tT X where X ∼ Nk (0, Σ). Hence by the Cramér Wold Device,
√ D
n(X n − µ) → Nk (0, Σ).
P D
Theorem 1.32. a) If X n → X, then X n → X.
b)
P D
X n → g(θ) iff X n → g(θ).
Let g(n)
√ ≥ 1 be an increasing function of the sample size n: g(n) ↑ ∞, e.g.
g(n) = n. See White (1984, p. 15). If a k×1 random vector T n −µ converges
to
√ a nondegenerate multivariate normal
√ distribution with convergence rate
n, then T n has (tightness) rate n.
The following two theorems are taken from Severini (2005, pp. 345-349,
354).
D
Hence g(z n ) → g(z) by Theorem 1.33.
Mixture distributions are useful for model and variable selection since β̂ Imin ,0
is a mixture distribution of β̂ Ij ,0 , and the lasso estimator β̂ L is a mixture
distribution of β̂L,λi for i = 1, ..., M . See Chapter 4. A random vector u has
a mixture distribution if u equals a random vector uj with probability πj
for j = 1, ..., J. See Definition 1.24 for the population mean and population
covariance matrix of a random vector.
PJ
where the probabilities πj satisfy 0 ≤ πj ≤ 1 and j=1 πj = 1, J ≥ 2,
and Fuj (t) is the cdf of a g × 1 random vector uj . Then u has a mixture
distribution of the uj with probabilities πj .
Hence
J
X
E(u) = πj E[uj ], (1.28)
j=1
J
X J
X
πj Cov(uj ) + πj E(uj )[E(uj )]T − E(u)[E(u)]T . (1.29)
j=1 j=1
J
X
Cov(u) = πj Cov(uj ).
j=1
This theorem is easy to prove if the uj are continuous random vectors with
(joint) probability density functions (pdfs) fuj (t). Then u is a continuous
random vector with pdf
J
X Z ∞ Z ∞
fu (t) = πj fuj (t), and E[h(u)] = ··· h(t)fu (t)dt
j=1 −∞ −∞
J
X Z ∞ Z ∞ J
X
= πj ··· h(t)fuj (t)dt = πj E[h(uj )]
j=1 −∞ −∞ j=1
where E[h(uj )] is the expectation with respect to the random vector uj . Note
that
XJ X J
T
E(u)[E(u)] = πj πk E(uj )[E(uk )]T . (1.30)
j=1 k=1
E(X) = µ (1.33)
and
Cov(X) = cX Σ (1.34)
where
cX = −2ψ0 (0).
π p/2
h(u) = kp up/2−1 g(u). (1.36)
Γ (p/2)
E(X|B T X) = µ + M B B T (X − µ) = aB + M B B T X (1.37)
aB = µ − M B B T µ = (I p − M B B T )µ,
and
M B = ΣB(B T ΣB)−1 .
See Problem 1.19. Notice that in the formula for M B , Σ can be replaced by
cΣ where c > 0 is a constant. In particular, if the EC distribution has 2nd
moments, Cov(X) can be used instead of Σ.
Theorem 1.39. Let X ∼ ECp (µ, Σ, g) and assume that E(X) exists.
a) Any subset of X is EC, in particular X 1 is EC.
b) (Cook 1998 p. 131, Kelker 1970). If Cov(X) is nonsingular,
b) Even if the first moment does not exist, the conditional median
MED(Y |X) = α + β T2 X
Then B T ΣB = Σ XX and
ΣY X
ΣB = .
Σ XX
Now
Y Y Y
E |X =E | BT
X X X
Y − µY
= µ + ΣB(B T ΣB)−1 B T
X − µX
by Theorem 1.38. The right hand side of the last equation is equal to
ΣY X −1 µY − Σ Y X Σ −1XX µX + Σ Y X Σ −1
XX X
µ+ Σ XX (X − µX ) =
Σ XX X
β T2 = Σ Y X Σ −1
XX .
where c > 0 and 0 < γ < 1. Since the multivariate normal distribution is
elliptically contoured (and see Theorem 1.37),
See Mardia et al. (1979, pp. 43, 57). See Johnson and Kotz (1972, p. 134) for
the special case where the xi ∼ N (0, 1).
The following EC(µ, Σ, g) distribution for a p × 1 random vector x is
the uniform distribution on a hyperellipsoid where f(z) = c for z in the
hyperellipsoid where c is the reciprocal of the volume of the hyperellipsoid.
The pdf of the distribution is
Γ ( 2p + 1)
f(z) = |Σ|−1/2 I[(z − µ)T Σ −1 (z − µ) ≤ p + 2].
[(p + 2)π]p/2
1.8 Summary
Also
Cov(a + AX) = Cov(AX) = ACov(X)AT .
Note that E(AY ) = AE(Y ) and Cov(AY ) = ACov(Y )AT .
5) If X ∼ Np (µ, Σ), then E(X) = µ and Cov(X) = Σ.
6) If X ∼ Np (µ, Σ) and if A is a q×p matrix, then AX ∼ Nq (Aµ, AΣAT ).
If a is a p × 1 vector of constants, then X + a ∼ Np (µ + a, Σ).
7) All subsets of a MVN are MVN: (Xk1 , ..., Xkq )T ∼ Nq (µ̃, Σ̃) where
µ̃i = E(Xki ) and Σ̃ ij = Cov(Xki , Xkj ). In particular, X 1 ∼ Nq (µ1 , Σ 11 )
and X 2 ∼ Np−q (µ2 , Σ 22 ). If X ∼ Np (µ, Σ), then X 1 and X 2 are indepen-
dent iff Σ 12 = 0.
8)
Y µY σY2 Cov(Y, X)
Let ∼ N2 , 2 .
X µX Cov(X, Y ) σX
Also recall that the population correlation between X and Y is given by
Cov(X, Y ) σX,Y
ρ(X, Y ) = p p =
VAR(X) VAR(Y ) σ X σY
X 1 |X 2 = x2 ∼ Nq (µ1 + Σ 12 Σ −1 −1
22 (x2 − µ2 ), Σ 11 − Σ 12 Σ 22 Σ 21 ).
10) Notation:
X 1 |X 2 ∼ Nq (µ1 + Σ 12 Σ −1 −1
22 (X 2 − µ2 ), Σ 11 − Σ 12 Σ 22 Σ 21 ).
1.9 Complements 59
1.9 Complements
Section 1.5 followed Olive (2014, ch. 8) closely, which is a good Master’s
level treatment of large sample theory. There are several PhD level texts
on large sample theory including, in roughly increasing order of difficulty,
Lehmann (1999), Ferguson (1996), Sen and Singer (1993), and Serfling (1980).
White (1984) considers asymptotic theory for econometric applications.
For a nonsingular matrix, the inverse of the matrix, the determinant of
the matrix, and the eigenvalues of the matrix are continuous functions of
the matrix. Hence if Σ̂ is a consistent estimator of Σ, then the inverse,
determinant, and eigenvalues of Σ̂ are consistent estimators of the inverse,
determinant, and eigenvalues of Σ > 0. See, for example, Bhatia et al. (1990),
Stewart (1969), and Severini (2005, pp. 348-349).
Big Data
Sometimes n is huge and p is small. Then importance sampling and se-
quential analysis with sample size less than 1000 can be useful for inference
for regression and time series models. Sometimes n is much smaller than p,
for example with microarrays. Sometimes both n and p are large.
1.10 Problems
Problems from old qualifying exams are marked with a Q since these problems
take longer than quiz and exam problems.
1.2Q . Suppose that the regression model is Yi = 7+βXi +ei for i = 1, ..., n
where the ei are iid N (0, σ 2 ) random variables. The least squares criterion is
X n
Q(η) = (Yi − 7 − ηXi )2 .
i=1
1.10 Problems 61
a) What is E(Yi )?
c) Show that your β̂ is the global minimizer of the least squares criterion
d2
Q by showing that the second derivative Q(η) > 0 for all values of η.
dη 2
1.3Q . The location model is Yi = µ + ei for i = 1, ..., n where the ei are iid
with mean E(ei ) = 0 and constant variance VAR(ei ) = σ 2 . The least squares
Xn
estimator µ̂ of µ minimizes the least squares criterion Q(η) = (Yi − η)2 .
i=1
To find the least squares estimator, perform the following steps.
d
a) Find the derivative Q, set the derivative equal to zero and solve for
dη
η. Call the solution µ̂.
b) To show that the solution was indeed the global minimizer of Q, show
d2
that Q > 0 for all real η. (Then the solution µ̂ is a local min and Q is
dη 2
convex, so µ̂ is the global min.)
1.4Q . The normal error model for simple linear regression through the
origin is
Yi = βXi + ei
for i = 1, ..., n where e1 , ..., en are iid N (0, σ 2 ) random variables.
b) Find E(β̂).
c) Find VAR(β̂).
Pn
(Hint: Note that β̂ = i=1 ki Yi where the ki depend on the Xi which are
treated as constants.)
d
timator β̂3 of β3 by setting the first derivative Q(η3 ) equal to zero. Show
dη3
that your β̂3 is the global minimizer of the least squares criterion Q by show-
d2
ing that the second derivative 2 Q(η3 ) > 0 for all values of η3 .
dη3
1.6. Suppose x1 , ..., xn are iid p × 1 random vectors from a multivariate
t-distribution with parameters µ and Σ with d degrees of freedom. Then
d
E(xi ) = µ and Cov(x) = Σ for d > 2. Assuming d > 2, find the
√ d − 2
limiting distribution of n(x − c) for appropriate vector c.
E(xi ) = e0.5 1
1.7. Suppose x1 , ..., xn are iid p × 1 random vectors where √
2
and Cov(xi ) = (e − e)I p . Find the limiting distribution of n(x − c) for
appropriate vector c.
1.12. Let σ12 = Cov(Y, X) and suppose Y and X follow a bivariate normal
distribution
Y 15 64 σ12
∼ N2 , .
X 20 σ12 81
where c > 0 and 0 < γ < 1. Following Example 1.17, show that X has
an elliptically contoured distribution assuming that all relevant expectations
exist.
1.14. In Theorem 1.39b, show that if the second moments exist, then Σ
can be replaced by Cov(X).
1.15. Using the notation in Theorem 1.40, show that if the second mo-
ments exist, then
Σ −1
XX Σ XY = [Cov(X)]
−1
Cov(X, Y ).
1.16. Using the notation under Theorem 1.38, show that if X is elliptically
contoured, then the conditional distribution of X 1 given that X 2 = x2 is
also elliptically contoured.
1.18. Recall that Cov(X, Y ) = E[(X − E(X))(Y − E(Y ))T ]. Using the
notation of Theorem 1.40, let (Y, X T )T be ECp+1 (µ, Σ, g) where Y is a
random variable. Let the covariance matrix of (Y, X T ) be
ΣY Y ΣY X VAR(Y ) Cov(Y, X)
Cov((Y, X T )T ) = c =
Σ XY Σ XX Cov(X, Y ) Cov(X)
64 1 Introduction
α = µY − βT µX and
β = [Cov(X)]−1 Cov(X, Y ).
1.19. (Due to R.D. Cook.) Let X be a p×1 random vector with E(X) = 0
and Cov(X) = Σ. Let B be any constant full rank p × r matrix where
1 ≤ r ≤ p. Suppose that for all such conforming matrices B,
E(X|BT X) = M B B T X
with 0 < γ < 1 and c > 0. Then √ E(xi ) = µ and Cov(xi ) = [1 + γ(c − 1)]Σ.
Find the limiting distribution of n(x − d) for appropriate vector d.
normal distribution
Y 134 24.5 1.1
∼ N2 , .
X 96 1.1 23.0
X 1 − Σ 12 Σ −1 −1 −1
22 X 2 ∼ Nq (µ1 − Σ 12 Σ 22 µ2 , Σ 11 − Σ 12 Σ 22 Σ 21 ).
X 1 |X 2 ∼ Nq (µ1 + Σ 12 Σ −1 −1
22 (X 2 − µ2 ), Σ 11 − Σ 12 Σ 22 Σ 21 ).
a) State (but do not derive) the least squares estimators of β1 for both
models. Are these estimators “BLUE”? Why or why not. Quote the relevant
theorem(s) in support of your assertation.
n
X
b) Prove than V (β̂1 ) = σ 2 / (xi − x)2 for model I, and V (β̂1 ) =
i=1
n
X
2 2
σ / (xi ) for model II.
i=1
c) Referring to b), show that the variance V (β̂1 ) for Model I is never
smaller than the variance V (β̂1 ) for model II.
1.33.
1.34.
1.35.
1.36.
1.37.
1.38.
1.39.
R Problem
b) Copy and paste the commands for this part into R. They make the
response plot with the points within the pointwise 99% prediction interval
bands omitted. Include this plot in Word. For example, left click on the plot
and hit the Ctrl and c keys at the same time to make a copy. Then paste the
plot into Word, e.g., get into Word and hit the Ctrl and v keys at the same
time.
c) The additive error regression model is a 1D regression model. What is
the sufficient predictor = h(x)?
1.41. The linmodpack function tplot2 makes transformation plots for
the multiple linear regression model Y = t(Z) = xT β + e. Type = 1 for full
model OLS and should not be used if n < 5p, type = 2 for elastic net, 3 for
lasso, 4 for ridge regression, 5 for PLS, 6 for PCR, and 7 for forward selection
with Cp if n ≥ 10p and EBIC if n < 10p. These methods are discussed in
Chapter 5.
Copy and paste the three library commands near the top of linmodrhw
into R.
For parts a) and b), n = 100, p = 4 and Y = log(Z) = 0x1 + x2 + 0x3 +
0x4 + e = x2 + e. (Y and Z are swapped in the R code.)
a) Copy and paste the commands for this part into R. This makes the
response plot for the elastic net using Y = Z and x when the linear model
needs Y = log(Z). Do not include the plot in Word, but explain why the plot
suggests that something is wrong with the model Z = xT β + e.
b) Copy and paste the command for this part into R. Right click Stop 3
times until the horizontal axis has log(z). This is the response plot for the
true model Y = log(Z) = xT β + e = x2 + e. Include the plot in Word. Right
click Stop 3 more times so that the cursor returns in the command window.
c) Is the response plot linear?
For the remaining parts, n = p − 1 = 100 and Y = log(Z) = 0x1 + x2 +
0x3 + · · · + 0x101 + e = x2 + e. Hence the model is sparse.
d) Copy and paste the commands for this part into R. Right click Stop 3
times until the horizontal axis has log(z). This is the response plot for the
true model Y = log(Z) = xT β + e = x2 + e. Include the plot in Word. Right
click Stop 3 more times so that the cursor returns in the command window.
e) Is the plot linear?
f) Copy and paste the commands for this part into R. Right click Stop 3
times until the horizontal axis has log(z). This is the response plot for the true
model Y = log(Z) = xT β + e = x2 + e. Include the plot in Word. Right click
Stop 3 more times so that the cursor returns in the command window. PLS
is probably overfitting since the identity line nearly interpolates the fitted
points.
1.42. Get the R commands for this problem. The data is such that Y =
2 + x2 + x3 + x4 + e where the zero mean errors are iid [exponential(2) -
2]. Hence the residual and response plots should show high skew. Note that
1.10 Problems 69
β = (2, 1, 1, 1)T . The R code uses 3 nontrivial predictors and a constant, and
the sample size n = 1000.
a) Copy and paste the commands for part a) of this problem into R. Include
the response plot in Word. Is the lowess curve fairly close to the identity line?
b) Copy and paste the commands for part b) of this problem into R.
Include the residual plot in Word: press the Ctrl and c keys as the same time.
Then use the menu command “Paste” in Word. Is the lowess curve fairly
close to the r = 0 line? The lowess curve is a flexible scatterplot smoother.
c) The output out$coef gives β̂. Write down β̂ or copy and paste β̂ into
Word. Is β̂ close to β?
Chapter 2
Full Rank Linear Models
Vector spaces, subspaces, and column spaces should be familiar from linear
algebra, but are reviewed below.
71
72 2 Full Rank Linear Models
Definition
Pk 2.4. Let x1 , ..., xk ∈ V. If ∃ scalars α1 , ..., αk not
Pkall zero such
that i=1 αi xi = 0, then x1 , ..., xk are linearly dependent. If i=1 αi xi = 0
only if αi = 0 ∀ i = 1, ..., k, then x1 , ..., xk are linearly independent. Suppose
{x1 , ..., xk } is a linearly independent set and V = span(x1 , ..., xk ). Then
{x1 , ..., xk } is a linearly independent spanning set for V, known as a basis.
The space spanned by the rows of A is the row space of A. The row space
of A is the column space C(AT ) of AT . Note that
w1 m
X
Aw = [a1 a2 ... am ] ... = wi a i .
wm i=1
With the design matrix X, different notation is used to denote the columns
of X since both the columns and rows X are important. Let
T
x1
..
X = [v1 v 2 ... v p ] = .
xTn
rank(A) = rank(AT ) ≤ min(m, n). If rank(A) = min(m, n), then A has full
rank, or A is a full rank matrix.
Generalized inverses are useful for the non-full rank linear model and for
defining projection matrices.
Most of the above results apply to full rank and nonfull rank matrices.
A corollary of the following theorem is that if X is full rank, then P X =
X(X T X)−1 X T = H.
Suppose A is p × p. Then the following are equivalent. 1) A is nonsingular,
2) A has a left inverse L with LA = I p , and 3) A has a right inverse R
with AR = I p . To see this, note that 1) implies (2) and 3) since A−1 A =
I p = AA−1 by the definition of an inverse matrix. Suppose AR = I p . Then
the determinant det(I p ) = 1 = det(AR) = det(A) det(R). Hence det(A) 6= 0
and A is nonsingular. Hence R = A−1 AR = A−1 and 3) implies 1). Similarly
2) implies 1). Also note that L = LI p = LAR = I p R = R = A−1 . Hence
in the proof below, we could just show that A− = L or A− = R.
Xn
1 T
A−1 = T Λ−1 T T = ti ti .
λ
i=1 i
The following theorem is often useful. Both the expected value and trace
are linear operators. Hence tr(A + B) = tr(A) + tr(B), and E[tr(X)] =
tr(E[X]) when the expected value of the random matrix X exists.
Proof. Two proofs are given. i) Searle (1971, p. 55): Note that E(xxT ) =
Σ + µµT . Since the quadratic form is a scalar and the trace is a linear
operator, E[xT Ax] = E[tr(xT Ax)] = E[tr(AxxT )] = tr(E[AxxT ]) =
tr(AΣ + AµµT ) = tr(AΣ) + tr(AµµT ) = tr(AΣ) + µT Aµ.
Pnii) PGraybill (1976, p. 140):
Pn P Using E(xi xj ) = σij + µi µj , E[xT Ax] =
n n T
i=1 j=1 aij E(xi xj ) = i=1 j=1 aij (σij + µi µj ) = tr(AΣ) + µ Aµ.
Much of the theoretical results for quadratic forms assumes that the ei
are iid N (0, σ 2 ). These exact results are often special cases of large sample
theory that holds for a large class of iid zero mean error distributions that
have V (ei ) ≡ σ 2 . For linear models, Y is typically an n × 1 random vector.
The following theorem from statistical inference will be useful.
Some of the proof ideas for the following theorem came from Marden
(2012, pp. 48, 96-97). Recall that if Y1 , ..., Yk are independent with moment
2.2 Quadratic Forms 79
Pk
generating functions (mgfs) mYi (t), then the mgf of i=1 Yi is mPk Yi (t) =
i=1
Yk
mYi (t). If Y ∼ χ2 (n, γ), then the probability density function (pdf) of Y
i=1
is rather hard to use, but is given by
X∞ n ∞
X
e−γ γ j y 2 +j−1 e−y/2
f(y) = n = pγ (j)fn+2j (y)
j=0
j! 2 2 +j Γ ( n2 + j) j=0
R∞
since the integral = 1 = f(w)dw where f(w) is the N (b, 1/(1 − 2t)) pdf.
−∞
Thus
1 tδ
mW 2 (t) = √ exp .
1 − 2t 1 − 2t
So mY (t) = mW 2 +X (t) = mW 2 (t)mX (t) =
(n−1)/2
1 tδ 1 1 tδ
√ exp = exp =
1 − 2t 1 − 2t 1 − 2t (1 − 2t)n/2 1 − 2t
−n/2 2γt
(1 − 2t) exp .
1 − 2t
b) i) By a), mPk Yi (t) =
i=1
k
Y k
Y
mYi (t) = (1 − 2t)−ni/2 exp(−γi [1 − (1 − 2t)−1 ]) =
i=1 i=1
k
!
Pk X
− ni /2 −1
(1 − 2t) i=1 exp − γi [1 − (1 − 2t) ] ,
i=1
k k
!
X X
2
the χ ni , γi mgf.
i=1 i=1
ii) Let Yi = Z Ti Z i
where the Z i ∼ Nni (µi , I ni ) are independent. Let
Z1 µ1
Z2 µ2
Z = . ∼ NPk ni . , I Pk ni ∼ NPk ni (µZ , I Pk ni ).
.. i=1 .. i=1 i=1 i=1
Zk µk
2.2 Quadratic Forms 81
k k k
!
X X X
T
Then Z Z = Z Ti Z i = Yi ∼ χ 2
ni , γZ where
i=1 i=1 i=1
k k
µTZ µZ X µTi µi X
γZ = = = γi .
2 2
i=1 i=1
n
X n
X n
X
[E(Zi4 ) − (E[Zi2 ])2 ] = [µ4i + 6µ2i + 3 − µ4i − 2µ2i − 1] = [4µ2i + 2]
i=1 i=1 i=1
= 2n + 4µT µ = 2n + 8γ.
For the following theorem, see Searle (1971, p. 57). Most of the results in
Theorem 2.14 are corollaries of Theorem 2.13. Recall that the matrix in a
quadratic form is symmetric, unless told otherwise.
Y T AY
∼ χ2r or Y T AY ∼ σ 2 χ2r
σ2
iff A is idempotent of rank r.
d) If Y ∼ Nn (0, Σ) where Σ > 0, then Y T AY ∼ χ2r iff AΣ is idempotent
with rank(A) = r = rank(AΣ).
Y TY 2 µT µ
e) If Y ∼ Nn (µ, σ 2 I) then ∼ χ n, .
σ2 2σ 2
f) If Y ∼ Nn (µ, I) then Y T AY ∼ χ2 (r, µT Aµ/2) iff A is idempotent
with rank(A) = tr(A) = r.
Y T AY 2 µT Aµ
g) If Y ∼ Nn (µ, σ 2 I) then ∼ χ r, iff A is idempotent
σ2 2σ 2
with rank(A) = tr(A) = r.
Y T AY
= Z T AZ ∼ χ2r
σ2
iff A is idempotent of rank r. Much of Theorem 2.14 follows from Theorem
2.13. For f), we give another proof from Christensen (1987, p. 8). Since A is a
projection matrix with rank(A) = r, let {b1 , ..., br } be an orthonormal basis
for C(A) and let B = [b1 b2 ... br ]. Then B T B = I r and the projection
matrix A = B(B T B)−1 B T = BB T . Thus Y T AY = Y T BB T Y = Z T Z
where Z = B T Y ∼ Nr (B T µ, BT IB) ∼ Nr (B T µ, I r ). Thus Y T AY =
Z T Z ∼ χ2 (r, µT BB T µ/2) ∼ χ2 (r, µT Aµ/2) by Definition 2.12.
The following theorem is useful for constructing ANOVA tables. See Searle
(1971, pp. 60-61).
with equality iff kθ̂ − θk2 = 0 iff θ̂ = θ = Xη. Since θ̂ = X β̂ the result
follows.
X T X β̂ = X T Y .
We can often fix σ and then show β̂ is the MLE by direct maximization.
Then the MLE σ̂ or σ̂ 2 can be found by maximizing the log profile likelihood
function log[Lp (σ)] or log[Lp (σ 2 )] where Lp (σ) = L(σ, β = β̂).
Remark 2.1. a) Know how to find the max and min of a function h that
is continuous on an interval [a,b] and differentiable on (a, b). Solve h0 (x) ≡ 0
and find the places where h0 (x) does not exist. These values are the critical
points. Evaluate h at a, b, and the critical points. One of these values will
be the min and one the max.
b) Assume h is continuous. Then a critical point θo is a local max of h(θ)
if h is increasing for θ < θo in a neighborhood of θo and if h is decreasing for
θ > θo in a neighborhood of θo . The first derivative test
is often used.
d2
c) If h is strictly concave h(θ) < 0 for all θ , then any local max
dθ2
of h is a global max.
d2
d) Suppose h0 (θo ) = 0. The 2nd derivative test states that if 2 h(θo ) < 0,
dθ
then θo is a local max.
e) If h(θ) is a continuous function on an interval with endpoints a < b
(not necessarily finite), and differentiable on (a, b) and if the critical point
is unique, then the critical point is a global maximum if it is a local
maximum (because otherwise there would be a local minimum and the critical
point would not be unique). To show that θ̂ is the MLE (the global maximizer
of h(θ) = log L(θ)), show that log L(θ) is differentiable on (a, b). Then show
d
that θ̂ is the unique solution to the equation log L(θ) = 0 and that the
dθ
2.3 Least Squares Theory 85
d2
2nd derivative evaluated at θ̂ is negative: log L(θ)|θ̂ < 0. Similar remarks
dθ2
2
hold for finding σ̂ using the profile likelihood.
n
!
1 X −1
2 −n/2 T 2 2 −n/2 2
(2πσ ) exp (yi − xi β) = (2πσ ) exp ky − Xβk .
2σ 2 i=1 2σ 2
Pn Pn
The least squares criterion Q(β) = i=1 (yi − xTi β)2 = i=1 ri2 (β) = ky −
Xβk2 = (y − Xβ)T (y − Xβ). For fixed σ 2 , maximizing the likelihood is
equivalent to maximizing
−1 2
exp ky − Xβk ,
2σ 2
Let τ = σ 2 . Then
n 1
log(Lp (σ 2 )) = c − log(σ 2 ) − 2 Q,
2 2σ
and
n 1
log(Lp (τ )) = c − log(τ ) − Q.
2 2τ
Hence
d log(LP (τ )) −n Q set
= + 2 = 0
dτ 2τ 2τ
or −nτ + Q = 0 or nτ = Q or
Pn
Q r2 n−p
τ̂ = = σ̂ 2 = i=1 i = M SE,
n n n
86 2 Full Rank Linear Models
Now assume the n × p matrix X has full rank p. There are two ways to
compute β̂. Use β̂ = (X T X)−1 X T Y , and use sample covariance matrices.
The population OLS coefficients are defined below. Let xTi = (1, uTi ) where
n
1X
ui is the vector of nontrivial predictors. Let Xjk = X ok = uok for
n
j=1
k = 2, ..., p. The subscript “ok” means sum over the first subscript j. Let
u = (uo,2 , ..., uo,p )T be the sample mean of the ui . Note that regressing on u
is equivalent to regressing on x if there is an intercept β1 in the model.
Definition 2.17. Using the above notation, let xTi = (1, uTi ), and let βT =
(β1 , βT2 ) where β1 is the intercept and the slopes vector β 2 = (β2 , ..., βp)T .
Let the population covariance matrices
P
Refer to Definitions 1.27, 1.28, and 1.33 for the notation “θ̂ → θ as n →
∞,” which means that θ̂ is a consistent estimator of θ, or that θ̂ converges
−1
in probability to θ. Note that D = X T1 X 1 − nu uT = (n − 1)Σ̂ u .
Theorem 2.18:
Seber
and Lee (2003,
p. 106).
Let X = T(1 X 1 ).
nY nY n nu
Then X T Y = T = Pn , XT X = ,
X1 Y u Y
i=1 i i nu X T1 X 1
2.3 Least Squares Theory 87
1
+ uT D−1 u −uT D−1
and (X T X)−1 = n
−D−1 u D−1
−1
where the (p − 1) × (p − 1) matrix D−1 = [(n − 1)Σ̂ u ]−1 = Σ̂ u /(n − 1).
b) Suppose that (Yi , uTi )T are iid random vectors such that σY2 , Σ −1
u , and
P
Σ uY exist. Then β̂1 → β1 and
P
β̂ 2 → β 2 as n → ∞.
and
Y1 n
X
X T1 Y = [u1 · · · un ] ... = ui Yi .
Yn i=1
So 1 T
β̂1 + uT D−1 u −uT D−1 1
= n Y =
β̂ 2 −D−1 u D −1 X T1
1 −1
T
n +u D u −uT D −1 nY
.
−D −1 u D−1 X T1 Y
Y = Xβ+e does not need to hold. Also, X is a random matrix, and the least
squares regression is conditional on X. When the linear model does hold, the
second method for computing β̂ is still valid even if X is a constant matrix,
P
and β̂ → β by the LS CLT. Some properties of the least squares estimators
and related quantities are given below, where X is a constant matrix. The
population results of Definition 2.17 were also shown when
Y
x2 2
E(Y ) σY ΣY u
.. ∼ Np ,
. E(u) Σ uY Σ uu
xp
in Remark 1.5. Also see Theorem 1.40. The following theorem is similar to
Theorem 1.2.
The following theorem is useful for finding the BLUE when X has full
rank. Note that if W is a random variable, then the covariance matrix of
2.3 Least Squares Theory 89
X1 /d1
∼ Fd1 ,d2 .
X2 /d2
Pk
If Ui ∼ χ21 are iid then i=1 Ui ∼ χ2k . Let d1 = r and k = d2 = dn . Hence if
X2 ∼ χ2dn , then
Pdn
X2 Ui P
= i=1 = U → E(Ui ) = 1
dn dn
D
by the law of large numbers. Hence if W ∼ Fr,dn , then rWn → χ2r .
The following theorem is analogous to the central limit theorem and the
theory for the t–interval for µ based on Y and the sample standard deviation
(SD) SY . If the data Y1 , ..., Yn are iid with mean 0 and variance σ 2 , then Y
is asymptotically normal and the t–interval will perform well if the sample
size is large enough. The result below suggests that the OLS estimators Ŷi
and β̂ are good if the sample size is large enough. The condition max hi → 0
in probability usually holds if the researcher picked the design matrix X or
if the xi are iid random vectors from a well behaved population. Outliers
D
can cause the condition to fail. Convergence in distribution, Z n → Np (0, Σ),
means the multivariate normal approximation can be used for probability
calculations involving Z n . When p = 1, the univariate normal distribution
can be used. See Sen and Singer (1993, p. 280) for the theorem, which implies
that β̂ ≈ Np (β, σ 2 (X T X)−1 )). Let hi = H ii where H = P X . Note that
the following theorem is for the full rank model since X T X is nonsingular.
XT X
→ W −1
n
Equivalently,
D
(X T X)1/2 (β̂ − β) → Np (0, σ 2 I p ). (2.2)
1 D
rFR = (Lβ̂ − c)T [L(X T X)−1 LT ]−1 (Lβ̂ − c) → χ2r (2.3)
M SE
√ D
as n → ∞ if H0 : Lβ = c is true so that n(Lβ̂ − c) → Nr (0, σ 2 LW LT ).
Definition 2.20. A test with test statistic Tn is a large sample right tail
δ test if the test rejects H0 if Tn > an and P (Tn > an ) = δn → δ as n → ∞
when H0 is true.
Remark 2.3. Suppose P (W ≤ χ2q (1−δ)) = 1−δ and P (W > χ2q (1−δ)) =
δ where W ∼ χ2q . Suppose P (W ≤ Fq,dn (1 − δ)) = 1 − δ when W ∼ Fq,dn .
Also write χ2q (1 − δ) = χ2q,1−δ and Fq,dn (1 − δ) = Fq,dn ,1−δ . Suppose P (W >
z1−δ ) = δ when W ∼ N (0, 1), and P (W > tdn ,1−δ ) = δ when W ∼ tdn .
i) Theorem 2.24 is important because it can often be shown that a statistic
D
Tn = rWn → χ2r when H0 is true. Then tests that reject H0 when Tn >
χ2r (1 − δ) or when Tn /r = Wn > Fr,dn (1 − δ) are both large sample right
tail δ tests if the positive integer dn → ∞ as n → ∞. Large sample F tests
and intervals are used instead of χ2 tests and intervals since the F tests and
intervals are more accurate for moderate n.
D
ii) An analogy is that if test statistic Tn → N (0, 1) when H0 is true, then
tests that reject H0 if Tn > z1−δ or if Tn > tdn ,1−δ are both large sample
right tail δ tests if the positive integer dn → ∞ as n → ∞. Large sample t
tests and intervals are used instead of Z tests and intervals since the t tests
and intervals are more accurate for moderate n.
iii) Often n ≥ 10p starts to give good results for the OLS output for error
distributions not too far from N (0, 1). Larger values of n tend to be needed
94 2 Full Rank Linear Models
if the zero mean iid errors have a distribution that is far from a normal
distribution. Also see Theorem 1.5.
−β T1 X T1 (I − P 1 )Y + β T1 X T1 (I − P 1 )X 1 β 1 − Y T (I − P 1 )X 1 β 1 ,
2.3 Least Squares Theory 95
X1 /d1
∼ Fd1 ,d2 .
X2 /d2
Hence
Y T (P − P 1 )Y /r Y T (P − P 1 )Y
= ∼ Fr,n−p
Y T (I − P )Y /(n − p) rM SE
when H0 is true. Since RSS = Y T (I − P )Y and RSS(R) = Y T (I − P 1 )Y ,
RSS(R) − RSS = Y T (I − P 1 − [I − P ])Y = Y T (P − P 1 )Y , and thus
Y T (P − P 1 )Y
FR = ∼ Fr,n−p .
rM SE
√ √ D
c) Assume H0 is true. By the OLS CLT, n(Lβ̂ − Lβ) = nLβ̂ →
√ √ D
Nr (0, σ 2 LW LT ). Thus n(Lβ̂)T (σ 2 LW LT )−1 nLβ̂ → χ2r . Let σ̂ 2 =
M SE and Ŵ = n(X T X)−1 . Then
D
n(Lβ̂)T [M SE Ln(X T X)−1 LT ]−1 Lβ̂ = rFR → χ2r .
D
d) By Theorem 2.24, if Wn ∼ Fr,dn then rWn → χ2r as n → ∞ and
dn → ∞. Hence the result follows by c).
Source df SS MS F
SSE(R)−SSE
Reduced n − pR SSE(R) = Y T (I − P R )Y MSE(R) FR = rM SE =
Y T (P − P R )Y /r
Full n−p SSE = Y T (I − P )Y MSE
Y T (I − P )Y /(n − p)
The ANOVA F test is the special case where k = 1, X R = 1, P R = P 1 ,
and SSE(R) − SSE(F ) = SST O − SSE = SSR.
96 2 Full Rank Linear Models
Source df SS MS F p-value
T 1 T M SR
Regression p-1 SSR = Y (P − 11 )Y MSR F0 = M SE for H0 :
n
Under the OLS model where FR ∼ Fq,n−p when H0 is true (so the ei are
iid N (0, σ 2 )), the pvalue = P (W > FR ) where W ∼ Fq,n−p . In general, we
can only estimate the pvalue. Let pval be the estimated pvalue. Then pval
P
= P (W > FR ) where W ∼ Fq,n−p , and pval → pvalue an n → ∞ for the
large sample partial F test. The pvalues in output are usually actually pvals
(estimated pvalues).
If H0 : β 2 = 0 is true, then γ = 0.
Proof. Note that the denominator is the M SE, and (n − p)M SE/σ 2 ∼
2
χn−p by the proof of Theorem 2.26. By Theorem 2.14 f),
!
T 2 2 β T X T (P − P 1 )Xβ
Y (P − P 1 )Y /σ ∼ χ r,
2σ 2
Remark 2.4. Suppose tests and confidence intervals are derived under
the assumption e ∼ Nn (0, σ 2 I). Then by the LS CLT and Remark 2.3,
the inference tends to give large sample tests and confidence intervals for
a large class of zero mean error distributions. For linear models, often the
error distribution has heavier tails than the normal distribution. See Huber
and Ronchetti (2009, p. 3). If some points stick out a bit in residual and/or
response plots, then the error distribution likely has heavier tails than the
normal distribution. See Figure 1.1.
Definition 2.23. Suppose that the response variable and at least one of the
predictor variables is quantitative. Then the generalized least squares (GLS)
model is
98 2 Full Rank Linear Models
Y = Xβ + e, (2.4)
where Y is an n × 1 vector of dependent variables, X is an n × p matrix
of predictors, β is a p × 1 vector of unknown coefficients, and e is an n × 1
vector of unknown errors. Also E(e) = 0 and Cov(e) = σ 2 V where V is a
known n × n positive definite matrix.
Definition 2.25. Suppose that the response variable and at least one of
the predictor variables is quantitative. Then the weighted least squares (WLS)
model with weights w1 , ..., wn is the special case of the GLS model where V
is diagonal: V = diag(v1 , ..., vn) and wi = 1/vi . Hence
Y = Xβ + e, (2.6)
β̂ W LS = (X T V −1 X)−1 X T V −1 Y . (2.7)
The fitted values are Ŷ F GLS = X β̂ F GLS . The feasible weighted least squares
(FWLS) estimator is the special case of the FGLS estimator where V =
V (θ) is diagonal. Hence the estimated weights ŵi = 1/v̂i = 1/vi (θ̂). The
FWLS estimator and fitted values will be denoted by β̂ F W LS and Ŷ F W LS ,
respectively.
Notice that the ordinary least squares (OLS) model is a special case of
GLS with V = I n , the n × n identity matrix. It can be shown that the GLS
estimator minimizes the GLS criterion
Notice that the FGLS and FWLS estimators have p + q + 1 unknown param-
eters. These estimators can perform very poorly if n < 10(p + q + 1).
The GLS and WLS estimators can be found from the OLS regression
(without an intercept) of a transformed model. Typically there will be a
constant in the model: the first column of X is a vector of ones. Let the
symmetric, nonsingular n × n square root matrix R = V 1/2 with V = RR.
Let Z = R−1 Y , U = R−1 X and = R−1 e.
Theorem 2.28. a)
Z = Uβ + (2.9)
follows the OLS model since E() = 0 and Cov() = σ 2 I n .
b) The GLS estimator β̂ GLS can be obtained from the OLS regression
(without an intercept) of Z on U .
c) For WLS, Yi = xTi β + ei . The corresponding OLS model Z = U β +
is equivalent to Zi = uTi β + i for i = 1, ..., n where uTi is the ith row of U .
√ √
Then Zi = wi Yi and ui = wi xi . Hence β̂ W LS can be obtained from the
√ √
OLS regression (without an intercept) of Zi = wi Yi on ui = wi xi .
Proof. a) E() = R−1 E(e) = 0 and
= σ 2 R−1 RR(R−1 ) = σ 2 I n .
Notice that OLS without an intercept needs to be used since U does not
contain a vector of ones. The first column of U is R−1 1 6= 1.
b) Let β̂ ZU denote the OLS estimator obtained by regressing Z on U .
Then
and the result follows since V −1 = (RR)−1 = R−1 R−1 = (R−1 )T R−1 .
√ √
c) The result follows from b) if Zi = wi Yi and ui = wi xi . But for
√ √
WLS, V = diag(v1 , ..., vn) and hence R = diag( v1 , ..., vn ). Hence
√ √ √ √
R−1 = diag(1/ v1 , ..., 1/ vn ) = diag( w1 , ..., wn )
√
and Z = R−1 Y has ith element Zi = wi Yi . Similarly, U = R−1 X has ith
√
row uTi = wi xTi .
Remark 2.5. Standard software produces WLS output and the ANOVA
F test and Wald t tests are performed using this output.
Remark 2.6. The FGLS estimator can also be found from the OLS re-
gression (without an intercept) of Z on U where V (θ̂) = RR. Similarly the
FWLS estimator can be found from the OLS regression (without an inter-
100 2 Full Rank Linear Models
√ √
cept) of Zi = ŵi Yi on ui = ŵi xi . But now U is a random matrix instead
of a constant matrix. Hence these estimators are highly nonlinear. OLS out-
put can be used for exploratory purposes, but the p–values are generally not
correct. The Olive (2018) bootstrap tests may be useful for FGLS and FWLS.
See Chapter 4.
iii) Find the pval = P(FdfR −dfF ,dfF > FR ). (On exams often an F table is
used. Here dfR −dfF = p−q = number of parameters set to 0, and dfF = n−p.)
iv) State whether you reject H0 or fail to reject H0 . Reject H0 if pval ≤ δ
and conclude that the full model should be used. Otherwise, fail to reject H0
and conclude that the reduced model is good.
Assume that the GLS model contains a constant β1 . The GLS ANOVA
F test of H0 : β2 = · · · = βp versus HA: not H0 uses the reduced model
that contains the first column of U . The GLS ANOVA F test of H0 : βi = 0
versus HA : βi 6= 0 uses the reduced model with the ith column of U deleted.
For the special case of WLS, the software will often have a weights option
that will also give correct output for inference.
Freedman (1981) shows that the nonparametric bootstrap can be use-
ful for the WLS model with the ei independent. For this case, the sand-
wich estimator is Cov(d β̂ T −1
X T Ŵ X(X T X)−1 with Ŵ =
OLS ) = (X X)
n diag(r1 , ..., rn)/(n − p) where the ri are the OLS residuals and W = σ 2 V .
2 2
See Hinkley (1977), MacKinnon and White (1985), and White (1980).
2.4 WLS and Generalized Least Squares 101
estimator
n
1 X Yi
θ̂1 = (X T V −1 X)−1 X T V −1 Y = .
n i
i=1
Now
n
1 X iθ
E(θ̂1 ) = = θ,
n i=1 i
and
n
X n
X
Yi i2 σ 2
V (θ̂1 ) = V = = σ 2 /2.
ni n2 i2
i=1 i=1
Now Pn
i=1 i i θ
E(θ̂2 ) = P n 2
= θ,
i=1 i
and Pn 2 2 2 Pn 4
n
X iY i i σ i
V (θ̂2 ) = V Pn i i=1 2
= Pn 2 2 = σ Pni=1 2 2 .
i=1 i=1 i2 ( i=1 i ) ( i=1 i )
Thus Pn 4
2
P i=1 i
θ̂2 ∼ N θ, σ .
( ni=1 i2 )2
d) The WLS estimator θ̂1 is BLUE and thus has smaller variance than
the OLS estimator θ̂2 (which is a linear unbiased estimator: WLS is “better
than” OLS when the weights are known).
2.5 Summary
1) The set of all linear combinationsPof x1 , ..., xn is the vector space known
n
as span(x1 , ..., xn ) = {y ∈ Rk : y = i=1 ai xi for some constants a1 , ..., an}.
2) Let A = [a1 a2 ... am ] be an n × m matrix. The space spanned by the
columns of A = column space of A = C(A). Then C(A) = {y ∈ Rn : y =
Aw for some w ∈ Rm } = {y : y = w1 a1 + w2 a2 + · · · + wm am for some
scalars w1 , ...., wm} = span(a1 , ..., am ).
3) A generalized inverse of an n × m matrix A is any m × n matrix A−
satisfying AA− A = A.
2.5 Summary 103
the slopes vector β 2 = (β2 , ..., βp)T . Let the population covariance matrices
Cov(u) = Σ u , and Cov(u, Y ) = Σ uY . If the (Yi , uTi )T are iid, then the
population coefficients from an OLS regression of Y on x are
b) Suppose that (Yi , uTi )T are iid random vectors such that σY2 , Σ −1
u , and
P P
Σ uY exist. Then β̂1 → β1 and β̂ 2 → β 2 as n → ∞ even if the OLS model
Y = Xβ + e does not hold.
17) Theorem 2.20. Let Y = Xβ + e = Ŷ + r where X is full rank,
E(e) = 0, and Cov(e) = σ 2 I. Let P = P X be the projection matrix on
C(X) so Ŷ = P X, r = Y − Ŷ = (I − P )Y , and P X = X so X T P = X T .
i) The predictor variables and residuals are orthogonal. Hence the columns
of X and the residual vector are orthogonal: X T r = 0.
ii) E(Y ) = Xβ.
iii) Cov(Y ) = Cov(e) = σ 2 I.
iv) The fitted values and residuals are uncorrelated: Cov(r, Ŷ ) = 0.
v) The least squares estimator β̂ is an unbiased estimator of β : E(β̂) = β.
vi) Cov(β̂) = σ 2 (X T X)−1 .
18) LS CLT. Suppose that the ei are iid and
XT X
→ W −1 .
n
Also,
D
(X T X)1/2 (β̂ − β) → Np (0, σ 2 I p ).
19) Theorem 2.26, Partial F Test Theorem. Suppose H0 : Lβ = 0 is
true for the partial F test. Under the OLS full rank model, a)
1
FR = (Lβ̂)T [L(X T X)−1 LT ]−1 (Lβ̂).
rM SE
b) If e ∼ Nn (0, σ 2 I), then FR ∼ Fr,n−p .
D
c) For a large class of zero mean error distributions rFR → χ2r .
d) The partial F test that rejects H0 : Lβ = 0 if FR > Fr,n−p (1 − δ) is a
large sample right tail δ test for the OLS model for a large class of zero mean
error distributions.
2.6 Complements
A good reference for quadratic forms and the noncentral χ2 , t, and F distri-
butions is Johnson and Kotz (1970, ch. 28-31).
The theory for GLS and WLS is similar to the theory for the OLS MLR
model, but the theory for FGLS and FWLS is often lacking or huge sample
2.7 Problems 105
sizes are needed. However, FGLS and FWLS are often used in practice be-
cause usually V is not known and V̂ must be used instead. See Eicker (1963,
1967).
Least squares theory can be extended in at least two ways. For the first
extension, see Chang and Olive (2010) and Chapter 10. The second extension
of least squares theory is to an autoregressive AR(p) time series model: Yt =
φ0 + φ1 Yt−1 + · · · + φp Yt−p + et . In matrix form, this model is Y = Xβ + e =
Yp+1 1 Yp Yp−1 . . . Y1 φ0 ep+1
Yp+2 1 Yp+1 Yp . . . Y2 φ1 ep+2
.. = .. .. .. . . .. .. + .. .
. . . . . . . .
Yn 1 Yn−1 Yn−2 . . . Yn−p φp en
2.7 Problems
Problems from old qualifying exams are marked with a Q since these problems
take longer than quiz and exam problems.
2.1Q . Suppose Yi = xTi β + ei for i = 1, ..., n where the errors are indepen-
dent N (0, σ 2 ). Then the likelihood function is
2 2 −n/2 −1 2
L(β, σ ) = (2πσ ) exp kY − Xβk .
2σ 2
2.3Q . Suppose Yi = xTi β+ei where the errors are independent N (0, σ 2 /wi )
where wi > 0 are known constants. Then the likelihood function is
n
! n n
!
Y √ 1 1 −1 X
2 T 2
L(β, σ ) = wi √ exp wi (yi − xi β) .
i=1
2π σn 2σ 2 i=1
Pn
a) Suppose that β̂ W minimizes i=1 wi (yi − xTi β)2 . Show that β̂ W is the
MLE of β.
b) Then find the MLE σ̂ 2 of σ 2 .
2.5. Find the vector a such that aT Y is an unbiased estimator for E(Yi )
if the usual linear model holds.
Let Z = (Y − µ)/σ ∼ N (0, 1). Hence µk = E(Y − µ)k = σ k E(Z k ). Use this
fact and the above recursion relationship E(Z k ) = (k − 1)E(Z k−2 ) to find
a) µ3 and b) µ4 .
by identifying Ỹi , and x̃i . (Hence the WLS estimator is obtained from the
least squares regression of Ỹi on x̃i without an intercept.)
2.18. Find the projection matrix P for C(X) where X is the 2 × 1 vector
X = (1, 2)T .
2.19. Let y ∼ Np (θ, Σ) where Σ is positive definite. Let A be a symmetric
p × p matrix.
a) Let x = y − θ. What is the distribution of x?
b) Show that
E[(y − θ)T A(y − θ)] = E[xT Ax]
is a function of A and Σ but not of θ.
2.20. (Hocking (2003, p. 61): Let y ∼ N3 (µ, σ 2 I) where y = (Y1 , Y2 , Y3 )T
and µ = (µ1 , µ2 , µ3 )T .
1 −1 0 1 1 −2
Let A = 21 −1 1 0 and B = 16 1 1 −2 .
0 0 0 −2 −2 4
Are y T Ay and y T By independent? Explain.
2.21Q . Let Y = Xβ +e where e ∼ Nn (0, σ 2 I n ). Assume X has full rank.
Let r be the vector of residuals. Then the residual sum of squares RSS =
T T
rT r. The sum of squared fitted values is Ŷ Ŷ . Prove that rT r and Ŷ Ŷ
independent (or dependent).
(Hint: write each term as a quadratic form.)
12
2.22. Let B = .
24
a) Find rank(B).
b) Find a basis for C(B).
T
c) Find [C(B)]⊥ = nullspace
of B .
1 −1
d) Show that B − = is a generalized inverse of B.
1 0
2.23. Suppose that Y = Xβ+e where Cov(e) = σ 2 Σ and Σ = Σ 1/2 Σ 1/2
where Σ 1/2 is nonsingular and symmetric. Hence Σ −1/2 Y = Σ −1/2 Xβ +
Σ −1/2 e. Find Cov(Σ −1/2 e). Simplify.
2 T T
2.24.
Let y ∼ N2 (µ, σI) where y = (Y1 , Y2 ) and µ = (µ1 , µ2 ) . Let
1/2 1/2 1/2 −1/2
A= and B = .
1/2 1/2 −1/2 1/2
Are y T Ay and y T By independent? Explain.
2.25. Assuming the assumptions of the √least squares central limit theorem
hold, what is the limiting distribution of n (β̂ − β) if (X 0 X)/n → W −1
as n → ∞?
√ D
n (β̂ − β) →
2.26. Let the model be Yi = β1 + β2 xi2 + β3 xi3 + β4 xi4 + ... + β10 xi10 + ei .
The model in matrix form is Y = Xβ + e where e ∼ Nn (0, σ 2 I). Let P be
110 2 Full Rank Linear Models
the projection matrix on C(X) where the n × p matrix X has full rank p.
What is the distribution of Y T P Y ?
Hint: If Y ∼ Nn (µ, I), then Y T AY ∼ χ2 (rank(A),
T
µ Aµ/2) iff A = AT
Y Xβ
is idempotent. Y ∼ Nn (Xβ, σ 2 I), so ∼ Nn , I . Simplify.
σ σ
2.27. Let Y 0 = Y T . Let Y ∼ Nn (Xβ, σ 2 I). Recall that E(Y 0 AY ) =
tr(ACov(Y )) + E(Y 0 )AE(Y ).
Find E(Y 0 Y ) = E(Y 0 IY ).
2 T T
2.28.
Let y ∼ N2 (µ, σI) where √ y = (Y1 , Y2 ) and µ = (µ1 , µ2 ) . Let
1/2 1/2 1/4 3/4
A= and B = √ .
1/2 1/2 3/4 3/4
Are Ay and By independent? Explain.
1 0
2.29. Let X = 1 0 .
1 1
a) Find rank(X).
b) Find a basis for C(X).
c) Find [C(X)]⊥ = nullspace of X T .
2.30Q . Let Y = Xβ + e where e ∼ Nn (0, σ 2 I n ). Assume X has full
rank and that the first column of X = 1 so that a constant is in the model.
Let r be the vector of residuals. Then the residual sum of squares RSS =
rT r = k(I − P )Y k2 . The sample mean Y = n1 1T Y . Prove that rT r and Y
independent (or dependent).
(Hint: If Y ∼ Nn (µ, Σ), then AY BY iff AΣB T = 0.
1 T
So prove whether (I − P )Y 1 Y .)
n
2.31. Let the full model be Yi = β1 +β2 xi2 +β3 xi3 +β4 xi4 +β5 xi5 +β6 xi6 +ei
and let the reduced model be Yi = β1 +β3 xi3 +ei for i = 1, ..., n. Write the full
model as Y = Xβ +e = X 1 β 1 +X 2 β2 +e, and consider testing H0 : β 2 = 0
where β 1 corresponds to the reduced model. Let P 1 be the projection matrix
on C(X 1 ) and let P be the projection matrix on C(X).
n − p Y T (P − P 1 )Y
Then FR = .
q Y T (I − P )Y
Assume ∼ Nn (0, σ 2 I). Assume H0 is true.
a) What is q?
b) What is the distribution of Y T (P − P 1 )Y ?
c) What is the distribution of Y T (I − P )Y ?
d) What is the distribution of FR ?
2.32Q . If P is a projection matrix, prove a) the eigenvalues of P are 0 or
1, b) rank(P ) = tr(P ).
2.33Q . Suppose that AY and BY are independent where A and B are
symmetric matrices. Are Y 0 AY and Y 0 BY independent? (Hint: show that
2.7 Problems 111
d d2
c) If each xi = 1 for i = 1, ..., n, what are β̂, Q(η), and Q(η)?
dη dη 2
d) The likelihood function is
n
!
2 2 −n/2 −1 X
L(β, σ ) = (2πσ ) exp (Yi − ai − βxi )2 .
2σ 2
i=1
Pn
Since the least squares estimator β̂ minimizes i=1 (Yi − ai − βxi )2 , show
that β̂ is the (maximum likelihood estimator) MLE of β.
e) Then find the MLE σ̂ 2 of σ 2 .
R Problems
Use the command source(“G:/linmodpack.txt”) to download the
functions and the command source(“G:/linmoddata.txt”) to download the
data. See Preface or Section 11.1. Typing the name of the linmodpack
function, e.g. regbootsim2, will display the code for the function. Use the
2.7 Problems 115
Much of Sections 2.1 and 2.2 apply to both full rank and nonfull rank linear
models. In this chapter we often assume X has rank r < p ≤ n.
Nonfull rank models are often used in experimental design models. Much
of the nonfull rank model theory is similar to that of the full rank model,
but there are some differences. Now the generalized inverse (X T X)− is not
unique. Similarly, β̂ is a solution to the normal equations, but depends on the
generalized inverse and is not unique. Some properties of the least squares
estimators are summarized below. Let P = P X be the projection matrix
on C(X). Recall that projection matrices are symmetric and idempotent but
singular unless P = I. Also recall that P X = X, so X T P = X T .
117
118 3 Nonfull Rank Linear Models and Cell Means Models
vii) Let the columns of X 1 form a basis for C(X). For example, take r lin-
early independent columns of X to form X 1 . Then P = X 1 (X T1 X 1 )−1 X T1 .
Proof. Parts i) follows from Theorem 2.2 a), b). For part iii), P and I −P
are projection matrices and projections P w and (I − P )w are unique since
projection matrices are unique. For ii), since (X T X)− is not unique, β̂ is not
unique. Note that iv) holds since X T X β̂ = X T P Y = X T Y since P X = X
and X T P = X T . From the proof of Theorem 2.2, if M is a projection
matrix, then rank(M ) = tr(M ) = the number of nonzero eigenvalues of
M = rank(X). Thus v) holds. vi) E(r T r) = E(eT (I − P )e) = tr[(I −
P )σ 2 I)] = σ 2 (n − r) by Theorem 2.5. Part vii) follows from Theorem 2.2.
For part b), we use the proof from Seber and Lee (2003, p. 43). Since θ̂ =
X β̂ = P Y , it follows that E(cT θ̂) = E(cT P Y ) = cT P Xβ = cT Xβ = cT θ.
Thus cT θ̂ = cT P Y = (P c)T Y is a linear unbiased estimator of cT θ. Let
dT Y be any other linear unbiased estimator of cT θ. Hence E(dT Y ) = dT θ =
cT θ for all θ ∈ C(X). So (c − d)T θ = 0 for all θ ∈ C(X). Hence (c − d) ∈
[C(X)]⊥ and P (c − d) = 0, or P c = P d. Thus V (cT θ̂) = V (cT P Y ) =
V (dT P Y ) = σ 2 dT P T P d = σ 2 dT P d. Then V (dT Y )−V (cT θ̂) = V (dT Y )−
V (dT P Y ) = σ 2 [dT d − dT P d] = σ 2 dT (I n − P )d = σ 2 dT (I n − P )T (I n −
P )d = g T g ≥ 0 with equality iff g = (I n − P )d = 0, or d = P d = P c. Thus
cT θ̂ has minimum variance and is unique.
c) Since aT β is estimable, aT β̂ = bT X β̂. Then aT β̂ = bT θ̂ is the unique
BLUE of aT β = bT θ by part b).
Nonfull rank models are often used for experimental design models, but cell
means models have full rank. The cell means models will be illustrated with
the one way Anova model. See Problem 3.9 for the cell means model for the
two way Anova model.
Investigators may also want to rank the population means from smallest to
largest.
Definition 3.4. Let fZ (z) be the pdf of Z. Then the family of pdfs fY (y) =
fZ (y − µ) indexed by the location parameter µ, −∞ < µ < ∞, is the location
family for the random variable Y = µ + Z with standard pdf fZ (z).
Definition 3.5. A one way fixed effects Anova model has a single quali-
tative predictor variable W with p categories a1 , ..., ap . There are p different
distributions for Y , one for each category ai . The distribution of
Y |(W = ai ) ∼ fZ (y − µi )
where the location family has second moments. Hence all p distributions come
from the same location family with different location parameter µi and the
same variance σ 2 .
Definition 3.6. The cell means model is the parameterization of the one
way fixed effects Anova model such that
Yij = µi + eij
where Yij is the value of the response variable for the jth trial of the ith
factor level. The µi are the unknown means and E(Yij ) = µi . The eij are
iid from the location family with pdf fZ (z) and unknown variance σ 2 =
VAR(Yij ) = VAR(eij ). For the normal cell means model, the eij are iid
N (0, σ 2 ) for i = 1, ..., p and j = 1, ..., ni.
The cell means model is a linear model (without intercept) of the form
Y = X c βc + e =
3.2 Cell Means Models 121
Y11 1 0 0 ... 0 e11
.. .. .. .. .. ..
. . . . . .
Y1,n1 1 0 0 . . . 0
e1,n1
Y21 0 1 0 . . . 0 e21
µ1
.. .. .. .. .. µ ..
. . . .
. .
2
.
Y2,n2 = 0 1 0 . . . 0 . + (3.1)
.. e2,n2
. . . . . .
.. .. .. .. .. µp ..
Yp,1 0 0 0 . . . 1 ep,1
. . . . .. .
.. .. .. .. . .
.
Yp,np 0 0 0 ... 1 ep,np
Pn i
Notation. Let Yi0 = j=1 Yij and let
ni
1 X
µ̂i = Y i0 = Yi0 /ni = Yij . (3.2)
ni j=1
E(Y ) = X c βc = (µ1 , ..., µ1, µ2 , ..., µ2, ..., µp, ..., µp)T ,
and X Tc Y = (Y10 , ..., Y10, Y20, ..., Y20, ..., Yp0, ..., Yp0)T . Hence (X Tc X c)−1 =
diag(1/n1 , ..., 1/np) and the OLS estimator
Since the cell means model is a linear model, there is an associated response
plot and residual plot. However, many of the interpretations of the OLS
quantities for Anova models differ from the interpretations for MLR models.
122 3 Nonfull Rank Linear Models and Cell Means Models
First, for MLR models, the conditional distribution Y |x makes sense even if
x is not one of the observed xi provided that x is not far from the xi . This
fact makes MLR very powerful. For MLR, at least one of the variables in x
is a continuous predictor. For the one way fixed effects Anova model, the p
distributions Y |xi make sense where xTi is a row of X c .
Also, the OLS MLR ANOVA F test for the cell means model tests H0 :
βc = 0 ≡ H0 : µ1 = · · · = µp = 0, while the one way fixed effects ANOVA F
test given after Definition 3.10 tests H0 : µ1 = · · · = µp .
Definition 3.7. Consider the one way fixed effects Anova model. The
response plot is a plot of Ŷij ≡ µ̂i versus Yij and the residual plot is a plot of
Ŷij ≡ µ̂i versus rij .
The points in the response plot scatter about the identity line and the
points in the residual plot scatter about the r = 0 line, but the scatter need
not be in an evenly populated band. A dot plot of Z1 , ..., Zm consists of an
axis and m points each corresponding to the value of Zi . The response plot
consists of p dot plots, one for each value of µ̂i . The dot plot corresponding
to µ̂i is the dot plot of Yi1 , ..., Yi,ni. The p dot plots should have roughly the
same amount of spread, and each µ̂i corresponds to level ai . If a new level
af corresponding to xf was of interest, hopefully the points in the response
plot corresponding to af would form a dot plot at µ̂f similar in spread to
the other dot plots, but it may not be possible to predict the value of µ̂f .
Similarly, the residual plot consists of p dot plots, and the plot corresponding
to µ̂i is the dot plot of ri1 , ..., ri,ni.
Assume that each ni ≥ 10. Under the assumption that the Yij are from
the same location family with different parameters µi , each of the p dot plots
should have roughly the same shape and spread. This assumption is easier
to judge with the residual plot. If the response plot looks like the residual
plot, then a horizontal line fits the p dot plots about as well as the identity
line, and there is not much difference in the µi . If the identity line is clearly
superior to any horizontal line, then at least some of the means differ.
Definition 3.8. An outlier corresponds to a case that is far from the
bulk of the data. Look for a large vertical distance of the plotted point from
the identity line or the r = 0 line.
Rule of thumb 3.1. Mentally add 2 lines parallel to the identity line and
2 lines parallel to the r = 0 line that cover most of the cases. Then a case is
an outlier if it is well beyond these 2 lines.
This rule often fails for large outliers since often the identity line goes
through or near a large outlier so its residual is near zero. A response that is
far from the bulk of the data in the response plot is a “large outlier” (large
in magnitude). Look for a large gap between the bulk of the data and the
large outlier.
3.2 Cell Means Models 123
The assumption of the Yij coming from the same location family with
different location parameters µi and the same constant variance σ 2 is a big
assumption and often does not hold. Another way to check this assumption is
to make a box plot of the Yij for each i. The box in the box plot corresponds
to the lower, middle, and upper quartiles of the Yij . The middle quartile
is just the sample median of the data mij : at least half of the Yij ≥ mij
and at least half of the Yij ≤ mij . The p boxes should be roughly the same
length and the median should occur in roughly the same position (e.g. in
the center) of each box. The “whiskers” in each plot should also be roughly
similar. Histograms for each of the p samples could also be made. All of the
histograms should look similar in shape.
Example 3.1. Kuehl (1994, p. 128) gives data for counts of hermit crabs
on 25 different transects in each of six different coastline habitats. Let Z be
the count. Then the response variable Y = log10 (Z + 1/6). Although the
counts Z varied greatly, each habitat had several counts of 0 and often there
were several counts of 1, 2, or 3. Hence Y is not a continuous variable. The
cell means model was fit with ni = 25 for i = 1, ..., 6. Each of the six habitats
was a level. Figure 3.1a and b shows the response plot and residual plot.
There are 6 dot plots in each plot. Because several of the smallest values in
each plot are identical, it does not always look like the identity line is passing
through the six sample means Y i0 for i = 1, ..., 6. In particular, examine the
dot plot for the smallest mean (look at the 25 dots furthest to the left that
fall on the vertical line FIT ≈ 0.36). Random noise (jitter) has been added to
the response and residuals in Figure 3.1c and d. Now it is easier to compare
the six dot plots. They seem to have roughly the same spread.
The plots contain a great deal of information. The response plot can be
used to explain the model, check that the sample from each population (treat-
ment) has roughly the same shape and spread, and to see which populations
have similar means. Since the response plot closely resembles the residual plot
in Figure 3.1, there may not be much difference in the six populations. Lin-
earity seems reasonable since the samples scatter about the identity line. The
residual plot makes the comparison of “similar shape” and “spread” easier.
1
RESID
1
Y
0
0
-1
0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0
FIT FIT
1
JR
JY
0
-1
0
0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0
FIT FIT
p
X
SST R = ni (Y i0 − Y 00 )2 .
i=1
σ̂ = M SE = rij = (Yij − Y i0 )2 =
n−p n−p
i=1 j=1 i=1 j=1
p
1 X
(ni − 1)Si2 = Spool
2
n−p
i=1
2
where Spool is known as the pooled variance estimator.
The ANOVA F test tests whether the p means are equal. If H0 is not
rejected and the means are equal, then it is possible that the factor is unim-
3.2 Cell Means Models 125
portant, but it is also possible that the factor is important but the
level is not. For example, the factor might be type of catalyst. The yield
may be equally good for each type of catalyst, but there would be no yield if
no catalyst was used.
The ANOVA table is the same as that for MLR, except that SSTR re-
places the regression sum of squares. The MSE is again an estimator of σ 2 .
The ANOVA F test tests whether all p means µi are equal. Shown below
is an ANOVA table given in symbols. Sometimes “Treatment” is replaced
by “Between treatments,” “Between Groups,” “Between,” “Model,” “Fac-
tor,” or “Groups.” Sometimes “Error” is replaced by “Residual,” or “Within
Groups.” Sometimes “p-value” is replaced by “P”, “P r(> F ),” or “PR > F.”
The “p-value” is nearly always an estimated p-value, denoted by pval. An ex-
ception is when the ei are iid N (0, σe2 ). Normality is rare and the constant
variance assumption rarely holds.
Here is the 4 step fixed effects one way ANOVA F test of hy-
potheses.
i) State the hypotheses H0 : µ1 = µ2 = · · · = µp and HA: not H0 .
ii) Find the test statistic F0 = M ST R/M SE or obtain it from output.
iii) Find the pval from output or use the F –table: pval =
P (Fp−1,n−p > F0 ).
iv) State whether you reject H0 or fail to reject H0 . If the pval ≤ δ, reject H0
and conclude that the mean response depends on the factor level. (Hence not
all of the treatment means are equal.) Otherwise fail to reject H0 and conclude
that the mean response does not depend on the factor level. (Hence all of the
treatment means are equal, or there is not enough evidence to conclude that
the mean response depends on the factor level.) Give a nontechnical sentence.
then the one way ANOVA F test results will be approximately correct if the
response and residual plots suggest that the remaining one way Anova model
assumptions are reasonable. See Moore (2007, p. 634). If all of the ni ≥ 5,
replace the standard deviations by the ranges of the dot plots when exam-
126 3 Nonfull Rank Linear Models and Cell Means Models
ining the response and residual plots. The range Ri = max(Yi,1 , ..., Yi,ni ) −
min(Yi,1 , ..., Yi,ni) = length of the ith dot plot for i = 1, ..., p.
The assumption that the zero mean iid errors have constant variance
V (eij ) ≡ σ 2 is much stronger for the one way Anova model than for the mul-
tiple linear regression model. The assumption implies that the p population
distributions have pdfs from the same location family with different means
µ1 , ..., µp but the same variances σ12 = · · · = σp2 ≡ σ 2 . The one way ANOVA F
test has some resistance to the constant variance assumption, but confidence
intervals have much less resistance to the constant variance √assumption. Con-
√
sider confidence intervals for µi such as Y i0 ± tni −1,1−δ/2 M SE/ ni . MSE
is a weighted average of the Si2 . Hence MSE overestimates small 2
√ σi and un-
2 2
derestimates large σi when the σi are not equal. Hence using M SE instead
of Si will make the CI too long or too short, and Rule of thumb 3.2 does not
apply to confidence intervals based on MSE.
All of the parameterizations of the one way fixed effects Anova model
yield the same predicted values, residuals, and ANOVA F test, but the inter-
pretations of the parameters differ. The cell means model is a linear model
(without intercept) of the form Y = X c βc + e = that can be fit using OLS.
The OLS MLR output gives the correct fitted values and residuals but an
incorrect ANOVA table. An equivalent linear model (with intercept) with
correct OLS MLR ANOVA table as well as residuals and fitted values can
be formed by replacing any column of the cell means model by a column of
ones 1. Removing the last column of the cell means model and making the
first column 1 gives the model Y = β0 + β1 x1 + · · · + βp−1 xp−1 + e given in
matrix form by (3.5) below.
It can be shown that the OLS estimators corresponding to (3.5) are β̂0 =
Y p0 = µ̂p , and β̂i = Y i0 − Y p0 = µ̂i − µ̂p for i = 1, ..., p − 1. The cell means
model has β̂i = µ̂i = Y i0 .
128 3 Nonfull Rank Linear Models and Cell Means Models
1 1 0 ... 0
. .. .. ..
Y11 .. . . . e11
.. 1 1 0 ... 0 ..
. 1 0 1 ... 0
.
Y1,n1 .. .. .. ..
e1,n1
Y21 . . . . e21
β0
.. 1 0 1 ... 0
..
. .. β1 + .
.. .. ..
. .
Y2,n2 = . . . . .. e (3.5)
2,n 2
. 1 0 0 ... 1 βp−1 .
.. . .. .. .. ..
..
Yp,1 . . . ep,1
. 1 0 0 ... 1 .
.. ..
1 0 0 ... 0
Yp,np . .. .. .. ep,np
.. . . .
1 0 0 ... 0
Pp Pp
Definition 3.11. A contrast C = i=1 ki µi where i=1 ki = 0. The
Pp
estimated contrast is Ĉ = i=1 ki Y i0 .
If the null hypothesis of the fixed effects one way ANOVA test is not true,
then not all of the means µi are equal. Researchers will often have hypotheses,
before examining the data, that they desire to test. Often such a hypothesis
can be put in the form of a contrast. For example, the contrast C = µi − µj
is used to compare the means of the ith and jth groups while the contrast
µ1 − (µ2 + · · · + µp )/(p − 1) is used to compare the last p − 1 groups with
the 1st group. This contrast is useful when the 1st group corresponds to a
standard or control treatment while the remaining groups correspond to new
treatments.
Assume that the normal cell means model is a useful approximation to the
data. Then the Y i0 ∼ N (µi , σ 2 /ni ) are independent, and
p p
!
X X ki2
2
Ĉ = ki Y i0 ∼ N C, σ .
ni
i=1 i=1
3.3 Summary
Y |(W = ai ) ∼ fZ (y − µi )
where the location family has second moments. Hence all p distributions come
from the same location family with different location parameter µi and the
same variance σ 2 . The one way fixed effects normal ANOVA model is the
special case where Y |(W = ai ) ∼ N (µi , σ 2 ).
14) The response plot is a plot of Ŷ versus Y . For the one way Anova model,
the response plot is a plot of Ŷij = µ̂i versus Yij . Often the identity line with
unit slope and zero intercept is added as a visual aid. Vertical deviations from
the identity line are the residuals eij = Yij − Ŷij = Yij − µ̂i . The plot will
consist of p dot plots that scatter about the identity line with similar shape
and spread if the fixed effects one way ANOVA model is appropriate. The
ith dot plot is a dot plot of Yi,1 , ..., Yi,ni. Assume that each ni ≥ 10. If the
response plot looks like the residual plot, then a horizontal line fits the p dot
plots about as well as the identity line, and there is not much difference in
the µi . If the identity line is clearly superior to any horizontal line, then at
least some of the means differ.
The residual plot is a plot of Ŷ versus e where the residual e = Y − Ŷ . The
plot will consist of p dot plots that scatter about the e = 0 line with similar
shape and spread if the fixed effects one way ANOVA model is appropriate.
The ith dot plot is a dot plot of ei,1 , ..., ei,ni. Assume that each ni ≥ 10.
Under the assumption that the Yij are from the same location scale family
with different parameters µi , each of the p dot plots should have roughly the
3.3 Summary 131
same shape and spread. This assumption is easier to judge with the residual
plot than with the response plot.
15) Rule of thumb: Let Ri be the range of the ith dot plot =
max(Yi1 , ..., Yi,ni)−min(Yi1 , ..., Yi,ni). If the ni ≈ n/p and if max(R1 , ..., Rp) ≤
2 min(R1 , ..., Rp), then the one way ANOVA F test results will be approxi-
mately correct if the response and residual plots suggest that the remaining
one way ANOVA model assumptions are reasonable. Confidence intervals
need stronger assumptions.
P i
16) Let Yi0 = nj=1 Yij and let
ni
1 X
µ̂i = Y i0 = Yi0 /ni = Yij .
ni j=1
Hence the “dot notation” P meansP sum over the subscript corresponding to the
p ni
0, e.g. j. Similarly, Y00 = i=1 j=1 Yij is the sum of all of the Yij . Be able
to find µ̂i from data.
17) The cell means model for the fixed effects one way Anova is Yij =
µi + ij where Yij is the value of the response variable for the jth trial of the
ith factor level for i = 1, ..., p and j = 1, ..., ni. The µi are the unknown means
and E(Yij ) = µi . The ij are iid from the location family with pdf fZ (z), zero
mean and unknown variance σ 2 = V (Yij ) = V (ij ). For the normal cell means
Pn i
model, the ij are iid N (0, σ 2 ). The estimator µ̂i = Y i0 = j=1 Yij /ni = Ŷij .
The ith residual
Pp is eij = Yij −Y i0 , and Y 00 is the samplePp mean Pniof all of the Y2ij
and n = i=1 ni . The total sum of squares SSTO = i=1 j=1 (Yij − Y 00 ) ,
Pp 2
the treatment sum of squares Pp SSTRPni = i=1 ni (Y i0 −Y 00 ) , and the error sum
of squares SSE = RSS = i=1 j=1 (Yij −Y i0 )2 . The MSE is an estimator of
σ 2 . The Anova table is the same as that for multiple linear regression, except
that SSTR replaces the regression sum of squares and that SSTO, SSTR and
SSE have n − 1, p − 1 and n − p degrees of freedom.
Summary Analysis of Variance Table
Source df SS MS F p-value
Treatment p − 1 SSTR MSTR F0 =MSTR/MSE for H0 :
Error n − p SSE MSE µ1 = · · · = µp
18) Shown is a one way ANOVA table given in symbols. Sometimes “Treat-
ment” is replaced by “Between treatments,” “Between Groups,” “Model,”
“Factor” or “Groups.” Sometimes “Error” is replaced by “Residual,” or
“Within Groups.” Sometimes “p-value” is replaced by “P”, “P r(> F )” or
“PR > F.” SSE is often replaced by RSS = residual sum of squares.
19) In matrix form, the cell means model is the linear model without an
intercept (although 1 ∈ C(X)), where µ = β = (µ1 , ..., µp)T , and Y =
Xµ + =
132 3 Nonfull Rank Linear Models and Cell Means Models
Y11 100 ... 0 11
.. .. .. .. .. ..
. . . . . .
Y1,n1 1 0 0 ... 0
1,n1
Y21 0 1 0 ... 0 21
µ1
.. .. .. .. .. µ ..
. . . .
. .
2
.
Y2,n2 = 0 1 0 ... 0
. +
.. 2,n2
. . . . .. µ .
.. .. .. .. . p ..
Yp,1 0 0 0 ... 1 p,1
. . . . .. .
.. .. .. .. . .
.
Yp,np 0 0 0 ... 1 p,np
20) For the cell means model, X T X = diag(n1 , ..., np), (X T X)−1 =
diag(1/n1 , ..., 1/np), and X T Y = (Y10 , ..., Yp0)T . So β̂ = µ̂ = (X T X)−1 X T Y
= (Y 10 , ..., Y p0 )T . Then Ŷ = X(X T X)−1 X T Y = X µ̂, and Ŷij = Y i0 .
Hence the ijth residual eij = Yij − Ŷij = Yij − Y i0 for i = 1, ..., p and
j = 1, ..., ni.
21) In the response plot, the dot plot for the jth treatment crosses the
identity line at Y j0 .
22) The one way Anova F test has hypotheses H0 : µ1 = · · · = µp and HA :
not H0 (not all of the p population means are equal). The one way Anova
table for this test is given above 18). Let RSS = SSE. The test statistic
M ST R [RSS(H) − RSS]/(p − 1)
F = = ∼ Fp−1,n−p
M SE M SE
2
if the ij are iid N (0, σP ). If P
H0 is true, then Yij = µ+ij and µ̂ = Y 00 . Hence
p ni
RSS(H) = SST O = i=1 j=1 (Yij − Y 00 )2 . Since SST O = SSE + SST R,
the quantity SST R = RSS(H) − RSS, and M ST R = SST R/(p − 1).
23) The one way Anova F test is a large sample test if the ij are iid with
mean 0 and variance σ 2 . Then the Yij come from the same location family
with the same variance σi2 = σ 2 and different mean µi for i = 1, ..., p. Thus
the p treatments (groups, populations) have the same variance σi2 = σ 2 . The
V (ij ) ≡ σ 2 assumption (which implies that σi2 = σ 2 for i = 1, ..., p) is a
much stronger assumption for the one way Anova model than for MLR, but
the test has some resistance to the assumption that σi2 = σ 2 by 15).
24) Other design matrices X can be used for the full model. One design
matrix adds a column of ones to the cell means design matrix. This model is
no longer a full rank model.
3.3 Summary 133
1 1 0 ... 0
. . . ..
Y11 .. .. .. . 11
.. 1 1 0 ... 0 ..
. .
1 0 1 ... 0
Y1,n1 . . . .. 1,n1
. . .
Y21 . . . . 21
β0
.. 1 0 1 . . . 0 ..
. . . . . β1
.
.. .. .. .. . + .
Y2,n2 = .. 2,n2
. 1 0 0 ... 1 .
.. βp−1 .
.. .. .. .. .
Yp,1 . . . .
p,1
. 1 0 0 . . . 1 .
.. 1 0 0 ... 0
..
Yp,np . . . .. p,np
.. .. .. .
1 0 0 ... 0
25) A full rank one way Anova model with an intercept adds a constant but
deletes the last column of the X for the cell means model. Then Y = Xβ +
where Y and are as in the cell means model. Then β = (β0 , β1 , ..., βp−1)T =
(µp , µ1 − µp , µ2 − µp , ..., µp−1 − µp )T . So β0 = µp and βi = µi − µp for
i = 1, ..., p − 1.
It can be shown that the OLS estimators are β̂0 = Y p0 = µ̂p , and β̂i =
Y i0 − Y p0 = µ̂i − µ̂p for i = 1, ..., p− 1. (The cell means model has β̂i = µ̂i =
Y i0 .) In matrix form the model is shown above.
Then X T Y = (Y00 , Y10 , Y20, ..., Yp−1,0)T and
n n1 n2 n3 · · · np−2 np−1
n1 n1 0 0 · · · 0 0 n (n1 n2 · · · np−1 )
n1
n2 0 n2 0 · · · 0 0
T n2
X X= . .. .. .. .. .. = .
.. . . . ··· .
. . . diag(n 1 , ..., n p−1)
.
np−2 0 0 0 · · · np−2 0
np−1
np−1 0 0 0 · · · 0 np−1
1 −1 −1 −1 · · · −1 −1
−1 1 + np 1 1 ··· 1 1
n1
−1 1 1 + np 1 · · · 1 1
1 n 2
Hence (X T X)−1 = . . . . . . =
np . . .
. .
. .
. ··· .
. .
.
−1 1 1
np
1 · · · 1 + np−2 1
np
−1 1 1 1 ··· 1 1 + np−1
1 1 −1T
T np np .
np −1 11 + diag( n1 , ..., np−1 )
134 3 Nonfull Rank Linear Models and Cell Means Models
3.4 Complements
Section 3.2 followed Olive (2017a, ch. 5) closely. The one way Anova model
assumption that the groups have the same variance is very strong. Chapter
9 shows how to use large sample theory to create better one way MANOVA
type tests, and better one way Anova tests are a special case. The tests tend
to be better when all of the ni are large enough for the CLT to hold for each
Y io . Also see Rupasinghe Arachchige Don and Olive (2019).
3.5 Problems
3.1. When X is not full rank, the projection matrix P X for C(X) is P X =
X(X 0 X)− X 0 where X 0 = X T . To show that C(P X ) = C(X), you can show
that a) P X w = Xy ∈ C(X) where w is an arbitrary conformable constant
vector, and b) Xy = P X w ∈ C(P X ) where y is an arbitrary conformable
constant vector.
a) Show P X w = Xy and identify y.
b) Show Xy = P X w and identify w. Hint: P X X = X.
3.2. Let P = X(X T X)− X T be the projection matrix onto the column
space of X. Using P X = X, show P is idempotent.
3.3. Suppose that X is an n × p matrix but the rank of X < p < n. Then
the normal equations X 0 Xβ = X 0 Y have infinitely many solutions. Let β̂ be
a solution to the normal equations. So X 0 X β̂ = X 0 Y . Let G = (X 0 X)− be a
generalized inverse of (X 0 X). Assume that E(Y ) = Xβ and Cov(Y ) = σ 2 I.
3.5 Problems 135
It can be shown that all solutions to the normal equations have the form bz
given below.
a) In each of cases i) and ii), state whether β is estimable and explain your
answer.
b) If the answer is “yes,” then determine the matrix B in β̂ = BY .
c) If the answer is “no,” then produce one estimable parametric function
and its unbiased estimator.
3.15Q . Let y ∼ Np (Aβ, σ 2 I p ), where A is a known p × n matrix of
constants and β an n×1 vector of unknown parameters. Let r = rank(A), 0 <
r < p. Define the vector of fitted values y b, and the vector of residuals e, as
b = P A y and e = y − y
y b (P A is the projection matrix on C(A), the column
space of A).
(a) Provide the distribution of y b.
(b) Provide the distribution of e.
138 3 Nonfull Rank Linear Models and Cell Means Models
B
12 3 4
1 1110
A2 1210
3 0002
Table 3.1 Frequency Table
We denote the data in vector notation as Y = (Y111 , Y121, Y131, Y211 , Y221, Y222, Y231, Y341 , Y342)> .
Also, we write β = (µ, α1 , α2 , α3, β1 , β2 , β3 , β4 )> .
(a) Find the model matrix (design matrix) X for the model so that
E(Y ) = Xβ.
(b) Find the vector X > Y .
(c) Decide whether Ȳ.1. − Ȳ.3. is the OLS estimator for β1 − β3 . Explain
P Pnij P
your answer. Here Ȳ.j. = 3i=1 k=1 Yijk / 3i=1 nij .
(d) Decide whether Ȳ1.. − Ȳ3.. is the OLS estimator for α1 − α3 . Explain
P4 Pnij P4
your answer. Here Ȳi.. = j=1 k=1 Yijk / j=1 nij .
3.5 Problems 139
12
3.18Q . Let Y = Xβ + where Y = (Y1 , Y2 , Y3 )0 , X = 1 2 , β =
24
(β1 , β2 )0 , E() = 0, and Cov() = σ 2 I.
a) Find C(X 0 ).
Show whether or not the following functions are estimable.
b) 5β1 + 10β2
c) β1
d) β1 − 2β2
Q 0 0
3.19 . Let Y = Xβ + where Y = (Y1 , Y2 , Y3 ) = (1, 2, 3) , X =
1 −2
1 −2 , β = (β1 , β2 )0 , E() = 0, and Cov() = σ 2 I.
1 −2
a) Calculate P , the projection matrix P onto the column space of X.
b) Calculate the error sum of squares SSE.
c) Find C(X 0 ).
Show whether or not the following functions are estimable.
d) 5β1 + 10β2
e) β1
f) β1 − 2β2
3.20Q . Let Y = Xβ + . Suppose that aT1 β, ..., aTk β are estimable func-
Xk
tions. Prove or disprove: ci aTi β is estimable where c1 , ..., ck are known
i=1
constants.
3.21Q . a) Let X be an n×1 random vector with E(X) = µ and Cov(X) =
Σ of rank r. Find E(X T Σ − X).
b) Consider the one way fixed effects ANOVA model with 2 replications
per group so that Y is a 2p × 1 random vector:
1 1 0 ... 0
Y1,1 1 1 0 ... 0 e1,1
Y1,2 1 0 1 . . . 0 β0 e1,2
Y2,1 1 0 1 . . . 0 β1 e2,1
Y = Xβ + e = Y2,2 = ... ... ... ... ... β2 + e2,2
.. .. ..
. 1 0 0 ... 1 . .
Yp,1 1 0 0 . . . 1 βp−1 ep,1
Yp,2 1 0 0 ... 0 ep,2
1 0 0 ... 0
with E(e) = 0.
i) Simplify E(Y ) = Xβ.
ii) If
140 3 Nonfull Rank Linear Models and Cell Means Models
0
E(Y ) = Xβ = β0 ,
β0
find β1 , ..., βp−1 in terms of β0 .
3.22Q . An experiment was run to compare three different primitive al-
timeters (an altimeter is a device which measures altitude). The response is
the error in reading.
Altimeter 1: 3, 6, 3
Altimeter 2: 4, 5, 4
Altimeter 3: 7, 8, 7
We like to compare the means of these three altimeters.
a) Write the linear model. Describe all terms and assumptions. Use βi instead
of µi .
b) Given that RSSH − RSS = 20.22 and RSS = 7.33, state the hypotheses
that the means are equal, and complete the ANOVA table (omit the p-value).
c) Find the distribution of the test statistic under normality, and show how
to precisely make the decision. (no calculation necessary, only show the steps)
Chapter 4
Prediction and Variable Selection When
n >> p
This chapter considers variable selection when n >> p and prediction in-
tervals that can work if n > p or p > n. Prediction regions and prediction
intervals applied to a bootstrap sample can result in confidence regions and
confidence intervals. The bootstrap confidence regions will be used for infer-
ence after variable selection.
Variable selection, also called subset or model selection, is the search for a
subset of predictor variables that can be deleted with little loss of information
if n/p is large. Consider the 1D regression model where Y x|SP where
SP = xT β. See Chapters 1 and 10. A model for variable selection can be
described by
xT β = xTS β S + xTE β E = xTS β S (4.1)
where x = (xTS , xTE )T is a p × 1 vector of predictors, xS is an aS × 1 vector,
and xE is a (p − aS ) × 1 vector. Given that xS is in the model, β E = 0 and
E denotes the subset of terms that can be eliminated given that the subset
S is in the model.
Since S is unknown, candidate subsets will be examined. Let xI be the
vector of a terms from a candidate subset indexed by I, and let xO be the
vector of the remaining predictors (out of the candidate submodel). Then
xT β = xTI βI + xTO β O .
141
142 4 Prediction and Variable Selection When n >> p
where xI/S denotes the predictors in I that are not in S. Since this is true
regardless of the values of the predictors, β O = 0 and the sample correlation
corr(xTi β, xTI,i β I ) = 1.0 for the population model if S ⊆ I. The estimated
sufficient predictor (ESP) is xT β̂, and a submodel I is worth considering if
the correlation corr(ESP, ESP (I)) ≥ 0.95.
Definition 4.1. The model Y x|xT β that uses all of the predictors is
called the full model. A model Y xI |xTI β I that uses a subset xI of the
predictors is called a submodel. The full model is always a submodel.
The full model has sufficient predictor SP = xT β and the submodel has
SP = xTI βI .
β I + (X I X I )−1 X TI X O β O = βI + AX O β O .
Simpler models are easier to explain and use than more complicated mod-
els, and there are several other important reasons to perform variable se-
lection.
Pn For example, an OLS MLR model with unnecessary predictors has
i=1 V (Ŷi ) that is too large. If (4.1) holds, S ⊆ I, βS is an aS × 1 vector,
and β I is a j × 1 vector with j > aS , then
4.1 Variable Selection 143
n n
1X σ2 j σ 2 aS 1X
V (ŶIi ) = > = V (ŶSi ). (4.2)
n i=1 n n n i=1
where SSE is the error sum of squares from the full model, and SSE(I) is the
error sum of squares from the candidate submodel. An extremely important
criterion for variable selection is the Cp criterion.
Definition 4.2.
144 4 Prediction and Variable Selection When n >> p
SSE(I)
Cp(I) = + 2k − n = (p − k)(FI − 1) + k
M SE
where MSE is the error mean square for the full model.
D
Note that when H0 is true, (p − k)(FI − 1) + k → χ2p−k + 2k − p for a large
class of iid error distributions. Minimizing Cp (I) is equivalent to minimizing
M SE [Cp (I)] = SSE(I) + (2k − n)M SE = rT (I)r(I) + (2k − n)M SE. The
following theorem helps explain why Cp is a useful criterion and suggests that
for subsets I with k terms, submodels with Cp(I) ≤ min(2k, p) are especially
interesting. Olive and Hawkins (2005) show that this interpretation of Cp can
be generalized to 1D regression models with a linear predictor β T x = xT β,
such as generalized linear models. Denote the residuals and fitted values from
the full model by ri = Yi −xTi β̂ = Yi −Ŷi and Ŷi = xTi β̂ respectively. Similarly,
let β̂I be the estimate of β I obtained from the regression of Y on xI and
denote the corresponding residuals and fitted values by rI,i = Yi − xTI,i β̂I
and ŶI,i = xTI,i β̂ I where i = 1, ..., n.
corr(xT β̂, xT
I β̂I ) = corr(ESP, ESP(I)) = corr(Ŷ, ŶI ) → 1.
Remark 4.1. Consider the model Ii that deletes the predictor xi . Then
the model has k = p − 1 predictors including the constant, and the test
statistic is ti where
t2i = FIi .
Using Definition 4.2 and Cp (Ifull ) = p, it can be shown that
Using the screen Cp (I) ≤ min(2k, p) suggests that the predictor xi should
not be deleted if √
|ti | > 2 ≈ 1.414.
√
If |ti | < 2 then the predictor can probably be deleted since Cp decreases.
The literature suggests using the Cp (I) ≤ k screen, but this screen eliminates
too many potentially useful submodels.
4.1 Variable Selection 145
Definition 4.3. The “fit–fit” or FF plot is a plot of ŶI,i versus Ŷi while
a “residual–residual” or RR plot is a plot rI,i versus ri . A response plot is a
plot of ŶI,i versus Yi . An EE plot is a plot of ESP(I) versus ESP. For MLR,
the EE and FF plots are equivalent.
Six graphs will be used to compare the full model and the candidate sub-
model: the FF plot, RR plot, the response plots from the full and submodel,
and the residual plots from the full and submodel. These six plots will con-
tain a great deal of information about the candidate subset provided that
Equation (4.1) holds and that a good estimator (such as OLS) for β̂ and β̂I
is used.
To verify that the six plots are useful for assessing variable selection,
the following notation will be useful. Suppose that all submodels include
a constant and that X is the full rank n × p design matrix for the full
model. Let the corresponding vectors of OLS fitted values and residuals
be Ŷ = X(X T X)−1 X T Y = HY and r = (I − H)Y , respectively.
Suppose that X I is the n × k design matrix for the candidate submodel
and that the corresponding vectors of OLS fitted values and residuals are
Ŷ I = X I (X TI X I )−1 X TI Y = H I Y and r I = (I − H I )Y , respectively.
A plot can be very useful if the OLS line can be compared to a reference
line and if the OLS slope is related to some quantity of interest. Suppose that
a plot of w versus z places w on the horizontal axis and z on the vertical axis.
Then denote the OLS line by ẑ = a + bw. The following theorem shows that
146 4 Prediction and Variable Selection When n >> p
the plotted points in the FF, RR, and response plots will cluster about the
identity line. Notice that the theorem is a property of OLS and holds even if
the data does not follow an MLR model. Let corr(x, y) denote the correlation
between x and y.
Theorem 4.2. Suppose that every submodel contains a constant and that
X is a full rank matrix.
Response Plot: i) If w = ŶI and z = Y then the OLS line is the identity
line.
2 2
ii) If w = Y and z = ŶI then the OLS line has slope
Pnb = [corr(Y, ŶI )]2 = R (I)
2
and intercept a = Y (1 − R (I)) where Y = i=1 Yi /n and R (I) is the
coefficient of multiple determination from the candidate model.
FF or EE Plot: iii) If w = ŶI and z = Ŷ then the OLS line is the identity
line. Note that ESP (I) = ŶI and ESP = Ŷ .
iv) If w = Ŷ and z = ŶI then the OLS line has slope b = [corr(Ŷ , ŶI )]2 =
SSR(I)/SSR and intercept a = Y [1 − (SSR(I)/SSR)] where SSR is the
regression sum of squares.
RR Plot: v) If w = r and z = rI then the OLS line is the identity line.
vi) If w = rI and z = r then a = 0 and the OLS slope b = [corr(r, rI )]2 and
s s r
SSE n−p n−p
corr(r, rI ) = = = .
SSE(I) Cp (I) + n − 2k (p − k)FI + n − p
Also recall that the OLS line passes through the means of the two variables
(w, z).
(*) Notice that the OLS slope from regressing z on w is equal to one if
and only if the OLS slope from regressing w on z is equal to [corr(z, w)]2 .
P P 2 T
i) The slope b = 1 if ŶI,i Yi = ŶI,i . This equality holds since Ŷ I Y =
T
Y T H I Y = Y T H I H I Y = Ŷ I Ŷ I . Since b = 1, a = Y − Y = 0.
v) The OLS line passes through the origin. Hence a = 0. The slope b =
rT rI /rT r. Since rT rI = Y T (I − H )(I − H I )Y and (I − H )(I − H I ) =
I − H, the numerator rT rI = rT r and b = 1.
vi) Again a = 0 since the OLS line passes through the origin. From v),
r
SSE(I)
1= [corr(r, rI )].
SSE
Hence s
SSE
corr(r, rI ) =
SSE(I)
and the slope
s
SSE
b= [corr(r, rI )] = [corr(r, rI )]2 .
SSE(I)
Remark 4.2. Let Imin be the model than minimizes Cp (I) among the
models I generated from the variable selection method such as forward se-
148 4 Prediction and Variable Selection When n >> p
lection. Assuming the the full model Ip is one of the models generated, then
Cp(Imin ) ≤ Cp (Ip ) = p, and corr(r, rImin ) → 1 as n → ∞ by Theorem 4.2
vi). Referring to Equation (4.1), if P (S ⊆ Imin ) does not go to 1 as n → ∞,
then the above correlation would not go to one. Hence P (S ⊆ Imin ) → 1 as
n → ∞.
Remark 4.3. Daniel and Wood (1980, p. 85) suggest using Mallows’
graphical method for screening subsets by plotting k versus Cp (I) for models
close to or under the Cp = k line. Theorem 4.2 vi) implies that if Cp(I) ≤ k
or FI < 1, then corr(r, rI ) and corr(ESP, ESP (I)) both go to 1.0 as n → ∞.
Hence models I that satisfy the Cp(I) ≤ k screen will contain the true model
S with high probability when n is large. This result does not guarantee that
the true model S will satisfy the screen, but overfit is likely. Let d be a lower
bound on corr(r, rI ). Theorem 4.2 vi) implies that if
1 p
Cp (I) ≤ 2k + n 2 − 1 − 2 ,
d d
the factor has c − 1 degrees of freedom) and the jump in Cp is large, greater
than 4, say.
d) If there are no models I with fewer predictors than II such that Cp (I) ≤
min(2k, p), then model II is a good candidate for the best subset found by
the numerical procedure.
These criteria need a good estimator of σ 2 and n/p large. See Shibata (1984).
The criterion Cp (I) = AICS (I) uses Kn = 2 while the BICS (I) criterion uses
Kn = log(n). See Jones (1946) and Mallows (1973) for Cp . It can be shown
that Cp (I) = AICS (I) is equivalent to the CP (I) criterion of Definition 4.2.
Typically σ̂ 2 is the OLS full model M SE when n/p is large.
The following criteria also need n/p large. AIC is due to Akaike (1973),
AICC is due to Hurvich and Tsai (1989), and BIC to Schwarz (1978) and
Akaike (1977, 1978). Also see Burnham and Anderson (2004).
SSE(I)
AIC(I) = n log + 2a,
n
SSE(I) 2a(a + 1)
AICC (I) = n log + ,
n n−a−1
SSE(I)
and BIC(I) = n log + a log(n).
n
Forward selection with Cp and AIC often gives useful results if n ≥ 5p
and if the final model has n ≥ 10ad . For p < n < 5p, forward selection with
Cp and AIC tends to pick the full model (which overfits since n < 5p) too
often, especially if σ̂ 2 = M SE. The Hurvich and Tsai (1989, 1991) AICC
criterion can be useful if n ≥ max(2p, 10ad).
The EBIC criterion given in Luo and Chen (2013) may be useful when
n/p is not large. Let 0 ≤ γ ≤ 1 and |I| = a ≤ min(n, p) if β̂ I is a × 1. We
may use a ≤ min(n/5, p). Then EBIC(I) =
SSE(I) p p
n log + a log(n) + 2γ log = BIC(I) + 2γ log .
n a a
This criterion can give good results if p = pn = O(nk ) and γ > 1 − 1/(2k).
Hence we will use γ = 1. Then minimizing EBIC(I) is equivalent to mini-
mizing BIC(I) − 2 log[(p − a)!] − 2 log(a!) since log(p!) is a constant.
The above criteria can be applied to forward selection and relaxed lasso.
The Cp criterion can also be applied to lasso. See Efron and Hastie (2016,
pp. 221, 231).
Now suppose p = 6 and S in Equation (4.1) corresponds to x1 ≡ 1, x2 ,
and x3 . Suppose the data set is such that underfitting (omitting a predic-
tor in S) does not occur. Then there are eight possible submodels that
contain S: i) x1 , x2 , x3 ; ii) x1 , x2 , x3 , x4 ; iii) x1 , x2 , x3 , x5 ; iv) x1 , x2 , x3 , x6 ;
v) x1 , x2 , x3, x4 , x5 ; vi) x1 , x2 , x3 , x4, x6 ; vii) x1 , x2 , x3, x5 , x6 ; and the full
model viii) x1 , x2 , x3 , x4 , x5, x6 . The possible submodel sizes are k = 3, 4, 5,
or 6. Since the variable selection criteria for forward selection described above
minimize the MSE given that x∗1 , ..., x∗k−1 are in the model, the M SE(Ik ) are
too small and underestimate σ 2 . Also the model Imin fits the data a bit too
well. Suppose Imin = Id . Compared to selecting a model Ik before examining
4.2 Large Sample Theory for Some Variable Selection Estimators 151
the data, the residuals ri (Imin ) are too small in magnitude, the |ŶImin ,i − Yi |
are too small, and M SE(Imin ) is too small. Hence using Imin = Id as the full
model for inference does not work. In particular, the partial F test statistic
FR in Theorem 2.27, using Id as the full model, is too large since the M SE
is too small. Thus the partial F test rejects H0 too often. Similarly, the con-
fidence intervals for βi are too short, and hypothesis tests reject H0 : βi = 0
too often when H0 is true. The fact that the selected model Imin from vari-
able selection cannot be used as the full model for classical inference is known
as selection bias. Also see Hurvich and Tsai (1990).
This chapter offers two remedies: i) use the large sample theory of β̂ Imin ,0
(defined two paragraphs below) and the bootstrap for inference after variable
selection, and ii) use data splitting for inference after variable selection.
where V j,0 adds columns and rows of zeros corresponding to the xi not in
Ij , and V j,0 is singular unless Ij corresponds to the full model.
152 4 Prediction and Variable Selection When n >> p
Theorem 4.3 can be used to justify prediction intervals after variable se-
lection. See Pelawa Watagoda and Olive (2020). Theorem 4.3d) is useful for
variable selection consistency and the oracle property where πd = πS = 1 if
P (Imin = S) → 1 as n → ∞. See Claeskens and Hjort (2008, pp. 101-114) and
Fan and Li (2001) for references. A necessary condition for P (Imin = S) → 1
is that S is one of the models considered with probability going to one.
This condition holds under strong regularity conditions for fast methods. See
Wieczorek and Lei (2021) for forward selection and Hastie et al. (2015, pp.
295-302) for lasso, where the predictors need a “near orthogonality” condi-
tion.
Remark 4.4. If A1 , A2 , ..., Ak are pairwise disjoint and if ∪ki=1 Ai = S,
then the collection of sets A1 , A2 , ..., Ak is a partition of S. Then the Law of
Total Probability states that if A1 , A2 , ..., Ak form a partition of S such that
P (Ai ) > 0 for i = 1, ..., k, then
k
X k
X
P (B) = P (B ∩ Aj ) = P (B|Aj )P (Aj ).
j=1 j=1
Let sets Ak+1 , ..., Am satisfy P (Ai ) = 0 for i = k +1, ..., m. Define P (B|Aj ) =
0 if P (Aj = 0. Then a Generalized Law of Total Probability is
m
X m
X
P (B) = P (B ∩ Aj ) = P (B|Aj )P (Aj ),
j=1 j=1
Fw n (t) = P [n1/2(β̂ V S − β) ≤ t] =
J
X
P [n1/2(β̂ V S − β) ≤ t|(β̂V S = β̂ Ik ,0 )]P (β̂V S = β̂Ik ,0 ) =
k=1
J
X
P [n1/2(β̂ Ik ,0 − β) ≤ t|(β̂ V S = β̂ Ik ,0 )]πkn
k=1
4.2 Large Sample Theory for Some Variable Selection Estimators 155
J
X J
X
C
= P [n1/2(β̂ Ik ,0 − β) ≤ t]πkn = Fwkn (t)πkn .
k=1 k=1
C
Hence β̂ V S has a mixture distribution of the β̂ Ik ,0 with probabilities πkn ,
and w n has a mixture distribution of the w kn with probabilities πkn.
√ C D
Charkhi and Claeskens (2018) showed that w jn = n(β̂ Ij ,0 − β) → wj if
S ⊆ Ij for the MLE with AIC. Here w j is a multivariate truncated normal
distribution (where no truncation is possible) that is symmetric about 0.
Hence E(w j ) = 0, and Cov(w j ) = Σ j exits. Referring to Definitions 4.4
√ √
and 4.5, note that both n(β̂ M IX − β) and n(β̂ V S − β) are selecting from
√
the ukn = n(β̂ Ik ,0 − β) and asymptotically from the uj of Equation (4.3).
The random selection for β̂ M IX does not change the distribution of ujn , but
selection bias does change the distribution of the selected ujn to that of w jn .
Similarly, selection bias does change the distribution of the selected uj to
D
that of w j . The reasonable Theorem 4.4 assumption that wjn → w j may
not be mild.
Mixture distributions are useful for variable selection since β̂ Imin ,0 has a
mixture distribution of the β̂ Ij ,0 . Review mixture distributions from Section
1.6. The following theorem is due toPPelawa Watagoda and Olive (2019a).
Note that the cdf of Tn is FTn (z) = j πjn FTjn (z) where FTjn (z) is the cdf
of Tjn .
Prediction intervals for regression and prediction regions for multivariate re-
gression are important topics. Inference after variable selection will consider
bootstrap hypothesis testing. Applying certain prediction intervals or pre-
diction regions to the bootstrap sample will result in confidence intervals or
confidence regions. The prediction intervals and regions are based on samples
of size n, while the bootstrap sample size is B = Bn . Hence this section and
the following section are important.
then the probability that Yfi ∈ P Ii for j of the PIs approximately follows a
binomial(k, ρ = 1 − δ) distribution. Hence if 100 95% PIs are made, ρ = 0.95
and Yfi ∈ P Ii happens about 95 times.
There are two big differences between CIs and PIs. First, the length of the
CI goes to 0 as the sample size n goes to ∞ while the length of the PI con-
verges to some nonzero number J, say. Secondly, many confidence intervals
work well for large classes of distributions while many prediction intervals
assume that the distribution of the data is known up to some unknown pa-
rameters. Usually the N (µ, σ 2 ) distribution is assumed, and the parametric
PI may not perform well if the normality assumption is violated. This section
will describe three nonparametric PIs for the additive error regression model,
Y = m(x) + e, that work well for a large class of unknown zero mean error
distributions.
First we will consider the location model, Yi = µ + ei , where Y1 , ..., Yn, Yf
are iid and there are no vectors of predictors xi and xf . Let Z(1) ≤ Z(2) ≤
· · · ≤ Z(n) be the order statistics of n iid random variables Z1 , ..., Zn. Let a
future random variable Zf be such that Z1 , ..., Zn, Zf are iid. Let k1 = dnδ/2e
and k2 = dn(1 − δ/2)e where dxe is the smallest integer ≥ x. For example,
d7.7e = 8. Then a common nonparametric large sample 100(1−δ)% prediction
interval for Zf is
[Z(k1) , Z(k2 ) ] (4.8)
where 0 < δ < 1. See Frey (2013) for references.
The shorth(c) estimator of the population shorth is useful for making
asymptotically optimal prediction intervals. With the Zi and Z(i) as in the
above paragraph, let the shortest closed interval containing at least c of the
Zi be
shorth(c) = [Z(s) , Z(s+c−1) ]. (4.9)
Let
kn = dn(1 − δ)e. (4.10)
Frey (2013) showed that for large nδ and iid data,
p the shorth(kn ) prediction
interval has maximum undercoverage ≈ 1.12 δ/n, and used the shorth(c)
estimator as the large sample 100(1 − δ)% PI where
p
c = min(n, dn[1 − δ + 1.12 δ/n ] e). (4.11)
An interesting fact is that the maximum undercoverage occurs for the family
of uniform U (θ1 , θ2 ) distributions where such a distribution has pdf f(y) =
1/(θ2 − θ1 ) for θ1 ≤ y ≤ θ2 where f(y) = 0, otherwise, and θ1 < θ2 .
A problem with the prediction intervals that cover ≈ 100(1 − δ)% of the
training data cases Yi (such as (4.8) using c = kn given by (4.9)), is that they
have coverage lower than the nominal coverage of 1 − δ for moderate n. This
result is not surprising since empirically statistical methods perform worse on
test data. For iid data, Frey (2013) used (4.10) to correct for undercoverage.
158 4 Prediction and Variable Selection When n >> p
Example 4.1. Given below were votes for preseason 1A basketball poll
from Nov. 22, 2011 WSIL News where the 778 was a typo: the actual value
was 78. As shown below, finding shorth(3) from the ordered data is simple.
If the outlier was corrected, shorth(3) = [76,78].
111 89 778 78 76
13 = 89 - 76
33 = 111 - 78
689 = 778 - 89
shorth(3) = [76,89]
1.0
0.8
0.6
pdf
0.4
0.2
0.0
0 1 2 3 4 5
Remark. 4.7. The large sample 100(1 − δ)% shorth PI (4.10) may or
may not be asymptotically optimal if the 100(1 − δ)% population shorth is
[Ls , Us ] and F (x) is not strictly increasing in intervals (Ls − , Ls + ) and
(Us − , Us + ) for some > 0. To see the issue, suppose Y has probability
mass function (pmf) p(0) = 0.4, p(1) = 0.3, p(2) = 0.2, p(3) = 0.06, and
p(4) = 0.04. Then the 90% population shorth is [0,2] and the 100(1 − δ)%
4.3 Prediction Intervals 159
For a random variable Y , the 100(1 −δ)% highest density region is a union
of k ≥ 1 disjoint intervals such that the mass within the intervals ≥ 1 − δ
and the sum of the k interval lengths is as small as possible. Suppose that
f(z) is a unimodal pdf that has interval support, and that the pdf f(z) of Y
decreases rapidly as z moves away from the mode. Let [a, b] be the shortest
interval such that FY (b) − FY (a) = 1 − δ where the cdf FY (z) = P (Y ≤ z).
Then the interval [a, b] is the 100(1 − δ) highest density region. To find the
100(1 − δ)% highest density region of a pdf, move a horizontal line down
from the top of the pdf. The line will intersect the pdf or the boundaries of
the support of the pdf at [a1 , b1 ], ..., [ak, bk ] for some k ≥ 1. Stop moving the
line when the areas under the pdf corresponding to the intervals is equal to
1 − δ. As an example, let f(z) = e−z for z > 0. See Figure 4.1 where the
area under the pdf from 0 to 1 is 0.368. Hence [0,1] is the 36.8% highest
density region. The shorth PI estimates the highest density interval which is
the highest density region for a distribution with a unimodal pdf. Often the
highest density region is an interval [a, b] where f(a) = f(b), especially if the
support where f(z) > 0 is (−∞, ∞).
c = dnqn e, (4.13)
and let r
15 n + 2d
bn = 1+ (4.14)
n n−d
if d ≤ 8n/9, and
15
bn = 5 1 + ,
n
otherwise. As d gets close to n, the model overfits and the coverage will
be less than the nominal. The piecewise formula for bn allows the prediction
interval to be computed even if d ≥ n. Compute the shorth(c) of the residuals
= [r(s), r(s+c−1)] = [ξ̃δ1 , ξ̃1−δ2 ]. Then the first 100 (1 − δ)% large sample PI
for Yf is
[m̂(xf ) + bn ξ̃δ1 , m̂(xf ) + bn ξ˜1−δ2 ]. (4.15)
The second PI randomly divides the data into two half sets H and V
where H has nH = dn/2e of the cases and V has the remaining nV = n − nH
cases i1 , ..., inV . The estimator m̂H (x) is computed using the training data
set H. Then the validation residuals vj = Yij − m̂H (xij ) are computed for the
j = 1, ..., nV cases in the validation set V . Find the Frey PI [v(s) , v(s+c−1)]
of the validation residuals (replacing n in (4.10) by nV = n − nH ). Then the
second new 100(1 − δ)% large sample PI for Yf is
Remark 4.8. Note that correction factors bn → 1 are used in large sample
confidence intervals and tests if the limiting distribution is N(0,1) or χ2p , but
a tdn or pFp,dn cutoff is used: tdn ,1−δ /z1−δ → 1 and pFp,dn ,1−δ /χ2p,1−δ → 1 if
dn → ∞ as n → ∞. Using correction factors for large sample confidence inter-
vals, tests, prediction intervals, prediction regions, and bootstrap confidence
regions improves the performance for moderate sample size n.
We can also motivate PI (4.15) by modifying the justification for the Lei
et al. (2018) split conformal prediction interval
P (Yf ∈ [m̂H (xf )+v(k) , m̂H (xf )+v(k+b−1) ]) = P (v(k) ≤ vnV +1 ≤ v(k+b−1)) ≥
P (vnV +1 has rank between k + 1 and k + b − 1 and there are no tied ranks)
≥ (b − 1)/(nV + 1) ≈ 1 − δ if b = d(nV + 1)(1 − δ)e + 1 and k + b − 1 ≤ nV .
This probability statement holds for a fixed k such as k = dnV δ/2e. The
statement is not true when the shorth(b) estimator is used since the shortest
interval using k = s can have s change with the data set. That is, s is not
fixed. Hence if PI’s were made from J independent data sets, the PI’s with
fixed k would contain Yf about J(1−δ) times, but this value would be smaller
for the shorth(b) prediction intervals where s can change with the data set.
The above argument works if the estimator m̂(x) is “symmetric in the data,”
which is satisfied for multiple linear regression estimators.
The PIs (4.14) to (4.16) can be used with m̂(x) = Ŷf = xTId β̂Id where Id
denotes the index of predictors selected from the model or variable selection
method. If β̂ is a consistent estimator of β, the PIs (4.14) and (4.15) are
asymptotically optimal for a large class of error distributions while the split
conformal PI (4.16) needs the error distribution to be unimodal and symmet-
ric for asymptotic optimality. Since m̂H uses n/2 cases, m̂H has about half
the efficiency of m̂. When p ≥ n, the regularity conditions for consistent esti-
mators are strong. For example, EBIC and lasso can have P (S ⊆ Imin ) → 1
as n → ∞. Then forward selection √ with EBIC and relaxed lasso can produce
consistent estimators. PLS can be n consistent. See Chapter 5 for the large
sample for many MLR estimators.
None of the three prediction intervals (4.14), (4.15), and (4.16) dominates
the other two. Recall that βS is an aS × 1 vector in (4.1). If a good fit-
ting method, such as lasso or forward selection with EBIC, is used, and
1.5aS ≤ n ≤ 5aS , then PI (4.14) can be much shorter than PIs (4.15) and
(4.16). For n/d large, PIs (4.14) and (4.15) can be shorter than PI (4.16) if
the error distribution is not unimodal and symmetric; however, PI (4.16) is
often shorter if n/d is not large since the sample shorth converges to the pop-
ulation shorth rather slowly. Grübel (1982)
√ shows that for iid data, the length
and center the shorth(kn ) interval are n consistent and n1/3 consistent es-
timators of the length and center of the population shorth interval. For a
4.4 Prediction Regions 163
unimodal and symmetric error distribution, the three PIs are asymptotically
equivalent, but PI (4.16) can be the shortest PI due to different correction
factors.
If the estimator is poor, the split conformal PI (4.16) and PI (4.15) can
have coverage closer to the nominal coverage than PI (4.14). For example, if
m̂ interpolates the data and m̂H interpolates the training data from H, then
the validation residuals will be huge. Hence PI (4.15) will be long compared
to PI (4.16).
Asymptotically optimal PIs estimate the population shorth of the zero
mean error distribution. Hence PIs that use the shorth of the residuals, such
as PIs (4.14) and (4.15), are the only easily computed asymptotically optimal
PIs for a wide range of consistent estimators β̂ of β for the multiple linear
regression model. If the error distribution is e ∼ EXP (1) − 1, then the
asymptotic length of the 95% PI (4.14) or (4.15) is 2.966 while that of the
split conformal PI is 2(1.966) = 3.992. For more about these PIs applied to
MLR models, see Section 5.10 and Pelawa Watagoda and Olive (2019b).
Definition 4.8. Let x1j , ..., xnj be measurements on the jth random
variable Xj corresponding to the jth column of the data matrix W . The
n
1X
jth sample mean is xj = xkj . The sample covariance Sij estimates
n
k=1
Cov(Xi , Xj ) = σij = E[(Xi − E(Xi ))(Xj − E(Xj ))], and
n
1 X
Sij = (xki − xi )(xkj − xj ).
n−1
k=1
Sii = Si2 is the sample variance that estimates the population variance
σii = σi2 . The sample correlation rij estimates the population correlation
σij
Cor(Xi , Xj ) = ρij = , and
σi σj
Pn
Sij Sij (xki − xi )(xkj − xj )
rij = = p = Pn k=1
p pPn .
Si Sj Sii Sjj k=1 (xki − xi )
2
k=1 (xkj − xj )
2
That is, the ij entry of S is the sample covariance Sij . The classical estima-
tor of multivariate location and dispersion is (T, C) = (x, S). The sample
correlation matrix
R = (rij ).
That is, the ij entry of R is the sample correlation rij .
Pn
It can be shown that (n − 1)S = i=1 xi xTi − x xT =
1 T T
WTW − W 11 W .
n
1 T
Hence if the centering matrix G = I − 11 , then (n − 1)S = W T GW .
n
See Definition 1.24 for the population mean and population covariance
matrix. Definition 2.18 also defined a sample covariance matrix. The Ma-
4.4 Prediction Regions 165
for each point xi . Notice that Di2 is a random variable (scalar valued). Let
(T, C) = (T (W ), C(W )). Then
2
Dx (T, C) = (x − T )T C −1 (x − T ).
Let the p × 1 location vector be µ, often the population mean, and let
the p × p dispersion matrix be Σ, often the population covariance matrix.
Notice that if x is a random vector, then the population squared Mahalanobis
distance from Definition 1.38 is
2
Dx (µ, Σ) = (x − µ)T Σ −1 (x − µ) (4.19)
and that the term Σ −1/2 (x − µ) is the p−dimensional analog to the z-score
used to transform a univariate N (µ, σ 2 ) random variable intopa N (0, 1) ran-
dom variable. Hence the sample Mahalanobis distance Di = Di2 is an ana-
log of the absolute value |Zi | of the sample Z-score Zi = (Xi − X)/σ̂. Also
notice that the Euclidean distance of xi from the estimate of center T (W )
is Di (T (W ), I p ) where I p is the p × p identity matrix.
Increasing c will improve the coverage for moderate samples. Also see Remark
4.8. Empirically for many distributions, for n ≈ 20p, the prediction region
(4.19) applied to iid data using c = kn = dn(1 − δ)e tended to have under-
coverage as high as 5%. The undercoverage decreases rapidly as n increases.
Let qn = min(1 − δ + 0.05, 1 − δ + p/n) for δ > 0.1 and
c = dnqn e (4.22)
in (4.19) decreased the undercoverage. Note that Equations (4.11) and (4.12)
are similar to Equations
√ (4.20) and (4.21), but replace p by d.
If (T, C) is a n consistent estimator of (µ, d Σ) for some constant d > 0
where Σ is nonsingular, then D2 (T, C) = (x − T )T C −1 (x − T ) =
{z : (z − T )T C −1 (z − T ) ≤ h2 } = {z : Dz
2
≤ h2 } = {z : Dz ≤ h} (4.23)
The Olive (2013a) nonparametric prediction region uses (T, C) = (x, S).
For the classical prediction region, see Chew (1966) and Johnson and Wichern
(1988, pp. 134, 151). Refer to the above paragraph for D(Un ) .
0.999
fr
0.997
0.995
ol
0.82
0.78
fr
0.74
ol
Fig. 4.2 Correction Factor Comparison when δ = 0.01 (Top Plot) and δ = 0.3
(Bottom Plot)
Remark 4.11. The most used prediction regions assume that the error
vectors are iid from a multivariate normal distribution. Using (4.23), the ratio
of the volumes of regions (4.25) and (4.24) is
!p/2
χ2p,1−δ
2 ,
D(U n)
which can become close to zero rapidly as p gets large if the xi are not
from the light tailed multivariate normal distribution. For example, suppose
χ24,0.5 ≈ 3.33 and D(U2
n)
≈ Dx 2 2
,0.5 = 6. Then the ratio is (3.33/6) ≈ 0.308.
Hence if the data is not multivariate normal, severe undercoverage can occur
if the classical prediction region is used, and the undercoverage tends to get
worse as the dimension p increases. The coverage need not to go to 0, since by
2
the multivariate Chebyshev’s inequality, P (Dx (µ, Σ x ) ≤ γ) ≥ 1 − p/γ > 0
4.5 Bootstrapping Hypothesis Tests and Confidence Regions 169
for γ > p where the population covariance matrix Σ x = Cov(x). See Budny
(2014), Chen (2011), and Navarro (2014, 2016). Using γ = h2 = p/δ in (4.22)
usually results in prediction regions with volume and coverage that is too
large.
For the multivariate lognormal distribution with n = 20p, the large sample
nonparametric 95% prediction region (4.24) had coverages 0.970, 0.959, and
0.964 for p = 100, 200, and 500. Some R code is below.
nruns=1000 #lognormal, p = 100, n = 20p = 2000
count<-0
for(i in 1:nruns){
x <- exp(matrix(rnorm(200000),ncol=100,nrow=2000))
xff <- exp(as.vector(rnorm(100)))
count <- count + predrgn(x,xf=xff)$inr}
count #970/1000, may take a few minutes
Notice that for the training data x1 , ..., xn , if C −1 exists, then c ≈ 100qn %
of the n cases are in the prediction regions for xf = xi , and qn → 1 −δ even if
(T, C) is not a good estimator. Hence the coverage qn of the training data is
robust to model assumptions. Of course the volume of the prediction region
could be large if a poor estimator (T, C) is used or if the xi do not come
from an elliptically contoured distribution. Also notice that qn = 1 − δ/2 or
qn = 1 − δ + 0.05 for n ≤ 20p and qn → 1 − δ as n → ∞. If qn ≡ 1 − δ and
(T, C) is a consistent estimator of (µ, dΣ) where d > 0 and Σ is nonsingular,
then (4.22) with h = D(Un ) is a large sample prediction region, but taking
qn given by (4.20) improves the finite sample performance of the prediction
region. Taking qn ≡ 1 − δ does not take into account variability of (T, C),
and for n = 20p the resulting prediction region tended to have undercoverage
as high as min(0.05, δ/2). Using (4.20) helped reduce undercoverage for small
n ≥ 20p due to the unknown variability of (T, C).
This section shows that, under regularity conditions, applying the nonpara-
metric prediction region of Section 4.4 to a bootstrap sample results in a
170 4 Prediction and Variable Selection When n >> p
There are several methods for obtaining a bootstrap sample T1∗ , ...., TB∗
where the sample size n is suppressed: Ti∗ = Tin
∗
. The parametric bootstrap,
nonparametric bootstrap, and residual bootstrap will be used. Applying pre-
diction region (4.24) to the bootstrap sample will result in a confidence region
for θ. When g = 1, applying the shorth PI (4.10) or PI (4.7) to the bootstrap
sample results in a confidence interval for θ. Section 4.5.2 will help clarify
ideas.
Definition 4.14. The large sample 100(1 − δ)% lower shorth CI for θ
∗
is (−∞, T(c) ], while the large sample 100(1 − δ)% upper shorth CI for θ is
∗
[T(B−c+1) , ∞). The large sample 100(1 − δ)% shorth(c) CI uses the interval
∗ ∗ ∗ ∗ ∗ ∗
[T(1) , T(c) ], [T(2) , T(c+1) ], ..., [T(B−c+1) , T(B) ] of shortest length. Here
p
c = min(B, dB[1 − δ + 1.12 δ/B ] e). (4.28)
Definition 4.15. Suppose that data x1 , ..., xn has been collected and
observed. Often the data is a random sample (iid) from a distribution with
cdf F . The empirical distribution is a discrete distribution where the xi are
the possible values, and each value is equally likely. If w is a random variable
having the empirical distribution, then pi = P (w = xi ) = 1/n for i = 1, ..., n.
The cdf of the empirical distribution is denoted by Fn .
Hence
n
X 1
E(w) = xi = x,
n
i=1
and
n
X 1 n−1
Cov(w) = (xi − x)(xi − x)T = S.
n n
i=1
Example 4.3. If W1 , ..., Wn are iid from a distribution with cdf FW , then
the empirical cdf Fn corresponding to FW is given by
n
1X
Fn (y) = I(Wi ≤ y)
n
i=1
prediction region (4.24) to the pseudodata. See Section 8.3 and Olive (2017b,
2018). The residual bootstrap could also be used to make a bootstrap sample
ŷ f + ˆ∗1 , ..., ŷf + ˆ∗B where the ˆ∗j are selected with replacement from the
residual vectors for j = 1, ..., B. As B → ∞, the bootstrap sample will take
on the n values ŷ f +ˆ i (the pseudodata) with probabilities converging to 1/n
for i = 1, ..., n.
Suppose there is a statistic Tn that is a g × 1 vector. Let
B B
∗ 1 X ∗ 1 X ∗ ∗ ∗
T = Ti and S ∗T = (T − T )(Ti∗ − T )T (4.29)
B i=1 B − 1 i=1 i
be the sample mean and sample covariance matrix of the bootstrap sample
T1∗ , ..., TB∗ where Ti∗ = Ti,n
∗ ∗
. Fix n, and let E(Ti,n ∗
) = θ n and Cov(Ti,n ) = Σn.
√ D
We will often assume that Cov(Tn ) = Σ T , and n(Tn − θ) → Ng (0, Σ A )
P
where Σ A > 0 is positive definite and nonsingular. Often nΣ̂ T → Σ A .
For example, using least squares and the residual bootstrap for the multiple
n−p
linear regression model, Σ n = M SE(X T X)−1 , Tn = θ n = β̂, θ = β,
n
Σ̂ T = M SE(X T X)−1 and Σ A = σ 2 limn→∞(X T X/n)−1 . See Example 4.6
in Section 4.6.
Suppose the Ti∗ = Ti,n∗
are iid from some distribution with cdf F̃n . For
∗ ∗
example, if Ti,n = t(Fn ) where iid samples from Fn are used, then F̃n is the
cdf of t(Fn∗). With respect to F̃n , both θn and Σ n are parameters, but with
respect to F , θ n is a random vector and Σ n is a random matrix. For fixed
n, by the multivariate central limit theorem,
√ ∗ D ∗ ∗ D
B(T − θn ) → Ng (0, Σ n ) and B(T − θ n )T [S∗T ]−1 (T − θ n ) → χ2r
as B → ∞.
Remark 4.13. For Examples 4.2, 4.5, and 4.6, the bootstrap works but
is expensive compared to alternative methods. For Example 4.2, fix n, then
∗ P P
T → θ n = x and S ∗T → (n − 1)S/n as B → ∞, but using (x, S) makes
more sense. For Example 4.5, use the pseudodata instead of the residual boot-
strap. For Example 4.6, using β̂ and the classical estimated covariance ma-
d β̂) = M SE(X T X)−1 makes more sense than using the bootstrap.
trix Cov(
For these three examples, it is known how the bootstrap sample behaves as
√ D
B → ∞. The bootstrap can be very useful when n(Tn − θ) → Ng (0, Σ A ),
but it not known how to estimate Σ A without using a resampling method
√ D
like the bootstrap. The bootstrap may be useful when n(Tn − θ) → u, but
the limiting distribution (the distribution of u) is unknown.
4.5 Bootstrapping Hypothesis Tests and Confidence Regions 175
When the bootstrap is used, a large sample 100(1 − δ)% confidence region
for a g × 1 parameter vector θ is a set An = An,B such that P (θ ∈ An,B ) is
eventually bounded below by 1 − δ as n, B → ∞. The B is often suppressed.
Consider testing H0 : θ = θ0 versus H1 : θ 6= θ0 where θ 0 is a known g × 1
vector. Then reject H0 if θ 0 is not in the confidence region An . Let the g × 1
vector Tn be an estimator of θ. Let T1∗ , ..., TB∗ be the bootstrap sample for Tn .
Let A be a full rank g × p constant matrix. For variable selection, consider
testing H0 : Aβ = θ0 versus H1 : Aβ 6= θ 0 with θ = Aβ where often
∗
θ0 = 0. Then let Tn = Aβ̂ Imin ,0 and let Ti∗ = Aβ̂ Imin ,0,i for i = 1, ..., B.
The statistic β̂Imin ,0 is the variable selection estimator padded with zeroes.
∗
See Section 4.2. Let T and S ∗T be the sample mean and sample covariance
matrix of the bootstrap sample T1∗ , ..., TB∗ . See Equation (4.28). See Theorem
2.25 for why dn Fg,dn ,1−δ → χ2g,1−δ as dn → ∞. Here P (X ≤ χ2g,1−δ ) = 1 − δ
if X ∼ χ2g , and P (X ≤ Fg,dn ,1−δ ) = 1 − δ if X ∼ Fg,dn . Let kB = dB(1 − δ)e.
2 2
{w : Dw (Tn , Σ̂ A /n) ≤ D(k B ,T )
} (4.31)
2
where the cutoff D(k B ,T )
is the 100kB th sample quantile of the
Di2 = (Ti∗ − Tn )T [Σ̂ A /n]−1(Ti∗ − Tn ) = n(Ti∗ − Tn )T [Σ̂ A ]−1 (Ti∗ − Tn ).
√ D P
Confidence region (4.29) needs n(Tn − θ) → Ng (0, Σ A ) and nS ∗T →
Σ A > 0 as n, B → ∞. See Machado and Parente (2005) for regularity con-
ditions for this assumption. Bickel and Ren (2001) have interesting sufficient
conditions for (4.30) to be a confidence region when Σ̂ A is a consistent esti-
mator of positive definite Σ A . Let the vector of parameters θ = T (F ), the
statistic Tn = T (Fn ), and the bootstrapped statistic T ∗ = T (Fn∗ ) where F
is the cdf of iid x1 , ..., xn , Fn is the empirical cdf, and Fn∗ is the empiri-
cal cdf of x∗1 , ..., x∗n , a sample from Fn using the nonparametric bootstrap.
√ D
If n(Fn − F ) → z F , a Gaussian random process, and if T is sufficiently
√ D
smooth (has a Hadamard derivative Ṫ (F )), then n(Tn − θ) → u and
176 4 Prediction and Variable Selection When n >> p
√ D
n(Ti∗ −Tn ) → u with u = Ṫ (F )z F . Note that Fn is a perfectly good cdf “F ”
∗
and Fn is a perfectly good empirical cdf from Fn = “F .” Thus if n is fixed,
and a sample of size m is drawn with replacement from the empirical distribu-
√ ∗ D
tion, then m(T (Fm )−Tn ) → Ṫ (Fn )z Fn . Now let n → ∞ with m = n. Then
√ D
bootstrap theory gives n(Ti∗ − Tn ) → limn→∞ Ṫ (Fn )z Fn = Ṫ (F )zF ∼ u.
The following three confidence regions will be used for inference after vari-
able selection. The Olive (2017ab, 2018) prediction region method applies
prediction region (4.24) to the bootstrap sample. Olive (2017ab, 2018) also
gave the modified Bickel and Ren confidence region that uses Σ̂ A = nS ∗T .
The hybrid confidence region is due to Pelawa Watagoda and Olive (2019a).
Let qB = min(1 − δ + 0.05, 1 − δ + g/B) for δ > 0.1 and
Hyperellipsoids (4.32) and (4.34) have the same volume since they are the
same region shifted to have a different center. The ratio of the volumes of
regions (4.32) and (4.33) is
|S ∗T |1/2 D(UB ) g D(UB ) g
= . (4.36)
|S ∗T |1/2 D(UB ,T ) D(UB ,T )
The volume of confidence region (4.33) tends to be greater than that of (4.32)
∗
since the Ti∗ are closer to T than Tn on average.
If g = 1, then a hyperellipsoid is an interval, and confidence intervals are
special cases of confidence regions. Suppose the parameter of interest is θ, and
there is a bootstrap sample T1∗ , ..., TB∗ where the statistic Tn is an estimator
of θ based on a sample of size n. The percentile method uses an interval that
∗ ∗
contains UB ≈ kB = dB(1 −δ)e of the Ti∗ . Let ai = |Ti∗ −T |. Let T and ST2∗
be the sample mean and variance of the Ti∗ . Then the squared Mahalanobis
∗ ∗ ∗
distance Dθ2 = (θ−T )2 /ST∗2 ≤ D(U 2
B)
is equivalent to θ ∈ [T −ST∗ D(UB ) , T +
∗ ∗ ∗
ST∗ D(UB ) ] = [T − a(UB ) , T + a(UB ) ], which is an interval centered at T just
long enough to cover UB of the Ti∗ . Hence the prediction region method is
a special case of the percentile method if g = 1. See Definition 4.13. Efron
(2014) used a similar large sample 100(1 − δ)% confidence interval assuming
∗
that T is asymptotically normal. The CI corresponding to (4.33) is defined
similarly, and [Tn − a(UB ) , Tn + a(UB ) ] is the CI for (4.34). Note that the
three CIs corresponding to (4.32)–(4.34) can be computed without finding
ST∗ or D(UB ) even if ST∗ = 0. The Frey (2013) shorth(c) CI (4.27) computed
from the Ti∗ can be much shorter than the Efron (2014) or prediction region
method confidence intervals. See Remark 4.16 for some theory for bootstrap
CIs.
∗ n−p
Remark 4.14. From Example 4.6, Cov(β̂ ) = M SE(X T X)−1 =
n
n−pd d β̂) = M SE(X T X)−1 starts to give good estimates
Cov(β̂) where Cov(
n
of Cov(β̂) = Σ T for many error distributions if n ≥ 10p and T = β̂. For
the residual bootstrap with large B, note that S ∗T ≈ 0.95Cov(d β̂) for n = 20p
∗ d
and S T ≈ 0.99Cov(β̂) for n = 100p. Hence we may need n >> √ p before the
S ∗T is a good estimator of Cov(T ) = Σ √ T . The distribution of n(Tn − θ) is
approximated by the distribution of n(T ∗ − Tn ) or by the distribution of
√ ∗
n(T ∗ − T ), but n may need to be large before the approximation is good.
∗
Suppose the bootstrap sample mean T estimates θ, and the bootstrap
∗
sample covariance matrix S T estimates cn Cov(T d n ) ≈ cn Σ T where cn in-
creases to 1 as n → ∞. Then S ∗T is not a good estimator of Cov(T d n ) un-
til cn ≈ 1 (n ≥ 100p for OLS β̂), but the squared Mahalanobis distance
∗
Dw 2∗
(T , S∗T ) ≈ Dw
2 2∗
(θ, Σ T )/cn and D(U B)
2
≈ D1−δ /cn . Hence the prediction
2∗ 2
region method has a cutoff D(UB ) that estimates the cutoff D1−δ /cn . Thus
the prediction region method may give good results for much smaller n than
178 4 Prediction and Variable Selection When n >> p
a bootstrap method that uses a χ2g,1−δ cutoff when a cutoff χ2g,1−δ /cn should
be used for moderate n.
Hence the prediction region method gives a large sample confidence region√ for
∗
2
θ provided that the sample percentile D̂1−δ of the DT2 i∗ (T , S ∗T ) = n(Ti∗ −
∗ √ ∗
T )T (nS ∗T )−1 n(Ti∗ − T ) is a consistent estimator of the percentile Dn,1−δ
2
∗ ∗ √ ∗ ∗ √ ∗
of the random variable D2 (T , S T ) = n(θ − T )T (nS T )−1 n(θ − T ) in
θ
2 2 P
that D̂1−δ − Dn,1−δ → 0. Since iii) and iv) hold, the sample percentile will
be consistent Hunder much weaker conditions than v) if Σ u is nonsingular.
Olive (2017b: 5.3.3, 2018) proved that the prediction region method gives a
large sample confidence region under the much stronger conditions of v) and
u ∼ Ng (0, Σ u ), but the above Pelawa Watagoda and Olive (2019a) proof is
simpler.
√ D √ D
Remark 4.16. Note that if n(Tn −θ) → U and n(Ti∗ −Tn ) → U where
U has a unimodal probability density function symmetric about zero, then
the confidence intervals from the three confidence regions (4.32)–(4.34), the
shorth confidence interval (4.27), and the “usual” percentile method confi-
dence interval (4.26) are asymptotically equivalent (use the central proportion
of the bootstrap sample, asymptotically).
P
Assume nS ∗T → Σ A as n, B → ∞ where Σ A and S ∗T are nonsingular g ×g
matrices, and Tn is an estimator of θ such that
√ D
n (Tn − θ) → u (4.37)
as n → ∞. Then
√ −1/2 D −1/2
n ΣA (Tn − θ) → Σ A u = z,
−1 D
n (Tn − θ)T Σ̂ A (Tn − θ) → z T z = D2
as n → ∞ where Σ̂ A is a consistent estimator of Σ A , and
D
(Tn − θ)T [S∗T ]−1 (Tn − θ) → D2 (4.38)
The following Pelawa Watagoda and Olive (2019a) theorem is very useful.
2
Let D(U B)
be the cutoff for the nonparametric prediction region (4.24) com-
puted from the Di2 (T , ST ) for i = 1, ..., B. Hence n is replaced by B. Since
Tn depends on the sample size n, we need (nS T )−1 to be fairly well behaved
(“not too ill conditioned”) for each n ≥ 20g, say. This condition is weaker
P
than (nS T )−1 → Σ −1
A . Note that Ti = Tin .
√ D
Theorem 4.7: Geometric Argument. Suppose n(Tn − θ) → u with
E(u) = 0 and Cov(u) = Σ u . Assume T1 , ..., TB are iid with nonsingular
covariance matrix Σ Tn . Then the large sample 100(1 − δ)% prediction region
2 2
Rp = {w : Dw (T , ST ) ≤ D(U B)
} centered at T contains a future value of
the statistic Tf with probability 1 − δB → 1 − δ as B → ∞. Hence the region
2 2
Rc = {w : Dw (Tn , S T ) ≤ D(U B)
} is a large sample 100(1 − δ)% confidence
region for θ where Tn is a randomly selected Ti .
Proof. The region Rc centered at a randomly selected Tn contains T with
probability√1 − δB which is eventually bounded below by 1 − δ as B → ∞.
Since the n(Ti − θ) are iid,
√
n(T1 − θ) v1
.. D ..
. → .
√
n(TB − θ) vB
where the v i are iid with the same distribution as u. (Use Theorems 1.30
and 1.31, and see Example 1.16.) For fixed B, the average of these random
vectors is
XB
√ D 1 Σu
n(T − θ) → v i ∼ ANg 0,
B i=1 B
Examining the iid data cloud T1 , ..., TB and the bootstrap sample √ data
cloud√T1∗ , ..., TB∗ is often useful for understanding the bootstrap. If n(Tn −θ)
and n(Ti∗ − Tn ) both converge in distribution to u, then the bootstrap
sample data cloud of T1∗ , ..., TB∗ is like the data cloud of iid T1 , ..., TB shifted
to be centered at Tn . The nonparametric confidence region (4.32) applies the
prediction region to the bootstrap. Then the hybrid region (4.34) centers that
region at Tn . Hence (4.34) is a confidence region by the geometric argument,
√ ∗ P
and (4.32) is a confidence region if n(T − Tn ) → 0. Since the Ti∗ are closer
∗ 2 2
to T than Tn on average, D(U B ,T )
tends to be greater than D(U B)
. Hence
the coverage and volume of (4.33) tend to be at least as large as the coverage
and volume of (4.34).
Remark 4.18. Remark 4.14 suggests that even if the statistic Tn is asymp-
totically normal so the Mahalanobis distances are asymptotically χ2g , the pre-
4.5 Bootstrapping Hypothesis Tests and Confidence Regions 183
diction region method can give better results for moderate n by using the
2
cutoff D(U B)
instead of the cutoff χ2g,1−δ . Theorem 4.7 says that the hyper-
ellipsoidal prediction and confidence regions have exactly the same volume.
We compensate for the prediction region undercoverage when n is moderate
2 2
by using D(U n)
. If n is large, by using D(U B)
, the prediction region method
confidence region compensates for undercoverage when B is moderate, say
B ≥ Jg where J = 20 or 50. See Remark 4.15. This result can be useful
if a simulation with B = 1000 or B = 10000 is much slower than a simu-
lation with B = Jg. The price to pay is that the prediction region method
confidence region is inflated to have better coverage, so the power of the
hypothesis test is decreased if moderate B is used instead of larger B.
This subsection illustrates a case where the shorth(c) bootstrap CI fails, but
the lower shorth CI can be useful. See Definition 4.14.
The multiple linear regression (MLR) model is
for i = 1, ..., n. See Definition 1.17 for the coefficient of multiple determination
SSR SSE
R2 = [corr(Yi , Ŷi )]2 = =1−
SSTO SSTO
R2 n−p
F0 = 2
1−R p−1
if the MLR model has a constant. If H0 is false, then F0 has an asymptotic
2
scaled noncentral √ χ distribution. These results suggest that the large sample
distribution of n(R2 − τ 2 ) may not be N (0, σ 2 ) if H0 is false so τ 2 > 0. If
√ D
τ 2 = 0, we may have n(R2 − 0) → N (0, 0), the point mass at 0. Hence the
shorth CI may not be a large sample CI for τ 2 . The lower shorth CI should
be useful for testing H0 : τ 2 = 0 versus HA : τ 2 > a where 0 < a ≤ 1 since
the coverage is 1 and the length of the CI converges to 0. So reject H0 if a is
not in the CI.
The simulation simulated iid data w with u = Aw and Aij = ψ for i 6= j
and Aii = 1 where 0 ≤ ψ < 1 and u = (x2 , ..., xp)T√. Hence Cor(xi , xj ) = ρ =
[2ψ + (p − 3)ψ2]/[1 + (p − 2)ψ2] for i 6= j. If ψ = 1/ kp, then ρ → 1/(k + 1) as
p → ∞ where k > 0. We used w ∼ Np−1 (0, I p−1 ). If ψ is high or if p is large
with ψ ≥ 0.5, then the data are clustered tightly about the line with direction
1 = (1, ..., 1)T , and there is a dominant principal component with eigenvector
√
1 and eigenvalue λ1 . We used ψ = 0, 1/ p, and 0.9. Then ρ = 0, ρ → 0.5, or
ρ → 1 as p → ∞.
We also used V (x2 ) = · · · = V (xp ) = σx2 . If p > 2, then Cov(xi , xj ) = ρσx2
for i 6= j and Cov(xi , xj ) = V (xi ) = σx2 for i = j. Then V (Y ) = σY2 = σL 2
+σe2
where
Xp Xp p
X p X
X p
2
σL = V (L) = V ( βi xi ) = Cov( βi xi , βj xj ) = βi βj Cov(xi , xj )
i=2 i=2 j=2 i=2 j=2
p
X p
X p
X
= βi2 σx2 + 2ρσx2 βi βj .
i=2 i=2 j=i+1
The zero mean errors ei were from 5 distributions: i) N(0,1), ii) t3 , iii)
EXP (1) − 1, iv) uniform(−1, 1), and v) (1 − )N (0, 1) + N (0, (1 + s)2 ) with
= 0.1 and s = 9 in the simulation. Then Y = 1 + bx2 + bx3 + · · · + bxp + e
with b = 0 or b = 1.
Remark 4.19. Suppose the simulation uses K runs and Wi = 1 if µ is
in the ith CI, and Wi = 0 otherwise, for i = 1, ..., K. Then the Wi are iid
binomial(1,1 − δn) where ρn = 1 − δn is the true coverage of the CI when the
PK
sample size is n. Let
p ρ̂n = W . Since i=1 Wi ∼ binomial(K, ρn ), the standard
error SE(W ) = ρn (1 − ρn )/K. For K = 5000 and ρn near 0.9, we have
3SE(W ) ≈ 0.01. Hence an observed coverage of ρ̂n within 0.01 of the nominal
coverage 1 − δ suggests that there is no reason to doubt that the nominal
CI coverage is different from the observed coverage. So for a large sample
4.5 Bootstrapping Hypothesis Tests and Confidence Regions 185
95% CI, we want the observed coverage to be between 0.94 and 0.96. Also
a difference of 0.01 is not large. Coverage slightly higher than the nominal
coverage is better than coverage slightly lower than the nominal coverage.
[1] 1
$lavelen
[1] 0.13688
$ucicov
[1] 0
$uavelen
[1] 0.9896608
This section considers bootstrapping the MLR variable selection model. Rath-
nayake and Olive (2020) shows how to bootstrap variable selection for many
other regression models. This section will explain why the bootstrap con-
fidence regions (4.32), (4.33), and (4.34) give useful results. Much of the
theory in Section 4.5.3 does not apply to the variable selection estimator
Tn = Aβ̂ Imin ,0 with θ = Aβ, because Tn is not smooth since Tn is equal to
the estimator Tjn with probability πjn for j = 1, ..., J. Here A is a known
full rank g × p matrix with 1 ≤ g ≤ p.
Obtaining the bootstrap samples for β̂V S and β̂ M IX is simple. Generate
∗
Y ∗ and X ∗ that would be used to produce β̂ if the full model estimator β̂
∗
was being bootstrapped. Instead of computing β̂ , compute the variable selec-
∗ ∗C
tion estimator β̂ V S,1 = β̂ Ik1 ,0 . Then generate another Y ∗ and X ∗ and com-
∗ ∗
pute β̂ M IX,1 = β̂ Ik1 ,0 (using the same subset Ik1 ). This process is repeated
B times to get the two bootstrap samples for i = 1, ..., B. Let the selection
probabilities for the bootstrap variable selection estimator be ρkn . Then this
bootstrap procedure bootstraps both β̂V S and β̂ M IX with πkn = ρkn .
The key idea is to show that the bootstrap data cloud is slightly more
variable than the iid data cloud, so confidence region (4.33) applied to the
bootstrap data cloud has coverage bounded below by (1 − δ) for large enough
n and B.
For the bootstrap, P suppose that Ti∗ is equal to Tij∗ with probability ρjn
for j = 1, ..., J where j ρjn = 1, and ρjn → πj as n → ∞. Let Bjn count
the number of times Ti∗ = Tij∗ in the bootstrap sample. Then the bootstrap
sample T1∗ , ..., TB∗ can be written as
∗
T1,1 , ..., TB∗ 1n,1 , ..., T1,J
∗
, ..., TB∗ Jn,J
P
where the Bjn follow a multinomial distribution and Bjn /B → ρjn as B →
∗
∞. Denote T1j , ..., TB∗ jn,j as the jth bootstrap component of the bootstrap
∗
sample with sample mean T j and sample covariance matrix S ∗T ,j . Then
4.6 Bootstrapping Variable Selection 187
B Bjn
∗ 1 X ∗ X Bjn 1 X ∗ X ∗
T = Ti = Tij = ρ̂jn T j .
B B Bjn
i=1 j i=1 j
Similarly, we can define the jth component of the iid sample T1 , ..., TB to
have sample mean T j and sample covariance matrix S T ,j .
√ D
Let Tn = β̂ M IX and Tij = β̂ Ij ,0 . If S ⊆ Ij , assume n(β̂ Ij − β Ij ) →
√ ∗ D
Naj (0, V j ) and n(β̂ Ij − β̂ Ij ) → Naj (0, V j ). Then by Equation (4.3),
√ D √ ∗ D
n(β̂ Ij ,0 −β) → Np (0, V j,0 ) and n(β̂ Ij ,0 − β̂ Ij ,0 ) → Np (0, V j,0 ). (4.39)
This result means that the component clouds have the same variability
asymptotically. The iid data component clouds are all centered at β. If the
bootstrap data component clouds were all centered at the same value β̃, then
the bootstrap cloud would be like an iid data cloud shifted to be centered at
β̃, and (4.33) would be a confidence region for θ = β. Instead, the bootstrap
data component clouds are shifted slightly from a common center, and are
each centered at a β̂ Ij ,0 . Geometrically, the shifting of the bootstrap compo-
nent data clouds makes the bootstrap data cloud similar but more variable
than the iid data cloud asymptotically (we want n ≥ 20p), and centering
the bootstrap data cloud at Tn results in the confidence region (4.33) hav-
ing slightly higher asymptotic coverage than applying (4.33) to the iid data
cloud. Also, (4.33) tends to have higher coverage than (4.34) since the cutoff
for (4.33) tends to be larger than the cutoff for (4.34). Region (4.32) has
the same volume as region (4.34), but tends to have higher coverage since
∗
empirically, the bagging estimator T tends to estimate θ at least as well as
Tn for a mixture distribution. A similar argument holds if Tn = Aβ̂ M IX ,
Tij = Aβ̂ Ij ,0 , and θ = Aβ.
To see that T ∗ has more variability than Tn , asymptotically, look at Figure
4.3. Imagine that n is huge and the J = 6 ellipsoids are 99.9% covering
regions for the component data clouds corresponding to Tjn for j = 1, ..., J.
Separating the clouds slightly, without rotation, increases the variability of
the overall data cloud. The bootstrap distribution of T ∗ corresponds to the
separated clouds. The shape of the overall data cloud does not change much,
but the volume does increase.
In the simulations for H0 : Aβ = BβS = θ 0 with n ≥ 20p, the coverage
tended to get close to 1 − δ for B ≥ max(200, 50p) so that S ∗T is a good esti-
mator of Cov(T ∗ ). In the simulations where S is not the full model, inference
with backward elimination with Imin using AIC was often more precise than
inference with the full model if n ≥ 20p and B ≥ 50p.
The matrix S ∗T can be singular due to one or more columns of zeros
in the bootstrap sample for β1 , ..., βp. The variables corresponding to these
columns are likely not needed in the model given that the other predictors
are in the model. A simple remedy is to add d bootstrap samples of the
188 4 Prediction and Variable Selection When n >> p
∗ ∗
full model estimator β̂ = β̂ F U LL to the bootstrap sample. For example,
take d = dcBe with c = 0.01. A confidence interval [Ln , Un ] can be com-
puted without S ∗T for (4.32), (4.33), and (4.34). Using the confidence interval
∗ ∗
[max(Ln , T(1) ), min(Un , T(B) )] can give a shorter covering region.
Undercoverage can occur if bootstrap sample data cloud is less variable
than the iid data cloud, e.g., if (n − p)/n is not close to one. Coverage can be
higher than the nominal coverage for two reasons: i) the bootstrap data cloud
is more variable than the iid data cloud of T1 , ..., TB, and ii) zero padding.
∗
The bootstrap component clouds for β̂ V S are again separated compared
to the iid clouds for β̂ V S , which are centered about β. Heuristically, most of
the selection bias is due to predictors in E, not to the predictors in S. Hence
∗ ∗ ∗
β̂S,V S is roughly similar to β̂ S,M IX . Typically the distributions of β̂ E,V S
∗
and β̂ E,M IX are not similar, but use the same zero padding. In simulations,
confidence regions for β̂V S tended to have less undercoverage than confidence
∗
regions for β̂ M IX .
Y ∗ = X β̂ OLS + e∗
The residual bootstrap is often useful for additive error regression models of
the form Yi = m(xi ) + ei = m̂(xi ) + ri = Ŷi + ri for i = 1, ..., n where the
ith residual ri = Yi − Ŷi . Let Y = (Y1 , ..., Yn)T , r = (r1 , ..., rn)T , and let
X be an n × p matrix with ith row xTi . Then the fitted values Ŷi = m̂(xi ),
and the residuals are obtained by regressing Y on X. Here the errors ei are
iid, and it would be useful to be able to generate B iid samples e1j , ..., enj
from the distribution of ei where j = 1, ..., B. If the m(xi ) were known, then
we could form a vector Y j where the ith element Yij = m(xi ) + eij for
∗ ∗
i = 1, ..., n. Then regress Y j on X. Instead, draw samples r1j , ..., rnj with
∗
replacement from the residuals, then form a vector Y j where the ith element
Yij∗ = m̂(xi ) + rij
∗
for i = 1, ..., n. Then regress Y ∗j on X. If the residuals do
∗
not sum to 0, it is often useful to replace ri by i = ri − r, and rij by ∗ij .
Y ∗ = X β̂ OLS + rW
follows a standard linear model where the elements riW of rW are iid from
the empirical distribution of the OLS full model residuals ri . Hence
n n
1X 1X 2 n−p
E(riW ) = ri = 0, V (riW ) = σn2 = r = M SE,
n i=1 n i=1 i n
Remark 4.20. The Cauchy Schwartz inequality says |aT b| ≤ kak kbk.
√
Suppose n(β̂ − β) = OP (1) is bounded in probability. This will occur if
√ D
n(β̂ − β) → Np (0, Σ), e.g. if β̂ is the OLS estimator. Then
Hence
√ √
n max |ri − ei | ≤ ( max kxi k) k n(β̂ − β)k = OP (1)
i=1,...,n i=1,...,n
since max kxi k = OP (1) or there is extrapolation. Hence OLS residuals be-
have well if the zero mean error distribution of the iid ei has a finite variance
σ2.
Remark 4.21. Note that both the residual bootstrap and parametric
bootstrap for OLS are robust to the unknown error distribution of the iid ei .
For the residual bootstrap with S ⊆ I where I is not the full model, it may
√ ∗ D
not be true that n(β̂ I − β̂ I ) → NaI (0, V I ) as n, B → ∞. For the model
Y = Xβ + e, the ei are iid from a distribution that does not depend on n,
and β E = 0. For Y ∗ = X β̂ + rW , the distribution of the riW depends on n
√
and β̂ E 6= 0 although nβ̂E = OP (1).
4.6 Bootstrapping Variable Selection 191
Y ∗ = X ∗ β̂ OLS + rW
Y ∗ = X ∗I β̂ I,OLS + rW
I .
Freedman (1981) showed that under regularity conditions for the OLS MLR
√ ∗ D
model, n(β̂ − β̂) → Np (0, σ 2 W ) ∼ Np (0, V ). Hence if S ⊆ Ij ,
√ ∗ D
n(β̂ I − β̂ I ) → NaI (0, V I )
Undercoverage can occur if the bootstrap sample data cloud is less variable
than the iid data cloud, e.g., if (n − p)/n is not close to one. Coverage can be
higher than the nominal coverage for two reasons: i) the bootstrap data cloud
is more variable than the iid data cloud of T1 , ..., TB, and ii) zero padding.
To see the effect of zero padding, consider H0 : Aβ = β O = 0 where
βO = (βi1 , ...., βig )T and O ⊆ E in (4.1) so that H0 is true. Suppose a nominal
95% confidence region is used and UB = 0.96. Hence the confidence region
∗
(4.32) or (4.33) covers at least 96% of the bootstrap sample. If β̂ O,j = 0 for
∗ ∗
more than 4% of the β̂O,1 , ..., β̂O,B , then 0 is in the confidence region and the
bootstrap test fails to reject H0 . If this occurs for each run in the simulation,
then the observed coverage will be 100%.
∗
Now suppose β̂ O,j = 0 for j = 1, ..., B. Then S ∗T is singular, but the
singleton set {0} is the large sample 100(1 − δ)% confidence region (4.32),
(4.33), or (4.34) for β O and δ ∈ (0, 1), and the pvalue for H0 : βO = 0 is
∗
one. (This result holds since {0} contains 100% of the β̂ O,j in the bootstrap
sample.) For large sample theory tests, the pvalue estimates the population
pvalue. Let I denote the other predictors in the model so β = (β TI , βTO )T . For
the Imin model from forward selection, there may be strong evidence that xO
is not needed in the model given xI is in the model if the “100%” confidence
region is {0}, n ≥ 20p, B ≥ 50p, and the error distribution is unimodal and
not highly skewed. (Since the pvalue is one, this technique may be useful
for data snooping: applying OLS theory to submodel I may have negligible
selection bias.)
Remark 4.23. Note that there are several important variable selection
models, including the model given by Equation (4.1) where xT β = xTS β S .
Another model is xT β = xTSi β Si for i = 1, ..., K. Then there are K ≥ 2
competing “true” nonnested submodels where β Si is aSi × 1. For example,
suppose the K = 2 models have predictors x1 , x2 , x3 for S1 and x1 , x2 , x4 for
S2 . Then x3 and x4 are likely to be selected and omitted often by forward
selection for the B bootstrap samples. Hence omitting all predictors xi that
∗
have a βij = 0 for at least one of the bootstrap samples j = 1, ..., B could
4.6 Bootstrapping Variable Selection 193
Example 4.7. Cook and Weisberg (1999, pp. 351, 433, 447) gives a data
set on 82 mussels sampled off the coast of New Zealand. Let the response
variable be the logarithm log(M ) of the muscle mass, and the predictors are
the length L and height H of the shell in mm, the logarithm log(W ) of the shell
width W, the logarithm log(S) of the shell mass S, and a constant. Inference
for the full model is shown below along with the shorth(c) nominal 95%
confidence intervals for βi computed using the nonparametric and residual
bootstraps. As expected, the residual bootstrap intervals are close to the
classical least squares confidence intervals ≈ β̂i ± 1.96SE(β̂i ).
large sample full model inference
Est. SE t Pr(>|t|) nparboot resboot
int -1.249 0.838 -1.49 0.14 [-2.93,-0.093][-3.045,0.473]
L -0.001 0.002 -0.28 0.78 [-0.005,0.003][-0.005,0.004]
logW 0.130 0.374 0.35 0.73 [-0.457,0.829][-0.703,0.890]
H 0.008 0.005 1.50 0.14 [-0.002,0.018][-0.003,0.016]
logS 0.640 0.169 3.80 0.00 [ 0.244,1.040][ 0.336,1.012]
output and shorth intervals for the min Cp submodel FS
Est. SE 95% shorth CI 95% shorth CI
int -0.9573 0.1519 [-3.294, 0.495] [-2.769, 0.460]
L 0 [-0.005, 0.004] [-0.004, 0.004]
logW 0 [ 0.000, 1.024] [-0.595, 0.869]
H 0.0072 0.0047 [ 0.000, 0.016] [ 0.000, 0.016]
logS 0.6530 0.1160 [ 0.322, 0.901] [ 0.324, 0.913]
for forward selection for all subsets
The minimum Cp model from all subsets variable selection and forward
selection both used a constant, H, and log(S). The shorth(c) nominal 95%
confidence intervals for βi using the residual bootstrap are shown. Note that
the intervals for H are right skewed and contain 0 when closed intervals
are used instead of open intervals. Some least squares output is shown, but
should only be used for inference if the model was selected before looking at
the data.
It was expected that log(S) may be the only predictor needed, along with
a constant, since log(S) and log(M ) are both log(mass) measurements and
likely highly correlated. Hence we want to test H0 : β2 = β3 = β4 = 0 with
the Imin model selected by all subsets variable selection. (Of course this test
would be easy to do with the full model using least squares theory.) Then
H0 : Aβ = (β2 , β3 , β4 )T = 0. Using the prediction region method with the
194 4 Prediction and Variable Selection When n >> p
q
full model gave an interval [0,2.930] with D0 = 1.641. Note that χ23,0.95 =
2.795. So fail to reject H0 . Using the prediction region method with the Imin
variable selection model had [0, D(UB ) ] = [0, 3.293] while D0 = 1.134. So fail
to reject H0 .
Then we redid the bootstrap with the full model and forward selection. The
full model had [0, D(UB ) ] = [0, 2.908] with D0 = 1.577. So fail to reject H0 .
Using the prediction region method with the Imin forward selection model
had [0, D(UB ) ] = [0, 3.258] while D0 = 1.245. So fail to reject H0 . The ratio of
the volumes of the bootstrap confidence regions for this test was 0.392. (Use
(4.35) with S ∗T and D from forward selection for the numerator, and from
the full model for the denominator.) Hence the forward selection bootstrap
test was more precise than the full model bootstrap test. Some R code used
to produce the above output is shown below.
library(leaps)
y <- log(mussels[,5]); x <- mussels[,1:4]
x[,4] <- log(x[,4]); x[,2] <- log(x[,2])
out <- regboot(x,y,B=1000)
tem <- rowboot(x,y,B=1000)
outvs <- vselboot(x,y,B=1000) #get bootstrap CIs
outfs <- fselboot(x,y,B=1000) #get bootstrap CIs
apply(out$betas,2,shorth3);
apply(tem$betas,2,shorth3);
apply(outvs$betas,2,shorth3) #for all subsets
apply(outfs$betas,2,shorth3) #for forward selection
ls.print(outvs$full)
ls.print(outvs$sub)
ls.print(outfs$sub)
#test if beta_2 = beta_3 = beta_4 = 0
Abeta <- out$betas[,2:4] #full model
#prediction region method with residual bootstrap
out<-predreg(Abeta)
Abeta <- outvs$betas[,2:4]
#prediction region method with Imin all subsets
outvs <- predreg(Abeta)
Abeta <- outfs$betas[,2:4]
#prediction region method with Imin forward sel.
outfs<-predreg(Abeta)
#ratio of volumes for forward selection and full model
(sqrt(det(outfs$cov))*outfs$D0ˆ3)/(sqrt(det(out$cov))*out$D0ˆ3)
Example 4.8. Consider the Gladstone (1905) data set that has 12 vari-
ables on 267 persons after death. The response variable was brain weight.
Head measurements were breadth, circumference, head height, length, and
size as well as cephalic index and brain weight. Age, height, and two categor-
4.6 Bootstrapping Variable Selection 195
ical variables ageclass (0: under 20, 1: 20-45, 2: over 45) and sex were also
given. The eight predictor variables shown in the output were used.
Output is shown below for the full model and the bootstrapped minimum
Cp forward selection estimator. Note that the shorth intervals for length and
sex are quite long. These variables are often in and often deleted from the
bootstrap forward selection. Model II is the model with the fewest predictors
such that CP (II ) ≤ CP (Imin )+1. For this data set, II = Imin . The bootstrap
CIs differ due to different random seeds.
large sample full model inference for Ex. 4.8
Estimate SE t Pr(>|t|) 95% shorth CI
Int -3021.255 1701.070 -1.77 0.077 [-6549.8,322.79]
age -1.656 0.314 -5.27 0.000 [ -2.304,-1.050]
breadth -8.717 12.025 -0.72 0.469 [-34.229,14.458]
cephalic 21.876 22.029 0.99 0.322 [-20.911,67.705]
circum 0.852 0.529 1.61 0.109 [ -0.065, 1.879]
headht 7.385 1.225 6.03 0.000 [ 5.138, 9.794]
height -0.407 0.942 -0.43 0.666 [ -2.211, 1.565]
len 13.475 9.422 1.43 0.154 [ -5.519,32.605]
sex 25.130 10.015 2.51 0.013 [ 6.717,44.19]
output and shorth intervals for the min Cp submodel
Estimate SE t Pr(>|t|) 95% shorth CI
Int -1764.516 186.046 -9.48 0.000 [-6151.6,-415.4]
age -1.708 0.285 -5.99 0.000 [ -2.299,-1.068]
breadth 0 [-32.992, 8.148]
cephalic 5.958 2.089 2.85 0.005 [-10.859,62.679]
circum 0.757 0.512 1.48 0.140 [ 0.000, 1.817]
headht 7.424 1.161 6.39 0.000 [ 5.028, 9.732]
height 0 [ -2.859, 0.000]
len 6.716 1.466 4.58 0.000 [ 0.000,30.508]
sex 25.313 9.920 2.55 0.011 [ 0.000,42.144]
output and shorth for I_I model
Estimate Std.Err t-val Pr(>|t|) 95% shorth CI
Int -1764.516 186.046 -9.48 0.000 [-6104.9,-778.2]
age -1.708 0.285 -5.99 0.000 [ -2.259,-1.003]
breadth 0 [-31.012, 6.567]
cephalic 5.958 2.089 2.85 0.005 [ -6.700,61.265]
circum 0.757 0.512 1.48 0.140 [ 0.000, 1.866]
headht 7.424 1.161 6.39 0.000 [ 5.221,10.090]
height 0 [ -2.173, 0.000]
len 6.716 1.466 4.58 0.000 [ 0.000,28.819]
sex 25.313 9.920 2.55 0.011 [ 0.000,42.847]
The R code used to produce the above output is shown below. The last
four commands are useful for examining the variable selection output.
x<-cbrainx[,c(1,3,5,6,7,8,9,10)]
196 4 Prediction and Variable Selection When n >> p
y<-cbrainy
library(leaps)
out <- regboot(x,y,B=1000)
outvs <- fselboot(x,cbrainy) #get bootstrap CIs,
apply(out$betas,2,shorth3)
apply(outvs$betas,2,shorth3)
ls.print(outvs$full)
ls.print(outvs$sub)
outvs <- modIboot(x,cbrainy) #get bootstrap CIs,
apply(outvs$betas,2,shorth3)
ls.print(outvs$sub)
tem<-regsubsets(x,y,method="forward")
tem2<-summary(tem)
tem2$which
tem2$cp
4.6.5 Simulations
less multicorrelation than the full model. For ψ ≥ 0, the Imin coverages were
higher than 0.95 for β3 and β4 and for testing H0 : β E = 0 since zeros often
occurred for β̂j∗ for j = 3, 4. The average CI lengths were shorter for Imin
than for the OLS full model for β3 and β4 . Note that for Imin , the coverage
for testing H0 : β S = 1 was higher than that for the OLS full model.
For error distributions i)-iv) and ψ = 0.9, sometimes the shorth CIs needed
n ≥ 100p for all p CIs to have good coverage. For error distribution v) and
ψ = 0.9, even larger values of n were needed. Confidence intervals based on
(4.32) and (4.33) worked for much smaller n, but tended to be longer than
the shorth CIs.
See Table 4.3 for one of the worst scenarios for the shorth, where shlen,
prlen, and brlen are for the average CI lengths based on the shorth, (4.32), and
(4.33), respectively. In Table 4.3, k = 8 and the two nonzero πj correspond
to the full model β̂ and β̂ S,0 . Hence βi = 1 for i = 1, ..., 9 and β10 = 0.
Hence confidence intervals for β10 had the highest coverage and usually the
shortest average length (for i 6= 1) due to zero padding. √ Theory in Section
4.2 showed that the CI lengths are proportional to 1/ n. When n = 25000,
the shorth CI uses the 95.16th percentile while CI (4.32) uses the 95.00th
percentile, allowing the average CI length of (4.32) to be shorter than that of
the shorth CI, but the distribution for β̂i∗ is likely approximately symmetric
for i 6= 10 since the average lengths of the three confidence intervals were
about the same for each i 6= 10.
When BIC was used, undercoverage was a bit more common and severe,
and undercoverage occasionally occurred with regions (4.32) and (4.33). BIC
also occasionally had 100% coverage since BIC produces more zeroes than
Cp .
Some R code for the simulation is shown below.
record coverages and ‘‘lengths" for
b1, b2, bp-1, bp, pm0, hyb0, br0, pm1, hyb1, br1
regbootsim3(n=100,p=4,k=1,nruns=5000,type=1,psi=0)
$cicov
[1] 0.9458 0.9500 0.9474 0.9484 0.9400 0.9408 0.9410
0.9368 0.9362 0.9370
$avelen
[1] 0.3955 0.3990 0.3987 0.3982 2.4508 2.4508 2.4521
[8] 2.4496 2.4496 2.4508
$beta
[1] 1 1 0 0
$k
[1] 1
library(leaps)
vsbootsim4(n=100,p=4,k=1,nruns=5000,type=1,psi=0)
$cicov
[1] 0.9480 0.9496 0.9972 0.9958 0.9910 0.9786 0.9914
0.9384 0.9394 0.9402
$avelen
[1] 0.3954 0.3987 0.3233 0.3231 2.6987 2.6987 3.0020
[8] 2.4497 2.4497 2.4570
200 4 Prediction and Variable Selection When n >> p
$beta
[1] 1 1 0 0
$k
[1] 1
Data splitting is used for inference after model selection. Use a training set
to select a full model, and a validation set for inference with the selected full
model. Here p >> n is possible. See Chapter 6, Hurvich and Tsai (1990, p.
216) and Rinaldo et al. (2019). Typically when training and validation sets
are used, the training set is bigger than the validation set or half sets are
used, often causing large efficiency loss.
Let J be a positive integer and let bxc be the integer part of x, e.g.,
b7.7c = 7. Initially divide the data into two sets H1 with n1 = bn/(2J)c
cases and V1 with n − n1 cases. If the fitted model from H1 is not good
enough, randomly select n1 cases from V1 to add to H1 to form H2 . Let V2
have the remaining cases from V1 . Continue in this manner, possibly forming
sets (H1 , V1 ), (H2 , V2 ), ..., (HJ , VJ ) where Hi has ni = in1 cases. Stop when
Hd gives a reasonable model Id with ad predictors if d < J. Use d = J,
otherwise. Use the model Id as the full model for inference with the data in
Vd .
This procedure is simple for a fixed data set, but it would be good to
automate the procedure. Forward selection with the Chen and Chen (2008)
EBIC criterion and lasso are useful for finding a reasonable fitted model.
BIC and the Hurvich and Tsai (1989) AICC criterion can be useful if n ≥
max(2p, 10ad). For example, if n = 500000 and p = 90, using n1 = 900 would
result in a much smaller loss of efficiency than n1 = 250000.
4.8 Summary
quantile of the Di2 = (Ti∗ − Tn )T [S ∗T ]−1 (Ti∗ − Tn ). c) The hybrid large sample
100(1 − δ)% confidence region: {w : (w − Tn )T [S ∗T ]−1 (w − Tn ) ≤ D(U 2
B)
}=
2 ∗ 2
{w : Dw (Tn , ST ) ≤ D(UB ) }.
If g = 1, confidence intervals can be computed without S ∗T or D2 for a),
b), and c).
For some data sets, S ∗T may be singular due to one or more columns of
zeroes in the bootstrap sample for β1 , ..., βp. The variables corresponding to
these columns are likely not needed in the model given that the other predic-
tors are in the model if n and B are large enough. Let β O = (βi1 , ..., βig )T ,
∗
and consider testing H0 : AβO = 0. If Aβ̂ O,i = 0 for greater than Bδ of the
bootstrap samples i = 1, ..., B, then fail to reject H0 . (If S ∗T is nonsingular,
the 100(1 − δ)% prediction region method confidence region contains 0.)
√ D
7) Theorem 4.7: Geometric Argument. Suppose n(Tn − θ) → u
with E(u) = 0 and Cov(u) = Σ u . Assume T1 , ..., TB are iid with nonsingular
covariance matrix Σ Tn . Then the large sample 100(1 − δ)% prediction region
2 2
Rp = {w : Dw (T , ST ) ≤ D(U B)
} centered at T contains a future value of
the statistic Tf with probability 1 − δB → 1 − δ as B → ∞. Hence the region
2 2
Rc = {w : Dw (Tn , S T ) ≤ D(U B)
} is a large sample 100(1 − δ)% confidence
region for θ.
8) Applying the nonparametric prediction region (4.24) to the iid data
T1 , ..., TB results in the 100(1−δ)% confidence region {w : (w−Tn )T S −1 T (w−
2 2
Tn ) ≤ D(U B)
(T n , S T )} where D (UB ) (Tn , S T ) is computed from the (Ti −
T −1
Tn ) S T (Ti − Tn ) provided the S T = S Tn are “not too ill conditioned.”
For OLS variable selection, assume there are two or more component clouds.
The bootstrap component data clouds have the same asymptotic covariance
matrix as the iid component data clouds, which are centered at θ. The jth
bootstrap component data cloud is centered at E(Tij∗ ) and often E(Tjn ∗
)=
Tjn . Confidence region (4.32) is the prediction region (4.24) applied to the
bootstrap sample, and (4.32) is slightly larger in volume than (4.24) applied
to the iid sample, asymptotically. The hybrid region (4.34) shifts (4.32) to be
centered at Tn . Shifting the component clouds slightly and computing (4.24)
does not change the axes of the prediction region (4.24) much compared
to not shifting the component clouds. Hence by the geometric argument, we
expect (4.34) to have coverage at least as high as the nominal, asymptotically,
provided the S ∗T are “not too ill conditioned.” The Bickel and Ren confidence
∗
region (4.33) tends to have higher coverage and volume than (4.34). Since T
tends to be closer to θ than Tn , (4.32) tends to have good coverage.
9) Suppose m independent large sample 100(1 − δ)% prediction regions
are made where x1 , ..., xn , xf are iid from the same distribution for each of
the m runs. Let Y count the number of times xf is in the prediction region.
Then Y ∼ binomial (m, 1 − δn ) where 1 − δn is the true coverage. Simulation
can be used to see if the true or actual coverage 1 − δn is close to the nominal
coverage 1 − δ. A prediction region with 1 − δn < 1 − δ is liberal and a region
with 1 − δn > 1 − δ is conservative. It is better to be conservative by 3% than
4.9 Complements 203
4.9 Complements
This chapter followed Olive (2017b, ch. 5) and Pelawa Watagoda and Olive
(2019ab) closely. Also see Olive (2013a, 2018), Pelawa Watagoda (2017), and
Rathnayake and Olive (2019). For MLR, Olive (2017a: p. 123, 2017b: p. 176)
showed that β̂Imin ,0 is a consistent estimator. Olive (2014: p. 283, 2017ab,
2018) recommended using the shorth(c) estimator for the percentile method.
Olive (2017a: p. 128, 2017b: p. 181, 2018) showed that the prediction region
method can simulate well for the p × 1 vector β̂ Imin ,0 . Hastie et al. (2009, p.
57) noted that variable selection is a shrinkage estimator: the coefficients are
shrunk to 0 for the omitted variables.
Good references for the bootstrap include Efron (1979, 1982), Efron and
Hastie (2016, ch. 10–11), and Efron and Tibshirani (1993). Also see Chen
(2016) and Hesterberg (2014). One of the sufficient conditions for the boot-
strap confidence region is that T has a well behaved Hadamard derivative.
Fréchet differentiability implies Hadamard differentiability, and many statis-
tics are shown to be Hadamard differentiable in Bickel and Ren (2001), Clarke
(1986, 2000), Fernholtz (1983), Gill (1989), Ren (1991), and Ren and Sen
(1995). Bickel and Ren (2001) showed that their method can work when
Hadamard differentiability fails.
There is a massive literature on variable selection and a fairly large litera-
ture for inference after variable selection. See, for example, Leeb and Pötscher
(2005, 2006, 2008), Leeb et al. (2015), Tibshirani et al. (2016), and Tibshi-
rani et al. (2018). Knight and Fu (2000) have some results on the residual
bootstrap that uses residuals from one estimator, such as full model OLS,
but fit another estimator, such as lasso.
204 4 Prediction and Variable Selection When n >> p
Inference techniques for the variable selection model, other than data split-
ting, have not had much success. For multiple linear regression, the methods
are often inferior to data splitting, often assume normality, or are asymptot-
ically equivalent to using the full model, or find a quantity to test that is not
Aβ. See Ewald and Schneider (2018). Berk et al. (2013) assumes normality,
needs p no more than about 30, assumes σ 2 can be estimated independently
of the data, and Leeb et al. (2015) say the method does not work. The
∗ P
bootstrap confidence region (4.32) is centered at T ≈ j ρjn Tjn , which is
closely related to a model averaging estimator. Wang and Zhou (2013) show
that the Hjort and Claeskens (2003) confidence intervals based on frequentist
model averaging are asymptotically equivalent to those obtained from the
full model. See Buckland et al. (1997) and Schomaker and Heumann (2014)
for standard errors when using the bootstrap or model averaging for linear
model confidence intervals.
∗ ∗ ∗
Efron (2014) used the confidence interval T ± z1−δ SE(T ) assuming T
is asymptotically normal and using delta method techniques, which require
nonsingular covariance matrices. There is not yet rigorous theory for this
∗
method. Section 4.2 proved that T is asymptotically normal: under regular-
√ D √ D
ity conditions: if n(Tn − θ) → Ng (0, Σ A ) and n(Ti∗ − Tn ) → Ng (0, Σ A ),
√ ∗ D
then under regularity conditions n(T − θ) → Ng (0, Σ A ). If g = 1,
then the prediction region method large sample 100(1 − δ)% CI for θ has
∗ ∗
P (θ ∈ [T − a(UB ) , T + a(UB ) ]) → 1 − δ as n → ∞. If the Frey CI also has
coverage converging to 1−δ, than√the two methods have the same asymptotic
length (scaled by multiplying by n), since otherwise the shorter interval will
have lower asymptotic coverage. √
For the mixture distribution with two or more component groups, n(Tn −
D √ D
θ) → v by Theorem 4.4 b). If n(Ti∗ − cn ) → u then cn must be a value
∗ P P
such as cn = T , cn = j ρjn Tjn , or cn = j πj Tjn . Next we will examine
∗ √ D
T . If S ⊆ Ij , then n(β̂ Ij ,0 − β) → Np (0, V j,0 ), and for the parametric
√ ∗ D
and nonparametric bootstrap, n(β̂ Ij ,0 − β̂ Ij ,0 ) → Np (0, V j,0 ). Let Tn =
Aβ̂Imin ,0 and Tjn = Aβ̂ Ij ,0 = ADj0 Y using notation from Section 4.6. Let
√ ∗ P P
θ = Aβ. Hence from Section 4.5.3, n(T j − Tjn ) → 0. Assume ρ̂in → ρi as
√ ∗
n → ∞. Then n(T − θ) =
X √ ∗ X √ ∗ X √ ∗
ρ̂in n(T i − θ) = ρ̂jn n(T j − θ) + ρ̂kn n(T k − θ)
i j k
P
= dn + an where an → 0 since ρk = 0. Now
X √ ∗ X √
dn = ρ̂jn n(T j − Tjn + Tjn − θ) = ρ̂jn n(Tjn − θ) + cn
j j
4.9 Complements 205
√ ∗
where cn = oP (1) since n(T j − Tjn ) = oP (1). Hence under regularity con-
√ ∗ D P √ D
ditions, if n(T − θ) → w then j ρj n(Tjn − θ) → w.
To examine the last term and w, let the n × 1 vector Y have characteristic
function φY , E(Y ) = Xβ, and Cov(Y ) = σ 2 I. Let Z = (Y T , ..., Y T )T be a
Jn × 1 vector with J copies of Y stacked into a vector. Let t = (tT1 , ..., tTJ )T .
PJ
Then Z has characteristic function φZ (t) = φY ( j=1 ti ) = φY (s). Now
assume Y ∼ Nn (Xβ, σ 2 I). Then tT Z = sT Y ∼ N (sT Xβ, σ 2 sT s). Hence
Z has a multivariate normal distribution by Definition 1.23 with E(Z) =
(Xβ T , ..., Xβ T )T , and Cov(Z) a block matrix with J × J blocks each equal
to σ 2 I. Then
X X
ρj Tjn = ρj ADj0 Y = BY ∼ Ng (θ, σ 2 BB T ) =
j j
XX
Ng (θ, σ 2 ρj ρk AD j0 DTk0 A)
j k
T T T
since E(Tjn ) = E(Aβ̂ Ij ,0 ) = Aβ = θ if S ⊆ Ij . Since (T1n , ..., Tjn) =
T T T
diag(AD10 , ..., ADJ0 )Z, then (T1n , ..., Tjn) is multivariate normal and
X XX
ρj Tjn ∼ Ng [θ, πj πk Cov(Tjn , Tkn )].
j j k
P
Now assume nDj0 DTk0 → W jk as n → ∞. Then
X √ D
XX
ρj n(Tjn − θ) → w ∼ Ng (0, σ 2 ρj ρk AW jk A).
j j k
We conjecture that this result may hold under milder conditions than
Y ∼ Nn (Xβ, σ 2 I), but even the above results are not yet rigorous. If
√ D
n(Tjn − θ) → w j ∼ Ng (0, Σ j ), then a possibly poor approximation is
∗ P P P
T
P≈ P j ρj Tjn ≈ Ng [θ, j k ρj ρk Cov(Tjn , Tkn )], and estimating
j k ρj ρk Cov(Tjn , Tkn) with delta method techniques may not be possible.
The double bootstrap technique may be useful. See Hall (1986) and Chang
∗ ∗
and Hall (2015) for references. The double bootstrap for T = T B says that
∗
Tn = T is a statistic that can be bootstrapped. Let Bd ≥ 50gmax where
1 ≤ gmax ≤ p is the largest dimension of θ to be tested with the double
∗
bootstrap. Draw a bootstrap sample of size B and compute T = T1∗ . Repeat
for a total of Bd times. Apply the confidence region (4.32), (4.33), orq
(4.34) to
the double bootstrap sample T1∗ , ..., TB∗ d . If D(UBd ) ≈ D(UBd ,T ) ≈ χ2g,1−δ ,
∗
then T may be approximately multivariate normal. The CI (4.32) applied
to the double bootstrap sample could be regarded as a modified Frey CI
206 4 Prediction and Variable Selection When n >> p
We can get a prediction region by randomly dividing the data into two
half sets H and V where H has nH = dn/2e of the cases and V has the
remaining m = nV = n − nH cases. Compute (xH , S H ) from the cases in
H. Then compute the distances Di2 = (xi − xH )T S −1 H (xi − xH ) for the m
vectors xi in V . Then a large sample 100(1 − δ)% prediction region for xF is
2 2
{x : Dx (xH , SH ) ≤ D(k m)
} where km = dm(1 − δ)e. This prediction region
may give better coverage than the nonparametric prediction region (4.24) if
5p ≤ n ≤ 20p.
The iid sample T1 , ..., TB has sample mean T . Let Tin = Tijn if Tjn is
P
chosen Djn times where the random variables Djn /B → πjn. The Djn follow
a multinomial distribution. Then the iid sample can be written as
where the Tij are not iid. Denote T1j , ..., TDjn,j as the jth component of the
iid sample with sample mean T j and sample covariance matrix S T ,j . Thus
B
1 X X Djn 1 DX
jn
X
T = Tijn = Tij = π̂jn T j .
B B Djn
i=1 j i=1 j
with Cp , the function vscisim was used to make Table 4.3, and can be used
to compare the shorth, prediction region method, and Bickel and Ren CIs for
βi .
4.10 Problems
4.1. Consider the Cushny and Peebles data set (see Staudte and Sheather
1990, p. 97) listed below. Find shorth(7). Show work.
0.0 0.8 1.0 1.2 1.3 1.3 1.4 1.8 2.4 4.6
4.2. Find shorth(5) for the following data set. Show work.
6 76 90 90 94 94 95 97 97 1008
4.3. Find shorth(5) for the following data set. Show work.
66 76 90 90 94 94 95 95 97 98
4.4. Suppose you are estimating the mean θ of losses with the maxi-
mum likelihood estimator (MLE) X assuming an exponential (θ) distribution.
Compute the sample mean of the fourth bootstrap sample.
actual losses 1, 2, 5, 10, 50: X = 13.6
bootstrap samples:
2, 10, 1, 2, 2: X = 3.4
50, 10, 50, 2, 2: X = 22.8
10, 50, 2, 1, 1: X = 12.8
5, 2, 5, 1, 50: X =?
4.5. The data below are a sorted residuals from a least squares regression
where n = 100 and p = 4. Find shorth(97) of the residuals.
number 1 2 3 4 ... 97 98 99 100
residual -2.39 -2.34 -2.03 -1.77 ... 1.76 1.81 1.83 2.16
4.6. To find the sample median of a list of n numbers where n is odd, order
the numbers from smallest to largest and the median is the middle ordered
number. The sample median estimates the population median. Suppose the
sample is {14, 3, 5, 12, 20, 10, 9}. Find the sample median for each of the three
bootstrap samples listed below.
Sample 1: 9, 10, 9, 12, 5, 14, 3
Sample 2: 3, 9, 20, 10, 9, 5, 14
Sample 3: 14, 12, 10, 20, 3, 3, 5
2, 10, 1, 2, 2:
50, 10, 50, 2, 2:
10, 50, 2, 1, 1:
5, 2, 5, 1, 50:
b) Now compute the bagging estimator which is the sample mean of the
B
∗ 1 X ∗
Ti∗ : the bagging estimator T = T where B = 4 is the number of
B i=1 i
bootstrap samples.
4.8. Consider the output for Example 4.7 for the minimum Cp forward
selection model.
a) What is β̂ Imin ?
b) What is β̂Imin ,0 ?
c) The large sample 95% shorth CI for H is [0,0.016]. Is H needed is the
minimum Cp model given that the other predictors are in the model?
d) The large sample 95% shorth CI for log(S) is [0.324,0.913] for all subsets.
Is log(S) needed is the minimum Cp model given that the other predictors
are in the model?
e) Suppose x1 = 1, x4 = H = 130, and x5 = log(S) = 5.075. Find
Ŷ = (x1 x4 x5 )β̂ Imin . Note that Y = log(M ).
R Problems
Use the command source(“G:/linmodpack.txt”) to download the
functions and the command source(“G:/linmoddata.txt”) to download the
data. See Preface or Section 11.1. Typing the name of the linmodpack
function, e.g. regbootsim2, will display the code for the function. Use the
args command, e.g. args(regbootsim2), to display the needed arguments for
the function. For the following problem, the R command can be copied and
pasted from (http://parker.ad.siu.edu/Olive/linmodrhw.txt) into R.
4.21. a) Type the R command predsim() and paste the output into
Word.
This program computes xi ∼ N4 (0, diag(1, 2, 3, 4)) for i = 1, ..., 100 and
xf = x101 . One hundred such data sets are made, and ncvr, scvr, and mcvr
count the number of times xf was in the nonparametric, semiparametric,
and parametric MVN 90% prediction regions. The volumes of the prediction
regions are computed and voln, vols, and volm are the average ratio of the
volume of the ith prediction region over that of the semiparametric region.
Hence vols is always equal to 1. For multivariate normal data, these ratios
should converge to 1 as n → ∞.
b) Were the three coverages near 90%?
4.22. Consider the multiple linear regression model Yi = β1 + β2 xi,2 +
β3 xi,3 + β4 xi,4 + ei where β = (1, 1, 0, 0)T . The function regbootsim2
bootstraps the regression model, finds bootstrap confidence intervals for βi
and a bootstrap confidence region for (β3 , β4 )T corresponding to the test
H0 : β3 = β4 = 0 versus HA: not H0 . See the R code near Table 4.3. The
lengths of the CIs along with the proportion of times the CI for βi contained
βi are given. The fifth interval gives the length of the interval [0, D(c)] where
H0 is rejected if D0 > D(c) and the fifth “coverage” is the proportion of times
the test fails to reject H0 . Since nominal 95% CIs were used and the nominal
210 4 Prediction and Variable Selection When n >> p
level of the test is 0.05 when H0 is true, we want the coverages near 0.95.
The CI lengths for the first 4 intervals should be near 0.392. The residual
bootstrap is used.
Copy and paste the commands for this problem into R, and include the
output in Word.
Chapter 5
Statistical Learning Alternatives to OLS
This chapter considers several alternatives to OLS for the multiple linear
regression model. Large sample theory is give for p fixed, but the prediction
intervals can have p > n.
for i = 1, ..., n. This model is also called the full model. Here n is the sample
size and the random variable ei is the ith error. Assume that the ei are iid
with variance V (ei ) = σ 2 . In matrix notation, these n equations become
Y = Xβ + e where Y is an n × 1 vector of dependent variables, X is an
n × p matrix of predictors, β is a p × 1 vector of unknown coefficients, and e
is an n × 1 vector of unknown errors.
There are many methods for estimating β, including (ordinary) least
squares (OLS) for the full model, forward selection with OLS, elastic net,
principal components regression (PCR), partial least squares (PLS), lasso,
lasso variable selection, and ridge regression (RR). For the last six methods,
it is convenient to use centered or scaled data. Suppose U has observed val-
ues U1 , ..., Un. For example, if Ui = Yi then U corresponds to the response
variable Y . The observed values of a random variable V are centered if their
sample mean is 0. The centered values of U are Vi = Ui − U for i = 1, ..., n.
Let g be an integer near 0. If the sample variance of the Ui is
n
1 X
σ̂g2 = (Ui − U )2 ,
n − g i=1
211
212 5 Statistical Learning Alternatives to OLS
then the sample standard deviation of Ui is σ̂g . If the values of Ui are not all
the same, then σ̂g > 0, and the standardized values of the Ui are
Ui − U
Wi = .
σ̂g
Remark 5.1. Let the nontrivial predictors uTi = (xi,2 , ..., xi,p) = (ui,1 , ...,
ui,p−1 ). Then xi = (1, uTi )T . Let the n × (p − 1) matrix of standardized
nontrivial P
predictors W g = (WPijn) when the predictors are standardized using
n
σ̂g . Thus, i=1 Wij = 0 and i=1 Wij2 = n − g for j = 1, ..., p − 1. Hence
n
xi,j+1 − xj+1 2 1 X
Wij = where σ̂j+1 = (xi,j+1 − xj+1 )2
σ̂j+1 n−g
i=1
is σ̂g for the (j + 1)th variable xj+1 . Let wTi = (wi,1 , ..., wi,p−1) be the
standardized vector of nontrivial predictors for the ith case. Since the stan-
dardized data are also centered, w = 0. Then the sample covariance matrix
of the w i is the sample correlation matrix of the ui :
W Tg W g
ρ̂u = Ru = (rij ) = n−g
where rij is the sample correlation of ui = xi+1 and uj = xj+1 . Thus the
sample correlation matrix Ru does not depend on g. Let Z = Y − Y where
Y = Y 1. Since the R software tends to use g = 0, let W = W 0 . Note that
n × (p − 1) matrix W does not include a vector 1 of ones. Then regression
through the origin is used for the model
Z = Wη + e (5.3)
where Z = (Z1 , ..., Zn)T and η = (η1 , ..., ηp−1)T . The vector of fitted values
Ŷ = Y + Ẑ.
Remark 5.2. i) Interest is in model (5.1): estimate Ŷf and β̂. For many
regression estimators, a method is needed so that everyone who uses the same
units of measurements for the predictors and Y gets the same (Ŷ , β̂). Also,
see Remark 7.7. Equation (5.3) is a commonly used method for achieving this
2
goal. Suppose g = 0. The method of moments estimator of the variance σw
is
5.1 The MLR Model 213
n
2 2 1X
σ̂g=0 = SM = (wi − w)2 .
n i=1
2
When data xi are standardized to have w = 0 and SM = 1, the standardized
data wi has no units. ii) Hence the estimators Ẑ and η̂ do not depend on
the units of measurement of the xi if standardized data and Equation (5.3)
are used. Linear combinations of the w i are linear combinations of the ui ,
which are linear combinations of the xi . (Note that γ T u = (0 γ T ) x.) Thus
the estimators Ŷ and β̂ are obtained using Ẑ, η̂, and Y . The linear trans-
formation to obtain (Ŷ , β̂) from (Ẑ, η̂) is unique for a given set of units of
measurements for the xi and Y . Hence everyone using the same units of mea-
2
surements gets the same (Ŷ , β̂). iii) Also, since W j = 0 and SM,j = 1, the
standardized predictor variables have similar spread, and the magnitude of
η̂i is a measure of the importance of the predictor variable Wj for predicting
Y.
= γ̂ + w Ti η̂ = γ̂ + Ẑi (5.4)
where
η̂j = σ̂j+1 β̂j+1 (5.5)
for j = 1, ..., p − 1. Often γ̂ = Y so that Ŷi = Y if xi,j = xj for j = 2, ..., p.
Then Ŷ = Y + Ẑ where Y = Y 1. Note that
x2 xp
γ̂ = β̂1 + η̂1 + · · · + η̂p−1 .
σ̂2 σ̂p
η̂OLS = (W T W )−1 W T Z
minimizes the OLS criterion QOLS (η) = r(η)T r(η) over all vectors η ∈
b OLS = W η̂
Rp−1 . The vector of predicted or fitted values Z OLS = HZ where
H = W (W T W )−1 W T . The vector of residuals r = r(Z, W ) = Z − Ẑ =
(I − H)Z.
WTW P
Ru = → V −1 . (5.6)
n
Note that V −1 = ρu , the population correlation matrix of the nontrivial
predictors ui , if the ui are a random sample from a population. Let H =
P
W (W T W )−1 W T = (hij ), and assume that maxi=1,...,n hii → 0 as n → ∞.
Then by Theorem 2.26 (the LS CLT), the OLS estimator satisfies
√ D
n(η̂OLS − η) → Np−1 (0, σ 2 V ). (5.7)
Remark 5.5. Prediction interval (4.14) used a number d that was often
the number of predictors in the selected model. For forward selection, PCR,
PLS, lasso, and relaxed lasso, let d be the number of predictors vj = γ Tj x in
the final model (with nonzero coefficients), including a constant v1 . For for-
ward selection, lasso, and relaxed lasso, vj corresponds to a single nontrivial
predictor, say vj = x∗j = xkj . Another method for obtaining d is to let d = j
if j is the degrees of freedom of the selected model if that model was chosen
in advance without model or variable selection. Hence d = j is not the model
degrees of freedom if model selection was used.
!
V AV AT
θ̂n ∼ ANr θ, , and Aθ̂ n + c ∼ ANk Aθ + c, .
n n
Theorem 2.26 gives the large sample theory for the OLS full model. Then
β̂ ≈ Np (β, σ 2 (X T X)−1 )) or β̂ ∼ ANp (β, M SE(X T X)−1 )).
As a mnemonic (memory aid) for the following theorem, note that the
d d d d
derivative ax = xa = a and ax2 = xax = 2ax.
dx dx dx dx
Theorem 5.1. a) If Q(η) = aT η = ηT a for some k × 1 constant vector
a, then 5Q = a.
b) If Q(η) = ηT Aη for some k × k constant matrix A, then 5Q = 2Aη.
Pk
c) If Q(η) = i=1 |ηi | = kηk1 , then 5Q = s = sη where si = sign(ηi )
where sign(ηi ) = 1 if ηi > 0 and sign(ηi ) = −1 if ηi < 0. This gradient is only
defined for η where none of the k values of ηi are equal to 0.
Example 5.2. The Hebbler (1847) data was collected from n = 26 dis-
tricts in Prussia in 1843. We will study the relationship between Y = the
number of women married to civilians in the district with the predictors x1
= constant, x2 = pop = the population of the district in 1843, x3 = mmen
= the number of married civilian men in the district, x4 = mmilmen = the
number of married men in the military in the district, and x5 = milwmn =
the number of women married to husbands in the military in the district.
Sometimes the person conducting the survey would not count a spouse if
the spouse was not at home. Hence Y is highly correlated but not equal to
x3 . Similarly, x4 and x5 are highly correlated but not equal. We expect that
Y = x3 + e is a good model, but n/p = 5.2 is small. See the following output.
ls.print(out)
Residual Standard Error=392.8709
R-Square=0.9999, p-value=0
F-statistic (df=4, 21)=67863.03
Estimate Std.Err t-value Pr(>|t|)
Intercept 242.3910 263.7263 0.9191 0.3685
pop 0.0004 0.0031 0.1130 0.9111
mmen 0.9995 0.0173 57.6490 0.0000
mmilmen -0.2328 2.6928 -0.0864 0.9319
milwmn 0.1531 2.8231 0.0542 0.9572
res<-out$res
yhat<-Y-res #d = 5 predictors used including x_1
AERplot2(yhat,Y,res=res,d=5)
#response plot with 90% pointwise PIs
$respi #90% PI for a future residual
[1] -950.4811 1445.2584 #90% PI length = 2395.74
218 5 Statistical Learning Alternatives to OLS
out<-summary(temp)
Selection Algorithm: forward
pop mmen mmilmen milwmn
1 ( 1 ) " " "*" " " " "
2 ( 1 ) " " "*" "*" " "
3 ( 1 ) "*" "*" "*" " "
4 ( 1 ) "*" "*" "*" "*"
out$cp
[1] -0.8268967 1.0151462 3.0029429 5.0000000
#mmen and a constant = Imin
mincp <- out$which[out$cp==min(out$cp),]
#do not need the constant in vin
vin <- vars[mincp[-1]]
sub <- lsfit(X[,vin],Y)
ls.print(sub)
Residual Standard Error=369.0087
R-Square=0.9999
F-statistic (df=1, 24)=307694.4
Estimate Std.Err t-value Pr(>|t|)
Intercept 241.5445 190.7426 1.2663 0.2175
X 1.0010 0.0018 554.7021 0.0000
res<-sub$res
yhat<-Y-res #d = 2 predictors used including x_1
AERplot2(yhat,Y,res=res,d=2)
#response plot with 90% pointwise PIs
$respi #90% PI for a future residual
[1] -778.2763 1336.4416 #length 2114.72
Consider forward selection where xI is a × 1. Underfitting occurs if S
is not a subset of I so xI is missing important predictors. A special case
of underfitting is d = a < aS . Overfitting for forward selection occurs if i)
n < 5a so there is not enough data to estimate the a parameters in βI well,
or ii) S ⊆ I but S 6= I. Overfitting is serious if n < 5a, but “not much of a
problem” if n > Jp where J = 10 or 20 for many data sets. Underfitting is a
serious problem. Let Yi = xTI,i β I + eI,i . Then V (eI,i ) may not be a constant
σ 2 : V (eI,i ) could depend on case i, and the model may no longer be linear.
Check model I with response and residual plots.
Forward selection is a shrinkage method: p models are produced and except
for the full model, some |β̂i | are shrunk to 0. Lasso and ridge regression are
also shrinkage methods. Ridge regression is a shrinkage method, but |β̂i | is
not shrunk to 0. Shrinkage methods that shrink β̂i to 0 are also variable
selection methods. See Sections 5.5, 5.6, and 5.8.
Ax = λx. (5.8)
Using the same notation as Johnson and Wichern (1988, pp. 50-51),
let P = [e1 e2 · · · ep ] be the p × p orthogonal matrix with ith column
T
ei . Then P Pp = P T P = I. Let Λ = diag(λ1 , ..., λp) and let Λ1/2 =
√
diag( λ1 , ..., λp ). If A is a positive definite p × p symmetric matrix with
Pp
spectral decomposition A = i=1 λi ei eTi , then A = P ΛP T and
p
X 1
A−1 = P Λ−1 P T = ei eTi .
λi
i=1
222 5 Statistical Learning Alternatives to OLS
The eigenvectors êi are orthonormal: êTi êi = 1 and êTi êj = 0 for i 6= j.
If the eigenvalues are unique, then êi and −êi are the only orthonormal
eigenvectors corresponding to λ̂i . For example, the eigenvalue eigenvector
pairs can be found using the singular value decomposition of the matrix
√
W g / n − g where W g is the matrix of the standardized nontrivial predictors
wi , the sample covariance matrix
n n
W Tg W g 1 X T 1 X
Σ̂ w = = (w i − w)(w i − w) = wi w Ti = Ru ,
n−g n−g n−g
i=1 i=1
2π K/2
|Ru |1/2hK .
KΓ (K/2)
= λ̂j êTi êj = 0 for i 6= j since the sample covariance matrix of the standard-
ized data is n
1X
w k w Tk = Ru
n
k=1
Example 5.2, continued. The PCR output below shows results for the
marry data where 10-fold CV was used. The OLS full model was selected.
library(pls); y <- marry[,3]; x <- marry[,-3]
z <- as.data.frame(cbind(y,x))
out<-pcr(y˜.,data=z,scale=T,validation="CV")
tem<-MSEP(out)
tem
(Int) 1 comps 2 comps 3 comps 4 comps
CV 1.743e+09 449479706 8181251 371775 197132
cvmse<-tem$val[,,1:(out$ncomp+1)][1,]
nc <-max(which.min(cvmse)-1,1)
res <- out$residuals[,,nc]
yhat<-y-res #d = 5 predictors used including constant
AERplot2(yhat,y,res=res,d=5)
#response plot with 90% pointwise PIs
$respi #90% PI same as OLS full model
-950.4811 1445.2584 #PI length = 2395.74
Remark 5.9. PLS may or may not give a consistent estimator of β if p/n
does not go to zero: rather strong regularity conditions have been used to
prove consistency or inconsistency if p/n does not go to zero. See Chun and
Keleş (2010), Cook (2018), Cook et al. (2013), and Cook and Forzani (2018,
2019).
Following Hastie et al. (2009, pp. 80-81), let W = [s1 , ..., sp−1 ] so sj is
the vector corresponding to the standardized jth nontrivial predictor. Let
ĝ1i = sTj Y be n times the least squares coefficient from regressing Y on
si . Then the first PLS direction ĝ 1 = (ĝ11 , ..., ĝ1,p−1)T . Note that W ĝ i =
(Vi1 , ..., Vin)T = pi is the ith PLS component. This process is repeated using
matrices W k = [sk1 , ..., skp−1 ] where W 0 = W and W k is orthogonalized
with respect to pk for k = 1, ..., p − 2. So skj = sjk−1 − [pTk sjk−1 /(pTk pk )]pk
for j = 1, ..., p− 1. If the PLS model Ii uses a constant and PLS components
V1 , ..., Vi−1, let Ŷ Ii be the predicted values from the PLS model using Ii .
Then Ŷ Ii = Ŷ Ii−1 + θ̂i pi where Ŷ I0 = Y 1 and θ̂i = pTi Y /(pTi pi ). Since
linear combinations of w are linear combinations of x, Ŷ = X β̂P LS,Ij where
Ij uses a constant and the first j − 1 PLS components. If j = p, then the
PLS model Ip is the OLS full model.
Example 5.2, continued. The PLS output below shows results for the
marry data where 10-fold CV was used. The OLS full model was selected.
library(pls); y <- marry[,3]; x <- marry[,-3]
z <- as.data.frame(cbind(y,x))
out<-plsr(y˜.,data=z,scale=T,validation="CV")
tem<-MSEP(out)
tem
(Int) 1 comps 2 comps 3 comps 4 comps
CV 1.743e+09 256433719 6301482 249366 206508
cvmse<-tem$val[,,1:(out$ncomp+1)][1,]
nc <-max(which.min(cvmse)-1,1)
res <- out$residuals[,,nc]
yhat<-y-res #d = 5 predictors used including constant
AERplot2(yhat,y,res=res,d=5)
$respi #90% PI same as OLS full model
-950.4811 1445.2584 #PI length = 2395.74
The Mevik et al. (2015) pls library is useful for computing PLS and PCR.
over all vectors η ∈ Rp−1 where λ1,n ≥ 0 and a > 0 are known constants
with a = 1, 2, n, and 2n common. Then
iii) Following Hastie et al. (2009, p. 96), let the augmented matrix W A
and the augmented response vector Z A be defined by
p W Z
WA = , and Z A = ,
λ1,n I p−1 0
where 0 is the (p − 1) × 1 zero vector. For λ1,n > 0, the OLS estimator from
regressing Z A on W A is
since W TA Z A = W T Z and
p
W TA W A = W T
λ1,n I p−1 p W = W T W + λ1,n I p−1 .
λ1,n I p−1
Remark 5.10 iii) is interesting. Note that for λ1,n > 0, the (n+p−1)×(p−1)
matrix W A has full rank p−1. The augmented OLS model consists of adding
p − 1 pseudo-cases (wTn+1 , Zn+1 )T , ..., (wTn+p−1 , Zn+p−1 )T where Zj = 0 and
p
wj = (0, ..., λ1,n , 0, ..., 0)T for j = n+1, ..., n+p−1 where the nonzero entry
is in the kth position if j = n + k. For centered response and standardized
nontrivial predictors, the population OLS regression fit runs through the
origin (w T , Z)T = (0T , 0)T . Hence for λ1,n = 0, the augmented OLS model
adds p − 1 typical cases at the origin. If λ1,n is not large, then the pseudo-
data can still be regarded as typical cases. If λ1,n is large, the pseudo-data
act as w–outliers (outliers in the standardized predictor variables), and the
OLS slopes go to zero as λ1,n gets large, making Ẑ ≈ 0 so Ŷ ≈ Y .
To prove Remark 5.10 ii), let (ψ, g) be an eigenvalue eigenvector pair of
W T W = nRu . Then [W T W + λ1,n I p−1 ]g = (ψ + λ1,n )g, and (ψ + λ1,n , g)
is an eigenvalue eigenvector pair of W T W + λ1,n I p−1 > 0 provided λ1,n > 0.
The degrees of freedom for a ridge regression with known λ1,n is also
interesting and will be found in the next paragraph. The sample correlation
matrix of the nontrivial predictors
1
Ru = WTWg
n−g g
Hence λ̂i = σi2 where λ̂i = λ̂i (W T W ) is the ith eigenvalue of W T W , and êi
is the ith orthonormal eigenvector of Ru and of W T W . The SVD of W T is
W T = V ΛT U T , and the Gram matrix
T
w 1 w 1 wT1 w2 . . . w T1 w n
..
W W T = ... ..
.
..
. .
w Tn w 1 wTn w2 . . . w Tn w n
which is the matrix of scalar products. Warning: Note that σi is the ith
singular value of W , not the standard deviation of wi .
Following Hastie et al. (2009, p. 68), if λ̂i = λ̂i (W T W ) is the ith eigenvalue
of W T W where λ̂1 ≥ λ̂2 ≥ · · · ≥ λ̂p−1 , then the (effective) degrees of freedom
for the ridge regression of Z on W with known λ1,n is df(λ1,n ) =
p−1 p−1
X σi2 X λ̂i
T −1 T
tr[W (W W + λ1,n I p−1 ) W ]= 2 = (5.11)
i=1
σi + λ1,n i=1 λ̂i + λ1,n
The following remark is interesting if λ1,n and p are fixed. However, λ̂1,n is
usually used, for example, after 10-fold cross validation. The fact that η̂R =
An,λ η̂OLS appears in Efron and Hastie (2016, p. 98), and Marquardt and
Snee (1975). See Theorem 5.4 for the ridge regression central limit theorem.
Remark 5.12. The ridge regression criterion from Definition 5.5 can also
be defined by
QR (η) = kZ − W ηk22 + λ1,n η T η. (5.13)
230 5 Statistical Learning Alternatives to OLS
W T (W W T + λ1,n I n )]a = W T Z
The following identity from Gunst and Mason (1980, p. 342) is useful for
ridge regression inference: η̂ R =(W T W + λ1,n I p−1 )−1 W T Z
W T W + λ1,n I p−1 P P
→ V −1 , and n(W T W + λ1,n I p−1 )−1 → V .
n
Note that
!−1
W T W + λ1,n I p−1 WTW P
An = An,λ = → V V −1 = I p−1
n n
and the OLS full model are asymptotically equivalent if λ̂1,n = oP (n1/2 ) so
√ P
λ̂1,n / n → 0.
regression estimator can have large bias for moderate n. Ten fold CV does
√ P P
not appear to guarantee that λ̂1,n / n → 0 or λ̂1,n /n → 0.
Ridge regression can be a lot better than the OLS full model if i) X T X is
singular or ill conditioned or ii) n/p is small. Ridge regression can be much
faster than forward selection if M = 100 and n and p are large.
Roughly speaking, the biased estimation of the ridge regression estimator
can make the MSE of β̂ R or η̂ R less than that of β̂OLS or η̂ OLS , but the
large sample inference may need larger n for ridge regression than for OLS.
However, the large sample theory has n >> p. We will try to use prediction
intervals to compare OLS, forward selection, ridge regression, and lasso for
data sets where p > n. See Sections 5.9, 5.10, 5.11, and 5.12.
Example 5.2, continued. The ridge regression output below shows results
for the marry data where 10-fold CV was used. A grid of 100 λ values was
used, and λ0 > 0 was selected. A problem with getting the false degrees of
freedom d for ridge regression is that it is not clear that λ = λ1,n /(2n). We
need to know the relationship between λ and λ1,n in order to compute d. It
seems unlikely that d ≈ 1 if λ0 is selected.
library(glmnet); y <- marry[,3]; x <- marry[,-3]
out<-cv.glmnet(x,y,alpha=0)
lam <- out$lambda.min #value of lambda that minimizes
#the 10-fold CV criterion
yhat <- predict(out,s=lam,newx=x)
res <- y - yhat
n <- length(y)
w1 <- scale(x)
w <- sqrt(n/(n-1))*w1 #t(w) %*% w = n R_u, u = x
diag(t(w)%*%w)
pop mmen mmilmen milwmn
26 26 26 26
#sum w_iˆ2 = n = 26 for i = 1, 2, 3, and 4
svs <- svd(w)$d #singular values of w,
pp <- 1 + sum(svsˆ2/(svsˆ2+2*n*lam)) #approx 1
# d for ridge regression if lam = lam_{1,n}/(2n)
AERplot2(yhat,y,res=res,d=pp)
$respi #90% PI for a future residual
[1] -5482.316 14854.268 #length = 20336.584
#try to reproduce the fitted values
z <- y - mean(y)
q<-dim(w)[2]
I <- diag(q)
5.6 Lasso 233
5.6 Lasso
over all vectors η ∈ Rp−1 where λ1,n ≥ 0 and a > 0 are known constants
with a = 1, 2, n, and 2n are common. The residual sum of squares RSS(η) =
(Z − W η)T (Z − W η), and λ1,n = 0 corresponds to the OLS estimator
η̂OLS = (W T W )−1 W T Z if W has full rank p − 1. The lasso vector of fitted
values is Ẑ = Ẑ L = W η̂ L , and the lasso vector of residuals r(η̂ L ) = Z − Ẑ L .
The estimator is said to be regularized if λ1,n > 0. Obtain Ŷ and β̂L using
η̂L , Ẑ, and Y .
where the minimization is over all vectors b ∈ Rp−1 . The literature often uses
λa = λ = λ1,n /a.
For fixed λ1,n , the lasso optimization problem is convex. Hence fast algo-
rithms exist. As λ1,n increases, some of the η̂i = 0. If λ1,n is large enough,
234 5 Statistical Learning Alternatives to OLS
then η̂ L = 0 and Ŷi = Y for i = 1, ..., n. If none of the elements η̂i of η̂L are
zero, then η̂ L can be found, in principle, by setting the partial derivatives of
QL(η) to 0. Potential minimizers also occur at values of η where not all of the
partial derivatives exist. An analogy is finding the minimizer of a real valued
function of one variable h(x). Possible values for the minimizer include values
of xc satisfying h0 (xc ) = 0, and values xc where the derivative does not exist.
Typically some of the elements η̂i of η̂L that minimizes QL(η) are zero, and
differentiating does not work.
The following identity from Efron and Hastie (2016, p. 308), for example,
is useful for inference for the lasso estimator η̂L :
−1 T λ1,n λ1,n
W (Z − W η̂L ) + sn = 0 or − W T (Z − W η̂L ) + sn = 0
n 2n 2
where sin ∈ [−1, 1] and sin = sign(η̂i,L) if η̂i,L 6= 0. Here sign(ηi ) = 1 if ηi > 0
and sign(ηi ) = −1 if ηi < 0. Note that sn = sn,η̂ depends on η̂L . Thus η̂L
L
λ1,n λ1,n
= (W T W )−1 W T Z − n(W T W )−1 sn = η̂OLS − n(W T W )−1 sn .
2n 2n
If none of the elements of η are zero, and if η̂ L is a consistent estimator of η,
P √
then sn → s = sη . If λ1,n / n → 0, then OLS and lasso are asymptotically
equivalent even if sn does not converge to a vector s as n → ∞ since sn is
bounded. For model selection, the M values of λ are denoted by 0 ≤ λ1 <
λ2 < · · · < λM where λi = λ1,n,i depends on n for i = 1, ..., M . Also, λM
is the smallest value of λ such that η̂λM = 0. Hence η̂λi 6= 0 for i < M . If
λs corresponds to the model selected, then λ̂1,n = λs . The following theorem
shows that lasso and the OLS full model are asymptotically equivalent if
√ P √
λ̂1,n = oP (n1/2 ) so λ̂1,n / n → 0: thus n(η̂ L − η̂ OLS ) = op (1).
Theorem 5.5, Lasso CLT. Assume p is fixed and that the conditions of
the LS CLT Theorem Equation (5.7) hold for the model Z = W η + e.
√ P
a) If λ̂1,n / n → 0, then
√ D
n(η̂L − η) → Np−1 (0, σ 2 V ).
√ P P
b) If λ̂1,n / n → τ ≥ 0 and sn → s = sη , then
√ D −τ
n(η̂ L − η) → Np−1 V s, σ 2 V .
2
√ P P
Proof. If λ̂1,n / n → τ ≥ 0 and sn → s = sη , then
√ √
n(η̂ L − η) = n(η̂L − η̂OLS + η̂OLS − η) =
5.6 Lasso 235
√ √ λ1,n D τ
n(η̂ OLS − η) − n n(W T W )−1 sn → Np−1 (0, σ 2 V ) − V s
2n 2
−τ
∼ Np−1 V s, σ 2 V
2
P
since under the LS CLT, n(W T W )−1 → V .
P
Part a) does not need sn → s as n → ∞, since sn is bounded.
The values a = 1, 2, and 2n are common. Following Hastie et al. (2015, pp.
9, 17, 19) for the next two paragraphs, it is convenient to use a = 2n:
p−1
1 X
QL,2n(b) = r(b)T r(b) + λ2n |bj |, (5.17)
2n j=1
1 T
λ2n,max = λ2n,M = max s Z
j n j
For model selection we let I denote the index set of the predictors in the
fitted model including the constant. The set A defined below is the index set
without the constant.
236 5 Statistical Learning Alternatives to OLS
Definition 5.7. The active set A is the index set of the nontrivial predic-
tors in the fitted model: the predictors with nonzero η̂i .
Suppose that there are k active nontrivial predictors. Then for lasso, k ≤ n.
Let the n × k matrix W A correspond to the standardized active predictors.
If the columns of W A are in general position, then the lasso vector of fitted
values
where sA is the vector of signs of the active lasso coefficients. Here we are
using the λ2n of (5.17), and nλ2n = λ1,n /2. We could replace n λ2n by λ2 if
we used a = 2 in the criterion
p−1
1 X
QL,2 (b) = r(b)T r(b) + λ2 |bj |. (5.18)
2 j=1
Example 5.2, continued. The lasso output below shows results for the
marry data where 10-fold CV was used. A grid of 38 λ values was used, and
λ0 > 0 was selected.
library(glmnet); y <- marry[,3]; x <- marry[,-3]
out<-cv.glmnet(x,y)
lam <- out$lambda.min #value of lambda that minimizes
#the 10-fold CV criterion
yhat <- predict(out,s=lam,newx=x)
res <- y - yhat
pp <- out$nzero[out$lambda==lam] + 1 #d for lasso
AERplot2(yhat,y,res=res,d=pp)
$respi #90% PI for a future residual
-4102.672 4379.951 #length = 8482.62
There are some problems with lasso. i) Lasso large sample theory is worse
or as good as that of the OLS full model if n/p is large. ii) Ten fold CV does
√ P P
not appear to guarantee that λ̂1,n / n → 0 or λ̂1,n /n → 0. iii) Lasso often
shrinks β̂ too much if aS ≥ 20 and the predictors are highly correlated. iv)
Ridge regression can be better than lasso if aS > n.
Lasso can be a lot better than the OLS full model if i) X T X is singular
or ill conditioned or ii) n/p is small. iii) For lasso, M = M (lasso) is often
near 100. Let J ≥ 5. If n/J and p are both a lot larger than M (lasso), then
lasso can be considerably faster than forward selection, PLS, and PCR if
M = M (lasso) = 100 and M = M (F ) = min(dn/Je, p) where F stands for
forward selection, PLS, or PCR. iv) The number of nonzero coefficients in
5.7 Lasso Variable Selection 237
η̂L ≤ n even if p > n. This property of lasso can be useful if p >> n and the
population model is sparse.
Lasso variable selection applies OLS on a constant and the active predictors
that have nonzero lasso η̂i . The method is called relaxed lasso by Hastie et al.
(2015, p. 12), and the relaxed lasso (φ = 0) estimator by Meinshausen (2007).
The method is also called OLS-post lasso and post model selection OLS.
Let X A denote the matrix with a column of ones and the unstandardized
active nontrivial predictors. Hence the lasso variable selection estimator is
β̂LV S = (X TA X A )−1 X TA Y , and lasso variable selection is an alternative to
forward selection. Let k be the number of active (nontrivial) predictors so
β̂V LS is (k + 1) × 1.
Let Imin correspond to the lasso variable selection estimator and β̂ V S =
β̂LV S,0 = β̂ Imin ,0 to the zero padded lasso variable selection estimator. Then
√
by Remark 4.5 where p is fixed, β̂LV S,0 is n consistent when lasso is consis-
tent, with the limiting distribution for β̂ LV S,0 given by Theorem 4.4. Hence,
relaxed lasso can be bootstrapped with the same methods used for forward
selection in Chapter 4. Lasso variable selection will often be better than lasso
when the model is sparse or if n ≥ 10(k + 1). Lasso can be better than lasso
variable selection if (X TA X A ) is ill conditioned or if n/(k + 1) < 10. Also see
Pelawa Watagoda and Olive (2020) and Rathnayake and Olive (2020).
Suppose the n × q matrix x has the q = p − 1 nontrivial predictors. The
following R code gives some output for a lasso estimator and then the corre-
sponding relaxed lasso estimator.
library(glmnet)
y <- marry[,3]
x <- marry[,-3]
out<-glmnet(x,y,dfmax=2) #Use 2 for illustration:
#often dfmax approx min(n/J,p) for some J >= 5.
lam<-out$lambda[length(out$lambda)]
yhat <- predict(out,s=lam,newx=x)
#lasso with smallest lambda in grid such that df = 2
lcoef <- predict(out,type="coefficients",s=lam)
as.vector(lcoef) #first term is the intercept
#3.000397e+03 1.800342e-03 9.618035e-01 0.0 0.0
res <- y - yhat
AERplot(yhat,y,res,d=3,alph=1) #lasso response plot
##relaxed lasso =
#OLS on lasso active predictors and a constant
vars <- 1:dim(x)[2]
238 5 Statistical Learning Alternatives to OLS
100000
100000
y
y
50000
50000
50000 100000 150000 50000 100000 150000
yhat yhat
100000
y
y
50000
50000
yhat yhat
the OLS full model with PI length 2395.74, forward selection used a constant
and mmen with PI length 2114.72, ridge regression had PI length 20336.58,
lasso and lasso variable selection used a constant, mmen, and pop with lasso
PI length 8482.62 and relaxed lasso PI length 2226.53. PI (4.14) was used.
Figure 5.1 shows the response plots for forward selection, ridge regression,
lasso, and lasso variable selection. The plots for PLS=PCR=OLS full model
were similar to those of forward selection and lasso variable selection. The
plots suggest that the MLR model is appropriate since the plotted points
scatter about the identity line. The 90% pointwise prediction bands are also
shown, and consist of two lines parallel to the identity line. These bands are
very narrow in Figure 5.1 a) and d).
240 5 Statistical Learning Alternatives to OLS
Following Hastie et al. (2015, p. 57), let β = (β1 , βTS )T , let λ1,n ≥ 0, and let
α ∈ [0, 1]. Let
Z = Wη + e (5.21)
and Z TA W A η = Z T W η. Then
Z TA Z A − Z TA W A η − η T W TA Z A + ηT W TA W A η =
p Wη
T T T T T T T √
Z Z − Z Wη − η W Z + η W λ1 η .
λ1 η
Thus
QL (η) = Z T Z − Z T W η − ηT W T Z + η T W T W η + λ1 ηT η + λ2 kηk1 =
Since lasso uses at most min(n, p − 1) nontrivial predictors, elastic net and
ridge regression can perform better than lasso if the true number of active
242 5 Statistical Learning Alternatives to OLS
2W T W η̂ EN − 2W T Z + 2λ1 η̂ EN + λ2 sn = 0, or
λ2
(W T W + λ1 I p−1 )η̂EN = W T Z − sn , or
2
λ2
η̂EN = η̂ R − n(W T W + λ1 I p−1 )−1 sn . (5.24)
2n
Hence
λ1 λ2
η̂EN = η̂OLS − n(W T W +λ1 I p−1 )−1 η̂ OLS − n(W T W +λ1 I p−1 )−1 sn
n 2n
λ1 λ2
= η̂OLS − n(W T W + λ1 I p−1 )−1 [ η̂OLS + sn ].
n 2n
√ P P √ P √ P
Note that if λ̂1,n / n → τ and α̂ → ψ, then λ̂1 / n → (1 −ψ)τ and λ̂2 / n →
2ψτ. The following theorem shows elastic net is asymptotically equivalent to
√ P
the OLS full model if λ̂1,n / n → 0. Note that we get the RR CLT if ψ = 0
√ P
and the lasso CLT (using 2λ̂1,n / n → 2τ ) if ψ = 1. Under these conditions,
√ √ λ̂1 λ̂2
n(η̂ EN −η) = n(η̂OLS −η)−n(W T W + λ̂1 I p−1 )−1 [ √ η̂OLS + √ sn ].
n 2 n
Theorem 5.6, Elastic Net CLT. Assume p is fixed and that the condi-
tions of the LS CLT Equation (5.7) hold for the model Z = W η + e.
√ P
a) If λ̂1,n / n → 0, then
√ D
n(η̂ EN − η) → Np−1 (0, σ 2 V ).
√ P P P
b) If λ̂1,n / n → τ ≥ 0, α̂ → ψ ∈ [0, 1], and sn → s = sη , then
√ D
n(η̂EN − η) → Np−1 −V [(1 − ψ)τ η + ψτ s], σ 2 V .
D 2ψτ
→ Np−1 −(1 − ψ)τ V η, σ 2 V − Vs
2
5.9 Prediction Intervals 243
∼ Np−1 −V [(1 − ψ)τ η + ψτ s], σ 2 V .
The mean of the normal distribution is 0 under a) since α̂ and sn are bounded.
Example 5.2, continued. The slpack function enet does elastic net
using 10-fold CV and a grid of α values {0, 1/am, 2/am, ..., am/am = 1}. The
default uses am = 10. The default chose lasso with alph = 1. The function
also makes a response plot, but does not add the lines for the pointwise
prediction intervals since the false degrees of freedom d is not computed.
library(glmnet); y <- marry[,3]; x <- marry[,-3]
tem <- enet(x,y)
tem$alph
[1] 1 #elastic net was lasso
tem<-enet(x,y,am=100)
tem$alph
[1] 0.97 #elastic net was not lasso with a finer grid
The elastic net variable selection estimator applies OLS to a constant
and the active predictors that have nonzero elastic net η̂i . Hence elastic net
is used as a variable selection method. Let X A denote the matrix with a
column of ones and the unstandardized active nontrivial predictors. Hence
the relaxed elastic net estimator is β̂ RL = (X TA X A )−1 X TA Y , and relaxed
elastic net is an alternative to forward selection. Let k be the number of
active (nontrivial) predictors so β̂ REN is (k + 1) × 1. Let Imin correspond to
the elastic net variable selection estimator and β̂V S = β̂ ENV S,0 = β̂ Imin ,0
to the zero padded relaxed
√ elastic net estimator. Then by Remark 4.5 where
p is fixed, β̂ ENV S,0 is n consistent when elastic net is consistent, with the
limiting distribution for β̂REN,0 given by Theorem 4.4. Hence, relaxed elastic
net can be bootstrapped with the same methods used for forward selection
in Chapter 4. Elastic net variable selection will often be better than elastic
net when the model is sparse or if n ≥ 10(k + 1). The elastic net can be
better than elastic net variable selection if (X TA X A ) is ill conditioned or if
n/(k + 1) < 10. Also see Olive (2019) and Rathnayake and Olive (2020).
This section will use the prediction intervals from Section 4.3 applied to the
MLR model with m̂(x) = xTI β̂ I and I corresponds to the predictors used
by the MLR method. We will use the six methods forward selection with
OLS, PCR, PLS, lasso, relaxed lasso, and ridge regression. When p > n,
results from Hastie et al. (2015, pp. 20, 296, ch. 6, ch. 11) and Luo and Chen
(2013) suggest that lasso, relaxed lasso, and forward selection with EBIC can
244 5 Statistical Learning Alternatives to OLS
perform well for sparse models: the subset S in Equation (4.1) and Remark
5.4 has aS small.
Consider d for the prediction interval (4.14). As in Chapter 4, with the
exception of ridge regression, let d be the number of “variables” used by the
method, including a constant. Hence for lasso, relaxed lasso, and forward
selection, d − 1 is the number of active predictors while d − 1 is the number
of “components” used by PCR and PLS.
Many things can go wrong with prediction. It is assumed that the test
data follows the same MLR model as the training data. Population drift is a
common reason why the above assumption, which assumes that the various
distributions involved do not change over time, is violated. Population drift
occurs when the population distribution does change over time.
A second thing that can go wrong is that the training or test data set is
distorted away from the population distribution. This could occur if outliers
are present or if the training data set and test data set are drawn from
different populations. For example, the training data set could be drawn
from three hospitals, and the test data set could be drawn from two more
hospitals. These two populations of three and two hospitals may differ.
A third thing that can go wrong is extrapolation: if xf is added to
x1 , ..., xn , then there is extrapolation if xf is not like the xi , e.g. xf is an
outlier. Predictions based on extrapolation are not reliable. Check whether
the Euclidean distance of xf from the coordinatewise median MED(X) of
the x1 , ..., xn satisfies Dxf (MED(X), I p ) ≤ maxi=1,...,n Di (MED(X), I p ).
Alternatively, use the ddplot5 function, described in Chapter 7, applied to
x1 , ..., xn , xf to check whether xf is an outlier.
When n ≥ 10p, let the hat matrix H = X(X T X)−1 X T . Let hi = hii
be the ith diagonal element of H for i = 1, ..., n. Then hi is called the
ith leverage and hi = xTi (X T X)−1 xi . Then the leverage of xf is hf =
xTf (X T X)−1 xf . Then a rule of thumb is that extrapolation occurs if hf >
max(h1 , ..., hn). This rule works best if the predictors are linearly related in
that a plot of xi versus xj should not have any strong nonlinearities. If there
are strong nonlinearities among the predictors, then xf could be far from the
xi but still have hf < max(h1 , ..., hn). If the regression method, such as lasso
or forward selection, uses a set I of a predictors, including a constant, where
n ≥ 10a, the above rule of thumb could be used for extrapolation where xf ,
xi , and X are replaced by xI,f , xI,i , and X I .
For the simulation from Pelawa Watagoda and Olive (2019b), we used
several R functions including forward selection (FS) as computed with the
regsubsets function from the leaps library, principal components regres-
sion (PCR) with the pcr function and partial least squares (PLS) with the
plsr function from the pls library, and ridge regression (RR) and lasso
with the cv.glmnet function from the glmnet library. Relaxed lasso (RL)
was applied to the selected lasso model.
Let x = (1 uT )T where u is the (p − 1) × 1 vector of nontrivial predictors.
In the simulations, for i = 1, ..., n, we generated w i ∼ Np−1 (0, I) where the
5.9 Prediction Intervals 245
Table 5.1 Simulated Large Sample 95% PI Coverages and Lengths, ei ∼ N (0, 1)
n p ψ k FS lasso RL RR PLS PCR
100 20 0 1 cov 0.9644 0.9750 0.9666 0.9560 0.9438 0.9772
len 4.4490 4.8245 4.6873 4.5723 4.4149 5.5647
100 40 0 1 cov 0.9654 0.9774 0.9588 0.9274 0.8810 0.9882
len 4.4294 4.8889 4.6226 4.4291 4.0202 7.3393
100 200 0 1 cov 0.9648 0.9764 0.9268 0.9584 0.6616 0.9922
len 4.4268 4.9762 4.2748 6.1612 2.7695 12.412
100 50 0 49 cov 0.8996 0.9719 0.9736 0.9820 0.8448 1.0000
len 22.067 6.8345 6.8092 7.7234 4.2141 38.904
200 20 0 19 cov 0.9788 0.9766 0.9788 0.9792 0.9550 0.9786
len 4.9613 4.9636 4.9613 5.0458 4.3211 4.9610
200 40 0 19 cov 0.9742 0.9762 0.9740 0.9738 0.9324 0.9792
len 4.9285 5.2205 5.1146 5.2103 4.2152 5.3616
200 200 0 19 cov 0.9728 0.9778 0.9098 0.9956 0.3500 1.0000
len 4.8835 5.7714 4.5465 22.351 2.1451 51.896
400 20 0.9 19 cov 0.9664 0.9748 0.9604 0.9726 0.9554 0.9536
len 4.5121 10.609 4.5619 10.663 4.0017 3.9771
400 40 0.9 19 cov 0.9674 0.9608 0.9518 0.9578 0.9482 0.9646
len 4.5682 14.670 4.8656 14.481 4.0070 4.3797
400 400 0.9 19 cov 0.9348 0.9636 0.9556 0.9632 0.9462 0.9478
len 4.3687 47.361 4.8530 48.021 4.2914 4.4764
400 400 0 399 cov 0.9486 0.8508 0.5704 1.0000 0.0948 1.0000
len 78.411 37.541 20.408 244.28 1.1749 305.93
400 800 0.9 19 cov 0.9268 0.9652 0.9542 0.9672 0.9438 0.9554
len 4.3427 67.294 4.7803 66.577 4.2965 4.6533
reason to doubt that the PI has the nominal coverage of 0.95. The simulation
√
used p = 20, 40, 50, n, or 2n; ψ = 0, 1/ p, or 0.9; and k = 1, 19, or p − 1. The
OLS full model fails when p = n and p = 2n, where regularity conditions
for consistent estimators are strong. The values k = 1 and k = 19 are sparse
models where lasso, relaxed lasso, and forward selection with EBIC can per-
form well when n/p is not large. If k = p − 1 and p ≥ n, then the model
√
is dense. When ψ = 0, the predictors are uncorrelated, when ψ = 1/ p,
the correlation goes to 0.5 as p increases and the predictors are moderately
correlated. For ψ = 0.9, the predictors are highly correlated with 1 dominant
principal component, a setting favorable for PLS and PCR. The simulated
data sets are rather small since the some of the R estimators are rather slow.
The simulations were done in R. See R Core Team (2016). The results
were similar for all five error distributions, and we show some results for
the normal and shifted exponential distributions. Tables 5.1 and 5.2 show
some simulation results for PI (4.14) where forward selection used Cp for
n ≥ 10p and EBIC for n < 10p. The other methods minimized 10-fold CV. For
forward selection, the maximum number of variables used was approximately
min(dn/5e, p). Ridge regression used the same d that was used for lasso.
For n ≥ 5p, coverages tended to be near or higher than the nominal value
of 0.95. The average PI length was often near 1.3 times the asymptotically
optimal length for n = 10p and close to the optimal length for n = 100p. Cp
and EBIC produced good PIs for forward selection, and 10-fold CV produced
good PIs for PCR and PLS. For lasso and ridge regression, 10-fold CV pro-
duced good PIs if ψ = 0 or if k was small, but if both k ≥ 19 and ψ ≥ 0.5,
then 10-fold CV tended to shrink too much and the PI lengths were often
too long. Lasso did appear to select S ⊆ Imin since relaxed lasso was good.
For n/p not large, good performance needed stronger regularity conditions,
and all six methods can have problems. PLS tended to have severe undercov-
erage with small average length, but sometimes performed well for ψ = 0.9.
The PCR length was often too long for ψ = 0. If there was k = 1 active
population predictor, then forward selection with EBIC, lasso, and relaxed
lasso often performed well. For k = 19, forward selection with EBIC often
performed well, as did lasso and relaxed lasso for ψ = 0. For dense models
with k = p − 1 and n/p not large, there was often undercoverage. Here for-
ward selection would use about n/5 variables. Let d − 1 be the number of
active nontrivial predictors in the selected model. For N (0, 1) errors,
√ ψ = 0,
and d < k, an asymptotic population 95% PI has length 3.92 k − d + 1.
Note that when the (Yi , uTi )T follow a multivariate normal distribution, ev-
ery subset follows a multiple linear regression model. EBIC occasionally had
undercoverage, especially for k = 19 or p − 1, which was usually more severe
√
for ψ = 0.9 or 1/ p.
Tables 5.3 and 5.4 show some results for PIs (4.15) and (4.16). Here forward
selection using the minimum Cp model if nH > 10p and EBIC otherwise. The
coverage was very good. Labels such as CFS and CRL used PI (4.16). For
relaxed lasso, the program sometimes failed to run for 5000 runs, e.g., if the
5.9 Prediction Intervals 247
Table 5.2 Simulated Large Sample 95% PI Coverages and Lengths, ei ∼ EXP (1)−1
Table 5.3 Validation Residuals: Simulated Large Sample 95% PI Coverages and
Lengths, ei ∼ N (0, 1)
number of variables selected d = nH . In Table 5.3, PIs (4.15) and (4.16) are
asymptotically equivalent, but PI (4.16) had shorter lengths for moderate
n. In Table 5.4, PI (4.15) is shorter than PI (4.16) asymptotically, but for
moderate n, PI (4.16) was often shorter.
Table 5.5 shows some results for PIs (4.14) and (4.15) for lasso and ridge
regression. The header lasso indicates PI (4.14) was used while vlasso indi-
cates that PI (4.15) was used. PI (4.15) tended to work better when the fit
was poor while PI (4.14) was better for n = 2p and k = p − 1. The PIs are
asymptotically equivalent for consistent estimators.
248 5 Statistical Learning Alternatives to OLS
Table 5.4 Validation Residuals: Simulated Large Sample 95% PI Coverages and
Lengths, ei ∼ EXP (1) − 1
Table 5.5 PIs (4.14) and (4.15): Simulated Large Sample 95% PI Coverages and
Lengths
For MLR variable selection there are many methods for choosing the final
submodel, including AIC, BIC, Cp , and EBIC. See Section 4.1. Variable se-
lection is a special case of model selection where there are M models a a final
model needs to be chosen. Cross validation is a common criterion for model
selection.
Definition 5.9. For k-fold cross validation (k-fold CV), randomly divide
the training data into k groups or folds of approximately equal size nj ≈ n/k
for j = 1, ..., k. Leave out the first fold, fit the statistical method to the k − 1
5.10 Cross Validation 249
remaining folds, and then compute some criterion for the first fold. Repeat
for folds 2, ..., k.
Then CV(k) ≡ CV(k) (Ii ) is computed for i = 1, ..., M , and the model Ic with
the smallest CV(k)(Ii ) is selected.
Assume that model (4.1) holds: Y = xT β + e = xTS β S + e where β S is an
aS × 1 vector. Suppose p is fixed and n → ∞. If β̂ I is a × 1, form the p × 1
vector β̂ I,0 from β̂ I by adding 0s corresponding to the omitted variables. If
P (S ⊆ Imin ) → 1 as n → ∞, then Theorem 4.4 and Remark 4.5 showed that
√
β̂Imin ,0 is a n consistent estimator of β under mild regularity conditions.
Note that if aS = p, then β̂ Imin ,0 is asymptotically equivalent to the OLS full
model β̂ (since S is equal to the full model).
coverage and average 95% PI length to compare the forward selection models.
All 4 models had coverage 1, but the average 95% PI lengths were 2591.243,
2741.154, 2902.628, and 2972.963 for the models with 2 to 5 predictors. See
the following R code.
y <- marry[,3]; x <- marry[,-3]
x1 <- x[,2]
x2 <- x[,c(2,3)]
x3 <- x[,c(1,2,3)]
pifold(x1,y) #nominal 95% PI
$cov
[1] 1
$alen
[1] 2591.243
pifold(x2,y)
$cov
[1] 1
$alen
[1] 2741.154
pifold(x3,y)
$cov
[1] 1
$alen
[1] 2902.628
pifold(x,y)
$cov
[1] 1
$alen
[1] 2972.963
#Validation PIs for submodels: the sample size is
#likely too small and the validation PI is formed
#from the validation set.
n<-dim(x)[1]
nH <- ceiling(n/2)
indx<-1:n
perm <- sample(indx,n)
H <- perm[1:nH]
vpilen(x1,y,H) #13/13 were in the validation PI
$cov
[1] 1.0
$len
[1] 116675.4
vpilen(x2,y,H)
$cov
[1] 1.0
$len
5.10 Cross Validation 251
[1] 116679.8
vpilen(x3,y,H)
$cov
[1] 1.0
$len
[1] 116312.5
vpilen(x,y,H)
$cov
[1] 1.0
$len #shortest length
[1] 116270.7
Some more code is below.
n <- 100
p <- 4
k <- 1
q <- p-1
x <- matrix(rnorm(n * q), nrow = n, ncol = q)
b <- 0 * 1:q
b[1:k] <- 1
y <- 1 + x %*% b + rnorm(n)
x1 <- x[,1]
x2 <- x[,c(1,2)]
x3 <- x[,c(1,2,3)]
pifold(x1,y)
$cov
[1] 0.96
$alen
[1] 4.2884
pifold(x2,y)
$cov
[1] 0.98
$alen
[1] 4.625284
pifold(x3,y)
$cov
[1] 0.98
$alen
[1] 4.783187
pifold(x,y)
$cov
[1] 0.98
$alen
[1] 4.713151
252 5 Statistical Learning Alternatives to OLS
n <- 10000
p <- 4
k <- 1
q <- p-1
x <- matrix(rnorm(n * q), nrow = n, ncol = q)
b <- 0 * 1:q
b[1:k] <- 1
y <- 1 + x %*% b + rnorm(n)
x1 <- x[,1]
x2 <- x[,c(1,2)]
x3 <- x[,c(1,2,3)]
pifold(x1,y)
$cov
[1] 0.9491
$alen
[1] 3.96021
pifold(x2,y)
$cov
[1] 0.9501
$alen
[1] 3.962338
pifold(x3,y)
$cov
[1] 0.9492
$alen
[1] 3.963305
pifold(x,y)
$cov
[1] 0.9498
$alen
[1] 3.96203
Section 4.6 showed how to use the bootstrap for hypothesis test H0 : θ =
Aβ = θ 0 versus H1 : θ = Aβ 6= θ 0 with the statistic Tn = Aβ̂ Imin ,0
where β̂ Imin ,0 is the zero padded OLS estimator computed from the variables
corresponding to Imin . The theory needs P (S ⊆ Imin ) → 1 as n → ∞, and
hence applies to OLS variable selection with AIC, BIC, and Cp , and to relaxed
lasso and relaxed elastic net if lasso and elastic net are consistent.
Assume n ≥ 20p and that the error distribution is unimodal and not highly
skewed. The response plot and residual plot are plots with Ŷ = xT β̂ on the
5.12 Data Splitting 253
horizontal axis and Y or r on the vertical axis, respectively. Then the plotted
points in these plots should scatter in roughly even bands about the identity
line (with unit slope and zero intercept) and the r = 0 line, respectively.
See Figure 1.1. If the plots for the OLS full model suggest that the error
distribution is skewed or multimodal, then much larger sample sizes may be
needed. √
Let p be fixed. Then lasso is asymptotically equivalent to OLS if λ̂1n / n →
0, and hence should
√ not have any β̂i = 0, asymptotically. If aS < p, then lasso
tends not be n consistent if lasso selects S with high probability√ by Ewald
and Schneider (2018), but then relaxed lasso tends to be n consistent. If
λ̂1n /n → 0, then lasso is consistent so P (S ⊆ I) → 1 as n → ∞. Hence often
√
if lasso has more than one β̂ i = 0, then lasso is not n consistent.
Suppose we use the residual bootstrap where Y ∗ = X β̂ OLS +r W follows a
standard linear model where the elements riW of rW are iid from the empirical
distribution of the OLS full model residuals ri . In Section 4.6 we used forward
selection when regressing Y ∗ on X, but we could use lasso or ridge regression
instead. Since these estimators are consistent if λ̂1n /n → 0 as n → ∞, we
∗ ∗ ∗
expect β̂ L and β̂R to be centered at β̂OLS . If the variabliity of the β̂ is similar
to or greater than that of β̂ OLS , then by the geometric argument Theorem
4.5, we might get simulated coverage close to or higher than the nominal.
∗
If lasso or ridge regression shrink β̂ too much, then the coverage could be
bad. In limited simulations, the prediction region method only simulated well
for ridge regression with ψ = 0. Results from Ewald and Schneider (2018, p.
1365) suggest that the lasso confidence region volume √ is greater than OLS
confidence region volume when lasso uses λ1n = n/2.
A small simulation was done for confidence intervals and confidence re-
gions, using the same type of data as for the variable selection simula-
tion in Section 4.6 and the prediction interval simulation in Section 5.9,
with B = max(1000, n, 20p) and 5000 runs. The regression model used
β = (1, 1, 0, 0)T with n = 100 and p = 4. When ψ = 0, the design matrix
X consisted of iid N(0,1) random variables. See Table 5.6 which was taken
from Pelawa Watagoda (2017). The residual bootstrap was used. Types 1)–
5) correspond to types i)–v), and the value only applies to the type 5)
error distribution. The function lassobootsim3 uses the prediction region
method for lasso and ridge regression. The function lassobootsim4 can
be used to simulate confidence intervals for the βi is S ∗T is singular for lasso.
The test was for H0 : (β3 , β4 )T = (0, 0)T .
A common method for data splitting randomly divides the data set into two
half sets. On the first half set, fit the model selection method, e.g. forward
254 5 Statistical Learning Alternatives to OLS
selection or lasso, to get the a predictors. Use this model as the full model
for the second half set: use the standard OLS inference from regressing the
response on the predictors found from the first half set. This method can
be inefficient if n ≥ 10p, but is useful for a sparse model if n ≤ 5p, if the
probability that the model underfits goes to zero, and if n ≥ 20a. A model is
sparse if the number of predictors with nonzero coefficients is small.
For lasso, the active set I from the first half set (training data) is found,
and data splitting estimator is the OLS estimator β̂ I,D computed from the
second half set (test data). This estimator is not the relaxed lasso estimator.
The estimator β̂ I,D has the same large sample theory as if I was chosen
before obtaining the data.
5.13 Summary
WTW
Ru = .
n
Then regression through the origin is used for the model Z = W η + e
where the vector of fitted values Ŷ = Y + Ẑ. Thus the centered response
Zi = Yi − Y and Ŷi = Ẑi + Y . Then η̂ does not depend on the units of
measurement of the predictors. Linear combinations of the ui can be written
as linear combinations of the xi , hence β̂ can be found from η̂.
6) A model for variable selection is xT β = xTS β S + xTE β E = xTS β S where
x = (xTS , xTE )T , xS is an aS × 1 vector, and xE is a (p − aS ) × 1 vector. Let
xI be the vector of a terms from a candidate subset indexed by I, and let xO
be the vector of the remaining predictors (out of the candidate submodel). If
S ⊆ I, then xT β = xTS β S = xTS β S + xTI/S β (I/S) + xTO 0 = xTI β I where xI/S
denotes the predictors in I that are not in S. Since this is true regardless
of the values of the predictors, β O = 0 if S ⊆ I. Note that β E = 0. Let
kS = aS − 1 = the number of population active nontrivial predictors. Then
k = a − 1 is the number of active predictors in the candidate submodel I.
7) Let Q(η) be a real valued function of the k × 1 vector η. The gradient
of Q(η) is the k × 1 vector
256 5 Statistical Learning Alternatives to OLS
∂
∂η1
Q(η)
∂
∂Q ∂Q(η) ∂η2 Q(η)
5Q = 5Q(η) = = =
.. .
∂η ∂η .
∂
∂ηk Q(η)
!
V AV AT
θ̂ n ∼ ANg θ, , and Aθ̂ n + c ∼ ANk Aθ + c, .
n n
ii) and v) are the Lasso CLT, ii) and iv) are the RR CLT, and ii) and iii)
are the EN CLT.
16) Under the conditions of√15), relaxed lasso = VS-lasso and relaxed
elastic net = VS-elastic net are n consistent under much √ milder conditions
than lasso and elastic net, since the relaxed estimators are n consistent when
lasso and elastic net are consistent. Let Imin correspond to the predictors
chosen by lasso, elastic net, or forward selection, including a constant. Let
β̂Imin be the OLS estimator applied to these predictors, let β̂ Imin ,0 be the
zero padded estimator. The large sample theory for β̂ Imin ,0 (from forward
selection, relaxed lasso, and relaxed elastic net) is given by Theorem 4.4.
Note that the large sample theory for the estimators β̂ is given for p × 1
vectors. The theory for η̂ is given for (p − 1) × 1 vectors In particular, the
theory for lasso and elastic net does not cast away the η̂i = 0.
17) Under Equation (4.1) with p fixed, if lasso or elastic net are consistent,
then P (S ⊆ Imin ) → 1 as n → ∞.√Hence when lasso and elastic net do
variable selection, they are often not n consistent.
18) Refer to 6). a) The OLS full model tends to be useful if n ≥ 10p with
large sample theory better than that of lasso, ridge regression, and elastic
net. Testing is easier and the Olive (2007) PI tailored to the OLS full model
will work better for smaller sample sizes than PI (4.14) if n ≥ 10p. If n ≥ 10p
but X T X is singular or ill conditioned, other methods can perform better.
Forward selection, relaxed lasso, and relaxed elastic net are competitive
with the OLS full model even when n ≥ 10p and X T X is well conditioned.
If n ≤ p then OLS interpolates the data and is a poor method. If n = Jp,
then as J decreases from 10 to 1, other methods become competitive.
b) If n ≥ 10p and kS < p − 1, then forward selection can give more precise
inference than the OLS full model. When n/p is small, the PI (4.14) for
forward selection can perform well if n/kS is large. Forward selection can
be worse than ridge regression or elastic net if kS > min(n/J, p). Forward
selection can be too slow if both n and p are large. Forward selection, relaxed
lasso, and relaxed elastic net tend to be bad if (X TA X A )−1 is ill conditioned
where A = Imin .
c) If n ≥ 10p, lasso can be better than the OLS full model if X T X is ill
conditioned. Lasso seems to perform best if kS is not much larger than 10
or if the nontrivial predictors are orthogonal or uncorrelated. Lasso can be
outperformed by ridge regression or elastic net if kS > min(n, p − 1).
d) If n ≥ 10p ridge regression and elastic net can be better than the OLS
full model if X T X is ill conditioned. Ridge regression (and likely elastic net)
seems to perform best if kS is not much larger than 10 or if the nontrivial
predictors are orthogonal or uncorrelated. Ridge regression and elastic net
can outperform lasso if kS > min(n, p − 1).
e) The PLS PI (4.14) can perform well if n ≥ 10p if some of the other five
methods used in the simulations start to perform well when n ≥ 5p. PLS may
or may not be inconsistent if n/p is not large. Ridge regression tends to be
5.14 Complements 259
5.14 Complements
Good references for forward selection, PCR, PLS, ridge regression, and lasso
are Hastie et al. (2009, 2015), James et al. (2013), Olive (2019), Pelawa
Watagoda (2017) and Pelawa Watagoda and Olive (2019b). Also see Efron
and Hastie (2016). An early reference for forward selection is Efroymson
(1960). Under strong regularity conditions, Gunst and Mason (1980, ch. 10)
covers inference for ridge regression (and a modified version of PCR) when
the iid errors ei ∼ N (0, σ 2 ).
Xu et al. (2011) notes that sparse algorithms are not stable. Belsley (1984)
shows that centering can mask ill conditioning of X.
Classical principal component analysis based on the correlation matrix can
be done using the singular value decomposition (SVD) of the scaled matrix
√
W S = W g / n − 1 using êi and λ̂i = σi2 where λ̂i = λ̂i (W TS W S ) is the ith
eigenvalue of W TS W S . Here the scaling is using g = 1. For more information
about the SVD, see Datta (1995, pp. 552-556) and Fogel et al. (2013).
There is massive literature on variable selection and a fairly large literature
for inference after variable selection. See, for example, Bertsimas et al. (2016),
Fan and Lv (2010), Ferrari and Yang (2015), Fithian et al. (2014), Hjort and
Claeskins (2003), Knight and Fu (2000), Lee et al. (2016), Leeb and Pötscher
(2005, 2006), Lockhart et al. (2014), Qi et al. (2015), and Tibshirani et al.
(2016).
For post-selection inference, the methods in the literature are often for mul-
tiple linear regression assuming normality, or are asymptotically equivalent
to using the full model, or find a quantity to test that is not Aβ. Typically
the methods have not been shown to perform better than data splitting. See
Ewald and Schneider (2018). When n/p is not large, inference is currently
much more difficult. Under strong regularity conditions, lasso and forward
selection with EBIC can work well. Leeb et al. (2015) suggests that the Berk
et al. (2013) method does not really work. Also see Dezeure et al. (2015),
Javanmard and Montanari (2014), Lu et al. (2017), Tibshirani et al. (2016),
van de Geer et al. (2014), and Zhang and Cheng (2017). Fan and Lv (2010)
gave large sample theory for some methods if p = o(n1/5 ). See Tibshirani et
al. (2016) for an R package.
Warning: For n < 5p, every estimator is unreliable, to my knowledge.
Regularity conditions for consistency are strong if they exist. For example,
260 5 Statistical Learning Alternatives to OLS
√
PLS is sometimes inconsistent and sometimes n consistent. Validating the
MLR estimator with PIs can help. Also make response and residual plots.
Full OLS Model: A sufficient condition for β̂ OLS to be a consistent
estimator of β is Cov(β̂ OLS ) = σ 2 (X T X)−1 → 0 as n → ∞. See Lai et al.
(1979).
Forward Selection: See Olive and Hawkins (2005), Pelawa Watagoda
and Olive (2019ab), and Rathnayake and Olive (2019).
Principal Components Regression: Principal components are Karhunen
Loeve directions of centered X. See Hastie et al. (2009, p. 66). A useful PCR
paper is Cook and Forzani (2008).
Partial Least Squares: PLS was introduced by Wold (1975). Also see
Wold (1985, 2006). Two useful √ papers are Cook et al. (2013) and Cook and
Su (2016). PLS tends to be n consistent if p is fixed and n → √ ∞. If p > n,
under two sets of strong regularity conditions, PLS can be n consistent
or inconsistent. See Chun and Keleş (2010), Cook (2018), Cook and Forzani
(2018, 2019), and Cook et al. (2013). Denham (1997) suggested a PI for PLS
that assumes the number of components is selected in advance.
Ridge Regression: An important ridge regression paper is Hoerl and
Kennard (1970). Also see Gruber (1998). Ridge regression is known as
Tikhonov regularization in the numerical analysis literature.
Lasso: Lasso was introduced by Tibshirani (1996). Efron et al. (2004)
and Tibshirani et al. (2012) are important papers. Su et al. (2017) note some
problems with lasso. If n/p is large, see Knight and Fu (2000) for the residual
bootstrap with OLS full model residuals. Camponovo (2015) suggested that
the nonparametric bootstrap does not work for lasso. Chatterjee and Lahiri
(2011) stated that the residual bootstrap with lasso does not work. Hall et
al. (2009) stated that the residual bootstrap with OLS full model residuals
does not work, but the m out of n residual bootstrap with OLS full model
residuals does work. Rejchel (2016) gave a good review of lasso theory. Fan
and Lv (2010) reviewed large sample theory for some alternative methods.
See Lockhart et al. (2014) for a partial remedy for hypothesis testing with
lasso. The Ning and Liu (2017) method needs a log likelihood. Knight and
Fu (2000) gave theory for fixed p.
Regularity conditions for testing are strong. Often lasso tests assume that
Y and the nontrivial predictors follow a multivariate normal (MVN) distri-
bution. For the MVN distribution, the MLR model tends to be dense not
sparse if n/p is small.
Lasso Variable Selection:
Applying OLS on a constant and the k nontrivial predictors that have
nonzero lasso η̂i is called lasso variable selection. We want n ≥ 10(k + 1).
If λ1 = 0, a variant of lasso variable selection computes the OLS submodel
for the subset corresponding to λi for i = 1, ..., M . If Cp is used, then this
variant has large sample theory given by Theorem 2.4.
Lasso can also be used for other estimators, such as generalized linear
models (GLMs). Then lasso variable selection is the “classical estimator,”
5.14 Complements 261
such as a GLM, applied to the lasso active set. In other words, use lasso
variable selection as a variable selection method. For prediction, lasso variable
selection is often better than lasso, but sometimes lasso is better.
See Meinshausen (2007) for the relaxed lasso method with R package
relaxo for MLR: apply lasso with penalty λ to get a subset of variables
with nonzero coefficients. Then reduce the shrinkage of the nonzero elements
by applying lasso again to the nonzero coefficients but with a smaller penalty
φ. This two stage estimator could be used for other estimators. Lasso variable
selection corresponds to the limit as φ → 0.
Dense Regression or Abundant Regression: occurs when most of the
predictors contribute to the regression. Hence the regression is not sparse. See
Cook et al. (2013).
Other Methods: Consider the MLR model Z = W η + e. Let λ ≥ 0 be
a constant and let q ≥ 0. The estimator η̂ q minimizes the criterion
p−1
X
Qq (b) = r(b)T r(b) + λ |bi |q , (5.27)
j=1
Following Sun and Zhang (2012), let (5.6) hold and let
1 X |ηi |
p−1
Q(η) = (Z − W η)T (Z − W η) + λ2 ρ where ρ is scaled such
2n i=1
λ
that the derivative ρ0 (0+) = 1. As for lasso and elastic net, let sj = sgn(η̂j )
where sj ∈ [−1, 1] if η̂j = 0. Let ρ0j = ρ0 (|η̂j |/λ) if η̂j 6= 0, and ρ0j = 1 if
η̂j = 0. Then η̂ is a critical point of Q(η) iff wTj (Z − W η̂) = nλsj ρ0j for
j = 1, ..., n. If ρ is convex, then these conditions are the KKT conditions. Let
dj = sj ρ0j . Then W T Z − W T W η̂ = nλd, and η̂ = η̂ OLS − nλ(W T W )−1 d.
If the dj are bounded, then η̂ is consistent if λ → 0 as n → ∞, and η̂ is
asymptotically equivalent to η̂OLS if n1/2 λ → 0. Note that ρ(t) = t for t > 0
gives lasso with λ = λ1,n /(2n).
Gao and Huang (2010) give theory for a LAD–lasso estimator, and Qi et
al. (2015) is an interesting lasso competitor.
Multivariate linear regression has m ≥ 2 response variables. See Olive
(2017ab: ch. 12). PLS also works if m ≥ 1, and methods like ridge regression
and lasso can also be extended to multivariate linear regression. See, for ex-
ample, Haitovsky (1987) and Obozinski et al. (2011). Sparse envelope models
are given in Su et al. (2016).
AIC and BIC Type Criterion:
Olive and Hawkins (2005) and Burnham and Anderson (2004) are useful
reference when p is fixed. Some interesting theory for AIC appears in Zhang
(1992ab). Zheng and Loh (1995) show that BICS can work if p = pn =
o(log(n)) and there is a consistent estimator of σ 2 . For the Cp criterion, see
Jones (1946) and Mallows (1973).
AIC and BIC type criterion and variable selection for high dimensional re-
gression are discussed in Chen and Chen (2008), Fan and Lv (2010), Fujikoshi
et al. (2014), and Luo and Chen (2013). Wang (2009) suggests using
See Bogdan et al. (2004), Cho and Fryzlewicz (2012), and Kim et al. (2012).
Luo and Chen (2013) state that W BIC(I) needs p/na < 1 for some 0 < a <
1.
If n/p is large and one of the models being considered is the true model
S (shown to occur with probability going to one only under very strong
assumptions by Wieczorek and Lei (2021)), then BIC tends to outperform
AIC. If none of the models being considered is the true model, then AIC
tends to outperform BIC. See Yang (2003).
Robust Versions: Hastie et al. (2015, pp. 26-27) discuss some modifica-
tions of lasso that are robust to certain types of outliers. Robust methods
for forward selection and LARS are given by Uraibi et al. (2017, 2019) that
need n >> p. If n is not much larger than p, then Hoffman et al. (2015)
have a robust Partial Least Squares–Lasso type estimator that uses a clever
weighting scheme.
5.14 Complements 263
Degrees of Freedom:
A formula for the model degrees of freedom df tend to be given for a model
when there is no model selection or variable selection. For many estimators,
the degrees of freedom is not known if model selection is used. A d for PI
(4.15) is often obtained by plugging in the degrees of freedom formula as if
model selection did not occur. Then the resulting d is rarely an actual degrees
of freedom. As an example, if Ŷ = H λ Y , then often df = trace(H λ ) if λ is
selected before examining the data. If model selection is used to pick λ̂, then
d = trace(H λ̂ ) is not the model degrees of freedom.
264 5 Statistical Learning Alternatives to OLS
5.15 Problems
λ
η̂R ≈ η̂OLS − V η̂ OLS .
n
where λ1,n ≥ 0, a > 0, and j > 0 are known constants. Consider the regres-
sion methods OLS, forward selection, lasso, PLS, PCR, ridge regression, and
relaxed lasso.
a) Which method corresponds to j = 1?
b) Which method corresponds to j = 2?
c) Which method corresponds to λ1,n = 0?
Z = Gη + e (5.28)
1
G = √ W.
n
Following Zou and Hastie (2005), the naive elastic net η̂N estimator is the
minimizer of
QN (η) = RSS(η) + λ∗2 kηk22 + λ∗1 kηk1 (5.29)
where λ∗i ≥ 0. The term “naive” is used because the elastic net estimator
λ∗ λ∗ p
is better. Let τ = ∗ 2 ∗ , γ = p 1 ∗ , and η A = 1 + λ∗2 η. Let the
λ1 + λ2 1 + λ2
(n + p − 1) × (p − 1) augmented matrix GA and the (n + p − 1) × 1 augmented
response vector Z A be defined by
p G Z
GA = ∗ , and Z A = ,
λ2 I p−1 0
p
where 0 is the (p − 1) × 1 zero vector. Let η̂A = 1 + λ∗2 η̂ be obtained from
the lasso of Z A on GA : that is η̂ A minimizes
η T GT Gη T λ∗2
QG(η) = − 2Z Gη + kηk22 + λ∗1 kηk1 ,
1 + λ∗2 1 + λ∗2
and hence is not the elastic net estimator corresponding to Equation (5.22).)
where λi ≥ 0 for i = 1, 2.
a) Which values of λ1 and λ2 correspond to ridge regression?
b) Which values of λ1 and λ2 correspond to lasso?
c) Which values of λ1 and λ2 correspond to elastic net?
d) Which values of λ1 and λ2 correspond to the OLS full model?
5.8. For the output below, an asterisk means the variable is in the model.
All models have a constant, so model 1 contains a constant and mmen.
a) List the variables, including a constant, that models 2, 3, and 4 contain.
b) The term out$cp lists the Cp criterion. Which model (1, 2, 3, or 4) is
the minimum Cp model Imin ?
c) Suppose β̂Imin = (241.5445, 1.001)T . What is β̂ Imin ,0 ?
Selection Algorithm: forward #output for Problem 5.8
pop mmen mmilmen milwmn
1 ( 1 ) " " "*" " " " "
2 ( 1 ) " " "*" "*" " "
3 ( 1 ) "*" "*" "*" " "
4 ( 1 ) "*" "*" "*" "*"
out$cp
[1] -0.8268967 1.0151462 3.0029429 5.0000000
5.9. Consider the output for Example 4.7 for the OLS full model. The
column resboot gives the large sample 95% CI for βi using the shorth applied
∗
to the β̂ij for j = 1, ..., B using the residual bootstrap. The standard large
sample 95% CI for βi is β̂i ±1.96SE(β̂i ). Hence for β2 corresponding to L, the
standard large sample 95% CI is −0.001 ± 1.96(0.002) = −0.001 ± 0.00392 =
[−0.00492, 0.00292] while the shorth 95% CI is [−0.005, 0.004].
a) Compute the standard 95% CIs for βi corresponding to W, H, and S.
Also write down the shorth 95% CI. Are the standard and shorth 95% CIs
fairly close?
b) Consider testing H0 : βi = 0 versus HA : βi 6= 0. If the corresponding
95% CI for βi does not contain 0, then reject H0 and conclude that the
predictor variable Xi is needed in the MLR model. If 0 is in the CI then fail
to reject H0 and conclude that the predictor variable Xi is not needed in the
MLR model given that the other predictors are in the MLR model.
Which variables, if any, are needed in the MLR model? Use the standard
CI if the shorth CI gives a different result. The nontrivial predictor variables
are L, W, H, and S.
b) The test “length” is the average length of the interval [0, D(UB ) ] = D(UB )
where the test fails to reject H0 if D0 ≤ D(UB ) . The OLS full model is
asymptotically normal, and hence for q large enough n and B the reg len row
for the test column should be near χ23,0.95 = 2.795.
Were the three values in the test column for reg within 0.1 of 2.795?
5.12. Suppose the MLR model Y = Xβ + e, and the regression method
fits Z = W η + e. Suppose Ẑ = 245.63 and Y = 105.37. What is Ŷ ?
5.13. To get a large sample 90% PI for a future value Yf of the response
variable, find a large sample 90% PI for a future residual and add Ŷf to the
endpoints of the of that PI. Suppose forward selection is used and the large
sample 90% PI for a future residual is [−778.28, 1336.44]. What is the large
sample 90% PI for Yf if β̂ Imin = (241.545, 1.001)T used a constant and the
predictor mmen with corresponding xImin ,f = (1, 75000)T ?
5.14. Table 5.8 below shows simulation results for bootstrapping OLS
(reg), lasso, and ridge regression (RR) with 10-fold CV when β = (1, 1, 0, 0)T .
The βi columns give coverage = the proportion of CIs that contained βi and
the average length of the CI. The test is for H0 : (β3 , β4 )T = 0 and H0 is
true. The “coverage” is the proportion of times the prediction region method
bootstrap test failed to reject H0 . OLS used 1000 runs while 100 runs were
used for lasso and ridge regression. Since 100 runs were used, a cov in [0.89,
1] is reasonable for a nominal value of 0.95. If the coverage for both methods
≥ 0.89, the method with the shorter average CI length was more precise.
(If one method had coverage ≥ 0.89 and the other had coverage < 0.89, we
will say the method with coverage ≥ 0.89 was more precise.) The results
for the lasso test were omitted since sometimes S ∗T was singular. (Lengths
5.15 Problems 269
for the test column are not comparable unless the statistics have the same
asymptotic distribution.)
a) For β3 and β4 which method, ridge regression or the OLS full model,
was better?
b) For β3 and β4 which method, lasso or the OLS full model, was more
precise?
5.15. Suppose n = 15 and 5-fold CV is used. Suppose observations are
measured for the following people. Use the output below to determine which
people are in the first fold.
folds: 4 3 4 2 1 4 3 5 2 2 3 1 5 5 1
1) Athapattu, 2) Azizi, 3) Cralley 4) Gallage, 5) Godbold, 6) Gunawar-
dana, 7) Houmadi, 8) Mahappu, 9) Pathiravasan, 10) Rajapaksha, 11)
Ranaweera, 12) Safari, 13) Senarathna, 14) Thakur, 15) Ziedzor
5.16. Table 5.9 below shows simulation results for a large sample 95% pre-
diction interval. Since 5000 runs were used, a cov in [0.94, 0.96] is reasonable
for a nominal value of 0.95. If the coverage for a method ≥ 0.94, the method
with the shorter average PI length was more precise. Ignore methods with
cov < 0.94. The MLR model had β = (1, 1, ..., 1, 0, ..., 0)T where the first
k + 1 coefficients were equal to 1. If ψ = 0 then the nontrivial predictors were
uncorrelated, but highly correlated if ψ = 0.9.
Table 5.9 Simulated Large Sample 95% PI Coverages and Lengths, ei ∼ N (0, 1)
n p ψ k FS lasso RL RR PLS PCR
100 40 0 1 cov 0.9654 0.9774 0.9588 0.9274 0.8810 0.9882
len 4.4294 4.8889 4.6226 4.4291 4.0202 7.3393
400 400 0.9 19 cov 0.9348 0.9636 0.9556 0.9632 0.9462 0.9478
len 4.3687 47.361 4.8530 48.021 4.2914 4.4764
a) Which method was most precise, given cov ≥ 0.94, when n = 100?
270 5 Statistical Learning Alternatives to OLS
b) Which method was most precise, given cov ≥ 0.94, when n = 400?
5.17. When doing a PI or CI simulation for a nominal 100(1 − δ)% = 95%
interval, there are m runs. For each run, a data set and interval are generated,
and for the ith run Yi = 1 if µ or Yf is in the interval, and Yi = 0, otherwise.
Hence the Yi are iid Bernoulli(1 − δn ) random variables where 1 − δn is
the true probability (true coverage) that the interval will contain
P µ or Yf .
The observed coverage (= coverage) in the simulation is Y = i Yi /m. The
variance V (Y ) = σ 2 /m where σ 2 = (1 − δn )δn ≈ (1 − δ)δ ≈ (0.95)0.05 if
δn ≈ δ = 0.05. Hence r
0.95(0.05)
SD(Y ) ≈ .
m
If the (observed) coverage is within 0.95 ± kSD(Y ) the integer k is near 3,
then there is no reason to doubt that the actual coverage 1 − δn differs from
the nominal coverage 1−δ = 0.95 if m ≥ 1000 (and as a crude benchmark, for
m ≥ 100). In the simulation, the length of each interval is computed, and the
average length is computed. For intervals with coverage ≥ 0.95 − kSD(Y ),
intervals with shorter average length are better (have more precision).
a) If m = 5000 what is 3 SD(Y ), using the above approximation? Your
answer should be close to 0.01.
b) If m = 1000 what is 3 SD(Y ), using the above approximation?
5.18. Let Yi = β1 + β2 xi2 + · · · + βp xip + i for i = 1, ..., n where the i are
independent and identically distributed (iid) with expected value E(i ) = 0
and variance V (i ) = σ 2 . in matrix form, this model is Y = Xβ + . As-
sume X has full rank p where p < n. Let β̂R = (X T X + λn I p )−1 X T Y =
(X T X + λn I p )−1 (X T X)(X T X)−1 X T Y where λn ≥ 0 is a constant that
may depend on n and I p is the p × p identity matrix. Let β̂ = β̂ OLS be the
ordinary least squares estimator. Let Cov(Z) = V ar(Z) be the covariance
matrix of random vector Z.
a) Find E(β̂).
b) Find E(β̂ R ).
c) Find Cov(β̂).
d) Find Cov(β̂ R ). Simplify.
e) Suppose (X T X)/n → V −1 as n → ∞. Then n(X T X)−1 → V as
n → ∞ and if λn /n → 0 as n → ∞, then (X T X + λn I p )/n → V −1 and
n(X T X + λn I p )−1 → V as n → ∞. If λn /n → 0, show nCov(β̂ R ) → σ 2 V
as n → ∞. Hint: nA−1 BA−1 = nA−1 (B/n)nA−1 .
5.19.
5.20.
R Problem
data. See Preface or Section 11.1. Typing the name of the slpack func-
tion, e.g. vsbootsim3, will display the code for the function. Use the args com-
mand, e.g. args(vsbootsim3), to display the needed arguments for the function.
For the following problem, the R command can be copied and pasted from
(http://parker.ad.siu.edu/Olive/linmodrhw.txt) into R.
5.21. The R program generates data satisfying the MLR model
Y = β1 + β2 x2 + β3 x3 + β4 x4 + e
5.23. This problem is like Problem 5.19, but ridge regression is used in-
stead of forward selection. This simulation is similar to that used to form
Table 4.2, but 100 runs are used so coverage in [0.89,1.0] suggests that the
actual coverage is close to the nominal coverage of 0.95.
The model is Y = xT β + e = xTS β S + e where β S = (β1 , β2 , ..., βk+1)T =
(β1 , β2 )T and k = 1 is the number of active nontrivial predictors in the popu-
lation model. The output for test tests H0 : (βk+2 , ..., βp)T = (β3 , ..., βp)T = 0
and H0 is true. The output gives the proportion of times the prediction region
method bootstrap test fails to reject H0 . The nominal proportion is 0.95.
After getting your output, make a table similar to Table 4.2 with 4 lines.
If your p = 5 then you need to add a column for β5 . Two lines are for reg
(the OLS full model) and two lines are for ridge regression (with 10 fold CV).
The βi columns give the coverage and lengths of the 95% CIs for βi . If the
coverage ≥ 0.89, then the shorter CI length is more precise. Were the CIs for
ridge regression more precise than the CIs for the OLS full model for β3 and
β4 ?
To get the output, copy and paste the source commands from
(http://parker.ad.siu.edu/Olive/linmodrhw.txt) into R. Copy and past the
library command for this problem into R.
If you are person j then copy and paste the R code for person j for this
problem into R.
5.21. This is like Problem 5.20, except lasso is used. If you are person j in
Problem 5.20, then copy and paste the R code for person j for this problem
into R. Make a table with 4 lines: two for OLS and 2 for lasso. Were the CIs
for lasso more precise than the CIs for the OLS full model for β3 and β4 ?
Chapter 6
What if n is not >> p?
When p > n, the fitted model should do better than i) interpolating the data
or ii) discarding all of the predictors and using the location model of Section
1.3.5 for inference. If p > n, forward selection, lasso, relaxed lasso, elastic
net, and relaxed elastic net can be useful for several regression models. Ridge
regression, partial least squares, and principal components regression can also
be computed for multiple linear regression. Sections 4.3, 5.9, and 10.7 give
prediction intervals.
One of the biggest errors in regression is to use the response variable
to build the regression model using all n cases, and then do inference as if
the built model was selected without using the response, e.g., selected before
gathering data. Using the response variable to build the model is called data
snooping, then inference is generally no longer valid, and the model built from
data snooping tends to fit the data too well. In particular, do not use data
snooping and then use variable selection or cross validation. See Hastie et al
(2009, p. 245) and Olive (2017a, pp. 85-89).
Building a regression model from data is one of the most challenging regres-
sion problems. The “final full model” will have response variable Y = t(Z), a
constant x1 , and predictor variables x2 = t2 (w2 , ..., wr), ..., xp = tp (w2 , ..., wr)
where the initial data consists of Z, w2 , ..., wr . Choosing t, t2 , ..., tp so that
the final full model is a useful regression approximation to the data can be
difficult.
273
274 6 What if n is not >> p?
Y x|SP or Y x|h(x),
where xI/S denotes the predictors in I that are not in S. Since this is true
regardless of the values of the predictors, β O = 0 if S ⊆ I.
6.2 Data Splitting 275
Data splitting is useful for many regression models when the n cases are in-
dependent, including multiple linear regression, multivariate linear regression
where there are m ≥ 2 response variables, generalized linear models (GLMs),
the Cox (1972) proportional hazards regression model, and parametric sur-
vival regression models.
Consider a regression model with response variable Y and a p × 1 vector
of predictors x. This model is the full model. Suppose the n cases are inde-
pendent. To perform data splitting, randomly divide the data into two sets
H and V where H has nH of the cases and V has the remaining nV = n−nH
cases i1 , ..., inV . Find a model I, possibly with data snooping or model se-
lection, using the data in the training set H. Use the model I as the full
model to perform inference using the data in the validation set V . That is,
regress YV on X V,I and perform the usual inference for the model using the
j = 1, ..., nV cases in the validation set V . If βI uses a predictors, we want
nV ≥ 10a and we want P (S ⊆ I) → 1 as n → ∞ or for (YV , X V,I ) to follow
the regression model.
In the literature, often nH ≈ dn/2e. For model selection, use the training
data set to fit the model selection method, e.g. forward selection or lasso, to
get the a predictors. On the test set, use the standard regression inference
from regressing the response on the predictors found from the training set.
This method can be inefficient if n ≥ 10p, but is useful for a sparse model
276 6 What if n is not >> p?
if n ≤ 5p, if the probability that the model underfits goes to zero, and if
n ≥ 20a.
The method is simple, use one half set to get the predictors, then fit
the regression model, such as a GLM or OLS, to the validation half set
(Y V , X V,I ). The regression model needs to hold for (Y V , X V,I ) and we want
nV ≥ 10a if I uses a predictors. The regression model can hold if S ⊆ I
and the model is sparse. Let x = (x1 , ..., xp )T where x1 is a constant. If
(Y, x2 , ..., xp )T follows a multivariate normal distribution, then (Y, xI ) follows
a multiple linear regression model for every I. Hence the full model need not
be sparse, although the selected model may be suboptimal.
Of course other sample sizes than half sets could be used. For example if
n = 1000p, use n = 10p for the training set and n = 990p for the validation
set.
Remark 6.1. i) One use of data splitting is to try to transform the
p ≥ n problem into an n ≥ 10k problem. This method can work if
the model is sparse. For multiple linear regression, this method can work
if Y ∼ Nn (Xβ, σ 2 I), since then all subsets I satisfy the MLR model:
Yi = xTI,i βI + eI,i . See Remark 1.5. If βI is k × 1, we want n ≥ 10k and
V (eI,i ) = σI2 to be small. For binary logistic regression, the discriminant
function model of Definition 10.7 can be useful if xI |Y = j ∼ Nk (µj , Σ)
for j = 0, 1. Of course, the models may not be sparse, and the multivariate
normal assumptions for MLR and binary logistic regression rarely hold.
ii) Data splitting can be tricky for lasso, ridge regression, and elastic net
if the sample sizes of the training and validation sets differ. Roughly set
λ1,n1 /(2n1 ) = λ2,n2 /(2n2 ). Data splitting is much easier for variable selection
methods such as forward selection, relaxed lasso, and relaxed elastic net. Find
the variables x∗1 , ..., x∗k indexed by I from the training set, and use model I
as the full model for the validation set.
iii) Another use of data splitting is that data snooping can be used on the
training set: use the model as the full model for the validation set.
6.3 Summary
Use the model I as the full model to perform inference using the data in the
validation set V .
6.4 Complements
6.5 Problems
Chapter 7
Robust Regression
This chapter considers outlier detection and then develops robust regression
estimators. Robust estimators of multivariate location and dispersion are
useful for outlier detection and for developing robust regression estimators.
Outliers and dot plots were discussed in Chapter 3.
Definition 7.1 An outlier corresponds to a case that is far from the bulk
of the data.
The following plots and techniques will be developed in this chapter. For
the location model, use a dot plot to detect outliers. For the multivariate
location model with p = 2 make a scatterplot. For multiple linear regression
with one nontrivial predictor x, plot x versus Y . For the multiple linear
regression model, make the residual and response plots. For the multivariate
location model, make the DD plot if n ≥ 5p, and use ddplot5 if p > n. If p
is not much larger than 5, elemental sets are useful for outlier detection for
multiple linear regression and multivariate location and dispersion.
Yi = µ + ei , i = 1, . . . , n (7.1)
where e1 , ..., en are error random variables, often iid with zero mean. The
location model is used when there is one variable Y , such as height, of interest.
The location model is a special case of the multiple linear regression model
and of the multivariate location and dispersion model, where there are p
279
280 7 Robust Regression
Y(n/2) + Y((n/2)+1)
MED(n) = if n is even.
2
The notation MED(n) = MED(Y1 , ..., Yn) will also be used.
Definition 7.4. The sample median absolute deviation is
Estimators that use order statistics are common. Theory for the MAD,
median, and trimmed mean is given, for example, in Olive (2008), which
also gives confidence intervals based on the median and trimmed mean. The
shorth estimator of Section 4.3 was used for prediction intervals.
The multivariate location and dispersion (MLD) model is a special case of the
multivariate linear model, just like the location model is a special case of the
multiple linear regression model. Robust estimators of multivariate location
and dispersion are useful for detecting outliers in the predictor variables and
for developing an outlier resistant multiple√ linear regression estimator.
The practical, highly outlier resistant, n consistent FCH, RFCH, and
RMVN estimators of (µ, cΣ) are developed along with proofs. The RFCH
and RMVN estimators are reweighted versions of the FCH estimator. It is
shown why competing “robust estimators” fail to work, are impractical, or are
not yet backed by theory. The RMVN and RFCH sets are defined and will be
used for outlier detection and to create practical robust methods of multiple
linear regression and multivariate linear regression. Many more applications
are given in Olive (2017b).
Warning: This section contains many acronyms, abbreviations, and es-
timator names such as FCH, RFCH, and RMVN. Often the acronyms start
with the added letter A, C, F, or R: A stands for algorithm, C for con-
centration, F for estimators that use a fixed number of trial fits, and R for
reweighted.
Y i = µ + ei , i = 1, . . . , n (7.5)
where e1 , ..., en are p × 1 error random vectors, often iid with zero mean and
covariance matrix Cov(e) = Cov(Y ) = Σ Y = Σ e .
282 7 Robust Regression
Note that the location model is a special case of the MLD model with
p = 1. If E(e) = 0, then E(Y ) = µ. A p × p dispersion matrix is a symmetric
matrix that measures the spread of a random vector. Covariance and corre-
lation matrices are dispersion matrices. One way to get a robust estimator
of multivariate location is to stack the marginal estimators of location into
a vector. The coordinatewise median MED(W ) is an example. The sample
mean x also stacks the marginal estimators into a vector, but is not outlier
resistant.
Let µ be a p × 1 location vector and Σ a p × p symmetric dispersion
matrix. Because of symmetry, the first row of Σ has p distinct unknown
parameters, the second row has p − 1 distinct unknown parameters, the third
row has p − 2 distinct unknown parameters, ..., and the pth row has one
distinct unknown parameter for a total of 1 + 2 + · · · + p = p(p + 1)/2
unknown parameters. Since µ has p unknown parameters, an estimator (T, C)
of multivariate location and dispersion, needs to estimate p(p+3)/2 unknown
parameters when there are p random variables. If the p variables can be
transformed into an uncorrelated set then there are only 2p parameters, the
means and variances, while if the dimension can be reduced from p to p − 1,
the number of parameters is reduced by p(p + 3)/2 − (p − 1)(p + 2)/2 = p + 1.
The sample covariance or sample correlation matrices estimate these pa-
rameters very efficiently since Σ = (σij ) where σij is a population covariance
or correlation. These quantities can be estimated with the sample covariance
or correlation taking two variables Xi and Xj at a time. Note that there are
p(p + 1)/2 pairs that can be chosen from p random variables X1 , ..., Xp.
T (Z) = T (W AT + B) = AT (W ) + b, (7.6)
The following theorem shows that the Mahalanobis distances are invariant
under affine transformations. See Rousseeuw and Leroy (1987, pp. 252-262)
for similar results. Thus if (T, C) is affine equivariant, so is
2 2
(T, D(c n)
(T, C) C) where D(j) (T, C) is the jth order statistic of the Di2 .
Definition 7.8. For MLD, an elemental set J = {m1 , ..., mp+1} is a set of
p + 1 cases drawn without replacement from the data set of n cases. The ele-
mental fit (TJ , C J ) = (xJ , SJ ) is the sample mean and the sample covariance
matrix computed from the cases in the elemental set.
If the data are iid, then the elemental fit gives an unbiased but inconsistent
estimator of (E(x), Cov(x)). Note that the elemental fit uses the smallest
sample size p + 1 such that S J is nonsingular if the data are in “general
position” defined in Definition 7.10. See Definition 4.7 for the sample mean
and sample covariance matrix.
7.2.2 Breakdown
Proof Sketch. See Johnson and Wichern (1988, pp. 64-65, 184). For a),
note that rank(C −1 A) = 1, where C = B and A = ddT , since rank(C −1 A)
= rank(A) = rank(d) = 1. Hence C −1 A has one nonzero eigenvalue eigen-
vector pair (λ1 , g1 ). Since
(λ1 = dT B −1 d, g 1 = B −1 d)
Definition 7.10. Let γn be the breakdown value of (T, C). High break-
down (HB) statistics have γn → 0.5 as n → ∞ if the (uncontaminated) clean
data are in general position: no more than p points of the clean data lie on
any (p − 1)-dimensional hyperplane. Estimators are zero breakdown if γn → 0
and positive breakdown if γn → γ > 0 as n → ∞.
Note that if the number of outliers is less than the number needed to cause
breakdown, then kT k is bounded and the eigenvalues are bounded away from
0 and ∞. Also, the bounds do not depend on the outliers but do depend on
the estimator (T, C) and on the clean data W .
The following result shows that a multivariate location estimator T basi-
cally “breaks down” if the d outliers can make the median Euclidean distance
MED(kwi −T (W nd )k) arbitrarily large where w Ti is the ith row of W nd . Thus
a multivariate location estimator T will not break down if T can not be driven
out of some ball of (possibly huge) radius r about the origin. For an affine
equivariant estimator, the largest possible breakdown value is n/2 or (n+1)/2
for n even or odd, respectively. Hence in the proof of the following result, we
could replace dn < dT by dn < min(n/2, dT ).
ber of arbitrarily bad cases that can make the median Euclidean distance
MED(kwi − T (W nd )k) arbitrarily large.
Proof. Suppose the multivariate location estimator T satisfies kT (W nd )k ≤
M for some constant M if dn < dT . Note that for a fixed data set W nd
with ith row wi , the median Euclidean distance MED(kw i − T (W nd )k) ≤
maxi=1,...,n kxi − T (W nd )k ≤ maxi=1,...,n kxi k + M if dn < dT . Similarly,
suppose MED(kwi − T (W nd )k) ≤ M for some constant M if dn < dT , then
kT (W nd )k is bounded if dn < dT .
Xcn
1
ai,j = (zi,m − z i )(zj,m − z j ).
cn − 1 m=1
{z : (z − T )T C −1 (z − T ) ≤ D(c
2
n)
} (7.9)
2
where D(c n)
is the cn th smallest squared Mahalanobis distance based on
(T, C). This hyperellipsoid contains the cn cases with the smallest Di2 . Sup-
pose (T, C) = (xM , b S M ) is the sample mean and scaled sample covariance
matrix applied to some subset of the data where b > 0. The classical, RFCH,
7.2 The Multivariate Location and Dispersion Model 287
and RMVN estimators satisfy this assumption. For h > 0, the hyperellipsoid
{z : (z − T )T C −1 (z − T ) ≤ h2 } = {z : Dz
2
≤ h2 } = {z : Dz ≤ h}
If h2 = D(c
2
n)
, then the volume is proportional to the square root of the deter-
1/2
minant |S M | , and this volume will be positive unless extreme degeneracy
is present among the cn cases. See Johnson and Wichern (1988, pp. 103-104).
Concentration algorithms are widely used since impractical brand name es-
timators, such as the MCD estimator given in Definition 7.11, take too long
to compute. The concentration algorithm, defined in Definition 7.12, use K
starts and attractors. A start is an initial estimator, and an attractor is an
estimator obtained by refining the start. For example, let the start be the
classical estimator (x, S). Then the attractor could be the classical estima-
tor (T1 , C 1 ) applied to the half set of cases with the smallest Mahalanobis
distances. This concentration algorithm uses one concentration step, but the
process could be iterated for k concentration steps, producing an estimator
(Tk , C k )
If more than one attractor is used, then some criterion is needed to select
which of the K attractors is to be used in the final estimator. If each attractor
(Tk,j , C k,j ) is the classical estimator applied to cn ≈ n/2 cases, then the
minimum covariance determinant (MCD) criterion is often used: choose the
attractor that has the minimum value of det(C k,j ) where j = 1, ..., K.
The remainder of this section will explain the concentration algorithm,
explain why the MCD criterion is useful but can be improved, provide some
theory for practical robust multivariate location and dispersion estimators,
and show how the set of cases used to compute the recommended RMVN or
RFCH estimator can be used to create outlier resistant regression estimators.
The RMVN and RFCH estimators are reweighted versions of the practical
FCH estimator, given in Definition 7.15.
Here
n n!
C(n, i) = =
i i! (n − i)!
is the binomial coefficient.
The MCD estimator is a high breakdown (HB) estimator, and the value
cn = b(n + p + 1)/2c is often used as the default. The MCD estimator is the
pair
(β̂LT S , QLT S (β̂LT S )/(cn − 1))
in the location model where LTS stands for the least trimmed sum of squares
estimator. See Section 7.6. The population analog of the MCD estimator is
closely related to the hyperellipsoid of highest concentration
√ that contains
cn /n ≈ half of the mass. The MCD estimator is a n consistent HB asymp-
totically normal estimator for (µ, aM CD Σ) where aM CD is some positive
constant when the data xi are iid from a large class of distributions. See
Cator and Lopuhaä (2010, 2012) who extended some results of Butler et al.
(1993).
Computing robust covariance estimators can be very expensive. For exam-
ple, to compute the exact MCD(cn ) estimator (TM CD , CM CD ), we need to
consider the C(n, cn ) subsets of size cn . Woodruff and Rocke (1994, p. 893)
noted that if 1 billion subsets of size 101 could be evaluated per second, it
would require 1033 millenia to search through all C(200, 101) subsets if the
sample size n = 200. See Section 7.8 for the MCD complexity.
Hence algorithm estimators will be used to approximate the robust esti-
mators. Elemental sets are the key ingredient for both basic resampling and
concentration algorithms.
ii) If all of the attractors are consistent estimators of (µ, a Σ) with the
same rate, e.g. nδ where 0 < δ ≤ 0.5, then the algorithm estimator is a
consistent estimator of (µ, a Σ) with the same rate as the attractors.
iii) If all of the attractors are high breakdown, then the algorithm estimator
is high breakdown.
iv) Suppose the data x1 , ..., xn are iid and P (xi = µ) < 1. The elemental
basic resampling algorithm estimator (k = −1) is inconsistent.
v) The elemental concentration algorithm is zero breakdown.
Proof. i) Choosing from K consistent estimators for (µ, a Σ) results in a
consistent estimator for of (µ, a Σ), and ii) follows from Pratt (1959). iii) Let
γn,i be the breakdown value of the ith attractor if the clean data x1 , ..., xn are
in general position. The breakdown value γn of the algorithm estimator can
be no lower than that of the worst attractor: γn ≥ min(γn,1 , ..., γn,K ) → 0.5
as n → ∞.
iv) Let (x−1,j , S−1,j ) be the classical estimator applied to a randomly
drawn elemental set. Then x−1,j is the sample mean applied to p + 1 iid
cases. Hence E(S j ) = Σ x , E[x−1,j ] = E(x) = µ, and Cov(x−1,j ) =
Cov(x)/(p+1) = Σ x /(p+1) assuming second moments. So the (x−1,j , S−1,j )
are identically distributed and inconsistent estimators of (µ, Σ x ). Even with-
out second moments, there exists > 0 such that P (kx−1,j −µk > ) = δ > 0
where the probability, , and δ do not depend on n since the distribution
of x−1,j only depends on the distribution of the iid xi , not on n. Then
P (minj kx−1,j − µk > ) = P (all kx−1,j − µk > ) → δK > 0 as n → ∞
where equality would hold if the x−1,j were iid. Hence the “best start” that
minimizes kx−1,j − µk is inconsistent.
v) The classical estimator with breakdown 1/n is applied to each elemental
start. Hence γn ≤ K/n → 0 as n → ∞.
If Wi is the number of outliers in the ith elemental set, then the Wi are
iid hypergeometric(d, n − d, h) random variables. Suppose that it is desired
to find K such that the probability P(that at least one of the elemental
sets is clean) ≡ P1 ≈ 1 − α where 0 < α < 1. Then P1 = 1− P(none of
the K elemental sets is clean) ≈ 1 − [1 − (1 − γ)h ]K by independence. If the
contamination proportion γ is fixed, then the probability of obtaining at least
one clean subset of size h with high probability (say 1 − α = 0.8) is given by
0.8 = 1 − [1 − (1 − γ)h ]K . Fix the number of starts K and solve this equation
for γ.
The proof of the following theorem implies that a high breakdown estima-
tor (T, C) has MED(Di2 ) ≤ V and that the hyperellipsoid {x|Dx 2 2
≤ D(c n)
}
292 7 Robust Regression
that contains cn ≈ n/2 of the cases is in some ball about the origin of ra-
dius r, where V and r do not depend on the outliers even if the number of
outliers is close to n/2. Also the attractor of a high breakdown estimator is
a high breakdown estimator if the number of concentration steps k is fixed,
e.g. k = 10. The theorem implies that the MB estimator (TM B , C M B ) is high
breakdown.
1 1
kx − T k2 ≤ (x − T )T C −1 (x − T ) ≤ kx − T k2 . (7.12)
λ1 λp
2
By (7.12), if the D(i) are the order statistics of the Di2 (T, C), then D(i)
2
<V
for some constant V that depends on the clean data but not on the outliers
even if i and dn are near n/2. (Note that 1/λp and MED(kxi − T k2 ) are both
bounded for high breakdown estimators even for dn near n/2.)
Following Johnson and Wichern (1988, pp. 50, 103), the boundary of
the set {x|Dx 2
≤ h2 } = {x|(x − T )T C −1√ (x − T ) ≤ h2 } is a hyperellip-
2 2
soid centered at T with axes of length 2h λi . Hence {x|Dx ≤ D(c n)
} is
contained in some ball about the origin of radius r where r does not de-
pend on the number of outliers even for dn near n/2. This is the set con-
taining the cases used to compute (T0 , C 0 ). Since the set is bounded, T0
is bounded and the largest eigenvalue λ1,0 of C 0 is bounded by Theorem
7.4. The determinant det(C M CD ) of the HB minimum covariance deter-
minant estimator satisfies 0 < det(C M CD ) ≤ det(C 0 ) = λ1,0 · · · λp,0 , and
λp,0 > inf det(C M CD )/λp−11,0 > 0 where the infimum is over all possible data
sets with n−dn clean cases and dn outliers. Since these bounds do not depend
on the outliers even for dn near n/2, (T0 , C 0 ) is a high breakdown estimator.
Now repeat the argument with (T0 , C 0 ) in place of (T, C) and (T1 , C 1 ) in
place of (T0 , C 0 ). Then (T1 , C 1 ) is high breakdown. Repeating the argument
iteratively shows (Tk , C k ) is high breakdown.
The 4th moment assumption was used to simplify theory, but likely holds
under 2nd moments. Affine equivariance is needed so that the attractor is
affine equivariant, but probably is not needed to prove consistency.
Remark 7.2. To see that the Lopuhaä (1999) theory extends to con-
centration where the weight function uses h2 = D(c
2
n)
(T, C), note that
2
(T, C̃) ≡ (T, D(c n)
(T, C) C) is a consistent estimator of (µ, bΣ) where b > 0
is derived in (7.14), and weight function I(Di2 (T, C̃) ≤ 1) is equivalent to the
concentration weight function I(Di2 (T, C) ≤ D(c 2
n)
(T, C)). As noted above
Theorem 7.1, (T, C̃) is affine equivariant if (T, C) is affine equivariant. Hence
Lopuhaä (1999) theory applied to (T, C̃) with h = 1 is equivalent to theory
applied to affine equivariant (T, C) with h2 = D(c 2
n)
(T, C).
If (T, C) is a consistent estimator of (µ, s Σ) with rate nδ where 0 < δ ≤
0.5, then D2 (T, C) = (x − T )T C −1 (x − T ) =
2
b = D0.5 (µ, Σ)
Theorem 7.10. Assume that (E1) holds and that (T, C) is a consistent
affine equivariant estimator of (µ, sΣ) with rate nδ where the constants s > 0
and 0 < δ ≤ 0.5. Then the classical estimator (xt,j , St,j ) computed from the
cn ≈ n/2 of cases with the smallest distances Di (T, C) is a consistent affine
equivariant estimator of (µ, aM CD Σ) with the same rate nδ .
Proof. By Remark 7.2 the estimator is a consistent affine equivariant esti-
mator of (µ, aΣ) with rate nδ . By the remarks above, a will be the same for
any consistent affine equivariant estimator of (µ, sΣ) and a does not depend
on s > 0 or δ √ ∈ (0, 0.5]. Hence the result follows if a = aM CD . The MCD
estimator is a n consistent affine equivariant estimator of (µ, aM CD Σ) by
Cator and Lopuhaä (2010, 2012). If the MCD estimator is the start, then it
is also the attractor by Theorem 7.5 which shows that concentration does not
increase the MCD criterion. Hence a = aM CD .
√
Next we define the easily computed robust n consistent FCH estima-
tor, so named since it is fast, consistent, and √uses a high breakdown attrac-
tor. The FCH and MBA estimators use the n consistent DGK estimator
(TDGK , C DGK ) and the high breakdown MB estimator (TM B , C M B ) as at-
tractors.
Definition 7.15. Let the “median ball” be the hypersphere containing the
“half set” of data closest to MED(W ) in Euclidean distance. The FCH esti-
mator uses the MB attractor if the DGK location estimator TDGK is outside
of the median ball, and the attractor with the smallest determinant, other-
wise. Let (TA , C A ) be the attractor used. Then the estimator (TF CH , C F CH )
takes TF CH = TA and
MED(Di2 (TA , C A ))
C F CH = CA (7.15)
χ2p,0.5
The following theorem shows the FCH estimator has good statistical prop-
erties. We conjecture that FCH is high breakdown. Note that the location
estimator TF CH is high breakdown and that det(C F CH ) is bounded away
from 0 and ∞ if the data is in general position, even if nearly half of the
cases are outliers.
MED(Di2 (µ̂1 , Σ̃ 1 ))
Σ̂ 1 = Σ̃ 1 .
χ2p,0.5
Then let (TRF CH , Σ̃ 2 ) be the classical estimator applied to the cases with
Di2 (µ̂1 , Σ̂ 1 ) ≤ χ2p,0.975, and let
MED(Di2 (TRF CH , Σ̃ 2 ))
C RF CH = Σ̃ 2 .
χ2p,0.5
√
RMBA and RFCH are n consistent estimators of (µ, cΣ) by Lopuhaä
(1999) where the weight function uses h2 = χ2p,0.975, but the two estimators
use nearly 97.5% of the cases if the data is multivariate normal.
MED(Di2 (µ̂1 , Σ̃ 1 ))
Σ̂ 1 = Σ̃ 1 .
χ2p,q1
Then let (TRM V N , Σ̃ 2 ) be the classical estimator applied to the n2 cases with
Di2 (µ̂1 , Σ̂ 1 )) ≤ χ2p,0.975. Let q2 = min{0.5(0.975)n/n2, 0.995}, and
MED(Di2 (TRM V N , Σ̃ 2 ))
C RM V N = Σ̃ 2 .
χ2p,q2
Table 7.1 Average Dispersion Matrices for Near Point Mass Outliers
RMVN FMCD OGK MB
1.002 −0.014 0.055 0.685 0.185 0.089 2.570 −0.082
−0.014 2.024 0.685 122.5 0.089 36.24 −0.082 5.241
√
The RMVN estimator is a n consistent estimator of (µ, dΣ) by Lopuhaä
(1999) where the weight function uses h2 = χ2p,0.975 and d = u0.5 /χ2p,q where
q2 → q in probability as n → ∞. Here 0.5 ≤ q < 1 depends on the elliptically
contoured distribution, but q = 0.5 and d = 1 for multivariate normal data.
If the bulk of the data is Np (µ, Σ), the RMVN estimator can give useful
estimates of (µ, Σ) for certain types of outliers where FCH and RFCH esti-
mate (µ, dE Σ) for dE > 1. To see this claim, let 0 ≤ γ < 0.5 be the outlier
P P
proportion. If γ = 0, then ni /n → 0.975 and qi → 0.5. If γ > 0, suppose
the outlier configuration is such that the Di2 (TF CH , C F CH ) are roughly χ2p
for the clean cases, and the outliers have larger Di2 than the clean cases.
Then MED(Di2 ) ≈ χ2p,q where q = 0.5/(1 − γ). For example, if n = 100 and
γ = 0.4, then there are 60 clean cases, q = 5/6, and the quantile χ2p,q is
being estimated instead of χ2p,0.5 . Now ni ≈ n(1 − γ)0.975, and qi estimates
q. Thus C RM V N ≈ Σ. Of course consistency cannot generally be claimed
when outliers are present.
Simulations suggested (TRM V N , C RM V N ) gives useful estimates of (µ, Σ)
for a variety of outlier configurations. Using 20 runs and n = 1000, the aver-
ages of the dispersion matrices were computed when the bulk of the data are
iid N√
2 (0, Σ) where Σ = diag(1, 2). For clean data, FCH, RFCH, and RMVN
give n consistent estimators of Σ, while FMCD and the Maronna and Za-
mar (2002) OGK estimator seem to be approximately unbiased for Σ. The
median ball estimator was scaled using (7.15) and estimated diag(1.13, 1.85).
Next the data had γ = 0.4 and the outliers had x ∼ N2 ((0, 15)T , 0.0001I2 ),
a near point mass at the major axis. FCH, MB, and RFCH estimated 2.6Σ
7.2 The Multivariate Location and Dispersion Model 299
Hubert et al. (2008, 2012) claim that FMCD computes the MCD estimator.
This claim is trivially shown to be false in the following theorem.
Theorem 7.12. Neither FMCD nor Det-MCD compute the MCD esti-
mator.
Proof. A necessary condition for an estimator to be the MCD estimator
is that the determinant of the covariance matrix for the estimator be the
smallest for every run in a simulation. Sometimes FMCD had the smaller
determinant and sometimes Det-MCD had the smaller determinant in the
simulations done by Hubert et al. (2012).
Example 7.2. Tremearne (1911) recorded height = x[,1] and height while
kneeling = x[,2] of 112 people. Figure 7.1a shows a scatterplot of the data.
Case 3 has the largest Euclidean distance of 214.767 from MED(W ) =
(1680, 1240)T , but if the distances correspond to the contours of a cover-
ing ellipsoid, then case 44 has the largest distance. For k = 0, (T0,M , C 0,M )
is the classical estimator applied to the “half set” of cases closest to MED(W )
in Euclidean distance. The hypersphere (circle) centered at MED(W ) that
covers half the data is small because the data density is high near MED(W ).
The median Euclidean distance is 59.661 and case 44 has Euclidean distance
77.987. Hence the intersection of the sphere and the data is a highly corre-
lated clean ellipsoidal region. Figure 7.1b shows the DD plot of the classical
distances versus the MB distances. Notice that both the classical and MB
estimators give the largest distances to cases 3 and 44. Notice that case 44
could not be detected using marginal methods.
300 7 Robust Regression
12
3 3
10
44
8
x[, 2]
RD
6
4
2
0
1500 1700 0 2 4 6
x[, 1] MD
RD
RD
100
0 100
0
2 4 6 8 10 2 4 6 8 10
MD MD
RD
0 100
0
2 4 6 8 10 2 4 6 8 10
MD MD
RMVN FMCD
0.996 0.014 0.002 -0.001 0.931 0.017 0.011 0.000
0.014 2.012 -0.001 0.029 0.017 1.885 -0.003 0.022
0.002 -0.001 2.984 0.003 0.011 -0.003 2.803 0.010
-0.001 0.029 0.003 3.994 0.000 0.022 0.010 3.752
RMVN FMCD
0.988 -0.023 -0.007 0.021 0.227 -0.016 0.002 0.049
-0.023 1.964 -0.022 -0.002 -0.016 0.435 -0.014 0.013
-0.007 -0.022 3.053 0.007 0.002 -0.014 0.673 0.179
0.021 -0.002 0.007 3.870 0.049 0.013 0.179 55.65
302 7 Robust Regression
Next the data had γ = 0.4 and the outliers had x ∼ N4 (15 1, Σ), a mean
shift with the same covariance matrix as the clean cases. Again FCH and
RFCH estimated 1.93Σ while RMVN and FMCD estimated Σ.
RMVN FMCD
1.013 0.008 0.006 -0.026 1.024 0.002 0.003 -0.025
0.008 1.975 -0.022 -0.016 0.002 2.000 -0.034 -0.017
0.006 -0.022 2.870 0.004 0.003 -0.034 2.931 0.005
-0.026 -0.016 0.004 3.976 -0.025 -0.017 0.005 4.046
Table 7.3 Number of Times Mean Shift Outliers had the Largest Distances
p γ n pm MBA FCH RFCH RMVN OGK FMCD MB
10 .1 100 4 49 49 85 84 38 76 57
10 .1 100 5 91 91 99 99 93 98 91
10 .4 100 7 90 90 90 90 0 48 100
40 .1 100 5 3 3 3 3 76 3 17
40 .1 100 8 36 36 37 37 100 49 86
40 .25 100 20 62 62 62 62 100 0 100
40 .4 100 20 20 20 20 20 0 0 100
40 .4 100 35 44 98 98 98 95 0 100
60 .1 200 10 49 49 49 52 100 30 100
60 .1 200 20 97 97 97 97 100 35 100
60 .25 200 25 60 60 60 60 100 0 100
60 .4 200 30 11 21 21 21 17 0 100
60 .4 200 40 21 100 100 100 100 0 100
q
2π p/2 p
h det(S k,j ), (7.16)
pΓ (p/2)
Table 7.4 Number of Times Near Point Mass Outliers had the Largest Distances
p γ n pm MBA FCH RFCH RMVN OGK FMCD MB
10 .1 100 40 73 92 92 92 100 95 100
10 .25 100 25 0 99 99 90 0 0 99
10 .4 100 25 0 100 100 100 0 0 100
40 .1 100 80 0 0 0 0 79 0 80
40 .1 100 150 0 65 65 65 100 0 99
40 .25 100 90 0 88 87 87 0 0 88
40 .4 100 90 0 91 91 91 0 0 91
60 .1 200 100 0 0 0 0 13 0 91
60 .25 200 150 0 100 100 100 0 0 100
60 .4 200 150 0 100 100 100 0 0 100
60 .4 200 20000 0 100 100 100 64 0 100
Tables 7.3 and 7.4 help illustrate the results for the simulation. Large
counts and small pm for fixed γ suggest greater ability to detect outliers.
304 7 Robust Regression
Values of p were 5, 10, 15, ..., 60. First consider the mean shift outliers and
Table 7.3. For γ = 0.25 and 0.4, MB usually had the highest counts. For
5 ≤ p ≤ 20 and the mean shift, the OGK estimator often had the smallest
counts, and FMCD could not handle 40% outliers for p = 20. For 25 ≤ p ≤ 60,
OGK usually had the highest counts for γ = 0.05 and 0.1. For p ≥ 30, FMCD
could not handle 25% outliers even for enormous values of pm.
In Table 7.4, FCH greatly outperformed MBA although the only difference
between the two estimators is that FCH uses a location criterion as well as
the MCD criterion. OGK performed well for γ = 0.05 and 20 ≤ p ≤ 60 (not
tabled). For large γ, OGK often has large bias for cΣ. Then the outliers may
need to be enormous before OGK can detect them. Also see Table 7.2, where
OGK gave the outliers the largest distances for all runs, but C OGK does not
give a good estimate of cΣ = c diag(1, 2).
FMCD DD Plot
72
4.0
34
3.5
19
3.0
RD
44
2.5
32
2.0
50
1.5
MD
Resistant DD Plot
30 6
20
27
15 20
RD
10
5
95 53
MD
7.3 shows that now the FMCD RDi are highly correlated with the M Di . The
DD plot based on the MBA estimator detects the outliers. See Figure 7.4.
For many data sets, Equation (7.10) gives a rough approximation for the
number of large outliers that concentration algorithms using K starts each
consisting of h cases can handle. However, if the data set is multivariate and
the bulk of the data falls in one compact hyperellipsoid while the outliers
fall in another hugely distant compact hyperellipsoid, then a concentration
algorithm using a single start can sometimes tolerate nearly 25% outliers.
For example, suppose that all p + 1 cases in the elemental start are outliers
but the covariance matrix is nonsingular so that the Mahalanobis distances
can be computed. Then the classical estimator is applied to the cn ≈ n/2
cases with the smallest distances. Suppose the percentage of outliers is less
than 25% and that all of the outliers are in this “half set.” Then the sample
mean applied to the cn cases should be closer to the bulk of the data than
to the cluster of outliers. Hence after a concentration step, the percentage
of outliers will be reduced if the outliers are very far away. After the next
concentration step the percentage of outliers will be further reduced and after
several iterations, all cn cases will be clean.
In a small simulation study, 20% outliers were planted for various values of
p. If the outliers were distant enough, then the minimum DGK distance for
the outliers was larger than the maximum DGK distance for the nonoutliers.
306 7 Robust Regression
Hence the outliers would be separated from the bulk of the data in a DD plot
of classical versus robust distances. For example, when the clean data comes
from the Np (0, I p ) distribution and the outliers come from the Np (2000 1, I p )
distribution, the DGK estimator with 10 concentration steps was able to
separate the outliers in 17 out of 20 runs when n = 9000 and p = 30. With
10% outliers, a shift of 40, n = 600, and p = 50, 18 out of 20 runs worked.
Olive (2004a) showed similar results for the Rousseeuw and Van Driessen
(1999) FMCD algorithm and that the MBA estimator could often correctly
classify up to 49% distant outliers. The following theorem shows that it is
very difficult to drive the determinant of the dispersion estimator from a
concentration algorithm to zero.
Proof. If all of the starts are singular, then the Mahalanobis distances
cannot be computed and the classical estimator can not be applied to cn
cases. Suppose that at least one start was nonsingular. Then C A and C M CD
are both sample covariance matrices applied to cn cases, but by definition
C M CD minimizes the determinant of such matrices. Hence 0 ≤ det(C M CD ) ≤
det(C A ).
Software
but in Spring 2015 this change was more likely to cause errors.
The linmodpack function
mldsim(n=200,p=5,gam=.2,runs=100,outliers=1,pm=15)
can be used to produce Tables 7.1–7.4. Change outliers to 0 to examine the
average of µ̂ and Σ̂. The function mldsim6 is similar but does not need the
library command since it compares the FCH, RFCH, MB estimators, and
the covmb2 estimator of Section 7.3.
The function function covfch computes FCH and RFCH, while covrmvn
computes the RMVN and MB estimators. The function covrmb computes MB
and RMB where RMB is like RMVN except the MB estimator is reweighted
instead of FCH. Functions covdgk, covmba, and rmba compute the scaled
DGK, MBA, and RMBA estimators. Better programs would use MB if
DGK causes an error.
7.2 The Multivariate Location and Dispersion Model 307
6
4
x[, 2]
2
0
−2
−4
−2 −1 0 1 2
x[, 1]
2
0
−2
−4
−2 −1 0 1 2
x[, 1]
6
4
x[, 2]
2
0
−2
−4
−2 −1 0 1 2
x[, 1]
2
1
MD
5
4
rd
3
2
1
MD
3
2
1
MD
20
15
rd
10
5
0
MD
Both the RMVN and RFCH estimators compute the classical estimator
(xU , SU ) on some set U containing nU ≥ n/2 of the cases. Referring to Defi-
7.2 The Multivariate Location and Dispersion Model 311
nition 7.16, for the RFCH estimator, (xU , S U ) = (TRF CH , Σ̃ 2 ), and then S U
is scaled to form C RF CH . Referring to Definition 7.17, for the RMVN esti-
mator, (xU , SU ) = (TRM V N , Σ̃ 2 ), and then S U is scaled to form C RM V N .
See Definition 7.18.
The two main ways to handle outliers are i) apply the multivariate method
to the cleaned data, and ii) plug in robust estimators for classical estimators.
Subjectively cleaned data may work well for a single data set, but we can’t
get large sample theory since sometimes too many cases are deleted (delete
outliers and some nonoutliers) and sometimes too few (do not get all of the
outliers).
√ Practical plug in robust estimators have rarely been shown to be
n consistent and highly outlier resistant.
Using the RMVN or RFCH set U is simultaneously a plug in method and
an objective way to clean the data such that the resulting robust method is
often backed by theory. This result is extremely useful computationally: find
the RMVN set or RFCH set U , then apply the classical method to the cases
in the set U . This procedure is often equivalent to using (xU , SU ) as plug
in estimators. The method can be applied if n > 2(p + 1) but may not work
well unless n > 20p. The linmodpack function getu gets the RMVN set U as
well as the case numbers corresponding to the cases in U .
The set U is a small volume hyperellipsoid containing at least half of the
cases since concentration is used. The set U can also be regarded as the
“untrimmed data”: the data that was not trimmed by ellipsoidal trimming.
Theory has been proved for a large class of elliptically contoured distributions,
but it is conjectured that theory holds for a much wider class of distributions.
See Olive (2017b, pp. 127-128).
In simulations RFCH and RMVN seem to estimate cΣ x if x = Az + µ
where z = (z1 , ..., zp)T and the zi are iid from a continuous distribution with
variance σ 2 . Here Σ x = Cov(x) = σ 2 AAT . The bias for the MB estimator
seemed to be small. It is known that affine equivariant estimators give unbi-
ased estimators of cΣ x if the distribution of zi is also symmetric. DGK is
affine equivariant and RFCH and RMVN are asymptotically equivalent to a
scaled DGK estimator. But in the simulations the results also held for skewed
distributions.
Several illustrative applications of the RMVN set U are given next, where
the theory usually assumes that the cases are iid from a large class of ellip-
tically contoured distributions.
i) The classical estimator of multivariate
√ location and dispersion applied
to the cases in U gives (xU , S U ), a n consistent estimator of (µ, cΣ) for
some constant c > 0. See Remark 7.4.
ii) The classical estimator of the correlation matrix applied to the cases in
U gives RU , a consistent estimator of the population correlation matrix ρx .
iii) For multiple linear regression, let Y be the response variable, x1 = 1
and x2 , ..., xp be the predictor variables. Let z i = (Yi , xi2, ..., xip)T . Let U
be the RMVN or RFCH set formed using the z i . Then a classical regression
312 7 Robust Regression
Let U be the RMVN or RFCH set formed using the z i . Then a classical least
squares multivariate linear regression estimator applied to the set U results
in a robust multivariate linear regression estimator. For least squares, this is
implemented with the linmodpack function rmreg2. The method for multiple
linear regression in iii) corresponds to m = 1. See Section 8.6.
There are also several variants on the method. Suppose there are tentative
predictors Z1 , ..., ZJ . After transformations assume that predictors X1 , ..., Xk
are linearly related. Assume the set U used cases i1 , i2 , ..., inU . To add vari-
ables like Xk+1 = X12 , Xk+2 = X3 X4 , Xk+3 = gender, ..., Xp , augment
U with the variables Xk+1 , ..., Xp corresponding to cases i1 , ..., inU . Adding
variables results in cleaned data that is more likely to contain outliers.
If there are g groups (g = G for discriminant analysis, g = 2 for binary
regression, and g = p for one way MANOVA), the function getubig gets
the RMVN set Ui for each group and combines the g RMVN sets into one
large set Ubig = U1 ∪ U2 ∪ · · ·∪ Ug . Olive (2017b) has many more applications.
Now suppose the multivariate data has been collected into an n × p matrix
T x1,1 x1,2 . . . x1,p
x1
x2,1 x2,2 . . . x2,p
W = X = ... = . . . . = v1 v2 . . . vp
.. .. . . ..
xTn
xn,1 xn,2 . . . xn,p
where the ith row of W is the ith case xTi and the jth column v j of W
corresponds to n measurements of the jth random variable Xj for j = 1, ..., p.
Hence the n rows of the data matrix W correspond to the n cases, while the
p columns correspond to measurements on the p random variables X1 , ..., Xp.
For example, the data may consist of n visitors to a hospital where the p = 2
variables height and weight of each individual were measured.
Definition 7.19. The coordinatewise median MED(W ) = (MED(X1 ), ...,
MED(Xp ))T where MED(Xi ) is the sample median of the data in column i
corresponding to variable Xi and v i .
7.3 Outlier Detection for the MLD Model 313
Example 7.4. Let the data for X1 be 1, 2, 3, 4, 5, 6, 7, 8, 9 while the data for
X2 is 7, 17, 3, 8, 6, 13, 4, 2, 1. Then MED(W ) = (MED(X1 ), MED(X2 ))T =
(5, 6)T .
Theorem 7.14. Assume that x1 , ..., xn are iid observations from a dis-
tribution with parameters (µ, Σ) where Σ is a symmetric positive definite
matrix. Let aj > 0 and assume that (µ̂j,n , Σ̂ j,n ) are consistent estimators of
(µ, aj Σ) for j = 1, 2.
a) Dx2
(µ̂j , Σ̂ j ) − a1j Dx
2
(µ, Σ) = oP (1).
−1
b) Let 0 < δ ≤ 0.5. If (µ̂j , Σ̂ j )−(µ, aj Σ) = Op (n−δ ) and aj Σ̂ j −Σ −1 =
OP (n−δ ), then
2 1 2
Dx (µ̂j , Σ̂ j ) − Dx (µ, Σ) = OP (n−δ ).
aj
(thus rn is the correlation of the distances in the “lower left corner” of the
DD plot). Then rn → 1 in probability as n → ∞.
Proof. Let Bn denote the subset of the sample space on which both Σ̂ 1,n
and Σ̂ 2,n have inverses. Then P (Bn ) → 1 as n → ∞.
2 −1
a) and b): Dx (µ̂j , Σ̂ j ) = (x − µ̂j )T Σ̂ j (x − µ̂j ) =
−1
Σ Σ −1 −1
(x − µ̂j )T − + Σ̂ j (x − µ̂j )
aj aj
−1
−Σ −1 −1 Σ
= (x − µ̂j )T + Σ̂ j (x − µ̂j ) + (x − µ̂j )T (x − µ̂j )
aj aj
1 −1
= (x − µ̂j )T (−Σ −1 + aj Σ̂ j )(x − µ̂j ) +
aj
−1
T Σ
(x − µ + µ − µ̂j ) (x − µ + µ − µ̂j )
aj
1
= (x − µ)T Σ −1 (x − µ)
aj
2 1
+ (x − µ)T Σ −1 (µ − µ̂j ) + (µ − µ̂j )T Σ −1 (µ − µ̂j )
aj aj
1 −1
+ (x − µ̂j )T [aj Σ̂ j − Σ −1 ](x − µ̂j ) (7.17)
aj
on Bn , and the last three terms are oP (1) under a) and OP (n−δ ) under b).
P
c) Following the proof of a), Dj2 ≡ Dx
2
(µ̂j , Σ̂ j ) → (x −µ)T Σ −1 (x −µ)/aj
for fixed x, and the result follows.
The above result implies that a plot of the MDi versus the Di (TA , C A ) ≡
Di (A) will follow a line through the origin with some positive slope since if
x = µ, then both the classical and the algorithm distances should be close to
zero. We want to find τ such that RDi = τ Di (TA , C A ) and the DD plot of
MDi versus RDi follows the identity line. By Theorem 7.14, the plot of MDi
versus Di (A) will follow the line segment defined by the origin (0, 0) and the
point of observed median Mahalanobis distances, (med(MDi ), med(Di (A))).
This line segment has slope
med(Di (A))/med(MDi )
which is generally not one. By taking τ = med(MDi )/med(Di (A)), the plot
will follow the identity line if (x, S) is a consistent estimator of (µ, cx Σ)
and if (TA , C A ) is a consistent estimator of (µ, aA Σ). (Using the notation
from Theorem 7.14, let (a1 , a2 ) = (cx , aA ).) The classical estimator is con-
sistent if the population has a nonsingular covariance matrix. The algorithm
7.3 Outlier Detection for the MLD Model 315
{x : (x − TR )T C −1 2
R (x − TR ) ≤ RD(h) } (7.19)
2
where RD(h) is the hth smallest squared robust Mahalanobis distance, and
which points are in a classical hyperellipsoid
{x : (x − x)T S −1 (x − x) ≤ M D(h)
2
}. (7.20)
In the DD plot, points below RD(h) correspond to cases that are in the
hyperellipsoid given by Equation (7.19) while points to the left of M D(h) are
in a hyperellipsoid determined by Equation (7.20). In particular, we can use
the DD plot to examine which points are in the nonparametric prediction
region (4.24).
12
10
3
8
RD
RD
6
2
4
1
2
0
0.5 1.0 1.5 2.0 2.5 3.0 3.5 0 1 2 3 4 5 6
MD MD
3.0
20
2.5
15
2.0
RD
RD
1.5
10
1.0
5
0.5
0
MD MD
For this application, the RFCH and RMVN estimators may be best. For
MVN data, the RDi from the RFCH estimator tend to have a higher correla-
tion with the MDi from the classical estimator than the RDi from the FCH
estimator, and the cov.mcd estimator may be inconsistent.
Figure 7.12 shows the DD plots for 3 artificial data sets using cov.mcd.
The DD plot for 200 N3 (0, I 3 ) points shown in Figure 7.12a resembles the
identity line. The DD plot for 200 points from the elliptically contoured
distribution 0.6N3 (0, I 3 ) + 0.4N3 (0, 25 I 3 ) in Figure 7.12b clusters about a
line through the origin with a slope close to 2.0.
in the weighted DD plot will tend to one and that the points will cluster
about a line passing through the origin. For example, the plotted points in
the weighted DD plot (not shown) for the non-MVN EC data of Figure 7.12b
are highly correlated and still follow a line through the origin with a slope
close to 2.0.
Figures 7.12c and 7.12d illustrate how to use the weighted DD plot. The
ith case in Figure 7.12c is (exp(xi,1 ), exp(xi,2 ), exp(xi,3 ))T where xi is the
ith case in Figure 7.12a; i.e. the marginals follow a lognormal distribution.
The plot does not resemble the identity line, correctly suggesting that the
distribution of the data is not MVN; however, the correlation of the plotted
points is
qrather high. Figure 7.12d is the weighted DD plot where cases with
RDi ≥ χ23,.975 ≈ 3.06 have been removed. Notice that the correlation of the
plotted points is not close to one and that the best fitting line in Figure 7.12d
may not pass through the origin. These results suggest that the distribution
of x is not EC.
64 61 63
300
65
62
200
RD
100
0
1 2 3 4 5
MD
2
1
1 2 3
MD
model for the measurements head length, nasal height, bigonal breadth, and
cephalic index where one case has been deleted due to missing values. Figure
7.13a shows the DD plot. Five head lengths were recorded to be around 5
feet and are massive outliers. Figure 7.13b is the DD plot computed after
deleting these points and suggests that the multivariate normal distribution
is reasonable. (The recomputation of the DD plot means that the plot is not
a weighted DD plot which would simply omit the outliers and then rescale
the vertical axis.)
library(MASS)
x <- cbind(buxy,buxx)
ddplot(x,type=3) #Figure 7.13a), right click Stop
zx <- x[-c(61:65),]
ddplot(zx,type=3) #Figure 7.13b), right click Stop
Most outlier detection methods work best if n ≥ 20p, but often data sets have
p > n, and outliers are a major problem. One of the simplest outlier detection
methods uses the Euclidean distances of the xi from the coordinatewise me-
dian Di = Di (MED(W ), I p ). Concentration type steps compute the weighted
median MEDj : the coordinatewise median computed from the “half set” of
cases xi with Di2 ≤ MED(Di2 (MEDj−1 , I p )) where MED0 = MED(W ).
We often used j = 0 (no concentration type steps) or j = 9. Let Di =
Di (MEDj , I p ). Let Wi = 1 if Di ≤ MED(D1 , ..., Dn) + kMAD(D1 , ..., Dn)
where k ≥ 0 and k = 5 is the default choice. Let Wi = 0, otherwise. Using
k ≥ 0 insures that at least half of the cases get weight 1. This weighting
corresponds to the weighting that would be used in a one sided metrically
trimmed mean (Huber type skipped mean) of the distances.
Application 7.2. This outlier resistant regression method uses terms from
the following definition. Let the ith case w i = (Yi , xTi )T where the continuous
predictors from xi are denoted by ui for i = 1, ..., n. Apply the covmb2
estimator to the ui , and then run the regression method on the m cases w i
corresponding to the covmb2 set B indices i1 , ...im, where m ≥ n/2.
Definition 7.21. Let the covmb2 set B of at least n/2 cases correspond
to the cases with weight Wi = 1. The cases not in set B get weight Wi = 0.
Then the covmb2 estimator (T, C) is the sample mean and sample covariance
matrix applied to the cases in set B. Hence
Pn Pn
i=1 Wi xi Wi (xi − T )(xi − T )T
T = P n and C = i=1 Pn .
i=1 Wi i=1 Wi − 1
7.3 Outlier Detection for the MLD Model 319
The covmb2 estimator can also be used for n > p. The covmb2 estimator
attempts to give a robust dispersion estimator that reduces the bias by using
a big ball about MEDj instead of a ball that contains half of the cases. The
linmodpack function getB gives the set B of cases that got weight 1 along
with the index indx of the case numbers that got weight 1. The function
ddplot5 plots the Euclidean distances from the coordinatewise median ver-
sus the Euclidean distances from the covmb2 location estimator. Typically
the plotted points in this DD plot cluster about the identity line, and outliers
appear in the upper right corner of the plot with a gap between the bulk of
the data and the outliers. An alternative for outlier detection is to replace C
by C d = diag(σ̂11 , ..., σ̂pp). For example, use σ̂ii = C ii . See Ro et al. (2015)
and Tarr et al. (2016) for references.
Example 7.8. For the Buxton (1920) data with multiple linear regression,
height was the response variable while an intercept, head length, nasal height,
bigonal breadth, and cephalic index were used as predictors in the multiple
linear regression model. Observation 9 was deleted since it had missing values.
Five individuals, cases 61–65, were reported to be about 0.75 inches tall with
head lengths well over five feet! See Problem 7.11 to reproduce the following
plots.
320 7 Robust Regression
a) lasso
1500
y
500
0
yhat
500
0
yhat
Fig. 7.14 Response plot for lasso and lasso applied to the covmb2 set B.
Figure 7.14a) shows the response plot for lasso. The identity line passes
right through the outliers which are obvious because of the large gap. Figure
7.14b) shows the response plot from lasso for the cases in the covmb2 set
B applied to the predictors, and the set B included all of the clean cases
and omitted the 5 outliers. The response plot was made for all of the data,
including the outliers. Prediction interval (PI) bands are also included for
both plots. Both plots are useful for outlier detection, but the method for
plot 7.14b) is better for data analysis: impossible outliers should be deleted
or given 0 weight, we do not want to predict that some people are about 0.75
inches tall, and we do want to predict that the people were about 1.6 to 1.8
meters tall. Figure 7.15 shows the DD plot made using ddplot5. The five
outliers are in the upper right corner.
Also see Problem 7.12 b) for the Gladstone (1905) data where the covmb2
set B deleted the 8 cases with the largest Di , including 5 outliers and 3 clean
cases.
7.4 Outlier Detection for the MLR Model 321
1500
1000
RDCOVMB2
500
0
RDMED
For multiple linear regression, the OLS response and residual plots are very
useful for detecting outliers. The DD plot of the continuous predictors is also
useful. Use the linmodpack functions MLRplot and ddplot4. Response and
residual plots from outlier resistant methods are also useful. See Figure 7.14.
Huber and Ronchetti (2009, p. 154) noted that efficient methods for iden-
tifying leverage groups are needed. Such groups are often difficult to detect
with regression diagnostics and residuals, but often have outlying fitted val-
ues and responses that can be detected with response and residual plots. The
following rules of thumb are useful for finding influential cases and outliers.
Look for points with large absolute residuals and for points far away from
Y . Also look for gaps separating the data into clusters. The OLS fit often
passes through a cluster of outliers, causing a large gap between a cluster
corresponding to the bulk of the data and the cluster of outliers. When such
a gap appears, it is possible that the smaller cluster corresponds to good
leverage points: the cases follow the same model as the bulk of the data. To
determine whether small clusters are outliers or good leverage points, give
zero weight to the clusters, and fit an MLR estimator such as OLS to the
bulk of the data. Denote the weighted estimator by β̂w . Then plot Ŷw versus
Y using the entire data set. If the identity line passes through the cluster,
then the cases in the cluster may be good leverage points, otherwise they
322 7 Robust Regression
may be outliers. The trimmed views estimator of Section 7.5 is also useful.
Dragging the plots, so that they are roughly square, can be useful.
Influence diagnostics such as Cook’s distances CDi from Cook (1977) and
the weighted Cook’s distances W CDi from Peña (2005) are sometimes useful.
Although an index plot of Cook’s distance CDi may be useful for flagging
influential cases, the index plot provides no direct way of judging the model
against the data. As a remedy, cases in the response and residual plots with
CDi > min(0.5, 2p/n) are highlighted with open squares, and cases with
|W CDi − median(WCDi )| > 4.5MAD(WCDi) are highlighted with crosses,
where the median absolute deviation MAD(wi ) = median(|wi −median(wi )|).
Example 7.9. Figure 7.16 shows the response plot and residual plot for
the Buxton (1920) data. Notice that the OLS fit passes through the outliers,
but the response plot is resistant to Y –outliers since Y is on the vertical
axis. Also notice that although the outlying cluster is far from Y , only two of
the outliers had large Cook’s distance and only one case had a large W CDi .
Hence masking occurred for the Cook’s distances, the W CDi , and for the
OLS residuals, but not for the OLS fitted values. Figure 7.16 was made with
the following R commands.
source("G:/linmodpack.txt"); source("G:/linmoddata.txt")
mlrplot4(buxx,buxy) #right click Stop twice
High leverage outliers are a particular challenge to conventional numerical
MLR diagnostics such as Cook’s distance, but can often be visualized using
the response and residual plots. (Using the trimmed views of Section 7.5
is also effective for detecting outliers and other departures from the MLR
model.)
Example 7.10. Hawkins et al. (1984) gave a well known artificial data
set where the first 10 cases are outliers while cases 11-14 are good leverage
points. Figure 7.17 shows the residual and response plots based on the OLS
estimator. The highlighted cases have Cook’s distance > min(0.5, 2p/n), and
the identity line is shown in the response plot. Since the good cases 11-14
have the largest Cook’s distances and absolute OLS residuals, swamping has
occurred. (Masking has also occurred since the outliers have small Cook’s
distances, and some of the outliers have smaller OLS residuals than clean
cases.) To determine whether both clusters are outliers or if one cluster con-
sists of good leverage points, cases in both clusters could be given weight
7.4 Outlier Detection for the MLR Model 323
Response Plot
1000
Y
FIT
Residual Plot
150
RES
0
−150
FIT
Fig. 7.16 Plots for Buxton Data
Response Plot
5
10
3
2
1 6 4
8
6
Y
13
2
14 11
12
0
0 2 4 6 8
FIT
Residual Plot
4
5
2
1 3
2
6 4
0
RES
-2
14
-4
13
-6
11
12
-8
0 2 4 6 8
FIT
zero and the resulting response plot created. (Alternatively, response plots
based on the tvreg estimator of Section 7.5 could be made where the cases
with weight one are highlighted. For high levels of trimming, the identity line
often passes through the good leverage points.)
The above example is typical of many “benchmark” outlier data sets for
MLR. In these data sets traditional OLS diagnostics such as Cook’s distance
and the residuals often fail to detect the outliers, but the combination of the
response plot and residual plot is usually able to detect the outliers. The CDi
and W CDi are the most effective when there is a single cluster about the
identity line. If there is a second cluster of outliers or good leverage points
or if there is nonconstant variance, then these numerical diagnostics tend to
fail.
For a fixed xj consider the ordered distances D(1) (xj ), ..., D(n)(xj ). Next,
let β̂ j (α) denote the OLS fit to the min(p + 3 + bαn/100c, n) cases with
the smallest distances where the approximate percentage of cases used is
α ∈ {1, 2.5, 5, 10, 20, 33, 50}. (Here bxc is the greatest integer function so
b7.7c = 7. The extra p + 3 cases are added so that OLS can be computed for
small n and α.) This yields seven OLS fits corresponding to the cases with
predictors closest to xj . A fixed number of K cases are selected at random
without replacement to use as the xj . Hence 7K OLS fits are generated. We
use K = 7 as the default. A robust criterion Q is used to evaluate the 7K
fits and the OLS fit to all of the data. Hence 7K + 1 OLS fits are generated
and the MBA estimator is the fit that minimizes the criterion. The median
squared residual is a good choice for Q.
Three ideas motivate this estimator. First, x-outliers, which are outliers in
the predictor space, tend to be much more destructive than Y -outliers which
are outliers in the response variable. Suppose that the proportion of outliers
is γ and that γ < 0.5. We would like the algorithm to have at least one
“center” xj that is not an outlier. The probability of drawing a center that is
not an outlier is approximately 1−γ K > 0.99 for K ≥ 7 and this result is free
of p. Secondly, by using the different percentages of coverages, for many data
sets there will be a center and a coverage√that contains no outliers. Third, by
Theorem 1.21, the MBA estimator is a n consistent estimator of the same
parameter vector β estimated by OLS under mild conditions.
Ellipsoidal trimming can be used to create resistant multiple linear regres-
sion (MLR) estimators. To perform ellipsoidal trimming, an estimator (T, C)
is computed and used to create the squared Mahalanobis distances Di2 for
each vector of observed predictors xi . If the ordered distance D(j) is unique,
then j of the xi ’s are in the ellipsoid
{x : (x − T )T C −1 (x − T ) ≤ D(j)
2
}. (7.21)
The ith case (Yi , xTi )T is trimmed if Di > D(j) . Then an estimator of β is
computed from the remaining cases. For example, if j ≈ 0.9n, then about
10% of the cases are trimmed, and OLS or L1 could be used on the cases
that remain. Ellipsoidal trimming differs from using the RFCH, RMVN, or
326 7 Robust Regression
covmb2 set since these sets use a random amount of trimming. (The ellip-
soidal trimming technique can also be used for other regression models, and
the theory of the regression method tends to apply to the method applied to
the cleaned data that was not trimmed since the response variables were not
used to select the cases. See Chapter 10.)
Use ellipsoidal trimming on the RFCH, RMVN, or covmb2 set applied to
the continuous predictors to get a fit β̂C . Then make a response and residual
plot using all of the data, not just the cleaned data that was not trimmed.
Example 7.11. For the Buxton (1920) data, height was the response
variable while an intercept, head length, nasal height, bigonal breadth, and
cephalic index were used as predictors in the multiple linear regression model.
Observation 9 was deleted since it had missing values. Five individuals, cases
61–65, were reported to be about 0.75 inches tall with head lengths well
over five feet! OLS was used on the cases remaining after trimming, and
Figure 7.18 shows four trimmed views corresponding to 90%, 70%, 40%,
and 0% trimming. The OLS TV estimator used 70% trimming since this
trimmed view was best. Since the vertical distance from a plotted point to the
identity line is equal to the case’s residual, the outliers had massive residuals
for 90%, 70%, and 40% trimming. Notice that the OLS trimmed view with
0% trimming “passed through the outliers” since the cluster of outliers is
scattered about the identity line.
90% 70%
1500
1500
1000
1000
y
y
500
500
0
0
2000 4000 6000 8000 10000 5000 10000 15000 20000
fit fit
40% 0%
1500
1500
1000
1000
y
y
500
500
0
fit fit
Let X n = X 0,n denote the full design matrix. Often when proving asymp-
totic normality of an MLR estimator β̂ 0,n , it is assumed that
X Tn X n
→ W −1 .
n
If β̂0,n has OP (n−1/2 ) rate and if for big enough n all of the diagonal elements
of !−1
X TM,n X M,n
n
are all contained in an interval [0, B) for some B > 0, then kβ̂ M,n − βk =
OP (n−1/2 ).
The distribution of the estimator β̂ M,n is especially simple when OLS is
used and the errors are iid N (0, σ 2 ). Then
β̂ M,n = (X TM,n X M,n )−1 X TM,n Y M,n ∼ Np (β, σ 2 (X TM,n X M,n )−1 )
√
and n(β̂ M,n −β) ∼ Np (0, σ 2 (X TM,n X M,n /n)−1 ). This result does not imply
that β̂T ,n is asymptotically normal. See the following paragraph for the large
sample theory of a modified trimmed views estimator.
328 7 Robust Regression
The conditions
√ under which the rmreg2 estimator of Section 8.6 has been
shown to be √ n consistent are quite strong, but it seems likely that the es-
timator is a n consistent estimator of β under mild conditions where the
parameter vector β is not, in general, the parameter vector estimated by OLS.
For MLR, the linmodpack function rmregboot bootstraps the rmreg2 es-
timator, and the function rmregbootsim can be used to simulate rmreg2.
Both functions use the residual bootstrap where the residuals come from
OLS. See the R code below.
out<-rmregboot(belx,bely)
plot(out$betas)
ddplot4(out$betas) #right click Stop
out<-rmregboot(cbrainx,cbrainy)
ddplot4(out$betas) #right click Stop
Often practical “robust estimators” generate a sequence of K trial fits
called attractors: b1 , ..., bK . Then some criterion is evaluated and the attractor
bA that minimizes the criterion is used in the final estimator.
a) A Start for the Animal Data b) The Attractor for the Start
8
6
6
4
4
Y
Y
2
2
0
0
0 5 10 0 5 10
X X
Fig. 7.19 The Highlighted Points are More Concentrated about the Attractor
k steps resulting in the sequence of estimators b0,j , b1,j , ..., bk,j . Then bk,j is
the jth attractor for j = 1, ..., K. Then the attractor bA that minimizes the
LTS criterion is used in the final estimator. Using k = 10 concentration steps
often works well, and the basic resampling algorithm is a special case with
k = 0, i.e., the attractors are the starts. Such an algorithm is called a CLTS
concentration algorithm or CLTS.
8
6
6
4
4
Y
Y
2
2
0
0
0 5 10 0 5 10
X X
14
X
these c highlighted cases is b1,1 = (2.076, 0.979)T and |r|(i)(b1,1 ) = 6.990.
i=1
The iteration consists of finding the cases corresponding to the c smallest
absolute residuals, obtaining the corresponding L1 fit and repeating. The
attractor ba,1 = b7,1 = (1.741, 0.821)T and the LTA(c) criterion evaluated
14
X
at the attractor is |r|(i)(ba,1 ) = 2.172. Figure 7.19b shows the attractor
i=1
and that the c highlighted cases corresponding to the smallest absolute resid-
uals are much more concentrated than those in Figure 7.19a. Figure 7.20a
shows 5 randomly selected starts while Figure 7.20b shows the corresponding
attractors. Notice that the elemental starts have more variability than the
attractors, but if the start passes through an outlier, so does the attractor.
Suppose the data set has n cases where d are outliers and n−d are “clean”
(not outliers). The the outlier proportion γ = d/n. Suppose that K elemental
sets are chosen with replacement and that it is desired to find K such that
the probability P(that at least one of the elemental sets is clean) ≡ P1 ≈ 1−α
where α = 0.05 is a common choice. Then P1 = 1− P(none of the K elemental
sets is clean) ≈ 1 −[1 −(1 −γ)p ]K by independence. Hence α ≈ [1 −(1 −γ)p ]K
or
log(α) log(α)
K≈ ≈ (7.22)
log([1 − (1 − γ)p ]) −(1 − γ)p
using the approximation log(1 − x) ≈ −x for small x. Since log(0.05) ≈ −3,
3
if α = 0.05, then K ≈ . Frequently a clean subset is wanted even if
(1 − γ)p
the contamination proportion γ ≈ 0.5. Then for a 95% chance of obtaining at
least one clean elemental set, K ≈ 3 (2p ) elemental sets need to be drawn. If
the start passes through an outlier, so does the attractor. For concentration
algorithms for multivariate location and dispersion, if the start passes through
a cluster of outliers, sometimes the attractor would be clean. See Figure 7.5–
7.11.
Notice that the number of subsets K needed to obtain a clean elemental set
with high probability is an exponential function of the number of predictors
p but is free of n. Hawkins and Olive (2002) showed that if K is fixed and
free of n, then the resulting elemental or concentration algorithm (that uses k
concentration steps), is inconsistent and zero breakdown. See Theorem 7.21.
Nevertheless, many practical estimators tend to use a value of K that is free
of both n and p (e.g. K = 500 or K = 3000). Such algorithms include ALMS
= FLMS = lmsreg and ALTS = FLTS = ltsreg. The “A” denotes that
an algorithm was used. The “F” means that a fixed number of trial fits (K
332 7 Robust Regression
elemental fits) was used and the criterion (LMS or LTS) was used to select
the trial fit used in the final estimator.
To examine the outlier resistance of such inconsistent zero breakdown es-
timators, fix both K and the contamination proportion γ and then find the
largest number of predictors p that can be in the model such that the proba-
bility of finding at least one clean elemental set is high. Given K and γ, P (at
least one of K subsamples is clean) = 0.95 ≈
3
1 − [1 − (1 − γ)p ]K . Thus the largest value of p satisfies ≈ K, or
(1 − γ)p
log(3/K)
p≈ (7.23)
log(1 − γ)
if the sample size n is very large. Again bxc is the greatest integer function:
b7.7c = 7.
Table 7.5 shows the largest value of p such that there is a 95% chance
that at least one of K subsamples is clean using the approximation given by
Equation (7.23). Hence if p = 28, even with one billion subsamples, there
is a 5% chance that none of the subsamples will be clean if the contami-
nation proportion γ = 0.5. Since clean elemental fits have great variability,
an algorithm needs to produce many clean fits in order for the best fit to
be good. When contamination is present, all K elemental sets could contain
outliers. Hence basic resampling and concentration algorithms that only use
K elemental starts are doomed to fail if γ and p are large.
The outlier resistance of elemental algorithms that use K elemental sets
decreases rapidly as p increases. However, for p < 10, such elemental algo-
rithms are often useful for outlier detection. They can perform better than
MBA, trimmed views, and rmreg2 if p is small and the outliers are close
to the bulk of the data or if p is small and there is a mixture distribution:
the bulk of the data follows one MLR model, but “outliers” and some of the
clean data are fit well by another MLR model. For example, if there is one
nontrivial predictor, suppose the plot of x versus Y looks like the letter X.
Such a mixture distribution is not really an outlier configuration since out-
liers lie far from the bulk of the data. All practical estimators have outlier
configurations where they perform poorly. If p is small, elemental algorithms
tend to have trouble when there is a weak regression relationship for the bulk
of the data and a cluster of outliers that are not good leverage points (do
not fall near the hyperplane followed by the bulk of the data). The Buxton
(1920) data set is an example.
If the multiple linear regression model holds, if the predictors are bounded,
and if all J regression estimators are consistent estimators of β, then the
subplots in the FF and RR plots should be linear with a correlation tending
to one as the sample size n increases. To prove this claim, let the ith residual
from the jth fit bj be ri (bj ) = Yi −xTi bj where (Yi , xTi ) is the ith observation.
Similarly, let the ith fitted value from the jth fit be Ybi (bj ) = xTi bj . Then
Table 7.6 Summaries for Seven Data Sets, the Correlations of the Residuals from
TV(M) and the Alternative Method are Given in the 1st 5 Rows
Notice that the TV, MBA, and OLS estimators were the same for the
Gladstone (1905) data and for the Tremearne (1911) major data which had
two small Y –outliers. For the Gladstone data, there is a cluster of infants
that are good leverage points, and we attempt to predict brain weight with
the head measurements height, length, breadth, size, and cephalic index. Orig-
inally, the variable length was incorrectly entered as 109 instead of 199 for
case 115, and the glado data contains this outlier. In 1997, lmsreg was not
able to detect the outlier while ltsreg did. Due to changes in the Splus 2000
code, lmsreg detected the outlier but ltsreg did not. These two functions
change often, not always for the better.
To end this section, we describe resistant regression with the RMVN set
U or covmb2 set B in more detail. Assume that predictor transformations
have been performed to make a p × 1 vector of predictors x, and that w
consists of k ≤ p continuous predictor variables that are linearly related. Find
the RMVN set based on the w to obtain nu cases (y ci , xci ), and then run
the regression method on the cleaned data. Often the theory of the method
applies to the cleaned data set since y was not used to pick the subset of
the data. Efficiency can be much lower since nu cases are used where n/2 ≤
nu ≤ n, and the trimmed cases tend to be the “farthest” from the center of
w. The method will have the most outlier resistance if k = p − 1 if there is a
trivial predictor X1 ≡ 1.
In R, assume Y is the vector of response variables, x is the data matrix of
the predictors (often not including the trivial predictor), and w is the data
matrix of the wi . Then the following R commands can be used to get the
cleaned data set. We could use the covmb2 set B instead of the RMVN set
U computed from the w by replacing the command getu(w) by getB(w).
indx <- getu(w)$indx #often w = x
Yc <- Y[indx]
Xc <- x[indx,]
#example
7.6 Robust Regression 335
This section will consider the breakdown of a regression estimator and then
develop the practical high breakdown hbreg estimator.
b is
Definition 7.27. Scale Equivariance: Let c be any scalar. Then β
scale equivariant if
b
β(X, b
cY ) = T (X, cY ) = cT (X, Y ) = cβ(X, Y ). (7.30)
Remark 7.7. OLS has the above invariance properties, but most Statis-
tical Learning alternatives such as lasso and ridge regression do not have all
four properties. Hence Remark 5.1 is used to fit the data with Z = W η + e.
Then obtain β̂ from η̂.
The following result greatly simplifies some breakdown proofs and shows
that a regression estimator basically breaks down if the median absolute
residual MED(|ri |) can be made arbitrarily large. The result implies that if
the breakdown value ≤ 0.5, breakdown can be computed using the median
absolute residual MED(|ri |(W nd )) instead of kT (W nd )k. Similarly β̂ is high
breakdown if the median squared residual or the cn th largest absolute resid-
2
ual |ri |(cn) or squared residual r(c n)
stay bounded under high contamination
where cn ≈ n/2. Note that kβ̂k ≡ kβ̂(W nd )k ≤ M for some constant M that
depends on T and W but not on the outliers if the number of outliers dn is
less than the smallest number of outliers needed to cause breakdown.
In the literature it is usually assumed that the original data are in general
position: q = p − 1.
Suppose that the clean data are in general position and that the number of
outliers is less than the number needed to make the median absolute residual
and kβ̂k arbitrarily large. If the xi are fixed, and the outliers are moved up
and down by adding a large positive or negative constant to the Y values
of the outliers, then for high breakdown (HB) estimators, β̂ and MED(|ri |)
stay bounded where the bounds depend on the clean data W but not on the
outliers even if the number of outliers is nearly as large as n/2. Thus if the
|Yi | values of the outliers are large enough, the |ri | values of the outliers will
be large.
If the Yi ’s are fixed, arbitrarily large x-outliers tend to drive the slope
estimates to 0, not ∞. If both x and Y can be varied, then a cluster of
outliers can be moved arbitrarily far from the bulk of the data but may still
have small residuals. For example, move the outliers along the regression
hyperplane formed by the clean cases.
If the Yi ’s are fixed, arbitrarily large x-outliers will rarely drive kβ̂k to
∞. The x-outliers can drive kβ̂k to ∞ if they can be constructed so that
the estimator is no longer defined, e.g. so that X T X is nearly singular. The
examples following some results on norms may help illustrate these points.
Several useful results involving matrix norms will be used. First, for any
subordinate matrix norm, kGykq ≤ kGkq kykq . Let J = Jm = {m1 , ..., mp}
denote the p cases in the mth elemental fit bJ = X −1 J Y J . Then for any
elemental fit bJ (suppressing q = 2),
kbJ − βk = kX −1 −1 −1
J (X J β + eJ ) − βk = kX J eJ k ≤ kX J k keJ k. (7.33)
The following results (Golub and Van Loan 1989, pp. 57, 80) on the Euclidean
norm are useful. Let 0 ≤ σp ≤ σp−1 ≤ · · · ≤ σ1 denote the singular values of
X J = (xmi,j ). Then
σ1
kX −1
J k= , (7.34)
σp kX J k
max |xmi,j | ≤ kX J k ≤ p max |xmi,j |, and (7.35)
i,j i,j
1 1
≤ ≤ kX −1
J k. (7.36)
p maxi,j |xmi,j | kX J k
From now on, unless otherwise stated, we will use the spectral norm as the
matrix norm and the Euclidean norm as the vector norm.
Example 7.14. Suppose the response values Y are near 0. Consider the fit
from an elemental set: bJ = X −1 J Y J and examine Equations (7.34), (7.35),
and (7.36). Now kbJ k ≤ kX −1 J k kY J k, and since x-outliers make kX J k
large, x-outliers tend to drive kX −1
J k and kbJ k towards zero not towards ∞.
The x-outliers may make kbJ k large if they can make the trial design kX J k
nearly singular. Notice that Euclidean norm kbJ k can easily be made large if
one or more of the elemental response variables is driven far away from zero.
Example 7.15. Without loss of generality, assume that the clean Y ’s are
contained in an interval [a, f] for some a and f. Assume that the regression
340 7 Robust Regression
Proof. Let MED(n) = MED(Y1 , ..., Yn) and MAD(n) = MAD(Y1 , ..., Yn).
Take β̂ M = (MED(n), 0, ..., 0)T . Then kβ̂M k = |MED(n)| ≤ max(|a|, |f|).
Note that the median absolute residual for the fit β̂ M is equal to the median
absolute deviation MAD(n) = MED(|Yi − MED(n)|, i = 1, ..., n) ≤ f − a if
dn < b(n + 1)/2c.
Note that β̂ M is a poor high breakdown estimator of β and Ŷi (β̂ M ) tracks
the Yi very poorly. If the data are in general position, a high breakdown
regression estimator is an estimator which has a bounded median absolute
residual even when close to half of the observations are arbitrary. Rousseeuw
and Leroy (1987, pp. 29, 206) conjectured that high breakdown regression
estimators can not be computed cheaply, and that if the algorithm is also
affine equivariant, then the complexity of the algorithm must be at least
O(np ). The following theorem shows that these two conjectures are false.
Theorem 7.17. If the clean data are in general position and the model has
an intercept, then a scale and affine equivariant high breakdown estimator
β̂w can be found by computing OLS on the set of cases that have Yi ∈
[MED(Y1 , ..., Yn) ± w MAD(Y1 , ..., Yn)] where w ≥ 1 (so at least half of the
cases are used).
Thus
√ √
MED(|r1 (β̂ w )|, ..., |rn(β̂ w )|) ≤ nj w MAD(n) < n w MAD(n) < ∞.
cases used remains the same under scale transformations and OLS is scale
equivariant.
Note that if w is huge and MAD(n) 6= 0, then the high breakdown estima-
tor β̂ w and β̂ OLS will be the same for most data sets. Thus high breakdown
estimators can be very nonrobust. Even if w = 1, the HB estimator β̂ w only
resists large Y outliers.
For example, consider the LTS(cn ) criterion. Suppose the ordered squared
residuals from the high breakdown mth start b0m are obtained. If the data
are in general position, then QLT S (b0m ) is bounded even if the number of
outliers dn is nearly as large as n/2. Then b1m is simply the OLS fit to
2
the cases corresponding to the cn smallest squared residuals r(i) (b0m ) for
i = 1, ..., cn. Denote these cases by i1 , ..., icn . Then QLT S (b1m ) =
cn
X cn
X cn
X cn
X
2
r(i) (b1m ) ≤ ri2j (b1m ) ≤ ri2j (b0m ) = 2
r(i) (b0m ) = QLT S (b0m )
i=1 j=1 j=1 j=1
where the second inequality follows from the definition of the OLS estimator.
Hence concentration steps reduce or at least do not increase the LTS criterion.
If cn = (n+1)/2 for n odd and cn = 1+n/2 for n even, then the LTS criterion
is bounded iff the median squared residual is bounded.
Theorem 7.18 can be used to show that the following two estimators are
high√breakdown. The estimator β̂ B is the high breakdown attractor used by
the n consistent high breakdown hbreg estimator of Definition 7.35.
Definition 7.34. Make an OLS fit to the cn ≈ n/2 cases whose Y values
are closest to the MED(Y1 , ..., Yn) ≡ MED(n) and use this fit as the start
for concentration. Define β̂ B to be the attractor after k concentration steps.
Define bk,B = 0.9999β̂B .
Theorem 7.19. If the clean data are in general position, then β̂ B and
bk,B are high breakdown regression estimators.
342 7 Robust Regression
Theorem 7.20. Assume the clean data are in general position, and that
the LMS estimator is a consistent estimator of β. Let β̂C be any practical con-
sistent estimator of β, and let β̂ D = β̂ C if MED(ri2 (β̂ C )) ≤ MED(ri2 (bk,B )).
Let β̂ D = bk,B , otherwise. Then β̂ D is a HB estimator that is asymptotically
equivalent to β̂ C .
Proof. The estimator is HB since the median squared residual of β̂ D
is no larger than that of the HB estimator bk,B . Since β̂ C is consistent,
MED(ri2 (β̂ C )) → MED(e2 ) in probability where MED(e2 ) is the population
median of the squared error e2 . Since the LMS estimator is consistent, the
probability that β̂C has a smaller median squared residual than the biased
estimator β̂ k,B goes to 1 as n → ∞. Hence β̂D is asymptotically equivalent
to β̂ C .
Olive and Hawkins √ (2011) showed that the practical hbreg estimator is a
high breakdown n consistent robust estimator that is asymptotically equiv-
alent to the least squares estimator for many error distributions. This sub-
section follows Olive (2017b, pp. 420-423).
The outlier resistance of the hbreg estimator is not very good, but roughly
comparable to the best of the practical “robust regression” estimators avail-
able in R packages as of 2019. The estimator is of some interest since it proved
that practical high breakdown consistent estimators are possible. Other prac-
tical regression estimators that claim to be high breakdown and consistent
appear to be zero breakdown because they use the zero breakdown elemental
concentration algorithm. See Theorem 7.21.
The following theorem is powerful because it does not depend on the crite-
rion used to choose the attractor. Suppose there are K consistent estimators
β̂j of β, each with the same rate nδ . If β̂ A is an estimator obtained by choos-
ing one of the K estimators, then β̂ A is a consistent estimator of β with rate
nδ by Pratt (1959). See Theorem 1.21.
Theorem 7.23. Suppose Kn ≡ K starts are used and that all starts have
subset size hn = g(n) ↑ ∞ as n → ∞. Assume that the estimator applied to
the subset has rate nδ .
i) For the hn -set basic resampling algorithm, the algorithm estimator has
rate [g(n)]δ .
ii) Under regularity conditions (e.g. given by He and Portnoy 1992), the k–
step CLTS estimator has rate [g(n)]δ .
High breakdown estimators are, however, not necessarily useful for detect-
ing outliers. Suppose γn < 0.5. On the one hand, if the xi are fixed, and the
outliers are moved up and down parallel to the Y axis, then for high break-
down estimators, β̂ and MED(|ri |) will be bounded. Thus if the |Yi | values of
the outliers are large enough, the |ri | values of the outliers will be large, sug-
gesting that the high breakdown estimator is useful for outlier detection. On
the other hand, if the Yi ’s are fixed at any values and the x values perturbed,
sufficiently large x-outliers tend to drive the slope estimates to 0, not ∞. For
many estimators, including LTS, LMS, and LTA, a cluster of Y outliers can
be moved arbitrarily far from the bulk of the data but still, by perturbing
their x values, have arbitrarily small residuals. See Example 7.16.
7.6 Robust Regression 345
That is, find the smallest of the three scaled criterion values QL (β̂ C ),
aQL(β̂ A ), aQL (β̂ B ). According to which of the three estimators attains this
minimum, set β̂H to β̂ C , β̂A , or β̂B respectively.
Large sample theory for hbreg is simple and given in the following theo-
rem. Let β̂ L be the LMS, LTS, or LTA estimator that minimizes the criterion
QL. Note that the impractical estimator β̂ L is never computed. The following
theorem shows that β̂ H is asymptotically equivalent to β̂ C on a large class
√
of zero mean finite variance symmetric error distributions. Thus if β̂C is n
consistent or asymptotically efficient, so is β̂ H . Notice that β̂A does not need
to be consistent. This point is crucial since lmsreg is not consistent and it is
not known whether FLTS is consistent. The clean data are in general position
if any p clean cases give a unique estimate of β̂.
Theorem 7.24. Assume the clean data are in general position, and sup-
pose that both β̂ L and β̂ C are consistent estimators of β where the regression
model contains a constant. Then the hbreg estimator β̂ H is high breakdown
and asymptotically equivalent to β̂ C .
Proof. Since the clean data are in general position and QL (β̂ H ) ≤
aQL(β̂ B ) is bounded for γn near 0.5, the hbreg estimator is high break-
down. Let Q∗L = QL for LMS and Q∗L = QL /n for LTS and LTA. As n → ∞,
consistent estimators β̂ satisfy Q∗L (β̂) − Q∗L (β) → 0 in probability. Since
LMS, LTS, and LTA are consistent and the minimum value is Q∗L(β̂ L ), it
follows that Q∗L (β̂ C ) − Q∗L (β̂ L ) → 0 in probability, while Q∗L(β̂ L ) < aQ∗L(β̂)
for any estimator β̂. Thus with probability tending to one as n → ∞,
QL(β̂ C ) < a min(QL (β̂ A ), QL(β̂ B )). Hence β̂ H is asymptotically equivalent
to β̂ C .
346 7 Robust Regression
1000
Y
Y
0
HBFIT OLSFIT
1000
1000
Y
Y
0
ALTSFIT BBFIT
Example 7.16. The LMS, LTA, and LTS estimators are determined by a
“narrowest band” covering half of the cases. Hawkins and Olive (2002) sug-
gested that the fit will pass through outliers if the band through the outliers
is narrower than the band through the clean cases. This behavior tends to
occur if the regression relationship is weak, and if there is a tight cluster
7.7 Summary 349
7.7 Summary
Pn
Yi
1) For the location model, the sample mean Y = i=1 , the sample vari-
Pn n
2 p
2 i=1 (Yi − Y )
ance Sn = , and the sample standard deviation Sn = Sn2 .
n−1
If the data Y1 , ..., Yn is arranged in ascending order from smallest to largest
and written as Y(1) ≤ · · · ≤ Y(n), then Y(i) is the ith order statistic and the
Y(i)’s are called the order statistics. The sample median
Y(n/2) + Y((n/2)+1)
MED(n) = if n is even.
2
The notation MED(n) = MED(Y1 , ..., Yn) will also be used. The sample me-
dian absolute deviation is MAD(n) = MED(|Yi − MED(n)|, i = 1, . . . , n).
2) Suppose the multivariate data has been collected into an n × p matrix
T
x1
..
W = X = . .
xTn
That is, the ij entry of S is the sample covariance Sij . The classical estimator
of multivariate location and dispersion is (T, C) = (x, S).
3) Let (T, C) = (T (W ), C(W )) be an estimator of multivariate
p location
and dispersion. The ith Mahalanobis distance Di = Di2 where the ith
squared Mahalanobis distance is Di2 = Di2 (T (W ), C(W )) =
(xi − T (W ))T C −1 (W )(xi − T (W )).
4) The squared Euclidean distances of the xi from the coordinatewise
median is Di2 = Di2 (MED(W ), I p ). Concentration type steps compute the
weighted median MEDj : the coordinatewise median computed from the cases
xi with Di2 ≤ MED(Di2 (MEDj−1 , I p )) where MED0 = MED(W ). Often used
j = 0 (no concentration type steps) or j = 9. Let Di = Di (MEDj , I p ). Let
Wi = 1 if Di ≤ MED(D1 , ..., Dn)+kMAD(D1 , ..., Dn) where k ≥ 0 and k = 5
is the default choice. Let Wi = 0, otherwise.
5) Let the covmb2 set B of at least n/2 cases correspond to the cases with
weight Wi = 1. Then the covmb2 estimator (T, C) is the sample mean and
sample covariance matrix applied to the cases in set B. Hence
Pn Pn
i=1 Wi xi Wi (xi − T )(xi − T )T
T = Pn and C = i=1 Pn .
i=1 Wi i=1 Wi − 1
The function ddplot5 plots the Euclidean distances from the coordinatewise
median versus the Euclidean distances from the covmb2 location estimator.
Typically the plotted points in this DD plot cluster about the identity line,
and outliers appear in the upper right corner of the plot with a gap between
the bulk of the data and the outliers.
7.8 Complements
Most of this chapter was taken from Olive (2017b). See that text for references
to concepts such as breakdown. The fact that response plots are extremely
useful for model assessment and for detecting influential cases and outliers
for an enormous variety of statistical models does not seem to be well known.
Certainly in any multiple linear regression analysis, the response plot and the
residual plot of Ŷ versus r should always be made. Cook and Olive (2001)
used response plots to select a response transformation graphically. Olive
(2005) suggested using residual, response, RR, and FF plots to detect outliers
while Hawkins and Olive (2002, pp. 141, 158) suggested using H the RR and
FF plots. The four plots are best for n ≥ 5p. Olive (2008: 6.4, 2017a: ch.
5-9) showed that the residual and response plots are useful for experimental
design models. Park et al. (2012) showed response plots are competitive with
the best robust regression methods for outlier detection on some outlier data
sets that have appeared in the literature.
7.8 Complements 351
Olive (2002) found applications for the DD plot. The TV estimator was
proposed by Olive (2002, 2005a). Although both the TV and MBA estimators
have the good OP (n−1/2 ) convergence rate, their efficiency under normality
may be very low. Chang and Olive (2010) suggested a method of adaptive
trimming such that the resulting estimator is asymptotically equivalent to
the OLS estimator.
If n is not much larger than p, then Hoffman et al. (2015) gave a ro-
bust Partial Least Squares–Lasso type estimator that uses a clever weighting
scheme. See Uraibi et al. (2017, 2019) for robust methods of forward selection
and least angle regression.
Robust MLD
For the FCH, RFCH, and RMVN estimators, see Olive and Hawkins
(2010), Olive (2017b, ch. 4), and Zhang et al. (2012). See Olive (2017b, p.
120) for the covmb2 estimator.
The fastest estimators of multivariate location and dispersion that have
been shown to be both consistent and high breakdown are the minimum
covariance determinant (MCD) estimator with O(nv ) complexity where
v = 1 + p(p + 3)/2 and possibly an all elemental subset estimator of He
and Wang (1997). See Bernholt and Fischer (2004). The minimum volume
ellipsoid (MVE) complexity is far higher, and for p > 2 there may be no
known method for computing S, τ , projection based, and constrained M
estimators. For some depth estimators, like the Stahel-Donoho estimator, the
exact algorithm of Liu and Zuo (2014) appears to take too long if p ≥ 6 and
n ≥ 100, and simulations may need p ≤ 3. It is possible to compute the MCD
and MVE estimators for p = 4 and n = 100 in a few hours using branch
and bound algorithms (like estimators with O(1004 ) complexity). See Agulló
(1996, 1998) and Pesch (1999). These algorithms take too long if both p ≥ 5
and n ≥ 100. Simulations may need p ≤ 2. Two stage estimators such as
the MM estimator, that need an initial high breakdown consistent estimator,
take longer to compute than the initial estimator. Rousseeuw (1984) intro-
duced the MCD and MVE estimators. See Maronna et al. (2006, ch. 6) for
descriptions and references.
Estimators with complexity higher than O[(n3 +n2 p+np2 +p3 ) log(n)] take
too long to compute and will rarely be used. Reyen et al. (2009) simulated
the OGK and the Olive (2004a) median ball algorithm (MBA) estimators for
p = 100 and n up to 50000, and noted that the OGK complexity is O[p3 +
np2 log(n)] while that of MBA is O[p3 + np2 + np log(n)]. FCH, RMBA, and
RMVN have the same complexity as MBA. FMCD has the same complexity
as FCH, but FCH is roughly 100 to 200 times faster.
Robust Regression
For the hbreg estimator, see Olive and Hawkins (2011) and Olive (2017b,
ch. 14). Robust regression estimators have unsatisfactory outlier resistance
and large sample theory. The hbreg estimator is fast and high breakdown,
but does not provide an adequate remedy for outliers, and the symmetry
condition for consistency is too strong. OLS response and residual plots, and
352 7 Robust Regression
RMVN or RFCH DD plots are useful for detecting multiple linear regression
outliers.
Many of the robust statistics for the location model are practical to com-
pute, outlier resistant, and backed by theory. See Huber and Ronchetti (2009).
A few estimators of multivariate location and dispersion, such as the coordi-
natewise median, are practical to compute, outlier resistant, and backed by
theory.
For practical estimators for MLR and MCD, hbreg and FCH appear to
be the only estimators proven to be consistent (for a large class of symmetric
error distributions and for a large class of EC distributions, respectively) with
some breakdown theory (TF CH is HB). Perhaps all other “robust statistics”
for MLR and MLD that have been shown to be both consistent and high
breakdown are impractical to compute for p > 4: the impractical “brand
name” estimators have at least O(np ) complexity, while the practical esti-
mators used in the software for the “brand name estimators” have not been
shown to be both high breakdown and consistent. See Theorems 7.12 and
7.21, Hawkins and Olive (2002), Olive (2008, 2017b), Hubert et al. (2002),
and Maronna and Yohai (2002). Huber and Ronchetti (2009, pp. xiii, 8-9,
152-154, 196-197) suggested that high breakdown regression estimators do
not provide an adequate remedy for the ill effects of outliers, that their sta-
tistical and computational properties are not adequately understood, that
high breakdown estimators “break down for all except the smallest regres-
sion problems by failing to provide a timely answer!” and that “there are no
known high breakdown point estimators of regression that are demonstrably
stable.”
A large number of impractical high breakdown regression estimators have
been proposed, including LTS, LMS, LTA, S, LQD, τ , constrained M, re-
peated median, cross checking, one step GM, one step GR, t-type, and re-
gression depth estimators. See Rousseeuw and Leroy (1987) and Maronna et
al. (2006). The practical algorithms used in the software use a brand name
criterion to evaluate a fixed number of trial fits and should be denoted as
an F-brand name estimator such as FLTS. Two stage estimators, such as
the MM estimator, that need an initial consistent high breakdown estima-
tor often have the same breakdown value and consistency rate as the initial
estimator. These estimators are typically implemented with a zero break-
down inconsistent initial estimator and hence are zero breakdown with zero
efficiency.
Maronna and Yohai (2015) used OLS and 500 elemental sets as the 501
trial fits to produce an FS estimator used as the initial estimator for an
FMM estimator. Since the 501 trial fits are zero breakdown, so is the FS
estimator. Since the FMM estimator has the same breakdown as the initial
estimator, the FMM estimator is zero breakdown. For regression, they show
that the FS estimator is consistent on a large class of zero mean finite variance
symmetric distributions. Consistency follows since the elemental fits and OLS
are unbiased estimators of β OLS but an elemental fit is an OLS fit to p cases.
7.9 Problems 353
Hence the elemental fits are very variable, and the probability that the OLS
fit has a smaller S-estimator criterion than a randomly chosen elemental
fit (or K randomly chosen elemental
√ fits) goes to one as n → ∞. (OLS
and the S-estimator are both n consistent estimators of β, so the ratio of
their criterion values goes to one, and the S-estimator minimizes the criterion
value.) Hence the FMM estimator is asymptotically equivalent to the MM
estimator that has the smallest criterion value for a large class of iid zero
mean finite variance symmetric error distributions. This FMM estimator is
asymptotically equivalent to the FMM estimator that uses OLS as the initial
estimator. When the error distribution is skewed the S-estimator and OLS
population constant are not the same, and the probability that an elemental
fit is selected is close to one for a skewed error distribution as n → ∞. (The
OLS estimator β̂ gets very close to β OLS while the elemental fits are highly
variable unbiased estimators of βOLS , so one of the elemental fits is likely to
have a constant that is closer to the S-estimator constant while still having
good slope estimators.) Hence the FS estimator is inconsistent, and the FMM
estimator is likely inconsistent
√ for skewed distributions. No practical method
is known for computing a n consistent FS or FMM estimator that has the
same breakdown and maximum bias function as the S or MM estimator that
has the smallest S or MM criterion value.
The L1 CLT is
√ D 1
n(β̂ L1 − β) → Np 0, W (7.37)
4[f(0)]2
when X T X/n → W −1 , and when the errors ei are iid with a cdf F and a pdf
f such that the unique population median is 0 with f(0) > 0. If a constant β1
is in the model or if the column space of X contains 1, then this assumption
is mild, but if the pdf is not symmetric about 0, then the L1 β1 tends to differ
from the OLS β1 . See Bassett and Koenker (1978). Estimating f(0) can be
difficult, so the residual bootstrap using OLS residuals or using êi = ri − r
where the ri are the L1 residuals with the prediction region method may be
useful.
7.9 Problems
7.1. Referring to Definition 7.25, let Ŷi,j = xTi β̂ j = Ŷi (β̂ j ) and let ri,j =
ri (β̂ j ). Show that kri,1 − ri,2 k = kŶi,1 − Ŷi,2 k.
354 7 Robust Regression
7.2. Assume that the model has a constant β1 so that the first column of
X is 1. Show that if the regression estimator is regression equivariant, then
adding 1 to Y changes β̂1 but does not change the slopes β̂2 , ..., β̂p.
R Problems
Use the command source(“G:/linmodpack.txt”) to download the
functions and the command source(“G:/linmoddata.txt”) to download the
data. See Preface or Section 11.1. Typing the name of the linmodpack
function, e.g. trviews, will display the code for the function. Use the args
command, e.g. args(trviews), to display the needed arguments for the func-
tion. For some of the following problems, the R commands can be copied and
pasted from (http://parker.ad.siu.edu/Olive/mrsashw.txt) into R.
7.3. Paste the command for this problem into R to produce the second
column of Table 7.5. Include the output in Word.
7.4. a) To get an idea for the amount of contamination a basic resam-
pling or concentration algorithm for MLR can tolerate, enter or download
the gamper function (with the source(“G:/linmodpack.txt”) command) that
evaluates Equation (7.24) at different values of h = p.
b) Next enter the following commands and include the output in Word.
zh <- c(10,20,30,40,50,60,70,80,90,100)
for(i in 1:10) gamper(zh[i])
7.5∗ . a) Assuming that you have done the two source commands above
Problem 7.3 (and the R command library(MASS)), type the command
ddcomp(buxx). This will make 4 DD plots based on the DGK, FCH, FMCD,
and median ball estimators. The DGK and median ball estimators are the
two attractors used by the FCH estimator. With the leftmost mouse button,
move the cursor to an outlier and click. This data is the Buxton (1920) data
and cases with numbers 61, 62, 63, 64, and 65 were the outliers with head
lengths near 5 feet. After identifying at least three outliers in each plot, hold
the rightmost mouse button down (and in R click on Stop) to advance to the
next plot. When done, hold down the Ctrl and c keys to make a copy of the
plot. Then paste the plot in Word.
b) Repeat a) but use the command ddcomp(cbrainx). This data is the
Gladstone (1905) data and some infants are multivariate outliers.
c) Repeat a) but use the command ddcomp(museum[,-1]). This data is the
Schaaffhausen (1878) skull measurements and cases 48–60 were apes while
the first 47 cases were humans.
plot after one concentration step. The start uses the coordinatewise median
and diag([M AD(Xi )]2 ). Repeat 4 more times to see the DD plot based on
the attractor. The outliers have large values of X2 and the highlighted cases
have the smallest distances. Repeat the command concmv() several times.
Sometimes the start will contain outliers but the attractor will be clean (none
of the highlighted cases will be outliers), but sometimes concentration causes
more and more of the highlighted cases to be outliers, so that the attractor
is worse than the start. Copy one of the DD plots where none of the outliers
are highlighted into Word.
7.9. This problem compares the MBA estimator that uses the median
squared residual MED(ri2 ) criterion with the MBA√estimator that uses the
LATA criterion.
√ On clean data, both estimators are n consistent since both
use 50 n consistent OLS estimators. The MED(ri2 ) criterion has trouble
with data sets where the multiple linear regression relationship is weak and
356 7 Robust Regression
there is a cluster of outliers. The LATA criterion tries to give all x–outliers,
including good leverage points, zero weight.
a) If necessary, use the commands source(“G:/linmodpack.txt”) and
source(“G:/linmoddata.txt”). The mlrplot2 function is used to compute
both MBA estimators. Use the rightmost mouse button to advance the plot
(and in R, highlight stop).
b) Use the command mlrplot2(belx,bely) and include the resulting plot in
Word. Is one estimator better than the other, or are they about the same?
c) Use the command mlrplot2(cbrainx,cbrainy) and include the resulting
plot in Word. Is one estimator better than the other, or are they about the
same? (The infants are likely good leverage cases instead of outliers.)
d) Use the command mlrplot2(museum[,3:11],museum[,2]) and include the
resulting plot in Word. For this data set, most of the cases are based on
humans but a few are based on apes. The MBA LATA estimator will often
give the cases corresponding to apes larger absolute residuals than the MBA
estimator based on MED(ri2 ), but the apes appear to be good leverage cases.
e) Use the command mlrplot2(buxx,buxy) until the outliers are clustered
about the identity line in one of the two response plots. (This will usually
happen within 10 or fewer runs. Pressing the “up arrow” will bring the pre-
vious command to the screen and save typing.) Then include the resulting
plot in Word. Which estimator went through the outliers and which one gave
zero weight to the outliers?
f) Use the command mlrplot2(hx,hy) several times. Usually both MBA
estimators fail to find the outliers for this artificial Hawkins data set that is
also analyzed by Atkinson and Riani (2000, section 3.1). The lmsreg estimator
can be used to find the outliers. In R use the commands library(MASS) and
ffplot2(hx,hy). Include the resulting plot in Word.
7.10. a) After entering the two source commands above Problem 7.3, enter
the following command.
MLRplot(buxx,buxy)
Click the rightmost mouse button (and in R click on Stop). The response
plot should appear. Again, click the rightmost mouse button (and in R click
on Stop). The residual plot should appear. Hold down the Ctrl and c keys to
make a copy of the two plots. Then paste the plots in Word.
b) The response variable is height, but 5 cases were recorded with heights
about 0.75 inches tall. The highlighted squares in the two plots correspond
to cases with large Cook’s distances. With respect to the Cook’s distances,
what is happening, swamping or masking?
7.11. For the Buxton (1920) data with multiple linear regression, height
was the response variable while an intercept, head length, nasal height, bigonal
breadth, and cephalic index were used as predictors in the multiple linear
regression model. Observation 9 was deleted since it had missing values. Five
7.9 Problems 357
individuals, cases 61–65, were reported to be about 0.75 inches tall with head
lengths well over five feet!
a) Copy and paste the commands for this problem into R. Include the lasso
response plot in Word. The identity line passes right through the outliers
which are obvious because of the large gap. Prediction interval (PI) bands
are also included in the plot.
b) Copy and paste the commands for this problem into R. Include the
lasso response plot in Word. This did lasso for the cases in the covmb2 set
B applied to the predictors which included all of the clean cases and omitted
the 5 outliers. The response plot was made for all of the data, including the
outliers.
c) Copy and paste the commands for this problem into R. Include the DD
plot in Word. The outliers are in the upper right corner of the plot.
7.12. Consider the Gladstone (1905) data set that has 12 variables on
267 persons after death. There are 5 infants in the data set. The response
variable was brain weight. Head measurements were breadth, circumference,
head height, length, and size as well as cephalic index and brain weight. Age,
height, and three categorical variables cause, ageclass (0: under 20, 1: 20-45,
2: over 45) and sex were also given. The constant x1 was the first variable.
The variables cause and ageclass were not coded as factors. Coding as factors
might improve the fit.
a) Copy and paste the commands for this problem into R. Include the
lasso response plot in Word. The identity line passes right through the infants
which are obvious because of the large gap. Prediction interval (PI) bands
are also included in the plot.
b) Copy and paste the commands for this problem into R. Include the
lasso response plot in Word. This did lasso for the cases in the covmb2 set
B applied to the nontrivial predictors which are not categorical (omit the
constant, cause, ageclass and sex) which omitted 8 cases, including the 5
infants. The response plot was made for all of the data.
c) Copy and paste the commands for this problem into R. Include the DD
plot in Word. The infants are in the upper right corner of the plot.
7.13. The linmodpack function mldsim6 compares 7 estimators: FCH,
RFCH, CMVE, RCMVE, RMVN, covmb2, and MB described in Olive
(2017b, ch. 4). Most of these estimators need n > 2p, need a nonsingu-
lar dispersion matrix, and work best with n > 10p. The function generates
data sets and counts how many times the minimum Mahalanobis distance
Di (T, C) of the outliers is larger than the maximum distance of the clean
data. The value pm controls how far the outliers need to be from the bulk of
√
the data, and pm roughly needs to increase with p.
For data sets with p > n possible, the function mldsim7 used the Eu-
clidean distances Di (T, I p ) and the Mahalanobis distances Di (T, C d ) where
C d is the diagonal matrix with the same diagonal entries as C where (T, C)
is the covmb2 estimator using j concentration type steps. Dispersion ma-
358 7 Robust Regression
trices are effected more by outliers than good robust location estimators,
so when the outlier proportion is high, it is expected that the Euclidean
distances Di (T, I p ) will outperform the Mahalanobis distance Di (T, C d ) for
many outlier configurations. Again the function counts the number of times
the minimum outlier distance is larger than the maximum distance of the
clean data.
Both functions used several outlier types. The simulations generated 100
data sets. The clean data had xi ∼ Np (0, diag(1, ..., p)). Type 1 had outliers
in a tight cluster (near point mass) at the major axis (0, ..., 0, pm)T . Type 2
had outliers in a tight cluster at the minor axis (pm, 0, ..., 0)T . Type 3 had
mean shift outliers xi ∼ Np ((pm, ..., pm)T , diag(1, ..., p)). Type 4 changed
the pth coordinate of the outliers to pm. Type 5 changed the 1st coordinate
of the outliers to pm. (If the outlier xi = (x1i , ..., xpi)T , then xi1 = pm.)
Table 7.8 Number of Times All Outlier Distances > Clean Distances, otype=1
n p γ osteps pm FCH RFCH CMVE RCMVE RMVN covmb2 MB
100 10 0.25 0 20 85 85 85 85 86 67 89
a) Table 7.8 suggests with osteps = 0, covmb2 had the worst count. When
pm is increased to 25, all counts become 100. Copy and paste the commands
for this part into R and make a table similar to Table 7.8, but now osteps=9
and p = 45 is close to n/2 for the second line where pm = 60. Your table
should have 2 lines from output.
Table 7.9 Number of Times All Outlier Distances > Clean Distances, otype=1
n p γ osteps pm covmb2 diag
100 1000 0.4 0 1000 100 41
100 1000 0.4 9 600 100 42
b) Copy and paste the commands for this part into R and make a table
similar to Table 7.9, but type 2 outliers are used.
c) When you have two reasonable outlier detectors, there are outlier con-
figurations where one will beat the other. Simulations suggest that “covmb2”
using Di (T, I p ) outperforms “diag” using Di (T, C d ) for many outlier config-
urations, but there are some exceptions. Copy and paste the commands for
this part into R and make a table similar to Table 7.9, but type 3 outliers
are used.
7.14. a) In addition to the source(“G:/linmodpack.txt”) command, also
use the source(“G:/linmoddata.txt”) command, and type the library(MASS)
command).
7.9 Problems 359
7.15. This problem is like Problem 7.11, except elastic net is used instead
of lasso.
a) Copy and paste the commands for this problem into R. Include the
elastic net response plot in Word. The identity line passes right through the
outliers which are obvious because of the large gap. Prediction interval (PI)
bands are also included in the plot.
b) Copy and paste the commands for this problem into R. Include the
elastic net response plot in Word. This did elastic net for the cases in the
covmb2 set B applied to the predictors which included all of the clean cases
and omitted the 5 outliers. The response plot was made for all of the data,
including the outliers. (Problem 7.11 c) shows the DD plot for the data.)
Chapter 8
Multivariate Linear Regression
This chapter will show that multivariate linear regression with m ≥ 2 re-
sponse variables is nearly as easy to use, at least if m is small, as multiple
linear regression which has 1 response variable. For multivariate linear re-
gression, at least one predictor variable is quantitative. Plots for checking
the model, including outlier detection, are given. Prediction regions that are
robust to nonnormality are developed. For hypothesis testing, it is shown
that the Wilks’ lambda statistic, Hotelling Lawley trace statistic, and Pillai’s
trace statistic are robust to nonnormality.
8.1 Introduction
Definition 8.1. The response variables are the variables that you want
to predict. The predictor variables are the variables used to predict the
response variables.
y i = B T xi + i
361
362 8 Multivariate Linear Regression
where v 1 = 1.
The p × m matrix
β1,1 β1,2 . . . β1,m
β2,1 β2,2 . . . β2,m
B = . .. .. .. = β 1 β 2 . . . βm .
.. . . .
βp,1 βp,2 . . . βp,m
The n × m matrix
1,1 1,2 . . . 1,m T
2,1 2,2 . . . 2,m
.1
E= . .. . . . = e1 e2 . . . em = .. .
.. . . ..
Tn
n,1 n,2 . . . n,m
is assumed that E(ej ) = 0 and Cov(ej ) = σjj I n . Hence the errors corre-
sponding to the jth response are uncorrelated with variance σj2 = σjj . Notice
that the same design matrix X of predictors is used for each of the m
models, but the jth response variable vector Y j , coefficient vector βj , and
error vector ej change and thus depend on j.
Now consider the ith case (xTi , yTi ) which corresponds to the ith row of Z
and the ith row of X. Then
Yi1 = β11 xi1 + · · · + βp1 xip + i1 = xTi β 1 + i1
Yi2 = β12 xi1 + · · · + βp2 xip + i2 = xTi β 2 + i2
..
.
Yim = β1m xi1 + · · · + βpm xip + im = xTi β m + im
The notation y i |xi and E(y i |xi ) is more accurate, but usually the condi-
tioning is suppressed. Taking µxi to be a constant (or condition on xi if the
predictor variables are random variables), y i and i have the same covariance
matrix. In the multivariate regression model, this covariance matrix Σ does
not depend on i. Observations from different cases are uncorrelated (often
independent), but the m errors for the m different response variables for the
same case are correlated. If X is a random matrix, then assume X and E
are independent and that expectations are conditional on X.
Definition 8.3. Least squares is the classical method for fitting multivari-
ate linear regression. The least squares estimators are
B̂ = (X T X)−1 X T Z = β̂ 1 β̂ 2 . . . β̂m .
The residuals Ê = Z − Ẑ = Z − X B̂ =
T
ˆ
1 ˆ1,1 ˆ1,2 . . . ˆ1,m
ˆ T
ˆ2,1 ˆ2,2 ... ˆ 2,m
2
. = r1 r 2 . . . rm = .. .. .. .. .
.. . . . .
Tn
ˆ ˆn,1 ˆn,2 ... ˆ n,m
and
Ê = [I − X(X T X)−1 X]Z.
The following two theorems show that the least squares estimators are
fairly good. Also see Theorem 8.7 in Section 8.4. Theorem 8.2 can also be
n−1
used for Σ̂ ,d = Sr.
n−d
Theorem 8.1, Johnson and Wichern (1988, p. 304): Suppose X has
full rank p < n and the covariance structure of Definition 8.2 holds. Then
E(B̂) = B so E(β̂ j ) = β j , Cov(β̂ j , β̂ k ) = σjk (X T X)−1 for j, k = 1, ..., p.
Also Ê and B̂ are uncorrelated, E(Ê) = 0, and
T !
Ê Ê
E(Σ̂ ) = E = Σ .
n−p
8.2 Plots for the Multivariate Linear Regression Model 365
Pn
Theorem 8.2. S r = Σ +OP (n−1/2 ) and n1 i=1 i Ti = Σ +OP (n−1/2 )
Pn
if the following three conditions hold: B − B̂ = OP (n−1/2 ), n1 i=1 i xTi =
P n
OP (1), and n1 i=1 xi xTi = OP (n1/2 ).
T
Proof. Note that y i = B T xi +i = B̂ xi +ˆ
i . Hence ˆi = (B−B̂)T xi +i .
Thus
n
X n
X n
X
ˆ Ti =
i ˆ (i −i +ˆi )(i −i +ˆi )T = [i Ti +i (ˆ
i −i )T +(ˆ Ti ]
i −i )ˆ
i=1 i=1 i=1
n
X Xn Xn
= i Ti + ( i xTi )(B − B̂) + (B − B̂)T ( xi Ti )+
i=1 i=1 i=1
Xn
(B − B̂)T ( xi xTi )(B − B̂).
i=1
Pn Pn
Thus 1
n i ˆTi
i=1 ˆ = 1
n
T
i=1 i i +
OP (1)OP (n−1/2 ) + OP (n−1/2 )OP (1) + OP (n−1/2 )OP (n1/2 )OP (n−1/2 ),
P
and the result follows since n1 ni=1 i Ti = Σ + OP (n−1/2 ) and
n
n 1X T
Sr = ˆi ˆ
i .
n−1n
i=1
√
S r and Σ̂ are also n consistent estimators of Σ by Su and Cook (2012,
p. 692). See Theorem 8.7.
This section suggests using residual plots, response plots, and the DD plot to
examine the multivariate linear model. The DD plot is used to examine the
distribution of the iid error vectors. The residual plots are often used to check
for lack of fit of the multivariate linear model. The response plots are used
to check linearity and to detect influential cases for the linearity assumption.
The response and residual plots are used exactly as in the m = 1 case corre-
sponding to multiple linear regression and experimental design models. See
Olive (2010, 2017a), Olive et al. (2015), Olive and Hawkins (2005), and Cook
and Weisberg (1999, p. 432).
Definition 8.4. A response plot for the jth response variable is a plot
of the fitted values Ybij versus the response Yij . The identity line with slope
one and zero intercept is added to the plot as a visual aid. A residual plot
corresponding to the jth response variable is a plot of Ŷij versus rij .
Remark 8.1. Make the m response and residual plots for any multivariate
linear regression. In a response plot, the vertical deviations from the identity
line are the residuals rij = Yij − Ŷij . Suppose the model is good, the jth error
distribution is unimodal and not highly skewed for j = 1, ..., m, and n ≥ 10p.
Then the plotted points should cluster about the identity line in each of the
m response plots. If outliers are present or if the plot is not linear, then the
current model or data need to be transformed or corrected. If the model is
good, then each of the m residual plots should be ellipsoidal with no trend
and should be centered about the r = 0 line. There should not be any pattern
in the residual plot: as a narrow vertical strip is moved from left to right, the
behavior of the residuals within the strip should show little change. Outliers
and patterns such as curvature or a fan shaped plot are bad.
Rule of thumb 8.1. Use multivariate linear regression if
provided that the m response and residual plots all look good. Make the DD
plot of the ˆi . If a residual plot would look good after several points have
been deleted, and if these deleted points were not gross outliers (points far
from the point cloud formed by the bulk of the data), then the residual plot
is probably good. Beginners often find too many things wrong with a good
model. For practice, use the computer to generate several multivariate linear
regression data sets, and make the m response and residual plots for these
data sets. This exercise will help show that the plots can have considerable
variability even when the multivariate linear regression model is good. The
linmodpack function MLRsim simulates response and residual plots for various
distributions when m = 1.
Rule of thumb 8.2. If the plotted points in the residual plot look like
a left or right opening megaphone, the first model violation to check is the
assumption of nonconstant variance. (This is a rule of thumb because it is
possible that such a residual plot results from another model violation such
as nonlinearity, but nonconstant variance is much more common.)
Remark 8.2. Residual plots magnify departures from the model while the
response plots emphasize how well the multivariate linear regression model
fits the data.
Definition 8.5. An RR plot is a scatterplot matrix of the m sets of
residuals r1 , ..., rm .
8.2 Plots for the Multivariate Linear Regression Model 367
Remark 8.3. Some applications for multivariate linear regression need the
m error vectors to be linearly related, and larger sample sizes may be needed
if the error vectors are not linearly related. For example, the asymptotic
optimality of the prediction regions of Section 8.3 needs the error vectors to
be iid from an elliptically contoured distribution. Make the RR plot and a
DD plot of the residual vectors ˆi to check that the error vectors are linearly
related. Make a DD plot of the continuous predictor variables to check for
x-outliers. Make a DD plot of Y1 , ...., Ym to check for outliers, especially if
it is assumed that the response variables come from an elliptically contoured
distribution.
The RMVN DD plot of the residual vectors ˆ i is used to check the error
vector distribution, to detect outliers, and to display the nonparametric pre-
diction region developed in Section 8.3. The DD plot suggests that the error
vector distribution is elliptically contoured if the plotted points cluster tightly
about a line through the origin as n → ∞. The plot suggests that the error
vector distribution is multivariate normal if the line is the identity line. If n
is large and the plotted points do not cluster tightly about a line through the
origin, then the error vector distribution may not be elliptically contoured.
These applications of the DD plot for iid multivariate data are discussed in
Olive (2002, 2008, 2013a, 2017b) and Chapter 7. The RMVN estimator has
not yet been proven to be a consistent estimator when computed from resid-
ual vectors, but simulations suggest that the RMVN DD plot of the residual
vectors is a useful diagnostic plot. The linmodpack function mregddsim can
be used to simulate the DD plots for various distributions.
Predictor transformations for the continuous predictors can be made ex-
actly as in Section 1.2.
Warning: The log rule and other transformations do not always work. For
example, the log rule may fail. If the relationships in the scatterplot matrix are
already linear or if taking the transformation does not increase the linearity,
then no transformation may be better than taking a transformation. For
the Cook and Weisberg (1999) data set evaporat.lsp with m = 1, the log
rule suggests transforming the response variable Evap, but no transformation
works better.
The classical large sample 100(1 − δ)% prediction region for a future value
2
xf given iid data x1 , ..., , xn is {x : Dx (x, S) ≤ χ2p,1−δ }, while for multi-
variate linear regression, the classical large sample 100(1 − δ)% prediction
region for a future value y f given xf and past data (x1 , yi ), ..., (xn , yn ) is
2
{y : Dy (ŷ f , Σ̂ ) ≤ χ2m,1−δ }. See Johnson and Wichern (1988, pp. 134, 151,
312). By Equation (1.36), these regions may work for multivariate normal xi
or i , but otherwise tend to have undercoverage. Section 4.4 and Olive (2013a)
replaced χ2p,1−δ by the order statistic D(U 2
n)
where Un decreases to dn(1 −δ)e.
This section will use a similar technique from Olive (2018) to develop possibly
the first practical large sample prediction region for the multivariate linear
model with unknown error distribution. The following technical theorem will
be needed to prove Theorem 8.4.
Theorem 8.3. Let a > 0 and assume that (µ̂n , Σ̂ n ) is a consistent esti-
mator of (µ, aΣ).
a) Dx2
(µ̂n , Σ̂ n ) − a1 Dx
2
(µ, Σ) = oP (1).
−1
b) Let 0 < δ ≤ 0.5. If (µ̂n , Σ̂ n ) − (µ, aΣ) = Op (n−δ ) and aΣ̂ n − Σ −1 =
OP (n−δ ), then
2 1 2
Dx (µ̂n , Σ̂ n ) − Dx (µ, Σ) = OP (n−δ ).
a
Proof. Let Bn denote the subset of the sample space on which Σ̂ n has an
inverse. Then P (Bn ) → 1 as n → ∞. Now
2 −1
Dx (µ̂n , Σ̂ n ) = (x − µ̂n )T Σ̂ n (x − µ̂n ) =
−1
Σ Σ −1 −1
(x − µ̂n )T − + Σ̂ n (x − µ̂n ) =
a a
−1 −1
−Σ −1 Σ
(x − µ̂n )T + Σ̂ n (x − µ̂n ) + (x − µ̂n )T (x − µ̂n ) =
a a
8.3 Asymptotically Optimal Prediction Regions 369
1 −1
(x − µ̂n )T (−Σ −1 + a Σ̂ n )(x − µ̂n ) +
a
−1
Σ
(x − µ + µ − µ̂n )T (x − µ + µ − µ̂n )
a
1 2
= (x − µ)T Σ −1 (x − µ) + (x − µ)T Σ −1 (µ − µ̂n )+
a a
1 1 −1
(µ − µ̂n )T Σ −1 (µ − µ̂n ) + (x − µ̂n )T [aΣ̂ n − Σ −1 ](x − µ̂n )
a a
on Bn , and the last three terms are oP (1) under a) and OP (n−δ ) under b).
for i = 1, ..., n. Let qn = min(1 − δ + 0.05, 1 − δ + m/n) for δ > 0.1 and
2 2
{z : Dz (ŷ f , Σ̂ ) ≤ D(U n)
} = {z : Dz (ŷ f , Σ̂ ) ≤ D(Un ) }. (8.1)
a) Consider the n prediction regions for the data where (y f,i , xf,i ) =
(y i , xi ) for i = 1, ..., n. If the order statistic D(Un ) is unique, then Un of the
n prediction regions contain y i where Un /n → 1 − δ as n → ∞.
b) If (ŷ f , Σ̂ ) is a consistent estimator of (E(y f ), Σ ), then (8.1) is a
large sample 100(1 − δ)% prediction region for y f .
c) If (ŷ f , Σ̂ ) is a consistent estimator of (E(y f ), Σ ), and the i come
from an elliptically contoured distribution such that the unique highest den-
sity region is {z : Dz (0, Σ ) ≤ D1−δ }, then the prediction region (8.1) is
asymptotically optimal.
Theorem 8.5 will show that this prediction region (8.2) can also be found
by applying the nonparametric prediction region (4.24) on the ẑ i . Recall that
S r defined in Definition 8.3 is the sample covariance matrix of the residual
vectors ˆi . For the multivariate linear regression model, if D1−δ is a continuity
point of the distribution of D, Assumption D1 above Theorem 8.7 holds, and
the i have a nonsingular covariance matrix, then (8.2) is a large sample
100(1 − δ)% prediction region for y f .
The RMVN DD plot of the residual vectors will be used to display the
prediction regions for multivariate linear regression. See Example 8.3. The
nonparametric prediction region for multivariate linear regression of Theorem
8.5 uses (T, C) = (ŷ f , S r ) in (8.1), and has simple geometry. Let Rr be the
nonparametric prediction region (8.2) applied to the residuals ˆi with ŷ f = 0.
Then Rr is a hyperellipsoid with center 0, and the nonparametric prediction
region is the hyperellipsoid Rr translated to have center ŷ f . Hence in a DD
plot, all points to the left of the line M D = D(Un ) correspond to y i that are
in their prediction region, while points to the right of the line are not in their
prediction region.
The nonparametric prediction region has some interesting properties. This
prediction region is asymptotically optimal if the i are iid for a large class
of elliptically contoured ECm (0, Σ, g) distributions. Also, if there are 100
different values (xjf , y jf ) to be predicted, we only need to update ŷ jf for
j = 1, ..., 100, we do not need to update the covariance matrix S r .
It is common practice to examine how well the prediction regions work
on the training data. That is, for i = 1, ..., n, set xf = xi and see if y i is
in the region with probability near to 1 − δ with a simulation study. Note
that ŷ f = ŷ i if xf = xi . Simulation is not needed for the nonparametric
prediction region (8.2) for the data since the prediction region (8.2) centered
at ŷ i contains y i iff Rr , the prediction region centered at 0, contains ˆ
i since
ˆi = y i − ŷ i . Thus 100qn % of prediction regions corresponding to the data
(y i , xi ) contain y i , and 100qn% → 100(1 − δ)%. Hence the prediction regions
work well on the training data and should work well on (xf , y f ) similar to
the training data. Of course simulation should be done for test data (xf , y f )
that are not equal to training data cases. See Problem 8.11.
This training data result holds provided that the multivariate linear regres-
sion using least squares is such that the sample covariance matrix S r of the
residual vectors is nonsingular, the multivariate regression model need
not be correct. Hence the coverage at the n training data cases (xi , yi )
is robust to model misspecification. Of course, the prediction regions may
be very large if the model is severely misspecified, but severity of misspec-
ification can be checked with the response and residual plots. Coverage for
a future value y f can also be arbitrarily bad if there is extrapolation or if
(xf , yf ) comes from a different population than that of the data.
8.4 Testing Hypotheses 373
Definition 8.8. Assume rank(X) = p. The total corrected (for the mean)
sum of squares and cross products matrix is
T 1 T
T = R + We = Z I n − 11 Z.
n
Note that T /(n − 1) is the usual sample covariance matrix Σ̂ y if all n of the
y i are iid, e.g. if B = 0. The regression sum of squares and cross products
matrix is
1 T 1
T T
R = Z X(X X) X − 11 Z = Z T X B̂ − Z T 11T Z.
−1 T
n n
T
Let H = B̂ LT [L(X T X)−1 LT ]−1 LB̂. The error or residual sum of squares
and cross products matrix is
Source matrix df
Regression or Treatment R p − 1
Error or Residual We n − p
Total (corrected) T n−1
Typically some function of one of the four above statistics is used to get
pval, the estimated pvalue. Output often gives the pvals for all four test
statistics. Be cautious about inference if the last three test statistics do not
lead to the same conclusions (Roy’s test may not be trustworthy for r > 1).
Theory and simulations developed below for the four statistics will provide
more information about the sample sizes needed to use the four test statistics.
See the paragraphs after the following theorem for the notation used in that
theorem.
Theorem 8.6. The Hotelling-Lawley trace statistic
1 −1
U (L) = [vec(LB̂)]T [Σ̂ ⊗ (L(X T X)−1 LT )−1 ][vec(LB̂)]. (8.3)
n−p
Some more details on the above results may be useful. Consider testing a
linear hypothesis H0 : LB = 0 versus H1 : LB 6= 0 where L is a full rank
r × p matrix. For now assume the error distribution is multivariate normal
Nm (0, Σ ). Then
β̂ 1 − β 1
β̂ − β
2 2 T
vec(B̂ − B) = .. ∼ Npm (0, Σ ⊗ (X X)−1 )
.
β̂ m − β m
where
376 8 Multivariate Linear Regression
σ11 (X T X)−1 σ12 (X T X)−1 · · · σ1m (X T X)−1
σ21 (X T X)−1 σ22 (X T X)−1 · · · σ2m (X T X)−1
C = Σ ⊗(X T X)−1 = .. .. .. .
. . ··· .
σm1 (X T X)−1 σm2 (X T X)−1 · · · σmm (X T X)−1
Now let A be an rm×pm block diagonal matrix: A = diag(L, ..., L). Then
A vec(B̂ − B) = vec(L(B̂ − B)) =
L(β̂ 1 − β1 )
L(β̂ 2 − β2 )
.. ∼ Nrm (0, Σ ⊗ L(X T X)−1 LT )
.
L(β̂ m − βm )
Hence under H0 ,
[vec(LB̂)]T [Σ −1 T −1 T −1 2
⊗ (L(X X) L ) ][vec(LB̂)] ∼ χrm ,
and
−1 D
T = [vec(LB̂)]T [Σ̂ ⊗ (L(X T X)−1 LT )−1 ][vec(LB̂)] → χ2rm . (8.4)
Since least squares estimators are asymptotically normal, if the i are iid
for a large class of distributions,
8.4 Testing Hypotheses 377
β̂ 1 − β1
√ √ β̂ 2 − β2
D
n vec(B̂ − B) = n .. → Npm (0, Σ ⊗ W )
.
β̂ m − βm
where
XT X P
→ W −1 .
n
Then under H0 ,
Lβ̂1
√ √ Lβ̂2
D
n vec(L B̂) = n .. → Nrm (0, Σ ⊗ LW LT ),
.
Lβ̂ m
and
D
n [vec(LB̂)]T [Σ −1 T −1 2
⊗ (LW L ) ][vec(LB̂)] → χrm .
Hence (8.4) holds, and (8.5) gives a large sample level δ test if the least
squares estimators are asymptotically normal.
Kakizawa (2009) showed, under stronger assumptions than Theorem 8.8,
that for a large class of iid error distributions, the following test statistics
have the same χ2rm limiting distribution when H0 is true, and the same non-
central χ2rm (ω2 ) limiting distribution with noncentrality parameter ω2 when
H0 is false under a local alternative. Hence the three tests are robust to the
assumption of normality. The limiting null distribution is well known when
the zero mean errors are iid from a multivariate normal distribution. See
D D
Khattree and Naik (1999, p. 68): (n − p)U (L) → χ2rm , (n − p)V (L) → χ2rm ,
D
and −[n − p − 0.5(m − r + 3)] log(Λ(L)) → χ2rm . Results from Kshirsagar
(1972, p. 301) suggest that the third chi-square approximation is very good
if n ≥ 3(m + p)2 for multivariate normal error vectors.
Theorems 8.6 and 8.8 are useful for relating multivariate tests with the
partial F test for multiple linear regression that tests whether a reduced
model that omits some of the predictors can be used instead of the full model
that uses all p predictors. The partial F test statistic is
SSE(R) − SSE(F )
FR = /M SE(F )
dfR − dfF
where the residual sums of squares SSE(F ) and SSE(R) and degrees of
freedom dfF and dfr are for the full and reduced model while the mean
square error M SE(F ) is for the full model. Let the null hypothesis for the
partial F test be H0 : Lβ = 0 where L sets the coefficients of the predictors
in the full model but not in the reduced model to 0. Seber and Lee (2003, p.
100) shows that
378 8 Multivariate Linear Regression
Hence the Hotelling Lawley test will have the most power and Pillai’s test
will have the least power.
Following Khattree and Naik (1999, pp. 67-68), there are several ap-
proximations used by the SAS software. For the Roy’s largest root test, if
h = max(r, m), use
n−p−h+r
λmax (L) ≈ F (h, n − p − h + r).
h
The simulations in Section 8.5 suggest that this approximation is good for
r = 1 but poor for r > 1. Anderson (1984, p. 333) stated that Roy’s largest
root test has the greatest power if r = 1 but is √
an inferior test
√ for r > 1. Let
g = n−p−(m−r+1)/2, u = (rm−2)/4 and t = r 2 m2 − 4/ m2 + r 2 − 5 for
P P
m2 +r 2 −5 > 0 and t = 1, otherwise. Assume H0 is true. Thus U → 0, V → 0,
P
and Λ → 1 as n → ∞. Then
gt − 2u 1 − Λ1/t 1 − Λ1/t
1/t
≈ F (rm, gt − 2u) or (n − p)t ≈ χ2rm .
rm Λ Λ1/t
For large n and t > 0, − log(Λ) = −t log(Λ1/t ) = −t log(1 + Λ1/t − 1) ≈
t(1 − Λ1/t ) ≈ t(1 − Λ1/t )/Λ1/t . If it can not be shown that
8.4 Testing Hypotheses 379
P
(n − p)[− log(Λ) − t(1 − Λ1/t )/Λ1/t ] → 0 as n → ∞,
then it is possible that the approximate χ2rm distribution may be the limiting
distribution for only a small class of iid error distributions. When the i are
iid Nm (0, Σ ), there are some exact results. For r = 1,
n−p−m+1 1−Λ
∼ F (m, n − p − m + 1).
m Λ
For r = 2,
2(n − p − m + 1) 1 − Λ1/2
∼ F (2m, 2(n − p − m + 1)).
2m Λ1/2
For m = 2,
2(n − p) 1 − Λ1/2
∼ F (2r, 2(n − p)).
2r Λ1/2
Let s = min(r, m), m1 = (|r − m| − 1)/2 and m2 = (n − p − m − 1)/2. Note
that s(|r − m| + s) = min(r, m) max(r, m) = rm. Then
The analog of the ANOVA F test for multiple linear regression is the
MANOVA F test that uses L = [0 I p−1 ] to test whether the nontrivial
predictors are needed in the model. This test should reject H0 if the response
and residual plots look good, n is large enough, and at least one response
plot does not look like the corresponding residual plot. A response plot for
Yj will look like a residual plot if the identity line appears almost horizontal,
hence the range of Ŷj is small. Response and residual plots are often useful
for n ≥ 10p.
The 4 step MANOVA F test of hypotheses uses L = [0 I p−1 ].
i) State the hypotheses H0 : the nontrivial predictors are not needed in the
mreg model H1 : at least one of the nontrivial predictors is needed.
ii) Find the test statistic F0 from output.
iii) Find the pval from output.
iv) If pval ≤ δ, reject H0 . If pval > δ, fail to reject H0 . If H0 is rejected,
conclude that there is a mreg relationship between the response variables
Y1 , ..., Ym and the predictors x2 , ..., xp . If you fail to reject H0 , conclude
that there is a not a mreg relationship between Y1 , ..., Ym and the predictors
x2 , ..., xp . (Or there is not enough evidence to conclude that there is a
mreg relationship between the response variables and the predictors. Get the
variable names from the story problem.)
The Fj test of hypotheses uses Lj = [0, ..., 0, 1, 0, ..., 0], where the 1 is in
the jth position, to test whether the jth predictor xj is needed in the model
given that the other p − 1 predictors are in the model. This test is an analog
of the t tests for multiple linear regression. Note that xj is not needed in the
model corresponds to H0 : B j = 0 while xj needed in the model corresponds
to H1 : B j 6= 0 where B Tj is the jth row of B.
The 4 step Fj test of hypotheses uses Lj = [0, ..., 0, 1, 0, ..., 0] where the 1
is in the jth position.
i) State the hypotheses H0 : xj is not needed in the model
H1 : xj is needed.
ii) Find the test statistic Fj from output.
iii) Find pval from output.
iv) If pval ≤ δ, reject H0 . If pval > δ, fail to reject H0 . Give a nontechnical
sentence restating your conclusion in terms of the story problem. If H0 is
rejected, then conclude that xj is needed in the mreg model for Y1 , ..., Ym
given that the other predictors are in the model. If you fail to reject H0 , then
conclude that xj is not needed in the mreg model for Y1 , ..., Ym given that
the other predictors are in the model. (Or there is not enough evidence to
conclude that xj is needed in the model. Get the variable names from the
story problem.)
8.4 Testing Hypotheses 381
$Ftable
Fj pvals
[1,] 6.30355375 0.01677169
[2,] 1.51013090 0.28449166
[3,] 5.61329324 0.02279833
[4,] 0.06482555 0.97701447
$MANOVA
MANOVAF pval
[1,] 3.150118 0.06038742
#Output for Example 8.2
y<-marry[,c(2,3)]; x<-marry[,-c(2,3)];
mltreg(x,y,indices=c(3,4))
$partial
partialF Pval
[1,] 0.2001622 0.9349877
$Ftable
Fj pvals
[1,] 4.35326807 0.02870083
[2,] 600.57002201 0.00000000
[3,] 0.08819810 0.91597268
[4,] 0.06531531 0.93699302
$MANOVA
MANOVAF pval
[1,] 295.071 1.110223e-16
Example 8.2. The above output is for the Hebbler (1847) data from
the 1843 Prussia census. Sometimes if the wife or husband was not at the
household, then s/he would not be counted. Y1 = number of married civilian
men in the district, Y2 = number of women married to civilians in the district,
x2 = population of the district in 1843, x3 = number of married military men
8.5 An Example and Simulations 383
In the DD plot, cases to the left of the vertical line are in their nonparametric
prediction region. The long horizontal line corresponds to a similar cutoff
based on the RD. The shorter horizontal line that ends at the identity line
384 8 Multivariate Linear Regression
is the parametric MVN prediction region from Section 4.4 applied to the
ẑi . Points below these two lines are only conjectured to be large sample
prediction regions, but are added to the DD plot as visual aids. Note that
ẑi = ŷ f + ˆ
i , and adding a constant ŷ f to all of the residual vectors does not
change the Mahalanobis distances, so the DD plot of the residual vectors can
be used to display the prediction regions.
Response Plot
5.5
37
4.5
Y
79
3.5
2.5
FIT
Residual Plot
0.4
37
0.2
0.0
RES
48 11
−0.4
79
FIT
Example 8.3. Cook and Weisberg (1999, pp. 351, 433, 447) gave a data
set on 82 mussels sampled off the coast of New Zealand. Let Y1 = log(S)
and Y2 = log(M ) where S is the shell mass and M is the muscle mass.
The predictors are X2 = L, X3 = log(W ), and X4 = H: the shell length,
log(width), and height. To check linearity of the multivariate linear regression
model, Figures 8.1 and 8.2 give the response and residual plots for Y1 and
Y2 . The response plots show strong linear relationships. For Y1 , case 79 sticks
out while for Y2 , cases 8, 25, and 48 are not fit well. Highlighted cases had
Cook’s distance > min(0.5, 2p/n). See Cook (1977).
To check the error vector distribution, the DD plot should be used instead
of univariate residual plots, which do not take into account the correlations
of the random variables 1 , ..., m in the error vector . A residual vector
ˆ = (ˆ − ) + is a combination of and a discrepancy ˆ − that tends
to have an approximate multivariate normal distribution. The ˆ − term
can dominate for small to moderate n when is not multivariate normal,
8.5 An Example and Simulations 385
Response Plot
4
3
2
Y
25
8
1
48
0
1.5 2.0 2.5 3.0 3.5 4.0
FIT
0.5
0.0 Residual Plot
RES
−0.5
25
8
−1.0
48
FIT
48
6
8
5
79
4
RD
3
2
1
0
0 1 2 3 4 5
MD
Fig. 8.3 DD Plot of the Residual Vectors for the Mussels Data.
386 8 Multivariate Linear Regression
Response Plot
50
40
30
Y
20
10
0
0 10 20 30 40
FIT
Residual Plot
15
10
5
RES
0
−10 −5
0 10 20 30 40
FIT
c) Now suppose the same model is used except Y2 = M . Then the response
and residual plots for Y1 remain the same, but the plots shown in Figure 8.4
show curvature about the identity and r = 0 lines. Hence the linearity condi-
tion is violated. Figure 8.5 shows that the plotted points in the DD plot have
correlation well less than one, suggesting that the error vector distribution
8.5 An Example and Simulations 387
6
5
4
RD
3
2
1
0
MD
A small simulation was used to study the Wilks’ Λ test, the Pillai’s trace
test, the Hotelling Lawley trace test, and the Roy’s largest root test for the
Fj tests and the MANOVA F test for multivariate linear regression. The first
row of B was always 1T and the last row of B was always 0T . When the null
hypothesis for the MANOVA F test is true, all but the first row corresponding
to the constant are equal to 0T . When p ≥ 3 and the null hypothesis for the
MANOVA F test is false, then the second to last row of B is (1, 0, ..., 0),
the third to last row is (1, 1, 0, ..., 0) et cetera as long as the first row is
not changed from 1T . First m × 1 error vectors wi were generated such that
the m random variables in the vector w i are iid with variance σ 2 . Let the
m × m matrix A = (aij ) with aii = 1 and aij = ψ where 0 ≤ ψ < 1 for i 6= j.
Then i = Aw i so that Σ = σ 2 AAT = (σij ) where the diagonal entries
σii = σ 2 [1 + (m − 1)ψ2 ] and the off diagonal entries σij = σ 2 [2ψ + (m − 2)ψ2 ]
where ψ = 0.10. Hence the correlations are (2ψ +(m−2)ψ2 )/(1 +(m−1)ψ2 ).
As ψ gets close to 1, the error vectors cluster about the line in the direction
of (1, ..., 1)T . We used w i ∼ Nm (0, I), wi ∼ (1 − τ )Nm (0, I) + τ Nm (0, 25I)
with 0 < τ < 1 and τ = 0.25 in the simulation, wi ∼ multivariate td with
d = 7 degrees of freedom, or w i ∼ lognormal - E(lognormal): where the m
components of w i were iid with distribution ez − E(ez ) where z ∼ N (0, 1).
Only the lognormal distribution is not elliptically contoured.
The simulation used 5000 runs, and H0 was rejected if the F statistic
was greater than Fd1 ,d2 (0.95) where P (Fd1 ,d2 < Fd1 ,d2 (0.95)) = 0.95 with
d1 = rm and d2 = n − mp for the test statistics
n−p−h+r
λmax (L).
h
Denote these statistics by W , P , HL, and R. Let the coverage be the propor-
tion of times that H0 is rejected. We want coverage near 0.05 when H0 is true
and coverage close to 1 for good power when H0 is false. With 5000 runs,
coverage outside of (0.04,0.06) suggests that the true coverage is not 0.05.
Coverages are tabled for the F1 , F2 , Fp−1 , and Fp test and for the MANOVA
F test denoted by FM . The null hypothesis H0 was always true for the Fp
test and always false for the F1 test. When the MANOVA F test was true,
H0 was true for the Fj tests with j 6= 1. When the MANOVA F test was
false, H0 was false for the Fj tests with j 6= p, but the Fp−1 test should be
hardest to reject for j 6= p by construction of B and the error vectors.
When the null hypothesis H0 was true, simulated values started to get close
to nominal levels for n ≥ 0.8(m+p)2 , and were fairly good for n ≥ 1.5(m+p)2 .
The exception was Roy’s test which rejects H0 far too often if r > 1. See Table
390 8 Multivariate Linear Regression
8.1 where we want values for the F1 test to be close to 1 since H0 is false
for the F1 test, and we want values close to 0.05, otherwise. Roy’s test was
very good for the Fj tests but very poor for the MANOVA F test. Results
are shown for m = p = 10. As expected from Berndt and Savin (1977),
Pillai’s test rejected H0 less often than Wilks’ test which rejected H0 less
often than the Hotelling Lawley test. Based on a much larger simulation
study, using the four types of error vector distributions and m = p, the tests
had approximately correct level if n ≥ 0.83(m + p)2 for the Hotelling Lawley
test, if n ≥ 2.80(m + p)2 for the Wilks’ test (agreeing with Kshirsagar (1972)
n ≥ 3(m + p)2 for multivariate normal data), and if n ≥ 4.2(m + p)2 for
Pillai’s test.
In Table 8.2, H0 is only true for the Fp test where p = m, and we want
values in the Fp column near 0.05. We want values near 1 for high power
otherwise. If H0 is false, often H0 will be rejected for small n. For example,
if n ≥ 10p, then the m residual plots should start to look good, and the
MANOVA F test should be rejected. For the simulated data, the test had
fair power for n not much larger than mp. Results are shown for the lognormal
distribution.
Some R output for reproducing the simulation is shown below. The linmod-
pack function is mregsim and etype = 1 uses data from a MVN distribution.
The fcov line computed the Hotelling Lawley statistic using Equation (8.3)
while the hotlawcov line used Definition 8.9. The mnull=T part of the com-
mand means we want the first value near 1 for high power and the next three
numbers near the nominal level 0.05 except for mancv where we want all
of the MANOVA F test statistics to be near the nominal level of 0.05. The
mnull=F part of the command means want all values near 1 for high power
except for the last column (for the terms other than mancv) corresponding to
the Fp test where H0 is true so we want values near the nominal level of 0.05.
The “coverage” is the proportion of times that H0 is rejected, so “coverage”
is short for “power” and “level”: we want the coverage near 1 for high power
when H0 is false and we want the coverage near the nominal level 0.05 when
H0 is true. Also see Problem 8.10.
mregsim(nruns=5000,etype=1,mnull=T)
$wilkcov
[1] 1.0000 0.0450 0.0462 0.0430
$pilcov
[1] 1.0000 0.0414 0.0432 0.0400
$hotlawcov
[1] 1.0000 0.0522 0.0516 0.0490
$roycov
[1] 1.0000 0.0512 0.0500 0.0480
$fcov
[1] 1.0000 0.0522 0.0516 0.0490
$mancv
wcv pcv hlcv rcv fcv
8.6 The Robust rmreg2 Estimator 391
mregsim(nruns=5000,etype=2,mnull=F)
$wilkcov
[1] 0.9834 0.9814 0.9104 0.0408
$pilcov
[1] 0.9824 0.9804 0.9064 0.0372
$hotlawcov
[1] 0.9856 0.9838 0.9162 0.0480
$roycov
[1] 0.9848 0.9834 0.9156 0.0462
$fcov
[1] 0.9856 0.9838 0.9162 0.0480
$mancv
wcv pcv hlcv rcv fcv
[1,] 0.993 0.9918 0.9942 0.9978 0.9942
H
See Olive (2017b, 12.5.2) for simulations for the prediction region. Also
see Problem 8.11.
Response Plot
60
55
y[, 1]
50
45 35 40 45 50
fit[, 1]
Residual Plot
20
15
10
res[, 1]
5
0
−5
35 40 45 50
fit[, 1]
Response Plot
1500
1000
y[, i]
500
0
fit[, i]
Residual Plot
0
−2000
res[, i]
−4000
fit[, i]
8.7 Bootstrap
The parametric bootstrap for the multivariate linear regression model uses
T
y ∗i ∼ Nm (B̂ xi , Σ̂ ) for i = 1, ..., n where we are not assuming that the
i ∼ Nm (0, Σ ). Let Z ∗j have ith row y ∗T i and regress Z ∗j on X to obtain
∗
B̂j for j = 1, ..., B. Let S ⊆ I, let B̂ I = (X TI X I )−1 X TI Z ∗ , and assume
P
n(X TI X I )−1 → W I for any I such that S ⊆ I. Then with calculations
similar to those for the multiple linear regression model parametric bootstrap
∗
of Section 4.6.1, E(B̂ I ) = B̂I ,
√ D
n vec(B̂ I − B I ) → NaI m (0, Σ ⊗ W I ),
√ ∗ −1 D
and n vec(B̂ I − B̂ I ) ∼ NaI m (0, Σ̂ ⊗ n(X T
I X I) ) → NaI m (0, Σ ⊗ W I )
∗ ∗
as n, B → ∞ if S ⊆ I. Let B̂ I,0 be formed from B̂ I by adding rows of zeros
corresponding to omitted variables.
The theory for multivariate linear regression assumes that the model is known
before gathering data. If variable selection and response transformations are
performed to build a model, then the estimators are biased and results for
inference fail to hold in that pvalues and coverage of confidence and prediction
regions will be wrong.
Data splitting can be used in a manner similar to how data splitting is
used for MLR and other regression models. A pilot study is an alternative to
data splitting.
8.9 Summary
D
11) Under regularity conditions, −[n − p + 1 − 0.5(m − r + 3)] log(Λ(L)) →
D D
χrm , (n − p)V (L) → χ2rm , and (n − p)U (L) → χ2rm .
2
16) The 4 step MANOVA F test should reject H0 if the response and
residual plots look good, n is large enough, and at least one response plot
does not look like the corresponding residual plot. A response plot for Yj will
look like a residual plot if the identity line appears almost horizontal, hence
the range of Ŷj is small.
17) The linmodpack function mltreg produces the m response and resid-
ual plots, gives B̂, Σ̂ , the MANOVA partial F test statistic and pval cor-
responding to the reduced model that leaves out the variables given by in-
dices (so x2 and x4 in the output below with F = 0.77 and pval = 0.614),
Fj and the pval for the Fj test for variables 1, 2, ..., p (where p = 4 in
the output below so F2 = 1.51 with pval = 0.284), and F0 and pval for
the MANOVA F test (in the output below F0 = 3.15 and pval= 0.06).
The command out <- mltreg(x,y,indices=c(2)) would produce a
MANOVA partial F test corresponding to the F2 test while the command
out <- mltreg(x,y,indices=c(2,3,4)) would produce a MANOVA
partial F test corresponding to the MANOVA F test for a data set with
p = 4 predictor variables. The Hotelling Lawley trace statistic is used in the
tests.
out <- mltreg(x,y,indices=c(2,4))
$Bhat [,1] [,2] [,3]
[1,] 47.96841291 623.2817463 179.8867890
[2,] 0.07884384 0.7276600 -0.5378649
[3,] -1.45584256 -17.3872206 0.2337900
[4,] -0.01895002 0.1393189 -0.3885967
$Covhat
[,1] [,2] [,3]
[1,] 21.91591 123.2557 132.339
[2,] 123.25566 2619.4996 2145.780
[3,] 132.33902 2145.7797 2954.082
$partial
partialF Pval
[1,] 0.7703294 0.6141573
$Ftable
Fj pvals
[1,] 6.30355375 0.01677169
[2,] 1.51013090 0.28449166
[3,] 5.61329324 0.02279833
[4,] 0.06482555 0.97701447
$MANOVA
MANOVAF pval
[1,] 3.150118 0.06038742
18) Given B̂ = [β̂1 β̂ 2 · · · β̂ m ] and xf , find ŷ f = (ŷ1 , ..., ŷm)T where
T
ŷi = β̂ i xf .
8.9 Summary 399
T n
Ê Ê 1 X T
19) Σ̂ = = ˆ
i ˆ
i while the sample covariance matrix of
n−p n−p
i=1
T
n−p Ê Ê √
the residuals is S r = Σ̂ = . Both Σ̂ and S r are n consistent
n−1 n−1
estimators of Σ for a large class of distributions for the error vectors i .
20) The 100(1 − δ)% nonparametric prediction
H region for y f given xf is
the nonparametric prediction region from 4.4 applied to ẑi = ŷ f + ˆ i =
T
B̂ xf + ˆi for i = 1, ..., n. This takes the data cloud of the n residual vectors
ˆi and centers the cloud at ŷ f . Let
for i = 1, ..., n. Let qn = min(1 − δ + 0.05, 1 − δ + m/n) for δ > 0.1 and
{y : (y − ŷ f )T S −1 2
r (y − ŷ f ) ≤ D(Un ) } = {y : Dy (ŷ f , S r ) ≤ D(Un ) }.
a) Consider the n prediction regions for the data where (y f,i , xf,i ) =
(y i , xi ) for i = 1, ..., n. If the order statistic D(Un ) is unique, then Un of the
n prediction regions contain y i where Un /n → 1 − δ as n → ∞.
b) If (ŷ f , S r ) is a consistent estimator of (E(y f ), Σ ) then the nonpara-
metric prediction region is a large sample 100(1 − δ)% prediction region for
yf .
c) If (ŷ f , S r ) is a consistent estimator of (E(y f ), Σ ), and the i come
from an elliptically contoured distribution such that the unique highest den-
sity region is {y : Dy (0, Σ ) ≤ D1−δ }, then the nonparametric prediction
region is asymptotically optimal.
21) On the DD plot for the residual vectors, the cases to the left of the
vertical line correspond to cases that would have y f = y i in the nonpara-
metric prediction region if xf = xi , while the cases to the right of the line
would not have y f = y i in the nonparametric prediction region.
22) The DD plot for the residual vectors is interpreted almost exactly as
a DD plot for iid multivariate data is interpreted. Plotted points clustering
about the identity line suggests that the i may be iid from a multivariate
normal distribution, while plotted points that cluster about a line through
the origin with slope greater than 1 suggests that the i may be iid from an
elliptically contoured distribution that is not MVN. Points to the left of the
vertical line corresponds to the cases that are in their nonparamtric prediction
region. Robust distances have not been shown to be consistent estimators of
the population distances, but are useful for a graphical diagnostic.
400 8 Multivariate Linear Regression
2) Yi = xTi β + ei y i = B T xi + i
3) E(e) = 0 E[E] = 0
5) b = (X T X)−1 X T Y
β b = (X T X)−1 X T Z
B
6) Yb = P Y b = PZ
Z
7) r=b
e = (I − P )Y b = (I − P )Z
E
8) b =β
E[β] b =B
E[B]
9) E(Yb ) = E(Y ) = Xβ b = XB
E[Z]
T
Ê Ê
σ̂ 2 = rn−pr
T
10) Σ̂ =
n−p
H0 : Lβ = 0 H0 : LB = 0
D D
13) rFR → χ2r (n − p)U (L) → χ2rm
23) The table on the previous page compares MLR and MREG.
24) The robust multivariate linear regression method rmreg2 computes
the classical estimator on the RMVN set where RMVN is computed from
the n cases v i = (xi2 , ..., xpi, Yi1 , ..., Yim)T . This estimator has considerable
outlier resistance but theory currently needs very strong assumptions. The
response and residual plots and DD plot of the residuals from this estimator
are useful for outlier detection. The rmreg2 estimator is superior to the
rmreg estimator for outlier detection.
8.10 Complements
This chapter followed Olive (2017b, ch. 12) closely. Multivariate linear re-
gression is a semiparametric method that is nearly as easy to use as multiple
linear regression if m is small. Section 8.3 followed Olive (2018) closely. The
material on plots and testing followed Olive et al. (2015) closely. The m re-
sponse and residual plots should be made as well as the DD plot, and the
response and residual plots are very useful for the m = 1 case of multiple
linear regression and experimental design. These plots speed up the model
building process for multivariate linear models since the success of power
transformations achieving linearity can be quickly assessed, and influential
cases can be quickly detected. See Cook and Olive (2001).
Work is needed on variable selection and on determining the sample sizes
for when the tests and prediction regions start to work well. Response and
residual plots can look good for n ≥ 10p, but for testing and prediction
regions, we may need n ≥ a(m + p)2 where 0.8 ≤ a ≤ 5 even for well behaved
elliptically contoured error distributions. Variable selection for multivariate
linear regression is discussed in Fujikoshi et al. (2014). R programs are needed
to make variable selection easy. Forward selection would be especially useful.
Often observations (Y1 , ..., Ym, x2, ..., xp) are collected on the same person
or thing and hence are correlated. If transformations can be found such that
the DD plot and the m response plots and residual plots look good, and
n is large (n ≥ max[(m + p)2 , mp + 30)] starts to give good results), then
multivariate linear regression can be used to efficiently analyze the data.
Examining m multiple linear regressions is an incorrect method for analyzing
the data.
In addition to robust estimators and seemingly unrelated regressions, en-
velope estimators and partial least squares (PLS) are competing methods for
multivariate linear regression. See recent work by Cook such as Cook (2018),
Cook and Su (2013), Cook et al. (2013), and Su and Cook (2012). Methods
like ridge regression and lasso can also be extended to multivariate linear re-
gression. See, for example, Obozinski et al. (2011). Relaxed lasso extensions
are likely useful. Prediction regions for alternative methods with n >> p
could be made following Section 8.3.
402 8 Multivariate Linear Regression
8.11 Problems
Let
XT X −1
= Ŵ .
n
−1
Show T (Ŵ ) = [vec(LB̂)]T [Σ̂ ⊗ (L(X T X)−1 LT )−1 ][vec(LB̂)].
T
U = tr([(n − p)Σ̂ ]−1 B̂ LT [L(X T X)−1 LT ]−1 LB̂]).
1 −1 T
Hence if L = Lj , then Uj = dj (n−p) tr(Σ̂ b̂j b̂j ).
Using tr(ABC) = tr(CAB) and tr(a) = a for scalar a, show that
(n − p)Uj = Tj .
8.3. Consider the Hotelling Lawley test statistic. Using the Searle (1982,
p. 333) identity
$MANOVA
MANOVAF pval
[1,] 67.80145 0
8.4. The output above is for the R Seatbelts data set where Y1 = drivers =
number of drivers killed or seriously injured, Y2 = front = number of front
seat passengers killed or seriously injured, and Y3 = back = number of back
seat passengers killed or seriously injured. The predictors were x2 = kms =
distance driven, x3 = price = petrol price, x4 = van = number of van drivers
killed, and x5 = law = 0 if the law was in effect that month and 1 otherwise.
The data consists of 192 monthly totals in Great Britain from January 1969 to
December 1984, and the compulsory wearing of seat belts law was introduced
in February 1983.
a) Do the MANOVA F test.
b) Do the F4 test.
8.5. a) Sketch a DD plot of the residual vectors ˆi for the multivariate
linear regression model if the error vectors i are iid from a multivariate
normal distribution. b) Does the DD plot change if the one way MANOVA
model is used instead of the multivariate linear regression model?
8.6. The output below is for the R judge ratings data set consisting of
lawyer ratings for n = 43 judges. Y1 = oral = sound oral rulings, Y2 = writ =
sound written rulings, and Y3 = rten = worthy of retention. The predictors
were x2 = cont = number of contacts of lawyer with judge, x3 = intg =
judicial integrity, x4 = dmnr = demeanor, x5 = dilg = diligence, x6 =
cfmg = case flow managing, x7 = deci = prompt decisions, x8 = prep =
preparation for trial, x9 = fami = familiarity with law, and x10 = phys =
physical ability.
a) Do the MANOVA F test.
b) Do the MANOVA partial F test for the reduced model that deletes
x2 , x5 , x6, x7 , and x8 .
x<-USJudgeRatings[,-c(9,10,12)]
mltreg(x,y,indices=c(2,5,6,7,8))
$partial
partialF Pval
[1,] 1.649415 0.1855314
$MANOVA
MANOVAF pval
[1,] 340.1018 1.121325e-14
8.7. Let β i be p × 1 and suppose
β̂ 1 − β1 0 σ (X T X)−1 σ12 (X T X)−1
∼ N2p , 11 T .
β̂ 2 − β2 0 σ21 (X X)−1 σ22 (X T X)−1
Hence E(y|w) = µy + Σ yw Σ −1 T
ww (w − µw ) = α + B S w.
a) Show α = µy − B TS µw .
b) Show B S = Σ −1
w Σ wy where Σ w = Σ ww .
(Hence B TS = Σ yw Σ −1
w .)
R Problems
8.11 Problems 405
rows are equal to 0T . Hence the null hypothesis for the MANOVA F test is
true. When mnull = F the null hypothesis is true for p = 2, but false for
p > 2. Now the first row of B is 1T and the last row of B is 0T . If p > 2,
then the second to last row of B is (1, 0, ..., 0), the third to last row is (1,
1, 0, ..., 0) et cetera as long as the first row is not changed from 1T . First
m iid errors z i are generated such that the m errors are iid with variance
σ 2 . Then i = Azi so that Σ̂ = σ 2 AAT = (σij ) where the diagonal entries
σii = σ 2 [1 + (m − 1)ψ2 ] and the off diagonal entries σij = σ 2 [2ψ + (m − 2)ψ2 ]
where ψ = 0.10. Terms like Wilkcov give the percentage of times the Wilks’
test rejected the F1 , F2 , ..., Fp tests. The $mancv wcv pcv hlcv rcv fcv output
gives the percentage of times the 4 test statistics reject the MANOVA F test.
Here hlcov and fcov both correspond to the Hotelling Lawley test using the
formulas in Problem 8.3.
5000 runs will be used so the simulation may take several minutes. Sample
sizes n = (m + p)2 , n = 3(m + p)2 , and n = 4(m + p)2 were interesting. We
want coverage near 0.05 when H0 is true and coverage close to 1 for good
power when H0 is false. Multivariate normal errors were used in a) and b)
below.
a) Copy the coverage parts of the output produced by the R commands
for this part where n = 20, m = 2, and p = 4. Here H0 is true except for
the F1 test. Wilks’ and Pillai’s tests had low coverage < 0.05 when H0 was
false. Roy’s test was good for the Fj tests, but why was Roy’s test bad for
the MANOVA F test?
b) Copy the coverage parts of the output produced by the R commands
for this part where n = 20, m = 2, and p = 4. Here H0 is false except for the
F4 test. Which two tests seem to be the best for this part?
8.12. This problem uses the linmodpack function mpredsim to simulate
the prediction regions for y f given xf for multivariate regression. With 5000
runs this simulation may take several minutes. The R command for this
problem generates iid lognormal errors then subtracts the mean, producing
zi . Then the i = Az i are generated as in Problem 8.11 with n=100, m=2,
and p=4. The nominal coverage of the prediction region is 90%, and 92%
of the training data is covered. The ncvr output gives the coverage of the
nonparametric region. What was ncvr?
Chapter 9
One Way MANOVA Type Models
9.1 Introduction
Definition 9.1. The response variables are the variables that you want
to predict. The predictor variables are the variables used to predict the
response variables.
y i = B T xi + i
407
408 9 One Way MANOVA Type Models
The notation y i |xi and E(y i |xi ) is more accurate, but usually the con-
ditioning is suppressed. Taking E(y i |xi ) to be a constant, y i and i have
the same covariance matrix. In the MANOVA model, this covariance matrix
Σ does not depend on i. Observations from different cases are uncorrelated
(often independent), but the m errors for the m different response variables
for the same case are correlated.
The residuals Ê = Z − Ẑ = Z − X B̂ =
T
ˆ
1 ˆ1,1 ˆ1,2 . . . ˆ1,m
ˆ T
ˆ2,1 ˆ2,2 ... ˆ 2,m
2
. = r̂1 r̂ 2 . . . r̂m = .. .. .. . .
.. . . . ..
Tn
ˆ ˆn,1 ˆn,2 ... ˆ n,m
Definition 9.5. A response plot for the jth response variable is a plot
of the fitted values Ybij versus the response Yij . The identity line with slope
one and zero intercept is added to the plot as a visual aid. A residual plot
corresponding to the jth response variable is a plot of Ŷij versus rij .
Remark 9.1. Make the m response and residual plots for any MANOVA
model. In a response plot, the vertical deviations from the identity line are the
residuals rij = Yij − Ŷij . Suppose the model is good, the error distribution is
not highly skewed, and n ≥ 10p. Then the plotted points should cluster about
the identity line in each of the m response plots. If outliers are present or if
the plot is not linear, then the current model or data need to be transformed
or corrected. If the model is good, then the each of the m residual plots
should be ellipsoidal with no trend and should be centered about the r = 0
line. There should not be any pattern in the residual plot: as a narrow vertical
strip is moved from left to right, the behavior of the residuals within the strip
should show little change. Outliers and patterns such as curvature or a fan
shaped plot are bad.
For some MANOVA models that do not use replication, the response and
residual plots look much like those for multivariate linear regression in Section
8.2. The response and residual plots for the one way MANOVA model need
some notation, and it is useful to use three subscripts. Suppose there are inde-
pendent random samples of size ni from p different populations (treatments),
Pp
or ni cases are randomly assigned to p treatment groups with n = i=1 ni .
Assume that m response variables y ij = (Yij1 , ..., Yijm)T are measured for the
ith treatment. Hence i = 1, ..., p and j = 1, ..., ni. The Yijk follow different one
9.2 Plots for MANOVA Models 411
way ANOVA models for k = 1, ..., m. Assume E(y ij ) = µi = (µi1 , ..., µim)T
and Cov(y ij ) = Σ . Hence the p treatments have possibly different mean
vectors µi , but common covariance matrix Σ .
Then for the kth response variable, the response plot is a plot of Ŷijk ≡ µ̂ik
versus Yijk and the residual plot is a plot of Ŷijk ≡ µ̂ik versus rijk where µ̂ik is
the sample mean of the ni responses Yijk corresponding to the ith treatment
for the kth response variable. Add the identity line to the response plot and
r = 0 line to the residual plot as visual aids. The points in the response
plot scatter about the identity line and the points in the residual plot scatter
about the r = 0 line, but the scatter need not be in an evenly populated band.
A dot plot of Z1 , ..., Zn consists of an axis and n points each corresponding to
the value of Zi . The response plot for the kth response variable consists of p
dot plots, one for each value of µ̂ik . The dot plot corresponding to µ̂ik is the
dot plot of Yi,1,k , ..., Yi,ni,k . Similarly, the residual plot for the kth response
variable consists of p dot plots, and the plot corresponding to µ̂ik is the dot
plot of ri,1,k , ..., ri,ni,k . Assuming the ni ≥ 10, the p dot plots for the kth
response variable should have roughly the same shape and spread in both
ni
1 X
the response and residual plots. Note that µ̂ik = Y iok = Yijk .
ni
j=1
Assume that each ni ≥ 10. It is easier to check shape and spread in the
residual plot. If the response plot looks like the residual plot, then a horizontal
line fits the p dot plots about as well as the identity line, and there may not
be much difference in the µik . In the response plot, if the identity line fits
the plotted points better than any horizontal line, then conclude that at least
some of the means µik differ.
Definition 9.6. An outlier corresponds to a case that is far from the
bulk of the data. Look for a large vertical distance of the plotted point from
the identity line or the r = 0 line.
Rule of thumb 9.1. Mentally add 2 lines parallel to the identity line and
2 lines parallel to the r = 0 line that cover most of the cases. Then a case is
an outlier if it is well beyond these 2 lines.
This rule often fails for large outliers since often the identity line goes
through or near a large outlier so its residual is near zero. A response that is
far from the bulk of the data in the response plot is a “large outlier” (large
in magnitude). Look for a large gap between the bulk of the data and the
large outlier.
Suppose there is a dot plot of ni cases corresponding to treatment i with
mean µik that is far from the bulk of the data. This dot plot is probably not
a cluster of “bad outliers” if ni ≥ 4 and n ≥ 5p. If ni = 1, such a case may
be a large outlier.
Rule of thumb 9.2. Often an outlier is very good, but more often an
outlier is due to a measurement error and is very bad.
412 9 One Way MANOVA Type Models
Remark 9.2. Rule of thumb 3.2 for the one way ANOVA F test may also
be useful for the one way MANOVA model tests of hypotheses.
Remark 9.3. The above rules are mainly for linearity and tend to use
marginal models. The marginal models are useful for checking linearity, but
are not very useful for checking other model violations such as outliers in the
error vector distribution. The RMVN DD plot of the residual vectors is a
global method (takes into account the correlations of Y1 , ..., Ym) for checking
the error vector distribution, but is not real effective for detecting outliers
since OLS is used to find the residual vectors. A DD plot of residual vectors
from a robust MANOVA method might be more effective for detecting out-
liers. This remark also applies to the plots used in Section 8.2 for multivariate
linear regression.
The RMVN DD plot of the residual vectors ˆi is used to check the er-
ror vector distribution, to detect outliers, and to display the nonparametric
prediction region developed in Section 8.3. The DD plot suggests that the
error vector distribution is elliptically contoured if the plotted points cluster
tightly about a line through the origin as n → ∞. The plot suggests that
the error vector distribution is multivariate normal if the line is the identity
line. If n is large and the plotted points do not cluster tightly about a line
through the origin, then the error vector distribution may not be elliptically
contoured. These applications of the DD plot for iid multivariate data are
discussed in Olive (2002, 2008, 2013a) and Chapter 7. The RMVN estimator
has not yet been proven to be a consistent estimator for residual vectors,
but simulations suggest that the RMVN DD plot of the residual vectors is a
useful diagnostic plot.
Response transformations can also be made as in Section 1.2, but also make
the response plot of Ŷ j versus Y j and use the rules of Section 1.2 on Yj to
linearize the response plot for each of the m response variables Y1 , ..., Ym.
Example 9.1. Consider the one way MANOVA model on the famous
iris data set with n = 150 and p = 3 species of iris: setosa, versicolor, and
virginica. The m = 4 variables are Y1 = sepal length, Y2 = sepal width, Y3 =
petal length, and Y4 = petal width. See Becker et al. (1988). The plots for the
m = 4 response variables look similar, and Figure 9.1 shows the response and
residual plots for Y4 . Note that the spread of the three dot plots is similar.
The dot plot intersects the identity line at the sample mean of the cases in
the dot plot. The setosa cases in lowest dot plot have a sample mean of 0.246
and the horizontal line Y4 = 0.246 is below the dot plots for versicolor and
virginica which have means of 1.326 and 2.026. Hence the mean petal widths
differ for the three species, and it is easier to see this difference in the response
plot than the residual plot. The plots for the other three variables are similar.
Figure 9.2 shows that the DD plot of the residual vectors suggests that the
error vector distribution is elliptically contoured but not multivariate normal.
9.2 Plots for MANOVA Models 413
The DD plot also shows the prediction regions of Section 8.3 computed
using the residual vectors ˆ i . From Section 8.3, if {ˆ
|Dˆ (0, S r ) ≤ h} is a
prediction region for the residual vectors, then {y|Dy (ŷ f , Sr ) ≤ h} is a
prediction region for y f . For the one way MANOVA model, a prediction
region for y f would only be valid for an xf which was observed, i.e., for
xf = xj , since only observed values of the categorical predictor variables
make sense. The 90% nonparametric prediction region corresponds to y with
distances to the left of the vertical line M D = 3.2.
Response Plot
2.5
2.0
1.5
y[, i]
1.0
0.5
fit[, i]
Residual Plot
0.2 0.4
res[, i]
−0.2
−0.6
fit[, i]
R commands for these two figures are shown below, and will also show
the plots for Y1 , Y2 , and Y3 . The linmodpack function manova1w makes the
response and residual plots while ddplot4 makes the DD plot. The last
command shows that the pvalue = 0 for the one way MANOVA test discussed
in the following section.
library(MASS)
y <- iris[,1:4] #m = 4 = number of response variables
group <- iris[,5]
#p = number of groups = number of dot plots
out<- manova1w(y,p=3,group=group) #right click
#Stop 8 times
ddplot4(out$res) #right click Stop
summary(out$out) #default is Pillai’s test
414 9 One Way MANOVA Type Models
5
4
3
RD
2
1
1 2 3 4
MD
Using double subscripts will be useful for describing the one way MANOVA
model. Suppose there are independent random samples of size ni from p
different populations (treatments),
Pp or ni cases are randomly assigned to p
treatment groups. Then n = i=1 ni and the group sample sizes are ni for
i = 1, ..., p. Assume that m response variables y ij = (Yij1 , ..., Yijm)T are
measured for the ith treatment group and the jth case (often an individual
or thing) in the group. Hence i = 1, ..., p and j = 1, ..., ni. The Yijk follow
different one way ANOVA models for k = 1, ..., m. Assume E(y ij ) = µi and
Cov(y ij ) = Σ . Hence the p treatments have different mean vectors µi , but
common covariance matrix Σ . (The common covariance matrix assumption
can be relaxed for p = 2 with the appropriate 2 sample Hotelling’s T 2 test.)
The one way MANOVA is used to test H0 : µ1 = µ2 = · · · = µp . Often
µi = µ + τ i , so H0 becomes H0 : τ 1 = · · · = τ p . If m = 1, the one
way MANOVA model is the one way ANOVA model. MANOVA is useful
since it takes into account the correlations between the m response variables.
Performing m ANOVA tests fails to account for these correlations, but can
be a useful diagnostic. The Hotelling’s T 2 test that uses a common covariance
matrix is a special case ofPthe one way MANOVA model with p = 2.
p
Let µi = µ+τ i where i=1 ni τ i = 0. The jth case from the ith population
or treatment group is y ij = µ+τ i +ij where ij is an error vector, i = 1, ..., p
9.3 One Way MANOVA 415
Pp Pn i
and j = 1, ..., ni. Let y = µ̂ = i=1 j=1 y ij /n be the overall mean. Let
Pn i
y i = j=1 y ij /ni so τ̂ i = y i − y. Let the residual vector ˆij = y ij − y i =
y ij − µ̂ − τ̂ i . Then y ij = y + (y i − y) + (y ij − y i ) = µ̂ + τ̂ i + ˆ
ij .
Several m×m matrices will be useful. Let S i be the sample covariance ma-
trix corresponding to the ith treatment group. Then the within sum of squares
and cross products matrix is W = W e = (n1 − 1)S 1 + · · · + (np − 1)S p =
Pp Pni T
i=1 j=1 (y ij − y i )(y ij − y i ) . Then Σ̂ = W /(n − p). The treatment or
between sum of squares and cross products matrix is
p
X
BT = ni (y i − y)(y i − y)T .
i=1
Source matrix df
Treatment or Between BT p − 1
Residual or Error or Within W n − p
Total (corrected) T n−1
Another way to perform the one way MANOVA test is to get R output.
The default test is Pillai’s test, but other tests can be obtained with the R
output shown below.
summary(out$out) #default is Pillai’s test
summary(out$out, test = "Wilks")
summary(out$out, test = "Hotelling-Lawley")
summary(out$out, test = "Roy")
Example 9.1, continued. The R output for the iris data gives a Pillai’s
F statistic of 53.466 and pval = 0.
i) H0 : µ1 = · · · = µ4 H1 : not H0
ii) F = 53.466
iii) pval = 0
iv) Reject H0 . The means for the three varieties of iris do differ.
Remark 9.4. Another method for one way MANOVA is to use the model
Z = XB + E or
9.3 One Way MANOVA 417
110 ... 0
. . . ..
Y111 Y112 · · · Y11m .. .. .. .
.. .. .. 110 ... 0
. . ··· .
... 0
1 0 1
Y1,n1,1 Y1,n1,2 · · · Y1,n1 ,m
. . .
.. .. ..
..
Y211 Y211 · · · Y21m .
β1,1 β1,2 . . . β1,m
.. .. .. 1 0 1 ... 0
.
. ··· . . . .
=. . . .. β2,1 β2,2 . . . β2,m + E.
Y2,n2,1 Y2,n2,2 · · · Y2,n2 ,m . . . . .. .. . . ..
. . . .
. .. .. 00
1 ... 1 βp,1 βp,2 . . . βp,m
.. . ··· . . . . ..
. . .
Yp,11 Yp,1m . . . .
Yp,1m · · ·
. .. .. 1 00 ... 1
..
1 0 0
. ··· .
... 0
Yp,np ,1 Yp,np ,2 · · · Yp,np ,m . . . ..
.. .. .. .
1 0 0 ... 0
Then X is full rank where the ith column of X is an indicator for group i − 1
for i = 2, ..., p, β̂1k = Y pok = µ̂pk for k = 1, ..., m, and
Equation (3.5) used the same X for one way ANOVA model with m = 1
as the X used in the above one way MANOVA model. Then the MLR F test
was the same as the one way ANOVA F test. Similarly, if L = (0 I p−1 ) then
the multivariate linear regression Hotelling Lawley test statistic for testing
H0 : LB = 0 versus H1 : LB 6= 0 is U = tr(W −1 H) while the Hotelling
Lawley test statistic for the one way MANOVA test with H0 : µ1 = µ2 =
· · · = µp is U = tr(W −1 B T ). Rupasinghe Arachchige Don (2018) showed
that these two test statistics are the the same for the above X by showing
that B T = H. Here H is given in Section 8.4 and is not the hat matrix.
418 9 One Way MANOVA Type Models
Large sample theory can be also be used to derive a competing test. Let Σ i
be the nonsingular population covariance matrix of the ith treatment group
or population. ToP simplify the large sample theory, assume ni = πi n where
p
0 < πi < 1 and i=1 πi = 1. Let Ti be a multivariate location estimator
√ D √ D Σi
such that ni (Ti − µi ) → Nm (0, Σ i ), and n(Ti − µi ) → Nm 0, . Let
πi
T T T T T T T T
T = (T1 , T2 , ..., Tp ) , ν = (µ1 , µ2 , ..., µp ) , and A be a full rank r × mp
matrix with rank r, then a large sample test of the form H0 : Aν = θ 0 versus
H1 : Aν 6= θ 0 uses
√ D Σ1 Σ2 Σp
A n(T − ν) → u ∼ Nr 0, A diag , , ..., AT . (9.2)
π1 π2 πp
This test is due to Rupasinghe Arachchige Don and Olive (2019), and a
special case was used by Zhang and Liu (2013) and Konietschke et al. (2015)
with Ti = y i and Σ̂ i = S i . The p = 2 case gives analogs to the two sample
Hotelling’s T 2 test. See Rupasinghe Arachchige Don and Pelawa Watagoda
(2018). The m = 1 case gives analogs of the one way ANOVA test. If m = 1,
see competing tests in Brown and Forsythe (1974a,b), Olive (2017a, pp. 200-
202), and Welch (1947, 1951).
For the one way MANOVA type test, let A be the block matrix
I 0 0 . . . -I
0 I 0 . . . -I
A=. . . .. .
.. .. .. .
0 0 . . . I -I
is a block matrix where the off diagonal block entries equal Σ̂ p /np and the
Σ̂ i Σ̂ p
ith diagonal block entry is + for i = 1, ..., (p − 1).
ni np
Reject H0 if
t0 > m(p − 1)Fm(p−1),dn (1 − δ) (9.6)
where dn = min(n1 , ..., np). See Theorem 2.25. It may make sense to relabel
the groups so that np is the largest ni or Σ̂ p /np has the smallest general-
ized variance of the Σ̂ i /ni . This test may start to outperform the one way
MANOVA test if n ≥ (m + p)2 and ni ≥ 40m for i = 1, ..., p.
If Σ i ≡ Σ and Σ̂ i is replaced by Σ̂, we will show that for the one way
MANOVA test that t0 = (n − p)U where U is the Hotelling Lawley statistic.
For the proof, some results on the vec and Kronecker product will be useful.
Following Henderson and Searle (1979), vec(G) and vec(GT ) contain the
same elements in different sequences. Define the permutation matrix P r,m
such that
vec(G) = P r,m vec(GT ) (9.7)
where G is r × m. Then P Tr,m = P m,r , and P r,m P m,r = P m,r P r,m = I rm .
If C is s × m and D is p × r, then
Also
(C ⊗ D)vec(G) = vec(DGC T ) = P p,s (D ⊗ C)vec(GT ). (9.9)
If C is m × mand D is r × r, then C ⊗ D = P r,m (D ⊗ C)P m,r , and
Theorem 9.2. For the one way MANOVA test using A as defined below
Theorem 9.1, let the Hotelling Lawley trace statistic U = tr(W −1 B T ). Then
" ! #−1
T Σ̂ Σ̂ Σ̂ T
(n − p)U = t0 = [AT − θ0 ] A diag , , ..., A [AT − θ 0 ].
n1 n2 np
D
Hence if the Σ i ≡ Σ and H0 : µ1 = · · · = µp is true, then (n − p)U = t0 →
χ2m(p−1) .
Proof. Let B and X be as in Remark 9.4. Let L = [0 I p−1 ] be an s × p
matrix with s = p − 1. For this choice of X, U = tr(W −1 B T ) = tr(W −1 H)
by Remark 9.4. Hence by Theorem 8.6,
−1
(n − p)U = [vec(LB̂)]T [Σ̂ ⊗ (L(X T X)−1 LT )−1 ][vec(LB̂)]. (9.11)
where
Σ̂ w
= L(X T X)−1 LT ⊗ Σ̂
n
is given by Equation (9.5) with each Σ̂ i replaced by Σ̂. Thus t0 =
−1
[vec([LB̂]T )]T [(L(X T X)−1 LT )−1 ⊗ Σ̂ ][vec([LB̂]T )]. (9.12)
Hence the one way MANOVA test is a special case of Equation (9.3) where
θ0 = 0 and Σ̂ i ≡ Σ̂, but then Theorem 9.1 only holds if H0 is true and
Σ i ≡ Σ. Note that the large sample theory of Theorem 9.1 is trivial compared
to the large sample theory of (n − p)U given in Theorem 9.2. Fujikoshi (2002)
D D
showed (n − m − p − 1)U → χ2m(p−1) while (n − p)U → χ2m(p−1) by Theorem
9.2 if H0 is true under the common covariance matrix assumption. There is no
P
contradiction since (m+1)U → 0 as the ni → ∞. Note the A is m(p−1)×mp.
9.5 Summary 421
For tests corresponding to Theorem 9.1, we will use bootstrap with the
prediction region method of Chapter 4 to test H0 when Σ̂ w or the Σ̂ i are
unknown or difficult to estimate. To bootstrap the test H0 : Aν = θ 0 versus
H1 : Aν 6= θ 0 , use Zn = AT . Take a sample of size nj with replacement from
the nj cases for each group for j = 1, 2, ..., p to obtain Tj∗ and T ∗1 . Repeat B
times to obtain T ∗1 , ..., T ∗B . Then Zi∗ = AT ∗i for i = 1, ..., B. We will illustrate
this method with the analog for the one way MANOVA test for H0 : Aθ = 0
which is equivalent to H0 : µ1 = · · · = µp , where 0 is an r × 1 vector of
zeroes with r = m(p − 1). Then Zn = AT = w given by Equation (9.4).
Hence the m(p − 1) × 1 vector Zi∗ = AT ∗i = ((T1∗ − Tp∗ )T , ..., (Tp−1 ∗
− Tp∗ )T )T
where Tj is a multivariate location estimator (such as the sample mean,
coordinatewise median, or trimmed mean), applied to the cases in the jth
treatment group. The prediction region method fails to reject H0 if 0 is in
the resulting confidence region.
We may need B ≥ 50m(p−1), n ≥ (m+p)2 , and ni ≥ 40m. If the ni are not
large, the one way MANOVA test can be regarded as a regularized estimator,
and can perform better than the tests that do not assume equal population
covariance matrices. See the simulations in Rupasinghe Arachchige Don and
Olive (2019).
If H0 : Aν = θ0 is true and if the Σ i ≡ Σ for i = 1, ..., p, then
" ! #−1
T Σ̂ Σ̂ Σ̂ T D
t0 = [AT − θ 0 ] A diag , , ..., A [AT − θ 0 ] → χ2r .
n1 n2 np
If H0 is true but the Σ i are not equal, we may be able to get a bootstrap
cutoff by using
" ! #−1
Σ̂ Σ̂ Σ̂
t∗0i = [AT ∗i − AT ] T
A diag , , ..., A T
[AT ∗i − AT ] =
n1 n2 np
! !
Σ̂ Σ̂ Σ̂
2
DAT ∗ AT , A diag , , ..., AT .
i n1 n2 np
9.5 Summary
The n × p matrix
x1,1 x1,2 . . . x1,p T
x2,1 x2,2 x
. . . x2,p .1
X= . .. . . .. = v 1 v 2 . . . v p = ..
.. . . .
xTn
xn,1 xn,2 . . . xn,p
where often v 1 = 1.
The p × m matrix
β1,1 β1,2 . . . β1,m
β2,1 β2,2 . . . β2,m
B = . .. .. .. = β 1 β 2 . . . βm .
.. . . .
βp,1 βp,2 . . . βp,m
The n × m matrix
1,1 1,2 . . . 1,m T
2,1 2,2 1
. . . 2,m
..
E= . .. .. . = e e
1 2 . . . e m = . .
.. . . ..
Tn
n,1 n,2 . . . n,m
9.6 Complements
9.7 Problems
9.1∗ . If X is of full rank and least squares is used to fit the MANOVA
model, then β̂ i = (X T X)−1 X T Y i , and Y i = Xβ i + ei . Treating Xβ i as a
constant, Cov(Y i , Y j ) = Cov(ei , ej ) = σij I n . Using this information, show
Cov(β̂ i , β̂j ) = σij (X T X)−1 .
Chapter 10
1D Regression Models Such as GLMs
... estimates of the linear regression coefficients are relevant to the linear
parameters of a broader class of models than might have been suspected.
Brillinger (1977, p. 509)
After computing β̂, one may go on to prepare a scatter plot of the points
(β̂xj , yj ), j = 1, ..., n and look for a functional form for g(·).
Brillinger (1983, p. 98)
This chapter considers 1D regression models including additive error re-
gression (AER), generalized linear models (GLMs), and generalized additive
models (GAMs). Multiple linear regression is a special case of these four
models.
See Definition 1.2 for the 1D regression model, sufficient predictor (SP =
h(x)), estimated sufficient predictor (ESP = ĥ(x)), generalized linear model
(GLM), and the generalized additive model (GAM). When using a GAM to
check a GLM, the notation ESP may be used for the GLM, and EAP (esti-
mated additive predictor) may be used for the ESP of the GAM. Definition
1.3 defines the response plot of ESP versus Y .
Suppose the sufficient predictor SP = h(x). Often SP = xT β. If u only
contains the nontrivial predictors, then SP = β1 + uT β2 = α + uT η is often
used where β = (β1 , βT2 )T = (α, ηT )T and x = (1, uT )T .
10.1 Introduction
425
426 10 1D Regression Models Such as GLMs
selection. For the models below, the model estimated mean function and
often a nonparametric estimator of the mean function, such as lowess, will
be added to the response plot as a visual aid. For all of the models in the
following three definitions, Y1 , ..., Yn are independent, but often the subscripts
are suppressed. For example, Y = SP + e is used instead of Yi = Yi |xi =
Yi |SPi = SPi + ei = h(xi ) + ei for i = 1, ..., n.
Γ (δ)Γ (ν)
θ = 1/(δ+ν). Let B(δ, ν) = . If Y has a beta–binomial distribution,
Γ (δ + ν)
Y
∼ BB(m, ρ, θ), then the probability mass function of Y is P (Y = y) =
m B(δ + y, ν + m − y)
for y = 0, 1, 2, ..., m where 0 < ρ < 1 and θ > 0.
y B(δ, ν)
Hence δ > 0 and ν > 0. Then E(Y ) = mδ/(δ + ν) = mρ and V(Y ) =
mρ(1 − ρ)[1 + (m − 1)θ/(1 + θ)]. If Y |π ∼ binomial(m, π) and π ∼ beta(δ, ν),
then Y ∼ BB(m, ρ, θ). As θ → 0, it can be shown that V (π) → 0, and the
beta–binomial distribution converges to the binomial distribution.
Definition 10.2. The BBR model states that Y1 , ..., Yn are independent
random variables where Yi |SPi ∼ BB(mi , ρ(SPi ), θ). Hence E(Yi |SPi ) =
mi ρ(SPi ) and
The BBR model has the same mean function as the binomial regression
model, but allows for overdispersion. As θ → 0, it can be shown that the
BBR model converges to the binomial regression model.
for y = 0, 1, 2, ... where µ > 0 and κ > 0. Then E(Y ) = µ and V(Y ) =
µ+µ2 /κ. (This distribution is a generalization of the negative binomial (κ, ρ)
distribution where ρ = κ/(µ + κ) and κ > 0 is an unknown real parameter
rather than a known integer.)
The NBR model has the same mean function as the PR model but allows
for overdispersion. Following Agresti (2002, p. 560), as τ ≡ 1/κ → 0, it can
be shown that the NBR model converges to the PR model.
normal, the ei are normal. If the Yi are loglogistic, the ei are logistic. If the
Yi are Weibull, the ei are from a smallest extreme value distribution. The
Weibull regression model is a proportional hazards model using Yi and an
accelerated failure time model using log(Yi ) with β P = β A /σ. Let Y hav a
Weibull W (γ, λ) distribution if the pdf of Y is
for y > 0. Prediction intervals for parametric survival regression models are
for survival times Y , not censored survival times. See Section 10.10.
where λ0 = exp(−α/σ).
where k(θ) ≥ 0 and h(y) ≥ 0. The functions h, k, t, and w are real valued
functions.
where S(y) = log(g(y)), d(θ) = log(k(θ)), and the support Y does not depend
on θ. Here the indicator function IY (y) = 1 if y ∈ Y and IY (y) = 0, otherwise.
10.1 Introduction 429
c(θ)
t(yi ) = yi and w(θ) = ,
a(φ)
and notice that the value of the parameter θ(xi ) = η(xTi β) depends on the
value of xi . Since the model depends on x only through the linear predictor
xT β, a GLM is a 1D regression model. Thus the linear predictor is also a
sufficient predictor.
The following three sections illustrate three of the most important gen-
eralized linear models. Inference and variable selection for these GLMs are
discussed in Sections 10.5 and 10.6. Their generalized additive model analogs
are discussed in Section 10.7.
430 10 1D Regression Models Such as GLMs
Note that the conditional mean function E(Yi |SPi ) = mi ρ(SPi ) and the
conditional variance function V (Yi |SPi ) = mi ρ(SPi )(1 − ρ(SPi )).
Thus the binary logistic regression model says that
where
exp(SP )
ρ(SP ) =
1 + exp(SP )
for the LR model. Note that the conditional mean function E(Y |SP ) =
ρ(SP ) and the conditional variance function V (Y |SP ) = ρ(SP )(1 − ρ(SP )).
For the LR model, the Y are independent and
exp(ESP)
Y |x ≈ binomial 1, ,
1 + exp(ESP)
Although the logistic regression model is the most important model for
binary regression, several other models are also used. Notice that ρ(x) =
P (S|x) is the population probability of success S given x, while 1 − ρ(x) =
P (F |x) is the probability of failure F given x. In particular, for binary re-
gression, ρ(x) = P (Y = 1|x) = 1 −P (Y = 0|x). If this population proportion
ρ = ρ(h(x)), then the model is a 1D regression model. The model is a GLM if
the link function g is differentiable and monotone so that g(ρ(x T β)) = xT β
and g−1 (xT β) = ρ(xT β). Usually the inverse link function corresponds to
the cumulative distribution function of a location scale family. For example,
for logistic regression, g−1 (x) = exp(x)/(1 + exp(x)) which is the cdf of the
logistic L(0, 1) distribution. For probit regression, g−1 (x) = Φ(x) which is the
cdf of the normal N (0, 1) distribution. For the complementary log-log link,
g−1 (x) = 1 − exp[− exp(x)] which is the cdf for the smallest extreme value
distribution. For this model, g(ρ(x)) = log[− log(1 − ρ(x))] = xT β.
Another important binary regression model is the discriminant function
model. See Hosmer and Lemeshow (2000, pp. 43–44). Assume that πj =
P (Y = j) and that x|Y = j ∼ Nk (µj , Σ) for j = 0, 1. That is, the conditional
distribution of x given Y = j follows a multivariate normal distribution with
mean vector µj and covariance matrix Σ which does not depend on j. Notice
that Σ = Cov(x|Y ) 6= Cov(x). Then as for the binary logistic regression
model with x = (1, uT )T and β = (α, ηT )T ,
exp(α + uT η) exp(xT β)
P (Y = 1|x) = ρ(x) = = .
1 + exp(α + uT η) 1 + exp(xT β)
η = Σ −1 (µ1 − µ0 ) (10.7)
π1
and α = log − 0.5(µ1 − µ0 )T Σ −1 (µ1 + µ0 ).
π0
The logistic regression (maximum likelihood) estimator also tends to per-
form well for this type of data. An exception is when the Y = 0 cases and
Y = 1 cases can be perfectly or nearly perfectly classified by the ESP. Let
the logistic regression ESP = xT β̂. Consider the response plot of the ESP
versus Y . If the Y = 0 values can be separated from the Y = 1 values by
the vertical line ESP = 0, then there is perfect classification. See Figure 10.1
b). In this case the maximum likelihood estimator for the logistic regression
parameters β does not exist because the logistic curve can not approximate
a step function perfectly. See Atkinson and Riani (2000, pp. 251-254). If only
a few cases need to be deleted in order for the data set to have perfect clas-
sification, then the amount of “overlap” is small and there is nearly “perfect
classification.”
10.3 Binary, Binomial, and Logistic Regression 433
Ordinary least squares (OLS) can also be useful for logistic regression. The
ANOVA F test, partial F test, and OLS t tests are often asymptotically valid
when the conditions in Definition 10.8 are met, and the OLS ESP and LR
ESP are often highly correlated. See Haggstrom (1983). For binary data the
Yi only take two values, 0 and 1, and the residuals do not behave very well.
Hence the response plot will be used both as a goodness of fit plot and as a
lack of fit plot.
Definition 10.9. For binary logistic regression, the response plot or esti-
mated sufficient summary plot is the plot of the ESP = ĥ(xi ) versus Yi with
the estimated mean function
exp(ESP )
ρ̂(ESP ) =
1 + exp(ESP )
then the mean function looks linear. If 0 < ESP < 5 then the mean function
first increases rapidly and then less and less rapidly. Finally, if −5 < ESP < 5
then the mean function has the characteristic “ESS” shape shown in Figure
10.1 c).
This plot is very useful as a goodness of fit diagnostic. Divide the ESP into
J “slices” each containing approximately n/J cases. Compute the sample
mean = sample proportion of the Y s in each slice and add the resulting step
function to the response plot. This is done in Figure 10.1 c) with J = 4
slices. This step function is a simple nonparametric estimator of the mean
function ρ(SP ). If the step function follows the estimated LR mean function
(the logistic curve) closely, then the LR model fits the data well. The plot
of these two curves is a graphical approximation of the goodness of fit tests
described in Hosmer and Lemeshow (2000, pp. 147–156).
The deviance test described in Section 10.5 is used to test whether β = 0,
and is the analog of the ANOVA F test for multiple linear regression. If the
binary LR model is a good approximation to the data but β = 0, then the
predictors x are not needed in the model and ρ̂(xi ) ≡ ρ̂ = Y (the usual
univariate estimator of the success proportion) should be used instead of the
LR estimator
exp(xTi β̂)
ρ̂(xi ) = .
1 + exp(xTi β̂)
If the logistic curve clearly fits the step function better than the line Y = Y ,
then H0 will be rejected, but if the line Y = Y fits the step function about
as well as the logistic curve (which should only happen if the logistic curve
is linear with a small slope), then Y may be independent of the predictors.
See Figure 10.1 a).
exp(ESP )
ρ̂(ESP ) =
1 + exp(ESP )
Both the lowess curve and step function are simple nonparametric estima-
tors of the mean function ρ(SP ). If the lowess curve or step function tracks
10.3 Binary, Binomial, and Logistic Regression 435
the logistic curve (the estimated mean) closely, then the LR mean function
is a reasonable approximation to the data.
Checking the LR model in the nonbinary case is more difficult because
the binomial distribution is not the only distribution appropriate for data
that takes on values 0, 1, ..., m if m ≥ 2. Hence both the mean and variance
functions need to be checked. Often the LR mean function is a good approx-
imation to the data, the LR MLE is a consistent estimator of β, but the
LR model is not appropriate. The problem is that for many data sets where
E(Yi |xi ) = mi ρ(SPi ), it turns out that V (Yi |xi ) > mi ρ(SPi )(1 − ρ(SPi )).
This phenomenon is called overdispersion. The BBR model of Definition 10.2
is a useful alternative to LR.
For both the LR and BBR models, the conditional distribution of Y |x can
still be visualized with a response plot of the ESP versus Zi = Yi /mi with the
estimated mean function Ê(Zi |xi ) = ρ̂(SP ) = ρ(ESP ) and a step function
or lowess curve added as visual aids.
Since the binomial regression model is simpler than the BBR model, graph-
ical diagnostics for the goodness of fit of the LR model would be useful. The
following plot was suggested by Olive (2013b) to check for overdispersion.
Definition 10.11. To check for overdispersion, use the OD plot of the
estimated model variance V̂M ≡ V̂ (Y |SP ) versus the squared residuals V̂ =
[Y − Ê(Y |SP )]2 . For the LR model, V̂ (Yi |SP ) = mi ρ(ESPi )(1 − ρ(ESPi ))
and Ê(Yi |SP ) = mi ρ(ESPi ).
Suppose the bulk of the plotted points in the OD plot fall in a wedge.
Then the identity line, slope 4 line, and OLS line will be added to the plot
as visual aids. It is easier to use the OD plot to check the variance function
than the response plot since judging the variance function with the straight
lines of the OD plot is simpler than judging the variability about the logistic
curve. Also outliers are often easier to spot with the OD plot. For the LR
model, V̂ (Yi |SP ) = mi ρ(ESPi )(1 − ρ(ESPi )) and Ê(Yi |SP ) = mi ρ(ESPi ).
The evidence of overdispersion increases from slight to high as the scale of the
vertical axis increases from 4 to 10 times that of the horizontal axis. There is
considerable evidence of overdispersion if the scale of the vertical axis is more
than 10 times that of the horizontal, or if the percentage of points above the
slope 4 line through the origin is much larger than 5%.
If the binomial LR OD plot is used but the data follows a beta–binomial re-
gression model, then V̂mod = V̂ (Yi |SP ) ≈ mi ρ(ESP )(1−ρ(ESP )) while V̂ =
[Yi − mi ρ(ESP )]2 ≈ (Yi − E(Yi ))2 . Hence E(V̂ ) ≈ V (Yi ) ≈ mi ρ(ESP )(1 −
ρ(ESP ))[1 + (mi − 1)θ/(1 + θ)], so the plotted points with mi = m should
θ 1 + mθ
scatter about a line with slope ≈ 1 + (m − 1) = .
1+θ 1+θ
a) b)
0.6
0.6
Y
Y
0.0
0.0
ESP ESP
c) ESSP d) OD Plot
0.0 0.4 0.8
0.6
Vhat
Y
0.0
ESP Vmodhat
The first example is for binary data. For binary data, G2 is not approxi-
mately χ2 and some plots of residuals have a pattern whether the model is
10.3 Binary, Binomial, and Logistic Regression 437
correct or not. For binary data the OD plot is not needed, and the plotted
points follow a curve rather than falling in a wedge. The response plot is
very useful if the logistic curve and step function of observed proportions are
added as visual aids. The logistic curve gives the estimated LR probability of
success. For example, when ESP = 0, the estimated probability is 0.5. The
following three examples used SP = xT β.
a) ESSP b) OD Plot
1.2
0.0 0.4 0.8
Vhat
0.6
Z
0.0
−4 0 4 0.5 2.0
ESP Vmodhat
Example 10.2. Abraham and Ledolter (2006, pp. 360-364) describe death
penalty sentencing in Georgia. The predictors are aggravation level from 1 to
6 (treated as a continuous variable) and race of victim coded as 1 for white
438 10 1D Regression Models Such as GLMs
and 0 for black. There were 362 jury decisions and 12 level race combinations.
The response variable was the number of death sentences in each combination.
The response plot (ESSP) in Figure 10.2a shows that the Yi /mi are close to
the estimated LR mean function (the logistic curve). The step function based
on 5 slices also tracks the logistic curve well. The OD plot is shown in Figure
10.2b with the identity, slope 4, and OLS lines added as visual aids. The
vertical scale is less than the horizontal scale, and there is no evidence of
overdispersion.
a) ESSP b) OD Plot
0.0 0.4 0.8
1000
Vhat
Z
−3 0 2 0 20 40
ESP Vmodhat
Example 10.3. Collett (1999, pp. 216-219) describes a data set where
the response variable is the number of rotifers that remain in suspension in
a tube. A rotifer is a microscopic invertebrate. The two predictors were the
density of a stock solution of Ficolli and the species of rotifer coded as 1
for polyarthra major and 0 for keratella cochlearis. Figure 10.3a shows the
response plot (ESSP). Both the observed proportions and the step function
track the logistic curve well, suggesting that the LR mean function is a good
approximation to the data. The OD plot suggests that there is overdispersion
since the vertical scale is about 30 times the horizontal scale. The OLS line
has slope much larger than 4 and two outliers seem to be present.
10.4 Poisson Regression 439
In the response plot for Poisson regression, the shape of the estimated
mean function µ̂(ESP ) = exp(ESP ) depends strongly on the range of the
ESP. The variety of shapes occurs because the plotting software attempts
to fill the vertical axis. Hence if the range of the ESP is narrow, then the
exponential function will be rather flat. If the range of the ESP is wide, then
the exponential curve will look flat in the left of the plot but will increase
sharply in the right of the plot.
This plot is very useful as a goodness of fit diagnostic. The lowess curve
is a nonparametric estimator of the mean function and is represented as a
jagged curve to distinguish it from the estimated PR mean function (the
exponential curve). See Figure 10.4 a). If the number of nontrivial predictors
q < n/10, if there is no overdispersion, and if the lowess curve follows the
exponential curve closely (except possibly for the largest values of the ESP),
then the PR mean function may be a useful approximation for E(Y |x). A
useful lack of fit plot is a plot of the ESP versus the deviance residuals
that are often available from the software.
If the exponential curve clearly fits the lowess curve better than the line
Y = Y , then H0 should be rejected, but if the line Y = Y fits the lowess
curve about as well as the exponential curve (which should only happen if
the exponential curve is approximately linear with a small slope), then Y
may be independent of the predictors. See Figure 10.6 a).
Warning: For many count data sets where the PR mean function is
good, the PR model is not appropriate but the PR MLE is still a con-
sistent estimator of β. The problem is that for many data sets where
E(Y |x) = µ(x) = exp(SP ), it turns out that V (Y |x) > exp(SP ). This
phenomenon is called overdispersion. Adding parametric and nonparamet-
ric estimators of the standard deviation function to the response plot can
be useful. See Cook and Weisberg (1999, pp. 401-403). The NBR model of
Definition 10.3 is a useful alternative to PR.
Since the Poisson regression model is simpler than the NBR model, graph-
ical diagnostics for the goodness of fit of the PR model would be useful. The
following plot was suggested by Winkelmann (2000, p. 110).
Definition 10.14. To check for overdispersion, use the OD plot of the
estimated model variance V̂M ≡ V̂ (Y |SP ) versus the squared residuals V̂ =
[Y − Ê(Y |SP )]2 . For the PR model, V̂ (Y |SP ) = exp(ESP ) = Ê(Y |SP ) and
V̂ = [Y − exp(ESP )]2 .
Numerical summaries are also available. The deviance G2 , described in
Section 10.5, is a statistic used to assess the goodness of fit of the Poisson
regression model much as R2 is used for multiple linear regression. For Poisson
regression, G2 is approximately chi-square with n − p degrees√ of freedom.
Since a χ2d random variable has mean d and standard deviation
√ 2d, the
√ 98th
2
percentile of the χd distribution is approximately d + 3 d ≈ d + 2.121 2d. If
the response and OD plots look good,√and G2 /(n−p) ≈ 1, then the PR model
is likely useful. If G2 > (n − p) + 3 n − p, then a more complicated count
model than PR may be needed. A good discussion of such count models is in
Simonoff (2003).
For PR, Winkelmann (2000, p. 110) suggested that the plotted points in
the OD plot should scatter about the identity line through the origin with unit
slope and that the OLS line should be approximately equal to the identity
line if the PR model is appropriate. But in simulations, it was found that the
following two observations make the OD plot much easier to use for Poisson
regression.
First, recall that a normal approximation is good for both the Poisson
and negative binomial distributions
p if the count Y is not too small. Notice
that if Y = E(Y |SP ) + 2 V (Y |SP ), then [Y − E(Y |SP )]2 = 4V (Y |SP ).
Hence if both the estimated mean and estimated variance functions are good
approximations, the plotted points in the OD plot for Poisson regression will
scatter about a wedge formed by the V̂ = 0 line and the line through the
10.4 Poisson Regression 441
Example 10.4. For the Ceriodaphnia data of Myers et al. (2002, pp.
136-139), the response variable Y is the number of Ceriodaphnia organisms
counted in a container. The sample size was n = 70, and the predictors were
a constant (x1 ), seven concentrations of jet fuel (x2 ), and an indicator for
two strains of organism (x3 ). The jet fuel was believed to impair reproduction
so high concentrations should have smaller counts. Figure 10.4 shows the 4
plots for this data. In the response plot of Figure 10.4a, the lowess curve
is represented as a jagged curve to distinguish it from the estimated PR
mean function (the exponential curve). The horizontal line corresponds to
the sample mean Y . The OD plot in Figure 10.4b suggests that there is little
evidence of overdispersion. These two plots as well as Figures 10.4c and 10.4d
suggest that the Poisson regression model is a useful approximation to the
data.
Example 10.5. For the crab data, the response Y is the number of satel-
lites (male crabs) near a female crab. The sample size n = 173 and the pre-
dictor variables were the color, spine condition, caparice width, and weight
of the female crab. Agresti (2002, pp. 126-131) first uses Poisson regression,
and then uses the NBR model with κ̂ = 0.98 ≈ 1. Figure 4.5a suggests that
there is one case with an unusually large value of the ESP. The lowess curve
does not track the exponential curve all that well. Figure 4.5b suggests that
overdispersion is present since the vertical scale is about 10 times that of
the horizontal scale and too many of the plotted points are large and greater
than the slope 4 line. Figure 4.5c also suggests that the Poisson regression
mean function is a rather poor fit since the plotted points fail to cover the
identity line. Although the exponential mean function fits the lowess curve
better than the line Y = Y , an alternative model to the NBR model may fit
10.4 Poisson Regression 443
a) ESSP b) OD Plot
100
300
Vhat
Y
40
0
0
1.5 2.5 3.5 4.5 20 40 60 80
ESP Ehat
2
20 40
MWRES
0
−2
0
0 10 30 0 10 30
MWFIT MWFIT
a) ESSP b) OD Plot
15
60 120
Vhat
Y
5
0
ESP Ehat
MWRES
8
4
4
0
0
0 2 4 6 0 2 4 6
MWFIT MWFIT
a) ESSP b) OD Plot
80
Vhat
0 1500
Y
20
3.0 3.5 4.0 20 40 60
ESP Ehat
10 30 50
6
MWRES
2
−2
10 20 30 40 10 20 30 40
MWFIT MWFIT
the data better. In later chapters, Agresti uses binomial regression models
for this data.
Example 10.6. For the popcorn data of Myers et al. (2002, p. 154), the
response variable Y is the number of inedible popcorn kernels. The sample
size was n = 15 and the predictor variables were temperature (coded as 5,
6, or 7), amount of oil (coded as 2, 3, or 4), and popping time (75, 90, or
105). One batch of popcorn had more than twice as many inedible kernels
as any other batch and is an outlier. Ignoring the outlier in Figure 10.6a
suggests that the line Y = Y will fit the data and lowess curve better than
the exponential curve. Hence Y seems to be independent of the predictors.
Notice that the outlier sticks out in Figure 10.6b and that the vertical scale is
well over 10 times that of the horizontal scale. If the outlier was not detected,
then the Poisson regression model would suggest that temperature and time
are important predictors, and overdispersion diagnostics such as the deviance
would be greatly inflated. However, we probably need to delete the high
temperature, low oil, and long popping time combination, to conclude that
the response is independent of the predictors.
10.5 GLM Inference, n/p Large 445
This section gives a very brief discussion of inference for the logistic regression
(LR) and Poisson regression (PR) models. Inference for these two models is
very similar to inference for the multiple linear regression (MLR) model. For
all three of these models, Y is independent of the p × 1 vector of predictors
x = (x1 , x2 , ..., xp)T given the sufficient predictor xT β where the constant
x1 ≡ 1.
To perform inference for LR and PR, computer output is needed. Shown
below is output using symbols and output from a real data set with p = 3
nontrivial predictors. This data set is the banknote data set described in Cook
and Weisberg (1999, p. 524). There were 200 Swiss bank notes of which 100
were genuine (Y = 0) and 100 counterfeit (Y = 1). The goal of the analysis
was to determine whether a selected bill was genuine or counterfeit from
physical measurements of the bill.
Coefficient Estimates
Label Estimate Std. Error Est/SE p-value
Constant -389.806 104.224 -3.740 0.0002
Bottom 2.26423 0.333233 6.795 0.0000
Left 2.83356 0.795601 3.562 0.0004
Scale factor: 1.
Number of cases: 200
Degrees of freedom: 197
Pearson X2: 179.809
Deviance: 99.169
446 10 1D Regression Models Such as GLMs
Point estimators for the mean function are important. Given values of
x = (x1 , ..., xp)T , a major goal of binary logistic regression is to estimate the
success probability P (Y = 1|x) = ρ(x) with the estimator
exp(xT β̂)
ρ̂(x) = . (10.8)
1 + exp(xT β̂)
The Wald confidence interval (CI) for βj can also be obtained using the
output: the large sample 100 (1 − δ) % CI for βj is βˆj ± z1−δ/2 se(β̂j ).
10.5 GLM Inference, n/p Large 447
The Wald test and CI tend to give good results if the sample size n is large.
Here 1 − δ refers to the coverage of the CI. A 90% CI uses z1−δ/2 = 1.645, a
95% CI uses z1−δ/2 = 1.96, and a 99% CI uses z1−δ/2 = 2.576.
For a GLM, often 3 models are of interest: the full model that uses all
p of the predictors xT = (xTR , xTO ), the reduced model that uses the r
predictors xR , and the saturated model that uses n parameters θ1 , ..., θn
where n is the sample size. For the full model the p parameters β1 , ..., βp are
estimated while the reduced model has r + 1 parameters. Let lSAT (θ1 , ..., θn)
be the likelihood function for the saturated model and let lF U LL (β) be the
likelihood function for the full model. Let LSAT = log lSAT (θ̂1 , ..., θ̂n) be the
log likelihood function for the saturated model evaluated at the maximum
likelihood estimator (MLE) (θ̂1 , ..., θ̂n) and let LF U LL = log lF U LL (β̂) be the
log likelihood function for the full model evaluated at the MLE (β̂). Then
the deviance D = G2 = −2(LF U LL − LSAT ). The degrees of freedom for
the deviance = dfF U LL = n − p where n is the number of parameters for the
saturated model and p is the number of parameters for the full model.
The saturated model for logistic regression states that for i = 1, ..., n, the
Yi |xi are independent binomial(mi, ρi ) random variables where ρ̂i = Yi /mi .
The saturated model is usually not very good for binary data (all mi = 1)
or if the mi are small. The saturated model can be good if all of the mi are
large or if ρi is very close to 0 or 1 whenever mi is not large.
The saturated model for Poisson regression states that for i = 1, ..., n,
the Yi |xi are independent Poisson(µi ) random variables where µ̂i = Yi . The
saturated model is usually not very good for Poisson data, but the saturated
model may be good if n is fixed and all of the counts Yi are large.
If X ∼ √ χ2d then E(X) = d and VAR(X) = 2d. An observed value √ of
X > d + 3 d is unusually large and an observed value of X < d − 3 d is
unusually small.
When the saturated model is good, a rule of thumb is that the logistic or
√
Poisson regression model is ok if G2 ≤ n − p (or if G2 ≤ n − p + 3 n − p).
For binary LR, the χ2n−p approximation for G2 is rarely good even for large
sample sizes n. For LR, the response plot is often a much better diagnostic
for goodness of fit, especially when ESP = xTi β takes on many values √ and
when p << n. For PR, both the response plot and G2 ≤ n − p + 3 n − p
should be checked.
Response = Y
Terms = (x1 , ..., xp)
448 10 1D Regression Models Such as GLMs
Total Change
Predictor df Deviance df Deviance
Ones n − 1 = dfo G2o
x2 n−2 1
x3 n−3 1
.. .. .. ..
. . . .
xp n − p = dfF U LL G2F U LL 1
-----------------------------------------
Data set = cbrain, Name of Fit = B1
Response = sex
Terms = (cephalic size log[size])
Sequential Analysis of Deviance
Total Change
Predictor df Deviance | df Deviance
Ones 266 363.820 |
cephalic 265 363.605 | 1 0.214643
size 264 315.793 | 1 47.8121
log[size] 263 305.045 | 1 10.7484
The above output, shown in symbols and for a real data set, is used for the
deviance test described below. Assume that the response plot has been made
and that the logistic or Poisson regression model fits the data well in that the
nonparametric step or lowess estimated mean function follows the estimated
model mean function closely and there is no evidence of overdispersion. The
deviance test is used to test whether β 2 = 0 where β = (β1 , βT2 )T = (α, ηT )T .
If this is the case, then the nontrivial predictors are not needed in the GLM
model. If H0 : β 2 = 0 is not rejected, then for Poisson regression the estimator
Xn Xn
µ̂ = Y should be used while for logistic regression ρ̂ = Yi / mi should
i=1 i=1
be used. Note that ρ̂ = Y for binary logistic regression since mi ≡ 1 for
i = 1, ..., n. This test is similar to the ANOVA F test for multiple liner
regression.
X2 , ..., Xp. (Or there is not enough evidence to conclude that there is a GLM
relationship between Y and the predictors.)
This test can be performed in R by obtaining output from the full and
null model.
outf <- glm(Y˜x2 + x3 + ... + xp, family = binomial)
outn <- glm(Y˜1,family = binomial)
anova(outn,outf,test="Chi")
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 *** ****
2 *** **** k Gˆ2(0|F) pvalue
The output below, shown both in symbols and for a real data set, can be
used to perform the change in deviance test. If the reduced model leaves out
a single variable xi , then the change in deviance test becomes H0 : βi = 0
versus HA : βi 6= 0. This test is a competitor of the Wald test. This change in
deviance test is usually better than the Wald test if the sample size n is not
large, but the Wald test is often easier for software to produce. For large n
the test statistics from the two tests tend to be very similar (asymptotically
equivalent tests).
If the reduced model is good, then the EE plot of ESP (R) = xTRi β̂ R
versus ESP = xTi β̂ should be highly correlated with the identity line with
unit slope and zero intercept.
SP = β1 + β2 x2 + · · · + βp xp = xT β = xTR β R + xTO β O
where the reduced model uses r of the predictors used by the full model and
xO denotes the vector of p − r predictors that are in the full model but not
the reduced model. For logistic regression, the reduced model is Yi |xRi ∼
independent Binomial(mi , ρ(xRi )) while for Poisson regression the reduced
model is Yi |xRi ∼ independent Poisson(µ(xRi )) for i = 1, ..., n.
Assume that the response plot looks good. Then we want to test H0 : the
reduced model is good (can be used instead of the full model) versus HA :
use the full model (the full model is significantly better than the reduced
model). Fit the full model and the reduced model to get the deviances G2F U LL
and G2RED . The next test is similar to the partial F test for multiple linear
regression.
This test can be performed in R by obtaining output from the full and
reduced model.
outf <- glm(Y˜x2 + x3 + ... + xp, family = binomial)
outr <- glm(Y˜ x4 + x6 + x8,family = binomial)
anova(outr,outf,test="Chi")
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 *** ****
2 *** **** p-r Gˆ2(R|F) pvalue
Interpretation of coefficients: if x2 , ..., xi−1, xi+1 , ..., xp can be held fixed,
then increasing xi by 1 unit increases the sufficient predictor SP by βi units.
As a special case, consider logistic regression. Let ρ(x) = P (success|x) = 1 −
P(failure|x) where a “success” is what is counted and a “failure” is what is not
counted (so if the Yi are binary, ρ(x) = P (Yi = 1|x)). Then the estimated
ρ̂(x)
odds of success is Ω̂(x) = = exp(xT β̂). In logistic regression,
1 − ρ̂(x)
increasing a predictor xi by 1 unit (while holding all other predictors fixed)
multiplies the estimated odds of success by a factor of exp(β̂i ).
Output for Full Model, Response = gender, Terms =
(age log[age] breadth circum headht
height length size log[size])
Number of cases: 267, Degrees of freedom: 257,
Deviance: 234.792
b) The full model uses the predictors listed above to the right of Terms.
Perform a 4 step change in deviance test to see if the reduced model can be
used. Both models contain a constant.
Solution: a) ESP = β̂1 + β̂2 x2 + βˆ3 x3 = −6.26111 − 0.0536078(65) +
0.0028215(3500) = 0.1296. So
eESP 1.1384
ρ̂(x) = = = 0.5324.
1 + eESP 1 + 1.1384
b) i) H0 : the reduced model is good HA: use the full model
ii) G2 (R|F ) = 313.457 − 234.792 = 78.665
iii) Now df = 264 − 257 = 7, and comparing 78.665 with χ27,0.999 = 24.32
shows that the pval = 0 < 1 − 0.999 = 0.001.
iv) Reject H0 , use the full model.
Coefficient Estimates
Label Estimate Std. Error Est/SE p-value
Constant -5.84211 1.74259 -3.353 0.0008
jaw ht 0.103606 0.0383650 ? ??
10.5 GLM Inference, n/p Large 453
Example 10.9. A museum has 60 skulls, some of which are human and
some of which are from apes. Consider trying to estimate whether the skull
type is human or ape from the height of the lower jaw. Use the above logistic
regression output to answer the following problems. The museum data is
available from the text’s website as file museum.lsp, and is from Schaaffhausen
(1878). Here x = x2 .
a) Predict ρ̂(x) if x = 40.0.
b) Find a 95% CI for β2 .
c) Perform the 4 step Wald test for H0 : β2 = 0.
Solution: a) exp[ESP ] = exp[β̂1 +β̂2 (40)] = exp[−5.84211+0.103606(40)] =
exp[−1.69787] = 0.1830731. So
eESP 0.1830731
ρ̂(x) = = = 0.1547.
1 + eESP 1 + 0.1830731
This subsection gives some rules of thumb for variable selection for logistic
and Poisson regression when SP = xT β. Before performing variable selection,
a useful full model needs to be found. The process of finding a useful full
model is an iterative process. Given a predictor x, sometimes x is not used
by itself in the full model. Suppose that Y is binary. Then to decide what
functions of x should be in the model, look at the conditional distribution of
x|Y = i for i = 0, 1. The rules shown in Table 10.1 are used if x is an indicator
variable or if x is a continuous variable. Replace normality by “symmetric
with similar spreads” and “symmetric with different spreads” in the second
and third lines of the table. See Cook and Weisberg (1999, p. 501) and Kay
and Little (1987).
The full model will often contain factors and interactions. If w is a nominal
variable with K levels, make w into a factor by using K − 1 (indicator or)
dummy variables x1,w , ..., xK−1,w in the full model. For example, let xi,w = 1
if w is at its ith level, and let xi,w = 0, otherwise. An interaction is a product
of two or more predictor variables. Interactions are difficult to interpret.
Often interactions are included in the full model, and then the reduced model
without any interactions is tested. The investigator is often hoping that the
interactions are not needed.
To make a full model, use the above discussion and then make a response
plot to check that the full model is good. The number of predictors in the
full model should be much smaller than the number P of data cases n. Suppose
that the Yi are binary for i = 1, ..., n. Let N1 = Yi = the number of 1s and
N0 = n−N1 = the number of 0s. A rough rule of thumb is that the full model
should use no more than min(N0 , N1 )/5 predictors and the final submodel
should have r predictor variables where r is small with r ≤ min(N0 , N1 )/10.
For Poisson regression, a rough rule of thumb is that the full model should
use no more than n/5 predictors and the final submodel should use no more
than n/10 predictors.
All subsets variable selection can be performed with the following pro-
cedure. Compute the ESP of the GLM and compute the OLS ESP found by
the OLS regression of Y on x. Check that |corr(ESP, OLS ESP)| ≥ 0.95.This
high correlation will exist for many data sets. Then perform multiple linear
regression and the corresponding all subsets OLS variable selection with the
Cp(I) criterion. If the sample size n is large and Cp (I) ≤ 2r where the subset
I has r variables including a constant, then corr(OLS ESP, OLS ESP(I))
will be high by Olive and Hawkins (2005), and hence corr(ESP, ESP(I))
will be high. In other words, if the OLS ESP and GLM ESP are highly
correlated, then performing multiple linear regression and the corresponding
MLR variable selection (e.g. forward selection, backward elimination, or all
subsets selection) based on the Cp (I) criterion may provide many interesting
submodels.
Know how to find good models from output. The following rules of thumb
(roughly in order of decreasing importance) may be useful. It is often not
possible to have all 12 rules of thumb to hold simultaneously. Let submodel
I have rI predictors, including a constant. Do not use more predictors than
submodel II , which has no more predictors than the minimum AIC model.
It is possible that II = Imin = Ifull . Assume the response plot for the full
model is good. Then the submodel I is good if
i) the response plot for the submodel looks like the response plot for the full
model.
ii) corr(ESP,ESP(I)) ≥ 0.95.
iii) The plotted points in the EE plot cluster tightly about the identity line.
iv) Want the pval ≥ 0.01 for the change in deviance test that uses I as the
reduced model.
v) For binary LR want rI ≤ min(N1 , N0 )/10. For PR, want rI ≤ n/10.
vi) Fit OLS to the full and reduced models. The plotted points in the plot of
the OLS residuals from the submodel versus the OLS residuals from the full
model should cluster tightly about the identity line.
vii) Want the deviance G2 (I) ≥ G2 (full) but close. (G2 (I) ≥ G2 (full) since
adding predictors to I does not increase the deviance.)
10.6 Variable and Model Selection 457
viii) Want AIC(I) ≤ AIC(Imin ) + 7 where Imin is the minimum AIC model
found by the variable selection procedure.
ix) Want hardly any predictors with pvals > 0.05.
x) Want few predictors with √ pvals between 0.01 and 0.05.
xi) Want G2 (I) ≤ n − rI + 3 n − rI .
xii) The OD plot should look good.
Heuristically, forward selection tries to add the variable that will decrease
the deviance the most. A decrease in deviance less than 4 (if the predictor
has 1 degree of freedom) may be troubling in that a bad predictor may have
been added. In practice, the forward selection program may add the variable
such that the submodel I with j nontrivial predictors has a) the smallest
AIC(I), b) the smallest deviance G2 (I), or c) the smallest pval (preferably
from a change in deviance test but possibly from a Wald test) in the test
H0 : βi = 0 versus HA : βi 6= 0 where the current model with j terms plus
the predictor xi is treated as the full model (for all variables xi not yet in
the model).
Suppose that the full model is good and is stored in M1. Let M2, M3,
M4, and M5 be candidate submodels found after forward selection, backward
elimination, etc. Make a scatterplot matrix of the ESPs for M2, M3, M4,
M5, and M1. Good candidates should have estimated sufficient predictors
that are highly correlated with the full model estimated sufficient predictor
(the correlation should be at least 0.9 and preferably greater than 0.95). For
binary logistic regression, mark the symbols (0 and +) using the response
variable Y .
The final submodel should have few predictors, few variables with large
Wald pvals (0.01 to 0.05 is borderline), a good response plot, and an EE plot
that clusters tightly about the identity line. If a factor has K − 1 dummy
variables, either keep all K − 1 dummy variables or delete all K − 1 dummy
variables, do not delete some of the dummy variables.
Example 10.11. The following output is for forward selection. All models
use a constant. For forward selection, the min AIC model uses {F}LOC, TYP,
AGE, CAN, SYS, PCO, and PH. Model II uses {F}LOC, TYP, AGE, CAN,
458 10 1D Regression Models Such as GLMs
and SYS. Let model I use {F}LOC, TYP, AGE, and CAN. This model
may be good, so for forward selection, models II and I are the first models
to examine. {F}LOC is notation used for a factor with K − 1 = 3 dummy
variables, while k is the number of variables in I, including a constant. Output
is from the Cook and Weisberg (1999) Arc software.
Forward Selection comment
Example 10.12. The above table gives summary statistics for 4 models
considered as final submodels after performing variable selection. One pre-
dictor was a factor, and a factor was considered to have a bad Wald p-value
> 0.05 if all of the dummy variables corresponding to the factor had p-values
10.6 Variable and Model Selection 459
> 0.05. Similarly the factor was considered to have a borderline p-value with
0.01 ≤ p-value ≤ 0.05 if none of the dummy variables corresponding to the
factor had a p-value < 0.01 but at least one dummy variable had a p-value
between 0.01 and 0.05. The response was binary and logistic regression was
used. The response plot for the full model B1 was good. Model B2 was the
minimum AIC model found. There were 267 cases: for the response, 113 were
0’s and 154 were 1’s.
Which two models are the best candidates for the final submodel? Explain
briefly why each of the other 2 submodels should not be used.
Solution: B2 and B3 are best. B1 has too many predictors with rather
large p-values. For B4, the AIC is too high and the corr and p-value are too
low.
Response Plot
1.0
0.8
0.6
Y
0.4
0.2
0.0
−20 −10 0 10 20 30 40
ESP
Example 10.13. The ICU data is available from the text’s website and
from STATLIB (http://lib.stat.cmu.edu/DASL/Datafiles/ICU.html). Also
see Hosmer and Lemeshow (2000, pp. 23-25). The survival of 200 patients
following admission to an intensive care unit was studied with logistic regres-
sion. The response variable was STA (0 = Lived, 1 = Died). Predictors were
AGE, SEX (0 = Male, 1 = Female), RACE (1 = White, 2 = Black, 3 =
Other), SER= Service at ICU admission (0 = Medical, 1 = Surgical), CAN=
Is cancer part of the present problem? (0 = No, 1 = Yes), CRN= History
of chronic renal failure (0 = No, 1 = Yes), INF= Infection probable at ICU
admission (0 = No, 1 = Yes), CPR= CPR prior to ICU admission (0 = No, 1
460 10 1D Regression Models Such as GLMs
40
30
20
ESP
10
0
−20 −10
−5 0 5 10 15 20
ESPS
= Yes), SYS= Systolic blood pressure at ICU admission (in mm Hg), HRA=
Heart rate at ICU admission (beats/min), PRE= Previous admission to an
ICU within 6 months (0 = No, 1 = Yes), TYP= Type of admission (0 =
Elective, 1 = Emergency), FRA= Long bone, multiple, neck, single area, or
hip fracture (0 = No, 1 = Yes), PO2= PO2 from initial blood gases (0 if >60,
1 if ≤ 60), PH= PH from initial blood gases (0 if ≥ 7.25, 1 if <7.25), PCO=
PCO2 from initial blood gases (0 if ≤ 45, 1 if >45), Bic= Bicarbonate from
initial blood gases (0 if ≥ 18, 1 if <18), CRE= Creatinine from initial blood
gases (0 if ≤ 2.0, 1 if >2.0), and LOC= Level of consciousness at admission
(0 = no coma or stupor, 1= deep stupor, 2 = coma).
Factors LOC and RACE had two indicator variables to model the three
levels. The response plot in Figure 10.7 shows that the logistic regression
model using the 19 predictors is useful for predicting survival, although the
output has ρ̂(x) = 1 or ρ̂(x) = 0 exactly for some cases. Note that the
step function of slice proportions tracks the model logistic curve fairly well.
Variable selection, using forward selection and backward elimination with
the AIC criterion, suggested the submodel using AGE, CAN, SYS, TYP, and
LOC. The EE plot of ESP(sub) versus ESP(full) is shown in Figure 10.8.
The plotted points in the EE plot should cluster tightly about the identity
line if the full model and the submodel are good. Since this clustering did
not occur, the submodel seems to be poor. The lowest cluster of points and
the case on the right nearest to the identity line correspond to black patients.
10.6 Variable and Model Selection 461
40
30
20
ESP
10
0
−20 −10
−20 −10 0 10 20 30
ESPS
The main cluster and upper right cluster correspond to patients who are not
black.
Figure 10.9 shows the EE plot when RACE is added to the submodel.
Then all of the points cluster about the identity line. Although numerical
variable selection did not suggest that RACE is important, perhaps since
output had ρ̂(x) = 1 or ρ̂(x) = 0 exactly for some cases, the two EE plots
suggest that RACE is important. Also the RACE variable could be replaced
by an indicator for black. This example illustrates how the plots can be
used to quickly improve and check the models obtained by following logistic
regression with variable selection even if the MLE β̂ LR does not exist.
P1 P2 P3 P4
df 144 147 148 149
# of predictors 6 3 2 1
# with 0.01 ≤ Wald p-value ≤ 0.05 1 0 0 0
# with Wald p-value > 0.05 3 0 1 0
G2 127.506 131.644 147.151 149.861
AIC 141.506 139.604 153.151 153.861
corr(ESP,ESP(I)) 1.0 0.954 0.810 0.792
p-value for change in deviance test 1.0 0.247 0.0006 0.0
Example 10.14. The above table gives summary statistics for 4 models
considered as final submodels after performing variable selection. Poisson
462 10 1D Regression Models Such as GLMs
regression was used. The response plot for the full model P1 was good. Model
P2 was the minimum AIC model found.
Which model is the best candidate for the final submodel? Explain briefly
why each of the other 3 submodels should not be used.
Solution: P2 is best. P1 has too many predictors with large pvalues and
more predictors than the minimum AIC model. P3 and P4 have corr and
pvalue too low and AIC too high.
Warning. Variable selection for GLMs is very similar to that for multiple
linear regression. Finding a model II from variable selection, and using GLM
output for model II does not give valid tests and confidence intervals. If there
is a good full model that was found before examining the response, and if II
is the minimum AIC model, then Section 10.9 describes how to do inference
after variable selection. If the model needs to be built using the response, use
data splitting. A pilot study can also be useful.
Forward selection with EBIC, lasso, and/or elastic net can be used for the
Cox proportional hazards regression model and for some GLMs, including
binomial and Poisson regression. The relaxed lasso = VS-lasso and relaxed
elastic net = VS-elastic net estimators apply the GLM or Cox regression
model to the predictors with nonzero lasso or elastic net coefficients. As
with multiple linear regression, the population number of active nontrivial
predictors = kS , but for a GLM, model I with SP = xTI βI has k active
nontrivial predictors. See Section 4.1.
Remark 10.1. Most of the plots in this chapter that use ESP = xT β̂,
and can also be made using ESP (I) = xTI β̂I . Obtaining a good ESP becomes
more difficult as n/p becomes smaller.
overfitting. This method should be best when the predictors are linearly
related: there should be no strong nonlinear relationships. See Olive and
Hawkins (2005) for this method when n > 10p.
Some R commands for GLM lasso and Remark 10.2 are shown below. Note
that the family command indicates whether a binomial regression (including
binary regression) or a Poisson regression is being fit. The default for GLM
lasso uses 10-fold CV with a deviance criterion.
set.seed(1976) #Binary regression
library(glmnet)
n<-100
m<-1 #binary regression
q <- 100 #100 nontrivial predictors, 95 inactive
k <- 5 #k_S = 5 population active predictors
y <- 1:n
mv <- m + 0 * y
vars <- 1:q
beta <- 0 * 1:q
beta[1:k] <- beta[1:k] + 1
beta
alpha <- 0
x <- matrix(rnorm(n * q), nrow = n, ncol = q)
SP <- alpha + x[,1:k] %*% beta[1:k]
pv <- exp(SP)/(1 + exp(SP))
y <- rbinom(n,size=m,prob=pv)
y
out<-cv.glmnet(x,y,family="binomial")
lam <- out$lambda.min
bhat <- as.vector(predict(out,type="coefficients",s=lam))
ahat <- bhat[1] #alphahat
bhat<-bhat[-1]
vin <- vars[bhat!=0] #want 1-5, overfit
[1] 1 2 3 4 5 6 16 59 61 74 75 76 96
ind <- as.data.frame(cbind(y,x[,vin])) #relaxed lasso GLM
tem <- glm(y˜.,family="binomial",data=ind)
tem$coef
(Inter) V2 V3 V4 V5 V6
0.2103 1.0037 1.4304 0.6208 1.8805 0.3831
V7 V8 V9 V10 V11 V12
0.8971 0.4716 0.5196 0.8900 0.6673 -0.7611
V13 V14
-0.5918 0.6926
lrplot3(tem=tem,x=x[,vin]) #binary response plot
#now use MLR lasso
outm<-cv.glmnet(x,y)
464 10 1D Regression Models Such as GLMs
There are many alternatives to the binomial and Poisson regression GLMs.
Alternatives to the binomial GLM of Definition 10.7 include the discriminant
function model of Definition 10.8, the quasi-binomial model, the binomial
generalized additive model (GAM), and the beta-binomial model of Definition
10.2.
Alternatives to the Poisson GLM of Definition 10.12 include the quasi-
Poisson model, the Poisson GAM, and the negative binomial regression model
of Definition 10.3. Other alternatives include the zero truncated Poisson
model, the zero truncated negative binomial model, the hurdle or zero in-
flated Poisson model, the hurdle or zero inflated negative binomial model,
the hurdle or zero inflated additive Poisson model, and the hurdle or zero
inflated additive negative binomial model. See Zuur et al. (2009), Simonoff
(2003), and Hilbe (2011).
T
The estimated sufficient predictor ESP = ĥ(x)
Ppand ESP = x β̂ for a GLM.
The estimated additive predictor EAP = α̂ + j=2 Ŝj (xj ). An ESP–response
plot is a plot of ESP versus Y while an EAP–response plot is a plot of EAP
versus Y .
Note that a GLM is a special case of the GAM using Sj (xj ) = βj xj for j =
2, ..., p with α = β1 . A GLM with SP = α + β2 x2 + β3 x3 + β4 x1 x2 is a special
case of a GAM with x4 ≡ x1 x2 . A GLM with SP = α + β2 x2 + β3 x22 + β4 x3
is a special case of a GAM with S2 (x2 ) = β2 x2 + β3 x22 and S3 (x3 ) = β4 x3 .
A GLM with p terms may be equivalent to a GAM with k terms w1 , ..., wk
where k < p.
The plotted points in the EE plot defined below should scatter tightly
about the identity line if the GLM is appropriate and if the sample size is
large enough so that the ESP is a good estimator of the SP and the EAP is a
good estimator of the AP. If the clustering is not tight but the GAM gives a
reasonable approximation to the data, as judged by the EAP–response plot,
then examine the Ŝj of the GAM to see if some simple terms such as x2i can
be added to the GLM so that the modified GLM has a good ESP–response
plot. (This technique is easiest if the GLM and GAM have the same p terms
x1 , ..., xp. The technique is more difficult, for example, if the GLM has terms
x1 , x2 , x22, and x3 while the GAM has terms x1 , x2 and x3 .)
Note that this model and the Poisson GLM have the same conditional mean
function, and the conditional variance functions are the same if φ = 1.
For the quasi-binomial model, the conditional mean and variance functions
are similar to those of the binomial distribution, but it is not assumed that
Y |SP has a binomial distribution. Similarly, it is not assumed that Y |SP
has a Poisson distribution for the quasi-Poisson model.
Next, some notation is needed to derive the zero truncated Poisson re-
gression model. Y has a zero truncated Poisson distribution, Y ∼ ZT P (µ),
e−µ µy
if the probability mass function (pmf) of Y is f(y) = for
(1 − eµ ) y!
y = 1, 2, 3, ... where µ > 0. The ZTP pmf is obtained from a Poisson distri-
bution where y = 0 values are truncated, so not allowed.
P∞ If W ∼ P oisson(µ)
with pmf fW (y), then P (W = 0) = e−µ , so y=1 fW (y) = 1 − e−µ =
P∞ P∞ µ
y=0 fW (y) − y=1 fW (y). So the ZTP pmf f(y) = fW (y)/(1 − e ) for
y 6= 0. P∞ P∞ P∞ −µ
Now E(Y ) = y=1 yf(y) = y=0 yf(y) = y=0 yfW (y)/(1 − e ) =
−µ −µ
E(W )/(1 − e ) = µ/(1P− e ). P∞ P∞
∞
Similarly, E(Y 2 ) = y=1 y2 f(y) = y=0 y2 f(y) = y=0 y2 fW (y)/(1 −
e−µ ) = E(W 2 )/(1 − e−µ ) = [µ2 + µ]/(1 − e−µ ). So
2
µ2 + µ µ
V (Y ) = E(Y 2 ) − (E(Y ))2 = − .
1 − e−µ 1 − e−µ
exp(SP )
E(Y |x) = and
1 − exp(− exp(SP ))
2
[exp(SP )]2 + exp(SP ) exp(SP )
V (Y |SP ) = − .
1 − exp(− exp(SP )) 1 − exp(− exp(SP ))
For a 1D regression model, there are several useful plots using the ESP. A
GAM is a 1D regression model with ESP = EAP . It is well known that the
468 10 1D Regression Models Such as GLMs
residual plot of ESP or EAP versus the residuals (on the vertical axis) is
useful for checking the model. Similarly, the response plot of ESP or EAP
versus the response Y is useful. Assume that the ESP or EAP takes on many
values. For a GAM, substitute EAP for ESP for the plots in Definitions 10.9,
10.10, 10.11, 10.13, 10.14, and 10.16.
The response plot for the beta-binomial GAM is similar to that for the
binomial GAM. The plots for the negative binomial GAM are similar to those
of the Poisson regression GAM, including the plots in Definition 10.16. See
Examples 10.4, 10.5, and 10.6.
Variable selection is the search for a subset of variables that can be deleted
without important loss of information. Olive and Hawkins (2005) make an
EE plot of ESP (I) versus ESP where ESP (I) is for a submodel I and ESP
is for the full model. This plot can also be used to complement the hypothesis
test that the reduced model I (which is selected before gathering data) can
be used instead of the full model. The obvious extension to GAMs is to make
the EE plot of EAP (I) versus EAP . If the fitted full model and submodel
I are good, then the plotted points should follow the identity line with high
correlation (use correlation ≥ 0.95 as a benchmark).
To justify this claim, assume that there exists a subset S of predictor
variables such that if xS is in the model, then none of the other predictors
is needed in the model. Write E for these (‘extraneous’) variables not in S,
partitioning x = (xTS , xTE )T . Then
p
X X X X
AP = α+ Sj (xj ) = α+ Sj (xj )+ Sk (xk ) = α+ Sj (xj ). (10.10)
j=2 j∈S k∈E j∈S
The extraneous terms that can be eliminated given that the subset S is in
the model have Sk (xk ) = 0 for k ∈ E.
Now suppose that I is a candidate subset of predictors and that S ⊆ I.
Then
p
X X X
AP = α + Sj (xj ) = α + Sj (xj ) = α + Sk (xk ) = AP (I),
j=2 j∈S k∈I
(if I includes predictors from E, these will have Sk (xk ) = 0). For any subset
I that includes all relevant predictors, the correlation corr(AP, AP(I)) = 1.
Hence if the full model and submodel are reasonable and if EAP and EAP(I)
are good estimators of AP and AP(I), then the plotted points in the EE plot
of EAP(I) versus EAP will follow the identity line with high correlation.
10.7 Generalized Additive Models 469
10.7.4 Examples
For the binary logistic GAM, the EAP will not be a consistent estimator
of the AP if the estimated probability ρ̂(AP ) = ρ(EAP ) is exactly zero or
one. The following example will show that GAM output and plots can still
be used for exploratory data analysis. The example also illustrates that EE
plots are useful for detecting cases with high leverage and clusters of cases.
1.0
0.8
0.6
Y
0.4
0.2
0.0
−20 0 20 40 60
EAP
40
30
20
ESP
10
0
−20 −10
−20 0 20 40 60
EAP
Example 10.15. For the ICU data of Example 10.13, a binary general-
ized additive model was fit with unspecified functions for AGE, SYS, and
HRA, and linear functions for the remaining 16 variables. Output suggested
that functions for SYS and HRA are linear but the function for AGE may
be slightly curved. Several cases had ρ̂(AP ) equal to zero or one, but the
response plot in Figure 10.10 suggests that the full model is useful for pre-
dicting survival. Note that the ten slice step function closely tracks the logistic
curve. To visualize the model with the response plot, use Y |x ≈ binomial[1,
ρ(EAP ) = eEAP /(1+eEAP )]. When x is such that EAP < −5, ρ(EAP ) ≈ 0.
If EAP > 5, ρ(EAP ) ≈ 1, and if EAP = 0, then ρ(EAP ) = 0.5. The logistic
curve gives ρ(EAP ) ≈ P (Y = 1|x) = ρ(AP ). The different estimated bi-
nomial distributions have ρ̂(AP ) = ρ(EAP ) that increases according to the
logistic curve as EAP increases. If the step function tracks the logistic curve
closely, the binary GAM gives useful smoothed estimates of ρ(AP ) provided
that the number of 0s and 1s are both much larger than the model degrees
of freedom so that the GAM is not overfitting.
A binary logistic regression was also fit, and Figure 10.11 shows the plot of
EAP versus ESP. The plot shows that the near zero and near one probabilities
are handled differently by the GAM and GLM, but the estimated success
probabilities for the two models are similar: ρ̂(ESP ) ≈ ρ̂(EAP ). Hence we
used the GLM and perform variable selection as in Example 10.13. Some R
code is below.
#http://parker.ad.siu.edu/Olive/ICU.lsp
#delete header of ICU.lsp and delete last parentheses
#at the end of the file. Save the file on F drive as
#icu.txt.
library(mgcv)
outgam <- gam(STA ˜ s(AGE)+SEX+RACE+SER+CAN+CRN+INF+
CPR+s(SYS)+s(HRA)+PRE+TYP+FRA+PO2+PH+PCO+Bic+CRE+LOC,
family=binomial,data=icu2)
EAP <- predict.gam(outgam)
plot(EAP,ESP)
abline(0,1)
#Figure 10.11
Y <- icu2[,1]
lrplot3(ESP=EAP,Y,slices=18)
#Figure 10.10
lrplot3(ESP,Y,slices=18)
#Figure 10.7
Example 10.16. For binary data, Kay and Little (1987) suggest exam-
ining the two distributions x|Y = 0 and x|Y = 1. Use predictor x if the two
distributions are roughly symmetric with similar spread. Use x and x2 if the
distributions are roughly symmetric with different spread. Use x and log(x)
if one or both of the distributions are skewed. The log rule says add log(x)
to the model if min(x) > 0 and max(x)/ min(x) > 10. The Gladstone (1905)
data is useful for illustrating these suggestions. The response was gender with
Y = 1 for male and Y = 0 for female. The predictors were age, height, and
the head measurements circumference, length, and size. When the GAM was
fit without log(age) or log(size), the Ŝj for age, height, and circumference
were nonlinear. The log rule suggested adding log(age), and log(size) was
added because size is skewed. The GAM for this model had plots of Ŝj (xj )
472 10 1D Regression Models Such as GLMs
that were fairly linear. The response plot is not shown but was similar to
Figure 10.10, and the step function tracked the logistic curve closely. When
EAP = 0, the estimated probability of Y = 1 (male) is 0.5. When EAP > 5
the estimated probability is near 1, but near 0 for EAP < −5. The response
plot for the binomial GLM, not shown, is similar.
8
6
4
ESPp
2
0
−2
−4
−2 0 2 4
EAP
Fig. 10.12 EE plot for cubic GLM for Heart Attack Data
Example 10.17. Wood (2017, pp. 125-130) describes heart attack data
where the response Y is the number of heart attacks for mi patients suspected
of suffering a heart attack. The enzyme ck (creatine kinase) was measured for
the patients and it was determined whether the patient had a heart attack
or not. A binomial GLM with predictors x1 = ck, x2 = [ck]2 , and x3 = [ck]3
was fit and had AIC = 33.66. The binomial GAM with predictor x1 was fit in
R, and Figure 10.12 shows that the EE plot for the GLM was not too good.
The log rule suggests using ck and log(ck), but ck was not significant. Hence
a GLM with the single predictor log(ck) was fit. Figure 10.13 shows the EE
plot, and Figure 10.14 shows the response plot where the Zi = Yi /mi track
the logistic curve closely. There was no evidence of overdispersion and the
model had AIC = 33.45. The GAM using log(ck) had a linear Ŝ, and the
correlation of the plotted points in the EE plot, not shown, was one.
10.7 Generalized Additive Models 473
4
2
ESPl
0
−2
−2 0 2 4
EAP
0.4
0.2
0.0
−2 0 2 4
ESPl
10.8 Overdispersion
The OD plot has been used by Winkelmann (2000, p. 110) for the Poisson
regression model where V̂M (Y |SP ) = ÊM (Y |SP ) = exp(ESP ). For binomial
and Poisson regression, the OD plot can be used to complement tests and
diagnostics for overdispersion such as those given in Cameron and Trivedi
(2013), Collett (1999, ch. 6), and Winkelmann (2000). See discussion below
Definitions 10.11 and 10.14 for how to interpret the OD plot with the identity
line, OLS line, and slope 4 line added as visual aids, and for discussion of the
numerical summaries G2 and X 2 for GLMs.
Definition 10.1, with SP = AP, gives EM (Y |AP ) = m(AP ) and VM (Y |AP )
= v(AP ) for several models. Often m̂(AP ) = m(EAP ) and v̂(AP ) =
v(EAP ), but additional parameters sometimes need to be estimated. Hence
v̂(AP ) = mi ρ(EAPi )(1−ρ(EAPi ))[1+(mi −1)θ̂/(1+θ̂)], v̂(AP ) = exp(EAP )+
τ̂ exp(2 EAP ), and v̂(AP ) = [m(EAP )]2 /ν̂ for the beta-binomial, nega-
tive binomial, and gamma GAMs, respectively. The beta-binomial regres-
10.8 Overdispersion 475
Example 10.18. The species data is from Cook and Weisberg (1999,
pp. 285-286) and Johnson and Raven (1973). The response variable is the
total number of species recorded on each of 29 islands in the Galápagos
Archipelago. Predictors include area of island, areanear = the area of the
closest island, the distance to the closest island, the elevation, and endem =
the number of endemic species (those that were not introduced from else-
where). A scatterplot matrix of the predictors suggested that log transfor-
mations should be taken. Poisson regression suggested that log(endem) and
log(areanear) were the important predictors, but the deviance and Pear-
son X 2 statistics suggested overdispersion was present since both statistics
were near 71.4 with 26 degrees of freedom. The residual plot also suggested
increasing variance with increasing fitted value. A negative binomial regres-
sion suggested that only log(endem) was needed in the model, and had a
deviance of 26.12 on 27 degrees of freedom. The residual plot for this model
was roughly ellipsoidal. The negative binomial GAM with log(endem) had
an Ŝ that was linear and the plotted points in the EE plot had correlation
near 1.
The response plot with the exponential and lowess curves added as visual
aids is shown in Figure 10.15. The interpretation is that Y |x ≈ negative
binomial with E(Y |x) ≈ exp(EAP ). Hence if EAP = 0, E(Y |x) ≈ 1. The
negative binomial and Poisson GAM have the same conditional mean func-
tion. If the plot was for a Poisson GAM, the interpretation would be that
Y |x ≈ Poisson(exp(EAP )). Hence if EAP = 0, Y |x ≈ Poisson(1).
Figure 10.16 shows the OD plot for the negative binomial GAM with the
identity line and slope 4 line through the origin added as visual aids. The
plotted points fall within the “slope 4 wedge,” suggesting that the negative
binomial regression model has successfully dealt with overdispersion. Here
Ê(Y |AP ) = exp(EAP ) and V̂ (Y |AP ) = exp(EAP ) + τ̂ exp(2EAP ) where
τ̂ = 1/37.
476 10 1D Regression Models Such as GLMs
400
300
Y
200
100
0
0 1 2 3 4 5 6
EAP
1000
0
ModVar
Inference after variable selection for GLMs is very similar to inference after
variable selection for multiple linear regression. AIC, BIC, EBIC, lasso, and
elastic net can be used for variable selection. Read Section 4.2 for the large
sample theory for β̂ Imin ,0 . We assume that n >> p. Theorem 4.4, the Vari-
able Selection CLT, still applies, as does Remark 4.4. Hence if √
lasso or elastic
net is consistent, then relaxed lasso or relaxed elastic net is n consistent.
The geometric argument of Theorem 4.5 also applies. We follow Rathnayake
and Olive (2019) closely. Read Sections 4.2, 4.5, and 4.6 before reading this
section. We will describe the parametric bootstrap, and then consider boot-
strapping variable selection.
as n → ∞.
Now suppose S ⊆ I. Without loss of generality, let β = (β TI , βTO )T and β̂ =
(β̂(I)T , β̂(O)T )T . Then (Y , X I ) follows the parametric regression model with
√ D
parameters (β I , γ). Hence n(β̂ I − β I ) → NaI (0, V (β I )). Now (Y ∗ , X I )
478 10 1D Regression Models Such as GLMs
∂ηi
zi = g(µi ) + g0 (µi )(Yi − µi ) = ηi + (Yi − µi ), Z = (zi ),
∂µi
2
∂µi 1
wi = , W = diag(wi ), Ŵ = W | , and Ẑ = Z| .
∂ηi Vi β̂ β̂
Then
β̂ = (X T Ŵ X)−1 X T Ŵ Ẑ and β̂ I = (X T
I Ŵ I X I )
−1
XT
I Ŵ I Ẑ I
while ∗ ∗ ∗ ∗
β̂ I = (X TI Ŵ I X I )−1 X TI Ŵ I Ẑ I (10.12)
∗
where is fit as if (Y ∗ , X I ) follows the GLM with parameters (β̂(I), γ̂).
β̂ I
√
If S ⊆ I, then this approximation is correct asymptotically since nβ̂(O) =
∗
OP (1). Hence ηiI = xTiI β̂(I) = g(µ∗iI ), and ViI∗ = VM (Yi∗ |xiI ) where VM
is the model variance from the GLM with parameters (β̂(I), γ̂). Also, the
estimated asymptotic covariance matrices are
d β̂) = (X T Ŵ X)−1 and Cov(
Cov( d β̂ I ) = (X T Ŵ I X I )−1 .
I
See, for example, Agresti (2002, pp. 138, 147), Hillis and Davis (1994),
and McCullagh and Nelder (1989). From Sen and Singer (1994, p. 307),
P
n(X TI Ŵ I X I )−1 → I −1 (β I ) as n → ∞ if S ⊆ I.
Let β̃ = (X T W X)−1 X T W Z. Then E(β̃) = β since E(Z) = Xβ, and
Cov(Y ) = Cov(Y |X) = diag(Vi ). Since
∂µi 1 ∂ηi
= 0 and = g0 (µi ),
∂ηi g (µi ) ∂µi
∂µ∗iI ∗ ∗ ∗
∗ = exp(ηiI ) = µiI = ViI ,
∂ηiI
∗
∗
wiI = exp(xTiI β̂(I)), and ŵiI
∗
= exp(xTiI β̂ I ). Similarly, ηiI
∗
= log(µ∗iI ),
∗
∗ ∗ ∂ηiI 1
ziI = ηiI + (Y ∗ − µ∗iI ) = ηiI
∗
+ ∗ (Yi∗ − µ∗iI ), and
∂µ∗iI i µiI
∗ ∗ 1 ∗
ẑiI = xTiI β̂ I + ∗ (Yi
∗
− exp(xTiI β̂ I )).
exp(xTiI β̂ I )
Note that for (Y , X I ), the formulas are the same with the asterisks removed
and µiI = exp(xTiI β I ).
The nonparametric bootstrap samples cases (Yi , xi ) with replacement to
∗
form (Y ∗j , X ∗j ), and regresses Y ∗j on X ∗j to get β̂ j for j = 1, ..., B. The
nonparametric bootstrap can be useful even if heteroscedasticity or overdis-
persion is present, if the cases are an iid sample from some population, a very
strong assumption.
P
where the Bjn follow a multinomial distribution and Bjn /B → ρjn as B →
∗
∞. Denote T1j , ..., TB∗ jn,j as the jth bootstrap component of the bootstrap
∗
sample with sample mean T j and sample covariance matrix S ∗T ,j . Then
480 10 1D Regression Models Such as GLMs
B Bjn
∗ 1 X ∗ X Bjn 1 X ∗ X ∗
T = Ti = Tij = ρ̂jn T j .
B B Bjn
i=1 j i=1 j
Similarly, we can define the jth component of the iid sample T1 , ..., TB to
have sample mean T j and sample covariance matrix S T ,j .
Suppose the jth component of an iid sample T1 , ..., TB and the jth compo-
nent of the bootstrap sample T1∗ , ..., TB∗ have the same variability asymptot-
ically. Since E(Tjn ) ≈ θ, each component of the iid sample is approximately
∗
centered at θ. The bootstrap components are centered at E(Tjn ), and often
∗
E(Tjn ) = Tjn . Geometrically, separating the component clouds so that they
are no longer centered at one value makes the overall data cloud larger. Thus
the variability of Tn∗ is larger than that of Tn for a mixture distribution,
asymptotically. Hence the prediction region applied to the bootstrap sample
is slightly larger than the prediction region applied to the iid sample, asymp-
2 2
totically (we want n ≥ 20p). Hence cutoff D̂1,1−δ = D(U B)
gives coverage
close to or higher than the nominal coverage for confidence regions (4.32)
and (4.34), using the geometric argument. The deviation Ti∗ − Tn tends to
∗
be larger in magnitude than the deviation and Ti∗ − T . Hence the cutoff
2 2 2
D̂2,1−δ = D(U B ,T )
tends to be larger than D(U B)
, and region (4.33) tends to
have higher coverage than region (4.34) for a mixture distribution.
The full model should be checked with the response plot before do-
ing variable selection inference. Assume p is fixed and n ≥ 20p. Assume
P (S ⊆ Imin ) → 1 as n → ∞, and that S ⊆ Ij . For multiple linear re-
gression with the residual bootstrap that uses residuals from the full OLS
model, Chapter 4 showed that the components of the iid sample and boot-
strap sample have the same variability asymptotically. The components of the
iid sample are centered at Aβ while the components of the bootstrap sample
are centered at Aβ̂Ij ,0 . Now consider regression models with Y x|xT β.
√ D
Assume nA(β̂ Ij ,0 − β) → Naj (0, Σ j ) where Σ j = AV j,0 AT . For the non-
√ ∗ D
parametric bootstrap, assume n(Aβ̂Ij ,0 − Aβ̂ Ij ,0 ) → Naj (0, Σ j ). Then the
components of the iid sample and bootstrap sample have the same variability
asymptotically. The components of iid sample are centered at Aβ while the
components of the bootstrap sample are centered at Aβ̂Ij ,0 . For the nonpara-
√ D
metric bootstrap, the above results tend to hold if n(β̂ − β) → Np (0, V )
√ ∗ D
and if n(β̂ − β̂) → Np (0, V ). Assumptions for the nonparametric boot-
strap tend to be rather strong: often one assumption is that the n cases
(Yi , xTi )T are iid from some population. See Shao and Tu (1995, pp. 335-349)
for the nonparametric bootstrap for GLMs, nonlinear regression, and Cox’s
proportional hazards regression. Also see Burr (1994), Efron and Tibshirani
(1993), Freedman (1981), and Tibshirani (1997).
For the parametric bootstrap, Section 10.9.1 showed that under regular-
∗
ity conditions, Cov(β̂I )− Cov(β̂I ) → 0 as n, B → ∞ if S ⊆ I. Hence
10.9 Inference After Variable Selection for GLMs 481
∗
Cov(Tjn ) − Cov(Tjn ) → 0 as n, B → ∞ if S ⊆ I. Here Tn = Aβ̂ Imin ,0 ,
∗ ∗
Tjn = Aβ̂ Ij ,0 , Tn∗ = Aβ̂ Imin ,0 , and Tjn
∗
= Aβ̂Ij ,0 . Then E(Tjn ) ≈ Aβ = θ
∗ ∗
while the E(Tjn ) are more variable than the E(Tjn ) with E(Tjn ) ≈ Aβ̂(Ij , 0),
roughly, where β̂(Ij , 0) is formed from β̂(Ij ) by adding zeros corresponding
to variables not in Ij . Hence the jth component of an iid sample T1 , ..., TB
and the jth component of the bootstrap sample T1∗ , ..., TB∗ have the same
variability asymptotically.
In simulations for n ≥ 20p for H0 : Aβ S = θ 0 , the coverage tended to
get close to 1 − δ for B ≥ max(200, 50p) so that S ∗T is a good estimator of
Cov(T ∗ ). In the simulations where S is not the full model, inference with
backward elimination with Imin using AIC was often more precise than in-
ference with the full model if n ≥ 20p and B ≥ 50p. It is possible that S ∗T is
singular if a column of the bootstrap sample is equal to 0. If the regression
model has a q × 1 vector of parameters γ, we may need to replace p by p + q.
Undercoverage can occur if bootstrap sample data cloud is less variable
than the iid data cloud, e.g., if (n − p)/n is not close to one. Coverage can be
higher than the nominal coverage for two reasons: i) the bootstrap data cloud
is more variable than the iid data cloud of T1 , ..., TB, and ii) zero padding.
To see the effect of zero padding, consider H0 : Aβ = β O = 0 where
βO = (βi1 , ...., βig )T and O ⊆ E in (4.1) so that H0 is true. Suppose a
nominal 95% confidence region is used and UB is the 96th percentile. Hence
the confidence region (4.32) or (4.33) covers at least 96% of the bootstrap
∗ ∗ ∗
sample. If β̂ O,j = 0 for more than 4% of the β̂ O,1 , ..., β̂O,B , then 0 is in the
confidence region and the bootstrap test fails to reject H0 . If this occurs for
each run in the simulation, then the observed coverage will be 100%.
∗
Now suppose β̂ O,j = 0 for j = 1, ..., B. Then S ∗T is singular, but the
singleton set {0} is the large sample 100(1 − δ)% confidence region (4.32),
(4.33), or (4.34) for β O and δ ∈ (0, 1), and the pvalue for H0 : βO = 0 is
∗
one. (This result holds since {0} contains 100% of the β̂ O,j in the bootstrap
sample.) For large sample theory tests, the pvalue estimates the population
pvalue. Let I denote the other predictors in the model so β = (β TI , βTO )T . For
the Imin model from variable selection, there may be strong evidence that xO
is not needed in the model given xI is in the model if the “100%” confidence
region is {0}, n ≥ 20p, and B ≥ 50p. (Since the pvalue is one, this technique
may be useful for data snooping: applying MLE theory to submodel I may
have negligible selection bias.)
Remark 10.3. As in Chapter 4, another way to look at the bootstrap con-
fidence region for variable selection estimators is to consider the estimator
T2,n that chooses Ij with probability equal to the observed bootstrap propor-
tion ρ̂jn . The bootstrap sample T1∗ , ..., TB∗ tends to be slightly more variable
than an iid sample T2,1 , ..., T2,B, and the geometric argument suggests that
the large sample coverage of the nominal 100(1 − δ)% confidence region will
be at least as large as the nominal coverage 100(1 − δ)%.
482 10 1D Regression Models Such as GLMs
Pelawa Watagoda and Olive (2019a) have an example and simulations for
multiple linear regression using the residual bootstrap. See Chapter 4. We
will use Poisson and binomial regression.
Example 10.19. Lindenmayer et al. (1991) and Cook and Weisberg (1999,
p. 533) give a data set with 151 cases where Y is the number of possum
species found in a tract of land in Australia. The predictors are acacia=basal
area of acacia + 1, bark=bark index, habitat=habitat score, shrubs=number
of shrubs + 1, stags= number of hollow trees + 1, stumps=indicator for
presence of stumps, and a constant. Inference for the full Poisson regression
model is shown along with the shorth(c) nominal 95% confidence intervals for
βi computed using the parametric bootstrap with B = 1000. As expected, the
bootstrap intervals are close to the large sample GLM confidence intervals
≈ β̂i ± 2SE(β̂i ).
The minimum AIC model from backward elimination used a constant,
bark, habitat, and stags. The shorth(c) nominal 95% confidence intervals for
βi using the parametric bootstrap are shown. Note that most of the confidence
intervals contain 0 when closed intervals are used instead of open intervals.
The Poisson regression output is also shown, but should only be used for
inference if the model was selected before looking at the data.
large sample full model inference
Est. SE z Pr(>|z|) 95% shorth CI
int -1.0428 0.2480 -4.205 0.0000 [-1.562,-0.538]
acacia 0.0166 0.0103 1.612 0.1070 [-0.004, 0.035]
bark 0.0361 0.0140 2.579 0.0099 [ 0.007, 0.065]
habitat 0.0762 0.0375 2.032 0.0422 [-0.003, 0.144]
shrubs 0.0145 0.0205 0.707 0.4798 [-0.028, 0.056]
stags 0.0325 0.0103 3.161 0.0016 [ 0.013, 0.054]
stumps -0.3907 0.2866 -1.364 0.1727 [-1.010, 0.171]
output and shorth intervals for the min AIC submodel
Est. SE z Pr(>|z|) 95% shorth CI
int -0.8994 0.2135 -4.212 0.0000 [-1.438,-0.428]
acacia 0 [ 0.000, 0.037]
bark 0.0336 0.0121 2.773 0.0056 [ 0.000, 0.060]
habitat 0.1069 0.0297 3.603 0.0003 [ 0.000, 0.156]
shrubs 0 [ 0.000, 0.060]
stags 0.0302 0.0094 3.210 0.0013 [ 0.000, 0.054]
stumps 0 [-0.970, 0.000]
We tested H0 : β2 = β5 = β7 = 0 with the Imin model selected by
backward elimination. (Of course this test would be easy to do with the
full model using GLM theory.) Then H0 : Aβ = (β2 , β5 , β7 )T = 0. Using
the prediction region method with the full model had [0, D(UB ) ] = [0, 2.836]
10.9 Inference After Variable Selection for GLMs 483
q
with D0 = 2.135. Note that χ23,0.95 = 2.795. So fail to reject H0 . Using
the prediction region method with the Imin backward elimination model had
[0, D(UB ) ] = [0, 2.804] while D0 = 1.269. So fail to reject H0 . The ratio of
the volumes of the bootstrap confidence regions for this test was 0.322. (Use
(3.35) with S ∗T and D from backward elimination for the numerator, and
from the full model for the denominator.) Hence the backward elimination
bootstrap test was more precise than the full model bootstrap test.
Example 10.20. For binary logistic regression, the MLE tends to converge
if max(|xTi β̂|) ≤ 7 and if the Y values of 0 and 1 are not nearly perfectly
classified by the rule Ŷ = 1 if xTi β̂ > 0.5 and Ŷ = 0, otherwise. If there
is perfect classification, the MLE does not exist. Let ρ̂(x) = P̂ (Y = 1|x)
under the binary logistic regression. If |xTi β̂|) ≥ 10, some of the ρ̂(xi ) tend
to be estimated to be exactly equal to 0 or 1, which causes problems for
the MLE. The Flury and Riedwyl (1988, pp. 5-6) banknote data consists of
100 counterfeit and 100 genuine Swiss banknote. The response variable is
an indicator for whether the banknote is counterfeit. The six predictors are
measurements on the banknote: bottom, diagonal, left, length, right, and top.
When the logistic regression model is fit with these predictors and a constant,
there is almost perfect classification and backward elimination had problems.
We deleted diagonal, which is likely an important predictor, so backward
elimination would run. For this full model, classification is very good, but
the xTi β̂ run from −20 to 20. In a plot of xTi β̂ versus Y on the vertical axis
(not shown), the logistic regression mean function is tracked closely by the
lowess scatterplot smoother. The full model and backward elimination output
is below. Inference using the logistic regression normal approximation appears
to greatly underestimate the variability of β̂ compared to the parametric full
model bootstrap variability. We tested H0 : β2 = β3 = β4 = 0 with the Imin
model selected by backward elimination. Using the prediction region method
with the full model had [0, D(UB ) ] = [0, 1.763] with D0 = 0.2046. Note that
q
χ23,0.95 = 2.795. So fail to reject H0 . Using the prediction region method
with the Imin backward elimination model had [0, D(UB ) ] = [0, 1.511] while
D0 = 0.2297. So fail to reject H0 . The ratio of the volumes of the bootstrap
confidence regions for this test was 16.2747. Hence the full model bootstrap
inference was much more precise. Backward elimination produced many zeros,
but also produced many estimates that were very large in magnitude.
large sample full model inference
Est. SE z Pr(>|z|) 95% shorth CI
int -475.581 404.913 -1.175 0.240 [-83274.99,1939.72]
length 0.375 1.418 0.265 0.791 [ -98.902,137.589]
left -1.531 4.080 -0.375 0.708 [ -364.814,611.688]
right 3.628 3.285 1.104 0.270 [ -261.034,465.675]
bottom 5.239 1.872 2.798 0.005 [ 3.159,567.427]
top 6.996 2.181 3.207 0.001 [ 4.137,666.010]
484 10 1D Regression Models Such as GLMs
and 0.96 would suggest coverage is close to the nominal value. The parametric
bootstrap was used with AIC.
In the tables, there are two rows for each model giving the observed confi-
dence interval coverages and average lengths of the confidence intervals. The
term “reg” is for the full model regression, and the term “vs” is for backward
elimination. The last six columns give results for the tests. The terms pr,
hyb, and br are for the prediction region method (4.32), hybrid region (4.34),
and Bickel and Ren region (4.33). The 0 indicates the test was H0 : β E = 0,
while the 1 indicates that the test was H0 : βS = (β1 , 1..., 1)T . The length
and coverage = P(fail to reject H0 ) for the interval [0, D(UB ) ] or [0, D(UB ,T ) ]
where D(UB ) or D(UB ,T ) is the cutoff for the confidence region. The cutoff
q
will often be near χ2g,0.95 if the statistic T is asymptotically normal. Note
q
that χ22,0.95 = 2.448 is close to 2.45 for the full model regression bootstrap
tests for β S if k = 1.
Volume ratios of the three confidence regions can be compared using (4.35),
but there is not enough information in the tables to compare the volume of
the confidence region for the full model regression versus that for the variable
selection regression since the two methods have different determinants |S ∗T |.
The inference for backward elimination was often as precise or more precise
than the inference for the full model. The coverages tended to be near 0.95
for the parametric bootstrap on the full model. Variable selection coverage
tended to be near 0.95 unless the β̂i could equal 0. An exception was binary
logistic regression with m = 1 where variable selection and the full model
often had higher coverage than the nominal 0.95 for the hypothesis tests,
especially for n = 25p. Compare Tables 10.2 and 10.3. For binary regression,
the bootstrap confidence regions using smaller a and larger n resulted in
coverages closer to 0.95 for the full model, and convergence problems caused
the programs to fail for a > 4. The Bickel and Ren (4.33) average cutoffs
were at least as high as those of the hybrid region (4.34).
If βi was a component of βE , then the backward elimination confidence
intervals had higher coverage but were shorter than those of the full model
due to zero padding. The zeros in β̂E tend to result in higher than nominal
coverage for the variable selection estimator, but can greatly decrease the
volume of the confidence region compared to that of the full model.
For the simulated data, when ψ = 0, the asymptotic covariance matrix
I −1 (β) is diagonal. Hence β̂ S has the same multivariate normal limiting
distribution for Imin and the full model by Remark 4.4. For Tables 10.2-
10.5, β S = (β1 , β2 )T , and βp−1 and βp are components of β E . For Table
10.6, β S = (β1 , ..., β9)T . Hence β1 , β2 , and βp−1 are components of β S , while
βE = β10 . For the n in the tables and ψ = 0, the coverages and “lengths”
did tend to be close for the βi that are components of β S , and for pr1, hyb1,
and br1.
486 10 1D Regression Models Such as GLMs
We use two prediction intervals from Olive et al. (2019). The first predic-
tion interval for Yf applies the shorth prediction interval of Section 4.3 to
the parametric bootstrap sample Y1∗ , ..., YB∗ where the Yi∗ are iid from the
distribution D(ĥ(xf ), γ̂). If the regression method produces a consistent es-
timator (ĥ(x), γ̂) of (h(x), γ), then this new prediction interval is a large
sample 100(1 − δ)% PI that is a consistent estimator of the shortest popula-
tion interval [L, U ] that contains at least 1 − δ of the mass as B, n → ∞. The
new large sample 100(1 − δ)% PI using Y1∗ , ..., YB∗ uses the shorth(c) PI with
p
c = min(B, dB[1 − δ + 1.12 δ/B ] e). (10.13)
488 10 1D Regression Models Such as GLMs
Olive (2007, 2018) and Pelawa Watagoda and Olive (2019b) used similar
correction factors since the maximum simulated undercoverage was about
0.05 when n = 20d. If a q × 1 vector of parameters γ is also estimated, we
may need to replace d by dq = d + q.
If β̂ I is a×1, form the p×1 vector β̂ I,0 from β̂ I by adding 0s corresponding
to the omitted variables. For example, if p = 4 and β̂ Imin = (β̂1 , β̂3 )T is the
estimator that minimized the variable selection criterion, then β̂ Imin ,0 =
(β̂1 , 0, β̂3 , 0)T .
Hong et al. (2018) explain why classical PIs after AIC variable selection
may not work. Fix p and let Imin correspond to the predictors used after
variable selection, including AIC, BIC, and relaxed lasso. Suppose P (S ⊆
10.10 Prediction Intervals 489
where V j,0 adds columns and rows of zeros corresponding to the xi not
√
in Ij . Then β̂ Imin ,0 is a n consistent estimator of β under model (4.1)
if the variable selection criterion is used with forward selection, backward
elimination, or all subsets. Hence (10.13) and (10.14) are large sample PIs.
√
Rathnayake and Olive (2019) gave the limiting distribution of n(β̂ Imin ,0 −
β), generalizing the Pelawa Watagoda and Olive (2019a) result for multiple
linear regression. Regularity conditions for (10.13) and (10.14) to be large
sample PIs when p > n are much stronger.
Prediction intervals (10.13) and (10.14) often have higher than the nominal
coverage if n is large and Yf can only take on a few values. Consider binary
regression where Yf ∈ {0, 1} and the PIs (10.13) and (10.14) are [0,1] with
100% coverage, [0,0], or [1,1]. If [0,0] or [1,1] is the PI, coverage tends to be
higher than nominal coverage unless P (Yf = 1|xf ) is near δ or 1 − δ, e.g., if
P (Yf = 1|xf ) = 0.01, then [0,0] has coverage near 99% even if 1 − δ < 0.99.
Example 10.21. For the Ceriodaphnia data of Example 10.4, Figure 10.17
shows the response plot of ESP versus Y for this data. In this plot, the lowess
curve is represented as a jagged curve to distinguish it from the estimated
Poisson regression mean function (the exponential curve). The horizontal line
corresponds to the sample mean Y . The circles correspond to the Yi and the
×’s to the PIs (10.13) with d = p = 3. The n large sample 95% PIs contained
97% of the Yi . There was no evidence of overdispersion: see Example 10.4.
There were 5 replications for each of the 14 strain–species combinations,
which helps show the bootstrap PI variability when B = 1000. This example
illustrates a useful goodness of fit diagnostic: if the model D is a useful
approximation for the data and n is large enough, we expect the coverage on
the training data to be close to or higher than the nominal coverage 1 − δ.
For example, there may be undercoverage if a Poisson regression model is
used when a negative binomial regression model is needed.
Example 10.22. For the banknote data of Example 10.20, after variable
selection, we decided to use a constant, right, and bottom as predictors. The
response plot for this submodel is shown in the left plot of Figure 10.18 with
Z = Zi = Yi /mi = Yi and the large sample 95% PIs for Zi = Yi . The circles
correspond to the Yi and the ×’s to the PIs (10.13) with d = 3, and 199 of the
200 PIs contain Yi . The PI [0,0] that did not contain Yi corresponds to the
490 10 1D Regression Models Such as GLMs
Response Plot
100
80
60
Y
40
20
0
ESP
circle in the upper left corner. The PIs were [0,0], [0,1], or [1,1] since the data
is binary. The mean function is the smooth curve and the step function gives
the sample proportion of ones in the interval. The step function approximates
the smooth curve closely, hence the binary logistic regression model seems
reasonable. The right plot of Figure 10.18 shows the GAM using right and
bottom with d = 3. The coverage was 100% and the GAM had many [1,1]
intervals.
Example 10.23. For the species data of Examples 10.18, we used a con-
stant and log(endem), log(area), log(distance), and log(areanear). The re-
sponse plot looks good, but the OD plot (not shown) suggests overdispersion.
When the response plot for the Poisson regression model was made, the n
large sample 95% PIs (10.13) contained 89.7% of the Yi .
1.0
1.0
0.8
0.8
0.6
0.6
Z
Z
0.4
0.4
0.2
0.2
0.0
0.0
−5 0 5 −5 0 5 10 15
ESP ESP
√
simulation used B = 1000; p = 4, 50, n, or 2n; ψ = 0, 1/ p, or 0.9; and
k = 1, 19, or p − 1. The simulated data sets are rather small since the R
estimators are rather slow. For binomial and Poisson regression, we only
computed the GAM for p = 4 with SP = AP = α + S2 (x2 ) + S2 (x3 ) + S4 (x4 )
and d = p = 4. We only computed the full model GLM if n ≥ 5p. Lasso and
relaxed lasso were computed for all cases. The regression model was computed
from the training data, and a prediction interval was made for the test case
Yf given xf . The “length” and “coverage” were the average length and the
proportion of the 5000 prediction intervals that contained Yf . Two rows per
table were used to display these quantities.
Tables 10.7 to 10.9 show some simulation results for Poisson regression.
Lasso minimized 10-fold cross validation and relaxed lasso was applied to the
selected lasso model. The full GLM, full GAM and backward elimination (BE
in the tables) used PI (10.13) while lasso, relaxed lasso (RL in the tables),
and forward selection using the Olive and Hawkins (2005) method (OHFS
in the tables) used PI (10.14). For n ≥ 10p, coverages tended to be near
or higher than the nominal value of 0.95, except for lasso and the Olive and
Hawkins (2005) method in Tables 10.8 and 10.9. In Table 10.7, coverages were
high because the Poisson counts were small and the Poisson distribution is
discrete. In Table 10.8, the Poisson counts were not small, so the discreteness
of the distribution did not affect the coverage much. For Table 10.9, p = 50,
and PI (10.13) has slight undercoverage for the full GLM since n = 10p. Table
10.9 helps illustrate the importance of the correction factor: PI (10.14) would
492 10 1D Regression Models Such as GLMs
Table 10.7 Simulated Large Sample 95% PI Coverages and Lengths for Poisson
Regression, p = 4, β1 = 1 = a
have higher coverage and longer average length. Lasso was good at choosing
subsets that contain S since relaxed lasso had good coverage. The Olive and
Hawkins (2005) method is partly graphical, and graphs were not used in the
simulation.
Tables 10.10 and 10.11 are for binomial regression where only PI (10.13)
was used. For large n, coverage is likely to be higher than the nominal if the
binomial probability of success can get close to 0 or 1. For binomial regression,
neither lasso nor the Olive and Hawkins (2005) method had undercoverage
in any of the simulations with n ≥ 10p.
For n ≤ p, good performance needed stronger regularity conditions, and
Table 10.12 shows some results with n = 100 and p = 200. For k = 1,
relaxed lasso performed well as did lasso except in the second to last column
of Table 10.12. With k = 19 and ψ = 0, there was undercoverage since
n < 10(k + 1). For the dense models with k = 199 and ψ = 0, there was often
severe undercoverage, lasso sometimes picked 100 predictors including the
constant, and then relaxed lasso caused the program to fail with 5000 runs.
Coverage was usually good for ψ > 0 except for the second to last column
and sometimes the last column of Table 10.12. With ψ = 0.9, each predictor
was highly correlated with the one dominant principal component.
10.10 Prediction Intervals 493
Table 10.8 Simulated Large Sample 95% PI Coverages and Lengths for Poisson
Regression, p = 4, β1 = 5, a = 2
Table 10.9 Simulated Large Sample 95% PI Coverages and Lengths for Poisson
Regression, p = 50, β1 = 5, a = 2
Table 10.10 Simulated Large Sample 95% PI Coverages and Lengths for Binomial
Regression, p = 4, m = 40
Table 10.11 Simulated Large Sample 95% PI Coverages and Lengths for Binomial
Regression, p = 50, m = 7
Table 10.12 Simulated Large Sample 95% PI Coverages and Lengths, n = 100,
p = 200
BR m=7 BR m=40 PR,a=1 β1 = 1 PR,a=2 β1 = 5
ψ,k lasso RL lasso RL lasso RL lasso RL
0 cov 0.9912 0.9654 0.9836 0.9602 0.9816 0.9612 0.7620 0.9662
1 len 4.2774 3.8356 11.3482 11.001 7.8350 7.5660 93.7318 91.4898
0.07 cov 0.9904 0.9698 0.9796 0.9644 0.9790 0.9696 0.7652 0.9706
1 len 4.2570 3.9256 11.4018 11.1318 7.8488 7.6680 92.0774 89.7966
0.9 cov 0.9844 0.9832 0.9820 0.9820 0.9880 0.9858 0.7850 0.9628
1 len 3.8242 3.7844 10.9600 10.8716 7.6380 7.5954 98.2158 95.9954
0 cov 0.9146 0.8216 0.8532 0.7874 0.8678 0.8038 0.1610 0.6754
19 len 4.7868 3.8632 12.0152 11.3966 7.8126 7.5188 88.0896 90.6916
0.07 cov 0.9814 0.9568 0.9424 0.9208 0.9620 0.9444 0.3790 0.5832
19 len 4.1992 3.8266 11.3818 11.0382 7.9010 7.7828 92.3918 92.1424
0.9 cov 0.9858 0.9840 0.9812 0.9802 0.9838 0.9848 0.7884 0.9594
19 len 3.8156 3.7810 10.9194 10.8166 7.6900 7.6454 97.744 95.2898
0.07 cov 0.9820 0.9640 0.9604 0.9390 0.9720 0.9548 0.3076 0.4394
199 len 4.1260 3.7730 11.2488 10.9248 8.0784 7.9956 90.4494 88.0354
0.9 cov 0.9886 0.9870 0.9822 0.9804 0.9834 0.9814 0.7888 0.9586
199 len 3.8558 3.8172 10.9714 10.8778 7.6728 7.6602 97.0954 94.7604
Y = g(α + uT η, e) (10.16)
where g is a bivariate (inverse link) function and e is a zero mean error that
is independent of x. The constant term α may be absorbed by g if desired.
An important special case is the response transformation model where
t(Y ) = xT β + e.
α̂ + uT η̂ versus Y or xT β̂ versus Y
Remark 10.5. For OLS, call the plot of xT β̂ versus Y the OLS view.
The fact that the OLS view is frequently a useful response plot was perhaps
first noted by Brillinger (1977, 1983) and called the 1D Estimation Result by
Cook and Weisberg (1999, p. 432).
Olive (2002, 2004b, 2008: ch.12) showed that the trimmed views esti-
mator of Chapter 7 also gives useful response plots for 1D regression. If
Y = m(xT β) + e = m(α + uT η) + e, look for a plot with a smooth mean
function and the smallest variance function. The trimmed view with 0% trim-
ming is the OLS view.
Recall from Definition 2.17 and Theorem 2.20 that if x = (1, uT )T and
β = (α, ηT )T , then ηOLS = Σ −1u Σ u,Y . Let q = p−1. The following notation
will be useful for studying the OLS estimator. Let the sufficient predictor z =
uT η = ηT u and let w = u−E(u). Let r = w−(Σ u η)ηT w. The proof of the
next result is outlined in Problem 10.1 using an argument due to Aldrin, et al.
(1993). If the 1D regression model is appropriate, then typically Cov(u, Y ) 6=
0 unless uT β follows a symmetric distribution and m is symmetric about the
median of uT η.
Theorem 10.1. Suppose that (Yi , uTi )T are iid observations and that the
positive definite q × q matrix Cov(u) = Σ u and the q × 1 vector Cov(u, Y ) =
Σ u,Y . Assume that Yi = m(uTi η)+ei where the zero mean constant variance
iid errors ei are independent of the predictors ui . Then
ηOLS = Σ −1
u Σ u,Y = cm,u η + bm,u (10.18)
bm,u = Σ −1 T
u E[m(u η)r]. (10.20)
Moreover, bm,u = 0 if u is from an elliptically contoured distribution with
nonsingular Σ u , and cm,u 6= 0 unless Cov(u, Y ) = 0. If the multiple linear
regression model holds, then cm,u = 1, and bm,u = 0.
Olive and Hawkins (2005) and Olive (2008, ch. 12) suggested using variable
selection methods with Cp , originally meant for multiple linear regression,
for 1D regression models with SP = xT β. In particular, Theorem 4.2 is still
useful.
This section follows Chang and Olive (2010) closely. Theorem 2.20 is useful.
Some notation is needed for the following results. Many 1D regression models
have an error e with
σ 2 = Var(e) = E(e2 ). (10.21)
Let ê be the error residual for e. Let the population OLS residual
with
τ 2 = E[(Y − αOLS − uT ηOLS )2 ] = E(v2 ), (10.23)
and let the OLS residual be
Typically the OLS residual r is not estimating the error e and τ 2 6= σ 2 , but
the following results show that the OLS residual is of great interest for 1D
regression models.
Assume that a 1D model holds, Y u|(α + uT η), which is equivalent to
Y u|uT η. Then under regularity conditions, results i) – iii) below hold.
where
−1
C OLS = Σ u E[(Y −αOLS −uT βTOLS )2 (u−E(u))(u−E(u))T ]Σ −1
u . (10.26)
498 10 1D Regression Models Such as GLMs
and
AC OLS AT = τ 2 AΣ −1 T
u A . (10.27)
Notice that C OLS = τ 2 Σ −1 T
u if v = Y −αOLS −u ηOLS u or if the MLR
2 2
model holds. If the MLR model holds, τ = σ .
To create test statistics, the estimator
n n
1 X 2 1 X
τ̂ 2 = MSE = ri = (Yi − α̂OLS − uT
i β̂ OLS )
2
n−p n−p
i=1 i=1
can also be useful. Notice that for general 1D regression models, the OLS
MSE estimates τ 2 rather than the error variance σ 2 .
iv) Result iii) suggests that a test statistic for H0 : Aη = 0 is
−1 D
WOLS = nη̂TOLS AT [AΣ̂ u AT ]−1 Aη̂OLS /τ̂ 2 → χ2k , (10.29)
Before presenting the main theoretical result, some results from OLS MLR
theory are needed. Let the p×1 vector β = (α, ηT )T , the known k×p constant
matrix à = [a A] where a is a k×1 vector, and let c be a known k×1 constant
vector. Using Equation (2.6), the usual F statistic for testing H0 : Ãβ = c is
T
(Ãη̂ − c)T [Ã(X T X)−1 Ã ]−1 (Ãη̂ − c)/(k τ̂ 2 ) (10.30)
where M SE = τ̂ 2 . Recall that if H0 is true, the MLR model holds and the
errors ei are iid N (0, σ 2 ), then Fo ∼ Fk,n−p, the F distribution with k and
n − p degrees of freedom. By Theorem 2.25, if Zn ∼ Fk,n−p, then
D
Zn → χ2k /k (10.31)
as n → ∞.
The main theoretical result of this section is Theorem 10.2 below. This
theorem and (10.31) suggest that OLS output, originally meant for testing
with the MLR model, can also be used for testing with many 1D regression
data sets. Without loss of generality, let the 1D model Y x|(α + uT η) be
10.11 OLS and 1D Regression 499
written as
Y x|(α + uTR β R + uTO β O )
where the reduced model is Y x|(α + uTR ηR ) and uO denotes the terms
outside of the reduced model. Notice that OLS ANOVA F test corresponds
to Ho: η = 0 and uses A = I p−1 . The tests for H0 : βi = 0 use A =
(0, ..., 0, 1, 0, ..., 0) where the 1 is in the (i − 1)th position for i = 2, ..., p and
are equivalent to the OLS t tests. The test H0 : ηO = 0 uses A = [0 I j ] if
ηO is a j × 1 vector, and the test statistic (10.30) can be computed with the
OLS partial F test: run OLS on the full model to obtain SSE and on the
reduced model to obtain SSE(R).
In the theorem below, it is crucial that H0 : Aη = 0. Tests for H0 : Aη =
1, say, may not be valid even if the sample size n is large. Also, confidence
intervals corresponding to the t tests are for cβi , and are usually not very
useful when c is unknown.
See Chang and Olive (2010) and Olive (2008: ch. 12, 2010: ch. 15) for
simulations and more information.
500 10 1D Regression Models Such as GLMs
Data splitting is used for inference after model selection. Use a training set
to select a full model, and a validation set for inference with the selected full
model. Here p >> n is possible. See Hurvich and Tsai (1990, p. 216) and
Rinaldo et al. (2019). Typically when training and validation sets are used,
the training set is bigger than the validation set or half sets are used, often
causing large efficiency loss.
Let J be a positive integer and let bxc be the integer part of x, e.g.,
b7.7c = 7. Initially divide the data into two sets H1 with n1 = bn/(2J)c
cases and V1 with n − n1 cases. If the fitted model from H1 is not good
enough, randomly select n1 cases from V1 to add to H1 to form H2 . Let V2
have the remaining cases from V1 . Continue in this manner, possibly forming
sets (H1 , V1 ), (H2 , V2 ), ..., (HJ , VJ ) where Hi has ni = in1 cases. Stop when
Hd gives a reasonable model Id with ad predictors if d < J. Use d = J,
otherwise. Use the model Id as the full model for inference with the data in
Vd .
This procedure is simple for a fixed data set, but it would be good to
automate the procedure. For example, if n = 500000 and p = 90, using
n1 = 900 would result in a much smaller loss of efficiency than n1 = 250000.
10.13 Complements
This chapter used material from Chang and Olive (2010), Olive (2013b,
2017a: ch. 13), Olive et al. (2019), and Rathnayake and Olive (2019). GLMs
were introduced by Nelder and Wedderburn (1972). Useful references for
generalized additive models include Hastie and Tibshirani (1986, 1990), and
Wood (2017). Zhou (2001) is useful for simulating the Weibull regression
model. Also see McCullagh and Nelder (1989), Agresti (2013, 2015), and Cook
and Weisberg (1999, ch. 21-23). Collett (2003) and Hosmer and Lemeshow
(2000) are excellent texts on logistic regression while Cameron and Trivedi
(2013) and Winkelmann (2008) cover Poisson regression. Alternatives to Pois-
son regression mentioned in Section 10.7 are covered by Zuur et al. (2009),
Simonoff (2003), and Hilbe (2011). Cook and Zhang (2015) show that enve-
lope methods have the potential to significantly improve GLMs. Some GLM
large sample theory is given by Claeskens and Hjort (2008, p. 27), Cook and
Zhang (2015), and Sen and Singer (1993, p. 309).
An introduction to 1D regression and regression graphics is Cook and
Weisberg (1999a, ch. 18, 19, and 20), while Olive (2010) considers 1D regres-
sion. A more advanced treatment is Cook (1998). Important papers include
Brillinger (1977, 1983) and Li and Duan (1989). Li (1997) shows that OLS F
tests can be asymptotically valid for model (10.18) if u is multivariate nor-
10.13 Complements 501
mal and Σ −1
u Σ uY 6= 0. The scatterplot smoother lowess is due to Cleveland
(1979, 1981).
Suppose n ≥ 10p. Results from Cameron and Trivedi (1998, p. 89) suggest
that if a Poisson regression model is fit using OLS software for MLR, then
a rough approximation is β̂ P R ≈ β̂ OLS /Y . So a rough approximation is
PR ESP ≈ (OLS ESP)/Y . Results from Haggstrom (1983) suggest that if
a binary regression model is fit using OLS software for MLR, then a rough
approximation is β̂ LR ≈ β̂ OLS /M SE.
Haughton (1988, 1989) showed P (S ⊆ Imin ) → 1 as n → ∞ if BIC is
used. AIC has a smaller penalty than BIC, so often overfits. According to
Claeskens and Hjort (2008, p. xi), inference after variable selection has been
called “the quiet scandal of statistics.”
Plots were made in R and Splus, see R Core Team (2016). The Wood
(2017) library mgcv was used for fitting a GAM, and the Venables and Ripley
(2010) library MASS was used for the negative binomial family. The gam
library is also useful. The Lesnoff and Lancelot (2010) R package aod has
function betabin for beta binomial regression and is also useful for fitting
negative binomial regression. SAS has proc genmod, proc gam, and proc
countreg which are useful for fitting GLMs such as Poisson regression,
GAMs such as the Poisson GAM, and overdispersed count regression models.
In Section 10.9, the functions binregbootsim and pregbootsim are
useful for the full binomial regression and full Poisson regression models. The
functions vsbrbootsim and vsprbootsim were used to bootstrap back-
ward elimination for binomial and Poisson regression. The functions LRboot
and vsLRboot bootstrap the logistic regression full model and backward
elimination. The functions PRboot and vsPRboot bootstrap the Poisson
regression full model and backward elimination.
In Section 10.10, table entries for Poisson regression were made with
prpisim2 while entries for binomial regression were made with brpisim.
The functions prpiplot2 and lrpiplot were used to make Figures 10.17
and 10.18. The function prplot can be used to check the full Poisson regres-
sion model for overdispersion. The function prplot2 can be used to check
other Poisson regression models such as a GAM or lasso.
i) Resistant regression: Suppose the regression model has an m×1 response
vector y, and a p × 1 vector of predictors x. Assume that predictor trans-
formations have been performed to make x, and that w consists of k ≤ p
continuous predictor variables that are linearly related. Find the RMVN set
based on the w to obtain nu cases (y ci , xci ), and then run the regression
method on the cleaned data. Often the theory of the method applies to the
cleaned data set since y was not used to pick the subset of the data. Effi-
ciency can be much lower since nu cases are used where n/2 ≤ nu ≤ n, and
the trimmed cases tend to be the “farthest” from the center of w.
The method will have the most outlier resistance if k = p (or k = p − 1 if
there is a trivial predictor X1 ≡ 1). If m = 1, make the response plot of Yˆc
502 10 1D Regression Models Such as GLMs
versus Yc with the identity line added as a visual aid, and make the residual
plot of Ŷc versus rc = Yc − Ŷc .
In R, assume Y is the vector of response variables, x is the data matrix of
the predictors (often not including the trivial predictor), and w is the data
matrix of the wi . Then the following R commands can be used to get the
cleaned data set. We could use the covmb2 set B instead of the RMVN set
U computed from the w by replacing the command getu(w) by getB(w).
indx <- getu(w)$indx #often w = x
Yc <- Y[indx]
Xc <- x[indx,]
#example
indx <- getu(buxx)$indx
Yc <- buxy[indx]
Xc <- buxx[indx,]
outr <- lsfit(Xc,Yc)
MLRplot(Xc,Yc) #right click Stop twice
a) Resistant additive error regression: An additive error regression model
has the form Y = h(x) + e where there is m = 1 response variable Y , and the
p × 1 vector of predictors x is assumed to be known and independent of the
additive error e. An enormous variety of regression models have this form,
including multiple linear regression, nonlinear regression, nonparametric re-
gression, partial least squares, lasso, ridge regression, etc. Find the RMVN
set (or covmb2 set) based on the w to obtain nU cases (Yci , xci ), and then
run the additive error regression method on the cleaned data.
b) Resistant Additive Error Multivariate Regression
Assume y = g(x) + = E(y|x) + where g : Rp → Rm , y = (Y1 , ..., Ym)T ,
and = (1 , ..., m)T . Many models have this form, including multivariate
linear regression, seemingly unrelated regressions, partial envelopes, partial
least squares, and the models in a) with m = 1 response variable. Clean the
data as in a) but let the cleaned data be stored in (Z c , X c ). Again, the theory
of the method tends to apply to the method applied to the cleaned data since
the response variables were not used to select the cases, but the efficiency is
often much lower. In the R code below, assume the y are stored in z.
indx <- getu(w)$indx #often w = x
Zc <- z[indx]
Xc <- x[indx,]
#example
ht <- buxy
t <- cbind(buxx,ht);
z <- t[,c(2,5)];
x <- t[,c(1,3,4)]
indx <- getu(x)$indx
Zc <- z[indx,]
10.14 Problems 503
Xc <- x[indx,]
mltreg(Xc,Zc) #right click Stop four times
10.14 Problems
Y = m(uT η) + e (10.34)
where m is a possibly unknown function and the zero mean errors e are inde-
pendent of the predictors. Let z = uT η and let w = u − E(u). Let Σ u,Y =
Cov(u, Y ), and let Σ u = Cov(u) = Cov(w). Let r = w − (Σ u η)ηT w.
a) Recall that Cov(u, Y ) = E[(u − E(u))(Y − E(Y ))T ] and show that
Σ u,Y = E(wY ).
c) Using η OLS = Σ −1
u Σ u,Y , show that η OLS = c(u)η + b(u) where the
constant
c(u) = E[ηT (u − E(u))m(uT η)]
and the bias vector b(u) = Σ −1 T
u E[m(u η)r].
d) Show that E(wz) = Σ u η. (Hint: Use E(wz) = E[(u − E(u))uT η] =
E[(u − E(u))(uT − E(uT ) + E(uT ))η].)
11.1 R
can be used to download the R functions and data sets into R. Type ls().
Nearly 10 R functions from linmodpack.txt should appear. In R, enter the
command q(). A window asking “Save workspace image?” will appear. Click
on No to remove the functions from the computer (clicking on Yes saves the
functions in R, but the functions and data are easily obtained with the source
commands).
Citing packages
We will use R packages often in this book. The following R command is
useful for citing the Mevik et al. (2015) pls package.
citation("pls")
505
506 11 Stuff for Students
Other packages cited in this book include MASS and class: both from Ven-
ables and Ripley (2010), glmnet: Friedman et al. (2015), and leaps: Lumley
(2009).
This section gives tips on using R, but is no replacement for books such
as Becker et al. (1988), Crawley (2005, 2013), Fox and Weisberg (2010), or
Venables and Ripley (2010). Also see Mathsoft (1999ab) and use the website
(www.google.com) to search for useful websites. For example enter the search
words R documentation.
To put a graph in Word, hold down the Ctrl and c buttons simulta-
neously. Then select “Paste” from the Word menu, or hit Ctrl and v at the
same time.
To enter data, open a data set in Notepad or Word. You need to know
the number of rows and the number of columns. Assume that each case is
entered in a row. For example, assuming that the file cyp.lsp has been saved
on your flash drive from the webpage for this book, open cyp.lsp in Word. It
has 76 rows and 8 columns. In R , write the following command.
cyp <- matrix(scan(),nrow=76,ncol=8,byrow=T)
Then copy the data lines from Word and paste them in R. If a cursor does
not appear, hit enter. The command dim(cyp) will show if you have entered
the data correctly.
11.1 R 507
To save data or a function in R, when you exit, click on Yes when the
“Save worksheet image?” window appears. When you reenter R, type ls().
This will show you what is saved. You should rarely need to save anything
for this book. To remove unwanted items from the worksheet, e.g. x, type
rm(x),
pairs(x) makes a scatterplot matrix of the columns of x,
hist(y) makes a histogram of y,
boxplot(y) makes a boxplot of y,
stem(y) makes a stem and leaf plot of y,
scan(), source(), and sink() are useful on a Unix workstation.
To type a simple list, use y <− c(1,2,3.5).
The commands mean(y), median(y), var(y) are self explanatory.
The following commands are useful for a scatterplot created by the com-
mand plot(x,y).
lines(x,y), lines(lowess(x,y,f=.2))
508 11 Stuff for Students
identify(x,y)
abline(out$coef ), abline(0,1)
Warning: R is free but not fool proof. If you have an old version of R
and want to download a library, you may need to update your version of
R. The libraries for robust statistics may be useful for outlier detection, but
the methods have not been shown to be consistent or high breakdown. All
software has some bugs. For example, Version 1.1.1 (August 15, 2000) of R
had a random generator for the Poisson distribution that produced variates
with too small of a mean θ for θ ≥ 10. Hence simulated 95% confidence
intervals might contain θ 0% of the time. This bug seems to have been fixed
in Versions 2.4.1 and later. Also, some functions in lregpack may no longer
work in new versions of R.
11.2 Hints for Selected Problems 509
Chapter 1
1.1 a) Sort each column, then find the median of each column. Then
MED(W ) = (1430, 180, 120)T .
b) The sample mean of (X1 , X2 , X3 )T is found by finding the sample mean
of each column. Hence x = (1232.8571, 168.00, 112.00)T .
1.2 a) 7 + βXi
P P
b) β̂ = (Yi − 7)Xi / Xi2
1.3 See Section
P 1.3.5. P 2
1.5 a)
P 2 3 β̂ = X3i (Yi − 10 − 2X2i )/ X3i . The second partial derivative
= 2 X3i > 0.
−1
c) VAR(Y |X) = Σ11 − Σ12 Σ22 Σ21 = 16 − 10(1/25)10 = 16 − 4 = 12.
1.13 The proof is identical to that given in Example 3.2. (In addition, it
is fairly simple to show that M1 = M2 ≡ M . That is, M depends on Σ but
not on c or g.)
1.26 a)
3 31
N2 , .
2 12
b) X2 X4 and X3 X4 .
σ12 1 √
c) √ = √ √ = 1/ 6 = 0.4082.
σ11 σ33 2 3
510 11 Stuff for Students
Model II:
Pn n
xi Yi X xi
β̂1 = Pi=1
n 2
= ki Yi with ki = Pn 2
.
x
j=1 j i=1 j=1 xj
b) Model I:
n
X n
X Pn Xn
(xi − x)2
V (β̂1 ) = ki2 V (Yi ) = σ 2
ki2 2
= σ Pni=1 = σ 2
/ (xi − x)2 .
i=1 i=1
[ j=1 (xj − x)2 ]2 i=1
Model II:
n
X n
X Pn n
X
x2
V (β̂1 ) = ki2 V (Yi ) = σ 2 ki2 = σ 2 Pni=1 2i 2 = σ 2 / x2i .
[ j=1 xj ]
i=1 i=1 i=1
E[(I−P )[Y −E(Y )][Y −E(Y )]T ] = (I−P )Cov(Y ) = (I−P )σ 2 I = σ 2 (I−P ).
c) Cov(r, Ŷ ) = E([r − E(r)][Ŷ − E(Ŷ )]T ) =
Chapter 2
2.1 See the proof of Theorem 2.18.
2.14 For fixed σ > 0, L(β, σ 2 ) is maximized by minimizing Q(β) ≥ 0. So
β̂Q maximizes L(β, σ 2 ) regardless of the value of σ 2 > 0. So β̂ Q is the MLE.
11.2 Hints for Selected Problems 511
d2 log Lp (τ ) n 2Q n 2nτ̂ −n
= 2− = − 3 = 2 <0
dτ 2 2τ 2τ 3 τ̂ 2τ 2 2τ̂ 2τ̂
Xn
dQ(β)
= −2 (Yi − βxi )xi .
dβ i=1
512 11 Stuff for Students
n
X n
X
Pn
b) β̂ = i=1 ki Yi where ki = xi / x2j . Hence E(β̂) = ki E(Yi ) =
j=1 i=1
n
X n
X n
X n
X Xn
ki βxi = β x2i / x2j = β. V (β̂) = ki2 V (Yi ) = σ2 ki2 using
i=1 i=1 j=1 i=1 i=1
Pn Pn
Yi = Yi |xi has V (Yi ) = σ 2 . Note that i=1 ki2 = 1/ i=1 x2i .
c) E(Ŷi ) = βxi = E(Yi ) = E(Yi |xi ), suppressing the conditioning. V (Ŷi ) =
Pn
V (β̂xi ) = x2i V (β̂) = σ 2 x2i / j=1 x2j by b).
d) Under this normal model, the MLE of β is β̂ and the MLE of σ 2 is
n
1X 2 n−p
σ̂ 2 = ri = M SE
n n
i=1
with p = 1.
2.37 a) Use either proof of Theorem 2.5. Normality is not necessary.
b) i)
Source df MS SS F p-value
1 T T M SR
Regression p-1 SSR = Y (P − 11 )Y MSR F0 = M SE for H0 :
n
Source df SS MS E(MS) F
T SSE(R)−SSE
Reduced n − p1 SSE(R) = Y (I − P 1 )Y MSE(R) E(MSE(R)) FR = p2 M SE =
Y T (P − P 1 )Y /p2
Full n−p SSE = Y T (I − P )Y MSE σ2
Y T (I − P )Y /(n − p)
where
1 1
E(M SE(R)) = [σ 2 tr(I−P 1 )+β T X T (I−P 1 )Xβ] = [σ 2 (n−p1 )+β T X T (I−P 1 )Xβ].
n − p1 n − p1
Xn
dQ(β)
= −2 (yi − βxi )xi .
dβ i=1
1 −1
cn
exp[ 2 Q(β)]
σn 2σ
where Q(β) is the least squares criterion. For fixed σ > 0, maiximizing L(β, σ)
is equivalent to minimizing the least squares criterion Q(β). Thus β̂ from a)
is the MLE of β. To find the MLE of σ 2 , use the profile likelihood function
1 −1 1 −1
Lp (σ 2 ) = Lp (τ ) = cn n
exp[ 2 Q] = cn n/2 exp[ Q]
σ 2σ τ 2τ
n Q
log(Lp (τ )) = dn − log(τ ) − ,
2 2τ
d −n Q set
and log(Lp (τ )) = + 2 = 0.
dτ 2τ 2τ
P
Thus nτ = Q or τ̂ = σ̂ 2 = Q/n = i=1 ri2 /n, which is a unique solution.
Now
d2 n 2Q n 2nτ̂ −n
log(Lp (τ )) = − 3 = 2 − 3 = 2 < 0.
dτ 2 2τ 2 2τ τ̂ 2τ̂ 2τ̂ 2τ̂
Thus σ̂ 2 is the MLE of σ 2 .
2.44 Let Y1 and Y2 be iindependent random variables with mean θ and
2θ respectively. Find the least squares estimate of θ and the residual sum of
squares.
Solution:
Y1 1 e1
Y = Xβ + e = = θ+ .
Y2 2 e2
Then
−1
T −1 T 1 Y1 Y1 + 2Y2
θ̂ = (X X) X Y = (1 2) (1 2) = .
2 Y2 5
Y1 +2Y2
1 Y1 + 2Y2 5
Now Ŷ = X θ̂ = = .
2 5 2Y1+4Y2
5
516 11 Stuff for Students
Thus 2 2
Y1 + 2Y2 2Y1 + 4Y2
RSS = Y1 − + Y2 − .
5 5
√ D
2.45 a) nA(β̂ − β) → Nr (0, σ 2 AW AT ).
D
b) A(Z n − µ) → Nr (0, AAT ).
2.46 a)
( n
)
1 X
L(β, σ 2 ) = f(y1 , . . . , yn |β, σ 2 ) = (2π)−n/2 (σ 2 )−n/2 exp − 2 (yi − βxi )2 .
2σ
i=1
which gives the least squares estimator. Taking derivative with respective to
β and setting it equal to 0, the solution is the MLE if the second derivative
is positive. The solution is:
Pn
xi Yi
β̂ = Pi=1n 2
i=1 xi
Pn Pn Pn Pn 2
xi Yi xi E[Yi ] i=1 xi βxi i=1 xi
E(β̂) = E Pi=1
n 2
= i=1
P n 2
= P n 2
= β P n 2
= β.
i=1 xi i=1 xi i=1 xi i=1 xi
Pn X n
i=1 xi Yi 1
V ar(β̂) = V ar Pn 2
= P n V ar( xi Yi ),
i=1 xi ( i=1 x2i )2 i=1
n
X
1 σ2
= Pn 2 2
x2i V ar(Yi ) = Pn 2.
( i=1 xi ) i=1 xi
i=1
2
Therefore, β̂ ∼ N (β, Pnσ x2 ).
i=1 i
d) For the expectation we have:
11.2 Hints for Selected Problems 517
P P P
Yi E[Yi ] βxi
E[U ] = E P = P = P = β,
xi xi xi
X
1 Yi 1 X E[Yi ] 1 X βxi
E[V ] = E = = = β.
n xi n xi n xi
then we have n
n 1X 1
Pn ≤ ,
i=1 x2i n i=1 x2i
therefore V ar(β̂) ≤ P
V ar(V ). Pn
n 2 2 2
Moreover, since i=1 (xi − x̄) ≥ 0 therefore i=1 xi ≥ nx̄ , hence
V ar(β̂) ≤ V ar(U ).
Finally, since f(t) = t12 is convex, then by using Jensen’s inequality we
have
n
1 1X 1
≤ ,
x̄2 n 1 x2i
thus
V ar(β̂) ≤ V ar(U ) ≤ V ar(V ).
2.47 For symmetry the solutions are obvious, since
(1) the transpose of a difference is the difference of the transpose, and
(2) we know H is symmetric, I is symmetric, and since the constant n−1
does not affect the transpose operation, n−1 J T = n−1 is a symmetric matrix.
For idempotent, we need to show that squaring each matrix returns the
original. Recall that H is idempotent, because
Now, we can write (a) (I − n−1 J)2 = I 2 − 2n−1 J + n−2 J 2 . But, since
J = 11T , we have J 2 = (11T )2 = 11T 11T = 1n1T = n11T . Thus,
Setting the derivative equal to 0 and calling the unique solution β̂ gives
Pn set Pn
η i=1 x2i = i=1 xi (Yi − ai ) or
Pn
xi (Yi − ai )
β̂ = i=1 Pn 2 .
i=1 xi
Now
X n
d2 Q(η)
2
=2 x2i > 0.
dη
i=1
Let τ = σ 2 . Then
n 1
log(Lp (σ 2 )) = c − log(σ 2 ) − 2 Q,
2 2σ
and
n 1
log(Lp (τ )) = c − log(τ ) − Q.
2 2τ
Hence
d log(LP (τ )) −n Q set
= + 2 = 0
dτ 2τ 2τ
or −nτ + Q = 0 or nτ = Q or
Q
τ̂ = = σ̂ 2 ,
n
which is a unique solution.
Now
d2 log(LP (τ )) n 2Q n 2nτ̂ −n
= − 3 = − 3 = 2 < 0.
dτ 2 2τ 2 2τ τ=τ̂ 2τ̂ 2 2τ̂ 2τ̂
RSS(η A ) = kZ A − GA ηA k22 = (Z A − GA ηA )T (Z A − GA η A ) =
λ∗
kZ − Gηk22 + λ∗2 kηk22 + p 1 ∗ kηA k1 =
1 + λ2
RSS(η) + λ∗2 kηk22 + λ∗1 kηk1 = Q(η).
1
3.12 a) SSE = Y T (I − P )Y and SSR = Y T (P − 11T )Y = Y T (P −
n
1
P 1 )Y where P 1 = 11T = 1(1T 1)−1 1T is the projection matrix on C(1).
n
b) E(M SE) = σ 2 , so E(SSE) = (n − r)σ 2 . By a) and Theorem 2.5,
If
a11 a12
A=
a21 a22
and d = a11 a22 − a21 a12 6= 0, then
1 a22 −a12
A−1 = .
d −a21 a11
Thus
1 5 −1 210 1 10 4 −2
B= = .
24 −1 5 012 24 −2 4 10
c) Note that bT Y is an unbiased estimator of bT Xβ = aT β with aT =
T
b X. If b = 1, then
36
aT = 1T X = (1 1 1) 2 4 = (6 12).
12
Hence,
yb ∼ Np (Aβ , σ2 P A ).
(b)
e = y−y b=y−
P A y = (I − P A )y. Therefore, we have
(I − P A )y ∼ Np (I − P A )Aβ , (I − P A )σ2 I p (I − P A )> where
(I p − P A )Aβ = Aβ − Aβ = 0, and
(I p − P A )σ2 I p (I p − P A )> = σ2 (I p − P A ). Hence
e ∼ Np 0, σ (I p − P A ) .
2
(c)
Cov(y, e) = Cov y , (I − P A )y = Cov(y )(I p −P A )> = σ2 I p (I p −P A ) = σ2 (I p −P A ) 6= 0.
Cov(b
y , e) = Cov (P A y, (I − P A )y) = P A Cov(y )(I p − P A )>
= P A σ I p (I p − P A ) = σ P A (I p − P A ) = 0
2 2
Z = (z 1 , . . . , zt ) = X(b1 , . . . , bt ) = XB.
(b)
11.2 Hints for Selected Problems 523
(c)
(P X − P Z ) =
2
P 2X − P X P Z − P Z P X + P 2Z
= P X − P Z − (P X P Z ) + P Z = P X − P Z .
>
(d)
SSE
∼ χ2 (df1 , ncp1 = 0), df1 = n − rank(X)
σ2
SSE2 − SSE
∼ χ2 (df1 , ncp2 ), df2 = rank(X) − rank(Z) > 0
σ2
df2 > 0 this is because C(Z) is a proper subset of C(X).
1
ncp2 = (Xβ)> (P X − P Z ) Xβ
2σ 2
1 > > >
= β X P X X − X P Z X β
2σ 2
1 > 1
= 2
β X >X − X >P Z X β = (Xβ)> (I − P Z ) Xβ > 0
2σ 2σ 2
The last inequality follows from the fact that C(Z) is a proper subset of
C(X).
Under the null hypothesis H0 : Xβ = Zγ, we have ncp2 = 0. Therefore,
F > c will be a test for H0 : E(Y ) ∈ C(Z), where
(SSE2 − SSE)/df2
F =
SSE/df1
524 11 Stuff for Students
3.17 (a)
1100 1000
1 1 0 0 0 1 0 0
1 1 0 0 0 0 1 0
1 0 1 0 1 0 0 0
X =
1 0 1 0 0 1 0 0
1 0 1 0 0 1 0 0
1 0 1 0 0 0 1 0
1 0 0 1 0 0 0 1
1001 0001
(b)
P P P
P
i P j k Yijk
Pj Pk Y1jk
Pj Pk Y2jk
X >Y = k Y3jk
P P Y
j
Pi Pk i1k
Pi Pk Yi2k
P i Pk Yi3k
j Y
k 3jk
(c)
First, note that:
P3 Pnij P3
i=1 k=1 µ + αi + βj n.j µ + nij αi + n.j βj
i=1
E(Y .j. ) = =
n.j n.j
P
nij αi
= µ + βj + i
n.j
Then,
1
E(Y .1. ) = µ + β1 + (α1 + α2 )
2
1
E(Y .3. ) = µ + β3 + (α1 + α2 )
2
Hence E(Y .1. ) − E(Y .3. ) = β1 − β3 , and it is a LUE for β1 − β3 . More work
is needed to show Y .1. − Y .3. is an OLS estimator of β1 − β3 .
(d)
11.2 Hints for Selected Problems 525
1
E(Y 1.. ) = µ + α1 + (β1 + β2 + β3 )
3
1
E(Y 3.. ) = µ + α3 + (2β4 )
2
1
⇒ E(Y 1.. − Y 3..) = α1 − α3 + (β1 + β2 + β3 − 3β4 ) 6= α1 − α3
3
Therefore, Y 1.. −Y 3.. is not an unbiased estimator for α1 −α3 , hence it cannot
be the OLS estimator of α1 − α3 .
3.18
112 1
a) X 0 = , so C(X 0 ) = span .
224 2
For b), c), and d), if a is a 2 × 1 constant vector, then a0 β is estimable iff
a ∈ C(X 0 ).
b) Yes, estimable since
5 1
5β1 + 10β2 = (5 10)β, and =5 ∈ C(X 0 ).
10 2
11.3 Tables
Tabled values are F(k,d, 0.95) where P (F < F (k, d, 0.95)) = 0.95.
00 stands for ∞. Entries were produced with the qf(.95,k,d) command
in R. The numerator degrees of freedom are k while the denominator degrees
of freedom are d.
k 1 2 3 4 5 6 7 8 9 00
d
1 161 200 216 225 230 234 237 239 241 254
2 18.5 19.0 19.2 19.3 19.3 19.3 19.4 19.4 19.4 19.5
3 10.1 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.53
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.63
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.37
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 3.67
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.23
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 2.93
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 2.71
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.54
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.41
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.30
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.21
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.13
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.07
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.01
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 1.96
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 1.92
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 1.88
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 1.84
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 1.71
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 1.62
00 3.84 3.00 2.61 2.37 2.21 2.10 2.01 1.94 1.88 1.00
528 11 Stuff for Students
Tabled values are tα,d where P (t < tα,d ) = α where t has a t distribution
with d degrees of freedom. If d > 29 use the N (0, 1) cutoffs d = Z = ∞.
alpha pvalue
d 0.005 0.01 0.0250.05 0.5 0.95 0.975 0.99 0.995 left tail
1 -63.66 -31.82 -12.71
-6.314 0 6.314 12.71 31.82 63.66
2 -9.925 -6.965 -4.303
-2.920 0 2.920 4.303 6.965 9.925
3 -5.841 -4.541 -3.182
-2.353 0 2.353 3.182 4.541 5.841
4 -4.604 -3.747 -2.776
-2.132 0 2.132 2.776 3.747 4.604
5 -4.032 -3.365 -2.571
-2.015 0 2.015 2.571 3.365 4.032
6 -3.707 -3.143 -2.447
-1.943 0 1.943 2.447 3.143 3.707
7 -3.499 -2.998 -2.365
-1.895 0 1.895 2.365 2.998 3.499
8 -3.355 -2.896 -2.306
-1.860 0 1.860 2.306 2.896 3.355
9 -3.250 -2.821 -2.262
-1.833 0 1.833 2.262 2.821 3.250
10 -3.169 -2.764 -2.228
-1.812 0 1.812 2.228 2.764 3.169
11 -3.106 -2.718 -2.201
-1.796 0 1.796 2.201 2.718 3.106
12 -3.055 -2.681 -2.179
-1.782 0 1.782 2.179 2.681 3.055
13 -3.012 -2.650 -2.160
-1.771 0 1.771 2.160 2.650 3.012
14 -2.977 -2.624 -2.145
-1.761 0 1.761 2.145 2.624 2.977
15 -2.947 -2.602 -2.131
-1.753 0 1.753 2.131 2.602 2.947
16 -2.921 -2.583 -2.120
-1.746 0 1.746 2.120 2.583 2.921
17 -2.898 -2.567 -2.110
-1.740 0 1.740 2.110 2.567 2.898
18 -2.878 -2.552 -2.101
-1.734 0 1.734 2.101 2.552 2.878
19 -2.861 -2.539 -2.093
-1.729 0 1.729 2.093 2.539 2.861
20 -2.845 -2.528 -2.086
-1.725 0 1.725 2.086 2.528 2.845
21 -2.831 -2.518 -2.080
-1.721 0 1.721 2.080 2.518 2.831
22 -2.819 -2.508 -2.074
-1.717 0 1.717 2.074 2.508 2.819
23 -2.807 -2.500 -2.069
-1.714 0 1.714 2.069 2.500 2.807
24 -2.797 -2.492 -2.064
-1.711 0 1.711 2.064 2.492 2.797
25 -2.787 -2.485 -2.060
-1.708 0 1.708 2.060 2.485 2.787
26 -2.779 -2.479 -2.056
-1.706 0 1.706 2.056 2.479 2.779
27 -2.771 -2.473 -2.052
-1.703 0 1.703 2.052 2.473 2.771
28 -2.763 -2.467 -2.048
-1.701 0 1.701 2.048 2.467 2.763
29 -2.756 -2.462 -2.045
-1.699 0 1.699 2.045 2.462 2.756
Z -2.576 -2.326 -1.960
-1.645 0 1.645 1.960 2.326 2.576
CI 90% 95% 99%
0.995 0.99 0.975 0.95 0.5 0.05 0.025 0.01 0.005 right tail
0.01 0.02 0.05 0.10 1 0.10 0.05 0.02 0.01 two tail
REFERENCES 529
Berk, R., Brown, L., Buja, A., Zhang, K., and Zhao, L. (2013), “Valid
Post-Selection Inference,” The Annals of Statistics, 41, 802-837.
Berndt, E.R., and Savin, N.E. (1977), “Conflict Among Criteria for Testing
Hypotheses in the Multivariate Linear Regression Model,” Econometrika, 45,
1263-1277.
Bernholt, T. (2005), “Computing the Least Median of Squares Estimator
in Time O(nd ),” Proceedings of ICCSA 2005, LNCS, 3480, 697-706.
Bernholt, T., and Fischer, P. (2004), “The Complexity of Computing the
MCD-Estimator,” Theoretical Computer Science, 326, 383-398.
Bertsimas, D., King, A., and Mazmunder, R. (2016), “Best Subset Selec-
tion Via a Modern Optimization Lens,” The Annals of Statistics, 44, 813-852.
Bhatia, R., Elsner, L., and Krause, G. (1990), “Bounds for the Variation of
the Roots of a Polynomial and the Eigenvalues of a Matrix,” Linear Algebra
and Its Applications, 142, 195-209.
Bickel, P.J., and Ren, J.–J. (2001), “The Bootstrap in Hypothesis Testing,”
in State of the Art in Probability and Statistics: Festschrift for William R. van
Zwet, eds. de Gunst, M., Klaassen, C., and van der Vaart, A., The Institute
of Mathematical Statistics, Hayward, CA, 91-112.
Bogdan, M., Ghosh, J., and Doerge, R. (2004), “Modifying the Schwarz
Bayesian Information Criterions to Locate Multiple Interacting Quantitative
Trait Loci,” Genetics, 167, 989-999.
Box, G.E.P., and Cox, D.R. (1964), “An Analysis of Transformations,”
Journal of the Royal Statistical Society, B, 26, 211-246.
Breiman, L. (1996), “Bagging Predictors,” Machine Learning, 24, 123-140.
Brillinger, D.R. (1977), “The Identification of a Particular Nonlinear Time
Series,” Biometrika, 64, 509-515.
Brillinger, D.R. (1983), “A Generalized Linear Model with “Gaussian”
Regressor Variables,” in A Festschrift for Erich L. Lehmann, eds. Bickel,
P.J., Doksum, K.A., and Hodges, J.L., Wadsworth, Pacific Grove, CA, 97-
114.
Brown, M.B., and Forsythe, A.B. (1974a), “The ANOVA and Multiple
Comparisons for Data with Heterogeneous Variances,” Biometrics, 30, 719-
724.
Brown, M.B., and Forsythe, A.B. (1974b), “The Small Sample Behavior of
Some Statistics Which Test the Equality of Several Means,” Technometrics,
16, 129-132.
Büchlmann, P., and Yu, B. (2002), “Analyzing Bagging,” The Annals of
Statistics, 30, 927-961.
Buckland, S.T., Burnham, K.P., and Augustin, N.H. (1997), “Model Se-
lection: an Integral Part of Inference,” Biometrics, 53, 603-618.
Budny, K. (2014), “A Generalization of Chebyshev’s Inequality for Hilbert-
Space-Valued Random Variables,” Statistics & Probability Letters, 88, 62-65.
Burnham, K.P., and Anderson, D.R. (2002), Model Selection and Mul-
timodel Inference: a Practical Information-Theoretic Approach, 2nd ed.,
Springer, New York, NY.
REFERENCES 531
Dezeure, R., Bühlmann, P., Meier, L., and Meinshausen, N. (2015), “High-
Dimensional Inference: Confidence Intervals, p-Values and R-Software hdi,”
Statistical Science, 30, 533-558.
Draper, N.R., and Smith, H. (1966, 1981, 1998), Applied Regression Anal-
ysis, 1st, 2nd, and 3rd ed., Wiley, New York, NY.
Driscoll, M.F., and Krasnicka, B. (1995), “An Accessible Proof of Craig’s
Theorem in the General Case,” The American Statistician, 49, 59-62.
Eaton, M.L. (1986), “A Characterization of Spherical Distributions,” Jour-
nal of Multivariate Analysis, 20, 272-276.
Eck, D.J. (2018), “Bootstrapping for Multivariate Linear Regression Mod-
els,” Statistics & Probability Letters, 134, 141-149.
Efron, B. (1979), “Bootstrap Methods, Another Look at the Jackknife,”
The Annals of Statistics, 7, 1-26.
Efron, B. (1982), The Jackknife, the Bootstrap and Other Resampling
Plans, SIAM, Philadelphia, PA.
Efron, B. (2014), “Estimation and Accuracy After Model Selection,” (with
discussion), Journal of the American Statistical Association, 109, 991-1007.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004), “Least
Angle Regression,” (with discussion), The Annals of Statistics, 32, 407-451.
Efron, B., and Hastie, T. (2016), Computer Age Statistical Inference, Cam-
bridge University Press, New York, NY.
Efron, B., and Tibshirani, R.J. (1993), An Introduction to the Bootstrap,
Chapman & Hall/CRC, New York, NY.
Efroymson, M.A. (1960), “Multiple Regression Analysis,” in Mathematical
Methods for Digital Computers, eds. Ralston, A., and Wilf, H.S., Wiley, New
York, NY, 191-203.
Eicker, F. (1963), “Asymptotic Normality and Consistency of the Least
Squares Estimators for Families of Linear Regressions,” Annals of Mathe-
matical Statistics, 34, 447-456.
Eicker, F. (1967), “Limit Theorems for Regressions with Unequal and De-
pendent Errors,” in Proceedings of the Fifth Berkeley Symposium on Mathe-
matical Statistics and Probability, Vol. I: Statistics, eds. Le Cam, L.M., and
Neyman, J., University of California Press, Berkeley, CA, 59-82.
Ewald, K., and Schneider, U. (2018), “Uniformly Valid Confidence Sets
Based on the Lasso,” Electronic Journal of Statistics, 12, 1358-1387.
Fahrmeir, L. and Tutz, G. (2001), Multivariate Statistical Modelling Based
on Generalized Linear Models, 2nd ed., Springer, New York, NY.
Fan, J., and Li, R. (2001), “Variable Selection Via Noncave Penalized
Likelihood and Its Oracle Properties,” Journal of the American Statistical
Association, 96, 1348-1360.
Fan, J., and Li, R. (2002), “Variable Selection for Cox’s Proportional Haz-
ard Model and Frailty Model,” The Annals of Statistics, 30, 74-99.
Fan, J., and Lv, J. (2010), “A Selective Overview of Variable Selection in
High Dimensional Feature Space,” Statistica Sinica, 20, 101-148.
REFERENCES 535
Lai, T.L., Robbins, H., and Wei, C.Z. (1979), “Strong Consistency of Least
Squares Estimates in Multiple Regression II,” Journal of Multivariate Anal-
ysis, 9, 343-361.
Larsen, R.J., and Marx, M.L. (2017), Introduction to Mathematical Statis-
tics and Its Applications, 6th ed., Pearson, Boston, MA.
Lee, J., Sun, D., Sun, Y., and Taylor, J. (2016), “Exact Post-Selection
Inference with Application to the Lasso,” The Annals of Statistics, 44, 907-
927.
Lee, J.D., and Taylor, J.E. (2014), “Exact Post Model Selection Infer-
ence for Marginal Screening,” in Advances in Neural Information Processing
Systems, 136-144.
Leeb, H., and Pötscher, B.M. (2005), “Model Selection and Inference: Facts
and Fiction,” Econometric Theory, 21, 21-59.
Leeb, H., and Pötscher, B.M. (2006), “Can One Estimate the Conditional
Distribution of Post–Model-Selection Estimators?” The Annals of Statistics,
34, 2554-2591.
Leeb, H. and Pötscher, B.M. (2008), “Can One Estimate the Unconditional
Distribution of Post-Model-Selection Estimators?” Econometric Theory, 24,
338-376.
Leeb, H., Pötscher, B.M., and Ewald, K. (2015), “On Various Confidence
Intervals Post-Model-Selection,” Statistical Science, 30, 216-227.
Lehmann, E.L. (1999), Elements of Large–Sample Theory, Springer, New
York, NY.
Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R.J., and Wasserman, L.
(2018), “Distribution-Free Predictive Inference for Regression,” Journal of
the American Statistical Association, 113, 1094-1111.
Leon, S.J. (1986), Linear Algebra with Applications, 2nd ed., Macmillan
Publishing Company, New York, NY.
Leon, S.J. (2015), Linear Algebra with Applications, 9th ed., Pearson,
Boston, MA.
Li, K.–C. (1987), “Asymptotic Optimality for Cp , CL, Cross-Validation
and Generalized Cross-Validation: Discrete Index Set,” The Annals of Statis-
tics, 15, 958-975.
Li, K.–C., and Duan, N. (1989), “Regression Analysis Under Link Viola-
tion,” The Annals of Statistics, 17, 1009-1052.
Lin, D., Foster, D.P., and Ungar, L.H. (2012), “VIF Regression, a Fast
Regression Algorithm for Large Data,” Journal of the American Statistical
Association, 106, 232-247.
Lindenmayer, D.B., Cunningham, R., Tanton, M.T., Nix, H.A., and Smith,
A.P. (1991), “The Conservation of Arboreal Marsupials in the Montane Ash
Forests of Central Highlands of Victoria, South-East Australia: III. The Habi-
tat Requirement’s of Leadbeater’s Possum Gymnobelideus Leadbeateri and
Models of the Diversity and Abundance of Arboreal Marsupials,” Biological
Conservation, 56, 295-315.
REFERENCES 541
Liu, X., and Zuo, Y. (2014), “Computing Projection Depth and Its Asso-
ciated Estimators,” Statistics and Computing, 24, 51-63.
Lockhart, R., Taylor, J., Tibshirani, R.J., and Tibshirani, R. (2014), “A
Significance Test for the Lasso,” (with discussion), The Annals of Statistics,
42, 413-468.
Lopuhaä, H.P. (1999), “Asymptotics of Reweighted Estimators of Multi-
variate Location and Scatter,” The Annals of Statistics, 27, 1638-1665.
Lu, S., Liu, Y., Yin, L., and Zhang, K. (2017), “Confidence Intervals and
Regions for the Lasso by Using Stochastic Variational Inequality Techniques
in Optimization,” Journal of the Royal Statistical Society, B, 79, 589-611.
Lumley, T. (using Fortran code by Alan Miller) (2009), leaps: Regression
Subset Selection, R package version 2.9, (https://CRAN.R-project.org/package
=leaps).
Luo, S., and Chen, Z. (2013), “Extended BIC for Linear Regression Models
with Diverging Number of Relevant Features and High or Ultra-High Feature
Spaces,” Journal of Statistical Planning and Inference, 143, 494-504.
Machado, J.A.F., and Parente, P. (2005), “Bootstrap Estimation of Covari-
ance Matrices Via the Percentile Method,” Econometrics Journal, 8, 70-78.
MacKinnon, J.G., and White, H. (1985), “Some Heteroskedasticity-Consistent
Covariance Matrix Estimators with Improved Finite Sample Properties,”
Journal of Econometrics, 29, 305-325.
Mallows, C. (1973), “Some Comments on Cp ,” Technometrics, 15, 661-676.
Marden, J.I. (2017), Mathematical Statistics: Old School, available at
(www.stat.istics.net and www.amazon.com).
Mardia, K.V., Kent, J.T., and Bibby, J.M. (1979), Multivariate Analysis,
Academic Press, London, UK.
Maronna, R.A., Martin, R.D., and Yohai, V.J. (2006), Robust Statistics:
Theory and Methods, Wiley, Hoboken, NJ.
Maronna, R.A., and Morgenthaler, S. (1986), “Robust Regression Through
Robust Covariances,” Communications in Statistics: Theory and Methods, 15,
1347-1365.
Maronna, R.A., and Yohai, V.J. (2002), “Comment on ‘Inconsistency of
Resampling Algorithms for High Breakdown Regression and a New Algo-
rithm’ by D.M. Hawkins and D.J. Olive,” Journal of the American Statistical
Association, 97, 154-155.
Maronna, R.A., and Yohai, V.J. (2015), “High-Sample Efficiency and Ro-
bustness Based on Distance-Constrained Maximum Likelihood,” Computa-
tional Statistics & Data Analysis, 83, 262-274.
Maronna, R.A., and Zamar, R.H. (2002), “Robust Estimates of Location
and Dispersion for High-Dimensional Datasets,” Technometrics, 44, 307-317.
Marquardt, D.W., and Snee, R.D. (1975), “Ridge Regression in Practice,”
The American Statistician, 29, 3-20.
Mašı̈ček, L. (2004), “Optimality of the Least Weighted Squares Estima-
tor,” Kybernetika, 40, 715-734.
542 REFERENCES
MathSoft (1999a), S-Plus 2000 User’s Guide, Data Analysis Products Di-
vision, MathSoft, Seattle, WA.
MathSoft (1999b), S-Plus 2000 Guide to Statistics, Volume 2, Data Anal-
ysis Products Division, MathSoft, Seattle, WA.
McCullagh, P., and Nelder, J.A. (1989), Generalized Linear Models, 2nd
ed., Chapman & Hall, London, UK.
Meinshausen, N. (2007), “Relaxed Lasso,” Computational Statistics &
Data Analysis, 52, 374-393.
Mevik, B.–H., Wehrens, R., and Liland, K.H. (2015), pls: Partial Least
Squares and Principal Component Regression, R package version 2.5-0, (https:
//CRAN.R-project.org/package=pls).
Monahan, J.F. (2008), A Primer on Linear Models, Chapman & Hall/CRC,
Boca Rotan, FL.
Montgomery, D.C., Peck, E.A., and Vining, G. (2001), Introduction to
Linear Regression Analysis, 3rd ed., Wiley, Hoboken, NJ.
Montgomery, D.C., Peck, E.A., and Vining, G. (2021), Introduction to
Linear Regression Analysis, 6th ed., Wiley, Hoboken, NJ.
Moore, D.S. (2007), The Basic Practice of Statistics, 4th ed., W.H. Free-
man, New York, NY.
Mosteller, F., and Tukey, J.W. (1977), Data Analysis and Regression,
Addison-Wesley, Reading, MA.
Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., and Wu, A.Y.
(2014), “On the Least Trimmed Squares Estimator,” Algorithmica, 69, 148-
183.
Muller, K.E., and Stewart, P.W. (2006), Linear Model Theory: Univariate,
Multivariate, and Mixed Models, Wiley, Hoboken, NJ.
Myers, R.H., and Milton, J.S. (1991), A First Course in the Theory of
Linear Statistical Models, Duxbury, Belmont, CA.
Myers, R.H., Montgomery, D.C., and Vining, G.G. (2002), Generalized
Linear Models with Applications in Engineering and the Sciences, Wiley, New
York, NY.
Navarro, J. (2014), “Can the Bounds in the Multivariate Chebyshev In-
equality be Attained?” Statistics & Probability Letters, 91, 1-5.
Navarro, J. (2016), “A Very Simple Proof of the Multivariate Chebyshev’s
Inequality,” Communications in Statistics: Theory and Methods, 45, 3458-
3463.
Nelder, J.A., and Wedderburn, R.W.M. (1972), “Generalized Linear Mod-
els,” Journal of the Royal Statistical Society, A, 135, 370-384.
Ning, Y., and Liu, H. (2017), “A General Theory of Hypothesis Tests and
Confidence Regions for Sparse High Dimensional Models,” The Annals of
Statistics, 45, 158-195.
Nishii, R. (1984), “Asymptotic Properties of Criteria for Selection of Vari-
ables in Multiple Regression,” The Annals of Statistics, 12, 758-765.
Nordhausen, K., and Tyler, D.E. (2015), “A Cautionary Note on Robust
Covariance Plug-In Methods,” Biometrika, 102, 573-588.
REFERENCES 543
Olive, D.J., and Hawkins, D.M. (2010), “Robust Multivariate Location and
Dispersion,” preprint, see (http://parker.ad.siu.edu/Olive/pphbmld.pdf).
Olive, D.J., and Hawkins, D.M. (2011), “Practical High Breakdown Re-
gression,” preprint at (http://parker.ad.siu.edu/Olive/pphbreg.pdf).
Olive, D.J., Pelawa Watagoda, L.C.R., and Rupasinghe Arachchige Don,
H.S. (2015), “Visualizing and Testing the Multivariate Linear Regression
Model,” International Journal of Statistics and Probability, 4, 126-137.
Olive, D.J., Rathnayake, R.C., and Haile, M.G. (2022), “Prediction Inter-
vals for GLMs, GAMs, and Some Survival Regression Models,” Communica-
tions in Statistics: Theory and Methods, 51, 8012-8026.
Olive, D.J., and Zhang, L. (2025), “One Component Partial Least Squares,
High Dimensional Regression, Data Splitting, and the Multitude of Models,”
Communications in Statistics: Theory and Methods, 54, 130-145.
Park, Y., Kim, D., and Kim, S. (2012), “Robust Regression Using Data
Partitioning and M-Estimation,” Communications in Statistics: Simulation
and Computation, 8, 1282-1300.
Pati, Y.C., Rezaiifar, R., and Krishnaprasad, P.S. (1993), “Orthogonal
Matching Pursuit: Recursive Function Approximation with Applications to
Wavelet Decomposition,” in Conference Record of the Twenty-Seventh Asilo-
mar Conference on Signals, Systems and Computers, IEEE, 40-44.
Pelawa Watagoda, L.C.R. (2017), “Inference after Variable Selection,”
Ph.D. Thesis, Southern Illinois University. See (http://parker.ad.siu.edu/
Olive/slasanthiphd.pdf).
Pelawa Watagoda, L.C.R. (2019), “A Sub-Model Theorem for Ordinary
Least Squares,” International Journal of Statistics and Probability, 8, 40-43.
Pelawa Watagoda, L.C.R., and Olive, D.J. (2021a), “Bootstrapping Mul-
tiple Linear Regression after Variable Selection,” Statistical Papers, 62, 681-
700.
Pelawa Watagoda, L.C.R., and Olive, D.J. (2021b), “Comparing Six
Shrinkage Estimators with Large Sample Theory and Asymptotically Op-
timal Prediction Intervals,” Statistical Papers, 62, 2407-2431.
Peña, D. (2005), “A New Statistic for Influence in Regression,” Techno-
metrics, 47, 1-12.
Pesch, C. (1999), “Computation of the Minimum Covariance Determinant
Estimator,” in Classification in the Information Age, Proceedings of the 22nd
Annual GfKl Conference, Dresden 1998, eds. Gaul, W., and Locarek-Junge,
H., Springer, Berlin, 225–232.
Pratt, J.W. (1959), “On a General Concept of “in Probability”,” The
Annals of Mathematical Statistics, 30, 549-558.
Press, S.J. (2005), Applied Multivariate Analysis: Using Bayesian and Fre-
quentist Methods of Inference, 2nd ed., Dover, Mineola, NY.
Qi, X., Luo, R., Carroll, R.J., and Zhao, H. (2015), “Sparse Regression
by Projection and Sparse Discriminant Analysis,” Journal of Computational
and Graphical Statistics, 24, 416-438.
REFERENCES 545
Rousseeuw, P.J., and Van Driessen, K. (1999), “A Fast Algorithm for the
Minimum Covariance Determinant Estimator,” Technometrics, 41, 212-223.
Rupasinghe Arachchige Don, H.S. (2018), “A Relationship Between the
One-Way MANOVA Test Statistic and the Hotelling Lawley Trace Test
Statistic,” International Journal of Statistics and Probability, 7, 124-131.
Rupasinghe Arachchige Don, H.S., and Olive, D.J. (2019), “Bootstrapping
Analogs of the One Way MANOVA Test,” Communications in Statistics:
Theory and Methods, 48, 5546-5558.
Rupasinghe Arachchige Don, H.S., and Pelawa Watagoda, L.C.R. (2018),
“Bootstrapping Analogs of the Two Sample Hotelling’s T 2 Test,” Communi-
cations in Statistics: Theory and Methods, 47, 2172-2182.
SAS Institute (1985), SAS User’s Guide: Statistics, Version 5, SAS Insti-
tute, Cary, NC.
Schaaffhausen, H. (1878), “Die Anthropologische Sammlung Des Anatom–
ischen Der Universitat Bonn,” Archiv fur Anthropologie, 10, 1-65, Appendix.
Scheffé, H. (1959), The Analysis of Variance, Wiley, New York, NY.
Schomaker, M., and Heumann, C. (2014), “Model Selection and Model Av-
eraging After Multiple Imputation,” Computational Statistics & Data Anal-
ysis, 71, 758-770.
Schwarz, G. (1978), “Estimating the Dimension of a Model,” The Annals
of Statistics, 6, 461-464.
Searle, S.R. (1971), Linear Models, Wiley, New York, NY.
Searle, S.R. (1982), Matrix Algebra Useful for Statistics, Wiley, New York,
NY.
Searle, S.R., and Gruber, M.H.J. (2017), Linear Models, 2nd ed., Wiley,
Hoboken, NJ.
Seber, G.A.F., and Lee, A.J. (2003), Linear Regression Analysis, 2nd ed.,
Wiley, New York, NY.
Sen, P.K., and Singer, J.M. (1993), Large Sample Methods in Statistics:
an Introduction with Applications, Chapman & Hall, New York, NY.
Sengupta, D., and Jammalamadaka, S.R. (2019), Linear Models and Re-
gression with R: an Integrated Approach, World Scientific, Singapore.
Serfling, R.J. (1980), Approximation Theorems of Mathematical Statistics,
Wiley, New York, NY.
Severini, T.A. (1998), “Some Properties of Inferences in Misspecified Lin-
ear Models,” Statistics & Probability Letters, 40, 149-153.
Severini, T.A. (2005), Elements of Distribution Theory, Cambridge Uni-
versity Press, New York, NY.
Shao, J. (1993), “Linear Model Selection by Cross-Validation,” Journal of
the American Statistical Association, 88, 486-494.
Shao, J., and Tu, D.S. (1995), The Jackknife and the Bootstrap, Springer,
New York, NY.
Shibata, R. (1984), “Approximate Efficiency of a Selection Procedure for
the Number of Regression Variables,” Biometrika, 71, 43-49.
REFERENCES 547
Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2011), “Regular-
ization Paths for Cox’s Proportional Hazards Model via Coordinate Descent,”
Journal of Statistical Software, 39, 1-13.
Simonoff, J.S. (2003), Analyzing Categorical Data, Springer, New York,
NY.
Slawski, M., zu Castell, W., and Tutz, G. (2010), “Feature Selection
Guided by Structural Information,” Annals of Applied Statistics, 4, 1056-
1080.
Srivastava, M.S., and Khatri, C.G. (1979), An Introduction to Multivariate
Statistics, North Holland, New York, NY.
Stapleton, J.H. (2009), Linear Statistical Models, 2nd ed., Wiley, Hoboken,
NJ.
Staudte, R.G., and Sheather, S.J. (1990), Robust Estimation and Testing,
Wiley, New York, NY.
Steinberger, L., and Leeb, H. (2023), “Conditional Predictive Inference for
Stable Algorithms,” The Annals of Statistics, 51, 290-311.
Stewart, G.M. (1969), “On the Continuity of the Generalized Inverse,”
SIAM Journal on Applied Mathematics, 17, 33-45.
Su, W., Bogdan, M., and Candés, E. (2017), “False Discoveries Occur
Early on the Lasso Path,” The Annals of Statistics, 45, 2133-2150.
Su, Z., and Cook, R.D. (2012), “Inner Envelopes: Efficient Estimation in
Multivariate Linear Regression,” Biometrika, 99, 687-702.
Su, Z., Zhu, G., and Yang, Y. (2016), “Sparse Envelope Model: Efficient
Estimation and Response Variable Selection in Multivariate Linear Regres-
sion,” Biometrika, 103, 579-593.
Sun, T., and Zhang, C.-H. (2012), “Scaled Sparse Linear Regression,”
Biometrika, 99, 879-898.
Tarr, G., Müller, S., and Weber, N.C. (2016), “Robust Estimation of Pre-
cision Matrices Under Cellwise Contamination,” Computational Statistics &
Data Analysis, 93, 404-420.
Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,”
Journal of the Royal Statistical Society, B, 58, 267-288.
Tibshirani, R, (1997), “The Lasso Method for Variable Selection in the
Cox Model,” Statistics in Medicine, 16, 385-395.
Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J.,
and Tibshirani, R.J. (2012), “Strong Rules for Discarding Predictors in Lasso-
Type Problems,” Journal of the Royal Statistical Society, B, 74, 245–266.
Tibshirani, R.J. (2013), “The Lasso Problem and Uniqueness,” Electronic
Journal of Statistics, 7, 1456-1490.
Tibshirani, R.J. (2015), “Degrees of Freedom and Model Search,” Statistica
Sinica, 25, 1265-1296.
Tibshirani, R.J., Rinaldo, A., Tibshirani, R., and Wasserman, L. (2018),
“Uniform Asymptotic Inference and the Bootstrap after Model Selection,”
The Annals of Statistics, 46, 1255-1287.
548 REFERENCES
551
552 Index
Chen, 150, 169, 171, 200, 203, 243, 262, degrees of freedom, 19, 264
275, 488, 497 Delta Method, 36
Cheng, 259 Denham, 260
Chew, 167 Det-MCD, 302, 306
Cho, 262 Devlin, 291
Chow, v Dey, v
Christensen, v, 74, 82, 101 Dezeure, 259
Chun, 225, 260 df, 19
CI, 3 DGK estimator, 291
Claeskens, 151, 154, 155, 204, 489, 500, discriminant function, 432
501 dispersion matrix, 282
Claeskins, 259 DOE, 119
Clarke, 203 dot plot, 122, 279, 411
classical prediction region, 167 double bootstrap, 205
Cleveland, 501 Driscoll, 78
CLT, 3 Duan, 495–497, 500
CLTS, 329
coefficient of multiple determination, 18 EAP, 3
Collett, 438, 474, 500 Eaton, 54
column space, 72, 102 EC, 3
concentration, 288, 291 Eck, 395
conditional distribution, 32 EE plot, 145, 455
confidence region, 170, 201 Efron, 150, 171, 177, 178, 189, 203, 229,
consistent, 39 230, 234, 259, 260
consistent estimator, 39 Efroymson, 259
constant variance MLR model, 13 Eicker, 105
Continuity Theorem, 46 eigenvalue, 221
Continuous Mapping Theorem:, 46 eigenvector, 221
converges almost everywhere, 41, 42 elastic net, 240
converges in distribution, 37 elastic net variable selection, 243
converges in law, 37 elemental set, 283, 288, 290, 328, 331
converges in probability, 39 ellipsoidal trimming, 325
converges in quadratic mean, 40 elliptically contoured, 53, 57, 316
Cook, v, 8, 19, 54, 55, 59, 64, 160, 188, elliptically contoured distribution, 166
193, 224, 225, 260, 261, 322, 350, elliptically symmetric, 53
365, 367, 371, 375, 384, 401, 405, empirical cdf, 172
410, 440, 454, 458, 475, 482, 495, empirical distribution, 172
500 envelope estimators, 401
coordinatewise median, 282 error sum of squares, 17, 30
Cornish, 57 ESP, 3
covariance matrix, 31 ESSP, 3
coverage, 167 estimable, 118
covmb2, 318, 350 estimated additive predictor, 5, 425
Cox, 12, 142, 275, 427 estimated sufficient predictor, 4, 425
Craig’s Theorem, 78, 103 estimated sufficient summary plot, 5,
Cramér, 18 495
Crawley, 506, 508 Euclidean norm, 48, 338
Croux, 56 Ewald, 204, 253, 259
CV, 3 experimental design, 119
exponential family, 428
Daniel, 148 extrapolation, 160, 244
data splitting, 462
Datta, 259, 286 Fahrmeir, 452
DD plot, 313 Fan, 154, 259, 260, 262, 277
Index 553
Rathnayake, 151, 152, 203, 237, 243, Schwarz, 142, 150, 488
260, 477, 484, 489, 500 score equations, 230
Raven, 475 SE, 3, 34
Ravishanker, v Searle, v, 77, 81, 82, 108, 374, 402, 419
regression equivariance, 335 Seber, v, 26, 33, 83, 86, 94, 97, 119, 143,
regression equivariant, 335 377
regression sum of squares, 17 selection bias, 151
regression through the origin, 29 Sen, 60, 92, 203, 477
Reid, 78 Sengupta, v
Rejchel, 260 Serfling, 60, 173
relaxed elastic net, 252 Severini, 32, 50, 60, 230
relaxed lasso, 215, 252 Shao, 152, 480
Ren, 175, 178, 201, 203 Sheather, 207
Rencher, v Shibata, 150
residual plot, 5, 14, 366, 410 shrinkage estimator, 203
residuals, 14, 213, 255 Simonoff, 426, 440, 465, 500
response plot, 5, 14, 100, 145, 366, 410, simple linear regression, 28
425, 496 Singer, 60, 92, 477
response transformation, 11 singular value decomposition, 227
response transformation model, 426, 495 Slawski, 242
response variable, 1, 4 SLR, 28
response variables, 361, 407 Slutsky’s Theorem, 45, 50
Reyen, 351 smallest extreme value distribution, 432
RFCH estimator, 297 smoothed bootstrap estimator, 178
Riani, 356, 432 Snee, 229
ridge regression, 215, 262, 401 SP, 3
Riedwyl, 483 span, 71, 102
Rinaldo, 200, 277, 500 sparse model, 4
Ripley, vi, 501, 506 spectral decomposition, 221
Ro, 319 Spectral Decomposition Theorem, 76
Rocke, 288, 299 spectral norm, 339
Rohatgi, 33, 46 spherical, 54
Ronchetti, 97, 321, 352 split conformal prediction interval, 161
Rothman, 263 square root matrix, 76, 99, 103, 222
Rousseeuw, 263, 283, 288, 291, 306, 313, Srivastava, 66
324, 326, 329, 340, 351, 352 SSP, 3, 495
row space, 72 Stahel-Donoho estimator, 351
RR plot, 23, 145, 366 standard deviation, 280
Rupasinghe Arachchige Don, 134, 417, standard error, 34
418, 421, 424 Stapleton, v
STATLIB, 459
S, 42 Staudte, 207
sample correlation matrix, 164 Steinberger, 263
sample covariance matrix, 164, 349 Stewart, v, 60, 230
sample mean, 16, 34, 164, 349 Su, 19, 160, 188, 260, 262, 365, 371, 375,
sandwich estimator, 100 401
SAS Institute, 405 submodel, 142
Savin, 378, 390 subspace, 71
scale equivariant, 336 sufficient predictor, 4, 142, 425
Schaaffhausen, 354, 437, 453 sufficient summary plot, 495
Schaalje, v Sun, 262
Scheffé, v supervised learning, 2
Schneider, 204, 253, 259 SVD, 227
Schomaker, 204 Swamping, 322
Index 557