Ch5 Resampling Methods
Ch5 Resampling Methods
1 / 44
Cross-validation and the Bootstrap
1 / 44
Cross-validation and the Bootstrap
1 / 44
Training Error versus Test error
2 / 44
Training- versus Test-Set Performance
Test Sample
Training Sample
Low High
Model Complexity
3 / 44
More on prediction-error estimates
4 / 44
Validation-set approach
5 / 44
The Validation process
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
%!!""!! #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!& !
6 / 44
Example: automobile data
• Want to compare linear vs higher-order polynomial terms
in a linear regression
• We randomly split the 392 observations into two sets, a
training set containing 196 of the data points, and a
validation set containing the remaining 196 observations.
28
28
Mean Squared Error
26
24
24
22
22
20
20
18
18
16
16
2 4 6 8 10 2 4 6 8 10
Left panel shows single split; right panel shows multiple splits
7 / 44
Drawbacks of validation set approach
8 / 44
Drawbacks of validation set approach
8 / 44
K-fold Cross-validation
9 / 44
K-fold Cross-validation in detail
1 2 3 4 5
10 / 44
The details
where MSEk = i∈Ck (yi − ŷi )2 /nk , and ŷi is the fit for
P
observation i, obtained from the data with part k removed.
11 / 44
The details
where MSEk = i∈Ck (yi − ŷi )2 /nk , and ŷi is the fit for
P
observation i, obtained from the data with part k removed.
• Setting K = n yields n-fold or leave-one out
cross-validation (LOOCV).
11 / 44
A nice special case!
• With least-squares linear or polynomial regression, an
amazing shortcut makes the cost of LOOCV the same as
that of a single model fit! The following formula holds:
n 2
1X yi − ŷi
CV(n) = ,
n 1 − hi
i=1
where ŷi is the ith fitted value from the original least
squares fit, and hi is the leverage (diagonal of the “hat”
matrix; see book for details.) This is like the ordinary
MSE, except the ith residual is divided by 1 − hi .
12 / 44
A nice special case!
• With least-squares linear or polynomial regression, an
amazing shortcut makes the cost of LOOCV the same as
that of a single model fit! The following formula holds:
n 2
1X yi − ŷi
CV(n) = ,
n 1 − hi
i=1
where ŷi is the ith fitted value from the original least
squares fit, and hi is the leverage (diagonal of the “hat”
matrix; see book for details.) This is like the ordinary
MSE, except the ith residual is divided by 1 − hi .
• LOOCV sometimes useful, but typically doesn’t shake up
the data enough. The estimates from each fold are highly
correlated and hence their average can have high variance.
• a better choice is K = 5 or 10.
12 / 44
Auto data revisited
LOOCV 10−fold CV
28
28
Mean Squared Error
26
24
24
22
22
20
20
18
18
16
16
2 4 6 8 10 2 4 6 8 10
13 / 44
True and estimated test MSE for the simulated data
3.0
3.0
20
2.5
2.5
15
Mean Squared Error
2.0
1.5
1.5
10
1.0
1.0
5
0.5
0.5
0.0
0.0
0
2 5 10 20 2 5 10 20 2 5 10 20
14 / 44
Other issues with Cross-validation
15 / 44
Other issues with Cross-validation
15 / 44
Other issues with Cross-validation
15 / 44
Cross-Validation for Classification Problems
• We divide the data into K roughly equal-sized parts
C1 , C2 , . . . CK . Ck denotes the indices of the observations
in part k. There are nk observations in part k: if n is a
multiple of K, then nk = n/K.
• Compute
K
X nk
CVK = Errk
n
k=1
P
where Errk = i∈Ck I(yi 6= ŷi )/nk .
• The estimated standard deviation of CVK is
v
u K
u1 X (Errk − Errk )2
SE(CVK ) =
c t
K K −1
k=1
17 / 44
Cross-validation: right and wrong
17 / 44
NO!
18 / 44
NO!
18 / 44
NO!
18 / 44
The Wrong and Right Way
19 / 44
Wrong Way
Samples
CV folds
20 / 44
Right Way
Samples CV folds
21 / 44
The Bootstrap
22 / 44
Where does the name came from?
23 / 44
A simple example
• Suppose that we wish to invest a fixed sum of money in
two financial assets that yield returns of X and Y ,
respectively, where X and Y are random quantities.
• We will invest a fraction α of our money in X, and will
invest the remaining 1 − α in Y .
• We wish to choose α to minimize the total risk, or
variance, of our investment. In other words, we want to
minimize Var(αX + (1 − α)Y ).
24 / 44
A simple example
• Suppose that we wish to invest a fixed sum of money in
two financial assets that yield returns of X and Y ,
respectively, where X and Y are random quantities.
• We will invest a fraction α of our money in X, and will
invest the remaining 1 − α in Y .
• We wish to choose α to minimize the total risk, or
variance, of our investment. In other words, we want to
minimize Var(αX + (1 − α)Y ).
• One can show that the value that minimizes the risk is
given by
σY2 − σXY
α= 2 ,
σX + σY2 − 2σXY
2 = Var(X), σ 2 = Var(Y ), and σ
where σX Y XY = Cov(X, Y ).
24 / 44
Example continued
2 , σ 2 , and σ
• But the values of σX Y XY are unknown.
• We can compute estimates for these quantities, σ̂X2 , σ̂ 2 ,
Y
and σ̂XY , using a data set that contains measurements for
X and Y .
• We can then estimate the value of α that minimizes the
variance of our investment using
σ̂Y2 − σ̂XY
α̂ = 2 + σ̂ 2 − 2σ̂ .
σ̂X Y XY
25 / 44
Example continued
2
2
1
1
0
Y
Y
0
−1
−1
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
X X
2
2
1
1
0
0
Y
Y
−1
−1
−2
−2
−3
−3
−3 −2 −1 0 1 2 −2 −1 0 1 2 3
X X
27 / 44
Example continued
• The mean over all 1,000 estimates for α is
1000
1 X
ᾱ = α̂r = 0.5996,
1000
r=1
0.9
200
200
0.8
150
150
0.7
0.6
α
100
100
0.5
50
50
0.4
0
0.3
0.4 0.5 0.6 0.7 0.8 0.9 0.3 0.4 0.5 0.6 0.7 0.8 0.9 True Bootstrap
α α
3 5.3 2.8
α̂ *1
1 4.3 2.4
*1
Z 3 5.3 2.8
Obs X Y
Obs X Y
*2 2 2.1 1.1
1 4.3 2.4 Z
!!
3 5.3 2.8 α̂ *2
2 2.1 1.1
!! 1 4.3 2.4 !!
3 5.3 2.8 !!
!! !!
!! !! !!
*B !!
Original Data (Z) !Z !!
!!
Obs X Y
α̂ *B
2 2.1 1.1
2 2.1 1.1
1 4.3 2.4
Random Bootstrap
Random Sampling
Sampling Data dataset
Estimated P̂ Z ∗ = (z1∗ , z2∗ , . . . zn∗ )
Population P Z = (z1 , z2 , . . . zn ) Population
Bootstrap
f (Z ∗ )
Estimate f (Z) Estimate
33 / 44
The bootstrap in general
34 / 44
The bootstrap in general
34 / 44
Other uses of the bootstrap
35 / 44
Other uses of the bootstrap
35 / 44
Other uses of the bootstrap
35 / 44
Can the bootstrap estimate prediction error?
36 / 44
Can the bootstrap estimate prediction error?
36 / 44
Can the bootstrap estimate prediction error?
36 / 44
Can the bootstrap estimate prediction error?
36 / 44
Can the bootstrap estimate prediction error?
36 / 44
Can the bootstrap estimate prediction error?
36 / 44
Can the bootstrap estimate prediction error?
36 / 44
Removing the overlap
37 / 44
Pre-validation
38 / 44
Pre-validation
38 / 44
Motivating example
39 / 44
Results
Comparison of the microarray predictor with some clinical
predictors, using logistic regression with outcome prognosis:
Model Coef Stand. Err. Z score p-value
Re-use
microarray 4.096 1.092 3.753 0.000
angio 1.208 0.816 1.482 0.069
er -0.554 1.044 -0.530 0.298
grade -0.697 1.003 -0.695 0.243
pr 1.214 1.057 1.149 0.125
age -1.593 0.911 -1.748 0.040
size 1.483 0.732 2.026 0.021
Pre-validated
microarray 1.549 0.675 2.296 0.011
angio 1.589 0.682 2.329 0.010
er -0.617 0.894 -0.690 0.245
grade 0.719 0.720 0.999 0.159
pr 0.537 0.863 0.622 0.267
age -1.471 0.701 -2.099 0.018
size 0.998 0.594 1.681 0.046
40 / 44
Idea behind Pre-validation
41 / 44
Pre-validation process
Omitted data
Pre−validated
Predictor Response
Predictors
Observations
Logistic Regression
Fixed
predictors
42 / 44
Pre-validation in detail for this example
43 / 44
The Bootstrap versus Permutation tests
44 / 44