ML 2024 Part2 Shrinkage Estimators
ML 2024 Part2 Shrinkage Estimators
A. Colin Cameron
Univ.of California - Davis
April 2024
1. Introduction
Consider linear regression model with p potential regressors where p is
too large.
Methods that reduce the model complexity are
I 1. choose a subset of regressors (previous slides)
I 2. shrink regression coe¢ cients towards zero (these slides)
I 3. reduce the dimension of the regressors
F principal components analysis (later slides).
Overview
1 Introduction
2 Shrinkage: Variance-bias trade-o¤
3 Shrinkage methods
1 Ridge regression
2 LASSO
3 Elastic net
4 Asymptotic Properties of Lasso
5 Clustered data
4 Generated data
5 Prediction using LASSO, ridge and elasticnet
1 Lasso command
2 Lasso linear regression example
3 Lasso postestimation commands example
4 Adaptive lasso
5 Elastic net and ridge regression
6 Shrinkage for logit, probit and Poisson
A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 4 / 64
2. Shrinkage: Variance-bias Trade-o¤s
E [(y0 b
f (x0 ))2 ] = Var [b
f (x0 )] + fBias (b
f (x0 ))g2 + Var (u )
Variance-bias trade-o¤
Shrinkage is one method that is biased but the bias may lead to lower
squared error loss
I …rst show this for estimation of a parameter θ
I then show this for prediction of y .
The mean squared error of a scalar estimator e
θ is
MSE(e
θ ) = E [(eθ θ )2 ]
= E [f(eθ E [eθ ]) + (E [eθ ] θ )g2 ]
= E [(eθ E [eθ ])2 ] + (E [eθ ] θ )2 + 2 0
= Var (eθ ) + Bias 2 (eθ )
So we have
Unbiased b
θ MSE (b
θ) = v
Biased e b e
θ = aθ MSE(θ ) = a2 v + (a 1)2 θ 2
Then MSE(e
θ ) <MSE[b
θ ] if θ 2 < 1 +a
1 av
I e.g. if e
θ = 0.9b
θ then e
θ has lower MSE for θ 2 < 19v !
We will consider
I ridge estimator shrinks towards zero
I LASSO estimator selects and shrinks towards zero.
James-Stein estimator
So bias in e
β that reduces MSE(e
β) also reduces MSE(ye0 ).
3. Shrinkage Methods
∑i =1 (yi
n
xi0 β)2 + λ ∑j =1 β2j = RSS + λ(jj βjj2 )2
1 p
Qλ ( β ) = n
∑i =1 (yi
n
xi0 β)2 subject to ∑j =1 β2j
1 p
n s.
Ridge Derivation
Ridge Properties
b ! 0 as λ ! ∞ and β
b !βb
β λ λ OLS as λ ! 0.
Ridge best when many predictors important with coe¤s of similar size.
Ridge best when LS has high variance
I meaning small changes in training data can lead to large changes in
OLS coe¢ cient estimates.
b for many values of λ
Algorithms exist to quickly compute β λ
I then choose λ by cross validation or AIC or BIC.
I with search over a decreasing logarithmic grid in λ.
More on Ridge
∑i =1 (yi
n
xi0 β)2 + λ ∑j =1 j βj j = RSS + λjj βjj1
1 p
Qλ ( β ) = n
Features
I best when a few regressors have βj 6= 0 and most βj = 0
I leads to a more interpretable model than ridge.
More generally Qλ ( β) = 1
n ∑ni=1 (yi xi0 β)2 + λ ∑pj=1 κ j j βj j.
A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 16 / 64
3. Shrinkage Methods 3.2 LASSO
where b
β(M ) is the M th largest OLS coe¢ cient.
A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 18 / 64
3. Shrinkage Methods 3.2 LASSO
More on LASSO
LASSO is due to Tibshirani (1999).
Can weight each βj di¤erently
I implemented in Stata lasso commands.
Can specify some variables to be always included.
The group lasso allows to include regressors as groups (e.g. race
dummies as a group)
I with L groups minimize over β
n L 2 L p p
1
n ∑ i =1 yi ∑l =1 xi0 βl + λ ∑ l =1 ρl ∑j =l 1 j βlj j .
∑i =1 (yi
n
xi0 β)2 + λ ∑j =1 fαj βj j + (1
p
Qλ,α ( β) = 1
n α) β2j g.
Key papers
Peter Bickel, Ya’acov Ritov and Alexandre Tsybakov (2009),
“Simultaneous Analysis of Lasso and Dantzig Selector”, The Annals
of Statistics, 1705-1732.
I In reading this just consider Lasso (ignore Dantzig Selector).
I lays out the typical assumptions well
F including sparsity and yi = f (zi ) + ui , ui i.i.d. N (0, σ2 )
I lays out …nite sample bounds for prediction loss
F key are assumptions on the eigenvalues of the Gram matrix X0 X
F similar to restrictions on the correlations among regressors.
Postestimation commands
. * Summarize data
. summarize
. correlate
(obs=40)
x1 x2 x3 y
x1 1.0000
x2 0.5077 1.0000
x3 0.4281 0.2786 1.0000
y 0.4740 0.3370 0.2046 1.0000
. quietly summarize y
. summarize ydemeaned z*
. qui summarize y
. summarize ydemeaned z*
lassoknots command
No. of CV mean
nonzero pred. Variables (A)dded, (R)emoved,
ID lambda coef. error or left (U)nchanged
2 1.450138 1 11.60145 A x1
11 .6277301 2 9.739804 A x2
* 21 .2475897 2 9.393523 U
24 .1872925 2 9.434326 U
cvplot command
lassoinfo command
Estimate: active
Command: lasso
No. of
Selection Selection selected
Depvar Model method criterion lambda variables
lassocoef command
We have
active
x1 1.206056
x2 .2715635
_cons 0
Legend:
b - base level
e - empty cell
o - omitted
Recall d.g.p. y = 2 + 1 x1 + 0 x2 + 0 x3 + u.
active
x1 1.35914
x2 .2918877
_cons 2.617622
Post-selection (post-LASSO)
active
x1 1.544198
x2 .4683922
_cons 2.533663
lassogof command
Goodness-of-…t
I as expected post-lasso OLS …ts better than lasso in the full sample
I since OLS minimizes MSE while lasso minimizes MSE plus penalty
Penalized coefficients
8.679274 0.2300 40
. lassogof, postselection
Postselection coefficients
8.597958 0.2372 40
Adaptive Lasso
Method that usually leads to fewer variables than basic lasso.
I First lasso as usual with κ = 1 since then λ ∑
p p
j j =1 κ j j β j j = λ ∑ j =1 j β j j .
I Second exclude x with b βj = 0 and for remainder set κ j = 1/j b βj jδ with
j
default δ = 1.
Here only x1 is selected.
. lassoknots
No. of CV mean
nonzero pred. Variables (A)dded, (R)emoved,
ID lambda coef. error or left (U)nchanged
26 3.945214 1 11.60145 A x1
* 52 .3512089 1 9.160539 U
57 .2205694 2 9.210699 A x2
95 .0064297 2 9.172378 U
∑i =1 (yi
n
xi0 β)2 + λ ∑j =1 fαj βj j + (1
p
Qλ,α ( β) = α) β2j g.
Ridge Regression
Standardized ridge (OLS estimates were 1.555, 0.471 and -0.026.)
. * Ridge estimation using the elasticnet command and selected results
. qui elasticnet linear y x1 x2 x3, alpha(0) rseed(10101) folds(5)
. lassoknots
No. of CV mean
nonzero pred. Variables (A)dded, (R)emoved,
alpha ID lambda coef. error or left (U)nchanged
0.000
1 1591.525 3 11.9595 A x1 x2
x3
* 93 .3052401 3 9.54017 U
100 .1591525 3 9.566065 U
active
x1 1.139476
x2 .4865453
x3 .0958546
_cons 2.659647
Elastic net
Default is λ 100 point logarithmic grid and α = 0.5, 0.7, 1.0
I here α = 1.0 (lasso) so narrow grid to 0.90, 0.95, 1.0
I optimal α = 0.95, λ = 0.2717, and x1 and x2 selected.
. * Elastic net estimation and selected results
. qui elasticnet linear y x1 x2 x3, alpha(0.9(0.05)1) rseed(10101) folds(5)
. lassoknots
No. of CV mean
nonzero pred. Variables (A)dded, (R)emoved,
alpha ID lambda coef. error or left (U)nchanged
1.000
4 1.450138 1 11.60145 A x1
13 .6277301 2 9.739804 A x2
26 .1872925 2 9.434326 U
0.950
29 1.591525 1 11.73019 A x1
38 .688933 2 9.81611 A x2
* 48 .2717294 2 9.3884 U
51 .2055533 2 9.425887 U
0.900
53 1.675289 1 11.74015 A x1
62 .7561031 2 9.900317 A x2
76 .2055533 2 9.431641 U
Penalized coefficients
∑i =1 q (yi , xi , β) + λ ∑j =1 j βj j
n p
Qλ ( β ) =
. lassoknots
No. of
nonzero CV mean Variables (A)dded, (R)emoved,
ID lambda coef. deviance or left (U)nchanged
2 .2065674 1 1.407613 A x1
* 24 .0266792 1 1.192646 U
26 .0221495 2 1.192865 A x2
30 .0152668 3 1.194545 A x3
31 .0139106 3 1.195055 U
Details
Ensemble Weights
Ensemble weights are similar to portfolio diversi…cation.
Example: X1 (µ, σ2 ) independent of X2 (µ, σ2 )
then
2
Var [(X1 + X2 )/2] = 41 fVar [X1 ] + Var [X2 ]g = σ2 < Var [X1 ] = σ2 .
I bene…t is less the more correlated are X1 and X2 .
So consider a linear combination of predictions.
For each ML method create 10,000 predictions in the training sample
as follows
I for each of the eight folds estimate (using the optimal tuning
parameter(s)) using seven folds and predict on the remaining fold
I this gives (10, 000 1) vectors b yOLS , b
yREGTREE , byLASSO , b
yRF .
α0 s from the OLS regression in the
The ensemble weights are the b
training sample
yi = α0 + α1 ybOLS ,i + α2 ybREGTREE .i + α3 ybLASSO ,i + α4 ybRF ,i + ui .
These ensemble weights are also used in the holdout sample exercise.
A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 58 / 64
5. Prediction for Economics 5.1 Predict Housing Prices
Further Details
6. Some R Commands
8. References
ISLR2: Gareth James, Daniela Witten, Trevor Hastie and Robert Tibsharani
(2021), An Introduction to Statistical Learning: with Applications in R, 2nd Ed.,
Springer.
ISLP: Gareth James, Daniela Witten, Trevor Hastie, Robert Tibsharani and
Jonathan Taylor (2023), An Introduction to Statistical Learning: with Applications
in Python, Springer.
I Free PDF from https://www.statlearning.com/ and $40 softcover book via
Springer Mycopy.
ESL: Trevor Hastie, Robert Tibsharani and Jerome Friedman (2009), The
Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer.
I PDF and $40 softcover book at
https://link.springer.com/book/10.1007/978-0-387-84858-7.
Geron2: Aurelien Geron (2022), Hands-On Machine Learning with Scikit-Learn,
Keras and Tensor Flow, Third edition, O’Reilly
A. Colin Cameron and Pravin K. Trivedi (2022), Microeconometrics using Stata,
Second edition, Chapters 28.3-28.4.
EH: Bradley Efron and Trevor Hastie
A. Colin Cameron Univ.of California - Davis ()
(2016), Computer Age Statistical
ML Part 2: Shrinkage
Inference:
April 2024 63 / 64
8. References
References (continued)