0% found this document useful (0 votes)
7 views64 pages

ML 2024 Part2 Shrinkage Estimators

Uploaded by

jfang1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views64 pages

ML 2024 Part2 Shrinkage Estimators

Uploaded by

jfang1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Machine Learning for Microeconometrics

Part 2: Shrinkage estimators

A. Colin Cameron
Univ.of California - Davis

April 2024

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 1 / 64


Course Outline
1: Variable selection and cross validation
Part 2: Shrinkage methods
I ridge, lasso, elastic net
3. ML for causal inference using lasso
I OLS with many controls, IV with many instruments
4. Other methods for prediction
I nonparametric regression, principal components, splines
I neural networks
I regression trees, random forests, bagging, boosting
5. More ML for causal inference
I ATE with heterogeneous e¤ects and many controls.
6. Classi…cation and unsupervised learning
I classi…cation (categorical y ) and unsupervised learning (no y ).
A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 2 / 64
1. Introduction

1. Introduction
Consider linear regression model with p potential regressors where p is
too large.
Methods that reduce the model complexity are
I 1. choose a subset of regressors (previous slides)
I 2. shrink regression coe¢ cients towards zero (these slides)
I 3. reduce the dimension of the regressors
F principal components analysis (later slides).

Linear regression may predict well if include interactions and powers


as potential regressors.
And methods can be adapted to alternative loss functions for
estimation.
Shrinkage is also called regularization
I lasso, ridge, elastic net.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 3 / 64


1. Introduction

Overview
1 Introduction
2 Shrinkage: Variance-bias trade-o¤
3 Shrinkage methods
1 Ridge regression
2 LASSO
3 Elastic net
4 Asymptotic Properties of Lasso
5 Clustered data
4 Generated data
5 Prediction using LASSO, ridge and elasticnet
1 Lasso command
2 Lasso linear regression example
3 Lasso postestimation commands example
4 Adaptive lasso
5 Elastic net and ridge regression
6 Shrinkage for logit, probit and Poisson
A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 4 / 64
2. Shrinkage: Variance-bias Trade-o¤s

2. Shrinkage: variance-bias trade-o¤s

Consider prediction in the regression model

y = f (x) + u with E [u ] = 0 and u ? x.

For out-of-estimation-sample point (y0 , x0 ) the true prediction error

E [(y0 b
f (x0 ))2 ] = Var [b
f (x0 )] + fBias (b
f (x0 ))g2 + Var (u )

The last term Var (u ) is called irreducible error


I we can do nothing about this.
So need to minimize sum of variance and bias-squared!
I more ‡exible models have less bias (good) and more variance (bad).
I this trade-o¤ is fundamental to machine learning.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 5 / 64


2. Shrinkage: Variance-bias Trade-o¤s

Variance-bias trade-o¤

Shrinkage is one method that is biased but the bias may lead to lower
squared error loss
I …rst show this for estimation of a parameter θ
I then show this for prediction of y .
The mean squared error of a scalar estimator e
θ is

MSE(e
θ ) = E [(eθ θ )2 ]
= E [f(eθ E [eθ ]) + (E [eθ ] θ )g2 ]
= E [(eθ E [eθ ])2 ] + (E [eθ ] θ )2 + 2 0
= Var (eθ ) + Bias 2 (eθ )

I as the cross product term 2 E [(e


θ E [e
θ ])(E [e
θ] θ )] =
constant E [(eθ E [e θ ])] = 0.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 6 / 64


2. Shrinkage: Variance-bias Trade-o¤s

Bias can reduce estimator MSE: a shrinkage example

Suppose scalar estimator b


θ is unbiased for θ with
I E [b
θ ] = θ and Var [b
θ] = v
I b
So MSE(θ ) = v .
Construct the shrinkage estimator e
θ = ab
θ where 0 a 1.
I Bias: Bias (e
θ ) = E [e
θ ] θ = aθ θ = (a 1)θ.
I Variance: Var [θ ] = Var [ab
e θ ] = a2 Var (bθ ) = a2 v
I e e e
So MSE(θ ) = Var [θ ] + Bias (θ ) = a v + (a 1)2 θ 2 .
2 2

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 7 / 64


2. Shrinkage: Variance-bias Trade-o¤s

Shrinkage example continued

So we have

Unbiased b
θ MSE (b
θ) = v
Biased e b e
θ = aθ MSE(θ ) = a2 v + (a 1)2 θ 2

Then MSE(e
θ ) <MSE[b
θ ] if θ 2 < 1 +a
1 av
I e.g. if e
θ = 0.9b
θ then e
θ has lower MSE for θ 2 < 19v !
We will consider
I ridge estimator shrinks towards zero
I LASSO estimator selects and shrinks towards zero.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 8 / 64


2. Shrinkage: Variance-bias Trade-o¤s

James-Stein estimator

Suppose yi N (µi , 1), i = 1, ..., n.


bi = yi with MSE(µ
The MLE is µ bi ) = 1.
e i = (1
The James-Stein estimator is µ c )yi + cy
I where c = n 1 3 ∑ni=1 (yi y )2 and n 4
I this shrinks towards the sample mean y
ei ) < MSE (µ
Then MSE(µ bi ) for n 4!
This remarkable 1950’s/1960’s result was a big surprise
I an estimator has lower MSE than the maximum likelihood estimator.
The estimator can be given an empirical Bayes interpretation.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 9 / 64


2. Shrinkage: Variance-bias Trade-o¤s

Bias can therefore reduce predictor MSE

Now consider prediction of y0 = βx0 + u where E [u ] = 0


I using ye0 = e
βx0 where treat scalar x0 as …xed.
Bias: Bias (ye0 ) = E [x0 e
β] βx0 = x0 (E [ e
β] β) = x0 Bias (e
β ).
Variance: Var [ye0 ] = Var [x0 e
β] = x0 Var (e
2 β ).
The mean squared error in the scalar regressor case is

MSE(ye0 ) = Var (ye0 ) + Bias 2 (ye0 ) + Var (u )


= x02 Var (e
β) + (x0 Bias (e β))2 + Var (u )
= x02 fVar (eβ) + Bias 2 (e β)g + Var (u )
= x0 MSE (e
2 β) + Var (u ).

So bias in e
β that reduces MSE(e
β) also reduces MSE(ye0 ).

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 10 / 64


3. Shrinkage Methods

3. Shrinkage Methods

Shrinkage estimators minimize RSS (residual sum of squares) with a


penalty for over…tting the data at hand.
I this shrinks parameter estimates towards zero.
The extent of shrinkage is determined by a tuning parameter
I this is determined by cross-validation or penalty such as AIC or BIC.
Standardize regressors as ridge, LASSO and elastic net are not
invariant to rescaling of regressors
I so xij below is actually (xij x̄j )/sj
I and demean yi so below yi is actually yi ȳ
I xi does not include an intercept nor does data matrix X
I we can recover intercept β0 as b β0 = ȳ .
So work with y = x0 β + ε = β1 x1 + β2 x2 + + βp xp + ε.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 11 / 64


3. Shrinkage Methods 3.1 Ridge Regression

3.1 Ridge Regression


b of β minimizes
The simplest form of the ridge estimator β λ

∑i =1 (yi
n
xi0 β)2 + λ ∑j =1 β2j = RSS + λ(jj βjj2 )2
1 p
Qλ ( β ) = n

I where λ q0 is a tuning parameter to be determined


I jj βjj2 = ∑pj=1 β2j is L2 norm.
Equivalently the ridge estimator minimizes

∑i =1 (yi
n
xi0 β)2 subject to ∑j =1 β2j
1 p
n s.

The ridge estimator is


b = (X0 X + nλI)
β 1
X0 y.
λ

More generally can weight each βj


I Qλ ( β ) = 1
n ∑ni=1 (yi xi0 β)2 + λ ∑pj=1 κ j β2j .

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 12 / 64


3. Shrinkage Methods 3.1 Ridge Regression

Ridge Derivation

1. Objective function includes penalty


I Q ( β) = n1 (y Xβ)0 (y Xβ) + λβ0 β
I ∂Q ( β)/∂β = n2 X0 (y Xβ) + 2λβ = 0
I ) X0 Xβ + λIβ = X0 y
I )β b = (X0 X + nλI) 1 X0 y.
λ

2. Form Lagrangian (multiplier is λ) from objective function and


constraint
I Q ( β) = n1 (y Xβ)0 (y Xβ) and constraint β0 β s
I L( β, λ) = n1 (y Xβ)0 (y Xβ) + λ( β0 β s )
I ∂L( β, λ)/∂β = n2 X0 (y Xβ) + 2λβ = 0
I )β b = (X0 X + nλI) 1 X0 y
λ
I Here λ = ∂Lopt ( β, λ, s )/∂s.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 13 / 64


3. Shrinkage Methods 3.1 Ridge Regression

Ridge Properties

b ! 0 as λ ! ∞ and β
b !βb
β λ λ OLS as λ ! 0.
Ridge best when many predictors important with coe¤s of similar size.
Ridge best when LS has high variance
I meaning small changes in training data can lead to large changes in
OLS coe¢ cient estimates.
b for many values of λ
Algorithms exist to quickly compute β λ
I then choose λ by cross validation or AIC or BIC.
I with search over a decreasing logarithmic grid in λ.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 14 / 64


3. Shrinkage Methods 3.1 Ridge Regression

More on Ridge

Also called Tikhonov regularization.


b
Hoerl & Kennard (1970) proposed ridge as a way to reduce MSE of β.
b = (X0 X + nλI) 1 X0 X β
We can write ridge as β b
λ OLS
I so shrinkage of OLS toward zero.

For scalar regressor and no intercept b


βλ = a b
βOLS where a = ∑i xi2
∑i xi2 +nλ
I like earlier example of e
β = ab
β.
Ridge is the posterior mean for y N (Xβ, σ2 I) with prior
β N (0, γ2 I)
I though γ is a speci…ed prior parameter whereas λ is data-determined.
Ridge is estimator in model y (Xβ, σ2 I) with stochastic constraints
β (0, γ2 I).

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 15 / 64


3. Shrinkage Methods 3.2 LASSO

3.2 LASSO (Least Absolute Shrinkage And Selection


Operator)
b of β minimizes
The LASSO estimator β λ

∑i =1 (yi
n
xi0 β)2 + λ ∑j =1 j βj j = RSS + λjj βjj1
1 p
Qλ ( β ) = n

I where λ 0 is a tuning parameter to be determined


I jj βjj1 = ∑pj=1 j βj j is L1 norm.
Equivalently the LASSO estimator minimizes

∑i =1 (yi xi0 β)2 subject to ∑j =1 j βj j


1 n p
n s.

Features
I best when a few regressors have βj 6= 0 and most βj = 0
I leads to a more interpretable model than ridge.
More generally Qλ ( β) = 1
n ∑ni=1 (yi xi0 β)2 + λ ∑pj=1 κ j j βj j.
A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 16 / 64
3. Shrinkage Methods 3.2 LASSO

LASSO versus Ridge (key …gure from ISL)


LASSO is likely to set some coe¢ cients to zero.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 17 / 64


3. Shrinkage Methods 3.2 LASSO

LASSO versus Ridge


Consider simple case where n = p and X = In (identity matrix).
b OLS = (I0 I) 1 I0 y = y so b
OLS: β β
OLS
= yj j
Ridge shrinks all β0 s towards zero
bR
β = (I0 I + λI) 1 0
I y = y/(1 + λ)
R
b
βj = yj /(1 + λ)
LASSO shrinks some βj0 s towards 0 and sets others = 0
8
L
< yj λ/2 if yj > λ/2
b
βj = y + λ/2 if yj < λ/2
: j
0 if jyj j λ/2
Aside: Best subset of size M in this example
BS
b
βj = b
βj 1[j b
βj j jb
β(M ) j]

where b
β(M ) is the M th largest OLS coe¢ cient.
A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 18 / 64
3. Shrinkage Methods 3.2 LASSO

Computation of LASSO estimator


Most common is a coordinate wise descent algorithm
I also called a shooting algorithm due to Fu (1998)
I exploits the special structure in the nondi¤erentiable part of the
LASSO objective function that makes convergence possible.
The algorithm for given λ (λ is later chosen by CV)
I denote β = ( β , β j ) and de…ne S ( β , β j ) = ∂RSS /∂β
j j j j
b=β
I start with β b
OLS
0 j
b j ) and set
I at step m for each j = 1, ..., p let S = S (0, β
8
> λ S0
> 2xj0 xj
if S0 > λ
<
b
βj = λ S0
if S0 < λ
>
> 2xj0 xj
:
0 if λ < S0 < λ
I b = [b
form new β β1 b
βp ] after updating all b
βj .
m
Alternatively LASSO is a minor adaptation of least angle regression
I so estimate using the forward-stagewise algorithm for LAR.
A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 19 / 64
3. Shrinkage Methods 3.2 LASSO

More on LASSO
LASSO is due to Tibshirani (1999).
Can weight each βj di¤erently
I implemented in Stata lasso commands.
Can specify some variables to be always included.
The group lasso allows to include regressors as groups (e.g. race
dummies as a group)
I with L groups minimize over β

n L 2 L p p
1
n ∑ i =1 yi ∑l =1 xi0 βl + λ ∑ l =1 ρl ∑j =l 1 j βlj j .

There are other extensions such as adaptive LASSO.


Giannone, Lenza and Primiceri (2021) …nd that sparse models (e.g.
LASSO) predict poorly in several standard economic applications.
Shrinkage (e.g. Ridge) predicts better.
A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 20 / 64
3. Shrinkage Methods 3.3 Elastic net

3.3 Elastic net

Elastic net combines ridge regression and LASSO with objective


function

∑i =1 (yi
n
xi0 β)2 + λ ∑j =1 fαj βj j + (1
p
Qλ,α ( β) = 1
n α) β2j g.

I ridge penalty λ averages correlated variables


I LASSO penalty α leads to sparsity.
For elastic net
I Ridge is special case α = 0
I LASSO is special case α = 1.
K-fold cross validation is used with default K = 10
I set seed for replicability.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 21 / 64


3. Shrinkage Methods 3.4 Asymptotic Properties of LASSO

3.4 Asymptotic Properties of Lasso


A model selection method is consistent if asymptotically it
correctly selects the correct model from a selection of candidate
models
I selecting on basis of minimum BIC is consistent.
A model selection method is conservative if asymptotically it
always selects a model that nests the correct model
I selecting a model on the basis of minimum AIC is conservative.
I Hannes Leeb and Benedikt M. Pötscher (2005), “Model Selection and
Inference”, Econometric Theory, 21-59.
A statistical model selection and estimation method is said to have an
oracle property if it leads to consistent model selection and a
subsequent estimator that is asymptotically equivalent to the
estimator that could be obtained if the true model was known so that
model selection was unnecessary.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 22 / 64


3. Shrinkage Methods 3.4 Asymptotic Properties of LASSO

Asymptotic properties of Lasso

The LASSO is a consistent model selection procedure


I but does not have oracle property due to bias.
The oracle property is an asymptotic property
I not useful in …nite sample settings that economists encounter
F our models do not …t perfectly
I and gives rates for a penalty parameters but not …nite sample value.
Lasso estimates have complicated …nite sample distribution.
I cannot perform standard inference on LASSO or post_LASSO
I instead add some model structure
F e.g. partial linear model.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 23 / 64


3. Shrinkage Methods 3.5 Clustered Data

3.5 Clustered Data

Now consider clustered data


I by clustered data I mean “clustered errors”
I data are grouped with correlated observations within group and
uncorrelated across groups
I examples are panel data and grouping by independent regions.
Notation: yig is outcome for individual i in cluster g , i = 1, ..., Ng ,
g = 1, ..., G .
Here focus on lasso as the machine learning method.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 24 / 64


3. Shrinkage Methods 3.5 Clustered Data

Lasso with Clustered Data


We want to generalize Qλ ( β) = n1 ∑ni=1 (yi xi0 β)2 + λ ∑pj=1 j βj j.
Method 1. With clustered data one can continue with this in which
case equal weight is given to each observation
1
G ∑ g =1 ∑ i =1
G Ng
β )2 + λ ∑ j =1 j β j j
0 p
Qλ ( β ) = (yig xig

Method 2. Alternatively one can give equal weight to each cluster


1
∑g =1 N1 ∑i =1 (yig
G Ng
β ) 2 + λ ∑ j =1 j β j j
0 p
Qλ ( β ) = xig
G g

Stata 17 option cluster(clustervar ) of the lasso command does


method 2.
Which is best?
I If data are independent within cluster then 1.?
I If data are perfectly correlated within cluster then 2.?
I And a big di¤erence if cluster sizes are greatly unbalanced.
A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 25 / 64
3. Shrinkage Methods 3.5 Clustered Data

Key papers
Peter Bickel, Ya’acov Ritov and Alexandre Tsybakov (2009),
“Simultaneous Analysis of Lasso and Dantzig Selector”, The Annals
of Statistics, 1705-1732.
I In reading this just consider Lasso (ignore Dantzig Selector).
I lays out the typical assumptions well
F including sparsity and yi = f (zi ) + ui , ui i.i.d. N (0, σ2 )
I lays out …nite sample bounds for prediction loss
F key are assumptions on the eigenvalues of the Gram matrix X0 X
F similar to restrictions on the correlations among regressors.

Alex Belloni and Victor Chernozhukov (2013), “Least Squares after


Model Selection in High-Dimensional Sparse Models”, Bernoulli,
521-547.
I builds on the previous paper
I harder to read, so read Bickel et al …rst
I OLS after LASSO is better than prediction from LASSO estimates.
A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 26 / 64
4. Prediction using LASSO, ... 4.1 lasso command

4. Prediction using LASSO: Stata lasso command

lasso model depvar [(alwaysvars)] othervars, options


Model is
I linear, logit, probit or poisson
folds(#)
penalty parameter λ
I cross validation (selection(cv)) sets all κ j = 1
I adaptive cv (selection(adaptive cv)) κ j can vary
I AIC (bic)
I plug-in (selection(plugin)) for non-prediction applications

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 27 / 64


4. Prediction using LASSO, ... 4.1 lasso command

Postestimation commands

lasso command focuses on …nding λ


Following commands give more info
I lassoknots
I lassoselect
I cvplot
I coefpath
I lassoinfo
I lassocoef
I lassogof

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 28 / 64


4. Prediction using LASSO, ... 4.2 Lasso linear regression example

4.2 LASSO linear regression example

Generated data example: n = 40, p = 3.


Three correlated regressors.
2 3 02 3 2 31
x1i 0 1 0.5 0.5
I 4 x2i 5 N @4 0 5 , 4 0.5 1 0.5 5A
x3i 0 0.5 0.5 1
But only x1 determines y
I y = 2 + x1i + ui where ui N (0, 32 ).
Same generate data as in part 1 slides.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 29 / 64


4. Prediction using LASSO, ... 4.2 Lasso linear regression example

. * Summarize data
. summarize

Variable Obs Mean Std. Dev. Min Max

x1 40 .3337951 .8986718 -1.099225 2.754746


x2 40 .1257017 .9422221 -2.081086 2.770161
x3 40 .0712341 1.034616 -1.676141 2.931045
y 40 3.107987 3.400129 -3.542646 10.60979

. correlate
(obs=40)

x1 x2 x3 y

x1 1.0000
x2 0.5077 1.0000
x3 0.4281 0.2786 1.0000
y 0.4740 0.3370 0.2046 1.0000

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 30 / 64


4. Prediction using LASSO, ... 4.2 Lasso linear regression example

Aside: Demeaning data


Stata commands such as lasso do this automatically
I but for completeness following code demeans.

. * Standardize regressors and demean y


. foreach var of varlist x1 x2 x3 {
2. qui egen z`var' = std(`var')
3. }

. quietly summarize y

. quietly generate ydemeaned = y - r(mean)

. summarize ydemeaned z*

Variable Obs Mean Std. Dev. Min Max

ydemeaned 40 -1.71e-08 3.400129 -6.650633 7.501798


zx1 40 2.05e-09 1 -1.594598 2.693921
zx2 40 2.79e-10 1 -2.34211 2.80662
zx3 40 2.79e-09 1 -1.688912 2.764129

The original variables x1 to x3 had standard deviations 0.89867,


0.94222 and 1.03462
I means di¤er from zero due to single precision rounding error.
A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 31 / 64
4. Prediction using LASSO, ... 4.2 Lasso linear regression example

Demeaning data better


Aside: Use double precision

. * Standardize regressors and demean y


. foreach var of varlist x1 x2 x3 {
2. qui egen double z`var' = std(`var')
3. }

. qui summarize y

. qui generate double ydemeaned = y - r(mean)

. summarize ydemeaned z*

Variable Obs Mean Std. Dev. Min Max

ydemeaned 40 -3.33e-17 3.400129 -6.650633 7.501798


zx1 40 2.63e-17 1 -1.594598 2.693921
zx2 40 2.62e-17 1 -2.34211 2.80662
zx3 40 -2.98e-17 1 -1.688912 2.764129

Stata does internal calculations in double precision but default is to


save variables in single precision.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 32 / 64


4. Prediction using LASSO, ... 4.2 Lasso linear regression example

LASSO linear regression example: lasso command


Apply to generated data example: n = 40, K = 5, p = 3.
I set the seed !
I First regressor selected when λ = 1.450138

. * Lasso linear using 5-fold cross validation


. lasso linear y x1 x2 x3, selection(cv) folds(5) rseed(10101)

5-fold cross-validation with 100 lambdas ...


Grid value 1: lambda = 1.591525 no. of nonzero coef. = 0
Folds: 1...5 CVF = 11.85738
Grid value 2: lambda = 1.450138 no. of nonzero coef. = 1
Folds: 1...5 CVF = 11.60145
Grid value 3: lambda = 1.321312 no. of nonzero coef. = 1
Folds: 1...5 CVF = 11.2296
Grid value 4: lambda = 1.20393 no. of nonzero coef. = 1
Folds: 1...5 CVF = 10.87719
Grid value 5: lambda = 1.096976 no. of nonzero coef. = 1
Folds: 1...5 CVF = 10.60149
Grid value 6: lambda = .9995238 no. of nonzero coef. = 1
Folds: 1...5 CVF = 10.38463
Grid value 7: lambda = .9107289 no. of nonzero coef. = 1
Folds: 1...5 CVF = 10.20522
Grid value 8: lambda = .8298222 no. of nonzero coef. = 1
Folds: 1...5 CVF = 10.05685

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 33 / 64


4. Prediction using LASSO, ... 4.2 Lasso linear regression example

lasso command (continued)


Second regressor included when λ = 0.6277301

Grid value 9: lambda = .7561031 no. of nonzero coef. = 1


Folds: 1...5 CVF = 9.934201
Grid value 10: lambda = .688933 no. of nonzero coef. = 1
Folds: 1...5 CVF = 9.829713
Grid value 11: lambda = .6277301 no. of nonzero coef. = 2
Folds: 1...5 CVF = 9.739804
Grid value 12: lambda = .5719643 no. of nonzero coef. = 2
Folds: 1...5 CVF = 9.666469
Grid value 13: lambda = .5211525 no. of nonzero coef. = 2
Folds: 1...5 CVF = 9.606777
Grid value 14: lambda = .4748548 no. of nonzero coef. = 2
Folds: 1...5 CVF = 9.562824
Grid value 15: lambda = .43267 no. of nonzero coef. = 2
Folds: 1...5 CVF = 9.525748
Grid value 16: lambda = .3942328 no. of nonzero coef. = 2
Folds: 1...5 CVF = 9.493472
Grid value 17: lambda = .3592102 no. of nonzero coef. = 2
Folds: 1...5 CVF = 9.460115
Grid value 18: lambda = .327299 no. of nonzero coef. = 2
Folds: 1...5 CVF = 9.43311
Grid value 19: lambda = .2982226 no. of nonzero coef. = 2
Folds: 1...5 CVF = 9.411316

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 34 / 64


4. Prediction using LASSO, ... 4.2 Lasso linear regression example

lasso command (continued)

Minimum CV of 9.393523 with two regressors.

Grid value 20: lambda = .2717294 no. of nonzero coef. = 2


Folds: 1...5 CVF = 9.393794
Grid value 21: lambda = .2475897 no. of nonzero coef. = 2
Folds: 1...5 CVF = 9.393523
Grid value 22: lambda = .2255945 no. of nonzero coef. = 2
Folds: 1...5 CVF = 9.40661
Grid value 23: lambda = .2055533 no. of nonzero coef. = 2
Folds: 1...5 CVF = 9.420332
Grid value 24: lambda = .1872925 no. of nonzero coef. = 2
Folds: 1...5 CVF = 9.434326
... cross-validation complete ... minimum found

Default grid search is a decreasing logarithmic grid of 100 values


I λj = λ1 10 4 (j 1 )/99 , j = 2, .., 100
I λ1 = 1.591525 is the smallest value at which no regressors are selected.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 35 / 64


4. Prediction using LASSO, ... 4.2 Lasso linear regression example

lasso command (continued)

Final results on optimal λ.

Lasso linear model No. of obs = 40


No. of covariates = 3
Selection: Cross-validation No. of CV folds = 5

No. of Out-of- CV mean


nonzero sample prediction
ID Description lambda coef. R-squared error

1 first lambda 1.591525 0 -0.0519 11.85738


20 lambda before .2717294 2 0.1666 9.393794
* 21 selected lambda .2475897 2 0.1666 9.393523
22 lambda after .2255945 2 0.1655 9.40661
24 last lambda .1872925 2 0.1630 9.434326

* lambda selected by cross-validation.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 36 / 64


4. Prediction using LASSO, ... 4.3 Lasso postestimation commands example

lassoknots command

Lists values of λ at which variables are added and removed


I here …rst x1 and then x2 are added.

. * List the values of lambda at which variables are added or removed


. lassoknots

No. of CV mean
nonzero pred. Variables (A)dded, (R)emoved,
ID lambda coef. error or left (U)nchanged

2 1.450138 1 11.60145 A x1
11 .6277301 2 9.739804 A x2
* 21 .2475897 2 9.393523 U
24 .1872925 2 9.434326 U

* lambda selected by cross-validation.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 37 / 64


4. Prediction using LASSO, ... 4.3 Lasso postestimation commands example

cvplot command

Plot value of CV5 against λ on log scale


I simply command cvplot
Plot of how coe¢ cients change with λ
I command coefpath

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 38 / 64


4. Prediction using LASSO, ... 4.3 Lasso postestimation commands example

lassoinfo command

Provides a summary of the LASSO.

. * Provide a summary of the lasso


. lassoinfo

Estimate: active
Command: lasso

No. of
Selection Selection selected
Depvar Model method criterion lambda variables

y linear cv CV min. .2475897 2

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 39 / 64


4. Prediction using LASSO, ... 4.3 Lasso postestimation commands example

lassocoef command

Provides three di¤erent sets of coe¢ cient estimates


1. Standardized coe¢ cients (default) are those directly from lasso.
2. Penalized coe¢ cients are the preceding ones rescaled so that the
standardization of variables is removed
I so multiply each coe¢ cient by the standard deviation of the
corresponding regressor.
I i.e. can interpret in terms of the original data.
3. Post-selection coe¢ cients are obtained by OLS of y on the
selected regressors (here x1 and x2).
I often called post-lasso estimates.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 40 / 64


4. Prediction using LASSO, ... 4.3 Lasso postestimation commands example

Standardized coe¢ cients for standardized regressors

We have

. * Lasso coefficients for the standardized regressors


. lassocoef, display(coef, standardized)

active

x1 1.206056
x2 .2715635
_cons 0

Legend:
b - base level
e - empty cell
o - omitted

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 41 / 64


4. Prediction using LASSO, ... 4.3 Lasso postestimation commands example

Unstandardized coe¢ cients for original regressors

Recall d.g.p. y = 2 + 1 x1 + 0 x2 + 0 x3 + u.

. * Lasso coefficients for the unstandardized regressors


. lassocoef, display(coef, penalized) nolegend

active

x1 1.35914
x2 .2918877
_cons 2.617622

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 42 / 64


4. Prediction using LASSO, ... 4.3 Lasso postestimation commands example

Post-selection (post-LASSO)

OLS on the selected and unstandardized regressors


I same as regress y x1 x2

. * Post-selection estimated coefficients for the unstandardized regressors


. lassocoef, display(coef, postselection) nolegend

active

x1 1.544198
x2 .4683922
_cons 2.533663

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 43 / 64


4. Prediction using LASSO, ... 4.3 Lasso postestimation commands example

lassogof command
Goodness-of-…t
I as expected post-lasso OLS …ts better than lasso in the full sample
I since OLS minimizes MSE while lasso minimizes MSE plus penalty

. * Goodness-of-fit with penalized coefficients and postselection coefficients


. lassogof, penalized

Penalized coefficients

MSE R-squared Obs

8.679274 0.2300 40

. lassogof, postselection

Postselection coefficients

MSE R-squared Obs

8.597958 0.2372 40

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 44 / 64


4. Prediction using LASSO, ... 4.4 Adaptive Lasso

Adaptive Lasso
Method that usually leads to fewer variables than basic lasso.
I First lasso as usual with κ = 1 since then λ ∑
p p
j j =1 κ j j β j j = λ ∑ j =1 j β j j .
I Second exclude x with b βj = 0 and for remainder set κ j = 1/j b βj jδ with
j
default δ = 1.
Here only x1 is selected.

. * Lasso linear using 5-fold adaptive cross validation


. qui lasso linear y x1 x2 x3, selection(adaptive) folds(5) rseed(10101)

. lassoknots

No. of CV mean
nonzero pred. Variables (A)dded, (R)emoved,
ID lambda coef. error or left (U)nchanged

26 3.945214 1 11.60145 A x1
* 52 .3512089 1 9.160539 U
57 .2205694 2 9.210699 A x2
95 .0064297 2 9.172378 U

* lambda selected by cross-validation in final adaptive step.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 45 / 64


4. Prediction using LASSO, ... 4.5 Elastic net and Ridge Regression

4.5 Elasticnet and Ridge Regression

Elastic net combines ridge regression and LASSO with objective


function

∑i =1 (yi
n
xi0 β)2 + λ ∑j =1 fαj βj j + (1
p
Qλ,α ( β) = α) β2j g.

I ridge penalty λ averages correlated variables


I LASSO penalty α leads to sparsity.
For elastic net
I Ridge is special case α = 0
I LASSO is special case α = 1.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 46 / 64


4. Prediction using LASSO, ... 4.5 Elastic net and Ridge Regression

Ridge Regression
Standardized ridge (OLS estimates were 1.555, 0.471 and -0.026.)
. * Ridge estimation using the elasticnet command and selected results
. qui elasticnet linear y x1 x2 x3, alpha(0) rseed(10101) folds(5)

. lassoknots

No. of CV mean
nonzero pred. Variables (A)dded, (R)emoved,
alpha ID lambda coef. error or left (U)nchanged

0.000
1 1591.525 3 11.9595 A x1 x2
x3
* 93 .3052401 3 9.54017 U
100 .1591525 3 9.566065 U

* alpha and lambda selected by cross-validation.

. lassocoef, display(coef, penalized) nolegend

active

x1 1.139476
x2 .4865453
x3 .0958546
_cons 2.659647

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 47 / 64


4. Prediction using LASSO, ... 4.5 Elastic net and Ridge Regression

Elastic net
Default is λ 100 point logarithmic grid and α = 0.5, 0.7, 1.0
I here α = 1.0 (lasso) so narrow grid to 0.90, 0.95, 1.0
I optimal α = 0.95, λ = 0.2717, and x1 and x2 selected.
. * Elastic net estimation and selected results
. qui elasticnet linear y x1 x2 x3, alpha(0.9(0.05)1) rseed(10101) folds(5)

. lassoknots

No. of CV mean
nonzero pred. Variables (A)dded, (R)emoved,
alpha ID lambda coef. error or left (U)nchanged

1.000
4 1.450138 1 11.60145 A x1
13 .6277301 2 9.739804 A x2
26 .1872925 2 9.434326 U

0.950
29 1.591525 1 11.73019 A x1
38 .688933 2 9.81611 A x2
* 48 .2717294 2 9.3884 U
51 .2055533 2 9.425887 U

0.900
53 1.675289 1 11.74015 A x1
62 .7561031 2 9.900317 A x2
76 .2055533 2 9.431641 U

* alpha and lambda selected by cross-validation.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 48 / 64


4. Prediction using LASSO, ... 4.6 Comparison of Shrinkage Estimators

4.6 Comparison of Shrinkage Estimators


Compare OLS, Lasso, ridge, elastic net (in-sample n = 40)

. * Estimate various models and store results


. qui regress y x1 x2 x3

. estimates store OLS

. qui lasso linear y x1 x2 x3, selection(cv) folds(5) rseed(10101)

. estimates store LASCV

. qui lasso linear y x1 x2 x3, selection(adaptive) folds(5) rseed(10101)

. estimates store LASADAPT

. qui lasso linear y x1 x2 x3, selection(plugin) folds(5)

. estimates store LASPLUG

. qui elasticnet linear y x1 x2 x3, alpha(0) selection(cv) folds(5) rseed(10101)

. estimates store RIDGECV

. qui elasticnet linear y x1 x2 x3, alpha(0.9(0.05)1) rseed(10101) folds(5)

. estimates store ELASTIC

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 49 / 64


4. Prediction using LASSO, ... 4.6 Comparison of Shrinkage Estimators

Comparison of Shrinkage Estimators


Most select x1 and x2, adaptive lasso only x1, ridge all three.
I lassogof default is penalized coe¢ cients and R 2
. * Compare in-sample fit and selected coefficients of various models
. lassogof OLS LASCV LASADAPT LASPLUG RIDGECV ELASTIC

Penalized coefficients

Name MSE R-squared Obs

OLS 8.597403 0.2373 40


LASCV 8.679274 0.2300 40
LASADAPT 8.755573 0.2232 40
LASPLUG 10.27195 0.0887 40
RIDGECV 8.70562 0.2277 40
ELASTIC 8.693386 0.2288 40

. lassocoef OLS LASCV LASADAPT LASPLUG RIDGECV ELASTIC, display(coef) nolegend

OLS LASCV LASADAPT LASPLUG RIDGECV ELASTIC

x1 1.555582 1.206056 1.462431 .3533654 1.011134 1.179972


x2 .4707111 .2715635 .452667 .2705777
x3 -.0256025 .0979251
_cons 2.531396 0 0 0 0 0

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 50 / 64


4. Prediction using LASSO, ... 4.7 Shrinkage for logit, probit and poisson

4.7 Shrinkage for logit, probit and poisson

More generally can apply shrinkage to other objective functions.


For logit, probit and poisson replace the squared residual by the
squared deviance residual
I deviance residual is used for generalized linear models
b of β minimizes
Consider lasso β λ

∑i =1 q (yi , xi , β) + λ ∑j =1 j βj j
n p
Qλ ( β ) =

Logit: q (yi , xi , β) = f2[yi ln Λ(xi0 β) + (1 yi ) ln(1 Λ(xi0 β)]g2


Probit: q (yi , xi , β) = f2[yi ln Φ(xi0 β) + (1 yi ) ln(1 Φ(xi0 β)]g2
Poisson: q (yi , xi , β) = f2[yi xi0 β exp(xi0 β) vi ]g2
I vi = 0 if yi = 0 and vi = yi ln yi otherwise.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 51 / 64


4. Prediction using LASSO, ... 4.7 Shrinkage for logit, probit and poisson

Lasso shrinkage for logit example


Create dy = y > 3
I only x1 selected
. * Lasso for logit example
. qui generate dy = y > 3

. qui lasso logit dy x1 x2 x3, rseed(10101) folds(5)

. lassoknots

No. of
nonzero CV mean Variables (A)dded, (R)emoved,
ID lambda coef. deviance or left (U)nchanged

2 .2065674 1 1.407613 A x1
* 24 .0266792 1 1.192646 U
26 .0221495 2 1.192865 A x2
30 .0152668 3 1.194545 A x3
31 .0139106 3 1.195055 U

* lambda selected by cross-validation.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 52 / 64


4. Prediction using LASSO, ... 4.7 Shrinkage for logit, probit and poisson

Other user-written Stata commands for LASSO

Lassopack package of Ahrens, Hansen and Scha¤er (2020)


I cvlasso for λ chosen by K-fold cross-validation and h-step ahead
rolling cross-validation for cross-section, panel and time-series data
I rlasso for theory-driven (‘rigorous’) penalization for the lasso and
square-root lasso for cross-section and panel data
I lasso2 for information criteria choice of λ
I now supplanted by Stata’s commands
I but the Ahrens, Hansen and Scha¤er (2020) is a great article to read
as it provides a lot of detail.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 53 / 64


5. Prediction for Economics

5. Prediction for Economics

Microeconometrics focuses on estimation of β or of partial e¤ects


(later).
But in some cases we are directly interested in predicting y
I for old people predict probability of one-year survival
F if low then do not have hip replacement surgery.
I probability of re-o¤ending
F if low then grant parole to prisoner.

Mullainathan and Spiess (2017, JEP)


I consider prediction of housing prices
I detail how to do this using machine learning methods
I and then summarize many recent economics ML applications.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 54 / 64


5. Prediction for Economics 5.1 Predict Housing Prices

5.1 Predict housing prices

y is log house price in U.S. 2011


I n = 51, 808 is sample size
I p = 150 is number of potential regressors.
Predict using
I OLS (using all regressors)
I regression tree
I LASSO (and not post-LASSO OLS)
I random forest
I ensemble: an optimal weighted average of the above methods.
1. Train model on 10,000 observations using 8-fold CV.
2. Fit preferred model on these 10,000 observations.
3. Predict on remaining 41,808 observations
I and do 500 bootstraps to get 95% CI for R 2 .

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 55 / 64


5. Prediction for Economics 5.1 Predict Housing Prices

Random forest (and subsequent ensemble) does best out of sample


I training sample is n = 10, 000 and holdout sample is n = 41, 808.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 56 / 64


5. Prediction for Economics 5.1 Predict Housing Prices

Details

Downloadable appendix to the paper gives more details and R code.


1. Divide into training (n = 10, 000) and hold-out sample
(n = 41, 808).
2. On the training sample do 8-fold cross-validation to get tuning
parameter(s) such as λ.
I If e.g. two tuning parameters then do two-dimensional grid search.
3. The prediction function b f (x ) is estimated using the entire training
sample (n = 10, 000) with optimal λ.
4. Now apply this bf (x ) to the hold-out sample
compute MSE= 41808 1
∑i (yi b f (xi ))2
∑i (y i b
f (xi ))2 MSE
hence compute R 2 = 1 ∑i (yi ȳ )2
=1 1
ȳ )2
.
41808 (y i

5. A 95% CI for R 2 is obtained by bootstrapping the hold-out sample.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 57 / 64


5. Prediction for Economics 5.1 Predict Housing Prices

Ensemble Weights
Ensemble weights are similar to portfolio diversi…cation.
Example: X1 (µ, σ2 ) independent of X2 (µ, σ2 )
then
2
Var [(X1 + X2 )/2] = 41 fVar [X1 ] + Var [X2 ]g = σ2 < Var [X1 ] = σ2 .
I bene…t is less the more correlated are X1 and X2 .
So consider a linear combination of predictions.
For each ML method create 10,000 predictions in the training sample
as follows
I for each of the eight folds estimate (using the optimal tuning
parameter(s)) using seven folds and predict on the remaining fold
I this gives (10, 000 1) vectors b yOLS , b
yREGTREE , byLASSO , b
yRF .
α0 s from the OLS regression in the
The ensemble weights are the b
training sample
yi = α0 + α1 ybOLS ,i + α2 ybREGTREE .i + α3 ybLASSO ,i + α4 ybRF ,i + ui .
These ensemble weights are also used in the holdout sample exercise.
A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 58 / 64
5. Prediction for Economics 5.1 Predict Housing Prices

Further Details

LASSO does not pick the “correct” regressors


I it just gets the correct b
f (x ) especially when regressors are correlated
with each other.
Diagram on next slide shows which of the 150 variables are included
in separate models for 10 subsamples
I there are many variables that appear sometimes but not at other times
F appearing sometimes in white and sometimes in black.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 59 / 64


5. Prediction for Economics 5.1 Predict Housing Prices

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 60 / 64


6. Some R Commands

6. Some R Commands

These are from An Introduction to Statistical Learning: with


Applications in R. There may be better commands.
Basic regression
I OLS is lm.…t
I cross-validation for OLS uses cv.glm()
I bootstrap uses boot() function in boot library
Variable selection
I best subset, forward stepwise and backward stepwise: regsubsets() in
leaps library
Penalized regression
I ridge regression: glmnet(,alpha=0) function in glmnet library
I lasso: glmnet(,alpha=1) function in glmnet library
I CV to get lambda for ridge/lasso: cv.glmnet() in glmnet library

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 61 / 64


7. Some Python Commands

7. Some Python Commands

Use the scikit-learn package.


More to come here.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 62 / 64


8. References

8. References
ISLR2: Gareth James, Daniela Witten, Trevor Hastie and Robert Tibsharani
(2021), An Introduction to Statistical Learning: with Applications in R, 2nd Ed.,
Springer.
ISLP: Gareth James, Daniela Witten, Trevor Hastie, Robert Tibsharani and
Jonathan Taylor (2023), An Introduction to Statistical Learning: with Applications
in Python, Springer.
I Free PDF from https://www.statlearning.com/ and $40 softcover book via
Springer Mycopy.
ESL: Trevor Hastie, Robert Tibsharani and Jerome Friedman (2009), The
Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer.
I PDF and $40 softcover book at
https://link.springer.com/book/10.1007/978-0-387-84858-7.
Geron2: Aurelien Geron (2022), Hands-On Machine Learning with Scikit-Learn,
Keras and Tensor Flow, Third edition, O’Reilly
A. Colin Cameron and Pravin K. Trivedi (2022), Microeconometrics using Stata,
Second edition, Chapters 28.3-28.4.
EH: Bradley Efron and Trevor Hastie
A. Colin Cameron Univ.of California - Davis ()
(2016), Computer Age Statistical
ML Part 2: Shrinkage
Inference:
April 2024 63 / 64
8. References

References (continued)

Sendhil Mullainathan and J. Spiess (2017): “Machine Learning: An Applied


Econometric Approach”, Journal of Economic Perspectives, Spring, 87-106.
Hannes Leeb and Benedikt M. Pötscher (2005), “Model selection and Inference”,
Econometric Theory, 21-59.
Peter Bickel, Ya’acov Ritov and Alexandre Tsybakov (2009), “Simultaneous
Analysis of Lasso and Dantzig Selector”, The Annals of Statistics, 1705-1732.
Alex Belloni and Victor Chernozhukov (2013), “Least Squares after Model
Selection in High-Dimensional Sparse Models”, Bernoulli, 521-547.
Achim Ahrens, Christian B. Hansen, Mark E. Scha¤er (2020), “lassopack: Model
selection and prediction with regularized regression in Stata”, The Stata Journal,
20, 176-235 (also ArXiv:1901.05397).
Domenico Giannone, Michele Lenza, Giorgio E. Primiceri (2021), “Economic
Predictions with Big Data: The Illusion of Sparsity,” Econometrica, 2409-2437.

A. Colin Cameron Univ.of California - Davis () ML Part 2: Shrinkage April 2024 64 / 64

You might also like