0% found this document useful (0 votes)
18 views41 pages

2.b Applied Machine Learning Secret Sauce - Slides

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views41 pages

2.b Applied Machine Learning Secret Sauce - Slides

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Machine Learning:

An Applied Econometric Approach


Jann Spiess

based on work with Sendhil Mullainathan

in collaboration with Susan Athey and Niall Keleher

2. The Secret Sauce of Machine Learning


Structure of first chapter of webinar
1. Introduction

Training data Application data


𝑓( (
(𝑦, 𝑥) (𝑦& = 𝑓(𝑥), 𝑥)

2. The Secret Sauce of Machine Learning

3. Prediction vs Estimation
Prediction problem set-up
Given:
• Training data set 𝑦! , 𝑥! , … , 𝑦" , 𝑥" (assume iid)
• Usually called “regression” when 𝑦 continuous, “classification”
when 𝑦 discrete
Econometrics ML
• Loss function ℓ(𝑦,
' 𝑦)
y Outcome variable Label
x Covariate Feature
Goal:
• Prediction function 𝑓) with low average loss (“risk”)
L 𝑓) = 𝐸($,&) ℓ 𝑓) 𝑥 , 𝑦
where (𝑦, 𝑥) distributed same as training
Squared-error loss for regression
“Regression”: Continuous outcome, 𝑦 ∈ ℝ

(
Squared-error loss: 𝐿 𝑓) = 𝐸 𝑓) 𝑥 − 𝑦
' 𝑦 = 𝑦' − 𝑦 ( )
(ℓ 𝑦,

• Predict log house price 𝑦 of a new home from its


characteristics 𝑥 based on survey data from homes with
same distribution (Mullainathan and Spiess, 2017)

• Predict log consumption 𝑦 for a new household 𝑥 based


on data on similar households (Adelman et al.)
Loss measures for classification
“Classification”: Binary outcome, 𝑦 ∈ {0,1}

If prediction itself is binary, 𝑦' ∈ {0,1}:


𝑦=1 𝑦=0
𝑦' = 1 True positive False positive (Type I)
𝑦' = 0 False negative (Type II) True negative

ROC curve source: Aiken et al. (2020)


Standard regression solution
(
Goal: small 𝐸 𝑓) 𝑥 − 𝑦

E.g. use linear functions 𝑓) 𝑥 = 𝛽) ) 𝑥 = 𝛽)* + ∑-+,! 𝛽)+ 𝑥+

• From training data, pick the 𝛽 that provides best in-


sample fit:
( ! " (
min 𝐸 𝑦 − 𝛽) 𝑥
)
à min ∑0,! 𝑦0 − 𝛽) 𝑥0
)
.
/ .
/ "

• Which optimality properties does OLS have?


• Is this optimal for prediction?
Bias–variance decomposition
• Loss at new point 𝑦 = 𝛽 ) 𝑥 + 𝜖 (𝐸 𝜖 𝑥 = 0):
(
( ) ) )
𝑦' − 𝑦 = 𝛽 𝑥 − 𝛽 𝑥 − 𝜖
• Average over draws of training sample (and 𝜖):
$
𝐸!,# [ 𝑦& − 𝑦 $] (
= 𝐸! [ 𝛽 𝑥 − 𝛽 𝑥 ] + 𝐸# [𝜖 $ ]
% %
$
%
= ( −𝛽 𝑥
𝐸! [ 𝛽] + 𝑥 % 𝑉! 𝛽( 𝑥 + 𝑉# (𝜖|𝑥)

bias variance irreducible noise


approximation overfit
• Important framing within econometrics, stats
Approximation–overfit trade-off
Approximation–overfit trade-off

Source: Hastie et al. (2009)


Approximation–overfit trade-off
As model becomes more complex:
1. Fit true function better (approximation)
2. Fit noise better (overfit)

Hence:
1. Flexible functional forms
2. Limit expressiveness (regularization)
Regularization for linear regression
• Rather than OLS "
1 (
min > 𝑦0 − 𝛽) ) 𝑥0
. 𝑛
/
0,!
• Fit constrained problem
! (
min ∑"0,! 𝑦0 − 𝛽) ) 𝑥0 s.t. 𝛽) ≤𝑐
. "
/
)
𝛽1 !
= ∑%"#$ 1'&! (! 𝛽1 $
= ∑%"#$ |𝛽1" | 𝛽1 )
= ∑%"#$ 𝛽1")
) = (𝛽)* , 𝛽)! , … , 𝛽)- )
• Throughout, assume 𝛽′
• Normalize! not penalized
LASSO regression
! ! "" (( -- |)𝛽) |
min ∑∑
min 0,!𝑦 𝑦−
0 −)
𝛽𝛽'
) )
𝑥 𝑥 0 + 𝜆
s.t. ∑∑ +,!| 𝛽 |
+ ≤ 𝑐
./." " 0,! 0 0 +,! +
/

• Selects and shrinks


• “Capitalist” – in doubt give all to one
• Produces sparse solutions

Illustration: Afshine Amidi and Shervine Amidi


LASSO regression
Ridge regression

! " (
min " ∑0,! 𝑦0 − 𝛽' ) 𝑥0 + 𝜆 ∑-+,! 𝛽)+(
.
/

• Shrink towards zero, but never quite


• “Socialist” – in doubt distribute to multiple
• Can be interpreted as Bayesian posterior

Illustration: Afshine Amidi and Shervine Amidi


Regularization for linear regression

Source: Afshine Amidi and Shervine Amidi


Structure of supervised learners
• A function class
• A regularizer
• An optimization algorithm that gets us there
Poverty targeting

Example source: Adelman et al.


Reference point: OLS

Example source: Adelman et al.


Fitted vs actual values in sample

Example source: Adelman et al.


Regression trees

Example source: Adelman et al.


OLS vs tree

Example source: Adelman et al.


How to find optimal tree?

Example source: Adelman et al.


Structure of supervised learners
• A function class
• A regularizer
• An optimization algorithm that gets us there
Choosing regularization parameter
Choosing regularization parameter
• Hold-out: create out-of-sample in-sample
• Cross-validation: create repeated hold-outs

Hence:
1. Flexible functional forms
2. Limit expressiveness (regularization)
3. Learn how much to regularize (tuning)

Illustration: Afshine Amidi and Shervine Amidi


Choosing regularization parameter
• Hold-out: create out-of-sample in-sample
• Cross-validation: create repeated hold-outs

Hence:
1. Flexible functional forms
2. Limit expressiveness (regularization)
3. Learn how much to regularize (tuning)

Illustration: Afshine Amidi and Shervine Amidi


Structure of ML exercise
fitting sample hold-out sample

engineering with econometric guidance econometric guarantees

'
Estimate L(𝑓)

firewall
Obtain a function 𝑓) principle

Illustration: Afshine Amidi and Shervine Amidi


ML basics recap
1. Flexible functional forms
2. Limit expressiveness (regularization)
3. Learn how much to regularize (tuning)

• Important researcher choices:


• Loss function
• Data management/splitting
• Feature representation
• Function class and regularizer
From LASSO to neural nets
Function class Regularizer
Linear LASSO, ridge, elastic net
Decision/regression trees Depth, leaves, leaf size, info
gain
Random forest Trees, variables per tree,
sample sizes, complexity
Nearest neighbors Number of neighbors
Kernel regression Bandwidth
Splines Number of knots, order
Neural nets Layers, sizes, connectivity,
drop-out, early stopping
Regularizing neural nets

Image source: Nielsen (2015)


Model combination: ensembles
𝑓) 𝑥 = 𝑤! 𝑓)! (𝑥) + ⋯ + 𝑤7 𝑓)7 (𝑥)

• Can combine across different model classes


• How to choose weights?
Model combination: bagging / random forest

Illustration: Databricks
Random forest
OLS Tree

Forest
Boosting / boosted trees
• Iteratively fit a simple tree

Source: medium.com/mlreview
Bayesian regularization
• Bayesian methods shrink towards a prior
• Powerful way of constructing regularized predictions,
e.g. ridge regression, Bayesian trees
ML basics recap
1. Flexible functional forms
2. Limit expressiveness (regularization)
3. Learn how much to regularize (tuning)

• Important researcher choices:


• Loss function
• Data management/splitting
• Feature representation
• Function class and regularizer
Implementation: R 0.18

0.15
Log Lambda = −4.69
Log Lambda = −6.46
Log Lambda = −6.65

RMSE
0.12

0.09

−3 −6 −9
Log Lambda (Reverse Scale)
0.18

Validation RMSE
cv_lasso_fit <- cv.glmnet(x = XVars, 0.15

y = house_train$Sale_Price) 0.12

0.09

0 100 200 300


Number of Non−zero Covariates

cv_folds <- vfold_cv(house_train, v = 5) min_n 4 8 12 16

rf_grid <- grid_regular(


mtry(range = c(10, 100)), 0.062
min_n(range = c(4, 20)),
levels = 5
)

RSME
0.061
tune_rf_res <- tune_grid(
tune_wf,
resamples = cv_folds, 0.060

grid = rf_grid
)
0.059
20 40 60 80
mtry
So what is new?
Statistics and econometrics
• Dominance of regularization: James and Stein (1961)
• Random forests: Breiman (2001)
• Non- and semiparametrics, sieve estimation

But still, something has happened


• Data
• Computation
• Functional forms that work
• Prediction focus that turns it into engineering competition
• Some new theoretical insights and developments,
e.g. double descent, deep learning
ML basics recap
1. Flexible functional forms
2. Limit expressiveness (regularization)
3. Learn how much to regularize (tuning)

)
• What do these features imply for the properties of 𝑓?
• And how can we therefore use 𝑓) in applied work?
Structure of first chapter of webinar
1. Introduction

Training data Application data


𝑓( (
(𝑦, 𝑥) (𝑦& = 𝑓(𝑥), 𝑥)

2. The Secret Sauce of Machine Learning

3. Prediction vs Estimation

You might also like