0% found this document useful (0 votes)

9 views50 pages

14310x Lecture Slides 21

The document discusses the intersection of machine learning and econometrics, highlighting the author's journey and the potential applications of machine learning in economics. It emphasizes the empirical nature of machine learning, contrasting it with traditional programming approaches, and explores concepts like Principal Component Analysis and the challenges of high-dimensional data. The document also addresses issues like overfitting and the importance of prediction over estimation in machine learning models.

Uploaded by

iganciozapp242

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views50 pages

14310x Lecture Slides 21

Uploaded by

iganciozapp242

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Machine Learning and

Econometrics
Sendhil Mullainathan
(with Jann Spiess)
Outline
• How did I get interested in this?

• What is the secret sauce of machine learning?

• Where is machine learning useful in economics?

Outline
• How did I get interested in this?

• What is the secret sauce of machine learning?

• Where is machine learning useful in economics?

Magic?
• Hard not to be wowed

• But what makes them tick?

• Could that be used elsewhere? In my own work?

• Look at something simpler than vision

AI Approach
• We do it perfectly.

• How do we do it?

• Introspect

• Let’s program that up.

Programming
• For each review make a vector of words

• Figure out whether it has positive words and

negative words

• Count
Trying to Copy Humans
Brilliant Suck

Dazzling

Cool 60% Cliched

Slow
Gripping
Awful
Moving
Bad
What is so hard?
• Decide what words makes for a positive review
– What combination of words do you look for?
• “Some people say this talk was great”

• This problem was endemic to every problem

– Driving a car: What is a tree?
– Language: Which noun does this pronoun modify?
This Approach Stalled
• “Trivial” problems proved impossible
– Marvin Minsky once assigned "the problem of
computer vision" as a summer project

• Forget about the more complicated problems

like language
What is the magic trick?
• Make this an empirical exercise
– Collect some data

• Example dataset:
– 2000 movie reviews
– 1000 good and 1000 bad reviews

• Now just ask what combination of words

predicts being a good review
Learning not Programming

STILL Bad

95%
Love
Stupid
Superb

Worst
Great ?
!
Pang, Lee and Vaithyanathan
Machine learning
• Turn any “intelligence” task into an empirical
learning task
– Specify what is to be predicted
– Specify what is used to predict it
ML drives many innovations…
• Every domain
– Post office uses machines to read addresses
– Voice recognition (Siri)
– Spam filters
– Recommender systems
– Driverless cars
• Not a coincidence that ML and big data arose
together
Wonderful
• Great that they discovered the 100+ year old
field of statistics!

• We’ve been estimating functions from data for a

long time

• KEY: This is in part definitely true

Principal Component Analysis Example

Original Variable B PC 2
PC 1

Original Variable A

• Orthogonal directions of greatest variance in data

• Projections along PC1 discriminate the data most along any one axis
PCA
• The intuitions usually come from two
dimensions
• But in very high dimensions thought this can get
very interesting…
PCA applications -Eigenfaces

1. Large set of digitized images of human faces is taken under

the same lighting conditions.
2. The images are normalized to line up the eyes and mouths.
3. The eigenvectors of the covariance matrix of the statistical
distribution of face image vectors are then extracted.
4. These eigenvectors are called eigenfaces.

Vectorization key building block

PCA applications -Eigenfaces
• The principal eigenface looks like a bland
androgynous average human face

[Link]
Wonderful
• Great that they discovered the 100+ year old field
of statistics!

• We’ve been estimating functions from data for a

long time

• KEY: This is in part definitely true

• But in important ways not true

ASIDE: Vectorization Pang, Lee and Vaithyanathan

NOTE: Large sets of variables

Why high dimensional data analysis
should not really be possible
• Easiest to see in linear case

• If you have n data points and k variables, then

X’X is not invertible
• Size k+1 by k+1
• But rank is at most n < k+1
Face Recognition
• Very simple problem

Y = {0, 1} Vectorization Again

| {z }
F ace?
24⇤24
X = {0, 1, .., g}
| {z }
gray scale

fˆ = argminf E[L(f (x), y)]

Some (possibly asymmetric) loss
for correctly or incorrectly guessing
Face Recognition Dataset
• Sample size:
• 5000 face photos (+ many non-face photos)
• Number of variables:
• 24x24 pixel array
• So 576 variables (with values ranging up to g)
• Or 576*g variables if we allow gray scale
• A bit tight on sample size…
Functions non-linear in these dummies
• But that’s only if we use binary variables

• Obviously a face is not going to be well

approximated by a linear function of these
binary inputs….
Example of Interactions

“Rectangle filters”

Value =
∑ (pixels in white area) –
∑ (pixels in black area)
Example
Source

Result
How many variables do we have now?
• For a 24x24 detection region, the number of
possible rectangle features is ~160,000!
Something pretty interesting…
• High dimensional prediction
– What does high dimensional mean?
• Not (just) about more variables than data.

• Really about “effective” number of variables given

the functions f
• In linear world = dimensions of function class equals
number of variables
• Able to search through many (MANY) possible
predictors
So…
Estimation Machine Learning
• Fit Y with X • Fit Y with X out of sample

• Low dimensional • High dimensional

• JUST BETTER?
Unbiased functions
Data Estimates
ES [fˆA,S ] = f ⇤ = E[y|x]
Sn = (yi , xi ) iid | {z }
Right fˆ
Face data Estimation fˆ
Function Class F
Rectangle ˆ [f , f ] Face Predictor
Good Confidence Intervals
f ⇤ 2 [f , f ] with high prob
Features
Converge to truth
fˆ ! f ⇤

Data size Estimates

Information going in Information coming out
Thousands? Hundreds of Thousands

How can we get more information out than

we’re putting in?
Unbiased functions
Data Estimates
ES [fˆA,S ] = f ⇤ = E[y|x]
Sn = (yi , xi ) iid | {z }
Right fˆ
Face data Estimation fˆ
Function Class F

Rectangle
ˆ [f , f ] Good Confidence Intervals
f ⇤ 2 [f , f ] with high prob
Features Face Predictor
Converge to truth
fˆ ! f ⇤

Do we need this?
Face Recognition
• Problem:
Y = {0, 1}
| {z }
F ace?
24⇤24
X = {0, 1, .., g}
| {z }
gray scale

fˆ = argminf E[L(f (x), y)]

Only need good predictions

Gets more out? Put more in
Estimation vs Prediction
Estimation Prediction
• Strict assumptions about data • Allow for flexible functional
generating process forms

• Back out coefficients • Get high quality predictions

• Low dimensional • Give up on adjudicating

between observably similar
functions (variables)

β̂ ŷ
But How?
• This tells us that there’s no free lunch

• But does not tell us mechanically how machine

learning works..
Outline
• How did I get interested in this?

• What is the secret sauce of machine learning?

• Where is machine learning useful in economics?

Outline
• How did I get interested in this?

• What is the secret sauce of machine learning?

• Where is machine learning useful in economics?

Understand OLS
AVERAGES
NOTATION

• The real problem here is minimizing the

“wrong” thing: In-sample fit vs out-of-sample fit
Overfit problem
• OLS looks good with the sample you have
– It’s the best you can do on this sample

• Problem is OLS by construction overfits

– We overfit in estimation
– Where does overfit show up?
– But in low-dimensional this is not a major problem
This problem is exactly why wide data is
troubling

• Why are we worried about having so many

variables?

• We’ll fit very well (perfectly if k > n) in sample

• But arbitrarily badly out of sample

Understanding overfit
• Let’s consider a general class of algorithms
A General Class of Algorithms
• Consider algorithms of the form
fˆA = arg min EH L(f (x), y)
f 2FA
– Like OLS empirical loss minimizers

• So algorithms are equivalent to the function class

they choose from
• For estimation what we typically do…
– Show that empirical loss minimizers generate
unbiasedness

User Guide
100% (7)
User Guide
448 pages
Sterling Espionage Case Appeal Decision
86% (7)
Sterling Espionage Case Appeal Decision
118 pages
1 Supply Chain Management Fundamentals
95% (20)
1 Supply Chain Management Fundamentals
174 pages
Annual Reports on Chemistry 1962
100% (12)
Annual Reports on Chemistry 1962
576 pages
Uveit Foster
75% (12)
Uveit Foster
954 pages
Supply Chain Management Overview
50% (2)
Supply Chain Management Overview
67 pages
General Chemistry II Exam 1 Solutions
78% (9)
General Chemistry II Exam 1 Solutions
5 pages
Organic vs Inorganic Polymers Explained
100% (3)
Organic vs Inorganic Polymers Explained
37 pages
Pediatric Cardiology MCQs and Answers
79% (14)
Pediatric Cardiology MCQs and Answers
3 pages
MBBS 1st Year Syllabus Overview
89% (9)
MBBS 1st Year Syllabus Overview
14 pages
PVC Solvent Cement Formulation Guide
90% (10)
PVC Solvent Cement Formulation Guide
2 pages
Social Media Marketing Literature Review
100% (4)
Social Media Marketing Literature Review
27 pages
IRQs
100% (2)
IRQs
2 pages
Cassava Production and Demand in Laos
No ratings yet
Cassava Production and Demand in Laos
46 pages
Income Tax Ordinance 2001 Overview
67% (3)
Income Tax Ordinance 2001 Overview
23 pages
Radiology MCQs for Medical Students
80% (5)
Radiology MCQs for Medical Students
7 pages
Birth Asphyxia: Definition and Management
75% (4)
Birth Asphyxia: Definition and Management
12 pages
Android Facebook Integration Exercise
100% (3)
Android Facebook Integration Exercise
10 pages
PU Coatings: Chemistry and Applications
86% (7)
PU Coatings: Chemistry and Applications
5 pages
Simple Distillation Process Explained
63% (8)
Simple Distillation Process Explained
2 pages
Intermolecular Forces and Boiling Points
75% (4)
Intermolecular Forces and Boiling Points
1 page
Globalization's Impact on Legal Education in India
75% (4)
Globalization's Impact on Legal Education in India
14 pages
Hydrogenation and Hydrogenolysis Overview
75% (4)
Hydrogenation and Hydrogenolysis Overview
12 pages
Cardiac Tamponade: Causes and Management
100% (3)
Cardiac Tamponade: Causes and Management
7 pages
BTX Aromatics Production and Uses
No ratings yet
BTX Aromatics Production and Uses
6 pages
Dermatology MCQs and Answers Guide
67% (3)
Dermatology MCQs and Answers Guide
7 pages
ARDS Nursing Management and Interventions
0% (1)
ARDS Nursing Management and Interventions
2 pages
Separation of Benzene and Chlorobenzene
No ratings yet
Separation of Benzene and Chlorobenzene
5 pages
PES 2013 PS3 Option File Guide
No ratings yet
PES 2013 PS3 Option File Guide
1 page
Stoichiometric Principles in Industrial Chemistry
100% (1)
Stoichiometric Principles in Industrial Chemistry
6 pages

14310x Lecture Slides 21

Uploaded by

14310x Lecture Slides 21

Uploaded by

Machine Learning and

• What is the secret sauce of machine learning?

• Where is machine learning useful in economics?

• What is the secret sauce of machine learning?

• Where is machine learning useful in economics?

• But what makes them tick?

• Could that be used elsewhere? In my own work?

• Look at something simpler than vision

• Let’s program that up.

• Figure out whether it has positive words and

Cool 60% Cliched

• This problem was endemic to every problem

• Forget about the more complicated problems

• Now just ask what combination of words

• We’ve been estimating functions from data for a

• KEY: This is in part definitely true

• Orthogonal directions of greatest variance in data

1. Large set of digitized images of human faces is taken under

Vectorization key building block

• We’ve been estimating functions from data for a

• KEY: This is in part definitely true

• But in important ways not true

NOTE: Large sets of variables

• If you have n data points and k variables, then

Y = {0, 1} Vectorization Again

fˆ = argminf E[L(f (x), y)]

• Obviously a face is not going to be well

• Really about “effective” number of variables given

• Low dimensional • High dimensional

Data size Estimates

How can we get more information out than

fˆ = argminf E[L(f (x), y)]

Only need good predictions

• Back out coefficients • Get high quality predictions

• Low dimensional • Give up on adjudicating

• But does not tell us mechanically how machine

• What is the secret sauce of machine learning?

• Where is machine learning useful in economics?

• What is the secret sauce of machine learning?

• Where is machine learning useful in economics?

• The real problem here is minimizing the

• Problem is OLS by construction overfits

• Why are we worried about having so many

• We’ll fit very well (perfectly if k > n) in sample

• But arbitrarily badly out of sample

• So algorithms are equivalent to the function class

You might also like