Machine Learning and
Econometrics
Sendhil Mullainathan
(with Jann Spiess)
Outline
• How did I get interested in this?
• What is the secret sauce of machine learning?
• Where is machine learning useful in economics?
Outline
• How did I get interested in this?
• What is the secret sauce of machine learning?
• Where is machine learning useful in economics?
Magic?
• Hard not to be wowed
• But what makes them tick?
• Could that be used elsewhere? In my own work?
• Look at something simpler than vision
AI Approach
• We do it perfectly.
• How do we do it?
• Introspect
• Let’s program that up.
Programming
• For each review make a vector of words
• Figure out whether it has positive words and
negative words
• Count
Trying to Copy Humans
Brilliant Suck
Dazzling
Cool 60% Cliched
Slow
Gripping
Awful
Moving
Bad
What is so hard?
• Decide what words makes for a positive review
– What combination of words do you look for?
• “Some people say this talk was great”
• This problem was endemic to every problem
– Driving a car: What is a tree?
– Language: Which noun does this pronoun modify?
This Approach Stalled
• “Trivial” problems proved impossible
– Marvin Minsky once assigned "the problem of
computer vision" as a summer project
• Forget about the more complicated problems
like language
What is the magic trick?
• Make this an empirical exercise
– Collect some data
• Example dataset:
– 2000 movie reviews
– 1000 good and 1000 bad reviews
• Now just ask what combination of words
predicts being a good review
Learning not Programming
STILL Bad
95%
Love
Stupid
Superb
Worst
Great ?
!
Pang, Lee and Vaithyanathan
Machine learning
• Turn any “intelligence” task into an empirical
learning task
– Specify what is to be predicted
– Specify what is used to predict it
ML drives many innovations…
• Every domain
– Post office uses machines to read addresses
– Voice recognition (Siri)
– Spam filters
– Recommender systems
– Driverless cars
• Not a coincidence that ML and big data arose
together
Wonderful
• Great that they discovered the 100+ year old
field of statistics!
• We’ve been estimating functions from data for a
long time
• KEY: This is in part definitely true
Principal Component Analysis Example
Original Variable B PC 2
PC 1
Original Variable A
• Orthogonal directions of greatest variance in data
• Projections along PC1 discriminate the data most along any one axis
PCA
• The intuitions usually come from two
dimensions
• But in very high dimensions thought this can get
very interesting…
PCA applications -Eigenfaces
1. Large set of digitized images of human faces is taken under
the same lighting conditions.
2. The images are normalized to line up the eyes and mouths.
3. The eigenvectors of the covariance matrix of the statistical
distribution of face image vectors are then extracted.
4. These eigenvectors are called eigenfaces.
Vectorization key building block
PCA applications -Eigenfaces
• The principal eigenface looks like a bland
androgynous average human face
[Link]
Wonderful
• Great that they discovered the 100+ year old field
of statistics!
• We’ve been estimating functions from data for a
long time
• KEY: This is in part definitely true
• But in important ways not true
ASIDE: Vectorization Pang, Lee and Vaithyanathan
NOTE: Large sets of variables
Why high dimensional data analysis
should not really be possible
• Easiest to see in linear case
• If you have n data points and k variables, then
X’X is not invertible
• Size k+1 by k+1
• But rank is at most n < k+1
Face Recognition
• Very simple problem
Y = {0, 1} Vectorization Again
| {z }
F ace?
24⇤24
X = {0, 1, .., g}
| {z }
gray scale
fˆ = argminf E[L(f (x), y)]
Some (possibly asymmetric) loss
for correctly or incorrectly guessing
Face Recognition Dataset
• Sample size:
• 5000 face photos (+ many non-face photos)
• Number of variables:
• 24x24 pixel array
• So 576 variables (with values ranging up to g)
• Or 576*g variables if we allow gray scale
• A bit tight on sample size…
Functions non-linear in these dummies
• But that’s only if we use binary variables
• Obviously a face is not going to be well
approximated by a linear function of these
binary inputs….
Example of Interactions
“Rectangle filters”
Value =
∑ (pixels in white area) –
∑ (pixels in black area)
Example
Source
Result
How many variables do we have now?
• For a 24x24 detection region, the number of
possible rectangle features is ~160,000!
Something pretty interesting…
• High dimensional prediction
– What does high dimensional mean?
• Not (just) about more variables than data.
• Really about “effective” number of variables given
the functions f
• In linear world = dimensions of function class equals
number of variables
• Able to search through many (MANY) possible
predictors
So…
Estimation Machine Learning
• Fit Y with X • Fit Y with X out of sample
• Low dimensional • High dimensional
• JUST BETTER?
Unbiased functions
Data Estimates
ES [fˆA,S ] = f ⇤ = E[y|x]
Sn = (yi , xi ) iid | {z }
Right fˆ
Face data Estimation fˆ
Function Class F
Rectangle ˆ [f , f ] Face Predictor
Good Confidence Intervals
f ⇤ 2 [f , f ] with high prob
Features
Converge to truth
fˆ ! f ⇤
Data size Estimates
Information going in Information coming out
Thousands? Hundreds of Thousands
How can we get more information out than
we’re putting in?
Unbiased functions
Data Estimates
ES [fˆA,S ] = f ⇤ = E[y|x]
Sn = (yi , xi ) iid | {z }
Right fˆ
Face data Estimation fˆ
Function Class F
Rectangle
ˆ [f , f ] Good Confidence Intervals
f ⇤ 2 [f , f ] with high prob
Features Face Predictor
Converge to truth
fˆ ! f ⇤
Do we need this?
Face Recognition
• Problem:
Y = {0, 1}
| {z }
F ace?
24⇤24
X = {0, 1, .., g}
| {z }
gray scale
fˆ = argminf E[L(f (x), y)]
Only need good predictions
Gets more out? Put more in
Estimation vs Prediction
Estimation Prediction
• Strict assumptions about data • Allow for flexible functional
generating process forms
• Back out coefficients • Get high quality predictions
• Low dimensional • Give up on adjudicating
between observably similar
functions (variables)
β̂ ŷ
But How?
• This tells us that there’s no free lunch
• But does not tell us mechanically how machine
learning works..
Outline
• How did I get interested in this?
• What is the secret sauce of machine learning?
• Where is machine learning useful in economics?
Outline
• How did I get interested in this?
• What is the secret sauce of machine learning?
• Where is machine learning useful in economics?
Understand OLS
AVERAGES
NOTATION
• The real problem here is minimizing the
“wrong” thing: In-sample fit vs out-of-sample fit
Overfit problem
• OLS looks good with the sample you have
– It’s the best you can do on this sample
• Problem is OLS by construction overfits
– We overfit in estimation
– Where does overfit show up?
– But in low-dimensional this is not a major problem
This problem is exactly why wide data is
troubling
• Why are we worried about having so many
variables?
• We’ll fit very well (perfectly if k > n) in sample
• But arbitrarily badly out of sample
Understanding overfit
• Let’s consider a general class of algorithms
A General Class of Algorithms
• Consider algorithms of the form
fˆA = arg min EH L(f (x), y)
f 2FA
– Like OLS empirical loss minimizers
• So algorithms are equivalent to the function class
they choose from
• For estimation what we typically do…
– Show that empirical loss minimizers generate
unbiasedness