02 Statistical Learning
02 Statistical Learning
50
100
200
TV
300
25
5
10
15
Sales
20
25
20
15
Sales
10
15
5
10
Sales
20
25
10
20
30
40
50
20
Radio
40
60
80
100
Newspaper
Notation
Here Sales is a response or target that we wish to predict. We
generically refer to the response as Y .
TV is a feature, or input, or predictor; we name it X1 .
Likewise name Radio as X2 , and so on.
We can refer to the input vector collectively as
X1
X = X2
X3
Now we write our model as
Y = f (X) +
where captures measurement errors and other discrepancies.
2 / 30
X = x.
3 / 30
6
4
2
0
4 / 30
5 / 30
5 / 30
5 / 30
Irreducible
5 / 30
How to estimate f
Typically we have few if any data points with X = 4
exactly.
So we cannot compute E(Y |X = x)!
Relax the definition and let
f(x) = Ave(Y |X N (x))
4
x
6 / 30
7 / 30
7 / 30
1.0
p= 5
1.5
0.5
0.0
x1
p= 2
p= 1
0.5
p= 3
1.0
p= 10
Radius
1.0
x2
0.0
0.5
0.5
0.0
1.0
10% Neighborhood
0.5
1.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Fraction of Volume
8 / 30
0 , 1 , . . . , p .
training data.
9 / 30
10 / 30
Incom
of
Se
rs
ni
or
ity
e
Ye
a
Ed
uc
a
tio
n
Incom
of
Se
rs
ni
or
ity
e
Ye
a
Ed
uc
a
tio
n
12 / 30
of
Se
rs
ni
or
ity
Incom
Ye
a
Ed
uc
a
tio
n
of
Se
rs
ni
or
ity
Incom
Ye
a
Ed
uc
a
tio
n
Some trade-offs
15 / 30
Some trade-offs
15 / 30
Some trade-offs
15 / 30
High
Subset Selection
Lasso
Interpretability
Least Squares
Generalized Additive Models
Trees
Bagging, Boosting
Low
Low
High
Flexibility
over Tr:
data Te = {xi , yi }M
1 :
17 / 30
2.5
0.5
1.0
1.5
2.0
12
10
8
Y
6
4
0.0
2
0
20
40
60
X
80
100
10
20
Flexibility
18 / 30
2.5
0.0
0.5
1.0
1.5
2.0
12
10
8
6
20
40
60
X
80
100
10
20
Flexibility
Here the truth is smoother, so the smoother fit and linear model do
FIGURE 2.10. Details are as in Figure 2.9, using a different true f that is
really
well. to linear. In this setting, linear regression provides a very good fit to
much closer
the data.
19 / 30
15
10
5
0
10
10
20
20
the data.
20
40
60
X
80
100
10
20
Flexibility
Here the truth is wiggly and the noise is low, so the more flexible fits
FIGURE 2.11. Details are as in Figure 2.9, using a different f that is far from
do
the best.
linear. In this setting, linear regression provides a very poor fit to the data.
20 / 30
Bias-Variance Trade-off
Suppose we have fit a model f(x) to some training data Tr, and
let (x0 , y0 ) be a test observation drawn from the population. If
the true model is Y = f (X) + (with f (x) = E(Y |X = x)),
then
2
E y0 f(x0 ) = Var(f(x0 )) + [Bias(f(x0 ))]2 + Var().
The expectation averages over the variability of y0 as well as
the variability in Tr. Note that Bias(f(x0 ))] = E[f(x0 )] f (x0 ).
Typically as the flexibility of f increases, its variance increases,
and its bias decreases. So choosing the flexibility based on
average test error amounts to a bias-variance trade-off.
21 / 30
20
15
2.0
MSE
Bias
Var
0.5
2
10
Flexibility
20
0.0
0.0
0.5
1.0
1.0
10
1.5
1.5
2.0
2.5
2. Statistical Learning
2.5
36
10
Flexibility
20
10
20
Flexibility
FIGURE 2.12. Squared bias (blue curve), variance (orange curve), Var()
(dashed line), and test MSE (red curve) for the three data sets in Figures 2.92.11.
The vertical dashed line indicates the flexibility level corresponding to the smallest
test MSE.
22 / 30
Classification Problems
X = (X1 , X2 , . . . , Xp ).
23 / 30
1.0
| | | || || |||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| |||| ||| | | ||
0.0
0.2
0.4
0.6
0.8
||
||
1.0
| |
||
||
|| || || ||
0.0
0.2
0.4
0.6
0.8
| || |
||
|| | |
|
5
|
6
C(x)
is less than on pk (x), k = 1, . . . , K.
25 / 30
i )]
ErrTe = AveiTe I[yi 6= C(x
The Bayes classifier (using the true pk (x)) has smallest
26 / 30
i )]
ErrTe = AveiTe I[yi 6= C(x
The Bayes classifier (using the true pk (x)) has smallest
26 / 30
2. Statistical Learning
X2
oo
X1
27 / 30
KNN: K=10
oo o
o
o
o
o
oo oo o
o
o
o
o
o
o o
o
oo oo
oo o
oo
o oo o
oo o o
o
o
o
o
o
o oo
oo
o
o o o oo
o
o
o
oo
o
o o
o
o
o o
o oo o
o o
o
o o o
o
o oo
o o
o
o o oooo o
ooo o o o o
o
o
o
o
o o oo o o o
o
oo
o
o o
o
oo o oo o
oo o
o o oo o
o
o
o
o oo
o
o o o oo o oo o
o
o
oo o
o ooooo oo
oo
o
o
o
o oo o o
o
o o
oo oo o
o
o
ooo
o
o
o ooo
o
o
o
oo o
o
o
o o
X2
oo
X1
FIGURE 2.15. The black curve indicates the KNN decision boundary on the
28 / 30
FIGURE 2.15. The black curve indicates the KNN decision boundary on the
data from Figure 2.13, using K = 10. The Bayes decision boundary is shown as
a purple dashed line. The KNN and Bayes decision boundaries are very similar.
KNN: K=1
oo
o
o
o
oo
o
o
o
o
o
o
o
oo oo o o
o
o
oo o o
o
oo
o oo
oo o o
o
o
oo o
o
oo
oo
o
o
o
o
o
o
o o
o
o
o
o
o o
o
o
o o
o oo o
o o
o
o o
o
o oo
o
o
o
o o
o
o o ooo o
ooo o
o
o
oo
o
o
o o oo o o
o
o
oo
o ooo
o
o
o o o
oo o
o o oo o
oo o
o
o
o
o
o
o
o o o o o oo
o
o
ooo o
o oooo oo
oo
o
oo
o
o
o oo o
o
o o
oo oo o
o
oo o
o
o
oooo
oo
o
o
oo o
o
o
oo
KNN: K=100
oo
o
o
o
oo
o
o
o
o
o
o
oo oo o o
o
o
oo o o
o
oo
o oo
oo o o
o
o
oo o
o
oo
oo
o
o
o
o
o
o
o o
o
o
o
o
o o
o
o
o o
o oo o
o o
o
o o
o
o oo
o
o
o
o o
o
o o ooo o
ooo o
o
o
oo
o
o
o o oo o o
o
o
oo
o ooo
o
o
o o o
oo o
o o oo o
oo o
o
o
o
o
o
o
o o o o o oo
o
o
ooo o
o oooo oo
oo
o
oo
o
o
o oo o
o
o o
oo oo o
o
oo o
o
o
oooo
oo
o
o
oo o
o
o
oo
0.20
0.15
Error Rate
0.10
0.05
0.00
Training Errors
Test Errors
0.01
0.02
0.05
0.10
0.20
0.50
1.00
1/K
30 / 30