0% found this document useful (0 votes)

38 views

02 Statistical Learning

Statistical Learning

Uploaded by

Marco Granato

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views

02 Statistical Learning

Statistical Learning

Uploaded by

Marco Granato

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

0

100

200
TV

300

25
5

Sales

25
20
15

Sales

15
5

Sales

What is Statistical Learning?

Radio

100

Newspaper

Shown are Sales vs TV, Radio and Newspaper, with a blue

linear-regression line fit separately to each.
Can we predict Sales using these three?
Perhaps we can do better using a model
Sales f (TV, Radio, Newspaper)
1 / 30

Notation
Here Sales is a response or target that we wish to predict. We
generically refer to the response as Y .
TV is a feature, or input, or predictor; we name it X1 .
Likewise name Radio as X2 , and so on.
We can refer to the input vector collectively as

X1

X = X2
X3
Now we write our model as
Y = f (X) +
where captures measurement errors and other discrepancies.
2 / 30

What is f (X) good for?

With a good f we can make predictions of Y at new points

X = x.

We can understand which components of

X = (X1 , X2 , . . . , Xp ) are important in explaining Y , and

which are irrelevant. e.g. Seniority and Years of
Education have a big impact on Income, but Marital
Status typically does not.

Depending on the complexity of f , we may be able to

understand how each component Xj of X affects Y .

3 / 30

6
4
2
0

Is there an ideal f (X)? In particular, what is a good value for

f (X) at any selected value of X, say X = 4? There can be
many Y values at X = 4. A good value is
f (4) = E(Y |X = 4)

E(Y |X = 4) means expected value (average) of Y given X = 4.

This ideal f (x) = E(Y |X = x) is called the regression function.

4 / 30

The regression function f (x)

Is also defined for vector X; e.g.

f (x) = f (x1 , x2 , x3 ) = E(Y |X1 = x1 , X2 = x2 , X3 = x3 )

5 / 30

The regression function f (x)

Is also defined for vector X; e.g.

f (x) = f (x1 , x2 , x3 ) = E(Y |X1 = x1 , X2 = x2 , X3 = x3 )

Is the ideal or optimal predictor of Y with regard to

mean-squared prediction error: f (x) = E(Y |X = x) is the

function that minimizes E[(Y g(X))2 |X = x] over all
functions g at all points X = x.

5 / 30

The regression function f (x)

Is also defined for vector X; e.g.

f (x) = f (x1 , x2 , x3 ) = E(Y |X1 = x1 , X2 = x2 , X3 = x3 )

Is the ideal or optimal predictor of Y with regard to

mean-squared prediction error: f (x) = E(Y |X = x) is the

function that minimizes E[(Y g(X))2 |X = x] over all
functions g at all points X = x.

= Y f (x) is the irreducible error i.e. even if we knew

f (x), we would still make errors in prediction, since at each

X = x there is typically a distribution of possible Y values.

5 / 30

The regression function f (x)

Is also defined for vector X; e.g.

f (x) = f (x1 , x2 , x3 ) = E(Y |X1 = x1 , X2 = x2 , X3 = x3 )

Is the ideal or optimal predictor of Y with regard to

mean-squared prediction error: f (x) = E(Y |X = x) is the

function that minimizes E[(Y g(X))2 |X = x] over all
functions g at all points X = x.

= Y f (x) is the irreducible error i.e. even if we knew

f (x), we would still make errors in prediction, since at each

X = x there is typically a distribution of possible Y values.
For any estimate f(x) of f (x), we have
E[(Y f(X))2 |X = x] = [f (x) f(x)]2 + Var()
|
{z
}
| {z }
Reducible

Irreducible

5 / 30

How to estimate f
Typically we have few if any data points with X = 4

exactly.
So we cannot compute E(Y |X = x)!
Relax the definition and let
f(x) = Ave(Y |X N (x))

where N (x) is some neighborhood of x.

4
x

6 / 30

Nearest neighbor averaging can be pretty good for small p

i.e. p 4 and large-ish N .

We will discuss smoother versions, such as kernel and

spline smoothing later in the course.

7 / 30

Nearest neighbor averaging can be pretty good for small p

i.e. p 4 and large-ish N .

We will discuss smoother versions, such as kernel and

spline smoothing later in the course.

Nearest neighbor methods can be lousy when p is large.

Reason: the curse of dimensionality. Nearest neighbors

tend to be far away in high dimensions.

We need to get a reasonable fraction of the N values of yi

to average to bring the variance downe.g. 10%.

A 10% neighborhood in high dimensions need no longer be

local, so we lose the spirit of estimating E(Y |X = x) by

local averaging.

7 / 30

The curse of dimensionality

1.0

p= 5

1.5

0.5

0.0
x1

p= 2
p= 1

0.5

p= 3

1.0

p= 10

Radius

1.0

0.0

0.5

0.0

1.0

10% Neighborhood

0.5

1.0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Fraction of Volume

8 / 30

Parametric and structured models

The linear model is an important example of a parametric
model:
fL (X) = 0 + 1 X1 + 2 X2 + . . . p Xp .
A linear model is specified in terms of p + 1 parameters

0 , 1 , . . . , p .

We estimate the parameters by fitting the model to

training data.

Although it is almost never correct, a linear model often

serves as a good and interpretable approximation to the

unknown true function f (X).

9 / 30

A linear model fL (X) = 0 + 1 X gives a reasonable fit here

A quadratic model fQ (X) = 0 + 1 X + 2 X 2 fits slightly

better.
2

10 / 30

Incom
of

ni
or

ity

e
Ye
a

uc
a

tio
n

Simulated example. Red points are simulated values for income

from the model
income = f (education, seniority) +
f is the blue surface.
11 / 30

Incom
of

ity

e
Ye
a

uc
a

tio
n

Linear regression model fit to the simulated data.

fL (education, seniority) = 0 +1 education+2 seniority

12 / 30

ity

Incom
Ye
a

uc
a

tio
n

More flexible regression model fS (education, seniority) fit to

the simulated data. Here we use a technique called a thin-plate
spline to fit a flexible surface. We control the roughness of the
fit (chapter 7).
13 / 30

ity

Incom
Ye
a

uc
a

tio
n

Even more flexible spline regression model

fS (education, seniority) fit to the simulated data. Here the
fitted model makes no errors on the training data! Also known
as overfitting.
14 / 30

Some trade-offs

Prediction accuracy versus interpretability.

Linear models are easy to interpret; thin-plate splines

are not.

15 / 30

Some trade-offs

Prediction accuracy versus interpretability.

Linear models are easy to interpret; thin-plate splines

are not.

Good fit versus over-fit or under-fit.

How do we know when the fit is just right?

15 / 30

Some trade-offs

Prediction accuracy versus interpretability.

Linear models are easy to interpret; thin-plate splines

are not.

Good fit versus over-fit or under-fit.

How do we know when the fit is just right?

Parsimony versus black-box.

We often prefer a simpler model involving fewer

variables over a black-box predictor involving them all.

15 / 30

High

Subset Selection
Lasso

Interpretability

Least Squares
Generalized Additive Models
Trees

Bagging, Boosting
Low

Support Vector Machines

Low

High

Flexibility

FIGURE 2.7. A representation of the tradeoff between flexibility and inte

pretability, using different statistical learning methods. In general, as the16flexib
/ 30

Assessing Model Accuracy

Suppose we fit a model f(x) to some training data
Tr = {xi , yi }N
1 , and we wish to see how well it performs.

We could compute the average squared prediction error

over Tr:

MSETr = AveiTr [yi f(xi )]2

This may be biased toward more overfit models.

Instead we should, if possible, compute it using fresh test

data Te = {xi , yi }M
1 :

MSETe = AveiTe [yi f(xi )]2

17 / 30

2.5
0.5

1.0

Mean Squared Error

1.5

2.0

12
10
8
Y
6
4

0.0

2
0

60
X

100

Flexibility

Black curve is truth. Red curve on right is MSETe , grey curve is

FIGURE
2.9. Left: Data simulated from f , shown in black. Three estimates of
MSE
Tr . Orange, blue and green curves/squares correspond to fits of
f are shown: the linear regression line (orange curve), and two smoothing spline
different flexibility.
fits (blue and green curves). Right: Training MSE (grey curve), test MSE (red
curve), and minimum possible test MSE over all methods (dashed line). Squares
represent the training and test MSEs for the three fits shown in the left-hand

18 / 30

2.5
0.0

0.5

1.0

Mean Squared Error

1.5

2.0

12
10
8
6

60
X

100

Flexibility

Here the truth is smoother, so the smoother fit and linear model do
FIGURE 2.10. Details are as in Figure 2.9, using a different true f that is
really
well. to linear. In this setting, linear regression provides a very good fit to
much closer
the data.

19 / 30

15
10
5
0

Mean Squared Error

the data.

60
X

100

Flexibility

Here the truth is wiggly and the noise is low, so the more flexible fits
FIGURE 2.11. Details are as in Figure 2.9, using a different f that is far from
do
the best.
linear. In this setting, linear regression provides a very poor fit to the data.

20 / 30

Bias-Variance Trade-off
Suppose we have fit a model f(x) to some training data Tr, and
let (x0 , y0 ) be a test observation drawn from the population. If
the true model is Y = f (X) + (with f (x) = E(Y |X = x)),
then

2
E y0 f(x0 ) = Var(f(x0 )) + [Bias(f(x0 ))]2 + Var().
The expectation averages over the variability of y0 as well as
the variability in Tr. Note that Bias(f(x0 ))] = E[f(x0 )] f (x0 ).
Typically as the flexibility of f increases, its variance increases,
and its bias decreases. So choosing the flexibility based on
average test error amounts to a bias-variance trade-off.

21 / 30

Bias-variance trade-off for the three examples

20
15

2.0

MSE
Bias
Var

0.5
2

10
Flexibility

0.0

0.5

1.0

1.5

2.0

2.5

2. Statistical Learning

2.5

10
Flexibility

Flexibility

FIGURE 2.12. Squared bias (blue curve), variance (orange curve), Var()
(dashed line), and test MSE (red curve) for the three data sets in Figures 2.92.11.
The vertical dashed line indicates the flexibility level corresponding to the smallest
test MSE.
22 / 30

Classification Problems

Here the response variable Y is qualitative e.g. email is one

of C = (spam, ham) (ham=good email), digit class is one of
C = {0, 1, . . . , 9}. Our goals are to:

Build a classifier C(X) that assigns a class label from C to

a future unlabeled observation X.

Assess the uncertainty in each classification

Understand the roles of the different predictors among

X = (X1 , X2 , . . . , Xp ).

23 / 30

1.0

| | | || || |||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| |||| ||| | | ||

0.0

0.2

0.4

0.6

0.8

| | || |||| |||||||| |||||||||||||||||||||||||||||||||||||||||||

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| | ||||||| | | |
1

Is there an ideal C(X)? Suppose the K elements in C are

numbered 1, 2, . . . , K. Let
pk (x) = Pr(Y = k|X = x), k = 1, 2, . . . , K.
These are the conditional class probabilities at x; e.g. see little
barplot at x = 5. Then the Bayes optimal classifier at x is
C(x) = j if pj (x) = max{p1 (x), p2 (x), . . . , pK (x)}
24 / 30

1.0

| |

|| || || ||

|| | | || ||| | | ||||||||| || || | ||||| || | | | |

0.0

0.2

0.4

0.6

0.8

| || |

|| | |

| |||| || ||| | | || | | ||| | || |

|
5

|
6

Nearest-neighbor averaging can be used as before.

Also breaks down as dimension grows. However, the impact on

C(x)
is less than on pk (x), k = 1, . . . , K.
25 / 30

Classification: some details

Typically we measure the performance of C(x)

using the
misclassification error rate:

i )]
ErrTe = AveiTe I[yi 6= C(x
The Bayes classifier (using the true pk (x)) has smallest

error (in the population).

26 / 30

Classification: some details

Typically we measure the performance of C(x)

using the
misclassification error rate:

i )]
ErrTe = AveiTe I[yi 6= C(x
The Bayes classifier (using the true pk (x)) has smallest

error (in the population).

Support-vector machines build structured models for C(x).

We will also build structured models for representing the

pk (x). e.g. Logistic regression, generalized additive models.

26 / 30

2. Statistical Learning

Example: K-nearest neighbors in two dimensions

oo o
o
o
o
o
oo oo o
o
o
o
o ooo
o
oo ooo o
oo o
oo
oo o o
o
o ooo oo
o oo
oo
o
o o o oo
o
o
o
oo
o
o o
o
o
o o
o oo o
o o
o o o
o o
o oo
o o
o
o o oooo o
ooo o o o o
ooo
o
o
oo o o o ooooo o o
o o
o
oo o o
oo o
o o oo o
o
o
o
o oo
o o o oo o ooo o
o
o
oo o
o ooooo oooo
oo
o
o oo o o
o
o o
oo oo o
o
oo oo
oo
o
o
o
o
o
o
o
oo o
o
o
o o

X1
27 / 30

KNN: K=10

oo o
o
o
o
o
oo oo o
o
o
o
o
o
o o
o
oo oo
oo o
oo
o oo o
oo o o
o
o
o
o
o
o oo
oo
o
o o o oo
o
o
o
oo
o
o o
o
o
o o
o oo o
o o
o
o o o
o
o oo
o o
o
o o oooo o
ooo o o o o
o
o
o
o
o o oo o o o
o
oo
o
o o
o
oo o oo o
oo o
o o oo o
o
o
o
o oo
o
o o o oo o oo o
o
o
oo o
o ooooo oo
oo
o
o
o
o oo o o
o
o o
oo oo o
o
o
ooo
o
o
o ooo
o
o
o
oo o
o
o
o o

FIGURE 2.15. The black curve indicates the KNN decision boundary on the

28 / 30

FIGURE 2.15. The black curve indicates the KNN decision boundary on the
data from Figure 2.13, using K = 10. The Bayes decision boundary is shown as
a purple dashed line. The KNN and Bayes decision boundaries are very similar.
KNN: K=1

o
o
o
oo

o
o

o
o
o
o
o
oo oo o o
o
o
oo o o
o
oo
o oo
oo o o
o
o
oo o
o
oo
oo
o
o
o
o
o
o
o o
o
o
o
o
o o
o
o
o o
o oo o
o o
o
o o
o
o oo
o
o
o
o o
o
o o ooo o
ooo o
o
o
oo
o
o
o o oo o o
o
o
oo
o ooo
o
o
o o o
oo o
o o oo o
oo o
o
o
o
o
o
o
o o o o o oo
o
o
ooo o
o oooo oo
oo
o
oo
o
o
o oo o
o
o o
oo oo o
o
oo o
o
o
oooo
oo
o
o
oo o
o
o

KNN: K=100

o
o
o
oo

o
o

o
o
o
o
oo oo o o
o
o
oo o o
o
oo
o oo
oo o o
o
o
oo o
o
oo
oo
o
o
o
o
o
o
o o
o
o
o
o
o o
o
o
o o
o oo o
o o
o
o o
o
o oo
o
o
o
o o
o
o o ooo o
ooo o
o
o
oo
o
o
o o oo o o
o
o
oo
o ooo
o
o
o o o
oo o
o o oo o
oo o
o
o
o
o
o
o
o o o o o oo
o
o
ooo o
o oooo oo
oo
o
oo
o
o
o oo o
o
o o
oo oo o
o
oo o
o
o
oooo
oo
o
o
oo o
o
o

FIGURE 2.16. A comparison of the KNN decision boundaries (solid black

curves) obtained using K = 1 and K = 100 on the data from Figure 2.13. With
K = 1, the decision boundary is overly flexible, while with K = 100 it is not
29 / 30
sufficiently flexible. The Bayes decision boundary is shown as a purple dashed

0.20
0.15
Error Rate

0.10
0.05
0.00

Training Errors
Test Errors
0.01

0.02

0.05

0.10

0.20

0.50

1.00

1/K
30 / 30

Statistical Learning
No ratings yet
Statistical Learning
31 pages
03 Regressionanalysis
No ratings yet
03 Regressionanalysis
21 pages
1 Statistical Learning
No ratings yet
1 Statistical Learning
42 pages
Lec-01-Introduction to Statistical Learning
No ratings yet
Lec-01-Introduction to Statistical Learning
38 pages
ML PPT 2
No ratings yet
ML PPT 2
206 pages
Chapter 1. Elements in Predictive Analytics
No ratings yet
Chapter 1. Elements in Predictive Analytics
66 pages
Week2 StatisticalLearning
No ratings yet
Week2 StatisticalLearning
46 pages
ESL: Chapter 1: 1.1 Introduction To Linear Regression
No ratings yet
ESL: Chapter 1: 1.1 Introduction To Linear Regression
4 pages
Sec2 Regression PDF
No ratings yet
Sec2 Regression PDF
183 pages
Module05 Notes
No ratings yet
Module05 Notes
19 pages
SimpleLinearRegression PDF
No ratings yet
SimpleLinearRegression PDF
86 pages
eng
No ratings yet
eng
10 pages
Stats 101 - Class 03
No ratings yet
Stats 101 - Class 03
94 pages
Lec 9
No ratings yet
Lec 9
14 pages
CS550 Regression
No ratings yet
CS550 Regression
62 pages
Sta 3
No ratings yet
Sta 3
9 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
intro to regression
No ratings yet
intro to regression
4 pages
Chapter 6: How To Do Forecasting by Regression Analysis
No ratings yet
Chapter 6: How To Do Forecasting by Regression Analysis
7 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
Content PDF
No ratings yet
Content PDF
61 pages
Chapter2 Annotated Part2
No ratings yet
Chapter2 Annotated Part2
30 pages
Lecture 2
No ratings yet
Lecture 2
23 pages
dodge1987_An introduction to L1-norm based statistical data analysis
No ratings yet
dodge1987_An introduction to L1-norm based statistical data analysis
15 pages
Ch2_Statistical_Learning
No ratings yet
Ch2_Statistical_Learning
51 pages
Hdnotes 2021
No ratings yet
Hdnotes 2021
31 pages
Midterm 2 Nem Veg Leges
No ratings yet
Midterm 2 Nem Veg Leges
9 pages
CH 2
No ratings yet
CH 2
31 pages
Stat 353 Study Guide
No ratings yet
Stat 353 Study Guide
44 pages
Slides 1 Handout
No ratings yet
Slides 1 Handout
23 pages
Classical LinearReg 000
No ratings yet
Classical LinearReg 000
41 pages
RigNotes15 PDF
No ratings yet
RigNotes15 PDF
130 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
Quant_Chapter_05_ols
No ratings yet
Quant_Chapter_05_ols
15 pages
Week 5 Notes
No ratings yet
Week 5 Notes
175 pages
Machine Learning
No ratings yet
Machine Learning
92 pages
Linear Regression - Module 3
No ratings yet
Linear Regression - Module 3
16 pages
DS303: Introduction To Machine Learning: Manjesh K. Hanawal
No ratings yet
DS303: Introduction To Machine Learning: Manjesh K. Hanawal
17 pages
MBAS901 2 Lecture
No ratings yet
MBAS901 2 Lecture
87 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
MGS3100 Chapter 13 Forecasting: Slides 13c: Causal Models and Regression Analysis
No ratings yet
MGS3100 Chapter 13 Forecasting: Slides 13c: Causal Models and Regression Analysis
36 pages
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
Notes2
No ratings yet
Notes2
16 pages
Machine Learning (CSO851) - Lecture 02
No ratings yet
Machine Learning (CSO851) - Lecture 02
74 pages
2.SupervisedLearning Error
No ratings yet
2.SupervisedLearning Error
32 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
Lecture Notes on High Dimensional Linear Regression
No ratings yet
Lecture Notes on High Dimensional Linear Regression
73 pages
Mathematical model
No ratings yet
Mathematical model
34 pages
14 Statistics and Probability
No ratings yet
14 Statistics and Probability
37 pages
Predictive Modelling Process: A First Tour
No ratings yet
Predictive Modelling Process: A First Tour
11 pages
Regressi On
No ratings yet
Regressi On
16 pages
Edab Module - 3
No ratings yet
Edab Module - 3
17 pages
2.1 Linear Regression
No ratings yet
2.1 Linear Regression
39 pages
MLF_Week_4_Notes_by_Manisha_Pal
No ratings yet
MLF_Week_4_Notes_by_Manisha_Pal
13 pages
Linear Regression
No ratings yet
Linear Regression
38 pages
Lec2 ASE
No ratings yet
Lec2 ASE
86 pages
Econometrics I Lecture 3 Wooldridge
No ratings yet
Econometrics I Lecture 3 Wooldridge
50 pages
Elementary Calculus
From Everand
Elementary Calculus
George N. Frempong
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)

02 Statistical Learning

Uploaded by

02 Statistical Learning

Uploaded by

0

What is Statistical Learning?

Shown are Sales vs TV, Radio and Newspaper, with a blue

What is f (X) good for?

With a good f we can make predictions of Y at new points

We can understand which components of

X = (X1 , X2 , . . . , Xp ) are important in explaining Y , and

Depending on the complexity of f , we may be able to

understand how each component Xj of X affects Y .

Is there an ideal f (X)? In particular, what is a good value for

E(Y |X = 4) means expected value (average) of Y given X = 4.

The regression function f (x)

f (x) = f (x1 , x2 , x3 ) = E(Y |X1 = x1 , X2 = x2 , X3 = x3 )

The regression function f (x)

f (x) = f (x1 , x2 , x3 ) = E(Y |X1 = x1 , X2 = x2 , X3 = x3 )

Is the ideal or optimal predictor of Y with regard to

mean-squared prediction error: f (x) = E(Y |X = x) is the

The regression function f (x)

f (x) = f (x1 , x2 , x3 ) = E(Y |X1 = x1 , X2 = x2 , X3 = x3 )

Is the ideal or optimal predictor of Y with regard to

mean-squared prediction error: f (x) = E(Y |X = x) is the

 = Y f (x) is the irreducible error i.e. even if we knew

f (x), we would still make errors in prediction, since at each

The regression function f (x)

f (x) = f (x1 , x2 , x3 ) = E(Y |X1 = x1 , X2 = x2 , X3 = x3 )

Is the ideal or optimal predictor of Y with regard to

mean-squared prediction error: f (x) = E(Y |X = x) is the

 = Y f (x) is the irreducible error i.e. even if we knew

f (x), we would still make errors in prediction, since at each

where N (x) is some neighborhood of x.

Nearest neighbor averaging can be pretty good for small p

i.e. p 4 and large-ish N .

We will discuss smoother versions, such as kernel and

spline smoothing later in the course.

Nearest neighbor averaging can be pretty good for small p

i.e. p 4 and large-ish N .

We will discuss smoother versions, such as kernel and

spline smoothing later in the course.

Nearest neighbor methods can be lousy when p is large.

Reason: the curse of dimensionality. Nearest neighbors

We need to get a reasonable fraction of the N values of yi

to average to bring the variance downe.g. 10%.

local, so we lose the spirit of estimating E(Y |X = x) by

The curse of dimensionality

Parametric and structured models

We estimate the parameters by fitting the model to

Although it is almost never correct, a linear model often

serves as a good and interpretable approximation to the

A linear model fL (X) = 0 + 1 X gives a reasonable fit here

A quadratic model fQ (X) = 0 + 1 X + 2 X 2 fits slightly

Simulated example. Red points are simulated values for income

Linear regression model fit to the simulated data.

More flexible regression model fS (education, seniority) fit to

Even more flexible spline regression model

Prediction accuracy versus interpretability.

Linear models are easy to interpret; thin-plate splines

Prediction accuracy versus interpretability.

Linear models are easy to interpret; thin-plate splines

Good fit versus over-fit or under-fit.

How do we know when the fit is just right?

Prediction accuracy versus interpretability.

Linear models are easy to interpret; thin-plate splines

Good fit versus over-fit or under-fit.

How do we know when the fit is just right?

Parsimony versus black-box.

We often prefer a simpler model involving fewer

Support Vector Machines

FIGURE 2.7. A representation of the tradeoff between flexibility and inte

Assessing Model Accuracy

We could compute the average squared prediction error

MSETr = AveiTr [yi f(xi )]2

This may be biased toward more overfit models.

MSETe = AveiTe [yi f(xi )]2

Mean Squared Error

Black curve is truth. Red curve on right is MSETe , grey curve is

Mean Squared Error

= Y f (x) is the irreducible error i.e. even if we knew

= Y f (x) is the irreducible error i.e. even if we knew