0% found this document useful (0 votes)
10 views

Module 2

Module 2 covers statistical machine learning, including concepts such as parametric and non-parametric methods, supervised vs. unsupervised learning, and various techniques like regression and classification. It emphasizes the importance of estimating the function f that relates input variables to an output variable, discussing prediction and inference as key goals. The module also highlights the trade-off between prediction accuracy and model interpretability, illustrating the differences between linear and more flexible modeling approaches.

Uploaded by

Gautham J K
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Module 2

Module 2 covers statistical machine learning, including concepts such as parametric and non-parametric methods, supervised vs. unsupervised learning, and various techniques like regression and classification. It emphasizes the importance of estimating the function f that relates input variables to an output variable, discussing prediction and inference as key goals. The module also highlights the trade-off between prediction accuracy and model interpretability, illustrating the differences between linear and more flexible modeling approaches.

Uploaded by

Gautham J K
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

MODULE 2

Statistical machine learning

◻ Introduction to statistical machine learning


◻ parametric and non-parametric methods
◻ supervised vs. unsupervised learning
◻ regression and classification
◻ linear discriminant analysis
◻ decision trees, random forests, and bagging

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
Introduction to statistical machine
learning
What Is Statistical Learning?
◻ The Advertising data set consists of the
sales of that product in 200 different markets, along
with
advertising budgets for the product in each of those
markets for three different media: TV, radio, and
newspaper

◻ Our goal is to develop an accurate model that


can be used to predict sales on the basis of the
three media budgets
Reference:An Introduction to Statistical Learning: with Applications in R., Springer
◻ The advertising budgets are input variables while
sales is an output variable.

◻ The input variables are typically denoted using the


symbol X, with a subscript to distinguish them.
● X1 might be the TV budget, X2 the radio budget, and X3
the newspaper budget.

◻ The inputs go by different names, such as predictors,


independent variables, features or sometimes just
variables. Reference:An Introduction to Statistical Learning: with Applications in R., Springer
◻ Suppose that we observe a quantitative response Y and p
different predictors, X1,X2, . . .,Xp.

◻ We assume that there is some relationship between Y and X =


(X1,X2, . . .,Xp), which can be written in the very general form
Y = f(X) + ε.

◻ Here f is some fixed but unknown function of X1, . . . , Xp, and ε


is a random error term, which is independent of X and has mean
zero.

Reference:An Introduction to Statistical Learning: with Applications in R., Springer


◻ In this formulation, f represents the systematic
information that X provides about Y

◻ Some errors are positive (if an observation lies above


the curve) and some are negative (if an observation
lies below the curve).

◻ Overall, these errors have approximately mean zero

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
◻ A plot of income versus years of education for 30
individuals in the Income data set.

◻ The plot suggests that one might be able to predict


income using years of education.

◻ The function f that connects the input variable to the


output variable is in general unknown.

◻ One must estimate f based on the observed points


Reference:An Introduction to Statistical Learning: with Applications in R.,
Springer
◻ The function f may involve more than one input
variable

◻ We plot income as a function of years of education


and seniority.

◻ Here f is a two-dimensional surface that must be


estimated based on the observed data.

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
Why Estimate f
◻ Statistical learning refers to a set of approaches for
estimating f.

◻ There are two main reasons that we may wish to


estimate f:
Prediction
Inference

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
Prediction
◻ In many situations set of inputs X are readily
available, but the output Y cannot be easily obtained.

◻ Since the error term averages to zero, we can


predict Y using

where f̂ represents our estimate for f, and Ŷ


represents the resulting prediction for Y .

InReference:An
this setting,f̂ is often treated asSpringer a black box, in the
Introduction to Statistical Learning: with Applications in R.,
◻ Suppose that X1, . . .,Xp are characteristics of a
patient’s blood sample that can be easily measured
in a lab, and Y is a variable encoding the patient’s
risk for a severe adverse reaction to a particular
drug.

◻ Predict Y using X
since we can then avoid giving the drug in question to
patients who are at high risk of an adverse reaction

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
The accuracy of Ŷ as a prediction for Y depends on
two quantities,
reducible error
irreducible error

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
● The reducible error is the element that we
can improve.
− It is the quantity that we reduce when the
model is learning on a training dataset and we
try to get this number as close to zero as
possible.
● The irreducible error is the error that we
can not remove with our model, or with any
model.
Reducible error

f̂ will not be a perfect estimate for f, and this


inaccuracy will introduce some error.
This error is reducible because we can potentially
improve the accuracy of f̂ by using the most
appropriate statistical learning technique to estimate
f.
Reducible error is the error arising from the mismatch
between f̂ and f.

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
Irreducible error
Even if it were possible to form a perfect estimate for f,
so that our estimated response took the form Ŷ = f(X),
our prediction would still have some error in it!

◻ This is because Y is also a function of ε, which, by


definition, cannot be predicted using X.

◻ Therefore, variability associated with ε also affects the


accuracy of our predictions.
Reference:An Introduction to Statistical Learning: with Applications in R.,
Springer
● Irreducible error arises from the fact that X
doesn’t completely determine Y.
● That is, there are variables outside of X
and independent of X that still have some
small effect on Y.
● The only way to improve prediction error
related to irreducible error is to identify
these outside influences and incorporate
them as predictors
Why is the irreducible error larger
than zero?
◻ The quantity may contain unmeasured variables
that are useful in predicting Y
since we don’t measure them, f cannot use them for its
prediction.

◻ The quantity may also contain unmeasurable


variation.
For example, the risk of an adverse reaction might vary
for a given patient on a given day, depending on
manufacturing variation in the drug itself or the
patient’s
Reference:An general
Introduction to Statistical feeling
Learning: withof well-being
Applications in R., on that day
Springer
Reference:An Introduction to Statistical Learning: with Applications in R.,
Springer
Inference
◻ Understand how Y changes as a function of X1, . . .,Xp.

◻ Which predictors are associated with the response?


only a small fraction of the available predictors are
substantially associated with Y .
Identifying the few important predictors among a large set
of possible variables can be extremely useful, depending
on the application

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
◻ What is the relationship between the response and
each predictor?
Some predictors may have a positive relationship with
Y , in the sense that increasing the predictor is
associated with increasing values of Y .
Other predictors may have the opposite relationship.

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
◻ Can the relationship between Y and each predictor be
adequately summarized using a linear equation, or is
the relationship more complicated?
Most methods for estimating f have taken a linear form.
But often the true relationship is more complicated, in
which case a linear model may not provide an accurate
representation of the relationship between the input
and output variables.

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
Inference Problem
◻ Advertising data

◻ One may be interested in answering questions such


as:
Which media contribute to sales?
Which media generate the biggest boost in sales? Or
How much increase in sales is associated with a given
increase in TV advertising?

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
Inference Problem
◻ Modeling the brand of a product that a customer
might purchase based on variables such as
price, store location, discount levels, competition price,

◻ How each of the individual variables affects the


probability of purchase.

◻ For instance, what effect will changing the price of a


product have on sales?
Reference:An Introduction to Statistical Learning: with Applications in R.,
Springer
Prediction and Inference
◻ Some modeling could be conducted both for
prediction and inference.
For example, in a real estate setting, one may seek to
relate values of homes to inputs such as
■ crime rate, zoning, distance from a river, air quality, schools,
income level of community, size of houses, and so forth.

One might be interested in how the individual input


variables affect the prices—that is, how much extra will
a house be worth if it has a view of the river? This is an
inference problem.
Reference:An Introduction to Statistical Learning: with Applications in R.,
Springer
◻ Depending on whether our ultimate goal is
prediction, inference, or a combination of the two,
different methods for estimating f may be
appropriate.

◻ Linear models allow for relatively simple and


interpretable inference, but may not yield as
accurate predictions as some other approaches

Some of the highly non-linear approaches can


◻ Reference:An Introduction to Statistical Learning: with Applications in R.,
Springer
potentially provide quite accurate predictions for Y
How Do We Estimate f?
◻ Many linear and non-linear approaches for
estimating f

◻ We observed n = 30 data points.

◻ These observations are called the training data


because we will use these observations to train, or
teach, our method how to estimate f.

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
◻ Let xij represent the value of the jth predictor, or
input, for observation i, where i = 1, 2, . . ., n and
j = 1, 2, . . . , p.

◻ let yi represent the response variable for the ith


observation.

◻ Then our training data consist of {(x1, y1), (x2, y2), . .


. , (xn, yn)} where xi = (xi1, xi2, . . . , xip)T .
Reference:An Introduction to Statistical Learning: with Applications in R., Springer
parametric and non-parametric
methods

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
◻ Apply a statistical learning method to the training
data in order to estimate the unknown function f.

◻ In other words, we want to find a function ˆ f such


that Y ≈ ˆ f(X) for any observation (X, Y ).

◻ Most statistical learning methods characterized as


either parametric or non-parametric

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
Parametric Methods
◻ Parametric methods involve a two-step
model-based approach.

1. First, we make an assumption about the functional


form, or shape, of f.
For example, one very simple assumption is that f is
linear in X:
f(X) = β0 + β1X1 + β2X2 + . . . + βpXp.
This is a linear model
Reference:An Introduction to Statistical Learning: with Applications in R.,
Springer
Once we have assumed that f is linear, the problem of
estimating f is greatly simplified.

Instead of having to estimate an entirely arbitrary


p-dimensional function f(X), one only needs to estimate
the p + 1 coefficients β0, β1, . . . , βp.

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
2. After a model has been selected, we need a
procedure that uses the training data to fit or train
the model.
In the case of the linear model fit , we need to estimate
the parameters β0, β1, . . . , βp.
We want to find values of these parameters such that
Y ≈ β0 + β1X1 + β2X2 + . . . + βpXp.

◻ The most common approach to fitting the model


is to as (ordinary) least squares
Reference:An Introduction to Statistical Learning: with Applications in R.,
Springer
◻ Parametric method reduces the problem of
estimating f down to one of estimating a set of
parameters

◻ The potential disadvantage of a parametric


approach is that the model we choose will usually
not match the true unknown form of f.

◻ If the chosen model is too far from the true f, then our
estimate
Reference:An willto be
Introduction poor
Statistical Learning: with Applications in R.,
Springer
Non-parametric Methods
◻ Non-parametric methods do not make explicit
assumptions about the functional form of f.

◻ Estimate of f that gets as close to the data points as


possible

◻ No assumption about the form of f is made.

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
◻ Non-parametric approaches disadvantage:
since they do not reduce the problem of estimating f to
a small number of parameters, a very large number of
observations (far more than is typically needed for a
parametric approach) is required in order to obtain an
accurate estimate for f.

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
◻ Non-parametric approach to fitting the Income data

A thin-plate spline is used to estimate f.

This approach does not impose any pre-specified model


on f.

It instead attempts spline to produce an estimate for f


that is as close as possible to the observed data,

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
The Trade-Off Between Prediction
Accuracy and Model Interpretability
◻ Some methods are less flexible, or more restrictive,
in the sense that they can produce just a relatively
small range of shapes to estimate f.

◻ linear regression is a relatively inflexible approach,


because it can only generate linear functions

◻ Thin plate splines are considerably more flexible


because they can generate a much wider range of
possible
Reference:An shapes
Introduction toLearning:
to Statistical estimate f. in R.,
with Applications
Springer
why would we ever choose to use a more restrictive
method instead of a very flexible approach?

◻ If we are mainly interested in inference, then


restrictive models are much more interpretable.

◻ When inference is the goal, the linear model may


be a good choice since it will be quite easy to
understand the relationship between Y and X1,X2, . .
. , Xp.

◻ In contrast, very flexible approaches, such as the


splinesIntroduction
Reference:An and the boosting
to Statistical Learning: withmethods
Applications in R.,, can lead to such
Springer
complicated estimates of f that it is difficult to
Trade-off between flexibility and
interpretability
◻ Least squares linear regression, is relatively
inflexible but is quite interpretable.

◻ The lasso, relies upon the lasso linear model but uses
an alternative fitting procedure for estimating the
coefficients β0, β1, . . . , βp.

The new procedure is more restrictive in estimating the


coefficients, and sets a number of them to exactly zero.
Hence in this sense the lasso is a less flexible approach
Reference:An Introduction to Statistical Learning: with Applications in R., Springer
than linear regression.
◻ Generalized additive models (GAMs), extend the
linear model to allow for certain non-linear
relationships.
GAMs are more flexible than linear regression.
They are also somewhat less interpretable than linear
regression, because the relationship between each
predictor and the response is now modeled using a
curve

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
◻ Fully non-linear methods such as bagging, boosting,
and support vector machines with non-linear kernels,
are highly flexible approaches that are harder to
interpret.

◻ when inference is the goal, there are clear


advantages to using simple and relatively inflexible
statistical learning methods.

When we are only interested in prediction, it will be


◻ Reference:An Introduction to Statistical Learning: with Applications in R.,
Springer
best to use the most flexible model available
Supervised Learning
◻ For each observation of the predictor measurement(s)
xi,
i = 1, . . . , n there is an associated response
measurement yi.

◻ We wish to fit a model that relates the response to the


predictors, with the aim of accurately predicting the
response for future observations (prediction) or
better understanding the relationship between the
response and the predictors (inference).
Reference:An Introduction to Statistical Learning: with Applications in R., Springer
linear regression
logistic regression
GAM
Boosting
support vector machines

◻ Operate in the supervised learning domain

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
Unsupervised Learning
◻ For every observation i = 1, . . . , n, we observe a
vector of measurements xi but no associated response
y i.

◻ It is not possible to fit a linear regression model,


since there is no response variable to predict.

◻ The situation is referred to as unsupervised because


we lack a response variable that can supervise our
analysis.
Reference:An Introduction to Statistical Learning: with Applications in R.,
Springer
Cluster analysis, or Clustering
◻ The goal of cluster analysis cluster is to ascertain, on
the basis of x1, . . . , xn, whether the observations fall
into analysis relatively distinct groups.

Reference:An Introduction to Statistical Learning: with Applications in R., Springer


Linear Regression

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
● Linear regression is a very simple approach
for supervised learning.
● Linear regression is a useful tool for
predicting a quantitative response.
● Advertising data

● One may be interested in answering questions such


as:
− Is there a relationship between advertising budget and
sales?
− How strong is the relationship between advertising budget
and sales?
− Which media contribute to sales?
− How accurately can we estimate the effect of each
medium on sales?
● Linear regression can be used to answer
each of these questions.
Simple Linear Regression

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
● Simple linear regression is a very
straightforward approach for predicting a
quantitative response Y on the basis of a
single predictor variable X.
● It assumes that there is approximately a
linear relationship between X and Y .
● Mathematically, we can write this linear
relationship as
● Y ≈ β 0 + β 1 X.
● “≈” is read as “is approximately modeled
as”.
● This equation is described by saying that
we are regressing Y on X (or Y onto X).
● For example, X may represent TV
advertising and Y may represent sales .
● Then we can regress sales onto TV by
fitting the model
● sales ≈ β0 + β1 × TV.
● β0 and β1 are two unknown constants that
represent the intercept and slope terms in
the linear model.
● Together, β0 and β1 are known as the
model coefficients or parameters.
● Once we have used our training data to
produce estimates β̂ 0 and β̂ 1 for the
model coefficients, we can predict future
sales on the basis of a particular value of
TV advertising by computing
● ŷ = β̂ 0 + β̂ 1 x
● Here ŷ indicates a prediction of Y on the
basis of X = x.
● Here we use a hat symbol, ˆ , to denote the
Multiple Linear Regression

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
● In practice we often have more than one
predictor.
● For example, in the Advertising data, we
have examined the relationship between
sales and TV advertising.
● We also have data for the amount of
money spent advertising on the radio and
in newspapers, and we may want to know
whether either of these two media is
associated with sales.
● One option is to run three separate simple
linear regressions, each of which uses a
different advertising medium as a predictor.
● The approach of fitting a separate simple
linear regression model for each predictor
is not entirely satisfactory.
● A better approach is to extend the simple
linear regression model so that it can
directly accommodate multiple predictors.
● We can do this by giving each predictor a
separate slope coefficient in a single
model.
● In general, suppose that we have p distinct
predictors.
● Then the multiple linear regression model
takes the form
● Y = β0 + β1 X1 + β2 X2 + · · · + βp Xp +ε
● where Xj represents the jth predictor and βj
quantifies the association between that
variable and the response.
● We interpret βj as the average effect on Y
of a one unit increase in Xj , holding all
other predictors fixed.
● In the advertising example,
● sales = β0 + β1 × TV +β2 × radio + β3 × newspaper + ε.

● Given estimates β̂0 , β̂1 , . . . , β̂p , we can


make predictions using the formula
● ŷ = β̂0 + β̂1 x1 + β̂2 x2 + · · · + β̂p xp .
The Basics of Decision Trees

Reference:An Introduction to Statistical Learning: with Applications in R.,


Springer
● Decision trees can be applied to both
regression and classification problems.
Regression Trees
● Figure shows a regression tree fit to this
data.
● It consists of a series of splitting rules,
starting at the top of the tree.
● The top split assigns observations having
Years<4.5 to the left branch.
● The predicted salary for these players is
given by the mean response value for the
players in the data set with Years<4.5 .
● For such players, the mean log salary is
5.107,and so we make a prediction of
e 5.107 thousands of dollars, i.e. $165,174,
for these players.
● Players with Years>=4.5 are assigned to
the right branch, and then that group is
further subdivided by Hits.
● Overall, the tree stratifies or segments the
players into three regions of predictor
space: players who have played for four or
● In keeping with the tree analogy, the
regions R1 , R2 , and R3 are known as
terminal nodes or leaves of the tree.
● The points along the tree where the
predictor space is split are referred to as
internal nodes.
● We might interpret the regression tree
displayed in Figure as follows:
− Years is the most important factor in
determining Salary , and players with less
experience earn lower salaries than more
experienced players.
− Given that a player is less experienced, the
number of hits that he made in the previous
year seems to play little role in his salary.
Classification Trees
● A classification tree is very similar to a
regression tree, except that it is used to
predict a qualitative response rather than a
quantitative one.
● For a classification tree, we predict that
each observation belongs to the most
commonly occurring class of training
observations in the region to which it
belongs.
● In interpreting the results of a classification
Classification error rate.
● The classification error rate is simply the
fraction of the training observations in that
region that do not belong to the most
common class.
Gini index
Cross-entropy
● When building a classification tree, either
the Gini index or the cross-entropy are
typically used to evaluate the quality of a
particular split, since these two approaches
are more sensitive to node purity than is
the classification error rate.
● Any of these three approaches might be
used when pruning the tree, but the
classification error rate is preferable if
prediction accuracy of the final pruned tree
Example
● Figure shows an example on the Heart
data set.
● These data contain a binary outcome HD
for 303 patients who presented with chest
pain.
● An outcome value of Yes indicates the
presence of heart disease based on an
angiographic test, while No means no
heart disease.

You might also like