0% found this document useful (0 votes)
0 views

Week8_Lecture_1_ML_SPR25

The document discusses various subset selection methods in machine learning, focusing on improving prediction accuracy and model interpretability. It outlines techniques such as Best Subset Selection, Forward Selection, and Backward Selection, along with their computational limitations and effectiveness in model selection. Additionally, it highlights the importance of metrics like Cp, AIC, and BIC for optimal model selection and emphasizes the use of validation methods for estimating test error.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Week8_Lecture_1_ML_SPR25

The document discusses various subset selection methods in machine learning, focusing on improving prediction accuracy and model interpretability. It outlines techniques such as Best Subset Selection, Forward Selection, and Backward Selection, along with their computational limitations and effectiveness in model selection. Additionally, it highlights the importance of metrics like Cp, AIC, and BIC for optimal model selection and emphasizes the use of validation methods for estimating test error.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Machine Learning and Deep Learning

with R
Instructor: Babu Adhimoolam
Learning objectives Subset Selection Methods
To improve prediction accuracy on
the test dataset in the following
conditions with high variance:

number of (n) is not so large than


Why additional the number of (p)
linear methods?
By constraining or shrinking the
coefficients associated with (p), we
can substantially reduce the variance
associated with test error.
To improve model interpretability:

Including multiple variables that are


not associated with the response in a
Why additional model leads to complexity.
linear methods?
We can make the coefficient of those
variables that do not contribute to
the response as zero thereby
resulting in interpretable models.
Subset Selection Methods
Best Subset selection
Forward Selection method
Backward Selection method
Shrinkage Methods
Extension of linear
Ridge Regression
methods
Lasso Regression
Dimensionality Reduction Methods
Principal Components Regression
Partial Least Squares
The Best Subset Selection Method
• We fit least squares regression for each possible combination of p predictors.

• The total number of all possible combinations of predictors for least square estimation
in the case of p predictors is 2p

• We first start with a null model (M0) with no predictors and then we calculate Mk for each
value of k.

for k = 1, 2, …, p :

- fit all models for each Mk.

- choose the best of all models with RSS or R 2 criterion for each Mk

• We finally choose the best model from the list of available models M 0 ,…. ,Mk .
Application of Best Subset selection to Credit Data set

M1 to Mp models

Response - Balance
Predictors – Income, Limit, Rating, Cards, Age, Education, Own, Student, Married and Region

Red line – Best model in each subset of predictors.


Suffers heavily from computational
limitations as p becomes higher.

Recall the total possible subsets of


models with p predictors is 2 p

If p is 10, 2 10 ~1000 models to


evaluate
Limitations of
If p is 20, 2 20 ~1,000,000 models to
best subset Selection evaluate.

P>40 is computationally infeasible!

In addition, large model space allows


for overfitting problems that may not
generalize to high accuracy in test
data.
Forward Stepwise Selection

• Forward stepwise selection is a computationally feasible and efficient alterative to best


subset selection as it considers lesser subset of models in comparison to 2p models.

• We begin with a model with no predictors(M0), and then add predictors to the model,
one at a time until we add all the predictors.

for k = 0, 1,..,p-1:

- consider all (p – k) models that will augment the predictions in M k with one additional
predictor.
- choose the best among the (p-k) models with RSS and R2 and call it Mk+1.

• We finally choose the best model among the list of available models M0 … Mk.
Unlike best subset selection which
involves model selection with 2 p
models (with p predictors), we have
in forward stepwise selection:

Computational
feasibility of forward
stepwise selection
So, if p = 20, in best subset selection
we must fit approximately 1,048,576
models. In contrast we have only 211
models in forward stepwise
selection.
Forward selection methods always don’t guarantee the best
models

Note that the best four variable models are different between best subset selection and
Forward stepwise selection methods.
Backward Stepwise Selection
• Unlike the best subset selection and forward step wise selection, here we start with a full
model (Mp) containing all the predictors.

• We then iteratively remove the least useful predictor one and a time.

for k = p, p-1, …1:

- consider all k models that contain all but one (k-1) of the predictors in Mk

- choose the best among these k models, call it M k-1, (as assessed with RSS and R2)

• We then choose the single best model out of Mp,…,M1

• Backward stepwise method is computationally similar to Forward stepwise selection

• It requires n>p unlike forward stepwise selection method.


R 2 and RSS are not the best metrics
for selecting the best models as
models with all the predictors will
always have R 2 as highest and RSS as
lowest values.
Choosing the
optimal model Indirect methods for test error
among M0 ,…,Mp estimation use adjustment to
training error rates to account for
models bias/overfitting.

Direct methods for test error


estimation use cross-validation
methods.
Cp

Indirect Methods Akaike Information Criterion(AIC)


for adjusting training
error rates Bayesian Information Criterion(BIC)

Adjusted R2
For a fitted least square model with
d predictors, C p estimate of test MSE
is:

C p adds a penalty that is proportional


to the number of predictors in the
Cp model.

The model with more predictors will


incur more penalty.

C p tends to take the small value for


models with low test errors(best
model!).
AIC is defined for larger set of
models fit by maximum likelihood
approach.

In the case least square models it is


defined as follows:
Akaike Information
Criterion (AIC)
AIC and Cp values measure same
things for least square models and
are proportional to each other.

AIC takes smaller values for best


models.
Like Cp and AIC, the BIC takes on
small value for models with lowest
test error and is given by:

Bayesian
Information
Criteria(BIC) The BIC places heavier penalty than
the Cp or AIC for models with many
variables for observations(n)> 7.
Maximizing adjusted R 2 is equivalent
to minimizing the RSS/(n-d-1).

The addition of noise variables will


result in very small decrease in RSS
along with increase in d resulting in
Adjusted R2 overall increase in RSS.

In comparison with regular R 2,


adjusted R 2 accounts for nuisance
variables (larger adjusted R 2 is the
best model)
Optimal model selection in the credit data set

Choose M1 ,…, Mp

Select from M1,…,Mp

Low values of Cp, AIC and BIC and high values of adjusted R2 will reveal models with low test error rates.
Choosing the optimal model with validation and cross
validation

Validation and cross-validation methods directly estimate the test error and are the most preferred methods for model
selection in comparison with previous methods.

You might also like