Week8_Lecture_1_ML_SPR25
Week8_Lecture_1_ML_SPR25
with R
Instructor: Babu Adhimoolam
Learning objectives Subset Selection Methods
To improve prediction accuracy on
the test dataset in the following
conditions with high variance:
• The total number of all possible combinations of predictors for least square estimation
in the case of p predictors is 2p
• We first start with a null model (M0) with no predictors and then we calculate Mk for each
value of k.
for k = 1, 2, …, p :
- choose the best of all models with RSS or R 2 criterion for each Mk
• We finally choose the best model from the list of available models M 0 ,…. ,Mk .
Application of Best Subset selection to Credit Data set
M1 to Mp models
Response - Balance
Predictors – Income, Limit, Rating, Cards, Age, Education, Own, Student, Married and Region
• We begin with a model with no predictors(M0), and then add predictors to the model,
one at a time until we add all the predictors.
for k = 0, 1,..,p-1:
- consider all (p – k) models that will augment the predictions in M k with one additional
predictor.
- choose the best among the (p-k) models with RSS and R2 and call it Mk+1.
• We finally choose the best model among the list of available models M0 … Mk.
Unlike best subset selection which
involves model selection with 2 p
models (with p predictors), we have
in forward stepwise selection:
Computational
feasibility of forward
stepwise selection
So, if p = 20, in best subset selection
we must fit approximately 1,048,576
models. In contrast we have only 211
models in forward stepwise
selection.
Forward selection methods always don’t guarantee the best
models
Note that the best four variable models are different between best subset selection and
Forward stepwise selection methods.
Backward Stepwise Selection
• Unlike the best subset selection and forward step wise selection, here we start with a full
model (Mp) containing all the predictors.
• We then iteratively remove the least useful predictor one and a time.
- consider all k models that contain all but one (k-1) of the predictors in Mk
- choose the best among these k models, call it M k-1, (as assessed with RSS and R2)
Adjusted R2
For a fitted least square model with
d predictors, C p estimate of test MSE
is:
Bayesian
Information
Criteria(BIC) The BIC places heavier penalty than
the Cp or AIC for models with many
variables for observations(n)> 7.
Maximizing adjusted R 2 is equivalent
to minimizing the RSS/(n-d-1).
Choose M1 ,…, Mp
Low values of Cp, AIC and BIC and high values of adjusted R2 will reveal models with low test error rates.
Choosing the optimal model with validation and cross
validation
Validation and cross-validation methods directly estimate the test error and are the most preferred methods for model
selection in comparison with previous methods.