0% found this document useful (0 votes)
12 views

DDMA05_ModelSelection

Marketing analytics lecture note

Uploaded by

sqade20
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

DDMA05_ModelSelection

Marketing analytics lecture note

Uploaded by

sqade20
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Model Selection

Inseong Song, SNU Business School


Predictive vs. Descriptive Modeling

• In common usage, prediction means to forecast a future event. In data science,


prediction more generally means to estimate an unknown value. The value could be
something in the future, in the present, or even in the past.

– Predictive models for credit scoring estimate the likelihood that a potential
customer will default.
– Predictive models for spam filtering estimate whether a given piece of email is
spam.
– Predictive models for fraud detection judge whether an account has been
defrauded.
• This is in contrast to descriptive modeling, where the primary purpose is to gain
insight into the underlying phenomenon or process.

– A descriptive model of churn would tell us what the churning-customers look like.

© Inseong Song, SNU Business School


• A descriptive model must be judged in part on its intelligibility, and less accurate
model may be preferred if it is easier to understand. So researchers’ preference for
model parsimony is implemented by adding penalties for model complexity.

• A predictive model may be judged solely on its predictive performance, although


intelligibility is nonetheless important.
• However, the difference between these model types is not so strict. Sometimes
much of the value of a predictive model is in the understanding gained from looking
at it rather than in predictions it makes.

© Inseong Song, SNU Business School


Variable Selection as model selection
• Typical models
Y = f(X1,..., Xk) + ε
Y: variable (=value) being predicted (e.g., sales volume; customer responses;
customer life time value; ...)
X: potential predictor variables
K: number of variables including the intercept

• Why not all available variables (say 300 variables)?


– Computation time (especially in nonlinear models)
– Feasibility (there would no observation in most nodes in a decision tree model
with 300 predictors)
– Overfitting: the calibration can be done perfectly, but the result may not apply
well to other comparable data set
– Interpretation: hard to interpret the effects of 300 variables, difficult to
communicate, and meaningless in finding strategic insights

© Inseong Song, SNU Business School


All-Possible Subset Regression

• Regressions with all possible subset of predictor variables: e.g., if there are N
variables, then we need to run 2N regressions
• Find the best regression model out of 2N regression outputs, based on evaluation
criteria such as adjusted-R2, AIC, or BIC .
"#$
Adjusted 𝑅 ! = 1 − (1 − 𝑅 ! ),
"#%
AIC = −2 log 𝐿 + 2𝑘,
BIC = −2 log 𝐿 + 𝑘 ∗ log 𝑇
T: the number of observations,
k: the number of predictor variables included,
log 𝐿: log likelihood
• This approach is impractical when N is large: e.g., N=50, 250=1.16 x 1015 possible
models

© Inseong Song, SNU Business School


Stepwise Selection

• Find the optimal set of predictor variables utilizing both forward selection with
backward elimination
• Forward Selection
– Estimate models with one predictor variable (N such models)
– The variable with the largest F-statistic becomes the candidate
– If the candidate’s F is larger than a predetermined level F0, then the variable
associated with this F is added in the predictor set (otherwise the process stop with
no variable in the model)
– If not stop, run regressions with two predictor variables, the selected variable (say
X1) and one (say Xk) of the remaining N-1 variables (so N-1 regression models) –
Compute partial-F: a statistic testing bk=0 when both X1 and Xk are in the model. If
the largest F value exceeds a predetermined level F0, then the variable with this F is
added. Otherwise stop.
– Repeat this process
– Forward selection continues until no further predictor can be added

© Inseong Song, SNU Business School


• Backward Elimination
– Estimate a model with all N variables are included
– Compute partial-F for each of the variable in the predictor set – the variable
with the smallest partial F is the candidate for deletion
&&'! #&&'" &&'"
– partial F for Xk 𝐹 = 1
()! #()" ()"
SSE2: sum of squared errors of larger model (Xk is included)
SSE1: sum of squared errors of smaller model (Xk is not included)
df: degree of freedom for error terms

– If the candidate’s partial-F is smaller than the predetermined value F1, then the
variable is removed. Otherwise stop.
– Continue backward elimination until no further variable can be eliminated.

© Inseong Song, SNU Business School


• Combination
– Begin with forward selection.
– When 2 variables are included, apply backward elimination.
– Apply forward-backward scheme until no further variable can be added
or removed. (some variables could be included in early stage and to be
dropped subsequently)
• Stepwise selection is computationally efficient. But it could end up a
suboptimal solution due to its algorithm characteristic (sequential search)

© Inseong Song, SNU Business School


Evaluation of Model
• Assess if the developed model can explain data well and also predict well
• The model should be good at explaining the data used in calibration. Moreover the
model should also work well for the another data.
• Partition of samples into subgroups
– Training Set (Calibration sample): observations used in model estimation
– Validation set: observations to which estimation results are applied, to select
the best performing model (model selection)
– Test set: observations to which estimation results are applied, to assess the
performance of the selected model (model assessment)
– A large calibration sample may lead to an accurate estimation. But such
improvement in accuracy becomes smaller as the sample size continues to
increase.
– A large validation sample makes the model comparison easy.
→ Since the size of the whole dataset is fixed, we have a trade-off between
calibration/validation samples.

© Inseong Song, SNU Business School


Total Data

Training (Calibration) Validation Test

Used to fit the models


Used to estimate prediction
error for model selection

Used for assessment of the


generalization error of the final
chosen model

© Inseong Song, SNU Business School


Loss Function

• 2
Consider a target variable (Y), a vector of inputs (X), and a prediction model (𝑓(𝑋))
that has been estimated from a training set.
#
𝑌, 𝑓$ 𝑋 squared error
• Loss function 𝐿 𝑌, 𝑓$ 𝑋 =(
𝑌, 𝑓$ 𝑋 absolute error

• Various types of loss functions exist (depending upon the nature of the data).
• Test error, also referred to as generalization error, is the prediction error over an
independent test sample
• Training error is the average loss over the training sample. Unfortunately, training
error is not a good estimate of the test error

© Inseong Song, SNU Business School


Bias-Variance Decomposition

• Assume 𝑌 = 𝑓 𝑋 + 𝜀 where 𝐸 𝜀 = 0 and 𝑉𝑎𝑟 𝜀 = 𝜎*! .


• 2
Expected prediction error of a regression fit 𝑓(𝑋) at an input point 𝑋 = 𝑥+ using
squared-error loss

Variance of the target Bias: difference between Variance: the expected


around its true mean the average of my ! !)
squared deviation of 𝑓(𝑥
(cannot be avoided) estimate and true mean around its mean

2 the lower the (squared) bias but the


• The more complex we make the model 𝑓,
higher the variance

© Inseong Song, SNU Business School


Training error, Test error, and Model Complexity
(Hastie, Tibshirani, Friedman 2013)

Test error

Training error

© Inseong Song, SNU Business School


• Suppose your model does not perform well. (say, large validation/test errors).
What to do with your model? Simpler or more complex?
• Case 1: validation(test) error >> training error (training error is small)
– Then your model is overfitting. Make it simpler.
• Case 2: validation(test) error ≈ training error (training error is large)
– Then it is underfitting. A more complex model may help

Case 2

Case 1

© Inseong Song, SNU Business School


Simple Example
• n=20, training=14, test=6
• True model: 𝑦 = 𝛽+ + 𝛽$ 𝑥 + 𝛽! 𝑥 ! + 𝛽, 𝑥 , + 𝜀 with 𝛽 - = 6,9, −6,1
– simulate 𝑥~𝑈(0,5), 𝜀~𝑁(0,1), then we have data (𝑦, 𝑥)
• Estimate regression model 𝑦 = 𝜃+ + 𝜃$ 𝑥 + ⋯ + 𝜃% 𝑥 % + 𝜖 for 𝑘 = 1, ⋯ , 7 using
training data
• Predict 𝑦 in test data based on the regression results

© Inseong Song, SNU Business School


• Mean squared Errors

k Training Error Test Error


underfit 1 2.41 9.02
2 1.41 6.03
true model 3 0.93 1.44
4 0.62 10.70
5 0.43 187.15
overfit
6 0.43 82.80
7 0.41 3514.03

© Inseong Song, SNU Business School


Calibration vs. Validation Sample
① Holdout method
• Partition the data into two mutually exclusive subgroups: calibration vs. holdout
• How much to allocate on calibration?
– Typically, 2/3 for calibration. For small data (n<100), maybe 3/4.
• Assigning each observation to a group: through randomization
• When the data set is large, hold-out method should be okay.
• When we have a small data set, the inefficiency from reserving a large portion of
samples matters in the hold-out method.

② K-Fold Cross-validation
• Partition randomly the data into K equal-sized subsets
• We estimate/ validate the model K times: In each turn, use K-1 subsets as
calibration samples, one remaining subset as validation sample.
• Models are assessed based on the average validation error across K predictions
• Rule of thumb: K = 10 or 20

© Inseong Song, SNU Business School


③ Leave-One-Out method
• Same idea as K-Fold cross-validation, but K=Number of Observations

④ Bootstrap
• When the size of data is n, select sample of size n “with replacement”
• Same observations can be included more than once.
• It is known that bootstrapping performs better than cross-validation when the size
of the data is small

© Inseong Song, SNU Business School


Evaluation Criteria (Choice of Loss Function)

• Criterion : “ goodness-of-fit” or “badness-of-fit” – distance between what really


happened and what the model predicts to happen
• The way in quantifying “goodness-of-fit” depends on the purpose of the model
and on the nature of the dependent variables.

① Continuous Variable
• mean squared errors(MSE)
∑0./$ 𝑒.! ⁄𝑛 = ∑0./$ 𝑦. − 𝑦O. ! ⁄𝑛
• mean absolute deviation(MAD)
∑0./$ 𝑒. ⁄𝑛 = ∑0./$ 𝑦. − 𝑦O. ⁄𝑛
• mean absolute percentage deviation(MAPD)
1$ #12 $
∑0./$ ⁄𝑛 ×100%
1$

© Inseong Song, SNU Business School


② Discrete Dependent Variable
• When Yi takes on discrete values, typically the prediction is expressed in probability
term, Pr(Yi =1)
• Prediction: First determine the cut=off value “c” beforehand
1, Pr 𝑦 = 1 > 𝑐
𝑦O = R
0, Pr 𝑦 = 1 ≤ 𝑐

실제 자료
0 1
0 으로 예측 𝑡" 𝑓"
모형의 예측
1 로 예측 𝑓# 𝑡#

– true positive
– true negative
– false positive
– false negative

© Inseong Song, SNU Business School


1, if prediction is correct
• Hit Ratio = ∑0./$ 𝐻. ⁄𝑛, 𝐻. = R
0, if not
3% 43&
– So, Hit Ratio =
3% 43& 4)% 4)&

• F1 accuracy measure (harmonic mean of precision and recall)


3%
– Precision =
3% 4)%
3%
– Recall =
3% 4)&

Precision×Recall
– F1 = 2 ⋅
Precision4Recall
• Predictive Likelihood
1 $#1$
∏0./$ 𝑃h. $ 1 − 𝑃h.
• Predictive Log Likelihood
∑0./$ 𝑦. log 𝑃h. + 1 − 𝑦. log 1 − 𝑃h.
• Note: We prefer larger Hit Ratio, F1, Log Likelihood. So when it comes to loss, for
example, you need to minimize the negative of log likelihood.

© Inseong Song, SNU Business School


Regularization: Avoiding Overfitting for Parameter Optimization

• Avoiding overfitting involves complexity control: “right” balance between


the fit to the data and the complexity of the model
• The general strategy for complexity control is that instead of just
optimizing the fit to the data, we optimize some combination of fit and
simplicity: Models will be better if they fit the data better, but they also
will be better if they are simpler. This general methodology is called
regularization.
• Complexity control via regularization works by adding a penalty for
complexity
arg max &it 𝑥, 𝑤 − 𝜆 ⋅ penalty(𝑤)
4
where 𝑤 refers to the model and 𝜆 is the importance weight on penalty.

© Inseong Song, SNU Business School


Shrinkage Methods

• Subset selection procedure is a discrete process – variables are either


retained or discarded – it often exhibits high variances and so doesn’t
reduce the prediction error of the full model. Shrinkage methods are
more continuous and don’t suffer as much from high variability.
• Ridge regression
• Lasso
• Elastic net

© Inseong Song, SNU Business School


Ridge regression

!
𝛽2 6.(78 = arg min ∑0./$ 𝑦. − 𝛽+ − ∑%:/$ 𝑥.: 𝛽: + 𝜆 ∑%:/$ 𝛽:!
9

• The larger the value of 𝜆, the greater the amount of shrinkage


• Ridge regression could be a solution for obtaining an estimate when the design
matrix is not a full rank matrix (i.e, multicollinearity).

• The ridge solutions are not equivariant under scaling of the inputs and so one
normally standardizes the inputs before solving the minimization
• In addition, the intercept has been left out of the penalty. So use
reparametrization using centered inputs (𝑥.: − 𝑥l: ) and estimate 𝛽+ = 𝑦.
m

• Then, after centering, the input matrix has k (rather than k+1) columns.
𝑅𝑆𝑆 𝜆 = 𝑦 − 𝑋𝛽 " 𝑦 − 𝑋𝛽 + 𝜆𝛽 " 𝛽

𝛽2 6.(78 = 𝑋 " 𝑋 + 𝜆𝐼 #$ 𝑋 " 𝑦

© Inseong Song, SNU Business School


Lasso

$ 0 !
𝛽2 ;<==> = arg min ∑ 𝑦. − 𝛽+ − ∑%:/$ 𝑥.: 𝛽: + 𝜆 ∑%:/$ |𝛽: |
9 ! ./$

• The 𝐿$ lasso penalty makes the solutions nonlinear and there is no closed form
expression (unlike ridge)
• Computing lasso solution is a quadratic programming problem. (Practically, the
computational burden for lasso is comparable to that of ridge.)

• R package ‘lars’ (check the package manual, it can do least angle regression,
lasso, and forward stagewise regression)

© Inseong Song, SNU Business School


Elastic Net

* ( +
𝛽!!"#$%&' (!% = arg min ∑
+ &,*
𝑦& − 𝛽- − ∑/.,* 𝑥&. 𝛽. + 𝜆 ∑/.,* 𝛼 𝛽. + 1 − 𝛼 𝛽.+
)

• Elastic net is a compromise between ridge and lasso.


• The parameter 𝛼 determines the mix of the penalties, and is often pre-chosen on
qualitative ground.

Inseong Song, SNU Business School


Implementing Ridge regression

!
2
Estimation 𝛽(𝜆) = arg min ∑0./$ 𝑦. − 𝛽+ − ∑%:/$ 𝑥.: 𝛽: + 𝜆 ∑%:/$ 𝛽:!
9

!
Validation Error for model evaluation ∑04? 2 % q
./04$ 𝑦. − 𝛽+ − ∑:/$ 𝑥.: 𝛽:

• Consider multiple candidate for 𝜆, say {0, 0.01, 0.02,0.04, 0.08, …., 10}. Prepare the
training data and validation/test data.
• For each value of candidate for 𝜆 , estimate 𝛽2 6.(78 (𝜆) and then compute the
validation error.

• Find the value of for 𝜆 at which the validation error is minimized.


• Then compute the test error for that value of 𝜆.

• R package ‘ridge’

© Inseong Song, SNU Business School


Bias-Variance in Shrinkage methods

!
2
Consider ridge 𝛽(𝜆) = arg min ∑0./$ 𝑦. − 𝛽+ − ∑%:/$ 𝑥.: 𝛽: + 𝜆 ∑%:/$ 𝛽:!
9

• If 𝜆 = 0, we are including all possible variables. So Overfitting. (High variance)

• If 𝜆 is a very large value, then it is very likely that 𝛽: = 0 for many j. So underfitting
(High bias)

• So there should be intermediate value of 𝜆 that optimally trade-offs between them.

High variance High bias

© Inseong Song, SNU Business School

You might also like