DDMA05_ModelSelection
DDMA05_ModelSelection
– Predictive models for credit scoring estimate the likelihood that a potential
customer will default.
– Predictive models for spam filtering estimate whether a given piece of email is
spam.
– Predictive models for fraud detection judge whether an account has been
defrauded.
• This is in contrast to descriptive modeling, where the primary purpose is to gain
insight into the underlying phenomenon or process.
– A descriptive model of churn would tell us what the churning-customers look like.
• Regressions with all possible subset of predictor variables: e.g., if there are N
variables, then we need to run 2N regressions
• Find the best regression model out of 2N regression outputs, based on evaluation
criteria such as adjusted-R2, AIC, or BIC .
"#$
Adjusted 𝑅 ! = 1 − (1 − 𝑅 ! ),
"#%
AIC = −2 log 𝐿 + 2𝑘,
BIC = −2 log 𝐿 + 𝑘 ∗ log 𝑇
T: the number of observations,
k: the number of predictor variables included,
log 𝐿: log likelihood
• This approach is impractical when N is large: e.g., N=50, 250=1.16 x 1015 possible
models
• Find the optimal set of predictor variables utilizing both forward selection with
backward elimination
• Forward Selection
– Estimate models with one predictor variable (N such models)
– The variable with the largest F-statistic becomes the candidate
– If the candidate’s F is larger than a predetermined level F0, then the variable
associated with this F is added in the predictor set (otherwise the process stop with
no variable in the model)
– If not stop, run regressions with two predictor variables, the selected variable (say
X1) and one (say Xk) of the remaining N-1 variables (so N-1 regression models) –
Compute partial-F: a statistic testing bk=0 when both X1 and Xk are in the model. If
the largest F value exceeds a predetermined level F0, then the variable with this F is
added. Otherwise stop.
– Repeat this process
– Forward selection continues until no further predictor can be added
– If the candidate’s partial-F is smaller than the predetermined value F1, then the
variable is removed. Otherwise stop.
– Continue backward elimination until no further variable can be eliminated.
• 2
Consider a target variable (Y), a vector of inputs (X), and a prediction model (𝑓(𝑋))
that has been estimated from a training set.
#
𝑌, 𝑓$ 𝑋 squared error
• Loss function 𝐿 𝑌, 𝑓$ 𝑋 =(
𝑌, 𝑓$ 𝑋 absolute error
• Various types of loss functions exist (depending upon the nature of the data).
• Test error, also referred to as generalization error, is the prediction error over an
independent test sample
• Training error is the average loss over the training sample. Unfortunately, training
error is not a good estimate of the test error
Test error
Training error
Case 2
Case 1
② K-Fold Cross-validation
• Partition randomly the data into K equal-sized subsets
• We estimate/ validate the model K times: In each turn, use K-1 subsets as
calibration samples, one remaining subset as validation sample.
• Models are assessed based on the average validation error across K predictions
• Rule of thumb: K = 10 or 20
④ Bootstrap
• When the size of data is n, select sample of size n “with replacement”
• Same observations can be included more than once.
• It is known that bootstrapping performs better than cross-validation when the size
of the data is small
① Continuous Variable
• mean squared errors(MSE)
∑0./$ 𝑒.! ⁄𝑛 = ∑0./$ 𝑦. − 𝑦O. ! ⁄𝑛
• mean absolute deviation(MAD)
∑0./$ 𝑒. ⁄𝑛 = ∑0./$ 𝑦. − 𝑦O. ⁄𝑛
• mean absolute percentage deviation(MAPD)
1$ #12 $
∑0./$ ⁄𝑛 ×100%
1$
실제 자료
0 1
0 으로 예측 𝑡" 𝑓"
모형의 예측
1 로 예측 𝑓# 𝑡#
– true positive
– true negative
– false positive
– false negative
Precision×Recall
– F1 = 2 ⋅
Precision4Recall
• Predictive Likelihood
1 $#1$
∏0./$ 𝑃h. $ 1 − 𝑃h.
• Predictive Log Likelihood
∑0./$ 𝑦. log 𝑃h. + 1 − 𝑦. log 1 − 𝑃h.
• Note: We prefer larger Hit Ratio, F1, Log Likelihood. So when it comes to loss, for
example, you need to minimize the negative of log likelihood.
!
𝛽2 6.(78 = arg min ∑0./$ 𝑦. − 𝛽+ − ∑%:/$ 𝑥.: 𝛽: + 𝜆 ∑%:/$ 𝛽:!
9
• The ridge solutions are not equivariant under scaling of the inputs and so one
normally standardizes the inputs before solving the minimization
• In addition, the intercept has been left out of the penalty. So use
reparametrization using centered inputs (𝑥.: − 𝑥l: ) and estimate 𝛽+ = 𝑦.
m
• Then, after centering, the input matrix has k (rather than k+1) columns.
𝑅𝑆𝑆 𝜆 = 𝑦 − 𝑋𝛽 " 𝑦 − 𝑋𝛽 + 𝜆𝛽 " 𝛽
$ 0 !
𝛽2 ;<==> = arg min ∑ 𝑦. − 𝛽+ − ∑%:/$ 𝑥.: 𝛽: + 𝜆 ∑%:/$ |𝛽: |
9 ! ./$
• The 𝐿$ lasso penalty makes the solutions nonlinear and there is no closed form
expression (unlike ridge)
• Computing lasso solution is a quadratic programming problem. (Practically, the
computational burden for lasso is comparable to that of ridge.)
• R package ‘lars’ (check the package manual, it can do least angle regression,
lasso, and forward stagewise regression)
* ( +
𝛽!!"#$%&' (!% = arg min ∑
+ &,*
𝑦& − 𝛽- − ∑/.,* 𝑥&. 𝛽. + 𝜆 ∑/.,* 𝛼 𝛽. + 1 − 𝛼 𝛽.+
)
!
2
Estimation 𝛽(𝜆) = arg min ∑0./$ 𝑦. − 𝛽+ − ∑%:/$ 𝑥.: 𝛽: + 𝜆 ∑%:/$ 𝛽:!
9
!
Validation Error for model evaluation ∑04? 2 % q
./04$ 𝑦. − 𝛽+ − ∑:/$ 𝑥.: 𝛽:
• Consider multiple candidate for 𝜆, say {0, 0.01, 0.02,0.04, 0.08, …., 10}. Prepare the
training data and validation/test data.
• For each value of candidate for 𝜆 , estimate 𝛽2 6.(78 (𝜆) and then compute the
validation error.
• R package ‘ridge’
!
2
Consider ridge 𝛽(𝜆) = arg min ∑0./$ 𝑦. − 𝛽+ − ∑%:/$ 𝑥.: 𝛽: + 𝜆 ∑%:/$ 𝛽:!
9
• If 𝜆 is a very large value, then it is very likely that 𝛽: = 0 for many j. So underfitting
(High bias)