Data Mining
Model Overfitting
Introduction to Data Mining, 2nd Edition
by
Tan, Steinbach, Karpatne, Kumar
03/26/2018 Introduction to Data Mining, 2nd Edition 1
Classification Errors
Training errors (apparent errors)
– Errors committed on the training set
Test errors
– Errors committed on the test set
Generalization errors
– Expected error of a model over random
selection of records from same distribution
03/26/2018 Introduction to Data Mining, 2nd Edition 2
Example Data Set
Two class problem:
+ : 5200 instances
• 5000 instances generated
from a Gaussian centered at
(10,10)
• 200 noisy instances added
o : 5200 instances
• Generated from a uniform
distribution
10 % of the data used for
training and 90% of the
data used for testing
03/26/2018 Introduction to Data Mining, 2nd Edition 3
Increasing number of nodes in Decision Trees
03/26/2018 Introduction to Data Mining, 2nd Edition 4
Decision Tree with 4 nodes
Decision Tree
Decision boundaries on Training data
03/26/2018 Introduction to Data Mining, 2nd Edition 5
Decision Tree with 50 nodes
Decision Tree
Decision boundaries on Training data
03/26/2018 Introduction to Data Mining, 2nd Edition 6
Which tree is better?
Decision Tree with 4 nodes
Which tree is better ?
Decision Tree with 50 nodes
03/26/2018 Introduction to Data Mining, 2nd Edition 7
Model Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is large
03/26/2018 Introduction to Data Mining, 2nd Edition 8
Model Overfitting
Using twice the number of data instances
• If training data is under-representative, testing errors increase and training errors
decrease on increasing number of nodes
• Increasing the size of training data reduces the difference between training and
testing errors at a given number of nodes
03/26/2018 Introduction to Data Mining, 2nd Edition 9
Model Overfitting
Decision Tree with 50 nodes Decision Tree with 50 nodes
Using twice the number of data instances
• If training data is under-representative, testing errors increase and training errors
decrease on increasing number of nodes
• Increasing the size of training data reduces the difference between training and
testing errors at a given number of nodes
03/26/2018 Introduction to Data Mining, 2nd Edition 10
Reasons for Model Overfitting
Limited Training Size
High Model Complexity
– Multiple Comparison Procedure
03/26/2018 Introduction to Data Mining, 2nd Edition 11
Effect of Multiple Comparison Procedure
Consider the task of predicting whether Day 1 Up
stock market will rise/fall in the next 10 Day 2 Down
trading days
Day 3 Down
Day 4 Up
Random guessing:
Day 5 Down
P(correct) = 0.5 Day 6 Down
Day 7 Up
Make 10 random guesses in a row: Day 8 Up
Day 9 Up
10 10 10 Day 10 Down
8 9 10
P (# correct 8) 10 0.0547
2
03/26/2018 Introduction to Data Mining, 2nd Edition 12
Effect of Multiple Comparison Procedure
Approach:
– Get 50 analysts
– Each analyst makes 10 random guesses
– Choose the analyst that makes the most
number of correct predictions
Probability that at least one analyst makes at
least 8 correct predictions
P(# correct 8) 1 (1 0.0547)50 0.9399
03/26/2018 Introduction to Data Mining, 2nd Edition 13
Effect of Multiple Comparison Procedure
Many algorithms employ the following greedy strategy:
– Initial model: M
– Alternative model: M’ = M ,
where is a component to be added to the model
(e.g., a test condition of a decision tree)
– Keep M’ if improvement, (M,M’) >
Often times, is chosen from a set of alternative
components, = {1, 2, …, k}
If many alternatives are available, one may inadvertently
add irrelevant components to the model, resulting in
model overfitting
03/26/2018 Introduction to Data Mining, 2nd Edition 14
Effect of Multiple Comparison - Example
Use additional 100 noisy variables
generated from a uniform distribution
along with X and Y as attributes.
Use 30% of the data for training and
70% of the data for testing
Using only X and Y as attributes
03/26/2018 Introduction to Data Mining, 2nd Edition 15
Notes on Overfitting
Overfitting results in decision trees that are more
complex than necessary
Training error does not provide a good estimate
of how well the tree will perform on previously
unseen records
Need ways for estimating generalization errors
03/26/2018 Introduction to Data Mining, 2nd Edition 16
Model Selection
Performed during model building
Purpose is to ensure that model is not overly
complex (to avoid overfitting)
Need to estimate generalization error
– Using Validation Set
– Incorporating Model Complexity
– Estimating Statistical Bounds
03/26/2018 Introduction to Data Mining, 2nd Edition 17
Model Selection:
Using Validation Set
Divide training data into two parts:
– Training set:
use for model building
– Validation set:
use for estimating generalization error
Note: validation set is not the same as test set
Drawback:
– Less data available for training
03/26/2018 Introduction to Data Mining, 2nd Edition 18
Model Selection:
Incorporating Model Complexity
Rationale: Occam’s Razor
– Given two models of similar generalization errors,
one should prefer the simpler model over the more
complex model
– A complex model has a greater chance of being fitted
accidentally by errors in data
– Therefore, one should include model complexity when
evaluating a model
Gen. Error(Model) = Train. Error(Model, Train. Data) +
x Complexity(Model)
03/26/2018 Introduction to Data Mining, 2nd Edition 19
Estimating the Complexity of Decision Trees
Pessimistic Error Estimate of decision tree T
with k leaf nodes:
– err(T): error rate on all training records
– : trade-off hyper-parameter (similar to )
Relative cost of adding a leaf node
– k: number of leaf nodes
– Ntrain: total number of training records
03/26/2018 Introduction to Data Mining, 2nd Edition 20
Estimating the Complexity of Decision Trees: Example
e(TL) = 4/24
e(TR) = 6/24
=1
egen(TL) = 4/24 + 1*7/24 = 11/24 = 0.458
egen(TR) = 6/24 + 1*4/24 = 10/24 = 0.417
03/26/2018 Introduction to Data Mining, 2nd Edition 21
Estimating the Complexity of Decision Trees
Resubstitution Estimate:
– Using training error as an optimistic estimate of
generalization error
– Referred to as optimistic error estimate
e(TL) = 4/24
e(TR) = 6/24
03/26/2018 Introduction to Data Mining, 2nd Edition 22
Minimum Description Length (MDL)
A?
X y Yes No
X y
X1 1 0 B? X1 ?
X2 0 B1 B2
X2 ?
X3 0 C? 1
A C1 C2 B X3 ?
X4 1
0 1 X4 ?
… …
Xn
… …
1
Xn ?
Cost(Model,Data) = Cost(Data|Model) + x Cost(Model)
– Cost is the number of bits needed for encoding.
– Search for the least costly model.
Cost(Data|Model) encodes the misclassification errors.
Cost(Model) uses node encoding (number of children)
plus splitting condition encoding.
03/26/2018 Introduction to Data Mining, 2nd Edition 23
Estimating Statistical Bounds
z2 / 2 e(1 e) z2 / 2
e z / 2
e' ( N , e, ) 2N N 4N 2
2
z / 2
1
N
Before splitting: e = 2/7, e’(7, 2/7, 0.25) = 0.503
e’(T) = 7 0.503 = 3.521
After splitting:
e(TL) = 1/4, e’(4, 1/4, 0.25) = 0.537
e(TR) = 1/3, e’(3, 1/3, 0.25) = 0.650
e’(T) = 4 0.537 + 3 0.650 = 4.098
Therefore, do not split
03/26/2018 Introduction to Data Mining, 2nd Edition 24
Model Selection for Decision Trees
Pre-Pruning (Early Stopping Rule)
– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
Stop if all instances belong to the same class
Stop if all the attribute values are the same
– More restrictive conditions:
Stop if number of instances is less than some user-specified
threshold
Stop if class distribution of instances are independent of the
available features (e.g., using 2 test)
Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
Stop if estimated generalization error falls below certain threshold
03/26/2018 Introduction to Data Mining, 2nd Edition 25
Model Selection for Decision Trees
Post-pruning
– Grow decision tree to its entirety
– Subtree replacement
Trim the nodes of the decision tree in a bottom-up
fashion
If generalization error improves after trimming,
replace sub-tree by a leaf node
Class label of leaf node is determined from
majority class of instances in the sub-tree
– Subtree raising
Replace subtree with most frequently used branch
03/26/2018 Introduction to Data Mining, 2nd Edition 26
Example of Post-Pruning
Training Error (Before splitting) = 10/30
Class = Yes 20 Pessimistic error = (10 + 0.5)/30 = 10.5/30
Class = No 10 Training Error (After splitting) = 9/30
Error = 10/30 Pessimistic error (After splitting)
= (9 + 4 0.5)/30 = 11/30
PRUNE!
A?
A1 A4
A2 A3
Class = Yes 8 Class = Yes 3 Class = Yes 4 Class = Yes 5
Class = No 4 Class = No 4 Class = No 1 Class = No 1
03/26/2018 Introduction to Data Mining, 2nd Edition 27
Examples of Post-pruning
03/26/2018 Introduction to Data Mining, 2nd Edition 28
Model Evaluation
Purpose:
– To estimate performance of classifier on previously
unseen data (test set)
Holdout
– Reserve k% for training and (100-k)% for testing
– Random subsampling: repeated holdout
Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
03/26/2018 Introduction to Data Mining, 2nd Edition 29
Cross-validation Example
3-fold cross-validation
03/26/2018 Introduction to Data Mining, 2nd Edition 30