Data Mining: Classification
Data Mining: Classification
LECTURE 10
Classification
Basic Concepts
Decision Trees
Catching tax-evasion
Tid Refund Marital Taxable
Status Income Cheat
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Evaluation of classification models
• Counts of test records that are correctly (or
incorrectly) predicted by the classification model
• Confusion matrix Predicted Class
Actual Class
Class = 1 Class = 0
Class = 1 f11 f10
Class = 0 f01 f00
Class labe
Training Data Model: Decision Tree
Another Example of Decision Tree
l l
ica ica ous
r r u
go go tin ss
te te n
ca ca co cla MarSt Single,
Tid Refund Marital Taxable
Married Divorced
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class
Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Tree Induction
• Finding the best decision tree is NP-hard
• Greedy strategy.
• Split the records based on an attribute test that optimizes
certain criterion.
• Many Algorithms:
• Hunt’s Algorithm (one of the earliest)
• CART
• ID3, C4.5
• SLIQ,SPRINT
General Structure of Hunt’s Algorithm
Tid Refund Marital Taxable
• Let Dt be the set of training records that Status Income Cheat
Hunt’s Algorithm 11
42
Yes
Yes
No
Yes
Single
Single
Single
Married
125K
125K
125K
Married 120K
Married 100K
120K
No
No
No
No
No
No
73 No
Yes Divorced
Single 220K
Divorced 70K
220K No
No
No
Refund
Don’t 24 Yes
No Married
Married 100K
Married 120K
100K No
No
No
Yes No
Cheat 63
5 No
No Single
Divorced 70K
Married 95K
60K No
Yes
No
Don’t Don’t
95
6 No
No Divorced
Married 95K
Married 60K
75K No
Yes
No
Cheat Cheat
36
7 No
Yes Married
Divorced 60K
Single 220K
70K No
No
No
58 No
No Single 85K
Single
Divorced 85K
95K Yes
Yes
Yes
89 No
No Married 75K
Married
Single 75K
85K No
No
Yes
Refund Refund
10 No
No Single
Single
Single 90K
90K
90K Yes
Yes
Yes
Yes No Yes No 10
10
10
CarType OR CarType
{Sports, {Family,
Luxury} {Family} Luxury} {Sports}
Splitting Based on Ordinal Attributes
• Multi-way split: Use as many partitions as distinct
values.
Size
Small Large
Medium
Size
• What about this split? {Small,
{Medium}
Large}
Splitting Based on Continuous Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
• Ideas?
Measuring Node Impurity
• p(i|t): fraction of records associated with node t
belonging to class i
c
Entropy(t ) p (i | t ) log p (i | t )
i 1
• Used in ID3 and C4.5
c
Gini(t ) 1 p (i | t )
2
i 1
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Splitting based on impurity
• Impurity measures favor attributes with large
number of values
GAIN n n
SplitINFO log
k
GainRATIO Split i i
SplitINFO
split
ni 1
n
0.9
0.8
x < 0.43?
0.7
Yes No
0.6
y < 0.33?
y
0.3
Yes No Yes No
0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
• Border line between two neighboring regions of different classes is known
as decision boundary
• Decision boundary is parallel to axes because test condition involves a
single attribute at-a-time
Expressiveness
• Decision tree provides expressive representation for
learning discrete-valued function
• But they do not generalize well to certain types of
Boolean functions
• Example: parity function:
• Class = 1 if there is an even number of Boolean attributes with truth
value = True
• Class = 0 if there is an odd number of Boolean attributes with truth
value = True
• For accurate modeling, must have a complete tree
x+y<1
Class = + Class =
• Evaluation
Underfitting and Overfitting (Example)
Circular points:
0.5 sqrt(x12+x22) 1
Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
Underfitting and Overfitting
Underfitting Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex it models the details of the training set and
fails on the test set
Overfitting due to Noise
Lack of data points in the lower half of the diagram makes it difficult to
predict correctly the class labels of that region
- Insufficient number of training records in the region causes the decision
tree to predict the test examples using other training records that are
irrelevant to the classification task
Notes on Overfitting
• Overfitting results in decision trees that are more
complex than necessary
• Pessimistic approach:
• For each leaf node:
• Total errors: (N: number of leaf nodes)
• Penalize large trees
• For a tree with 30 leaf nodes and 10 errors on training (out of 1000 instances)
• Training error = 10/1000 = 1
• Generalization error = (10 + 300.5)/1000 = 2.5%
A1 A4
A2 A3
PREDICTED CLASS
Class=Yes Class=No
a: TP (true positive)
ACTUAL Class=Yes a b
b: FN (false negative)
CLASS
Class=No c d c: FP (false positive)
d: TN (true negative)
Metrics for Performance Evaluation…
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
(TP) (FN)
CLASS
Class=No c d
(FP) (TN)
wa w d
Weighted Accuracy 1 4
+ - + -
ACTUAL ACTUAL
+ 150 40 + 250 45
CLASS CLASS
- 60 250 - 5 200
Class=Yes a b N=a+b+c+d
ACTUAL
CLASS Class=No c d
Accuracy = (a + d)/N
Requires a sampling
schedule for creating
learning curve
- Variance of estimate
Model Evaluation
• Metrics for Performance Evaluation
• How to evaluate the performance of a model?
At threshold t:
TP=0.5, FN=0.5, FP=0.12, FN=0.88
ROC Curve
(TP,FP):
• (0,0): declare everything
to be negative class
• (1,1): declare everything
to be positive class
• (1,0): ideal