Liaquat Majeed Sheikh: National University of Computer and Emerging Sciences
Liaquat Majeed Sheikh: National University of Computer and Emerging Sciences
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Examples of Classification Task
cal cal us
ri ri uo
ego ego tin ss
t t n a
ca ca co cl
Tid Refund Marital Taxable
Splitting Attributes
Status Income Cheat
cal cal us
i i o
or or nu
t eg
t eg
nti
a ss Single,
ca ca co cl MarSt
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
Decision Tree Classification Task
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Decision Tree Classification Task
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Decision Tree Induction
Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
General Structure of Hunt’s Algorithm
Tid Refund Marital Taxable
Let Dt be the set of training records Status Income Cheat
that reach a node t 1 Yes Single 125K No
General Procedure: 2 No Married 100K No
3 No Single 70K No
– If Dt contains records that 4 Yes Married 120K No
belong the same class yt, then t 5 No Divorced 95K Yes
is a leaf node labeled as yt 6 No Married 60K No
Refund
Don’t
Yes No
Cheat
Don’t Don’t
Cheat Cheat
Refund Refund
Yes No Yes No
Don’t Cheat
Cheat
Tree Induction
Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Nominal The values of a nominal attribute zip codes, employee mode, entropy,
are just different names, i.e., ID numbers, eye color, contingency
nominal attributes provide only sex: {male, female} correlation, 2 test
enough information to distinguish
one object from another. (=, )
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical
current
Attribute Transformation Comments
Level
Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection
of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.
Splitting Based on Nominal Attributes
Size
{Small,
What about this split? Large} {Medium}
Splitting Based on Continuous Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Greedy approach:
– Nodes with homogeneous class distribution
are preferred
Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Measures of Node Impurity
Gini Index
Entropy
Misclassification error
How to Find the Best Split
Before Splitting: C0 N00
M0
C1 N01
A? B?
Yes No Yes No
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
Measure of Impurity: GINI
GINI (t ) 1 [ p ( j | t )]2
j
GINI (t ) 1 [ p ( j | t )]2
j
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Alternative Splitting Criteria based on INFO
Entropy (t ) p ( j | t ) log p ( j | t )
j 2
Information Gain:
n
GAIN Entropy ( p ) Entropy (i )
k
i
n
split i 1
Gain Ratio:
GAIN n n
SplitINFO log
k
GainRATIO Split i i
SplitINFO
split
n
i 1
n
Parent Node, p is split into k partitions
ni is the number of records in partition i
(b)
(a)
(c) (d)
A criterion for attribute selection
So,
Subsets are more likely to be pure if there is a
large number of values
– Information gain is biased towards choosing
attributes with a large number of values
– This may result in overfitting (selection of an
attribute that is non-optimal for prediction)
The gain ratio
“Outlook” still comes out top but “Humidity” is now a much closer
contender because it splits the data into two subsets instead of
three.
However: “ID code” has still greater gain ratio. But its advantage
is greatly reduced.
• And we see that, conveniently, all the points with L not greater
than 1.5 are of class 0, so we can make a leaf there.
Bankruptcy Example
Now the best split is at R > 0.9. And we see that all the points for
which that's true are positive, so we can make another leaf.
Bankruptcy Example
Error (t ) 1 max P (i | t )
i
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.361 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves !!
Tree Induction
Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification
techniques for many simple data sets
Example: C4.5