Decision Tree
Decision Tree
learning
Sunita Sarawagi
IIT Bombay
http://www.it.iitb.ac.in/~sunita
Setting
Given old data about customers and payments,
predict new applicants loan eligibility.
Previous customers
Age
Salary
Profession
Location
Customer
type
Classifier
Decision rules
Salary > 5 L
Prof. = Exec
Good/
bad
Decision trees
Tree where internal nodes are simple decision
rules on one or more attributes and leaf nodes
are predicted class labels.
Salary < 1 M
Prof = teaching
Good
Bad
Age < 30
Bad
Good
Training Dataset
This
follows
an
example
from
Quinlan
s ID3
age
<=30
<=30
3040
>40
>40
>40
3140
<=30
<=30
>40
<=30
3140
3140
>40
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
30..40
overcast
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Temperature
Humidity
Windy
Play?
sunny
hot
high
false
No
sunny
hot
high
true
No
overcast
hot
high
false
Yes
rain
mild
high
false
Yes
rain
cool
normal
false
Yes
rain
cool
normal
true
No
overcast
cool
normal
true
Yes
sunny
mild
high
false
No
sunny
cool
normal
false
Yes
rain
mild
normal
false
Yes
sunny
mild
normal
true
Yes
overcast
mild
high
true
Yes
overcast
hot
normal
false
Yes
rain
mild
high
true
No
Note:
Outlook is the
Forecast,
no relation to
Microsoft
email program
overcast
Humidity
Yes
rain
Windy
high
normal
true
false
No
Yes
No
Yes
Topics to be covered
Tree construction:
Basic tree learning algorithm
Measures of predictive ability
High performance decision tree construction: Sprint
Tree pruning:
Why prune
Methods of pruning
Other issues:
Handling missing data
Continuous class labels
Effect of training size
Yes
Stop
Selection
Find best attribute and best split on attribute
criteria
Partition data on split condition
For each child j of node Gen_Tree (node_j, data_j)
Split criteria
Select the attribute that is best for classification.
Intuitively pick one that best separates instances
of different classes.
Quantifying the intuitive: measuring separability:
First define impurity of an arbitrary set S
consisting of K classes
Smallest when consisting of only one class,
highest when all classes in equal number.
Should allow computations in multiple stages.
1
Measures of impurity
Entropy
k
Entropy ( S ) pi log pi
i 1
Gini
k
Gini ( S ) 1 pi
i 1
Information gain
0.5
Gini
Entropy
p1
Sj
S
Entropy ( S j )
witten&eibe
S1
S2
Temperature
Humidity
Windy
Play?
sunny
hot
high
false
No
sunny
hot
high
true
No
overcast
hot
high
false
Yes
rain
mild
high
false
Yes
rain
cool
normal
false
Yes
rain
cool
normal
true
No
overcast
cool
normal
true
Yes
sunny
mild
high
false
No
sunny
cool
normal
false
Yes
rain
mild
normal
false
Yes
sunny
mild
normal
true
Yes
overcast
mild
high
true
Yes
overcast
hot
normal
false
Yes
rain
mild
high
true
No
witten&eibe
Outlook = Overcast:
Note: log(0) is
info([4,0]) entropy(1,0) 1log(1) 0 log(0) 0 bits not defined, but
we evaluate
Outlook = Rainy:
0*log(0) as zero
info([3,2]) entropy(3/5,2/5) 3 / 5 log(3 / 5) 2 / 5 log(2 / 5) 0.971 bits
Information
for) attributes
gain("gain
Outlook"
0.247 bits from
weather gain("
data:
Temperatur e" ) 0.029 bits
gain(" Humidity") 0.152 bits
gain(" Windy" ) 0.048 bits
witten&eibe
Continuing to split
witten&eibe
Highly-branching attributes
Problematic: attributes with a large number
of values (extreme case: ID code)
Subsets are more likely to be pure if there is
a large number of values
Information gain is biased towards choosing
attributes with a large number of values
This may result in overfitting (selection of an
attribute that is non-optimal for prediction)
witten&eibe
Outlook
Temperature
Humidity
Windy
Play?
sunny
hot
high
false
No
sunny
hot
high
true
No
overcast
hot
high
false
Yes
rain
mild
high
false
Yes
rain
cool
normal
false
Yes
rain
cool
normal
true
No
overcast
cool
normal
true
Yes
sunny
mild
high
false
No
sunny
cool
normal
false
Yes
rain
mild
normal
false
Yes
sunny
mild
normal
true
Yes
overcast
mild
high
true
Yes
overcast
hot
normal
false
Yes
rain
mild
high
true
No
Gain ratio
Gain ratio: a modification of the information
gain that reduces its bias on high-branch
attributes
Gain ratio should be
Large when data is evenly spread
Small when all data belong to one branch
witten&eibe
|S |
|S |
IntrinsicInfo(S , A) i log i .
|S| 2 | S |
Gain ratio (Quinlan86) normalizes info gain by:
GainRatio(S , A)
Gain(S , A)
.
IntrinsicInfo(S , A)
Example:
witten&eibe
0.940 bits
gain_ratio(" ID_code")
0.246
3.807 bits
Temperature
Info:
0.693
Info:
0.911
Gain: 0.940-0.693
0.247
Gain: 0.940-0.911
0.029
1.577
Split info:
info([4,6,4])
1.362
Gain ratio:
0.247/1.577
0.156
Gain ratio:
0.029/1.362
0.021
Humidity
Windy
Info:
0.788
Info:
0.892
Gain: 0.940-0.788
0.152
Gain: 0.940-0.892
0.048
1.000
0.985
0.152
Gain ratio:
0.048/0.985
0.049
witten&eibe
witten&eibe
SPRINT
Example
Example Data
Age
Car Type
42
family
17
truck
57
21
sports
sports
28
family
68
truck
Risk
Age < 25
Low
CarType in {sports}
High
High
High
High
High
Low
Low
Low
Building tree
GrowTree(TrainingData D)
Partition(D);
Partition(Data D)
if (all points in D belong to the same class) then
return;
for each attribute A do
evaluate splits on attribute A;
use best split found to partition D into D1 and
D2;
Partition(D1);
Partition(D2);
Example list:
Age Risk
RID
17
High
1
20
High
5
23
High
0
32
Low
4
43
High
2
68
Low
3
sorted
order
Age Risk
RID
Car Type
Risk
RID
23
17
43
68
32
20
23
17
43
68
32
20
0
1
2
3
4
5
family
sports
sports
family
truck
family
High
High
High
Low
Low
High
0
1
2
3
4
5
family
sports
sports
family
truck
family
High
High
High
Low
Low
High
High
High
High
Low
Low
High
Age Risk
RID
17
20
23
32
43
68
1
5
0
4
2
3
High
High
High
Low
High
Low
Car Type
Risk
RID
family
family
family
sports
sports
truck
High
High
Low
High
High
Low
0
5
3
2
1
4
RID
17
20
23
32
High
High
High
Low
1
5
0
4
43
High
68
Low
High
Low
Position of
cursor in scan
0: Age < 17
Low
High
Low
High
Low
High
Low
High
Low
High
Low
High
Low
High
Low
GINI Index:
GINI = undef
1: Age < 20
3: Age < 32
GINI = 0.4
GINI = 0.222
GINI = undef
evaluate splitting index for various subsets using the constructed matri
Attribute List
Car Type Risk RID
family
family
family
sports
sports
truck
High
High
Low
High
High
Low
High Low
0
5
3
2
1
4
family 2
sports 2
truck 0
Left Child
CarType in {family}
CarType in {sports}
CarType in {truck}
1
0
1
Right Child
High
Low
High
Low
High
Low
High
Low
High
Low
High
Low
GINI Index:
GINI = 0.444
GINI = 0.333
GINI = 0.267
RID
Car Type
Risk
RID
17
20
23
32
43
68
1
5
0
4
2
3
family
family
family
sports
sports
truck
High
High
Low
High
High
Low
0
5
3
2
1
4
High
High
High
Low
High
Low
Age < 32
Age Risk
RID
Age Risk
RID
17
20
23
High
High
High
1
5
0
32
43
68
4
2
3
Car Type
Risk
RID
family
family
sports
High
High
High
0
5
1
Hash Table
0 Left
1 Left
2 Right
3 Right
4 Right
5 Left
Low
High
Low
Car Type
Risk
RID
family
sports
truck
Low
High
Low
3
2
4
Sprint: summary
Each node of the decision tree classifier,
requires examining possible splits on each
value of each attribute.
After choosing a split attribute, need to
partition all data into its subset.
Need to make this search efficient.
Evaluating splits on numeric attributes:
Sort on attribute value, incrementally evaluate gini
Preventing overfitting
A tree T overfits if there is another tree T that
gives higher error on the training data yet gives
lower error on unseen data.
An overfitted tree does not generalize to
unseen instances.
Happens when data contains noise or irrelevant
attributes and training size is small.
Overfitting can reduce accuracy drastically:
10-25% as reported in Mingers 1989 Machine
learning
No.
Terminal
Nodes
71
63
58
40
34
19
**10
9
7
6
5
2
1
R(T)
Rts(T)
.00
.00
.03
.10
.12
.20
.29
.32
.41
.46
.53
.75
.86
.42
.40
.39
.32
.32
.31
.30
.34
.47
.54
.61
.82
.91
Overfitting example
Consider the case where a single attribute xj
is adequate for classification but with an
error of 20%
Consider lots of other noise attributes that
enable zero error during training
This detailed tree during testing will have an
expected error of (0.8*0.2 + 0.2*0.8) = 32%
whereas the pruned tree with only a single
split on xj will have an error of only 20%.
Approaches to prevent
overfitting
Two Approaches:
Stop growing the tree beyond a certain
point
Tricky, since even when information gain is zero
an attribute might be useful (XOR example)
size:
Three criteria:
Cross validation with separate test data
Statistical bounds: use all data for training
but apply statistical test to decide right size.
(cross-validation dataset may be used to
threshold)
Use some criteria function to choose best
size
Example: Minimum description length (MDL)
criteria
Cross validation
Partition the dataset into two disjoint parts:
1. Training set used for building the tree.
2. Validation set used for pruning the tree:
children.
24 Terminal Nodes
20 Terminal Nodes
21 Terminal Nodes
18 Terminal Nodes
packets
protocol
20K
http
24K
http
20K
http
40K
11
ftp
58K
18
http
100K
24
ftp
300K
35
ftp
80K
15
http
Packets > 10
yes
no
Protocol = ftp
no
Protocol = http
Protocol = http
Encoding data
Assume t records of training data D
First send tree m using L(m|M) bits
Assume all but the class labels of training data
known.
Goal: transmit class labels using L(D|m)
If tree correctly predicts an instance, 0 bits
Otherwise, log k bits where k is number of classes.
Thus, if e errors on training data: total cost
e log k + L(m|M) bits.
Complex tree will have higher L(m|M) but lower e.
Question: how to encode the tree?
age =
age =
age =
age =
yes
IF age =
Rule-based pruning
Tree-based pruning limits the kind of
pruning. If a node is pruned all subtrees
under it has to be pruned.
Rule-based: For each leaf of the tree, extract
a rule using a conjuction of all tests upto the
root.
On the validation set, independently prune
tests from each rule to get the highest
accuracy for that rule.
Sort rule by decreasing accuracy..
Regression trees
Decision tree with continuous class labels:
Regression trees approximates the function
with piece-wise constant regions.
Split criteria for regression trees:
Predicted value for a set S = average of all values
in S
Error: sum of the square of error of each member
of S from the predicted average.
Pick smallest average error.
Issues
Methods of handling missing values
assume majority value
take most probable path