100% found this document useful (4 votes)
237 views

Decision Tree

Machine Learning : decision tree
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
100% found this document useful (4 votes)
237 views

Decision Tree

Machine Learning : decision tree
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 66

Decision tree

learning
Sunita Sarawagi
IIT Bombay
http://www.it.iitb.ac.in/~sunita

Copyright 2001, Andrew W.


Moore

Decision tree classifiers


Widely used learning method
Easy to interpret: can be re-represented as if-thenelse rules
Approximates function by piece wise constant
regions
Does not require any prior knowledge of data
distribution, works well on noisy data.
Has been applied to:

classify medical patients based on the disease,


equipment malfunction by cause,
loan applicant by likelihood of payment.
lots and lots of other applications..

Setting
Given old data about customers and payments,
predict new applicants loan eligibility.

Previous customers
Age
Salary
Profession
Location
Customer
type

Classifier

Decision rules
Salary > 5 L
Prof. = Exec

New applicants data

Good/
bad

Decision trees
Tree where internal nodes are simple decision
rules on one or more attributes and leaf nodes
are predicted class labels.
Salary < 1 M
Prof = teaching
Good

Bad

Age < 30
Bad

Good

Training Dataset
This
follows
an
example
from
Quinlan
s ID3

age
<=30
<=30
3040
>40
>40
>40
3140
<=30
<=30
>40
<=30
3140
3140
>40

income student credit_rating


high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent

buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no

Output: A Decision Tree for


buys_computer
age?
<=30
student?

30..40
overcast

yes

>40
credit rating?

no

yes

excellent

fair

no

yes

no

yes

Weather Data: Play or not Play?


Outlook

Temperature

Humidity

Windy

Play?

sunny

hot

high

false

No

sunny

hot

high

true

No

overcast

hot

high

false

Yes

rain

mild

high

false

Yes

rain

cool

normal

false

Yes

rain

cool

normal

true

No

overcast

cool

normal

true

Yes

sunny

mild

high

false

No

sunny

cool

normal

false

Yes

rain

mild

normal

false

Yes

sunny

mild

normal

true

Yes

overcast

mild

high

true

Yes

overcast

hot

normal

false

Yes

rain

mild

high

true

No

Note:
Outlook is the
Forecast,
no relation to
Microsoft
email program

Example Tree for Play?


Outlook
sunny

overcast

Humidity

Yes

rain

Windy

high

normal

true

false

No

Yes

No

Yes

Topics to be covered
Tree construction:
Basic tree learning algorithm
Measures of predictive ability
High performance decision tree construction: Sprint

Tree pruning:
Why prune
Methods of pruning

Other issues:
Handling missing data
Continuous class labels
Effect of training size

Tree learning algorithms

ID3 (Quinlan 1986)


Successor C4.5 (Quinlan 1993)
CART
SLIQ (Mehta et al)
SPRINT (Shafer et al)

Basic algorithm for tree building


Greedy top-down construction.
Gen_Tree (Node, data)

make node a leaf?

Yes

Stop

Selection
Find best attribute and best split on attribute
criteria
Partition data on split condition
For each child j of node Gen_Tree (node_j, data_j)

Split criteria
Select the attribute that is best for classification.
Intuitively pick one that best separates instances
of different classes.
Quantifying the intuitive: measuring separability:
First define impurity of an arbitrary set S
consisting of K classes
Smallest when consisting of only one class,
highest when all classes in equal number.
Should allow computations in multiple stages.
1

Measures of impurity
Entropy
k

Entropy ( S ) pi log pi
i 1

Gini
k

Gini ( S ) 1 pi
i 1

Information gain
0.5
Gini

Entropy

p1

Information gain on partitioning S into r subsets


Impurity (S) - sum of weighted impurity of each subset
r

Gain( S , S1..S r ) Entropy ( S )


j 1

Sj
S

Entropy ( S j )

*Properties of the entropy


The multistage property:
q
r
entropy( p,q,r ) entropy( p,q r ) (q r ) entropy(
,
)
qr qr
Simplification of computation:
info([2,3,4]) 2 / 9 log(2 / 9) 3 / 9 log(3 / 9) 4 / 9 log(4 / 9)

witten&eibe

Information gain: example


K= 2, |S| = 100, p1= 0.6, p2= 0.4
E(S) = -0.6 log(0.6) - 0.4 log (0.4)=0.29
| S1 | = 70, p1= 0.8, p2= 0.2
E(S1) = -0.8log0.8 - 0.2log0.2 = 0.21

S1

S | S2| = 30, p1= 0.13, p2= 0.87


E(S2) = -0.13log0.13 - 0.87 log 0.87=.16

S2

Information gain: E(S) - (0.7 E(S1 ) + 0.3 E(S2) ) =0.1

Weather Data: Play or not Play?


Outlook

Temperature

Humidity

Windy

Play?

sunny

hot

high

false

No

sunny

hot

high

true

No

overcast

hot

high

false

Yes

rain

mild

high

false

Yes

rain

cool

normal

false

Yes

rain

cool

normal

true

No

overcast

cool

normal

true

Yes

sunny

mild

high

false

No

sunny

cool

normal

false

Yes

rain

mild

normal

false

Yes

sunny

mild

normal

true

Yes

overcast

mild

high

true

Yes

overcast

hot

normal

false

Yes

rain

mild

high

true

No

Which attribute to select?

witten&eibe

Example: attribute Outlook


Outlook = Sunny:
info([2,3]) entropy(2/5,3/5) 2 / 5 log(2 / 5) 3 / 5 log(3 / 5) 0.971 bits

Outlook = Overcast:

Note: log(0) is
info([4,0]) entropy(1,0) 1log(1) 0 log(0) 0 bits not defined, but
we evaluate
Outlook = Rainy:
0*log(0) as zero
info([3,2]) entropy(3/5,2/5) 3 / 5 log(3 / 5) 2 / 5 log(2 / 5) 0.971 bits

Expected information for attribute:


info([3,2], [4,0],[3,2]) (5 / 14) 0.971 (4 / 14) 0 (5 / 14) 0.971
0.693 bits
witten&eibe

Computing the information gain


Information gain:
(information before split) (information after
split)
gain(" Outlook" ) info([9,5]) - info([2,3], [4,0], [3,2]) 0.940 - 0.693
0.247 bits

Information
for) attributes
gain("gain
Outlook"
0.247 bits from
weather gain("
data:
Temperatur e" ) 0.029 bits
gain(" Humidity") 0.152 bits
gain(" Windy" ) 0.048 bits
witten&eibe

Continuing to split

gain(" Temperatur e" ) 0.571 bits

gain(" Humidity") 0.971 bits

gain(" Windy" ) 0.020 bits


witten&eibe

The final decision tree

Note: not all leaves need to be pure; sometimes


identical instances have different classes
Splitting stops when data cant be split any further

witten&eibe

Highly-branching attributes
Problematic: attributes with a large number
of values (extreme case: ID code)
Subsets are more likely to be pure if there is
a large number of values
Information gain is biased towards choosing
attributes with a large number of values
This may result in overfitting (selection of an
attribute that is non-optimal for prediction)

witten&eibe

Weather Data with ID code


ID

Outlook

Temperature

Humidity

Windy

Play?

sunny

hot

high

false

No

sunny

hot

high

true

No

overcast

hot

high

false

Yes

rain

mild

high

false

Yes

rain

cool

normal

false

Yes

rain

cool

normal

true

No

overcast

cool

normal

true

Yes

sunny

mild

high

false

No

sunny

cool

normal

false

Yes

rain

mild

normal

false

Yes

sunny

mild

normal

true

Yes

overcast

mild

high

true

Yes

overcast

hot

normal

false

Yes

rain

mild

high

true

No

Split for ID Code Attribute

Entropy of split = 0 (since each leaf node is pure, having only


one case.
Information gain is maximal for ID code
witten&eibe

Gain ratio
Gain ratio: a modification of the information
gain that reduces its bias on high-branch
attributes
Gain ratio should be
Large when data is evenly spread
Small when all data belong to one branch

Gain ratio takes number and size of branches


into account when choosing an attribute
It corrects the information gain by taking the
intrinsic information of a split into account (i.e. how
much info do we need to tell which branch an
instance belongs to)

witten&eibe

Gain Ratio and Intrinsic Info.


Intrinsic information: entropy of distribution of
instances into branches

|S |
|S |
IntrinsicInfo(S , A) i log i .
|S| 2 | S |
Gain ratio (Quinlan86) normalizes info gain by:

GainRatio(S , A)

Gain(S , A)
.
IntrinsicInfo(S , A)

Computing the gain ratio


Example: intrinsic information for ID code

info([1,1, ,1) 14 (1 / 14 log 1 / 14) 3.807 bits


Importance of attribute decreases as
intrinsic information gets larger
Example of gain ratio:
gain(" Attribute" )
gain_ratio(" Attribute" )
intrinsic_info(" Attribute" )

Example:

witten&eibe

0.940 bits
gain_ratio(" ID_code")
0.246
3.807 bits

Gain ratios for weather data


Outlook

Temperature

Info:

0.693

Info:

0.911

Gain: 0.940-0.693

0.247

Gain: 0.940-0.911

0.029

Split info: info([5,4,5])

1.577

Split info:
info([4,6,4])

1.362

Gain ratio:
0.247/1.577

0.156

Gain ratio:
0.029/1.362

0.021

Humidity

Windy

Info:

0.788

Info:

0.892

Gain: 0.940-0.788

0.152

Gain: 0.940-0.892

0.048

Split info: info([7,7])

1.000

Split info: info([8,6])

0.985

Gain ratio: 0.152/1

0.152

Gain ratio:
0.048/0.985

0.049

witten&eibe

More on the gain ratio


Outlook still comes out top
However: ID code has greater gain ratio
Standard fix: ad hoc test to prevent splitting on that type
of attribute

Problem with gain ratio: it may overcompensate


May choose an attribute just because its intrinsic
information is very low
Standard fix:
First, only consider attributes with greater than average
information gain
Then, compare them on gain ratio

witten&eibe

SPRINT

(Serial PaRallelizable INduction of


decision Trees)

Decision-tree classifier for data mining


Design goals:
Able to handle large disk-resident training sets
No restrictions on training-set size
Easily parallelizable

Example
Example Data
Age

Car Type

42

family

17

truck

57
21

sports
sports

28

family

68

truck

Risk

Age < 25

Low

CarType in {sports}

High
High
High

High
High

Low

Low
Low

Building tree
GrowTree(TrainingData D)
Partition(D);
Partition(Data D)
if (all points in D belong to the same class) then
return;
for each attribute A do
evaluate splits on attribute A;
use best split found to partition D into D1 and
D2;
Partition(D1);
Partition(D2);

Evaluating Split Points


Gini Index
if data D contains examples from c classes
Gini(D) = 1 - pj2
where pj is the relative frequency of class
j in D

If D split into D1 & D2 with n1 & n2 tuples each


Ginisplit(D) = n1* gini(D1) + n2*
gini(D2)
n
n

Note: Only class frequencies are needed to compute index

Finding Split Points


For each attribute A do
evaluate splits on attribute A using attribute list

Keep split with lowest GINI index

Split Points: Continuous Attrib.


Consider splits of form: value(A) < x
Example: Age < 17

Evaluate this split-form for every value in an


attribute list
To evaluate splits on attribute A for a given
tree-node:
Initialize class-histogram of left child to zeroes;
Initialize class-histogram of right child to same as its parent;
for each record in the attribute list do
evaluate splitting index for value(A) < record.value;
using class label of the record, update class histograms;

Data Setup: Attribute Lists


One list for each attribute
Entries in an Attribute List consist of:
attribute value
class value
record id

Lists for continuous attributes are in


Lists may be disk-resident

Example list:

Age Risk

RID

17
High
1
20
High
5
23
High
0
32
Low
4
43
High
2
68
Low
3
sorted
order

Each leaf-node has its own set of attribute lists


representing the training examples belonging to
that leaf

Attribute Lists: Example


Age Car Type Risk

Age Risk

RID

Car Type

Risk

RID

23
17
43
68
32
20

23
17
43
68
32
20

0
1
2
3
4
5

family
sports
sports
family
truck
family

High
High
High
Low
Low
High

0
1
2
3
4
5

family
sports
sports
family
truck
family

High
High
High
Low
Low
High

Initial Attribute Lists for


the root node:

High
High
High
Low
Low
High

Age Risk

RID

17
20
23
32
43
68

1
5
0
4
2
3

High
High
High
Low
High
Low

Car Type

Risk

RID

family
family
family
sports
sports
truck

High
High
Low
High
High
Low

0
5
3
2
1
4

Split Points: Continuous Attrib.


Attribute List
Age Risk

RID

17
20
23
32

High
High
High
Low

1
5
0
4

43

High

68

Low

High

Low

Position of
cursor in scan
0: Age < 17

State of Class Histograms:


Left Child
Right Child
High

Low

High

Low

High

Low

High

Low

High

Low

High

Low

High

Low

High

Low

GINI Index:
GINI = undef

1: Age < 20
3: Age < 32

GINI = 0.4

GINI = 0.222

GINI = undef

Split Points: Categorical Attrib.


Consider splits of the form: value(A) {x1,
x2, ..., xn}
Example: CarType {family, sports}

Evaluate this split-form for subsets of


domain(A)
initialize class/value matrix of node to zeroes;
To evaluate splits on attribute A for a given
for each
record in the attribute list do
tree
node:
increment appropriate count in matrix;

evaluate splitting index for various subsets using the constructed matri

Finding Split Points: Categorical


Attrib.
class/value matrix

Attribute List
Car Type Risk RID
family
family
family
sports
sports
truck

High
High
Low
High
High
Low

High Low

0
5
3
2
1
4

family 2
sports 2
truck 0

Left Child
CarType in {family}
CarType in {sports}
CarType in {truck}

1
0
1

Right Child

High

Low

High

Low

High

Low

High

Low

High

Low

High

Low

GINI Index:
GINI = 0.444
GINI = 0.333
GINI = 0.267

Performing the Splits


The attribute lists of every node must be
divided among the two children
To split the attribute lists of a give node:
for the list of the attribute used to split this node do
use the split test to divide the records;
collect the record ids;
build a hashtable from the collected ids;
for the remaining attribute lists do
use the hashtable to divide each list;
build class-histograms for each new leaf;

Performing the Splits: Example


Age Risk

RID

Car Type

Risk

RID

17
20
23
32
43
68

1
5
0
4
2
3

family
family
family
sports
sports
truck

High
High
Low
High
High
Low

0
5
3
2
1
4

High
High
High
Low
High
Low

Age < 32

Age Risk

RID

Age Risk

RID

17
20
23

High
High
High

1
5
0

32
43
68

4
2
3

Car Type

Risk

RID

family
family
sports

High
High
High

0
5
1

Hash Table
0 Left
1 Left
2 Right
3 Right
4 Right
5 Left

Low
High
Low

Car Type

Risk

RID

family
sports
truck

Low
High
Low

3
2
4

Sprint: summary
Each node of the decision tree classifier,
requires examining possible splits on each
value of each attribute.
After choosing a split attribute, need to
partition all data into its subset.
Need to make this search efficient.
Evaluating splits on numeric attributes:
Sort on attribute value, incrementally evaluate gini

Splits on categorical attributes


For each subset, find gini and choose the best
For large sets, use greedy method

Preventing overfitting
A tree T overfits if there is another tree T that
gives higher error on the training data yet gives
lower error on unseen data.
An overfitted tree does not generalize to
unseen instances.
Happens when data contains noise or irrelevant
attributes and training size is small.
Overfitting can reduce accuracy drastically:
10-25% as reported in Mingers 1989 Machine
learning

Example of over-fitting with binary data.

Training Data Vs. Test Data Error


Rates
Compare error rates
measured by
learn data
large test set

Learn R(T) always


decreases as tree grows (Q:
Why?)
Test R(T) first declines then
increases (Q: Why?)
Overfitting is the result tree
of too much reliance on
learn R(T)
Can lead to disasters when
applied to new data

No.
Terminal
Nodes
71
63
58
40
34
19
**10
9
7
6
5
2
1

R(T)

Rts(T)

.00
.00
.03
.10
.12
.20
.29
.32
.41
.46
.53
.75
.86

.42
.40
.39
.32
.32
.31
.30
.34
.47
.54
.61
.82
.91

Digit recognition dataset: CART book

Overfitting example
Consider the case where a single attribute xj
is adequate for classification but with an
error of 20%
Consider lots of other noise attributes that
enable zero error during training
This detailed tree during testing will have an
expected error of (0.8*0.2 + 0.2*0.8) = 32%
whereas the pruned tree with only a single
split on xj will have an error of only 20%.

Approaches to prevent
overfitting
Two Approaches:
Stop growing the tree beyond a certain
point
Tricky, since even when information gain is zero
an attribute might be useful (XOR example)

First over-fit, then post prune. (More widely


used)
Tree building divided into phases:
Growth phase
Prune phase

size:
Three criteria:
Cross validation with separate test data
Statistical bounds: use all data for training
but apply statistical test to decide right size.
(cross-validation dataset may be used to
threshold)
Use some criteria function to choose best
size
Example: Minimum description length (MDL)
criteria

Cross validation
Partition the dataset into two disjoint parts:
1. Training set used for building the tree.
2. Validation set used for pruning the tree:

Rule of thumb: 2/3rds training, 1/3rd validation

Evaluate the tree on the validation set and at


each leaf and internal node keep count of
correctly labeled data.
Starting bottom-up, prune nodes with error less than its

children.

What if training data set size is limited?


n-fold cross validation: partition training data into n parts D1,
D2Dn.
Train n classifiers with D-Di as training and Di as test instance.
Pick average. (how?)

That was a simplistic view..


A tree with minimum error on a single test
set may not be stable.
In what order do you prune?

Minimum Cost complexity


pruning in CART
For each cross-validation run
Construct the full tree Tmax
Use some error estimates to prune Tmax
Delete subtrees in decreasing order of strength..
All subtrees of the same strength go together.
This gives several trees of various sizes
Use validation partition to note error against various tree
size.

Choose tree size with smallest error over all CV


partitions
Run a complicated search involving growing and
shrinking phases to find the best tree of the
chosen size using all data.

Pruning: Which nodes come off


next?

Order of Pruning: Weakest


Link Goes First
Prune away "weakest link" the nodes that add least to
overall accuracy of the tree
contribution to overall tree a function of both increase in
accuracy and size of node
accuracy gain is weighted by share of sample
small nodes tend to get removed before large ones

If several nodes have same contribution they all prune


away simultaneously
Hence more than two terminal nodes could be cut off in one
pruning

Sequence determined all the way back to root node


need to allow for possibility that entire tree is bad
if target variable is unpredictable we will want to prune back to
root . . . the no model solution

Pruning Sequence Example

24 Terminal Nodes

20 Terminal Nodes

21 Terminal Nodes

18 Terminal Nodes

Now we test every tree in


the pruning sequence
Take a test data set and drop it down the largest tree
in the sequence and measure its predictive accuracy
how many cases right and how many wrong
measure accuracy overall and by class

Do same for 2nd largest tree, 3rd largest tree, etc


Performance of every tree in sequence is measured
Results reported in table and graph formats
Note that this critical stage is impossible to complete
without test data
CART procedure requires test data to guide tree
evaluation

Pruning via significance tests


For each node test on the training data if the
class label is independent of the splits of the
attribute of this node. Prune if independent.
A common statistical test for independence is the
Chi-squared test
Chi-squared test of independence (board)

A second test of independence is mutual


information
\sum_{x,y} p(x,y) log(p(x,y)/p(x)p(y)

The minimum description length


principle (MDL)
MDL: paradigm for statistical estimation
particularly model selection
Given data D and a class of models M, our
choose is to choose a model m in M such that
data and model can be encoded using the
smallest total length
L(D) = L(D|m) + L(m)

How to find encoding length?


Answer in Information Theory
Consider the problem of transmitting n messages
where pi is probability of seeing message i
Shannons theorem: minimum expected length when
-log pi bits to message i

MDL Example: Compression with


Classification Trees
bytes

packets

protocol

20K

http

24K

http

20K

http

40K

11

ftp

58K

18

http

100K

24

ftp

300K

35

ftp

80K

15

http

From: Anthony Tung, NUS

Packets > 10
yes

no

Bytes > 60K


yes

Protocol = ftp

no

Protocol = http

Protocol = http

Outlier: Row 4, protocol=ftp,


Row 8, protocol =http

Encoding data
Assume t records of training data D
First send tree m using L(m|M) bits
Assume all but the class labels of training data
known.
Goal: transmit class labels using L(D|m)
If tree correctly predicts an instance, 0 bits
Otherwise, log k bits where k is number of classes.
Thus, if e errors on training data: total cost
e log k + L(m|M) bits.
Complex tree will have higher L(m|M) but lower e.
Question: how to encode the tree?

Extracting Classification Rules


from Trees

Represent the knowledge in the form of IF-THEN rules


One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
IF
IF
IF
IF

age =
age =
age =
age =
yes
IF age =

<=30 AND student = no THEN buys_computer = no


<=30 AND student = yes THEN buys_computer = yes
3140
THEN buys_computer = yes
>40 AND credit_rating = excellent THEN buys_computer =
<=30 AND credit_rating = fair THEN buys_computer = no

Rule-based pruning
Tree-based pruning limits the kind of
pruning. If a node is pruned all subtrees
under it has to be pruned.
Rule-based: For each leaf of the tree, extract
a rule using a conjuction of all tests upto the
root.
On the validation set, independently prune
tests from each rule to get the highest
accuracy for that rule.
Sort rule by decreasing accuracy..

Regression trees
Decision tree with continuous class labels:
Regression trees approximates the function
with piece-wise constant regions.
Split criteria for regression trees:
Predicted value for a set S = average of all values
in S
Error: sum of the square of error of each member
of S from the predicted average.
Pick smallest average error.

Splits on categorical attributes:


Can it be better than for discrete class labels?
Homework.

Other types of trees


Multi-way trees on low-cardinality
categorical data
Multiple splits on continuous attributes
[Fayyad 93, Multi-interval discretization of
continuous attributes]
Multi attribute tests on nodes to handle
correlated attributes
multivariate linear splits [Oblique trees, Murthy 94]

Issues
Methods of handling missing values
assume majority value
take most probable path

Allowing varying costs for different attributes

Pros and Cons of decision trees


Cons
Pros
+ Reasonable training time Not effective for very high
dimensional data where
+ Fast application
information about the class is
+ Easy to interpret
spread in small ways over
+ Easy to implement
many correlated features
+ Intuitive
Example: words in text
classification
Not robust to dropping of
important features even when
correlated substitutes exist in
data

You might also like