0% found this document useful (0 votes)
38 views

Lassification Trees: Concha Bielza, Pedro Larra Naga

The document describes classification trees and algorithms for building them, including ID3, C4.5, and M5. It introduces classification trees and how they represent functions as sets of if-then rules. It then details the basic ID3 algorithm, which builds trees top-down by choosing attributes that best separate examples at each node, and provides an example using the PlayTennis dataset. Finally, it outlines extensions like C4.5 and algorithms for regression trees.

Uploaded by

Richard Avilés
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Lassification Trees: Concha Bielza, Pedro Larra Naga

The document describes classification trees and algorithms for building them, including ID3, C4.5, and M5. It introduces classification trees and how they represent functions as sets of if-then rules. It then details the basic ID3 algorithm, which builds trees top-down by choosing attributes that best separate examples at each node, and provides an example using the PlayTennis dataset. Finally, it outlines extensions like C4.5 and algorithms for regression trees.

Uploaded by

Richard Avilés
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Introduction ID3 Extensions C4.

5 M5 Conclusions

B.6 C LASSIFICATION TREES

Concha Bielza, Pedro Larrañaga

Computational Intelligence Group


Departmento de Inteligencia Artificial
Universidad Politécnica de Madrid

Master Universitario en Inteligencia Artificial


Introduction ID3 Extensions C4.5 M5 Conclusions

Outline

1 Introduction

2 Basic algorithm: ID3

3 ID3 improvements

4 C4.5

5 Continuous class variable

6 Conclusions
Introduction ID3 Extensions C4.5 M5 Conclusions

Outline

1 Introduction

2 Basic algorithm: ID3

3 ID3 improvements

4 C4.5

5 Continuous class variable

6 Conclusions
Introduction ID3 Extensions C4.5 M5 Conclusions

Introduction

Classification tree
Widely used and popular
Approximate discrete-valued functions. The learned
function is represented by a decision tree
Robust to noisy data
Represent a disjunction of conjunctions of constraints on
the attribute values of instances
Learned tree can be re-represented as sets of if-then
rules (intuitive)
Successfully applied: medical patients by their disease,
equipment malfunctions by their cause, assess credit risk
of loan applicants...
Classification and regression tree
Introduction ID3 Extensions C4.5 M5 Conclusions

Introduction

Representation as a tree
Each node (not leaf) specifies a test of some attribute of
the instance
Each branch descending corresponds to one of the
possible values for this attribute
Each leaf node provides the classification of the instance
Unseen instances are classified by sorting them down the
tree from the root to some leaf node testing the attribute
specified at each node. At the leaf we have its classification
Introduction ID3 Extensions C4.5 M5 Conclusions

Introduction
Example: PlayTennis
Classify Saturday mornings according to whether they’re
suitable for playing tennis

Instance:
Outlook=Sunny, Temperature=Hot, Humidity=High, Wind=Strong
sorted down the leftmost branch, classified PlayTennis=No
Introduction ID3 Extensions C4.5 M5 Conclusions

Introduction

Example: PlayTennis
Tree represents a disjunction of conjunctions
A path = a conjunction of attribute tests
The tree = disjunction of these conjunctions
This tree corresponds to expression:

(Outlook=Sunny ∧ Humidity=Normal)
∨ (Outlook=Overcast)
∨ (Outlook=Rain ∧ Wind=Weak)
Introduction ID3 Extensions C4.5 M5 Conclusions

Introduction

Rule generation
R1: If (Outlook=Sunny) AND (Humidity=High) then PlayTennis=No
R2: If (Outlook=Sunny) AND (Humidity=Normal) then PlayTennis=Yes
R3: If (Outlook=Overcast) then PlayTennis=Yes
R4: If (Outlook=Rain) AND (Wind=Strong) then PlayTennis=No
R5: If (Outlook=Rain) AND (Wind=Weak) then PlayTennis=Yes
Introduction ID3 Extensions C4.5 M5 Conclusions

Introduction

Appropriate problems for decision tree learning


Instances represented by attribute-value pairs (attributes
discrete or real)
Target function has discrete output values (classification)...
...but extensions to real-valued outputs: regression trees
Robust to errors in (training) data, both in the class variable
and attributes
Data may contain missing attribute values
Complex domains where there is no a clear linear
separation
Introduction ID3 Extensions C4.5 M5 Conclusions

Introduction

Example

Space is divided into regions labelled with one class; they’re hyperectangles
Introduction ID3 Extensions C4.5 M5 Conclusions

Introduction

Types of trees
Classification trees: discrete outputs
CLS, ID3, C4.5, ID4, ID5, C4.8, C5.0
Regression trees: continuous outputs
CART, M5, M5’
Introduction ID3 Extensions C4.5 M5 Conclusions

Outline

1 Introduction

2 Basic algorithm: ID3

3 ID3 improvements

4 C4.5

5 Continuous class variable

6 Conclusions
Introduction ID3 Extensions C4.5 M5 Conclusions

Basic algorithm: ID3

ID3=Iterative dicotomiser [Quinlan, 1986]


Based on CLS (Concept Learning Systems) algorithm [Hunt et al., 1966], that
only used binary attributes
Greedy search strategy through the space of all possible classification trees
Construct the tree top-down, asking: which attribute should be tested at the root?
Answer by evaluating each attribute to determine how well it alone classifies
training examples
Best attribute is selected, a descendant of the root is created for each value of
this attribute and the training examples are sorted to the appropriate descendant
Repeat this process using the examples associated with each descendant node
to select the best attribute at that point in the tree (always forward, searching
among the not yet used attributes in this path)
Stop when the tree correctly classifies the examples or when all attributes have
been used
Label each leaf node with the class of the examples
Introduction ID3 Extensions C4.5 M5 Conclusions

Basic algorithm: ID3

ID3
Ket step: how to select which attribute to test at each node
in the tree?
We would like the most useful for classifying examples;
that separates well the examples
ID3 chooses mutual information as a measure of the worth
of an attribute (maximize)
I(C, Xi ) = H(C) − H(C|Xi ) (information gain)
X XX
H(C) = − p(c) log2 p(c), H(C|X ) = − p(x, c) log2 p(c|x)
c c x

Expected reduction in entropy (uncertainty), caused by


partitioning examples according to this attribute
Introduction ID3 Extensions C4.5 M5 Conclusions

Basic algorithm: ID3

Example ID3: PlayTennis


Data:
Day Outlook Temperature Humidity Wind PlayTennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
Introduction ID3 Extensions C4.5 M5 Conclusions

Basic algorithm: ID3

Example ID3: PlayTennis


Wind?
9 9 5 5
H(C) = − 14 log2 14 − 14 log2 14 = 0.940
H(C|Wind) =
−p(Strong, Yes) log2 p(Yes|Strong) − p(Strong, No) log2 p(No|Strong) −
p(Weak , Yes) log2 p(Yes|Weak) − p(Weak , No) log2 p(No|Weak) =
3 3 3 3 6 6 2 2
− 14 log2 6
− 14
log2 6
− 14
log2 8
− 14
log2 8
= 0.892

⇒ I(C, Wind) = 0.940 − 0.892 = 0.048


Analogously,
I(C, Humidity) = 0.151
I(C, Outlook) = 0.246 ⇐ Choose Outlook as root node
I(C, Temperature) = 0.029
Introduction ID3 Extensions C4.5 M5 Conclusions

Basic algorithm: ID3

Example ID3: PlayTennis


Introduction ID3 Extensions C4.5 M5 Conclusions

Basic algorithm: ID3

Example ID3: PlayTennis

All the examples with Outlook=Overcast are positive ⇒ It becomes a leaf node
The rest have nonzero entropy and the tree still grows
Introduction ID3 Extensions C4.5 M5 Conclusions

Basic algorithm: ID3


Example ID3: PlayTennis
The process of selecting (attribute) and partitioning
(examples) is repeated for each nonterminal descendant
node, this time using only the training examples associated
with that node
Any attribute only appears at most once along any path
E.g., through the branch Sunny (with 5 instances), we search
for the next attribute:
H(Csunny ) = − 35 log2 3
5 − 2
5 log2 2
5 = 0.97
I(Csunny , Temperature) = 0.97 − 0.4 = 0.57
I(Csunny , Humidity) = 1.94
I(Csunny , Wind) = 0.019
⇒ ...Final tree is the one above
Introduction ID3 Extensions C4.5 M5 Conclusions

Basic algorithm: ID3

General observations
Because of being greedy, it can lead to a local optimal
solution rather than global
Extension of ID3 to add a form of backtracking (post-pruning the tree)
Because of using statistical properties of all the examples
(in info gain), search is less sensitive to errors in data
Extension of ID3 to handle noisy data by modifying its stopping criterion:
create a leaf without perfectly fitting the data (label with majority)

Complexity increases linearly with the number of training


instances and exponentially with the number of attributes
It has been shown that the selection of variables with more
values is favored
Introduction ID3 Extensions C4.5 M5 Conclusions

Outline

1 Introduction

2 Basic algorithm: ID3

3 ID3 improvements

4 C4.5

5 Continuous class variable

6 Conclusions
Introduction ID3 Extensions C4.5 M5 Conclusions

ID3 improvements

Practical issues
Determine how deeply to grow the tree
Choose an appropriate attribute selection measure
Handling continuous attributes
Handling training data with missing attribute values
Handling attributes with differing costs
Introduction ID3 Extensions C4.5 M5 Conclusions

How deeply to grow the tree


Overfitting problem
Problems if we grow the tree just deeply enough to
perfectly classify the training examples:
If there is noise in the data, we can learn the noise!
If we have a few examples associated with leaves, hard to
produce a representative sample of the true target function
⇒ ID3 produces trees that overfit the training data (and it
doesn’t work well with new unseen examples). The model
is unable to generalize
Introduction ID3 Extensions C4.5 M5 Conclusions

Overfitting

Avoiding overfitting
Two groups of approaches, that try to simplify the tree:
Pre-pruning: stop growing the tree earlier, before it reaches
the point where it perfectly classifies data
Post-pruning: allow the tree to overfit the data, and then
post-prune it replacing subtrees by a leaf
⇒ More successful in practice (although more costly),
since in pre-pruning it’s hard to estimate precisely when to
stop growing the tree
Introduction ID3 Extensions C4.5 M5 Conclusions

Overfitting

Pre-pruning
Apply a statistical test to estimate whether expanding a particular node is likely
to produce an improvement beyond the training set (e.g. chi-square test as in
Quinlan, 1986)

Post-pruning
In down-top manner, pruning means removing the subtree rooted at a node,
making it a leaf node, and assigning it the most common classification of the
training examples affiliated with that node
Prune is done only if the resulting tree performs no worse than the original over
the test data
Prune iteratively, choosing the node whose removal most increases the accuracy
over the test set
...until further pruning is harmful (accuracy decreases)
Introduction ID3 Extensions C4.5 M5 Conclusions

Other attribute selection measures

Attributes with many values are favored


Extreme example: “Date” in the PlayTennis problem would have the highest info
gain (chosen as root node); broad tree of depth one which perfectly classifies the
training data (but poorly on subsequent examples)
What’s wrong with “Date”? has so many values that forces to separate the
training samples into very small subsets
Select attributes based on other measures:
⇒ e.g. gain ratio I(C, Xi )/H(Xi ) , that penalizes attributes with many values
and many uniformly distributed values
Experimental studies say that these measures appears to have a smaller impact
on final accuracy compared to post-pruning
Introduction ID3 Extensions C4.5 M5 Conclusions

Continuous attributes

Discretize them
Partition into a discrete set of intervals, e.g. 2 intervals A < c and A ≥ c

How to select c? we would like one that produces the greatest info gain

It can be shown that this value always lies where the class value changes, once
the examples are sorted according to the continuous attribute A

⇒ Sort the examples according to A, take 2 adjacent A-values with different


C-labels and average them

⇒ Evaluate each candidate to be c by computing the info gain associated with each

⇒ With this chosen c, this new attribute competes with the other discrete candidate
attributes

⇒ If A is not chosen, c may be different later

⇒ I.e., the attribute is created dynamically


Introduction ID3 Extensions C4.5 M5 Conclusions

Continuous attributes

Discretize them
Example: candidates are c1 = (48 + 60)/2 = 54 and
c2 = (80 + 90)/2 = 85:
Temperature 40 48 60 72 80 90
PlayTennis No No Yes Yes Yes No
Other alternatives: more than 2 intervals, or linear
combinations of several variables αA + βB > c...
Also if discrete with many values: group values
Introduction ID3 Extensions C4.5 M5 Conclusions

Data with missing attribute values

Missing
Eliminate incomplete instances
Estimate those missing values –imputation–:
Assign them the mode: most common value among
examples associated to the node where we are located at
or among the examples at the node with the same label
that the instance to be imputed
Assign a probability distribution of each possible attribute
value, estimated based on the observed frequencies at the
node
Introduction ID3 Extensions C4.5 M5 Conclusions

Attributes with differing costs

Include the cost into the attribute selection measure


Fever, BiopsyResult, Pulse, BloodTestResults... vary
significantly in their costs (monetary cost and cost to
patient comfort)
Try to favor lower-cost attributes for the starting nodes
(closer to the root)
For example:

I(C, Xi )
cost(Xi )
Introduction ID3 Extensions C4.5 M5 Conclusions

Outline

1 Introduction

2 Basic algorithm: ID3

3 ID3 improvements

4 C4.5

5 Continuous class variable

6 Conclusions
Introduction ID3 Extensions C4.5 M5 Conclusions

C4.5 algorithm [Quinlan, 1993]

C4.5
Choose attributes by using gain ratio (maximize it)

I(C, Xi )/H(Xi ) (gain ratio)

Incorporate post-pruning rules: generate rules (1 per path) and eliminate


preconditions (antecedents) whenever it doesn’t worse the error

Convert the tree into a set of rules R


Error = classification error with R
For each rule r from R:

New-error = error after eliminating antecedent j from r


If New-error ≤ Error, then New-error = Error and eliminate from r this antecedent

If there are no more antecedents in r , delete r from R

Order the rules according to the estimated error (ascending order) and consider
this sequence when classifying instances
Introduction ID3 Extensions C4.5 M5 Conclusions

C4.5 algorithm

C4.5
Pessimistic pruning: C4.5 estimates error e by resubstitution and corrects it
toward a pessimistic position: it gives the approximation e + 1.96 ∗ σ, with σ a std
deviation estimate
Advantages of using rules instead of the tree:

We can prune contexts (paths), rather than subtrees


Easy to understand
Last version: C4.8, which is C4.5 revision 8, last public version of this family of
algorithms before the commercial implementation C5.0
⇒ J48 is the name in Weka
Introduction ID3 Extensions C4.5 M5 Conclusions

Outline

1 Introduction

2 Basic algorithm: ID3

3 ID3 improvements

4 C4.5

5 Continuous class variable

6 Conclusions
Introduction ID3 Extensions C4.5 M5 Conclusions

Regression trees and model trees

Continuous class variable (numeric prediction)


Generalize these ideas for continuous regression functions
Regression tree: at the leaves the value represents the response mean value of
the instances that reach the leaf
Adopted by the well-known CART system [Breiman et al., 1984]
We approximate a non-linear function by a piecewise constant one
Valid for any kind of attributes
Model tree: at the leaves there is a linear regression model to predict the
response of the instances that reach the leaf
M5 algorithm [Quinlan, 1992] and M5’ [Wang and Witten, 1997]
Any kind of attributes, especially continuous
Regression trees is a particular case of model trees
Introduction ID3 Extensions C4.5 M5 Conclusions

Model trees

M5 and M5’ algorithms


M5 generalizes decision trees similar to ID3 trees
Splitting criterion: choose attribute that minimizes
intra-subset variation in the class values down each branch
Leaves are linear models
Attributes to define the regression are those not yet chosen
(subordinate to the current one, down)
M5’ introduces improvements:
Handle discrete attributes
Handle missing values
Reduce the tree size
Introduction ID3 Extensions C4.5 M5 Conclusions

Model trees
Example
209 different computer configurations:

Outcome (relative performance of computer processing power) and attributes are


continuous
Introduction ID3 Extensions C4.5 M5 Conclusions

Model trees
Example
Regression equation (a) and regression tree (b)
Introduction ID3 Extensions C4.5 M5 Conclusions

Model trees
Example
Model tree:
Introduction ID3 Extensions C4.5 M5 Conclusions

Model trees

Building the tree with M5


Let T be the set of examples that reach the node
Std dev of the class values that reach a node is treated as
a measure of the error at that node
Splitting criterion based on expected reduction in error
Compute the SD of the class in T (≈ error in T )
Compute expected reduction in error when an attribute is
used
RSD = SD(T ) − i |T
P i|
|T | SD(Ti ) (maximize it)

where Ti are the sets that result from splitting the node
according to the chosen attribute
–SD (error) in response is decreasing down the tree
Introduction ID3 Extensions C4.5 M5 Conclusions

Model trees

Building the tree with M5


Stop growing the tree when:
Only a few instances remain (≤ 4)
Class values of instances that reach a node vary very
slightly (its SD is a small fraction of the overall SD with all
data, e.g. 5 %)
Also pruning
Introduction ID3 Extensions C4.5 M5 Conclusions

Model trees

Handle discrete attributes X


They are transformed into binary variables:
Before building the tree, do:
For each value of X , average its class values. Order
X -values according to this
If X takes k values, replace X with k − 1 binary var.:
ith-variable is 0 if the value is one of the first i in the order
and else, it’s 1
Introduction ID3 Extensions C4.5 M5 Conclusions

Model trees
Example of M5: 2 continuous attributes and 2 discrete
Discrete: motor and screw (5 values). Class = rise time of the servo system
Order of their values was D, E, C, B, A for both
Introduction ID3 Extensions C4.5 M5 Conclusions

Extensions

MARS: Multivariate Adaptive Regression Splines


[Friedman, 1991]
Continuous generalization of CART
Base functions are a product of spline functions
The same variable can be chosen several times
Tests at the nodes can be multivariate (linear combinations
of variables)
Introduction ID3 Extensions C4.5 M5 Conclusions

Outline

1 Introduction

2 Basic algorithm: ID3

3 ID3 improvements

4 C4.5

5 Continuous class variable

6 Conclusions
Introduction ID3 Extensions C4.5 M5 Conclusions

Software

Trees with Weka


Classifier ⇒ Trees

Id3 J48 M5P


Introduction ID3 Extensions C4.5 M5 Conclusions

Conclusions

Classification trees
ID3 greedily selects the next attribute to be added in the
tree
Avoiding overfitting is important to have a tree that
generalizes well ⇒ Pruning the tree
Many extensions to ID3: handle continuous attributes, with
missing values, other attribute selection measures,
introduce costs
Continuous class: model trees (M5) with linear regressions
at the leaves
Introduction ID3 Extensions C4.5 M5 Conclusions

Bibliography

Texts
Alpaydin, E (2004) Introduction to Machine Learning, MIT Press [Chap. 9]
Duda, R., Hart, P.E., Stork, D.G. (2001) Pattern Classification, Wiley [Chap. 8]
Mitchell, T. (1997) Machine Learning, McGraw-Hill [Chap. 3]
Webb, A. (2002) Statistical Pattern Recognition Wiley [Chap. 7]
Witten, I., Frank, E. (2005) Data Mining, Morgan Kaufmann, 2nd ed. [Sections
4.3, 6.1, 6.5, 10.4]
Introduction ID3 Extensions C4.5 M5 Conclusions

Bibliography

Papers
Quinlan, J.R. (1986) Induction of trees, Machine Learning, 1, 81-106. [ID3]
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984) Classification and
Regression Trees, Wadsworth. [CART]
Quinlan, J.R. (1993) C4.5: Programs for Machine Learning, Morgan Kaufmann.
[C4.5]
Quinlan, J.R. (1992) Learning with continuous classes, Proc. of the 5th
Australian Joint Conference on AI, 343-348. [M5]
Wang, Y., Witten, I. (1997) Induction of model trees for predicting continuous
classes, Proc. of the Poster Papers of the ECML, 128-137 [M5’]
Frank, E., Wang, Y., Inglis, S., Holmes, G., Witten, I. (1998) Using model trees for
classification, Machine Learning, 32, 63-76
Friedman, J.H. (1991) Multivariate adaptive regression splines, Annals of
Statistics, 19, 1-141 [MARS]
Introduction ID3 Extensions C4.5 M5 Conclusions

B.6 C LASSIFICATION TREES

Concha Bielza, Pedro Larrañaga

Computational Intelligence Group


Departmento de Inteligencia Artificial
Universidad Politécnica de Madrid

Master Universitario en Inteligencia Artificial

You might also like