Lassification Trees: Concha Bielza, Pedro Larra Naga
Lassification Trees: Concha Bielza, Pedro Larra Naga
5 M5 Conclusions
Outline
1 Introduction
3 ID3 improvements
4 C4.5
6 Conclusions
Introduction ID3 Extensions C4.5 M5 Conclusions
Outline
1 Introduction
3 ID3 improvements
4 C4.5
6 Conclusions
Introduction ID3 Extensions C4.5 M5 Conclusions
Introduction
Classification tree
Widely used and popular
Approximate discrete-valued functions. The learned
function is represented by a decision tree
Robust to noisy data
Represent a disjunction of conjunctions of constraints on
the attribute values of instances
Learned tree can be re-represented as sets of if-then
rules (intuitive)
Successfully applied: medical patients by their disease,
equipment malfunctions by their cause, assess credit risk
of loan applicants...
Classification and regression tree
Introduction ID3 Extensions C4.5 M5 Conclusions
Introduction
Representation as a tree
Each node (not leaf) specifies a test of some attribute of
the instance
Each branch descending corresponds to one of the
possible values for this attribute
Each leaf node provides the classification of the instance
Unseen instances are classified by sorting them down the
tree from the root to some leaf node testing the attribute
specified at each node. At the leaf we have its classification
Introduction ID3 Extensions C4.5 M5 Conclusions
Introduction
Example: PlayTennis
Classify Saturday mornings according to whether they’re
suitable for playing tennis
Instance:
Outlook=Sunny, Temperature=Hot, Humidity=High, Wind=Strong
sorted down the leftmost branch, classified PlayTennis=No
Introduction ID3 Extensions C4.5 M5 Conclusions
Introduction
Example: PlayTennis
Tree represents a disjunction of conjunctions
A path = a conjunction of attribute tests
The tree = disjunction of these conjunctions
This tree corresponds to expression:
(Outlook=Sunny ∧ Humidity=Normal)
∨ (Outlook=Overcast)
∨ (Outlook=Rain ∧ Wind=Weak)
Introduction ID3 Extensions C4.5 M5 Conclusions
Introduction
Rule generation
R1: If (Outlook=Sunny) AND (Humidity=High) then PlayTennis=No
R2: If (Outlook=Sunny) AND (Humidity=Normal) then PlayTennis=Yes
R3: If (Outlook=Overcast) then PlayTennis=Yes
R4: If (Outlook=Rain) AND (Wind=Strong) then PlayTennis=No
R5: If (Outlook=Rain) AND (Wind=Weak) then PlayTennis=Yes
Introduction ID3 Extensions C4.5 M5 Conclusions
Introduction
Introduction
Example
Space is divided into regions labelled with one class; they’re hyperectangles
Introduction ID3 Extensions C4.5 M5 Conclusions
Introduction
Types of trees
Classification trees: discrete outputs
CLS, ID3, C4.5, ID4, ID5, C4.8, C5.0
Regression trees: continuous outputs
CART, M5, M5’
Introduction ID3 Extensions C4.5 M5 Conclusions
Outline
1 Introduction
3 ID3 improvements
4 C4.5
6 Conclusions
Introduction ID3 Extensions C4.5 M5 Conclusions
ID3
Ket step: how to select which attribute to test at each node
in the tree?
We would like the most useful for classifying examples;
that separates well the examples
ID3 chooses mutual information as a measure of the worth
of an attribute (maximize)
I(C, Xi ) = H(C) − H(C|Xi ) (information gain)
X XX
H(C) = − p(c) log2 p(c), H(C|X ) = − p(x, c) log2 p(c|x)
c c x
All the examples with Outlook=Overcast are positive ⇒ It becomes a leaf node
The rest have nonzero entropy and the tree still grows
Introduction ID3 Extensions C4.5 M5 Conclusions
General observations
Because of being greedy, it can lead to a local optimal
solution rather than global
Extension of ID3 to add a form of backtracking (post-pruning the tree)
Because of using statistical properties of all the examples
(in info gain), search is less sensitive to errors in data
Extension of ID3 to handle noisy data by modifying its stopping criterion:
create a leaf without perfectly fitting the data (label with majority)
Outline
1 Introduction
3 ID3 improvements
4 C4.5
6 Conclusions
Introduction ID3 Extensions C4.5 M5 Conclusions
ID3 improvements
Practical issues
Determine how deeply to grow the tree
Choose an appropriate attribute selection measure
Handling continuous attributes
Handling training data with missing attribute values
Handling attributes with differing costs
Introduction ID3 Extensions C4.5 M5 Conclusions
Overfitting
Avoiding overfitting
Two groups of approaches, that try to simplify the tree:
Pre-pruning: stop growing the tree earlier, before it reaches
the point where it perfectly classifies data
Post-pruning: allow the tree to overfit the data, and then
post-prune it replacing subtrees by a leaf
⇒ More successful in practice (although more costly),
since in pre-pruning it’s hard to estimate precisely when to
stop growing the tree
Introduction ID3 Extensions C4.5 M5 Conclusions
Overfitting
Pre-pruning
Apply a statistical test to estimate whether expanding a particular node is likely
to produce an improvement beyond the training set (e.g. chi-square test as in
Quinlan, 1986)
Post-pruning
In down-top manner, pruning means removing the subtree rooted at a node,
making it a leaf node, and assigning it the most common classification of the
training examples affiliated with that node
Prune is done only if the resulting tree performs no worse than the original over
the test data
Prune iteratively, choosing the node whose removal most increases the accuracy
over the test set
...until further pruning is harmful (accuracy decreases)
Introduction ID3 Extensions C4.5 M5 Conclusions
Continuous attributes
Discretize them
Partition into a discrete set of intervals, e.g. 2 intervals A < c and A ≥ c
How to select c? we would like one that produces the greatest info gain
It can be shown that this value always lies where the class value changes, once
the examples are sorted according to the continuous attribute A
⇒ Evaluate each candidate to be c by computing the info gain associated with each
⇒ With this chosen c, this new attribute competes with the other discrete candidate
attributes
Continuous attributes
Discretize them
Example: candidates are c1 = (48 + 60)/2 = 54 and
c2 = (80 + 90)/2 = 85:
Temperature 40 48 60 72 80 90
PlayTennis No No Yes Yes Yes No
Other alternatives: more than 2 intervals, or linear
combinations of several variables αA + βB > c...
Also if discrete with many values: group values
Introduction ID3 Extensions C4.5 M5 Conclusions
Missing
Eliminate incomplete instances
Estimate those missing values –imputation–:
Assign them the mode: most common value among
examples associated to the node where we are located at
or among the examples at the node with the same label
that the instance to be imputed
Assign a probability distribution of each possible attribute
value, estimated based on the observed frequencies at the
node
Introduction ID3 Extensions C4.5 M5 Conclusions
I(C, Xi )
cost(Xi )
Introduction ID3 Extensions C4.5 M5 Conclusions
Outline
1 Introduction
3 ID3 improvements
4 C4.5
6 Conclusions
Introduction ID3 Extensions C4.5 M5 Conclusions
C4.5
Choose attributes by using gain ratio (maximize it)
Order the rules according to the estimated error (ascending order) and consider
this sequence when classifying instances
Introduction ID3 Extensions C4.5 M5 Conclusions
C4.5 algorithm
C4.5
Pessimistic pruning: C4.5 estimates error e by resubstitution and corrects it
toward a pessimistic position: it gives the approximation e + 1.96 ∗ σ, with σ a std
deviation estimate
Advantages of using rules instead of the tree:
Outline
1 Introduction
3 ID3 improvements
4 C4.5
6 Conclusions
Introduction ID3 Extensions C4.5 M5 Conclusions
Model trees
Model trees
Example
209 different computer configurations:
Model trees
Example
Regression equation (a) and regression tree (b)
Introduction ID3 Extensions C4.5 M5 Conclusions
Model trees
Example
Model tree:
Introduction ID3 Extensions C4.5 M5 Conclusions
Model trees
where Ti are the sets that result from splitting the node
according to the chosen attribute
–SD (error) in response is decreasing down the tree
Introduction ID3 Extensions C4.5 M5 Conclusions
Model trees
Model trees
Model trees
Example of M5: 2 continuous attributes and 2 discrete
Discrete: motor and screw (5 values). Class = rise time of the servo system
Order of their values was D, E, C, B, A for both
Introduction ID3 Extensions C4.5 M5 Conclusions
Extensions
Outline
1 Introduction
3 ID3 improvements
4 C4.5
6 Conclusions
Introduction ID3 Extensions C4.5 M5 Conclusions
Software
Conclusions
Classification trees
ID3 greedily selects the next attribute to be added in the
tree
Avoiding overfitting is important to have a tree that
generalizes well ⇒ Pruning the tree
Many extensions to ID3: handle continuous attributes, with
missing values, other attribute selection measures,
introduce costs
Continuous class: model trees (M5) with linear regressions
at the leaves
Introduction ID3 Extensions C4.5 M5 Conclusions
Bibliography
Texts
Alpaydin, E (2004) Introduction to Machine Learning, MIT Press [Chap. 9]
Duda, R., Hart, P.E., Stork, D.G. (2001) Pattern Classification, Wiley [Chap. 8]
Mitchell, T. (1997) Machine Learning, McGraw-Hill [Chap. 3]
Webb, A. (2002) Statistical Pattern Recognition Wiley [Chap. 7]
Witten, I., Frank, E. (2005) Data Mining, Morgan Kaufmann, 2nd ed. [Sections
4.3, 6.1, 6.5, 10.4]
Introduction ID3 Extensions C4.5 M5 Conclusions
Bibliography
Papers
Quinlan, J.R. (1986) Induction of trees, Machine Learning, 1, 81-106. [ID3]
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984) Classification and
Regression Trees, Wadsworth. [CART]
Quinlan, J.R. (1993) C4.5: Programs for Machine Learning, Morgan Kaufmann.
[C4.5]
Quinlan, J.R. (1992) Learning with continuous classes, Proc. of the 5th
Australian Joint Conference on AI, 343-348. [M5]
Wang, Y., Witten, I. (1997) Induction of model trees for predicting continuous
classes, Proc. of the Poster Papers of the ECML, 128-137 [M5’]
Frank, E., Wang, Y., Inglis, S., Holmes, G., Witten, I. (1998) Using model trees for
classification, Machine Learning, 32, 63-76
Friedman, J.H. (1991) Multivariate adaptive regression splines, Annals of
Statistics, 19, 1-141 [MARS]
Introduction ID3 Extensions C4.5 M5 Conclusions