7. Decision Tree & Random Forest
7. Decision Tree & Random Forest
Machine Learning
• Decision tree & Random Forest
12/4/2024 2
Outline
• Decision tree representation
• ID3 learning algorithm
• Which attribute is best?
• C4.5: real valued attributes
• Which hypothesis is best?
• Noise
• From Trees to Rules
• Miscellaneous
3
Decision Tree Representation
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Outlook
No Yes No Yes
5
Alternative Decision Tree for PlayTennis
Temperature
Humidity
YES
Wind
Weak Strong
{1,3} {2}
Outlook NO • What is different?
Sunny Overcast • Sequence of attributes influences
{1} {3}
size and shape of tree
NO YES
6
Occam’s Principle
• Occam’s Principle:
“If two theories explain the
facts equally well, then the
simpler theory is preferred”
7
Decision Trees
Decision tree representation:
yes no
• Example XOR:
B B
yes no yes no
NO YES YES NO
8
When to Consider Decision Trees
• Instances describable by attribute–value pairs
• Target function is discrete valued
• Disjunctive hypothesis may be required
• Possibly noisy training data
• Interpretable result of learning is required
• Examples:
• Medical diagnosis
• Text classification
• Credit risk analysis
9
Top-Down Induction of Decision Trees, ID3
• ID3 (Quinlan, 1986) operates on whole training set S
(Iterative Dichotomiser)
Algorithm:
11
Example ID3
• Tree
12
Example – Resulting Tree
Outlook
No Yes No Yes
13
ID3 – Intermediate Summary
• Recursive splitting of the training set
• Stop, if current training set is sufficiently pure
• ... What means pure?
• … Can we allow for errors?
14
Which attribute is best?
• Produced splits:
Value 1 Value 2
x1 {+, +, −, −, + } {−, +, +, −, −}
x2 {+} {+, −, −, +, −, +, +, −, − }
x3 {+, +, +, +, −} {−, −, −, −, + }
• No attribute is perfect
• Which one to choose?
15
Entropy
examples
0.5
• Entropy measures the impurity of S
16
Entropy
17
Entropy
S = { + + + + + + + + + , − − − − − } = { 9 + , 5−}.
E nt r o py ( S ) = ?
S = { + + + + + + + + , − − − − − − } = { 8 + , 6−}.
E nt r o py ( S ) = ?
E nt r o py ( S ) = −8/14 log(8/14) − 6/14 log(6/14) = 0.98
S = { + + + + + + + + + + + + + + + } = {14+}. E nt r o py ( S ) = ?
E nt r o py ( S ) = 0
S = { + + + + + + + + − − − − − − − } = { 7 + , 7−}.
E nt r o py ( S ) = 1
E n tr o p y( S ) = ? 18
Entropy
Given: S { + , + , − , − , + , − , + , + , − , − }
Value 1 Value 2
x1 {+, +, −, −, + } {−, + , + , −, − }
x2 {+} {+, −, −, +, −, +, +, −, − }
x3 {+, +, +, +, −} {−, −, −, −, + }
19
Information Gain
• Measuring attribute x creates subsets S 1 and S 2 with different
entropies
20
Information Gain: Definition
| sv |
Gain( S , x) = Entropy ( S ) − vValues ( x ) Entropy ( Sv )
|S|
• 𝑉𝑎𝑙𝑢𝑒 (x): the set of all possible values for the attribute x.
• S𝑣: the subset of S for which x has value 𝑣
21
Example - Training Set
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
22
Example
| sv |
Gain( S , x) = Entropy ( S ) − vValues ( x ) Entropy ( Sv )
|S|
Play
ID Wind
Tennis
D1 Weak No
For top node: S = { 9 + , 5−}, E n t r o p y ( S ) = 0 . 94 D8 Weak No
S _s t r o n g = { 3 + , 3−}, | S _s t r o n g | = 6
D5 Weak Yes
D9 Weak Yes
E n t r o p y ( S _w e a k ) = −6/8*lo g ( 6 / 8 ) -
D10 Weak Yes
2 / 8 *lo g ( 2 / 8 ) = 0.81
D13 Weak Yes
E n t r o p y ( S _s t r o n g ) = 1 D2 Strong No
Expected Entropy when assuming attribute ’Wind’: D6 Strong No
E n t r o p y ( S | W i n d ) = 8 / 1 4 * E n t r o p y ( S _w e a k ) + D14 Strong No
6 / 1 4*E n t r o p y ( S _s t r o n g ) = 0.89 D7 Strong Yes
D11 Strong Yes
0. 94 − 0. 89 ≈ 0.05
23
Selecting the Next Attribute
• For whole training set:
• G a i n ( S , O u t lo o k ) = 0 . 246
• G a i n ( S , H u m i d i t y ) = 0 . 151
• G a i n ( S , W i n d ) = 0 . 048
• G a i n ( S , Te m p er at u r e ) = 0 . 029
→ O u t lo ok should be used to split training set!
24
Next step in growing the decision tree
25
The Resulting Decision Tree & Its
Rules
26
Some issues: Real-Valued Attributes
• Temperature = 82.5
• Create discrete attributes to test continuous:
• (Temperature > 54) = true or = false
• Sort attribute values that occur in training set:
Temperature: 40 48 60 72 80 90
PlayTennis: No No Yes Yes Yes No
27
Some issues: Noise
• Consider adding noisy (=wrongly labeled) training example #15:
• S u n n y , M i l d , N o r m a l , We a k , P l a y Te n n i s = N o
, i.e. outlook = sunny, humidity = normal
• What effect on earlier tree?
Outlook
No Yes No Yes
28
Some issues: Overfitting
Outlook
No Yes No Yes
• Algorithm will introduce new test
• Unnecessary, because new example was erroneous due to the
presence of Noise
→ Overfitting corresponds to learning coincidental regularities
• Unfortunately, we generally don’t know which examples are noisy
• ... and also not the amount, e.g. percentage, of noisy examples
29
Some issues: Overfitting
• An example: continuing to grow the tree can improve the accuracy on
the training data, but perform badly on the test data.
[Mitchell, 1997] 30
Overfitting: solutions
• Some solutions:
• Stop learning early: prevent the tree before it fits the
training data perfectly.
• Prune the full tree: grow the tree to its full size, and then
post prune the tree.
• It is hard to decide when to stop learning.
• Post-pruning the tree empirically results in better
performance. But
• How to decide the good size of a tree?
• When to stop pruning?
• We can use a validation set to do pruning, such
as, reduced-error pruning, and rule-post pruning
31
Summary
32
Type of Decision tree
• Hunt’s algorithm was developed in the 1960s to model human
learning in Psycholog
• C4.5
• Continous value attribute
• Measure: Information gain or Gain ratios
• Higher is better
• CART:
• Measure: Gini (how often a randomly chosen attribute is misclassified)
• Lower is better.
33
Advantages & Disadvantages
• Advantages
• It is simple to understand as it follows the same process
which a human follow while making any decision in real-
life.
• It can be very useful for solving decision-related
problems.
• It helps to think about all the possible outcomes for a
problem.
• There is less requirement of data cleaning compared to
other algorithms.
34
Advantages & Disadvantages
• Disadvantages
• The decision tree contains lots of layers, which makes it
complex.
• It may have an overfitting issue, which can be resolved
using the Random Forest algorithm.
• For more class labels, the computational complexity of
the decision tree may increase
35
Decision tree for Regression
• Reference:
https://saedsayad.com/decision_tree_reg.htm
36
Random forests
• Random forests (RF) is a method by Leo Breiman (2001)
for both classification and regression.
• Main idea: prediction is based on combination of many
decision trees, by taking the average of all individual
predictions
• Each tree in RF is simple but random.
• Each tree is grown differently, depending on the choices of
the attributes and training data
37
Random forests
• RF currently is one of the most popular and accurate
methods [Fernández-Delgado et al., 2014]
• It is also very general.
• RF can be implemented easily and efficiently.
• It can work with problems of very high dimensions, without
overfitting
38
How Random Forest work
39
RF: three basic ingredients
• Randomization and no pruning:
• For each tree and at each node, we select randomly a subset
of attributes.