0% found this document useful (0 votes)
11 views

7. Decision Tree & Random Forest

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

7. Decision Tree & Random Forest

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Ho Chi Minh University of Banking DAT 704

Department of Data Science in Business

Machine Learning
• Decision tree & Random Forest

12/4/2024 Vuong Trong Nhan


Content
• Introduction to Machine Learning
• Data Representation
• Supervised Learning
• Linear Regression
• Logistic Regression
• k-Nearest Neighbors
• Naïve Bayes Classifier
• Decision Tree and Random Forest
• Support Vector Machine
• Neural Network
• Unsupervised Learning
• Model Evaluation and Improvement

12/4/2024 2
Outline
• Decision tree representation
• ID3 learning algorithm
• Which attribute is best?
• C4.5: real valued attributes
• Which hypothesis is best?
• Noise
• From Trees to Rules
• Miscellaneous

3
Decision Tree Representation
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

• Outlook, Temperature, etc.: attributes


• PlayTennis: class
• Shall I play tennis today?
4
Decision Tree for PlayTennis

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

5
Alternative Decision Tree for PlayTennis
Temperature

hot mild cool


{1,2,3,13} {4,8,10,11,12,14} {5,6,7,9}

Humidity

Normal High ...


...
{1,2,3} {13]

YES
Wind
Weak Strong
{1,3} {2}
Outlook NO • What is different?
Sunny Overcast • Sequence of attributes influences
{1} {3}
size and shape of tree
NO YES

6
Occam’s Principle

• Occam’s Principle:
“If two theories explain the
facts equally well, then the
simpler theory is preferred”

 Preferred the smallest tree


that correctly classifies all
training examples

7
Decision Trees
Decision tree representation:

• Each internal node tests an attribute


• Each branch corresponds to attribute value
• Each leaf node assigns a label (class)

How would we represent: A

yes no
• Example XOR:
B B

yes no yes no

NO YES YES NO

8
When to Consider Decision Trees
• Instances describable by attribute–value pairs
• Target function is discrete valued
• Disjunctive hypothesis may be required
• Possibly noisy training data
• Interpretable result of learning is required

• Examples:
• Medical diagnosis
• Text classification
• Credit risk analysis

9
Top-Down Induction of Decision Trees, ID3
• ID3 (Quinlan, 1986) operates on whole training set S
(Iterative Dichotomiser)

Algorithm:

1. Create a new node


2. If current training set is sufficiently pure:

• Label node with respective class


• We’re done
3. Else:

• x  the “best” decision attribute for current training set


• Assign x as decision attribute for node
• For each value of x, create new descendant of node
• Sort training examples to leaf nodes
• Iterate over new leaf nodes and apply algorithm recursively
10
Example ID3
• Look at current training set S

• Determine best attribute

• Split training set according to different values

11
Example ID3

• Tree

• Apply algorithm recursively

12
Example – Resulting Tree

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

13
ID3 – Intermediate Summary
• Recursive splitting of the training set
• Stop, if current training set is sufficiently pure
• ... What means pure?
• … Can we allow for errors?

• What is the best attribute?


• How can we tell that the tree is really good?
• How shall we deal with continuous values?

14
Which attribute is best?

• Assume a training set S { + , + , − , − , + , − , + , + , − , − }


(only classes)

• Assume binary attributes x 1 , x 2 , and x 3

• Produced splits:

Value 1 Value 2
x1 {+, +, −, −, + } {−, +, +, −, −}
x2 {+} {+, −, −, +, −, +, +, −, − }
x3 {+, +, +, +, −} {−, −, −, −, + }

• No attribute is perfect
• Which one to choose?
15
Entropy

• p⊕ is the proportion of positive


examples
1.0
• p⊖ is the proportion of negative
Entropy (S)

examples
0.5
• Entropy measures the impurity of S

E nt ro py (S) ≡ −p ⊕ log 2 p⊕ − p⊖ log2 p⊖


0.0 0.5 1.0
p⊕
• Information can be seen as the
negative of entropy

16
Entropy

• All members of 𝑆 belong to the same


1.0
class
Entropy (S)

• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = 0 (the purest set)


0.5
• Numbers of positive and negative
examples are equal (p ⊕ = p⊖ = 0.5)
• 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = 1 (most impurity)
0.0 0.5 1.0
p⊕ • Numbers of positive and negative
examples are unequal
• Entropy is between 0 and 1.

17
Entropy

S = { + + + + + + + + + , − − − − − } = { 9 + , 5−}.
E nt r o py ( S ) = ?

E nt r o py ( S ) = −9/14 log(9/14) − 5/14 log(5/14) = 0.94

S = { + + + + + + + + , − − − − − − } = { 8 + , 6−}.
E nt r o py ( S ) = ?
E nt r o py ( S ) = −8/14 log(8/14) − 6/14 log(6/14) = 0.98
S = { + + + + + + + + + + + + + + + } = {14+}. E nt r o py ( S ) = ?

E nt r o py ( S ) = 0
S = { + + + + + + + + − − − − − − − } = { 7 + , 7−}.

E nt r o py ( S ) = 1
E n tr o p y( S ) = ? 18
Entropy

Given: S { + , + , − , − , + , − , + , + , − , − }

Value 1 Value 2
x1 {+, +, −, −, + } {−, + , + , −, − }
x2 {+} {+, −, −, +, −, +, +, −, − }
x3 {+, +, +, +, −} {−, −, −, −, + }

Which attribute is better (x1, x2 or x3)?

19
Information Gain
• Measuring attribute x creates subsets S 1 and S 2 with different
entropies

• Taking the mean of En tr opy (S 1 ) and En tr opy (S 2 ) gives


conditional entropy Entropy(S|x), i.e. in general:
𝑠𝑣
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆|𝑥) = ෎ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
𝑆
𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝑥)

• → Choose that attribute that maximizes difference:

Gain( S , x) = Entropy ( S ) − Entropy( S | x)

• G a in ( S, x ) = expected reduction in entropy due to partitioning on x

20
Information Gain: Definition

| sv |
Gain( S , x) = Entropy ( S ) −  vValues ( x ) Entropy ( Sv )
|S|

• 𝑉𝑎𝑙𝑢𝑒 (x): the set of all possible values for the attribute x.
• S𝑣: the subset of S for which x has value 𝑣

• Information Gain is a measure of the effectiveness of an


attribute in classifying data.
• It is the expected reduction in entropy caused by
partitioning the objects according to this attribute.

21
Example - Training Set
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

22
Example
| sv |
Gain( S , x) = Entropy ( S ) −  vValues ( x ) Entropy ( Sv )
|S|
Play
ID Wind
Tennis
D1 Weak No
For top node: S = { 9 + , 5−}, E n t r o p y ( S ) = 0 . 94 D8 Weak No

Attribute Wind: D3 Weak Yes

S _ w e a k = { 6 + , 2−}, | S _w e a k | = 8 D4 Weak Yes

S _s t r o n g = { 3 + , 3−}, | S _s t r o n g | = 6
D5 Weak Yes
D9 Weak Yes
E n t r o p y ( S _w e a k ) = −6/8*lo g ( 6 / 8 ) -
D10 Weak Yes
2 / 8 *lo g ( 2 / 8 ) = 0.81
D13 Weak Yes
E n t r o p y ( S _s t r o n g ) = 1 D2 Strong No
Expected Entropy when assuming attribute ’Wind’: D6 Strong No
E n t r o p y ( S | W i n d ) = 8 / 1 4 * E n t r o p y ( S _w e a k ) + D14 Strong No
6 / 1 4*E n t r o p y ( S _s t r o n g ) = 0.89 D7 Strong Yes
D11 Strong Yes

Gain ( S, W in d ) = D12 Strong Yes

0. 94 − 0. 89 ≈ 0.05
23
Selecting the Next Attribute
• For whole training set:
• G a i n ( S , O u t lo o k ) = 0 . 246
• G a i n ( S , H u m i d i t y ) = 0 . 151
• G a i n ( S , W i n d ) = 0 . 048
• G a i n ( S , Te m p er at u r e ) = 0 . 029
→ O u t lo ok should be used to split training set!

• Further down in the tree, E n t r o p y ( S ) is computed locally


• Usually, the tree does not have to be minimized
• Reason of good performance of ID3!

24
Next step in growing the decision tree

25
The Resulting Decision Tree & Its
Rules

26
Some issues: Real-Valued Attributes
• Temperature = 82.5
• Create discrete attributes to test continuous:
• (Temperature > 54) = true or = false
• Sort attribute values that occur in training set:

Temperature: 40 48 60 72 80 90
PlayTennis: No No Yes Yes Yes No

• Determine points where the class changes


• Candidates are (48 + 60) / 2 and (80 + 90) / 2
• Select best one using info gain
• Implemented in the system C4.5 (successor of ID3)

27
Some issues: Noise
• Consider adding noisy (=wrongly labeled) training example #15:
• S u n n y , M i l d , N o r m a l , We a k , P l a y Te n n i s = N o
, i.e. outlook = sunny, humidity = normal
• What effect on earlier tree?
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

28
Some issues: Overfitting
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
• Algorithm will introduce new test
• Unnecessary, because new example was erroneous due to the
presence of Noise
→ Overfitting corresponds to learning coincidental regularities
• Unfortunately, we generally don’t know which examples are noisy
• ... and also not the amount, e.g. percentage, of noisy examples
29
Some issues: Overfitting
• An example: continuing to grow the tree can improve the accuracy on
the training data, but perform badly on the test data.

[Mitchell, 1997] 30
Overfitting: solutions
• Some solutions:
• Stop learning early: prevent the tree before it fits the
training data perfectly.
• Prune the full tree: grow the tree to its full size, and then
post prune the tree.
• It is hard to decide when to stop learning.
• Post-pruning the tree empirically results in better
performance. But
• How to decide the good size of a tree?
• When to stop pruning?
• We can use a validation set to do pruning, such
as, reduced-error pruning, and rule-post pruning

31
Summary

• Decision tree is a Machine Learning algorithm that can


perform both classification and regression tasks.
• Decision tree represents a function by using a tree.
• Each decision tree can be interpreted as a set of rules
of the form: IF-THEN
• Decision trees have been used in many practical
applications.

32
Type of Decision tree
• Hunt’s algorithm was developed in the 1960s to model human
learning in Psycholog

• ID3 (Quinlan, 1986)


• Measure: Information Gain
• Higher is better.

• C4.5
• Continous value attribute
• Measure: Information gain or Gain ratios
• Higher is better

• CART:
• Measure: Gini (how often a randomly chosen attribute is misclassified)
• Lower is better.

33
Advantages & Disadvantages
• Advantages
• It is simple to understand as it follows the same process
which a human follow while making any decision in real-
life.
• It can be very useful for solving decision-related
problems.
• It helps to think about all the possible outcomes for a
problem.
• There is less requirement of data cleaning compared to
other algorithms.

34
Advantages & Disadvantages
• Disadvantages
• The decision tree contains lots of layers, which makes it
complex.
• It may have an overfitting issue, which can be resolved
using the Random Forest algorithm.
• For more class labels, the computational complexity of
the decision tree may increase

35
Decision tree for Regression
• Reference:
https://saedsayad.com/decision_tree_reg.htm

36
Random forests
• Random forests (RF) is a method by Leo Breiman (2001)
for both classification and regression.
• Main idea: prediction is based on combination of many
decision trees, by taking the average of all individual
predictions
• Each tree in RF is simple but random.
• Each tree is grown differently, depending on the choices of
the attributes and training data

37
Random forests
• RF currently is one of the most popular and accurate
methods [Fernández-Delgado et al., 2014]
• It is also very general.
• RF can be implemented easily and efficiently.
• It can work with problems of very high dimensions, without
overfitting

38
How Random Forest work

39
RF: three basic ingredients
• Randomization and no pruning:
• For each tree and at each node, we select randomly a subset
of attributes.

• Find the best split, and then grow appropriate subtrees.


• Every tree will be grown to its largest size without pruning.

• Combination: each prediction later is made by taking


the average of all predictions of individual trees.
• Bagging: the training set for each tree is generated by
sampling (with replacement) from the original data.
40
Exersice

1. Build Decision tree


2. Predict (class) :
Customer (15, youth, medium, no, fair)
Customer (16, senior, low, yes, excellent)
41

You might also like