Wk. 5.2. Decision Trees (27.10.2020)
Wk. 5.2. Decision Trees (27.10.2020)
Course Teacher:
Md. Aynul Hasan Nahid
Lecturer
Department of Computer Science and Engineering
Daffodil International University
Recommended Reading
2
Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of the
values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into training
and test sets, with training set used to build the model
and test set used to validate it.
3
Illustrating Classification Task
4
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
5
Example of a Decision Tree
l l
a s
ric ir ca uou
e go go tin ss
at te n
c ca co cla
Splitting Attributes
Refund
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
6
Another Example of Decision Tree
l l s
r ica ric
a
uo
u
go go tin ss
te t e n
ca ca c o cla MarSt Single,
Married Divorced
NO Refund
Yes No
NO TaxInc
< 80K > 80K
NO YES
7
Decision Tree Classification Task
Decision
Tree
8
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
9
Apply Model to Test Data
Test Data
Refund
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
10
Apply Model to Test Data
Test Data
Refund
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
11
Apply Model to Test Data
Test Data
Refund
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
12
Apply Model to Test Data
Test Data
Refund
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
13
Apply Model to Test Data
Test Data
Refund
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
14
Decision Tree Terminology
15
Decision Tree Classification Task
Decision
Tree
16
Decision Tree Induction
• Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
17
Decision Tree Classifier
10
9 Ross Quinlan
8
7
Antenna Length
5 no yes
4
3 Antenna Length > 6.0? Katydid
2
no yes
1
Grasshopper Katydid
1 2 3 4 5 6 7 8 9 10 18
Abdomen Length
Antennae shorter than body?
Yes No
3 Tarsi?
Grasshopper
Yes No
Yes No
Cricket
20
Decision Tree Classification
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
21
Decision Tree Representation
• Each internal node tests an attribute
• Each branch corresponds to attribute value
• Each leaf node assigns a classification
outlook
no yes no yes
22
How do we construct the
decision tree?
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they can be discretized in
advance)
– Examples are partitioned recursively based on selected attributes.
– Test attributes are selected on the basis of a heuristic or statistical measure
(e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
– There are no samples left
23
Top-Down Decision Tree Induction
• Main loop:
1. A 🡨 the “best” decision attribute for next node
2. Assign A as decision attribute for node
3. For each value of A, create new descendant of node
4. Sort training examples to leaf nodes
5. If training examples perfectly classified,
Then STOP, Else iterate over new leaf nodes
24
Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that optimizes
certain criterion.
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
25
How To Split Records
• Random Split
– The tree can grow huge
– These trees are hard to understand.
– Larger trees are typically less accurate than smaller trees.
• Principled Criterion
– Selection of an attribute to test at each node - choosing the most useful attribute
for classifying examples.
– How?
– Information gain
• measures how well a given attribute separates the training examples
according to their target classification
• This measure is used to select among the candidate attributes at each step
while growing the tree
26
Tree Induction
• Greedy strategy:
– Split the records based on an attribute test that optimizes
certain criterion:
– Hunt’s algorithm: recursively partition training records into
successively purer subsets. How to measure
purity/impurity
• Entropy and information gain (covered in the lectures slides)
• Gini (covered in the textbook)
• Classification error
27
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
Gender
28
How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous class distribution are preferred
• Need a measure of node impurity:
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
29
Picking a Good Split Feature
• Goal is to have the resulting tree be as small as possible, per Occam’s
razor.
• Finding a minimal decision tree (nodes, leaves, or depth) is an NP-hard
optimization problem.
• Top-down divide-and-conquer method does a greedy search for a simple
tree but does not guarantee to find the smallest.
– General lesson in Machine Learning and Data Mining: “Greed is good.”
• Want to pick a feature that creates subsets of examples that are relatively
“pure” in a single class so they are “closer” to being leaf nodes.
• There are a variety of heuristics for picking a good test, a popular one is
based on information gain that originated with the ID3 system of Quinlan
(1979).
R. Mooney, UT Austin
30
Information Theory
• Think of playing "20 questions": I am thinking of an integer between 1 and
1,000 -- what is it? What is the first question you would ask?
• What question will you ask?
• Why?
• Entropy measures how much more information you need before you can
identify the integer.
• Initially, there are 1000 possible values, which we assume are equally
likely.
• What is the maximum number of question you need to ask?
31
Entropy
• Entropy (disorder, impurity) of a set of examples, S, relative to a binary
classification is:
R. Mooney, UT Austin
32
Entropy Plot for Binary
Classification
• The entropy is 0 if the outcome is certain.
• The entropy is maximum if we have no knowledge of the system
(or any outcome is equally possible).
Entropy of a 2-class
problem with regard to
the portion of one of the
two groups
33
Information Gain
• Is the expected reduction in entropy caused by partitioning the examples
according to this attribute.
• is the number of bits saved when encoding the target value of an arbitrary
member of S, by knowing the value of attribute A.
34
Information Gain in Decision
Tree Induction
• Assume that using attribute A, a current set will be partitioned into some
number of child sets
35
Examples for Computing Entropy
NOTE: p( j | t) is computed as the relative frequency of class j at node t
37
Splitting Based on INFO...
• Information Gain:
• Each non-leaf node is a test, its edge partitioning the attribute into
subsets (easy for discrete attribute).
• For continuous attribute
– Partition the continuous value of attribute A into a discrete set of
intervals
– Create a new boolean attribute Ac , looking for a threshold c,
How to choose c ?
39
Person Hair Weight Age Class
Length
Homer 0” 250 36 M
Marge 10” 150 34 F
Bart 2” 90 10 M
Lisa 6” 78 8 F
Maggie 4” 20 1 F
Abe 1” 170 70 M
Selma 8” 160 41 F
Otto 10” 180 38 M
Krusty 6” 200 45 M
Comic 8” 290 38 ? 40
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
yes no
Hair Length <= 5?
Entrop Entro
y (1 py (3F,2M
F,3M) ) = -(3
= -(1/4 /5)log
= 0.81 )log (1 = 0.9
13 2 /4) - (3/4)lo 710 2 (3/5) - (2/5
g )log (
2 (3/4) 2 2/5)
Entrop Entro
y (4 py (0F,4M
F,1M) ) = -(0
= -(4/5 /4)log
= 0.72 )log (4 = 0
19 2 /5) - (1/5)lo 2 (0/4) - (4/4
g )log (
2 (1/5) 2 4/4)
Entrop Entro
y (3 py (1F,2M
F,3M) ) = -(1
= -(3/6 /3)log
= 1 )log (3 = 0.9
2 /6) - (3/6)lo 183 2 (1/3) - (2/3
g )log (
2 (3/6) 2 2/3)
yes no
Hair Length <= 2?
44
We don’t need to keep the data around, just the
test conditions. Weight <= 160?
yes no
Male Female
45
It is trivial to convert Decision Trees to rules…
Weight <= 160?
yes no
Male Female
Decision tree for a typical shared-care setting applying the system for the
diagnosis of prostatic obstructions.
47
The worked examples we have seen were
performed on small datasets. However with
small datasets there is a great danger of
overfitting the data…
Female Male
For example, the rule “Wears green?” perfectly classifies the data, so does
“Mothers name is Jacqueline?”, so does “Has blue shoes”…
48
How to Find the Best Split: GINI
Before Splitting: M0
A? B?
Yes No Yes No
Node Node Node Node
N1 N2 N3 N4
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
49
Measure of Impurity: GINI (at node t)
• Gini Index for a given node t with classes j
50
More on Gini
• Worst Gini corresponds to probabilities of 1/nc, where nc is
the number of classes.
– For 2-class problems the worst Gini will be ½
• How do we get the best Gini? Come up with an example for
node t with 10 examples for classes C1 and C2
– 10 C1 and 0 C2
– Now what is the Gini?
• 1 – [(10/10)2 + (0/10)2 = 1 – [1 + 0] = 0
– So 0 is the best Gini
• So for 2-class problems:
– Gini varies from 0 (best) to ½ (worst).
51
Some More Examples
• Below we see the Gini values for 4 nodes with
different distributions. They are ordered from best to
worst. See next slide for details
– Note that thus far we are only computing GINI for one
node. We need to compute it for a split and then compute
the change in Gini from the parent node.
52
Examples for computing GINI
54
Examples for Computing Error
55
Comparison among Splitting Criteria
For a 2-class problem:
56
57