0% found this document useful (0 votes)
27 views57 pages

Wk. 5.2. Decision Trees (27.10.2020)

Uploaded by

walid49161
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views57 pages

Wk. 5.2. Decision Trees (27.10.2020)

Uploaded by

walid49161
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Summer 2022

Data Mining and Machine Learning


(CSE 321)
Topic – 5.2: Decision Trees

Course Teacher:
Md. Aynul Hasan Nahid
Lecturer
Department of Computer Science and Engineering
Daffodil International University
Recommended Reading

• “Introduction to Data Mining,” Pang-Ning


Tan, Michael Steinbach and Vipin Kumar,
Addison Wesley, 2006.
☞ Chapter 4 (Classification: Basic Concepts,
Decision Trees, and Model Evaluation)

2
Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of the
values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into training
and test sets, with training set used to build the model
and test set used to validate it.

3
Illustrating Classification Task

4
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines

5
Example of a Decision Tree
l l
a s
ric ir ca uou
e go go tin ss
at te n
c ca co cla
Splitting Attributes

Refund
Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

Training Data Model: Decision Tree

6
Another Example of Decision Tree
l l s
r ica ric
a
uo
u
go go tin ss
te t e n
ca ca c o cla MarSt Single,
Married Divorced

NO Refund
Yes No

NO TaxInc
< 80K > 80K

NO YES

There could be more than one tree that fits


the same data!

7
Decision Tree Classification Task

Decision
Tree

8
Apply Model to Test Data
Test Data
Start from the root of tree.

Refund
Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

9
Apply Model to Test Data
Test Data

Refund
Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

10
Apply Model to Test Data
Test Data

Refund
Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

11
Apply Model to Test Data
Test Data

Refund
Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

12
Apply Model to Test Data
Test Data

Refund
Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

13
Apply Model to Test Data
Test Data

Refund
Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES

14
Decision Tree Terminology

15
Decision Tree Classification Task

Decision
Tree

16
Decision Tree Induction
• Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT

• John Ross Quinlan is a computer science researcher in data


mining and decision theory. He has contributed extensively to the
development of decision tree algorithms, including inventing the
canonical C4.5 and ID3 algorithms.

17
Decision Tree Classifier

10
9 Ross Quinlan

8
7
Antenna Length

6 Abdomen Length > 7.1?

5 no yes
4
3 Antenna Length > 6.0? Katydid
2
no yes
1
Grasshopper Katydid
1 2 3 4 5 6 7 8 9 10 18
Abdomen Length
Antennae shorter than body?

Yes No

3 Tarsi?

Grasshopper
Yes No

Foretiba has ears?

Yes No
Cricket

Decision trees predate computers Katydids Camel Cricket19


Definition
● Decision tree is a classifier in the form of a tree structure
– Decision node: specifies a test on a single attribute
– Leaf node: indicates the value of the target attribute
– Arc/edge: split of one attribute
– Path: a disjunction of test to make the final decision

● Decision trees classify instances or examples by starting at the root of the


tree and moving through it until a leaf node.

20
Decision Tree Classification
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree

21
Decision Tree Representation
• Each internal node tests an attribute
• Each branch corresponds to attribute value
• Each leaf node assigns a classification

outlook

sunny overcast rain

humidity yes wind

high normal strong weak

no yes no yes

22
How do we construct the
decision tree?
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they can be discretized in
advance)
– Examples are partitioned recursively based on selected attributes.
– Test attributes are selected on the basis of a heuristic or statistical measure
(e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
– There are no samples left

23
Top-Down Decision Tree Induction

• Main loop:
1. A 🡨 the “best” decision attribute for next node
2. Assign A as decision attribute for node
3. For each value of A, create new descendant of node
4. Sort training examples to leaf nodes
5. If training examples perfectly classified,
Then STOP, Else iterate over new leaf nodes

24
Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that optimizes
certain criterion.

• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting

25
How To Split Records
• Random Split
– The tree can grow huge
– These trees are hard to understand.
– Larger trees are typically less accurate than smaller trees.

• Principled Criterion
– Selection of an attribute to test at each node - choosing the most useful attribute
for classifying examples.
– How?
– Information gain
• measures how well a given attribute separates the training examples
according to their target classification
• This measure is used to select among the candidate attributes at each step
while growing the tree

26
Tree Induction
• Greedy strategy:
– Split the records based on an attribute test that optimizes
certain criterion:
– Hunt’s algorithm: recursively partition training records into
successively purer subsets. How to measure
purity/impurity
• Entropy and information gain (covered in the lectures slides)
• Gini (covered in the textbook)
• Classification error

27
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1

Gender

Which test condition is the best?


Why is student id a bad feature to use?

28
How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous class distribution are preferred
• Need a measure of node impurity:

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity

29
Picking a Good Split Feature
• Goal is to have the resulting tree be as small as possible, per Occam’s
razor.
• Finding a minimal decision tree (nodes, leaves, or depth) is an NP-hard
optimization problem.
• Top-down divide-and-conquer method does a greedy search for a simple
tree but does not guarantee to find the smallest.
– General lesson in Machine Learning and Data Mining: “Greed is good.”
• Want to pick a feature that creates subsets of examples that are relatively
“pure” in a single class so they are “closer” to being leaf nodes.
• There are a variety of heuristics for picking a good test, a popular one is
based on information gain that originated with the ID3 system of Quinlan
(1979).

R. Mooney, UT Austin
30
Information Theory
• Think of playing "20 questions": I am thinking of an integer between 1 and
1,000 -- what is it? What is the first question you would ask?
• What question will you ask?
• Why?

• Entropy measures how much more information you need before you can
identify the integer.
• Initially, there are 1000 possible values, which we assume are equally
likely.
• What is the maximum number of question you need to ask?

31
Entropy
• Entropy (disorder, impurity) of a set of examples, S, relative to a binary
classification is:

where p1 is the fraction of positive examples in S and p0 is the fraction of


negatives.
• If all examples are in one category, entropy is zero (we define 0⋅log(0)=0)
• If examples are equally mixed (p1=p0=0.5), entropy is a maximum of 1.
• Entropy can be viewed as the number of bits required on average to encode
the class of an example in S where data compression (e.g. Huffman coding) is
used to give shorter codes to more likely cases.
• For multi-class problems with c categories, entropy generalizes to:

R. Mooney, UT Austin
32
Entropy Plot for Binary
Classification
• The entropy is 0 if the outcome is certain.
• The entropy is maximum if we have no knowledge of the system
(or any outcome is equally possible).

Entropy of a 2-class
problem with regard to
the portion of one of the
two groups

33
Information Gain
• Is the expected reduction in entropy caused by partitioning the examples
according to this attribute.
• is the number of bits saved when encoding the target value of an arbitrary
member of S, by knowing the value of attribute A.

34
Information Gain in Decision
Tree Induction
• Assume that using attribute A, a current set will be partitioned into some
number of child sets

• The encoding information that would be gained by branching on A

Note: entropy is at its minimum if the collection of objects is completely uniform

35
Examples for Computing Entropy
NOTE: p( j | t) is computed as the relative frequency of class j at node t

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


Entropy = – 0 log2 0 – 1 log2 1 = – 0 – 0 = 0

P(C1) = 1/6 P(C2) = 5/6


Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65

P(C1) = 2/6 P(C2) = 4/6


Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

P(C1) = 3/6=1/2 P(C2) = 3/6 = 1/2


Entropy = – (1/2) log2 (1/2) – (1/2) log2 (1/2)
= -(1/2)(-1) – (1/2)(-1) = ½ + ½ = 1 36
How to Calculate log2x
• Many calculators only have a button for log10x and
logex (note log typically means log10)
• You can calculate the log for any base b as follows:
– logb(x) = logk(x) / logk(b)
– Thus log2(x) = log10(x) / log10(2)
– Since log10(2) = .301, just calculate the log base 10 and
divide by .301 to get log base 2.
– You can use this for HW if needed

37
Splitting Based on INFO...

• Information Gain:

Parent Node, p is split into k partitions;


ni is number of records in partition i
– Measures Reduction in Entropy achieved because of the
split. Choose the split that achieves most reduction
(maximizes GAIN)
– Used in ID3 and C4.5
– Disadvantage: Tends to prefer splits that result in large
number of partitions, each being small but pure.
Continuous Attribute?
(more on it later)

• Each non-leaf node is a test, its edge partitioning the attribute into
subsets (easy for discrete attribute).
• For continuous attribute
– Partition the continuous value of attribute A into a discrete set of
intervals
– Create a new boolean attribute Ac , looking for a threshold c,

How to choose c ?

39
Person Hair Weight Age Class
Length
Homer 0” 250 36 M
Marge 10” 150 34 F
Bart 2” 90 10 M
Lisa 6” 78 8 F
Maggie 4” 20 1 F
Abe 1” 170 70 M
Selma 8” 160 41 F
Otto 10” 180 38 M
Krusty 6” 200 45 M

Comic 8” 290 38 ? 40
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
yes no
Hair Length <= 5?

Let us try splitting on


Hair length

Entrop Entro
y (1 py (3F,2M
F,3M) ) = -(3
= -(1/4 /5)log
= 0.81 )log (1 = 0.9
13 2 /4) - (3/4)lo 710 2 (3/5) - (2/5
g )log (
2 (3/4) 2 2/5)

Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911


41
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
yes no
Weight <= 160?

Let us try splitting on


Weight

Entrop Entro
y (4 py (0F,4M
F,1M) ) = -(0
= -(4/5 /4)log
= 0.72 )log (4 = 0
19 2 /5) - (1/5)lo 2 (0/4) - (4/4
g )log (
2 (1/5) 2 4/4)

Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900 42


Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
yes no
age <= 40?

Let us try splitting on


Age

Entrop Entro
y (3 py (1F,2M
F,3M) ) = -(1
= -(3/6 /3)log
= 1 )log (3 = 0.9
2 /6) - (3/6)lo 183 2 (1/3) - (2/3
g )log (
2 (3/6) 2 2/3)

Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183 43


Of the 3 features we had, Weight was best.
But while people who weigh over 160 are
perfectly classified (as males), the under 160
people are not perfectly classified… So we
simply recurse!
yes no
Weight <= 160?

This time we find that we can split on


Hair length, and we are done!

yes no
Hair Length <= 2?

44
We don’t need to keep the data around, just the
test conditions. Weight <= 160?

yes no

How would these


people be Hair Length <= 2?
classified? Male
yes no

Male Female

45
It is trivial to convert Decision Trees to rules…
Weight <= 160?

yes no

Hair Length <= 2?


Male
yes no

Male Female

Rules to Classify Males/Females

If Weight greater than 160, classify as Male


Elseif Hair Length less than or equal to 2, classify as Male
Else classify as Female
46
Once we have learned the decision tree, we don’t even need a computer!
This decision tree is attached to a medical machine, and is designed to help
nurses make decisions about what type of doctor to call.

Decision tree for a typical shared-care setting applying the system for the
diagnosis of prostatic obstructions.
47
The worked examples we have seen were
performed on small datasets. However with
small datasets there is a great danger of
overfitting the data…

When you have few datapoints, there are


many possible splitting rules that perfectly Yes No
classify the data, but will not generalize to
future datasets. Wears green?

Female Male

For example, the rule “Wears green?” perfectly classifies the data, so does
“Mothers name is Jacqueline?”, so does “Has blue shoes”…
48
How to Find the Best Split: GINI
Before Splitting: M0

A? B?
Yes No Yes No
Node Node Node Node
N1 N2 N3 N4

M1 M2 M3 M4

M12 M34
Gain = M0 – M12 vs M0 – M34
49
Measure of Impurity: GINI (at node t)
• Gini Index for a given node t with classes j

NOTE: p( j | t) is computed as the relative frequency of class j at node t

• Example: Two classes C1 & C2 and node t has 5 C1


and 5 C2 examples. Compute Gini(t)
– 1 – [p(C1|t) + p(C2|t)] = 1 – [(5/10)2 + [(5/10)2 ]
– 1 – [¼ + ¼] = ½.
– Do you think this Gini value indicates a good split or bad
split? Is it an extreme value?

50
More on Gini
• Worst Gini corresponds to probabilities of 1/nc, where nc is
the number of classes.
– For 2-class problems the worst Gini will be ½
• How do we get the best Gini? Come up with an example for
node t with 10 examples for classes C1 and C2
– 10 C1 and 0 C2
– Now what is the Gini?
• 1 – [(10/10)2 + (0/10)2 = 1 – [1 + 0] = 0
– So 0 is the best Gini
• So for 2-class problems:
– Gini varies from 0 (best) to ½ (worst).

51
Some More Examples
• Below we see the Gini values for 4 nodes with
different distributions. They are ordered from best to
worst. See next slide for details
– Note that thus far we are only computing GINI for one
node. We need to compute it for a split and then compute
the change in Gini from the parent node.

52
Examples for computing GINI

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

P(C1) = 1/6 P(C2) = 5/6


Gini = 1 – (1/6)2 – (5/6)2 = 0.278

P(C1) = 2/6 P(C2) = 4/6


Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Splitting Criteria based on
Classification Error
• Classification error at a node t :

• Measures misclassification error made by a node.


• Maximum (1 - 1/nc) when records are equally distributed among all
classes, implying least interesting information
• Minimum (0.0) when all records belong to one class, implying most
interesting information

54
Examples for Computing Error

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


Error = 1 – max (0, 1) = 1 – 1 = 0

P(C1) = 1/6 P(C2) = 5/6


Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

P(C1) = 2/6 P(C2) = 4/6


Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

55
Comparison among Splitting Criteria
For a 2-class problem:

56
57

You might also like