08 - Classification - Decision Trees
08 - Classification - Decision Trees
Decision Trees
1
Agenda
Classification: Definition
Hunt’s Algorithm
C4.5 Algorithm
Impurity Measures
Information Gain
2
Supervised vs. Unsupervised Learning
Supervised learning (classification)
The training data (observations, measurements, etc.) are accompanied
by labels indicating the class of the observations.
New data is classified based on the training set.
3
Prediction Problems: Classification vs. Numeric
Prediction
Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying
new data
Numeric Prediction
models continuous-valued functions, i.e.,
predicts unknown or missing values
4
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes.
5
Classification—A Two-Step Process
Model usage: for classifying future or unknown objects.
6
Model Construction
Classification
Algorithms
Training
Data
Accuracy=?
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
9
Classification: Definition
Given a set of data records D (training set).
10
Classification: Definition
Goal:
To learn a classification model from the training data that can be used to
predict the classes of new instances/cases.
Each new object (record) is then assigned to one of the classes using the
classification model as accurately as possible.
Usually, the given data set is divided into training and test sets.
The training set is used to build the model
11
An example: Data (Loan Application)
Approved (yes/no)
12
An example: the learning task
Goal:
Learn a classification model from the data.
Use the model to classify future loan applications into
Yes (approved) and
No (not approved)
13
Examples of Classification
Credit/Loan approval
yes/no
14
Motivating Application: Credit Approval
A bank wants to classify its customers based on whether they are
expected to pay back their approved loans.
Classification rule:
If age = “31...40” and income = high then credit_rating = excellent
Future customers
Paul: age = 35, income = high excellent credit rating
John: age = 20, income = medium fair credit rating
15
What do we mean by learning?
16
What do we mean by learning?
Given
a data set D
a task T
a performance measure M
In other words, the learned model helps the system to perform T better
as compared to without learning.
17
The Traditional Programming Paradigm
Rules to
filter emails Programmer
Emails
Inputs
18
The Data Mining Paradigm
Classification
Algorithm
19
What is a Decision Tree?
20
Creating a Decision Tree
x2
x
x
x
o
x
x
o x x x
o o o
o o o o
0 x1
21
Creating a Decision Tree
x2
X2 < 2.5
x
x
x Yes No
o
x Blue circle Mixed
x
o x x x
2.5 o
o o
o o o o
0 x1
22
Creating a Decision Tree
x2
X2 < 2.5
x
x
x Yes No
o
x Blue circle Mixed
x
o x x x
2.5 o
o o
o o o o
pur
e
0 x1
23
Creating a Decision Tree
x2
X2 < 2.5
x
x
x Yes No
o
x Blue circle X1 < 2
x
o x x x
2.5 Yes No
o o o
o o o o Blue circle Red X
0 2 x1
24
Training Data with Objects
25
Building The Tree:
we choose “age” as a root
age
>40
<=30
31…40
31…40
class=yes
27
Building The Tree:
we chose “student” on <=30 branch
age
<=30 >40
student
income student credit class
no yes
medium no fair yes
low yes fair yes
in cr cl low yes excellent no
in cr cl
h f n medium yes fair yes
l f y medium no excellent no
h e n
m e y 31…40
m f n
class=yes
28
Building The Tree:
we chose “student” on <=30 branch
age
<=30 >40
student
income student credit class
no yes
medium no fair yes
low yes fair yes
low yes excellent no
medium yes fair yes
class= no class=yes
medium no excellent no
31…40
class=yes
29
Building The Tree:
we chose “credit” on >40 branch
<=30 age >40
credit
student
no excellent fair
yes
in st cl in st cl
l y n m n y
class= no class=yes l y y
m n n
m y y
31…40
class=yes
30
Finished Tree for class=“buys”
<=30 age >40
credit
student
no excellent fair
yes
buys=no buys=yes
buys= no buys=yes
31…40
buys=yes
31
A decision Tree
age?
<=30 overcast
31..40 >40
no yes no yes
32
Discriminant RULES extracted from our
TREE
The rules are:
33
The Loan Data
Approved or not
34
A decision tree from the loan data
Decision nodes and leaf nodes (classes)
35
Using the Decision Tree
No
36
Is the decision tree unique?
No. There are many possible trees.
Here is a simpler tree.
37
From a decision tree to a set of rules
A decision tree can be converted to a set of rules.
38
Example of a Decision Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No No
Yes
2 No Married 100K No
3 No Single 70K No NO TaxInc
40
What is a decision tree?
Decision tree
A flow-chart-like tree structure
Internal node denotes a
test on an attribute
Branch represents an
outcome of the test
Leaf nodes represent
class labels
41
Classifying Using Decision Tree
42
Classifying Using Decision Tree
To classify an object, the appropriate attribute value is used at each node,
starting from the root, to determine the branch taken.
The path found by tests at each node leads to a leaf node which is the class
the model believes the object belongs to.
43
Classifying Using Decision Tree
cal cal us
i i o
or or nu
teg
teg
nti
ass
ca ca co cl Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
Home
1 Yes Single 125K No
Owner
2 No Married 100K No Yes No
3 No Single 70K No
NO MarSt
4 Yes Married 120K No
Single, Divorced Married
5 No Divorced 95K Yes
6 No Married 60K No Income NO
7 Yes Divorced 220K No < 80K > 80K
8 No Single 85K Yes
NO YES
9 No Married 75K No
10 No Single 90K Yes
10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
45
Classifying Using Decision Tree
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
46
Classifying Using Decision Tree
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
47
Classifying Using Decision Tree
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
48
Classifying Using Decision Tree
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
49
Classifying Using Decision Tree
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K
NO YES
50
Methods for Expressing Test Conditions
51
Methods for Expressing Test Conditions
Depends on attribute types
Binary
Nominal
Ordinal
Continuous
52
Test Condition for Nominal Attributes
Multi-way split: Marital
Use as many partitions as distinct values. Status
Binary split:
Single Divorced Married
Divides values into two subsets.
Some decision tree algorithms, such as CART, produce only binary splits by
considering all 2k−1 − 1 ways of creating a binary partition of k attribute values.
53
Test Condition for Ordinal Attributes
Multi-way split: Shirt
Size
Use as many partitions as distinct
values
Small
Medium Large Extra Large
Shirt
Size
This grouping
violates order
property
{Small, {Medium,
Large} Extra Large}
54
Test Condition for Continuous Attributes
Binary split:
The attribute test condition can be expressed as a
comparison test
A<v or Av
Any possible value v between the minimum and
maximum attribute values in the training data.
Consider all possible splits and finds the best cut
can be more compute intensive
Multi-way split:
The attribute test condition can be expressed as a
range query of the form
vi ≤ A vi+1, i = 1, 2, …, k
Any possible collection of attribute value ranges
can be used, as long as
they are mutually exclusive and cover the entire
range of attribute values between the minimum and
maximum values observed in the training set.
55
Splitting Based on Continuous Attributes
Different ways of handling
Discretization to form an ordinal categorical attribute
After discretization, a new ordinal value is assigned to each discretized
interval, and the attribute test condition is then defined using this newly
constructed ordinal attribute.
56
Building a Decision Tree
57
Decision Tree Algorithms: Short History
Late 1970s - ID3 (Interative Dichotomiser) by J. Ross Quinlan
This work expanded on earlier work on concept learning system, described by
E. B. Hunt, J. Marin, and P. T. Stone
ID3, C4.5, and CART were invented independently of one another yet
follow a similar approach for learning decision trees from training tuples.
58
Building a Decision Tree
ID3, C4.5, and CART adopt a greedy (i.e. anon-backtracking) approach.
Most algorithms for decision tree induction also follow such a top-down
approach
All of the algorithms start with a training set of tuples and their associated
class labels (classification data table).
The training set is recursively partitioned into smaller subsets as the tree is
being built.
59
Building a Decision Tree
The aim is to build a decision tree consisting of a root node, a
number of internal nodes, and a number of leaf nodes.
Building the tree starts with the root node and then splitting the
data into two or more children nodes and splitting them in lower
level nodes and so on until the process is complete.
60
An Example
First five attributes are symptoms and the last attribute is diagnosis.
All attributes are categorical.
Wish to predict the diagnosis class.
Sore Swollen
Throat Fever Glands Congestion Headache Diagnosis
Yes Yes Yes Yes Yes Strep throat
No No No Yes Yes Allergy
Yes Yes No Yes No Cold
Yes No Yes No No Strep throat
No Yes No Yes No Cold
No No No Yes No Allergy
No No Yes No No Strep throat
Yes No No Yes Yes Allergy
No Yes No Yes Yes Cold
Yes Yes No Yes Yes Cold
61
An Example
Con sider ea ch of t h e a t t r ibu t es in t u r n t o see
wh ich wou ld be a “good” on e t o st a r t
Sore
Throat Diagnosis
No Allergy
No Cold
No Allergy
No Strep throat
No Cold
Yes Strep throat
Yes Cold
Yes Strep throat
Yes Allergy
Yes Cold
62
An Example
Fever Diagnosis
No Allergy
No Strep throat
No Allergy
No Strep throat
No Allergy
Yes Strep throat
Yes Cold
Yes Cold
Yes Cold
Yes Cold
63
An Example
Tr y swollen gla n ds
Swollen
Glands Diagnosis
No Allergy
No Cold
No Cold
No Allergy
No Allergy
No Cold
No Cold
Yes Strep throat
Yes Strep throat
Yes Strep throat
64
An Example
Tr y con gest ion
Congestion Diagnosis
No Strep throat
No Strep throat
Yes Allergy
Yes Cold
Yes Cold
Yes Allergy
Yes Allergy
Yes Cold
Yes Cold
Yes Strep throat
Not helpful.
65
An Example
Tr y t h e sym ptom h ea da ch e
Headache Diagnosis
No Cold
No Cold
No Allergy
No Strep throat
No Strep throat
Yes Allergy
Yes Allergy
Yes Cold
Yes Cold
Yes Strep throat
Not helpful.
66
Brute Force Approach
This approach does not work if there are many attributes and a large training
set.
The tree continues to grow until finding better ways to split the objects is no
longer possible.
67
Basic Algorithm
1. Let the root node contains all training data D.
Discretise all continuous-valued attributes.
2. If all objects D in the root node belong to the same class then
stop.
68
Decision Tree Algorithms
Many Algorithms:
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ, SPRINT
69
Hunt’s Algorithm
70
General Structure of Hunt’s Algorithm
Let Dt be the set of training records Home Marital Annual Defaulted
ID
that reach a node t. Owner Status Income Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
General Procedure: 4 Yes Married 120K No
If Dt contains records that belong 5 No Divorced 95K Yes
71
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
72
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
73
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
74
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
75
C4.5 Algorithm
76
Algorithm for decision tree learning
Basic algorithm (a greedy divide-and-conquer algorithm)
Tree is constructed in a top-down recursive manner
At start, all the training data are at the root.
Data are partitioned recursively based on selected attributes.
Attributes are selected on the basis of an impurity function
e.g., information gain
77
Decision tree learning algorithm C4.5
78
Decision tree learning algorithm C4.5
79
Choose an attribute to partition data
The key to building a decision tree
which attribute to choose in order to branch.
80
The Loan Data
Approved or not
81
Two possible roots, which is better?
82
Impurity Measures
83
Finding the Best Split
Before splitting:
10 records of class 0 (C0)
10 records of class 1 (C1)
Before splitting:
10 records of class 0 (C0)
10 records of class 1 (C1)
C0: 5 C0: 9
C1: 5 C1: 1
High degree of impurity Low degree of impurity
85
Impurity
When deciding which question to ask at a node, we consider the
impurity in its child nodes after the question.
We want it to be as low as possible (low impurity or high purity).
86
Computing Impurity
There are many measures that can be used to determine the goodness of an attribute
test condition.
These measures try to give preference to attribute test conditions that partition the
training instances into purer subsets in the child nodes,
which mostly have the same class labels.
87
Computing Impurity
88
Measure of Impurity: Entropy
Entropy for a given node that represents the dataset
c
E ( D ) Entropy ( D ) pi log 2 ( pi )
i 1
89
Computing Entropy of a Dataset
Assume we have a dataset D with only two classes
positive and negative. c
Entropy ( D ) pi log 2 ( pi )
i 1
The dataset D has
50% positive examples (P(positive) = 0.5) and 50% negative examples
(P(negative) = 0.5).
E(D)= −0.5⋅log2 0.5 − 0.5⋅log2 0.5 = 1
90
Computing Entropy of a Dataset
c
Entropy ( D ) pi log 2 ( pi )
i 1
91
Information Gain
92
Information Gain
Information gained by selecting attribute Ai to branch or to
partition the data D is
gain( D, Ai ) entropy( D) entropyAi ( D)
gain( D, Ai ) E ( D) E Ai ( D)
Disadvantage:
Attributes with many values are preferred.
93
Computing Information Gain
1. Given a dataset D, compute the entropy value of D before splitting,
c
E ( D ) pi log 2 ( pi )
i 1
2. If we make attribute Ai, with v values, as the root of the current tree, this
will partition D into v subsets D1, D2 …, Dv. The expected weighted entropy
if Ai is used as the current root, after splitting: 𝑣
𝑛𝑖
𝐸 𝐴 (𝐷)=∑ 𝐸 (𝐷𝑖 )
𝑖
𝑖=1 𝑛
where, = number of records of Di at child node
= number of records in D by the parent node.
A? B?
Yes No Yes No
EA(D) EB(D)
6 6 9 9 c
E ( D) log 2 log 2 0.971
15 15 15 15
E ( D ) pi log 2 ( pi )
i 1
96
An example
5 5 5
E Age ( D) E ( D1 ) E ( D2 ) E ( D3 )
15 15 15
5 5 5
0.971 0.971 0.722
15 15 15
0.888
𝑣
𝑛𝑖
𝐸 𝐴 (𝐷)=∑
c
E ( D ) pi log 2 ( pi ) 𝐸 (𝐷𝑖 )
i 1
𝑖
𝑖=1 𝑛
98
An example
Own_house is the best attribute for the root node.
99
Another Example
First five attributes are symptoms and the last attribute is diagnosis.
All attributes are categorical.
Wish to predict the diagnosis class.
Sore Swollen
Throat Fever Glands Congestion Headache Diagnosis
Yes Yes Yes Yes Yes Strep throat
No No No Yes Yes Allergy
Yes Yes No Yes No Cold
Yes No Yes No No Strep throat
No Yes No Yes No Cold
No No No Yes No Allergy
No No Yes No No Strep throat
Yes No No Yes Yes Allergy
No Yes No Yes Yes Cold
Yes Yes No Yes Yes Cold
100
Another Example
D has (n = 10) samples and three classes c = 3.
Strep throat =t=3
Cold =d=4
c
Allergy = a = 3
E ( D ) pi log 2 ( pi )
i 1
101
Another Example
𝑣
Sore Throat 𝑛𝑖
𝐸 (𝐷)=∑
has 2 distinct values {Yes, No} 𝐴𝑖 𝐸 (𝐷𝑖 )
DYes has t = 2, d = 2, a = 1, total 5 𝑖=1 𝑛
DNo has t = 1, d = 2, a = 2, total 5
E(DYes) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 0.46
E(DNo) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 0.46
Fever
has 2 distinct values {Yes, No}
DYes has t = 1, d = 4, a = 0, total 5
DNo has t = 2, d = 0, a = 3, total 5
E(DYes) = –1/5 log(1/5)) – 4/5 log(4/5) = 0.22
E(DNo) = –2/5 log(2/5)) – 3/5 log(3/5) = 0.29
Congestion
has 2 distinct values {Yes, No}
DYes has t = 1, d = 4, a = 3, total 8
DNo has t = 2, d = 0, a = 0, total 2
E(DYes) = –1/8 log(1/8)) – 4/8 log(4/8) – 3/8 log(3/8) = 0.42
E(DNo) =0
ECongestion(D) = 0.8*0.42
103 + 0.2*0 = 0.34
Another Example
𝑣
Headache 𝑛𝑖
has 2 distinct values {Yes, No} 𝐸 (𝐷)=∑
𝐴𝑖 𝐸 (𝐷𝑖 )
DYes has t = 2, d = 2, a = 1, total 5 𝑖=1 𝑛
DNo has t = 1, d = 2, a = 2, total 5
E(DYes) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 0.46
E(DNo) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 0.46
105
Gini Index
Gini Index for a given node that represents the dataset
𝑐
𝐺 ( 𝐷 )=𝐺𝑖𝑛𝑖(𝐷)=1− ∑ 𝑝 𝑖
2
𝑖=1
Where is the relative frequency % of class at , and is the total
number of classes.
Maximum of
when records are equally distributed among all classes,
implying the least beneficial situation for classification.
Minimum of 0
when all records belong to one class, implying the most
beneficial situation for classification.
Gini index is used in decision tree algorithms such as
CART, SLIQ, SPRINT
106
Gini Index of a Dataset
For 2-class problem (p, 1 – p):
G(D) = 1 – p2 – (1 – p)2 = 2p (1 – p) 𝑐
𝐺 ( 𝐷 )=𝐺𝑖𝑛𝑖( 𝐷)=1− ∑ 𝑝 𝑖2
𝑖=1
Parent
C1 7
A?
C2 5
Yes No
G(D) = Gini = 0.486
Node N1 Node N2
𝑣
N1 N2
𝑛𝑖
C1 5 2 𝐺 𝑖𝑛𝑖 𝐴 ( 𝐷 )=∑ 𝐺𝑖𝑛𝑖( 𝐷𝑖 )
C2 1 4 𝑖 =1 𝑛
GA(D) = Gini = 0.361 GiniA(D) = Weighted Gini of attribute A
= 6/12 * 0.278 + 6/12 * 0.444 = 0.361
Gini(D1) = 1 – (5/6)2 – (1/6)2 = 0.278
Gini(D2) = 1 – (2/6)2 – (4/6)2 = 0.444 GainA = Gini(D) – GiniA (D) = 0.486 – 0.361 = 0.125
108
Computing Information Gain Using Gini Index
Parent
C1 7
B?
C2 5
Yes No
G(D) = Gini = 0.486
Node N1 Node N2
𝑣
N1 N2
𝑛𝑖
C1 5 1 𝐺 𝑖𝑛𝑖 𝐵 ( 𝐷 )=∑ 𝐺𝑖𝑛𝑖( 𝐷𝑖 )
C2 2 4 𝑖=1 𝑛
GB(D) = Gini = 0.371 GiniB(D) = Weighted Gini of attribute B
= 7/12 * 0.408 + 5/12 * 0.32 = 0.371
Gini(D1) = 1 – (5/7)2 – (2/7)2 = 0.408
Gini(D2) = 1 – (1/5)2 – (4/5)2 = 0.32
GainB = Gini(D) – GiniB (D) = 0.486 – 0.371 = 0.115
109
Selecting the Splitting Attribute
Since GainA is larger then GainB, attribute A will be selected for the
next splitting.
max(GainA, GainB) or min(GiniA(D), GiniB(D))
110
Information Gain Ratio
111
Problem with large number of partitions
Node impurity measures tend to prefer splits that result in
large number of partitions, each being small but pure.
Customer ID has highest gain because entropy for all the children is
zero.
112
Gain Ratio
Tree building algorithm blindly picks attribute that
maximizes information gain
113
Gain Ratio
Gain Ratio:
g
114
Gain Ratio
g
The point with the least Gini index or Entropy value for A is selected as the
split point for A
split-point
Split:
D1 is the set of tuples in D satisfying A split-point
D2 is the set of tuples in D satisfying A > split-point.
116