0% found this document useful (0 votes)
46 views

08 - Classification - Decision Trees

The document discusses decision trees for classification. A decision tree is a flowchart-like structure that uses a tree-like model of decisions and their possible consequences. It breaks down a data set into smaller and smaller subsets while associating decisions with results. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

08 - Classification - Decision Trees

The document discusses decision trees for classification. A decision tree is a flowchart-like structure that uses a tree-like model of decisions and their possible consequences. It breaks down a data set into smaller and smaller subsets while associating decisions with results. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 116

Data Mining

Classification: Basic Concepts

Decision Trees

1
Agenda
 Classification: Definition

 What do we mean by learning?

 What is a Decision Tree?

 Classifying Using Decision Tree

 Methods for Expressing Test Conditions

 Building a Decision Tree

 Hunt’s Algorithm

 C4.5 Algorithm

 Impurity Measures

 Information Gain

 Computing Impurity of Continuous Attributes

 Information Gain Ratio

2
Supervised vs. Unsupervised Learning
 Supervised learning (classification)
 The training data (observations, measurements, etc.) are accompanied
by labels indicating the class of the observations.
 New data is classified based on the training set.

 Unsupervised learning (clustering)


 The class labels of training data are unknown.
 Given a set of measurements, observations, etc., with the aim of
establishing the existence of classes or clusters in the data.

3
Prediction Problems: Classification vs. Numeric
Prediction
 Classification
 predicts categorical class labels (discrete or nominal)
 classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying
new data

 Numeric Prediction
 models continuous-valued functions, i.e.,
 predicts unknown or missing values

4
Classification—A Two-Step Process
 Model construction: describing a set of predetermined classes.

 Each tuple/sample is assumed to belong to a predefined class, as


determined by the class label attribute.

 The set of tuples used for model construction is a training set.

 The model is represented as


 classification rules, decision trees, or mathematical formulae

5
Classification—A Two-Step Process
 Model usage: for classifying future or unknown objects.

 Estimate the accuracy of the model.


 The known label of the test sample is compared with the classified result
from the model.
 The accuracy rate is the percentage of test set samples the model correctly
classifies.
 The test set is independent of the training set (otherwise overfitting)

 If the accuracy is acceptable, use the model to classify new data.

 If the test set is used to select models,


 it is called the validation (test) set.

6
Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


(Model)
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no OR years > 6
Anne Associate Prof 3 no THEN tenured = ‘yes’
7
Using the Model in Prediction

Accuracy=?

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)

NAME RANK YEARS TENURED


Tenured?
Tom Assistant Prof 2 no
Mellisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
8
Classification: Definition

9
Classification: Definition
 Given a set of data records D (training set).

 Each record in D is described by a tuple (x, y).

 x is the attribute set A1, A2, … Am.


 Ai: feature, attribute, predictor, independent variable, input

 y is the class label.


 y: class, label, response, dependent variable, output
 value known for the training sample but not for others.

10
Classification: Definition
 Goal:
 To learn a classification model from the training data that can be used to
predict the classes of new instances/cases.

 Each new object (record) is then assigned to one of the classes using the
classification model as accurately as possible.

 A test set is used to determine the accuracy of the model.

 Usually, the given data set is divided into training and test sets.
 The training set is used to build the model

 The test set is used to validate the model.

11
An example: Data (Loan Application)
Approved (yes/no)

12
An example: the learning task
 Goal:
 Learn a classification model from the data.
 Use the model to classify future loan applications into
 Yes (approved) and
 No (not approved)

 What is the class for the following applicant/case?

13
Examples of Classification
 Credit/Loan approval
 yes/no

 Categorizing email messages


 spam or non-spam

 Medical diagnosis if a tumor is


 cancerous or benign

 Classifying secondary structures of protein as


 alpha-helix, beta-sheet, or random coil.

 Classifying credit card transactions as


 legitimate or fraudulent.

 Categorizing news stories as


 finance, weather, entertainment, sports, etc …

14
Motivating Application: Credit Approval
 A bank wants to classify its customers based on whether they are
expected to pay back their approved loans.

 The history of past customers is used to train the classifier.

 The classifier provides rules that identify potentially reliable future


customers.

 Classification rule:
 If age = “31...40” and income = high then credit_rating = excellent

 Future customers
 Paul: age = 35, income = high  excellent credit rating
 John: age = 20, income = medium  fair credit rating

15
What do we mean by learning?

16
What do we mean by learning?
 Given
 a data set D
 a task T
 a performance measure M

 A computer system is said to learn from D to perform the task T,


 If after learning, the system’s performance on T improves as measured
by M.

 In other words, the learned model helps the system to perform T better
as compared to without learning.

17
The Traditional Programming Paradigm

Problem Email spam filter

Rules to
filter emails Programmer

Spam filter Program Computer Outputs

Execute the spam,


program not spam

Emails
Inputs

18
The Data Mining Paradigm

Problem Email spam filter

Classification
Algorithm

Inputs Outputs Computer Program

spam, Learning the


Emails Spam filter
not spam rules to
filter emails

19
What is a Decision Tree?

20
Creating a Decision Tree

x2
x
x
x
o
x
x
o x x x
o o o
o o o o

0 x1

21
Creating a Decision Tree

x2
X2 < 2.5
x
x
x Yes No
o
x Blue circle Mixed
x
o x x x
2.5 o
o o
o o o o

0 x1

22
Creating a Decision Tree

x2
X2 < 2.5
x
x
x Yes No
o
x Blue circle Mixed
x
o x x x
2.5 o
o o
o o o o
pur
e
0 x1

23
Creating a Decision Tree

x2
X2 < 2.5
x
x
x Yes No
o
x Blue circle X1 < 2
x
o x x x
2.5 Yes No
o o o
o o o o Blue circle Red X

0 2 x1

24
Training Data with Objects

rec Age Income Student Credit_rating Buys_computer (CLASS)


r1 <=30 High No Fair No

r2 <=30 High No Excellent No

r3 31…40 High No Fair Yes

r4 >40 Medium No Fair Yes

r5 >40 Low Yes Fair Yes

r6 >40 Low Yes Excellent No

r7 31…40 Low Yes Excellent Yes

r8 <=30 Medium No Fair No

r9 <=30 Low Yes Fair Yes

r10 >40 Medium Yes Fair Yes


r11 <=30 Medium Yes Excellent Yes

r12 31…40 Medium No Excellent Yes

r13 31…40 High Yes Fair Yes

r14 >40 Medium No Excellent No

25
Building The Tree:
we choose “age” as a root
age
>40
<=30

income student credit class income student credit class


high no fair no medium no fair yes
high no excellent no low yes fair yes
medium no fair no low yes excellent no
low yes fair yes medium yes fair yes
medium yes excellent yes medium no excellent no

31…40

income student credit class


high no fair yes
low yes excellent yes
medium no excellent yes
high yes fair yes
26
Building The Tree:
“age” as the root
age
>40
<=30

income student credit class income student credit class


high no fair no medium no fair yes
high no excellent no low yes fair yes
medium no fair no low yes excellent no
low yes fair yes medium yes fair yes
medium yes excellent yes medium no excellent no

31…40

class=yes

27
Building The Tree:
we chose “student” on <=30 branch
age
<=30 >40

student
income student credit class
no yes
medium no fair yes
low yes fair yes
in cr cl low yes excellent no
in cr cl
h f n medium yes fair yes
l f y medium no excellent no
h e n
m e y 31…40
m f n

class=yes

28
Building The Tree:
we chose “student” on <=30 branch
age
<=30 >40

student
income student credit class
no yes
medium no fair yes
low yes fair yes
low yes excellent no
medium yes fair yes
class= no class=yes
medium no excellent no

31…40

class=yes

29
Building The Tree:
we chose “credit” on >40 branch
<=30 age >40

credit
student
no excellent fair
yes

in st cl in st cl
l y n m n y

class= no class=yes l y y
m n n
m y y

31…40

class=yes

30
Finished Tree for class=“buys”
<=30 age >40

credit
student
no excellent fair
yes

buys=no buys=yes

buys= no buys=yes

31…40

buys=yes

31
A decision Tree

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

32
Discriminant RULES extracted from our
TREE
 The rules are:

33
The Loan Data
Approved or not

34
A decision tree from the loan data
 Decision nodes and leaf nodes (classes)

35
Using the Decision Tree

No

36
Is the decision tree unique?
 No. There are many possible trees.
 Here is a simpler tree.

 We want a smaller and accurate tree.


 Easy to understand and perform better.

 Finding the best tree is NP-hard.

 All existing tree building algorithms are heuristic algorithms

37
From a decision tree to a set of rules
 A decision tree can be converted to a set of rules.

 Each path from the root to a leaf is a rule.

38
Example of a Decision Tree

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


Refund
2 No Married 100K No
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
Induction
6 No Married 60K No
TaxInc NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


39
Another Example of Decision Tree

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No No
Yes
2 No Married 100K No
3 No Single 70K No NO TaxInc

4 Yes Married 120K No < 80K > 80K


5 No Divorced 95K Yes
6 No Married 60K No
Induction NO YES

7 Yes Divorced 220K No


8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10

40
What is a decision tree?
 Decision tree
 A flow-chart-like tree structure
 Internal node denotes a
 test on an attribute
 Branch represents an
 outcome of the test
 Leaf nodes represent
 class labels

 Decision tree generation consists of two phases


 Tree construction
 At start, all the training examples are at the root
 Partition examples recursively based on selected attributes
 Tree pruning
 Identify and remove branches that reflect noise or outliers

41
Classifying Using Decision Tree

42
Classifying Using Decision Tree
 To classify an object, the appropriate attribute value is used at each node,
starting from the root, to determine the branch taken.

 The path found by tests at each node leads to a leaf node which is the class
the model believes the object belongs to.

43
Classifying Using Decision Tree
cal cal us
i i o
or or nu
teg
teg
nti
ass
ca ca co cl Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
Home
1 Yes Single 125K No
Owner
2 No Married 100K No Yes No
3 No Single 70K No
NO MarSt
4 Yes Married 120K No
Single, Divorced Married
5 No Divorced 95K Yes
6 No Married 60K No Income NO
7 Yes Divorced 220K No < 80K > 80K
8 No Single 85K Yes
NO YES
9 No Married 75K No
10 No Single 90K Yes
10

Model: Decision Tree


Training Data
44
Classifying Using Decision Tree
Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

45
Classifying Using Decision Tree
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

46
Classifying Using Decision Tree
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

47
Classifying Using Decision Tree
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

48
Classifying Using Decision Tree
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

49
Classifying Using Decision Tree
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K

NO YES

50
Methods for Expressing Test Conditions

51
Methods for Expressing Test Conditions
 Depends on attribute types
 Binary

 Nominal

 Ordinal

 Continuous

52
Test Condition for Nominal Attributes
 Multi-way split: Marital
 Use as many partitions as distinct values. Status

 Binary split:
Single Divorced Married
 Divides values into two subsets.
 Some decision tree algorithms, such as CART, produce only binary splits by
considering all 2k−1 − 1 ways of creating a binary partition of k attribute values.

Marital Marital Marital


Status Status Status
OR OR

{Married} {Single, {Single} {Married, {Single, {Divorced}


Divorced} Divorced} Married}

53
Test Condition for Ordinal Attributes
 Multi-way split: Shirt
Size
 Use as many partitions as distinct
values

Small
Medium Large Extra Large

 Binary split: Shirt Shirt


Size Size
 Divides values into two subsets.
 Preserve order property among
attribute values
{Small, {Large, {Small} {Medium, Large,
Medium} Extra Large} Extra Large}

Shirt
Size
This grouping
violates order
property

{Small, {Medium,
Large} Extra Large}

54
Test Condition for Continuous Attributes
 Binary split:
 The attribute test condition can be expressed as a
comparison test
 A<v or Av
 Any possible value v between the minimum and
maximum attribute values in the training data.
 Consider all possible splits and finds the best cut
 can be more compute intensive

 Multi-way split:
 The attribute test condition can be expressed as a
range query of the form
 vi ≤ A  vi+1, i = 1, 2, …, k
 Any possible collection of attribute value ranges
can be used, as long as
 they are mutually exclusive and cover the entire
range of attribute values between the minimum and
maximum values observed in the training set.

55
Splitting Based on Continuous Attributes
 Different ways of handling
 Discretization to form an ordinal categorical attribute
 After discretization, a new ordinal value is assigned to each discretized
interval, and the attribute test condition is then defined using this newly
constructed ordinal attribute.

56
Building a Decision Tree

57
Decision Tree Algorithms: Short History
 Late 1970s - ID3 (Interative Dichotomiser) by J. Ross Quinlan
 This work expanded on earlier work on concept learning system, described by
E. B. Hunt, J. Marin, and P. T. Stone

 Early 1980 - C4.5 a successor of ID3 by Quinlan


 C4.5 later became a benchmark to which newer supervised learning algorithms,
are often compared

 In 1984, a group of statisticians published the book "Classification and


Regression Trees (CART)"
 The book described a generation of binary decision trees.

 ID3, C4.5, and CART were invented independently of one another yet
follow a similar approach for learning decision trees from training tuples.

 These cornerstone algorithms spawned a flurry of work on decision tree


induction.

58
Building a Decision Tree
 ID3, C4.5, and CART adopt a greedy (i.e. anon-backtracking) approach.

 It this approach decision trees are constructed in a top-down recursive divide-


and-conquer approach.

 Most algorithms for decision tree induction also follow such a top-down
approach

 All of the algorithms start with a training set of tuples and their associated
class labels (classification data table).

 The training set is recursively partitioned into smaller subsets as the tree is
being built.

59
Building a Decision Tree
 The aim is to build a decision tree consisting of a root node, a
number of internal nodes, and a number of leaf nodes.

 Building the tree starts with the root node and then splitting the
data into two or more children nodes and splitting them in lower
level nodes and so on until the process is complete.

 The method uses induction based on the training data.


 We illustrate the brute-force approach using a simple example.

60
An Example
 First five attributes are symptoms and the last attribute is diagnosis.
 All attributes are categorical.
 Wish to predict the diagnosis class.

Sore Swollen
Throat Fever Glands Congestion Headache Diagnosis
Yes Yes Yes Yes Yes Strep throat
No No No Yes Yes Allergy
Yes Yes No Yes No Cold
Yes No Yes No No Strep throat
No Yes No Yes No Cold
No No No Yes No Allergy
No No Yes No No Strep throat
Yes No No Yes Yes Allergy
No Yes No Yes Yes Cold
Yes Yes No Yes Yes Cold

61
An Example
Con sider ea ch of t h e a t t r ibu t es in t u r n t o see
wh ich wou ld be a “good” on e t o st a r t
Sore
Throat Diagnosis
No Allergy
No Cold
No Allergy
No Strep throat
No Cold
Yes Strep throat
Yes Cold
Yes Strep throat
Yes Allergy
Yes Cold

Sore throat does not predict diagnosis.

62
An Example

Is sym pt om fever a ny bet t er ?

Fever Diagnosis
No Allergy
No Strep throat
No Allergy
No Strep throat
No Allergy
Yes Strep throat
Yes Cold
Yes Cold
Yes Cold
Yes Cold

Fever is better but not perfect.

63
An Example

Tr y swollen gla n ds

Swollen
Glands Diagnosis
No Allergy
No Cold
No Cold
No Allergy
No Allergy
No Cold
No Cold
Yes Strep throat
Yes Strep throat
Yes Strep throat

Good. Swollen glands = yes means Strep Throat

64
An Example
Tr y con gest ion

Congestion Diagnosis
No Strep throat
No Strep throat
Yes Allergy
Yes Cold
Yes Cold
Yes Allergy
Yes Allergy
Yes Cold
Yes Cold
Yes Strep throat

Not helpful.

65
An Example

Tr y t h e sym ptom h ea da ch e

Headache Diagnosis
No Cold
No Cold
No Allergy
No Strep throat
No Strep throat
Yes Allergy
Yes Allergy
Yes Cold
Yes Cold
Yes Strep throat

Not helpful.

66
Brute Force Approach
 This approach does not work if there are many attributes and a large training
set.

 Need an algorithm to select an attribute that best discriminates among the


target classes as the split attribute.

 How do we find the attribute that is most influential in determining the


dependent/target attribute?

 The tree continues to grow until finding better ways to split the objects is no
longer possible.

67
Basic Algorithm
 1. Let the root node contains all training data D.
 Discretise all continuous-valued attributes.

 2. If all objects D in the root node belong to the same class then
 stop.

 3. Selecting an attribute A from amongst the independent attributes


that best divides or splits the objects in the node into subsets and
create a decision tree node.
 Split the node according to the values of A.

 4. Stop if any of the following conditions is met otherwise


continue with 3.
 data in each subset belongs to a single class.
 there are no remaining attributes on which the sample may be further
divided.

68
Decision Tree Algorithms
 Many Algorithms:
 Hunt’s Algorithm (one of the earliest)

 CART

 ID3, C4.5

 SLIQ, SPRINT

69
Hunt’s Algorithm

70
General Structure of Hunt’s Algorithm
 Let Dt be the set of training records Home Marital Annual Defaulted
ID
that reach a node t. Owner Status Income Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
 General Procedure: 4 Yes Married 120K No
 If Dt contains records that belong 5 No Divorced 95K Yes

the same class yt, then t is a leaf 6 No Married 60K No

node labeled as yt. 7 Yes Divorced 220K No


8 No Single 85K Yes
9 No Married 75K No
 If Dt contains records that belong to 10 No Single 90K Yes
more than one class, use an 10

attribute test to split the data into Dt


smaller subsets.

 Recursively apply the procedure to ?


each subset.

71
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10

Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)

72
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10

Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)

73
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10

Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)

74
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital
Yes No
10

Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)

75
C4.5 Algorithm

76
Algorithm for decision tree learning
 Basic algorithm (a greedy divide-and-conquer algorithm)
 Tree is constructed in a top-down recursive manner
 At start, all the training data are at the root.
 Data are partitioned recursively based on selected attributes.
 Attributes are selected on the basis of an impurity function
 e.g., information gain

 Conditions for stopping partitioning


 All examples for a given node belong to the same class.
 There are no remaining attributes for further partitioning.
 There are no training examples left.

77
Decision tree learning algorithm C4.5

78
Decision tree learning algorithm C4.5

79
Choose an attribute to partition data
 The key to building a decision tree
 which attribute to choose in order to branch.

 Objective: reduce impurity in data as much as possible.


 A subset of data is pure if all instances belong to the same class.

 C4.5 chooses the attribute with the maximum Information Gain or


Gain Ratio based on information theory.

80
The Loan Data
Approved or not

81
Two possible roots, which is better?

Fig. (B) seems to be better.

82
Impurity Measures

83
Finding the Best Split
 Before splitting:
 10 records of class 0 (C0)
 10 records of class 1 (C1)

Gender Car Customer


Type ID

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?


84
Finding the Best Split
 Greedy approach:
 Nodes with purer class distribution are preferred.

 Need a measure of node impurity (‫)تنوع‬.

 Before splitting:
 10 records of class 0 (C0)
 10 records of class 1 (C1)

C0: 5 C0: 9
C1: 5 C1: 1
High degree of impurity Low degree of impurity

85
Impurity
 When deciding which question to ask at a node, we consider the
impurity in its child nodes after the question.
 We want it to be as low as possible (low impurity or high purity).

 Let’s look at this example (assume a bucket below is simply a node in


decision tree):

86
Computing Impurity
 There are many measures that can be used to determine the goodness of an attribute
test condition.

 All three measures give a


 zero impurity value if a node contains instances from a single class
 maximum impurity if the node has equal proportion of instances from multiple classes

 These measures try to give preference to attribute test conditions that partition the
training instances into purer subsets in the child nodes,
 which mostly have the same class labels.

87
Computing Impurity

88
Measure of Impurity: Entropy
 Entropy for a given node that represents the dataset
c
E ( D )  Entropy ( D )   pi log 2 ( pi )
i 1

Where is the relative frequency % of class at , and is the total number of


classes.
 Maximum of
 when records are equally distributed among all classes, implying the least
beneficial situation for classification.
 Minimum of 0
 when all records belong to one class, implying the most beneficial situation
for classification.
 Entropy is used in decision tree algorithms such as
 ID3, C4.5

89
Computing Entropy of a Dataset
 Assume we have a dataset D with only two classes
 positive and negative. c
Entropy ( D )   pi log 2 ( pi )
i 1
 The dataset D has
 50% positive examples (P(positive) = 0.5) and 50% negative examples
(P(negative) = 0.5).
 E(D)= −0.5⋅log2 0.5 − 0.5⋅log2 0.5 = 1

 20% positive examples (P(positive) = 0.2) and 80% negative examples


(P(negative) = 0.8).
 E(D)= −0.2⋅log2 0.2 − 0.8⋅log2 0.8 = 0.722

 100% positive examples (P(positive) = 1) and no negative examples,


(P(negative) = 0).
 E(D) = −1⋅log2 1 − 0⋅log2 0 = 0
 Per definition 0⋅log2 0 = 0

90
Computing Entropy of a Dataset
c
Entropy ( D )   pi log 2 ( pi )
i 1

C1 0 p1(C1) = 0/6 = 0 p2(C2) = 6/6 = 1


C2 6 E(D) = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 p1(C1) = 1/6 p2(C2) = 5/6

C2 5 E(D) = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65

C1 2 p1(C1) = 2/6 p2(C2) = 4/6

C2 4 E(D) = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

91
Information Gain

92
Information Gain
 Information gained by selecting attribute Ai to branch or to
partition the data D is
gain( D, Ai )  entropy( D)  entropyAi ( D)
gain( D, Ai )  E ( D)  E Ai ( D)

 We evaluate every attribute:


 We choose the attribute with the highest gain to branch/split
the current tree.

 Disadvantage:
 Attributes with many values are preferred.

93
Computing Information Gain
 1. Given a dataset D, compute the entropy value of D before splitting,
c
E ( D )   pi log 2 ( pi )
i 1

 2. If we make attribute Ai, with v values, as the root of the current tree, this
will partition D into v subsets D1, D2 …, Dv. The expected weighted entropy
if Ai is used as the current root, after splitting: 𝑣
𝑛𝑖
𝐸 𝐴 (𝐷)=∑ 𝐸 (𝐷𝑖 )
𝑖
𝑖=1 𝑛
where, = number of records of Di at child node
= number of records in D by the parent node.

 3. Choose the attribute Ai, that produces the highest gain


gain( D, Ai )  E ( D)  E Ai ( D)
94
Computing Information Gain
D Before Splitting: C0 N00
E(D)
C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

EA1(D) EA2(D) EB1(D) EB2(D)

EA(D) EB(D)

max (GainA = E(D) – EA(D), GainB = E(D) – EB(D)) or min(EA(D), EB(D))


95
An example

6 6 9 9 c
E ( D)    log 2   log 2  0.971
15 15 15 15
E ( D )   pi log 2 ( pi )
i 1

96
An example
5 5 5
E Age ( D)   E ( D1 )   E ( D2 )   E ( D3 )
15 15 15
5 5 5
  0.971   0.971   0.722
15 15 15
 0.888

Age Yes No entropy(Di)


young 2 3 0.971 𝑣
𝑛𝑖
middle 3 2 0.971 𝐸 𝐴 (𝐷)=∑ 𝐸 (𝐷𝑖 )
old 4 1 0.722 𝑖
𝑖=1 𝑛
c
E ( D )   pi log 2 ( pi )
i 1 97
An example
6 9
EOwn _ house ( D)   E ( D1 )   E ( D2 )
15 15
6 9
  0   0.918
15 15
 0.551

𝑣
𝑛𝑖
𝐸 𝐴 (𝐷)=∑
c
E ( D )   pi log 2 ( pi ) 𝐸 (𝐷𝑖 )
i 1
𝑖
𝑖=1 𝑛
98
An example
 Own_house is the best attribute for the root node.

99
Another Example
 First five attributes are symptoms and the last attribute is diagnosis.
 All attributes are categorical.
 Wish to predict the diagnosis class.

Sore Swollen
Throat Fever Glands Congestion Headache Diagnosis
Yes Yes Yes Yes Yes Strep throat
No No No Yes Yes Allergy
Yes Yes No Yes No Cold
Yes No Yes No No Strep throat
No Yes No Yes No Cold
No No No Yes No Allergy
No No Yes No No Strep throat
Yes No No Yes Yes Allergy
No Yes No Yes Yes Cold
Yes Yes No Yes Yes Cold

100
Another Example
 D has (n = 10) samples and three classes c = 3.
 Strep throat =t=3
 Cold =d=4
c
 Allergy = a = 3
E ( D )   pi log 2 ( pi )
i 1

 E(D) = 2*(– 3/10 log(3/10)) – (4/10 log(4/10)) = 0.47

 Let us now consider using the various symptoms to split D.

101
Another Example
𝑣
Sore Throat 𝑛𝑖
𝐸 (𝐷)=∑

 has 2 distinct values {Yes, No} 𝐴𝑖 𝐸 (𝐷𝑖 )
DYes has t = 2, d = 2, a = 1, total 5 𝑖=1 𝑛
DNo has t = 1, d = 2, a = 2, total 5
E(DYes) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 0.46
E(DNo) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 0.46

ESoreThroat(D) = 0.5*0.46 + 0.5*0.46 = 0.46

 Fever
 has 2 distinct values {Yes, No}
DYes has t = 1, d = 4, a = 0, total 5
DNo has t = 2, d = 0, a = 3, total 5
E(DYes) = –1/5 log(1/5)) – 4/5 log(4/5) = 0.22
E(DNo) = –2/5 log(2/5)) – 3/5 log(3/5) = 0.29

EFever(D) = 0.5 * 0.22 + 0.5


102* 0.29 = 0.23
Another Example
𝑣
Swollen Glands 𝑛𝑖
𝐸 (𝐷)=∑

 has 2 distinct values {Yes, No} 𝐴𝑖 𝐸 (𝐷𝑖 )
DYes has t = 3, d = 0, a = 0, total 3 𝑖=1 𝑛
DNo has t = 0, d = 4, a = 3, total 7
E(DYes) = 2*(–0/3 log(0/3)) – (3/3 log(3/3)) = 0
E(DNo) = –4/7 log(4/7)) – 3/7 log(3/7) = 0.30

ESwollenGlands(D) = 0.3*0 + 0.7*0.3 = 0.21

 Congestion
 has 2 distinct values {Yes, No}
DYes has t = 1, d = 4, a = 3, total 8
DNo has t = 2, d = 0, a = 0, total 2
E(DYes) = –1/8 log(1/8)) – 4/8 log(4/8) – 3/8 log(3/8) = 0.42
E(DNo) =0

ECongestion(D) = 0.8*0.42
103 + 0.2*0 = 0.34
Another Example
𝑣
 Headache 𝑛𝑖
 has 2 distinct values {Yes, No} 𝐸 (𝐷)=∑
𝐴𝑖 𝐸 (𝐷𝑖 )
DYes has t = 2, d = 2, a = 1, total 5 𝑖=1 𝑛
DNo has t = 1, d = 2, a = 2, total 5
E(DYes) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 0.46
E(DNo) = 2*(–2/5 log(2/5)) – (1/5 log(1/5)) = 0.46

EHeadache(D) = 0.5*0.46 + 0.5*0.46 = 0.46

 So the entropy values of all attributes are:


Sore Throat 0.46
Fever 0.23
Swollen Glands 0.21
Congestion 0.34
Headache 0.46
 Continuing the process one more step will find Swollen Glands as the next
split attribute.
104
Gini Index

105
Gini Index
 Gini Index for a given node that represents the dataset
𝑐
𝐺 ( 𝐷 )=𝐺𝑖𝑛𝑖(𝐷)=1− ∑ 𝑝 𝑖
2

𝑖=1
Where is the relative frequency % of class at , and is the total
number of classes.
 Maximum of
 when records are equally distributed among all classes,
implying the least beneficial situation for classification.
 Minimum of 0
 when all records belong to one class, implying the most
beneficial situation for classification.
 Gini index is used in decision tree algorithms such as
 CART, SLIQ, SPRINT
106
Gini Index of a Dataset
 For 2-class problem (p, 1 – p):
 G(D) = 1 – p2 – (1 – p)2 = 2p (1 – p) 𝑐
𝐺 ( 𝐷 )=𝐺𝑖𝑛𝑖( 𝐷)=1− ∑ 𝑝 𝑖2
𝑖=1

C1 0 p1(C1) = 0/6 = 0 p2(C2) = 6/6 = 1

C2 6 Gini = 1 – p1(C1)2 – p2(C2)2 = 1 – 0 – 1 = 0

C1 1 p1(C1) = 1/6 p2(C2) = 5/6

C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 p1(C1) = 2/6 p2(C2) = 4/6

C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

C1 3 p1(C1) = 3/6 p2(C2) = 3/6

C2 3 Gini = 1 – (3/6)2 – (3/6)2 = 0.5


107
Computing Information Gain Using Gini Index

Parent

C1 7
A?
C2 5
Yes No
G(D) = Gini = 0.486
Node N1 Node N2

𝑣
N1 N2
𝑛𝑖
C1 5 2 𝐺 𝑖𝑛𝑖 𝐴 ( 𝐷 )=∑ 𝐺𝑖𝑛𝑖( 𝐷𝑖 )
C2 1 4 𝑖 =1 𝑛
GA(D) = Gini = 0.361 GiniA(D) = Weighted Gini of attribute A
= 6/12 * 0.278 + 6/12 * 0.444 = 0.361
Gini(D1) = 1 – (5/6)2 – (1/6)2 = 0.278
Gini(D2) = 1 – (2/6)2 – (4/6)2 = 0.444 GainA = Gini(D) – GiniA (D) = 0.486 – 0.361 = 0.125

108
Computing Information Gain Using Gini Index

Parent

C1 7
B?
C2 5
Yes No
G(D) = Gini = 0.486
Node N1 Node N2

𝑣
N1 N2
𝑛𝑖
C1 5 1 𝐺 𝑖𝑛𝑖 𝐵 ( 𝐷 )=∑ 𝐺𝑖𝑛𝑖( 𝐷𝑖 )
C2 2 4 𝑖=1 𝑛
GB(D) = Gini = 0.371 GiniB(D) = Weighted Gini of attribute B
= 7/12 * 0.408 + 5/12 * 0.32 = 0.371
Gini(D1) = 1 – (5/7)2 – (2/7)2 = 0.408
Gini(D2) = 1 – (1/5)2 – (4/5)2 = 0.32
GainB = Gini(D) – GiniB (D) = 0.486 – 0.371 = 0.115

109
Selecting the Splitting Attribute
 Since GainA is larger then GainB, attribute A will be selected for the
next splitting.
 max(GainA, GainB) or min(GiniA(D), GiniB(D))

GainA = Gini(D) – GiniA (D) = 0.486 – 0.361 = 0.125

GainB = Gini(D) – GiniB (D) = 0.486 – 0.371 = 0.115

110
Information Gain Ratio

111
Problem with large number of partitions
 Node impurity measures tend to prefer splits that result in
 large number of partitions, each being small but pure.

 Customer ID has highest gain because entropy for all the children is
zero.

Gender Car Customer


Type ID

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

112
Gain Ratio
 Tree building algorithm blindly picks attribute that
 maximizes information gain

 Need a correction to penalize attributes with highly scattered


attributes.

 Gain ratio overcomes the disadvantage of Gain by normalizing the


gain using the entropy of the data with respect to the values of the
attribute.
 Our previous entropy computations are done with respect to the class
attribute.
 Adjusts Gain by the entropy of the partitioning.
 Higher entropy partitioning (large number of small partitions) is penalized!
 Used in C4.5 algorithm

113
Gain Ratio
 Gain Ratio:
g

 D is split into partitions {D1, D2 …, Dv}(children nodes)


 v is the number of possible values of A

 is number of records in Di (child node )

114
Gain Ratio
g

age income student credit_rating buys_computer


 gainsplit(D, income)  0.029 <=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
 gain_ratioincome(D) = 0.029/1.557 = 0.019
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
 The attribute with the maximum gain <=30 medium no fair no
ratio is selected as the splitting attribute . <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
115
Gain for Continuous-Valued Attributes
 Let attribute A be a continuous-valued attribute.

 Must determine the best split point for A.

 Sort the value A in increasing order.

 The midpoint between each pair of adjacent values is considered a possible


split point.
 (ai + ai+1)/2 is the midpoint between the values of ai and ai+1.

 The point with the least Gini index or Entropy value for A is selected as the
split point for A
 split-point

 Split:
 D1 is the set of tuples in D satisfying A  split-point
 D2 is the set of tuples in D satisfying A > split-point.

116

You might also like