0% found this document useful (0 votes)
2 views

DM Lect 9_Classification - Decision Trees

Chapter 8 of 'Data Mining: Concepts and Techniques' discusses classification, focusing on supervised learning methods such as decision trees, Bayes classification, and rule-based classification. It outlines the two-step process of model construction and usage for predicting categorical class labels, along with techniques to improve classification accuracy. The chapter also covers model evaluation and selection, emphasizing the importance of information gain and entropy in decision tree induction.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

DM Lect 9_Classification - Decision Trees

Chapter 8 of 'Data Mining: Concepts and Techniques' discusses classification, focusing on supervised learning methods such as decision trees, Bayes classification, and rule-based classification. It outlines the two-step process of model construction and usage for predicting categorical class labels, along with techniques to improve classification accuracy. The chapter also covers model evaluation and selection, emphasizing the importance of information gain and entropy in decision tree induction.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Data Mining

Supervised Learning

Dr. Wedad Hussein


[email protected]

Dr. Mahmoud Mounir


[email protected]
Data Mining:
Concepts and Techniques
(3rd ed.)

— Chapter 8 —

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
2
Chapter 8. Classification: Basic Concepts

◼ Classification: Basic Concepts


◼ Decision Tree Induction
◼ Bayes Classification Methods
◼ Rule-Based Classification
◼ Model Evaluation and Selection
◼ Techniques to Improve Classification Accuracy:
Ensemble Methods
◼ Summary
3
INTRODUCTION-
• Given the following dataset of objects

Attribute 1 Attribute 2 Attribute 3

Objects X Y Z
OB-1 1 4 1
OB-2 1 2 2
OB-3 1 4 2
OB-4 2 1 2
OB-5 1 1 1
OB-6 2 4 2
OB-7 1 1 2
OB-8 2 1 1
INTRODUCTION-
• Given the following dataset of objects

Attribute 1 Attribute 2 Attribute 3

Objects X Y Z Class
OB-1 1 4 1 A
OB-2 1 2 2 B
OB-3 1 4 2 B
OB-4 2 1 2 A
OB-5 1 1 1 A
OB-6 2 4 2 B
OB-7 1 1 2 A
OB-8 2 1 1 A
Supervised vs. Unsupervised Learning

◼ Supervised learning (classification)


◼ Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
◼ New data is classified based on the training set
◼ Unsupervised learning (clustering)
◼ The class labels of training data is unknown
◼ Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
Prediction Problems: Classification vs.
Numeric Prediction
◼ Classification
◼ predicts categorical class labels (discrete or nominal)

◼ classifies data (constructs a model) based on the training


set and the values (class labels) in a classifying attribute
and uses it in classifying new data
◼ Numeric Prediction
◼ models continuous-valued functions, i.e., predicts
unknown or missing values
◼ Typical applications
◼ Credit/loan approval:

◼ Medical diagnosis: if a tumor is cancerous or benign

◼ Fraud detection: if a transaction is fraudulent

◼ Web page categorization: which category it is


Classification—A Two-Step Process
◼ Model construction: describing a set of predetermined classes
◼ Each tuple/sample is assumed to belong to a predefined class, as

determined by the class label attribute


◼ The set of tuples used for model construction is training set

◼ The model is represented as classification rules, decision trees, or

mathematical formulae
◼ Model usage: for classifying future or unknown objects
◼ Estimate accuracy of the model

◼ The known label of test sample is compared with the classified

result from the model


◼ Accuracy rate is the percentage of test set samples that are

correctly classified by the model


◼ Test set is independent of training set (otherwise overfitting)

◼ If the accuracy is acceptable, use the model to classify new data

◼ Note: If the test set is used to select models, it is called validation (test) set
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS Qualified Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no OR years > 6
Anne Associate Prof 3 no
THEN qualified = ‘yes’
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS Qualified
Tom Assistant Prof 2 no Qualified?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Chapter 8. Classification: Basic Concepts

◼ Classification: Basic Concepts


◼ Decision Tree Induction
◼ Bayes Classification Methods
◼ Rule-Based Classification
◼ Model Evaluation and Selection
◼ Techniques to Improve Classification Accuracy:
Ensemble Methods
◼ Summary
11
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
❑ Training data set: Buys_computer <=30 high no excellent no
❑ The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
>40 low yes excellent no
❑ Resulting tree:
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes yes
Algorithm for Decision Tree Induction
◼ Basic algorithm (a greedy algorithm)
◼ Tree is constructed in a top-down recursive divide-and-

conquer manner
◼ At start, all the training examples are at the root

◼ Attributes are categorical (if continuous-valued, they are

discretized in advance)
◼ Examples are partitioned recursively based on selected

attributes
◼ Test attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain)


◼ Conditions for stopping partitioning
◼ All samples for a given node belong to the same class

◼ There are no remaining attributes for further partitioning –

majority voting is employed for classifying the leaf


◼ There are no samples left
Brief Review of Entropy
◼ Entropy =H(D)= -σ𝑖 𝑃𝑖 𝑙𝑜𝑔2 (𝑃𝑖 )
Attribute 1
0 1
4 Class A 4 Class B

4 4 4 4
◼ Entropy =H(Attribute 1)= - 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ( ) =1
8 8 8 8
Attribute 2
X Y
0 Class A 8 Class B

0 0 8 8
◼ Entropy =H(Attribute 2)= - 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ( ) =0
8 8 8 8
Brief Review of Entropy

Attribute 3
M N
5 Class A 3 Class B

5 5 3 3
◼ Entropy =H(Attribute 3)= - 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ( )
8 8 8 8
= 0.424 + 0.531 = 0.955
Attribute 4
K L
2 Class A 6 Class B

2 2 6 6
◼ Entropy =H(Attribute 4)= - 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ( ) = 0.811
8 8 8 8
Brief Review of Entropy

Attribute Selection Measure:
Information Gain (ID3/C4.5)
◼ Select the attribute with the highest information gain
◼ Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
◼ Expected information (entropy) needed to classify a tuple in D:
m
Info( D) = − pi log 2 ( pi )
i =1
◼ Information needed (after using A to split D into v partitions) to
classify D: v | D |
InfoA ( D) =   Info( D j )
j

j =1 | D |
◼ Information gained by branching on attribute A

Gain(A) = Info(D)− InfoA(D)


Decision Trees Using ID3 Algorithm

a. What is the entropy of buys computer?


Buys Computer
No Yes
5 9
Entropy(Buy Computer) = H(Buys Computer)
5 5 9 9
=- 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ( ) = 0.531 + 0.41
14 14 14 14
age income student credit_rating buys_computer
= 0.941 <=30
<=30
high
high
no
no
fair
excellent
no
no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Decision Trees Using ID3 Algorithm
b. Which attribute should you choose as the root of a
decision tree?
Age (14) 0.941
<= 30 31..40 >40
5 4 5
[3 No , 2 Yes] [0 No , 4 Yes] [2 No , 3 Yes]
5 4 5
H(Age) = H[3,2] + H[0,4] + H[2,3]=
14 14 14
5 3 3 2 2 4 0 0 4 4
= [- 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ]+ [- 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ]
14 5 5 5 5 14 4 4 4 4
5 2 2 3 3
+ [- 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ]=
14 5 5 5 5
= 0.347 + 0 + 0.347 = 0.694
IG (Buys Computer/Age) = H(Buys Computer) - H(Age) = 0.941 – 0.694 = 0.247
Decision Trees Using ID3 Algorithm
b. Which attribute should you choose as the root of a
decision tree?
Income (14) 0.941
High Medium Low
4 6 4
[2 No , 2 Yes] [2 No , 4 Yes] [1 No , 3 Yes]
4 6 4
H(Income) = H[2,2] + H[2,4] + H[1,3]=
14 14 14
4 2 2 2 2 6 2 2 4 4
= [- 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ]+ [- 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ]
14 4 4 4 4 14 6 6 6 6
4 1 1 3 3
+ [- 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ]=
14 4 4 4 4
= 0.286 + 0.394 + 0.232 = 0.912
IG (Buys Computer/Income) = H(Buys Computer) - H(Income) = 0.941 – 0.912 = 0.029
Decision Trees Using ID3 Algorithm
b. Which attribute should you choose as the root of a
decision tree?
Student (14) 0.941
Yes No
7 7
[1 No , 6 Yes] [4 No , 3 Yes]
7 7
H(Student) = H[1,7] + H[4,3] =
14 14
7 1 1 6 6 7 4 4 3 3
= [- 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ]+ [- 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ]
14 7 7 7 7 14 7 7 7 7
= 0.286 + 0.493 = 0.779
IG (Buys Computer/Student) = H(Buys Computer) - H(Student) = 0.941 – 0.779 = 0.162
Decision Trees Using ID3 Algorithm
b. Which attribute should you choose as the root of a
decision tree?
Credit Rating (14) 0.941
Fair Excellent
8 6
[2 No , 6 Yes] [3 No , 3 Yes]
8 6
H(Credit rating) = H[2,6] + H[3,3] =
14 14
8 2 2 6 6 6 3 3 3 3
= [- 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ]+ [- 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ]
14 8 8 8 8 14 6 6 6 6
= 0.464 + 0.429 = 0.893
IG (Buys Computer/Credit Rating) = H(Buys Computer) - H(Credit Rating) = 0.941 – 0.893 = 0.048

So, Age is the root of the tree, because IG (Edible/ Smooth) has
the Greatest Information Gain Value
Decision Trees Using ID3 Algorithm
b. Which attribute should you choose as the root of a
decision tree?
Age (14) 0.941
<= 30 31..40 >40
5 4 5
[3 No , 2 Yes] [0 No , 4 Yes] [2 No , 3 Yes]

Age

<= 30 31..40 >40


Attribute Selection: Information Gain
 Class P: buys_computer = “yes” 5 4
Infoage ( D) = I (2,3) + I (4,0)
 Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D) = I (9,5) = − log 2 ( ) − log 2 ( ) =0.940 + I (3,2) = 0.694
14 14 14 14 14
age pi ni I(pi, ni) 5
<=30 2 3 0.971 I ( 2,3)means “age <=30” has 5 out of
14
14 samples, with 2 yes’es and 3
31…40 4 0 0
>40 3 2 0.971 no’s. Hence
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain (age) = Info ( D ) − Infoage ( D) = 0.246
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes Similarly,
>40 low yes fair yes

Gain(income) = 0.029
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30
>40
low
medium
yes fair
yes fair
yes
yes
Gain( student ) = 0.151
Gain(credit _ rating ) = 0.048
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Decision Trees Using ID3 Algorithm

◼ You are stranded on a deserted island. Mushrooms of


various types grow widely all over the island, but no
other food is anywhere to be found. Some of the
mushrooms have been determined as poisonous and
others as not (determined by your former companions’
trial and error). You are the only one remaining on the
island. You have the following data to consider:
Decision Trees Using ID3 Algorithm

Edible
Yes
Yes
Yes
No
No
No
No
No
Decision Trees Using ID3 Algorithm

a. What is the entropy of Edible?


b. Which attribute should you choose as the root of a
decision tree?
Decision Trees Using ID3 Algorithm

a. What is the entropy of Edible?


Edible
No Yes
5 3
Entropy(Edible) = H(Edible)
5 5 3 3
=- 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ( ) = 0.4238 + 0.5306
8 8 8 8
= 0.9544
Decision Trees Using ID3 Algorithm
b. Which attribute should you choose as the root of a
decision tree?
Not Heavy (8) 0.9544
0 1
3 5
[2 No , 1 Yes] [3 No , 2 Yes]
3 5
H(Not Heavy) = H[2,1] + H[3,2]=
8 8
3 2 2 1 1 5 3 3 2 2
= [- 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ]+ [- 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ]=
8 3 3 3 3 8 5 5 5 5
= 0.3444 + 0.6068 = 0.9512
IG (Edible/Not Heavy) = H(Edible) - H(Not Heavy) = 0.9544 – 0.9512 = 0.0032
Decision Trees Using ID3 Algorithm
b. Which attribute should you choose as the root of a
decision tree?
Smelly (8) 0.9544
0 1
5 3
[3 No , 2 Yes] [2 No , 1 Yes]
5 3
H(Smelly) = H[3,2] + H[2,1]=
8 8
5 3 3 2 2 3 2 2 1 1
= [- 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ]+ [- 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ]=
8 5 5 5 5 8 3 3 3 3
= 0.6068 + 0.3444 = 0.9512
IG (Edible/Smelly) = H(Edible) - H(Smelly) = 0.9544 – 0.9512 = 0.0032
Decision Trees Using ID3 Algorithm
b. Which attribute should you choose as the root of a
decision tree?
Spotted (8) 0.9544
0 1
5 3
[3 No , 2 Yes] [2 No , 1 Yes]
5 3
H(Spotted) = H[3,2] + H[2,1]=
8 8
5 3 3 2 2 3 2 2 1 1
= [- 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ]+ [- 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ]=
8 5 5 5 5 8 3 3 3 3
= 0.6068 + 0.3444 = 0.9512
IG (Edible/Spotted) = H(Edible) - H(Spotted) = 0.9544 – 0.9512 = 0.0032
Decision Trees Using ID3 Algorithm
b. Which attribute should you choose as the root of a
decision tree?
Smooth (8) 0.9544
0 1
4 4
[2 No , 2 Yes] [3 No , 1 Yes]
4 4
H(Smooth) = H[2,2] + H[3,1]=
8 8
4 2 2 2 2 4 3 3 1 1
= [- 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ]+ [- 𝑙𝑜𝑔2 ( ) - 𝑙𝑜𝑔2 ]=
8 4 4 4 4 8 4 4 4 4
= 0.5 + 0.4056 = 0.9056
IG (Edible/ Smooth) = H(Edible) - H(Smooth) = 0.9544 – 0.9056 = 0.0488
So, Smooth is the root of the tree, because IG (Edible/ Smooth) has the Greatest Information
Gain Value
Computing Information-Gain for
Continuous-Valued Attributes
◼ Let attribute A be a continuous-valued attribute
◼ Must determine the best split point for A
◼ Sort the value A in increasing order
◼ Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
◼ (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
◼ The point with the minimum expected information
requirement for A is selected as the split-point for A
◼ Split:
◼ D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
Overfitting and Tree Pruning
◼ Overfitting: An induced tree may overfit the training data
◼ Too many branches, some may reflect anomalies due to

noise or outliers
◼ Poor accuracy for unseen samples

◼ Two approaches to avoid overfitting


◼ Prepruning: Halt tree construction early ̵ do not split a node

if this would result in the goodness measure falling below a


threshold
◼ Difficult to choose an appropriate threshold

◼ Postpruning: Remove branches from a “fully grown” tree—

get a sequence of progressively pruned trees


◼ Use a set of data different from the training data to

decide which is the “best pruned tree”


Enhancements to Basic Decision Tree Induction

◼ Allow for continuous-valued attributes


◼ Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of
intervals
◼ Handle missing attribute values
◼ Assign the most common value of the attribute
◼ Assign probability to each of the possible values
◼ Attribute construction
◼ Create new attributes based on existing ones that are
sparsely represented
◼ This reduces fragmentation, repetition, and replication
Example
◼ Apply the ID3 algorithm. Suppose we want to train a
decision tree using the following instances:

◼ Which attribute from the previously mentioned table


will be the first node in the tree?
◼ Use your selected root node to partition the
previously mentioned instances. Sketch your answer

You might also like