0% found this document useful (0 votes)
27 views

Classification Slides

Uploaded by

Anshika
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Classification Slides

Uploaded by

Anshika
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 147

CLASSIFICATION

CLASSIFICATION
 Task of assigning objects to one of several predefined categories.
 Examples
 Predicting tumor cells as benign or malignant
 Classifying credit card transactions as legitimate or fraudulent
 Detecting spam email messages based on message header and content
 Categorizing news stories as finance, weather, entertainment, sports, etc.
 Classifying galaxies based on their shape
CLASSIFICATION

 Input data is collection of records.


 Each record is characterized by tuple (x,y), where x is the attribute set and y is a
special attribute designated as class label.
 The attribute set could be continuous or discrete, but the class label should be
discrete.
CLASSIFICATION
 Classification is the task of learning a target function f that maps each
attribute set x to one of the predefined class labels y.
 The target function is also informally known as classification model.
 A classification model is useful for the following purposes:
 Descriptive Modeling : A classification model can serve as an
explanatory tool to distinguish between objects of different classes.
 A descriptive model which can summarize the features that define a
vertebrate as a mammal, reptile, bird, fish or amphibian.
 Predictive Modeling: A classification model can be used to predict class
labels of unknown records.
CLASSIFICATION

 Classification is most suited for techniques for


predicting or describing datasets with binary
or nominal categories.
 Less suitable for ordinal categories as they do
not consider the implicit ordering among the
categories.
 Other forms of relationships, such as the
subclass-superclass relationships among
categories are also ignored.
GENERAL APPROACH

 A classification approach (or classifier) is a systematic approach to build classification


models from an input data set.
 Examples: Decision Tree Classifiers, Neural Networks, Support Vector Machines,
Naïve Bayes Classifier.
 Each technique employs a learning algorithm to identify a model that best fits the
relationship between the attribute set and class label of the input data.
 The model generated by the learning algorithm should both fit the input data well
and correctly predict the class label of records it has never seen before.
 The key objective of the learning algorithm is to build models with good
generalization capability.
GENERAL APPROACH

 Given a collection of records (training set )


 Each record contains a set of attributes, one of the attributes is the class.
 Find a model for class attribute as a function of the values of other attributes.
 Goal: previously unseen records should be assigned a class as accurately as possible.
 A test set is used to determine the accuracy of the model. Usually, the given data
set is divided into training and test sets, with training set used to build the model
and test set used to validate it.
Illustration of Classification Task
ILLUSTRATION OF CLASSIFICATION TASK
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
CONFUSION MATRIX

 Confusion Matrix: Tabulates the counts of


test records correctly and incorrectly
predicted by the model.
 It is used to evaluate the performance of a
classification model.
 Total Number of Correct Predictions made
by the model
 TP +TN
 Total Number of Incorrect Predictions
 FP + FN
 Performance Metric: Accuracy
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝑇𝑃+𝑇𝑁
 Accuracy = =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝑇𝑃+𝐹𝑃+𝐹𝑁+𝑇𝑁
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐼𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝐹𝑃+𝐹𝑁
 Error Rate = =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝑇𝑃+𝐹𝑃+𝐹𝑁+𝑇𝑁

 Classification algorithms seek models that attain the highest accuracy or


equivalently the lowest error rate.
HOW A DECISION TREE WORKS
 Vertebrate Classification
 Attributes:
 Name (Whale, Frog, Cat, Leopard)
 Body Temperature (Warm Blooded/ Cold Blooded)
 Skin Cover (Hair/ Scales/ Fur/ Feather)
 Gives Birth (Yes/No)
 Aquatic Creature (Yes/No)
 Aerial Creature (Yes/No)
 Has Legs (Yes/No)
 Hibernates (Yes/No)
 Class Label: Mammal/ Non-mammal
HOW A DECISION TREE WORKS
 Suppose a new species is discovered by scientists. How can we tell whether it is a
mammal or a non-mammal?
 An approach is to pose a series of questions about the characteristics of species.
 Q1: Is the species warm blooded or cold blooded?
 If cold blooded definitely not a mammal. Otherwise it is a bird or a mammal.
 Q2: Do the females of the species give birth to their young?
 Those that give birth are definitely mammals and those that do not are likely to be
non-mammals.
 Each time we receive an answer a follow-up question is asked until we reach the
conclusion about the class label of the record.
HOW A DECISION TREE WORKS?

 The series of questions and their possible answers can be organized in the form of
decision tree.
 It is a hierarchical structure that consists of nodes and directed edges.
A tree has three types of nodes
• Root Node: One root node with no incoming
edges and zero or more outgoing edges.
• Internal Nodes: Exactly one incoming edge and
two or more outgoing edges.
• Leaf or Terminal Nodes: Exactly one incoming
edge and no outgoing edges.
HOW A DECISION TREE WORKS?

 Each leaf node is assigned a class label.


 The non-terminal nodes, which
include the root and other internal
nodes, contain attribute test
conditions to separate records that
have different characteristics.
HOW A DECISION TREE WORKS?

 In order to classify a test record, apply test condition to the


record and follow the appropriate branch based on the
outcome of the test.
 This will lead to either an internal node, for which a new
test condition is applied or to a leaf node.
 The class label associated with the leaf node is then
assigned to the record.
EXAMPLE OF A DECISION TREE
Splitting Attributes

Tid Refund Marital Taxable Refund


Status Income Cheat
Yes No
2,3,5,6,8,9,10
1 Yes Single 125K No
NO MarSt
2 No Married 100K No
Single, Divorced Married
3 No Single 70K No
3,5,8,10 2,6,9
4 Yes Married 120K No TaxInc NO
5 No Divorced 95K Yes
< 80K > 80K
6 No Married 60K No
7 Yes Divorced 220K No NO YES
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
Model: Decision Tree
10

Training Data
ANOTHER EXAMPLE OF DECISION TREE

Tid Refund Marital Taxable MarSt Single,


Status Income Cheat Married Divorced
1 Yes Single 125K No
NO Refund
2 No Married 100K No
Yes No
3 No Single 70K No
4 Yes Married 120K No NO TaxInc
5 No Divorced 95K Yes < 80K > 80K
6 No Married 60K No
NO YES
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
There could be more than one tree that fits
the same data!
10 No Single 90K Yes
10
DECISION TREE CLASSIFICATION TASK

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
APPLY MODEL TO TEST DATA
Test Data
Start from the root of tree.
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
APPLY MODEL TO TEST DATA

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Test Data
APPLY MODEL TO TEST DATA
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Test Data
APPLY MODEL TO TEST DATA
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Test Data
APPLY MODEL TO TEST DATA
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Test Data
APPLY MODEL TO TEST DATA
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
DECISION TREE CLASSIFICATION TASK
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction Decision
2 No Medium 100K No algorithm Tree
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
DECISION TREE INDUCTION

 Many Algorithms:
 Hunt’s Algorithm (one of the earliest)
 CART
 ID3, C4.5
 SLIQ,SPRINT
GENERAL STRUCTURE OF HUNT’S ALGORITHM
Tid Refund Marital Taxable
Cheat
 Let Dt be the set of training records that reach a
Status Income

node t 1 Yes Single 125K No


2 No Married 100K No
 General Procedure: 3 No Single 70K No
4 Yes Married 120K No
 If Dt contains records that belong the same 5 No Divorced 95K Yes
class yt, then t is a leaf node labeled as yt 6 No Married 60K No

 If Dt is an empty set, then t is a leaf node 7 Yes Divorced 220K No

labeled by the default class, yd 8


9
No
No
Single
Married
85K
75K
Yes
No

 If Dt contains records that belong to more 10 No Single 90K Yes

than one class, use an attribute test to split


10

Dt
the data into smaller subsets. Recursively apply
the procedure to each subset. ?
Tid Refund Marital Taxable
Status Income Cheat

HUNT’S ALGORITHM
Don’t
Refund Hunt’s 1 Yes Single 125K No
Yes No
Cheat Algorithm 2 No Married 100K No
Don’t Don’t 3 No Single 70K No
Cheat Cheat 4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No

Refund Refund 7 Yes Divorced 220K No

Yes No Yes No 8 No Single 85K Yes

Don’t Don’t Marital


9 No Married 75K No
Marital
Cheat Status
Cheat Status 10 No Single 90K Yes
Single, Single, 10

Married Married
Divorced Divorced
Don’t Taxable Don’t
Cheat Cheat
Cheat Income
< 80K >= 80K
Don’t Cheat
Cheat
Predicting whether a loan applicant will repay her loan
obligations or become delinquent, subsequently defaulting in
HUNT’S ALGORITHM
her loan

• The initial tree contains a single label with class label Defaulted = “NO”, which means most
borrowers successfully repaid their loans.
• However, the tree needs refinement.
HUNT’S ALGORITHM

 Hunt’s algorithm will work if every combination of attribute values is present in the
training data and each combination has a unique class label.
 These assumptions are too stringent for use in most practical scenarios.
 Additional conditions are needed to handle the following cases:
 It is possible for some of the child nodes created be empty; i.e. there are no
records associated with these nodes. This can happen if none of the training
records have the combination of attribute values associated with such nodes.
 In this case the node is declared a leaf node with the same class label as the
majority class of training records associated with its parent node.
HUNT’S ALGORITHM

 If all the records associated with Dt have identical attribute values (except for the
class label), then it is not possible to split the records any further.
 In this case, the node is declared a leaf node with the same class label as the
majority class of training records associated with this node.
TREE INDUCTION

 Greedy strategy.
 Split the records based on an attribute test that optimizes certain criterion.
 Issues
 Determine how to split the records
 How to specify the attribute test condition?
 How to determine the best split?
 Determine when to stop splitting
 Continue expanding a node until either all records belong to the same class or all
records have identical attribute values
METHODS FOR EXPRESSING ATTRIBUTE TEST CONDITIONS

 Depends on attribute types


 Nominal
 Ordinal
 Continuous
 Depends on number of ways to split
 2-way split
 Multi-way split
SPLITTING BASED ON NOMINAL ATTRIBUTES

 Multi-way split: Use as many partitions as distinct values.


CarType
Family Luxury
Sports

 Binary split: Divides values into two subsets. Need to find optimal partitioning.

OR CarType
CarType {Family,
{Sports, Luxury} {Sports}
Luxury} {Family}
SPLITTING BASED ON ORDINAL ATTRIBUTES

 Can produce binary or multiway


splits.
 They can be grouped as long as
the grouping does not violate the
order property of the attribute
values.
SPLITTING BASED ON ORDINAL ATTRIBUTES
 Multi-way split: Use as many partitions as distinct values.

Size
Small Large
Medium

 Binary split: Divides values into two subsets.


Need to find optimal partitioning.

Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}

 What about this split?

Size
{Small,
Large} {Medium}
SPLITTING BASED ON CONTINUOUS ATTRIBUTES

 Different ways of handling


 Discretization to form an ordinal categorical attribute
 Static – discretize once at the beginning
 Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing
(percentiles), or clustering.

 Binary Decision: (A < v) or (A  v)


 consider all possible splits and finds the best cut
 can be more computation intensive
SPLITTING BASED ON CONTINUOUS ATTRIBUTES

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


HOW TO DETERMINE THE BEST SPLIT?
 Various measures can be used to determine the best way to split the records.
 These measures are defined in terms of class distribution of records before and after
splitting.
 Let p(i|t) denote the fraction of records belonging to class i at a given node t. We
sometimes omit the reference to node t and express the fraction as pi.
 Greedy approach:
 Nodes with homogeneous class distribution are preferred
 Need a measure of node impurity: C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
HOW TO DETERMINE THE BEST SPLIT?

Before Splitting: 10 records of class 0,


10 records of class 1

Own Car Student


Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?


HOW TO DETERMINE THE BEST SPLIT?

 The measure developed for selecting the best split are often based on the degree of
impurity of the child nodes.
 The smaller the degree of impurity, the more skewed the class distribution.
 A node with class distribution (0,1) has zero impurity and a node with uniform class
distribution (0.5,0.5) has the highest impurity.
MEASURES OF NODE IMPURITY

 Gini Index

 Entropy

 Misclassification error
ALGORITHM

 The best attribute is selected and used as the test at the root node of the tree.
 A descendant of the root node is then created for each possible value of this
attribute, and the training examples are sorted to the appropriate descendant
node (i.e., down the branch corresponding to the example's value for this
attribute).
 The entire process is then repeated using the training examples associated with
each descendant node to select the best attribute to test at that point in the
tree.
 This forms a greedy search for an acceptable decision tree, in which the
algorithm never backtracks to reconsider earlier choices.
ID3 ALGORITHM
Examples: Training Examples
Target-attribute: Attribute to be predicted
Attributes: List of predictors

ID3 (Examples ,Target-attribute,Attributes)


 Create a root node for the tree.
 If all examples are positive, return the single-node tree Root, with label = +
 If all examples are negative, return the single-node tree Root, with label = -
 If Attributes is empty, return the single-node tree Root with label = most common
value of Target-attribute in Examples.
ID3 Otherwise
ALGORITHM Begin ID3 ALGORITHM
 A  the attribute from Attributes that best classifies Examples
 The decision attribute for Root  A
 For each possible value vi of A
 Add a new tree branch below Root, corresponding to the test A= vi
 Let Examples vi be the subset of Examples that have vi for A
 If Examples viis empty or below a certain threshold (Eg. 5%)
 Then below this new branch add a leaf node with label = most common value of
Target-attribute in Examples
 Else below this new branch add the subtree
ID3(Examples vi,Target-attribute,Attributes –{A})
 End
 Return Root
WHICH ATTRIBUTE IS THE BEST CLASSIFIER?

 We would like to select the attribute that is most useful for classifying examples.
 We will define a statistical property, called information gain, that
measures how well a given attribute separates the training examples
according to their target classification
ENTROPY

 A measure used from Information Theory in the ID3 algorithm and popularly used in
decision tree construction is that of Entropy.
 Entropy of a dataset measures the impurity of the dataset.
 Informally, Entropy can be considered to find out how disordered the dataset is.
ENTROPY

 It has been shown that there is a relationship between entropy and information.
 That is, higher the uncertainty or entropy of some data, implies more information is
required to completely describe that data.
 In building a decision tree, the aim is to decrease the entropy of the dataset until we
reach leaf nodes at which point the subset that we are left with is pure, or has zero
entropy and represents instances all of one class.
ENTROPY
 Suppose C is class distribution and A is a group of attributes, the Entropy (H) of a
random variable is defined as

◼ where C has proportion of positive and the proportion


of negative examples.
ENTROPY

𝐻 𝐶𝐴 =− 𝑝(𝑎) 𝑝(𝑐|𝑎) log 2 (𝑝 𝑐 𝑎 )


𝑎∈𝐴 𝑐∈𝐶

 Conditional entropy H(C|A ) is the entropy of one random


variable conditional upon knowledge of another.
INFOGAIN
 In terms of decision trees, the information gain (Infogain) is based on the decrease
in entropy after a dataset is split on an attribute i.e. the mutual information given a
specific attribute.
 Constructing a decision tree is all about finding attribute that returns the
highest information gain (mutual information).
ALTERNATIVE SPLITTING CRITERIA BASED ON INFO

 Entropy at a given node t:

Entropy(t ) = − p( j | t ) log p( j | t )
j

(NOTE: p( j | t) is the relative frequency of class j at node t).


 Measures homogeneity of a node.
 Maximum (log nc) when records are equally distributed among all
classes implying least information
 Minimum (0.0) when all records belong to one class, implying most
interesting information
EXAMPLES FOR COMPUTING ENTROPY

Entropy(t ) = − p( j | t ) log p( j | t )
j 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
GAIN

 Decision tree induction algorithms choose a test condition that maximizes the gain
(∆).
 As, I (Parent) is same for all test conditions, maximizing the gain is equivalent to
minimizing the weighted average impurity measures of the child nodes.
 When entropy is used as an impurity measure, the difference in entropy is known as
information gain, ∆ gain
HOW TO FIND THE BEST SPLIT
C0 N00
Before Splitting: C1 N01 M0
A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

M12 M34
Gain = M0 – M12 vs M0 – M34
EXAMPLE
WHICH ATTRIBUTE TO SELECT?
EXAMPLE
 There are two classes, play tennis, “yes” and “no” in the data. Therefore the entropy can be
calculated as below:
Entropy(S) = -pyes log2(pyes) -pno log2(pno)

= -(9/14) * log2(9/14) -(5/14) * log2(5/14)

= -0.643 log2 0.643 -0.357 log2 0.357

= 0.410 + 0.530 = 0.94


EXAMPLE

 Now, the next step is to select highly significant input variable among all the four
input variables (Outlook, Temperature, Humidity, Wind) that will split the data more
purely than the other attributes.
 For this, we will calculate the information gain that would result in over the entire
dataset after splitting the attributes (Outlook,Temperature, Humidity,Wind).
EXAMPLE
Infogain (S|Outlook) = Entropy(S) -5/14 Entropy (S|Outlook =Sunny) – 4/14
Entropy(S|Outlook =Overcast) – 5/14 Entropy(S|Outlook =Rain)

= 0.94 –(5/14) (-pyes log2(pyes) -pno log2(pno) ) –(4/14) (-pyes log2(pyes) -pno log2(pno) ) –
(5/14) (-pyes log2(pyes) -pno log2(pno) )

= 0.94 -(5/14) (-2/5 log2 2/5 -3/5 log2 3/5) -(4/14) (-4/4 log2 4/4) -(5/14) (-3/5 log2 3/5 -
2/5 log2 2/5)
= 0.94 – 0.347 – 0 -0.347
= 0.246 bits
EXAMPLE
Infogain (S|Temperature) = Entropy(S) -4/14 Entropy (S| Temperature =Hot) – 6/14
Entropy(S| Temperature =Mild) – 4/14 Entropy(S|Temperature =Cold)

= 0.94 –(4/14) (-pyes log2(pyes) -pno log2(pno) ) –(6/14) (-pyes log2(pyes) -pno log2(pno) ) –
(4/14) (-pyes log2(pyes) -pno log2(pno) )

= 0.94 -(4/14) (-2/4 log2 2/4 -2/4 log2 2/4) -(6/14) (-4/6 log2 4/6 -2/6 log2 2/6) -(4/14) (-
3/4 log2 3/4 -1/4 log2 1/4)
= 0.94 – 0.286 – 0.392 -0.233
= 0.029 bits
EXAMPLE
Infogain (S|Humidity) = Entropy(S) -7/14 Entropy (S| Humidity =High) – 7/14 Entropy(S|
Humidity =Normal)

= 0.94 –(7/14) (-pyes log2(pyes) -pno log2(pno) ) –(7/14) (-pyes log2(pyes) -pno log2(pno) )

= 0.94 -(7/14) (-3/7 log2 3/7 -4/7 log2 4/7) -(7/14) (-6/7 log2 6/7 -1/7 log2 1/7)
= 0.94 – 0.493 – 0.296 = 0.151 bits
EXAMPLE
Infogain (S|Wind) = Entropy(S) -6/14 Entropy (S| Wind =Strong) – 8/14 Entropy(S|
Wind = Weak)

= 0.94 –(6/14) (-pyes log2(pyes) -pno log2(pno) ) –(8/14) (-pyes log2(pyes) -pno log2(pno) )

= 0.94 -(6/14) (-1/2 log2 1/2 -1/2 log2 1/2) -(8/14) (-6/8 log2 6/8 -2/8 log2 2/8)
= 0.94 – 0.428 – 0.465 = 0.047 bits
EXAMPLE
 Now, we select the attribute for the root node which will result in the highest
reduction in entropy.
Infogain (S| Outlook) = 0.246 bits
Infogain (S| Temperature) = 0.029 bits
Infogain (S| Humidity) = 0.151 bits
Infogain (S|Wind) = 0.047 bits
 We can clearly see that the attribute Outlook results in the highest reduction in
entropy or the highest information gain.
 We would therefore select Outlook at the root node, splitting the data up into subsets
corresponding to all the different values for the Outlook attribute.
EXAMPLE

 Now, for Outlook=Overcast, we have the following instances.


Day Outlook Temperature Humidity Wind Play
Tennis
D3 Overcast Hot High Weak Yes
D7 Overcast Cold Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
As all instances have output=“yes”, we do not need to split any further.
EXAMPLE

Which attribute
should be tested
here?
EXAMPLE
 Now, for Outlook=Sunny, we have the following instances.
Day Outlook Temperature Humidity Wind Play
Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
Entropy(Sunny) = -pyes log2(pyes) -pno log2(pno)
= (-2/5 log2 2/5 -3/5 log2 3/5) = 0.97
EXAMPLE

Infogain (Sunny|Humidity) = Entropy(Sunny) -3/5 Entropy (Sunny| Humidity =High) –


2/5 Entropy(Sunny| Humidity = Normal)

= 0.97 -(3/5) (-3/3 log2 3/3 - 0) -(2/5) (-2/2 log2 2/2 -0)
= 0.97 – 0 – 0 = 0.97 bits
Infogain (Sunny|Temperature) = Entropy(Sunny) -2/5 Entropy (Sunny| Temp. =Hot) – 2/5
Entropy(Sunny| Temp.= Mild) - 1/5 Entropy(Sunny| Temp. = Cold)

= 0.97 -(2/5) (-2/2 log2 2/2 - 0) -(2/5) (-1/2 log2 1/2 - 1/2 log2 1/2) -(1/5) (-1/1 log2 1/1 -
0)
= 0.97 – 0 – 0.4-0 = 0.57 bits
EXAMPLE

Infogain (Sunny|Wind) = Entropy(Sunny) -3/5 Entropy (Sunny| Wind =Weak) –


2/5 Entropy(Sunny| Wind = Strong)

= 0.97 -(3/5) (-1/3 log2 1/3 - 2/3 log2 2/3) -(2/5) (-1/2 log2 1/2 - 1/2 log2 1/2)
= 0.97 – 0.551 – 0.4 = 0.019 bits
Now,
Infogain (Sunny| Humidity) = 0.97 bits
Infogain (Sunny| Temperature) = 0.57 bits
Infogain (Sunny| Wind) = 0.019 bits
Thus, we select Humidity attribute.
EXAMPLE

{D1, D2,…D14}
9 Yes, 5 No

Outlook

Sunny Overcast Rain

{D1, D2, D8, D9, D11} {D3, D7, D12, D13} {D4, D5, D6, D10, D14}
2 Yes, 3 No 4 Yes, 0 No 3 Yes, 2 No
Humidity Yes ?

Which attribute
should be tested
here?
EXAMPLE
 Now, for Outlook=Rain, we have the following instances.

Day Outlook Temperature Humidity Wind Play


Tennis
D4 Rain Mild High Weak Yes
D5 Rain Cold Normal Weak Yes
D6 Rain Cold Normal Strong No
D10 Rain Mild Normal Weak Yes
D14 Rain Mild High Strong No
Entropy(Rain) = -pyes log2(pyes) -pno log2(pno)
= (-3/5 log2 3/5 -2/5 log2 2/5) = 0.97
EXAMPLE
Infogain (Rain|Humidity) = Entropy(Rain) -2/5 Entropy (Rain| Humidity =High) – 3/5
Entropy(Rain| Humidity = Normal)

= 0.97 -(2/5) (-1/2 log2 1/2 - 1/2 log2 1/2 ) -(3/5) (-2/3 log2 2/3 - 1/3 log2 1/3)
= 0.97 – 0.4 – 0.551 = 0.019 bits
Infogain (Rain|Temperature) = Entropy(Rain) -2/5 Entropy (Rain| Temp. =Cold) – 3/5
Entropy(Rain| Temp.= Mild)

= 0.97 -(2/5) (-1/2 log2 1/2 - 1/2 log2 1/2) -(3/5) (-2/3 log2 2/3 - 1/3 log2 1/3)
= 0.97 – 0.4 – 0.551 = 0.019 bits
EXAMPLE
Infogain (Rain|Wind) = Entropy(Rain) -2/5 Entropy (Rain| Wind =Weak) – 3/5
Entropy(Rain| Wind = Weak)

= 0.97 -(2/5) (-0 - 2/2 log2 2/2) -(3/5) (-3/5 log2 3/5 - 0)
= 0.97 – 0 = 0.97 bits
Now,
Infogain (Rain| Humidity) = 0.019 bits
Infogain (Rain| Temperature) = 0.019 bits
Infogain (Rain| Wind) = 0.97 bits
Thus, we select wind attribute.
DECISION TREES

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
DECISION TREES

(Outlook = Sunny) ∧ (Humidity = Normal)


∨ (Outlook = Overcast)
∨ (Outlook = Rain) ∧ (Wind = Weak)
PROBLEM WITH INFORMATION GAIN APPROACH

 Biased towards tests with many outcomes (attributes having a large number of
values)
 E.g: An attribute acting as unique identifier will produce a large number of partitions
(1 tuple per partition).
 Each resulting partition D is pure Info(D) =0
 The information gain is maximized
 Eg: In a data with customer attributes (Customer ID, Gender and Car Type), a split on
“Customer ID” is preferred as compared to other attributes such as “Gender” and
“Car Type”. But “Customer ID” is not a predictive attribute.
EXTENSION TO INFORMATION GAIN

 C4.5 a successor of ID3 uses an extension to information gain known as gain ratio.
 It overcomes the bias of Information gain as it applies a kind of normalization to
information gain using a split information value.
 The split information value represents the potential information generated by
splitting the training data set S into c partitions, corresponding to classes of attribute
A. 𝑐
|𝑆𝑖 | |𝑆𝑖 |
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑆, 𝐴 = − log 2
|𝑆| |𝑆|
𝑖=1

 High split info: partitions have more or less the same size
 Low split info: few partitions hold most of the tuples
GAIN
 To determine how well a test condition performs, we need to compare the degree of
impurity of the parent node (before splitting) with the degree of impurity of the child
node (after splitting).
 The larger the difference, the better is the test condition.
 The Gain (∆) is a criterion that can be used to determine the goodness of split:
𝑁 𝑣𝑗
∆ = 𝐼 𝑃𝑎𝑟𝑒𝑛𝑡 − σ𝑘𝑗=1 𝐼 𝑣𝑗
𝑁
I () → Impurity measure of a given node; N → Total number of records at the parent
node, K → Number of attribute values, N(vj) → Number of records associated with child node
vj
SPLITTING BASED ON INFO...
 Information Gain:
 n 
GAIN = Entropy( p) −   Entropy(i ) 
k
i

 n 
split i =1

Parent Node, p is split into k partitions;


ni is number of records in partition i
 Measures Reduction in Entropy achieved because of the split. Choose the split that
achieves most reduction (maximizes GAIN)
 Used in ID3
 Disadvantage: Tends to prefer splits that result in large number of partitions, each
being small but pure. Eg: A split on “Customer ID” is preferred as compared to
“Gender” and “Car Type”. But “Customer ID” is not a predictive attribute.
SPLITTING BASED ON INFO...
 Gain Ratio:

GAIN n n
GainRATIO = SplitINFO = −  log
Split k
i i
split
SplitINFO n i =1
n
Parent Node, p is split into k partitions
ni is the number of records in partition i
 Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher
entropy partitioning (large number of small partitions) is penalized!
 Used in C4.5
 Designed to overcome the disadvantage of Information Gain.
EXTENSION TO INFORMATION GAIN

 Gain ratio is defined as

The attribute with the maximum gain ratio is selected as the splitting attribute.
EXAMPLE
EXAMPLE
 For Outlook
 Gain: 0.246
 Split Info:
 Sunny: 5 examples, Overcast: 4 examples, Rain: 5
examples
 -5/14 log2 5/14- 4/14 log2 4/14 -5/14 log2 5/14 =
1.577
 Gain Ratio:
 =(0.246/1.577) = 0.156
EXAMPLE

 For Temperature
 Gain: 0.029
 Split Info:
 Hot: 4 examples, Mild: 6 examples, Cold:4 examples
 -4/14 log2 4/14- 6/14 log2 6/14- 4/14 log2 4/14 =
1.56
 Gain Ratio:
 =(0.029/1.56) = 0.018
EXAMPLE

 For Humidity
 Gain: 0.151
 Split Info:
 Hot: 4 examples, Normal: 7 examples
 -7/14 log2 7/14- 7/14 log2 7/14 = -1/2 log2 1/2- 1/2
log2 1/2 = 1
 Gain Ratio:
 =(0.151/1) = 0.151
EXAMPLE

 For Wind
 Gain: 0.047
 Split Info:
 Strong: 6 examples, Weak: 8 examples
 -6/14 log2 6/14- 8/14 log2 8/14 =0.985
 Gain Ratio:
 =(0.047/0.985) = 0.048
EXAMPLE

 For Outlook
 Gain Ratio: 0.156
 For Humidity Again, Outlook is selected as it
 Gain Ratio: 0.151
has the highest gain ratio.
 For Temperature
 Gain Ratio: 0.019
 For Wind
 Gain Ratio: 0.048
EXAMPLE

Which attribute
should be tested
here?
EXAMPLE
 Now, for Outlook=Sunny, we have the following
instances.
Day Outlook Temperature Humidity Wind Play
Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
EXAMPLE

 For Outlook=Sunny|Humidity
 Gain: 0.97
 Split Info:
 High: 3 examples, Normal: 2 examples
 -3/5 log2 3/5- 2/5 log2 2/5 =0.971
 Gain Ratio:
 =(0.97/0.971) = 0.999
EXAMPLE

 For Outlook=Sunny|Temperature
 Gain: 0.57
 Split Info:
 Hot: 2 examples, Mild: 2 examples, Cold: 1 example
 -2/5 log2 2/5- 2/5 log2 2/5- 1/5 log2 1/5 =1.52
 Gain Ratio:
 =(0.57/1.52) = 0.375
EXAMPLE

 For Outlook=Sunny|Wind
 Gain: 0.019
 Split Info:
 Strong: 2 examples, Weak: 3 examples
 -2/5 log2 2/5- 3/5 log2 3/5 =0.971
 Gain Ratio:
 =(0.019/0.971) = 0.0195
EXAMPLE

 For Outlook=Sunny|Humidity
 Gain Ratio: 0.999
 For Outlook=Sunny|Temperature
 Gain Ratio: 0.375
 For Outlook=Sunny|Wind
 Gain Ratio: 0.0195 Humidity is selected as it has the
highest gain ratio.
EXAMPLE

{D1, D2,…D14}
9 Yes, 5 No

Outlook

Sunny Overcast Rain

{D1, D2, D8, D9, D11} {D3, D7, D12, D13} {D4, D5, D6, D10, D14}
2 Yes, 3 No 4 Yes, 0 No 3 Yes, 2 No
Humidity Yes ?
Which attribute
should be tested
here?
EXAMPLE
Now,Outlook
 Day for Outlook=Rain, we have
Temperature the following
Humidity Windinstances.
Play
Tennis
D4 Rain Mild High Weak Yes
D5 Rain Cold Normal Weak Yes
D6 Rain Cold Normal Strong No
D10 Rain Mild Normal Weak Yes
D14 Rain Mild High Strong No
EXAMPLE

 For Outlook=Rain|Humidity
 Gain: 0.019
 Split Info:
 High: 2 examples, Normal: 3 examples
 - 2/5 log2 2/5 -3/5 log2 3/5 =0.971
 Gain Ratio:
 =(0.019/0.971) = 0.0195
EXAMPLE

 For Outlook=Rain|Temperature
 Gain: 0.019
 Split Info:
 Mild: 3 examples, Cold: 2 example
 -3/5 log2 3/5- 2/5 log2 2/5 =0.971
 Gain Ratio:
 =(0.019/0.971) = 0.0195
EXAMPLE

 For Outlook=Rain|Wind
 Gain: 0.97
 Split Info:
 Strong: 2 examples, Weak: 3 examples
 -2/5 log2 2/5- 3/5 log2 3/5 =0.971
 Gain Ratio:
 =(0.97/0.971) = 0.999
EXAMPLE

 For Outlook=Rain|Humidity
 Gain Ratio: 0.0195
 For Outlook=Rain|Temperature
 Gain Ratio: 0.0195
 For Outlook=Rain|Wind
 Gain Ratio: 0.999 Wind is selected as it has the
highest gain ratio.
DECISION TREES
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
MEASURE OF IMPURITY: GINI
 Gini Index for a given node t :

GINI (t ) = 1 − [ p( j | t )]2
j
(NOTE: p( j | t) is the relative frequency of class j at node t).

 Maximum (1 - 1/nc) when records are equally distributed among all classes,
implying least interesting information, where nc is the number of classes.
 Minimum (0.0) when all records belong to one class, implying most interesting
information
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
EXAMPLES FOR COMPUTING GINI

GINI (t ) = 1 − [ p( j | t )]2
j

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
SPLITTING BASED ON GINI

 Used in CART, SLIQ, SPRINT.


 When a node p is split into k partitions (children), the quality of split is computed as,

k
ni
GINI split =  GINI (i)
i =1 n

where, ni = number of records at child i,


n = number of records at node p.
BINARY ATTRIBUTES: COMPUTING GINI INDEX

Splits into two partitions


Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Parent
B?
C1 6
Yes No C2 6
Gini = 0.500
Gini(N1) Node N1 Node N2
= 1 – (5/7)2 – (2/7)2
N1 N2
Gini(Children)
= 0.41
= 7/12 * 0.41 +
C1 5 1
Gini(N2) 5/12 * 0.32
= 1 – (1/5)2 – (4/5)2 C2 2 4 = 0.372
= 0.32 Gini=0.389
EXAMPLE

 Computation of Gini Index for Outlook Attribute


 It has three possible values of Sunny (5 examples), Overcast (4 examples) and Rain
(5 examples).
 For Outlook = Sunny, there are 3 examples with “no” and 2 examples with “yes”.
 Gini(S) = 1 – [(3/5)2 + (2/5)2] = 0.48

 For Outlook = Overcast, there are 4 examples with “yes”


 Gini(S) = 1 – [(4/4)2 + (0/4)2] = 0
EXAMPLE

 Computation of Gini Index for Outlook Attribute


 For Outlook = Rain, there are 3 examples with “yes” and 2 with “no”
 Gini(S) = 1 – [(3/5)2 + (2/5)2] = 0.48
 Weighted Average:
[Sunny] 0.48 * (5/14) + [Overcast] 0 * (4/14) + [Rain] 0.48 * (5/14) = 0.342
EXAMPLE

 Computation of Gini Index for Temperature Attribute


 It has three possible values of Hot (4 examples), Mild (6 examples) and Cold (4 examples).
 For Temperature = Hot, there are 2 examples with “no” and 2 examples with “yes”.
Gini(S) = 1 – [(2/4)2 + (2/4)2] = 0.5
 For Temperature = Mild, there are 4 examples with “yes” and 2 examples with “no”
Gini(S) = 1 – [(4/6)2 + (2/6)2] = 0.444
 For Temperature = Cold, there are 3 examples with “yes” and 1 example with “no”
Gini(S) = 1 – [(3/4)2 + (1/4)2] = 0.375
EXAMPLE

 Computation of Gini Index for Temperature Attribute


 Weighted Average:
[Hot] 0.5 * (4/14) + [Mild] 0.444 * (6/14) + [Cold] 0.375 * (4/14) = 0.440
EXAMPLE
 Computation of Gini Index for Humidity Attribute
 It has two possible values of High (7 examples) and Normal (7 examples).
 For Humidity = High, there are 3 examples with “yes” and 4 examples with “no”.
Gini(S) = 1 – [(3/7)2 + (4/7)2] = 0.49
 For Humidity = Normal, there are 6 examples with “yes” and 1 example with “no”
Gini(S) = 1 – [(6/7)2 + (1/7)2] = 0.245
 Weighted Average:
[High] 0.49 * (7/14) + [Normal] 0.245 * (7/14) = 0.368
EXAMPLE

 Computation of Gini Index for Wind Attribute


 It has two possible values of Strong (6 examples) and Weak (8 examples).
 For Wind = Strong, there are 3 examples with “yes” and 3 examples with “no”.
Gini(S) = 1 – [(3/6)2 + (3/6)2] = 0.5
 For Wind = Weak, there are 6 examples with “yes” and 2 examples with “no”
Gini(S) = 1 – [(6/8)2 + (2/8)2] = 0.375
 Weighted Average:
[Strong] 0.5 * (6/14) + [Weak] 0.375 * (8/14) = 0.428
EXAMPLE

 For Outlook
 Gini Index: 0.342 Outlook is selected as it has
 For Temperature
smallest Gini index.
 Gini Index: 0.440
 For Humidity
 Gini Index: 0.368
 For Wind
 Gini Index: 0.428
EXAMPLE

Which attribute
should be tested
here?
EXAMPLE
 Now, for Outlook=Sunny, we have the following
instances.
Day Outlook Temperature Humidity Wind Play
Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
EXAMPLE

 Computation of Gini Index for Outlook=Sunny |Humidity


 It has two possible values of High (3 examples) and Normal (2 examples).
 For Humidity = High, there are 3 examples with “no”.
Gini(S) = 1 – [(3/3)2] = 0
 For Humidity = Normal, there are 2 examples with “yes”
Gini(S) = 1 – [(2/2)2] = 0
 Weighted Average:
[High] 0 * (3/5) + [Normal] 0 * (2/5) = 0
EXAMPLE

 Computation of Gini Index for Outlook=Sunny| Temperature Attribute


 It has three possible values of Hot (2 examples), Mild (2 examples) and Cold (1
example).
 For Temperature = Hot, there are 2 examples with “no” .
Gini(S) = 1 – [(2/2)2] = 0
 For Temperature = Mild, there is 1 examples with “yes” and 1 example with “no”
Gini(S) = 1 – [(1/2)2 + (1/2)2] = 0.5
 For Temperature = Cold, there is 1 example with “yes”
Gini(S) = 1 – [(1/1)2] = 0
EXAMPLE

 Computation of Gini Index for Outlook=Sunny|Temperature Attribute


 Weighted Average:
[Hot] 0 * (2/5) + [Mild] 0.5 * (2/5) + [Cold] 0 * (1/5) = 0.2
EXAMPLE
 Computation of Gini Index for Outlook=Sunny|Wind Attribute
 It has two possible values of Strong (2 examples) and Weak (3 examples).
 For Wind = Strong, there are 1 example with “yes” and 1 example with “no”.
Gini(S) = 1 – [(1/2)2 + (1/2)2] = 0.5
 For Wind = Weak, there are 1 example with “yes” and 2 examples with “no”
Gini(S) = 1 – [(1/3)2 + (2/3)2] = 0.444
 Weighted Average:
[Strong] 0.5 * (2/5) + [Weak] 0.444 * (3/5) = 0.466
EXAMPLE
 For Outlook = Sunny |Temperature
 Gini Index: 0.2
 For Outlook =Sunny |Humidity
 Gini Index: 0 Humidity is selected as it has
smallest Gini index.
 For Outlook =Sunny | Wind
 Gini Index: 0.466
EXAMPLE

{D1, D2,…D14}
9 Yes, 5 No

Outlook

Sunny Overcast Rain

{D1, D2, D8, D9, D11} {D3, D7, D12, D13} {D4, D5, D6, D10, D14}
2 Yes, 3 No 4 Yes, 0 No 3 Yes, 2 No
Humidity Yes ?
Which attribute
should be tested
here?
EXAMPLE
 Now, for Outlook=Rain, we have the following instances.
Day Outlook Temperature Humidity Wind Play
Tennis
D4 Rain Mild High Weak Yes
D5 Rain Cold Normal Weak Yes
D6 Rain Cold Normal Strong No
D10 Rain Mild Normal Weak Yes
D14 Rain Mild High Strong No
EXAMPLE

 Computation of Gini Index for Outlook=Rain |Humidity


 It has two possible values of High (2 examples) and Normal (3 examples).
 For Humidity = High, there is 1 example with “no” and 1 with “yes”.
Gini(S) = 1 – [(1/2)2 + (1/2)2 ] = 0.5
 For Humidity = Normal, there are 2 examples with “yes” and 1 example with “no”
Gini(S) = 1 – [(2/3)2 + (1/3)2] = 0.444
 Weighted Average:
[High] 0.5 * (2/5) + [Normal] 0.444 * (3/5) = 0.466
EXAMPLE
 Computation of Gini Index for Outlook=Rain| Temperature Attribute
 It has two values, Mild (3 examples) and Cold (2 examples).
 For Temperature = Mild, there is 2 examples with “yes” and 1 example with “no”
Gini(S) = 1 – [(2/3)2 + (1/3)2] = 0.444
 For Temperature = Cold, there is 1 example with “yes” and 1 with “no”
Gini(S) = 1 – [(1/2)2 + (1/2)2 ] = 0.5
 Weighted Average:
[Mild] 0.444 * (3/5) + [Cold] 0.5 * (2/5) = 0.466
EXAMPLE

 Computation of Gini Index for Outlook=Rain|Wind Attribute


 It has two possible values of Strong (2 examples) and Weak (3 examples).
 For Wind = Strong, there are 2 examples with “no”.
Gini(S) = 1 – [(2/2)2 ] = 0
 For Wind = Weak, there are 3 examples with “yes”
Gini(S) = 1 – [(3/3)2] = 0
 Weighted Average:
[Strong] 0 * (2/5) + [Weak] 0 * (3/5) = 0
EXAMPLE
 For Outlook = Rain |Temperature
 Gini Index: 0.466
 For Outlook =Rain |Humidity
 Gini Index: 0.466
 For Outlook =Rain | Wind
 Gini Index: 0

Wind is selected as it has smallest


Gini index.
DECISION TREES
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes
CATEGORICAL ATTRIBUTES: COMPUTING GINI INDEX

 For each distinct value, gather counts for each class in the dataset
 Use the count matrix to make decisions

Multi-way split Two-way split


(find best partition of values)

CarType CarType
CarType
{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 3 1 C1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419
SPLITTING CRITERIA BASED ON CLASSIFICATION ERROR

 Classification error at a node t :

Error (t ) = 1 − max P(i | t )


i

 Measures misclassification error made by a node.


 Maximum (1 - 1/nc) when records are equally distributed among all classes,
implying least interesting information
 Minimum (0.0) when all records belong to one class, implying most interesting
information
EXAMPLES FOR COMPUTING ERROR

Error (t ) = 1 − max P(i | t )


i

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
COMPARISON AMONG SPLITTING CRITERIA

For a 2-class problem:

• All measures attain their maximum


values when the class distribution is
uniform i.e. 0.5.
• The minimum values for the
measures are attained when all the
records belongs to the same class
(i.e. when p equals 0 or 1)
CONTINUOUS ATTRIBUTES: COMPUTING GINI INDEX
Tid Refund Marital Taxable
Taxable Income Cheat
 Use Binary Decisions based on one value
Status
Income
> 80K? 1 Yes Single 125K No
 Several Choices for the splitting value 2 No Married 100K No

 Number of possible splitting values Yes No 3 No Single 70K No

= Number of distinct values 4 Yes Married 120K No


5 No Divorced 95K Yes
 Each splitting value has a count matrix associated with it 6 No Married 60K No

 Class counts in each of the partitions, A < v and A  v


7 Yes Divorced 220K No
8 No Single 85K Yes

 Simple method to choose best v 9 No Married 75K No


10 No Single 90K Yes
 For each v, scan the database to gather count matrix and 10

compute its Gini index


 Computationally Inefficient! Repetition of work.
CONTINUOUS ATTRIBUTES: COMPUTING GINI INDEX...
 For efficient computation: for each attribute,
 Sort the attribute on values
 Linearly scan these values, each time updating the count matrix and computing
gini index
 Choose the split position that has the least gini index
Cheat No No No Yes Yes Yes No No No No
Taxable Income

Sorted Values 60 70 75 85 90 95 100 120 125 220


55 65 72 80 87 92 97 110 122 172 230
Split Positions
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
ALGORITHM FOR DECISION TREE INDUCTION

 Input consists of Training records (E) and attribute set (F).


 The algorithm works by recursively selecting the best attribute to split the data (Step
7) and expanding the leaf nodes of the tree (Steps 11 and 12) until the stopping
criterion is met (Step1).
 The createNode() function extends the decision tree by creating a new node. A
node in the decision tree has either a test condition, denoted as node.test_cond, or
a class label, denoted as node.label.
 The find_best_split () function determines which attribute should be selected as
the test condition for splitting the training records. The choice of test condition
depends on the impurity measure.
ALGORITHM FOR DECISION TREE INDUCTION

 The Classify() function determines the class label to be assigned to a leaf node. For
each leaf node t, let p(i|t) denote the fraction of training records from class i
associated with the node t. In most cases, the leaf node is assigned to the class that
has majority number of training records:
leaf.label = argmax p(i|t)
Where argmax operator returns the argument i that maximizes the expression
p(i|t).
 The stopping_cond() function is used to terminate the tree growing process by
testing whether all the records have either the same class label or the same
attribute values. Another way to terminate the recursive function is to test
whether the number of records have fallen below some minimum threshold.
ALGORITHM FOR DECISION TREE INDUCTION

 After building the decision tree, a tree-pruning step can be performed to reduce
the size of the decision tree.
 Pruning helps by trimming the branches of the initial tree in a way that improves the
generalization capability of the decision tree.
 TreeGrowth (E,F)
 1: if stopping_cond(E,F) = true then Algorithm for
ALGORITHM
 2: leaf FOR DECISION TREE INDUCTION
= createNode().
Decision Tree
 3: leaf.label = Classify(E).
Induction
 4: return leaf.
 5. else
 6. root = createNode().
 7. root.test_cond = find_best_split(E,F).
 8. let V = {v|v is a possible outcome of root.test_cond}.
 9. for each v Є V do
 10. Ev = {e | root.test_cond(e) = v and e Є E}.
 11. child = TreeGrowth (Ev, F).
 12. add child as descendant of root and label the edge (root → child) as v.
 13. end for
 14. end if
 15. return root.
HOLDOUT METHOD

 The original data with labelled examples is partitioned into two disjoint sets, called
the training set and test set.
 A classification model is then induced from the training set and its performance is
evaluated on the test set.
 The proportion of data reserved for training and for testing is typically at the
discretion of the analysts (Eg. 50-50 or two-thirds for training and one-third for
testing).
 The accuracy of the classifier can be estimated based on the accuracy of the induced
model on the test set.
HOLDOUT METHOD
 Limitations
 Fewer labelled examples are available for training because some of the records are
withheld for testing. The induced model may not be as good as when all the labelled
examples are used for training.
 The model may be highly dependent on the composition of the training and tests
sets.
 The smaller the training set size, the larger is the variance of the model.
 On the other hand, if the training set is too large, then the estimated accuracy computed
from the smaller test set is less reliable. Such an estimate is said to have a wide
confidence interval.
HOLDOUT METHOD
 Limitations
 The training and test sets are
no longer independent of each
other. Because the training
and test sets are subsets of
the original data, a class that is
overrepresented in one subset
will be underrepresented in
the other, and vice versa.
RANDOM SUBSAMPLING

 The holdout method can be repeated


several times to improve the estimation
of a classifier’s performance. Known as
random subsampling.
 Let acci be the model accuracy during the
ith iteration. The overall accuracy is given
by
𝑎𝑐𝑐𝑖
accsub = σ𝑘𝑖=1
𝑘
RANDOM SUBSAMPLING

 Limitations
 It does not utilize as much as data as possible for training.
 It has no control over the number of times each record is used for testing and
training. Consequently, some records might be used for training more often than
others.
CROSS-VALIDATION
 In this approach, each record is used the same
number of times for training and exactly once for
testing.
 Example:
 Partition the data three equal-sized subsets.
 Choose two of the subsets (A, B) for training and
the remaining one for testing.
 Choose two of the other combination (A,C) of
subsets for training and the remaining one for
testing.
 Choose the remaining one combination (B,C) for
training and remaining one for testing.
 This approach is called three-fold cross validation.
CROSS-VALIDATION

 The k-fold cross validation method generalizes this approach by segmenting the data
into k-equal sized partitions.
 During each runs, one of the partition is used for testing, while the rest of them are
used for training.
 This procedure is repeated k times so that each partition is used for testing exactly
once.
 The total error is found by summing up the errors for all the k runs.
CROSS-VALIDATION

 A special case of k-fold cross-validation method sets k=N, the size of the dataset.
 Called leave-one-out approach, each test set contains only one record.
 This approach has the advantage of utilizing as much data as possible for training.
 In addition, the test sets are mutually exclusive and they effectively cover the
entire data set.
 The drawback of this approach is that it is computationally expensive to repeat the
procedure N times.
 Furthermore, since each test contains only one record, the variance of the
estimated performance metric tends to be high.
BOOTSTRAP
 The methods so far assume that the training records are sampled without
replacement. As a result, there are no duplicate records in the training and test sets.
 Bootstrap Approach:Training records sampled with replacement.
 If the original data has N records, it can be shown that, on average, a bootstrap
sample of size N contains about 63.2% of the records of the original data.
 Records that are not included in bootstrap sample, become a part of the test set.
 The module induced from the training set is then applied to the test set to obtain an
estimate of the accuracy of the bootstrap sample, εi.
 The sampling procedure is then repeated b times to generate b bootstrap samples.
BOOTSTRAP

 .632 Bootstrap
 It computes the overall accuracy by combining the accuracies of each bootstrap
sample (εi) with the accuracy computed from the training set that contains all
labelled examples in the original data (accs) i.e. no repeated labelled examples.

1
Accuracy (accboost) = σ𝑏𝑖=1 0.632 ∗ ∈ 𝑖 + 0.368 ∗ 𝑎𝑐𝑐𝑠
𝑏

You might also like