Classification Slides
Classification Slides
CLASSIFICATION
Task of assigning objects to one of several predefined categories.
Examples
Predicting tumor cells as benign or malignant
Classifying credit card transactions as legitimate or fraudulent
Detecting spam email messages based on message header and content
Categorizing news stories as finance, weather, entertainment, sports, etc.
Classifying galaxies based on their shape
CLASSIFICATION
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
CONFUSION MATRIX
The series of questions and their possible answers can be organized in the form of
decision tree.
It is a hierarchical structure that consists of nodes and directed edges.
A tree has three types of nodes
• Root Node: One root node with no incoming
edges and zero or more outgoing edges.
• Internal Nodes: Exactly one incoming edge and
two or more outgoing edges.
• Leaf or Terminal Nodes: Exactly one incoming
edge and no outgoing edges.
HOW A DECISION TREE WORKS?
Training Data
ANOTHER EXAMPLE OF DECISION TREE
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
APPLY MODEL TO TEST DATA
Test Data
Start from the root of tree.
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
APPLY MODEL TO TEST DATA
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
APPLY MODEL TO TEST DATA
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
APPLY MODEL TO TEST DATA
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
APPLY MODEL TO TEST DATA
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
APPLY MODEL TO TEST DATA
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
DECISION TREE CLASSIFICATION TASK
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction Decision
2 No Medium 100K No algorithm Tree
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
DECISION TREE INDUCTION
Many Algorithms:
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
GENERAL STRUCTURE OF HUNT’S ALGORITHM
Tid Refund Marital Taxable
Cheat
Let Dt be the set of training records that reach a
Status Income
Dt
the data into smaller subsets. Recursively apply
the procedure to each subset. ?
Tid Refund Marital Taxable
Status Income Cheat
HUNT’S ALGORITHM
Don’t
Refund Hunt’s 1 Yes Single 125K No
Yes No
Cheat Algorithm 2 No Married 100K No
Don’t Don’t 3 No Single 70K No
Cheat Cheat 4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
Married Married
Divorced Divorced
Don’t Taxable Don’t
Cheat Cheat
Cheat Income
< 80K >= 80K
Don’t Cheat
Cheat
Predicting whether a loan applicant will repay her loan
obligations or become delinquent, subsequently defaulting in
HUNT’S ALGORITHM
her loan
• The initial tree contains a single label with class label Defaulted = “NO”, which means most
borrowers successfully repaid their loans.
• However, the tree needs refinement.
HUNT’S ALGORITHM
Hunt’s algorithm will work if every combination of attribute values is present in the
training data and each combination has a unique class label.
These assumptions are too stringent for use in most practical scenarios.
Additional conditions are needed to handle the following cases:
It is possible for some of the child nodes created be empty; i.e. there are no
records associated with these nodes. This can happen if none of the training
records have the combination of attribute values associated with such nodes.
In this case the node is declared a leaf node with the same class label as the
majority class of training records associated with its parent node.
HUNT’S ALGORITHM
If all the records associated with Dt have identical attribute values (except for the
class label), then it is not possible to split the records any further.
In this case, the node is declared a leaf node with the same class label as the
majority class of training records associated with this node.
TREE INDUCTION
Greedy strategy.
Split the records based on an attribute test that optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
Continue expanding a node until either all records belong to the same class or all
records have identical attribute values
METHODS FOR EXPRESSING ATTRIBUTE TEST CONDITIONS
Binary split: Divides values into two subsets. Need to find optimal partitioning.
OR CarType
CarType {Family,
{Sports, Luxury} {Sports}
Luxury} {Family}
SPLITTING BASED ON ORDINAL ATTRIBUTES
Size
Small Large
Medium
Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}
Size
{Small,
Large} {Medium}
SPLITTING BASED ON CONTINUOUS ATTRIBUTES
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
The measure developed for selecting the best split are often based on the degree of
impurity of the child nodes.
The smaller the degree of impurity, the more skewed the class distribution.
A node with class distribution (0,1) has zero impurity and a node with uniform class
distribution (0.5,0.5) has the highest impurity.
MEASURES OF NODE IMPURITY
Gini Index
Entropy
Misclassification error
ALGORITHM
The best attribute is selected and used as the test at the root node of the tree.
A descendant of the root node is then created for each possible value of this
attribute, and the training examples are sorted to the appropriate descendant
node (i.e., down the branch corresponding to the example's value for this
attribute).
The entire process is then repeated using the training examples associated with
each descendant node to select the best attribute to test at that point in the
tree.
This forms a greedy search for an acceptable decision tree, in which the
algorithm never backtracks to reconsider earlier choices.
ID3 ALGORITHM
Examples: Training Examples
Target-attribute: Attribute to be predicted
Attributes: List of predictors
We would like to select the attribute that is most useful for classifying examples.
We will define a statistical property, called information gain, that
measures how well a given attribute separates the training examples
according to their target classification
ENTROPY
A measure used from Information Theory in the ID3 algorithm and popularly used in
decision tree construction is that of Entropy.
Entropy of a dataset measures the impurity of the dataset.
Informally, Entropy can be considered to find out how disordered the dataset is.
ENTROPY
It has been shown that there is a relationship between entropy and information.
That is, higher the uncertainty or entropy of some data, implies more information is
required to completely describe that data.
In building a decision tree, the aim is to decrease the entropy of the dataset until we
reach leaf nodes at which point the subset that we are left with is pure, or has zero
entropy and represents instances all of one class.
ENTROPY
Suppose C is class distribution and A is a group of attributes, the Entropy (H) of a
random variable is defined as
Entropy(t ) = − p( j | t ) log p( j | t )
j
Entropy(t ) = − p( j | t ) log p( j | t )
j 2
Decision tree induction algorithms choose a test condition that maximizes the gain
(∆).
As, I (Parent) is same for all test conditions, maximizing the gain is equivalent to
minimizing the weighted average impurity measures of the child nodes.
When entropy is used as an impurity measure, the difference in entropy is known as
information gain, ∆ gain
HOW TO FIND THE BEST SPLIT
C0 N00
Before Splitting: C1 N01 M0
A? B?
Yes No Yes No
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
EXAMPLE
WHICH ATTRIBUTE TO SELECT?
EXAMPLE
There are two classes, play tennis, “yes” and “no” in the data. Therefore the entropy can be
calculated as below:
Entropy(S) = -pyes log2(pyes) -pno log2(pno)
Now, the next step is to select highly significant input variable among all the four
input variables (Outlook, Temperature, Humidity, Wind) that will split the data more
purely than the other attributes.
For this, we will calculate the information gain that would result in over the entire
dataset after splitting the attributes (Outlook,Temperature, Humidity,Wind).
EXAMPLE
Infogain (S|Outlook) = Entropy(S) -5/14 Entropy (S|Outlook =Sunny) – 4/14
Entropy(S|Outlook =Overcast) – 5/14 Entropy(S|Outlook =Rain)
= 0.94 –(5/14) (-pyes log2(pyes) -pno log2(pno) ) –(4/14) (-pyes log2(pyes) -pno log2(pno) ) –
(5/14) (-pyes log2(pyes) -pno log2(pno) )
= 0.94 -(5/14) (-2/5 log2 2/5 -3/5 log2 3/5) -(4/14) (-4/4 log2 4/4) -(5/14) (-3/5 log2 3/5 -
2/5 log2 2/5)
= 0.94 – 0.347 – 0 -0.347
= 0.246 bits
EXAMPLE
Infogain (S|Temperature) = Entropy(S) -4/14 Entropy (S| Temperature =Hot) – 6/14
Entropy(S| Temperature =Mild) – 4/14 Entropy(S|Temperature =Cold)
= 0.94 –(4/14) (-pyes log2(pyes) -pno log2(pno) ) –(6/14) (-pyes log2(pyes) -pno log2(pno) ) –
(4/14) (-pyes log2(pyes) -pno log2(pno) )
= 0.94 -(4/14) (-2/4 log2 2/4 -2/4 log2 2/4) -(6/14) (-4/6 log2 4/6 -2/6 log2 2/6) -(4/14) (-
3/4 log2 3/4 -1/4 log2 1/4)
= 0.94 – 0.286 – 0.392 -0.233
= 0.029 bits
EXAMPLE
Infogain (S|Humidity) = Entropy(S) -7/14 Entropy (S| Humidity =High) – 7/14 Entropy(S|
Humidity =Normal)
= 0.94 –(7/14) (-pyes log2(pyes) -pno log2(pno) ) –(7/14) (-pyes log2(pyes) -pno log2(pno) )
= 0.94 -(7/14) (-3/7 log2 3/7 -4/7 log2 4/7) -(7/14) (-6/7 log2 6/7 -1/7 log2 1/7)
= 0.94 – 0.493 – 0.296 = 0.151 bits
EXAMPLE
Infogain (S|Wind) = Entropy(S) -6/14 Entropy (S| Wind =Strong) – 8/14 Entropy(S|
Wind = Weak)
= 0.94 –(6/14) (-pyes log2(pyes) -pno log2(pno) ) –(8/14) (-pyes log2(pyes) -pno log2(pno) )
= 0.94 -(6/14) (-1/2 log2 1/2 -1/2 log2 1/2) -(8/14) (-6/8 log2 6/8 -2/8 log2 2/8)
= 0.94 – 0.428 – 0.465 = 0.047 bits
EXAMPLE
Now, we select the attribute for the root node which will result in the highest
reduction in entropy.
Infogain (S| Outlook) = 0.246 bits
Infogain (S| Temperature) = 0.029 bits
Infogain (S| Humidity) = 0.151 bits
Infogain (S|Wind) = 0.047 bits
We can clearly see that the attribute Outlook results in the highest reduction in
entropy or the highest information gain.
We would therefore select Outlook at the root node, splitting the data up into subsets
corresponding to all the different values for the Outlook attribute.
EXAMPLE
Which attribute
should be tested
here?
EXAMPLE
Now, for Outlook=Sunny, we have the following instances.
Day Outlook Temperature Humidity Wind Play
Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
Entropy(Sunny) = -pyes log2(pyes) -pno log2(pno)
= (-2/5 log2 2/5 -3/5 log2 3/5) = 0.97
EXAMPLE
= 0.97 -(3/5) (-3/3 log2 3/3 - 0) -(2/5) (-2/2 log2 2/2 -0)
= 0.97 – 0 – 0 = 0.97 bits
Infogain (Sunny|Temperature) = Entropy(Sunny) -2/5 Entropy (Sunny| Temp. =Hot) – 2/5
Entropy(Sunny| Temp.= Mild) - 1/5 Entropy(Sunny| Temp. = Cold)
= 0.97 -(2/5) (-2/2 log2 2/2 - 0) -(2/5) (-1/2 log2 1/2 - 1/2 log2 1/2) -(1/5) (-1/1 log2 1/1 -
0)
= 0.97 – 0 – 0.4-0 = 0.57 bits
EXAMPLE
= 0.97 -(3/5) (-1/3 log2 1/3 - 2/3 log2 2/3) -(2/5) (-1/2 log2 1/2 - 1/2 log2 1/2)
= 0.97 – 0.551 – 0.4 = 0.019 bits
Now,
Infogain (Sunny| Humidity) = 0.97 bits
Infogain (Sunny| Temperature) = 0.57 bits
Infogain (Sunny| Wind) = 0.019 bits
Thus, we select Humidity attribute.
EXAMPLE
{D1, D2,…D14}
9 Yes, 5 No
Outlook
{D1, D2, D8, D9, D11} {D3, D7, D12, D13} {D4, D5, D6, D10, D14}
2 Yes, 3 No 4 Yes, 0 No 3 Yes, 2 No
Humidity Yes ?
Which attribute
should be tested
here?
EXAMPLE
Now, for Outlook=Rain, we have the following instances.
= 0.97 -(2/5) (-1/2 log2 1/2 - 1/2 log2 1/2 ) -(3/5) (-2/3 log2 2/3 - 1/3 log2 1/3)
= 0.97 – 0.4 – 0.551 = 0.019 bits
Infogain (Rain|Temperature) = Entropy(Rain) -2/5 Entropy (Rain| Temp. =Cold) – 3/5
Entropy(Rain| Temp.= Mild)
= 0.97 -(2/5) (-1/2 log2 1/2 - 1/2 log2 1/2) -(3/5) (-2/3 log2 2/3 - 1/3 log2 1/3)
= 0.97 – 0.4 – 0.551 = 0.019 bits
EXAMPLE
Infogain (Rain|Wind) = Entropy(Rain) -2/5 Entropy (Rain| Wind =Weak) – 3/5
Entropy(Rain| Wind = Weak)
= 0.97 -(2/5) (-0 - 2/2 log2 2/2) -(3/5) (-3/5 log2 3/5 - 0)
= 0.97 – 0 = 0.97 bits
Now,
Infogain (Rain| Humidity) = 0.019 bits
Infogain (Rain| Temperature) = 0.019 bits
Infogain (Rain| Wind) = 0.97 bits
Thus, we select wind attribute.
DECISION TREES
Outlook
No Yes No Yes
DECISION TREES
Biased towards tests with many outcomes (attributes having a large number of
values)
E.g: An attribute acting as unique identifier will produce a large number of partitions
(1 tuple per partition).
Each resulting partition D is pure Info(D) =0
The information gain is maximized
Eg: In a data with customer attributes (Customer ID, Gender and Car Type), a split on
“Customer ID” is preferred as compared to other attributes such as “Gender” and
“Car Type”. But “Customer ID” is not a predictive attribute.
EXTENSION TO INFORMATION GAIN
C4.5 a successor of ID3 uses an extension to information gain known as gain ratio.
It overcomes the bias of Information gain as it applies a kind of normalization to
information gain using a split information value.
The split information value represents the potential information generated by
splitting the training data set S into c partitions, corresponding to classes of attribute
A. 𝑐
|𝑆𝑖 | |𝑆𝑖 |
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑆, 𝐴 = − log 2
|𝑆| |𝑆|
𝑖=1
High split info: partitions have more or less the same size
Low split info: few partitions hold most of the tuples
GAIN
To determine how well a test condition performs, we need to compare the degree of
impurity of the parent node (before splitting) with the degree of impurity of the child
node (after splitting).
The larger the difference, the better is the test condition.
The Gain (∆) is a criterion that can be used to determine the goodness of split:
𝑁 𝑣𝑗
∆ = 𝐼 𝑃𝑎𝑟𝑒𝑛𝑡 − σ𝑘𝑗=1 𝐼 𝑣𝑗
𝑁
I () → Impurity measure of a given node; N → Total number of records at the parent
node, K → Number of attribute values, N(vj) → Number of records associated with child node
vj
SPLITTING BASED ON INFO...
Information Gain:
n
GAIN = Entropy( p) − Entropy(i )
k
i
n
split i =1
GAIN n n
GainRATIO = SplitINFO = − log
Split k
i i
split
SplitINFO n i =1
n
Parent Node, p is split into k partitions
ni is the number of records in partition i
Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher
entropy partitioning (large number of small partitions) is penalized!
Used in C4.5
Designed to overcome the disadvantage of Information Gain.
EXTENSION TO INFORMATION GAIN
The attribute with the maximum gain ratio is selected as the splitting attribute.
EXAMPLE
EXAMPLE
For Outlook
Gain: 0.246
Split Info:
Sunny: 5 examples, Overcast: 4 examples, Rain: 5
examples
-5/14 log2 5/14- 4/14 log2 4/14 -5/14 log2 5/14 =
1.577
Gain Ratio:
=(0.246/1.577) = 0.156
EXAMPLE
For Temperature
Gain: 0.029
Split Info:
Hot: 4 examples, Mild: 6 examples, Cold:4 examples
-4/14 log2 4/14- 6/14 log2 6/14- 4/14 log2 4/14 =
1.56
Gain Ratio:
=(0.029/1.56) = 0.018
EXAMPLE
For Humidity
Gain: 0.151
Split Info:
Hot: 4 examples, Normal: 7 examples
-7/14 log2 7/14- 7/14 log2 7/14 = -1/2 log2 1/2- 1/2
log2 1/2 = 1
Gain Ratio:
=(0.151/1) = 0.151
EXAMPLE
For Wind
Gain: 0.047
Split Info:
Strong: 6 examples, Weak: 8 examples
-6/14 log2 6/14- 8/14 log2 8/14 =0.985
Gain Ratio:
=(0.047/0.985) = 0.048
EXAMPLE
For Outlook
Gain Ratio: 0.156
For Humidity Again, Outlook is selected as it
Gain Ratio: 0.151
has the highest gain ratio.
For Temperature
Gain Ratio: 0.019
For Wind
Gain Ratio: 0.048
EXAMPLE
Which attribute
should be tested
here?
EXAMPLE
Now, for Outlook=Sunny, we have the following
instances.
Day Outlook Temperature Humidity Wind Play
Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
EXAMPLE
For Outlook=Sunny|Humidity
Gain: 0.97
Split Info:
High: 3 examples, Normal: 2 examples
-3/5 log2 3/5- 2/5 log2 2/5 =0.971
Gain Ratio:
=(0.97/0.971) = 0.999
EXAMPLE
For Outlook=Sunny|Temperature
Gain: 0.57
Split Info:
Hot: 2 examples, Mild: 2 examples, Cold: 1 example
-2/5 log2 2/5- 2/5 log2 2/5- 1/5 log2 1/5 =1.52
Gain Ratio:
=(0.57/1.52) = 0.375
EXAMPLE
For Outlook=Sunny|Wind
Gain: 0.019
Split Info:
Strong: 2 examples, Weak: 3 examples
-2/5 log2 2/5- 3/5 log2 3/5 =0.971
Gain Ratio:
=(0.019/0.971) = 0.0195
EXAMPLE
For Outlook=Sunny|Humidity
Gain Ratio: 0.999
For Outlook=Sunny|Temperature
Gain Ratio: 0.375
For Outlook=Sunny|Wind
Gain Ratio: 0.0195 Humidity is selected as it has the
highest gain ratio.
EXAMPLE
{D1, D2,…D14}
9 Yes, 5 No
Outlook
{D1, D2, D8, D9, D11} {D3, D7, D12, D13} {D4, D5, D6, D10, D14}
2 Yes, 3 No 4 Yes, 0 No 3 Yes, 2 No
Humidity Yes ?
Which attribute
should be tested
here?
EXAMPLE
Now,Outlook
Day for Outlook=Rain, we have
Temperature the following
Humidity Windinstances.
Play
Tennis
D4 Rain Mild High Weak Yes
D5 Rain Cold Normal Weak Yes
D6 Rain Cold Normal Strong No
D10 Rain Mild Normal Weak Yes
D14 Rain Mild High Strong No
EXAMPLE
For Outlook=Rain|Humidity
Gain: 0.019
Split Info:
High: 2 examples, Normal: 3 examples
- 2/5 log2 2/5 -3/5 log2 3/5 =0.971
Gain Ratio:
=(0.019/0.971) = 0.0195
EXAMPLE
For Outlook=Rain|Temperature
Gain: 0.019
Split Info:
Mild: 3 examples, Cold: 2 example
-3/5 log2 3/5- 2/5 log2 2/5 =0.971
Gain Ratio:
=(0.019/0.971) = 0.0195
EXAMPLE
For Outlook=Rain|Wind
Gain: 0.97
Split Info:
Strong: 2 examples, Weak: 3 examples
-2/5 log2 2/5- 3/5 log2 3/5 =0.971
Gain Ratio:
=(0.97/0.971) = 0.999
EXAMPLE
For Outlook=Rain|Humidity
Gain Ratio: 0.0195
For Outlook=Rain|Temperature
Gain Ratio: 0.0195
For Outlook=Rain|Wind
Gain Ratio: 0.999 Wind is selected as it has the
highest gain ratio.
DECISION TREES
Outlook
No Yes No Yes
MEASURE OF IMPURITY: GINI
Gini Index for a given node t :
GINI (t ) = 1 − [ p( j | t )]2
j
(NOTE: p( j | t) is the relative frequency of class j at node t).
Maximum (1 - 1/nc) when records are equally distributed among all classes,
implying least interesting information, where nc is the number of classes.
Minimum (0.0) when all records belong to one class, implying most interesting
information
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
EXAMPLES FOR COMPUTING GINI
GINI (t ) = 1 − [ p( j | t )]2
j
k
ni
GINI split = GINI (i)
i =1 n
For Outlook
Gini Index: 0.342 Outlook is selected as it has
For Temperature
smallest Gini index.
Gini Index: 0.440
For Humidity
Gini Index: 0.368
For Wind
Gini Index: 0.428
EXAMPLE
Which attribute
should be tested
here?
EXAMPLE
Now, for Outlook=Sunny, we have the following
instances.
Day Outlook Temperature Humidity Wind Play
Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
EXAMPLE
{D1, D2,…D14}
9 Yes, 5 No
Outlook
{D1, D2, D8, D9, D11} {D3, D7, D12, D13} {D4, D5, D6, D10, D14}
2 Yes, 3 No 4 Yes, 0 No 3 Yes, 2 No
Humidity Yes ?
Which attribute
should be tested
here?
EXAMPLE
Now, for Outlook=Rain, we have the following instances.
Day Outlook Temperature Humidity Wind Play
Tennis
D4 Rain Mild High Weak Yes
D5 Rain Cold Normal Weak Yes
D6 Rain Cold Normal Strong No
D10 Rain Mild Normal Weak Yes
D14 Rain Mild High Strong No
EXAMPLE
No Yes No Yes
CATEGORICAL ATTRIBUTES: COMPUTING GINI INDEX
For each distinct value, gather counts for each class in the dataset
Use the count matrix to make decisions
CarType CarType
CarType
{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 3 1 C1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419
SPLITTING CRITERIA BASED ON CLASSIFICATION ERROR
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
ALGORITHM FOR DECISION TREE INDUCTION
The Classify() function determines the class label to be assigned to a leaf node. For
each leaf node t, let p(i|t) denote the fraction of training records from class i
associated with the node t. In most cases, the leaf node is assigned to the class that
has majority number of training records:
leaf.label = argmax p(i|t)
Where argmax operator returns the argument i that maximizes the expression
p(i|t).
The stopping_cond() function is used to terminate the tree growing process by
testing whether all the records have either the same class label or the same
attribute values. Another way to terminate the recursive function is to test
whether the number of records have fallen below some minimum threshold.
ALGORITHM FOR DECISION TREE INDUCTION
After building the decision tree, a tree-pruning step can be performed to reduce
the size of the decision tree.
Pruning helps by trimming the branches of the initial tree in a way that improves the
generalization capability of the decision tree.
TreeGrowth (E,F)
1: if stopping_cond(E,F) = true then Algorithm for
ALGORITHM
2: leaf FOR DECISION TREE INDUCTION
= createNode().
Decision Tree
3: leaf.label = Classify(E).
Induction
4: return leaf.
5. else
6. root = createNode().
7. root.test_cond = find_best_split(E,F).
8. let V = {v|v is a possible outcome of root.test_cond}.
9. for each v Є V do
10. Ev = {e | root.test_cond(e) = v and e Є E}.
11. child = TreeGrowth (Ev, F).
12. add child as descendant of root and label the edge (root → child) as v.
13. end for
14. end if
15. return root.
HOLDOUT METHOD
The original data with labelled examples is partitioned into two disjoint sets, called
the training set and test set.
A classification model is then induced from the training set and its performance is
evaluated on the test set.
The proportion of data reserved for training and for testing is typically at the
discretion of the analysts (Eg. 50-50 or two-thirds for training and one-third for
testing).
The accuracy of the classifier can be estimated based on the accuracy of the induced
model on the test set.
HOLDOUT METHOD
Limitations
Fewer labelled examples are available for training because some of the records are
withheld for testing. The induced model may not be as good as when all the labelled
examples are used for training.
The model may be highly dependent on the composition of the training and tests
sets.
The smaller the training set size, the larger is the variance of the model.
On the other hand, if the training set is too large, then the estimated accuracy computed
from the smaller test set is less reliable. Such an estimate is said to have a wide
confidence interval.
HOLDOUT METHOD
Limitations
The training and test sets are
no longer independent of each
other. Because the training
and test sets are subsets of
the original data, a class that is
overrepresented in one subset
will be underrepresented in
the other, and vice versa.
RANDOM SUBSAMPLING
Limitations
It does not utilize as much as data as possible for training.
It has no control over the number of times each record is used for testing and
training. Consequently, some records might be used for training more often than
others.
CROSS-VALIDATION
In this approach, each record is used the same
number of times for training and exactly once for
testing.
Example:
Partition the data three equal-sized subsets.
Choose two of the subsets (A, B) for training and
the remaining one for testing.
Choose two of the other combination (A,C) of
subsets for training and the remaining one for
testing.
Choose the remaining one combination (B,C) for
training and remaining one for testing.
This approach is called three-fold cross validation.
CROSS-VALIDATION
The k-fold cross validation method generalizes this approach by segmenting the data
into k-equal sized partitions.
During each runs, one of the partition is used for testing, while the rest of them are
used for training.
This procedure is repeated k times so that each partition is used for testing exactly
once.
The total error is found by summing up the errors for all the k runs.
CROSS-VALIDATION
A special case of k-fold cross-validation method sets k=N, the size of the dataset.
Called leave-one-out approach, each test set contains only one record.
This approach has the advantage of utilizing as much data as possible for training.
In addition, the test sets are mutually exclusive and they effectively cover the
entire data set.
The drawback of this approach is that it is computationally expensive to repeat the
procedure N times.
Furthermore, since each test contains only one record, the variance of the
estimated performance metric tends to be high.
BOOTSTRAP
The methods so far assume that the training records are sampled without
replacement. As a result, there are no duplicate records in the training and test sets.
Bootstrap Approach:Training records sampled with replacement.
If the original data has N records, it can be shown that, on average, a bootstrap
sample of size N contains about 63.2% of the records of the original data.
Records that are not included in bootstrap sample, become a part of the test set.
The module induced from the training set is then applied to the test set to obtain an
estimate of the accuracy of the bootstrap sample, εi.
The sampling procedure is then repeated b times to generate b bootstrap samples.
BOOTSTRAP
.632 Bootstrap
It computes the overall accuracy by combining the accuracies of each bootstrap
sample (εi) with the accuracy computed from the training set that contains all
labelled examples in the original data (accs) i.e. no repeated labelled examples.
1
Accuracy (accboost) = σ𝑏𝑖=1 0.632 ∗ ∈ 𝑖 + 0.368 ∗ 𝑎𝑐𝑐𝑠
𝑏