CS-13410
Introduction to Machine Learning
Lecture # 07
(Decision Trees – Ch # 3 by Tom Mitchell)
by
Mudasser Naseer
Assignment -1
Assignment – 1 is uploaded on Moodle. Due date is 12-03-
2021 till 4:00 pm.
2
Measuring Node Impurity
p(i|t): fraction of records associated
with node t belonging to class i
• Used in ID3 and C4.5
• Used in CART, SLIQ, SPRINT.
3
Example
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1 – (P(C1))2 – (P(C2))2 = 1 – 0 – 1 = 0
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
Error = 1 – max (0, 1) = 1 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
P(C1) = 2/6 P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
4
Impurity measures
All of the impurity measures take value zero
(minimum) for the case of a pure node where a
single value has probability 1
All of the impurity measures take maximum
value when the class distribution in a node is
uniform.
5
Splitting Based on GINI
When a node p is split into k partitions
(children), the quality of split is computed as,
where, ni = number of records at child i,
n = number of records at node p.
6
Binary Attributes: Computing GINI Index
• Splits into two partitions
• Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
B?
Yes No
Node N1 Node N2
Gini(Children)
Gini(N1) = 7/12 * 0.194 + 5/12 * 0.528
= 1 – (5/7)2 – (2/7)2 = 0.194 = 0.333
Gini(N2) This is the quality of split
= 1 – (1/5)2 – (4/5)2 = 0.528 for Variable B 7
Categorical Attributes
For binary values split in two
For multivalued attributes, for each distinct value,
gather counts for each class in the dataset
Use the count matrix to make decisions
Multi-way split Two-way split
(find best partition of values)
8
Continuous Attributes
Use Binary Decisions based on one value
Choices for the splitting value
Number of possible splitting values
= Number of distinct values
Each splitting value has a count matrix
associated with it
Class counts in each of the partitions, A
≤ v and A > v
Exhaustive method to choose best v
For each v, scan the database to gather
count matrix and compute the impurity
index
Computationally Inefficient! Repetition of
work.
Continuous Attributes
For efficient computation: for each attribute,
Sort the attribute on values
Linearly scan these values, each time updating the count matrix
and computing impurity
Choose the split position that has the least impurity
Sorted Values
Split Positions
10
Splitting based on impurity
Impurity measures favor attributes with
large number of categories
A test condition with large number of
outcomes may not be desirable
# of records in each partition is too small
to make predictions
11
Gain Ratio
The information gain measure tends to prefer attributes with
large numbers of possible categories.
Gain ratio: a modification of the information gain that reduces
its bias on high‐branch attributes.
Gian ratio should be
Large when data is evenly spread
Small when all data belong to one branch
Gain ratio takes number and size of branches into account
when choosing an attribute
It corrects the information gain by taking the intrinsic
information of a split into account
Or if we use S in place of D
Gain Ratio
Adjusts Information Gain by the entropy of the partitioning
(SplitINFO). Higher entropy partitioning (large number of
small partitions) is penalized!
Used in C4.5
Designed to overcome the disadvantage of impurity
Example (Play tennis) :
More on the gain ratio
“Outlook” still comes out top
However: “ID code” has greater gain ratio
Standard fix: In particular applications we can
use an ad hoc test to prevent splitting on that
type of attribute
Problem with gain ratio: it may overcompensate
May choose an attribute just because its intrinsic
information is very low
Standard fix:
• First, only consider attributes with greater than average
information gain
• Then, compare them on gain ratio
14
Comparing Attribute Selection Measures
The three measures, in general, return good
results but
Information Gain
Biased towards multivalued attributes
Gain Ratio
Tends to prefer unbalanced splits in which one
partition is much small than the other
Gini Index
Biased towards multivalued attributes
Has difficulties when the number of classes is
large
Tends to favor tests that result in equal-sized
partitions and purity in both partitions
15
Stopping Criteria for Tree Induction
Stop expanding a node when all the
records belong to the same class
Stop expanding a node when all the
records have similar attribute values
16
Decision Tree Based Classification
Advantages:
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification
techniques for many simple data sets
17
Example: C4.5
Simple depth-first construction.
Uses Information Gain
Sorts Continuous Attributes at each
node.
Needs entire data to fit in memory.
Unsuitable for Large Datasets.
Needs out-of-core sorting.
You can download the software from:
[Link]
18
Practical Issues of Classification
Underfitting and Overfitting
Evaluation
19
Underfitting and Overfitting
Underfitting Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex it models the details of the training20set
and fails on the test set
Overfitting due to Noise
Decision boundary is distorted by noise point 21
Notes on Overfitting
Overfitting results in decision trees that
are more complex than necessary
Training error no longer provides a
good estimate of how well the tree will
perform on previously unseen records
The model does not generalize well
Need new ways for estimating errors
22
How to Address Overfitting:
Tree Pruning
Pre-Pruning (Early Stopping Rule)
Stop the algorithm before it becomes a fully-grown tree
Typical stopping conditions for a node:
• Stop if all instances belong to the same class
• Stop if all the attribute values are the same
More restrictive conditions:
• Stop if number of instances is less than some user-specified threshold
• Stop if class distribution of instances are independent of the available features
(e.g., using 2 test)
• Stop if expanding the current node does not improve impurity measures (e.g.,
Gini or information gain) or it falls below a threshold value.
Upon halting, the node becomes a leaf
The leaf may hold the most frequent class among the subset
tuples.
Problem
23
• Difficult to choose an appropriate threshold
How to Address Overfitting…
Post-pruning
Grow decision tree to its entirety
Trim the nodes of the decision tree in a
bottom-up fashion
If generalization error improves after
trimming, replace sub-tree by a leaf node.
Class label of leaf node is determined from
majority class of instances in the sub-tree
24
Prune the Tree OR Prune the Rule
In order to Reduce the complexity of
decision procedure we have two options
(i) either we can prune the tree first
and then develop the rules or (ii) we
can develop the rules and then prune
the rules.
Which is better?
(ii) is better, why?
25