0% found this document useful (0 votes)
83 views

3 Decision Tree Learning

Decision tree learning is a method for approximating discrete-valued target functions by classifying instances as they are sorted down the tree from root to leaf node. ID3 and C4.5 are popular decision tree algorithms that use information gain to select the attribute that best splits the data at each node, growing the tree in a greedy top-down manner. The learned decision tree represents the target function as a set of if-then rules.

Uploaded by

Matrix Bot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views

3 Decision Tree Learning

Decision tree learning is a method for approximating discrete-valued target functions by classifying instances as they are sorted down the tree from root to leaf node. ID3 and C4.5 are popular decision tree algorithms that use information gain to select the attribute that best splits the data at each node, growing the tree in a greedy top-down manner. The learned decision tree represents the target function as a set of if-then rules.

Uploaded by

Matrix Bot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Chap 3: Decision Tree Learning

Machine Learning MTAI-301

Instructor
Dr. Sanjay Chatterji
Decision Tree Learning
● Decision tree learning is a method for approximating
discrete-valued target functions.
● It classifies instances by sorting them down the tree
from root to leaf node, which provides the classification
of the instance.
● It is robust to noisy data and capable of learning
disjunctive expressions.
● The decision tree algorithms such as ID3, C4.5 are
very popular inductive inference algorithms, and they
are successfully applied to many learning tasks.
Decision Tree

● The learned function is represented by a decision tree.


● Decision Tree
− Node: specifies a test of some attribute
− Edge: descendant from the node corresponds to one of
the possible values for the attribute
− Leaf node: assigns a classification
● A learned decision tree can also be represented as a set of
if-then rules.
● Traversing through the tree.
Decision Tree for PlayTennis
(Outlook = Sunny ^ Humidity = Normal)
V(Outlook = Overcast)
V(Outlook = Rain ^ Wind = Weak)

(Outlook=Sunny, Temperature=Hot, Humidity=High, Wind=Strong) is a negative


instance.
When to Consider Decision Trees
● Instances are represented by attribute-value pairs.
● The target function has discrete output values.
● Disjunctive descriptions may be required.
● The training data may contain errors.
● The training data may contain missing attribute
values.
● Decision tree learning has been applied to problems
such as learning to classify
Greedy Decision Tree Learning Algorithm
• Top-down, greedy search through the space of
possible decision trees.
• This approach is exemplified by the ID3 algorithm
and its successor C4.5
• ID3 learns decision trees by constructing them
top down, beginning with the question which
attribute should be tested at the root of the tree?
• The best attribute is selected.
• A descendant of the root node is then created for
each possible value of this attribute.
• The entire process is then repeated.
Which Attribute is “best”?
● We would like to select the attribute that is most useful
for classifying examples.
● Information gain measures how well an attribute
separates the training examples according to the target
classification.
● ID3 uses this information gain measure to select among
the candidate attributes at each step while growing the
tree.
● In order to define information gain precisely, we use
entropy commonly used in information theory.
Entropy
● Entropy characterises the (im)purity of an arbitrary
collection of examples.
● Entropy(S)= expected number of bits needed to encode
class (+ or -) of randomly drawn members of S.
− if p+ is 1, the receiver knows the drawn example will be
positive, so no message need be sent: Entropy=zero
− if p+ is 0.5, one bit is required to indicate whether the drawn
example is positive or negative: Entropy=1.
− if p+ is 0.8, then a collection of messages can be encoded
using on average less than 1 bit per message by assigning
shorter codes to collections of positive examples and longer
codes to less likely positive examples.
Entropy
● Given a collection S, containing positive and
negative examples of some target concept, the
entropy of S relative to this Boolean classification
is:
− Entropy(S) = -p+log2p+-p-log2p-
● S is a sample of training examples
● p+ is the proportion of positive examples
● p- is the proportion of negative examples

● Information theory optimal length code assign –log2p bits


to messages having probability p.
Entropy

● It is assumed that log2(0) is 0


● Entropy([9+,5-] = –(9/14) log2(9/14) –(5/14) log2(5/14) = 0.940
● Entropy([12+,4-]= –(12/16) log2(12/16) –(4/16) log2(4/16)= 0.811
● Entropy([12+,5-]= –(12/17) log2(12/17) –(5/17) log2(5/17)= 0.874
● Entropy([8+,8-] = –(8/16) log2(8/16) –(8/16) log2(8/16) = 1.0
● Entropy([8+,0-] = –(8/8) log2(8/8) –(0/8) log2(0/8) = 0.0
● Entropy([0+,8-] = –(0/8) log2(0/8) -(8/8) log2(8/8) = 0.0
Entropy

● If the target attribute can take on c different values,


then the entropy of S relative to this c-wise
classification is defined as

● pi is the proportion of S belonging to class i.


● The logarithm is still base 2 because entropy is a
measure of the expected encoding length measured in
bits.
Information Gain
● Information gain is a measure of the effectiveness of
an attribute in classifying the training data.
● Information gain measures the expected reduction in
entropy by partitioning the examples according to an
attribute.

● S: a collection of examples
● A: an attribute
● Values(A): possible values of attribute A
● Sv: the subset of S for which attribute A has value v
A Training Example
Information Gain
Which attribute is the best classifier?

Gain(S,A) = 0.27 Gain(S,B)= 0.12


● A provides greater information gain than B.

● A is a better classifier than B.


ID3 –Selecting Next Attribute
● Entropy([9+,5-] = –(9/14) log2(9/14) –(5/14) log2(5/14) = 0.940
● Gain(S,Humidity)=0.940-(7/14)*0.985-(7/14)*0.592=0.151
● Gain(S,Wind)=0.940-(8/14)*0.811-(6/14)*1.0=0.048
● Gain(S,Outlook)=0.940-(5/14)*0.971 -(4/14)*0.0 -(5/14)*0.971=0.247
● Gain(S,Temp)=0.940-(4/14)*1.0-(6/14)*0.911-(4/14)*0.811 =0.029
ID3 -Ssunny
● Gain(Ssunny, Humidity)=0.970-(3/5)0.0 –2/5(0.0) =0.970
● Gain(Ssunny,Temp.)=0.970-(2/5)0.0
–2/5(1.0)-(1/5)0.0=0.570
● Gain(Ssunny, Wind)=0.970-(2/5)1.0 –3/5(0.918) = 0.019
ID3(Examples, TargetAttribute, Attributes)
● Create a Root node for the tree
● If all Examples are positive, Return the single-node tree Root, with label = +
● If all Examples are negative, Return the single-node tree Root, with label = -
● If Attributes is empty, Return the single-node tree Root, with label = most
common value of TargetAttribute in Examples
● Otherwise Begin
− A← the attribute from Attributes that best classifies Examples
− The decision attribute for Root ← A
− For each possible value vi of A,
● Add a new tree branch below Root, corresponding to the test A = vi
● Let Examplesvi be the subset of Examples that have value vi for A
● If Examplesvi is empty
− Then add a leaf node with label = most common value of
TargetAttribute in Examples
● Else add the subtree ID3(Examplesvi, TargetAttribute, Attributes–{A})
● End
● Return Root
Hypothesis Space Search in ID3
ID3 can be characterized as searching a space of
hypotheses for one that fits the training examples.
The hypothesis space searched by ID3 is the set
of possible decision trees.
ID3 performs a simple to complex, hill-climbing
search through this hypothesis space, beginning
with the empty tree, then considering
progressively more elaborate hypotheses in
search of a decision tree that correctly classifies
the training data.
Hypothesis Space Search in ID3
ID3 -Capabilities and Limitations
● ID3’s hypothesis space of all decision trees is a
complete space of finite discrete-valued functions.
❖ As any discrete-valued function can be
represented by decision tree, ID3 avoids a major
risk that the hypothesis space might not contain
the target function.
● ID3 maintains only a single current hypothesis, and
outputs only a single hypothesis (in contrast to the
Candidate Elimination Algorithm).
❖ It can’t determine how many alternative decision
trees are consistent with the training data.
ID3 -Capabilities and Limitations
● No backtracking on selected attributes (greedy
search)
❖ It is susceptible to the usual risks of
converging to locally optimal solutions.
● Statistically-based search choices
❖ The resulting search is much less sensitive
to errors in individual training examples.
❖ It can be easily extended to handle noisy
training data by modifying its termination
criterion to accept hypotheses that
imperfectly fit the training data.
Inductive Bias in Decision Tree Learning
● The policy by which ID3 generalizes from observed
training examples to classify unseen instances.
● Selects in favour of shorter trees over longer ones.
● Selects trees that place the attributes with highest
information gain closest to the root.
● OCCAM'S RAZOR: Prefer the simplest hypothesis that
fits the data.
● ID3 can be viewed as an efficient approximation to
BFS-ID3, using a greedy heuristic search to attempt to
find the shortest tree without conducting the entire
breadth-first search through the hypothesis space.
Difference b/w ID3 and CE algorithms

• ID3 searches a complete hypothesis space (i.e., one


capable of expressing any finite discrete-valued function)
incompletely through this space, from simple to complex
hypotheses, until its termination condition is met.
• The version space CANDIDATE-ELIMINATION algorithm
searches an incomplete hypothesis space (i.e., one that
can express only a subset of the potentially teachable
concepts) completely finding every hypothesis consistent
with the training data.
Overfitting
● Given a hypothesis space H, a hypothesis h∈H is said
to OVERFIT the training data if there exists some
alternative hypothesis h'∈H, such that h has smaller
error than h' over the training examples, but h' has a
smaller error than h over the entire distribution of
instances.
● Reasons for overfitting:
− Errors and noise in training examples
− Coincidental regularities(especially small number of
examples are associated with leaf nodes).
Overfitting – Reduced Error Prunning
Effect of adding a +ve training
example incorrectly labeled as -ve

(Outlook=Sunny, Temperature=Hot, Humidity=Normal,


Wind=Strong, PlayTennis=No)

Overfitting is possible even when the training data is noise-free,


espe-cially when small number of examples are associated with leaf
nodes.
Avoid Overfitting
● How can we avoid overfitting?
● Stop growing when data split not statistically significant
● Grow full tree then post-prune
● Criteria to be used to determine final tree size
● Use separate examples to evaluate the utility of post-pruning.
● Use available data for training. Test whether expanding a node
will improve the performance beyond training set.
● Minimum Description Length principle.
● Reduced-Error Pruning
− Split data into training and validation set
− Pruning of nodes continues until further pruning is harmful for
validation set
Pruning
• Pruning a decision node consists of
removing the subtree rooted at that node
making it a leaf node, and
assigning it the most common classification of the training
examples affiliated with that node.
• Nodes are removed if the pruned tree performs no
worse than the original over the validation set.
leaf node added due to coincidental regularities in the
training set is likely to be pruned
same coincidences are unlikely in the validation set
• As pruning proceeds, the number of nodes is
reduced and accuracy over the test set increases.
Reduced-Error Pruning
• Nodes are pruned iteratively, always choosing
the node whose removal most increases the
decision tree accuracy over the validation set.
• Pruning of nodes continues until further pruning
is harmful (i.e., decreases accuracy of the tree
over the validation set).
• When data is limited, withholding part of it for
the validation set reduces even further the
number of examples available for training.
Rule-Post Pruning
● A variant of this is used by C4.5 learning algorithm
● Steps of Rule-Post Pruning:
− Infer the decision tree from the training set.
− Convert the learned tree into an equivalent set of rules by
creating one rule for each path from the root node to a leaf
node.
− Prune (generalize) each rule by removing any preconditions
that result in improving its estimated accuracy.
− Sort the pruned rules by their estimated accuracy.
− Consider them in the sequence when classifying subsequent
instances.
Converting a Decision Tree to Rules

● R1: If (Outlook=Sunny)^(Humidity=High) Then PlayTennis=No


● R2: If (Outlook=Sunny)^(Humidity=Normal) Then PlayTennis=Yes
● R3: If (Outlook=Overcast) Then PlayTennis=Yes
● R4: If (Outlook=Rain)^(Wind=Strong) Then PlayTennis=No
● R5: If (Outlook=Rain)^(Wind=Weak) Then PlayTennis=Yes
Method to estimate rule accuracy
• One method is to use a validation set of examples
disjoint from the training set.
• Another method is to evaluate performance based on
training set, using a pessimistic estimate (used by C4.5)
calculate the rule accuracy over the training examples to which it applies
calculate the standard deviation in this estimated accuracy assuming a binomial
distribution.
For a given confidence level, the lower-bound estimate is then taken as the
measure of rule performance

• Effect: For large data sets, the pessimistic


estimate is very close to the observed accuracy
Continuous-Valued Attributes-Example

● Temperature:40 4860728090
● PlayTennis: No No Yes Yes Yes No

● Two candidate thresholds: (48+60)/2=54 (80+90)/2=85


● The new boolean attributes: Temperature>54;
Temperature>85
● Use these new boolean attributes same as other discrete
valued attributes.
Alternative Selection Measures

● Information gain measure favours attributes with many


values.
− Information gain will be very high for Date attribute.
● GainRatio measure based on SplitInformation
− SplitInformation(S,A) = -∑ci=1 ( |Si| / |S| ) log2( |Si| / |S| )
− GainRatio(S,A) = Gain(S,A) / SplitInformation(S,A)
● When |Si| ≈ |S|; SplitInformation = 0 or very small;
GainRatio = undefined or very large
− Apply heuristics to select attributes: compute Gain first;
− compute GainRatio only when Gain is large enough
(above average Gain)
Other Types of Attributes

● Assume that an example (with classification c) in S has


a missing value for attribute A.
− Assign the most common value of A in S having c
classification in S.
− Or, use probability value for each possible attribute value.
● Measuring attributes by costs
− prefer cheap ones if possible
− use costly ones only if good gain
− no guarantee in finding optimum, but give bias towards
cheapest
− Gain2(S,A)/Cost(S,A) 2Gain(S,A)-1/(Cost(A)+1)w
Main Points with Decision Tree
Learning
● DT learning provides a practical method for concept learning
and for learning other discrete-valued functions.
● DT are inferred by growing them from the root downward,
greedily selecting the next best attribute.
● Overfitting training data is an important issue in DT learning.
● Inductive bias in ID3 includes a preference for smaller trees.
● A large variety of extensions to the basic ID3 algorithm has
been developed.
● These extensions include methods for:
− post-pruning trees
− handling real-valued attributes
− accommodating examples with missing attribute values
− attribute selection using other than information gain
− considering costs associated with instance attributes
Thank You

You might also like