ML UNIT 2 Decision Tree
ML UNIT 2 Decision Tree
AND
ID3ALGORITHM
OVERVIEW
• Introduction
• Decision Tree Representation
• Appropriate Problems for Decision Tree Learning
• Basic Decision Tree Learning Algorithm (ID3)
• Hypothesis Space Search in decision Tree Learning
• Inductive Bias in Decision Tree Learning
• Issues in Decision Tree Learning
OVERVIEW
• Introduction
• Decision Tree Representation
• Appropriate Problems for Decision Tree Learning
• Basic Decision Tree Learning Algorithm (ID3)
• Hypothesis Space Search in decision Tree Learning
• Inductive Bias in Decision Tree Learning
• Issues in Decision Tree Learning
INTRODUCTION
• Decision trees, one of the simplest and yet most useful Machine Learning
method.
• Decision trees, as the name implies, are trees of decisions.
• A decision tree can be used to visually and explicitly represent decisions
and decision making.
•Take for example the decision about what activity you should do
this weekend.
It might depend on whether or not you feel like going out
with your friends or spending the weekend alone; in both
cases, your decision also depends on the weather.
If it’s sunny and your friends are available, you may want
to play soccer.
If it ends up raining you’ll go to a movie.
And if your friends don’t show up at all, well then you like
playing video games no matter what the weather is like!.
These decisions can be represented by a tree as shown.
INTRODUCTION
• The Decision tree learning is a method for approximating discrete-valued target functions,
in which the learned function is represented by a decision tree.
• Learned trees can also be re-represented as sets of if-then rules to improve human
readability.
• These learning methods are among the most popular of inductive inference algorithms
and have been successfully applied to a broad range of tasks from learning to diagnose
medical cases to learning to assess credit risk of loan applicants.
• These decision tree learning methods search a completely expressive hypothesis space
and thus avoid the difficulties of restricted hypothesis spaces.
• Their inductive bias is a preference for small trees over large trees.
OVERVIEW
• Introduction
• Decision Tree Representation
• Appropriate Problems for Decision Tree Learning
• Basic Decision Tree Learning Algorithm (ID3)
• Hypothesis Space Search in decision Tree Learning
• Inductive Bias in Decision Tree Learning
• Issues in Decision Tree Learning
DECISION TREE REPRESENTATION
• A decision tree is drawn upside down with its root at the top.
• In the below figure, the bold text in black represents a condition/internal node, based on which the tree
splits into branches/ edges.
• The end of the branch that doesn’t split anymore is the decision/leaf, in this case, whether the passenger
died or survived, represented as red and green text respectively.
• So the Decision trees classify instances by sorting them down the tree from the root to some leaf node,
which provides the classification of the instance.
•This process is then repeated for the sub tree rooted at the new
node.
DECISION TREE REPRESENTATION
DECISION TREE REPRESENTATION
• Take one more example. The below figure is a learned decision tree for the concept PlayTennis.
• An example is classified by sorting it through the tree to the appropriate leaf node, then returning the
classification associated with this leaf.
What is the classification for the below instance ? and
whether the day represented by the instance suits for playing
tennis or not ?
(Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong)
Based on the learnt decision tree, the instance is classified to be
negative instance (i.e., the tree predicts that PlayTennis = no).
(Outlook = Sunny⋀ Humidity = Normal) V (Outlook = Overcast) V (Outlook = Rain ⋀ Wind = Weak)
DECISION TREE REPRESENTATION
OVERVIEW
• Introduction
• Decision Tree Representation
• Appropriate Problems for Decision Tree Learning
• Basic Decision Tree Learning Algorithm (ID3)
• Hypothesis Space Search in decision Tree Learning
• Inductive Bias in Decision Tree Learning
• Issues in Decision Tree Learning
APPROPRIATE PROBLEMS FOR DECISION TREE LEARNING
Decision tree learning is generally best suited to problems with the following
characteristics:
• Instances are represented by attribute-value pairs.
– Instances are described by a fixed set of attributes (e.g., Temperature) and their values
(e.g., Hot).
– The easiest situation for decision tree learning is when each attribute takes on a small
number of disjoint possible values (e.g., Hot, Mild, Cold).
• The target function has discrete output values.
– The decision tree assigns a Boolean classification (e.g., yes or no) to each example.
– Decision tree methods easily extend to learning functions with more than two possible
output values.
• Disjunctive descriptions may be required.
– As noted above, decision trees naturally represent disjunctive expressions.
APPROPRIATE PROBLEMS FOR DECISION TREE LEARNING
• We will define a statistical property, called information gain, that measures how well a given
attribute separates the training examples according to their target classification.
• ID3 uses this information gain measure to select among the candidate attributes at each step
while growing the tree.
• ID3 algorithm, stands for Iterative Dichotomiser 3, is a classification algorithm that follows
a greedy approach of building a decision tree by selecting a best attribute that yields
maximum Information Gain (IG) or minimum Entropy (H).
• ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively
(repeatedly) dichotomises(divides) features into two or more groups at each step.
DEFINITION: ENTROPY
• It is a Measurement of Homogeneity of Examples
• Given a collection S, containing +ve and -ve examples of some target concept, the entropy
of S is given by
• Where p+ is the proportion of positive examples in S and p- is the proportion of negative examples in S
• In all calculations involving entropy we define 0.log 0 = 0.
• In general for c class classification
ENTROPY - ILLUSTRATION
• Let S is a collection of 14 examples of some boolean concept
• Let 9 positive and 5 negative examples [9+, 5-]
• Then the entropy of S relative to this Boolean classification is
Entropy(9+,5-) = -(9/14)log2(9/14)-(5/14)log2(5/14)
= 0.940
• Entropy function relative to a Boolean classification, as p+, varies between 0 and 1
DEFINITION: INFORMATION GAIN
• It is the expected reduction in entropy caused by partitioning the examples according to
some attribute A
• Split the node with attribute having highest Gain
INFORMATION GAIN-ILLUSTRATION
= 8/14(0.811)+6/14(1)= 0.892
Gain = Entropy(S) - I(Wind) = 0.94 – 0.892=0.048
ID3 USING INFORMATION GAIN
SO FAR..
(Outlook = Sunny⋀ Humidity = Normal) V (Outlook = Overcast) V (Outlook = Rain ⋀ Wind = Weak)
SO FAR..
Appropriate Problems For Decision Tree Learning
Instances are represented by attribute-value pairs.
The target function has discrete output values.
Disjunctive descriptions may be required.
The training data may contain errors.
The training data may contain missing attribute values.
SO FAR..
ID3 Algorithm
John Ross Quinlan
ID3 stands for Iterative Dichotomiser 3 and is named such because the
algorithm iteratively (repeatedly) dichotomises (divides) features into two or
more groups at each step.
SO FAR..
ID3 Algorithm
EXAMPLE
S=(9+,5-)
Attribute: Outlook
Values(Outlook)=‘Sunny’, ‘Overcast’, ’Rain’
D1 Sunny No
D2 Sunny No
D8 Sunny No
D9 Sunny Yes
T F
B Y
T
F
N Y
OVERVIEW
• Introduction
• Decision Tree Representation
• Appropriate Problems for Decision Tree Learning
• Basic Decision Tree Learning Algorithm (ID3)
• Hypothesis Space Search in decision Tree Learning
• Inductive Bias in Decision Tree Learning
• Issues in Decision Tree Learning
HYPOTHESIS SPACE SEARCH IN ID3
• ID3 can be characterized as searching a space of hypotheses for one
that fits the training examples.
• The hypothesis space searched by ID3 is the set of possible decision
trees.
• ID3 performs a simple-to-complex, hill-climbing search through this
hypothesis space.
• Beginning with the empty tree, then considering progressively more
elaborate hypotheses in search of a decision tree that correctly
classifies the training data.
• The evaluation function that guides this hill-climbing search is the
information gain measure.
HYPOTHESIS SPACE SEARCH IN ID3
HYPOTHESIS SPACE SEARCH IN ID3
• Hypothesis space:
– The hypothesis space searched by ID3 is the set of possible
decision trees.
– It is a complete space of finite discrete-valued functions,
relative to the available attributes.
• Search Method:
– ID3 performs a simple-to-complex, hill-climbing search.
• Evaluate Function: Information Gain
CAPABILITIES AND LIMITATIONS OF ID3
• ID3's hypothesis space of all decision spaces is a complete space of
finite discrete-valued functions, relative to the available attributes.
– Because every finite discrete-valued function can be represented by some decision
tree, ID3 avoids one of the major risks of methods that search incomplete hypothesis
spaces (such as version space methods that consider only conjunctive hypotheses):
that the hypothesis space might not contain the target function.
• ID3 maintains only a single current hypothesis as it searches through
the space of decision trees.
– This contrasts, for example, with the earlier version space candidate-elimination
method, which maintains the set of all hypotheses consistent with the available
training examples.
– However, by determining only a single hypothesis, ID3 loses the capabilities that
follow from explicitly representing all consistent hypotheses.
CAPABILITIES AND LIMITATIONS OF ID3
• ID3, in its pure form, performs no backtracking in its search (greedy
algorithm).
– Once it selects an attribute to test at a particular level in the tree, it never backtracks to
reconsider this choice; it is susceptible to the usual risks of hill-climbing search without
backtracking: converging to locally optimal solutions that are not globally optimal.
• ID3 uses all training examples at each step in the search to make
statistically based decisions regarding how to refine its current
hypothesis.
– This contrasts with methods that make decisions incrementally, based on individual
training examples (eg. version space candidate-elimination).
– One advantage of using statistical properties of all the examples is that the resulting
search is much less sensitive to errors in individual training examples. ID3 can be easily
extended to handle noisy training data by modifying its termination criterion to accept
hypotheses that imperfectly fit the training data.
OVERVIEW
• Introduction
• Decision Tree Representation
• Appropriate Problems for Decision Tree Learning
• Basic Decision Tree Learning Algorithm (ID3)
• Hypothesis Space Search in decision Tree Learning
• Inductive Bias in Decision Tree Learning
• Issues in Decision Tree Learning
INDUCTIVE BIAS
• Inductive bias is the set of assumptions that, together with the training data, deductively justify
the classifications assigned by the learner to future instances.
• Given a collection of training examples, there are typically many decision trees consistent with
these examples.
• Describing the inductive bias of ID3 therefore consists of describing the basis by which it
chooses one of these consistent hypotheses over the others.
– It chooses the first acceptable tree it encounters in its simple-to-complex, hill-climbing
search through the space of possible trees.
• The ID3 search strategy
(a) selects in favor of shorter trees over longer ones, and
(b) selects trees that place the attributes with highest information gain closest to the root.
INDUCTIVE BIAS
• Approximate inductive bias of ID3: Shorter trees are preferred over larger
trees.
– ID3 can be viewed as an efficient approximation to BFS-ID3, using a greedy heuristic search to attempt
to find the shortest tree without conducting the entire breadth-first search through the hypothesis
space.
– Because ID3 uses the information gain heuristic and a hill climbing strategy, it exhibits a more
complex bias than BFS-ID3.
– It is biased to favor trees that place attributes with high information gain closest to the root.
(Source: wikipedia)
UNDERFITTING
• Underfitting occurs when a statistical model cannot adequately
capture the underlying structure of the data.
Source: http://blog.algotrading101.com/design-theories/what-is-curve-fitting-overfitting-in-trading/
OVERFITTING
• The ID3 algorithm grows each branch of the tree just deeply enough to perfectly classify
the training examples but it can lead to difficulties when there is noise in the data, or when
the number of training examples is too small to produce a representative sample of the
true target function. This algorithm can produce trees that overfit the training examples.
Definition: Overfitting
• Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the training data, if
there exists some alternative hypothesis h' ∈ H, such that h has smaller error than h' over
the training examples, but h' has a smaller error than h over the entire distribution of
instances.
OVERFITTING
• The below figure illustrates the impact of overfitting in a typical application of decision tree
learning.
• The horizontal axis of this plot indicates the total number of nodes in the decision tree, as the tree
is being constructed. The vertical axis indicates the accuracy of predictions made by the tree.
• The solid line shows the accuracy of the decision tree over the training examples. The broken line
shows accuracy measured over an independent set of test example
• The accuracy of the tree over the training examples increases monotonically as the tree is
grown.
• The accuracy measured over the independent test examples first increases, then decreases.
OVERFITTING
How can it be possible for tree h to fit the training examples better than h', but for it to perform more
poorly over subsequent examples?
Reasons for Overfitting:
1. Overfitting can occur when the training examples contain random errors or noise
2. When small numbers of examples are associated with leaf nodes.
• The additional line in figure shows accuracy over the test examples as the tree is pruned. When
pruning begins, the tree is at its maximum size and lowest accuracy over the test set. As pruning
proceeds, the number of nodes is reduced and accuracy over the test set increases.
• The available data has been split into three subsets: the training examples, the validation
examples used for pruning the tree, and a set of test examples used to provide an unbiased estimate
of accuracy over future unseen examples.
• The plot shows accuracy over the training and test sets.
REDUCED-ERROR PRUNING
PROS AND CONS
Pros: Produces smallest version of most accurate T (subtree of T)
Cons:
Uses less data to construct T
Can afford to hold out Dvalidation?. If not (data is too limited), may make error worse
(insufficient Dtrain)
RULE POST-PRUNING
Rule post-pruning is successful method for finding high accuracy hypotheses
• Converting to rules allows distinguishing among the different contexts in which a decision
node is used. Because each distinct path through the decision tree node produces a distinct
rule, the pruning decision regarding that attribute test can be made differently for each path.
• Converting to rules removes the distinction between attribute tests that occur near the root
of the tree and those that occur near the leaves. Thus, it avoid messy bookkeeping issues such
as how to reorganize the tree if the root node is pruned while retaining part of the subtree
below this test.
• Converting to rules improves readability. Rules are often easier for to understand.
ISSUES IN DECISION TREE LEARNING
INCORPORATING CONTINUOUS VALUED ATTRIBUTES
INCORPORATING CONTINUOUS VALUED ATTRIBUTES
• Continuous-valued decision attributes can be incorporated into the learned tree.
• There are two methods for Handling Continuous Attributes
1. Define new discrete valued attributes that partition the continuous attribute value into a discrete
set of intervals.
E.g., {high ≡ Temp > 35º C, med ≡ 10º C < Temp ≤ 35º C, low ≡ Temp ≤ 10º C}
2. Using thresholds for splitting nodes
E.g., A ≤ a produces subsets A ≤ a and A > a
• What threshold-based Boolean attribute should be defined based on Temperature?
• The gain ratio measure penalizes attributes by incorporating a split information, that is sensitive to how
broadly and uniformly the attribute splits the data.
Instance Classification a1 a2
1 + T T
2 + T T
3 – T F
4 + F F
5 – F T
6 – F T
QUESTION ON DECISION TREES
9. Identify the entropy, information gain and draw the decision tree for the following
set of training examples
Transportation
Gender Car ownership Travel cost Income Level (Class)
Instance a1 a2 a3 Classification