Decision Trees
Decision Trees
Module-2
DECISION TREE LEARNING
2.1 INTRODUCTION
Decision tree learning is a method for approximating discrete-valued target functions, in which the
learned function is represented by a decision tree. Learned trees can also be re-represented as sets
of if-then rules to improve human readability. These learning methods are among the most popular
of inductive inference algorithms and have been successfully applied to a broad range of tasks from
learning to diagnose medical cases to learning to assess credit risk of loan applicants.
Figure 2.1 illustrates a typical learned decision tree. This decision tree classifies Saturday mornings
according to whether they are suitable for playing tennis.
Decision tree learning is generally best suited to problems with the following characteristics:
• Instances are represented by attribute-value pairs - Instances are described by a fixed set of
attributes (e.g., Temperature) and their values (e.g., Hot). The easiest situation for decision tree
learning is when each attribute takes on small number of disjoint possible values (e.g., Hot, Mild,
Cold).
• The target function has discrete output values - The decision tree in Figure 2.1assigns a boolean
classification (e.g., yes or no) to each example. Decision tree methods easily extend to learning
functions with more than two possible output values.
• Disjunctive descriptions may be required - Decision trees naturally represent disjunctive
expressions.
• The training data may contain errors - Decision tree learning methods are robust to errors, both
errors in classifications of the training examples and errors in the attribute values that describe these
examples.
The training data may contain missing attribute values - Decision tree methods can be used
even when some training examples have unknown values (e.g., if the Humidity of the day is
known for only some of the training).
Decision tree learning has therefore been applied to problems such as learning to classify medical
patients by their disease, equipment malfunctions by their cause, and loan applicants by their
likelihood of defaulting on payments. Such problems, in which the task is to classify examples
into one of a discrete set of possible categories, are often referred to as classification problems.
Basic algorithm, ID3, learns decision trees by constructing them top-down, beginning with the
question "which attribute should be tested at the root of the tree?' To answer this question, each
instance attribute is evaluated using a statistical test to determine how well it alone classifies the
training examples. The best attribute is selected and used as the test at the root node of the tree. A
descendant of the root node is then created for each possible value of this attribute, and the training
examples are sorted to the appropriate descendant node (Leaf node). The entire process is then
repeated using the training examples associated with each descendant node to select the best
attribute to test at that point in the tree. This forms a greedy search for an acceptable decision tree,
in which the algorithm never backtracks to reconsider earlier choices.
One interpretation of entropy from information theory is that it specifies the minimum number of
bits of information needed to encode the classification of an arbitrary member of S (i.e., a member
of S drawn at random with uniform probability). For example, if p, is 1, the receiver knows the
drawn example will be positive, so no message need be sent, and the entropy is zero.
INFORMATION GAIN MEASURES THE EXPECTED REDUCTION IN ENTROPY
Given entropy as a measure of the impurity in a collection of training examples, we can
now define a measure of the effectiveness of an attribute in classifying the training data. The
measure we will use, called information gain, is simply the expected reduction in entropy caused
by partitioning the examples according to this attribute. More precisely, the information gain,
Gain(S, A) of an attribute A, relative to a collection of examples S, is defined as:
where Values(A) is the set of all possible values for attribute A, and S, is the subset of S for which
attribute A has value
To illustrate the operation of ID3, consider the learning task represented by the training examples
of Table 3.2. Here the target attribute PlayTennis, which can have values yes or no for different
Saturday mornings, is to be predicted based on other attributes of the morning in question. Consider
the first step through the algorithm, in which the topmost node of the decision tree is created. Which
attribute should be tested first in the tree? ID3 determines the information gain for each candidate
attribute (i.e., Outlook, Temperature, Humidity, and Wind), then gain for two of these attributes
is shown in Figure 3.3
ID3's hypothesis space of all decision trees is a complete space of finite discrete-valued
functions, relative to the available attributes. ID3 avoids one of the major risks of methods
that search incomplete hypothesis spaces.
ID3 maintains only a single current hypothesis as it searches through the space of decision
trees.
ID3 uses all training examples at each step in the search to make statistically based
decisions regarding how to refine its current hypothesis.
ID3 can be easily extended to handle noisy training data by modifying its termination
criterion to accept hypotheses that imperfectly fit the training data.
Converting to rules improves readability. Rules are often easier for people to
understand.