Decision Tree Learning Decision trees • A method for approximating discrete-valued functions that is robust to noisy data and capable of learning disjunctive expressions • ID3(Iterative Dichotomiser 3), ASSISTANT, C4.5, J48 (Weka) • Learn a completely expressive hypothesis space • Inductive bias is a preference for smaller trees over larger ones Decision trees • They classify instances by sorting them down the tree from the root to some leaf node • Leaf node provides the classification of the instance • Each node in the tree specifies a test of some attribute of the instance, and each branch descending from that node corresponds to one of the possible values for this attribute Decision trees Decision trees • Disjunctions of conjunctions of constraints on the attribute values of instances • Each path from the root to a leaf corresponds to a conjunction of attribute tests • Tree itself to a disjunction of these conjunctions Decision trees A more complex decision tree Decision trees – appropriate problems • Attribute-value pairs • Fixed set of attributes and their values • Easiest is one it takes a small number of disjoint values • However, real valued attributes can be handled as well • Target function • Discrete output values • Boolean classification • However, can be extended to consider multiple outputs Decision trees – appropriate problems • Disjunctive descriptions • Naturally represent disjunctive expressions • Noisy training data • Robust to errors • Errors in classifications of the training examples • Errors in attribute values • Missing attribute values • Some of the attribute values are missing Decision trees – example problems • Learning to classify medical patients by their disease • Equipment malfunctions by their cause • Loan applicants by their likelihood of defaulting on payments Decision trees – the algorithm • Top-down greedy search through space of possible decision trees • The algorithm constructs the tree • “Which attribute should be tested at the root of the tree?” • Each instance attribute is evaluated using a statistical test to determine how well it alone classifies the training examples Decision trees – the algorithm • A descendent of the root node is then created for each possible value of this attribute • Training examples are sorted to the appropriate descendant node (i.e., down the branch corresponding to the example’s value for this attribute) • The entire process is then repeated using the training examples associated with each descendant node to select the best attribute to test at that point in the tree • Greedy approach with no backtracking Which attribute is the best classifier? • Selection of the best attribute at each node is the key problem to be solved while creating the tree • How to measure the “goodness” of an attribute? • information gain • Measures how well a given attribute separates the training examples according to their target classification • The algorithm uses it to select among the candidate attribute at each step while building the tree Entropy • Entropy measures homogeneity of examples • (im)purity of an arbitrary collection of examples • Given a collection , containing positive and negative examples of some target concept, the entropy of relative to this boolean classification is • Where is the proportion of positive examples in and is the proportion of negative examples in • is defined as Entropy • Suppose is a collection of 14 examples of some boolean concept • 9 positive and 5 negative examples • Entropy of relative to this boolean classification is Entropy • Entropy is if all members of belong to the same class, either positive or negative • If all members are positive , then is
• Entropy is when the collection contains an
equal number of positive and negative examples • For an unequal number of positive and negative examples, its value lies between and Entropy
Tom Motchell, page 57, figure 3.2
Entropy • In general • Where is the proportion of belonging to class Entropy • Specifies the minimum number of bits of information needed to encode the classification of an arbitrary member of • (i.e., a member of drawn at random with uniform probability) • If is • The receiver knows the drawn example will be positive • So no message need to be sent • Entropy is 0 Entropy • If is , one bit is required to indicate whether the drawn example is positive or negative • If is , then a collection of messages can be encoded using on average less than bit per message • Assigns shorter codes to collections of positive examples • Assigns longer codes to less likely negative examples Information gain • The expected reduction in entropy by choosing a particular attribute over others for a particular node in the tree • It identifies how much impurity is removed in a set of examples if a particular attribute is chosen at a particular node • The aim is to reduce that impurity so that for a particular attribute constraint all the examples are either positive or negative Information gain • Information gain, of an attribute , relative to a collection of examples , is defined as • Where is the set of all possible values for attribute • is the subset of for which attribute has value (i.e., ) Information gain
• First term is the entropy of the original collection
• Second term is the expected value of entropy after is partitioned using attribute • is therefore the expected reduction in entropy caused by knowing the value of attribute Information Gain • The value of is the number of bits saved when encoding the target value of an arbitrary member of , by knowing the value of attribute . Information gain Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Information gain
• Not much reduction in entropy by using Wind as an
attribute Information gain • Information gain is the measure used by the ID3 algorithm to identify the best attribute at any particular node of the tree
Tom Mitchell, pp. 59, figure 3.3
An illustrative example • Given the training data (slide 21), what should be the root node of the decision tree?
• provides the best prediction of the target concept
• Reduces the entropy (ambiguity about the classification) of the original set of examples by the maximum • Or increases the faith (or certainty) that a certain set of examples belong to a particular class An illustrative example
Tom Mitchell, pp. 61, Figure 3.4
An illustrative example • The process is repeated till • Either every attribute has already been included along this path through the tree • Or the training examples associated with this leaf node all have the same target attribute value (their entropy is zero) ID3 – the algorithm