Decision Trees Notes
Decision Trees Notes
– p. 1/1
Decision Trees
We can describe this data set with the following decision tree:
– p. 2/1
Decision Trees
All observations in the data set are perfectly described by the tree.
Question: How do we build such trees?
– p. 3/1
Entropy
The key to decision tree induction is the notion of entropy,
p+
Observation: Entropy is at its maximum if we have a 50%-50% split among the positive
and negative examples.
Observation: Entropy is zero if we have all positive or all negative examples.
– p. 4/1
Entropy
We can apply entropy to measure the “randomness" of our data set.
Let
D = {(x1 , y1 ), . . . , (xl , yl )} ⊆ An × {+1, −1}
and
then
l+ l+ l− l−
Entropy(D) = − log2 ( ) − log2 ( )
l l l l
Now let p+ = l+ /l and p− = l− /l then
– p. 5/1
Information Gain
Def: We say that an attribute is informative if, when the training set is split according to its
attribute values, the overall entropy in the training data is reduced.
Example: Consider the attribute Ak = {v1 , v2 , v3 } then the split Dvi of D only contains instances
that have value vi of attribute Ak ,
Dvi = {(x, y) | xk = vi }
We can now split the data set D according to the values of attribute A k ,
– p. 6/1
Information Gain
Rather than using the arithmetic mean we use the weighted mean,
X |Dv |
i
Entropy(Ak ) = Entropy(Dvi )
v ∈A
|D|
i k
or
X |Dv |
i
Gain(D, Ak ) = Entropy(D) − Entropy(Dvi )
v ∈A
|D|
i k
– p. 7/1
Information Gain
We can now use the gain to build a decision tree top-down (greedy heuristic).
Example: Consider our tennis data set with
Then
D = [9+, 5−]
DWeak = [6+, 2−]
DStrong = [3+, 3−]
Finally,
X |Dv |
i
Gain(D, Wind) = Entropy(D) − Entropy(Dvi )
v ∈A
|D|
i k
8 6
= .94 − .811 − 1
14 14
= .048
– p. 8/1
Information Gain
Similarly, for Outlook, Humidity, and Temp,
– p. 9/1
Information Gain
– p. 10/1
Information Gain
– p. 11/1