0% found this document useful (0 votes)
28 views

Decision Trees Notes

This document discusses how decision trees are built using information gain and entropy. It explains that entropy measures the randomness or purity in a dataset, reaching its maximum when the class is an even 50-50 split and its minimum when all examples are of one class. Information gain is used to select the most informative attribute to split on at each node - the attribute that reduces entropy the most when the dataset is split on its values. The attribute with highest information gain becomes the root node, and the process continues recursively on its subsets to build the tree top-down using a greedy heuristic.

Uploaded by

Sameer
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Decision Trees Notes

This document discusses how decision trees are built using information gain and entropy. It explains that entropy measures the randomness or purity in a dataset, reaching its maximum when the class is an even 50-50 split and its minimum when all examples are of one class. Information gain is used to select the most informative attribute to split on at each node - the attribute that reduces entropy the most when the dataset is split on its values. The attribute with highest information gain becomes the root node, and the process continues recursively on its subsets to build the tree top-down using a greedy heuristic.

Uploaded by

Sameer
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Decision Trees

Consider this binary classification data set:

– p. 1/1
Decision Trees
We can describe this data set with the following decision tree:

– p. 2/1
Decision Trees

All observations in the data set are perfectly described by the tree.
Question: How do we build such trees?

– p. 3/1
Entropy
The key to decision tree induction is the notion of entropy,

Entropy ≡ measure of randomness

p+

Observation: Entropy is at its maximum if we have a 50%-50% split among the positive
and negative examples.
Observation: Entropy is zero if we have all positive or all negative examples.

– p. 4/1
Entropy
We can apply entropy to measure the “randomness" of our data set.
Let
D = {(x1 , y1 ), . . . , (xl , yl )} ⊆ An × {+1, −1}
and

l+ = |{(x, y) | (x, y) ∧ y = +1}|


l− = |{(x, y) | (x, y) ∧ y = −1}|

then
l+ l+ l− l−
Entropy(D) = − log2 ( ) − log2 ( )
l l l l
Now let p+ = l+ /l and p− = l− /l then

Entropy(D) = −p+ log2 (p+ ) − p− log2 (p− )

– p. 5/1
Information Gain
Def: We say that an attribute is informative if, when the training set is split according to its
attribute values, the overall entropy in the training data is reduced.

Example: Consider the attribute Ak = {v1 , v2 , v3 } then the split Dvi of D only contains instances
that have value vi of attribute Ak ,

Dvi = {(x, y) | xk = vi }

We can now split the data set D according to the values of attribute A k ,

If EAk < ED then attribute Ak is informative.

– p. 6/1
Information Gain
Rather than using the arithmetic mean we use the weighted mean,
X |Dv |
i
Entropy(Ak ) = Entropy(Dvi )
v ∈A
|D|
i k

Formally we define information gain as,

Gain(D, Ak ) = Entropy(D) − Entropy(Ak )

or
X |Dv |
i
Gain(D, Ak ) = Entropy(D) − Entropy(Dvi )
v ∈A
|D|
i k

⇒ The larger the difference the more informative an attribute!

– p. 7/1
Information Gain
We can now use the gain to build a decision tree top-down (greedy heuristic).
Example: Consider our tennis data set with

Wind = {Weak, Strong}

Then

D = [9+, 5−]
DWeak = [6+, 2−]
DStrong = [3+, 3−]

Finally,
X |Dv |
i
Gain(D, Wind) = Entropy(D) − Entropy(Dvi )
v ∈A
|D|
i k

8 6
= .94 − .811 − 1
14 14
= .048

– p. 8/1
Information Gain
Similarly, for Outlook, Humidity, and Temp,

Gain(D, Outlook ) = .246


Gain(D, Humidity) = .151
Gain(D, Temp) = .029

⇒ This means the Outlook will become our root more.

– p. 9/1
Information Gain

– p. 10/1
Information Gain

– p. 11/1

You might also like