0% found this document useful (0 votes)
14 views20 pages

03 InformationGain

Uploaded by

Abdullateef Abba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views20 pages

03 InformationGain

Uploaded by

Abdullateef Abba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Decision Trees:

Information Gain

These slides were assembled by Byron Boots, with grateful acknowledgement to Eric Eaton and the many
others who made their course materials freely available online. Feel free to reuse or adapt these slides for
your own academic purposes, provided that you include proper attribution.
Robot Image Credit: Viktoriya Sukhanova © 123RF.com
Last Time: Basic Algorithm for
Top-Down Learning of Decision Trees
[ID3, C4.5 by Quinlan]

node = root of decision tree


Main loop:
1. A ß the “best” decision attribute for the next node.
2. Assign A as decision attribute for node.
3. For each value of A, create a new descendant of node.
4. Sort training examples to leaf nodes.
5. If training examples are perfectly classified, stop. Else,
recurse over new leaf nodes.

How do we choose which attribute is best?


Entropy
Entropy # of possible
values for X
Entropy H(X) of a random variable X

H(X) is the expected number of bits needed to encode a


randomly drawn value of X (under most efficient code)

Why? Information theory:


• Most efficient code assigns -log2P(X=i) bits to encode
the message X=i
• So, expected number of bits to code one random X is:

Slide by Tom Mitchell


Entropy
Entropy # of possible
values for X
Entropy H(X) of a random variable X

H(X) is the expected number of bits needed to encode a


randomly drawn value of X (under most efficient code)

Why? Information theory:


• Most efficient code assigns -log2P(X=i) bits to encode
the message X=i
• So, expected number of bits to code one random X is:

Slide by Tom Mitchell


2-Class Cases:
n
X
Entropy H(x) = P (x = i) log2 P (x = i)
i=1
• What is the entropy of a group in which all Minimum
examples belong to the same class? impurity
– entropy = - 1 log21 = 0

• What is the entropy of a group with 50% Maximum


in either class? impurity
– entropy = -0.5 log20.5 – 0.5 log20.5 =1

5
Based on slide by Pedro Domingos
Sample Entropy
Sample Entropy

Slide by Tom Mitchell


Information Gain
• We want to determine which attribute in a given set
of training feature vectors is most useful for
discriminating between the classes to be learned.

• Information gain tells us how important a given


attribute of the feature vectors is.

• We will use it to decide the ordering of attributes in


the nodes of a decision tree.

7
Based on slide by Pedro Domingos
From Entropy to Information Gain
Entropy
Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka Information Gain) of X and Y :

Slide by Tom Mitchell


From Entropy to Information Gain
Entropy
Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka Information Gain) of X and Y :

Slide by Tom Mitchell


From Entropy to Information Gain
Entropy
Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka Information Gain) of X and Y :

Slide by Tom Mitchell


From Entropy to Information Gain
Entropy
Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka Information Gain) of X and Y :

Slide by Tom Mitchell


Information Gain
Information Gain is the mutual information between
input attribute A and target variable Y

Information Gain is the expected reduction in entropy


of target variable Y for data sample S, due to sorting
on variable A

Slide by Tom Mitchell


Calculating Information Gain
Information Gain = entropy(parent) – [average entropy(children)]

child = -æç 13 × log 13 ö÷ - æç 4 × log 4 ö÷ = 0.787


impurity 2 2
entropy è 17 17 ø è 17 17 ø

Entire population (30 instances)


17 instances

child æ1 1 ö æ 12 12 ö
impurity
entropy= -ç × log 2 ÷ - ç × log 2 ÷ = 0.391
è 13 13 ø è 13 13 ø

parent= -æç 14 × log 14 ö÷ - æç 16 × log 16 ö÷ = 0.996


impurity 2 2
entropy è 30 30 ø è 30 30 ø 13 instances

æ 17 ö æ 13 ö
(Weighted) Average Entropy of Children = ç × 0.787 ÷ + ç × 0.391÷ = 0.615
è 30 ø è 30 ø
Information Gain= 0.996 - 0.615 = 0.38 13
Based on slide by Pedro Domingos
Entropy-Based Automatic Decision
Tree Construction

Training Set X Node 1


x1=(f11,f12,…f1m) What feature
x2=(f21,f22, f2m) should be used?
.
What values?
.
xn=(fn1,f22, f2m)

Quinlan suggested information gain in his ID3 system

14
Based on slide by Pedro Domingos
Using Information Gain to Construct
a Decision Tree
Choose the attribute A
Full Training Set X with highest information
Attribute A
gain for the full training
v1 v2 vk set at the root of the tree.
Construct child nodes
for each value of A. Set X ¢ X¢={xÎX | value(A)=v1}
Each has an associated
subset of vectors in repeat
which A has a particular recursively
till when?
value.

15
Based on slide by Pedro Domingos
Sample Dataset (was Tennis Played?)
• Columns denote features Xi
• Rows denote labeled instances hxi , yi i
• Class label denotes whether a tennis game was played

hxi , yi i
Slide by Tom Mitchell
Slide by Tom Mitchell
Slide by Tom Mitchell
Which Tree Should We Output?
• ID3 performs heuristic
search through space of
decision trees
• It stops at smallest
acceptable tree. Why?

Occam’s razor: prefer the


simplest hypothesis that
fits the data

Slide by Tom Mitchell

You might also like