Decision Trees: Classifier
Decision Trees: Classifier
Class
prediction
Classifier
Y=y
XM=xM
Training data
Three variables:
Training data:
P:2 G:4
(B,T,P)
(B,T,P)
Hair = D?
Hair = B?
(B,S,G)
(D,S,G)
(D,T,G)
P:0 G:2
P:2 G:2
(B,S,G)
Height = S?
Height = T?
P:0 G:2
P:2 G:0
P:2 G:2
Height = T?
P:2 G:0
Height = S?
G is the output
P:0 G:2 for this node
P:2 G:4
Hair = D?
Hair = B?
P:0 G:2
P:2 G:2
Height = T?
Height = S?
P:2 G:0
P:0 G:2
X1 .. XM Y
Input x1 ..... xM ???
data
X1 . XM
Data 1 x1 ..... xM
Data 2
Data R
Training Data
X1 =
nth possible value for X1 ?
Output class Y = y1
Xj =
ith possible value for Xj ?
Output class Y = yc
X2=0.5
:4
X1 < 0.5 ??
:3
X1=0.5
:4
:4
:0
X2 < 0.5??
:3
:0
:0
:4
:7
:4
X1 < 0.5 ??
:3
:4
:4
:0
X2 < 0.5??
:3
:0
:0
:4
X1 .. XM Y
Input x1 ..... xM ???
data
X1 . XM
Data 1 x1 ..... xM
Data 2
Data R
Training Data
Output class Y = y1
X j < tj ?
Output class Y = yc
Basic Questions
How to choose the attribute/value to split
on at each level of the tree?
When to stop splitting? When should a
node be declared a leaf?
If a leaf node is impure, how should the
class label be assigned?
If the tree is too large, how can it be
pruned?
Good
Bad
This node is
This node is
pure because
Good
almost pure
there is only
one class left Little
No ambiguity in ambiguity in the
the class label class label
Bad
This node is
node is
pure because
GoodThis
almost pure
there is only
Little
one class left
ambiguity in the
No ambiguity in
class label
the class label
Bad
Frequency of
occurrence
Suppose that we are dealing with data which can come from
four possible values (A, B, C, D)
Each class may appear with some probability
Suppose P(A) = P(B) = P(C) = P(D) = 1/4
What is the average number of bits necessary to encode each
class?
In this case: average = 2 = 2xP(A)+2xP(B)+2xP(C)+2xP(D)
A 00 B 01 C 10 D 11
The distribution is not very
informative impure
A B CD Class Number
Information Content
Frequency of
occurrence
Suppose now P(A) = 1/2 P(B) = 1/4 P(C) = 1/8 P(D) = 1/8
What is the average number of bits necessary to encode each
class?
In this case, the classes can be encoded by using 1.75 bits on
average
A 0 B 10 C 110 D 111
Average
= 1xP(A)+2xP(B)+3xP(C)+3xP(D) = 1.75
The distribution is more
informative higher purity
A B CD Class Number
Entropy
In general, the average number of bits
necessary to encode n values is the
entropy:
n
H = Pi log 2 Pi
i =1
Entropy
Frequency of
occurrence
High
Entropy
The entropy
captures the
degree of purity
of the distribution
Frequency of
occurrence
Class Number
Low
Entropy
Class Number
10
NA = 1
NB = 6
pA = NA/(NA+NB) = 1/7
pB = NB/(NA+NB) = 6/7
NA = 3
NB = 2
pA = NA/(NA+NB) = 3/5
pB = NB/(NA+NB) = 2/5
H1 = -pAlog2 pA pBlog2 pB
= 0.59
H2 = -pAlog2 pA pBlog2 pB
= 0.97
Frequency of occurrence
2
of class A in node (1)
NA = 1
NA =of3 occurrence
Frequency
NB = 6
= 2node (1)
of classNBB in
pA = NA/(NA+NB) = 1/7
pA = NA/(NA+NB) = 3/5
of node (1)
pB = NB/(NA+NB) = 6/7
pB = NEntropy
B/(NA+NB) = 2/5
H1 = -pAlog2 pA pBlog2 pB
= 0.59
H2 = -pAlog2 pA pBlog2 pB
= 0.97
11
Conditional Entropy
Entropy before splitting: H
HLx PL+ HR x PR
Conditional Entropy
Entropy before splitting: H
HLx PL+ HR x PR
Conditional Entropy
12
Information Gain
PL
PR
HL
HR
Notations
Entropy: H(Y) = Entropy of the distribution
of classes at a node
Conditional Entropy:
Discrete: H(Y|Xj) = Entropy after splitting with
respect to variable j
Continuous: H(Y|Xj,t) = Entropy after splitting
with respect to variable j with threshold t
Information gain:
Discrete: IG(Y|Xj) = H(Y) - H(Y|Xj) = Entropy
after splitting with respect to variable j
Continuous: IG(Y|Xj,t) = H(Y) - H(Y|Xj,t) =
Entropy after splitting with respect to variable j
with threshold t
13
Information Gain
PL
HL
PR
HR
H = 0.99
IG =
H (HL * 4/11 + HR * 7/11)
HL = 0
H = 0.99
IG =
H (HL * 5/11 + HR * 6/11)
14
H = 0.99
H = 0.99
IG = 0.62
IG = 0.052
HL = 0
H = 0.99
IG = 0.62
HL = 0
H = 0.99
IG = 0.052
15
IG
X1 Split value
Best split value (max Information Gain) for X1
attribute: 0.24 with IG = 0.138
16
IG
X2 Split value
Best split value (max Information Gain) for X2
attribute: 0.234 with IG = 0.202
X2
17
X2
IG
X1 Split value
Best split value (max Information Gain) for X1
attribute: 0.22 with IG ~ 0.182
18
IG
X2 Split value
Best split value (max Information Gain) for X2
attribute: 0.75 with IG ~ 0.353
X2
X2
19
Best
X split: 0.22, IG = 0.182
There is no point in splitting
Best
split:
0.75,since
IG =it 0.353
thisYnode
further
contains only data from a
single class return it as a
leaf node with output A
X2
X2
A
A
X2
X2
X1
X1
20
A
A
X2
X1
X1
Basic Questions
How to choose the attribute/value to split
on at each level of the tree?
When to stop splitting? When should a
node be declared a leaf?
If a leaf node is impure, how should the
class label be assigned?
If the tree is too large, how can it be
pruned?
21
Try all the possible attributes Xj and threshold t and choose the
one, j*, for which IG(Y|Xj,t) is maximum
XL, YL= set of datapoints for which xj* < t and corresponding
classes
XH, YH = set of datapoints for which xj* >= t and corresponding
classes
Left Child LearnTree(XL,YL)
Right Child LearnTree(XH,YH)
22
23