Decision Tree
Decision Tree
Learning
-
Classification
Classification: Decision Tree
Distance Measures and Cosine Similarity
Model Evaluation
2
Decision Tree Induction: An Example
Training data set: Buys_computer class age income student credit_rating buys_computer
Resulting tree: <=30 high no fair no
<=30 high no excellent no
age? 31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
<=30 overcast
31..40 >40 >40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
student? yes credit rating? <=30 low yes fair yes
>40 medium yes fair yes
no yes excellent fair <=30 medium yes excellent yes
31…40 medium no excellent yes
no yes no yes 31…40 high yes fair yes
>40 medium no excellent no
Algorithm for Decision Tree Induction
m=2
Question?
How do you determine which attribute best classifies data?
Answer: Entropy!
Information gain:
◦ Statistical quantity measuring how well an attribute classifies the data.
◦ Calculate the information gain for each attribute.
◦ Choose attribute with greatest information gain.
But how do you measure information?
◦ Claude Shannon in 1948 at Bell Labs established the field of information theory.
Boolean
functions
with the same
number of
ones and
zeros have
largest
entropy
(Back to the story of ID3)
Information gain is our metric for how well one attribute A i classifies the training data.
Information gain for a particular attribute = Information about target function, given the value of that
attribute.(conditional entropy)
Gain( S , Ai ) H ( S) P( A v ) H ( S )
vValues ( Ai )
i v
entropy
Entropy for
value v
ID3 algorithm (boolean-valued function)
9 9 5 5
Info ( D ) I (9,5) log 2 ( ) log 2 ( ) 0.940
14 14 14 14
5 4
Info age ( D ) I ( 2,3) I ( 4,0)
14 14
5
I (3,2) 0.694
14 Log2 (a/b) = log10(a/b)
5 ________
I (2,3) means “age <=30” has 5 out of 14
14 samples, with 2 yes’es and 3 no’s. log10(2)
Gain(age) Info ( D ) Info age ( D ) 0.246
Similarly,
Gain(income) 0.029
Gain( student ) 0.151
Gain(credit _ rating ) 0.048
Data to be classified:
X = (age <=30, Income = medium, Student = yes,
Credit_rating = Fair)
age?
<=30 overcast
31..40 >40
no yes no yes
Conditions for stopping partitioning
All samples for a given node belong to the
same class
There are no samples left
There are no remaining attributes for further
partitioning
BP Angiogram TMT Echo Class
Yes No Yes Yes A
No No No No B
Yes Yes No No C
Yes No No Yes A
No Yes No No C
No Yes No No C
Yes No Yes Yes A
Yes No Yes Yes A
Yes No No Yes A
Yes No No No B
No Yes No No C
No Yes No No C