0% found this document useful (0 votes)
10 views

Decision Tree

Uploaded by

toxicacm0
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Decision Tree

Uploaded by

toxicacm0
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Supervised

Learning
-
Classification
Classification: Decision Tree
Distance Measures and Cosine Similarity

K-nearest Neighbour Classification

Bayes Classification Methods

Classification: Model Construction

Decision Tree Induction

Model Evaluation

2
Decision Tree Induction: An Example
 Training data set: Buys_computer class age income student credit_rating buys_computer
 Resulting tree: <=30 high no fair no
<=30 high no excellent no
age? 31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
<=30 overcast
31..40 >40 >40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
student? yes credit rating? <=30 low yes fair yes
>40 medium yes fair yes
no yes excellent fair <=30 medium yes excellent yes
31…40 medium no excellent yes
no yes no yes 31…40 high yes fair yes
>40 medium no excellent no
Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm)


During the late 1970s and early 1980s, J. Ross Quinlan, a
researcher in machine learning, developed a decision tree
algorithm known as ID3 (Iterative Dichotomiser).
Quinlan later presented C4.5 (a successor of ID3), benchmark to
which newer supervised learning algorithms are often compared.
◦ Tree is constructed in a top-down recursive divide-and-conquer
manner
◦ At start, all the training examples are at the root
◦ Attributes are categorical (if continuous-valued, they are discretized
in advance)
◦ Examples partitioned recursively based on selected attributes
◦ Attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Brief Review of Entropy

m=2
Question?
How do you determine which attribute best classifies data?
Answer: Entropy!

Information gain:
◦ Statistical quantity measuring how well an attribute classifies the data.
◦ Calculate the information gain for each attribute.
◦ Choose attribute with greatest information gain.
But how do you measure information?
◦ Claude Shannon in 1948 at Bell Labs established the field of information theory.

◦ Mathematical function, Entropy, measures information content of random process:

◦ Takes on largest value when events are equiprobable.


◦ Takes on smallest value when only one event has
non-zero probability.

◦ For two states: Entropy of set S denoted by H(S)


◦Positive examples and Negative examples from set S:
H(S) = -p+log2(p+) - p-log2(p-)
Largest
entropy

Boolean
functions
with the same
number of
ones and
zeros have
largest
entropy
(Back to the story of ID3)
Information gain is our metric for how well one attribute A i classifies the training data.

Information gain for a particular attribute = Information about target function, given the value of that
attribute.(conditional entropy)

Mathematical expression for information gain:

Gain( S , Ai )  H ( S)   P( A  v ) H ( S )
vValues ( Ai )
i v

entropy
Entropy for
value v
ID3 algorithm (boolean-valued function)

Calculate the entropy for all training examples


◦ positive and negative cases
◦ p+ = #pos/Tot p- = #neg/Tot
◦ H(S) = -p+log2(p+) - p-log2(p-)
◦ Determine which single attribute best classifies the training examples using
information gain.
◦ For each attribute find:
Gain( S , Ai )  H ( S)   P( A  v ) H ( S )
v Values ( Ai )
i v

◦ Use attribute with greatest information gain as a root / node


Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify
m a tuple in D:
Info ( D)   pi log 2 ( pi )
i 1
 Information needed (after using A to split D into v partitions) to
v | D |
classify D:
Info A ( D )  
j
 Info ( D j )
j 1 | D |
 Information gained by branching on attribute A
Gain(A)  Info(D)  Info A(D)
Attribute Selection: Information Gain
gClass P: buys_computer = “yes”
gClass N: buys_computer = “no”

9 9 5 5
Info ( D )  I (9,5)   log 2 ( )  log 2 ( ) 0.940
14 14 14 14
5 4
Info age ( D )  I ( 2,3)  I ( 4,0)
14 14
5
 I (3,2)  0.694
14 Log2 (a/b) = log10(a/b)

5 ________
I (2,3) means “age <=30” has 5 out of 14
14 samples, with 2 yes’es and 3 no’s. log10(2)
Gain(age)  Info ( D )  Info age ( D )  0.246

0.940 – 0.694 = 0.246

Similarly,

Gain(income)  0.029
Gain( student )  0.151
Gain(credit _ rating )  0.048
Data to be classified:
X = (age <=30, Income = medium, Student = yes,
Credit_rating = Fair)

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes
Conditions for stopping partitioning
All samples for a given node belong to the
same class
There are no samples left
There are no remaining attributes for further
partitioning
BP Angiogram TMT Echo Class
Yes No Yes Yes A
No No No No B
Yes Yes No No C
Yes No No Yes A
No Yes No No C
No Yes No No C
Yes No Yes Yes A
Yes No Yes Yes A
Yes No No Yes A
Yes No No No B
No Yes No No C
No Yes No No C

You might also like