0% found this document useful (0 votes)
10 views

ML Lecture 13-14

Uploaded by

ahmedjamshaid953
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

ML Lecture 13-14

Uploaded by

ahmedjamshaid953
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Machine Learning

Course Instructor: Mirza Adnan Baig

Email ID: [email protected]


Lecture 13-14

Course Instructor: Mirza Adnan Baig

Email ID: [email protected]


Decision Tree Learning
Decision trees
• A method for approximating discrete-valued
functions that is robust to noisy data and
capable of learning disjunctive expressions
• ID3(Iterative Dichotomiser 3), ASSISTANT,
C4.5, J48 (Weka)
• Learn a completely expressive hypothesis space
• Inductive bias is a preference for smaller trees
over larger ones
Decision trees
• They classify instances by sorting them down
the tree from the root to some leaf node
• Leaf node provides the classification of the
instance
• Each node in the tree specifies a test of some
attribute of the instance, and each branch
descending from that node corresponds to
one of the possible values for this attribute
Decision trees
Decision trees
• Disjunctions of conjunctions of constraints on
the attribute values of instances
• Each path from the root to a leaf corresponds
to a conjunction of attribute tests
• Tree itself to a disjunction of these
conjunctions
Decision trees
A more complex decision tree
Decision trees – appropriate problems
• Attribute-value pairs
• Fixed set of attributes and their values
• Easiest is one it takes a small number of disjoint
values
• However, real valued attributes can be handled as well
• Target function
• Discrete output values
• Boolean classification
• However, can be extended to consider multiple outputs
Decision trees – appropriate problems
• Disjunctive descriptions
• Naturally represent disjunctive expressions
• Noisy training data
• Robust to errors
• Errors in classifications of the training examples
• Errors in attribute values
• Missing attribute values
• Some of the attribute values are missing
Decision trees – example problems
• Learning to classify medical patients by their
disease
• Equipment malfunctions by their cause
• Loan applicants by their likelihood of
defaulting on payments
Decision trees – the algorithm
• Top-down greedy search through space of
possible decision trees
• The algorithm constructs the tree
• “Which attribute should be tested at the root of
the tree?”
• Each instance attribute is evaluated using a statistical
test to determine how well it alone classifies the
training examples
Decision trees – the algorithm
• A descendent of the root node is then created for
each possible value of this attribute
• Training examples are sorted to the appropriate
descendant node (i.e., down the branch corresponding
to the example’s value for this attribute)
• The entire process is then repeated using the
training examples associated with each
descendant node to select the best attribute to
test at that point in the tree
• Greedy approach with no backtracking
Which attribute is the best classifier?
• Selection of the best attribute at each node is
the key problem to be solved while creating
the tree
• How to measure the “goodness” of an attribute?
• information gain
• Measures how well a given attribute separates the training
examples according to their target classification
• The algorithm uses it to select among the candidate
attribute at each step while building the tree
Entropy
• Entropy measures homogeneity of examples
• (im)purity of an arbitrary collection of examples
• Given a collection , containing positive and
negative examples of some target concept, the
entropy of relative to this boolean classification is
• Where is the proportion of positive examples in and is
the proportion of negative examples in
• is defined as
Entropy
• Suppose is a collection of 14 examples of
some boolean concept
• 9 positive and 5 negative examples
• Entropy of relative to this boolean classification is
Entropy
• Entropy is if all members of belong to the
same class, either positive or negative
• If all members are positive , then is

• Entropy is when the collection contains an


equal number of positive and negative
examples
• For an unequal number of positive and negative
examples, its value lies between and
Entropy

Tom Motchell, page 57, figure 3.2


Entropy
• In general
• Where is the proportion of belonging to class
Entropy
• Specifies the minimum number of bits of
information needed to encode the
classification of an arbitrary member of
• (i.e., a member of drawn at random with uniform
probability)
• If is
• The receiver knows the drawn example will be positive
• So no message need to be sent
• Entropy is 0
Entropy
• If is , one bit is required to indicate whether the
drawn example is positive or negative
• If is , then a collection of messages can be
encoded using on average less than bit per
message
• Assigns shorter codes to collections of positive
examples
• Assigns longer codes to less likely negative examples
Information gain
• The expected reduction in entropy by
choosing a particular attribute over others for
a particular node in the tree
• It identifies how much impurity is removed in a set
of examples if a particular attribute is chosen at a
particular node
• The aim is to reduce that impurity so that for a
particular attribute constraint all the examples are
either positive or negative
Information gain
• Information gain, of an attribute , relative to a
collection of examples , is defined as
• Where is the set of all possible values for attribute
• is the subset of for which attribute has value
(i.e., )
Information gain

• First term is the entropy of the original collection


• Second term is the expected value of entropy after
is partitioned using attribute
• is therefore the expected reduction in
entropy caused by knowing the value of
attribute
Information Gain
• The value of is the number of bits saved when
encoding the target value of an arbitrary
member of , by knowing the value of
attribute .
Information gain
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Information gain

• Not much reduction in entropy by using Wind as an


attribute
Information gain
• Information gain is the measure used by the
ID3 algorithm to identify the best attribute at
any particular node of the tree

Tom Mitchell, pp. 59, figure 3.3


An illustrative example
• Given the training data (slide 21), what should
be the root node of the decision tree?

• provides the best prediction of the target concept


• Reduces the entropy (ambiguity about the classification) of the
original set of examples by the maximum
• Or increases the faith (or certainty) that a certain set of
examples belong to a particular class
An illustrative example

Tom Mitchell, pp. 61, Figure 3.4


An illustrative example
• The process is repeated till
• Either every attribute has already been included
along this path through the tree
• Or the training examples associated with this leaf
node all have the same target attribute value
(their entropy is zero)
ID3 – the algorithm

You might also like