Class Adv Classification II
Class Adv Classification II
• K-NN Classifier
• Rule Based Classifier
Nearest-Neighbor Classifiers
Unknown record Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number
of nearest neighbors to
retrieve
To classify an unknown record:
– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown
record (e.g., by taking
majority vote)
Definition of Nearest Neighbor
X X X
X
Nearest Neighbor Classification
• Compute distance between two points:
– Euclidean distance
d ( p, q ) ( pi
i
q) i
2
• Scaling issues
– Attributes may have to be scaled to prevent distance
measures from being dominated by one of the attributes
– Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M
Nearest Neighbor Classification…
111111111110 100000000000
vs
011111111111 000000000001
d = 1.4142 d = 1.4142
12
Rule-based Classifier (Example)
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
15
Rule Coverage and Accuracy
Tid Refund Marital Taxable
Status Income Class
consequent of a rule 10
10 No Single 90K Yes
(Status=Single) No
Coverage = 40%, Accuracy = 50%
16
Characteristics of Rule-Based Classifier
• Mutually exclusive rules
– Classifier contains mutually exclusive rules if the rules are
independent of each other
– Every record is covered by at most one rule
• Exhaustive rules
– Classifier has exhaustive coverage if it accounts for every
possible combination of attribute values
– Each record is covered by at least one rule
17
From Decision Trees To Rules
Classification Rules
(Refund=Yes) ==> No
Refund
Yes No (Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
NO Marita l
Status (Refund=No, Marital Status={Single,Divorced},
{Single,
{Married} Taxable Income>80K) ==> Yes
Divorced}
(Refund=No, Marital Status={Married}) ==> No
Taxable NO
Income
< 80K > 80K
NO YES
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the
tree
18
Rules Can Be Simplified
Tid Refund Marital Taxable
Status Income Cheat
Refund
Yes No 1 Yes Single 125K No
2 No Married 100K No
NO Marita l
3 No Single 70K No
{Single, Status
{Married} 4 Yes Married 120K No
Divorced}
5 No Divorced 95K Yes
Taxable NO
Income 6 No Married 60K No
20
Ordered Rule Set
• Rules are rank ordered according to their priority
– An ordered rule set is known as a decision list
• When a test record is presented to the classifier
– It is assigned to the class label of the highest ranked rule it has triggered
– If none of the rules fired, it is assigned to the default class
22
Building Classification Rules
• Direct Method:
• Extract rules directly from data
• e.g.: RIPPER, CN2, Holte’s 1R
• Indirect Method:
• Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
• e.g: C4.5rules
23
Direct Method: Sequential Covering
24
Example of Sequential Covering
26
Example of Sequential Covering…
R1 R1
R2
27
28
29
30
31
When to Stop Building a Rule
• When the rule is perfect, i.e. accuracy
=1
• When increase in accuracy gets
below a given threshold
• When the training set cannot be split
any further
32