0% found this document useful (0 votes)
3 views

Class Adv Classification II

The document discusses classification techniques, focusing on K-nearest neighbors (K-NN) and rule-based classifiers. K-NN requires stored records, a distance metric, and a value for k to classify unknown records based on the majority vote of nearest neighbors, while rule-based classifiers utilize 'if...then...' rules for classification. It also addresses challenges such as scaling issues, the curse of dimensionality, and methods for building and simplifying classification rules.

Uploaded by

fakertoolzz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Class Adv Classification II

The document discusses classification techniques, focusing on K-nearest neighbors (K-NN) and rule-based classifiers. K-NN requires stored records, a distance metric, and a value for k to classify unknown records based on the majority vote of nearest neighbors, while rule-based classifiers utilize 'if...then...' rules for classification. It also addresses challenges such as scaling issues, the curse of dimensionality, and methods for building and simplifying classification rules.

Uploaded by

fakertoolzz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Classification

• K-NN Classifier
• Rule Based Classifier
Nearest-Neighbor Classifiers
Unknown record  Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number
of nearest neighbors to
retrieve
To classify an unknown record:
– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown
record (e.g., by taking
majority vote)
Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points


that have the k smallest distance to x
Nearest Neighbor Classification…
• Choosing the value of k:
– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from
other classes

X
Nearest Neighbor Classification
• Compute distance between two points:
– Euclidean distance

d ( p, q )   ( pi
i
 q) i
2

Determine the class from nearest neighbor list


– take the majority vote of class labels among the k-nearest neighbors
– Weigh the vote according to distance
• weight factor, w = 1/d2
Distance-Weighted Nearest Neighbor
Algorithm
• Assign weights to the neighbors based on their ‘distance’
from the query point
– Weight ‘may’ be inverse square of the distances
All training points may influence a particular instance
(a) Classify the data point x = 5.0 according to its 1-, 3-, 5-,
and 9-nearest neighbors (using majority vote).
(b) Repeat the previous analysis using the distance-weighted
voting approach
Nearest Neighbor Classification…

• Scaling issues
– Attributes may have to be scaled to prevent distance
measures from being dominated by one of the attributes
– Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M
Nearest Neighbor Classification…

• Problem with Euclidean measure:


– High dimensional data
• curse of dimensionality
– Can produce counter-intuitive results

111111111110 100000000000
vs
011111111111 000000000001
d = 1.4142 d = 1.4142

 Solution: Normalize the vectors to unit


length
Nearest neighbor Classification…

• k-NN classifiers are lazy learners


– It does not build models explicitly
– Unlike eager learners such as decision tree induction and rule-
based systems
• Classifying unknown records are relatively expensive
– Naïve algorithm: O(n)
– Need for structures to retrieve nearest neighbors fast.
• The Nearest Neighbor Search problem.
Rule-Based Classifier

• Classify records by using a collection of “if…then…” rules


• Rule: (Condition)  y
– where
• Condition is a conjunctions of attributes
• y is the class label
– LHS: rule antecedent or condition
– RHS: rule consequent
– Examples of classification rules:
• (Blood Type=Warm)  (Lay Eggs=Yes)  Birds
• (Taxable Income < 50K)  (Refund=Yes)  Evade=No

12
Rule-based Classifier (Example)
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds

R1: (Give Birth = no)  (Can Fly = yes)  Birds


R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians 13
Application of Rule-Based Classifier

• A rule r covers an instance x if the attributes of the instance


satisfy the condition of the rule

R1: (Give Birth = no)  (Can Fly = yes)  Birds


R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?

The rule R1 covers a hawk => Bird


The rule R3 covers the grizzly bear => Mammal
14
How does Rule-based Classifier Work?
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?

A lemur triggers rule R3, so it is classified as a mammal


A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules

15
Rule Coverage and Accuracy
Tid Refund Marital Taxable
Status Income Class

• Coverage of a rule: 1 Yes Single 125K No

– Fraction of records that 2 No Married 100K No

satisfy the antecedent of a 3 No Single 70K No


4 Yes Married 120K No
rule
5 No Divorced 95K Yes
• Accuracy of a rule: 6 No Married 60K No
– Fraction of records that 7 Yes Divorced 220K No

satisfy both the 8 No Single 85K Yes

antecedent and 9 No Married 75K No

consequent of a rule 10
10 No Single 90K Yes

(Status=Single)  No
Coverage = 40%, Accuracy = 50%

16
Characteristics of Rule-Based Classifier
• Mutually exclusive rules
– Classifier contains mutually exclusive rules if the rules are
independent of each other
– Every record is covered by at most one rule

• Exhaustive rules
– Classifier has exhaustive coverage if it accounts for every
possible combination of attribute values
– Each record is covered by at least one rule

17
From Decision Trees To Rules
Classification Rules
(Refund=Yes) ==> No
Refund
Yes No (Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
NO Marita l
Status (Refund=No, Marital Status={Single,Divorced},
{Single,
{Married} Taxable Income>80K) ==> Yes
Divorced}
(Refund=No, Marital Status={Married}) ==> No
Taxable NO
Income
< 80K > 80K

NO YES
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the
tree

18
Rules Can Be Simplified
Tid Refund Marital Taxable
Status Income Cheat
Refund
Yes No 1 Yes Single 125K No
2 No Married 100K No
NO Marita l
3 No Single 70K No
{Single, Status
{Married} 4 Yes Married 120K No
Divorced}
5 No Divorced 95K Yes
Taxable NO
Income 6 No Married 60K No

< 80K > 80K 7 Yes Divorced 220K No


8 No Single 85K Yes
NO YES
9 No Married 75K No
10 No Single 90K Yes
10

Initial Rule: (Refund=No)  (Status=Married)  No


Simplified Rule: (Status=Married)  No
19
Effect of Rule Simplification
• Rules are no longer mutually exclusive
– A record may trigger more than one rule
– Solution?
• Ordered rule set
• Unordered rule set – use voting schemes

• Rules are no longer exhaustive


– A record may not trigger any rules
– Solution?
• Use a default class

20
Ordered Rule Set
• Rules are rank ordered according to their priority
– An ordered rule set is known as a decision list
• When a test record is presented to the classifier
– It is assigned to the class label of the highest ranked rule it has triggered
– If none of the rules fired, it is assigned to the default class

R1: (Give Birth = no)  (Can Fly = yes)  Birds


R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm) 
Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians
Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
21
Rule Ordering Schemes
• Rule-based ordering
– Individual rules are ranked based on their quality
• Class-based ordering
– Rules that belong to the same class appear together

Rule-based Ordering Class-based Ordering


(Refund=Yes) ==> No (Refund=Yes) ==> No

(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Single,Divorced},


Taxable Income<80K) ==> No Taxable Income<80K) ==> No

(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Married}) ==> No


Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Single,Divorced},
(Refund=No, Marital Status={Married}) ==> No Taxable Income>80K) ==> Yes

22
Building Classification Rules

• Direct Method:
• Extract rules directly from data
• e.g.: RIPPER, CN2, Holte’s 1R

• Indirect Method:
• Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
• e.g: C4.5rules

23
Direct Method: Sequential Covering

1. Start from an empty rule


2. Grow a rule using the Learn-One-Rule function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion is met

24
Example of Sequential Covering

(i) Original Data (ii) Step 1

26
Example of Sequential Covering…

R1 R1

R2

(iii) Step 2 (iv) Step 3

27
28
29
30
31
When to Stop Building a Rule
• When the rule is perfect, i.e. accuracy
=1
• When increase in accuracy gets
below a given threshold
• When the training set cannot be split
any further

32

You might also like