0% found this document useful (0 votes)
2 views22 pages

Classification (K-Nearest Neighbor)

The document provides an overview of the Nearest Neighbor Classifier, detailing its purpose in classification tasks where unseen records are assigned class labels based on training data. It outlines the steps for implementing the k-nearest neighbors (k-NN) algorithm, including calculating distances, selecting neighbors, and determining class labels through majority voting. Additionally, it discusses considerations such as choosing the value of k, scaling attributes, and alternative distance measures.

Uploaded by

nick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views22 pages

Classification (K-Nearest Neighbor)

The document provides an overview of the Nearest Neighbor Classifier, detailing its purpose in classification tasks where unseen records are assigned class labels based on training data. It outlines the steps for implementing the k-nearest neighbors (k-NN) algorithm, including calculating distances, selecting neighbors, and determining class labels through majority voting. Additionally, it discusses considerations such as choosing the value of k, scaling attributes, and alternative distance measures.

Uploaded by

nick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

NEAREST NEIGHBOR

CLASSIFIER
1
CLASSIFICATION: DEFINITION
Given a collection of records (training set )
 Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a function of the values of other
attributes.
Goal: previously unseen records should be assigned a class as accurately
as possible.
 A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it.
ILLUSTRATING CLASSIFICATION TASK
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
EXAMPLES OF CLASSIFICATION TASK
Predicting tumor cells as benign or malignant

Classifying credit card transactions


as legitimate or fraudulent

Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random
coil

Categorizing news stories as finance,


weather, entertainment, sports, etc
NEAREST NEIGHBOR CLASSIFIER
Basic idea:
 If it walks like a duck, quacks like a duck, then it’s probably a duck

Compute
Distance Test record

Choose k of the
Training “nearest” records
records

5
NEAREST-NEIGHBOR CLASSIFIER
Unknown record Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

To classify an unknown record:


– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
6
DEFINITION OF NEAREST NEIGHBOR

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points


that have the k smallest distance to x

7
NEAREST NEIGHBOR CLASSIFICATION
Compute distance between two points:
 Euclidean distance

d ( p, q ) =  ( pi
i
−q ) i
2

Determine the class from nearest neighbor list


 take the majority vote of class labels among the k-nearest neighbors
 Weight the vote according to distance
 weight factor, w = 1/d2

8
K-NEAREST NEIGHBOR ( K-NN)
ALGORITHM
For every point in dataset:
calculate the distance between X and the current point
sort the distances in increasing order
take k items with lowest distances to X
find the majority class among these items
return the majority class as our prediction for the class of X

9
HOW TO IMPLEMENT K-NN IN PYTHON
1. Handle Data: Open the dataset from CSV and split into test/train
datasets.
2. Similarity: Calculate the distance between two data instances.
3. Neighbors: Locate k most similar data instances.
4. Response: getting the majority voted response from a number of
neighbors.
5. Accuracy: Summarize the accuracy of predictions.
6. Main: Tie it all together.

https://machinelearningmastery.com/tutorial-to-implement-k-nearest-
neighbors-in-python-from-scratch/

10
HOW TO IMPLEMENT K-NN IN PYTHON
1. Handle Data: Open the dataset from CSV and split into
test/train datasets.

11
HOW TO IMPLEMENT K-NN IN PYTHON
2. Similarity: Calculate the distance between two
data instances.

12
HOW TO IMPLEMENT K-NN IN PYTHON
3. Neighbors: Locate k most similar data instances.

13
HOW TO IMPLEMENT K-NN IN PYTHON
4. Response: a function for getting the majority
voted response from a number of neighbors

14
HOW TO IMPLEMENT K-NN IN PYTHON
5. Accuracy: Summarize the accuracy of predictions

15
HOW TO IMPLEMENT K-NN IN PYTHON

16
NEAREST NEIGHBOR CLASSIFICATION…
Choosing the value of k:
 If k is too small, sensitive to noise points
 If k is too large, neighborhood may include points from other
classes

17
NEAREST NEIGHBOR CLASSIFICATION…
Scaling issues
Attributes may have to be scaled to prevent
distance measures from being dominated by one
of the attributes
Example:
 height of a person may vary from 1.5m to 1.8m
 weight of a person may vary from 90lb to 300lb
 income of a person may vary from $10K to $1M

18
NEAREST NEIGHBOR CLASSIFICATION…

Problem with Euclidean measure:


 High dimensional data
 curse of dimensionality
 Can produce counter-intuitive results

111111111110 100000000000
vs
011111111111 000000000001
d = 1.4142 d = 1.4142

◆ Solution: Normalize the vectors to unit length


19
OTHER DISTANCE MEASURES

Hamming Distance: Calculate the distance between binary vectors


Manhattan Distance: Calculate the distance between real vectors using the sum
of their absolute difference. Also called City Block Distance
Minkowski Distance: Generalization of Euclidean and Manhattan distance

20
NEAREST NEIGHBOR CLASSIFICATION…

k-NN classifiers are lazy learners


It does not build models explicitly
Unlike eager learners such as decision tree
induction and rule-based systems
Classifying unknown records are relatively
expensive

21
REFERENCES
Tan, Steinbach, Kumar, Introduction to Data Mining, 2000
https://machinelearningmastery.com/k-nearest-neighbors-for-
machine-learning/

22

You might also like