Classification (K-Nearest Neighbor)
Classification (K-Nearest Neighbor)
CLASSIFIER
1
CLASSIFICATION: DEFINITION
Given a collection of records (training set )
Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a function of the values of other
attributes.
Goal: previously unseen records should be assigned a class as accurately
as possible.
A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it.
ILLUSTRATING CLASSIFICATION TASK
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
EXAMPLES OF CLASSIFICATION TASK
Predicting tumor cells as benign or malignant
Compute
Distance Test record
Choose k of the
Training “nearest” records
records
5
NEAREST-NEIGHBOR CLASSIFIER
Unknown record Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
X X X
7
NEAREST NEIGHBOR CLASSIFICATION
Compute distance between two points:
Euclidean distance
d ( p, q ) = ( pi
i
−q ) i
2
8
K-NEAREST NEIGHBOR ( K-NN)
ALGORITHM
For every point in dataset:
calculate the distance between X and the current point
sort the distances in increasing order
take k items with lowest distances to X
find the majority class among these items
return the majority class as our prediction for the class of X
9
HOW TO IMPLEMENT K-NN IN PYTHON
1. Handle Data: Open the dataset from CSV and split into test/train
datasets.
2. Similarity: Calculate the distance between two data instances.
3. Neighbors: Locate k most similar data instances.
4. Response: getting the majority voted response from a number of
neighbors.
5. Accuracy: Summarize the accuracy of predictions.
6. Main: Tie it all together.
https://machinelearningmastery.com/tutorial-to-implement-k-nearest-
neighbors-in-python-from-scratch/
10
HOW TO IMPLEMENT K-NN IN PYTHON
1. Handle Data: Open the dataset from CSV and split into
test/train datasets.
11
HOW TO IMPLEMENT K-NN IN PYTHON
2. Similarity: Calculate the distance between two
data instances.
12
HOW TO IMPLEMENT K-NN IN PYTHON
3. Neighbors: Locate k most similar data instances.
13
HOW TO IMPLEMENT K-NN IN PYTHON
4. Response: a function for getting the majority
voted response from a number of neighbors
14
HOW TO IMPLEMENT K-NN IN PYTHON
5. Accuracy: Summarize the accuracy of predictions
15
HOW TO IMPLEMENT K-NN IN PYTHON
16
NEAREST NEIGHBOR CLASSIFICATION…
Choosing the value of k:
If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from other
classes
17
NEAREST NEIGHBOR CLASSIFICATION…
Scaling issues
Attributes may have to be scaled to prevent
distance measures from being dominated by one
of the attributes
Example:
height of a person may vary from 1.5m to 1.8m
weight of a person may vary from 90lb to 300lb
income of a person may vary from $10K to $1M
18
NEAREST NEIGHBOR CLASSIFICATION…
111111111110 100000000000
vs
011111111111 000000000001
d = 1.4142 d = 1.4142
20
NEAREST NEIGHBOR CLASSIFICATION…
21
REFERENCES
Tan, Steinbach, Kumar, Introduction to Data Mining, 2000
https://machinelearningmastery.com/k-nearest-neighbors-for-
machine-learning/
22