0% found this document useful (0 votes)
34 views

Lecture7 KNN

K-Nearest Neighbor (KNN) is a supervised learning algorithm that can be used for both classification and regression tasks. It works by finding the K closest training examples in the feature space and predicting the label based on a majority vote of the labels of the K nearest neighbors. The document provides examples of using a KNN classifier to classify fruits into different categories based on their features like width, height and color score. It describes splitting the dataset into training and test sets, extracting features, finding the nearest neighbors and making predictions on new samples.

Uploaded by

raverita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Lecture7 KNN

K-Nearest Neighbor (KNN) is a supervised learning algorithm that can be used for both classification and regression tasks. It works by finding the K closest training examples in the feature space and predicting the label based on a majority vote of the labels of the K nearest neighbors. The document provides examples of using a KNN classifier to classify fruits into different categories based on their features like width, height and color score. It describes splitting the dataset into training and test sets, extracting features, finding the nearest neighbors and making predictions on new samples.

Uploaded by

raverita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

K-Nearest Neighbor (KNN) Classifier and Regressor

Liang Liang
Categories of Machine Learning
• Unsupervised Learning
Clustering: k-means, GMM
Dimensionality reduction (representation learning): PCA, isomap, etc
to learn a meaningful representation in a lower dimensional space
Probability Density Estimation: GMM, KDE
• Supervised Learning
to model the relationship between measured features of data and
some labels associated with the data
• Reinforcement Learning
the goal is to develop a model (agent) that improves its performance
based on interactions with the environment
Supervised Learning: classification and regression
Input x Output y

Classifier 0 Class label

Input x Output y
Feature Vector
of a house
Regressor Sale Price
Target (value)
Supervised Learning: classification and regression

Input x Machine Learning Model Output y

Dataset:
input-output pairs, 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , 𝑥3 , 𝑦3 , . . , (𝑥𝑁 , 𝑦𝑁 )
Binary Classification
• Data points are from two classes. A data point only belongs to one class.

label of the data point x


Classifier y = 0 male
y = 1 female

or
y = -1 male
y = 1 female
Multiclass Classification
• Data points are from many classes.
• A data point only belongs to one class.

label of the data point x


label = ?
10 possible labels:
Classifier (0, 1, 2, 3, 4, 5 ,6, 7, 8, 9)
Multiclass Classification
𝑦 = 𝑔(𝑥)
Classifier y

y is the class label of the data point x

y=0
Multiclass Classification: one-hot-encoding
• Data points are from many classes. A data point only belongs to one class.
one-hot encoding: the output from a classifier is a vector, length= # of classes
label = 0 label = 1 label = 2 label = 9
𝑦0 1 0 0 0
𝑦1 0 1 0 0
𝑦2 0 0 1 0
𝑦3 0 0 0 0
𝑦4 0 0 0 0
𝑦= 𝑦
5 0 0 0 0
𝑦6 0 0 0 0
𝑦7 0 0 0 0
𝑦8 0 0 0 0
𝑦9 0 0 0 1
Output of a classifier could be real numbers
soft label hard label
0.8 1
Example: 0 0
0 0
convert
0 0
to binary
0 0 class label = 0
Classifier 0 0
0 0
10 possible labels:
0 0
(0, 1, 2, 3, 4, 5 ,6, 7, 8, 9)
0.1 0
0.1 0

"0.8" is usually interpreted as the "confidence" of the classifier about {𝑦𝑜𝑢𝑡 = 1}


Multi-Label Classification
• The data points are from many classes.
• A data point may belong to more than one class.
𝑦0 = 1, it is a cat
𝑦0 = 0, it is not a cat
𝑦1 = 1, it is cute
𝑦1 = 0, it is not cute

𝑦0 1 cat
Classifier 𝑦= 𝑦 =
1 1 cute

It is a cute cat
classifiers
• Many types of classifiers:
KNN classifier (K-Nearest Neighbor)
Naïve Bayes classifier
Decision Tree classifier
Random Forest classifier
SVM classifier (Support Vector Machine)
Neural Network classifier

• Each type is associated with specific Learning Algorithms


A Classification Task

• A simple task: classify a fruit into four classes/categories


{1:apple, 2:mandarin, 3:orange, 4:lemon},
note: class-3 contains oranges that are not mandarin oranges

A sample/instance

perform classification class/fruit label


1:apple
2:mandarin
3:orange
4:lemon
A sample/instance
(a 2D image)
class label
1:apple
2:mandarin
Human Brain features It is an apple! 3:orange
4:lemon

Model-1 features Model-2 It is an apple!

It is not easy to develop Model-1 for feature extraction

It is relatively easy to develop Model-2 for classification, given the features of the sample.

Now, let's develop Model-2 for classification.


1:apple
Feature 2:mandarin
Classifier Class Label 3:orange
Vector
4:lemon

The feature vector of a fruit sample: [width, height, color_score ]


color_score is a number (0~1) to describe the color
The Fruit Dataset
The fruit dataset was created by Dr. Iain Murray at the University
A bucket of Edinburgh. He bought a few dozen oranges, lemons and apples,
and recorded their features in a table.
of fruits
4 classes: {1:apple, 2:mandarin, 3:orange, 4:lemon}

The fruit dataset (a table)


Each row contains the information of a fruit sample/instance
fruit label fruit_name subtype mass (g) width (cm) height (cm) color_score
1 apple granny_smith 192 8.4 7.3 0.55
4 lemon spanish_belsan 194 7.2 10.3 0.70

In this table: what is input x? what is output y?


http://usapple.org/the-industry/apple-varieties/
In total, there are 59 fruit samples (i.e. 59 rows) in the table

4 classes: {1:apple, 2:mandarin, 3:orange, 4:lemon}


Select 3 features: width, height, color_score
Split data (59) into a training set (80%, 47) and a testing set (20%, 12)

X_train contains the features of the 47 training samples


Each row of X_train is a feature vector of a training sample.

Y_train contains the class/fruit labels of the 47 training samples


Each element of Y_train is a class label of a training sample.

X_test contains the features of the 12 testing samples


Each row of X_test is a feature vector of a testing sample.

Y_test contains the class/fruit labels of the 12 testing samples


Each element of Y_test is a class label of a testing sample.
The flowchart of a classification study
• Classification is a subcategory of supervised learning where the goal is to
predict the class labels of new samples.
Model Training
80% Training Set
Data Set
(with class Labels)

Learning
20%
Algorithm
Model Testing

Testing Set Classifier Predicted Class Labels


True Class Labels
Classification Accuracy
KNN classifier (K-Nearest Neighbor)
• A KNN classifier. The user needs to:
(1) choose the value of K and (2) choose a distance measure

7 samples in the training set


a new sample,
class label is
unknown

7 samples in training set


Let’s set K=1 and use L2-based distance measure
Task: Find the nearest neighbor in the training set (by comparing distances)

the nearest neighbor in the


training set is an apple, therefore
the KNN classifier will classify
the input as an apple

is classified as an apple
because its nearest neighbor
is an apple

7 samples in training set


Let’s set K=5 and use L2-based distance measure
Task: Find the 5 nearest neighbor in the training set

Among the 5 nearest neighbors


in the training set, there are 3
lemons and 2 apples, therefore,
based on majority vote, the
KNN classifier will classify the
input as a lemon
is classified as a lemon
because the majority of its K
nearest neighbors are lemons
7 samples in training set
Let's build and train a KNN classifier using sk-learn
Build a KNN classifier, name it knn

K=5

Train the KNN classifier (fit the model to the data)

L2-based distance

Model training is to let knn memorize all of the training samples (features and labels), and
build a tree for K-nearest neighbor search.
Use the trained KNN classifier to classify a sample in testing set

Select a sample in the testing set

We know the true label of this sample

Use knn to Predict the label of this sample


Use the trained KNN classifier
to classify a new, previously unseen sample that is not
in the training set nor in the testing set
the Feature Vector of a new sample
Evaluate the Performance of the KNN Classifier (K=5)

the number of correctly classified samples


• Classification Accuracy =
total number of samples

• Training Accuracy: accuracy on training set (80% of the data)

Testing Accuracy: accuracy on testing set (20% of the data)


Training Accuracy of KNN classifier is 100% when K=1
Use the KNN classifier to
7 samples in the training set predict the label of a sample
x that is in the training set
height
apple apple
height
lemon apple x (apple) (apple)
KNN
(apple)
lemon
apple classifier (lemon)
lemon (lemon)
It memorized all data (apple)
points in the training set (lemon)

width
width
The nearest neighbor of x is itself: x and its label are in KNN’s memory
Use confusion matrix to visualize the classification
result on the testing set
2 apples are classified as apples 2 apples are classified as oranges

The off diagonal numbers


show the wrong classifications
the sum of the
numbers in the
matrix is the
The diagonal numbers show
total number of
the correct classifications
testing samples
KNN_classification.ipynb
• Use hand-engineered features to improve classification accuracy
• Feature normalization may improve classification accuracy
Plot the Decision Boundary
to Visualize the Classification Result
A point on the plot represents a sample (a feature vector),
which may be in training set or the testing set or unobserved yet.
Roughly speaking, to get the decision boundary
plot, we use the KNN classifier to predict the decision boundary between
lemon and orange
class label of every point on the plot.

In fact, we do not need to check every point:


we only need to predict the class labels of the
points on a dense grid and interpolate the result.
Question: how do we choose the value of K ?

Question: what is the weakness of KNN classifier ?


What's the difference between
clustering and classification ?
Data: 𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑁

Clustering
Input 𝑥 Output 𝑦: predicted cluster label
Algorithm

Data: input-output pairs, 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , 𝑥3 , 𝑦3 , . . , (𝑥𝑁 , 𝑦𝑁 )

Input 𝑥 Classifier Output 𝑦: predicted class label


KNN can be used for classification and regression
• For classification, the output from a KNN classifier is a discrete
value (class label), which is done by majority vote
• For regression, the output from a KNN regressor is a continues
value (target value)
• For regression, the average target value of the K-nearest neighbors
will be the predicted target value of the input x
Assume K=3 and training samples 𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 are the (K=3) nearest
neighbors of x, the target values are 𝒚𝟏 , 𝒚𝟐 , 𝒚𝟑
Then, the predicted target value 𝑦෤ of x is (𝒚𝟏 + 𝒚𝟐 + 𝒚𝟑 )/3
Boston Housing Dataset
The Housing dataset, which contains information about houses in the different
districts of Boston collected by D. Harrison and D.L. Rubinfeld in 1978.
The dataset is a large table that has 506 samples (rows) and 14 columns
Each row contains information/attributes of a region in Boston
y: target
x: input (13 attributes)

MEDV: Median value of owner-occupied homes in $1000s


Regression y=f(x)

Model Training
80% Training Set
Data Set
(with target values)

Learning
20%
Algorithm
Model Testing

Testing Set ML-Model Predicted Target Values


True Target Values
Regression Accuracy
KNN_Regression.ipynb

You might also like