Lecture7 KNN
Lecture7 KNN
Liang Liang
Categories of Machine Learning
• Unsupervised Learning
Clustering: k-means, GMM
Dimensionality reduction (representation learning): PCA, isomap, etc
to learn a meaningful representation in a lower dimensional space
Probability Density Estimation: GMM, KDE
• Supervised Learning
to model the relationship between measured features of data and
some labels associated with the data
• Reinforcement Learning
the goal is to develop a model (agent) that improves its performance
based on interactions with the environment
Supervised Learning: classification and regression
Input x Output y
Input x Output y
Feature Vector
of a house
Regressor Sale Price
Target (value)
Supervised Learning: classification and regression
Dataset:
input-output pairs, 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , 𝑥3 , 𝑦3 , . . , (𝑥𝑁 , 𝑦𝑁 )
Binary Classification
• Data points are from two classes. A data point only belongs to one class.
or
y = -1 male
y = 1 female
Multiclass Classification
• Data points are from many classes.
• A data point only belongs to one class.
y=0
Multiclass Classification: one-hot-encoding
• Data points are from many classes. A data point only belongs to one class.
one-hot encoding: the output from a classifier is a vector, length= # of classes
label = 0 label = 1 label = 2 label = 9
𝑦0 1 0 0 0
𝑦1 0 1 0 0
𝑦2 0 0 1 0
𝑦3 0 0 0 0
𝑦4 0 0 0 0
𝑦= 𝑦
5 0 0 0 0
𝑦6 0 0 0 0
𝑦7 0 0 0 0
𝑦8 0 0 0 0
𝑦9 0 0 0 1
Output of a classifier could be real numbers
soft label hard label
0.8 1
Example: 0 0
0 0
convert
0 0
to binary
0 0 class label = 0
Classifier 0 0
0 0
10 possible labels:
0 0
(0, 1, 2, 3, 4, 5 ,6, 7, 8, 9)
0.1 0
0.1 0
𝑦0 1 cat
Classifier 𝑦= 𝑦 =
1 1 cute
It is a cute cat
classifiers
• Many types of classifiers:
KNN classifier (K-Nearest Neighbor)
Naïve Bayes classifier
Decision Tree classifier
Random Forest classifier
SVM classifier (Support Vector Machine)
Neural Network classifier
A sample/instance
It is relatively easy to develop Model-2 for classification, given the features of the sample.
Learning
20%
Algorithm
Model Testing
is classified as an apple
because its nearest neighbor
is an apple
K=5
L2-based distance
Model training is to let knn memorize all of the training samples (features and labels), and
build a tree for K-nearest neighbor search.
Use the trained KNN classifier to classify a sample in testing set
width
width
The nearest neighbor of x is itself: x and its label are in KNN’s memory
Use confusion matrix to visualize the classification
result on the testing set
2 apples are classified as apples 2 apples are classified as oranges
Clustering
Input 𝑥 Output 𝑦: predicted cluster label
Algorithm
Model Training
80% Training Set
Data Set
(with target values)
Learning
20%
Algorithm
Model Testing