0% found this document useful (0 votes)
24 views

Lecture 02 - KNN and ML Basics

knn and ml basics

Uploaded by

shihyunnam7
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Lecture 02 - KNN and ML Basics

knn and ml basics

Uploaded by

shihyunnam7
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

K-Nearest

Neighbor and
ML Basics

Applied Machine Learning


Derek Hoiem

Dall-E
Today’s Lecture
• Overview of machine learning process

• K-Nearest Neighbor Algorithm

• Measuring and understanding error

• Example applications
– HW 1 overview
– Deep Face
Machine learning model maps from features to prediction

𝑓 (𝑥 ) → 𝑦
Features Prediction

Examples
• Classification
• Is this a dog or a cat?
• Is this email spam or not?

• Regression
• What will the stock price be tomorrow?
• What will be the high temperature tomorrow?

• Structured prediction
Learning has three stages

• Training: optimize model parameters

• Validation: intermediate evaluations to design/select model

• Test: final performance evaluation


Training: the model is fit to data to minimize a loss or maximize an
objective function


𝜃 =argmin 𝐿𝑜𝑠𝑠 ( 𝑓 ( 𝑋 ; 𝜃 ) ,𝑌 )
𝜃
Model parameters Features of all training “Ground Truth” predictions of
that minimize loss examples all training examples

Example
𝑋 𝑌 Loss: sum squared error

𝐿𝑜𝑠𝑠 ( 𝑓 ( 𝑋 ;𝜃 ) ,𝑌 ) =∑ ( 𝑓 ( 𝑋𝑖 ; 𝜃 ) − 𝑦 𝑖 )
37.5 41.2 51.0 48.3 50.5 2

1 row per 47.0 46.5 48.9 50.5 47.6


𝑖
Learn to predict next day’s
example Model: linear
temperature given … …

preceding days’ 67.0 64.7 63.0 61.4 60.2


𝑓 ( 𝑋 𝑖 ; 𝜃)= 𝐴 𝑋 𝑖 + 𝑏
temperatures
1 column per
Optimization via ordinary least
feature
squares regression
Model design and “hyper parameter” tuning is performed using a
validation set

• Select model linear regression neural network


• Set training parameters
– Feature selection
– Learning rate, regularization parameters, …

• Sometimes, there are clear “train”, “val”, and “test” sets. Other times, you need to split “train” into a
train set for learning parameters and a val set for checking model performance
Testing: The effectiveness of the model is evaluated on a held out test set

• “Held out”: not used in training; ideally not viewed by developers, e.g. in a private test server

• Common performance measures


Classification error: (for classification model, is target/true label)
Cross-entropy: (for probabilistic model)
RMSE: (regression measure)
: (unitless regression measure; is expectation/mean/avg)

• In machine learning research, usually data is collected once and then randomly sampled into
train and test partitions
– Train and test samples are “i.i.d.”, independent and identically distributed
– In many real-world applications, the input to the model in deployment comes from a different
distribution than training
Recap of training and evaluation

Modify design based on initial results

Model/Training Validate Model Validation Data


Train
Design
Evaluate model design
Model
Training Data Test Model Test Data
Final evaluation
What class do you think the ‘+’ belongs to?

x
x
x o
x x
x
+
o x
o x
o o
o
o
x2

x1
Key principle of machine learning

Given feature/target pairs


if is similar to , then is probably similar to

With variations on how you define similarity and make


predictions based on multiple similar examples, this principle
underlies virtually all ML algorithms
Nearest neighbor algorithm
For given test features, assign the label / target
value of the most similar training features

1.
2.

Distance function is up to designer. Simplest is L2 distance.


K-nearest neighbor: predict based on K closest training samples

x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2

x1
1-nearest neighbor

x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2

x1
3-nearest neighbor

x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2

x1
5-nearest neighbor

x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2

x1
KNN Distance Function
• Euclidean or L2 norm:
– Assumes all dimensions are equally scaled
– Dominated by biggest differences
• Citi Block or L1 norm:
– Assumes all dimensions are equally scaled
– Less sensitive to very large differences along one dimension
• Mahalanobis distance:
– Normalized by inverse feature covariance matrix: “whitening”
– When diagonal covariance is assumed, this is equivalent to scaling
each dimension by

and are training and test sample feature vectors


KNN Classification vs Regression

• For classification, prediction is usually the mode or most


common class of the returned labels

• For regression, prediction is usually the arithmetic mean


(average, informally) of the returned values
o

+
x

https://www.researchgate.net/figure/Scatter-plot-of-Sitting-Height-over-H
Predict sex from standing/sitting heights eight-for-males-and-females_fig3_301724988
x +
https://www.researchgate.net/figure/Scatter-plot-of-Sitting-Height-over-H
Predict sitting height from standing height eight-for-males-and-females_fig3_301724988
KNN Classification Demos
• https://lecture-demo.ira.uka.de/knn-demo/#
• http://vision.stanford.edu/teaching/cs231n-demos/knn/
Comments on K-NN

• Simple: an excellent baseline and sometimes hard to beat


– Naturally scales with data: it may be the only choice when you have one example per class, and is
still often achieves good performance when you have many
– Higher K gives smoother functions

can be precomputed
• Sloww… but there are tricks to speed it up, e.g.

– Can use approximate nearest neighbor methods like FLANN (will come to those later)

• No training time (unless you learn a distance function)

• With infinite examples, 1-NN provably has error that is at most twice Bayes optimal error
(but we never have infinite examples)
How do we measure and analyze classification error?

• Classification error:
– Percent of examples that are wrongly
predicted

• Confusion matrix: joint/conditional


distribution of predicted and true
labels https://scikit-learn.or
g/stable/auto_exam
ples/model_selectio
– Can be a count or probability n/plot_confusion_m
atrix.html

– Practice varies whether “Predict” or


“True” is the y-axis. Need to label.
Measuring error example
Confusion Matrix:

Count
True Predicted Predicted
Y Y N Y
N Y N

True
Y Y Y
Y N Classification Error:
N N
P(predicted | true)
N Y
Predicted
N N
N Y
N

True
Y
How do we measure and analyze regression error?

• Root mean squared error


𝑓 ( 𝑥

• Mean absolute error

• R2: (unexplained variance)


(total variance)

• RMSE/MAE are unit-dependent


measures of accuracy, while R2
is a unitless measure of the Fig:
https://medium.com/analytics-vidhya/mae-mse-rmse-coefficient-of-determination-adjust
ed-r-squared-which-metric-is-better-cd0326a5697e
Sources of test error for a trained model
• Intrinsic: sometimes it is not possible to achieve zero error given available features (e.g. handwriting,
weather prediction)
– Bayes optimal error: The error if the true function P(y|x) is known

• Model Bias: the model is limited so that Bayes optimal error cannot be achieved for an infinite training
set

• Model Variance: given finite training data, different parameters and predictions would result from
different samplings of data

• Distribution Shift: some examples are more likely in test than training, i.e. true P(x) is different for train
and test
– E.g., datasets are collected at different times and frequency of content has changed

• Function Shift: P(y|x) is changed between train and test


– E.g., the predicted answer to “What is your favorite TV show?” changes over time

Others: imperfect optimization, final performance measure is different than training loss
Effect of Training Size

Fixed model

Error

Testing
Generalization Error

Training
Number of Training Examples
Sources of error and training size

Fixed model

Due to limited training data


Testing (model variance) and
distribution shift
Error

Test error with infinite training examples


Due to difference in P(y|x) in
Train error with infinite training examples training and test (function shift)
Training
Number of Training Examples
Due to limited power of model
(model bias) and unavoidable
intrinsic error (Bayes optimal
error)
Something to think about…

Why is it important to have a validation set? Why not simply


evaluate all your trained models on the test set and then choose
the best?
HW 1 Preview
KNN Usage Example: Deep Face

CVPR 2014

1. Detect facial features


2. Align faces to be frontal
3. Extract features using deep network while training classifier to label image into person (dataset based on employee faces)
4. In testing, extract features from deep network and use nearest neighbor classifier to assign identity

• Performs similarly to humans in the LFW dataset (labeled faces in the wild)
• Can be used to organize photo albums, identifying celebrities, or alert user when someone posts an image of them
• If this is used in a commercial deployment, what might be some unintended consequences?
• This algorithm is used by Facebook (though with expanded training data)
KNN Summary
• Key Assumptions
– Samples with similar input features will have similar output predictions
– Depending on distance measure, may assume all dimensions are equally important
• Model Parameters
– Features and predictions of the training set
• Designs
– K (number of nearest neighbors to use for prediction)
– How to combine multiple predictions if K > 1
– Feature design (selection, transformations)
– Distance function (e.g. L2, L1, Mahalanobis)
• When to Use
– Few examples per class, many classes
– Features are all roughly equally important
– Training data available for prediction changes frequently
– Can be applied to classification or regression, with discrete or continuous features
– Most powerful when combined with feature learning
• When Not to Use
– Many examples are available per class (feature learning with linear classifier may be better)
– Limited storage (cannot store many training examples)
– Limited computation (linear model may be faster to evaluate)
Things to remember
• Supervised machine learning involves:
1. Fitting parameters to a model using training data
2. Refining the model based on validation
performance
3. Evaluating the final model on a held out test set

• KNN is a simple but effective classifier/regressor


that predicts the label of the most similar training
example(s)

• With more samples, fitting the training data


becomes harder, but test error is expected to
decrease

• Test errors have many sources


– intrinsic to problem
– model bias / limited power
– model variance / limited training data
– differences in training and test distributions

• Model design and fitting is just one part of a larger


process in collecting data, developing, and
Next week
• Probabilistic models and Naïve Bayes
• Linear and Logistic Regression

You might also like