Lecture 02 - KNN and ML Basics
Lecture 02 - KNN and ML Basics
Neighbor and
ML Basics
Dall-E
Today’s Lecture
• Overview of machine learning process
• Example applications
– HW 1 overview
– Deep Face
Machine learning model maps from features to prediction
𝑓 (𝑥 ) → 𝑦
Features Prediction
Examples
• Classification
• Is this a dog or a cat?
• Is this email spam or not?
• Regression
• What will the stock price be tomorrow?
• What will be the high temperature tomorrow?
• Structured prediction
Learning has three stages
∗
𝜃 =argmin 𝐿𝑜𝑠𝑠 ( 𝑓 ( 𝑋 ; 𝜃 ) ,𝑌 )
𝜃
Model parameters Features of all training “Ground Truth” predictions of
that minimize loss examples all training examples
Example
𝑋 𝑌 Loss: sum squared error
𝐿𝑜𝑠𝑠 ( 𝑓 ( 𝑋 ;𝜃 ) ,𝑌 ) =∑ ( 𝑓 ( 𝑋𝑖 ; 𝜃 ) − 𝑦 𝑖 )
37.5 41.2 51.0 48.3 50.5 2
• Sometimes, there are clear “train”, “val”, and “test” sets. Other times, you need to split “train” into a
train set for learning parameters and a val set for checking model performance
Testing: The effectiveness of the model is evaluated on a held out test set
• “Held out”: not used in training; ideally not viewed by developers, e.g. in a private test server
• In machine learning research, usually data is collected once and then randomly sampled into
train and test partitions
– Train and test samples are “i.i.d.”, independent and identically distributed
– In many real-world applications, the input to the model in deployment comes from a different
distribution than training
Recap of training and evaluation
x
x
x o
x x
x
+
o x
o x
o o
o
o
x2
x1
Key principle of machine learning
1.
2.
x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2
x1
1-nearest neighbor
x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2
x1
3-nearest neighbor
x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2
x1
5-nearest neighbor
x
x
x o
x x
x
+ o
o x
x
o o+
o
o
x2
x1
KNN Distance Function
• Euclidean or L2 norm:
– Assumes all dimensions are equally scaled
– Dominated by biggest differences
• Citi Block or L1 norm:
– Assumes all dimensions are equally scaled
– Less sensitive to very large differences along one dimension
• Mahalanobis distance:
– Normalized by inverse feature covariance matrix: “whitening”
– When diagonal covariance is assumed, this is equivalent to scaling
each dimension by
+
x
https://www.researchgate.net/figure/Scatter-plot-of-Sitting-Height-over-H
Predict sex from standing/sitting heights eight-for-males-and-females_fig3_301724988
x +
https://www.researchgate.net/figure/Scatter-plot-of-Sitting-Height-over-H
Predict sitting height from standing height eight-for-males-and-females_fig3_301724988
KNN Classification Demos
• https://lecture-demo.ira.uka.de/knn-demo/#
• http://vision.stanford.edu/teaching/cs231n-demos/knn/
Comments on K-NN
can be precomputed
• Sloww… but there are tricks to speed it up, e.g.
–
– Can use approximate nearest neighbor methods like FLANN (will come to those later)
• With infinite examples, 1-NN provably has error that is at most twice Bayes optimal error
(but we never have infinite examples)
How do we measure and analyze classification error?
• Classification error:
– Percent of examples that are wrongly
predicted
Count
True Predicted Predicted
Y Y N Y
N Y N
True
Y Y Y
Y N Classification Error:
N N
P(predicted | true)
N Y
Predicted
N N
N Y
N
True
Y
How do we measure and analyze regression error?
• Model Bias: the model is limited so that Bayes optimal error cannot be achieved for an infinite training
set
• Model Variance: given finite training data, different parameters and predictions would result from
different samplings of data
• Distribution Shift: some examples are more likely in test than training, i.e. true P(x) is different for train
and test
– E.g., datasets are collected at different times and frequency of content has changed
Others: imperfect optimization, final performance measure is different than training loss
Effect of Training Size
Fixed model
Error
Testing
Generalization Error
Training
Number of Training Examples
Sources of error and training size
Fixed model
CVPR 2014
• Performs similarly to humans in the LFW dataset (labeled faces in the wild)
• Can be used to organize photo albums, identifying celebrities, or alert user when someone posts an image of them
• If this is used in a commercial deployment, what might be some unintended consequences?
• This algorithm is used by Facebook (though with expanded training data)
KNN Summary
• Key Assumptions
– Samples with similar input features will have similar output predictions
– Depending on distance measure, may assume all dimensions are equally important
• Model Parameters
– Features and predictions of the training set
• Designs
– K (number of nearest neighbors to use for prediction)
– How to combine multiple predictions if K > 1
– Feature design (selection, transformations)
– Distance function (e.g. L2, L1, Mahalanobis)
• When to Use
– Few examples per class, many classes
– Features are all roughly equally important
– Training data available for prediction changes frequently
– Can be applied to classification or regression, with discrete or continuous features
– Most powerful when combined with feature learning
• When Not to Use
– Many examples are available per class (feature learning with linear classifier may be better)
– Limited storage (cannot store many training examples)
– Limited computation (linear model may be faster to evaluate)
Things to remember
• Supervised machine learning involves:
1. Fitting parameters to a model using training data
2. Refining the model based on validation
performance
3. Evaluating the final model on a held out test set