CS550 Lec7-ClassificationIntro
CS550 Lec7-ClassificationIntro
Lecture 7: Classification
A simple idea: Predict using its distance from each of the 2 training images
d( Test
image , ) < d( Test
image
, ) ? Predict cat else dogExcellent question! Glad you
Wait. Is it ML? Seems to be Some possibilities: Use a feature asked!
like just a simple “rule”. learning/selection algorithm to Even this simple model can be
Where is the “learning” part extract features, and use a learned. For example, for the
in this? Mahalanobis distance where you feature extraction/selection part
learn the W matrix (instead of using and/or for the distance computation
a predefined W), using “distance part
metric learning” techniques
4
Capturing Variations by using more Training Data
Just one input per class may not sufficiently capture variations in a class
Both LwP and KNN will use multiple inputs per class but in different ways
5
Learning with Prototypes (LwP)
Basic idea: Represent each class by a “prototype” vector
Predict label of each test input based on its distances from the class prototypes
Predicted label will be the class that is the closest to the test input
How we compute distances can have an effect on the accuracy of this model
(may need to try Euclidean, weight Euclidean, Mahalanobis, or something else)
Pic from: https://www.reddit.com/r/dataisbeautiful/comments/3wgbv9/average_handwritten_digit_oc/
6
Learning with Prototypes (LwP): An Illustration
Suppose the task is binary classification (two classes assumed pos and neg)
1 𝜇
∑
1
𝜇 −= 𝐱𝑛 𝜇−
+¿=
∑
¿
𝑁 − 𝑦 =−1
𝑛 𝜇+¿¿ 𝑁 +¿
𝑦 𝑛 =+ 1
𝐱 𝑛¿
Test example
Test example
In general, if classes are not equisized and spherical, LwP with Euclidean
distance will usually not work well (but improvements possible; will discuss
later)
10
LwP: Some Key Aspects
Very simple, interpretable, and lightweight model
Just requires computing and storing the class prototype vectors
Works with any number of classes (thus for multi-class classification as well)
With a learned distance function, can work very well even with very few
examples from each class (used in some “few-shot learning” models nowadays
– if interested, please refer to “Prototypical Networks for Few-shot Learning”)
Confusion Matrix
• Accuracy is not enough to evaluate
classification algorithms
• Class imbalance problem
• One class more important than other!
• Precision=
• Recall = {True +ve rate, Sensitivity} Discuss:
• Safe Video for Kids
• Surveillance
Precision/Recall Tradeoff
We want to choose that classifier that maximizes the area under the curve
Precision-Recall curve preferred when the +ve class is rare
OR when you want to reduce the false +ves.
Which classifier is better?
Multiclass Classification (MNIST digits)
Looking at this plot, it seems that your efforts should be spent on reducing the false 8s.
More training data for digits that look like 8s (but are not) so the classifier can learn to distinguish them from real 8s.
Engineer new features to help classifier—E.g. Count the # of closed loops (e.g., 8 has two, 6 has one, 5 has none).
Preprocess the images (e.g., using Scikit-Image, Pillow, or OpenCV) to make some patterns stand out more, such as
closed loops.
17
Learning with Prototypes (LwP)
1 𝜇
∑ 𝐱
1
𝜇 −= 𝜇−
+¿=
∑
¿
𝑁 − 𝑦 =−1 𝑛
𝑛
𝐰 𝜇+¿¿ 𝑁 +¿
𝑦 𝑛 =+ 1
𝐱 𝑛¿
Note: Mahalanobis distance also has the effect of rotating the axes which helps
A good W will help bring
W will be a 2x2 symmetric matrix in points from same class
this case (chosen by us or learned) closer and move different
classes apart
𝑑𝑤 ( 𝒂 , 𝒃 )= √ ( 𝒂 − 𝒃 ) 𝐖 ( 𝒂 − 𝒃 )
⊤
𝜇− 𝜇+¿¿
𝜇+¿¿ 𝜇−
19
Improving LwP when classes are complex-shaped
Even with weighted Euclidean or Mahalanobis dist, LwP still a linear classifier
Exercise: Prove the above fact. You may use the following hint
Mahalanobis dist can be written as
is a symmetric matrix and thus can be written as for any matrix
Showing for Mahalabonis is enough. Weighted Euclidean is a special case with diag
Supervised Learning
using
Nearest Neighbors
22
Nearest Neighbors
Another supervised learning technique based on computing distances
Wait. Did you say distance from
ALL the training points? That’s
Very simple idea. Simply do the following at test time gonna be sooooo expensive!
Compute distance of of the test point from all the training points
Sort the distances to find the “nearest” input(s) in training data Yes, but let’s not worry
about that at the moment.
Predict the label using majority or avg label of these inputs There are ways to speed
up this step
Can use Euclidean or other dist (e.g., Mahalanobis). Choice imp just like LwP
Applicable to both classifn as well as regression (LwP only works for classifn)
23
Test input = 31
How to pick the
“right” K
value?
K is this model’s
“hyperparameter”. One
way to choose it is using
“cross-validation” (will
see shortly)
Also, K should ideally be
an odd number to avoid
Essentially, taking more votes helps! ties
Also leads to smoother decision boundaries (less chances of overfitting on training data)
26
-Ball Nearest Neighbors (-NN)
Rather than looking at a fixed number of neighbors, can look inside a ball of a
given radius , around the test input So changing may change
the prediction. How to
pick the “right” value?
Test input
For multi-class, simply used the same majority rule like in binary classfn case
Just a simple difference that now we have more than 2 classes
For regression, simply compute the average of the outputs of nearest neighbors
This form makes direct sense of regression and for cases where the each output
is a vector (e.g., multi-class classification where each output is a discrete value
which can be represented as a one-hot vector, or tagging/multi-label
classification where each output is a binary vector)
For binary classification, assuming labels as +1/-1, we predict )
30
Nearest Neighbors: Some Comments
An old, classic but still very widely used algorithm
Can sometimes give deep neural networks a run for their money
Can work very well in practical with the right distance function
Comes with very nice theoretical guarantees
Also called a memory-based or instance-based or non-parametric method
No “model” is learned here (unlike LwP). Prediction step uses all the training data
Requires lots of storage (need to keep all the training data at test time)
Prediction step can be slow at test time
For each test point, need to compute its distance from all the training points
Clever data-structures or data-summarization techniques can provide speed-ups
31
Decision Trees
A Decision Tree (DT) defines a hierarchy of rules to make a prediction
Root Node
Body
Warm temp. Cold
Gives No
Yes
birth
Mammal Non-mammal
Root and internal nodes test rules. Leaf nodes make predictions
Decision Tree (DT) learning is about learning such a tree from labeled data
32
Decision Trees for Supervised Learning
The basic idea is very simple
Within each group, fit a simple supervised learner (e.g., predict the majority
label)
33
Decision Trees for Classification
5
NO YES
𝑥1 >3.5 ?
4 Test input
NO 𝑥 2> 2?
YES NO 𝑥 2> 3 ?
YES
3
Feature 2 (
2
Predict Predict Predict Predict
1 Red Green Green Red
1 2 3 4 5 6
Feature 1 ( Remember: Root node
contains all training inputs
DT is very efficient at test time: To predict the Each leaf node receives a
label of a test point, nearest neighbors will subset of training inputs
require computing distances from 48 training
inputs. DT predicts the label by doing just 2
feature-value comparisons! Way more fast!!!
34
Decision Trees for Classification: Another Example
Deciding whether to play or not to play Tennis on a Saturday
Each input (Saturday) has 4 categorical features: Outlook, Temp., Humidity,
Wind
A binary classification problem (play vs no-play)
Below Left: Training data, Below Right: A decision tree constructed using this
data
3
y YES
NO Predict
2 𝑥 2> 3 ?
1
Predict Predict
1 2 3 4 5
𝐱
Question: Why does it make more sense to test the feature “outlook” first?
Answer: Of all the 4 features, it’s the most informative
It has the highest information gain as the root node
Given some training
40
Constructing Decision Trees data, what’s the
“optimal” DT?
How to decide which rules
5 NO
𝑥1 >3.5 ?
YES
to test for and in what
4 order?
YES NO YES
NO
𝑥 2> 2? 𝑥 2> 3 ?
How to assess informativeness of a rule?
3
Feature 2 (
2
Predict
Red
Predict
Green
Predict
Green
Predict
Red In general, constructing DT is
1
an intractable problem (NP-
1 2 3 4 5 6 The rules are organized in the hard)
Feature 1 (
DT such that most Often we can use some “greedy”
Hmm.. So DTs are heuristics to construct a “good” DT
informative rules are tested
like the “20
first
questions” game (ask Informativeness of a rule is of related To do so, we use the training data to figure out
the most useful to the extent of the purity of the split which rules should be tested at each node
questions first) arising due to that rule. More The same rules will be applied on the test
informative rules yield more pure inputs to route them along the tree until they
splits reach some leaf node where the prediction is
made
41
Decision Tree Construction: An Example
Let’s consider the playing Tennis example
Assume each internal node will test the value of one of the features
Question: Why does it make more sense to test the feature “outlook” first?
Answer: Of all the 4 features, it’s the most informative
It has the highest information gain as the root node
42
Entropy and Information Gain
Assume a set of labelled inputs from C classes, as fraction of class c inputs
Uniform sets (all classes
Entropy of the set is defined as roughly equally present)
have high entropy; skewed
Suppose a rule splits into two smaller disjoint sets and sets low
Reduction in entropy after the split is called information gain
Likewise, at root: IG(S, outlook) = 0.246, IG(S, humidity) = 0.151, IG(S,temp) = 0.029
Thus we choose “outlook” feature to be tested at the root node
Now how to grow the DT, i.e., what to do at the next level? Which feature to test next?
Rule: Iterate - for each child node, select the feature with the highest IG
44
Growing the tree
More sophisticated decision rules at the internal nodes can also be used
Basically, need some rule that splits inputs at an internal node into homogeneous groups
The rule can even be a machine learning classification algo (e.g., LwP or a deep learner)
However, in DTs, we want the tests to be fast so single feature based rules are preferred
Need to take care handling training or test inputs that have some features missing
1
Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees
48
An Illustration: DT with Real-Valued Features
Test example
“Best” (purest possible) “Best”(purest possible)
Horizontal Split Vertical Split
Feature 2 (
features (real, categorical, etc.) 2
Very fast at test time 1
Predict
Red
Predict
Green
Predict
Green
Predict
Red