0% found this document useful (0 votes)
3 views

06-07-08-Supervised Learning by Computing Distances, Multi Class Classification, Decision Boundary

The document discusses Learning with Prototypes (LwP) and Nearest Neighbors (NN) as supervised learning techniques, focusing on their mathematical foundations, prediction rules, and applications in classification tasks. It explains the use of Euclidean and Mahalanobis distances, decision boundaries, and the concept of one-vs-rest for multi-class classification. Additionally, it covers the K-Nearest Neighbors algorithm, including its variations and hyperparameters, and how these methods can be applied to various supervised learning problems.

Uploaded by

backcchodikhana
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

06-07-08-Supervised Learning by Computing Distances, Multi Class Classification, Decision Boundary

The document discusses Learning with Prototypes (LwP) and Nearest Neighbors (NN) as supervised learning techniques, focusing on their mathematical foundations, prediction rules, and applications in classification tasks. It explains the use of Euclidean and Mahalanobis distances, decision boundaries, and the concept of one-vs-rest for multi-class classification. Additionally, it covers the K-Nearest Neighbors algorithm, including its variations and hyperparameters, and how these methods can be applied to various supervised learning problems.

Uploaded by

backcchodikhana
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Learning by Computing

Distances (2):
Wrapping-up LwP, Nearest
Neighbors
CSN-382
2
LwP: The Prediction Rule, Mathematically
 What does the prediction rule for LwP look like
mathematically?

 Assume we are using Euclidean distances here


||𝝁− − 𝐱|| =||𝝁−|| +||𝐱|| − 2 ⟨ 𝝁− , 𝐱 ⟩
2 2 2
𝜇− 𝜇+¿¿
¿ ¿

Test example

Prediction Rule: Predict label as +1 if otherwise -1


3
LwP: The Prediction Rule, Mathematically
 Let’s expand the prediction rule expression a bit more

 Thus LwP with Euclidean distance is equivalent to a linear


model with Will look at linear
models more formally
 Weight vector 2( and in more detail
 Bias term later

 Prediction rule therefore is: Predict +1 if > 0, else predict -1


4
Learning with Prototypes (LwP)
1 𝜇
∑ 𝐱
1
𝜇 −= 𝜇−
+¿=

¿
𝑁 − 𝑦 =−1 𝑛
𝑛
𝐰 𝜇+¿¿ 𝑁 +¿
𝑦 𝑛 =+ 1
𝐱 𝑛¿

Prediction rule for 𝐰 = 𝝁+¿ −𝝁 − ¿


LwP (for binary
classification with If Euclidean distance used
Euclidean distance)
For LwP, the prototype vectors
+ (or their difference) define the
Decision boundary “model”. and (or just in the
(> 0 then predict +1 otherwise -1)
(perpendicular bisector of Euclidean distance case) are
line joining the class the model parameters.
rcise: Show that for the bin. classfn case
prototype vectors)
𝑁 Note: Even though can be Can throw away training data after
𝑓 ( 𝐱 ) =∑ 𝛼 𝑛 ⟨ 𝐱 𝑛 , 𝐱 ⟩ +𝑏
expressed in this form, if N > D, computing the prototypes and just need to
this may be more expensive to keep the model parameters for the test
𝑛=1 compute (O(N) time)as
compared to (O(D) time).
time in such “parametric” models
So the “score” of a test point is a weighted sum
of its similarities with each of the N training However the form is still very useful as we will
inputs. Many supervised learning models have see later when we discuss kernel methods
in this form as we will see later
5
Linear Models
 A linear model can also be used in classification problems

 For Binary Classification, can treat as the “score” of input and threshold
to get binary label
Recall that the LwP model
can also be seen as a linear
model (although it wasn’t
formulated like this)
Don’t worry. Can easily fold-in
the bias term here as shown in
the figure below
Can append a constant
Wait – when
feature “1” for each input
discussing LwP,
and rewrite as where now
wasn’t the linear
both and
model of the
form ? Where did
We will assume the same and
the “bias” term
omit the explicit bias for
go?
6
Multi-class Classification
 Multi-class Classification: A classification task with more than two classes;
e.g., classify a set of images of fruits which may be oranges, apples, or pears.
Multi-class classification makes the assumption that each sample is assigned
to one and only one label: a fruit can be either an apple or a pear but not both
at the same time.
• Binary Classification: Classification tasks with two classes.
• Multi-class Classification: Classification tasks with more than two classes.
• Some algorithms are designed for binary classification problems.
• Examples include:
• Logistic Regression
• Perceptron
• Support Vector Machines
7
8
One-Vs-Rest for Multi-Class Classification
 One-vs-rest (OvR for short, also referred to as One-vs-All or OvA) is a heuristic method for
using binary classification algorithms for multi-class classification.
• It involves splitting the multi-class dataset into multiple binary classification problems. A
binary classifier is then trained on each binary classification problem and predictions are
made using the model that is the most confident.
• For example, given a multi-class classification problem with examples for each class ‘red,’
‘blue,’ and ‘green‘. This could be divided into three binary classification datasets as follows:
• Binary Classification Problem 1: red vs [blue, green]
• Binary Classification Problem 2: blue vs [red, green]
• Binary Classification Problem 3: green vs [red, blue]

• A possible downside of this approach is that it requires


one model to be created for each class. For example,
three classes requires three models.
9
One-Vs-Rest for Multi-Class Classification
10
One-Vs-Rest for Multi-Class Classification
11
12
Linear Models
 Linear models are also used in multiclass classification problems

 Assuming classes, we can assume the following model



𝑦 =argmax 𝑘∈ {1 , 2 ,… , 𝐾 } 𝒘 𝒙 𝑘

 Can think of as the score of the input for the class

 Let’s understand with one example


 We passed test data to the classifier models. We got the outcome in the form of a
positive rating derived from the Green class classifier with a score of (0.9).
 Again, We got a positive rating from the Blue class with a score of (0.4) along with a
negative classification score from the remaining Red classifier.
 Hence, based on the positive responses and decisive score, we can say that our test
input belongs to the Green class.
13
Decision Boundary
 The Decision Boundary separates the data-points into regions, which are
actually the classes in which they belong.
 In a statistical-classification problem with two classes, a decision boundary or
decision surface is a hypersurface that partitions the underlying vector space
into two sets, one for each class. The classifier will classify all the points on
one side of the decision boundary as belonging to one class and all those on
the other side as belonging to the other class.
 A decision boundary is the region of a
problem space in which the output label
of a classifier is ambiguous.
14
Decision Boundary
 If the decision surface is a line/hyperplane, then the classification problem is
linear, and the classes are linearly separable.
 When the training examples are linearly separable, we can set the parameters
of a linear classifier so that all the training examples are classified correctly
 Many decision boundaries!
15
Decision Boundary
• Earlier definition of decision boundary (points where classifier
gets confused is simple but not general enough)
• More robust definition of decision boundary: locations where classifier
decision abruptly changes from one class to another class
• All classifiers have such a decision boundary
• Easy to detect whether a test point is at decision boundary for linear
classifiers – difficult to do so for most other classifiers, e.g. deep nets
16
Linear Classifiers
• LwP with 2 classes, Euclidean metric always gives a linear classifier
• Even if Mahalanobis metric used, still LwP gives a linear classifier
• Extremely popular in ML
• Very small model size – just one vector (and one bias value)
• Very fast prediction time
Before going forward, recall
that linear classifiers are those
that have a line or a plane as
the decision boundary. A linear
classifier is given by a model
that looks like and it makes
predictions by looking at
whether or not
It seems infinitely
The “best” Linear Classifier many classifiers
perfectly classify
the data. Which
one should I
All these brittle dotted
choose?
classifiers misclassify the
two new test points
However, the bold
classifier, whose decision
boundary is far from all
train points is not affected
Indeed! Such models would be very brittle
and might misclassify test data (i.e. predict
the wrong class), even those test data
which look very similar to train data

It is better to not select a


model whose decision
boundary passes very close 17
Large Margin Classifiers 18

Geometric
Margin

A good linear classifier would can


be one where all data points are
correctly classified, as well as far
19
Non-­linear decision boundaries

Non-linear Boundary in Deep Neural Network


20
Improving LwP when classes are complex-shaped
 Using weighted Euclidean or Mahalanobis distance can
sometimes help

𝐷

𝜇+¿¿ 𝑑𝑤 ( 𝒂 , 𝒃 ) = ∑ 𝑤 𝑖 ( 𝑎 𝑖 −𝑏𝑖 )
2

𝜇− 𝑖 =1
Use a smaller for the
horizontal axis feature in
this example

A good W will help


 Note: Mahalanobis distanceWalso
will be has the effect of rotating
a 2x2 symmetric bring points the
from
axes which helps matrix in this case (chosen
by us or learned)
same class closer
and move different
𝑑𝑤 ( 𝒂 , 𝒃 )= √ ( 𝒂 − 𝒃 ) 𝐖 ( 𝒂 − 𝒃 )
⊤ classes apart
𝜇− 𝜇+¿¿
𝜇+¿¿ 𝜇−
21
Improving LwP when classes are complex-shaped
 Even with weighted Euclidean or Mahalanobis dist, LwP still a
linear classifier 
 Exercise: Prove the above fact. You may use the following hint
 Mahalanobis dist can be written as
 is a symmetric matrix and thus can be written as for any matrix
 Showing for Mahalabonis is enough. Weighted Euclidean is a special
case with diag

Note: Modeling each class by


 LwP can be extended to learn nonlinear
not justdecision
a mean but byboundaries
a if
we use nonlinear distances/similarities(more on this
probability distribution canwhen we
also help in learning
talk about kernels) nonlinear decision
boundaries. More on this
when we discuss
probabilistic models for
classification
22
LwP as a subroutine in other ML models
 For data-clustering (unsupervised learning), K-means clustering is a
popular algo

 K-means also computes means/centres/prototypes of groups of unlabeled


points
Will see K-
 Harder than LwP since labels are unknown. But we can do means
the following
in
 Guess the label of each point, compute means using guess labels detail later
 Refine labels using these means (assign each point to the current closest mean)
 Repeat until means don’t change anymore
23

Supervised Learning
using
Nearest
Neighbors
24
Nearest Neighbors
 Another supervised learning technique based on computing
Wait. Did you say distance
distances from ALL the training
points? That’s gonna be
sooooo expensive! 
 Very simple idea. Simply do the following at test time
Yes, but let’s not
 Compute distance of of the test point from all the training
worry aboutpoints
that at
the moment. There
 Sort the distances to find the “nearest” input(s) in training
are ways todata
speed
 Predict the label using majority or avg label of theseupinputs
this step

 Can use Euclidean or other dist (e.g., Mahalanobis). Choice imp


just like LwP

 Unlike LwP which does prototype based comparison, nearest


neighbors method looks at the labels of individual training
inputs to make prediction
25

Nearest Neighbors for


Classification
26
Nearest Neighbor (or “One” Nearest Neighbor)
Interesting. Even with
Decision boundary Euclidean distances, it
can learn nonlinear
decision boundaries?

Indeed. And that’s


possible since it is a
“local” method
(looks at a local
neighborhood of the
test point to make
prediction)

Nearest neighbour
Test point approach induces a
Test point
Voronoi
tessellation/partition of
the input space (all test
points falling in a cell will get
the label of the training input
27
K Nearest Neighbors (KNN)
 In many cases, it helps to look at not one but > 1 nearest
neighbors
Test input = 31
How to pick
the “right” K
value?

K is this model’s
“hyperparameter”.
One way to choose it
A hyperparameter is using “cross-
is a parameter validation” (will see
whose value is shortly)
used to control Also, K should ideally
the learning be an odd number to
process. avoid ties
 Essentially, taking more votes helps!
 Also leads to smoother decision boundaries (less chances of
overfitting on training data)
28
-Ball Nearest Neighbors (-NN)
 Rather than looking at a fixed number of neighbors, can look
So changing may
inside a ball of a given radius , around the test input
change the
prediction. How to
pick the “right”
value?

Test input

Just like K, is also a


“hyperparameter”.
One way to choose it
is using “cross-
validation” (will see
shortly)
29
Distance-weighted KNN and -NN
 The standard KNN and 𝜖-NN treat all nearest neighbors equally
(all vote equally)
=3
Test input

In weighted approach, a single red


 An improvement: 1 When
1 voting,
1 give more importance to closer
training input is being given 3
Unweighted KNN prediction:
training inputs 3
+ 3
+ 3 = times more importance than the
other two green inputs since it is
3 1 1 sort of “three times” closer to the
Weighted KNN prediction:
5
+ 5
+ 5 = test input than the other two
-NNgreen
can also be
inputs made weighted
30
KNN/-NN for Other Supervised Learning Problems
 Can apply KNN/𝜖-NN for other supervised learning problems as
well, such as We can also try the weighted
versions for such problems, just
 Multi-class classification like we did in the case of binary
 Regression classification

 Tagging/multi-label learning

 For multi-class, simply used the same majority rule like in


binary classfn case
 Just a simple difference that now we have more than 2 classes

 For regression, simply compute the average of the outputs of


nearest neighbors
31
KNN Prediction Rule: The Mathematical Form
 Let’s denote the set of K nearest neighbors of an input by

 The unweighted KNN prediction for a test input can be


Assuming discrete labels with 5
written as possible values, the one-hot
1

representation will be a all zeros
𝐲= 𝐲𝑖 vector of size 5, except a single 1
𝐾 𝑖∈ 𝑁 ( 𝐱 )
𝐾
denoting the value of the discrete
label, e.g., if label = 3 then one-hot
vector = [0,0,1,0,0]

 This form makes direct sense of regression and for cases


where the each output is a vector (e.g., multi-class
classification where each output is a discrete value which can
be represented as a one-hot vector, or tagging/multi-label
classification where each output is a binary vector)

32
Nearest Neighbors: Some Comments
 An old, classic but still very widely used algorithm
 Can sometimes give deep neural networks a run for their money 
 Can work very well in practical with the right distance function
 Comes with very nice theoretical guarantees
 Also called a memory-based or instance-based or non-
parametric method
 No “model” is learned here (unlike LwP). Prediction step uses all the
training data
 Requires lots of storage (need to keep all the training data at
test time)
 Prediction step can be slow at test time
 For each test point, need to compute its distance from all the training
points

You might also like