0% found this document useful (0 votes)
30 views

Lecture 9 - SVMs

Uploaded by

Eric Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Lecture 9 - SVMs

Uploaded by

Eric Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

CIS5200: Machine Learning Spring 2024

Lecture 9: Support Vector Machines


Date: February 15, 2024

In this lecture, we’ll cover our next machine learning algorithm: the Support vector machine
(SVM). SVMs are one of the most successful linear models of all time, and there’s a particular
notion of “quality” under which they are in fact the optimal linear classifier.

𝒘𝟐 𝒘𝟐 𝛾1

𝒘𝟏 𝒘𝟏
𝛾2

Figure 1: Left: Intuitively, the purple hyperplane parameterized by w2 is a “better” classifier than
the hyperplane parameterized by w1 , because it seems to leave more room for error. Right: the
margin–a concept we’ve briefly touched on before–seems to formalize this concept nicely. The green
hyperplane has a much smaller margin.

To begin, consider the dataset in the figure above. Intuitively, it seems like the purple hyperplane
parameterized by w2 is “better” than the green one parameterized by w2 . The reason for this is
that there are orange circles very close to the green hyperplane – it’s not implausible to think that
there could be orange circle test points just to the left of the given training data that we would
classify incorrectly. Put another way, the purple hyperplane is better because it seems to “leave
more room for error.” On the right, we see that we can formalize this intuition using a quantity
we’ve already seen before: the margin. The purple hyperplane is “better” because it has a larger
margin. SVMs are linear models with the maximum possible margin given a dataset.
Note: Recall that the Perceptron algorithm guaranteed to find a hyperplane that separates the data
as long as one exists. In fact, since we assumed separation with margin, there are infinitely many
hyperplanes that separate the data, and the Perceptron may find any one of these. In particular,
Perceptron algroithm did not provide any guarantees on the margin of the hyperplane found, even
though we assumed that the training dataset had margin γ with respect to the true separator. How-
ever, in Homework 1, we saw a simple modification of the Perceptron to the Margin Perceptron
which guaranteed an approximate margin of γ/3. SVM will guarantee maximum margin of γ.

1
1 The Margin Revisited

Recall from the perceptron that we defined the margin γ(w, S) of a hyperplane w and dataset S as
the smallest distance between any point in S and the hyperplane. Since the margin is going to be
such a critical quantity to our discussion of SVMs, let’s talk about how to derive it. This time, we
won’t assume that the hyperplane is defined by a normalized vector w – instead we’re given some
arbitrary vector w that defines a hyperplane. As we’ll see, we get the “right thing” for normalized
vectors anyways.
To derive the margin, we make the following observations about the distance from a point x to
some hyperplane parameterized by w:

1. For any point x, there is a point xp that is the perpendicular projection from the hyperplane
to the point.

2. Because the vector pointing from xp to x, d = x − xp , is perpendicular to the hyperplane, it


is parallel to w. As a result, it must be the case that d = αw.

3. Since xp is on the hyperplane, we know that w⊤ xp = 0 by definition of the hyperplane


parameterized by w.

These facts are enough to compute the vector pointing from the hyperplane to the point x:

w ⊤ xp = 0 xp is on the hyperplane

w (x − d) = 0 d = x − xp =⇒ xp = x − d
w⊤ (x − αw) = 0 w ∥ d, so d = αw
⊤ ⊤
w x − αw w = 0 expand
w⊤ x
α= solve for α
w⊤ w
w⊤ x
d = αw = ⊤ w solve for α
w w


Given this vector d, we simply need to compute the length of this vector d⊤ d to get the distance
from the hyperplane to x:
s s
√ (w ⊤ x)2 (w⊤ x)2 |w⊤ x|
d⊤ d = w ⊤w = = √ (1)
(w⊤ w)2 (w⊤ w) w⊤ w

Note that if w is a unit vector, e.g. ∥w∥2 = w⊤ w = 1, then this expression indeed just becomes
|w⊤ x|. Either way, we can now write down a general expression for the margin γ(w, S) of a
hyperplane parameterized by w for a dataset S:

|w⊤ xi |
γ(w, S) = min √ (2)
i w⊤ w

2
2 (Hard Margin) Support Vector Machines: Maximizing the Mar-
gin

Now that we’ve recapped our expression for the margin, and some intuition that we’d like a linear
classifier with a large margin, what we might try first using our nice numerical optimization tools
is to simply maximize the margin:

DANGER: Bad Idea Zone

w∗ = max γ(w, S)
w

The problem with this is, of course, that it says nothing about actually fitting the data! Indeed,
it is trivial to simply maximize the margin: just place the hyperplane as far away from all of your
data as you can get away with. What we need to do is not just find a maximum margin hyperplane,
but a maximum margin separating hyperplane (which we’ll assume exists for the moment).
Recall that a point (xi , yi ) is on the “right” side of the hyperplane if yi w⊤ xi > 0. What we can
do is modify the above optimization problem to be subject to the constraint that all training data
points are on the “right” side of the hyperplane:

A Better Idea

w∗ = max γ(w, S) s.t. ∀i yi w⊤ xi > 0


w | {z }
w needs to be a separating hyperplane

Pros of this optimization problem: it probably gives us a really good linear model! Cons of this
optimization problem: it has constraints, so it doesn’t immediately look like we can apply gradient
descent.
The rest of what we need to do in this section is to fix that con. First, let’s plug in the definition
of the margin above to make things even worse for a moment:
|w⊤ xi |
w∗ = max min √ s.t. ∀i yi w⊤ xi > 0
w i w⊤ w
Let’s play a few tricks to clean up this optimization problem. First of all, remember that the
hyperplane is scale invariant: for any hyperplane parameterized by w, cw for c ̸= 0 parameterizes
the same hyperplane. This is because (cw⊤ )x = 0 ⇐⇒ w⊤ x = 0. So let’s pick a really convenient
scale for c. Given the optimal solution w∗ to the above optimization problem let’s find c such that
mini |(cw∗⊤ )xi | = 1. Since w∗ and cw∗ define the same hyperplane, why not just search for this
scaled version directly by adding this as a constraint so that we find cw∗ instead of w∗ in the first
place?
|w⊤ xi |
w∗ = max min √
w i w⊤ w

s.t. ∀i yi w xi > 0
min |w∗⊤ xi | = 1
i

3
The claim is that the above optimization problem, which just has a new constraint, is equivalent
to the first one. Why does this help us? Well, let’s first pull a √ 1⊤ out of the minimization over
w w
i since it doesn’t depend on the data:
1
w∗ = max √ min |w⊤ xi |
w w⊤ w i

s.t. ∀i yi w xi > 0
min |w∗⊤ xi | = 1
i

But at the minimum, we know by the constraint that mini |w⊤ xi | = 1, so we can drop it!
1
w∗ = max √
w w⊤ w
s.t. ∀i yi w⊤ xi > 0
min |w∗⊤ xi | = 1
i

Now, we’ll make two observations. The first is simple, the second is complex and we’ll prove it.
First the maximization problem can be converted to a simpler minimization problem,

1
max √ ⇐⇒ min w⊤ w
w w⊤ w w

This is because maximizing a fraction is the same as making the denominator as small as possible,

and since · is monotonic we can just drop it.
Next, the big part. I claim that the following optimization problems are equivalent:
(
∀i yi w⊤ xi > 0
min w⊤ w s.t. ⇐⇒ min w⊤ w s.t. yi w⊤ xi ≥ 1
w mini |w∗⊤ xi | = 1 w

Let’s be clear about what I mean by equivalent. I am claiming that the optimal solution w∗ is
the same for both optimization problems. Equivalently, w∗ satisfies the left two constraints if and
only if it satisfies the right constraint. Let’s prove this.

Left implies right. Let’s start by showing that if the left two constraints are satisfied by w∗ , then
the right one is. The first constraint implies that yi w⊤ xi is positive for every i. Since yi ∈ {+1, −1},
the multiplication by yi only changes the sign of w⊤ xi , it cannot change the magnitude. Therefore,

yi w⊤ xi = |w⊤ xi |

But by the second constraint on the left, we know that the smallest value of |w⊤ xi | is 1. Therefore:

yi w⊤ xi = |w⊤ xi | ≥ 1

Thus, if the left two constraints are satisfied, the right constraint is also satisfied.

4
Right implies left. To finish the proof that these constraint sets are equivalent, we just need
to show that the right constraint implies both of the left two constraints. First of all, it is obvious
that the first constraint is implied:
yi w⊤ xi ≥ 1 =⇒ yi w⊤ xi > 0
Clearly if for every i the quantity is greater than 1, it’s also greater than 0. Now we consider the
second constraint. Suppose for contradiction that we indeed solved the right optimization problem:
w∗ = min w⊤ w
w
s.t. ∀i yi w⊤ xi ≥ 1
and found that the second left constraint was violated. In other words, we found that mini |w∗⊤ xi | =
c for c ̸= 1. Clearly, c cannot be less than 1, otherwise we wouldn’t have yi w⊤ xi ≥ 1. Therefore,
c > 1.
Set w∗∗ = wc∗ . I claim that w∗∗ is at least a feasible solution to the optimization problem above.
To see this, observe that, by definition,
⊤ 1
yi w∗∗ xi = yi w∗⊤ xi
c
Since yi w∗⊤ xi and c are both positive, we certainly haven’t changed the sign – the above is still
positive. Therefore:
1 1
yi w∗⊤ xi = |w∗⊤ xi |
c c

But by definition of c, we know that mini |w∗ xi | = c. Therefore,
1 c
min |w∗⊤ xi | = = 1
i c c
⊤ x is 1, certainly for all i this quantity is greater than or equal to
Since the smallest value of yi w∗∗ i
1. Therefore, w∗∗ is a feasible solution.
But what about the objective value?
⊤ 1 1 1
w∗∗ w∗∗ = ( w∗ )⊤ ( w∗ ) = 2 w∗⊤ w∗
c c c
Since c > 1, this is less than the objective value we achieved with w∗ . This is a contradiction,
because w∗ was supposed to be the optimal solution!
Therefore, reiterating things, we do indeed have that:
(
⊤ ∀i yi w⊤ xi > 0
min w w s.t. ⇐⇒ min w⊤ w s.t. yi w⊤ xi ≥ 1
w mini |w∗⊤ xi | = 1 w

This right optimization problem is exactly what we’ll call the hard margin SVM:

The Hard Margin SVM

We call the linear model specified by h(x) = w∗⊤ x parameterized by w∗ a “hard margin”
SVM if it is a solution to the optimization problem:

w∗ = min w⊤ w s.t. yi w⊤ xi ≥ 1
w

5
What happens if the data isn’t linearly separable? Then there is no feasible solution! We call this
a “hard margin” because it requires linear separability. What we’ll do next is relax this.

3 Soft margin Support Vector Machines

The formulation above has two problems:

1. It’s still a constrained optimization problem. It’s a really nice kind of constrained optimization
problem called a quadratic program, so it’s actually solvable, but we still don’t like constraints.

2. The hard margin SVM only works if the data is linearly separable – otherwise, there is no
feasible solution to the optimization problem.

In this section, we’ll solve both of these problems with the same trick. To do this, we’ll introduce
the notion of slack variables to the constraint. In particular, we make the following observation:
for any w, if w is not a feasible solution above, it’s because there’s some i for which yi w⊤ xi .
The idea of slack variables is to explicitly allow this kind of violation, but make the optimizer pay
a penalty for it. The resulting optimization problem might look like this:
X
w∗ = min w⊤ w + C ξi
w,ξ
i

s.t. y w xi ≥ 1 − ξi
|i {z }
ξi is constraint violation for xi

ξi ≥ 0

Here, ξi measures the amount by which the constraint for (xi , yi ) is violated – e.g., yi w⊤ xi is
less than 1. In English, what this new optimization problem does is allow each xi to violate the
constraint by some ξi , but we pay a penalty of C ξi (where C is a constant chosen by you) in the
objective. Our goal is to minimize the total penalty.
So far, we seem to have solved the “hard margin” problem, but not the constrained optimization
problem. To achieve this final goal, first rearrange the above optimization problem:
X
w∗ = min w⊤ w + C ξi
w,ξ
i
s.t. ξi ≥ 1 − yi w⊤ xi
ξi ≥ 0

observe that at the optimum:


(
1 − yi w⊤ xi yi w ⊤ xi < 1
ξi =
0 yi w ⊤ xi ≥ 1

Why is this true? Well, if yi w⊤ xi ≥ 1, clearly we can set ξi = 0. This satisfies both constraints, but
adds the minimum possible additional loss to the objective. On the other hand, if yi w⊤ xi < 1, the

6
minimum value of ξi we can choose to satisfy the constraint that yi w⊤ xi ≥ 1 − ξi is just 1 − yi w⊤ xi ,
which we can see by rearranging the constraint itself.
Thus, for any choice of w, instead of treating ξi as a variable to be optimized over, we can just
compute it as above. A compact form for this is just:

ξi = max(0, 1 − yi w⊤ xi )

To recap, we know that (1) at the optimum, ξi takes exactly the value above, and (2) this definition
of ξi always satisfies both constraints. Thus, we can plug this definition of ξi into the loss and drop
both constraints! X
w∗ = min w⊤ w + C max(0, 1 − yi w⊤ xi )
w
i
We’re essentially done – the above is a perfectly fine optimization problem, and we could stop here.
However, to make it look more like the optimization problems we’ve seen in class, let’s do three
“clean up” steps:

1. First, write w⊤ w as ∥w∥22 .

2. Second, since C is a free parameter that you will have to specify anyways, let’s instead define
1
λ = Cm and multiply the whole equation by that.

3. Finally, let’s swap the order of the terms.

The result is the following:

The Soft Margin SVM

We call the linear model specified by h(x) = w∗⊤ x parameterized by w∗ a “soft margin”
SVM if w∗ is a solution to the optimization problem:
1 X
w∗ = min max(0, 1 − yi w⊤ xi ) + λ∥w∥22
w m
i

4 The Empirical Risk Minimization View

Part of the reason we did those “clean up” steps is that now the above minimization problem looks
pretty familiar. We’re minimizing the following function of w:
1 X
R̂(w) = max(0, 1 − yi w⊤ xi ) + λ∥w∥22
m | {z } | {z }
i
loss function ℓ2 -regularizer

Indeed, the soft margin SVM is just a plain old empirical risk minimization problem like linear or
logistic regression, just with a particular loss function and an L2 regularizer!
We call this specific loss the hinge loss:

ℓhinge (h(x), y) = max(0, 1 − yi h(xi ))

7
A soft margin SVM therefore just uses the hinge loss plus L2 regularization. Here’s a plot comparing
the 0 − 1 loss, the hinge loss, and the logistic loss:

Figure 2: Three classification losses we’ve learned about so far (plus the exponential loss). These
are plotted as a function of yi h(xi ), because that quantity measures how “bad” the fit is – negative
values mean we are classifying incorrectly.

By looking at these loss functions, we can start to get a better sense of the behavior of these algo-
rithms. The hinge loss penalizes misclassification, but also penalizes the case where 0 < yi h(xi ) < 1.
The hinge loss has no penalty if points are classified correctly by at least a “margin” of 1. The ex-
ponential loss penalizes points the model is getting very wrong more heavily than the hinge/logistic
loss. What do you think this will mean if there are outliers?

You might also like