Tutorial4 SVM
Tutorial4 SVM
SELECTION
Abstract. In the class, the SVMs where introduced as a variation to the Perceptron al-
gorithm: SVM uses the Hinge loss instead of the Perceptron loss. In the first part of this
handout, we give a geometrical interpretation to what SVM does, and return to the Hinge
loss from geometry. For interested readers, we also bring the concept of support vectors, as
well as the dual of the SVM problem, which can be considered as a prelude to kernels. In the
second part, we bring some examples for the problem of feature selection in machine learning.
Sections marked with a * are considered as optional.
Contents
Notations 1
Notations. We do not use bold font to distinguish vectors from scalars, as it is easily understood
from the context. If x is a vector, its ith component is denoted by xi . We index different vectors
by lowerPindices, e.g., x1 and xj . We denote by hu, vi the inner product of u and v, defined by
hu, vi = ui v i . The Euclidean norm of a vector x is denoted by kxk.
1. Linear Classification
Let d ∈ N and D be an (unknown, but fixed) distribution over Rd × {−1, 1}. Assume that
we are given n i.i.d. samples (x1 , y1 ), . . . , (xn , yn ) ∼ D, where xi ∈ Rd are the features, and
yi ∈ {−1, 1} are the labels. For now, we make the following assumption (we will relax this
assumption in later sections):
Assumption 2 (Linear Separability). There exists a hyperplane that separates the data points
with y = 1 from those with y = −1. That is, there exists w ∈ Rd and b ∈ R such that
yi (hw, xi i − b) ≥ 0, ∀i ∈ {1, . . . , n}.
Any such hyperplane is called a separating hyperplane. Note that w is pointing towards the
datapoints with y = 1.
Our goal is to find (among all possible separating hyperplanes) the hyperplane H with maxi-
mum margin, where the margin is the distance of the closest point among x1 , . . . , xn to H, see
Figure 1. The reason behind this choice of a hyperplane is two-fold: firstly, the intuitive picture
GEOMETRY OF SUPPORT VECTOR MACHINES AND FEATURE SELECTION 3
suggests that this hyperplane is the correct separating hyperplane. Secondly, it is related to
generalisation power and stability, which we do not cover in this handout.
Figure 1. Different separating hyperplanes for a dataset. The green line de-
notes the largest margin separator.
Dividing w by a, gives rise to a separator w̃ with the same margin, but with the new constraints
yi · hw̃, xi i ≥ 1.
Note that in this case, γD (w̃) = 1/kw̃k. Thus, maximising the margin is equivalent to minimising
kw̃k, or similarly, minimising kw̃k2 . The final optimisation problem can be stated as follows:
Definition 5 (Hard-SVM). Let D be a linearly separable dataset, where Assumption 4 holds.
The following optimization problem (called the Hard-SVM) yields the maximum margin separa-
tor for D:
minimise 12 kwk2
(2.2) w
s.t. yi · hw, xi i ≥ 1 for all i ∈ {1, . . . , n}.
3. Soft-SVM
In the last section, we had two assumptions (Assumptions 2 and 4) about the dataset D,
which may not hold in a general dataset. Note that when these assumptions are violated, the
optimisation problem (2.2) has no solutions. In this section, we drop these assumptions, and
derive a soft version of SVM.
The key idea of this derivation is letting some constraints in 2.2 to be violated. One way to
achieve this is to introduce some slack variables for each constraint. Let ξ1 , . . . , ξn ≥ 0, and
define a new optimization problem
minimise 12 kwk2
w,ξ1 ,...,ξn
(3.1) s.t. yi · hw, xi i ≥ 1 − ξi , for all i ∈ {1, . . . , n},
ξi ≥ 0, for all i ∈ {1, . . . , n}.
The variables ξi denote the amount that the ith constraint is violated: ξi = 0 denotes no violation,
and larger ξi shows larger violation, see Figure 2.
Although this optimisation problem has solutions for every dataset D (even the ones that are
not linearly separable), it is easy to see that its solution is always w = 0. This is not surprising,
as we have not defined any cost for violation of constraints. The simplest way to define such a
cost, is as follows:
Xn
minimise 12 kwk2 + C ξi
w,ξ1 ,...,ξn
i=1
(3.2)
s.t. yi · hw, xi i ≥ 1 − ξi , for all i ∈ {1, . . . , n},
ξi ≥ 0, for all i ∈ {1, . . . , n},
GEOMETRY OF SUPPORT VECTOR MACHINES AND FEATURE SELECTION 5
where C > 0 denotes the softness of the problem: larger C goes for larger violation penalties.
The optimisation problem (3.2) is called Soft-SVM. One can observe that when C → ∞, the
Soft-SVM problem gets similar to Hard-SVM problem.
The following lemma states that this new optimisation problem has the same solution as in (3.2).
Lemma 6. The optimisation problems (4.2) and (3.2) have the same solution.
Proof. Let w? and ξ1? , . . . , ξn? be the optimal solutions to (3.2). Feasibility implies that for all
i ∈ {1, . . . , n},
ξi∗ ≥ `hinge (yi · hw? , xi i).
Optimality implies that the inequality above is indeed equality. Thus, the optimisation problem
gets reduced to (4.2). Reduction of (4.2) to (3.2) is similar.
This completes our discussion about the equivalence of regularized Hinge loss minimisation
and the geometric approach to SVM. We conclude with the following remark.
Remark 7. There is a hidden issue when we do the transformation described in the begining
of Section 2. As seen in optimisation problems (2.2) and (4.2), the last coordinate of vector w
is the intercept of the separating hyperplane. As we are penalising the norm of w, we are also
penalising the intercept as well. This means (roughly) that we are looking for hyperplanes that
are closer to the origin.
The “correct” way to write the optimisation problem for SVM is to include b (the intercept)
and write the problem in terms of b and w, having in mind that b shall not be penalised. Although
this is possible, the clarity of our current derivation and little difference in experiments, made
us choosing the former approach. The reader is encouraged to derive the latter derivation and
see the difference.
GEOMETRY OF SUPPORT VECTOR MACHINES AND FEATURE SELECTION 6
5. Support Vectors*
Note. For this section, we need material discussed in the recap slides related to
KKT Optimality Conditions. This section is not related to the exam and is only
here for the interested reader.
We start with the Hard-SVM problem (2.2). Notice that the objective function is convex, and
all the conditions are linear inequalities. Thus, the KKT conditions apply. First, we have the
stationarity condition:
n
X
(5.1) w− λi yi xi = 0,
i=1
or equivalently
X
w= λi yi xi .
i∈T
We have thus proved the following lemma:
Lemma 8. The solution to Hard-SVM can be expressed (or represented) as a linear combination
of the support vectors, i.e., those points of the dataset which are exactly on the margin.
The good thing about support vectors is that they can be much smaller than the whole dataset,
and predictions (in the dual SVM) will get less expensive.
We note that in this situation we can exchange min and max.1 Thus, we arrive to the following
equivalent problem:
n
X
(6.1) maxn min 12 kwk2 + αi (1 − yi · hw, xi i).
α∈R w
α≥0 i=1
The inner minimisation problem can be solved analytically, and one gets
Xn
w= αi yi xi ,
i=1
which is again the result of Lemma 8. Note that for constraints which are not tight, αi = 0.
Replacing the value of w in (6.1) gives
n n n
X 1 XX i j
(6.2) maxn αi − α α yi yj hxi , xj i .
α∈R
i=1
2 i=1 j=1
α≥0
Note that solving (6.2) only needs the inner products of xi ’s. This is helpful in situations where
the dimensions of the problem is high (or even infinite!). You will learn in the course that the
inner product hxi , xj i can be replaced with a so-called kernel k(xi , xj ), which enables one to
solve nonlinear separation problems.
7. Linear Models
Assume a regression problem, where features (x) are in R2 and responses (y) are real values.
Define the following distribution over the features and responses:
x1 ∼ Uniform(−1, 1)
y = (x1 )2
x2 = y + z, where z ∼ Uniform(−0.01, 0.01).
The question is, “which feature to choose? x1 or x2 ?”
Since the response is a function of x1 , one may say that x1 is the best selected feature.
However, in a linear model, x2 performs much better than x1 (why?).
The following example shows how this method fails. Consider a linear regression problem on
R2 . Define the distribution over features (x) and response (y) as follows:
y = x1 + 2x2 ,
x1 ∼ Uniform({±1})
1 1
x2 = − x1 + z, where z ∼ Uniform({±1}).
2 2
Notice that E[yx ] = 0, meaning that x1 is not a good predictor alone! But one can see that any
1
9. LASSO Regularisation
Consider a one dimensional linear regression problem with Lasso regularisation. The data
points are x1 , . . . , xn ∈ R and the responses are y1 , . . . , yn ∈ R. The problem is equivalent to
solving the following optimisation problem
n
1 X
(9.1) minimise (xi w − yi )2 + λ|w|,
w∈R 2n i=1
where λ is the regularisation parameter. Rewriting the objective function, we get the following
equivalent problem
n n
!
1 1 X 1 X
minimise w2 · x2 − w · xi yi + λ|w|.
w∈R 2 n i=1 i n i=1
By normalising the data, we can assume that n1
P 2
xi = 1, and get
1 2 1
minimise w − w · hx, yi + λ|w|.
w∈R 2 n
The following lemma then gives us the optimal solution
? | hx, yi |
w = sign(hx, yi) −λ ,
n +
which basically means that if the correlation between x and y is below the threshold λ, the
solution is w? = 0. The reader is encouraged to derive the same solution for Ridge regression
and see that there is no thresholding effect in that case.
Lemma 9. The solution to the optimisation problem
1
minimise w2 − xw + λ|w|
w∈R 2
for λ ≥ 0 is
w? = sign(x)[|x| − λ]+ ,
where [a]+ = max{a, 0}.
Proof. A simple proof proceeds with a case by case analysis and is left to the reader.
References
[1] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory
to Algorithms. Cambridge University Press, 2014.