0% found this document useful (0 votes)
2 views79 pages

QSRI-lecture5

The document discusses Support Vector Machines (SVM) and their objective of maximizing the margin between data points for binary classification. It explains the mathematical formulation of SVM, including the optimization problem and the role of support vectors in determining the decision boundary. Additionally, it introduces the concept of soft margins to handle noisy data and the importance of the parameter C in controlling the trade-off between margin width and classification accuracy.

Uploaded by

Len McLemore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views79 pages

QSRI-lecture5

The document discusses Support Vector Machines (SVM) and their objective of maximizing the margin between data points for binary classification. It explains the mathematical formulation of SVM, including the optimization problem and the role of support vectors in determining the decision boundary. Additionally, it introduces the concept of soft margins to handle noisy data and the importance of the parameter C in controlling the trade-off between margin width and classification accuracy.

Uploaded by

Len McLemore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Support Vector Machines and Kernel methods

Seth Flaxman1

Imperial College London

3 July 2019

1
Based on slides from Simon Rogers, Glasgow
1 / 31
The margin

I We have seen that our algorithms so far focuses on


minimising loss.
I The Support Vector Machine (SVM) has a different objective
and underlying principal.
I It finds the decision boundary that maximises the margin.

2 / 31
Some data
I We’ll ‘think’ in 2-dimensions.
6

4
SVM is a binary classifier.
N data points, each with
x2

2
attributes x = [x1 , x2 ]T and
0 target y = ±1
−2
−2 0 2 4
x1

3 / 31
Some data
I We’ll ‘think’ in 2-dimensions.
6

4
SVM is a binary classifier.
N data points, each with
x2

2
attributes x = [x1 , x2 ]T and
0 target y = ±1
−2
−2 0 2 4
x1
I A linear decision boundary can be represented as a straight
line:
wT x + b = 0

3 / 31
Some data
I We’ll ‘think’ in 2-dimensions.
6

4
SVM is a binary classifier.
N data points, each with
x2

2
attributes x = [x1 , x2 ]T and
0 target y = ±1
−2
−2 0 2 4
x1
I A linear decision boundary can be represented as a straight
line:
wT x + b = 0
I Our task is to find w and b

3 / 31
Some data
I We’ll ‘think’ in 2-dimensions.
6

4
SVM is a binary classifier.
N data points, each with
x2

2
attributes x = [x1 , x2 ]T and
0 target y = ±1
−2
−2 0 2 4
x1
I A linear decision boundary can be represented as a straight
line:
wT x + b = 0
I Our task is to find w and b
I Once we have these, classification is easy:
w T x∗ + b > 0 : y∗ = 1
w T x∗ + b < 0 : y ∗ = −1
I i.e. y ∗ = sign(wT x∗ + b)
3 / 31
The margin

I How do we choose w and b?


I Need a quantity to optimise!

4 / 31
The margin
γ

I How do we choose w and b?


I Need a quantity to optimise!
I Use the margin, γ
I Maximise it!

4 / 31
The margin
γ

I How do we choose w and b?


I Need a quantity to optimise!
I Use the margin, γ
I Maximise it!

Definition. The margin is the perpendicular distance from the


decision boundary to the closest points on each side.

4 / 31
Why maximise the margin?
γ
γ

I Maximum margin decision boundary (left) seems to better


reflect the data characteristics than other boundary (right).

5 / 31
Why maximise the margin?
γ
γ

I Maximum margin decision boundary (left) seems to better


reflect the data characteristics than other boundary (right).
I Note how margin is much smaller on right and closest points
have changed.
I A larger margin will be more robust to noise, as it will be less
likely to change if dataset had been slightly different.
5 / 31
Computing the margin

1
2γ = wT (x1 − x2 )
||w||

x1 − x2

x1
x2

6 / 31
Computing the margin

1
2γ = wT (x1 − x2 )
||w||

Fix the scale such that: x1 − x2


w T x1 + b = 1
wT x2 + b = −1 x1
x2

6 / 31
Computing the margin

1
2γ = wT (x1 − x2 )
||w||

Fix the scale such that: x1 − x2


w T x1 + b = 1
wT x2 + b = −1 x1
Therefore: x2
(wT x1 + b) − (wT x2 + b) = 2 2γ
wT (x1 − x2 ) = 2
1
γ=
||w||

6 / 31
Maximising the margin

1
I We want to maximise γ = ||w||

7 / 31
Maximising the margin

1
I We want to maximise γ = ||w||
I Equivalent to minimising ||w||

7 / 31
Maximising the margin

1
I We want to maximise γ = ||w||
I Equivalent to minimising ||w||
I Equivalent to minimising 12 ||w||2 = 21 wT w

7 / 31
Maximising the margin

1
I We want to maximise γ = ||w||
I Equivalent to minimising ||w||
I Equivalent to minimising 12 ||w||2 = 21 wT w

I There are some constraints:


I For xn with yn = 1: wT xn + b ≥ 1
I For xn with yn = −1: wT xn + b ≤ −1

7 / 31
Maximising the margin

1
I We want to maximise γ = ||w||
I Equivalent to minimising ||w||
I Equivalent to minimising 12 ||w||2 = 21 wT w

I There are some constraints:


I For xn with yn = 1: wT xn + b ≥ 1
I For xn with yn = −1: wT xn + b ≤ −1
I Which can be expressed more neatly as:

yn (wT xn + b) ≥ 1

I (This is why we use yn = ±1 and not yn = {0, 1}.)

7 / 31
Maximising the margin
I We have the following optimisation problem:
1 T
argmin w w
w 2
Subject to: yn (wT xn + b) ≥ 1

8 / 31
Maximising the margin
I We have the following optimisation problem:
1 T
argmin w w
w 2
Subject to: yn (wT xn + b) ≥ 1

I Can put the constraints into the minimisation using Lagrange


multipliers:
N
1 T X
argmin w w− αn (yn (wT xn + b) − 1)
w 2
n=1
Subject to: αn ≥ 0

8 / 31
Maximising the margin
I We have the following optimisation problem:
1 T
argmin w w
w 2
Subject to: yn (wT xn + b) ≥ 1

I Can put the constraints into the minimisation using Lagrange


multipliers:
N
1 T X
argmin w w− αn (yn (wT xn + b) − 1)
w 2
n=1
Subject to: αn ≥ 0

I Non-support vectors have αn = 0. Sparse solution in terms of


αn !

8 / 31
Optimal boundary
6

4
x2

−2
−2 0 2 4
x1
I Optimisation has a global minimum, gives us α1 , . . . , αN
P
I Compute w = n αn yn xn
I Compute b = yn − wT x (for one of the closest points)
I Recall that we defined wT x + b = ±1 = yn for closest points.
I Plot wT x + b = 0

9 / 31
Support Vectors
I At the optimum, only 3 non-zero α values (squares).
6

4
x2

−2
−2 0 2 4
x1

y ∗ = sign T ∗
P 
n αn yn xn x + b
I
I Predictions only depend on these data-points!

10 / 31
Support Vectors
I At the optimum, only 3 non-zero α values (squares).
6

4
x2

−2
−2 0 2 4
x1

y ∗ = sign T ∗
P 
n αn yn xn x + b
I
I Predictions only depend on these data-points!
I We knew that – margin is only a function of closest points.

10 / 31
Support Vectors
I At the optimum, only 3 non-zero α values (squares).
6

4
x2

−2
−2 0 2 4
x1

y ∗ = sign T ∗
P 
n αn yn xn x + b
I
I Predictions only depend on these data-points!
I We knew that – margin is only a function of closest points.
I These are called Support Vectors

10 / 31
Support Vectors
I At the optimum, only 3 non-zero α values (squares).
6

4
x2

−2
−2 0 2 4
x1

y ∗ = sign T ∗
P 
n αn yn xn x + b
I
I Predictions only depend on these data-points!
I We knew that – margin is only a function of closest points.
I These are called Support Vectors
I Normally a small proportion of the data:
I Solution is sparse.
10 / 31
Is sparseness good?
I Not always:
6

4
x2

−2

−2 0 2 4 6
x1

11 / 31
Is sparseness good?
I Not always:
6

4
x2

−2

−2 0 2 4 6
x1

I Why does this happen?

yn (wT xn + b) ≥ 1

I All points must be on correct side of boundary.


I This is a hard margin

11 / 31
Soft margin
I We can relax the constraints:
yn (wT xn + b) ≥ 1 − ξn , ξn ≥ 0

12 / 31
Soft margin
I We can relax the constraints:
yn (wT xn + b) ≥ 1 − ξn , ξn ≥ 0
I Our optimisation becomes:
N
1 X
argmin wT w + C ξn
w 2
n=1
T
subject to yn (w xn + b) ≥ 1 − ξn

12 / 31
Soft margin
I We can relax the constraints:
yn (wT xn + b) ≥ 1 − ξn , ξn ≥ 0
I Our optimisation becomes:
N
1 X
argmin wT w + C ξn
w 2
n=1
T
subject to yn (w xn + b) ≥ 1 − ξn
I And when we add Lagrange etc:
N N
X 1 X
argmax αn − αn αm yn tm xT
n xm
α 2
n=1 n.m=1
N
X
subject to αn yn = 0, 0 ≤ αn ≤ C
n=1

12 / 31
Soft margin
I We can relax the constraints:
yn (wT xn + b) ≥ 1 − ξn , ξn ≥ 0
I Our optimisation becomes:
N
1 X
argmin wT w + C ξn
w 2
n=1
T
subject to yn (w xn + b) ≥ 1 − ξn
I And when we add Lagrange etc:
N N
X 1 X
argmax αn − αn αm yn tm xT
n xm
α 2
n=1 n.m=1
N
X
subject to αn yn = 0, 0 ≤ αn ≤ C
n=1
I The only change is an upper-bound on αn !
12 / 31
Soft margins
I Here’s our problematic data again:

4
x2

−2

−2 0 2 4 6
x1
I αn for the ‘bad’ square is 3.5.

13 / 31
Soft margins
I Here’s our problematic data again:

4
x2

−2

−2 0 2 4 6
x1
I αn for the ‘bad’ square is 3.5.
I So, if we set C < 3.5, we should see this point having less
influence and the boundary moving to somewhere more
sensible...
13 / 31
Soft margins

I Try C = 1

4
x2

−2

−2 0 2 4 6
x1
I We have an extra support vector.
I And a better decision boundary.

14 / 31
Soft margins

I The choice of C is very important.


I Too high and we over-fit to noise.
I Too low and we underfit
I ...and lose any sparsity.

15 / 31
Soft margins

I The choice of C is very important.


I Too high and we over-fit to noise.
I Too low and we underfit
I ...and lose any sparsity.

I Choose it using cross-validation.

15 / 31
SVMs – some observations

I In our example, we started with 3 parameters:

w = [w1 , w2 ]T , b

I In general: D+1.

16 / 31
SVMs – some observations

I In our example, we started with 3 parameters:

w = [w1 , w2 ]T , b

I In general: D+1.
I We now have N: α1 , . . . , αN

16 / 31
SVMs – some observations

I In our example, we started with 3 parameters:

w = [w1 , w2 ]T , b

I In general: D+1.
I We now have N: α1 , . . . , αN
I Sounds harder?
I Depends on data dimensionality:
I Recall the mouse genome dataset:
I N = 1522, D = 10346
I Typical in many settings: N  D.

16 / 31
Inner products

I Here’s the optimisation problem:


X 1X
argmax αn − αn αm yn tm xT
n xm
α n
2 n,m

I Here’s the decision function:


!
X
y ∗ = sign αn yn xT ∗
nx +b
n

17 / 31
Inner products

I Here’s the optimisation problem:


X 1X
argmax αn − αn αm yn tm xT
n xm
α n
2 n,m

I Here’s the decision function:


!
X
y ∗ = sign αn yn xT ∗
nx +b
n

I Data (xn , xm , x∗ , etc) only appears as inner (dot) products:


T ∗
xT
n xm , xn x , etc

17 / 31
Projections
I Our SVM can find linear decision boundaries.
I What if the data requires something nonlinear?
3
1
2

1 0.5
x2

0
0
−1

−2 −0.5

−3
−3 −2 −1 0 1 2 3 −1
x1 0 2 4 6 8 10
φ(x n )
φ(x)

18 / 31
Projections
I Our SVM can find linear decision boundaries.
I What if the data requires something nonlinear?
3
1
2

1 0.5
x2

0
0
−1

−2 −0.5

−3
−3 −2 −1 0 1 2 3 −1
x1 0 2 4 6 8 10
φ(x n )
φ(x)
I We can transform the data e.g.:
φ(x) = x12 + x22
I So that it can be separated with a straight line.
I And use φ(x) instead of x in our optimisation.
18 / 31
Projections
I Our optimisation is now:
X 1X
argmax αn − αn αm yn tm φ(xn )T φ(xm )
α n
2 n,m

I And predictions:
!
X
y ∗ = sign αn yn φ(xn )T φ(x∗ ) + b
n

19 / 31
Projections
I Our optimisation is now:
X 1X
argmax αn − αn αm yn tm φ(xn )T φ(xm )
α n
2 n,m

I And predictions:
!
X
y ∗ = sign αn yn φ(xn )T φ(x∗ ) + b
n

I In this case:

φ(xn )T φ(xm ) = (xn1


2 2
+ xn2 2
)(xm1 2
+ xm2 ) = k(xn , xm )

19 / 31
Projections
I Our optimisation is now:
X 1X
argmax αn − αn αm yn tm φ(xn )T φ(xm )
α n
2 n,m

I And predictions:
!
X
y ∗ = sign αn yn φ(xn )T φ(x∗ ) + b
n

I In this case:

φ(xn )T φ(xm ) = (xn1


2 2
+ xn2 2
)(xm1 2
+ xm2 ) = k(xn , xm )

I We can think of the dot product in the projected space as a


function of the original data.

19 / 31
Projections
I We needn’t directly think of projections at all.
I Can just think of functions k(xn , xm ) that are dot products in
some space.

20 / 31
Projections
I We needn’t directly think of projections at all.
I Can just think of functions k(xn , xm ) that are dot products in
some space.
I Called kernel functions.
I Don’t ever need to actually project the data – just use the
kernel function to compute what the dot product would be if
we did project.

20 / 31
Projections
I We needn’t directly think of projections at all.
I Can just think of functions k(xn , xm ) that are dot products in
some space.
I Called kernel functions.
I Don’t ever need to actually project the data – just use the
kernel function to compute what the dot product would be if
we did project.
I Optimisation task:
X 1X
argmax αn − αn αm yn tm k(xn , xm )
α n
2 n,m

20 / 31
Projections
I We needn’t directly think of projections at all.
I Can just think of functions k(xn , xm ) that are dot products in
some space.
I Called kernel functions.
I Don’t ever need to actually project the data – just use the
kernel function to compute what the dot product would be if
we did project.
I Optimisation task:
X 1X
argmax αn − αn αm yn tm k(xn , xm )
α n
2 n,m

I Predictions:
!
X
∗ ∗
y = sign αn yn k(xn , x ) + b
n

20 / 31
Kernels

I Plenty of off-the-shelf kernels that we can use:

21 / 31
Kernels

I Plenty of off-the-shelf kernels that we can use:


I Linear:
k(xn , xm ) = xT
n xm

21 / 31
Kernels

I Plenty of off-the-shelf kernels that we can use:


I Linear:
k(xn , xm ) = xT
n xm

I Gaussian:
n o
k(xn , xm ) = exp −β(xn − xm )T (xn − xm )

21 / 31
Kernels

I Plenty of off-the-shelf kernels that we can use:


I Linear:
k(xn , xm ) = xT
n xm

I Gaussian:
n o
k(xn , xm ) = exp −β(xn − xm )T (xn − xm )

I Polynomial:
k(xn , xm ) = (1 + xT
n xm )
β

21 / 31
Kernels

I Plenty of off-the-shelf kernels that we can use:


I Linear:
k(xn , xm ) = xT
n xm

I Gaussian:
n o
k(xn , xm ) = exp −β(xn − xm )T (xn − xm )

I Polynomial:
k(xn , xm ) = (1 + xT
n xm )
β

I These all correspond to φ(xn )T φ(xm ) for some transformation


φ(xn ).
I Don’t know what the projections φ(xn ) are – don’t need to
know!

21 / 31
Kernels

I Our algorithm is still only finding linear boundaries....

22 / 31
Kernels

I Our algorithm is still only finding linear boundaries....


I ...but we’re finding linear boundaries in some other space.

22 / 31
Kernels

I Our algorithm is still only finding linear boundaries....


I ...but we’re finding linear boundaries in some other space.
I The optimisation is just as simple, regardless of the kernel
choice.
I Still a quadratic program.
I Still a single, global optimum.

22 / 31
Kernels

I Our algorithm is still only finding linear boundaries....


I ...but we’re finding linear boundaries in some other space.
I The optimisation is just as simple, regardless of the kernel
choice.
I Still a quadratic program.
I Still a single, global optimum.
I We can find very complex decision boundaries with a linear
algorithm!

22 / 31
A technical point
I Our decision boundary was defined as wT x + b = 0.
I Now, w is defined as:
N
X
w= αn yn φ(xn )
n=1

I We don’t know φ(xn ).

23 / 31
A technical point
I Our decision boundary was defined as wT x + b = 0.
I Now, w is defined as:
N
X
w= αn yn φ(xn )
n=1

I We don’t know φ(xn ).


I We only know φ(xn )T φ(xm ) = k(xn , xm )
I So, we can’t compute w or the boundary!

23 / 31
A technical point
I Our decision boundary was defined as wT x + b = 0.
I Now, w is defined as:
N
X
w= αn yn φ(xn )
n=1

I We don’t know φ(xn ).


I We only know φ(xn )T φ(xm ) = k(xn , xm )
I So, we can’t compute w or the boundary!
I But we can evaluate the predictions on a grid of x∗ and draw
a contour using R:
N
X
αn yn k(xn , x∗ ) + b
n=1

23 / 31
Aside: kernelising other algorithms

I Many algorithms can be kernelised.


I Any that can be written with data only appearing as inner
products.
I Simple algorithms can be used to solve very complex
problems!

24 / 31
Example – nonlinear data
3

1
x2

−1

−2

−3
−3 −2 −1 0 1 2 3
x1
I We’ll use a Gaussian kernel:
n o
k(xn , xm ) = exp −β(xn − xm )T (xn − xm )

I And vary β (C = 10).


25 / 31
Examples
3

1
x2

−1

−2

−3
−3 −2 −1 0 1 2 3
x1
I β = 1.
n o
k(xn , xm ) = exp −β(xn − xm )T (xn − xm )

26 / 31
Examples
3

1
x2

−1

−2

−3
−3 −2 −1 0 1 2 3
x1
I β = 0.01.
n o
k(xn , xm ) = exp −β(xn − xm )T (xn − xm )

27 / 31
Examples
3

1
x2

−1

−2

−3
−3 −2 −1 0 1 2 3
x1
I β = 50.
n o
k(xn , xm ) = exp −β(xn − xm )T (xn − xm )

28 / 31
The Gaussian kernel

I β controls the complexity of the decision boundaries.

29 / 31
The Gaussian kernel

I β controls the complexity of the decision boundaries.


I β = 0.01 was too simple:
I Not flexible enough to surround just the square class.

29 / 31
The Gaussian kernel

I β controls the complexity of the decision boundaries.


I β = 0.01 was too simple:
I Not flexible enough to surround just the square class.
I β = 50 was too complex:
I Memorises the data.

29 / 31
The Gaussian kernel

I β controls the complexity of the decision boundaries.


I β = 0.01 was too simple:
I Not flexible enough to surround just the square class.
I β = 50 was too complex:
I Memorises the data.
I β = 1 was about right.

29 / 31
The Gaussian kernel

I β controls the complexity of the decision boundaries.


I β = 0.01 was too simple:
I Not flexible enough to surround just the square class.
I β = 50 was too complex:
I Memorises the data.
I β = 1 was about right.
I Neither β = 50 or β = 0.01 will generalise well.
I Both are also non-sparse (lots of support vectors).

29 / 31
Choosing kernel function, parameters and C

I Kernel function and parameter choice is data dependent.


I Easy to overfit.

30 / 31
Choosing kernel function, parameters and C

I Kernel function and parameter choice is data dependent.


I Easy to overfit.
I Need to set C too
I C and β are linked
I C too high – overfitting.
I C too low – underfitting.

30 / 31
Choosing kernel function, parameters and C

I Kernel function and parameter choice is data dependent.


I Easy to overfit.
I Need to set C too
I C and β are linked
I C too high – overfitting.
I C too low – underfitting.
I Cross-validation!

30 / 31
Choosing kernel function, parameters and C

I Kernel function and parameter choice is data dependent.


I Easy to overfit.
I Need to set C too
I C and β are linked
I C too high – overfitting.
I C too low – underfitting.
I Cross-validation!
I Search over β and C
I SVM scales with N 3 (naive implementation)
I For large N, cross-validation over many C and β values is
infeasible.

30 / 31
Summary - SVMs

I Described a classifier that is optimised by maximising the


margin.
I Did some re-arranging to turn it into a quadratic
programming problem.
I Saw that data only appear as inner products.
I Introduced the idea of kernels.
I Can fit a linear boundary in some other space without
explicitly projecting.
I Loosened the SVM constraints to allow points on the wrong
side of boundary.
I Other algorithms can be kernelised...we’ll see a clustering one
in the future.

31 / 31

You might also like