QSRI-lecture5
QSRI-lecture5
Seth Flaxman1
3 July 2019
1
Based on slides from Simon Rogers, Glasgow
1 / 31
The margin
2 / 31
Some data
I We’ll ‘think’ in 2-dimensions.
6
4
SVM is a binary classifier.
N data points, each with
x2
2
attributes x = [x1 , x2 ]T and
0 target y = ±1
−2
−2 0 2 4
x1
3 / 31
Some data
I We’ll ‘think’ in 2-dimensions.
6
4
SVM is a binary classifier.
N data points, each with
x2
2
attributes x = [x1 , x2 ]T and
0 target y = ±1
−2
−2 0 2 4
x1
I A linear decision boundary can be represented as a straight
line:
wT x + b = 0
3 / 31
Some data
I We’ll ‘think’ in 2-dimensions.
6
4
SVM is a binary classifier.
N data points, each with
x2
2
attributes x = [x1 , x2 ]T and
0 target y = ±1
−2
−2 0 2 4
x1
I A linear decision boundary can be represented as a straight
line:
wT x + b = 0
I Our task is to find w and b
3 / 31
Some data
I We’ll ‘think’ in 2-dimensions.
6
4
SVM is a binary classifier.
N data points, each with
x2
2
attributes x = [x1 , x2 ]T and
0 target y = ±1
−2
−2 0 2 4
x1
I A linear decision boundary can be represented as a straight
line:
wT x + b = 0
I Our task is to find w and b
I Once we have these, classification is easy:
w T x∗ + b > 0 : y∗ = 1
w T x∗ + b < 0 : y ∗ = −1
I i.e. y ∗ = sign(wT x∗ + b)
3 / 31
The margin
4 / 31
The margin
γ
4 / 31
The margin
γ
4 / 31
Why maximise the margin?
γ
γ
5 / 31
Why maximise the margin?
γ
γ
1
2γ = wT (x1 − x2 )
||w||
x1 − x2
x1
x2
2γ
6 / 31
Computing the margin
1
2γ = wT (x1 − x2 )
||w||
6 / 31
Computing the margin
1
2γ = wT (x1 − x2 )
||w||
6 / 31
Maximising the margin
1
I We want to maximise γ = ||w||
7 / 31
Maximising the margin
1
I We want to maximise γ = ||w||
I Equivalent to minimising ||w||
7 / 31
Maximising the margin
1
I We want to maximise γ = ||w||
I Equivalent to minimising ||w||
I Equivalent to minimising 12 ||w||2 = 21 wT w
7 / 31
Maximising the margin
1
I We want to maximise γ = ||w||
I Equivalent to minimising ||w||
I Equivalent to minimising 12 ||w||2 = 21 wT w
7 / 31
Maximising the margin
1
I We want to maximise γ = ||w||
I Equivalent to minimising ||w||
I Equivalent to minimising 12 ||w||2 = 21 wT w
yn (wT xn + b) ≥ 1
7 / 31
Maximising the margin
I We have the following optimisation problem:
1 T
argmin w w
w 2
Subject to: yn (wT xn + b) ≥ 1
8 / 31
Maximising the margin
I We have the following optimisation problem:
1 T
argmin w w
w 2
Subject to: yn (wT xn + b) ≥ 1
8 / 31
Maximising the margin
I We have the following optimisation problem:
1 T
argmin w w
w 2
Subject to: yn (wT xn + b) ≥ 1
8 / 31
Optimal boundary
6
4
x2
−2
−2 0 2 4
x1
I Optimisation has a global minimum, gives us α1 , . . . , αN
P
I Compute w = n αn yn xn
I Compute b = yn − wT x (for one of the closest points)
I Recall that we defined wT x + b = ±1 = yn for closest points.
I Plot wT x + b = 0
9 / 31
Support Vectors
I At the optimum, only 3 non-zero α values (squares).
6
4
x2
−2
−2 0 2 4
x1
y ∗ = sign T ∗
P
n αn yn xn x + b
I
I Predictions only depend on these data-points!
10 / 31
Support Vectors
I At the optimum, only 3 non-zero α values (squares).
6
4
x2
−2
−2 0 2 4
x1
y ∗ = sign T ∗
P
n αn yn xn x + b
I
I Predictions only depend on these data-points!
I We knew that – margin is only a function of closest points.
10 / 31
Support Vectors
I At the optimum, only 3 non-zero α values (squares).
6
4
x2
−2
−2 0 2 4
x1
y ∗ = sign T ∗
P
n αn yn xn x + b
I
I Predictions only depend on these data-points!
I We knew that – margin is only a function of closest points.
I These are called Support Vectors
10 / 31
Support Vectors
I At the optimum, only 3 non-zero α values (squares).
6
4
x2
−2
−2 0 2 4
x1
y ∗ = sign T ∗
P
n αn yn xn x + b
I
I Predictions only depend on these data-points!
I We knew that – margin is only a function of closest points.
I These are called Support Vectors
I Normally a small proportion of the data:
I Solution is sparse.
10 / 31
Is sparseness good?
I Not always:
6
4
x2
−2
−2 0 2 4 6
x1
11 / 31
Is sparseness good?
I Not always:
6
4
x2
−2
−2 0 2 4 6
x1
yn (wT xn + b) ≥ 1
11 / 31
Soft margin
I We can relax the constraints:
yn (wT xn + b) ≥ 1 − ξn , ξn ≥ 0
12 / 31
Soft margin
I We can relax the constraints:
yn (wT xn + b) ≥ 1 − ξn , ξn ≥ 0
I Our optimisation becomes:
N
1 X
argmin wT w + C ξn
w 2
n=1
T
subject to yn (w xn + b) ≥ 1 − ξn
12 / 31
Soft margin
I We can relax the constraints:
yn (wT xn + b) ≥ 1 − ξn , ξn ≥ 0
I Our optimisation becomes:
N
1 X
argmin wT w + C ξn
w 2
n=1
T
subject to yn (w xn + b) ≥ 1 − ξn
I And when we add Lagrange etc:
N N
X 1 X
argmax αn − αn αm yn tm xT
n xm
α 2
n=1 n.m=1
N
X
subject to αn yn = 0, 0 ≤ αn ≤ C
n=1
12 / 31
Soft margin
I We can relax the constraints:
yn (wT xn + b) ≥ 1 − ξn , ξn ≥ 0
I Our optimisation becomes:
N
1 X
argmin wT w + C ξn
w 2
n=1
T
subject to yn (w xn + b) ≥ 1 − ξn
I And when we add Lagrange etc:
N N
X 1 X
argmax αn − αn αm yn tm xT
n xm
α 2
n=1 n.m=1
N
X
subject to αn yn = 0, 0 ≤ αn ≤ C
n=1
I The only change is an upper-bound on αn !
12 / 31
Soft margins
I Here’s our problematic data again:
4
x2
−2
−2 0 2 4 6
x1
I αn for the ‘bad’ square is 3.5.
13 / 31
Soft margins
I Here’s our problematic data again:
4
x2
−2
−2 0 2 4 6
x1
I αn for the ‘bad’ square is 3.5.
I So, if we set C < 3.5, we should see this point having less
influence and the boundary moving to somewhere more
sensible...
13 / 31
Soft margins
I Try C = 1
4
x2
−2
−2 0 2 4 6
x1
I We have an extra support vector.
I And a better decision boundary.
14 / 31
Soft margins
15 / 31
Soft margins
15 / 31
SVMs – some observations
w = [w1 , w2 ]T , b
I In general: D+1.
16 / 31
SVMs – some observations
w = [w1 , w2 ]T , b
I In general: D+1.
I We now have N: α1 , . . . , αN
16 / 31
SVMs – some observations
w = [w1 , w2 ]T , b
I In general: D+1.
I We now have N: α1 , . . . , αN
I Sounds harder?
I Depends on data dimensionality:
I Recall the mouse genome dataset:
I N = 1522, D = 10346
I Typical in many settings: N D.
16 / 31
Inner products
17 / 31
Inner products
17 / 31
Projections
I Our SVM can find linear decision boundaries.
I What if the data requires something nonlinear?
3
1
2
1 0.5
x2
0
0
−1
−2 −0.5
−3
−3 −2 −1 0 1 2 3 −1
x1 0 2 4 6 8 10
φ(x n )
φ(x)
18 / 31
Projections
I Our SVM can find linear decision boundaries.
I What if the data requires something nonlinear?
3
1
2
1 0.5
x2
0
0
−1
−2 −0.5
−3
−3 −2 −1 0 1 2 3 −1
x1 0 2 4 6 8 10
φ(x n )
φ(x)
I We can transform the data e.g.:
φ(x) = x12 + x22
I So that it can be separated with a straight line.
I And use φ(x) instead of x in our optimisation.
18 / 31
Projections
I Our optimisation is now:
X 1X
argmax αn − αn αm yn tm φ(xn )T φ(xm )
α n
2 n,m
I And predictions:
!
X
y ∗ = sign αn yn φ(xn )T φ(x∗ ) + b
n
19 / 31
Projections
I Our optimisation is now:
X 1X
argmax αn − αn αm yn tm φ(xn )T φ(xm )
α n
2 n,m
I And predictions:
!
X
y ∗ = sign αn yn φ(xn )T φ(x∗ ) + b
n
I In this case:
19 / 31
Projections
I Our optimisation is now:
X 1X
argmax αn − αn αm yn tm φ(xn )T φ(xm )
α n
2 n,m
I And predictions:
!
X
y ∗ = sign αn yn φ(xn )T φ(x∗ ) + b
n
I In this case:
19 / 31
Projections
I We needn’t directly think of projections at all.
I Can just think of functions k(xn , xm ) that are dot products in
some space.
20 / 31
Projections
I We needn’t directly think of projections at all.
I Can just think of functions k(xn , xm ) that are dot products in
some space.
I Called kernel functions.
I Don’t ever need to actually project the data – just use the
kernel function to compute what the dot product would be if
we did project.
20 / 31
Projections
I We needn’t directly think of projections at all.
I Can just think of functions k(xn , xm ) that are dot products in
some space.
I Called kernel functions.
I Don’t ever need to actually project the data – just use the
kernel function to compute what the dot product would be if
we did project.
I Optimisation task:
X 1X
argmax αn − αn αm yn tm k(xn , xm )
α n
2 n,m
20 / 31
Projections
I We needn’t directly think of projections at all.
I Can just think of functions k(xn , xm ) that are dot products in
some space.
I Called kernel functions.
I Don’t ever need to actually project the data – just use the
kernel function to compute what the dot product would be if
we did project.
I Optimisation task:
X 1X
argmax αn − αn αm yn tm k(xn , xm )
α n
2 n,m
I Predictions:
!
X
∗ ∗
y = sign αn yn k(xn , x ) + b
n
20 / 31
Kernels
21 / 31
Kernels
21 / 31
Kernels
I Gaussian:
n o
k(xn , xm ) = exp −β(xn − xm )T (xn − xm )
21 / 31
Kernels
I Gaussian:
n o
k(xn , xm ) = exp −β(xn − xm )T (xn − xm )
I Polynomial:
k(xn , xm ) = (1 + xT
n xm )
β
21 / 31
Kernels
I Gaussian:
n o
k(xn , xm ) = exp −β(xn − xm )T (xn − xm )
I Polynomial:
k(xn , xm ) = (1 + xT
n xm )
β
21 / 31
Kernels
22 / 31
Kernels
22 / 31
Kernels
22 / 31
Kernels
22 / 31
A technical point
I Our decision boundary was defined as wT x + b = 0.
I Now, w is defined as:
N
X
w= αn yn φ(xn )
n=1
23 / 31
A technical point
I Our decision boundary was defined as wT x + b = 0.
I Now, w is defined as:
N
X
w= αn yn φ(xn )
n=1
23 / 31
A technical point
I Our decision boundary was defined as wT x + b = 0.
I Now, w is defined as:
N
X
w= αn yn φ(xn )
n=1
23 / 31
Aside: kernelising other algorithms
24 / 31
Example – nonlinear data
3
1
x2
−1
−2
−3
−3 −2 −1 0 1 2 3
x1
I We’ll use a Gaussian kernel:
n o
k(xn , xm ) = exp −β(xn − xm )T (xn − xm )
1
x2
−1
−2
−3
−3 −2 −1 0 1 2 3
x1
I β = 1.
n o
k(xn , xm ) = exp −β(xn − xm )T (xn − xm )
26 / 31
Examples
3
1
x2
−1
−2
−3
−3 −2 −1 0 1 2 3
x1
I β = 0.01.
n o
k(xn , xm ) = exp −β(xn − xm )T (xn − xm )
27 / 31
Examples
3
1
x2
−1
−2
−3
−3 −2 −1 0 1 2 3
x1
I β = 50.
n o
k(xn , xm ) = exp −β(xn − xm )T (xn − xm )
28 / 31
The Gaussian kernel
29 / 31
The Gaussian kernel
29 / 31
The Gaussian kernel
29 / 31
The Gaussian kernel
29 / 31
The Gaussian kernel
29 / 31
Choosing kernel function, parameters and C
30 / 31
Choosing kernel function, parameters and C
30 / 31
Choosing kernel function, parameters and C
30 / 31
Choosing kernel function, parameters and C
30 / 31
Summary - SVMs
31 / 31