SVM Notes
SVM Notes
Hilary 2015
Wide margin
Cost function
Slack variables
Loss functions revisited
Optimization
A. Zisserman
Binary Classification
Given training data (xi, yi) for i = 1 . . . N , with
xi Rd and yi {1, 1}, learn a classifier f (x)
such that
f (x i )
0 yi = +1
< 0 yi = 1
Linear separability
linearly
separable
not
linearly
separable
Linear classifiers
A linear classifier has the form
f (x) = 0
X2
f (x) = w>x + b
f (x) < 0
f (x) > 0
X1
Linear classifiers
A linear classifier has the form
f (x) = 0
f (x) = w>x + b
f (x i ) = w > x i + b
separates the categories for i = 1, .., N
how can we find this separating hyperplane ?
w w + sign(f (xi)) xi
For example in 2D
Initialize w = 0
Cycle though the data points { xi, yi }
if xi is misclassified then
w w + sign(f (xi)) xi
after update
X2
X2
w
w
xi
X1
NB after convergence w
X1
PN
= i i xi
Perceptron
example
-2
-4
-6
-8
-10
-15
-10
-5
10
Support Vector
Support Vector
f (x) =
X
i
i yi (xi > x) + b
support vectors
w
. x+ x =
||w||
w>
x+ x
||w||
2
=
||w||
Support Vector
Support Vector
wTx + b = 1
wTx + b = 0
wTx + b = -1
2
||w||
SVM Optimization
Learning the SVM can be formulated as an optimization:
2
1
max
subject to w>xi+b
w ||w||
1
Or equivalently
min ||w||2
w
if yi = +1
for i = 1 . . . N
if yi = 1
>
subject to yi w xi + b 1 for i = 1 . . . N
In general there is a trade off between the margin and the number of
mistakes on the training data
i 0
Margin =
Misclassified
point
1
i
<
||w||
||w||
Support Vector
Support Vector
=0
wTx + b = 1
wTx + b = 0
wTx + b = -1
2
||w||
wRd ,
R+
||w|| +C
N
X
subject to
>
yi w xi + b 1i for i = 1 . . . N
0.8
0.6
feature y
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
-0.8
-0.6
-0.4
-0.2
0
feature x
0.2
0.4
0.6
0.8
C = Infinity
hard margin
C = 10
soft margin
binary classification
dominant
direction
HOG
frequency
orientation
Algorithm
Training (Learning)
Represent each example window by a HOG feature vector
xi Rd , with d = 1024
Testing (Detection)
Sliding window classifier
f (x) = w> x + b
Learned model
f (x) = w>x + b
Optimization
Learning an SVM has been formulated as a constrained optimization problem over w and
min
wRd ,i R+
||w||2 + C
N
X
i
>
i subject to yi w xi + b 1 i for i = 1 . . . N
>
The constraint yi w xi + b 1 i, can be written more concisely as
yif (xi) 1 i
wRd
regularization
N
X
i
loss function
Loss function
N
X
2
min ||w|| + C
max (0, 1 yif (xi))
wRd
i
wTx + b = 0
loss function
Points are in three categories:
Support Vector
Support Vector
Loss functions
yif (xi)
SVM uses hinge loss max (0, 1 yi f (xi))
an approximation to the 0-1 loss
Optimization continued
min C
wRd
N
X
i
global
minimum
If the cost function is convex, then a locally optimal point is globally optimal (provided
the optimization is over a convex set, which it is in our case)
Convex functions
convex
Not convex
SVM
min C
wRd
N
X
i
convex
wt+1 wt tw C(wt)
where is the learning rate.
First, rewrite the optimization problem as an average
N
X
1
max (0, 1 yif (xi))
||w||2 +
min C(w) =
w
2
N i
N
1 X
=
||w||2 + max (0, 1 yif (xi))
N i 2
f (xi) = w>xi + b
L
= yixi
w
L
=0
w
yif (xi)
N
1 X
wt+1 wt wt C(wt)
N
1X
wt
(wt + w L(xi, yi; wt))
N i
where is the learning rate.
Then each iteration t involves cycling through the training data with the
updates:
1
In the Pegasos algorithm the learning rate is set at t = t
10
4
1
10
energy
2
0
10
-2
-1
10
-4
-2
10
50
100
150
200
250
300
-6
-6
-4
-2
f (x) =
i yi (xi > x) + b
support vectors
On web page:
http://www.robots.ox.ac.uk/~az/lectures/ml
links to SVM tutorials and video lectures
MATLAB SVM demo