Support Vector Machine
Support Vector Machine
linearly
separabl
e
not
linearly
separabl
e
Input Space to Feature Space
f(x,w,b) = sign(w x + b)
denotes +1 wx+b>0
denotes -1
wx+b<0
Linear Classifiers
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1
Any of these
would be fine..
..but which is
best?
Linear Classifiers
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1
Misclassifie d
to +1 class
The Perceptron Classifier
Given linearly separable data xi labelled into two categories yi = {-
1,1} , find a weight vector w such that the discriminant function
f ( xi ) = w > xi + b
separates the categories for i = 1, .., N
• how can we find this separating hyperplane ?
X2 X2
w
w
X1 X1
xi
8
6
Perceptron
example 4
-2
-4
-6
-8
-10
-15 -10 -5 0 5 10
■ w . xi + b ≥ 1 when yi = +1
■ w . xi + b ≤ 1 when yi = -1
yi (w . xi + b) ≥ 1 ∀i
• To obtain the geometric distance from the hyperplane to a data
point, we normalize by the magnitude of w.
• We want the hyperplane that maximizes the geometric distance
to the closest data points. d( (w, b) , x ) = [y (w. x + b)] / ||w|| ≥ 1 / ||w||
i i i
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
Maximum Margin
1. Maximizing the margin
2. f(x,w,b)are
support vectors = sign(w x + b)
important
denotes +1
denotes -1
The maximum
margin linear
classifier
Support Vectors
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM
Linear SVM Mathematically
x+ M=Margin Width
X-
Two hyperplanes are parallel (they
have the same normal) and that no
training points fall between them.
What we know:
■ w . x+ + b = +1
■ w . x- + b = -1
■ w . (x+-x-) = 2
Linear SVM Mathematically
■ Goal: 1) Correctly classify all training data
if yi = +1
if yi = -1
for all i
2) Maximize the Margin or same as minimize
■ Minimize
subject to
Lagrange Multipliers
• Consider a problem: minx f(x) subject to h(x) = 0
Original Problem:
Find w and b such that
Φ(w) =½ wTw is minimized;
and for all i {(xi ,yi)}: yi (wTxi + b) ≥ 1
Construct the Lagrangian Function for
optimization
S. T. αi ≥ 0; ∀
i
Substituting we get:
maxα :
Subject to
•
Dataset with noise
OVERFITTING!
Soft Margin Classification
Slack variables ξi can be added to allow
misclassification of difficult or noisy examples.
ε7
Hard Margin v.s. Soft Margin
■ The old formulation:
Find w and b such that
Φ(w) =½ wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1
f(x) = Σαiyixi Tx + b
Application: Pedestrian detection in Computer
Vision
Objective: detect (localize) standing humans in an
image
• cf face detection with a sliding window classifier
•does an image
window contain a
person or not?
Training (Learning)
• Represent each example window by a HOG feature
vector
x i ∈ R d , with d = 1024
Testing (Detection)
• Sliding window classifier
f (x) = w > x + b
Dalal and Triggs, CVPR
2005
Learned model
f ( x) = w > x + b
x1 x2 y λ
0.38 0.47 + 65.52
0.49 0.61 - 65.52
0.92 0.41 - 0
0.74 0.89 - 0
0.18 0.58 + 0
0.41 0.35 + 0
0.93 0.81 - 0
0.21 0.10 + 0
This implies that the test data falls on or below the MMH and
SVM classifies that X belongs to class label -.
Non-linear SVMs
■ Datasets that are linearly separable with some noise work out
great:
0 x
x2
0 x
Non-Linear SVM
For understanding this, .
Note that a linear hyperplane is expressed as a linear equation
in terms of n-dimensional component, whereas a non-linear
hypersurface is a non-linear expression.
Φ: x → φ(x)
Mapping the Inputs to other dimensions - the
use of Kernels
• Finding the optimal curve to fit the data is difficult.
•There is a way to “pre-process" the data in such a way that the
problem is transformed into one of finding a simple hyperplane.
•We define a mapping z = φ(x) that transforms the d-dimensional
input vector x into a (usually higher) d*-dimensional vector z.
•We hope to choose a φ() so that the new training data {φ(xi),yi} is
separable by a hyperplane.
• How do we go about choosing φ()?
The “Kernel Trick”
■ The linear classifier relies on dot product between vectors K(x ,x )=x Tx
i j i j
■ If every data point is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the dot product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
■ A kernel function is some function that corresponds to an inner product in
some expanded feature space.
■ Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2 ,
Need to show that K(xi,xj)= φ(x i) Tφ(x j):
K(xi,xj)=(1 + xiTxj)2,
= 1+ x i12xj12 + 2 xi1xj1 xi2xj2 + x i22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi) Tφ(x j), where φ(x) = [1 x 12 √2 x 1x 2 x22 √2x1 √2x2]
Non-linear SVMs Mathematically
■ Dual problem formulation:
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi
T
■ Sigmoid: K(xi,xj)= tanh(β0xi xj + β1)
Nonlinear SVM - Overview
■ SVM locates a separating hyperplane in the
feature space and classify points in that
space
■ It does not need to represent the space
explicitly, simply by defining a kernel
function
■ The kernel function plays the role of the dot
product in the feature space.
Properties of SVM
■ Flexibility in choosing a similarity function
■ Sparseness of solution when dealing with large data
sets
-only support vectors are used to specify the separating
hyperplane
■ Ability to handle large feature spaces
-complexity does not depend on the dimensionality of the
feature space
■ Overfitting can be controlled by soft margin
approach
■ Nice math property: a simple convex optimization problem
which is guaranteed to converge to a single global solution
■ Feature Selection
Weakness of SVM
■ It is sensitive to noise
-A relatively small number of mislabeled examples can
dramatically decrease the performance
http://www.kernel-machines.org/