An Idiot Guide To SVM
An Idiot Guide To SVM
1980s Decision trees and NNs allowed efficient learning of nonlinear decision surfaces Little theoretical basis and all suffer from local minima 1990s Efficient learning algorithms for non-linear functions based on computational learning theory developed Nice theoretical properties.
Key Ideas
Two independent developments within last decade Computational learning theory New efficient separability of non-linear functions that use kernel functions The resulting learning algorithm is an optimization algorithm rather than a greedy search.
Organization
Basic idea of support vector machines Optimal hyperplane for linearly separable patterns Extend to patterns that are not linearly separable by transformations of original data to map into new space Kernel function SVM algorithm for pattern recognition
Support Vectors
Support vectors are the data points that lie closest to the decision surface They are the most difficult to classify They have direct bearing on the optimum location of the decision surface We can show that the optimal hyperplane stems from the function class with the lowest capacity (VC dimension).
In general, lots of possible solutions for a,b,c. Support Vector Machine (SVM) finds an optimal solution. (wrt what cost?)
Maximize margin
Separation by Hyperplanes
Assume linear separability for now: in 2 dimensions, can separate by a line in higher dimensions, need hyperplanes Can find separating hyperplane by linear programming (e.g. perceptron): separator can be expressed as ax + by = c
Find a,b,c, such that ax + by c for red points ax + by c for green points.
Which Hyperplane?
Which Hyperplane?
Lots of possible solutions for a,b,c. Some methods find a separating hyperplane, but not the optimal one (e.g., perceptron) Most methods find an optimal separating hyperplane Which points should influence optimality? All points
Linear regression Nave Bayes
X X X
Define the hyperplane H such that: xiw+b +1 when yi =+1 xiw+b -1 when yi =-1
H1 and H2 are the planes: H1: xiw+b = +1 H2: xiw+b = -1 The points on the planes H1 and H2 are the Support Vectors
Definitions
H1
H2 d-
d+ H
d+ = the shortest distance to the closest positive point d- = the shortest distance to the closest negative point The margin of a separating hyperplane is d+ + d-.
The algorithm to generate the weights proceeds in such a way that only the support vectors determine the weights and thus the boundary
Recall the distance from a point(x0,y0) to a line: Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2+B2) The distance between H and H1 is: |wx+b|/||w||=1/||w|| The distance between H1 and H2 is: 2/||w||
In order to maximize the margin, we need to minimize ||w||. With the condition that there are no datapoints between H1 and H2: xiw+b +1 when yi =+1 Can be combined into yi(xiw) 1 xiw+b -1 when yi =-1
Maximize when the constraint line g is tangent to the inner ellipse contour line of f
flattened paraboloid f: 2-x2-2y2=0 with superimposed constraint g: x +y = 1; at tangent solution p, gradient vectors of f,g are parallel (no possible move to incr f that also keeps you in region g)
Maximize when the constraint line g is tangent to the inner ellipse contour line of f
Two constraints
1. Parallel normal constraint (= gradient constraint on f, g solution is a max) 2. G(x)=0 (solution is on the constraint line) We now recast these by combining f, g as the Lagrangian
f ( p ) = g ( p ) g ( x) = 0
Or, combining these two as the Langrangian L & requiring derivative of L be zero: L ( x, ) = f ( x ) g ( x )
( x, ) = 0
L( x, ) = f ( x) + i i gi ( x)
In general
Gradient max of f constraint condition g
L( x, ) = f ( x) + i i gi ( x) a function of n + m variables n for the x ' s, m for the . Differentiating gives n + m equations, each set to 0. The n eqns differentiated wrt each xi give the gradient conditions; the m eqns differentiated wrt each i recover the constraints gi
In our case, f(x): || w||2 ; g(x): yi(w.xi +b)-1=0 so Lagrangian is L= || w||2 - i[yi(w.xi +b)-1]
Lagrangian Formulation
In the SVM problem the Lagrangian is
LP
1 2
w i yi ( xi w + b ) + i
2 i =1 i =1
i 0, i
From the derivatives = 0 we get
w = i yi xi , i yi = 0
i =1 i =1
At a solution p
The the constraint line g and the contour lines of f must be tangent If they are tangent, their gradient vectors (perpindiculars) are parallel Gradient of g must be 0 I.e., steepest ascent & so perpendicular to f Gradient of f must also be in the same direction as g
Inner products
The task: Max L = i ijxixj, Subject to: w = iyixi iyi = 0 Inner product
Inner products
Why should inner product kernels be involved in pattern recognition? -- Intuition is that they provide some measure of similarity -- cf Inner product in 2D between 2 vectors of unit length returns the cosine of the angle between them. e.g. x = [1, 0]T , y = [0, 1]T I.e. if they are parallel inner product is 1 xT x = x.x = 1 If they are perpendicular inner product is 0 xT y = x.y = 0
Butare we done???
Transformation to separate
o x x o x o x x o o x X o x (o) (o) (o) (x) (x) (x) (x) (x)
( x a )( x b ) = x 2 ( a + b ) x + ab
a b
=-1 =+1
What if the decision function is not linear? What transform would separate these?
=-1 =+1
xi and xj as a dot product. We will have (xi) (xj) in the non-linear case. If there is a kernel function K such as K(xi,xj) = (xi) (xj), we do not need to know explicitly. One example:
i ijxixj,
tanh(0xTxi + 1)
K ( x, y ) = exp
xy
2 2
K ( x, y ) = tanh ( x y )
1st is polynomial (includes xx as special case) 2nd is radial basis function (gaussians) 3rd is sigmoid (neural net activation function)
Power p is specified apriori by the user The width 2 is specified apriori Mercers theorem is satisfied only for some values of 0 and 1
exp(1/(22)||x-xi||2)
tanh(0xTxi + 1)
Non-linear svm2
The function we end up optimizing is: Max Ld = i ijK(xixj), Subject to: w = iyixi iyi = 0 Another kernel example: The polynomial kernel K(xi,xj) = (xixj + 1)p, where p is a tunable parameter. Evaluating K only require one addition and one exponentiation more than the original dot product.
Gaussian
Overfitting by SVM