0% found this document useful (0 votes)

63 views

An Idiot Guide To SVM

The document provides an overview of support vector machines (SVMs) for pattern recognition and classification. It discusses how SVMs find the optimal separating hyperplane between two classes of data by maximizing the margin around the plane. Only certain data points, called support vectors, influence the location of the optimal hyperplane. The document also describes how SVMs can handle non-linearly separable data using kernel methods to map the data into a higher-dimensional feature space where a linear separator may be found.

Uploaded by

Ljiljana Đurić-Delić

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views

An Idiot Guide To SVM

Uploaded by

Ljiljana Đurić-Delić

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

An Idiots guide to Support vector machines (SVMs)

R. Berwick, Village Idiot

SVMs: A New Generation of Learning Algorithms

Pre 1980:
Almost all learning methods learned linear decision surfaces. Linear learning methods have nice theoretical properties

1980s Decision trees and NNs allowed efficient learning of nonlinear decision surfaces Little theoretical basis and all suffer from local minima 1990s Efficient learning algorithms for non-linear functions based on computational learning theory developed Nice theoretical properties.

Key Ideas
Two independent developments within last decade Computational learning theory New efficient separability of non-linear functions that use kernel functions The resulting learning algorithm is an optimization algorithm rather than a greedy search.

Statistical Learning Theory

Systems can be mathematically described as a system that Receives data (observations) as input and Outputs a function that can be used to predict some features of future data. Statistical learning theory models this as a function estimation problem Generalization Performance (accuracy in labeling test data) is measured

Organization
Basic idea of support vector machines Optimal hyperplane for linearly separable patterns Extend to patterns that are not linearly separable by transformations of original data to map into new space Kernel function SVM algorithm for pattern recognition

Unique Features of SVMs and Kernel Methods

Are explicitly based on a theoretical model of learning Come with theoretical guarantees about their performance Have a modular design that allows one to separately implement and design their components Are not affected by local minima Do not suffer from the curse of dimensionality

Support Vectors
Support vectors are the data points that lie closest to the decision surface They are the most difficult to classify They have direct bearing on the optimum location of the decision surface We can show that the optimal hyperplane stems from the function class with the lowest capacity (VC dimension).

Recall: Which Hyperplane?

In general, lots of possible solutions for a,b,c. Support Vector Machine (SVM) finds an optimal solution. (wrt what cost?)

Support Vector Machine (SVM)

SVMs maximize the margin around the separating hyperplane. The decision function is fully specified by a subset of training samples, the support vectors. Quadratic programming problem Text classification method du jour
Support vectors

Maximize margin

Separation by Hyperplanes
Assume linear separability for now: in 2 dimensions, can separate by a line in higher dimensions, need hyperplanes Can find separating hyperplane by linear programming (e.g. perceptron): separator can be expressed as ax + by = c

Linear Programming / Perceptron

Find a,b,c, such that ax + by c for red points ax + by c for green points.

Which Hyperplane?

In general, lots of possible solutions for a,b,c.

Which Hyperplane?
Lots of possible solutions for a,b,c. Some methods find a separating hyperplane, but not the optimal one (e.g., perceptron) Most methods find an optimal separating hyperplane Which points should influence optimality? All points
Linear regression Nave Bayes

Only difficult points close to decision boundary

Support vector machines Logistic regression (kind of)

Support Vectors again for linearly separable case

Support vectors are the elements of the training set that would change the position of the dividing hyper plane if removed. Support vectors are the critical elements of the training set The problem of finding the optimal hyper plane is an optimization problem and can be solved by optimization techniques (use Lagrange multipliers to get into a form that can be solved analytically).

Support Vectors: Input vectors for which w0Tx + b0 = 1 or w0Tx + b0 = -1

X X X

Define the hyperplane H such that: xiw+b +1 when yi =+1 xiw+b -1 when yi =-1
H1 and H2 are the planes: H1: xiw+b = +1 H2: xiw+b = -1 The points on the planes H1 and H2 are the Support Vectors

Definitions
H1

H2 d-

d+ H

d+ = the shortest distance to the closest positive point d- = the shortest distance to the closest negative point The margin of a separating hyperplane is d+ + d-.

Moving a support vector moves the decision boundary

Moving the other vectors has no effect

The algorithm to generate the weights proceeds in such a way that only the support vectors determine the weights and thus the boundary

Maximizing the margin

We want a classifier with as big margin as possible. H1 H H2
dd+

Recall the distance from a point(x0,y0) to a line: Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2+B2) The distance between H and H1 is: |wx+b|/||w||=1/||w|| The distance between H1 and H2 is: 2/||w||

In order to maximize the margin, we need to minimize ||w||. With the condition that there are no datapoints between H1 and H2: xiw+b +1 when yi =+1 Can be combined into yi(xiw) 1 xiw+b -1 when yi =-1

We now must solve a quadratic programming problem

Problem is: minimize ||w||, s.t. discrimination boundary is obeyed, i.e., min f(x) s.t. g(x)=0, where f: ||w||2 and g: yi(xiw)-b = 1 or [yi(xiw)-b] - 1 =0 This is a constrained optimization problem Solved by Lagrangian multipler method

flatten paraboloid 2-x2-2y2 Intuition: intersection of two functions at a tangent point.

flattened paraboloid 2-x2-2y2 with superimposed constraint x2 +y2 = 1

flattened paraboloid f: 2-x2-2y2=0 with superimposed constraint g: x +y = 1

Maximize when the constraint line g is tangent to the inner ellipse contour line of f

flattened paraboloid f: 2-x2-2y2=0 with superimposed constraint g: x +y = 1; at tangent solution p, gradient vectors of f,g are parallel (no possible move to incr f that also keeps you in region g)

Maximize when the constraint line g is tangent to the inner ellipse contour line of f

Two constraints
1. Parallel normal constraint (= gradient constraint on f, g solution is a max) 2. G(x)=0 (solution is on the constraint line) We now recast these by combining f, g as the Lagrangian

Redescribing these conditions

Want to look for solution point p where

f ( p ) = g ( p ) g ( x) = 0
Or, combining these two as the Langrangian L & requiring derivative of L be zero: L ( x, ) = f ( x ) g ( x )

( x, ) = 0

How Langrangian solves constrained optimization

L( x, ) = f ( x) g ( x) where ( x, ) = 0
Partial derivatives wrt x recover the parallel normal constraint Partial derivatives wrt recover the g(x,y)=0 In general,

L( x, ) = f ( x) + i i gi ( x)

In general
Gradient max of f constraint condition g
L( x, ) = f ( x) + i i gi ( x) a function of n + m variables n for the x ' s, m for the . Differentiating gives n + m equations, each set to 0. The n eqns differentiated wrt each xi give the gradient conditions; the m eqns differentiated wrt each i recover the constraints gi

In our case, f(x): || w||2 ; g(x): yi(w.xi +b)-1=0 so Lagrangian is L= || w||2 - i[yi(w.xi +b)-1]

Lagrangian Formulation
In the SVM problem the Lagrangian is

1 2

w i yi ( xi w + b ) + i
2 i =1 i =1

i 0, i
From the derivatives = 0 we get

w = i yi xi , i yi = 0
i =1 i =1

The Lagrangian trick

Reformulate the optimization problem: A trick often used in optimization is to do an Lagrangian formulation of the problem.The constraints will be replaced by constraints on the Lagrangian multipliers and the training data will occur only as dot products. Gives us the task: Max L = i ijxixj, Subject to: w = iyixi iyi = 0
What we need to see: xiand xj (input vectors) appear only in the form of dot product we will soon see why that is important.

The Dual problem

Original problem: fix value of f and find New problem: Fix the values of , and solve the (now unconstrained) problem max L(, x) Ie, get a solution for each , f*() Now minimize this over the space of Kuhn-Tucker theorem: this is equivalent to original problem

At a solution p
The the constraint line g and the contour lines of f must be tangent If they are tangent, their gradient vectors (perpindiculars) are parallel Gradient of g must be 0 I.e., steepest ascent & so perpendicular to f Gradient of f must also be in the same direction as g

Inner products
The task: Max L = i ijxixj, Subject to: w = iyixi iyi = 0 Inner product

Inner products
Why should inner product kernels be involved in pattern recognition? -- Intuition is that they provide some measure of similarity -- cf Inner product in 2D between 2 vectors of unit length returns the cosine of the angle between them. e.g. x = [1, 0]T , y = [0, 1]T I.e. if they are parallel inner product is 1 xT x = x.x = 1 If they are perpendicular inner product is 0 xT y = x.y = 0

Butare we done???

Not Linearly Separable

Find a line that penalizes points on the wrong side.

Transformation to separate
o x x o x o x x o o x X o x (o) (o) (o) (x) (x) (x) (x) (x)

(o) (x) (o) (x) (o) (o) F

Non Linear SVMs

The idea is to gain linearly separation by mapping the data to a higher dimensional space The following set cant be separated by a linear function, but can be separated by a quadratic one

( x a )( x b ) = x 2 ( a + b ) x + ab
a b

x ,x So if we map x we gain linear separation

Problems with linear SVM

=-1 =+1
What if the decision function is not linear? What transform would separate these?

The Kernel trick

Ans: polar coordinates! Non-linear SVM 1

Rd =Rd
=-1 =+1

Imagine a function that maps the data into another space:

=-1 =+1

xi and xj as a dot product. We will have (xi) (xj) in the non-linear case. If there is a kernel function K such as K(xi,xj) = (xi) (xj), we do not need to know explicitly. One example:

Remember the function we want to optimize: Ldual =

i ijxixj,

Weve already seen a nonlinear transform

What is it???

tanh(0xTxi + 1)

Examples for Non Linear SVMs

K ( x, y ) = ( x y + 1)
p

K ( x, y ) = exp

2 2

K ( x, y ) = tanh ( x y )
1st is polynomial (includes xx as special case) 2nd is radial basis function (gaussians) 3rd is sigmoid (neural net activation function)

Inner Product Kernels

Type of Support Vector Machine Polynomial learning machine Radial-basis function network Two layer perceptron Inner Product Kernel K(x,xi), I = 1, 2, , N (xTxi + 1)p Comments

Power p is specified apriori by the user The width 2 is specified apriori Mercers theorem is satisfied only for some values of 0 and 1

exp(1/(22)||x-xi||2)

tanh(0xTxi + 1)

Non-linear svm2
The function we end up optimizing is: Max Ld = i ijK(xixj), Subject to: w = iyixi iyi = 0 Another kernel example: The polynomial kernel K(xi,xj) = (xixj + 1)p, where p is a tunable parameter. Evaluating K only require one addition and one exponentiation more than the original dot product.

Examples for Non Linear SVMs 2 Gaussian Kernel

Linear

Gaussian

Nonlinear rbf kernel

Admirals delight w/ difft kernel functions

Overfitting by SVM

Building an SVM Classifier

Now we know how to build a separator for two linearly separable classes What about classes whose exemplary examples are not linearly separable?

An Idiot's Guide To Support Vector Machines
No ratings yet
An Idiot's Guide To Support Vector Machines
28 pages
Support Vector Machine
No ratings yet
Support Vector Machine
35 pages
SVM_NEW
No ratings yet
SVM_NEW
12 pages
SVM Seminarbericht Hofmann
No ratings yet
SVM Seminarbericht Hofmann
16 pages
Introduction To Support Vector Machines: 1 Description
No ratings yet
Introduction To Support Vector Machines: 1 Description
15 pages
Lec5 Support vector machine
No ratings yet
Lec5 Support vector machine
28 pages
Support Vector Machine
No ratings yet
Support Vector Machine
55 pages
Support Vector Machine
No ratings yet
Support Vector Machine
19 pages
SVM Slides
No ratings yet
SVM Slides
22 pages
Report 1
No ratings yet
Report 1
6 pages
Support Vector Machine
No ratings yet
Support Vector Machine
45 pages
ML Lec SVM Linear
No ratings yet
ML Lec SVM Linear
19 pages
SVM SLIDES
No ratings yet
SVM SLIDES
32 pages
CS-13410 Introduction To Machine Learning
No ratings yet
CS-13410 Introduction To Machine Learning
33 pages
Support Vector Machines (SVM) : Y.H. Hu
No ratings yet
Support Vector Machines (SVM) : Y.H. Hu
25 pages
Chapter_8 (1)
No ratings yet
Chapter_8 (1)
52 pages
SVM Explained PDF
No ratings yet
SVM Explained PDF
19 pages
SVM Scribe Notes
No ratings yet
SVM Scribe Notes
16 pages
SVM
No ratings yet
SVM
28 pages
Support Vector Machines PDF
No ratings yet
Support Vector Machines PDF
5 pages
Math Behind SVM Part 1 (Support Vector Machine) - by MLMath - Io - Medium
No ratings yet
Math Behind SVM Part 1 (Support Vector Machine) - by MLMath - Io - Medium
15 pages
Lecture 7_SVM
No ratings yet
Lecture 7_SVM
125 pages
An Introduction To Support Vector Machines
No ratings yet
An Introduction To Support Vector Machines
13 pages
Final - Support Vector Machine - Class - Modifie
No ratings yet
Final - Support Vector Machine - Class - Modifie
69 pages
Chapter 5 - Support Vector Machine: Prepared By: Shier Nee, SAW
No ratings yet
Chapter 5 - Support Vector Machine: Prepared By: Shier Nee, SAW
44 pages
1 Number 1: Support Vector Machine: 1.1 Case 1: Linear Separable Binary Classification
No ratings yet
1 Number 1: Support Vector Machine: 1.1 Case 1: Linear Separable Binary Classification
11 pages
Design of SVM
No ratings yet
Design of SVM
20 pages
39f6c97e482b96aba75c59b4ac0d99b8_MIT15_097S12_lec12
No ratings yet
39f6c97e482b96aba75c59b4ac0d99b8_MIT15_097S12_lec12
14 pages
L5_SVMs
No ratings yet
L5_SVMs
37 pages
Support Vector Machines (SVM) : N I y X D
No ratings yet
Support Vector Machines (SVM) : N I y X D
5 pages
1632118884_ML-TCS-Lecture-15 (1)
No ratings yet
1632118884_ML-TCS-Lecture-15 (1)
46 pages
Support vector machine
No ratings yet
Support vector machine
49 pages
W12 SVM
No ratings yet
W12 SVM
52 pages
Lec06 SVM
No ratings yet
Lec06 SVM
25 pages
Survey Piccialli sciandrone4OR
No ratings yet
Survey Piccialli sciandrone4OR
29 pages
20 SVM
No ratings yet
20 SVM
35 pages
Support Vector Machines: Artificial Neural Networks Unit 6
No ratings yet
Support Vector Machines: Artificial Neural Networks Unit 6
10 pages
Support Vector Machine
No ratings yet
Support Vector Machine
52 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
07 SVMs
No ratings yet
07 SVMs
68 pages
SVM Tutorial
No ratings yet
SVM Tutorial
31 pages
CH 5 SVM
No ratings yet
CH 5 SVM
25 pages
SVM.pptx
No ratings yet
SVM.pptx
67 pages
Support Vector Machine
No ratings yet
Support Vector Machine
46 pages
SVM PRESENTATION
No ratings yet
SVM PRESENTATION
34 pages
10 SVM
No ratings yet
10 SVM
23 pages
Another Introduction SVM
No ratings yet
Another Introduction SVM
4 pages
Support Vector Machines - An Introduction: Department of Electrical Engineering Technion, Israel
100% (1)
Support Vector Machines - An Introduction: Department of Electrical Engineering Technion, Israel
44 pages
27-Module 4 - Support Vector Machine and Naïve Bayes-20-09-2024
No ratings yet
27-Module 4 - Support Vector Machine and Naïve Bayes-20-09-2024
31 pages
SVM 30thoct Annotated
No ratings yet
SVM 30thoct Annotated
35 pages
10_SVM (1)
No ratings yet
10_SVM (1)
77 pages
Lecture 9 - SVMs
No ratings yet
Lecture 9 - SVMs
8 pages
By: Moataz Al-Haj: Vision Topics - Seminar (University of Haifa)
No ratings yet
By: Moataz Al-Haj: Vision Topics - Seminar (University of Haifa)
69 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
SVM Intro
No ratings yet
SVM Intro
114 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
From Everand
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
Peter Kattan
2.5/5 (2)