0% found this document useful (0 votes)
23 views

An Introduction To Support Vector Machines

This document provides an introduction to support vector machines (SVMs). It discusses how SVMs find the optimal separating hyperplane between two classes of data by maximizing the margin between them. It explains how SVMs use Lagrangian multipliers to convert the constrained optimization problem into one that is easier to solve. The document also introduces the kernel trick, which allows SVMs to fit non-linear decision boundaries by projecting data into a higher-dimensional space.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

An Introduction To Support Vector Machines

This document provides an introduction to support vector machines (SVMs). It discusses how SVMs find the optimal separating hyperplane between two classes of data by maximizing the margin between them. It explains how SVMs use Lagrangian multipliers to convert the constrained optimization problem into one that is easier to solve. The document also introduces the kernel trick, which allows SVMs to fit non-linear decision boundaries by projecting data into a higher-dimensional space.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 13

An Introduction to Support Vector

Machines
Main Ideas
• Max-Margin Classifier
– Formalize notion of the best linear separator
• Lagrangian Multipliers
– Way to convert a constrained optimization problem
to one that is easier to solve
• Kernels
– Projecting data into higher-dimensional space
makes it linearly separable
• Complexity
– Depends only on the number of training examples,
not on dimensionality of the kernel space!
Tennis example

Temperature

Humidity
= play tennis
= do not play tennis
Linear Support Vector
Machines
Data: <xi,yi>, i=1,..,l
xi  Rd
yi  {-1,+1}

x2

=+1
=-1

x1
Linear SVM 2

Data: <xi,yi>, i=1,..,l


xi  Rd
yi  {-1,+1}

f(x) =-1
=+1

All hyperplanes in Rd are parameterize by a vector (w) and a constant b.


Can be expressed as w•x+b=0 (remember the equation for a hyperplane
from algebra!)
Our aim is to find such a hyperplane f(x)=sign(w•x+b), that
correctly classify our data.
Definitions
Define the hyperplane H such that:
H1
xi•w+b  +1 when yi =+1
xi•w+b  -1 when yi =-1
H2
H1 and H2 are the planes: d+
H1: xi•w+b = +1
d-
H2: xi•w+b = -1 H
The points on the planes
H1 and H2 are the
Support Vectors

d+ = the shortest distance to the closest positive point

d- = the shortest distance to the closest negative point


The margin of a separating hyperplane is d+ + d-.
Maximizing the margin
We want a classifier with as big margin as possible.

H1
H
H2
Recall the distance from a point(x0,y0) to a line: d+
Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2+B2) d-
The distance between H and H1 is:
|w•x+b|/||w||=1/||w||

The distance between H1 and H2 is: 2/||w||

In order to maximize the margin, we need to minimize ||w||. With the


condition that there are no datapoints between H1 and H2:
xi•w+b  +1 when yi =+1
xi•w+b  -1 when yi =-1 Can be combined into yi(xi•w)  1
Constrained Optimization
Problem
Minimize || w || w  w subject to yi ( x i  w  b)  1 for all i
Lagrangian method : maximize inf w L(w, b,  ), where
1
L(w, b,  )  || w ||   i   yi (x i  w )  b   1
2 i

At the extremum, the partial derivative of L with respect


both w and b must be 0. Taking the derivatives, setting them
to 0, substituti ng back into L, and simplifying yields :
1
Maximize   i   yi y j i j x i  x j
i 2 i, j
subject to  yi i  0 and  i  0
i
Quadratic Programming
• Why is this reformulation a good thing?
• The problem
1
Maximize   i   yi y j i j x i  x j
i 2 i, j
subject to  yi i  0 and  i  0
i

is an instance of what is called a positive, semi-definite


programming problem
• For a fixed real-number accuracy, can be solved in
O(n log n) time = O(|D|2 log |D|2)
Problems with linear SVM

=-1
=+1

What if the decision function is not a linear?


Kernel Trick

Data points are linearly separable


in the space ( x12 , x22 , 2 x1 x2 )

1
We want to maximize   i   yi y j i j F (x i )  F (x j )
i 2 i, j
Define K (x i , x j )  F (x i )  F (x j )
Cool thing : K is often easy to compute directly! Here,
2
K (x i , x j )  x i  x j
Overtraining/overfitting
A well known problem with machine learning methods is overtraining.
This means that we have learned the training data very well, but
we can not classify unseen examples correctly.
An example: A botanist really knowing trees.Everytime he sees a new tree,
he claims it is not a tree.

=-1
=+1
Overtraining/overfitting 2
A measure of the risk of overtraining with SVM (there are also other
measures).
It can be shown that: The portion, n, of unseen data that will be
missclassified is bounded by:
n  Number of support vectors / number of training examples

Ockham´s razor principle: Simpler system are better than more complex ones.
In SVM case: fewer support vectors mean a simpler representation of the
hyperplane.

Example: Understanding a certain cancer if it can be described by one gene


is easier than if we have to describe it with 5000.

You might also like