0% found this document useful (0 votes)
65 views37 pages

Support Vector Machines: Constantin F. Aliferis & Ioannis Tsamardinos

1) Support vector machines find the optimal separating hyperplane between two classes of data points by maximizing the margin between the classes. 2) They do this by mapping the data points to a higher dimensional space using kernels and finding the hyperplane with the maximum margin in this space. 3) Support vector machines can handle non-linearly separable data by introducing slack variables that allow some data points to fall within the margin and penalizing them.

Uploaded by

HyunJong Lee
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views37 pages

Support Vector Machines: Constantin F. Aliferis & Ioannis Tsamardinos

1) Support vector machines find the optimal separating hyperplane between two classes of data points by maximizing the margin between the classes. 2) They do this by mapping the data points to a higher dimensional space using kernels and finding the hyperplane with the maximum margin in this space. 3) Support vector machines can handle non-linearly separable data by introducing slack variables that allow some data points to fall within the margin and penalizing them.

Uploaded by

HyunJong Lee
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 37

Support Vector

Machines

MEDINFO 2004,
T02: Machine Learning Methods for Decision Support and Discovery
Constantin F. Aliferis & Ioannis Tsamardinos
Discovery Systems Laboratory
Department of Biomedical Informatics
Vanderbilt University

1
Support Vector Machines
 Decision surface is a hyperplane (line in 2D) in
feature space (similar to the Perceptron)
 Arguably, the most important recent discovery in
machine learning
 In a nutshell:
 map the data to a predetermined very high-
dimensional space via a kernel function
 Find the hyperplane that maximizes the margin
between the two classes
 If data are not separable find the hyperplane that
maximizes the margin and minimizes the (a weighted
average of the) misclassifications

2
Support Vector Machines
 Three main ideas:
1. Define what an optimal hyperplane is (in way
that can be identified in a computationally
efficient way): maximize margin
2. Extend the above definition for non-linearly
separable problems: have a penalty term for
misclassifications
3. Map data to high dimensional space where it
is easier to classify with linear decision
surfaces: reformulate problem so that data is
mapped implicitly to this space
3
Support Vector Machines
 Three main ideas:
1. Define what an optimal hyperplane is (in way
that can be identified in a computationally
efficient way): maximize margin
2. Extend the above definition for non-linearly
separable problems: have a penalty term for
misclassifications
3. Map data to high dimensional space where it
is easier to classify with linear decision
surfaces: reformulate problem so that data is
mapped implicitly to this space
4
Which Separating Hyperplane to Use?
Var1

Var2 5
Maximizing the Margin
Var1 IDEA 1: Select the
separating
hyperplane that
maximizes the
margin!

Margin
Width

Margin
Width
Var2 6
Support Vectors
Var1

Support Vectors

Margin
Width
Var2 7
Setting Up the Optimization Problem
Var1
The width of the
margin is:
2k
w
 
 w x b  k So, the problem is:
w
2k
max
  w
w  x  b  k
k k Var2
s.t. (w  x  b)  k , x of class 1
  (w  x  b)   k , x of class 2
w x  b  0

8
Setting Up the Optimization Problem
Var1
There is a scale and
unit for data so that
k=1. Then problem
becomes:

2
  max
 w x  b 1 w
w s.t. (w  x  b)  1, x of class 1
  (w  x  b)  1, x of class 2
w  x  b  1
1 1 Var2
 
w x  b  0

9
Setting Up the Optimization Problem
 If class 1 corresponds to 1 and class 2 corresponds
to -1, we can rewrite
(w  xi  b)  1, xi with yi  1
(w  xi  b)  1, xi with yi  1
 as
yi (w  xi  b)  1, xi
 So the problem becomes:
2 1 2
max min w
w or 2
s.t. yi (w  xi  b)  1, xi s.t. yi (w  xi  b)  1, xi

10
Linear, Hard-Margin SVM
Formulation
 Find w,b that solves

1 2
min w
2
s.t. yi (w  xi  b)  1, xi
 Problem is convex so, there is a unique global minimum value
(when feasible)
 There is also a unique minimizer, i.e. weight and b value that
provides the minimum
 Non-solvable if the data is not linearly separable
 Quadratic Programming
 Very efficient computationally with modern
constraint optimization engines (handles
thousands of constraints and training instances).
11
Support Vector Machines
 Three main ideas:
1. Define what an optimal hyperplane is (in way
that can be identified in a computationally
efficient way): maximize margin
2. Extend the above definition for non-linearly
separable problems: have a penalty term for
misclassifications
3. Map data to high dimensional space where it
is easier to classify with linear decision
surfaces: reformulate problem so that data is
mapped implicitly to this space
12
Support Vector Machines
 Three main ideas:
1. Define what an optimal hyperplane is (in way
that can be identified in a computationally
efficient way): maximize margin
2. Extend the above definition for non-linearly
separable problems: have a penalty term for
misclassifications
3. Map data to high dimensional space where it
is easier to classify with linear decision
surfaces: reformulate problem so that data is
mapped implicitly to this space
13
Non-Linearly Separable Data

Var1 Introduce slack


i variables i

Allow some
instances to fall
i within the margin,
  but penalize them
 w x  b 1
w
 
w  x  b  1
1 1 Var2
 
w x b  0
14
Formulating the Optimization Problem
Constraint becomes :

Var1 yi (w  xi  b)  1  i , xi
i
i  0

Objective function
penalizes for
i
misclassified instances
  and those within the
 w x  b 1
w margin
1
min w  C  i
2
 
w  x  b  1
1 1 Var2
2 i
 
w x b  0
C trades-off margin width
and misclassifications 15
Linear, Soft-Margin SVMs
1
min w  C  i yi (w  xi  b)  1  i , xi
2

2 i i  0
 Algorithm tries to maintain i to zero while maximizing
margin
 Notice: algorithm does not minimize the number of
misclassifications (NP-complete problem) but the sum of
distances from the margin hyperplanes
 Other formulations use i2 instead
 As C, we get closer to the hard-margin solution

16
Robustness of Soft vs Hard Margin
SVMs

Var1 Var1

i

i

Var2
 
w x  b  0 Var2
 
w x  b  0

Soft Margin SVN Hard Margin SVN

17
Soft vs Hard Margin SVM
 Soft-Margin always have a solution
 Soft-Margin is more robust to outliers
 Smoother surfaces (in the non-linear case)
 Hard-Margin does not require to guess the
cost parameter (requires no parameters at
all)

18
Support Vector Machines
 Three main ideas:
1. Define what an optimal hyperplane is (in way
that can be identified in a computationally
efficient way): maximize margin
2. Extend the above definition for non-linearly
separable problems: have a penalty term for
misclassifications
3. Map data to high dimensional space where it
is easier to classify with linear decision
surfaces: reformulate problem so that data is
mapped implicitly to this space
19
Support Vector Machines
 Three main ideas:
1. Define what an optimal hyperplane is (in way
that can be identified in a computationally
efficient way): maximize margin
2. Extend the above definition for non-linearly
separable problems: have a penalty term for
misclassifications
3. Map data to high dimensional space where it
is easier to classify with linear decision
surfaces: reformulate problem so that data is
mapped implicitly to this space
20
Disadvantages of Linear Decision
Surfaces
Var1

Var221
Advantages of Non-Linear Surfaces
Var1

Var222
Linear Classifiers in High-
Dimensional Spaces
Constructed
Var1
Feature 2

Var2
Constructed
Feature 1
Find function (x) to map to
a different space
23
Mapping Data to a High-Dimensional
Space
• Find function (x) to map to a different space, then
SVM formulation becomes:
1 s.t. yi ( w   ( x )  b)  1   i , xi
min w  C  i
2

2 i i  0

• Data appear as (x), weights w are now weights in


the new space
• Explicit mapping expensive if (x) is very high
dimensional
• Solving the problem without explicitly mapping the
data is desirable
24
The Dual of the SVM Formulation
 Original SVM formulation
1 2
 n inequality constraints
min w  C i
 n positivity constraints w ,b 2 i
 n number of  variables

s.t. yi ( w   ( x )  b)  1   i , xi
i  0
 The (Wolfe) dual of this
1
problem
min   i j yi y j ( ( xi )   ( x j ))    i
 one equality constraint ai 2
i, j i
 n positivity constraints
 n number of  variables s.t. C   i  0, xi
(Lagrange multipliers)
 Objective function more
complicated
 y
i
i i 0

 NOTICE: Data only appear


as (xi)  (xj)
25
The Kernel Trick
 (xi)  (xj): means, map data into new space, then take the
inner product of the new vectors
 We can find a function such that: K(xi  xj) = (xi)  (xj), i.e., the
image of the inner product of the data is the inner product of the
images of the data
 Then, we do not need to explicitly map the data into the high-
dimensional space to solve the optimization problem (for
training)
 How do we classify without explicitly mapping the new
instances? Turns out
sgn( wx  b)  sgn(  i yi K ( xi , x )  b)
i

where b solves  j ( y j   i yi K ( xi , x j )  b  1)  0,
i

for any j with  j  0


26
Examples of Kernels
 Assume we measure two quantities, e.g. expression
level of genes TrkC and SonicHedghog (SH) and we
use the mapping:
 : xTrkC , x SH  {x TrkC
2 2
, x SH , 2 xTrkC x SH , xTrkC , x SH ,1}
 Consider the function:
K ( x  z )  ( x  z  1) 2
 We can verify that:
( x)  ( z ) 
2
x TrkC 2
z TrkC  x SH
2 2
z SH  2 xTrkC x SH zTrkC z SH  xTrkC zTrkC  x SH z SH  1 
 ( xTrkC zTrkC  x SH z SH  1) 2  ( x  z  1) 2  K ( x  z )

27
Polynomial and Gaussian Kernels
K ( x  z )  ( x  z  1) p
 is called the polynomial kernel of degree p.
 For p=2, if we measure 7,000 genes using the kernel once
means calculating a summation product with 7,000 terms then
taking the square of this number
 Mapping explicitly to the high-dimensional space means
calculating approximately 50,000,000 new features for both
training instances, then taking the inner product of that (another
50,000,000 terms to sum)
 In general, using the Kernel trick provides huge computational
savings over explicit mapping!
 Another commonly used Kernel is the Gaussian (maps to a
dimensional space with number of dimensions equal to the
number of training cases):
K ( x  z )  exp( x  z / 2 )2

28
The Mercer Condition
 Is there a mapping (x) for any symmetric
function K(x,z)? No
 The SVM dual formulation requires
calculation K(xi , xj) for each pair of training
instances. The array Gij = K(xi , xj) is called
the Gram matrix
 There is a feature space (x) when the
Kernel is such that G is always semi-positive
definite (Mercer condition)

29
Support Vector Machines
 Three main ideas:
1. Define what an optimal hyperplane is (in way
that can be identified in a computationally
efficient way): maximize margin
2. Extend the above definition for non-linearly
separable problems: have a penalty term for
misclassifications
3. Map data to high dimensional space where it
is easier to classify with linear decision
surfaces: reformulate problem so that data is
mapped implicitly to this space
30
Other Types of Kernel Methods
 SVMs that perform regression
 SVMs that perform clustering
 -Support Vector Machines: maximize margin while
bounding the number of margin errors
 Leave One Out Machines: minimize the bound of the
leave-one-out error
 SVM formulations that take into consideration
difference in cost of misclassification for the different
classes
 Kernels suitable for sequences of strings, or other
specialized kernels
31
Variable Selection with SVMs
 Recursive Feature Elimination
 Train a linear SVM
 Remove the variables with the lowest weights (those
variables affect classification the least), e.g., remove
the lowest 50% of variables
 Retrain the SVM with remaining variables and repeat
until classification is reduced
 Very successful
 Other formulations exist where minimizing the number
of variables is folded into the optimization problem
 Similar algorithm exist for non-linear SVMs
 Some of the best and most efficient variable selection
methods

32
Comparison with Neural Networks
Neural Networks SVMs
 Hidden Layers map to lower  Kernel maps to a very-high
dimensional spaces dimensional space
 Search space has multiple  Search space has a unique
local minima minimum
 Training is expensive  Training is extremely
 Classification extremely efficient
efficient  Classification extremely
 Requires number of hidden efficient
units and layers  Kernel and cost the two
 Very good accuracy in parameters to select
typical domains  Very good accuracy in
typical domains
 Extremely robust

33
Why do SVMs Generalize?
 Even though they map to a very high-
dimensional space
 They have a very strong bias in that space
 The solution has to be a linear combination of
the training instances
 Large theory on Structural Risk Minimization
providing bounds on the error of an SVM
 Typically the error bounds too loose to be of
practical use

34
MultiClass SVMs
 One-versus-all
Train n binary classifiers, one for each class against all other
classes.
 Predicted class is the class of the most confident classifier
 One-versus-one
 Train n(n-1)/2 classifiers, each discriminating between a pair
of classes
 Several strategies for selecting the final classification based
on the output of the binary SVMs
 Truly MultiClass SVMs
 Generalize the SVM formulation to multiple categories
 More on that in the nominated for the student paper award:
“Methods for Multi-Category Cancer Diagnosis from Gene
Expression Data: A Comprehensive Evaluation to Inform
Decision Support System Development”, Alexander Statnikov,
Constantin F. Aliferis, Ioannis Tsamardinos

35
Conclusions
 SVMs express learning as a mathematical
program taking advantage of the rich theory
in optimization
 SVM uses the kernel trick to map indirectly to
extremely high dimensional spaces
 SVMs extremely successful, robust, efficient,
and versatile while there are good theoretical
indications as to why they generalize well

36
37

You might also like