0% found this document useful (0 votes)

65 views37 pages

Support Vector Machines: Constantin F. Aliferis & Ioannis Tsamardinos

1) Support vector machines find the optimal separating hyperplane between two classes of data points by maximizing the margin between the classes. 2) They do this by mapping the data points to a higher dimensional space using kernels and finding the hyperplane with the maximum margin in this space. 3) Support vector machines can handle non-linearly separable data by introducing slack variables that allow some data points to fall within the margin and penalizing them.

Uploaded by

HyunJong Lee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views37 pages

Support Vector Machines: Constantin F. Aliferis & Ioannis Tsamardinos

Uploaded by

HyunJong Lee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 37

Support Vector

Machines

MEDINFO 2004,
T02: Machine Learning Methods for Decision Support and Discovery
Constantin F. Aliferis & Ioannis Tsamardinos
Discovery Systems Laboratory
Department of Biomedical Informatics
Vanderbilt University

1
Support Vector Machines
 Decision surface is a hyperplane (line in 2D) in
feature space (similar to the Perceptron)
 Arguably, the most important recent discovery in
machine learning
 In a nutshell:
 map the data to a predetermined very high-
dimensional space via a kernel function
 Find the hyperplane that maximizes the margin
between the two classes
 If data are not separable find the hyperplane that
maximizes the margin and minimizes the (a weighted
average of the) misclassifications

2
Support Vector Machines
 Three main ideas:
1. Define what an optimal hyperplane is (in way
that can be identified in a computationally
efficient way): maximize margin
2. Extend the above definition for non-linearly
separable problems: have a penalty term for
misclassifications
3. Map data to high dimensional space where it
is easier to classify with linear decision
surfaces: reformulate problem so that data is
mapped implicitly to this space
3
Support Vector Machines
 Three main ideas:
1. Define what an optimal hyperplane is (in way
that can be identified in a computationally
efficient way): maximize margin
2. Extend the above definition for non-linearly
separable problems: have a penalty term for
misclassifications
3. Map data to high dimensional space where it
is easier to classify with linear decision
surfaces: reformulate problem so that data is
mapped implicitly to this space
4
Which Separating Hyperplane to Use?
Var1

Var2 5
Maximizing the Margin
Var1 IDEA 1: Select the
separating
hyperplane that
maximizes the
margin!

Margin
Width

Margin
Width
Var2 6
Support Vectors
Var1

Support Vectors

Margin
Width
Var2 7
Setting Up the Optimization Problem
Var1
The width of the
margin is:
2k
w
 
 w x b  k So, the problem is:
w
2k
max
  w
w  x  b  k
k k Var2
s.t. (w  x  b)  k , x of class 1
  (w  x  b)   k , x of class 2
w x  b  0

8
Setting Up the Optimization Problem
Var1
There is a scale and
unit for data so that
k=1. Then problem
becomes:

2
  max
 w x  b 1 w
w s.t. (w  x  b)  1, x of class 1
  (w  x  b)  1, x of class 2
w  x  b  1
1 1 Var2
 
w x  b  0

9
Setting Up the Optimization Problem
 If class 1 corresponds to 1 and class 2 corresponds
to -1, we can rewrite
(w  xi  b)  1, xi with yi  1
(w  xi  b)  1, xi with yi  1
 as
yi (w  xi  b)  1, xi
 So the problem becomes:
2 1 2
max min w
w or 2
s.t. yi (w  xi  b)  1, xi s.t. yi (w  xi  b)  1, xi

10
Linear, Hard-Margin SVM
Formulation
 Find w,b that solves

1 2
min w
2
s.t. yi (w  xi  b)  1, xi
 Problem is convex so, there is a unique global minimum value
(when feasible)
 There is also a unique minimizer, i.e. weight and b value that
provides the minimum
 Non-solvable if the data is not linearly separable
 Quadratic Programming
 Very efficient computationally with modern
constraint optimization engines (handles
thousands of constraints and training instances).
11
Support Vector Machines
 Three main ideas:
1. Define what an optimal hyperplane is (in way
that can be identified in a computationally
efficient way): maximize margin
2. Extend the above definition for non-linearly
separable problems: have a penalty term for
misclassifications
3. Map data to high dimensional space where it
is easier to classify with linear decision
surfaces: reformulate problem so that data is
mapped implicitly to this space
12
Support Vector Machines
 Three main ideas:
1. Define what an optimal hyperplane is (in way
that can be identified in a computationally
efficient way): maximize margin
2. Extend the above definition for non-linearly
separable problems: have a penalty term for
misclassifications
3. Map data to high dimensional space where it
is easier to classify with linear decision
surfaces: reformulate problem so that data is
mapped implicitly to this space
13
Non-Linearly Separable Data

Var1 Introduce slack

i variables i

Allow some
instances to fall
i within the margin,
  but penalize them
 w x  b 1
w
 
w  x  b  1
1 1 Var2
 
w x b  0
14
Formulating the Optimization Problem
Constraint becomes :

Var1 yi (w  xi  b)  1  i , xi
i
i  0

Objective function
penalizes for
i
misclassified instances
  and those within the
 w x  b 1
w margin
1
min w  C  i
2
 
w  x  b  1
1 1 Var2
2 i
 
w x b  0
C trades-off margin width
and misclassifications 15
Linear, Soft-Margin SVMs
1
min w  C  i yi (w  xi  b)  1  i , xi
2

2 i i  0
 Algorithm tries to maintain i to zero while maximizing
margin
 Notice: algorithm does not minimize the number of
misclassifications (NP-complete problem) but the sum of
distances from the margin hyperplanes
 Other formulations use i2 instead
 As C, we get closer to the hard-margin solution

16
Robustness of Soft vs Hard Margin
SVMs

Var1 Var1

i

Var2
 
w x  b  0 Var2
 
w x  b  0

Soft Margin SVN Hard Margin SVN

17
Soft vs Hard Margin SVM
 Soft-Margin always have a solution
 Soft-Margin is more robust to outliers
 Smoother surfaces (in the non-linear case)
 Hard-Margin does not require to guess the
cost parameter (requires no parameters at
all)

18
Support Vector Machines
 Three main ideas:
1. Define what an optimal hyperplane is (in way
that can be identified in a computationally
efficient way): maximize margin
2. Extend the above definition for non-linearly
separable problems: have a penalty term for
misclassifications
3. Map data to high dimensional space where it
is easier to classify with linear decision
surfaces: reformulate problem so that data is
mapped implicitly to this space
19
Support Vector Machines
 Three main ideas:
1. Define what an optimal hyperplane is (in way
that can be identified in a computationally
efficient way): maximize margin
2. Extend the above definition for non-linearly
separable problems: have a penalty term for
misclassifications
3. Map data to high dimensional space where it
is easier to classify with linear decision
surfaces: reformulate problem so that data is
mapped implicitly to this space
20
Disadvantages of Linear Decision
Surfaces
Var1

Var221
Advantages of Non-Linear Surfaces
Var1

Var222
Linear Classifiers in High-
Dimensional Spaces
Constructed
Var1
Feature 2

Var2
Constructed
Feature 1
Find function (x) to map to
a different space
23
Mapping Data to a High-Dimensional
Space
• Find function (x) to map to a different space, then
SVM formulation becomes:
1 s.t. yi ( w   ( x )  b)  1   i , xi
min w  C  i
2

2 i i  0

• Data appear as (x), weights w are now weights in

the new space
• Explicit mapping expensive if (x) is very high
dimensional
• Solving the problem without explicitly mapping the
data is desirable
24
The Dual of the SVM Formulation
 Original SVM formulation
1 2
 n inequality constraints
min w  C i
 n positivity constraints w ,b 2 i
 n number of  variables

s.t. yi ( w   ( x )  b)  1   i , xi
i  0
 The (Wolfe) dual of this
1
problem
min   i j yi y j ( ( xi )   ( x j ))    i
 one equality constraint ai 2
i, j i
 n positivity constraints
 n number of  variables s.t. C   i  0, xi
(Lagrange multipliers)
 Objective function more
complicated
 y
i
i i 0

 NOTICE: Data only appear

as (xi)  (xj)
25
The Kernel Trick
 (xi)  (xj): means, map data into new space, then take the
inner product of the new vectors
 We can find a function such that: K(xi  xj) = (xi)  (xj), i.e., the
image of the inner product of the data is the inner product of the
images of the data
 Then, we do not need to explicitly map the data into the high-
dimensional space to solve the optimization problem (for
training)
 How do we classify without explicitly mapping the new
instances? Turns out
sgn( wx  b)  sgn(  i yi K ( xi , x )  b)
i

where b solves  j ( y j   i yi K ( xi , x j )  b  1)  0,
i

for any j with  j  0

26
Examples of Kernels
 Assume we measure two quantities, e.g. expression
level of genes TrkC and SonicHedghog (SH) and we
use the mapping:
 : xTrkC , x SH  {x TrkC
2 2
, x SH , 2 xTrkC x SH , xTrkC , x SH ,1}
 Consider the function:
K ( x  z )  ( x  z  1) 2
 We can verify that:
( x)  ( z ) 
2
x TrkC 2
z TrkC  x SH
2 2
z SH  2 xTrkC x SH zTrkC z SH  xTrkC zTrkC  x SH z SH  1 
 ( xTrkC zTrkC  x SH z SH  1) 2  ( x  z  1) 2  K ( x  z )

27
Polynomial and Gaussian Kernels
K ( x  z )  ( x  z  1) p
 is called the polynomial kernel of degree p.
 For p=2, if we measure 7,000 genes using the kernel once
means calculating a summation product with 7,000 terms then
taking the square of this number
 Mapping explicitly to the high-dimensional space means
calculating approximately 50,000,000 new features for both
training instances, then taking the inner product of that (another
50,000,000 terms to sum)
 In general, using the Kernel trick provides huge computational
savings over explicit mapping!
 Another commonly used Kernel is the Gaussian (maps to a
dimensional space with number of dimensions equal to the
number of training cases):
K ( x  z )  exp( x  z / 2 )2

28
The Mercer Condition
 Is there a mapping (x) for any symmetric
function K(x,z)? No
 The SVM dual formulation requires
calculation K(xi , xj) for each pair of training
instances. The array Gij = K(xi , xj) is called
the Gram matrix
 There is a feature space (x) when the
Kernel is such that G is always semi-positive
definite (Mercer condition)

29
Support Vector Machines
 Three main ideas:
1. Define what an optimal hyperplane is (in way
that can be identified in a computationally
efficient way): maximize margin
2. Extend the above definition for non-linearly
separable problems: have a penalty term for
misclassifications
3. Map data to high dimensional space where it
is easier to classify with linear decision
surfaces: reformulate problem so that data is
mapped implicitly to this space
30
Other Types of Kernel Methods
 SVMs that perform regression
 SVMs that perform clustering
 -Support Vector Machines: maximize margin while
bounding the number of margin errors
 Leave One Out Machines: minimize the bound of the
leave-one-out error
 SVM formulations that take into consideration
difference in cost of misclassification for the different
classes
 Kernels suitable for sequences of strings, or other
specialized kernels
31
Variable Selection with SVMs
 Recursive Feature Elimination
 Train a linear SVM
 Remove the variables with the lowest weights (those
variables affect classification the least), e.g., remove
the lowest 50% of variables
 Retrain the SVM with remaining variables and repeat
until classification is reduced
 Very successful
 Other formulations exist where minimizing the number
of variables is folded into the optimization problem
 Similar algorithm exist for non-linear SVMs
 Some of the best and most efficient variable selection
methods

32
Comparison with Neural Networks
Neural Networks SVMs
 Hidden Layers map to lower  Kernel maps to a very-high
dimensional spaces dimensional space
 Search space has multiple  Search space has a unique
local minima minimum
 Training is expensive  Training is extremely
 Classification extremely efficient
efficient  Classification extremely
 Requires number of hidden efficient
units and layers  Kernel and cost the two
 Very good accuracy in parameters to select
typical domains  Very good accuracy in
typical domains
 Extremely robust

33
Why do SVMs Generalize?
 Even though they map to a very high-
dimensional space
 They have a very strong bias in that space
 The solution has to be a linear combination of
the training instances
 Large theory on Structural Risk Minimization
providing bounds on the error of an SVM
 Typically the error bounds too loose to be of
practical use

34
MultiClass SVMs
 One-versus-all
Train n binary classifiers, one for each class against all other
classes.
 Predicted class is the class of the most confident classifier
 One-versus-one
 Train n(n-1)/2 classifiers, each discriminating between a pair
of classes
 Several strategies for selecting the final classification based
on the output of the binary SVMs
 Truly MultiClass SVMs
 Generalize the SVM formulation to multiple categories
 More on that in the nominated for the student paper award:
“Methods for Multi-Category Cancer Diagnosis from Gene
Expression Data: A Comprehensive Evaluation to Inform
Decision Support System Development”, Alexander Statnikov,
Constantin F. Aliferis, Ioannis Tsamardinos

35
Conclusions
 SVMs express learning as a mathematical
program taking advantage of the rich theory
in optimization
 SVM uses the kernel trick to map indirectly to
extremely high dimensional spaces
 SVMs extremely successful, robust, efficient,
and versatile while there are good theoretical
indications as to why they generalize well

36
37

C1000-141 Exam Dumps - IBM Maximo Manage v8.x Administrator
No ratings yet
C1000-141 Exam Dumps - IBM Maximo Manage v8.x Administrator
6 pages
H3 User Manual
No ratings yet
H3 User Manual
58 pages
IAM Modernization Program Charter
No ratings yet
IAM Modernization Program Charter
5 pages
Lego Star Wars The Complete Saga Manual PDF
No ratings yet
Lego Star Wars The Complete Saga Manual PDF
11 pages
Support Vector Machines: Some Slides Adapted From
No ratings yet
Support Vector Machines: Some Slides Adapted From
54 pages
Lecture 15
No ratings yet
Lecture 15
35 pages
Support Vector Machines: Logisic Regression
No ratings yet
Support Vector Machines: Logisic Regression
10 pages
Lec06 SVM
No ratings yet
Lec06 SVM
25 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
CS-13410 Introduction To Machine Learning
No ratings yet
CS-13410 Introduction To Machine Learning
33 pages
9 Svm-Handout PDF
No ratings yet
9 Svm-Handout PDF
21 pages
Another Introduction SVM
No ratings yet
Another Introduction SVM
4 pages
22-Kernel Tricks Shit
No ratings yet
22-Kernel Tricks Shit
43 pages
SVM Tutorial
100% (1)
SVM Tutorial
34 pages
Support Vector Machines: Jeff Wu
No ratings yet
Support Vector Machines: Jeff Wu
35 pages
SVM Explained PDF
No ratings yet
SVM Explained PDF
19 pages
Support Vector Machine
No ratings yet
Support Vector Machine
45 pages
Introduction To Support Vector Machines
No ratings yet
Introduction To Support Vector Machines
46 pages
SVM.pptx
No ratings yet
SVM.pptx
67 pages
L5-Support Vector Machine
No ratings yet
L5-Support Vector Machine
61 pages
Lecture 18 - SVM
No ratings yet
Lecture 18 - SVM
54 pages
Support Vector Machine
No ratings yet
Support Vector Machine
19 pages
10 SVM
No ratings yet
10 SVM
23 pages
6 Lec SVM Kernel
No ratings yet
6 Lec SVM Kernel
36 pages
13.1 Support Vector Machine
No ratings yet
13.1 Support Vector Machine
28 pages
svm
No ratings yet
svm
33 pages
Svm Student
No ratings yet
Svm Student
40 pages
SVM
No ratings yet
SVM
40 pages
Unit 2 - SVM - 241016 - 104220
No ratings yet
Unit 2 - SVM - 241016 - 104220
47 pages
5d. Support Vector Machine
No ratings yet
5d. Support Vector Machine
2 pages
2024-SCU-ML-2-1-SVM
No ratings yet
2024-SCU-ML-2-1-SVM
36 pages
SVM PRESENTATION
No ratings yet
SVM PRESENTATION
34 pages
UNIT - 2
No ratings yet
UNIT - 2
15 pages
Supervised Learning - Support Vector Machines and Feature Reduction
No ratings yet
Supervised Learning - Support Vector Machines and Feature Reduction
11 pages
Unit-III - SVM
No ratings yet
Unit-III - SVM
105 pages
SVM Tutorial
No ratings yet
SVM Tutorial
31 pages
27-Module 4 - Support Vector Machine and Naïve Bayes-20-09-2024
No ratings yet
27-Module 4 - Support Vector Machine and Naïve Bayes-20-09-2024
31 pages
SVM Class
No ratings yet
SVM Class
33 pages
Lec5 Support vector machine
No ratings yet
Lec5 Support vector machine
28 pages
SML Unit 4
No ratings yet
SML Unit 4
61 pages
SVM
No ratings yet
SVM
11 pages
Support Vector Machines
No ratings yet
Support Vector Machines
19 pages
Support Vector Machine
No ratings yet
Support Vector Machine
33 pages
Kernel Method and Support Vector Machines: Nguyen Duc Dung, Ph.D. Ioit, Vast
No ratings yet
Kernel Method and Support Vector Machines: Nguyen Duc Dung, Ph.D. Ioit, Vast
34 pages
Support_Vector_Machine(SVM)[1]
No ratings yet
Support_Vector_Machine(SVM)[1]
103 pages
Chapter_8 (1)
No ratings yet
Chapter_8 (1)
52 pages
Svm
No ratings yet
Svm
40 pages
Support Vector Machines - An Introduction: Department of Electrical Engineering Technion, Israel
100% (1)
Support Vector Machines - An Introduction: Department of Electrical Engineering Technion, Israel
44 pages
Support Vector Machine
No ratings yet
Support Vector Machine
52 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
(Optimization) SVMs
No ratings yet
(Optimization) SVMs
19 pages
Svm
No ratings yet
Svm
29 pages
Slide+-+SVM
No ratings yet
Slide+-+SVM
12 pages
SVM-CDing2024 11 15
No ratings yet
SVM-CDing2024 11 15
54 pages
Support Vector Machines: Xiaojin Zhu
No ratings yet
Support Vector Machines: Xiaojin Zhu
41 pages
Lecture - 7 Classification (SVM)
No ratings yet
Lecture - 7 Classification (SVM)
48 pages
ML-Lec9-SVM
No ratings yet
ML-Lec9-SVM
32 pages
Support vector machine
No ratings yet
Support vector machine
49 pages
Module10 - Support Vector Machine
No ratings yet
Module10 - Support Vector Machine
23 pages
Introduction To Support Vector Machines: 1 Description
No ratings yet
Introduction To Support Vector Machines: 1 Description
15 pages
UNIT - 2-1
No ratings yet
UNIT - 2-1
7 pages
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
From Everand
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
Fouad Sabry
No ratings yet
SAP Certified Application Associate - SAP S_4HANA Cloud - Sourcing and Procurement - Full _ ERPPrep 2
No ratings yet
SAP Certified Application Associate - SAP S_4HANA Cloud - Sourcing and Procurement - Full _ ERPPrep 2
20 pages
Module 5 Hashing
No ratings yet
Module 5 Hashing
66 pages
RM-9 - User Manual - V1.01
No ratings yet
RM-9 - User Manual - V1.01
2 pages
Learning Visual 3 D
No ratings yet
Learning Visual 3 D
9 pages
FM Audio Transmission With Pluto SDR and Gnu Radio
No ratings yet
FM Audio Transmission With Pluto SDR and Gnu Radio
30 pages
CC Unit 3 PPT 1.pptx
No ratings yet
CC Unit 3 PPT 1.pptx
21 pages
SEQUENCE AND SERIES PW
No ratings yet
SEQUENCE AND SERIES PW
14 pages
Project Abstract:: Project Title: Sound Pattern Locking System
No ratings yet
Project Abstract:: Project Title: Sound Pattern Locking System
3 pages
HackyEaster2018 Summary
No ratings yet
HackyEaster2018 Summary
96 pages
System Design Guide: For Audio Installation
No ratings yet
System Design Guide: For Audio Installation
32 pages
8 - 11 - Introduction To Github
No ratings yet
8 - 11 - Introduction To Github
10 pages
1st Exaplar
No ratings yet
1st Exaplar
4 pages
Abstract Factory Pattern Assignment
No ratings yet
Abstract Factory Pattern Assignment
13 pages
Writing Data Commentary
No ratings yet
Writing Data Commentary
17 pages
Industrial Duty Commercial Door Operator: Owner'S Manual
No ratings yet
Industrial Duty Commercial Door Operator: Owner'S Manual
36 pages
NSCI6401 Course Project
No ratings yet
NSCI6401 Course Project
2 pages
Immaculate Hostels Juja - Project Proposal
No ratings yet
Immaculate Hostels Juja - Project Proposal
26 pages
Abhimo Technologies Campus Hiring 2025 Notification
No ratings yet
Abhimo Technologies Campus Hiring 2025 Notification
13 pages
Atm Application Use Case Diagram
No ratings yet
Atm Application Use Case Diagram
58 pages
Midterms CS-352-LEC-1913T
No ratings yet
Midterms CS-352-LEC-1913T
15 pages
Mid Term Mcq
No ratings yet
Mid Term Mcq
10 pages
Stereo Chino
No ratings yet
Stereo Chino
2 pages
Schedule Managment Plan
No ratings yet
Schedule Managment Plan
57 pages
FT Transmitter CP111-112-113
No ratings yet
FT Transmitter CP111-112-113
4 pages
Sap Note 2178382 Edocument: Aif Namespace Locked: Implementation Steps
No ratings yet
Sap Note 2178382 Edocument: Aif Namespace Locked: Implementation Steps
4 pages
Introduction To AI and ML
No ratings yet
Introduction To AI and ML
22 pages