0% found this document useful (0 votes)
14 views76 pages

7 SVM For Scientists Annotated

The document discusses Support Vector Machines (SVM) as a discriminative classifier used in machine learning, particularly in cancer genomics for predicting cancer types based on gene expression. It covers the max-margin principle, problem formulation for linearly and non-linearly separable cases, and the optimization of parameters through quadratic programming. Key concepts include the focus on support vectors, margin maximization, and the mathematical formulation of SVM.

Uploaded by

dscsamson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views76 pages

7 SVM For Scientists Annotated

The document discusses Support Vector Machines (SVM) as a discriminative classifier used in machine learning, particularly in cancer genomics for predicting cancer types based on gene expression. It covers the max-margin principle, problem formulation for linearly and non-linearly separable cases, and the optimization of parameters through quadratic programming. Key concepts include the focus on support vectors, margin maximization, and the mathematical formulation of SVM.

Uploaded by

dscsamson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Support Vector Machines

Machine Learning for Scientists 02-620

Reading:
Friedman, Tibshirani, Hastie, Ch 12.1-12.3
Murphy Ch 4.5
Predicting
Cancer Types
Based on
Gene
Expression
SVM in Cancer Genomics

http://cgp.iiarjournals.org/content/15/1/41.full.pdf
Types of Classifiers

• Training classifiers involves estimating f: X à Y, or P(Y|X)

• Generative classifiers (e.g., Naïve Bayes)


• Model and learn P(Y, X)= P(X|Y)P(Y)
• Derive P(Y|X) from P(Y, X) using Bayes rule

• Discriminative classifiers (e.g., Logistic regression)


• Model P(Y|X) directly

• SVM is
• Another discriminative classifier
• Non-probabilistic
Overview

• Max-margin principle

• Problem formulation with max-margin principle


– Linearly separable case
– Non-linearly separable case

• Dual formulation and support vectors

• Kernelized SVM
Linear Classifier So Far

• Recall our regression classifiers


– classify 𝑥 = (𝑥1, 𝑥2) into 𝑦=0 or 1

𝑤 !𝑥 + 𝑏 ≥ 0 𝑤 !𝑥 + 𝑏 = 0
𝑦 =1
𝑥2

𝑤 !𝑥 + 𝑏 < 0
𝑦=0

𝑥1
Linear Classifier So Far

• Recall our regression classifiers Regression


– classify 𝑥 = (𝑥1, 𝑥2) into 𝑦=0 or 1 Logistic function

𝑤 !𝑥 + 𝑏 ≥ 0 𝑤 !𝑥 + 𝑏 = 0
𝑦 =1
𝑥2

𝑤 !𝑥 + 𝑏 < 0
𝑦=0

𝑥1
Linear Classifier So Far

• Recall our regression classifiers


– classify 𝑥 = (𝑥1, 𝑥2) into 𝑦=0 or 1

𝑤 !𝑥 + 𝑏 ≥ 0 𝑤 !𝑥 + 𝑏 = 0
𝑦 =1
𝑥2

𝑤 !𝑥 + 𝑏 < 0
𝑦=0

Line closer to the


blue nodes since
many of them are far
away from the
boundary
𝑥1
Linear Classifier So Far

• Recall our regression classifiers


– classify 𝑥 = (𝑥1, 𝑥2) into 𝑦=0 or 1

𝑤 !𝑥 + 𝑏 ≥ 0 𝑤 !𝑥 + 𝑏 = 0 Logistic regression MLE


𝑦 =1 consider all data points
𝑥2

𝑤 !𝑥 + 𝑏 < 0
𝑦=0

𝑥1
Linear Classifier So Far

• Recall our regression classifiers


– classify 𝑥 = (𝑥1, 𝑥2) into 𝑦=0 or 1

𝑤 !𝑥 + 𝑏 ≥ 0 𝑤 !𝑥 + 𝑏 = 0
𝑦 =1
𝑥2

𝑤 !𝑥 + 𝑏 < 0
𝑦=0

Many more possible


classifiers
𝑥1
Max Margin Classifier

• Instead of fitting all points, focus on boundary points


• Learn a boundary that leads to the largest margin from both sets of points

𝑤 !𝑥 + 𝑏 ≥ 0 𝑤 !𝑥 + 𝑏 = 0
𝑦 =1 From all the possible
𝑥2 boundary lines, this
leads to the largest
𝑤 ! 𝑥 + 𝑏 < 0 margin on both
sides
𝑦=0

𝑥1
Max Margin Classifier

• Instead of fitting all points, focus on boundary points


• Learn a boundary that leads to the largest margin from both sets of points

𝑤 !𝑥 + 𝑏 ≥ 0 𝑤 !𝑥 + 𝑏 = 0
𝑦 =1 Margin: find the
𝑥2 decision boundary
that maximizes the
margin
M 𝑤 !𝑥 + 𝑏 < 0
𝑦=0

𝑥1
Support Vector Machine

• Instead of fitting all points, focus on boundary points


• Learn a boundary that leads to the largest margin from both sets of points

𝑤 !𝑥 + 𝑏 ≥ 0 𝑤 !𝑥 + 𝑏 = 0
𝑦 =1 Margin: find the
𝑥2 decision boundary
that maximizes the
margin
M 𝑤 !𝑥 + 𝑏 < 0
𝑦=0

Support vectors that


support the decision
boundary

𝑥1
Overview

• Max-margin principle

• Problem formulation with max-margin principle


– Linearly separable case
– Non linearly separable case

• Dual formulation and support vectors

• Kernelized SVM
Learning as Function Optimization

• In probability models for supervised learning, find parameter


estimates that maximizes log likelihood of data

– Data log likelihood as a loss function

What is the loss function in support vector machine?


Support Vector Machine

1
𝑏=

0
𝑏=

−1
𝑥+
𝑤!

𝑏=
𝑥+
𝑤!

𝑥+
𝑤!
Classification
𝑥2
𝑦 =1 if 𝑤 ! 𝑥 + 𝑏 ≥ 1
𝑦 =0 if 𝑤 ! 𝑥 + 𝑏 ≤ −1
Undefined if −1 ≤ 𝑤 ! 𝑥 + 𝑏 ≤1

𝑥1
Maximizing the Margin

• Let’s define the width of the margin M


• How can we encode our goal of maximizing M in terms of our
parameters (w and b) and train data?
𝑤 !𝑥 + 𝑏 ≥ 0
Classification
𝑥2 𝑦 =1
𝑦 =1 if 𝑤 ! 𝑥 + 𝑏 ≥ 1
𝑦 =0 if 𝑤 ! 𝑥 + 𝑏 ≤ −1
M
Undefined if −1 ≤ 𝑤 ! 𝑥 + 𝑏 ≤1

𝑥1
Support Vector Machine

1
𝑏=

0
𝑏=

−1
𝑥+
Maximizing the margin M means

𝑤!

𝑏=
𝑥+
minimizing the inverse of M

𝑤!

𝑥+
#$#
Minimize

𝑤!
𝟐
𝑥2

𝟐
M= #$#

𝑥1
Support Vector Machine

1
𝑏=

0
𝑏=

−1
𝑥+
Maximizing the margin M means

𝑤!

𝑏=
𝑥+
minimizing the inverse of M

𝑤!

𝑥+
#$#
Minimize

𝑤!
𝟐
𝑥2
Subject to the constraints:
𝟐
M= #$# For all x in class + 1
𝑤 !𝑥 + 𝑏 ≥ 1

For all x in class – 1


𝑤 ! 𝑥 + 𝑏 ≤ −1
A total of n constraints if
we have n input samples
𝑥1
Support Vector Machine as Quadratic
Programming (QP)
Maximizing the margin M means
minimizing the inverse of M
#$#
Minimize Quadratic function of 𝑤
𝟐
Subject to the constraints:

For all x in class + 1


𝑤 !𝑥 + 𝑏 ≥ 1 Inequality constraints linear in 𝑤

For all x in class – 1


𝑤 ! 𝑥 + 𝑏 ≤ −1

If you can formulate your problem in this


form, a standard QP solver can be used
Support Vector Machine as Quadratic
Programming
Maximizing the margin M means
minimizing the inverse of M
#$#
Minimize Quadratic function of 𝑤
𝟐
Subject to the constraints:

For all x in class + 1


𝑤 !𝑥 + 𝑏 ≥ 1 Inequality constraints linear in 𝑤

For all x in class – 1


𝑤 ! 𝑥 + 𝑏 ≤ −1

By solving this problem, we find the optimal parameters that:


1. Correctly classifies all points
2. Maximizes the margin (or equivalently minimizes wTw)
Support Vector Machine

1
𝑏=

0
𝑏=

−1
𝑥+
𝑤!

𝑏=
𝑥+
𝑤!

𝑥+
𝑤!
𝑥2

𝟐
M= #$#
Expression for Margin:
Where did this come from?

𝑥1
Expression for Margin

1
𝑏=
Assume two data points

0
𝑏=

−1
𝑥+
x+ on 𝑤 ! 𝑥 + 𝑏 = 1

𝑤!

𝑏=
𝑥+
x- on 𝑤 ! 𝑥 + 𝑏 = −1

𝑤!

𝑥+
x+

𝑤!
M that are in the closest
𝑥2 distance
x-
Margin M is the distance you
have to travel to get from x- to x+
In what direction?
In what distance?

𝑥1
Margin: Direction

1
𝑏=
Fact 1: Vector orthogonal to the

0
𝑏=

−1
plane is given by 𝑤

𝑥+
𝑤!

𝑏=
𝑥+
This means w and 𝑤 ! 𝑥 + 𝑏 = 1

𝑤!
are in right angle

𝑥+
𝑤!
𝑥2
Why? Consider two points u, v on
w’x+b=1 plane, i.e.,
w’u+b = 1
w’v+b = 1

Then, we have
w’(u-v)=0

w is in right angle with respect to


all three planes
𝑥1
Margin: Distance

1
𝑏=
Fact 2: if x+ is a point on the +1

0
𝑏=
plane and x- is the closest point to

−1
𝑥+
x+ on the -1 plane then

𝑤!

𝑏=
𝑥+
𝑤!
x+ = lw + x-

𝑥+
x+

𝑤!
𝑥2

x-
l Is the distance you have to
travel to get from x- to x+

𝑥1
Expression for Margin

−1
1
𝑥+ =0
=

𝑏=
𝑤! +𝑏
𝑤! 𝑏
𝑥+
𝑥
𝑤!
𝑥2
𝟐
M= #$#

𝑥1

What we have so far

• wT x+ + b = +1
• wT x- + b = -1
• x+ = lw + x-
• | x+ - x- | = M
Expression for Margin

𝑥+ 𝑏= 1
−1
𝑏= 0
=
𝑤! +𝑏
𝑤 !𝑥 +
𝑥
𝑤!
𝑥2
𝟐
M= #$#
Derivation of the margin

𝑥1 wT x+ + b = +1

What we have so far Þ wT (lw + x-) + b = +1

• wT x+ + b = +1 Þ wTx- + b + lwTw = +1
• wT x- + b = -1
Þ -1 + lwTw = +1
• x+ = lw + x-
Þ l = 2/wTw
• | x+ - x- | = M
Expression for Margin

−1
𝑤! 𝑏 =1
𝑏 =0
𝑥+ =
𝑤! +𝑏
𝑥+
𝑥
𝑤!
𝑥2
𝟐
M= #$#
Derivation of the margin

𝑥1 M = | x+ - x- |
Þ M = | lw |
What we have so far
Þ M=l|w|
• wT x+ + b = +1
Þ M = l 𝑤′𝑤
• wT x- + b = -1 𝟐 #!# 𝟐
Þ M= =
#!# #!#
• x+ = lw + x-
• | x+ - x- | = M
•l = 2/wTw
Support Vector Machine

1
𝑏=

0
𝑏=

−1
𝑥+
Maximizing the margin M means

𝑤!

𝑏=
𝑥+
minimizing the inverse of M

𝑤!

𝑥+
#$#
Minimize

𝑤!
𝟐
𝑥2
Subject to the constraints:
𝟐
M= #$# For all x in class + 1
𝑤 !𝑥 + 𝑏 ≥ 1

For all x in class – 1


𝑤 ! 𝑥 + 𝑏 ≤ −1
A total of n constraints if
we have n input samples
𝑥1
Overview

• Max-margin principle

• Problem formulation with max-margin principle


– Linearly separable case
– Non linearly separable case

• Dual formulation and support vectors

• Kernelized SVM
Support Vector Machine

1
𝑏=

0
𝑏=

−1
𝑥+
𝑤!

𝑏=
𝑥+
𝑤!

𝑥+
𝑤!
𝑥2

𝟐
M= #$#

𝑥1
Non linearly separable case

• When train data are not linearly separable (noise, outliers)

𝑥2

𝑥1
Support Vector Machine, Separable Case
Revisited
Maximizing the margin M means
minimizing the inverse of M
#$#
Minimize Perhaps, in addition to maximizing the margin,
𝟐 minimize the training data error as well?
Subject to the constraints:

For all x in class + 1


𝑤 !𝑥 + 𝑏 ≥ 1

For all x in class – 1


𝑤 ! 𝑥 + 𝑏 ≤ −1
Non linearly separable case

• Instead of minimizing the number of misclassified points, we can minimize


the distance between these points and their correct plane
Should have
been on the +1 plane
other side of the
blue plane -1 plane

ek
Non linearly separable case

• Instead of minimizing the number of misclassified points, we can minimize


the distance between these points and their correct plane
Should have
been on the +1 plane
other side of the
blue plane -1 plane

ek
Non linearly separable case

• Instead of minimizing the number of misclassified points, we can minimize


the distance between these points and their correct plane

+1 plane

-1 plane

ej
Should have
been on the
other side of the
red plane
Support Vector Machine, Separable Case
Revisited
Maximizing the margin M means
minimizing the inverse of M
#$#
Minimize Perhaps, in addition to maximizing the margin,
𝟐 minimize the training data error as well?
Subject to the constraints:

For all x in class + 1


𝑤 !𝑥 + 𝑏 ≥ 1

For all x in class – 1


𝑤 ! 𝑥 + 𝑏 ≤ −1
Non linearly separable case

• Instead of minimizing the number of misclassified points, we can minimize


the distance between these points and their correct plane
The new optimization problem is:
+1 plane #$#
Minimizew + ∑'$%& 𝐶𝜀𝑖
𝟐
-1 plane
subject to the following inequality
constraints:

ek ej For all x in class + 1


𝑤 ! 𝑥 + 𝑏 ≥ 1 − 𝜀𝑖

For all x in class – 1


𝑤 ! 𝑥 + 𝑏 ≤ −1 + 𝜀𝑖
Non linearly separable case

• Instead of minimizing the number of misclassified points, we can minimize


the distance between these points and their correct plane
The new optimization problem is:
+1 plane #$#
Minimizew + ∑'$%& 𝐶𝜀𝑖
𝟐
-1 plane
subject to the following inequality
constraints:

ej For all x in class + 1


𝑤 ! 𝑥 + 𝑏 ≥ 1 − 𝜀𝑖

For all x in class – 1


𝑤 ! 𝑥 + 𝑏 ≤ −1 + 𝜀𝑖
Non linearly separable case

• Instead of minimizing the number of misclassified points, we can minimize


the distance between these points and their correct plane
The new optimization problem is:
+1 plane #$#
Minimizew + ∑'$%& 𝐶𝜀𝑖
𝟐
-1 plane
subject to the following inequality
constraints:

ek ej For all x in class + 1


𝑤 ! 𝑥 + 𝑏 ≥ 1 − 𝜀𝑖

For all x in class – 1


𝑤 ! 𝑥 + 𝑏 ≤ −1 + 𝜀𝑖
For all i A total of n constraints if
we have n input samples
eI ³ 0
Additional n constraints
Separable vs Non-separable Cases
Separable Case Non-separable Case
#$# #$#
Minimize
𝟐 Minimizew + ∑'$%& 𝐶𝜀𝑖
𝟐
Subject to the constraints:
subject to the constraints:
For all x in class + 1
For all x in class + 1
𝑤 !𝑥 + 𝑏 ≥ 1
𝑤 ! 𝑥 + 𝑏 ≥ 1 − 𝜀𝑖
For all x in class – 1 For all x in class – 1
𝑤 ! 𝑥 + 𝑏 ≤ −1 𝑤 ! 𝑥 + 𝑏 ≤ −1 + 𝜀𝑖
For all i eI ³ 0
Seperable vs Non-separable Cases
Separable Case Non-separable Case
#$# #$#
Minimize
𝟐 Minimizew + ∑'$%& 𝐶𝜀𝑖
𝟐
Subject to the constraints:
subject to the constraints:
For all x in class + 1
For all x in class + 1
𝑤 !𝑥 + 𝑏 ≥ 1
𝑤 ! 𝑥 + 𝑏 ≥ 1 − 𝜀𝑖
For all x in class – 1 For all x in class – 1
𝑤 ! 𝑥 + 𝑏 ≤ −1 𝑤 ! 𝑥 + 𝑏 ≤ −1 + 𝜀𝑖
For all i eI ³ 0

• Instead of solving these QPs directly we will solve a dual


formulation of the SVM optimization problem
• The main reason for switching to this type of representation is
that it would allow us to identify support vectors and to use a neat
trick that will make our lives easier (and the run time faster)
Overview

• Max-margin principle

• Problem formulation with max-margin principle


– Linearly separable case
– Non linearly separable case

• Dual formulation and support vectors

• Kernelized SVM
Support Vector Machine

• Instead of fitting all points, focus on boundary points


• Learn a boundary that leads to the largest margin from both sets of points

𝑤 !𝑥 + 𝑏 ≥ 0 𝑤 !𝑥 + 𝑏 = 0
𝑦 =1 Margin: find the
𝑥2 decision boundary
that maximizes the
margin
M 𝑤 !𝑥 + 𝑏 < 0
𝑦=0

Support vectors that


support the decision
boundary
How can you find
support vectors?
𝑥1
Support Vector Machine

1
𝑏=

0
𝑏=

−1
𝑥+
Maximizing the margin M means

𝑤!

𝑏=
𝑥+
minimizing the inverse of M

𝑤!

𝑥+
#$#
Minimize

𝑤!
𝟐
𝑥2
Subject to the constraints:
𝟐
M= #$# For all x in class + 1
𝑤 !𝑥 + 𝑏 ≥ 1

For all x in class – 1


𝑤 ! 𝑥 + 𝑏 ≤ −1
A total of n constraints if
we have n input samples
𝑥1
Dual Formulation of QP

• In general, Lagrange multipliers can be applied to turn the


original primal problem to dual problem
Primal problem Dual problem
minx x2 maxa -a2/4+ ab
s.t. x ³ b s.t. a ³ 0
Primal parameter x Dual parameter a

Lagrangian
minx maxa x2 -a(x-b)
s.t. a ³ 0

• a is a Lagrange multiplier associated with each constraint


Dual Representation of SVM QP

• Re-write our QP for the linearly separable case

#$# #$#
Minimize Minimize
𝟐 𝟐
Subject to the constraints: Subject to the constraints:
For all x in class + 1
(𝑤 ! 𝑥𝑖 + 𝑏)𝑦𝑖 ≥ 1
𝑤 !𝑥 +𝑏 ≥1
For all x in class – 1 For all n samples i=1,…,n
𝑤 ! 𝑥 + 𝑏 ≤ −1
Applying to SVM

Dual formulation
Primal formulation .
Max ∑/,-. α, − ∑'$,)%& 𝛼$ 𝛼) 𝑦$ 𝑦) 𝑥$ ′𝑥)
#$# α$ 𝟐
Min Subject to the constraints:
𝑤 𝟐 '
Subject to the constraints:
7 𝛼$ 𝑦 𝑖 = 0
(𝑤 ! 𝑥𝑖 + 𝑏)𝑦𝑖 ≥ 1
$%&
For all n samples i=1,…,n 𝛼$ ≥ 0 for 𝑖=1, …, n

Lagrangian
w Tw
minw,b - åa i [(w T x i + b)y i -1]
2 i

ai ³ 0 "i
Differentiate
w.r.t. w
w = åa i x i y i Tells us about
support vectors
i
Support Vectors in SVM Dual

1
𝑏=

0
w = åa i x i y i

𝑏=

−1
𝑥+
𝑤!

𝑏=
𝑥+
𝑤!
i

𝑥+
𝑤!
𝑥2

w is defined only by ai’s that


are not 0
Each data point xi is
associated with its own ai
The data points with non-
zero ai’s are called support
vectors
𝑥1
Support Vectors in SVM Dual

1
w = åa i x i y i

𝑏=

0
𝑏=

−1
𝑥+
𝑤!

𝑏=
𝑥+
i

𝑤!

𝑥+
'

𝑤!
𝑥2 𝑤 $ 𝑥𝑗 + 𝑏 = 7 𝛼$ 𝑦$ 𝑥$ ′𝑥) + 𝑏
$%&

To evaluate a new sample xj


we need to compute:
Dot product with support
vectors in the training data

𝑥1
Applying to SVM Dot product for all
training samples

Dual formulation
Primal formulation .
Max ∑/,-. α, − ∑'$,)%& 𝛼$ 𝛼) 𝑦$ 𝑦) 𝑥$ ′𝑥)
#$# α$ 𝟐
Min Subject to the constraints:
𝑤 𝟐 '
Subject to the constraints:
7 𝛼$ 𝑦 𝑖 = 0
(𝑤 ! 𝑥𝑖 + 𝑏)𝑦𝑖 ≥ 1
$%&
For all n samples i=1,…,n 𝛼$ ≥ 0 for 𝑖=1, …, n

Lagrangian
w Tw
minw,b - åa i [(w T x i + b)y i -1]
2 i

ai ³ 0 "i
Differentiate
w.r.t. w
w = åa i x i y i Tells us about
support vectors
i
Applying to SVM: Non-Separable Case

Dual formulation

.
Max ∑/,-. α, − ∑'$,)%& 𝛼$ 𝛼) 𝑦$ 𝑦) 𝑥$ ′𝑥)
α$ 𝟐
Subject to the constraints:
'
7 𝛼$ 𝑦 𝑖 = 0
$%&
The only difference is 𝐶 > 𝛼$ ≥ 0 for 𝑖=1, …, n
that the aI’s are now
bounded
Midterm

• 10:30-11:30am, Wednesday next week

• In class exam

• Open book, open note

• No calculators

• Exam will be posted on canvas. Submit a photo of your solution to


gradescope

• The exam will cover up to SVM.

• I’ll make practice problems for SVM available this week. TA will review
SVM in recitation.
Overview

• Max-margin principle

• Problem formulation with max-margin principle


– Linearly separable case
– Non linearly separable case

• Dual formulation and support vectors

• Kernelized SVM
Support Vector Machine

• Instead of fitting all points, focus on boundary points


• Learn a boundary that leads to the largest margin from both sets of points

𝑤 !𝑥 + 𝑏 ≥ 0 𝑤 !𝑥 + 𝑏 = 0
𝑦 =1 Margin: find the
𝑥2 decision boundary
that maximizes the
margin
M 𝑤 !𝑥 + 𝑏 < 0
𝑦=0

Support vectors that


support the decision
boundary
How can you find
support vectors?
𝑥1
Dual formulation

Primal formulation .
Max ∑/,-. α, − ∑'$,)%& 𝛼$ 𝛼) 𝑦$ 𝑦) 𝑥$ ′𝑥)
α$ 𝟐
#$#
Min Subject to the constraints:
𝑤 𝟐 '
Subject to the constraints:
7 𝛼$ 𝑦 𝑖 = 0
(𝑤 ! 𝑥𝑖 + 𝑏)𝑦𝑖 ≥ 1 $%&
For all n samples i=1,…,n 𝛼$ ≥ 0 for 𝑖=1, …, n

Parameters:

Number of
parameters:

What the
parameters mean:

How are the


parameters w = åa i x i y i Tells us about
support vectors
related?: i
Computational Cost for Testing

• During testing, the computational costs using primal vs. dual


representations are Dot product with all
training samples?
Using primal variables: Using dual variables:
'
𝑦'*# = 𝑠𝑖𝑔𝑛(7 𝛼$ 𝑦$ 𝑥$ ′𝑥'*# + 𝑏)
$%&

m operation for m features mr operations where r is


the number of support
vectors (ai>0)

If one uses dual parameters to make predictions, the prediction depends


only on the support vectors, but this is not explicitly represented in the
primal
Dual SVM – Interpretation for Non-linearly
Separable Case
Support vectors: data
points in the wrong side
of margin w = åa i x i y i
+1 i

-1

For ai’s that are not


0
Overview

• Max-margin principle

• Problem formulation with max-margin principle


– Linearly separable case
– Non linearly separable case

• Dual formulation and support vectors

• Kernelized SVM
Classifying based on 1-d Input

Can an SVM correctly classify What about this?


this data?

X X
Classifying based on 1-d Input

Can an SVM correctly classify And now?


this data?

X2

X X
Classifying based on 1-d Input

• By transforming the input


space from x to a higher
dimensional space φ(x) =
(x,x2), where the training set
is linearly separable X2

X
φ(x) = (x,x2)
Classifying based on 2-d Input

• The original input space x can be mapped to some higher-


dimensional feature space φ(x) where the training set is
separable:
x=(x1,x2)
Classifying based on 2-d Input

• The original input space x can be mapped to some higher-


dimensional feature space φ(x) where the training set is
separable:
Ö2x1x2
x=(x1,x2)

Φ: x → φ(x)
φ(x) =(x12,x22,Ö2x1x2)

x22

x12
SVM After Applying Input Transformation

• The original problem • After the input transformation


. 1'
Max ∑,-. aαå
/
max a - åa
, − i ∑$,)%& 𝛼$ia y$i 𝑦y)j𝑥x$i′𝑥
𝛼)j 𝑦 x)j maxa åa i -åa ia j y i y j F(x i )F(x j )
𝟐 2 i,j
i i i,j
Subject to the constraints: Subject to the constraints: Kernel K
åa y i i =0 åa y i i =0
i i

ai ³ 0 "i ai ³ 0 "i

x=(x1,x2) Ö2x1x2

Φ: x → φ(x)
φ(x) =(x12,x22,Ö2x1x2) x22
x12
Transformation of Inputs

• Possible problems
– High computation burden due to high-dimensionality
– Many more parameters
• SVM solves these two issues simultaneously
– “Kernel tricks” for efficient computation
– Dual formulation only assigns parameters to samples, not features

f( )
f( ) f( )
f( ) f( ) f( )
f(.) f( )
f( ) f( )
f( ) f( )
f( ) f( )
f( ) f( ) f( )
f( )
f( )

Input space Feature space


Kernel Trick

• Consider feature 𝐱 = [𝑥., 𝑥0 ]

• Map it to a higher dimensional space

• Compute : what is time complexity?


Kernel Trick

• Consider feature 𝐱 = [𝑥., 𝑥0 ]

• Map it to a higher dimensional space

• Compute : what is time complexity?

• Kernel:
=

What is time complexity?


Why do SVMs work?

• If we are using huge features spaces (with kernels) how come


we are not overfitting the data?
– Number of parameters remains the same (and most are set to 0)
– While we have a lot of input values, at the end we only care about the support vectors
and these are usually a small group of samples
– The minimization (or the maximizing of the margin) function acts as a sort of
regularization term leading to reduced overfitting
Multi-class classification with SVMs

• What if we have data from more than two classes?


• Most common solution: One vs. all
• create a classifier for each class against all other data
Separable vs Non-separable Cases
Separable Case Non-separable Case
#$# #$#
Minimize
𝟐 Minimizew + ∑'$%& 𝐶𝜀𝑖
𝟐
Subject to the constraints:
subject to the constraints:
For all x in class + 1
For all x in class + 1
𝑤 !𝑥 + 𝑏 ≥ 1
𝑤 ! 𝑥 + 𝑏 ≥ 1 − 𝜀𝑖
For all x in class – 1 For all x in class – 1
𝑤 ! 𝑥 + 𝑏 ≤ −1 𝑤 ! 𝑥 + 𝑏 ≤ −1 + 𝜀𝑖
For all i eI ³ 0
Error Function for SVM t > 0 for both positive
and negative training
samples if classified
correctly

Error(t) Let t = (w’xi+b)yi

Ideal classifier:
0 if t > 0
1 Error(t) =
1 if t < 0

t SVM
0 1
Error(t) = [1- t]+

[ ]+ denotes
positive part
Hinge Loss
Non linearly separable case

• Instead of minimizing the number of misclassified points, we can minimize


the distance between these points and their correct plane
Should have
been on the +1 plane
other side of the
blue plane -1 plane

ek
Separable vs Non-separable Cases
Non-separable Case
#$#
Minimizew + ∑'$%& 𝐶𝜀𝑖
𝟐
subject to the constraints:
For all x in class + 1
𝑤 ! 𝑥 + 𝑏 ≥ 1 − 𝜀𝑖
For all x in class – 1
𝑤 ! 𝑥 + 𝑏 ≤ −1 + 𝜀𝑖
For all i eI ³ 0
Non-separable Case
#$#
SVM on Simulated Data Minimizew
𝟐
+ ∑'$%& 𝐶𝜀𝑖

For all x in class + 1


𝑤 ! 𝑥 + 𝑏 ≥ 1 − 𝜀𝑖
For all x in class – 1
𝑤 ! 𝑥 + 𝑏 ≤ −1 + 𝜀𝑖
For all i eI ³ 0

Support
vectors

Margin
Summary, SVM

• Maximum margin principle


• Target function for SVMs
• Linearly separable and non separable cases
• Dual formulation of SVMs
• Support vectors of SVMs
• Kernel trick and computational complexity

You might also like