7 SVM For Scientists Annotated
7 SVM For Scientists Annotated
Reading:
Friedman, Tibshirani, Hastie, Ch 12.1-12.3
Murphy Ch 4.5
Predicting
Cancer Types
Based on
Gene
Expression
SVM in Cancer Genomics
http://cgp.iiarjournals.org/content/15/1/41.full.pdf
Types of Classifiers
• SVM is
• Another discriminative classifier
• Non-probabilistic
Overview
• Max-margin principle
• Kernelized SVM
Linear Classifier So Far
𝑤 !𝑥 + 𝑏 ≥ 0 𝑤 !𝑥 + 𝑏 = 0
𝑦 =1
𝑥2
𝑤 !𝑥 + 𝑏 < 0
𝑦=0
𝑥1
Linear Classifier So Far
𝑤 !𝑥 + 𝑏 ≥ 0 𝑤 !𝑥 + 𝑏 = 0
𝑦 =1
𝑥2
𝑤 !𝑥 + 𝑏 < 0
𝑦=0
𝑥1
Linear Classifier So Far
𝑤 !𝑥 + 𝑏 ≥ 0 𝑤 !𝑥 + 𝑏 = 0
𝑦 =1
𝑥2
𝑤 !𝑥 + 𝑏 < 0
𝑦=0
𝑤 !𝑥 + 𝑏 < 0
𝑦=0
𝑥1
Linear Classifier So Far
𝑤 !𝑥 + 𝑏 ≥ 0 𝑤 !𝑥 + 𝑏 = 0
𝑦 =1
𝑥2
𝑤 !𝑥 + 𝑏 < 0
𝑦=0
𝑤 !𝑥 + 𝑏 ≥ 0 𝑤 !𝑥 + 𝑏 = 0
𝑦 =1 From all the possible
𝑥2 boundary lines, this
leads to the largest
𝑤 ! 𝑥 + 𝑏 < 0 margin on both
sides
𝑦=0
𝑥1
Max Margin Classifier
𝑤 !𝑥 + 𝑏 ≥ 0 𝑤 !𝑥 + 𝑏 = 0
𝑦 =1 Margin: find the
𝑥2 decision boundary
that maximizes the
margin
M 𝑤 !𝑥 + 𝑏 < 0
𝑦=0
𝑥1
Support Vector Machine
𝑤 !𝑥 + 𝑏 ≥ 0 𝑤 !𝑥 + 𝑏 = 0
𝑦 =1 Margin: find the
𝑥2 decision boundary
that maximizes the
margin
M 𝑤 !𝑥 + 𝑏 < 0
𝑦=0
𝑥1
Overview
• Max-margin principle
• Kernelized SVM
Learning as Function Optimization
1
𝑏=
0
𝑏=
−1
𝑥+
𝑤!
𝑏=
𝑥+
𝑤!
𝑥+
𝑤!
Classification
𝑥2
𝑦 =1 if 𝑤 ! 𝑥 + 𝑏 ≥ 1
𝑦 =0 if 𝑤 ! 𝑥 + 𝑏 ≤ −1
Undefined if −1 ≤ 𝑤 ! 𝑥 + 𝑏 ≤1
𝑥1
Maximizing the Margin
𝑥1
Support Vector Machine
1
𝑏=
0
𝑏=
−1
𝑥+
Maximizing the margin M means
𝑤!
𝑏=
𝑥+
minimizing the inverse of M
𝑤!
𝑥+
#$#
Minimize
𝑤!
𝟐
𝑥2
𝟐
M= #$#
𝑥1
Support Vector Machine
1
𝑏=
0
𝑏=
−1
𝑥+
Maximizing the margin M means
𝑤!
𝑏=
𝑥+
minimizing the inverse of M
𝑤!
𝑥+
#$#
Minimize
𝑤!
𝟐
𝑥2
Subject to the constraints:
𝟐
M= #$# For all x in class + 1
𝑤 !𝑥 + 𝑏 ≥ 1
1
𝑏=
0
𝑏=
−1
𝑥+
𝑤!
𝑏=
𝑥+
𝑤!
𝑥+
𝑤!
𝑥2
𝟐
M= #$#
Expression for Margin:
Where did this come from?
𝑥1
Expression for Margin
1
𝑏=
Assume two data points
0
𝑏=
−1
𝑥+
x+ on 𝑤 ! 𝑥 + 𝑏 = 1
𝑤!
𝑏=
𝑥+
x- on 𝑤 ! 𝑥 + 𝑏 = −1
𝑤!
𝑥+
x+
𝑤!
M that are in the closest
𝑥2 distance
x-
Margin M is the distance you
have to travel to get from x- to x+
In what direction?
In what distance?
𝑥1
Margin: Direction
1
𝑏=
Fact 1: Vector orthogonal to the
0
𝑏=
−1
plane is given by 𝑤
𝑥+
𝑤!
𝑏=
𝑥+
This means w and 𝑤 ! 𝑥 + 𝑏 = 1
𝑤!
are in right angle
𝑥+
𝑤!
𝑥2
Why? Consider two points u, v on
w’x+b=1 plane, i.e.,
w’u+b = 1
w’v+b = 1
Then, we have
w’(u-v)=0
1
𝑏=
Fact 2: if x+ is a point on the +1
0
𝑏=
plane and x- is the closest point to
−1
𝑥+
x+ on the -1 plane then
𝑤!
𝑏=
𝑥+
𝑤!
x+ = lw + x-
𝑥+
x+
𝑤!
𝑥2
x-
l Is the distance you have to
travel to get from x- to x+
𝑥1
Expression for Margin
−1
1
𝑥+ =0
=
𝑏=
𝑤! +𝑏
𝑤! 𝑏
𝑥+
𝑥
𝑤!
𝑥2
𝟐
M= #$#
𝑥1
• wT x+ + b = +1
• wT x- + b = -1
• x+ = lw + x-
• | x+ - x- | = M
Expression for Margin
𝑥+ 𝑏= 1
−1
𝑏= 0
=
𝑤! +𝑏
𝑤 !𝑥 +
𝑥
𝑤!
𝑥2
𝟐
M= #$#
Derivation of the margin
𝑥1 wT x+ + b = +1
• wT x+ + b = +1 Þ wTx- + b + lwTw = +1
• wT x- + b = -1
Þ -1 + lwTw = +1
• x+ = lw + x-
Þ l = 2/wTw
• | x+ - x- | = M
Expression for Margin
−1
𝑤! 𝑏 =1
𝑏 =0
𝑥+ =
𝑤! +𝑏
𝑥+
𝑥
𝑤!
𝑥2
𝟐
M= #$#
Derivation of the margin
𝑥1 M = | x+ - x- |
Þ M = | lw |
What we have so far
Þ M=l|w|
• wT x+ + b = +1
Þ M = l 𝑤′𝑤
• wT x- + b = -1 𝟐 #!# 𝟐
Þ M= =
#!# #!#
• x+ = lw + x-
• | x+ - x- | = M
•l = 2/wTw
Support Vector Machine
1
𝑏=
0
𝑏=
−1
𝑥+
Maximizing the margin M means
𝑤!
𝑏=
𝑥+
minimizing the inverse of M
𝑤!
𝑥+
#$#
Minimize
𝑤!
𝟐
𝑥2
Subject to the constraints:
𝟐
M= #$# For all x in class + 1
𝑤 !𝑥 + 𝑏 ≥ 1
• Max-margin principle
• Kernelized SVM
Support Vector Machine
1
𝑏=
0
𝑏=
−1
𝑥+
𝑤!
𝑏=
𝑥+
𝑤!
𝑥+
𝑤!
𝑥2
𝟐
M= #$#
𝑥1
Non linearly separable case
𝑥2
𝑥1
Support Vector Machine, Separable Case
Revisited
Maximizing the margin M means
minimizing the inverse of M
#$#
Minimize Perhaps, in addition to maximizing the margin,
𝟐 minimize the training data error as well?
Subject to the constraints:
ek
Non linearly separable case
ek
Non linearly separable case
+1 plane
-1 plane
ej
Should have
been on the
other side of the
red plane
Support Vector Machine, Separable Case
Revisited
Maximizing the margin M means
minimizing the inverse of M
#$#
Minimize Perhaps, in addition to maximizing the margin,
𝟐 minimize the training data error as well?
Subject to the constraints:
• Max-margin principle
• Kernelized SVM
Support Vector Machine
𝑤 !𝑥 + 𝑏 ≥ 0 𝑤 !𝑥 + 𝑏 = 0
𝑦 =1 Margin: find the
𝑥2 decision boundary
that maximizes the
margin
M 𝑤 !𝑥 + 𝑏 < 0
𝑦=0
1
𝑏=
0
𝑏=
−1
𝑥+
Maximizing the margin M means
𝑤!
𝑏=
𝑥+
minimizing the inverse of M
𝑤!
𝑥+
#$#
Minimize
𝑤!
𝟐
𝑥2
Subject to the constraints:
𝟐
M= #$# For all x in class + 1
𝑤 !𝑥 + 𝑏 ≥ 1
Lagrangian
minx maxa x2 -a(x-b)
s.t. a ³ 0
#$# #$#
Minimize Minimize
𝟐 𝟐
Subject to the constraints: Subject to the constraints:
For all x in class + 1
(𝑤 ! 𝑥𝑖 + 𝑏)𝑦𝑖 ≥ 1
𝑤 !𝑥 +𝑏 ≥1
For all x in class – 1 For all n samples i=1,…,n
𝑤 ! 𝑥 + 𝑏 ≤ −1
Applying to SVM
Dual formulation
Primal formulation .
Max ∑/,-. α, − ∑'$,)%& 𝛼$ 𝛼) 𝑦$ 𝑦) 𝑥$ ′𝑥)
#$# α$ 𝟐
Min Subject to the constraints:
𝑤 𝟐 '
Subject to the constraints:
7 𝛼$ 𝑦 𝑖 = 0
(𝑤 ! 𝑥𝑖 + 𝑏)𝑦𝑖 ≥ 1
$%&
For all n samples i=1,…,n 𝛼$ ≥ 0 for 𝑖=1, …, n
Lagrangian
w Tw
minw,b - åa i [(w T x i + b)y i -1]
2 i
ai ³ 0 "i
Differentiate
w.r.t. w
w = åa i x i y i Tells us about
support vectors
i
Support Vectors in SVM Dual
1
𝑏=
0
w = åa i x i y i
𝑏=
−1
𝑥+
𝑤!
𝑏=
𝑥+
𝑤!
i
𝑥+
𝑤!
𝑥2
1
w = åa i x i y i
𝑏=
0
𝑏=
−1
𝑥+
𝑤!
𝑏=
𝑥+
i
𝑤!
𝑥+
'
𝑤!
𝑥2 𝑤 $ 𝑥𝑗 + 𝑏 = 7 𝛼$ 𝑦$ 𝑥$ ′𝑥) + 𝑏
$%&
𝑥1
Applying to SVM Dot product for all
training samples
Dual formulation
Primal formulation .
Max ∑/,-. α, − ∑'$,)%& 𝛼$ 𝛼) 𝑦$ 𝑦) 𝑥$ ′𝑥)
#$# α$ 𝟐
Min Subject to the constraints:
𝑤 𝟐 '
Subject to the constraints:
7 𝛼$ 𝑦 𝑖 = 0
(𝑤 ! 𝑥𝑖 + 𝑏)𝑦𝑖 ≥ 1
$%&
For all n samples i=1,…,n 𝛼$ ≥ 0 for 𝑖=1, …, n
Lagrangian
w Tw
minw,b - åa i [(w T x i + b)y i -1]
2 i
ai ³ 0 "i
Differentiate
w.r.t. w
w = åa i x i y i Tells us about
support vectors
i
Applying to SVM: Non-Separable Case
Dual formulation
.
Max ∑/,-. α, − ∑'$,)%& 𝛼$ 𝛼) 𝑦$ 𝑦) 𝑥$ ′𝑥)
α$ 𝟐
Subject to the constraints:
'
7 𝛼$ 𝑦 𝑖 = 0
$%&
The only difference is 𝐶 > 𝛼$ ≥ 0 for 𝑖=1, …, n
that the aI’s are now
bounded
Midterm
• In class exam
• No calculators
• I’ll make practice problems for SVM available this week. TA will review
SVM in recitation.
Overview
• Max-margin principle
• Kernelized SVM
Support Vector Machine
𝑤 !𝑥 + 𝑏 ≥ 0 𝑤 !𝑥 + 𝑏 = 0
𝑦 =1 Margin: find the
𝑥2 decision boundary
that maximizes the
margin
M 𝑤 !𝑥 + 𝑏 < 0
𝑦=0
Primal formulation .
Max ∑/,-. α, − ∑'$,)%& 𝛼$ 𝛼) 𝑦$ 𝑦) 𝑥$ ′𝑥)
α$ 𝟐
#$#
Min Subject to the constraints:
𝑤 𝟐 '
Subject to the constraints:
7 𝛼$ 𝑦 𝑖 = 0
(𝑤 ! 𝑥𝑖 + 𝑏)𝑦𝑖 ≥ 1 $%&
For all n samples i=1,…,n 𝛼$ ≥ 0 for 𝑖=1, …, n
Parameters:
Number of
parameters:
What the
parameters mean:
-1
• Max-margin principle
• Kernelized SVM
Classifying based on 1-d Input
X X
Classifying based on 1-d Input
X2
X X
Classifying based on 1-d Input
X
φ(x) = (x,x2)
Classifying based on 2-d Input
Φ: x → φ(x)
φ(x) =(x12,x22,Ö2x1x2)
x22
x12
SVM After Applying Input Transformation
ai ³ 0 "i ai ³ 0 "i
x=(x1,x2) Ö2x1x2
Φ: x → φ(x)
φ(x) =(x12,x22,Ö2x1x2) x22
x12
Transformation of Inputs
• Possible problems
– High computation burden due to high-dimensionality
– Many more parameters
• SVM solves these two issues simultaneously
– “Kernel tricks” for efficient computation
– Dual formulation only assigns parameters to samples, not features
f( )
f( ) f( )
f( ) f( ) f( )
f(.) f( )
f( ) f( )
f( ) f( )
f( ) f( )
f( ) f( ) f( )
f( )
f( )
• Kernel:
=
Ideal classifier:
0 if t > 0
1 Error(t) =
1 if t < 0
t SVM
0 1
Error(t) = [1- t]+
[ ]+ denotes
positive part
Hinge Loss
Non linearly separable case
ek
Separable vs Non-separable Cases
Non-separable Case
#$#
Minimizew + ∑'$%& 𝐶𝜀𝑖
𝟐
subject to the constraints:
For all x in class + 1
𝑤 ! 𝑥 + 𝑏 ≥ 1 − 𝜀𝑖
For all x in class – 1
𝑤 ! 𝑥 + 𝑏 ≤ −1 + 𝜀𝑖
For all i eI ³ 0
Non-separable Case
#$#
SVM on Simulated Data Minimizew
𝟐
+ ∑'$%& 𝐶𝜀𝑖
Support
vectors
Margin
Summary, SVM