0% found this document useful (0 votes)
48 views72 pages

21 Support Vector Machines 03-10-2024

The document discusses Support Vector Machines (SVMs), focusing on the concept of maximum margin classifiers and the role of support vectors in defining decision boundaries. It explains the mathematical formulation of SVMs, including the conditions for optimal separating hyperplanes and the computation of margin width. Additionally, the document addresses challenges in real-world applications, such as the presence of outliers, and introduces the concept of soft-margin classification to improve robustness.

Uploaded by

adityaduggi0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views72 pages

21 Support Vector Machines 03-10-2024

The document discusses Support Vector Machines (SVMs), focusing on the concept of maximum margin classifiers and the role of support vectors in defining decision boundaries. It explains the mathematical formulation of SVMs, including the conditions for optimal separating hyperplanes and the computation of margin width. Additionally, the document addresses challenges in real-world applications, such as the presence of outliers, and introduces the concept of soft-margin classification to improve robustness.

Uploaded by

adityaduggi0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

O O

X X
X O
X O O
X X O
X O
X O
O O
X X
X O
X O O
X X O
X O
X O
O O
X X
X O
X O O
X X O
X O
X O
O O
X X
X O
X O O
X X O
X O
X O
O O
X X
X O
X O O
X X O
X O
X O
O O
X X
X O
X O O
X X O
X O
X O
O O
X X
X O
X O O
X X O
X O
X O
O O
X X
X O
X O O
X X O
X O
X O
O O
X X
X O
X O O
X X O
X O
X O
O O
X X
X O
X O O
X X O
X O
X O
X O X X

X X X X O X X O

O O X O

O X X O O O O O
1 2m 4
R[f]  Remp[f] + h ln + 1 + ln
m h 
O O
X X
X O
X O O
X X O
X O
X O
O O O
OO O O O O
O X O
XX X
O O
O X X O
O O O
O O
Image from http://www.atrandomresearch.com/iclass/
x2 x12
X X
X X
X X O O O O X X x1 OO O O x1
Image from http://web.engr.oregonstate.edu/
~afern/classes/cs534/
Copyright © 2001, 2003, Andrew W.
Moore
“Given an algorithm which is
formulated in terms of a positive
definite kernel K1, one can construct
an alternative algorithm by replacing
K1 with another positive definite
kernel K2”

Ø SVMs can use the kernel trick


a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the, um,
are those
datapoints maximum margin.
that the This is the
margin pushes
up against simplest kind of
SVM (Called an
LSVM)
Linear SVM
Support Vector Machines: Slide 1
Why Maximum Margin?
1. Intuitively this feels safest.
f(x,w,b)
2. If we’ve made a small error inxthe
= sign(w. - b)
denotes +1
location of the boundary (it’s been
denotes -1 The maximum
jolted in its perpendicular direction)
this gives us leastmargin
chance linear
of causing a
misclassification. classifier is the
linear
3. CV is easy since the classifier
model is immune
Support Vectors to removal of anywith
non-support-vector
the, um,
are those datapoints.
datapoints that maximum margin.
the margin 4. There’s some theory that this is a
good thing. This is the
pushes up
against simplest kind of
5. Empirically it works very very well.
SVM (Called an
LSVM)
Support Vector Machines: Slide 2
Specifying a line and margin
+1” Plus-Plane
=
la ss Classifier Boundary
ic t C one
re d z Minus-Plane
“ P
-1”
=
la ss
ic t C one
“P red z

• How do we represent this mathematically?


• …in m input dimensions?

Support Vector Machines: Slide 3


Specifying a line and margin
+1” Plus-Plane
=
la ss Classifier Boundary
ic t C one
re d z Minus-Plane
“ P
-1”
=
=1 la ss
+b
ic t C one
wx
red z
0
b=
+
wx b=-1 “P
+
wx

Conditions for optimal separating hyperplane for data points


(x1, y1),…,(xl, yl) where yi =1
1. w . xi + b  1 if yi = 1 (points in plus class)
2. w . xi + b  -1 if yi = -1 (points in minus class)

Support Vector Machines: Slide 4


Computing the margin width
+1”
= M = Margin Width
la ss
ic t C one
re d z
“ P
= -1” How do we compute
=1 la ss M in terms of w
+b
ic t C one
wx
red z
0
+b=
wx b=-1 “P and b?
+
wx

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?

Support Vector Machines: Slide 5


Computing the margin width
+1”
= M = Margin Width
la ss
ic t C one
re d z
“ P
= -1” How do we compute
=1 la ss M in terms of w
+b
ic t C one
wx
red z
0
+b=
wx b=-1 “P and b?
+
wx

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
Let u and v be two vectors on the
Plus Plane. What is w . ( u – v ) ?

And so of course the vector w is also


perpendicular to the Minus Plane
Support Vector Machines: Slide 6
Computing the margin width
+ 1” +
= x M = Margin Width
s s
Cla e
t
re dic zon
“P
- -1” How do we compute
s x=
b =1
t C las
e M in terms of w
wx
+
=0 e dic zon
wx
+ b
= -1 “P r and b?
x+b
w

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
Any location in
• Let x- be any point on the minus plane Rmm::not
not
necessarily a
• Let x+ be the closest plus-plane-point to x-. datapoint

Support Vector Machines: Slide 7


Computing the margin width
+ 1” +
= x M = Margin Width
s s
Cla e
t
re dic zon
“P
- -1” How do we compute
s x=
b =1
t C las
e M in terms of w
wx
+
=0 e dic zon
wx
+ b
= -1 “P r and b?
x+b
w

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
• Let x- be any point on the minus plane
• Let x+ be the closest plus-plane-point to x-.
• Claim: x+ = x- + l w for some value of l. Why?

Support Vector Machines: Slide 8


Computing the margin width
+ 1” +
= x M = Margin Width
s s
Cla e
e d i t
c zon The line from x - to x+ is
“ Pr
- -1” How do we
perpendicular compute
to the
s x=
= 1
C las planes.
M in terms of w
b t e
wx
+
=0 e dic zon
wx
+ b
=-
1 “P r and
So to getbfrom
? x- to x+
+b
wx travel some distance in
• Plus-plane = { x : w . x + b = +1 direction
} w.
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
• Let x- be any point on the minus plane
• Let x+ be the closest plus-plane-point to x-.
• Claim: x+ = x- + l w for some value of l. Why?

Support Vector Machines: Slide 9


Computing the margin width
+ 1” +
= x M = Margin Width
s s
Cla e
t
re dic zon
“P
- -1”
s x=
=1 la s
b
c t C e
wx
+
0 di zo n
x
=
+b -1 “ P re
w =
x+b
w
What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- + l w
• | x+ - x- | = M
It’s now easy to get M
in terms of w and b
Support Vector Machines: Slide 10
Computing the margin width
+ 1” +
= x M = Margin Width
s s
Cla e
t
re dic zon
“P
- -1”
s x=
=1 la s
b
c t C e
wx
+
di zo n
wx
= 0
+b -1 “ P re w . (x - + l w) + b =1
=
x+b
w
=>
What we know:
• w . x+ + b = +1 w . x - + b + l w .w = 1
• w . x- + b = -1 =>
• x+ = x- + l w
-1 + l w .w = 1
• | x+ - x- | = M
=>
It’s now easy to get M 2
in terms of w and b =
w.w
Support Vector Machines: Slide 11
Computing the margin width
+ 1” + 2
= x M = Margin Width =
s s w.w
Cla e
dict zon
e
“ Pr
- -1”
s x=
=1 la s
b
c t C e
wx
+
di zo n
M = |x+ - x- | =| l w |=
0
x
=
+b -1 “ P re
w =
x+b
w
What we know: = |w|= w.w
• w . x+ + b = +1
• w . x- + b = -1 2 w.w 2
• x+ = x- + l w = =
w.w w.w
• | x+ - x- | = M
• 2
=
w.w
Support Vector Machines: Slide 12
Learning the Maximum Margin Classifier
+ 1” + 2
= x M = Margin Width =
s s w.w
Cla e
t
re dic zon
“P
- -1”
s x=
=1 la s
b
c t C e
wx
+
0 di zo n
x
=
+b -1 “ P re
w =
x+b
Given a guess of w and b we can
w

• Compute whether all data points in the correct half-planes


• Compute the width of the margin
So now we just need to write a program to search the space
of w’s and b’s to find the widest margin that matches all
the datapoints. How?
Gradient descent? Simulated Annealing? Matrix Inversion? EM?
Newton’s Method?
Support Vector Machines: Slide 13
Large-Margin Classification
• Hyperplanes C0 - C2 achieve perfect classification CO
C2 H2
(zero empirical risk): C1 class 1

§ C0 is optimal in terms of generalization.


H1
§ The data points that define the boundary
are called support vectors. w

§ A hyperplane can be defined by: x w + b


§ We will impose the constraints:
yi ( xi wi + b) - 1 0 origin
class 2
optimal
classifier

The data points that satisfy the equality are


called support vectors.
• Support vectors are found using a constrained optimization:

1 2 N N
L p = w - a i y i ( x i w + b) + a i
2 i =1 i =1

• The final classifier is computed using the support vectors and the weights:
N
f (x ) = ai yi xi x + b
i =1

ECE 8443: Lecture 16, Slide 0


Soft-Margin Classification
• In practice, the number of support vectors will grow unacceptably large for
real problems with large amounts of data.

• Also, the system will be very sensitive to mislabeled training data or outliers.

• Solution: introduce “slack variables”


or a soft margin:

yi ( xi wi + b) - (1 - xi ) 0 Class 2
This gives the system the ability to
ignore data points near the boundary,
and effectively pushes the margin
towards the centroid of the training data.

• This is now a constrained optimization


with an additional constraint:
Class 1

• The solution to this problem can still be found using Lagrange multipliers.
ECE 8443: Lecture 16, Slide 1
Nonlinear Decision Surfaces
• Thus far we have only considered linear decision surfaces. How do we
generalize this to a nonlinear surface?

f( )
f( ) f( )
f( ) f( ) f( )
f(.) f( )
f( ) f( )
f( ) f( )
f( ) f( )
f( ) f( ) f( )
f( )
f( )

Input space Feature space


• Our approach will be to transform the data to a higher dimensional space
where the data can be separated by a linear surface.

• Define a kernel function:


K ( x i , x j ) = f ( x i )f ( x j )

Examples of kernel functions include polynomial:


t
K ( x i , x j ) = ( x i x j + 1) d

ECE 8443: Lecture 16, Slide 2


Kernel Functions
Other popular kernels are a radial basis function (popular in neural networks):
2
K ( x i , x j ) = exp( x i - x j ( 2s 2 ))

and a sigmoid function:


t
K ( x i , x j ) = tanh( kx i x j + q )
• Our optimization does not change significantly:
n 1n n
max W (a ) = ai - a ia j yi y j K (x i ,x j )
i =1 2 i =1 j =1
n
subject to C ai 0, a i yi = 0
i =1

• The final classifier has a similar form:


N
f (x ) = a i y i K (x i , x ) + b
i =1
• Let’s work some examples.

ECE 8443: Lecture 16, Slide 3


SVM Limitations
• Uses a binary (yes/no) decision rule

• Generates a distance from the hyperplane, but this distance is often not a
good measure of our “confidence” in the classification

• Can produce a “probability” as a function of the distance (e.g. using


sigmoid fits), but they are inadequate

• Number of support vectors grows linearly with the size of the data set

• Requires the estimation of trade-off parameter, C, via held-out sets

Error Open-Loop
Error

Optimum

Training Set
Error

Model Complexity

ECE 8443: Lecture 16, Slide 4


Summary
• Support Vector Machines are one example of a kernel-based learning machine
that is training in a discriminative fashion.
• Integrates notions of risk minimization, large-margin and soft margin
classification.
• Two fundamental innovations:
§ maximize the margin between the classes using actual data points,
§ rotate the data into a higher-dimensional space in which the data is linearly
separable.
• Training can be computationally expensive but classification is very fast.
• Note that SVMs are inherently non-probabilistic (e.g., non-Bayesian).
• SVMs can be used to estimate posteriors by mapping the SVM output to a
likelihood-like quantity using a nonlinear function (e.g., sigmoid).
• SVMs are not inherently suited to an N-way classification problem. Typical
approaches include a pairwise comparison or “one vs. world” approach.

ECE 8443: Lecture 16, Slide 5


Summary
• Many alternate forms include Transductive SVMs, Sequential SVMs, Support
Vector Regression, Relevance Vector Machines, and data-driven kernels.

• Key lesson learned: a linear algorithm in the feature space is equivalent to a


nonlinear algorithm in the input space. Standard linear algorithms can be
generalized (e.g., kernel principal component analysis, kernel independent
component analysis, kernel canonical correlation analysis, kernel k-means).

• What we didn’t discuss:

§ How do you train SVMs?

§ Computational complexity?

§ How to deal with large amounts of data?

ECE 8443: Lecture 16, Slide 6


Support Vector Machines

Introduction to Data Mining, 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar

10/11/2021 Introduction to Data Mining, 2nd Edition 1


Support Vector Machines

• Find a linear hyperplane (decision boundary) that will separate the data
10/11/2021 Introduction to Data Mining, 2nd Edition 2
Support Vector Machines

• One Possible Solution


10/11/2021 Introduction to Data Mining, 2nd Edition 3
Support Vector Machines

• Another possible solution


10/11/2021 Introduction to Data Mining, 2nd Edition 4
Support Vector Machines

• Other possible solutions


10/11/2021 Introduction to Data Mining, 2nd Edition 5
Support Vector Machines

• Which one is better? B1 or B2?


• How do you define better?
10/11/2021 Introduction to Data Mining, 2nd Edition 6
Support Vector Machines

• Find hyperplane maximizes the margin => B1 is better than B2


10/11/2021 Introduction to Data Mining, 2nd Edition 7
Support Vector Machines

r r
w x+b=0
r r
r r w x + b = +1
w x + b = -1

r r
r 1 if w x + b 1 2
f (x) = r r Margin = r
-1 if w x + b - 1 || w ||
10/11/2021 Introduction to Data Mining, 2nd Edition 8
Linear SVM

• Linear model:
r r
r 1 if w x+b 1
f ( x) = r r
- 1 if w x + b -1

• Learning the model is equivalent to determining


r
the values of w and b
r
– How to find w and b from training data?

10/11/2021 Introduction to Data Mining, 2nd Edition 9


Learning Linear SVM
2
• Objective is to maximize: Margin = r
|| w ||
r 2
r || w ||
– Which is equivalent to minimizing: L( w) =
2
– Subject to the following constraints:
r r
1 if w x i + b 1
yi = r r
-1 if w x i + b -1
or
�� (w•x� + �) ≥ 1, � = 1,2, . . . , �

u This is a constrained optimization problem


– Solve it using Lagrange multiplier method

10/11/2021 Introduction to Data Mining, 2nd Edition 10


Example of Linear SVM

Support vectors

10/11/2021 Introduction to Data Mining, 2nd Edition 11


Learning Linear SVM

• Decision boundary depends only on support


vectors
– If you have data set with same support
vectors, decision boundary will not change

– How to classify using SVM once w and b are


found? Given a test record, xi
r r
r 1 if w xi + b 1
f ( xi ) = r r
- 1 if w x i + b -1

10/11/2021 Introduction to Data Mining, 2nd Edition 12


Support Vector Machines

• What if the problem is not linearly separable?

10/11/2021 Introduction to Data Mining, 2nd Edition 13


Support Vector Machines

• What if the problem is not linearly separable?


– Introduce slack variables
u Need to minimize: r 2
|| w || N
L( w) = +C xik
2 i =1
u Subject to:
r r
1 if w x i + b 1 - x i
yi = r r
-1 if w x i + b - 1 + x i

u If k is 1 or 2, this leads to similar objective function


as linear SVM but with different constraints (see
textbook)

10/11/2021 Introduction to Data Mining, 2nd Edition 14


Support Vector Machines

• Find the hyperplane that optimizes both factors


10/11/2021 Introduction to Data Mining, 2nd Edition 15
Nonlinear Support Vector Machines

• What if decision boundary is not linear?

10/11/2021 Introduction to Data Mining, 2nd Edition 16


Nonlinear Support Vector Machines

• Transform data into higher dimensional space

Decision boundary:
r r
w F( x ) + b = 0
10/11/2021 Introduction to Data Mining, 2nd Edition 17
Learning Nonlinear SVM

• Optimization problem:

• Which leads to the same set of equations (but


involve F(x) instead of x)

10/11/2021 Introduction to Data Mining, 2nd Edition 18


Learning NonLinear SVM

• Issues:
– What type of mapping function F should be
used?
– How to do the computation in high
dimensional space?
u Most computations involve dot product F(xi) F(xj)
u Curse of dimensionality?

10/11/2021 Introduction to Data Mining, 2nd Edition 19


Learning Nonlinear SVM

• Kernel Trick:
– F(xi) F(xj) = K(xi, xj)
– K(xi, xj) is a kernel function (expressed in
terms of the coordinates in the original space)
u Examples:

10/11/2021 Introduction to Data Mining, 2nd Edition 20


Example of Nonlinear SVM

SVM with polynomial


degree 2 kernel

10/11/2021 Introduction to Data Mining, 2nd Edition 21


Learning Nonlinear SVM

• Advantages of using kernel:


– Don’t have to know the mapping function F
– Computing dot product F(xi) F(xj) in the
original space avoids curse of dimensionality

• Not all functions can be kernels


– Must make sure there is a corresponding F in
some high-dimensional space
– Mercer’s theorem (see textbook)

10/11/2021 Introduction to Data Mining, 2nd Edition 22


Characteristics of SVM

• The learning problem is formulated as a convex optimization problem


– Efficient algorithms are available to find the global minima
– Many of the other methods use greedy approaches and find locally
optimal solutions
– High computational complexity for building the model

• Robust to noise
• Overfitting is handled by maximizing the margin of the decision boundary,
• SVM can handle irrelevant and redundant attributes better than many
other techniques
• The user needs to provide the type of kernel function and cost function
• Difficult to handle missing values

• What about categorical variables?

10/11/2021 Introduction to Data Mining, 2nd Edition 23

You might also like