0% found this document useful (0 votes)
10 views

Support Vector Machines

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Support Vector Machines

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Support Vector Machines

Overview

1 SVM introduction

2 Basic principle of classification

3 Basics of SVM

4 Necessary mathematical concepts for SVM


About SVM


Vladimir Vapnik and colleagues: In 1992, Vladimir Vapnik, along with his colleagues Bernhard Boser and
Isabelle Guyon, introduced the Support Vector Machine as a new approach to classification problems

The original SVM was a linear classifier


Widespread Adoption and Applications (1990s-2000s)


Challenges and Decline in Popularity (2000s-2010s)

 Scalability issues
 Competition from deep learning

SVMs are still relevant for small- to medium-sized datasets

SVMs continue to be used in fields like bioinformatics, document classification, and anomaly detection
Basic principles of classification


Want to classify objects as boats and houses
Basic principles of classification


All the objects before the coast line are boats and all the objects after
the coast line are houses

Coast line serves as a decision surface that separates two classes
Basic principles of classification

These boats will be misclassified as houses


Basic principles of classification


Classification algorithms operate similarly 
Then the algorithm seeks to find a decision
surface that separates classes of objects

First all objects are represented geometrically
Basic principles of classification


Unseen (new) objects are classified as “boats” if they fall below the
decision surface and as “houses” if they fall above it
SVM basics
Which line is the best line? Line that maximizes the margin from both patterns
Support Vectors


The points that lie closest to the hyperplane are
called support vectors


These points play a crucial role in determining the
position and orientation of the hyperplane
Necessary mathematical concepts
An example
Data Point X1 X2 Class Solve the equations
A 2 3 -1 W 1=−
3
W
4 2
B 1 1 -1
C 5 4 +1
Let’s assume the SVM algorithm gives us the
D 6 6 +1
following hyperplane:
To classify these points using SVM. The main goal w1=1,w2=1,b=−5
is to find the hyperplane (a line in 2D) that
maximizes the margin between the two classes This results in the decision boundary:
Choose the Support Vectors x1+x2−5=0

Point A (2, 3) for Class -1 Classify a new point, say P(3,3)P(3, 3)P(3,3)
Using the decision function:

Point D (6, 6) for Class +1
Formulate Equations for the Support f(3,3)=(1)(3)+(1)(3)−5=6−5=1
Vectors
Since f(3,3)=1>0, the point P(3,3) belongs to

2w1 + 3w2 + b = 1(Equation 1) Class +1.


6w1 + 6w2 + b = 1(Equation 2)
Petal Petal Width Class
Length
5 3.3 Iris Setosa
4.9 3 Iris Setosa
7 3.2 Iris Virginica
6.4 3.2 Iris Virginica
Dealing with non-linearly separable data
Non-linearly separablde data with SVM


SVM uses a few techniques to still find an optimal decision boundary
– Kernel trick
– soft margin
Soft margin SVM


For data that is not perfectly separable, SVM allows some points to be misclassified by introducing a
soft margin

The soft margin approach allows for a trade-off between maximizing the margin and minimizing
classification errors

This is done by introducing a slack variable for each data point, which measures the degree of
misclassification.
C is a regularization parameter that controls the
trade-off between maximizing the margin and
minimizing the classification error.

If CCC is large, SVM will try harder to correctly


classify all points (at the risk of overfitting).

If CCC is small, SVM will allow more points to be


misclassified, which can be useful to avoid
overfitting.
Kernel trick


The kernel trick is a way for SVM to handle data that is not linearly separable by implicitly mapping it
into a higher-dimensional space where it might be linearly separable

Instead of finding a linear decision boundary in the original feature space, we can transform the data
to a higher-dimensional space where a hyperplane can separate the data
Kernel trick
Common kernels


Linear Kernel: K ( x , y )=x T . y


Polynomial Kernel: K ( x , y )=( x T . y +c)d
– x and y are input vectors.

– c is a constant.

– d is the degree of the polynomial.


Radial Basis Function (RBF) : K ( x , y )=exp(−σ‖x− y‖2 )

– σ is a hyperparameter that controls the width of the kernel.


Sigmoid Kernel: K ( x , y )=tanh (α . x T . y + β )

– α and β are hyperparameters


Choosing the right kernel


Linear Kernel
– Use for linearly separable data or when computational efficiency is a concern.

Polynomial Kernel
– Use for data that is not linearly separable but can be separated by polynomial boundaries.
Experiment with different degrees to find the optimal setting

RBF Kernel
– A versatile kernel that can handle complex, non-linear relationships. It's a good default choice for
many problems

Sigmoid Kernel
– Neural Network-like Behavior

– Text Classification
An example


Consider the following data in 2D:

– Class -1: (1,1), (0,1), (−1,−1), (1,−1)


– Class +1: (3,3), (0,3), (−3,−3), (3,−3)

Mapping to Higher Dimensions with an Explicit Transformation

ϕ ( x 1 , x 2 )=( x 1 , x 2 , x 12 + x 22 ) ​


Applying the Transformation

– Class -1: (1,1,2), (0,1,1), (−1,−1,2), (1,−1,2)


– Class +1: (3,3,18), (0,3,9), (−3,−3,18), (3,−3,18)
Thank you for
your attention

You might also like