Support Vector Machines
Support Vector Machines
Overview
1 SVM introduction
3 Basics of SVM
Vladimir Vapnik and colleagues: In 1992, Vladimir Vapnik, along with his colleagues Bernhard Boser and
Isabelle Guyon, introduced the Support Vector Machine as a new approach to classification problems
The original SVM was a linear classifier
Widespread Adoption and Applications (1990s-2000s)
Challenges and Decline in Popularity (2000s-2010s)
Scalability issues
Competition from deep learning
SVMs are still relevant for small- to medium-sized datasets
SVMs continue to be used in fields like bioinformatics, document classification, and anomaly detection
Basic principles of classification
Want to classify objects as boats and houses
Basic principles of classification
All the objects before the coast line are boats and all the objects after
the coast line are houses
Coast line serves as a decision surface that separates two classes
Basic principles of classification
Classification algorithms operate similarly
Then the algorithm seeks to find a decision
surface that separates classes of objects
First all objects are represented geometrically
Basic principles of classification
Unseen (new) objects are classified as “boats” if they fall below the
decision surface and as “houses” if they fall above it
SVM basics
Which line is the best line? Line that maximizes the margin from both patterns
Support Vectors
The points that lie closest to the hyperplane are
called support vectors
These points play a crucial role in determining the
position and orientation of the hyperplane
Necessary mathematical concepts
An example
Data Point X1 X2 Class Solve the equations
A 2 3 -1 W 1=−
3
W
4 2
B 1 1 -1
C 5 4 +1
Let’s assume the SVM algorithm gives us the
D 6 6 +1
following hyperplane:
To classify these points using SVM. The main goal w1=1,w2=1,b=−5
is to find the hyperplane (a line in 2D) that
maximizes the margin between the two classes This results in the decision boundary:
Choose the Support Vectors x1+x2−5=0
●
Point A (2, 3) for Class -1 Classify a new point, say P(3,3)P(3, 3)P(3,3)
Using the decision function:
●
Point D (6, 6) for Class +1
Formulate Equations for the Support f(3,3)=(1)(3)+(1)(3)−5=6−5=1
Vectors
Since f(3,3)=1>0, the point P(3,3) belongs to
●
2w1 + 3w2 + b = 1(Equation 1) Class +1.
●
6w1 + 6w2 + b = 1(Equation 2)
Petal Petal Width Class
Length
5 3.3 Iris Setosa
4.9 3 Iris Setosa
7 3.2 Iris Virginica
6.4 3.2 Iris Virginica
Dealing with non-linearly separable data
Non-linearly separablde data with SVM
SVM uses a few techniques to still find an optimal decision boundary
– Kernel trick
– soft margin
Soft margin SVM
For data that is not perfectly separable, SVM allows some points to be misclassified by introducing a
soft margin
The soft margin approach allows for a trade-off between maximizing the margin and minimizing
classification errors
This is done by introducing a slack variable for each data point, which measures the degree of
misclassification.
C is a regularization parameter that controls the
trade-off between maximizing the margin and
minimizing the classification error.
The kernel trick is a way for SVM to handle data that is not linearly separable by implicitly mapping it
into a higher-dimensional space where it might be linearly separable
Instead of finding a linear decision boundary in the original feature space, we can transform the data
to a higher-dimensional space where a hyperplane can separate the data
Kernel trick
Common kernels
Linear Kernel: K ( x , y )=x T . y
Polynomial Kernel: K ( x , y )=( x T . y +c)d
– x and y are input vectors.
– c is a constant.
Radial Basis Function (RBF) : K ( x , y )=exp(−σ‖x− y‖2 )
Sigmoid Kernel: K ( x , y )=tanh (α . x T . y + β )
Linear Kernel
– Use for linearly separable data or when computational efficiency is a concern.
Polynomial Kernel
– Use for data that is not linearly separable but can be separated by polynomial boundaries.
Experiment with different degrees to find the optimal setting
RBF Kernel
– A versatile kernel that can handle complex, non-linear relationships. It's a good default choice for
many problems
Sigmoid Kernel
– Neural Network-like Behavior
– Text Classification
An example
Consider the following data in 2D:
ϕ ( x 1 , x 2 )=( x 1 , x 2 , x 12 + x 22 )
Applying the Transformation