0% found this document useful (0 votes)
33 views9 pages

Advanced SVM and Kernel Theory

The document discusses advanced Support Vector Machines (SVM) and kernel theory in machine learning, focusing on the kernel trick and various kernel functions for classifying non-linearly separable data. It covers the SVM objective function, dual formulation, common kernel types, and visualizations of decision boundaries using different kernels on datasets like the Iris dataset. The conclusion emphasizes the importance of visualizing decision boundaries to understand how different kernels affect classification, while recommending parameter tuning for better model performance.

Uploaded by

sufyanalthawri1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views9 pages

Advanced SVM and Kernel Theory

The document discusses advanced Support Vector Machines (SVM) and kernel theory in machine learning, focusing on the kernel trick and various kernel functions for classifying non-linearly separable data. It covers the SVM objective function, dual formulation, common kernel types, and visualizations of decision boundaries using different kernels on datasets like the Iris dataset. The conclusion emphasizes the importance of visualizing decision boundaries to understand how different kernels affect classification, while recommending parameter tuning for better model performance.

Uploaded by

sufyanalthawri1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Ferhat Abbas University, Setif 1

Computer Science Department


Machine Learning, IDTW, M2

Kernel Theory and Advanced SVM

November 5, 2024

i
im
rr
Be
F.

[email protected] 1
Contents
1 Introduction 3

2 Theory of SVM and Kernel Trick 3

3 Objective Function of SVM 3

4 Kernel Trick 3

5 SVM Dual Formulation with Kernel 3

6 Common Kernel Functions 4


6.1 Linear Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
6.2 Polynomial Kernel (degree p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
6.3 Radial Basis Function (RBF) Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
6.4 Sigmoid Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

7 Visualizing the Effect of Kernels 4

8 Plotting Different SVM Classifiers in the Iris Dataset 5


8.1 SVM Classifiers and Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

i
9 Example: Plot classification boundaries with different SVM Kernels 5

im
9.1 Creating a dataset . . . . . . . . . . . . . . . . . . . . .
9.2 Training SVC model and plotting decision boundaries .
9.3 Polynomial kernel . . . . . . . . . . . . . . . . . . . . . .
9.4 RBF kernel . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
7
7
8
9.5 Sigmoid kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
rr
10 Conclusion 9
Be
F.

2
1 Introduction
I’ll explain the advanced Support Vector Machines (SVM) and kernel theory in machine learning, par-
ticularly in the context of non-linearly separable data. This involves explaining the kernel trick, various
kernel functions, and how SVMs utilize kernels for classification in high-dimensional spaces. I’ll provide
the formulas and graphs where possible.

2 Theory of SVM and Kernel Trick


Support Vector Machines (SVM) are supervised learning algorithms primarily used for classification and
regression tasks. The goal of an SVM classifier is to find the hyperplane that best separates data points
of different classes with the maximum margin.
In cases where the data is linearly inseparable in the input space, SVMs can use the kernel trick to
map data into a higher-dimensional space where it becomes linearly separable.

3 Objective Function of SVM


Given a set of labeled data points (xi , yi ) where i = 1, . . . , N , xi ∈ Rd represents the feature vectors,
and yi ∈ {−1, 1} the labels, the SVM seeks to solve the following optimization problem:
1
min ∥w∥2

i
w,b 2

subject to: im
yi (w · xi + b) ≥ 1,
where: w is the normal vector to the hyperplane.
∀i

b is the bias term.


rr
To handle non-linearly separable cases, we introduce the kernel function K(x, x′ ), which implicitly
maps data to a higher-dimensional feature space without explicitly computing the transformation.
Be

4 Kernel Trick
The kernel trick allows us to compute the inner product of transformed features in high-dimensional
space efficiently. Instead of mapping each point explicitly, we use a function K(xi , xj ) = ϕ(xi ) · ϕ(xj ),
where ϕ is the mapping function.
F.

This kernel function replaces the dot product in the SVM’s decision function, yielding a more flexible
classifier that can work in high-dimensional spaces.

5 SVM Dual Formulation with Kernel


The dual form of the SVM objective function using kernels K(x, x′ ) is:
N N N
X 1 XX
max αi − αi αj yi yj K(xi , xj )
α
i=1
2 i=1 j=1

subject to:
N
X
0 ≤ αi ≤ C, αi yi = 0
i=1

where αi are the Lagrange multipliers, and C is the regularization parameter. The decision function
for a new data point x is:
N
!
X
f (x) = sign αi yi K(xi , x) + b
i=1

3
Figure 1: Linear SVC and SVC with kernels

i
6
6.1
Common Kernel Functions
Linear Kernel
im
K(xi , xj ) = xi · xj
rr
6.2 Polynomial Kernel (degree p)
Be

K(xi , xj ) = (xi · xj + 1)p

6.3 Radial Basis Function (RBF) Kernel


∥xi − xj ∥2
 
K(xi , xj ) = exp −
2σ 2
F.

where σ is a free parameter that determines the spread of the kernel.

6.4 Sigmoid Kernel


K(xi , xj ) = tanh(κxi · xj + θ)
where κ and θ are parameters.

7 Visualizing the Effect of Kernels


The following diagrams illustrate the effect of different kernels in transforming non-linearly separable
data:

• Linear Kernel: A simple hyperplane separation (useful for linearly separable data).
• Polynomial Kernel: Creates complex decision boundaries that can capture curved relationships.
• RBF Kernel: Suitable for highly non-linear relationships, producing circular boundaries.

4
8 Plotting Different SVM Classifiers in the Iris Dataset
We provide a comparison of different linear SVM classifiers on a 2D projection of the Iris dataset. For
simplicity, we consider only the first two features of the dataset:
• Sepal length

• Sepal width
This example demonstrates how to plot the decision surface for four different SVM classifiers, each
with a distinct kernel type.

8.1 SVM Classifiers and Kernels


The linear models LinearSVC() and SVC(kernel=’linear’) produce slightly different decision bound-
aries. These differences may arise due to:
• Loss Function: LinearSVC minimizes the squared hinge loss, while SVC minimizes the regular
hinge loss.

• Multiclass Strategy: LinearSVC employs a One-vs-All (or One-vs-Rest) approach for multiclass
classification, whereas SVC uses a One-vs-One approach.
Both linear models create linear decision boundaries (intersecting hyperplanes). In contrast, non-

i
linear kernel models, such as polynomial or Gaussian RBF kernels, produce flexible, non-linear boundaries
shaped by the kernel type and parameters.

Note
im
Visualizing decision functions of classifiers on 2D datasets can provide insights into their expressive
rr
power. However, these intuitions may not generalize to high-dimensional problems.
Be

9 Example: Plot classification boundaries with different SVM


Kernels
This example shows how different kernels in a SVC (Support Vector Classifier) influence the classification
boundaries in a binary, two-dimensional classification problem.
SVCs aim to find a hyperplane that effectively separates the classes in their training data by maxi-
F.

mizing the margin between the outermost data points of each class. This is achieved by finding the best
weight vector that defines the decision boundary hyperplane and minimizes the sum of hinge losses for
misclassified samples, as measured by the hinge−loss function. By default, regularization is applied with
the parameter C=1, which allows for a certain degree of misclassification tolerance.

If the data is not linearly separable in the original feature space, a non-linear kernel parameter can be
set. Depending on the kernel, the process involves adding new features or transforming existing features
to enrich and potentially add meaning to the data. When a kernel other than ”linear” is set, the SVC
applies the kernel trick, which computes the similarity between pairs of data points using the kernel
function without explicitly transforming the entire dataset. The kernel trick surpasses the otherwise
necessary matrix transformation of the whole dataset by only considering the relations between all pairs
of data points. The kernel function maps two vectors (each pair of observations) to their similarity using
their dot product.

The hyperplane can then be calculated using the kernel function as if the dataset were represented in a
higher-dimensional space. Using a kernel function instead of an explicit matrix transformation improves
performance, as the kernel function has a time complexity of O(n2 ) , whereas matrix transformation
scales according to the specific transformation being applied.
In this example, we compare the most common kernel types of Support Vector Machines: the linear
kernel (”linear”), the polynomial kernel (”poly”), the radial basis function kernel (”rbf”) and the sigmoid
kernel (”sigmoid”).

5
9.1 Creating a dataset
We create a two-dimensional classification dataset with 16 samples and two classes. We plot the samples
with the colors matching their respective targets.

i
im
rr
Be

Figure 2: script of SVC

To download the svm script from different kernel, click on this link website.
F.

Figure 3: Samples in two dimentionel feature space

We can see that the samples are not clearly separable by a straight line.

6
9.2 Training SVC model and plotting decision boundaries
We define a function that fits a SVC classifier, allowing the kernel parameter as an input, and then plots
the decision boundaries learned by the model using DecisionBoundaryDisplay.
Notice that for the sake of simplicity, the C parameter is set to its default value (C = 1) in this
example and the γ parameter is set to γ = 2 across all kernels, although it is automatically ignored for
the linear kernel. In a real classification task, where performance matters, parameter tuning (by using
GridSearchCV for instance) is highly recommended to capture different structures within the data.

Setting response method = ”predict” in DecisionBoundaryDisplay colors the areas based on their
predicted class. Using response method = ”decision f unction” allows us to also plot the decision
boundary and the margins to both sides of it. Finally the support vectors used during training (which
always lay on the margins) are identified by means of the support vectors attribute of the trained SVCs,
and plotted as well.

i
im
rr
Be

Figure 4: Decision boundaries of linear kernel in SVC


F.

Training a SVC on a linear kernel results in an untransformed feature space, where the hyperplane
and the margins are straight lines. Due to the lack of expressivity of the linear kernel, the trained classes
do not perfectly capture the training data.

9.3 Polynomial kernel


The polynomial kernel changes the notion of similarity. The variable d is the degree (degree) of the
polynomial, γ controls the influence of each individual training sample on the decision boundary and r
is the bias term (coef 0) that shifts the data up or down. Here, we use the default value for the degree
of the polynomial in the kernel function (degree = 3). When coef 0 = 0 (the default), the data is only
transformed, but no additional dimension is added. Using a polynomial kernel is equivalent to creating
PolynomialFeatures and then fitting a SVC with a linear kernel on the transformed data, although this
alternative approach would be computationally expensive for most datasets.

7
Figure 5: Decision boundaries of poly kernel in SVC

i
The polynomial kernel with γ = 2 adapts well to the training data, causing the margins on both sides
of the hyperplane to bend accordingly.

9.4 RBF kernel


im
rr
The radial basis function (RBF) kernel, also known as the Gaussian kernel, is the default kernel for
Support Vector Machines in scikit-learn. It measures similarity between two data points in infinite di-
mensions and then approaches classification by majority vote.
Be
F.

Figure 6: Decision boundaries of RBF kernel in SVC

The variable γ controls the influence of each individual training sample on the decision boundary.
The larger the euclidean distance between two points the closer the kernel function is to zero. This
means that two points far away are more likely to be dissimilar.

8
9.5 Sigmoid kernel
The kernel coefficient γ controls the influence of each individual training sample on the decision boundary
and r is the bias term (coef 0) that shifts the data up or down.
In the sigmoid kernel, the similarity between two data points is computed using the hyperbolic tangent
function (tanh). The kernel function scales and possibly shifts the dot product of the two points (x1 and
x2 ).

i
im
rr
Figure 7: Decision boundaries of sigmoid kernel in SVC
Be

We can see that the decision boundaries obtained with the sigmoid kernel appear curved and irregu-
lar. The decision boundary tries to separate the classes by fitting a sigmoid-shaped curve, resulting in a
complex boundary that may not generalize well to unseen data.
From this example it becomes obvious, that the sigmoid kernel has very specific use cases, when dealing
with data that exhibits a sigmoidal shape. In this example, careful fine tuning might find more gen-
eralizable decision boundaries. Because of it’s specificity, the sigmoid kernel is less commonly used in
F.

practice compared to other kernels.

10 Conclusion
In this example, we have visualized the decision boundaries trained with the provided dataset. The plots
serve as an intuitive demonstration of how different kernels utilize the training data to determine the
classification boundaries.
The hyperplanes and margins, although computed indirectly, can be imagined as planes in the trans-
formed feature space. However, in the plots, they are represented relative to the original feature space,
resulting in curved decision boundaries for the polynomial, RBF, and sigmoid kernels.
Noted that the plots do not evaluate the individual kernel’s accuracy or quality. They are intended
to provide a visual understanding of how the different kernels use the training data.
For a comprehensive evaluation, fine-tuning of SVC parameters using techniques such as Grid-
SearchCV is recommended to capture the underlying structures within the data.

You might also like