SVM Notes
SVM Notes
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine. Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is
termed as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:
The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1
and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green
or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary
or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from
both the classes. These points are called support vectors. The distance between the vectors
and the hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
Say, our data is like shown in the figure above. SVM solves this by creating a new variable
using a kernel. We call a point x i on the line and we create a new variable y i as a function
of distance from origin o.so if we plot this we get something like as shown below
In this case, the new variable y is created as a function of distance from the origin. A non-
linear function that creates a new variable is referred to as kernel.
SVM Kernel:
The SVM kernel is a function that takes low dimensional input space and transforms it
into higher-dimensional space, ie it converts non separable problem to separable
problem. It is mostly useful in non-linear separation problems. Simply put the kernel, it
does some extremely complex data transformations then finds out the process to
separate the data based on the labels or outputs defined.
Polynomial Kernel: It represents the similarity of vectors in the training set of data
in a feature space over polynomials of the original variables used in the kernel.
Cost parameter: C parameter adds a penalty for each misclassified data point. If c is
small, the penalty for misclassified points is low so a decision boundary with a large
margin is chosen at the expense of a greater number of misclassifications. If c is
large, SVM tries to minimize the number of misclassified examples due to high
penalty which results in a decision boundary with a smaller margin. Penalty is not
same for all misclassified examples. It is directly proportional to the distance to
decision boundary.
Gamma: One of the commonly used kernel functions is radial basis function (RBF).
Gamma parameter of RBF controls the distance of influence of a single training point.
Low values of gamma indicate a large similarity radius which results in more points
being grouped together. For high values of gamma, the points need to be very close
to each other to be considered in the same group (or class). Therefore, models with
very large gamma values tend to overfit.
In the paper A practical guide to Support Vector Classifier they recommend to search
Advantages of SVM: