Chapter 07 SVM
Chapter 07 SVM
1
Introduction
• A Support Vector Machine (SVM) is a supervised learning model widely used for
classification and regression tasks.
• Even though it is applied for regression tasks, best suits for classification.
• The objective of SVM is to identify a best decision boundary that can classify the
data points present in n-dimensional feature space.
2
Introduction
• There can be multiple decision boundaries to classify the data points, but we
need to find out the best decision boundary that helps to classify the data
points accurately.
• This best boundary is known as Hyperplane of SVM
• Decision boundary:
– 2D feature space: straight line
– 3D : Plane
– N-D: Hyperplane
3
Hyperplane of SVM
4
How does it work
5
Mathematical Intitution
• The equation for straight line separating two classes in 2D feature space is
y=mx+c
• But in SVM, we represent this line using a different, more general form, which
works in any number of dimensions:
f(x)=w x+b=0
– where: w is the weight vector (or direction) perpendicular to the hyperplane,
– x is the input vector (a data point),
– b is the bias term (Intercept).
6
Classification Rule
• The SVM decides which side of the line each data point is on based on the sign
of f(x):
7
The Margin: Making Space Between the Classes
• The key to SVM is maximizing the margin — the distance between the
boundary line and the closest data points (called support vectors) in each class.
• The margin width M is defined as:
8
Constraints
• For each data point ( , ) where is the label (+1 or −1), we want
it to be correctly classified using a maximum margin distance away
from the boundary. This leads to the following constraints:
9
The Primal Optimization Problem
Now, we have two goals:
1. Maximize the margin (equivalently, minimize w ),
2. Satisfy the constraint for each data point.
• The primal form of the SVM optimization problem can be written as:
10
Soft Margin
• Achieved by introducing slack variables ξi one for each data point.
• Measures how much each data point violates the margin.
• If ξi=0, the point is correctly classified and outside the margin.
• If ξi>0, the point is either inside the margin or misclassified.
• Optimization Problem for Soft Margin SVM:
– The goal now becomes to maximize the margin while minimizing the total error from the
points that violate the margin.
– The optimization problem is formulated as:
– C is a parameter that controls the trade-off between maximizing the margin and allowing
some violations. (Large C: less tolerance of errors, Small C: High tolerance of errors)
11
The Primal Optimization Problem
• The primal form is straightforward (ensuring that each data point is classified
on to the right side) but has limitations.
– Complicated if there are lots of data points.
– Cannot be applied to the data that cannot be linearly separable.
12
Linear and Non Linear separable data
• Datapoints can be classified using • Datapoints cannot be classified
single (line/plane) using single hyperplane.
13
The Dual Formulation: Alternative Approach
• The dual approach, focuses on relationships between data points
rather than calculating the boundary directly by using w and b.
• The relationship is established by using Lagrange multipliers to
each point.
• The Lagrangian for the original problem (primal form) is:
14
The Dual Formulation: Alternative Approach
• Maximizing the Lagrangian with respect to and leads to the
dual problem:
15
The Dual Formulation: Alternative Approach
• If we consider only the support vectors, that get non-zero values,
then the decision function is:
16
Non-Linear SVM: Kernel Trick
17
Non-Linear SVM: Kernel Trick
• If the data isn’t linearly separable (like in a circular pattern), SVM uses the
kernel trick to map data into a higher-dimensional space where it can be
linearly separated. (Used for solving regression tasks too)
• The kernel trick uses a function K(xi, xj) that computes the dot product in this
high-dimensional space.
• The common kernels are
18
Logistic Regression
• Statistical method used for binary classification
• Logistic regression uses a logistic function (also called the sigmoid function) to
map the output to a range between 0 and 1.
• The sigmoid function to transform the linear output into a probability is
20