0% found this document useful (0 votes)
9 views

Module10 - Support Vector Machine

This document discusses support vector machines, which are supervised learning models used for classification and regression. It describes how support vector machines aim to find the optimal separating hyperplane that maximizes the margin between classes. It also discusses how the support vector classifier extends this to allow some misclassification by introducing slack variables and a tuning parameter.

Uploaded by

riya pandey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Module10 - Support Vector Machine

This document discusses support vector machines, which are supervised learning models used for classification and regression. It describes how support vector machines aim to find the optimal separating hyperplane that maximizes the margin between classes. It also discusses how the support vector classifier extends this to allow some misclassification by introducing slack variables and a tuning parameter.

Uploaded by

riya pandey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Support Vector Machines

Dr. Sayak Roychowdhury


Department of Industrial and Systems Engineering
IIT Kharagpur
Reference Books
• James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An
introduction to statistical learning (Vol. 112, p. 18). New York:
springer.
• Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H.
(2009). The elements of statistical learning: data mining,
inference, and prediction (Vol. 2, pp. 1-758). New York: springer.
Topics
• Maximal-margin classifier (MMC)

• Support vector classificer (SVC, an extension of MMC)

• Support vector machines (SVM, for non-linear class boundaries using


SVC)
Support Vector Machines
• Support vector machine (SVM) is popular supervised learning
algorithm, primarily used for classification problems.
• Developed at AT&T Bell Labs in 1990s, by Vladimir Vapnik and
colleagues
• SVM is a generalization of a simple and intuitive classifier called the
maximal margin classifier.
• The goal of the SVM algorithm is to
create the best line or decision boundary
that can segregate n-dimensional space
into classes so that we can easily put the
new data point in the correct category in
the future.
• SVM algorithm can be used for Face
detection, image classification, text
categorization, etc.
Hyperplane
• Hyperplane is a line that splits the input variable space into two halves (either
class 0 or class 1).
• In a 𝑝-dimensional space, a hyperplane is a flat affine subspace of dimension 𝑝 −
1. e.g. in 2D, hyperplane is a flat 1D space i.e. line.
• The mathematical definition to p-dimensional setting:
𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝 = 0
• Selection of best hyperplane is done through maximal margin classifier, i.e.
always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.
Separating Hyperplane
• Suppose there is a 𝑛 × 𝑝 data matrix, n observations, p features
𝑥11 𝑥𝑛1
. .
𝑥1 = . , … 𝑥𝑛 = .
𝑥1𝑝 𝑥𝑛𝑝
• Suppose there are two classes, 𝑦𝑖 ∈ {−1, 1}
• The separating hyperplane has the property that
𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + ⋯ + 𝛽𝑝 𝑥𝑖𝑝 > 0 𝑖𝑓 𝑦𝑖 = 1
𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + ⋯ + 𝛽𝑝 𝑥𝑖𝑝 < 0 𝑖𝑓 𝑦𝑖 = −1
Together this can be written as
𝑦𝑖 (𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + ⋯ + 𝛽𝑝 𝑥𝑖𝑝 ) > 0
Many hyperplanes are possible, which one to choose?
Maximal Margin Classifier
• Maximal margin hyperplane is also known as optimal separating hyperplane, which is
the separating hyperplane that is farthest from the training observations (margin is
largest).
• Classes have to be linearly separable
• The smallest perpendicular distance (minimal distance) from the observations to the
hyperplane is known as the margin.
• The maximal margin hyperplane is the separating hyperplane for which the margin is
largest— it is the hyperplane that has the farthest minimum distance to the training
observations.
• The points which is closest to hyperplane and relevant in defining the line and
construction of classifier are called as support vectors.
• Maximal margin hyperplane is directly depends on the support vectors, but not on the
other observations.
Maximal Margin Classifier
Two classes of observations, the maximal
margin hyperplane is shown as a solid line. The
margin is the distance from the solid line to
either of the dashed lines. The two blue points
and the purple point that lie on the dashed
lines are the support vectors, and the distance
from those points to the hyperplane is
indicated by arrows.
The purple and blue grid indicates the decision
rule made by a classifier based on this
separating hyperplane.
Construction of the Maximal Margin Classifier
• Maximal margin hyperplane is learned from training data using an optimization
procedure that maximizes the margin:
max 𝑀
𝛽0,𝛽1 ,…,𝛽𝑝
𝑝
Subject to σ𝑗=1 𝛽𝑗2 = 1,
𝑦𝑖 𝛽0 + 𝛽1 𝑋𝑖1 + 𝛽2 𝑋𝑖2 + ⋯ + 𝛽𝑝 𝑋𝑖𝑝 ≥ 𝑀 ∀𝑖 = 1, … 𝑛
𝑛=number of training observations and 𝑀=margin
• The last constraint take care of each observation will lie on the correct side of the
hyperplane, provided that 𝑀 is positive.
• Maximal margin classifier is perfect way to perform classification, if a separating
hyperplane exists.
Problems with MMC
• Requires linearly separable class
• Too much dependency on the points closest to margin

Linearly non-separable case


Soft-margin

Some of the points are allowed to be on the wrong side of the hyperplane
Support Vector Classifier
• Support vector classifier (SVC) is also known as soft margin classifier.
• This allow some observations to be on the incorrect side of the margin that take care of:
1) robustness of individual observations and
2) better classification of most of the training observations.
• SVC is a solution to following optimization
max 𝑀
𝛽0,𝛽1,…,𝛽𝑝 ,𝜖1,…,𝜖𝑛
𝑝
Subject to σ𝑗=1 𝛽𝑗2 = 1,
𝑦𝑖 𝛽0 + 𝛽1 𝑋𝑖1 + 𝛽2 𝑋𝑖2 + ⋯ + 𝛽𝑝 𝑋𝑖𝑝 ≥ 𝑀(1 − 𝜖𝑖 )
𝑛

𝜖𝑖 ≥ 0, ෍ 𝜖𝑖 ≤ 𝐶
𝑖=1
Where 𝐶= nonnegative tuning parameter, 𝑀=width of margin, 𝜖1 , … , 𝜖𝑛 =slack variables
• 𝐶 is a budget for number and severity of the violations to the margin that will tolerate.
• Observations that lie directly on the margin, or on the wrong side of the margin for their class, are
known as support vectors
Support Vector Classifier with Linear Kernel
Support Vector Classifier
• The constraint of maximizing the margin of the line that separate the classes must
be relaxed. This is often called as soft margin classifier.
• Some additional set of coefficients are introduced that give the margin to move in
each dimensions. These coefficients are sometimes called slack variables (𝝐𝒏 ).
• Tuning parameter is defined as amount of violation of the margin allowed.
Value of 𝑪 Physical significance Variance Bias

Smaller value Fewer support vectors High Low


Margin is wide, many observations violate
the margin, hence many support
Higher value Low High
Vectors; Many observations are involved in
determining the hyperplane
Physical Significance of Parameters
• Role of slack variable 𝜖𝑖 :
1) If 𝜖𝑖 = 0 then 𝑖th observation is on the correct side of the margin
2) If 𝜖𝑖 > 0 then 𝑖th observation is on the wrong side of margin and particular
observation has violated the margin
3) If 𝜖𝑖 > 1 then it is on the wrong side of the hyperplane
• Role of tuning parameter 𝐶:
1) If 𝐶 = 0 then there is no budget for violations to the margin (maximal margin
classifier).
2) If 𝐶 > 0 then not more than 𝐶 observations can be on the wrong side of the
hyperplane
• C is generally chosen via cross-validation.
• The support vector classifier’s decision rule is based only on a potentially small
subset of the training observations (the support vectors) means that it is quite
robust to the behavior of observations that are far away from the hyperplane.
Non-linear Boundaries
Support Vector Machines
• In actual practice there is non-linear relationship between the predictors and the
outcome which can be addressed by enlarging the feature space using functions
of the predictors.
• The support vector machine (SVM) is an extension of the support vector classifier
that results from enlarging the feature space in a specific way, using kernels.
• The solution to SVM involves only the inner product of the observations, rather
than observation themselves.
• Inner product of two observations 𝑥𝑖 , 𝑥𝑖 ′ is given by
𝑝

𝑥, 𝑥𝑖 ′ = ෍ 𝑥𝑖𝑗 𝑥𝑖 ′𝑗
𝑗=1
• The linear support vector classifier can be represented as:
𝑓 𝑥 = 𝛽0 + ෍ 𝛼𝑖 𝑥, 𝑥𝑖 𝑖 = 1, … . , 𝑛
𝑖∈𝑆
• Where 𝑆 is a collection of support vectors and 𝑛 = parameters
Types of Kernel
• The inner product in support vector classifier is generally represented as 𝐾(𝑥𝑖 , 𝑥𝑖 ′ )
• Where 𝐾 is kernel function that quantifies the similarity of two observations.
• The linear kernel is written as: 𝑝

𝐾 𝑥𝑖 , 𝑥𝑖 ′ = ෍ 𝑥𝑖𝑗 𝑥𝑖 ′ 𝑗
𝑗=1
• Polynomial kernel of degree 𝑑: 𝑝

𝐾 𝑥𝑖 , 𝑥𝑖 ′ = (1 + ෍ 𝑥𝑖𝑗 𝑥𝑖 ′ 𝑗 )𝑑
𝑗=1
• The support vector classifier combined with a non-linear kernel, then resulting classifier is known
as a support vector machine.
𝑓 𝑥 = 𝛽0 + ෍ 𝛼𝑖 𝐾 𝑥𝑖 , 𝑥𝑖 ′
𝑖∈𝑆
• Radial kernel: 𝑝

𝐾 𝑥𝑖 , 𝑥𝑖 ′ = exp(−𝛾 ෍(𝑥𝑖𝑗 − 𝑥𝑖 ′ 𝑗 )2 )
𝑗=1
SVM

Polynomial Kernel Gaussian Kernel


SVM with Radial Kernel

Bayesian boundary

Decision boundary using SVM


SVM with More than 2 Classes
• SVMs can be extended to the more general case with arbitrary number of
classes
• Two approaches are one-versus-one and one-versus-all approaches.
• One-Versus-One Classification
• For K classes construct 𝐾2 SVMs with a pair of classes
• Classify a test observation using each of 𝐾2 classifiers, and tally the
number of times that the test observation is assigned to each of the K
classes.
• The final classification is performed by assigning the test observation to the
class to which it was most frequently assigned in these pairwise
classifications
SVM with More than 2 Classes
• One-Versus-All Classification
• For K classes fit K SVMs each time comparing one of all the K classes
to the remaining K − 1 classes
• Let 𝛽0𝑘 , 𝛽1𝑘 , . . . , 𝛽𝑝𝑘 denote the parameters that result from fitting an
SVM comparing the kth class (coded as +1) to the others (coded as
−1).
• A test observation 𝑥 ∗ is assigned for which
∗ ∗
𝛽0𝑘 + 𝛽1𝑘 𝑥1𝑘 +. . . + 𝛽𝑝𝑘 𝑥𝑝𝑘 is largest

You might also like