Machine Learning
(Unit 2 - Part 2)
In statistic modeling a common problem arises as to how can we
estimate the joint probability distribution for dataset.
What is EM Algorithm?
• EM algorithm was proposed in 1997 by Arthur Dempster.
• It is basically used to find the local maximum likelihood
parameters of a statistical model in case of latent variables
are present for the data is missing or incomplete.
Applications of EM
Algorithm
Data Clustering in Machine Learning and Computer Vision
Used in Natural Language Processing
Used in Parameter Estimation in Mix Models and
Quantitative Genetics
Used in Psychometrics
Used in Medical Image Reconstruction, Structural Engineering
Support Vector Machine
Support vector Machine (SVM)
❑ SVM is based on statistical learning theory.
❑ A support-vector machines are supervised learning models with associated learning
algorithms that analyze data used for classification and regression analysis.
❑ SVM involve detection of hyperplanes which segregate data into classes.
❑ Support vectors are the data points that lie closest to the decision surface (or
hyperplane).
❑ SVMs are very versatile and are also capable of performing linear or nonlinear
classification, regression, and outlier detection.
Two Class Problem: Linear Separable Case
❑ Linearly separable
Class 1
binary sets
Denotes +1
Denotes -1
❑ Many decision
boundaries can
separate these two
Class 2 classes.
Which one should we
choose?
Classifier Margin
Denotes +1
Denotes -1
Define the margin of a
linear classifier as the width
that the boundary could be
increased by before hitting
a data point.
Good Decision Boundary: Margin Should Be Large
f(x,w,b) = sign(w. x - b)
Denotes +1
Denotes -1 The maximum margin linear
classifier is the linear classifier
with the maximum margin.
This is the simplest kind of
SVM (called an Linear SVM).
Support Vectors
are those data
points that the
margin pushes
up against
How Does it Works
Identify the right hyper-plane (Scenario-1):
Thumb rule to identify the
right hyper-plane: “Select
the hyper-plane which
segregates the two classes
better”.
In this scenario,
hyper-plane “B”
has excellently performed
this job.
How Does it Works
Identify the right hyper-plane (Scenario-2):
Maximizing the distances
between nearest data
point (either class) and
hyper-plane will help us to
decide the right
hyper-plane.
This distance is called
as Margin.
How Does it Works
Identify the right hyper-plane (Scenario-2):
The margin for hyper-plane
C is high as compared to
both A and B.
Hence, we name the right
hyper-plane as C.
Another lightning reason
for selecting the
hyper-plane with higher
margin is robustness.
How Does it Works
Identify the right hyper-plane (Scenario-3):
SVM selects the
hyper-plane which
classifies the classes
accurately prior
to maximizing margin.
Here, hyper-plane B has a
classification error and A
has classified all correctly.
Therefore, the right
hyper-plane is A.
How Does it Works
Identify the right hyper-plane (Scenario-3):
SVM selects the
hyper-plane which
classifies the classes
accurately prior
to maximizing margin.
Here, hyper-plane B has a
classification error and A
has classified all correctly.
Therefore, the right
hyper-plane is A.
How Does it Works
Can we classify two classes (Scenario-4)?
The SVM algorithm has a feature to ignore outliers
and find the hyper-plane that has the maximum
margin.
Hence, we can say, SVM classification is robust to
outliers.
How Does it Works
Find the hyper-plane to segregate to classes (Scenario-5)
All values for z would be
positive always because z is
the squared sum of both x
and y
In the original plot, red circles
appear close to the origin of x and
It solves this problem by introducing additional y axes, leading to lower value of z
feature. and star relatively away from the
origin result to higher value of z.
Here, we will add a new feature z=x^2+y^2.
Now, let’s plot the data points on axis x and z
What is SVM?
Support vector machines so called as SVM is a supervised learning algorithm
Can be used for classification and regression problems as support vector classification
(SVC) and support vector regression (SVR).
It is used for smaller dataset as it takes too long to process.
The ideology behind SVM
SVM is based on the idea of finding a hyperplane
that best separates the features into different domains
Intuition development
There is a stalker who is sending you emails and now you want to design a function(
hyperplane ) which will clearly differentiate the two cases, such that whenever you
received an email from the stalker it will be classified as a spam. The following are the
figure of two cases in which the hyperplane are drawn, which one will you pick and why?
Terminologies used in SVM
The points closest to the hyperplane are called as the support vector points and
The distance of the vectors from the hyperplane are called the margins.
SV points are very critical in determining the
hyperplane because if the position of the
vectors changes the hyperplane’s position is
altered.
Technically this hyperplane can also be called as
margin maximizing hyperplane.
Hyperplane (Decision surface)
The hyperplane is a function which is used to differentiate between features.
In 2-D, the function used to classify between features is a line, whereas
The function used to classify the features in a 3-D is called as a plane
Similarly, the function which classifies the point in higher dimension is called as a
hyperplane.
Hyperplane (Decision surface)
Let’s say there are “m” dimensions
thus the equation of the hyperplane in the ‘M’ dimension can be given as =
where,
Wi = vectors(W0, W1, W2, W3……Wm)
b = biased term (W0)
X = variables.
Hard margin SVM
Assume 3 hyperplanes namely (π, π+, π−) such that ‘π+’ is parallel to ‘π’ passing through
the support vectors on the positive side and ‘π−’ is parallel to ‘π’ passing through the
support vectors on the negative side.
the equations of each hyperplane can be considered as:
Hard margin SVM
for the point X1 :
for the point X3 :
for the point X4 :
for the point X6 :
Hard margin SVM
Let’s look into the constraints which are not classified:
So we can see that if the points are linearly separable then only our hyperplane is able to
distinguish between them and if any outlier is introduced then it is not able to separate
them.
So these type of SVM is called as hard margin SVM
Support Vector Kernels
❑ The linear classifier relies on an inner product between vectors
K(xi,xj)=xiTxj
❑ If every data point is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
❑ A kernel function is some function that corresponds to an inner product in some
expanded feature space.
Why use kernels?
Make non-separable problem separable.
Map data into better representational space
SVM Kernel Functions
❑ SVM algorithms use a set of mathematical functions that are defined as the
kernel. The function of kernel is to take data as input and transform it into the
required form.
❑ Different SVM algorithms use different types of kernel functions. These
functions can be different types. For example linear, nonlinear, polynomial,
radial basis function (RBF), and sigmoid.
❑ The kernel functions return the inner product between two points in a suitable
feature space. Thus by defining a notion of similarity, with little computational
cost even in very high-dimensional spaces.
Support Vector Kernel
Types of kernels:
Linear Kernel
Polynomial Kernel
Radial Basis Function Kernel (RBF) / Gaussian Kernel
.
Kernel Functions
❑ Linear Kernel: K(X,Y)=XTY + c
❑ Polynomial kernel: K(X,Y)=(γ⋅XTY+r)d,γ>0
❑ Radial basis function (RBF) Kernel: K(X,Y)=exp(∥X−Y∥2/2σ2) which in
simple form can be written as exp(−γ⋅∥X−Y∥2),γ>0
Data representation using kernels
Pros of SVM
It is really effective in the higher dimension.
Effective when the number of features are more than training examples.
Best algorithm when classes are separable
The hyperplane is affected by only the support vectors thus outliers have less impact.
SVM is suited for extreme case binary classification.
Cons of SVM
For larger dataset, it requires a large amount of time to process.
Does not perform well in case of overlapped classes.
Selecting, appropriately hyperparameters of the SVM that will allow for sufficient
generalization performance.
Selecting the appropriate kernel function can be tricky.
Applications
Definition
❑ Margin of Separation (d): the separation between the hyperplane and the closest
data point for a given weight vector w and bias b.
❑ Optimal Hyperplane (maximal margin): the particular hyperplane for which the
margin of separation d is maximized.
❑ Thus, this can be written as : H1
H
T
w xi + b ≥ 0 for di = +1 H2
wT xi + b < 0 for di = –1 d+
d-
Contents
❑ Non- Linear SVM
▪ Non- Linear SVM : Feature Space
▪ Transformation to Feature Space
❑ SVM kernel functions
❑ Applications
Non-Linear SVM
❑ The idea is to gain linearly separation by mapping the data to a higher
dimensional space.
❑ Datasets that are linearly separable (with some noise) work out great
x
❑ But what are we going to do if the dataset is just too hard?
0
❑ How about … mapping data to a higher-dimensional
x space
x2
0 x
Non-Linear SVM : Feature Space
❑ General idea: the original input space (x) can be mapped to some
higher-dimensional feature space (φ(x) )where the training set is separable:
x=(x1,x2) √2x1x2
Φ: x → φ(x)
x22
x12
Transformation to Feature Space
❑ Possible problem of the transformation
❑ High computation burden due to high-dimensionality and hard to get a good
estimate
❑ SVM solves these two issues simultaneously
❑ “Kernel tricks” for efficient computation
❑ Minimize ||w||2 can lead to a “good” classifier
φ( φ(
φ( ) φ( ) φ( φ(
φ )
φ( φ( ) φ( ) φ( ) φ(
) φ( ) φ( ) φ(φ( ) )
(.) ) φ(
) φ( )
) φ( )
)
Feature
Input )
space space space
Key idea: transform xi to a higher dimensional
How to calculate the distance from a point to a line?
Form of equation defining the decision
surface separating the classes is a
hyperplane of the form:
wx +b = 0
W x X – Vector
W – Normal Vector
b – bias
What is the distance expression for a
point x to a line wx+b= 0?
Thank
You