Transfer Functions
Supervised Learning – Classification
Support Vector Machines
Background
There are three methods to establish a classifier
a) Model a classification rule directly
Examples: k-NN, decision trees, perceptron, SVM
b) Model the probability of class memberships given input data
Example: feedforward ANN (multi-layered perceptron)
c) Make a probabilistic model of data within each class
Examples: naive Bayes, model based classifiers
a) and b) are examples of discriminative classification
c) is an example of generative classification
b) and c) are both examples of probabilistic classification
2
Support Vector Machines - Overview
• Proposed by Vapnik and his colleagues
- Started in 1963, taking shape in late 70’s as part of his statistical
learning theory (with Chervonenkis)
- Current form established in early 90’s (with Cortes)
• Became popular in last decade
- Classification, regression (function approx.), optimization
• Basic ideas
- Maximize margin of decision boundary
- Overcoming linear seperability problem by transforming the
problem into higher dimensional space using kernel functions
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1
How would you
classify this data?
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1
How would you
classify this data?
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1
How would you
classify this data?
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1
How would you
classify this data?
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1
Any of these
would be fine..
..but which is
best?
a
Classifier Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with the maximum
margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the, um,
are those
datapoints that maximum margin.
the margin This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
Linear SVM
Why Maximum Margin?
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the maximum
are those
datapoints that margin.
the margin This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
Specifying a line and margin
Plus-Plane
Classifier Boundary
Minus-Plane
• How do we represent this mathematically?
• …in m input dimensions?
Specifying a line and margin
Plus-Plane
Classifier Boundary
Minus-Plane
Conditions for optimal separating hyperplane for data points
(x1, y1),…,(xl, yl) where yi =1
1. w . xi + b 1 if yi = 1 (points in plus class)
2. w . xi + b -1 if yi = -1 (points in minus class)
Specifying a line and margin
Estimate the Margin
denotes +1
denotes -1
x
wx +b = 0
X – Vector
W
W – Normal Vector
b – Scale Value
Maximize Margin
denotes +1
denotes -1 wx +b = 0
Margin
b xi w
argmax arg min
w ,b xi D d 2
w
i 1 i
subject to xi D : yi xi w b 0
WXi+b≥1 iff yi=1
WXi+b≤-1 iff yi=-1
argmin i 1 wi2
d
w ,b
yi(WXi+b) ≥ 1 subject to xi D : yi xi w b 1
Linear SVM
• Linear model:
1 if w x b 1
f ( x)
1 if w x b 1
• Learning the model is equivalent to determining
the values of
• How to find w and b from training data?
• Constrained Optimization
• Langrangian Method
SVM – Langrangian Formulation
SVM – Langrangian Formulation
Example of Linear SVM
x1 x2 y l
0.3858 0.4687 1 65.5261
0.4871 0.611 -1 65.5261
0.9218 0.4103 -1 0
0.7382 0.8936 -1 0
0.1763 0.0579 1 0
0.4057 0.3529 1 0
0.9355 0.8132 -1 0
0.2146 0.0099 1 0
Example of Linear SVM
• only the first two tuples are support
vectors in this case
Learning Linear SVM
• Let W = (w1;w2) and b denote the parameter to be determined. We
can solve for w1 and w2 as follows
Learning Linear SVM
Learning Linear SVM
Learning Linear SVM
Learning Linear SVM
Example of Linear SVM
Support vectors
x1 x2 y a
l
0.3858 0.4687 1 65.5261
0.4871 0.611 -1 65.5261
0.9218 0.4103 -1 0
0.7382 0.8936 -1 0
0.1763 0.0579 1 0
0.4057 0.3529 1 0
0.9355 0.8132 -1 0
0.2146 0.0099 1 0
Example of Linear SVM
Learning Linear SVM
• Decision boundary depends only on support
vectors
• If you have data set with same support vectors,
decision boundary will not change
• How to classify using SVM once w and b are
found? Given a test record, xi
1 if w x i b 1
f ( xi )
1 if w x i b 1
Support Vector Machines
• What if the problem is not linearly separable?
Support Vector Machines
• What if the problem is not linearly separable?
• Introduce slack variables
• Need to minimize: 2
|| w || N k
L( w) C i
2 i 1
• Subject to:
1 if w x i b 1 - i
yi
1 if w x i b 1 i
• If k is 1 or 2, this leads to similar objective function
as linear SVM but with different constraints
Support Vector Machines
B1
B2
b21
b22
margin
b11
b12
• Find the hyperplane that optimizes both factors
Nonlinear Support Vector Machines
• What if decision boundary is not linear?
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Concept of Nonlinear Mapping
Nonlinear Support Vector Machines
• What if decision boundary is not linear?
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
• Transform data into higher dimensional space
Decision boundary:
w ( x ) b 0
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Example of Nonlinear SVM
SVM with polynomial
degree 2 kernel
Learning Nonlinear SVM
• Advantages of using kernel:
• Computing dot product (xi) (xj) in the
original space avoids curse of dimensionality
• Not all functions can be kernels
• Must make sure there is a corresponding in
some high-dimensional space
• Mercer’s theorem
Characteristics of SVM
• The learning problem is formulated as a convex optimization problem
• Efficient algorithms are available to find the global minima
• Many of the other methods use greedy approaches and find locally
optimal solutions
• High computational complexity for building the model
• Robust to noise
• Overfitting is handled by maximizing the margin of the decision
boundary,
• SVM can handle irrelevant and redundant attributes better than many
other techniques
• The user needs to provide the type of kernel function and cost function
• Difficult to handle missing values
References
• An excellent tutorial on VC-dimension and Support
Vector Machines:
C.J.C. Burges. A tutorial on support vector machines
for pattern recognition. Data Mining and Knowledge
Discovery, 2(2):955-974, 1998.
http://citeseer.nj.nec.com/burges98tutorial.html
• The VC/SRM/SVM Bible:
Statistical Learning Theory by Vladimir Vapnik, Wiley-
Interscience; 1998
• Download SVM-light:
http://svmlight.joachims.org/
Some other issues in SVM
• SVM works only in a real-valued space. For a
categorical attribute, we need to convert its
categorical values to numeric values.
• SVM does only two-class classification. For multi-
class problems, some strategies can be applied, e.g.,
one-against-rest, and error-correcting output coding.
• The hyperplane produced by SVM is hard to
understand by human users. The matter is made
worse by kernels. Thus, SVM is commonly used in
applications that do not required human
understanding.