Intro SVM New Example PDF
Intro SVM New Example PDF
Machines
Martin Law
Lecture for CSE 802
Department of Computer Science and Engineering
Michigan State University
Outline
A brief history of SVM
Large-margin linear classifier
Linear separable
Nonlinear separable
3/1/11
History of SVM
SVM is related to statistical learning theory [3]
SVM was first introduced in 1992 [1]
SVM becomes popular because of its success in
handwritten digit recognition
1.1% test error rate for SVM. This is the same as the error
rates of a carefully constructed neural network, LeNet 4.
[1] B.E. Boser et al. A Training Algorithm for Optimal Margin Classifiers. Proceedings of the Fifth Annual Workshop on
Computational Learning Theory 5 144-152, Pittsburgh, 1992.
[2] L. Bottou et al. Comparison of classifier methods: a case study in handwritten digit recognition. Proceedings of the 12th
IAPR International Conference on Pattern Recognition, vol. 2, pp. 77-82.
[3] V. Vapnik. The Nature of Statistical Learning Theory. 2nd edition, Springer, 1999.
3/1/11
Class 2
3/1/11
Class 1
Class 2
Class 1
3/1/11
Class 2
Class 1
Class 2
Class 1
3/1/11
3/1/11
3/1/11
The function
is also known as the
Lagrangrian; we want to set its gradient to 0
3/1/11
The Lagrangian is
3/1/11
10
to
, we have
Note that
3/1/11
11
12
w can be recovered by
3/1/11
13
Compute
and
classify z as class 1 if the sum is positive, and class 2
otherwise
Note: w need not be formed explicitly
3/1/11
14
In practice, we can just regard the QP solver as a blackbox without bothering how it works
3/1/11
15
A Geometrical Interpretation
Class 2
8=0.6 10=0
7=0
5=0
4=0
9=0
Class 1
3/1/11
2=0
1=0.8
6=1.4
3=0
16
Class 2
Class 1
3/1/11
17
We want to minimize
3/1/11
18
w is recovered as
3/1/11
19
Why transform?
Linear operation in the feature space is equivalent to nonlinear operation in input space
Classification can become easier with a proper
transformation. In the XOR problem, for example, adding a
new feature of x1x2 make the problem linearly separable
3/1/11
20
Input space
( )
( )
( )
( ) ( ) ( )
( )
( )
( )
( ) ( )
( ) ( )
( )
( ) ( )
( )
( )
Feature space
Note: feature space is of higher dimension
than the input space in practice
3/1/11
21
3/1/11
22
3/1/11
23
Kernel Functions
In practical use of SVM, the user specifies the kernel
function; the transformation (.) is not explicitly stated
Given a kernel function K(xi, xj), the transformation (.)
is given by its eigenfunctions (a concept in functional
analysis)
3/1/11
24
3/1/11
25
Original
With kernel
function
3/1/11
26
Original
With kernel
function
3/1/11
27
3/1/11
28
3/1/11
29
Example
Suppose we have 5 1D data points
K(x,y) = (xy+1)2
C is set to 100
3/1/11
30
Example
By using a QP solver, we get
3/1/11
31
Example
Value of discriminant function
class 1
3/1/11
class 1
class 2
2
32
3/1/11
33
3/1/11
34
VC-dimension
However, if we have four points, we can find a labeling
such that the linear classifier fails to be perfect
3/1/11
35
VC-dimension
The VC-dimension of the nearest neighbor classifier is
infinity, because no matter how many points you have,
you get perfect classification on training data
The higher the VC-dimension, the more flexible a
classifier is
VC-dimension, however, is a theoretical concept; the VCdimension of most classifiers, in practice, is difficult to be
computed exactly
3/1/11
36
3/1/11
37
Increasing
error rate
Training error
Training error
CI of test error
for classifier 2
CI of test error
for classifier 1
3/1/11
38
3/1/11
39
Justification of SVM
Large margin classifier
SRM
Ridge regression: the term ||w||2 shrinks the
parameters towards zero to avoid overfitting
The term the term ||w||2 can also be viewed as
imposing a weight-decay prior on the weight vector, and
we find the MAP estimate
3/1/11
40
3/1/11
41
3/1/11
42
Software
A list of SVM implementation can be found at http://
www.kernel-machines.org/software.html
Some implementation (such as LIBSVM) can handle
multi-class classification
SVMLight is among one of the earliest implementation of
SVM
Several Matlab toolboxes for SVM are also available
3/1/11
43
3/1/11
44
Weaknesses
3/1/11
45
3/1/11
46
Conclusion
SVM is a useful alternative to neural networks
Two key concepts of SVM: maximize the margin and the
kernel trick
Many SVM implementations are available on the web for
you to try on your data set!
3/1/11
47
Resources
http://www.kernel-machines.org/
http://www.support-vector.net/
http://www.support-vector.net/icml-tutorial.pdf
http://www.kernel-machines.org/papers/tutorialnips.ps.gz
http://www.clopinet.com/isabelle/Projects/SVM/
applist.html
3/1/11
48
3/1/11
49
3/1/11
50
Demonstration
Iris data set
3/1/11
51
3/1/11
52
Multi-class Classification
SVM is basically a two-class classifier
One can change the QP formulation to allow multi-class
classification
More commonly, the data set is divided into two parts
intelligently in different ways and a separate SVM is
trained for each way of division
Multi-class classification is done by combining the output
of all the SVM classifiers
Majority rule
Error correcting code
Directed acyclic graph
3/1/11
53
Penalty
-
3/1/11
Penalty
Value off
target
CSE 802. Prepared by Martin Law
Value off
target
54
3/1/11
55
3/1/11
56