0% found this document useful (0 votes)
44 views

Support Vector Machines: Kernels: CS4780/5780 - Machine Learning Fall 2011 Thorsten Joachims Cornell University

This document discusses support vector machines and how kernels allow SVMs to handle non-linear and non-vectorial data. Kernels transform data into a higher-dimensional feature space, allowing SVMs to find linear separating boundaries in that space. Kernels compute the dot product in feature space without explicitly representing features. Common kernels include polynomial and radial basis function kernels. Kernels can be designed to handle discrete, structured, and non-vectorial data types. SVMs with kernels are expressive, computationally efficient, and avoid problems like local optima.

Uploaded by

igrice sve
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Support Vector Machines: Kernels: CS4780/5780 - Machine Learning Fall 2011 Thorsten Joachims Cornell University

This document discusses support vector machines and how kernels allow SVMs to handle non-linear and non-vectorial data. Kernels transform data into a higher-dimensional feature space, allowing SVMs to find linear separating boundaries in that space. Kernels compute the dot product in feature space without explicitly representing features. Common kernels include polynomial and radial basis function kernels. Kernels can be designed to handle discrete, structured, and non-vectorial data types. SVMs with kernels are expressive, computationally efficient, and avoid problems like local optima.

Uploaded by

igrice sve
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Support Vector Machines:

Kernels

CS4780/5780 – Machine Learning


Fall 2011

Thorsten Joachims
Cornell University

Reading: Schoelkopf/Smola Chapter 7.4, 7.6, 7.8


Cristianini/Shawe-Taylor 3.1, 3.2, 3.3.2, 3.4
Outline

• Transform a linear learner into a non-linear


learner
• Kernels can make high-dimensional spaces
tractable
• Kernels can make non-vectorial data
tractable
Non-Linear Problems

Problem:
• some tasks have non-linear structure
• no hyperplane is sufficiently accurate
How can SVMs learn non-linear classification rules?
Extending the Hypothesis
Space
Idea: add more features

 Learn linear rule in feature space.


Example:

 The separating hyperplane in feature space is degree


two polynomial in input space.
Example

• Input Space: (2 attributes)


• Feature Space:
(6 attributes)
Dual SVM Optimization
Problem
• Primal Optimization Problem

• Dual Optimization Problem

• Theorem: If w* is the solution of the Primal and


α* is the solution of the Dual, then
Kernels

Problem: Very many Parameters! Polynomials of


degree p over N attributes in input space lead to
O(Np) attributes in feature space!
Solution [Boser et al.]: The dual OP depends only
on inner products => Kernel Functions

Example: For
calculating computes inner
product in feature space.
 no need to represent feature space explicitly.
SVM with Kernel
Training:

Classification:

New hypotheses spaces through new Kernels:


• Linear:
• Polynomial:
• Radial Basis Function:
• Sigmoid:
Examples of Kernels

Polynomial Radial Basis Function


What is a Valid Kernel?

Definition: Let X be a nonempty set. A function


is a valid kernel in X if for all n and all x1,…, xn
2 X it produces a Gram matrix
Gij = K(xi, xj)
that is symmetric
G = GT
and positive semi-definite
How to Construct Valid
Kernels
Theorem: Let K1 and K2 be valid Kernels over X £
X, X µ <N,  ≥ 0, 0 ≤  ≤ 1, f a real-valued
function on X, :X! <m with a kernel K3 over <m
£ <m, and K a symmetric positive semi-definite
matrix. Then the following functions are valid
Kernels
K(x,z) =  K1(x,z) + (1-) K2(x,z)
K(x,z) =  K1(x,z)
K(x,z) = K1(x,z) K2(x,z)
K(x,z) = f(x) f(z)
K(x,z) = K3((x),(z))
K(x,z) = xT K z
Kernels for Discrete and
Structured Data
Kernels for Sequences: Two sequences are similar, if
the have many common and consecutive
subsequences.
Example [Lodhi et al., 2000]: For 0 ≤  ≤ 1 consider
the following features space
c-a c-t a-t b-a b-t c-r a-r b-r
(cat) 2 3 2 0 0 0 0 0
(car) 2 0 0 0 0 3 2 0
(bat) 0 0 2 2 3 0 0 0
(bar) 0 0 0 2 0 0 2 3

=> K(car,cat) = 4, efficient computation via dynamic


programming
Kernels for Non-Vectorial
Data
• Applications with Non-Vectorial Input Data
 classify non-vectorial objects
– Protein classification (x is string of amino acids)
– Drug activity prediction (x is molecule structure)
– Information extraction (x is sentence of words)
– Etc.
• Applications with Non-Vectorial Output Data
 predict non-vectorial objects
– Natural Language Parsing (y is parse tree)
– Noun-Phrase Co-reference Resolution (y is clustering)
– Search engines (y is ranking)
 Kernels can compute inner products efficiently!
Properties of SVMs with
Kernels
• Expressiveness
– SVMs with Kernel can represent any boolean function
(for appropriate choice of kernel)
– SVMs with Kernel can represent any sufficiently
“smooth” function to arbitrary accuracy (for appropriate
choice of kernel)
• Computational
– Objective function has no local optima (only one
global)
– Independent of dimensionality of feature space
• Design decisions
– Kernel type and parameters
– Value of C
SVMs for other Problems

• Multi-class Classification
– [Schoelkopf/Smola Book, Section 7.6]
• Regression
– [Schoelkopf/Smola Book, Section 1.6]
• Outlier Detection
– D.M.J. Tax and R.P.W. Duin, "Support vector domain
description", Pattern Recognition Letters, vol. 20, pp. 1191-
1199, 1999b. 26
• Structural Prediction
– B. Taskar, C. Guestrin, D. Koller - Advances in Neural
Information Processing Systems, 2003.
– I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun,
Support Vector Machine Learning for Interdependent and
Structured Output Spaces, Proceedings of the International
Conference on Machine Learning (ICML), 2004.

You might also like