Slides Chap5 KernelMethods
Slides Chap5 KernelMethods
Mohammed J. Zaki1
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
1 / 24
2 / 24
Sequence-based Features
Consider a dataset of DNA sequences over the alphabet = {A, C, G, T }.
One simple feature space is to represent each sequence in terms of the
probability distribution over symbols in . That is, given a sequence x with
length |x| = m, the mapping into feature space is given as
(x) = {P(A), P(C), P(G), P(T )}
where P(s) = nms is the probability of observing symbol s , and ns is the
number of times s appears in sequence x.
For example, if x = ACAGCAGTA, with m = |x| = 9, since A occurs four
times, C and G occur twice, and T occurs once, we have
(x) = (4/9, 2/9, 2/9, 1/9) = (0.44, 0.22, 0.22, 0.11)
We can compute larger feature spaces by considering, for example, the
probability distribution over all substrings or words of size up to k over the
alphabet .
Zaki & Meira Jr. (RPI and UFMG)
3 / 24
Nonlinear Features
4 / 24
Kernel Method
Let I denote the input space, which can comprise any arbitrary set of objects, and let
D = {xi }ni=1 I be a dataset comprising n objects in the input space. Let : I F
be a mapping from the input space I to the feature space F.
Kernel methods avoid explicitly transforming each point x in the input space into the
mapped point (x) in the feature space. Instead, the input objects are represented via
their pairwise similarity values comprising the n n kernel matrix, defined as
K=
..
..
..
..
.
.
.
.
K (xn , x1 ) K (xn , x2 ) K (xn , xn )
K : I I R is a kernel function on any two points in input space, which should
satisfy the condition
K (xi , xj ) = (xi )T (xj )
Intuitively, we need to be able to compute the value of the dot product using the
original input representation x, without having recourse to the mapping (x).
Zaki & Meira Jr. (RPI and UFMG)
5 / 24
Linear Kernel
Let (x) x be the identity kernel. This leads to the linear kernel, which is
simply the dot product between two input vectors:
(x)T (y) = xT y = K (x, y)
For example, if x1 = 5.9
T
T
3 and x2 = 6.9 3.1 , then we have
3.0
2.5
x4
x1
bC
x3
x2
bC
bC
x5
bC
X1
K
x1
x2
x3
x4
x5
x1
43.81
50.01
47.64
36.74
42.00
x2
50.01
57.22
54.53
41.66
48.22
x3
47.64
54.53
51.97
39.64
45.98
x4
36.74
41.66
39.64
31.40
34.64
x5
42.00
48.22
45.98
34.64
40.84
6 / 24
Kernel Trick
Many data mining methods can be kernelized that is, instead of mapping the
input points into feature space, the data can be represented via the n n
kernel matrix K, and all relevant analysis can be performed over K.
This is done via the kernel trick, that is, show that the analysis task requires
only dot products (xi )T (xj ) in feature space, which can be replaced by the
corresponding kernel K (xi , xj ) = (xi )T (xj ) that can be computed efficiently
in input space.
Once the kernel matrix has been computed, we no longer even need the input
points xi , as all operations involving only dot products in the feature space can
be performed over the n n kernel matrix K.
7 / 24
Kernel Matrix
8 / 24
n X
n
X
i=1 j=1
n
n X
X
ai aj K (xi , xj )
ai aj (xi )T (xj )
i=1 j=1
n
X
i=1
!T n
X
ai (xi )
aj (xj )
j=1
n
2
X
=
ai (xi )
0
i=1
9 / 24
T
K1/2 Kj
(xi )T (xj ) = K1/2 Ki
= KTi K1/2 K1/2 Kj
= KTi K1 Kj
10 / 24
(xi ) = Ui
where Ui is the ith row of U.
The kernel value is simply the dot product between scaled rows of U:
T
Ui
Uj = UTi Uj
(xi )T (xj ) =
11 / 24
Polynomial Kernel
Polynomial kernels are of two types: homogeneous or inhomogeneous.
Let x, y Rd . The (inhomogeneous) polynomial kernel is defined as
Kq (x, y) = (x)T (y) = (c + xT y)q
where q is the degree of the polynomial, and c 0 is some constant. When c = 0 we
obtain the homogeneous kernel, comprising only degree q terms. When c > 0, the
feature space is spanned by all products of at most q attributes.
This can be seen from the binomial expansion
!
q
X
q qk T k
Kq (x, y) = (c + xT y)q =
c
x y
k
k =1
The most typical cases are the linear (with q = 1) and quadratic (with q = 2) kernels,
given as
K1 (x, y) = c + xT y
K2 (x, y) = (c + xT y)2
12 / 24
Gaussian Kernel
The Gaussian kernel, also called the Gaussian radial basis function (RBF)
kernel, is defined as
)
(
kx yk2
K (x, y) = exp
2 2
where > 0 is the spread parameter that plays the same role as the standard
deviation in a normal density function.
Note that K (x, x) = 1, and further that the kernel value is inversely related to
the distance between the two points x and y.
A feature space for the Gaussian kernel has infinite dimensionality.
13 / 24
p
K (x, x).
q
K (xi , xi ) + K (xj , xj ) 2K (xi , xj )
Data Mining and Analysis
14 / 24
The more the distance k(xi ) (xj )k between the two points in feature
space, the less the kernel value, that is, the less the similarity.
Mean in Feature
Space: The mean of the points in feature space is given as
P
= 1/n ni=1 (xi ). Thus, we cannot compute it explicitly. However, the the
squared norm of the mean is:
k k2 = T =
n
n
1 XX
K (xi , xj )
n2
(1)
i=1 j=1
The squared norm of the mean in feature space is simply the average of the
values in the kernel matrix K.
Zaki & Meira Jr. (RPI and UFMG)
15 / 24
n
n
n
n
1X
1X
1 XX
k(xi ) k2 =
K (xi , xi ) 2
K (xi , xj )
n
n
n
i=1
i=1 j=1
i=1
Centering in Feature Space We can center each point in feature space by subtracting
the mean from it, as follows:
i ) = (xi )
(x
n
n
n
n
1X
1 XX
1X
K (xi , xk )
K (xj , xk ) + 2
K (xa , xb )
n
n
n
k =1
k =1
a=1 b=1
1
1
1nn K I 1nn
n
n
16 / 24
(xi )T (xj )
= cos
k(xi )k k(xj )k
If the mapped points are both centered and normalized, then a dot product
corresponds to the correlation between the two points in feature space.
The normalized kernel matrix, Kn , can be computed using only the kernel
function K , as
Kn (xi , xj ) =
(xi )T (xj )
K (xi , xj )
= p
k(xi )k k(xj )k
K (xi , xi ) K (xj , xj )
17 / 24
Given alphabet , the l-spectrum feature map is the mapping : R|| from the
set of substrings over to the ||l -dimensional space representing the number of
occurrences of all possible substrings of length l, defined as
T
(x) = , #(),
l
The (full) spectrum map considers all lengths from l = 0 to l = , leading to an infinite
dimensional feature map : R :
T
(x) = , #(),
The (l-)spectrum kernel between two strings xi , xj is simply the dot product between
their (l-)spectrum maps:
K (xi , xj ) = (xi )T (xj )
The (full) spectrum kernel can be computed efficiently via suffix trees in O(n + m) time
for two strings of length n and m.
Zaki & Meira Jr. (RPI and UFMG)
18 / 24
l
= U l UT
19 / 24
X
1 l l
S
l!
l=0
1
1
= I + S + 2 S2 + 3 S3 +
2!
3!
= exp S
where is a damping factor, and exp{S} is the matrix exponential. The series on the
right hand side above converges for all 0.
exp{1 }
0
0
exp{
2}
= U
..
..
.
.
0
0
..
.
where i is an eigenvalue of S.
Zaki & Meira Jr. (RPI and UFMG)
0
0
T
U
0
exp{n }
20 / 24
l Sl
l=0
21 / 24
v5
v3
v2
v1
2
0 0 1 1 0
0
0 0 1 0 1
=
A=
0
1 1 0 1 0
0
1 0 1 0 1
0
0 1 0 1 0
0
2
0
0
0
0
0
3
0
0
0
0
0
3
0
0
0
0
2
22 / 24
2
0
1
1
0
0 2
1
0
1
1
1
3
1
0
S = L = A D =
1
0
1 3
1
0
1
0
1 2
2 = 1.38
3 = 2.38
4 = 3.62
5 = 4.62
u1
u2
u3
u4
u5
0.45 0.63
0.00
0.63
0.00
0.45
0.51 0.60
0.20 0.37
U=
0.60
23 / 24
exp{0.21 }
0
0
exp{0.22 }
0
T
K = exp 0.2S = U
U
..
..
.
.
.
0
.
.
0
0
exp{0.2n }
=
0.14 0.13 0.59 0.13 0.03
0.14 0.03 0.13 0.59 0.13
0.01 0.14 0.03 0.13 0.70
Assuming = 0.2, the von Neumann kernel is given as
K = U(I 0.2)1 UT =
0.11 0.10 0.66 0.10
0.11 0.03 0.10 0.66
0.02 0.11 0.03 0.10
Zaki & Meira Jr. (RPI and UFMG)
0.02
0.11
0.03
0.10
0.74
24 / 24