Lecture 3
Lecture 3
To talk about estimation of "hidden" parameters, French speaking people and English
speaking people use different terms which can lead to some confusions. Within a supervised
framework, English people would prefer to use the term classification whereas the French
use the term discrimination. Within an unsupervised context, English people would rather
use the term clustering, whereas French people would use classification or classification non-
supervisée. In the following we will only use the English terms.
3.1 K-means
K- means clustering is a method of vector quantization. K-means clustering is an algorithm
of alternate minimization that aims at partitioning n observations into K clusters in which
each observation belongs to the cluster with the nearest mean, serving as a prototype to the
cluster (see Figure 3.1).
• µk ∈ Rp , k ∈ {1, ..., K} are the means where µk is the center of the cluster k. We will
denote µ the associated matrix.
• zik are indicator variables associated to xi such that zik = 1 if xi belongs to the cluster
k, zik = 0 otherwise. z is the matrix which components are equal to zik .
3-1
Lecture 3 — October 16th 2013/2014
3.1.2 Algorithm
The aim of the algorithm is to minimize J(µ, z). To do so we proceed with an alternating
minimization :
Remark 3.1.1 The step of minimization with respect to z is equivalent to allocating the xi
in the Voronoi cells which centers are the µk .
Remark 3.1.2 During the step of minimization with respect to µ, µk is obtained by setting
to zero the k-th coordinate of the gradient of J with respect to µ. Indeed we can easily see
that : X
∇µk J = −2 zik (xi − µk )
i
3-2
Lecture 3 — October 16th 2013/2014
The intuition behind this approach is that it is a clever thing to well spread out the K
initial cluster centers. At each iteration of the algorithm we will build a new center. We will
repeat the algorithm until we have K centers. Here are the steps of the algorithm :
• Step 0 : First initiate the algorithm by choosing the first center uniformly at random
among the data points.
• Step 1: For each data point xi of your data set, compute the distance between xi and
the nearest center that has already been chosen. We denote this distance Dµt (xi ) where
µt is specified to recall that we are minimizing over the current chosen centers.
• Step 2: Choose one new data point at random as a new center, but now using a weighted
probability distribution where a point xi is chosen with probability proportional to
Dµt (xi )2 .
• Step 3 : Repeat Step 1 and Step 2 until K centers have been chosen.
We see that we have now built K vectors with respect to our first intuition which was to
well spread out the centers (because we used a well chosen weighted probability). We can
now use those vectors as the initialization of our standard K-means algorithm.
More details can be found on the K-means++ algorithm in [A].
[A] Arthur, D. and Vassilvitskii, S. (2007). k-means++: the advantages of careful seeding.
Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms.
3.1.4 Choice of K
It is important to point out that the choice of K is not universal. Indeed, we see that if we
increase K, the distortion J decreases, until it reaches 0 when K = n, that is to say when
each data point is the center of its own center. To address this issue one solution could be
to add to J a penalty over K. Usually it takes the following form :
n X
X K
J(µ, z, K) = zik kxi − µk k2 + λK
i=1 k=1
3-3
Lecture 3 — October 16th 2013/2014
Figure 3.2. Example where K- means does not provide a satisfactory clustering result
Using Gaussian mixtures provides a way to avoid this problem (see next section).
Assumption : (x, z) are random variables where x is observed (our data) and z is non
observed (unknown cluster center for example).
We can already infer that, because of the sum, the problem should be slightly more difficult
than before. Indeed, taking the log of our probability would not lead to a simple convex
problem. In the following we will see that EM is a method to solve those kind s of problems.
3-4
Lecture 3 — October 16th 2013/2014
3.2.1 Example
Let’s present a simple example to illustrate what we just said. The probability density
represented on Figure 3.2.1 is akin to an average of two Gaussians. Thus, it is natural to use
a mixture model and to introduce an hidden variable z, following a Bernoulli distribution
defining which Gaussian the point is sampled from.
Figure 3.3. Average of two probability distributions of two Gaussian for which it is natural to introduce a
mixture model
Thus we have : z ∈ {1, 2} and x|z = i ∼ N (µi , Σi ). The density p(x) is a convex
combination of normal density:
3-5
Lecture 3 — October 16th 2013/2014
3.2.4 EM algorithm
P
We introduce the function q(z) such that q(z) ≥ 0 and z q(z) = 1 in the expression of the
likelihood. Thus we have :
X
log pθ (x) = log pθ (x, z)
z
X pθ (x, z)
= log q(z)
z
q(z)
X pθ (x, z)
≥ q(z) log , by the Jensen’s inequality because log is concave
z
q(z)
X X
= q(z) log pθ (x, z) − q(z) log q(z)
z z
= L(q, θ)
with equality iff q(z) = P pθ (x,z) ′ = pθ (z|x) (by strict concavity of the logarithm).
z ′ pθ (x,z )
Proposition 3.1 ∀θ, ∀q log pθ (x) ≥ L(q, θ) with equality if and only if q(z) = pθ (z|x).
Remark 3.2.1 We have introduced an auxiliary function L(q, θ) that is always below the
function log(pθ (x))
3-6
Lecture 3 — October 16th 2013/2014
Algorithm properties
∀t log(pθt ) ≥ log(pθt−1 )
• It does not converge to a global maximum but rather to a local maximum because we
are dealing here with a non-convex problem. An illustration is given in Figure 3.4.
• As it was already the case for K-means, we reiterate the result in order to be more
confident. Then we keep the one with the highest likelihood.
3-7
Lecture 3 — October 16th 2013/2014
The EM recipe Let’s recall the initial goal of the algorithm. The goal is to maximize
the incomplete likelihood log(pθ (x)). To do so we want to maximize the following function
which is always inferior to log(pθ (x)) :
X X
L(q, θ) = q(z) log pθ (x, z) − q(z) log q(z).
z z
1. Compute the probability of Z given X : pθt (z|x) (Corresponding to qt+1 = arg maxq L(q, θt ))
3. E-Step : calculate the expected value of the complete log likelihood function, with
respect to the conditional distribution of Z given X under the current estimate of the
parameter θt : EZ|X (lc ).
X X
pθ (xi ) = pθ (xi , zi ) = pθ (xi |zi )pθ (zi )
zi zi
k
X
= pθ (xi |zi = j)pθ (zi = j)
j=1
= τij (θ).
1
We recall that N (xi |µ, Σ) = d 1 exp(− 12 (x − µ)T Σ−1 (x − µ)).
(2π) 2 |Σ| 2
3-8
Lecture 3 — October 16th 2013/2014
n
X
lc,t = log pθt (x, z) = log pθt (xi , zi )
i=1
Xn
= log(pθt (zi )pθt (xi |zi ))
i=1
Xn
= log(pθt (zi )) + log(pθt (xi |zi ))
i=1
Xn Xk n X
X k
= zij log(πj,t ) + zij log(N (xi |µj,t, Σj,t ))
i=1 j=1 i=1 j=1
E-Step We can now write the expectation of the previous quantity with respect to the
conditional distribution of Z given X. In fact it is equivalent to replace zij by EZ|X (zij ) =
pθt (z = j|xi ) = τij (θt ). Indeed, the other terms of the sum are constant from the point of
view of the conditional probability of Z given X, and we finally obtain EZ|X (lc,t). Since the
value of θt will be fixed during the M-step, we drop the dependence on θt and write τij .
n X
k n X
k
" #
X X 1 1 1
τij log(πj,t ) + τij log( k ) + log( 1 ) − (xi − µj,t)T Σ−1
j,t (xi − µj,t ))
i=1 j=1 i=1 j=1 (2π) 2 |Σj,t | 2 2
We can now maximize with respect to µt and Σt . By computing the gradient along the
µj,t and along the Σj,t , we obtain :
P j
τ xi
µj,t+1 = Pi i j
i τi
3-9
Lecture 3 — October 16th 2013/2014
j
− µj,t+1)(xi − µj,t+1)T
P
i τi (xi
Σj,t+1 = P j
i τi
The M-step in the EM algorithm corresponds to the estimation of means step in K-means.
Note that the value of τij in the expressions above are taken for the parameter values of the
previous iterate, i.e., τij = τij (θt ).
3-10
Lecture 3 — October 16th 2013/2014
Definition 3.6 (maximal clique) A maximal clique, C, is a clique which is maximal for
the inclusion order:
∄v ∈ V : v ∈
/ C and v ∪ C is a clique.
(Figure 3.8).
Definition 3.7 (path) A path is a sequence of connected vertices that are globally distinct
(Figure 3.9).
3-11
Lecture 3 — October 16th 2013/2014
u
v
• v0 = vk
• ∀i, j, vi 6= vj if {i, j} =
6 {1, k}
Definition 3.9 Let A, B, C be distinct subsets of V . C separates A and B if all paths from
A to B go through C (Figure 3.10).
B
C
Figure 3.10. C separates A and B.
3-12
Lecture 3 — October 16th 2013/2014
In this course we will consider there is only one connected component. Otherwise we deal
with them independently.
Definition 3.15 (cycle) A cycle is a sequence of vertices (v0 , . . . , vk ) (Figure 3.12) such
that:
• v0 = vk
• ∀i, j, vi 6= vj if {i, j} =
6 {1, k}
Definition 3.16 (DAG) A directed acyclic graph (DAG) is a directed graph without any
cycle.
3-13
Lecture 3 — October 16th 2013/2014
• joint distribution:
p(x1 , . . . , xn ) = P (X1 = x1 , . . . , Xn = xn )
• conditional distribution:
Review
X⊥
⊥Y ⇔ p(x, y) = p(x)p(y) ∀x, y
⇔ pXY (x, y) = pX (x)pY (y)
X⊥
⊥ Y |Z ⇔ p(x, y|z) = p(x|z)p(y|z) ∀x, y, z
⇔ p(x|y, z) = p(x|z)
with
• πi set of parents of i
3-14
Lecture 3 — October 16th 2013/2014
• ∀i, fi > 0
P
• ∀i, xi fi (xi , xπi ) = 1
3-15