0% found this document useful (0 votes)

48 views

Lecture 3

- The document discusses K-means clustering and the Expectation-Maximization (EM) algorithm. - K-means is an algorithm that partitions observations into K clusters by alternating between assigning observations to the closest cluster mean and updating the cluster means. The number of clusters K must be specified. - EM is an iterative method for maximum likelihood estimation in the presence of latent variables. It is used for problems like mixture modeling where the clusters are not directly observed.

Uploaded by

nguyenhoangnguyennt

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views

Lecture 3

Uploaded by

nguyenhoangnguyennt

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

K-means, EM, Gaussian Mixture, Graph Theory 2013/2014

Lecture 3 — October 16th

Lecturer: Francis Bach Scribe: Marie d’Autume, Jean-Baptiste Alayrac

To talk about estimation of "hidden" parameters, French speaking people and English
speaking people use different terms which can lead to some confusions. Within a supervised
framework, English people would prefer to use the term classification whereas the French
use the term discrimination. Within an unsupervised context, English people would rather
use the term clustering, whereas French people would use classification or classification non-
supervisée. In the following we will only use the English terms.

3.1 K-means
K- means clustering is a method of vector quantization. K-means clustering is an algorithm
of alternate minimization that aims at partitioning n observations into K clusters in which
each observation belongs to the cluster with the nearest mean, serving as a prototype to the
cluster (see Figure 3.1).

Figure 3.1. Clustering on a 2D point data set with 3 clusters.

3.1.1 Notations and notion of Distortion

We will use the following notations:

• xi ∈ Rp , i ∈ {1, ..., n} are the observations we want to partition.

• µk ∈ Rp , k ∈ {1, ..., K} are the means where µk is the center of the cluster k. We will
denote µ the associated matrix.

• zik are indicator variables associated to xi such that zik = 1 if xi belongs to the cluster
k, zik = 0 otherwise. z is the matrix which components are equal to zik .

3-1
Lecture 3 — October 16th 2013/2014

Finally, we define the distortion J(µ, z) by:

n X
X K
J(µ, z) = zik kxi − µk k2 .
i=1 k=1

3.1.2 Algorithm
The aim of the algorithm is to minimize J(µ, z). To do so we proceed with an alternating
minimization :

• Step 0 : We choose a vector µ

• Step 1 : we minimize J with respect to z : zik = 1 if k xi − µk k2 = mins k xi − µs k2 ,

in other words we associate to xi the nearest center µk .
P k
z x
• Step 2 : we minimize J with respect to µ : µk = Pi i k i .
i zi

• Step 3 : we come back to step 1 until convergence.

Remark 3.1.1 The step of minimization with respect to z is equivalent to allocating the xi
in the Voronoi cells which centers are the µk .

Remark 3.1.2 During the step of minimization with respect to µ, µk is obtained by setting
to zero the k-th coordinate of the gradient of J with respect to µ. Indeed we can easily see
that : X
∇µk J = −2 zik (xi − µk )
i

3.1.3 Convergence and Initialization

We can show that this algorithm converges in a finite number of iterations. Therefore the
convergence could be local, thus it introduces the problem of initialization.
A classic method is use of random restarts. It consists in choosing several random vectors
µ, computing the algorithm for each case and finally keeping the partition which minimizes
the distortion. Thus we hope that at least one of the local minimum is close enough to a
global minimum.
One other well known method is the K-means++ algorithm, which aims at correcting a
major theoretic shortcomings of the K-means algorithm : the approximation found can be
arbitrarily bad with respect to the objective function compared to the optimal clustering.
The K-means++ algorithm addresses this obstacles by specifying a procedure to initialize
the cluster centers before proceeding with the standard K-means optimization iterations.
With the K-means ++ initialization, the algorithm is guaranteed to find a solution that is
O(log K) competitive to the optimal K-means solution.

3-2
Lecture 3 — October 16th 2013/2014

The intuition behind this approach is that it is a clever thing to well spread out the K
initial cluster centers. At each iteration of the algorithm we will build a new center. We will
repeat the algorithm until we have K centers. Here are the steps of the algorithm :

• Step 0 : First initiate the algorithm by choosing the first center uniformly at random
among the data points.

• Step 1: For each data point xi of your data set, compute the distance between xi and
the nearest center that has already been chosen. We denote this distance Dµt (xi ) where
µt is specified to recall that we are minimizing over the current chosen centers.

• Step 2: Choose one new data point at random as a new center, but now using a weighted
probability distribution where a point xi is chosen with probability proportional to
Dµt (xi )2 .

• Step 3 : Repeat Step 1 and Step 2 until K centers have been chosen.

We see that we have now built K vectors with respect to our first intuition which was to
well spread out the centers (because we used a well chosen weighted probability). We can
now use those vectors as the initialization of our standard K-means algorithm.
More details can be found on the K-means++ algorithm in [A].

[A] Arthur, D. and Vassilvitskii, S. (2007). k-means++: the advantages of careful seeding.
Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms.

3.1.4 Choice of K
It is important to point out that the choice of K is not universal. Indeed, we see that if we
increase K, the distortion J decreases, until it reaches 0 when K = n, that is to say when
each data point is the center of its own center. To address this issue one solution could be
to add to J a penalty over K. Usually it takes the following form :
n X
X K
J(µ, z, K) = zik kxi − µk k2 + λK
i=1 k=1

But again the choice of the penalty is arbitrary.

3-3
Lecture 3 — October 16th 2013/2014

3.1.5 Other problems

We can also point out that K-means will work pretty well when the width of the different
clusters are similar, for example if we deal with spheres. But clustering by K-means could
also be disappointing in some cases such as the example given in Figure 3.2.

Figure 3.2. Example where K- means does not provide a satisfactory clustering result

Using Gaussian mixtures provides a way to avoid this problem (see next section).

3.2 EM : Expectation Maximization

The Expectation-maximization (EM) algorithm is an iterative method for finding maximum
likelihood estimates of parameters in statistical models, where the models depend on un-
observed latent or hidden variables z. Latent variables are variables that are not directly
observed but are rather inferred from other variables that are observed.
Previous algorithms aimed at estimating the parameter θ that maximized the likelihood
of pθ (x), where x is the vector of observed variables.
Here it is a little bit different. Indeed we have now :

Assumption : (x, z) are random variables where x is observed (our data) and z is non
observed (unknown cluster center for example).

pθ (x, z) : joint density depending on a parameter θ (the model)

The goal : to maximize the following probability :

X
max pθ (x) = pθ (x, z).
θ
z

We can already infer that, because of the sum, the problem should be slightly more difficult
than before. Indeed, taking the log of our probability would not lead to a simple convex
problem. In the following we will see that EM is a method to solve those kind s of problems.

3-4
Lecture 3 — October 16th 2013/2014

3.2.1 Example
Let’s present a simple example to illustrate what we just said. The probability density
represented on Figure 3.2.1 is akin to an average of two Gaussians. Thus, it is natural to use
a mixture model and to introduce an hidden variable z, following a Bernoulli distribution
defining which Gaussian the point is sampled from.

Figure 3.3. Average of two probability distributions of two Gaussian for which it is natural to introduce a
mixture model

Thus we have : z ∈ {1, 2} and x|z = i ∼ N (µi , Σi ). The density p(x) is a convex
combination of normal density:

p(x) = p(x, z = 1) + p(x, z = 2) = p(x|z = 1)p(z = 1) + p(x|z = 2)p(z = 2)

It is a mixture model. It represents a simple way to model complicated phenomena.

3.2.2 Objective: maximum likelihood

Let z be the hidden variables and x be the observed data. We make the assumption that
the xi , i ∈ {1, ..., n} are i.i.d..
As we mentioned it in the introduction the aim is to maximize the likelihood
X
pθ (x) = pθ (x, z)
z
X
log pθ (x) = log pθ (x, z)
z
.
Note that in practice, we often have (x, z)P= 1 , . . . , xn , zn ) where each pair (xi , yi )
(x1 , zP
n
is i.i.d. In this situation we have log pθ (x) = i=1 log zi pθ (xi , zi ).
There is at least two ways to solve this problem:

3-5
Lecture 3 — October 16th 2013/2014

1. By a direct way, if we can, by a gradient ascent for example.

2. By using the EM algorithm.

3.2.3 Jensen’s Inequality

We will use the following properties :
1. if f : R → R is convex and if X is an integrable random variable :

EX (f (X)) ≥ f (EX (X))

2. if f : R → R is strictly convex, we have equality in the previous inequality if and only

if X = constant a.s.

3.2.4 EM algorithm
P
We introduce the function q(z) such that q(z) ≥ 0 and z q(z) = 1 in the expression of the
likelihood. Thus we have :
X
log pθ (x) = log pθ (x, z)
z
X pθ (x, z)
= log q(z)
z
q(z)
X pθ (x, z)
≥ q(z) log , by the Jensen’s inequality because log is concave
z
q(z)
X X
= q(z) log pθ (x, z) − q(z) log q(z)
z z
= L(q, θ)

with equality iff q(z) = P pθ (x,z) ′ = pθ (z|x) (by strict concavity of the logarithm).
z ′ pθ (x,z )

Proposition 3.1 ∀θ, ∀q log pθ (x) ≥ L(q, θ) with equality if and only if q(z) = pθ (z|x).

Remark 3.2.1 We have introduced an auxiliary function L(q, θ) that is always below the
function log(pθ (x))

EM algorithm is an algorithm of alternate maximization with respect to q and θ.

We initialize θ0 , then we iterate for t > 0, by alternating the following steps until con-
vergence:
• qt+1 ∈ arg maxq (L(q, θt ))
• θt+1 ∈ arg maxθ (L(qt+1 , θ))

3-6
Lecture 3 — October 16th 2013/2014

Algorithm properties

• It is an ascent algorithm, indeed it goes up in term of likelihood (compare to before

where we were descending along the distortion) :

∀t log(pθt ) ≥ log(pθt−1 )

• The sequence of log-likelihoods converges.

• It does not converge to a global maximum but rather to a local maximum because we
are dealing here with a non-convex problem. An illustration is given in Figure 3.4.

Figure 3.4. An illustration of the EM algorithm that converges to a local minimum.

• As it was already the case for K-means, we reiterate the result in order to be more
confident. Then we keep the one with the highest likelihood.

Initialization Because EM gives a local maximum, it is clever to choose a θ0 relatively

close to the final solution. For Gaussian mixtures, it is quite usual to initiate EM by a
K-means. The solution of K-means gives the θ0 and a large variance is used.

3-7
Lecture 3 — October 16th 2013/2014

The EM recipe Let’s recall the initial goal of the algorithm. The goal is to maximize
the incomplete likelihood log(pθ (x)). To do so we want to maximize the following function
which is always inferior to log(pθ (x)) :
X X
L(q, θ) = q(z) log pθ (x, z) − q(z) log q(z).
z z

1. Compute the probability of Z given X : pθt (z|x) (Corresponding to qt+1 = arg maxq L(q, θt ))

2. Write the complete likelihood lc = log(pθt (x, z)).

3. E-Step : calculate the expected value of the complete log likelihood function, with
respect to the conditional distribution of Z given X under the current estimate of the
parameter θt : EZ|X (lc ).

4. M-Step : find θt+1 by maximizing L(qt+1 , θ) with respect to θ.

3.2.5 Gaussian Mixture

Let (xi , zi ) be a couple, for i ∈ {1, ..., n} with xi ∈ Rp , zi ∼ M(1, π1 , ..., πk ) and (xi |zi =
j) ∼ N (µj , Σj ). Here we have θ = (π, µ, Σ).

Calculation of pθ (z|x) We write pθ (xi ) :

X X
pθ (xi ) = pθ (xi , zi ) = pθ (xi |zi )pθ (zi )
zi zi
k
X
= pθ (xi |zi = j)pθ (zi = j)
j=1

Then we use the Bayes formula to estimate pθ (z|x) :

pθ (xi |zi = j)pθ (zi = j)

pθ (zi = j|xi ) =
pθ (xi )
(∝ pθ (xi |zi = q)pθ (zi = q))
πj N (xi |µj , Σj )
= P ′ ′
j ′ πj ′ N (xi |µj , Σj )

= τij (θ).
1
We recall that N (xi |µ, Σ) = d 1 exp(− 12 (x − µ)T Σ−1 (x − µ)).
(2π) 2 |Σ| 2

3-8
Lecture 3 — October 16th 2013/2014

Suppose that we are at the t-th iteration of the algorithm.

Complete likelihood Let’s write the complete likelihood of the problem.

n
X
lc,t = log pθt (x, z) = log pθt (xi , zi )
i=1
Xn
= log(pθt (zi )pθt (xi |zi ))
i=1
Xn
= log(pθt (zi )) + log(pθt (xi |zi ))
i=1
Xn Xk n X
X k
= zij log(πj,t ) + zij log(N (xi |µj,t, Σj,t ))
i=1 j=1 i=1 j=1

where zij ∈ {0, 1} with zij = 1 if zi = j and 0 otherwise.

E-Step We can now write the expectation of the previous quantity with respect to the
conditional distribution of Z given X. In fact it is equivalent to replace zij by EZ|X (zij ) =
pθt (z = j|xi ) = τij (θt ). Indeed, the other terms of the sum are constant from the point of
view of the conditional probability of Z given X, and we finally obtain EZ|X (lc,t). Since the
value of θt will be fixed during the M-step, we drop the dependence on θt and write τij .

M-Step For the M-step, we this need to maximize:

n X
k n X
k
" #
X X 1 1 1
τij log(πj,t ) + τij log( k ) + log( 1 ) − (xi − µj,t)T Σ−1
j,t (xi − µj,t ))
i=1 j=1 i=1 j=1 (2π) 2 |Σj,t | 2 2

We want to maximize the previous equation with respect to θt = (Πt , µt , Σt )

As the sum is separated into two terms independent along the variables we can first
maximize with respect to πt :
k X
n Pn n
X τij 1X j
max τij log πj ⇒ i=1
πj,t+1 = Pn P = τ
Π
j=1 i=1
k
i=1
j′
j ′ =1 τi
n i=1 i

We can now maximize with respect to µt and Σt . By computing the gradient along the
µj,t and along the Σj,t , we obtain :
P j
τ xi
µj,t+1 = Pi i j
i τi

3-9
Lecture 3 — October 16th 2013/2014

j
− µj,t+1)(xi − µj,t+1)T
P
i τi (xi
Σj,t+1 = P j
i τi
The M-step in the EM algorithm corresponds to the estimation of means step in K-means.
Note that the value of τij in the expressions above are taken for the parameter values of the
previous iterate, i.e., τij = τij (θt ).

Possible forms for Σj

• isotropic: Σj = σj2 Id, 1 parameter, the cluster is a sphere.
• diagonal: Σj is a diagonal matrix, d parameters, the cluster is an ellipse oriented along
the axis.
d(d+1)
• general: Σj , 2
parameters, the cluster is an ellipse.

3.3 Graph theory

3.3.1 Graph
Definition 3.2 (graph) A graph is a pair G = (V, E) comprising a set V of vertices or
nodes together with a set E ⊂ V × V of edges or arcs, which are 2-element subsets of V .
Remark 3.3.1 In this course we only consider graphs without self-loop.

3.3.2 Undirected graphs

Definition 3.3 (undirected graph) G = (V, E) is an if ∀(u, v) ∈ V × V with u 6= v we
have:
(u, v) ∈ E ⇐⇒ (v, u) ∈ E
(Figure 3.5).

Figure 3.5. two different ways to represent an undirected graph

Definition 3.4 (neighbour) We define N (u), the set of the neighbours of u, as

N (u) = {v ∈ V, (v, u) ∈ E}
(Figure 3.6).

3-10
Lecture 3 — October 16th 2013/2014

Figure 3.6. A vertex and its neighbours

Definition 3.5 (clique) A totally connected subset of vertices or a singleton is called a

clique (Figure 3.7).

Figure 3.7. A clique.

Definition 3.6 (maximal clique) A maximal clique, C, is a clique which is maximal for
the inclusion order:
∄v ∈ V : v ∈
/ C and v ∪ C is a clique.
(Figure 3.8).

Figure 3.8. A maximal clique

Definition 3.7 (path) A path is a sequence of connected vertices that are globally distinct
(Figure 3.9).

3-11
Lecture 3 — October 16th 2013/2014

u
v

Figure 3.9. A path from u to v.

Definition 3.8 (cycle) A cycle is a sequence of vertices (v0 , . . . , vk ) such that:

• v0 = vk

• ∀j, (vj , vj+1 ) ∈ E

• ∀i, j, vi 6= vj if {i, j} =
6 {1, k}

Definition 3.9 Let A, B, C be distinct subsets of V . C separates A and B if all paths from
A to B go through C (Figure 3.10).

B
C
Figure 3.10. C separates A and B.

Definition 3.10 (connected component) A connected component is a subgraph induced

by the equivalence class of the relation uRv ⇔ ∃ path from u to v (Figure 3.11).

3-12
Lecture 3 — October 16th 2013/2014

Figure 3.11. A graph with 2 connected components

In this course we will consider there is only one connected component. Otherwise we deal
with them independently.

3.3.3 Oriented graphs

Definition 3.11 (parent) v is a parent of u if (v, u) ∈ E

Definition 3.12 (children) v is a children of u if (u, v) ∈ E

Definition 3.13 (ancestor) v is an ancestor of u if there exists a path from u to v.

Definition 3.14 (descendant) v is a descendant of u if there exists a path from u to v

Definition 3.15 (cycle) A cycle is a sequence of vertices (v0 , . . . , vk ) (Figure 3.12) such
that:

• v0 = vk

• ∀j, (vj , vj+1 ) ∈ E

• ∀i, j, vi 6= vj if {i, j} =
6 {1, k}

Figure 3.12. Un graphe orienté avec un cycle.

Definition 3.16 (DAG) A directed acyclic graph (DAG) is a directed graph without any
cycle.

3-13
Lecture 3 — October 16th 2013/2014

Definition 3.17 (topological order) Let G = (V, E) a graph. I is a topological order if

• I is a bijection from {1, . . . , } to V

• If u is a parent of v, then I(u) < I(v)

Proposition 3.18 G = (V, E) has a topological order ⇔ G is a DAG.

Proof ⇒ easy, ⇐ use a depth-first search

3.3.4 Directed graphical models

Notations n discrete random variables X1 , . . . , Xn .

• joint distribution:
p(x1 , . . . , xn ) = P (X1 = x1 , . . . , Xn = xn )

• marginal distribution: for A ⊂ V ,

X
p(xA ) = pA (xA ) = P (Xk = xk , k ∈ A) = p(xA , xAc )
xAc

• conditional distribution:

p(xA |xAc ) = pA|Ac (xA |xAc ) = P (XA = xA |XAc = xAc )

Review
X⊥
⊥Y ⇔ p(x, y) = p(x)p(y) ∀x, y
⇔ pXY (x, y) = pX (x)pY (y)
X⊥
⊥ Y |Z ⇔ p(x, y|z) = p(x|z)p(y|z) ∀x, y, z
⇔ p(x|y, z) = p(x|z)

Definitions and first properties

Let G = (V, E) a DAG with V = {1, . . . n} and (X1 , . . . , Xn ) n discrete random variables.
L(G) set of p(x) = p(x1 , . . . , xn ) of the form
n
Y
p(x) = fi (xi , xπi )
i=1

with

• πi set of parents of i

3-14
Lecture 3 — October 16th 2013/2014

• ∀i, fi > 0
P
• ∀i, xi fi (xi , xπi ) = 1

Proposition 3.19 If p(x) factorizes in G, i.e. (p ∈ L(G)), then p is a distribution and

∀i, fi (xi , xπi ) = p(xi |xπi )

Proof By induction on n = |V |. See next class.

3-15

3143 01 5RP AFP tcm143-700721
100% (11)
3143 01 5RP AFP tcm143-700721
16 pages
Detailed Lesson Plan in Science 9: (Presentation of Assigned Task Per Group)
No ratings yet
Detailed Lesson Plan in Science 9: (Presentation of Assigned Task Per Group)
6 pages
F18 Pocket Guide PDF
No ratings yet
F18 Pocket Guide PDF
1 page
A Small List of Operating Oil and Gas Fields in Myanmar by Production Level
No ratings yet
A Small List of Operating Oil and Gas Fields in Myanmar by Production Level
6 pages
cs229 Notes7b PDF
No ratings yet
cs229 Notes7b PDF
4 pages
gmm
No ratings yet
gmm
8 pages
Introduction To (Statistical) Machine Learning
No ratings yet
Introduction To (Statistical) Machine Learning
30 pages
6.2 K Means
No ratings yet
6.2 K Means
23 pages
GMMEMNotes
No ratings yet
GMMEMNotes
10 pages
Applied Stat
No ratings yet
Applied Stat
2 pages
Mixture Models and Clustering
No ratings yet
Mixture Models and Clustering
8 pages
Lec15 16 Handout
No ratings yet
Lec15 16 Handout
33 pages
EM and Kmeans relations
No ratings yet
EM and Kmeans relations
70 pages
ML Lecture06 Unsupervised Learning
No ratings yet
ML Lecture06 Unsupervised Learning
87 pages
Lec 11
No ratings yet
Lec 11
57 pages
Lecture08b Kmeans
No ratings yet
Lecture08b Kmeans
10 pages
Expectation Maximization - Georgia Tech - Machine Learning - English
No ratings yet
Expectation Maximization - Georgia Tech - Machine Learning - English
3 pages
GAUSSIAN MIXTURES
No ratings yet
GAUSSIAN MIXTURES
5 pages
K-Medias, Mezcla de Gausianas y Un Ejemplo
No ratings yet
K-Medias, Mezcla de Gausianas y Un Ejemplo
6 pages
کتاب ششم بارگزاری شده
No ratings yet
کتاب ششم بارگزاری شده
49 pages
Week 7 - Latent Variable Models and Expectation Maximization
No ratings yet
Week 7 - Latent Variable Models and Expectation Maximization
39 pages
20-gaussian-mixture-model
No ratings yet
20-gaussian-mixture-model
55 pages
Clustering MIT 15.097 Course Notes
No ratings yet
Clustering MIT 15.097 Course Notes
9 pages
Clustering: CMPUT 466/551 Nilanjan Ray
No ratings yet
Clustering: CMPUT 466/551 Nilanjan Ray
34 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
Chap2 Part2 GMM
No ratings yet
Chap2 Part2 GMM
34 pages
Module13 GaussianMixtureModel
No ratings yet
Module13 GaussianMixtureModel
17 pages
Mixture Models and Expectation-Maximization: Justus H. Piater
No ratings yet
Mixture Models and Expectation-Maximization: Justus H. Piater
11 pages
Lecture Expectation Maximization
No ratings yet
Lecture Expectation Maximization
58 pages
Gaussian Mixture Modelling GMM
No ratings yet
Gaussian Mixture Modelling GMM
11 pages
Statistical Methods For NLP: Document and Topic Clustering, K-Means, Mixture Models, Expectation-Maximization
No ratings yet
Statistical Methods For NLP: Document and Topic Clustering, K-Means, Mixture Models, Expectation-Maximization
47 pages
EM-converted
No ratings yet
EM-converted
22 pages
5 Clustering
No ratings yet
5 Clustering
38 pages
Dis10 Sol PDF
No ratings yet
Dis10 Sol PDF
6 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
K Means
No ratings yet
K Means
33 pages
ML.5-Clustering Techniques (Week 9)
No ratings yet
ML.5-Clustering Techniques (Week 9)
71 pages
Lecture Notes On Clustering
No ratings yet
Lecture Notes On Clustering
10 pages
ML UNIT III
No ratings yet
ML UNIT III
12 pages
Wk03 machine learning
No ratings yet
Wk03 machine learning
5 pages
Week 7 GMM
No ratings yet
Week 7 GMM
9 pages
1 The K-Medoids Algorithm
No ratings yet
1 The K-Medoids Algorithm
5 pages
VIP Cheatsheet: Unsupervised Learning: Afshine Amidi and Shervine Amidi September 9, 2018
No ratings yet
VIP Cheatsheet: Unsupervised Learning: Afshine Amidi and Shervine Amidi September 9, 2018
3 pages
Problem Sheet 1 (1)
No ratings yet
Problem Sheet 1 (1)
3 pages
I2ml3e Chap7
No ratings yet
I2ml3e Chap7
22 pages
EM Algo
No ratings yet
EM Algo
8 pages
Pattern Analysis-Machine Learning
No ratings yet
Pattern Analysis-Machine Learning
74 pages
Machine Learning Techniques
No ratings yet
Machine Learning Techniques
8 pages
Dsci303-19 GM - em
No ratings yet
Dsci303-19 GM - em
81 pages
DSA5102_lecture10
No ratings yet
DSA5102_lecture10
40 pages
Lecture 11 - K-Means Clustering (DONE!!) PDF
No ratings yet
Lecture 11 - K-Means Clustering (DONE!!) PDF
49 pages
Expectation-Maximization Clustring V2
No ratings yet
Expectation-Maximization Clustring V2
9 pages
Week 5 v1.1 - Unsupervised Learning
No ratings yet
Week 5 v1.1 - Unsupervised Learning
40 pages
Module 3
No ratings yet
Module 3
193 pages
Report 1
No ratings yet
Report 1
3 pages
Region Segmentation Readings: Chapter 10: 10.1 Additional Materials Provided
No ratings yet
Region Segmentation Readings: Chapter 10: 10.1 Additional Materials Provided
47 pages
Clustering 10/36-702 Spring 2018
No ratings yet
Clustering 10/36-702 Spring 2018
50 pages
MA5232 Modeling and Numerical Simulations: Iterative Methods For Mixture-Model Segmentation 8 Apr 2015
No ratings yet
MA5232 Modeling and Numerical Simulations: Iterative Methods For Mixture-Model Segmentation 8 Apr 2015
32 pages
401 Week7 Part 2 EM Algorithm
No ratings yet
401 Week7 Part 2 EM Algorithm
58 pages
Hota ML13
No ratings yet
Hota ML13
24 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
Complex numbers
From Everand
Complex numbers
Alessio Mangoni
No ratings yet
Anixter - Advance Ship Notices 856 - 4010 Version 1.3
No ratings yet
Anixter - Advance Ship Notices 856 - 4010 Version 1.3
66 pages
Variasi Bahasa Dalam Sosial Media: Sebuah Konstruksi Identitas
No ratings yet
Variasi Bahasa Dalam Sosial Media: Sebuah Konstruksi Identitas
8 pages
Teaching Critical Thinking Skills in A Hong Kong Secondary School
No ratings yet
Teaching Critical Thinking Skills in A Hong Kong Secondary School
9 pages
M Organizational Behavior 2nd Edition McShane Solutions Manual Download
100% (17)
M Organizational Behavior 2nd Edition McShane Solutions Manual Download
27 pages
KO WBNR Whitepaper MCW0011262MachineLearnining
No ratings yet
KO WBNR Whitepaper MCW0011262MachineLearnining
62 pages
State Laws
No ratings yet
State Laws
3 pages
Anand CCBL Anilpathak
No ratings yet
Anand CCBL Anilpathak
24 pages
Vegan Ingredient Sustitution Chart PDF
No ratings yet
Vegan Ingredient Sustitution Chart PDF
1 page
Car - December 2024 UK
No ratings yet
Car - December 2024 UK
132 pages
Carbopree HS Laminate - Tds 385-15 - r02
No ratings yet
Carbopree HS Laminate - Tds 385-15 - r02
3 pages
SL Guide
No ratings yet
SL Guide
7 pages
Chapter18 Class and Objects
No ratings yet
Chapter18 Class and Objects
3 pages
Ta Cover Letter
100% (1)
Ta Cover Letter
6 pages
Magnetism and Electromagnetism Theory Thesis Defense
No ratings yet
Magnetism and Electromagnetism Theory Thesis Defense
60 pages
استراتيجيات الحد من الفقر
No ratings yet
استراتيجيات الحد من الفقر
69 pages
Ezra 5
No ratings yet
Ezra 5
2 pages
Course Outline - Income Tax - 2020
No ratings yet
Course Outline - Income Tax - 2020
9 pages
Life Insurance Company India LTD
No ratings yet
Life Insurance Company India LTD
1 page
Chapter 25 - Problem Set
No ratings yet
Chapter 25 - Problem Set
2 pages
Dhave Kiezer Esguerra EXERCISE 5 Sterilization and Disinfection
No ratings yet
Dhave Kiezer Esguerra EXERCISE 5 Sterilization and Disinfection
9 pages
자율운항선박 기술개발사업
No ratings yet
자율운항선박 기술개발사업
270 pages
How To Do Capacity Planning - Team Quest
No ratings yet
How To Do Capacity Planning - Team Quest
18 pages
Jurnal Jhon Pieter Dosen Tetap STMIK IBBI
No ratings yet
Jurnal Jhon Pieter Dosen Tetap STMIK IBBI
12 pages
UAE Customs Magazine - Issue 8
No ratings yet
UAE Customs Magazine - Issue 8
13 pages
PSHCP Member Booklet at A Glance
No ratings yet
PSHCP Member Booklet at A Glance
36 pages
Notification CGHS Nursing Officer Vacancy
No ratings yet
Notification CGHS Nursing Officer Vacancy
4 pages

Lecture 3

Uploaded by

Lecture 3

Uploaded by

K-means, EM, Gaussian Mixture, Graph Theory 2013/2014

Lecture 3 — October 16th

Figure 3.1. Clustering on a 2D point data set with 3 clusters.

3.1.1 Notations and notion of Distortion

• xi ∈ Rp , i ∈ {1, ..., n} are the observations we want to partition.

Finally, we define the distortion J(µ, z) by:

• Step 0 : We choose a vector µ

• Step 1 : we minimize J with respect to z : zik = 1 if k xi − µk k2 = mins k xi − µs k2 ,

• Step 3 : we come back to step 1 until convergence.

3.1.3 Convergence and Initialization

But again the choice of the penalty is arbitrary.

3.1.5 Other problems

3.2 EM : Expectation Maximization

pθ (x, z) : joint density depending on a parameter θ (the model)

The goal : to maximize the following probability :

p(x) = p(x, z = 1) + p(x, z = 2) = p(x|z = 1)p(z = 1) + p(x|z = 2)p(z = 2)

It is a mixture model. It represents a simple way to model complicated phenomena.

3.2.2 Objective: maximum likelihood

1. By a direct way, if we can, by a gradient ascent for example.

3.2.3 Jensen’s Inequality

EX (f (X)) ≥ f (EX (X))

2. if f : R → R is strictly convex, we have equality in the previous inequality if and only

EM algorithm is an algorithm of alternate maximization with respect to q and θ.

• It is an ascent algorithm, indeed it goes up in term of likelihood (compare to before

• The sequence of log-likelihoods converges.

Figure 3.4. An illustration of the EM algorithm that converges to a local minimum.

Initialization Because EM gives a local maximum, it is clever to choose a θ0 relatively

2. Write the complete likelihood lc = log(pθt (x, z)).

4. M-Step : find θt+1 by maximizing L(qt+1 , θ) with respect to θ.

3.2.5 Gaussian Mixture

Calculation of pθ (z|x) We write pθ (xi ) :

Then we use the Bayes formula to estimate pθ (z|x) :

pθ (xi |zi = j)pθ (zi = j)

Suppose that we are at the t-th iteration of the algorithm.

Complete likelihood Let’s write the complete likelihood of the problem.

where zij ∈ {0, 1} with zij = 1 if zi = j and 0 otherwise.

M-Step For the M-step, we this need to maximize:

We want to maximize the previous equation with respect to θt = (Πt , µt , Σt )

Possible forms for Σj

3.3 Graph theory

3.3.2 Undirected graphs

Figure 3.5. two different ways to represent an undirected graph

Definition 3.4 (neighbour) We define N (u), the set of the neighbours of u, as

Figure 3.6. A vertex and its neighbours

Definition 3.5 (clique) A totally connected subset of vertices or a singleton is called a

Figure 3.7. A clique.

Figure 3.8. A maximal clique

Figure 3.9. A path from u to v.

Definition 3.8 (cycle) A cycle is a sequence of vertices (v0 , . . . , vk ) such that:

• ∀j, (vj , vj+1 ) ∈ E

Definition 3.10 (connected component) A connected component is a subgraph induced

Figure 3.11. A graph with 2 connected components

3.3.3 Oriented graphs

Definition 3.12 (children) v is a children of u if (u, v) ∈ E

Definition 3.13 (ancestor) v is an ancestor of u if there exists a path from u to v.

Definition 3.14 (descendant) v is a descendant of u if there exists a path from u to v

• ∀j, (vj , vj+1 ) ∈ E

Figure 3.12. Un graphe orienté avec un cycle.

Definition 3.17 (topological order) Let G = (V, E) a graph. I is a topological order if

• I is a bijection from {1, . . . , } to V

• If u is a parent of v, then I(u) < I(v)

Proposition 3.18 G = (V, E) has a topological order ⇔ G is a DAG.

Proof ⇒ easy, ⇐ use a depth-first search

3.3.4 Directed graphical models

• marginal distribution: for A ⊂ V ,

p(xA |xAc ) = pA|Ac (xA |xAc ) = P (XA = xA |XAc = xAc )

Definitions and first properties

Proposition 3.19 If p(x) factorizes in G, i.e. (p ∈ L(G)), then p is a distribution and

∀i, fi (xi , xπi ) = p(xi |xπi )

Proof By induction on n = |V |. See next class.

You might also like