EM Converge Property
EM Converge Property
Yu Chen
Mathematical Department
Technical University of Munich
Germany
17.12.2020
Abstract
The thesis reviews Jeff Wu’s paper ON THE CONVERGENCE PROPERTIES OF THE
EM ALGORITHM[4] which studied: (1) whether EM algorithm finds a local maximum or just
a stationary value for the target likelihood over incomplete data? (2) whether the parameter
sequence generated from the EM iteration process finally converge?. Jeff Wu dedicated to
correcting the error appeared in the paper MAXIMUM LIKELIHOOD FROM INCOMPLETE
DATA VIA THE EM ALGORITHM (WITH DISCUSSION)[2] which Jeff Wu’s paper is based
on via presenting 7 theorems and one corollary. In this thesis, we focus on studying the
relationship among the theorems.
1 Introduction
Expectation Maximization (EM) consisting of a E-step and M-step is an iterative algorithm that
tries to maximize the likelihood over incomplete data. This algorithm is popular in not only statis-
tics but also optimization, machine learning and computer vision. Due to its wide applications, it
has a range of forms describing the E-step and M-step. Dempster, Laird and Rubin (abbreviated
DLR) [2] introduced a general form for the EM algorithm and analysed some properties for it,
whereas their proof of convergence of an EM sequence is not totally correct. Therefore, Jeff Wu
corrects the error in his paper[4]. In this thesis, we inherit the general EM form from the DLR
paper and review Jeff Wu’s correct proof of EM convergence properties.
Specifically, this thesis answers two questions regarding the convergence of the EM. (1) Does
EM finally reach a global maximum or local maximum or stationary value for the likelihood? (2)
Whether the sequence generated from the EM iteration converges to a limit? The key to answer
the first question is the Global Convergence Theorem[5]. Based on this theorem Jeff Wu drew[4]
three theorems answering the first question. In addition, the other four theorems are introduced to
obtain the answer for the second question. The EM sequence can converge to the unique maximum
when the likelihood is unimodal and a differentiability condition is satisfied[4].
where y = y(x) is the observed incomplete data in Y and φ ∈ Ω the parameter space. The
relationship between the two sample space is a many-to-one mapping from X to Y which we will
talk about in detail in section 3. However, another way to express this relationship is:
Z
g(y|φ) = f (x, y|φ)dx
In order to present the generalized EM, we first draw the likelihood of the incomplete obser-
vation y. To this end, we introduce the conditional density of x given y and φ (note: f (x|φ) =
f (x, y|φ)):
f (x|φ)
k(x|y, φ) =
g(y|φ)
So,
f (x|φ)
g(y|φ) =
k(x|y, φ)
.
1
0
Then the log likelihood L(φ ) is:
0 0
L(φ ) = log(g(y|φ ))
0
= Ex∼k(x|y,φ) [log(g(y|φ ))]
0
f (x|φ )
= Ex∼k(x|y,φ) [log( )]
k(x|y, φ0 ) (1)
0 0
= Ex∼k(x|y,φ) [log(f (x|φ )) − log(k(x|y, φ ))]
0 0
= E{log(f (x|φ ))|y, φ} − E{log(k(x|y, φ ))|y, φ}
0 0
= Q(φ |φ) − H(φ |φ)
0 0 0
Where we assume Q(φ |φ) and H(φ |φ) exist for all pairs of (φ , φ)
0 0 0
Now, we are interested in maximizing the log likelihood max
0
L(φ ) = Q(φ |φ) − H(φ |φ). The
φ
EM algorithm solve this using an iteration process:
φp → φp+1 ∈ M (φp )
However, Q(φ|φp ) can be very complex and may not numerically feasible to maximize it, so
we need a more general way to depict the EM algorithm[4]. Since we are only interested in the
convergence property of EM algorithm, we do not care the specific methods we use for the M-step.
We just need to guarantee the properties that the EM must satisfy in the Generalized EM (GEM).
Dempster, Laird and Rubin (1977) defined the GEM algorithm in their DLR paper [2] as an
iterative scheme:
φp+1 ∈ M (φp )
0
where φ → M (φ) is a point-to-set map such that:
0 0
Q(φ |φ) ≥ Q(φ|φ) ∀ φ ∈ M (φ) (2)
So, we see that EM is special case of the GEM. Moreover, two properties of the GEM have
been sumarized in DLR (Theorem 1 and Lemma 1):
0 0
H(φ|φ) ≥ H(φ |φ) ∀ φ ∈ Ω (3)
2
• 2) Ωφ0 = {φ ∈ Ω : L(φ) ≥ L(φ0 )} is compact for any L(φ0 ) > −∞
• 3) L(φ) is continuous in Ω and differentiable in the interior of Ω
• 4) {L(φp )}p≥0 is bounded above for any φ0 ∈ Ω
• 5) φp is in the interior of Ω, int(Ω)
• 6) φp converges to some φ∗ ∈ int(Ω) such that the Hessian matrices ∇2 Q(φ∗ |φ∗ ) and
0 0
∇2 H(φ∗ |φ∗ ) exist at the first φ∗ , and ∇2 Q(φ |φ) is continuous in (φ , φ)
where assumption 4) is a consequence of the previous three assumptions. Assumption 6) ensures
that we have tools to analyze whether L∗ is the global maximum or local maximum or just sta-
tionary value.
0
In the M-step of EM, we globally maximize Q(φ |φ) for current φ, so based on assumption 6),
2 ∗ ∗
we have ∇ Q(φ |φ ) is non-positive definite (n.p.d.). According to Lemma 2 of DLR, we have
−∇2 H(φ∗ |φ∗ ) is non-negative definite (n.n.d.). Therefore, the Hessian matrix of the log-likelihood
∇2 L(φ∗ ) = ∇2 Q(φ∗ |φ∗ ) − ∇2 H(φ∗ |φ∗ ) may not be n.p.d. i.e. L(φ∗ ) may not be a local maximum.
Murray[3] gave an example illustrating that the EM converges to a stationary value rather than a
local maximum.
Now, to answer the global or local or stationary question, we introduce point-to-set map M
on set X i.e. M maps from points of X to subsets of X. M is called closed at x∗ if xk ∈ X,
lim xk = x∗ and lim yk = y ∗ , yk ∈ M (xk ) imply y ∗ ∈ M (x∗ ). With this concept, we introduce
k→∞ k→∞
the following theorem.
Global Convergence Theorem.
• {xk }∞
k=0 is generated by xk+1 ∈ M (xk ) where M is p2s map on set X
• Solution set Γ ⊂ X
- 1 ) xk ∈ S where compact set S ⊂ X
- 2 ) M is closed over ΓC
Then all limit points of {xk } are in Γ
- 3 ) function α is continuous on X such that
a) if x ∈ ΓC , then α(y) > α(x) ∀ y ∈ M (x);
b) if x ∈ Γ, then α(y) ≥ α(x) ∀ y ∈ M (x).
Then α(xk ) converges monotonically to α(x∗ ) for some x∗ ∈ Γ
PROOF see Appendix. Figure 1 helps understand the relationships among each concept in the
theorem and illustrates the first consequence that the limit of the sequence {xk } finally fall into
the solution set Γ.
Now, we set M as the point-to-set map in a GEM iteration and set the function α as the
log-likelihood function L. Additionally, let the solution set Γ be one of the following:
3
• M : local maxima in the interior of Ω
• ϕ: stationary points in the interior of Ω
Then we get Theorem 1 as a special case of the Global Convergence Theorem.
Theorem 1
φp is a GEM sequence generated by φp+1 ∈ M (φp ), and suppose:
1) M is a closed p2s map over ϕC (or M C )
2) L(φp+1 ) > L(φp ) ∀φp ∈/ ϕ (or M )
Then, all limit points of {φp } are stationary (or local maxima) of L, and L(φp ) converges mono-
tonically to L∗ = L(φ∗ )
Note that if Q(ψ|φ) is continuous in both ψ and φ, then condition 1) in Theorem 1 satisfies[4].
In fact it is also sufficient to imply condition 2) in Theorem 1 such that we Theorem 2.
Theorem 2
If Q(ψ|φ) is continuous in ψ and φ, then all limit points of {φp } in an EM are stationary points of
L, and L(φp ) converges monotonically to L∗ = L(φ∗ ) for some point φ∗ .
PROOF
Condition 1) of Theorem 1 has held, we only need to prove condition 2).
0 0
∵H(φ|φ) ≥ H(φ |φ) ∀φ ∈ Ω (Theorem 1 of DLR)
∴∇H(φp |φp ) = 0
∴∇L(φp ) = ∇Q(φp |φp ) 6= 0 ∀ φp ∈ /ϕ
∵Q(φp+1 |φp ) > Q(φp |φp ) and H(φp |φp ) ≥ H(φp+1 |φp )
∴L(φp+1 ) > L(φp )
Theorem 2 can be easily applied because the continuity condition is not too strong. For example,
if the unobserved data x can be expressed as the curved exponential family , then the continuity
holds.
Curved Exponential Family
X is a random vector with p.d.f. f (x|φ) from a probability space X.
k
X
f (x|φ) = A(φ) exp ( Ti (x)ηi (φ))h(x)
i=1
Where Ti (x) is a real valued statistics, ηi (φ) is a real valued function on the parameter space
Ω ⊆ Rq , and q < k ∈ N.
If covφ (T~ ) (T~ = [T1 , T2 , ·, Tk ]) is positive definite, then X belongs to the curved exponential family.
Note that Theorem 2 DOES NOT apply to M (consider some φp ∈ ϕ whereas φp ∈ / M ). So,
how to ensure that the sequence L(φk ) converges to a local maximum? We need another condition,
so we obtain the following theorem.
Theorem 3
0
If Q(ψ|φ) is continuous in ψ and φ, and supφ0 ∈Ω Q(φ |φ) > Q(φ|φ) ∀ φ ∈ ϕ\M
then all limit points of {φp } in an EM are local maxima of L, and L(φp ) converges monotonically
to L∗ = L(φ∗ ) for some local maxima φ∗ .
However, the new condition in Theorem 3 is hard to verify in real application. Therefore,
Theorem 1 is the most general answer to our first question, and Theorem 2 provides a basis in real
application.
Now, we have given an answer to the first question that it is not easy to make sure L(φp )
converge to a local maximum, whereas we still do not know whether the sequence {φp } generated
from the EM process converge to a specific point. Even though we have known that L(φp ) converges
to L∗ , this convergence does not imply the convergence of the GEM (EM) sequence {φp } because
this sequence is generated from a point-to-set map M . We study this question in the next section.
4
M (a) = {φ ∈ M : L(φ) ≡ a}
According to the definition above, if L(φp ) → L∗ then the limit points of φp are in ϕ(L∗ ) (or
M (L∗ )). Notice that if ϕ(L∗ ) (or M (L∗ )) consists of a single point φ∗ then lim φp = φ∗ , so we
p→∞
have the following theorem.
Theorem 4
φp is a GEM sequence generated by φp+1 ∈ M (φp ), and suppose:
1) M is a closed p2s map over ϕC (or M C )
2) L(φp+1 ) > L(φp ) ∀φp ∈/ ϕ (or M )
If ϕ(L∗ ) = {φ∗ } (or M (L∗ ) = {φ∗ }) where L∗ = lim L(φp ), then lim φp = φ∗
p→∞ p→∞
Note that condition 1) and 2) in Theorem 4 are the two conditions from Theorem 1, we only
introduce an extra condition that ϕ(L∗ ) = {φ∗ } (or M (L∗ ) = {φ∗ }). This new condition can be
relaxed by lim ||φp+1 − φp || = 0 such that we have the next theorem, Theorem 5.
p→∞
Theorem 5
φp is a GEM sequence generated by φp+1 ∈ M (φp ), and suppose:
1) M is a closed p2s map over ϕC (or M C )
2) L(φp+1 ) > L(φp ) ∀φp ∈/ ϕ (or M )
If lim ||φp+1 − φp || = 0, then all limit points of {φp } are in a connected and compact subset of
p→∞
ϕ(L∗ ) or M (L∗ ) where L∗ = lim L(φp ).
p→∞
(Here, a connected subset cannot be represented as an union of two disjoint sets.)
In particular, if ϕ(L∗ ) or M (L∗ ) is discrete, then φp converges to some φ∗ in ϕ(L∗ ) or M (L∗ ).
PROOF
see Theorem 28.1 of Ostrowski (1967) [1] and Theorem 1.
Both Theorem 4 and 5 inherit condition 1) and 2) from Theorem 1, but we can obtain a
condition that implies these two conditions and at the same time we strengthen the conditions
after if in Theorem 4 and 5 to get a new theorem Theorem 6. To do this, we define a new set:
ψ(L) = {φ ∈ Ω : L(φ) ≡ L}
Theorem 6
0
φp is a GEM sequence generated by φp+1 ∈ M (φp ) with ∇Q(φp+1 |φp ) = 0, and suppose ∇Q(φ |φ)
0
is continuous in φ and φ.
If either (a) ψ(L ) = {φ∗ }, or (b) lim ||φp+1 − φp || = 0 and ψ(L∗ ) is discrete satisfies, then φp
∗
p→∞
converges to a stationary point φ∗ with L(φ∗ ) = L∗ , the limit of L(φp ) .
PROOF
0
The continuity of ∇Q(φ |φ) implies condition 1) and 2) of Theorem 1.
(a) or (b) in Theorem 6 is stronger than the condition after if in Theorem 4 or Theorem 5.
0
The continuity of ∇Q(φ |φ) and ∇Q(φp+1 |φp ) = 0 imply ∇L(φ∗ |φ∗ ) = ∇Q(φ∗ |φ∗ ) = 0.
Condition a) in Theorem 6 can be replaced by that L(φ) is unimodal in Ω with φ∗ being the
only stationary point, so we get a corollary of Theorem 6 (Theorem 7).
Theorem 7
0
Suppose that L(φ) is unimodal in Ω with φ∗ being the only stationary point and that ∇Q(φ |φ)
0
is continuous in φ and φ , then for any EM sequence {φp }, φp converges to the unique maximizer
φ∗ of L(φ).
5
5 Summary
• (1) For an EM sequence {φp } that increases the likelihood L(φp ), if L(φp ) is bounded above,
it converges to some L∗ .
• (2) If Q(ψ|φ) is continuous in ψ and φ, then all limit points of {φp } in an EM are stationary
points of L, and L(φp ) converges monotonically to L∗ = L(φ∗ ) for some point φ∗ . The curved
exponential family satisfies the continuity condition of Q. Additionally, if {φp } converges to
0
some limit φ∗ , then φ∗ is a stationary point under the condition that ∇Q(φ |φ) is continuous
0
in φ and φ.
• (3) To ensure that the limit of L(φp ) is not a stationary value but a local maximum under the
0
condition of the previous item, we need another condition: supφ0 ∈Ω Q(φ |φ) > Q(φ|φ) ∀ φ ∈
ϕ\M . However, this is a condition hard to verify in practice. To deal with this issue, Jeff Wu
suggests that it is better to set several representative initial points in the parameter space to
launch the EM algorithm, because whether EM will trapped into stationary points but not
local maxima highly depends on the initializers.
• (4) In addition to item (2), if either (a) ψ(L∗ ) = {φ∗ }, or (b) lim ||φp+1 − φp || = 0 and
p→∞
ψ(L∗ ) is discrete satisfies, then φp converges to a stationary point φ∗ with L(φ∗ ) = L∗ , the
limit of L(φp ).
0
• (5) If L(φ) is unimodal in Ω with φ∗ being the only stationary point and that ∇Q(φ |φ) is
0
continuous in φ and φ , then for any EM sequence {φp }, φp converges to the unique maximizer
φ∗ of L(φ).
6
6 Appendix
6.1 Proof of Global Convergence Theorem
Firstly, we prove the second consequence that α(xk ) converges monotonically to α(x∗ ) for some
x∗ ∈ Γ. Assume that x∗ is a limit of {xk }∞ ∞
k=0 . Then there is a sub-sequence {xkj }j=0 such
that lim xkj = x . Since the ascent function α() is continuous, we have lim α(xkj ) = α(x∗ ).
∗
j→∞ j→∞
Additionally, we observe that α is monotonically increasing on {xk }∞ ∗
k=0 , so α(x ) ≥ α(xk ) ∀ k.
∗ ∗
Since lim xkj = x , there exits a j0 such that for j > j0 , α(x ) − α(xkj ) < ∀ > 0. Hence, for all
j→∞
k ≥ kj0 :
α(x∗ ) − α(xk ) = α(x∗ ) − α(xkj0 ) + α(xkj0 ) − α(xk ) <
which implies that α(xk ) converges monotonically to α(x∗ ).
Secondly, we prove x∗ ∈ Γ by contradiction. Suppose x∗ ∈ / Γ, and consider the sequence
{xkj +1 }∞
j=0 which satisfies xkj +1 ∈ M (xkj ). Since xkj +1 ∈ S, it has a convergent sequence
lim x(kj +1)l = x∗∗ . Moreover, M is closed on X\Γ, so x∗∗ ∈ M (x∗ ). By the previous proof,
l→∞
we have lim α(xk ) = α(x∗ ), so α(x∗∗ ) = α(x∗ ) which is contradictory to a) of 3) in the theorem.
k→∞
References
[1] HOUSEHOL. AS. Ostrowski, am-solution of equations and systems of equations, 1967.
[2] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete
data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological),
39(1):1–22, 1977.
[3] Gordon D Murray. Contribution to discussion of paper by ap dempster, nm laird and db rubin.
J. Roy. Statist. Soc. Ser. B, 39:27–28, 1977.
[4] CF Jeff Wu. On the convergence properties of the em algorithm. The Annals of statistics,
pages 95–103, 1983.
[5] Willard I Zangwill. Nonlinear programming: a unified approach, volume 52. Prentice-hall
Englewood Cliffs, NJ, 1969.