Week 7 GMM
Week 7 GMM
Semester
1
Mixture
models
• Recall
types
of
clustering
methods
– hard
clustering:
clusters
do
not
overlap
• element
either
belongs
to
cluster
or
it
does
not
– soJ
clustering:
clusters
may
overlap
• stength
of
associa?on
between
clusters
and
instances
• Mixture
models
– probabilis?cally-‐grounded
way
of
doing
soJ
clustering
– each
cluster:
a
genera?ve
model
(Gaussian
or
mul?nomial)
– parameters
(e.g.
mean/covariance
are
unknown)
• Expecta?on
Maximiza?on
(EM)
algorithm
– automa?cally
discover
all
parameters
for
the
K
“sources”
Copyright
©
2014
Victor
Lavrenko
Mixture
models
in
1-‐d
x1 + x 2 + ...+ x n b
• Observa?ons
x1
…
xn
µb =
nb
– K=2
Gaussians
with
unknown
μ,
σ2
(x1 − µb ) 2 + ...+ (x n − µb ) 2
2
– es?ma?on
trivial
if
we
know
the
σb = nb
source
of
each
observa?on
€
P(x i | b)P(b)
bi = P(b | x i ) =
P(x i | b)P(b) + P(x i | a)P(a)
€
ai = P(a | x i ) = 1 − bi
€ b1 x1 + b2 x 2 + ...+ bn x n
µb =
€ b1 + b2 + ...+ bn
2 b1 (x1 − µb ) 2 + ...+ bn (x n − µb ) 2
σb =
b1 + b2 + ...+ bn
€ a1 x1 + a2 x 2 + ...+ an x n
µa =
a1 + a2 + ...+ an
€
a1 (x1 − µa ) 2 + ...+ an (x n − µa ) 2
2
σ = a
a1 + a2 + ...+ an
€
could also estimate priors:
P(b) = (b1 + b2 + … bn) / n
€ P(a) = 1 – P(b)
Copyright
©
2014
Victor
Lavrenko
Gaussian
Mixture
Model
• Data
with
D
acributes,
from
Gaussian
sources
c1
…
ck
– how
typical
is
xi
P( x! | c) = 1 exp − 1 ( x! − µ! )T Σ−1( x! − µ! )
under
source
c
i { 2 i c c i c }
2π Σ c
∑ ∑ (x − µ )[Σ ] (x −1
− µcb )
– how
likely
that
!
!
P( x | c)P(c)
a b ia ca c ab ib
P(c | x i ) = k i !
xi
came
from
c
∑ P( x i | c)P(c)
c =1
€ € = P(c | x! ) ( P(c | x! ) + ...+ P(c | x! ))
– how
important
is
xi
for
source
c:
w i,c i 1 n
– mean
of
a€cribute
a
in
items
assigned
to
c:
µca = w c1 x1a + ...+ w cn x na
n
– covariance
of
a
and
b
in
items
€ from
c:
Σcab = ∑i=1 w ci ( x ia − µca )( x ib − µcb )
– prior:
how
many
items
assigned
to
c€:
P(c) = 1n ( P(c | x!1) + ...+ P(c | x! n ))
Copyright
©
2014
Victor
Lavrenko
How
to
pick
K?
• Probabilis?c
model
– tries
to
“fit”
the
data
(maximize
likelihood)
• Pick
K
that
makes
L
as
large
as
possible?
– K
=
n:
each
data
point
has
its
own
“source”
– may
not
work
well
for
new
data
points
• Split
points
into
training
set
T
and
valida?on
set
V
– for
each
K:
fit
parameters
of
T,
measure
likelihood
of
V
– some?mes
s?ll
best
when
K
=
n
• Occam’s
razor:
pick
“simplest”
of
all
models
that
fit
– Bayes
Inf.
Criterion
(BIC):
maxp
{
L
–
½
p
log
n
}
L … likelihood, how well
– Akaike
Inf.
Criterion
(AIC):
minp
{
2
p
–
L
}
our model fits the data
p … number of parameters
Copyright
©
2014
Victor
Lavrenko
how “simple” is the model
Summary
• Walked
through
1-‐d
version
– works
for
higher
dimensions
• d-‐dimensional
Gaussians,
can
be
non-‐spherical
– works
for
discrete
data
(text)
• d-‐dimensional
mul?nomial
distribu?ons
(pLSI)
• Maximizes
likelihood
of
the
data:
• Similar
to
K-‐means
– sensi?ve
to
star?ng
point,
converges
to
a
local
maximum
– convergence:
when
change
in
P(x1…xn)
is
sufficiently
small
– cannot
discover
K
(likelihood
keeps
growing
with
K)
• Different
from
K-‐means
– soJ
clustering:
instance
can
come
from
mul?ple
“clusters”
– co-‐variance:
no?on
of
“distance”
changes
over
?me
• How
can
you
make
GMM
=
K-‐means?
Copyright
©
2014
Victor
Lavrenko
N N K ⎧ (x i − µk ) 2 ⎫
L = log ∏ P(x i ) = ∑ log ∑ P(k) 1
exp⎨ − 2
⎬
2πσ 2 2σ
i=1 i=1 k =1 ⎩ ⎭
∂L N
p j N(x i ; µ j ,σ 2j ) ⎧ 2(x i − µk ) ⎫
=∑ K ⎨+ 2
⎬
€ ∂µ j 2 σ
i=1
∑ pk N(x i; µk ,σ k2 ) ⎩ j ⎭
k =1