0% found this document useful (0 votes)
9 views9 pages

Week 7 GMM

Introduction to Applied Machine Learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views9 pages

Week 7 GMM

Introduction to Applied Machine Learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

IAML:

 Mixture  models  and  EM  


Victor  Lavrenko  and  Nigel  Goddard  
School  of  Informa?cs  

Semester  1  
Mixture  models  
• Recall  types  of  clustering  methods  
– hard  clustering:  clusters  do  not  overlap  
• element  either  belongs  to  cluster  or  it  does  not  
– soJ  clustering:  clusters  may  overlap  
• stength  of  associa?on  between  clusters  and  instances  
• Mixture  models  
– probabilis?cally-­‐grounded  way  of  doing  soJ  clustering  
– each  cluster:  a  genera?ve  model  (Gaussian  or  mul?nomial)  
– parameters  (e.g.  mean/covariance  are  unknown)  
• Expecta?on  Maximiza?on  (EM)  algorithm  
– automa?cally  discover  all  parameters  for  the  K  “sources”    
Copyright  ©  2014  Victor  Lavrenko  
Mixture  models  in  1-­‐d  
x1 + x 2 + ...+ x n b
• Observa?ons  x1  …  xn     µb =
nb
– K=2  Gaussians  with  unknown  μ,  σ2   (x1 − µb ) 2 + ...+ (x n − µb ) 2
2
– es?ma?on  trivial  if  we  know  the     σb = nb
source  of  each  observa?on   €

• What  if  we  don’t  know  the  €source?  


• If  we  knew  parameters  of  the  Gaussians  (μ,  σ2)  
– can  guess  whether  point  is  more  likely  to  be  a  or  b  
P(x i | b)P(b)
P(b | x i ) =
P(x i | b)P(b) + P(x i | a)P(a)
1 ⎧ (x i − µi ) 2 ⎫
P(x i | b) = exp⎨ − 2
⎬
2
2πσ b ⎩ 2σ b ⎭
Copyright  ©  2014  Victor  Lavrenko  
Expecta?on  Maximiza?on  (EM)  
• Chicken  and  egg  problem  
– need  (μa,  σa2)  and  (μb,  σb2)  to  guess  source  of  points  
– need  to  know  source  to  es?mate  (μa,  σa2)  and  (μb,  σb2)    
• EM  algorithm  
– start  with  two  randomly  placed  Gaussians  (μa,  σa2),  (μb,  σb2)  
E-step: – for  each  point:  P(b|xi)  =  does  it  look  like  it  came  from  b?  
M-step: – adjust  (μa,  σa2)  and  (μb,  σb2)  to  fit  points  assigned  to  them  
– iterate  un?l  convergence  

Copyright  ©  2014  Victor  Lavrenko  


EM:  1-­‐d  example  
1⎧ (x i − µi ) 2 ⎫
P(x i | b) = exp⎨ − 2
⎬
2
2πσ b ⎩ 2σ b ⎭

P(x i | b)P(b)
bi = P(b | x i ) =
P(x i | b)P(b) + P(x i | a)P(a)

ai = P(a | x i ) = 1 − bi

€ b1 x1 + b2 x 2 + ...+ bn x n
µb =
€ b1 + b2 + ...+ bn
2 b1 (x1 − µb ) 2 + ...+ bn (x n − µb ) 2
σb =
b1 + b2 + ...+ bn
€ a1 x1 + a2 x 2 + ...+ an x n
µa =
a1 + a2 + ...+ an

a1 (x1 − µa ) 2 + ...+ an (x n − µa ) 2
2
σ = a
a1 + a2 + ...+ an

could also estimate priors:
P(b) = (b1 + b2 + … bn) / n
€ P(a) = 1 – P(b)
Copyright  ©  2014  Victor  Lavrenko  
Gaussian  
Mixture  
Model    
• Data  with  D  acributes,  from  Gaussian  sources  c1  …  ck  
– how  typical  is  xi     P( x! | c) = 1 exp − 1 ( x! − µ! )T Σ−1( x! − µ! )
under  source  c   i { 2 i c c i c }
2π Σ c
∑ ∑ (x − µ )[Σ ] (x −1
− µcb )
– how  likely  that     !
!
P( x | c)P(c)
a b ia ca c ab ib

P(c | x i ) = k i !
xi  came  from  c   ∑ P( x i | c)P(c)
c =1
€ € = P(c | x! ) ( P(c | x! ) + ...+ P(c | x! ))
– how  important  is  xi  for  source  c:   w i,c i 1 n

– mean  of  a€cribute  a  in  items  assigned  to  c:   µca = w c1 x1a + ...+ w cn x na
n
– covariance  of  a  and  b  in  items  
€ from  c:   Σcab = ∑i=1 w ci ( x ia − µca )( x ib − µcb )
– prior:  how  many  items  assigned  to  c€:      P(c) = 1n ( P(c | x!1) + ...+ P(c | x! n ))
Copyright  ©  2014  Victor  Lavrenko  
How  to  pick  K?  
• Probabilis?c  model  
– tries  to  “fit”  the  data  (maximize  likelihood)  
• Pick  K  that  makes  L  as  large  as  possible?  
– K  =  n:  each  data  point  has  its  own  “source”  
– may  not  work  well  for  new  data  points  
• Split  points  into  training  set  T  and  valida?on  set  V  
– for  each  K:  fit  parameters  of  T,  measure  likelihood  of  V  
– some?mes  s?ll  best  when  K  =  n  
• Occam’s  razor:  pick  “simplest”  of  all  models  that  fit  
– Bayes  Inf.  Criterion  (BIC):    maxp  {  L  –  ½  p  log  n  }  
L … likelihood, how well
– Akaike  Inf.  Criterion  (AIC):  minp  {  2  p  –  L  }   our model fits the data
p … number of parameters
Copyright  ©  2014  Victor  Lavrenko   how “simple” is the model
Summary  
• Walked  through  1-­‐d  version  
– works  for  higher  dimensions  
• d-­‐dimensional  Gaussians,  can  be  non-­‐spherical  
– works  for  discrete  data  (text)  
• d-­‐dimensional  mul?nomial  distribu?ons  (pLSI)  
• Maximizes  likelihood  of  the  data:  
• Similar  to  K-­‐means  
– sensi?ve  to  star?ng  point,  converges  to  a  local  maximum  
– convergence:  when  change  in  P(x1…xn)  is  sufficiently  small  
– cannot  discover  K  (likelihood  keeps  growing  with  K)  
• Different  from  K-­‐means  
– soJ  clustering:  instance  can  come  from  mul?ple  “clusters”  
– co-­‐variance:  no?on  of  “distance”  changes  over  ?me  
• How  can  you  make  GMM  =  K-­‐means?  
Copyright  ©  2014  Victor  Lavrenko  
N N K ⎧ (x i − µk ) 2 ⎫
L = log ∏ P(x i ) = ∑ log ∑ P(k) 1
exp⎨ − 2
⎬
2πσ 2 2σ
i=1 i=1 k =1 ⎩ ⎭

∂L N
p j N(x i ; µ j ,σ 2j ) ⎧ 2(x i − µk ) ⎫
=∑ K ⎨+ 2
⎬
€ ∂µ j 2 σ
i=1
∑ pk N(x i; µk ,σ k2 ) ⎩ j ⎭
k =1

Copyright  ©  2014  Victor  Lavrenko  

You might also like