0% found this document useful (0 votes)
6 views

Week 7 GMM

Introduction to Applied Machine Learning
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Week 7 GMM

Introduction to Applied Machine Learning
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

IAML:

 Mixture  models  and  EM  


Victor  Lavrenko  and  Nigel  Goddard  
School  of  Informa?cs  

Semester  1  
Mixture  models  
• Recall  types  of  clustering  methods  
– hard  clustering:  clusters  do  not  overlap  
• element  either  belongs  to  cluster  or  it  does  not  
– soJ  clustering:  clusters  may  overlap  
• stength  of  associa?on  between  clusters  and  instances  
• Mixture  models  
– probabilis?cally-­‐grounded  way  of  doing  soJ  clustering  
– each  cluster:  a  genera?ve  model  (Gaussian  or  mul?nomial)  
– parameters  (e.g.  mean/covariance  are  unknown)  
• Expecta?on  Maximiza?on  (EM)  algorithm  
– automa?cally  discover  all  parameters  for  the  K  “sources”    
Copyright  ©  2014  Victor  Lavrenko  
Mixture  models  in  1-­‐d  
x1 + x 2 + ...+ x n b
• Observa?ons  x1  …  xn     µb =
nb
– K=2  Gaussians  with  unknown  μ,  σ2   (x1 − µb ) 2 + ...+ (x n − µb ) 2
2
– es?ma?on  trivial  if  we  know  the     σb = nb
source  of  each  observa?on   €

• What  if  we  don’t  know  the  €source?  


• If  we  knew  parameters  of  the  Gaussians  (μ,  σ2)  
– can  guess  whether  point  is  more  likely  to  be  a  or  b  
P(x i | b)P(b)
P(b | x i ) =
P(x i | b)P(b) + P(x i | a)P(a)
1 ⎧ (x i − µi ) 2 ⎫
P(x i | b) = exp⎨ − 2
⎬
2
2πσ b ⎩ 2σ b ⎭
Copyright  ©  2014  Victor  Lavrenko  
Expecta?on  Maximiza?on  (EM)  
• Chicken  and  egg  problem  
– need  (μa,  σa2)  and  (μb,  σb2)  to  guess  source  of  points  
– need  to  know  source  to  es?mate  (μa,  σa2)  and  (μb,  σb2)    
• EM  algorithm  
– start  with  two  randomly  placed  Gaussians  (μa,  σa2),  (μb,  σb2)  
E-step: – for  each  point:  P(b|xi)  =  does  it  look  like  it  came  from  b?  
M-step: – adjust  (μa,  σa2)  and  (μb,  σb2)  to  fit  points  assigned  to  them  
– iterate  un?l  convergence  

Copyright  ©  2014  Victor  Lavrenko  


EM:  1-­‐d  example  
1⎧ (x i − µi ) 2 ⎫
P(x i | b) = exp⎨ − 2
⎬
2
2πσ b ⎩ 2σ b ⎭

P(x i | b)P(b)
bi = P(b | x i ) =
P(x i | b)P(b) + P(x i | a)P(a)

ai = P(a | x i ) = 1 − bi

€ b1 x1 + b2 x 2 + ...+ bn x n
µb =
€ b1 + b2 + ...+ bn
2 b1 (x1 − µb ) 2 + ...+ bn (x n − µb ) 2
σb =
b1 + b2 + ...+ bn
€ a1 x1 + a2 x 2 + ...+ an x n
µa =
a1 + a2 + ...+ an

a1 (x1 − µa ) 2 + ...+ an (x n − µa ) 2
2
σ = a
a1 + a2 + ...+ an

could also estimate priors:
P(b) = (b1 + b2 + … bn) / n
€ P(a) = 1 – P(b)
Copyright  ©  2014  Victor  Lavrenko  
Gaussian  
Mixture  
Model    
• Data  with  D  acributes,  from  Gaussian  sources  c1  …  ck  
– how  typical  is  xi     P( x! | c) = 1 exp − 1 ( x! − µ! )T Σ−1( x! − µ! )
under  source  c   i { 2 i c c i c }
2π Σ c
∑ ∑ (x − µ )[Σ ] (x −1
− µcb )
– how  likely  that     !
!
P( x | c)P(c)
a b ia ca c ab ib

P(c | x i ) = k i !
xi  came  from  c   ∑ P( x i | c)P(c)
c =1
€ € = P(c | x! ) ( P(c | x! ) + ...+ P(c | x! ))
– how  important  is  xi  for  source  c:   w i,c i 1 n

– mean  of  a€cribute  a  in  items  assigned  to  c:   µca = w c1 x1a + ...+ w cn x na
n
– covariance  of  a  and  b  in  items  
€ from  c:   Σcab = ∑i=1 w ci ( x ia − µca )( x ib − µcb )
– prior:  how  many  items  assigned  to  c€:      P(c) = 1n ( P(c | x!1) + ...+ P(c | x! n ))
Copyright  ©  2014  Victor  Lavrenko  
How  to  pick  K?  
• Probabilis?c  model  
– tries  to  “fit”  the  data  (maximize  likelihood)  
• Pick  K  that  makes  L  as  large  as  possible?  
– K  =  n:  each  data  point  has  its  own  “source”  
– may  not  work  well  for  new  data  points  
• Split  points  into  training  set  T  and  valida?on  set  V  
– for  each  K:  fit  parameters  of  T,  measure  likelihood  of  V  
– some?mes  s?ll  best  when  K  =  n  
• Occam’s  razor:  pick  “simplest”  of  all  models  that  fit  
– Bayes  Inf.  Criterion  (BIC):    maxp  {  L  –  ½  p  log  n  }  
L … likelihood, how well
– Akaike  Inf.  Criterion  (AIC):  minp  {  2  p  –  L  }   our model fits the data
p … number of parameters
Copyright  ©  2014  Victor  Lavrenko   how “simple” is the model
Summary  
• Walked  through  1-­‐d  version  
– works  for  higher  dimensions  
• d-­‐dimensional  Gaussians,  can  be  non-­‐spherical  
– works  for  discrete  data  (text)  
• d-­‐dimensional  mul?nomial  distribu?ons  (pLSI)  
• Maximizes  likelihood  of  the  data:  
• Similar  to  K-­‐means  
– sensi?ve  to  star?ng  point,  converges  to  a  local  maximum  
– convergence:  when  change  in  P(x1…xn)  is  sufficiently  small  
– cannot  discover  K  (likelihood  keeps  growing  with  K)  
• Different  from  K-­‐means  
– soJ  clustering:  instance  can  come  from  mul?ple  “clusters”  
– co-­‐variance:  no?on  of  “distance”  changes  over  ?me  
• How  can  you  make  GMM  =  K-­‐means?  
Copyright  ©  2014  Victor  Lavrenko  
N N K ⎧ (x i − µk ) 2 ⎫
L = log ∏ P(x i ) = ∑ log ∑ P(k) 1
exp⎨ − 2
⎬
2πσ 2 2σ
i=1 i=1 k =1 ⎩ ⎭

∂L N
p j N(x i ; µ j ,σ 2j ) ⎧ 2(x i − µk ) ⎫
=∑ K ⎨+ 2
⎬
€ ∂µ j 2 σ
i=1
∑ pk N(x i; µk ,σ k2 ) ⎩ j ⎭
k =1

Copyright  ©  2014  Victor  Lavrenko  

You might also like