lecture_06
lecture_06
2. a. Calculate K cluster
centroids;
𝑝 𝑑𝑎𝑡𝑎 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠) = 𝑝 𝑦 𝜃)
𝑝 ℎ𝑒𝑖𝑔ℎ𝑡 𝜃) =
𝑃𝑟 𝑚𝑎𝑛 𝑁𝑜𝑟𝑚𝑎𝑙 𝜇𝑚𝑎𝑛 , 𝜎𝑚𝑎𝑛 +
𝑃𝑟(𝑤𝑜𝑚𝑎𝑛)𝑁𝑜𝑟𝑚𝑎𝑙(𝜇𝑤𝑜𝑚𝑎𝑛 , 𝜎𝑤𝑜𝑚𝑎𝑛 )
𝑝 ℎ𝑒𝑖𝑔ℎ𝑡 𝜃) =
𝜋1𝑋 𝑁𝑜𝑟𝑚𝑎𝑙 𝜇1 , 𝜎1 +
(1 − 𝜋1𝑋 )𝑁𝑜𝑟𝑚𝑎𝑙 𝜇2 , 𝜎2
Model-based clustering
Gaussian mixture parameters:
• 𝜋1𝑋 determines the relative cluster sizes
• Proportion of observations to be expected in each cluster
• 𝜇1 and 𝜇2 determine the locations of the clusters
• Like centroids in k-means clustering
• 𝜎1 and 𝜎2 determine the volume of the clusters
• how large / spread out the clusters are in data space
𝑋 2.20
𝜋𝑚𝑎𝑛 = ≈ 0.77
2.86
Estimation: the EM algorithm
• Now we have some class assignments (probabilities);
• So we can go back to the parameters and update them using
our easy rule (M-step)
• Then, we can compute new posterior probabilities (E-step)
𝑁𝑜𝑟𝑚𝑎𝑙 𝑥; 𝜇, 𝜎 =
𝑀𝑉𝑁 𝑥; 𝜇, 𝜎 =
Multivariate model-based
clustering
𝑝 𝒚 𝜃) = 𝜋1𝑋 𝑀𝑉𝑁 𝝁𝟏 , 𝚺𝟏 + 1 − 𝜋1𝑋 𝑀𝑉𝑁 𝝁𝟐 , 𝚺𝟐
Estimation: the EM algorithm
Multivariate model-based
clustering
• Cluster shape parameters (the variance-covariance matrix)
can be constrained to be equal across clusters
• Same as k-means
• Can also be different across clusters
• not possible in k-means
• More flexible, complex model
• Think about the bias-variance tradeoff!
TOP SECRET SLIDE
• K-means clustering is a GMM with the following model:
• All prior class proportions are 1/K
• EII model: equal volume, only circles
• All posterior probabilities are either 0 or 1
TOP SECRET SLIDE 2
• GMM has trouble with clusters that are not ellipses
• Secret weapon: merging
Powerful idea:
• Start with Gaussian mixture solution
• Merge “similar” components to create non-Gaussian clusters
library(mclust)
out <- Mclust(x)
com <-
clustCombi(out)
plot(com)
Assessing clustering results
Methods to assess whether the obtained clusters are “good”:
• Stability (previous lecture)
• External validity (previous lecture)
• Model fit
Model fit
How well does the model fit to the data?
Log-likelihood
𝑁 𝑁