What is Gaussian mixture model clustering using R
Last Updated :
24 Apr, 2025
Gaussian mixture model (GMM) clustering is a used technique in unsupervised machine learning that groups data points based on their probability distributions. In R Programming Language versatility lies in its ability to model clusters of shapes and sizes making it applicable to scenarios. The approach assumes that the data consists of a mixture of distributions each representing a distinct cluster. By estimating the parameters of these components GMM clustering identifies and separates data points belonging to different clusters.
Mathematical Concept
These data are represented as a mixture of several Gaussian distributions in GMMs. Each Gaussian is characterized by its mean vector (center) and covariance matrix (spread and shape). For the determination of the probability density function of a data point belonging to a specific cluster, the corresponding Gaussian distribution is used.
GMM Clustering Algorithm
- Initialization: initialize parameters, such as mean or covariance matrices, by choosing the number of clusters(k).
- Expectation (E) step: Assign each data point to a cluster based on the current parameters, considering its probability of belonging to each cluster.
- Maximization (M) step: Update the parameters (means and covariances) based on the current cluster assignments.
- Repeat steps 2 and 3: Iterate these number of steps until convergence (minimal change in parameters).
Understanding the GMM Architecture
Our data sets as an entire landscape, filled with points. Of being scattered these points are organized into clusters each having its own "center of gravity" and "shape." The center of this cluster is represented by a vector that includes values for each of these features. Covariance matrices describe the variation of data points from mean to mean and thus determine their shape for a cluster.
It is assumed that each data point belongs to one of these clusters and is assigned a membership probability accordingly in the Gaussian Mixture Models. Unlike clustering algorithms such as k means, this probabilistic approach offers advantages.
- GMMs allow data points to be included in clusters to acknowledge that the boundaries between clusters are not always well-defined.
- GMMs are capable of automatically determining a number on their own via inference, as opposed to k means which require the enumeration of clusters beforehand.
- Thanks to their use of a covariate matrix, GMMs are flexible enough for the collection of clusters with shapes and orientations.
R provides various packages to cluster GMMs with ClusterR as a single package in this field. A set of functions for analysing the GMMs is available in ClusterR, including:
- GMM() : This is the most important function you use to populate your dataset with a GMM.
- The predict() function predicts the cluster's membership for new data points.
- The optimal number of clusters can be selected using BIC() and aic().
mclust:
- The mclust function offers a number of advanced functions, such as model selection and automatic k determination.
- The optimum model shall be selected using BIC and ICL methods in accordance with different criteria.
mixtools:
- normalmixEM() is a function which can be used for GMMs with particular model options.
- The probability of a cluster's membership is predicted by predict() function.
- The comparison of various GMM models is facilitated by AIC and BIC functions.
Steps Involved in GMM Clustering using R
- Load and process data to ensure that they are coded properly for clustering.
- Use methods such as elbow analysis or silhouette analysis to select a number of clusters.
- Using the'mclust' package in R, apply GMM to your data
- Assign each data point to the cluster that has the probability of being its place. Use metrics, such as the silhouette score or the Calinski Harabasz index, to evaluate clustering results.
Important Packages for this model
The R package called model-based is commonly utilized for performing model-based, clustering, density estimation and discriminant analysis using Gaussian mixture models. In order to estimate the parameters of the models, it uses the Expectation Maximization algorithm. The package's capabilities for handling mixture models, selection of mixtures according to criteria like BIC or ICL as well as support in density estimation and discriminant analysis have been noted. This particular package proves to be highly useful when it comes to organizing and grouping multivariate data that adheres to a distribution.
Customer Segmentation
R
# Install and load the 'mclust' package
library(mclust)
# Generate a synthetic dataset with three clusters
set.seed(123)
data <- rbind(matrix(rnorm(100, mean = 0, sd = 1), ncol = 2),
matrix(rnorm(100, mean = 5, sd = 1), ncol = 2),
matrix(rnorm(100, mean = 10, sd = 1), ncol = 2))
# Perform GMM clustering
# G represents the number of clusters
gmm_model <- Mclust(data, G = 3)
# Get the cluster assignments
cluster_assignments <- predict(gmm_model)$classification
# Visualize the results
plot(data, col = cluster_assignments, main = "GMM Clustering Results")
points(gmm_model$parameters$mean, col = 1:3, pch = 8, cex = 2)
Output:
Gaussian mixture model clustering using RMake sure you have the mclust package installed if you haven't already. This package offers options, for model selection that you can explore based on our data.
- Create a dataset with three clusters using the rnorm function.
- Utilize the mclust function to fit a Gaussian Mixture Model to the dataset specifying the desired number of clusters (G parameter).
- Determine the cluster assignments by using predict(gmm_model)$classification.
- Present the clustering results visually through a scatter plot.
Anomaly Detection
R
# Install and load the 'mclust' package
library(mclust)
# Generate a synthetic dataset with normal and anomalous data
set.seed(123)
normal_data <- matrix(rnorm(1000, mean = 0, sd = 1), ncol = 2)
anomalous_data <- matrix(rnorm(50, mean = 10, sd = 5), ncol = 2)
data <- rbind(normal_data, anomalous_data)
# Fit a Gaussian Mixture Model to the data
# Assuming there are 2 components (normal and anomalous)
gmm_model <- Mclust(data, G = 2)
summary(gmm_model)
Output:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust VII (spherical, varying volume) model with 2 components:
log-likelihood n df BIC ICL
-1667.953 525 7 -3379.749 -3380.132
Clustering table:
1 2
500 25
The Gaussian finite mixture model, specifically Mclust VII with 2 components, was fitted to a dataset of 525 points using the Expectation-Maximization (EM) algorithm.
- The log-likelihood, a measure of model fit, is approximately -1667.953.
- The BIC (Bayesian Information Criterion) and ICL (Integrated Completed Likelihood) are -3379.749 and -3380.132, respectively, with lower values indicating a better model fit.
- The model has 7 degrees of freedom, and the clustering table reveals two clusters with 500 and 25 data points, respectively.
Overall, the model suggests a good fit to the data with well-defined clusters based on the provided metrics.
Visualize the result of Anomaly Detection
R
# Get the log-likelihood values for each data point
log_likelihoods <- matrix(predict(gmm_model)$z, ncol = gmm_model$G)
# Calculate the anomaly scores based on log-likelihoods
anomaly_scores <- apply(log_likelihoods, 1, max)
# Set a threshold to classify anomalies
threshold <- quantile(anomaly_scores, 0.95)
# Identify anomalies based on the threshold
anomalies <- data[anomaly_scores > threshold, ]
# Visualize the results
plot(data, pch = 19, col = ifelse(anomaly_scores > threshold, "red", "blue"),
main = "Anomaly Detection using Gaussian Mixture Model")
points(anomalies, pch = 3, col = "red")
Output:
Gaussian mixture model clustering using RTo ensure accuracy we generate a dataset containing both anomalous data points.
- We then employ the mclust function to fit a Gaussian Mixture Model to this dataset.
- By utilizing predict(gmm_model)$z we obtain log likelihood values for each data point.
- Based on these log likelihood values we calculate anomaly scores with scores indicating deviation, from normality.
- Anomalies are identified by applying a threshold specifically using the percentile of anomaly scores in this case.
- Finally we visualize our dataset with anomalies highlighted in red.
Advantages/Disadvantages of Gaussian Mixture Model (GMM) Clustering in R:
Advantages
- Allows GMM to model cluster shapes that are varied, e.g. ellipse or elongated clusters, as opposed to some clustering algorithms assuming a sphere shape.
- Particularly useful for highdimensional data: GMM is capable of processing datasets with a number of features and can be used to analyse complex figures containing many variables.
- Provides soft clustering: GMM assigns probability to each cluster, enabling a data point to belong to multiple clusters with different weightings, unlike hard clustered algorithms which assign all data points to one cluster.
- Estimates cluster densities: GMM estimates the probability density function of each cluster, making it possible to understand how data are distributed between clusters.
- Leverages Expectation-Maximization (EM) algorithm: GMM uses the EM algorithm for parameter estimation, which is known for its efficiency and robustness in handling incomplete data and missing values.
Disadvantages
- The initial values of cluster means and variations may affect the performance of GMM. Inefficiencies of initialization may lead to unsatisfactory results.
- The training of GMMs can be computationally intensive, in particular when it comes to huge datasets and data sizes greater than 3D.
- There is a requirement for GMM to specify the clusters in advance. If the wrong number was selected it could have a significant impact on the result.
- The GMM gives cluster probabilities, but does not make it easy to understand the features of each cluster because they cannot be interpreted as other clustering algorithms.
- The GMM may be able to mislead the data. This may lead to poor generalizations in the field of unobserved data.
Similar Reads
Gaussian Mixture Model
Clustering is a key technique in unsupervised learning, used to group similar data points together. While traditional methods like K-Means and Hierarchical Clustering are widely used, they assume that clusters are well-separated and have rigid shapes. This can be limiting in real-world scenarios whe
7 min read
Mean Shift Clustering using Sklearn
Clustering is a fundamental method in unsupervised device learning, and one powerful set of rules for this venture is Mean Shift clustering. Mean Shift is a technique for grouping comparable data factors into clusters primarily based on their inherent characteristics, with our previous understanding
9 min read
Gaussian Mixture Models (GMM) in Scikit Learn
The Gaussian Mixture Model (GMM) is a flexible clustering technique that models data as a mixture of multiple Gaussian distributions. Unlike k-means which assumes spherical clusters GMM allows clusters to take various shapes making it more effective for complex datasets. If you're new to GMM, you ca
7 min read
What is text clustering in NLP?
Grouping texts of documents, sentences, or phrases into texts that are not similar to other texts in the same cluster falls under text clustering in natural language processing (NLP). When it comes to topic modeling, recommendation systems, and finding related news in document organization among oth
6 min read
Measuring Clustering Quality in Data Mining
A cluster is the collection of data objects which are similar to each other within the same group. The data objects of a cluster are dissimilar to data objects of other groups or clusters. Clustering Approaches:1. Partitioning approach: The partitioning approach constructs various partitions and the
4 min read
K means clustering using Weka
In this article, we are going to see how to use Weka explorer to do simple k-mean clustering. Here we will use sample data set which is based on iris data that is available in ARFF format. There are 150 iris instances in this dataset. Before starting let's have a small intro about clustering and sim
3 min read
What is the Most Efficient K-Means Clustering Package in R?
K-means clustering is one of the most popular unsupervised machine learning algorithms used for grouping data points into a specified number of clusters. Each data point is assigned to the cluster with the nearest mean, serving as a prototype of the cluster. In R, several packages provide implementa
6 min read
Spectral Clustering using R
Spectral clustering is a technique used in machine learning and data analysis for grouping data points based on their similarity. The method involves transforming the data into a representation where the clusters become apparent and then using a clustering algorithm on this transformed data. In R Pr
9 min read
ML | Variational Bayesian Inference for Gaussian Mixture
Prerequisites: Gaussian Mixture A Gaussian Mixture Model assumes the data to be segregated into clusters in such a way that each data point in a given cluster follows a particular Multi-variate Gaussian distribution and the Multi-Variate Gaussian distributions of each cluster is independent of one a
5 min read
Hierarchical clustering using Weka
In this article, we will see how to utilize the Weka Explorer to perform hierarchical analysis. The sample data set for this example is based on iris data in ARFF format. The data has been appropriately preprocessed, as this article expects. This dataset has 150 iris occurrences. Clustering: Cluster
3 min read