Open In App

Clustering in R Programming

Last Updated : 19 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Clustering is an unsupervised learning technique where a dataset is divided into groups, or clusters, based on similarities among data points. It helps identify natural groupings within the data without prior labeling. Each cluster has data points that are closer to one another than to other clusters. This approach is commonly employed in data mining and pattern recognition to reveal latent structures, group behavior, or trends in the data. Clustering is particularly helpful when we need to explore data, reduce dimensionality, or pre-process data for additional supervised learning activities.

Types of Clustering in R Programming

In R, there are several clustering techniques, each suited for different data types and clustering challenges. Each method has its own advantages and is designed to handle specific data characteristics such as the number of clusters, their shapes, and whether or not noise is present in the data.

1. K-means clustering

The most common method is K-means, where the number of clusters (k) is set beforehand. It is computationally efficient for large datasets but can struggle with irregularly shaped clusters or varying densities. it is a data-partitioning technique that seeks to assign each observation to the cluster with the closest mean after dividing the data into k clusters.

2. Hierarchical Clustering

Hierarchical Clustering, on the other hand, creates a hierarchy of clusters by either merging smaller clusters (agglomerative) or splitting larger ones (divisive). This approach builds a tree like structure known as a dendrogram, which allows you to see the relationships between clusters at various levels of similarity. Hierarchical clustering is great for small to medium datasets where understanding the relationships between clusters is important, but it can be computationally expensive for larger datasets.

3. Spectral Clustering

Spectral Clustering transforms the clustering problem into a graph partitioning problem. By constructing a similarity graph from the data and performing clustering based on the eigenvalues of the graph’s Laplacian matrix, it is able to capture complex, non convex clusters. This method works particularly well for datasets where the clusters are not linearly separable, but it can be computationally intensive and requires careful tuning of the similarity matrix.

4. Fuzzy Clustering

Fuzzy Clustering (or Fuzzy C-Means) is a soft clustering technique where data points are assigned membership scores for each cluster, rather than being definitively assigned to one cluster. This means a data point can belong to multiple clusters with varying degrees of membership. Fuzzy clustering is useful when the boundaries between clusters are not well defined, and it allows for more nuanced grouping, but interpreting the membership scores can be more complex.

5. Density Based Clustering

Density Based Clustering is a broader category that includes methods like DBSCAN. These methods focus on finding clusters based on regions of high data density, rather than relying on a distance metric. Density based methods are robust to noise and can find clusters of arbitrary shapes. However, they can be sensitive to the parameters used, such as the minimum number of points needed to form a cluster.

6. Ensemble Clustering

Ensemble Clustering takes a different approach by combining the results of multiple clustering algorithms or multiple runs of the same algorithm to create a more reliable clustering solution. By aggregating the results of different methods, ensemble clustering aims to improve performance and reduce the risk of overfitting. This method is particularly useful when there is uncertainty about which clustering technique is the most appropriate, and it can provide more robust and stable results.

Each of these clustering techniques has its strengths and weaknesses, making it important to choose the right one based on the specific characteristics of your data and the goals of your analysis. Whether you’re working with large datasets, noisy data, or data that requires soft assignments, there’s a clustering method in R that can be tailored to your needs.

Implementation of K-Means Clustering in R Programming

We will implement K-Means Clustering algorithm here since it is simple and easy to understand.

K-Means is an iterative hard clustering technique that uses an unsupervised learning algorithm. In this, total numbers of clusters are pre defined by the user and based on the similarity of each data point, the data points are clustered. This algorithm also finds out the centroid of the cluster.

Algorithm

  1. Specify number of clusters (K): Let us take an example of k = 2 and 5 data points.
  2. Randomly assign each data point to a cluster: In the below example, the red and green color shows 2 clusters with their respective random data points assigned to them.
  3. Calculate cluster centroids: The cross mark represents the centroid of the corresponding cluster.
  4. Reallocate each data point to their nearest cluster centroid: Green data point is assigned to the red cluster as it is near to the centroid of red cluster.
  5. Reconfigure cluster centroid.

Syntax:

 kmeans(x, centers, nstart)

where,

  • x : represents numeric matrix or data frame object.
  • centers : represents the K value or distinct cluster centers.
  • nstart : represents number of random sets to be chosen.

Example

R
install.packages("factoextra")
library(factoextra)

df <- mtcars
df <- na.omit(df)
df <- scale(df)

km <- kmeans(df, centers = 4, nstart = 25)
fviz_cluster(km, data = df)

km <- kmeans(df, centers = 5, nstart = 25)
fviz_cluster(km, data = df)

Output: 

When k = 4
K-Means Clustering in R ProgrammingGeeksforgeeks

K-Means Clustering in R Programming

When k = 5 

K-Means clustering in R



Next Article
Article Tags :
Practice Tags :

Similar Reads