0% found this document useful (0 votes)
23 views

Clustering

Uploaded by

VIJAY
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Clustering

Uploaded by

VIJAY
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Clustering

Introduction-Types of clustering -Correlations and distances-clustering by partitioning methods :- hierarchical


clustering, overlapping clustering, K-Means Clustering-Profiling and Interpreting Clusters.
Clustering is a machine learning technique that involves grouping a set of objects in such a way that
objects in the same group (called a cluster) are more similar to each other than to those in other groups
(clusters). It is an essential unsupervised learning method, meaning that clustering algorithms do not rely on
labeled data. Instead, they find natural patterns in the dataset.
Key Concepts:
1. Unsupervised Learning: In clustering, the data is not labeled, so the algorithm identifies patterns
based on the inherent structure of the data.
2. Similarity/Dissimilarity: Clustering is based on a measure of similarity (or dissimilarity), often
calculated using distance metrics like Euclidean distance. Objects that are close to each other in terms
of this metric are considered more similar.
3. Clusters: A group of data points that are similar in some way. The goal of clustering is to find a way
to define and group these clusters effectively.
Types of Clustering Techniques:
1. Partitioning Methods (e.g., K-means Clustering): These methods divide the data into k distinct
clusters. K-means is a popular algorithm where each data point belongs to the cluster with the nearest
mean value.
2. Hierarchical Clustering: This method builds a tree (dendrogram) of clusters, either by merging small
clusters into bigger ones (agglomerative) or by splitting large clusters into smaller ones (divisive).
3. Density-Based Clustering (e.g., DBSCAN): Clusters are formed based on the density of data points.
DBSCAN can discover clusters of arbitrary shapes, making it effective for noisy datasets.
4. Model-Based Clustering: These algorithms assume that the data is generated from a mixture of
underlying probability distributions, and they attempt to identify the distribution for each cluster.
Applications of Clustering:

● Market Segmentation: Identifying different customer groups based on buying behavior.


● Image Segmentation: Grouping pixels with similar colors or patterns in an image.
● Anomaly Detection: Identifying unusual data points in a dataset, such as fraud detection.
● Document Clustering: Grouping similar text documents for easier retrieval.

Classifications of clustering

Clustering is classified into several types based on the approach used to group the data. These methods
differ in how they define clusters and the assumptions they make about the underlying data.

1. Partitioning Clustering

Partitioning methods divide the dataset into a predefined number of clusters (typically kkk clusters) such
that each data point belongs to exactly one cluster.

Key Algorithms:

● K-Means Clustering:
o The most popular partitioning method.
It tries to minimize the sum of squared distances between the data points and the centroid of
o
each cluster.
o The user specifies the number of clusters (k), and the algorithm iteratively assigns points to
the nearest cluster centroid.
● K-Medoids (PAM - Partitioning Around Medoids):
o Similar to K-means but uses medoids (actual data points) instead of centroids.
o It is more robust to outliers because it uses representative points from the dataset rather than
calculating a centroid.

Pros:

● Simple to implement and computationally efficient.


● Works well with spherical-shaped clusters.

Cons:

● The number of clusters must be pre-specified.


● Sensitive to the choice of initial centroids and outliers.

2. Hierarchical Clustering

Hierarchical clustering builds a multilevel hierarchy of clusters either by merging smaller clusters
(agglomerative approach) or by splitting larger clusters (divisive approach). The result is a tree-like
structure called a dendrogram.

Types of Hierarchical Clustering:

● Agglomerative (Bottom-Up Approach):


o Initially, each data point is considered as its own cluster.
o The algorithm merges the closest pairs of clusters iteratively until all points are in a single
cluster or a predefined number of clusters is reached.
● Divisive (Top-Down Approach):
o Starts with one large cluster containing all data points.
o The algorithm splits this large cluster into smaller clusters recursively until every point is in
its own cluster or the required number of clusters is reached.

Distance Metrics Used:

● Single-Linkage: Minimum distance between points in two clusters.


● Complete-Linkage: Maximum distance between points in two clusters.
● Average-Linkage: Average distance between all points in the two clusters.

Pros:

● Does not require a pre-specified number of clusters.


● Can capture nested clusters.

Cons:

● Computationally expensive for large datasets.


● Sensitive to noise and outliers.

3. Density-Based Clustering

In density-based clustering, clusters are defined as regions of high density separated by regions of low
density. This method is particularly useful for discovering clusters of arbitrary shape and for handling
noise in the data.

Key Algorithm:

● DBSCAN (Density-Based Spatial Clustering of Applications with Noise):


o It groups points that are closely packed together, considering a minimum number of points
within a specified distance (epsilon) to define a cluster.
Points that lie in low-density regions are treated as noise or outliers.
o
● OPTICS (Ordering Points to Identify the Clustering Structure):
o Similar to DBSCAN but works better for datasets with varying density. It provides an
ordering of the points and helps visualize the clustering structure.

Pros:

● Can find arbitrarily shaped clusters.


● Does not require a predefined number of clusters.
● Handles noise and outliers well.

Cons:

● Sensitive to the choice of parameters (epsilon and min Pts in DBSCAN).


● Struggles with data that has large differences in density.

4. Grid-Based Clustering

Grid-based clustering divides the data space into a finite number of cells that form a grid structure, and the
clustering operations are performed on these grid cells.

Key Algorithm:

● STING (Statistical Information Grid):


o Divides the spatial area into hierarchical grid cells.
o Each cell stores statistical information like the mean and variance of the points inside the cell.
o The algorithm uses this information to decide whether to merge or split the grid cells.

Pros:

● Efficient for large datasets.


● Handles both numeric and categorical data.

Cons:
● Grid size is fixed, which might affect cluster quality.
● Does not perform well for high-dimensional data.

5. Model-Based Clustering

Model-based clustering assumes that the data is generated by a mixture of underlying probability
distributions, typically Gaussian distributions. Each cluster corresponds to a different probability
distribution, and the algorithm estimates the parameters of these distributions.

Key Algorithms:

● Gaussian Mixture Models (GMM):


o Assumes that the data is a mixture of several Gaussian distributions.
o Each cluster is represented by a Gaussian distribution with its own mean and covariance
matrix.
o Uses the Expectation-Maximization (EM) algorithm to estimate the parameters of the
Gaussian distributions.
● Bayesian Clustering:
o Extends GMM by incorporating Bayesian principles, which help to avoid overfitting by using
prior distributions on the model parameters.

Pros:

● Can model more complex cluster shapes (elliptical or Gaussian).


● Provides probabilistic cluster memberships (i.e., each point has a probability of belonging to each
cluster).

Cons:

● Computationally expensive for large datasets.


● Assumes that the data follows a particular distribution (e.g., Gaussian).

6. Fuzzy Clustering

Fuzzy clustering allows each data point to belong to multiple clusters with varying degrees of
membership. Unlike hard clustering, where each point belongs to one cluster, fuzzy clustering allows for
uncertainty in the cluster assignments.

Key Algorithm:

● Fuzzy C-Means (FCM):


o A soft version of K-means clustering.
o Each point has a membership score that indicates its degree of belonging to each cluster.
o The algorithm minimizes an objective function that considers these membership scores.

Pros:

● Useful in situations where data points can naturally belong to more than one cluster.
● Provides more nuanced clustering results.

Cons:

● Requires tuning of the fuzziness parameter.


● Computationally more complex than K-means.

7. Subspace and High-Dimensional Clustering

In high-dimensional datasets, clusters may only exist in a subset of dimensions, making traditional
clustering methods less effective. Subspace clustering addresses this issue by identifying clusters in
lower-dimensional subspaces of the data.

Key Algorithms:

● CLIQUE: Combines grid-based and density-based approaches to find clusters in subspaces.


● PROCLUS: A k-medoids-based method that finds clusters in subspaces by identifying relevant
dimensions for each cluster.

Pros:

● Effective for high-dimensional datasets.


● Can discover clusters in specific subspaces, ignoring irrelevant dimensions.

Cons:

● More complex and computationally expensive.


● Requires careful parameter tuning.

Correlations and distances

1. Correlation in Clustering

Correlation measures the relationship between variables, reflecting whether changes in one variable lead to
predictable changes in another.

In clustering, correlation is used when the goal is to group data based on the pattern or trend of variables. If
two variables are highly correlated, they likely belong to the same cluster.

● Pearson Correlation: Commonly used to measure linear relationships. Clustering algorithms like
hierarchical clustering can use correlation-based distances, where clusters are formed by grouping
variables with high correlation.
● Cosine Similarity: This measures the angle between two vectors (data points), often used when
clustering high-dimensional data like text or documents. If the angle is small (close to 0), the vectors
are considered highly correlated and can be clustered together.

Application: Correlation is useful in cases like time series or gene expression data clustering, where you care
about how data points rise or fall together.

2. Distance in Clustering
Distance measures how far apart two data points are, often using spatial geometry. The smaller the distance,
the more similar the data points are, and they are likely to belong to the same cluster.

● Euclidean Distance: The most common distance measure, calculating the straight line between two
points in space. It is used in many clustering algorithms like K-means, where clusters form based on
proximity in space.
● Manhattan Distance: Measures the distance between two points along axes at right angles (like
walking in a city grid). It is useful when features are independent and vary greatly in magnitude.
● Mahalanobis Distance: This is used when the data has varying distributions, adjusting the distance
based on the variance of the data, making it ideal when the features are correlated.

Uses: Distance-based clustering (e.g., K-means or hierarchical clustering) is more effective when you're
interested in grouping data points that are spatially close together.

Combining Correlations and Distances

In practice, clustering can involve both distance and correlation:

● Clustering in high-dimensional data may use distance metrics like cosine similarity that are based
on correlations.
● Correlation-based distances like Mahalanobis distance can adjust for relationships between
variables, leading to more meaningful clusters in data with complex interactions.

Problem 1: Clustering Time Series Data Using Correlation


The following time series data for two variables:
Time 1 2 3 4

Variable A 3 5 7 9

Variable B 2 4 6 8

Use Pearson correlation to determine if Variables A and B are highly correlated and should belong to the
same cluster. Solution r =1.
Problem 2: Clustering with Euclidean Distance
Point A: (1, 2), Point B: (4, 6), Point C: (7, 1) Determine whether Points A and B or Points A and C should be
clustered together using Euclidean distance.
Problem 3: Clustering with Manhattan Distance

You are given three data points in a city grid:Point X: (1, 2)Point Y: (5, 5)Point Z: (3, 8) Cluster points using
Manhattan distance.

Problem 4: Mahalanobis Distance for Correlated Variables

Two data points A and B with two correlated features: Point A: (2, 3)&Point B: (6, 7) and Covariance matrix
Σ=[ 4 22 3 ] .

K-means clustering

The K-means clustering algorithm is a popular unsupervised machine learning algorithm used to partition a
dataset into K clusters.

Algorithm:

STEP1. Initialization -Choose the number of clusters K.


● STEP 2 . Randomly initialize K centroids (either randomly from the data points or by using other
methods like k-means++).

STEP 3 . Repeat until convergence:

a. Assign each data point to the nearest centroid.


i. For each data point in the dataset, compute the Euclidean distance (or other distance
metrics) between the point and each of the K centroids.
ii. Assign each data point to the cluster whose centroid is closest to the point (i.e., the
minimum distance).
b. Update centroids by calculating the mean of all points assigned to the cluster.

Once all data points are assigned to clusters, compute the new centroids for each cluster
by calculating the mean (average) of all the data points assigned to that cluster.

● STEP 4 . Return the final cluster assignments and centroids.

Repeat steps 2 and 3 until the centroids no longer change (or change very little), or until a
specified number of iterations is reached.

● STEP 5. Stopping Criteria: The algorithm converges when the centroids no longer change
or when the change in assignments becomes minimal.
● STEP 6. Visualization of the given data points clustered into 3(min pure assume) groups using K-
means. Each colour represents a different cluster, and the crosses mark the centroids of the clusters.
Remark:

⮚ The algorithm minimizes the within-cluster variance (sum of squared distances from each
point to its cluster centroid).
⮚ K-means is sensitive to the initial placement of centroids, so using methods like k-means++
can help improve performance by initializing centroids more effectively.
⮚ The number of clusters K must be specified in advance, which can sometimes be a challenge.
Methods like the elbow method or silhouette score are used to choose the optimal K.

Problems:

1. Suppose we measure two variables X 1 and X 2 for four items A,B,C and D. The data are as
follows: Observations

Use K-means Item X1 X2 clustering technique to divide the items


into k =2 A 5 3 clusters. Start with the initial groups (AB)

and (CD). B -1 1

C 1 -2

D -3 -2
2. Using K -Means clustering to cluster the following data into two groups. Consider cluster centroid
are m1=4 and m2=11. The distance function used is Euclidean distance. { 2, 4, 10, 12, 3, 20, 30,
11, 25 }.
3. Using K-mean clustering algorithm to divide the following into two clusters

X1 1 2 2 3 4 5

X2 1 1 3 2 3 5

4. Cluster the following classes into 3 clusters {(2,10),(2,5),(8,4),(5,8),(7,5),(6,4),(1,2),(4,9)} using K-


mean clustering.(or) Suppose that the data mining task is to cluster points into three clusters,
where the points are A1(2, 10), A2(2, 5), A3(8, 4), B1(5, 8), B2(7, 5), B3(6, 4), C1(1, 2), C2(4, 9).
The distance function is Euclidean distance. Suppose initially we assign A1, B1, and C1 as the
center of each cluster, respectively. Using K Means Clustering Algorithm determine new data
tasks.

Hierarchical Clustering

Algorithm:

Step1. Initialization

● Start with each data point as a separate cluster. If you have N data points, initialize with N
clusters (each cluster containing one point).

Step 2. Calculate Distance Matrix

● Compute the distance (or similarity) between every pair of clusters. Use a distance metric like
Euclidean distance, Manhattan distance, or others depending on your data.
● Store the distances in a distance matrix.

Step 3. Merge Closest Clusters

● Find the pair of clusters that are closest (have the smallest distance) and merge them into a
single cluster.
● This reduces the number of clusters by 1.

Step 4. Update Distance Matrix

● After merging, update the distance matrix to reflect the new cluster distances.
● The distance between the new cluster and the remaining clusters is calculated using a linkage
criterion such as:
o Single Linkage (Minimum): Distance between two clusters is the minimum distance
between any pair of points in the two clusters.
o Complete Linkage (Maximum): Distance is the maximum distance between any pair
of points in the clusters.
o Average Linkage: Distance is the average of all pairwise distances between points in
the clusters.
o Centroid Linkage: Distance between the centroids (mean points) of the clusters.
Step 5. Repeat

● Repeat steps 3 and 4 until all data points are in a single cluster, or a predefined number of
clusters is reached.

Step 6. Build a Dendrogram

● During the merging process, keep track of the order in which clusters are merged.
● Construct a dendrogram (a tree-like diagram) that shows the hierarchical relationship
between clusters at different levels of similarity.

Step 7. Cut the Dendrogram

● To get a final clustering solution, you can "cut" the dendrogram at a specific height. This will
result in a specified number of clusters, depending on where the cut is made.

Problems.
1. Consider the hypothetical distances between pairs of five objects as follows
D= [ 0 9 3 6 119 0 7 5 10 37 0 9 26 5 9 0 8 1110 2 8 0 ]Cluster the items using each of the following
procedures (a) Single linkage hierarchical procedure (b) Complete linkage hierarchical
procedure. Draw the dendrograms and compare the result in (a) and (b).
2. Compute complete link technique Agglomeratine Hierarchical clustering {(1,1),(1.5,1.5),(5,5),(3,4),
(4,4),(3,3.5)}.

Overlapping Clustering

This approach generally follows soft clustering methods like Fuzzy C-Means or Probabilistic Latent
Semantic Analysis (PLSA), where data points can belong to multiple clusters with varying degrees of
membership.

Overlapping Clustering Algorithm:

Step1. Data Preparation

● Input: A dataset D with n data points.


● Features: Each data point is represented by a feature vector.
● Distance Metric: Define a similarity/distance metric (Euclidean, Cosine, etc.).

Step 2. Initialization

● Set Parameters: Choose the number of clusters k (may not be fixed if it’s adaptive), a
threshold for cluster assignment, and maximum iterations.
● Centroids: Randomly initialize k cluster centroids or select initial cluster centers based on a
heuristic (like k-means++ initialization).

Step 3. Cluster Membership Assignment

● For each data point, assign soft membership to each cluster based on similarity:
o Calculate the distance/similarity of the data point to each centroid.
o Convert the distance into a degree of membership to each cluster (e.g., using a
Gaussian function or a normalized similarity measure).
o Ensure that the membership for each data point across all clusters sums up to 1.

Step 4. Membership Update

● Iterate over all points to update memberships:


o For a given data point xi, calculate the probability of belonging to each cluster based
on distances or similarities.
o If the similarity to multiple clusters exceeds a given threshold, the data point is
considered to belong to those clusters, allowing for overlap.

Step 5. Centroid Update

● Update the centroid of each cluster by calculating the weighted average of all points based

on their membership values:


o For each cluster Cj, compute the new centroid as: Where uij is the membership
degree of point xi to cluster Cj.

Step 6. Convergence Check

● Stopping Criteria: Check if the centroids change less than a defined threshold or if the
maximum number of iterations is reached. If not, go back to step 4.

Step 7. Cluster Assignment

● Assign each point to one or more clusters where its membership is above a certain threshold.
If the point has significant membership in multiple clusters, it belongs to those clusters
(allowing overlap).

Step 8. Post-processing (Optional)

● Refine memberships: If needed, refine memberships based on additional criteria (such as


reducing overlap by pruning low-membership assignments).
● Outlier Detection: Identify points with very low membership in all clusters and treat them as
outliers.

Step 9. Output

● Final overlapping clusters, where each point may belong to multiple clusters based on its
degree of membership.

You might also like