0% found this document useful (0 votes)
35 views

Unit-5

Uploaded by

Nandita
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Unit-5

Uploaded by

Nandita
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

UNIT-V

CLUSTERING

Dr. Suresh Chimkode


CLUSTERING
What is Clustering?

 Clustering is a type of unsupervised learning method used in data


mining to group a set of objects in such a way that objects in the
same group (called a cluster) are more similar to each other than to
those in other groups (clusters).

 The similarity between objects is typically measured using a distance


metric (e.g., Euclidean distance).

 Clustering is used to uncover the natural structure of the data and


discover patterns. It helps in identifying homogeneous groups in
datasets, which can then be analyzed for insights.

Dr. Suresh Chimkode


What are the uses of Clustering discuss briefly
 Clustering has a wide range of applications across various fields due to its
ability to find natural groupings in data.
 Few key uses of clustering are:
1. Customer Segmentation
2. Image Segmentation
3. Document Clustering
4. Anomaly Detection
5. Social Network Analysis,…..
1. Customer Segmentation
 Application like Retail, Marketing, etc., they segment their customers based
on purchasing behavior, demographics, or other attributes.
 This helps companies tailor marketing strategies, personalize customer
experiences, and improve customer retention.

2. Image Segmentation
 Application like Computer Vision, Partition the images into meaningful
segments (e.g., objects, background).
 It is used in medical imaging, object recognition, and photo editing
applications. Dr. Suresh Chimkode
Few key uses of clustering are……
3. Document Clustering
 Application like Information Retrieval, Natural Language Processing, Group
the similar documents together for topic extraction, information retrieval,
and document organization.
 This is useful for search engines, recommendation systems, and organizing
large collections of texts.

4. Anomaly Detection
 Application like Cybersecurity, Fraud Detection identify unusual patterns
that do not conform to expected behavior.
 Clustering can help detect fraudulent transactions, network intrusions, and
other anomalies by grouping normal data points and identifying outliers.

5. Social Network Analysis


Application like Social Media, Communication Networks identify communities
or clusters within social networks to understand social structures, influence
patterns, and communication dynamics.

Dr. Suresh Chimkode


Problem Definition
Problem Definition: Customer Segmentation in Retail
A retail company aims to segment its customers into distinct groups based on
their purchasing behavior. The goal is to tailor marketing strategies, improve
customer service, and enhance product offerings.
Dataset:
Customer ID
Age
Annual Income
Spending Score: A metric assigned by the company based on customer
behavior and purchasing data.
Dataset Sample

Dr. Suresh Chimkode


Steps to Perform Clustering:
1. Data Collection: Collect the customer data, which includes Customer ID,
Age, Annual Income, and Spending Score.
2. Data Preprocessing:
 Handle any missing values.
 Normalize the data to ensure all features contribute equally to the
distance calculation.
3. Choose a Clustering Algorithm: we can use K-Means clustering for this
example.
4. Determine the Number of Clusters (𝑘): Use the Elbow Method to decide the
optimal number of clusters.
5. Apply the Clustering Algorithm: Run the K-Means algorithm with the

Expected Results: Imagine we have chosen 𝑘 = 3 based on the Elbow Method


chosen number of clusters (𝑘).

and obtained the following clusters:


Cluster 1: Young customers with low annual income and moderate spending scores.
Cluster 2: Middle-aged customers with moderate annual income and high spending
scores.
Cluster 3: Older customers with high annual income and low spending scores.
Dr. Suresh Chimkode
Requirements of clustering in data mining.
1. Scalability:
 Many clustering algorithms work well on small data sets containing
hundreds of data objects; however, a large database may contain millions
or even billions of objects, particularly in Web search scenarios.
 Clustering on only a sample of a given large data set may lead to biased
results. Therefore, highly scalable clustering algorithms are needed.

2. Ability to deal with different types of attributes:


 Many algorithms are designed to cluster numeric (interval-based) data.
However, applications may require clustering other data types, such as
binary, nominal (categorical), and ordinal data, or mixtures of these data
types.

3. Discovery of clusters with arbitrary shape:


 Many clustering algorithms determine clusters based on Euclidean or
Manhattan distance measures. Algorithms based on such distance
measures tend to find spherical clusters with similar size and density.
 However, It is important to develop algorithms that can detect clusters of
arbitrary shape. Dr. Suresh Chimkode
Requirements of clustering in data mining….
4. Requirements for domain knowledge to determine input parameters:
Many clustering algorithms require users to provide domain knowledge in the
form of input parameters such as the desired number of clusters.

5. Ability to deal with noisy data:


 Most real-world data sets contain outliers and/or missing, unknown, or
erroneous data.
 Clustering algorithms can be sensitive to such noise and may produce poor-
quality clusters. Therefore, we need clustering methods that are robust to
noise.
6. Incremental clustering and insensitivity to input order:
 In many applications, incremental updates (representing newer data) may
arrive at any time. Some clustering algorithms cannot incorporate
at a re in s e nsitive
incremental updates into existing clustering
s a n d a lg ithms thand, instead, have to
structures
o r
e rin g a hm
lgoritscratch.
recompute a new t
clustering
c lu s from
S o, Incremental ded.be sensitive to the input data order. That is,
neealso
 Clustering algorithms
t o rd e r a re
may
to the inpu
given a set of data objects, clustering algorithms may return dramatically
different clustering's depending on the order in which the objects are
presented. Dr. Suresh Chimkode
Overview of Basic Clustering Methods

Dr. Suresh Chimkode


Partitioning clustering
 Partitioning methods are a fundamental category of clustering algorithms
that divide a dataset into distinct non-overlapping clusters.
 The goal is to organize the data into clusters where each data point belongs
to exactly one cluster, and the points within a cluster are more similar to
each other than to those in other clusters.

K-Means Clustering

learning to group a dataset into 𝑘 distinct, non-overlapping clusters.


 K-Means clustering is a popular partitioning method used in unsupervised

 It divide the data such that the points within each cluster are as similar as
possible, while points in different clusters are as dissimilar as possible.

Dr. Suresh Chimkode


K-Means Clustering Steps:

 Choose the Number of Clusters 𝑘: The user specifies the number of clusters
1. Initialization

 Initialize Centroids: Randomly select 𝑘 points from the dataset as the initial
they want to create.

centroids. These centroids act as the initial center points for the clusters.

2. Assignment Step
Assign Points to Nearest Centroid:
 For each data point in the dataset, calculate its distance to each centroid.
Assign the data point to the cluster whose centroid is closest.

3. Update Step
Recalculate Centroids:
 After all points are assigned to clusters, recalculate the centroids.
 The new centroid of each cluster is the mean of all the points assigned to
that cluster.
4. Repeat
Iterate: Repeat the Assignment and Update steps until the centroids no longer
change significantly or until a predefined number of iterations
Dr. Suresh is reached.
Chimkode
Objective Function (Cost Function)
 The goal of the K-Means algorithm is to minimize the sum of squared
distances between the data points and their assigned centroids. This is
known as the within-cluster sum of squared errors (WCSS).
 The objective function to minimize is:

Example Dataset

Dr. Suresh Chimkode


We will perform K-Means clustering with K = 2 (i.e., we want to divide the data
into 2 clusters).
Step 1: Initialization
Let’s assume we randomly choose the initial centroids for for K = 2

Step 2: Assignment Step


We will now assign each data point to the nearest centroid based on the
Euclidean distance. The Euclidean distance between two points (x1, y1) and
(x2, y2)

Calculating the distances between each point and the centroids:

Dr. Suresh Chimkode


Calculating the distances between each point and the centroids.. …

Dr. Suresh Chimkode


Calculating the distances between each point and the centroids.. …

Cluster Assignments:

Step 3: Update Step


Now, we compute the new centroids by calculating the mean of the points
assigned to each cluster.

Dr. Suresh Chimkode


Calculating the distances between each point and the centroids.. …
Step 4: Reassign Points
 We now reassign the points to the new centroids.
 Reassigning Points:

 Since the assignments and centroids haven’t changed after the update, the
algorithm converges.
Final Clusters:

The K-Means algorithm has successfully grouped

Dr. Suresh Chimkode


PAM (Partitioning Around Medoids) Algorithm
 The Partitioning Around Medoids (PAM) algorithm is a clustering
technique used in data mining. It partitions a dataset into k clusters.
 It is similar to the K-means algorithm (which uses centroids), but
instead of using centroids as cluster representatives, PAM uses actual
data points (medoids) as cluster centers.
 Medoids are representative objects that minimize the dissimilarity
within the cluster.
Steps of the PAM Algorithm
1. Initialize: Select k medoids randomly from the dataset.
2. Assign Points to Clusters: Assign each data point to the nearest
medoid.
3. Compute New Medoids: For each cluster, calculate the new medoid
by finding the data point that minimizes the sum of dissimilarities
within the cluster.
4. Check for Convergence: If the medoids have changed, repeat from
Step 2. Otherwise, stop.
Dr. Suresh Chimkode
Mathematical Formulation
1. Dissimilarity Matrix (D):
 The dissimilarity matrix D represents the pairwise distances between each
pair of points in the dataset.
 The distance function used is typically Euclidean distance, but other
distance measures like Manhattan distance or cosine similarity can also be
used.

2. Objective Function (Cost Function):


 The objective function in PAM is to minimize the total dissimilarity within all
clusters.
 This is done by minimizing the sum of the dissimilarities between the data
points and their corresponding medoids.

Dr. Suresh Chimkode


3. Swap Operations:
 PAM improves the medoids by considering swapping one of the current
medoids with a non-medoid point.
 The new medoid is selected to minimize the total cost (total dissimilarity) of
the clusters.
 This is done by iterating through all possible swaps and keeping the one that

 For a given pair of medoids, 𝑚𝑎 and 𝑚𝑏 , and the swap between them, the
leads to the greatest improvement in minimizing the cost function.

new cost can be calculated as:

4. Recompute Medoids:
After swapping, the medoids are updated based on the new clusters formed by
the closest points to the new medoids.

Dr. Suresh Chimkode


Example: Consider a simple example dataset to illustrate how PAM works.
Assume we have the following 2D dataset:

We want to cluster the dataset into k=2 clusters.

Let’s randomly select 2 medoids. Suppose we select 𝑥1 (1, 2) and 𝑥5 (7, 8) as


Step 1: Initialize the Medoids

the initial medoids.

Step 2: Assign Points to Clusters


We calculate the distance from each point to each of the two medoids. Using
Euclidean distance: Dr. Suresh Chimkode
Step 2: Assign Points to Clusters
We calculate the distance from each point to each of the two medoids. Using
Euclidean distance:

For each point 𝑥𝑖 , we assign it to the medoid that is closest.


Distances:
From x1 (1, 2):
x1 to x1: sqrt((1-1)^2 + (2-2)^2) = 0
x2 to x1: sqrt((2-1)^2 + (3-2)^2) = 1.41 x2(2, 3)
Similarly we can calculate x3 to x1, x4 to x1, x5 to x1, x6 to x1
Similarly calculate Distances
From x5 (7, 8): x1 to x5, x2 to x5, x3 to x5, x4 to x5, x5 to x5, x6 to x5

Dr. Suresh Chimkode


So, we have two clusters:

Step 3: Recompute the Medoids


Now, for each cluster, we calculate the sum of dissimilarities within the cluster
and determine which point within the cluster minimizes the sum.

Step 4: Check for Convergence


If the medoids have not changed, the algorithm converges. If the medoids have
changed, repeat Step 2.
Dr. Suresh Chimkode
In this case, since the medoids haven't changed, the algorithm stops, and we
have our final clusters:

Dr. Suresh Chimkode


For Cluster 1 (𝑥1, 𝑥2, 𝑥3) How to calculate the total dissimilarity for each
point as the potential new medoid.

Dr. Suresh Chimkode


We now need to calculate the total dissimilarity for each point ( 𝑥1, 𝑥2 , and
𝑥3) as the potential new medoid.

Dr. Suresh Chimkode


We now need to calculate the total dissimilarity for each point ( 𝑥1, 𝑥2 , and
𝑥3) as the potential new medoid.

Dr. Suresh Chimkode


We now need to calculate the total dissimilarity for each point ( 𝑥1, 𝑥2 , and
𝑥3) as the potential new medoid.

Dr. Suresh Chimkode


Now that we have the total dissimilarities for each point as the potential new
medoid:

medoid. In this case, 𝑥2 (2,3) has the minimum dissimilarity of 2.82 and will
 The point with the minimum total dissimilarity will be chosen as the new

i.e., 𝑥2 is the optimal medoid for Cluster 1 based on the total dissimilarity
be selected as the new medoid for Cluster 1.

calculation.

Dr. Suresh Chimkode


Hierarchical clustering

What is Hierarchical clustering? and discuss methods of Hierarchical clustering.

 Hierarchical clustering is a method of cluster analysis that seeks to build a


hierarchy of clusters.
 Unlike k-means clustering, which requires specifying the number of clusters
in advance, hierarchical clustering does not need such a pre-specified
number.
 The hierarchy can be represented as a tree structure known as a
dendrogram, where each node represents a cluster, and the branches show
the hierarchy of clusters.

Hierarchical clustering can be broadly categorized into two types:


1. Agglomerative Hierarchical Clustering
2. Divisive Hierarchical Clustering

Dr. Suresh Chimkode


Hierarchical clustering Types

1. Agglomerative Hierarchical Clustering:

It is also known as bottom-up clustering, it starts with each data point as an


individual cluster and merges the most similar clusters iteratively until all
points belong to one cluster or a desired number of clusters is reached.

2. Divisive Hierarchical Clustering:

It is also known as top-down clustering, it starts with all data points in a single
cluster and splits the least cohesive clusters iteratively until each data point is
its own cluster or a desired number of clusters is reached.

Dr. Suresh Chimkode


Working of Agglomerative Hierarchical Clustering:
Consider a simple dataset with points (height and weight):
Point Height (cm) Weight (kg)
A 160 55
B 165 60
C 170 65
D 175 70
E 180 75
Step 1: Start with Individual Clusters.
Each data point starts as its own cluster.
 Initial clusters: {A}, {B}, {C}, {D}, {E}
Step 2: Calculate Distance Matrix
Calculate the Euclidean distance between each pair of data points.
For points A (160, 55) and B (165, 60):

Repeat the same type of calculation


for all pairs:

Dr. Suresh Chimkode


Working of Agglomerative Hierarchical Clustering:
Step 3: Merge Closest Clusters
Find the pair of clusters with the smallest distance and merge them.
Closest clusters: {A} and {B} (distance ≈ 7.07)
New clusters: {AB}, {C}, {D}, {E}

Step 4: Update Distance Matrix


Recalculate the distances between the new cluster and the remaining clusters.
Use complete linkage (maximum distance between points in clusters).

Dr. Suresh Chimkode


OU
KY
AN

ws
TH

Fo ’s?
llo
Dr. Suresh Chimkode

You might also like