0% found this document useful (0 votes)
21 views62 pages

Unit 4

Unsupervised learning is a machine learning approach that analyzes unlabeled data to identify patterns and relationships without prior knowledge. Key algorithms include clustering, association rule learning, and dimensionality reduction, each serving various applications like customer segmentation, anomaly detection, and recommendation systems. K-means clustering is a popular unsupervised algorithm that partitions data into distinct clusters based on similarities, widely used in marketing, image compression, and medical imaging.

Uploaded by

devangverma2025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views62 pages

Unit 4

Unsupervised learning is a machine learning approach that analyzes unlabeled data to identify patterns and relationships without prior knowledge. Key algorithms include clustering, association rule learning, and dimensionality reduction, each serving various applications like customer segmentation, anomaly detection, and recommendation systems. K-means clustering is a popular unsupervised algorithm that partitions data into distinct clusters based on similarities, widely used in marketing, image compression, and medical imaging.

Uploaded by

devangverma2025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

[Link].

in

UNIT-IV
UNSUPERVISED LEARING
Introduction
Unsupervised learning is a branch of machine learning that deals with
unlabeled data. Unlike supervised learning, where the data is labeled with
a specific category or outcome, unsupervised learning algorithms are
tasked with finding patterns and relationships within the data
without any prior knowledge of the data’s meaning.
Unsupervised machine learning algorithms find hidden patterns and
data without any human intervention, i.e., we don’t give output to
our model. The training model has only input parameter values and
discovers the groups or patterns on its own.
Click to Edit
The image shows set of animals: elephants, camels, and
cows that represents raw data that the unsupervised learning
algorithm will process.
• The “Interpretation” stage signifies that the algorithm
doesn’t have predefined labels or categories for the data. It
needs to figure out how to group or organize the data based
on inherent patterns.

• Algorithm represents the core of unsupervised learning


Click to Edit process using techniques like clustering, dimensionality
reduction, or anomaly detection to identify patterns and
structures in the data.

• Processing stage shows the algorithm working on the data.


Unsupervised Learning Algorithms
There are mainly 3 types of Algorithms which are used for
Unsupervised dataset.
• Clustering

• Association Rule Learning

• Dimensionality Reduction
Click to Edit
1. Clustering Algorithms

Clustering in unsupervised machine learning is the process of grouping


unlabeled data into clusters based on their similarities. The goal of
clustering is to identify patterns and relationships in the data
without any prior knowledge of the data’s meaning.
• Broadly this technique is applied to group data based on different
patterns, such as similarities or differences, our machine model
finds. These algorithms are used to process raw, unclassified data
objects into groups.
Click to Edit
Some common clustering algorithms:
• K-means Clustering: Groups data into K clusters based on
how close the points are to each other.

• Hierarchical Clustering: Creates clusters by building a tree


step-by-step, either merging or splitting groups.

• Density-Based Clustering (DBSCAN): Finds clusters in


dense areas and treats scattered points as noise.
Click to Edit
• Mean-Shift Clustering: Discovers clusters by moving
points toward the most crowded areas.

• Spectral Clustering: Groups data by analyzing connections


between points using graphs.
2. Association Rule Learning

Association rule learning is also known as association rule mining is a


common technique used to discover associations in unsupervised
machine learning. This technique is a rule-based ML technique that
finds out some very useful relations between parameters of a large
data set. This technique is basically used for market basket analysis
that helps to better understand the relationship between different
products.

Click to Edit For e.g. shopping stores use algorithms based on this technique to find
out the relationship between the sale of one product w.r.t to another’s
sales based on customer behavior. Like if a customer buys milk,
then he may also buy bread, eggs, or butter. Once trained well,
such models can be used to increase their sales by planning different
offers.
3. Dimensionality Reduction

Dimensionality reduction is the process of reducing the


number of features in a dataset while preserving as
much information as possible. This technique is useful
for improving the performance of machine learning
algorithms and for data visualization.
• Imagine a dataset of 100 features about students
(height, weight, grades, etc.). To focus on key traits,
Click to Edit
you reduce it to just 2 features: height and grades,
making it easier to visualize or analyze the data.
Here are some popular Dimensionality Reduction algorithms:
• Principal Component Analysis (PCA): Reduces dimensions by
transforming data into uncorrelated principal components.

• Linear Discriminant Analysis (LDA): Reduces dimensions


while maximizing class separability for classification tasks.

• Non-negative Matrix Factorization (NMF): Breaks data into


non-negative parts to simplify representation.

• Locally Linear Embedding (LLE): Reduces dimensions while


Click to Edit
preserving the relationships between nearby points.

• Isomap: Captures global data structure by preserving distances


along a manifold.
Applications of Unsupervised learning
Unsupervised learning has diverse applications across industries and domains. Key
applications include:

• Customer Segmentation: Algorithms cluster customers based on purchasing behavior or


demographics, enabling targeted marketing strategies.

• Anomaly Detection: Identifies unusual patterns in data, aiding fraud detection,


cybersecurity, and equipment failure prevention.

• Recommendation Systems: Suggests products, movies, or music by analyzing user behavior


and preferences.

Click to Edit • Image and Text Clustering: Groups similar images or documents for tasks like organization,
classification, or content recommendation.

• Social Network Analysis: Detects communities or trends in user interactions on social media
platforms.

• Astronomy and Climate Science: Classifies galaxies or groups weather patterns to support
scientific research
Clustering in Machine Learning

Clustering or cluster analysis is a machine learning technique, which


groups the unlabelled dataset. It can be defined as "A way of grouping
the data points into different clusters, consisting of similar data
points. The objects with the possible similarities remain in a group
that has less or no similarities with another group."
It does it by finding some similar patterns in the unlabelled dataset such as
shape, size, color, behavior, etc., and divides them as per the presence
and absence of those similar patterns.
Click to Edit
• It is an unsupervised learning method, hence no supervision is provided
to the algorithm, and it deals with the unlabeled dataset.
The clustering technique is commonly used for statistical data analysis.
Example: Let's understand the clustering technique with the real-world
example of Mall: When we visit any shopping mall, we can observe that
the things with similar usage are grouped together. Such as the t-shirts
are grouped in one section, and trousers are at other sections, similarly, at
vegetable sections, apples, bananas, Mangoes, etc., are grouped in
separate sections, so that we can easily find out the things. The clustering
technique also works in the same way. Other examples of clustering are
grouping documents according to the topic.

Click to Edit
Types of Clustering Methods

The clustering methods are broadly divided into Hard clustering (datapoint
Click to Edit belongs to only one group) and Soft Clustering (data points can belong to another
group also). But there are also other various approaches of Clustering exist.
Below are the main clustering methods used in Machine learning:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
1. Partitioning Clustering
It is a type of clustering that divides the data into non hierarchical groups. It is also
known as the centroid-based method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.

In this type, the dataset is divided into a set of k groups, where K is used to define the
number of pre-defined groups. The cluster center is created in such a way that the
distance between the data points of one cluster is minimum as compared to another
cluster centroid.

Click to Edit
2. Density-Based Clustering
The density-based clustering method connects the highly-dense areas into
clusters, and the arbitrarily shaped distributions are formed as long as
the dense region can be connected. This algorithm does it by
identifying different clusters in the dataset and connects the areas of
high densities into clusters. The dense areas in data space are divided
from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the
dataset has varying densities and high dimensions.

Click to Edit
3. Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided
based on the probability of how a dataset belongs to a particular
distribution. The grouping is done by assuming some distributions
commonly Gaussian Distribution.
• The example of this type is the Expectation-Maximization
Clustering algorithm that uses Gaussian Mixture Models (GMM).

Click to Edit
4. Hierarchical Clustering
• Hierarchical clustering can be used as an alternative for the
partitioned clustering as there is no requirement of pre-specifying the
number of clusters to be created. In this technique, the dataset is
divided into clusters to create a tree-like structure, which is also called
a dendrogram. The observations or any number of clusters can be
selected by cutting the tree at the correct level. The most common
example of this method is the Agglomerative Hierarchical
algorithm.

Click to Edit
5. Fuzzy Clustering
• Fuzzy clustering is a type of soft method in which a data object
may belong to more than one group or cluster. Each dataset has a
set of membership coefficients, which depend on the degree of
membership to be in a cluster. Fuzzy C-means algorithm is the
example of this type of clustering; it is sometimes also known as
the Fuzzy k-means algorithm.

Click to Edit
K-Means Clustering Algorithm
• K-Means Clustering is an unsupervised learning algorithm that is
used to solve the clustering problems in machine learning or data
science.
What is K-Means Algorithm?
• K-Means Clustering is an Unsupervised Learning algorithm, which
groups the unlabeled dataset into different clusters. Here K defines
the number of pre-defined clusters that need to be created in the
Click to Edit
process, as if K=2, there will be two clusters, and for K=3, there will
be three clusters, and so on.
It is a centroid-based algorithm, where each cluster is associated with a
centroid. The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding clusters.
The k-means clustering algorithm mainly performs two tasks:
• Determines the best value for K center points or centroids by an
iterative process.
• Assigns each data point to its closest k-center. Those data points
which are near to the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is
away from other clusters.
• The below diagram explains the working of the K-means Clustering
Algorithm:
Click to Edit
K-means clustering is a popular unsupervised machine learning algorithm used to partition a
dataset into K distinct, non-overlapping subsets or clusters. Here's a concise explanation of
how K-means clustering works:

Initialization: Choose the number of clusters ( K ) and randomly initialize


( K ) centroids (cluster centers).

Assignment: Assign each data point to the nearest centroid based on the
Euclidean distance. This forms ( K ) clusters.

Update: Calculate the new centroids by taking the mean of all data points
Click to Edit assigned to each cluster.

Repeat: Repeat the assignment and update steps until the centroids no
longer change significantly or a maximum number of iterations is reached.

Output: The algorithm outputs the final cluster centroids and the
assignment of each data point to a cluster.
Applications of K Means Clustering
1. Customer Segmentation (Marketing)
• Group customers based on purchasing behavior, demographics, or browsing patterns.
• Example: A retail store can create different marketing strategies for different customer
clusters.
2. Image Compression
• Reduces the number of colors in an image by clustering similar colors.
• Each pixel's color is replaced by the centroid of its cluster, significantly reducing file size.
3. Document or Text Clustering
• Groups similar documents or articles (e.g., news categorization).
• Can be used in search engines or recommendation systems to suggest related content.
Click to Edit
4. Product Recommendation
• Clusters products based on user ratings or features.
• Users can be recommended items from the same cluster they prefer.
5. Anomaly Detection
• Clusters normal data, so outliers (data points far from any cluster) can be flagged as
anomalies.
• Useful in fraud detection or network security.
6. Geographical Data Analysis
• Used for grouping locations in geo-marketing or urban
planning.
• Example: Clustering delivery addresses to optimize routes.
7. Medical Imaging
• Clusters pixels in scans like MRI or CT images for tumor
detection or tissue segmentation.
8. Educational Data Mining
Click to Edit
• Groups students based on performance, learning styles, or
interaction patterns for personalized education strategies.
Click to Edit
Click to Edit
Click to Edit
Click to Edit
What is Image Segmentation?
Image segmentation is one of the key computer vision tasks, It
separates objects, boundaries, or structures within the image for
more meaningful analysis. Image segmentation plays an
important role in extracting meaningful information from
images, enabling computers to perceive and understand visual
data in a manner that humans understand, view, and perceive.
Image Segmentation Use cases
1. Medical Imaging
Click to Edit
Tumor detection: Identify and isolate tumors in MRI, CT, or
PET scans.
Organ segmentation: Separate organs (e.g., lungs, heart) for
diagnosis or surgery planning.
Retinal analysis: Detect abnormalities in eye scans for
conditions like diabetic retinopathy.
2. Autonomous Vehicles
Scene understanding: Segment roads, pedestrians, vehicles,
traffic signs, etc.
Obstacle detection: Help the car identify and avoid obstacles in
real-time.
3. Augmented Reality (AR)
Background removal: Segment people from the background
(e.g., in virtual backgrounds).
Click to Edit
Object tracking: Identify and follow real-world objects to
overlay virtual content accurately.
Image Segmentation using K Means Clustering
Image segmentation is a technique in computer vision that
divides an image into different segments. This can help identify
specific objects, boundaries or patterns in the image. Image is
basically a set of given pixels and in image segmentation pixels
with similar intensity are grouped together. Image segmentation
creates a pixel-wise mask for objects in an image which gives
us a better understanding of the object.

Click to Edit
Step 1: Install Required Libraries
• In the first step we load required libraries like Numpy ,
Matplotlib and OpenCV.
import numpy as np
import [Link] as plt
import cv2
image = [Link]('images/[Link]')
image = [Link](image, cv2.COLOR_BGR2RGB)
Click to Edit
[Link](image)

Output-
Step 2: Reshape the Image for K-Means Clustering
• K-Means works on 2D data but images are in 3D i.e height, width, color channels. So we need
to reshape the image into a 2D array.

pixel_vals = [Link]((-1,3))

pixel_vals = np.float32(pixel_vals)

Step 3: Apply K-Means Clustering and Segment the Image


Now let’s apply the K-Means clustering algorithm to segment the image into distinct regions
based on color.

First set the criteria for when the algorithm should stop.

Click to Edit We’ll use a maximum of 100 iterations or an accuracy threshold of 85%.

We will choose k = 3 which means the algorithm will identify 3 clusters in the image.

K-Means will group pixels with similar colors into the specified number of clusters.

Finally we reshape the segmented data to match the original dimensions of the image so it can be
visualized properly.
criteria = (cv2.TERM_CRITERIA_EPS +
cv2.TERM_CRITERIA_MAX_ITER, 100, 0.85)
k=3
retval, labels, centers = [Link](pixel_vals, k, None,
criteria, 10, cv2.KMEANS_RANDOM_CENTERS)
centers = np.uint8(centers)
segmented_data = centers[[Link]()]
Click to Edit
segmented_image = segmented_data.reshape(([Link]))
[Link](segmented_image)
Output-
Now if we change the value of k to 6 we get the below image

Click to Edit
As you can see with an increase in the value of k the image becomes
clearer and distinct because K-means algorithm can classify more classes
or cluster of colors. It can segment objects in images and give better
results in smaller dataset. But when it is applied on large datasets it
becomes time consuming.
Using Clustering for Preprocessing
Data preprocessing is a important step in the data science transforming
raw data into a clean structured format for analysis. It involves tasks
like handling missing values, normalizing data and encoding variables.
Mastering preprocessing in Python ensures reliable insights for accurate
predictions and effective decision-making. Pre-processing refers to
the transformations applied to data before feeding it to the algorithm.

Stages of Data preprocessing for K-means Clustering


Click to Edit
1. Data Cleaning
• Removing duplicates
• Removing irrelevant observations and errors
• Removing unnecessary columns
• Handling inconsistent data
• Handling outliers and noise
2. Handling missing data
3. Data Integration
4. Data Transformation
• Feature Construction
• Handling skewness
• Data Scaling
5. Data Reduction
• Removing dependent (highly correlated) variables
• Feature selection
Click to Edit
• PCA
A Python code to Handle Outliers Using Clustering
import numpy as np # Define a threshold: anything far from
import pandas as pd
center is an outlier
from [Link] import KMeans
import [Link] as plt
threshold = [Link](distances, 90)
# top 10% as outliers
# Sample 2D data with a few outliers outliers = data[distances > threshold]
data = [Link]([ inliers = data[distances <= threshold]
[1, 2], [1, 4], [1, 0],
[10, 2], [10, 4], [10, 0],
[5, 5], [6, 5], [5, 6], # Plot
[100, 100], [120, 120] # outliers [Link](inliers[:, 0], inliers[:, 1],
Click to Edit
]) c='blue', label='Inliers')
[Link](outliers[:, 0], outliers[:, 1],
# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=0)
c='red', label='Outliers')
labels = kmeans.fit_predict(data) [Link]("Outlier Detection using
KMeans")
# Calculate distances to cluster centers [Link]()
distances = [Link](data –
[Link]()
kmeans.cluster_centers_[labels], axis=1)
Output-

Click to Edit
A Python code to detect Anomalies
import numpy as np # 5. Define anomaly threshold (e.g., top 10%
import pandas as pd furthest points)
import [Link] as plt threshold = [Link](distances, 90)
from [Link] import load_iris anomalies = X[distances > threshold]
from [Link] import KMeans normal = X[distances <= threshold]
from [Link] import StandardScaler
# 6. Show anomaly counts
# 1. Load the Iris dataset print(f"Total points: {len(X)}")
iris = load_iris() print(f"Detected anomalies:
X = [Link] # We will not use the labels here {len(anomalies)}")

# 2. Scale the data # 7. (Optional) Plot first two features


Click to Edit scaler = StandardScaler()
[Link](normal[:, 0], normal[:, 1], c='blue',
X_scaled = scaler.fit_transform(X) label='Normal')
[Link](anomalies[:, 0], anomalies[:, 1],
# 3. Apply KMeans clustering c='red', label='Anomaly')
kmeans = KMeans(n_clusters=3, random_state=0) [Link]("Feature 1")
labels = kmeans.fit_predict(X_scaled) [Link]("Feature 2")
[Link]("Anomaly Detection on Iris Dataset")
# 4. Calculate distance to assigned cluster center [Link]()
distances = [Link](X_scaled -
kmeans.cluster_centers_[labels], axis=1) [Link]()
Click to Edit
What is Semi-Supervised Learning?
Semi-supervised learning is a type of machine learning that falls in
between supervised and unsupervised learning. It is a method that uses a
small amount of labeled data and a large amount of unlabeled data to train
a model. The goal of semi-supervised learning is to learn a function that
can accurately predict the output variable based on the input variables,
similar to supervised learning. However, unlike supervised learning, the
algorithm is trained on a dataset that contains both labeled and unlabeled
data.

Click to Edit
Process of using clustering in semi supervised
learning
Clustering plays a crucial role in semi-supervised learning by
leveraging the structure of unlabeled data to enhance learning from
a limited amount of labeled examples. Here’s the general process:
1. Preprocessing Data – Raw data is prepared by normalizing
features and handling missing values to improve clustering
effectiveness.
2. Applying Clustering Algorithm – Unlabeled data is grouped
Click to Edit into clusters using algorithms like K-Means, DBSCAN, or
Gaussian Mixture Models. The idea is to find naturally
occurring data patterns.
3. Assigning Pseudo-labels – Once clusters are formed, some
data points are assigned labels based on their proximity to
known labeled instances or based on assumptions about data
distribution.
4. Training Model – The labeled data (original and pseudo-
labeled) is used to train a supervised model, such as a
neural network or decision tree.
5. Refining and Iterating – The pseudo-labels are re-
evaluated, clustering is refined, and the training process is
Click to Edit repeated to improve accuracy.
DBSCAN

DBSCAN is a density-based clustering algorithm that groups


data points that are closely packed together and marks
outliers as noise based on their density in the feature
space. It identifies clusters as dense regions in the data
space, separated by areas of lower density.
Unlike K-Means or hierarchical clustering, which assume
clusters are compact and spherical, DBSCAN excels in
Click to Edit
handling real-world data irregularities such as:
• Arbitrary-Shaped Clusters: Clusters can take any shape,
not just circular or convex.

• Noise and Outliers: It effectively identifies and handles


noise points without assigning them to any cluster.
The figure below shows a data set with clustering algorithms:
K-Means and Hierarchical handling compact, spherical
clusters with varying noise tolerance, while DBSCAN
manages arbitrary-shaped clusters and excels in noise
handling.

Click to Edit
Importance of DBSCAN

1. Density-Based: DBSCAN works on the idea of density connectivity


and density reachability. It groups together points that are closely
packed together,. It mark the points that lie alone in low-density
regions as outliers.
2. Robust to Noise: DBSCAN is robust to noise and can identify outliers
as noise points, making it suitable for datasets with noise or outliers.
3. Handles Clusters of Varying Shapes and Densities: DBSCAN can
Click to Edit identify clusters of arbitrary shapes and sizes, unlike K-means, which
assumes spherical clusters.
4. No Need to Specify Number of Clusters: Unlike K-means, DBSCAN
does not require specifying the number of clusters beforehand,
making it more flexible.
5. Efficient: DBSCAN is computationally efficient and can scale well to
large datasets.
How DBSCAN Works?

DBSCAN, or Density-Based Spatial Clustering of Applications with


Noise, is a clustering algorithm that groups data points based on their
density.
Let us understand how DBSCAN works:
1. Core Points, Border Points, and Noise Points:
Core Points: Imagine core points as central hubs in a cluster. A point is
considered a core point if it has a minimum number of neighboring points
Click to Edit within a specified distance.
Border Points: Border points are on the outskirts of a cluster. They are
reachable from core points but do not have enough neighbors to be core
points themselves.
Noise Points: Noise points are outliers that do not belong to any cluster.
Parameters:
Epsilon (eps): Epsilon is defined as the radius of each data
point around which the density is considered. This defines the
maximum distance between two points for them to be
considered as neighbors.
Minimum Samples (min_samples): It is the number of
points required within the radius so that the data point
becomes a core point.
Click to Edit
3. Algorithm Steps:
Initialization: The algorithm begins by randomly selecting a point
from the dataset that has not been visited. This initial point serves as
the starting point for forming a cluster.
Expand:
• For each core point or border point (reachable from a core point),
the algorithm expands the cluster by adding neighboring points
recursively.
• It checks the neighboring points of the current point to determine if
Click to Edit they should be included in the cluster.
• If a neighboring point meets the criteria to be a core point or a
border point, it is added to the cluster.
• This process continues iteratively, expanding the cluster by
including points that are within the specified distance (epsilon) and
have the minimum number of neighbors (min_samples).
Termination: The algorithm stops when all points have been visited.
4. Output:
Clusters: Points that belong to the same cluster based on
density.
Noise: Outliers or points that do not fit into any cluster
Click to Edit
Understanding Core Points, Border Points and Noise Points
In the DBSCAN algorithm, a circle with a radius epsilon is drawn around
each data point and the data point is classified into Core Point, Border
Point, or Noise Point. The data point is classified as a core point if it has
min_samples of data points with epsilon radius. If it has points less than
min_samples it is known as Border Point and if there are no points inside
epsilon radius it is considered a Noise [Link] us understand working
through an example.

In the above figure, we can see that


point A has no points inside epsilon(e)
Click to Edit radius. Hence it is a Noise Point.
Point B has min_samples(=4) number
of points with epsilon(e) radius, thus
it is a Core Point. While the point C
has only 1 (less than minPoints) point,
hence it is a Border Point..
Working of DBSCAN Algorithm
Suppose we have a dataset of points representing customers in a
shopping mall based on their spending score annual income. We
want to group these customers into clusters using DBSCAN.
1. Core Points, Border Points, and Noise Points:
Core Points: A core point could be a customer who has at least
5 other customers within a distance of 10 units. These core
points act as central hubs in a cluster.
Click to Edit Border Points: Border points are customers who are reachable
from core points but do not have enough neighbors to be core
points themselves.
Noise Points: Noise points are customers who do not belong to
any cluster, perhaps because they are outliers in terms of
spending score and income.
2. Parameters:
Epsilon (eps): Let's set epsilon to 10 units, meaning points
within a distance of 10 units are considered neighbors.
Minimum Samples (min_samples): We require at least 5
points within the epsilon radius for a point to be considered
a core point.

Click to Edit
3. Algorithm Steps:
Initialization: Start by randomly selecting a customer who has not
been visited as the initial point for forming a cluster.
Expand:
• For each core point or border point, expand the cluster by
adding neighboring customers recursively based on the epsilon
and min_samples criteria.
• Check if the neighboring customers meet the criteria to be core
points or border points and add them to the cluster.
Click to Edit
• Termination: The algorithm stops when all customers have been
visited and clustered.
4. Output:
Clusters: Customers grouped together based on their spending
score and income density
Noise: Outliers or customers who do not fit well into any cluster.
Python code to demonstrate DBSCAN
from [Link] import make_blobs
from [Link] import DBSCAN
import [Link] as plt

# Step 1: Generate sample data


X, _ = make_blobs(n_samples=200, centers=3, cluster_std=0.6, random_state=0)

# Step 2: Apply DBSCAN


dbscan = DBSCAN(eps=0.5, min_samples=5)
Click to Edit labels = dbscan.fit_predict(X)

# Step 3: Plot the clusters


[Link](X[:, 0], X[:, 1], c=labels, cmap='viridis')
[Link]("DBSCAN Clustering")
[Link]("Feature 1")
[Link]("Feature 2")
[Link]()
[Link] and Mapping Applications
Use Case: Grouping GPS coordinates to find locations of interest (e.g.,
tourist hotspots).
Example: Clustering users based on check-in locations or delivery drop-
offs.
2. Anomaly/Outlier Detection
Use Case: Detecting unusual data points that don’t fit into any group.
Example: Fraud detection, network intrusion, or unusual sensor readings.
Click to Edit
3. Image Segmentation
Use Case: Separating different regions of an image based on pixel
density.
Example: Detecting and segmenting tumors in medical imaging.
4. Noise-Robust Clustering in Machine Learning
Use Case: When datasets contain noise and varying densities.
Example: Clustering of customer behavior data or social media activity.
Other Clustering Algorithms
1. Agglomerative Clustering
• Type: Hierarchical (bottom-up)
• How it works: Starts with each data point as a single cluster
and merges the closest pairs iteratively.
• Output: A dendrogram; you choose the number of clusters
by "cutting" the tree.
Click to Edit • Distance metrics: Euclidean, Manhattan, etc.
• Strengths:
• No need to pre-specify number of clusters (if dendrogram used)
• Handles non-spherical clusters
• Limitations:
• Computationally expensive for large datasets
2. BIRCH (Balanced Iterative Reducing and Clustering
using Hierarchies)
•Type: Hierarchical + Centroid-based
•How it works: Builds a CF (Clustering Feature) Tree to
summarize large datasets, then clusters them.
•Designed for: Large-scale datasets
•Strengths:
•Memory-efficient, scalable
Click to Edit
•Can handle streaming data
•Limitations:
•Assumes spherical clusters
•Sensitive to input parameters like threshold
3. Mean Shift Clustering
• Type: Centroid-based (mode-seeking)
• How it works: Points shift toward areas of highest data
density (modes); convergence forms clusters.
• No need to specify K.
• Strengths:
Click to Edit • Automatically finds number of clusters
• Works well with irregular cluster shapes
• Limitations:
• Slow on large datasets
• Sensitive to bandwidth parameter
Affinity Propagation
• Type: Graph-based / Message-passing
• How it works: Sends messages between data points to
find exemplars (cluster centers).
• Strengths:
• Doesn’t need to specify number of clusters
• Can identify representative examples
Click to Edit
• Limitations:
• Computationally heavy
• Needs proper tuning of preference parameter
5. Spectral Clustering
• Type: Graph-based
• How it works: Uses eigenvalues of a similarity matrix to reduce
dimensions and apply clustering (e.g., KMeans) in lower space.
• Great for non-convex or complex-shaped clusters.
Click to Edit • Strengths:
• Captures complex structure
• Good for image segmentation, graph partitioning
• Limitations:
• Not scalable for large datasets
• Requires similarity graph construction
THANK YOU
Click to Edit

You might also like