Clustering In
Machine Learning
BY
BODA SANTOSH NAIK(EC21B020)
BANOTH ROHITH(EC21B015)
DESAVATH SIVA NAIK(EC21B024)
Introduction to Clustering
Clustering is an unsupervised learning technique used to group
similar data points.
It helps in discovering inherent patterns within datasets without
prior labels.
Clustering is widely used in various applications such as image
segmentation and customer segmentation.
PAGE-2
Importance of Clustering
Clustering simplifies complex datasets by reducing dimensionality.
It facilitates
. better data analysis by grouping similar items together.
Clustering can improve decision-making processes in business and research.
PAGE-3
Types of Clustering
Clustering can be categorized into several types,
including centroid-based, density-based, and
hierarchical clustering.
Each type has its own methodology and use cases
suited for different data distributions.
Understanding the types of clustering is crucial for
selecting the appropriate algorithm.
Centroid-Based Clustering
K-Means is a widely used centroid-based clustering
algorithm.
It partitions the data into K clusters by minimizing
the variance within each cluster.
The algorithm iteratively updates cluster centroids
until convergence is reached.
Density-Based Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies clusters based
on the density of data points.
DBSCAN requires two parameters: epsilon (neighborhood radius) and minPts (minimum points
to form a cluster).
Density Based consists of 3 types of data points
Core point : It should satisfy the condition of min. pts
Boundary point : Neighbour of Core.
Noise point : Not core nor boundary
PAGE-6
Hierarchical Clustering
Hierarchical clustering creates a tree-like
structure to represent data relationships.
It can be agglomerative (bottom-up) or
divisive (top-down) in its approach.
Dendrograms are commonly used to
visualize the results of hierarchical
clustering.
PAGE-7
Evaluation Metrics
Evaluation matrices are crucial tools in machine learning for assessing the performance of a model. They
provide quantitative measures to understand how well a model is making predictions. Here are some
commonly used evaluation matrices.
.
Classification of matrices:
Accuracy : Accuracy is a matrices that measures how often a machine learning model correctly
predicts the outcomes.
Precision : Precision performance the quality of a positive prediction made by the model.
Recall : Recall is a machine learning metric that measures how well a model can identify positive
instances in a dataset.
PAGE-8
Challenges in Clustering
Clustering is sensitive to outliers, which
can distort the results significantly.
The choice of the number of clusters (K)
in algorithms like K-Means can be
subjective.
High-dimensional data often leads to the
“curse of dimensionality,” complicating
clustering.
Practical Applications
Clustering is used in customer segmentation
to tailor marketing strategies effectively.
It plays a critical role in image and video
processing for object recognition.
In bioinformatics, clustering helps in gene
expression analysis and protein
classification.
PAGE-10
Tools and Libraries
Popular libraries for clustering in Python include Scikit-
learn, Scipy, and HDBSCAN.
R also offers robust clustering packages such as 'cluster'
and 'factoextra’.
These tools provide easy-to-use implementations of
various clustering algorithms.
Pandas is useful for data manipulation and preprocessing
before clustering.
Numpy is useful for numerical operations, it’s often used
for implementing clustering algorithms from scratch.
Case Study: Customer Segmentation
A retail company used K-Means clustering to segment its
customer base into distinct groups.
This segmentation enabled targeted marketing campaigns
and improved customer engagement.
The results showed a significant increase in sales and
customer satisfaction.
PAGE-12
Case Study: Image Segmentation
Researchers applied DBSCAN for segmenting complex
images in a computer vision project.
The algorithm effectively identified regions of interest
while ignoring background noise.
This segmentation improved the accuracy of subsequent
image classification tasks.
The segmentation approach was applied to real-world
data, such as satellite images and medical scans, Where
DBSCAN successfully identified key region like urban
areas or tumor boundaries, further validating its
effectiveness.
PAGE-13
Future Directions
The integration of clustering with deep learning
techniques is an emerging trend.
Research is focusing on developing algorithms that
can handle dynamic and streaming data.
Further advancements in clustering will enhance its
applicability across various domains.
PAG-14
Best Practices
Always preprocess your data to remove noise and
handle missing values before clustering.
Experiment with multiple algorithms and
parameters to find the most suitable method for
your data.
Visualize the clusters formed to gain insights and
validate the clustering results.
PAGE-15
Conclusion
Clustering is a powerful tool for data analysis that
uncovers hidden structures in data.
Understanding different clustering algorithms and
their applications is essential for practitioners.
As data continues to grow, the importance and
relevance of clustering in machine learning will
only increase
PAG-16
PAG-17