0% found this document useful (0 votes)
10 views5 pages

Unit V Machine Learning

Clustering is an unsupervised machine learning technique that groups similar data points into clusters, widely used in areas such as customer segmentation and anomaly detection. It can be categorized into types like hard vs. soft clustering, hierarchical vs. partition-based, and density-based vs. model-based clustering, with various algorithms like K-Means, DBSCAN, and Gaussian Mixture Models. Evaluation metrics for clustering include the Silhouette Score, Davies-Bouldin Index, and Dunn Index, with applications in customer segmentation, image segmentation, and medical diagnosis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views5 pages

Unit V Machine Learning

Clustering is an unsupervised machine learning technique that groups similar data points into clusters, widely used in areas such as customer segmentation and anomaly detection. It can be categorized into types like hard vs. soft clustering, hierarchical vs. partition-based, and density-based vs. model-based clustering, with various algorithms like K-Means, DBSCAN, and Gaussian Mixture Models. Evaluation metrics for clustering include the Silhouette Score, Davies-Bouldin Index, and Dunn Index, with applications in customer segmentation, image segmentation, and medical diagnosis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

UNIT V

1. Introduction to Clustering

Clustering is an unsupervised machine learning technique used to group similar data points
into clusters. The goal is to ensure that:

 Data points within a cluster are similar to each other.

 Data points in different clusters are dissimilar.

Clustering is widely used in:

 Customer segmentation

 Pattern recognition

 Image segmentation

 Anomaly detection

 Genomics and Bioinformatics

2. Types of Clustering

Clustering can be categorized into different types based on how clusters are formed.

a) Hard Clustering vs. Soft Clustering

 Hard Clustering: Each data point belongs to only one cluster.

o Example: K-Means Clustering.

 Soft Clustering (Fuzzy Clustering): A data point can belong to multiple clusters with
probabilities.

o Example: Fuzzy C-Means Clustering.

b) Hierarchical vs. Partition-Based Clustering

1. Hierarchical Clustering:

o Creates a tree-like structure of clusters.

o Can be agglomerative (bottom-up) or divisive (top-down).

o Example: Agglomerative Clustering, Divisive Clustering.

2. Partition-Based Clustering:

o Divides the dataset into k predefined clusters.


o Example: K-Means Clustering.

c) Density-Based vs. Model-Based Clustering

1. Density-Based Clustering:

o Clusters are formed based on dense regions in the data.

o Example: DBSCAN (Density-Based Spatial Clustering).

2. Model-Based Clustering:

o Assumes data is generated from a mixture of statistical distributions.

o Example: Gaussian Mixture Models (GMM).

3. Clustering Algorithms

a) K-Means Clustering

 One of the most popular clustering algorithms.

 Steps:

1. Choose K (number of clusters).

2. Select K random points as centroids.

3. Assign each point to the nearest centroid.

4. Recalculate new centroids.

5. Repeat until clusters do not change.

 Pros:

o Simple and efficient for large datasets.

o Works well when clusters are spherical.

 Cons:

o Requires choosing K in advance.

o Sensitive to outliers.

b) Hierarchical Clustering

 Forms a tree structure of clusters (dendrogram).

 Two main types:


o Agglomerative (Bottom-Up): Merges small clusters into bigger ones.

o Divisive (Top-Down): Splits a large cluster into smaller ones.

 Pros:

o No need to specify K.

o Works well for small datasets.

 Cons:

o Computationally expensive for large datasets.

c) DBSCAN (Density-Based Spatial Clustering)

 Groups points based on density.

 Can detect arbitrary-shaped clusters.

 Steps:

1. Select a point.

2. Find all nearby points within a given radius (ε).

3. Expand the cluster if there are enough points (minPts threshold).

4. Ignore noise points.

 Pros:

o Can detect outliers.

o Works well with non-spherical clusters.

 Cons:

o Requires choosing ε (radius) carefully.

o Not effective for varying-density clusters.

d) Gaussian Mixture Model (GMM)

 Uses probabilistic models to form clusters.

 Each cluster follows a Gaussian distribution.

 Works well for overlapping clusters.

 Pros:
o Can model complex cluster shapes.

o Provides probability of belonging to a cluster.

 Cons:

o Requires choosing the number of clusters (K).

e) Fuzzy C-Means (Soft Clustering)

 Each data point has a degree of membership in multiple clusters.

 Instead of assigning a point to only one cluster, it belongs to all clusters with
different probabilities.

 Pros:

o More flexible than hard clustering.

 Cons:

o More computationally expensive.

f) Rough Clustering & Rough K-Means

 Allows uncertain data points to belong to multiple clusters.

 Rough K-Means is an extension of K-Means that handles uncertainty better.

4. Evaluation Metrics for Clustering

Since clustering is unsupervised, we cannot use accuracy. Instead, we use:

1. Silhouette Score:

o Measures how similar a point is to its cluster vs. other clusters.

o Range: -1 (bad) to +1 (good).

2. Davies-Bouldin Index:

o Lower values indicate better clustering.

3. Dunn Index:

o Higher values mean better separation between clusters.

4. Elbow Method (For K-Means):

o Finds the optimal K value by plotting inertia (SSE) vs. K.


5. Applications of Clustering

1. Customer Segmentation:

o Grouping customers based on purchasing behavior.

2. Anomaly Detection:

o Detecting fraud, cyber threats, and network intrusions.

3. Image Segmentation:

o Identifying objects in images.

4. Document Clustering:

o Grouping similar news articles, research papers, or emails.

5. Medical Diagnosis:

o Identifying different disease patterns.

Metrics

1. Silhouette Score: Measures how well-separated clusters are.

o Range: -1 to 1 (Higher is better)

o Good Clustering: > 0.5

2. Davies-Bouldin Index: Measures intra-cluster similarity.

o Lower values are better

3. Calinski-Harabasz Index: Measures compactness and separation.

o Higher values indicate better clustering

You might also like