Unit 4
Unit 4
in
UNIT-IV
UNSUPERVISED LEARING
Introduction
Unsupervised learning is a branch of machine learning that deals with
unlabeled data. Unlike supervised learning, where the data is labeled with
a specific category or outcome, unsupervised learning algorithms are
tasked with finding patterns and relationships within the data
without any prior knowledge of the data’s meaning.
Unsupervised machine learning algorithms find hidden patterns and
data without any human intervention, i.e., we don’t give output to
our model. The training model has only input parameter values and
discovers the groups or patterns on its own.
Click to Edit
The image shows set of animals: elephants, camels, and
cows that represents raw data that the unsupervised learning
algorithm will process.
• The “Interpretation” stage signifies that the algorithm
doesn’t have predefined labels or categories for the data. It
needs to figure out how to group or organize the data based
on inherent patterns.
• Dimensionality Reduction
Click to Edit
1. Clustering Algorithms
Click to Edit For e.g. shopping stores use algorithms based on this technique to find
out the relationship between the sale of one product w.r.t to another’s
sales based on customer behavior. Like if a customer buys milk,
then he may also buy bread, eggs, or butter. Once trained well,
such models can be used to increase their sales by planning different
offers.
3. Dimensionality Reduction
Click to Edit • Image and Text Clustering: Groups similar images or documents for tasks like organization,
classification, or content recommendation.
• Social Network Analysis: Detects communities or trends in user interactions on social media
platforms.
• Astronomy and Climate Science: Classifies galaxies or groups weather patterns to support
scientific research
Clustering in Machine Learning
Click to Edit
Types of Clustering Methods
The clustering methods are broadly divided into Hard clustering (datapoint
Click to Edit belongs to only one group) and Soft Clustering (data points can belong to another
group also). But there are also other various approaches of Clustering exist.
Below are the main clustering methods used in Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
1. Partitioning Clustering
It is a type of clustering that divides the data into non hierarchical groups. It is also
known as the centroid-based method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the
number of pre-defined groups. The cluster center is created in such a way that the
distance between the data points of one cluster is minimum as compared to another
cluster centroid.
Click to Edit
2. Density-Based Clustering
The density-based clustering method connects the highly-dense areas into
clusters, and the arbitrarily shaped distributions are formed as long as
the dense region can be connected. This algorithm does it by
identifying different clusters in the dataset and connects the areas of
high densities into clusters. The dense areas in data space are divided
from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the
dataset has varying densities and high dimensions.
Click to Edit
3. Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided
based on the probability of how a dataset belongs to a particular
distribution. The grouping is done by assuming some distributions
commonly Gaussian Distribution.
• The example of this type is the Expectation-Maximization
Clustering algorithm that uses Gaussian Mixture Models (GMM).
Click to Edit
4. Hierarchical Clustering
• Hierarchical clustering can be used as an alternative for the
partitioned clustering as there is no requirement of pre-specifying the
number of clusters to be created. In this technique, the dataset is
divided into clusters to create a tree-like structure, which is also called
a dendrogram. The observations or any number of clusters can be
selected by cutting the tree at the correct level. The most common
example of this method is the Agglomerative Hierarchical
algorithm.
Click to Edit
5. Fuzzy Clustering
• Fuzzy clustering is a type of soft method in which a data object
may belong to more than one group or cluster. Each dataset has a
set of membership coefficients, which depend on the degree of
membership to be in a cluster. Fuzzy C-means algorithm is the
example of this type of clustering; it is sometimes also known as
the Fuzzy k-means algorithm.
Click to Edit
K-Means Clustering Algorithm
• K-Means Clustering is an unsupervised learning algorithm that is
used to solve the clustering problems in machine learning or data
science.
What is K-Means Algorithm?
• K-Means Clustering is an Unsupervised Learning algorithm, which
groups the unlabeled dataset into different clusters. Here K defines
the number of pre-defined clusters that need to be created in the
Click to Edit
process, as if K=2, there will be two clusters, and for K=3, there will
be three clusters, and so on.
It is a centroid-based algorithm, where each cluster is associated with a
centroid. The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding clusters.
The k-means clustering algorithm mainly performs two tasks:
• Determines the best value for K center points or centroids by an
iterative process.
• Assigns each data point to its closest k-center. Those data points
which are near to the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is
away from other clusters.
• The below diagram explains the working of the K-means Clustering
Algorithm:
Click to Edit
K-means clustering is a popular unsupervised machine learning algorithm used to partition a
dataset into K distinct, non-overlapping subsets or clusters. Here's a concise explanation of
how K-means clustering works:
Assignment: Assign each data point to the nearest centroid based on the
Euclidean distance. This forms ( K ) clusters.
Update: Calculate the new centroids by taking the mean of all data points
Click to Edit assigned to each cluster.
Repeat: Repeat the assignment and update steps until the centroids no
longer change significantly or a maximum number of iterations is reached.
Output: The algorithm outputs the final cluster centroids and the
assignment of each data point to a cluster.
Applications of K Means Clustering
1. Customer Segmentation (Marketing)
• Group customers based on purchasing behavior, demographics, or browsing patterns.
• Example: A retail store can create different marketing strategies for different customer
clusters.
2. Image Compression
• Reduces the number of colors in an image by clustering similar colors.
• Each pixel's color is replaced by the centroid of its cluster, significantly reducing file size.
3. Document or Text Clustering
• Groups similar documents or articles (e.g., news categorization).
• Can be used in search engines or recommendation systems to suggest related content.
Click to Edit
4. Product Recommendation
• Clusters products based on user ratings or features.
• Users can be recommended items from the same cluster they prefer.
5. Anomaly Detection
• Clusters normal data, so outliers (data points far from any cluster) can be flagged as
anomalies.
• Useful in fraud detection or network security.
6. Geographical Data Analysis
• Used for grouping locations in geo-marketing or urban
planning.
• Example: Clustering delivery addresses to optimize routes.
7. Medical Imaging
• Clusters pixels in scans like MRI or CT images for tumor
detection or tissue segmentation.
8. Educational Data Mining
Click to Edit
• Groups students based on performance, learning styles, or
interaction patterns for personalized education strategies.
Click to Edit
Click to Edit
Click to Edit
Click to Edit
What is Image Segmentation?
Image segmentation is one of the key computer vision tasks, It
separates objects, boundaries, or structures within the image for
more meaningful analysis. Image segmentation plays an
important role in extracting meaningful information from
images, enabling computers to perceive and understand visual
data in a manner that humans understand, view, and perceive.
Image Segmentation Use cases
1. Medical Imaging
Click to Edit
Tumor detection: Identify and isolate tumors in MRI, CT, or
PET scans.
Organ segmentation: Separate organs (e.g., lungs, heart) for
diagnosis or surgery planning.
Retinal analysis: Detect abnormalities in eye scans for
conditions like diabetic retinopathy.
2. Autonomous Vehicles
Scene understanding: Segment roads, pedestrians, vehicles,
traffic signs, etc.
Obstacle detection: Help the car identify and avoid obstacles in
real-time.
3. Augmented Reality (AR)
Background removal: Segment people from the background
(e.g., in virtual backgrounds).
Click to Edit
Object tracking: Identify and follow real-world objects to
overlay virtual content accurately.
Image Segmentation using K Means Clustering
Image segmentation is a technique in computer vision that
divides an image into different segments. This can help identify
specific objects, boundaries or patterns in the image. Image is
basically a set of given pixels and in image segmentation pixels
with similar intensity are grouped together. Image segmentation
creates a pixel-wise mask for objects in an image which gives
us a better understanding of the object.
Click to Edit
Step 1: Install Required Libraries
• In the first step we load required libraries like Numpy ,
Matplotlib and OpenCV.
import numpy as np
import [Link] as plt
import cv2
image = [Link]('images/[Link]')
image = [Link](image, cv2.COLOR_BGR2RGB)
Click to Edit
[Link](image)
Output-
Step 2: Reshape the Image for K-Means Clustering
• K-Means works on 2D data but images are in 3D i.e height, width, color channels. So we need
to reshape the image into a 2D array.
pixel_vals = [Link]((-1,3))
pixel_vals = np.float32(pixel_vals)
First set the criteria for when the algorithm should stop.
Click to Edit We’ll use a maximum of 100 iterations or an accuracy threshold of 85%.
We will choose k = 3 which means the algorithm will identify 3 clusters in the image.
K-Means will group pixels with similar colors into the specified number of clusters.
Finally we reshape the segmented data to match the original dimensions of the image so it can be
visualized properly.
criteria = (cv2.TERM_CRITERIA_EPS +
cv2.TERM_CRITERIA_MAX_ITER, 100, 0.85)
k=3
retval, labels, centers = [Link](pixel_vals, k, None,
criteria, 10, cv2.KMEANS_RANDOM_CENTERS)
centers = np.uint8(centers)
segmented_data = centers[[Link]()]
Click to Edit
segmented_image = segmented_data.reshape(([Link]))
[Link](segmented_image)
Output-
Now if we change the value of k to 6 we get the below image
Click to Edit
As you can see with an increase in the value of k the image becomes
clearer and distinct because K-means algorithm can classify more classes
or cluster of colors. It can segment objects in images and give better
results in smaller dataset. But when it is applied on large datasets it
becomes time consuming.
Using Clustering for Preprocessing
Data preprocessing is a important step in the data science transforming
raw data into a clean structured format for analysis. It involves tasks
like handling missing values, normalizing data and encoding variables.
Mastering preprocessing in Python ensures reliable insights for accurate
predictions and effective decision-making. Pre-processing refers to
the transformations applied to data before feeding it to the algorithm.
Click to Edit
A Python code to detect Anomalies
import numpy as np # 5. Define anomaly threshold (e.g., top 10%
import pandas as pd furthest points)
import [Link] as plt threshold = [Link](distances, 90)
from [Link] import load_iris anomalies = X[distances > threshold]
from [Link] import KMeans normal = X[distances <= threshold]
from [Link] import StandardScaler
# 6. Show anomaly counts
# 1. Load the Iris dataset print(f"Total points: {len(X)}")
iris = load_iris() print(f"Detected anomalies:
X = [Link] # We will not use the labels here {len(anomalies)}")
Click to Edit
Process of using clustering in semi supervised
learning
Clustering plays a crucial role in semi-supervised learning by
leveraging the structure of unlabeled data to enhance learning from
a limited amount of labeled examples. Here’s the general process:
1. Preprocessing Data – Raw data is prepared by normalizing
features and handling missing values to improve clustering
effectiveness.
2. Applying Clustering Algorithm – Unlabeled data is grouped
Click to Edit into clusters using algorithms like K-Means, DBSCAN, or
Gaussian Mixture Models. The idea is to find naturally
occurring data patterns.
3. Assigning Pseudo-labels – Once clusters are formed, some
data points are assigned labels based on their proximity to
known labeled instances or based on assumptions about data
distribution.
4. Training Model – The labeled data (original and pseudo-
labeled) is used to train a supervised model, such as a
neural network or decision tree.
5. Refining and Iterating – The pseudo-labels are re-
evaluated, clustering is refined, and the training process is
Click to Edit repeated to improve accuracy.
DBSCAN
Click to Edit
Importance of DBSCAN
Click to Edit
3. Algorithm Steps:
Initialization: Start by randomly selecting a customer who has not
been visited as the initial point for forming a cluster.
Expand:
• For each core point or border point, expand the cluster by
adding neighboring customers recursively based on the epsilon
and min_samples criteria.
• Check if the neighboring customers meet the criteria to be core
points or border points and add them to the cluster.
Click to Edit
• Termination: The algorithm stops when all customers have been
visited and clustered.
4. Output:
Clusters: Customers grouped together based on their spending
score and income density
Noise: Outliers or customers who do not fit well into any cluster.
Python code to demonstrate DBSCAN
from [Link] import make_blobs
from [Link] import DBSCAN
import [Link] as plt