ML Unit 3
ML Unit 3
Unsupervised Learning
3.1 Clustering
Clustering is the task of dividing the unlabeled data or data points into
different clusters such that similar data points fall in the same cluster than
those which differ from the others. In simple words, the aim of the clustering
process is to segregate groups with similar traits and assign them into clusters.
Let’s understand this with an example. Suppose you are the head of a
rental store and wish to understand the preferences of your customers to scale
up your business. Is it possible for you to look at the details of each customer
and devise a unique business strategy for each one of them? Definitely not.
But, what you can do is cluster all of your customers into, say 10 groups
based on their purchasing habits and use a separate strategy for customers in
each of these 10 groups. And this is what we call clustering. Now that we
understand what clustering is. Let’s take a look at its different types.
32
records imputation, managing missing statistics, and figuring out and
getting rid of outliers.
8. Data Quality Assessment: EDA permits for assessing the nice and
reliability of the information. It involves checking for records integrity,
consistency, and accuracy to make certain the information is suitable
for analysis.
33
out relationships, and gain insights. There are various sorts of EDA strategies
that can be hired relying on the nature of the records and the desires of the
evaluation. Here are some not unusual kinds of EDA:
34
impact at the analysis. Techniques along with box plots, scatter plots,
z-rankings, and clustering algorithms are used for outlier evaluation.
These are just a few examples of the types of EDA techniques that can be
employed at some stage in information evaluation. The choice of strategies
relies upon on the information traits, research questions, and the insights
sought from the analysis.
• Hard Clustering: Each input data point either fully belongs to a cluster
or not. For instance, in the example above, every customer is assigned
to one group out of the ten.
35
are classified as a single cluster and then partitioned as the distance
increases. Also, the choice of distance function is subjective. These
models are very easy to interpret but lack scalability for handling big
datasets. Examples of these models are the hierarchical clustering
algorithms and their variants.
• Centroid Models: These clustering algorithms iterate, deriving similarity
from the proximity of a data point to the centroid or cluster center.
The k-Means clustering algorithm, a popular example, falls into this
category. These models necessitate specifying the number of clusters
beforehand, requiring prior knowledge of the dataset. They iteratively
run to discover local optima.
• Distribution Models: These clustering models are based on the notion
of how probable it is that all data points in the cluster belong to the
same distribution (For example: Normal, Gaussian). These models
often suffer from overfitting. A popular example of these models is the
Expectation-maximization algorithm which uses multivariate normal
distributions.
• Density Models: These models search the data space for areas of the
varied density of data points in the data space. They isolate different
dense regions and assign the data points within these regions to the
same cluster. Popular examples of density models are DBSCAN and
OPTICS. These models are particularly useful for identifying clusters of
arbitrary shape and detecting outliers, as they can detect and separate
points that are located in sparse regions of the data space, as well as
points that belong to dense regions.
36
Figure 3.1: Difference between types of Machine Learning
37
• Agglomerative: Initially, each object is considered to be its own cluster.
According to a particular procedure, the clusters are then merged step
by step until a single cluster remains. At the end of the cluster merging
process, a cluster containing all the elements will be formed.
Agglomerative Approach
This Algorithm is also referred as Bottom-up approach. This approach treats
each and every data point as a single cluster and then merges each cluster by
considering the similarity (distance) in each individual cluster until a single
large cluster is obtained or when some condition is satisfied.
Algorithm
1. Initialize all n data points into N individual clusters.
2. Find the cluster pairs with the least distance (closest distance) and
combine them as one single cluster.
4. Repeat steps 2 and 3 until all data samples are merged into a single
large cluster of size N
Advantages
• Easy to identify nested clusters.
38
• Reduces the computing time and space complexity.
Disadvantages
Divisive Approach
This approach is also referred as the top-down approach. In this, we consider
the entire data sample set as one cluster and continuously splitting the cluster
into smaller clusters iteratively. It is done until each object in one cluster
or the termination condition holds. This method is rigid, because once a
merging or splitting is done, it can never be undone.
Algorithm
1. Initially, initiate the process with one cluster containing all the samples.
2. Select a largest cluster from the cluster that contains widest diameter.
3. Detect the data point in the cluster found in step 2 with the minimum
average similarity to the other elements in that cluster.
4. The first element to be added to the fragment group is the data samples
found in step 3.
5. Detect the element in the original group which has the highest average
similarity with the fragment group.
7. Repeat the step 2 to 6 until each data point is separated into individual
clusters
Advantage
39
• It produces more accurate hierarchies than bottom-up algorithm in
some circumstances.
Disadvantages
Divisive is the opposite of Agglomerative, it starts off with all the points into
one cluster and divides them to create more clusters. These algorithms create
a distance matrix of all the existing clusters and perform the linkage between
the clusters depending on the criteria of the linkage. The clustering of the
data points is represented by using a dendrogram. There are different types
of linkages: –
• Single Linkage: – In single linkage the distance between the two clusters
is the shortest distance between points in those two clusters.
3. Merge the clusters based on a metric for the similarity between clusters
40
3.6 Centroid-based clustering algorithms /
Partitioning clustering algorithms
In centroid/partitioning clustering, clusters are represented by a central vector,
which may not necessarily be a member of the dataset. Even in this particular
clustering type, the value of K needs to be chosen. This is an optimization
problem: finding the number of centroids or the value of K and assigning the
objects to nearby cluster centers. These steps need to be performed in such a
way that the squared distance from clusters is maximized.
In centroid/partitioning clustering, clusters are represented by a central
vector, which may not necessarily be a member of the dataset. Even in this
particular clustering type, the value of K needs to be chosen. This is an
optimization problem: finding the number of centroids or the value of K
and assigning the objects to nearby cluster centers. These steps need to be
performed in such a way that the squared distance from clusters is maximized.
3.6.1 K-Means
One of the most widely used centroid-based clustering algorithms is K-Means,
and one of its drawbacks is that you need to choose a K value in advance.
K-Means clustering algorithm The K-Means algorithm splits the given
dataset into a predefined(K) number of clusters using a particular distance
metric. The center of each cluster/group is called the centroid.
1. Choosing the number of clusters: The first step is to define the K
number of clusters in which we will group the data. Let’s select K=3.
2. Initializing centroids: Centroid is the center of a cluster but initially, the
exact center of data points will be unknown so, we select random data
points and define them as centroids for each cluster. We will initialize 3
centroids in the dataset.
41
3. Assign data points to the nearest cluster: Now that centroids are
initialized, the next step is to assign data points Xn to their closest
cluster centroid Ck
In this step, we will first calculate the distance between data point X
and centroid C using Euclidean Distance metric. And then choose the
cluster for data points where the distance between the data point and
the centroid is minimum.
42
3.6.2 Advantages and Disadvantages
Advantages The following are some advantages of K-Means clustering algo-
rithms
43
3.6.3 Applications of K-Means Clustering Algorithm
The main goals of cluster analysis are
• To get a meaningful intuition from the data we are working with.
2. Document Clustering
3. Image segmentation
4. Image compression
5. Customer segmentation
3.7 DBSCAN
DBSCAN is the abbreviation for Density-Based Spatial Clustering of Ap-
plications with Noise. It is an unsupervised clustering algorithm.DBSCAN
clustering can work with clusters of any size from huge amounts of data
and can work with datasets containing a significant amount of noise. It is
basically based on the criteria of a minimum number of points within a region.
DBSCAN algorithm can cluster densely grouped points efficiently into one
cluster. It can identify local density in the data points among large datasets.
DBSCAN can very effectively handle outliers. An advantage of DBSACN over
the K-means algorithm is that the number of centroids need not be known
beforehand in the case of DBSCAN.
DBSCAN algorithm depends upon two parameters epsilon and minPoints.
• Epsilon is defined as the radius of each data point around which the
density is considered.
44
3.7.1 DBSCAN Algorithm
In the DBSCAN algorithm, a circle with a radius epsilon is drawn around
each data point and the data point is classified into Core Point, Border Point,
or Noise Point. The data point is classified as a core point if it has minPoints
number of data points with epsilon radius. If it has points less than minPoints
it is known as Border Point and if there are no points inside epsilon radius it
is considered a Noise Point.
Let us understand working through an example.
In the above figure, we can see that point A has no points inside epsilon(e)
radius. Hence it is a Noise Point. Point B has minPoints(=4) number of
points with epsilon e radius , thus it is a Core Point. While the point has only
1 ( less than minPoints) point, hence it is a Border Point. The above figure
45
All the data points with at least 3 points in the circle including itself are
considered as Core points represented by red color. All the data points with
less than 3 but greater than 1 point in the circle including itself are considered
as Border points. They are represented by yellow color. Finally, data points
with no point other than itself present inside the circle are considered as Noise
represented by the purple color. For locating data points in space, DBSCAN
uses Euclidean distance, although other methods can also be used (like great
circle distance for geographical data). It also needs to scan through the entire
dataset once, whereas in other algorithms we have to do it multiple times.
Steps Involved in DBSCAN Algorithm
1. First, all the points within epsilon radius are found and the core points
are identified with number of points greater than or equal to minPoints.
2. Next, for each core point, if not assigned to a particular cluster, a new
cluster is created for it.
3. All the densely connected points related to the core point are found and
assigned to the same cluster. Two points are called densely connected
points if they have a neighbor point that has both the points within
epsilon distance.
4. Then all the points in the data are iterated, and the points that do not
belong to any cluster are marked as noise.
46
3.7.4 Applications
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a
popular clustering algorithm in data mining and machine learning, particularly
useful for tasks involving spatial data analysis. Here are some applications
where DBSCAN is commonly used:
47
of transactions that deviate significantly from normal behavior, indicat-
ing potential fraudulent activities.
5. What are the steps involved in EDA for clustering explain with an
example.
48