Unsupervised Learning
Unsupervised Learning
Prepared By
Archana
AP/PE
IIT(ISM), Dhanbad
Clustering in Machine Learning
• In real world, not every data we work upon has a target variable. This
kind of data cannot be analyzed using supervised learning algorithms.
• We need the help of unsupervised algorithms. One of the most
popular type of analysis under unsupervised learning is Cluster
analysis.
• When the goal is to group similar data points in a dataset, then we
use cluster analysis.
What is Clustering ?
• The task of grouping data points based on their similarity with each
other is called Clustering or Cluster Analysis. This method is defined
under the branch of Unsupervised Learning, which aims at gaining
insights from unlabelled data points, that is, unlike supervised
learning we don’t have a target variable.
• Clustering aims at forming groups of homogeneous data points from a
heterogeneous dataset. It evaluates the similarity based on a metric
like Euclidean distance, Cosine similarity, Manhattan distance, etc.
and then group the points with highest similarity score together.
• For Example, In the graph given below, we can clearly see that there
are 3 circular clusters forming on the basis of distance.
• Now it is not necessary that the clusters formed must be circular in
shape. The shape of clusters can be arbitrary. There are many
algortihms that work well with detecting arbitrary shaped clusters.
• For example, In the below given graph we can see that the clusters
formed are not circular in shape.
Types of Clustering
• Broadly speaking, there are 2 types of clustering that can be
performed to group similar data points:
• Hard Clustering: In this type of clustering, each data point belongs to
a cluster completely or not. For example, Let’s say there are 4 data
point and we have to cluster them into 2 clusters. So each data point
will either belong to cluster 1 or cluster 2.
• Soft Clustering: In this type of clustering, instead of assigning each
data point into a separate cluster, a probability or likelihood of that
point being that cluster is evaluated.
• For example, Let’s say there are 4 data point and we have to cluster
them into 2 clusters. So we will be evaluating a probability of a data
point belonging to both clusters. This probability is calculated for all
data points.
Types of Clustering Algorithms
• At the surface level, clustering helps in the analysis of unstructured data.
Graphing, the shortest distance, and the density of the data points are a
few of the elements that influence cluster formation.
• Clustering is the process of determining how related the objects are based
on a metric called the similarity measure.
• Similarity metrics are easier to locate in smaller sets of features. It gets
harder to create similarity measures as the number of features increases.
• Depending on the type of clustering algorithm being utilized in data
mining, several techniques are employed to group the data from the
datasets.
Various types of clustering algorithms are:
• Centroid-based Clustering (Partitioning methods)
• Density-based Clustering (Model-based methods)
• Connectivity-based Clustering (Hierarchical clustering)
• Distribution-based Clustering
1. Centroid-based Clustering (Partitioning methods)
• Partitioning methods are the most easiest clustering algorithms. They
group data points on the basis of their closeness.
• Generally, the similarity measure chosen for these algorithms are Euclidian
distance, Manhattan Distance or Minkowski Distance.
• Euclidean distance, for example, is a simple straight-line measurement
between points and is commonly used in many applications.
• Manhattan distance, however, follows a grid-like path, much like how you'd
navigate city streets.
• Squared Euclidean distance makes calculations easier by squaring the
values, while cosine distance is handy when working with text data
because it measures the angle between data vectors.
• Picking the right distance measure really depends on what kind of problem
you’re solving and the nature of your data.
• The datasets are separated into a predetermined number of clusters,
and each cluster is referenced by a vector of values.
• When compared to the vector value, the input data variable shows no
difference and joins the cluster.
• The primary drawback for these algorithms is the requirement that
we establish the number of clusters, “k,” either intuitively or
scientifically (using the Elbow Method) before any clustering machine
learning system starts allocating the data points.
• Despite this, it is still the most popular type of clustering. K-
means and K-medoids clustering are some examples of this type
clustering.
2. Density-based Clustering (Model-based methods)
• Density-based clustering, a model-based method, finds groups based on the density of
data points.
• Contrary to centroid-based clustering, which requires that the number of clusters be
predefined and is sensitive to initialization, density-based clustering determines the
number of clusters automatically and is less susceptible to beginning positions.
• They are great at handling clusters of different sizes and forms, making them ideally
suited for datasets with irregularly shaped or overlapping clusters.
• These methods manage both dense and sparse data regions by focusing on local density
and can distinguish clusters with a variety of morphologies.
• In contrast, centroid-based grouping, like k-means, has trouble finding arbitrary shaped
clusters.
• Due to its preset number of cluster requirements and extreme sensitivity to the initial
positioning of centroids, the outcomes can vary.
• Furthermore, the tendency of centroid-based approaches to produce spherical or convex
clusters restricts their capacity to handle complicated or irregularly shaped clusters.
• In conclusion, density-based clustering overcomes the drawbacks of centroid-based
techniques by autonomously choosing cluster sizes, being resilient to initialization, and
successfully capturing clusters of various sizes and forms. The most popular density-
based clustering algorithm is DBSCAN
3. Connectivity-based Clustering (Hierarchical clustering)
• A method for assembling related data points into hierarchical clusters is called
hierarchical clustering.
• Each data point is initially taken into account as a separate cluster, which is subsequently
combined with the clusters that are the most similar to form one large cluster that
contains all of the data points.
• Think about how you may arrange a collection of items based on how similar they are.
• Each object begins as its own cluster at the base of the tree when using hierarchical
clustering, which creates a dendrogram, a tree-like structure.
• The closest pairings of clusters are then combined into larger clusters after the algorithm
examines how similar the objects are to one another.
• When every object is in one cluster at the top of the tree, the merging process has
finished. Exploring various granularity levels is one of the fun things about hierarchical
clustering.
• To obtain a given number of clusters, you can select to cut the dendrogram at a particular
height. The more similar two objects are within a cluster, the closer they are. It’s
comparable to classifying items according to their family trees, where the nearest
relatives are clustered together and the wider branches signify more general
connections.
• There are 2 approaches for Hierarchical clustering:
• Divisive Clustering: It follows a top-down approach, here we consider
all data points to be part one big cluster and then this cluster is divide
into smaller groups.
• Agglomerative Clustering: It follows a bottom-up approach, here we
consider all data points to be part of individual clusters and then
these clusters are clubbed together to make one big cluster with all
data points.
4. Distribution-based Clustering
• Using distribution-based clustering, data points are generated and organized according to
their propensity to fall into the same probability distribution (such as a Gaussian,
binomial, or other) within the data.
• The data elements are grouped using a probability-based distribution that is based on
statistical distributions. Included are data objects that have a higher likelihood of being in
the cluster.
• A data point is less likely to be included in a cluster the further it is from the cluster’s
central point, which exists in every cluster.
• A notable drawback of density and boundary-based approaches is the need to specify
the clusters a priori for some algorithms, and primarily the definition of the cluster form
for the bulk of algorithms.
• There must be at least one tuning or hyper-parameter selected, and while doing so
should be simple, getting it wrong could have unanticipated repercussions. Distribution-
based clustering has a definite advantage over proximity and centroid-based clustering
approaches in terms of flexibility, accuracy, and cluster structure.
• The key issue is that, in order to avoid overfitting, many clustering methods only work
with simulated or manufactured data, or when the bulk of the data points certainly
belong to a preset distribution. The most popular distribution-based clustering algorithm
is Gaussian Mixture Model.
Advantages of K-means
1.Simple and easy to implement: The k-means algorithm is easy to
understand and implement, making it a popular choice for clustering
tasks.
2.Fast and efficient: K-means is computationally efficient and can
handle large datasets with high dimensionality.
3.Scalability: K-means can handle large datasets with many data points
and can be easily scaled to handle even larger datasets.
4.Flexibility: K-means can be easily adapted to different applications
and can be used with varying metrics of distance and initialization
methods.
Disadvantages of K-Means
1.Sensitivity to initial centroids: K-means is sensitive to the initial
selection of centroids and can converge to a suboptimal solution.
2.Requires specifying the number of clusters: The number of clusters k
needs to be specified before running the algorithm, which can be
challenging in some applications.
3.Sensitive to outliers: K-means is sensitive to outliers, which can have a
significant impact on the resulting clusters.
Different Evaluation Metrics for Clustering
• When it comes to evaluating how well your clustering algorithm is
working, there are a few key metrics that can help you get a clearer
picture of your results. Here’s a rundown of the most useful ones:
Silhouette Analysis
• Silhouette analysis is like a report card for your clusters. It measures
how well each data point fits into its own cluster compared to other
clusters.
• A high silhouette score means that your points are snugly fitting into
their clusters and are quite distinct from points in other clusters.
• Imagine a score close to 1 as a sign that your clusters are well-defined
and separated.
• Conversely, a score close to 0 indicates some overlap, and a negative
score suggests that the clustering might need some work.
Inertia
• Inertia is a bit like a gauge of how tightly packed your data points are within each
cluster.
• It calculates the sum of squared distances from each point to the cluster's center
(or centroid).
• Think of it as measuring how snugly the points are huddled together. Lower inertia
means that points are closer to the centroid and to each other, which generally
indicates that your clusters are well-formed.
• For most numeric data, you'll use Euclidean distance, but if your data includes
categorical features, Manhattan distance might be better.
Dunn Index
• The Dunn Index takes a broader view by considering both the distance within and
between clusters. It’s calculated as the ratio of the smallest distance between any
two clusters (inter-cluster distance) to the largest distance within a cluster (intra-
cluster distance).
• A higher Dunn Index means that clusters are not only tight and cohesive internally
but also well-separated from each other.
• In other words, you want your clusters to be as far apart as possible while being as
compact as possible.
How Does K-Means Clustering Work?
• The flowchart below shows how k-means clustering works:
• The goal of the K-Means algorithm is to find clusters in the given input
data. There are a couple of ways to accomplish this.
• We can use the trial and error method by specifying the value of K
(e.g., 3,4, 5). As we progress, we keep changing the value until we get
the best clusters.
• Another method is to use the Elbow technique to determine the value
of K.
• Once we get the K's value, the system will assign that many centroids
randomly and measure the distance of each of the data points from
these centroids.
• Accordingly, it assigns those points to the corresponding centroid from
which the distance is minimum.
• So each data point will be assigned to the centroid, which is closest to
it. Thereby we have a K number of initial clusters.
• It calculates the new centroid position for the newly formed clusters.
The centroid's position moves compared to the randomly allocated
one.
• Once again, the distance of each point is measured from this new
centroid point. If required, the data points are relocated to the new
centroids, and the mean position or the new centroid is calculated
once again.
• If the centroid moves, the iteration continues indicating no
convergence. But once the centroid stops moving (which means that
the clustering process has converged), it will reflect the result.
Visualization example to understand this better:
• We have a data set for a grocery shop, and we want to find out how
many clusters this has to be spread across. To find the optimum
number of clusters, we break it down into the following steps:
• Step 1:
• The Elbow method is the best way to find the number of clusters. The
elbow method constitutes running K-Means clustering on the dataset.
• Next, we use within-sum-of-squares as a measure to find the optimum
number of clusters that can be formed for a given data set. Within the
sum of squares (WSS) is defined as the sum of the squared distance
between each member of the cluster and its centroid.
• The WSS is measured for each value of K. The value of K, which has the
least amount of WSS, is taken as the optimum value.
• Now, we draw a curve between WSS and the number of clusters.
• Next, let’s convert "df_scaled" from an array to a data frame and add
the labeled clusters per well to that data frame as follows:
• Next, let’s return the data to its original (unstandardized form) by
multiplying each variable by the standard deviation of that variable
and adding the mean of that variable as illustrated below.
• Note that "scaler.inverse_transform()" in scikit-learn could have also
been used to transform the data back to its original form.
• Please ensure the codes listed below are continuous when you
replicate them in Jupyter Notebook.
• For example, "df_scaled['Water Saturation, fraction']" is split into two
lines in the code shown below due to space limitation. Therefore,
ensure to have continuous code lines to avoid getting an error.
• As illustrated in Fig. 4.12, each cluster centroid represents the average
of each feature’s average.
• For example, cluster 3 (since indexing starts with 0) has an average GR
of 154.422 API, a bulk density of 2.238 g/cc, a resistivity of 15.845 U-
m, a water saturation of 18.3627%, a Phi*H of 20.907 ft, and a TVD of
9672.233 ft.
• The next step is to understand the number of counts per each cluster.
Let’s use the following lines of code to obtain it
• The last step in type curve clustering is to plot these wells based on
their latitude and longitude on a map to evaluate the clustering
outcome.
• In addition, the domain expertise plays a key role in determining the
optimum number of clusters to successfully define the type curve
regions/boundaries.
• For example, if there are currently 10 type curve regions within your
company’s acreage position, 10 clusters can be used as a starting
point to evaluate kmeans clustering’s outcome.
• For this synthetic data set, the last step of plotting and evaluating the
clustering’s outcome is ignored. However, please make sure to always
visualize the clustering outcome and adjust the selected number of
clusters accordingly.