CSE3008 Module4
CSE3008 Module4
Module4
Dr.Mohana S D,
Assistant Professor,
(Course In-charge - CSE3008)
School of Computer Science and Engineering & Information Science,
Presidency University Bengaluru.
Presidency University Bengaluru IC: Dr.Mohana S D
Unsupervised Learning
Unsupervised Learning
Definition
Unsupervised learning is a type of machine learning where the model
is trained on a dataset without any labeled output.
The goal of unsupervised learning is to find patterns, structures, or
relationships in the data that can be used to gain insights, make
predictions, or classify new data.
Unsupervised Learning
Dimensionality Reduction:
Dimensionality Reduction: Dimensionality reduction is a technique used to reduce
the number of features or variables in a dataset while preserving the most
important information. Principal Component Analysis (PCA) and t-distributed
stochastic neighbor embedding (t-SNE) are common dimensionality reduction
techniques.
Unsupervised Learning
Anomaly Detection:
Anomaly Detection: Anomaly detection is a technique used to identify data
points that are significantly different from the majority of the data. Isolation
Forest and Local Outlier Factor (LOF) are common anomaly detection algorithms.
Generative Models:
Generative Models: Generative models are algorithms that attempt to model the
underlying probability distribution of the data. Gaussian Mixture Models (GMM)
and Variational Autoencoders (VAE) are common generative models.
Unsupervised learning
Unsupervised learning has a wide range of applications, including customer
segmentation, image and speech recognition, anomaly detection, and
natural language processing. By finding patterns and relationships in large
and complex datasets, unsupervised learning can provide valuable insights
and help solve real-world problems.
Unsupervised learning
Unsupervised learning
K-Means
In simple K-Means, the algorithm starts by randomly selecting K data
points as the initial centroids. Then, it iteratively assigns each data point
to its nearest centroid and updates the centroids based on the mean of the
assigned data points. The process continues until the centroids no longer
move significantly or a maximum number of iterations is reached.
K-Means
Mini-batch K-Means is a variant of K-Means that is designed to handle
large datasets more efficiently. Instead of using all the data points in each
iteration, mini-batch K-Means randomly selects a small subset (batch) of
the data points to update the centroids. This reduces the computational
complexity of the algorithm and allows it to handle large datasets in a
reasonable amount of time.
Unsupervised learning
K-Means
To update the centroids incrementally, we can use the following steps:
Initialize the centroids: Randomly select K data points as the initial
centroids.
Assign data points to centroids: For each data point, compute its distance
to each centroid and assign it to the nearest centroid.
Update the centroids: For each centroid, compute the mean of the assigned
data points and use it as the new centroid.
Incremental update: When a new data point is added to the dataset, we can
update the centroids incrementally by only computing the mean of the new
data point and the current centroid. This avoids the need to recompute the
mean of all the data points assigned to the centroid.
Repeat steps 2-4 until the centroids no longer move significantly or a
maximum number of iterations is reached.
Unsupervised learning
K-Means
Overall, K-Means clustering is a powerful technique for partitioning data
into clusters based on their similarity. The algorithm can be further
improved by using mini-batch K-Means to handle large datasets and
updating the centroids incrementally to handle streaming data.
K-Means
Finding the optimal number of clusters in K-Means clustering is an
important task, as it can greatly impact the accuracy of the clustering.
Two commonly used methods for determining the optimal number of
clusters are the Elbow method and the Silhouette coefficient.
Unsupervised learning
Elbow method:
The Elbow method is a heuristic technique that involves plotting the
Within-Cluster Sum of Squares (WCSS) against the number of clusters, K. WCSS
is the sum of the squared distances between each data point and its assigned
centroid. The idea is to choose the number of clusters, K, at the ”elbow” of the
curve, which is the point where the reduction in WCSS starts to diminish
significantly.
Silhouette coefficient:
The Silhouette coefficient is a metric that measures the quality of a clustering
solution. It takes into account both the cohesion (how close the data points are
to their assigned centroid) and the separation (how far the data points are from
other centroids). The Silhouette coefficient ranges from -1 to 1, with higher
values indicating better clustering solutions. To find the optimal number of
clusters using the Silhouette coefficient, we can calculate the coefficient for each
value of K and choose the one with the highest average coefficient.
Unsupervised learning
Drawbacks of K-Means:
Although K-Means is a popular and effective clustering algorithm, it has
several limitations, including:
Sensitivity to the initial centroids: The quality of the clustering
solution can be highly dependent on the initial randomly selected
centroids.
Inability to handle non-linearly separable data: K-Means assumes that
the clusters are convex and separable, which may not be true for all
datasets.
Difficulty in determining the optimal number of clusters: Choosing
the number of clusters can be a challenging task, as it requires
manual intervention or the use of heuristic techniques.
Unsupervised learning
K-Means++ :
K-Means++ is an improvement over the original K-Means algorithm
that addresses the sensitivity to the initial centroids problem.
Instead of randomly selecting the initial centroids, K-Means++ uses a
smarter initialization strategy that selects the initial centroids with a
higher probability of being far from each other.
This helps to improve the quality of the clustering solution and reduce
the number of iterations needed to converge.
K-Means++
Divisive hierarchical clustering is a type of hierarchical clustering algorithm
that works by recursively dividing a dataset into smaller and smaller
subsets until each subset contains only one data point. Two commonly
used methods for performing divisive hierarchical clustering are bisecting
K-Means and clustering using Minimum Spanning Tree (MST).
CSE3008 Machine Learning Techniques 15 / 29
Unsupervised Learning
Unsupervised learning
Bisecting K-Means:
Bisecting K-Means is a type of divisive hierarchical clustering that
works by recursively dividing the dataset into two subsets using
K-Means clustering.
The algorithm starts by treating the entire dataset as one cluster and
applies K-Means clustering to it.
The resulting clusters are then bisected into two subsets by running
K-Means clustering again on each of the clusters.
The process continues until the desired number of clusters is reached.
Unsupervised learning
Competitive Learning:
Competitive learning is a type of unsupervised learning that involves
training a neural network to learn a set of features or patterns from the
input data.
One popular competitive learning algorithm is Kohonen’s Self-Organizing
Maps (SOM), which is a type of neural network that can be used for
clustering.
Isolation Forest:
Isolation Forest is an outlier detection algorithm that works by randomly
partitioning the data points into subsets until each subset contains only
one data point. The algorithm then computes the path length for each
data point in the tree and uses it to determine the anomaly score.
Isolation Forest:
Local Outlier Factor (LOF) is an outlier detection algorithm that works by
measuring the local density of a data point relative to its neighbors. The
algorithm computes a score for each data point based on its distance to its
k-nearest neighbors and their average distance to each other. Points with
a high LOF score are considered outliers.
CSE3008 Machine Learning Techniques 28 / 29
Unsupervised Learning