Unit 3 unsupervised learning algorith
Unit 3 unsupervised learning algorith
Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.
Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to
the machine learning model in order to train it. Firstly, it will interpret the raw data
to find the hidden patterns from the data and then will apply suitable algorithms such
as k-means clustering, Decision tree, etc.
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
What is Clustering ?
The task of grouping data points based on their similarity with each other is called
Clustering or Cluster Analysis. This method is defined under the branch of
Unsupervised Learning, which aims at gaining insights from unlabelled data points, that
is, unlike supervised learning we don’t have a target variable.
Clustering aims at forming groups of homogeneous data points from a heterogeneous
dataset. It evaluates the similarity based on a metric like Euclidean distance, Cosine
similarity, Manhattan distance, etc. and then group the points with highest similarity
score together.
For Example, In the graph given below, we can clearly see that there are 3 circular
clusters forming on the basis of distance.
Now it is not necessary that the clusters formed must be circular in shape. The shape
of clusters can be arbitrary. There are many algortihms that work well with detecting
arbitrary shaped clusters.
For example, In the below given graph we can see that the clusters formed are not
circular in shape.
With clustering, data points are put into groups — known as clusters —
based on similarities like color, shape or other features. In hierarchical
clustering, each cluster is placed within a nested tree-like hierarchy,
where clusters are grouped and break down further into smaller
clusters depending on similarities. Here, the closer clusters are
together in the hierarchy, the more similar they are to each other.
Hierarchical clustering is based on the core idea that similar objects lie
nearby to each other in a data space while others lie far away. It uses
distance functions to find nearby data points and group the data points
together as clusters.
1. Agglomerative Clustering
Agglomerative clustering is a bottom-up approach. It starts clustering
by treating the individual data points as a single cluster, then it is
merged continuously based on similarity until it forms one big cluster
containing all objects. It is good at identifying small clusters.
o Step-2: Take two closest data points or clusters and merge them to form one
cluster. So, there will now be N-1 clusters.
o Step-3: Again, take the two closest clusters and merge them together to form
one cluster. There will be N-2 clusters.
o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following
clusters. Consider the below images:
o Step-5: Once all the clusters are combined into one big cluster, develop the
dendrogram to divide the clusters as per the problem.
Divisive Clustering
Divisive clustering works just the opposite of agglomerative clustering.
It starts by considering all the data points into a big single cluster and
later on splitting them into smaller heterogeneous clusters
continuously until all data points are in their own cluster. Thus, they
are good at identifying large clusters. It follows a top-down approach
and is more efficient than agglomerative clustering. But, due to its
complexity in implementation, it doesn’t have any predefined
implementation in any of the major machine learning frameworks.
Advantages of Hierarchical clustering
o It is simple to implement and gives the best output in some cases.
o It is easy and results in a hierarchy, a structure that contains more information.
o It does not need us to pre-specify the number of clusters.
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also
known as the centroid-based method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define
the number of pre-defined groups. The cluster center is created in such a way that
the distance between the data points of one cluster is minimum as compared to
another cluster centroid.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters,
and the arbitrarily shaped distributions are formed as long as the dense region can
be connected. This algorithm does it by identifying different clusters in the dataset
and connects the areas of high densities into clusters. The dense areas in data space
are divided from each other by sparser (refers to areas in data where data points
are not required to be assigned to a cluster but rather identified as noise.) areas.
These algorithms can face difficulty in clustering the data points if the dataset has
varying densities and high dimensions.
Introduction
DBSCAN algorithm can cluster densely grouped points efficiently into one cluster. It
can identify local density in the data points among large datasets. DBSCAN can very
effectively handle outliers. An advantage of DBSACN over the K-means algorithm is
that the number of centroids need not be known beforehand in the case of DBSCAN.
eps: It defines the neighborhood around a data point i.e. if the distance
between two points is lower or equal to ‘eps’ then they are considered
neighbors. If the eps value is chosen too small then a large part of the data
will be considered as an outlier. If it is chosen very large then the clusters will
merge and the majority of the data points will be in the same clusters. One
way to find the eps value is based on the k-distance graph.
MinPts: Minimum number of neighbors (data points) within eps radius. The
larger the dataset, the larger value of MinPts must be chosen. As a general
rule, the minimum MinPts can be derived from the number of dimensions D in
the dataset as, MinPts >= D+1. The minimum value of MinPts must be chosen
at least 3.
Core Point: A point is a core point if it has more than MinPts points within eps.
Border Point: A point which has fewer than MinPts within eps but it is in the
neighborhood of a core point.
Noise or outlier: A point which is not a core point or border point.
Applications of DBSCAN
It is used in satellite imagery.
Used in XRay crystallography
Anamoly detection in temperature.
The association rule learning is one of the very important concepts of machine
learning, and it is employed in Market Basket analysis, Web usage mining,
continuous production, etc. Here market basket analysis is a technique used by
the various big retailer to discover the associations between items. We can
understand it by taking an example of a supermarket, as in a supermarket, all
products that are purchased together are put together.
For example, if a customer buys bread, he most likely can also buy butter, eggs, or
milk, so these products are stored within a shelf or mostly nearby. Consider the below
diagram:
Association rule learning can be divided into three types of algorithms:
1. Apriori
2. Eclat
3. F-P Growth Algorithm
o Support
o Confidence
o Lift
Let's understand each of them:
Support
Support is the frequency of A or how frequently an item appears in the dataset. It is
defined as the fraction of the transaction T that contains the itemset X. If there are
X datasets, then for transactions T, it can be written as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the
items X and Y occur together in the dataset when the occurrence of X is already
given. It is the ratio of the transaction that contains X and Y to the number of records
that contain X.
Lift
It is the strength of any rule, which can be defined as below formula:
It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:
Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is designed to
work on the databases that contain transactions. This algorithm uses a breadth-first
search and Hash Tree to calculate the itemset efficiently.
It is mainly used for market basket analysis and helps to understand the products
that can be bought together. It can also be used in the healthcare field to find drug
reactions for patients.
Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses
a depth-first search technique to find frequent itemsets in a transaction database. It
performs faster execution than Apriori Algorithm.