Machine Learning
Machine Learning
Agglomerative and
Divisive clustering
Agglomerative Clustering
Divisive Clustering
Divisive Clustering is the technique that starts with all data points in a single cluster and
recursively splits the clusters into smaller sub-clusters based on their dissimilarity. It is
also known as, “top-down” clustering. It starts with all data points in a single cluster, and
then recursively splits the clusters into smaller sub-clusters based on their dissimilarity.
Unlike agglomerative clustering, which starts with each data point as its own cluster and
iteratively merges the most similar pairs of clusters, divisive clustering is a “divide and
conquer” approach that breaks a large cluster into smaller sub-clusters
K-Means Algorithm
It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabelled dataset on its own without the need for any training.
The algorithm takes the unlabelled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.
Determines the best value for K center points or centroids by an iterative process.
Assigns each data point to its closest k-center. Those data points which are near to
the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Working of K-Means Algorithm
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables
is given below:
Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
We need to choose some random k points or centroid to form the cluster. These points can
be either the points from the dataset or any other point. So, here we are selecting the
below two points as k points, which are not the part of our dataset. Consider the below
image:
Now we will assign each data point of the scatter plot to its closest K-point or centroid. We
will compute it by applying some mathematics that we have studied to calculate the
distance between two points. So, we will draw a median between both the centroids.
Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color
them as blue and yellow for clear visualization.
As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two
blue points are right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.
We will repeat the process by finding the center of gravity of centroids, so the new centroids
will be as shown in the below image:
As we got the new centroids so again will draw the median line and reassign the data
points. So, the image will be:
It only identifies spherical-shaped clusters i.e it cannot identify, if the clusters are
non-spherical or of various sizes and densities.
It suffers from local minima and has a problem when the data contains outliers.
Bisecting K-Means
When K is big, bisecting k-means is more effective. Every data point in the data collection
and k centroids are used in the K-means method for computation. On the other hand,
only the data points from one cluster and two centroids are used in each Bisecting stage
of Bisecting k-means. As a result, computation time is shortened.
While k-means is known to yield clusters of varied sizes, bisecting k-means results in
clusters of comparable sizes.
Initialize the list of clusters to accommodate the cluster consisting of all points.
repeat
end for
Select the 2 clusters from the bisection with the least total SSE.
Firstly, let us assume the number of clusters required at the final stage, ‘K’ = 3 (Any value
can be assumed, if not mentioned).
Step 01:
Step 02:
Apply K-Means (K=3). The cluster ‘GFG’ is split into two clusters ‘GFG1’ and ‘GFG2’. The
required number of clusters aren’t obtained yet. Thus, ‘GFG1’ is further split into two
(since it has a higher SSE (formula to calculate SSE is explained below))
In the above diagram, as we split the cluster ‘GFG’ into ‘GFG1’ and ‘GFG2’, we calculate
the SSE of the two clusters separately using the above formula. The cluster, with the
higher SSE, will be split further. The cluster, with the lower SSE, contains lesser errors
comparatively, and hence won’t be split further.
Here, if we get the calculation that the cluster ‘GFG1’ is the one with higher SSE, we split
it into (GFG1)` and (GFG1)`. The number of clusters required at the final stage is
mentioned as ‘3’, and we obtained 3 clusters.
If the required number of clusters is not obtained, we should continue splitting until they
are produced.
K-Medoids clustering
K-Medoids and K-Means are two types of clustering mechanisms in Partition Clustering.
First, Clustering is the process of breaking down an abstract group of data points/ objects
into classes of similar objects such that all the objects in one cluster have similar traits. ,
a group of n objects is broken down into k number of clusters based on their similarities.
Two statisticians, Leonard Kaufman, and Peter J. Rousseeuw came up with this method.
K-Medoids:
Medoid: A Medoid is a point in the cluster from which the sum of distances to other data
points is minimal.
(or)
A Medoid is a point in the cluster from which dissimilarities with all the other points in
the clusters are minimal.
PAM is the most powerful algorithm of the three algorithms but has the disadvantage of
time complexity. The following K-Medoids are performed using PAM. In the further parts,
we'll see what CLARA and CLARANS are.
Algorithm:
CLARA:
It is an extension to PAM to support Medoid clustering for large data sets. This algorithm
selects data samples from the data set, applies Pam on each sample, and outputs the best
Clustering out of these samples. This is more effective than PAM. We should ensure that
the selected samples aren't biased as they affect the Clustering of the whole data.
CLARANS:
This algorithm selects a sample of neighbors to examine instead of selecting samples from
the data set. In every step, it examines the neighbors of every node. The time complexity
of this algorithm is O(n2), and this is the best and most efficient Medoids algorithm of all.
Advantages of using K-Medoids:
Disadvantages:
Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can be
more profitable. It tries to find some interesting relations or associations among the
variables of dataset. It is based on different rules to discover the interesting relations
between variables in the database.
The association rule learning is one of the very important concepts of machine learning,
and it is employed in Market Basket analysis, Web usage mining, continuous production,
etc. Here market basket analysis is a technique used by the various big retailer to discover
the associations between items. We can understand it by taking an example of a
supermarket, as in a supermarket, all products that are purchased together are put
together.
For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk,
so these products are stored within a shelf or mostly nearby. Consider the below diagram:
Working of Association Rule Learning
Association rule learning works on the concept of If and Else Statement, such as if A then
B.
Here the If element is called antecedent, and then statement is called as Consequent.
These types of relationships where we can find out some association or relation between
two items is known as single cardinality. It is all about creating rules, and if the number
of items increases, then cardinality also increases accordingly. So, to measure the
associations between thousands of data items, there are several metrics. These metrics
are given below:
Support
Confidence
Lift
Support
Support is the frequency of A or how frequently an item appears in the dataset. It is defined
as the fraction of the transaction T that contains the itemset X. If there are X datasets,
then for transactions T, it can be written as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the items
X and Y occur together in the dataset when the occurrence of X is already given. It is the
ratio of the transaction that contains X and Y to the number of records that contain X.
Lift
Lift>1: It determines the degree to which the two itemsets are dependent to each other.
Lift<1: It tells us that one item is a substitute for other items, which means one item has
a negative effect on another.
The Apriori algorithm uses frequent itemsets to generate association rules, and it is
designed to work on the databases that contain transactions. With the help of these
association rule, it determines how strongly or how weakly two objects are connected. This
algorithm uses a breadth-first search and Hash Tree to calculate the itemset associations
efficiently. It is the iterative process for finding the frequent itemsets from the large
dataset.
This algorithm was given by the R. Agrawal and Srikant in the year 1994. It is mainly used
for market basket analysis and helps to find those products that can be bought together.
It can also be used in the healthcare field to find drug reactions for patients.
Frequent Itemset
Frequent itemsets are those items whose support is greater than the threshold value or
user-specified minimum support. It means if A & B are the frequent itemsets together,
then individually A and B should also be the frequent itemset.8:10llscreen
Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7}, in these two
transactions, 2 and 3 are the frequent itemsets.
Note: To better understand the apriori algorithm, and related term such as support and
confidence, it is recommended to understand the association rule learning.
Steps for Apriori Algorithm
Step-1: Determine the support of itemsets in the transactional database, and select the
minimum support and confidence.
Step-2: Take all supports in the transaction with higher support value than the minimum
or selected support value.
Step-3: Find all the rules of these subsets that have higher confidence value than the
threshold or minimum confidence.
We will understand the apriori algorithm using an example and mathematical calculation:
Example: Suppose we have the following dataset that has various transactions, and from
this dataset, we need to find the frequent itemsets and generate the association rules using
the Apriori algorithm:
Solution:
In the first step, we will create a table that contains support count (The frequency of each
itemset individually in the dataset) of each itemset in the given dataset. This table is called
the Candidate set or C1.
Now, we will take out all the itemsets that have the greater support count that the
Minimum Support (2). It will give us the table for the frequent itemset L1.
Since all the itemsets have greater or equal support count than the minimum support,
except the E, so E itemset will be removed.
Step-2: Candidate Generation C2, and L2:
In this step, we will generate C2 with the help of L1. In C2, we will create the pair of the
itemsets of L1 in the form of subsets.
After creating the subsets, we will again find the support count from the main transaction
table of datasets, i.e., how many times these pairs have occurred together in the given
dataset. So, we will get the below table for C2:
Again, we need to compare the C2 Support count with the minimum support count, and
after comparing, the itemset with less support count will be eliminated from the table C2.
It will give us the below table for L2
For C3, we will repeat the same two processes, but now we will form the C3 table with
subsets of three itemsets together, and will calculate the support count from the dataset.
It will give the below table:
Now we will create the L3 table. As we can see from the above C3 table, there is only one
combination of itemset that has support count equal to the minimum support count. So,
the L3 will have only one combination, i.e., {A, B, C}.
To generate the association rules, first, we will create a new table with the possible rules
from the occurred combination {A, B.C}. For all the rules, we will calculate the Confidence
using formula sup( A ^B)/A. After calculating the confidence value for all rules, we will
exclude the rules that have less confidence than the minimum threshold(50%).
As the given threshold or minimum confidence is 50%, so the first three rules A ^B → C,
B^C → A, and A^C → B can be considered as the strong association rules for the given
problem.
FP -Growth
The FP-Growth Algorithm proposed by Han in. This is an efficient and scalable method for
mining the complete set of frequent patterns by pattern fragment growth, using an
extended prefix-tree structure for storing compressed and crucial information about
frequent patterns named frequent-pattern tree (FP-tree). In his study, Han proved that his
method outperforms other popular methods for mining frequent patterns, e.g. the Apriori
Algorithm and the Tree Projection. In some later works, it was proved that FP-Growth
performs better than other methods, including Eclat and Relim. The popularity and
efficiency of the FP-Growth Algorithm contribute to many studies that propose variations
to improve its performance.
FP Growth Algorithm
The FP-Growth Algorithm is an alternative way to find frequent item sets without using
candidate generations, thus improving performance. For so much, it uses a divide-and-
conquer strategy. The core of this method is the usage of a special data structure named
frequent-pattern tree (FP-tree), which retains the item set association information.
FP-Tree
The frequent-pattern tree (FP-tree) is a compact data structure that stores quantitative
information about frequent patterns in a database. Each transaction is read and then
mapped onto a path in the FP-tree. This is done until all transactions have been read.
Different transactions with common subsets allow the tree to remain compact because
their paths overlap.
A frequent Pattern Tree is made with the initial item sets of the database. The purpose of
the FP tree is to mine the most frequent pattern. Each node of the FP tree represents an
item of the item set.
The root node represents null, while the lower nodes represent the item sets. The
associations of the nodes with the lower nodes, that is, the item sets with the other item
sets, are maintained while forming the tree.
One root is labelled as "null" with a set of item-prefix subtrees as children and a frequent-
item-header table.
The number of input features, variables, or columns present in a given dataset is known
as dimensionality, and the process to reduce these features is called dimensionality
reduction.
A dataset contains a huge number of input features in various cases, which makes the
predictive modelling task more complicated. Because it is very difficult to visualize or make
predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to use.
Dimensionality reduction technique can be defined as, "It is a way of converting the higher
dimensions dataset into lesser dimensions dataset ensuring that it provides similar
information." These techniques are widely used in machine learning for obtaining a better
fit predictive model while solving the classification and regression problems.
It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.
PCA works by considering the variance of each attribute because the high attribute shows
the good split between the classes, and hence it reduces the dimensionality. Some real-
world applications of PCA are image processing, movie recommendation system,
optimizing the power allocation in various communication channels.
SVD, or Singular Value Decomposition, is one of several techniques that can be used to
reduce the dimensionality, i.e., the number of columns, of a data set. Why would we want
to reduce the number of dimensions? In predictive analytics, more columns normally
means more time required to build models and score data. If some columns have no
predictive value, this means wasted time, or worse, those columns contribute noise to the
model and reduce model quality or predictive accuracy.
Dimensionality reduction can be achieved by simply dropping columns, for example, those
that may show up as collinear with others or identified as not being particularly predictive
of the target as determined by an attribute importance ranking technique. But it can also
be achieved by deriving new columns based on linear combinations of the original
columns. In both cases, the resulting transformed data set can be provided to machine
learning algorithms to yield faster model build times, faster scoring times, and more
accurate models.
While SVD can be used for dimensionality reduction, it is often used in digital signal
processing for noise reduction, image compression, and other areas.
SVD is an algorithm that factors an m x n matrix, M, of real or complex values into three
component matrices, where the factorization has the form USV*. U is an m x p matrix. S is
a p x p diagonal matrix. V is an n x p matrix, with V* being the transpose of V, a p x
n matrix, or the conjugate transpose if M contains complex values. The value p is called
the rank. The diagonal entries of S are referred to as the singular values of M. The columns
of U are typically called the left-singular vectors of M, and the columns of V are called the
right-singular vectors of M.
One of the features of SVD is that given the decomposition of M into U, S, and V, one can
reconstruct the original matrix M, or an approximation of it. The singular values in the
diagonal matrix S can be used to understand the amount of variance explained by each of
the singular vectors.