IAI&ML UNIT-4
IAI&ML UNIT-4
UNIT-IV
UNIT– IV: Basic Methods in Supervised Learning: Distance-based methods, Nearest-Neighbors,
Decision Trees, Support Vector Machines, Nonlinearity and Kernel Methods. Unsupervised Learning:
Clustering, K-means, Dimensionality Reduction, PCA and kernel.
Most clustering approaches use distance measures to assess the similarities or differences between a pair
of objects, the most popular distance measures used are:
1. Euclidean Distance
2. Manhattan Distance
3. Jaccard Index
4. Minkowski distance
5. Cosine Index
1. Euclidean Distance:
Euclidean distance is considered the traditional metric for problems with geometry. It can be simply
explained as the ordinary distance between two points. It is one of the most used algorithms in the
cluster analysis. One of the algorithms that use this formula would be K-mean. Mathematically it
computes the root of squared differences between the coordinates between two objects.
1
IAI&ML MECH (R20)
Figure – Euclidean Distance
2. Manhattan Distance:
This determines the absolute difference among the pair of the coordinates.
Suppose we have two points P and Q to determine the distance between these points we simply have to
calculate the perpendicular distance of the points from X-Axis and Y-Axis.In a plane with P at coordinate
(x1, y1) and Q at (x2, y2).
Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|
3. Jaccard Index:
The Jaccard distance measures the similarity of the two data set items as the intersection of those items
divided by the union of the data items.
2
IAI&ML MECH (R20)
Example: Note that we need to transform the data into binary form before applying Jaccard Index. Let’s
consider Store 1 and Store 2 sell below items and each item is considered as an element.
Then, we can observe that bread, jam, coke and cake are sold by both stores. Hence, 1 is assigned for both
stores.
Jaccard Index value ranges from 0 to 1. Higher the similarity when Jaccard index is high.
4. Minkowski distance:
It is the generalized form of the Euclidean and Manhattan Distance Measure. In an N-dimensional space,
a point is represented as (x1, x2, ..., xN)
Consider two points P1 and P2:
P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)
3
IAI&ML MECH (R20)
5. Cosine Index:
Cosine distance measure for clustering determines the cosine of the angle between two vectors given by
the following formula.
Here (theta) gives the angle between two vectors and A, B are n-dimensional vectors.
Cosine Similarity values range between -1 and 1. Lower the cosine similarity, low is the similarity b/w two
observations.
Customer 1
Customer 2
4
IAI&ML MECH (R20)
Decision Tree Classification Algorithm
o Decision Tree is a Supervised learning technique that can be used for both classification and Regression
problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier,
where internal nodes represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes are
used to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions
and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision based on given
conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands on further
branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and Regression Tree
algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into
subtrees.
o Below diagram explains the general structure of a decision tree:
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
5
IAI&ML MECH (R20)
There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset and
problem is the main point to remember while creating a machine learning model. Below are the two reasons
for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like structure.
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a
leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to the
given conditions.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the
child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the
tree. This algorithm compares the values of root attribute with the record (real dataset) attribute and, based
on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and move
further. It continues the process until it reaches the leaf node of the tree. The complete process can be better
understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3. Continue
this process until a stage is reached where you cannot further classify the nodes and called the final node as
a leaf node.
6
IAI&ML MECH (R20)
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should accept
the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary attribute by
ASM). The root node splits further into the next decision node (distance from the office) and one leaf node
based on the corresponding labels. The next decision node further gets split into one decision node (Cab
facility) and one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers and
Declined offer). Consider the below diagram:
While implementing a Decision tree, the main issue arises that how to select the best attribute for the root
node and for sub-nodes. So, to solve such problems there is a technique which is called as Attribute
selection measure or ASM. By this measurement, we can easily select the best attribute for the nodes of the
tree. There are two popular techniques for ASM, which are:
Information Gain
Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset based on an
attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute
having the highest information gain is split first. It can be calculated using the below formula:
7
IAI&ML MECH (R20)
1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data.
Entropy can be calculated as:
Where,
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the CART(Classification
and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
o Gini index can be calculated using the below formula:
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important features
of the dataset. Therefore, a technique that decreases the size of the learning tree without reducing accuracy
is known as Pruning. There are mainly two types of tree pruning technology used:
8
IAI&ML MECH (R20)
o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below
diagram in which there are two different categories that are classified using a decision boundary or
hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose we
see a strange cat that also has some features of dogs, so if we want a model that can accurately identify
whether it is a cat or dog, so such a model can be created by using the SVM algorithm. We will first train
our model with lots of images of cats and dogs so that it can learn about different features of cats and dogs,
and then we test it with this strange creature. So as support vector creates a decision boundary between these
two data (cat and dog) and choose extreme cases (support vectors), it will see the extreme case of cat and
dog. On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:
9
IAI&ML MECH (R20)
Video
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified
into two classes by using a single straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot
be classified by using a straight line, then such data is termed as non-linear data and classifier used is called
as Non-linear SVM classifier.
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional
space, but we need to find out the best decision boundary that helps to classify the data points. This best
boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2
features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance between
the data points.
10
IAI&ML MECH (R20)
Support Vectors:The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence
called a Support vector.
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset that
has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that can
classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there can
be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is
called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These
points are called support vectors. The distance between the vectors and the hyperplane is called as margin.
11
IAI&ML MECH (R20)
And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is called
the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have used two
dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
12
IAI&ML MECH (R20)
So now, SVM will divide the datasets into classes in the following way. Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space
with z=1, then it will become as:
13
IAI&ML MECH (R20)
Here we can see a hyperplane which is separating green dots from the blue ones. A hyperplane is one
dimension less than the ambient plane. E.g. in the above figure, we have 2 dimension which represents the
ambient space but the lone which divides or classifies the space is one dimension less than the ambient
space and is called hyperplane.
14
IAI&ML MECH (R20)
It is very difficult to solve this classification using a linear classifier as there is no good linear line that
should be able to classify the red and the green dots as the points are randomly distributed. Here comes the
use of kernel function which takes the points to higher dimensions, solves the problem over there and
returns the output. Think of this in this way, we can see that the green dots are enclosed in some perimeter
area while the red one lies outside it, likewise, there could be other scenarios where green dots might be
distributed in a trapezoid-shaped area.
So what we do is to convert the two-dimensional plane which was first classified by one-dimensional
hyperplane (“or a straight line”) to the three-dimensional area and here our classifier i.e. hyperplane will not
be a straight line but a two-dimensional plane which will cut the area.
In order to get a mathematical understanding of kernel, let us understand the Lili Jiang’s equation of kernel
which is:
Example:
Let us say that we have two points, x= (2, 3, 4) and y= (3, 4, 5)
15
IAI&ML MECH (R20)
so,
f(2, 3, 4)=(4, 6, 8, 6, 9, 12, 8, 12, 16)and
f(3 ,4, 5)=(9, 12, 15, 12, 16, 20, 15, 20, 25)
so the dot product,
f (x). f (y) = f(2,3,4) . f(3,4,5)=
(36 + 72 + 120 + 72 +144 + 240 + 120 + 240 + 400)=
1444
And,
K(x, y) = (2*3 + 3*4 + 4*5) ^2=(6 + 12 + 20)^2=38*38=1444.
This as we find out, f(x).f(y) and K(x, y) give us the same result, but the former method required a lot of
calculations(because of projecting 3 dimensions into 9 dimensions) while using the kernel, it was much
easier.
1. Liner Kernel
Let us say that we have two vectors with name x1 and Y1, then the linear kernel is defined by the dot
product of these two vectors:
K(x1, x2) = x1 . x2
2. Polynomial Kernel
A polynomial kernel is defined by the following equation:
Where,
3. Gaussian Kernel
This kernel is an example of a radial basis function kernel. Below is the equation for this:
The given sigma plays a very important role in the performance of the Gaussian kernel and should neither
be overestimated and nor be underestimated, it should be carefully tuned according to the problem.
4. Exponential Kernel
This is in close relation with the previous kernel i.e. the Gaussian kernel with the only difference is – the
square of the norm is removed.
16
IAI&ML MECH (R20)
5. Laplacian Kernel
This type of kernel is less prone for changes and is totally equal to previously discussed exponential
function kernel, the equation of Laplacian kernel is given as:
This kernel is very much used and popular among support vector machines.
There are a lot more types of Kernel Method and we have discussed the mostly used kernels. It purely
depends on the type of problem which will decide the kernel function to be used.
Unsupervised Learning:
Clustering
Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. It can be
defined as "A way of grouping the data points into different clusters, consisting of similar data
17
IAI&ML MECH (R20)
points. The objects with the possible similarities remain in a group that has less or no similarities with
another group."
It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color, behavior, etc.,
and divides them as per the presence and absence of those similar patterns.
It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it deals with
the unlabeled dataset.
After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML system can
use this id to simplify the processing of large and complex datasets.
Note: Clustering is somewhere similar to the classification algorithm, but the difference is the type of
dataset that we are using. In classification, we work with the labeled data set, whereas in clustering,
we work with the unlabelled dataset.
Example: Let's understand the clustering technique with the real-world example of Mall: When we visit any
shopping mall, we can observe that the things with similar usage are grouped together. Such as the t-shirts
are grouped in one section, and trousers are at other sections, similarly, at vegetable sections, apples,
bananas, Mangoes, etc., are grouped in separate sections, so that we can easily find out the things. The
clustering technique also works in the same way. Other examples of clustering are grouping documents
according to the topic.
The clustering technique can be widely used in various tasks. Some most common uses of this technique
are:
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation system to provide the
recommendations as per the past search of products. Netflix also uses this technique to recommend the
movies and web-series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the different fruits are
divided into several groups with similar properties.
18
IAI&ML MECH (R20)
The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one group)
and Soft Clustering (data points can belong to another group also). But there are also other various
approaches of Clustering exist. Below are the main clustering methods used in Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
1. Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as the centroid-
based method. The most common example of partitioning clustering is the K-Means Clustering
algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-defined
groups. The cluster center is created in such a way that the distance between the data points of one cluster is
minimum as compared to another cluster centroid.
19
IAI&ML MECH (R20)
2. Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the arbitrarily shaped
distributions are formed as long as the dense region can be connected. This algorithm does it by identifying
different clusters in the dataset and connects the areas of high densities into clusters. The dense areas in data
space are divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has varying densities and high
dimensions.
The example of this type is the Expectation-Maximization Clustering algorithm that uses Gaussian
Mixture Models (GMM).
4. Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no requirement
of pre-specifying the number of clusters to be created. In this technique, the dataset is divided into clusters
20
IAI&ML MECH (R20)
to create a tree-like structure, which is also called a dendrogram. The observations or any number of
clusters can be selected by cutting the tree at the correct level. The most common example of this method is
the Agglomerative Hierarchical algorithm.
5. Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one group or
cluster. Each dataset has a set of membership coefficients, which depend on the degree of membership to be
in a cluster. Fuzzy C-means algorithm is the example of this type of clustering; it is sometimes also known
as the Fuzzy k-means algorithm.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained above. There are
different types of clustering algorithms published, but only a few are commonly used. The clustering
algorithm is based on the kind of data that we are using. Such as, some algorithms need to guess the number
of clusters in the given dataset, whereas some are required to find the minimum distance between the
observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used in machine learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms. It classifies the dataset
by dividing the samples into different clusters of equal variances. The number of clusters must be specified in this
algorithm. It is fast with fewer computations required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth density of data points. It is an
example of a centroid-based model, that works on updating the candidates for centroid to be the center of the points
within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with Noise. It is an example
of a density-based model similar to the mean-shift, but with some remarkable advantages. In this algorithm, the areas
of high density are separated by the areas of low density. Because of this, the clusters can be found in any arbitrary
shape.
21
IAI&ML MECH (R20)
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as an alternative for the k-means
algorithm or for those cases where K-means can be failed. In GMM, it is assumed that the data points are Gaussian
distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm performs the bottom-up
hierarchical clustering. In this, each data point is treated as a single cluster at the outset and then successively merged.
The cluster hierarchy can be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not require to specify the number of
clusters. In this, each data point sends a message between the pair of data points until convergence. It has O(N 2T) time
complexity, which is the main drawback of this algorithm.
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning:
o In Identification of Cancer Cells: The clustering algorithms are widely used for the identification of cancerous cells.
It divides the cancerous and non-cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering technique. The search result appears based on the
closest object to the search query. It does it by grouping similar data objects in one group that is far from the other
dissimilar objects. The accurate result of a query depends on the quality of the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the customers based on their choice and
preferences.
o In Biology: It is used in the biology stream to classify different species of plants and animals using the image
recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands use in the GIS database. This
can be very useful to find that for what purpose the particular land should be used, that means for which purpose it is
more suitable.
K-means Algorithm
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into
different clusters. Here K defines the number of pre-defined clusters that need to be created in the process,
as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each
dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the categories of
groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.
22
IAI&ML MECH (R20)
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats
the process until it does not find the best clusters. The value of k should be predetermined in this algorithm.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a
cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.
23
IAI&ML MECH (R20)
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given below:
Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different clusters. It
means here we will try to group these datasets into two different clusters.
We need to choose some random k points or centroid to form the cluster. These points can be either the
points from the dataset or any other point. So, here we are selecting the below two points as k points, which
are not the part of our dataset. Consider the below image:
Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will compute
it by applying some mathematics that we have studied to calculate the distance between two points. So, we
will draw a median between both the centroids. Consider the below image:
24
IAI&ML MECH (R20)
From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points
to the right of the line are close to the yellow centroid. Let's color them as blue and yellow for clear
visualization.
As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To choose the new
centroids, we will compute the center of gravity of these centroids, and will find new centroids as below:
25
IAI&ML MECH (R20)
Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of
finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two blue points are
right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-
points.We will repeat the process by finding the center of gravity of centroids, so the new centroids will be
as shown in the below image:
26
IAI&ML MECH (R20)
o As we got the new centroids so again will draw the median line and reassign the data points. So, the image will be:
o
o We can see in the above image; there are no dissimilar data points on either side of the line, which means our model is
formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as
shown in the below image:
27
IAI&ML MECH (R20)
How to choose the value of "K number of clusters" in K-means Clustering?
The performance of the K-means clustering algorithm depends upon highly efficient clusters that it forms.
But choosing the optimal number of clusters is a big task. There are some different ways to find the optimal
number of clusters, but here we are discussing the most appropriate method to find the number of clusters or
value of K. The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters. This method uses
the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total
variations within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given below:
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and its
centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as Euclidean
distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
o It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is considered as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow method.
The graph for the elbow method looks like the below image:
28
IAI&ML MECH (R20)
Dimensionality Reduction
The number of input features, variables, or columns present in a given dataset is known as dimensionality,
and the process to reduce these features is called dimensionality reduction.
A dataset contains a huge number of input features in various cases, which makes the predictive modeling
task more complicated. Because it is very difficult to visualize or make predictions for the training dataset
with a high number of features, for such cases, dimensionality reduction techniques are required to use.
Dimensionality reduction technique can be defined as, "It is a way of converting the higher dimensions
dataset into lesser dimensions dataset ensuring that it provides similar information." These techniques
are widely used in machine learning for obtaining a better fit predictive model while solving the
classification and regression problems.
It is commonly used in the fields that deal with high-dimensional data, such as speech recognition, signal
processing, bioinformatics, etc. It can also be used for data visualization, noise reduction, cluster
analysis, etc.
29
IAI&ML MECH (R20)
Benefits of applying Dimensionality Reduction
Some benefits of applying dimensionality reduction technique to the given dataset are given below:
o By reducing the dimensions of the features, the space required to store the dataset also gets reduced.
o Less Computation training time is required for reduced dimensions of features.
o Reduced dimensions of features of the dataset help in visualizing the data quickly.
o It removes the redundant features (if present) by taking care of multicollinearity.
There are also some disadvantages of applying the dimensionality reduction, which are given below:
Feature Selection
Feature selection is the process of selecting the subset of the relevant features and leaving out the
irrelevant features present in a dataset to build a model of high accuracy. In other words, it is a way of
selecting the optimal features from the input dataset.
1. Filters Methods
In this method, the dataset is filtered, and a subset that contains only the relevant features is taken. Some
common techniques of filters method are:
o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a machine learning model for its
evaluation. In this method, some features are fed to the ML model, and evaluate the performance. The
performance decides whether to add those features or remove to increase the accuracy of the model. This
30
IAI&ML MECH (R20)
method is more accurate than the filtering method but complex to work. Some common techniques of
wrapper methods are:
o Forward Selection
o Backward Selection
o Bi-directional Elimination
3. Embedded Methods: Embedded methods check the different training iterations of the machine
learning model and evaluate the importance of each feature. Some common techniques of Embedded
methods are:
o LASSO
o Elastic Net
o Ridge Regression, etc.
Feature Extraction:
Feature extraction is the process of transforming the space containing many dimensions into space with
fewer dimensions. This approach is useful when we want to keep the whole information but use fewer
resources while processing the information.
31
IAI&ML MECH (R20)
PCA and kernel
Principal Component Analysis is an unsupervised learning algorithm that is used for the dimensionality
reduction in machine learning. It is a statistical process that converts the observations of correlated
features into a set of linearly uncorrelated features with the help of orthogonal transformation. These
new transformed features are called the Principal Components. It is one of the popular tools that is
used for exploratory data analysis and predictive modeling. It is a technique to draw strong patterns from
the given dataset by reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.
PCA works by considering the variance of each attribute because the high attribute shows the good split
between the classes, and hence it reduces the dimensionality. Some real-world applications of PCA
are image processing, movie recommendation system, optimizing the power allocation in various
communication channels. It is a feature extraction technique, so it contains the important variables and
drops the least important variable.
o Dimensionality: It is the number of features or variables present in the given dataset. More easily, it is the
number of columns present in the dataset.
o Correlation: It signifies that how strongly two variables are related to each other. Such as if one changes, the
other variable also gets changed. The correlation value ranges from -1 to +1. Here, -1 occurs if variables are
inversely proportional to each other, and +1 indicates that variables are directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other, and hence the correlation between the pair
of variables is zero.
o Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v will be eigenvector if Av is
the scalar multiple of v.
o Covariance Matrix: A matrix containing the covariance between the pair of variables is called the Covariance
Matrix.
o The principal component must be the linear combination of the original features.
o These components are orthogonal, i.e., the correlation between a pair of variables is zero.
32
IAI&ML MECH (R20)
o The importance of each component decreases when going to 1 to n, it means the 1 PC has the most importance,
and n PC will have the least importance.
Firstly, we need to take the input dataset and divide it into two subparts X and Y, where X is the training set, and
Y is the validation set.
Now we will represent our dataset into a structure. Such as we will represent the two-dimensional matrix of
independent variable X. Here each row corresponds to the data items, and the column corresponds to the Features.
The number of columns is the dimensions of the dataset.
In this step, we will standardize our dataset. Such as in a particular column, the features with high variance are
more important compared to the features with lower variance. If the importance of features is independent of the
variance of the feature, then we will divide each data item in a column with the standard deviation of the column.
Here we will name the matrix as Z.
To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After transpose, we will
multiply it by Z. The output matrix will be the Covariance matrix of Z.
Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance matrix Z. Eigenvectors or
the covariance matrix are the directions of the axes with high information. And the coefficients of these
eigenvectors are defined as the eigenvalues.
In this step, we will take all the eigenvalues and will sort them in decreasing order, which means from largest to
smallest. And simultaneously sort the eigenvectors accordingly in matrix P of eigenvalues. The resultant matrix
will be named as P*.
33
IAI&ML MECH (R20)
Here we will calculate the new features. To do this, we will multiply the P* matrix to the Z. In the resultant matrix
Z*, each observation is the linear combination of original features. Each column of the Z* matrix is independent
of each other.
The new feature set has occurred, so we will decide here what to keep and what to remove. It means, we will only
keep the relevant or important features in the new dataset, and unimportant features will be removed out.
34