0% found this document useful (0 votes)
25 views

Unit-4 (2)

Uploaded by

Krishna Rao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Unit-4 (2)

Uploaded by

Krishna Rao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

4.

UNSUPERVISED LEARNING TECHNIQUES

CLUSTERING
Clustering (or) cluster analysis is a machine learning technique, which groups the
unlabelled dataset. It can be defined as "A way of grouping the data points into different
clusters, consisting of similar data points. The objects with the possible similarities remain in
a group that has less or no similarities with another group."
It is an unsupervised learning method, hence no supervision is provided to the
algorithm, and it deals with the unlabeled dataset. The clustering technique can be widely
used in various tasks. Some most common uses of this technique are:
 Market Segmentation
 Statistical data analysis
 Social network analysis
 Image segmentation
 Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation
system to provide the recommendations as per the past search of products. Netflix also uses
this technique to recommend the movies and web-series to its users as per the watch history.

Types of Clustering Methods:


The clustering methods are broadly divided into Hard clustering (datapoint belongs to
only one group) and Soft Clustering (data points can belong to another group also). Below are
the main clustering methods used in Machine learning:
1. Partitioning Clustering: It is a type of clustering that divides the data into non-
hierarchical groups. It is also known as the centroid-based method. The most common
examples of partitioning clustering are the K-Means Clustering algorithm, CLARANS
(Clustering Large Applications based upon Randomized Search), etc.
2. Density-Based Clustering: These methods consider the clusters as the dense region
having some similarities and differences from the lower dense region of the space.
These methods have good accuracy and the ability to merge two clusters. Examples
are DBSCAN (Density-Based Spatial Clustering of Applications with Noise),
OPTICS (Ordering Points to Identify Clustering Structure), etc.

1
3. Distribution Model-Based Clustering: In this, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done
by assuming some distributions commonly Gaussian Distribution. The example of
this type is the Expectation-Maximization Clustering algorithm that uses Gaussian
Mixture Models (GMM).
4. Hierarchical Clustering: Hierarchical clustering can be used as an alternative for the
partitioned clustering as there is no requirement of pre-specifying the number of
clusters to be created. In this technique, the dataset is divided into clusters to create a
tree-like structure, which is also called a dendrogram. Examples are Agglomerative
(bottom-up approach) & Divisive (top-down approach).
5. Fuzzy Clustering: Fuzzy clustering is a type of soft method in which a data object
may belong to more than one group or cluster. Each dataset has a set of membership
coefficients, which depend on the degree of membership to be in a cluster. Fuzzy C-
means algorithm is the example of this type of clustering; it is sometimes also known
as the Fuzzy k-means algorithm

Applications of Clustering:
 Identification of Cancer Cells: The clustering algorithms are widely used for the
identification of cancerous cells. It divides the cancerous and non-cancerous data sets
into different groups.
 Search Engines: Search engines also work on the clustering technique. The search
result appears based on the closest object to the search query. It does it by grouping
similar data objects in one group that is far from the other dissimilar objects. The
accurate result of a query depends on the quality of the clustering algorithm used.
 Customer Segmentation: It is used in market research to segment the customers based
on their choice and preferences.
 Biology: It is used in the biology stream to classify different species of plants and
animals using the image recognition technique.
 Land Use: The clustering technique is used in identifying the area of similar lands use
in the GIS database. This can be very useful to find that for what purpose the
particular land should be used, that means for which purpose it is more suitable.

2
K-MEANS CLUSTERING ALGORITHM
K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters “It is an iterative algorithm that divides the unlabeled
dataset into k different clusters in such a way that each dataset belongs only one group that
has similar properties.”
It is a centroid-based algorithm, where each cluster is associated with a centroid. The
main aim of this algorithm is to minimize the sum of distances between the data point and
their corresponding clusters. The algorithm takes the unlabeled dataset as input, divides the
dataset into k-number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm. The k-means clustering
algorithm mainly performs two tasks:
1. Determines the best value for K center points or centroids by an iterative process.
2. Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.

Working of K-Means Algorithm:


1. Select the number K to decide the number of clusters.
2. Select random K points or centroids. (It can be other from the input dataset).
3. Assign each data point to their closest centroid, which will form the predefined K
clusters.
4. Calculate the variance and place a new centroid of each cluster.
5. Repeat the third step, which means reassign each datapoint to the new closest centroid
of each cluster.
6. If any reassignment occurs, then go to step-4 else go to FINISH.
7. The model is ready.

3
Example of K-Means Algorithm:

Data Points X Y
A1 2 10
A2 2 5
A3 8 4
B1 5 8
B2 7 5
B3 6 4
C1 1 2
C2 4 9

In the given example, there are 8 data points. Consider the initial centroids as A1, B1
& C1. So, the value of K=3. If initially any centroids are given we need to consider them if
not we can selects any of the data points as centroid. The formula for calculating distance is

Initial centroids are A1: (2, 10) , B1: (5, 8) , C1: (1, 2). Initially we need to calculate
the distance by using above formula. Once the distance has been calculated, we need to
assign the data points to one of the cluster. So, we need to consider the distances and the one
which is having the smallest value, to that we can assign the cluster.

4
After the 1st iteration the data points A1 is assigned to 1st Cluster, A3, B1, B2, B3, C2
are assigned to the 2nd cluster and A2, C1 are assigned to the 3rd cluster. Now we need to
calculate the new centroids. The new centroids are A1: (2, 10), B1: (6, 6) & C1: (1.5, 3.5).
After considering the new centroids:

If we look into the above table C2 was assigned to 2 nd Cluster, but now it is assigned
to 1st cluster, which means the data point has been moved from one cluster to other. So, we
need to calculate the new centroids again. The new centroids are A1: (3, 9.5), B1: (6.5, 5.25)
& C1: (1.5, 3.5). Here, the new cluster will become the current cluster. After considering the
new centroids:

If we look into the above table B1 was assigned to 2nd Cluster, but now it is assigned
to 1st cluster, which means the data point has been moved from one cluster to other. So, we
need to calculate the new centroids again. The new centroids are A1: (3.67, 9), B1: (7, 4.33)

5
& C1: (1.5, 3.5). Here, the new cluster will become the current cluster. After considering the
new centroids:

Finally, cluster & new cluster became exactly same. A1, B1, C2 belongs to 1 st cluster.
A3, B2, B3 belongs to 2nd cluster & A2, C1 belongs to 3rd cluster.

LIMITATIONS OF K-MEANS
 Choosing the right number of clusters: K-means requires the user to specify the
number of clusters to be generated, which can be difficult to determine. A poor choice
of K can lead to suboptimal clustering results.
 Outliers can skew results: K-means assumes that all data points are equally important
in the clustering process, which can lead to outliers skewing the results. Outliers may
be assigned to a cluster, leading to inaccurate cluster assignments for the other data
points.
 Sensitive to outliers: K-means is sensitive to outliers or noise data, which can distort
the resulting clusters.
 Assumes spherical clusters: K-means assumes that the clusters are spherical and have
equal variances, which is not always true in real-world scenarios.
 Limited applicability to non-numerical data: K-means is designed for numerical data
and does not handle categorical or textual data well.
 Lack of robustness: K-means can be sensitive to the distribution of the data, and the
resulting clusters may not be robust to small changes in the input data.

6
 Difficulty in handling high-dimensional data: K-means is less effective in high-
dimensional data, where the "curse of dimensionality" makes it harder to identify
meaningful clusters.

USING CLUSTERING FOR IMAGE SEGMENTATION


It is a method to perform Image Segmentation of pixel-wise segmentation. In this type
of segmentation, we try to cluster the pixels that are together. There are two approaches for
performing the Segmentation by clustering
 Clustering by Merging
 Clustering by Divisive

Clustering by merging or Agglomerative Clustering:


In this approach, we follow the bottom-up approach, which means we assign the pixel
closest to the cluster. The algorithm for performing the agglomerative clustering as follows:
 Take each point as a separate cluster.
 For a given number of epochs or until clustering is satisfactory.
 Merge two clusters with the smallest inter-cluster distance.
 Repeat the above step

The agglomerative clustering is represented by Dendrogram. It can be performed in 3


methods: by selecting the closest pair for merging, by selecting the farthest pair for merging,
or by selecting the pair which is at an average distance (neither closest nor farthest).

Clustering by division or Divisive splitting:


In this approach, we follow the top-down approach, which means we assign the pixel
closest to the cluster. The algorithm for performing the agglomerative clustering as follows:
 Construct a single cluster containing all points.
 For a given number of epochs or until clustering is satisfactory.
 Split the cluster into two clusters with the largest inter-cluster distance.
 Repeat the above steps.

7
K-Means Clustering:
K-means clustering is a very popular clustering algorithm which applied when we
have a dataset with labels unknown. The goal is to find certain groups based on some kind of
similarity in the data with the number of groups represented by K. This algorithm is generally
used in areas like market segmentation, customer segmentation, etc. But, it can also be used
to segment different objects in the images on the basis of the pixel values.

Algorithm for Image Segmentation:


1. First, we need to select the value of K in K-means clustering.
2. Select a feature vector for every pixel (color values such as RGB value, texture etc.).
3. Define a similarity measure b/w feature vectors such as Euclidean distance to measure
the similarity b/w any two points/pixel.
4. Apply K-means algorithm to the cluster centers
5. Apply connected component’s algorithm.
6. Combine any component of size less than the threshold to an adjacent component that
is similar to it until you can’t combine more.

USING CLUSTERING FOR PREPROCESSING


Clustering is a useful technique for preprocessing data in machine learning. It can
help in identifying patterns and structures within the data that may be useful for subsequent
modeling steps.
One common use case for clustering in preprocessing is to identify groups of similar
instances in the data. This can be helpful in situations where there are a large number of
instances and it is difficult to manually label each one for classification. By clustering the
instances, we can group similar instances together and then assign a label to each cluster
based on the majority class of the instances within the cluster. This can save time and effort
in labeling the data and can also improve the quality of the labels by leveraging the
similarities between instances.
Another use case for clustering in preprocessing is feature selection or feature
engineering. Clustering can be used to identify groups of highly correlated features, which
can then be combined or reduced to a smaller set of features that better capture the underlying
structure of the data. This can help to reduce the dimensionality of the data and improve the
performance of subsequent modeling steps.

8
Finally clustering can be a useful preprocessing technique in machine learning for
identifying patterns and structures in the data, grouping similar instances together, and
selecting or engineering features.

USING CLUSTERING FOR SEMI-SUPERVISED LEARNING


Clustering can be a useful technique for semi-supervised learning, which is a type of
machine learning where some of the data is labeled and some of it is not. In semi-supervised
learning, the goal is to use the labeled data to help guide the learning process for the
unlabeled data.
One way to use clustering for semi-supervised learning is to first cluster the unlabeled
data using an unsupervised clustering algorithm such as k-means or hierarchical clustering or
spectral clustering. Once the data has been clustered, the labels of the clustered data can be
propagated to the individual data points within each cluster. This means that each data point
within a cluster will be assigned the same label as the centroid of the cluster.
Once the labels have been propagated, the labeled and unlabeled data can be used
together to train a supervised learning model. The labeled data provides the model with
examples of what the correct output should be, while the unlabeled data helps the model learn
the underlying patterns and structure of the data.
One important consideration when using clustering for semi-supervised learning is the
choice of clustering algorithm and the number of clusters to use. The number of clusters can
have a significant impact on the quality of the labels that are propagated to the individual data
points, and choosing the optimal number of clusters can be challenging. There are several
methods for semi-supervised clustering that can be divided into two classes which are as
follows:
 Constraint-based semi-supervised clustering: It can be used based on user-provided
labels or constraints to support the algorithm toward a more appropriate data
partitioning. This contains modifying the objective function depending on constraints
or initializing and constraining the clustering process depending on the labeled
objects.
 Distance-based semi-supervised clustering: It can be used to employ an adaptive
distance measure that is trained to satisfy the labels or constraints in the supervised
data. Multiple adaptive distance measures have been utilized, including string-edit

9
distance trained using Expectation-Maximization (EM), and Euclidean distance
changed by the shortest distance algorithm.

Overall, clustering can be a useful technique for semi-supervised learning, but it


should be used in conjunction with other methods and with careful consideration given to the
choice of clustering algorithm and number of clusters.

DBSCAN (Density-based spatial clustering of applications with noise)


Partitioning and hierarchical methods are designed to find spherical-shaped clusters.
They have difficulty finding clusters of arbitrary shape such as “S” shape and oval clusters.
Density based clustering methods can be used to find clusters of arbitrary shape (or) non
spherical shape.

DBSCAN is one of the most popular unsupervised learning algorithms. The


DBSCAN algorithm is based on the intuitive notion of “clusters” and “noise”. The key idea is
that for each point of a cluster, the neighbourhood of a given radius has to contain at least a
minimum number of points.

DBSCAN Algorithm requires two parameters:


 eps : It defines the neighbourhood around a data point i.e. if the distance between two
points is lower or equal to ‘eps’ then they are considered neighbors. If the eps value is
chosen too small then large part of the data will be considered as outliers.
 MinPts: Minimum number of neighbors (data points) within eps radius. Larger the
dataset, the larger value of MinPts must be chosen.

DBSCAN Algorithm has 3 types of data points:


 Core Point: A point is a core point if it has more than MinPts points within eps

10
 Border Point: A point which has fewer than MinPts within eps but it is in the
neighbourhood of a core point
 Noise or outlier: A point which is not a core point or border point

Other Parameters:
 A point X is directly density-reachable from point Y w.r.t epsilon, minPoints if,
 X belongs to the neighborhood of Y, i.e, dist(X, Y) <= epsilon
 Y is a core point
 X is density-reachable from Y with X being directly density-reachable from P2, P2
from P3, and P3 from Y. But, the inverse of this is not valid.
 A point X is density-connected from point Y w.r.t epsilon and minPoints if there
exists a point O such that both X and Y are density-reachable from O w.r.t to epsilon
and minPoints.

DBSCAN Algorithm:
1. Find all the neighbor points within eps and identify the core points or visited with
more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density connected points and assign them to the same cluster as
the core point.
A point a and b are said to be density connected if there exist a point c which
has a sufficient number of points in its neighbors and both the points a and b are
within the eps distance. This is a chaining process. So, if b is neighbor of c, c is

11
neighbor of d, d is neighbor of e, which in turn is neighbor of a implies that b is
neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do not
belong to any cluster are noise.

Example:
S1 5 7
S2 8 4

S3 3 3

S4 4 4

S5 3 7

S6 6 7

S7 6 1

S8 5 5

Form the clusters with ε = 3.5 and MinPts = 3

Calculate the Euclidean distance among all the points

12
Identify the neighbours of each point for ε = 3.5

Point and their neighbours within the boundary of radius ε = 3.5


S1 : S4, S5, S6, S8 S2 : S8 S3 : S4, S8 S4 : S1, S5, S3, S8
S5 : S1, S4, S6, S8 S6 : S1, S5, S8 S7 : None S8 : S1, S2, S3, S4, S5, S6

S1, S3, S4, S5, S6 & S8 become the core points

13
Identify the core points & noise points. Check for direct density reachable condition
for noise points. If the density reachable condition is satisfied, convert noise to boundary
point.
Point Core/Noise
S1 Core
S2 Noise Boundary

S3 Core
S4 Core
S5 Core
S6 Core
S7 Noise Noise

S8 Core

GAUSSIAN MIXTURES
A Gaussian mixture model (GMM) is a probabilistic model that assumes that the
instances were generated from a mixture of several Gaussian distributions whose parameters
are unknown. All the instances generated from a single Gaussian distribution form a cluster
that typically looks like an ellipsoid. Each cluster can have a different ellipsoidal shape, size,
density and orientation.
This generative process can be represented as a graphical model. This is a graph
which represents the structure of the conditional dependencies between random variables.

Gaussian Mixture Model

14
 The circles represent random variables
 The squares represent fixed values
 The large rectangles are called plates: they indicate that their content is repeated
several times
 The number indicated at the bottom right hand side of each plate indicates how many
times its content is repeated
 Each variable z (i)
is drawn from the categorical distribution with weights ϕ. Each
(i)
variable x is drawn from the normal distribution with the mean and covariance
matrix defined by its cluster z (i).
 The solid arrows represent conditional dependencies
 The squiggly arrow from z (i)
to x (i)
represents a switch: depending on the value of z
(i)
, the instance x (i) will be sampled from a different Gaussian distribution
 Shaded nodes indicate that the value is known, so in this case only the random
(i)
variables x have known values: they are called observed variables. The unknown
random variables z (i) are called latent variables.

Scikit-Learn’s Gaussian Mixture class is as follows:


from sklearn.mixture import GaussianMixture
gm = GaussianMixture(n_components=3, n_init=10)
gm.fit(X)
>>> gm.weights_
>>> gm.means_
>>> gm.covariances_
We can check whether or not the algorithm converged and how many iterations it took:
>>> gm.converged_
True
>>> gm.n_iter_
3
Now that you have an estimate of the location, size, shape, orientation and relative
weight of each cluster, the model can easily assign each instance to the most likely cluster
(hard clustering) or estimate the probability that it belongs to a particular cluster (soft
clustering). For this, just use the predict() method for hard clustering, or the predict_proba()
method for soft clustering:

15
>>> gm.predict(X)
>>> gm.predict_proba(X)
It is a generative model, meaning we can actually sample new instances from it
>>> X_new, y_new = gm.sample(6)
>>> X_new
>>> y_new
It is also possible to estimate the density of the model at any given location. This is
achieved using the score_samples() method: for each instance it is given, this method
estimates the log of the probability density function (PDF) at that location. The greater the
score, the higher the density
>>> gm.score_samples(X)
If we compute the exponential of these scores, we get the value of the PDF at the
location of the given instances. These are not probabilities, but probability densities: they can
take on any positive value, not just between 0 and 1. To estimate the probability that an
instance will fall within a particular region, we would have to integrate the PDF over that
region.

Bayesian Gaussian Mixture Models:


Rather than manually searching for the optimal number of clusters, it is possible to
use instead the Bayesian Gaussian Mixture class which is capable of giving weights equal (or
close) to zero to unnecessary clusters. Just set the number of clusters n_components to a
value that we have good reason to believe is greater than the optimal number of clusters, and
the algorithm will eliminate the unnecessary clusters automatically
>>> from sklearn.mixture import BayesianGaussianMixture
>>>bgm = BayesianGaussianMixture(n_components=10, n_init=10, random_state=42)
>>> bgm.fit(X)
>>> np.round(bgm.weights_, 2)

16
DIMENSIONALITY REDUCTION

The number of input features, variables, or columns present in a given dataset is


known as dimensionality, and the process to reduce these features is called dimensionality
reduction.
A dataset contains a huge number of input features in various cases, which makes the
predictive modeling task more complicated. Because it is very difficult to visualize or make
predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to use.
The number of input features, variables, or columns present in a given dataset is
known as dimensionality, and the process to reduce these features is called dimensionality
reduction.
A dataset contains a huge number of input features in various cases, which makes the
predictive modeling task more complicated. Because it is very difficult to visualize or make
predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to use.

THE CURSE OF DIMENSIONALITY


Handling the high-dimensional data is very difficult in practice, commonly known as
the curse of dimensionality. If the dimensionality of the input dataset increases, any machine
learning algorithm and model becomes more complex. As the number of features increases,
the number of samples also gets increased proportionally, and the chance of overfitting also
increases. If the machine learning model is trained on high-dimensional data, it becomes
overfitted and results in poor performance.
Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.

Benefits of applying Dimensionality Reduction:


 By reducing the dimensions of the features, the space required to store the dataset also
gets reduced
 Less Computation training time is required for reduced dimensions of features
 Reduced dimensions of features of the dataset help in visualizing the data quickly
 It removes the redundant features (if present) by taking care of multicollinearity

17
Disadvantages of dimensionality Reduction:
 Some data may be lost due to dimensionality reduction
 In the PCA dimensionality reduction technique, sometimes the principal components
required to consider are unknown

MAIN APPROACHES OF DIMENSION REDUCTION


There are two ways to apply the dimension reduction technique, which are given below:
1. Feature Selection: Feature selection is the process of selecting the subset of the
relevant features and leaving out the irrelevant features present in a dataset to build a
model of high accuracy. In other words, it is a way of selecting the optimal features
from the input dataset. Three methods are used for the feature selection:
 Filters Methods: In this method, the dataset is filtered, and a subset that
contains only the relevant features is taken. Some common techniques of
filters method are:
 Correlation
 Chi-Square Test
 ANOVA
 Information Gain, etc.
 Wrappers Methods: The wrapper method has the same goal as the filter
method, but it takes a machine learning model for its evaluation. In this
method, some features are fed to the ML model, and evaluate the performance.
The performance decides whether to add those features or remove to increase
the accuracy of the model. This method is more accurate than the filtering
method but complex to work. Some common techniques of wrapper methods
are:
 Forward Selection
 Backward Selection
 Bi-directional Elimination
 Embedded Methods: Embedded methods check the different training iterations
of the machine learning model and evaluate the importance of each feature.
Some common techniques of Embedded methods are:
 LASSO
 Elastic Net

18
 Ridge Regression, etc.

2. Feature Extraction: Feature extraction is the process of transforming the space


containing many dimensions into space with fewer dimensions. This approach is
useful when we want to keep the whole information but use fewer resources while
processing the information. Some common feature extraction techniques are:
 Principal Component Analysis
 Linear Discriminant Analysis
 Kernel PCA
 Quadratic Discriminant Analysis

COMMON TECHNIQUES OF DIMENSIONALITY REDUCTION


1. Principal Component Analysis: Principal Component Analysis is a statistical process
that converts the observations of correlated features into a set of linearly uncorrelated
features with the help of orthogonal transformation.
2. Backward Elimination: The backward feature elimination technique is mainly used
while developing Linear Regression or Logistic Regression model.
3. Forward Feature Selection: Forward feature selection follows the inverse process of
the backward elimination process. It means, in this technique, we don't eliminate the
feature; instead, we will find the best features that can produce the highest increase in
the performance of the model.
4. Missing Value Ratio: If a dataset has too many missing values, then we drop those
variables as they do not carry much useful information. To perform this, we can set a
threshold level, and if a variable has missing values more than that threshold, we will
drop that variable.
5. Low Variance Filter: As same as missing value ratio technique, data columns with
some changes in the data have less information. Therefore, we need to calculate the
variance of each variable, and all data columns with variance lower than a given
threshold are dropped because low variance features will not affect the target variable.
6. High Correlation Filter: High Correlation refers to the case when two variables carry
approximately similar information. Due to this factor, the performance of the model
can be degraded. This correlation between the independent numerical variable gives
the calculated value of the correlation coefficient. If this value is higher than the
threshold value, we can remove one of the variables from the dataset. We can

19
consider those variables or features that show a high correlation with the target
variable.
7. Random Forest: Random Forest is a popular and very useful feature selection
algorithm in machine learning. In this technique, we need to generate a large set of
trees against the target variable, and with the help of usage statistics of each attribute,
we need to find the subset of features.
8. Factor Analysis: Factor analysis is a technique in which each variable is kept within a
group according to the correlation with other variables, it means variables within a
group can have a high correlation between themselves, but they have a low correlation
with variables of other groups.
9. Auto-Encoder: One of the popular methods of dimensionality reduction is auto-
encoder. In this, the input is compressed into latent-space representation, and output is
occurred using this representation. It has mainly two parts
 Encoder: The function of the encoder is to compress the input to form the
latent-space representation.
 Decoder: The function of the decoder is to recreate the output from the latent-
space representation.

PRINCIPAL COMPONENT ANALYSIS (PCA)


Principal Component Analysis is an unsupervised learning algorithm that is used for
the dimensionality reduction in machine learning. It is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated features with the help of
orthogonal transformation. These new transformed features are called the Principal
Components. It is one of the popular tools that is used for exploratory data analysis and
predictive modeling. It is a technique to draw strong patterns from the given dataset by
reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the high-
dimensional data.
PCA works by considering the variance of each attribute because the high attribute
shows the good split between the classes, and hence it reduces the dimensionality. Some real-
world applications of PCA are image processing, movie recommendation system, optimizing
the power allocation in various communication channels. It is a feature extraction technique,
so it contains the important variables and drops the least important variable.

20
PCA Algorithm:

21
22
Example:
Given the data in Table, reduce the dimension from 2 to 1 using the Principal
Component Analysis (PCA) algorithm.

Feature Example 1 Example 2 Example 3 Example 4


X1 4 8 13 7
X2 11 4 5 14

Step 1:
No. of features, n=2
No. of samples, N=4
The scatter plot of the given data points are

23
Step 2: Calculate the mean of X1 and X2

Step 3: Calculation of the Covariance Matrix

The Covariance’s are calculated as follows:

The covariance matrix is,

Step 4: Calculate the Eigen Values, Eigen Vectors & Normalized Eigen Vector of the
Covariance Matrix

24
25
Step 5: Deriving New Dataset

First PCA Example 1 Example 2 Example 3 Example 4


(PC1) P11 P12 P13 P14

First PCA Example 1 Example 2 Example 3 Example 4


(PC1) -4.3052 3.7361 5.6928 -5.1238

Step 6: The new Dataset is

Feature Example 1 Example 2 Example 3 Example 4


X1 4 8 13 7
X2 11 4 5 14
First PCA (PC1) -4.3052 3.7361 5.6928 -5.1238

Step 7: The scatter plot of the given data points are

26
USING SCIKIT-LEARN
Scikit-Learn is a free software machine learning library for the Python programming
language. It features various classification, regression and clustering algorithms including
support-vector machines, random forests, gradient boosting, k-means and DBSCAN, and is
designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
Scikit-Learn’s PCA class implements PCA using SVD decomposition just like we did
before. The following code applies PCA to reduce the dimensionality of the dataset down to
two dimensions:

RANDOMIZED PCA
If we set the svd_solver hyperparameter to "randomized", Scikit-Learn uses a
stochastic algorithm called Randomized PCA that quickly finds an approximation of the first
d principal components. Its computational complexity is O(m × d 2 ) + O(d3), instead of O(m ×

27
n2 ) + O(n3) for the full SVD approach, so it is dramatically faster than full SVD when d is
much smaller than n:

By default, svd_solver is actually set to "auto": Scikit-Learn automatically uses the


randomized PCA algorithm if m or n is greater than 500 and d is less than 80% of m or n, or
else it uses the full SVD approach. If we want to force Scikit-Learn to use full SVD, we can
set the svd_solver hyperparameter to "full"

KERNEL PCA
Kernel PCA a mathematical technique that implicitly maps instances into a very high-
dimensional space (called the feature space), enabling nonlinear classification and regression
with Support Vector Machines.
A linear decision boundary in the high-dimensional feature space corresponds to a
complex nonlinear decision boundary in the original space. It turns out that the same trick can
be applied to PCA, making it possible to perform complex nonlinear projections for
dimensionality reduction. This is called Kernel PCA. It is often good at preserving clusters of
instances after projection, or sometimes even unrolling datasets that lie close to a twisted
manifold.
For example, the following code uses Scikit-Learn’s KernelPCA class to perform
kPCA with an RBF kernel

28
29

You might also like