0% found this document useful (0 votes)
4 views

Dimentiality

Uploaded by

swertyuou
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Dimentiality

Uploaded by

swertyuou
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Dimensionality Reduction

Dimensionality reduction simply refers to the process of reducing the number


of attributes in a dataset while keeping as much of the variation in the original
dataset as possible. It is a data preprocessing step meaning that we perform
dimensionality reduction before training the model. In this article, we will
discuss 11 such dimensionality reduction techniques and implement them with
real-world datasets using Python and Scikit-learn libraries.

The importance of dimensionality reduction

When we reduce the dimensionality of a dataset, we lose some percentage


(usually 1%-15% depending on the number of components or features that we
keep) of the variability in the original data. But, don’t worry about losing that
much percentage of the variability in the original data because dimensionality
reduction will lead to the following advantages.

 A lower number of dimensions in data means less training time


and less computational resources and increases the overall
performance of machine learning algorithms — Machine learning
problems that involve many features make training extremely slow.
Most data points in high-dimensional space are very close to the
border of that space. This is because there’s plenty of space in high
dimensions. In a high-dimensional dataset, most data points are likely
to be far away from each other. Therefore, the algorithms cannot
effectively and efficiently train on the high-dimensional data. In
machine learning, that kind of problem is referred to as the curse of
dimensionality — this is just a technical term that you do not need to
worry about!

 Dimensionality reduction avoids the problem of overfitting —


When there are many features in the data, the models become more
complex and tend to overfit on the training data. To see this in action,
read my “How to Mitigate Overfitting with Dimensionality
Reduction” article.

 Dimensionality reduction is extremely useful for data


visualization — When we reduce the dimensionality of higher
dimensional data into two or three components, then the data can
easily be plotted on a 2D or 3D plot. To see this in action, read
my “Principal Component Analysis (PCA) with Scikit-learn” article.

 Dimensionality reduction takes care of multicollinearity — In


regression, multicollinearity occurs when an independent variable is
highly correlated with one or more of the other independent variables.
Dimensionality reduction takes advantage of this and combines those
highly correlated variables into a set of uncorrelated variables. This
will address the problem of multicollinearity. To see this in action,
read my “How do you apply PCA to Logistic Regression to remove
Multicollinearity?” article.

 Dimensionality reduction is very useful for factor analysis — This


is a useful approach to find latent variables which are not directly
measured in a single variable but rather inferred from other variables
in the dataset. These latent variables are called factors. To see this in
action, read my “Factor Analysis on Women Track Records Data
with R and Python” article.
 Dimensionality reduction removes noise in the data — By keeping
only the most important features and removing the redundant features,
dimensionality reduction removes noise in the data. This will improve
the model accuracy.

 Dimensionality reduction can be used for image


compression — image compression is a technique that minimizes the
size in bytes of an image while keeping as much of the quality of the
image as possible. The pixels which make the image can be
considered as dimensions (columns/variables) of the image data. We
perform PCA to keep an optimum number of components that balance
the explained variability in the image data and the image quality. To
see this in action, read my “Image Compression Using Principal
Component Analysis (PCA)” article.

 Dimensionality reduction can be used to transform non-linear


data into a linearly-separable form — Read the Kernel PCA section
of this article to see this in action!

Principal Component Analysis (PCA)

PCA is one of my favorite machine learning algorithms. PCA is a linear


dimensionality reduction technique (algorithm) that transforms a set of
correlated variables (p) into a smaller k (k<p) number of uncorrelated variables
called principal components while retaining as much of the variation in the
original dataset as possible. In the context of Machine Learning (ML), PCA is an
unsupervised machine learning algorithm that is used for dimensionality
reduction.
As this is one of my favourite algorithms, I have previously written several
contents for PCA. If you’re interested to learn more about the theory behind
PCA and its Scikit-learn implementation, you may read the following contents
written by me.

 Principal Component Analysis (PCA) with Scikit-learn


 Statistical and Mathematical Concepts behind PCA
 Principal Component Analysis for Breast Cancer Data with R and
Python
 Image Compression Using Principal Component Analysis (PCA)

You might also like