Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4
Dimensionality Reduction
Dimensionality reduction simply refers to the process of reducing the number
of attributes in a dataset while keeping as much of the variation in the original dataset as possible. It is a data preprocessing step meaning that we perform dimensionality reduction before training the model. In this article, we will discuss 11 such dimensionality reduction techniques and implement them with real-world datasets using Python and Scikit-learn libraries.
The importance of dimensionality reduction
When we reduce the dimensionality of a dataset, we lose some percentage
(usually 1%-15% depending on the number of components or features that we keep) of the variability in the original data. But, don’t worry about losing that much percentage of the variability in the original data because dimensionality reduction will lead to the following advantages.
A lower number of dimensions in data means less training time
and less computational resources and increases the overall performance of machine learning algorithms — Machine learning problems that involve many features make training extremely slow. Most data points in high-dimensional space are very close to the border of that space. This is because there’s plenty of space in high dimensions. In a high-dimensional dataset, most data points are likely to be far away from each other. Therefore, the algorithms cannot effectively and efficiently train on the high-dimensional data. In machine learning, that kind of problem is referred to as the curse of dimensionality — this is just a technical term that you do not need to worry about!
Dimensionality reduction avoids the problem of overfitting —
When there are many features in the data, the models become more complex and tend to overfit on the training data. To see this in action, read my “How to Mitigate Overfitting with Dimensionality Reduction” article.
Dimensionality reduction is extremely useful for data
visualization — When we reduce the dimensionality of higher dimensional data into two or three components, then the data can easily be plotted on a 2D or 3D plot. To see this in action, read my “Principal Component Analysis (PCA) with Scikit-learn” article.
Dimensionality reduction takes care of multicollinearity — In
regression, multicollinearity occurs when an independent variable is highly correlated with one or more of the other independent variables. Dimensionality reduction takes advantage of this and combines those highly correlated variables into a set of uncorrelated variables. This will address the problem of multicollinearity. To see this in action, read my “How do you apply PCA to Logistic Regression to remove Multicollinearity?” article.
Dimensionality reduction is very useful for factor analysis — This
is a useful approach to find latent variables which are not directly measured in a single variable but rather inferred from other variables in the dataset. These latent variables are called factors. To see this in action, read my “Factor Analysis on Women Track Records Data with R and Python” article. Dimensionality reduction removes noise in the data — By keeping only the most important features and removing the redundant features, dimensionality reduction removes noise in the data. This will improve the model accuracy.
Dimensionality reduction can be used for image
compression — image compression is a technique that minimizes the size in bytes of an image while keeping as much of the quality of the image as possible. The pixels which make the image can be considered as dimensions (columns/variables) of the image data. We perform PCA to keep an optimum number of components that balance the explained variability in the image data and the image quality. To see this in action, read my “Image Compression Using Principal Component Analysis (PCA)” article.
Dimensionality reduction can be used to transform non-linear
data into a linearly-separable form — Read the Kernel PCA section of this article to see this in action!
Principal Component Analysis (PCA)
PCA is one of my favorite machine learning algorithms. PCA is a linear
dimensionality reduction technique (algorithm) that transforms a set of correlated variables (p) into a smaller k (k<p) number of uncorrelated variables called principal components while retaining as much of the variation in the original dataset as possible. In the context of Machine Learning (ML), PCA is an unsupervised machine learning algorithm that is used for dimensionality reduction. As this is one of my favourite algorithms, I have previously written several contents for PCA. If you’re interested to learn more about the theory behind PCA and its Scikit-learn implementation, you may read the following contents written by me.
Principal Component Analysis (PCA) with Scikit-learn
Statistical and Mathematical Concepts behind PCA Principal Component Analysis for Breast Cancer Data with R and Python Image Compression Using Principal Component Analysis (PCA)