ML Unit 2 Part -2
ML Unit 2 Part -2
Dimensionality reduction technique can be defined as, "It is a way of converting the higher
dimensions dataset into lesser dimensions dataset ensuring that it provides similar
information."
Handling the high-dimensional data is very difficult in practice, commonly known as the curse of
dimensionality. If the dimensionality of the input dataset increases, any machine learning
algorithm and model becomes more complex. As the number of features increases, the number of
samples also gets increased proportionally, and the chance of overfitting also increases. If the
machine learning model is trained on high-dimensional data, it becomes overfitted and results in
poor performance.
Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.
There are two ways to apply the dimension reduction technique, which are given below:
Feature Selection:
Feature selection is the process of selecting the subset of the relevant features and leaving out the
irrelevant features present in a dataset to build a model of high accuracy. In other words, it is a
way of selecting the optimal features from the input dataset.
1. Filters Methods
In this method, the dataset is filtered, and a subset that contains only the relevant features is
taken. Some common techniques of filters method are:
o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a machine learning
model for its evaluation. In this method, some features are fed to the ML model, and evaluate the
performance. The performance decides whether to add those features or remove to increase the
accuracy of the model. This method is more accurate than the filtering method but complex to
work. Some common techniques of wrapper methods are:
o Forward Selection
o Backward Selection
o Bi-directional Elimination
3. Embedded Methods: Embedded methods check the different training iterations of the
machine learning model and evaluate the importance of each feature. Some common techniques
of Embedded methods are:
o LASSO
o Elastic Net
o Ridge Regression, etc.
Feature Extraction:
Feature extraction is the process of transforming the space containing many dimensions into
space with fewer dimensions. This approach is useful when we want to keep the whole
information but use fewer resources while processing the information.
Some common feature extraction techniques are:
Karl Pearson was the first person to come up with this plan. It is based on the idea that when
data from a higher-dimensional space is put into a lower-dimensional space, the lower-
dimensional space should have the most variation. In simple terms, principal component
analysis (PCA) is a way to get important variables (in the form of components) from a large
set of variables in a data set. It tends to find the direction in which the data is most spread
out. PCA is more useful when you have data with three or more dimensions.
When applying the PCA method, the following are the primary steps that should be
followed:
1. Obtain the dataset you need.
2. Calculate the mean of the vectors ().
3. Deduct the mean of the given data from the total.
4. Complete the computation for the covariance matrix.
5. Determine the eigenvectors and eigenvalues of the matrix that represents the covariance
matrix.
6. Creating a feature vector and deciding which components would be the major ones i.e. the
Principal components.
7. Create a new data set by projecting the weight vector onto the dataset. As a result, we have
a smaller number of eigenvectors, and some data may have been lost in the process.
However, the remaining eigenvectors should keep the most significant variances.
OR
Consider the two-dimensional patterns (2, 1), (3, 5), (4, 3), (5, 6), (6, 7), (8, 8) and (9, 10).
X 2,3,4 X 5,6,7
Y 1,5,3 Y 6,7,8
Answer :