CSC604 (Machine Learning)
Module VI
➢ BY:
DR. ARUNDHATI DAS
Module VI: Dimensionality Reduction
• 6.1 Curse of Dimensionality.
• 6.2 Feature Selection and Feature Extraction.
• 6.3 Dimensionality Reduction Techniques, Principal Component
Analysis.
2
Curse of Dimensionality
• In machine learning classification problems,
the higher the number of features, the harder
it gets to visualize the training set and then
work on it. Sometimes, most of these
features are correlated, and hence redundant.
This is where dimensionality reduction
algorithms come into play. Dimensionality
reduction is the process of reducing the
number of features.
• Due to high dimensions in data, there exists
a phenomenon called curse of
dimensionality.
• Curse of dimensionality causes overfitting
problem, resulting in high test error (low test
accuracy) in machine learning. Image courtesy: Internet
3
Introduction to Dimensionality Reduction
• There are two dimensionality reduction approaches used.
• Feature Selection
• Feature Extraction
• Feature Selection: A feature selection method selects a subset of relevant
features from the original feature set.
• Feature Extraction: A feature extraction method creates new features based on
combinations or transformations of the original feature set.
4
Introduction to Dimensionality Reduction
Feature Selection Feature Extraction
• Original features are maintained in the • The feature extraction algorithms
case of feature selection algorithms. transform the data onto a new feature
space.
• Reduces the dimensionality of the feature
space by selecting a subset from the • Represent the original features in a lower
original set. dimensional transformed feature space by
capturing only the essential information
• Requires domain knowledge and feature from the original set
engineering to write algorithm for
selecting features. • Can be applied to raw data without
explicitly writing algorithms having the
• May lose some information and introduce domain knowledge.
bias (error due to incorrect assumptions in
an algorithm) if the wrong features are • May introduce some noise and
selected. redundancy if the extracted features are
not informative.
5
Feature Selection
• Feature selection, also known as variable, attribute, or variable
subset selection is used in machine learning or statistics for
selection of a subset of features from the original set of features
to construct models for describing data.
Diabetes dataset
Iris dataset
Ionospheare dataset
6
Feature selection
• Why feature selection is needed?
• Feature selection is used to choose a subset of relevant features for effective classification
of data.
• In high dimensional data classification, the performance of a classifier often depends on
the feature subset used for classification
• People use feature selection to
• Minimize redundancy
• Reduce dimensionality (to reduce number of features)
• Improve predictive accuracy
• The main objective of feature selection is to identify m most informative features out of
the d original features, where m < d. (For example m=10, d=100)
7
Feature selection
• Feature selection approaches are classified mainly into three categories.
• 1. Filter approach
• 2. Wrapper approach
• 3. Embedded approach
Image Courtesy: Liping Xei
8
Feature selection
1. Filter approach (Guyon & Elisseeff, 2003):
1. It is used in many datasets where the number of features is high.
2. This approach selects a subset of features depending on some
measure calculated on the features without using a learning
algorithm.
3. Filter-based feature selection methods are faster than wrapper-
based methods.
2. Wrapper approach (Blum & Langley, 1997):
1. This approach uses a learning algorithm to evaluate the
accuracy produced by the use of the selected features in
classification.
2. Wrapper methods can give high classification accuracy for
particular classifiers, but generally they have high
computational complexity.
3. Embedded approach (Guyon & Elisseeff, 2003):
1. This approach performs feature selection during the process of
training and is specific to the applied learning algorithms.
Hybrid approach (Hsu, Hsieh, & Lu, 2011):
1. This approach is a combination of both filter and wrapper-based
methods.
2. The filter approach selects a candidate feature set from the
original feature set and the candidate feature set is refined by the
wrapper approach.
3. It exploits the advantages of these two approaches.
9
Feature selection
1. Filter approach (Guyon & Elisseeff, 2003):
a. It is used in many datasets where the number of features is high.
b. This approach selects a subset of features depending on some measure calculated on the features
without using a learning algorithm. Some of the criteria or measures to select the features are:
i. Information Gain – It is defined as the amount of information provided by the feature for
identifying the target value. Information gain of each attribute is calculated considering the target
values for feature selection.
ii. Variance Threshold – It is an approach where all features are removed whose variance doesn’t
meet the specific threshold. By default, this method removes features having zero variance. The
assumption made using this method is higher variance features are likely to contain more
information.
iii. Pearson Correlation- Measures the linear relationship between features and the target variable. It
indicates how one variable changes in response to another.
c. Filter-based feature selection methods are faster than wrapper-based methods since it does not use
learning algorithms.
ORIGINAL SET
OF FEATURES
10
Feature selection
2. Wrapper approach (Blum & Langley, 1997):
1. This approach uses a learning algorithm to evaluate the accuracy produced by the use of the selected features
in classification.
2. Wrapper methods can give high classification accuracy for particular classifiers, but generally they have high
computational complexity.
3. Some of the techniques used are:
1. Forward selection – This method is an iterative approach where we initially start with an empty set of
features and keep adding a feature which best improves our model after each iteration. The stopping
criterion is till the addition of a new variable does not improve the performance of the model.
2. Backward elimination – This method is also an iterative approach where we initially start with all
features and after each iteration, we remove the least significant feature. The stopping criterion is till no
improvement in the performance of the model is observed after the feature is removed.
ORIGINAL SET
OF FEATURES
11
Feature selection
3. Embedded approach (Guyon & Elisseeff, 2003):
1. This approach performs feature selection during the process of training and is specific to the
applied learning algorithms.
2. The feature selection algorithm is blended as part of the learning algorithm, thus having its own
built-in feature selection methods.
3. Some techniques used are:
Tree-based methods – Methods such as Random Forest, Gradient Boosting provides us
feature importance as a way to select features as well. Feature importance tells us which
features are more important in making an impact on the target feature.
ORIGINAL SET
OF FEATURES
12
13
Feature selection
Hybrid approach (Hsu, Hsieh, & Lu, 2011):
1. This approach is a combination of both filter and wrapper-based methods.
2. The filter approach selects a candidate feature set from the original feature set and the
candidate feature set is refined by the wrapper approach.
3. It exploits the advantages of these two approaches.
14
Feature Extraction:
• A feature extraction method creates new features based on
combinations or transformations of the original feature set.
• Whether the new transformed features are better than the original
features?
• Eg: PCA (Principal Component Analysis), LDA (Linear Discriminant
Analysis).
15
Principal Component Analysis (PCA).
• PCA is one of the most popular and widely used feature extraction
techniques.
• It was introduced by Karl Pearson.
• Why it is so popular:
• Simple, based on applied linear algebra.
• Non-parametric method of extracting relevant information from confusing data set.
• PCA makes one stringent but powerful assumption that is linearity.
• PCA is unsupervised in nature.
16
Principal Component Analysis (PCA)
➢ Consider the example shown in the figure 1. A spring-like structure is shown, which is stretched and released, the
system will exhibit oscillatory motion in one direction.
➢ The underlying dynamics can be expressed as a function of a single variable x.
➢ Unfortunately, because of our ignorance, we do not even know what are the real “x”, “y” and “z” axes, so we choose
three camera axes {~a, ~b, ~c} at some arbitrary angles with respect to the system.
➢ If we were smart experimenters, we would have just measured the position along the x-axis with one camera. But this is
not what happens in the real world.
➢ We often do not know what measurements best reflect the dynamics of our system in question.
➢ Furthermore, we sometimes record more dimensions than we actually need!
➢ Any data can be represented as a linear combination of their basis. For example
(x,y)=(3,-2) => 3(1, 0)+-2(0, 1) : when we use standard basis
o What is basis?
o We could choose different basis vectors!
➢ Principal component analysis computes the most meaningful basis to re-express a
noisy data.
➢ The hope is that this new basis will filter out the noise and reveal hidden dynamics.
➢ The basis vectors will always be orthogonal. (Eg. X-y co-ord are basis vectors.)
17
Principal Component Analysis (PCA)
• For the shown example, in other words, the goal of PCA is to determine that x' - the unit basis vector along the x-
axis which is the important dimension.
• Determining this fact allows an experimenter to discern which dynamics are important and which are just
redundant.
• PCA asks: Is there another basis, which is a linear combination of the original basis, that best re-expresses our data set?
• PCA transforms or maps the data in one (higher) dimensional space into data in another (lower) dimension space, the variance of the
data in the transformed dimensional space should be maximum
• PCA involves the following steps:
• Construct the covariance matrix of the data.
• Compute the eigenvectors of this matrix.
• Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large fraction of variance of the original data.
• Hence, we are left with a lesser number of eigenvectors, and there might have been some data loss in the process.
But, the most important variances should be retained by the remaining eigenvectors
18
Basic concepts for PCA
• As we can see PCA computes covariance matrix, eigen vectors, eigen values etc, let us
brush up our concepts on all these.
• Equation to find Eigen values
• Eigen Vector Equation
• Variance (X),
• Co-variance (X, Y),
• Co-variance Matrix, for two columns X,Y
• Co-variance Matrix, for three columns X,Y, Z
• Numericals discussed on board.
19
Basic concepts for PCA
• Covariance matrix is also known as dispersion matrix or variance-
covariance matrix
• Variance means how much one random variable gets changed
• Co-variance means how much two random variables get changed together
• Covariance matrix is used to calculate co-variance between every pair (2
columns) of a dataset.
20
Principal Component Analysis (PCA).
• The idea of principal component analysis (PCA) is to reduce the
dimensionality of a dataset consisting of a large number of related
variables while retaining as much variance in the data as possible.
• PCA finds a set of new variables that the original variables are just
their linear combinations.
• The new variables are called Principal Components (PCs).
• These principal components are orthogonal: In a 3-D case, the
principal components are perpendicular to each other.
• Figure 2 shows the intuition of PCA: it “rotates” the axes to line
up better with your data.
• The first principal component will capture most of the variance in
the data, then followed by the second, third, and so on. As a result,
the new data will have fewer dimensions.
Figure 2: Visualization of PCs
21
Dimensionality Reduction
• Advantages of Dimensionality Reduction
• It helps in data compression, and hence reduced storage space.
• It reduces computation time.
• It also helps remove redundant features, if any.
• It overcomes the curse of dimensionality problem
• Disadvantages of Dimensionality Reduction
• It may lead to some amount of data loss.
• PCA tends to find linear correlations between variables, which is sometimes undesirable.
• PCA fails in cases where mean and covariance are not enough to define datasets.
• We may not know how many principal components to keep- in practice, some thumb
rules are applied.
22
Numerical on PCA
23
24