Extensive Class Notes:
Feature Selection & Extraction
Advanced Machine Learning / Data Mining :: Samik Chakraborty
1 Introduction: The Curse of Dimensionality
Concept 1.1 (Curse of Dimensionality). As the number of features (dimensions) d in
a dataset grows, the amount of data needed to generalize accurately grows exponentially.
This leads to:
• Sparsity: Data points become very far apart in high-dimensional space.
• Overfitting: Models learn noise instead of the underlying signal.
• Computational Cost: Training and prediction times increase significantly.
Solution: Dimensionality Reduction. This is achieved through two main families
of techniques:
1. Feature Selection: Choose a subset of the original features.
2. Feature Extraction: Create a new, smaller set of features from the original ones.
2 Feature Selection
Goal: To select a subset of k features from the original d features (k < d) that is most
relevant to the target variable.
Let the original feature set be F = {f1 , f2 , ..., fd }. The goal is to find a subset S ⊂ F
such that |S| = k and an evaluation criterion J(S) is maximized.
2.1 Filter Methods
Concept 2.1 (Filter Methods). Select features based on their intrinsic, statistical prop-
erties, independent of any machine learning model. They are fast and scalable.
a) Variance Threshold
• Concept: Remove features with low variance. Assumes features with no variance
provide no information.
• Math: Calculate the variance σ 2 for each feature j:
n
1X
σj2 = (xij − µj )2
n i=1
Features where σj2 < τ (a threshold) are removed.
1
b) Correlation-based (e.g., Pearson Correlation)
• Concept: Remove features highly correlated with each other (redundancy). Keep
features highly correlated with the target.
• Math (Pearson’s r): For two variables X and Y :
Pn
(xi − x̄)(yi − ȳ)
rXY = pPn i=1 pPn
2 2
i=1 (xi − x̄) i=1 (yi − ȳ)
– |r| ≈ 1: Strong linear relationship.
– r ≈ 0: No linear relationship.
– Remove feature fi if |rfi fj | > 0.9 for another feature fj .
c) Mutual Information (MI)
• Concept: A measure from information theory that captures any kind of rela-
tionship (linear or non-linear). Measures how much knowing one variable reduces
uncertainty about the other.
• Math: For two discrete variables X and Y :
XX p(x, y)
I(X; Y ) = p(x, y) log
y∈Y x∈X
p(x)p(y)
where p(x, y) is the joint probability mass function and p(x), p(y) are the marginal
PMFs.
– I(X; Y ) = 0 if X and Y are independent.
– Higher values indicate stronger dependency.
– Features are ranked by their MI score with the target variable.
2.2 Wrapper Methods
Concept 2.2 (Wrapper Methods). Use the performance of a specific predictive model to
evaluate the quality of a feature subset. More computationally expensive but often more
accurate than filter methods.
a) Forward Selection
1. Start with an empty set S = {}.
2. For each feature not in S, add it to S and train a model. Evaluate the model (e.g.,
using cross-validation accuracy, J(S)).
3. Keep the feature that provides the highest improvement in J(S).
4. Repeat steps 2-3 until a stopping criterion is met (e.g., |S| = k, or performance
gain is negligible).
b) Backward Elimination
1. Start with the full set S = F .
2. For each feature in S, remove it and train a model. Evaluate the model.
3. Remove the feature whose absence causes the smallest decrease (or largest increase)
in J(S).
4. Repeat steps 2-3 until a stopping criterion is met.
2
c) Recursive Feature Elimination (RFE)
A popular variant often used with models that provide feature importance (e.g., linear
models, SVMs, tree-based models).
1. Train a model on the current feature set.
2. Rank features by their importance (e.g., absolute coefficient weight |wi |).
3. Discard the least important feature(s).
4. Repeat steps 1-3 on the reduced set until the desired number of features k is reached.
2.3 Embedded Methods
Concept 2.3 (Embedded Methods). Feature selection is built directly into the model
training process. They combine the qualities of filter and wrapper methods: they are
model-specific but more efficient than wrappers.
a) Lasso Regression (L1 Regularization)
• Concept: Adds a penalty equal to the absolute value of the magnitude of coeffi-
cients. This tends to produce sparse models where some coefficients become exactly
zero.
• Math: The objective function to minimize is:
1 2
min ||y − Xw||2 + α||w||1
w 2n
where:
P2d is the least squares loss (MSE).
2
– ||y − Xw||
– ||w||1 = i=1 |wi | is the L1-norm penalty.
– α is the regularization parameter. A larger α forces more coefficients to zero.
b) Tree-Based Methods (e.g., Random Forest, XGBoost)
• Concept: These algorithms have built-in mechanisms to calculate feature impor-
tance.
• Math (Gini Importance): For a tree, the importance of a feature is the total
reduction in impurity achieved by splits on that feature, averaged over all trees.
1 X X
Importance(fj ) = ∆I(t)
Ntrees T t∈T :split on fj
where ∆I(t) is the impurity reduction at node t. Features are ranked by this score.
3 Feature Extraction
Goal: To project the original d-dimensional data onto a new k-dimensional subspace
(k < d), creating new features that are combinations of the original ones, preserving as
much relevant information as possible.
3
3.1 Principal Component Analysis (PCA)
Concept 3.1 (PCA). Find a set of orthogonal axes (principal components) that capture
the directions of maximum variance in the data. The first PC captures the most variance,
the second captures the next most while being orthogonal to the first, and so on. PCA is
an unsupervised method.
Mathematical Derivation:
1. Standardize the Data: Center the data to have mean zero. Xstandardized = X − µ
(Assume X is mean-centered for simplicity).
2. Covariance Matrix: Compute the covariance matrix C:
1
C= XT X
n−1
C is a d × d symmetric matrix.
3. Eigendecomposition: Factorize the covariance matrix:
Cvi = λi vi or CV = V Λ
• The columns of V are the eigenvectors (the principal components).
• Λ is a diagonal matrix of eigenvalues λ1 ≥ λ2 ≥ ... ≥ λd (the amount of
variance captured).
4. Projection: To reduce dimensionality to k, choose the top-k eigenvectors. The
transformed data is:
Z = XVk
where Vk is a d × k matrix containing the top-k eigenvectors.
3.2 Linear Discriminant Analysis (LDA)
Concept 3.2 (LDA). Find a projection that maximizes the separation between
classes while minimizing the variance within each class. It is a supervised
method.
Mathematical Derivation:
1. Compute Scatter Matrices:
• Within-class scatter matrix (SW ): Measures spread within each class.
C
X X
SW = Sc where Sc = (xi − µc )(xi − µc )T
c=1 i∈c
(C is the number of classes, µc is the mean of class c).
• Between-class scatter matrix (SB ): Measures spread between class means.
C
X
SB = nc (µc − µ)(µc − µ)T
c=1
(nc is the number of points in class c, µ is the overall mean).
2. The Objective: Maximize the ratio of SB to SW in the projected space (Rayleigh
quotient).
wT SB w
J(w) = T
w SW w
4
3. Solution: The optimal projection W is found by solving the generalized eigenvalue
problem:
SB w = λSW w
The columns of W are the eigenvectors corresponding to the largest eigenvalues.
The maximum number of components is at most C − 1.
3.3 Non-Negative Matrix Factorization (NMF)
Concept 3.3 (NMF). Factorize a non-negative data matrix X (n × d) into two lower-
rank, non-negative matrices W (n × k) and H (k × d).
X ≈ WH
W represents components or "basis features", and H represents coefficients. The non-
negativity constraint often leads to more interpretable parts-based representations.
4 Summary & Comparison
Table 1: Feature Selection vs. Feature Extraction
Aspect Feature Selection Feature Extraction
Output Subset of original features (e.g.,
New transformed features (e.g.,
f1 , f5 , f7 ) PC1 , PC2 )
Interpretability High. Original feature mean-
Low. New features are combi-
ing is preserved. nations and can be hard to in-
terpret (except for LDA).
Model Speci- Filters: No. Wrap- PCA: No (unsupervised).
ficity pers/Embedded: Yes. LDA: Yes (supervised).
Primary Goal Remove irrelevant/redundant Create a compact, informative
features. representation.
Methods Variance, Correlation, MI, L1, PCA, LDA, NMF, Autoen-
RFE coders
When to Use What?
• Use Feature Selection when interpretability is crucial (e.g., “Which gene is a
biomarker?”).
• Use Feature Extraction (like PCA) when features are highly correlated, you need
to reduce dimensionality drastically, or interpretability is less important (e.g., image
preprocessing).
• Use LDA specifically for a supervised classification task to maximize class separa-
tion.