UNIT04
UNIT04
.
FEATURE SUBSET SELECTION
• Feature selection is arguably the most critical
pre-processing activity in any machine
learning project.
• It intends to select a subset of system
attributes or features which makes a most
meaningful contribution in a machine learning
activity.
• Issues in high-dimensional data
– ‘High-dimensional’ refers to the high number of
variables or attributes or features present in certain
data sets.
– One is that of biomedical research, which includes
gene selection from microarray data. The other one is
text categorization which deals with huge volumes of
text data from social networking sites, emails,
• The objective of feature selection is three-fold:
– Having faster and more cost-effective (i.e. less need
for computational resources) learning model
– Improving the efficiency of the learning model
– Having a better understanding of the underlying
model that generated the data
• Key drivers of feature selection
• Feature relevance
– Feature relevance is indicated by the information gain
from a feature measured in terms of relative entropy.
• Feature redundancy
– Feature redundancy is based on similar information
contributed by multiple features measured by feature-
to-feature
– features measured by feature-to-feature:
• Correlation
• Distance (Minkowski distances, e.g. Manhattan, Euclidean,
etc. used as most popular measures)
• Other coefficient-based (Jaccard, SMC, Cosine similarity,
etc.)
• Distance-based similarity measure
– The most common distance measure is the Euclidean distance, which,
between two features F1 and F2 are calculated as: d(F1, F2)
where, n11 = number of cases where both the features have value 1 n01 =
number of cases where the feature 1 has value 0 and feature 2 has value 1 n10
= number of cases where the feature 1 has value 1 and feature 2 has value 0