0% found this document useful (0 votes)
9 views

UNIT04

Uploaded by

Amit Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

UNIT04

Uploaded by

Amit Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

UNIT04

Basics of Feature Engineering


OUTLINE

• Feature and Feature Engineering


• Feature transformation:
– Construction
– Extraction
• Feature subset selection
– Issues in high-dimensional data
– key drivers
– measure and overall process
Feature and Feature Engineering
• What is a feature?
– A feature is an attribute of a data set that is used
in a machine learning process.
– The features in a data set are also called its
dimensions. So a data set having ‘n’ features is
called an n-dimensional data set.
• What is feature engineering?
– Feature engineering is an important pre-
processing step for machine learning.
– Feature engineering is a critical preparatory
process in machine learning. It is responsible for
taking raw input data and converting that to well-
aligned features which are ready to be used by the
machine learning models.
– Feature engineering refers to the process of
translating a data set into features such that these
features are able to represent the data set more
effectively and result in a better learning
performance. As we know already, feature
• Feature engineering has two major elements:
– feature transformation
• transforms the data – structured or unstructured, into a new
set of features which can represent the underlying problem
which machine learning is trying to solve.
• There are two variants of feature transformation:
– feature construction
» process discovers missing information about the relationships
between features and augments the feature space by creating
additional features. Hence, if there are ‘n’ features or dimensions
in a data set, after feature construction ‘m’ more features or
dimensions may get added. So at the end, the data set will
become ‘n + m’ dimensional.
– feature extraction
» The objective of feature selection is to derive a subset of
features from the full feature set which is most meaningful in the
context of a specific machine learning problem.
Both are sometimes known as feature discovery.
– feature subset selection
FEATURE TRANSFORMATION

• Engineering a good feature space is a crucial


prerequisite for the success of any machine
learning model.
• Feature transformation is used as an effective
tool for dimensionality reduction and hence
for boosting learning model performance.
Broadly, there are two distinct goals of
feature transformation:
– Achieving best reconstruction of the original
features in the data set
– Achieving highest efficiency in the learning task
Feature construction
• Feature construction involves transforming a
given set of input features to generate a new
set of more powerful features.
• There are certain situations where feature
construction is an essential activity before we
can start with the machine learning task.
These situations are
– when features have categorical value and
machine learning needs numeric value inputs
– when features having numeric (continuous) values
and need to be converted to ordinal values
– when text-specific feature construction needs to
be done
• Encoding categorical (nominal) variables
• Encoding categorical (ordinal) variables
• Transforming numeric (continuous) features to
categorical features
• Text-specific feature construction
• There are three major steps that are followed:
1. tokenize 2. count 3. normalize
• In order to tokenize a corpus, the blank spaces
and punctuations are used as delimiters to
separate out the words, or tokens. Then the
number of occurrences of each token is
counted, for each document. Lastly, tokens are
weighted with reducing importance when
they occur in the majority of the documents.
Feature extraction

• In feature extraction, new features are created from


a combination of original features. Some of the
commonly used operators for combining the original
features include
– For Boolean features: Conjunctions, Disjunctions,
Negation, etc.
– For nominal features: Cartesian product, M of N, etc.
– For numerical features: Min, Max, Addition, Subtraction,
Multiplication, Division, Average, Equivalence, Inequality,
etc.
• Principal Component Analysis
– key to the success of machine learning lies in the
fact that the features are less in number as well as
the similarity between each other is very less.
This is the main guiding philosophy of principal
component analysis (PCA)
– In PCA, a new set of features are extracted from
the original features which are quite dissimilar in
nature. So an n-dimensional feature space gets
transformed to an m-dimensional feature space,
where the dimensions are orthogonal to each
other, i.e. completely independent of each other.
• The objective of PCA is to make the transformation in such a way that
– The new features are distinct, i.e. the covariance between the new features,
i.e. the principal components is 0.
– The principal components are generated in order of the variability in the data
that it captures. Hence, the first principal component should capture the
maximum variability, the second principal component should capture the next
highest variability etc.
– The sum of variance of the new features or the principal components should
be equal to the sum of variance of the original features.
• PCA works based on a process called eigenvalue decomposition of a covariance
matrix of a data set. Below are the steps to be followed:
– First, calculate the covariance matrix of a data set.
– Then, calculate the eigenvalues of the covariance matrix.
– The eigenvector having highest eigenvalue represents the direction in which
there is the highest variance. So this will help in identifying the first principal
component.
– The eigenvector having the next highest eigenvalue represents the direction in
which data has the highest remaining variance and also orthogonal to the first
direction. So this helps in identifying the second principal component.
– 5. Like this, identify the top ‘k’ eigenvectors having top ‘k’ eigenvalues so as to
get the ‘k’ principal components.
Singular value decomposition
(https://www.youtube.com/watch?v=P5mlg91as1c)
Linear Discriminant Analysis
https://www.youtube.com/watch?v=mtTVXZq-9gE

.
FEATURE SUBSET SELECTION
• Feature selection is arguably the most critical
pre-processing activity in any machine
learning project.
• It intends to select a subset of system
attributes or features which makes a most
meaningful contribution in a machine learning
activity.
• Issues in high-dimensional data
– ‘High-dimensional’ refers to the high number of
variables or attributes or features present in certain
data sets.
– One is that of biomedical research, which includes
gene selection from microarray data. The other one is
text categorization which deals with huge volumes of
text data from social networking sites, emails,
• The objective of feature selection is three-fold:
– Having faster and more cost-effective (i.e. less need
for computational resources) learning model
– Improving the efficiency of the learning model
– Having a better understanding of the underlying
model that generated the data
• Key drivers of feature selection
• Feature relevance
– Feature relevance is indicated by the information gain
from a feature measured in terms of relative entropy.
• Feature redundancy
– Feature redundancy is based on similar information
contributed by multiple features measured by feature-
to-feature
– features measured by feature-to-feature:
• Correlation
• Distance (Minkowski distances, e.g. Manhattan, Euclidean,
etc. used as most popular measures)
• Other coefficient-based (Jaccard, SMC, Cosine similarity,
etc.)
• Distance-based similarity measure
– The most common distance measure is the Euclidean distance, which,
between two features F1 and F2 are calculated as: d(F1, F2)

• A more generalized form of the Euclidean distance is the Minkowski


distance, measured as
• Other similarity measures
• Jaccard index/coefficient is used as a measure of similarity between
two features.
• For two features having binary values, Jaccard index is measured as

where, n11 = number of cases where both the features have value 1 n01 =
number of cases where the feature 1 has value 0 and feature 2 has value 1 n10
= number of cases where the feature 1 has value 1 and feature 2 has value 0

• The Jaccard distance, a measure of dissimilarity between two


features, is complementary of Jaccard index.
Jaccard distance, dJ = 1 - J
• Simple matching coefficient (SMC) is almost same as Jaccard
coeficient except the fact that it includes a number of cases where
both the features have a value of 0.
Example:
• Cosine similarity
– Cosine similarity actually measures the angle
(refer to Fig. 4.11) between x and y vectors.
Hence, if cosine similarity has a value 1, the angle
between x and y is 0° which means x and y are
same except for the magnitude. If cosine similarity
is 0, the angle between x and y is 90°. Hence, they
do not share any similarity (in case of text data,
no term/word is common).
• Overall feature selection
– typical feature selection process consists of four
steps:
1. generation of possible subsets
2. subset evaluation
3. stop searching based on some stopping criterion
4. validation of the result
• Subset Generation
– first step of any feature selection algorithm, is a search
procedure which ideally should produce all possible
candidate subsets. However, for an n-dimensional data
set, 2n subsets can be generated.
– The search may start with an empty set and keep adding
features. This search strategy is termed as a sequential
forward selection.
– A search may start with a full set and successively remove
features. This strategy is termed as sequential backward
elimination.
– In certain cases, search start with both ends and add and
remove features simultaneously. This strategy is termed as
a bi-directional selection.
– Each candidate subset is then evaluated and compared
with the previous best performing subset based on certain
evaluation criterion.
• This cycle of subset generation and evaluation
continues till a pre-defined stopping criterion
is fulfilled. Some commonly used stopping
criteria are
1. the search completes
2. some given bound (e.g. a specified number of
iterations) is reached
3. subsequent addition (or deletion) of the feature is
not producing a better subset
4. a sufficiently good subset (e.g. a subset having
better classification accuracy than the existing
benchmark) is selected Then the selected best
subset is validated
• Feature selection approaches
– There are four types of approach for feature
selection:
1. Filter approach
2. Wrapper approach
3. Hybrid approach
4. Embedded approach In the filter approach

You might also like