0% found this document useful (0 votes)
6 views

Week 10

Uploaded by

Engineer JO
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Week 10

Uploaded by

Engineer JO
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

CSE315:Introduction to Data

Science
Feature Engineering
WEEK-??
Feature Engineering
Feature engineering is the process of extracting meaningful features from
raw data. We can experiment with different features based on our domain knowledge
or understanding of the data.
•There are mainly four different ways to do feature engineering:
1. Feature Transformation (FT)
2. Feature Construction
3. Feature Selection
4. Feature Extraction

1. Feature Transformation
Feature transformation is the process of modifying features to make them more suitable
for machine learning algorithms." It includes
5. Handling missing values,
6. Handling categorical values (converting categorical features to numerical values)
7. Detecting outliers, and
8. Calling features to a standard or common range.

Handling Missing Values

•Missing values can crash our data and ruin the model due to overlooking them.
•There are two main approaches to handling missing values:
 Imputation: This is like filling in the blanks with estimates. We can use the mean, median, or mode of the
nearest values, or we can use some logic to fill in the blanks.

•Python Code:
•#file: cse315_1.py
•#data file impu.csv
•import pandas as pd
•data1 = pd.read_csv("C:/Users/HP/Desktop/ddd/impu.csv")
•impudat=data1.fillna(0)
•print(" Before imputation",data1,sep='\n')
Deletion
 Deletion: We can remove the rows or columns with missing values.

User Device OS Transactions


A Mobile NA 5
B Mobile Android 3
C NA IOS 2
D Tablet Android 1
E Mobile IOS 4

•#file: cse315_del.py
•#data file deletion.csv
•import pandas as pd
•dele = pd.read_csv("C:/Users/HP/Desktop/ddd/deletion.csv")
•#deldat=dele.dropna(inplace=False)
•deldat=dele.dropna(inplace=False)
•print(" Before deletion",dele,sep='\n')
•print(" After deletion",deldat,sep='\n')
Encoding: Handling Categorical Variable
Data can be divided into numerical (quantitative) and categorical
(qualitative). Categorical data can be divided into nominal and
ordinal data. Depending on the data type, there are different ways
to convert categorical data to numerical data. This process is
called encoding.
Encoding refers to the process of converting categorical data
into a numerical format.

One-hot encoding
•By far the most common way to represent categorical variables is using the one-hot
encoding or one-out-of-N encoding, also known as dummy variables.
 One Hot Encoder is a popular encoder through which categorical variables can be converted into
separate columns and their presence can be expressed in each column by Boolean True/False or 0/1.
 This is encoded by the OneHotEncoder function, but we will see its application in the following
example by its dummy variable using the get_dummies function.
 For example, we created two separate columns for OWN_OCCUPIED columns 1 and 2 and
expressed the value of the columns by 0/1.
• #One Hot Encoding for nominal data
• Import pandas as pd
• df=……..
• df1 = pd.get_dummies(df, columns=['OWN_OCCUPIED'])
• df1
Example of one-hot coding
• Example 1

• Example 2: Consider the data where fruits, their corresponding categorical values, and prices are given.
Fruit Categorical value of fruit Price apple mango orange price
apple 1 5 1 0 0 5
• mango 2 10  0 1 0 10
apple 1 15 1 0 0 15
orange 3 20 0 0 1 20
Ordinal Encoding
Ordinal data can be converted to ordinal encoding.
Example:
Data Transformation
• Data transformation is an important issue for machine learning.
• Suppose you have a dataset that contains some people's age and
income data. The age range of human beings is usually from 0 to
100, in fact it is seen that those who earn may be between 25 and
60 years of age.
• On the other hand, the amount of income can be from a few
thousand to a few lakhs. So it is clear to us that there is a big
difference between the age range and the income range.
Sometimes such differences can cause the model to be biased.
• Also, scaling the data increases the performance of the model, in
many cases it takes less time to run the model.
• Many people also call data transformation as feature engineering.
Label Encoding
One-hot and Ordinal encoders can be used for explanatory/independent
variables (x). For prediction/target variables (y), we use label encoding,
specially designed for output or target variables. from sklearn.preprocessing

import LabelEncoder

Label encoding is a method of processing categorical data. In this method le = LabelEncoder()


we can express all the unique values ​of categorical variables with different
numbers. df['ST_NAME']=
• Suppose you have a dataset where male or female is written as human le.fit_transform(df['ST_NAME'])
gender. We can express the males by 1 and the females by 2 by df
encoding the label.
• It is also called dummy variable. The code below shows that the
ST_NAME variable is a categorical variable, with different road names.
• Through label encoding we have expressed the name of each unique
road by a different number. Then when the dataset is shown again, we
see that all the roads (unique) are expressed by different numbers.
Example of label encoding
Handling Outliers

• Outliers are data points significantly different from the rest of the data set. They can
affect the accuracy of our model.

There are two main ways to treat outliers:


 Trimming: We can remove the outliers from the data set.
 Capping: We can replace the outliers with values within the range of the rest of the data.
If the outliers are a small number, e can trim them. But if the outliers are significant, we
might cap them.
Several methods can be used to detect and remove outliers, including z-score, IQR,
percentile, and Winsorization.
Mapping Function
• We can also convert data from numerical to
categorical or categorical to numerical through the
mapping function. In the following example we have
mapped Y to 1 and N to 2 of the OWN_OCCUPIED
column.
#use of mapping function
mapping = {'Y' :1 , 'N' : 2 }
df['OWN_OCCUPIED'] =
df['OWN_OCCUPIED'].map(mapping)
df
Feature scaling
Feature scaling is a process of transforming the features in a dataset
to have a common scale. This helps to prevent certain features from
dominating the model.
•There are two main types of feature scaling:
i.Standardization
ii.Normalization
FeatureScaling

Feature scaling is an essential step in the machine-learning


process. By scaling the features, we can help improve your
model's performance and ensure that all features are given a fair
chance.
Standardization

Standardization: Standardization subtracts each feature's mean and divides it by


the standard deviation. This ensures that each feature has a mean of 0 and a standard
deviation 1. Standardization is often used for data that follows Gaussian
distribution, such as linear regression.

n
A value is standardized as follows:
 xi – mean   xi – x
x
i 1
i

n
i 1
 xi  x 
2

yi   , where x  and s 
standard deviation s n n 1
We can guesstimate a mean of 10.0 and a standard deviation of about 5.0. Using
these values, we can standardize the first value of 20.7 as follows:
y = (x – mean) / standard_deviation
y = (20.7 – 10) / 5= (10.7) / 5= 2.14
The mean and standard deviation estimates of a dataset can be more robust to
new data than the minimum and maximum.
Normalization

• The data normalization (also referred to as data pre-processing) is a


basic element of data mining. It means transforming the data, namely
converting the source data in to another format that allows
processing data effectively. The main purpose of data
normalization is to minimize or even exclude duplicated data.
Normalization Technique
• Min-Max Normalization
• Robust Scaling Normalization
• Z-score Normalization
• Decimal Scale Normalization
• Log Scale Normalization
Min-Max Normalization
Salary
Min-max scaling is very often simply called ‘normalization.’ It
transforms features to a specified range, typically between 0 and 1. 64000
The formula for min-max scaling is: 55000
Xnormalized =(X – Xmin ) / (Xmax – Xmin ), 19000
100000
where X is a random feature value that is to be normalized. X min is
the minimum feature value in the dataset, and Xmax is the maximum 75000
feature value.

• Apply min max normalization for the salary value of 64000 and 55000
Max distance for salary = 100000 Min distance for salary = 19000
Min-Max Normalization
Applying the min-Max normalization formula,

For 64000,

For 55000, = 0.44

Min-max scaling is a good choice when:


 The approximate upper and lower bounds of the dataset are known, and the
dataset has few or no outliers
 When the data distribution is unknown or non-Gaussian, and the data is
approximately uniformly distributed across the range
 When maintaining the distribution’s original shape is essential
Robust Scaling Normalization

Both standard and robust scalers transform inputs to comparable scales. The
difference lies in how they scale raw input values. Robust scaling answers a simple
question. How far is each data point from the input’s median? More precisely, it
measures this distance in terms of the IQR using the below formula:

Original value  Input ' s Median


Scaled value 
Input ' s IQR
The scaled values will have their median and IQR set to 0 and 1, respectively. The
fact that robust scaling uses median and IQR makes it resistant to outliers. Since
robust scaling is resilient to the influence of outliers, this makes it suitable for
datasets with skewed or anomalous values or with outliers.
Example: Original value =30, Input’s median= 25, IQR = 20
Robust scaled value = (30-28)/20=0.10
Z-Score Normalization
• Apply z-score normalization for the following data : 71, 67, 87

 x  x 
2

z
 x  mean  
 x x
, where s  , and x  mean
standard deviation s n 1
x = particular value, and n = number of values

Mean = = 75
⸫sd = = 10.58
Z-Score Normalization
After applying the formula we get,
For 71,
= -0.37780
For 67,
z = -0.7559
For 67,
z = 1.1339
Decimal Scale Normalization
 Decimal scaling normalization aims to scale the feature values by a power of 10,
ensuring that the largest absolute value in each feature becomes less than 1. It is
useful when the range of values in a dataset is known, but the range varies
across features. The formula for decimal scaling normalization is:
Xdecimal = X / 10d
 X is the original feature value, and d is the smallest integer such that the largest
absolute value in the feature becomes less than 1.
 For example, if the largest absolute value in a feature is 350, then d would be 3,
and the feature would be scaled by 103.
 Decimal scaling normalization is advantageous when dealing with datasets
where the absolute magnitude of values matters more than their specific scale.
Decimal Scale Normalization
CGPA Formula After Decimal Money Formula After Decimal
Normalization Normalization
2 2/10 0.2 500 500/1000 0.5
3 3/10 0.3 320 320/1000 0.32

Salary Formula After Decimal


Normalization
32000 32000/100000 0.32
28000 28000/100000 0.28
Log scaling normalization
•Log scaling normalization converts data into a logarithmic scale by taking the log of each
data point.
•It is particularly useful when dealing with several orders of magnitude data. The formula
for log scaling normalization is:
• Xlog = log(X)
•This normalization comes in handy with data that follows an exponential growth or decay
pattern. It compresses the scale of the dataset, making it easier for models to capture
patterns and relationships in the data.
•Population size over the years is a good example of a dataset where some features exhibit
exponential growth. Log scaling normalization can make these features more amenable to
modelling.
•Example: X =200, Xlog =2.30103
Comparison of normalization (min-max scaling) and standardization
.

Normalization Standardization

The objective is to bring the values of a feature within a The objective is to transform the values of a feature to
specific range, often between 0 and 1 have a mean of 0 and a standard deviation of 1

Less sensitive to outliers due to the use of the mean and


Sensitive to outliers and the range of the data
standard deviation
Effective when algorithms assume a standard normal
Useful when maintaining the original range is essential
distribution
No assumption about the distribution of data is made Assumes a normal distribution or close approximation
Suitable for algorithms where the absolute values and It is particularly useful for algorithms that assume normally
their relations are important (e.g., k-nearest neighbors, distributed data, such as linear regression and support
neural networks) vector machines.

Maintains the interpretability of the original values Alters the original values, making interpretation more
within the specified range challenging due to the shift in scale and units

This can lead to faster convergence, especially in It also contributes to faster convergence, particularly in
algorithms that rely on gradient descent algorithms sensitive to the scale of input features.

Use cases: Image processing, neural networks, algorithms Use cases: Linear regression, support vector machines,
sensitive to feature scales algorithms assuming a normal distribution
Which procedure is appropriate for when?
• It is difficult to say when to use any kind of transformation. Depending on
the type of problem.
• Data scaling is very important in case of distance based algorithms like
SVM, KNN, clustering etc.
• On the other hand, non-distance based algorithm is not very important in
case of naive-bayes, various tree based algorithms.
• The normalization brings the data on a scale from 0 to 1 Standardization,
• on the other hand, brings data within Min 0 and Standard Deviation 1.
• Normalization can be used if there is a large difference in the range of
values ​of the dataset feature.
• Standardization works well if there is an outlier in the data.
• However, in most cases, standardization works well overall.
Feature Construction

The process of developing new features from existing features or upon our domain knowledge is known as
feature construction.
•Making the features more informative and relevant to the task helps machine learning models perform better.
Feature Construction
•There are numerous ways to build features, but some typical techniques include:
 Repurposing existing features: This is like remixing old songs. We can combine existing features in new
ways to create something new and exciting. Combine, alter, or create new features from existing ones. For
example, you could combine the features "sibsp" and "parch" in the Titanic dataset to create a new feature
called "family."
 Using domain expertise: This is like consulting a chef. You can use your domain understanding to create
new features essential to the task. Create new features that are important to the task based on our domain
knowledge. For example, if you are developing a model to predict customer churn, you could add a new
feature called "number of months since last purchase" if you know that customers who haven't purchased in
a while are more likely to churn.
 Using feature selection algorithms: This is like hiring a personal shopper. You can use these algorithms to
determine the most important features from a data set and then build new features based on those
features. It's like curating the best attributes to create powerful new ones!
Feature Extraction

Feature extraction is a process of reducing the dimensionality of


data by identifying the most essential features.
•The more features we have, the harder it is to find the important ones. "The
curse of dimensionality occurs when the dataset contains an excessive number of
features, making it difficult for Machine Learning algorithms to identify essential
features." Dimensionality Reduction can be helpful in such a situation.
•"Dimensionality reduction is a process of simplifying a dataset by reducing the
number of features or dimensions. This can be done to improve the efficiency
and accuracy of machine learning models."
•Dimensionality reduction can be done in two ways:
 Feature Extraction
 Feature Selection
Feature Extraction
The most frequently used approaches for dimensionality reduction are:
 Principal Component Analysis(PCA)
 Linear Discriminant Analysis(LDA)
 T-distributed Stochastic Neighbor Embedding (T-SNE)
In feature selection, we choose a subset of features from the dataset to train
machine learning algorithms while maintaining the original feature distribution.
However, when using dimensionality reduction techniques such as PCA, the
original representation of the variables is altered.
Feature Extraction: Principal Component Analysis

Principal Component Analysis(PCA)


PCA is an unsupervised technique that reduces the dimensionality of data by finding the directions (called
Principal Components) that capture the most variance in the data.
Feature Extraction: Linear Discriminant Analysis(LDA)

LDA is a supervised machine learning algorithm that seeks to find the directions in
the data that best separate the different known categories.
Feature Extraction: Linear Discriminant Analysis(LDA)

LDA is a supervised machine learning algorithm that seeks to find the directions in
the data that best separate the different known categories.
Feaure Extraction: T-distributed Stochastic Neighbor Embedding(T-SNE)

T-SNE is a non-linear dimensionality reduction algorithm that can be used to


separate data that a line cannot separate.
• It does this by projecting the data into a lower dimension while preserving the
clustering in the high-dimensional space
Feature Selection
Feature selection is the process of identifying and selecting the most important features in a dataset relevant to
predicting the target variable.
There are mainly three techniques used for feature selection:
 Filter methods
 Wrapper methods
 Embedded methods.
Feature Selection: Filter methods
• Filter methods select features based on a statistical measure, and they do this by focusing on a single feature
at a time and comparing it to the other features. The selection of features is not based on a learning
algorithm.
Feature Selection: Filter Method
Filter methods can be done in the following ways:
Correlation: This method is like a couple on a blind date. They're a match made in heaven
if they have a high correlation. But if they have a low correlation, they're better off
going their separate ways.
This method calculates the correlation(using corr() ) between each feature and the target
variable. If the correlation is below a certain value, we remove that feature from
consideration, as it doesn't seem to impact the target variable significantly.
Variance threshold: If a feature has low variance, it's not very interesting, so it gets kicked
out.
Chi-squared test: This method is like a detective. It looks for associations between two
categorical variables and determines if they're guilty of being important features.

ANOVA: This method is like a judge in a courtroom. It looks at the means of multiple
groups and decides if there's enough evidence to convict them of being essential
features.
Feature Selection: Wrapper methods

•Filter methods are a simple way to select features, but they do not consider the relationship
between features. This can lead to the selection of features that are not relevant to the target
variable.
•The wrapper method is like a dating app for features. It takes many features out on dates
with a machine learning algorithm and then sees which ones the algorithm likes the best.
The feature that gets the most dates is the one that gets selected.
Wrapper method considers the relationship between features by training a machine learning
algorithm on a subset of features and then evaluating the algorithm's performance."
•This process is repeated for different subs
•This process is repeated for different subsets of features, and the subset that results in the
best performance is selected.
Feature Selection: Wrapper Method
Feature Selection: Wrapper methods cont.

There are several common techniques of wrapper methods, including:


 Exhaustive Feature Selection/ Best Feature Selection
 Sequential Forward Feature Selection
 Sequential Backward Feature Selection
Exhaustive Feature Selection/ Best Feature Selection
•"Exhaustive feature selection evaluates the performance of a machine learning algorithm on every
possible subset of features, and then selects the subset that results in the best performance."
•When we have n features, there are 2^n possible subsets of features.
•As a result of this computational expense, other feature selection methods have been developed .
Wrapper methods

Forward Feature Selection


Forward Feature Selection starts with an empty set of features. It then adds the feature that most improves
the accuracy of a machine learning algorithm."
•The algorithm continues adding features until the accuracy of the algorithm no longer improves or until a
maximum number of features is reached.
Backward Feature Selection
•Backward feature selection is like the same chef who starts with a full pantry and then throws out the
ingredients that make the dish taste the worst.
•"Backward Feature Selection starts with all features included in the model. It then removes the feature that
most decreases the accuracy of a machine learning algorithm."
•The algorithm continues removing features until the accuracy of the algorithm no longer decreases or until a
minimum number of features is reached.
Wrapper methods: Embedded methods

•"Embedded methods integrate the feature selection process into the machine learning algorithm itself.“

The most common types of embedded methods are,


 Regularization
 Tree Based algorithms
Regularization
"Regularization in embedded methods can be done by adding a regularization term to the loss function of the machine learning algorithm. "
The features with the largest coefficients will be penalized the most and may be excluded from the model.
Following regularization techniques are used
 Lasso Regularization(L1):
 Ridge Regularization(L2):
 Elastic-Net Regularization(L1+L2):
Tree Based algorithms
Tree-based algorithms build a tree-like structure of decisions, where each decision is based on the
importance of a feature."
The model will likely include the most essential features for splitting the data. Tree-based algorithms that can be
used in embedded methods are,
 Decision trees:

• Random forests
Test for feature selection
• In the case of filter-based methods, statistical tests are used to determine the strength of
correlation of the feature with the target variable. The choice of the test depends on the
data type of both input and output variable (i.e. whether they are categorical or
numerical.). You can see the most popular tests in the table below.
Test (cont)
Input Output Feature Selection Model

 Pearson’s correlation coefficient


 Spearman’s rank coefficient
Numerical Numerical
 Trace ratio criterion
 Mutual information
 ANOVA correlation coefficient
Numerical Categorical
 Kendall’s rank coefficient
 Kendall’s rank coefficient
Categorical Numerical
 ANOVA correlation coefficient
 Chi-Squared test (contingency tables).
Categorical Categorical
 Mutual Information.

You might also like