0% found this document useful (0 votes)
3 views

Preprocessing

The document discusses data pre-processing and feature engineering, emphasizing the importance of transforming raw data into a usable format for machine learning. It outlines various steps in data pre-processing, including data quality assessment, cleaning, transformation, and scaling techniques such as StandardScaler, MinMaxScaler, and RobustScaler. Additionally, it highlights the significance of avoiding data leakage during the scaling process to ensure accurate model performance.

Uploaded by

Shreya Parekh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Preprocessing

The document discusses data pre-processing and feature engineering, emphasizing the importance of transforming raw data into a usable format for machine learning. It outlines various steps in data pre-processing, including data quality assessment, cleaning, transformation, and scaling techniques such as StandardScaler, MinMaxScaler, and RobustScaler. Additionally, it highlights the significance of avoiding data leakage during the scaling process to ensure accurate model performance.

Uploaded by

Shreya Parekh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

11-08-2024

Pre-processing
Feature Engineering

Data Pre-processing
• Data come from several sources
• Raw data can’t be fed into the model
• Raw, real-world data in the form of text, images, video, etc., is messy.
• Not only may it contain errors and inconsistencies, but it is often
incomplete, and doesn’t have a regular, uniform design.
• Data preprocessing is a step in the data mining and data analysis
process that takes raw data and transforms it into a format that can
be understood and analyzed by computers and machine learning.
• “Garbage-in, garbage-out”
• Types of data feature: Numerical Features, Categorical Features

1
11-08-2024

Data Pre-processing Steps


• Data Quality Assessment
▪ Mismatched data types
▪ Mixed data values
▪ Data outliers
▪ Missing data
• Data Cleaning: process of adding missing data and correcting, repairing, or
removing incorrect or irrelevant data from a data set.
▪ Correct missing data: ignore the tuples, fill the missing data
▪ Fixing noisy data: include unnecessary data points, irrelevant data, and data that are
more difficult to group together
• Data Transformation (Scaling):
▪ Aggregation: combine data together in a uniform format
▪ Normalisation or scaling: more on next slides
▪ Discretisation: pools data into smaller intervals; Ex: weekly, monthly, daily, etc
• Data Reduction: Larger amount of data makes harder to analyse
▪ Dimensionality reduction, feature elimination

Data Scaling
• Data scaling is an important step in pre-processing.
• Some machine learning algorithms and most of the deep learning
architectures are sensitive to input data.
• Before feeding data into the model, the data must be transformed to
make them unitless or must belong to the same unit.
• This is also called data normalization.
• Some important techniques are StandardScaler and MinMaxScaler

2
11-08-2024

Data Scaling: StandardScaler


• It uses standard normal distribution (Gaussian function) for
normalising data.
• Scaled data have zero mean and unit standard deviation
• The scaled features
x−μ
𝑥𝑠𝑐𝑎𝑙𝑒𝑑 =
σ
• where, σ is the standard deviation of the sample data and µ is the
mean of the sample data.
• It scales the data from -∞ to +∞.
• Location: sklearn.preprocessing.StandardScaler

Data Scaling: MinMaxScaler


• It transforms the data within two defined boundary values: lower
bound (LB) and upper bound (UB).
• LB and UB are chosen in such a way that model works optimally.
• The formula is:
𝑥𝑠𝑐𝑎𝑙𝑒𝑑 = ∆x UB − LB + LB
where ∆x is computed as
x − xmin
∆x =
xmax − xmin
For example, to scale data from 0 to 1, LB and UB should be 0 and 1,
respectively.
• Location: sklearn.preprocessing.MinMaxScaler

3
11-08-2024

Data Scaling: RobustScaler


• Outliers significantly impact our standardization by affecting the
feature’s mean and variance.
• Scale features using statistics that are robust to outliers.
• Scaled features:
𝑥𝑖 − 𝑄1 𝑥
𝑥𝑠𝑐𝑎𝑙𝑒𝑑 =
𝑄3 𝑥 − 𝑄1 𝑥
Where 𝑥 is the feature vector, 𝑥𝑖 is the individual element of feature 𝑥.
𝑄1 𝑥 and 𝑄3 𝑥 are the 1st and 3rd quantile.
The IQR (Interquartile Range) is the range between the 1st quartile
(25th quantile) and the 3rd quartile (75th quantile).

Data Scaling: Normaliser


• Normalizer rescale the feature values of observations to have unit
norm (a total length of one). This is different from min-max scaler
which is also commonly referred as normalization. The key difference
is normalizer works on the rows, not the columns. Meaning, we
rescale across individual observations.
• By default, L2 normalization is applied to each observation so that the
values in a row have a unit norm. Unit norm with L2 means that if
each element were squared and summed, the total would be equal to
one. Alternatively, L1 can be applied instead of L2 normalization.

4
11-08-2024

Data Scaling: Normaliser


• L1: Manhattan norm, also called taxicab norm
𝑥𝑖
𝑥𝑛𝑜𝑟𝑚𝑒𝑑 =
𝑥 1
Where: 𝑛

𝑥 1 = ෍ 𝑥𝑖
𝑖=1
L2: Euclidean norm, which is the default
𝑥𝑖
𝑥𝑛𝑜𝑟𝑚𝑒𝑑 =
𝑥 2
Where:
𝑥 2 = 𝑥12 + 𝑥22 + ⋯ + 𝑥𝑛2
• Complete List: https://scikit-learn.org/stable/modules/preprocessing.html

Data Scaling: How to apply?


Do we need to apply scaling on
1. The entire dataset or
scaled_dataset = (dataset - mean_dataset) / stdev_dataset
train, test = split (scaled_dataset)

2. Both the train and test dataset separately or


train, test = split (dataset)
scaled_train = (train - mean_train) / stdev_train
scaled_test = (test - mean_test) / stdev_test

3. Transform the training dataset and apply the same transformation on the test dataset?
scaled_train = (train - mean_train) / stdev_train
scaled_test = (test - mean_train) / stdev_train

• The idea is to keep train and test dataset separate to avoid data leakage and perform the 3rd
option. Chances of data leakage in 1st option and Two different transformations in 2nd option
would not be advisable because model learnt on a scale and predict on some other scale.
• Data Leakage: It occurs when information from outside the training dataset is used to create
the model. This additional information can allow the model to learn or know something that
it otherwise would not know and in turn invalidate the estimated performance of the mode
being constructed.
• Pipeline works only in ML. We should be careful in DL and DL is highly sensitive on scaling.

10

You might also like