Preprocessing
Preprocessing
Pre-processing
Feature Engineering
Data Pre-processing
• Data come from several sources
• Raw data can’t be fed into the model
• Raw, real-world data in the form of text, images, video, etc., is messy.
• Not only may it contain errors and inconsistencies, but it is often
incomplete, and doesn’t have a regular, uniform design.
• Data preprocessing is a step in the data mining and data analysis
process that takes raw data and transforms it into a format that can
be understood and analyzed by computers and machine learning.
• “Garbage-in, garbage-out”
• Types of data feature: Numerical Features, Categorical Features
1
11-08-2024
Data Scaling
• Data scaling is an important step in pre-processing.
• Some machine learning algorithms and most of the deep learning
architectures are sensitive to input data.
• Before feeding data into the model, the data must be transformed to
make them unitless or must belong to the same unit.
• This is also called data normalization.
• Some important techniques are StandardScaler and MinMaxScaler
2
11-08-2024
3
11-08-2024
4
11-08-2024
𝑥 1 = 𝑥𝑖
𝑖=1
L2: Euclidean norm, which is the default
𝑥𝑖
𝑥𝑛𝑜𝑟𝑚𝑒𝑑 =
𝑥 2
Where:
𝑥 2 = 𝑥12 + 𝑥22 + ⋯ + 𝑥𝑛2
• Complete List: https://scikit-learn.org/stable/modules/preprocessing.html
3. Transform the training dataset and apply the same transformation on the test dataset?
scaled_train = (train - mean_train) / stdev_train
scaled_test = (test - mean_train) / stdev_train
• The idea is to keep train and test dataset separate to avoid data leakage and perform the 3rd
option. Chances of data leakage in 1st option and Two different transformations in 2nd option
would not be advisable because model learnt on a scale and predict on some other scale.
• Data Leakage: It occurs when information from outside the training dataset is used to create
the model. This additional information can allow the model to learn or know something that
it otherwise would not know and in turn invalidate the estimated performance of the mode
being constructed.
• Pipeline works only in ML. We should be careful in DL and DL is highly sensitive on scaling.
10