Session-2-CO3-Introduction to Data Preprocessing (1)
Session-2-CO3-Introduction to Data Preprocessing (1)
CO - 3 Session - 15
AIM
To familiarize the students with the concept of data preprocessing and various data preprocessing
techniques.
INSTRUCTIONAL OBJECTIVES
LEARNING OUTCOMES
3
WHY PREPROCESS DATA?
4
WHY DATA ARE DIRTY?
5
WHY DATA PREPROCESSING IS IMPORTANT IN
MACHINE LEARNING CONTEXT?
No quality data, no quality results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
In machine learning, preprocessing is a crucial step to
ensure that the data is in a format that the algorithm can
understand and that it is free of errors or outliers that can
negatively impact the model's performance.
6
MAJOR TASKS IN DATA PREPROCESSING
7
MAJOR TASKS IN DATA PREPROCESSING
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same
or similar analytical results
8
ILLUSTRATION OF MAJOR TASKS IN DATA
PREPROCESSING
9
DATA CLEANING
10
MISSING DATA
11
HOW TO HANDLE MISSING DATA?
Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per attribute
varies considerably.
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., “unknown”, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class: smarter
the most probable value
12
HOW TO HANDLE NOISY DATA?
Binning: This method is to smooth or handle noisy data. First, the data is sorted, and then
the sorted values are separated and stored in the form of bins. There are three methods for
smoothing data in the bin.
Smoothing by bin mean method: In this method, the values in the bin are replaced
by the mean value of the bin;
Smoothing by bin median: In this method, the values in the bin are replaced by the
median value;
Smoothing by bin boundary: In this method, the using minimum and maximum
values of the bin values are taken, and the closest boundary value replaces the values.
Regression: This is used to smooth the data and will help to handle data when unnecessary
data is present. For the analysis, purpose regression helps to decide the variable that is
suitable for our analysis.
Clustering: This is used for finding the13
outliers and also in grouping the data. Clustering is
DATA TRANSFORMATION: BINNING
14
DATA INTEGRATION
Data integration:
Combines data from multiple sources into a coherent store
Clinton
Possible reasons: different representations, different scales, e.g., metric vs. British
15
DATA TRANSFORMATION
16
DATA TRANSFORMATION
Smoothing:
With the help of algorithms, we can remove noise from the dataset, which helps
in knowing the important features of the dataset. By smoothing, we can find
even a simple change that helps in prediction. Aggregated
Data
Aggregation:
In this method, the data is stored and presented in the form of a summary.
Discretization:
The continuous data here is split into intervals.
Discretization reduces the data size. For example, rather than specifying the class time,
we can set an interval like (3 pm-5 pm or 6 pm-8 pm).
Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in the hierarchy . For
Example-The attribute “street” can be converted to “country”.
Normalization:
It is the method of scaling the data so that it can be represented in a smaller range.
Example ranging from -1.0 to 1.0.
17
DATA TRANSFORMATION: NORMALIZATION
v A
Z-score normalization (μ: mean, σ: standard deviation):
v'
A
73,600 54,000
1.225
16,000
Ex. Let μ = 54,000, σ = 16,000. Then
18
DATA TRANSFORMATION: NORMALIZATION
Dimensionality reduction:
In this method, irrelevant features are removed
Avoid the curse of dimensionality
Reduce time and space required during the learning process
Allow easier visualization
Techniques for dimensionality reduction are principal component analysis, feature selection, and wavelet
transforms
Numerosity reduction:
In this method, the data representation is made smaller by reducing the volume. There will not be any
loss of data in this reduction
Techniques for numerosity reduction are parametric (regression, log-linear models) and non-parametric
methods (histograms, clustering, sampling)
Data compression:
The compressed form of data is called data compression
This compression can be lossless or lossy. When there is no loss of information during compression, it is
called lossless compression. Whereas lossy compression reduces information, but it removes only the
unnecessary information
20
DATA REDUCTION
21
DATA REDUCTION: FEATURE SELECTION
23
PCA – SCREE PLOT
24
PCA – APPLICATIONS, CONSIDERATIONS, AND
LIMITATIONS
Applications of PCA
Dimensionality reduction and feature extraction: PCA can simplify complex datasets by
reducing the number of variables while retaining meaningful information.
Data preprocessing: PCA can be used to preprocess data before applying other machine
learning algorithms to improve their performance.
Noise reduction: PCA can remove noise and extract signals from data.
Visualization: PCA can help visualize high-dimensional data in a lower-dimensional space.
Considerations and Limitations
PCA assumes linearity and normality in the data.
Outliers can affect the results of PCA.
Scaling and standardization of variables are important to avoid the dominance of certain
features.
PCA is a linear technique and may not capture nonlinear relationships in the data.
25
DATA SPLITTING
26
DATA SPLITTING
Cross-Validation
Purpose: To ensure the model performs well on different subsets of the data.
Method:
The dataset is split into k equally sized folds.
The model is trained on k-1 folds and tested on the remaining fold.
This process is repeated k times with each fold used exactly once as the test data.
Common values for k are 5 or 10.
Stratified Splitting
Purpose: To maintain the same proportion of classes in each subset as in the
original dataset, particularly useful for imbalanced datasets.
Method:
Similar to train-test split but ensures the distribution of classes is consistent across
training and testing sets.
27
DATA BATCHING
28
DATA SHUFFLING
29
OVERFITTING AND UNDERFITTING
Overfitting occurs when a model learns the training data too well, capturing
noise and details that are not relevant to the general problem. This leads to a
model that performs exceptionally well on training data but poorly on new,
unseen data.
Causes of Overfitting:
Complex Models: Models with too many parameters relative to the number of
training samples.
Insufficient Data: Too little training data can cause the model to learn the noise and
random fluctuations.
Too Many Features: Using too many irrelevant features can lead to overfitting.
Indicators of Overfitting:
High accuracy on training data.
Low accuracy on validation/testing data.
Large gap between training and validation/testing performance.
30
OVERFITTING AND UNDERFITTING
Solutions to Overfitting:
Simplify the Model: Use fewer parameters or features.
Regularization: Techniques like L1 (Lasso) or L2 (Ridge) regularization add a
penalty for large coefficients.
Cross-Validation: Use techniques like k-fold cross-validation to ensure the model
generalizes well.
Pruning (for Decision Trees): Reduce the size of the tree by removing sections
that provide little power.
Early Stopping (for Neural Networks): Stop training when the performance on a
validation set starts to degrade.
31
OVERFITTING AND UNDERFITTING
Causes of Underfitting:
Too Simple Models: Models that are not complex enough to capture the data's
underlying structure.
Insufficient Training: Not training the model for enough epochs or iterations.
Poor Feature Selection: Using features that do not capture the underlying
patterns.
Indicators of Underfitting:
Low accuracy on both training and validation/testing data.
Minimal gap between training and validation/testing performance.
32
OVERFITTING AND UNDERFITTING
Solutions to Underfitting:
Increase Model Complexity: Use a more complex model with more parameters.
Train Longer: Train the model for more epochs or iterations.
Feature Engineering: Create better features or use more relevant data.
Reduce Regularization: If using regularization techniques, reduce the penalty to
allow the model to better fit the data.
33
SUMMARY
Data preprocessing techniques are responsible for converting raw data into
a format understandable by the machine learning algorithms.
34
BOOKS
Textbooks:
• Best Practices in Data Cleaning:A Complete Guide to Everything You Need to Do Before and
After Collecting Your Data, 1st Edition, Jason W. Osborne, 2012
• Data Cleaning, 1st Edition, Ihab F Ilyas and Xu Chu, 2019
• Data mining concepts and techniques, 3rd Edition, Jiawei Han, Micheline Kamber, and Jian Pei
Reference Books :
• Bad Data Handbook, 1st Edition, Q. E McCallum, 2012
• Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists, 1 st
Edition, Alice Zheng and Amanda Casari, 2018
• Introduction to data mining, Pang-Ning Tan and Michael Steinbach
35
WEBLINKS
1. https://www.scaler.com/topics/data-science/data-preprocessing/
2. https://neptune.ai/blog/data-preprocessing-guide
3. https://www.analyticsvidhya.com/blog/2021/08/data-preprocessing-in-da
ta-mining-a-hands-on-guide/#:~:text=Data%20preprocessing%20is%20
the%20process,learning%20or%20data%20mining%20algorithms
.
4. https://wikidocs.net/185539
5. https://www.turing.com/kb/guide-to-principal-component-analysis
36
Self-Assessment Questions
37
TERMINAL QUESTIONS
1. You are given a dataset with several missing values scattered across different features.
How would you decide on the best strategy to handle these missing values? Discuss the
considerations and techniques you would use.
2. What techniques would you use to reduce noise in your data? Explain how these
techniques can improve model performance with an example.
3. How would you preprocess data that needs to be aggregated over certain periods or
categories? Provide an example where data aggregation is necessary before modeling.
4. Why is it important to normalize your data before applying machine learning algorithms?
Describe a scenario where normalization significantly improves the model performance.
5. Explain the difference between standard scaling and min-max scaling. Provide an example
where one might be preferred over the other.
6. Discuss the importance of dimensionality reduction in data preprocessing. How would you
apply PCA (Principal Component Analysis) to a high-dimensional dataset? What are the
potential pitfalls of using PCA?
7. What considerations would you consider 38when splitting your data into training and testing
sets?
THANK YOU
39