0% found this document useful (0 votes)

71 views

Data Prep and Cleaning For Machine Learning

This document provides an overview of data preparation and cleaning for machine learning. It explains that data preparation is an important part of the machine learning process that must be done before building or training a model. Poorly prepared data can negatively impact even sophisticated algorithms. The document then presents an 8-step checklist for data preparation, covering issues like missing values, duplicate data, incorrect/irrelevant data, outliers, feature scaling, feature engineering, and validation splitting. Each step is then explained in more detail with examples and best practices.

Uploaded by

Shubham J

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views

Data Prep and Cleaning For Machine Learning

Uploaded by

Shubham J

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

DATA PREP &

CLEANING FOR
MACHINE
LEARNING

AN OVERVIEW

Why?
Data Preparation & Cleaning is an extremely important
part of the overall Machine Learning process, one that must
be considered before ever looking to build or train a model.

A common phrase in Machine Learning is...

"Garbage in...Garbage Out!"

If the data isn't clean or isn’t prepared in an appropriate way,

even the fanciest of algorithms or models will struggle to
learn.

On top of this, ensuring the data is clean can actually be one

of the biggest boosters of model performance and accuracy!
8-Step Checklist
This 8-Step Data Preparation & Cleaning checklist will ensure
you’re always giving your ML model the best chance to learn &
perform!

Missing Values 1

2 Duplicate & Low Variation Data

Incorrect & Irrelevant Data 3

4 Categorical Data

Outliers 5

6 Feature Scaling

Feature Engineering/Selection 7

8 Validation Split

Note: Which steps are applicable can depend on the data you’re
using, the problem you’re solving, and on the type of model you’re
applying! However, in the vast majority of cases you can’t go
wrong at least considering these 8 steps!
Missing Values 1

In many cases, your chosen model or algorithm simply

won't know how to process missing values - and you will
be returned an error.

Even if the missing values don't lead to an error you do

always want to ensure that you are passing the model
the most useful information to learn from - and you
should consider whether missing values meet that criteria
or not.

The two most common approaches for dealing with

missing values:

Removal: Often you will simply remove any observations

(rows) where one or missing values are present. You can
also remove entire columns if no information is present.

Imputation: This is where you input or "impute"

replacement values where they were originally missing.
This can be based upon the column mean, median, or
most common value or more advanced approaches that
take into account other present data points to give an
estimation of what the missing value might be!
Duplicate & Low
Variation Data
2
When looking through your data, don't just look for
missing values - keep an eye out for duplicate data, or
data that has low variation.

Duplicate data is most commonly rows of data that are

exactly the same across all columns. This duplicate rows
do not add anything to the learning process of the model
or algorithm, but do add storage & processing overhead.

In the vast majority of cases you can remove duplicate

rows prior to training your model.

Low variation data is where a column in your dataset

contain only one (or few) unique value(s).

An example: There is a column in a house price dataset

called "property_type". Every row in this column has the
value "house". This column won’t add any value to the
learning process, so can be removed.
Incorrect &
Irrelevant Data
Irrelevant data is anything that isn’t related specifically
to the problem you’re looking to solve. For example, if
you're predicting house prices, but your dataset contains
commercial properties as well - these would need to be 3
removed.

Incorrect data can be hard to spot! An example could

be looking for values that shouldn’t be possible such a
negative house price values.

For categorical or text variables you should spend time

analysing the unique values within a column.

For example, you were predicting car prices and had a

column in the dataset called "car_colour". Upon
inspection you find values of "Orange", "orange", "orang"
are all present. These need to be rectifed prior to training
otherwise you will be limited the potential learning that an
take place!

Always explore your data thoroughly!

Categorical Data
Generally speaking, ML models like being fed numerical
data. They are less fond of categorical data!

Categorical data is anything that is listed as groups,

classes, or text. A simple example would be a column for
gender which contains values of either "Male" or "Female".

Your model or algorithm won’t know how to assign some

numerical importance to these values, so you often want
to turn these groups or classes into numerical values. 4

A common approach is called One Hot Encoding where

you create new columns, one for each unique class in your
categorical column. You fill these new columns with
values of 1 or 0 depending on which is true for each
observation.

Other encoding techniques you can consider are; Label

Encoding, Binary Encoding, Target Encoding, Ordinal
Encoding, & Feature Hashing.
Outliers
There is no formal definition for an outlier. You can think
of them as any data point that is very different to the
majority.

How you deal with outliers is dependent on the problem

you are solving, and the model you are applying. For
example, if your data contained one value that was 1000x
any other, this could badly affect a Linear Regression
model which tries to generalise a rule across all
observations. A Decision Tree would be unaffected
however, as it deals with each observation independently.

In practice, outliers are commonly isolated using the 5

number of standard deviations from the mean, or a rule
based upon the interquartile range.

In cases where you want to mitigate the effects of outliers,

you may look to simply remove any observations (rows)
that contain outlier values in one or more of the columns
or you may look to replace their values to reduce their
effect.

Always remember - just because a value is very high, or

very low, that does not mean it is wrong to be included.
Feature Scaling
Feature Scaling is where you force all the values from a
column in your data to exist on the same scale. In certain
scenarios it will help the model assess the relationships
between variables more fairly, and more accurately.

The two most common scaling techniques are:

Standardisation: rescales all values to have a mean of 0

and standard deviation of 1. In other words, the majority
of your values end up between -4 and +4

Normalisation: rescales data so that it exists in a range

between 0 and 1

Feature Scaling is essential for distance-based models

such as k-means or k-nearest-neighbours.
6
Feature Scaling is recommended for any algorithms that
utilise Gradient Descent such as Linear Regression,
Logistic Regression, and Neural Networks.

Feature Scaling is not necessary for tree-based

algorithms such as Decision Trees & Random Forests.
Feature Engineering
& Selection
Feature Engineering is the process of using further
knowledge to supplement or transform the original
feature set.

The key to good Feature Engineering is to create or refine

features that the algorithm or model can understand
better or that it will find more useful than the raw features
for solving the particular problem at hand.

Feature Selection is where you only keep a subset of the

most informative variables. This can be done using
human intuition, or dynamically based upon statistical
analysis.

A smaller feature set can lead to improved model

accuracy through reduced noise. It can mean a lower
computational cost, and improved processing speed. It
can also make your models easier to understand & 7
explain to stakeholders & customers!
Validation Split
The Validation Split is where you partition your data into a
training set, and a validation set (and sometimes a
test set as well).

You train the model with the training set only. The
validation and/or test sets, are held-back from training and
are used to assess model performance. They provide a
true understanding of how accurate predictions are on
new or unseen data.

An approach called k-fold cross validation can provide

you with an even more robust understanding of model
performance.

Here, the entire dataset is again partitioned into training &

validation sets, and the model is trained and assessed like
before. However, this process is done multiple times with
the training and test sets being rotated to encompass
different sets of observations within the data. You do this
k times - with your final predictive accuracy assessment
being based upon the average of each of the iterations.

8
Want to land an incredible
role in the exciting, future-
proof, and lucrative field of
Data Science?
LEARN THE
RIGHT SKILLS

A curriculum based on
input from hundreds of
leaders, hiring managers,
and recruiters

https://data-science-infinity.teachable.com
BUILD YOUR
PORTFOLIO

Create your professionally

made portfolio site that
includes 10 pre-built
projects

https://data-science-infinity.teachable.com
EARN THE
CERTIFICATION

Prove your skills with the

DSI Data Science
Professional Certification

https://data-science-infinity.teachable.com
LAND AN
AMAZING ROLE

Get guidance & support

based upon hundreds of
interviews at top tech
companies

https://data-science-infinity.teachable.com
Taught by former Amazon
& Sony PlayStation Data
Scientist Andrew Jones

What do DSI
students say?
"I had over 40 interviews without an offer.
After DSI I quickly got 7 offers including
one at KPMG and my amazing new role
at Deloitte!"
- Ritesh

"The best program I've been a part of,

hands down"
- Christian

"DSI is incredible - everything is taught in

such a clear and simple way, even the
more complex concepts!"
- Arianna

"I got it! Thank you so much for all your

advice & help with preparation - it truly
gave me the confidence to go in and
land the job!"
- Marta
"I've taken a number of Data Science
courses, and without doubt, DSI is the
best"
- William

"One of the best purchases towards

learning I have ever made"
- Scott

"I learned more than on any other

course, or reading entire books!"
- Erick

"I started a bootcamp last summer

through a well respected University, but I
didn't learn half as much from them!"
- GA
"100% worth it, it is amazing. I have
never seen such a good course and I
have done plenty of them!"
- Khatuna

"This is a world-class Data Science

experience. I would recommend this
course to every aspiring or professional
Data Scientist"
- David

"Andrew's guidance with my Resume &

throughout the interview process helped
me land my amazing new role (and at a
much higher salary than I expected!)"
- Barun

"DSI is a fantastic community & Andrew

is one of the best instructors!"

- Keith
"I'm now at University, and my Data
Science related subjects are a piece of
cake after completing this course!

I'm so glad I enrolled!" - Jose

"In addition to the great content,

Andrew's dedication to the growing DSI
community is amazing"
- Sophie

"The course has such high quality

content - you get your ROI even from the
first module"
- Donabel

"The Statistics 101 section was awesome!

I have now started to get confidence in
Statistics!"
- Shrikant
"I can't emphasise how good this
programme is...well worth the
investment!"
- Dejan

Come and join the

hundreds & hundreds of
other students getting the
results they want!

https://data-science-infinity.teachable.com

Human Design Book
99% (71)
Human Design Book
47 pages
How to Build a Machine Learning Model | by Chanin Nantasenamat | Towards Data Science
No ratings yet
How to Build a Machine Learning Model | by Chanin Nantasenamat | Towards Data Science
37 pages
VB6 Microsoft Chart Handle With Care PDF
No ratings yet
VB6 Microsoft Chart Handle With Care PDF
11 pages
Tech Stage Makeup Lesson Plan
100% (1)
Tech Stage Makeup Lesson Plan
5 pages
Assessment 2.0
87% (23)
Assessment 2.0
10 pages
Presentation 2
No ratings yet
Presentation 2
19 pages
Matlab
100% (1)
Matlab
83 pages
Risk-Constrained FTR Bidding Strategy in Transmission Markets
No ratings yet
Risk-Constrained FTR Bidding Strategy in Transmission Markets
8 pages
Improved Fast Decoupled Power Flow
No ratings yet
Improved Fast Decoupled Power Flow
10 pages
An Event-Triggered Hybrid System Model For Cascading Failure in Power Grid
No ratings yet
An Event-Triggered Hybrid System Model For Cascading Failure in Power Grid
14 pages
Relayoperationprinciples 141126065914 Conversion Gate01
No ratings yet
Relayoperationprinciples 141126065914 Conversion Gate01
43 pages
Topological Methods in Data Analysis and Visualization IV Theory Algorithms and Applications 1st Edition Hamish Carr 2024 Scribd Download
100% (3)
Topological Methods in Data Analysis and Visualization IV Theory Algorithms and Applications 1st Edition Hamish Carr 2024 Scribd Download
62 pages
Load Forecasting
No ratings yet
Load Forecasting
25 pages
Immediate download Power Distribution System Reliability Practical Methods and Applications IEEE Press Series on Power Engineering 1st Edition Ali Chowdhury ebooks 2024
100% (13)
Immediate download Power Distribution System Reliability Practical Methods and Applications IEEE Press Series on Power Engineering 1st Edition Ali Chowdhury ebooks 2024
60 pages
Computer Graphics by Dinesh Thakur
No ratings yet
Computer Graphics by Dinesh Thakur
4 pages
ECEN 667 Power System Stability
No ratings yet
ECEN 667 Power System Stability
27 pages
Department of Computer Science and Applications: Quantum Computing
No ratings yet
Department of Computer Science and Applications: Quantum Computing
19 pages
ANN - Ch2-Adaline and Madaline
100% (1)
ANN - Ch2-Adaline and Madaline
29 pages
ECON601 - The Impact of AI & ML On Labor Market-10042022
No ratings yet
ECON601 - The Impact of AI & ML On Labor Market-10042022
22 pages
[R14] Sulabh Sachan, Sanjeevikumar Padmanaban, Sanchari Deb - Smart Charging Solutions for Hybrid and Electric Vehicles-Wiley-Scriven
No ratings yet
[R14] Sulabh Sachan, Sanjeevikumar Padmanaban, Sanchari Deb - Smart Charging Solutions for Hybrid and Electric Vehicles-Wiley-Scriven
453 pages
Kuliah Matrik Ybus
No ratings yet
Kuliah Matrik Ybus
49 pages
Cyber-Security in Smart Grid Survey and Challenges
No ratings yet
Cyber-Security in Smart Grid Survey and Challenges
14 pages
Minor Project Report
No ratings yet
Minor Project Report
4 pages
Semana 1: The Data Scientist's Toolbox
No ratings yet
Semana 1: The Data Scientist's Toolbox
20 pages
Question Bank Samples For Power System II EE602
No ratings yet
Question Bank Samples For Power System II EE602
4 pages
A Modal-Based Initial Estimate For The Newton Solution of Ill-Conditioned Large-Scale Power Flow Problems
No ratings yet
A Modal-Based Initial Estimate For The Newton Solution of Ill-Conditioned Large-Scale Power Flow Problems
4 pages
User Manual 5587
No ratings yet
User Manual 5587
36 pages
Optimal Power Flows: J Carpentier
100% (1)
Optimal Power Flows: J Carpentier
13 pages
Module 4
No ratings yet
Module 4
19 pages
Optimal Power Flow Using Distributed Generation
No ratings yet
Optimal Power Flow Using Distributed Generation
6 pages
Thailand Power System Flexibility Study
No ratings yet
Thailand Power System Flexibility Study
87 pages
07factsheet Mini Grid
No ratings yet
07factsheet Mini Grid
6 pages
Unbalanced Three-Phase Distribution System
No ratings yet
Unbalanced Three-Phase Distribution System
5 pages
Field Study On Opration, Maintenance, Overhauling, Troubleshooting and Serviceing of Bangla Trac Limited (Bangla Cat)
No ratings yet
Field Study On Opration, Maintenance, Overhauling, Troubleshooting and Serviceing of Bangla Trac Limited (Bangla Cat)
63 pages
HydropowerNorway SeminarPaper
No ratings yet
HydropowerNorway SeminarPaper
71 pages
DecisionTrees RandomForest v2
No ratings yet
DecisionTrees RandomForest v2
27 pages
KFRNN An Effective False Data Injection Attack Detection in Smart Grid Based On Kalman Filter and Recurrent Neural Network
No ratings yet
KFRNN An Effective False Data Injection Attack Detection in Smart Grid Based On Kalman Filter and Recurrent Neural Network
12 pages
Neural Network and Fuzzy Logic
No ratings yet
Neural Network and Fuzzy Logic
46 pages
POM 1 - Intro
No ratings yet
POM 1 - Intro
38 pages
12 Uncertainty in Future Events
No ratings yet
12 Uncertainty in Future Events
54 pages
Bond Graphs V2
No ratings yet
Bond Graphs V2
31 pages
Simulink Time 1 Simple Oscillator: M y + C y + Ky 0
No ratings yet
Simulink Time 1 Simple Oscillator: M y + C y + Ky 0
16 pages
Economic Assesment of Dynamic Pricing in Smart Distribution Networks
No ratings yet
Economic Assesment of Dynamic Pricing in Smart Distribution Networks
5 pages
Radial Basis Functions With Adaptive Input and Composite Trend Representation For Portfolio Selection
100% (1)
Radial Basis Functions With Adaptive Input and Composite Trend Representation For Portfolio Selection
13 pages
New Contributions To Load Flow Studies by The Method of Reduction and Restoration
No ratings yet
New Contributions To Load Flow Studies by The Method of Reduction and Restoration
7 pages
Aipython PDF
No ratings yet
Aipython PDF
320 pages
Data Stream
No ratings yet
Data Stream
225 pages
Boundary Load Flow Solutions
No ratings yet
Boundary Load Flow Solutions
75 pages
Smart Grid
100% (2)
Smart Grid
44 pages
Follow These Steps To Export and Import A Web Content Library
No ratings yet
Follow These Steps To Export and Import A Web Content Library
4 pages
Adaptive Fault Detection Scheme Using An Optimized Self-Healing Ensemble Machine Learning Algorithm
100% (1)
Adaptive Fault Detection Scheme Using An Optimized Self-Healing Ensemble Machine Learning Algorithm
12 pages
QMS-PR-15-004 Inspection and Testing of Mobile Lifting Aplliances and Cranes Procedure_rev01
No ratings yet
QMS-PR-15-004 Inspection and Testing of Mobile Lifting Aplliances and Cranes Procedure_rev01
35 pages
Experiencias en Plataformas de Medición Inteligente - SIEMENS
No ratings yet
Experiencias en Plataformas de Medición Inteligente - SIEMENS
20 pages
Origin Pro®
No ratings yet
Origin Pro®
40 pages
2021S - A Step by Step Guide To Regression Analysis
No ratings yet
2021S - A Step by Step Guide To Regression Analysis
10 pages
Know Your Power-Pdf Version
100% (1)
Know Your Power-Pdf Version
276 pages
AEP-shared Constr Esb756
100% (1)
AEP-shared Constr Esb756
179 pages
Instant Download Imbalanced Classification with Python Choose Better Metrics Balance Skewed Classes and Apply Cost Sensitive Learning 1st Edition Jason Brownlee PDF All Chapters
100% (3)
Instant Download Imbalanced Classification with Python Choose Better Metrics Balance Skewed Classes and Apply Cost Sensitive Learning 1st Edition Jason Brownlee PDF All Chapters
40 pages
Implementation of The Sine - Cosine Algorithm To The Pressure Vessel Design Problem
No ratings yet
Implementation of The Sine - Cosine Algorithm To The Pressure Vessel Design Problem
6 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
Anomalies in dataset
No ratings yet
Anomalies in dataset
4 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Application Guidelines
No ratings yet
Application Guidelines
18 pages
GOI Financial Budget Analysis 2021
No ratings yet
GOI Financial Budget Analysis 2021
10 pages
Roadmap
No ratings yet
Roadmap
13 pages
Python Coding Interview Questions For Freshers
No ratings yet
Python Coding Interview Questions For Freshers
6 pages
SQL Solved Questions (Imp.)
No ratings yet
SQL Solved Questions (Imp.)
21 pages
1st Trinal Exam
No ratings yet
1st Trinal Exam
3 pages
DLL ENG8 3RDQ Week8 1
No ratings yet
DLL ENG8 3RDQ Week8 1
9 pages
Design for Social Innovation Ezio Manzini
No ratings yet
Design for Social Innovation Ezio Manzini
10 pages
Foundational Resources - Equitable Learning Practices
No ratings yet
Foundational Resources - Equitable Learning Practices
84 pages
Lesson 5 - Competent Teachers - Good Community Link
No ratings yet
Lesson 5 - Competent Teachers - Good Community Link
8 pages
3 Language Vs - Thought
No ratings yet
3 Language Vs - Thought
9 pages
Digital Literacy Final Test
No ratings yet
Digital Literacy Final Test
12 pages
Self Awareness Self Management Principles: Rashmi Shetty Dept of General Management Rvim
No ratings yet
Self Awareness Self Management Principles: Rashmi Shetty Dept of General Management Rvim
30 pages
Prof. Ed. 1 - Module 3
No ratings yet
Prof. Ed. 1 - Module 3
4 pages
Observation Feedback Form Anna
No ratings yet
Observation Feedback Form Anna
4 pages
Apologies in Chinese and English
No ratings yet
Apologies in Chinese and English
14 pages
43 Self-Assessment in The Classroom: Mats Oscarson
No ratings yet
43 Self-Assessment in The Classroom: Mats Oscarson
18 pages
Weekly Progress Report W9
No ratings yet
Weekly Progress Report W9
2 pages
Jade - Lesson Plan - Week 1
No ratings yet
Jade - Lesson Plan - Week 1
2 pages
TFN Ratio Sy 2022 2023
No ratings yet
TFN Ratio Sy 2022 2023
70 pages
Dancers Talking Dance Critical Evaluatio
No ratings yet
Dancers Talking Dance Critical Evaluatio
106 pages
Unit 3
No ratings yet
Unit 3
33 pages
Leader Definition
No ratings yet
Leader Definition
3 pages
Classroom Decision Making Paperback Frontmatter
No ratings yet
Classroom Decision Making Paperback Frontmatter
10 pages
Written by John Adair
No ratings yet
Written by John Adair
2 pages
Jean Piaget Reflection
100% (3)
Jean Piaget Reflection
2 pages
How To Write A Literature Review
No ratings yet
How To Write A Literature Review
2 pages
Educational Data Mining Applications and Trends 1st Edition Nabila Bousbia - The 2025 ebook edition is available with updated content
100% (1)
Educational Data Mining Applications and Trends 1st Edition Nabila Bousbia - The 2025 ebook edition is available with updated content
56 pages
CRL Program Action Research
No ratings yet
CRL Program Action Research
8 pages
Full Essay - Introduction To NPE
No ratings yet
Full Essay - Introduction To NPE
16 pages
How To Be A Successful College Student
No ratings yet
How To Be A Successful College Student
25 pages
School Action Plan in Science (S.Y. 2018-2019)
75% (4)
School Action Plan in Science (S.Y. 2018-2019)
2 pages

Data Prep and Cleaning For Machine Learning

Uploaded by

Data Prep and Cleaning For Machine Learning

Uploaded by

DATA PREP &

A common phrase in Machine Learning is...

"Garbage in...Garbage Out!"

If the data isn't clean or isn’t prepared in an appropriate way,

On top of this, ensuring the data is clean can actually be one

2 Duplicate & Low Variation Data

Incorrect & Irrelevant Data 3

In many cases, your chosen model or algorithm simply

Even if the missing values don't lead to an error you do

The two most common approaches for dealing with

Removal: Often you will simply remove any observations

Imputation: This is where you input or "impute"

Duplicate data is most commonly rows of data that are

In the vast majority of cases you can remove duplicate

Low variation data is where a column in your dataset

An example: There is a column in a house price dataset

Incorrect data can be hard to spot! An example could

For categorical or text variables you should spend time

For example, you were predicting car prices and had a

Always explore your data thoroughly!

Categorical data is anything that is listed as groups,

Your model or algorithm won’t know how to assign some

A common approach is called One Hot Encoding where

Other encoding techniques you can consider are; Label

How you deal with outliers is dependent on the problem

In practice, outliers are commonly isolated using the 5

In cases where you want to mitigate the effects of outliers,

Always remember - just because a value is very high, or

The two most common scaling techniques are:

Standardisation: rescales all values to have a mean of 0

Normalisation: rescales data so that it exists in a range

Feature Scaling is essential for distance-based models

Feature Scaling is not necessary for tree-based

The key to good Feature Engineering is to create or refine

Feature Selection is where you only keep a subset of the

A smaller feature set can lead to improved model

An approach called k-fold cross validation can provide

Here, the entire dataset is again partitioned into training &

Create your professionally

Prove your skills with the

Get guidance & support

"The best program I've been a part of,

"DSI is incredible - everything is taught in

"I got it! Thank you so much for all your

"One of the best purchases towards

"I learned more than on any other

"I started a bootcamp last summer

"This is a world-class Data Science

"Andrew's guidance with my Resume &

"DSI is a fantastic community & Andrew

I'm so glad I enrolled!" - Jose

"In addition to the great content,

"The course has such high quality

"The Statistics 101 section was awesome!

Come and join the

You might also like