0% found this document useful (0 votes)
71 views

Data Prep and Cleaning For Machine Learning

This document provides an overview of data preparation and cleaning for machine learning. It explains that data preparation is an important part of the machine learning process that must be done before building or training a model. Poorly prepared data can negatively impact even sophisticated algorithms. The document then presents an 8-step checklist for data preparation, covering issues like missing values, duplicate data, incorrect/irrelevant data, outliers, feature scaling, feature engineering, and validation splitting. Each step is then explained in more detail with examples and best practices.

Uploaded by

Shubham J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views

Data Prep and Cleaning For Machine Learning

This document provides an overview of data preparation and cleaning for machine learning. It explains that data preparation is an important part of the machine learning process that must be done before building or training a model. Poorly prepared data can negatively impact even sophisticated algorithms. The document then presents an 8-step checklist for data preparation, covering issues like missing values, duplicate data, incorrect/irrelevant data, outliers, feature scaling, feature engineering, and validation splitting. Each step is then explained in more detail with examples and best practices.

Uploaded by

Shubham J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

DATA PREP &

CLEANING FOR
MACHINE
LEARNING

AN OVERVIEW

Why?
Data Preparation & Cleaning is an extremely important
part of the overall Machine Learning process, one that must
be considered before ever looking to build or train a model.

A common phrase in Machine Learning is...

"Garbage in...Garbage Out!"

If the data isn't clean or isn’t prepared in an appropriate way,


even the fanciest of algorithms or models will struggle to
learn.

On top of this, ensuring the data is clean can actually be one


of the biggest boosters of model performance and accuracy!
8-Step Checklist
This 8-Step Data Preparation & Cleaning checklist will ensure
you’re always giving your ML model the best chance to learn &
perform!

Missing Values 1

2 Duplicate & Low Variation Data

Incorrect & Irrelevant Data 3

4 Categorical Data

Outliers 5

6 Feature Scaling

Feature Engineering/Selection 7

8 Validation Split

Note: Which steps are applicable can depend on the data you’re
using, the problem you’re solving, and on the type of model you’re
applying! However, in the vast majority of cases you can’t go
wrong at least considering these 8 steps!
Missing Values 1

In many cases, your chosen model or algorithm simply


won't know how to process missing values - and you will
be returned an error.

Even if the missing values don't lead to an error you do


always want to ensure that you are passing the model
the most useful information to learn from - and you
should consider whether missing values meet that criteria
or not.

The two most common approaches for dealing with


missing values:

Removal: Often you will simply remove any observations


(rows) where one or missing values are present. You can
also remove entire columns if no information is present.

Imputation: This is where you input or "impute"


replacement values where they were originally missing.
This can be based upon the column mean, median, or
most common value or more advanced approaches that
take into account other present data points to give an
estimation of what the missing value might be!
Duplicate & Low
Variation Data
2
When looking through your data, don't just look for
missing values - keep an eye out for duplicate data, or
data that has low variation.

Duplicate data is most commonly rows of data that are


exactly the same across all columns. This duplicate rows
do not add anything to the learning process of the model
or algorithm, but do add storage & processing overhead.

In the vast majority of cases you can remove duplicate


rows prior to training your model.

Low variation data is where a column in your dataset


contain only one (or few) unique value(s).

An example: There is a column in a house price dataset


called "property_type". Every row in this column has the
value "house". This column won’t add any value to the
learning process, so can be removed.
Incorrect &
Irrelevant Data
Irrelevant data is anything that isn’t related specifically
to the problem you’re looking to solve. For example, if
you're predicting house prices, but your dataset contains
commercial properties as well - these would need to be 3
removed.

Incorrect data can be hard to spot! An example could


be looking for values that shouldn’t be possible such a
negative house price values.

For categorical or text variables you should spend time


analysing the unique values within a column.

For example, you were predicting car prices and had a


column in the dataset called "car_colour". Upon
inspection you find values of "Orange", "orange", "orang"
are all present. These need to be rectifed prior to training
otherwise you will be limited the potential learning that an
take place!

Always explore your data thoroughly!


Categorical Data
Generally speaking, ML models like being fed numerical
data. They are less fond of categorical data!

Categorical data is anything that is listed as groups,


classes, or text. A simple example would be a column for
gender which contains values of either "Male" or "Female".

Your model or algorithm won’t know how to assign some


numerical importance to these values, so you often want
to turn these groups or classes into numerical values. 4

A common approach is called One Hot Encoding where


you create new columns, one for each unique class in your
categorical column. You fill these new columns with
values of 1 or 0 depending on which is true for each
observation.

Other encoding techniques you can consider are; Label


Encoding, Binary Encoding, Target Encoding, Ordinal
Encoding, & Feature Hashing.
Outliers
There is no formal definition for an outlier. You can think
of them as any data point that is very different to the
majority.

How you deal with outliers is dependent on the problem


you are solving, and the model you are applying. For
example, if your data contained one value that was 1000x
any other, this could badly affect a Linear Regression
model which tries to generalise a rule across all
observations. A Decision Tree would be unaffected
however, as it deals with each observation independently.

In practice, outliers are commonly isolated using the 5


number of standard deviations from the mean, or a rule
based upon the interquartile range.

In cases where you want to mitigate the effects of outliers,


you may look to simply remove any observations (rows)
that contain outlier values in one or more of the columns
or you may look to replace their values to reduce their
effect.

Always remember - just because a value is very high, or


very low, that does not mean it is wrong to be included.
Feature Scaling
Feature Scaling is where you force all the values from a
column in your data to exist on the same scale. In certain
scenarios it will help the model assess the relationships
between variables more fairly, and more accurately.

The two most common scaling techniques are:

Standardisation: rescales all values to have a mean of 0


and standard deviation of 1. In other words, the majority
of your values end up between -4 and +4

Normalisation: rescales data so that it exists in a range


between 0 and 1

Feature Scaling is essential for distance-based models


such as k-means or k-nearest-neighbours.
6
Feature Scaling is recommended for any algorithms that
utilise Gradient Descent such as Linear Regression,
Logistic Regression, and Neural Networks.

Feature Scaling is not necessary for tree-based


algorithms such as Decision Trees & Random Forests.
Feature Engineering
& Selection
Feature Engineering is the process of using further
knowledge to supplement or transform the original
feature set.

The key to good Feature Engineering is to create or refine


features that the algorithm or model can understand
better or that it will find more useful than the raw features
for solving the particular problem at hand.

Feature Selection is where you only keep a subset of the


most informative variables. This can be done using
human intuition, or dynamically based upon statistical
analysis.

A smaller feature set can lead to improved model


accuracy through reduced noise. It can mean a lower
computational cost, and improved processing speed. It
can also make your models easier to understand & 7
explain to stakeholders & customers!
Validation Split
The Validation Split is where you partition your data into a
training set, and a validation set (and sometimes a
test set as well).

You train the model with the training set only. The
validation and/or test sets, are held-back from training and
are used to assess model performance. They provide a
true understanding of how accurate predictions are on
new or unseen data.

An approach called k-fold cross validation can provide


you with an even more robust understanding of model
performance.

Here, the entire dataset is again partitioned into training &


validation sets, and the model is trained and assessed like
before. However, this process is done multiple times with
the training and test sets being rotated to encompass
different sets of observations within the data. You do this
k times - with your final predictive accuracy assessment
being based upon the average of each of the iterations.

8
Want to land an incredible
role in the exciting, future-
proof, and lucrative field of
Data Science?
LEARN THE
RIGHT SKILLS

A curriculum based on
input from hundreds of
leaders, hiring managers,
and recruiters

https://data-science-infinity.teachable.com
BUILD YOUR
PORTFOLIO

Create your professionally


made portfolio site that
includes 10 pre-built
projects

https://data-science-infinity.teachable.com
EARN THE
CERTIFICATION

Prove your skills with the


DSI Data Science
Professional Certification

https://data-science-infinity.teachable.com
LAND AN
AMAZING ROLE

Get guidance & support


based upon hundreds of
interviews at top tech
companies

https://data-science-infinity.teachable.com
Taught by former Amazon
& Sony PlayStation Data
Scientist Andrew Jones

What do DSI
students say?
"I had over 40 interviews without an offer.
After DSI I quickly got 7 offers including
one at KPMG and my amazing new role
at Deloitte!"
- Ritesh

"The best program I've been a part of,


hands down"
- Christian

"DSI is incredible - everything is taught in


such a clear and simple way, even the
more complex concepts!"
- Arianna

"I got it! Thank you so much for all your


advice & help with preparation - it truly
gave me the confidence to go in and
land the job!"
- Marta
"I've taken a number of Data Science
courses, and without doubt, DSI is the
best"
- William

"One of the best purchases towards


learning I have ever made"
- Scott

"I learned more than on any other


course, or reading entire books!"
- Erick

"I started a bootcamp last summer


through a well respected University, but I
didn't learn half as much from them!"
- GA
"100% worth it, it is amazing. I have
never seen such a good course and I
have done plenty of them!"
- Khatuna

"This is a world-class Data Science


experience. I would recommend this
course to every aspiring or professional
Data Scientist"
- David

"Andrew's guidance with my Resume &


throughout the interview process helped
me land my amazing new role (and at a
much higher salary than I expected!)"
- Barun

"DSI is a fantastic community & Andrew


is one of the best instructors!"

- Keith
"I'm now at University, and my Data
Science related subjects are a piece of
cake after completing this course!

I'm so glad I enrolled!" - Jose

"In addition to the great content,


Andrew's dedication to the growing DSI
community is amazing"
- Sophie

"The course has such high quality


content - you get your ROI even from the
first module"
- Donabel

"The Statistics 101 section was awesome!


I have now started to get confidence in
Statistics!"
- Shrikant
"I can't emphasise how good this
programme is...well worth the
investment!"
- Dejan

Come and join the


hundreds & hundreds of
other students getting the
results they want!

https://data-science-infinity.teachable.com

You might also like