0% found this document useful (0 votes)

25 views

An Extensive Step by Step Guide To Exploratory Data Analysis

The document provides an extensive guide to performing exploratory data analysis (EDA). It discusses understanding variables, cleaning datasets, and analyzing relationships between variables. Key steps include understanding variable types and values, removing unnecessary variables, dealing with outliers and missing data, and visualizing relationships using correlation matrices and scatterplots.

Uploaded by

ojeifoissy

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

An Extensive Step by Step Guide To Exploratory Data Analysis

Uploaded by

ojeifoissy

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

An Extensive Step by Step

Guide to Exploratory Data

Analysis

My personal guide to performing EDA for any dataset

Terence Shin, MSc, MBA

·
Follow
·

1.5K

14
Be sure to subscribe here or to my exclusive

newsletter to never miss another article on data

science guides, tricks and tips, life lessons, and more!

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA), also known as Data

Exploration, is a step in the Data Analysis Process, where a

number of techniques are used to better understand the dataset

being used.

‘Understanding the dataset’ can refer to a number of things

including but not limited to…

● Extracting important variables and leaving behind

useless variables

● Identifying outliers, missing values, or human error

● Understanding the relationship(s), or lack of, between

variables

● Ultimately, maximizing your insights of a dataset and

minimizing potential error that may occur later in the

process

Here’s why this is important.

Have you heard of the phrase, “garbage in, garbage out”?

With EDA, it’s more like, “garbage in, perform EDA, possibly

garbage out.”

By conducting EDA, you can turn an almost useable dataset into

a completely useable dataset. I’m not saying that EDA can

magically make any dataset clean — that is not true. However,

many EDA techniques can remedy some common problems that

are present in every dataset.

Exploratory Data Analysis does two main

things:

1. It helps clean up a dataset.

2. It gives you a better understanding of the

variables and the relationships between them.

Components of EDA
To me, there are main components of exploring data:

1. Understanding your variables

2. Cleaning your dataset

3. Analyzing relationships between variables

In this article, we’ll take a look at the first two components.

Be sure to subscribe here or to my exclusive

newsletter to never miss another article on data

science guides, tricks and tips, life lessons, and more!

1. Understanding Your Variables

You don’t know what you don’t know. And if you don’t know

what you don’t know, then how are you supposed to know

whether your insights make sense or not? You won’t.

To give an example, I was exploring data provided by the NFL

(data here) to see if I could discover any insights regarding

variables that increase the likelihood of injury. One insight that I

got was that Linebackers accumulated more than eight times as

many injuries as Tight Ends. However, I had no idea what the

difference between a Linebacker and a Tight End was, and

because of this, I didn’t know if my insights made sense or not.

Sure, I can Google what the differences between the two are, but

I won’t always be able to rely on Google! Now you can see why

understanding your data is so important. Let’s see how we can do

this in practice.

As an example, I used the same dataset that I used to create my

first Random Forest model, the Used Car Dataset here. First, I
imported all of the libraries that I knew I’d need for my analysis

and conducted some preliminary analyses.

#Import Libraries

import numpy as np

import pandas as pd

import matplotlib.pylab as plt

import seaborn as sns#Understanding my variables

df.shape

df.head()

df.columns
.shape returns the number of rows by the number of columns

for my dataset. My output was (525839, 22), meaning the dataset

has 525839 rows and 22 columns.

.head() returns the first 5 rows of my dataset. This is useful if

you want to see some example values for each variable.

.columns returns the name of all of your columns in the

dataset.

df.columns output

Once I knew all of the variables in the dataset, I wanted to get a

better understanding of the different values for each variable.

df.nunique(axis=0)
df.describe().apply(lambda s: s.apply(lambda x: format(x,
'f')))

.nunique(axis=0) returns the number of unique values for

each variable.

.describe() summarizes the count, mean, standard deviation,

min, and max for numeric variables. The code that follows this

simply formats each row to the regular format and suppresses

scientific notation (see here).

df.nunique(axis=0) output
df.describe().apply(lambda s: s.apply(lambda x: format(x, ‘f’))) output

Immediately, I noticed an issue with price, year, and odometer.

For example, the minimum and maximum price are $0.00 and

$3,048,344,231.00 respectively. You’ll see how I dealt with this

in the next section. I still wanted to get a better understanding of

my discrete variables.

df.condition.unique()

Using .unique(), I took a look at my discrete variables,

including ‘condition’.
df.condition.unique()

You can see that there are many synonyms of each other, like

‘excellent’ and ‘like new’. While this isn’t the greatest example,

there will be some instances where it‘s ideal to clump together

different words. For example, if you were analyzing weather

patterns, you may want to reclassify ‘cloudy’, ‘grey’, ‘cloudy with

a chance of rain’, and ‘mostly cloudy’ simply as ‘cloudy’.

Later you’ll see that I end up omitting this column due to having

too many null values, but if you wanted to re-classify the

condition values, you could use the code below:

# Reclassify condition column

def clean_condition(row):
good = ['good','fair']

excellent = ['excellent','like new']

if row.condition in good:

return 'good'

if row.condition in excellent:

return 'excellent'

return row.condition# Clean dataframe

def clean_df(playlist):

df_cleaned = df.copy()

df_cleaned['condition'] = df_cleaned.apply(lambda row:

clean_condition(row), axis=1)
return df_cleaned# Get df with reclassfied 'condition'
column

df_cleaned =
clean_df(df)print(df_cleaned.condition.unique())

And you can see that the values have been re-classified below.

print(df_cleaned.condition.unique()) output

2. Cleaning your dataset

You now know how to reclassify discrete data if needed, but there

are a number of things that still need to be looked at.

a. Removing Redundant variables

First I got rid of variables that I thought were redundant. This

includes url, image_url, and city_url.

df_cleaned =
df_cleaned.copy().drop(['url','image_url','city_url'],
axis=1)

b. Variable Selection

Next, I wanted to get rid of any columns that had too many null

values. Thanks to my friend, Richie, I used the following code to

remove any columns that had 40% or more of its data as null

values. Depending on the situation, I may want to increase or

decrease the threshold. The remaining columns are shown

below.

NA_val = df_cleaned.isna().sum()def na_filter(na, threshold

= .4): #only select variables that passees the threshold

col_pass = []

for i in na.keys():

if na[i]/df_cleaned.shape[0]<threshold:
col_pass.append(i)

return col_passdf_cleaned =
df_cleaned[na_filter(NA_val)]

df_cleaned.columns

c. Removing Outliers

Revisiting the issue previously addressed, I set parameters for

price, year, and odometer to remove any values outside of the set

boundaries. In this case, I used my intuition to determine

parameters — I’m sure there are methods to determine the

optimal boundaries, but I haven’t looked into it yet!

df_cleaned = df_cleaned[df_cleaned['price'].between(999.99,
99999.00)]

df_cleaned = df_cleaned[df_cleaned['year'] > 1990]

df_cleaned = df_cleaned[df_cleaned['odometer'] <
899999.00]df_cleaned.describe().apply(lambda s:
s.apply(lambda x: format(x, 'f')))

You can see that the minimum and maximum values have

changed in the results below.

d. Removing Rows with Null Values

Lastly, I used .dropna(axis=0) to remove any rows with null

values. After the code below, I went from 371982 to 208765

rows.
df_cleaned = df_cleaned.dropna(axis=0)

df_cleaned.shape

3. Analyzing relationships between variables

Correlation Matrix

The first thing I like to do when analyzing my variables is

visualizing it through a correlation matrix because it’s the fastest

way to develop a general understanding of all of my variables. To

review, correlation is a measurement that describes the

relationship between two variables — if you want to learn more

about it, you can check out my statistics cheat sheet here.) Thus,

a correlation matrix is a table that shows the correlation

coefficients between many variables. I used sns.heatmap() to

plot a correlation matrix of all of the variables in the used car

dataset.

# calculate correlation matrix

corr = df_cleaned.corr()# plot the heatmap

sns.heatmap(corr, xticklabels=corr.columns,
yticklabels=corr.columns, annot=True,
cmap=sns.diverging_palette(220, 20, as_cmap=True))

We can see that there is a positive correlation between price and

year and a negative correlation between price and odometer.

This makes sense as newer cars are generally more expensive,

and cars with more mileage are relatively cheaper. We can also
see that there is a negative correlation between year and

odometer — the newer a car the less number of miles on the car.

Scatterplot

It’s pretty hard to beat correlation heatmaps when it comes to

data visualizations, but scatterplots are arguably one of the most

useful visualizations when it comes to data.

A scatterplot is a type of graph which ‘plots’ the values of two

variables along two axes, like age and height. Scatterplots are

useful for many reasons: like correlation matrices, it allows you

to quickly understand a relationship between two variables, it’s

useful for identifying outliers, and it’s instrumental when

polynomial multiple regression models (which we’ll get to in the

next article). I used .plot() and set the ‘kind’ of graph as

scatter. I also set the x-axis to ‘odometer’ and y-axis as ‘price’,

since we want to see how different levels of mileage affects price.

df_cleaned.plot(kind='scatter', x='odometer', y='price')

This narrates the same story as a correlation matrix — there’s a

negative correlation between odometer and price. What’s neat

about scatterplots is that it communicates more information

than just that. Another insight that you can assume is that

mileage has a diminishing effect on price. In other words, the

amount of mileage that a car accumulates early in its life impacts

price much more than later on when a car is older. You can see

this as the plots show a steep drop at first, but becomes less steep

as more mileage is added. This is why people say that it’s not a

good investment to buy a brand new car!

df_cleaned.plot(kind='scatter', x='year', y='price')

To give another example, the scatterplot above shows the

relationship between year and price — the newer the car is, the

more expensive it’s likely to be.

As a bonus, sns.pairplot() is a great way to create scatterplots

between all of your variables.

sns.pairplot(df_cleaned)
Histogram

Correlation matrices and scatterplots are useful for exploring the

relationship between two variables. But what if you only wanted

to explore a single variable by itself? This is when histograms

come into play. Histograms look like bar graphs but they show

the distribution of a variable’s set of values.

df_cleaned['odometer'].plot(kind='hist', bins=50,
figsize=(12,6),
facecolor='grey',edgecolor='black')df_cleaned['year'].plot(
kind='hist', bins=20, figsize=(12,6),
facecolor='grey',edgecolor='black')

df_cleaned['year'].plot(kind='hist', bins=20,
figsize=(12,6), facecolor='grey',edgecolor='black')
We can quickly notice that the average car has an odometer from

0 to just over 200,000 km and a year of around 2000 to 2020.

The difference between the two graphs is that the distribution of

‘odometer’ is positively skewed while the distribution of ‘year’ is

negatively skewed. Skewness is important, especially in areas like

finance, because a lot of models assume that all variables are

normally distributed, which typically isn’t the case.

Boxplot

Another way to visualize the distribution of a variable is a

boxplot. We’re going to look at ‘price’ this time as an example.

df_cleaned.boxplot('price')

Boxplots are not as intuitive as the other graphs shown above,

but it communicates a lot of information in its own way. The

image below explains how to read a boxplot. Immediately, you

can see that there are a number of outliers for price in the upper

range and that most of the prices fall between 0 and $40,000.
There are several other types of visualizations that weren’t

covered that you can use depending on the dataset like stacked

bar graphs, area plots, violin plots, and even geospatial visuals.

By going through the three steps of exploratory data analysis,

you’ll have a much better understanding of your data, which will

make it easier to choose your model, your attributes, and refine it

overall.

Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
PD03 Efficient Project Execution
100% (1)
PD03 Efficient Project Execution
12 pages
Data Clean R
100% (1)
Data Clean R
11 pages
DFR0535 (V1.0) Schematic
100% (1)
DFR0535 (V1.0) Schematic
1 page
Pandas-1
No ratings yet
Pandas-1
13 pages
12 Dimensionality Reduction Techniqwues (with Python Codes)
No ratings yet
12 Dimensionality Reduction Techniqwues (with Python Codes)
20 pages
Machine Learning Path
No ratings yet
Machine Learning Path
21 pages
Lab Experiment 4 - AI
No ratings yet
Lab Experiment 4 - AI
7 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Python For Finance - The Complete Beginner's Guide - by Behic Guven - Jul, 2020 - Towards Data Science PDF
100% (1)
Python For Finance - The Complete Beginner's Guide - by Behic Guven - Jul, 2020 - Towards Data Science PDF
12 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
Project Occupancy Alfonso Vicente Aragues
No ratings yet
Project Occupancy Alfonso Vicente Aragues
18 pages
Business Analytics and Data Mining Modeling Using R
No ratings yet
Business Analytics and Data Mining Modeling Using R
6 pages
FDS Unit 2
No ratings yet
FDS Unit 2
8 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Decision Making Tools
No ratings yet
Decision Making Tools
21 pages
Data Cleansing Using R
0% (1)
Data Cleansing Using R
10 pages
Problem Set 1 (R)
No ratings yet
Problem Set 1 (R)
5 pages
Top 9 Feature Engineering Techniques With Python: Dataset & Prerequisites
No ratings yet
Top 9 Feature Engineering Techniques With Python: Dataset & Prerequisites
27 pages
Feature Engineering
No ratings yet
Feature Engineering
20 pages
12 Useful Pandas Techniques in Python For Data Manipulation
100% (2)
12 Useful Pandas Techniques in Python For Data Manipulation
19 pages
BS51009 workshop 1
No ratings yet
BS51009 workshop 1
15 pages
Engo 645
No ratings yet
Engo 645
9 pages
Improving The Performance of Your Imbalanced Machine Learning Classifiers
No ratings yet
Improving The Performance of Your Imbalanced Machine Learning Classifiers
26 pages
Comprehensive Data Exploration With Python
No ratings yet
Comprehensive Data Exploration With Python
20 pages
Exploratory Data Analysis, Variation, Missing Values, Covariation
No ratings yet
Exploratory Data Analysis, Variation, Missing Values, Covariation
22 pages
Dawit House
No ratings yet
Dawit House
49 pages
Spreadsheets and The Data Life Cycle
No ratings yet
Spreadsheets and The Data Life Cycle
11 pages
House Ames Project
No ratings yet
House Ames Project
15 pages
2 SVM Kernel
No ratings yet
2 SVM Kernel
8 pages
Guide Data Exploration
No ratings yet
Guide Data Exploration
16 pages
41 Perusse Alexander Aperusse PDF
No ratings yet
41 Perusse Alexander Aperusse PDF
7 pages
Model_learning_steps
No ratings yet
Model_learning_steps
12 pages
Essentials of Machine Learning Algorithms
No ratings yet
Essentials of Machine Learning Algorithms
15 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
Commonly Used Machine Learning Algorithms
No ratings yet
Commonly Used Machine Learning Algorithms
38 pages
lecture-week5
No ratings yet
lecture-week5
72 pages
PS Notes (Machine Learning
No ratings yet
PS Notes (Machine Learning
14 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
12 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Dealing With Missing Data in Python Pandas
100% (1)
Dealing With Missing Data in Python Pandas
14 pages
Week+4SQL
No ratings yet
Week+4SQL
14 pages
Hyperparameter Tuning the Random Forest in Python BOM 3_ by Will Koehrsen _ Towards Data Science
No ratings yet
Hyperparameter Tuning the Random Forest in Python BOM 3_ by Will Koehrsen _ Towards Data Science
15 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Exercises 3
No ratings yet
Exercises 3
11 pages
Unit 2
No ratings yet
Unit 2
11 pages
Standard Structure of Exploratory Data Analysis
No ratings yet
Standard Structure of Exploratory Data Analysis
6 pages
AUTOMATED EDA Libraries
No ratings yet
AUTOMATED EDA Libraries
12 pages
Jupyter Lab
No ratings yet
Jupyter Lab
42 pages
Data Science Interview Guide
No ratings yet
Data Science Interview Guide
23 pages
Data Mining Primer
No ratings yet
Data Mining Primer
5 pages
Whole ML PDF 1614408656
100% (1)
Whole ML PDF 1614408656
214 pages
04.Session Notes on Principal Component Regression(1)
No ratings yet
04.Session Notes on Principal Component Regression(1)
12 pages
my_project_1_AI
No ratings yet
my_project_1_AI
3 pages
Supervised Learning (Classification and Regression)
No ratings yet
Supervised Learning (Classification and Regression)
14 pages
Solutions Exo 3 (2021)
No ratings yet
Solutions Exo 3 (2021)
3 pages
The Art of Finding The Best Features For Machine Learning - by Rebecca Vickery - Towards Data Science
No ratings yet
The Art of Finding The Best Features For Machine Learning - by Rebecca Vickery - Towards Data Science
14 pages
COC131 Tutorial w6
No ratings yet
COC131 Tutorial w6
4 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Microsoft Excel Statistical and Advanced Functions for Decision Making
From Everand
Microsoft Excel Statistical and Advanced Functions for Decision Making
Palani Murugappan
5/5 (2)
150K HQ Corps
No ratings yet
150K HQ Corps
2,478 pages
CBS131 - Assessment 1 - 20240603
No ratings yet
CBS131 - Assessment 1 - 20240603
9 pages
PTT 1 Question Paper
No ratings yet
PTT 1 Question Paper
18 pages
Operation and Maintenance Manual of SD22 Shantui - PDF - Transmission (Mechanics) - Manufactured Goods
No ratings yet
Operation and Maintenance Manual of SD22 Shantui - PDF - Transmission (Mechanics) - Manufactured Goods
125 pages
Inert Gas Solutions For Product Tankers PDF
No ratings yet
Inert Gas Solutions For Product Tankers PDF
2 pages
TH Trouble Code List
No ratings yet
TH Trouble Code List
43 pages
Principles of Underwater Sound: LO: Apply Characteristics of Sound in Water To Calculate Sound Levels
No ratings yet
Principles of Underwater Sound: LO: Apply Characteristics of Sound in Water To Calculate Sound Levels
24 pages
Unit 1 - Introduction To Project Management
No ratings yet
Unit 1 - Introduction To Project Management
23 pages
Senior_Engineer
No ratings yet
Senior_Engineer
4 pages
Meet Ronan Ye and the Vision Behind 3ERP
No ratings yet
Meet Ronan Ye and the Vision Behind 3ERP
2 pages
Meta Identity Graphic Design Master Thes
No ratings yet
Meta Identity Graphic Design Master Thes
83 pages
Debate On Technology's Affect On Children's Well-Being
No ratings yet
Debate On Technology's Affect On Children's Well-Being
3 pages
Lokendra Singh Sekhawat: Internship/training (1 Year-Full Time)
No ratings yet
Lokendra Singh Sekhawat: Internship/training (1 Year-Full Time)
3 pages
English Exam
No ratings yet
English Exam
3 pages
Trademarks
No ratings yet
Trademarks
4 pages
Asha Checklist
No ratings yet
Asha Checklist
2 pages
Inventions and Discoveries Lesson
100% (1)
Inventions and Discoveries Lesson
17 pages
EIC N 1001 0 Hydraulic Fluids
No ratings yet
EIC N 1001 0 Hydraulic Fluids
13 pages
Con-Tech BK Axle Drawing
No ratings yet
Con-Tech BK Axle Drawing
1 page
23 Wjjwe NSJDH
No ratings yet
23 Wjjwe NSJDH
1 page
Envoy Installation and Operation NA
No ratings yet
Envoy Installation and Operation NA
60 pages
Activity 1 - DMGT 25
No ratings yet
Activity 1 - DMGT 25
2 pages
Lotfi FEJRI CV - en PDF
No ratings yet
Lotfi FEJRI CV - en PDF
2 pages
Student Motivation and Engagement
No ratings yet
Student Motivation and Engagement
2 pages
DC18 M02 Updated
No ratings yet
DC18 M02 Updated
114 pages
01PQ Baseline Study - Prof Khalid
No ratings yet
01PQ Baseline Study - Prof Khalid
62 pages
Documents - Pub - Satellite Communication 55cf9007c22b3
No ratings yet
Documents - Pub - Satellite Communication 55cf9007c22b3
27 pages
HYDRA-G Brochure 2022
No ratings yet
HYDRA-G Brochure 2022
4 pages