An Extensive Step by Step Guide To Exploratory Data Analysis
An Extensive Step by Step Guide To Exploratory Data Analysis
1.5K
14
Be sure to subscribe here or to my exclusive
being used.
useless variables
variables
process
garbage out.”
things:
Components of EDA
To me, there are main components of exploring data:
what you don’t know, then how are you supposed to know
Sure, I can Google what the differences between the two are, but
I won’t always be able to rely on Google! Now you can see why
this in practice.
first Random Forest model, the Used Car Dataset here. First, I
imported all of the libraries that I knew I’d need for my analysis
#Import Libraries
import numpy as np
import pandas as pd
df.shape
df.head()
df.columns
.shape returns the number of rows by the number of columns
dataset.
df.columns output
df.nunique(axis=0)
df.describe().apply(lambda s: s.apply(lambda x: format(x,
'f')))
each variable.
min, and max for numeric variables. The code that follows this
df.nunique(axis=0) output
df.describe().apply(lambda s: s.apply(lambda x: format(x, ‘f’))) output
For example, the minimum and maximum price are $0.00 and
my discrete variables.
df.condition.unique()
including ‘condition’.
df.condition.unique()
You can see that there are many synonyms of each other, like
‘excellent’ and ‘like new’. While this isn’t the greatest example,
Later you’ll see that I end up omitting this column due to having
def clean_condition(row):
good = ['good','fair']
if row.condition in good:
return 'good'
if row.condition in excellent:
return 'excellent'
def clean_df(playlist):
df_cleaned = df.copy()
df_cleaned =
clean_df(df)print(df_cleaned.condition.unique())
And you can see that the values have been re-classified below.
print(df_cleaned.condition.unique()) output
b. Variable Selection
Next, I wanted to get rid of any columns that had too many null
remove any columns that had 40% or more of its data as null
below.
col_pass = []
for i in na.keys():
if na[i]/df_cleaned.shape[0]<threshold:
col_pass.append(i)
return col_passdf_cleaned =
df_cleaned[na_filter(NA_val)]
df_cleaned.columns
c. Removing Outliers
price, year, and odometer to remove any values outside of the set
df_cleaned = df_cleaned[df_cleaned['price'].between(999.99,
99999.00)]
You can see that the minimum and maximum values have
rows.
df_cleaned = df_cleaned.dropna(axis=0)
df_cleaned.shape
Correlation Matrix
about it, you can check out my statistics cheat sheet here.) Thus,
dataset.
sns.heatmap(corr, xticklabels=corr.columns,
yticklabels=corr.columns, annot=True,
cmap=sns.diverging_palette(220, 20, as_cmap=True))
and cars with more mileage are relatively cheaper. We can also
see that there is a negative correlation between year and
odometer — the newer a car the less number of miles on the car.
Scatterplot
variables along two axes, like age and height. Scatterplots are
than just that. Another insight that you can assume is that
price much more than later on when a car is older. You can see
this as the plots show a steep drop at first, but becomes less steep
as more mileage is added. This is why people say that it’s not a
relationship between year and price — the newer the car is, the
sns.pairplot(df_cleaned)
Histogram
df_cleaned['odometer'].plot(kind='hist', bins=50,
figsize=(12,6),
facecolor='grey',edgecolor='black')df_cleaned['year'].plot(
kind='hist', bins=20, figsize=(12,6),
facecolor='grey',edgecolor='black')
df_cleaned['year'].plot(kind='hist', bins=20,
figsize=(12,6), facecolor='grey',edgecolor='black')
We can quickly notice that the average car has an odometer from
Boxplot
can see that there are a number of outliers for price in the upper
range and that most of the prices fall between 0 and $40,000.
There are several other types of visualizations that weren’t
covered that you can use depending on the dataset like stacked
bar graphs, area plots, violin plots, and even geospatial visuals.
overall.