Explorotary Data Analysis
Explorotary Data Analysis
Analysis(EDA): Python
Learning the basics of Exploratory Data Analysis using
Python with Numpy, Matplotlib, and Pandas.
1. Data Sourcing
2. Data Cleaning
3. Univariate analysis
4. Bivariate analysis
5. Multivariate analysis
1. Data Sourcing
Data Sourcing is the process of finding and loading the data into
our system. Broadly there are two ways in which we can find
data.
1. Private Data
2. Public Data
Private Data
As the name suggests, private data is given by private
organizations. There are some security and privacy concerns
attached to it. This type of data is used for mainly organizations
internal analysis.
Public Data
https://data.gov
https://data.gov.uk
https://data.gov.in
https://www.kaggle.com/
https://archive.ics.uci.edu/ml/index.php
https://github.com/awesomedata/awesome-public-
datasets
The very first step of EDA is Data Sourcing, we have seen how
we can access data and load into our system. Now, the next step
is how to clean the data.
2. Data Cleaning
After completing the Data Sourcing, the next step in the process
of EDA is Data Cleaning. It is very important to get rid of the
irregularities and clean the data after sourcing it into our
system.
Missing Values
Incorrect Format
Incorrect Headers
Anomalies/Outliers
First, let’s import the necessary libraries and store the data in
our system for analysis.
Now, the dataset looks like this, and it makes more sense.
Dataset after fixing the rows and columns
Missing Values
Let’s check whether the missing values in the dataset have been
handled or not,
All the missing values have been handled
We can also, fill the missing values as ‘NaN’ so that while doing
any statistical analysis, it won’t affect the outcome.
Handling Outliers
We have seen how to fix missing values, now let’s see how to
handle outliers in the dataset.
Outliers are the values that are far beyond the next
nearest data points.
Standardizing Values
Now, that we are clear on how to source and clean the data, let’s
see how we can analyze the data.
3. Univariate Analysis
If we analyze data over a single variable/column from a dataset,
it is known as Univariate Analysis.
Now, let’s analyze the job category by using plots. Since Job is a
category, we will plot the bar plot.
a) Numeric-Numeric Analysis:
Scatter Plot
Pair Plot
Correlation Matrix
Scatter Plot
Let’s take three columns ‘Balance’, ‘Age’ and ‘Salary’ from our
dataset and see what we can infer by plotting to scatter plot
between salary balance and age balance
Scatter Plots
Pair Plot
Now, let’s plot Pair Plots for the three columns we used in
plotting Scatter plots. We’ll use the seaborn library for plotting
Pair Plots.
Correlation Matrix
Let’s see how the response rate varies for different categories in
marital status.
First, we’ll create a pivot table with the three columns and after
that, we’ll create a heatmap.
Conclusion
This is how we’ll do Exploratory Data Analysis. Exploratory
Data Analysis (EDA) helps us to look beyond the data. The more
we explore the data, the more the insights we draw from it. As a
data analyst, almost 80% of our time will be spent
understanding data and solving various business problems
through EDA.
Seaborn: Python
Pandas: Python
Matplotlib: Python
NumPy: Python
References
Exploratory data
analysis: https://en.wikipedia.org/wiki/Exploratory
_data_analysis