EDA - Zep
EDA - Zep
Exploratory Data
Analysis
Raw Data
Data is Processed Clean Dataset
Collected
Models &
Algorithms
3
By: Tushar Bhadouria
What is Exploratory Data Analysis
• Exploratory Data Analysis is an approach to analyze the datasets to summarize their main
characteristics in form of visual methods.
• EDA is nothing but a data exploration technique to understand various aspects of the data.
• The main aim of EDA is to obtain confidence in a data to an extent where we are ready to
engage a machine learning model.
• EDA is important to analyze the data it’s a first steps in data analysis process.
• EDA give a basic idea to understand the data and make sense of the data to figure out the
question you need to ask and find out the best way to manipulate the dataset to get the
answer to your question.
• Exploratory data analysis help us to finding the errors, discovering data, mapping out data
structure, finding out anomalies.
• Exploratory data analysis is important for business process because we are preparing
dataset for deep thorough analysis that will detect you business problem.
• EDA help to build a quick and dirty model, or a baseline model, which can serve as a
comparison against later models that you will build.
4
By: Tushar Bhadouria
Visualization
Visualization is the presentation of the data in the graphical or visual form to understand the data more
clearly. Visualization is easy to understand the data
Easily understand the Easily analyze the data Help to get meaningful Help to find the trend
features of the data and summarize it. insights from the data. or pattern of the
data.
5
By: Tushar Bhadouria
Steps involved in EDA
Numerical Analysis
Data Cleaning
7
By: Tushar Bhadouria
Data Sourcing
• Data Sourcing is the process of gathering data from multiple sources as external or internal data collection.
• There are two major kind of data which can be classified according to the source:
1. Public data
2. Private data
The data which is easy to access without taking Private Data:- The data which is not available on
any permission from the agencies is called public public platform and to access the data we have to
data. The agencies made the data public for the take the permission of organisation is called
purpose of the research, private data.
• Example: government and other public sector or • Example: Banking ,telecom ,retail sector are
ecommerce sites made the data public. there which not made their data publicly
available.
Does what you are Does the data follow After computing the
reading make sense? the rules for this summarize statistics for
field? numerical data, does it
make sense?
10
This method we commonly This method can be used on Some machine learning Prediction model is one of the
used to handle missing independent variable when it algorithm supports to handle advanced method to handle
values. Rows can be deleted if has numerical variables. On missing value in the datasets. missing values. In this method
it has insignificant number of categorical feature we apply Like KNN, Naïve Bayes, dataset with no missing value
missing value Columns can be mode method to fill the Random forest. become training set and
delete if it has more than 75% missing value. dataset with missing value
of missing value become the test set and the
missing values is treated as
target variable.
11
12
13
14
Standard Scaler
Standard scaler ensures that for each
feature, the mean is zero and the
standard deviation is 1, bringing all
feature to the same magnitude. In
simple words Standardization helps you
to scale down your feature based on the
standard normal distribution
Min-Max Scaler
Normalization helps you to scale down
your features between a range 0 to 1
15
By: Tushar Bhadouria
Example
Normalization
Minimum = 0
28 30000 (30000 – 12000)/18000 =1 Maximum = 1
16
24 15000 (15000 - 19000)/9643.65 = -0.4147 Hence, we have converted the income values to lower values
using the z-score method.
17
Detect Outliers using following methods Handle Outlier using following methods
5. Inter quartile range(values out of 1.5 time of IQR) • Inter quartile range
3. Use that ML model which are not sensitive to outliers
4. Like:-KNN, Decision Tree, SVM, Naïve Bayes,
Ensemble methods
18
In case the data is being read as Correct the incorrect data types If some of the values are beyond Values that don’t follow a defined
junk characters, try to change to the correct data types for ease logical range, e.g. temperature structure can be removed.
encoding, E.g. CP1252 instead of of analysis. E.g. if numeric values less than -273° C (0° K), you
UTF-8. are stored as strings, it would not would need to correct them as
be possible to calculate metrics required. E.g. In a data set containing pin
such as mean, median, etc. codes of Indian cities, a pin code
A close look would help you
of 12 digits would be an invalid
Some of the common data type check if there is scope for
value and needs to be removed.
corrections are — string to correction, or if the value needs
Similarly, a phone number of 12
number: "12,300" to “12300”; to be removed.
digits would be an invalid value
string to date: "2013-Aug" to
“2013/08”; number to string:
“PIN Code 110001” to "110001";
etc.
19
By: Tushar Bhadouria
Types of Data
20
Qualitative Quantitative
A variable which able to describe quality of the A variable which quantify the population
population. (Categorical values) (Numerical values)
21
It has a discrete value that It represent qualitative A number within a range of a It represent qualitative
means it take only counted information without order. value is usually measured, information with order. It
value not a decimal values. Value represent a discrete such as height indicate the measurement
Like count of student in class units. classification are different and
can be ranked. Lets say
Like Gender: Male/Female
Economic status: high/
,Eye colour.
medium /low which can
ordered as low, medium, high.
22
23
25
26
We also perform various analysis over Similarly, while analyzing multiple features,
Numerical data. we might be interested in knowing their
correlation with each other.
For example, dealing with a single numerical
variable, we might be interested in knowing
their statistical information such as mean,
median, 25th Percentile, 75th Percentile,
min, max etc.
27
Feature Binning
Feature Encoding
28
Type of Binning
29
30
Label Label encoding is technique to transform categorical variables into numerical variables by assigning a numerical
encoding value to each of the categories.
This technique is used when independent variables are nominal. It creates k different columns each for a
One-Hot
category and replaces one column with 1 rest of the columns is 0.
encoding
Here, 0 represents the absence, and 1 represents the presence of that category.
31
In this data set we have to predict who are suffering from cancer and who’s not.
Uses
Cases
Fraud Data Analysis in E-commerce Transactions
32
3
3
By: Tushar Bhadouria