0% found this document useful (0 votes)
11 views

EDA - Zep

Exploratory Data Analysis (EDA) is a crucial approach for analyzing datasets to summarize their main characteristics using visual methods, helping to build confidence in data for machine learning models. The EDA process involves data sourcing, cleaning, univariate and bivariate analysis, and deriving metrics, with visualization playing a key role in understanding data trends. Use cases for EDA include cancer data analysis and fraud detection in e-commerce transactions.

Uploaded by

scharita32
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

EDA - Zep

Exploratory Data Analysis (EDA) is a crucial approach for analyzing datasets to summarize their main characteristics using visual methods, helping to build confidence in data for machine learning models. The EDA process involves data sourcing, cleaning, univariate and bivariate analysis, and deriving metrics, with visualization playing a key role in understanding data trends. Use cases for EDA include cancer data analysis and fraud detection in e-commerce transactions.

Uploaded by

scharita32
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Exploratory Data Analysis

By: Tushar Bhadouria


Agenda
1. What is Exploratory Data analysis?
2. Why EDA is important?
3. Visualization
• Important charts for visualization.
4. Steps involved in EDA:-
• Data Sourcing
• Data Cleaning
• Univariate analysis with visualization
• Bivariate analysis with visualization
• Derived Metrics
5. Use Cases

By: Tushar Bhadouria


Data Analytics/Science Process

Exploratory Data
Analysis

Raw Data
Data is Processed Clean Dataset
Collected

Models &
Algorithms

Data Product Visualize Report Make Decisions


Reality

3
By: Tushar Bhadouria
What is Exploratory Data Analysis
• Exploratory Data Analysis is an approach to analyze the datasets to summarize their main
characteristics in form of visual methods.

• EDA is nothing but a data exploration technique to understand various aspects of the data.

• The main aim of EDA is to obtain confidence in a data to an extent where we are ready to
engage a machine learning model.

• EDA is important to analyze the data it’s a first steps in data analysis process.

• EDA give a basic idea to understand the data and make sense of the data to figure out the
question you need to ask and find out the best way to manipulate the dataset to get the
answer to your question.

• Exploratory data analysis help us to finding the errors, discovering data, mapping out data
structure, finding out anomalies.

• Exploratory data analysis is important for business process because we are preparing
dataset for deep thorough analysis that will detect you business problem.

• EDA help to build a quick and dirty model, or a baseline model, which can serve as a
comparison against later models that you will build.

4
By: Tushar Bhadouria
Visualization
Visualization is the presentation of the data in the graphical or visual form to understand the data more
clearly. Visualization is easy to understand the data

Easily understand the Easily analyze the data Help to get meaningful Help to find the trend
features of the data and summarize it. insights from the data. or pattern of the
data.

5
By: Tushar Bhadouria
Steps involved in EDA

By: Tushar Bhadouria


Numerical Analysis
Data Cleaning

Data Sourcing Categorical Analysis Derived Metrics

7
By: Tushar Bhadouria
Data Sourcing
• Data Sourcing is the process of gathering data from multiple sources as external or internal data collection.
• There are two major kind of data which can be classified according to the source:
1. Public data
2. Private data

Public Data Private Data

The data which is easy to access without taking Private Data:- The data which is not available on
any permission from the agencies is called public public platform and to access the data we have to
data. The agencies made the data public for the take the permission of organisation is called
purpose of the research, private data.

• Example: government and other public sector or • Example: Banking ,telecom ,retail sector are
ecommerce sites made the data public. there which not made their data publicly
available.

By: Tushar Bhadouria



After collecting the data , the next step is data The following are some steps involve in Data Cleaning
cleaning. Data cleaning means that you get rid of
any information that doesn’t need to be there and
clean up by mistake.

Data Cleaning is the process of clean the data to


improve the quality of the data for further data
Handle Missing Values Standardization of the data/
analysis and building a machine learning model. Feature Scaling

The benefit of data cleaning is that all the incorrect


and irrelevant data is gone, and we get the good
quality of data which will help in improving the
accuracy of our machine learning model.

Outlier Treatment Handle Invalid values

By: Tushar Bhadouria


Some Questions you need to ask yourself about the data
before cleaning the data

Does this data you’re


looking at match the Does this data
column labels? make sense?

Does what you are Does the data follow After computing the
reading make sense? the rules for this summarize statistics for
field? numerical data, does it
make sense?

10

By: Tushar Bhadouria


Handle Missing Values

Replacing with mean/ Predicting the missing


Delete Rows/Columns median/mode Algorithm Imputation values

This method we commonly This method can be used on Some machine learning Prediction model is one of the
used to handle missing independent variable when it algorithm supports to handle advanced method to handle
values. Rows can be deleted if has numerical variables. On missing value in the datasets. missing values. In this method
it has insignificant number of categorical feature we apply Like KNN, Naïve Bayes, dataset with no missing value
missing value Columns can be mode method to fill the Random forest. become training set and
delete if it has more than 75% missing value. dataset with missing value
of missing value become the test set and the
missing values is treated as
target variable.

11

By: Tushar Bhadouria


Example
For Numerical Data The following are some steps involve in Data Cleaning
Suppose we have Airlines ticket price data in which there is missing value.
Airlines Ticket Price
Steps to fill the numeric missing value:-
Indigo 3887
1. Compute the mean/median of the data
Air Asia 7662 (3887+7662+5221+4321)/4 = 5272.75

Jet Airways - 2. Substitute the Mean of the value in missing place.


Air India 5221

SpiceJet 4321 For categorical Data

Suppose we have Missing values in the Airlines Ticket Price


categorical data:
Indigo 3887
Then we take the mode of the dataset a to fill the
missing values: Here : Indigo 7675

Mode = Indigo Air Asia 4236

We substitute the Indigo in place of missing value - 6524


in Airline column
Jet Airways 4321

12

By: Tushar Bhadouria


Standardization/Feature Scaling

13

By: Tushar Bhadouria


Importance of Feature Scaling


When we are dealing with independent variable or features that differ from
each other in terms of range of values or units of the features, then we have
to normalize/standardize the data so that the difference in range of values
doesn’t affect the outcome of the data.

Feature scaling is the method to rescale the


values present in the features. In feature scaling
we convert the scale of different measurement
into a single scale. It standardize the whole
dataset in one range.

14

By: Tushar Bhadouria


Feature Scaling Method.

Standard Scaler
Standard scaler ensures that for each
feature, the mean is zero and the
standard deviation is 1, bringing all
feature to the same magnitude. In
simple words Standardization helps you
to scale down your feature based on the
standard normal distribution
Min-Max Scaler
Normalization helps you to scale down
your features between a range 0 to 1

15
By: Tushar Bhadouria
Example
Normalization

Income Minimum = 12000


Age Income (£) New value
Income Maximum = 30000
(Max – min) = (30000 – 12000) = 18000
24 15000 (15000 – 12000)/18000 = 0.16667 Hence, we have converted the income values between 0 and 1

Please note, the new values have


30 12000 (12000 – 12000)/18000 =0

Minimum = 0
28 30000 (30000 – 12000)/18000 =1 Maximum = 1

16

By: Tushar Bhadouria


Example
Standardization

Average = (15000 + 12000 + 30000)/3 = 19000


Age Income (£) New value
Standard deviation = 9643.65

24 15000 (15000 - 19000)/9643.65 = -0.4147 Hence, we have converted the income values to lower values
using the z-score method.

30 12000 (12000 - 19000)/9643.65 = -0.7258 x = c(-0.4147, -0.7258, 1.1406)


mean(x) = -0.000003 ~ 0
var(x) = 0.999 ~1
28 30000 (30000-19000)/9643.65 = 1.1406

17

By: Tushar Bhadouria


Outlier Treatment
Outliers are the most extremes values in the data. It is an abnormal observations that deviate from the norm.
Outliers do not fit in the normal behavior of the data.

Detect Outliers using following methods Handle Outlier using following methods

1. Boxplot 1. Remove the outliers.


2. Histogram 2. Replace outlier with suitable values by using following
methods:-
3. Scatter plot
4. Z-score • Quantile method

5. Inter quartile range(values out of 1.5 time of IQR) • Inter quartile range
3. Use that ML model which are not sensitive to outliers
4. Like:-KNN, Decision Tree, SVM, Naïve Bayes,
Ensemble methods

18

By: Tushar Bhadouria


By: Tushar Bhadouria
Handle Invalid Value

Correct values that go


Encode Unicode properly Convert incorrect data types beyond range Correct wrong structure

In case the data is being read as Correct the incorrect data types If some of the values are beyond Values that don’t follow a defined
junk characters, try to change to the correct data types for ease logical range, e.g. temperature structure can be removed.
encoding, E.g. CP1252 instead of of analysis. E.g. if numeric values less than -273° C (0° K), you
UTF-8. are stored as strings, it would not would need to correct them as
be possible to calculate metrics required. E.g. In a data set containing pin
such as mean, median, etc. codes of Indian cities, a pin code
A close look would help you
of 12 digits would be an invalid
Some of the common data type check if there is scope for
value and needs to be removed.
corrections are — string to correction, or if the value needs
Similarly, a phone number of 12
number: "12,300" to “12300”; to be removed.
digits would be an invalid value
string to date: "2013-Aug" to
“2013/08”; number to string:
“PIN Code 110001” to "110001";
etc.

19
By: Tushar Bhadouria
Types of Data

20

By: Tushar Bhadouria


Types of Data

Qualitative Quantitative

A variable which able to describe quality of the A variable which quantify the population
population. (Categorical values) (Numerical values)

Nominal Ordinal Discrete Continuous

21

By: Tushar Bhadouria


Types of Data

Discrete Nominal Continuous Ordinal

It has a discrete value that It represent qualitative A number within a range of a It represent qualitative
means it take only counted information without order. value is usually measured, information with order. It
value not a decimal values. Value represent a discrete such as height indicate the measurement
Like count of student in class units. classification are different and
can be ranked. Lets say
Like Gender: Male/Female
Economic status: high/
,Eye colour.
medium /low which can
ordered as low, medium, high.

22

By: Tushar Bhadouria


Types of Analysis

23

By: Tushar Bhadouria


Univariate Analysis
Univariate analysis is the simplest
form of analyzing data.

“Uni” means “one”, so in other words


your data has only one variable.

24 By: Tushar Bhadouria


Bivariate Analysis
Data which has two variables, you often want
to measure the relationship that exists
between these two categorical variables.

Bivariate Analysis can also be performed with


numerical values, or a combination of
numerical & categorical values

25

By: Tushar Bhadouria


Multivariate Analysis
Data which has more than two variables, you
often want to measure the relationship that
exists between these features

26

By: Tushar Bhadouria


Numerical Analysis

We also perform various analysis over Similarly, while analyzing multiple features,
Numerical data. we might be interested in knowing their
correlation with each other.
For example, dealing with a single numerical
variable, we might be interested in knowing
their statistical information such as mean,
median, 25th Percentile, 75th Percentile,
min, max etc.

27

By: Tushar Bhadouria


Derived Metrics
Derived metrics create a new variable from the existing variable to get a insightful information from the data
by analyzing the data.

Feature Binning

Feature Encoding

From Domain Knowledge


.

Calculated from Data

28

By: Tushar Bhadouria


By: Tushar Bhadouria
Feature Binning
Feature binning converts or transform continuous/numeric variable to categorical variable.
It can also be used to identify missing values or outliers.

Type of Binning

a. Equal width binning


b. Equal frequency binning

29

By: Tushar Bhadouria


Feature Binning

It transform continuous or numeric variable into categorical value


without taking dependent variable into consideration

Equal Equal width separate the continuous variable to


Width several categories having same range of width.

Equal frequency separate the continuous


Equal
variable into several categories having
Frequency
approximately same number of values.

30

By: Tushar Bhadouria


By: Tushar Bhadouria
Feature Encoding
Feature encoding help us to transform categorical data into numeric data.

Label Label encoding is technique to transform categorical variables into numerical variables by assigning a numerical
encoding value to each of the categories.

This technique is used when independent variables are nominal. It creates k different columns each for a
One-Hot
category and replaces one column with 1 rest of the columns is 0.
encoding
Here, 0 represents the absence, and 1 represents the presence of that category.

31

By: Tushar Bhadouria


Use cases
Basically EDA is important in every business problem, it’s a first crucial step in data analysis process.

Some of the use cases where we use EDA is:-

Cancer Data Analysis

In this data set we have to predict who are suffering from cancer and who’s not.

Uses
Cases
Fraud Data Analysis in E-commerce Transactions

In this dataset we have to detect the fraud in a E-commerce transaction.

32

By: Tushar Bhadouria


Thank you

3
3
By: Tushar Bhadouria

You might also like