0% found this document useful (0 votes)

11 views

EDA - Zep

Exploratory Data Analysis (EDA) is a crucial approach for analyzing datasets to summarize their main characteristics using visual methods, helping to build confidence in data for machine learning models. The EDA process involves data sourcing, cleaning, univariate and bivariate analysis, and deriving metrics, with visualization playing a key role in understanding data trends. Use cases for EDA include cancer data analysis and fraud detection in e-commerce transactions.

Uploaded by

scharita32

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

EDA - Zep

Uploaded by

scharita32

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Exploratory Data Analysis

By: Tushar Bhadouria

Agenda
1. What is Exploratory Data analysis?
2. Why EDA is important?
3. Visualization
• Important charts for visualization.
4. Steps involved in EDA:-
• Data Sourcing
• Data Cleaning
• Univariate analysis with visualization
• Bivariate analysis with visualization
• Derived Metrics
5. Use Cases

By: Tushar Bhadouria

Data Analytics/Science Process

Exploratory Data
Analysis

Raw Data
Data is Processed Clean Dataset
Collected

Models &
Algorithms

Data Product Visualize Report Make Decisions

Reality

3
By: Tushar Bhadouria
What is Exploratory Data Analysis
• Exploratory Data Analysis is an approach to analyze the datasets to summarize their main
characteristics in form of visual methods.

• EDA is nothing but a data exploration technique to understand various aspects of the data.

• The main aim of EDA is to obtain confidence in a data to an extent where we are ready to
engage a machine learning model.

• EDA is important to analyze the data it’s a first steps in data analysis process.

• EDA give a basic idea to understand the data and make sense of the data to figure out the
question you need to ask and find out the best way to manipulate the dataset to get the
answer to your question.

• Exploratory data analysis help us to finding the errors, discovering data, mapping out data
structure, finding out anomalies.

• Exploratory data analysis is important for business process because we are preparing
dataset for deep thorough analysis that will detect you business problem.

• EDA help to build a quick and dirty model, or a baseline model, which can serve as a
comparison against later models that you will build.

4
By: Tushar Bhadouria
Visualization
Visualization is the presentation of the data in the graphical or visual form to understand the data more
clearly. Visualization is easy to understand the data

Easily understand the Easily analyze the data Help to get meaningful Help to find the trend
features of the data and summarize it. insights from the data. or pattern of the
data.

5
By: Tushar Bhadouria
Steps involved in EDA

By: Tushar Bhadouria

…

Numerical Analysis
Data Cleaning

Data Sourcing Categorical Analysis Derived Metrics

7
By: Tushar Bhadouria
Data Sourcing
• Data Sourcing is the process of gathering data from multiple sources as external or internal data collection.
• There are two major kind of data which can be classified according to the source:
1. Public data
2. Private data

Public Data Private Data

The data which is easy to access without taking Private Data:- The data which is not available on
any permission from the agencies is called public public platform and to access the data we have to
data. The agencies made the data public for the take the permission of organisation is called
purpose of the research, private data.

• Example: government and other public sector or • Example: Banking ,telecom ,retail sector are
ecommerce sites made the data public. there which not made their data publicly
available.

By: Tushar Bhadouria

…
After collecting the data , the next step is data The following are some steps involve in Data Cleaning
cleaning. Data cleaning means that you get rid of
any information that doesn’t need to be there and
clean up by mistake.

Data Cleaning is the process of clean the data to

improve the quality of the data for further data
Handle Missing Values Standardization of the data/
analysis and building a machine learning model. Feature Scaling

The benefit of data cleaning is that all the incorrect

and irrelevant data is gone, and we get the good
quality of data which will help in improving the
accuracy of our machine learning model.

Outlier Treatment Handle Invalid values

By: Tushar Bhadouria

Some Questions you need to ask yourself about the data
before cleaning the data

Does this data you’re

looking at match the Does this data
column labels? make sense?

Does what you are Does the data follow After computing the
reading make sense? the rules for this summarize statistics for
field? numerical data, does it
make sense?

By: Tushar Bhadouria

Handle Missing Values

Replacing with mean/ Predicting the missing

Delete Rows/Columns median/mode Algorithm Imputation values

This method we commonly This method can be used on Some machine learning Prediction model is one of the
used to handle missing independent variable when it algorithm supports to handle advanced method to handle
values. Rows can be deleted if has numerical variables. On missing value in the datasets. missing values. In this method
it has insignificant number of categorical feature we apply Like KNN, Naïve Bayes, dataset with no missing value
missing value Columns can be mode method to fill the Random forest. become training set and
delete if it has more than 75% missing value. dataset with missing value
of missing value become the test set and the
missing values is treated as
target variable.

By: Tushar Bhadouria

Example
For Numerical Data The following are some steps involve in Data Cleaning
Suppose we have Airlines ticket price data in which there is missing value.
Airlines Ticket Price
Steps to fill the numeric missing value:-
Indigo 3887
1. Compute the mean/median of the data
Air Asia 7662 (3887+7662+5221+4321)/4 = 5272.75

Jet Airways - 2. Substitute the Mean of the value in missing place.

Air India 5221

SpiceJet 4321 For categorical Data

Suppose we have Missing values in the Airlines Ticket Price

categorical data:
Indigo 3887
Then we take the mode of the dataset a to fill the
missing values: Here : Indigo 7675

Mode = Indigo Air Asia 4236

We substitute the Indigo in place of missing value - 6524

in Airline column
Jet Airways 4321

By: Tushar Bhadouria

Standardization/Feature Scaling

By: Tushar Bhadouria

…

Importance of Feature Scaling

When we are dealing with independent variable or features that differ from
each other in terms of range of values or units of the features, then we have
to normalize/standardize the data so that the difference in range of values
doesn’t affect the outcome of the data.

Feature scaling is the method to rescale the

values present in the features. In feature scaling
we convert the scale of different measurement
into a single scale. It standardize the whole
dataset in one range.

By: Tushar Bhadouria

Feature Scaling Method.

Standard Scaler
Standard scaler ensures that for each
feature, the mean is zero and the
standard deviation is 1, bringing all
feature to the same magnitude. In
simple words Standardization helps you
to scale down your feature based on the
standard normal distribution
Min-Max Scaler
Normalization helps you to scale down
your features between a range 0 to 1

15
By: Tushar Bhadouria
Example
Normalization

Income Minimum = 12000

Age Income (£) New value
Income Maximum = 30000
(Max – min) = (30000 – 12000) = 18000
24 15000 (15000 – 12000)/18000 = 0.16667 Hence, we have converted the income values between 0 and 1

Please note, the new values have

30 12000 (12000 – 12000)/18000 =0

Minimum = 0
28 30000 (30000 – 12000)/18000 =1 Maximum = 1

By: Tushar Bhadouria

Example
Standardization

Average = (15000 + 12000 + 30000)/3 = 19000

Age Income (£) New value
Standard deviation = 9643.65

24 15000 (15000 - 19000)/9643.65 = -0.4147 Hence, we have converted the income values to lower values
using the z-score method.

30 12000 (12000 - 19000)/9643.65 = -0.7258 x = c(-0.4147, -0.7258, 1.1406)

mean(x) = -0.000003 ~ 0
var(x) = 0.999 ~1
28 30000 (30000-19000)/9643.65 = 1.1406

By: Tushar Bhadouria

Outlier Treatment
Outliers are the most extremes values in the data. It is an abnormal observations that deviate from the norm.
Outliers do not fit in the normal behavior of the data.

Detect Outliers using following methods Handle Outlier using following methods

1. Boxplot 1. Remove the outliers.

2. Histogram 2. Replace outlier with suitable values by using following
methods:-
3. Scatter plot
4. Z-score • Quantile method

5. Inter quartile range(values out of 1.5 time of IQR) • Inter quartile range
3. Use that ML model which are not sensitive to outliers
4. Like:-KNN, Decision Tree, SVM, Naïve Bayes,
Ensemble methods

By: Tushar Bhadouria

By: Tushar Bhadouria
Handle Invalid Value

Correct values that go

Encode Unicode properly Convert incorrect data types beyond range Correct wrong structure

In case the data is being read as Correct the incorrect data types If some of the values are beyond Values that don’t follow a defined
junk characters, try to change to the correct data types for ease logical range, e.g. temperature structure can be removed.
encoding, E.g. CP1252 instead of of analysis. E.g. if numeric values less than -273° C (0° K), you
UTF-8. are stored as strings, it would not would need to correct them as
be possible to calculate metrics required. E.g. In a data set containing pin
such as mean, median, etc. codes of Indian cities, a pin code
A close look would help you
of 12 digits would be an invalid
Some of the common data type check if there is scope for
value and needs to be removed.
corrections are — string to correction, or if the value needs
Similarly, a phone number of 12
number: "12,300" to “12300”; to be removed.
digits would be an invalid value
string to date: "2013-Aug" to
“2013/08”; number to string:
“PIN Code 110001” to "110001";
etc.

19
By: Tushar Bhadouria
Types of Data

By: Tushar Bhadouria

Types of Data

Qualitative Quantitative

A variable which able to describe quality of the A variable which quantify the population
population. (Categorical values) (Numerical values)

Nominal Ordinal Discrete Continuous

By: Tushar Bhadouria

Types of Data

Discrete Nominal Continuous Ordinal

It has a discrete value that It represent qualitative A number within a range of a It represent qualitative
means it take only counted information without order. value is usually measured, information with order. It
value not a decimal values. Value represent a discrete such as height indicate the measurement
Like count of student in class units. classification are different and
can be ranked. Lets say
Like Gender: Male/Female
Economic status: high/
,Eye colour.
medium /low which can
ordered as low, medium, high.

By: Tushar Bhadouria

Types of Analysis

By: Tushar Bhadouria

Univariate Analysis
Univariate analysis is the simplest
form of analyzing data.

“Uni” means “one”, so in other words

your data has only one variable.

24 By: Tushar Bhadouria

Bivariate Analysis
Data which has two variables, you often want
to measure the relationship that exists
between these two categorical variables.

Bivariate Analysis can also be performed with

numerical values, or a combination of
numerical & categorical values

By: Tushar Bhadouria

Multivariate Analysis
Data which has more than two variables, you
often want to measure the relationship that
exists between these features

By: Tushar Bhadouria

Numerical Analysis

We also perform various analysis over Similarly, while analyzing multiple features,
Numerical data. we might be interested in knowing their
correlation with each other.
For example, dealing with a single numerical
variable, we might be interested in knowing
their statistical information such as mean,
median, 25th Percentile, 75th Percentile,
min, max etc.

By: Tushar Bhadouria

Derived Metrics
Derived metrics create a new variable from the existing variable to get a insightful information from the data
by analyzing the data.

Feature Binning

Feature Encoding

From Domain Knowledge

Calculated from Data

By: Tushar Bhadouria

By: Tushar Bhadouria
Feature Binning
Feature binning converts or transform continuous/numeric variable to categorical variable.
It can also be used to identify missing values or outliers.

Type of Binning

a. Equal width binning

b. Equal frequency binning

By: Tushar Bhadouria

Feature Binning

It transform continuous or numeric variable into categorical value

without taking dependent variable into consideration

Equal Equal width separate the continuous variable to

Width several categories having same range of width.

Equal frequency separate the continuous

Equal
variable into several categories having
Frequency
approximately same number of values.

By: Tushar Bhadouria

By: Tushar Bhadouria
Feature Encoding
Feature encoding help us to transform categorical data into numeric data.

Label Label encoding is technique to transform categorical variables into numerical variables by assigning a numerical
encoding value to each of the categories.

This technique is used when independent variables are nominal. It creates k different columns each for a
One-Hot
category and replaces one column with 1 rest of the columns is 0.
encoding
Here, 0 represents the absence, and 1 represents the presence of that category.

By: Tushar Bhadouria

Use cases
Basically EDA is important in every business problem, it’s a first crucial step in data analysis process.

Some of the use cases where we use EDA is:-

Cancer Data Analysis

In this data set we have to predict who are suffering from cancer and who’s not.

Uses
Cases
Fraud Data Analysis in E-commerce Transactions

In this dataset we have to detect the fraud in a E-commerce transaction.

By: Tushar Bhadouria

Thank you

3
3
By: Tushar Bhadouria

Cost Standards For Dredging Equipment (2009)
No ratings yet
Cost Standards For Dredging Equipment (2009)
4 pages
The Risk - Elle Kennedy
No ratings yet
The Risk - Elle Kennedy
4 pages
Motion and Force: Lecture Notes Prof. Muhammad Ali
100% (1)
Motion and Force: Lecture Notes Prof. Muhammad Ali
69 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
EDA -task
No ratings yet
EDA -task
20 pages
EDA and Cleaning
No ratings yet
EDA and Cleaning
24 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
DS203 2024 09 06 Data Problems 1
No ratings yet
DS203 2024 09 06 Data Problems 1
25 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Exploratory Data Analysis-1 (EDA-1)
No ratings yet
Exploratory Data Analysis-1 (EDA-1)
38 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Unit 1
No ratings yet
Unit 1
21 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Data Cleaning Wrangling
No ratings yet
Data Cleaning Wrangling
42 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
PPT 1.1.5
No ratings yet
PPT 1.1.5
20 pages
DAV practical 2
No ratings yet
DAV practical 2
6 pages
Data Mining
No ratings yet
Data Mining
40 pages
Data Preparation
No ratings yet
Data Preparation
17 pages
253777
No ratings yet
253777
66 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
Data_Preprocessing-1-19
No ratings yet
Data_Preprocessing-1-19
19 pages
UNIT 2 dt
No ratings yet
UNIT 2 dt
8 pages
L3 Overview of ML Model Development Lifecycle-1
No ratings yet
L3 Overview of ML Model Development Lifecycle-1
30 pages
chapter3 DS
No ratings yet
chapter3 DS
17 pages
Anomalies in dataset
No ratings yet
Anomalies in dataset
4 pages
03_Data_Preprocessing
No ratings yet
03_Data_Preprocessing
15 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
ML_EXP_NO_1
No ratings yet
ML_EXP_NO_1
8 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
1-Introduction to data cleaning
No ratings yet
1-Introduction to data cleaning
22 pages
chap3
No ratings yet
chap3
26 pages
Data Cleaning (Examples)
No ratings yet
Data Cleaning (Examples)
9 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Unit 4
No ratings yet
Unit 4
33 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Chapter 3& 4 (3)
No ratings yet
Chapter 3& 4 (3)
60 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
L3
No ratings yet
L3
34 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Data Quality
No ratings yet
Data Quality
14 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Unit 2 - Data Munging PDF
No ratings yet
Unit 2 - Data Munging PDF
54 pages
UNIT _ Introduction_DataScience_new (1)
No ratings yet
UNIT _ Introduction_DataScience_new (1)
55 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Principles of Data Mining
From Everand
Principles of Data Mining
Subodh Keshari
No ratings yet
Unit-5 Mail Services
No ratings yet
Unit-5 Mail Services
11 pages
Mba Finance Thesis
100% (3)
Mba Finance Thesis
6 pages
FTP - Sas.com Samples A59498
No ratings yet
FTP - Sas.com Samples A59498
17 pages
Cohen's Pathways of the Pulp Expert Consult 12th Edition Kenneth M. Hargreaves all chapter instant download
100% (1)
Cohen's Pathways of the Pulp Expert Consult 12th Edition Kenneth M. Hargreaves all chapter instant download
47 pages
Corbel Design
No ratings yet
Corbel Design
14 pages
Plant Flow Measurement and Control Handbook 1st Edition Swapan Basu all chapter instant download
100% (1)
Plant Flow Measurement and Control Handbook 1st Edition Swapan Basu all chapter instant download
45 pages
Chapter 6: Trajectory Generation: Robotics
No ratings yet
Chapter 6: Trajectory Generation: Robotics
16 pages
Sounds Gesture, Mannerism (Vocab)
No ratings yet
Sounds Gesture, Mannerism (Vocab)
6 pages
The Importance of Technology On Language Learning
100% (2)
The Importance of Technology On Language Learning
8 pages
Refrigeration System
100% (1)
Refrigeration System
54 pages
CORPORATE-SOCIAL-RESPONSIBILITY-AND-SUSTAINABILITY-POLICY-2021 Balmer Lawrie
No ratings yet
CORPORATE-SOCIAL-RESPONSIBILITY-AND-SUSTAINABILITY-POLICY-2021 Balmer Lawrie
13 pages
B 2d Task Analysis
No ratings yet
B 2d Task Analysis
2 pages
PPR Pipe Production Line - Benk Machinery Co LTD
No ratings yet
PPR Pipe Production Line - Benk Machinery Co LTD
6 pages
AIA2013 Hands-On InventorCAM 01
No ratings yet
AIA2013 Hands-On InventorCAM 01
34 pages
IBP - Certification in Data Analytics With Python
No ratings yet
IBP - Certification in Data Analytics With Python
2 pages
Lesson Plan 2
No ratings yet
Lesson Plan 2
4 pages
Ifm Bop
No ratings yet
Ifm Bop
18 pages
2024 IDPA Rulebook Master 12-14-23
No ratings yet
2024 IDPA Rulebook Master 12-14-23
39 pages
Ground Floor Plan (Main Building Facility: Project Title John Paul Silloren Pitero
No ratings yet
Ground Floor Plan (Main Building Facility: Project Title John Paul Silloren Pitero
1 page
Finale Action Research
No ratings yet
Finale Action Research
5 pages
Aes61 DMCC Dotnet Prog
No ratings yet
Aes61 DMCC Dotnet Prog
168 pages
Sekaran Lyrica Phase 2
No ratings yet
Sekaran Lyrica Phase 2
12 pages
Vol 0012
No ratings yet
Vol 0012
6 pages
8 Parts of Speech Quiz
100% (1)
8 Parts of Speech Quiz
33 pages
Advantages of Afforestation: 1. A Constant Supply of Forest Products
No ratings yet
Advantages of Afforestation: 1. A Constant Supply of Forest Products
5 pages
Rose Bowl - IRS Form 990 - FYE 2010
No ratings yet
Rose Bowl - IRS Form 990 - FYE 2010
40 pages
Chemistry Btech (CHY1009) - Reference-Material-I
No ratings yet
Chemistry Btech (CHY1009) - Reference-Material-I
48 pages