L18&19 Data Exploration

Yo Yo ?

Uploaded by

directadmission465

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

L18&19 Data Exploration

Yo Yo ?

Uploaded by

directadmission465

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Data Exploration

Data exploration
Data exploration is the first step of data
analysis used to explore and visualize data to
uncover insights from the start or identify
areas or patterns to dig into more
Why is Data Exploration Important?
• Data exploration helps users to make better
decisions on where to dig deeper into the data
and to take a broad understanding of the business
when asking more detailed questions later.
• With a user-friendly interface, anyone across an
organization can familiarize themselves with the
data, discover patterns, and generate thoughtful
questions that may spur on deeper, valuable
analysis.

https://www.tibco.com/reference-center/what-is-data-exploration
• Data exploration and visual analytics tools build
understanding, empowering users to explore data in
any visualization.
• This approach speeds up time to answers and deepens
users’ understanding by covering more ground in less
time.
• Data exploration is important for this reason because
it democratizes access to data and provides governed
self-service analytics.
• Furthermore, businesses can accelerate data
exploration by provisioning and delivering data
through visual data marts that are easy to explore and
use.
Main Use Cases for Data Exploration
• Data exploration can help businesses explore large amounts
of data quickly to better understand next steps in terms of
further analysis.
• This gives the business a more manageable starting point
and a way to target areas of interest.
• In most cases, data exploration involves using data
visualizations to examine the data at a high level.
• By taking this high-level approach, businesses can
determine which data is most important and which may
distort the analysis and therefore should be removed.
• Data exploration can also be helpful in decreasing time
spent on less valuable analysis by selecting the right path
forward from the start.

https://www.tibco.com/reference-center/what-is-data-exploration
Steps of Data Exploration and
Preparation
• Variable Identification
• Univariate Analysis
• Bi-variate Analysis
• Missing values treatment
• Outlier treatment
• Variable transformation
• Variable creation
Variable Identification
• First, identify Predictor (Input)
and Target (output) variables. Next, identify
the data type and category of the variables.
• Example:- Suppose, we want to predict,
whether the students will play cricket or not
(refer below data set). Here you need to
identify predictor variables, target variable,
data type of variables and category of
variables.Below, the variables have been
defined in different category:
Univariate Analysis
• At this stage, we explore variables one by one.
– Method to perform uni-variate analysis will
depend on whether the variable type is
categorical or continuous.
• Continuous Variables:- In case of continuous
variables, we need to understand the central
tendency and spread of the variable. These are
measured using various statistical metrics
visualization methods as shown below:
• Categorical Variables:- For categorical
variables, we’ll use frequency table to
understand distribution of each category.
• We can also read as percentage of
values under each category.
• It can be measured using two
metrics, Count and Count% against each
category.
• Bar chart can be used as visualization.
Bi-variate Analysis
• Bi-variate Analysis finds out the relationship between
two variables.
• Here, we look for association and disassociation
between variables at a pre-defined significance level.
• We can perform bi-variate analysis for any
combination of categorical and continuous variables.
• The combination can be: Categorical & Categorical,
Categorical & Continuous and Continuous &
Continuous.
• Different methods are used to tackle these
combinations during analysis process.
• Continuous & Continuous: While doing
bi-variate analysis between two continuous
variables, we should look at scatter plot.
• It is a nifty way to find out the relationship
between two variables.
• The pattern of scatter plot indicates the
relationship between variables. The
relationship can be linear or non-linear.
• Categorical & Categorical: To find the relationship
between two categorical variables, we can use
following methods:
– Two-way table: We can start analyzing the
relationship by creating a two-way table of count and
count%.
• The rows represents the category of one variable and the
columns represent the categories of the other variable.
• We show count or count% of observations available in each
combination of row and column categories.
– Stacked Column Chart: This method is more of a
visual form of Two-way table.
• Chi-Square Test: This test is used to derive the
statistical significance of relationship between the
variables.
– Also, it tests whether the evidence in the sample is
strong enough to generalize that the relationship for a
larger population as well.
– Chi-square is based on the difference between the
expected and observed frequencies in one or more
categories in the two-way table.
– It returns probability for the computed chi-square
distribution with the degree of freedom
• Probability of 0: It indicates that both categorical variable are
dependent
• Probability of 1: It shows that both variables are independent.
• Probability less than 0.05: It indicates that the relationship
between the variables is significant at 95% confidence. The
chi-square test statistic for a test of independence of two
categorical variables is found by:

• where O represents the observed frequency. E is the expected

frequency under the null hypothesis and computed by:
• Categorical & Continuous: While exploring
relation between categorical and continuous
variables, we can draw box plots for each level
of categorical variables.
– If levels are small in number, it will not show the
statistical significance.
– To look at the statistical significance we can
perform Z-test, T-test or ANOVA.
• Z-Test/ T-Test:- Either test assess whether mean
of two groups are statistically different from each
other or not.

• If the probability of Z is small then the difference

of two averages is more significant.
• The T-test is very similar to Z-test but it is used
when number of observation for both categories
is less than 30.
• ANOVA:- It assesses whether the average of
more than two groups is statistically different.
– Example: Suppose, we want to test the effect of
five different exercises. For this, we recruit 20
men and assign one type of exercise to 4 men (5
groups). Their weights are recorded after a few
weeks. We need to find out whether the effect of
these exercises on them is significantly different
or not. This can be done by comparing the weights
of the 5 groups of 4 men each.
Missing Value Treatment
• Why missing values treatment is required?
– Missing data in the training data set can reduce
the power / fit of a model or can lead to a biased
model.
– It can lead to wrong prediction or classification.
Why my data has missing values?
• We looked at the importance of treatment of
missing values in a dataset. They may occur at two
stages:
• Data Extraction:
– It is possible that there are problems with extraction
process.
– In such cases, we should double-check for correct data
with data guardians.
– Some hashing procedures can also be used to make
sure data extraction is correct.
– Errors at data extraction stage are typically easy to
find and can be corrected easily as well.
• Data collection: These errors occur at time
of data collection and are harder to correct.
They can be categorized in four types:
– Missing completely at random
– Missing at random
– Missing that depends on unobserved
predictors
– Missing that depends on the missing
value itself
Methods to treat missing values

• Deletion
• Mean/ Mode/ Median Imputation
• Prediction Model
• KNN Imputation
Deletion
• It is of two types: List Wise Deletion and Pair Wise Deletion.
• In list wise deletion, we delete observations where any of the
variable is missing. Simplicity is one of the major advantage of this
method, but this method reduces the power of model because it
reduces the sample size.

• In pair wise deletion, we perform analysis with all cases in which

the variables of interest are present. Advantage of this method is,
it keeps as many cases available for analysis. One of the
disadvantage of this method, it uses different sample size for
different variables.

• Deletion methods are used when the nature of missing data is

“Missing completely at random” else non random missing values
can bias the model output.
Mean/ Mode/ Median Imputation
• Imputation is a method to fill in the missing values
with estimated ones.
• The objective is to employ known relationships
that can be identified in the valid values of the
data set to assist in estimating the missing values.
• It is one of the most frequently used methods.
• It consists of replacing the missing data for a given
attribute by the mean or median (quantitative
attribute) or mode (qualitative attribute) of all
known values of that variable.
• It can be of two types:-
– Generalized Imputation: In this case, we calculate the
mean or median for all non missing values of that variable
then replace missing value with mean or median.
• Like in above table, variable “Manpower” is missing so we take
average of all non missing values of “Manpower” (28.33) and
then replace missing value with it.
– Similar case Imputation: In this case, we calculate average
for gender “Male” (29.75) and “Female” (25) individually of
non missing values then replace the missing value based on
gender.
• For “Male“, we will replace missing values of manpower with
29.75 and for “Female” with 25.
Prediction Model
• Prediction model is one of the sophisticated method for handling missing data.
• Here, we create a predictive model to estimate values that will substitute the
missing data.
• In this case, we divide our data set into two sets: One set with no missing values for
the variable and another one with missing values.
• First data set become training data set of the model while second data set with
missing values is test data set and variable with missing values is treated as target
variable.
• Next, we create a model to predict target variable based on other attributes of the
training data set and populate missing values of test data set.
• We can use regression, ANOVA, Logistic regression and various modeling technique
to perform this.
• There are 2 drawbacks for this approach:
– The model estimated values are usually more well-behaved than the true values
– If there are no relationships with attributes in the data set and the attribute with missing
values, then the model will not be precise for estimating missing values.
KNN Imputation
– In this method of imputation, the missing values of an attribute are
imputed using the given number of attributes that are most similar to
the attribute whose values are missing.
– The similarity of two attributes is determined using a distance function.
It is also known to have certain advantage & disadvantages.
– Advantages:
• k-nearest neighbour can predict both qualitative & quantitative attributes
• Creation of predictive model for each attribute with missing data is not
required
• Attributes with multiple missing values can be easily treated
• Correlation structure of the data is taken into consideration
– Disadvantage:
• KNN algorithm is very time-consuming in analyzing large database. It searches
through all the dataset looking for the most similar instances.
• Choice of k-value is very critical. Higher value of k would include attributes
which are significantly different from what we need whereas lower value of k
implies missing out of significant attributes.
Outlier Detection and Treatment
What is an Outlier?
• Outlier is an observation that appears far
away and diverges from an overall pattern in a
sample.
Types of Outliers
• Outlier can be of two types:
Univariate and Multivariate.
What causes Outliers?
Causes of outliers can be classified in two broad
categories:
• Artificial (Error) / Non-natural
• Natural.
• Let’s understand various types of outliers in more detail:
• Data Entry Errors:- Human errors such as errors caused during data
collection, recording, or entry can cause outliers in data.
• Measurement Error: It is the most common source of outliers. This is
caused when the measurement instrument used turns out to be faulty.
• Experimental Error: Another cause of outliers is experimental error.
• Intentional Outlier: This is commonly found in self-reported measures
that involves sensitive data.
• Data Processing Error: Whenever we perform data mining, we extract
data from multiple sources. It is possible that some manipulation or
extraction errors may lead to outliers in the dataset.
• Sampling error: For instance, we have to measure the height of athletes.
By mistake, we include a few basketball players in the sample. This
inclusion is likely to cause outliers in the dataset.
• Natural Outlier: When an outlier is not artificial (due to error), it is a
natural outlier.
Impact of Outliers on a dataset

• Outliers can drastically change the results of the data

analysis and statistical modeling. There are numerous
unfavourable impacts of outliers in the data set:
– It increases the error variance and reduces the power of
statistical tests
– If the outliers are non-randomly distributed, they can
decrease normality
– They can bias or influence estimates that may be of
substantive interest
– They can also impact the basic assumption of Regression,
ANOVA and other statistical model assumptions.
How to detect Outliers?
• Most commonly used method to detect outliers is
visualization. We use various visualization methods,
like Box-plot, Histogram, Scatter Plot .
• Some analysts also various thumb rules to detect outliers.
Some of them are:
– Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
– Use capping methods. Any value which out of range of 5th and
95th percentile can be considered as outlier
– Data points, three or more standard deviation away from mean
are considered outlier
• Bivariate and multivariate outliers are typically measured
using either an index of influence or leverage, or distance.
How to remove Outliers?
• Most of the ways to deal with outliers are similar
to the methods of missing values like deleting
observations, transforming them, binning them,
treat them as a separate group, imputing values
and other statistical methods.
• Common techniques used to deal with outliers:
– Deleting observations
– Transforming and binning values
– Imputing
– Treat separately
What is Feature Engineering?
• Feature engineering is the science (and art) of
extracting more information from existing
data.
Process of Feature Engineering
Feature engineering can be divided in 2 steps:
• Variable transformation.
• Variable / Feature creation.
Variable Transformation
• In data modelling, transformation refers
to the replacement of a variable by a function.
– For instance, replacing a variable x by the square /
cube root or logarithm x is a transformation.
– In other words, transformation is a process that
changes the distribution or relationship of a
variable with others.
• Below are the situations where variable transformation is
a requisite:
• When we want to change the scale of a variable or
standardize the values of a variable for better
understanding.
• When we can transform complex non-linear relationships
into linear relationships.
• Symmetric distribution is preferred over skewed
distribution as it is easier to interpret and generate
inferences
• Variable Transformation is also done from
an implementation point of view (Human involvement).
Methods of Variable Transformation
There are various methods used to transform
variables. Some of them include square root,
cube root, logarithmic, binning, reciprocal and
many others.

• Logarithm
• Square / Cube root
• Binning
Feature / Variable Creation

• Feature / Variable creation is a process to

generate a new variables / features based on
existing variable(s).
• There are various techniques to create new features.
• Creating derived variables: This refers to creating new
variables from existing variable(s) using set of
functions or different methods.
• Creating dummy variables:
– One of the most common application of dummy variable
is to convert categorical variable into numerical variables.
– Dummy variables are also called Indicator Variables.
– It is useful to take categorical variable as a predictor in
statistical models.
– Categorical variable can take values 0 and 1.

Multivariate Analysis – The Simplest Guide in the Universe: Bite-Size Stats, #6
From Everand
Multivariate Analysis – The Simplest Guide in the Universe: Bite-Size Stats, #6
Lee Baker
No ratings yet
Dabur Final
No ratings yet
Dabur Final
41 pages
Missing Value Treatment
No ratings yet
Missing Value Treatment
22 pages
module 3 data preparation
No ratings yet
module 3 data preparation
33 pages
A Comprehensive Guide To Data Exploration
100% (1)
A Comprehensive Guide To Data Exploration
18 pages
7. Data Cleaning
No ratings yet
7. Data Cleaning
39 pages
Data Exploration
No ratings yet
Data Exploration
23 pages
Data Mining Technical
No ratings yet
Data Mining Technical
45 pages
Guide Data Exploration
No ratings yet
Guide Data Exploration
16 pages
A Guide To Data Exploration
No ratings yet
A Guide To Data Exploration
20 pages
Data Exploration & Visualization
No ratings yet
Data Exploration & Visualization
23 pages
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
100% (2)
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
8 pages
Quantitative Research Methods - Data Processing and Analysis
No ratings yet
Quantitative Research Methods - Data Processing and Analysis
25 pages
Data Science Presentation
100% (3)
Data Science Presentation
113 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
45 pages
Mvda - Question Bank
No ratings yet
Mvda - Question Bank
14 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Unit2 _Data Cleaning and Multivariate Techniques_26_01_2025
No ratings yet
Unit2 _Data Cleaning and Multivariate Techniques_26_01_2025
42 pages
6 Data Analysis
No ratings yet
6 Data Analysis
24 pages
Market Research: Data Analysis Methods
No ratings yet
Market Research: Data Analysis Methods
20 pages
Data Processing and Analysis of Data
100% (1)
Data Processing and Analysis of Data
43 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
Multivariant Data.
No ratings yet
Multivariant Data.
36 pages
Data Screening Assumptions
No ratings yet
Data Screening Assumptions
29 pages
FIN10002 - Notes Master
No ratings yet
FIN10002 - Notes Master
44 pages
Unit 4
No ratings yet
Unit 4
25 pages
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
100% (1)
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
8 pages
Em (601) Report# 9
No ratings yet
Em (601) Report# 9
6 pages
CH 8
No ratings yet
CH 8
6 pages
DataAnalytics(Unit 2)
No ratings yet
DataAnalytics(Unit 2)
131 pages
RM Module 1
No ratings yet
RM Module 1
63 pages
Business Club: Basic Statistics
No ratings yet
Business Club: Basic Statistics
26 pages
Unit-II
No ratings yet
Unit-II
13 pages
8614(2)
No ratings yet
8614(2)
19 pages
Data Analysis - Selecting a Test
No ratings yet
Data Analysis - Selecting a Test
5 pages
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
No ratings yet
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
42 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Day 1 Article For Discussion
No ratings yet
Day 1 Article For Discussion
5 pages
Advanced Handling of Missing Data: One-Day Workshop
No ratings yet
Advanced Handling of Missing Data: One-Day Workshop
38 pages
Data Analysis and Interpretation
No ratings yet
Data Analysis and Interpretation
33 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
30 pages
Data Analysis Guide
No ratings yet
Data Analysis Guide
4 pages
Quantitative Analysis Using Spss
100% (1)
Quantitative Analysis Using Spss
42 pages
RM-Cha 9
No ratings yet
RM-Cha 9
83 pages
Univariate Bivariate & Multivariate Analysis of Data
No ratings yet
Univariate Bivariate & Multivariate Analysis of Data
24 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
Presentation1HOD SIR-1
No ratings yet
Presentation1HOD SIR-1
13 pages
Data Preparation and Analysis 3
No ratings yet
Data Preparation and Analysis 3
182 pages
Data Analysis Notes
No ratings yet
Data Analysis Notes
67 pages
Data Analysis and report Writing BRM
No ratings yet
Data Analysis and report Writing BRM
49 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
BRM Statwiki
No ratings yet
BRM Statwiki
55 pages
Marketing Analytics (Unit 2)
No ratings yet
Marketing Analytics (Unit 2)
78 pages
CLC - Data Cleansing and Data Summary
No ratings yet
CLC - Data Cleansing and Data Summary
17 pages
AF Notes W2
No ratings yet
AF Notes W2
2 pages
Module 01 Introduction To Business Statistics
No ratings yet
Module 01 Introduction To Business Statistics
16 pages
BRM Lab File
No ratings yet
BRM Lab File
52 pages
Univariate Analysis: Quantitative (Statistical) Analysis. Variable Unit of Analysis
No ratings yet
Univariate Analysis: Quantitative (Statistical) Analysis. Variable Unit of Analysis
5 pages
MBR Lab Week 10-12-1
No ratings yet
MBR Lab Week 10-12-1
65 pages
Statistical Analysis and Visualization
From Everand
Statistical Analysis and Visualization
Mohit Chatterjee
No ratings yet
MATH 1281 - Unit 3 Discussion Assignment
No ratings yet
MATH 1281 - Unit 3 Discussion Assignment
6 pages
Course: Credit: Semester-III Course Title:: Syllabus
No ratings yet
Course: Credit: Semester-III Course Title:: Syllabus
26 pages
Amit Kumar Share Khan
No ratings yet
Amit Kumar Share Khan
77 pages
The Study of Customer Satisfaction in Private Sector and Public Sector Banks in Haldwani"
50% (2)
The Study of Customer Satisfaction in Private Sector and Public Sector Banks in Haldwani"
43 pages
SMEs Supply Chain Finance - The Role of Micro Finance Banks Activities in Abuja (FCT) - Nigeria
No ratings yet
SMEs Supply Chain Finance - The Role of Micro Finance Banks Activities in Abuja (FCT) - Nigeria
7 pages
Final Examination in Educ-Pa 502
No ratings yet
Final Examination in Educ-Pa 502
3 pages
Business Analytics, Global Edition James Evans instant download
No ratings yet
Business Analytics, Global Edition James Evans instant download
43 pages
INFORMATION AND COMMUNICATION TECHNOLOGY AS A TOOL FOR CREATING JOB OPPORTUNITIES IN NIGERIA
No ratings yet
INFORMATION AND COMMUNICATION TECHNOLOGY AS A TOOL FOR CREATING JOB OPPORTUNITIES IN NIGERIA
52 pages
Awareness About The Ocular Effects of Smartphones A Study of Knowledge and Practice Among School Children PDF
No ratings yet
Awareness About The Ocular Effects of Smartphones A Study of Knowledge and Practice Among School Children PDF
4 pages
A Study To Assess The Effectiveness of Planned Teaching Programme On Knowledge Regarding Lithium Therapy Among B.Sc. Nursing 3rd Year Students in Selected Nursing Colleges, Dehradun
No ratings yet
A Study To Assess The Effectiveness of Planned Teaching Programme On Knowledge Regarding Lithium Therapy Among B.Sc. Nursing 3rd Year Students in Selected Nursing Colleges, Dehradun
9 pages
Sodapdf
No ratings yet
Sodapdf
12 pages
Adstat Final Exam Reviewer2
No ratings yet
Adstat Final Exam Reviewer2
29 pages
Pract08 Chi Square
No ratings yet
Pract08 Chi Square
9 pages
A Study On The Urban Cooperative Banks Success and Growth in Vellore District
No ratings yet
A Study On The Urban Cooperative Banks Success and Growth in Vellore District
4 pages
[FREE PDF sample] (Ebook) The Humongous Book of Statistics Problems: Translated for People Who Don't Speak Math by W. Michael Kelley, Robert A. Donnelly ISBN 9781592578658, 1592578659 ebooks
100% (2)
[FREE PDF sample] (Ebook) The Humongous Book of Statistics Problems: Translated for People Who Don't Speak Math by W. Michael Kelley, Robert A. Donnelly ISBN 9781592578658, 1592578659 ebooks
77 pages
Business_Statistics_Module_5_Notes
No ratings yet
Business_Statistics_Module_5_Notes
3 pages
Chi Square
No ratings yet
Chi Square
3 pages
Growth Monitoring of VLBW Babies in NICU
No ratings yet
Growth Monitoring of VLBW Babies in NICU
18 pages
Consumer Perception of Global Branded Products Qua
No ratings yet
Consumer Perception of Global Branded Products Qua
8 pages
Chi Square (1) comm 215
No ratings yet
Chi Square (1) comm 215
3 pages
Nexon
No ratings yet
Nexon
5 pages
Psych Stats Reviewer
100% (1)
Psych Stats Reviewer
16 pages
2 Towards Water Literacy Analysis of Standards For Teaching and Learning About Water On Earth
No ratings yet
2 Towards Water Literacy Analysis of Standards For Teaching and Learning About Water On Earth
17 pages
Chi Square Test
No ratings yet
Chi Square Test
24 pages
Lecture Notes ON Parametric & Non-Parametric Tests FOR Social Scientists/ Participants OF Research Metodology Workshop Bbau, Lucknow
No ratings yet
Lecture Notes ON Parametric & Non-Parametric Tests FOR Social Scientists/ Participants OF Research Metodology Workshop Bbau, Lucknow
19 pages
Problems Faced by Women Entrepreneurs An
No ratings yet
Problems Faced by Women Entrepreneurs An
12 pages
Dokumen - Pub Data Science and Social Research II Methods Technologies and Applications 1st Ed 9783030512217 9783030512224
No ratings yet
Dokumen - Pub Data Science and Social Research II Methods Technologies and Applications 1st Ed 9783030512217 9783030512224
391 pages
Assignment C PDFFF
No ratings yet
Assignment C PDFFF
18 pages
Pengujian Hipotesis
No ratings yet
Pengujian Hipotesis
79 pages

L18&19 Data Exploration

Uploaded by

L18&19 Data Exploration

Uploaded by

Data Exploration

• where O represents the observed frequency. E is the expected

• If the probability of Z is small then the difference

• In pair wise deletion, we perform analysis with all cases in which

• Deletion methods are used when the nature of missing data is

• Outliers can drastically change the results of the data

• Feature / Variable creation is a process to

You might also like