L18&19 Data Exploration
L18&19 Data Exploration
Data exploration
Data exploration is the first step of data
analysis used to explore and visualize data to
uncover insights from the start or identify
areas or patterns to dig into more
Why is Data Exploration Important?
• Data exploration helps users to make better
decisions on where to dig deeper into the data
and to take a broad understanding of the business
when asking more detailed questions later.
• With a user-friendly interface, anyone across an
organization can familiarize themselves with the
data, discover patterns, and generate thoughtful
questions that may spur on deeper, valuable
analysis.
https://www.tibco.com/reference-center/what-is-data-exploration
• Data exploration and visual analytics tools build
understanding, empowering users to explore data in
any visualization.
• This approach speeds up time to answers and deepens
users’ understanding by covering more ground in less
time.
• Data exploration is important for this reason because
it democratizes access to data and provides governed
self-service analytics.
• Furthermore, businesses can accelerate data
exploration by provisioning and delivering data
through visual data marts that are easy to explore and
use.
Main Use Cases for Data Exploration
• Data exploration can help businesses explore large amounts
of data quickly to better understand next steps in terms of
further analysis.
• This gives the business a more manageable starting point
and a way to target areas of interest.
• In most cases, data exploration involves using data
visualizations to examine the data at a high level.
• By taking this high-level approach, businesses can
determine which data is most important and which may
distort the analysis and therefore should be removed.
• Data exploration can also be helpful in decreasing time
spent on less valuable analysis by selecting the right path
forward from the start.
https://www.tibco.com/reference-center/what-is-data-exploration
Steps of Data Exploration and
Preparation
• Variable Identification
• Univariate Analysis
• Bi-variate Analysis
• Missing values treatment
• Outlier treatment
• Variable transformation
• Variable creation
Variable Identification
• First, identify Predictor (Input)
and Target (output) variables. Next, identify
the data type and category of the variables.
• Example:- Suppose, we want to predict,
whether the students will play cricket or not
(refer below data set). Here you need to
identify predictor variables, target variable,
data type of variables and category of
variables.Below, the variables have been
defined in different category:
Univariate Analysis
• At this stage, we explore variables one by one.
– Method to perform uni-variate analysis will
depend on whether the variable type is
categorical or continuous.
• Continuous Variables:- In case of continuous
variables, we need to understand the central
tendency and spread of the variable. These are
measured using various statistical metrics
visualization methods as shown below:
• Categorical Variables:- For categorical
variables, we’ll use frequency table to
understand distribution of each category.
• We can also read as percentage of
values under each category.
• It can be measured using two
metrics, Count and Count% against each
category.
• Bar chart can be used as visualization.
Bi-variate Analysis
• Bi-variate Analysis finds out the relationship between
two variables.
• Here, we look for association and disassociation
between variables at a pre-defined significance level.
• We can perform bi-variate analysis for any
combination of categorical and continuous variables.
• The combination can be: Categorical & Categorical,
Categorical & Continuous and Continuous &
Continuous.
• Different methods are used to tackle these
combinations during analysis process.
• Continuous & Continuous: While doing
bi-variate analysis between two continuous
variables, we should look at scatter plot.
• It is a nifty way to find out the relationship
between two variables.
• The pattern of scatter plot indicates the
relationship between variables. The
relationship can be linear or non-linear.
• Categorical & Categorical: To find the relationship
between two categorical variables, we can use
following methods:
– Two-way table: We can start analyzing the
relationship by creating a two-way table of count and
count%.
• The rows represents the category of one variable and the
columns represent the categories of the other variable.
• We show count or count% of observations available in each
combination of row and column categories.
– Stacked Column Chart: This method is more of a
visual form of Two-way table.
• Chi-Square Test: This test is used to derive the
statistical significance of relationship between the
variables.
– Also, it tests whether the evidence in the sample is
strong enough to generalize that the relationship for a
larger population as well.
– Chi-square is based on the difference between the
expected and observed frequencies in one or more
categories in the two-way table.
– It returns probability for the computed chi-square
distribution with the degree of freedom
• Probability of 0: It indicates that both categorical variable are
dependent
• Probability of 1: It shows that both variables are independent.
• Probability less than 0.05: It indicates that the relationship
between the variables is significant at 95% confidence. The
chi-square test statistic for a test of independence of two
categorical variables is found by:
• Deletion
• Mean/ Mode/ Median Imputation
• Prediction Model
• KNN Imputation
Deletion
• It is of two types: List Wise Deletion and Pair Wise Deletion.
• In list wise deletion, we delete observations where any of the
variable is missing. Simplicity is one of the major advantage of this
method, but this method reduces the power of model because it
reduces the sample size.
• Logarithm
• Square / Cube root
• Binning
Feature / Variable Creation