0% found this document useful (0 votes)
16 views

L18&19 Data Exploration

Yo Yo ?
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

L18&19 Data Exploration

Yo Yo ?
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Data Exploration

Data exploration
Data exploration is the first step of data
analysis used to explore and visualize data to
uncover insights from the start or identify
areas or patterns to dig into more
Why is Data Exploration Important?
• Data exploration helps users to make better
decisions on where to dig deeper into the data
and to take a broad understanding of the business
when asking more detailed questions later.
• With a user-friendly interface, anyone across an
organization can familiarize themselves with the
data, discover patterns, and generate thoughtful
questions that may spur on deeper, valuable
analysis.

https://www.tibco.com/reference-center/what-is-data-exploration
• Data exploration and visual analytics tools build
understanding, empowering users to explore data in
any visualization.
• This approach speeds up time to answers and deepens
users’ understanding by covering more ground in less
time.
• Data exploration is important for this reason because
it democratizes access to data and provides governed
self-service analytics.
• Furthermore, businesses can accelerate data
exploration by provisioning and delivering data
through visual data marts that are easy to explore and
use.
Main Use Cases for Data Exploration
• Data exploration can help businesses explore large amounts
of data quickly to better understand next steps in terms of
further analysis.
• This gives the business a more manageable starting point
and a way to target areas of interest.
• In most cases, data exploration involves using data
visualizations to examine the data at a high level.
• By taking this high-level approach, businesses can
determine which data is most important and which may
distort the analysis and therefore should be removed.
• Data exploration can also be helpful in decreasing time
spent on less valuable analysis by selecting the right path
forward from the start.

https://www.tibco.com/reference-center/what-is-data-exploration
Steps of Data Exploration and
Preparation
• Variable Identification
• Univariate Analysis
• Bi-variate Analysis
• Missing values treatment
• Outlier treatment
• Variable transformation
• Variable creation
Variable Identification
• First, identify Predictor (Input)
and Target (output) variables. Next, identify
the data type and category of the variables.
• Example:- Suppose, we want to predict,
whether the students will play cricket or not
(refer below data set). Here you need to
identify predictor variables, target variable,
data type of variables and category of
variables.Below, the variables have been
defined in different category:
Univariate Analysis
• At this stage, we explore variables one by one.
– Method to perform uni-variate analysis will
depend on whether the variable type is
categorical or continuous.
• Continuous Variables:- In case of continuous
variables, we need to understand the central
tendency and spread of the variable. These are
measured using various statistical metrics
visualization methods as shown below:
• Categorical Variables:- For categorical
variables, we’ll use frequency table to
understand distribution of each category.
• We can also read as percentage of
values under each category.
• It can be measured using two
metrics, Count and Count% against each
category.
• Bar chart can be used as visualization.
Bi-variate Analysis
• Bi-variate Analysis finds out the relationship between
two variables.
• Here, we look for association and disassociation
between variables at a pre-defined significance level.
• We can perform bi-variate analysis for any
combination of categorical and continuous variables.
• The combination can be: Categorical & Categorical,
Categorical & Continuous and Continuous &
Continuous.
• Different methods are used to tackle these
combinations during analysis process.
• Continuous & Continuous: While doing
bi-variate analysis between two continuous
variables, we should look at scatter plot.
• It is a nifty way to find out the relationship
between two variables.
• The pattern of scatter plot indicates the
relationship between variables. The
relationship can be linear or non-linear.
• Categorical & Categorical: To find the relationship
between two categorical variables, we can use
following methods:
– Two-way table: We can start analyzing the
relationship by creating a two-way table of count and
count%.
• The rows represents the category of one variable and the
columns represent the categories of the other variable.
• We show count or count% of observations available in each
combination of row and column categories.
– Stacked Column Chart: This method is more of a
visual form of Two-way table.
• Chi-Square Test: This test is used to derive the
statistical significance of relationship between the
variables.
– Also, it tests whether the evidence in the sample is
strong enough to generalize that the relationship for a
larger population as well.
– Chi-square is based on the difference between the
expected and observed frequencies in one or more
categories in the two-way table.
– It returns probability for the computed chi-square
distribution with the degree of freedom
• Probability of 0: It indicates that both categorical variable are
dependent
• Probability of 1: It shows that both variables are independent.
• Probability less than 0.05: It indicates that the relationship
between the variables is significant at 95% confidence. The
chi-square test statistic for a test of independence of two
categorical variables is found by:

• where O represents the observed frequency. E is the expected


frequency under the null hypothesis and computed by:
• Categorical & Continuous: While exploring
relation between categorical and continuous
variables, we can draw box plots for each level
of categorical variables.
– If levels are small in number, it will not show the
statistical significance.
– To look at the statistical significance we can
perform Z-test, T-test or ANOVA.
• Z-Test/ T-Test:- Either test assess whether mean
of two groups are statistically different from each
other or not.

• If the probability of Z is small then the difference


of two averages is more significant.
• The T-test is very similar to Z-test but it is used
when number of observation for both categories
is less than 30.
• ANOVA:- It assesses whether the average of
more than two groups is statistically different.
– Example: Suppose, we want to test the effect of
five different exercises. For this, we recruit 20
men and assign one type of exercise to 4 men (5
groups). Their weights are recorded after a few
weeks. We need to find out whether the effect of
these exercises on them is significantly different
or not. This can be done by comparing the weights
of the 5 groups of 4 men each.
Missing Value Treatment
• Why missing values treatment is required?
– Missing data in the training data set can reduce
the power / fit of a model or can lead to a biased
model.
– It can lead to wrong prediction or classification.
Why my data has missing values?
• We looked at the importance of treatment of
missing values in a dataset. They may occur at two
stages:
• Data Extraction:
– It is possible that there are problems with extraction
process.
– In such cases, we should double-check for correct data
with data guardians.
– Some hashing procedures can also be used to make
sure data extraction is correct.
– Errors at data extraction stage are typically easy to
find and can be corrected easily as well.
• Data collection: These errors occur at time
of data collection and are harder to correct.
They can be categorized in four types:
– Missing completely at random
– Missing at random
– Missing that depends on unobserved
predictors
– Missing that depends on the missing
value itself
Methods to treat missing values

• Deletion
• Mean/ Mode/ Median Imputation
• Prediction Model
• KNN Imputation
Deletion
• It is of two types: List Wise Deletion and Pair Wise Deletion.
• In list wise deletion, we delete observations where any of the
variable is missing. Simplicity is one of the major advantage of this
method, but this method reduces the power of model because it
reduces the sample size.

• In pair wise deletion, we perform analysis with all cases in which


the variables of interest are present. Advantage of this method is,
it keeps as many cases available for analysis. One of the
disadvantage of this method, it uses different sample size for
different variables.

• Deletion methods are used when the nature of missing data is


“Missing completely at random” else non random missing values
can bias the model output.
Mean/ Mode/ Median Imputation
• Imputation is a method to fill in the missing values
with estimated ones.
• The objective is to employ known relationships
that can be identified in the valid values of the
data set to assist in estimating the missing values.
• It is one of the most frequently used methods.
• It consists of replacing the missing data for a given
attribute by the mean or median (quantitative
attribute) or mode (qualitative attribute) of all
known values of that variable.
• It can be of two types:-
– Generalized Imputation: In this case, we calculate the
mean or median for all non missing values of that variable
then replace missing value with mean or median.
• Like in above table, variable “Manpower” is missing so we take
average of all non missing values of “Manpower” (28.33) and
then replace missing value with it.
– Similar case Imputation: In this case, we calculate average
for gender “Male” (29.75) and “Female” (25) individually of
non missing values then replace the missing value based on
gender.
• For “Male“, we will replace missing values of manpower with
29.75 and for “Female” with 25.
Prediction Model
• Prediction model is one of the sophisticated method for handling missing data.
• Here, we create a predictive model to estimate values that will substitute the
missing data.
• In this case, we divide our data set into two sets: One set with no missing values for
the variable and another one with missing values.
• First data set become training data set of the model while second data set with
missing values is test data set and variable with missing values is treated as target
variable.
• Next, we create a model to predict target variable based on other attributes of the
training data set and populate missing values of test data set.
• We can use regression, ANOVA, Logistic regression and various modeling technique
to perform this.
• There are 2 drawbacks for this approach:
– The model estimated values are usually more well-behaved than the true values
– If there are no relationships with attributes in the data set and the attribute with missing
values, then the model will not be precise for estimating missing values.
KNN Imputation
– In this method of imputation, the missing values of an attribute are
imputed using the given number of attributes that are most similar to
the attribute whose values are missing.
– The similarity of two attributes is determined using a distance function.
It is also known to have certain advantage & disadvantages.
– Advantages:
• k-nearest neighbour can predict both qualitative & quantitative attributes
• Creation of predictive model for each attribute with missing data is not
required
• Attributes with multiple missing values can be easily treated
• Correlation structure of the data is taken into consideration
– Disadvantage:
• KNN algorithm is very time-consuming in analyzing large database. It searches
through all the dataset looking for the most similar instances.
• Choice of k-value is very critical. Higher value of k would include attributes
which are significantly different from what we need whereas lower value of k
implies missing out of significant attributes.
Outlier Detection and Treatment
What is an Outlier?
• Outlier is an observation that appears far
away and diverges from an overall pattern in a
sample.
Types of Outliers
• Outlier can be of two types:
Univariate and Multivariate.
What causes Outliers?
Causes of outliers can be classified in two broad
categories:
• Artificial (Error) / Non-natural
• Natural.
• Let’s understand various types of outliers in more detail:
• Data Entry Errors:- Human errors such as errors caused during data
collection, recording, or entry can cause outliers in data.
• Measurement Error: It is the most common source of outliers. This is
caused when the measurement instrument used turns out to be faulty.
• Experimental Error: Another cause of outliers is experimental error.
• Intentional Outlier: This is commonly found in self-reported measures
that involves sensitive data.
• Data Processing Error: Whenever we perform data mining, we extract
data from multiple sources. It is possible that some manipulation or
extraction errors may lead to outliers in the dataset.
• Sampling error: For instance, we have to measure the height of athletes.
By mistake, we include a few basketball players in the sample. This
inclusion is likely to cause outliers in the dataset.
• Natural Outlier: When an outlier is not artificial (due to error), it is a
natural outlier.
Impact of Outliers on a dataset

• Outliers can drastically change the results of the data


analysis and statistical modeling. There are numerous
unfavourable impacts of outliers in the data set:
– It increases the error variance and reduces the power of
statistical tests
– If the outliers are non-randomly distributed, they can
decrease normality
– They can bias or influence estimates that may be of
substantive interest
– They can also impact the basic assumption of Regression,
ANOVA and other statistical model assumptions.
How to detect Outliers?
• Most commonly used method to detect outliers is
visualization. We use various visualization methods,
like Box-plot, Histogram, Scatter Plot .
• Some analysts also various thumb rules to detect outliers.
Some of them are:
– Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
– Use capping methods. Any value which out of range of 5th and
95th percentile can be considered as outlier
– Data points, three or more standard deviation away from mean
are considered outlier
• Bivariate and multivariate outliers are typically measured
using either an index of influence or leverage, or distance.
How to remove Outliers?
• Most of the ways to deal with outliers are similar
to the methods of missing values like deleting
observations, transforming them, binning them,
treat them as a separate group, imputing values
and other statistical methods.
• Common techniques used to deal with outliers:
– Deleting observations
– Transforming and binning values
– Imputing
– Treat separately
What is Feature Engineering?
• Feature engineering is the science (and art) of
extracting more information from existing
data.
Process of Feature Engineering
Feature engineering can be divided in 2 steps:
• Variable transformation.
• Variable / Feature creation.
Variable Transformation
• In data modelling, transformation refers
to the replacement of a variable by a function.
– For instance, replacing a variable x by the square /
cube root or logarithm x is a transformation.
– In other words, transformation is a process that
changes the distribution or relationship of a
variable with others.
• Below are the situations where variable transformation is
a requisite:
• When we want to change the scale of a variable or
standardize the values of a variable for better
understanding.
• When we can transform complex non-linear relationships
into linear relationships.
• Symmetric distribution is preferred over skewed
distribution as it is easier to interpret and generate
inferences
• Variable Transformation is also done from
an implementation point of view (Human involvement).
Methods of Variable Transformation
There are various methods used to transform
variables. Some of them include square root,
cube root, logarithmic, binning, reciprocal and
many others.

• Logarithm
• Square / Cube root
• Binning
Feature / Variable Creation

• Feature / Variable creation is a process to


generate a new variables / features based on
existing variable(s).
• There are various techniques to create new features.
• Creating derived variables: This refers to creating new
variables from existing variable(s) using set of
functions or different methods.
• Creating dummy variables:
– One of the most common application of dummy variable
is to convert categorical variable into numerical variables.
– Dummy variables are also called Indicator Variables.
– It is useful to take categorical variable as a predictor in
statistical models.
– Categorical variable can take values 0 and 1.

You might also like