0% found this document useful (0 votes)
14 views

CH05 Business Analytics Process and Data Exploration

Uploaded by

Cindy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

CH05 Business Analytics Process and Data Exploration

Uploaded by

Cindy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Intelligence and

Business Analytics
05 Business Analytics Process and Data
Exploration
Course Overview
• This chapter covers data exploration, validation, and cleaning required
for data analysis. You’ll learn the purpose of data cleaning, why you
need data preparation, how to go about handling missing values, and
some of the data-cleaning techniques used in the industry.
Course Contents
• Business Analytics Life Cycle
• Understanding the Business Problem
• Collecting and Integrating the Data
• Preprocessing the Data
• Exploring and Visualizing the Data
• Using Modeling Techniques and Algorithms
• Evaluating the Model
• Presenting a Management Report and Review
• Deploying the Model
Business Analytics Life Cycle
• This purpose is to derive information from data in order to make
appropriate business decisions.
• Consists of eight phases:
a. Understand the Business Problem
b. Collect and Integrate the Data
c. Preprocess the Data
d. Explore and Visualize the Data
e. Choose Modeling Techniques and Algorithms
f. Evaluate the Model
g. Report to Management and Review
h. Deploy the Model
Business Analytics Life Cycle
Business Analytics Life Cycle
Phase 1 → Understand the Business Problem
• the focus is to understand the problem, objectives, and requirements
from the perspective of the business.
• then converted into a data analytics problem with the aim of solving it by
using appropriate methods to achieve the objective.

Phase 2 → Collect and Integrate the Data


• data is collected from various sources in this important phase.
• If the data is not available in a current database or data warehouse, a
survey may be required.
Business Analytics Life Cycle
Phase 3 → Preprocess the Data
• data is cleaned and normalized
• This process may be repeated several times, depending on the quality
of data you get and the accuracy of the model obtained.

Phase 4 → Explore and Visualize the Data


• to understand the characteristics of the data, such as its distribution,
trends, and relationships among variables
Business Analytics Life Cycle
Phase 5 → Choose Modeling Techniques and Algorithms
• decide whether to use unsupervised or supervised machine-learning
techniques
• These choices depend on both the business requirements and the data
you have.

Phase 6 → Evaluate the Model


• evaluate the model by using standard methods that measure the
accuracy of the model and its performance in the field.
• important to evaluate the model and to be certain that the model
achieves the business objectives specified by business leaders
Business Analytics Life Cycle
Phase 7 → Report to Management and Review
• present the mathematical model to the business leaders
Phase 8 → Deploy the Model
• a challenging phase of the project.
• The model is now deployed for end users and is in a production
environment, analyzing the live data
Understanding the Business Problem
• Key purpose → to solve a business problem
• Need to thoroughly understand the problem from a business perspective
before solve the problem
Collecting and Integrating the Data
• The most important factor determining the accuracy of the results is
getting quality data.
• Can be either from a primary source or secondary source
• Most organizations have data spread across various databases
Collecting and Integrating the Data
• Sampling → a smaller collection of units from a population used to
determine truths about that population (Field, 2005)
• several variations on this type of sampling:
• Random sampling: A sample is picked randomly, and every member
has an equal opportunity to be selected.
• Stratified sampling: The population is divided into groups, and data is
selected randomly from a group, or strata.
• Systematic sampling: You select members systematically—say, every
tenth member—in that particular time or event.
𝑧∗𝑠𝑖𝑔𝑚𝑎 2
• 𝑛= 𝐸
→ if standard deviation is known
𝑧 2
• 𝑛= 𝑝 1−𝑝
𝐸
→ if standard deviation is unknown
Collecting and Integrating the Data
• Variable Selection
• If have more predictor variables, so need more records
• More records we have, the better the prediction results
Preprocessing the Data
• Data type: Qualitative or Quantitative
• Qualitative data → not numerical (e.g. type of car, favorite color)
• Quantitative data → numeric. Can be divided into discrete data or
continuous data
• Discrete Data → A variable can take a certain value that is separate and
distinct
• Continuous Data → A variable that can take numeric values within a
specific range or interval
Preprocessing the Data
• Handling Missing Values
• Methods used to resolve missing values:
a. Ignore the values (not very effective method)
b. Fill in the values with average value or mode (the simplest method)
c. Fill in the values with an attribute mean belonging to the same bin
Preprocessing the Data
• Handling Duplicates, Junk, and Null Values
• Should be cleaned from the database before the analytics process.
• Same process with handling missing values
• To identify the junk characters is the challenges
Preprocessing the Data
• Data preprocessing with R → discuss method:
a. Understanding the variable types
b. Changing the variable types
c. Finding missing and null values
d. Cleaning missing values with appropriate methods
• The following are the basic data types in R:
a. Numeric: Real numbers.
b. Integer: whole numbers.
c. Factor: Categorical data to define various categories.
d. Character: Data strings of characters defining
Exploring and Visualizing the Data
• Tables → View() command
Exploring and Visualizing the Data
• Summary Tables
Exploring and Visualizing the Data
• Box plots
Exploring and Visualizing the Data
• Scatter plots
Exploring and Visualizing the Data
• Scatter plots matrices: use pairs() function → > hou<-
read.table(header=TRUE,sep="\t","housing.data")
Exploring and Visualizing the Data
• Scatter plots matrices → Trellis Plot
Exploring and Visualizing the Data
• Scatter plots matrices → Correlation Plot
Exploring and Visualizing the Data
• Scatter plots matrices → Density by Class
Exploring and Visualizing the Data
• Normalization → some techniques:
a. Z-score normalization: The new value is created based on the mean and
standard deviations → A’ = A – meanA/SDA
b. Min-max normalization: values are transformed within the range of values
specified → A’ = ((A – MinA)/(MaxA – MinA))(range of A’) + MinA, Range of
A’ = MaxA – MinA
c. Data aggregation: sometime a new variable may be required to better
′ 𝐴′ −1
understand the data → 𝐴 = ,𝜆>1
𝜆
Using Modelling Techniques And Algorithms
• Descriptive Analytics → explains the patterns hidden in data.
• Any patterns like number of market segments, or sales numbers based
on regions are purely based on historical data.

• Predictive Analytics → consists of two methods:


a. Classification → a basic form of data analysis in which data is classified
into classes
b. Regression → predicting the value of a numerical variable
Using Modelling Techniques And Algorithms
• Machine learning → making computers learn and perform task better
based on past historical data
• Two type of machine learning:
a. Supervised machine learning → a machine builds a predictive model
under supervision—that is, with the help of a training data set.
b. Unsupervised machine learning → no training data to learn so no target
variable to predict.
Using Modelling Techniques And Algorithms
Supervised Machine learning Unsupervised Machine Learning
Evaluating the Model
• Evaluating model performance is a key aspect to understanding how
good your prediction is when you apply new data.
• The data set is divided into three partitions:
a. Training Data Partition → used to train the model
b. Test Data Partition → a subset of the data set that is not in the training
set
c. Validation Data Partition → used to fine-tune the model performance and
reduce overfitting problems.
Evaluating the Model
• Cross Validation (k-fold cross validation) → divide the data into k folds
and build the model using k – 1 folds, and the last fold is used for
testing.
Classification Model Evaluation
• Confusion Matrix
The simplest way of measuring the performance of a classifier is by judging
the number of mistakes → misclassification error.
• Classification matrix → referred to as a confusion matrix → gives an
estimate of the true classification and misclassification rates
Classification Model Evaluation
• Lift Chart → commonly used for marketing problems
• The lift curve helps determine how effectively the online advertisement
campaign can be done by selecting a relatively small group and getting
maximum responders. The lift , is a measure of the effectiveness of the
model by calculating ratios with or with out the model.
• A confusion matrix evaluates the effectiveness of the model as a whole
population, whereas the lift chart evaluates a portion of the population.
Classification Model Evaluation
• ROC (receiver operating characteristic) Chart → similar to a lift chart.
• It is a plot of the true-positive rate on the y axis and the false-positive
rate on the x axis.
• ROC is plotted as a function of true positive rate (sensitivity) vs.
function of false positive rate (specificity)
Regression Model Evaluation
• has many criteria for measuring its performance.
• Root-Mean-Square Error → A regression line predicts the y values for a
given x value. Note that the values are around the average

σ𝑛 𝑦ො −𝑦 2
• 𝑅𝑀𝑆𝐸 = 𝑘=0 𝑘
𝑛
𝑘
Presenting a Management Report and Review
• Problem Description
• Data Set Used
• Data Cleaning Carried Out
• Method Used to Create The Model
• Model Deployment Prerequisities
• Model Deployment and Usage
• Issues Handling
Deploying the Model
• A challenging phase of the project.
• The model is now deployed for end users and is in a production
environment analyzing the live data
• Success of the deployment depends on the following:
a. Proper sizing of the hardware, ensuring required performance
b. Proper programming to handle the capabilities of the hardware
c. Proper data integration and cleaning
d. Effective reports, dashboards, views, decisions, and interventions to be
used by end users or end-user systems
e. Effective training to the users of the model

You might also like