0% found this document useful (0 votes)
5 views55 pages

Data analytics and its processess - models - methods

Uploaded by

liliatran7704
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views55 pages

Data analytics and its processess - models - methods

Uploaded by

liliatran7704
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

What is data analytics?

Data analytics overview

• Data analytics is the science of analyzing raw data to make decisions,


take actions and measure the impact related to growth, profitability,
and risk, etc.

• Data analytics helps individuals and organizations make sense of data.


Data analytics is, by its nature, an interdisciplinary field.

• Data analysts typically analyze raw data for insights and trends. Data
analysts use methods, various tools and techniques to enable
organizations to make decisions and succeed.
Knowledge vs Wisdom
What is knowledge?

Source: https://www.ontotext.com/knowledgehub/fundamentals/dikw-pyramid/
What is wisdom?

Source: https://www.ontotext.com/knowledgehub/fundamentals/dikw-pyramid/
Source: https://www.ontotext.com/knowledgehub/fundamentals/dikw-pyramid/
Data or Information?
Invoice Date : 2/22/14 Invoice #: 123
Customer: ABC company
Item # Qty Price
99 3 $20

Total Invoice Amount $60


How do we have the meaning of data?
How do we have the meaning of data?
Knowledge vs Wisdom
What is knowledge?

Source: https://www.ontotext.com/knowledgehub/fundamentals/dikw-
What is wisdom?

Source: https://www.ontotext.com/knowledgehub/fundamentals/dikw-pyramid/
A Data to
Knowledge
Continuum

Source: Ramesh Sharda, Dursun Delen,


and Efraim Turban (2020), Analytics,
Data Science, & Artificial Intelligence:
Systems for Decision Support, 11e,
Global Edition, Pearson Education
The data analysis process
Step 1: Define the question – Identify business problems
Asking a targeted question before searching the data for an answer.

• What problems are we trying to solve?


• Which parts of our business do we want more information about?
• Are we trying to solve an existing problem or predict how our
company will perform based on determined factors?
• Etc.

→ Clearly defining goals will help guide the rest of the


analysis process.
Organization’s goals and business problems
What are critical needs at every touch point of
this Retail Value Chain?

Example of Analytics Applications in a Retail Value Chain

Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2020), Analytics, Data Science, & Artificial Intelligence: Systems for
Decision Support, 11e, Global Edition, Pearson Education
DATA ANALYTICS AND DATA PRIVACY TRAINING COURSE

Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2020), Analytics, Data Science, & Artificial Intelligence: Systems for
Decision Support, 11e, Global Edition, Pearson Education
Analytics Applications Business Questions

What are
business
values?

Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2020), Analytics, Data Science, & Artificial Intelligence: Systems for
Decision Support, 11e, Global Edition, Pearson Education
Analytics Applications Business Questions Business Values

Source: Ramesh Sharda, Dursun Delen, and Efraim Turban (2020), Analytics, Data Science, & Artificial Intelligence: Systems for Decision Support, 11e,
Global Edition, Pearson Education
Step 2: Understand and Collect data
Before we can start analyzing, there
needs to be data available for use
(data source)

Data can include sales records,


customer demographics, lead
tracking, net promoter scores, and
more

The volume of data required will


depend on the question we wish to
answer

Not having enough data can skew


the results of our analysis.
Step 3: Clean data (Data preparation and EDA)
• Clean the data before beginning the
analysis portion of this process
Querying Profiling
• A large part of the cleansing process
includes making sure that the data is
in a usable format
Cleaning Transforming
• Searching for outliers, dealing with
null values, and looking for data that
may have been incorrectly input
• Data cleansing is crucial to optimize Loading

the accuracy of our analysis


22
Data profiling and cleaning

Source: Farah Kim, August 21, 2020


23
Data transformation

A sales dataset with heterogeneous values


Yeye He, et al., Transform-Data-by-Example (TDE): An Extensible Search Engine for Data Transformations.
PVLDB, 11(10): 1165-1177, 2018.
24
Data transformation

A sales dataset with heterogeneous values Transformation for names

Yeye He, et al., Transform-Data-by-Example (TDE): An Extensible Search Engine for Data Transformations.
PVLDB, 11(10): 1165-1177, 2018.
EDA with Scatter Plot
What happened?
Units

Amount
EDA with Box Plots

Source: Ramesh Sharda, Dursun Delen, and


Efraim Turban (2020), Analytics, Data
Science, & Artificial Intelligence: Systems for
Decision Support, 11e, Global Edition,
Pearson Education
Step 4: Build model and Analyze data
• Choosing a method/model of analysis will heavily depend on the
question or goals defined earlier and the type of analysis needed.

Descriptive Diagnostic Predictive Prescriptive


analytics analytics analytics analytics
DATA ANALYTICS AND DATA PRIVACY TRAINING COURSE
To answer the question: Which factors are
negatively impacting the customer experience?,…

Which one? → Diagnostic analytics

To answer the question: How many customer


segmentations did company have last year? How
much profit did company make last month?,…

Which one? → Descriptive analytics


The retail industry often uses transaction data to
predict where future trends lie, or to determine
seasonal buying habits to develop their strategies

Which one? → Predictive analytics

Algorithms that guide Google’s self-driving cars.


Every second, these algorithms make countless
decisions based on past and present data, ensuring a
smooth, safe ride.

Which one? → Prescriptive analytics


Step 5: Visualize and share the findings
• What are we learning from the Making reports -
results of the analysis? Visualization
• One way to interpret the results is
by creating data visualizations Presentation
• Gaining insights from the data
and being able to apply these Making Data
insights to our business. Storytelling
• Leaders can use to take actions,
make changes, or refocus efforts Recommendations
Step 5: Visualize and share the findings

Customer segmentation distribution Clustering result by RFM level


Source: Ho Trung Thanh, et al. Customer segmentation analysis and customer lifetime
value prediction using Pareto/NBD and Gamma-Gamma model.
Step 5: Visualize and share the findings
Customer segmentation distribution Retention rates in cohort analysis

Source: Ho Trung Thanh, et al. Customer segmentation analysis and customer


lifetime value prediction using Pareto/NBD and Gamma-Gamma model.
34
Machine Learning and
Data Mining methods
Methods and Process
overview
Data Aalytics and Data Privacy Course

Machine Learning and Data Mining methods

• A manifestation of the best practices


• A systematic way to conduct ML of DM projects
• Moving from Art to Science for DM project
• Everybody has a different version
• Most common standard processes:
▫ CRISP-DM (Cross-Industry Standard Process for Data Mining)
▫ SEMMA (Sample, Explore, Modify, Model, and Assess)
▫ KDD (Knowledge Discovery in Databases)
CRISP-DM (1 of 2)
• Cross Industry Standard Process for Data Mining
• Proposed in 1990s by a European consortium
• Composed of six consecutive phases
▫ Step 1: Business Understanding Accounts for
▫ Step 2: Data Understanding ~85% of total
▫ Step 3: Data Preparation project time
▫ Step 4: Model Building
▫ Step 5: Testing and Evaluation
▫ Step 6: Deployment
Data Aalytics and Data Privacy Course

CRISP-DM (2 of 2)
1 2
Business Data
• The Six-Step C R I S P-D M Process Understanding Understanding

• The process is highly repetitive and 3


experimental Data
Preparation
6
4
Deployment
Model
Data
Building

5
Testing and
Evaluation
SE M M A Sample
(Generate a representative
sample of the data)

• SEMMA (Sample,
Explore, Modify, Model,
and Assess) Data Mining Assess Explore
Process (Evaluate the accuracy and
usefulness of the models)
(Visualization and basic
description of the data)

• Developed by S A S Feedback

Institute

Model Modify
(Use variety of statistical and (Select variables, transform
machine learning models ) variable representations)
K DD
Internalization

Data Mining
• K D D (Knowledge Discovery in DEPLOYMENT CHART
Knowledge
Databases) Process “Actionable
PHASE 1 PHASE 2 PHASE 3 PHASE 4 PHASE 5

DEPT 1

DEPT 2

DEPT 3

5 Insight”
DEPT 4

4
Data 1 2 3

Transformation
Extracted
Patterns

Data
Cleaning Transformed
Data

Data
Selection Preprocessed
Data

Target
Data

Feedback

Sources for
Raw Data
Which Process is the Best?

• Ranking of Data CRISP-DM

Mining My own
Methodologies/Proc
SEMMA
esses.
KDD Process

My organization's

Domain-specific methodology

None

Other methodology (not domain specific)

0 10 20 30 40 50 60 70

Source: Used with permission from KDnuggets.com.


Classification and Clustering
(Supervised and Unsupervised
machine learning)
Data Mining Tasks & Data
Methods Aalytics and
Data Mining Data
Algorithms Privacy Course
Learning Type

A Taxonomy for Prediction

Machine Classification
Decision Trees, Neural Networks, Support
Vector Machines, kNN, Naïve Bayes, GA
Supervised

Learning and Regression


Linear/Nonlinear Regression, ANN,
Regression Trees, SVM, kNN, GA
Supervised

Data Mining Time Series


Autoregressive Methods, Averaging
Methods, Exponential Smoothing, ARIMA
Supervised

Association

Market-basket Apriory, OneR, ZeroR, Eclat, GA Unsupervised

Expectation Maximization, Apriory


Link analysis Unsupervised
Algorithm, Graph-based Matching

Apriory Algorithm, FP-Growth, Graph-


Sequence analysis Unsupervised
based Matching

Segmentation

Clustering K-means, Expectation Maximization (EM) Unsupervised

Outlier analysis K-means, Expectation Maximization (EM) Unsupervised


Classification
(Supervised machine learning)
Data Aalytics and Data Privacy Course
Classification Techniques for Predictive
Analytics

• Decision tree
• Random forest
• Statistical analysis
• Neural networks
• Support vector machines
• Bayesian classifiers
• …………..
47
Random
Forest

• Random forest is a
commonly-used machine
learning algorithm
trademarked by Leo
Breiman and Adele Cutler

• Which combines the


output of multiple decision
trees to reach a single
result.

Soure: Aurélien Géron. Hands-on Machine Learning with


Scikit-Learn, Keras, and TensorFlow. June 2019.
48

Logistic regression
Logistic regression estimates the probability of an event occurring, such
as voted or didn't vote, based on a given dataset of independent variables.

Soure: Aurélien Géron. Hands-on Machine Learning with


Scikit-Learn, Keras, and TensorFlow. June 2019.
49

Support Vector Machine (SVM)

• Hyperplanes are decision boundaries


that help classify the data points

• To separate the two classes of data


points, there are many possible
hyperplanes that could be chosen

• The objective is to find a plane that


has the maximum margin
Soure: Aurélien Géron. Hands-on Machine Learning
with Scikit-Learn, Keras, and TensorFlow. June 2019.
50

Artificial Neural Network

Artificial neural
networks (ANNs)
consist of input, hidden,
and output layers with
connected neurons
(nodes) to simulate the
human brain.
Data Aalytics and Data Privacy Course

Ensemble Models for Predictive Analytics

• Produces more robust


and reliable prediction
models
• Graphical Illustration
of a Heterogeneous
Ensemble

Soure: Aurélien Géron. Hands-on Machine Learning with


Scikit-Learn, Keras, and TensorFlow. June 2019.
52

Ensemble
Models for
Predictive
Analytics

Soure: Aurélien Géron. Hands-on Machine Learning with


Scikit-Learn, Keras, and TensorFlow. June 2019.
Clustering methods
(Unsupervised machine learning)
Data Aalytics and Data Privacy Course

Cluster Analysis
• A Graphical Illustration of the Steps in the k-Means
Algorithm

Step 1 Step 2 Step 3


THANK YOU
FOR ATTENDING

You might also like