Lecture02 Frameworks Platforms-Part1
Lecture02 Frameworks Platforms-Part1
2
Task Categories
Regression: Estimate or predict the numerical value of a
variable for a given individual
Ex: How much will a given customer spend next week?
Classification: Predict which of a set of classes an individual
belongs to
Ex: Will a given cell phone customer leave then his/her contract
expires?
Similarity: Identify similar individuals based on what we
know about them
Ex: Which customers are similar to a given customer?
Clustering: Group individuals in a given set by their similarity
Ex: How to group our customers into market segments?
3
Task Categories
Co-occurrence Grouping: Find associations between entities
based on transactions involving them
Ex: What items are purchased together?
Profiling: Characterize the typical behavior of an individual,
group, or population
Ex: What is the typical cell phone usage of this customer segment?
Link Prediction: Predict connections between items
Ex: Which customers are friends?
Data Reduction: Transform a large dataset into a smaller dataset
Ex: How can we reduce product preferences of our customers to a much
smaller dataset so that we can deal with data easily
…
4
Stages of Analytics
What is the data telling you?
Value Descriptive: What’s happening in my
business?
Comprehensive, accurate and live data
Effective visualization
Prescriptive Diagnostic: Why is it happening?
Ability to drill down to the root-cause
Predictive Ability to isolate all confounding information
Diagnostic
Descriptive
Complexity
Ref: https://www.kdnuggets.com/2017/07/4-types-data-analytics.html
www.principa.co.za 5
Stages of Analytics
Predictive: What’s likely to happen?
Business strategies have remained fairly
Value consistent over time
Historical patterns being used to predict
specific outcomes using algorithms
Decisions are automated using algorithms
and technology
Prescriptive
Prescriptive: What do I need to do?
Recommended actions and strategies based
Predictive
on champion/challenger testing strategy
outcomes
Diagnostic
Applying advanced analytical techniques to
Descriptive
make specific recommendations
Complexity
Ref: https://www.kdnuggets.com/2017/07/4-types-data-analytics.html
www.principa.co.za 6
Stages of Analytics
Example:
Descriptive:
The admission of patients to a hospital over a year.
Complaints, gender, the number of people admitted
Diagnostic:
Specific complaints are highly correlated with the number of
admissions (multicollinearity issues?)
People forget to enter specific entries due to time constraints.
Predictive:
You try to predict the number of admissions over the next week.
Prescriptive:
As the number of admissions will be high, you know the potential
implications. You increased the number of beds and staff.
7
Machine Learning Categories
Supervised
Unsupervised
Semi-supervised
Reinforcement
8
Supervised vs Unsupervised
Methods
Supervised: A specific target (or desired output) is
known for each sample in the input dataset
classification (categorical target) and regression (numeric
target) are generally solved with supervised methods
Unsupervised: No specific target can be provided for the
samples in the input dataset
clustering, co-occurrence grouping, and profiling are
generally solved with unsupervised methods
Similarity matching, link prediction, and data reduction
could be solved with supervised or unsupervised
methods
9
Two Major Phases
First, we use available data to find patterns and build
models
Ex: Use historical data to produce a model for the
prediction of customer churn
This phase is typically composed of many sub-phases
Then, we use these patterns and models in decision
making
Ex: Use the generated model to predict whether a given
customer will leave
If the customer is likely to leave, the company may offer
special deals prior to the expiration of her/his contract
10
Semi-Supervised Learning
Algorithms are trained on both labelled and
unlabeled data.
Labelled data: low in number
Unlabeled data: high in number
Assumptions:
Continuity assumption
Cluster assumption
Manifold assumption
11
Reinforcement Learning
Different from supervised learning, an agent learns
from its experience and makes decisions on the run
time.
Main points:
Input: The input should be an initial state from which the
model will start
Output: There are many possible output as there are variety of
solution to a particular problem
Training: The training is based upon the input. The model will
return a state and the user will decide to reward or punish the
model based on its output.
The model keeps continues to learn.
The best solution is decided based on the maximum reward.
Ref: https://www.geeksforgeeks.org/what-is-reinforcement-learning/
12
How to Execute and Manage
Data Science Projects?
A well defined framework is needed to carry out data
science projects to
guide the implementation (i.e., translate business problems
into data-driven solutions),
effectively manage the projects,
get the highest return on investment,
repeat previous successes
Ad hoc processes and lack of a robust methodology leads
to several issues such as poor team coordination, slow
information sharing, scope creep, producing random and
false discoveries, lack of reproducibility, and management
inefficiencies
13
Data Science Frameworks
Definition: A framework, or software framework, is a platform for developing
software applications. It provides a foundation on which software developers
can build programs for a specific platform.
A good framework enables
knowing where and how to start,
establishing a shared understanding across business and data science teams,
getting reasonable expectations,
reusing knowledge from previous projects,
knowing where and how to use the results,
knowing how to evaluate the business impact
There are several frameworks or process models that can be applied in data
science projects
Ex: OSEMN, SEMMA, KDD, CRISP-DM, TDSP
Nevertheless, many organizations are not following a well-defined
methodology
14
OSEMN Framework
OSEMN Pipeline
Obtaining data
Scrubbing / Cleaning data
Exploring / Visualizing
data
allow us to find patterns and
trends
Modeling data
give us our predictive power
as a wizard
iNterpreting models and
data
15
OSEMN
Obtain: Get data from Explore: Inspect available data and its
sources such as properties, visualize data, compute
SQL/NoSQL Databases, Data descriptive statistics, correlation, etc.
warehouses, Big Data (e.g., Visualization, Inferential Statistics (R,
Hadoop) Python with Numpy, Pandas,
Web APIs, Web scraping Matplotlib, etc., Spark, Flink, …)
Repositories (e.g., Kaggle) Model: create models such as
Scrub: organizing and tidying regression, classification, and
up the data by filtering, clustering, select features, tune
cleaning, imputing, merging, parameters, evaluate models
transforming Python with scikit-learn, R with caret,
Python, R, MapReduce, Spark ML, Flink ML
Spark, … iNterpret: Interpret models and data,
present findings in such a way that
business problems can be solved
Data storytelling, Visualization tools
16
Knowledge Discovery (KDD)
Process
This is a view from typical
database systems and data Pattern Evaluation
warehousing communities
Data mining plays an essential
role in the knowledge discovery Data Mining
process
Task-relevant Data
Data Cleaning
Data Integration
Databases
Han, Jiawei 17
SEMMA
SEMMA: Sample, Explore,
Modify, Model and Assess
It refers to core process of
conducting data mining
All SEMMA steps may not be
included in analysis
It may be necessary to repeat one
or more of the steps several times
SAS Enterprise Miner is an
integrated product that
provides a front end to the
SEMMA mining process
18
SEMMA
Sample: Sample data by extracting a portion of a large
dataset big enough the contain the significant information,
yet small enough to manipulate
Explore: Search for unanticipated trends and anomalies in
order to gain understanding and ideas
Modify: create, select, transform variables to prepare the
data for analysis. Identify outliers, replace missing values,
etc.
Model: Fit a predictive model to a target variable
Assess: Evaluate the competing models
19
CRISP-DM
CRISP-DM: CRoss Industry
Standard Process for Data
Mining
Proposed in 1996
In 2015, IBM released a new
methodology called ASUM-
DM which refines and
extends CRISP-DM
It is the most widely-used
analytic methodology
according to many opinion
polls
20
CRISP-DM
It makes explicit the fact that
iteration is the rule rather than the
exception
There are six phases
The sequence of the phases is not
strict
The arrows indicate only the most
important and frequent
dependencies
Outcome of the current phase may
determine the next phase
Backtracks and repetitions are
common
21
CRISP-DM:
Business Understanding
Initially, it is vital to understand the problem to be solved
Business projects seldom come pre-packaged as clear and
unambiguous problems
Iterations may be necessary for an acceptable solution to appear
It is also necessary to think carefully about the use scenarios
Involves, understanding the project objectives and
requirements from a business perspective, then converting
this knowledge into a data mining problem definition and a
preliminary plan designed to the achieve objectives
What the client really wants to accomplish?
Uncover important factors (constraints, competing objectives)
22
CRISP-DM:
Data Understanding
Data comprise the available raw material from
which the solution will be built
It is important to understand the strengths and
limitations of the data because rarely is there an
exact match with the problem
We start with an initial data collection and proceed
with activities in order to get familiar with the data,
to identify data quality problems, to discover initial
level of insights into the data or to detect
interesting subsets to form hypotheses for hidden
information
23
CRISP-DM:
Data Preparation
It covers all activities to construct the final dataset
from the initial raw data
Quality of cleaned data will impact on model
performance
Data preparation tasks are likely to be performed
multiple times and not in any prescribed order
Tasks include table, record and attribute selection,
and integration as well as transformation and
cleaning of data for modeling tools
24
CRISP-DM:
Modelling
In this phase, various modeling techniques are
selected and applied and their parameters are
calibrated to optimal values
There are several techniques that can be applied
for the same data mining problem type
Some techniques have specific requirements on the
form of data.
Therefore, stepping back to the data preparation
phase is often necessary.
25
CRISP-DM:
Evaluation
Thoroughly evaluate the model and review the
steps executed to construct the model to be certain
it properly achieves the business objectives
A key objective is to determine if there is some
important business issue that has not been
sufficiently considered
At the end of this phase, a decision on the use of
the data mining results should be reached
26
CRISP-DM:
Deployment
In this phase, we need to determine how the
results to be utilized and plan deployment,
monitoring and maintenance
The knowledge gained will need to be organized
and presented in a way that the customer can use
it
Depending on the requirements, the deployment
phase can be as simple as generating a report or as
complex as implementing a repeatable data mining
process across the enterprise
27
CRISP-DM:
Criticisms
It is widely adopted
It has a good focus on the business understanding
It also covers deployment
The model is not actively maintained
The official site, crisp-dm.org, is no longer being
maintained
The framework itself has not been updated on
issues on working with new technologies, such
as big data
Big data means that additional effort can be spent in
the data understanding phase
28
KDD, SEMMA and CRISP-DM
Comparison
Summary of the correspondences between KDD,
SEMMA and CRISP-DM
Azevedo, Ana Isabel Rojão Lourenço, and Manuel Filipe Santos. "KDD, SEMMA and CRISP-DM: a parallel
overview." IADS-DM (2008). 29
KDD, SEMMA and CRISP-DM
Based on Waterfall Model
*Grady, N. W., Payne, J. A., & Parker, H. (2017, December). Agile big data analytics: AnalyticsOps for
data science. In 2017 IEEE International Conference on Big Data (Big Data) (pp. 2331-2339). IEEE. 31
Emerging Approaches
The purpose of using agile analytics is to reach a point of
optimality between generating value from data and the
time spent getting there.
One hybrid approach can be to use Scrum with CRISP-
DM
Scrum is a lightweight, iterative and incremental
framework that divides the project into sprints (mini
projects) undertaken by self organizing cross-functional
teams
Team Data Science Process (TDSP) is an agile, iterative,
data science process for executing and delivering
advanced analytics solutions
32
Agile Software Development
Agile software development is a set of principles and
practices used by self-organizing teams to rapidly and
frequently deliver customer-valued software.
*Agile Allience 33
Agile Software Development
Agile is engrained as a philosophy, not a
methodology,
It is an adaptive approach to doing work based on
empirical evidence,
Agile embraces continuous improvement and
expects ongoing evolution,
There is no one or the «Agile» method
Customer Contract
over
Collaboration Negotiation
Responding to over
Following a plan
change
*www.agilemanifesto.org 35
Microsoft Team Data Science Project
(TDSP)
It is a team-oriented
solution that emphasizes
teamwork
and collaboration
throughout the project
The lifecycle is similar to
CRISP-DM
It takes several elements
from Scrum such as
backlog, sprints, and clear
team roles
36
Microsoft Team Data Science Project
(TDSP): Work Item Types
Feature: A Feature corresponds to a project engagement. Different
engagements with a client are different Features, and it's best to consider
different phases of a project as different Features.
User Story: User Stories are work items needed to complete a Feature end-to-
end. Examples of User Stories include:
Get data
Explore data
Generate features
Build models
Operationalize models
Retrain models
Task: Tasks are assignable work items that need to be done to complete a
specific User Story. For example, Tasks in the User Story Get data could be:
Get SQL Server credentials
Bug: Bugs are issues in existing code or documents that must be fixed to
complete a Task. If Bugs are caused by missing work items, they can escalate to
be User Stories or Tasks.
Ref: https://docs.microsoft.com/en-us/azure/architecture/data-science-process/agile-development 37
Microsoft Team Data Science Project
(TDSP): Work Item Types: Example
Feature 1: Showing recommendations to a user
User Story 1.1: Get user X’s previous purchase history
User Story 1.2: Get user X’s browsing patterns
User Story 1.3: Get all users’ data for training dataset
User Story 1.4: Apply collaborative filtering algorithm for recommendation
User Story 1.5: Give a survey to a user who has registered to the system for
initial recommendation
Task 1.1.1: Query database for retrieving user X data
Bug 1.1: SQL error when retrieving user X data
TDSP borrows the concepts of Features, User Stories, Tasks, and Bugs
from software code management (SCM). The TDSP concepts might
differ slightly from their conventional SCM definitions.
38
Microsoft Team Data Science Project
(TDSP): Work Item Types: Example
Sprint planning can be done via Azure Boards.
Ref: https://docs.microsoft.com/en-us/azure/architecture/data-science-process/agile-development 39
Microsoft Team Data Science Project
(TDSP): Work Item Types: Example
The following figure outlines the TDSP workflow for project execution:
Ref: https://docs.microsoft.com/en-us/azure/architecture/data-science-process/agile-development 40