0% found this document useful (0 votes)

109 views24 pages

CRISP Data Mining SIBM Pune

The document describes the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology for conducting data mining projects. It outlines the six main phases of CRISP-DM: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. For each phase, it provides brief explanations of the typical tasks and goals, using a example of classifying texts by region to illustrate how the phases may be applied. The overall goal of CRISP-DM is to provide a standard, repeatable process for data mining to help ensure successful project outcomes.

Uploaded by

AYUSH AGRAWAL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

109 views24 pages

CRISP Data Mining SIBM Pune

Uploaded by

AYUSH AGRAWAL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

CRISP-DM

(required for cw,

useful for any project…)

Based on Intro to Data Mining:

CRISP-DM
Prof Chris Clifton, Purdue Univ
Thanks also to Laura Squier, SPSS for some of the material
Data Mining Process
• Cross-Industry Standard Process for Data
Mining (CRISP-DM) – a Methodology, not for
Software Engineering, but data-analysis work
• European Community funded effort to develop
framework for data mining and text mining tasks
• Goals:
– Encourage interoperable tools across entire data
mining process, by defining subtasks
– Take the mystery/high-priced expertise out of simple
data mining tasks – anyone can do it! (even students)
CS490D 2
Why Should There be a
Standard Process?
• Framework for recording
experience
– Allows projects to be
replicated, “real science”
The data mining process must • Aid to project planning
be reliable and repeatable by and management
people with little data mining • “Comfort factor” for new
background. adopters
– Demonstrates maturity of
Data Mining
– Reduces dependency on
“stars”

CS490D 3
Process Standardization
• CRoss Industry Standard Process for Data Mining
• Initiative launched Sept.1996
• http://www.crisp-dm.org/
• SPSS/ISL, NCR, Daimler-Benz, OHRA
• Funding from European commission
• Over 200 members of the CRISP-DM SIG worldwide
– DM Vendors - SPSS, NCR, IBM, SAS, SGI, Data Distilleries,
Syllogic, Magnify, ..
– System Suppliers / consultants - Cap Gemini, ICL Retail, Deloitte
& Touche, …
– End Users - BT, ABB, Lloyds Bank, AirTouch, Experian, ...
– Linkedin.com group discussion

CS490D 4
CRISP-DM
• Non-proprietary
• Application/Industry
neutral
• Tool neutral
• Focus on business issues
and practical problems
– As well as technical
analysis
• Framework for guidance
• Experience base
– Templates and case
studies for guidance and
analysis

CS490D 5
CRISP-DM: Overview

CS490D 6
CRISP-DM: Phases
• Business Understanding
– Understanding project objectives and requirements
– Data mining problem definition
• Data Understanding
– Initial data collection and familiarization
– Identify data quality issues
– Initial, obvious results
• Data Preparation
– Record and attribute selection
– Data cleansing
• Modeling
– Run the data analysis and data mining tools
• Evaluation
– Determine if results meet business objectives
– Identify business issues that should have been addressed earlier
• Deployment
– Put the resulting models into practice
– Set up for repeated/continuous mining of the data
CS490D 7
Phases and Tasks/Reports
Business Data Data
Modeling Evaluation Deployment
Understanding Understanding Preparation

Determine Collect Initial Data Data Set Select Modeling Evaluate Results Plan Deployment
Business Objectives Initial Data Collection Data Set Description Technique Assessment of Data Deployment Plan
Background Report Modeling Technique Mining Results w.r.t.
Business Objectives Select Data Modeling Assumptions Business Success Plan Monitoring and
Business Success Describe Data Rationale for Inclusion / Criteria Maintenance
Criteria Data Description Report Exclusion Generate Test Design Approved Models Monitoring and
Test Design Maintenance Plan
Situation Assessment Explore Data Clean Data Review Process
Inventory of Resources Data Exploration Report Data Cleaning Report Build Model Review of Process Produce Final Report
Requirements, Parameter Settings Final Report
Assumptions, and Verify Data Quality Construct Data Models Determine Next Steps Final Presentation
Constraints Data Quality Report Derived Attributes Model Description List of Possible Actions
Risks and Contingencies Generated Records Decision Review Project
Terminology Assess Model Experience
Costs and Benefits Integrate Data Model Assessment Documentation
Merged Data Revised Parameter
Determine Settings
Data Mining Goal Format Data
Data Mining Goals Reformatted Data
Data Mining Success
Criteria

Produce Project Plan

Project Plan
Initial Asessment of
Tools and Techniques

CS490D 8
Phases in the DM Process
(1)
• Business
Understanding:
– Statement of Business
Objective
– Statement of Data
Mining objective
– Statement of Success
Criteria

CS490D 9
Phases in cw DM Process
(1)
• Business Understanding:
– Business Objective: attract
Language academics to DM
(to be our “customers”?)
– Data Mining objective: is
domain English classed as
UK or US English? (classify
by salient features)
– Success Criteria: specific
evidence: set of features
which classify UK and US
training data correctly, used
to classify domain data-sets
CS490D 10
Phases in the DM Process
(2)
• Data Understanding
– Collect data
– Describe data
– Explore the data
– Verify the quality and
identify outliers

CS490D 11
Phases in cw DM Process
(2)
• Data Understanding
– Select domain corpora to fit
region covered by journal
– Describe texts: size,
sources, markup, …
– Explore the texts – can you
see any obvious indications
they are UK/US?
– Verify the quality (are texts
really from your domain?
Errors? Repetitions?) and
identify outliers (texts which
don’t “belong”)

CS490D 12
Phases in the DM Process (3)
Data preparation:
• Can take over 90% of the time
– Consolidation and Cleaning
• table links, aggregation
level, missing values, etc
– Data selection
• Remove “noisy” data,
repetitions, etc
• Remove outliers?
• Select samples
• visualization tools
– Transformations - create new
variables, formats

CS490D 13
Phases in cw DM Process (3)
Data preparation:
• May take up to 90% of the time
• Select Data
• Rationale for Inclusion /
Exclusion: if it isn‘t really from
your domain – remove
• Clean Data
• Remove repetitions
• Remove headers, footers,
tables, pictures etc (BootCat
does this automatically)
• Transform Data
• Convert to plain text (ditto)
• Reduce to word-frequency list,
keyword-freqs can be features
in machine-learning
CS490D 14
Phases in the DM Process(4)
• Model building
– Selection of the
modeling techniques is
based upon the data
mining objective
– Modeling can be an
iterative process; may
model for either
description or
prediction

CS490D 15
Phases in cw DM Process(4)
• Model building
– Data Mining objective: is
domain English classed as
UK or US English? (classify
by salient features)
– “model” can be Decision
Tree (or NN, or other
classifier) based on freqs of
UK-only terms and US-only
terms (and sources used to
derive these)
– Data Visualization or On-Line
Analytical Processing (OLAP)
as well as Data Mining
CS490D 16
Phases in the DM Process(5)
• Model Evaluation
– Evaluation of model: how
well it performed, how well
it met business needs
– Methods and criteria
depend on model type:
• e.g., confusion matrix with
classification models,
mean error rate with
regression models
– Interpretation of model:
important or not, easy or
hard depends on algorithm

CS490D 17
Phases in cw DM Process(5)
• Model Evaluation
– Evaluation of model:
have you found and
quantified key
differences between
UK, US English, to
classify domain data?
– Interpretation: don’t
just present the
results, try to explain
possible reasons

CS490D 18
Phases in the DM Process (6)
• Deployment
– Determine how the results
need to be utilized
– Who needs to use them?
– How often do they need to
be used
• Deploy Data Mining
results by:
– Utilizing results as
business rules
– Publishing report for users,
with recommendations to
improve their business
CS490D 19
Phases in cw DM Process (6)
• Deployment
– Write a scientific report:
Intro, Methods, Results,
Conclusion; 3-4 pages
(plus Appendices?)
– Utilizing results as
business rules: attract
Language researchers to
use text mining (as
“customers” or
collaborators for SoC
researchers)

CS490D 20
Why CRISP-DM?
• The data mining process must be reliable and
repeatable by people with little data mining skills
(e.g. IT Consultants, students?...)

• CRISP-DM provides a uniform framework for

– guidelines
– experience documentation

• CRISP-DM is flexible to account for differences

– Different business/agency problems
– Different data

CS490D 21
Why DM?: Concept Description
• Descriptive vs. predictive data mining
– Descriptive mining: describes concepts or task-
relevant data sets in concise, summarative,
informative, discriminative forms
– Predictive mining: Based on data and analysis,
constructs models from the data-set, and predicts the
trend and properties of unknown data
• Concept description:
– Characterization: provides a concise and succinct
summarization of the given collection of data
– Comparison: provides descriptions comparing two or
more collections of data
DM vs. OLAP
• Data Mining:
– can handle complex data types of the
attributes and their aggregations
– a more automated process
• Online Analytic Processing (visualization):
– restricted to a small number of dimension and
measure types
– user-controlled process

CS490D 23
CRISP-DM: Summary
• Business Understanding
– Understanding project objectives and requirements
– Data mining problem definition
• Data Understanding
– Initial data collection and familiarization
– Identify data quality issues
– Initial, obvious results
• Data Preparation
– Record and attribute selection
– Data cleansing
• Modeling
– Run the data mining tools
• Evaluation
– Determine if results meet business objectives
– Identify business issues that should have been addressed earlier
• Deployment
– Put the resulting models into practice
– Set up for repeated/continuous mining of the data
CS490D 24

Converting To SAP S4HANA Stage 6 - Checking The Table ACDOCA
No ratings yet
Converting To SAP S4HANA Stage 6 - Checking The Table ACDOCA
10 pages
JCR or RDBMS: Why, When, How?
100% (20)
JCR or RDBMS: Why, When, How?
49 pages
Find
No ratings yet
Find
91 pages
Assignment 1
No ratings yet
Assignment 1
8 pages
PAM - Unit1.PDF
No ratings yet
PAM - Unit1.PDF
217 pages
Chapter 2 Data Warehousing
No ratings yet
Chapter 2 Data Warehousing
47 pages
TIS User Manual v2.0 - en
No ratings yet
TIS User Manual v2.0 - en
20 pages
HTTP Host header attacks exploit
No ratings yet
HTTP Host header attacks exploit
6 pages
Phishing Awareness and Prevention
No ratings yet
Phishing Awareness and Prevention
18 pages
IEEE Online Auction System
No ratings yet
IEEE Online Auction System
3 pages
Advance Computing Technology
0% (1)
Advance Computing Technology
2 pages
Vdi3 V1
No ratings yet
Vdi3 V1
249 pages
Data Analyst Portfolio
No ratings yet
Data Analyst Portfolio
5 pages
Digital Certificates (Certification Authority)
100% (1)
Digital Certificates (Certification Authority)
23 pages
Quick-And-Dirty Data Analysis Through GUI
No ratings yet
Quick-And-Dirty Data Analysis Through GUI
27 pages
Data Leakage Incident Management Procedure
No ratings yet
Data Leakage Incident Management Procedure
5 pages
Introduction To Weka: Statistical Learning
No ratings yet
Introduction To Weka: Statistical Learning
36 pages
Syniti ADM Training
No ratings yet
Syniti ADM Training
15 pages
Alex Norman SAC DSP Live Session April2025
No ratings yet
Alex Norman SAC DSP Live Session April2025
24 pages
Source Data For Bank
No ratings yet
Source Data For Bank
7 pages
DecisionTree Numerical ID3Prob
No ratings yet
DecisionTree Numerical ID3Prob
114 pages
Macro in BEx
0% (1)
Macro in BEx
7 pages
Datacamp ETL Documentation
No ratings yet
Datacamp ETL Documentation
19 pages
Iiot Libraries SL: Product Description
No ratings yet
Iiot Libraries SL: Product Description
4 pages
Data Migration Write Up
No ratings yet
Data Migration Write Up
5 pages
KOHA 303 Training Docs
No ratings yet
KOHA 303 Training Docs
251 pages
PAM - Complete
No ratings yet
PAM - Complete
322 pages
3a Feature Driven Development Methodology FDD
No ratings yet
3a Feature Driven Development Methodology FDD
14 pages
ISU Master Data V0.5
No ratings yet
ISU Master Data V0.5
29 pages
Achal B. Indiresh: Ruby On Rails Software Developer
No ratings yet
Achal B. Indiresh: Ruby On Rails Software Developer
3 pages
Recruitment Feedback 2015-16
No ratings yet
Recruitment Feedback 2015-16
417 pages
Reading Multiple Sheets of A Xls File in Bods
0% (1)
Reading Multiple Sheets of A Xls File in Bods
5 pages
COMP8047 - S03 Business Requirements
No ratings yet
COMP8047 - S03 Business Requirements
30 pages
Geometry Formulas: Triangle Formula
No ratings yet
Geometry Formulas: Triangle Formula
4 pages
Qno 5
No ratings yet
Qno 5
21 pages
Patni Computer Systems LTD.: Student Guide
No ratings yet
Patni Computer Systems LTD.: Student Guide
197 pages
DAX Formulas and Statements in Power BI
No ratings yet
DAX Formulas and Statements in Power BI
1 page
Sap Abap Interview Questions
No ratings yet
Sap Abap Interview Questions
47 pages
Forti ADC
No ratings yet
Forti ADC
11 pages
SLP Comand
No ratings yet
SLP Comand
14 pages
Lecture 02 ASP - NET MVC Razor
No ratings yet
Lecture 02 ASP - NET MVC Razor
23 pages
Institut Kemahiran MARA Sungai Petani, Kedah: Information Sheet
No ratings yet
Institut Kemahiran MARA Sungai Petani, Kedah: Information Sheet
6 pages
Data Dictionary PDF by Santo
No ratings yet
Data Dictionary PDF by Santo
129 pages
openSAP Sac3 Week 1 Exercise1
No ratings yet
openSAP Sac3 Week 1 Exercise1
30 pages
Developing An Automated Orphanage Management System
No ratings yet
Developing An Automated Orphanage Management System
9 pages
Innowera Ebook - Simplify and Automate Your Business Processes in SAP Using Excel and Lower Your Risk With Data PDF
No ratings yet
Innowera Ebook - Simplify and Automate Your Business Processes in SAP Using Excel and Lower Your Risk With Data PDF
12 pages
S4HANA CompatibilityScopeMatrix DETAILS
No ratings yet
S4HANA CompatibilityScopeMatrix DETAILS
8 pages
The Key in Business Is To Know Something That Nobody Else Knows.
No ratings yet
The Key in Business Is To Know Something That Nobody Else Knows.
43 pages
DataStage Interview Question
No ratings yet
DataStage Interview Question
9 pages
Amazon AWS Tutorial II: Windows and Linux On EC2
No ratings yet
Amazon AWS Tutorial II: Windows and Linux On EC2
69 pages
Implementing Currency Conversion in A Calculation View
No ratings yet
Implementing Currency Conversion in A Calculation View
20 pages
Data Wharehousing, OLAP and Data Mining
No ratings yet
Data Wharehousing, OLAP and Data Mining
84 pages
MDM Training Program
No ratings yet
MDM Training Program
53 pages
B W Retail Con Ten Supply Chan A
No ratings yet
B W Retail Con Ten Supply Chan A
69 pages
Lean Vs Agile Vs Leagile Supply Chain
No ratings yet
Lean Vs Agile Vs Leagile Supply Chain
10 pages
LSMW - Sap HR: How To Upload Bulk Data in SAP HR System
100% (1)
LSMW - Sap HR: How To Upload Bulk Data in SAP HR System
25 pages
DCT Model v005 LTM All
No ratings yet
DCT Model v005 LTM All
28 pages
Collections - New Functions in SAP ISU
No ratings yet
Collections - New Functions in SAP ISU
30 pages
Saurav KR Das Resume
No ratings yet
Saurav KR Das Resume
3 pages
Enterprise Reporting Best Practices in An SAP Environment: White Paper
No ratings yet
Enterprise Reporting Best Practices in An SAP Environment: White Paper
22 pages
BDM Using AI - Data Driven Decision Making
No ratings yet
BDM Using AI - Data Driven Decision Making
34 pages
Computer System Servicing 9
No ratings yet
Computer System Servicing 9
6 pages
Sap Data Migration
No ratings yet
Sap Data Migration
4 pages
Sap BW Cheat Sheet
No ratings yet
Sap BW Cheat Sheet
2 pages
Sd1002 - SD Master Data - V1.0: India Sap Coe, Slide 1
No ratings yet
Sd1002 - SD Master Data - V1.0: India Sap Coe, Slide 1
79 pages
Exercise XD01
No ratings yet
Exercise XD01
8 pages
BW LO Extraction - Guide
No ratings yet
BW LO Extraction - Guide
59 pages
Shubham Porwal: Professional Summary
No ratings yet
Shubham Porwal: Professional Summary
4 pages
Connecting To Alteryx Via ODBC To SAP HANA Database
No ratings yet
Connecting To Alteryx Via ODBC To SAP HANA Database
13 pages
4-Stored Procedures
No ratings yet
4-Stored Procedures
22 pages
Cut-Over Plan PM 1
No ratings yet
Cut-Over Plan PM 1
2 pages
1st Quarter Exam in Empowerment Technologies Sy 2022 2023
No ratings yet
1st Quarter Exam in Empowerment Technologies Sy 2022 2023
4 pages
Dataware Q&a Bank
100% (1)
Dataware Q&a Bank
42 pages
T-Code: SQVI (Quick Viewer) : 1 Nawaz
No ratings yet
T-Code: SQVI (Quick Viewer) : 1 Nawaz
4 pages
Wipro 7b1630 User Manuel English
No ratings yet
Wipro 7b1630 User Manuel English
70 pages
MM Extraction
No ratings yet
MM Extraction
5 pages
Idc Wipro Product Engg Services RD Profile
No ratings yet
Idc Wipro Product Engg Services RD Profile
13 pages
Material Master Tables in SAP
No ratings yet
Material Master Tables in SAP
3 pages
ABAP Interview Questions and Answers
No ratings yet
ABAP Interview Questions and Answers
4 pages
Factless Fact Table
No ratings yet
Factless Fact Table
5 pages
SAP Accelerated Data Migration
No ratings yet
SAP Accelerated Data Migration
19 pages
Unit 1: Introduction To Data Migration: Week 1: SAP S/4HANA Migration Cockpit
No ratings yet
Unit 1: Introduction To Data Migration: Week 1: SAP S/4HANA Migration Cockpit
49 pages
Quora - Informatica DW BI Ques ANS
No ratings yet
Quora - Informatica DW BI Ques ANS
7 pages
AID For Travel Expense Management
No ratings yet
AID For Travel Expense Management
13 pages
Difference Between Temporary Table and Table Variable in SQL Server
No ratings yet
Difference Between Temporary Table and Table Variable in SQL Server
2 pages
Uday Kumar: Professional Summary
No ratings yet
Uday Kumar: Professional Summary
3 pages
Working With Requirements Documents
No ratings yet
Working With Requirements Documents
5 pages
Business Analytics
No ratings yet
Business Analytics
9 pages
Vertipaq Vs OLAP - Change Your Data Modeling Approach - Marco Russo
No ratings yet
Vertipaq Vs OLAP - Change Your Data Modeling Approach - Marco Russo
10 pages

CRISP Data Mining SIBM Pune

Uploaded by

CRISP Data Mining SIBM Pune

Uploaded by

CRISP-DM

(required for cw,

Based on Intro to Data Mining:

Produce Project Plan

• CRISP-DM provides a uniform framework for

• CRISP-DM is flexible to account for differences

You might also like