Data Mining
Data Mining
DAMA-NCR
Data Collection "What was my total Computers, tapes, IBM, CDC Retrospective,
(1960s) revenue in the last disks static data delivery
five years?"
Data Warehousing "What were unit On-line analytic SPSS, Comshare, Retrospective,
& Decision sales in New processing Arbor, Cognos, dynamic data
Support England last (OLAP), Microstrategy,NCR delivery at multiple
(1990s) March? Drill down multidimensional levels
to Boston." databases, data
warehouses
Determine Collect Initial Data Data Set Select Modeling Evaluate Results Plan Deployment
Business Objectives
Initial Data Collection Data Set Description Technique Assessment of Data Deployment Plan
Background Report Modeling Technique Mining Results w.r.t.
Business Objectives Select Data Modeling Assumptions Business Success Plan Monitoring and
Business Success Describe Data Rationale for Inclusion
/ Criteria Maintenance
Criteria Data Description Report Exclusion Generate Test DesignApproved Models Monitoring and
Test Design Maintenance Plan
Situation Assessment Explore Data Clean Data Review Process
Inventory of ResourcesData Exploration ReportData Cleaning Report Build Model Review of Process Produce Final Report
Requirements, Parameter Settings Final Report
Assumptions, and Verify Data Quality Construct Data Models Determine Next StepsFinal Presentation
Constraints Data Quality Report Derived Attributes Model Description List of Possible Actions
Risks and Contingencies Generated Records Decision Review Project
Terminology Assess Model Experience
Costs and Benefits Integrate Data Model Assessment Documentation
Merged Data Revised Parameter
Determine Settings
Data Mining Goal Format Data
Data Mining Goals Reformatted Data
Data Mining Success
Criteria
Output
Neural Networks
• Description
– Difficult interpretation
– Tends to ‘overfit’ the data
– Extensive amount of training time
– A lot of data preparation
– Works with all data types
Rule Induction
•Description
– Produces decision trees:
• income < $40K
– job > 5 yrs then good risk
– job < 5 yrs then bad risk
• income > $40K C re d it ra n kin g (1 = d e fa u lt)
C a t. % n
– Or Rule Sets:
W e e kly p a y M o n th ly s a la ry
C a t. % n C a t. % n
B a d 8 6 .6 7 1 4 3 B a d 1 5 .8 2 2 5
Ag e C a te g o rica l Ag e C a te g o ric a l
P -va lu e = 0 .0 0 0 0 , C h i-s q u a re =3 0 .1 1 1 3 , d f=1 P -va lu e =0 .0 0 0 0 , C h i-s q u a re =5 8 .7 2 5 5 , d f=1
– if income > $40K Yo u n g (< 2 5 );Mid d le (2 5 -3 5 ) O ld ( > 3 5 ) Yo u n g (< 2 5 ) Mid d le (2 5 -3 5 );O ld ( > 3 5 )
– if low debt
C a t. % n C a t. % n C a t. % n C a t. % n
Ba d 9 0 .5 1 1 4 3 Ba d 0 .0 0 0 B a d 4 8 .9 8 2 4 B a d 0 .9 2 1
G o o d 9 .4 9 1 5 Good1 0 0 .0 0 7 Good 5 1 .0 2 2 5 Good 9 9 .0 8 1 0 8
To ta l(4 8 .9 2 1) 5 8 To ta l (2 .1 7 ) 7 To ta l(1 5 .1 7 )4 9 To ta l(3 3 .7 5 1) 0 9
C a t. % n
P ro fe s s io n a l
C a t. % n
B a d 0 .0 0 0 Ba d 5 8 .5 4 2 4