Data Mining: Introduction: Lecture Notes For Chapter 1
Data Mining: Introduction: Lecture Notes For Chapter 1
Much of the data is never analyzed at allTotal new disk (TB) since 1995
3500000
3000000
2000000
1500000
1000000
500000
Number of
analysts
0
1995 1996 1997 1998 1999
From: R. Grossman, C. Kamath, V. Kumar, Data Mining for Scientific and Engineering Applications
What is Data Mining?
Many Definitions
Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
Exploration & analysis, by automatic or semi-automatic means, of
large quantities of data in order to discover meaningful patterns
Description Methods
- Find human-interpretable patterns that
describe the data.
Prediction Methods
Use some variables to predict unknown or
future values of other variables.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Data Mining Tasks...
io n
ic at
s if
l as
C
Data Mining Tasks...
io n
ic at
s if
l as
C
Classification: Definition
Given a collection of records (training set )
- Each record contains a set of attributes, one of the
attributes is the class.
cal cal us
ri or
i uo
go g tin
t e t e n ss Refund Marital Taxable
ca ca c o
cl
a Status Income
Cheat
No Single 75K ?
Tid Refund Marital Taxable
Cheat Yes Married 50K ?
Status Income
No Married 150K ?
1 Yes Single 125K No
2 No Married 100K No Yes Divorced 90K ?
3 No Single 70K No No Single 40K ?
4 Yes Married 120K No No Married 80K ?
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes Test
Set
Training
Learn
Model
Set Classifier
Classification: Application 1
Direct Marketing
Approach:
- Segment the image.
- Measure image attributes (features)
- 40 of them per object.
- Model the class based on these
features.
- Success Story: Could find 16
new high red-shift quasars, some
of the farthest objects that are
difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Data Mining Tasks...
io n
ic at
s if
l as
C
Clustering Definition
Proximity Measures:
- Euclidean Distance if attributes are continuous.
- Other Problem-specific Measures.
Illustrating Clustering
Market Segmentation
Document Clustering
io n
ic at
s if
l as
C
Association Rule Discovery: Definition
TI Items
D Discovered Rules:
1 Bread, Coke, Milk {Milk} {Coke}
2 Beer, Bread {Diaper, Milk} {Beer}
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Association Rule Discovery: Application 1
Inventory Management
Given is a set of objects, with each object associated with its own
timeline of events, find rules that predict strong sequential
dependencies among different events.
(A B) (C) (D E)
(A B) (C) (D E)
<= xg >ng <= ws
<= ms
Sequential Pattern Discovery: Examples
io n
ic at
s if
l as
C
Deviation/Anomaly Detection
Detect significant deviations from normal behavior
Applications:
Credit Card Fraud Detection
Network Intrusion
Detection
Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
Streaming Data