Why Data Mining?: March 3, 2015
Why Data Mining?: March 3, 2015
March 3, 2015
March 3, 2015
March 3, 2015
Alternative names
March 3, 2015
March 3, 2015
Data miningcore of
knowledge discovery
process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
March 3, 2015
March 3, 2015
March 3, 2015
10
March 3, 2015
11
March 3, 2015
12
March 3, 2015
13
Decisio
n
Making
Data Presentation
Visualization Techniques
End User
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
March 3, 2015
DBA
14
Machine
Learning
Pattern
Recognition
March 3, 2015
Statistics
Data Mining
Algorithm
Visualization
Other
Disciplines
15
High-dimensionality of data
March 3, 2015
16
Data to be mined
Knowledge to be mined
Techniques utilized
Relational, data warehouse, transactional, stream, objectoriented/relational, active, spatial, time-series, text, multimedia, heterogeneous, legacy, WWW
Applications adapted
March 3, 2015
General functionality
March 3, 2015
18
March 3, 2015
19
Object-relational databases
Multimedia database
Text databases
March 3, 2015
20
March 3, 2015
Cluster analysis
Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
Maximizing intra-class similarity & minimizing interclass similarity
Outlier analysis
Outlier: Data object that does not comply with the general behavior
of the data
Noise or exception? Useful in fraud detection, rare events analysis
Trend and evolution analysis
Trend and deviation: e.g., regression analysis
Sequential pattern mining: e.g., digital camera large SD memory
Periodicity analysis
Similarity-based analysis
Other pattern-directed or statistical analyses
March 3, 2015
22
Mining methodology
Mining different kinds of knowledge from diverse data types, e.g., bio,
stream, Web
User interaction
March 3, 2015
2. Association Rules:
Used to find associations between sets of
attributes
3. Sequential patterns:
Used to find temporal associations in time series
4. Hierarchical clustering:
used to group customers, web users, etc
Heterogeneous
Databases data selection
data cleaning
data integration
Data
Warehou
se
data summarization
did
...
Transaction
Details
tid
1
2
3
...
tid
1
2
3
...
type
sale
sale
buy
...
date
4/11/1999
5/2/1999
5/17/1999
...
dname
...
pid
21
13
41
...
qty
2
1
3
...
HK-Database
Supplier
Country
sid name birthdate
... ...
...
Sales
sid
1
2
3
4
...
cid
...
cname
...
date
time qty
15:4:1999 8:30 2
15:4:1999 9:30 2
???
3
19:5:1999
4
...
pid
11
11
56
22
Data Warehouse
FACT table
timeid
1
2
2
3
...
pid
1
1
2
3
...
sales
2
4
1
2
...
dimension 1: time
timeid
1
2
3
...
day
11
15
2
...
month
4
4
5
year
1999
1999
1999
...
dimension 2: product
pid
1
2
3
...
name
chair
table
desk
...
type
office
office
office
Data Selection
Only data which are important for analysis are
selected (e.g., information about employees,
departments, etc. are not stored in the warehouse)
Therefore the data warehouse is subject-oriented
Data Integration
Consistency of attribute names
Consistency of attribute data types. (e.g., dates are
converted to a consistent format)
Consistency of values (e.g., product-ids are
converted to correspond to the same products from
both sources)
Integration of data (e.g, data from both sources are
integrated into the warehouse)
Data Cleaning
Data Summarization
Example of a Data
Warehouse (4)
Example of an OLAP query (collects counts)
product
1999
2000
2001
2002
ALL
chairs
25
37
89
21
172
tables
10
30
45
85
desks
56
84
35
184
shelves
19
20
71
110
16
11
15
47
115
187
109
187
598
boards
ALL
Data cube
What is Data
Warehouse?
Data WarehouseSubjectOriented
Data WarehouseIntegrated
Data WarehouseTime
Variant
Data WarehouseNonVolatile
Example: