Data Mining - 3 PDF
Data Mining - 3 PDF
and
Data Mining
EXECUTIVE Few
SUPPORT sophisticated
SYSTEMS users
DATA WAREHOUSE
Database with the following distinctive characteristics:
Separate from operational databases
Subject oriented: provides a simple, concise view on one or
more selected areas, in support of the decision process
Constructed by integrating multiple, heterogeneous data
sources
Contains historical data: spans a much longer time horizon
than operational databases
(Mostly) Read-Only access: periodic, infrequent updates
Multi-Tier Architecture
Analysis
Cleaning
extraction
Data Warehouse Server Reporting
DB DB Data Mining
OLAP Front-End
Data sources Data Storage engine Tools
PERIOD
(0,N)
(1,1)
(0,N) (1,1) (1,1) (0,N)
PRODUCT FACTs STORE
(1,1)
(0,N)
SUPPLIER
MOLAP Approach
Total sales of PCs in
Period
europe in the 4th
1Qtr 2Qtr 3Qtr 4Qtr quarter of the year
TV
t
uc
PC Europe
od
Pr
DVD
CD
North America
Region
Middle East
Far East
DATA CUBE
representing a
Fact Table
A.A. 04-05 Datawarehousing & Datamining 17
Data Warehousing
Europe
PC E.g.: group quarters
od
Pr
DVD
CD North America
Middle East
Far East
Region
Drill-Down = (Roll-Up)-1
Middle East
Far East
SUM
Yearly sales of PCs
in the middle east
Yearly sales of
electronics in the
middle east
A.A. 04-05 Datawarehousing & Datamining 21
Data Warehousing
FACT TABLE
A.A. 04-05 Datawarehousing & Datamining 22
Data Warehousing
Data Mining
Evaluation &
presentation
Selection &
transformation
Training
data
Data cleaning
& integration Domain
Data Warehouse
knowledge
DB DB
Information repositories
A.A. 04-05 Datawarehousing & Datamining 30
Data Mining
DATA MINING
Interestingness measures
Purpose: filter irrelevant patterns to convey concise and
useful knowledge. Certain data mining tasks can produce
thousands or millions of patterns most of which are
redundant, trivial, irrelevant.
Objective measures: based on statistics and structure of
patterns (e.g., frequency counts)
Subjective measures: based on users belief about the
data. Patterns may become interesting if they confirm or
contradict a users hypothesis, depending on the context.
Interestingness measures can employed both after and during
the pattern discovery. In the latter case, they improve the
search efficiency
A.A. 04-05 Datawarehousing & Datamining 33
Data Mining
Algorithms
High-
Statistics Performance
Computing
Data Mining
Machine
learning DB Visualization
Technology
Applications:
Catalog design
Store layout (e.g., diapers and beer close to each other)
Financial forecast
Medical diagnosis
A.A. 04-05 Datawarehousing & Datamining 38
Association Rules
Observation:
Min sup and min confidence are objective measures of
interestingness. Their proper setting, however, requires
users domain knowledge. Low values may yield exponentially
(in |I|) many rules, high values may cut off interesting rules
Example
Transaction-id Items bought Frequent Itemsets Support
10 A, B, C {A} 75%
20 A, C {B} 50%
30 A, D {C} 50%
40 B, E, F {A, C} 50%
Notion of Closedness
I (items), D (transactions), min sup (support threshold)
Notion of Maximality
Maximal Frequent Itemsets =
{X I : supp(X) min sup & supp(Y)<min sup I Y X}
information loss
A.A. 04-05 Datawarehousing & Datamining 43
Association Rules
Tid Items
10 B, A, D, F, G, H
A sequence: <(ef)(ab)(df)(c)(b)>
sequence database
SID sequence An element may contain a set of items.
10 <(a)(abc)(ac)(d)(cf)> Items within an element are unordered
20 <(ad)(c)(bc)(ae)> and are listed alphabetically.
30 <(ef)(ab)(df)(c)(b)> <(a)(bc)(d)(c)>
40 <(e)(g)(af)(c)(b)(c)> is a subsequence of
<(a)(abc)(ac)(d)(cf)>
<(
Applications:
Marketing
Natural disaster forecast
Analysis of web log data
DNA analysis
GOAL
From the training data find a model to describe the classes
accurately and synthetically using the datas features. The
model will then be used to assign class labels to unknown
(previously unseen) records
A.A. 04-05 Datawarehousing & Datamining 50
Classification
Applications:
Classification process
BUILD
Training Set
Unseen Data
model
Test Data
VALIDATE
PREDICT
class labels
Observations:
Features can be either categorical if belonging to
unordered domains (e.g., Car Type), or continuous, if
belonging to ordered domains (e.g., Age)
The class could be regarded as an additional attribute of
the examples, which we want to predict
Classification vs Regression:
Classification: builds models for categorical classes
Regression: builds models for continuous classes
Several types of models exist: decision trees, neural
networks, bayesan (statistical) classifiers.
Example:
examples = car insurance applicants, class = insurance risk
features class
Age < 25
Example
outlier
attribute Y attribute Y
object
attribute X attribute X
Applications:
Challenges: