Datawarehousing and Data Mining
Datawarehousing and Data Mining
Clustering
Approaches to Other Data Mining Problems
Applications of Data Mining
Data Mining
Example: A transaction database maintained by a specialty
Data Mining
Many possibilities exist for discovering new knowledge
Classification
Optimization
shoppers in a rush
loyal regular shoppers
shoppers attached to name brands
infrequent shoppers.
10
Knowledge Classification
Knowledge discovered during data mining is classified as follows:
Association rules
Classification of hierarchies
Sequential patterns
Patterns within time series
Clustering
For most applications, the desired knowledge is a combination of
the above types.
11
Association Rules
These rules correlate the presence of a set of items with
another range of values for another set of variables.
Examples:
(1) When a female retail shopper buys a handbag, she is likely to
buy shoes.
(2) (2) An X-ray image containing characteristics a and b is likely to
also exhibit characteristic c
12
Multidimensional Associations
14
Classification Hierarchies
The goal is to work from an existing set of events or transactions to
create a hierarchy of classes.
Examples:
(1) A population may be divided into five ranges of credit
worthiness based on a history of previous credit card
transactions.
(2) A model may be developed for the factors that determine the
desirability of a store location on a 110 scale.
15
Sequential Patterns
A sequence of actions or events is sought.
Example:
If a patient underwent cardiac bypass surgery for blocked arteries
and an aneurysm and later developed high blood urea within a year
of surgery, he or she is likely to suffer from kidney failure within the
next 18 months.
Detection of sequential patterns is equivalent to detecting
associations among events with certain temporal relationships.
16
Clustering
A given population of events or items can be partitioned
(segmented) into sets of similar elements.
Examples:
(1) An entire population of treatment data on a disease may be
divided into groups based on the similarity of side effects
produced.
(2) The adult population in the United States may be categorized
into five groups from most likely to buy to least likely to buy a
new product.
(3) The Web accesses made by a collection of users against a set
of documents (say, in a digital library) may be analyzed in terms
of the keywords of documents to reveal clusters or categories of
users.
18
mutual funds
Evaluation of financing options
Fraud detection.
19
requirements.
Health Care
Radiological images
Analysis of microarray (gene-chip) experimental data to cluster genes
20
21
Building a Datawarehouse
Typical Functionality
Datawarehouses versus Views
22
network, or hierarchical).
Data warehouses have the distinguishing characteristic that they are
mainly intended for decision-support applications. They are optimized
for data retrieval, not routine transaction processing.
Data warehouses are quite distinct from traditional databases in their
structure, functioning, performance, and purpose
Data warehouse as a subject-oriented, integrated, nonvolatile, timevariant collection of data in support of managements decisions.
Data warehouses provide access to data for complex analysis,
knowledge discovery, and decision making. They support highperformance demands on an organizations data and information
24
Traditional DB vs Datawarehouse
Traditional Database
Datawarehouse
Characteristics of Datawarehouse
27
Characteristics of Datawarehouse
Multidimensional conceptual view
Generic dimensionality
Unlimited dimensions and aggregation levels
Multiuser support
Accessibility
Transparency
Intuitive data manipulation
Consistent reporting performance
Flexible reporting
28
Types of Datawarehouses
Enterprise-wide data warehouses are huge projects
29
Types of Datawarehouses
Enterprise-wide data warehouses are huge projects
30
Data Modeling
Multidimensional data modeling
cubes
Called hypercubes if they have more than three dimensions
Query performance in multidimensional matrices can be
much better than in the relational data model.
Examples of dimensions in a corporate data warehouse:
fiscal periods
products
regions.
31
32
Data Cube
Example: Adding a time dimension, such as an organizations
33
Data Cube
Example: Figure shows a three-dimensional data cube that organizes
product sales data by fiscal quarters and sales regions. Each cell could
contain data for a specific product, specific fiscal quarter, and specific
region
34
Data Cube
By including additional dimensions, a data hypercube could be
36
grained view,
Example: Disaggregating country sales by region and then regional
sales by sub-region and also breaking up products by styles.
37
Roll Up Example
Figure shows a roll-up display that moves from individual products to a
38
39
dimension.
A fact table can be thought of as having tuples, one per a
recorded fact.
This fact contains some measured or observed variable(s)
and identifies it (them) with pointers to dimension tables.
The fact table contains the data, and the dimensions identify
each tuple in that data.
40
41
43
Fact Constellation
A fact constellation is a set of fact tables that share some dimension
tables.
Fact constellations limit the possible queries for the warehouse
Figure shows a fact constellation with two fact tables, business results
and business forecast.
These share the dimension table called product. Fact constellations
limit the possible queries for the warehouse
44
Typical Functionality
Facilitate complex, data-intensive, and frequent ad hoc queries.
Data warehouses must provide far greater and more efficient query
ad hoc queries
data mining
materialized views.
Pre-programmed Functionality
Roll-up Data is summarized with increasing generalization.
Example: weekly to quarterly to annually
Drill-down Increasing levels of detail are revealed (the
complement of rollup).
Pivot Cross tabulation (also referred to as rotation) is
performed.
Slice and dice Projection operations are performed on the
dimensions.
Sorting Data is sorted by ordinal value.
Selection Data is available by value or range.
Derived (computed) attributes Attributes are computed by
46
operations on stored and derived values.