Data Warehousing and
Data Mining
Data Warehouse Design
Logical Design
Requirement analysis
Requirement specification
Conceptual design
Logical design
Physical design
is not a design --- used just to describe
the business
should be a business model -- and not
data design model
should identify real world business
objects (e.g. Customer, Order, Sale,
Policy, etc)
the relationships between these objects
Translating the agreed business
requirements into system deliverables
within the scope defined by the
conceptual design results in Logical and
then Physical design
Logical Design
Is more conceptual and abstract than physical
Looks at the logical relationships among the
objects
Defines the types of information that are
needed
Entity-Relationship model
Identify the things of importance (Entities)
The properties of these things (Attributes)
Show how two or more entities are related
(Relationship)
The entities, attributes and
relationships existing in a ER model are
translated into a star model.
arrange data into a series of logical
relationships called entities and attributes
An entity
represents a chunk of information
often maps to a table
An attribute
a component of an entity that helps define the uniqueness of the
entity
often maps to a column.
A unique identifier is
what is added to tables so that it is possible to differentiate
between the same item when it appears in different places
usually a primary key.
record the associations between objects and
facts
In dimensional modeling, instead of
seeking to discover atomic units of
information (such as entities and
attributes) and all of the relationships
between them, we
identify which information belongs to a central
fact table and which information belongs to its
associated dimension tables
identify business subjects or fields of data,
define relationships between business subjects,
and name the attributes for each subject.
often start with a conceptual schema
and then generates relational structures
should result in
a set of entities and attributes
corresponding to fact tables and dimension
tables
a model of operational data from your
source into subject-oriented information in
the target data warehouse schema.
mapping the conceptual model
structures to the logical model ones
taking into account implementation
issues, which are not considered in
the conceptual schema
Identify all applicable entities (Conceptual
Model doesn't express all the details)
Attributize (either fully or mostly) data
entities (with business nomenclature)
Assign datatype domains (e.g. text, date,
numeric) vs. datatypes (varchar, integer)
Resolve M:M relationships (e.g. with an
associative entity, record versioning, etc)
Formalize keys (primary, alternate,
foreign)
carry out resolution of subtypes
Perform abstraction (e.g. abstracting
conceptual entities such as Customer,
Prospect, Supplier, etc. into a
generalized entity such as Party) as
part of the normalization process (so
that data can be stored once)
a type of database that is optimized
for data warehouse and online
analytical processing (OLAP)
applications.
is designed to make the best use of
storing and utilizing data
is created using input from existing
relational databases.
implies the ability to rapidly process
the data in the database so that
answers can be generated quickly
can receive data from a variety of
relational databases and structure the
information into categories and sections
that can be accessed in a number of
different ways
can obtain data more easily, more quickly
and more succinctly
uses the idea of a data cube to represent
the dimensions of data available to a user
Store data in dimensions
Multiple dimensions, aka cubes (also
called hypercube), allow users to
analyze any view of data
Can consolidate data much faster than
relational database
"sales" could be viewed in the
dimensions of product model,
geography, time, or some additional
dimension. In this case, "sales" is known
as the measure attribute of the data
cube and the other dimensions are seen
as feature attributes. More hierarchies
and levels can be defined within a
dimension (for example, state and city
levels within a regional hierarchy).
Product Dimension
ProductKey
Time Dimension
TimeKey
Sales Fact Table
Store Dimension
StoreKey
TimeKey (FK)
ProductKey (FK)
StoreKey (FK)
Sale Amount
SALES Fact Table
Time
Key
Product
Key
Stores
Key
Sales
Amount
Rs.1000
Rs.1200
Rs.1500
Product
3
2
1
1
1
2
Store
Time
Product
3
2
1
1
Rs.1000
1
2
Store
Time
Product
3
2
1
1
1
2
Store
Rs.1200
Time
Relational
Database
structured for keyword
searches and building a
query by specifying fields
and perimeters, using SQL
Multi Dimensional
Database
a user simply poses the
question in everyday
verbiage. The user is
helped by the several
online help tools associated
with software programs
such as word processing
and spreadsheet
applications, as well as
several of the more popular
search engines currently in
use
SQL has several aggregate operators:
SUM(), MIN(), MAX(), COUNT(), AVG()
Some systems extend this with many others:
Stat functions, financial functions
i.e. RANK(), N_TILE(), RATIO_TO_TOTAL()
The basic idea is:
Combine all values in a column
into a single scalar value
AVG(Temp)
Weather;
AVG()
13
SELECT
FROM
17
...
Syntax
GROUP BY allows aggregates over table sub-groups
Result is a new table
Syntax
SELECT
Time, Altitude, AVG(Temp)
FROM
Weather
GROUP BY Time, Altitude;
Time
Latitude
Longitude
Altitude
(m)
Temp
07/9/5:1500
20
24
Time
Altitude
(m)
AVG(Temp)
07/9/5:1500
20
22
07/9/5:1500
20
23
07/9/5:1500
100
17
07/9/5:1500
100
17
07/9/9:1500
50
19
07/9/9:1500
50
20
07/9/9:1500
50
21
Users want Histograms
MAX(Temp)
Suppose:
Day(): time day
Nation(): latitude & longitude name of the country
SELECT
day, nation, MAX(Temp)
FROM
Weather
GROUP BY Day(Time) AS day,
Nation(Latitude, Longitude) AS nation;
day, nation
The following is not a STANDARD SQL
query!!
SELECT
FROM
GROUP BY
day, nation, MAX(Temp)
Weather
Day(Time) AS day,
Nation(Latitude, Longitude) AS nation;
In standard SQL:
SELECT
FROM
day, nation, MAX(Temp)
(SELECT Day(Time) AS day,
Nation(Latitude,Longitude) AS nation,
FROM Weather) AS foo
GROUP BY day, nation;
A Nested Query
Users want Roll-Up Reports
Attributes: Model, Year, Color, and, Sales
Chevy Sales Roll Up by Model by Year by Color:
Keyword ALL
{Black, White}
{1994, 1995}
Problems with GROUP BY - Roll-Up
Reports
To build the Chevy Sales Roll Up
Unioned GROUP BYs
Too many
GROUP BYs and UNIONs!!
Users want Cross-Tabulations
Chevy Sales Cross-Tab
By adding the following clause
Problems with
GROUP BY
GROUP BY cannot directly
construct:
Histograms
Roll-Up Reports
Cross-Tabs
CUBE Operator
Generalize GROUP BY and RollUp and Cross-Tabs!!
CUBE
Think of ALL as a token representing the set
{red, white, blue}
{1990, 1991, 1992}
{Chevy, Ford}
Sample syntax:
Model, Make, Year, SUM(Sales)
SELECT
FROM
Sales
WHERE
Model IN {Chevy, Ford}
AND
Year BETWEEN 1990 AND 1994
GROUP BY CUBE Model, Make, Year
HAVING
SUM(Sales) > 0;
Note: GROUP BY operator repeats aggregate list
in select list
in group by list
Allows functional aggregations
(e.g., Sales by quarter):
SELECT
FROM
WHERE
GROUP BY
quarter;
Store, quarter, SUM(Sales)
Sales
nation=Korea AND Year=1994
ROLLUP Store, Quarter(Date) AS
ROLLUP Operator
A Subset of CUBE Operator
Return Sales Roll Up by Store by Quarter in 1994.
An Example of ..
By Year
Ch
ev
Fo
rd
y
3D Data Cube
90
9
1
91
19
9
19
2
93
9
1
Re
te
hi
W
Sum
By
Co
lo r
By Make
ue
l
B
&
Ye
ar
By
By Color
ke
a
M
&
r
lo
o
C
A multi-dimensional structure containing
data points that represent unique
combinations of several classifications
A flexible way of storing and
disseminating data
is a data structure that allows fast
analysis of data
overcomes a limitation of relational
databases
can be thought of as extensions to the
two-dimensional array of a spreadsheet
consists of numeric facts (also called
measures) which are categorized by
dimensions.
The cube metadata is typically created
from a star schema or snowflake
schema of tables in a relational
database. Measures are derived from
the records in the fact table and
dimensions are derived from the
dimension tables.
Year
Country 2000
2001
2002
2003
AAA
123 456 124 567 125 678 126 789
BBB
987 654 988 654 989 654 999 654
CCC
35 789
36 789
37 789
38 789
Many recent statistical data management
models and systems are based on cubes
Users can select just those data that are
of interest
Cubes can easily be expanded, e.g. for
extra years, countries, or other
categories
At least in theory, cubes can have an
infinite number of dimensions