0% found this document useful (0 votes)
36 views

DMDW_Operations

The document provides an overview of data warehousing and OLAP technology, defining a data warehouse as a repository for integrated, subject-oriented, time-variant, and non-volatile data. It discusses the architecture of data warehouses, including multi-dimensional data models and schemas like star and snowflake, as well as the processes involved in data warehouse design. Additionally, it outlines typical OLAP operations and the importance of separating data warehouses from operational databases for performance optimization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

DMDW_Operations

The document provides an overview of data warehousing and OLAP technology, defining a data warehouse as a repository for integrated, subject-oriented, time-variant, and non-volatile data. It discusses the architecture of data warehouses, including multi-dimensional data models and schemas like star and snowflake, as well as the processes involved in data warehouse design. Additionally, it outlines typical OLAP operations and the importance of separating data warehouses from operational databases for performance optimization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Unit-2: Data Warehousing and OLAP

Technology

 What is a data warehouse?

 A multi-dimensional data model

 Data warehouse architecture

 Data warehouse implementation

1
 Definition1: A data warehouse is a repository of
information collected from multiple sources, stored under
a unified schema, and that usually resides at a single site.

2
What is Data Warehouse?
 Definition2: “A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
 Data warehousing:
 The process of constructing and using data warehouses

3
Data Warehouse—Subject-Oriented

 Organized around major subjects, such as customer,


product, sales
 Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing

4
Data Warehouse—Integrated
 Constructed by integrating multiple, heterogeneous data
sources
 relational databases, flat files, on-line transaction

records
 Data cleaning and data integration techniques are
applied.
 Ensure consistency in naming conventions, encoding

structures, attribute measures, etc. among different


data sources
 E.g., Hotel price: currency, tax, breakfast covered, etc.
 When data is moved to the warehouse, it is converted.

5
Data Warehouse—Time Variant

 The time horizon or scope for the data warehouse is


significantly longer than that of operational systems
 Operational database: current value data
 Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)

6
Data Warehouse—Non volatile

 Data warehouses are also non-volatile, which means


that when new data is entered, the previous data is not
erased. Data is read only and refreshed on a regular
basis.

7
Data Warehouse vs. Operational DBMS
 OLTP (on-line transaction processing)
 Major task of traditional relational DBMS
 Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
 OLAP (on-line analytical processing)
 Major task of data warehouse system
 Data analysis and decision making
 Distinct features (OLTP vs. OLAP):
 User and system orientation
 Data contents
 Database design
 View
 Access patterns

8
Data Warehouse vs. Operational DBMS
OLTP vs. OLAP

9
Why Separate Data Warehouse?
 A major reason for such a separation is to help promote the High
performance for both systems
 DBMS— tuned for OLTP: access methods (Read/write), indexing
on Primary Key, concurrency control, recovery mechanism.
 Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
 Different functions and different data:
 missing data: Decision support requires historical data which
operational DBs do not typically maintain
 data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
 data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled

10
Data Warehousing and OLAP
Technology

 What is a data warehouse?

 A multi-dimensional data model

 Data warehouse architecture

 Data warehouse implementation

11
A multi-dimensional data model

 Data warehouses and OLAP tools are based on a


multidimensional data model.
 This model views data in the form of a data cube.
 What is a data cube? A data cube allows data to
be modeled and viewed in multiple dimensions
(Eg. 2D,3D etc.).
 It is defined by dimensions and facts

12
From Tables and Spreadsheets to Data Cubes

 A data cube, such as sales, allows data to be modeled and viewed in


multiple dimensions
 Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
 Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables
 In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.

13
14
15
16
Cube: A Lattice of Cuboids

Each cuboid represents a different degree of summarization.


all
0-D(apex) cuboid

time item location supplier


1-D cuboids

time,location item,location location,supplier


time,item 2-D cuboids
time,supplier item,supplier

time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier

4-D(base) cuboid
time, item, location, supplier

17
Conceptual Modeling of Data Warehouses

 Modeling data warehouses: dimensions & measures


 Star schema: A fact table in the middle connected to a
set of dimension tables
 Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
 Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation

18
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measure
s
19
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year supplier_key
item_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_provinc
Measure e
s country
20
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type
item_key shipper_key
from_location
branch_key
branch location_key location to_location
branch_key location_key dollars_cost
branch_name units_sold
street
branch_type
dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type 21
Cube Definition Syntax in DMQL

 Data warehouses and data marts can be defined using


two language primitives, one for cube definition and one
for dimension definition

 Cube Definition (Fact Table)


define cube <cube_name> [<dimension_list>]:
<measure_list>
 Dimension Definition (Dimension Table)
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)

22
Defining Star Schema in DMQL

define cube sales_star [time, item, branch, location]:


dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week,
month, quarter, year)
define dimension item as (item_key, item_name, brand,
type, supplier_type)
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)

23
Defining Snowflake Schema in DMQL

define cube sales_snowflake [time, item, branch, location]:


dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter,
year)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key,
province_or_state, country))

24
Defining Fact Constellation in DMQL
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location in
cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales

25
A Concept Hierarchy: Dimension (location)

all all

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind


A concept hierarchy defines a sequence of mappings from a set of low-
level concepts to higher-level, more general concepts. 26
Hierarchical and lattice structures of attributes in
warehouse dimensions

27
Multidimensional Data

 Sales volume as a function of product, month,


and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
o
gi
Re

Industry Region Year


n

Category Country Quarter


Product

Product City Month Week

Office Day

Month
28
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
t
uc

TV
od

PC U.S.A
Pr

VCR
sum

Country
Canada

Mexico

sum

All, All, All


29
Cuboids Corresponding to the Cube

all
0-D(apex) cuboid
product date country
1-D cuboids

product,date product,country date, country


2-D cuboids

3-D(base) cuboid
product, date, country

30
Typical OLAP Operations

 Roll up (drill-up)
 Drill down (roll down)
 Dice
 Slice
 Pivot (rotate)

31
1. Roll up (drill-up):
Summarize data By climbing up hierarchy or by dimension reduction

32
2. Drill down (roll down):
•Reverse of roll-up
•From higher level summary to lower level summary or detailed data, or
introducing new dimensions

33
3. Dice:
Dice operation defines a subcube by performing a selection on two or
more dimensions.

34
4. Slice:
Slice operation performs a selection on one dimension of the given
cube, resulting in a subcube.

35
5. Pivot (rotate):
Visualization operation that rotates the data axes in view in order to
provide an alternative presentation of the data.

36
Fig. Typical OLAP
Operations

37
A Starnet Query Model for Querying
Multidimensional Databases

38
A Starnet Query Model for Querying
Multidimensional Databases

 The querying of multidimensional databases can be based on a


starnet model.

 A starnet model consists of radial lines emanating from a central


point, where each line represents a concept hierarchy for a dimension.

 Each abstraction level in the hierarchy is called a footprint.

 These represent the granularities available for use by OLAP


operations such as drill-down and roll-up.

39
Data Warehousing and OLAP
Technology: An Overview

 What is a data warehouse?

 A multi-dimensional data model

 Data warehouse architecture

 Data warehouse implementation

40
Design of Data Warehouse: A Business
Analysis Framework
 To design an effective data warehouse we need to understand and
analyze business needs and construct a business analysis framework.

 Four views regarding the design of a data warehouse


 Top-down view
 allows selection of the relevant information necessary for the data
warehouse
 This information matches the current and future business needs.

41
Design of Data Warehouse: A Business
Analysis Framework
 Data source view
 This information may be documented at various levels of
detail and accuracy, from individual data source tables to
integrated data source tables.
 Data warehouse view
 consists of fact tables and dimension tables
 Business query view
 sees the perspectives of data in the warehouse from the view
of end-user

42
The Process of Data Warehouse Design

 Top-down, bottom-up approaches or a combination of both


 Top-down: Starts with overall design and planning (Business problems-
clear and well understood)
 Bottom-up: Starts with experiments and prototypes
 From software engineering point of view
 Waterfall
 Spiral

43
The Process of Data Warehouse Design

 Typical data warehouse design process consists of the following steps:

 Choose a business process to model, e.g., orders, invoices, sales etc.


 Choose the grain (fundamental, atomic level of data to be represented in
the fact table) of the business process
 Choose the dimensions that will apply to each fact table record
 Choose the measure that will populate each fact table record

44
45
Data Warehouse: A Multi Tier (3 Tier)
Architecture

Monitor
& OLAP Server
Other Metadata
sources Integrator
Analysis,
Operational Extract Query/
DBs Transform Data Server Report,
Load
Refresh
Warehouse Data mining
Tools

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


46
Three Data Warehouse Models
 Enterprise warehouse
 collects all of the information about subjects spanning

the entire organization


 Data Mart
 a subset of corporate-wide data that is of value to a

specific groups of users. Its scope is confined to


specific, selected groups, such as marketing data mart
 Virtual warehouse
 It is a set of views over operational databases

 For efficient query processing, only some of the possible

summary views may be materialized.

47
Data Warehouse Development:
A Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts

Enterprise
Data Data
Data
Mart Mart
Warehouse

Model refinement Model refinement

Define a high-level corporate data model


48
Data Warehouse Back-End Tools and Utilities

 Data extraction
 Data cleaning
 Data transformation
 Load
 Refresh

49
OLAP Server Architectures
 Relational OLAP (ROLAP)
 Is an extended relational DBMS that maps operations on

multidimensional data to standard relational operations.


 Multidimensional OLAP (MOLAP)
 a multidimensional OLAP (MOLAP) model, that is, a special-
purpose server that directly implements multidimensional data
and operations.
 Hybrid OLAP (HOLAP)
 Flexibile

50
Data Warehousing and OLAP
Technology

 What is a data warehouse?

 A multi-dimensional data model

 Data warehouse architecture

 Data warehouse implementation

51
Data warehouse implementation

Efficient Data Cube Computation


 Data cube can be viewed as a lattice of cuboids
 The bottom-most cuboid is the base cuboid
 The top-most cuboid (apex) contains only one cell
 How many cuboids are there in an n-dimensional data
cube?
 If there were no hierarchies associated with each dimension, then the
total number of cuboids are 2n
 If n-dimensional cube with L levels ( For Time dimension “day
<month < quarter < year” )

n
T   ( L i 1)
i 1
52
Cube Operation

 Cube definition and computation in DMQL


define cube sales[item, city, year]: sum(sales_in_dollars)
compute cube sales
 Each cuboid represents a different group-by
()
 “Compute the sum of sales, grouping by
city and item.”
 “Compute the sum of sales, grouping by city.” (city) (item) (year)

 OLAP may need to access different


(city, item) (city, year) (item, year
cuboids for different queries.
(city, item, year),
(city, item ), (city, year), (item, year) (city, item, year)
(city), (item), (year)

53
Efficient Data Cube Computation

 Precomputation leads to fast response time and avoids some


redundant computation.

 A major challenge related to this precomputation, is that the required


storage space may explode if all the cuboids in a data cube are
precomputed, especially when the cube has many dimensions.

 The storage requirements are even more excessive when many of


the dimensions have associated concept hierarchies, each with
multiple levels.

 This problem is referred to as the curse of dimensionality

54
Efficient Data Cube Computation

 Precomputation leads to fast response time and avoids


some redundant computation.

 Materialization of data cube-Three choices

 No materialization: Do not precompute any of the cuboids


 Full materialization: Precompute all of the cuboids.
 Partial materialization: Selectively compute a proper subset of
the whole set of possible cuboids.

55
Iceberg Cube
 Partially materialized cubes are known as iceberg
cubes
 Computing only the cuboid cells whose count or
other aggregates satisfying the condition like
HAVING COUNT(*) >= minsup

 Motivation
 Only a small portion of cube cells may be “above the water’’ in a
sparse cube (For many cells in a cuboid, the measure (count or
sum)value will be zero. If zero valued tuples are stored in the
cuboid, then we say that the cuboid is sparse)
 Only calculate “interesting” cells—data above certain threshold
 Eg. Count >= 10 , sales $100.
 Avoid explosive growth of the cube

56
 Eg. Iceberg cube.
compute cube sales_iceberg as
select month, city, customer_group, count(*)
from salesInfo
cube by month, city, customer_group
having count(*) >= min_sup

57
Indexing OLAP Data: Bitmap Index and Join Index

 To facilitate efficient data accessing, most data warehouse systems support index
structures and materialized views (using cuboids).

 1. Bitmap indexing method is popular in OLAP products because it allows quick


searching in data cubes.
 Index on a particular column
 Each value in the column has a bit vector

Base table Index on Region Index on Type


C u s t R e g io n Type R e c ID A s ia E u r o p e A m e r ic a R e c ID R e ta il D e a le r
C1 A s ia R e ta il 1 1 0 0 1 1 0
C2 E u ro p e D e a le r 2 0 1 0 2 0 1
C3 A s ia D e a le r 3 1 0 0 3 0 1
C4 A m e r ic a R e ta il 4 0 0 1 4 1 0
C5 E u ro p e D e a le r 5 0 1 0 5 0 1

58
Indexing OLAP Data: Join indexing
 2. Join indexing method gained popularity from its use in relational database
query processing.
 Join indexing registers the joinable rows of two relations from a relational database.
 The linkage between a fact table and its corresponding dimension tables comprises
the fact table’s foreign key and the dimension table’s primary key

59
Example: the “Main Street” value in the location dimension table joins with
tuples T57, T238, and T884 of the sales fact table. Similarly, the “Sony-TV” value
in the item dimension table joins with tuples T57 and T459 of the sales fact
table.

60
Efficient Processing OLAP Queries
 Determine which operations should be performed on the
available cuboids
 selection, projection, roll-up (group-by), and drill-down operations
specified in the query into corresponding SQL and/or OLAP
operations
 Determine which materialized cuboid(s) should be selected
for OLAP operation
 Identify all of the materialized cuboids that may potentially be used
to answer the query and select the cuboid with the least cost.

61
 OLAP query processing. Suppose that we define a data cube for AllElectronics of the
form “sales cube [time, item, location]: sum(sales in dollars).”
 The dimension hierarchies used are “day < month < quarter < year” for time; “item_
name < brand < type” for item; and “street < city < province _or_state < country” for
location.
 Let the query to be processed be on {brand, province_or_state} with the
condition “year = 2004”, and there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?
 if efficient indices are available for cuboid 4, then cuboid 4 may be a better
choice.

62
From Data Warehousing to Data Mining

 Data Warehouse Usage


 Data warehouses and data marts are used in a wide range of
applications.
 Three kinds of data warehouse applications
 Information processing

 supports querying, basic statistical analysis, and reporting using

crosstabs, tables, charts and graphs


 Analytical processing

 multidimensional analysis of data warehouse data

 supports basic OLAP operations, slice-dice, drilling, pivoting

 Data mining

 knowledge discovery from hidden patterns

 supports associations, performing classification and prediction,

and presenting the mining results using visualization tools

63
From On-Line Analytical Processing (OLAP)
to On-Line Analytical Mining (OLAM)
 OLAM (also called OLAP mining) integrates OLAP with data mining technology.

 Why online analytical mining?


 High quality of data in data warehouses

 Most data mining tools need to work on integrated, consistent, and cleaned
data, which requires costly preprocessing steps.
 Available information processing Infrastructure surrounding data
warehouses
 Accessing, integration, consolidation, and transformation of multiple
heterogeneous Databases.
 OLAP-based exploratory data analysis
 Mining with drilling, dicing, pivoting, etc.
 On-line selection of data mining functions

64
An OLAM System Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM

Data Cube API

Layer2
MDDB
MDDB
Meta
Data
Filtering&Integration Database API Filtering
Layer1
Data cleaning Data
Database Data
Data integration Warehous
s Repository
e 65

You might also like