0% found this document useful (0 votes)
230 views

Data Warehousing

This document discusses data warehousing and online analytical processing (OLAP). It describes how decision support systems use data collected by transaction systems to make business decisions. It also explains how OLAP allows interactive analysis of multidimensional data to be summarized and viewed in different ways. Finally, it discusses implementations of OLAP using relational or multidimensional databases and techniques like pivoting, slicing, rollup and drill down.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
230 views

Data Warehousing

This document discusses data warehousing and online analytical processing (OLAP). It describes how decision support systems use data collected by transaction systems to make business decisions. It also explains how OLAP allows interactive analysis of multidimensional data to be summarized and viewed in different ways. Finally, it discusses implementations of OLAP using relational or multidimensional databases and techniques like pivoting, slicing, rollup and drill down.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Data Warehousing and OLAP

Decision Support Systems


• Decision-support systems are used to make
business decisions, often based on data collected
by online transaction-processing systems
• Examples of business decisions:
– What items to stock?
– What insurance premium to change?
– To whom to send advertisements?
• Examples of data used for making decisions
– Retail sales transaction details
– Customer profiles (income, age, gender, etc.)

CMPT 354: Database I -- Data Warehousing and OLAP 2


Data and Statistical Analysis
• Data analysis tasks are simplified by
specialized tools and SQL extensions
– Example tasks
• For each product category and each region, what
were the total sales in the last quarter and how do
they compare with the same quarter last year
• As above, for each product category and each
customer category
• Statistical analysis packages (e.g., SAS)
can be interfaced with databases
CMPT 354: Database I -- Data Warehousing and OLAP 3
Data Analysis and OLAP
• Online Analytical Processing (OLAP)
– Interactive analysis of data, allowing data to be
summarized and viewed in different ways in an online
fashion (with negligible delay)
• Multidimensional data: data modeled as dimension
attributes and measure attributes
– Dimension attributes: define the dimensions on which
measure attributes (or aggregates thereof) are viewed,
e.g. the attributes item_name, color, and size of the
sales relation
– Measure attributes: can be aggregated upon, e.g., the
attribute number of the sales relation

CMPT 354: Database I -- Data Warehousing and OLAP 4


Pivot Table
• Values for one of the dimension attributes form the row headers
• Values for another dimension attribute form the column headers
• Other dimension attributes are listed on top
• Values in individual cells are (aggregates of) the values of the
dimension attributes that specify the cell

CMPT 354: Database I -- Data Warehousing and OLAP 5


Relational Representation
• Cross-tabs can be
represented as
relations
– The value all is used to
represent aggregates
– All represents a set
– The SQL:1999
standard uses null
values in place of all
despite confusion with
regular null values
CMPT 354: Database I -- Data Warehousing and OLAP 6
Data Cubes
• A data cube is a multidimensional
generalization of a cross-tab
• Can have n dimensions
• Cross-tabs can be used as views on a data
cube

CMPT 354: Database I -- Data Warehousing and OLAP 7


Online Analytical Processing
• Pivoting: changing the dimensions used in a
cross-tab is called
• Slicing: creating a cross-tab for fixed values only
– Sometimes called dicing, particularly when values for
multiple dimensions are fixed
• Rollup: moving from finer-granularity data to a
coarser granularity
• Drill down: The opposite operation - that of
moving from coarser-granularity data to finer-
granularity data
CMPT 354: Database I -- Data Warehousing and OLAP 8
Hierarchies on Dimensions
• Enable dimensions be viewed at different levels of
detail
– Dimension DateTime can be used to aggregate by hour
of day, date, day of week, month, quarter or year

CMPT 354: Database I -- Data Warehousing and OLAP 9


Cross Tabulation With Hierarchy

CMPT 354: Database I -- Data Warehousing and OLAP 10


OLAP Implementation
• Multidimensional OLAP (MOLAP) systems
– Multidimensional arrays in memory to store data
cubes
• Relational OLAP (ROLAP) systems
– Relational tables to store data cubes
• Hybrid OLAP (HOLAP) systems
– Store some summaries in memory and store the
base data and other summaries in a relational
database
CMPT 354: Database I -- Data Warehousing and OLAP 11
Extended Aggregation in
SQL:1999
• The cube operation computes union of group by’s
on every subset of the specified attributes
select item-name, color, size, sum(number)
from sales
group by cube(item-name, color, size)
• Compute the union of eight different groupings of
the sales relation: { (item-name, color, size), (item-
name, color), (item-name, size), (color, size),
(item-name), (color), (size), ( ) }
• For each grouping, the result contains the null
value for attributes not present in the grouping
CMPT 354: Database I -- Data Warehousing and OLAP 12
OLTP Versus OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date, detailed, flat historical, summarized, multidimensional
relational Isolated integrated, consolidated
usage repetitive ad-hoc
access read/write, index/hash on prim. lots of scans
key
unit of work short, simple transaction complex query
# records tens millions
accessed
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

CMPT 354: Database I -- Data Warehousing and OLAP 13


What Is a Data Warehouse?
• “A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of
management’s decision-making process.”
– W. H. Inmon
• Data warehousing: the process of
constructing and using data warehouses

CMPT 354: Database I -- Data Warehousing and OLAP 14


Subject-Oriented
• Organized around major subjects, such as
customer, product, sales
• Focusing on the modeling and analysis of
data for decision makers, not on daily
operations or transaction processing
• Providing a simple and concise view around
particular subject issues by excluding data
that are not useful in the decision support
process
CMPT 354: Database I -- Data Warehousing and OLAP 15
Integrated
• Integrating multiple, heterogeneous data sources
– Relational databases, flat files, on-line transaction
records
• Data cleaning and data integration
– Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
– When data is moved to the warehouse, it is converted

CMPT 354: Database I -- Data Warehousing and OLAP 16


Time Variant
• The time horizon for the data warehouse is
significantly longer than that of operational
systems
– Operational database: current value data
– Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
• Every key structure in the data warehouse
– Contains an element of time, explicitly or implicitly
– But the key of operational data may or may not contain
“time element”

CMPT 354: Database I -- Data Warehousing and OLAP 17


Nonvolatile
• A physically separate store of data
transformed from the operational
environment
• Operational update of data does not occur in
the data warehouse environment
– Does not require transaction processing,
recovery, and concurrency control mechanisms
– Requires only two operations in data accessing:
• initial loading of data and access of data

CMPT 354: Database I -- Data Warehousing and OLAP 18


Data Warehousing

CMPT 354: Database I -- Data Warehousing and OLAP 19


Collecting Data
• Source driven architecture: data sources transmit
new information to a warehouse, either
continuously or periodically (e.g. at night)
• Destination driven architecture: a warehouse
periodically requests new information from data
sources
• Keeping warehouse exactly synchronized with
data sources (e.g. using two-phase commit) is too
expensive
– Usually OK to have slightly out-of-date data at
warehouse
– Data/updates are periodically downloaded form online
transaction processing (OLTP) systems
CMPT 354: Database I -- Data Warehousing and OLAP 20
Design Issues
• Data cleansing
– Correct mistakes in addresses (misspellings, zip code
errors), and merge address lists from different sources
and purge duplicates
• Update propagating
– Warehouse schema may be a (materialized) view of
schema from data sources
• Summarizing data
– Raw data may be too large to store on-line
– Aggregate values (totals/subtotals) often suffice
– Queries on raw data can often be transformed by query
optimizer to use aggregate values
CMPT 354: Database I -- Data Warehousing and OLAP 21
Warehouse Schemas
• Dimension values are usually encoded
using small integers and mapped to full
values via dimension tables
• Resultant schema is called a star schema
– More complicated schema structures
• Snowflake schema: multiple levels of dimension
tables
• Constellation: multiple fact tables

CMPT 354: Database I -- Data Warehousing and OLAP 22


Data Warehouse Schema

CMPT 354: Database I -- Data Warehousing and OLAP 23


Picture from publib.boulder.ibm.com

Snowflake Schema
A star schema is a
snowflake schema
where each
dimension has only
one single
dimension table

CMPT 354: Database I -- Data Warehousing and OLAP 24


Why Data Mining?
• Evolution of database technology
– To collect a large amount of data Æ primitive
file processing
– To store and query data efficiently Æ DBMS
• New challenges: huge amount of data, how
to analyze and understand?
– Data mining

CMPT 354: Database I -- Data Warehousing and OLAP 25


What Is Data Mining?
• Mining data – mining knowledge
• Data mining is the non-trivial process of
identifying valid, novel, potentially useful,
and ultimately understandable patterns in
data

CMPT 354: Database I -- Data Warehousing and OLAP 26


The KDD Process
Knowledge

Interpretation/
Patterns evaluation
Transformed
data
Data mining
Preprocessed
data
Transformation
Selection Preprocessing
Target data

Data

CMPT 354: Database I -- Data Warehousing and OLAP 27


KDD Process Steps
• Preprocessing
– Data cleaning
– Data integration
• Data selection
• Data transformation
• Data mining
• Pattern evaluation
• Knowledge presentation
CMPT 354: Database I -- Data Warehousing and OLAP 28
Summary
• OLAP
• Data warehousing
– Star schema
– Snowflake schema
• Data mining and KDD process

CMPT 354: Database I -- Data Warehousing and OLAP 29

You might also like