0% found this document useful (0 votes)
12 views

Lecture19 257

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Lecture19 257

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

Data Warehousing

University of California, Berkeley


School of Information
IS 257: Database Management

IS 257 – Fall 2015 2015.11.03 - SLIDE 1


Lecture Outline
• Data Warehouses
• Introduction to Data Warehouses
• Data Warehousing
– (Based on lecture notes from Modern
Database Management Text (Hoffer, Ramesh,
Topi); Joachim Hammer, University of Florida,
and Joe Hellerstein and Mike Stonebraker of
UCB)

IS 257 – Fall 2015 2015.11.03 - SLIDE 2


Overview
• Data Warehouses and Merging
Information Resources
• What is a Data Warehouse?
• History of Data Warehousing
• Types of Data and Their Uses
• Data Warehouse Architectures
• Data Warehousing Problems and Issues

IS 257 – Fall 2015 2015.11.03 - SLIDE 3


Problem: Heterogeneous Information Sources

“Heterogeneities are
everywhere” Personal
Databases

World
Scientific Databases
Wide
Web
Digital Libraries
 Different interfaces
 Different data representations
 Duplicate and inconsistent information
Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 4
Problem: Data Management in Large Enterprises

• Vertical fragmentation of informational


systems (vertical stove pipes)
• Result of application (user)-driven
development of operational systems
Sales Planning Suppliers Num. Control
Stock Mngmt Debt Mngmt Inventory
... ... ...

Sales Administration Finance Manufacturing ...


Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 5
Goal: Unified Access to Data

Integration System

World
Wide
Personal
Web
Digital Libraries Scientific Databases Databases

• Collects and combines information


• Provides integrated view, uniform user interface
• Supports sharing
Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 6
The Traditional Research Approach

• Query-driven (lazy, on-demand)


Clients

Integration System Metadata

...
Wrapper Wrapper Wrapper

...
Source Source Source
Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 7
Disadvantages of Query-Driven Approach

• Delay in query processing


– Slow or unavailable information sources
– Complex filtering and integration
• Inefficient and potentially expensive for
frequent queries
• Competes with local processing at sources
• Hasn’t caught on in industry

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 8
The Warehousing Approach
• Information Clients
integrated in
advance Data
Warehouse
• Stored in WH
for direct
Integration System Metadata
querying and
analysis ...
Extractor/ Extractor/ Extractor/
Monitor Monitor Monitor

...
Source Source Source
Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 9
Advantages of Warehousing Approach

• High query performance


– But not necessarily most current information
• Doesn’t interfere with local processing at
sources
– Complex queries at warehouse
– OLTP at information sources
• Information copied at warehouse
– Can modify, annotate, summarize, restructure, etc.
– Can store historical information
– Security, no auditing
• Has caught on in industry

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 10
Not Either-Or Decision
• Query-driven approach still better for
– Rapidly changing information
– Rapidly changing information sources
– Truly vast amounts of data from large
numbers of sources
– Clients with unpredictable needs

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 11
Data Warehouse Evolution
“Building the
Relational Company DW” Data Replication
Databases DWs Inmon (1992) Tools

1960 1975 1980 1985 1990 1995 2000

Information-
“Middle Data

TIME
“Prehistoric Based
Times” Ages” Revolution
Management

PC’s and End-user 1st DW DW Vendor DW


Spreadsheets Interfaces Article Confs. Frameworks
Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 12
What is a Data Warehouse?

“A Data Warehouse is a
– subject-oriented,
– integrated,
– time-variant,
– non-volatile
collection of data used in support of
management decision making
processes.”
-- Inmon & Hackathorn, 1994: viz. Hoffer, Chap 11

IS 257 – Fall 2015 2015.11.03 - SLIDE 13


DW Definition…
• Subject-Oriented:
– The data warehouse is organized around the
key subjects (or high-level entities) of the
enterprise. Major subjects include
• Customers
• Patients
• Students
• Products
• Etc.

IS 257 – Fall 2015 2015.11.03 - SLIDE 14


DW Definition…
• Integrated
– The data housed in the data warehouse are
defined using consistent
• Naming conventions
• Formats
• Encoding Structures
• Related Characteristics

IS 257 – Fall 2015 2015.11.03 - SLIDE 15


DW Definition…
• Time-variant
– The data in the warehouse contain a time
dimension so that they may be used as a
historical record of the business

IS 257 – Fall 2015 2015.11.03 - SLIDE 16


DW Definition…
• Non-volatile
– Data in the data warehouse are loaded and
refreshed from operational systems, but
cannot be updated by end-users

IS 257 – Fall 2015 2015.11.03 - SLIDE 17


What is a Data Warehouse?
A Practitioners Viewpoint
• “A data warehouse is simply a single,
complete, and consistent store of data
obtained from a variety of sources and
made available to end users in a way they
can understand and use it in a business
context.”
• -- Barry Devlin, IBM Consultant

IS 257 – Fall 2015 Slide credit:


2015.11.03 J. Hammer
- SLIDE 18
A Data Warehouse is...
• Stored collection of diverse data
– A solution to data integration problem
– Single repository of information
• Subject-oriented
– Organized by subject, not by application
– Used for analysis, data mining, etc.
• Optimized differently from transaction-
oriented db
• User interface aimed at executive decision
makers and analysts

IS 257 – Fall 2015 2015.11.03 - SLIDE 19


… Cont’d
• Large volume of data (Gb, Tb)
• Non-volatile
– Historical
– Time attributes are important
• Updates infrequent
• May be append-only
• Examples
– All transactions ever at WalMart
– Complete client histories at insurance firm
– Stockbroker financial information and portfolios

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 20
Need for Data Warehousing
• Integrated, company-wide view of high-quality
information (from disparate databases)
• Separation of operational and informational systems
and data (for improved performance)

IS 257 – Fall 2015 2015.11.03 - SLIDE 21


Warehouse is a Specialized DB

Warehouse
Standard (Informational)
(Operational) DB
•• Mostly
Mostly reads
updates
•• Queries are transactions
long and complex
Many small
• Gb - Tb of data
• Mb - Gb of data
• History
• Current snapshot
• Lots of scans
• Index/hash onreconciled
p.k.
• Summarized, data
•• Raw data of users (e.g., decision-makers, analysts)
Hundreds
• Thousands of users (e.g., clerical users)

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 22
Warehouse vs. Data Mart

IS 257 – Fall 2015 2015.11.03 - SLIDE 23


Data Warehouse Architectures
• Generic Two-Level Architecture
• Independent Data Mart
• Dependent Data Mart and Operational
Data Store
• Logical Data Mart and @ctive Warehouse
• Three-Layer architecture

All involve some form of extraction, transformation and loading (ETL)

IS 257 – Fall 2015 2015.11.03 - SLIDE 24


Generic two-level data warehousing
architecture

L
One,
company-
wide
T warehouse

Periodic extraction  data is not completely current in warehouse

IS 257 – Fall 2015 2015.11.03 - SLIDE 25


Independent data mart data warehousing
architecture
Data marts:
Mini-warehouses, limited in scope

T
E
Separate ETL for each Data access complexity
independent data mart due to multiple data marts
IS 257 – Fall 2015 2015.11.03 - SLIDE 26
Dependent data mart with operational data
store: a three-level architecture ODS provides option for
obtaining current data

T
E Simpler data access
Single ETL for
Dependent data marts
enterprise data warehouse
loaded from EDW
(EDW)
IS 257 – Fall 2015 2015.11.03 - SLIDE 27
Logical data mart and real time warehouse
architecture
ODS and data
warehouse are one
and the same

T
E
Near real-time ETL for Data marts are NOT separate databases, but logical views of the
data warehouse
Data Warehouse  Easier to create new data marts

IS 257 – Fall 2015 2015.11.03 - SLIDE 28


Data Characteristics
Status vs. Event Data

Status

Event = a database
action
(create/update/delete
) that results from a
Status transaction

IS 257 – Fall 2015 2015.11.03 - SLIDE 30


Data Characteristics
Transient vs. Periodic Data

With
transient
data,
changes to
existing
records are
written over
previous
records, thus
destroying
the previous
data content

IS 257 – Fall 2015 2015.11.03 - SLIDE 31


Data Characteristics
Transient vs. Periodic Data

Periodic
data are
never
physically
altered or
deleted
once they
have
been
added to
the store

IS 257 – Fall 2015 2015.11.03 - SLIDE 32


Other Data Warehouse Changes
• New descriptive attributes
• New business activity attributes
• New classes of descriptive attributes
• Descriptive attributes become more
refined
• Descriptive data are related to one another
• New source of data

IS 257 – Fall 2015 2015.11.03 - SLIDE 33


The Reconciled Data Layer
• Typical operational data is:
– Transient–not historical
– Not normalized (perhaps due to denormalization for
performance)
– Restricted in scope–not comprehensive
– Sometimes poor quality–inconsistencies and errors
• After ETL, data should be:
– Detailed–not summarized yet
– Historical–periodic
– Normalized–3rd normal form or higher
– Comprehensive–enterprise-wide perspective
– Timely–data should be current enough to assist decision-making
– Quality controlled–accurate with full integrity

IS 257 – Fall 2015 2015.11.03 - SLIDE 34


Types of Data
• Business Data - represents meaning
– Real-time data (ultimate source of all business data)
– Reconciled data
– Derived data
• Metadata - describes meaning
– Build-time metadata
– Control metadata
– Usage metadata
• Data as a product* - intrinsic meaning
– Produced and stored for its own intrinsic value
– e.g., the contents of a text-book

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 35
Data Warehousing: Two Distinct Issues

• (1) How to get information into warehouse


– “Data warehousing”
• (2) What to do with data once it’s in
warehouse
– “Warehouse DBMS”
• Both rich research areas
• Industry has focused on (2)

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 36
The ETL Process
• Capture/Extract
• Scrub or data cleansing
• Transform
• Load and Index

ETL = Extract, transform, and load

IS 257 – Fall 2015 2015.11.03 - SLIDE 37


Capture/Extract…obtaining a snapshot of a
chosen subset of the source data for
loading into the data warehouse

Static extract = capturing Incremental extract =


a snapshot of the source capturing changes that
data at a point in time have occurred since the last
static extract
IS 257 – Fall 2015 2015.11.03 - SLIDE 38
Data Extraction
• Source types
– Relational, flat file, WWW, etc.
• How to get data out?
– Replication tool
– Dump file
– Create report
– ODBC or third-party “wrappers”

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 39
Wrapper
 Converts data and queries from one data model to
another
Data Queries Data
Model Model
A Data B

 Extends query capabilities for sources with


limited capabilities

Queries Wrapper Source

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 40
Wrapper Generation
• Solution 1: Hard code for each source
• Solution 2: Automatic wrapper generation

Wrapper
Wrapper Definition
Generator

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 41
Monitors
• Goal: Detect changes of interest and
propagate to integrator
• How?
– Triggers
– Replication server
– Log sniffer
– Compare query results
– Compare snapshots/dumps

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 42
Scrub/Cleanse…uses pattern recognition
and AI techniques to upgrade data quality
Figure 11-10:
Steps in data
reconciliation
(cont.)

Fixing errors: misspellings, Also: decoding, reformatting,


erroneous dates, incorrect field time stamping, conversion, key
usage, mismatched addresses, generation, merging, error
missing data, duplicate data, detection/logging, locating
inconsistencies missing data
IS 257 – Fall 2015 2015.11.03 - SLIDE 43
New approaches for Data Cleansing

• It is generally been found that 70-90


percent of the time and effort in large data
management and analysis tasks is taken
up with data cleansing
• New tool “Data Wrangler” from Stanford
and Berkeley CS folks
• http://vis.stanford.edu/wrangler/

IS 257 – Fall 2015 2015.11.03 - SLIDE 44


Data Cleansing
• Find (& remove) duplicate tuples
– e.g., Jane Doe vs. Jane Q. Doe
• Detect inconsistent, wrong data
– Attribute values that don’t match
• Patch missing, unreadable data
• Notify sources of errors found

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 45
Transform = convert data from format of
operational system to format of data
Figure 11-10: warehouse
Steps in data
reconciliation
(cont.)

Record-level: Field-level:
Selection–data partitioning single-field–from one field to one field
Joining–data combining multi-field–from many fields to one, or
Aggregation–data summarization one field to many

IS 257 – Fall 2015 2015.11.03 - SLIDE 46


Data Transformations
• Convert data to uniform format
– Byte ordering, string termination
– Internal layout
• Remove, add & reorder attributes
– Add key
– Add data to get history
• Sort tuples

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 47
Load/Index= place
transformed data into the
Figure 11-10:
Steps in data
warehouse and create
reconciliation indexes
(cont.)

Refresh mode: bulk rewriting Update mode: only changes


of target data at periodic intervals in source data are written to data
warehouse

IS 257 – Fall 2015 2015.11.03 - SLIDE 48


Data Integration
• Receive data (changes) from multiple
wrappers/monitors and integrate into warehouse
• Rule-based
• Actions
– Resolve inconsistencies
– Eliminate duplicates
– Integrate into warehouse (may not be empty)
– Summarize data
– Fetch more data from sources (wh updates)
– etc.

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 49
Warehouse Maintenance
• Warehouse data  materialized view
– Initial loading
– View maintenance
• View maintenance

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 50
Differs from Conventional View Maintenance...

• Warehouses may be highly aggregated


and summarized
• Warehouse views may be over history of
base data
• Process large batch updates
• Schema may evolve

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 51
Differs from Conventional View Maintenance...

• Base data doesn’t participate in view


maintenance
– Simply reports changes
– Loosely coupled
– Absence of locking, global transactions
– May not be queriable

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 52
Warehouse Maintenance Anomalies

• Materialized view maintenance in loosely


coupled, non-transactional environment
• Simple example

Data Sold (item,clerk,age)


Warehouse

Sold = Sale Emp


Integrator

Sales Comp.

Sale(item,clerk) Emp(clerk,age)
Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 53
Warehouse Maintenance Anomalies

Data Sold (item,clerk,age)


Warehouse

Integrator

Sales Comp.

Sale(item,clerk) Emp(clerk,age)
1. Insert into Emp(Mary,25), notify integrator
2. Insert into Sale (Computer,Mary), notify integrator
3. (1)  integrator adds Sale (Mary,25)
4. (2)  integrator adds (Computer,Mary) Emp
5. View incorrect (duplicate tuple)
Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 54
Warehouse Specification (ideally)
View Definitions

Warehouse
Integration Warehouse
Configuration rules
Module
Change Integrator Metadata
Detection
Requirements

Extractor/ Extractor/ Extractor/


Monitor Monitor Monitor

...
Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 55
Additional Research Issues
• Historical views of non-historical data
• Expiring outdated information
• Crash recovery
• Addition and removal of information
sources
– Schema evolution

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 56
Warehousing and Industry
• Data Warehousing is big business
– $2 billion in 1995
– $3.5 billion in early 1997
– Predicted: $8 billion in 1998 [Metagroup]
• Wal-Mart said to have the largest warehouse
– 1000-CPU, 583 Terabyte, Teradata system
(InformationWeek, Jan 9, 2006)
– “Half a Petabyte” in warehouse (Ziff Davis Internet,
October 13, 2004)
– 1 billion rows of data or more are updated every day
(InformationWeek, Jan 9, 2006)
– Reported to be 2.5 Petabytes in 2008
• http://gigaom.com/2013/03/27/why-apple-ebay-and-walmart-
have-some-of-the-biggest-data-warehouses-youve-ever-see
n

IS 257 – Fall 2015 2015.11.03 - SLIDE 57


Other Large Data Warehouses

(InformationWeek, Jan 9, 2006)


IS 257 – Fall 2015 2015.11.03 - SLIDE 58
Those are small change today…
• Some databases are larger, however…
– eBay: has two Teradata systems. Its primary data
warehouse is 9.2 petabyes; its “singularity system”
that stores web clicks and other “big” data is more
than 40 petabytes. It includes a single table that’s 1
trillion rows. (2013)
• http://gigaom.com/2013/03/27/why-apple-ebay-and-walmart-have-s
ome-of-the-biggest-data-warehouses-youve-ever-seen
– Apple: “Multiple Petabytes” in 2013
– Yahoo! for web user behavioral analysis, storing two
petabytes and claimed to be the largest data
warehouse using a heavily modified version of
PostgreSQL (Wikipedia 2012)

IS 257 – Fall 2015 2015.11.03 - SLIDE 59


More Information on DW
• Agosta, Lou, The Essential Guide to Data
Warehousing. Prentise Hall PTR, 1999.
• Devlin, Barry, Data Warehouse, from
Architecture to Implementation. Addison-Wesley,
1997.
• Inmon, W.H., Building the Data Warehouse.
John Wiley, 1992.
• Widom, J., “Research Problems in Data
Warehousing.” Proc. of the 4th Intl. CIKM Conf.,
1995.
• Chaudhuri, S., Dayal, U., “An Overview of Data
Warehousing and OLAP Technology.” ACM
SIGMOD Record, March 1997.

IS 257 – Fall 2015 2015.11.03 - SLIDE 60

You might also like