Lecture19 257
Lecture19 257
“Heterogeneities are
everywhere” Personal
Databases
World
Scientific Databases
Wide
Web
Digital Libraries
Different interfaces
Different data representations
Duplicate and inconsistent information
Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 4
Problem: Data Management in Large Enterprises
Integration System
World
Wide
Personal
Web
Digital Libraries Scientific Databases Databases
...
Wrapper Wrapper Wrapper
...
Source Source Source
Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 7
Disadvantages of Query-Driven Approach
...
Source Source Source
Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 9
Advantages of Warehousing Approach
Information-
“Middle Data
TIME
“Prehistoric Based
Times” Ages” Revolution
Management
“A Data Warehouse is a
– subject-oriented,
– integrated,
– time-variant,
– non-volatile
collection of data used in support of
management decision making
processes.”
-- Inmon & Hackathorn, 1994: viz. Hoffer, Chap 11
Warehouse
Standard (Informational)
(Operational) DB
•• Mostly
Mostly reads
updates
•• Queries are transactions
long and complex
Many small
• Gb - Tb of data
• Mb - Gb of data
• History
• Current snapshot
• Lots of scans
• Index/hash onreconciled
p.k.
• Summarized, data
•• Raw data of users (e.g., decision-makers, analysts)
Hundreds
• Thousands of users (e.g., clerical users)
L
One,
company-
wide
T warehouse
T
E
Separate ETL for each Data access complexity
independent data mart due to multiple data marts
IS 257 – Fall 2015 2015.11.03 - SLIDE 26
Dependent data mart with operational data
store: a three-level architecture ODS provides option for
obtaining current data
T
E Simpler data access
Single ETL for
Dependent data marts
enterprise data warehouse
loaded from EDW
(EDW)
IS 257 – Fall 2015 2015.11.03 - SLIDE 27
Logical data mart and real time warehouse
architecture
ODS and data
warehouse are one
and the same
T
E
Near real-time ETL for Data marts are NOT separate databases, but logical views of the
data warehouse
Data Warehouse Easier to create new data marts
Status
Event = a database
action
(create/update/delete
) that results from a
Status transaction
With
transient
data,
changes to
existing
records are
written over
previous
records, thus
destroying
the previous
data content
Periodic
data are
never
physically
altered or
deleted
once they
have
been
added to
the store
Wrapper
Wrapper Definition
Generator
Record-level: Field-level:
Selection–data partitioning single-field–from one field to one field
Joining–data combining multi-field–from many fields to one, or
Aggregation–data summarization one field to many
Sales Comp.
Sale(item,clerk) Emp(clerk,age)
Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 53
Warehouse Maintenance Anomalies
Integrator
Sales Comp.
Sale(item,clerk) Emp(clerk,age)
1. Insert into Emp(Mary,25), notify integrator
2. Insert into Sale (Computer,Mary), notify integrator
3. (1) integrator adds Sale (Mary,25)
4. (2) integrator adds (Computer,Mary) Emp
5. View incorrect (duplicate tuple)
Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 54
Warehouse Specification (ideally)
View Definitions
Warehouse
Integration Warehouse
Configuration rules
Module
Change Integrator Metadata
Detection
Requirements
...
Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 55
Additional Research Issues
• Historical views of non-historical data
• Expiring outdated information
• Crash recovery
• Addition and removal of information
sources
– Schema evolution