Chapter 10: Data Warehousing & Caching
Chapter 10: Data Warehousing & Caching
DATA INTEGRATION
ANHAI DOAN ALON HALEVY ZACHARY IVES
Outline
The data warehouse
Motivation: Master data management
Physical design
Extract/transform/load
Data exchange
Caching & partial materialization
Operating on external data
Data Warehouse
ETL pipeline
outputs
ETL
ETL
ETL
RDBMS1
ETL
ETL
RDBMS2
HTML1
XML1
ETL Tools
ETL tools are the equivalent of schema mappings in
virtual integration, but are more powerful
Arbitrary pieces of code to take data from a source,
convert it into data for the warehouse:
import filters read and convert from data sources
data transformations join, aggregate, filter, convert data
de-duplication finds multiple records referring to the
same entity, merges them
profiling builds tables, histograms, etc. to summarize
data
quality management test against master values, known
business rules, constraints, etc.
Split
Date time
Filter
invalid
Join
Customer
records
Filter
invalid
Filter
non match
Group by
customer
Customer
balance
Invoice
line items
Invalid
dates /times
Invalid
items
Invalid
customers
Outline
The data warehouse
Data exchange
Caching & partial materialization
Operating on external data
Data Exchange
Intuitively, a declarative setup for data warehousing
Declarative schema mappings as in Ch. 2-3
Materialized database as in the previous section
relating S and T
An Example
Source S has
Target T has
Teaches(prof, student)
Adviser(adviser, student)
Advise(adviser, student)
TeachesCourse(prof, course)
Takes(course, student)
Teaches
Advise
prof
student
adviser student
Ann
Bob
Ellen
Bob
Felicia
David
Chloe David
TeachesCourse
Adviser
prof
course
adviser student
Ann
C1
Ellen
Bob
Chloe
C2
Felicia
David
Takes
course
student
C1
Bob
C2
David
prof
student
adviser student
Ann
Bob
Ellen
Bob
Felicia
David
Chloe David
TeachesCourse
Adviser
prof
course
adviser student
Ann
C1
Ellen
Bob
Chloe
C1
Felicia
David
Takes
course
student
C1
Bob
C1
David
Universal Solutions
Intuitively, the first solution should be better than
the second
The first solution uses the same variable for the course
taught by Ann and by Chloe they are the same course
But this was not specified in the original schema!
Outline
The data warehouse
Data exchange
Caching & partial materialization
Operating on external data
sources materialized
Data exchange /
data warehouse
Administrator-selected views
Someone manually specifies views to compute and
maintain, as with a relational DBMS
System automatically maintains
Outline
The data warehouse
Data exchange
Caching & partial materialization
Operating on external data
MapReduce Basics
MapReduce is essentially a template for writing
distributed programs corresponding to a single SQL
SELECT..FROM..WHERE..GROUP BY..HAVING block
with user-defined functions
The MapReduce runtime calls a set of functions:
map is given a tuple, outputs 0 or more tuples in response
Map
Worker
emit aggregate
results
Reduce
Worker
Reduce
Worker
Reduce
Worker
Reduce
Worker
Reduce
Worker
30
MapReduce as ETL
Some people use MapReduce to take data,
transform it, and load it into a warehouse
which is basically what ETL tools do!
The dividing line between DBMSs, EII, MapReduce is
blurring as of the development of this book
SQL MapReduce
MapReduce over SQL engines
Shared-nothing DBMSs
NoSQL