Data Integration
Data Integration
TO DATA INTEGRATION
PRINCIPLES OF
DATA INTEGRATION
ANHAI DOAN ALON HALEVY ZACHARY IVES
Outline
Introduction: data integration as a new
abstraction
Examples of data integration applications
Schema heterogeneity
Goal of data integration, why its a hard
problem
Data integration architectures
Data Integration
Databases are great: they let us manage huge
amounts of data
Assuming youve put it all into your schema.
Courses:
CID
CSE444
CSE541
Takes:
Name
Charles
Dan
Category
undergrad
grad
Name
Databases
Operating systems
SSN
123-45-6789
123-45-6789
234-56-7890
Quarter
fall
winter
SELECT
SELECT C.name
C.name
FROM
FROMStudents
StudentsS,
S,Takes
TakesT,
T,Courses
CoursesCC
WHERE
WHERES.name=Mary
S.name=Maryand
and
S.ssn
S.ssn==T.ssn
T.ssnand
andT.cid
T.cid==C.cid
C.cid
CID
CSE444
CSE444
CSE142
Data Integration:
A Higher-level Abstraction
Query
Independence
of:
Mediated Schema
data model,
syntax
semantic
variations
S3
Semantic
Mappings
S1
S2
SSN
123-45-6789
234-56-7890
Name
Charles
Dan
Category
undergrad
grad
CID
CSE444
CSE541
Name
Quarter
Databases
fall
Operating systems winter
SSN
123-45-6789
123-45-6789
234-56-7890
CID
CSE444
CSE444
CSE142
Outline
Introduction: data integration as a new
abstraction
Examples of data integration applications
Schema heterogeneity
Goal of data integration, why its a hard
problem
Data integration architectures
Applications of Data
Integration
Business
Science
Government
The Web
Pretty much everywhere
EII Apps:
CRM
ERP
Portals
Legacy Databases
Services and Applications
Gene
Sequenceable
Entity
Protein
OMIM
Experimen
t
Nucleotide
Sequence
Microarray
Experiment
SwissProt
HUGO
GeneClinics
Structured
Vocabulary
LocusLink
GO
Entrez
GEO
Hundreds of millions of
high-quality tables on the
Outline
Introduction: data integration as a new
abstraction
Examples of data integration applications
Schema heterogeneity
Goal of data integration, why its a hard
problem
Data integration architectures
Employees
FullTimeEmp
Hire
TempEmployees
Training
Courses
Enrollments
Sales
Products
Sales
Resumes
Interview
CV
Services
Services
Customers
Contracts
HelpLine
Calls
EuroCard Corporation
Employees
Employees
Hire
Credit Cards
Customer
CustDetail
Resumes
Interview
HelpLine
Calls
Examples of Heterogeneity
FullServe
EuroCard
FullTimeEmp
Employees
ssn, empId, firstName
ID, firstNameMiddleInitial
middleName, lastName lastName
Hire
Hire
empId, hireDate, recruiter
ID, hireDate, recruiter
TempEmployees
ssn, hireStart, hireEnd
Sales
Products
Sales
Services
Services
Customers
Contracts
Credit Cards
Customer
CustDetail
Business intelligence
Whats really wrong with our products?
Outline
Introduction: data integration as a new
abstraction
Examples of data integration applications
Schema heterogeneity
Goal of data integration, why its a hard
problem
Data integration architectures
Why is it Hard?
Systems-level reasons:
Managing different platforms
SQL across multiple systems is not so simple
Distributed query processing
Logical reasons:
Schema (and data) heterogeneity
Social reasons:
Locating and capturing relevant data in the
enterprise.
Convincing people to share (data fiefdoms)
Setting Expectations
Data integration is AI-Complete.
Completely automated solutions unlikely.
Goal 1:
Reduce the effort needed to set up an
integration application.
Goal 2:
Enable the system to perform gracefully with
uncertainty (e.g., on the web)
Data Integration
Smorgasbord
Something for everyone:
Outline
Introduction: data integration as a new
abstraction
Examples of data integration applications
Schema heterogeneity
Goal of data integration, why its a hard
problem
Data integration architectures
Query reformulation/
Query over materialized data
Source
descriptions/
Transforms
Wrapper /
Extractor
Wrapper /
Extractor
RDBMS 1
Wrapper /
Extractor
Wrapper /
Extractor
RDBMS 2
HTML1
XML 1
Example
Movie(title, director, year, genre)
Actors(title, actor)
Plays(movie, location, startTime)
Reviews(title, rating, description)
S1
S2
S3
S4
S5
Wrappers
<cd>
<cd> <title>
<title>The
Thebest
bestof
of
</title>
</title>
<artist>
<artist>Abiteboul
Abiteboul</artist>
</artist>
<artist>
<artist>Pavarotti
Pavarotti </artist>
</artist>
<artist>
<artist>Domingo
Domingo </artist>
</artist>
<price>
</price>
<price>19.95
19.95
</price>
</cd>
</cd>
Mediation Languages
Mediated Schema
Describe
relationships
between
mediated
schema and
data sources
(Chapter 3).
Album
ASIN
Price
DiscountPrice
Studio
Books
Title
ISBN
Price
DiscountPrice
Edition
Authors
ISBN
FirstName
LastName
Artists
CDCategories
ASIN
Category
BookCategories
ISBN
Category
ASIN
ArtistName
GroupName
S2
Cinemas:
place, movie,
start
S3
Cinemas in NYC:
cinema, title,
startTime
S4
Cinemas in SF:
location, movie,
startingTime
S5
Reviews:
title, date
grade, review
Query Processing
Query
Query reformulation
Logical query plan
Chapter 8
Query optimizer
Physical query plan
Replanning request
Execution engine
wrapper
wrapper
wrapper
wrapper
wrapper
source
source
source
source
source
Query
Results
Data Warehouse
36
Summary of Chapter 1
Data integration: abstract away the fact that
data comes from multiple sources in varying
schemata.
Problem occurs everywhere: its key to
business, science, Web and government.
Goal: reduce the effort involved in integrating.
Regardless of the architecture, heterogeneity
is a key issue.
Architectures range from warehousing to
virtual integration.