100% found this document useful (1 vote)
308 views

Data Integration

Data integration provides a unified view of data across multiple, heterogeneous sources. It allows querying across these sources while abstracting away differences in schemas, structures, and semantics. Data integration is challenging due to schema heterogeneity between sources, autonomy of data sources, and the need to handle large numbers of sources. Common architectures for data integration include virtual and warehousing approaches. Virtual integration leaves data at its sources while warehousing replicates data into a common repository.

Uploaded by

Mohammad Afwan
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
308 views

Data Integration

Data integration provides a unified view of data across multiple, heterogeneous sources. It allows querying across these sources while abstracting away differences in schemas, structures, and semantics. Data integration is challenging due to schema heterogeneity between sources, autonomy of data sources, and the need to handle large numbers of sources. Common architectures for data integration include virtual and warehousing approaches. Virtual integration leaves data at its sources while warehousing replicates data into a common repository.

Uploaded by

Mohammad Afwan
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 38

CHAPTER 1: INTRODUCTION

TO DATA INTEGRATION
PRINCIPLES OF

DATA INTEGRATION
ANHAI DOAN ALON HALEVY ZACHARY IVES

Outline
Introduction: data integration as a new
abstraction
Examples of data integration applications
Schema heterogeneity
Goal of data integration, why its a hard
problem
Data integration architectures

Data Integration
Databases are great: they let us manage huge
amounts of data
Assuming youve put it all into your schema.

In reality, data sets are often created


independently
Only to discover later that they need to combine their
data!
At that point, theyre using different systems, different
schemata and have limited interfaces to their data.

The goal of data integration: tie together


different sources, controlled by many people,
under a common schema.

DBMS: its all about abstraction


Logical vs. Physical; What vs. How.
Students:
SSN
123-45-6789
234-56-7890

Courses:

CID
CSE444
CSE541

Takes:
Name
Charles
Dan

Category
undergrad
grad

Name
Databases
Operating systems

SSN
123-45-6789
123-45-6789
234-56-7890
Quarter
fall
winter

SELECT
SELECT C.name
C.name
FROM
FROMStudents
StudentsS,
S,Takes
TakesT,
T,Courses
CoursesCC
WHERE
WHERES.name=Mary
S.name=Maryand
and
S.ssn
S.ssn==T.ssn
T.ssnand
andT.cid
T.cid==C.cid
C.cid

CID
CSE444
CSE444
CSE142

Data Integration:
A Higher-level Abstraction
Query

Independence
of:

Mediated Schema

source & location

data model,
syntax
semantic
variations

S3

Semantic
Mappings
S1

S2

SSN
123-45-6789
234-56-7890

Name
Charles
Dan

Category
undergrad
grad

CID
CSE444
CSE541

Name
Quarter
Databases
fall
Operating systems winter

SSN
123-45-6789
123-45-6789
234-56-7890

CID
CSE444
CSE444
CSE142

<cd> <title> The best of </title>


<cd> <title> The best of </title>
<artist> Carreras </artist>
<artist> Carreras </artist>
<artist> Pavarotti </artist>
<artist> Pavarotti </artist>
<artist> Domingo </artist>
<artist> Domingo </artist>
<price> 19.95
</price>
<price> 19.95
</price>
</cd>
</cd>

Outline
Introduction: data integration as a new
abstraction
Examples of data integration applications
Schema heterogeneity
Goal of data integration, why its a hard
problem
Data integration architectures

Applications of Data
Integration

Business
Science
Government
The Web
Pretty much everywhere

Application Area 1: Business


Enterprise Databases

Single Mediated View

EII Apps:
CRM
ERP
Portals

Legacy Databases
Services and Applications

50% of all IT $$$ spent here!

Application Area 2: Science


Phenotype

Gene

Sequenceable
Entity

Protein

OMIM

Experimen
t

Nucleotide
Sequence

Microarray
Experiment

SwissProt

HUGO

GeneClinics

Structured
Vocabulary

LocusLink

GO

Entrez

Hundreds of biomedical data sources available;


growing rapidly!

GEO

Application Area 3: The Web

Hundreds of millions of
high-quality tables on the

The Deep Web


Millions of high quality HTML forms out
there
Each form has its own special interface
Hard to explore data across sites.

Goal (for some domains):


A single interface into a multitude of deep-web
sources.

Create a single site to search for jobs/renta

Easily traverse between the site by clicking

Outline
Introduction: data integration as a new
abstraction
Examples of data integration applications
Schema heterogeneity
Goal of data integration, why its a hard
problem
Data integration architectures

Enterprise Data Integration:


FullServe Corporation

Employees
FullTimeEmp
Hire
TempEmployees

Training

Courses
Enrollments

Sales
Products
Sales

Resumes
Interview
CV

Services
Services
Customers
Contracts

HelpLine
Calls

EuroCard Corporation
Employees
Employees
Hire

Credit Cards
Customer
CustDetail

Resumes
Interview

HelpLine
Calls

Examples of Heterogeneity
FullServe

EuroCard

FullTimeEmp
Employees
ssn, empId, firstName
ID, firstNameMiddleInitial
middleName, lastName lastName
Hire
Hire
empId, hireDate, recruiter
ID, hireDate, recruiter
TempEmployees
ssn, hireStart, hireEnd

Find all employees (making over $100K)

Customer Call Center


Agents should have a full view of customer
when they call in.

Sales
Products
Sales

Services
Services
Customers
Contracts

Credit Cards
Customer
CustDetail

Other Reasons to Integrate


Data
Create a (useful) web site for tracking
services
Collaborate with third parties
E.g., create branded services

Comply with government regulations


Find risky employees

Business intelligence
Whats really wrong with our products?

Outline
Introduction: data integration as a new
abstraction
Examples of data integration applications
Schema heterogeneity
Goal of data integration, why its a hard
problem
Data integration architectures

Goal of Data Integration


Uniform query access to a set of data
sources
Handle:

Scale of sources: from tens to millions


Heterogeneity
Autonomy
Semi-structure

Why is it Hard?
Systems-level reasons:
Managing different platforms
SQL across multiple systems is not so simple
Distributed query processing

Logical reasons:
Schema (and data) heterogeneity

Social reasons:
Locating and capturing relevant data in the
enterprise.
Convincing people to share (data fiefdoms)

Security, privacy and performance implications.

Setting Expectations
Data integration is AI-Complete.
Completely automated solutions unlikely.

Goal 1:
Reduce the effort needed to set up an
integration application.

Goal 2:
Enable the system to perform gracefully with
uncertainty (e.g., on the web)

Data Integration
Smorgasbord
Something for everyone:

Theory of modeling data sources


Systems aspects of data integration
Architectural issues: e.g., P2P data sharing
AI @ work: automated schema matching
Web: latest on data integration & web
Commercial products: BEA, IBM
Semantic Web: what does it have to offer?
New trends in DBMS: uncertainty,
dataspaces

Outline
Introduction: data integration as a new
abstraction
Examples of data integration applications
Schema heterogeneity
Goal of data integration, why its a hard
problem
Data integration architectures

Virtual, Warehousing and in


Between
Data warehousing: integrate by bringing the
data into a single physical warehouse
Virtual data integration: leave the data at the
sources and access it at query time.
Some differences, but semantic heterogeneity
arises in both cases.
Numerous intermediate architectures.
The course illustrates data integration
technology mostly through the virtual
architecture.

Virtual Data Integration


Architecture
Mediated Schema
or Warehouse

Query reformulation/
Query over materialized data
Source
descriptions/
Transforms

Wrapper /
Extractor

Wrapper /
Extractor

RDBMS 1

Wrapper /
Extractor

Wrapper /
Extractor

RDBMS 2
HTML1

XML 1

Example
Movie(title, director, year, genre)
Actors(title, actor)
Plays(movie, location, startTime)
Reviews(title, rating, description)

S1

S2

S3

S4

S5

Movies (name, Cinemas (place, CinemasInNYC CinemasInSF Reviews (title,


(cinema, title, (location, movie, date, grade,
actors, director, movie, start)
startTime)
startingTime)
review)
genre)

Wrappers
<cd>
<cd> <title>
<title>The
Thebest
bestof
of
</title>
</title>
<artist>
<artist>Abiteboul
Abiteboul</artist>
</artist>
<artist>
<artist>Pavarotti
Pavarotti </artist>
</artist>
<artist>
<artist>Domingo
Domingo </artist>
</artist>
<price>
</price>
<price>19.95
19.95
</price>

</cd>
</cd>

Send queries to data


sources and transform
answers into tuples (or
other internal data
model). (Chapter 9)

Mediation Languages
Mediated Schema

Describe
relationships
between
mediated
schema and
data sources
(Chapter 3).

CD: ASIN, Title, Genre,


Artist: ASIN, name,
logic
CDs

Album
ASIN
Price
DiscountPrice
Studio

Books
Title
ISBN
Price
DiscountPrice
Edition

Authors

ISBN
FirstName
LastName

Artists
CDCategories
ASIN
Category

BookCategories
ISBN
Category

ASIN
ArtistName
GroupName

Woody Allen Comedies in NY


Mediated schema:
Movie: Title, director, year, genre
Actors: title, actor
Plays: movie, location, startTime
Reviews: title, rating, description

select title, startTime


from Movie, Plays
where Movie.title=Plays.movie AND
location=New York AND
director=Woody Allen

Movie: Title, director, year, genre


Actors: title, actor
Plays: movie, location, startTime
Reviews: title, rating, description

select title, startTime


from Movie, Plays
where Movie.title=Plays.movie AND
location=New York AND
director=Woody Allen
Sources S1 and S3 are relevant, sources S4 and S5
are irrelevant, and source S2 is relevant but possibly
redundant.
S1
Movies:
name, actors,
director, genre

S2
Cinemas:
place, movie,
start

S3
Cinemas in NYC:
cinema, title,
startTime

S4
Cinemas in SF:
location, movie,
startingTime

S5
Reviews:
title, date
grade, review

Query Processing
Query

Query reformulation
Logical query plan

Chapter 8

Query optimizer
Physical query plan

Replanning request

Execution engine

wrapper

wrapper

wrapper

wrapper

wrapper

source

source

source

source

source

Data Warehouses Offline


Replication
Determine physical schema
Define a database with this
schema
Define procedural mappings
in an ETL tool to import
the data and clean it.
Periodically copy all of the
data from the data sources
Note that the sources and
the warehouse are
basically independent at
this point

Query

Results

Data Warehouse

36

Pros and Cons of Data


Warehouses
Need to spend time to design the physical
database layout, as well as logical
This actually takes a lot of effort!

Data is generally not up-to-date (lazy or


offline refresh)
Queries over the warehouse dont disrupt
the data sources
Can run very heavy-duty computations,
including data mining and cleaning
37

Summary of Chapter 1
Data integration: abstract away the fact that
data comes from multiple sources in varying
schemata.
Problem occurs everywhere: its key to
business, science, Web and government.
Goal: reduce the effort involved in integrating.
Regardless of the architecture, heterogeneity
is a key issue.
Architectures range from warehousing to
virtual integration.

You might also like