3-Distribution Design
3-Distribution Design
• Introduction
• Background
• Distributed Database Design
➡ Fragmentation
➡ Data distribution
• Database Integration
• Semantic Data Control
• Distributed Query Processing
• Multidatabase Query Processing
• Distributed Transaction Management
• Data Replication
• Parallel Database Systems
• Distributed Object DBMS
• Peer-to-Peer Data Management
• Web Data Management
• Current Issues
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.3/1
Design Problem
• In the general setting :
Making decisions about the placement of data and programs across the sites of
a computer network as well as possibly designing the network itself.
Level of sharing
• Bottom-up
➡ when the databases already exist at a number of sites
Objectives
User Input
Conceptual View Integration View Design
Design
Access
GCS Information ES’s
Distribution
Design User Input
LCS’s
Physical
Design
LIS’s
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.3/5
Distribution Design Issues
Why fragment at all?
How to fragment?
How to allocate?
Information requirements?
PROJ1 PROJ2
PROJ1 PROJ2
PNO BUDGET PNO PNAME LOC
tuples relations
or
attributes
CONCURRENCY
Moderate Difficult Easy
CONTROL
Possible Possible
REALITY Realistic
application application
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.3/13
Information Requirements
• Four categories:
➡ Database information
➡ Application information
➡ Communication network information
➡ Computer system information
PAY
TITLE, SAL
L1
EMP PROJ
ENO, ENAME, TITLE PNO, PNAME, BUDGET, LOC
ASG
ENO, PNO, RESP, DUR
pj : Ai θValue
where θ {=,<,≤,>,≥,≠}, Value Di and Di is the domain of Ai.
For relation R we define Pr = {p1, p2, …,pm} as the set of simple predicates of R
Example :
PNAME = "Maintenance"
BUDGET ≤ 200000
➡ minterm predicates : Given R and Pr = {p1, p2, …,pm}
define M = {m1,m2,…,mr} as
Given a set of minterm predicates M, there are as many horizontal fragments
of relation R as there are minterm predicates.
Set of horizontal fragments also referred to as minterm fragments.
Preliminaries :
➡ Pr should be complete
➡ Pr should be minimal
• Example :
➡ Assume PROJ[PNO,PNAME,BUDGET,LOC] has two applications defined
on it.
➡ Find the budgets of projects at each location. (1)
➡ Find projects with budgets less than $200000. (2)
PAY1 PAY2
TITLE SAL TITLE SAL
Mech. Eng. 27000 Elect. Eng. 40000
Programmer 24000 Syst. Anal. 34000
➡ Simple predicates
➡ For application (1)
p1 : LOC = “Montreal”
p2 : LOC = “New York”
p3 : LOC = “Paris”
➡ For application (2)
p4 : BUDGET ≤ 200000
p5 : BUDGET > 200000
➡ Pr = Pr' = {p ,p ,p ,p ,p }
1 2 3 4 5
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.3/32
PHF – Example
• Fragmentation of relation PROJ continued
➡ Minterm fragments left after elimination
m1 : (LOC = “Montreal”) (BUDGET ≤ 200000)
m2 : (LOC = “Montreal”) (BUDGET > 200000)
m3 : (LOC = “New York”) (BUDGET ≤ 200000)
m4 : (LOC = “New York”) (BUDGET > 200000)
m5 : (LOC = “Paris”) (BUDGET ≤ 200000)
m6 : (LOC = “Paris”) (BUDGET > 200000)
PROJ1 PROJ2
Database
P1 Instrumentation 150000 Montreal P2 135000 New York
Develop.
PROJ4 PROJ6
• Reconstruction
➡ If relation R is fragmented into FR = {R1,R2,…,Rr}
R = Ri FR Ri
• Disjointness
➡ Minterm predicates that form the basis of fragmentation should be mutually
exclusive.
PAY
TITLE, SAL
L1
EMP PROJ
ENO, ENAME, TITLE PNO, PNAME, BUDGET, LOC
L2 L3
ASG
ENO, PNO, RESP, DUR
Ri = R ⋉F Si, 1≤i≤w
where w is the maximum number of fragments that will be defined on R and
Si = F (S)
i
EMP1 EMP2
ENO ENAME TITLE ENO ENAME TITLE
• Disjointness
➡ Simple join graphs between the owner and the member fragments.
0 otherwise
Assume each query in the previous example accesses the attributes once
during each execution. S1 S2 S3
Also assume the access frequencies q 1 15 20 10
q2 5 0 0
q3 25 25 25
q
4 3 0 0
Then
A1 A2 A3 A4
aff(A1, A3) = 15*1 + 20*1+10*1 A1 45 0 45 0
= 45 A2 0 80 5 75
and the attribute affinity matrix AA is A3 45 5 53 3
A4 0 75 3 78
where
n
bond(Ax,Ay) = aff(Az,Ax)aff(Az,Ay)
z 1
Ordering (0-3-1) :
cont(A0,A3,A1) = 2bond(A0 , A3)+2bond(A3 , A1)–2bond(A0 , A1)
= 2* 0 + 2* 4410 – 2*0 = 8820
Ordering (1-3-2) :
cont(A1,A3,A2) = 2bond(A1 , A3)+2bond(A3 , A2)–2bond(A1,A2)
= 2* 4410 + 2* 890 – 2*225 = 10150
Ordering (2-3-4) :
cont (A2,A3,A4) = 1780
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.3/49
BEA – Example
• Therefore, the CA matrix has the form A1 A3 A2
45 45 0
0 5 80
45 53 5
0 3 75
• When A is placed, the final form of the CA matrix (after row organization)
4
is A1 A3 A2 A4
A1 45 45 0 0
A3 45 53 5 3
A2 0 5 80 75
A4 0 3 75 78
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.3/50
VF – Algorithm
How can you divide a set of clustered attributes {A1, A2, …, An}
into two (or more) sets {A1, A2, …, Ai} and {Ai, …, An} such that
there are no (or minimal) applications that access both (or more
than one) of the sets.
A1 A2 A3 … Ai Ai+1 . . .Am
A1
A2
...
TA
Ai
Ai+1
...
BA
Am
Z = CTQCBQCOQ2
The point along the diagonal defines two fragments such that the values of
CTQ and CBQ as nearly equal as possible
A = AR
i
• Reconstruction
➡ Reconstruction can be achieved by
• Disjointness
➡ TID's are not considered to be overlapping since they are maintained by the
system
➡ Duplicated keys are not considered to be overlapping
R1 R2
VF VF VF VF VF
Decision Variable
• Heuristics based on
➡ single commodity warehouse location (for FAP)
➡ knapsack problem
➡ network flow
➡ Retrieval Cost