ADMT chp3
ADMT chp3
Distributed Databases
Slides by: Ms. Shree Jaswal
2 Topics to be covered
Introduction : Distributed Data Processing,
What is a Distributed Database System? Design Issues .
Distributed DBMS Architecture.
Distributed Database Design : Top-Down Design Process,
Distribution Design Issues, Fragmentation , Allocation .
Topic Beyond Syllabus:
Overview of Query Processing : Query Processing Problem,
Objectives of Query Processing,
Layers of Query Processing,
Concurrency Control in Distributed Database system
Recovery algorithms in Distributed Database system
Self-learning Topics: Query Optimization in Distributed Databases
ADMT Chp3 Slides by: Ms. Shree J.
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
4
Distributed Database Concepts
7
Distributed Database System
Advantages (transparency, contd.)
The EMPLOYEE, PROJECT, and WORKS_ON
tables may be fragmented horizontally and
stored with possible replication as shown
below.
14
15
Data Fragmentation
Horizontal fragmentation
It is a horizontal subset of a relation which
contain those of tuples which satisfy selection
conditions.
Consider the Employee relation with selection
condition (DNO = 5). All tuples satisfy this
condition will create a subset which will be a
horizontal fragment of Employee relation.
ADMT Chp3 Slides by: Ms. Shree J.
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
16
Data Fragmentation
A selection condition may be
composed of several conditions
connected by AND or OR.
Derived horizontal fragmentation: It is
the partitioning of a primary relation to
other secondary relations which are
related with Foreign keys.
ADMT Chp3 Slides by: Ms. Shree J.
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
17 Data Fragmentation
Vertical fragmentation
It is a subset of a relation which is
created by a subset of columns.
Thus a vertical fragment of a
relation will contain values of
selected columns.
There is no selection condition
used in vertical fragmentation.
18
Data Fragmentation
Consider the Employee relation. A vertical
fragment of can be created by keeping the
values of Name, Bdate, Sex, and Address.
Because there is no condition for creating a
vertical fragment, each fragment must
include the primary key attribute of the parent
relation Employee. In this way all vertical
fragments of a relation are connected.
19
Data Fragmentation
Representation: Horizontal
fragmentation
Each horizontal fragment on a relation
can be specified by a sCi (R)
operation in the relational algebra.
Complete horizontal fragmentation: A
set of horizontal fragments whose
conditions C1, C2, …, Cn include all
the tuples in R- that is, every tuple in R
satisfies (C1 OR C2 OR … OR Cn).
20
Data Fragmentation
Disjoint complete horizontal
fragmentation: No tuple in R satisfies (Ci
AND Cj) where i ≠ j.
To reconstruct R from horizontal
fragments a UNION is applied.
21
Data Fragmentation
Representation:Vertical
fragmentation
A vertical fragment on a relation can
be specified by a Li(R) operation in the
relational algebra.
Complete vertical fragmentation: A set
of vertical fragments whose projection
lists L1, L2, …, Ln include all the
attributes in R but share only the primary
key of R. In this case the projection lists
satisfy the following two conditions:
ADMT Chp3 Slides by: Ms. Shree J.
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
22
Data Fragmentation
1. L1 L2 ... Ln = ATTRS (R)
2. Li Lj = PK(R) for any i j, where ATTRS
(R) is the set of attributes of R and
PK(R) is the primary key of R.
To reconstruct R from complete
vertical fragments a FULL OUTER
JOIN is applied.
ADMT Chp3 Slides by: Ms. Shree J.
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
23
Data Fragmentation
Representation:Mixed (Hybrid)
fragmentation
A combination of Vertical
fragmentation and Horizontal
fragmentation.
This is achieved by SELECT-
PROJECT operations which is
represented by Li(sCi (R)).
24
Data Fragmentation
If C = True (Select all tuples) and
L ≠ ATTRS(R), we get a vertical
fragment, if C ≠ True and L =
ATTRS(R), we get a horizontal
fragment and if C ≠ True and L ≠
ATTRS(R), we get a mixed
fragment.
If C = True and L = ATTRS(R), then
R can be considered a fragment.
ADMT Chp3 Slides by: Ms. Shree J.
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
25 Data Fragmentation
Fragmentation schema
A definition of a set of fragments
(horizontal or vertical or horizontal
and vertical) that includes all
attributes and tuples in the
database that satisfies the condition
that the whole database can be
reconstructed from the fragments
by applying some sequence of
UNION (or OUTER JOIN) and UNION
operations.
ADMT Chp3 Slides by: Ms. Shree J.
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
26
Data Fragmentation
Allocation schema
It describes the distribution of
fragments to sites of distributed
databases. It can be fully or partially
replicated or can be partitioned.
28
Data Replication and allocation
29 Example
Suppose that the company has three computer sites— one for
each current department. Sites 2 and 3 are for departments 5 and
4, respectively.
At each of these sites, we expect frequent access to the EMPLOYEE
and PROJECT information for the employees who work in that
department and the projects controlled by that department.
Further, we assume that these sites mainly access the Name, Ssn,
Salary, and Super_ssn attributes of EMPLOYEE.
Site 1 is used by company headquarters and accesses all
employee and project information regularly, in addition to keeping
track of DEPENDENT information for insurance purposes.
ADMT Chp3 Slides by: Ms. Shree J.
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
30
31
32 Example contd…
35 Example contd…
39 Example contd…
40 Example contd…
Disjointness
If relation R is decomposed into fragments R1, R2, ..., Rn, and data
item di is in Rj, then di should not be in any other fragment Rk (k ≠ j ).
Oracle
Site 3 Site 2
Linux Oracle Linux Oracle
ADMT Chp3 Slides by: Ms. Shree J.
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
Network
Object DBMS
Oriented Site 3 Site 2 Relational
Linux Linux
3.
Transfer Department relation to site 1,
execute the join at site 1, and send the
result to site 3.
Total bytes transferred = 400,000 + 3500
= 403,500 bytes.
Optimization criteria: minimizing data
transfer.
Preferred approach: strategy 3.
ADMT Chp3 Slides by: Ms. Shree J.
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
63
65
Distributed DBMS architectures
3 types:
Client server
Collaborating server
Middleware
66
Client-Server Database
Architecture
It consists of clients running client software, a set of
servers which provide all database functionalities and
a reliable communication infrastructure.
Server 1 Client 1
Client 2
Server 2 Client 3
67
Client-Server Database
Architecture
Clients reach server for desired service, but
server does not reach clients.
The server software is responsible for local data
management at a site, much like centralized
DBMS software.
The client software is responsible for most of the
distribution function.
The communication software manages
communication among clients and servers.
68
Client-Server Database
Architecture
The processing of a SQL queries goes as follows:
Client parses a user query and decomposes it
into a number of independent sub-queries. Each
subquery is sent to appropriate site for execution.
Each server processes its query and sends the
result to the client.
The client combines the results of subqueries and
produces the final result.
69
Collaborating server systems
The client in client-server architecture is
incapable of breaking a query which
spans multiple servers into appropriate
sub queries to be executed at different
sites & then piecing together the answers
to sub queries.
Client process would therefore become
quiet complex & would begin to overlap
with server in terms of capabilities
In order to eliminate the problem of
distinguishing between client & server, an
alternative is collaborating server system.
ADMT Chp3 Slides by: Ms. Shree J.
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
70
Collaborating server systems
In this system we have a collection
of database servers each capable
of running transactions against local
data which co-operatively execute
transactions spanning multiple
servers.
When a server receives a query that
requires access to data at other
servers, it generates appropriate sub
queries to be executed by other
servers and puts the results together
to compute answers to the original
query.
ADMT Chp3 Slides by: Ms. Shree J.
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
72
Middleware systems
This architecture is designed to allow a single query
to span multiple servers, without requiring all
database servers to be capable of managing such
multisite execution strategies
The idea is that we need just 1 database server
capable of managing queries & transactions
spanning multiple servers; the remaining servers
need to handle local queries & transactions
This special server acts as a layer of software, often
called middleware
The middleware layer is capable of executing joins
& other relational operations on data obtained
from other servers but typically does not itself
maintain any data.
ADMT Chp3 Slides by: Ms. Shree J.
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
73 Middleware systems
and the following simple user query: “Find the names of employees who are
managing a project”Slides by: Ms. Shree J.
ADMT Chp3
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
75 Selecting Alternatives
SELECT ENAME Project
FROM EMP,ASG s Select
WHERE EMP.ENO = ASG.ENO Join
AND RESP="Manager"
Strategy 1
ENAME(sRESP="Manager"EMP.ENO=ASG.ENO (EMP ASG))
Strategy 2
ENAME(EMP EMP.ENO=ASG.ENO (sRESP="Manager" (ASG)))
Strategy 2 avoids Cartesian product, so is “better”
ADMT Chp3 Slides by: Ms. Shree J.
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
Site 5
ASG1’ ASG2’
Site 1 Site 2
ASG1’=sRESP="Manager"(ASG1) ASG2’=sRESP="Manager" (ASG2)
77 Cost of Alternatives
Assume:
size(EMP) = 400, size(ASG) = 1000
tuple access cost = 1 unit; tuple transfer cost = 10 units
there are 20 managers in relation ASG and data is uniformly distributed among sites.
Strategy 1
produce ASG': (10+10)tuple access cost 20
transfer ASG' to the sites of EMP: (10+10)tuple transfer cost 200
produce EMP': (10+10) tuple access cost2 40
transfer EMP' to result site: (10+10) tuple transfer cost 200
Total cost 460
Strategy 2
transfer EMP to site 5:400tuple transfer cost 4,000
transfer ASG to site 5 :1000tuple transfer cost 10,000
produce ASG':1000tuple access cost 1,000
join EMP and ASG':40020tuple access cost 8,000
Total cost 23,000
ADMT Chp3 Slides by: Ms. Shree J.
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
78
Concurrency and Recovery
Concurrency
Interleaved processing:
Concurrent execution of processes
is interleaved in a single CPU
Parallel processing:
Processes are concurrently
executed in multiple CPUs.
Recovery
Recovery from transaction failures
usually means that the database is
restored to the most recent consistent
state just before the time of failure.
Distributed deadlock:
Since transactions are processed
at multiple sites, two or more sites
may get involved in deadlock. This
must be resolved in a distributed
manner.
Primary site
Site 5
Site 1
Site 3 Site 2
ADMT Chp3 Slides by: Ms. Shree J.
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
Transaction management:
Concurrency control and commit are
managed by this site.
this site manages locking and releasing
data items.
Advantages:
Data items are locked only at one site but
they can be accessed at any site.
ADMT Chp3 Slides by: Ms. Shree J.
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
Disadvantages:
All transaction management activities go to
primary site which is likely to overload the site.
If the primary site fails, the entire system is
inaccessible.
To aid recovery a backup site is designated
which behaves as a shadow of primary site. In
case of primary site failure, backup site can act
as primary site.
Advantages:
Since primary copies are distributed at
various sites, a single site is not
overloaded with locking and unlocking
requests.
Disadvantages:
Identification of a primary copy is
complex. A distributed directory must
be maintained, possibly at all sites.
ADMT Chp3 Slides by: Ms. Shree J.
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
96
Top-down
Bottom-up
Note: Chapter number and page numbers are from the book, Elmasri and Navathe,
“Fundamentals of Database Systems”, 6th Edition, PEARSON Education.
ADMT Chp3 Slides by: Ms. Shree J.
The material in this presentation belongs to St. Francis Institute of Technology and is solely for educational purposes. Distribution and modifications of the content is prohibited.
Note: Chapter number and page numbers are from the book, Elmasri and Navathe,
“Fundamentals of Database Systems”, 6th Edition, PEARSON Education.