adsu2
adsu2
Distributed Databases
Unit 2: Parallel and Distributed Database
2
Database System Concepts - 6th Edition
Parallel and Distributed Database
Networking of computers allows some tasks to be executed on a server
system and some tasks to be executed on client systems. This division of
work has led to client–server database systems.
Parallel processing within a computer system allows database-system
activities to be speeded up, allowing faster response to transactions, as well
as more transactions per second. The need for parallel query processing
has led to parallel database systems.
Keeping multiple copies of the database across different sites also allows
large organizations to continue their database operations even when one
site is affected by a natural disaster, such as flood, fire, or earthquake.
Distributed database systems handle geographically, or administratively
distributed data spread across multiple database systems.
3
Database System Concepts - 6th Edition
Database System Architectures
Database System Architectures
Centralized and Client-Server Systems
Server System Architectures
Parallel Systems
Distributed Systems
5
Database System Concepts - 6th Edition
Centralized and Client–Server Architectures
Centralized database systems are those that run on a
single computer system and do not interact with other
computer systems.
Client–server systems, on the other hand, have
functionality split between a server system and multiple
client systems.
6
Database System Concepts - 6th Edition
Centralized Systems
Run on a single computer system and do not interact with other
computer systems.
General-purpose computer system: one to a few CPUs and a number of
device controllers that are connected through a common bus that
provides access to shared memory.
Single-user system (e.g., personal computer or workstation): desk-top
unit, single user, usually has only one CPU and one or two hard disks;
the OS may support only one user.
Multi-user system: more disks, more memory, multiple CPUs, and a
multi-user OS. Serve a large number of users who are connected to the
system vie terminals. Often called server systems.
7
Database System Concepts - 6th Edition
A Centralized Computer System
8
Client-Server Systems
Client–server systems have functionality split between a server system
and multiple client systems.
Server systems satisfy requests generated at m client systems, whose
general structure is shown below:
9
Database System Concepts - 6th Edition
Client-Server Systems (Cont.)
Database functionality can be divided into:
Back-end: manages access structures, query evaluation and optimization, concurrency
control and recovery.
Front-end: consists of tools such as forms, report-writers, and graphical user interface
facilities.
The interface between the front-end and the back-end is through SQL or
through an application program interface.
Standards such as ODBC and JDBC were developed to interface clients with
servers. Any client that uses the ODBC or JDBC interface can connect to any
server that provides the interface
10
Database System Concepts - 6th Edition
Client-Server Systems (Cont.)
Advantages of replacing mainframes with networks of workstations
or personal computers connected to back-end server machines:
better functionality for the cost
flexibility in locating resources and expanding facilities
better user interfaces
easier maintenance
11
Database System Concepts - 6th Edition
Server System Architecture
Server systems can be broadly categorized into two kinds:
Transaction servers which are widely used in relational database systems, and
Data servers, used in object-oriented database systems
12
Database System Concepts - 6th Edition
Transaction Servers
Also called query server systems or SQL server systems
Clients send requests to the server
Transactions are executed at the server
Results are shipped back to the client.
Requests are specified in SQL, and communicated to the server
through a remote procedure call (RPC) mechanism.
Transactional RPC allows many RPC calls to form a transaction.
Open Database Connectivity (ODBC) is a C language application
program interface standard from Microsoft for connecting to a server,
sending SQL requests, and receiving results.
JDBC standard is similar to ODBC, for Java.
13
Database System Concepts - 6th Edition
Transaction Server Process Structure
A typical transaction server consists of multiple processes accessing data in
shared memory.
Server processes
These receive user queries (transactions), execute them and send results back
Processes may be multithreaded, allowing a single process to execute several
user queries concurrently
Typically multiple multithreaded server processes (hybrid architecture).
Lock manager process
This process implements lock manager functionality, which includes lock grant,
lock release, and deadlock detection.
Database writer process
There are one or more processes that output modified buffer blocks back to
disk on a continuous basis.
14
Database System Concepts - 6th Edition
Transaction Server Processes (Cont.)
Log writer process
Server processes simply add log records to log record buffer
Log writer process outputs log records to stable storage.
Checkpoint process
Performs periodic checkpoints
Process monitor process
Monitors other processes, and takes recovery actions if any of the
other processes fail
E.g., aborting any transactions being executed by a server process
and restarting it
15
Database System Concepts - 6th Edition
Transaction Server Processes (Cont.)
16
Database System Concepts - 6th Edition
Transaction Server Processes (Cont.)
Shared memory contains shared data
Buffer pool
Lock table
Log buffer
Cached query plans (reused if same query submitted again)
All database processes can access shared memory
To ensure that no two processes are accessing the same data structure
at the same time, databases systems implement mutual exclusion
using either
Operating system semaphores
Atomic instructions such as test-and-set
To avoid overhead of inter process communication for lock
request/grant, each database process operates directly on the lock table
instead of sending requests to lock manager process
Lock manager process still used for deadlock detection
17
Database System Concepts - 6th Edition
Data Servers
Used in high-speed LANs, in cases where
there is a high-speed connection between the clients and the server, the
clients are comparable in processing power to the server
The tasks to be executed are compute intensive.
Data are shipped to clients where processing is performed, and then
shipped results back to the server.
This architecture requires full back-end functionality at the clients.
Used in many object-oriented database systems
Issues:
Page-Shipping versus Item-Shipping
Locking
Data Caching
Lock Caching
18
Database System Concepts - 6th Edition
Data Servers (Cont.)
Page-shipping versus item-shipping
Smaller unit of shipping more messages
Worth prefetching related items along with requested item
Page shipping can be thought of as a form of prefetching
Locking
Overhead of requesting and getting locks from server is high due to
message delays
Can grant locks on requested and prefetched items; with page
shipping, transaction is granted lock on whole page.
Locks on a prefetched item can be P{called back} by the server, and
returned by client transaction if the prefetched item has not been
used.
Locks on the page can be deescalated to locks on items in the page
when there are lock conflicts. Locks on unused items can then be
returned to server.
19
Database System Concepts - 6th Edition
Data Servers (Cont.)
Data Caching
Data can be cached at client even in between transactions
But check that data is up-to-date before it is used (cache
coherency)
Check can be done when requesting lock on data item
Lock Caching
Locks can be retained by client system even in between transactions
Transactions can acquire cached locks locally, without contacting
server
Server calls back locks from clients when it receives conflicting lock
request. Client returns lock once no local transaction is using it.
Similar to deescalation, but across transactions.
20
Database System Concepts - 6th Edition
Parallel Systems
Parallel database systems consist of multiple processors and multiple
disks connected by a fast interconnection network.
A coarse-grain parallel machine consists of a small number of
powerful processors
A massively parallel or fine grain parallel machine utilizes
thousands of smaller processors.
Two main performance measures:
throughput --- the number of tasks that can be completed in a
given time interval
response time --- the amount of time it takes to complete a
single task from the time it is submitted
21
Database System Concepts - 6th Edition
Speed-Up and Scale-Up
Speedup: a fixed-sized problem executing on a small system is given
to a system which is N-times larger.
Measured by:
speedup = small system elapsed time
large system elapsed time
Speedup is linear if equation equals N.
Scaleup: increase the size of both the problem and the system
N-times larger system used to perform N-times larger job
Measured by:
scaleup = small system small problem elapsed time
big system big problem elapsed time
Scale up is linear if equation equals 1.
22
Database System Concepts - 6th Edition
Speedup
23
Database System Concepts - 6th Edition
Scaleup
24
Database System Concepts - 6th Edition
Batch and Transaction Scaleup
Batch scaleup:
A single large job; typical of most decision support queries and
scientific simulation.
Use an N-times larger computer on N-times larger problem.
Transaction scaleup:
Numerous small queries submitted by independent users to a
shared database; typical transaction processing and timesharing
systems.
N-times as many users submitting requests (hence, N-times as many
requests) to an N-times larger database, on an N-times larger
computer.
Well-suited to parallel execution.
25
Database System Concepts - 6th Edition
Factors Limiting Speedup and Scaleup
Speedup and scaleup are often sublinear due to:
Startup costs: Cost of starting up multiple processes may dominate
computation time, if the degree of parallelism is high.
Interference: Processes accessing shared resources (e.g., system
bus, disks, or locks) compete with each other, thus spending time
waiting on other processes, rather than performing useful work.
Skew: Increasing the degree of parallelism increases the variance in
service times of parallely executing tasks. Overall execution time
determined by slowest of parallely executing tasks.
26
Database System Concepts - 6th Edition
Interconnection Network Architectures
Bus. System components send data on and receive data
from a single communication bus;
Does not scale well with increasing parallelism.
Mesh. Components are arranged as nodes in a grid, and
each component is connected to all adjacent
components
Communication links grow with growing number of
components, and so scales better.
But may require 2n hops to send message to a node
(or n with wraparound connections at edge of grid).
Hypercube. Components are numbered in binary;
components are connected to one another if their
binary representations differ in exactly one bit.
n components are connected to log(n) other
components and can reach each other via at most
log(n) links; reduces communication delays.
27
Database System Concepts - 6th Edition
Parallel Database Architectures
Shared memory – All processors
share a common memory
Shared disk – All processors share
a common disk (cluster)
Shared nothing -- processors
share neither a common memory
nor common disk
Hierarchical -- hybrid of the above
3 architectures
28
Database System Concepts - 6th Edition
Shared Memory
Processors and disks have access to a common memory, typically via a
bus or through an interconnection network.
Extremely efficient communication between processors — data in
shared memory can be accessed by any processor without having to
move it using software.
Downside – architecture is not scalable beyond 32 or 64 processors
since the bus or the interconnection network becomes a bottleneck
Widely used for lower degrees of parallelism (4 to 8).
29
Database System Concepts - 6th Edition
Shared Disk
All processors can directly access all disks via an interconnection
network, but the processors have private memories.
The memory bus is not a bottleneck
Architecture provides a degree of fault-tolerance — if a
processor fails, the other processors can take over its tasks since
the database is resident on disks that are accessible from all
processors.
Examples: IBM Sysplex and DEC clusters (now part of Compaq)
running Rdb (now Oracle Rdb) were early commercial users
Downside: bottleneck now occurs at interconnection
to the disk subsystem.
Shared-disk systems can scale to a somewhat larger
number of processors, but communication between
processors is slower.
30
Database System Concepts - 6th Edition
Shared Nothing
Node consists of a processor, memory and one or more disks.
Processors at one node communicate with another processor at another
node using an interconnection network.
A node functions as the server for the data on the disk or disks the node
owns.
Examples: Teradata, Tandem, Oracle-n CUBE
Data accessed from local disks (and local memory accesses) do not pass
through interconnection network, thereby minimizing the interference of
resource sharing.
Shared-nothing multiprocessors can be scaled up
to thousands of processors without interference.
Main drawback: cost of communication and non-local
disk access; sending data involves software interaction
at both ends.
31
Database System Concepts - 6th Edition
Hierarchical
Combines characteristics of shared-memory, shared-disk, and shared-
nothing architectures.
Top level is a shared-nothing architecture – nodes connected by an
interconnection network, and do not share disks or memory with
each other.
Each node of the system could be a shared-memory system with a
few processors.
Alternatively, each node could be a shared-disk system, and each of
the systems sharing a set of disks could be a shared-memory system.
Reduce the complexity of programming such systems by distributed
virtual-memory architectures
Also called non-uniform memory
architecture (NUMA)
32
Database System Concepts - 6th Edition
Distributed Systems
Data spread over multiple machines (also referred to as sites or nodes).
Network interconnects the machines
Data shared by users on multiple machines
33
Database System Concepts - 6th Edition
Distributed Databases
Homogeneous distributed databases
Same software/schema on all sites, data may be partitioned among sites
Goal: provide a view of a single database, hiding details of distribution
Heterogeneous distributed databases
Different software/schema on different sites
Goal: integrate (combine) existing databases to provide useful
functionality
Differentiate between local and global transactions
A local transaction accesses data in the single site at which the
transaction was initiated.
A global transaction either accesses data in a site different from the
one at which the transaction was initiated or accesses data in several
different sites.
34
Database System Concepts - 6th Edition
Trade-offs in Distributed Systems
Sharing data – users at one site able to access the data residing at
some other sites.
Autonomy – each site is able to retain a degree of control over data
stored locally.
Higher system availability through redundancy — data can be
replicated at remote sites, and system can function even if a site fails.
Disadvantage: added complexity required to ensure proper
coordination among sites.
Software development cost.
Greater potential for bugs.
Increased processing overhead.
35
Database System Concepts - 6th Edition
Parallel Databases
Parallel Databases
Introduction
I/O Parallelism
Interquery Parallelism
Intraquery Parallelism
Intraoperation Parallelism
Interoperation Parallelism
Design of Parallel Systems
38
Database System Concepts - 6th Edition
Introduction
Parallel machines are becoming quite common and affordable
Prices of microprocessors, memory and disks have dropped sharply
Recent desktop computers feature multiple processors and this
trend is projected to accelerate
Databases are growing increasingly large
large volumes of transaction data are collected and stored for later
analysis.
multimedia objects like images are increasingly stored in databases
Large-scale parallel database systems increasingly used for:
storing large volumes of data
processing time-consuming decision-support queries
providing high throughput for transaction processing
39
Database System Concepts - 6th Edition
Parallelism in Databases
1. Data can be partitioned across multiple disks for parallel I/O.
2. Individual relational operations (e.g., sort, join, aggregation) can be
executed in parallel
data can be partitioned and each processor can work independently
on its own partition.
3. Queries are expressed in high level language (SQL, translated to relational
algebra)
makes parallelization easier.
4. Different queries can be run in parallel with each other. Concurrency
control takes care of conflicts.
Thus, databases naturally lend themselves to parallelism.
40
Database System Concepts - 6th Edition
I/O Parallelism
Reduce the time required to retrieve relations from disk by partitioning.
The relations on multiple disks.
Horizontal partitioning – tuples of a relation are divided among many
disks such that each tuple resides on one disk.
Partitioning techniques (number of disks = n):
Round-robin:
Send the I th tuple inserted in the relation to disk i mod n.
Hash partitioning:
Choose one or more attributes as the partitioning attributes.
Choose hash function h with range 0…n - 1
Let i denote result of hash function h applied to the partitioning
attribute value of a tuple. Send tuple to disk i.
41
Database System Concepts - 6th Edition
Example of Hash Partitioning
The binary representation of the ith character is assumed to be the integer i.
There are 8 buckets(0 to 7),
The hash function returns the sum of the binary representations of the characters
modulo 8
E.g. h(Music) = 1 h(History) = 2
h(Physics) = 3 h(Elec. Eng.) = 3
M = 01001101
u = 01110101
s = 01110011
i = 01101001
c = 01100011
Binary Addition = 01000000001
Decimal Value = 513
513 MOD 8 = 1
Bucket 1 assign Music.
43
Database System Concepts - 6th Edition
Comparison of Partitioning Techniques
Evaluate how well partitioning techniques support the following types of
data access:
1. Scanning the entire relation.
2. Locating a tuple associatively – point queries.
E.g., r.A = 25.
3. Locating all tuples such that the value of a given attribute lies within a
specified range – range queries.
E.g., 10 r.A < 25.
44
Database System Concepts - 6th Edition
Comparison of Partitioning Techniques (Cont.)
Round robin:
Advantages
Best suited for sequential scan of entire relation on each query.
All disks have almost an equal number of tuples; retrieval work is
thus well balanced between disks.
Range queries are difficult to process
No clustering -- tuples are scattered across all disks
45
Database System Concepts - 6th Edition
Comparison of Partitioning Techniques (Cont.)
Hash partitioning:
Good for sequential access
Assuming hash function is good, and partitioning attributes form a
key, tuples will be equally distributed between disks
Retrieval work is then well balanced between disks.
Good for point queries on partitioning attribute
Can lookup single disk, leaving others available for answering other
queries.
Index on partitioning attribute can be local to disk, making lookup
and update more efficient
No clustering, so difficult to answer range queries
46
Database System Concepts - 6th Edition
Comparison of Partitioning Techniques (Cont.)
Range partitioning:
Provides data clustering by partitioning attribute value.
Good for sequential access
Good for point queries on partitioning attribute: only one disk needs to
be accessed.
For range queries on partitioning attribute, one to a few disks may need
to be accessed
Remaining disks are available for other queries.
Good if result tuples are from one to a few blocks.
If many blocks are to be fetched, they are still fetched from one to a
few disks, and potential parallelism in disk access is wasted
Example of execution skew.
47
Database System Concepts - 6th Edition
Partitioning a Relation across Disks
If a relation contains only a few tuples which will fit into a single disk
block, then assign the relation to a single disk.
Large relations are preferably partitioned across all the available disks.
If a relation consists of m disk blocks and there are n disks available in
the system, then the relation should be allocated min(m,n) disks.
48
Database System Concepts - 6th Edition
Handling of Skew
The distribution of tuples to disks may be skewed — that is, some disks
have many tuples, while others may have fewer tuples.
Types of skew:
Attribute-value skew.
Some values appear in the partitioning attributes of many tuples; all
the tuples with the same value for the partitioning attribute end up
in the same partition.
Can occur with range-partitioning and hash-partitioning.
Partition skew.
With range-partitioning, badly chosen partition vector may assign
too many tuples to some partitions and too few to others. Ex:
Partition on Income level
Less likely with hash-partitioning if a good hash-function is chosen.
Ex: Mod function on name attribute (ASCII Addition of characters)
49
Database System Concepts - 6th Edition
Handling Skew in Range-Partitioning
To create a balanced partitioning vector (assuming partitioning
attribute forms a key of the relation):
Sort the relation on the partitioning attribute.
Construct the partition vector by scanning the relation in sorted
order as follows.
After every 1/nth of the relation has been read, the value of the
partitioning attribute of the next tuple is added to the partition
vector.
n denotes the number of partitions to be constructed.
Duplicate entries or imbalances can result if duplicates are present in
partitioning attributes.
Alternative technique based on histograms used in practice
50
Database System Concepts - 6th Edition
Interquery Parallelism
Queries/transactions execute in parallel with one another.
Increases transaction throughput; used primarily to scale up a
transaction processing system to support a larger number of
transactions per second.
Easiest form of parallelism to support, particularly in a shared-memory
parallel database.
More complicated to implement on shared-disk or shared-nothing
architectures
Locking and logging must be coordinated by passing messages
between processors.
Data in a local buffer may have been updated at another processor.
Cache-coherency has to be maintained — reads and writes of data
in buffer must find latest version of data.
53
Database System Concepts - 6th Edition
Cache Coherency Protocol
Example of a cache coherency protocol for shared disk systems:
Before reading/writing to a page, the page must be locked in
shared/exclusive mode.
On locking a page, the page must be read from disk
Before unlocking a page, the page must be written to disk if it was
modified.
More complex protocols with fewer disk reads/writes exist.
Cache coherency protocols for shared-nothing systems are similar. Each
database page is assigned a home processor. Requests to fetch the page
or write it to disk are sent to the home processor.
54
Database System Concepts - 6th Edition
Intraquery Parallelism
Execution of a single query in parallel on multiple processors/disks;
important for speeding up long-running queries.
Two complementary forms of intraquery parallelism:
Intraoperation Parallelism – parallelize the execution of each
individual operation in the query.
Interoperation Parallelism – execute the different operations in a
query expression in parallel.
The first form scales better with increasing parallelism because
the number of tuples processed by each operation is typically more than
the number of operations in a query.
55
Database System Concepts - 6th Edition
Parallel Sort
Range-Partitioning Sort
Choose processors P0, ..., Pm, where m n -1 to do sorting.
Create range-partition vector with m entries, on the sorting attributes
Redistribute the relation using range partitioning
all tuples that lie in the ith range are sent to processor Pi
Pi stores the tuples it received temporarily on disk Di.
This step requires I/O and communication overhead.
Each processor Pi sorts its partition of the relation locally.
Each processors executes same operation (sort) in parallel with other
processors, without any interaction with the others (data parallelism).
Final merge operation is trivial: range-partitioning ensures that, for 1 < i <
j < m, the key values in processor Pi are all less than the key values in Pj.
57
Database System Concepts - 6th Edition
Parallel Sort (Cont.)
Parallel External Sort-Merge
Assume the relation has already been partitioned among disks D0, ..., Dn-1
(in whatever manner).
Each processor Pi locally sorts the data on disk Di.
The sorted runs on each processor are then merged to get the final sorted
output.
Parallelize the merging of sorted runs as follows:
The sorted partitions at each processor Pi are range-partitioned across
the processors P0, ..., Pm-1.
Each processor Pi performs a merge on the streams as they are received,
to get a single sorted run.
The sorted runs on processors P0,..., Pm-1 are concatenated to get the
final result.
58
Database System Concepts - 6th Edition
Parallel Join
The join operation requires pairs of tuples to be tested to see if they satisfy
the join condition, and if they do, the pair is added to the join output.
Parallel join algorithms attempt to split the pairs to be tested over several
processors. Each processor then computes part of the join locally.
In a final step, the results from each processor can be collected together to
produce the final result.
59
Database System Concepts - 6th Edition
Partitioned Join
For equi-joins and natural joins, it is possible to partition the two input
relations across the processors and compute the join locally at each
processor.
Let r and s be the input relations, and we want to compute r r.A=s.B s.
Suppose that we are using n processors and that the relations to be joined
are r and s.
r and s each are partitioned into n partitions, denoted r0, r1, ..., rn-1 and s0,
s1, ..., sn-1.
The system sends partitions ri and si to processor Pi, where their join is
computed locally.
60
Database System Concepts - 6th Edition
Partitioned Join (Cont …)
Can use either range partitioning or hash partitioning.
In either case, the same partitioning function must be used for both
relations.
For range partitioning, the same partition vector must be used for both
relations.
For hash partitioning, the same hash function must be used on both
relations.
Once the relations are partitioned, we can use any join technique locally at
each processor Pi to compute the join of ri and si .
For example, hash join, merge join, or nested-loop join could be used.
61
Database System Concepts - 6th Edition
Partitioned Join (Cont.)
62
Database System Concepts - 6th Edition
Fragment-and-Replicate Join
Partitioning not possible for some join conditions eg: if the join condition is
an inequality,
E.g., non-equijoin conditions, such as r.A > s.B.
For joins where partitioning is not applicable, parallelization can be
accomplished by fragment and replicate technique
Special case – asymmetric fragment-and-replicate:
One of the relations, say r, is partitioned; any
partitioning technique can be used.
The other relation, s, is replicated across all the
processors.
Processor Pi then locally computes the join of ri
with all of s using any join technique.
63
Database System Concepts - 6th Edition
Fragment-and-Replicate Join (Cont.)
General case: reduces the sizes of the relations at each processor.
r is partitioned into n partitions, r0, r1, ..., r n-1;
s is partitioned into m partitions, s0, s1, ..., sm-1.
Any partitioning technique may be used.
There must be at least m * n processors.
Label the processors as: P0,0, P0,1, ..., P0,m-1, P1,0, ..., Pn-1m-1.
Pi,j computes the join of ri with sj. In order to do so, ri is replicated to Pi,0,
Pi,1, ..., Pi,m-1, while si is replicated to P0,i, P1,i, ..., Pn-1,i
Any join technique can be used at each processor Pi,j.
64
Database System Concepts - 6th Edition
Fragment-and-Replicate Join (Cont.)
65
Database System Concepts - 6th Edition
Fragment-and-Replicate Join (Cont.)
Both versions of fragment-and-replicate work with any join condition,
since every tuple in r can be tested with every tuple in s.
Usually has a higher cost than partitioning, since one of the relations (for
asymmetric fragment-and-replicate) or both relations (for general
fragment-and-replicate) have to be replicated.
Sometimes asymmetric fragment-and-replicate is preferable even though
partitioning could be used.
E.g., say s is small and r is large, and already partitioned. It may be
cheaper to replicate s across all processors, rather than repartition r
and s on the join attributes.
General Fragment and replicate reduces the sizes of the relations at each
processor, compared to asymmetric fragment and replicate.
66
Database System Concepts - 6th Edition
Partitioned Parallel Hash-Join
Suppose that we have n processors, P0, P1, . . . , Pn−1, and two relations r and s,
such that the relations r and s are partitioned across multiple disks.
Assume s is smaller than r and therefore s is chosen as the build relation.
The parallel hash-join algorithm proceeds this way:
1. Choose a hash function—say, h1—that takes the join attribute value of
each tuple in r and s and maps the tuple to one of the n processors. Let ri
denote the tuples of relation r that are mapped to processor Pi ; similarly, let si
denote the tuples of relation s that are mapped to processor Pi. Each processor
Pi reads the tuples of s that are on its disk Di and sends each tuple to the
appropriate processor on the basis of hash function h1.
67
Database System Concepts - 6th Edition
Partitioned Parallel Hash-Join
2. As the destination processor Pi receives the tuples of si , it further
partitions them by another hash function, h2, which the processor uses to
compute the hash join locally. The partitioning at this stage is exactly the
same as in the partitioning phase of the sequential hash-join algorithm. Each
processor Pi executes this step independently from the other processors.
3. Once the tuples of s have been distributed, the system redistributes
the larger relation r across the n processors by the hash function h1, in the
sameway as before. As it receives each tuple, the destination processor
repartitions it by the function h2, just as the probe relation is partitioned in the
sequential hash-join algorithm.
4. Each processor Pi executes the build and probe phases of the hash-
join algorithm on the local partitions ri and si of r and s to produce a partition
of the final result of the hash join.
68
Database System Concepts - 6th Edition
Partitioned Parallel Hash-Join (Cont.)
Once the tuples of s have been distributed, the larger relation r is
redistributed across the m processors using the hash function h1
Let ri denote the tuples of relation r that are sent to processor Pi.
As the r tuples are received at the destination processors, they are
repartitioned using the function h2
(just as the probe relation is partitioned in the sequential hash-join
algorithm).
Each processor Pi executes the build and probe phases of the hash-join
algorithm on the local partitions ri and s of r and s to produce a partition
of the final result of the hash-join.
Note: Hash-join optimizations can be applied to the parallel case
e.g., the hybrid hash-join algorithm can be used to cache some of the
incoming tuples in memory and avoid the cost of writing them and
reading them back in.
69
Database System Concepts - 6th Edition
Parallel Nested-Loop Join
Assume that
relation s is much smaller than relation r and that r is stored by
partitioning.
there is an index on a join attribute of relation r at each of the partitions
of relation r.
Use asymmetric fragment-and-replicate, with relation s being replicated, and
using the existing partitioning of relation r.
Each processor Pj where a partition of relation s is stored reads the tuples
of relation s stored in Dj, and replicates the tuples to every other processor
P i.
At the end of this phase, relation s is replicated at all sites that store
tuples of relation r.
Each processor Pi performs an indexed nested-loop join of relation s with
the ith partition of relation r.
70
Database System Concepts - 6th Edition
Other Relational Operations
Selection (r)
If is of the form ai = v, where ai is an attribute and v a value.
If r is partitioned on ai the selection is performed at a single processor.
If is of the form l <= ai <= u (i.e., is a range selection) and the relation
has been range-partitioned on ai
Selection is performed at each processor whose partition overlaps with
the specified range of values.
In all other cases: the selection is performed in parallel at all the
processors.
71
Database System Concepts - 6th Edition
Other Relational Operations (Cont.)
Duplicate elimination
Perform by using either of the parallel sort techniques
eliminate duplicates as soon as they are found during sorting.
Can also partition the tuples (using either range- or hash- partitioning)
and perform duplicate elimination locally at each processor.
Projection
Projection without duplicate elimination can be performed as tuples are
read in from disk in parallel.
If duplicate elimination is required, any of the above duplicate elimination
techniques can be used.
72
Database System Concepts - 6th Edition
Grouping/Aggregation
Partition the relation on the grouping attributes and then compute the
aggregate values locally at each processor.
Can reduce cost of transferring tuples during partitioning by partly
computing aggregate values before partitioning.
Consider the sum aggregation operation:
Perform aggregation operation at each processor Pi on those tuples
stored on disk Di
results in tuples with partial sums at each processor.
Result of the local aggregation is partitioned on the grouping attributes,
and the aggregation performed again at each processor Pi to get the final
result.
Fewer tuples need to be sent to other processors during partitioning.
73
Database System Concepts - 6th Edition
Cost of Parallel Evaluation of Operations
If there is no skew in the partitioning, and there is no overhead due to the
parallel evaluation, expected speed-up will be 1/n
If skew and overheads are also to be taken into account, the time taken by
a parallel operation can be estimated as
Tpart + Tasm + max (T0, T1, …, Tn-1)
Tpart is the time for partitioning the relations
Tasm is the time for assembling the results
Ti is the time taken for the operation at processor Pi
this needs to be estimated taking into account the skew, and the time
wasted in contentions.
74
Database System Concepts - 6th Edition
Interoperator Parallelism
Pipelined parallelism
Consider a join of four relations
r1 r2 r3 r4
Set up a pipeline that computes the three joins in parallel
Let P1 be assigned the computation of
temp1 = r1 r2
And P2 be assigned the computation of temp2 = temp1 r3
And P3 be assigned the computation of temp2 r4
Each of these operations can execute in parallel, sending result tuples it
computes to the next operation even as it is computing further results
Provided a pipelineable join evaluation algorithm (e.g., indexed nested
loops join) is used
75
Database System Concepts - 6th Edition
Factors Limiting Utility of Pipeline Parallelism
Pipeline parallelism is useful since it avoids writing intermediate results to
disk
Useful with small number of processors but does not scale up well with
more processors. One reason is that pipeline chains do not attain
sufficient length.
Little speedup is obtained for the frequent cases of skew in which
one operator's execution cost is much higher than the others.
76
Database System Concepts - 6th Edition
Independent Parallelism
Independent parallelism
Consider a join of four relations
r1 r2 r3 r4
Let P1 be assigned the computation of
temp1 = r1 r2
And P2 be assigned the computation of temp2 = r3 r4
And P3 be assigned the computation of temp1 temp2
P1 and P2 can work independently in parallel
P3 has to wait for input from P1 and P2
Can pipeline output of P1 and P2 to P3, combining independent
parallelism and pipelined parallelism
Does not provide a high degree of parallelism
useful with a lower degree of parallelism.
less useful in a highly parallel system.
77
Database System Concepts - 6th Edition
Distributed Databases
Distributed Databases
Heterogeneous and Homogeneous Databases
Distributed Data Storage
Distributed Transactions
Commit Protocols
Concurrency Control in Distributed Databases
Availability
81
Database System Concepts - 6th Edition
Distributed Database System
A distributed database system consists of loosely coupled sites
that share no physical component
Database systems that run on each site are independent of each
other
Transactions may access data at one or more sites
82
Database System Concepts - 6th Edition
Homogeneous and Heterogeneous Databases
In a homogeneous distributed database
All sites have identical software
Are aware of each other and agree to cooperate in processing user
requests.
Each site surrenders part of its autonomy in terms of right to change
schemas or software.
Appears to user as a single system
83
Database System Concepts - 6th Edition
Distributed Data Storage
Assume relational data model
Consider a relation r that is to be stored in the database. There are two
approaches to storing this relation in the distributed database:
Replication
The system maintains several identical replicas (copies) of the relation,
and stores each replica at a different site.
Fragmentation
Relation is partitioned into several fragments and stores each fragment
at a different site
Replication and fragmentation can be combined
Relation is partitioned into several fragments: system maintains several
identical replicas of each such fragment.
84
Database System Concepts - 6th Edition
Data Replication
If relation r is replicated, a copy of relation r is stored in two or more
sites.
Full replication of a relation is the case where the relation is stored at
all sites.
Fully redundant databases are those in which every site contains a copy
of the entire database.
85
Database System Concepts - 6th Edition
Data Replication (Cont.)
Advantages of Replication
Availability: failure of site containing relation r does not result in
unavailability of r is replicas exist.
Parallelism: queries on r may be processed by several nodes in
parallel.
Reduced data transfer: relation r is available locally at each site
containing a replica of r.
Disadvantages of Replication
Increased cost of updates: each replica of relation r must be updated.
Increased complexity of concurrency control: concurrent updates to
distinct replicas may lead to inconsistent data unless special
concurrency control mechanisms are implemented.
One solution: choose one copy as primary copy and apply
concurrency control operations on primary copy
86
Database System Concepts - 6th Edition
Data Fragmentation
Division of relation r into fragments r1, r2, …, rn which contain sufficient
information to reconstruct relation r.
Horizontal fragmentation: each tuple of r is assigned to one or
more fragments
Vertical fragmentation: the schema for relation r is split into several
smaller schemas
All schemas must contain a common candidate key (or superkey) to
ensure lossless join property.
A special attribute, the tuple-id attribute may be added to each
schema to serve as a candidate key.
87
Database System Concepts - 6th Edition
Horizontal Fragmentation of account Relation
Hillside Lowman 1
Hillside Camp 2
Valleyview Camp 3
Valleyview Kahn 4
Hillside Kahn 5
Valleyview Kahn 6
Valleyview Green 7
deposit1 = branch_name, customer_name, tuple_id (employee_info )
account_number balance tuple_id
A-305 500 1
A-226 336 2
A-177 205 3
A-402 10000 4
A-155 62 5
A-408 1123 6
A-639 750 7
89deposit2 = account_number, balance, tuple_id (employee_info )
Database System Concepts - 6th Edition
Advantages of Fragmentation
Horizontal:
allows parallel processing on fragments of a relation
allows a relation to be split so that tuples are located where they are
most frequently accessed
Vertical:
allows tuples to be split so that each part of the tuple is stored
where it is most frequently accessed
tuple-id attribute allows efficient joining of vertical fragments
allows parallel processing on a relation
Vertical and horizontal fragmentation can be mixed.
Fragments may be successively fragmented to an arbitrary depth.
90
Database System Concepts - 6th Edition
Data Transparency
Data transparency: Degree to which system user may remain unaware
of the details of how and where the data items are stored in a
distributed system
Consider transparency issues in relation to:
Fragmentation transparency
Replication transparency
Location transparency
91
Database System Concepts - 6th Edition
Naming of Data Items - Criteria
1. Every data item must have a system-wide unique name.
2. It should be possible to find the location of data items efficiently.
3. It should be possible to change the location of data items transparently.
4. Each site should be able to create new data items autonomously.
92
Database System Concepts - 6th Edition
Centralized Scheme - Name Server
Structure:
name server assigns all names
each site maintains a record of local data items
sites ask name server to locate non-local data items
Advantages:
satisfies naming criteria 1-3
Disadvantages:
does not satisfy naming criterion 4
name server is a potential performance bottleneck
name server is a single point of failure
93
Database System Concepts - 6th Edition
Use of Aliases
Alternative to centralized scheme: each site prefixes its own site
identifier to any name that it generates i.e., site 17.account.
Fulfills having a unique identifier, and avoids problems associated with
central control.
However, fails to achieve network transparency.
Solution: Create a set of aliases for data items; Store the mapping of
aliases to the real names at each site.
The user can be unaware of the physical location of a data item, and is
unaffected if the data item is moved from one site to another.
94
Database System Concepts - 6th Edition
Distributed Transactions
Transaction may access data at several sites.
Each site has a local transaction manager responsible for:
Maintaining a log for recovery purposes
Participating in coordinating the concurrent execution of the
transactions executing at that site.
Each site has a transaction coordinator, which is responsible for:
Starting the execution of transactions that originate at the site.
Distributing sub-transactions at appropriate sites for execution.
Coordinating the termination of each transaction that originates at
the site, which may result in the transaction being committed at all
sites or aborted at all sites.
95
Database System Concepts - 6th Edition
Transaction System Architecture
96
Database System Concepts - 6th Edition
System Failure Modes
Failures unique to distributed systems:
Failure of a site.
Loss of massages
Handled by network transmission control protocols such as TCP-IP
Failure of a communication link
Handled by network protocols, by routing messages via alternative
links
Network partition
A network is said to be partitioned when it has been split into two
or more subsystems that lack any connection between them
Note: a subsystem may consist of a single node
97
Database System Concepts - 6th Edition
Commit Protocols
Commit protocols are used to ensure atomicity across sites
a transaction which executes at multiple sites must either be committed
at all the sites, or aborted at all the sites.
not acceptable to have a transaction committed at one site and aborted
at another
The two-phase commit (2PC) protocol is widely used
The three-phase commit (3PC) protocol is more complicated and more
expensive, but avoids some drawbacks of two-phase commit protocol. This
protocol is not used in practice.
98
Database System Concepts - 6th Edition
Two Phase Commit Protocol (2PC)
Assumes fail-stop model – failed sites simply stop working, and do not
cause any other harm, such as sending incorrect messages to other sites.
Execution of the protocol is initiated by the coordinator after the last step
of the transaction has been reached.
The protocol involves all the local sites at which the transaction executed
Let T be a transaction initiated at site Si, and let the transaction
coordinator at Si be Ci
99
Database System Concepts - 6th Edition
Phase 1: Obtaining a Decision
Coordinator asks all participants to prepare to commit transaction Ti.
Ci adds the records <prepare T> to the log and forces log to stable
storage
sends prepare T messages to all sites at which T executed
Upon receiving message, transaction manager at site determines if it can
commit the transaction
if not, add a record <no T> to the log and send abort T message to Ci
if the transaction can be committed, then:
add the record <ready T> to the log
force all records for T to stable storage
send ready T message to Ci
100
Database System Concepts - 6th Edition
Phase 2: Recording the Decision
T can be committed if Ci received a ready T message from all the
participating sites: otherwise T must be aborted.
Coordinator adds a decision record, <commit T> or <abort T>, to the
log and forces record onto stable storage. Once the record stable storage
it is irrevocable (even if failures occur)
Coordinator sends a message to each participant informing it of the
decision (commit or abort)
Participants take appropriate action locally.
101
Database System Concepts - 6th Edition
Handling of Failures - Site Failure
If the coordinator Ci detects that a site Si has failed, it takes these actions:.
Log contain <commit T> record: site executes redo (T)
Log contains <abort T> record: site executes undo (T)
Log contains <ready T> record: site must consult Ci to determine the
fate of T.
If T committed, redo (T)
If T aborted, undo (T)
The log contains no control records concerning T replies that Sk failed
before responding to the prepare T message from Ci
since the failure of Sk precludes the sending of such a
response C1 must abort T
Sk must execute undo (T)
102
Database System Concepts - 6th Edition
Handling of Failures-Coordinator Failure
If coordinator fails while the commit protocol for T is executing then
participating sites must decide on T’s fate:
1. If an active site contains a <commit T> record in its log, then T
must be committed.
2. If an active site contains an <abort T> record in its log, then T must
be aborted.
3. If some active participating site does not contain a <ready T>
record in its log, then the failed coordinator Ci cannot have decided
to commit T. Can therefore abort T.
4. If none of the above cases holds, then all active sites must have a
<ready T> record in their logs, but no additional control records
(such as <abort T> of <commit T>). In this case active sites must
wait for Ci to recover, to find decision.
Blocking problem: active sites may have to wait for failed coordinator
to recover.
103
Database System Concepts - 6th Edition
Handling of Failures - Network Partition
If the coordinator and all its participants remain in one partition, the failure
has no effect on the commit protocol.
If the coordinator and its participants belong to several partitions:
Sites that are not in the partition containing the coordinator think the
coordinator has failed, and execute the protocol to deal with failure of
the coordinator.
No harm results, but sites may still have to wait for decision from
coordinator.
The coordinator and the sites are in the same partition as the coordinator
think that the sites in the other partition have failed, and follow the usual
commit protocol.
Again, no harm results
104
Database System Concepts - 6th Edition
Recovery and Concurrency Control
In-doubt transactions have a <ready T>, but neither a
<commit T>, nor an <abort T> log record.
The recovering site must determine the commit-abort status of such
transactions by contacting other sites; this can slow and potentially block
recovery.
Recovery algorithms can note lock information in the log.
Instead of <ready T>, write out <ready T, L> L = list of locks held by T
when the log is written (read locks can be omitted).
For every in-doubt transaction T, all the locks noted in the
<ready T, L> log record are reacquired.
After lock reacquisition, transaction processing can resume; the commit or
rollback of in-doubt transactions is performed concurrently with the
execution of new transactions.
105
Database System Concepts - 6th Edition
Alternative Models of Transaction Processing
Notion of a single transaction spanning multiple sites is inappropriate for
many applications
E.g., transaction crossing an organizational boundary
No organization would like to permit an externally initiated transaction
to block local transactions for an indeterminate period
Alternative models carry out transactions by sending messages
Code to handle messages must be carefully designed to ensure
atomicity and durability properties for updates
Isolation cannot be guaranteed, in that intermediate stages are visible,
but code must ensure no inconsistent states result due to
concurrency
Persistent messaging systems are systems that provide transactional
properties to messages
Messages are guaranteed to be delivered exactly once
106
Database System Concepts - 6th Edition
Alternative Models (Cont.)
Motivating example: funds transfer between two banks
Two phase commit would have the potential to block updates on the
accounts involved in funds transfer
Alternative solution:
Debit money from source account and send a message to other site
Site receives message and credits destination account
Messaging has long been used for distributed transactions (even before
computers were invented!)
Atomicity issue
Once transaction sending a message is committed, message must
guaranteed to be delivered
Guarantee as long as destination site is up and reachable, code to
handle undeliverable messages must also be available
e.g., credit money back to source account.
If sending transaction aborts, message must not be sent
107
Database System Concepts - 6th Edition
Concurrency Control
Modify concurrency control schemes for use in distributed environment.
We assume that each site participates in the execution of a commit
protocol to ensure global transaction atomicity.
We assume all replicas of any item are updated
Will see how to relax this in case of site failures later
Locking Protocols
• The various locking protocols can be used in a distributed environment.
• The only change that needs to be incorporated is in the way the lock manager deals
with replicated data.
• We present several possible schemes that are applicable to an environment where
data can be replicated in several sites.
• We shall assume the existence of the shared and exclusive lock modes.
•Two Approaches:
•Single lock manager
•Distributed lock manager
110
Database System Concepts - 6th Edition
Single-Lock-Manager Approach
System maintains a single lock manager that resides in a single chosen site,
say Si
When a transaction needs to lock a data item, it sends a lock request to
Si and lock manager determines whether the lock can be granted
immediately
If yes, lock manager sends a message to the site which initiated the
request
If no, request is delayed until it can be granted, at which time a message
is sent to the initiating site
111
Database System Concepts - 6th Edition
Single-Lock-Manager Approach (Cont.)
The transaction can read the data item from any one of the sites at which a
replica of the data item resides.
Writes must be performed on all replicas of a data item
Advantages of scheme:
Simple implementation
Simple deadlock handling
Disadvantages of scheme are:
Bottleneck: lock manager site becomes a bottleneck
Vulnerability: system is vulnerable(weak) to lock manager site failure.
112
Database System Concepts - 6th Edition
Distributed Lock Manager
In this approach, functionality of locking is implemented by lock managers
at each site
Lock managers control access to local data items
But special protocols may be used for replicas
Advantage: work is distributed and can be made robust to failures
Disadvantage: deadlock detection is more complicated
Lock managers cooperate for deadlock detection
Several variants of this approach
Primary copy
Majority protocol
Biased protocol
Quorum consensus
113
Database System Concepts - 6th Edition
Primary Copy
Choose one replica of data item to be the primary copy.
Site containing the replica is called the primary site for that data item
Different data items can have different primary sites
When a transaction needs to lock a data item Q, it requests a lock at the
primary site of Q.
Implicitly gets lock on all replicas of the data item
Benefit
Concurrency control for replicated data handled similarly to
unreplicated data - simple implementation.
Drawback
If the primary site of Q fails, Q is inaccessible even though other sites
containing a replica may be accessible.
114
Database System Concepts - 6th Edition
Majority Protocol
When a transaction wishes to lock an unreplicated data item Q residing at
site Si, a message is sent to Si ‘s lock manager.
If Q is locked in an incompatible mode, then the request is delayed until
it can be granted.
When the lock request can be granted, the lock manager sends a
message back to the initiator indicating that the lock request has been
granted.
115
Database System Concepts - 6th Edition
Majority Protocol (Cont.)
In case of replicated data
If Q is replicated at n sites, then a lock request message must be sent to
more than half of the n sites in which Q is stored.
The transaction does not operate on Q until it has obtained a lock on a
majority of the replicas of Q.
When writing the data item, transaction performs writes on all replicas.
Benefit
Can be used even when some sites are unavailable
Drawback
Requires 2(n/2 + 1) messages for handling lock requests, and (n/2 + 1)
messages for handling unlock requests.
Potential for deadlock even with single item - e.g., each of 3 transactions
may have locks on 1/3rd of the replicas of a data.
116
Database System Concepts - 6th Edition
Biased Protocol
Local lock manager at each site as in majority protocol, however, requests
for shared locks are handled differently than requests for exclusive locks.
Shared locks. When a transaction needs to lock data item Q, it simply
requests a lock on Q from the lock manager at one site containing a replica
of Q.
Exclusive locks. When transaction needs to lock data item Q, it requests a
lock on Q from the lock manager at all sites containing a replica of Q.
Advantage - imposes less overhead on read operations.
Disadvantage - additional overhead on writes
117
Database System Concepts - 6th Edition
Quorum Consensus Protocol
A generalization of both majority and biased protocols
Each site is assigned a weight.
Let S be the total of all site weights
Choose two values read quorum Qr and write quorum Qw
Such that Qr + Qw > S and 2 * Qw > S
Quorums can be chosen (and S computed) separately for each item
Each read must lock enough replicas that the sum of the site weights is
>= Qr
Each write must lock enough replicas that the sum of the site weights is
>= Qw
For now we assume all replicas are written
118
Database System Concepts - 6th Edition
Timestamping
Timestamp based concurrency-control protocols can be used in
distributed systems
Each transaction must be given a unique timestamp
Main problem: how to generate a timestamp in a distributed fashion
Each site generates a unique local timestamp using either a logical
counter or the local clock.
Global unique timestamp is obtained by concatenating the unique local
timestamp with the unique identifier.
119
Database System Concepts - 6th Edition
Replication with Weak Consistency
Many commercial databases support replication of data with weak degrees
of consistency (i.e., without a guarantee of serializabiliy)
E.g., master-slave replication: updates are performed at a single “master”
site, and propagated to “slave” sites.
Propagation is not part of the update transaction: its is decoupled
May be immediately after transaction commits
May be periodic
Data may only be read at slave sites, not updated
No need to obtain locks at any remote site
Particularly useful for distributing information
E.g., from central office to branch-office
Also useful for running read-only queries offline from the main database
121
Database System Concepts - 6th Edition
Replication with Weak Consistency (Cont.)
Replicas should see a transaction-consistent snapshot of the database
That is, a state of the database reflecting all effects of all transactions up
to some point in the serialization order, and no effects of any later
transactions.
E.g., Oracle provides a create snapshot statement to create a snapshot of
a relation or a set of relations at a remote site
snapshot refresh either by recomputation or by incremental update
Automatic refresh (continuous or periodic) or manual refresh
122
Database System Concepts - 6th Edition
Multimaster and Lazy Replication
With multimaster replication (also called update-anywhere replication)
updates are permitted at any replica, and are automatically propagated to all
replicas
Basic model in distributed databases, where transactions are unaware of
the details of replication, and database system propagates updates as
part of the same transaction
Coupled with 2 phase commit
Many systems support lazy propagation where updates are transmitted
after transaction commits
Allows updates to occur even if some sites are disconnected from the
network, but at the cost of consistency
123
Database System Concepts - 6th Edition
Deadlock Handling
Consider the following two transactions and history, with item X and
transaction T1 at site 1, and item Y and transaction T2 at site 2:
T1: write (X) T2: write (Y)
write (Y) write (X)
X-lock on X
write (X) X-lock on Y
write (Y)
wait for X-lock on X
124
Database System Concepts - 6th Edition
Centralized Approach
A global wait-for graph is constructed and maintained in a single site; the
deadlock-detection coordinator
Real graph: Real, but unknown, state of the system.
Constructed graph: Approximation generated by the controller during the
execution of its algorithm .
the global wait-for graph can be constructed when:
a new edge is inserted in or removed from one of the local wait-for
graphs.
a number of changes have occurred in a local wait-for graph.
the coordinator needs to invoke cycle-detection.
If the coordinator finds a cycle, it selects a victim and notifies all sites. The
sites roll back the victim transaction.
125
Database System Concepts - 6th Edition
Local and Global Wait-For Graphs
Local
Global
126
Database System Concepts - 6th Edition
Unnecessary Rollbacks
Unnecessary rollbacks may result when deadlock has indeed occurred and
a victim has been picked, and meanwhile one of the transactions was
aborted for reasons unrelated to the deadlock.
Unnecessary rollbacks can result from false cycles in the global wait-for
graph; however, likelihood of false cycles is low.
129
Database System Concepts - 6th Edition
End of Chapter
Three Phase Commit (3PC)
Assumptions:
No network partitioning
At any point, at least one site must be up.
At most K sites (participants as well as coordinator) can fail
Phase 1: Obtaining Preliminary Decision: Identical to 2PC Phase 1.
Every site is ready to commit if instructed to do so
Phase 2 of 2PC is split into 2 phases, Phase 2 and Phase 3 of 3PC
In phase 2 coordinator makes a decision as in 2PC (called the pre-commit
decision) and records it in multiple (at least K) sites
In phase 3, coordinator sends commit/abort message to all participating sites,
Under 3PC, knowledge of pre-commit decision can be used to commit despite
coordinator failure
Avoids blocking problem as long as < K sites fail
Drawbacks:
higher overheads
assumptions may not be satisfied in practice
131
Database System Concepts - 6th Edition