0% found this document useful (0 votes)
8 views

adsu2

Uploaded by

Anisha Patil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

adsu2

Uploaded by

Anisha Patil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 120

Unit 2: Parallel and

Distributed Databases
Unit 2: Parallel and Distributed Database

 Database System Architectures:


➢ Centralized and Client Server Architectures
➢ Server System Architectures
➢ Parallel Systems
➢ Distributed Systems
 Parallel Databases:
➢ I/O parallelism
➢ Inter and Intra Query Parallelism
➢ Inter and Intra operation Parallelism
➢ Design of Parallel System
➢ Distributed Database Concepts
➢ Distributed Data Storage
➢ Distributed Transactions
➢ Commit Protocols
➢ Concurrency Control

2
Database System Concepts - 6th Edition
Parallel and Distributed Database
 Networking of computers allows some tasks to be executed on a server
system and some tasks to be executed on client systems. This division of
work has led to client–server database systems.
 Parallel processing within a computer system allows database-system
activities to be speeded up, allowing faster response to transactions, as well
as more transactions per second. The need for parallel query processing
has led to parallel database systems.
 Keeping multiple copies of the database across different sites also allows
large organizations to continue their database operations even when one
site is affected by a natural disaster, such as flood, fire, or earthquake.
Distributed database systems handle geographically, or administratively
distributed data spread across multiple database systems.

3
Database System Concepts - 6th Edition
Database System Architectures
Database System Architectures
 Centralized and Client-Server Systems
 Server System Architectures
 Parallel Systems
 Distributed Systems

5
Database System Concepts - 6th Edition
Centralized and Client–Server Architectures
 Centralized database systems are those that run on a
single computer system and do not interact with other
computer systems.
 Client–server systems, on the other hand, have
functionality split between a server system and multiple
client systems.

6
Database System Concepts - 6th Edition
Centralized Systems
 Run on a single computer system and do not interact with other
computer systems.
 General-purpose computer system: one to a few CPUs and a number of
device controllers that are connected through a common bus that
provides access to shared memory.
 Single-user system (e.g., personal computer or workstation): desk-top
unit, single user, usually has only one CPU and one or two hard disks;
the OS may support only one user.
 Multi-user system: more disks, more memory, multiple CPUs, and a
multi-user OS. Serve a large number of users who are connected to the
system vie terminals. Often called server systems.

7
Database System Concepts - 6th Edition
A Centralized Computer System

8
Client-Server Systems
 Client–server systems have functionality split between a server system
and multiple client systems.
 Server systems satisfy requests generated at m client systems, whose
general structure is shown below:

9
Database System Concepts - 6th Edition
Client-Server Systems (Cont.)
 Database functionality can be divided into:
 Back-end: manages access structures, query evaluation and optimization, concurrency
control and recovery.
 Front-end: consists of tools such as forms, report-writers, and graphical user interface
facilities.
 The interface between the front-end and the back-end is through SQL or
through an application program interface.
 Standards such as ODBC and JDBC were developed to interface clients with
servers. Any client that uses the ODBC or JDBC interface can connect to any
server that provides the interface

10
Database System Concepts - 6th Edition
Client-Server Systems (Cont.)
 Advantages of replacing mainframes with networks of workstations
or personal computers connected to back-end server machines:
 better functionality for the cost
 flexibility in locating resources and expanding facilities
 better user interfaces
 easier maintenance

11
Database System Concepts - 6th Edition
Server System Architecture
 Server systems can be broadly categorized into two kinds:
 Transaction servers which are widely used in relational database systems, and
 Data servers, used in object-oriented database systems

12
Database System Concepts - 6th Edition
Transaction Servers
 Also called query server systems or SQL server systems
 Clients send requests to the server
 Transactions are executed at the server
 Results are shipped back to the client.
 Requests are specified in SQL, and communicated to the server
through a remote procedure call (RPC) mechanism.
 Transactional RPC allows many RPC calls to form a transaction.
 Open Database Connectivity (ODBC) is a C language application
program interface standard from Microsoft for connecting to a server,
sending SQL requests, and receiving results.
 JDBC standard is similar to ODBC, for Java.

13
Database System Concepts - 6th Edition
Transaction Server Process Structure
 A typical transaction server consists of multiple processes accessing data in
shared memory.
 Server processes
 These receive user queries (transactions), execute them and send results back
 Processes may be multithreaded, allowing a single process to execute several
user queries concurrently
 Typically multiple multithreaded server processes (hybrid architecture).
 Lock manager process
 This process implements lock manager functionality, which includes lock grant,
lock release, and deadlock detection.
 Database writer process
 There are one or more processes that output modified buffer blocks back to
disk on a continuous basis.

14
Database System Concepts - 6th Edition
Transaction Server Processes (Cont.)
 Log writer process
 Server processes simply add log records to log record buffer
 Log writer process outputs log records to stable storage.
 Checkpoint process
 Performs periodic checkpoints
 Process monitor process
 Monitors other processes, and takes recovery actions if any of the
other processes fail
 E.g., aborting any transactions being executed by a server process
and restarting it

15
Database System Concepts - 6th Edition
Transaction Server Processes (Cont.)

16
Database System Concepts - 6th Edition
Transaction Server Processes (Cont.)
 Shared memory contains shared data
 Buffer pool
 Lock table
 Log buffer
 Cached query plans (reused if same query submitted again)
 All database processes can access shared memory
 To ensure that no two processes are accessing the same data structure
at the same time, databases systems implement mutual exclusion
using either
 Operating system semaphores
 Atomic instructions such as test-and-set
 To avoid overhead of inter process communication for lock
request/grant, each database process operates directly on the lock table
 instead of sending requests to lock manager process
 Lock manager process still used for deadlock detection

17
Database System Concepts - 6th Edition
Data Servers
 Used in high-speed LANs, in cases where
 there is a high-speed connection between the clients and the server, the
clients are comparable in processing power to the server
 The tasks to be executed are compute intensive.
 Data are shipped to clients where processing is performed, and then
shipped results back to the server.
 This architecture requires full back-end functionality at the clients.
 Used in many object-oriented database systems
 Issues:
 Page-Shipping versus Item-Shipping
 Locking
 Data Caching
 Lock Caching

18
Database System Concepts - 6th Edition
Data Servers (Cont.)
 Page-shipping versus item-shipping
 Smaller unit of shipping  more messages
 Worth prefetching related items along with requested item
 Page shipping can be thought of as a form of prefetching
 Locking
 Overhead of requesting and getting locks from server is high due to
message delays
 Can grant locks on requested and prefetched items; with page
shipping, transaction is granted lock on whole page.
 Locks on a prefetched item can be P{called back} by the server, and
returned by client transaction if the prefetched item has not been
used.
 Locks on the page can be deescalated to locks on items in the page
when there are lock conflicts. Locks on unused items can then be
returned to server.

19
Database System Concepts - 6th Edition
Data Servers (Cont.)
 Data Caching
 Data can be cached at client even in between transactions
 But check that data is up-to-date before it is used (cache
coherency)
 Check can be done when requesting lock on data item
 Lock Caching
 Locks can be retained by client system even in between transactions
 Transactions can acquire cached locks locally, without contacting
server
 Server calls back locks from clients when it receives conflicting lock
request. Client returns lock once no local transaction is using it.
 Similar to deescalation, but across transactions.

20
Database System Concepts - 6th Edition
Parallel Systems
 Parallel database systems consist of multiple processors and multiple
disks connected by a fast interconnection network.
 A coarse-grain parallel machine consists of a small number of
powerful processors
 A massively parallel or fine grain parallel machine utilizes
thousands of smaller processors.
 Two main performance measures:
 throughput --- the number of tasks that can be completed in a
given time interval
 response time --- the amount of time it takes to complete a
single task from the time it is submitted

21
Database System Concepts - 6th Edition
Speed-Up and Scale-Up
 Speedup: a fixed-sized problem executing on a small system is given
to a system which is N-times larger.
 Measured by:
speedup = small system elapsed time
large system elapsed time
 Speedup is linear if equation equals N.
 Scaleup: increase the size of both the problem and the system
 N-times larger system used to perform N-times larger job
 Measured by:
scaleup = small system small problem elapsed time
big system big problem elapsed time
 Scale up is linear if equation equals 1.

22
Database System Concepts - 6th Edition
Speedup

23
Database System Concepts - 6th Edition
Scaleup

24
Database System Concepts - 6th Edition
Batch and Transaction Scaleup
 Batch scaleup:
 A single large job; typical of most decision support queries and
scientific simulation.
 Use an N-times larger computer on N-times larger problem.
 Transaction scaleup:
 Numerous small queries submitted by independent users to a
shared database; typical transaction processing and timesharing
systems.
 N-times as many users submitting requests (hence, N-times as many
requests) to an N-times larger database, on an N-times larger
computer.
 Well-suited to parallel execution.

25
Database System Concepts - 6th Edition
Factors Limiting Speedup and Scaleup
Speedup and scaleup are often sublinear due to:
 Startup costs: Cost of starting up multiple processes may dominate
computation time, if the degree of parallelism is high.
 Interference: Processes accessing shared resources (e.g., system
bus, disks, or locks) compete with each other, thus spending time
waiting on other processes, rather than performing useful work.
 Skew: Increasing the degree of parallelism increases the variance in
service times of parallely executing tasks. Overall execution time
determined by slowest of parallely executing tasks.

26
Database System Concepts - 6th Edition
Interconnection Network Architectures
 Bus. System components send data on and receive data
from a single communication bus;
 Does not scale well with increasing parallelism.
 Mesh. Components are arranged as nodes in a grid, and
each component is connected to all adjacent
components
 Communication links grow with growing number of
components, and so scales better.
 But may require 2n hops to send message to a node
(or n with wraparound connections at edge of grid).
 Hypercube. Components are numbered in binary;
components are connected to one another if their
binary representations differ in exactly one bit.
 n components are connected to log(n) other
components and can reach each other via at most
log(n) links; reduces communication delays.

27
Database System Concepts - 6th Edition
Parallel Database Architectures
 Shared memory – All processors
share a common memory
 Shared disk – All processors share
a common disk (cluster)
 Shared nothing -- processors
share neither a common memory
nor common disk
 Hierarchical -- hybrid of the above
3 architectures

28
Database System Concepts - 6th Edition
Shared Memory
 Processors and disks have access to a common memory, typically via a
bus or through an interconnection network.
 Extremely efficient communication between processors — data in
shared memory can be accessed by any processor without having to
move it using software.
 Downside – architecture is not scalable beyond 32 or 64 processors
since the bus or the interconnection network becomes a bottleneck
 Widely used for lower degrees of parallelism (4 to 8).

29
Database System Concepts - 6th Edition
Shared Disk
 All processors can directly access all disks via an interconnection
network, but the processors have private memories.
 The memory bus is not a bottleneck
 Architecture provides a degree of fault-tolerance — if a
processor fails, the other processors can take over its tasks since
the database is resident on disks that are accessible from all
processors.
 Examples: IBM Sysplex and DEC clusters (now part of Compaq)
running Rdb (now Oracle Rdb) were early commercial users
 Downside: bottleneck now occurs at interconnection
to the disk subsystem.
 Shared-disk systems can scale to a somewhat larger
number of processors, but communication between
processors is slower.
30
Database System Concepts - 6th Edition
Shared Nothing
 Node consists of a processor, memory and one or more disks.
 Processors at one node communicate with another processor at another
node using an interconnection network.
 A node functions as the server for the data on the disk or disks the node
owns.
 Examples: Teradata, Tandem, Oracle-n CUBE
 Data accessed from local disks (and local memory accesses) do not pass
through interconnection network, thereby minimizing the interference of
resource sharing.
 Shared-nothing multiprocessors can be scaled up
to thousands of processors without interference.
 Main drawback: cost of communication and non-local
disk access; sending data involves software interaction
at both ends.
31
Database System Concepts - 6th Edition
Hierarchical
 Combines characteristics of shared-memory, shared-disk, and shared-
nothing architectures.
 Top level is a shared-nothing architecture – nodes connected by an
interconnection network, and do not share disks or memory with
each other.
 Each node of the system could be a shared-memory system with a
few processors.
 Alternatively, each node could be a shared-disk system, and each of
the systems sharing a set of disks could be a shared-memory system.
 Reduce the complexity of programming such systems by distributed
virtual-memory architectures
 Also called non-uniform memory
architecture (NUMA)

32
Database System Concepts - 6th Edition
Distributed Systems
 Data spread over multiple machines (also referred to as sites or nodes).
 Network interconnects the machines
 Data shared by users on multiple machines

33
Database System Concepts - 6th Edition
Distributed Databases
 Homogeneous distributed databases
 Same software/schema on all sites, data may be partitioned among sites
 Goal: provide a view of a single database, hiding details of distribution
 Heterogeneous distributed databases
 Different software/schema on different sites
 Goal: integrate (combine) existing databases to provide useful
functionality
 Differentiate between local and global transactions
 A local transaction accesses data in the single site at which the
transaction was initiated.
 A global transaction either accesses data in a site different from the
one at which the transaction was initiated or accesses data in several
different sites.

34
Database System Concepts - 6th Edition
Trade-offs in Distributed Systems
 Sharing data – users at one site able to access the data residing at
some other sites.
 Autonomy – each site is able to retain a degree of control over data
stored locally.
 Higher system availability through redundancy — data can be
replicated at remote sites, and system can function even if a site fails.
 Disadvantage: added complexity required to ensure proper
coordination among sites.
 Software development cost.
 Greater potential for bugs.
 Increased processing overhead.

35
Database System Concepts - 6th Edition
Parallel Databases
Parallel Databases
 Introduction
 I/O Parallelism
 Interquery Parallelism
 Intraquery Parallelism
 Intraoperation Parallelism
 Interoperation Parallelism
 Design of Parallel Systems

38
Database System Concepts - 6th Edition
Introduction
 Parallel machines are becoming quite common and affordable
 Prices of microprocessors, memory and disks have dropped sharply
 Recent desktop computers feature multiple processors and this
trend is projected to accelerate
 Databases are growing increasingly large
 large volumes of transaction data are collected and stored for later
analysis.
 multimedia objects like images are increasingly stored in databases
 Large-scale parallel database systems increasingly used for:
 storing large volumes of data
 processing time-consuming decision-support queries
 providing high throughput for transaction processing

39
Database System Concepts - 6th Edition
Parallelism in Databases
1. Data can be partitioned across multiple disks for parallel I/O.
2. Individual relational operations (e.g., sort, join, aggregation) can be
executed in parallel
 data can be partitioned and each processor can work independently
on its own partition.
3. Queries are expressed in high level language (SQL, translated to relational
algebra)
 makes parallelization easier.
4. Different queries can be run in parallel with each other. Concurrency
control takes care of conflicts.
 Thus, databases naturally lend themselves to parallelism.

40
Database System Concepts - 6th Edition
I/O Parallelism
 Reduce the time required to retrieve relations from disk by partitioning.
 The relations on multiple disks.
 Horizontal partitioning – tuples of a relation are divided among many
disks such that each tuple resides on one disk.
 Partitioning techniques (number of disks = n):
Round-robin:
Send the I th tuple inserted in the relation to disk i mod n.
Hash partitioning:
 Choose one or more attributes as the partitioning attributes.
 Choose hash function h with range 0…n - 1
 Let i denote result of hash function h applied to the partitioning
attribute value of a tuple. Send tuple to disk i.

41
Database System Concepts - 6th Edition
Example of Hash Partitioning
 The binary representation of the ith character is assumed to be the integer i.
 There are 8 buckets(0 to 7),
 The hash function returns the sum of the binary representations of the characters
modulo 8
 E.g. h(Music) = 1 h(History) = 2
h(Physics) = 3 h(Elec. Eng.) = 3
 M = 01001101
 u = 01110101
 s = 01110011
 i = 01101001
 c = 01100011
 Binary Addition = 01000000001
 Decimal Value = 513
 513 MOD 8 = 1
 Bucket 1 assign Music.

Database System Concepts - 6th Edition


I/O Parallelism (Cont.)
 Partitioning techniques (cont.):
 Range partitioning:
 Choose an attribute as the partitioning attribute.
 A partitioning vector [vo, v1, ..., vn-2] is chosen.
 Let v be the partitioning attribute value of a tuple.
 Tuples with v < v0 go to disk 0 and tuples with v  vn-2 go to disk n-1.
 Tuples such that vi  vi+1 go to disk I + 1.
E.g., partitioning vector [5,11] with 3 disk, a tuple with partitioning
attribute value of 2 will go to disk 0, a tuple with value 8 will go to
disk 1, while a tuple with value 20 will go to disk 2.
E.g. partitioning vector [f, l, r] with 4 disk, a tuple with partitioning
attribute value name start with ‘d’ will go to disk 0, a tuple with
partitioning attribute value name start with ‘m’ will go to disk 2.

43
Database System Concepts - 6th Edition
Comparison of Partitioning Techniques
 Evaluate how well partitioning techniques support the following types of
data access:
1. Scanning the entire relation.
2. Locating a tuple associatively – point queries.
 E.g., r.A = 25.
3. Locating all tuples such that the value of a given attribute lies within a
specified range – range queries.
 E.g., 10  r.A < 25.

44
Database System Concepts - 6th Edition
Comparison of Partitioning Techniques (Cont.)

Round robin:
 Advantages
 Best suited for sequential scan of entire relation on each query.
 All disks have almost an equal number of tuples; retrieval work is
thus well balanced between disks.
 Range queries are difficult to process
 No clustering -- tuples are scattered across all disks

45
Database System Concepts - 6th Edition
Comparison of Partitioning Techniques (Cont.)
Hash partitioning:
 Good for sequential access
 Assuming hash function is good, and partitioning attributes form a
key, tuples will be equally distributed between disks
 Retrieval work is then well balanced between disks.
 Good for point queries on partitioning attribute
 Can lookup single disk, leaving others available for answering other
queries.
 Index on partitioning attribute can be local to disk, making lookup
and update more efficient
 No clustering, so difficult to answer range queries

46
Database System Concepts - 6th Edition
Comparison of Partitioning Techniques (Cont.)
Range partitioning:
 Provides data clustering by partitioning attribute value.
 Good for sequential access
 Good for point queries on partitioning attribute: only one disk needs to
be accessed.
 For range queries on partitioning attribute, one to a few disks may need
to be accessed
 Remaining disks are available for other queries.
 Good if result tuples are from one to a few blocks.
 If many blocks are to be fetched, they are still fetched from one to a
few disks, and potential parallelism in disk access is wasted
 Example of execution skew.

47
Database System Concepts - 6th Edition
Partitioning a Relation across Disks
 If a relation contains only a few tuples which will fit into a single disk
block, then assign the relation to a single disk.
 Large relations are preferably partitioned across all the available disks.
 If a relation consists of m disk blocks and there are n disks available in
the system, then the relation should be allocated min(m,n) disks.

48
Database System Concepts - 6th Edition
Handling of Skew
 The distribution of tuples to disks may be skewed — that is, some disks
have many tuples, while others may have fewer tuples.
 Types of skew:
 Attribute-value skew.
 Some values appear in the partitioning attributes of many tuples; all
the tuples with the same value for the partitioning attribute end up
in the same partition.
 Can occur with range-partitioning and hash-partitioning.
 Partition skew.
 With range-partitioning, badly chosen partition vector may assign
too many tuples to some partitions and too few to others. Ex:
Partition on Income level
 Less likely with hash-partitioning if a good hash-function is chosen.
Ex: Mod function on name attribute (ASCII Addition of characters)

49
Database System Concepts - 6th Edition
Handling Skew in Range-Partitioning
 To create a balanced partitioning vector (assuming partitioning
attribute forms a key of the relation):
 Sort the relation on the partitioning attribute.
 Construct the partition vector by scanning the relation in sorted
order as follows.
 After every 1/nth of the relation has been read, the value of the
partitioning attribute of the next tuple is added to the partition
vector.
 n denotes the number of partitions to be constructed.
 Duplicate entries or imbalances can result if duplicates are present in
partitioning attributes.
 Alternative technique based on histograms used in practice

50
Database System Concepts - 6th Edition
Interquery Parallelism
 Queries/transactions execute in parallel with one another.
 Increases transaction throughput; used primarily to scale up a
transaction processing system to support a larger number of
transactions per second.
 Easiest form of parallelism to support, particularly in a shared-memory
parallel database.
 More complicated to implement on shared-disk or shared-nothing
architectures
 Locking and logging must be coordinated by passing messages
between processors.
 Data in a local buffer may have been updated at another processor.
 Cache-coherency has to be maintained — reads and writes of data
in buffer must find latest version of data.

53
Database System Concepts - 6th Edition
Cache Coherency Protocol
 Example of a cache coherency protocol for shared disk systems:
 Before reading/writing to a page, the page must be locked in
shared/exclusive mode.
 On locking a page, the page must be read from disk
 Before unlocking a page, the page must be written to disk if it was
modified.
 More complex protocols with fewer disk reads/writes exist.
 Cache coherency protocols for shared-nothing systems are similar. Each
database page is assigned a home processor. Requests to fetch the page
or write it to disk are sent to the home processor.

54
Database System Concepts - 6th Edition
Intraquery Parallelism
 Execution of a single query in parallel on multiple processors/disks;
important for speeding up long-running queries.
 Two complementary forms of intraquery parallelism:
 Intraoperation Parallelism – parallelize the execution of each
individual operation in the query.
 Interoperation Parallelism – execute the different operations in a
query expression in parallel.
The first form scales better with increasing parallelism because
the number of tuples processed by each operation is typically more than
the number of operations in a query.

55
Database System Concepts - 6th Edition
Parallel Sort
Range-Partitioning Sort
 Choose processors P0, ..., Pm, where m  n -1 to do sorting.
 Create range-partition vector with m entries, on the sorting attributes
 Redistribute the relation using range partitioning
 all tuples that lie in the ith range are sent to processor Pi
 Pi stores the tuples it received temporarily on disk Di.
 This step requires I/O and communication overhead.
 Each processor Pi sorts its partition of the relation locally.
 Each processors executes same operation (sort) in parallel with other
processors, without any interaction with the others (data parallelism).
 Final merge operation is trivial: range-partitioning ensures that, for 1 < i <
j < m, the key values in processor Pi are all less than the key values in Pj.

57
Database System Concepts - 6th Edition
Parallel Sort (Cont.)
Parallel External Sort-Merge
 Assume the relation has already been partitioned among disks D0, ..., Dn-1
(in whatever manner).
 Each processor Pi locally sorts the data on disk Di.
 The sorted runs on each processor are then merged to get the final sorted
output.
 Parallelize the merging of sorted runs as follows:
 The sorted partitions at each processor Pi are range-partitioned across
the processors P0, ..., Pm-1.
 Each processor Pi performs a merge on the streams as they are received,
to get a single sorted run.
 The sorted runs on processors P0,..., Pm-1 are concatenated to get the
final result.

58
Database System Concepts - 6th Edition
Parallel Join
 The join operation requires pairs of tuples to be tested to see if they satisfy
the join condition, and if they do, the pair is added to the join output.
 Parallel join algorithms attempt to split the pairs to be tested over several
processors. Each processor then computes part of the join locally.
 In a final step, the results from each processor can be collected together to
produce the final result.

59
Database System Concepts - 6th Edition
Partitioned Join
 For equi-joins and natural joins, it is possible to partition the two input
relations across the processors and compute the join locally at each
processor.
 Let r and s be the input relations, and we want to compute r r.A=s.B s.
 Suppose that we are using n processors and that the relations to be joined
are r and s.
 r and s each are partitioned into n partitions, denoted r0, r1, ..., rn-1 and s0,
s1, ..., sn-1.
 The system sends partitions ri and si to processor Pi, where their join is
computed locally.

60
Database System Concepts - 6th Edition
Partitioned Join (Cont …)
 Can use either range partitioning or hash partitioning.
 In either case, the same partitioning function must be used for both
relations.
 For range partitioning, the same partition vector must be used for both
relations.
 For hash partitioning, the same hash function must be used on both
relations.
 Once the relations are partitioned, we can use any join technique locally at
each processor Pi to compute the join of ri and si .
 For example, hash join, merge join, or nested-loop join could be used.

61
Database System Concepts - 6th Edition
Partitioned Join (Cont.)

62
Database System Concepts - 6th Edition
Fragment-and-Replicate Join
 Partitioning not possible for some join conditions eg: if the join condition is
an inequality,
 E.g., non-equijoin conditions, such as r.A > s.B.
 For joins where partitioning is not applicable, parallelization can be
accomplished by fragment and replicate technique
 Special case – asymmetric fragment-and-replicate:
 One of the relations, say r, is partitioned; any
partitioning technique can be used.
 The other relation, s, is replicated across all the
processors.
 Processor Pi then locally computes the join of ri
with all of s using any join technique.

63
Database System Concepts - 6th Edition
Fragment-and-Replicate Join (Cont.)
 General case: reduces the sizes of the relations at each processor.
 r is partitioned into n partitions, r0, r1, ..., r n-1;
 s is partitioned into m partitions, s0, s1, ..., sm-1.
 Any partitioning technique may be used.
 There must be at least m * n processors.
 Label the processors as: P0,0, P0,1, ..., P0,m-1, P1,0, ..., Pn-1m-1.
 Pi,j computes the join of ri with sj. In order to do so, ri is replicated to Pi,0,
Pi,1, ..., Pi,m-1, while si is replicated to P0,i, P1,i, ..., Pn-1,i
 Any join technique can be used at each processor Pi,j.

64
Database System Concepts - 6th Edition
Fragment-and-Replicate Join (Cont.)

65
Database System Concepts - 6th Edition
Fragment-and-Replicate Join (Cont.)
 Both versions of fragment-and-replicate work with any join condition,
since every tuple in r can be tested with every tuple in s.
 Usually has a higher cost than partitioning, since one of the relations (for
asymmetric fragment-and-replicate) or both relations (for general
fragment-and-replicate) have to be replicated.
 Sometimes asymmetric fragment-and-replicate is preferable even though
partitioning could be used.
 E.g., say s is small and r is large, and already partitioned. It may be
cheaper to replicate s across all processors, rather than repartition r
and s on the join attributes.
 General Fragment and replicate reduces the sizes of the relations at each
processor, compared to asymmetric fragment and replicate.

66
Database System Concepts - 6th Edition
Partitioned Parallel Hash-Join
 Suppose that we have n processors, P0, P1, . . . , Pn−1, and two relations r and s,
such that the relations r and s are partitioned across multiple disks.
 Assume s is smaller than r and therefore s is chosen as the build relation.
 The parallel hash-join algorithm proceeds this way:
1. Choose a hash function—say, h1—that takes the join attribute value of
each tuple in r and s and maps the tuple to one of the n processors. Let ri
denote the tuples of relation r that are mapped to processor Pi ; similarly, let si
denote the tuples of relation s that are mapped to processor Pi. Each processor
Pi reads the tuples of s that are on its disk Di and sends each tuple to the
appropriate processor on the basis of hash function h1.

67
Database System Concepts - 6th Edition
Partitioned Parallel Hash-Join
2. As the destination processor Pi receives the tuples of si , it further
partitions them by another hash function, h2, which the processor uses to
compute the hash join locally. The partitioning at this stage is exactly the
same as in the partitioning phase of the sequential hash-join algorithm. Each
processor Pi executes this step independently from the other processors.
3. Once the tuples of s have been distributed, the system redistributes
the larger relation r across the n processors by the hash function h1, in the
sameway as before. As it receives each tuple, the destination processor
repartitions it by the function h2, just as the probe relation is partitioned in the
sequential hash-join algorithm.
4. Each processor Pi executes the build and probe phases of the hash-
join algorithm on the local partitions ri and si of r and s to produce a partition
of the final result of the hash join.

68
Database System Concepts - 6th Edition
Partitioned Parallel Hash-Join (Cont.)
 Once the tuples of s have been distributed, the larger relation r is
redistributed across the m processors using the hash function h1
 Let ri denote the tuples of relation r that are sent to processor Pi.
 As the r tuples are received at the destination processors, they are
repartitioned using the function h2
 (just as the probe relation is partitioned in the sequential hash-join
algorithm).
 Each processor Pi executes the build and probe phases of the hash-join
algorithm on the local partitions ri and s of r and s to produce a partition
of the final result of the hash-join.
 Note: Hash-join optimizations can be applied to the parallel case
 e.g., the hybrid hash-join algorithm can be used to cache some of the
incoming tuples in memory and avoid the cost of writing them and
reading them back in.

69
Database System Concepts - 6th Edition
Parallel Nested-Loop Join
 Assume that
 relation s is much smaller than relation r and that r is stored by
partitioning.
 there is an index on a join attribute of relation r at each of the partitions
of relation r.
 Use asymmetric fragment-and-replicate, with relation s being replicated, and
using the existing partitioning of relation r.
 Each processor Pj where a partition of relation s is stored reads the tuples
of relation s stored in Dj, and replicates the tuples to every other processor
P i.
 At the end of this phase, relation s is replicated at all sites that store
tuples of relation r.
 Each processor Pi performs an indexed nested-loop join of relation s with
the ith partition of relation r.

70
Database System Concepts - 6th Edition
Other Relational Operations
Selection (r)
 If  is of the form ai = v, where ai is an attribute and v a value.
 If r is partitioned on ai the selection is performed at a single processor.
 If  is of the form l <= ai <= u (i.e.,  is a range selection) and the relation
has been range-partitioned on ai
 Selection is performed at each processor whose partition overlaps with
the specified range of values.
 In all other cases: the selection is performed in parallel at all the
processors.

71
Database System Concepts - 6th Edition
Other Relational Operations (Cont.)
 Duplicate elimination
 Perform by using either of the parallel sort techniques
 eliminate duplicates as soon as they are found during sorting.
 Can also partition the tuples (using either range- or hash- partitioning)
and perform duplicate elimination locally at each processor.

 Projection
 Projection without duplicate elimination can be performed as tuples are
read in from disk in parallel.
 If duplicate elimination is required, any of the above duplicate elimination
techniques can be used.

72
Database System Concepts - 6th Edition
Grouping/Aggregation
 Partition the relation on the grouping attributes and then compute the
aggregate values locally at each processor.
 Can reduce cost of transferring tuples during partitioning by partly
computing aggregate values before partitioning.
 Consider the sum aggregation operation:
 Perform aggregation operation at each processor Pi on those tuples
stored on disk Di
 results in tuples with partial sums at each processor.
 Result of the local aggregation is partitioned on the grouping attributes,
and the aggregation performed again at each processor Pi to get the final
result.
 Fewer tuples need to be sent to other processors during partitioning.

73
Database System Concepts - 6th Edition
Cost of Parallel Evaluation of Operations
 If there is no skew in the partitioning, and there is no overhead due to the
parallel evaluation, expected speed-up will be 1/n
 If skew and overheads are also to be taken into account, the time taken by
a parallel operation can be estimated as
Tpart + Tasm + max (T0, T1, …, Tn-1)
 Tpart is the time for partitioning the relations
 Tasm is the time for assembling the results
 Ti is the time taken for the operation at processor Pi
 this needs to be estimated taking into account the skew, and the time
wasted in contentions.

74
Database System Concepts - 6th Edition
Interoperator Parallelism
 Pipelined parallelism
 Consider a join of four relations
 r1 r2 r3 r4
 Set up a pipeline that computes the three joins in parallel
 Let P1 be assigned the computation of
temp1 = r1 r2
 And P2 be assigned the computation of temp2 = temp1 r3
 And P3 be assigned the computation of temp2 r4
 Each of these operations can execute in parallel, sending result tuples it
computes to the next operation even as it is computing further results
 Provided a pipelineable join evaluation algorithm (e.g., indexed nested
loops join) is used

75
Database System Concepts - 6th Edition
Factors Limiting Utility of Pipeline Parallelism
 Pipeline parallelism is useful since it avoids writing intermediate results to
disk
 Useful with small number of processors but does not scale up well with
more processors. One reason is that pipeline chains do not attain
sufficient length.
 Little speedup is obtained for the frequent cases of skew in which
one operator's execution cost is much higher than the others.

76
Database System Concepts - 6th Edition
Independent Parallelism
 Independent parallelism
 Consider a join of four relations
r1 r2 r3 r4
 Let P1 be assigned the computation of
temp1 = r1 r2
 And P2 be assigned the computation of temp2 = r3 r4
 And P3 be assigned the computation of temp1 temp2
 P1 and P2 can work independently in parallel
 P3 has to wait for input from P1 and P2
 Can pipeline output of P1 and P2 to P3, combining independent
parallelism and pipelined parallelism
 Does not provide a high degree of parallelism
 useful with a lower degree of parallelism.
 less useful in a highly parallel system.

77
Database System Concepts - 6th Edition
Distributed Databases
Distributed Databases
 Heterogeneous and Homogeneous Databases
 Distributed Data Storage
 Distributed Transactions
 Commit Protocols
 Concurrency Control in Distributed Databases
 Availability

81
Database System Concepts - 6th Edition
Distributed Database System
 A distributed database system consists of loosely coupled sites
that share no physical component
 Database systems that run on each site are independent of each
other
 Transactions may access data at one or more sites

82
Database System Concepts - 6th Edition
Homogeneous and Heterogeneous Databases
 In a homogeneous distributed database
 All sites have identical software
 Are aware of each other and agree to cooperate in processing user
requests.
 Each site surrenders part of its autonomy in terms of right to change
schemas or software.
 Appears to user as a single system

 In a heterogeneous distributed database


 Different sites may use different schemas and software
 Difference in schema is a major problem for query processing
 Difference in software is a major problem for transaction processing
 Sites may not be aware of each other and may provide only
limited facilities for cooperation in transaction processing

83
Database System Concepts - 6th Edition
Distributed Data Storage
 Assume relational data model
 Consider a relation r that is to be stored in the database. There are two
approaches to storing this relation in the distributed database:
 Replication
 The system maintains several identical replicas (copies) of the relation,
and stores each replica at a different site.
 Fragmentation
 Relation is partitioned into several fragments and stores each fragment
at a different site
 Replication and fragmentation can be combined
 Relation is partitioned into several fragments: system maintains several
identical replicas of each such fragment.

84
Database System Concepts - 6th Edition
Data Replication
 If relation r is replicated, a copy of relation r is stored in two or more
sites.
 Full replication of a relation is the case where the relation is stored at
all sites.
 Fully redundant databases are those in which every site contains a copy
of the entire database.

85
Database System Concepts - 6th Edition
Data Replication (Cont.)
 Advantages of Replication
 Availability: failure of site containing relation r does not result in
unavailability of r is replicas exist.
 Parallelism: queries on r may be processed by several nodes in
parallel.
 Reduced data transfer: relation r is available locally at each site
containing a replica of r.
 Disadvantages of Replication
 Increased cost of updates: each replica of relation r must be updated.
 Increased complexity of concurrency control: concurrent updates to
distinct replicas may lead to inconsistent data unless special
concurrency control mechanisms are implemented.
 One solution: choose one copy as primary copy and apply
concurrency control operations on primary copy

86
Database System Concepts - 6th Edition
Data Fragmentation
 Division of relation r into fragments r1, r2, …, rn which contain sufficient
information to reconstruct relation r.
 Horizontal fragmentation: each tuple of r is assigned to one or
more fragments
 Vertical fragmentation: the schema for relation r is split into several
smaller schemas
 All schemas must contain a common candidate key (or superkey) to
ensure lossless join property.
 A special attribute, the tuple-id attribute may be added to each
schema to serve as a candidate key.

87
Database System Concepts - 6th Edition
Horizontal Fragmentation of account Relation

branch_name account_number balance

Hillside A-305 500


Hillside A-226 336
Hillside A-155 62
account1 = branch_name=“Hillside” (account )

branch_name account_number balance

Valleyview A-177 205


Valleyview A-402 10000
Valleyview A-408 1123
Valleyview A-639 750

account2 = branch_name=“Valleyview” (account )


88
Database System Concepts - 6th Edition
Vertical Fragmentation of employee_info Relation
branch_name customer_name tuple_id

Hillside Lowman 1
Hillside Camp 2
Valleyview Camp 3
Valleyview Kahn 4
Hillside Kahn 5
Valleyview Kahn 6
Valleyview Green 7
deposit1 = branch_name, customer_name, tuple_id (employee_info )
account_number balance tuple_id

A-305 500 1
A-226 336 2
A-177 205 3
A-402 10000 4
A-155 62 5
A-408 1123 6
A-639 750 7
89deposit2 = account_number, balance, tuple_id (employee_info )
Database System Concepts - 6th Edition
Advantages of Fragmentation
 Horizontal:
 allows parallel processing on fragments of a relation
 allows a relation to be split so that tuples are located where they are
most frequently accessed
 Vertical:
 allows tuples to be split so that each part of the tuple is stored
where it is most frequently accessed
 tuple-id attribute allows efficient joining of vertical fragments
 allows parallel processing on a relation
 Vertical and horizontal fragmentation can be mixed.
 Fragments may be successively fragmented to an arbitrary depth.

90
Database System Concepts - 6th Edition
Data Transparency
 Data transparency: Degree to which system user may remain unaware
of the details of how and where the data items are stored in a
distributed system
 Consider transparency issues in relation to:
 Fragmentation transparency
 Replication transparency
 Location transparency

91
Database System Concepts - 6th Edition
Naming of Data Items - Criteria
1. Every data item must have a system-wide unique name.
2. It should be possible to find the location of data items efficiently.
3. It should be possible to change the location of data items transparently.
4. Each site should be able to create new data items autonomously.

92
Database System Concepts - 6th Edition
Centralized Scheme - Name Server
 Structure:
 name server assigns all names
 each site maintains a record of local data items
 sites ask name server to locate non-local data items
 Advantages:
 satisfies naming criteria 1-3
 Disadvantages:
 does not satisfy naming criterion 4
 name server is a potential performance bottleneck
 name server is a single point of failure

93
Database System Concepts - 6th Edition
Use of Aliases
 Alternative to centralized scheme: each site prefixes its own site
identifier to any name that it generates i.e., site 17.account.
 Fulfills having a unique identifier, and avoids problems associated with
central control.
 However, fails to achieve network transparency.
 Solution: Create a set of aliases for data items; Store the mapping of
aliases to the real names at each site.
 The user can be unaware of the physical location of a data item, and is
unaffected if the data item is moved from one site to another.

94
Database System Concepts - 6th Edition
Distributed Transactions
 Transaction may access data at several sites.
 Each site has a local transaction manager responsible for:
 Maintaining a log for recovery purposes
 Participating in coordinating the concurrent execution of the
transactions executing at that site.
 Each site has a transaction coordinator, which is responsible for:
 Starting the execution of transactions that originate at the site.
 Distributing sub-transactions at appropriate sites for execution.
 Coordinating the termination of each transaction that originates at
the site, which may result in the transaction being committed at all
sites or aborted at all sites.

95
Database System Concepts - 6th Edition
Transaction System Architecture

96
Database System Concepts - 6th Edition
System Failure Modes
 Failures unique to distributed systems:
 Failure of a site.
 Loss of massages
 Handled by network transmission control protocols such as TCP-IP
 Failure of a communication link
 Handled by network protocols, by routing messages via alternative
links
 Network partition
 A network is said to be partitioned when it has been split into two
or more subsystems that lack any connection between them
 Note: a subsystem may consist of a single node

 Network partitioning and site failures are generally indistinguishable.

97
Database System Concepts - 6th Edition
Commit Protocols
 Commit protocols are used to ensure atomicity across sites
 a transaction which executes at multiple sites must either be committed
at all the sites, or aborted at all the sites.
 not acceptable to have a transaction committed at one site and aborted
at another
 The two-phase commit (2PC) protocol is widely used
 The three-phase commit (3PC) protocol is more complicated and more
expensive, but avoids some drawbacks of two-phase commit protocol. This
protocol is not used in practice.

98
Database System Concepts - 6th Edition
Two Phase Commit Protocol (2PC)
 Assumes fail-stop model – failed sites simply stop working, and do not
cause any other harm, such as sending incorrect messages to other sites.
 Execution of the protocol is initiated by the coordinator after the last step
of the transaction has been reached.
 The protocol involves all the local sites at which the transaction executed
 Let T be a transaction initiated at site Si, and let the transaction
coordinator at Si be Ci

99
Database System Concepts - 6th Edition
Phase 1: Obtaining a Decision
 Coordinator asks all participants to prepare to commit transaction Ti.
 Ci adds the records <prepare T> to the log and forces log to stable
storage
 sends prepare T messages to all sites at which T executed
 Upon receiving message, transaction manager at site determines if it can
commit the transaction
 if not, add a record <no T> to the log and send abort T message to Ci
 if the transaction can be committed, then:
 add the record <ready T> to the log
 force all records for T to stable storage
 send ready T message to Ci

100
Database System Concepts - 6th Edition
Phase 2: Recording the Decision
 T can be committed if Ci received a ready T message from all the
participating sites: otherwise T must be aborted.
 Coordinator adds a decision record, <commit T> or <abort T>, to the
log and forces record onto stable storage. Once the record stable storage
it is irrevocable (even if failures occur)
 Coordinator sends a message to each participant informing it of the
decision (commit or abort)
 Participants take appropriate action locally.

101
Database System Concepts - 6th Edition
Handling of Failures - Site Failure
If the coordinator Ci detects that a site Si has failed, it takes these actions:.
 Log contain <commit T> record: site executes redo (T)
 Log contains <abort T> record: site executes undo (T)
 Log contains <ready T> record: site must consult Ci to determine the
fate of T.
 If T committed, redo (T)
 If T aborted, undo (T)
 The log contains no control records concerning T replies that Sk failed
before responding to the prepare T message from Ci
 since the failure of Sk precludes the sending of such a
response C1 must abort T
 Sk must execute undo (T)

102
Database System Concepts - 6th Edition
Handling of Failures-Coordinator Failure
 If coordinator fails while the commit protocol for T is executing then
participating sites must decide on T’s fate:
1. If an active site contains a <commit T> record in its log, then T
must be committed.
2. If an active site contains an <abort T> record in its log, then T must
be aborted.
3. If some active participating site does not contain a <ready T>
record in its log, then the failed coordinator Ci cannot have decided
to commit T. Can therefore abort T.
4. If none of the above cases holds, then all active sites must have a
<ready T> record in their logs, but no additional control records
(such as <abort T> of <commit T>). In this case active sites must
wait for Ci to recover, to find decision.
 Blocking problem: active sites may have to wait for failed coordinator
to recover.
103
Database System Concepts - 6th Edition
Handling of Failures - Network Partition
 If the coordinator and all its participants remain in one partition, the failure
has no effect on the commit protocol.
 If the coordinator and its participants belong to several partitions:
 Sites that are not in the partition containing the coordinator think the
coordinator has failed, and execute the protocol to deal with failure of
the coordinator.
 No harm results, but sites may still have to wait for decision from
coordinator.
 The coordinator and the sites are in the same partition as the coordinator
think that the sites in the other partition have failed, and follow the usual
commit protocol.
 Again, no harm results

104
Database System Concepts - 6th Edition
Recovery and Concurrency Control
 In-doubt transactions have a <ready T>, but neither a
<commit T>, nor an <abort T> log record.
 The recovering site must determine the commit-abort status of such
transactions by contacting other sites; this can slow and potentially block
recovery.
 Recovery algorithms can note lock information in the log.
 Instead of <ready T>, write out <ready T, L> L = list of locks held by T
when the log is written (read locks can be omitted).
 For every in-doubt transaction T, all the locks noted in the
<ready T, L> log record are reacquired.
 After lock reacquisition, transaction processing can resume; the commit or
rollback of in-doubt transactions is performed concurrently with the
execution of new transactions.

105
Database System Concepts - 6th Edition
Alternative Models of Transaction Processing
 Notion of a single transaction spanning multiple sites is inappropriate for
many applications
 E.g., transaction crossing an organizational boundary
 No organization would like to permit an externally initiated transaction
to block local transactions for an indeterminate period
 Alternative models carry out transactions by sending messages
 Code to handle messages must be carefully designed to ensure
atomicity and durability properties for updates
 Isolation cannot be guaranteed, in that intermediate stages are visible,
but code must ensure no inconsistent states result due to
concurrency
 Persistent messaging systems are systems that provide transactional
properties to messages
 Messages are guaranteed to be delivered exactly once

106
Database System Concepts - 6th Edition
Alternative Models (Cont.)
 Motivating example: funds transfer between two banks
 Two phase commit would have the potential to block updates on the
accounts involved in funds transfer
 Alternative solution:
 Debit money from source account and send a message to other site
 Site receives message and credits destination account
 Messaging has long been used for distributed transactions (even before
computers were invented!)
 Atomicity issue
 Once transaction sending a message is committed, message must
guaranteed to be delivered
 Guarantee as long as destination site is up and reachable, code to
handle undeliverable messages must also be available
 e.g., credit money back to source account.
 If sending transaction aborts, message must not be sent

107
Database System Concepts - 6th Edition
Concurrency Control
Modify concurrency control schemes for use in distributed environment.
 We assume that each site participates in the execution of a commit
protocol to ensure global transaction atomicity.
 We assume all replicas of any item are updated
 Will see how to relax this in case of site failures later
Locking Protocols
• The various locking protocols can be used in a distributed environment.
• The only change that needs to be incorporated is in the way the lock manager deals
with replicated data.
• We present several possible schemes that are applicable to an environment where
data can be replicated in several sites.
• We shall assume the existence of the shared and exclusive lock modes.
•Two Approaches:
•Single lock manager
•Distributed lock manager

110
Database System Concepts - 6th Edition
Single-Lock-Manager Approach
 System maintains a single lock manager that resides in a single chosen site,
say Si
 When a transaction needs to lock a data item, it sends a lock request to
Si and lock manager determines whether the lock can be granted
immediately
 If yes, lock manager sends a message to the site which initiated the
request
 If no, request is delayed until it can be granted, at which time a message
is sent to the initiating site

111
Database System Concepts - 6th Edition
Single-Lock-Manager Approach (Cont.)
 The transaction can read the data item from any one of the sites at which a
replica of the data item resides.
 Writes must be performed on all replicas of a data item
 Advantages of scheme:
 Simple implementation
 Simple deadlock handling
 Disadvantages of scheme are:
 Bottleneck: lock manager site becomes a bottleneck
 Vulnerability: system is vulnerable(weak) to lock manager site failure.

112
Database System Concepts - 6th Edition
Distributed Lock Manager
 In this approach, functionality of locking is implemented by lock managers
at each site
 Lock managers control access to local data items
 But special protocols may be used for replicas
 Advantage: work is distributed and can be made robust to failures
 Disadvantage: deadlock detection is more complicated
 Lock managers cooperate for deadlock detection
 Several variants of this approach
 Primary copy
 Majority protocol
 Biased protocol
 Quorum consensus

113
Database System Concepts - 6th Edition
Primary Copy
 Choose one replica of data item to be the primary copy.
 Site containing the replica is called the primary site for that data item
 Different data items can have different primary sites
 When a transaction needs to lock a data item Q, it requests a lock at the
primary site of Q.
 Implicitly gets lock on all replicas of the data item
 Benefit
 Concurrency control for replicated data handled similarly to
unreplicated data - simple implementation.
 Drawback
 If the primary site of Q fails, Q is inaccessible even though other sites
containing a replica may be accessible.

114
Database System Concepts - 6th Edition
Majority Protocol
 When a transaction wishes to lock an unreplicated data item Q residing at
site Si, a message is sent to Si ‘s lock manager.
 If Q is locked in an incompatible mode, then the request is delayed until
it can be granted.
 When the lock request can be granted, the lock manager sends a
message back to the initiator indicating that the lock request has been
granted.

115
Database System Concepts - 6th Edition
Majority Protocol (Cont.)
 In case of replicated data
 If Q is replicated at n sites, then a lock request message must be sent to
more than half of the n sites in which Q is stored.
 The transaction does not operate on Q until it has obtained a lock on a
majority of the replicas of Q.
 When writing the data item, transaction performs writes on all replicas.
 Benefit
 Can be used even when some sites are unavailable
 Drawback
 Requires 2(n/2 + 1) messages for handling lock requests, and (n/2 + 1)
messages for handling unlock requests.
 Potential for deadlock even with single item - e.g., each of 3 transactions
may have locks on 1/3rd of the replicas of a data.

116
Database System Concepts - 6th Edition
Biased Protocol
 Local lock manager at each site as in majority protocol, however, requests
for shared locks are handled differently than requests for exclusive locks.
 Shared locks. When a transaction needs to lock data item Q, it simply
requests a lock on Q from the lock manager at one site containing a replica
of Q.
 Exclusive locks. When transaction needs to lock data item Q, it requests a
lock on Q from the lock manager at all sites containing a replica of Q.
 Advantage - imposes less overhead on read operations.
 Disadvantage - additional overhead on writes

117
Database System Concepts - 6th Edition
Quorum Consensus Protocol
 A generalization of both majority and biased protocols
 Each site is assigned a weight.
 Let S be the total of all site weights
 Choose two values read quorum Qr and write quorum Qw
 Such that Qr + Qw > S and 2 * Qw > S
 Quorums can be chosen (and S computed) separately for each item
 Each read must lock enough replicas that the sum of the site weights is
>= Qr
 Each write must lock enough replicas that the sum of the site weights is
>= Qw
 For now we assume all replicas are written

118
Database System Concepts - 6th Edition
Timestamping
 Timestamp based concurrency-control protocols can be used in
distributed systems
 Each transaction must be given a unique timestamp
 Main problem: how to generate a timestamp in a distributed fashion
 Each site generates a unique local timestamp using either a logical
counter or the local clock.
 Global unique timestamp is obtained by concatenating the unique local
timestamp with the unique identifier.

119
Database System Concepts - 6th Edition
Replication with Weak Consistency
 Many commercial databases support replication of data with weak degrees
of consistency (i.e., without a guarantee of serializabiliy)
 E.g., master-slave replication: updates are performed at a single “master”
site, and propagated to “slave” sites.
 Propagation is not part of the update transaction: its is decoupled
 May be immediately after transaction commits
 May be periodic
 Data may only be read at slave sites, not updated
 No need to obtain locks at any remote site
 Particularly useful for distributing information
 E.g., from central office to branch-office
 Also useful for running read-only queries offline from the main database

121
Database System Concepts - 6th Edition
Replication with Weak Consistency (Cont.)
 Replicas should see a transaction-consistent snapshot of the database
 That is, a state of the database reflecting all effects of all transactions up
to some point in the serialization order, and no effects of any later
transactions.
 E.g., Oracle provides a create snapshot statement to create a snapshot of
a relation or a set of relations at a remote site
 snapshot refresh either by recomputation or by incremental update
 Automatic refresh (continuous or periodic) or manual refresh

122
Database System Concepts - 6th Edition
Multimaster and Lazy Replication
 With multimaster replication (also called update-anywhere replication)
updates are permitted at any replica, and are automatically propagated to all
replicas
 Basic model in distributed databases, where transactions are unaware of
the details of replication, and database system propagates updates as
part of the same transaction
 Coupled with 2 phase commit
 Many systems support lazy propagation where updates are transmitted
after transaction commits
 Allows updates to occur even if some sites are disconnected from the
network, but at the cost of consistency

123
Database System Concepts - 6th Edition
Deadlock Handling
Consider the following two transactions and history, with item X and
transaction T1 at site 1, and item Y and transaction T2 at site 2:
T1: write (X) T2: write (Y)
write (Y) write (X)

X-lock on X
write (X) X-lock on Y
write (Y)
wait for X-lock on X

Wait for X-lock on Y

Result: deadlock which cannot be detected locally at either site

124
Database System Concepts - 6th Edition
Centralized Approach
 A global wait-for graph is constructed and maintained in a single site; the
deadlock-detection coordinator
 Real graph: Real, but unknown, state of the system.
 Constructed graph: Approximation generated by the controller during the
execution of its algorithm .
 the global wait-for graph can be constructed when:
 a new edge is inserted in or removed from one of the local wait-for
graphs.
 a number of changes have occurred in a local wait-for graph.
 the coordinator needs to invoke cycle-detection.
 If the coordinator finds a cycle, it selects a victim and notifies all sites. The
sites roll back the victim transaction.

125
Database System Concepts - 6th Edition
Local and Global Wait-For Graphs

Local

Global

126
Database System Concepts - 6th Edition
Unnecessary Rollbacks
 Unnecessary rollbacks may result when deadlock has indeed occurred and
a victim has been picked, and meanwhile one of the transactions was
aborted for reasons unrelated to the deadlock.
 Unnecessary rollbacks can result from false cycles in the global wait-for
graph; however, likelihood of false cycles is low.

129
Database System Concepts - 6th Edition
End of Chapter
Three Phase Commit (3PC)
 Assumptions:
 No network partitioning
 At any point, at least one site must be up.
 At most K sites (participants as well as coordinator) can fail
 Phase 1: Obtaining Preliminary Decision: Identical to 2PC Phase 1.
 Every site is ready to commit if instructed to do so
 Phase 2 of 2PC is split into 2 phases, Phase 2 and Phase 3 of 3PC
 In phase 2 coordinator makes a decision as in 2PC (called the pre-commit
decision) and records it in multiple (at least K) sites
 In phase 3, coordinator sends commit/abort message to all participating sites,
 Under 3PC, knowledge of pre-commit decision can be used to commit despite
coordinator failure
 Avoids blocking problem as long as < K sites fail
 Drawbacks:
 higher overheads
 assumptions may not be satisfied in practice

131
Database System Concepts - 6th Edition

You might also like