Aseem Kumar Patel - Distributed DataBase System
Aseem Kumar Patel - Distributed DataBase System
In recent years, the availability of databases and of computer networks has given rise to a new
field, which is a distributed database. A distributed database is integrated database, which is
built on top of a computer network rather than on a single computer. The data, which constitute
the database, are stored at the different sites of the computer network, and the application
programs, which are run by the computers, access data at different sites. Databases may involve
different database management systems, running on different architectures, that distributes the
execution of transactions.
A Distributed Database Management System (DDBMS) is defined as the software that handles
the management of the DDB (Distributed Database) and makes the operation of such a system
appear to the user as if it were a centralized database.
Distributed database is a collection of data, which belongs logically to the same system but are
spread over the sites of a computer network.
(1) Data is stored at several sites, each managed by a DBMS that can run independently.
(2) The data from different sites are tied together using computer network.
§ Distributed Data Independence: Users should not have to know where data is located
(extends Physical and Logical Data Independence principles).
§ Distributed Transaction Atomicity: Users should be able to write Xacts accessing multiple
sites just like local Xacts.
From above properties, Users have to be aware of where data is located, i.e., Distributed Data
Independence and Distributed Transaction Atomicity are not supported. These properties are hard
to support efficiently. For globally distributed sites, these properties may not even be desirable
due to administrative overheads of making location of data transparent.
Data is located near the greatest demand site, access is faster, processing is faster due to several
sites spreading out the work load, new sites can be added quickly and easily, communication is
Sushil Kulkarni 2
improved, operating costs are reduced, it is user friendly, there is less danger of a single-point
failure, and it has process independence.
Several reasons why businesses and organizations move to distributed databases include
organizational and economic reasons, reliable and flexible interconnection of existing database,
and the future incremental growth. Companies believe that a decentralized, distributed data
database approach will adapt more naturally with the structure of the organizations. Distributed
database is more suitable solution when several databases already exist in an organization. In
addition, the necessity of performing global application can be easily performed with distributed
database. If an organization grows by adding new relatively independent organizational units,
then the distributed database approach support a smooth incremental growth.
Data can physically reside nearest to where it is most often accessed, thus providing users with
local control of data that they interact with. This result in local autonomy of the data allowing
users to enforce locally the policies regarding access to their data.
One might want to consider a parallel architecture is to improve reliability and availability of the
data in a scalable system. In a distributed system, with some careful tact, it is possible to access
some or possibly all of the data in a failure mode if there is sufficient data replication.
Managing and controlling is complex, there is less security because data is at so many different
sites.
A distributed database provides more flexible accesses that increase the chance of security
violations since the database can be accessed throughout every site within the network. For many
applications, it is important to provide secure. Present distributed database systems do not provide
adequate mechanisms to meet these objectives. Hence the solution requires the operation of
DDBMS capable of handling multilevel data. Such a system is also called a multi level security
distributed database management systems (MLS-DDBMS). MLS-DDBMS provides a
verification service for users who wish to share data in the database at different level security. In
MLS- DDBMS, every data item in the database has correlated with one of several classifications
or sensitivities.
The ability to ensure the integrity of the database in the presence of unpredictable failures of both
hardware and software components is also an important features of any distributed database
management systems. The integrity of a database is concerned with its consistency, correctness,
validity, and accuracy. The integrity controls must be built into the structure of software,
databases, and involved personnel.
If there are multiple copies of the same data, then this duplicated data introduces additional
complexity in ensuring that all copies are updated for each update. The notion of concurrency
control and recoverability consume much of the research efforts in the area of distributed
database theory. Increasing in reliability and performance is the goal and not the status quo.
Example
Consider the bank having three branches connected by computer network containing teller
terminals of the branch and the account database of the branch All the branches can process the
database. Each branch that has local database constitutes one site of the distributed database.
The applications that are required from the terminals of a branch need only to access the database
of that branch. Local applications are applications processed by the computer of the branch from
where they are issued.
A global application involves the generation of a number of sub-transactions, each of which may
be executed at different sites.
For instance, suppose the customer wants to transfer the money from one account of one branch
to an account of another branch. This application requires updating the database at two different
branches. One has to perform two local updates at two different branches. Thus if A is the data
item with two copies A 1 and A 2 exists at sites S 1 and S 2 , then the operation of updating the
value of A involves generating two sub-transactions T S 1 and T S 2 , each of which will perform the
required operation from sites S 1 and S 2.
The branches are located at different geographical areas, then the banking system is called
distributed database on a geographical dispersed network. [See the figure]
The branches are situated in one building and are connected with a high bandwidth local network.
The teller terminals of the branches are connected to their respective computer by telephone lines.
Each processor and its database constitute a site of the local computer network. Here the
distributed database is implemented on a local network instead of a geographical network and is
called distributed database on a local network. [See the figure by replacing communication
network to local network]
Let us now consider the same example as above. The data of the different branches are distributed
on three backend computers, which perform the database management functions. The application
T T T T T T
Site 1 Site 2
COM COM
1 1
Communicatio
n Network
T T T
Site 3
COM
1
Sushil Kulkarni 4
programs are executed by different computer, which request database access services from the
backends when necessary. This type of system is not distributed database because the data is
physically distributed over different processors but their distribution is not relevant from the
application viewpoint. Here the existence of local application on computers can’t be executed.
From above examples, one can summarized and obtain the following working definition:
A distributed database is a collection of data, which are distributed over different computers of
a computer network. Each site of the network has processing capacity and can perform local
applications. Each site also participates in the execution of at least one global application.
Thus we have in short,
Functions of DDBMS
(1) extended communication to allow the transfer of queries and data among sites
(2) extended system catalog to store data distribution details
(3) distributed query processing , including query optimization
(4) extended concurrency control to maintain consistency of replicated data.
(5) extended recovery services to take account of failures of individual sites and common links
User from one site makes a request from a site. This request can be made as if it is made to local
database. Users can see only one logical database and does not need to know the names of
different data streams. In fact, user need not know that the database is partitioned and the location
of data streams.
Component Architecture
The components, which are necessary for building distributed databases, are as follows:
User Request Interface is typically a client program that acts as an interface to the Distributed
Transaction Manager and accepts user's requests for transactions.
Together, the distributed transaction managers and the database managers make up the DDBMS
A node is a piece of hardware that can implement a transaction manager, a database manager, or
both.
There are two types of distributed database systems: Homogeneous and Heterogeneous
distributed database system.
Homogeneous: If data is distributed but every site runs same type of DBMS, then the distributed
database is called homogeneous database.
Heterogeneous: If data is distributed but different sites run different DBMSs (different DBMSs
or even non-relational DBMSs), then the distributed database is called heterogeneous database.
Following figure illustrates this. This system is build using gateway protocols, which is API that
exposes DBMS functionality to external applications. Example is ODBC and JDBC.
There are two alternative approaches used to separate functionality across different distributed
databases related processes: Client-server, Collaborating server.
Client-server architecture:
This architecture is similar to the system of restaurant where many customers give order to a
waiter.
In this architecture there are three components clients, servers, and communication middleware
§ The client ships the request to a single site, which process the query.
§ The server process all queries send by the client
Sushil Kulkarni 6
Query
server
The client-server architecture does not allow a single query to span multiple servers because the
client process would have to be capable of breaking such a query into appropriate subqueries to
be executed at different sites and then piecing together the answer to the subqueries. His difficulty
can be eliminated using collaborating server architecture
In this architecture, we have a collection of database servers, each capable of running transactions
against local data, which co-operatively execute transactions spanning multiple servers.
server
Client
server server Other
servers
Query server
A ‘client’ server makes a request for a data to a server using a query. The server generates
appropriate subqueries to be executed by ‘other’ servers and puts the result together to compute
the answer of the original query.
Data allocation
How should a database be distributed among the sites of a network? We discussed this important
issue using following strategies.
Sushil Kulkarni 7
§ Centralized
§ Replication
§ Fragmentation
We will explain and illustrate each of these approaches using relational databases.
Suppose that a bank has numerous branches located throughout a state. One of the base relations
in the bank’s database is the customer relation given below. For simplicity the sample data in the
relation apply to only two of the branches (Mumbai and Pune). The primary key in the relation is
account number (Ac_no). B_name is the name of the branch where customers have opened their
accounts.
Centralized
All data can be stored at one site. The processing can be done only from this site. Other sites
have to approach this site for data. If problem occurs at this site then the system fails. This
method of storing data has high communication costs.
Fragmentation
Fragmentation consists of breaking a relation into smaller relations or fragments and storing the
fragments at different sites. Thus partitioning of global relation R into fragments R1, R2, . . . , Rn
that contain sufficient information to reconstruct the original relation..
To fragment the relation the prerequisite is the access patterns (typical queries) and application
behavior must be known; that is, which applications access which (portions of the) data.
There are two types of fragmentation: horizontal, vertical,
Horizontal fragmentation
With this fragmentation, some of the rows of a table are put into a base relation at one site, and
other rows are put into a base relation at another site. More generally, the rows of a relation are
distributed to many sites. Fragments are required to be disjoint.
Each fragment assigned to one site locality of reference can bee high, storage costs are low - no
replication, reliability and availability are low
Sushil Kulkarni 8
For a given relation R(A1, . . . , An) a set PR = {p1, . . . , pn} of simple predicates is determined.
Each pi has the pattern Ai q <value>, with q Î. {=, ¹,< ,>}.
Furthermore, assume that there are two applications. One application typically retrieves
information about employees who earn more than 5000 Rs., the other application typically
manages information about clerks.
The simple predicates will be PEMP = {Job = ’Clerk’, Sal > 5000}and set of minterm predicates
MR = {mi | [not]p1 . . . . . [not] p n }
Vertical fragmentation
Here the idea is to split schema R into smaller schemas R i. Splitting is based on which attributes
are typically accessed together by (different) applications. Each fragment R i must contain
primary key in order to reconstruct original relation R.
With this fragmentation, some of the columns of a relation are projected into a base relation at
one of the sites, and other columns are projected into a base relation at another site. More
generally, the columns may be projected to several sites. For example two fragments of EMP can
be written as
This fragmentation is more difficult than horizontal fragmentation, because more alternatives
exist. The fragmentation can be achieved in this case by two approaches: (1) grouping attributes
to fragments, (2) splitting relation into fragments.
Ac_no Balance
200 1000
324 250
153 38
426 1500
500 168
683 1398
252 303
When a relation is fragmented, we must be able to recover the original relation from the
fragments.
Correctness of Fragmentation
§ Reconstruction - If relation R is decomposed into fragments R1, R2, . . . , Rn, then there
should be some relational operator Ä such that R = Ä 1£ i £ n ( R i ).
• Disjointness: If a relation R is partitioned, and a tuple t is in fragment R i, then t should be in
no other fragment Rj, j ¹ i
Sushil Kulkarni 10
Replication
System maintains multiple copies of data (fragments) stored at different sites, for faster retrieval
and fault tolerance.
• Full replication of a relation is the case where the relation is stored at all sites.
• Fully redundant databases are those in which every site contains a copy of the entire database.
• Rule:
Advantages
• Reduced data transfer: Relation R is available locally at each site containing a replica of R
Disadvantages
• Increased complexity of concurrency control: Concurrent updates two distinct replicas may
lead to inconsistent data unless special concurrency control mechanisms are implemented.
Strategies
The allocation schema describes the mapping of relations or fragments to the servers that store
them. This mapping can be:
Transparencies in DDBMS
A distributed database system has certain functional characteristics. These characteristics are
grouped together and called set of transparency. The DDBMS transparency features are:
Distributed Transparency
This allows the user to see the database as a single, logical entity. If this transparency is exhibited
then the user does not need to know that
The level of transparency supported by DDBMS varies from system to system. Three levels of
distributed transparency are recognized:
In absence of transparency, each DBMS accepts its own SQL ‘dialect’: the system is
heterogeneous and the DBMSs do not support a common interoperability standard
Let us assume that the SUPPLIER (SNum, Name, City) relation with two horizontal fragments
[email protected]
[email protected]
SUPPLIER2@company. Hydrabad 2.in
Fragmentation transparency
On this level, the programmer should not worry about whether or not the database is distributed
or fragmented For example, consider the following query:
procedure Query1(:snum,:name);
SELECT Name into: name
FROM Supplier
WHERE SNum =: snum;
end procedure;
Here snum and name are found from the fragmentations SUPPLIER1 and SUPPLIER2 but the
user will not know that the data is stored on different nodes.
Sushil Kulkarni 12
Allocation transparency
On this level, the programmer should know the structure of the fragments, but does not have to
indicate their allocation. With replication, the programmer does not have to indicate which copy
is chosen for access (replication transparency) For example, consider the following query:
procedure Query2(:snum,:name);
SELECT Name into :name
FROM Supplier1
WHERE SNum =: snum;
IF :empty then
SELECT Name into :name
FROM Supplier2
WHERE SNum = :snum;
end procedure;
Language transparency
On this level the programmer must indicate in the query both the structure of the fragments and
their allocation
Queries expressed at a high level of transparency are transformed to this level by the distributed
query optimizer, aware of data fragmentation and allocation For example, consider the following
query:
Optimizations
§ By using parallelism: instead of submitting the two requests in sequence, they can be
processed in parallel, thus saving on the global response time
§ By using the knowledge on the logical properties of fragments (but then the programs are not
flexible)
Classification Of Transaction
This transparency allows a transaction to update data at several sites. It will also ensure that the
transaction is completed or aborted and thus database integrity is maintained.
In order to understand how the transactions are managed, one should know the classification of
transaction. : remote request, remote transaction, distributed transaction and distributed requests.
Remote request
These are read-only transactions made up of an arbitrary number of SQL queries, addressed to a
single remote DBMS. The remote DBMS can only be queried. Following figure illustrate this
Site 1 Site 2
Network
TP ¬¾ ¾ ¾¾® DP Employee
Remote transactions
It is made up of any number of SQL commands (select, insert, delete, update) directed to a single
remote DBMS and each transaction writes onto only one DP.Consider Customer and Employee
tables situated at site2. The transaction should able to update customer and employee tables. The
transaction can reference to only one DP at a time.
Site 1 Site 2
Network Employee
TP ¬¾ ¾ ¾¾® DP
Customer
begin:
UPDATE Employee
SET salary = salary + 200
WHERE eno > 123;
INSERT
INTO Customer ( cno, cname, age)
VALUES ( 123, ‘Seema’,29);
COMMIT;
Sushil Kulkarni 14
Distributed transactions
It is made up of any number of SQL commands (select, insert, delete, update) directed to an
arbitrary number of remote DP sites, but each SQL command refers to a single DBMS
Let us consider the transaction that points towards two remote sites say 2 and 3. The first request
(SELECT statement) is processed by the DP at the remote site 2, and the next requests
(UPDATE, INSERT statements) are processed by the DP at the remote site 3. Each request can
access only one request at a time.
Site 1 Site 2
Network Employee
¬¾ ¾ ¾¾®
TP DP
begin: Customer
SELECT * DP
FROM Employee
WHERE eno > 104;
Product
UPDATE Customer Site 3
SET cbalance =cbalance -100
WHERE cno = 124;
INSERT
INTO Product ( pno, price)
VALUES ( 321, 50);
COMMIT;
Distributed requests
It is made up of arbitrary transactions, in which each SQL command can refer to any DP that may
contain the fragmentation. This request requires a distributed optimizer.
Example 1: Let shop (sno, sname) is at site 2 and customer (cno, cname, bill, sno) and
employee (eno, ename, sno) be at site 3. Fetch the tuples to find sname and cname where sno
= 123. The following figure 1 illustrates this:
Example 2: The distributed request features allow a single request to references a physically
partitioned table. Divide the employee table into two fragments says E1 and E2 located sites 2
and 3. Suppose the end- user wants to obtain all tuples whose salary exceed Rs. 15,000. The
request is illustrated in figure 2
Sushil Kulkarni 15
Site 1 Site 2
Network Shop
¬¾ ¾ ¾¾®
TP DP
begin: DP
Customer
SELECT sname, cname
FROM shop, customer
WHERE sno =123 AND
Customer.sno= shop.sno; Employee
Site 3
UPDATE Customer
SET bill = bill+5100
WHERE cno = 124;
INSERT
INTO Product ( pno, price) Figure 1
VALUES ( 321, 50);
COMMIT;
Site 1
Network Site 2
¬¾ ¾ ¾¾® E1
TP DP
begin: DP
SELECT * E2
FROM employee
WHERE salary >15000;
COMMIT;
Site 3
Figure 2
Catalog manager keeps track of all data, which are distributed across different sites by keeping
the name of each replica of each fragment. To preserve local autonomy, information is stored in
the form < local_name, birth_site>. Every site also has the catalog that describes all objects in the
form < fragments, replica> at a site plus it keeps the track of all replica of relations created at this
site. To find any relation it is the best to see birth site catalog which never changes even if the
relation is moved to another site.
Sushil Kulkarni 16
begin transaction
UPDATE Account1
SET Total = Total - 100000
WHERE AccNum = 3154;
UPDATE Account2
SET Total = Total + 100000
WHERE AccNum = 14878;
commit ;
end transaction
The above code is not acceptable because it may violate atomicity that is one of the modifications
is executed while the other is not
§ Durability is not a problem that depends on the data distribution, because each system
guarantees local durability by using local recovery mechanisms (logs, checkpoints, and
dumps)
§ Query optimization
§ Concurrency control
§ Reliability control
Besides the factors used in centralized databases, we require additional factors for optimization.
Query optimization is required when a DP receives a distributed request from the site where TP is
located; the DP that is queried is responsible for the ‘global optimization’
DP decides on the breakdown of the query into many sub-queries, each addressed to a specific
DP and it builds a strategy (plan) of distributed execution that consists of coordinating various
execution programs from various DPs and in the exchange of data among them
The following examples will give us the idea for determining the “best” execution plan
These relations are not fragmented and have the following information
§ Each record of Student contains 10,000 records and each record is of 100 bytes long.
§ Each record of Course contains 100 records and each record is of 35 bytes long.
6
From above size of Student relation is 100*10,000 = 10 bytes and size of Course relation is
100*35 bytes.
Query: Find the names of students and the course titles that are taken by the students.
Using relational algebra the query can be sited as follows:
s name , cname ( Student join cno Course )
The result of this query will contain 10,000 records, assuming that every student is opted for a
course. Assume further that each record in the query will result in 40 bytes long.
[A] Suppose assume that the query is submitted at a site 2. Site 2 is called resultant site. We have
the following strategies:
(1) Transfer the Student relation to site 2 by executing the query and result is displayed at site 2.
Then 10 6 bytes are transfer from site 1 to2.
(2) Transfer the Course relation to site 1 by executing the query at site 1 and send the result at
site 2. Then 4000,000 + 3500 = 403,500 bytes are transfer for the query.
If minimizing the amount of data transfer is our optimization criteria then we can choose strategy
2.
[B] Suppose assume that the query is submitted at a site 3 that does not contain any one of the
relations given above. Site 3 is called resultant site where we will store the result obtained by the
given query. We have the following strategies:
Student: Site 1
Result site:
Course: Site 2 Site 3
Sushil Kulkarni 18
(1) Transfer Student and Course relations to site 3 and perform the join at site 3. Here total of
10 6 +3500 = 1,003,500 bytes must be transferred.
(2) Transfer the Student relation to site 2, execute the join at site 2 and send the result to site 3.
The size of the query result is 40*10,000 bytes, so 4000,000 + 10 6 = 1,400,000 bytes are
transferred.
(3) Transfer the Course relation to site 1, execute the join at site 1 and send the result to site 3.
The size of the query result is 40*10,000 bytes, so 4000,000 + 3500 = 403,500 bytes are
transferred.
If minimizing the amount of data transfer is our optimization criteria then we can choose strategy
3.
Following is the SQL query to find the ENAME from the relation EMP (ENO, ENAME, AGE)
and ASG (ENO, DUR)
SELECT ENAME
FROM EMP, ASG
WHERE EMP.ENO = ASG.ENO AND DUR > 37
Let us assume that: size (EMP) = 400, size (ASG) = 1000 and tuple access cost = 1 unit; tuple
transfer cost = 10 units
For strategy 1
For Strategy 2
Sushil Kulkarni 19
Concurrency Control
There are varies techniques to determine which object are to be locked in distributed databases by
obtaining and releasing the locks. The locks are determined by concurrency control protocol.
Centralized A single site is in charge of handling lock and unlock requests for all objects.
Primary copy One copy of each object is called the primary copy. All requests to lock or unlock
a copy of this object are handled by the lock manager at the site where the primary copy is stored.
Fully distributed Request to lock or unlock a copy of an object stored at a site is handled by the
lock manager of that site where the copy is stored.
The centralized scheme is not useful as well as popular because the failure of the site, which
contains locks, will disturb the functionality of all sites.
Sushil Kulkarni 20
The primary copy scheme avoids this problem, but in general, reading an object requires
communications with two sites, the site where the primary copy resides and the site where the
copy to be read resides. This problem is avoided in the fully distributed scheme, because locking
is done at the site where the copy to be read resides. However, while writing locks must be set at
all sites where copies are modified in the fully distributed scheme, whereas locks need to be set
only at one site in the other two schemes.
In a distributed system, a transaction t i can carry out various sub-transactions t i j, where the
second subscript denotes the node of the system on which the sub-transaction works. For
example, let there be two nodes say node 1 and node 2, then the transactions on them can be
performed as follows:
t1 t2
Node 1 Node 2 Node 1 Node 2
Read(x) Read(y)
Write(x) Write(y)
Read(y) Read(x)
Write(y) Write(x)
The local Serializability within the schedulers is not a sufficient guarantee of serializability.
Consider the two schedules at nodes 1 and 2 [see the table above]:
These are locally serializable, but their global conflict graph has a cycle, since for a node 1 t1
precedes t2 and is in conflict with t2 and for node 2, t2 precedes t1 and is in conflict with t1. Thus
we define:
Global serializability of distributed transactions over the nodes of a distributed database requires
the existence of a unique serial schedule S equivalent to all the local schedules Si produced at
each node.
(1) If each scheduler of a distributed database uses the two-phase locking method on each node
and carries out the commit action when all the sub-transactions have acquired all the
resources, then the resulting schedules are globally conflict-serializable
If each distributed transaction acquires a single timestamp and uses it in all requests to all the
schedulers that use concurrency control based on timestamp, the resulting schedules are
globally serial, based on the order imposed by the timestamps
Sushil Kulkarni 21
The Lamport method for assigning timestamps reflects the precedence among events in a
distributed system. The method is as follows:
(1) The least significant digits identify the node at which the event occurs
(2) The most significant digits identify the events that happen at that node. They can be obtained
from a local counter, which is incremented at each event
§ Each time two nodes exchange a message, the timestamps become synchronized.The
receiving event must have a timestamp greater than the timestamp of the sending event and
may require the increasing of the local counter on the receiving node
Following example shows how timestamps can be assignment using the Lamport method
Distributed deadlocks
Distributed deadlocks can be due to circular waiting situations between two or more nodes. A
particular time is given to all nodes to replay and if one of the nodes is not responding in a given
time then dead lock may be possible.
Deadlock resolution can be done with an asynchronous and distributed protocol (implemented in
a distributed version of DB2 by IBM)
Assume that using a remote procedure call, that is, a synchronous call to a procedure that is
remotely executed activates sub-transactions. This model allows for two distinct types of waiting
§ Two sub-transactions of the same transaction can be in waiting in distinct DP s as one waits
for the termination of the other. For instance, if t 11 activates t 12 , it waits for the termination
of t 12
Sushil Kulkarni 22
§ Two different sub-transactions on the same DPS can wait as one blocks a data item to which
the other one requires access. For instance, if t 11 locks an objects requested by t 21, t 21 waits
for the termination of t 11
The waiting conditions at each DP can be characterized using precedence conditions, Let t i and
t j be two transactions such that t j has to wait till t i is completed. The general format of waiting
condition is give using a wait sequence as:
Algorithm can be developed on distributed deadlocks and is given below and is periodically
activated on the various DPs of the system. When it is activated, it:
§ integrates new wait sequences with the local wait conditions as described by the lock
manager
§ analyzes the wait conditions on its DP and detects local deadlocks. It communicates the wait
sequences to other instances of the same algorithm
§ to avoid the situation in which the same deadlock is discovered more than once, the algorithm
sends wait sequences:
– ‘ahead’, towards the DPS which has received the remote procedure call.
– only if, for example, i > j where i and j are the identifiers of the transactions.
Sushil Kulkarni 23
• Node failures may occur on any node of the system and be soft or hard, as discussed before
• Message losses leave the execution of a protocol in an uncertain situation
• Network partitioning. This is a failure of the communication links of the computer network,
which divides it into two sub-networks that have no communication between each other
– A transaction can be simultaneously active in more than one sub-network
Commit protocols allow a transaction to reach the correct commit or abort decision at all the
nodes that participate in a transaction
The two-phase commit protocol is similar in essence to a marriage, in that the decision of two
parties is received and registered by a third party, who ratifies the marriage
Sushil Kulkarni 24
The servers, who represent the participants to the marriage, are called resource managers (RM)
and the celebrant (or coordinator), is allocated to a process, called the transaction manager
(TM).
It takes place by means of a rapid exchange of messages between TM and RM and writing of
records into their logs. The TM can use broadcast mechanisms (transmission of the same message
to many nodes, collecting responses arriving from various nodes) and serial communication with
each of the RMs in turn.
Records of TM
§ The prepare record contains the identity of all the RM processes (that is, their identifiers of
nodes and processes)
§ The global commit or global abort record describes the global decision. When the TM
writes in its log the global commit or global abort record, it reaches the final decision
§ The complete record is written at the end of the two-phase commit protocol
Records of RM
§ The ready record indicates the irrevocable availability to participate in the two-phase commit
protocol, thereby contributing to a decision to commit. Can be written only when the RM is
“recoverable”, i.e., possesses locks on all resources that need to be written. The identifier
(process identifier and node identifier) of the TM is also written on this record
§ In addition, begin, insert, delete, and update records are written as in centralized servers
§ At any time an RM can autonomously abort a sub-transaction, by undoing the effects, without
participating to the two-phase commit protocol
§ The TM writes the prepare record in its log and sends a prepare message to all the RMs. Sets
a timeout indicating the maximum time allocated to the completion of the first phase
§ The recoverable RMs write on their own logs the ready record and transmit to the TM a ready
message, which indicates the positive choice of commit participation
§ The non-recoverable RMs send a not-ready message and end the protocol
§ The TM transmits its global decision to the RMs. It then sets a second time-out
Sushil Kulkarni 25
§ The RMs that are ready receive the decision message, write the commit or abort record on
their own logs, and send an acknowledgement to the TM. Then they implement the commit or
abort by writing the pages to the database as discussed before
§ The TM collects all the ack messages from the RMs involved in the second phase. If the
time-out expires it sets another time-out and repeats the transmission to all the RMs from
which it has not received an ack
§ When all the acks have arrived, the TM writes the complete record on its log
Two-phase commit protocol in the context of a transaction is shown in the following figure
• An RM in a ready state loses its autonomy and awaits the decision of the TM. A failure of the
TM leaves the RM in an uncertain state. The resources acquired by using locks are blocked.
Sushil Kulkarni 26
• The interval between the writing on the RM’s log of the ready record and the writing of the
commit or abort record is called the window of uncertainty. The protocol is designed to keep
this interval to a minimum.
• Recovery protocols are performed by the TM or RM after failures; they recover a final state
which depends on the global decision of the TM
Recovery of participants
• Performed by the warm restart protocol. Depends on the last record written in the log:
– When it is an action or abort record, the actions are undone; when it is a commit, the
actions are redone; in both cases, the failure has occurred before starting the commit
protocol
– When the last record written in the log is a ready, the failure has occurred during the two-
phase commit. The participant is in doubt about the result of the transaction
§ During the warm restart protocol, the identifier of the transactions in doubt are collected in
the ready set. For each of them the final transaction outcome must be requested to the TM
§ This can happen as a result of a direct (remote recovery) request from the RM or as a
repetition of the second phase of the protocol
§ When the last record in the log is a prepare, the failure of the TM might have placed some
RMs in a blocked situation. Two recovery options:
– Write global abort on the log, and then carry out the second phase of the protocol
– Repeat the first phase, trying to arrive to a global commit
§ When the last record in the log is a global decision, some RMs may have been correctly
informed of the decision and others may have been left in a blocked state. The TM must
repeat the second phase
§ The loss of a prepare or ready messages are not distinguishable by the TM. In both cases,
the time-out of the first phase expires and a global abort decision is made
§ The loss of a decision or ack message are also indistinguishable. In both cases, the time-out
of the second phase expires and the second phase is repeated
§ A network partitioning does not cause further problems, in that the transaction will be
successful only if the TM and all the RMs belong to the same partition
– when a TM receives a remote recovery request from an induct RM and it does not know
the outcome of that transaction, the TM returns a global abort decision as default
§ As a consequence, the force of prepare and global abort records can be avoided, because in
the case of loss of these records the default behavior gives an identical recovery
§ Furthermore, the complete record is not critical for the algorithm, so it needs not be forced; in
some systems, it is omitted. In conclusion the records to be forced are ready, global commit
and commit
Read-only optimization
§ When a participant is found to have carried out only read operations (no write operations)
§ It responds read-only to the prepare message and suspends the execution of the protocol
§ The coordinator ignores read-only participants in the second phase of the protocol
§ The TM process is replicated by a backup process, located on a different node. At each phase
of the protocol, the TM first informs the backup of its decisions and then communicates with
the RMs
§ When a backup becomes TM, it first activates another backup, to which it communicates the
information about its state, and then continues the execution of the transaction
§ The basic idea is to introduce a third pre-commit phase in the standard protocol. If the TM
fails, a participant can be elected as new TM and decide the result of the transaction by
looking at its log
– If the new TM finds ready as last record, no other participants in the protocol can have
gone beyond the pre-commit condition, and thus can make the decision to abort
– If the new TM finds pre-commit as last record, it knows that the other participants are at
least in the ready state, and thus can make the decision to commit
§ The three-phase commit protocol has serious inconveniences and has not been successfully
implemented:
– It lengthens the window of uncertainty
– It is not resilient to network partitioning, unless with additional quorum mechanisms
Interoperability