Consensus On Transaction Commit
Consensus On Transaction Commit
The distributed transaction commit problem requires reaching agreement on whether a transaction
is committed or aborted. The classic Two-Phase Commit protocol blocks if the coordinator fails.
Fault-tolerant consensus algorithms also reach agreement, but do not block whenever any majority
of the processes are working. The Paxos Commit algorithm runs a Paxos consensus algorithm on the
commit/abort decision of each participant to obtain a transaction commit protocol that uses 2F + 1
coordinators and makes progress if at least F + 1 of them are working properly. Paxos Commit
has the same stable-storage write delay, and can be implemented to have the same message delay
in the fault-free case as Two-Phase Commit, but it uses more messages. The classic Two-Phase
Commit algorithm is obtained as the special F = 0 case of the Paxos Commit algorithm.
Categories and Subject Descriptors: D.4.1 [Operating Systems]: Process Management—Con-
currency; D.4.5 [Operating Systems]: Reliability—Fault-tolerance; D.4.7 [Operating Systems]:
Organization and Design—Distributed systems
General Terms: Algorithms, Reliability
Additional Key Words and Phrases: Consensus, Paxos, two-phase commit
1. INTRODUCTION
A distributed transaction consists of a number of operations, performed at mul-
tiple sites, terminated by a request to commit or abort the transaction. The
sites then use a transaction commit protocol to decide whether the transac-
tion is committed or aborted. The transaction can be committed only if all sites
are willing to commit it. Achieving this all-or-nothing atomicity property in a
distributed system is not trivial. The requirements for transaction commit are
stated precisely in Section 2.
The classic transaction commit protocol is Two-Phase Commit [Gray 1978],
described in Section 3. It uses a single coordinator to reach agreement. The fail-
ure of that coordinator can cause the protocol to block, with no process knowing
the outcome, until the coordinator is repaired. In Section 4, we use the Paxos
consensus algorithm [Lamport 1998] to obtain a transaction commit protocol
Authors’ addresses: J. Gray, Microsoft Research, 455 Market St., San Francisco, CA 94105; email:
[email protected]; L. Lamport, Microsoft Research, 1065 La Avenida, Mountain View, CA
94043.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is
granted without fee provided that copies are not made or distributed for profit or direct commercial
advantage and that copies show this notice on the first page or initial screen of a display along
with the full citation. Copyrights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,
to redistribute to lists, or to use any component of this work in other works requires prior specific
permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515
Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected].
C 2006 ACM 0362-5915/06/0300-0133 $5.00
ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006, Pages 133–160.
134 • J. Gray and L. Lamport
2. TRANSACTION COMMIT
In a distributed system, a transaction is performed by a collection of pro-
cesses called resource managers (RMs), each executing on a different node. The
transaction ends when one of the resource managers issues a request either
to commit or to abort the transaction. For the transaction to be committed,
ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.
Consensus on Transaction Commit • 135
Fig. 1. The state-transition diagram for a resource manager. It begins in the working state, in
which it may decide that it wants to abort or commit. It aborts by simply entering the aborted
state. If it decides to commit, it enters the prepared state. From this state, it can commit only if all
other resource managers also decided to commit.
cannot satisfy the stability and consistency conditions and still guarantee
progress in the presence of even a single fault. We therefore require progress
only if timeliness hypotheses are satisfied. Our two liveness requirements for
a transaction commit protocol are as follows:
— Nontriviality If the entire network is nonfaulty throughout the execution
of the protocol, then (a) if all RMs reach the prepared state, then all RMs
eventually reach the committed state, and (b) if some RM reaches the aborted
state, then all RMs eventually reach the aborted state.
— Nonblocking If, at any time, a sufficiently large network of nodes is nonfaulty
for long enough, then every RM executed on those nodes will eventually reach
either the committed or aborted state.
A precise statement of these two conditions would require a precise defini-
tion of what it means for a network of nodes to be nonfaulty. The meaning of
“long enough” in the nonblocking condition depends on the response times of
nonfaulty processes and communication networks. The nontriviality and non-
blocking conditions can be stated precisely, but we will not do so here.
We can more precisely specify a transaction commit protocol by specifying
its set of legal behaviors, where a behavior is a sequence of system states. We
specify the safety properties with an initial predicate and a next-state relation
that describes all possible steps (state transitions). The initial predicate asserts
that all RMs are in the working state. To define the next-state relation, we first
define two state predicates:
—canCommit. True iff all RMs are in the prepared or committed state.
— notCommitted. True iff no RM is in the committed state.
The next-state relation asserts that each step consists of one of the following
two actions performed by a single RM:
— Prepare. The RM can change from the working state to the prepared state.
— Decide. If the RM is in the prepared state and canCommit is true, then it can
transition to the committed state; and if the RM is in either the working or
Fig. 2. The message flow for Two-Phase Commit in the normal failure-free case, where RM1 is
the first RM to enter the prepared state.
prepared state and notCommitted is true, then it can transition to the aborted
state.
3. TWO-PHASE COMMIT
— The initiating RM enters the prepared state and sends a Prepared message
to the TM. (1 message)
— The TM sends a Prepare message to every other RM. (N − 1 messages)
— Each other RM sends a Prepared message to the TM. (N − 1 messages)
— The TM sends a Commit message to every RM. (N messages)
Thus, in the normal case, the RMs learn that the transaction has been commit-
ted after four message delays. A total of 3N − 1 messages are sent. It is typical
for the TM to be on the same node as the initiating RM. In that case, two of the
messages are intranode and can be discounted, leaving 3N − 3 messages and
three message delays.
As discussed in Section 3.1, we can eliminate the TM’s Prepare messages,
reducing the message complexity to 2N . But in practice, this requires either
extra message delays or some real-time assumptions.
In addition to the message delays, the two-phase commit protocol incurs the
delays associated with writes to stable storage: the write by the first RM to
prepare, the writes by the remaining RMs when they prepare, and the write by
the TM when it makes the commit decision. This can be reduced to two write
delays by having all RMs prepare concurrently.
2 In practice, an RM may notify the TM when it spontaneously aborts; we ignore this optimization.
4. PAXOS COMMIT
number 0. (The missing phase 1 is explained below.) Each acceptor receives this
message and replies with a phase 2b message for ballot 0. When the leader re-
ceives these phase 2b messages from a majority of acceptors, it sends a phase 3
message announcing that the value is chosen.
The initial leader may fail, causing ballot 0 not to choose a value. In that case,
some algorithm is executed to select a new leader—for example, the algorithm
of Aguilera et al. [2001]. Selecting a unique leader is equivalent to solving
the consensus problem. However, Paxos maintains consistency, never allowing
two different values to be chosen, even if multiple processes think they are
the leader. (This is unlike traditional Three-Phase Commit protocols, in which
multiple coordinators can lead to inconsistency.) A unique nonfaulty leader is
needed only to ensure liveness.
A process that believes itself to be a newly elected leader initiates a ballot,
which proceeds in the following phases. (Since there can be multiple leaders,
actions from several phases may be performed concurrently.)
— Phase 1a. The leader chooses a ballot number bal for which it is the leader
and that it believes to be larger than any ballot number for which phase 1
has been performed. The leader sends a phase 1a message for ballot number
bal to every acceptor.
— Phase 1b. When an acceptor receives the phase 1a message for ballot number
bal, if it has not already performed any action for a ballot numbered bal or
higher, it responds with a phase 1b message containing its current state,
which consists of
—the largest ballot number for which it received a phase 1a message, and
—the phase 2b message with the highest ballot number it has sent, if
any.
The acceptor ignores the phase 1a message if it has performed an action for
a ballot numbered bal or greater.
— Phase 2a. When the leader has received a phase 1b message for ballot number
bal from a majority of the acceptors, it can learn one of two possibilities:
—Free. None of the majority of acceptors reports having sent a phase 2b
message, so the algorithm has not yet chosen a value.
— Forced. Some acceptor in the majority reports having sent a phase 2b mes-
sage. Let μ be the maximum ballot number of all the reported phase 2b
messages, and let Mμ be the set of all those phase 2b messages that have
ballot number μ. All the messages in Mμ have the same value v , which
might already have been chosen.
In the free case, the leader can try to get any value accepted; it usually picks
the first value proposed by a client. In the forced case, it tries to get the value
v chosen by sending a phase 2a message with value v and ballot number bal
to every acceptor.
— Phase 2b. When an acceptor receives a phase 2a message for a value v and
ballot number bal, if it has not already received a phase 1a or 2a message for
a larger ballot number, it accepts that message and sends a phase 2b message
for v and bal to the leader. The acceptor ignores the message if it has already
participated in a higher-numbered ballot.
ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.
Consensus on Transaction Commit • 141
— Phase 3. When the leader has received phase 2b messages for value v and
ballot bal from a majority of the acceptors, it knows that the value v has been
chosen and communicates that fact to all interested processes with a phase 3
message.
Ballot 0 has no phase 1 because there are no lower-numbered ballots, so there
is nothing for acceptors to report in phase 1b messages.
An explanation of why the Paxos algorithm is correct can be found in the
literature [De Prisco et al. 1997; Lamport 1998, 2001; Lampson 1996]. As with
any asynchronous algorithm, process failure and restart is handled by having
each process record the necessary state information in stable storage.
The algorithm can be optimized in two independent ways. We can reduce
the number of messages in the normal fault-free case by having the leader
send phase 2a messages only to a majority of the acceptors. The leader will
know that value v is chosen if it receives phase 2b messages from that ma-
jority of acceptors. It can send phase 2a messages to additional acceptors if
it does not receive enough phase 2b messages. The second optimization is
to eliminate the message delay of phase 3, at the cost of extra messages, by
having acceptors send their phase 2b messages directly to all processes that
need to know the chosen value. Like the leader, those processes learn the
chosen value when they receive phase 2b messages from a majority of the
acceptors.
The Paxos algorithm guarantees that at most one value is chosen despite any
nonmalicious failure of any part of the system—that is, as long as processes do
not make errors in executing the algorithm and the communication network
does not undetectably corrupt messages. It guarantees progress if a unique
leader is selected and if the network of nodes executing both that leader and
some majority of acceptors is nonfaulty for a long enough period of time. A pre-
cise statement and proof of this progress condition has been given by De Prisco
et al. [1997].
In practice, it is not difficult to construct an algorithm that, except during rare
periods of network instability, selects a suitable unique leader among a majority
of nonfaulty acceptors. Transient failure of the leader-selection algorithm is
harmless, violating neither safety nor eventual progress. One algorithm for
leader selection was presented by Aguilera et al. [2001].
all processes that the transaction has aborted. (Once a process knows that the
transaction has been aborted, it can ignore all other protocol messages.) This
short-circuiting is possible only for phase 2a messages with ballot number 0. It
is possible for an instance of the Paxos algorithm to choose the value Prepared
even though a leader has sent a phase 2a message (for a ballot number greater
than 0) with value Aborted.
We briefly sketch an intuitive proof of correctness of Paxos Commit. Recall
that, in Section 2, we stated that a nonblocking algorithm should satisfy four
properties: stability, consistency, nontriviality, and nonblocking. The algorithm
satisfies stability because, once an RM receives a decision from a leader, it never
changes its view of what value has been chosen. Consistency holds because each
instance of the Paxos algorithm chooses a unique value, so different leaders
cannot send different decisions. Nontriviality holds if the leader waits long
enough before performing phase 1a for a new ballot number so that, if there
are no failures, then each Paxos instance will finish performing phase 2 for
ballot 0. The nonblocking property follows from the Paxos progress property,
which implies that each instance of Paxos eventually chooses either Prepared
or Aborted if a large enough network of acceptors is nonfaulty. More precisely,
the nonblocking property holds if Paxos satisfies the liveness requirement for
consensus, which is the case if the leader-selection algorithm ensures that a
unique nonfaulty leader is chosen whenever a large enough subnetwork of the
acceptors’ nodes is nonfaulty for a long enough time.
The safety part of the algorithm—that is, the algorithm with no progress
requirements—is specified formally in Section A.3 of the Appendix, along with
a theorem asserting that it implements transaction commit. The correctness
of this theorem has been checked by the TLC model checker on configurations
that are too small to detect subtle errors, but are probably large enough to find
simple “coding” errors. Rigorous proofs of the Paxos algorithm convince us that
it harbors no subtle errors, and correctness of the Paxos Commit algorithm is
a simple corollary of the correctness of Paxos.
4.3 The Cost of Paxos Commit
We now consider the cost of Paxos Commit in the normal case, when the trans-
action is committed. The sequence of message exchanges is shown in Figure 3.
We again assume that there are N RMs. We consider a system that can toler-
ate F faults, so there are 2F +1 acceptors. However, we assume the optimization
in which the leader sends phase 2a messages to F + 1 acceptors, and only if one
or more of them fail are other acceptors used. In the normal case, the Paxos
Commit algorithm uses the following potentially internode messages:
— The first RM to prepare sends a BeginCommit message to the leader. (1
message)
— The leader sends a Prepare message to every other RM. (N − 1 messages)
— Each RM sends a ballot 0 phase 2a Prepared message for its instance of Paxos
to the F + 1 acceptors. (N(F + 1) messages)
—For each RM’s instance of Paxos, an acceptor responds to a phase 2a message
by sending a phase 2b Prepared message to the leader. However, an acceptor
ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.
144 • J. Gray and L. Lamport
Fig. 3. The message flow for Paxos Commit in the normal failure-free case, where RM1 is the first
RM to enter the prepared state, and 2a Prepared and 2b Prepared are the phase 2a and 2b messages
of the Paxos consensus algorithm.
can bundle the messages for all those instances into a single message. (F + 1
messages)
—The leader sends a single Commit message to each RM containing a phase 3
Prepared message for every instance of Paxos. (N messages)
The RMs therefore learn after five message delays that the transaction has
been committed. A total of (N + 1)(F + 3) − 2 messages are sent. If the initial
leader is on the same node as one of the acceptors, then that acceptor’s phase 2b
Prepared message is intranode and can be discounted. Moreover, the first RM’s
BeginCommit message can combine with its phase 2a Prepared message to that
acceptor, reducing the total number of messages to (N + 1)(F + 3) − 4. If N ≥ F
and each acceptor is on the same node as an RM, with the first RM being on
the same node as the leader, then the messages between the first RM and the
leader and an additional F of the phase 2a messages are intranode, leaving
N(F + 3) − 3 intranode messages.
As observed above, we can eliminate phase 3 of Paxos by having each acceptor
send its phase 2b messages directly to all the RMs. This allows the RMs to learn
the outcome in only four message delays, but a total of N(2F + 3) messages are
required. Letting the leader be on the same node as an acceptor eliminates
one of those messages. If each acceptor is on the same node as an RM, and
the leader is on the same node as the first RM, then the initial BeginCommit
message, F + 1 of the phase 2a messages, and F + 1 of the phase 2b messages
can be discounted, leaving (N − 1)(2F + 3) messages.
We have seen so far that Paxos Commit requires five message delays, which
can be reduced to four by eliminating phase 3 and having acceptors send ex-
tra phase 2b messages. Two of those message delays result from the sending
of Prepare messages to the RMs. As observed in Section 3.1, these delays can
be eliminated by allowing the RMs to prepare spontaneously, leaving just two
message delays. This is optimal because implementing transaction commit re-
quires reaching consensus on an RM’s decision, and it can be shown that any
ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.
Consensus on Transaction Commit • 145
RM is the on the same node as the initial leader. In Paxos Commit without
colocation, we assume that the initial leader is an acceptor.
For the near future, system designers are likely to be satisfied with a commit
algorithm that is nonblocking despite at most one failure—the F = 1 case.
In this case, for a transaction with 5 RMs, the Two-Phase Commit uses 12
messages, regular Paxos Commit uses 17, and Faster Paxos Commit uses 20
(with colocation). For larger values of N, the three algorithms use about 3N ,
4N , and 5N messages, respectively (with or without colocation).
Consider now the trivial case of Paxos Commit with F = 0, so there is just a
single acceptor and a single possible leader, and the algorithm does not tolerate
any acceptor faults. (The algorithm can still tolerate RM faults.) Let the single
acceptor and the leader be on the same node. The single phase 2b message of the
Paxos consensus algorithm then serves as a phase 3 message, making phase 3
unnecessary. Paxos Commit therefore becomes the same as Faster Paxos Com-
mit. Figure 4 shows that, when F = 0, Two-Phase Commit and Paxos Commit
use the same number of messages, 3N − 1 or 3N − 3, depending on whether or
not colocation is assumed. In fact, Two-Phase Commit and Paxos Commit are
essentially the same when F = 0. The two algorithms are isomorphic under the
following correspondence:
Two-Phase Commit Paxos Commit
TM ↔ acceptor/leader
Prepare message ↔ Prepare message
Prepared message ↔ phase 2a Prepared message
Commit message ↔ Commit message
Aborted message ↔ phase 2a Aborted message
Abort message ↔ Abort message
The phase 2b/phase 3 Aborted message that corresponds to a TM abort message
is one generated by any instance of the Paxos algorithm, indicating that the
transaction is aborted because not all instances chose Prepared. The phase 1
and 2 messages that precede it are all sent between the leader and the acceptor,
which are on the same node.
The Two-Phase Commit protocol is thus the degenerate case of the Paxos
Commit algorithm with a single acceptor.
of RMs that have joined, rather than the value Prepared or Aborted. As with
an RM, Paxos Commit runs a separate instance of the Paxos consensus algo-
rithm to decide upon the registrar’s input, using the same set of acceptors. The
transaction is committed iff the consensus algorithm for the registrar chooses
a set of RMs and the instance of the consensus algorithm for each of those RMs
chooses Prepared.
The registrar is generally on the same node as the initial leader, which is
typically on the same node as the RM that creates the transaction. In Two-
Phase Commit, the registrar’s function is usually performed by the TM rather
than by a separate process. (Recall that, for the case of Two-Phase Commit, the
Paxos consensus algorithm is the trivial one in which the TM simply chooses
the value and writes it to stable storage.)
We now describe how the dynamic Paxos algorithm works.
have joined the transaction. It does this by sending a ballot 0 phase 2a message
containing J to the acceptors. (The transaction descriptor lists the acceptors.)
This instance of the consensus algorithm is executed in the same way as the
instances for the RMs. Failure of the registrar could cause a leader to begin a
higher-numbered ballot and get Aborted chosen as the registrar’s value.
The registrar must never send a ballot 0 phase 2a message with an incorrect
value of J , even if it fails and restarts. The registrar can record in stable storage
when it begins executing the transaction and simply not send the phase 2a
message if it subsequently fails and restarts. It is possible to eliminate even this
one write per transaction by having the registrar write once to stable storage
whenever it restarts from a failure.
Meanwhile, as described in Section 4.2, each RM initiates its instance of
the consensus algorithm by sending a ballot 0 phase 2a message with the value
Prepared or Aborted to the acceptors. The transaction is defined to be committed
if the registrar’s instance of the consensus algorithm chooses a set J and the
instance for each RM in J chooses Prepared. The transaction is defined to be
aborted if the instance for any RM in J chooses Aborted, or if the registrar’s
instance chooses Aborted instead of a set J .
Having a dynamically chosen set of RMs requires one change to the execu-
tion of the multiple instances of the Paxos consensus algorithm. Recall that
an acceptor combines into a single message its phase 2b messages for all in-
stances. The acceptor waits until it knows what phase 2b message to send for
all instances before sending this one message. However, “all instances” includes
an instance for each participating RM, and the set of participating RMs is cho-
sen by the registrar’s instance. To break this circularity, we observe that, if the
registrar’s instance chooses the value Aborted, then it doesn’t matter what val-
ues are chosen by the RMs’ instances. Therefore, the acceptor waits until it is
ready to send a phase 2b message for the registrar’s instance. If that message
contains a set J of RMs as a value, then the acceptor waits until it can send the
phase 2b message for each RM in J . If the phase 2b message for the registrar’s
instance contains the value Aborted, then the acceptor sends only that phase 2b
message.
As explained in Section 4.2, the protocol can be short-circuited and abort mes-
sages sent to all processes if any participating RM chooses the value Aborted.
Instead of sending a phase 2a message, the RM can simply send an abort mes-
sage to the coordinator processes. The registrar can relay the abort message to
all other RMs that have joined the transaction.
Failure of the registrar before it sends its ballot 0 phase 2a message causes
the transaction to abort. However, failure of a single RM can also cause the
transaction to abort. Fault-tolerance means only that failure of an individual
process does not prevent a commit/abort decision from being made.
could be an RM that sent a phase 2a Prepared message and timed out without
learning the outcome. Learning the transaction’s outcome forces the transaction
to commit or abort if it had not already done so.
7. CONCLUSION
Two-Phase Commit is the classical transaction commit protocol. Indeed, it is
sometimes thought to be synonymous with transaction commit [Newcomer
2002]. Two-Phase Commit is not fault-tolerant because it uses a single coordi-
nator whose failure can cause the protocol to block. We have introduced Paxos
Commit, a new transaction commit protocol that uses multiple coordinators
and makes progress if a majority of them are working. Hence, 2F + 1 coordi-
nators can make progress even if F of them are faulty. Two-Phase Commit is
isomorphic to Paxos Commit with a single coordinator.
In the normal, failure-free case, Paxos Commit requires one more message
delay than Two-Phase Commit. This extra message delay is eliminated by
Faster Paxos Commit, which has the theoretically minimal message delay for
a nonblocking protocol.
Nonblocking transaction commit protocols were first proposed in the early
1980s [Bernstein et al. 1987; Borr 1981; Skeen 1981]. The initial algorithms
had two message delays more than Two-Phase Commit in the failure-free case;
later algorithms reduced this to one extra message delay [Bernstein et al.
1987]. All of these algorithms used a coordinator process and assumed that
two different processes could never both believe they were the coordinator—
an assumption that cannot be implemented in a purely asynchronous system.
Transient network failures could cause them to violate the consistency require-
ment of transaction commit. It is easy to implement nonblocking commit using
a consensus algorithm—an observation also made in the 1980s [Mohan et al.
1983]. However, the obvious way of doing this leads to one message delay more
than that of Paxos Commit. The only algorithm that achieved the low mes-
sage delay of Faster Paxos Commit was that of Guerraoui et al. [1996]. It is
essentially the same as Faster Paxos Commit in the absence of failures. (It can
be modified with an optimization analogous to the sending of phase 2a mes-
sages only to a majority of acceptors to give it the same message complexity
as Faster Paxos Commit.) This similarity to Paxos Commit is not surprising,
since most asynchronous consensus algorithms (and most incomplete attempts
at algorithms) are the same as Paxos in the failure-free case. However, their
algorithm is more complicated than Paxos Commit. It uses a special proce-
dure for the failure-free case and calls upon a modified version of an ordi-
nary consensus algorithm, which adds an extra message delay in the event of
failure.
With 2F + 1 coordinators and N resource managers, Paxos Commit requires
about 2FN more messages than Two-Phase Commit in the normal case. Both
algorithms incur the same delay for writing to stable storage. In modern local
area networks, messages are cheap, and the cost of writing to stable storage
can be much larger than the cost of sending messages. So in many systems, the
APPENDIX
TCTypeOK =
The type-correctness invariant.
rmState ∈ [RM → {“working”, “prepared”, “committed”, “aborted”}]
TCInit = rmState = [rm ∈ RM → “working”]
The initial predicate.
canCommit = ∀ rm ∈ RM : rmState[rm] ∈ {“prepared”, “committed”}
True iff all RMs are in the “prepared ” or “committed ” state.
notCommitted = ∀ rm ∈ RM : rmState[rm] = “committed”
True iff no resource manager has decided to commit.
We now define the actions that may be performed by the RMs, and then define the complete
next-state action of the specification to be the disjunction of the possible RM actions.
Prepare(rm) = ∧ rmState[rm] = “working”
∧ rmState = [rmState EXCEPT ![rm] = “prepared”]
Decide(rm) = ∨ ∧ rmState[rm] = “prepared”
∧ canCommit
∧ rmState = [rmState EXCEPT ![rm] = “committed”]
∨ ∧ rmState[rm] ∈ {“working”, “prepared”}
∧ notCommitted
∧ rmState = [rmState EXCEPT ![rm] = “aborted”]
TCNext = ∃ rm ∈ RM : Prepare(rm) ∨ Decide(rm)
The next-state action.
TCSpec = TCInit ∧ [TCNext]rmState
The complete specification of the protocol.
VARIABLES
rmState, rmState[rm] is the state of resource manager RM.
tmState, The state of the transaction manager.
tmPrepared, The set of RMs from which the TM has received “Prepared ”
messages.
msgs
In the protocol, processes communicate with one another by sending messages. Since we are
specifying only safety, a process is not required to receive a message, so there is no need to
model message loss. (There’s no difference between a process not being able to receive a mes-
sage because the message was lost and a process simply ignoring the message.) We therefore
represent message passing with a variable msgs whose value is the set of all messages that
have been sent. Messages are never removed from msgs. An action that, in an implementa-
tion, would be enabled by the receipt of a certain message is here enabled by the existence
of that message in msgs. (Receipt of the same message twice is therefore allowed; but in this
particular protocol, receiving a message for the second time has no effect.)
Message =
The set of all possible messages. Messages of type “Prepared ” are sent from the RM indicated
by the message’s rm field to the TM. Messages of type “Commit ” and “Abort ” are broadcast by
the TM, to be received by all RMs. The set msgs contains just a single copy of such a message.
[type : {“Prepared”}, rm : RM] ∪ [type : {“Commit”, “Abort”}]
TPTypeOK =
The type-correctness invariant
∧ rmState ∈ [RM → {“working”, “prepared”, “committed”, “aborted”}]
∧ tmState ∈ {“init”, “committed”, “aborted”}
∧ tmPrepared ⊆ RM
∧ msgs ⊆ Message
TPInit =
The initial predicate.
∧ rmState = [rm ∈ RM → “working”]
∧ tmState = “init”
∧ tmPrepared = {}
∧ msgs = {}
We now define the actions that may be performed by the processes, first the TM’s actions, then
the RMs’ actions.
TMRcvPrepared(rm) =
The TM receives a “Prepared ” message from resource manager rm.
∧ tmState = “init”
∧ [type → “Prepared”, rm → rm] ∈ msgs
∧ tmPrepared = tmPrepared ∪ {rm}
∧ UNCHANGED rmState, tmState, msgs
TMCommit =
The TM commits the transaction; enabled iff the TM is in its initial state and every RM has
sent a “Prepared ” message.
∧ tmState = “init”
∧ tmPrepared = RM
∧ tmState = “committed”
∧ msgs = msgs ∪ {[type → “Commit”]}
∧ UNCHANGED rmState, tmPrepared
TMAbort =
The TM spontaneously aborts the transaction.
∧ tmState = “init”
∧ tmState = “aborted”
∧ msgs = msgs ∪ {[type → “Abort”]}
∧ UNCHANGED rmState, tmPrepared
RMPrepare(rm) =
Resource manager rm prepares.
∧ rmState[rm] = “working”
∧ rmState = [rmState EXCEPT ![rm] = “prepared”]
ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.
154 • J. Gray and L. Lamport
We now assert that the Two-Phase Commit protocol implements the Transaction Commit pro-
tocol of module TCommit. The following statement defines TC!TCSpec to be formula TSpec of
module TCommit. (The TLA+ INSTANCE statement is used to rename the operators defined in
module TCommit to avoid any name conflicts that might exist with operators in the current
module.)
TC = INSTANCE TCommit
THEOREM TPSpec ⇒ TC!TCSpec
This theorem asserts that the specification TPSpec of the Two-Phase Commit protocol imple-
ments the specification TCSpec of the Transaction Commit protocol.
The two theorems in this module have been checked with TLC for six RMs, a configuration with
50816 reachable states, in a little over a minute on a 1-GHz PC.
VARIABLES
rmState, rmState[rm] is the state of resource manager rm.
aState, aState[ins][ac] is the state of acceptor ac for instance
ins of the Paxos algorithm.
msgs The set of all messages ever sent.
PCTypeOK =
The type-correctness invariant. Each acceptor maintains the values mbal, bal, and val for
each instance of the Paxos consensus algorithm.
∧ rmState ∈ [RM → {“working”, “prepared”, “committed”, “aborted”}]
∧ aState ∈ [RM → [Acceptor → [mbal : Ballot,
bal : Ballot ∪ { − 1},
val : {“prepared”, “aborted”, “none”}]]]
∧ msgs ∈ SUBSET Message
PCInit = The initial predicate.
∧ rmState = [rm ∈ RM → “working”]
∧ aState = [ins ∈ RM →
[ac ∈ Acceptor → [mbal → 0, bal → − 1, val → “none”]]]
∧ msgs = {}
The Actions
Send(m) = msgs = msgs ∪ {m}
An action expression that describes the sending of message m.
RM Actions
RMPrepare(rm) =
Resource manager rm prepares by sending a phase 2a message for ballot number 0 with value
“prepared ”.
∧ rmState[rm] = “working”
∧ rmState = [rmState EXCEPT ![rm] = “prepared”]
∧ Send([type → “phase2a”, ins → rm, bal → 0, val → “prepared”])
∧ UNCHANGED aState
RMChooseToAbort(rm) =
Resource manager rm spontaneously decides to abort. It may (but need not) send a phase 2a
message for ballot number 0 with value “aborted ”.
∧ rmState[rm] = “working”
∧ rmState = [rmState EXCEPT ![rm] = “aborted”]
∧ Send([type → “phase2a”, ins → rm, bal → 0, val → “aborted”])
∧ UNCHANGED aState
RMRcvCommitMsg(rm) =
Resource manager rm is told by the leader to commit. When this action is enabled, rmState[rm]
must equal either “prepared ” or “committed ”. In the latter case, the action leaves the state
unchanged (it is a “stuttering step”).
∧ [type → “Commit”] ∈ msgs
∧ rmState = [rmState EXCEPT ![rm] = “committed”]
∧ UNCHANGED aState, msgs
RMRcvAbortMsg(rm) =
Resource manager rm is told by the leader to abort. It could be in any state except “committed ”.
∧ [type → “Abort”] ∈ msgs
∧ rmState = [rmState EXCEPT ![rm] = “aborted”]
∧ UNCHANGED aState, msgs
Leader Actions
The following actions are performed by any process that believes itself to be the current leader.
Since leader selection is not assumed to be reliable, multiple processes could simultaneously
consider themselves to be the leader.
Phase1a(bal, rm) =
If the leader times out without learning that a decision has been reached on resource manager
rm’s prepare/abort decision, it can perform this action to initiate a new ballot bal. (Sending
duplicate phase 1a messages is harmless.)
∧ Send([type → “phase1a”, ins → rm, bal → bal])
∧ UNCHANGED rmState, aState
Phase2a(bal, rm) =
The action in which a leader sends a phase 2a message with ballot bal > 0 in instance rm, if
it has received phase 1b messages for ballot number bal from a majority of acceptors. If the
leader received a phase 1b message from some acceptor that had sent a phase 2b message
for this instance, then mu ≥ 0 and the value v the leader sends is determined by the phase
1b messages. (If v =“prepared ”, then rm must have prepared.) Otherwise, mu = −1 and the
leader sends the value “aborted ”.
The first conjunct asserts that the action is disabled if any leader has already sent a phase 2a
message with ballot number bal. In practice, this is implemented by having ballot numbers
partitioned among potential leaders, and having a leader record in stable storage the largest
ballot number for which it sent a phase 2a message.
∧ ¬∃ m ∈ msgs : ∧ m.type = “phase2a”
∧ m.bal = bal
∧ m.ins = rm
∧ ∃ MS ∈ Majority :
LET mset = {m ∈ msgs : ∧ m.type = “phase1b”
∧ m.ins = rm
∧ m.mbal = bal
∧ m.acc ∈ MS}
mu = Maximum({m.bal : m ∈ mset})
v = IF mu = − 1 THEN “aborted”
ELSE (CHOOSE m ∈ mset : m.bal = mu).val
IN ∧ ∀ ac ∈ MS : ∃ m ∈ mset : m.acc = ac
∧ Send([type → “phase2a”, ins → rm, bal → bal, val → v ])
∧ UNCHANGED rmState, aState
ACM Transactions on Database Systems, Vol. 31, No. 1, March 2006.
158 • J. Gray and L. Lamport
Decide =
A leader can decide that Paxos Commit has reached a result and send a message announcing
the result if it has received the necessary phase 2b messages.
∧ LET Decided(rm, v ) =
True iff instance rm of the Paxos consensus algorithm has chosen the value v .
∃ b ∈ Ballot, MS ∈ Majority :
∀ ac ∈ MS : [type → “phase2b”, ins → rm,
bal → b, val → v , acc → ac] ∈ msgs
IN ∨ ∧ ∀ rm ∈ RM : Decided(rm, “prepared”)
∧ Send([type → “Commit”])
∨ ∧ ∃ rm ∈ RM : Decided(rm, “aborted”)
∧ Send([type → “Abort”])
∧ UNCHANGED rmState, aState
Acceptor Actions
Phase1b(acc) =
∃ m ∈ msgs :
∧ m.type = “phase1a”
∧ aState[m.ins][acc].mbal < m.bal
∧ aState = [aState EXCEPT ![m.ins][acc].mbal = m.bal]
∧ Send([type → “phase1b”,
ins → m.ins,
mbal → m.bal,
bal → aState[m.ins][acc].bal,
val → aState[m.ins][acc].val,
acc → acc])
∧ UNCHANGED rmState
Phase2b(acc) =
∧ ∃ m ∈ msgs :
∧ m.type = “phase2a”
∧ aState[m.ins][acc].mbal ≤ m.bal
∧ aState = [aState EXCEPT ![m.ins][acc].mbal = m.bal,
![m.ins][acc].bal = m.bal,
![m.ins][acc].val = m.val]
∧ Send([type → “phase2b”, ins → m.ins, bal → m.bal,
val → m.val, acc → acc])
∧ UNCHANGED rmState
PCNext = The next-state action.
∨ ∃ rm ∈ RM : ∨ RMPrepare(rm)
∨ RMChooseToAbort(rm)
∨ RMRcvCommitMsg(rm)
∨ RMRcvAbortMsg(rm)
∨ ∃ bal ∈ Ballot \ {0}, rm ∈ RM : Phase1a(bal, rm) ∨ Phase2a(bal, rm)
∨ Decide
∨ ∃ acc ∈ Acceptor : Phase1b(acc) ∨ Phase2b(acc)
PCSpec = PCInit ∧ [PCNext]rmState, aState, msgs
The complete spec of the Paxos Commit protocol.
We now assert that the two-phase commit protocol implements the transaction commit protocol
of module TCommit. The following statement defines TC!TCSpec to be the formula TCSpec of
module TCommit. (The TLA+ INSTANCE statement is used to rename the operators defined in
module TCommit to avoid possible name conflicts with operators in the current module having
the same name.)
TC = INSTANCE TCommit
REFERENCES
AGUILERA, M. K., DELPORTE-GALLET, C., FAUCONNIER, H., AND TOUEG, S. 2001. Stable leader election.
In DISC ’01: Proceedings of the 15th International Conference on Distributed Computing, J. L.
Welch, Ed. Lecture Notes in Computer Science, vol. 2180. Springer-Verlag, Berlin, Germany,
108–122.
ALPERN, B. AND SCHNEIDER, F. B. 1985. Defining liveness. Inf. Process. Lett. 21, 4 (Oct.), 181–185.
BERNSTEIN, P. A., HADZILACOS, V., AND GOODMAN, N. 1987. Concurrency Control and Recovery in
Database Systems. Addison-Wesley, Reading, MA.
BORR, A. J. 1981. Transaction monitoring in encompass: Reliable distributed transaction pro-
cessing. In Proceedings of the 1981 ACM SIGMOD International Conference on Management
of Data (Ann Arbor, MI, April 29-May 1), Y. E. Lien, Ed. ACM Press, New York, NY, 155–
165.
CHARRON-BOST, B. AND SCHIPER, A. 2000. Uniform consensus is harder than consensus (extended
abstract). Tech. rep. DSC/2000/028. École Polytechnique Fédérale de Lausanne, Switzerland.
DE PRISCO, R., LAMPSON, B., AND LYNCH, N. 1997. Revisiting the Paxos algorithm. In Proceedings
of the 11th International Workshop on Distributed Algorithms (WDAG 97), M. Mavronicolas and
P. Tsigas, Eds. Lecture Notes in Computer Science, vol. 1320. Springer-Verlag, Saarbruken,
Germany, 111–125.
DWORK, C., LYNCH, N., AND STOCKMEYER, L. 1988. Consensus in the presence of partial synchrony.
J. Assoc. Comput. Mach. 35, 2 (Apr.), 288–323.
FISCHER, M. J., LYNCH, N., AND PATERSON, M. S. 1985. Impossibility of distributed consensus with
one faulty process. J. Assoc. Comput. Mach. 32, 2 (Apr.), 374–382.
GRAY, J. 1978. Notes on data base operating systems. In Operating Systems: An Advanced Course,
R. Bayer, R. M. Graham, and G. Seegmuller, Eds. Lecture Notes in Computer Science, vol. 60.
Springer-Verlag, Berlin, Heidelberg, Germany/New York, NY, 393–481.
GUERRAOUI, R. 1995. Revisiting the relationship between nonblocking atomic commitment
and consensus. In Proceedings of the 9th International Workshop on Distributed Algorithms
(WDAG95), J.-M. Hélary and M. Raynal, Eds. Lecture Notes in Computer Science, vol. 972.
Springer-Verlag, Le Mont-Saint-Michel, France, 87–100.
GUERRAOUI, R., LARREA, M., AND SCHIPER, A. 1996. Reducing the cost for nonblocking in atomic com-
mitment. In Proceedings of the 16th International Conference on Distributed Computing Systems
(ICDCS). IEEE Computer Society Press, Los Alamitos, CA, 692–697.
LAMPORT, L. 1998. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May), 133–169.
LAMPORT, L. 2001. Paxos made simple. ACM SIGACT News (Distributed Computing Column)
32, 4 (Dec.), 18–25.
LAMPORT, L. 2003. Specifying Systems. Addison-Wesley, Boston, MA. A link to an electronic copy
can be found online at http://lamport.org.
LAMPSON, B. W. 1996. How to build a highly available system using consensus. In Distributed
Algorithms, O. Babaoglu and K. Marzullo, Eds. Lecture Notes in Computer Science, vol. 1151.
Springer-Verlag, Berlin, Germany, 1–17.
MOHAN, C., STRONG, R., AND FINKELSTEIN, S. 1983. Method for distributed transaction commit and
recovery using Byzantine agreement within clusters of processors. In Proceedings of the Second
Annual ACM Symposium on Principles of Distributed Computing. The ACM Press, New York,
NY, 29–43.
NEWCOMER, E. 2002. Understanding Web Services. Addison-Wesley, Boston, MA.
PEASE, M., SHOSTAK, R., AND LAMPORT, L. 1980. Reaching agreement in the presence of faults. J.
Assoc. Comput. Mach. 27, 2 (Apr.), 228–234.
SKEEN, D. 1981. Nonblocking commit protocols. In SIGMOD ’81: Proceedings of the 1981 ACM
SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 133–142.
Received August 2004; revised February 2005, August 2005; accepted September 2005