CS3551 - Unit IV
CS3551 - Unit IV
UNIT IV
CONSENSUS AND RECOVERY
Consensusand agreement algorithms: Problem definition – Overview of results – Agreement in
a failure – free system (Synchronous and Asynchronous) – Agreement in synchronous systems
with failures. Check pointing and rollback recovery: Introduction – Background and definitions
– Issues in failure recovery – Checkpoint-based recovery – Coordinated check pointing algorithm
– Algorithm for asynchronous check pointing and recovery.
Agreement: All non-faulty processes must agree on the same (single) value.
Validity: If all the non-faulty processes have the same initial value, then the agreed upon value
by all the non-faulty processes must be that same value.
The overhead bounds are for the given algorithms, and not necessarily tight bounds for the
problem.
CS3551 DISTRIBUTED COMPUTING
Validity: If the source process is non-faulty, then the agreed upon value by all the non- faulty
processes must be the same as the initial value of the source.
CS3551 DISTRIBUTED COMPUTING
Each phase has a unique "phase king" derived, say, from PID. Each phase has two rounds:
1 in 1st round, each process sends its estimate to all other processes.
2 in 2nd round, the "Phase king" process arrives at an estimate based on the values it
received in 1st round, and broadcasts its new estimate to all others.
(f + 1) phases, (f + 1)[(n - 1)(n + 1)] messages, and can tolerate up to f < dn=4e malicious processes
Correctness Argument
2 In phase k, all non-malicious processes Pi and Pj will have same estimate of consensus
value as Pk does.
Pi and Pj use their own majority values. Pi 's mult > n=2 + f )
Pi uses its majority value; Pj uses phase-king's tie-breaker value. (Pi’s mult > n=2 + f ,
Pj 's mult > n=2 for same value)
Pi and Pj use the phase-king's tie-breaker value. (In the phase in which Pk is non-
malicious, it sends same value to Pi and Pj )
In all 3 cases, argue that Pi and Pj end up with same value as estimate
If all non-malicious processes have the value x at the start of a phase, they will continue
to have x as the consensus value at the end of the phase.
Agreement: All non-faulty processes must make a decision and the values decided upon by any
two non-faulty processes must be within range of each other.
Validity: If a non-faulty process Pi decides on some value vi , then that value must be within the
range of values initially proposed by the processes.
Termination: Each non-faulty process must eventually decide on a value. The algorithm for the
message-passing model assumes n ≥ 5f + 1, although the problem is solvable for n > 3f + 1.
Not possible to go from bivalent to univalent state if even a single failure is allowed. Difficulty is
not being able to read & write a variable atomically.
Weakening the consensus problem, e.g., k-set consensus, approximate consensus, and
renaming using atomic registers.
Using memory that is stronger than atomic Read/Write memory to design wait- free
consensus algorithms. Such a memory would need corresponding access primitives.
Are there objects (with supporting operations), using which there is a wait-free (i.e., (n -1)- crash
resilient) algorithm for reaching consensus in a n-process system? Yes, e.g., Test&Set, Swap,
Compare&Swap. The crash failure model requires the solutions to be wait-free.
An object is defined to be universal if that object along with read/write registers can simulate
any other object in a wait-free manner. In any system containing up to k processes, an object X
such that CN(X) = k is universal.
For any system with up to k processes, the universality of objects X with consensus number k is
shown by giving a universal algorithm to wait-free simulate any object using objects of type X
and read/write registers.
2 Then, the arbitrary k-process consensus objects are simulated with objects of type X,
having consensus number k. This trivially follows after the first step.
A nonblocking operation, in the context of shared memory operations, is an operation that may
not complete itself but is guaranteed to complete at least one of the pending operations in a
finite number of steps.
The linked list stores the linearized sequence of operations and states following each operation.
Operations to the arbitrary object Z are simulated in a nonblocking way using an arbitrary
consensus object (the field op.next in each record) which is accessed via the Decide call.
Each process attempts to thread its own operation next into the linked list.
A single pointer/counter cannot be used instead of the array Head. Because reading and
updating the pointer cannot be done atomically in a wait-free manner.
process communication.
Some protocols assume that the communication uses first-in-first-out (FIFO) order, while
other protocols assume that the communication subsystem can lose, duplicate, or reorder
messages.
Rollback-recovery protocols therefore must maintain information about the internal
interactions among processes and also the external interactions with the outside world.
A local checkpoint
All processes save their local states at certain instants of time
A local check point is a snapshot of the state of the process at a given instance
Assumption
– A process stores all local checkpoints on the stable storage
– A process is able to roll back to any of its existing local checkpoints
1. In-transit message
messages that have been sent but not yet received
2. Lost messages
messages whose “send‟ is done but “receive‟ is undone due to rollback
3. Delayed messages
messages whose “receive‟ is not recorded because the receiving process was
either down or the message arrived after rollback
4. Orphan messages
messages with “receive‟ recorded but message “send‟ not recorded
do not arise if processes roll back to a consistent global state
5. Duplicate messages
arise due to message logging and replaying during process recovery
CS3551 DISTRIBUTED COMPUTING
In-transit messages
In Figure , the global state {C1,8 , C2, 9 , C3,8, C4,8} shows that message m1 has been sent but
not yet received. We call such a message an in-transit message. Message m2 is also an in-transit
message.
Delayed messages
Messages whose receive is not recorded because the receiving process was either down or the
message arrived after the rollback of the receiving process, are called delayed messages. For
example, messages m2 and m5 in Figure are delayed messages.
Lost messages
Messages whose send is not undone but receive is undone due to rollback are called lostmessages.
This type of messages occurs when the process rolls back to a checkpoint prior to reception of the
message while the sender does not rollback beyond the send operation of the message. In Figure ,
message m1 is a lost message.
Duplicate messages
Duplicate messages arise due to message logging and replaying during process
recovery. For example, in Figure, message m4 was sent and received before the
rollback. However, due to the rollback of process P4 to C4,8 and process P3 to C3,8,
both send and receipt of message m4 are undone.
CS3551 DISTRIBUTED COMPUTING
The computation comprises of three processes Pi, Pj , and Pk, connected through a communication
network. The processes communicate solely by exchanging messages over fault- free, FIFO
communication channels.
Checkpoint-based recovery
Checkpoint-based rollback-recovery techniques can be classified into three categories:
1. Uncoordinated checkpointing
2. Coordinated checkpointing
3. Communication-induced checkpointing
1. Uncoordinated Checkpointing
Each process has autonomy in deciding when to take checkpoints
Advantages
The lower runtime overhead during normal execution
Disadvantages
1. Domino effect during a recovery
2. Recovery from a failure is slow because processes need to iterate to find a
consistent set of checkpoints
3. Each process maintains multiple checkpoints and periodically invoke a
garbage collection algorithm
4. Not suitable for application with frequent output commits
The processes record the dependencies among their checkpoints caused by message
exchange during failure-free operation
The following direct dependency tracking technique is commonly used in uncoordinated
checkpointing.
Direct dependency tracking technique
Assume each process 𝑃𝑖 starts its execution with an initial checkpoint 𝐶𝑖,0
𝐼𝑖,𝑥 : checkpoint interval, interval between 𝐶𝑖,𝑥−1 and 𝐶𝑖,𝑥
When 𝑃𝑗 receives a message m during 𝐼𝑗,𝑦 , it records the dependency from 𝐼𝑖,𝑥 to 𝐼𝑗,𝑦,
which is later saved onto stable storage when 𝑃𝑗 takes 𝐶𝑗,𝑦
CS3551 DISTRIBUTED COMPUTING
When a process receives this message, it stops its execution and replies with the
dependency information saved on the stable storage as well as with the dependency
information, if any, which is associated with its current state.
The initiator then calculates the recovery line based on the global dependency information
and broadcasts a rollback request message containing the recovery line.
Upon receiving this message, a process whose current state belongs to the recovery line
simply resumes execution; otherwise, it rolls back to an earlier checkpoint as indicated by
the recovery line.
2. Coordinated Checkpointing
In coordinated checkpointing, processes orchestrate their checkpointing activities so that all
local checkpoints form a consistent global state
Types
1. Blocking Checkpointing: After a process takes a local checkpoint, to prevent orphan
messages, it remains blocked until the entire checkpointing activity is complete
Disadvantages: The computation is blocked during the checkpointing
2. Non-blocking Checkpointing: The processes need not stop their execution while taking
checkpoints. A fundamental problem in coordinated checkpointing is to prevent a process
from receiving application messages that could make the checkpoint inconsistent.
CS3551 DISTRIBUTED COMPUTING
Algorithm
The algorithm consists of two phases. During the first phase, the checkpoint initiator
identifies all processes with which it has communicated since the last checkpoint and sends
them a request.
Upon receiving the request, each process in turn identifies all processes it has
communicated with since the last checkpoint and sends them a request, and so on, until
no more processes can be identified.
During the second phase, all processes identified in the first phase take a checkpoint. The
result is a consistent checkpoint that involves only the participating processes.
In this protocol, after a process takes a checkpoint, it cannot send any message until the
second phase terminates successfully, although receiving a message after the checkpoint
has been taken is allowable.
3. Communication-induced Checkpointing
Communication-induced checkpointing is another way to avoid the domino effect, while allowing
processes to take some of their checkpoints independently. Processes may be forced to take
additional checkpoints
Two types of checkpoints
1. Autonomous checkpoints
2. Forced checkpoints
The checkpoints that a process takes independently are called local checkpoints, while those that
a process is forced to take are called forced checkpoints.
Communication-induced check pointing piggybacks protocol- related information on
each application message
The receiver of each application message uses the piggybacked information to determine
if it has to take a forced checkpoint to advance the global recovery line
The forced checkpoint must be taken before the application may process the contents of
the message
In contrast with coordinated check pointing, no special coordination messages are
exchanged
Two types of communication-induced checkpointing
1. Model-based checkpointing
2. Index-based checkpointing.
CS3551 DISTRIBUTED COMPUTING
Model-based checkpointing
Model-based checkpointing prevents patterns of communications and checkpoints
that could result in inconsistent states among the existing checkpoints.
No control messages are exchanged among the processes during normal operation. All
information necessary to execute the protocol is piggybacked on application messages
Suppose a set of processes crashes. A process p in becomes an orphan when p itself does
not fail and p’s state depends on the execution of a nondeterministic event e whose determinant
cannot be recovered from the stable storage or from the volatile memory of a surviving process.
storage or from the volatile memory of a surviving process. Formally, it can be stated as follows
CS3551 DISTRIBUTED COMPUTING
Types
1. Pessimistic Logging
Pessimistic logging protocols assume that a failure can occur after any non-deterministic
event in the computation. However, in reality failures are rare
Pessimistic protocols implement the following property, often referred to as synchronous logging,
which is a stronger than the always-no-orphans condition
Synchronous logging
– ∀e: ¬Stable(e) ⇒ |Depend(e)| = 0
Thai is,if an event has not been logged on the stable storage, then no process can depend
on it.
Example:
Suppose processes 𝑃1 and 𝑃2 fail as shown, restart from checkpoints B and C, and roll
forward using their determinant logs to deliver again the same sequence of messages as in
the pre-failure execution
Once the recovery is complete, both processes will be consistent with the state of 𝑃0
that includes the receipt of message 𝑚7 from 𝑃1
CS3551 DISTRIBUTED COMPUTING
• Consider the example shown in Figure Suppose process P2 fails before the determinant for
m5 is logged to the stable storage. Process P1 then becomes an orphan process and must
roll back to undo the effects of receiving the orphan message m6. The rollback of P1
further forces P0 to roll back to undo the effects of receiving message m7.
• Advantage: better performance in failure-free execution
• Disadvantages:
• coordination required on output commit
• more complex garbage collection
• Since determinants are logged asynchronously, output commit in optimistic logging
protocols requires a guarantee that no failure scenario can revoke the output. For example,
if process P0 needs to commit output at state X, it must log messages m4 andm7 to the
stable storage and ask P2 to log m2 and m5. In this case, if any process fails, the
computation can be reconstructed up to state X.
3. Causal Logging
• Combines the advantages of both pessimistic and optimistic logging at the expense of a more
complex recovery protocol
• Like optimistic logging, it does not require synchronous access to the stable storage except
during output commit
• Like pessimistic logging, it allows each process to commit output independently and never
creates orphans, thus isolating processes from the effects of failures at other processes
• Make sure that the always-no-orphans property holds
• Each process maintains information about all the events that have causally affected its state
CS3551 DISTRIBUTED COMPUTING
• Consider the example in Figure Messages m5 and m6 are likely to be lost on the failures
of P1 and P2 at the indicated instants. Process
• P0 at state X will have logged the determinants of the nondeterministic events that
causally precede its state according to Lamport’s happened-before relation.
• These events consist of the delivery of messages m0, m1, m2, m3, and m4.
• The determinant of each of these non-deterministic events is either logged on the stable
storage or is available in the volatile log of process P0.
• The determinant of each of these events contains the order in which its original receiver
delivered the corresponding message.
• The message sender, as in sender-based message logging, logs the message content. Thus,
process P0 will be able to “guide” the recovery of P1 and P2 since it knows the order in
which P1 should replay messages m1 and m3 to reach the state from which P1 sent message
m4.
• Similarly, P0 has the order in which P2 should replay message m2 to be consistent with
both P0 and P1.
• The content of these messages is obtained from the sender log of P0 or regenerated
deterministically during the recovery of P1 and P2.
• Note that information about messages m5 and m6 is lost due to failures. These messages
may be resent after recovery possibly in a different order.
• However, since they did not causally affect the surviving process or the outside world, the
CS3551 DISTRIBUTED COMPUTING
First Phase
1. An initiating process Pi takes a tentative checkpoint and requests all other processes to take
tentative checkpoints. Each process informs Pi whether it succeeded in taking a tentative
checkpoint.
2. A process says “no” to a request if it fails to take a tentative checkpoint
3. If Pi learns that all the processes have successfully taken tentative checkpoints, Pi decides
that all tentative checkpoints should be made permanent; otherwise, Pi decides that all the
tentative checkpoints should be thrown-away.
Second Phase
1. Pi informs all the processes of the decision it reached at the end of the first phase.
2. A process, on receiving the message from Pi will act accordingly.
CS3551 DISTRIBUTED COMPUTING
3. Either all or none of the processes advance the checkpoint by taking permanent
checkpoints.
4. The algorithm requires that after a process has taken a tentative checkpoint, it cannot
send messages related to the basic computation until it is informed of Pi’s decision.
Correctness: for two reasons
i. Either all or none of the processes take permanent checkpoint
ii. No process sends message after taking permanent checkpoint
An Optimization
The above protocol may cause a process to take a checkpoint even when it is not necessary for
consistency. Since taking a checkpoint is an expensive operation, we avoid taking checkpoints.
The above protocol, in the event of failure of process X, the above protocol will require
processes X, Y, and Z to restart from checkpoints x2, y2, and z2, respectively.
Process Z need not roll back because there has been no interaction between process Z and the
other two processes since the last checkpoint at Z.
CS3551 DISTRIBUTED COMPUTING
Basic idea
Since the algorithm is based on asynchronous check pointing, the main issue in the
recovery is to find a consistent set of checkpoints to which the system can be restored.
The recovery algorithm achieves this by making each processor keep track of both the
CS3551 DISTRIBUTED COMPUTING
number of messages it has sent to other processors as well as the number of messages it
has received from other processors.
Whenever a processor rolls back, it is necessary for all other processors to find out if any
message has become an orphan message. Orphan messages are discovered by comparing
the number of messages sent to and received from neighboring processors.
For example, if RCVDi←j(CkPti) > SENTj→i(CkPtj) (that is, the number of messages received
by processor pi from processor pj is greater than the number of messages sent by processor pj to
processor pi, according to the current states the processors), then one or more messages at
processor pj are orphan messages.
The Algorithm
When a processor restarts after a failure, it broadcasts a ROLLBACK message that it had failed
Procedure RollBack_Recovery
processor pi executes the following:
STEP (a)
if processor pi is recovering after a failure then
CkPti := latest event logged in the stable storage
else
CkPti := latest event that took place in pi {The latest event at pi can be either in stable or in
volatile storage.}
end if
STEP (b)
for k = 1 1 to N {N is the number of processors in the system} do
for each neighboring processor pj do
compute SENTi→j(CkPti)
send a ROLLBACK(i, SENTi→j(CkPti)) message to pj
end for
for every ROLLBACK(j, c) message received from a neighbor j do
if RCVDi←j(CkPti) > c {Implies the presence of orphan messages} then
find the latest event e such that RCVDi←j(e) = c {Such an event e may be in the volatile storage
or stable storage.}
CkPti := e
end if
CS3551 DISTRIBUTED COMPUTING
end for
end for{for k}
D. An Example
Consider an example shown in Figure 2 consisting of three processors. Suppose processor Y
fails and restarts. If event ey2 is the latest checkpointed event at Y, then Y will restart from the
state corresponding to ey2.
Since RCVDZ←Y (CkPtZ) = 2 > 1, Z will set CkPtZ to ez1 satisfying RCVDZ←Y (ez1) = 1 ≤
1.
At Y, RCVDY←X(CkPtY ) = 1 < 2 and RCVDY←Z(CkPtY ) = 1 = SENTZ←Y (CkPtZ).
Y need not roll back further.
CS3551 DISTRIBUTED COMPUTING