Distributed UNIT IV (1)
Distributed UNIT IV (1)
TEXTBOOKS
I. Problem definition
Consensus
Distributed computing
Failure Modes
Many forms of coordination require the processes to exchange information to negotiate with
one another and eventually reach a common understanding or agreement, before taking
application-specific actions.
Some assumptions underlying study of agreement algorithms:
1.Failure models
3. Network connectivity.
4.Sender identification.
5.Channel reliability.
7.Agreement variable.
The Byzantine agreement and other problems
Validity: If the source process is non-faulty, then the agreed upon value by all the non-faulty
Agreement: All non-faulty processes must agree on the same (single) value.
Validity: If all the non-faulty processes have the same initial value, then the agreed upon value by
all the non-faulty processes must be that same value.
Agreement: All non-faulty processes must agree on the same array of values A[v1 : : : vn].
Validity: If process i is non-faulty and its initial value is vi , then all non-faulty processes agree on
vi as the ith element of the array A. If process j is faulty, then the non-faulty processes can agree
on any value for A[j ].
In a failure-free system, consensus can be reached by collecting information from the different
processes, arriving at a “decision,” and distributing this decision in the system.
A distributed mechanism would have each process broadcast its values to others, and each process
computes the same function on the values received.
The decision can be reached by using an application specific function – some simple examples being
the majority, max, and min functions.
Algorithms to collect the initial values and then distribute the decision may be based on the token
circulation on a logical ring, or the three-phase tree-based broadcast–convergecast–broadcast, or direct
communication with all nodes.
In a synchronous system, this can be done simply in a constant number of rounds (depending on the
specific logical topology and algorithm used).
In an asynchronous system, consensus can similarly be reached in a constant number of message hops.
IV. Agreement in (message-passing) synchronous systems with failures
Consensus algorithms for Byzantine failures (synchronous system)
With n = 3 processes, the Byzantine agreement problem cannot be solved if the number of
Byzantine processes f = 1.
The argument uses the illustration in Figure 14.3, which shows a commander Pc and two
lieutenant
The malicious process is the lieutenant Pb in the first scenario (Figure 14.3(a)) and hence Pa
should agree on the value of the loyal commander Pc, which is 0.
But note the second scenario (Figure 14.3(b)) in which Pa receives identical values from Pb and
Pc, but now Pc is the disloyal commander whereas Pb is a loyal lieutenant.
In this case, Pa needs to agree with Pb. However, Pa cannot distinguish between the two
scenarios and any further message exchange does not help because each process has already
conveyed what it knows from the third process.
In both scenarios, Pa gets different values from the other two processes.
In the first scenario, it needs to agree on a 0, and if that is the default value, the decision is
correct, but then if it is in the second indistinguishable scenario, it agrees on an incorrect value.
A similar argument shows that if 1 is the default value, then in the first scenario, Pa makes an
incorrect decision. This shows the impossibility of agreement when n=3 and f =1.
Recursive formulation
We begin with an informal description of how agreement can be achieved with n = 4 and f = 1
processes [20, 25], as depicted in Figure 14.4.
In the first round, the commander Pc sends its value to the other three lieutenants, as shown by dotted
arrows.
In the second round, each lieutenant relays to the other two lieutenants, the value it received from the
commander in the first round.
At the end of the second round, a lieutenant takes the majority of the values it received (i) directly from
the commander in the first round, and (ii) from the other two lieutenants in the second round.
The majority gives a correct estimate of the “commander’s” value. Consider Figure 14.4(a) where the
commander is a traitor.
The values that get transmitted in the two rounds are as shown. All three lieutenants take the majority of
(1, 0, 0) which is “0,” the agreement value.
In Figure 14.4(b), lieutenant Pd is malicious. Despite its behavior as shown, lieutenants Pa and Pb agree
on “0,” the value of the commander.
F
Bzantine Generals (iterative formulation), Sync, Msg-passing
Phase-king algorithm for consensus: polynomial (synchronous system)
The phase-king algorithm proposed by Berman and Garay solves the consensus problem under the
same model, requiring f +1 phases, and a polynomial number of messages.
Operation
Each phase has a unique ”phase king” derived, say, from PID.
1. In 1st round, each process sends its estimate to all other processes.
2. In 2nd round, the ”Phase king” process arrives at an estimate based on the values it received in
1st round, and broadcasts its new estimate to all others.
T
Introduction
• Rollback propagation
– the dependencies may force some of the processes that did not fail to roll back
– This phenomenon is called “domino effect” 22
If each process takes its checkpoints independently, then the system can not avoid the domino
effect this scheme is called independent or uncoordinated checkpointing
A local checkpoint
A local check point is a snapshot of the state of the process at a given instance
Assumption
𝑪𝒊,𝒌
𝑪𝒊,𝟎
24
A process 𝑃𝑖 takes a checkpoint 𝐶𝑖,0 before it starts execution
Consistent states
a collection of the individual states of all participating processes and the states of the
communication channels
a global state that may occur during a failure-free execution of distribution of distributed
computation
if a process’s state reflects a message receipt, then the state of the corresponding sender
must reflect the sending of the message
A global checkpoint
– a global checkpoint such that no message is sent by a process after taking its local point that
is received by another process before taking its checkpoint.
A distributed system often interacts with the outside world to receive input data or deliver
the outcome of a computation
a special process that interacts with the rest of the system through message passing
A common approach
save each input message on the stable storage before allowing the application program to
process it
Symbol “||”
27
Messages
In-transit message -messages that have been sent but not yet received
Lost messages - messages whose ‘send’ is done but ‘receive’ is undone due to rollback
Delayed messages
messages whose ‘receive’ is not recorded because the receiving process was either down or
the message arrived after rollback
Orphan messages
Duplicate messages
• In-transit
– 𝑚1 , 𝑚2
• Lost
– 𝑚1
• Delayed
– 𝑚1 , 𝑚5
• Orphan
– none
• Duplicated
– 𝑚4 , 𝑚5
29
VII.Issues in failure recovery
Checkpoints : {𝐶𝑖,0 , 𝐶𝑖,1 }, {𝐶𝑗,0 , 𝐶𝑗,1 , 𝐶𝑗,2 }, and {𝐶𝑘,0 , 𝐶𝑘,1 , 𝐶𝑘,2 }
Messages : A - J
• Orphan message I is created due to the roll back of process 𝑃𝑗 to checkpoint 𝐶𝑗,1
– Message D: a lost message since the send event for D is recorded in the restored state for
𝑃𝑗 , but the receive event has been undone at process 𝑃𝑖 .
– Lost messages can be handled by having processes keep a message log of all the sent
messages
– Messages E, F: delayed orphan messages. After resuming execution from their checkpoints,
processes will generate both of these messages
31
VIII. Checkpoint-based recovery
In the checkpoint-based recovery approach, the state of each process and the communication
channel is checkpointed frequently so that, upon a failure, the system can be restored to a
globally consistent set of checkpoints.
It does not rely on the PWD assumption, and so does not need to detect, log, or replay non-
deterministic events.
Checkpoint-based protocols are therefore less restrictive and simpler to implement than log-
based rollback recovery.
However, checkpoint-based rollback recovery does not guarantee that prefailure execution can
be deterministically regenerated after a rollback.
Therefore, checkpoint-based rollback recovery may not be suitable for applications that require
frequent interactions with the outside world.
Uncoordinated Checkpointing
•Advantages
– Recovery from a failure is slow because processes need to iterate to find a consistent set of
checkpoints
– Each process maintains multiple checkpoints and periodically invoke a garbage collection
algorithm
•The processes record the dependencies among their checkpoints caused by message exchange
during failure-free operation
Direct dependency tracking technique
• Assume each process 𝑃𝑖 starts its execution with an initial heckpoint 𝐶𝑖,0
• When 𝑃𝑗 receives a message m during 𝐼𝑗,𝑦 , it records the dependency from 𝐼𝑖,𝑥 to 𝐼𝑗,𝑦 , which
is later saved onto stable storage when 𝑃𝑗 takes 𝐶𝑗,𝑦
35
Coordinated Checkpointing
• Blocking Checkpointing
– After a process takes a local checkpoint, to prevent orphan messages, it remains blocked
until the entire checkpointing activity is complete
– Disadvantages
• Non-blocking Checkpointing
– The processes need not stop their execution while taking checkpoints
36
Coordinated Checkpointing
– This situation results in an inconsistent checkpoint since checkpoint 𝐶1,𝑥 shows the
receipt of message m from 𝑃0 , while checkpoint 𝐶0,𝑥 does not show m being sent from 𝑃0
– If channels are FIFO, this problem can be avoided by preceding the first post-checkpoint
message on each channel by a checkpoint request, forcing each process to take a
checkpoint before receiving the first post-checkpoint message.
37
Coordinated Checkpointing
38
Communication-induced Checkpointing
• Two types of checkpoints -autonomous and forced checkpoints
• The receiver of each application message uses the piggybacked information to determine if it
has to take a forced checkpoint to advance the global recovery line
• The forced checkpoint must be taken before the application may process the contents of the
message
Checkpointing algorithm
• Assumptions: FIFO channel, end-to-end protocols, communication failures do not partition the
network, single process initiation, no process fails during the execution of the algorithm.
• Tentative checkpoint: temporary checkpoint, become permanent checkpoint when the algorithm
terminates successfully.
40
First Phase
An initiating process Pi takes a tentative checkpoint and requests all other processes to take
tentative checkpoints.
A process says “no” to a request if it fails to take a tentative checkpoint, which could be due to
If Pi learns that all the processes have successfully taken tentative checkpoints, Pi decides that
all tentative checkpoints should be made permanent; otherwise, Pi decides that all the tentative
The above protocol may cause a process to take a checkpoint even when it is not necessary for
consistency. Since taking a checkpoint is an expensive operation, we avoid taking checkpoints.
Consider the example shown in Fig.1. The set {x1, y1, z1} is a consistent set of checkpoints.
It takes a tentative checkpoint x2 and sends “take tentative checkpoint" messages to processes
There is no need for process Z to take checkpoint z2 because Z has not sent any message since
Process Y must take a checkpoint since has sent messages since its last checkpoint.
B.The Rollback Recovery Algorithm
The rollback recovery algorithm restores the system state to a consistent state after a failure.
The rollback recovery algorithm assumes that a single process invokes the algorithm.
It assumes that the checkpoint and the rollback recovery algorithms are not invoked
concurrently. The rollback recovery algorithm has two phases.
First Phase
An initiating process Pi sends a message to all other processes to check if they all are willing to
restart from their previous checkpoints.
A process may reply “no” to a restart request due to any reason (e.g., it is already participating
in a checkpointing or a recovery process initiated by some other process).
If Pi learns that all processes are willing to restart from their previous checkpoints,
Pi decides that all processes should roll back to their previous checkpoints. Otherwise, Pi aborts
the roll back attempt and it may attempt a recovery at a later time.
Second Phase
During the execution of the recovery algorithm, a process cannot send messages related to the
underlying computation while it is waiting for Pi’s decision.
Correctness
All processes restart from an appropriate state because if processes decide to restart, then they
resume execution from a consistent state (the checkpointing algorithm takes a consistent set of
checkpoints).
An Optimization
The above recovery protocol causes all processes to roll back irrespective of whether a process
needs to roll back or not. Consider the example shown in below figure.
The above protocol, in the event of failure of process X, the above protocol will require processes
X, Y, and Z to restart from checkpoints x2, y2, and z2, respectively.
Process Z need not roll back because there has been no interaction between process Z and the
other two processes since the last checkpoint at Z.
X.Juang-Venkatesan algorithm for asynchronous checkpointing and recovery
The algorithm makes the following assumptions about the underlying system:
The communication channels are reliable, deliver the messages in FIFO order and have infinite
buffers.
The processors directly connected to a processor via communication channels are called its
neighbors.
The new state s′ and the contents of messages sent to its neighbors depend on state s and the
contents of message m.
The events at a processor are identified by unique monotonically increasing numbers, ex0, ex1,
ex2,...
To help recovery after a process failure and restore the system to a consistent state, two types of
log storage are maintained, volatile log and stable log.
Accessing the volatile log takes less time than accessing the stable log, but the contents of the
volatile log are lost if the corresponding processor fails.
The contents of the volatile log are periodically flushed to the stable storage.
A. Asynchronous Checkpointing
After executing an event, a processor records a triplet {s,m,msgs_sent} in its volatile storage.
Where s is the state of the processor before the event, m is the message ,and msqs_sent is the
set of messages that were sent by the processor during the event.
A local checkpoint at a processor consists of the record of an event occurring at the processor
and it is taken without any synchronization with other processors.
Periodically, a processor independently saves the contents of the volatile log in the stable
storage and clears the volatile log. This operation amounts to taking a local checkpoint.
The following notation and data structure are used by the algorithm:
Basic idea
Since the algorithm is based on asynchronous checkpointing, the main issue in the recovery is
to find a consistent set of checkpoints to which the system can be restored.
The recovery algorithm achieves this by making each processor keep track of both the number
of messages it has sent to other processors as well as the number of messages it has received
from other processors.
Whenever a processor rolls back, it is necessary for all other processors to find out if any
message sent by the rolled back processor has become an orphan message.
Orphan messages are discovered by comparing the number of messages sent to and received
from neighboring processors.
For example, if RCVDi←j(CkPti) > SENTj→i(CkPtj) (that is, the number of messages received
by processor pi from processor pj is greater than the number of messages sent by processor pj to
processor pi, according to the current states the processors), then one or more messages at
processor pj are orphan messages.
In this case, processor pj must roll back to a state where the number of messages received
agrees with the number of messages sent.
Consider an example shown in Fig.13.3. Suppose processor Y crashes at the point indicated
and rolls back to a state corresponding to checkpoint ey1.
According to this state, Y has sent one message to X; according to X’s current state (ex2), X
has received two messages from Y.
X must roll back to a state preceding ex2 to be consistent with Y’s state.
If X rolls back to checkpoint ex1, then it will be consistent with Y’s state, ey1. Processor Z
must roll back to checkpoint ez2 to be consistent with Y’s state, ey1.
When a processor restarts after a failure, it broadcasts a ROLLBACK message that it had
failed1.
The recovery algorithm at a processor is initiated when it restarts after a failure or when it
learns of a failure at another processor.
Because of the broadcast of ROLLBACK messages, the recovery algorithm is initiated at all
processors.
H
The rollback starts at the failed processor and slowly diffuses into the entire system through
ROLLBACK messages.
The procedure has |N| iterations. During the kth iteration (k ≠ 1), a processor pi does the
following:
(i) based on the state CkPti it was rolled back in the (k - 1)th iteration, it computes
SENTi→j(CkPti)) for each neighbor pj and sends this value in a ROLLBACK message to that
neighbor.
(ii) pi waits for and processes ROLLBACK messages that it receives from its neighbors in kth
iteration and determines a new recovery point CkPti for pi based on information in these
messages.
At the end of each iteration, at least one processor will roll back to its final recovery point,
unless the current recovery points are already consistent.
An Example
If event ey2 is the latest checkpointed event at Y, then Y will restart from the state
corresponding to ey2.
Because of the broadcast nature of ROLLBACK messages, the recovery algorithm is initiated
at processors X and Z.
Initially, X, Y, and Z set CkPtX ← ex3, CkPtY ← ey2 and CkPtZ ← ez2, respectively, and X, Y,
and Z send the following messages during the first iteration:
Since RCVDX←Y (CkPtX) = 3 > 2 (2 is the value received in the ROLLBACK(Y,2) message
from Y), X will set CkPtX to ex2 satisfying RCVDX←Y (ex2) = 1≤ 2.
Since RCVDZ←Y (CkPtZ) = 2 > 1, Z will set CkPtZ to ez1 satisfying RCV DZ←Y (ez1) = 1 ≤ 1.
At Y, RCVDY←X(CkPtY ) = 1 < 2 and RCVDY←Z(CkPtY ) = 1 = SENTZ←Y (CkPtZ).
If Y rolls back beyond ey3 and loses the message from X that caused ey3, X can resend this
message to Y because ex2 is logged at X and this message available in the log.
The second and third iteration will progress in the same manner.
The set of recovery points chosen at the end of the first iteration, {ex2, ey2, ez1}, is consistent,
and no further rollback occurs.