0% found this document useful (0 votes)
41 views

Distributed Sys 8

The document discusses fault tolerance in distributed systems. It defines key concepts like reliability, availability, failures, errors and faults. It describes different types of failures like crash failures, omission failures and timing failures. It distinguishes between omission and commission failures and notes that deliberate failures are typically security problems. It also discusses assumptions that can be made about halting failures in asynchronous and synchronous systems.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Distributed Sys 8

The document discusses fault tolerance in distributed systems. It defines key concepts like reliability, availability, failures, errors and faults. It describes different types of failures like crash failures, omission failures and timing failures. It distinguishes between omission and commission failures and notes that deliberate failures are typically security problems. It also discusses assumptions that can be made about halting failures in asynchronous and synchronous systems.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 97

Distributed Systems

(3rd Edition)

Chapter 08: Fault Tolerance


Version: March 20, 2022
Fault tolerance: Introduction to fault tolerance Basic concepts

Dependability
Basics
A component provides services to clients. To provide services, the
component may require the services from other components ⇒ a
component may depend on some other component.
Specifically
A component C depends on C∗ if the correctness of C’s behavior
depends on the correctness of C∗’s behavior. (Components are
processes or channels.)

2 / 68
Fault tolerance: Introduction to fault tolerance Basic concepts

Dependability
Basics
A component provides services to clients. To provide services, the
component may require the services from other components ⇒ a
component may depend on some other component.
Specifically
A component C depends on C∗ if the correctness of C’s behavior
depends on the correctness of C∗’s behavior. (Components are
processes or channels.)
Requirements related to dependability
Requirement Description
Availability Readiness for usage
Reliability Continuity of service delivery
Safety Very low probability of catastrophes
Maintainability How easy can a failed system be repaired

2 / 68
Fault tolerance: Introduction to fault tolerance Basic concepts

Reliability versus availability

Reliability R(t ) of component C


Conditional probability that C has been functioning correctly during
[0, t ) given C was functioning correctly at time T = 0.

Traditional metrics
► Mean Time To Failure (MTTF): The average time until a
component fails.
► Mean Time To Repair (MTTR): The average time needed to
repair a component.
► Mean Time Between Failures (MTBF): Simply MTTF +
MTTR.

3 / 68
Fault tolerance: Introduction to fault tolerance Basic concepts

Reliability versus availability

Availability A(t ) of component C


Average fraction of time that C has been up-and-running in interval
[0, t ).
► Long-term availability A: A(∞)
► Note: A = MTTF = MTTF
MTBF MTTF+MTTR

Observation
Reliability and availability make sense only if we have an accurate
notion of what a failure actually is.

4 / 68
Fault tolerance: Introduction to fault tolerance Basic concepts

Terminology

Failure, error, fault


Term Description Example
Failure A component is not living up Crashed program
to its specifications
Error Part of a component that Programming bug
can lead to a failure
Fault Cause of an error Sloppy programmer

5 / 68
Fault tolerance: Introduction to fault tolerance Basic concepts

Terminology
Handling faults
Term Description Example
Fault Prevent the Don’t hire sloppy
prevention occurrence of a fault programmers
Fault Build a component Build each
tolerance such that it can mask component by two
the occurrence of a independent
fault programmers
Fault removal Reduce the Get rid of sloppy
presence, number, or programmers
seriousness of a
fault
Fault Estimate current Estimate how a
forecasting presence, future recruiter is doing
incidence, and when it comes to
consequences of hiring sloppy
faults programmers
6 / 68
Fault tolerance: Introduction to fault tolerance Failure models

Failure models
Types of failures

Type Description of server’s behavior


Crash failure Halts, but is working correctly until it halts
Omission failure Fails to respond to incoming requests
Receive omission Fails to receive incoming messages
Send omission Fails to send messages
Timing failure Response lies outside a specified time in-
terval
Response failure Response is incorrect
Value failure The value of the response is wrong
State-transition Deviates from the correct flow of control
failure
Arbitrary failure May produce arbitrary responses at arbi-
trary times

7 / 68
Fault tolerance: Introduction to fault tolerance Failure models

Dependability versus security

Omission versus commission


Arbitrary failures are sometimes qualified as malicious. It is better to
make the following distinction:
► Omission failures: a component fails to take an action that it
should have taken
► Commission failure: a component takes an action that it should
not have taken

8 / 68
Fault tolerance: Introduction to fault tolerance Failure models

Dependability versus security

Omission versus commission


Arbitrary failures are sometimes qualified as malicious. It is better to
make the following distinction:
► Omission failures: a component fails to take an action that it
should have taken
► Commission failure: a component takes an action that it should
not have taken

Observation
Note that deliberate failures, be they omission or commission failures
are typically security problems. Distinguishing between deliberate
failures and unintentional ones is, in general, impossible.

8 / 68
Fault tolerance: Introduction to fault tolerance Failure models

Halting failures
Scenario
C no longer perceives any activity from C∗ — a halting failure?
Distinguishing between a crash or omission/timing failure may be
impossible.

Asynchronous versus synchronous systems


► Asynchronous system: no assumptions about process
execution
speeds or message delivery times → cannot reliably detect crash
failures.
► Synchronous system: process execution speeds and message
delivery times are bounded → we can reliably detect omission
and timing failures.
► In practice we have partially synchronous systems: most of the
time, we can assume the system to be synchronous, yet there is
no bound on the time that a system is asynchronous → can
normally reliably detect crash failures.

9 / 68
Fault tolerance: Introduction to fault tolerance Failure models

Halting failures

Assumptions we can make


Halting type Description
Fail-stop Crash failures, but reliably detectable
Fail-noisy Crash failures, eventually reliably de-
tectable
Fail-silent Omission or crash failures: clients cannot
tell what went wrong
Fail-safe Arbitrary, yet benign failures (i.e., they can-
not do any harm)
Fail-arbitrary Arbitrary, with malicious failures

10 / 68
Fault tolerance: Introduction to fault tolerance Failure masking by redundancy

Redundancy for failure masking

Types of redundancy
► Information redundancy: Add extra bits to data units so that errors
can recovered when bits are garbled.
► Time redundancy: Design a system such that an action can be
performed again if anything went wrong. Typically used when
faults are transient or intermittent.
► Physical redundancy: add equipment or processes in order to
allow one or more components to fail. This type is extensively
used in distributed systems.

11 / 68
Fault tolerance: Process resilience Resilience by process groups

Process resilience

Basic idea
Protect against malfunctioning processes through process replication,
organizing multiple processes into a process group. Distinguish
between flat groups and hierarchical groups.
Flat group
Hierarchical group Coordinator

Worker

Group organization 12 / 68
Fault tolerance: Process resilience Failure masking and replication

Groups and failure masking


k -fault tolerant group
When a group can mask any k concurrent member failures (k is called
degree of fault tolerance).

13 / 68
Fault tolerance: Process resilience Failure masking and replication

Groups and failure masking


k -fault tolerant group
When a group can mask any k concurrent member failures (k is called
degree of fault tolerance).
How large does a k -fault tolerant group need to be?
► With halting failures (crash/omission/timing failures): we need a
total of k + 1 members as no member will produce an incorrect
result, so the result of one member is good enough.
► With arbitrary failures: we need 2k + 1 members so that the
correct result can be obtained through a majority vote.

13 / 68
Fault tolerance: Process resilience Failure masking and replication

Groups and failure masking


k -fault tolerant group
When a group can mask any k concurrent member failures (k is called
degree of fault tolerance).
How large does a k -fault tolerant group need to be?
► With halting failures (crash/omission/timing failures): we need a
total of k + 1 members as no member will produce an incorrect
result, so the result of one member is good enough.
► With arbitrary failures: we need 2k + 1 members so that the
correct result can be obtained through a majority vote.

Important assumptions
► All members are identical
► All members process commands in the same order
Result: We can now be sure that all processes do exactly the same
thing.
13 / 68
Fault tolerance: Process resilience Consensus in faulty systems with crash failures

Consensus

Prerequisite
In a fault-tolerant process group, each nonfaulty process executes the
same commands, and in the same order, as every other nonfaulty
process.

Reformulation
Nonfaulty group members need to reach consensus on which
command to execute next.

14 / 68
Fault tolerance: Process resilience Consensus in faulty systems with crash failures

Flooding-based consensus
System model
► A process group P = {P1 , . . . , Pn }
► Fail-stop failure semantics, i.e., with reliable failure detection
► A client contacts a Pi requesting it to execute a command
► Every Pi maintains a list of proposed commands

15 / 68
Fault tolerance: Process resilience Consensus in faulty systems with crash failures

Flooding-based consensus
System model
► A process group P = {P1 , . . . , Pn }
► Fail-stop failure semantics, i.e., with reliable failure detection
► A client contacts a Pi requesting it to execute a command
► Every Pi maintains a list of proposed commands

Basic algorithm (based on rounds)


1. In round r , Pi multicasts its known set of commands Cri to all
others
2. At the end of r , each Pi merges all received commands into a
new Cir + 1 .
3. Next command cmdi selected through a globally shared,
deterministic function: cmd i ← select (Cr+
i
1
).

15 / 68
Fault tolerance: Process resilience Consensus in faulty systems with crash failures

Flooding-based consensus: Example


P1

P2
decide

P3
decide

P4
decide

Observations
► P2 received all proposed commands from all other processes ⇒
makes decision.
► P3 may have detected that P1 crashed, but does not know if P2
received anything, i.e., P3 cannot know if it has the same
information as P2 ⇒ cannot make decision (same for P4 ).

16 / 68
Fault tolerance: Process resilience Example: Paxos

Realistic consensus: Paxos


Assumptions (rather weak ones, and realistic)
► A partially synchronous system (in fact, it may even be
asynchronous).
► Communication between processes may be unreliable: messages
may be lost, duplicated, or reordered.
► Corrupted message can be detected (and thus subsequently
ignored).
► All operations are deterministic: once an execution is started, it is
known exactly what it will do.
► Processes may exhibit crash failures, but not arbitrary failures.
► Processes do not collude.

Understanding Paxos
We will build up Paxos from scratch to understand where many
consensus algorithms actually come from.
Essential Paxos 17 / 68
Fault tolerance: Process resilience Example: Paxos

Paxos essentials

Starting point
► We assume a client-server configuration, with initially one primary
server.
► To make the server more robust, we start with adding a backup
server.
► To ensure that all commands are executed in the same order at
both servers, the primary assigns unique sequence numbers to
all commands. In Paxos, the primary is called the leader.
► Assume that actual commands can always be restored (either
from clients or servers) ⇒ we consider only control messages.

Understanding Paxos 18 / 68
Fault tolerance: Process resilience Example: Paxos

Two-server situation

(o1 ) (σ121 ) (σ221 )


C1

(Seq, o2 , 1) (Seq, o1 , 2)
SLeader
1

o2 o1

S2
o2 o1
C2
(o2 ) (σ22 ) (σ12 )

Understanding Paxos 19 / 68
Fault tolerance: Process resilience Example: Paxos

Handling lost messages

Some Paxos terminology


► The leader sends an accept message ACCEPT (o, t ) to
backups when assigning a timestamp t to command o.
► A backup responds by sending a learn message: LEARN (o, t)
► When the leader notices that operation o has not yet been
learned, it retransmits ACCEPT (o, t ) with the original
timestamp.

Understanding Paxos 20 / 68
Fault tolerance: Process resilience Example: Paxos

Two servers and one crash: problem

(o1 ) (σ11 )
C1
(Ac c , o1 , 1)
S1
o1

S2 Leader 2
(Ac c , o , 1) o2

C2
(o2 ) (σ22 )

Problem
Primary crashes after executing an operation, but the backup never
received the accept message.

Understanding Paxos 21 / 68
Fault tolerance: Process resilience Example: Paxos

Two servers and one crash: solution

(o1 ) (σ21 ) (σ11 )


C1
(Ac c , o1 , 1)
S1
o1

S2 Leader 2
(Lr n , o1 ) (Ac c , o , 2) o2
o1
C2
(o2 ) (σ212 )

Solution
Never execute an operation before it is clear that is has been learned.

Understanding Paxos 22 / 68
Fault tolerance: Process resilience Example: Paxos

Three servers and two crashes: still a problem?

(o1 ) (σ21 ) (σ11 )


C1

(Ac c , o1 , 1)
S1
o1

S2
(Lr n , o1 ) o1

S3 Leader 2
(Ac c , o , 1) o2

C2
(o2 ) (σ32 )

Understanding Paxos 23 / 68
Fault tolerance: Process resilience Example: Paxos

Three servers and two crashes: still a problem?

(o1 ) (σ21 ) (σ11 )


C1

(Ac c , o1 , 1)
S1
o1

S2
(Lr n , o1 ) o1

S3 Leader 2
(Ac c , o , 1) o2

C2
(o2 ) (σ32 )

Scenario
What happens when LEARN (o 1 ) as sent by S2 to S1 is lost?

Understanding Paxos 23 / 68
Fault tolerance: Process resilience Example: Paxos

Three servers and two crashes: still a problem?

(o1 ) (σ21 ) (σ11 )


C1

(Ac c , o1 , 1)
S1
o1

S2
(Lr n , o1 ) o1

S3 Leader 2
(Ac c , o , 1) o2

C2
(o2 ) (σ32 )

Scenario
What happens when LEARN (o 1 ) as sent by S2 to S1 is lost?

Solution
S2 will also have to wait until it knows that S3 has learned o1 .

Understanding Paxos 23 / 68
Fault tolerance: Process resilience Example: Paxos

Paxos: fundamental rule

General rule
In Paxos, a server S cannot execute an operation o until it has
received a LEARN (o) from all other nonfaulty servers.

Understanding Paxos 24 / 68
Fault tolerance: Process resilience Example: Paxos

Failure detection

Practice
Reliable failure detection is practically impossible. A solution is to set
timeouts, but take into account that a detected failure may be false.

Understanding Paxos 25 / 68
Fault tolerance: Process resilience Example: Paxos

Failure detection

Practice
Reliable failure detection is practically impossible. A solution is to set
timeouts, but take into account that a detected failure may be false.
1
C1 (o1 ) 1 )

(Ac c , o1 , 1)
S1
Leader

o1

S2 Leader
(Al iv e, o1 ) (Ac c , o2 , 1) o2
C2
(o2 ) (σ22 )

Understanding Paxos 25 / 68
Fault tolerance: Process resilience Example: Paxos

Required number of servers

Observation
Paxos needs at least three servers

Understanding Paxos 26 / 68
Fault tolerance: Process resilience Example: Paxos

Required number of servers

Observation
Paxos needs at least three servers

Adapted fundamental rule


In Paxos with three servers, a server S cannot execute an operation o
until it has received at least one (other) LEARN(o) message, so that it
knows that a majority of servers will execute o.

Understanding Paxos 26 / 68
Fault tolerance: Process resilience Example: Paxos

Required number of servers

Assumptions before taking the next steps


► Initially, S1 is the leader.
► A server can reliably detect it has missed a message, and recover
from that miss.
► When a new leader needs to be elected, the remaining servers
follow a strictly deterministic algorithm, such as S 1 → S 2 → S 3.
► A client cannot be asked to help the servers to resolve a situation.

Understanding Paxos 27 / 68
Fault tolerance: Process resilience Example: Paxos

Required number of servers

Assumptions before taking the next steps


► Initially, S1 is the leader.
► A server can reliably detect it has missed a message, and recover
from that miss.
► When a new leader needs to be elected, the remaining servers
follow a strictly deterministic algorithm, such as S 1 → S 2 → S 3.
► A client cannot be asked to help the servers to resolve a situation.

Observation
If either one of the backups (S2 or S3 ) crashes, Paxos will behave
correctly: operations at nonfaulty servers are executed in the same
order.

Understanding Paxos 27 / 68
Example: Paxos
Fault tolerance: Process resilience

Leader crashes after executing o1

Understanding Paxos 28 / 68
Example: Paxos
Fault tolerance: Process resilience

Leader crashes after executing o1


S3 is completely ignorant of any activity by S1
► S2 received ACCEPT(o, 1), detects crash, and becomes leader.
► S3 even never received ACCEPT(o, 1).
► If S2 sends ACCEPT (o 2 , 2) ⇒ S3 sees unexpected timestamp and
tells S2 that it missed o1 .
► S2 retransmits ACCEPT(o1 , 1), allowing S3 to catch up.

Understanding Paxos 28 / 68
Example: Paxos
Fault tolerance: Process resilience

Leader crashes after executing o1


S3 is completely ignorant of any activity by S1
► S2 received ACCEPT(o, 1), detects crash, and becomes leader.
► S3 even never received ACCEPT(o, 1).
► If S2 sends ACCEPT (o 2 , 2) ⇒ S3 sees unexpected timestamp and
tells S2 that it missed o1 .
► S2 retransmits ACCEPT(o1 , 1), allowing S3 to catch up.

S2 missed ACCEPT(o1 , 1)
► S2 did detect crash and became new leader
► If S2 sends ACCEPT (o 1 , 1) ⇒ S3 retransmits LEARN (o1 ).
► If S2 sends ACCEPT (o 2 , 1) ⇒ S3 tells S2 that it apparently missed
ACCEPT(o1 , 1) from S1 , so that S2 can catch up.

Understanding Paxos 28 / 68
Example: Paxos
Fault tolerance: Process resilience

Leader crashes after sending ACCEPT(o 1 , 1)

S3 is completely ignorant of any activity by S1


As soon as S2 announces that o2 is to be accepted, S3 will notice that
it missed an operation and can ask S2 to help recover.

S2 had missed ACCEPT(o1 , 1)


As soon as S2 proposes an operation, it will be using a stale
timestamp, allowing S3 to tell S2 that it missed operation o1 .

Understanding Paxos 29 / 68
Example: Paxos
Fault tolerance: Process resilience

Leader crashes after sending ACCEPT(o 1 , 1)

S3 is completely ignorant of any activity by S1


As soon as S2 announces that o2 is to be accepted, S3 will notice that
it missed an operation and can ask S2 to help recover.

S2 had missed ACCEPT(o1 , 1)


As soon as S2 proposes an operation, it will be using a stale
timestamp, allowing S3 to tell S2 that it missed operation o1 .

Observation
Paxos (with three servers) behaves correctly when a single server
crashes, regardless when that crash took place.

Understanding Paxos 29 / 68
Fault tolerance: Process resilience Example: Paxos

False crash detections


(o1 )
C1

(Ac c , o1 , 1)
S1
Leader
drop leadership
(Ac c , o2 , 1)
S2 Leader

S3
(Lr n , o2 ) confusion
o2
C2
(o2 ) (σ32 )

Problem and solution


S3 receives ACCEPT (o 1 , 1), but much later than ACCEPT (o 2 , 1). If
it knew who the current leader was, it could safely reject the
delayed accept message ⇒ leaders should include their ID in
messages.

Understanding Paxos 30 / 68
Fault tolerance: Process resilience Example: Paxos

But what about progress?


(o1 ) (σ31 )
C1

(Ac c , S 1 , o1 , 1)
SLeader
1

(Ac c , S 2 , o2 , 1)
S2 Leader

S3
(Lr n , o1 ) o1 (Lr n , o2 ) o2
C2
(o2 ) (σ312 )

Understanding Paxos 31 / 68
Fault tolerance: Process resilience Example: Paxos

But what about progress?


(o1 ) (σ31 )
C1

(Ac c , S 1 , o1 , 1)
SLeader
1

(Ac c , S 2 , o2 , 1)
S2 Leader

S3
(Lr n , o1 ) o1 (Lr n , o2 ) o2
C2
(o2 ) (σ312 )

Essence of solution
When S2 takes over, it needs to make sure than any outstanding
operations initiated by S1 have been properly flushed, i.e., executed by
enough servers. This requires an explicit leadership takeover by which
other servers are informed before sending out new accept messages.

Understanding Paxos 31 / 68
Fault tolerance: Process resilience Consensus in faulty systems with arbitrary failures

Consensus under arbitrary failure semantics

Essence
We consider process groups in which communication between process
is inconsistent: (a) improper forwarding of messages, or (b) telling
different things to different processes.

P1 P1

a a a b

P3 P2 P3 P2
b b
(a) (b)

32 / 68
Fault tolerance: Process resilience Consensus in faulty systems with arbitrary failures

Consensus under arbitrary failure semantics


System model
► We consider a primary P and n − 1 backups B1 , . .. , Bn−1 .
► A client sends v ∈ {T , F } to P
► Messages may be lost, but this can be detected.
► Messages cannot be corrupted beyond detection.
► A receiver of a message can reliably detect its sender.

Byzantine agreement: requirements


BA1: Every nonfaulty backup process stores the same value.
BA2: If the primary is nonfaulty then every nonfaulty backup process
stores exactly what the primary had sent.

Observation
► Primary faulty ⇒ BA1 says that backups may store the same, but
different (and thus wrong) value than originally sent by the client.
► Primary not faulty ⇒ satisfying BA2 implies that BA1 is satisfied.
33 / 68
Fault tolerance: Process resilience Consensus in faulty systems with arbitrary failures

Why having 3k processes is not enough

Faulty process
First message round
Second message round

P P

T F T T

T F

{T,F} B1 B2 {T,F} B1 B2 {T,

F T

Why having 3k processes is not enough 34 / 68


Fault tolerance: Process resilience Consensus in faulty systems with arbitrary failures

Why having 3k + 1 processes is


enough

{T,{T,F}} B1
Faulty process
First message round
T
Second message round

F T
T T
F P T {T,{T,F}} B1
F T
{F,{T,T}} B2 B3 {T,{T,F}}
F T
T T T
T P T
F
{T,{T,T}} B2 B3 {T,{T,F
T

Why having 3k + 1 processes is enough 35 / 68


Fault tolerance: Process resilience Some limitations on realizing fault tolerance

Realizing fault tolerance

Observation
Considering that the members in a fault-tolerant process group are so
tightly coupled, we may bump into considerable performance problems,
but perhaps even situations in which realizing fault tolerance is
impossible.

Question
Are there limitations to what can be readily achieved?
► What is needed to enable reaching consensus?
► What happens when groups are partitioned?

36 / 68
Fault tolerance: Process resilience Some limitations on realizing fault tolerance

Distributed consensus: when can it be reached


Message ordering
Unordered Ordered

Communication delay
Process behavior

X X X X Bounded
Synchronous
X X Unbounded
X Bounded
Asynchronous
X Unbounded
Unicast Multicast Unicast Multicast
Message transmission

Formal requirements for consensus


► Processes produce the same output value
► Every output value must be valid
► Every process must eventually provide output

On reaching consensus 37 / 68
Fault tolerance: Process resilience Some limitations on realizing fault tolerance

Consistency, availability, and partitioning

CAP theorem
Any networked system providing shared data can provide only two of
the following three properties:
C: consistency, by which a shared and replicated data item appears
as a single, up-to-date copy
A: availability, by which updates will always be eventually executed
P: Tolerant to the partitioning of process group.

Conclusion
In a network subject to communication failures, it is impossible to
realize an atomic read/write shared memory that guarantees a
response to every request.

Consistency, availability, and partitioning 38 / 68


Fault tolerance: Process resilience Some limitations on realizing fault tolerance

CAP theorem intuition

Simple situation: two interacting processes


► P and Q can no longer communicate:
► Allow P and Q to go ahead ⇒ no consistency
► Allow only one of P, Q to go ahead ⇒ no availability
► P and Q have to be assumed to continue communication ⇒ no
partitioning allowed.

Consistency, availability, and partitioning 39 / 68


Fault tolerance: Process resilience Some limitations on realizing fault tolerance

CAP theorem intuition

Simple situation: two interacting processes


► P and Q can no longer communicate:
► Allow P and Q to go ahead ⇒ no consistency
► Allow only one of P, Q to go ahead ⇒ no availability
► P and Q have to be assumed to continue communication ⇒ no
partitioning allowed.

Fundamental question
What are the practical ramifications of the CAP theorem?

Consistency, availability, and partitioning 39 / 68


Fault tolerance: Process resilience Failure detection

Failure detection

Issue
How can we reliably detect that a process has actually crashed?

General model
► Each process is equipped with a failure detection module
► A process P probes another process Q for a reaction
► If Q reacts: Q is considered to be alive (by P)
► If Q does not react with t time units: Q is suspected to have
crashed

Observation for a synchronous system


a suspected crash ≡ a known crash

40 / 68
Fault tolerance: Process resilience Failure detection

Practical failure detection

Implementation
► If P did not receive heartbeat from Q within time t : P suspects Q.
► If Q later sends a message (which is received by P):
► P stops suspecting Q
► P increases the timeout value t
► Note: if Q did crash, P will keep suspecting Q.

41 / 68
Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures

Reliable remote procedure calls

What can go wrong?


1. The client is unable to locate the server.
2. The request message from the client to the server is lost.
3. The server crashes after receiving a request.
4. The reply message from the server to the client is lost.
5. The client crashes after sending a request.

42 / 68
Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures

Reliable remote procedure calls

What can go wrong?


1. The client is unable to locate the server.
2. The request message from the client to the server is lost.
3. The server crashes after receiving a request.
4. The reply message from the server to the client is lost.
5. The client crashes after sending a request.

Two “easy” solutions


1: (cannot locate server): just report back to client
2: (request was lost): just resend message

42 / 68
Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures

Reliable RPC: server crash


Server Server Server
REQ REQ REQ
Receive Receive Receive
Crash
REP Execute No REP No REP
Execute
Crash
Reply

(a) (b)
(c)

Problem
Where (a) is the normal case, situations (b) and (c) require different
solutions. However, we don’t know what happened. Two approaches:
► At-least-once-semantics: The server guarantees it will carry out
an operation at least once, no matter what.
► At-most-once-semantics: The server guarantees it will carry out
an operation at most once.

Server crashes 43 / 68
Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures

Why fully transparent server recovery is impossible


Three type of events at the server
(Assume the server is requested to update a document.)
M: send the completion message
P: complete the processing of the document
C: crash

Six possible orderings


(Actions between brackets never take place)
1. M → P → C: Crash after reporting
completion.
2. M → C → P: Crash after reporting completion, but before the
update.
3. P → M → C: Crash after reporting completion, and after the
update.
4. P → C(→ M): Update took place, and then a crash.
5. C(→ P → M): Crash before doing anything
6. C(→ M → P): Crash before doing anything
Server crashes 44 / 68
Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures

Why fully transparent server recovery is impossible

Strategy M → P Strategy P → M
Reissue strategy MPC MC(P) C(MP) PMC PC(M) C(PM)
Always DUP OK OK DUP DUP OK
Never OK ZERO ZERO OK OK ZERO
Only when ACKed DUP OK ZERO DUP OK ZERO
Only when not ACKed OK ZERO OK OK DUP OK
Client Server

Server
OK = Document updated once
DUP = Document updated twice
ZERO = Document not update at all

Server crashes 45 / 68
Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures

Reliable RPC: lost reply messages

The real issue


What the client notices, is that it is not getting an answer. However, it
cannot decide whether this is caused by a lost request, a crashed
server, or a lost response.

Partial solution
Design the server such that its operations are idempotent: repeating
the same operation is the same as carrying it out exactly once:
► pure read operations
► strict overwrite operations
Many operations are inherently nonidempotent, such as many banking
transactions.

Lost reply messages 46 / 68


Fault tolerance: Reliable client-server communication RPC semantics in the presence of failures

Reliable RPC: client crash

Problem
The server is doing work and holding resources for nothing (called
doing an orphan computation).

Solution
► Orphan is killed (or rolled back) by the client when it recovers
► Client broadcasts new epoch number when recovering ⇒ server
kills client’s orphans
► Require computations to complete in a T time units. Old ones are
simply removed.

Client crashes 47 / 68
Fault tolerance: Reliable group communication

Simple reliable group communication


Intuition
A message sent to a process group G should be delivered to each
member of G. Important: make distinction between receiving and
delivering messages.

Sender Recipient

Recipient
Group membership Group membership Group membership
functionality functionality functionality

Message delivery
Message-handling Message-handling Message-handling
component component component
Message reception

Local OS Local OS Local OS

Network

48 / 68
Fault tolerance: Reliable group communication

Less simple reliable group communication

Reliable communication in the presence of faulty


processes
Group communication is reliable when it can be guaranteed that a
message is received and subsequently delivered by all nonfaulty group
members.

Tricky part
Agreement is needed on what the group actually looks like before a
received message can be delivered.

49 / 68
Fault tolerance: Reliable group communication

Simple reliable group communication


Reliable communication, but assume nonfaulty processes
Reliable group communication now boils down to reliable multicasting:
is a message received and delivered to each recipient, as intended by
the sender.
Receiver missed
message #24

Sender Receiver
Receiver Receiver
M25 History
Receiver
buffer Last = 24 Last = 24 Last = 23 Last = 24
M25 M25 M25 M25

Network
Sender Receiver Receiver Receiver Receiver

Last = 25 Last = 24 Last = 23 Last = 24


M25 M25 M25 M25
ACK 25 ACK 25 Missed 24 ACK 25
Network

50 / 68
Fault tolerance: Distributed commit

Distributed commit protocols

Problem
Have an operation being performed by each member of a process
group, or none at all.
► Reliable multicasting: a message is to be delivered to all
recipients.
► Distributed transaction: each local transaction must
succeed.

51 / 68
Fault tolerance: Distributed commit

Two-phase commit protocol (2PC)

Essence
The client who initiated the computation acts as coordinator; processes
required to commit are the participants.
► Phase 1a: Coordinator sends VOTE - REQUEST to participants
(also called a pre-write)
► Phase 1b: When participant receives VOTE - REQUEST it returns
either VOTE - COMMIT or VOTE - ABORT to coordinator. If it sends
VOTE - ABORT , it aborts its local computation

► Phase 2a: Coordinator collects all votes; if all are VOTE - COMMIT ,
it sends GLOBAL - COMMIT to all participants, otherwise it sends
GLOBAL - ABORT

► Phase 2b: Each participant waits for GLOBAL - COMMIT or


GLOBAL - ABORT and handles accordingly.

52 / 68
Fault tolerance: Distributed commit

2PC - Finite state machines

Vote-request
INIT Vote-abort INIT
Commit Vote-request
Vote-request Vote-commit
WAIT READY
Vote-abort Vote-commit Global-abort Global-commit
Global-abort Global-commit ACK ACK

ABORT COMMIT ABORT COMMIT

Coordinator Participant

53 / 68
Fault tolerance: Distributed commit

2PC – Failing participant


Analysis: participant crashes in state S, and recovers to S
► INIT : No problem: participant was unaware of protocol

54 / 68
Fault tolerance: Distributed commit

2PC – Failing participant


Analysis: participant crashes in state S, and recovers to S

► READY : Participant is waiting to either commit or abort. After


recovery, participant needs to know which state transition it
should make ⇒ log the coordinator’s decision

54 / 68
Fault tolerance: Distributed commit

2PC – Failing participant


Analysis: participant crashes in state S, and recovers to S

► ABORT : Merely make entry into abort state idempotent, e.g.,


removing the workspace of results

54 / 68
Fault tolerance: Distributed commit

2PC – Failing participant


Analysis: participant crashes in state S, and recovers to S

► COMMIT : Also make entry into commit state idempotent, e.g.,


copying workspace to storage.

54 / 68
Fault tolerance: Distributed commit

2PC – Failing participant


Analysis: participant crashes in state S, and recovers to S
► INIT : No problem: participant was unaware of protocol
► READY : Participant is waiting to either commit or abort. After
recovery, participant needs to know which state transition it
should make ⇒ log the coordinator’s decision
► ABORT : Merely make entry into abort state idempotent, e.g.,
removing the workspace of results
► COMMIT : Also make entry into commit state idempotent, e.g.,
copying workspace to storage.

Observation
When distributed commit is required, having participants use
temporary workspaces to keep their results allows for simple recovery
in the presence of failures.

54 / 68
Fault tolerance: Distributed commit

2PC – Failing participant


Alternative
When a recovery is needed to READY state, check state of other
participants ⇒ no need to log coordinator’s decision.

Recovering participant P contacts another participant


Q
State of Q Action by P
COMMIT Make transition to COMMIT
ABORT Make transition to ABORT
INIT Make transition to ABORT
READY Contact another participant

Result
If all participants are in the READY state, the protocol blocks.
Apparently, the coordinator is failing. Note: The protocol prescribes
that we need the decision from the coordinator.

55 / 68
Fault tolerance: Distributed commit

2PC – Failing coordinator

Observation
The real problem lies in the fact that the coordinator’s final decision
may not be available for some time (or actually lost).

Alternative
Let a participant P in the READY state timeout when it hasn’t received
the coordinator’s decision; P tries to find out what other participants
know (as discussed).

Observation
Essence of the problem is that a recovering participant cannot make a
local decision: it is dependent on other (possibly failed) processes

56 / 68
Fault tolerance: Distributed commit

Coordinator in Python

1 c l a s s C oo r di n a t o r :
2
3 def r u n ( sel f ) :
4 yetToReceive = l i s t ( p a r t i c i p a n t s )
5 se l f . l og . i n f o( ’ WA I T ’ )
6 s e l f . c h a n . s e n d To ( p a r t i c i p a n t s , VOTE_REQUEST)
7 w h i l e len (yetToRecei ve) > 0 :
8 msg = s e l f . c h a n . r e c v F r o m ( p a r t i c i p a n t s ,
TIMEOUT)
9 i f ( n o t msg) o r (msg[1] == VOTE_ABORT):
10 self.log.info(’ABORT’)
11 s e l f . c h a n . s e n d To ( p a r t i c i p a n t s , GLOBAL_ABORT)
12 return
13 e l s e : # msg[1] == VOTE_COMMIT
14 yetToReceive.remove(msg[0])
15 self.log.info( ’COMMIT’)
16 s e l f . c h a n . s e n d To ( p a r t i c i p a n t s , GLOBAL_COMMIT)

57 / 68
Fault tolerance: Distributed commit

Participant in Python
1 class Partici pant:
2 def r u n ( sel f ) :
3 msg = se l f . c ha n . r e cv F r om ( c oo r di n a t o r, TIMEOUT)
4 i f ( n o t m sg): # Crashed c oo rdi na t o r - g i v e up e n t i r e l y
5 d e c i s i o n = LOCAL_ABORT
6 e l s e : # Coordi nat or w i l l have s e n t VOTE_REQUEST
7 d e c i s i o n = se l f . d o_ w or k( )
8 i f d e c i s i o n == LOCAL_ABORT:
9 se l f . c h a n . s e n d To ( c o o r d i n a t o r, VOTE_ABORT)
10 e l s e : # Ready t o commit, e n t e r READY s t a t e
11 s e l f . c h a n . s e n d To ( c o o r d i n a t o r, VOTE_COMMIT)
12 msg = se l f . ch a n . r ec v F r om ( co o r di n a t o r, TIMEOUT)
13 i f ( n o t m sg) : # Crashed c oo rd i na t o r - c he c k t h e o t h e r s
14 s e l f . c h a n . s e n d To ( a l l _ p a r t i c i p a n t s , NEED_DECISION)
15 w h i l e Tr u e :
16 msg = sel f.chan.recvFrom Any()
17 i f msg[1] i n [GLOBAL_COMMIT, GLOBAL_ABORT, LOCAL_ABORT]:
18 d e c i s i o n = msg[1]
19 break
20 e l s e : # Coordinator came t o a d e c i s i o n
21 d e c i s i o n = msg[1]
22
23 w h i l e Tr u e : # Help any o t h e r p a r t i c i p a n t when c o ord i n at or crashed
24 msg = s e l f . c h a n . r e c v F r o m ( a l l _ p a r t i c i p a n t s )
25 i f msg[1] == NEED_DECISION:
26 se l f . c ha n . se nd To( [ m s g[ 0 ] ] , d e c i s i o n )

58 / 68
Fault tolerance: R e c o v e r y Introduction

Recovery: Background
Essence
When a failure occurs, we need to bring the system into an error-free
state:
► Forward error recovery: Find a new state from which the system
can continue operation
► Backward error recovery: Bring the system back into a previous
error-free state

Practice
Use backward error recovery, requiring that we establish recovery
points

Observation
Recovery in distributed systems is complicated by the fact that
processes need to cooperate in identifying a consistent state from
where to recover

59 / 68
Fault tolerance: R e c o v e r y Checkpointing

Consistent recovery state


Requirement
Every message that has been received is also shown to have been
sent in the state of the sender.

Recovery line
Assuming processes regularly checkpoint their state, the most recent
consistent global checkpoint.

Initial state Recovery line Checkpoint

P1

Failure

P2

Time
Message sent Inconsistent collection
from P2 to of
P1 checkpoints

60 / 68
Fault tolerance: R e c o v e r y Checkpointing

Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.

Simple solution
Use a two-phase blocking protocol:

Coordinated checkpointing 61 / 68
Fault tolerance: R e c o v e r y Checkpointing

Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.

Simple solution
Use a two-phase blocking protocol:
► A coordinator multicasts a checkpoint request message

Coordinated checkpointing 61 / 68
Fault tolerance: R e c o v e r y Checkpointing

Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.

Simple solution
Use a two-phase blocking protocol:
► A coordinator multicasts a checkpoint request message
► When a participant receives such a message, it takes a
checkpoint, stops sending (application) messages, and reports
back that it has taken a checkpoint

Coordinated checkpointing 61 / 68
Fault tolerance: R e c o v e r y Checkpointing

Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.

Simple solution
Use a two-phase blocking protocol:
► A coordinator multicasts a checkpoint request message
► When a participant receives such a message, it takes a
checkpoint, stops sending (application) messages, and reports
back that it has taken a checkpoint
► When all checkpoints have been confirmed at the coordinator,
the
latter broadcasts a checkpoint done message to allow all
processes to continue

Coordinated checkpointing 61 / 68
Fault tolerance: R e c o v e r y Checkpointing

Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.

Simple solution
Use a two-phase blocking protocol:
► A coordinator multicasts a checkpoint request message
► When a participant receives such a message, it takes a
checkpoint, stops sending (application) messages, and reports
back that it has taken a checkpoint
► When all checkpoints have been confirmed at the coordinator,
the
latter broadcasts a checkpoint done message to allow all
processes to continue

Observation
It is possible to consider only those processes that depend on the
recovery of the coordinator, and ignore the rest
Coordinated checkpointing 61 / 68
Fault tolerance: Recovery Checkpointing

Cascaded rollback

Observation
If checkpointing is done at the “wrong” instants, the recovery line may
lie at system startup time. We have a so-called cascaded rollback.
Initial state Checkpoint

P1

m* m Failure

P2

Time

Independent checkpointing 62 / 68
Fault tolerance: Recovery Checkpointing

Independent checkpointing
Essence
Each process independently takes checkpoints, with the risk of a
cascaded rollback to system startup.

Independent checkpointing 63 / 68
Fault tolerance: R e c o v e r y Checkpointing

Independent checkpointing
Essence
Each process independently takes checkpoints, with the risk of a
cascaded rollback to system startup.
► Let CPi (m) denote mth checkpoint of process Pi and INTi (m) the
interval between CPi (m − 1) and CPi (m).

Independent checkpointing 63 / 68
Fault tolerance: R e c o v e r y Checkpointing

Independent checkpointing
Essence
Each process independently takes checkpoints, with the risk of a
cascaded rollback to system startup.
► Let CPi (m) denote mth checkpoint of process Pi and INTi (m) the
interval between CPi (m − 1) and CPi (m).
► When process Pi sends a message in interval INTi (m), it
piggybacks (i, m)

Independent checkpointing 63 / 68
Fault tolerance: R e c o v e r y Checkpointing

Independent checkpointing
Essence
Each process independently takes checkpoints, with the risk of a
cascaded rollback to system startup.
► Let CPi (m) denote mth checkpoint of process Pi and INTi (m) the
interval between CPi (m − 1) and CPi (m).
► When process Pi sends a message in interval INTi (m), it
piggybacks (i, m)
► When process Pj receives a message in interval INTj (n), it
records the dependency INTi (m) → INTj (n).

Independent checkpointing 63 / 68
Fault tolerance: R e c o v e r y Checkpointing

Independent checkpointing
Essence
Each process independently takes checkpoints, with the risk of a
cascaded rollback to system startup.
► Let CPi (m) denote mth checkpoint of process Pi and INTi (m) the
interval between CPi (m − 1) and CPi (m).
► When process Pi sends a message in interval INTi (m), it
piggybacks (i, m)
► When process Pj receives a message in interval INTj (n), it
records the dependency INTi (m) → INTj (n).
► The dependency INTi (m) → INTj (n) is saved to storage when
taking checkpoint CPj (n).

Independent checkpointing 63 / 68
Fault tolerance: R e c o v e r y Checkpointing

Independent checkpointing
Essence
Each process independently takes checkpoints, with the risk of a
cascaded rollback to system startup.
► Let CPi (m) denote mth checkpoint of process Pi and INTi (m) the
interval between CPi (m − 1) and CPi (m).
► When process Pi sends a message in interval INTi (m), it
piggybacks (i, m)
► When process Pj receives a message in interval INTj (n), it
records the dependency INTi (m) → INTj (n).
► The dependency INTi (m) → INTj (n) is saved to storage when
taking checkpoint CPj (n).

Observation
If process Pi rolls back to CPi (m − 1), Pj must roll back to CPj (n −
1).
Independent checkpointing 63 / 68
Fault tolerance: Recovery Message logging

Message logging
Alternative
Instead of taking an (expensive) checkpoint, try to replay your
(communication) behavior from the most recent checkpoint ⇒ store
messages in a log.

Assumption
We assume a piecewise deterministic execution model:
► The execution of each process can be considered as a sequence
of state intervals
► Each state interval starts with a nondeterministic event (e.g.,
message receipt)
► Execution in a state interval is deterministic

Conclusion
If we record nondeterministic events (to replay them later), we obtain a
deterministic execution model that will allow us to do a complete replay.

64 / 68
Fault tolerance: R e c o v e r y Message logging

Message logging and consistency


When should we actually log messages?
Avoid orphan processes:
► Process Q has just received and delivered messages m1 and m2
► Assume that m2 is never logged.
► After delivering m1 and m2 , Q sends message m3 to process R
► Process R receives and subsequently delivers m3 : it is an
orphan.

Q crashes and recovers


P
m1 m1 m2 is never replayed,
so neither will m3
Q
m3 m2 m3
m2
R
Time
Unlogged message
Logged message

65 / 68
Fault tolerance: Recovery Message logging

Message-logging schemes

Notations
► DEP(m): processes to which m has been delivered. If message
m∗ is causally dependent on the delivery of m, and m∗ has been
delivered to Q, then Q ∈ DEP(m).
► COPY(m): processes that have a copy of m, but have not (yet)
reliably stored it.
► FAIL: the collection of crashed processes.

Characterization
Q is orphaned ⇔ ∃m : Q ∈ DEP(m) and COPY(m) ⊆ FAIL

66 / 68
Fault tolerance: Recovery Message logging

Message-logging schemes

Pessimistic protocol
For each nonstable message m, there is at most one process
dependent on m, that is |DEP(m)| ≤ 1.

Consequence
An unstable message in a pessimistic protocol must be made stable
before sending a next message.

67 / 68
Fault tolerance: Recovery Message logging

Message-logging schemes

Optimistic protocol
For each unstable message m, we ensure that if COPY(m) ⊆ FAIL,
then eventually also DEP(m) ⊆
FAIL.

Consequence
To guarantee that DEP(m) ⊆ FAIL, we generally rollback each orphan
process Q until Q /∈ DEP(m).

68 / 68

You might also like