Chapter 8 Fault Tolerance
Chapter 8 Fault Tolerance
Chapter 8
1
Fault Tolerance
• An important goal in distributed systems
design is to construct the system in such a
way that it can automatically recover from
partial failure without seriously affecting the
overall performance.
• A distributed system should tolerate faults
and continue to operate to some extent even
in their presence.
2
Basic Concepts
• The essence of a fault tolerant system is a
dependable system. Dependability Includes:
– Availability: a system is ready to be used
immediately.
– Reliability: a system can run continuously
without failure.
– Safety: when a system temporarily fails to
operate, nothing catastrophic happens.
– Maintainability: how easy a failed system can
be repaired.
3
Basic Concepts
• A system is said to fail when it cannot meet its promises.
• An error is a part of a system’s state that may lead to a
failure.
• The cause of an error is a fault.
• Fault tolerance means that a system can provide its
services even in the presence of faults.
• A transient fault occurs once and then disappears.
• An intermittent fault occurs, then vanishes of its own
accord, then reappears, and so on.
• A permanent fault is one that continues to exist unitl
the faulty component is repaired.
4
Failure Models
The first topic we discuss is protection against process failures, which is achieved by
replicating processes into groups.
Resilience by process groups
The key approach to tolerating a faulty process is to organize several identical processes
into a group. The key property that all groups have is that when a message is sent to the
group itself, all members of the group receive it. In this way, if one process in a group
fails, hopefully some other process can take over for it.
7
Flat Groups versus Hierarchical Groups
9
Reliable client-server communication
• Point-to-point communication
– Omission failures occur in the form of lost message, and
can be masked by using acknowledgements and
retransmissions.
– Connection crash failures are often not masked. The client
can be informed of the channel crash by raising an
exception.
• RPC semantics in the presence of failures
1. The client is unable to locate the server.
2. The request message from the client to the server is lost.
3. The server crashes after receiving a request.
4. The reply message from the server to the client is lost.
5. The client crashes after sending a request.
10
Reliable group communication
Considering how important process resilience by replication is, it is not surprising
that reliable multicast services are important as well. Such services guarantee that
messages are delivered to all members in a process group. Unfortunately, reliable
multicasting turns out to be surprisingly tricky. In this section, we take a closer look
at the issues involved in reliably delivering.
12
Distributed commit
Distributed commit is often established by means of a coordinator. In a simple
scheme, this coordinator tells all other processes that are also involved, called
participants, whether or not to (locally) perform the operation in question. This
scheme is referred to as a one-phase commit protocol.
The distributed commit problem involves having an operation being performed by
each member of a process group, or none at all. In the case of reliable
multicasting, the operation is the delivery of a message.
13
Two-Phase Commit
Phase I (Voting):
1. The coordinator sends a VOTE_REQUEST message to all participants.
2. When a participant receives a VOTE_REQUEST message, it returns
either a VOTE_COMMIT message to the coordinator telling the
coordinator that it is prepared to locally commit its part of the
transaction, or otherwise a VOTE_ABORT message.
Phase II (Decision):
1. The coordinator collects all votes from the participants. If all have
voted to commit, then so will the coordinator. In that case, it sends a
GLOBAL_COMMIT message to all participants. However, if one
participant had voted to abort the transaction, the coordinator will also
decide to abort and multicast a GLOBAL_ABORT message.
2. Each participant that voted for a commit waits for the final reaction
from the coordinator. If a participant receives a GLOBAL_COMMIT ,
it locally commits the transaction. Otherwise, when receiving a
GLOBAL_ABORT, it locally aborts the transaction as well.
14
Two-Phase Commit
15
Two-Phase Commit
State of Q Action by P
A recovery line.
18
Independent Check pointing
Now consider the case in which each process simply records
its local state from time to time in an uncoordinated fashion.
To discover a recovery line requires that each process is
rolled back to its most recently saved state. If these local
states jointly do not form a distributed snapshot, further
rolling back is necessary.
19
Message logging
One such technique is logging messages. The basic idea
underlying message logging is that if the transmission of
messages can be replayed, we can still reach a globally consistent
state but without having to restore that state from local storages.
Instead, a checkpointed state is taken as a starting point, and all
messages that have been sent since are simply retransmitted and
handled accordingly.
Considering that message logs are necessary to recover from a
process crash so that a globally consistent state is restored, it
becomes important to know precisely when messages are to be
logged.
20