Lect8 FaultTolerance
Lect8 FaultTolerance
Fault Tolerance
2023
Outline
FAULT TOLERANCE
5
Faults
Fault types according to their output behaviour:
1. Fail-stop fault (omission faults):
Either the processor is executing and produces correct values,
or it failed and will never respond to any request.
Working processors can detect the failed processor
by a time-out mechanism.
2. Byzantine fault (arbitrary faults):
A process can fail and stop, execute slowly, or execute at a
normal speed but produce erroneous values and actively try to
make the computation fail
Any message can be corrupted, and correctness has to be
decided upon by a group of processors.
Essential aspects:
Backward recovery assumes time redundancy!
The system periodically saves globally consistent states of the
distributed system, which can serve as recovery points.
When a fault is detected, the system is recovered from the most
recent recovery point.
Corrective action:
Carry on with the same processor and software
(a transient fault is assumed).
Carry on with a new processor
(a permanent hardware fault is assumed).
Carry on with the same processor and another software version
(a permanent software fault is assumed).
8
Forward Recovery
Backward recovery is based on time redundancy and on the
availability of back-up files and saved checkpoints;
This is expensive in terms of time.
Forward recovery:
the error is masked without redoing any computations.
Forward recovery is based on hardware and, possibly,
software redundancy.
9
Hardware Redundancy
Hardware redundancy: use of additional hardware to compensate for failures:
Fault detection, correction, and masking:
Multiple hardware units are assigned to the same task in parallel
and their results are compared.
Detection: if one or more (but not all) units are faulty, this shows up
as a disagreement in the results.
Correction and masking: if only a minority of the units are faulty, and
sufficient units produce the same output, this output can be used to
correct and mask the failure.
Replacement of malfunctioning units:
Correction and masking are short-term measures.
In order to restore the initial performance and degree of fault-tolerance,
the faulty unit has to be replaced.
12
Voters
Several approaches for voting are possible.
The goal is to "filter out" the correct value from the set of candidates.
The most common one: majority voter
The voter constructs a set of classes of values:
P1, P2, ..., Pn:
x, y Pi if and only if x = y
If Pi is the largest set and N is the number of outputs (N is odd):
if card(Pi) ≥ N/2 x Pi is correct output;
the error can be masked.
if card(Pi) < N/2 the error cannot be masked
(only be detected).
13
Voters
Sometimes we can not use strict equality:
sensors can provide slightly different values;
the same application can be run on different processors,
and outputs can be different only because of internal
representations used (e.g., floating point).
if |x - y| < ε then we consider x = y.
14
Voters
Other voting schemes:
k-plurality voter
Similar to majority voting:
the largest set needs not contain more than N/2 elements,
it is sufficient that card(Pi) = k, k selected by the designer
15
Voters
Other voting schemes:
Median voter
The median value is selected.
16
k-Fault-Tolerant Systems
A system is k-fault-tolerant if it can survive faults in k components
and still meet its specifications.
How many components do we need in order to achieve k-fault-
tolerance with voting?
18
Processor and Memory Level Redundancy
Processor and memory are handled as a unit;
voting is on processor outputs:
19
Processor and Memory Level Redundancy
Processors and memories can be handled as separate modules.
20
Processor and Memory Level Redundancy
Processors and memories can be handled as separate modules.
21
Processor and Memory Level Redundancy
Processors and memories can be handled as separate modules.
22
Software Redundancy
Software is very different from hardware
in the context of redundancy:
A software fault is always caused by a mistake in specification
or by a bug (a design error).
Example
P1 receives a value from the sensor,
and the processors have to continue operation with that value;
in order to achieve fault tolerance,
they have to agree on the value to continue with:
this should be the value received by P1 from the sensor,
if P1 is not faulty;
if P1 is faulty, all non-faulty processors should use the
same value to continue with.
26
Distributed Agreement with Byzantine Faults
Example
P1 receives a value from the sensor,
and the processors have to continue operation with that value;
in order to achieve fault tolerance,
they have to agree on the value to continue with:
this should be the value received by P1 from the sensor,
if P1 is not faulty;
if P1 is faulty, all non-faulty processors should use the
same value to continue with.
27
Distributed Agreement with Byzantine Faults
Example
Maybe, by letting P2 and P3 communicate, they could get out of the trouble?
P2 does not know if P1 or P3 is the faulty one,
thus it cannot handle the contradicting inputs.
The same for P3.
No agreement
The same if P3 is faulty:
P2 does not know if P1 or P3 is the
faulty one, thus it cannot handle the
contradicting inputs
No agreement
28
Distributed Agreement with Byzantine Faults
29
The Byzantine Generals Problem
The Byzantine army is preparing for a battle.
A number of generals must
coordinate among themselves
through (reliable) messengers
on whether to attack or retreat.
A commanding general (C) will make
the decision whether or not to attack.
Any of the generals, including the
commander, may be traitorous:
they might send messages to attack
to some generals and messages
to retreat to others.
30
The Byzantine Generals Problem
The problem in the story:
The loyal generals have all to agree to attack, or all to retreat.
If the commanding general is loyal, all loyal generals must
agree with the decision that he made.
31
The Byzantine Generals Problem
The case with three generals:
No agreement is possible
if one of three generals is traitorous
32
The Byzantine Generals Problem
The case with four generals:
Gen. left: attack, ???, retreat.
Gen. middle: ???, attack, retreat.
Gen. right: retreat, ???, attack.
35
The Byzantine Generals Problem
Let us come back to our real-life example,
this time with four processors:
P2, P3, and P4 will reach P2, P3, and P4 will reach agreement
agreement on value 3, on the default value, e.g. 0 (used
despite the faulty input unit P1. when no majority exists), despite
36 the faulty input unit P1.
The Byzantine Generals Problem
Let us come back to our real-life example,
this time with four processors: