0% found this document useful (0 votes)

20 views

Lect8 FaultTolerance

The document discusses fault tolerance in distributed systems. It covers topics like fault tolerant systems, faults and fault models, redundancy, backward and forward recovery, hardware redundancy using techniques like N-modular redundancy, voters, and fault tolerance at different system levels.

Uploaded by

Viola Ngige

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

Lect8 FaultTolerance

Uploaded by

Viola Ngige

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Distributed Systems

Fault Tolerance
2023
Outline

FAULT TOLERANCE

1. Fault Tolerant Systems

2. Faults and Fault Models
3. Redundancy
4. Time Redundancy and Backward Recovery
5. Hardware Redundancy
6. Software Redundancy
7. Distributed Agreement with Byzantine Faults
8. The Byzantine Generals Problem
2
Fault Tolerant Systems
 A system fails if it behaves in a way which is not consistent with its
specification. Such a failure is a result of a fault in a system component.

 Systems are fault-tolerant if they behave in a predictable manner, according

to their specification, in the presence of faults
 there are no failures in a fault-tolerant system.

 Several application areas need systems to maintain a correct (predictable)

functionality in the presence of faults:
 banking systems
 avionics, medical, automotive
 manufacturing systems

What means correct functionality in the presence of faults?

 The answer depends on the application (on the specification of the system):
 The system stops and does not produce any erroneous (dangerous) result
/ behaviour.
 The system stops and restarts after a while without loss of information.
 The system keeps functioning without any interruption and (possibly) with
unchanged performance. 3
Faults
A fault can be:
 Hardware fault: malfunction of a hardware component (processor,
communication line, switch, etc.).
 Software fault: malfunction due to a software bug.

A fault can be the result of:

1. Mistakes in specification or design: such mistakes are at the
origin of all software faults and of some of the hardware faults.
2. Defects in components: hardware faults can be produced by
manufacturing defects or by defects caused as result of
deterioration in the course of time.
3. Operating environment: hardware faults can be the result of
stress produced by adverse environment: temperature, radiation,
vibration, etc.
4
Faults
Fault types according to their temporal behaviour:
1. Permanent fault:
the fault remains until it is repaired
or the affected unit is replaced.
2. Intermittent fault:
the fault vanishes and reappears
(e.g. caused by a loose wire).
3. Transient fault:
the fault dies away after some time
(caused by environmental effects).

5
Faults
Fault types according to their output behaviour:
1. Fail-stop fault (omission faults):
 Either the processor is executing and produces correct values,
or it failed and will never respond to any request.
Working processors can detect the failed processor
by a time-out mechanism.
2. Byzantine fault (arbitrary faults):
 A process can fail and stop, execute slowly, or execute at a
normal speed but produce erroneous values and actively try to
make the computation fail
Any message can be corrupted, and correctness has to be
decided upon by a group of processors.

 The fail-stop model is the easiest to handle;

unfortunately, sometimes it is too simple to cover real situations.
 The Byzantine model is the most general;
it is very expensive, in terms of complexity, to implement fault-
tolerant algorithms based on this6 model.
Redundancy
If a system has to be fault-tolerant,
it has to be provided with spare capacity  redundancy:
1. Time redundancy: the timing of the system is such that if certain
tasks have to be rerun and recovery operations have to be
performed, system requirements are still fulfilled.
2. Hardware redundancy: the system is provided with far more
hardware than needed for basic functionality.
3. Software redundancy: the system is provided with different
software versions:
results produced by different versions are compared;
when one version fails, another one can take over.
4. Information redundancy: data is coded in such a way that a
certain number of bit errors can be detected and, possibly,
corrected (using parity coding, checksum codes, cyclic codes).
7
Backward Recovery
Basic idea: roll back the computation to a previous checkpoint
and retake from there.

Essential aspects:
 Backward recovery assumes time redundancy!
 The system periodically saves globally consistent states of the
distributed system, which can serve as recovery points.
 When a fault is detected, the system is recovered from the most
recent recovery point.

Corrective action:
 Carry on with the same processor and software
(a transient fault is assumed).
 Carry on with a new processor
(a permanent hardware fault is assumed).
 Carry on with the same processor and another software version
(a permanent software fault is assumed).
8
Forward Recovery
 Backward recovery is based on time redundancy and on the
availability of back-up files and saved checkpoints;
 This is expensive in terms of time.

 Control applications and, in general, real-time systems have

very strict timing requirements.
 Recovery has to be very fast
and preferably to be continued from the current state.

Forward recovery:
the error is masked without redoing any computations.
 Forward recovery is based on hardware and, possibly,
software redundancy.
9
Hardware Redundancy
Hardware redundancy: use of additional hardware to compensate for failures:
 Fault detection, correction, and masking:
Multiple hardware units are assigned to the same task in parallel
and their results are compared.
 Detection: if one or more (but not all) units are faulty, this shows up
as a disagreement in the results.
 Correction and masking: if only a minority of the units are faulty, and
sufficient units produce the same output, this output can be used to
correct and mask the failure.
 Replacement of malfunctioning units:
Correction and masking are short-term measures.
In order to restore the initial performance and degree of fault-tolerance,
the faulty unit has to be replaced.

Hardware redundancy is a fundamental technique to provide fault-tolerance in

safety-critical distributed systems: aerospace applications, automotive
applications, medical equipment, some parts of telecommunications
equipment, nuclear centres, military equipment, etc.
10
N-Modular Redundancy
N-modular redundancy (N-MR) is a scheme for forward error
recovery. N units are used, instead of one, and a voting scheme is
used on their output.

 The same inputs are provided to all participating processors, which

are supposed to work synchronously
 a new set of inputs is provided to all processors simultaneously,
and the corresponding set of outputs is compared.
 3-modular redundancy is the most commonly used.
11
N-Modular Redundancy
 The voter itself can fail
 structure with redundant voters:

 Voting on inputs from sensors:

12
Voters
Several approaches for voting are possible.
The goal is to "filter out" the correct value from the set of candidates.
 The most common one: majority voter
 The voter constructs a set of classes of values:
P1, P2, ..., Pn:
x, y  Pi if and only if x = y
 If Pi is the largest set and N is the number of outputs (N is odd):
if card(Pi) ≥ N/2  x  Pi is correct output;
the error can be masked.
if card(Pi) < N/2  the error cannot be masked
(only be detected).

13
Voters
 Sometimes we can not use strict equality:
 sensors can provide slightly different values;
 the same application can be run on different processors,
and outputs can be different only because of internal
representations used (e.g., floating point).
 if |x - y| < ε then we consider x = y.

14
Voters
Other voting schemes:
 k-plurality voter
 Similar to majority voting:
the largest set needs not contain more than N/2 elements,
it is sufficient that card(Pi) = k, k selected by the designer

15
Voters
Other voting schemes:
 Median voter
 The median value is selected.

16
k-Fault-Tolerant Systems
A system is k-fault-tolerant if it can survive faults in k components
and still meet its specifications.
 How many components do we need in order to achieve k-fault-
tolerance with voting?

 With fail-stop faults:

having k+1 components is enough to provide k-fault-tolerance:
 if k stop, the answer from the one left can be used.

 With Byzantine faults, components continue to work and send

out erroneous or random replies:
2k+1 components are needed to achieve k-fault-tolerance
 a majority of k+1 correct components can
outvote k components producing faulty results.
17
Processor and Memory Level Redundancy
 N-modular redundancy can be applied at any level:
gates, sensors, registers, ALUs, processors, memories,
boards.
 If applied at a lower level,
time and cost overhead can be high:
 voting takes time
 number of additional components (voters, connections)
becomes high.

18
Processor and Memory Level Redundancy
 Processor and memory are handled as a unit;
voting is on processor outputs:

19
Processor and Memory Level Redundancy
Processors and memories can be handled as separate modules.

(a) voting at read from memory

20
Processor and Memory Level Redundancy
Processors and memories can be handled as separate modules.

(b) voting at write to memory

21
Processor and Memory Level Redundancy
Processors and memories can be handled as separate modules.

(c) voting at read and write

22
Software Redundancy
Software is very different from hardware
in the context of redundancy:
 A software fault is always caused by a mistake in specification
or by a bug (a design error).

 Software faults are not produced by manufacturing, aging,

stress, or environment.
 Different copies of identical software always produce the
same behaviour for identical inputs

 Replicating the same software N times, and letting it run on N

processors, does not provide any software redundancy:
if there is a software bug, it will be produced by all N copies.
23
Software Redundancy
 N different versions of software are needed in order to provide redundancy.
 Two possible approaches:
1. All N versions are running in parallel; voting is done on the output.
2. One version is running;
if it fails, another one takes over after recovery.
 The N versions of the software must be diverse
 the probability that they all fail on the same input has to be sufficiently
small.
 It is difficult to produce sufficiently diverse versions for the same software:
 Let independent teams, with no contact between them,
generate software for the same application.
 Use different programming languages.
 Use different tools like, for example, compilers.
 Use different (numerical) algorithms.
 Start from differently formulated specifications
 Expensive and not always pos24sible
Distributed Agreement with Byzantine Faults

Very often, distributed processes have to come to an agreement.

For example, they have to agree on a certain value,
with which each of them has to continue operation.
 What if some of the processors are faulty
and exhibit Byzantine faults?
 How many correct processors are needed
in order to achieve k-fault-tolerance?
Remember:
 With a simple voting scheme, 2k+1 components are needed to
achieve k-fault-tolerance in the case of Byzantine faults
 3 processors are sufficient to mask the fault of one of them.

However, this is not the case for agreement!

25
Distributed Agreement with Byzantine Faults

Example
 P1 receives a value from the sensor,
and the processors have to continue operation with that value;
in order to achieve fault tolerance,
they have to agree on the value to continue with:
 this should be the value received by P1 from the sensor,
if P1 is not faulty;
 if P1 is faulty, all non-faulty processors should use the
same value to continue with.

26
Distributed Agreement with Byzantine Faults

27
Distributed Agreement with Byzantine Faults

Example
 Maybe, by letting P2 and P3 communicate, they could get out of the trouble?
 P2 does not know if P1 or P3 is the faulty one,
thus it cannot handle the contradicting inputs.
 The same for P3.
 No agreement
 The same if P3 is faulty:
 P2 does not know if P1 or P3 is the
faulty one, thus it cannot handle the
contradicting inputs
 No agreement
28
Distributed Agreement with Byzantine Faults

 With three processors we cannot achieve agreement,

if one of them is faulty (with Byzantine behaviour)!
 The Byzantine Generals Problem is used as a model to
study agreement with Byzantine faults

29
The Byzantine Generals Problem
The Byzantine army is preparing for a battle.
A number of generals must
coordinate among themselves
through (reliable) messengers
on whether to attack or retreat.
A commanding general (C) will make
the decision whether or not to attack.
Any of the generals, including the
commander, may be traitorous:
they might send messages to attack
to some generals and messages
to retreat to others.

30
The Byzantine Generals Problem
The problem in the story:
 The loyal generals have all to agree to attack, or all to retreat.
 If the commanding general is loyal, all loyal generals must
agree with the decision that he made.

The problem in real life:

 All non-faulty processors must use the same input value.
 If the input unit (P1) is not faulty,
all non-faulty processors must use the value it provides.

31
The Byzantine Generals Problem
The case with three generals:

No agreement is possible
if one of three generals is traitorous

32
The Byzantine Generals Problem
The case with four generals:
 Gen. left: attack, ???, retreat.
 Gen. middle: ???, attack, retreat.
 Gen. right: retreat, ???, attack.

 The generals decide by

majority voting
on their input;
if no majority exists, a default
value is used (retreat, for example):
 If ??? = attack  all three decide on attack.
 If ??? = retreat  all three decide on retreat.
 If ??? = dummy  all three decide on retreat.

The three loyal generals have reached agreement,

despite the traitorous commander.
33
The Byzantine Generals Problem
The case with four generals (cont.):

 Gen. left: attack, attack, anything.

 Gen. middle: attack, attack, anything.

By majority vote on the input messages,

the two loyal generals have agreed on the message
proposed by the loyal commander (attack).
34
The Byzantine Generals Problem
The conclusion in general:

 To reach agreement with k traitorous generals

requires a total of at least 3k + 1 generals.

 We need 3k + 1 processors to achieve k-fault-tolerance

for agreement with Byzantine faults.
 To mask one faulty processor: total of 4 processors;
 To mask two faulty processors: total of 7 processors;
 To mask three faulty processors: total of 10 processors;
 ...

35
The Byzantine Generals Problem
Let us come back to our real-life example,
this time with four processors:

P2, P3, and P4 will reach P2, P3, and P4 will reach agreement
agreement on value 3, on the default value, e.g. 0 (used
despite the faulty input unit P1. when no majority exists), despite
36 the faulty input unit P1.
The Byzantine Generals Problem
Let us come back to our real-life example,
this time with four processors:

The two non-faulty processors P2 and P3 agree on value 3,

which is the value produced by the non-faulty input unit P1.
37

The iOS Interview Guide - Questions, Answers, and General Guidance On What iOS Developers (EnglishOnlineClub - Com)
No ratings yet
The iOS Interview Guide - Questions, Answers, and General Guidance On What iOS Developers (EnglishOnlineClub - Com)
195 pages
Extrapolation Factory Operator S Manual
No ratings yet
Extrapolation Factory Operator S Manual
114 pages
Rick Billstein - Shlomo Libeskind - Johnny W. Lott - A Problem Solving Approach To Mathematics For Elementary School Teachers-Pearson (2015)
No ratings yet
Rick Billstein - Shlomo Libeskind - Johnny W. Lott - A Problem Solving Approach To Mathematics For Elementary School Teachers-Pearson (2015)
1,044 pages
A State of Art Techniques On Machine Learning Algorithms A Perspective of Supervised Learning Approaches in Data Classification
100% (1)
A State of Art Techniques On Machine Learning Algorithms A Perspective of Supervised Learning Approaches in Data Classification
5 pages
MESSIAEN Etudes
No ratings yet
MESSIAEN Etudes
4 pages
Answer ET Assignment 2
33% (3)
Answer ET Assignment 2
18 pages
Swimming Class Proposal
100% (9)
Swimming Class Proposal
2 pages
National Pools Catalog
No ratings yet
National Pools Catalog
11 pages
Fault Tolerance Techniques: Unit 3
No ratings yet
Fault Tolerance Techniques: Unit 3
40 pages
II - Fault-Tolerant-techniques
No ratings yet
II - Fault-Tolerant-techniques
104 pages
II Fault Tolerant Techniques
No ratings yet
II Fault Tolerant Techniques
101 pages
Fault Lecture 01 - Introduction
No ratings yet
Fault Lecture 01 - Introduction
20 pages
Fault Tolerance
No ratings yet
Fault Tolerance
49 pages
Fault Tolerance
No ratings yet
Fault Tolerance
13 pages
Slides 08 PDF
No ratings yet
Slides 08 PDF
95 pages
Fault Tolerance Notes
No ratings yet
Fault Tolerance Notes
101 pages
Fault Avoidance and Tolerance Technique
No ratings yet
Fault Avoidance and Tolerance Technique
15 pages
Fault Tolerance
No ratings yet
Fault Tolerance
33 pages
Distributed Systems - Fault Tolerance
No ratings yet
Distributed Systems - Fault Tolerance
21 pages
Introduction To Fault Tolerance
No ratings yet
Introduction To Fault Tolerance
20 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
Lecture 3
No ratings yet
Lecture 3
118 pages
Basic-Concepts MTBF
No ratings yet
Basic-Concepts MTBF
15 pages
Fault Tolerance Exam
No ratings yet
Fault Tolerance Exam
14 pages
Process Synchronization and Deadlocks 119961623287018 3
100% (1)
Process Synchronization and Deadlocks 119961623287018 3
29 pages
Software Testing and Quality Assurance
No ratings yet
Software Testing and Quality Assurance
13 pages
16 Fault Tolerance
No ratings yet
16 Fault Tolerance
34 pages
Unit 11 Dependability-and-Security
No ratings yet
Unit 11 Dependability-and-Security
39 pages
Kernel Programming Projects
No ratings yet
Kernel Programming Projects
6 pages
Fault Tolerance PDF
No ratings yet
Fault Tolerance PDF
12 pages
Software Reliability
No ratings yet
Software Reliability
24 pages
Unit 2 (Process Synchronization) 1
No ratings yet
Unit 2 (Process Synchronization) 1
79 pages
Process Synchronization: Critical Section Problem
No ratings yet
Process Synchronization: Critical Section Problem
8 pages
Cyclic Redundancy Check
No ratings yet
Cyclic Redundancy Check
40 pages
Os Unit 1 Notes cs3451
No ratings yet
Os Unit 1 Notes cs3451
24 pages
Inroduction To Real Time Systems
No ratings yet
Inroduction To Real Time Systems
27 pages
Ch-2-Maintenance Support Processes1
No ratings yet
Ch-2-Maintenance Support Processes1
59 pages
CS 1538 Introduction To Simulation: Course Notes For
No ratings yet
CS 1538 Introduction To Simulation: Course Notes For
228 pages
SMS NOTES 1st UNIT 8th Sem
100% (1)
SMS NOTES 1st UNIT 8th Sem
30 pages
Sae Technical Paper Series: Song You, Mark Krage and Laci Jalics
No ratings yet
Sae Technical Paper Series: Song You, Mark Krage and Laci Jalics
13 pages
Fault Tolerant System Design
100% (1)
Fault Tolerant System Design
44 pages
Mutex Vs Semaphore
No ratings yet
Mutex Vs Semaphore
4 pages
Interprocess Communication and Synchronization
No ratings yet
Interprocess Communication and Synchronization
9 pages
Risk Analysis For Information and Systems Engineering: INSE 6320 - Week 3 Session 2
No ratings yet
Risk Analysis For Information and Systems Engineering: INSE 6320 - Week 3 Session 2
23 pages
Module 5 RTOS
No ratings yet
Module 5 RTOS
23 pages
Practical 4 Cse307 PPT
No ratings yet
Practical 4 Cse307 PPT
16 pages
WCCA Guidelines
No ratings yet
WCCA Guidelines
98 pages
TestBank ch06
100% (1)
TestBank ch06
10 pages
06 Static Testing
No ratings yet
06 Static Testing
17 pages
Finding Minimal Cut Sets in A Fault Tree
No ratings yet
Finding Minimal Cut Sets in A Fault Tree
4 pages
Simulation
No ratings yet
Simulation
141 pages
1 Chapter 11 Security and Dependability
No ratings yet
1 Chapter 11 Security and Dependability
46 pages
Real Time Operating Systems
No ratings yet
Real Time Operating Systems
23 pages
136 Systemverilog Assertions Handbook, 3 Edition: 4.2.3.2 Uvm Severity Levels
No ratings yet
136 Systemverilog Assertions Handbook, 3 Edition: 4.2.3.2 Uvm Severity Levels
3 pages
Availability and Reliability
No ratings yet
Availability and Reliability
25 pages
Advanced Distributed Systems: Chapter - 1
No ratings yet
Advanced Distributed Systems: Chapter - 1
38 pages
Fuzzy
No ratings yet
Fuzzy
343 pages
Electronic Syatem Design PPT - Reliability of Digital Systems
No ratings yet
Electronic Syatem Design PPT - Reliability of Digital Systems
18 pages
Unit - Iii: Embedded Firmware Design
No ratings yet
Unit - Iii: Embedded Firmware Design
6 pages
ADF Syllabus
No ratings yet
ADF Syllabus
8 pages
SSM Unit-2 Vik
100% (1)
SSM Unit-2 Vik
13 pages
Problem 7
No ratings yet
Problem 7
2 pages
Assymetric Key Cryptography
No ratings yet
Assymetric Key Cryptography
4 pages
Automated and Emerging Technologies-Chapter 6
No ratings yet
Automated and Emerging Technologies-Chapter 6
36 pages
Rajib Mall Lecture Notes
No ratings yet
Rajib Mall Lecture Notes
78 pages
Fault Tolerance
No ratings yet
Fault Tolerance
10 pages
08876040610679918
No ratings yet
08876040610679918
13 pages
Install Rac 12r1 Linux
No ratings yet
Install Rac 12r1 Linux
5 pages
Red Flag Wow Procedure v2 With Comments From FSP Aug 28 and FSP and SJQ Sept1
No ratings yet
Red Flag Wow Procedure v2 With Comments From FSP Aug 28 and FSP and SJQ Sept1
10 pages
P632 EN M R-a5-B 312 660 Volume 1 PDF
No ratings yet
P632 EN M R-a5-B 312 660 Volume 1 PDF
836 pages
BMD_S4CLD2408_BPD_EN_DE
No ratings yet
BMD_S4CLD2408_BPD_EN_DE
59 pages
Separation Processes - I (CHE F244) Total Marks - 15 Due Date & Time: 01/07/2020, 5:00 PM Assignment
No ratings yet
Separation Processes - I (CHE F244) Total Marks - 15 Due Date & Time: 01/07/2020, 5:00 PM Assignment
4 pages
Karl - Whittington - Body - Worlds - Opicinus - de Canistris and The Medieval Cartographic
No ratings yet
Karl - Whittington - Body - Worlds - Opicinus - de Canistris and The Medieval Cartographic
23 pages
Linux Commands Cheat Sheet - Linux Training Academy
No ratings yet
Linux Commands Cheat Sheet - Linux Training Academy
22 pages
DLL 7
No ratings yet
DLL 7
2 pages
Epilepsia - 2005 - Kossoff
No ratings yet
Epilepsia - 2005 - Kossoff
10 pages
Uml 91
No ratings yet
Uml 91
35 pages
Petronas Pressol: Premium Grade Air Compressor Oils
No ratings yet
Petronas Pressol: Premium Grade Air Compressor Oils
1 page
32 Bit Floating Point ALU
80% (5)
32 Bit Floating Point ALU
7 pages
Carbothane 133 HB: Selection & Specification Data
No ratings yet
Carbothane 133 HB: Selection & Specification Data
5 pages
40RUM 50Hz IOM - tcm478-51384
No ratings yet
40RUM 50Hz IOM - tcm478-51384
28 pages
Proceedings of Spie: Design and Characterization of A 3d-Printer-Based Diode Laser Engraver
No ratings yet
Proceedings of Spie: Design and Characterization of A 3d-Printer-Based Diode Laser Engraver
7 pages
SOC
No ratings yet
SOC
4 pages
1 s2.0 S0362028X22080917 Main
No ratings yet
1 s2.0 S0362028X22080917 Main
15 pages
Materials 15 05876 v2
No ratings yet
Materials 15 05876 v2
16 pages
Assessment Practices in Philippine Higher STEAM Education
No ratings yet
Assessment Practices in Philippine Higher STEAM Education
19 pages
Exercise 1: Fill in The Blanks With A, An or The
No ratings yet
Exercise 1: Fill in The Blanks With A, An or The
3 pages
All 3 Gas Laws Student
No ratings yet
All 3 Gas Laws Student
3 pages
Hsslive Xi Chem Pyq Ans 2. Eletrochemistry
No ratings yet
Hsslive Xi Chem Pyq Ans 2. Eletrochemistry
12 pages
Quiz - MS Word Quizmaker A Short Guide
No ratings yet
Quiz - MS Word Quizmaker A Short Guide
5 pages

Lect8 FaultTolerance

Uploaded by

Lect8 FaultTolerance

Uploaded by

Distributed Systems

1. Fault Tolerant Systems

 Systems are fault-tolerant if they behave in a predictable manner, according

 Several application areas need systems to maintain a correct (predictable)

What means correct functionality in the presence of faults?

A fault can be the result of:

 The fail-stop model is the easiest to handle;

 Control applications and, in general, real-time systems have

Hardware redundancy is a fundamental technique to provide fault-tolerance in

 The same inputs are provided to all participating processors, which

 Voting on inputs from sensors:

 With fail-stop faults:

 With Byzantine faults, components continue to work and send

(a) voting at read from memory

(b) voting at write to memory

(c) voting at read and write

 Software faults are not produced by manufacturing, aging,

 Replicating the same software N times, and letting it run on N

Very often, distributed processes have to come to an agreement.

However, this is not the case for agreement!

 With three processors we cannot achieve agreement,

The problem in real life:

 The generals decide by

The three loyal generals have reached agreement,

 Gen. left: attack, attack, anything.

By majority vote on the input messages,

 To reach agreement with k traitorous generals

 We need 3k + 1 processors to achieve k-fault-tolerance

The two non-faulty processors P2 and P3 agree on value 3,

You might also like