0% found this document useful (0 votes)

64 views

Week 07a

This document discusses faults and distributed consensus in parallel computing systems. It begins by covering dependability, failures, errors, and faults. It then distinguishes between total and partial failures in distributed systems. The main types of failures are classified as crash, omission, transient, Byzantine, software, temporal, and security failures. Examples of each type are provided. The history of fault-tolerant systems is briefly discussed. Finally, fault-tolerant systems are defined as systems that can tolerate failures without violating safety and liveness properties. The main approaches to fault tolerance are masking tolerance, non-masking tolerance, fail-safe tolerance, and graceful degradation.

Uploaded by

ngokfong yu

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views

Week 07a

Uploaded by

ngokfong yu

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Information Technology

FIT3143 - LECTURE WEEK 7a

FAULTS & DISTRIBUTED CONSENSUS
Overview
▪ Faults and Fault-Tolerant Systems
▪ Distributed Consensus

Learning outcome(s) related to this topic

• Compare and contrast different parallel computing architectures,

algorithms and communication schemes using research-based
knowledge and methods (LO2)

FIT3143 Parallel Computing 2

Dependability in a distributed memory parallel
computing system
▪ Availability: System is ready to be used immediately

▪ Reliability: System can run continuously without failure

▪ Safety: When a system (temporarily) fails to operate correctly, nothing

catastrophic happens

▪ Maintainability: How easily a failed system can be repaired

– Building a dependable system comes down to controlling failure and faults.

FIT3143 Parallel Computing 3

Failure
▪ Failure: a system fails when it fails to meet its promises or cannot provide its
services in the specified manner

▪ Error: part of the system state that leads to failure (i.e., it differs from its
intended value)

▪ Fault: the cause of an error (results from design errors, manufacturing faults,
deterioration, or external disturbance)

▪ Recursive:
– Failure may be initiated by a mechanical fault
– Manufacturing fault leads to disk failure
– Disk failure is a fault that leads to database failure
– Database failure is a fault that leads to email service failure

FIT3143 Parallel Computing 4

Total vs Partial Failure
Total Failure:
▪ All components in a system fail
▪ Typical in non-distributed system

Partial Failure:
▪ One or more (but not all) components in a distributed system fail
▪ Some components affected
▪ Other components completely unaffected
▪ Considered as fault for the whole system

FIT3143 Parallel Computing 5

Classification of failures
▪ Our view of a distributed system is a process-level view, so we begin with the
description of certain types of failures that are visible at the process level.

▪ The major classes of failures are as follows:-

– Crash Failure
– Omission Failure
– Transient Failure
– Byzantine Failure
– Software Failure
– Temporal Failure
– Security Failure

FIT3143 Parallel Computing 6

Classification of failures
Crash Failure

▪ A process undergoes crash failure, when it permanently ceases to execute its

actions. This is an irreversible change.

▪ In an asynchronous model, crash failures cannot be detected with total

certainty, since there is no lower bound of the speed at which a process can
execute its actions.

▪ In a synchronous system where processor speed and channel delays are

bounded, crash failures can be detected using timeouts.

FIT3143 Parallel Computing 7

Classification of failures
▪ Omission Failure

If the receiver does not receive one or more of the messages sent by the
transmitter, an omission failure occurs. For wireless networks, collision occurs
in MAC layer or receiving node moves out of range.

▪ Transient Failure

A transient failure can disturb the state of processes in an arbitrary way. The
agent inducing this problem may be momentarily active but it can make a
lasting effect on the global state. E.g., a power surge, or a mechanical shock,
or a lightening.

FIT3143 Parallel Computing 8

Classification of failures
▪ Byzantine Failure
Byzantine failures represent the weakest of all failure model that allows
every conceivable form of erroneous behaviour. The term alludes to
uncertainty and was first proposed by Pease et al.

▪ Assume that process i forwards the value x of a local variable to each of

its neighbours. The followings inconsistencies may occur:

– two distinct neighbours j and k receive values x and y, where x ≠ y

– one or more neighbours do not receive any data from i
– every neighbour receives a value z where z ≠ x

FIT3143 Parallel Computing 9

Classification of failures
▪ Some possible causes of the Byzantine failures are:

– total or partial breakdown of a link joining i with one of its neighbours

– software problems in process i

– hardware synchronization problems – assume that every neighbour is

connected to the same bus, and reading the same copy sent out by i, but
since the clocks are not perfectly synchronized, they may not read the
value of x at the same time. If value of x varies with time, then different
neighbours of i may read different values of x from process i.

FIT3143 Parallel Computing 10

Classification of failures
Software Failure
▪ Primary causes of software failure:
▪ Coding error or human errors: program fails to use the appropriate physical
parameters. September 23, 1999 NASA lost $125 million Mars Orbiter
spacecraft because one engineering team used metric units while another
used English units, leading to a navigation fiasco, causing it to burn in the
atmosphere.

▪ Software design error – Mars pathfinder mission landed flawlessly on the

Martial surface on July 4, 1997. However, later its communication failed due to
a design flaw in the real-time embedded software kernel VxWorks. The
problem was later diagnosed to be caused due to priority inversion
– Priority inversion: Low priority task LP locks file F
– High priority task HP is scheduled next, it also needs to lock file F
– A medium priority MP task (with high CPU requirement) becomes ready to run
– MP is the highest priority unblocked task, its allowed to run, consumes all CPU
– LP has no CPU, it stops. HP ‘s priority < MP’s priority (priority inversion)

FIT3143 Parallel Computing 11

Classification of failures
▪ Memory Leaks
– Processes fail to fully free up the physical memory that has been
allocated to them. This effectively reduces the size of available physical
memory over time. When the available memory falls below the minimum
requirement by the system, a crash becomes inevitable.

▪ Problem with inadequacy of specification e.g. Y2K bug

Note: that many of the failures like crash, omission, transient and Byzantine
can be caused by software bugs. For example, a poorly designed loop that
does not terminate can mimic a crash failure in the sender process. An
inadequate policy in the router software can cause packets to drop and trigger
omission failure.

FIT3143 Parallel Computing 12

Classification of failures
▪ Temporal Failure

Real-time systems require actions to be completed within a specific time

frame. When this time limit is not met, a temporal failure occurs.

▪ Security Failure

Virus and other malicious software may lead to unexpected behaviour that
manifests itself as a system fault.

FIT3143 Parallel Computing 13

Classification of failures
Finally,

Human errors play can play a role in system failure.

– In November 1988, much of the long distance service along the East
Coast of USA was disrupted when a construction crew accidentally
detached a major fibre optic cable in New Jersey; as a result
3,500,000 call attempts were blocked.

– On September 17, 1991 AT&T technicians in NY attending a seminar

on warning systems failed to respond to an activated alarm for six
hours. The resulting power failure blocked nearly 5 million domestic
and international calls and paralysed air travel throughout the
Northeast, causing nearly 1,170 flights to be cancelled or delayed.

FIT3143 Parallel Computing 14

History of Fault Tolerant Systems
▪ The first known fault-tolerant computer was SAPO, built in 1951 in
Czechoslovakia by Antonin Svoboda.

▪ Most of the development in the so called LLNM (Long Life, No Maintenance)

computing was done by NASA during the 1960's, in preparation for Project
Apollo and other research aspects. NASA's first machine went into a space
observatory, and their second attempt, the JSTAR computer, was used in
Voyager. This computer had a backup of memory arrays to use memory
recovery methods and thus it was called the JPL Self-Testing-And-Repairing
computer. It could detect its own errors and fix them or bring up redundant
modules as needed.
▪ Smart Sensor Network in the Ageless Space Vehicle Project

FIT3143 Parallel Computing 15

Fault-Tolerant System
▪ We designate a system that does not tolerate failures as a fault-tolerant system. In such
systems, the occurrence of a fault violates liveness and safety properties.

▪ The are four major types of fault-tolerance

– Masking tolerance
– Non-masking tolerance
– Fail-safe tolerance
– Graceful degradation

Note:
▪ Safety properties specify that “something bad never happens”
– Doing nothing easily fulfils a safety property as this will never lead to a “bad”
situation

▪ Safety properties are complemented by liveness properties

▪ Liveness properties assert that: “something good” will eventually happen [Lamport]

FIT3143 Parallel Computing 16

Masking Tolerance
▪ Let P be the set of configurations for the fault-tolerance system.

▪ Given a set of fault actions F, the fault span Q corresponds to the largest set of
configurations that the system can support.

▪ In Masking tolerance system, when a fault F is masked its occurrence has no

impact on the application, that is P = Q.

▪ Masking tolerance is important in many safety-critical applications where the

failure can endanger human life or cause massive loss of properties.

▪ An aircraft must be able to fly even if one of its engines malfunctions.

▪ Masking tolerance preserve both safety and liveness properties of the original
system.

FIT3143 Parallel Computing 17

Implementing Failure Masking
▪ Introduce Redundancy
– Information redundancy
– Time redundancy
– Physical redundancy

FIT3143 Parallel Computing 18

Non-Masking Tolerance
▪ In non-masking fault tolerance, faults may temporarily affect and violate the safety
property, that is PQ

▪ However, liveness is not compromised, and eventually normal behaviour is restored.

▪ Consider that while watching a movie, the server crashed, but the system
automatically restored the service by switching to a standby proxy server.

▪ Stabilization and Checkpointing represent two opposing scenario in non-masking

tolerance.
– Checkpointing relies on history and recovery is achieved by retrieving the lost
computation.
– Stabilization is history-insensitive and does not care about lost computation
as long as eventual recovery is guaranteed.

FIT3143 Parallel Computing 19

Fail-Safe Tolerance
▪ Certain faulty configurations do not affect the application in an adverse way
and therefore considered harmless.

▪ A fail-safe system relaxes the tolerance requirement by avoiding only those

faulty configurations that will have catastrophic consequences (not
withstanding the failure).

As an example,
▪ At a four-way traffic crossing, if the lights are green in both directions then a
collision is possible. However, if the lights are red, at best traffic will stall but
will not have any catastrophic side effect.

FIT3143 Parallel Computing 20

Graceful Degradation
▪ There are systems that neither mask, nor fully recover from the effect of failures,
but exhibit a degraded behaviour that falls short of normal behaviour, but is still
considered acceptable.

▪ The notion of acceptability is highly subjective and entirely dependent on the

user running the application.

Some examples

– While routing a message between two points in a network, a program

computes the shortest path. In the presence of a failure, if this program
returns another path but not the shortest, then this may be acceptable.

– An operating system may switch to a safe mode where users cannot

create or modify files, but can read the files that already exist.

FIT3143 Parallel Computing 21

Distributed Consensus
▪ Why we need distributed consensus? Let us consider the following examples:

▪ Example1. The leader election problem in a network of processes. Each process

begins with an initial proposal for leadership. At the end one of it a candidate is
elected as a leader, it reflects the final decision of every process.

▪ Example2. Fund transfer

▪ Example3. Synchronizing clocks

▪ Consensus is easier to achieve in the absence of failures. We will study

distributed consensus in the presence of failures.

FIT3143 Parallel Computing 22

Problem Definition
The Consensus may be formulated as follows:

▪ A distributed system contains n processes {0, 1, 2, …, n-1}.

▪ Every process has an initial value in a mutually agreed domain.
▪ The challenge is to devise an algorithm, which in spite of the
occurrence of failures, allows processes to reach an irrevocable
decision that fulfils the following conditions:
– Termination. Every (non-faulty) process must eventually come to a
decision.

– Agreement. The final decision of every (non-faulty) process must be

identical.

– Validity. If every (non-faulty) process begins with the same initial value
v, their final decision must be v.

FIT3143 Parallel Computing 23

Consensus in Asynchronous System
▪ If there is no failure, then reaching an agreement is trivial.

▪ Reaching consensus, however, becomes surprisingly difficult when one or

more members fail to execute actions.

▪ Assume that at most k members (k>0) can fail.

– An important finding by Fischer et al. is that in a fully asynchronous
system, it is impossible to reach consensus even if k=1.

FIT3143 Parallel Computing 24

Bivalent and Univalent States
▪ A decision state is bivalent, if starting from a state, there exist at least two
distinct executions leading to two distinct decision values e.g. 0 or 1.

▪ On the other hand, a state from which only one decision value can be reached
is called a univalent state. Univalent state states can be either 0-valent or 1-
valent.

▪ Consider a best-of-five-sets tennis match between A and B. If the score is 6-3,

6-4 in favour of A, the decision state is bivalent, since anyone can win at this
point. However, if the score becomes 6-3, 6-4, 7-6 in favour of A, then the
state becomes univalent.

FIT3143 Parallel Computing 25

The Byzantine General Problem
▪ Lamport showed (by proof):
– For a system of n+1 nodes, there cannot be more than n/3 faulty
nodes. (if we want to establish distributed consensus)
– Alternatively:
• There must be more than 3m troops in an army with up to m
traitors to launch a concerted attack.

Credits to Leslie Lamport [2002]

FIT3143 Parallel Computing 26

EIP Book of Knowledge
No ratings yet
EIP Book of Knowledge
120 pages
SIGA-MFT Map Fault Tracer
No ratings yet
SIGA-MFT Map Fault Tracer
16 pages
Fault Tolerance
No ratings yet
Fault Tolerance
10 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
6 pages
Reliable and Fault Tolerant Distributed Systems
No ratings yet
Reliable and Fault Tolerant Distributed Systems
45 pages
Fault Tolerant Computing
No ratings yet
Fault Tolerant Computing
4 pages
-Describe Troubleshooting-.
No ratings yet
-Describe Troubleshooting-.
3 pages
fault tolerance techniques
No ratings yet
fault tolerance techniques
4 pages
Embedded System
No ratings yet
Embedded System
64 pages
Fault Tolerance Slides
No ratings yet
Fault Tolerance Slides
18 pages
5 - Network Troubleshooting V2.0
No ratings yet
5 - Network Troubleshooting V2.0
69 pages
Fault Lecture 01 - Introduction
No ratings yet
Fault Lecture 01 - Introduction
20 pages
Lect8 FaultTolerance
No ratings yet
Lect8 FaultTolerance
37 pages
Fault Tolerance Computing Lecture Note
No ratings yet
Fault Tolerance Computing Lecture Note
61 pages
FTA Tutorial
No ratings yet
FTA Tutorial
13 pages
FTA Tutorial Part 1 With Quizzes
No ratings yet
FTA Tutorial Part 1 With Quizzes
12 pages
Embedded Systems
No ratings yet
Embedded Systems
65 pages
Characteristics of Real-Time Systems
No ratings yet
Characteristics of Real-Time Systems
40 pages
OS Module 1 Complete Solutions
No ratings yet
OS Module 1 Complete Solutions
28 pages
A Survey On Fault Injection Techniques
No ratings yet
A Survey On Fault Injection Techniques
16 pages
Toward Monitoring Fault-Tolerant Embedded Systems (Extended Abstract)
No ratings yet
Toward Monitoring Fault-Tolerant Embedded Systems (Extended Abstract)
3 pages
OS Module 1 Complete Solutions
No ratings yet
OS Module 1 Complete Solutions
32 pages
TERM Paper B51
No ratings yet
TERM Paper B51
13 pages
Input-Output Organization: Computer Architecture and Organization (Seng2031) Group 4
No ratings yet
Input-Output Organization: Computer Architecture and Organization (Seng2031) Group 4
67 pages
Embedded System Chapter-6&7
No ratings yet
Embedded System Chapter-6&7
15 pages
Database Recovery Management
No ratings yet
Database Recovery Management
8 pages
PC Hardware and Software Troubleshooting
No ratings yet
PC Hardware and Software Troubleshooting
65 pages
Lab Report 10 Dbms
No ratings yet
Lab Report 10 Dbms
5 pages
Computer Troubleshooting 2
No ratings yet
Computer Troubleshooting 2
52 pages
8 CITW Real Time Systems Concepts
No ratings yet
8 CITW Real Time Systems Concepts
33 pages
TLE-CSS_10_Q4_M3-revalidated
No ratings yet
TLE-CSS_10_Q4_M3-revalidated
16 pages
Real Time System Quiz 1 EntesarElBanna 20170295
No ratings yet
Real Time System Quiz 1 EntesarElBanna 20170295
3 pages
Week09-Fault Tolerant System
No ratings yet
Week09-Fault Tolerant System
26 pages
Interrupts Interrupts: Goals
No ratings yet
Interrupts Interrupts: Goals
8 pages
Rtes Chapter One for Student.docx Renaw.ppt
No ratings yet
Rtes Chapter One for Student.docx Renaw.ppt
14 pages
Fault Tolerance and Recovery
No ratings yet
Fault Tolerance and Recovery
50 pages
Error Handling of Profinet
No ratings yet
Error Handling of Profinet
2 pages
Interrupts
No ratings yet
Interrupts
27 pages
002. Lesson 2 - Fault and Error Modelling.docx
No ratings yet
002. Lesson 2 - Fault and Error Modelling.docx
7 pages
Understanding Soft and Firm Errors in Semiconductor Devices
No ratings yet
Understanding Soft and Firm Errors in Semiconductor Devices
6 pages
16 Fault Tolerance
No ratings yet
16 Fault Tolerance
34 pages
Q4-CSS11 - Las 1
No ratings yet
Q4-CSS11 - Las 1
14 pages
Advanced Computer Architectures: Exception Handling
No ratings yet
Advanced Computer Architectures: Exception Handling
17 pages
Pract 9
No ratings yet
Pract 9
5 pages
Lect 04 Interrupts
No ratings yet
Lect 04 Interrupts
14 pages
e-PG PATHSHALA-Computer Science Computer Architecture
No ratings yet
e-PG PATHSHALA-Computer Science Computer Architecture
9 pages
Lessons
No ratings yet
Lessons
6 pages
Failures in Computer
No ratings yet
Failures in Computer
7 pages
why_do_computers_stop_jim_gray
No ratings yet
why_do_computers_stop_jim_gray
8 pages
Design of Parallel and Distributed Systems: Dr. Seemab Latif
No ratings yet
Design of Parallel and Distributed Systems: Dr. Seemab Latif
36 pages
Lecture Defect Prevention
No ratings yet
Lecture Defect Prevention
22 pages
Information Technology Systems Audit Sample Report
No ratings yet
Information Technology Systems Audit Sample Report
18 pages
Lessons Learned From Launch Vehicle Avionics Systems
No ratings yet
Lessons Learned From Launch Vehicle Avionics Systems
10 pages
2-OS LECTURE
No ratings yet
2-OS LECTURE
40 pages
9 CITW Real Time Systems Concepts
No ratings yet
9 CITW Real Time Systems Concepts
19 pages
Task 2-ICTTEN616 report-1
No ratings yet
Task 2-ICTTEN616 report-1
6 pages
Chapter 2 Case Study
No ratings yet
Chapter 2 Case Study
4 pages
WhitePaper Every Day Issues
No ratings yet
WhitePaper Every Day Issues
8 pages
Fixing Common Computer Problems: A Comprehensive Troubleshooting Guide
From Everand
Fixing Common Computer Problems: A Comprehensive Troubleshooting Guide
Don Carlos
No ratings yet
Hack into your Friends Computer
From Everand
Hack into your Friends Computer
Magelan Cyber Security
No ratings yet
Week 10 2021
No ratings yet
Week 10 2021
42 pages
Week 09 2021b
No ratings yet
Week 09 2021b
52 pages
Week 07b
No ratings yet
Week 07b
11 pages
Week 02
No ratings yet
Week 02
41 pages
English 8 Course Outline
No ratings yet
English 8 Course Outline
7 pages
Determination of Caffeine in Different Tea Samples: S. Subila & M. Shirley Navis
No ratings yet
Determination of Caffeine in Different Tea Samples: S. Subila & M. Shirley Navis
4 pages
Fpo 2017 43rd CTP - PSP
No ratings yet
Fpo 2017 43rd CTP - PSP
6 pages
Bahan Ulangan SMT Kls Xi SMT 1
No ratings yet
Bahan Ulangan SMT Kls Xi SMT 1
56 pages
Statistics 2ND Sem Reviewer
No ratings yet
Statistics 2ND Sem Reviewer
5 pages
Cro 2 3 19090900
No ratings yet
Cro 2 3 19090900
24 pages
Abnormal Psychology Paper
No ratings yet
Abnormal Psychology Paper
7 pages
Augmentin Duo Suspension
No ratings yet
Augmentin Duo Suspension
14 pages
From Service To Experience: Understanding and Defining The Hospitality Business
No ratings yet
From Service To Experience: Understanding and Defining The Hospitality Business
20 pages
Quarter 2 UCSP Week 3
No ratings yet
Quarter 2 UCSP Week 3
5 pages
Should Cell Phones and Beepers Be Allowed in Class?: Your Turn!
No ratings yet
Should Cell Phones and Beepers Be Allowed in Class?: Your Turn!
1 page
Evolve Quest Episodes
No ratings yet
Evolve Quest Episodes
2 pages
Pratibimbh 2012 Rule Book As On 5-8-12
No ratings yet
Pratibimbh 2012 Rule Book As On 5-8-12
45 pages
Online Gamer and Non Gamer Academic Performance
No ratings yet
Online Gamer and Non Gamer Academic Performance
15 pages
Anurag Kashyap On Tamil Cinema
No ratings yet
Anurag Kashyap On Tamil Cinema
1 page
Lesson 8: Solomon Succeeds David As King
No ratings yet
Lesson 8: Solomon Succeeds David As King
7 pages
Performance Appraisal Management in Aviation Industry
No ratings yet
Performance Appraisal Management in Aviation Industry
7 pages
DLP ACTIVITY FOR INSET
No ratings yet
DLP ACTIVITY FOR INSET
4 pages
Republicgreektex03platuoft BW
No ratings yet
Republicgreektex03platuoft BW
528 pages
Some Tips To Successful Teamwork Development
No ratings yet
Some Tips To Successful Teamwork Development
2 pages
Cdi 5 Midterm Notes Updated
No ratings yet
Cdi 5 Midterm Notes Updated
18 pages
Observational Essay Examples
100% (2)
Observational Essay Examples
7 pages
Alexander Galloway Laruelle Against The Digital PDF
No ratings yet
Alexander Galloway Laruelle Against The Digital PDF
321 pages
Invoice - 2021-06-08T002801.878
No ratings yet
Invoice - 2021-06-08T002801.878
1 page
MPSC 4
No ratings yet
MPSC 4
17 pages
2.875x6.4 l80 STD Jfebear
No ratings yet
2.875x6.4 l80 STD Jfebear
1 page
School Form 2 (SF2) Daily Attendance Report of Learners: 342597 Sinalhan Integrated High School
No ratings yet
School Form 2 (SF2) Daily Attendance Report of Learners: 342597 Sinalhan Integrated High School
13 pages
Wa0010.
No ratings yet
Wa0010.
3 pages
End of The School Year Self Evaluation: I CAN... : Speaking
No ratings yet
End of The School Year Self Evaluation: I CAN... : Speaking
2 pages
Portfolio
No ratings yet
Portfolio
38 pages

Week 07a

Uploaded by

Week 07a

Uploaded by

Information Technology

FIT3143 - LECTURE WEEK 7a

Learning outcome(s) related to this topic

• Compare and contrast different parallel computing architectures,

FIT3143 Parallel Computing 2

▪ Reliability: System can run continuously without failure

▪ Safety: When a system (temporarily) fails to operate correctly, nothing

▪ Maintainability: How easily a failed system can be repaired

FIT3143 Parallel Computing 3

FIT3143 Parallel Computing 4

FIT3143 Parallel Computing 5

▪ The major classes of failures are as follows:-

FIT3143 Parallel Computing 6

▪ A process undergoes crash failure, when it permanently ceases to execute its

▪ In an asynchronous model, crash failures cannot be detected with total

▪ In a synchronous system where processor speed and channel delays are

FIT3143 Parallel Computing 7

FIT3143 Parallel Computing 8

▪ Assume that process i forwards the value x of a local variable to each of

– two distinct neighbours j and k receive values x and y, where x ≠ y

FIT3143 Parallel Computing 9

– total or partial breakdown of a link joining i with one of its neighbours

– software problems in process i

– hardware synchronization problems – assume that every neighbour is

FIT3143 Parallel Computing 10

▪ Software design error – Mars pathfinder mission landed flawlessly on the

FIT3143 Parallel Computing 11

▪ Problem with inadequacy of specification e.g. Y2K bug

FIT3143 Parallel Computing 12

Real-time systems require actions to be completed within a specific time

FIT3143 Parallel Computing 13

Human errors play can play a role in system failure.

– On September 17, 1991 AT&T technicians in NY attending a seminar

FIT3143 Parallel Computing 14

▪ Most of the development in the so called LLNM (Long Life, No Maintenance)

FIT3143 Parallel Computing 15

▪ The are four major types of fault-tolerance

▪ Safety properties are complemented by liveness properties

FIT3143 Parallel Computing 16

▪ In Masking tolerance system, when a fault F is masked its occurrence has no

▪ Masking tolerance is important in many safety-critical applications where the

▪ An aircraft must be able to fly even if one of its engines malfunctions.

FIT3143 Parallel Computing 17

FIT3143 Parallel Computing 18

▪ However, liveness is not compromised, and eventually normal behaviour is restored.

▪ Stabilization and Checkpointing represent two opposing scenario in non-masking

FIT3143 Parallel Computing 19

▪ A fail-safe system relaxes the tolerance requirement by avoiding only those

FIT3143 Parallel Computing 20

▪ The notion of acceptability is highly subjective and entirely dependent on the

– While routing a message between two points in a network, a program

– An operating system may switch to a safe mode where users cannot

FIT3143 Parallel Computing 21

▪ Example1. The leader election problem in a network of processes. Each process

▪ Example2. Fund transfer

▪ Example3. Synchronizing clocks

▪ Consensus is easier to achieve in the absence of failures. We will study

FIT3143 Parallel Computing 22

▪ A distributed system contains n processes {0, 1, 2, …, n-1}.

– Agreement. The final decision of every (non-faulty) process must be

FIT3143 Parallel Computing 23

▪ Reaching consensus, however, becomes surprisingly difficult when one or

▪ Assume that at most k members (k>0) can fail.

FIT3143 Parallel Computing 24

▪ Consider a best-of-five-sets tennis match between A and B. If the score is 6-3,

FIT3143 Parallel Computing 25

Credits to Leslie Lamport [2002]

FIT3143 Parallel Computing 26

You might also like