0% found this document useful (0 votes)

11 views26 pages

Fault Tolerant Message Passing Systems

Uploaded by

rrk259388

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views26 pages

Fault Tolerant Message Passing Systems

Uploaded by

rrk259388

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 26

Department of Computer Science and

Engineering (RA)
COURSE NAME: PARALLEL &DISTRIBUTED COMPUTING
COURSE CODE: 22CS4106 R

TOPIC: FAULT-TOLERANT MESSAGE-PASSING SYSTEMS

DEPENDABILITY
Basics
A component provides services to clients. To provide services, the component
may require the services from other components ⇒ a component may
depend on some other component.

Specifically
A component C depends on C∗ if the correctness of C’s behavior depends
on the correctness of C∗’s behavior. (Components are processes or
channels.)

Requirements related to
Requirement
dependability
Description
Availability Readiness for usage
Reliability Continuity of service delivery
Safety Very low probability of catastrophes
Maintainability How easy can a failed system be repaired

Basic concepts
RELIABILITY VERSUS AVAILABILITY

Reliability R(t ) of component C

Conditional probability that C has been functioning correctly during [0, t ) given
C was functioning correctly at time T = 0.

Traditional metrics
• Mean Time To Failure (MTTF): The average time until a component fails.
• Mean Time To Repair (MTTR): The average time needed to repair a
component.
• Mean Time Between Failures (MTBF): Simply MTTF + MTTR.

Basic concepts
RELIABILITY VERSUS AVAILABILITY

Availability A(t ) of component C

Average fraction of time that C has been up-and-running in interval [0, t ).
• Long-term availability A: A(∞)
• Note: A = MTTF = MTTF
MTBF MTTF+MTTR

Observation
Reliability and availability make sense only if we have an accurate notion
of what a failure actually is.

Basic concepts
Terminology
Failure, error, fault
Term Description Example
Failure A component is not living up to Crashed program
its specifications
Error Part of a component that can Programming bug
lead to a failure
Fault Cause of an error Sloppy programmer

Basic concepts
Terminology
Handling faults

Term Description Example

Fault Prevent the occurrence Don’t hire sloppy
prevention of a fault programmers
Fault tolerance Build a component Build each component
such that it can mask by two independent
the occurrence of a programmers
fault
Fault removal Reduce the presence, Get rid of sloppy
number, or seriousness programmers
of a fault
Fault Estimate current Estimate how a
forecasting presence, future recruiter is doing when
incidence, and it comes to hiring
consequences of faults sloppy programmers

Basic concepts
Failure models
Types of failures

Type Description of server’s behavior

Crash failure Halts, but is working correctly until it halts
Omission failure Fails to respond to incoming requests
Receive omission Fails to receive incoming messages Fails
Send omission to send messages
Timing failure Response lies outside a specified time interval
Response failure Response is incorrect
Value failure The value of the response is wrong
State-transition failure Deviates from the correct flow of control
Arbitrary failure May produce arbitrary responses at arbitrary
times
DEPENDABILITY VERSUS SECURITY

Omission versus commission

Arbitrary failures are sometimes qualified as malicious. It is better to make the following
distinction:
• Omission failures: a component fails to take an action that it should have taken
• Commission failure: a component takes an action that it should not have taken

Observation
Note that deliberate failures, be they omission or commission failures, are typically
security problems. Distinguishing between deliberate failures and unintentional
ones is, in general, impossible.
HALTING FAILURES

Scenario
C no longer perceives any activity from C∗ — a halting failure? Distinguishing between a crash or
omission/timing failure may be impossible.

Asynchronous versus synchronous systems

• Asynchronous system: no assumptions about process execution speeds or message delivery
times → cannot reliably detect crash failures.
• Synchronous system: process execution speeds and message delivery times are bounded
→ we can reliably detect omission and timing failures.
• In practice we have partially synchronous systems: most of the time, we can assume the
system to be synchronous, yet there is no bound on the time that a system is asynchronous
→ can normally reliably detect crash failures.
Halting failures
Assumptions we can make

Halting type Description

Fail-stop Crash failures, but reliably detectable
Fail-noisy Crash failures, eventually reliably detectable
Fail-silent Omission or crash failures: clients cannot tell
what went wrong
Fail-safe Arbitrary, yet benign failures (i.e., they cannot
do any harm)
Fail-arbitrary Arbitrary, with malicious failures

Failure models
REDUNDANCY FOR FAILURE MASKING

Types of redundancy
• Information redundancy: Add extra bits to data units so that errors can recovered
when bits are garbled.
• Time redundancy: Design a system such that an action can be performed again if
anything went wrong. Typically used when faults are transient or intermittent.
• Physical redundancy: add equipment or processes in order to allow one or more
components to fail. This type is extensively used in distributed systems.
PROCESS RESILIENCE
Basic idea
Protect against malfunctioning processes through process replication,
organizing multiple processes into a process group. Distinguish between
flat groups and hierarchical groups.

Resilience by process groups

GROUPS AND FAILURE MASKING
k -fault tolerant group
When a group can mask any k concurrent member failures (k is called
degree of fault tolerance).

How large does a k -fault tolerant group need to be?

• With halting failures (crash/omission/timing failures): we need a total of
k + 1 members as no member will produce an incorrect result, so the
result of one member is good enough.
• With arbitrary failures: we need 2k + 1 members so that the correct
result can be obtained through a majority vote.

Important assumptions
• All members are identical
• All members process commands in the same order
Result: We can now be sure that all processes do exactly the same thing.

Failure masking and replication

Flooding-based consensus
System model
• A process group P = {P1,..., Pn}
• Fail-stop failure semantics, i.e., with reliable failure detection
• A client contacts a Pi requesting it to execute a command

• Every Pi maintains a list of proposed commands

Basic algorithm (based on rounds)

1. In round r , Pi multicasts its known set of commands Ci r to all others
2. At the end of r , each Pi merges all received commands into a new Cir+1 .
3. Next command cmdi selected through a globally shared, deterministic
function: cmdi ← select (Cir+1 ).

Consensus in faulty systems with crash failures

Flooding-based consensus: Example

Observations
• P2 received all proposed commands from all other processes ⇒
makes decision.
• P3 may have detected that P1 crashed, but does not know if P2 received
anything, i.e., P3 cannot know if it has the same information as P2 ⇒
cannot make decision (same for P4 ).

Consensus in faulty systems with crash failures

RAFT
Developed for understandability
• Uses a fairly straightforward leader-election algorithm (see Chp. 5). The
current leader operates during the current term.
• Every server (typically, five) keeps a log of operations, some of which
have been committed. A backup will not vote for a new leader if its
own log is more up to date.
• All committed operations have the same position in the log of
each respective server.
• The leader decides which pending operation is to be committed next ⇒
a primary-backup approach.

Consensus in faulty systems with crash failures

Fault tolerance

Raft
When submitting an operation
• A client submits a request for operation o.
• The leader appends the request ⟨o, t , ⟩ to its own log (registering
the current term t and length of ).
• The log is (conceptually) broadcast to the other servers.
• The others (conceptually) copy the log and acknowledge the
receipt.
• When a majority of acks arrives, the leader commits o.

Note
In practice, only updates are broadcast. At the end, every server has the
same view and knows about the c committed operations. Note that
effectively, any information at the backups is overwritten.

Consensus in faulty systems with crash failures

Fault tolerance

Raft: when a leader crashes

Crucial observations
• The new leader has the most committed operations in its log.
• Any missing commits will eventually be sent to the other backups.

Consensus in faulty systems with crash failures

DISTRIBUTED COMMIT PROTOCOLS

Problem
Have an operation being performed by each member of a process group,
or none at all.
• Reliable multicasting: a message is to be delivered to all recipients.
• Distributed transaction: each local transaction must succeed.
TWO-PHASE COMMIT PROTOCOL (2PC)
Essence
The client who initiated the computation acts as coordinator;
processes required to commit are the participants.
• Phase 1a: Coordinator sends VOTE-REQUEST to participants (also called
a pre-write)
• Phase 1b: When participant receives VOTE-REQUEST it returns either
VOTE-COMMIT or VOTE-ABORT to coordinator. If it sends VOTE-ABORT, it
aborts its local computation
• Phase 2a: Coordinator collects all votes; if all are VOTE-COMMIT, it
sends
GLOBAL - COMMIT to all participants, otherwise it sends GLOBAL - ABORT

• Phase 2b: Each participant waits for GLOBAL - COMMIT or GLOBAL - ABORT
and handles accordingly.
2PC - Finite state machines

Coordinator Participant
2PC – FAILING PARTICIPANT
Analysis: participant crashes in state S, and recovers to S
• INIT : No problem: participant was unaware of protocol
• READY : Participant is waiting to either commit or abort. After recovery,
participant needs to know which state transition it should make ⇒ log
the coordinator’s decision
• ABORT : Merely make entry into abort state idempotent, e.g.,
removing the workspace of results
• COMMIT : Also make entry into commit state idempotent, e.g.,
copying workspace to storage.

Observation
When distributed commit is required, having participants use temporary
workspaces to keep their results allows for simple recovery in the presence
of failures.
2PC – FAILING PARTICIPANT
Alternative
When a recovery is needed to READY state, check state of other
participants
⇒ no need to log coordinator’s decision.

Recovering participant P contacts another participant Q

State of Q Action by P
COMMIT Make transition to COMMIT
ABORT Make transition to ABORT
INIT Make transition to ABORT
READY Contact another participant

Result
If all participants are in the READY state, the protocol blocks. Apparently, the
coordinator is failing. Note: The protocol prescribes that we need the
decision from the coordinator.
2PC – FAILING COORDINATOR
Observation
The real problem lies in the fact that the coordinator’s final decision may not
be available for some time (or actually lost).

Alternative
Let a participant P in the READY state timeout when it hasn’t received the
coordinator’s decision; P tries to find out what other participants know (as
discussed).

Observation
Essence of the problem is that a recovering participant cannot make a local
decision: it is dependent on other (possibly failed) processes
REFERENCES FOR FURTHER LEARNING OF THE
SESSION

Reference Books:

1. Chapman, Barbara Jost, Gabriele Pas, Ruud van der, Using OpenMP: portable shared
memory parallel programming, 2008, MIT Press.
2. Gadi Taubenfeld - Distributed Computing Pearls (2018, Morgan & Claypool Publishers)
3. Tanenbaum, Andrew S Steen, Maarten van-Distributed systems: principles and
paradigms. Pearson, 4th Edition

25
THANK YOU

A.SANJEEV KUMAR – PARALLEL AND DISTRIBUTED

COMPUTING

Environmental Management Systems (B) Fault Tree Analysis (C) Failure Mode Effect Analysis (D) Total Productive Maintenance
No ratings yet
Environmental Management Systems (B) Fault Tree Analysis (C) Failure Mode Effect Analysis (D) Total Productive Maintenance
4 pages
Chapter 8
No ratings yet
Chapter 8
107 pages
Slides 08 PDF
No ratings yet
Slides 08 PDF
95 pages
Distributed Systems - Fault Tolerance
No ratings yet
Distributed Systems - Fault Tolerance
21 pages
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
No ratings yet
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
48 pages
w9s1 FaultTolerance1
No ratings yet
w9s1 FaultTolerance1
34 pages
DS CH7 - Fault Tolerance
No ratings yet
DS CH7 - Fault Tolerance
17 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
30 pages
Chen 07
No ratings yet
Chen 07
39 pages
lecture 7
No ratings yet
lecture 7
57 pages
Intro To DS Chapter 6
No ratings yet
Intro To DS Chapter 6
51 pages
Ch8 Distributed
No ratings yet
Ch8 Distributed
12 pages
Ds chapter 7 (2)
No ratings yet
Ds chapter 7 (2)
21 pages
Chapter 8-Fault Tolerance
100% (1)
Chapter 8-Fault Tolerance
71 pages
Chapter 8 Fault Tolerance
No ratings yet
Chapter 8 Fault Tolerance
20 pages
DS Chapter V8.0fault Tolerance
No ratings yet
DS Chapter V8.0fault Tolerance
23 pages
Chapter_8-Fault_Tolerance (1)
No ratings yet
Chapter_8-Fault_Tolerance (1)
37 pages
Chapte Four DS
No ratings yet
Chapte Four DS
37 pages
DS unit_4
No ratings yet
DS unit_4
20 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
Unit 4
No ratings yet
Unit 4
11 pages
Unit5 compressed Fault tolerance- PACE
No ratings yet
Unit5 compressed Fault tolerance- PACE
11 pages
Distributed Computing: Farhad Muhammad Riaz
No ratings yet
Distributed Computing: Farhad Muhammad Riaz
18 pages
BCS 413 - Lecture7 - Fault Tolerance
No ratings yet
BCS 413 - Lecture7 - Fault Tolerance
47 pages
8. Fault Tolerance
No ratings yet
8. Fault Tolerance
40 pages
System Recovery
No ratings yet
System Recovery
38 pages
Consensus
No ratings yet
Consensus
77 pages
1904050001
No ratings yet
1904050001
119 pages
Failure Model
No ratings yet
Failure Model
14 pages
1-Lecture (2. Intro-Core Challenges)_Slides
No ratings yet
1-Lecture (2. Intro-Core Challenges)_Slides
22 pages
Chapter 06 Fault - Tolerance
No ratings yet
Chapter 06 Fault - Tolerance
30 pages
Week-04
No ratings yet
Week-04
49 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
6 pages
ch08 Ts TK Fault Tolerance I
No ratings yet
ch08 Ts TK Fault Tolerance I
29 pages
Chapter 3
No ratings yet
Chapter 3
40 pages
Fault Tolerance Fdcc
No ratings yet
Fault Tolerance Fdcc
76 pages
Consensus Failure
No ratings yet
Consensus Failure
79 pages
4th Unit Topics Recovery
No ratings yet
4th Unit Topics Recovery
73 pages
Unit-3 Part2
No ratings yet
Unit-3 Part2
74 pages
Failure Model
No ratings yet
Failure Model
14 pages
Lec 3
No ratings yet
Lec 3
30 pages
Distributed Computing Series 2 Important Topics
No ratings yet
Distributed Computing Series 2 Important Topics
24 pages
Fault Tolerance Notes
No ratings yet
Fault Tolerance Notes
101 pages
CS 194: Distributed Systems
No ratings yet
CS 194: Distributed Systems
15 pages
Fault
No ratings yet
Fault
101 pages
CH 4
No ratings yet
CH 4
25 pages
ch-4-Fault Tularance- Naming-SM
No ratings yet
ch-4-Fault Tularance- Naming-SM
42 pages
Fault System One
No ratings yet
Fault System One
19 pages
DU3 1
No ratings yet
DU3 1
54 pages
Distributed Failure Recovery
No ratings yet
Distributed Failure Recovery
30 pages
Reference Book Principles of Distributed Database System Chapters
No ratings yet
Reference Book Principles of Distributed Database System Chapters
25 pages
Unit # IV Replication and Fault Tolerance
No ratings yet
Unit # IV Replication and Fault Tolerance
82 pages
fault tolerance techniques
No ratings yet
fault tolerance techniques
4 pages
CSE446 Lecture 4
No ratings yet
CSE446 Lecture 4
32 pages
CSC423 - Lec12 - Distributed and Parallel ComputerSystems
No ratings yet
CSC423 - Lec12 - Distributed and Parallel ComputerSystems
28 pages
08 Falhas
No ratings yet
08 Falhas
41 pages
Fault Tolerance
No ratings yet
Fault Tolerance
10 pages
DS UNIT-3 NOTES
No ratings yet
DS UNIT-3 NOTES
35 pages
CBDT3103 Answer
No ratings yet
CBDT3103 Answer
9 pages
ProcessResilience+FaultTolerance+Recovery
No ratings yet
ProcessResilience+FaultTolerance+Recovery
21 pages
C++ Exception Handling Made Easy: A Practical Guide with Examples
From Everand
C++ Exception Handling Made Easy: A Practical Guide with Examples
William E. Clark
No ratings yet
Coordination Algorithms
No ratings yet
Coordination Algorithms
21 pages
Global State and Snapshot Recording
No ratings yet
Global State and Snapshot Recording
19 pages
CD Lab questions2024-2025 ODD
No ratings yet
CD Lab questions2024-2025 ODD
2 pages
Ticket TK482897059t68
No ratings yet
Ticket TK482897059t68
3 pages
Software Engineering
No ratings yet
Software Engineering
48 pages
Chapter 26 - Cleanroom Software Engineering
No ratings yet
Chapter 26 - Cleanroom Software Engineering
3 pages
Abm8Aig Abm8Aig
No ratings yet
Abm8Aig Abm8Aig
4 pages
Reliability Course Work Sushil
No ratings yet
Reliability Course Work Sushil
7 pages
BAL XXAB 2313870 00 00 00 en
100% (1)
BAL XXAB 2313870 00 00 00 en
123 pages
Reliability Engineering Principles For The Plant Engineer
No ratings yet
Reliability Engineering Principles For The Plant Engineer
17 pages
05 Bemildo Álvaro Ferreira Filho (FEBRACTA)
No ratings yet
05 Bemildo Álvaro Ferreira Filho (FEBRACTA)
10 pages
Making "Push On Green" A Reality - Daniel Klein
No ratings yet
Making "Push On Green" A Reality - Daniel Klein
7 pages
A Brief of NCC
No ratings yet
A Brief of NCC
7 pages
1.1 ARL Ispark Brochure
No ratings yet
1.1 ARL Ispark Brochure
12 pages
9.performance Standards External
No ratings yet
9.performance Standards External
8 pages
Auto-Disable Syringes What To Look For
No ratings yet
Auto-Disable Syringes What To Look For
8 pages
Psib 20130524a
100% (1)
Psib 20130524a
1 page
Trella Profile and Introduction
No ratings yet
Trella Profile and Introduction
13 pages
Series BC600: Fire Detection Control Panels
No ratings yet
Series BC600: Fire Detection Control Panels
6 pages
Job Description Senior Engineer - Maintenance - Reliability
No ratings yet
Job Description Senior Engineer - Maintenance - Reliability
6 pages
Fault Diagnosis and Fault Tolerant Control of A Three-Phase VSI Supplying Sensorless Speed Controlled Induction Motor Drive
No ratings yet
Fault Diagnosis and Fault Tolerant Control of A Three-Phase VSI Supplying Sensorless Speed Controlled Induction Motor Drive
17 pages
Brochure o P CPP Propeller Systems
No ratings yet
Brochure o P CPP Propeller Systems
8 pages
Substation Automation Basics - The Next Generation: By: By: John Mcdonald, P.E
No ratings yet
Substation Automation Basics - The Next Generation: By: By: John Mcdonald, P.E
5 pages
SMART GRID Seminar Report
No ratings yet
SMART GRID Seminar Report
21 pages
HEMM - Performance Analysys - Terminology of Mining Performance Metric
No ratings yet
HEMM - Performance Analysys - Terminology of Mining Performance Metric
24 pages
FF AH Price List April 2018
100% (1)
FF AH Price List April 2018
155 pages
Guidance For Assessors Manual 06-00
No ratings yet
Guidance For Assessors Manual 06-00
15 pages
Effect of High-Temperature Cycling On Conductor Systems:: 2008 Progress Report
No ratings yet
Effect of High-Temperature Cycling On Conductor Systems:: 2008 Progress Report
88 pages
Human Factors Play A Vital Role in Influencing Maintenance Reliability Management in An Organization
No ratings yet
Human Factors Play A Vital Role in Influencing Maintenance Reliability Management in An Organization
7 pages
Urgently Required by Qatar Fertiliser Company
No ratings yet
Urgently Required by Qatar Fertiliser Company
12 pages
指导性文件 Guidance Notes GD 23-2019
No ratings yet
指导性文件 Guidance Notes GD 23-2019
53 pages
Sample - Solution Manual Reliability Engineering by Singiresu Rao
No ratings yet
Sample - Solution Manual Reliability Engineering by Singiresu Rao
18 pages
Design Protection System
No ratings yet
Design Protection System
147 pages

Fault Tolerant Message Passing Systems

Uploaded by

Fault Tolerant Message Passing Systems

Uploaded by

Department of Computer Science and

TOPIC: FAULT-TOLERANT MESSAGE-PASSING SYSTEMS

Reliability R(t ) of component C

Availability A(t ) of component C

Term Description Example

Type Description of server’s behavior

Omission versus commission

Asynchronous versus synchronous systems

Halting type Description

Resilience by process groups

How large does a k -fault tolerant group need to be?

Failure masking and replication

• Every Pi maintains a list of proposed commands

Basic algorithm (based on rounds)

Consensus in faulty systems with crash failures

Consensus in faulty systems with crash failures

Consensus in faulty systems with crash failures

Consensus in faulty systems with crash failures

Raft: when a leader crashes

Consensus in faulty systems with crash failures

Recovering participant P contacts another participant Q

A.SANJEEV KUMAR – PARALLEL AND DISTRIBUTED

You might also like