4th Unit Topics Recovery
4th Unit Topics Recovery
3
Recovery (cont.)
System failure:
– System does not meet requirements, i.e.does not perform its services as specified
Fault:
– Anomalous physical condition, e.g. design errors, manufacturing problems, damage,
external disturbances.
4
Classification of failures
❑ Process failure:
❑ Behavior: process causes system state to deviate from specification (e.g. incorrect computation,
process stop execution)
❑ Errors causing process failure: protection violation, deadlocks, timeout, wrong user input, etc…
❑ Recovery: Abort process or
❑ Restart process from prior state
❑ System failure:
❑ Behavior: processor fails to execute
❑ Caused by software errors or hardware faults (CPU/memory/bus/…/ failure)
❑ Recovery: system stopped and restarted in correct state
❑ Assumption: fail-stop processors, i.e. system stops execution, internal state is lost
❑ Secondary Storage Failure:
❑ Behavior: stored data cannot be accessed
❑ Errors causing failure: parity error, head crash, etc.
❑ Recovery/Design strategies:
❑ Reconstruct content fromarchive + log of activities
❑ Design mirrored disk system
❑ Communication Medium Failure:
❑ Behavior: a site cannot communicate with another operational site
❑ Errors/Faults: failure of switching nodes or communication links
❑ Recovery/Design Strategies: reroute, error-resistant communication protocols
5
Backward and Forward Error Recovery
❑ Failure recovery: restore an erroneous state to an error-free state
❑ Approaches to failure recovery:
❑ Forward-error recovery:
❑ Remove errors in process/system state (if errors can be completely assessed)
❑ Continue process/system forward execution
❑ Backward-error recovery:
❑ Restore process/system to previous error-free state and restart from there
❑ Comparison: Forward vs. Backward error recovery
❑ Backward-error recovery
❑ (+) Simple to implement
❑ (+) Can be used as general recovery mechanism
❑ (-) Performance penalty
❑ (-) No guarantee that fault does not occur again
❑ (-) Some components cannot be recovered
❑ Forward-error Recovery
❑ (+) Less overhead
❑ (-) Limited use, i.e. only when impact of faults understood
❑ (-) Cannot be used as general mechanism for error recovery
6
Backward-Error Recovery: Basic approach
❑ Principle: restore process/system to a known, error-free “recovery point”/ “checkpoint”.
❑ System model:
Storage that
CPU
maintains
secondar stable information in
y storage Main memory storage the event of
system failure
Bring object to MM Store logs and
to be accessed recovery points
❑ Approaches:
❑ (1) Operation-based approach
❑ (2) State-based approach
7
(1) The Operation-based Approach
❑ Principle:
❑ Record all changes made to state of process (‘audit trail’ or ‘log’) such that process
can be returned to a previous state
❑ Example: A transaction based environment where transactions update a database
❑ It is possible to commit or undo updates on a per-transaction basis
❑ A commit indicates that the transaction on the object was successful and changes are
permanent
❑ (1.a) Updating-in-place
❑ Principle: every update (write) operation to an object creates a log in stable storage
that can be used to ‘undo’ and ‘redo’ the operation
❑ Log content: object name, old object state, new object state
❑ Implementation of a recoverable update operation:
❑ Do operation: update object and write log record
❑ Undo operation: log(old) -> object (undoes the action performed by a do)
❑ Redo operation: log(new) -> object (redoes the action performed by a do)
❑ Display operation: display log record (optional)
❑ Problem: a ‘do’ cannot be recovered if system crashes after write object but before
log record write
❑ (1.b) The write-ahead log protocol
❑ Principle: write log record before updating object
8
(2) State-based Approach
9
Recovery in concurrent systems
❑ Issue: if one of a set of cooperating processes fails and has to be rolled back to a
recovery point, all processes it communicated with since the recovery point have to be
rolled back.
❑ Conclusion: In concurrent and/or distributed systems all cooperating processes have to
establish recovery points
❑ Orphan messages and the domino effect
X x1 x2 x3
m
Y y1 y2
Z z1 z2
Time
❑ Case 1: failure of X after x3 : no impact on Y or Z
❑ Case 2: failure of Y after sending msg. ‘m’
❑ Y rolled back to y2
❑ ‘m’ ≡ orphan massage
❑ X rolled back to x2
❑ Case 3: failure of Z after z2
❑ Y has to roll back to y1
❑ X has to roll back to x1 Domino Effect
❑ Z has to roll back to z1
10
Lost messages
X x1
m
Failure
Y y1
Time
• Assume that x1 and y1 are the only recovery points for processes X and Y, respectively
• Assume Y fails after receiving message ‘m’
• Y rolled back to y1, X rolled back to x1
• Message ‘m’ is lost
Note: there is no distinction between this case and the case where message ‘m’ is lost in
communication channel and processes X and Y are in states x1 and y1, respectively
11
Problem of livelock
• Livelock: case where a single failure can cause an infinite number of rollbacks
X x1 n1
Y y1 m1
Failure
(a) Time
X x1 n2
n1
Y y1 m2
2nd roll back
(b) Time
• Synchronous
– with global synchronization at checkpointing
• Asynchronous
– without global synchronization at checkpointing
Preliminary (Assumption)
~Synchronous Checkpoint~
Goal
To make a consistent global checkpoint
Assumptions
– Communication channels are FIFO
– No partition of the network
– End-to-end protocols cope with message loss due to
rollback recovery and communication failure
– No failure during the execution of the algorithm
Preliminary (Two types of checkpoint)
~Synchronous Checkpoint~
tentative checkpoint :
– a temporary checkpoint
– a candidate for permanent checkpoint
permanent checkpoint :
– a local checkpoint at a process
– a part of a consistent global checkpoint
Checkpoint Algorithm
Algorithm
~Synchronous Checkpoint~
1. an initiating process (a single process that invokes this algorithm) takes a
tentative checkpoint
2. it requests all the processes to take tentative checkpoints
3. it waits for receiving from all the processes whether taking a tentative
checkpoint has been succeeded
4. if it learns all the processes has succeeded, it decides all tentative
checkpoints should be made permanent; otherwise, should be discarded.
5. it informs all the processes of the decision
6. The processes that receive the decision act accordingly
Supplement
Once a process has taken a tentative checkpoint, it shouldn’t send messages
until it is informed of initiator’s decision.
Diagram of Checkpoint Algorithm
~Synchronous Checkpoint~
Tentative decide to commit
checkpoint
Initiator permanentcheckpoint
[ | [
request to
take a OK
[ tentative
| [
checkpoint
[ | [
last_label_rcvdX[Y]
X [
Y [
first_label_sentY[X]
Optimized Algorithm
~Synchronous Checkpoint~
Algorithm
1. an initiating process takes a tentative checkpoint
2. it requests p ∈ ckpt_cohort to take tentative checkpoints ( this
message includes last_label_rcvd[reciever] of sender )
3. if the processes that receive the request need to take a checkpoint,
they do the same as 1.2.; otherwise, return OK messages.
4. they wait for receiving OK from all of p ∈ ckpt_cohort
5. if the initiator learns all the processes have succeeded, it decides all
tentative checkpoints should be made permanent; otherwise, should
be discarded.
6. it informs p ∈ ckpt_cohort of the decision
7. The processes that receive the decision act accordingly
Diagram of Optimized Algorithm
~Synchronous Checkpoint~
Teenrm
P taa
tivnent decide to commit
checkpoint
A [ [| 2 >= 0 > 0
ab1 ac1 ba1 ba2 ca2
B [ |[ 2 >= 1 > 0
OK
bd1 cb1 cb2 ac2
C [ |[ 2 >= 2 > 0
dc1 cd1
dc2
D [
A [
ab1 ac1 ba1 ba2
OK
B [ 2
0>1
ac2
request to
bd1 cb1 cb2
roll back
C [ 2>1
dc1 dc1
dc2
D [ 0 >т
– No additional message
– No synchronization delay
– Lighter load during normal excution
Preliminary (Assumptions)
~Asynchronous Checkpoint / Recovery~
Goal
To find the latest consistent set of checkpoints
Assumptions
– Communication channels are FIFO
– Communication channels are reliable
– The underlying computation is event-driven
Preliminary (Two types of log)
~Asynchronous Checkpoint / Recovery~
• save an event on the memory at receipt of messages
(volatile log)
• volatile log periodically flushed to the disk (stable
log) ⇔ checkpoint
volatile log :
quick access
lost if the corresponding processor fails
stable log :
slow access
not lost even if processors fail
Preliminary (Definition)
~Asynchronous Checkpoint / Recovery~
Definition
CkPti : the checkpoint (stable log) that i rolled back to when failure
occurs
RCVDi←j (CkPti / e ) :
the number of messages received by processor i from processor j, per
the information stored in the checkpoint CkPti or event e.
SENTi→j(CkPti / e ) :
the number of messages sent by processor i to processor j, per the
information stored in the checkpoint CkPti or event e
Recovery Algorithm
~Asynchronous Checkpoint / Recovery~
Algorithm
1. When one process crashes, it recovers to the latest checkpoint
CkPt.
2. It broadcasts the message that it had failed. Others receive this
message, and rollback to the latest event.
3. Each process sends SENT(CkPt) to neighboring processes
4. Each process waits for SENT(CkPt) messages from every
neighbor
5. On receiving SENTj→i(CkPtj) from j, if i notices RCVDi←j (CkPti) >
SENTj→i(CkPtj), it rolls back to the event e such that RCVDi←j (e)
= SENTj→i(e),
6. repeat 3,4,and 5 N times (N is the number of processes)
Asynchronous Recovery
X:Y X:Z
x1
Ex0 Ex1 Ex2 Ex3
X [ 2 <= 2
3 0 <= 0
(X,2) (Z,0)
(Y,2) Y:X Y:Z
Ey0 Ey1 Ey2 y1
Ey3
Y [ 1 <= 2 1 <= 1
(X,0)
(Z,1)
(Y,1)
Z:X Z:Y
Ez0 Ez1 Ez2
Z [ 0 <= 0 1
2 <= 1
z1
36
Approaches to fault-tolerance
❑ Approaches:
❑ (a) Mask failures
❑ (b) Well defined failure behavior
❑ Mask failures:
❑ System continues to provide its specified function(s) in the presence of failures
❑ Example: voting protocols
❑ (b) Well defined failure behaviour:
❑ System exhibits a well define behaviour in the presence of failures
❑ It may or it may not perform its specified function(s), but facilitates actions
suitable for fault recovery
❑ Example: commit protocols
❑ A transaction made to a database is made visible only if successful and it commits
❑ If it fails, transaction is undone
❑ Redundancy:
❑ Method for achieving fault tolerance (multiple copies of hardware, processes,
data, etc...)
37
Issues
❑ Process Deaths:
❑ All resources allocated to a process must be recovered when a process
dies
❑ Kernel and remaining processes can notify other cooperating processes
❑ Client-server systems: client (server) process needs to be informed that
the corresponding server (client) process died
❑ Machine failure:
❑ All processes running on that machine will die
❑ Client-server systems: difficult to distinguish between a process and
machine failure
❑ Issue: detection by processes of other machines
❑ Network Failure:
❑ Network may be partitioned into subnets
❑ Machines from different subnets cannot communicate
❑ Difficult for a process to distinguish between a machine and a
communication link failure
38
Atomic actions
Coordinator Cohorts
Initialization
Send start transaction message to all cohorts
Phase 1
Send commit-request message, requesting all If transaction at cohort is successful
cohort to commit then write undo and redo log on stable
Wait for reply from cohorts storage and return agreed
Phase 2 message
If all cohorts sent agreed and coordinator else return abort message
agrees
then write commit record into log
and send commit message to cohorts If commit received,
else send abort message to cohorts release all resources and locks held for
Wait for acknowledgment from cohorts transaction and
If acknowledgment from a cohort not received send acknowledgment
within specified period if abort received,
resent commit/abort to that cohort undo the transaction using undo log record,
If all acknowledgments received, release resources and locks and
write complete record to log send acknowledgment
41
NonBlocking Commit Protocols
❑Our Blocking Theorem from last week states that if
network partitioning is possible, then any distributed
commit protocol may block.
❑Let’s assume now that the network can not partition.
❑Then we can consult other processes to make
progress.
❑However, if all processes fail, then we are, again,
blocked.
❑Let’s further assume that total failure is not possible
ie. not all processes are crashed at the same time.
Automata representation
❑We model the participants with finite state automata
(FSA).
❑The participants move from one state to another as a
result of receiving one or several messages or as a
result of a timeout event.
❑Having received these messages, a participant may
send some messages before executing the state
transition.
Commit Protocol Automata
❑ Final states are divided into Abort states and Commit states
(finally, either Abort or Commit takes place).
❑ Once an Abort state is reached, it is not possible to do a
transition to a non-Abort state. (Abort is irreversible). Similarly
for Commit states (Commit is also irreversible).
❑ The state diagram is acyclic.
❑ We denote the initial state by q, the terminal states are a (an
abort/rollback state) and c (a commit state). Often there is a
wait-state, which we denote by w.
❑ Assume the participants are P1,…,Pn. Possible coordinator is
P0, when the protocol starts.
2PC Coordinator
w
Timeout or No from one of P1,.., Pn Yes from all P1,..,Pn
Abort to P1,…Pn Commit to P1,…,Pn
a c
2PC Participant
w a
Abort from P0
Commit from P0 -
-
c
Commit Protocol State Transitions
❑ In a commit protocol, the idea is to inform other participants
on local progress.
❑ In fact, a state transition without message change is
uninteresting, unless the participant moves into a terminal
state.
❑ Therefore, unless a participant moves into a terminal state,
we may assume that it sends messages to other participants
about its change of state.
❑ To simplify our analysis, we may assume that the messages
are sent to all other participants. This is not necessary, but
creates unnecessary complication.
Concurrency set
❑ A concurrency set of a state s is the set of
possible states among all participants, if
some participant is in state s.
❑ In other words, the concurrency set of state s
is the set of all states that can co-exist with
state s.
2PC Concurrency Sets
q q
w w a
a c c
w
Yes from all P1,..,Pn
Timeout or No from one of P1,.., Pn Prepare to P1,…,Pn
Abort to P1,…Pn
a p
c
3PC Participant
w a
Abort from P0
Prepare from P0 -
Ack to P0
p
Commit from P0
-
c
3PC Concurrency sets (cs)
q q
A commit-request
VoteReq from P0 VoteReq from P0
VoteReq to all
Yes to P0 No to P0
w
w a
Timeout or one No Yes from all
Abort to all Prepare to all Abort from P0
- P0
Prepare from
Ack to P0
a p p
cs(p) = {w,p,c}, Ack from all
Commit from P0
cs(w) = {q,a,w,p}, Commit to all
-
etc.
c c
3PC and failures
❑If there are no failures, then clearly 3PC is correct.
❑In the presence of failures, the operational
participants should be able to terminate their
execution.
❑In the centralised case, a need for termination
protocol implies that the coordinator is no longer
operational.
❑We discuss a general termination protocol. It makes
the assumption that at least one participant remains
operational and that the participants obey the
Fundamental Non-Blocking Theorem.
Termination
❑Basic idea: Choose a backup coordinator B – vote or
use some preassigned ids.
❑Backup Coordinator Decision Rule:
If the B’s state contains commit in its concurrency
set, commit the transaction. Else abort the
transaction.
❑Reasoning behind the rule: If B’s state contains
commit in the concurrency set, then it is possible
that some site has performed commit – otherwise
not.
Re-executing termination
❑ It is, of course, possible the backup
coordinatorfails.
❑ For this reason, the termination protocol
should be executed in such a way that it can
be re-executed.
❑ In particular, the termination protocol must
not break the one-step synchronisation.
Implementing termination
❑To keep one-step synchronisation, the termination
protocol should be executed in two steps:
❑1. The backup coordinator B tells the others to make
a transition to B’s state. Others answer Ok. (This is
not necessary if B is in Commit or Abort state.)
2. B tells the others to commit or abort by the
decision rule.
Fundamental Non-Blocking Theorem
Proof - Sufficiency
❑ The basic termination procedure and decision
rule is valid for any protocol that fulfills the
conditions given in the Fundamental Non-
Blocking Theorem.
❑ The existence of a termination protocol
completes the proof.
Voting protocols
❑ Principles:
❑ Data replicated at several sites to increase reliability
❑ Each replica assigned a number of votes
❑ To access a replica, a process must collect a majority of votes
❑ Vote mechanism:
❑ (1) Static voting:
❑ Each replica has number of votes (in stable storage)
❑ A process can access a replica for a read or write operation if it can
collect a certain number of votes (read or write quorum)
❑ (2) Dynamic voting
❑ Number of votes or the set of sites that form a quorum change with
the state of system (due to site and communication failures)
❑ (2.1) Majority based approach:
❑ Set of sites that can form a majority to allow access to replicated data of
changes with the changing state of the system
❑ (2.2) Dynamic vote reassignment:
❑ Number of votes assigned to a site changes dynamically
68
Failure resilient processes
❑ Resilient process: continues execution in the presence of failures
with minimum disruption to the service provided (masks failures)
❑ Approaches for implementing resilient processes:
❑ Backup processes and
❑ Replicated execution
❑ (1) Backup processes
❑ Each process made of a primary process and one or more backup
processes
❑ Primary process execute, while the backup processes are inactive
❑ If primary process fails, a backup process takes over
❑ Primary process establishes checkpoints, such that backup process can
restart
❑ (2) Replicated execution
❑ Several processes execute same program concurrently
❑ Majority consensus (voting) of their results
❑ Increases both the reliability and availability of the process 69
Recovery (fault tolerant) block concept
❑ Zero or more alternates (providing the same function as the primary block,
but using different algorithm), and
70
Recovery (fault tolerant) Block concept
Recovery Block A
Acceptance test AT
Primary block AP
<Program text>
Alternate block AQ
<Program text>
Recovery block
Primary block Acceptance
alternate block test
71
N-version programming
Module ‘0’
Module ‘1’
Voter
Module ‘n-1’
72
Thank U