Consensus Failure
Consensus Failure
The Players
Choose from a large set of interchangeable terms:
Processes, threads, tasks,
Processors, nodes, servers, clients,
Actors, agents, participants, partners, cohorts
I prefer to use the term node or actor
Short and sweet
A logical/virtual entity: may be multiple logical
nodes per physical machine.
General with regard to role and internal structure
Tend to use actor if self-interest is an issue
Properties of nodes/actors
Essential properties typically assumed by model:
Private state
Distributed memory: model sharing as messages
Executes a sequence of state transitions
Some transitions are reactions to messages
May have internal concurrency, but hide that
Deterministic vs. nondeterministic
Unique identity vs. anonymous nodes
Local clocks with arbitrary drift vs. global time
strobe (e.g., GPS satellites)
Node recovery
Fail-stopped nodes may revive/restart.
Retain identity
Lose messages sent to them while failed
Arbitrary time to restartor maybe never
Restarted node may recover state at time of failure.
Lose state in volatile (primary) memory.
Restore state in non-volatile (secondary) memory.
Writes to non-volatile memory are expensive.
Design problem: recover complete states reliably,
with minimal write cost.
Messages
Processes communicate by sending messages.
Unicast typically assumed
Build multicast/broadcast on top
Use unique process identity as destination.
Optional: cryptography
(optional) Sender is authenticated.
(optional) Message integrity is assured.
E.g., using digital signatures or Message
Authentication Codes.
Messaging properties
Other possible properties of the messaging model:
Messages may be lost.
Messages may be delivered out of order.
Messages may be duplicated.
Do we need to consider these in our distributed
system model?
Or, can we solve them within the asynchronous model,
without affecting its foundational properties?
E.g., reliable transport protocol such as TCP
8
The network
Picture a cloud with open unicast and unbounded
capacity/bandwidth.
Squint and call it the Internet.
Alternatively, the network could be a graph:
Graph models a particular interconnect structure.
Examples: star, ring, hypercube, etc.
Nodes must forward/route messages.
Issues: cut-through, buffer scheduling, etc.
Bounded links, blocking send: may deadlock.
For that take CPS 221 (Parallel Architectures)
Standard assumptions
For this class, we make reasonable assumptions for
general Internet systems:
Nodes with local state and (mostly) local clocks
Asynchronous model: unbounded delay but no loss
Fail-stop or Byzantine
Node identity with (optional) authentication
Allows message integrity
No communication-induced deadlock.
Can deadlock occur? How to avoid it?
Temporary network interruptions are possible.
Including partitions
Coordination
If the solution to availability and scalability is to decentralize
and replicate functions and data, how do we coordinate the
nodes?
data consistency
update propagation
mutual exclusion
consistent global states
group membership
group communication
event ordering
distributed consensus
quorum consensus
Consensus
P1
P1
v1
P2
v2
d1
Unreliable
multicast
Step 1
Propose.
P3
v3
Consensus
algorithm
P2
d2
Step 2
Decide.
Generalizes to N nodes/processes.
P3
d3
vleader
subordinate or
lieutenant
di = vleader
Fischer-Lynch-Patterson (1985)
No consensus can be guaranteed in an asynchronous
communication system in the presence of any failures.
Intuition: a failed process may just be slow, and can
rise from the dead at exactly the wrong time.
Consensus may occur recognizably, rarely or often.
e.g., if no inconveniently delayed messages
FLP implies that no agreement can be guaranteed in
an asynchronous system with Byzantine failures
either. (More on that later.)
Consensus in Practice I
What do these results mean in an asynchronous world?
Unfortunately, the Internet is asynchronous, even if we
believe that all faults are eventually repaired.
Synchronized clocks and predictable execution times
dont change this essential fact.
Even a single faulty process can prevent consensus.
The FLP impossibility result extends to:
Reliable ordered multicast communication in groups
Transaction commit for coordinated atomic updates
Consistent replication
These are practical necessities, so what are we to do?
Consensus in Practice II
We can use some tricks to apply synchronous algorithms:
Fault masking: assume that failed processes always
recover, and reintegrate them into the group.
If you havent heard from a process, wait longer
A round terminates when every expected message is
received.
Failure detectors: construct a failure detector that can
determine if a process has failed.
A round terminates when every expected message is
received, or the failure detector reports that its
sender has failed.
But: protocols may block in pathological scenarios, and they
may misbehave if a failure detector is wrong.
Failure Detectors
How to detect that a member has failed?
pings, timeouts, beacons, heartbeats
recovery notifications
I was gone for awhile, but now Im back.
Is the failure detector accurate?
Is the failure detector live (complete)?
In an asynchronous system, it is possible for a failure detector
to be accurate or live, but not both.
FLP tells us that it is impossible for an asynchronous system
to agree on anything with accuracy and liveness!
A network partition
Commit Protocols
Two phase commit (2PC)
Widely taught and used
Might block forever if coordinator (TM) fails or
disconnects.
3PC: Add another phase
Reduce the window of vulnerability
Paxos commit: works whenever it can (nonblocking)
Lamport/Gray based on Paxos consensus
If TM fails, another steps forward to take over
and restart the protocol.
28
2PC
If unanimous to commit
decide to commit
else decide to abort
commit or abort?
heres my vote
commit/abort!
TM/C
RM/P
39
precommit
or prepare
vote
decide
notify
TM logs
commit/abort
(commit point)
29
2PC: Phase 1
1. Tx
2PC: Phase 2
4. Coordinator (TM) commits.
Iff all P votes are unanimous to commit
C writes a commit record to its log
Tx is committed.
Else abort.
31
3. P
33
Notes on 2PC
Any RM (participant) can enter the prepared state at any
time. The TMs prepare message can be viewed as an
optional suggestion that now would be a good time to do so.
Other events, including real-time deadlines, might cause
working RMs to prepare. This observation is the basis for
variants of the 2PC protocol that use fewer messages.
Lamport and Gray.
3N-1 messages, some of which may be local.
Non-blocking commit: failure of a single process does not
prevent other processes from deciding if the transaction
is committed or aborted.
E.g., 3PC.
34
3PC
35
General Asynchronous
Consensus: Paxos
36
38
Paxos: Properties
Paxos is an asynchronous consensus algorithm.
FLP result says no asynchronous consensus algorithm
can guarantee both safety and liveness.
Paxos is guaranteed safe.
Consensus is a stable property: once reached it is
never violated; the agreed value is not changed.
Paxos is not guaranteed live.
Consensus is reached if a large enough
subnetwork...is nonfaulty for a long enough time.
Otherwise Paxos might never terminate.
39
Leader/proposer/coordinator
Presents a consensus value to the acceptors and counts
the ballots for acceptance of the majority.
Notifies the agents of success.
Note: any node/replica may serve either/both roles.
v?
OK
v!
L
N
40
41
Paxos in Practice
Lampson: Since general consensus is expensive,
practical systems reserve it for emergencies.
e.g., to select a primary such as a lock server.
Frangipani
Google Chubby service (Paxos Made Live)
Pick a primary with Paxos. Do it rarely; do it right.
Primary holds a master lease with a timeout.
Renewable by consensus with primary as leader.
Primary is king as long as it holds the lease.
Master lease expires? Fall back to Paxos.
43
OK
v?
OK
v!
L
N
1a
1b
2a
2b
45
OK
v?
OK
v!
L
N
1a
1b
2a
2b
46
The Agents
Proposal for a new ballot (1a) by a would-be leader?
If this ballot ID is higher than any you have seen
so far, then accept it as the current ballot.
Log ballot ID in persistent memory.
Respond with highest previous ballot ID etc. (1b)
Commanded (2a) to accept a value for current ballot?
Accept value and log it in persistent memory.
Discard any previously accepted value.
Respond (2b) with accept, or deny (if ballot is old).
47
7?
OK
7!
L
N
1a
1b
2a
2b
48
A Paxos Round
Self-appoint
Can I
lead b? OK, but
v?
v!
L
N
1a
1b
log
Propose
2b
2a
log
Promise
Accept
3
safe
Ack
Commit
Safety: Outline
Key invariant: If some round succeeds, then any
subsequent round chooses the same value, or it fails.
To see why, consider the leader L of a round R.
If a previous round S succeeded with value v, then
either L learns of (S, v), or else R fails.
Why? S got responses from a majority: if R
does too, then some agent responds to both.
If L does learn of (S,v), then by the rules of Paxos
L chooses v as a suitable value for R.
(Unless there was an intervening successful round.)
52
More on Safety
All agents that accept a value for some round S accept
the same value v for S.
They can only accept the one value proposed by the
single leader for that unique round S.
And if an agent accepts a value (after 2a), it reports
that value to the leader of a successor round (in 1b).
Therefore, if R is the next round to succeed, then the
leader of a R learns of (S,v), and picks v for R.
Success requires a majority, and majority sets are
guaranteed to intersect.
Induction to all future successful rounds.
53
Consensus
A Paxos Round
Self-appoint
Can I
lead b? OK, but
v?
v!
L
N
1a
1b
log
Propose
2b
2a
log
Promise
Accept!
3
safe
Ack
Commit
55
56
57
59
Paxos: Summary
Non-blocking asynchronous consensus protocol.
Safe, and live if not too many failures/conflicts.
Paxos is at the heart of many distributed and
networked systems.
Often used as a basis for electing the primary in
primary-based systems (i.e., token passing).
Related to 2PC, but robust to leader failure if some
leader lasts long enough to complete the protocol.
The cost of this robustness is related to the rate
of failures and competition among leaders.
64
Byzantine Consensus
65
66
p1 (Commander)
p1 (Commander)
1:v
1:v
3:1:u means
3 says 1 says u.
1:w
2:1:v
p2
1:x
2:1:w
p3
p2
3:1:u
p3
3:1:x
[Lamport82]
Intuition: subordinates cannot distinguish these cases.
Each must select the commanders value in the first case,
but this means they cannot agree in the second case.
p1 (Commander)
1:v
2:1:v
p2
1:v
1:v
1:u
2:1:u
p3
3:1:u
p2
4:1:v 4:1:v
3:1:w
2:1:u
p4
Intuition: vote.
p3
3:1:w
4:1:v 4:1:v
2:1:v
1:v
1:w
3:1:w
p4
Faulty processes are shown shaded
Practical BFT
ABCAST
Weak Synchrony
Castro and Liskov: Consensus protocols can be "live if
delay(t) does not grow faster than (t) indefinitely".
This is a "weak synchrony assumption" that is "likely
to be true in any real system provided that network
faults are eventually repaired, yet it enables us to
circumvent the impossibility result" in FLP.
74
76
consistency
C
Fox&Brewer CAP Theorem:
C-A-P: choose two.
A
Availability
P
Partition-resilience
77
CAP Examples
CP: Paxos, or any consensus algorithm, or state machine
replication with a quorum required for service.
Always consistent, even in a partition. But might not
be available, even without a partition.
AP: Bayou
Always available if any replica is up and reachable,
even in a partition. But might not be consistent, even
without a partition.
CA: consistent replication (e.g., state machine with
CATOCS) with service from any replica.
What happens in a partition?
78
CAP: CA
A CA system is consistent and available, but may become
inconsistent in a partition.
Basic state machine replication with service from any replica.
Coda read-one-write-all-available replication.
These are always consistent in the absence of a partition.
But they could provide service at two or more isolated/
conflicting replicas in a partition (split brain).
To preserve consistency in a partition requires some mechanism
like quorum voting to avoid a split brain.
That makes the system a CP: it must deny service when it
does not have a quorum, even if there is no partition.
79