12 Distributed2
12 Distributed2
Operating Systems
Brad Campbell – [email protected]
https://www.cs.virginia.edu/~bjc8c/class/cs6456-f19/
1
Our Expectation with Data
• Consider a single process using a filesystem
• What do you expect to read?
P1
x.write(2) x.read() ?
• Not-so-clear-cut (parallel)
• Case 1:
• Case 2:
• Case 3:
8
Linearizability Subtleties
• With a single process and a single copy, can
overlaps happen?
• No, these are cases that do not arise with a single
process and a single copy.
• “Most recent write” becomes unclear when there are
overlapping operations.
• Thus, we (as a system designer) have freedom to
impose an order.
• As long as it appears to all clients that there is a
single, interleaved ordering for all (overlapping and
non-overlapping) operations that your implementation
uses to process all requests, it’s fine.
• I.e., this ordering should still provide the single-client,
single-copy semantics.
• Again, it’s all about how clients perceive the behavior
of your system.
9
Linearizability Subtleties
• Definite guarantee
• Case 2
• Case 3
10
Linearizability (Textbook Definition)
• Let the sequence of read and update operations that
client i performs in some execution be oi1, oi2,….
• "Program order" for the client
• A replicated shared object service is linearizable if for
any execution (real), there is some interleaving of
operations (virtual) issued by all clients that:
• meets the specification of a single correct copy of objects
• is consistent with the actual times at which each operation
occurred during the execution
• Main goal: any client will see (at any point of time) a
copy of the object that is correct and consistent
• The strongest form of consistency
15
Implementing Linearizability
• Importance of latency
• Amazon: every 100ms of latency costs them 1% in
sales.
• Google: an extra .5 seconds in search page generation
time dropped traffic by 20%.
• Linearizability typically requires complete
synchronization of multiple copies before a write
operation returns.
• So that any read over any copy can return the most
recent write.
• No room for asynchronous writes (i.e., a write operation
returns before all updates are propagated.)
• It makes less sense in a global setting.
• Inter-datecenter latency: ~10s ms to ~100s ms
• It might still makes sense in a local setting (e.g., 17
Relaxing the Guarantees
• Linearizability advantages
• It behaves as expected.
• There’s really no surprise.
• Application developers do not need any additional logic.
• Linearizability disadvantages
• It’s difficult to provide high-performance (low latency).
• It might be more than what is necessary.
• Relaxed consistency guarantees
• Sequential consistency
• Causal consistency
• Eventual consistency
• It is still all about client-side perception.
• When a read occurs, what do you return?
18
Sequential Consistency
• A little weaker than linearizability, but still quite strong
• Essentially linearizability, except that it doesn’t need to return the
most recent write according to physical time.
• How can we achieve it?
• Preserving the single-client, (per-process) single-copy semantics
• We give an illusion that there’s a single copy to an isolated process.
• The single-client semantics
• Processing all requests as if they were coming from a single client
(in a single stream of ops).
• Again, this meets our basic expectation---it’s easiest to understand
for an app developer if all requests appear to be processed one at a
time.
• Let’s consider the per-process single-copy semantics with a
few examples.
19
Per-Process Single-Copy Semantics
• But we need to make it work with multiple processes.
• When a storage system preserves each and every process’s program
order, each will think that there’s a single copy.
• Simple example
P1
x.write(2) x.write(3) x.read() 3
P2
x.write(5) x.read() 5
21
Pre-Process Single-Copy Examples
• Example 1: Does this work like a single copy at P2?
P1
x.write(5)
P2
x.write(2) x.write(3) x.read() 3 x.read() 3
• Yes!
• Does this satisfy linearizability?
• Yes
22
Pre-Process Single-Copy Examples
• Example 2: Does this work like a single copy at
P1 P2?
x.write(5)
P2
x.write(2) x.write(3) x.read() 3 x.read() 5
• Yes!
• Does this satisfy linearizability?
• No
• It’s just that P1’s write is showing up later.
• For P2, it’s like x.write(5) happens between the last
two reads.
• It’s also like P1 and P2’s operations are interleaved
and processed like the arrow shows. 23
Sequential Consistency
• Insight: we don’t need to make other processes’ writes
immediately visible.
• Central question
• Can you explain a storage system’s behavior by coming up with a single
interleaving ordering of all requests, where the program order of each and
every process is preserved?
• Previous example:
P1
x.write(5)
P2
x.write(2) x.write(3) x.read() 3 x.read() 5
P2
x.write(2) x.read() 3 x.read() 5
• Yes
27
Two More Consistency Models
• Even more relaxed
• We don’t even care about providing an illusion of
a single copy.
• Causal consistency
• We care about ordering causally related write
operations correctly.
• Eventual consistency
• As long as we can say all replicas converge to
the same copy eventually, we’re fine.
35
Relaxing the Guarantees
• For some applications, different clients (e.g., users)
do not need to see the writes in the same order, but
causality is still important (e.g., facebook post-like
pairs).
• Causal consistency
• More relaxed than sequential consistency
• Clients can read values out of order, i.e., it doesn’t behave as
a single copy anymore.
• Clients read values in-order for causally-related writes.
• How do we define “causal relations” between two writes?
• (Roughly) Client 0 writes Client 1 reads Client 1 writes
• E.g., writing a comment on a post
36
Causal Consistency
• Example 1:
Causally related Concurrent writes
37
Causal Consistency Example 2
• Causally consistent?
Causally related
P1: W(x)1
P2: R(x)1 W(x)2
P3: R(x)2 R(x)1
P4: R(x)1 R(x) 2
• No!
38
Causal Consistency Example 3
• Causally consistent?
P1: W(x)1
P2: W(x)2
P3: R(x)2 R(x)1
P4: R(x)1 R(x) 2
• Yes!
39
Implementing Causal Consistency
• We drop the notion of a single copy.
• Writes can be applied in different orders across
copies.
• Causally-related writes do need to be applied in
the same order for all copies.
• Need a mechanism to keep track of
causally-related writes.
• Due to the relaxed requirements, low
latency is more tractable.
40
Relaxing Even Further
• Let’s just do best effort to make things consistent.
• Eventual consistency
• Popularized by the CAP theorem.
• The main problem is network partitions.
Client + front end Client + front end
Network U
withdraw(B, 4) T partition
deposit(B,3);
B
Replica managers
B B B
41
Dilemma
• In the presence of a network partition:
• In order to keep the replicas consistent,
you need to block.
• From an outside observer, the system appears to
be unavailable.
• If we still serve the requests from two
partitions, then the replicas will diverge.
• The system is available, but no consistency.
• The CAP theorem explains this dilemma.
42
Dealing with Network Partitions
• During a partition, pairs of conflicting transactions may have
been allowed to execute in different partitions. The only
choice is to take corrective action after the network has
recovered
• Assumption: Partitions heal eventually
• Abort one of the transactions after the partition has healed
• Basic idea: allow operations to continue in one or some of
the partitions, but reconcile the differences later after
partitions have healed
43
Lamport Clocks
49
A distributed edit-compile workflow
Physical time
51
Idea: Logical clocks
66
Defining “happens-before”
P1 P2
P3
Physical time ↓
67
Defining “happens-before”
P1 P2
P3
a
b
Physical time ↓
68
Defining “happens-before”
P1 P2
P3
a
b
Physical time ↓
69
Defining “happens-before”
P1 P2
P3
a
b
c
Physical time ↓
70
Defining “happens-before”
P1 P2
P3
a
b
c
Physical time ↓
71
Defining “happens-before”
P1 P2
P3
a
b
c
Physical time ↓
72
Defining “happens-before”
3. If a b and b c, then a c
P1 P2
P3
a
b
c
Physical time ↓
73
Concurrent events
P1 P2
P3
a
d
b
c
Physical time ↓
74
Lamport clocks: Objective
P1 P2
C1=0 C2=0 P3
C3=0
a
b
c
Physical time ↓
76
The Lamport Clock algorithm
P1 P2
C1=1 C2=1 P3
C(a) = 1 C3=1
a
b
c
Physical time ↓
77
The Lamport Clock algorithm
P1 P2
C1=2 C2=1 P3
C(a) = 1 C3=1
a
C(b) = 2
b
c
Physical time ↓
78
The Lamport Clock algorithm
P1 P2
C1=2 C2=3 P3
C(a) = 1 C3=1
a
C(b) = 2 C(c) = 3
b
c
C(m) = 2 Physical time ↓
80
Take-away points: Lamport clocks
89
Today
1. The need for time synchronization
3. Logical Time
• Lamport clocks
• Vector clocks
90
Vector clock (VC)
• Label each event e with a vector V(e) = [c1, c2 …, cn]
• ci is a count of events in process i that causally precede e
Original
post
1’s reply
Physical time
98
Scaling out: Place and partition
• Problem 1: Data placement
• On which node(s) to place a partition?
• Maintain mapping from data object to responsible node(s)
100
Problem for modulo hashing:
Changing number of servers
h(x) = x + 1 (mod 4)
Add one machine: h(x) = x + 1 (mod 5)
Server
4
3
All entries get remapped to new nodes!
2
Need to move objects over the network
1
0
5 7 10 11 27 29 36 38 40
Object serial number
101
Consistent hashing
8
• Desired features –
– Balance: No bucket has “too many” objects
– Smoothness: Addition/removal of token minimizes
object movements for other buckets
102
Consistent hashing’s load balancing problem
• Each node owns 1/nth of the ID space in
expectation
• Says nothing of request load per bucket
103
Virtual nodes
• Idea: Each physical node implements v virtual nodes
• Each physical node maintains v > 1 token ids
• Each token id corresponds to a virtual node
• Techniques:
• Consistent hashing to map keys to nodes
• Session info
• Maybe “recently visited products” et c.?
• Product list
• Mostly read-only, replication for high read throughput
107
Dynamo requirements
• Highly available writes despite failures
• Despite disks failing, network routes flapping, “data centers
destroyed by tornadoes”
• Always respond quickly, even during failures replication
109
Dynamo’s system interface
111
Data placement
Coordinator node
114
Partitions force a choice between availability and
consistency
• Suppose three replicas are partitioned into two and one
put(k,v0) put(k,v1)
?@%$!!
116
Mechanism: Sloppy quorums
• If no failure, reap consistency benefits of single master
• Else sacrifice consistency to allow progress
120
Sloppy quorums and get()s
• Suppose coordinator doesn’t receive R replies when
processing a get()
• Penultimate ¶,§4.5: “R is the min. number of
nodes that must participate in a successful read
operation.”
• Sounds like these get()s fail
121
Sloppy quorums and freshness
• Common case given in paper: N = 3; R = W = 2
• With these values, do sloppy quorums guarantee a
get() sees all prior put()s?
• If no failures, yes:
• Two writers saw each put()
• Two readers responded to each get()
• Write and read quorums must overlap!
122
Sloppy quorums and freshness
• Common case given in paper: N = 3, R = W = 2
• With these values, do sloppy quorums guarantee a
get() sees all prior put()s?
124
Conflicts vs. applications
• Shopping cart:
• Could take union of two shopping carts
• What if second put() was result of user deleting
item from cart stored in first put()?
• Result: “resurrection” of deleted item
125
Version vectors (vector clocks)
• Version vector: List of (coordinator node, counter) pairs
• e.g., [(A, 1), (B, 3), …]
126
Version vectors: Dynamo’s mechanism
• Rule: If vector clock comparison of v1 < v2, then the first
is an ancestor of the second – Dynamo can forget v1
• Each time a get() occurs, Dynamo returns the V.V. for the
value(s) returned (in the “context”)
• Then users must supply that context to put()s that modify the
same key
127
Version vectors (auto-resolving case)
put handled
by node A
v1 [(A,1)]
put handled
by node C
v2 [(A,1), (C,1)]
put handled
by node A
v1 [(A,1)]
put handled put handled
by node B by node C
130
Impact of deleting a VV entry?
put handled
by node A
v1 [(A,1)]
put handled
by node C
v2 [(A,1), (C,1)]
N R W Behavior
3 2 2 Parameters from paper:
Good durability, good R/W latency
3 3 1 Slow reads, weak durability, fast writes
3 1 3 Slow writes, strong durability, fast reads
3 3 3 More likely that reads see all prior writes?
3 1 1 Read quorum doesn’t overlap write quorum
136
Dynamo: Take-away ideas
• Consistent hashing broadly useful for replication—not
only in P2P systems