0% found this document useful (0 votes)
13 views

12 Distributed2

This document discusses consistency models for distributed systems, focusing on linearizability and relaxed models. It defines linearizability as providing single-client, single-copy semantics where reads return the most recent write, regardless of client or replica. While strong, linearizability has high performance costs. Sequential and causal consistency are introduced as weaker but still useful alternatives, relaxing the guarantees on read linearization while maintaining order for single processes or causally related operations.

Uploaded by

maykelnawar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

12 Distributed2

This document discusses consistency models for distributed systems, focusing on linearizability and relaxed models. It defines linearizability as providing single-client, single-copy semantics where reads return the most recent write, regardless of client or replica. While strong, linearizability has high performance costs. Sequential and causal consistency are introduced as weaker but still useful alternatives, relaxing the guarantees on read linearization while maintaining order for single processes or causally related operations.

Uploaded by

maykelnawar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 92

CS6456: Graduate

Operating Systems
Brad Campbell – [email protected]
https://www.cs.virginia.edu/~bjc8c/class/cs6456-f19/

1
Our Expectation with Data
• Consider a single process using a filesystem
• What do you expect to read?
P1
x.write(2) x.read() ?

• Our expectation (as a user or a developer)


• A read operation returns the most recent write.
• This forms our basic expectation from any file or storage system.
• Linearizability meets this basic expectation.
• But it extends the expectation to handle multiple processes…
• …and multiple replicas.
• The strongest consistency model
2
Linearizability
• Three aspects
• A read operation returns the most recent write,
• …regardless of the clients,
• …according to the single actual-time ordering of requests.
• Or, put it differently, read/write should behave as if there
were,
• …a single client making all the (combined) requests in their original
actual-time order (i.e., with a single stream of ops),
• …over a single copy.
• You can say that your storage system guarantees
linearizability when it provides single-client, single-copy
semantics where a read returns the most recent write.
• It should appear to all clients that there is a single order (actual-time
order) that your storage uses to process all requests.
5
Linearizability Subtleties
• A read/write operation is never a dot!
• It takes time. Many things are involved, e.g., network, multiple disks,
etc.
• Read/write latency: the time measured right before the call and right
after the call from the client making the call.
• Clear-cut (e.g., black---write & red---read)

• Not-so-clear-cut (parallel)
• Case 1:

• Case 2:

• Case 3:
8
Linearizability Subtleties
• With a single process and a single copy, can
overlaps happen?
• No, these are cases that do not arise with a single
process and a single copy.
• “Most recent write” becomes unclear when there are
overlapping operations.
• Thus, we (as a system designer) have freedom to
impose an order.
• As long as it appears to all clients that there is a
single, interleaved ordering for all (overlapping and
non-overlapping) operations that your implementation
uses to process all requests, it’s fine.
• I.e., this ordering should still provide the single-client,
single-copy semantics.
• Again, it’s all about how clients perceive the behavior
of your system.

9
Linearizability Subtleties
• Definite guarantee

• Relaxed guarantee when overlap


• Case 1

• Case 2

• Case 3
10
Linearizability (Textbook Definition)
• Let the sequence of read and update operations that
client i performs in some execution be oi1, oi2,….
• "Program order" for the client
• A replicated shared object service is linearizable if for
any execution (real), there is some interleaving of
operations (virtual) issued by all clients that:
• meets the specification of a single correct copy of objects
• is consistent with the actual times at which each operation
occurred during the execution
• Main goal: any client will see (at any point of time) a
copy of the object that is correct and consistent
• The strongest form of consistency
15
Implementing Linearizability
• Importance of latency
• Amazon: every 100ms of latency costs them 1% in
sales.
• Google: an extra .5 seconds in search page generation
time dropped traffic by 20%.
• Linearizability typically requires complete
synchronization of multiple copies before a write
operation returns.
• So that any read over any copy can return the most
recent write.
• No room for asynchronous writes (i.e., a write operation
returns before all updates are propagated.)
• It makes less sense in a global setting.
• Inter-datecenter latency: ~10s ms to ~100s ms
• It might still makes sense in a local setting (e.g., 17
Relaxing the Guarantees
• Linearizability advantages
• It behaves as expected.
• There’s really no surprise.
• Application developers do not need any additional logic.
• Linearizability disadvantages
• It’s difficult to provide high-performance (low latency).
• It might be more than what is necessary.
• Relaxed consistency guarantees
• Sequential consistency
• Causal consistency
• Eventual consistency
• It is still all about client-side perception.
• When a read occurs, what do you return?
18
Sequential Consistency
• A little weaker than linearizability, but still quite strong
• Essentially linearizability, except that it doesn’t need to return the
most recent write according to physical time.
• How can we achieve it?
• Preserving the single-client, (per-process) single-copy semantics
• We give an illusion that there’s a single copy to an isolated process.
• The single-client semantics
• Processing all requests as if they were coming from a single client
(in a single stream of ops).
• Again, this meets our basic expectation---it’s easiest to understand
for an app developer if all requests appear to be processed one at a
time.
• Let’s consider the per-process single-copy semantics with a
few examples.
19
Per-Process Single-Copy Semantics
• But we need to make it work with multiple processes.
• When a storage system preserves each and every process’s program
order, each will think that there’s a single copy.
• Simple example
P1
x.write(2) x.write(3) x.read()  3
P2
x.write(5) x.read()  5

• Per-process single-copy semantics


• A storage system preserves each and every process’s program order.
• It gives an illusion to every process that they’re working with a single
copy.

21
Pre-Process Single-Copy Examples
• Example 1: Does this work like a single copy at P2?

P1
x.write(5)

P2
x.write(2) x.write(3) x.read()  3 x.read()  3

• Yes!
• Does this satisfy linearizability?
• Yes
22
Pre-Process Single-Copy Examples
• Example 2: Does this work like a single copy at
P1 P2?
x.write(5)

P2
x.write(2) x.write(3) x.read()  3 x.read()  5

• Yes!
• Does this satisfy linearizability?
• No
• It’s just that P1’s write is showing up later.
• For P2, it’s like x.write(5) happens between the last
two reads.
• It’s also like P1 and P2’s operations are interleaved
and processed like the arrow shows. 23
Sequential Consistency
• Insight: we don’t need to make other processes’ writes
immediately visible.
• Central question
• Can you explain a storage system’s behavior by coming up with a single
interleaving ordering of all requests, where the program order of each and
every process is preserved?
• Previous example:
P1
x.write(5)

P2
x.write(2) x.write(3) x.read()  3 x.read()  5

• We can explain this behavior by the following ordering of requests


24
• x.write(2), x.write(3), x.read()  3, x.write(5), x.read()  5
Sequential Consistency Examples
• Example 1: Does this satisfy sequential
consistency?
P1
x.write(5) x.write(3)

P2
x.write(2) x.read()  3 x.read()  5

• No: even if P1’s writes show up later, we can’t


explain the last two writes.
26
Sequential Consistency Examples
• Example 2: Does this satisfy sequential
consistency?
P1
x.write(2) x.write(3) x.read()  3
P2
x.write(5) x.read()  5

• Yes
27
Two More Consistency Models
• Even more relaxed
• We don’t even care about providing an illusion of
a single copy.
• Causal consistency
• We care about ordering causally related write
operations correctly.
• Eventual consistency
• As long as we can say all replicas converge to
the same copy eventually, we’re fine.

35
Relaxing the Guarantees
• For some applications, different clients (e.g., users)
do not need to see the writes in the same order, but
causality is still important (e.g., facebook post-like
pairs).
• Causal consistency
• More relaxed than sequential consistency
• Clients can read values out of order, i.e., it doesn’t behave as
a single copy anymore.
• Clients read values in-order for causally-related writes.
• How do we define “causal relations” between two writes?
• (Roughly) Client 0 writes  Client 1 reads  Client 1 writes
• E.g., writing a comment on a post
36
Causal Consistency
• Example 1:
Causally related Concurrent writes

P1: W(x)1 W(x) 3


P2: R(x)1 W(x)2
P3: R(x)1 R(x)3 R(x)2
P4: R(x)1 R(x)2 R(x) 3

This sequence obeys causal consistency

37
Causal Consistency Example 2
• Causally consistent?
Causally related
P1: W(x)1
P2: R(x)1 W(x)2
P3: R(x)2 R(x)1
P4: R(x)1 R(x) 2

• No!
38
Causal Consistency Example 3
• Causally consistent?
P1: W(x)1
P2: W(x)2
P3: R(x)2 R(x)1
P4: R(x)1 R(x) 2

• Yes!
39
Implementing Causal Consistency
• We drop the notion of a single copy.
• Writes can be applied in different orders across
copies.
• Causally-related writes do need to be applied in
the same order for all copies.
• Need a mechanism to keep track of
causally-related writes.
• Due to the relaxed requirements, low
latency is more tractable.
40
Relaxing Even Further
• Let’s just do best effort to make things consistent.
• Eventual consistency
• Popularized by the CAP theorem.
• The main problem is network partitions.
Client + front end Client + front end
Network U
withdraw(B, 4) T partition
deposit(B,3);

B
Replica managers

B B B

41
Dilemma
• In the presence of a network partition:
• In order to keep the replicas consistent,
you need to block.
• From an outside observer, the system appears to
be unavailable.
• If we still serve the requests from two
partitions, then the replicas will diverge.
• The system is available, but no consistency.
• The CAP theorem explains this dilemma.
42
Dealing with Network Partitions
• During a partition, pairs of conflicting transactions may have
been allowed to execute in different partitions. The only
choice is to take corrective action after the network has
recovered
• Assumption: Partitions heal eventually
• Abort one of the transactions after the partition has healed
• Basic idea: allow operations to continue in one or some of
the partitions, but reconcile the differences later after
partitions have healed

43
Lamport Clocks

49
A distributed edit-compile workflow

Physical time 

• 2143 < 2144  make doesn’t call compiler

Lack of time synchronization result –


a possible object file mismatch
50
What makes time synchronization hard?
1. Quartz oscillator sensitive to temperature,
age, vibration, radiation
• Accuracy ca. one part per million (one
second of clock drift over 12 days)

2. The internet is:


• Asynchronous: arbitrary message delays
• Best-effort: messages don’t always arrive

51
Idea: Logical clocks

• Landmark 1978 paper by Leslie Lamport

• Insight: only the events themselves matter

Idea: Disregard the precise clock time


Instead, capture just a “happens before”
relationship between a pair of events

66
Defining “happens-before”

• Consider three processes: P1, P2, and P3

• Notation: Event a happens before event b (a  b)

P1 P2
P3

Physical time ↓
67
Defining “happens-before”

1. Can observe event order at a


single process

P1 P2
P3
a

b
Physical time ↓
68
Defining “happens-before”

1. If same process and a occurs


before b, then a  b

P1 P2
P3
a

b
Physical time ↓
69
Defining “happens-before”

1. If same process and a occurs before b, then a  b

2. Can observe ordering when processes


communicate

P1 P2
P3
a

b
c
Physical time ↓
70
Defining “happens-before”

1. If same process and a occurs before b,


then a  b

2. If c is a message receipt of b, then b  c

P1 P2
P3
a

b
c
Physical time ↓
71
Defining “happens-before”

1. If same process and a occurs before b, then a  b

2. If c is a message receipt of b, then b  c

3. Can observe ordering transitively

P1 P2
P3
a

b
c
Physical time ↓
72
Defining “happens-before”

1. If same process and a occurs before b, then a  b

2. If c is a message receipt of b, then b  c

3. If a  b and b  c, then a  c

P1 P2
P3
a

b
c
Physical time ↓
73
Concurrent events

• Not all events are related by 

• a, d not related by  so concurrent, written as a || d

P1 P2
P3
a
d
b
c
Physical time ↓
74
Lamport clocks: Objective

• We seek a clock time C(a) for every


event a
Plan: Tag events with clock times; use clock
times to make distributed system correct

• Clock condition: If a  b, then C(a) <


C(b) 75
The Lamport Clock algorithm

• Each process Pi maintains a local clock Ci

1. Before executing an event, Ci  Ci + 1

P1 P2
C1=0 C2=0 P3
C3=0
a

b
c
Physical time ↓
76
The Lamport Clock algorithm

1. Before executing an event a, Ci  Ci + 1:

• Set event time C(a)  Ci

P1 P2
C1=1 C2=1 P3
C(a) = 1 C3=1
a

b
c
Physical time ↓
77
The Lamport Clock algorithm

1. Before executing an event b, Ci  Ci + 1:

• Set event time C(b)  Ci

P1 P2
C1=2 C2=1 P3
C(a) = 1 C3=1
a
C(b) = 2
b
c
Physical time ↓
78
The Lamport Clock algorithm

1. Before executing an event b, Ci  Ci + 1

2. Send the local clock in the message m


P1 P2
C1=2 C2=1 P3
C(a) = 1 C3=1
a
C(b) = 2
b
c
C(m) = 2 Physical time ↓
79
The Lamport Clock algorithm

3. On process Pj receiving a message m:

• Set Cj and receive event time C(c) 1 + max{ Cj, C(m) }

P1 P2
C1=2 C2=3 P3
C(a) = 1 C3=1
a
C(b) = 2 C(c) = 3
b
c
C(m) = 2 Physical time ↓
80
Take-away points: Lamport clocks

• Can totally-order events in a distributed system: that’s


useful!

• But: while by construction, a  b implies C(a) < C(b),


• The converse is not necessarily true:
• C(a) < C(b) does not imply a  b (possibly, a || b)

Can’t use Lamport clock timestamps to infer


causal relationships between events

89
Today
1. The need for time synchronization

2. “Wall clock time” synchronization


• Cristian’s algorithm, Berkeley algorithm, NTP

3. Logical Time
• Lamport clocks
• Vector clocks

90
Vector clock (VC)
• Label each event e with a vector V(e) = [c1, c2 …, cn]
• ci is a count of events in process i that causally precede e

• Initially, all vectors are [0, 0, …, 0]

• Two update rules:

1. For each local event on process i, increment local entry ci

2. If process j receives message with vector [d1, d2, …, dn]:


• Set each local entry ck = max{ck, dk}
• Increment local entry cj
91
Vector clock: Example

• All counters start at [0, 0, 0]


P1 P2 P3
a [1,0,0]
[2,0,0] e [0,0,1]
• Applying local update b
[2 ,0 [2,1,0]
,0 ]
rule c
[2,2,0]
d
[2 ,2
,0] [2,2,2]
f
• Applying message rule
• Local vector clock
piggybacks on inter- Physical time ↓
process messages 92
Vector clocks can establish causality
• Rule for comparing vector clocks:
• V(a) = V(b) when ak = bk for all k
• V(a) < V(b) when ak ≤ bk for all k and V(a) ≠ V(b)

• Concurrency: a || b if ai < bi and aj > bj, some i, j

• V(a) < V(z) when there is a chain of events linked


by  between a and z
a [1,0,0]
[2,0,0]
b
[2,1,0]
c
z [2,2,0] 93
Two events a, z

Lamport clocks: C(a) < C(z)


Conclusion: None

Vector clocks: V(a) < V(z)


Conclusion: a  …  z

Vector clock timestamps tell us about


causal event relationships
94
VC application:
Causally-ordered bulletin board system

• Distributed bulletin board application


• Each post  multicast of the post to all other users

• Want: No user to see a reply before the


corresponding original message post

• Deliver message only after all messages that


causally precede it have been delivered
• Otherwise, the user would see a reply to a message they
could not find 95
VC application:
Causally-ordered bulletin board system

Original
post

1’s reply

Physical time 

• User 0 posts, user 1 replies to 0’s post; user 2


observes 96
97
Distributing data

98
Scaling out: Place and partition
• Problem 1: Data placement
• On which node(s) to place a partition?
• Maintain mapping from data object to responsible node(s)

• Problem 2: Partition management


• Including how to recover from node failure
• e.g., bringing another node into partition group
• Changes in system size, i.e. nodes joining/leaving

• Centralized: Cluster manager


• Decentralized: Deterministic hashing and
algorithms
99
Modulo hashing

• Consider problem of data partition:


• Given object id X, choose one of k servers to use

• Suppose instead we use modulo hashing:


• Place X on server i = hash(X) mod k

• What happens if a server fails or joins (k  k±1)?


• or different clients have different estimate of k?

100
Problem for modulo hashing:
Changing number of servers

h(x) = x + 1 (mod 4)
Add one machine: h(x) = x + 1 (mod 5)
Server
4

3
All entries get remapped to new nodes!
2
 Need to move objects over the network
1

0
5 7 10 11 27 29 36 38 40
Object serial number
101
Consistent hashing

– Assign n tokens to random points on 0


mod 2k circle; hash key size = k 14
– Hash object to random circle position Token
12 4
– Put object in closest clockwise bucket
– successor (key)  bucket Bucket

8
• Desired features –
– Balance: No bucket has “too many” objects
– Smoothness: Addition/removal of token minimizes
object movements for other buckets
102
Consistent hashing’s load balancing problem
• Each node owns 1/nth of the ID space in
expectation
• Says nothing of request load per bucket

• If a node fails, its successor takes over bucket


• Smoothness goal ✔: Only localized shift, not O(n)

• But now successor owns two buckets: 2/nth of key space


• The failure has upset the load balance

103
Virtual nodes
• Idea: Each physical node implements v virtual nodes
• Each physical node maintains v > 1 token ids
• Each token id corresponds to a virtual node

• Each virtual node owns an expected 1/(vn)th of ID


space

• Upon a physical node’s failure, v virtual nodes fail


• Their successors take over 1/(vn)th more

• Result: Better load balance with larger v


104
Dynamo: The P2P context
• Chord and DHash intended for wide-area P2P
systems
• Individual nodes at Internet’s edge, file sharing

• Central challenges: low-latency key lookup with


small forwarding state per node

• Techniques:
• Consistent hashing to map keys to nodes

• Replication at successors for availability under failure


105
Amazon’s workload (in 2007)
• Tens of thousands of servers in globally-distributed
data centers

• Peak load: Tens of millions of customers

• Tiered service-oriented architecture


• Stateless web page rendering servers, atop
• Stateless aggregator servers, atop
• Stateful data stores (e.g. Dynamo)
• put( ), get( ): values “usually less than 1 MB”
106
How does Amazon use Dynamo?
• Shopping cart

• Session info
• Maybe “recently visited products” et c.?

• Product list
• Mostly read-only, replication for high read throughput

107
Dynamo requirements
• Highly available writes despite failures
• Despite disks failing, network routes flapping, “data centers
destroyed by tornadoes”
• Always respond quickly, even during failures  replication

• Low request-response latency: focus on 99.9% SLA

• Incrementally scalable as servers grow to workload


• Adding “nodes” should be seamless

• Comprehensible conflict resolution


• High availability in above sense implies conflicts
108
Design questions
• How is data placed and replicated?

• How are requests routed and handled in a


replicated system?

• How to cope with temporary and permanent


node failures?

109
Dynamo’s system interface

• Basic interface is a key-value store


• get(k) and put(k, v)
• Keys and values opaque to Dynamo

• get(key)  value, context


• Returns one value or multiple conflicting values
• Context describes version(s) of value(s)

• put(key, context, value)  “OK”


• Context indicates which versions this version
supersedes or merges
110
Dynamo’s techniques

• Place replicated data on nodes with consistent hashing

• Maintain consistency of replicated data with vector


clocks
• Eventual consistency for replicated data: prioritize success
and low latency of writes over reads
• And availability over consistency (unlike DBs)

• Efficiently synchronize replicas using Merkle trees

Key trade-offs: Response time vs.


consistency vs. durability

111
Data placement

Key K put(K,…), get(K)


requests go to me

Coordinator node

Each data item is replicated at N virtual nodes (e.g., N = 3) 112


Data replication
• Much like in Chord: a key-value pair  key’s N
successors (preference list)
• Coordinator receives a put for some key
• Coordinator then replicates data onto nodes in the
key’s preference list

• Preference list size > N to account for node failures

• For robustness, the preference list skips tokens to


ensure distinct physical nodes
113
Gossip and “lookup”
• Gossip: Once per second, each node contacts a
randomly chosen other node
• They exchange their lists of known nodes
(including virtual node IDs)

• Each node learns which others handle all key ranges

• Result: All nodes can send directly to any key’s


coordinator (“zero-hop DHT”)
• Reduces variability in response times

114
Partitions force a choice between availability and
consistency
• Suppose three replicas are partitioned into two and one

• If one replica fixed as master, no client in other partition can write

• In Paxos-based primary-backup, no client in the partition of one


can write

• Traditional distributed databases emphasize consistency over


availability when there are partitions
115
Alternative: Eventual consistency
• Dynamo emphasizes availability over consistency when there are
partitions
• Tell client write complete when only some replicas have stored it
• Propagate to other replicas in background
• Allows writes in both partitions…but risks:
• Returning stale data
• Write conflicts when partition heals:

put(k,v0) put(k,v1)
?@%$!!
116
Mechanism: Sloppy quorums
• If no failure, reap consistency benefits of single master
• Else sacrifice consistency to allow progress

• Dynamo tries to store all values put() under a key on


first N live nodes of coordinator’s preference list

• BUT to speed up get() and put():


• Coordinator returns “success” for put when W < N
replicas have completed write
• Coordinator returns “success” for get when R < N
replicas have completed read
117
Sloppy quorums: Hinted handoff
• Suppose coordinator doesn’t receive W replies when
replicating a put()
• Could return failure, but remember goal of high
availability for writes…

• Hinted handoff: Coordinator tries further nodes in


preference list (beyond first N) if necessary
• Indicates the intended replica node to recipient
• Recipient will periodically try to forward to the
intended replica node
118
Hinted handoff: Example

• Suppose C fails Key K


• Node E is in preference
list
• Needs to receive replica Coordinator
of the data
• Hinted Handoff: replica at
E points to node C

• When C comes back


• E forwards the replicated
data back to C 119
Wide-area replication
• Last ¶,§4.6: Preference lists always contain nodes
from more than one data center
• Consequence: Data likely to survive failure of
entire data center

• Blocking on writes to a remote data center would


incur unacceptably high latency
• Compromise: W < N, eventual consistency

120
Sloppy quorums and get()s
• Suppose coordinator doesn’t receive R replies when
processing a get()
• Penultimate ¶,§4.5: “R is the min. number of
nodes that must participate in a successful read
operation.”
• Sounds like these get()s fail

• Why not return whatever data was found, though?


• As we will see, consistency not guaranteed anyway…

121
Sloppy quorums and freshness
• Common case given in paper: N = 3; R = W = 2
• With these values, do sloppy quorums guarantee a
get() sees all prior put()s?

• If no failures, yes:
• Two writers saw each put()
• Two readers responded to each get()
• Write and read quorums must overlap!

122
Sloppy quorums and freshness
• Common case given in paper: N = 3, R = W = 2
• With these values, do sloppy quorums guarantee a
get() sees all prior put()s?

• With node failures, no:


• Two nodes in preference list go down
• put() replicated outside preference list

• Two nodes in preference list come back up


• get() occurs before they receive prior put()
123
Conflicts
• Suppose N = 3, W = R = 2, nodes are named A, B, C
• 1st put(k, …) completes on A and B
• 2nd put(k, …) completes on B and C
• Now get(k) arrives, completes first at A and C

• Conflicting results from A and C


• Each has seen a different put(k, …)

• Dynamo returns both results; what does client do now?

124
Conflicts vs. applications
• Shopping cart:
• Could take union of two shopping carts
• What if second put() was result of user deleting
item from cart stored in first put()?
• Result: “resurrection” of deleted item

• Can we do better? Can Dynamo resolve cases when


multiple values are found?
• Sometimes. If it can’t, application must do so.

125
Version vectors (vector clocks)
• Version vector: List of (coordinator node, counter) pairs
• e.g., [(A, 1), (B, 3), …]

• Dynamo stores a version vector with each stored


key-value pair

• Idea: track “ancestor-descendant” relationship


between different versions of data stored under the
same key k

126
Version vectors: Dynamo’s mechanism
• Rule: If vector clock comparison of v1 < v2, then the first
is an ancestor of the second – Dynamo can forget v1

• Each time a put() occurs, Dynamo increments the


counter in the V.V. for the coordinator node

• Each time a get() occurs, Dynamo returns the V.V. for the
value(s) returned (in the “context”)

• Then users must supply that context to put()s that modify the
same key

127
Version vectors (auto-resolving case)

put handled
by node A

v1 [(A,1)]
put handled
by node C

v2 [(A,1), (C,1)]

v2 > v1, so Dynamo nodes automatically drop v1, for v2


128
Version vectors (app-resolving case)

put handled
by node A
v1 [(A,1)]
put handled put handled
by node B by node C

v2 [(A,1), (B,1)] v3 [(A,1), (C,1)]


v2 || v3, so a client must perform
Client reads v2, v3; context:
semantic reconciliation
[(A,1), (B,1), (C,1)]
v4 [(A,2), (B,1), (C,1)]
Client reconciles v2 and v3; node A handles the put 129
Trimming version vectors
• Many nodes may process a series of put()s to
same key
• Version vectors may get long – do they grow forever?

• No, there is a clock truncation scheme


• Dynamo stores time of modification with each V.V. entry

• When V.V. > 10 nodes long, V.V. drops the timestamp of


the node that least recently processed that key

130
Impact of deleting a VV entry?

put handled
by node A

v1 [(A,1)]
put handled
by node C

v2 [(A,1), (C,1)]

v2 || v1, so looks like application resolution is required


131
Concurrent writes
• What if two clients concurrently write w/o failure?
• e.g. add different items to same cart at same time
• Each does get-modify-put
• They both see the same initial version
• And they both send put() to same coordinator

• Will coordinator create two versions with


conflicting VVs?
• We want that outcome, otherwise one was thrown away
• Paper doesn't say, but coordinator could detect problem
via put() context
132
Removing threats to durability

• Hinted handoff node crashes before it can


replicate data to node in preference list
• Need another way to ensure that each key-
value pair is replicated N times

• Mechanism: replica synchronization


• Nodes nearby on ring periodically gossip
• Compare the (k, v) pairs they hold
• Copy any missing keys the other has

How to compare and copy replica state


quickly and efficiently?
133
Efficient synchronization with Merkle trees
• Merkle trees hierarchically summarize the key-value
pairs a node holds

• One Merkle tree for each virtual node key range


• Leaf node = hash of one key’s value
• Internal node = hash of concatenation of children

• Compare roots; if match, values match


• If they don’t match, compare children
• Iterate this process down the tree
134
Merkle tree reconciliation

• B is missing orange key; A is missing green one

• Exchange and compare hash nodes from root


downwards, pruning when hashes match
A’s values: B’s values:
[0, 2128) [0, 2128)
[0, 2127) [2127, 2128) [0, 2127) [2127, 2128)

Finds differing keys quickly and with


minimum information exchange
135
How useful is it to vary N, R, W?

N R W Behavior
3 2 2 Parameters from paper:
Good durability, good R/W latency
3 3 1 Slow reads, weak durability, fast writes
3 1 3 Slow writes, strong durability, fast reads
3 3 3 More likely that reads see all prior writes?
3 1 1 Read quorum doesn’t overlap write quorum

136
Dynamo: Take-away ideas
• Consistent hashing broadly useful for replication—not
only in P2P systems

• Extreme emphasis on availability and low latency,


unusually, at the cost of some inconsistency

• Eventual consistency lets writes and reads return


quickly, even when partitions and failures

• Version vectors allow some conflicts to be resolved


automatically; others left to application
137

You might also like