Distributed System
Distributed System
get(x) ! 1
set(x, 0)
set(x, 1)
get(x) ! 1
cas(x, 1, 2) ! true
get(x) ! 2
cas(x, 0, 3) ! false
get(x) ! 4
cas(x, 2, 4) ! true
get(x) ! 2
Linearizability advantages:
I Makes a distributed system behave as if it were
non-distributed
I Simple for applications to use
Downsides:
I Performance cost: lots of messages and waiting for
responses
I Scalability limits: leader can be a bottleneck
I Availability problems: if you can’t contact a quorum of
nodes, you can’t process any operations
Slide 137
75
As an example, consider the calendar app that you can find on most phones, tablets, and computers.
We would like the appointments and entries in this app to sync across all of our devices; in other words,
we want it to be replicated such that each device is a replica. Moreover, we would like to be able to
view, modify, and add calendar events even while a device is o✏ine (e.g. due to poor mobile network
coverage). If the calendar app’s replication protocol was linearizable, this would not be possible, since
an o✏ine device cannot communicate with a quorum of replicas.
Slide 138
Instead, calendar apps allow the user to read and write events in their calendar even while a device is
o✏ine, and they sync any updates between devices sometime later, in the background, when an internet
connection is available. The video of this lecture includes a demonstration of o✏ine updates to a calendar.
This trade-o↵ is known as the CAP theorem (named after consistency, availability, and partition
tolerance), which states that if there is a network partition in a system, we must choose between one of
the following options [Gilbert and Lynch, 2002]:
1. We can have linearizable consistency, but in this case, some replicas will not be able to respond to
requests because they cannot communicate with a quorum. Not being able to respond to requests
makes those nodes e↵ectively unavailable.
2. We can allow replicas to respond to requests even if they cannot communicate with other replicas.
In this case, they continue to be available, but we cannot guarantee linearizability.
Sometimes the CAP theorem is formulated as a choice of “pick 2 out of 3”, but that framing is misleading.
A system can be both linearizable and available as long as there is no network partition, and the choice
is forced only in the presence of a partition [Kleppmann, 2015].
This trade-o↵ is illustrated on Slide 139, where node C is unable to communicate with nodes A and
B. On A and B’s side of the partition, linearizable operations can continue as normal, because A and
B constitute a quorum. However, if C wants to read the value of x, it must either wait (potentially
indefinitely) until the network partition is repaired, or it must return its local value of x, which does not
reflect the value previously written by A on the other side of the partition.
76
The CAP theorem
A system can be either strongly Consistent (linearizable) or
Available in the presence of a network Partition
node A
set(x, v1 ) node B node C
network partition
get(x) ! v1
get(x) ! v1
get(x) ! v0
C must either wait indefinitely for the network to recover, or
return a potentially stale value
Slide 139
The calendar app chooses option 2: it forgoes linearizability in favour of allowing the user to continue
performing operations while a device is o✏ine. Many other systems similarly make this choice for various
reasons.
The approach of allowing each replica to process both reads and writes based only on its local state,
and without waiting for communication with other replicas, is called optimistic replication. A variety of
consistency models have been proposed for optimistically replicated systems, with the best-known being
eventual consistency.
Eventual consistency is defined as: “if no new updates are made to an object, eventually all reads will
return the last updated value” [Vogels, 2009]. This is a very weak definition: what if the updates to an
object never stop, so the premise of the statement is never true? A slightly stronger consistency model
called strong eventual consistency, defined on Slide 140, is often more appropriate [Shapiro et al., 2011].
It is based on the idea that as two replicas communicate, they converge towards the same state.
Eventual consistency
Replicas process operations based only on their local state.
If there are no more updates, eventually all replicas will be in
the same state. (No guarantees how long it might take.)
Strong eventual consistency:
I Eventual delivery: every update made to one non-faulty
replica is eventually processed by every non-faulty replica.
I Convergence: any two replicas that have processed the
same set of updates are in the same state
(even if updates were processed in a di↵erent order).
Properties:
I Does not require waiting for network communication
I Causal broadcast (or weaker) can disseminate updates
I Concurrent updates =) conflicts need to be resolved
Slide 140
In both eventual consistency and strong eventual consistency, there is the possibility of di↵erent nodes
concurrently updating the same object, leading to conflicts (as previously discussed on Slide 95). Various
algorithms have been developed to resolve those conflicts automatically [Shapiro et al., 2011].
The lecture video shows an example of a conflict in the eventually consistent calendar app: on one
device, I update the time of an event, while concurrently on another device, I update the title of the
same event. After the two devices synchronise, the update of the time is applied to both devices, while
the update of the title is discarded. The state of the two devices therefore converges – at the cost of
a small amount of data loss. This is the last writer wins approach to conflict resolution that we have
seen on Slide 95 (assuming the update to the time is the “last” update in this example). A more refined
approach might merge the updates to the time and the title, as shown on Slide 143.
This brings us to the end of our discussion of consistency models. Slide 141 summarises some of the
77
key properties of the models we have seen, in descending order of the minimum strength of assumptions
that they must make about the system model.
strength of assumptions
consensus, quorum partially
total order broadcast, synchronous
linearizable CAS
linearizable get/set quorum asynchronous
eventual consistency, local replica only asynchronous
causal broadcast,
FIFO broadcast
Slide 141
Atomic commit makes the strongest assumptions, since it must wait for communication with all nodes
participating in a transaction (potentially all of the nodes in the system) in order to complete successfully.
Consensus, total order broadcast, and linearizable algorithms make weaker assumptions since they only
require waiting for communication with a quorum, so they can tolerate some unavailable nodes. The FLP
result (Slide 107) showed us that consensus and total order broadcast require partial synchrony. It can be
shown that a linearizable CAS operation is equivalent to consensus [Herlihy, 1991], and thus also requires
partial synchrony. On the other hand, the ABD algorithm for linearizable get/set is asynchronous, since
it does not require any clocks or timeouts. Finally, eventual consistency and strong eventual consistency
make the weakest assumptions: operations can be processed without waiting for any communication
with other nodes, and without any timing assumptions. Similarly, in causal broadcast and weaker forms
of broadcast (FIFO, reliable, etc.), a node broadcasting a message can immediately deliver it to itself
without waiting for communication with other nodes, as discussed in Section 4.2; this corresponds to a
replica immediately processing its own operations without waiting for communication with other replicas.
This hierarchy has some similarities to the concept of complexity classes of algorithms – for example,
sorting generally is O(n log n) – in the sense that it captures the unavoidable minimum communication
and synchrony requirements for a range of common problems in distributed systems.
8 Case studies
In this last lecture we will look at a couple of examples of distributed systems that need to manage
concurrent access to data. In particular, we will include some case studies of practical, real-world systems
that need to deal with concurrency, and which build upon the concepts from the rest of this course.
78
performance and better robustness to network interruptions, most collaboration software uses optimistic
replication that provides strong eventual consistency (Slide 140).
Families of algorithms:
I Conflict-free Replicated Data Types (CRDTs)
I Operation-based
I State-based
I Operational Transformation (OT)
Slide 142
In this section we will look at some algorithms that are used for this kind of collaboration. As example,
consider the calendar sync demo in the lecture recording of Section 7.3. Two nodes initially start with the
same calendar entry. On node A, the title is changed from “Lecture” to “Lecture 1”, and concurrently
on node B the time is changed from 12:00 to 10:00. These two updates happen while the two nodes are
temporarily unable to communicate, but eventually connectivity is restored and the two nodes sync their
changes. In the outcome shown on Slide 143, the final calendar entry reflects both the change to the title
and the change to the time.
{ {
"title": "Lecture 1", "title": "Lecture 1",
"date": "2020-11-05", "date": "2020-11-05",
"time": "10:00" "time": "10:00"
} }
Slide 143
This scenario is an example of conflict resolution, which occurs whenever several concurrent writes to
the same object need to be integrated into a single final state (see also Slide 95). Conflict-free replicated
data types, or CRDTs for short, are a family of algorithms that perform such conflict resolution [Shapiro
et al., 2011]. A CRDT is a replicated object that an application accesses though the object-oriented
interface of an abstract datatype, such as a set, list, map, tree, graph, counter, etc.
Slide 144 shows an example of a CRDT that provides a map from keys to values. The application
can invoke two types of operation: reading the value for a given key, and setting the value for a given
key (which adds the key if it is not already present).
The local state at each node consists of the set values containing (timestamp, key, value) triples.
Reading the value for a given key is a purely local operation that only inspects values on the current node,
and performs no network communication. The algorithm preserves the invariant that values contains at
most one element for any given key. Therefore, when reading the value for a key, the value is unique if
it exists.
79