07 Replication
07 Replication
Replication is ! Replication of data: the maintenance of copies of data at multiple computers ! Object replication: the maintenance of copies of whole server objects at multiple computers
Replication can provide !Performance enhancement, scalability Remember caching: improve performance by storing data locally, but data are incomplete Additionally: several web servers can have the same DNS name. The servers are selected by DNS in turn to share the load Replication of read-only data is simple, but replication of frequently changing data causes overhead in providing current data
Introduction to Replication
!Increased availability Sometimes needed: nearly 100% of time a service should be available In case of server failures: simply contact another server with the same data items Network partitions and disconnected operations: availability of data if the connection to a server is lost. But: after re-establishing the connection, (conflicting) data updates have to be resolved !Fault-tolerant services Guaranteeing correct behaviour in spite of certain faults (can include timeliness) If f in a group of f+1 servers crash, then 1 remains to supply the service If f in a group of 2f+1 servers have byzantine faults, the group can supply a correct service
When to do Replication?
Replication at service initialisation Try to estimate number of servers needed from a customer's specification regarding performance, availability, fault-tolerance, Choose places to deposit data or objects Example: root servers in DNS Replication 'on demand' When failures or performance bottlenecks occur, make a new replica, possibly placed at a new location in the network Example: web server of an online shop Or: to improve local access operations, place a copy near a client Example: (DNS) caching, disconnected operations
It is useful to consider objects instead of only considering data Objects have the benefit of encapsulating data and operations on data. Thus, object-specific operation requests can be distributed But: now one has to consider internal object states! This topic is related to mobile agents, but it becomes more complicated in the case of consistent internal states!
System Model
Each logical object is implemented by a collection of physical copies called replicas (the replicas are not necessarily consistent all the time; some may have received updates not yet delivered to the others) Assumption: asynchronous system where processes fail only by crashing and generally no network partitions occur. Replica managers ! Contain replicas on a computer and access them directly ! Replica managers apply operations to replicas recoverably, i.e. they do not leave inconsistent results if they crash ! Static systems are based on a fixed set of replica managers ! In a dynamic system, replica managers may join or leave (e.g. when they crash) ! A replica manager can be a state machine which has the following properties: a) Operations are applied atomically b) The current state is a deterministic function of the initial state and the operations applied c) All replicas start identical and carry out the same operations d) The operations must not be affected by clock readings etc.
Performing a Request
In general: five phases for performing a request on replicated data Issue request. The front end either: ! sends the request to a single replica manager that passes it on to the others, or ! multicasts the request to all of the replica managers Coordination. For consistent execution, the replica managers decide ! whether to apply the request (e.g. because of failures) ! how to order the request relative to other requests (according to FIFO, causal or total ordering) Execution. The replica managers execute the request (sometimes tentatively) Agreement. The replica managers agree on the effect of the request, e.g. perform it 'lazily' or immediately Response. One or more replica managers reply to the front end, which combines the results: ! For high availability, give first response to client ! To tolerate byzantine faults, take a vote
Group Communication
Strong requirement: replication needs multicast communication (or group communication) to distribute requests to several servers holding the data and to manage the replicated data Required: dynamic membership in groups That means: establish a group membership service which manages dynamic memberships in groups, additionally to multicast communication
Leave
Fail
A client sending from outside the group does not have to know the group membership.
Replication Example
How to provide fault-tolerant services, i.e. a service that is provided correct even if some processes fail? Simple replication system: e.g. two replica managers A and B each managing replicas of two accounts x and y. Clients use the local replica manager if possible, after responding to a client the update is transmitted to the other replica manager
Client 1: setBalanceB(x,1) setBalanceA(y,2) getBalanceA(y) 2 getBalanceA(x) 0 Client 2:
Strict Consistency
There are several correctness criteria for replication regarding consistency. Real world: strict consistency: Any read on a data item x returns a value corresponding to the result of the most recent write on x Problem with strict consistency: it relies on absolute global time
Initial balance of x and y is $0 Client 1 first updates x at B (local). When updating y it finds B has failed, so it uses A for the next operation. Client 2 reads balances at A (local), but because B had failed, no update was propagated to A: x has amount 0. Be careful when designing replication algorithms! You need a consistency model
Linearisability
The strictest criterion for a replication system is linearisability Consider a replicated service with two clients, that perform read and update operations o1i resp. o2j. Communication is synchronous, i.e. a client waits for one operation to complete before doing another Single server: serialise the operations by interleaving, e.g. o20, o21, o10, o22, o11, o12 In replication: virtual interleaving; a replicated shared service is linearisable if for any execution there is some interleaving of the series of operations issued by all the clients with ! The interleaved sequence of operations meets the specification of a (single) correct copy of the objects ! The order of operations in the interleaving is consistent with the real times at which they occurred in the actual execution. Linearisability concerns only the interleaving of individual operations, it is not intended to be transactional
Sequential Consistency
Problem with linearisability: real-time requirement practically is not to fulfil. Weaker correctness criterion: sequential consistency A replicated shared service is sequential consistent if for any execution there is some interleaving of the series of operations issued by all the clients with ! The interleaved sequence of operations meets the specification of a (single) correct copy of the objects ! The order of operations in the interleaving is consistent with the program order in which each individual client executed them Every linearisable service is sequential consistent, but not vice versa:
Client 1: setBalanceB(x,1) getBalanceA(y) 0 getBalanceA(x) 0 Client 2: Possible under a naive replication strategy: the update at B has not yet been propagated to A when client 2 reads it.
The real-time criterion for linearisability is not satisfied, because the reading of x occurs setBalanceA(y,2) after its writing. But: both criteria for sequential consistency are satisfied with the ordering: getBalanceA(y) 0; getBalanceA(x) 0; setBalanceB(x,1); setBalanceA(y,2)
more Consistency
Sequential consistency is widely used, but has poor performance. Thus, there are models relaxing the consistency guarantees: Causal Consistency (weakening of sequential consistency) Distinction between events that are causally related and those that are not Example: read(x); write(y) in one process are causally related, because the value of y can depend on the value of x read before. Consistency criterion: Writes that are potentially causally related must be seen by all processes in the same order. Concurrent writes may be seen in a different order on different machines. Necessary: keeping track of which processes have seen which write (vector timestamps) FIFO Consistency (weakening of causal consistency) Writes done by a single process are seen by all other processes in the order in which they were issued, but writes from different processes may be seen in a different order by different processes Easy to implement Weak Consistency, Release Consistency, Entry Consistency
Replication Approaches
In the following lecture: replication for fault tolerance for highly available services in transactions
RM
FE
RM Backup
Front ends only communicate with the primary replica manager which executes the operation and sends copies of the updated data to the backups If the primary fails, one of the backups is promoted to act as the primary This system implements linearisability, since the primary sequences all the operations on the shared objects If the primary fails, the system is linearisable, if a single backup takes over exactly where the primary left off view-synchronous group communication can achieve linearisability
Active Replication
The replica managers are state machines all playing the same role and are organised as a group: A front end multicasts each request to the group of replica managers All replica managers start in the same state and perform the same operations in the same order so that their state remains identical (notice: totally ordered reliable multicast would be needed to guarantee the identical execution order!) If a replica manager crashes it has no effect on the performance of the service because the others continue as normal Byzantine failures can be tolerated because the front end can collect and compare the replies it receives
RM
FE
RM
FE
RM
The front end converts operations containing both, reading and writing, in two separate calls
Gossip
For ordering operations, the front end sends a timestamp Update id Update, prev prev with each request to denote its latest state FE A new timestamp new is Update passed back in a read operation to mark the data C state the client has seen last
Timestamps
Each front end keeps a vector timestamp that reflects the latest data value seen by the front end (prev) Clients can communicate with each other. This can lead to causal relationships between client operations which has to be considered in the replicated system. Thus, communication is made via the front ends including an exchange of vector timestamps allowing the front ends to consider causal ordering in their timestamps
Service RM RM RM
Gossip
FE
Vector timestamps
FE
Value timestamp
Value
Executed operation table
Gossip Messages
The timestamp table contains a vector timestamp for each other replica, collected from gossip messages A replica manager uses entries in this timestamp table to estimate which updates another manager has not yet received. These information are sent in a gossip message A gossip message m contains the log m.log and the replica timestamp m.ts A manager receiving gossip message m has the following main tasks: ! Merge the arriving log with its own ! Apply in causal order updates that are new and have become stable ! Remove redundant entries from the log and executed operation table when it is known that they have been applied by all replica managers ! Merge its replica timestamp with m.ts, so that it corresponds to the additions in the log
Update Propagation
The given architecture does not specify when to exchange gossip messages. To design a robust system in which each update is propagated in reasonable time, an exchange strategy is needed. The time which is required for all replica managers to receive a given update depends upon three factors: 1. The frequency and duration of network partitions This is beyond the systems control 2. The frequency with which replica managers send gossip messages This may be tuned to the application 3. The policy for choosing a partner with which to exchange gossip messages Random policies choose a partner randomly but with weighted probabilities so as to favour some partners over others Deterministic policies give fixed communication partners Topological policies arrange the replica managers into a fixed graph (mesh, circle, tree, ) and messages are passed to the neighbours Other strategies: consider transmission latencies, fault probabilities,
Bayou is different from other approaches because it makes replication non-transparent to the application Increased complexity for the application programmer: provide dependency checks and merge procedures Increased complexity for the user: getting tentative data and alteration of user-specified operations The operational transformation approach used by Bayou appears in systems for computer supported cooperative work (CSCW) This approach is limited in practice to situations where only few conflicts arise, users can deal with tentative data and where data semantics are simple
T
getBalance(A) Replica managers
Replica managers
The simple read one/write all scheme is not realistic: It cannot be carried out if some of the replica managers are unavailable either because the have crashed or because of a communication failure The available copies replication scheme is designed to allow some managers to be temporarily unavailable A read request can be performed by any available replica manager Write requests are performed by the receiving manager and all other available managers in the group As long as the set of available managers does not change, local concurrency control achieves one-copy serialisability in the same way as read one/write all Problems with this occur if a manager fails/recovers during the progress of conflicting transactions
getBalance(A) deposit(B,3);
Replica managers
B
Replica managers
M B P N B
A X
delay
A Y
An replica manager can fail by crashing and is replaced by a new process, which restores the state from a recovery file Front ends use timeouts to detect a manager failure. In case of a timeout, another manager is tried If a replica manager is doing recovery, it is currently not up to date and rejects requests (and the front end tries another replica manager) For one-copy serialisability, failures and recoveries have to be serialised with respect to transactions ! A transaction observes when a failure occurs ! One-copy serializability is not achieved if different transactions make conflicting failure observations ! In addition to local concurrency control, some global concurrency control is required to prevent inconsistent results between a read in one transaction and a write in another transaction
U
Client + front end getBalance(B) deposit(A,3); Replica managers
getBalance(A) deposit(B,3);
B
Replica managers
M B P N B
A X Y
Network partitions
Part of the network fails creating sub-groups, which cannot communicate with one another Replication schemes assume partitions will be repaired ! Operations done during a partition must not cause inconsistency ! Optimistic schemes (e.g available copies with validation) allow all operations and resolve inconsistencies when a partition is repaired ! Pessimistic schemes (e.g. quorum consensus) prevent inconsistency e.g. by limiting availability in all but one sub-group
Client + front end Client + front end Network partition deposit(B,3)
T
withdraw(B, 4)
B
Replica managers
To prevent transactions in different partitions from producing inconsistent results, make a rule that operations can be performed in only one of the partitions Replica managers in different partitions cannot communicate, thus each subgroup decides independently whether they can perform operations A quorum is a sub-group of replica managers whose size gives it the right to perform operations. The right could be given by having the majority of the replica managers in the partition In quorum consensus schemes, update operations may be performed by a subset of the replica managers forming a quorum ! The other replica managers have out-of-date copies ! Version numbers or timestamps can be used to determine which copies are up-to-date ! Operations are applied only to copies with the current version number
Quora
Three examples of the voting algorithm: a) A correct choice of read and write set b) A choice that may lead to write-write conflicts c) A correct choice, coming to ROWA (read one, write all)
Derived performance of file suite: Read Write Latency 65 Blocking probability 0.01 Latency 75 Blocking probability 0.01 75 0.0002 100 0.0101 75 0.000001 750 0.03