0% found this document useful (0 votes)
9 views

Week 3_Lecture Notes

This document discusses the leader election problem in ring topologies within distributed systems, detailing various algorithms for both anonymous and non-anonymous rings. It highlights the impossibility of leader election in anonymous rings and presents algorithms such as the LeLann-Chang-Roberts (LCR) algorithm and the Hirschberg and Sinclair (HS) algorithm, analyzing their message complexities and correctness. The document emphasizes the importance of leader election for coordinating system activities and breaking deadlocks.

Uploaded by

narutoworld2806
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Week 3_Lecture Notes

This document discusses the leader election problem in ring topologies within distributed systems, detailing various algorithms for both anonymous and non-anonymous rings. It highlights the impossibility of leader election in anonymous rings and presents algorithms such as the LeLann-Chang-Roberts (LCR) algorithm and the Hirschberg and Sinclair (HS) algorithm, analyzing their message complexities and correctness. The document emphasizes the importance of leader election for coordinating system activities and breaking deadlocks.

Uploaded by

narutoworld2806
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 144

Leader Election in Rings

(Classical Distributed Algorithms)

EL
PT
N
Dr. Rajiv Misra
Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Cloud Computing and DistributedVuSystems
Pham Leader Election in Rings
Preface
Content of this Lecture:

In this lecture, we will discuss the leader election

EL
problem in message-passing systems for a ring
topology, in which a group of processors must choose

PT
one among them to be a leader.
N
We will present the different algorithms for leader
election problem by taking the cases like anonymous/
non-anonymous rings, uniform/ non-uniform rings
and synchronous/ asynchronous rings etc.

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Leader Election (LE) Problem: Introduction
The leader election problem has several variants.
LE problem is for each processor to decide that either
it is the leader or non-leader, subject to the
constraint that exactly one processor decides to be

EL
the leader.
LE problem represents a general class of symmetry-
breaking problems.
PT
For example, when a deadlock is created, because of
N
processors waiting in a cycle for each other, the
deadlock can be broken by electing one of the
processor as a leader and removing it from the cycle.

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Leader Election: Definition
Each processor has a set of elected (won) and not-
elected (lost) states.

EL
Once an elected state is entered, processor is always
in an elected state (and similarly for not-elected): i.e.,
irreversible decision
PT
In every admissible execution:
N
every processor eventually enters either an elected
or a not-elected state
exactly one processor (the leader) enters an
elected state
Cloud Computing and DistributedVuSystems
Pham Leader Election in Rings
Uses of Leader Election
A leader can be used to coordinate activities of the
system:
find a spanning tree using the leader as the root

EL
reconstruct a lost token in a token-ring network

PT
In this lecture, we will study the leader election in
rings.
N

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Definition: (1)Ring Networks
In an oriented ring, processors have a consistent notion of
left and right

EL
PT
N
For example, if messages are always forwarded on channel
1, they will cycle clockwise around the ring

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Definition: (2) Anonymous Rings
How to model situation when processors do not have
unique identifiers?

EL
First attempt: require each processor to have the
same state machine

PT
N

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Definition: (3) Uniform (Anonymous) Algorithms
A uniform algorithm does not use the ring size (same
algorithm for each size ring)
Formally, every processor in every size ring is
modeled with the same state machine

EL
A non-uniform algorithm uses the ring size (different
PT
algorithm for each size ring)
Formally, for each value of n, every processor in a
N
ring of size n is modeled with the same state
machine An .

Note the lack of unique ids.


Cloud Computing and DistributedVuSystems
Pham Leader Election in Rings
Impossibility: Leader Election in Anonymous Rings
Theorem: There is no leader election algorithm for
anonymous rings, even if
algorithm knows the ring size (non-uniform) and
synchronous model

EL
Proof Sketch:

PT
Every processor begins in same state with same
outgoing messages (since anonymous)
N
Every processor receives same messages, does same
state transition, and sends same messages in round 1
Ditto for rounds 2, 3, …
Eventually some processor is supposed to enter an
elected state. But then they all would.

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Leader Election in Anonymous Rings
Proof sketch shows that either safety (never elect
more than one leader) or liveness (eventually elect at
least one leader) is violated.

EL
Since the theorem was proved for non-uniform and

PT
synchronous rings, the same result holds for weaker
(less well-behaved) models:
N
uniform
asynchronous

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Rings with Identifiers
Assume each processor has a unique id.

Don't confuse indices and ids:

EL
indices are 0 to n - 1; used only for analysis, not
available to the processors

PT
ids are arbitrary nonnegative integers; are available
to the processors through local variable id.
N

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Specifying a Ring
Start with the smallest id and list ids in clockwise
order.

EL
id = 37
id = 3
p4 p0

p3
PT p1 id = 19
N
id = 25 p2
id = 4

Example: 3, 37, 19, 4, 25

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Uniform (Non-anonymous) Algorithms
Uniform algorithm: there is one state machine for
every id, no matter what size ring

EL
Non-uniform algorithm: there is one state machine
for every id and every different ring size

PT
These definitions are tailored for leader election in a
N
ring.

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
O(n2) Messages LE Algorithm:
LeLann-Chang-Roberts (LCR) algorithm

send value of own id to the left


when receive an id j (from the right):

EL
if j > id then
• forward j to the left (this processor has lost)
if j = id then
PT
• elect self (this processor has won)
N
if j < id then
• do nothing

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Analysis of 2
O(n ) Algorithm
Correctness: Elects processor with largest id.
message containing largest id passes through every
processor

EL
Time: O(n)
Message complexity: Depends how the ids are
arranged.
PT
largest id travels all around the ring (n messages)
N
2nd largest id travels until reaching largest
3rd largest id travels until reaching largest or
second largest etc.

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Analysis of 2
O(n ) Algorithm

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Analysis of 2
O(n ) Algorithm
Worst way to arrange the ids is in decreasing order:
2nd largest causes n - 1 messages
3rd largest causes n - 2 messages etc.

EL
Total number of messages is:
PT n + (n-1) + (n-2) + … + 1 = (n2)
N

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Analysis of 2
O(n ) Algorithm
Clearly, the algorithm never sends more
than O(n2) messages in any admissible
execution. Moreover, there is an
admissible execution in which the
algorithm sends (n2) messages; Consider

EL
the ring where the identifiers of the
processor are 0,……, n-1 and they are

PT
ordered as in Figure 3.2. In this
configuration, the message of processor
with identifier i is send exactly i+1 times,
N
Thus the total number of messages,
including the n termination messages, is
Clockwise Unidirectional Ring

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Can We Use Fewer Messages?
The O(n2) algorithm is simple and works in both
synchronous and asynchronous model.
But can we solve the problem with fewer messages?

EL
Idea:

PT
Try to have messages containing smaller ids travel
smaller distance in the ring
N

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
O(nlogn) Messages LE Algorithm:
The Hirschberg and Sinclair (HS) algorithm
To describe the algorithm, we first define the k-neighbourhood
of a processor pi in the ring to be the set of processors that are
at distance at most k from pi in the ring (either to the left or to
the right). Note that the k-neighbourhood of a processor

EL
includes exactly 2k+1 processors.
The algorithm operates in phases; it is convenient to start

PT
numbering the phases with 0. In the kth phase a processor
tries to become a winner for that phase; to be a winner, it must
have the largest id in its 2k-neighborhood. Only processors that
N
are winners in the kth phase continue to compete in the (k+1)-
st phase, Thus fewer processors proceed to higher phases, until
at the end, only one processor is a winner and it is elected as
the leader of the whole ring.

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
The HS Algorithm: Sending Messages
Phase 0

In more detail, in phase 0, each processor attempts to


become a phase 0 winner and sends a <probe> message

EL
containing its identifier to its 1-neighborhood, that is, to
each of its two neighbors.

PT
If the identifier of the neighbor receiving the probe is
greater than the identifier in the probe, it swallows the
probe; otherwise, it sends back a <reply> message.
N
If a processor receives a reply from both its neighbors,
then the processor becomes a phase 0 winner and
continues to phase 1.

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
The HS Algorithm: Sending Messages
Phase k
In general, in phase k, a processor pi that is a phase k-1
winner sends <probe> messages with its identifier to its
2k-neighborhood (one in each direction). Each such message

EL
traverses 2k processors one by one, A probe is swallowed by
a processor if it contains an identifier that is smaller than its
own identifier.
PT
If the probe arrives at the last processor on the
neighbourhood without being swallowed, then that last
N
processor sends back a <reply> message to pi. If pi receives
replies from both directions, it becomes a phase k winner,
and it continues to phase k+1. A processor that receives its
own <probe> message terminates the algorithm as the
leader and sends a termination message around the ring.
Cloud Computing and DistributedVuSystems
Pham Leader Election in Rings
EL
PT
N

Vu Pham
The HS Algorithm
The pseudocode appears in Algorithm 5. Phase k for a processor
corresponds to the period between its sending of a <probe> message in
line 4 or 15 with third parameter k and its sending of a <probe>
message in line 4 or 15 with third parameter k+1. The details of sending
the termination message around the ring have been left out in the code,

EL
and only the leader terminates.

The correctness of the algorithm follows in the same manner as in the


PT
simple algorithm, because they have the same swallowing rules.
N
It is clear that the probes of the processor with the maximal identifier
are never swallowed; therefore, this processor will terminate the
algorithm as a leader. On the other hand, it is also clear that no other
<probe> can traverse the whole ring without being swallowed.
Therefore, the processor with the maximal identifier is the only leader
elected by the algorithm.
Cloud Computing and DistributedVuSystems
Pham Leader Election in Rings
O(n log n) Leader Election Algorithm
Each processor tries to probe successively larger
neighborhoods in both directions
size of neighborhood doubles in each phase

EL
If probe reaches a node with a larger id, the probe stops

PT
If probe reaches end of its neighborhood, then a reply is
sent back to initiator
N
If initiator gets back replies from both directions, then go
to next phase

If processor receives a probe with its own id, it elects itself

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
O(n log n) Leader Election Algorithm

EL
pi
probe probe
reply reply

PT
probe
reply
probe
reply
probe probe
reply reply
N
probe probe probe probe probe probe probe probe
reply reply reply reply reply reply reply reply

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Analysis of O(n log n) Leader Election Algorithm
Correctness:
Similar to O(n2) algorithm.

Message Complexity:

EL
Each message belongs to a particular phase and is
initiated by a particular processor

PT
Probe distance in phase k is 2k
N
Number of messages initiated by a processor in
phase k is at most 4*2k (probes and replies in both
directions)

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Analysis of O(n log n) Leader Election Algorithm

How many processors initiate probes in phase k ?

EL
For k = 0, every processor does

- 1 does PT
For k > 0, every processor that is a "winner" in phase k

"winner" means has largest id in its 2k-1


N
neighborhood

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Analysis of O(n log n) Leader Election Algorithm

Maximum number of phase k - 1 winners occurs when


they are packed as densely as possible:

EL
2k-1 processors

… a phase
PT
k-1 winner … a phase
k-1 winner

N
Total number of phase k - 1 winners is at most
n/(2k-1 + 1)

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Analysis of O(n log n) Leader Election Algorithm

How many phases are there?

EL
At each phase the number of (phase) winners is cut
approx. in half
from n/(2k-1 + 1) to n/(2k + 1)
PT
So after approx. log2 n phases, only one winner is left.
N
more precisely, max phase is log(n–1)+1

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Analysis of O(n log n) Leader Election Algorithm

Total number of messages is sum, over all phases, of


number of winners at that phase times number of
messages originated by that winner:

EL
log(n–1)+1

phase 0 msgs PT
≤ 4n + n + k=14•2k•n/(2k-1+1)
N
termination
msgs < 8n(log n + 2) + 5n
msgs for
phases 1 to
= O(n log n) log(n–1)+1

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Can We Do Better?
The O(n log n) algorithm is more complicated than the
O(n2) algorithm but uses fewer messages in worst
case.

EL
Works in both synchronous and asynchronous case.

PT
Can we reduce the number of messages even more?
N
Not in the asynchronous model…

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Lower bound for LE algorithm
But, can we do better than O(n log n)?

Theorem: Any leader election algorithm for asynchronous

EL
rings whose size is not known a priori has ῼ(n log n)
message complexity (holds also for unidirectional rings).

PT
Both LCR and HS are comparison-based algorithms, i.e.
they use the identifiers only for comparisons (<; >;=).
In synchronous networks, O(n) message complexity can be
N
achieved if general arithmetic operations are permitted
(non-comparison based) and if time complexity is
unbounded.

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Overview of LE in Rings with Ids
There exist algorithms when nodes have unique ids.
We have evaluated them according to their message
complexity.

EL
Asynchronous ring:
(n log n) messages
Synchronous ring:
PT
(n) messages under certain conditions
N
otherwise (n log n) messages
All bounds are asymptotically tight.

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Conclusion
This lecture provided an in-depth study of the leader
election problem in message-passing systems for a
ring topology.

EL
We have presented the different algorithms for
PT
leader election problem by taking the cases like
anonymous/non-anonymous rings, uniform/non-
N
uniform rings and synchronous/ asynchronous rings

Cloud Computing and DistributedVuSystems


Pham Leader Election in Rings
Leader Election
(Ring LE & Bully LE Algorithm)

EL
PT
N
Dr. Rajiv Misra
Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
Preface
Content of this Lecture:
In this lecture, we will discuss the underlying concepts of
‘leader election problem’, which has been very useful for
many variant of distributed systems including today’s Cloud

EL
Computing Systems.

PT
Then we will present the different ‘classical algorithms for
leader election problem’ i.e. Ring LE and Bully LE
algorithms and also discuss how election is done in some of
N
the popular systems in Industry such as Google’s Chubby
and Apache’s Zookeeper system.

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
Need of Election
Example 1: Suppose your Bank account details are
replicated at a few servers, but one of these servers is
responsible for receiving all reads and writes, i.e., it is the

EL
leader among the replicas

PT
What if there are two leaders per customer?
What if servers disagree about who the leader is?
N
What if the leader crashes?
Each of the above scenarios leads to Inconsistency

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
Some more motivating examples

Example 2: Group of NTP servers: who is the root


server?

EL
Other systems that need leader election:
PT
Apache Zookeeper, Google’s Chubby
N
Leader is useful for coordination among distributed
servers

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
Leader Election Problem
In a group of processes, elect a Leader to undertake special tasks
And let everyone know in the group about this Leader

EL
What happens when a leader fails (crashes)
Some process detects this (using a Failure Detector!)
Then what?
PT
N
Goals of Election algorithm:
1. Elect one leader only among the non-faulty processes
2. All non-faulty processes agree on who is the leader

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
System Model

N processes.

EL
Each process has a unique id.

PT
Messages are eventually delivered.
N
Failures may occur during the election protocol.

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
Calling for an Election
Any process can call for an election.
A process can call for at most one election at a time.

EL
Multiple processes are allowed to call an election
simultaneously.

PT
All of them together must yield only a single
leader
N
The result of an election should not depend on which
process calls for it.

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
Formally: Election Problem
A run of the election algorithm must always guarantee at the
end:
 Safety: For all non-faulty processes p: (p’s elected =
(q: a particular non-faulty process with the best attribute

EL
value) or Null)
 Liveness: For all election runs: (election run terminates)

PT
& for all non-faulty processes p: p’s elected is not Null
At the end of the election protocol, the non-faulty process with
N
the best (highest) election attribute value is elected.
Common attribute : leader has highest id
Other attribute examples: leader has highest IP address, or
fastest cpu, or most disk space, or most number of files, etc.

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
(i) Classical Algorithm: Ring Election
The Ring:

N processes are organized in a logical ring

EL
Similar to ring in Chord p2p system
PT
i-th process pi has a communication channel to
p(i+1) mod N
N
All messages are sent clockwise around the ring.

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
The Ring

N12 N3

EL
N6

PT N32
N
N80 N5

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
The Ring Election Protocol
Any process pi that discovers the old coordinator has failed initiates an
“Election” message that contains pi ’s own id:attr. (attr: attribute)
This is the initiator of the election.
When a process pi receives an “Election” message, it compares the attr in

EL
the message with its own attr.
If the arrived attr is greater, pi forwards the message.

PT
If the arrived attr is smaller and pi has not forwarded an election
message earlier, it overwrites the message with its own id:attr, and
N
forwards it.
If the arrived id:attr matches that of pi, then pi’s attr must be the
greatest (why?), and it becomes the new coordinator. This process then
sends an “Elected” message to its neighbor with its id, announcing the
election result.
Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
The Ring Election Protocol (2)
When a process pi receives an “Elected” message, it
sets its variable electedi  id of the message.

EL
forwards the message unless it is the new
coordinator.

PT
N

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
Ring Election: Example

Initiates the election


N3

EL
N12
Election: 3

N6
PT N32
N
N80 N5

Goal: Elect highest id process as leader


Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
Initiates the election
N3

EL
N12

N6
PT N32
N
Election: 32
N80 N5

Goal: Elect highest id process as leader


Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
Initiates the election
N3

EL
N12

N6
PT N32
N
N80 N5

Election: 32
Goal: Elect highest id process as leader
Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
Initiates the election
N3

EL
N12

N6
PT N32
N
Election: 80

N80 N5

Goal: Elect highest id process as leader


Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
Election: 80
Initiates the election
N3

EL
N12

N6
PT N32
N
N80 N5

Goal: Elect highest id process as leader


Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
Initiates the election
N3

EL
N12

Election: 80
N6
PT N32
N
N80 N5

Goal: Elect highest id process as leader


Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
Initiates the election
N3

EL
N12

N6
PT N32
N
N80 N5
Election: 80
Goal: Elect highest id process as leader
Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
Initiates the election
N3

EL
N12

N6
PT N32
N
Elected: 80

N80 N5

Goal: Elect highest id process as leader


Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
Initiates the election
N3

EL
N12
Elected: 80

N6
elected = 80 PT N32
N
N80 N5

Goal: Elect highest id process as leader


Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
elected = 80 Initiates the election
N3

EL
N12
elected = 80

N6
elected = 80 PT N32
elected = 80
N
N80 N5
Elected: 80 elected = 80
Goal: Elect highest id process as leader
Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
elected = 80 Initiates the election
N3

EL
N12
elected = 80

N6
elected = 80 PT N32
elected = 80
N
N80 N5
elected = 80 elected = 80
Goal: Elect highest id process as leader
Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
Analysis
Let’s assume no failures occur during the election
protocol itself, and there are N processes

EL
How many messages?

PT
Worst case occurs when the initiator is the ring
N
successor of the would-be leader

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
Worst-case

N12 N3
Initiates the election

EL
N6
N32
PT
N
N80 N5

Goal: Elect highest id process as leader

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
Worst-case Analysis
(N-1) messages for Election message to get from Initiator (N6)
to would-be coordinator (N80)
N messages for Election message to circulate around ring

EL
without message being changed
N messages for Elected message to circulate around the ring

PT
Message complexity: (3N-1) messages
N
Completion time: (3N-1) message transmission times
Thus, if there are no failures, election terminates (liveness)
and everyone knows about highest-attribute process as
leader (safety)
Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
Best Case?
Initiator is the would-be leader, i.e., N80 is the
initiator

EL
Message complexity: 2N messages

PT
Completion time: 2N message transmission times
N

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
Multiple Initiators?
Each process remembers in cache the initiator of each
Election/Elected message it receives

EL
(All the time) Each process suppresses Election/Elected
messages of any lower-id initiators

PT
Updates cache if receives higher-id initiator’s
Election/Elected message
N
Result is that only the highest-id initiator’s election run
completes

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
Effect of Failures

Initiates the election


N12 N3
Elected: 80

EL
N6
elected = 80
PT N32

Election: 80 will
N
N80 N5 circulate around
Crash the ring forever
=>
Liveness violated

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
Fixing for failures: First Option
First option: have predecessor (or successor) of
would-be leader N80 detect failure and start a new
election run

EL
May re-initiate election if

PT
– Receives an Election message but times out waiting
for an Elected message
– Or after receiving the Elected:80 message
N
But what if predecessor also fails?
And its predecessor also fails? (and so on)

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
Fixing for failures: Second Option
Second option: use the failure detector (FD)
Any process, after receiving Election:80 message, can
detect failure of N80 via its own local failure detector
If so, start a new run of leader election

EL
But failure detectors may not be both complete and
accurate
PT
Incompleteness in FD => N80’s failure might be missed
=> Violation of Safety
N
Inaccuracy in FD => N80 mistakenly detected as failed
• => new election runs initiated forever
• => Violation of Liveness

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
Why is Election so Hard?
Because it is related to the consensus problem!

If we could solve election, then we could solve


consensus!

EL
Elect a process, use its id’s last bit as the consensus
decision

PT
But since consensus is impossible in asynchronous
systems, so is leader election!
N
Consensus-like protocols such as Paxos used in industry
systems for leader election

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
(ii) Classical Algorithm: Bully Algorithm
• All processes know other process’ ids
• When a process finds the coordinator has failed (via the
failure detector):

EL
• if it knows its id is the highest
• it elects itself as coordinator, then sends a Coordinator

PT
message to all processes with lower identifiers. Election
is completed.
N
• else
• it initiates an election by sending an Election message
• (contd…)

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
Bully Algorithm (2)
• else it initiates an election by sending an Election message
• Sends it to only processes that have a higher id than itself.
• if receives no answer within timeout, calls itself leader and

EL
sends Coordinator message to all lower id processes.
Election completed.
• if an answer received however, then there is some non-
PT
faulty higher process => so, wait for coordinator message. If
none received after another timeout, start a new election
N
run.
• A process that receives an Election message replies with OK
message, and starts its own leader election protocol (unless it has
already done so)

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
Bully Algorithm: Example

N3

EL
N80

N32
PT N5
N
N12 N6
Detects failure
of N80
Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
N3

EL
N80

N32
PT N5
N
Election
N12 N6
Detects failure
of N80
Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
N3

EL
N80
Election
N32
PT
Election N5
N
OK
N12 N6
Waiting…

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
N3

EL
N80

N32
PT OK
N5
N
N12 N6
Waiting… Waiting…

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
N3

EL
N80

Times out
waiting for N80’s
response
N32
PT N5
N
Coordinator: N32

N12 N6

Election is completed

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
Failures during Election Run

N3

EL
N80

N32
PT N5
N
N12 N6
Waiting… Waiting…

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
N3

EL
N80

N32
PT N5
N
Election
N12 N6

OK Times out, starts


Waiting… new election run
Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
N80 N3

EL
N32

PT N5
Election
N
N12 N6
Times out, starts
another new election run

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
Failures and Timeouts
If failures stop, eventually will elect a leader
How do you set the timeouts?
Based on Worst-case time to complete election

EL
5 message transmission times if there are no failures
during the run:

PT
1. Election from lowest id server in group
2. Answer to lowest id server from 2nd highest id
process
N
3. Election from 2nd highest id server to highest id
4. Timeout for answers @ 2nd highest id server
5. Coordinator from 2nd highest id server

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
Analysis
Worst-case completion time: 5 message transmission
times
When the process with the lowest id in the system
detects the failure.

EL
• (N-1) processes altogether begin elections, each
sending messages to processes with higher ids.
• i-th highest id process sends (i-1) election messages
PT
Number of Election messages :
= N-1 + N-2 + … + 1 = (N-1)*N/2 = O(N2)
N
Best-case
Second-highest id detects leader failure
Sends (N-2) Coordinator messages
Completion time: 1 message transmission time
Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
Impossibility?

Since timeouts built into protocol, in asynchronous


system model:
Protocol may never terminate => Liveness not

EL
guaranteed

PT
But satisfies liveness in synchronous system model
where
N
Worst-case one-way latency can be calculated =
worst-case processing time +
worst-case message latency

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Ring LE & Bully LE)
Leader Election
(Industry Systems: Google’s Chubby

EL
and Apache Zookeeper)
PT
N

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Chubby & Zookeeper)
Use of Consensus to solve Election
One approach:

EL
Each process proposes a value
Everyone in group reaches consensus on some
process Pi’s value
PT
That lucky Pi is the new leader!
N

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Chubby & Zookeeper)
Election in Industry
Several systems in industry use Paxos-like approaches for
election
Paxos is a consensus protocol (safe, but eventually live)

EL
Google’s Chubby system
Apache Zookeeper
PT
N

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Chubby & Zookeeper)
Election in Google Chubby
A system for locking
Essential part of Google’s stack Server A
Many of Google’s internal

EL
Server B
systems rely on Chubby
BigTable, Megastore, etc. Server C

Group of replicas
PT Server D
N
Need to have a master
server elected at all times Server E

Reference: http://research.google.com/archive/chubby.html

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Chubby & Zookeeper)
Election in Google Chubby (2)
Group of replicas
Need to have a master (i.e., leader)
Server A
Election protocol

EL
Potential leader tries to get votes Server B
from other servers

leader
PT
Each server votes for at most one
Server C

Server D Master
N
Server with majority of votes
becomes new leader, informs Server E
everyone

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Chubby & Zookeeper)
Election in Zookeeper
Centralized service for maintaining configuration
information

EL
Uses a variant of Paxos called Zab (Zookeeper Atomic
Broadcast)

PT
Needs to keep a leader elected at all times
N
Reference: http://zookeeper.apache.org/

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Chubby & Zookeeper)
Election in Zookeeper (2)
Each server creates a new
sequence number for itself N12 N3
Let’s say the sequence

EL
numbers are ids
Gets highest id so far (from N6

PT
ZK(zookeeper) file system),
creates next-higher id, writes it
into ZK file system
N32
N
N80 N5
Elect the highest-id server as
Master
leader.

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Chubby & Zookeeper)
Election in Zookeeper (3)
Failures:

One option: everyone N12 N3

EL
monitors current master
(directly or via a failure N6
detector)
PT
On failure, initiate election
Leads to a flood of elections Crash
N32
N
Too many messages N80 N5

Master

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Chubby & Zookeeper)
Election in Zookeeper (4)
Second option: (implemented in
Zookeeper)

N80 N3

EL
Each process monitors its next-
higher id process
if that successor was the N32

PT
leader and it has failed
• Become the new leader
Monitors
N5
N
else
• wait for a timeout, and check N12 N6
your successor again.

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Chubby & Zookeeper)
Conclusion
Leader election an important component of many
cloud computing systems

EL
Classical leader election protocols
Ring-based
Bully
PT
N
But failure-prone
Paxos-like protocols used by Google Chubby,
Apache Zookeeper

Cloud Computing and DistributedVuSystems


Pham Leader Election ( Chubby & Zookeeper)
Design of Zookeeper

EL
PT
N
Dr. Rajiv Misra
Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Preface
Content of this Lecture:

In this lecture, we will discuss the ‘design of

EL
ZooKeeper’, which is a service for coordinating
processes of distributed applications.

PT
We will discuss its basic fundamentals, design goals,
architecture and applications.
N

https://zookeeper.apache.org/
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
ZooKeeper, why do we need it?
Coordination is important

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Classic Distributed System

EL
PT
N

Most of the system like HDFS have one Master and couple of
slave nodes and these slave nodes report to the master.
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Fault Tolerant Distributed System

EL
PT
N
Real distributed fault tolerant system have Coordination service,
Master and backup master.
If primary failed then backup works for it.
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
What is a Race Condition?
When two processes are competing with each other causing data
corruption.

Person B

EL
Person A

Bank PT
N
As shown in the diagram, two persons are trying to deposit 1 rs.
online into the same bank account. The initial amount is 17 rs. Due
to race conditions, the final amount in the bank is 18 rs. instead of
19.

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
What is a Deadlock?
When two processes are waiting for each other directly or
indirectly, it is called deadlock. Waiting for

EL
Process 1 Process 2

PT Waiting for
Process 3
Waiting for
N
Here, Process 1 is waiting for process 2 and process 2 is waiting
for process 3 to finish and process 3 is waiting for process 1 to
finish. All these three processes would keep waiting and will
never end. This is called dead lock.
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
What is Coordination ?
How would Email Processors avoid
reading same emails?
Suppose, there is an inbox from
which we need to index emails.
Indexing is a heavy process and might

EL
take a lot of time.
Here, we have multiple machine

PT
which are indexing the emails. Every
email has an id. You can not delete
any email. You can only read an email
N
and mark it read or unread.
Now how would you handle the
coordination between multiple
indexer processes so that every email
is indexed?

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
What is Coordination ?
If indexers were running as multiple
threads of a single process, it was
easier by the way of using
synchronization constructs of Central
Storage

EL
programming language.
But since there are multiple Email
Id-
processes running on multiple Time

PT
machines which need to coordinate,
we need a central storage.
This central storage should be safe
stamp-
Subject
-Status
N
from all concurrency related
problems.
This central storage is exactly the
role of Zookeeper.

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
What is Coordination ?
Group membership: Set of datanodes (tasks) belong to same
group
Leader election: Electing a leader between primary and
backup

EL
Dynamic Configuration: Multiple services are joining,
communicating and leaving (Service lookup registry)

in a cluster PT
Status monitoring: Monitoring various processes and services

Queuing: One process is embedding and other is using


N
Barriers: All the processes showing the barrier and leaving
the barrier.
Critical sections: Which process will go to the critical section
and when?
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
What is ZooKeeper ?
ZooKeeper is a highly reliable distributed coordination kernel,
which can be used for distributed locking, configuration
management, leadership election, work queues,….
Zookeeper is a replicated service that holds the metadata of

EL
distributed applications.
Key attributed of such data
Small size

PT
Performance sensitive
Dynamic
N
Critical
In very simple words, it is a central store of key-value using
which distributed systems can coordinate. Since it needs to be
able to handle the load, Zookeeper itself runs on many
machines.
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
What is ZooKeeper ?
Exposes a simple set of primitives
Very easy to program
Uses a data model like directory tree

EL
Used for
Synchronisation
Locking
PT
Maintaining Configuration
Coordination service that does not suffer from
N
Race Conditions
Dead Locks

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Design Goals: 1. Simple
A shared hierarchal namespace looks like standard file
system
The namespace has data nodes - znodes (similar to files/dirs)

EL
Data is kept in-memory
Achieve high throughput and low latency numbers.
High performance
PT
Used in large, distributed systems
N
Highly available
No single point of failure
Strictly ordered access
Synchronisation
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Design Goals: 2. Replicated

EL
PT
• All servers have a copy of the state in memory
• A leader is elected at startup
N
• Followers service clients, all updates go through leader
• Update responses are sent when a majority of servers have
persisted the change
We need 2f+1 machines to tolerate f failures

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Design Goals: 2. Replicated
The client
Keeps a TCP connection
Gets watch events

EL
Sends heart beats.
If connection breaks,
Connect to different

The servers
PT server.
N
• Know each other
•Keep in-memory image of State
•Transaction Logs & Snapshots - persistent

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Design Goals: 3. Ordered
ZooKeeper stamps each update with a number

The number:

EL
Reflects the order of transactions.
used implement higher-level abstractions, such as
synchronization primitives.
PT
N

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Design Goals: 4. Fast
Performs best where reads are more common than
writes, at ratios of around 10:1.

EL
At Yahoo!, where it was created, the throughput for a

PT
ZooKeeper cluster has been benchmarked at over 10,000
operations per second for write-dominant workloads
generated by hundreds of clients
N

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Data Model
The way you store data in any store is called data model.
Think of it as highly available fileSystem: In case of zookeeper,
think of data model as if it is a highly available file system with
little differences.

EL
Znode: We store data in an entity called znode.
JSON data: The data that we store should be in JSON format

PT
which Java script object notation.
No Append Operation: The znode can only be updated. It does
not support append operations.
N
Data access (read/write) is atomic: The read or write is atomic
operation meaning either it will be full or would throw an error
if failed. There is no intermediate state like half written.
Znode: Can have children

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Data Model Contd…
So, znodes inside znodes make a
tree like heirarchy.
The top level znode is "/".

EL
The znode "/zoo" is child of "/"
which top level znode.

denoted as /zoo/duck PT
duck is child znode of zoo. It is

Though "." or ".." are invalid


N
characters as opposed to the file
system.

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Data Model – Znode - Types
Persistent
Such kind of znodes remain in zookeeper untill deleted. This
is the default type of znode. To create such node you can
use the command: create /name_of_myznode "mydata"

EL
Ephemeral

PT
Ephermal node gets deleted if the session in which the
node was created has disconnected. Though it is tied to
client's session but it is visible to the other users.
N
An ephermal node can not have children not even
ephermal children.

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Data Model – Znode - Types
Sequential
Creates a node with a sequence number in the name
The number is automatically appended.

EL
create -s /zoo v create -s /zoo/ v
Created /zoo0000000008 Created /zoo/0000000003

create -s /xyz v
PT create -s /zoo/ v
N
Created /xyz0000000009 Created /zoo/0000000004

The counter keeps increasing monotonically


Each node keeps a counter
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Architecture
Zookeeper can run in two modes: (i) Standalone and
(ii) Replicated.

(i) Standalone:

EL
In standalone mode, it is just running on one machine and
for practical purposes we do not use stanalone mode.

PT
This is only for testing purposes.
It doesn't have high availability.
N
(ii) Replicated:
Run on a cluster of machines called an ensemble.
High availability
Tolerates as long as majority.
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Architecture: Phase 1
Phase 1: Leader election (Paxos
Algorithm) Ensemble
The machines elect a distinguished
member - leader.

EL
The others are termed followers.
This phase is finished when majority

PT
sync their state with leader.
If leader fails, the remaining
machines hold election. takes
N
200ms.
If the majority of the machines
aren't available at any point of time,
the leader automatically steps
down.
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Architecture: Phase 2
Phase 2: Atomic broadcast 3 out of 4
have saved
All write requests are forwarded to
the leader,
Client
Leader broadcasts the update to
the followers

EL
Write Write
When a majority have persisted the Successful
change:
Leader
PT
The leader commits the up-date
The client gets success
response.
N
The protocol for achieving
consensus is atomic like two-phase Follower Follower
commit.
Machines write to disk before in- Follower Follower
memory
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Election Demo
If you have three nodes A, B, C with A as Leader. And A dies.
Will someone become leader?

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Election Demo
If you have three nodes A, B, C with A as Leader. And A dies.
Will someone become leader?

EL
PT
N
Yes. Either B or C.

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Election Demo
If you have three nodes A, B, C And A and B die. Will C become
Leader?

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Election Demo
If you have three nodes A, B, C And A and B die. Will C become
Leader?

EL
PT
N
No one will become Leader.
C will become Follower.
Reason: Majority is not available.
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Why do we need majority?
Imagine: We have an ensemble spread over two data centres.

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Why do we need majority?
Imagine: The network between data centres got disconnected.
If we did not need majority for electing Leader,
What will happen?

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Why do we need majority?
Each data centre will have their own Leader.
No Consistency and utter Chaos.
That is why it requires majority.

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Sessions
Lets try to understand how do the zookeeper decides to
delete ephermals nodes and takes care of session
management.

EL
A client has list of servers in the ensemble

PT
It tries each until successful.
Server creates a new session for the client.
A session has a timeout period - decided by caller
N

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Contd…
If the server hasn’t received a request within the timeout
period, it may expire the session.
On session expire, ephermal nodes are lost

EL
To keep sessions alive client sends pings (heartbeats)
Client library takes care of heartbeats

PT
Sessions are still valid on switching to another server
Failover is handled automatically by the client
N
Application can't remain agnostic of server reconnections
- because the operations will fail during disconnection.

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
States

EL
PT
N

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Use Case: Many Servers How do they Coordinate?
Let us say there are many servers which can respond to
your request and there are many clients which might want
the service.

EL
Z
Servers o Clients

PT o
k
e
N
e
p
e
r
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Use Case: Many Servers How do they Coordinate?
From time to time some of the servers will keep going
down. How can all of the clients can keep track of the
available servers?

EL
Z
Servers o Clients

PT o
k
e
N
e
p
e
r
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Use Case: Many Servers How do they Coordinate?
It is very easy using zookeeper as a centeral agency. Each
server will create their own ephermal znode under a particular
znode say "/servers". The clients would simply query
zookeeper for the most recent list of servers.
Available Servers ?

EL
Z
Servers o Clients

PT o
k
e
N
e
p
e
r
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Use Case: Many Servers How do they Coordinate?
Lets take a case of two servers
and a client. The two server
duck and cow created their
ephermal nodes under
"/servers" znode. The client

EL
would simply discover the alive
servers cow and duck using
command ls /servers.

PT
Say, a server called "duck" is
down, the ephermal node will
disappear from /servers znode
N
and hence next time the client
comes and queries it would only
get "cow". So, the coordinations
has been made heavily
simplified and made efficient
because of ZooKeeper.
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Guarantees
Sequential consistency
Updates from any particular client are applied in the order
Atomicity
Updates either succeed or fail.

EL
Single system image
A client will see the same view of the system, The new server will

Durability PT
not accept the connection until it has caught up.

Once an update has succeeded, it will persist and will not be


N
undone.
Timeliness
Rather than allow a client to see very stale data, a server will shut
down,

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Operations

OPERATION DESCRIPTION
create Creates a znode (parent znode must exist)

EL
delete Deletes a znode (mustn’t have children)
exists/ls Tests whether a znode exists & gets metadata
getACL, Gets/sets the ACL for a znode
setACL
getData/get,
PT
getChildren/ls Gets a list of the children of a znode
Gets/sets the data associated with a znode
N
setData
sync Synchronizes a client’s view of a znode with
ZooKeeper

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Multi Update

Batches together multiple operations together

EL
Either all fail or succeed in entirety

PT
Possible to implement transactions
N
Others never observe any inconsistent state

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
APIs
Two core: Java & C
contrib: perl, python, REST
For each binding, sync and async available

EL
Synch:
Public Stat exists (String path, Watcher watcher) throws KeeperException,
InterruptedException
PT
N
Asynch:

Public void exists (String path, Watcher watcher, StatCallback cb, Object ctx

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Watches
Clients to get notifications when a znode changes in some
way

Watchers are triggered only once

EL
For multiple notifications, re-register
PT
N

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Watch Triggers
The read operations exists, getChildren, getData may have
watches
Watches are triggered by write ops: create, delete, setData
ACL (Access Control List) operations do not participate in

EL
watches

WATCH OF …ARE
TRIGGERED
PT WHEN ZNODE IS…
N
exists created, deleted, or its data updated.
getData deleted or has its data updated.
deleted, or its any of the child is created or
getChildren
deleted

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
ACLs - Access Control Lists
ACL Determines who can perform certain operations on it.

ACL is the combination

EL
authentication scheme,
an identity for that scheme,
and a set of permissions

PT
Authentication Scheme
N
digest - The client is authenticated by a username & password.
sasl - The client is authenticated using Kerberos.
ip - The client is authenticated by its IP address.

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Use Cases
Building a reliable configuration service

• A Distributed lock service

EL
Only single process may hold the lock

PT
N

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
When Not to Use?

1. To store big data because:


• The number of copies == number of nodes
• All data is loaded in RAM too

EL
• Network load of transferring all data to all
Nodes

PT
2. Extremely strong consistency
N

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
ZooKeeper Applications: The Fetching Service
• The Fetching Service: Crawling is an important part of a search
engine, and Yahoo! crawls billions of Web documents. The Fetching
Service (FS) is part of the Yahoo! crawler and it is currently in
production. Essentially, it has master processes that command page-
fetching processes.

EL
• The master provides the fetchers with configuration, and the

PT
fetchers write back informing of their status and health. The main
advantages of using ZooKeeper for FS are recovering from failures of
masters, guaranteeing availability despite failures, and decoupling
N
the clients from the servers, allowing them to direct their request to
healthy servers by just reading their status from ZooKeeper.

• Thus, FS uses ZooKeeper mainly to manage configuration metadata,


although it also uses Zoo- Keeper to elect masters (leader election).
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
ZooKeeper Applications: Katta
Katta: It is a distributed indexer that uses Zoo- Keeper for
coordination, and it is an example of a non- Yahoo! application.
Katta divides the work of indexing using shards.
A master server assigns shards to slaves and tracks progress.

EL
Slaves can fail, so the master must redistribute load as slaves
come and go.

PT
The master can also fail, so other servers must be ready to take
over in case of failure. Katta uses ZooKeeper to track the status
of slave servers and the master (group membership), and to
N
handle master failover (leader election).
Katta also uses ZooKeeper to track and propagate the
assignments of shards to slaves (configuration management).

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
ZooKeeper Applications: Yahoo! Message Broker
Yahoo! Message Broker: (YMB) is a distributed publish-
subscribe system. The system manages thousands of topics
that clients can publish messages to and receive messages
from. The topics are distributed among a set of servers to

EL
provide scalability.
Each topic is replicated using a primary-backup scheme that
ensures messages are replicated to two machines to ensure
PT
reliable message delivery. The servers that makeup YMB use a
shared-nothing distributed architecture which makes
N
coordination essential for correct operation.
YMB uses ZooKeeper to manage the distribution of topics
(configuration metadata), deal with failures of machines in the
system (failure detection and group membership), and control
system operation.
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
ZooKeeper Applications: Yahoo! Message Broker
Figure, shows part of the znode
data layout for YMB.
Each broker domain has a znode
called nodes that has an

EL
ephemeral znode for each of the
active servers that compose the
YMB service.

PT
Each YMB server creates an
N
ephemeral znode under nodes
with load and status information
providing both group Figure: The layout of Yahoo!
membership and status Message Broker (YMB) structures in
information through ZooKeeper. ZooKeeper
of YMB.
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
ZooKeeper Applications: Yahoo! Message Broker
The topics directory has a child
znode for each topic managed by
YMB.
These topic znodes have child

EL
znodes that indicate the primary
and backup server for each topic
along with the subscribers of
that topic.
PT
The primary and backup server
znodes not only allow servers to
N
discover the servers in charge of
a topic, but they also manage Figure: The layout of Yahoo!
leader election and server Message Broker (YMB) structures in
crashes. ZooKeeper

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
More Details

EL
PT
N
See: https://zookeeper.apache.org/

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper
Conclusion
ZooKeeper takes a wait-free approach to the problem of
coordinating processes in distributed systems, by exposing
wait-free objects to clients.

EL
ZooKeeper achieves throughput values of hundreds of
thousands of operations per second for read-dominant
PT
workloads by using fast reads with watches, both of which
served by local replicas.
N
In this lecture, the basic fundamentals, design goals,
architecture and applications of ZooKeeper.

Cloud Computing and DistributedVuSystems


Pham Design of Zookeeper

You might also like