Week 3_Lecture Notes
Week 3_Lecture Notes
EL
PT
N
Dr. Rajiv Misra
Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Cloud Computing and DistributedVuSystems
Pham Leader Election in Rings
Preface
Content of this Lecture:
EL
problem in message-passing systems for a ring
topology, in which a group of processors must choose
PT
one among them to be a leader.
N
We will present the different algorithms for leader
election problem by taking the cases like anonymous/
non-anonymous rings, uniform/ non-uniform rings
and synchronous/ asynchronous rings etc.
EL
the leader.
LE problem represents a general class of symmetry-
breaking problems.
PT
For example, when a deadlock is created, because of
N
processors waiting in a cycle for each other, the
deadlock can be broken by electing one of the
processor as a leader and removing it from the cycle.
EL
Once an elected state is entered, processor is always
in an elected state (and similarly for not-elected): i.e.,
irreversible decision
PT
In every admissible execution:
N
every processor eventually enters either an elected
or a not-elected state
exactly one processor (the leader) enters an
elected state
Cloud Computing and DistributedVuSystems
Pham Leader Election in Rings
Uses of Leader Election
A leader can be used to coordinate activities of the
system:
find a spanning tree using the leader as the root
EL
reconstruct a lost token in a token-ring network
PT
In this lecture, we will study the leader election in
rings.
N
EL
PT
N
For example, if messages are always forwarded on channel
1, they will cycle clockwise around the ring
EL
First attempt: require each processor to have the
same state machine
PT
N
EL
A non-uniform algorithm uses the ring size (different
PT
algorithm for each size ring)
Formally, for each value of n, every processor in a
N
ring of size n is modeled with the same state
machine An .
EL
Proof Sketch:
PT
Every processor begins in same state with same
outgoing messages (since anonymous)
N
Every processor receives same messages, does same
state transition, and sends same messages in round 1
Ditto for rounds 2, 3, …
Eventually some processor is supposed to enter an
elected state. But then they all would.
EL
Since the theorem was proved for non-uniform and
PT
synchronous rings, the same result holds for weaker
(less well-behaved) models:
N
uniform
asynchronous
EL
indices are 0 to n - 1; used only for analysis, not
available to the processors
PT
ids are arbitrary nonnegative integers; are available
to the processors through local variable id.
N
EL
id = 37
id = 3
p4 p0
p3
PT p1 id = 19
N
id = 25 p2
id = 4
EL
Non-uniform algorithm: there is one state machine
for every id and every different ring size
PT
These definitions are tailored for leader election in a
N
ring.
EL
if j > id then
• forward j to the left (this processor has lost)
if j = id then
PT
• elect self (this processor has won)
N
if j < id then
• do nothing
EL
Time: O(n)
Message complexity: Depends how the ids are
arranged.
PT
largest id travels all around the ring (n messages)
N
2nd largest id travels until reaching largest
3rd largest id travels until reaching largest or
second largest etc.
EL
PT
N
EL
Total number of messages is:
PT n + (n-1) + (n-2) + … + 1 = (n2)
N
EL
the ring where the identifiers of the
processor are 0,……, n-1 and they are
PT
ordered as in Figure 3.2. In this
configuration, the message of processor
with identifier i is send exactly i+1 times,
N
Thus the total number of messages,
including the n termination messages, is
Clockwise Unidirectional Ring
EL
Idea:
PT
Try to have messages containing smaller ids travel
smaller distance in the ring
N
EL
includes exactly 2k+1 processors.
The algorithm operates in phases; it is convenient to start
PT
numbering the phases with 0. In the kth phase a processor
tries to become a winner for that phase; to be a winner, it must
have the largest id in its 2k-neighborhood. Only processors that
N
are winners in the kth phase continue to compete in the (k+1)-
st phase, Thus fewer processors proceed to higher phases, until
at the end, only one processor is a winner and it is elected as
the leader of the whole ring.
EL
containing its identifier to its 1-neighborhood, that is, to
each of its two neighbors.
PT
If the identifier of the neighbor receiving the probe is
greater than the identifier in the probe, it swallows the
probe; otherwise, it sends back a <reply> message.
N
If a processor receives a reply from both its neighbors,
then the processor becomes a phase 0 winner and
continues to phase 1.
EL
traverses 2k processors one by one, A probe is swallowed by
a processor if it contains an identifier that is smaller than its
own identifier.
PT
If the probe arrives at the last processor on the
neighbourhood without being swallowed, then that last
N
processor sends back a <reply> message to pi. If pi receives
replies from both directions, it becomes a phase k winner,
and it continues to phase k+1. A processor that receives its
own <probe> message terminates the algorithm as the
leader and sends a termination message around the ring.
Cloud Computing and DistributedVuSystems
Pham Leader Election in Rings
EL
PT
N
Vu Pham
The HS Algorithm
The pseudocode appears in Algorithm 5. Phase k for a processor
corresponds to the period between its sending of a <probe> message in
line 4 or 15 with third parameter k and its sending of a <probe>
message in line 4 or 15 with third parameter k+1. The details of sending
the termination message around the ring have been left out in the code,
EL
and only the leader terminates.
EL
If probe reaches a node with a larger id, the probe stops
PT
If probe reaches end of its neighborhood, then a reply is
sent back to initiator
N
If initiator gets back replies from both directions, then go
to next phase
EL
pi
probe probe
reply reply
PT
probe
reply
probe
reply
probe probe
reply reply
N
probe probe probe probe probe probe probe probe
reply reply reply reply reply reply reply reply
Message Complexity:
EL
Each message belongs to a particular phase and is
initiated by a particular processor
PT
Probe distance in phase k is 2k
N
Number of messages initiated by a processor in
phase k is at most 4*2k (probes and replies in both
directions)
EL
For k = 0, every processor does
- 1 does PT
For k > 0, every processor that is a "winner" in phase k
EL
2k-1 processors
… a phase
PT
k-1 winner … a phase
k-1 winner
…
N
Total number of phase k - 1 winners is at most
n/(2k-1 + 1)
EL
At each phase the number of (phase) winners is cut
approx. in half
from n/(2k-1 + 1) to n/(2k + 1)
PT
So after approx. log2 n phases, only one winner is left.
N
more precisely, max phase is log(n–1)+1
EL
log(n–1)+1
phase 0 msgs PT
≤ 4n + n + k=14•2k•n/(2k-1+1)
N
termination
msgs < 8n(log n + 2) + 5n
msgs for
phases 1 to
= O(n log n) log(n–1)+1
EL
Works in both synchronous and asynchronous case.
PT
Can we reduce the number of messages even more?
N
Not in the asynchronous model…
EL
rings whose size is not known a priori has ῼ(n log n)
message complexity (holds also for unidirectional rings).
PT
Both LCR and HS are comparison-based algorithms, i.e.
they use the identifiers only for comparisons (<; >;=).
In synchronous networks, O(n) message complexity can be
N
achieved if general arithmetic operations are permitted
(non-comparison based) and if time complexity is
unbounded.
EL
Asynchronous ring:
(n log n) messages
Synchronous ring:
PT
(n) messages under certain conditions
N
otherwise (n log n) messages
All bounds are asymptotically tight.
EL
We have presented the different algorithms for
PT
leader election problem by taking the cases like
anonymous/non-anonymous rings, uniform/non-
N
uniform rings and synchronous/ asynchronous rings
EL
PT
N
Dr. Rajiv Misra
Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
Preface
Content of this Lecture:
In this lecture, we will discuss the underlying concepts of
‘leader election problem’, which has been very useful for
many variant of distributed systems including today’s Cloud
EL
Computing Systems.
PT
Then we will present the different ‘classical algorithms for
leader election problem’ i.e. Ring LE and Bully LE
algorithms and also discuss how election is done in some of
N
the popular systems in Industry such as Google’s Chubby
and Apache’s Zookeeper system.
EL
leader among the replicas
PT
What if there are two leaders per customer?
What if servers disagree about who the leader is?
N
What if the leader crashes?
Each of the above scenarios leads to Inconsistency
EL
Other systems that need leader election:
PT
Apache Zookeeper, Google’s Chubby
N
Leader is useful for coordination among distributed
servers
EL
What happens when a leader fails (crashes)
Some process detects this (using a Failure Detector!)
Then what?
PT
N
Goals of Election algorithm:
1. Elect one leader only among the non-faulty processes
2. All non-faulty processes agree on who is the leader
N processes.
EL
Each process has a unique id.
PT
Messages are eventually delivered.
N
Failures may occur during the election protocol.
EL
Multiple processes are allowed to call an election
simultaneously.
PT
All of them together must yield only a single
leader
N
The result of an election should not depend on which
process calls for it.
EL
value) or Null)
Liveness: For all election runs: (election run terminates)
PT
& for all non-faulty processes p: p’s elected is not Null
At the end of the election protocol, the non-faulty process with
N
the best (highest) election attribute value is elected.
Common attribute : leader has highest id
Other attribute examples: leader has highest IP address, or
fastest cpu, or most disk space, or most number of files, etc.
EL
Similar to ring in Chord p2p system
PT
i-th process pi has a communication channel to
p(i+1) mod N
N
All messages are sent clockwise around the ring.
N12 N3
EL
N6
PT N32
N
N80 N5
EL
the message with its own attr.
If the arrived attr is greater, pi forwards the message.
PT
If the arrived attr is smaller and pi has not forwarded an election
message earlier, it overwrites the message with its own id:attr, and
N
forwards it.
If the arrived id:attr matches that of pi, then pi’s attr must be the
greatest (why?), and it becomes the new coordinator. This process then
sends an “Elected” message to its neighbor with its id, announcing the
election result.
Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
The Ring Election Protocol (2)
When a process pi receives an “Elected” message, it
sets its variable electedi id of the message.
EL
forwards the message unless it is the new
coordinator.
PT
N
EL
N12
Election: 3
N6
PT N32
N
N80 N5
EL
N12
N6
PT N32
N
Election: 32
N80 N5
EL
N12
N6
PT N32
N
N80 N5
Election: 32
Goal: Elect highest id process as leader
Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
Initiates the election
N3
EL
N12
N6
PT N32
N
Election: 80
N80 N5
EL
N12
N6
PT N32
N
N80 N5
EL
N12
Election: 80
N6
PT N32
N
N80 N5
EL
N12
N6
PT N32
N
N80 N5
Election: 80
Goal: Elect highest id process as leader
Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
Initiates the election
N3
EL
N12
N6
PT N32
N
Elected: 80
N80 N5
EL
N12
Elected: 80
N6
elected = 80 PT N32
N
N80 N5
EL
N12
elected = 80
N6
elected = 80 PT N32
elected = 80
N
N80 N5
Elected: 80 elected = 80
Goal: Elect highest id process as leader
Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
elected = 80 Initiates the election
N3
EL
N12
elected = 80
N6
elected = 80 PT N32
elected = 80
N
N80 N5
elected = 80 elected = 80
Goal: Elect highest id process as leader
Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
Analysis
Let’s assume no failures occur during the election
protocol itself, and there are N processes
EL
How many messages?
PT
Worst case occurs when the initiator is the ring
N
successor of the would-be leader
N12 N3
Initiates the election
EL
N6
N32
PT
N
N80 N5
EL
without message being changed
N messages for Elected message to circulate around the ring
PT
Message complexity: (3N-1) messages
N
Completion time: (3N-1) message transmission times
Thus, if there are no failures, election terminates (liveness)
and everyone knows about highest-attribute process as
leader (safety)
Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
Best Case?
Initiator is the would-be leader, i.e., N80 is the
initiator
EL
Message complexity: 2N messages
PT
Completion time: 2N message transmission times
N
EL
(All the time) Each process suppresses Election/Elected
messages of any lower-id initiators
PT
Updates cache if receives higher-id initiator’s
Election/Elected message
N
Result is that only the highest-id initiator’s election run
completes
EL
N6
elected = 80
PT N32
Election: 80 will
N
N80 N5 circulate around
Crash the ring forever
=>
Liveness violated
EL
May re-initiate election if
PT
– Receives an Election message but times out waiting
for an Elected message
– Or after receiving the Elected:80 message
N
But what if predecessor also fails?
And its predecessor also fails? (and so on)
EL
But failure detectors may not be both complete and
accurate
PT
Incompleteness in FD => N80’s failure might be missed
=> Violation of Safety
N
Inaccuracy in FD => N80 mistakenly detected as failed
• => new election runs initiated forever
• => Violation of Liveness
EL
Elect a process, use its id’s last bit as the consensus
decision
PT
But since consensus is impossible in asynchronous
systems, so is leader election!
N
Consensus-like protocols such as Paxos used in industry
systems for leader election
EL
• if it knows its id is the highest
• it elects itself as coordinator, then sends a Coordinator
PT
message to all processes with lower identifiers. Election
is completed.
N
• else
• it initiates an election by sending an Election message
• (contd…)
EL
sends Coordinator message to all lower id processes.
Election completed.
• if an answer received however, then there is some non-
PT
faulty higher process => so, wait for coordinator message. If
none received after another timeout, start a new election
N
run.
• A process that receives an Election message replies with OK
message, and starts its own leader election protocol (unless it has
already done so)
N3
EL
N80
N32
PT N5
N
N12 N6
Detects failure
of N80
Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
N3
EL
N80
N32
PT N5
N
Election
N12 N6
Detects failure
of N80
Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
N3
EL
N80
Election
N32
PT
Election N5
N
OK
N12 N6
Waiting…
EL
N80
N32
PT OK
N5
N
N12 N6
Waiting… Waiting…
EL
N80
Times out
waiting for N80’s
response
N32
PT N5
N
Coordinator: N32
N12 N6
Election is completed
N3
EL
N80
N32
PT N5
N
N12 N6
Waiting… Waiting…
EL
N80
N32
PT N5
N
Election
N12 N6
EL
N32
PT N5
Election
N
N12 N6
Times out, starts
another new election run
EL
5 message transmission times if there are no failures
during the run:
PT
1. Election from lowest id server in group
2. Answer to lowest id server from 2nd highest id
process
N
3. Election from 2nd highest id server to highest id
4. Timeout for answers @ 2nd highest id server
5. Coordinator from 2nd highest id server
EL
• (N-1) processes altogether begin elections, each
sending messages to processes with higher ids.
• i-th highest id process sends (i-1) election messages
PT
Number of Election messages :
= N-1 + N-2 + … + 1 = (N-1)*N/2 = O(N2)
N
Best-case
Second-highest id detects leader failure
Sends (N-2) Coordinator messages
Completion time: 1 message transmission time
Cloud Computing and DistributedVuSystems
Pham Leader Election ( Ring LE & Bully LE)
Impossibility?
EL
guaranteed
PT
But satisfies liveness in synchronous system model
where
N
Worst-case one-way latency can be calculated =
worst-case processing time +
worst-case message latency
EL
and Apache Zookeeper)
PT
N
EL
Each process proposes a value
Everyone in group reaches consensus on some
process Pi’s value
PT
That lucky Pi is the new leader!
N
EL
Google’s Chubby system
Apache Zookeeper
PT
N
EL
Server B
systems rely on Chubby
BigTable, Megastore, etc. Server C
Group of replicas
PT Server D
N
Need to have a master
server elected at all times Server E
Reference: http://research.google.com/archive/chubby.html
EL
Potential leader tries to get votes Server B
from other servers
leader
PT
Each server votes for at most one
Server C
Server D Master
N
Server with majority of votes
becomes new leader, informs Server E
everyone
EL
Uses a variant of Paxos called Zab (Zookeeper Atomic
Broadcast)
PT
Needs to keep a leader elected at all times
N
Reference: http://zookeeper.apache.org/
EL
numbers are ids
Gets highest id so far (from N6
PT
ZK(zookeeper) file system),
creates next-higher id, writes it
into ZK file system
N32
N
N80 N5
Elect the highest-id server as
Master
leader.
EL
monitors current master
(directly or via a failure N6
detector)
PT
On failure, initiate election
Leads to a flood of elections Crash
N32
N
Too many messages N80 N5
Master
N80 N3
EL
Each process monitors its next-
higher id process
if that successor was the N32
PT
leader and it has failed
• Become the new leader
Monitors
N5
N
else
• wait for a timeout, and check N12 N6
your successor again.
EL
Classical leader election protocols
Ring-based
Bully
PT
N
But failure-prone
Paxos-like protocols used by Google Chubby,
Apache Zookeeper
EL
PT
N
Dr. Rajiv Misra
Associate Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Preface
Content of this Lecture:
EL
ZooKeeper’, which is a service for coordinating
processes of distributed applications.
PT
We will discuss its basic fundamentals, design goals,
architecture and applications.
N
https://zookeeper.apache.org/
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
ZooKeeper, why do we need it?
Coordination is important
EL
PT
N
EL
PT
N
Most of the system like HDFS have one Master and couple of
slave nodes and these slave nodes report to the master.
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Fault Tolerant Distributed System
EL
PT
N
Real distributed fault tolerant system have Coordination service,
Master and backup master.
If primary failed then backup works for it.
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
What is a Race Condition?
When two processes are competing with each other causing data
corruption.
Person B
EL
Person A
Bank PT
N
As shown in the diagram, two persons are trying to deposit 1 rs.
online into the same bank account. The initial amount is 17 rs. Due
to race conditions, the final amount in the bank is 18 rs. instead of
19.
EL
Process 1 Process 2
PT Waiting for
Process 3
Waiting for
N
Here, Process 1 is waiting for process 2 and process 2 is waiting
for process 3 to finish and process 3 is waiting for process 1 to
finish. All these three processes would keep waiting and will
never end. This is called dead lock.
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
What is Coordination ?
How would Email Processors avoid
reading same emails?
Suppose, there is an inbox from
which we need to index emails.
Indexing is a heavy process and might
EL
take a lot of time.
Here, we have multiple machine
PT
which are indexing the emails. Every
email has an id. You can not delete
any email. You can only read an email
N
and mark it read or unread.
Now how would you handle the
coordination between multiple
indexer processes so that every email
is indexed?
EL
programming language.
But since there are multiple Email
Id-
processes running on multiple Time
PT
machines which need to coordinate,
we need a central storage.
This central storage should be safe
stamp-
Subject
-Status
N
from all concurrency related
problems.
This central storage is exactly the
role of Zookeeper.
EL
Dynamic Configuration: Multiple services are joining,
communicating and leaving (Service lookup registry)
in a cluster PT
Status monitoring: Monitoring various processes and services
EL
distributed applications.
Key attributed of such data
Small size
PT
Performance sensitive
Dynamic
N
Critical
In very simple words, it is a central store of key-value using
which distributed systems can coordinate. Since it needs to be
able to handle the load, Zookeeper itself runs on many
machines.
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
What is ZooKeeper ?
Exposes a simple set of primitives
Very easy to program
Uses a data model like directory tree
EL
Used for
Synchronisation
Locking
PT
Maintaining Configuration
Coordination service that does not suffer from
N
Race Conditions
Dead Locks
EL
Data is kept in-memory
Achieve high throughput and low latency numbers.
High performance
PT
Used in large, distributed systems
N
Highly available
No single point of failure
Strictly ordered access
Synchronisation
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Design Goals: 2. Replicated
EL
PT
• All servers have a copy of the state in memory
• A leader is elected at startup
N
• Followers service clients, all updates go through leader
• Update responses are sent when a majority of servers have
persisted the change
We need 2f+1 machines to tolerate f failures
EL
Sends heart beats.
If connection breaks,
Connect to different
The servers
PT server.
N
• Know each other
•Keep in-memory image of State
•Transaction Logs & Snapshots - persistent
The number:
EL
Reflects the order of transactions.
used implement higher-level abstractions, such as
synchronization primitives.
PT
N
EL
At Yahoo!, where it was created, the throughput for a
PT
ZooKeeper cluster has been benchmarked at over 10,000
operations per second for write-dominant workloads
generated by hundreds of clients
N
EL
Znode: We store data in an entity called znode.
JSON data: The data that we store should be in JSON format
PT
which Java script object notation.
No Append Operation: The znode can only be updated. It does
not support append operations.
N
Data access (read/write) is atomic: The read or write is atomic
operation meaning either it will be full or would throw an error
if failed. There is no intermediate state like half written.
Znode: Can have children
EL
The znode "/zoo" is child of "/"
which top level znode.
denoted as /zoo/duck PT
duck is child znode of zoo. It is
EL
Ephemeral
PT
Ephermal node gets deleted if the session in which the
node was created has disconnected. Though it is tied to
client's session but it is visible to the other users.
N
An ephermal node can not have children not even
ephermal children.
EL
create -s /zoo v create -s /zoo/ v
Created /zoo0000000008 Created /zoo/0000000003
create -s /xyz v
PT create -s /zoo/ v
N
Created /xyz0000000009 Created /zoo/0000000004
(i) Standalone:
EL
In standalone mode, it is just running on one machine and
for practical purposes we do not use stanalone mode.
PT
This is only for testing purposes.
It doesn't have high availability.
N
(ii) Replicated:
Run on a cluster of machines called an ensemble.
High availability
Tolerates as long as majority.
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Architecture: Phase 1
Phase 1: Leader election (Paxos
Algorithm) Ensemble
The machines elect a distinguished
member - leader.
EL
The others are termed followers.
This phase is finished when majority
PT
sync their state with leader.
If leader fails, the remaining
machines hold election. takes
N
200ms.
If the majority of the machines
aren't available at any point of time,
the leader automatically steps
down.
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Architecture: Phase 2
Phase 2: Atomic broadcast 3 out of 4
have saved
All write requests are forwarded to
the leader,
Client
Leader broadcasts the update to
the followers
EL
Write Write
When a majority have persisted the Successful
change:
Leader
PT
The leader commits the up-date
The client gets success
response.
N
The protocol for achieving
consensus is atomic like two-phase Follower Follower
commit.
Machines write to disk before in- Follower Follower
memory
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Election Demo
If you have three nodes A, B, C with A as Leader. And A dies.
Will someone become leader?
EL
PT
N
EL
PT
N
Yes. Either B or C.
EL
PT
N
EL
PT
N
No one will become Leader.
C will become Follower.
Reason: Majority is not available.
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Why do we need majority?
Imagine: We have an ensemble spread over two data centres.
EL
PT
N
EL
PT
N
EL
PT
N
EL
A client has list of servers in the ensemble
PT
It tries each until successful.
Server creates a new session for the client.
A session has a timeout period - decided by caller
N
EL
To keep sessions alive client sends pings (heartbeats)
Client library takes care of heartbeats
PT
Sessions are still valid on switching to another server
Failover is handled automatically by the client
N
Application can't remain agnostic of server reconnections
- because the operations will fail during disconnection.
EL
PT
N
EL
Z
Servers o Clients
PT o
k
e
N
e
p
e
r
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Use Case: Many Servers How do they Coordinate?
From time to time some of the servers will keep going
down. How can all of the clients can keep track of the
available servers?
EL
Z
Servers o Clients
PT o
k
e
N
e
p
e
r
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Use Case: Many Servers How do they Coordinate?
It is very easy using zookeeper as a centeral agency. Each
server will create their own ephermal znode under a particular
znode say "/servers". The clients would simply query
zookeeper for the most recent list of servers.
Available Servers ?
EL
Z
Servers o Clients
PT o
k
e
N
e
p
e
r
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Use Case: Many Servers How do they Coordinate?
Lets take a case of two servers
and a client. The two server
duck and cow created their
ephermal nodes under
"/servers" znode. The client
EL
would simply discover the alive
servers cow and duck using
command ls /servers.
PT
Say, a server called "duck" is
down, the ephermal node will
disappear from /servers znode
N
and hence next time the client
comes and queries it would only
get "cow". So, the coordinations
has been made heavily
simplified and made efficient
because of ZooKeeper.
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
Guarantees
Sequential consistency
Updates from any particular client are applied in the order
Atomicity
Updates either succeed or fail.
EL
Single system image
A client will see the same view of the system, The new server will
Durability PT
not accept the connection until it has caught up.
OPERATION DESCRIPTION
create Creates a znode (parent znode must exist)
EL
delete Deletes a znode (mustn’t have children)
exists/ls Tests whether a znode exists & gets metadata
getACL, Gets/sets the ACL for a znode
setACL
getData/get,
PT
getChildren/ls Gets a list of the children of a znode
Gets/sets the data associated with a znode
N
setData
sync Synchronizes a client’s view of a znode with
ZooKeeper
EL
Either all fail or succeed in entirety
PT
Possible to implement transactions
N
Others never observe any inconsistent state
EL
Synch:
Public Stat exists (String path, Watcher watcher) throws KeeperException,
InterruptedException
PT
N
Asynch:
Public void exists (String path, Watcher watcher, StatCallback cb, Object ctx
EL
For multiple notifications, re-register
PT
N
EL
watches
WATCH OF …ARE
TRIGGERED
PT WHEN ZNODE IS…
N
exists created, deleted, or its data updated.
getData deleted or has its data updated.
deleted, or its any of the child is created or
getChildren
deleted
EL
authentication scheme,
an identity for that scheme,
and a set of permissions
PT
Authentication Scheme
N
digest - The client is authenticated by a username & password.
sasl - The client is authenticated using Kerberos.
ip - The client is authenticated by its IP address.
EL
Only single process may hold the lock
PT
N
EL
• Network load of transferring all data to all
Nodes
PT
2. Extremely strong consistency
N
EL
• The master provides the fetchers with configuration, and the
PT
fetchers write back informing of their status and health. The main
advantages of using ZooKeeper for FS are recovering from failures of
masters, guaranteeing availability despite failures, and decoupling
N
the clients from the servers, allowing them to direct their request to
healthy servers by just reading their status from ZooKeeper.
EL
Slaves can fail, so the master must redistribute load as slaves
come and go.
PT
The master can also fail, so other servers must be ready to take
over in case of failure. Katta uses ZooKeeper to track the status
of slave servers and the master (group membership), and to
N
handle master failover (leader election).
Katta also uses ZooKeeper to track and propagate the
assignments of shards to slaves (configuration management).
EL
provide scalability.
Each topic is replicated using a primary-backup scheme that
ensures messages are replicated to two machines to ensure
PT
reliable message delivery. The servers that makeup YMB use a
shared-nothing distributed architecture which makes
N
coordination essential for correct operation.
YMB uses ZooKeeper to manage the distribution of topics
(configuration metadata), deal with failures of machines in the
system (failure detection and group membership), and control
system operation.
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
ZooKeeper Applications: Yahoo! Message Broker
Figure, shows part of the znode
data layout for YMB.
Each broker domain has a znode
called nodes that has an
EL
ephemeral znode for each of the
active servers that compose the
YMB service.
PT
Each YMB server creates an
N
ephemeral znode under nodes
with load and status information
providing both group Figure: The layout of Yahoo!
membership and status Message Broker (YMB) structures in
information through ZooKeeper. ZooKeeper
of YMB.
Cloud Computing and DistributedVuSystems
Pham Design of Zookeeper
ZooKeeper Applications: Yahoo! Message Broker
The topics directory has a child
znode for each topic managed by
YMB.
These topic znodes have child
EL
znodes that indicate the primary
and backup server for each topic
along with the subscribers of
that topic.
PT
The primary and backup server
znodes not only allow servers to
N
discover the servers in charge of
a topic, but they also manage Figure: The layout of Yahoo!
leader election and server Message Broker (YMB) structures in
crashes. ZooKeeper
EL
PT
N
See: https://zookeeper.apache.org/
EL
ZooKeeper achieves throughput values of hundreds of
thousands of operations per second for read-dominant
PT
workloads by using fast reads with watches, both of which
served by local replicas.
N
In this lecture, the basic fundamentals, design goals,
architecture and applications of ZooKeeper.