0% found this document useful (0 votes)
13 views

Merged_document

The document discusses the challenges of measuring and synchronizing time in distributed systems, highlighting issues such as clock skew and drift. It outlines various methods for clock synchronization, including Cristian's and Berkeley algorithms, as well as the concept of logical clocks introduced by Lamport. The importance of maintaining accurate time across different nodes in a distributed system is emphasized to ensure efficient communication and resource sharing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Merged_document

The document discusses the challenges of measuring and synchronizing time in distributed systems, highlighting issues such as clock skew and drift. It outlines various methods for clock synchronization, including Cristian's and Berkeley algorithms, as well as the concept of logical clocks introduced by Lamport. The importance of maintaining accurate time across different nodes in a distributed system is emphasized to ensure efficient communication and resource sharing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

DISTRIBUTED SYSTEMS (CS 3006)

National Institute of Technology Rourkela


Measuring Time
Traditionally time measured astronomically
Transit of the sun (highest point in the sky)
Solar day and solar second

Problem: Earth’s rotation is slowing down


Days get longer and longer
300 million years ago there were 400 days in the year ;-)

Modern way to measure time is atomic clock


Based on transitions in Cesium-133 atom
Still need to correct for Earth’s rotation

Result: Universal Coordinated Time (UTC)


UTC available via radio signal, telephone line, satellite (GPS)

2
Hw/Sw Clocks
• Physical clocks in computers are realized as crystal
oscillation counters at the hardware level
– Correspond to counter register H(t)
– Used to generate interrupts
• Usually scaled to approximate physical time t, yielding
software clock C(t), C(t) = αH(t) + β
– C(t) measures time relative to some reference event, e.g., 64
bit counter for # of nanoseconds since last boot
– Simplification: C(t) carries an approximation of real time
– Ideally, C(t) = t (never 100% achieved)
– Note: Values given by two consecutive clock queries will differ
only if clock resolution is sufficiently smaller than processor
cycle time

3
Problems with H/W or S/W Clock

• Skew: Disagreement in the reading of two clocks

• Drift: Difference in the rate at which two clocks count the time
– Due to physical differences in crystals, plus heat, humidity, voltage, etc.
– Accumulated drift can lead to significant skew

• Clock drift rate: Difference in precision between a prefect


reference clock and a physical clock,
– Usually, 10-6 sec/sec, 10-7 sec/sec to 10-8 sec/sec for high precision
clocks

4
Challenges
• Two clocks do not agree perfectly
• Skew: The time difference between two clocks
• Quartz oscillators vibrate at different rates
• Drift: The difference in rates of two clocks
• If we had two perfect clocks
– Skew = 0
– Drift = 0

5
Clock Skew
• When we detect a clock has a skew
• Eg: it is 5 seconds behind Or 5 seconds ahead
• What can we do?

6
Clock Skew: Impacts & Solutions
• When we detect a clock has a skew
• Eg: it is 5 seconds behind – We can advance it 5 seconds
to correct/Run it faster until it catches up
• Or 5 seconds ahead – Pushing back 5 seconds is a bad
idea/Run it slower until it catches up
• • This does not guarantee correct clock in future
• – Need to check and adjust periodically
• Problems due to Skew:
– Message was received before it was sent
– Document closed before it was saved etc..
– We want monotonicity: time always increases

7
How clocks synchronize ?
Obtain time from Time server:

Request
Time
Client Server
Reply: Time:
00:05:20

A dedicated time server is allocated for clock


synchronization

8
Causes of Inaccurate Time

 Delays in message
transmission

 Delays due to processing

 Server’s time may be


inaccurate

9
Clock Inaccuracies
• Clock inaccuracies cause problems and troublesome in
solving tasks in distributed systems.
• The clocks of different nodes need to be synchronized to
limit errors.
• Need of an efficient communication or resource sharing.
• Clocks need to be monitored and adjusted continuously.
Otherwise, the clocks drift apart.
• Similarly clock skew also introduces mismatch in time
value of two clocks.
• Both these are to be addressed to make an efficient
usage of features of distributed systems.

• Example: Clock synchronization using token ring 10


Clock Synchronization
Clock synchronization aims to coordinate independent
clocks available in individual node.

Even when initially set accurately, real clocks will differ


after some amount of time due to clock drift,

Caused by clocks counting time at slightly different rates.

11
Solutions
The synchronization solution using a central server is trivial; the
server will dictate the system time. (Single point of failure)

Due to lack of global time/clock, achieving clock synchronization in


distributed systems is difficult.

Two Solutions: (Physical & Logical Clock Synchronization)

(1)Popular algorithms for Clock Synchronization (Physical) in


distributed systems:
(a) Cristian’s algorithm & (b) Berkeley algorithm

(2) Concepts of Logical clock in distributed systems for Clock


Synchronization (Logical): (a) Lamport timestamps & (b) Vector
clocks
12
Solutions
Wired Distributed Systems: Internet, LAN, MAN, WAN, PAN etc
Network Time Protocol (NTP): Works on client-server
architecture

User Datagram Protocol (UDP) message passing


Wireless Distributed Systems: WSN, VANET, MANET, FANET,ANET
etc
The problem becomes even more challenging
due to the possibility of collision of the
synchronization packets on the wireless medium
and the higher drift rate of clocks on low-cost
wireless devices.
Wired-Cum-Wireless Distributed Systems:
Wireless Internet
Cristian's algorithm (Physical Clock Synchronization)

Introduced by Flaviu Cristian in 1989

Primarily used in low-latency intranets.

Though the algorithm is simple, the obtained clock value is


probabilistic:
It only achieves synchronization if the round-trip time (RTT) of the
request is less than the required accuracy.

It also suffers in implementations using a single server, making it


unsuitable for many distributive applications where redundancy may
be crucial.

14
Cristian's algorithm

Client To T1

Req Cutc = Current UTC of Time Server


Time Server

Interrupt Handling Time

Best estimation of message propagation time = (To +T1)/2

Both To and T1 are measured using same clock

Tnew = Tserver + (To+T1)/2 i.e Cutc + message


Propagation Time

15
Cristian's algorithm
Cristian's algorithm works between a process P, and a time server S connected to a
time reference source.

Step 1: P requests the time from S


Step 2: S receives the request from P
Step 3: S prepares a response and appends the time T from its own clock
Step 4: S sends the time to P
Step 4: P then sets its time to be T + RTT/2 where RTT is the round trip time (Req
Time + Resp Time)
Stop
Assumption: Request time = response time (May be reasonable for LAN but not
always)
Further accuracy can be gained by making multiple requests to S and using the
response with the shortest RTT.

We can estimate the accuracy of the system as follows.


Let min be the minimum time to transmit a message one-way.
Transmission time includes message preparation time or nodes ready time to send
a message.
The earliest point at which S can place the time T, is min after P sent its request.
16
Berkeley algorithm (Physical Clock Synchronization)
 Developed by Gusella and Zatti at the University of California,
Berkeley in 1989.

 Assumes no machine has an accurate time source.

 Intended for use within intranets.

 The server process (called the leader) periodically polls


other follower processes requesting for time.

 Based on the answers, it computes an average time & tells all the
other nodes to advance their clocks to the new time or slow their
clocks down until some specific reduction has been achieved

 The time daemons time must be set manually by the operator


periodically 17
Example
The time daemon sends its own clock value and asks
all other nodes for their clock values (i.e., 3.00)
3.00
Time Daemon

Network

2.50 P1 P2 3.25

18
Example
The nodes answer the difference in their time w.r.t
time at Time Daemon ( i.e., -10 & +25)
3.00
Time Daemon

-10 (2.50 – 3.00) +25 (3.25 - 3.00)


Network

2.50 P1 P2 3.25

19
Example

The time daemon computes the average of time of all the nodes including time
daemon i.e, (3.00 + 2.50 + 3.25)/3 = 9.15/3 = 3.05.
Time Daemon
3.05

2.50+15 Network 3.25- 20

3.05 P1 P2 3.05

The time daemon tells other nodes to adjust their cock


values by increasing or decreasing by sending difference
in values (i.e., 15 & -20) instead of average values.
Berkeley algorithm
A leader is chosen via an election process such as
Chang and Roberts algorithm.

The leader polls the followers who reply with their time in a similar
way to Cristian's algorithm.

The leader observes the round-trip time (RTT) of the messages and
estimates the time of each follower and its own.

The leader then averages the clock times, ignoring any values it
receives far outside the values of the others.

Instead of sending the updated current time back to the other


process, the leader then sends out the amount (positive or negative)
that each follower must adjust its clock.
21
This avoids further uncertainty due to RTT at
Limitations Berkeley algorithm
Gusella and Zatti released results involving 15 computers whose
clocks were synchronized to within about 20-25 milliseconds using
their protocol.

Computer systems normally avoid rewinding their clock when they


receive a negative clock alteration from the leader. This would break
the property of monotonic time, which is a fundamental assumption
in certain algorithms in the system itself or in programs such as
make.

A simple solution to this problem is to halt the clock for the duration
specified by the leader, but this simplistic solution can also cause
problems, although they are less severe. For minor corrections, most
systems slow the clock (known as "clock slew"), applying the
correction over a longer period of time.

Often, any client whose clock differs by a value outside of a given22


Logical Clock
Due to lack of global physical clock in a distributed system and
limitations of Berkeley algorithm, Lamports introduced the concept
of logical clock based on event ordering instead of using physical
clock.

Two event ordering clocks : (i) Lamports Clock (Also known as


Scalar Clock)
For Partial ordering of events
(ii) Vector Clocks (modification of Lamport Clocks)

Lamports Logical Clocks: Can be considered as a counter/integer


value

Lamports has defined certain rules to increament the counter values


which are assigned to events in the processes of distributed system
23
Clock Drift rate is usually assumed 1 unit however any value greater
Each process has n number of instructions or tasks

An Event ?

Send, Receive, Print etc

24
Three Conditions proposed by Lamport:

(1) a -> b C(a) < C (b) (Happened Before Relations) indicates event
a is always earlier than b

(2) If a is a sending event of message m and b is a Receive event of


message m then C(a) < C(b)

(3) a -> b, b -> c => a -> c (Transitive Relations)

Where a, b & c are events may be executed in same or different


processes and

C(x) = Time stamp at event x

Example of Events: Sending, Receiving, Executing, Print etc. 25


Logical Clocks

Physical clocks are physical entities that assign physical times to


events,

Logical clocks order events logically by assigning logical timestamps


instead of physical ordering.

In fact, the logical clock decides the order of execution of different


parallel or concurrent or independent processes

Whereas the logical clocks are simply a conceptualization of a


mathematical function that assigns numbers to events. These
numbers act as timestamps that help in ordering events.

Refer to implementing a protocol on all machines within your


distributed system, so that the machines are able to maintain
consistent ordering of events within some virtual timespan. 26
Logical Clocks
More formally, each process Pi has a clock Ci which is a function from events to the integers.

The timestamp of an event e in Pi is Ci(e).

The system clock, C = f(from events to the integers) where C(e)=Ci(e) and e is an event in Pi

Causal Functionality:

Given 2 events (e1, e2) where one is caused by the other (e1 contributes to e2 occurring). Then the
timestamp of the ‘caused by’ event (e1) is less than the other event (e2).

27
Implementation Rules
To provide this functionality any Logical Clock must provide 2 rules:

Rule 1: this determines how a local process updates its own clock when an event occurs.

Before executing an event (excluding the event of receiving a message) increment the
local clock by 1.
Local_clock = local_clock + 1

Rule 2: determines how a local process updates its own clock when it receives a message from another
process. This can be described as how the process brings its local clock inline with information about the
global time.

When receiving a message (the message must include the senders local clock value) set your local
clock to the maximum of the received clock value and the local clock value. After this, increment your
local clock by 1

1. local_clock = max(local_clock, received_clock)

2. local_clock = local_clock + 1

3. message becomes available.


28
Lamports Logical Clocks
Key Idea:
(1)Processes exchange messages
(2)Messages must be send before they are received
(3)Send/Receive is used to order the events & synchronize the logical clocks

Let

Pi is process i
a, b & c …. are events in processes
Ci(a) is the time stamp of event ‘a’ of process Pi
IR is the implementation rule

Clock Condition to evaluate the logical clocks with the following correctness criterion

1. ∀a,b. a → b ⟹ C(a)<C(b) (happened before relation denoted by →)


2. [C1] : Ci(a) < Ci(b) applies to same process
3. [C2] : Ci(a) < Cj(b) applies to different processes
4. [IR1] : If a → b then Ci(a) = Ci(b) + d {d>0} where d is the drift rate of the clock
(applies to the same process )
5. [IR2]: Cj = max (Cj, tm + d ) where tm is same as Ci(a)
(applies to process j when we get an incoming arrow to the current
29
process j)
Clock Values

a, b, c, d, e, f, g, h, i, j, k, l, m are events

1, 2, 3, 4, 5, 6, 7, are clock values or time stamps for above events

No proper ordering events


Ordering of events
Every process Po, P1, & P2 in a distributed system orders
the event for execution

Process Po has a, b, c, d, e, f, and g has 7 events, Time


stamps are 1,2,3,4,5,6,7

Process P1 has h, i, and j has 3 events, Time stamps


are 1,2, & 3

Process P2 has 3 k, l, & m has 3 events, Time stamps are


1, 2, & 3.

C(d) = 4;C(m) = 3 4 > 3 does not satisfy lamports 31


Clock Values
Rule Applied: 1 1 1 1 2 ? ?
Clock Values: (1) (2) (3) (4) ? Incoming Arrow Encountered ?

P1 e11 e12 e13 e14 e15 e16 e17

P2 e21 e22 e23 e24 e25


Clock Values: (1) (2) (3) max (3,3) ? ?
(1) When
Rule an incoming
Applied: 1 arrow
1 is detected
2 with respect? to a process, ?
Rule 2 needs to
be followed i.e.,
Max(local clock +1, Sending process clock value + n/w delay 1)

(2) Drift rate d is assumed to be value 1


32
Clock Values
Rule No Applied: 1 1 1 1 2
1 ?
Clock Values: (1) (2) (3) (4) (5) max (5, 3)
(6)
P1 e11 e12 e13 e14 e15 e16
e17

(2+1)
(2+1) (6+1)

P2 e21 e22 e23 e24


e25
Clock
ClockValue
Values: of e25: (1) Max(C(e16)
(2) + 1,C(e24)
(3) max (3,3) + 1)
(4) =
Max( 6+1,
Rule Applied:
4+1) =1Max (7,5)
7:max(5,7)
1
=2 7 1
2
33
Clock Values
Clock Values: (1) (2) (3) (4) (5) max (5, 3)
(6) 7: max(5,7)
P1 e11 e12 e13 e14 e15 e16
e17

P2 e21 e22 e23 e24


e25
Clock Values: (1) (2) (3) max (3,3) (4)
Clock Value of e17: Max(C(e24)
+ 1,C(e24) + 1) =
7:max(5,7)
Max( 4+1, 6+1) = Max (5,7) = 7

34
Another Example

35
36
37
Logical Clock

38
Logical Clock

39
Another Example

40
Clock Values: 1 2 7 8
P1 e11 e12 e13 e14

Clock Values: 1 2 3 5 6
e21x e22 e23 e24 e25
P2

e31 e32 e33 e34 e35 e36


P3
Clock values: 1 2 3 4 5
6

41
Limitations of Lamports Clock or
Scalar
W.R.T Implementation rule 1 and 2: Clock
[IR1]: If a → b then Ci(a) < Ci(b) True
[IR2]: If a → b then Ci(a) < Ci(b) May be or May not
Limitation:Difficult to predict whether Clock value of e11 < Clock
value of e31 or not ?
ThisRule
is called partial ordering
No Applied: 1 1
of events (Can not resolve clock issues
withClock
same counter values)
Values: (1) (2)
P1 e11 e12 Works globally & Communicate; Causal
dependency

Space

P2 e21 e22 Works globally & Communicate ; Causal


dependency
Clock Values: (1) (3)
Rule Applied: 1 1
P3 e31 e32 e33 Independent Process works locally; No
incoming edges
Clock Values: (1) (2) (3) 42
Two Types of Event Ordering
Partial Ordering of Events: Supported by Lamports logical clock
Clock values are obtained for each event within the process
Execution of events in concurrent processes can
not be predicted
Problem arises as a single number is used to represent time

Total Ordering of Events: Used to solve the problem of partial ordering of


events using an arbitrary mathmatical functions

Example: Multiply by 10 and I to the clock value of Pi so that the values are
different from each other

Finding the clock values of every event in concurrent


processes

Resolve the issue of having same counter


values in different processes
43
Total Ordering of Events

Clock Val *10+1: 11 21 71 81


P1 e11 e12 e13 e14

Clock Val*10+2: 12 22 32 52 62
e21 e22 e23 e24 e25
P2

e31 e32 e33 e34 e35 e36


P3
Clock val *10+3 13 23 33 43 53
63
Math Function: Clock Value * 10 + process
number
44
Vector Clocks
Vector Clocks extend Lamports Scalar Time to provide a causally
consistent view of the world.

By looking at the clock, we can observe whether one event caused


another event.

Provides partial ordering of events

Each process keeps a vector (a list of integers) with an integer for


each local clock of every process within the system.

For N processes, a vector of N size maintained by each process.

45
Vector Clocks
Given a process (Pi) with a vector (v), Vector Clocks implement the Logical
Clock rules as follows:

Rule 1: Before executing an event (excluding the event of receiving a message)


process Pi increments the value v[i] within its local vector by 1.

This is the element in the vector that refers to Node(i)’s local clock.

local_vector[i] = local_vector[i] + 1

Rule 2: When receiving a message (the message must include the senders vector)
loop through each element in the vector sent and compare it to the local vector,
updating the local vector to be the maximum of local and received clock value.

Then increment your local clock within the vector by 1

1. For k = 1 to N: local_vector[k] = max(local_vector[k], sent_vector[k])


2. local_vector[i] = local_vector[i] + 1
3. message becomes available.
46
Advantages/Disadvantages of
Vector Clocks
Advantage: Provide a causally consistent ordering of
events

Disadvantages: Costly due to the need of sending the


entire Vector to each process for every message sent, in
order to keep the vector clocks in sync.

When there are a large number of processes this technique


can become extremely expensive, as the vector sent is
extremely large.

Gives partial order of events for processes in a distributed


system
47
Improvements over Vector Clocks
(1) Singhal–Kshemkalyani’s differential technique : This
approach improves the message passing mechanism by
only sending updates to the vector clock that have
occurred since the last message sent from Process(i) →
Process(j).

This drastically reduces the message size being sent,


But does require O(n²) storage.

(2) Fowler-Zwaenepoel direct-dependency technique :


Further reduces the message size by only sending the
single clock value of the sending process with a message.
However, this means processes cannot know their
transitive dependencies when looking at the causality 48of
49
Vector Clocks
Clock Val: (1 0 0) (200 ) (3 4 1 )
P1 e11 e12 e13

Clock Val: (0 1 0) (2 2 0) (2 3 1) (2 4 1)
P2 e21 e22 e23 e24

P3 e31 e32
Clock val (0 0 1) (0 0 2)

TIME ------

P2 -> e22 -> Max[( 2 0 0 ) ( 0 2 0)] = (2 2 0)


P2 -> e23 -> Max [(2 3 0) (0 0 1)] = (2 3 1)
P1 -> e13 -> Max [(3 0 0) (2 4 1)] = (3 4 1)
50
Vector Clocks
Clock Val: (1 0 0) (200 ) (3 0 0 ) (4 0 0 )
P1 e11 e12 e13 e14

Clock Val: (0 1 0) (0 2 1) (2 3 1) (2 4 1)
P2 e21 e22 e23 e24

P3 e31 e32 e33


Clock val (0 0 1) (0 0 2)
(4 0 3 )

TIME ------

P2 -> e22 -> max [(0 2 0) (0 0 1)] = (0 1 1)


P2 -> e23 -> max [(2 0 0), (0 3 1)] = (2 3 1)
P3 -> e33 -> max [(0 0 3), (4 0 0)] = (4 0 3)
51
Applications of Vector Clocks

For updating the data during transactions in


distributed data bases

Transactions can be assigned with logical time


stamps

Provides consistent view of transactions and


correct updation of data in distributed data bases

52
References

https://www.youtube.com/watch?v=VqZa4raMv_Q

53
Thank
You

54
LEADER ELECTION

National Institute of Technology Rourkela


Leader Election Algorithms
• Leader election is the simple idea of giving one thing (a
process, host, thread, object, or human) in a distributed
system some special powers such as:
– the ability to assign work,
– the ability to modify a piece of data, or
– even the responsibility of handling all requests in the
system.

• Advantages: a powerful tool for improving efficiency,


reducing coordination, simplifying architectures, and
reducing operations.

• Disadvantages: Can introduce new failure modes 56and


Requirement of Leader Election
Typically leader election is used:
To ensure exclusive access by a single node to
shared data, or
To ensure a single node coordinates the work
in a system.

For replicated database systems such as MySQL,


Apache Zookeeper, or Cassandra, we need to
make sure only one "leader" exists at any given
time.
57
Applications of LEAs
Radio networks:
In radio network protocols, leader election is often used as a first
step to approach more advanced communication primitives, such as
message gathering or broadcasts.

When adjacent nodes transmit at the same time in wireless


networks (very natural) induces collisions; electing a leader allows
to better coordinate this process.

While the diameter D of a network is a natural lower bound for the


time needed to elect a leader, upper and lower bounds for the
leader election problem depend on the specific radio model.

RDBMS:
RDBMSs rely on leader election to pick a leader database which
handles all writes, and sometimes, all reads where election may58 be
Election Algorithms
Many distributed algorithms need one process to act as a
coordinator for coordinating all the activities in a distributed system

Election algorithms are to pick a unique coordinator/leader based on


certain criteria such as largest identifier

Examples:
(1) Take over the role of a failed process (Fault Tolerance)

(2) Pick a master in Berkeley clock synchronization algorithm


(Physical Clock Synchronization)

(3) A powerful tool used in systems across Amazon for fault-


tolerance and easier to operate.

(4) Znode in Zookeeper is choosen as leader. All application 59


processes watch the current smallest znode which is ephemeral
Leader Election Algorithms
Once the leader is elected, the nodes reach a particular state known
as terminated state.

The states are partitioned into elected states & non-elected


states.

When a node enters either state, it always remains in that state.

Safety and liveness condition for execution of Leader Election


Algorithm:

Liveness condition: Every node will eventually enter an elected


state or a non-elected state.

Safety condition: Only a single node enters the elected state &
eventually become the leader. 60
Validity of LEA
A LEA is valid if it meets the following
conditions:
Termination: the algorithm should finish
within a finite time once the leader is
selected. In randomized approaches this
condition is sometimes weakened (for
example, requiring termination with
probability 1).

Uniqueness: there is exactly one node61


Types of LEA
An algorithm for leader election may vary in the following
aspects:

Communication mechanism: the nodes are either


synchronous in which processes are synchronized by a
clock signal or asynchronous where processes run at
arbitrary speeds.

Process names: whether processes have a unique identity


or are indistinguishable (anonymous).

Network topology: for instance, ring, acyclic graph or


complete graph.
62
Types of Leader Election Algorithms

The most prominent LE algorithms are:

a.Bully Algorithm presented by Gracia-Molina in 1982.


Improved Bully Election Algorithm by A. Arghavani in
2011.
Modified Bully Election Algorithm by M. S. Kordafshari
and group.

b. Ring Algorithm
Modified Ring Algorithm

63
Bully Algorithm: Basic Assumptions
1. The system is synchronous

2. Each process has a unique numerical Id

3. Processes know the Ids and address of every other processes

4. Communication/Message Delivery b/n processes is reliable

5. The processes may fail at any time including during execution


of algorithm

6. There is a failure detector

7. The process fails by stopping response message

8. Used to elect a coordinator dynamically from a set of


distributed computing processes 64
Bully Algorithm
Key idea:
Select process with highest Id
Processes initiate election if just recovered from failure
If coordinator failed several processes can initiate
an election simultaneously

Types of Messages:
Coordinator: For announcing the victory of election
Election Message: To initiate election process
Alive Message : To indicate the status of message

O() messages are needed with n processes

65
Bully Algorithm

When a process P recovers from failure, or the failure detector indicates that the
current coordinator has failed, P performs the following actions:

Step 1: If P has the highest process ID, it sends a Victory message to all other
processes and becomes the new Coordinator. Otherwise, P broadcasts an Election
message to all other processes with higher process IDs than itself.

Step 2: If P receives no Answer after sending an Election message, then it


broadcasts a Victory message to all other processes and becomes the Coordinator.

Step 3: If P receives an Answer from a process with a higher ID, it sends no further
messages for this election and waits for a Victory message. (If there is no Victory
message after a period of time, it restarts the process at the beginning.)

Step 4: If P receives an Election message from another process with a lower ID it


sends an Answer message back and starts the election process at the beginning,
by sending an Election message to higher-numbered processes.

Step 5: If P receives a Coordinator message, it treats the sender as the


coordinator. 66
Algorithm (Bully)
Step 1: Let process P sends a message to the
coordinator
P C
Step2: If coordinator does not respond to it within a time interval T, then it is
assumed that coordinator has failed

Step 3: Now process P sends election message to every process with high
priority number

Step 4: It waits for response, if no one responds for time interval T then process
P elects itself as a coordinator

Step 5: Then it sends a message to all lower priority processes then it is elected
as their new coordinator

Step 6: If an answer is received within time T from any other process Q

Process P again waits for time interval T to receive another message from Q
that it has been elected as coordinator 67
Example Bully Algorithm

Failed N3
N80

ELECT N5
N32

N12 N6
Detected
ELECT
the
failure

Let process N80 fails and detected by node N6 with the


help of a failure detector
N6 sends election messages to all processes having
higher Ids i.e., N80, N32 & N12 68
N80 does not respond
N80 Failed
N3
Cordinator
Cordinator N5
N32
OK
Cordinator
Coordinator
OK
N12
N12 & N32 send OK or Alive messages toN6N6 being the higher ids
than N6

N32 know the id of all other processes (Every process knows the
ids of all other processes)

N32 sends Coordinator or Victory message to all lower Id processes


69
If failures is stop, eventually will elect a leader

How to set the timeouts ?

Answer: Based on worst case time to complete election

5 message transmissions time if there are no failures during the run

1.Election from lowest id server in group

2.Answer to lowest id server from 2nd highest id process

3.Election from second highest id server to highest id

4.Timeout for answers @ 2nd highest id server

5.Coordinator from second highest id server


70
Analysis

Worst case completion time: 5 message transmission times

When the process with lowest id in the system detects failure


N-1 processes altogether begin elections, each sending
messages to processes with higher ids

ith highest id process sends (i-1) election messages

No. of election messages:

N-1 + N-2 + … +1 = (N-1)* N/2 = O()

Best case
Second highest id detects leader failure
Sends (N-2) coordinator messages
Completion time: 1 message transmission time 71
Impossibility
Since timeouts built into protocol, in asynchronous
system model:

Protocol may never terminate -> liveness not


guaranteed

But satisfy liveness in synchronous system model where

worst case one-way latency can be calculated = worst


case process time + worst case process latency

72
Disadvantages of Bully Algorithm
(a) Space Complexity is very large since every process
should know the identity of every other process in the
system.

(b) High number of message passing during


communication increases heavy traffic.

(c) The message complexity has order O(n2).

73
Improved Bully Algorithm
Presented by A.arghavani, E.ahmadi, A.T.haghighat in 2011.

Overcomes the disadvantages of the original bully.

The main concept: The algorithm declares the new coordinator


before actual or current coordinator is crashed. (needs extra stages)

Before the coordinator is failed, the current coordinator tries to


gather information about processes in the system and declares the
next possible coordinator to the processes.

With increasing knowledge and get the id of all other process, a


process with the bigger id attempts to execute the bully algorithm.

If the coordinator is failed, each process that notices this failure


compares its id with the id which it has received via the coordinator.
74
Disadvantages of Improved Bully
Algorithm
It has complex structure.

Every time process updates its database.

Large database required to maintain the


information of each process in database of
every process.

75
MODIFIED ELECTION ALGORITHM
Presented by M.S. Kordafshari, M.Gholipour, M.jahanshahi, A.T.haghighat in 2005.

The algorithm resolve the disadvantages of the bully algorithm.

1. When any process p notices that coordinator is not responding, it initiates an


election and send election message to all process with higher priority number.

2. If no process responds, process P wins the election and becomes new


coordinator.

3. Process with the higher priority sends ok message with its priority number to
process P.

4. When process p receive all the response it select the new coordinator with the
highest priority number process and sends the grant message to it.

5. Now the coordinator process will broadcast a new coordinator message to all
76
other process and informs itself as a coordinator.
Disadvantages of Modified Bully Algorithm

A modified algorithm is also time bounded.

It is better than bully but also has O(n2) complexity in


worst case.

It is necessary for all process to know the priority of


other.

77
Ring Algorithm
The algorithm applies to system organized as a ring
(logically or physically)

Assumptions: The link between the processes are


unidirectional and every process can manage to the
processes on its right only (Clockwise)

Data Structures used in the algorithm: Active List


i.e., a list that has priority number of all active processes
in the system 1 0 2 3 4

78
Algorithm: Ring
Step1: If process P1 detects a coordinator failure, it creates a new
active list which is empty initially.
It sends election message to its neighbor on right and adds number
1 to its active list.

Step 2: If process P2 receives a message elect from processes on


left, it responds in 3 ways:
(i)If msg received does not contain 1 in active list then P1 adds 2 to
its active list & forward the message
(ii)If this is the first election message it has received or sent, P1
creates new active list with numbers
0 3 1 4 2 1 1 and 2.
It then sends election message 1 followed by 2:

Coordinator

(iii) If process P1 receives its own election message 1 then active


list for P1 now contains numbers of all the active processes in the
79
Example: Ring Algorithm
0-7 Processes are participating in the network
P thinks the coordinator has crashed, builds an election message
which contains its own id number (Process 6)
Sends to first live successor; (Ex. Node 5 sends id 5 to node 6)
Each process adds its own number and forwards to next
O.K to have two elections at once
2nd Part One Part
[5,6,0]
1
0 [2] Election Message
[5,6] 2
7

6 [2,3] 3
Previous Coordinator [5]
has crashed [2,3,4] 4
5
[5,6,0,1,2,3,4] Valu5 [2,3,4,5,6,0,1] Active List at Node
2 80
Example: Ring Algorithm

When the message returns to P, it sees its own process ID


in the list &
knows that the circuit is complete

P circulates a “COORDINATOR” message with the new


high number

Here both 2 and 5 elected 6 as the leader

[5, 6, 0, 1, 2, 3, 4]
[2, 3, 4, 5, 6, 0, 1] 6 Coordinator

81
MODIFIED RING ALGORITHM

When a node notices that the leader has crashed, it sends its ID number to its
neighboring node in the ring. Thus, it is not necessary for all nodes to send their
IDs into the ring.

The receiving node compares the received ID with its own, and forwards whichever
is the greatest. This comparison is done by all the nodes such that only the
greatest ID remains in the ring.

Finally, the greatest ID returns back to the initial node.

If the received ID equals that of the initial sender, it declares itself as the leader
by sending a coordinate message into the ring.

It can be observed that this method dramatically reduces the overhead involved in
message passing.

Thus, if many nodes notice the absence of the leader at the same time, only the
message of the node with the greatest ID circulates in the ring thus, preventing
smaller IDs from being sent.
82
If n{i1,i2,··· ,im} is the number of nodes that concurrently detect the absence of
• Leader election is an important component of many cloud
computing systems

• Classical leader election protocols: Ring and Bully

• But Failure Prone

• Paxos like protocols used by google Chubby, Apache


Zookeeper

83
Applications of Leader Election

In Wireless Networks:
Key distribution,
Routing coordination,
Sensor coordination, and
General control.

In Cloud Computing:
Resolving Conflicts During Resource
sharing
84
How Amazon elects a leader ?
There are many ways to elect a leader, ranging from algorithms like
Paxos, to software like Apache ZooKeeper, to custom hardware, to
leases.

Leases:
are the most widely used leader election mechanism at Amazon.

are relatively straightforward to understand and implement

offer built-in fault tolerance.

work by having a single database that stores the current leader.

requires that the leader heartbeat periodically to show that it’s


still the leader.
85
Examples of systems using leader election at Amazon

Leader election is a widely deployed pattern across


Amazon.

For example:

RDBMSs rely on leader election to pick a leader database


which handles all writes and sometimes all reads.

The election may be automated but it is frequently done


manually by a human operator.

86
Examples of systems using leader election at Amazon
Amazon EBS (Elastic Block Store) distributes reads and
writes for a volume (Solid State Drives/Hard Disk Drives)
over many storage servers.

To ensure consistency, it uses leader election to elect


primaries for each area of the volume which order the
reads and writes.

If primary fails, follower copies steps in using the same


leader election mechanism.

Leader election ensures consistency while improving


performance by avoiding coordination on the data plane.
87
Examples of systems using leader election at Amazon

DynamoDB:
Uses leader election protocol to elect a AWS Management Console
to monitor resource utilization and performance metrics of various
operations over data bases

Amazon Quantum Ledger Database (Amazon QLDB):


Elect a central trusted authority to provide a fully managed
ledger database that provides a transparent, immutable,
and cryptographically verifiable transaction log

Amazon Kinesis (Kinesis) :


The Kinesis Client Library (KCL) uses leases to ensure that
each Kinesis shard is processed by one owner, making it
easy to do scale-out processing of Kinesis streams. 88
What happens when leader fails?
Allows the new leader to confidently redrive work that the outgoing
leader may have partially completed or completed but didn't tell
others about.

To tolerate failures, Amazon distributed systems don’t have a single


leader. Instead, leadership is a property that passes from server to
server, or process to process.

In distributed systems, it’s not possible to guarantee that there is


exactly one leader in the system. Instead, there can mostly be one
leader, and there can be either zero leaders or two leaders during
failures.

Idempotent can often tolerate two leaders with minimal loss of


89
efficiency
Characteristics of a Good Leader Election
Frequent Checkpointing: Frequent check of the remaining lease time
(or lock status in general) especially before initiating any operation
that has side-effects beyond the leader itself.

Network Latency: Consider that slow networking, timeouts, retries,


and garbage collection pauses can cause the remaining lease time to
expire before the code expects it to.

Correctness: Avoid heartbeating leases in a background thread. This


can cause correctness issues if the thread can’t interrupt the code
when the lease expires or the heartbeating thread dies.

Availability: This issues can occur if the work thread dies or stops
while the heartbeating thread holds on to the lease.
90
Characteristics of a Good Leader Election
Reliability: Have reliable metrics that show how much work a leader
can do versus how much it is doing now.

Scalability: Review the metrics often and make sure that there are
plans for scaling in advance of running out of capacity.

Flexibility: Make it easy to find which host is the current leader and
which host was the leader at any given time. Keep an audit trail or
log of leadership changes.

Formal Verification Tools: Model and formally verify the correctness


of distributed algorithms using tools like TLA+.

Bug Tolerance: This catches subtle, difficult to observe, and rare


bugs that can creep in when an application assumes too much about 91
References
1. https://www.coursera.org/lecture/cloud-computing-2/1-4-bully-algorithm-K8QwJ

2. https://aws.amazon.com/builders-library/leader-election-in-distributed-systems/

3. Seema Balhara, Kavita Khanna, Leader Election Algorithms in Distributed


Systems, International Journal of Computer Science and Mobile Computing, Vol. 3,
Issue. 6, June 2014, pg.374 – 379

92
Thank
You

93

You might also like