0% found this document useful (0 votes)

13 views

Merged_document

The document discusses the challenges of measuring and synchronizing time in distributed systems, highlighting issues such as clock skew and drift. It outlines various methods for clock synchronization, including Cristian's and Berkeley algorithms, as well as the concept of logical clocks introduced by Lamport. The importance of maintaining accurate time across different nodes in a distributed system is emphasized to ensure efficient communication and resource sharing.

Uploaded by

anubhutipandey028

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Merged_document

Uploaded by

anubhutipandey028

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 93

DISTRIBUTED SYSTEMS (CS 3006)

National Institute of Technology Rourkela

Measuring Time
Traditionally time measured astronomically
Transit of the sun (highest point in the sky)
Solar day and solar second

Problem: Earth’s rotation is slowing down

Days get longer and longer
300 million years ago there were 400 days in the year ;-)

Modern way to measure time is atomic clock

Based on transitions in Cesium-133 atom
Still need to correct for Earth’s rotation

Result: Universal Coordinated Time (UTC)

UTC available via radio signal, telephone line, satellite (GPS)

2
Hw/Sw Clocks
• Physical clocks in computers are realized as crystal
oscillation counters at the hardware level
– Correspond to counter register H(t)
– Used to generate interrupts
• Usually scaled to approximate physical time t, yielding
software clock C(t), C(t) = αH(t) + β
– C(t) measures time relative to some reference event, e.g., 64
bit counter for # of nanoseconds since last boot
– Simplification: C(t) carries an approximation of real time
– Ideally, C(t) = t (never 100% achieved)
– Note: Values given by two consecutive clock queries will differ
only if clock resolution is sufficiently smaller than processor
cycle time

3
Problems with H/W or S/W Clock

• Skew: Disagreement in the reading of two clocks

• Drift: Difference in the rate at which two clocks count the time
– Due to physical differences in crystals, plus heat, humidity, voltage, etc.
– Accumulated drift can lead to significant skew

• Clock drift rate: Difference in precision between a prefect

reference clock and a physical clock,
– Usually, 10-6 sec/sec, 10-7 sec/sec to 10-8 sec/sec for high precision
clocks

4
Challenges
• Two clocks do not agree perfectly
• Skew: The time difference between two clocks
• Quartz oscillators vibrate at different rates
• Drift: The difference in rates of two clocks
• If we had two perfect clocks
– Skew = 0
– Drift = 0

5
Clock Skew
• When we detect a clock has a skew
• Eg: it is 5 seconds behind Or 5 seconds ahead
• What can we do?

6
Clock Skew: Impacts & Solutions
• When we detect a clock has a skew
• Eg: it is 5 seconds behind – We can advance it 5 seconds
to correct/Run it faster until it catches up
• Or 5 seconds ahead – Pushing back 5 seconds is a bad
idea/Run it slower until it catches up
• • This does not guarantee correct clock in future
• – Need to check and adjust periodically
• Problems due to Skew:
– Message was received before it was sent
– Document closed before it was saved etc..
– We want monotonicity: time always increases

7
How clocks synchronize ?
Obtain time from Time server:

Request
Time
Client Server
Reply: Time:
00:05:20

A dedicated time server is allocated for clock

synchronization

8
Causes of Inaccurate Time

 Delays in message
transmission

 Delays due to processing

 Server’s time may be

inaccurate

9
Clock Inaccuracies
• Clock inaccuracies cause problems and troublesome in
solving tasks in distributed systems.
• The clocks of different nodes need to be synchronized to
limit errors.
• Need of an efficient communication or resource sharing.
• Clocks need to be monitored and adjusted continuously.
Otherwise, the clocks drift apart.
• Similarly clock skew also introduces mismatch in time
value of two clocks.
• Both these are to be addressed to make an efficient
usage of features of distributed systems.

• Example: Clock synchronization using token ring 10

Clock Synchronization
Clock synchronization aims to coordinate independent
clocks available in individual node.

Even when initially set accurately, real clocks will differ

after some amount of time due to clock drift,

Caused by clocks counting time at slightly different rates.

11
Solutions
The synchronization solution using a central server is trivial; the
server will dictate the system time. (Single point of failure)

Due to lack of global time/clock, achieving clock synchronization in

distributed systems is difficult.

Two Solutions: (Physical & Logical Clock Synchronization)

(1)Popular algorithms for Clock Synchronization (Physical) in

distributed systems:
(a) Cristian’s algorithm & (b) Berkeley algorithm

(2) Concepts of Logical clock in distributed systems for Clock

Synchronization (Logical): (a) Lamport timestamps & (b) Vector
clocks
12
Solutions
Wired Distributed Systems: Internet, LAN, MAN, WAN, PAN etc
Network Time Protocol (NTP): Works on client-server
architecture

User Datagram Protocol (UDP) message passing

Wireless Distributed Systems: WSN, VANET, MANET, FANET,ANET
etc
The problem becomes even more challenging
due to the possibility of collision of the
synchronization packets on the wireless medium
and the higher drift rate of clocks on low-cost
wireless devices.
Wired-Cum-Wireless Distributed Systems:
Wireless Internet
Cristian's algorithm (Physical Clock Synchronization)

Introduced by Flaviu Cristian in 1989

Primarily used in low-latency intranets.

Though the algorithm is simple, the obtained clock value is

probabilistic:
It only achieves synchronization if the round-trip time (RTT) of the
request is less than the required accuracy.

It also suffers in implementations using a single server, making it

unsuitable for many distributive applications where redundancy may
be crucial.

14
Cristian's algorithm

Client To T1

Req Cutc = Current UTC of Time Server

Time Server

Interrupt Handling Time

Best estimation of message propagation time = (To +T1)/2

Both To and T1 are measured using same clock

Tnew = Tserver + (To+T1)/2 i.e Cutc + message

Propagation Time

15
Cristian's algorithm
Cristian's algorithm works between a process P, and a time server S connected to a
time reference source.

Step 1: P requests the time from S

Step 2: S receives the request from P
Step 3: S prepares a response and appends the time T from its own clock
Step 4: S sends the time to P
Step 4: P then sets its time to be T + RTT/2 where RTT is the round trip time (Req
Time + Resp Time)
Stop
Assumption: Request time = response time (May be reasonable for LAN but not
always)
Further accuracy can be gained by making multiple requests to S and using the
response with the shortest RTT.

We can estimate the accuracy of the system as follows.

Let min be the minimum time to transmit a message one-way.
Transmission time includes message preparation time or nodes ready time to send
a message.
The earliest point at which S can place the time T, is min after P sent its request.
16
Berkeley algorithm (Physical Clock Synchronization)
 Developed by Gusella and Zatti at the University of California,
Berkeley in 1989.

 Assumes no machine has an accurate time source.

 Intended for use within intranets.

 The server process (called the leader) periodically polls

other follower processes requesting for time.

 Based on the answers, it computes an average time & tells all the
other nodes to advance their clocks to the new time or slow their
clocks down until some specific reduction has been achieved

 The time daemons time must be set manually by the operator

periodically 17
Example
The time daemon sends its own clock value and asks
all other nodes for their clock values (i.e., 3.00)
3.00
Time Daemon

Network

2.50 P1 P2 3.25

18
Example
The nodes answer the difference in their time w.r.t
time at Time Daemon ( i.e., -10 & +25)
3.00
Time Daemon

-10 (2.50 – 3.00) +25 (3.25 - 3.00)

Network

2.50 P1 P2 3.25

19
Example

The time daemon computes the average of time of all the nodes including time
daemon i.e, (3.00 + 2.50 + 3.25)/3 = 9.15/3 = 3.05.
Time Daemon
3.05

2.50+15 Network 3.25- 20

3.05 P1 P2 3.05

The time daemon tells other nodes to adjust their cock

values by increasing or decreasing by sending difference
in values (i.e., 15 & -20) instead of average values.
Berkeley algorithm
A leader is chosen via an election process such as
Chang and Roberts algorithm.

The leader polls the followers who reply with their time in a similar
way to Cristian's algorithm.

The leader observes the round-trip time (RTT) of the messages and
estimates the time of each follower and its own.

The leader then averages the clock times, ignoring any values it
receives far outside the values of the others.

Instead of sending the updated current time back to the other

process, the leader then sends out the amount (positive or negative)
that each follower must adjust its clock.
21
This avoids further uncertainty due to RTT at
Limitations Berkeley algorithm
Gusella and Zatti released results involving 15 computers whose
clocks were synchronized to within about 20-25 milliseconds using
their protocol.

Computer systems normally avoid rewinding their clock when they

receive a negative clock alteration from the leader. This would break
the property of monotonic time, which is a fundamental assumption
in certain algorithms in the system itself or in programs such as
make.

A simple solution to this problem is to halt the clock for the duration
specified by the leader, but this simplistic solution can also cause
problems, although they are less severe. For minor corrections, most
systems slow the clock (known as "clock slew"), applying the
correction over a longer period of time.

Often, any client whose clock differs by a value outside of a given22

Logical Clock
Due to lack of global physical clock in a distributed system and
limitations of Berkeley algorithm, Lamports introduced the concept
of logical clock based on event ordering instead of using physical
clock.

Two event ordering clocks : (i) Lamports Clock (Also known as

Scalar Clock)
For Partial ordering of events
(ii) Vector Clocks (modification of Lamport Clocks)

Lamports Logical Clocks: Can be considered as a counter/integer

value

Lamports has defined certain rules to increament the counter values

which are assigned to events in the processes of distributed system
23
Clock Drift rate is usually assumed 1 unit however any value greater
Each process has n number of instructions or tasks

An Event ?

Send, Receive, Print etc

24
Three Conditions proposed by Lamport:

(1) a -> b C(a) < C (b) (Happened Before Relations) indicates event
a is always earlier than b

(2) If a is a sending event of message m and b is a Receive event of

message m then C(a) < C(b)

(3) a -> b, b -> c => a -> c (Transitive Relations)

Where a, b & c are events may be executed in same or different

processes and

C(x) = Time stamp at event x

Example of Events: Sending, Receiving, Executing, Print etc. 25

Logical Clocks

Physical clocks are physical entities that assign physical times to

events,

Logical clocks order events logically by assigning logical timestamps

instead of physical ordering.

In fact, the logical clock decides the order of execution of different

parallel or concurrent or independent processes

Whereas the logical clocks are simply a conceptualization of a

mathematical function that assigns numbers to events. These
numbers act as timestamps that help in ordering events.

Refer to implementing a protocol on all machines within your

distributed system, so that the machines are able to maintain
consistent ordering of events within some virtual timespan. 26
Logical Clocks
More formally, each process Pi has a clock Ci which is a function from events to the integers.

The timestamp of an event e in Pi is Ci(e).

The system clock, C = f(from events to the integers) where C(e)=Ci(e) and e is an event in Pi

Causal Functionality:

Given 2 events (e1, e2) where one is caused by the other (e1 contributes to e2 occurring). Then the
timestamp of the ‘caused by’ event (e1) is less than the other event (e2).

27
Implementation Rules
To provide this functionality any Logical Clock must provide 2 rules:

Rule 1: this determines how a local process updates its own clock when an event occurs.

Before executing an event (excluding the event of receiving a message) increment the
local clock by 1.
Local_clock = local_clock + 1

Rule 2: determines how a local process updates its own clock when it receives a message from another
process. This can be described as how the process brings its local clock inline with information about the
global time.

When receiving a message (the message must include the senders local clock value) set your local
clock to the maximum of the received clock value and the local clock value. After this, increment your
local clock by 1

1. local_clock = max(local_clock, received_clock)

2. local_clock = local_clock + 1

3. message becomes available.

28
Lamports Logical Clocks
Key Idea:
(1)Processes exchange messages
(2)Messages must be send before they are received
(3)Send/Receive is used to order the events & synchronize the logical clocks

Let

Pi is process i
a, b & c …. are events in processes
Ci(a) is the time stamp of event ‘a’ of process Pi
IR is the implementation rule

Clock Condition to evaluate the logical clocks with the following correctness criterion

1. ∀a,b. a → b ⟹ C(a)<C(b) (happened before relation denoted by →)

2. [C1] : Ci(a) < Ci(b) applies to same process
3. [C2] : Ci(a) < Cj(b) applies to different processes
4. [IR1] : If a → b then Ci(a) = Ci(b) + d {d>0} where d is the drift rate of the clock
(applies to the same process )
5. [IR2]: Cj = max (Cj, tm + d ) where tm is same as Ci(a)
(applies to process j when we get an incoming arrow to the current
29
process j)
Clock Values

a, b, c, d, e, f, g, h, i, j, k, l, m are events

1, 2, 3, 4, 5, 6, 7, are clock values or time stamps for above events

No proper ordering events

Ordering of events
Every process Po, P1, & P2 in a distributed system orders
the event for execution

Process Po has a, b, c, d, e, f, and g has 7 events, Time

stamps are 1,2,3,4,5,6,7

Process P1 has h, i, and j has 3 events, Time stamps

are 1,2, & 3

Process P2 has 3 k, l, & m has 3 events, Time stamps are

1, 2, & 3.

C(d) = 4;C(m) = 3 4 > 3 does not satisfy lamports 31

Clock Values
Rule Applied: 1 1 1 1 2 ? ?
Clock Values: (1) (2) (3) (4) ? Incoming Arrow Encountered ?

P1 e11 e12 e13 e14 e15 e16 e17

P2 e21 e22 e23 e24 e25

Clock Values: (1) (2) (3) max (3,3) ? ?
(1) When
Rule an incoming
Applied: 1 arrow
1 is detected
2 with respect? to a process, ?
Rule 2 needs to
be followed i.e.,
Max(local clock +1, Sending process clock value + n/w delay 1)

(2) Drift rate d is assumed to be value 1

32
Clock Values
Rule No Applied: 1 1 1 1 2
1 ?
Clock Values: (1) (2) (3) (4) (5) max (5, 3)
(6)
P1 e11 e12 e13 e14 e15 e16
e17

(2+1)
(2+1) (6+1)

P2 e21 e22 e23 e24

e25
Clock
ClockValue
Values: of e25: (1) Max(C(e16)
(2) + 1,C(e24)
(3) max (3,3) + 1)
(4) =
Max( 6+1,
Rule Applied:
4+1) =1Max (7,5)
7:max(5,7)
1
=2 7 1
2
33
Clock Values
Clock Values: (1) (2) (3) (4) (5) max (5, 3)
(6) 7: max(5,7)
P1 e11 e12 e13 e14 e15 e16
e17

P2 e21 e22 e23 e24

e25
Clock Values: (1) (2) (3) max (3,3) (4)
Clock Value of e17: Max(C(e24)
+ 1,C(e24) + 1) =
7:max(5,7)
Max( 4+1, 6+1) = Max (5,7) = 7

34
Another Example

35
36
37
Logical Clock

38
Logical Clock

39
Another Example

40
Clock Values: 1 2 7 8
P1 e11 e12 e13 e14

Clock Values: 1 2 3 5 6
e21x e22 e23 e24 e25
P2

e31 e32 e33 e34 e35 e36

P3
Clock values: 1 2 3 4 5
6

41
Limitations of Lamports Clock or
Scalar
W.R.T Implementation rule 1 and 2: Clock
[IR1]: If a → b then Ci(a) < Ci(b) True
[IR2]: If a → b then Ci(a) < Ci(b) May be or May not
Limitation:Difficult to predict whether Clock value of e11 < Clock
value of e31 or not ?
ThisRule
is called partial ordering
No Applied: 1 1
of events (Can not resolve clock issues
withClock
same counter values)
Values: (1) (2)
P1 e11 e12 Works globally & Communicate; Causal
dependency

Space

P2 e21 e22 Works globally & Communicate ; Causal

dependency
Clock Values: (1) (3)
Rule Applied: 1 1
P3 e31 e32 e33 Independent Process works locally; No
incoming edges
Clock Values: (1) (2) (3) 42
Two Types of Event Ordering
Partial Ordering of Events: Supported by Lamports logical clock
Clock values are obtained for each event within the process
Execution of events in concurrent processes can
not be predicted
Problem arises as a single number is used to represent time

Total Ordering of Events: Used to solve the problem of partial ordering of

events using an arbitrary mathmatical functions

Example: Multiply by 10 and I to the clock value of Pi so that the values are
different from each other

Finding the clock values of every event in concurrent

processes

Resolve the issue of having same counter

values in different processes
43
Total Ordering of Events

Clock Val *10+1: 11 21 71 81

P1 e11 e12 e13 e14

Clock Val*10+2: 12 22 32 52 62
e21 e22 e23 e24 e25
P2

e31 e32 e33 e34 e35 e36

P3
Clock val *10+3 13 23 33 43 53
63
Math Function: Clock Value * 10 + process
number
44
Vector Clocks
Vector Clocks extend Lamports Scalar Time to provide a causally
consistent view of the world.

By looking at the clock, we can observe whether one event caused

another event.

Provides partial ordering of events

Each process keeps a vector (a list of integers) with an integer for

each local clock of every process within the system.

For N processes, a vector of N size maintained by each process.

45
Vector Clocks
Given a process (Pi) with a vector (v), Vector Clocks implement the Logical
Clock rules as follows:

Rule 1: Before executing an event (excluding the event of receiving a message)

process Pi increments the value v[i] within its local vector by 1.

This is the element in the vector that refers to Node(i)’s local clock.

local_vector[i] = local_vector[i] + 1

Rule 2: When receiving a message (the message must include the senders vector)
loop through each element in the vector sent and compare it to the local vector,
updating the local vector to be the maximum of local and received clock value.

Then increment your local clock within the vector by 1

1. For k = 1 to N: local_vector[k] = max(local_vector[k], sent_vector[k])

2. local_vector[i] = local_vector[i] + 1
3. message becomes available.
46
Advantages/Disadvantages of
Vector Clocks
Advantage: Provide a causally consistent ordering of
events

Disadvantages: Costly due to the need of sending the

entire Vector to each process for every message sent, in
order to keep the vector clocks in sync.

When there are a large number of processes this technique

can become extremely expensive, as the vector sent is
extremely large.

Gives partial order of events for processes in a distributed

system
47
Improvements over Vector Clocks
(1) Singhal–Kshemkalyani’s differential technique : This
approach improves the message passing mechanism by
only sending updates to the vector clock that have
occurred since the last message sent from Process(i) →
Process(j).

This drastically reduces the message size being sent,

But does require O(n²) storage.

(2) Fowler-Zwaenepoel direct-dependency technique :

Further reduces the message size by only sending the
single clock value of the sending process with a message.
However, this means processes cannot know their
transitive dependencies when looking at the causality 48of
49
Vector Clocks
Clock Val: (1 0 0) (200 ) (3 4 1 )
P1 e11 e12 e13

Clock Val: (0 1 0) (2 2 0) (2 3 1) (2 4 1)
P2 e21 e22 e23 e24

P3 e31 e32
Clock val (0 0 1) (0 0 2)

TIME ------

P2 -> e22 -> Max[( 2 0 0 ) ( 0 2 0)] = (2 2 0)

P2 -> e23 -> Max [(2 3 0) (0 0 1)] = (2 3 1)
P1 -> e13 -> Max [(3 0 0) (2 4 1)] = (3 4 1)
50
Vector Clocks
Clock Val: (1 0 0) (200 ) (3 0 0 ) (4 0 0 )
P1 e11 e12 e13 e14

Clock Val: (0 1 0) (0 2 1) (2 3 1) (2 4 1)
P2 e21 e22 e23 e24

P3 e31 e32 e33

Clock val (0 0 1) (0 0 2)
(4 0 3 )

TIME ------

P2 -> e22 -> max [(0 2 0) (0 0 1)] = (0 1 1)

P2 -> e23 -> max [(2 0 0), (0 3 1)] = (2 3 1)
P3 -> e33 -> max [(0 0 3), (4 0 0)] = (4 0 3)
51
Applications of Vector Clocks

For updating the data during transactions in

distributed data bases

Transactions can be assigned with logical time

stamps

Provides consistent view of transactions and

correct updation of data in distributed data bases

52
References

https://www.youtube.com/watch?v=VqZa4raMv_Q

53
Thank
You

54
LEADER ELECTION

National Institute of Technology Rourkela

Leader Election Algorithms
• Leader election is the simple idea of giving one thing (a
process, host, thread, object, or human) in a distributed
system some special powers such as:
– the ability to assign work,
– the ability to modify a piece of data, or
– even the responsibility of handling all requests in the
system.

• Advantages: a powerful tool for improving efficiency,

reducing coordination, simplifying architectures, and
reducing operations.

• Disadvantages: Can introduce new failure modes 56and

Requirement of Leader Election
Typically leader election is used:
To ensure exclusive access by a single node to
shared data, or
To ensure a single node coordinates the work
in a system.

For replicated database systems such as MySQL,

Apache Zookeeper, or Cassandra, we need to
make sure only one "leader" exists at any given
time.
57
Applications of LEAs
Radio networks:
In radio network protocols, leader election is often used as a first
step to approach more advanced communication primitives, such as
message gathering or broadcasts.

When adjacent nodes transmit at the same time in wireless

networks (very natural) induces collisions; electing a leader allows
to better coordinate this process.

While the diameter D of a network is a natural lower bound for the

time needed to elect a leader, upper and lower bounds for the
leader election problem depend on the specific radio model.

RDBMS:
RDBMSs rely on leader election to pick a leader database which
handles all writes, and sometimes, all reads where election may58 be
Election Algorithms
Many distributed algorithms need one process to act as a
coordinator for coordinating all the activities in a distributed system

Election algorithms are to pick a unique coordinator/leader based on

certain criteria such as largest identifier

Examples:
(1) Take over the role of a failed process (Fault Tolerance)

(2) Pick a master in Berkeley clock synchronization algorithm

(Physical Clock Synchronization)

(3) A powerful tool used in systems across Amazon for fault-

tolerance and easier to operate.

(4) Znode in Zookeeper is choosen as leader. All application 59

processes watch the current smallest znode which is ephemeral
Leader Election Algorithms
Once the leader is elected, the nodes reach a particular state known
as terminated state.

The states are partitioned into elected states & non-elected

states.

When a node enters either state, it always remains in that state.

Safety and liveness condition for execution of Leader Election

Algorithm:

Liveness condition: Every node will eventually enter an elected

state or a non-elected state.

Safety condition: Only a single node enters the elected state &
eventually become the leader. 60
Validity of LEA
A LEA is valid if it meets the following
conditions:
Termination: the algorithm should finish
within a finite time once the leader is
selected. In randomized approaches this
condition is sometimes weakened (for
example, requiring termination with
probability 1).

Uniqueness: there is exactly one node61

Types of LEA
An algorithm for leader election may vary in the following
aspects:

Communication mechanism: the nodes are either

synchronous in which processes are synchronized by a
clock signal or asynchronous where processes run at
arbitrary speeds.

Process names: whether processes have a unique identity

or are indistinguishable (anonymous).

Network topology: for instance, ring, acyclic graph or

complete graph.
62
Types of Leader Election Algorithms

The most prominent LE algorithms are:

a.Bully Algorithm presented by Gracia-Molina in 1982.

Improved Bully Election Algorithm by A. Arghavani in
2011.
Modified Bully Election Algorithm by M. S. Kordafshari
and group.

b. Ring Algorithm
Modified Ring Algorithm

63
Bully Algorithm: Basic Assumptions
1. The system is synchronous

2. Each process has a unique numerical Id

3. Processes know the Ids and address of every other processes

4. Communication/Message Delivery b/n processes is reliable

5. The processes may fail at any time including during execution

of algorithm

6. There is a failure detector

7. The process fails by stopping response message

8. Used to elect a coordinator dynamically from a set of

distributed computing processes 64
Bully Algorithm
Key idea:
Select process with highest Id
Processes initiate election if just recovered from failure
If coordinator failed several processes can initiate
an election simultaneously

Types of Messages:
Coordinator: For announcing the victory of election
Election Message: To initiate election process
Alive Message : To indicate the status of message

O() messages are needed with n processes

65
Bully Algorithm

When a process P recovers from failure, or the failure detector indicates that the
current coordinator has failed, P performs the following actions:

Step 1: If P has the highest process ID, it sends a Victory message to all other
processes and becomes the new Coordinator. Otherwise, P broadcasts an Election
message to all other processes with higher process IDs than itself.

Step 2: If P receives no Answer after sending an Election message, then it

broadcasts a Victory message to all other processes and becomes the Coordinator.

Step 3: If P receives an Answer from a process with a higher ID, it sends no further
messages for this election and waits for a Victory message. (If there is no Victory
message after a period of time, it restarts the process at the beginning.)

Step 4: If P receives an Election message from another process with a lower ID it

sends an Answer message back and starts the election process at the beginning,
by sending an Election message to higher-numbered processes.

Step 5: If P receives a Coordinator message, it treats the sender as the

coordinator. 66
Algorithm (Bully)
Step 1: Let process P sends a message to the
coordinator
P C
Step2: If coordinator does not respond to it within a time interval T, then it is
assumed that coordinator has failed

Step 3: Now process P sends election message to every process with high
priority number

Step 4: It waits for response, if no one responds for time interval T then process
P elects itself as a coordinator

Step 5: Then it sends a message to all lower priority processes then it is elected
as their new coordinator

Step 6: If an answer is received within time T from any other process Q

Process P again waits for time interval T to receive another message from Q
that it has been elected as coordinator 67
Example Bully Algorithm

Failed N3
N80

ELECT N5
N32

N12 N6
Detected
ELECT
the
failure

Let process N80 fails and detected by node N6 with the

help of a failure detector
N6 sends election messages to all processes having
higher Ids i.e., N80, N32 & N12 68
N80 does not respond
N80 Failed
N3
Cordinator
Cordinator N5
N32
OK
Cordinator
Coordinator
OK
N12
N12 & N32 send OK or Alive messages toN6N6 being the higher ids
than N6

N32 know the id of all other processes (Every process knows the
ids of all other processes)

N32 sends Coordinator or Victory message to all lower Id processes

69
If failures is stop, eventually will elect a leader

How to set the timeouts ?

Answer: Based on worst case time to complete election

5 message transmissions time if there are no failures during the run

1.Election from lowest id server in group

2.Answer to lowest id server from 2nd highest id process

3.Election from second highest id server to highest id

4.Timeout for answers @ 2nd highest id server

5.Coordinator from second highest id server

70
Analysis

Worst case completion time: 5 message transmission times

When the process with lowest id in the system detects failure

N-1 processes altogether begin elections, each sending
messages to processes with higher ids

ith highest id process sends (i-1) election messages

No. of election messages:

N-1 + N-2 + … +1 = (N-1)* N/2 = O()

Best case
Second highest id detects leader failure
Sends (N-2) coordinator messages
Completion time: 1 message transmission time 71
Impossibility
Since timeouts built into protocol, in asynchronous
system model:

Protocol may never terminate -> liveness not

guaranteed

But satisfy liveness in synchronous system model where

worst case one-way latency can be calculated = worst

case process time + worst case process latency

72
Disadvantages of Bully Algorithm
(a) Space Complexity is very large since every process
should know the identity of every other process in the
system.

(b) High number of message passing during

communication increases heavy traffic.

(c) The message complexity has order O(n2).

73
Improved Bully Algorithm
Presented by A.arghavani, E.ahmadi, A.T.haghighat in 2011.

Overcomes the disadvantages of the original bully.

The main concept: The algorithm declares the new coordinator

before actual or current coordinator is crashed. (needs extra stages)

Before the coordinator is failed, the current coordinator tries to

gather information about processes in the system and declares the
next possible coordinator to the processes.

With increasing knowledge and get the id of all other process, a

process with the bigger id attempts to execute the bully algorithm.

If the coordinator is failed, each process that notices this failure

compares its id with the id which it has received via the coordinator.
74
Disadvantages of Improved Bully
Algorithm
It has complex structure.

Every time process updates its database.

Large database required to maintain the

information of each process in database of
every process.

75
MODIFIED ELECTION ALGORITHM
Presented by M.S. Kordafshari, M.Gholipour, M.jahanshahi, A.T.haghighat in 2005.

The algorithm resolve the disadvantages of the bully algorithm.

1. When any process p notices that coordinator is not responding, it initiates an

election and send election message to all process with higher priority number.

2. If no process responds, process P wins the election and becomes new

coordinator.

3. Process with the higher priority sends ok message with its priority number to
process P.

4. When process p receive all the response it select the new coordinator with the
highest priority number process and sends the grant message to it.

5. Now the coordinator process will broadcast a new coordinator message to all
76
other process and informs itself as a coordinator.
Disadvantages of Modified Bully Algorithm

A modified algorithm is also time bounded.

It is better than bully but also has O(n2) complexity in

worst case.

It is necessary for all process to know the priority of

other.

77
Ring Algorithm
The algorithm applies to system organized as a ring
(logically or physically)

Assumptions: The link between the processes are

unidirectional and every process can manage to the
processes on its right only (Clockwise)

Data Structures used in the algorithm: Active List

i.e., a list that has priority number of all active processes
in the system 1 0 2 3 4

78
Algorithm: Ring
Step1: If process P1 detects a coordinator failure, it creates a new
active list which is empty initially.
It sends election message to its neighbor on right and adds number
1 to its active list.

Step 2: If process P2 receives a message elect from processes on

left, it responds in 3 ways:
(i)If msg received does not contain 1 in active list then P1 adds 2 to
its active list & forward the message
(ii)If this is the first election message it has received or sent, P1
creates new active list with numbers
0 3 1 4 2 1 1 and 2.
It then sends election message 1 followed by 2:

Coordinator

(iii) If process P1 receives its own election message 1 then active

list for P1 now contains numbers of all the active processes in the
79
Example: Ring Algorithm
0-7 Processes are participating in the network
P thinks the coordinator has crashed, builds an election message
which contains its own id number (Process 6)
Sends to first live successor; (Ex. Node 5 sends id 5 to node 6)
Each process adds its own number and forwards to next
O.K to have two elections at once
2nd Part One Part
[5,6,0]
1
0 [2] Election Message
[5,6] 2
7

6 [2,3] 3
Previous Coordinator [5]
has crashed [2,3,4] 4
5
[5,6,0,1,2,3,4] Valu5 [2,3,4,5,6,0,1] Active List at Node
2 80
Example: Ring Algorithm

When the message returns to P, it sees its own process ID

in the list &
knows that the circuit is complete

P circulates a “COORDINATOR” message with the new

high number

Here both 2 and 5 elected 6 as the leader

[5, 6, 0, 1, 2, 3, 4]
[2, 3, 4, 5, 6, 0, 1] 6 Coordinator

81
MODIFIED RING ALGORITHM

When a node notices that the leader has crashed, it sends its ID number to its
neighboring node in the ring. Thus, it is not necessary for all nodes to send their
IDs into the ring.

The receiving node compares the received ID with its own, and forwards whichever
is the greatest. This comparison is done by all the nodes such that only the
greatest ID remains in the ring.

Finally, the greatest ID returns back to the initial node.

If the received ID equals that of the initial sender, it declares itself as the leader
by sending a coordinate message into the ring.

It can be observed that this method dramatically reduces the overhead involved in
message passing.

Thus, if many nodes notice the absence of the leader at the same time, only the
message of the node with the greatest ID circulates in the ring thus, preventing
smaller IDs from being sent.
82
If n{i1,i2,··· ,im} is the number of nodes that concurrently detect the absence of
• Leader election is an important component of many cloud
computing systems

• Classical leader election protocols: Ring and Bully

• But Failure Prone

• Paxos like protocols used by google Chubby, Apache

Zookeeper

83
Applications of Leader Election

In Wireless Networks:
Key distribution,
Routing coordination,
Sensor coordination, and
General control.

In Cloud Computing:
Resolving Conflicts During Resource
sharing
84
How Amazon elects a leader ?
There are many ways to elect a leader, ranging from algorithms like
Paxos, to software like Apache ZooKeeper, to custom hardware, to
leases.

Leases:
are the most widely used leader election mechanism at Amazon.

are relatively straightforward to understand and implement

offer built-in fault tolerance.

work by having a single database that stores the current leader.

requires that the leader heartbeat periodically to show that it’s

still the leader.
85
Examples of systems using leader election at Amazon

Leader election is a widely deployed pattern across

Amazon.

For example:

RDBMSs rely on leader election to pick a leader database

which handles all writes and sometimes all reads.

The election may be automated but it is frequently done

manually by a human operator.

86
Examples of systems using leader election at Amazon
Amazon EBS (Elastic Block Store) distributes reads and
writes for a volume (Solid State Drives/Hard Disk Drives)
over many storage servers.

To ensure consistency, it uses leader election to elect

primaries for each area of the volume which order the
reads and writes.

If primary fails, follower copies steps in using the same

leader election mechanism.

Leader election ensures consistency while improving

performance by avoiding coordination on the data plane.
87
Examples of systems using leader election at Amazon

DynamoDB:
Uses leader election protocol to elect a AWS Management Console
to monitor resource utilization and performance metrics of various
operations over data bases

Amazon Quantum Ledger Database (Amazon QLDB):

Elect a central trusted authority to provide a fully managed
ledger database that provides a transparent, immutable,
and cryptographically verifiable transaction log

Amazon Kinesis (Kinesis) :

The Kinesis Client Library (KCL) uses leases to ensure that
each Kinesis shard is processed by one owner, making it
easy to do scale-out processing of Kinesis streams. 88
What happens when leader fails?
Allows the new leader to confidently redrive work that the outgoing
leader may have partially completed or completed but didn't tell
others about.

To tolerate failures, Amazon distributed systems don’t have a single

leader. Instead, leadership is a property that passes from server to
server, or process to process.

In distributed systems, it’s not possible to guarantee that there is

exactly one leader in the system. Instead, there can mostly be one
leader, and there can be either zero leaders or two leaders during
failures.

Idempotent can often tolerate two leaders with minimal loss of

89
efficiency
Characteristics of a Good Leader Election
Frequent Checkpointing: Frequent check of the remaining lease time
(or lock status in general) especially before initiating any operation
that has side-effects beyond the leader itself.

Network Latency: Consider that slow networking, timeouts, retries,

and garbage collection pauses can cause the remaining lease time to
expire before the code expects it to.

Correctness: Avoid heartbeating leases in a background thread. This

can cause correctness issues if the thread can’t interrupt the code
when the lease expires or the heartbeating thread dies.

Availability: This issues can occur if the work thread dies or stops
while the heartbeating thread holds on to the lease.
90
Characteristics of a Good Leader Election
Reliability: Have reliable metrics that show how much work a leader
can do versus how much it is doing now.

Scalability: Review the metrics often and make sure that there are
plans for scaling in advance of running out of capacity.

Flexibility: Make it easy to find which host is the current leader and
which host was the leader at any given time. Keep an audit trail or
log of leadership changes.

Formal Verification Tools: Model and formally verify the correctness

of distributed algorithms using tools like TLA+.

Bug Tolerance: This catches subtle, difficult to observe, and rare

bugs that can creep in when an application assumes too much about 91
References
1. https://www.coursera.org/lecture/cloud-computing-2/1-4-bully-algorithm-K8QwJ

2. https://aws.amazon.com/builders-library/leader-election-in-distributed-systems/

3. Seema Balhara, Kavita Khanna, Leader Election Algorithms in Distributed

Systems, International Journal of Computer Science and Mobile Computing, Vol. 3,
Issue. 6, June 2014, pg.374 – 379

92
Thank
You

NRC 2024 Future Innovators General Rules
No ratings yet
NRC 2024 Future Innovators General Rules
15 pages
Distributed Systems: Clock Synchronization: Physical Clocks
No ratings yet
Distributed Systems: Clock Synchronization: Physical Clocks
52 pages
Mod 3 Part 1
No ratings yet
Mod 3 Part 1
86 pages
CS6223: Distributed Systems: Distributed Time and Clock Synchronization (1) Physical Time
No ratings yet
CS6223: Distributed Systems: Distributed Time and Clock Synchronization (1) Physical Time
34 pages
Time in Distributed Systems Time in Distributed Systems: Brian Nielsen Bnielsen@cs - Aau.dk Bnielsen@cs - Aau.dk
No ratings yet
Time in Distributed Systems Time in Distributed Systems: Brian Nielsen Bnielsen@cs - Aau.dk Bnielsen@cs - Aau.dk
40 pages
Clock Synchronization
No ratings yet
Clock Synchronization
6 pages
3.introduction To Clock Synchronization
No ratings yet
3.introduction To Clock Synchronization
50 pages
Synchronization
100% (1)
Synchronization
13 pages
GB_Final_unit3_clock-synchronization
No ratings yet
GB_Final_unit3_clock-synchronization
38 pages
Clock Synchronization Final
No ratings yet
Clock Synchronization Final
5 pages
UNIT7 Clock Synchronization PDF
No ratings yet
UNIT7 Clock Synchronization PDF
46 pages
Synchronization-Unit 2
No ratings yet
Synchronization-Unit 2
25 pages
Clock Synchronization
No ratings yet
Clock Synchronization
34 pages
Synchronization
No ratings yet
Synchronization
16 pages
Christian Algorithm
No ratings yet
Christian Algorithm
8 pages
ch 03 Synchronization.pptx
No ratings yet
ch 03 Synchronization.pptx
152 pages
Unit 3 Dos
No ratings yet
Unit 3 Dos
142 pages
MODULE3 NOTES
No ratings yet
MODULE3 NOTES
12 pages
Distributed System Lecture 2
No ratings yet
Distributed System Lecture 2
22 pages
chp11
No ratings yet
chp11
40 pages
FALLSEM2024-25 BITE402L TH VL2024250103475 2024-08-21 Reference-Material-I
No ratings yet
FALLSEM2024-25 BITE402L TH VL2024250103475 2024-08-21 Reference-Material-I
43 pages
Abstract - Time Is An Important Practical: Ii. Clocks Events and Process States
No ratings yet
Abstract - Time Is An Important Practical: Ii. Clocks Events and Process States
5 pages
Global Time State
No ratings yet
Global Time State
63 pages
Synchronization Time, Event, Clocks
No ratings yet
Synchronization Time, Event, Clocks
41 pages
DISTRIBUTED SYSTEMS SLIDES-lesson8
No ratings yet
DISTRIBUTED SYSTEMS SLIDES-lesson8
46 pages
05 Clock Synchronization Slides
No ratings yet
05 Clock Synchronization Slides
49 pages
5) Concurrency Control
No ratings yet
5) Concurrency Control
22 pages
Lec10 DCS
No ratings yet
Lec10 DCS
27 pages
Clocks in Distributed System
No ratings yet
Clocks in Distributed System
34 pages
Unit-III Clock, Events and Process States
No ratings yet
Unit-III Clock, Events and Process States
18 pages
6Clock Synchronization
No ratings yet
6Clock Synchronization
31 pages
Module 3 DS
No ratings yet
Module 3 DS
61 pages
Lecture 7.1 Coordination
No ratings yet
Lecture 7.1 Coordination
8 pages
Chapter 4
No ratings yet
Chapter 4
32 pages
Materi-CE739-M06-Time Synchronization
No ratings yet
Materi-CE739-M06-Time Synchronization
38 pages
Chapter 5 Synchronization
No ratings yet
Chapter 5 Synchronization
8 pages
Aks Synchronizing Physical Clocks 1 1
No ratings yet
Aks Synchronizing Physical Clocks 1 1
25 pages
Distributed-Systems-Lecturer-Notes-Latest-192-249
No ratings yet
Distributed-Systems-Lecturer-Notes-Latest-192-249
58 pages
Synchronization: Dr. Saeed Mahmud Ullah Professor Eee, Du
No ratings yet
Synchronization: Dr. Saeed Mahmud Ullah Professor Eee, Du
40 pages
1 Removed
No ratings yet
1 Removed
24 pages
Chapter 6- Synchronization in Distributed System
No ratings yet
Chapter 6- Synchronization in Distributed System
26 pages
03b Clocks Slides
No ratings yet
03b Clocks Slides
47 pages
DS 7 Synchronization PDF
No ratings yet
DS 7 Synchronization PDF
52 pages
Comparative Study of Clock Synchronization Algorithms in Distributed Systems
No ratings yet
Comparative Study of Clock Synchronization Algorithms in Distributed Systems
12 pages
MODULE-03-DFS (1)
No ratings yet
MODULE-03-DFS (1)
18 pages
CH 5 Distributed System PDF
No ratings yet
CH 5 Distributed System PDF
6 pages
Synchronisation - DT09 J Shannon
No ratings yet
Synchronisation - DT09 J Shannon
5 pages
5.2 Physical Clock
No ratings yet
5.2 Physical Clock
28 pages
9.clock Syncronisation
No ratings yet
9.clock Syncronisation
6 pages
Lect6 - DS - Synchronization
No ratings yet
Lect6 - DS - Synchronization
69 pages
Chapter 6 Synchronization
No ratings yet
Chapter 6 Synchronization
36 pages
Synchronization in Distributed Systems
No ratings yet
Synchronization in Distributed Systems
56 pages
CS2056 DS Unit4 Notes
No ratings yet
CS2056 DS Unit4 Notes
34 pages
Dos - Unit-3
No ratings yet
Dos - Unit-3
21 pages
Synchronization
No ratings yet
Synchronization
69 pages
L14 Syncronization Algorithm.namita
No ratings yet
L14 Syncronization Algorithm.namita
21 pages
Synchronization
No ratings yet
Synchronization
36 pages
Ds Unit 2.1 Web
No ratings yet
Ds Unit 2.1 Web
13 pages
DC - Unit 2 - Logical Time and Global State
No ratings yet
DC - Unit 2 - Logical Time and Global State
113 pages
Profound Python Libraries
From Everand
Profound Python Libraries
Onder Teker
No ratings yet
Advanced Java Interview Questions and Answers
From Everand
Advanced Java Interview Questions and Answers
Jaishree Soni
No ratings yet
Primefact Estimating Tractor Power Needs
No ratings yet
Primefact Estimating Tractor Power Needs
5 pages
Automated Selection of Materialized Views and Indexes For SQL Databases
No ratings yet
Automated Selection of Materialized Views and Indexes For SQL Databases
10 pages
Commando Family Brochure TextronSystems 2016
0% (1)
Commando Family Brochure TextronSystems 2016
13 pages
Matruprasad Rout (Editor), Kishore Debnath (Editor) - Forming and Machining of Polymers, Ceramics, and Composites (Advanced Materials Processing and Manufacturing) - CRC Press (2024)
No ratings yet
Matruprasad Rout (Editor), Kishore Debnath (Editor) - Forming and Machining of Polymers, Ceramics, and Composites (Advanced Materials Processing and Manufacturing) - CRC Press (2024)
269 pages
Instant download The American Promise Volume C A History of the United States Since 1890 James L. Roark pdf all chapter
No ratings yet
Instant download The American Promise Volume C A History of the United States Since 1890 James L. Roark pdf all chapter
65 pages
Exercise. 9 (A) : Class X Chapter 9 - Matrices Maths
100% (1)
Exercise. 9 (A) : Class X Chapter 9 - Matrices Maths
49 pages
Bulletin of Science Education: Minimum Completeness Criteria For Islamic Education Subjects
No ratings yet
Bulletin of Science Education: Minimum Completeness Criteria For Islamic Education Subjects
8 pages
ENG 189 - ENG 189 Experience (12425)
No ratings yet
ENG 189 - ENG 189 Experience (12425)
4 pages
2b Sample Parts Plan Paper
No ratings yet
2b Sample Parts Plan Paper
8 pages
1980 Honda CX500C CDI Module
No ratings yet
1980 Honda CX500C CDI Module
5 pages
Instant Time Management - Clegg, Brian - 1999 - London - Kogan Page - 9780749429638 - Anna's Archive
No ratings yet
Instant Time Management - Clegg, Brian - 1999 - London - Kogan Page - 9780749429638 - Anna's Archive
130 pages
O Level Past Papers 20s
No ratings yet
O Level Past Papers 20s
24 pages
Keyboard Layout For MyWin2
No ratings yet
Keyboard Layout For MyWin2
4 pages
Accomplishment Report Lac 1
No ratings yet
Accomplishment Report Lac 1
7 pages
Field Training Report: Department of Civil Engineering
No ratings yet
Field Training Report: Department of Civil Engineering
8 pages
Selected Methods of Analysis: Step 1
No ratings yet
Selected Methods of Analysis: Step 1
3 pages
Contents CLO Level PLO of Mech Vibrations
No ratings yet
Contents CLO Level PLO of Mech Vibrations
2 pages
Principles of Management Assignment 2 Pre Mid
No ratings yet
Principles of Management Assignment 2 Pre Mid
7 pages
RX 60 Technical Data Electric Forklift Truck
No ratings yet
RX 60 Technical Data Electric Forklift Truck
10 pages
Honorable Business: A Framework for Business in a Just and Humane Society James R. Otteson - The ebook is available for quick download, easy access to content
100% (1)
Honorable Business: A Framework for Business in a Just and Humane Society James R. Otteson - The ebook is available for quick download, easy access to content
63 pages
trilogy triputra
No ratings yet
trilogy triputra
1 page
Irena 2019
No ratings yet
Irena 2019
60 pages
Multivibrators
100% (1)
Multivibrators
12 pages
TIG and tig Welding Practical Assignment Checklist
No ratings yet
TIG and tig Welding Practical Assignment Checklist
17 pages
Get (Ebook) Artificial Intelligence and Security: 6th International Conference, ICAIS 2020, Hohhot, China, July 17–20, 2020, Proceedings, Part III by Xingming Sun, Jinwei Wang, Elisa Bertino ISBN 9789811581007, 9789811581014, 9811581002, 9811581010 free all chapters
100% (6)
Get (Ebook) Artificial Intelligence and Security: 6th International Conference, ICAIS 2020, Hohhot, China, July 17–20, 2020, Proceedings, Part III by Xingming Sun, Jinwei Wang, Elisa Bertino ISBN 9789811581007, 9789811581014, 9811581002, 9811581010 free all chapters
53 pages
Final PPT of Rtu
No ratings yet
Final PPT of Rtu
68 pages
1565415688398X682uRNPY1Gsjv1G PDF
No ratings yet
1565415688398X682uRNPY1Gsjv1G PDF
8 pages
Excel - Subtotals
100% (2)
Excel - Subtotals
2 pages
Kabbalah for Beginners
No ratings yet
Kabbalah for Beginners
7 pages