0% found this document useful (0 votes)
34 views

Disc

Uploaded by

swathi sp
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Disc

Uploaded by

swathi sp
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 299

8/9/2024

SSWT ZG526
Distributed Computing
BITS Pilani
Pilani | Dubai | Goa | Hyderabad

BITS Pilani
Pilani | Dubai | Goa | Hyderabad

LECTURE 1

1
8/9/2024

Text and References

T1 Ajay D. Kshemkalyani, and Mukesh Singhal “Distributed Computing: Principles,


Algorithms, and Systems”, Cambridge University Press, 2008 (Reprint 2013).

R1 Kai Hwang, Geoffrey C. Fox, and Jack J. Dongarra, “Distributed and Cloud
Computing: From Parallel processing to the Internet of Things”, Morgan
Kaufmann, 2012 Elsevier Inc.

R2 John F. Buford, Heather Yu, and Eng K. Lua, “P2P Networking and Applications”,
Morgan Kaufmann, 2009 Elsevier Inc.

SSWT ZG526 - Distributed Computing 3 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

index
• DISTRIBUTED COMPUTING
• Contact Session – 1

• 1Introduction - introduction to Distributed computing in terms of various hardware and software models T1 (Chap.1)

• 1.2 Multiprocessing and Multi computing System, Distributed System Design Issues 1.3 Distributed Communication
Model (RPC) 1.4 Review of different communication models Review of Design issues and Challenges for building
distributed systems
• Contact Session – 2
• M2: Logical Clocks & Vector Clocks

• 2.1 Distributed Computational Model and Logical Clocks. T1 (Chap.3)


• 2.2 Lamport Logical Clocks 2.3 Vector Clocks 2 Review of logical clocks. Review Lamport logical and vector clocks
examples

4 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

2
8/9/2024

index

• Contact Session – 3

• M3: Global state and snapshot recording algorithms

• 3.1 Global States, Principles to use to record the global states T1 (Chap.4 & 5)

• 3 Review of recording global state

• Contact Session – 4
• M3: Global state and snapshot recording algorithms

• 3.2 Chandy Lamport global state recording Algorithm for FIFO channels and Lai yang Algorithm for non-FIFO
channels

• 4 Review of algorithms Chandy Lamport global state recording Algorithm and Lai yang Algorithm for FIFO and non-
FIFO channels

5 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

index

• Contact Session – 5

• M4: Message ordering and Termination Pre-CH/CS

• 4.1 Casual Ordering of messages; Birman Schipher Stephenson (BSS) Algorithm with Example T1 (Chap.6)

• 4.2 Schipher Eggli Sandoz (SES) Protocol for casual ordering with example

• 5 Review of Birman Schipher Stephenson (BSS) Algorithm Schipher Eggli Sandoz (SES) Algorithm with examples
detection

• Contact Session – 6

• M5: Distributed Mutual Exclusion

• 5.1 Distributed Mutual Exclusion; Centralized Algorithm T1 (Chap.9, 10)


• 5.2 Lamport DME Algorithm with Examples
• 5.3 Ricart Agrawala DME Algorithm with Example
• 6 Review of DME algorithms like, Lamport. Ricart Agrawala algorithms with example.
6 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

3
8/9/2024

index

• Contact Session – 7

• M5: Distributed Mutual Exclusion

• 5.4 Maekawa’s DME Algorithm with Example T1 (Chap.9, 10)

• 5.5 Token Based DME, Broadcast Based Algorithm; Suzuki Kasami Algorithm

• 5.6 Raymond’s Tree Based Algorithm


• 7 Review of DME algorithms like Maekawa’s,Broadcast Based, Suzuki Kasam and Raymond Tree based algorithms
with example .

• Contact Session – 8

• Review 8 Review of previous Modules M1 to M5 for Mid Term test preperation

7 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

index

• Contact Session – 9

• M6: Deadlock Detection


• 6.1 Deadlocks in distribution system
• 6.2 Chandy Misra Haas(CMH) Algorithm for AND Model (Edge Chasing)

• 6.3 Chandy Misra Haas(CMH) Algorithm for OR Model (Diffusion Computation)


• 9 Review of Chandy Misra Haas algorithms for deadlock.

• Contact Session – 10
• M7: Consensus and Agreement Algorithm

• 7.1 Agreement Algorithm T1 (Chap.14)

• 7.2 Oral Message Algorithm

• 7.3 Applications of Byzantine Algorithm


• 10 Review of Agreement and OM algorithms 8 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

4
8/9/2024

index

• Contact Session – 11

• M8: Peer to Peer Computing and Overlay graphs

• 8.1 Introduction, P2P Architecture T1 (Chap.18)

• 8.2 Design of Unstructured peer to peer networks

• 11 Review of deign of structured P2P network.


• Contact Session – 12

• M8: Peer to Peer Computing and Overlay graphs


• 8.3 Design of structured peer too peer networks T1 (Chap.18)
• 8.4 Security Solutions for threats in P2P networks 12 Review of deign of unstructured P2P network

9 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

index

• Contact Session – 13

• M9: Cluster computing, Grid Computing


• 9.1 Cluster computing Introduction R2 (Chap.2, 7)

• 9.2 Design Components of cluster computers

• 13 Review of Cluster computing.

• Contact Session – 14

• M9: Cluster computing, Grid Computing


• 9.3 Grid Computing Introduction R2 (Chap.2, 7)

• 14 Review of grid computing.

10 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

10

5
8/9/2024

index

• Contact Session – 15

• M10: Internet of Things

• 9.4 IoT ; IoT Architecture

• 15 Review of IoT Architecture and Technologies

• Contact Session – 16
• Review CS 16 Review of previous Modules M6 to M10 for Comprehensive exam preparation

11 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

11

Massively Multiplayer Online Games

• Player's login from different parts of the world into a single virtual world
• Different physical times
– Logical clocks (S2)
• Consistent view of the virtual world
– Global state (S3 & S4)
• Player communication
– Causal ordering in group communication (S5)
• Send broadcasts on the network, placement of game servers
– Distributed graph algos (S6)

SSWT ZG526 - Distributed Computing 12 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

12

6
8/9/2024

Massively Multiplayer Online Games

• Completion of a stage or set of steps


– Termination detection (S5)
• Fight for resources
– Distributed Mutual Exclusion and Deadlocks (S7 & S9)
• A team of players making a decision
– Consensus protocols (S10)
• Game servers' architecture
– Client/Server, Peer-to-Peer, Clusters, Grid (S11 - S14)
• Large scale networks - millions of users, devices
– IoT (S15)

SSWT ZG526 - Distributed Computing 13 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

13

Module Details

SSWT ZG526 - Distributed Computing 14 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

14

7
8/9/2024

Module Details

M1 - Introduction to Distributed Computing

• Introduction to Distributed computing


• Motivation, Multiprocessor Vs Multicomputer Systems
• Distributed Communication Model; RPC
• Design issues and challenge

References : T1 (Chap.1)

SSWT ZG526 - Distributed Computing 15 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

15

Topics for today

• Introduction to Distributed Computing


• Motivation
• Classification and introduction to Parallel Systems
• Distributed Communication Models
• Design issues and challenges
• Summary

SSWT ZG526 - Distributed Computing 16 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

16

8
8/9/2024

DISTRIBUTED COMPUTING

• Distributed computing is a field of computer science that studies


distributed systems.
• A distributed system is a software system in which components located on
networked computers communicate and coordinate their actions by passing
messages.
• A distributed system is a collection of independent entities that cooperate to solve a
problem that cannot be individually solved.

SSWT ZG526 - Distributed Computing 17 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

17

DISTRIBUTED COMPUTING

• Multiple independent machines with no physical


common clock
• No shared memory but somehow a global state
needs to get created when required
• May be physically distant machines and needs
good network infrastructure and protocols /
control algorithms to communicate
• Machines communicate via messages and
somehow the system figures out the message
order
• The machines doing pieces of the work can jointly
figure out when a task terminates
SSWT ZG526 - Distributed Computing 18 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

18

9
8/9/2024

DISTRIBUTED SYSTEM CHARACTERISTICS

A distributed system can be characterized as a collection of mostly autonomous


processors communicating over a communication network and having the following
features:

– No common physical clock


– No shared memory
– Geographical separation
– Autonomy and heterogeneity
(Different processors with different speed and OS)

SSWT ZG526 - Distributed Computing 19 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

19

RELATION TO COMPUTER SYSTEM COMPONENTS

• The distributed software is also termed as middleware.


• A distributed execution is the execution of processes across the distributed system
to collaboratively achieve a common goal.
• An execution is also sometimes termed a computation or a run.
• The machines can access and use shared resources without conflicts, e.g. writing a
file on a network storage
• If the program is stuck with deadlock, it can be detected and resolved.
• Independent machines can have joint decisions made via agreement protocol - e.g.
majority voting by just passing messages
• All this logic can then be built into the distributed system using various architectural
patterns - client/server, peer to peer etc.

SSWT ZG526 - Distributed Computing 20 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

20

10
8/9/2024

RELATION TO COMPUTER SYSTEM COMPONENTS

A Middleware Service for Distributed Applications

SSWT ZG526 - Distributed Computing 21 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

21

RELATION TO COMPUTER SYSTEM COMPONENTS

Relation between software components on a machine

SSWT ZG526 - Distributed Computing 22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

22

11
8/9/2024

RELATION TO COMPUTER SYSTEM COMPONENTS

Complex application directly on network stack has Easier programming where complexity is handled in
to solve distributed computing problems distributed middleware / os services

Networking OS Distributed OS

SSWT ZG526 - Distributed Computing 23 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

23

Examples of Distributed Systems

• Cluster computing systems


• Banking applications
• Internet caching systems, e.g. Akamai
• Distributed Databases
• Peer-to-peer systems for content sharing
• Media streaming systems
• Real-time process control, e.g. aircraft control systems
• IoT or sensor networks

SSWT ZG526 - Distributed Computing 24 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

24

12
8/9/2024

Topics for today

• Introduction to Distributed Computing


• Motivation
• Classification and introduction to Parallel Systems
• Distributed Communication Models
• Design issues and challenges
• Summary

SSWT ZG526 - Distributed Computing 25 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

25

MOTIVATION

The motivation for using a distributed system is some or all of the following
requirements:

1. Inherently distributed computations


• In many applications such as money transfer in banking, or reaching consensus
among parties that are geographically distant, the computation is inherently
distributed.

SSWT ZG526 - Distributed Computing 26 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

26

13
8/9/2024

MOTIVATION

2. Resource sharing
• Resources such as peripherals, complete data sets in databases, special libraries, as
well as data (variable/files) cannot be fully replicated at all the sites because it is
often neither practical nor cost-effective.
• Further, they cannot be placed at a single site because access to that site might
prove to be a bottleneck. Therefore, such resources are typically distributed across
the system.
• For example, distributed databases such as DB2 partition the data sets across
several servers, in addition to replicating them at a few sites for rapid access as well
as reliability.

SSWT ZG526 - Distributed Computing 27 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

27

MOTIVATION

3. Access to geographically remote data and resources


• In many scenarios, the data cannot be replicated at every site participating in the
distributed execution because it may be too large or too sensitive to be replicated.
• For example, payroll data within a multinational corporation is both too large and
too sensitive to be replicated at every branch office/site. It is therefore stored at a
central server which can be queried by branch offices.
• Similarly, special resources such as supercomputers exist only in certain locations.
• Advances in the design of resource-constrained mobile devices as well as in the
wireless technology with which these devices communicate have given further
impetus to the importance of distributed protocols and middleware.

SSWT ZG526 - Distributed Computing 28 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

28

14
8/9/2024

MOTIVATION

4. Enhanced reliability
• A distributed system has the inherent potential to provide increased reliability
because of the possibility of replicating resources and executions, as well as the
reality that geographically distributed resources are not likely to crash or
malfunction at the same time under normal circumstances.

Reliability entails several aspects:


1. Availability, i.e., the resource should be accessible at all times
2. Integrity, i.e., the value/state of the resource should be correct
3. Fault-tolerance, i.e., the ability to recover from system failures

SSWT ZG526 - Distributed Computing 29 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

29

MOTIVATION

5. Increased performance/cost ratio


• By resource sharing and accessing geographically remote data and resources, the
performance/cost ratio is increased. Although higher throughput has not necessarily
been the main objective behind using a distributed system, nevertheless, any task
can be partitioned across the various computers in the distributed system.
• Such a configuration provides a better performance/cost ratio than using special
parallel machines. This is particularly true of the NOW configuration.

SSWT ZG526 - Distributed Computing 30 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

30

15
8/9/2024

MOTIVATION

6. Scalability
• As the processors are usually connected by a wide-area network, adding more
processors does not pose a direct bottleneck for the communication network.

7. Modularity and incremental expandability


• Heterogeneous processors may be easily added into the system without affecting
the performance, as long as those processors are running the same middleware
algorithms. Similarly, existing processors may be easily replaced by other
processors.

SSWT ZG526 - Distributed Computing 31 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

31

Example Distributed System : Netflix

SSWT ZG526 - Distributed Computing 32 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

32

16
8/9/2024

Example Distributed System : Netflix

Widely distributed network of content caching servers

SSWT ZG526 - Distributed Computing 33 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

33

Topics for today

• Introduction to Distributed Computing


• Motivation
• Classification and introduction to Parallel Systems
• Distributed Communication Models
• Design issues and challenges
• Summary

SSWT ZG526 - Distributed Computing 34 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

34

17
8/9/2024

PARALLEL SYSTEMS

A parallel system may be broadly classified as belonging to one of three types:

1. A multiprocessor system
2. A multicomputer parallel system
3. Array processors

SSWT ZG526 - Distributed Computing 35 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

35

MULTIPROCESSOR SYSTEMS

• A multiprocessor system is a parallel system in which the multiple processors have


direct access to shared memory which forms a common address space.
• Such processors usually do not have a common clock.
• A multiprocessor system usually corresponds to a uniform memory access (UMA)
architecture in which the access latency, i.e., waiting time, to complete an access to any
memory location from any processor is the same.
• The processors are in very close physical proximity and are connected by an
interconnection network.
• All the processors usually run the same operating system, and both the hardware and
software are very tightly coupled.

SSWT ZG526 - Distributed Computing 36 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

36

18
8/9/2024

MULTIPROCESSOR SYSTEMS

SSWT ZG526 - Distributed Computing 37 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

37

MULTIPROCESSOR SYSTEMS

• The processors are usually of the same type, and are housed within the same
box/container with a shared memory.
• The interconnection network to access the memory may be a bus, although for
greater efficiency, it is usually a multistage switch with a symmetric and regular
design.
• There are two popular interconnection networks
1. The Omega network and
2. The Butterfly network

SSWT ZG526 - Distributed Computing 38 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

38

19
8/9/2024

MULTIPROCESSOR SYSTEMS

• Two popular interconnection networks – the Omega network and the Butterfly
network , each of which is a multi-stage network formed of 2×2 switching elements.
• Each 2×2 switch allows data on either of the two input wires to be switched to the
upper or the lower output wire.
• In a single step, however, only one data unit can be sent on an output wire. So if the
data from both the input wires is to be routed to the same output wire in a single
step, there is a collision.
• Various techniques such as buffering or more elaborate interconnection designs can
address collisions.

SSWT ZG526 - Distributed Computing 39 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

39

MULTIPROCESSOR SYSTEMS

INTERCONNECTION NETWORKS FOR SHARED MEMORY MULTIPROCESSOR SYSTEMS

Omega network and Butterfly network


for n = 8 processors P0–P7 and memory banks M0–M7

SSWT ZG526 - Distributed Computing 40 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

40

20
8/9/2024

MULTIPROCESSOR SYSTEMS

Interconnection Networks

a) A crossbar switch - faster


b) An omega switching network - cheaper

SSWT ZG526 - Distributed Computing 41 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

41

MULTICOMPUTER PARALLEL SYSTEM

• A multicomputer parallel system is a parallel system in which the multiple


processors do not have direct access to shared memory.
• The memory of the multiple processors may or may not form a common address
space. Such computers usually do not have a common clock.
• The processors are in close physical proximity and are usually very tightly coupled
(homogenous hardware and software), and connected by an interconnection
network.
• The processors communicate either via a common address space or via message-
passing usually corresponds to a non-uniform memory access (NUMA) architecture
in which the latency to access various shared memory locations from the different
processors varies.

SSWT ZG526 - Distributed Computing 42 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

42

21
8/9/2024

MULTICOMPUTER PARALLEL SYSTEM

Non-uniform memory access (NUMA) multiprocessor

SSWT ZG526 - Distributed Computing 43 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

43

MULTICOMPUTER PARALLEL SYSTEM

A wrap-around 4×4 mesh.


For a k×k mesh which will contain k2
processors, the maximum path length
between any two processors is
2(k/2−1).

SSWT ZG526 - Distributed Computing 44 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

44

22
8/9/2024

MULTICOMPUTER PARALLEL SYSTEM

• A k-dimensional hypercube has 2k processor-and-memory units. Each such unit is a


node in the hypercube, and has a unique k-bit label. Each of the k dimensions is
associated with a bit position in the label

Four-dimensional hypercube.

SSWT ZG526 - Distributed Computing 45 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

45

ARRAY PROCESSORS

• Array processors belong to a class of parallel computers that are physically co-
located, are very tightly coupled, and have a common system clock (but may not
share memory and communicate by passing data using messages).
• Array processors and systolic arrays that perform tightly synchronized processing
and data exchange in lock-step for applications such as DSP and image processing
belong to this category.
• These applications usually involve a large number of iterations on the data.

SSWT ZG526 - Distributed Computing 46 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

46

23
8/9/2024

ARRAY PROCESSORS

SSWT ZG526 - Distributed Computing 47 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

47

Flynn’s Taxonomy

SSWT ZG526 - Distributed Computing 48 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

48

24
8/9/2024

Some basic concepts

• Coupling
– Tight - SIMD, MISD shared memory systems
– Loose - NOW, distributed systems, no shared memory
• Speedup
– how much faster can a program run when given N processors as opposed to 1 processor — T(N) / T(1)
– Optional reading: Amdahl’s Law, Gustafson’s Law
• Parallelism / Concurrency of program
– Compare time spent in computations to time spent for communication via shared memory or message
passing
• Granularity
– Average number of compute instructions before communication is needed across processors
• Note:
– Coarse granularity —> Distributed systems else use tightly coupled multi-processors/computers
– High concurrency doesn’t lead to high speedup if granularity is too small leading to high overheads

SSWT ZG526 - Distributed Computing 49 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

49

Topics for today

• Introduction to Distributed Computing


• Motivation
• Classification and introduction to Parallel Systems
• Distributed Communication Models
• Design issues and challenges
• Summary

SSWT ZG526 - Distributed Computing 50 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

50

25
8/9/2024

REMOTE PROCEDURE CALL (RPC)

• Remote Procedure Call (RPC) is a powerful technique for constructing distributed,


client-server based applications.
• It is based on extending the conventional local procedure calling, so that the called
procedure need not exist in the same address space as the calling procedure.
• The two processes may be on the same system, or they may be on different systems
with a network connecting them.

SSWT ZG526 - Distributed Computing 51 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

51

REMOTE PROCEDURE CALL (RPC)

SSWT ZG526 - Distributed Computing 52 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

52

26
8/9/2024

REMOTE PROCEDURE CALL (RPC)

SSWT ZG526 - Distributed Computing 53 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

53

MESSAGE QUEUEING

• Message queueing is a method by which process (or program instances) can


exchange or pass data using an interface to a system-managed queue of messages.
• Messages can vary in length and be assigned different types or usages.
• A message queue can be created by one process and used by multiple processes
that read and/or write messages to the queue.
• For example, a server process can read and write messages from and to a message
queue created for client processes.
• The message type can be used to associate a message with a particular client
process even though all messages are on the same queue.

SSWT ZG526 - Distributed Computing 54 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

54

27
8/9/2024

PUBLISH–SUBSCRIBE PATTERN

• Publish–subscribe is a messaging pattern where senders of messages, called


publishers, do not program the messages to be sent directly to specific receivers,
called subscribers, but instead categorize published messages into classes without
knowledge of which subscribers, if any, there may be.
• Similarly, subscribers express interest in one or more classes and only receive
messages that are of interest, without knowledge of which publishers, if any, there
are.

SSWT ZG526 - Distributed Computing 55 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

55

PUBLISH–SUBSCRIBE PATTERN

• Publish–subscribe is a sibling of the message queue paradigm, and is typically one


part of a larger message-oriented middleware system.
• Most messaging systems support both the pub/sub and message queue models in
their API, e.g. Java Message Service (JMS).
• This pattern provides greater network scalability and a more dynamic network
topology, with a resulting decreased flexibility to modify the publisher and the
structure of the published data.

SSWT ZG526 - Distributed Computing 56 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

56

28
8/9/2024

PUBLISH–SUBSCRIBE PATTERN

• The sender (also called a publisher) uses a topic-based approach to publish


messages to topic A and to topic B.
• Three receivers (also called subscribers) subscribe to these topics;
– one receiver subscribes to topic A,
– one receiver subscribes to topic B, and
– one receiver subscribes to both topic A and to topic B.

SSWT ZG526 - Distributed Computing 57 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

57

PUBLISH–SUBSCRIBE PATTERN

SSWT ZG526 - Distributed Computing 58 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

58

29
8/9/2024

Topics for today

• Introduction to Distributed Computing


• Motivation
• Classification and introduction to Parallel Systems
• Distributed Communication Models
• Design issues and challenges
• Summary

SSWT ZG526 - Distributed Computing 59 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

59

DESIGN ISSUES AND CHALLENGES

The following functions must be addressed when designing and building a distributed system:

Communication
• This task involves designing appropriate mechanisms for communication among the processes in the
network. Some example mechanisms are: remote procedure call (RPC), remote object invocation
(ROI), message-oriented communication versus stream-oriented communication.

Processes
• Some of the issues involved are: management of processes and threads at clients/servers; code
migration; and the design of software and mobile agents.

SSWT ZG526 - Distributed Computing 60 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

60

30
8/9/2024

DESIGN ISSUES AND CHALLENGES

Naming
• Naming Devising easy to use and robust schemes for names, identifiers, and addresses is essential
for locating resources and processes in a transparent and scalable manner.

Synchronization
• Synchronization Mechanisms for synchronization or coordination among the processes are essential.

Data storage and access


• Data storage and access Schemes for data storage, and implicitly for accessing the data in a fast and
scalable manner across the network are important for efficiency.

SSWT ZG526 - Distributed Computing 61 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

61

DESIGN ISSUES AND CHALLENGES

Consistency and replication


• To avoid bottlenecks, to provide fast access to data, and to provide scalability, replication of data
objects is highly desirable

Fault tolerance
• Fault tolerance requires maintaining correct and efficient operation in spite of any failures of links,
nodes, and processes

Security
• Distributed systems security involves various aspects of cryptography, secure channels, access
control, key management – generation and distribution, authorization, and secure group
management.

SSWT ZG526 - Distributed Computing 62 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

62

31
8/9/2024

DESIGN ISSUES AND CHALLENGES

Applications Programming Interface (API) and transparency


• The API for communication and other specialized services is important for the ease of use and wider
adoption of the distributed systems services by non-technical users.

Scalability and modularity


• The algorithms, data (objects), and services must be as distributed as possible. Various techniques
such as replication, caching and cache management, and asynchronous processing help to achieve
scalability.

SSWT ZG526 - Distributed Computing 63 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

63

NEW APPLICATIONS POSE NEW CHALLENGES

• Mobile networks
– distributed graph algorithms
• Sensor networks collecting and transmitting physical parameters
– large volume streaming, algorithms for location estimation
• Ubiquitous or Pervasive computing where processors are embedded everywhere to perform
application functions e.g. smart homes
– groups of wireless sensors/actuators connected to Cloud backend
• Peer-to-peer e.g. gaming, content distribution
– object storage/lookup/retrieval/replication, self-organising networks, privacy

SSWT ZG526 - Distributed Computing 64 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

64

32
8/9/2024

NEW APPLICATIONS POSE NEW CHALLENGES

• Publish-subscribe e.g. movie streaming, online trading markets - receive only data of interest
– data streaming systems and filtering / matching algorithms
• Distributed agents or robots e.g. in mobile computing
– swarm algorithms, coordination among agents (like an ant colony)
• Distributed data mining
– e.g. user profiling - Data is spread in many repositories

SSWT ZG526 - Distributed Computing 65 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

65

33
8/9/2024

BITS Pilani
Pilani | Dubai | Goa | Hyderabad

Contact Session – 2
M2 -Logical clocks

References : T1 (Chap.3)

Topics for today


» A model of distributed computing
» Lamport’s logical clock
» Vector clock
» Efficient implementation of vector clock
» Other techniques, e.g. Matrix and Fowler-Zwaenepoel’s
» Physical clock sync using Network Time Protocol
» Physical or logical clocks ?
» Summary

SSWT ZG526 - Distributed Computing 2 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

1
8/9/2024

A distributed program

• A distributed program is composed of a set of n asynchronous processes, p1,


p2, ..., pi , ..., pn.
• The processes do not share a global memory and communicate solely
by message passing.
• The processes do not share a global clock that is instantaneously
accessible to these processes.
• Process executions and message transfers are asynchronous.

SSWT ZG526 - Distributed Computing 3 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Causality and clocks

• The concept of causality between events is fundamental to the design and analysis of parallel and distributed computing and operating systems.

• Usually causality is tracked using physical time.

• In distributed systems, it is not possible to have global physical time; it is possible to realize only an approximation of it.

• Causality among events in a distributed system is a powerful concept in reasoning, analyzing, and drawing inferences about a computation.

SSWT ZG526 - Distributed Computing 4 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

2
8/9/2024

Why is logical clock essential ?


• Physical time synchronisation across many nodes and high message
density with high accuracy is nearly impossible
z
• We will study a measurement based technique in NTP
• Need a way to order events across processes to understand how
computation across processes is progressing
• Need to ensure that
x y
• incoming messages are delivered before they are expected by a
process
• no message meant to arrive at a process after an event is delivered
y = f(x)
before the event
• unrelated events can proceed concurrently
time

SSWT ZG526 - Distributed Computing 5 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Causality

• Causality is influence by which one event, process or state contributes to the

production of another event, process or state where the cause is partly responsible

for the effect, and the effect is partly dependent on the cause.

SSWT ZG526 - Distributed Computing 6 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

3
8/9/2024

Uses of causality and logical clocks (1)


The knowledge of the causal precedence relation among the events of processes helps solve a variety of problems in distributed systems some of these problems is as follows.

Distributed algorithms design

The knowledge of the causal precedence relation among events helps ensure liveness and fairness in mutual exclusion algorithms, helps maintain consistency in replicated

databases, and helps design correct deadlock detection algorithms to avoid phantom and undetected deadlocks.

Concurrency measure

The knowledge of how many events are causally dependent is useful in measuring the amount of concurrency in a computation. All events that are not causally related can be

executed concurrently. Thus, an analysis of the causality in a computation gives an idea of the concurrency in the program.

SSWT ZG526 - Distributed Computing 7 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Uses of causality and logical clocks (2)

Tracking of dependent events

In distributed debugging, the knowledge of the causal dependency among events helps construct a consistent state for resuming

reexecution; in failure recovery, it helps build a checkpoint; in replicated databases, it aids in the detection of file inconsistencies in

case of a network partitioning.

Knowledge about the progress

The knowledge of the causal dependency among events helps measure the progress of processes in the distributed computation.

This is useful in discarding obsolete information, garbage collection, and termination detection, e.g. how much of a distributed file

write has finished.

SSWT ZG526 - Distributed Computing 8 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

4
8/9/2024

A model of distributed executions


“happens before” “happens before” A space-time diagram

space

“happens before”

» Result of logically ordered computation is same and independent of absolute time

SSWT ZG526 - Distributed Computing 9 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

A model of distributed executions (2)


“happens before” “happens before” logically concurrent events

space

“happens before”

» Logically concurrent events may happen in different physical times - depends on message delays
» Assume they happened at the same time … computation result will not change

SSWT ZG526 - Distributed Computing 10 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

10

5
8/9/2024

Models of Communication Networks


• There are several models of services provided by communication networks,
namely, FIFO, Non-FIFO
• Distributed middleware can provide additional semantics - e.g. causal order

C1
P1 P2
m4 | m3 | m1 | m2 |

C3
C2 FIFO channel e.g. TCP
non-FIFO channel C4
e.g. UDP
m3 | m4 | P3
m1 | m2 |

SSWT ZG526 - Distributed Computing 11 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

11

Models of Communication Networks


• non-FIFO: No restrictions in message order on link
• FIFO: Messages are delivered in same sequence as sent on a link
• Causal ordering (CO) aka ‘happened before’ relationship among messages across links
• If send(mij) —> send(mkj) then receive(mij) —> receive(mkj)
• Ensures that causally related messages for same destination are delivered preserving the same order.
• CO is most useful in building distributed algorithms, e.g. DB replica updates (case study later)

Send 1 Send 2
P1
CO FIFO

P2
Receive 1 Receive 2 non-FIFO

SSWT ZG526 - Distributed Computing 12 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

12

6
8/9/2024

Logical Clocks: Scalar time and Vector time

• What is a logical clock


• In the absence of perfect clock synchronisation, assign logical numbers
to events in each processor so that across processor events get ordered.
• 2 main techniques - scalar and vector
• The goal is to assign a causal order or “happened before” when assigning
numbers instead of actually finding physical time
• Causality among events in a distributed system is a powerful concept in
reasoning, analysing, and drawing inferences about a computation.

SSWT ZG526 - Distributed Computing 13 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

13

Implementing logical clocks


Requires addressing two issues

• Data structures local to every process to represent logical time and

• A protocol (set of rules) to update the data structures to ensure the consistency condition.

Each process pi maintains data structures that allow it the following two capabilities:

• A local logical clock , denoted by lc , that helps process p measure its own progress.
i i

• A logical global clock , denoted by gci , that is a representation of process pi’s local view of the logical global time. It allows this process to assign consistent timestamps to its local events.

Typically, lci is a part of gci .

SSWT ZG526 - Distributed Computing 14 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

14

7
8/9/2024

Lamport clock with scalar time

• Proposed by Lamport as an attempt to totally order events in a distributed


system.
• It however only creates a partial order as shown later
• Time domain is the set of non-negative integers (scalar).
• The logical local clock of a process pi and its local view of the global time are
squashed into one integer variable Ci.
• 2 rules R1 and R2 to update the clocks

SSWT ZG526 - Distributed Computing 15 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

15

Rules for Lamport clock (1)

• R1: Before executing an event (send, receive, or internal),


process pi executes the following:
Ci := Ci + d (d > 0)
In general, every time R1 is executed, d can have a different value; however,
typically d is kept at 1 (e.g. for event counting - more later).

R1: 1 R1: 2 <rule applied>: <clock value>


e1 e2
P1
0

SSWT ZG526 - Distributed Computing 16 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

16

8
8/9/2024

Rules for Lamport clock (2)

• R2: Each message piggybacks the clock value of its sender at


sending time. When a process pi receives a message with
timestamp Cmsg , it executes the following actions:
➢ Ci := max(Ci , Cmsg )
➢ Execute R1.
➢ Deliver the message. 2

e2
P1

R2: 2

P2
e3
0
R2: 3

SSWT ZG526 - Distributed Computing 17 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

17

Lamport clock example


Assume d=1
R1: 1 R1: 2 <rule applied>: <clock value>
e1 e2
P1
0 R2: 2
R1: 1
e3
P2
0
R2: 1

P3
0 e4 e5 e6

R1: 1 R2: 2 R2: 3

SSWT ZG526 - Distributed Computing 18 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

18

9
8/9/2024

Another example
• Distributed algorithms such as resource synchronization often depend on some method of ordering events to function.

• For example, consider a system with two processes and a disk.

• The processes send messages to each other, and also send messages to the disk requesting access.

• The disk grants access in the order the messages were sent.

• Now, imagine process 1 sends a message to the disk asking for access to write, and then sends a message to process 2 asking it to read.

• Process 2 receives the message, and as a result sends its own message to the disk.

• Now, due to some timing delay, the disk receives both messages at the same time: how does it determine which message happened-before the other?

• A logical clock algorithm provides a mechanism to determine facts about the order of such events.

SSWT ZG526 - Distributed Computing 19 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

19

Basic Properties

Consistency Property
• Scalar clocks satisfy the monotonicity and hence the consistency property:
for two events ei and ej,

* However, converse is not true (more about this later)


Total Ordering
• A partial order is created because events may have same time stamp
• Can be used to also totally order events in a distributed system.
• But, the main problem in totally ordering events is that two or more events at different processes
may have identical timestamp.

SSWT ZG526 - Distributed Computing 20 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

20

10
8/9/2024

Consistency
Assume d=1
1 2
e1 e2
P1
1
e3
P2

P3
e4 e5 e6
1 2 3

SSWT ZG526 - Distributed Computing 21 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

21

Total Ordering

• A tie-breaking mechanism is needed to order such events. A tie is broken as follows:


➢ Process identifiers are linearly ordered and tie among events with identical scalar timestamp is
broken on the basis of their process identifiers.
➢ The lower the process identifier in the ranking, the higher the priority.
➢ The timestamp of an event is denoted by a tuple (t, i) where t is its time of occurrence and i is the
identity of the process where it occurred.
➢ The total order relation ≺ on two events x and y with timestamps (h,i) and (k,j), respectively, is
defined as follows:

total order relation


SSWT ZG526 - Distributed Computing 22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

22

11
8/9/2024

Total ordering (example)


Assume d=1
1 2
e1 e2
P1
1
e3
P2

P3
e4 e5 e6
1 2 3

Use process index (P1<P2) to make e1 precede e3 and force a total order
SSWT ZG526 - Distributed Computing 23 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

23

Properties (1)

No Strong Consistency
1 2 3

• The system of scalar clocks is not strongly consistent; that e1 e2 e3


P1
is, for two events ei and ej,

P2
• The reason that scalar clocks are not strongly consistent e4

is that the logical local clock and logical global clock of a 3

process are squashed into one, resulting in the loss of P3


e5 e6 e7
causal dependency information among events at
different processes. 1 2 3

SSWT ZG526 - Distributed Computing 24 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

24

12
8/9/2024

Properties (2)

Event counting
1 2 3
• If the increment value d is always 1, the scalar time
e1 e2 e3
has the following interesting property: if event e has a
P1
timestamp h, then h-1 represents the minimum
logical duration, counted in units of events, required height(e4) = 2
before producing the event e; P2
e4

• We call it the height of the event e. 3

• In other words, h-1 events have been produced P3


e5 e6 e7
sequentially before the event e regardless of the
processes that produced these events. 1 2 3

SSWT ZG526 - Distributed Computing 25 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

25

Vector time
P2
• The system of vector clocks was developed
independently by Fidge, Mattern and Schmuck.
P1 P3
• In the system of vector clocks, the time domain is
represented by a set of n-dimensional non-negative
integer vectors.

• So views of local clock and non-local clocks from


clock(P3, t) = [2, 4, 6]
perspective of a process are represented in a
vector. view of P3’s local clock
view of P1, P2’s clocks

SSWT ZG526 - Distributed Computing 26 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

26

13
8/9/2024

Vector clock rules (1)

Process pi uses the following two rules R1 and R2 to update its clock:

• R1: Before executing an event, process pi updates its local logical time as
follows:
vti[i] := vti[i] + d (d > 0)

(1,0,0) R1: (2,0,0)


e1 e2
P1

SSWT ZG526 - Distributed Computing 27 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

27

Vector clock rules (2)

•R2: Each message m is piggybacked with the vector clock vt of


the sender process at sending time. On the receipt of such a
message (m,vt), process pi executes the following sequence of
actions:

➢ Update its global logical time as follows:

1≤ k≤ n : vti[k] := max(vti [k], vt[k]) R2: (2,0,0)


➢ Execute R1
e3
P2
➢ Deliver the message m. (0,0,0) R2 (a): (2,0,0)
R2 (b): (2,1,0)
Deliver
SSWT ZG526 - Distributed Computing 28 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

28

14
8/9/2024

Vector time example


Assume d=1

(1,0,0) (2,0,0)
e1 e2
P1
(0,0,0)
(2,0,0)
e3 e5
P2
(0,0,0) (2,1,0) (2,2,1)

P3
(0,0,0) e4
(0,0,1)

SSWT ZG526 - Distributed Computing 29 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

29

Vector time example


Assume d=1 R1: update local clock entry

(1,0,0) (2,0,0)
e1 e2
P1

(2,0,0)
e3 e5
P2
(0,0,0) (2,1,0) (2,2,1)

P3
e4
(0,0,1)

SSWT ZG526 - Distributed Computing 30 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

30

15
8/9/2024

Vector time example


Assume d=1

(1,0,0) (2,0,0)
e1 e2 R2: Max of remote clocks from local
P1 knowledge and message content
(2,0,0) Update local clock entry as per R1

e3 e5
P2
(0,0,0) (2,1,0) (2,2,1)

P3
e4
(0,0,1)

SSWT ZG526 - Distributed Computing 31 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

31

Comparing vector timestamps (1)

• The following relations are defined to compare two (1,0,0) (2,0,0) (3,0,0)
vector timestamps, vh and vk: e1 e2 e3
P1
(2,0,0)

e4 e5
P2
(0,1,0) (2,2,0)

P3
Examples : e6 e7 e8
(1,0,1) = (1,0,1) (0,0,1) (0,0,2) (0,0,3)
(0,1,1) < (1,2,1)
(1,0,0) || (0,0,2) Now we can confidently say with full
knowledge : e1 || e7

SSWT ZG526 - Distributed Computing 32 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

32

16
8/9/2024

Comparing vector timestamps (2)

• If the process at which an event occurred is known, the (1,0,0) (2,0,0) (3,0,0)
test to compare two timestamps can be simplified as e1 e2 e3
follows: If eventsx and y respectively occurred at P1

processes pi and pj and are assigned timestamps vh


(2,0,0)

and vk, respectively, then


e4 e5
P2
(0,1,0) (2,2,0)

P3
e6 e7 e8
(0,0,1) (0,0,2) (0,0,3)
e2 —> e5 because V_P1[P1] <= V_P2[P1]
e1 || e7 because V_P1[P1] > V_P3[P1] and V_P1[P3] < V_P3[P3]
P1’s view P3’s view

SSWT ZG526 - Distributed Computing 33 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

33

Properties (1)
Strong Consistency
• The system of vector clocks is strongly consistent; thus, by examining the vector
timestamp of two events, we can determine if the events are causally related.
• However, Charron-Bost showed that the dimension of vector clocks cannot be less than
n, the total number of processes in the distributed computation, for this property to
hold.

SSWT ZG526 - Distributed Computing 34 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

34

17
8/9/2024

Properties (2)
Event Counting

(1,0,0) (2,0,0) (3,0,0)


• If d=1 (in rule R1), then the ith component of vector clock
e1 e2 e3
at process pi, vti[i], denotes the number of events that P1
have occurred at pi until that instant.
(2,0,0)
• So, if an event e has timestamp vh, vh[j] denotes the 2+2-1=3
number of events executed by process pj that causally P2
e4 e5
precede e. Clearly, Σvh[j] − 1 represents the total (0,1,0) (2,2,0)
number of events that causally precede e in the
distributed computation.
P3
e6 e7 e8
(0,0,1) (0,0,2) (0,0,3)

SSWT ZG526 - Distributed Computing 35 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

35

Efficient implementation of vector clock


• If the number of processes in a distributed computation is large, then vector clocks will require
piggybacking of huge amount of information in messages.
• The message overhead grows linearly with the number of processors in the
system and when there are thousands of processors in the system, the message size becomes huge
even if there are only a few events occurring in few processors.
• For 1000 processors, we need 1000 integer vector with every message
• Charron-Bost showed that if vector clocks have to satisfy the strong consistency property, then in
general vector timestamps must be at least of size n, the total number of
processes.
• However, optimizations are possible and next, and we discuss a technique to implement
vector clocks efficiently.

SSWT ZG526 - Distributed Computing 36 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

36

18
8/9/2024

Singhal-Kshemkalyani’s differential technique


• Between successive message sends to the same process, only a few entries of the vector
clock at the sender process are likely to change.

Capture only changes in vectors


sent with msgs:

[
{ <change position>, <value },

]

SSWT ZG526 - Distributed Computing 37 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

37

Matrix clock
• A matrix clock is a mechanism for capturing chronological and causal relationships in a distributed system.

• Matrix clocks are a generalization of the notion of vector clocks. A matrix clock maintains a vector of the vector clocks for each

communicating host.

• Every time a message is exchanged, the sending host sends not only what it knows about the global state of time, but also the

state of time that it received from other hosts.

• This allows establishing a lower bound on what other hosts know, and is useful in applications such as checkpointing and

garbage collection.

SSWT ZG526 - Distributed Computing 38 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

38

19
8/9/2024

Fowler–Zwaenepoel’s direct-dependency technique

• Fowler–Zwaenepoel direct dependency technique reduces the size of messages by transmitting only a scalar value in

the messages.

• No vector clocks are maintained on-the-fly.

• Instead, a process only maintains information regarding direct dependencies on other processes.

• A vector time for an event, which represents transitive dependencies on other processes, is constructed off-line from

a recursive search of the direct dependency information at processes.

SSWT ZG526 - Distributed Computing 39 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

39

Fowler–Zwaenepoel’s example

SSWT ZG526 - Distributed Computing 40 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

40

20
8/9/2024

Why do we need time synchronisation


» Distributed system nodes have clock drift
» But we need to know the real time in some cases
» Time of day on a node
» Time difference between 2 nodes
» Time sensitive (stricter than causal ordering) writes and reads to database
replicas
» Improve performance by replacing communication with local computation if
all nodes are time synchronised
» Correlate time-sensitive data from different nodes in sensor networks

SSWT ZG526 - Distributed Computing 41 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

41

Definitions
» Time: Cp(t) is time of Clock at process p. Cp(t)=t for
perfect clock.
» Frequency: C’p(t) is the rate at which clock progresses
» Offset: Cp(t) - t is offset from real-time t.
» Skew: C’p(t) - C’q(t) is difference in frequency between
clocks of processes p and q
» If skews are bounded by r then dC/dt is allowed to
diverge in the range 1-r to 1+r (clock accuracy)
» Drift (rate): C’’p(t) is the rate of change of frequency at
process p

SSWT ZG526 - Distributed Computing 42 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

42

21
8/9/2024

NTP
» A hierarchy of time servers on a spanning tree
» Root level primary (stratum 1) synchronises with UTC with microseconds offset from attached stratum 0
devices
» Second level (stratum 2) are backup servers synchronised with stratum 1 servers
» Max level is stratum 15
» At the lowest level are the clients which can be configured with NTP server(s) to use for running a
synchronisation algorithm
» The synchronisation algorithm uses request-reply NTP messages and offset-delay
estimation technique to estimate the round-trip delay and hence the clock offset between 2
servers.
» Offset - 10s of millisecond over internet, <1 millisec in LAN but asymmetric routes and congestion can
cause problems in measurement

SSWT ZG526 - Distributed Computing 43 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

43

Offset-delay estimation
T2 T3 » From the perspective of A :
B
» T1, T2, T3, T4 are available
T1 T1, T2, T3 » Offset = [ (T2-T1) + (T3-T4) ] / 2
» Round trip delay = (T4-T1) - (T3-T2)
A » Continuously send messages to estimate offset and
T1 T4
correct local clock of A
asymmetric routes,
congestion can cause » Choose the offset that has the minimum delay (typically
estimation challenges last 8 samples)
» CA(t) = CB(t) + Offset

SSWT ZG526 - Distributed Computing 44 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

44

22
8/9/2024

Physical or Logical clocks ?


» Normally we use physical clocks when events density is low compared to event
execution time and accuracy of ordering is not critical
» In distributed systems, event rate is orders of magnitude higher and execution time
is orders of magnitude lower and may mandate perfect order of delivery
» Will need accurate physical clock synchronisation to causally order events
» Hence logical clocks are the mostly used
» Important to understand application use case and pick physical or logical clock
solutions

SSWT ZG526 - Distributed Computing 45 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

45

Case study: MongoDB vs Cassandra

» Popular NoSQL large scale distributed


systems for Big Data storage using data
Distributed client application
replication and partitions DB Driver
» A “write” of a data item has to update
multiple replicas Write Read
request request
» A “read” of the data item is consistent if
the “latest” value is made available to any
Replica Replica Replica
process in the system
» So how do you order reads and writes ?

SSWT ZG526 - Distributed Computing 46 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

46

23
8/9/2024

MongoDB
» Consistency is a design priority
» All “writes” and “reads” happen through
primary replica. Hence, any “read” should
return the latest consistent value.
» Reads and writes are ordered using logical
clock implementation *
» Optionally consistency can be loosened by
reading from replicas

* MongoDB: ACM SIGMOD’19 session:


https://av.tib.eu/media/43077
SSWT ZG526 - Distributed Computing 47 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

47

Case study: Cassandra


» Scale is a design priority with eventual consistency as a design choice .. so no
ordering guarantees
» Reads can go to any replica to allow the system to scale
» Reads can potentially read older values for some time
» e.g. recent posts / comments not appearing on a social network page
» Over time (eventually), read has to return the latest value (“last write wins”)
» Physical clock synchronisation is recommended between replicas to make the sure
replicas converge within a determined time period

** Cassandra: https://www.datastax.com/blog/why-cassandra-doesnt-need-vector-clocks
SSWT ZG526 - Distributed Computing 48 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

48

24
8/9/2024

BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Contact Session – 3
M3 - Global state and snapshot recording
References : T1 (Chap.4)

Last session exercise - Lamport clock

SSWT ZG526 - Distributed Computing 2 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

1
8/9/2024

Last session exercise - Vector clock

SSWT ZG526 - Distributed Computing 3 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Topics for today


» Why record global state
» Consistent and in-consistent global states
» Issues in recording state
» Algorithm for FIFO channels - Chandy Lamport
» Properties of Chandy-Lamport algorithm
» Optimizations on Chandy-Lamport - 2 more algorithm sketches
» Algorithm for non-FIFO channels - Lai Yang
» Necessary and sufficient conditions to record global state

SSWT ZG526 - Distributed Computing 4 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

2
8/9/2024

Global state recording

• Recording the global state of a distributed system on-the-fly is


an important problem.
• Why is it non-trivial?

100 200
A B

A, B, and C exchange money between themselves.


The global state needs to capture exact status of the transfers.
Total money in the system cannot change.
C

100

SSWT ZG526 - Distributed Computing 5 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Global state recording

• Recording the global state of a distributed system on-the-fly is


an important problem.
• Why is it non-trivial?

Total
100 200 ——-
A B 100 Then,
200 T1: A transfers 50 to B
100 T2: B transfer 100 to C
——-
400
C

100

SSWT ZG526 - Distributed Computing 6 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

3
8/9/2024

Global state recording

• Recording the global state of a distributed system on-the-fly is


an important problem.
• Why is it non-trivial?

Total
50 100 ——-

]
100 50 200 50
A B 50 in transit
100 Have to record these correctly
100 in transit
100 100
——-
C
400 Total money in the system cannot change

100

SSWT ZG526 - Distributed Computing 7 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

More examples

What is the total number of files in a system when files are moved
around across machines ?
What is the total space left on the system across storage nodes ?

SSWT ZG526 - Distributed Computing 8 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

4
8/9/2024

Uses of Global State Recording


• Recording a “consistent” state of the global computation
– checkpointing in fault tolerance (rollback, recovery)
– when processor fails and computation on that processor needs to be
restarted
– debugging programs

• Detecting “stable” properties in a distributed system via snapshots. A


property is “stable” if, once it holds in a state, it holds in all subsequent
states.

– termination detection
– deadlock detection
termination deadlock

SSWT ZG526 - Distributed Computing 9 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Topics for today


» Why record global state
» Consistent and in-consistent global states
» Issues in recording state
» Algorithm for FIFO channels - Chandy Lamport
» Properties of Chandy-Lamport algorithm
» Optimizations on Chandy-Lamport - 2 more algorithm sketches
» Algorithm for non-FIFO channels - Lai Yang
» Necessary and sufficient conditions to record global state

SSWT ZG526 - Distributed Computing 10 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

10

5
8/9/2024

Global state of a distributed system

• Collection of local states of processes / nodes and channels


• Notation —

• To be meaningful, states of all components of the distributed


system must be recorded at the same instant
• Possible only when all local clocks are perfectly synchronised or if
there is a global system clock that can be instantaneously read by
all process.
• Impossible to achieve in a real system
• Hence logical clock is a necessity for global state capture

SS ZG 526: Distributed Computing 11

11

How do we create a global state

SS ZG 526: Distributed Computing 12

12

6
8/9/2024

Strongly consistent global state


PAST FUTURE
LS1

No need to draw a vertical line /


LS2
cut or even a straight line since
process real times are not
LS3 synchronised

LS4

GS1 = { LS1, LS2, LS3, LS4 }

cut of a space-time diagram

SS ZG 526: Distributed Computing 13

13

Consistent global state


PAST FUTURE

M LS5

LS6

LS7
LS8

GS2 = { LS5, LS6, LS7, LS8} + M

SS ZG 526: Distributed Computing 14

14

7
8/9/2024

Inconsistent global state


PAST FUTURE
LS9

LS10
M
LS11

LS12

GS3 = { LS9, LS10, LS11, LS12} ??

But can’t include event M with send in the FUTURE and receive in the
PAST. Violates Causal Ordering.

SS ZG 526: Distributed Computing 15

15

Topics for today


» Consistent and in-consistent global states (revision)
» Why record global state
» Issues in recording state
» Algorithm for FIFO channels - Chandy Lamport
» Properties of Chandy-Lamport algorithm
» Optimizations on Chandy-Lamport - 2 more algorithm sketches
» Algorithm for non-FIFO channels - Lai Yang
» Necessary and sufficient conditions to record global state

SSWT ZG526 - Distributed Computing 16 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

16

8
8/9/2024

Issues to address in recording a global state (1)


Issue 1: How to distinguish between the messages to be
recorded in the snapshot* from those not to be recorded?
Solution: P1 snapshot taken

• Any message that is sent by a process before recording its P1


snapshot, must be recorded in the global snapshot.
m1 m2
• Any message that is sent by a process after recording its
snapshot, must not be recorded in the global snapshot. P2

m1 included in global state


m2 excluded in global state

* snapshot is a recording or image of global state


SSWT ZG526 - Distributed Computing 17 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

17

Issues to address in recording a global state (2)

Issue 2: How to determine the instant when a process P1 snapshot taken


takes its snapshot?
Solution: P1

A process pj must record its snapshot before m1 m2

processing a message mij that was sent by process pi P2


after recording its snapshot. P2 snapshot taken
before this time

If P2 takes snapshot here then m2 receive is in


global snapshot but not m2 send —> inconsistent
SSWT ZG526 - Distributed Computing 18 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

18

9
8/9/2024

Difficulties due to Non-determinism

• Deterministic Computation
• At any point in computation there is at most one event that can happen
next.

• Non-Deterministic Computation
• At any point in computation there can be more than one event that can
happen next.

SSWT ZG526 - Distributed Computing 19 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

19

Example

A B

m
A B

A B

Deterministic A B Non-deterministic
n
- serial - multiple solutions possible
- synchronous - one shown with the dotted arrows
A B

Easy to capture snapshots Snapshot algorithms need to handle non-determinism


SSWT ZG526 - Distributed Computing 20 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

20

10
8/9/2024

Topics for today


» Why record global state
» Consistent and in-consistent global states (revision)
» Issues in recording state
» Algorithm for FIFO channels - Chandy Lamport
» Properties of Chandy-Lamport algorithm
» Optimizations on Chandy-Lamport
» Algorithm for non-FIFO channels - Lai Yang
» Applications
» Necessary and sufficient conditions to record global state

SSWT ZG526 - Distributed Computing 21 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

21

• Snapshot recording algorithm discussed next is based on the principle that


inconsistent global states are never recorded

SSWT ZG526 - Distributed Computing 22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

22

11
8/9/2024

Snapshot algorithms for FIFO channels


Chandy-Lamport algorithm
• The Chandy-Lamport algorithm uses a control message, called a marker whose role in a FIFO system is to separate messages in the
channels.
• A marker separates the messages in the channel into those to be included in the snapshot from those not to be recorded in the
snapshot.
• Include a message on channel or post-receipt.
• Markers serve as a barrier for state recording - past vs future

Markers are trying to


create a “cut”

Where? Depends when the


marker reaches Pi

Think of a ripple in water and a


specific wave (marker) hitting
islands (processes)

SSWT ZG526 - Distributed Computing 23 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

23

Chandy-Lamport (2)
• Start: The algorithm can be initiated by any process by executing the “Marker Sending Rule”
by which
• (a) it records its local state and
• (b) sends a marker on each outgoing channel before sending any more messages.

marker
state S1
M1
P1 P2

M1

P4 P3

SSWT ZG526 - Distributed Computing 24 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

24

12
8/9/2024

Chandy-Lamport (3)
• Propagation: A process executes the “Marker Receiving Rule” on receiving a marker on incoming
channel C.
• (a) If the process has not yet recorded its local state, it records the state of C as empty and
executes the “Marker Sending Rule” to record its local state.
• (b) If process has recorded state already, it records messages received on C after last
recording and before this marker on C.
S1 M1 S2
P1 P2
M2

M1
P4 P3

SSWT ZG526 - Distributed Computing 25 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

25

Chandy-Lamport (4)

• Termination: The algorithm terminates after each process has received a marker
on all of its incoming channels.
• Dissemination: All the local snapshots get disseminated to all other processes and
all the processes can determine the global state. A strongly connected graph helps
even if topology is arbitrary.

P1 P2
Disseminate and forward local snapshots
on outgoing channels so that everyone can
construct a global snapshot.
P4 P3

SSWT ZG526 - Distributed Computing 26 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

26

13
8/9/2024

State Recording Example (1)

FIFO is a must
50 50 100 M 50 M
100 200 c1 150
c1 c1 100 50
50
A B A B A c2
B
c2 c2

c4 c3 c4 c3
c4 c3
100
C
‘A’ initiates
recording of C C
global state
100 200
200

‘A’ takes ‘B’ receives


Assume C has received 100 to avoid snapshot and marker and takes
multiple snapshot initiation in this example sends marker snapshot

SSWT ZG526 - Distributed Computing 27 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

27

State Recording Example (2)

150 c1 150
c1 150 50 c1 50
50
A B A B A c2
B
c2 c2
M M
c3 c4 c3
c4 c3 M c4
M
C C C
200 200
200

‘C’ records state ‘A’ records state of


‘B’ sends markers incoming channels
and sends marker
as empty.

Finally, A, B, C exchange local states including incoming channel states


SSWT ZG526 - Distributed Computing 28 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

28

14
8/9/2024

Properties of the recorded global state

• Chandy-Lamport algorithm shows an important property -

• The recorded global state may not correspond to any of the


global states that occurred during the actual computation.
• This happens because a process can change its state S1 S2
asynchronously before the markers it sent are received by other
sites and other sites record their states.
• But the system could have passed through the recorded St
global states in some equivalent executions.
Recorded state may not be
actually seen in this execution

SSWT ZG526 - Distributed Computing 29 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

29

Example

Actual physical states in a computation :

A recorded state (down, , up, M’) is never physically reached.

ref: https://homepage.cs.uiowa.edu/~ghosh/10-16-03.pdf
SSWT ZG526 - Distributed Computing 30 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

30

15
8/9/2024

How useful is an unseen snapshot

• A recorded state is a feasible state that is reachable from the initial configuration.
• The final state is always reachable from the recorded state.
• The recorded state may not be visited / seen in an actual execution.
• These states are useful
• for debugging because it is possible for the system to reach the state in some
execution instance
• for checking stable predicates
• for checkpointing

SSWT ZG526 - Distributed Computing 31 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

31

Topics for today


» Why record global state
» Consistent and in-consistent global states
» Issues in recording state
» Algorithm for FIFO channels - Chandy Lamport
» Properties of Chandy-Lamport algorithm
» Optimizations on Chandy-Lamport
» Algorithm for non-FIFO channels - Lai Yang
» Applications
» Necessary and sufficient conditions to record global state

SSWT ZG526 - Distributed Computing 32 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

32

16
8/9/2024

Improvements of Chandy-Lamport

• Optimizations 1
• when multiple snapshots are triggered
• leading to inefficiency and redundant load
3 2
• to disseminate local states more efficiently
• flooding is too expensive
• build spanning tree during marker propagation -
works in undirected graphs
4 5

spanning tree edge

SSWT ZG526 - Distributed Computing 33 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

33

Optimization algorithm 1 by Spezialetti-Kearns

• Provides 2 optimisations
• avoid taking redundant snapshots when there are multiple initiators
using “regions” or territories for each initiator
• leads to simpler dissemination of local snapshots
• Assumes undirected edges

SSWT ZG526 - Distributed Computing 34 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

34

17
8/9/2024

Spezialetti-Kearns outline - taking local snapshots

• Marker carries the ID of the initiator process


• Each process including initiator stores the ID of the initiator in
variable “master”
• On receiving a marker, a process who has no master will set master A B
• A region is all those processes that have the same master M: A
• On receiving a marker, if a process already has a master, it will set
M: B
a variable “id_border_set” with received ID in marker and not act
on or forward the marker C
• This sets up a border process between 2 regions M: B master: B
id_border_set: A

SSWT ZG526 - Distributed Computing 35 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

35

Spezialetti-Kearns outline - disseminating snapshots


• When a process executes a marker sending rule by exchange
recording a local snapshot, it records the marker region info
sender as the parent
• This sets up a spanning tree A B
• A process sends its local snapshot and id_border_set
variable to its parent and it finally reaches the snapshot(c)
master of the region
• Initiators / masters can exchange region information C id_border_set: A

parent: B

Think of this as country boundaries getting dynamically formed to include


cities in their region and the capital cities initiating and exchanging
information about their country.
SSWT ZG526 - Distributed Computing 36 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

36

18
8/9/2024

Optimization algorithm 2 by Venkatesan


• Use case: Applications that need periodic snapshots for
checkpointing
1
• Naive solution: repeated invocations of Chandy-Lamport
• But not efficient m
3 2
• Venkatesan proposed incremental snapshot to save messages
because only few messages are sent on some channels between
2 checkpoints
m
• Assumes bi-directional FIFO links, single initiator, a fixed
spanning tree for sending control messages 4 5

checkpoint i checkpoint i+1


Think of a distributed database doing periodic checkpoints
across nodes so that it can recover from failures

SSWT ZG526 - Distributed Computing 37 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

37

New “control messages” - not just 1 marker

• Assign version number to snapshots


• All control messages carry the number
• init_snap: sent by initiator (spanning tree root) to children asking to start
snapshot
• snap_complete: sent by child to parent (not every neighbor) after taking
snapshot
• regular: like init_snap but sent by internal nodes to neighbors asking them to
take snapshot
• ack: acknowledge receipt of init_snap or regular messages

SSWT ZG526 - Distributed Computing 38 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

38

19
8/9/2024

Venkatesan’s modified “marker sending rule”

• Chandy-Lamport “marker sending rule”


• take a snapshot and send marker to all outgoing channels before
sending any more message
• Venkatesan’s “marker sending rule”
• take a snapshot and send “regular” (like a marker) message before
sending any other message but to only those channels where there is
activity after last snapshot

SSWT ZG526 - Distributed Computing 39 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

39

Venkatesan’s outline
spanning tree edge

1. Initiator sends init_snap along spanning tree to start iteration for Initiator
one global snapshot version
2. Process receiving init_snap or regular follows modified marker 1
sending rule and waits for an ack
3. When leaf process finishes with local snapshot and all acks, it m
sends a snap_complete to parent. 3 2
4. When a non-leaf process gets all acks and snap_complete from
children, it sends snap_complete to parent.
m
5. Terminates when initiator receives all acks and snap_complete
from children 4 5

non-spanning tree edge


activity since last snapshot

SSWT ZG526 - Distributed Computing 40 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

40

20
8/9/2024

Venkatesan’s example
Initiator

1 1 sends init_snap and gets back


1 sends init_snap and gets back
ack, snap_complete ack, snap_complete

m
3 2
2 sends regular and gets
3 sends regular and gets back ack
m
back ack

4 5 no new snapshot needed


non-spanning tree edge
spanning tree edge 2 sends regular and gets back ack,
snap_complete

SSWT ZG526 - Distributed Computing 41 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

41

Topics for today


» Why record global state
» Consistent and in-consistent global states
» Issues in recording state
» Algorithm for FIFO channels - Chandy Lamport
» Properties of Chandy-Lamport algorithm
» Optimizations on Chandy-Lamport
Why is it important ?
» Algorithm for non-FIFO channels - Lai Yang Network links can use protocols such as UDP,
» Necessary and sufficient conditions to record globalwith no ordering guarantee
state

SSWT ZG526 - Distributed Computing 42 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

42

21
8/9/2024

Global state recording for non-FIFO channels

• A marker (like in Chandy-Lamport) cannot be used to delineate


messages because marker cannot be guaranteed to arrive before
event sent after marker.
hold back a message add to another message

• Here, either some degree of inhibition or piggybacking of control


information on computation messages is to be used to capture out-of-
sequence messages.

SSWT ZG526 - Distributed Computing 43 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

43

Lai-Yang algorithm for non-FIFO channels


The Lai-Yang algorithm uses a coloring scheme where processes and messages are colored RED
and WHITE. No special markers.
1 Every process is initially white and turns red while taking a snapshot. The equivalent of the
“Marker Sending Rule” is executed when a process turns red.
2 Every message sent by a white (red) process is colored white (red).

P1 takes snapshot

P1

m3
m1
m2

P2

SSWT ZG526 - Distributed Computing 44 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

44

22
8/9/2024

Lai-Yang algorithm for non-FIFO channels


3 Thus, a white (red) message is a message that was sent before (after) the sender of that
message recorded its local snapshot.
4 Every white process takes its snapshot at its convenience, but no later than the instant it
receives a red message.

P1 takes snapshot

P1

m3
m1
m2
P2 has to take a snapshot
by this time (remember
Issue 2 rule for consistent
P2 snapshots in slide 16 ?)

SSWT ZG526 - Distributed Computing 45 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

45

Lai-Yang algorithm continued…


5 Every white process records a history of all white messages sent or received by it
along each channel.
e.g. P1 (when white) records that it has sent m1 and m2, P2 (when white) records it
has received m1 and m3.
6 When a process turns red, it sends these histories along with its snapshot to the
initiator process that collects the global snapshot.
7 The initiator process evaluates transit (LSi, LSj) to compute the state of a channel Cij as
given below:
SCij= white messages sent by Pi on Cij − white messages received by Pj on Cij

SSWT ZG526 - Distributed Computing 46 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

46

23
8/9/2024

Example
Snapshot initiated - P1 turns RED
History : send [m1, m2] + local state of P1

P1

m3 Initiator
m1
m2 Gets both histories in snapshot
Figures out m2 is in transit and m1,
m3 are accounted in local snapshot
P2 of P2

P2 takes snapshot and turns RED


History : received [m1, m3] + local state of P2

SSWT ZG526 - Distributed Computing 47 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

47

Application of snapshots
» Looking for “consistent and stable” state across processes
» Checkpointing - restart a system from recorded global state and log of messages sent
after checkpoint
» Termination - useful in batch processing systems to know when set of distributed tasks
have completed
» Deadlock - similar to termination but jobs are stuck
» Debugging - inspect global state to find issues

SSWT ZG526 - Distributed Computing 48 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

48

24
8/9/2024

Topics for today


» Why record global state
» Consistent and in-consistent global states
» Issues in recording state
» Algorithm for FIFO channels - Chandy Lamport
» Properties of Chandy-Lamport algorithm
» Optimizations on Chandy-Lamport
» Algorithm for non-FIFO channels - Lai Yang
» Necessary and sufficient conditions to record global state

Why is it important ?
Distributed database nodes can take periodic checkpoints without strict time coordination with other nodes.
Need to pick which of these checkpoints are correct recording of global state across all nodes.

SSWT ZG526 - Distributed Computing 49 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

49

Conditions for consistent checkpoints

» Each process takes a local checkpoint


» A set of checkpoints (1 for each process) need to satisfy some necessary and
sufficient conditions to form a consistent global state

SSWT ZG526 - Distributed Computing 50 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

50

25
8/9/2024

“No causal order” is a necessary condition

» Necessary condition: 2 checkpoints cannot have a causal order, i.e. they should be
parallel or potentially parallel. E.g. C1,0 and C3,1 have a causal ordering using {
m1, m2} and cannot be part of a consistent set of local checkpoints. But this is not
a sufficient condition. Why ?

SSWT ZG526 - Distributed Computing 51 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

51

Why “no causal order” is not sufficient ?

» C1,1 and C3,2 have no causal order but cannot be part of a consistent global state
» C1,1 cannot pair with C2,2 and C3.2 cannot pair with C2,1 because of causal order

SSWT ZG526 - Distributed Computing 52 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

52

26
8/9/2024

Zigzag path

» Sufficient condition: 2 checkpoints cannot have a zigzag path connecting them


» zigzag path: Causal ordering + Message can be sent before arrival of previous on same
path, e.g. { m3, m4 }

SSWT ZG526 - Distributed Computing 53 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

53

Summary
» Why is global snapshot important
» Algorithms for FIFO using markers and non-FIFO using coloring
» Optimized FIFO algorithms
» Property of snapshot in Chandy-Lamport algorithm
» No zig-zag path as a necessary and sufficient condition of consistency among
snapshots
» Study material: Chapter 4 of Text book 1 (T1)

SSWT ZG526 - Distributed Computing 54 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

54

27
8/9/2024

SS ZG 526: Distributed Computing


Session 4: Terminology and basic algorithms

Dr. Anindya Neogi


Associate Professor
[email protected]

Reference: digital content slides from Prof. C. Hota, BITS Pilani

Topics for today


» Distributed algorithms : Basic concepts / terms / classifications
» Example distributed graph algorithms and use cases
» Single source shortest path
» Minimum Spanning Tree (MST)
» Leader election
» Maximal Independent Sets (MIS)
» Connected Dominating Set (CDS)
» Synchronizers

1
8/9/2024

Topology abstractions
» Physical
» Actual network layer connections
» Logical
» Defined in context of a particular application. Nodes
are distributed application nodes. Channels can be
given useful properties, e.g. FIFO, non-FIFO. Can define
arbitrary connectivity or neighbourhood relations. E.g. in
P2P overlays.
» Super-imposed
Logical abstractions
» Higher level on logical topology, e.g. spanning tree,
ring, to provide efficient paths for information
dissemination and gathering.

Classifications (1)
» Application vs control executions
» Application execution is the core program logic
» Control algorithms or “protocols” are part of the distributed middleware. Latter may piggyback
on the application execution (e.g. Lai-Yang global snapshot) but doesn’t interfere.
» Centralized vs distributed
» Centralized - e.g. client-server systems.
» Purely distributed yet efficient algorithms are harder. Many distributed algorithms have a
centralised component - e.g. Chandy-Lamport has an initiator that assembles the global
snapshot. Peer-to-Peer systems demand more distributed logic.

2
8/9/2024

Classifications (2)
» Symmetric vs asymmetric
» Symmetric where all processors execute same logic.
» e.g. logical clock algorithms
» Centralized algorithms or cases where nodes perform different functions, e.g.
root and leaf nodes or leaders, are asymmetric.
» Anonymous
» Does not use process identifiers - structurally elegant but harder to build and
sometimes impossible - e.g. anonymous leader election, total ordering in Lamport
clock. Typically process id is used for resolving ties.

Classifications (3)
» Uniform
» Does not use n or number of processes in code and thus allows scalability transparency. E.g. leader election
in a ring only talks to 2 neighbours.
» Adaptive
» When complexity is a function of k < n and not n, where n is number of nodes / processes. e.g. mutual
exclusion algorithms between k contenders
» Deterministic vs Non-deterministic
» Deterministic programs have no non-deterministic receives, i.e. they specify a source always to receive a
message.
» Non-deterministic executions may produce different results each time. Harder to reason and debug.
» Even in an async system deterministic execution will produce same partial order each time but not for non-
deterministic execution.

3
8/9/2024

Classifications (4)
» Execution Inhibition
» Inhibitory protocols freeze/suspend the normal execution of a process till some conditions are met. Can be
local inhibition (e.g. waiting for a local condition) or global (e.g. waiting for message from another process).
» Message ordering or state recording algorithms may use inhibition
» Sync vs Async
» Sync has
1. known upper bound of communication delay,
2. known bounded drift rate of local clock wrt real time,
3. known upper bound on time taken to perform a logical step.
» Async has none of the above satisfied. Typical distributed systems are async but can be made sync with
synchronisers.

Classifications (5)
» Online vs offline
» Online algorithm works on data that is being generated.
» Preferred. E.g. debugging, scheduling etc. to work with latest dynamic data
» Offline needs all data to be available first
» Wait-free
» Resilient to n-1 process failures thus offering high degree of robustness. But very expensive and may
not be possible in many cases.
» Operations of a process complete in a bounded number of steps despite failures of other processes
» Useful for real-time systems
» e.g. lock-less page cache patches in Linux kernel, RethinkDB
» Will refer in session on “consensus protocols”

4
8/9/2024

Classifications (6)
» Failure models
» Important to specify for an algorithm that claims to have some fault tolerance features
» Process failure models with increasing severity
» Fail-stop: Process stops at an instant and others know about it
» Crash: Same as above but others don’t know about it
» Receive/Send or General omission: Intermittently fail to send and/or receive some messages or
by crashing
» Timing failures: For sync systems only, e.g. violates time bound to finish a step
» Byzantine failure with authentication: Random faults but if a faulty process claims to receive a
message from correct process, it can be verified
» Byzantine failure: Same as above but no verification possible
» In addition, Link failure models include all of the above except fail-stop

Topics for today


» Distributed algorithms : Basic concepts / terms
» Example distributed graph algorithms and use cases
» Single source shortest path
Application level routing
» Minimum Spanning Tree (MST) Content distribution networks
» Leader election
Asymmetric distributed algorithms
» Maximal Independent Sets (MIS)
Setting up wireless networks
» Connected Dominating Set (CDS) Intelligent placement of servers / point-of-presence
Backbone networks
» Synchronizers

10

10

5
8/9/2024

1. Bellman-Ford: Single source shortest path


incident link for A
source
» From source S find the shortest path to all nodes in the directed
graph with possibly negative weights*
S
» But it is a distributed algorithm … 8 10
» Topology is not known globally
» Each node knows only its neighbors and incident links and E A
weights -4
1
» Only send messages to neighbours about local state 1
2
» Each process knows total number of nodes n, so that it can
terminate computation D B
» Eg. use — distributed routing protocols -1 -2

C
* negative weights ?? think of currency exchange, ISP contracts for routing etc.
SS ZG 526: Distributed Computing 11

11

Bellman-Ford: Single source shortest path (sync)


Logic on each process i with knowledge of n incident link for A
source
• length = inf <- distance to source
• parent = null round 1:
S
• neighbors = {set of neighbors} 8 10 parent = S
• weights = {weight(i,j), weight (j,i) for all
length = 10
neighbors j of process i }
E A final:
-4 parent = D
• on receive of UPDATE message from neighbor length = 5
1
• if i is source S then length=0 1
2
• for round=1 to n-1 do <- synchronise
• send UPDATE(i, length) to all neighbours round 1 UPDATE:
D B S—>A: 10
Parallel across all Pi

• await UPDATE(j, length) from each


neighbor j
-1 -2
• for each neighbor j do
• if length > length j + weight (j,i) C
then
• length = length j + weight (j,i)
• parent
SS ZG 526: Distributed Computing = j 12

12

6
8/9/2024

Round 1 of n-1
Length of S—>X where X in {A,B,C,D,E}

8 S 10
S A B C D E
0 inf inf inf inf inf
0 10 inf inf inf 8
E A
-4
1
1 A and E get to know their distance from S
2

D B
UPDATE messages
-1 -2

SS ZG 526: Distributed Computing 13

13

Round 1 of n-1
Length of S—>X where X in {A,B,C,D,E}

8 S 10
S A B C D E
0 inf inf inf inf inf
0 10 inf 12 inf 8
E A
-4
1
1 C gets to know distance of S from A and already knows
2
it’s own distance to A.

D B So dist(S->C) = dist(S->A) + dist(S->C)

-1 -2

SS ZG 526: Distributed Computing 14

14

7
8/9/2024

Round 1 of n-1
Length of S—>X where X in {A,B,C,D,E}

8 S 10
S A B C D E
0 inf inf inf inf inf
0 10 inf 12 inf 8
E A 0 10 inf 12 inf 8
-4
1
1
2 B has no new information about S

D B
-1 -2

SS ZG 526: Distributed Computing 15

15

Round 1 of n-1
Length of S—>X where X in {A,B,C,D,E}

8 S 10
S A B C D E
0 inf inf inf inf inf
0 10 inf 12 inf 8
E A 0 10 inf 12 inf 8
-4 0 10 10 12 inf 8
1
1
2

dist(S->C) is known and dist(C->B) is known.


D B So now dist(S->B) = dist(S->C) + dist(C->B)
-1 -2

SS ZG 526: Distributed Computing 16

16

8
8/9/2024

Round 1 of n-1
Length of S—>X where X in {A,B,C,D,E}

8 S 10
S A B C D E
0 inf inf inf inf inf
0 10 inf 12 inf 8
E A 0 10 inf 12 inf 8
0 10 10 12 inf 8
1 -4 0 10 10 12 inf 8
1

2
D B
D has no new information
-1 -2

SS ZG 526: Distributed Computing 17

17

Round 1 of n-1
Length of S—>X where X in {A,B,C,D,E}

8 S 10
S A B C D E
0 inf inf inf inf inf
0 10 inf 12 inf 8
E A 0 10 inf 12 inf 8
0 10 10 12 inf 8
1 -4 0 10 10 12 inf 8
1
0 10 10 12 9 8
2
D B
dist(S->E) and dist(E->D) are known.
-1 -2 So dist(S->E) is now known.

C Upto hop 1 is now stable after round 1 is completed.

SS ZG 526: Distributed Computing 18

18

9
8/9/2024

Round 2 of n-1
Length of S—>X where X in {A,B,C,D,E}

8 S S A B C D E
10 0 10 10 12 9 8
0 10 10 12 9 8
E A 0 10 10 12 9 8
0 10 10 12 9 8
1 -4
1

2
D B
No new information with S, A, B, C in round 2 so far.
-1 -2

SS ZG 526: Distributed Computing 19

19

Round 2 of n-1
Length of S—>X where X in {A,B,C,D,E}

8 S 10
S A B C D E
0 10 10 12 9 8
0 10 10 12 9 8
E A 0 10 10 12 9 8
0 10 10 12 9 8
1 -4 0 5 10 8 9 8
1

2
D B
dist(S->D) can now be used to update dist(S->X) for all neighbours X of D.
-1 -2 dist(S->A) — a better path now is via D from S instead of directly from S
dist(S->C) — a better path now is via D from S instead of via A
C

SS ZG 526: Distributed Computing 20

20

10
8/9/2024

Round 2 of n-1
Length of S—>X where X in {A,B,C,D,E}

8 S 10
S A B C D E
0 10 10 12 9 8
0 10 10 12 9 8
E A 0 10 10 12 9 8
0 10 10 12 9 8
1 -4 0 5 10 8 9 8
1
0 5 10 8 9 8
2
D B
-1 -2 No new information with E.
Upto hop 2 is now stable at the end of round 2.
C

SS ZG 526: Distributed Computing 21

21

Rounds 3 to 5
Length of S—>X where X in {A,B,C,D,E}

8 S 10
Round S A B C D E
3 0 5 5 7 9 8
4 0 5 5 7 9 8
E A 5 0 5 5 7 9 8

1 -4
1
Final parent-child links converges before n-1
2 (n=6)
D B
-2 Notice that :
-1
E stabilised by first round.
C D stabilised by second round.
So hop i in final path will stabilise by ith round.

SS ZG 526: Distributed Computing 22

22

11
8/9/2024

Some observations
» Works when no cycles have negative weight
» Time complexity O(N)
» How many messages - (n-1)*(#links)
» Constructs a minimum spanning tree as well
» MST algorithm discussed later doesn’t handle negative weights and is for undirected graph
» By round k, kth hops stabilise
» The parent variable is key to routing function - tells each node D how to reach S hop by hop
» Used in Internet Distance Vector Routing protocol
» But need to handle dynamic changes to network
» Virtually synchronous with timeouts when UPDATEs get lost
» Async version : large time and message complexity and complex to determine termination

23

23

2. Minimum Spanning Tree


» Overlay a tree on top of an undirected network graph
» Makes message passing distributed algorithms becomes simpler when we have to send
messages to other nodes
» e.g. broadcast (root to all) or convergecast (leaves to root) or multicast (one to many)
» What’s a good tree ?
» A minimum spanning tree - because minimum cost keep nodes close by based on weights on
edges
» Which algorithm ?
» Gallager Humblet Spira (GHS) algorithm (sync)
» Sequential algorithms - Kruskal, Prims

24

24

12
8/9/2024

GHS algorithm - Basics


» Some basics first
» If all weights are unique then MST is unique
» A fragment is a sub-tree of the MST
» Take a fragment F
» e is a min weight outgoing edge (MWOE) from F
to rest of the graph
e
» F U e is also a fragment (i.e. sub-tree of MST)
F

25

25

GHS algorithm - Overview


» Start with each node being a fragment
» Within a fragment, nodes run a distributed algorithm to identify the 5
4
MWOE e
» Send test messages to neighbours (e.g. between nodes 1, 2, 3, 4, 5
1
in diagram) and exchange info to see who has the best candidate
F
» A fragment connects to another fragment via MWOE and makes a 2 3
larger fragment
» Gradually number of fragments decrease
» The final one fragment is the MST e
F1 F2

26

26

13
8/9/2024

GHS algorithm - finding the MWOE


» Send messages to neighbours
» If neighbor is in same fragment will get a reject
» Pick edge that gets a positive response and has
minimum weight
» Make the node with the MWOE as the fragment
root 3 e

4 A
F
B

27

27

GHS algorithm - fragment name and level


» Each fragment has a unique name
» Each fragment has a level - some indication of size
» When 2 fragments combine, then one of the fragment’s nodes change their F2
fragment name to a new name
» if F1 is combining with F2 then level(F1) <= level(F2)
» smaller fragment always asks the larger fragment
» if level(F1) < level(F2) then F1 nodes take name and level of F2 e
» if level(F1) =level(F2) then level of both fragment nodes = old level +1
» A new name has to be given for all nodes from F1 U F2
F1

Think of the analogy of company M&A


28

28

14
8/9/2024

GHS algorithm - “M&A” rules


» F1 wants to combine with F2
» LT rule / ABSORB : L1 < L2
» All F1 nodes get named F2 and assigned L2
F2, L2
» EQ rule / MERGE: L1 = L2 and e1 = e2 (e2 is the
MWOE of F2)
» new name is e1
» new level = L1 + 1 e1

» WAIT rule (i.e. L1 > L2) F1, L1


» If LT and EQ don’t apply then wait

29

29

Example

1
2

30

30

15
8/9/2024

3a. Leader election for rings


» A set of processes decide a leader
» Fundamental to many asymmetric / centralized distributed algorithms
» e.g. leader needed to avoid replicated computation or do book-keeping
or recover a lost token etc.
» Assume processes are in a unidirectional ring (i.e. right and left neighbors) E A
» There are other algorithms for different topologies
» Each process has a unique ID and know their right and left neighbors
» Non-anonymous but uniform (independent of n) algorithm B
D
» An anonymous algorithm is not possible for a ring

31

31

“Ring leader” election - outline


» Pi for all i, sends its ID i to left neighbor
» When a process Pi received ID j from right neighbour Pj
» i < j: forward ID j to left neighbor ID(E)
» i > j: ignore message E A
» i = k: This is only possible if Pi’s ID i has circulated around ID(D) ID(A)

the ring.
D B
» So Pi declares itself has leader and send message
around the ring. ID(B)
ID(C)
C

32

32

16
8/9/2024

3b. Leader Election - completely connected graph


1. If largest process ID then declare you are the leader and
send “leader” message to all and finish. aka Bully Algorithm

2. Send “election” message to processes with larger ID


3. If you receive a reply then give up and wait for “leader” E A
message
4. If no reply then send “leader” message to all
B
5. If you receive reply but no “leader” message then re-initiate D
by resending “election” message
C

33

33

4. Maximal Independent Sets


» Find a set of nodes in a graph such that if a node is included then its neighbours cannot
be included - independent set
» If there is no independent set that is a super-set of this set then it is a maximal
independent set
» In other words, you cannot add one or more nodes to an MIS and get another
independent set
» Largest (one or more) MISes is (are) Maximum Independent Set(s)
» If all MIS are same size then it is a well-covered graph
» Why is it useful ?
» For any shared resource (e.g. wireless) allow maximum concurrent use without
causing conflict
» Challenging in real-life when nodes dynamically join an leave, new links show up, MIS nodes
links fail
» Assume a static graph and discuss an async distributed algorithm proposed by Luby

34

34

17
8/9/2024

Luby’s algorithm for MIS: overview (1)


random numbers picked in each round by each candidate node

1. In each iteration, every node Pi picks


7 2 1 5 a random number ri and sends to
neighbor using RANDOM message
6 1 2 6 2. If ri < all rj picked by neighbours j
then Pi includes itself in MIS and exits.
e.g. nodes with numbers 1 and 0
0 8 6 include themselves in round 1.
round 1

35

35

Luby’s algorithm for MIS: overview (2)


3. In any case, Pi informs neighbours about it’s
decision to include itself using SELECTED=True
7 2 1 5
message else sends SELECTED=False message
4. If Pi received SELECTED=True from any
neighbour Pj then Pi eliminates itself from MIS
6 1 2 6
and informs neighbours by sending ELIMINATED
message, e.g. nodes with numbers 6, 1, 2, 5
eliminate themselves.
0 8 6
5. If Pi received ELIMINATED from Pj then it
round 1 removes Pj from MIS

36

36

18
8/9/2024

Luby’s algorithm for MIS: overview (3)


random numbers picked in round 2 by each candidate node

1. In round 2, new random numbers get


assigned.
4 2 2. Nodes with numbers 2 and 1 include
themselves
3. Their neighbor nodes with numbers 4
9 and 5 eliminate themselves.
4. Node with number 9 remains.
5. Algorithm finishes in round 2 with MIS
5 1 being nodes with green tick.

round 2

» The algorithm will converge in 4log(n) passes with probability 1-1/n


37

37

5. Connected Dominating Set


» Find a set of nodes S in a graph such that any other
node in the graph outside S is connected to a node in S
by an edge
» What is it useful for ?
» Backbone networks for broadcasts, routing in WAN
and wireless

38

38

19
8/9/2024

CDS algorithms
» Simple heuristics
» Create a MST and delete
edges to all leaf nodes
» An MIS is also a dominating set.
So create an MIS and add
edges.
» It is non-trivial to devise a good
approximation algorithm
MIS nodes

39

39

Topics for today


» Distributed algorithms : Basic concepts / terms
» Example distributed graph algorithms and use cases
» Single source shortest path
» Minimum Spanning Tree (MST)
» Leader election
» Maximal Independent Sets (MIS)
» Connected Dominating Set (CDS)
» Synchronizers

40

40

20
8/9/2024

Synchronizers (1)
» Synchronous algorithms work in lock-step across processors
» easier to program than async algorithms
» On each clock-tick / round, a process does
» receive messages
» compute
» send messages
» A synchroniser protocol enables a sync algorithm to run on an async system
» Think of traffic lights on a straight road synchronising flow of cars from one light to the
next.

41

41

Synchronizers (2)
» Every message sent in clock tick k must be received in clock tick
k
» All messages have a clock tick value
» if d is max propagation delay on a channel then each process
will start simulation of a new clock tick after 2d time units 2d
» Simulating lock step mode of operation using a protocol
» Clock ticks can be broadcast by a leader or by having
k-1 k k+1
heartbeat-like control messages on channels (in the absence of
data messages in a clock tick)
Finish all work for clock tick k

Think of the rounds in sync Bellman-Ford algorithm for single source shortest paths
42

42

21
8/9/2024

-Synchronizer
Repeat for each clock tick at each process Pi in round / clock tick k
1. Send receive all messages m for current clock tick
2. Send 'ack’ for each message received and receive ‘ack’ for each message sent
3. Send a ‘safe’ message to all neighbours after sending and receiving all ‘ack’ messages
4. Move to round k+1 after getting ‘safe’ from all neighbours in round k

round k

43

43

-Synchronizer
1. Construct spanning tree of nodes with a root
2. Root of the spanning tree initiates new clock tick by sending broadcast ‘next’ message for
clock tick j
3. All processes exchange messages and ‘ack's in same way as alpha-synchronizer
4. The ‘safe' messages use a convergecast from leaves to root:
1. Each process responds with ack(j) and then safe(j) once all the subtree at the process
has sent ack(j) and safe(j)
2. Once root receives safe(j) from all children it initiates clock tick j+1 using a broadcast
‘next’ message

44

44

22
8/9/2024

-Synchronizer - a hybrid approach


» Divide network into clusters
» Form spanning tree within cluster with
cluster head as root
» Create designated edges across
clusters
» Within each cluster use beta
synchronizer
» Once each cluster is stablized, run
alpha-synchronizer across clusters with
only designated edges

45

45

Summary

» Terminologies and classification of various algorithms


» Designing distributed algorithms through some examples
» How these algorithms are used in practice
» Appreciate challenges in design of distributed logic
» Study material:
» Chapter 5 of Text book 1 (T1) esp algorithms covered in class
» Study GHS algorithm in detail (Section 5.5.11 and Algorithm 5.11)

SS ZG 526: Distributed Computing 46

46

23
8/9/2024

SS ZG 526: Distributed Computing


Session 5: Causal ordering and group communication

Dr. Anindya Neogi


Associate Professor
[email protected]

Reference: digital content slides from 1Prof. C. Hota, BITS Pilani

Topics for today

1. Recap some graph algorithms from Session 4


2. Message ordering
3. Group communication
4. Case study of causal ordering: MongoDB
5. Protocols for causal ordering

SS ZG 526: Distributed Computing 2

1
8/9/2024

Session 3 Recap: GHS algorithm for MST


Phase 1

https://slidetodoc.com/minimum-spanning-trees-gallagherhumbletspira-ghs-algorithm-1-weighted/
SS ZG 526: Distributed Computing 3

Session 3 Recap: GHS algorithm for MST


Phase 1

SS ZG 526: Distributed Computing 4

2
8/9/2024

Session 3 Recap: GHS algorithm for MST


Phase 2

SS ZG 526: Distributed Computing 5

Session 3 Recap: GHS algorithm for MST


Phase 2

SS ZG 526: Distributed Computing 6

3
8/9/2024

Session 3 Recap: GHS algorithm for MST


Phase 3

SS ZG 526: Distributed Computing 7

Session 3 Recap: GHS algorithm for MST


Phase 3

SS ZG 526: Distributed Computing 8

4
8/9/2024

Session 3 Recap: Luby’s algorithm for MIS

6 4 5 6 4 5

a b f 3
a b f 3
2 g 2 g
c d c d
9 1 9 1
h h
e e
7 7

MIS = {d, h, a, f}

SS ZG 526: Distributed Computing 9

Topics for today

1. Recap some graph algorithms from Session 4


2. Message ordering
3. Group communication
4. Case study of causal ordering: MongoDB
5. Protocols for causal ordering

SS ZG 526: Distributed Computing 10

10

5
8/9/2024

Message orders
1. Async execution / non-FIFO
• Messages can be delivered in any order on a link

2. FIFO
• Messages from same sender are delivered in same order

3. Causal order (CO)


• If send1—>send2 then receive1—>receive2

4. Sync execution
• Send and receive happen at the same instant as an atomic transaction

and not causally ordered

Sync CO FIFO ASync

SS ZG 526: Distributed Computing 11

11

Message orders - example

least concurrent most concurrent


easiest to program
Sync CO FIFO ASync hardest to program

SS ZG 526: Distributed Computing 12

12

6
8/9/2024

FIFO example

violates FIFO

1. All deliveries of m1 must precede all deliveries of m3


2. No effect on m2, m4

SS ZG 526: Distributed Computing 13

13

CO example
(1) Lamport clock values

(2) (3)

m1 “happens before” m2

1. Everyone must see m1 before m3 (as in FIFO)


2. Also, due to “happens before” as in Lamport clock, everyone must see m1 before m2. But, P2 sees m2
before m1 breaking CO.

SS ZG 526: Distributed Computing 14

14

7
8/9/2024

Causal ordering - examples

violated. Why ? satisfied satisfied satisfied


s1—>s2—>r2—>s3
But r3—>r1 violates CO
because by CO : s1—>s3
implies r1—>r3

SS ZG 526: Distributed Computing 15

15

Topics for today

1. Recap some graph algorithms from Session 4


2. Message ordering
3. Group communication
4. Case study of causal ordering: MongoDB
5. Protocols for causal ordering

SS ZG 526: Distributed Computing 16

16

8
8/9/2024

What are Groups


• Distributed systems may have process groups
• Set of processes on a set of machines
• Processes can multicast messages in the group
• Allows fault tolerance when some processes fail
• Membership can be fixed or dynamic
• Members can join and leave if dynamic
• Groups can be open or closed
• External members can send messages in open groups
• Internally can be structured as peer-to-peer or in a master slave
model
• Coordinator is needed as a leader for control and
management
• May make some protocols easier, e.g. message ordering
SS ZG 526: Distributed Computing 17

17

Group communication

• Groups can send messages to members


• Could be single network level multicast or
Application level
application level with multiple messages to
each receiver
• Assume reliable message delivery
• Assume nodes don’t crash

Network level
SS ZG 526: Distributed Computing 18

18

9
8/9/2024

Application of CO: Group communication

• Applications
• Akamai content distribution system
• Air Traffic Control communication to relay orders
• Facebook updates
• Distributed DB replica updates
• Multiple modes of communication
• Unicast: message sent one to one
• Broadcast: message sent one to all in group
2 communicating processes are
• Multicast: message broadcast to a sub-group updating a group of 3 replicas
• Kind of message ordering
• FIFO, non-FIFO, sync, CO

SS ZG 526: Distributed Computing 19

19

Application of CO: Group communication (2)

Why in the distributed systems middleware layer and not in


networking stack and hardware assisted layers ?
• Application specific ordering semantics
• Adapting groups to dynamically changing membership
• Various fault tolerance semantics
2 communicating processes are
• Customizing send multicasts to any process subgroup updating a group of 3 replicas

SS ZG 526: Distributed Computing 20

20

10
8/9/2024

CO - DB replica update example

update message orders are different irrespective of m

@ R1 : P2 then P1
@ R2 : P1 then P2
@ R3 : P1 then P2

SS ZG 526: Distributed Computing 21

21

CO - DB replica update example

All P2 updates happen before P1 at all Receivers


But violates CO because P1 —> P2 considering m

@ R1 : P2 then P1
@ R2 : P2 then P1
@ R3 : P2 then P1

SS ZG 526: Distributed Computing 22

22

11
8/9/2024

CO criteria

1. Safety (system does not reach a bad state)


• A message M that arrives at a process needs to be buffered

until all system-wide messages sent in the causal past of send(M)


event to the same destination have been delivered.
2. Liveness (system eventually reaches a good state)
• A message that arrives at a process has to be delivered to the

process.

Notes:
a. Safety and Liveness criteria are commonly stated in many protocols.
b. In a system with FIFO channels CO has to be implemented by the control protocol.
SS ZG 526: Distributed Computing 23

23

Topics for today

1. Recap some graph algorithms from Session 4


2. Message ordering
3. Group communication
4. Case study of causal ordering: MongoDB
5. Protocols for causal ordering
1. BSS

SS ZG 526: Distributed Computing 24

24

12
8/9/2024

cloud.mongodb.com

Get me top 10 beach front homes

SS ZG 526: Distributed Computing 25

25

MongoDB
• Document oriented DB
• Various read and write choices for flexible consistency tradeoff with scale / performance and durability
• Automatic primary re-election on primary failure and/or network partition

26

26

13
8/9/2024

Example in MongoDB

• Case 1 : No causal consistency

• Case 2: Causal consistency by making read to secondary wait

https://engineering.mongodb.com/post/ryp0ohr2w9pvv0fks88kq6qkz9k9p3
27

27

MongoDB “read concerns”, i.e. read configurations

• local :
• Client reads primary replica
• Client reads from secondary in causally consistent sessions
• available:
• Read on secondary but causal consistency not required
• majority :
• If client wants to read what majority of nodes have. Best option for fault tolerance and durability.
• linearizable :
• If client wants to read what has been written to majority of nodes before the read started.
• Has to be read on primary
• Only single document can be read

https://docs.mongodb.com/v3.4/core/read-preference-mechanics/
28

28

14
8/9/2024

MongoDB “write concerns” - i.e. write configs

• how many replicas should ack


• 1 - primary only
• 0 - none
• n - how many including primary
• majority - a majority of nodes (preferred for durability)
• journaling - If True then nodes need to write to disk journal before ack else ack
after writing to memory (less durable)
• timeout for write operation

https://docs.mongodb.com/manual/reference/write-concern/
29

29

Consistency scenarios - causally consistent and durable

(gets support of majority)

• read=majority, write=majority
• W1 and R1 for P1 will fail and will succeed in P2
• So causally consistent, durable even with network partition sacrificing performance
• Example: Used in critical transaction oriented applications, e.g. stock trading

https://engineering.mongodb.com/post/ryp0ohr2w9pvv0fks88kq6qkz9k9p3
30

30

15
8/9/2024

Consistency scenarios - causally consistent but not durable

• read=majority, write=1
• W1 may succeed on P1 and P2. R1 will succeed only on P2. W1 on P1 may roll back.
• So causally consistent but not durable with network partition. Fast writes, slower reads.
• Example: Twitter - a post may disappear but if on refresh you see it then it should be durable, else
repost.

https://engineering.mongodb.com/post/ryp0ohr2w9pvv0fks88kq6qkz9k9p3
31

31

Consistency scenarios - eventual consistency with durable writes

• read=available, write=majority
• W1 will succeed only for P1 and reads may not succeed to see the last write. Slow durable writes and
fast non-causal reads.
• Example: Review site where write should be durable but reads don’t need causal guarantee as long as
it appears some time (eventual consistency).

https://engineering.mongodb.com/post/ryp0ohr2w9pvv0fks88kq6qkz9k9p3
32

32

16
8/9/2024

Consistency scenarios - eventual consistency but no durability

• read=local, write=1
• Same as previous scenario and not writes are also not durable and may be rolled back.
• Example: Real-time sensor data feed that needs fast writes to keep up with the rate and reads should
get as much recent real-time data as possible. Data may be dropped on failures.

https://engineering.mongodb.com/post/ryp0ohr2w9pvv0fks88kq6qkz9k9p3
33

33

Topics for today

1. Recap some graph algorithms from Session 4


2. Message ordering
3. Group communication
4. Case study of causal ordering: MongoDB
5. Protocols for causal ordering
1. BSS

SS ZG 526: Distributed Computing 34

34

17
8/9/2024

Recap : Vector clock rules (1)

Process pi uses the following two rules R1 and R2 to update its clock:

• R1: Before executing an event, process pi updates its local logical time as follows:
vti[i] := vti[i] + d (d > 0)

(1,0,0) R1: (2,0,0)


e1 e2
P1

SS ZG 526: Distributed Computing 35

35

Recap: Vector clock rules (2)

• R2: Each message m is piggybacked with the vector clock vt of the sender process at
sending time. On the receipt of such a message (m,vt), process pi executes the following
sequence of actions:

➢ Update its global logical time as follows:

1≤ k≤ n : vti[k] := max(vti [k], vt[k])


➢ Execute R1

➢ Deliver the message


Not usedm.
in context of BSS R2: (2,0,0)

e3
P2
(0,0,0) R2 (a): (2,0,0)
R2 (b): (2,1,0)
Deliver
SS ZG 526: Distributed Computing 36

36

18
8/9/2024

Birman- Schiper- Stephenson (BSS) protocol (1)

• Works for broadcast communication


• Before broadcasting m, process Pi increments Pi
vector time VTPi [i] and timestamps ‘m’
• Rule R1 of vector clock
• Process Pj ≠ P i receives m with timestamp VTm m, VTm[…]
from Pi
• Pj has to evaluate whether to deliver m based on it
m, VTPj[…]
own timestamp VTPj and VTm Pj

Compare these 2 timestamps to decide


whether to buffer the message or deliver it
SS ZG 526: Distributed Computing 37

37

Birman- Schiper- Stephenson (BSS) protocol (2)


P1 [3,2]
• Pj delays delivery of m until both these rules hold:
– R1: VT Pj [i] = VT m [i] - 1
m, VT_m[3,2]
– received all previous m from Pi
– R2: VT Pj [k] >= VT m [k] for every k Є {1,2,…, n} - {i}
– received all messages also received by Pi before sending
message m VT_P2[2,4]
P2
• When Pj delivers ‘m’, R1 passed: VT_m[P1]-1 = 2
R2 passed: VT_P2[P2] > VT_m[P2]
– VT Pj updated by rule R2 of vector clock
– R1 of vector clock is not used
New VT_P2 = [3,4]

SS ZG 526: Distributed Computing

38

19
8/9/2024

Simple example of BSS (1)

[1,0] [2,0]
P1

[2,0] m2
m1 [1,0]
Example shows a trivial broadcast to
one process to explain how buffering
is done to maintain CO.

P2
[0,0]
R1: VTm1[P1]-1 = 0 == VTP2[P1] -> pass
R2: VTP2[P2] = 0 >= VTm1[P2] -> pass
R1: VTm2[P1]-1 = 1 != VTP2[P1] -> fail Action: Deliver and set VTP2=[1,0]
Action: Buffer
SS ZG 526: Distributed Computing 39

39

Simple example of BSS (2)

[1,0] [2,0]
P1

[2,0] m2
m1 [1,0] We do not apply full vector clock (R1
called from R2) because purpose of
BSS is to order messages and not
consider local events.

[1,0] <- using R2 only of Vector Clock algorithm


P2
[0,0]
Re-evaluate m2 in buffer
R1: VTm2[P1]-1 = 1 == VTP2[P1] -> pass
m2 [2,0] in Buffer R2: VTP2[P1] = 1 >= VTm2[P1] -> pass
Action: Deliver and set VTP2=[2,0]
SS ZG 526: Distributed Computing 40

40

20
8/9/2024

A proper broadcast example of BSS

[1,0,0]
P1

[1,0,0]

P2
[0,0,0]
R1: [1-1,*,*] = [0,*,*]
R2: [*,0,0] >= [*,0,0]
==> Deliver
P3

SS ZG 526: Distributed Computing 41

41

A proper broadcast example of BSS

[1,0,0] [2,0,0]
P1

[1,0,0]
[2,0,0]

P2 [1,0,0] [2,0,0]
[0,0,0]
deliver

P3

SS ZG 526: Distributed Computing 42

42

21
8/9/2024

A proper broadcast example of BSS

[1,0,0] [2,0,0]
P1

[1,0,0]
[2,0,0]

P2 [1,0,0] [2,0,0] [2,1,0] <- based on R1 in vector clock


[0,0,0]
[2,0,0]
[2,1,0]
P3

[0,0,0] R1: [*,1-1,*] == [*,0,*] > pass


R1: [2-1,*,*] != [0,*,*] R2: [0,*,0] < [2,1,0] > fails
so buffer so buffer
SS ZG 526: Distributed Computing 43

43

A proper broadcast example of BSS

[1,0,0] [2,0,0]
P1

P2 [1,0,0] [2,0,0] [2,1,0]


[0,0,0]
[1,0,0]

P3
[1,0,0]
[0,0,0] [0,0,0] deliver and apply R2 of vector clock

VT_P3 has still not changed


SS ZG 526: Distributed Computing 44

44

22
8/9/2024

A proper broadcast example of BSS

[1,0,0] [2,0,0]
P1

P2 [1,0,0] [2,0,0] [2,1,0]

[2,0,0]
[2,0,0]
[2,1,0]
P3
[1,0,0]

VT_P3=[1,0,0] and VT_m=[2,0,0]


R1 and R2 hold on buffered message
SS ZG 526: Distributed Computing
==> deliver 45

45

A proper broadcast example of BSS

[1,0,0] [2,0,0]
P1

P2 [1,0,0] [2,0,0] [2,1,0]


[2,1,0]
[2,0,0]
[2,0,0]
[2,1,0]
P3
[1,0,0] [2,0,0] [2,1,0]

VT_P3=[2,0,0] and VT_m=[2,1,0]


R1 and R2 pass on buffered message
SS ZG 526: Distributed Computing
==> deliver 46

46

23
8/9/2024

Topics for today

1. Recap some graph algorithms from Session 4


2. Message ordering
3. Group communication
4. Protocols for causal ordering
1. BSS
2. SES

SS ZG 526: Distributed Computing 47

47

Schiper-Eggli-Sandoz (SES) protocol


• Relaxation from BSS protocol: No need for broadcast messages.
• Each process maintains a vector V_P of size N-1 V_P
• N being the number of processes in the system.
• Each element of V_P is a tuple (P’,t)
P’, [….]
• P’ the destination process id
• t is a vector timestamp
• Tm: logical time of sending message m
• Tpi: present logical time at pi
• Initially, V_P is empty.
N-1 destinations

SS ZG 526: Distributed Computing 48

48

24
8/9/2024

SES sending rule

V_P1 V_P1

R1: Sending a Message


– Send message M, time stamped Tm, along P2, [t] P2, [Tm]
with V_P1 to P2*.
– Then insert (P2, Tm) into V_P1 and overwrite
the previous value of (P2,t), if any.

There is now no need of broadcast because every message


contains the history of a sender’s latest message timestamp to
other destinations old vector new vector

* using P1 and P2 as an example - ideally it is a rule for a Pi sending M to Pj


send with M to P2
SS ZG 526: Distributed Computing 49

49

SES receiving rule

R2: Delivering a message


– If V_M (in the message M) does not contain any pair (P2, t) P1
– deliver M
– If (P2, t) exists in V_M
[ (P2,[1,0]) ]
– If t !< Tp2
– buffer the message and don’t deliver
– else (t < Tp2)
– deliver M because M was sent in the past of current
time at P2 P2
– If M is delivered [0,0]
– use R1 and R2 of vector clock algorithm to update vector
Buffer since t=[1,0] > Tp2=[0,0]
clock at recipient P2
i.e. message is ahead of time at P2
– Update V_P2 based on latest information in V_M

SS ZG 526: Distributed Computing 50

50

25
8/9/2024

SES simple example

R1: V_P1={} R1: V_P1={(P2,[1,0])} R1: V_P1={(P2,[2,0])}


[1,0] [2,0]
P1
[0,0]
[2,0] {(P2,[1,0])}
m2 V_M (dst specific vector clock entry
m1 [1,0] {} sent along with sender vector clock

[1,1] Vector clock R1 and R2 applied


P2
[0,0]
R2: P2 not in V_M
Action: Deliver
R2: P2 in V_M and [1,0] > [0,0]
Action: Buffer
SS ZG 526: Distributed Computing 51

51

SES simple example

R1: V_P1={} R1: V_P1={(P2,[1,0])} R1: V_P1={(P2,[2,0])}


[1,0] [2,0]
P1

[2,0] {(P2,[1,0])}
m2
m1 [1,0] {}

P2
[0,0] [1,1]

R2: P2 in V_M and [1,0] < [1,1]


R2: P2 in V_M and [1,0] > [0,0] Action: Deliver
Action: Buffer
SS ZG 526: Distributed Computing 52

52

26
8/9/2024

Another example of SES


V_P1={(P3,[1,0,0])}
[1,0,0] [2,0,0]
P1
[1,0,0] {} [2,0,0] {(P3,[1,0,0])}

P2
[0,0,0]
R2: P2 not in V_M
Action: Deliver

P3

SS ZG 526: Distributed Computing 53

53

Another example of SES


V_P1={(P3,[1,0,0])} V_P1={(P3,[1,0,0]), (P2,[2,0,0])}
[1,0,0] [2,0,0]
P1
[1,0,0] {} [2,0,0] {(P3,[1,0,0])}

P2
[0,0,0] [2,1,0]

V_P2={(P3,[1,0,0])}

P3

SS ZG 526: Distributed Computing 54

54

27
8/9/2024

Another example of SES


V_P1={(P3,[1,0,0])} V_P1={(P3,[1,0,0]), (P2,[2,0,0])}
[1,0,0] [2,0,0]
P1
[1,0,0] {} [2,0,0] {(P3,[1,0,0])}

[2,2,0]
P2
[0,0,0] [2,1,0] [2,2,0] {(P3,[1,0,0])}

V_P2={(P3,[1,0,0])}

P3
[0,0,0]
R2: P3 in V_M and [1,0,0] > [0,0,0]
Action: Buffer
SS ZG 526: Distributed Computing 55

55

Another example of SES


V_P1={} V_P1={(P3,[1,0,0])} V_P1={(P3,[1,0,0]), (P2,[2,0,0])}
[1,0,0] [2,0,0]
P1
[1,0,0] {} [2,0,0] {(P3,[1,0,0])}

V_P2={}
[2,2,0]
P2
[0,0,0] [2,1,0] [2,2,0] {(P3,[1,0,0])}

V_P2={(P3,[1,0,0])} R2: P3 not in V_M


Action: Deliver
P3
[1,0,1]
[0,0,0]
R2: P3 in V_M and [1,0,0] > [0,0,0]
Action: Buffer
SS ZG 526: Distributed Computing 56

56

28
8/9/2024

Another example of SES


V_P1={} V_P1={(P3,[1,0,0])} V_P1={(P3,[1,0,0]), (P2,[2,0,0])}
[1,0,0] [2,0,0]
P1
[1,0,0] {} [2,0,0] {(P3,[1,0,0])}

V_P2={}
[2,2,0]
P2
[0,0,0] [2,1,0] [2,2,0] {(P3,[1,0,0])}

V_P2={(P3,[1,0,0])} [2,2,0] {(P3,[1,0,0])}

P3
[1,0,1] [2,2,2]
[0,0,0]
R2: P3 in V_M and [1,0,0] < [1,0,1]
Action: Deliver

SS ZG 526: Distributed Computing 57

57

29
8/9/2024

SS ZG 526: Distributed Computing


Session 6: Message ordering and Termination
detection

Dr. Anindya Neogi


Associate Professor
[email protected]

Reference: digital content slides from Prof. C. Hota, BITS Pilani

Topics for today


» Recap: Causal ordering
» SES Algorithm for unicast and multicast communication
» Total order
» Application level multicast
» Propagation Tree algorithm
» Termination detection
» using distributed snapshots
» using weight throwing
» using spanning tree

1
8/9/2024

Message orders
1. Async execution / non-FIFO
• Messages can be delivered in any order on a link

2. FIFO
• Messages from same sender are delivered in same order

3. Causal order (CO)


• If send1—>send2 then receive1—>receive2

4. Sync execution
• Send and receive happen at the same instant as an atomic transaction

and not causally ordered

Sync CO FIFO ASync

SS ZG 526: Distributed Computing 3

FIFO example

violates FIFO

1. All deliveries of m1 must precede all deliveries of m3


2. No effect on m2, m4

SS ZG 526: Distributed Computing 4

2
8/9/2024

CO example
(1) Lamport clock values

(2) (3)

m1 “happens before” m2

1. Everyone must see m1 before m3 (as in FIFO)


2. Also, due to “happens before” as in Lamport clock, everyone must see m1 before m2. But, P2 sees m2
before m1 breaking CO.

SS ZG 526: Distributed Computing 5

Receiving vs Delivery

This is where one should implement the ordering algorithms

3
8/9/2024

Schiper-Eggli-Sandoz (SES) protocol


• Relaxation from BSS protocol: No need for broadcast messages.
• Each process maintains a vector V_P of size N-1 V_P
• N being the number of processes in the system.
• Each element of V_P is a tuple (P’,t)
P’, [….]
• P’ the destination process id
• t is a vector timestamp of last message sent to P’
• Tm: logical time of sending message m
• Tpi: present logical time at pi
• Initially, V_P is empty.
N-1 destinations

SS ZG 526: Distributed Computing 7

SES algorithm for CO: sending rule

V_P1 V_P1

R1: Sending a Message


1. Send message M, time stamped Tm (current time) along P2, [t] Step (2) P2, [Tm]
with V_P1 to P2*.
2. Then insert (P2, Tm) into V_P1 and overwrite the
previous value of (P2,t), if any.

There is now no need of broadcast because every message


contains the history of a sender’s latest message timestamp to
other destinations <<— V_Pi old vector new vector

Step (1)
* using P1 and P2 as an example - ideally it is a rule for a Pi
sending M to Pj send with M to P2
SS ZG 526: Distributed Computing 8

4
8/9/2024

SES algorithm for CO: receiving rule

R2: Delivering a message


– If V_M (in msg M) does not contain any pair (P2, t) P1
– deliver M (because this is the first msg from the sender to
P2)
– else (P2, t) exists in V_M [ (P2,[1,0]) ]
– if (t < Tp2)
– deliver M because M was sent in the past of current
time at P2
– else
P2
– buffer the message and don’t deliver [0,0]
– If M is delivered
– use R1 and R2 of vector clock algorithm to update vector Buffer since t=[1,0] > Tp2=[0,0]
clock at recipient P2 i.e. message is ahead of time at P2
– Update V_P2 based on latest information in V_M

SS ZG 526: Distributed Computing 9

SES example (1) SES doesn’t need broadcast messages because the destination
vector is sent around

V_P1={(P3,[1,0,0])}
[1,0,0] [2,0,0]
P1
[1,0,0] {} [2,0,0] {(P3,[1,0,0])}

P2
[0,0,0]
R2: P2 not in V_M
Action: Deliver

P3

SS ZG 526: Distributed Computing 10

10

5
8/9/2024

SES example (2)


V_P1={(P3,[1,0,0])} V_P1={(P3,[1,0,0]), (P2,[2,0,0])}
[1,0,0] [2,0,0]
P1
[1,0,0] {} [2,0,0] {(P3,[1,0,0])}

Delivered
P2
[0,0,0] [2,1,0]

V_P2={(P3,[1,0,0])}

P3

SS ZG 526: Distributed Computing 11

11

SES example (3)


V_P1={(P3,[1,0,0])} V_P1={(P3,[1,0,0]), (P2,[2,0,0])}
[1,0,0] [2,0,0]
P1
[1,0,0] {} [2,0,0] {(P3,[1,0,0])}

[2,2,0]
Delivered
P2
[0,0,0] [2,1,0] [2,2,0] {(P3,[1,0,0])}

V_P2={(P3,[1,0,0])}

P3
[0,0,0]
R2: P3 in V_M and [1,0,0] > [0,0,0]
Action: Buffer
SS ZG 526: Distributed Computing 12

12

6
8/9/2024

SES example (4)


V_P1={} V_P1={(P3,[1,0,0])} V_P1={(P3,[1,0,0]), (P2,[2,0,0])}
[1,0,0] [2,0,0]
P1
[1,0,0] {} [2,0,0] {(P3,[1,0,0])}

V_P2={}
[2,2,0]
Delivered
P2
[0,0,0] [2,1,0] [2,2,0] {(P3,[1,0,0])}

V_P2={(P3,[1,0,0])} R2: P3 not in V_M


Action: Deliver
P3
[1,0,1]
[0,0,0]
Buffered

SS ZG 526: Distributed Computing 13

13

SES example (5)


V_P1={} V_P1={(P3,[1,0,0])} V_P1={(P3,[1,0,0]), (P2,[2,0,0])}
[1,0,0] [2,0,0]
P1
[1,0,0] {} [2,0,0] {(P3,[1,0,0])}

V_P2={}
[2,2,0]
Delivered
P2
[0,0,0] [2,1,0] [2,2,0] {(P3,[1,0,0])}

V_P2={(P3,[1,0,0])} [2,2,0] {(P3,[1,0,0])}

P3
[1,0,1] [2,2,2]
[0,0,0] Buffered Delivered
R2: P3 in V_M and [1,0,0] < [1,0,1]
Action: Deliver

SS ZG 526: Distributed Computing 14

14

7
8/9/2024

Topics for today


» Recap: Causal ordering
» Total order
» Application level multicast
» Termination detection
» using distributed snapshots
» using weight throwing
» using spanning tree

15

15

Total order
» CO is popular but sometimes Total Order (TO) is useful
Sync CO FIFO ASync
» CO does not imply TO and vice versa (see next slide)
» Total Order can be used for
» Replica updates in a distributed data store (see next slide) Total Order
» If update(x) is seen before update(y) for replica i then all replicas should see the same sequence
» Distributed Mutual Exclusion using total order multicast (next session)
» For all pairs of processes Pi, Pj and all pairs of messages Mx, My that are delivered in both processes,
Pi is delivered Mx before My iff Pj is also delivered Mx before My.
» Doesn’t depend on sender and not trying to establish CO
» Basically all processes must see the same (FIFO) message sequence
» TO + CO = Synchronous system — why ?

16

16

8
8/9/2024

Violation of Total Order

update message orders are different irrespective of m

@ R1 : P2 then P1
@ R2 : P1 then P2
@ R3 : P1 then P2

SS ZG 526: Distributed Computing 17

17

Total Order but violation of Causal Order

All P2 updates happen before P1 at all Receivers


But violates CO because P1 —> P2 considering m

@ R1 : P2 then P1
@ R2 : P2 then P1
@ R3 : P2 then P1
Note: In Lamport clock CO does not imply TO. You have to use process ID to create TO.
SS ZG 526: Distributed Computing 18

18

9
8/9/2024

Centralized algorithm using sequencer


» A multicast sender Pi wants to send message M M(P4,G) M(P1,G)

to group G, send a message to leader / Sequencer


coordinator / sequencer: M(i, G) M(P4, G)

» Sequencer sends M(i, G) to all members


M(P1, G)
» When M(i, G) arrives at Pj, deliver to the
application
» So multiple senders are forced into a sequence
at a central sequencer
P1 P2 P3 P4

Group G

19

19

ISIS: 3 Phase algorithm without sequencer


After step 1, queue state at P3:
<S:M((P2,G), deliverable>,<M(P1,G), undeliverable>
1. The multicast sender multicasts the message to everyone.
2. Recipients add the received message to a priority queue, tag the P3
message undeliverable, and reply to the sender with a proposed number >= S+1
priority (i.e., proposed sequence number). S+1,P3
• The proposed priority is 1 more than the latest sequence number 1 2 3
heard so far at the recipient, suffixed with the recipient's process 1
M(P1,G)
ID.
2
• The priority queue is always sorted by priority. P1 P4
3. The sender collects all responses from the recipients, calculates their 3
maximum, and re-multicasts original message with this as the final
priority for the message.
4. On receipt of this information, recipients mark the message as 1 2 3
deliverable, reorder the priority queue, and deliver the set of lowest
priority messages that are marked as deliverable.
P2 1. Message
2. Proposed priority
3. Message with final priority
20

20

10
8/9/2024

Topics for today


» Recap: Causal order
» Total order
» Application level multicast
» Termination detection
» using distributed snapshots
» using weight throwing
» using spanning tree

21

21

4 categories of multicast based on source / dest

» Works for open or closed groups - in the


latter case, source is part of a group SSSG SSMG
» For SSSG/SSMG with FIFO channels both
Total Order and CO can be done easily
(refer to TO and CO multicast examples
earlier)
» For MSSG the centralised algorithm works
well for TO and CO (sequencer can do
ordering) because it is converted to a
multiple copies of SSSG class
» For MSMG, need something new

MSSG MSMG

22

22

11
8/9/2024

MSMG using Propagation Tree : high level concept


multiple trees organisations
» Nodes in a distributed system belong to 1 or more multicast groups
are possible
» Create a set of meta-groups (groups of groups) to reach nodes where a node belongs to
only one meta-group ABC
» Publishing to one or more groups can be done via meta-groups
BCD
» A group can be reached via multiple meta-groups
» So need to select a primary meta-group for each group A C AB AC
B BC
» Create a propagation tree for forwarding messages :
CD BD D
» Meta-groups are organised under their primary meta-group (sub-tree)
» Primary meta-groups are connected if they have a common group between them meta-groups primary meta-groups
» Any meta-group will have a leader who actually forwards messages within the meta-group
and to the connected meta-group Multicast groups: A, B, C, D
» leader actually does message buffering to implement ordering protocols

B
A Meta-group = [A,B]
23

23

Algorithm for MSMG: Meta-groups


» Create meta-groups (MG)
» Each process is in only one MG and has same group
membership as others in the same MG
» No other process outside that MG has the same
exact group membership
» MG of A are A, AB, AC, ABC
» MG of C are C, AC, CE, CD, BCD, BC, ABC
» Problem becomes MSMG multicast to MGs and not
Gs.
» Each MG has a leader / manager
Groups: A, B, C, D, E, F

24

24

12
8/9/2024

Algorithm for MSMG: Primary MG of a G

» For each G, pick a primary member PM(G) from the


MGs of G so that you can reach all MGs through
PM(G)
» ABC can be the PM(G) for group A
» ABC can also be the PM(G) for groups B and C,
respectively
» BCD can be the PM(G) for group D
» DE can be the PM(G) for group E
» EF can be the PM(G) for group F
Groups: A, B, C, D, E, F

25

25

Propagation Tree
» All MGs have to put in a tree structure with a property with ABC BCD DE EF
properties as follows —
1. PM(G) is ancestor of all other MG of G
» e.g. ABC is ancestor of A, B, C, AB, … because one can reach
all these MGs through ABC
2. PM(G) is unique
3. For any MG, there is a unique path to the MG from the PM of
any G of which MG is a subset - so it’s a tree
» e.g. ABC—>BCD—>CD
4. PM[G1] and PM[G2] should lie in same branch or in disjoint
tree (if memberships are disjoint)
» e.g. for G1=A and G2=F, ABC—>BCD—>DE—>EF

26

26

13
8/9/2024

Propagation Tree - algorithm outline *


» Convert a multicast(G) to multicast(PM(G)) because subtree root at PM(G) should
contain members of G
» Messages are then propagated down sub-tree rooted at PM(G)
» multicast(D) will be sent to PM(D) = BCD
» multicast(D,F) sent to BCD and then forwarded to DE->EF to reach F with single
logical timestamped message
» The leader of each PM(G) acts as sequencer to buffer messages and establish
TO or CO
» Note:
» Propagation trees are not unique and depends on order of computation.
» Large meta-groups with many user group memberships should be ideally
towards root to keep tree height low and reduce hops.

* A modification of centralised sequencer for Total Order


* Can be used to enforce CO or TO
27

27

Classification of app level multicast algorithms (1)


» Communication history based
» Logical clock based algorithms mostly to provide causal ordering
» No group tracking - so works on open groups
» Privilege based
» A token circulates among senders with seqno
» Whoever has token (privilege) can send and update seqno in token
» Receiver orders delivery to app based on seqno
» Can implement Total Order or Causal Order
» Senders can use logical clock to do CO
» Closed group for token circulation
» Doesn’t work well for large systems because token passing doesn’t scale

28

28

14
8/9/2024

Classification of app level multicast algorithms (2)


» Moving sequencer
1. Sender sends msg to ALL sequencers (smaller set compared to senders)
2. Sequencers implement token passing
3. Token contains seqno and all msgs that have a seqno, i.e. already sent (so token
contains some sent msg history to avoid duplicate msgs)
4. When sequencer receives token
A. it assigns seqno to all received but unsequenced msgs (so sequencer has a queue
of unsent msgs)
B. sends to destinations
C. updates token with these sent msgs
D. forwards token to next sequencer
5. Destination processes deliver in order of seqno
» Algorithm guarantees Total Order because sequencers put in seqno in total order
across all senders

29

29

Classification of app level multicast algorithms (3)


» Fixed sequencer
» Simpler version of moving sequencer with centralisation
» The propagation tree and centralised total ordering algorithms are in
this class
» Destination agreement
» Destinations receive messages with some limited ordering info
» They exchange information among themselves to order the messages.
» Two classes
» 3 phase algorithm for total ordering
» Consensus protocols discussed in later session

30

30

15
8/9/2024

Topics on Jan 8th


» Total order
» Application level multicast
» Termination detection
» using distributed snapshots
» using weight throwing
» using spanning tree

31

31

Termination detection
» Need to detect that a distributed computation has ended or a sub-problem has
ended to proceed to the next step
» No process has complete knowledge of global state - so how do we know that a
computation has ended ?
» 2 distributed computations running in parallel
» Application messages for user computation
» Control messages for termination detection - these should not indefinitely
delay the application or require additional channels of communication

32

32

16
8/9/2024

System model
» Process can be active or idle state
» Active process can become idle at any time
» Idle process can become active only when it receives messages
» Only active process can send messages - so an external trigger is required to start
» A message can be received in either state
» Sending and receiving messages are atomic actions
» A distributed computation is terminated when all processes are idle and there are
no messages on any channel (stable state)

33

33

Algorithm 1: Using distributed snapshots


» Assumption Pi with logical time k goes idle
» Bi-directional channel between each process pair
» Channels can be non-FIFO but reliable, i.e. message delay is finite
» Outline
1. When process Pi goes active to idle, it sends message R(k, i) to all processes to take R(k, i)
a local snapshot including itself. k is the logical clock at Pi *.
2. On receipt of request at Pj, if it agrees that Pi became idle before itself, then it
grant
grants the request by taking a local snapshot (so we need total order *)
R(k, i)
3. A request is successful if all processes take a local snapshot
4. The requester or an external entity can collect all the local snapshots
5. A successful global snapshot can indicate if computation has terminated

Pj is already idle

* If you use Lamport clock then use process ID to break ties and derive total order

34

34

17
8/9/2024

Using distributed snapshots - Formal rules (1)

x: logical clock at a process


x’: logical clock sent in message to another process

active to idle: sent R(x,k)

active x idle x+1


i
B(x) R(x+1, i)

j
» Last process to terminate will have largestidle
clock value.x+1
Everyoneactive
will take a snapshot for it but it will not take a snapshot for anyone.

35

35

Using distributed snapshots - Formal rules (2)

j
possibilities:
R(x’, k’) a. i is already idle: i is idle and x’ > logical clock of i
b. i is ahead of j: i is idle and x’ <= logical clock of i
c. i is active still
i
x

1. Sender has terminated


after P. So take a snapshot.

2. Sender has terminated


before P. So can’t be the last.
3. P hasn’t terminated yet. So sender can’t be the last.
» Last process to terminate will have largest clock value. Everyone will take a snapshot for it but it will not take a snapshot for anyone.

36

36

18
8/9/2024

Algorithm 2: Using weight throwing - Basic idea (1)


Controller
1. A process called controlling agent monitors the
computation. 1
2. A communication channel exists between each of the
processes and the controlling agent and also between
every pair of processes.
3. Initially, all processes are in the idle state.
0 0
4. The weight at each process is zero and the weight at the
controlling agent is 1. idle idle
5. The computation starts when the controlling agent sends
0
a basic message to one of the processes.
idle

Think of a central bank lending money (weight). Initially bank has all the money.
37

37

Algorithm 2: Using weight throwing - Basic idea (2)


» A non-zero weight (0-1) is assigned to each process in the Controller
active state and to each message in transit in the following
manner : 0.9
» A process sends part of it’s weight in a message
» On receipt of a message the process adds the weight in the M (0.1)
message to itself
» Sum of weights in system is always 1
0.1 0
» When process goes to idle it sends its weight to controlling
agent to add to the latter’s weight P1: active P3: idle
» Termination is detected when controlling agent weight
0
comes back to 1
P2: idle
Everyone gets money to spend (send messages) and after all exchanges money goes back to the bank

38

38

19
8/9/2024

Weight throwing flow example

Controller Controller Controller

0.9 0.9 0.95

0.05
P1: active

0.1 0 0.05 0 0 0

P1: active P3: idle M (0.05) P3: idle P1: idle P3: idle
0 0.05 0.05

P2: idle P2: active P2: active

39

39

Using weight throwing - Rules


» Rule 1: The controlling agent or an active process may send a basic message to one of the
processes, say P, by splitting its weight W into W1 and W2 such that W1+W2=W, W1>0
and W2>0. It then assigns its weight W:=W1 and sends a basic message B(DW:=W2) to P.
» Rule 2: On the receipt of the message B(DW), process P adds DW to its weight W
(W:=W+DW). If the receiving process is in the idle state, it becomes active.
» Rule 3: A process switches from the active state to the idle state at any time by sending a
control message C(DW:=W) to the controlling agent and making its weight W:=0.
» Rule 4: On the receipt of a message C(DW), the controlling agent adds DW to its weight
(W:=W+DW). If W=1, then it concludes that the computation has terminated.

B(DW): Basic message as part of application computation with weight DW


C(DW): Control message from process to controlling agent with weight DW
40

40

20
8/9/2024

Algorithm 3: Using spanning tree - definitions


» Processes are connected in an undirected graph with a spanning tree P0
rooted at process P0.
Repeat Repeat
» Tokens: Contracting wave of signals from leaves to the root
» Repeat: If a token wave fails to detect termination, then P0 sends
Repeat signal to leaves
» A node which has one or more tokens at any instant form set S
» Node j is outside of S if j is not in S but an element in path from root to
j is in S. Every path from root to leaf may not contain a node in S.
» All nodes outside S are idle. Why ?
» Any node that terminates sends token to parent, thus leaving S.
Token Token Token

41

41

A simple version for spanning tree

» Each leaf is given a token


» After termination (i.e. becomes idle), a leaf
T
sends token to its parent
idle
» A parent, after it receives token from all its
children and after it has itself terminated, sends T

token to its parent.


» A root detects termination when it has token from
idle
each child and itself terminates.

42

42

21
8/9/2024

A simple version - so what’s the problem


» A node sends a token after it has become idle.
» But after sending token, it receives a message turning it
active again.
» e.g. it can receive a message from a process in an
independent sub-tree.
» So a root cannot conclude after receiving tokens from each
child that computation has indeed terminated in the graph.
» So root may have to re-invoke the algorithm
» But when should it re-invoke ?

43

43

A correct version - by colouring the tokens


1. Each leaf has a white token and set S is used to track who has tokens
2. When a leaf terminates, it sends a token it held to parent
3. Parent collects tokens from all children and sends token up the tree once it has terminated
4. A process turns black when it sends a message.
5. If a black process terminates, it sends a black token, if it is holding the token from all its children.
• A parent that has received a black token from any child only sends a black token, indicating some
activity in its subtree.
6. A black process turns white once it sends a black token.
7. If only white tokens reach the root from all children, it is white itself, and it is idle then it detects
termination
8. Any black token sent to root makes it send repeat signal.

44

44

22
8/9/2024

Example

0 0 T1 0

1 2 T3 1 2 1 2
T4

3 4 5 6 3 4 5 6 3 4 5 6

T3 T4 T5 T6 T5 T6 T5 T6

Idle process

45

45

Example …
T1 T1 T1 T2
0 0 0

T6
1 2 1 T5 2 1 2
m

3 4 5 6 3 4 5 6 3 4 5 6

T5 T6

» P0 has to send repeat because T2 is black

46

46

23
8/9/2024

Propagation Tree - observations (1)


» MG1 subsumes MG2 if
» for each group G (such that member of MG2 is a
member of G) is we have some member of MG1
also a member of G. MG1 is basically a subset of
each G of which MG2 is a subset.
» e.g. [AB] subsumes A because any member of
MG2=[A] is a member of G=A and each
member of MG1=[AB] is also a member of A.
» Similarly, [AB] also subsumes B.

47

47

Propagation Tree - observations (2)


» MG1 is joint with MG2 if they don’t subsume
each other and there is some G such the
MG1 and MG2 are subset of G
» e.g. [ABC] is joint with [CD] because both
are subset of C

48

48

24
8/9/2024

SS ZG 526: Distributed Computing


Session 7: Distributed Mutual Exclusion

Dr. Anindya Neogi


Associate Professor
[email protected]

Reference: digital content slides from Prof. C. Hota, BITS Pilani

Topics for today


» Why is “distributed” mutual exclusion hard ?
» Assertion based
» Lamport’s DME algorithm
» Ricart-Agrawala’s algorithm
» Maekawa’s algorithm
» Token based
» Raymond’s tree based algorithm
» Suzuki-Kasami’s algorithm

1
8/9/2024

Distributed Mutual Exclusion

What is mutual exclusion?

1. Simultaneous update and read of a directory/file


2. Can two processes send their data to a printer?

So, it is exclusive access to a shared resource or to the critical region.

SS ZG 526: Distributed Computing 3

Approaches in multi-tasking systems

Manage concurrent access to shared resource using shared memory / variables

1. semaphores - signalling mechanism, limit users of a resource


2. mutex - locking mechanism, single user access of a resource

In distributed systems, there is no shared memory


Need to rely on message passing for concurrent access control
What options exist ?

SS ZG 526: Distributed Computing 4

2
8/9/2024

Approaches in Distributed Systems


» Assertion based
» Set of conditions are fulfilled by a process to access
shared resource
» Algorithms vary based on assertions
» Token based
» Process needs to have token to access shared resource
» Algorithms vary based on how token is circulated

SS ZG 526: Distributed Computing 5

Requirements of DME algorithms


» Safety
» At point of time only one process can access the critical section
» Liveness
» 2 or more processes should not indefinitely wait for messages that will never
arrive (i.e. no deadlock)
» A process should not wait indefinitely to execute CS while other processes are
repeatedly executing CS. So every process should get access in finite time.
» Fairness
» Processes get a chance to execute CS in order of the request arrival by logical
clock

Safety is a must-have while Liveness and Fairness are important

SS ZG 526: Distributed Computing 6

3
8/9/2024

Performance of DME Algorithms

» Number of messages required for an entry into


CS
» Synchronization delay
» Measured from one process leaving CS and
another, waiting in queue, getting access to CS
» Response time
» Measured as time taken to allow a process to
enter in CS (after request) and the process
finally exiting the CS. So includes wait time and
execution time.
» System throughput = 1/(synchronisation delay +
execution time)

SS ZG 526: Distributed Computing 7

A centralised algorithm
» A Central controller with a FIFO queue for deferring
P3
replies.
3
» Request, Reply, and Release messages. 2
» Reliability and Performance bottleneck. 1 1: Request
» Controller is SPOF
» Single bottleneck for queue management …P2 P1 C
2: Reply
P1
3: Release
1 2 3

P2

SS ZG 526: Distributed Computing 8

4
8/9/2024

Topics for today


» Why is “distributed” mutual exclusion hard ?
» Assertion based
» Lamport’s DME algorithm
» Ricart-Agrawala’s algorithm
» Maekawa’s algorithm
» Token based
» Raymond’s tree based algorithm
» Suzuki-Kasami’s algorithm

Lamport’s Distributed Mutual Exclusion - Request / Replies

Requesting the critical section.

1. When a site Si wants to enter the CS, it sends a REQUEST(T=tsi, i) message to all the sites in its request set Ri
and places the request on request_queue_i.
2. When a site Sj receives the REQUEST(tsi , i) message from site Si, it returns a timestamped REPLY message to Si
and places site Si’s request on request_queue_j.

R: Request(tsi, i)

time = tsi time = tsj


Critical R R
Section Si Sj
Request_queue_i Request_queue_j

Reply(tsj, j)

Request queues are maintained based on timestamp

SS ZG 526: Distributed Computing 10

10

5
8/9/2024

Lamport’s Distributed Mutual Exclusion - Enter

Executing the critical section.

Site Si enters the CS when the two following conditions hold:


1. Si has received a message with timestamp larger than (tsi, i) from all other sites. Use Total Order with
site id to make sure there is no tsi = tsj.
2. Si’s request is at the top of request_queue_i.

R: Request(tsi, i)

time = tsi time = tsj


Critical R R
Section Si Sj
Request_queue_i Request_queue_j

tsj > tsi for all j R’: Reply(tsj, j)

SS ZG 526: Distributed Computing 11

11

Lamport’s Distributed Mutual Exclusion - Release


Releasing critical section

1. Site Si, upon exiting the CS, removes its request from the top of its request queue and sends a timestamped RELEASE
message to all the sites in its request set.
2. When a site Sj receives a RELEASE message from site Si, it removes Si’s request from its request queue.
3. When a site removes a request from its request queue, its own request may become the top of the queue, enabling it
to enter the CS.
4. The algorithm executes CS requests in the increasing order of timestamps.

If this is own request,


then becomes eligible
R: Release(tsi, i)

time = tsi
Critical
Section Si Sj
Request_queue_i Request_queue_j

SS ZG 526: Distributed Computing 12

12

6
8/9/2024

Example 1
(2, P1)
P1 in CS (2, P1)
1 2
P1

(2, P1)

P2
(2, P1) (2, P1)

request (2, P1)

reply P3
release (2, P1) (2, P1)

queue entry at process

SS ZG 526: Distributed Computing 13

13

Example 2: First request

(1,P3)

P1

(1,P3)
P2

(1, P3)

P3
1 (1,P3)
request
reply
release

SS ZG 526: Distributed Computing 14

14

7
8/9/2024

Example 2: Second request and first entry

(2, P1)
(1,P3),(2, P1) insert in queue based on timestamp
1 2
P1
(2, P1)

(1,P3)
P2
(1,P3),(2, P1)

(1, P3)

P3
1 (1,P3) (1,P3),(2, P1) P3 in CS
request
reply
release

SS ZG 526: Distributed Computing 15

15

Example 2: First release and second entry

(2, P1)
(1,P3),(2, P1) P1 in CS
1 2
P1
(2, P1) (2, P1)

(1,P3) (2, P1)


P2
(1,P3),(2, P1)

(1, P3)

P3
1 (1,P3) (1,P3),(2, P1) P3 in CS (2, P1)
request
reply
release

SS ZG 526: Distributed Computing 16

16

8
8/9/2024

Correctness
• Suppose that both Si and Sj were in CS at the same time (t).
• Then we have an impossible situation where i > j and j > i

Total number of messages


- N-1 Request
- N-1 Reply
- N-1 Release
Total : 3(N-1)

SS ZG 526: Distributed Computing 17

17

Topics for today


» Why is “distributed” mutual exclusion hard ?
» Assertion based
» Lamport’s DME algorithm
» Ricart-Agrawala’s algorithm
» Maekawa’s algorithm
» Token based
» Raymond’s tree based algorithm
» Suzuki-Kasami’s algorithm

18

18

9
8/9/2024

Ricart-Agrawala DME algorithm

Optimization: merges release and reply messages to make it total 2(N-1) messages - a site
replies only if certain conditions hold
Requesting Site:
– A requesting site Pi sends a message request(ts,i) to all sites
– It enters CS if it receives reply from all sites
– No release message is needed
Receiving Site:
– Upon reception of a request(ts,i) message, the receiving site Pj will immediately send a
timestamped reply(ts,j) message if and only if:
• Pj is not requesting or executing the critical section OR
• Pj is requesting the critical section but sent a request with a higher timestamp than
the timestamp of Pi
– Otherwise, Pj will defer the reply message.

SS ZG 526: Distributed Computing 19

19

Example 1

P1 enters CS
1 2
P1

(2, P1)

P2

(2, P1)

P3

SS ZG 526: Distributed Computing 20

20

10
8/9/2024

Example 2: Requests

1 2
P1

(2, P1) (1, P3)

P2

(1, P3) (2, P1)

P3
1

SS ZG 526: Distributed Computing 21

21

Example 2: One enters CS


request has earlier ts (1< 2),
so P1 replies to P3
1 2
P1

(2, P1) (1, P3)

P2

(1, P3) (2, P1)

P3
1 queue P1 P3 in CS

Request has later ts (2 > 1), so P3


doesn’t reply yet but queues request

SS ZG 526: Distributed Computing 22

22

11
8/9/2024

Example 2: Second enters CS

2 P1 enters CS
1
P1

(2, P1) (1, P3)

P2

(1, P3) (2, P1)


dequeue P1 and reply

P3
1 queue P1 P3 in CS

No release messages needed: 2(N-1) messages to enter CS

SS ZG 526: Distributed Computing 23

23

Topics for today


» Why is “distributed” mutual exclusion hard ?
» Assertion based
» Lamport’s DME algorithm
» Ricart-Agrawala’s algorithm
» Maekawa’s algorithm (also quorum based)
» Token based
» Raymond’s tree based algorithm
» Suzuki-Kasami’s algorithm

24

24

12
8/9/2024

Maekawa’s DME algorithm - Basic idea


» Groups are formed among sites called Request Sets.

» Each site has to request approval from the members of the Request Set attached to the site and not Ri (with 4 sites)
all N-1 other sites.
» So it is approval of a quorum or subset of sites in a Request Set attached to the site
Si
» Much lesser messages get exchanged

» Request set of sites Si & Sj are Ri & Rj such that Ri and Rj will have at-least one common site Sk. Sk
» Sk mediates conflicts between Ri and Rj. How ?

» A site can send only one REPLY message at a time even if it gets 2 requests from 2 different request Sj
sets that it belongs to
» A site can send a REPLY message only after receiving a RELEASE message for the previous REPLY
message. Rj (with 4 sites)

SS ZG 526: Distributed Computing 25

25

Creating Request Set


Any pair of sets will have common sites

All sites belongs to some set

Size of all sets is same, i.e. K

Any site is in K sets

N = K(K-1) + 1
K = sqrt(N)
• M1 and M2 are necessary for correctness; #msgs = 3K
• M3 and M4 provide other desirable features to the algorithm.
• M3 implies that all sites have to do equal amount of work to invoke mutual exclusion.
• M4 enforces that exactly the same number of sites should request permission from any site implying
that all sites have “equal responsibility” in granting permission to other sites.

SS ZG 526: Distributed Computing 26

26

13
8/9/2024

Maekawa’s request set construction examples


S1, S2, S3 S1, S2, S3, S4, S5, S6, S7 R1 = { 1, 2, 3, 4 }
N=7 R2 = { 2, 5, 8, 11 } N = 13
N=3 R3 = { 3, 6, 8, 13 }
N = 3(3-1) + 1 R4 = { 4, 6, 10, 11 }
N = 4(4-1) + 1
N = 2(2-1) + 1 K=3 R5 = { 1, 5, 6, 7 } K=4
K=2 R1 = {S1, S2, S3} R6 = { 2, 6, 9, 12 }
R7 = { 2, 7, 10, 13 }
R1 = {S1, S2} R4 = {S1, S4, S5}
R8 = { 1, 8, 9, 10 }
R2 = {S2, S3} R6 = {S1, S6, S7} R9 = { 3, 7, 9, 11 }
R2 = {S2, S4, S6} R10 = { 3, 5, 10, 12 }
R3 = {S3, S1} R5 = {S2, S5, S7} R11 = { 1, 11, 12, 13 }
R12 = { 4, 7, 8, 12 }
R7 = {S3, S4, S7}
R13 = { 4, 5, 9, 13 }
R3 = {S3, S5, S6}

Think of a way to generate R set programmatically from S set

SS ZG 526: Distributed Computing 27

27

Maekawa’s DME Algo - Request


Requesting the critical section

1. A site Si requests access to the CS by sending REQUEST(i)


messages to all the sites in its request set Ri. Si
2. When a site Sj receives the REQUEST(i) message,
» it sends a REPLY(j) message to Si provided it hasn’t sent a Sj
REPLY message to a site from the time it received the last
RELEASE message.
» So site has to remember last RELEASE and REPLY
» It will remove this state once it receives RELEASE from
that site
» Otherwise, it queues up the REQUEST for later consideration. depending on arrival time at Sj
any of these requests can be
queued

SS ZG 526: Distributed Computing 28

28

14
8/9/2024

Maekawa’s DME Algo - Access

Executing the critical section


1. Site Si accesses the CS only after receiving REPLY messages from all
the sites in Ri .

Si

Sj

SS ZG 526: Distributed Computing 29

29

Maekawa’s DME Algo - Release

Releasing the critical section


1. After the execution of the CS, site Si sends RELEASE(i)
message to all the sites in Ri .
2. When a site Sj receives a RELEASE(i) message from site
Si, it sends a REPLY message to the next site waiting in the Si
queue and deletes that entry from the queue.
3. If the queue is empty, then the site updates its state to
Sj
reflect that the site has not sent out any REPLY message (so
it can respond to next REQUEST).

SS ZG 526: Distributed Computing 30

30

15
8/9/2024

Example
S2 accesses CS
R1 {S1, S2, S3} S1
R2 {S2, S4, S6} queue S5 dequeue S5 and reply
R3 {S3, S5, S6} S2
R4 {S1, S4, S5}
R5 {S2, S5, S7}
R6 {S1, S6, S7} S3
R7 {S3, S4, S7}
S4
Si—>Ri

S5
S5 accesses CS
S6
request
reply
S7
release
S5 can send to R5 or R4 or R3 … here it sends to R5
SS ZG 526: Distributed Computing 31

31

Maekawa’s algorithm can lead to deadlocks


R1 {S1, S2, S3} R6
R2 {S2, S4, S6}
R3 {S3, S5, S6}
R2 S6
R4 {S1, S4, S5} S6
R5 {S2, S5, S7} Request
R6 {S1, S6, S7} Reply
R7 {S3, S4, S7}
S2 S7
S5, S2, S6 request for CS
Creates a cycle
S2 S6 cannot reply to S2
S7 cannot reply to S6
S2 cannot reply to S5
—> Deadlock
S5
R5

SS ZG 526: Distributed Computing 32

32

16
8/9/2024

Deadlock handling - 3 more control messages


FAILED (3) ts=7 (1) ts=4
• A FAILED message from site Si to site Sj indicates that Si cannot grant Sj’s request
because it has currently granted permission to a site with a higher priority request. So
that Sj will not think that it is just waiting for the message to arrive. Sj Si Sk
• Use Lamport Clock to timestamp requests. Lower clock has higher priority.
(4) FAILED (2) reply
INQUIRE
An INQUIRE message from Si to Sj indicates that Si would like to find out from Sj if it has (3) ts=2 (1) ts=4
succeeded in locking all the sites in its request set, i.e. has it detected any cycle that it is
part of.
Sj Si Sk
(2) reply
YIELD
If Sk detects a cycle in response to INQUIRE, it will send an YIELD.
A YIELD message from site Si to Sj indicates that Si is returning the permission to Sj (to yield
to a higher priority request at Sj).
(4) INQUIRE
ts=2 ts=4

Sj Si Sk

Number of messages with deadlock = 5 sqrt(N) (6) reply


No deadlock = 3 sqrt(N) (5) YIELD
SS ZG 526: Distributed Computing 33

33

Number of messages
» Lamport
» 3 x (N-1)
» Ricart-Agarwala
» 2 x (N-1)
» Maekawa without deadlock
» 3 x sqrt(N)
» Maekawa with deadlock possibility
» 5 x sqrt(N)

34

34

17
8/9/2024

Topics for today


» Why is “distributed” mutual exclusion hard ?
» Assertion based
» Lamport’s DME algorithm
» Ricart-Agrawala’s algorithm
» Maekawa’s algorithm
» Token based
» Raymond’s tree based algorithm
» Suzuki-Kasami’s algorithm

35

35

Raymonds’ Tree based Algo

» Not a broadcast based algorithm


» Uses a tree to look for token in the tree. The token is
with a node that may be multiple hops away from the
requester.
A node / site » Token is with current root node and root node changes
Points to the direction when the token is forwarded
where the token is (holder) » 2 data structures
» holder: direction (next hop) where the token can be
found in the tree from the node.
» queue: contains requests from different nodes
» from itself or
Queue containing requests » from others as they look for token and this node
sent by this node and others
is on the path of that search
that are passing by looking
for the token

SS ZG 526: Distributed Computing 36

36

18
8/9/2024

The Algorithm - Requesting CS (1)


Root node with Token
C
CASE A : If a site needs access to CS,
1. sends a REQUEST message to the node along the directed path to the REQUEST
root (that’s the direction of where the token is), provided :
A. it does not hold the token and
B. its request_q is empty
B queue:A

2. Adds request to own request_q


REQUEST
CASE B : When a site on the path receives a REQUEST message
1. It places the REQUEST in its request_q
2. Sends a REQUEST message along the directed path to the root A
provided it has not already sent out a REQUEST message on its
outgoing edge queue:A

SS ZG 526: Distributed Computing 37

37

The Algorithm - Requesting CS (2)


queue:A

CASE C : Root node gets a REQUEST message,


C B
1. Sends the token to the site from which it received the REQUEST
message
2. Sets its holder variable to point at that site A queue:A

CASE D : When the site receives the token,


1. Deletes top entry from request_q
2. Sends token to site indicated in the entry
3. Sets holder to point to the site
C B
4. If request_q is non-empty then send REQUEST message to the site
pointed to by holder variable (e.g. if B had a request pending in
queue, it would have sent a REQUEST msg to A) A queue:A

SS ZG 526: Distributed Computing 38

38

19
8/9/2024

The Algorithm - Executing CS

C B
A site enters the CS when

(A) It receives the token and A queue:A

(B) Its own entry is at the top of its request_q.

In this case, the site deletes the top entry from it’s
request_q and enters the CS. C B

access CS A

SS ZG 526: Distributed Computing 39

39

The Algorithm - Releasing CS


queue:C
queue:C

C B
Once CS is done at a site and request_q is non_empty then the site
does the following
releases CS A queue:B
1. Deletes the top entry from its request_q
2. Sends the token to that top site
3. Sets its holder variable to point at that site
queue:C queue:C
4. Sends a REQUEST message to the site which is pointed at by the
holder variable so that it can get back the token at a later point C
for a pending request in request_q (E.g. if A had a non-empty B
queue in this case).

A
SS ZG 526: Distributed Computing 40

40

20
8/9/2024

Example 1 - Single requester

n1 idle token holder

Critical
Section
n2 n3

REQUEST

n4 n5 n6 n7 holder pointer

n6
root - has token

SS ZG 526: Distributed Computing 41

41

Example 1
n2 idle token holder
n1
REQUEST Critical
Section
n6 n2 n3

n4 n5 n6 n7 holder pointer

n6
root - has token

SS ZG 526: Distributed Computing 42

42

21
8/9/2024

Example 1
n2 idle token holder
n1
Token
Critical
Section
n6 n2 n3

n4 n5 n6 n7 holder pointer

n6
root - has token

SS ZG 526: Distributed Computing 43

43

Example 1

holder changed n1

Critical
Section
n6 n2 n3

n4 n5 n6 n7 holder pointer

n6
root - has token

SS ZG 526: Distributed Computing 44

44

22
8/9/2024

Example 1

n1

Critical
Section
n2 n3

token sent to n6
holder changed

n4 n5 n6 n7 holder pointer

root - has token

SS ZG 526: Distributed Computing 45

45

Example 2 - Multiple requesters: n5 and then n4

n1 idle token holder


n1

n5 n4 n2 n3 n5 n4 n2 n3

REQUEST REQUEST

n4 n5 n6
n4 n5 n6
n4 n5
n4 n5

SS ZG 526: Distributed Computing 46

46

23
8/9/2024

Example 2 - Multiple requesters


Critical
Section
n1 n1

n4 n2 n4
n3 n2 n3

REQUEST

n4 n5 n6 n4 n5 n6

n4 n5 n4
n2 needs to get the token on behalf of n4
which is top item pending in queue

SS ZG 526: Distributed Computing 47

47

Example 2 - Multiple requesters

Critical
Critical Section
Section n1
n1

n4
n4 n2 n3
n2 n3

n4 n5 n6
n4 n5 n6
n4 n5 will send token and
n4 n2 n2 is queued now at n5 point to n2
SS ZG 526: Distributed Computing 48

48

24
8/9/2024

Example 2 - Multiple requesters

Critical
n1 n1 Section

n2 n3 n2 n3

n4 n5 n6 n4 n5 n6

n4

SS ZG 526: Distributed Computing 49

49

Analysis

Proof of Correctness
Mutex is trivial.
Finite waiting: all the requests in the system form a FIFO queue and the
token is passed in that order.

Performance
O(logN) messages per CS invocation, i.e. average distance between two
nodes in a tree, so quite efficient in terms of messages

SS ZG 526: Distributed Computing 50

50

25
8/9/2024

Topics for today


» Why is “distributed” mutual exclusion hard ?
» Assertion based
» Lamport’s DME algorithm
» Ricart-Agrawala’s algorithm
» Maekawa’s algorithm
» Token based
» Raymond’s tree based algorithm
» Suzuki-Kasami’s algorithm

51

51

Suzuki-Kasami (SK) broadcast-based DME (1)


RN at site Si
Data structures involved
» Each site as a Request Array (RN) j
» Each site sending a broadcast request has a seqno.
» RN stores the seqnos.
largest seqno seen by Si from Sj
» RN[j] is largest requesting seqno so far from Sj seen by Si
» Contains outdated requests
» One token circulates between sites. Token contains:
» Queue LN array in token
» Contains site IDs of where to send token
» LN array contains outstanding requests
» LN[j] is seqno of the request that site Sj executed most
recently j

seqno of last executed request from Sj

SS ZG 526: Distributed Computing 52

52

26
8/9/2024

SK DME algorithm (2)

1. RN[i] = RN[i] + 1

Si 2. REQUEST(i, RN[i])
Requesting the critical section

Request send
1. If the requesting site Si does not have the token, then it increments it’s
sequence number, RNi [i], and sends a REQUEST(i, sn) message to all Token
other sites. (sn is the updated value of RNi [i].)
Request receive REQUEST(i, sn)
1. When a site Sj receives this message, it sets RNj [i] to max(RNj [i], sn). Sj
2. If Sj has the idle token and if RNj [i] = LN[i] + 1
• send the token to Si 1. RN[i] = max(RN[i], sn)
(check for outdated message) 2. if Sj has token and
RN[i] = LN[i] + 1
send token to Si

SS ZG 526: Distributed Computing 53

53

SK DME algorithm (3)


Executing the critical section.
Site Si executes the CS when it has received the token.

54

54

27
8/9/2024

SK DME algorithm (4)

Releasing the critical section.

Having finished the execution of the CS, site Si takes the following actions:
1.It sets LN[i] element of the token array equal to RNi [i] (last executed request is
recorded).
2.For every site Sj whose ID is not in the token queue, it appends its ID to the token
queue if RNi [j] = LN[j] + 1 (eligible sites).
3.If token queue is nonempty after the above update, then it deletes the top site ID
from the queue and sends the token to the site indicated by the ID.

SS ZG 526: Distributed Computing 55

55

Example: S4 requests CS

(S4, 1)

RN=[0 0 0 0]
S4
S1 RN=[0 0 0 1]

RN=[0 0 0 0]
(S4, 1)
(S4, 1)

Initial Idle token at S3


S2 S3
Q=[]
RN=[0 0 0 0] Critical RN=[0 0 0 0] LN=[0 0 0 0]
Section

SS ZG 526: Distributed Computing 56

56

28
8/9/2024

Example: S4 request processed by S3

(S4, 1)

RN=[0 0 0 0]
S4
S1 RN=[0 0 0 1]

RN=[0 0 0 0]
(S4, 1)

S2 S3
Q=[S4]
RN=[0 0 0 0] Critical RN=[0 0 0 0] LN=[0 0 0 0]
Section RN=[0 0 0 1]
LN[4] + 1 = RN[4] <== so not outdated
Add S4 in queue and send token to S4
SS ZG 526: Distributed Computing 57

57

Example: S4 gets token and gets CS access

Token with S4
RN=[0 0 0 1] Q=[]
S4
S1 LN=[0 0 0 0]

RN=[0 0 0 1]

S1 and S2 also update RN arrays


based on S4’s request

S2 S3

RN=[0 0 0 1] Critical RN=[0 0 0 1]


Section

SS ZG 526: Distributed Computing 58

58

29
8/9/2024

Example: S2 requests CS access while S4 in CS

Token
RN=[0 0 0 1] Q=[]
S4
RN=[0 0 0 1] S1 LN=[0 0 0 0]

(S2, 1) (S2, 1)

(S2, 1)
RN=[0 0 0 1] S2 S3
RN=[0 1 0 1]
Critical RN=[0 0 0 1]
Section

SS ZG 526: Distributed Computing 59

59

Example: RN arrays updated recording S2 request

Token
RN=[0 0 0 1] Q=[]
S4
RN=[0 0 0 1] S1 RN=[0 1 0 1] LN=[0 0 0 0]
RN=[0 1 0 1]

RN=[0 0 0 1] S2 S3
RN=[0 1 0 1]
Critical RN=[0 0 0 1]
Section RN=[0 1 0 1]

SS ZG 526: Distributed Computing 60

60

30
8/9/2024

Example: S4 finishes with CS and updates token


Token
Q=[]
S4 RN=[0 1 0 1] LN=[0 0 0 0]
RN=[0 1 0 1] S1

(1) Set LN[4] based on RN[4] LN=[0 0 0 1]


(2) LN[2] + 1 = RN[2] —> Q=[S2]

RN=[0 1 0 1] S2 S3 RN=[0 1 0 1]

Critical
Section

SS ZG 526: Distributed Computing 61

61

Example: S1 also requests access to CS

Token
(S1, 1) LN=[0 0 0 1]
S4 RN=[0 1 0 1]
RN=[0 1 0 1] S1 Q=[S2]
RN=[1 1 0 1]

(S1, 1) (S1, 1)

RN=[0 1 0 1] S2 S3

Critical RN=[0 1 0 1]
Section

SS ZG 526: Distributed Computing 62

62

31
8/9/2024

Example: S1’s request is processed by others

Token
LN=[0 0 0 1]
S4 RN=[0 1 0 1]
RN=[1 1 0 1] S1 RN=[1 1 0 1] Q=[S2] Q=[S2,S1]

RN=[0 1 0 1] S2 S3
RN=[1 1 0 1]
Critical RN=[0 1 0 1]
Section RN=[1 1 0 1]

SS ZG 526: Distributed Computing 63

63

Example - Token sent from S4 to S2 and S2 accesses CS

S4 RN=[1 1 0 1]
RN=[1 1 0 1] S1

RN=[1 1 0 1] S2 S3 RN=[1 1 0 1]
Token
LN=[0 0 0 1] Critical
Section
Q=[S1]

SS ZG 526: Distributed Computing 64

64

32
8/9/2024

Example: S2 done with CS and updates token

S4 RN=[1 1 0 1]
RN=[1 1 0 1] S1

RN=[1 1 0 1] S2 S3
Token
RN=[1 1 0 1]
LN=[0 0 0 1] LN=[0 1 0 1] Critical
Section
Q=[S1]

SS ZG 526: Distributed Computing 65

65

Example - S2 sends token to S1 and S1 accesses CS

Token
S4 RN=[1 1 0 1]
RN=[1 1 0 1] S1 LN=[0 1 0 1]
Q=[]

RN=[1 1 0 1] S2 S3

RN=[1 1 0 1]
Critical
Section

SS ZG 526: Distributed Computing 66

66

33
8/9/2024

Example: S1 done with CS and updates token

Token
S4 RN=[1 1 0 1]
RN=[1 1 0 1] S1 LN=[1 1 0 1]
Q=[]

DONE

RN=[1 1 0 1] S2 S3

RN=[1 1 0 1]
Critical
Section

SS ZG 526: Distributed Computing 67

67

Analysis of the DME algorithm


Correctness
Mutex is trivial.
– Theorem:
A requesting site enters the CS in finite amount of time.
– Proof
A request enters the token queue in finite time. The queue is in FIFO order, and there
can be a maximum N-1 sites ahead of the request. So finite time.

Performance
0 or N messages per CS invocation. 0 is if the site as the token and there are no
other requests.

SS ZG 526: Distributed Computing 68

68

34
8/9/2024

SS ZG 526: Distributed Computing


Session 8: Distributed Deadlock Detection

Dr. Anindya Neogi


Associate Professor
[email protected]

1
Reference: digital content slides from Prof. C. Hota, BITS Pilani

Topics for today


» What is a deadlock in a distributed system
» Conditions
» Solution options
» Types of graphs
» Detection algorithms
» CMH algorithm for AND model graphs
» CMH algorithm for OR model graphs

1
8/9/2024

What is a deadlock?

• One or more processes waiting T1 T2


indefinitely for resources to be released
… …
by other waiting processes. ….
lock(x)
… lock(y)
• Can occur on h/w or s/w resources, but …
lock(y)
mostly seen on distributed databases lock(x)
(Lock & Unlock). time

SS ZG 526: Distributed Computing 3

Conditions for deadlock


1. Mutual Exclusion
• Resource cannot be shared
2. Hold & Wait P1
assign request
• Request and assign edges
3. No Preemption x y
• Cannot forcibly take away resource from
process
4. Circular Wait request
P2
assign

• Cyclic wait of request / assign edges

SS ZG 526: Distributed Computing 4

2
8/9/2024

Ways to handle deadlocks


• Prevention
• Try to prevent a deadlock possibility, e.g. provision all processes with their resources right in the
beginning. But poor resource utilisation efficiency.
• Avoidance
• Judiciously decide whether granting a resource request will lead to an unsafe state. So detect safe and
unsafe states. Remember Maekawa’s algorithm ?
• Complex for large systems.
• Detection and resolution
• Check whether there is a cycle in the graph and resolve the deadlock - may be kill a process.
• Ignorance
• Don’t bother and let it happen. May be kill one or more processes once you identify. Not a desired
solution unless deadlocks are very rare.

SS ZG 526: Distributed Computing 5

Resource Allocation Graph


Concept: Cycle Vs Knot (RAG)

• The AND model of requests requires all resources P1


currently being requested to be granted to un-block
a computation AND
– A cycle is sufficient condition to declare a
deadlock with this model x y

P2 P3
P2 P1 P3

z
Remove resources to get a Wait For Graph
(WFG)

SS ZG 526: Distributed Computing 6

3
8/9/2024

Concept: Cycle Vs Knot y


P3

• The OR model of requests allows a computation P1 OR


making multiple different resource requests to un-block
as soon as any one is granted
x
– A cycle is a necessary condition P2
– A knot* is a sufficient condition

z
w y

P3 No deadlock even with cycle


OR x
P1
P2
z
Deadlock with a knot * Knot - subset of processes from which you cannot leave
SS ZG 526: Distributed Computing 7

Detection requirements
• Liveness / Progress
✓ All deadlocks found
- No undetected deadlocks
✓ Deadlocks found in finite time

• Safety
✓ No false deadlock detection
✓ Phantom (non-existent) deadlocks caused by network latencies
✓ See example next chart

SS ZG 526: Distributed Computing 8

4
8/9/2024

False deadlocks

A S S C A S C A S C

R T R T R T

B B B B
Machine 1 Machine 2 Global view 1 Global view 2
@ coordinator @ coordinator
Safe state False deadlock
No deadlock because B’s request for Deadlock because B’s request for T
T has not been recorded before B’s has reached before B’s release of
release of R R

SS ZG 526: Distributed Computing 9

Topics for today


» What is a deadlock in a distributed system
» Conditions
» Solution options
» Types of graphs
» Detection algorithms
» CMH algorithm for AND model graphs
» CMH algorithm for OR model graphs

10

10

5
8/9/2024

Edge-Chasing: Chandy-Misra-Haas (CMH) algorithm


Site

• Detects deadlock in AND graph


• A set of processes run on a node / machine / site. Pi Pj
• Processes request for resources within same site or across sites.
• Some processes wait for local resources (within site)
• Some processes wait for resources on other sites.
• Algorithm invoked when a process has to wait for a resource for a longer period (suspects a
deadlock).
• Uses local WFGs to detect local deadlocks
• Use probes to determine the existence of global deadlocks (inter-site).
• Edge chasing: A process sends a probe along edges and if it receives it back then there is a
cycle

SS ZG 526: Distributed Computing 11

11

Chandy-Misra-Haas’s (CMH) Algorithm (1)

Sending the probe:


if Pi is locally dependent on itself then deadlock.
else for all Pj and Pk such that
(a) Pi is locally dependent upon Pj, and
(b) Pj is waiting on Pk, and
(c ) Pj and Pk are on different sites, send probe(i, j, k) to the home site of Pk.

probe(i, j, k) S1 S2
i: initiator or block P2 Probe(1, 3, 4)
j: sender
k: receiver P1 P4 P5
P3

local deadlocks can be detected with WFG within site


probes are for global deadlocks

SS ZG 526: Distributed Computing 12

12

6
8/9/2024

Chandy-Misra-Haas’s (CMH) Algorithm (2)

Receiving the probe (continuation of previous set of steps for sending):


if (d) Pk is blocked, and S1 S2 Probe(1, 4, 5)
(e) dependentk(i) is false, and P2 Probe(1, 3, 4)

(f) Pk has not replied to all requests of Pj P1 P4 P5


then begin P3
dependentk(i) := true;
if k == i then Pi is deadlocked
else
for all Pm and Pn such that Probe(1, 5, 1)
(a’) Pk is locally dependent upon Pm, and
(b’) Pm is waiting on Pn, and
(c’) Pm and Pn are on different sites, send probe(i,m,n)
to the home site of Pn.
end. dependentk: array kept by each process to
process probe from a source process only once

SS ZG 526: Distributed Computing 13

13

Example 1: CMH algorithm - edge chasing


S1
(1, 1, 2) (1, 3, 4) S2
Probe(1, 2, 3)
P1 P3
P2 P4
(1, 4, 5)
AND (1, 4, 6)

P5
P6
k==i (1, 9, 1) (1, 5, 7)

(1, 6, 8)
P7
P8
(1, 7, 9)

P9
S3

SS ZG 526: Distributed Computing 14

14

7
8/9/2024

Example 1: CMH algorithm - deadlock detected


S1
(1, 1, 2) (1, 3, 4) S2
Probe(1, 2, 3)
P1 P3
P2 P4
(1, 4, 5)
AND (1, 4, 6)

P5
P6
k==i (1, 9, 1) (1, 5, 7)

(1, 6, 8)
P7
P8
(1, 7, 9)

P9
S3

SS ZG 526: Distributed Computing 15

15

Example 2: CMH algorithm

P2
P1
P3

P4
P7
P6 P5

SS ZG 526: Distributed Computing 16

16

8
8/9/2024

Advantages and Disadvantages


1. Popular variants of this are used in 1. Two or more processes may independently
distributed DB locking schemes. detect the same deadlock and hence while
2. Easy to implement, as each message resolving, several processes will be aborted.
is of fixed length and requires few
computational steps. Why ?
3. No graph constructing and 2. Even though a process detects a deadlock, it
information collection does not know the full cycle.
4. False deadlocks are not detected, 3. M(N-1)/2 messages required to detect
because only j and k updated deadlock, where M= no. of processes, N =
5. Does not require a particular structure no. of sites.
among processes

SS ZG 526: Distributed Computing 17

17

OR model graphs

P1
P2
OR

creates a Knot

P3

SS ZG 526: Distributed Computing 18

18

9
8/9/2024

Diffusion Computation: CMH Algorithm (1)

Pj
(i, i, j)
Pi OR
1. Initiation by a blocked process Pi:
Send query(i, i, j) to all processes Pj
in the dependent set DSi of Pi; (i, i, k) Pk
num(i) := | DSi |;
waiti(i) := true;
num(i) = 2

SS ZG 526: Distributed Computing 19

19

Diffusion Computation: CMH Algorithm (2)

2. Blocked process Pk receiving query(i, j, k): Pj


if this is engaging query for process Pk
Pi
/* engaging: first query from Pi */
then query(i, j, k)
a. send query(i, k, m) to all Pm in DSk; case 2: reply(i, k, j)
b. numk(i) := |DSk|;
c. waitk(i) := true; Pk
else if waitk(i) then case 1: query(i, k, m)
/* not engaging or first query from Pi */
send a reply(i,k,j) to Pj. DSk Pm

SS ZG 526: Distributed Computing 20

20

10
8/9/2024

Diffusion Computation: CMH Algorithm (3)


Pi

3. Process Pk receiving reply(i, j, k) … Pm


if waitk(i) then
/* decrement on each reply */ query(i, m, k)
numk(i) := numk(i) - 1; reply(i, k, m)
if numk(i) = 0 then
waitk(i)
/* all replies received across OR branches */ Pk
numk(i)
if i == k then declare a deadlock.
else reply(i, j, k)
send reply(i, k, m) to Pm, which
query(i, k, j)
had sent the engaging query.
Pj

SS ZG 526: Distributed Computing 21

21

Example 1: CMH for OR model graph (1)


S1
(1, 1, 2) S2
(1, 3, 4)
(1, 2, 3)
P1 P3 num4(1) = 2
P2 P4
(1, 4, 5)
OR (1, 4, 6)

P5
P6
(1, 5, 7)
(1, 9, 1)
(1, 6, 8)
P7
P8
(1, 7, 9)
query
P9
S3

SS ZG 526: Distributed Computing 22

22

11
8/9/2024

Example 1: CMH for OR model graph (2)


S1
S2

P1 doesn’t get reply P1 P3 num4(1) = 1


No deadlock P2 P4
(1, 5, 4)
OR
P5
P6
(1, 7, 5)

(1, 1, 9)
P7
P1 doesn’t get back reply because P4 does not
P8
(1, 9, 7) have num4(1) = 0 ever with no reply from P6.
reply
P9 P1 may send several non-engage queries.
S3 P2 will reply to all of them but the engage query
will never get a reply.
SS ZG 526: Distributed Computing 23

23

Example 2: CMH for OR model graph (1)


S1
(1, 1, 2) S2
(1, 3, 4)
(1, 2, 3)
P1 P3 num4(1) = 2
P2 P4
(1, 4, 5)
OR (1, 4, 6)

P5
P6
(1, 5, 7)
(1, 9, 1)
(1, 6, 8)
P7
P8
(1, 7, 9)

(1, 8, 9)
P9

S3
SS ZG 526: Distributed Computing 24

24

12
8/9/2024

Example 2: CMH for OR model graph (2)


Reply reaches P1
S1
(1, 2, 1) S2
(1, 3, 2) (1, 4, 3)
P1 P3 num4(1) = 0
P2 P4
(1, 5, 4)
OR (1, 6, 4)
P5
P6
(1, 7, 5)

(1, 1, 9) (1, 8, 6)
P7
P8
(1, 9, 7)

(1, 9, 8)
P9

S3
SS ZG 526: Distributed Computing 25

25

Example 3: CMH for OR model graph

P2 P2
P1 P1
P3 P3

P4 P4
P7 P7
P6 P5 P6 P5

SS ZG 526: Distributed Computing 26

26

13
8/9/2024

AND-OR graph models


» One way is to repeatedly apply OR model algorithm
» But that is not very efficient
» Another way is to capture global state and analyse it for stable
deadlock property
» For interested readers :
» Tech report reference [16] in Textbook T1 Chapter 10
» Section 10.9 in Textbook T1 Chapter 10

27

27

Persistence & Resolution

• Deadlock persistence:
• Average time a deadlock exists before it is resolved.
• Deadlock resolution:
• Aborting at least one process/request involved in the deadlock.
• Efficient resolution of deadlock requires knowledge of all processes and
resources.
• e.g. probes can try to get some global knowledge - e.g. least priority process
• If every process detects a deadlock and tries to resolve it independently, which is
highly inefficient ! Several processes might be aborted.

SS ZG 526: Distributed Computing 28

28

14
8/9/2024

Summary
» What is a deadlock in a distributed system
» What are WFG and RAG
» When does deadlock happen - 4 conditions that must be satisfied
» What options do we have :
» Prevention, Avoidance, Detection + Resolution, Ignorance
» Deadlock detection and resolution is most useful
» 2 Detection algorithms
» CMH algorithm for AND model graphs
» CMH algorithm for OR model graphs
» Reading: Chapter 10 of T1

SS ZG 526: Distributed Computing 29

29

15
8/9/2024

SSTCS ZG526
` DISTRIBUTED COMPUTING
BITS
BITSPilani
Pilani
Pilani|Dubai|Goa|Hyderabad
Anil Kumar G
Pilani|Dubai|Goa|Hyderabad

BITS Pilani
Pilani|Dubai|Goa|Hyderabad

L8 : Deadlock detection
[T1: Chap - 10]

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by Publishers of T1 & T2

1
8/9/2024

Presentation Overview

• Introduction
• Models of Deadlocks
• Single resource model
• AND model
• OR model
• Candy-Misra-Haas Algorithm for AND model
• Candy-Misra-Haas Algorithm for OR model

Distributed Computing 3 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Introduction

• A deadlock can be defined as a condition where a set of processes request


resources that are held by other processes in the set.
• Deadlocks can be dealt with using any one of the following three strategies:

– deadlock prevention,
– deadlock avoidance, and
– deadlock detection.

Distributed Computing 4 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

2
8/9/2024

Introduction

• Deadlock prevention is commonly achieved by either having a process acquire all


the needed resources simultaneously before it begins execution or by pre-empting
a process that holds the needed resource.
• In the deadlock avoidance approach to distributed systems, a resource is granted to
a process if the resulting global system is safe.
• Deadlock detection requires an examination of the status of the process–resources
interaction for the presence of a deadlock condition.

Distributed Computing 5 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Models of deadlocks

• Distributed systems allow many kinds of resource requests.


• A process might require a single resource or a combination of resources for its
execution.
• Hierarchy of request models starting with very restricted forms to the ones with no
restrictions
• This hierarchy shall be used to classify deadlock detection algorithms based on the
complexity of the resource requests they permit.

Distributed Computing 6 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

3
8/9/2024

The single-resource model

• The single-resource model is the simplest resource model in a distributed system,


where a process can have at most one outstanding request for only one unit of a
resource.
• Since the maximum out-degree of a node in a Wait-for graph WFG for the single
resource model can be 1, the presence of a cycle in the WFG shall indicate that
there is a deadlock.

Distributed Computing 7 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

The AND model

• In the AND model, a process can request more than one resource simultaneously
and the request is satisfied only after all the requested resources are granted to the
process.
• The requested resources may exist at different locations.

Distributed Computing 8 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

4
8/9/2024

The OR model

• In the OR model, a process can make a request for numerous resources


simultaneously and the request is satisfied if any one of the requested resources is
granted.
• The requested resources may exist at different locations

Distributed Computing 9 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Chandy–Misra–Haas algorithm for the AND model

• The algorithm uses a special message called probe , which is a triplet (i , j , k ),


denoting that it belongs to a deadlock detection initiated for process Pi and it is
being sent by the home site of process Pj to the home site of process Pk .
• A probe message travels along the edges of the global WFG graph, and a deadlock
is detected when a probe message returns to the process that initiated it.
• A process Pj is said to be dependent on another process Pk if there exists a
sequence of processes Pj , Pi1 , Pi2 , . . . , Pim , Pk such that each process except
Pk in the sequence is blocked and each process, except the Pj , holds a resource for
which the previous process in the sequence is waiting.
• Process Pj is said to be locally dependent upon process Pk if Pj is dependent upon
Pk and both the processes are on the same site.

Distributed Computing 10 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

10

5
8/9/2024

Chandy–Misra–Haas algorithm for the OR model

• A blocked process determines if it is deadlocked by initiating a diffusion


computation. Two types of messages are used in a diffusion computation: query (i ,
j , k ) and reply (i , j , k ), denoting that they belong to a diffusion computation
initiated by a process Pi and are being sent from process Pj to process Pk .

Distributed Computing 11 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

11

Chandy–Misra–Haas algorithm for the OR model

• A blocked process initiates deadlock detection by sending query messages to all


processes in its dependent set (i.e., processes from which it is waiting to receive a
message).
• If an active process receives a query or reply message, it discards it.
• When a blocked process Pk receives a query (i , j , k ) message, it takes the
following actions

Distributed Computing 12 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

12

6
8/9/2024

Chandy–Misra–Haas algorithm for the OR model

1. If this is the first query message received by Pk for the deadlock detection initiated
by Pi (called the engaging query ), then it propagates the query to all the processes in
its dependent set and sets a local variable numki to the number of query messages
sent.
2. If this is not the engaging query , then Pk returns a reply message to it immediately
provided Pk has been continuously blocked since it received the corresponding
engaging query . Otherwise, it discards the query .

Distributed Computing 13 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

13

Chandy–Misra–Haas algorithm for the OR model

• Process Pk maintains a boolean variable waitki that denotes the fact that it has
been continuously blocked since it received the last engaging query from process
Pi .
• When a blocked process Pk receives a reply (i , j , k ) message, it decrements numki
only if waitki holds.
• A process sends a reply message in response to an engaging query only after it has
received a reply to every query message it has sent out for this engaging query .
• The initiator process detects a deadlock when it has received reply messages to all
the query messages it has sent out.

Distributed Computing 14 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

14

7
8/9/2024

BITS Pilani
Pilani|Dubai|Goa|Hyderabad

L9: Consensus & Agreement Algorithms


[T1: Chap - 14]

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by Publishers of T1 & T2

PRESENTATION OVERVIEW

• Consensus
• Byzantine Behavior
• Agreement in a failure-free system
• Agreement In (Message-passing) Synchronous Systems With Failures
– Consensus algorithm for crash failures (synchronous system)
– Consensus algorithms for Byzantine failures
– Byzantine Agreement Tree Algorithm
• Agreement In Asynchronous Message-passing Systems With Failures
– Impossibility result for the consensus problem
– Terminating reliable broadcast
– Distributed transaction commit
– k-set consensus
– Approximate agreement
– Renaming problem
– Reliable broadcast DC - Anil Kumar G 2
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

1
8/9/2024

Consensus and Agreement


Algorithms

• Agreement among the processes in a distributed system is a


fundamental requirement for a wide range of applications.
• Many forms of coordination require the processes to exchange
information to negotiate with one another and eventually
reach a common understanding or agreement, before taking
application-specific actions.
• A classical example is that of the commit decision in database
systems, wherein the processes collectively decide whether
µto commit or abort a transaction that they participate in.

DC - Anil Kumar G 3
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Consensus

• Consensus decision-making is a group decision-making process


in which group members develop, and agree to support, a
decision in the best interest of the whole.
• Consensus may be defined professionally as an acceptable
resolution, one that can be supported, even if not the
"favourite" of each individual.

DC - Anil Kumar G 4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

2
8/9/2024

Consensus

• Consensus decision-making is a group decision-making process


in which group members develop, and agree to support, a
decision in the best interest of the whole.
• Consensus may be defined professionally as an acceptable
resolution, one that can be supported, even if not the
"favourite" of each individual.

DC - Anil Kumar G 5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Byzantine Behavior

• Consider the difficulty of reaching agreement using the


following example, that is inspired by the long wars fought by
the Byzantine Empire in the Middle Ages.
• Four camps of the attacking army, each commanded by a
general, are camped around the fort of Byzantium.
• They can succeed in attacking only if they attack
simultaneously. Hence, they need to reach agreement on the
time of attack.
• The only way they can communicate is to send messengers
among themselves.

DC - Anil Kumar G 6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

3
8/9/2024

Byzantine Behavior

• The messengers model the messages.


• An asynchronous system is modeled by messengers taking an
unbounded time to travel between two camps.
• A lost message is modeled by a messenger being captured by
the enemy.
• A Byzantine process is modeled by a general being a traitor.
• The traitor will attempt to subvert the agreement-reaching
mechanism, by giving misleading information to the other
generals.

DC - Anil Kumar G 7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Byzantine Behavior

Byzantine generals sending confusing messages.

DC - Anil Kumar G 8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

4
8/9/2024

Byzantine Behavior

• For example, a traitor may inform one general to attack at 10


a.m., and inform the other generals to attack at noon.
• Or he may not send a message at all to some general. Likewise,
he may tamper with the messages he gets from other
generals, before relaying those messages.
• Four generals are shown, and a consensus decision is to be
reached about a boolean value.
• The various generals are conveying potentially misleading
values of the decision variable to the other generals, which
results in confusion.

DC - Anil Kumar G 9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Byzantine Behavior

• In the face of such Byzantine behavior, the challenge is to


determine whether it is possible to reach agreement, and if so
under what conditions.
• If agreement is reachable, then protocols to reach it need to
be devised.

DC - Anil Kumar G 10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

10

5
8/9/2024

Agreement in a failure-free system

• In a failure-free system, consensus can be reached by


collecting information from the different processes, arriving at
a “decision,” and distributing this decision in the system.
• A distributed mechanism would have each process broadcast
its values to others, and each process computes the same
function on the values received.
• The decision can be reached by using an application specific
function
• Some simple examples being the majority , max , and min
functions.

DC - Anil Kumar G 11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

11

Agreement in a failure-free system

• Algorithms to collect the initial values and then distribute the


decision may be based on the token circulation on a logical
ring, or the three-phase tree-based broadcast–convergecast–
broadcast, or direct communication with all nodes.
• In a synchronous system, this can be done simply in a constant
number of rounds
• In an asynchronous system, consensus can similarly be
reached in a constant number of message hops.
• Reaching agreement is straightforward in a failure-free system

DC - Anil Kumar G 12
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

12

6
8/9/2024

Agreement In (Message-passing)
Synchronous Systems
With Failures

DC - Anil Kumar G 13
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

13

Consensus algorithm for crash


failures (synchronous system)
• Consensus algorithm for n processes, where up to f processes,
may fail in the fail-stop model .
• Here, the consensus variable x is integer-valued. Each process
has an initial value xi .
• If up to f failures are to be tolerated, then the algorithm has
f + 1 rounds.
• In each round, a process i sends the value of its variable xi to
all other processes if that value has not been sent before.
• Of all the values received within the round and its own value xi
at the start of the round, the process takes the minimum, and
updates xi .
• After f + 1 rounds, the local value xi is guaranteed to be the
consensus value. DC - Anil Kumar G 14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

14

7
8/9/2024

Consensus algorithm for crash


failures (synchronous system)
• A lower bound on the number of rounds
• At least f +1 rounds are required, where f < n.
• The idea behind this lower bound is that in the worst-case
scenario, one process may fail in each round; with f +1 rounds,
there is at least one round in which no process fails.
• In that guaranteed failure-free round, all messages broadcast
can be delivered reliably, and all processes that have not failed
can compute the common function of the received values to
reach an agreement value.

DC - Anil Kumar G 15
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

15

Consensus algorithms for


Byzantine failures

Upper bound on Byzantine processes

In a system of n processes, the Byzantine agreement


problem (as also the other variants of the agreement problem)
can be solved in a synchronous system only if the number of
Byzantine processes f is such that

DC - Anil Kumar G 16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

16

8
8/9/2024

Byzantine Agreement Tree


Algorithm

• We begin with an informal description of how agreement can


be achieved with n = 4 and f = 1 processes
• In the first round, the commander Pc sends its value to the
other three lieutenants, as shown by dotted arrows.
• In the second round, each lieutenant relays to the other two
lieutenants, the value it received from the commander in the
first round.
• At the end of the second round, a lieutenant takes the majority
of the values it received (i) directly from the commander in the
first round, and (ii) from the other two lieutenants in the
second round.

DC - Anil Kumar G 17
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

17

Byzantine Agreement Tree


Algorithm

• The majority gives a correct estimate of the “commander’s”


value. Consider Figure where the commander is a traitor.
• The values that get transmitted in the two rounds are as
shown.
• All three lieutenants take the majority of (1, 0, 0) which is “0,”
the agreement value.
• In Figure lieutenant Pd is malicious. Despite its behavior as
shown, lieutenants Pa and Pb agree on “0,” the value of the
commander.

DC - Anil Kumar G 18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

18

9
8/9/2024

Byzantine Agreement Tree


Algorithm

DC - Anil Kumar G 19
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

19

Agreement In Asynchronous
Message-passing Systems
With Failures

DC - Anil Kumar G 20
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

20

10
8/9/2024

Impossibility result for the


consensus problem

• Fischer et al showed a fundamental result on the impossibility


of reaching agreement in an asynchronous (message-passing)
system, even if a single process is allowed to have a crash
failure.
• This result has a significant impact on the field of designing
distributed algorithms in a failure-susceptible system.
• The correctness proof of this result also introduced the
important notion of valency of global states.

DC - Anil Kumar G 21
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

21

Terminating reliable broadcast

• Consider the terminating reliable broadcast problem, which


states that a correct process always gets a message even if the
sender crashes while sending. If the sender crashes while
sending the message, the message may be a null message but
it must be delivered to each correct process.
• We have an additional termination condition, which states that
each correct process must eventually deliver some message.

DC - Anil Kumar G 22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

22

11
8/9/2024

Terminating reliable broadcast

• Validity If the sender of a broadcast message m is non-faulty,


then all correct processes eventually deliver m.
• Agreement If a correct process delivers a message m, then all
correct processes deliver m.
• Integrity Each correct process delivers a message at most
once. Further, if it delivers a message different from the null
message, then the sender must have broadcast m.
• Termination Every correct process eventually delivers some
message.

DC - Anil Kumar G 23
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

23

Terminating reliable broadcast

• A process decides on a “0” or “1” depending on whether it


receives “0” or “1” in the message from this process.
• However, if it receives the null message, it decides on a default
value.
• As the broadcast is done using the terminating reliable
broadcast, it can be seen that the conditions of the consensus
problem are satisfied.
• But as consensus is not solvable, an algorithm to implement
terminating reliable broadcast cannot exist.

DC - Anil Kumar G 24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

24

12
8/9/2024

Distributed transaction commit

• Database transactions require the commit operation to


preserve the ACID properties (atomicity, consistency, integrity,
durability) of transactional semantics.
• The commit operation requires polling all participants
whether the transaction should be committed or rolled back.
• Even a single rollback vote requires the transaction to be rolled
back.
• Whatever the decision, it is conveyed to all the participants in
the transaction.

DC - Anil Kumar G 25
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

25

Distributed transaction commit

• Despite the unsolvability of the distributed commit problem


under crash failure, the (blocking) two-phase commit and the
non-blocking three-phase commit protocols do solve the
problem.
• This is because the protocols use a somewhat different model
in practice, than that used for our theoretical analysis of the
consensus problem.
• The two-phase protocol waits indefinitely for a reply, and it is
assumed that a crashed node eventually recovers and sends in
its vote.

DC - Anil Kumar G 26
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

26

13
8/9/2024

Distributed transaction commit

• Optimizations such as presumed abort and presumed commit


are pessimistic and optimistic solutions that are not
guaranteed to be correct under all circumstances.
• Similarly, the three-phase commit protocol uses timeouts to
default to the “abort” decision when the coordinator does not
get a reply from all the participants within the timeout period.

DC - Anil Kumar G 27
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

27

k-set consensus

• Although consensus is not solvable in an asynchronous system


under crash failures, a weaker version, known as the k -set
consensus problem, is solvable as long as the number of crash
failures f is less than the parameter k .
• The parameter k indicates that the non faulty processes agree
on different values, as long as the size of the set of values
agreed upon is bounded by k .

DC - Anil Kumar G 28
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

28

14
8/9/2024

k-set consensus

• The k -agreement condition is new, the validity condition is


different from that for regular consensus, and the termination
condition is unchanged from that for regular consensus.
• The protocol in Algorithm 14.5 can be seen to solve k -set
consensus in a straightforward manner, as long as the number
of crash failures f is less than k .
• Let n = 10, f = 2, k = 3 and let each process propose a unique
value from [1, 2, …10] . Then the 3 -set is [8, 9, 10] .

DC - Anil Kumar G 29
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

29

Approximate agreement

• Another weaker version of consensus that is solvable in an


asynchronous system under crash failures is known as the
approximate consensus problem.
• Like k -set consensus, approximate agreement also assumes
the consensus value is from a multi-valued domain.
• However, rather than restricting the set of consensus values to
a set of size k , approximate agreement requires that the
agreed upon values by the non-faulty processes be within of
each other.

DC - Anil Kumar G 30
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

30

15
8/9/2024

Renaming problem

• The consensus problem which was a problem about


agreement required the processes to agree on a single value,
or a small set of values (k -set consensus), or a set of values
close to one another (approximate agreement), or reach
agreement with high probability (probabilistic or randomized
agreement).
• A different agreement problem introduced by Attiya et al.
requires the processes to agree on necessarily distinct values.
• This problem is termed as the renaming problem.
• The renaming problem assigns to each process Pi , a name mi
from a domain M

DC - Anil Kumar G 31
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

31

Renaming problem

• The renaming problem is useful for name space


transformation.
• A specific example where this problem arises is when
processes from different domains need to collaborate, but
must first assign themselves distinct names from a small
domain.
• A second example of the use of renaming is when processes
need to use their names as “tags” to simply mark their
presence, as in a priority queue.

DC - Anil Kumar G 32
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

32

16
8/9/2024

Reliable broadcast

• Although reliable terminating broadcast (RTB) is not solvable


under failuresva weaker version of RTB, namely reliable
broadcast, in which the termination condition is dropped, is
solvable under crash failures.
• The key difference between RTB and reliable broadcast is that
RTB requires eventual delivery of some message – even if the
sender fails just when about to broadcast.
• In this case, a null message must get sent, whereas this null
message need not be sent under reliable broadcast.

DC - Anil Kumar G 33
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

33

17
8/9/2024

BITS Pilani
Pilani|Dubai|Goa|Hyderabad

L11 & 12: Peer - to - Peer Computing & Overlay Graphs

[T1: Chap - 18]

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by Publishers of T1 & T2

Presentation Overview

• Introduction
• Characteristics and performance features of P2P systems
• Data indexing and overlays
– Centralized indexing
– Distributed indexing
– Local indexing
• Structured overlays
• Unstructured overlays
• Challenges in P2P system design

Distributed Computing 2 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

1
8/9/2024

Introduction

• Peer-to-peer (P2P) network systems use an application-level organization of the


network overlay for flexibly sharing resources (e.g., files and multimedia
documents) stored across network-wide computers.
• In contrast to the client–server model, any node in a P2P network can act as a
server to others and, at the same time, act as a client.

Distributed Computing 3 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Introduction

• Communication and exchange of information is performed directly between the


participating peers and the relationships between the nodes in the network are
equal.
• Thus, P2P networks differ from other Internet applications in that they tend to
share data from a large number of end users rather than from the more central
machines and Web servers.

Distributed Computing 4 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

2
8/9/2024

Introduction

Several well-known P2P networks that allow P2P file-sharing include

• Napster
• Gnutella
• Freenet
• Pastry
• Chord
• CAN

Distributed Computing 5 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Introduction

• Traditional distributed systems used DNS (domain name service) to provide a


lookup from host names (logical names) to IP addresses.
• Special DNS servers are required, and manual configuration of the routing
information is necessary to allow requesting client nodes to navigate the DNS
hierarchy.
• Further, DNS is confined to locating hosts or services (not data objects that have to
be a priori associated with specific computers), and host names need to be
structured as per administrative boundary regulations.
• P2P networks overcome these drawbacks, and, more importantly, allow the
location of arbitrary data objects.

Distributed Computing 6 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

3
8/9/2024

Introduction

• An important characteristic of P2P networks is their ability to provide a large


combined storage, CPU power, and other resources while imposing a low cost for
scalability, and for entry into and exit from the network.
• The ongoing entry and exit of various nodes, as well as dynamic insertion and
deletion of objects is termed as churn.
• The impact of churn should be as transparent as possible.
• P2P networks exhibit a high level of self-organization and are able to operate
efficiently despite the lack of any prior infrastructure or authority.

Distributed Computing 7 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Characteristics and performance features of P2P systems

Distributed Computing 8 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

4
8/9/2024

Napster

• One of the earliest popular P2P systems, Napster, used a server-mediated central
index architecture organized around clusters of servers that store direct indices of
the files in the system.
• The central server maintains a table with the following information of each
registered client:
(i) the client’s address (IP) and port, and offered bandwidth
(ii) information about the files that the client can allow to share.

Distributed Computing 9 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Napster

The basic steps of operation to search for content and to determine a node from
which to download the content are the following:
1. A client connects to a meta-server that assigns a lightly loaded server from one of
the close-by clusters of servers to process the client’s query.
2. The client connects to the assigned server and forwards its query along with its own
identity.
3. The server responds to the client with information about the users connected to it
and the files they are sharing.
4. On receiving the response from the server, the client chooses one of the users from
whom to download a desired file. The address to enable the P2P connection between
the client and the selected user is provided by the server to the client.

Distributed Computing 10 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

10

5
8/9/2024

Napster

• Users are generally anonymous to each other.


• The directory serves to provide the mapping from a particular host that contains
the required content, to the IP address needed to download from it.

Distributed Computing 11 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

11

Data indexing and overlays

• The data in a P2P network is identified by using indexing.


• Data indexing allows the physical data independence from the applications.
• Indexing mechanisms can be classified as being
– Centralized
– Local
– Distributed

Distributed Computing 12 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

12

6
8/9/2024

Centralized indexing

• Centralized indexing entails the use of one or a few central servers to store
references (indexes) to the data on many peers.
• The DNS lookup as well as the lookup by some early P2P networks such as Napster
used a central directory lookup.

Distributed Computing 13 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

13

Distributed indexing

• Distributed indexing involves the indexes to the objects at various peers being
scattered across other peers throughout the P2P network.
• In order to access the indexes, a structure is used in the P2P overlay to access the
indexes.
• Distributed indexing is the most challenging of the indexing schemes, and many
novel mechanisms have been proposed, most notably the distributed hash table
(DHT).
• Various DHT schemes differ in the hash mapping, search algorithms, diameter for
lookup, search diameter, fault-tolerance, and resilience to churn.

Distributed Computing 14 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

14

7
8/9/2024

Local indexing

• Local indexing requires each peer to index only the local data objects and remote
objects need to be searched for.
• This form of indexing is typically used in unstructured overlays

Distributed Computing 15 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

15

Semantic Index mechanism

• An alternate way to classify indexing mechanisms is as being a semantic index


mechanism or a semantic-free index mechanism.
• A semantic index is human readable, for example, a document name, a keyword, or
a database key.
• A semantic-free index is not human readable and typically corresponds to the index
obtained by a hash mechanism, e.g., the DHT schemes.
• A semantic index mechanism supports keyword searches, range searches, and
approximate searches, whereas these searches are not supported by semantic free
index mechanisms.

Distributed Computing 16 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

16

8
8/9/2024

Structured overlays

• The P2P network topology has a definite structure, and the placement of files or
data in this network is highly deterministic as per some algorithmic mapping.
• The objective of such a deterministic mapping is to allow a very fast and
deterministic lookup to satisfy queries for the data.
• These systems are termed as lookup systems and typically use a hash table interface
for the mapping.
• The hash function, which efficiently maps keys to values, in conjunction with the
regular structure of the overlay, allows fast search for the location of the file.

Distributed Computing 17 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

17

Structured overlays

• An implicit characteristic of such a deterministic mapping of a file to a location is


that the mapping can be based on a single characteristic of the file such as its
name, its length, or more generally some predetermined function computed on the
file.
• A disadvantage of such a mapping is that arbitrary queries, such as range queries,
attribute queries and exact keyword queries cannot be handled directly.

Distributed Computing 18 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

18

9
8/9/2024

Unstructured overlays

• The P2P network topology does not have any particular controlled structure, nor is
there any control over where files/data is placed.
• Each peer typically indexes only its local data objects, hence, local indexing is used.
• Node joins and departures are easy – the local overlay is simply adjusted.
• File placement is not governed by the topology.
• Search for a file may entail high message overhead and high delays.

Distributed Computing 19 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

19

Unstructured overlays

• Although the P2P network topology does not have any controlled structure, some
topologies naturally emerge
• Unstructured overlays have the serious disadvantage that queries may take a long
time to find a file or may be unsuccessful even if the queried object exists.
• The message overhead of a query search may also be high.

Distributed Computing 20 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

20

10
8/9/2024

Unstructured overlays

The following are the main advantages of unstructured overlays


• Exact keyword queries, range queries, attribute-based queries, and other complex
queries can be supported because the search query can capture the semantics of
the data being sought; and the indexing of the files and data is not bound to any
non-semantic structure.
• Unstructured overlays can accommodate high churn, i.e., the rapid joining and
departure of many nodes without affecting performance.

Distributed Computing 21 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

21

Unstructured overlays

• Unstructured overlays are efficient when there is some degree of data replication in
the network.
• Users are satisfied with a best-effort search.
• The network is not so large as to lead to scalability problems during the search
process.

Distributed Computing 22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

22

11
8/9/2024

Challenges in P2P system design

Fairness
• P2P systems depend on all the nodes cooperating to store objects and allowing
• other nodes to download from them.
• However, nodes tend to be selfish in nature; thus there is a tendancy to download
files without reciprocating by allowing others to download the locally available files.
• This behavior, termed as leaching or free-riding, leads to a degradation of the
overall P2P system performance.
• Hence, penalties and incentives should be built in the system to encourage sharing
and maximize the benefit to all nodes.

Distributed Computing 23 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

23

Challenges in P2P system design

• In the prisoners’ dilemma, two suspects, A and B, are arrested by the police.
• There is not enough evidence for a conviction.
• The police separate the two prisoners, and, separately, offer each the same deal: if
the prisoner testifies against (betrays) the other prisoner and the other prsioner
remains silent, the betrayer gets freed and the silent accomplice gets a 10-year
sentence.
• If both testify against the other (betray), they each receive a 2-year sentence.
• If both remain silent, the police can only sentence both to a small 6-month term on
a minor offence.

Distributed Computing 24 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

24

12
8/9/2024

Challenges in P2P system design

• Rational selfish behavior dictates that both A and B would betray the other.
• This is not a Pareto-optimal solution, where a Pareto-optimal solution is one in
which the overall good of all the participants is maximized.

Distributed Computing 25 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

25

Challenges in P2P system design

• In the above example, both A and B staying silent results in a Pareto-optimal


solution.
• The dilemma is that this is not considered the rational behavior of choice.
• In the iterative prisoners’ dilemma, the game is played multiple times, until an
“equilibrium” is reached.
• Each player retains memory of the last move of both players (in more general
versions, the memory extends to several past moves).
• After trying out various strategies, both players should converge to the ideal
optimal solution of staying silent. This is Pareto-optimal.

Distributed Computing 26 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

26

13
8/9/2024

Challenges in P2P system design

• The commonly accepted view is that the tit-for-tat strategy, is the best for winning
such a game.
• In the first step, a prisoner cooperates, and in each subsequent step, he
reciprocates the action taken by the other party in the immediately preceding step.
• The BitTorrent P2P system has adopted the tit-for-tat strategy in deciding whether
to allow a download of a file in solving the leaching problem.
• Here, cooperation is analogous to allowing others to upload local files, and betrayal
is analogous to not allowing others to upload.

Distributed Computing 27 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

27

Challenges in P2P system design

Trust or reputation management


• Various incentive-based economic mechanisms to ensure maximum cooperation
among the selfish peers inherently depend on the notion of trust.
• In a P2P environment where the peer population is highly transient, there is also a
need to have trust in the quality of data being downloaded.
• These requirements have lead to the area of trust and trust management in P2P
Systems
• As no node has a complete view of the other downloads in the P2P system, it may
have to contact other nodes to evaluate the trust in particular offerers from which
it could download some file.

Distributed Computing 28 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

28

14
8/9/2024

Challenges in P2P system design

• These communication protocol messages for trust management may be susceptible


to various forms of malicious attack (such as man-in-the-middle attacks and Sybil
attacks), thereby requiring strong security guarantees.
• The many challenges to tracking trust in a distributed setting include: quantifying
trust and using different metrics for trust, how to maintain trust about other peers
in the face of collusion, and how to minimize the cost of the trust management
protocols.

Distributed Computing 29 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

29

15
8/9/2024

SSTCS ZG526
` DISTRIBUTED COMPUTING
BITS
BITSPilani
Pilani
Pilani|Dubai|Goa|Hyderabad
Anil Kumar G
Pilani|Dubai|Goa|Hyderabad

TEXT BOOKS

REFERENCE BOOKS
R.1 - Kai Hwang, Geoffrey C. Fox, and Jack J. Dongarra,
“Distributed and Cloud Computing: From Parallel processing to the Internet of Things”,
Morgan Kaufmann, 2012 Elsevier Inc.
R.2 - John F. Buford, Heather Yu, and Eng K. Lua, “P2P Networking and Applications”,
Morgan Kaufmann, 2009 Elsevier Inc.
R.3 - Joshy Joseph, and Craif Fellenstein, “Grid Computing”, IBM Press, Pearson
education, 2011.

Note: In order to broaden understanding of concepts as applied to Indian IT industry, students are advised to refer books of
their choice and case-studies in their own organizations

Distributed Computing 2 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

1
8/9/2024

BITS Pilani
Pilani|Dubai|Goa|Hyderabad

L13: Cluster Computing & Grid Computing - 2

[R2: Chap - 2]

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by Publishers of T1 & T2

Presentation Overview

• Cluster System Interconnects


– Crossbar Switch in Google Search Engine Cluster
– Share of System Interconnects over Time
• Hardware, Software, and Middleware Support
• GPU Clusters for Massive Parallelism
• Cluster Job Scheduling Methods
– Space Sharing
– Time Sharing
• Independent scheduling
• Gang scheduling
• Competition with foreign (local) jobs

Distributed Computing 4 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

2
8/9/2024

Cluster System Interconnects

High-Bandwidth Interconnects
• Ethernet used a 1 Gbps link, while the fastest InfiniBand links ran at 30 Gbps.
• The Myrinet and Quadrics perform in between.
• The MPI latency represents the state of the art in long-distance message passing.
• All four technologies can implement any network topology, including crossbar
switches, fat trees, and torus networks.
• The InfiniBand is the most expensive choice with the fastest link speed.
• The Ethernet is still the most cost-effective choice.

Distributed Computing 5 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Cluster System Interconnects

Distributed Computing 6 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

3
8/9/2024

Crossbar Switch in Google Search Engine Cluster

• Google has many data centers using clusters of low-cost PC engines.


• These clusters are mainly used to support Google’s web search business.
• Google cluster interconnect of 40 racks of PC engines via two racks of 128 x 128
Ethernet switches.
• Each Ethernet switch can handle 128 one Gbps Ethernet links.
• A rack contains 80 PCs. This is an earlier cluster of 3,200 PCs. Google’s search
engine clusters are built with a lot more nodes.
• Today’s server clusters from Google are installed in data centers with container
trucks.

Distributed Computing 7 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Crossbar Switch in Google Search Engine Cluster

• Two switches are used to enhance cluster availability.


• The cluster works fine even when one switch fails to provide the links among the
PCs.
• The front ends of the switches are connected to the Internet via 2.4 Gbps OC 12
links. The 622 Mbps OC 12 links are connected to nearby data-center networks.
• In case of failure of the OC 48 links, the cluster is still connected to the outside
world via the OC 12 links.
• Thus, the Google cluster eliminates all single points of failure.

Distributed Computing 8 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

4
8/9/2024

Crossbar Switch in Google Search Engine Cluster

Distributed Computing 9 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Share of System Interconnects over Time

• The distribution of large-scale system interconnects in the Top 500 systems from
2003 to 2008.
• Gigabit Ethernet is the most popular interconnect due to its low cost and market
readiness.
• The InfiniBand network has been chosen in about 150 systems for its high-
bandwidth performance.
• The Cray interconnect is designed for use in Cray systems only.
• The use of Myrinet and Quadrics networks had declined rapidly in the Top 500 list
by 2008.

Distributed Computing 10 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

10

5
8/9/2024

Share of System Interconnects over Time

Distributed Computing 11 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

11

Share of System Interconnects over Time

• The InfiniBand has a switch-based point-to-point interconnect architecture.


• A large InfiniBand has a layered architecture.
• The interconnect supports the virtual interface architecture (VIA) for distributed
messaging.
• The InfiniBand switches and links can make up any topology. Popular ones include
crossbars, fat trees, and torus networks.
• The InfiniBand provides the highest speed links and the highest bandwidth in
reported largescale systems.
• However, InfiniBand networks cost the most among the four interconnect
technologies.

Distributed Computing 12 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

12

6
8/9/2024

Share of System Interconnects over Time

• Each end point can be a storage controller, a network interface card (NIC), or an
interface to a host system.
• A host channel adapter (HCA) connected to the host processor through a standard
peripheral component interconnect (PCI), PCI extended (PCI-X), or PCI express bus
provides the host interface.
• Each HCA has more than one InfiniBand port. A target channel adapter (TCA)
enables I/O devices to be loaded within the network.
• The TCA includes an I/O controller that is specific to its particular device’s protocol
such as SCSI, Fibre Channel, or Ethernet.
• This architecture can be easily implemented to build very large scale cluster
interconnects that connect thousands or more hosts together

Distributed Computing 13 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

13

Share of System Interconnects over Time

Distributed Computing 14 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

14

7
8/9/2024

Hardware, Software, and Middleware Support

• Realistically, SSI and HA features in a cluster are not obtained free of charge.
• They must be supported by hardware, software, middleware, or OS extensions.
• Any change in hardware design and OS extensions must be done by the
manufacturer.
• The hardware and OS support could be cost prohibitive to ordinary users.
• However, programming level is a big burden to cluster users.
• Therefore, the middleware support at the application level costs the least to
implement.

Distributed Computing 15 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

15

Hardware, Software, and Middleware Support

• Close to the user application end, middleware packages are needed at the cluster
management level: one for fault management to support failover and failback
• Another desired feature is to achieve HA using failure detection and recovery and
packet switching.
• In the middle of Figure we need to modify the Linux OS to support HA, and we need
special drivers to support HA, I/O, and hardware devices.
• Toward the bottom, we need special hardware to support hot-swapped devices and
provide router interfaces

Distributed Computing 16 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

16

8
8/9/2024

Hardware, Software, and Middleware Support

Distributed Computing 17 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

17

GPU Clusters for Massive Parallelism

• Commodity GPUs are becoming high-performance accelerators for data-parallel


computing. Modern GPU chips contain hundreds of processor cores per chip.
• Based on a 2010 report each GPU chip is capable of achieving up to 1 Tflops for
single-precision (SP) arithmetic, and more than 80 Gflops for double-precision (DP)
calculations.
• Recent HPC-optimized GPUs contain up to 4 GB of on-board memory, and are
capable of sustaining memory bandwidths exceeding 100 GB/second.
• GPU clusters are built with a large number of GPU chips.
• GPU clusters have already demonstrated their capability to achieve Pflops
performance in some of the Top 500 systems.

Distributed Computing 18 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

18

9
8/9/2024

GPU Clusters for Massive Parallelism

• Most GPU clusters are structured with homogeneous GPUs of the same hardware
class, make, and model.
• The software used in a GPU cluster includes the OS, GPU drivers, and clustering API
such as an MPI.
• The high performance of a GPU cluster is attributed mainly to its massively parallel
multicore architecture, high throughput in multithreaded floating-point arithmetic,
and significantly reduced time in massive data movement using large on-chip cache
memory.
• In other words, GPU clusters already are more cost-effective than traditional CPU
clusters.

Distributed Computing 19 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

19

GPU Clusters for Massive Parallelism

• GPU clusters result in not only a quantum jump in speed performance, but also
significantly reduced space, power, and cooling demands.
• A GPU cluster can operate with a reduced number of operating system images,
compared with CPU-based clusters.
• These reductions in power, environment, and management complexity make GPU
clusters very attractive for use in future HPC applications

Distributed Computing 20 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

20

10
8/9/2024

CLUSTER JOB
AND
RESOURCE MANAGEMENT

Distributed Computing 21 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

21

Cluster Job Scheduling Methods

• Cluster jobs may be scheduled to run at a specific time (calendar scheduling ) or


when a particular event happens (event scheduling ).
• Jobs are scheduled according to priorities based on submission time, resource
nodes, execution time, memory, disk, job type, and user identity.
• With static priority , jobs are assigned priorities according to a predetermined, fixed
scheme.
• A simple scheme is to schedule jobs in a first-come, first-serve fashion.
• Another scheme is to assign different priorities to users.
• With dynamic priority , the priority of a job may change over time.

Distributed Computing 22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

22

11
8/9/2024

Cluster Job Scheduling Methods

• Three schemes are used to share cluster nodes.


• In the dedicated mode , only one job runs in the cluster at a time, and at most, one
process of the job is assigned to a node at a time.
• The single job runs until completion before it releases the cluster to run other jobs.
• Note that even in the dedicated mode, some nodes may be reserved for system use
and not be open to the user job.
• Other than that, all cluster resources are devoted to run a single job.
• This may lead to poor system utilization.

Distributed Computing 23 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

23

Cluster Job Scheduling Methods

• The job resource requirement can be static or dynamic .


• Static scheme fixes the number of nodes for a single job for its entire period.
• Static scheme may underutilize the cluster resource.
• It cannot handle the situation when the needed nodes become unavailable, such as
when the workstation owner shuts down the machine
• Dynamic resource allows a job to acquire or release nodes during execution.
• However, it is much more difficult to implement, requiring cooperation between a
running job

Distributed Computing 24 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

24

12
8/9/2024

Space Sharing

• A common scheme is to assign higher priorities to short, interactive jobs in daytime


and during evening hours using tiling .
• In this space-sharing mode, multiple jobs can run on disjointed partitions (groups)
of nodes simultaneously.
• At most, one process is assigned to a node at a time.
• Although a partition of nodes is dedicated to a job, the interconnect and the I/O
subsystem may be shared by all jobs.
• Space sharing must solve the tiling problem and the large-job problem.

Distributed Computing 25 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

25

Space Sharing

• JMS schedules four jobs in a first-come first-serve fashion on four nodes.


• Jobs 1 and 2 are small and thus assigned to nodes 1 and 2.
• Jobs 3 and 4 are parallel; each needs three nodes.
• When job 3 comes, it cannot run immediately.
• It must wait until job 2 finishes to free up the needed nodes.
• Tiling will increase the utilization of the nodes
• The overall execution time of the four jobs is reduced after repacking the jobs over
the available nodes

Distributed Computing 26 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

26

13
8/9/2024

Space Sharing

Distributed Computing 27 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

27

Time Sharing

• In the dedicated or space-sharing model, only one user process is allocated to a


node.
• However, the system processes or daemons are still running on the same node.
• In the time-sharing mode, multiple user processes are assigned to the same node.
• Time sharing introduces the following parallel scheduling policies:

Distributed Computing 28 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

28

14
8/9/2024

Independent scheduling

• The most straightforward implementation of time sharing is to use the operating


system of each cluster node to schedule different processes as in a traditional
workstation.
• This is called local scheduling or independent scheduling.
• However, the performance of parallel jobs could be significantly degraded.
• Processes of a parallel job need to interact.
• For instance, when one process wants to barrier-synchronize with another, the
latter may be scheduled out. So the first process has to wait.
• As the second process is rescheduled, the first process may be swapped out.

Distributed Computing 29 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

29

Gang scheduling

• The gang scheduling scheme schedules all processes of a parallel job together.
• When one process is active, all processes are active.
• The cluster nodes are not perfectly clock-synchronized.
• In fact, most clusters are asynchronous systems, and are not driven by the same
clock.
• Although we say, “All processes are scheduled to run at the same time,” they do not
start exactly at the same time

Distributed Computing 30 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

30

15
8/9/2024

Competition with foreign (local) jobs

• Scheduling becomes more complicated when both cluster jobs and local jobs are
running.
• Local jobs should have priority over cluster jobs.
• With one keystroke, the owner wants command of all workstation resources.
• There are basically two ways to deal with this situation:
– The cluster job can either stay in the workstation node or
– migrate to another idle node.

Distributed Computing 31 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

31

Competition with foreign (local) jobs

• A stay scheme has the advantage of avoiding migration cost.


• The cluster process can be run at the lowest priority.
• The workstation’s cycles can be divided into three portions, for kernel processes,
local processes, and cluster processes.
• However, to stay slows down both the local and the cluster jobs, especially when
the cluster job is a load balanced parallel job that needs frequent synchronization
and communication.
• This leads to the migration approach to flow the jobs around available nodes,
mainly for balancing the workload.

Distributed Computing 32 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

32

16
8/9/2024

BITS Pilani
Pilani|Dubai|Goa|Hyderabad

L14 : Grid Computing

[R2: Chap - 7]

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by Publishers of T1 & T2

Presentation Overview

• Grid Architecture and Service Modeling


– Grid history and service families
– Four Grid Service Families
– Grid Service Protocol Stack
• Grid Resources
– CPU Scavenging and Virtual Supercomputers
– Dynamic Formation of Virtual Organizations
• OPEN GRID SERVICES ARCHITECTURE (OGSA)
• Resource management and job scheduling
• Grid Job Scheduling Methods
• Resource Brokering with Gridbus
Distributed Computing 2 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

1
8/9/2024

Grid Architecture and Service Modeling

• The goal of grid computing is to explore fast solutions for large-scale computing
problems.
• This objective is shared by computer clusters and massively parallel processor
(MPP) systems
• However, grid computing takes advantage of the existing computing resources
scattered in a nation or internationally around the globe.
• In grids, resources owned by different organizations are aggregated together and
shared by many users in collective applications.

Distributed Computing 3 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Grid Architecture and Service Modeling

• Grids rely heavy use of LAN/WAN resources across enterprises, organizations, and
governments.
• The virtual organizations or virtual supercomputers are new concept derived from
grid or cloud computing.
• These are virtual resources dynamically configured and are not under the full
control of any single user or local administrator.

Distributed Computing 4 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

2
8/9/2024

Grid history and service families

• The Internet was developed in the 1980s to provide computer-to-computer


connections using the telnet:// protocol.
• The web service was developed in the 1990s to establish direct linkage between
web pages using the http:// protocol.
• Ever since the 1990s, grids became gradually available to establish large pools of
shared resources.
• The approach is to link many Internet applications across machine platforms
directly in order to eliminate isolated resource islands

Distributed Computing 5 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Grid history and service families

• Grids differ from conventional HPC clusters.


• Cluster nodes are more homogeneous machines that are better coordinated to
work collectively and cooperatively.
• The grid nodes are heterogeneous computers that are more loosely coupled
together over geographically dispersed sites.
• In 2001, Forbes Magazine advocated the emergence of the great global grid (GGG)
as a new global infrastructure.
• This GGG evolved from the World Wide Web (WWW) technology we have enjoyed
for many years.
• Four major families of grid computing systems were suggested by the Forbes GGG
categorization

Distributed Computing 6 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

3
8/9/2024

Grid history and service families

Distributed Computing 7 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Four Grid Service Families

• Most of today’s grid systems are called computational grids or data grids.
• Information or knowledge grids post another grid class dedicated to knowledge
management and distributed ontology processing.
• In the business world, we see a family, called business grids, built for business
data/information processing. Some business grids are being transformed
into Internet clouds.
• The last grid class includes several grid extensions such as P2P grids and parasitic
grids.

Distributed Computing 8 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

4
8/9/2024

Grid Service Protocol Stack

• The top layer corresponds to user applications to run on the grid system.
• The user applications demand collective services including collective computing
and communications.
• The next layer is formed by the hardware and software resources aggregated to run
the user applications under the collective operations.
• The connectivity layer provides the interconnection among drafted resources.
• This connectivity could be established directly on physical networks or it could be
built with virtual networking technology.

Distributed Computing 9 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Grid Service Protocol Stack

• The connectivity must support the grid fabric, including the network links and
virtual private channels.
• The fabric layer includes all computational resources, storage systems, catalogs,
network resources, sensors, and their network connections.
• The connectivity layer enables the exchange of data between fabric layer resources.
• The five-layer grid architecture is closely related to the layered Internet protocol
stack
• The connectivity layer is supported by the network and transport layers of the
Internet stack.
• The Internet application layer supports the top three layers.

Distributed Computing 10 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

10

5
8/9/2024

Grid Service Protocol Stack

Distributed Computing 11 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

11

Grid Resources

Resources Control Operations Enquiries


Compute Starting, monitoring, and controlling the Hardware and software characteristics; relevant
resources execution of resultant processes; control load information: current load and queue state
over resources: advance reservation

Storage Putting and getting files; control over Hardware and software characteristics; relevant
resources resources allocated to data transfers: load information: available space and bandwidth
advance reservation utilization

Network Control over resources allocated Network characteristics and load


resources
Code Managing versioned source and object code Software files and compile support
repositories
Service catalogs Implementing catalog query and update Service order information and agreements
operations: a relational database

Distributed Computing 12 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

12

6
8/9/2024

CPU Scavenging and Virtual Supercomputers

• Both public and virtual grids can be built over large or small machines, that are
loosely coupled together to satisfy the application need.
• Grids differ from the conventional supercomputers in many ways in the context of
distributed computing.
• Supercomputers like MPPs in the Top-500 list are more homogeneously structured
with tightly coupled operations, while the grids are built with heterogeneous nodes
running non-interactive workloads.
• These grid workloads may involve a large number of files and individual users. The
geographically dispersed grids are more scalable and fault-tolerant with
significantly lower operational costs than the supercomputers.

Distributed Computing 13 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

13

CPU Scavenging and Virtual Supercomputers

• The concept of creating a “grid” from the unused resources in a network of


computers is known as CPU scavenging.
• In reality, virtual grids are built over large number of desktop computers by using
their free cycles at night or during inactive usage periods.
• The donors are ordinary citizens on a voluntary participation basis.
• In practice, these client hosts also donate some disk space, RAM, and network
bandwidth in addition to the raw CPU cycles.

Distributed Computing 14 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

14

7
8/9/2024

CPU Scavenging and Virtual Supercomputers

• At present, many volunteer computing grids are built using the CPU scavenging
model.
• The most famous example is the SETI@Home, which applied over 3 million
computers to achieve 23.37 TFlpos as of Sept. 2001.
• More recent examples include the BOINC and Folding@Home etc.
• In practice, these virtual grids can be viewed as virtual supercomputers

Distributed Computing 15 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

15

Dynamic Formation of Virtual Organizations

• A new drug development subject needs the molecular simulation software from a
research institute and the clinic database from a hospital.
• At the same time, computer scientists need to perform biological sequence analysis
on Linux clusters in a supercomputing center.
• The members from three organizations contribute all or part of their hardware or
software resources to form a VO

Distributed Computing 16 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

16

8
8/9/2024

Dynamic Formation of Virtual Organizations

Distributed Computing 17 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

17

Dynamic Formation of Virtual Organizations

• Each physical organization owns some resources, such as computers, storage,


instruments, and databases.
• The three organizations cooperate with one another to perform in two tasks: new
jumbo jet development and new drug exploration.
• Each physical organization contributes some of its resources to form a VO labeled
“X” for drug development and another VO labeled “Y” for aircraft development.
• Physical organizations may allocate their resources as available to a VO only under
light workload conditions.

Distributed Computing 18 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

18

9
8/9/2024

Dynamic Formation of Virtual Organizations

• The owners can also modify the number of cluster servers to be allocated to a
specific VO.
• The participants join or leave the VO dynamically under special service agreement.
• A new physical organization may also join an existing VO.
• A participant may leave a VO, once its job is done.
• The dynamic nature of the resources in a VO posts a great challenge for grid
computing.
• Resources have to cooperate closely to produce a rewarding result.
• Without an effective resource management system, the grid or VO may be
inefficient and waste resources, if poorly managed.

Distributed Computing 19 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

19

OPEN GRID SERVICES ARCHITECTURE (OGSA)

• The OGSA is an open source grid service standard jointly developed by academia
and the IT industry under coordination of a working group in the Global Grid Forum
(GGF).
• The standard was specifically developed for the emerging grid and cloud service
communities.
• The OGSA is extended from web service concepts and technologies.
• The standard defines a common framework that allows businesses to build grid
platforms across enterprises and business partners.
• The intent is to define the standards required for both open source and commercial
software to support a global grid infrastructure

Distributed Computing 20 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

20

10
8/9/2024

Resource management and job scheduling

• In a grid system, resources are usually autonomous.


• Each organization may have its own resource management policies.
• It is more reasonable and adaptable to set an individual resource management
system (RMS) for an organization—the RMSes in the upper level can be considered
the resource consumers, and the RMSes in the lower level can be considered the
resource providers.
• To support such a multi-RMS structure, an abstract model for RMS is introduced.

Distributed Computing 21 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

21

Resource management and job scheduling

• There are four published interfaces in the model.


• The resource consumer interface is for access from upper-level RMS or user
applications.
• The resource discoverer actively searches qualified resources by this interface.
• The resource disseminator broadcasts information about local resources to other
RMSes.
• The resource trader exchanges resources among RMSes for market-based grid
systems.
• The resource resolver routes jobs to remote RMSes.
• The resource co-allocator simultaneously allocates multiple resources to a job.

Distributed Computing 22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

22

11
8/9/2024

Resource management and job scheduling

Distributed Computing 23 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

23

Grid Job Scheduling Methods

• Two schemas are often used to classify them: hierarchical classification, which
classifies scheduling methods by a multilevel tree and flat classification based on a
single attribute.
• These include methods based on adaptive versus nonadaptive, load balancing,
bidding, or probabilistic and one-time assignment versus dynamic reassignment

Distributed Computing 24 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

24

12
8/9/2024

Grid Job Scheduling Methods

Distributed Computing 25 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

25

Resource Brokering with Gridbus

• To make resources constituents of the grid, they need to be accessible from


different management domains.
• This can be achieved by installing core grid middleware such as Globus in
UNIX/Linux environments.
• Multinode clusters need to be presented as a single resource to the grid, and this
can be achieved by deploying job management systems such as the Sun Grid Engine
on them

Distributed Computing 26 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

26

13
8/9/2024

Resource Brokering with Gridbus

Distributed Computing 27 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

27

Resource Brokering with Gridbus

• In a grid environment where data needs to be federated for sharing among various
interested parties, data grid technologies such as SRB, Globus RLS, and EU DataGrid
need to be deployed.
• The user-level middleware needs to be deployed on resources responsible for
providing resource brokering and application execution management services.
• Users may even access these services via web portals.
• Several grid resource brokers have been developed. Some of the more prominent
include Nimrod-G, Condor-G, GridWay, and Gridbus Resource Broker.

Distributed Computing 28 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

28

14
8/9/2024

Resource Brokering with Gridbus

The following 11 steps are followed to aggregate the Grid resources:


1. The user composes his application as a distributed application (e.g., parameter
sweep) using visual application development tools.
2. The user specifies his analysis and QoS requirements and submits them to the grid
resource broker.
3. The grid resource broker performs resource discovery using the grid information
service.
4. The broker identifies resource service prices by querying the grid market directory.
5. The broker identifies the list of data sources or replicas and selects the optimal
ones.

Distributed Computing 29 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

29

Resource Brokering with Gridbus

6. The broker identifies computational resources that provide the required services.
7. The broker ensures that the user has necessary credit or an authorized share to
utilize the resources.
8. The broker scheduler analyzes the resources to meet the user’s QoS requirements.
9. The broker agent on a resource executes the job and returns the results.
10. The broker collates the results and passes them to the user.
11. The meter charges the user by passing the resource usage information to the
accountant.

Distributed Computing 30 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

30

15
8/9/2024

BITS Pilani
Pilani|Dubai|Goa|Hyderabad

L15 : Internet of Things


[R1: Chap - 9]

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by Publishers of T1 & T2

PRESENTATION OVERVIEW

• Introduction
• Applications of IOT
• Radio-Frequency Identification
• ZigBee

DC - Anil Kumar G 2
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

1
8/9/2024

Introduction

• The Internet of Things (IoT) is the network of physical objects


or "things" embedded with electronics, software, sensors,
and network connectivity, which enables these objects to
collect and exchange data
• The Internet of Things allows objects to be sensed and
controlled remotely across existing network infrastructure,
creating opportunities for more direct integration between the
physical world and computer-based systems, and resulting in
improved efficiency, accuracy and economic benefit
• Each thing is uniquely identifiable through its embedded
computing system but is able to interoperate within the
existing Internet infrastructure
DC - Anil Kumar G 3
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Introduction

• Integration with the Internet implies that devices will use an IP


address as a unique identifier.
• However, due to the limited address space of IPv4 (which
allows for 4.3 billion unique addresses), objects in the IoT will
have to use IPv6 to accommodate the extremely large address
space required
• Objects in the IoT will not only be devices with sensory
capabilities, but also provide actuation capabilities (e.g., bulbs
or locks controlled over the Internet)
• To a large extent, the future of the Internet of Things will not
be possible without the support of IPv6

DC - Anil Kumar G 4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

2
8/9/2024

Introduction

• The Internet of Things (IoT) is an environment in which


objects, animals or people are provided with unique
identifiers and the ability to transfer data over a network
without requiring human-to-human or human-to-computer
interaction
• IoT has evolved from the convergence
of wireless technologies, micro-electro mechanical systems
and the Internet.
• The concept may also be referred to as the Internet of
Everything

DC - Anil Kumar G 5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Applications of IOT

CHECK ON THE BABY


• Aimed at helping to prevent SIDS, the Mimo monitor is a new
kind of infant monitor that provides parents with real-time
information about their baby's breathing, skin temperature,
body position, and activity level on their smartphones

DC - Anil Kumar G 6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

3
8/9/2024

Applications of IOT

REMEMBER TO TAKE YOUR MEDS


• GlowCaps fit prescription bottles and via a wireless chip
provide services that help people stick with their prescription
regimen; from reminder messages, all the way to refill and
doctor coordination

DC - Anil Kumar G 7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Applications of IOT

TRACK YOUR ACTIVITY LEVELS


Using your smartphone's range of sensors (Accelerometer, Gyro,
Video, Proximity, Compass, GPS, etc) and connectivity options
(Cell, WiFi, Bluetooth, NFC, etc) you have a well equipped
Internet of Things device in your pocket that can automatically
monitor your movements, location, and workouts throughout
the day

DC - Anil Kumar G 8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

4
8/9/2024

Applications of IOT

GET THE MOST OUT OF YOUR MEDICATION


• The Proteus ingestible pill sensor is powered by contact with
your stomach fluid and communicates a signal that determines
the timing of when you took your meds and the identity of the
pill. This information is transferred to a patch worn on the skin
to be logged for you and your doctor's reference. Heart rate,
body position and activity can also be detected

DC - Anil Kumar G 9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Applications of IOT

MONITOR AN AGING FAMILY MEMBER


• Using a wearable alarm button and other discrete wireless
sensors placed around the home, the BeClose system can track
your loved one's daily routine and give you peace of mind for
their safety by alerting you to any serious disruptions detected
in their normal schedule

DC - Anil Kumar G 10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

10

5
8/9/2024

Applications of IOT

STAY OUT OF THE DOCTOR'S OFFICE


• Intended for individuals with cardiac arrhythmias the Body
Guardian is an FDA cleared wearable sensor system that can
remotely read a patient’s biometrics (ECG, heart rate,
respiration rate and activity Level), sending the data to the
patients physician and allowing users to go about their daily
lives outside of a clinical setting.

DC - Anil Kumar G 11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

11

Applications of IOT

KEEP YOUR PLANTS ALIVE


• Whether taking care of a small hydroponic system or a large
backyard lawn, systems like HarvestGeek with their suite of
sensors and web connectivity help save you time and
resources by keeping plants fed based on their actual growing
needs and conditions while automating much of the labor
processes

DC - Anil Kumar G 12
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

12

6
8/9/2024

Applications of IOT

LIGHT YOUR HOME IN NEW WAYS


• Web enabled lights like the Phillip's Hue can be used as an
ambient data displays (Glow red when my bus is 5 minutes
away). These multi-functional lights can also help you to
reduce electricity use (automatically turn off the lights when
no one is in a room) or help to secure your home while you are
away by turning your lights on and off

DC - Anil Kumar G 13
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

13

Radio-frequency identification

• Radio-frequency identification (RFID) is the wireless use


of electromagnetic fields to transfer data, for the purposes of
automatically identifying and tracking tags attached to objects.
The tags contain electronically stored information.
• Some tags are powered by electromagnetic induction from
magnetic fields produced near the reader Some types collect
energy from the interrogating radio waves and act as a passive
transponder.
• Other types have a local power source such as a battery and
may operate at hundreds of meters from the reader.
• Unlike a barcode, the tag does not necessarily need to be
within line of sight of the reader and may be embedded in the
tracked object DC - Anil Kumar G 14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

14

7
8/9/2024

Radio-frequency identification

• RFID tags are used in many industries.


• For example, an RFID tag attached to an automobile during
production can be used to track its progress through the
assembly line
• RFID-tagged pharmaceuticals can be tracked through
warehouses; and implanting RFID microchips in livestock and
pets allows positive identification of animals

DC - Anil Kumar G 15
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

15

Radio-Frequency Identification

DC - Anil Kumar G 16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

16

8
8/9/2024

ZigBee

• ZigBee is a IEEE 802.15.4 - based specification for a suite of


high-level communication protocols used to create personal
area networks with small, low-power digital radios
• The technology defined by the ZigBee specification is intended
to be simpler and less expensive than other wireless personal
area networks (WPANs), such as Bluetooth or Wi-Fi.
• Applications include wireless light switches, electrical meters
with in-home-displays, traffic management systems, and other
consumer and industrial equipment that requires short-range
low-rate wireless data transfer.

DC - Anil Kumar G 17
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

17

ZigBee

• ZigBee is a low-cost, low-power, wireless mesh


network standard targeted at wide development of long
battery life devices in wireless control and monitoring
applications.
• Zigbee devices have low latency, which further reduces
average current.
• ZigBee chips are typically integrated with radios and with
microcontrollers that have between 60-256 KB flash memory.

DC - Anil Kumar G 18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

18

9
8/9/2024

ZigBee

• ZigBee operates in the industrial, scientific and medical (ISM)


radio bands: 2.4 GHz in most jurisdictions worldwide; 784 MHz
in China, 868 MHz in Europe and 915 MHz in the USA and
Australia. Data rates vary from 20 kbit/s (868 MHz band) to
250 kbit/s (2.4 GHz band).

DC - Anil Kumar G 19
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

19

ZigBee

DC - Anil Kumar G 20
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

20

10
8/9/2024

ZigBee

• The main applications for 802.15.4 are aimed at control and


monitoring applications where relatively low levels of data
throughput are needed, and with the possibility of remote,
battery powered sensors, low power consumption is a key
requirement.
• Sensors, lighting controls, security and many more applications
are all candidates for the new technology

DC - Anil Kumar G 21
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

21

ZigBee

• The data is transferred in packets. These have a maximum size


of 128 bytes, allowing for a maximum payload of 104 bytes.
• Although this may appear low when compared to other
systems, the applications in which 802.15.4 and ZigBee are
likely to be used should not require very high data rates.
• The standard supports 64 bit IEEE addresses as well as 16 bit
short addresses.
• The 64 bit addresses uniquely identify every device in the
same way that devices have a unique IP address.
• Once a network is set up, the short addresses can be used and
this enables over 65000 nodes to be supported

DC - Anil Kumar G 22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

22

11
8/9/2024

ZigBee

SafePlug 1203
• The SafePlug 1203 electrical outlet installs into a standard
120V 15A outlet and provides two independently controlled
receptacles.
• Each receptacle features independent:
– Power (energy) monitoring,
– Line voltage monitoring,
– On/off control,
– Groups/Scenes support,
– Appliance tracking,
– Fire and shock protection, and
– Cold load start smoothing
DC - Anil Kumar G 23
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

23

ZigBee

Access Controller (Key Fob)


• AlertMe Access controller (Keyfob) is used as a keyfob for the
AlertMe Intruder Alarm service.
• Its presence or absence from the system also infers premises'
occupancy.
• The casework contains a battery, indicator LEDs, piezoelectric
sounder, low-power micro-controller and ZigBee radio.
• The fob reports its status every two minutes and button
pushes as they occur

DC - Anil Kumar G 24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

24

12
8/9/2024

ZigBee

Occupancy Sensor Door Sensor ON/OFF Switch (Button)

AlertMe ZigBee Home Automation

DC - Anil Kumar G 25
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

25

ZigBee

• The Smart Home ecosystem will use a standardized technology


such as ZigBee to communicate via the set-top box to local
controllers as well as smartphones and mobile devices over
the web
• In addition to providing the existing Four Plays - TV and
entertainment, internet access, phone service (VoIP), and cell
phone services - operators will be adding the Fifth Play – smart
home services for monitoring energy usage, home health,
security, climate control, etc.
• Companies like Comcast, Time Warner and Verizon are already
marketing and installing these types of Fifth Play Smart Home
solution
DC - Anil Kumar G 26
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

26

13
8/9/2024

ZigBee

DC - Anil Kumar G 27
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

27

ZigBee

The development of smart home services may trigger the emergence


of many business models, providing a diverse range of what may become
essential components for the smart home ecosystem. These may include:
– A security breach in the home can immediately send a text message to the
home owner and/or a response company.
– Water leaks or gas leaks can be immediately identified and alerted – saving
money and preventing damage.
– Elderly people can be monitored by their children and medical staff via smart
phone and alerts.
– Medicine consumption can be automatically monitored.
– Air-conditioning or heaters turn off when windows get opened.
– Lights are switched off in rooms where are no people.
– Roof top solar panels can be monitored and controlled to ensure optimal
operating efficiency.

DC - Anil Kumar G 28
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

28

14
8/9/2024

Smart Thermostats

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

29

Siemens APOGEE
Floor Level Network Controller

Interface to heaters, fans, AC units, and lighting


through field level controllers

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

30

15
8/9/2024

Smart meter

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

31

Kwikset SmartCode

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

32

16
8/9/2024

Wireless Outlet

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

33

Philips 431643 Hue Personal


Wireless Lighting, Starter Pack

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

34

17
8/9/2024

Nest Learning Thermostat - 2nd


Generation T200577

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

35

ADT Monitored Home Security


Alarm Systems

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

36

18
8/9/2024

Navigation System using ZigBee Wireless


Sensor Network for Parking

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

37

Efficient Home Energy


Management

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

38

19
8/9/2024

Thank You
ALL THE BEST

DC - Anil Kumar G 39
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

39

20

You might also like