Disc
Disc
SSWT ZG526
Distributed Computing
BITS Pilani
Pilani | Dubai | Goa | Hyderabad
BITS Pilani
Pilani | Dubai | Goa | Hyderabad
LECTURE 1
1
8/9/2024
R1 Kai Hwang, Geoffrey C. Fox, and Jack J. Dongarra, “Distributed and Cloud
Computing: From Parallel processing to the Internet of Things”, Morgan
Kaufmann, 2012 Elsevier Inc.
R2 John F. Buford, Heather Yu, and Eng K. Lua, “P2P Networking and Applications”,
Morgan Kaufmann, 2009 Elsevier Inc.
SSWT ZG526 - Distributed Computing 3 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
index
• DISTRIBUTED COMPUTING
• Contact Session – 1
• 1Introduction - introduction to Distributed computing in terms of various hardware and software models T1 (Chap.1)
• 1.2 Multiprocessing and Multi computing System, Distributed System Design Issues 1.3 Distributed Communication
Model (RPC) 1.4 Review of different communication models Review of Design issues and Challenges for building
distributed systems
• Contact Session – 2
• M2: Logical Clocks & Vector Clocks
2
8/9/2024
index
• Contact Session – 3
• 3.1 Global States, Principles to use to record the global states T1 (Chap.4 & 5)
• Contact Session – 4
• M3: Global state and snapshot recording algorithms
• 3.2 Chandy Lamport global state recording Algorithm for FIFO channels and Lai yang Algorithm for non-FIFO
channels
• 4 Review of algorithms Chandy Lamport global state recording Algorithm and Lai yang Algorithm for FIFO and non-
FIFO channels
index
• Contact Session – 5
• 4.1 Casual Ordering of messages; Birman Schipher Stephenson (BSS) Algorithm with Example T1 (Chap.6)
• 4.2 Schipher Eggli Sandoz (SES) Protocol for casual ordering with example
• 5 Review of Birman Schipher Stephenson (BSS) Algorithm Schipher Eggli Sandoz (SES) Algorithm with examples
detection
• Contact Session – 6
3
8/9/2024
index
• Contact Session – 7
• 5.5 Token Based DME, Broadcast Based Algorithm; Suzuki Kasami Algorithm
• Contact Session – 8
index
• Contact Session – 9
• Contact Session – 10
• M7: Consensus and Agreement Algorithm
4
8/9/2024
index
• Contact Session – 11
index
• Contact Session – 13
• Contact Session – 14
10
5
8/9/2024
index
• Contact Session – 15
• Contact Session – 16
• Review CS 16 Review of previous Modules M6 to M10 for Comprehensive exam preparation
11
• Player's login from different parts of the world into a single virtual world
• Different physical times
– Logical clocks (S2)
• Consistent view of the virtual world
– Global state (S3 & S4)
• Player communication
– Causal ordering in group communication (S5)
• Send broadcasts on the network, placement of game servers
– Distributed graph algos (S6)
SSWT ZG526 - Distributed Computing 12 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
12
6
8/9/2024
SSWT ZG526 - Distributed Computing 13 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
13
Module Details
SSWT ZG526 - Distributed Computing 14 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
14
7
8/9/2024
Module Details
References : T1 (Chap.1)
SSWT ZG526 - Distributed Computing 15 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
15
SSWT ZG526 - Distributed Computing 16 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
16
8
8/9/2024
DISTRIBUTED COMPUTING
SSWT ZG526 - Distributed Computing 17 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
17
DISTRIBUTED COMPUTING
18
9
8/9/2024
SSWT ZG526 - Distributed Computing 19 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
19
SSWT ZG526 - Distributed Computing 20 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
20
10
8/9/2024
SSWT ZG526 - Distributed Computing 21 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
21
SSWT ZG526 - Distributed Computing 22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
22
11
8/9/2024
Complex application directly on network stack has Easier programming where complexity is handled in
to solve distributed computing problems distributed middleware / os services
Networking OS Distributed OS
SSWT ZG526 - Distributed Computing 23 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
23
SSWT ZG526 - Distributed Computing 24 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
24
12
8/9/2024
SSWT ZG526 - Distributed Computing 25 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
25
MOTIVATION
The motivation for using a distributed system is some or all of the following
requirements:
SSWT ZG526 - Distributed Computing 26 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
26
13
8/9/2024
MOTIVATION
2. Resource sharing
• Resources such as peripherals, complete data sets in databases, special libraries, as
well as data (variable/files) cannot be fully replicated at all the sites because it is
often neither practical nor cost-effective.
• Further, they cannot be placed at a single site because access to that site might
prove to be a bottleneck. Therefore, such resources are typically distributed across
the system.
• For example, distributed databases such as DB2 partition the data sets across
several servers, in addition to replicating them at a few sites for rapid access as well
as reliability.
SSWT ZG526 - Distributed Computing 27 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
27
MOTIVATION
SSWT ZG526 - Distributed Computing 28 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
28
14
8/9/2024
MOTIVATION
4. Enhanced reliability
• A distributed system has the inherent potential to provide increased reliability
because of the possibility of replicating resources and executions, as well as the
reality that geographically distributed resources are not likely to crash or
malfunction at the same time under normal circumstances.
SSWT ZG526 - Distributed Computing 29 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
29
MOTIVATION
SSWT ZG526 - Distributed Computing 30 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
30
15
8/9/2024
MOTIVATION
6. Scalability
• As the processors are usually connected by a wide-area network, adding more
processors does not pose a direct bottleneck for the communication network.
SSWT ZG526 - Distributed Computing 31 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
31
SSWT ZG526 - Distributed Computing 32 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
32
16
8/9/2024
SSWT ZG526 - Distributed Computing 33 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
33
SSWT ZG526 - Distributed Computing 34 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
34
17
8/9/2024
PARALLEL SYSTEMS
1. A multiprocessor system
2. A multicomputer parallel system
3. Array processors
SSWT ZG526 - Distributed Computing 35 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
35
MULTIPROCESSOR SYSTEMS
SSWT ZG526 - Distributed Computing 36 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
36
18
8/9/2024
MULTIPROCESSOR SYSTEMS
SSWT ZG526 - Distributed Computing 37 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
37
MULTIPROCESSOR SYSTEMS
• The processors are usually of the same type, and are housed within the same
box/container with a shared memory.
• The interconnection network to access the memory may be a bus, although for
greater efficiency, it is usually a multistage switch with a symmetric and regular
design.
• There are two popular interconnection networks
1. The Omega network and
2. The Butterfly network
SSWT ZG526 - Distributed Computing 38 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
38
19
8/9/2024
MULTIPROCESSOR SYSTEMS
• Two popular interconnection networks – the Omega network and the Butterfly
network , each of which is a multi-stage network formed of 2×2 switching elements.
• Each 2×2 switch allows data on either of the two input wires to be switched to the
upper or the lower output wire.
• In a single step, however, only one data unit can be sent on an output wire. So if the
data from both the input wires is to be routed to the same output wire in a single
step, there is a collision.
• Various techniques such as buffering or more elaborate interconnection designs can
address collisions.
SSWT ZG526 - Distributed Computing 39 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
39
MULTIPROCESSOR SYSTEMS
SSWT ZG526 - Distributed Computing 40 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
40
20
8/9/2024
MULTIPROCESSOR SYSTEMS
Interconnection Networks
SSWT ZG526 - Distributed Computing 41 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
41
SSWT ZG526 - Distributed Computing 42 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
42
21
8/9/2024
SSWT ZG526 - Distributed Computing 43 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
43
SSWT ZG526 - Distributed Computing 44 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
44
22
8/9/2024
Four-dimensional hypercube.
SSWT ZG526 - Distributed Computing 45 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
45
ARRAY PROCESSORS
• Array processors belong to a class of parallel computers that are physically co-
located, are very tightly coupled, and have a common system clock (but may not
share memory and communicate by passing data using messages).
• Array processors and systolic arrays that perform tightly synchronized processing
and data exchange in lock-step for applications such as DSP and image processing
belong to this category.
• These applications usually involve a large number of iterations on the data.
SSWT ZG526 - Distributed Computing 46 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
46
23
8/9/2024
ARRAY PROCESSORS
SSWT ZG526 - Distributed Computing 47 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
47
Flynn’s Taxonomy
SSWT ZG526 - Distributed Computing 48 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
48
24
8/9/2024
• Coupling
– Tight - SIMD, MISD shared memory systems
– Loose - NOW, distributed systems, no shared memory
• Speedup
– how much faster can a program run when given N processors as opposed to 1 processor — T(N) / T(1)
– Optional reading: Amdahl’s Law, Gustafson’s Law
• Parallelism / Concurrency of program
– Compare time spent in computations to time spent for communication via shared memory or message
passing
• Granularity
– Average number of compute instructions before communication is needed across processors
• Note:
– Coarse granularity —> Distributed systems else use tightly coupled multi-processors/computers
– High concurrency doesn’t lead to high speedup if granularity is too small leading to high overheads
SSWT ZG526 - Distributed Computing 49 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
49
SSWT ZG526 - Distributed Computing 50 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
50
25
8/9/2024
SSWT ZG526 - Distributed Computing 51 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
51
SSWT ZG526 - Distributed Computing 52 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
52
26
8/9/2024
SSWT ZG526 - Distributed Computing 53 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
53
MESSAGE QUEUEING
SSWT ZG526 - Distributed Computing 54 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
54
27
8/9/2024
PUBLISH–SUBSCRIBE PATTERN
SSWT ZG526 - Distributed Computing 55 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
55
PUBLISH–SUBSCRIBE PATTERN
SSWT ZG526 - Distributed Computing 56 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
56
28
8/9/2024
PUBLISH–SUBSCRIBE PATTERN
SSWT ZG526 - Distributed Computing 57 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
57
PUBLISH–SUBSCRIBE PATTERN
SSWT ZG526 - Distributed Computing 58 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
58
29
8/9/2024
SSWT ZG526 - Distributed Computing 59 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
59
The following functions must be addressed when designing and building a distributed system:
Communication
• This task involves designing appropriate mechanisms for communication among the processes in the
network. Some example mechanisms are: remote procedure call (RPC), remote object invocation
(ROI), message-oriented communication versus stream-oriented communication.
Processes
• Some of the issues involved are: management of processes and threads at clients/servers; code
migration; and the design of software and mobile agents.
SSWT ZG526 - Distributed Computing 60 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
60
30
8/9/2024
Naming
• Naming Devising easy to use and robust schemes for names, identifiers, and addresses is essential
for locating resources and processes in a transparent and scalable manner.
Synchronization
• Synchronization Mechanisms for synchronization or coordination among the processes are essential.
SSWT ZG526 - Distributed Computing 61 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
61
Fault tolerance
• Fault tolerance requires maintaining correct and efficient operation in spite of any failures of links,
nodes, and processes
Security
• Distributed systems security involves various aspects of cryptography, secure channels, access
control, key management – generation and distribution, authorization, and secure group
management.
SSWT ZG526 - Distributed Computing 62 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
62
31
8/9/2024
SSWT ZG526 - Distributed Computing 63 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
63
• Mobile networks
– distributed graph algorithms
• Sensor networks collecting and transmitting physical parameters
– large volume streaming, algorithms for location estimation
• Ubiquitous or Pervasive computing where processors are embedded everywhere to perform
application functions e.g. smart homes
– groups of wireless sensors/actuators connected to Cloud backend
• Peer-to-peer e.g. gaming, content distribution
– object storage/lookup/retrieval/replication, self-organising networks, privacy
SSWT ZG526 - Distributed Computing 64 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
64
32
8/9/2024
• Publish-subscribe e.g. movie streaming, online trading markets - receive only data of interest
– data streaming systems and filtering / matching algorithms
• Distributed agents or robots e.g. in mobile computing
– swarm algorithms, coordination among agents (like an ant colony)
• Distributed data mining
– e.g. user profiling - Data is spread in many repositories
SSWT ZG526 - Distributed Computing 65 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
65
33
8/9/2024
BITS Pilani
Pilani | Dubai | Goa | Hyderabad
Contact Session – 2
M2 -Logical clocks
References : T1 (Chap.3)
SSWT ZG526 - Distributed Computing 2 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
1
8/9/2024
A distributed program
SSWT ZG526 - Distributed Computing 3 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
• The concept of causality between events is fundamental to the design and analysis of parallel and distributed computing and operating systems.
• In distributed systems, it is not possible to have global physical time; it is possible to realize only an approximation of it.
• Causality among events in a distributed system is a powerful concept in reasoning, analyzing, and drawing inferences about a computation.
SSWT ZG526 - Distributed Computing 4 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
2
8/9/2024
SSWT ZG526 - Distributed Computing 5 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Causality
production of another event, process or state where the cause is partly responsible
for the effect, and the effect is partly dependent on the cause.
SSWT ZG526 - Distributed Computing 6 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
3
8/9/2024
The knowledge of the causal precedence relation among events helps ensure liveness and fairness in mutual exclusion algorithms, helps maintain consistency in replicated
databases, and helps design correct deadlock detection algorithms to avoid phantom and undetected deadlocks.
Concurrency measure
The knowledge of how many events are causally dependent is useful in measuring the amount of concurrency in a computation. All events that are not causally related can be
executed concurrently. Thus, an analysis of the causality in a computation gives an idea of the concurrency in the program.
SSWT ZG526 - Distributed Computing 7 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
In distributed debugging, the knowledge of the causal dependency among events helps construct a consistent state for resuming
reexecution; in failure recovery, it helps build a checkpoint; in replicated databases, it aids in the detection of file inconsistencies in
The knowledge of the causal dependency among events helps measure the progress of processes in the distributed computation.
This is useful in discarding obsolete information, garbage collection, and termination detection, e.g. how much of a distributed file
SSWT ZG526 - Distributed Computing 8 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
4
8/9/2024
space
“happens before”
SSWT ZG526 - Distributed Computing 9 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
space
“happens before”
» Logically concurrent events may happen in different physical times - depends on message delays
» Assume they happened at the same time … computation result will not change
SSWT ZG526 - Distributed Computing 10 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
10
5
8/9/2024
C1
P1 P2
m4 | m3 | m1 | m2 |
C3
C2 FIFO channel e.g. TCP
non-FIFO channel C4
e.g. UDP
m3 | m4 | P3
m1 | m2 |
SSWT ZG526 - Distributed Computing 11 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
11
Send 1 Send 2
P1
CO FIFO
P2
Receive 1 Receive 2 non-FIFO
SSWT ZG526 - Distributed Computing 12 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
12
6
8/9/2024
SSWT ZG526 - Distributed Computing 13 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
13
• A protocol (set of rules) to update the data structures to ensure the consistency condition.
Each process pi maintains data structures that allow it the following two capabilities:
• A local logical clock , denoted by lc , that helps process p measure its own progress.
i i
• A logical global clock , denoted by gci , that is a representation of process pi’s local view of the logical global time. It allows this process to assign consistent timestamps to its local events.
SSWT ZG526 - Distributed Computing 14 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
14
7
8/9/2024
SSWT ZG526 - Distributed Computing 15 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
15
SSWT ZG526 - Distributed Computing 16 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
16
8
8/9/2024
e2
P1
R2: 2
P2
e3
0
R2: 3
SSWT ZG526 - Distributed Computing 17 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
17
P3
0 e4 e5 e6
SSWT ZG526 - Distributed Computing 18 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
18
9
8/9/2024
Another example
• Distributed algorithms such as resource synchronization often depend on some method of ordering events to function.
• The processes send messages to each other, and also send messages to the disk requesting access.
• The disk grants access in the order the messages were sent.
• Now, imagine process 1 sends a message to the disk asking for access to write, and then sends a message to process 2 asking it to read.
• Process 2 receives the message, and as a result sends its own message to the disk.
• Now, due to some timing delay, the disk receives both messages at the same time: how does it determine which message happened-before the other?
• A logical clock algorithm provides a mechanism to determine facts about the order of such events.
SSWT ZG526 - Distributed Computing 19 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
19
Basic Properties
Consistency Property
• Scalar clocks satisfy the monotonicity and hence the consistency property:
for two events ei and ej,
SSWT ZG526 - Distributed Computing 20 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
20
10
8/9/2024
Consistency
Assume d=1
1 2
e1 e2
P1
1
e3
P2
P3
e4 e5 e6
1 2 3
SSWT ZG526 - Distributed Computing 21 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
21
Total Ordering
22
11
8/9/2024
P3
e4 e5 e6
1 2 3
Use process index (P1<P2) to make e1 precede e3 and force a total order
SSWT ZG526 - Distributed Computing 23 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
23
Properties (1)
No Strong Consistency
1 2 3
P2
• The reason that scalar clocks are not strongly consistent e4
SSWT ZG526 - Distributed Computing 24 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
24
12
8/9/2024
Properties (2)
Event counting
1 2 3
• If the increment value d is always 1, the scalar time
e1 e2 e3
has the following interesting property: if event e has a
P1
timestamp h, then h-1 represents the minimum
logical duration, counted in units of events, required height(e4) = 2
before producing the event e; P2
e4
SSWT ZG526 - Distributed Computing 25 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
25
Vector time
P2
• The system of vector clocks was developed
independently by Fidge, Mattern and Schmuck.
P1 P3
• In the system of vector clocks, the time domain is
represented by a set of n-dimensional non-negative
integer vectors.
SSWT ZG526 - Distributed Computing 26 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
26
13
8/9/2024
Process pi uses the following two rules R1 and R2 to update its clock:
• R1: Before executing an event, process pi updates its local logical time as
follows:
vti[i] := vti[i] + d (d > 0)
SSWT ZG526 - Distributed Computing 27 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
27
28
14
8/9/2024
(1,0,0) (2,0,0)
e1 e2
P1
(0,0,0)
(2,0,0)
e3 e5
P2
(0,0,0) (2,1,0) (2,2,1)
P3
(0,0,0) e4
(0,0,1)
SSWT ZG526 - Distributed Computing 29 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
29
(1,0,0) (2,0,0)
e1 e2
P1
(2,0,0)
e3 e5
P2
(0,0,0) (2,1,0) (2,2,1)
P3
e4
(0,0,1)
SSWT ZG526 - Distributed Computing 30 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
30
15
8/9/2024
(1,0,0) (2,0,0)
e1 e2 R2: Max of remote clocks from local
P1 knowledge and message content
(2,0,0) Update local clock entry as per R1
e3 e5
P2
(0,0,0) (2,1,0) (2,2,1)
P3
e4
(0,0,1)
SSWT ZG526 - Distributed Computing 31 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
31
• The following relations are defined to compare two (1,0,0) (2,0,0) (3,0,0)
vector timestamps, vh and vk: e1 e2 e3
P1
(2,0,0)
e4 e5
P2
(0,1,0) (2,2,0)
P3
Examples : e6 e7 e8
(1,0,1) = (1,0,1) (0,0,1) (0,0,2) (0,0,3)
(0,1,1) < (1,2,1)
(1,0,0) || (0,0,2) Now we can confidently say with full
knowledge : e1 || e7
SSWT ZG526 - Distributed Computing 32 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
32
16
8/9/2024
• If the process at which an event occurred is known, the (1,0,0) (2,0,0) (3,0,0)
test to compare two timestamps can be simplified as e1 e2 e3
follows: If eventsx and y respectively occurred at P1
P3
e6 e7 e8
(0,0,1) (0,0,2) (0,0,3)
e2 —> e5 because V_P1[P1] <= V_P2[P1]
e1 || e7 because V_P1[P1] > V_P3[P1] and V_P1[P3] < V_P3[P3]
P1’s view P3’s view
SSWT ZG526 - Distributed Computing 33 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
33
Properties (1)
Strong Consistency
• The system of vector clocks is strongly consistent; thus, by examining the vector
timestamp of two events, we can determine if the events are causally related.
• However, Charron-Bost showed that the dimension of vector clocks cannot be less than
n, the total number of processes in the distributed computation, for this property to
hold.
SSWT ZG526 - Distributed Computing 34 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
34
17
8/9/2024
Properties (2)
Event Counting
SSWT ZG526 - Distributed Computing 35 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
35
SSWT ZG526 - Distributed Computing 36 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
36
18
8/9/2024
[
{ <change position>, <value },
…
]
SSWT ZG526 - Distributed Computing 37 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
37
Matrix clock
• A matrix clock is a mechanism for capturing chronological and causal relationships in a distributed system.
• Matrix clocks are a generalization of the notion of vector clocks. A matrix clock maintains a vector of the vector clocks for each
communicating host.
• Every time a message is exchanged, the sending host sends not only what it knows about the global state of time, but also the
• This allows establishing a lower bound on what other hosts know, and is useful in applications such as checkpointing and
garbage collection.
SSWT ZG526 - Distributed Computing 38 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
38
19
8/9/2024
• Fowler–Zwaenepoel direct dependency technique reduces the size of messages by transmitting only a scalar value in
the messages.
• Instead, a process only maintains information regarding direct dependencies on other processes.
• A vector time for an event, which represents transitive dependencies on other processes, is constructed off-line from
SSWT ZG526 - Distributed Computing 39 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
39
Fowler–Zwaenepoel’s example
SSWT ZG526 - Distributed Computing 40 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
40
20
8/9/2024
SSWT ZG526 - Distributed Computing 41 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
41
Definitions
» Time: Cp(t) is time of Clock at process p. Cp(t)=t for
perfect clock.
» Frequency: C’p(t) is the rate at which clock progresses
» Offset: Cp(t) - t is offset from real-time t.
» Skew: C’p(t) - C’q(t) is difference in frequency between
clocks of processes p and q
» If skews are bounded by r then dC/dt is allowed to
diverge in the range 1-r to 1+r (clock accuracy)
» Drift (rate): C’’p(t) is the rate of change of frequency at
process p
SSWT ZG526 - Distributed Computing 42 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
42
21
8/9/2024
NTP
» A hierarchy of time servers on a spanning tree
» Root level primary (stratum 1) synchronises with UTC with microseconds offset from attached stratum 0
devices
» Second level (stratum 2) are backup servers synchronised with stratum 1 servers
» Max level is stratum 15
» At the lowest level are the clients which can be configured with NTP server(s) to use for running a
synchronisation algorithm
» The synchronisation algorithm uses request-reply NTP messages and offset-delay
estimation technique to estimate the round-trip delay and hence the clock offset between 2
servers.
» Offset - 10s of millisecond over internet, <1 millisec in LAN but asymmetric routes and congestion can
cause problems in measurement
SSWT ZG526 - Distributed Computing 43 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
43
Offset-delay estimation
T2 T3 » From the perspective of A :
B
» T1, T2, T3, T4 are available
T1 T1, T2, T3 » Offset = [ (T2-T1) + (T3-T4) ] / 2
» Round trip delay = (T4-T1) - (T3-T2)
A » Continuously send messages to estimate offset and
T1 T4
correct local clock of A
asymmetric routes,
congestion can cause » Choose the offset that has the minimum delay (typically
estimation challenges last 8 samples)
» CA(t) = CB(t) + Offset
SSWT ZG526 - Distributed Computing 44 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
44
22
8/9/2024
SSWT ZG526 - Distributed Computing 45 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
45
SSWT ZG526 - Distributed Computing 46 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
46
23
8/9/2024
MongoDB
» Consistency is a design priority
» All “writes” and “reads” happen through
primary replica. Hence, any “read” should
return the latest consistent value.
» Reads and writes are ordered using logical
clock implementation *
» Optionally consistency can be loosened by
reading from replicas
47
** Cassandra: https://www.datastax.com/blog/why-cassandra-doesnt-need-vector-clocks
SSWT ZG526 - Distributed Computing 48 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
48
24
8/9/2024
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Contact Session – 3
M3 - Global state and snapshot recording
References : T1 (Chap.4)
SSWT ZG526 - Distributed Computing 2 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
1
8/9/2024
SSWT ZG526 - Distributed Computing 3 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
SSWT ZG526 - Distributed Computing 4 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
2
8/9/2024
100 200
A B
100
SSWT ZG526 - Distributed Computing 5 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Total
100 200 ——-
A B 100 Then,
200 T1: A transfers 50 to B
100 T2: B transfer 100 to C
——-
400
C
100
SSWT ZG526 - Distributed Computing 6 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
3
8/9/2024
Total
50 100 ——-
]
100 50 200 50
A B 50 in transit
100 Have to record these correctly
100 in transit
100 100
——-
C
400 Total money in the system cannot change
100
SSWT ZG526 - Distributed Computing 7 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
More examples
What is the total number of files in a system when files are moved
around across machines ?
What is the total space left on the system across storage nodes ?
SSWT ZG526 - Distributed Computing 8 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
4
8/9/2024
– termination detection
– deadlock detection
termination deadlock
SSWT ZG526 - Distributed Computing 9 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
SSWT ZG526 - Distributed Computing 10 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
10
5
8/9/2024
11
12
6
8/9/2024
LS4
13
M LS5
LS6
LS7
LS8
14
7
8/9/2024
LS10
M
LS11
LS12
But can’t include event M with send in the FUTURE and receive in the
PAST. Violates Causal Ordering.
15
SSWT ZG526 - Distributed Computing 16 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
16
8
8/9/2024
17
18
9
8/9/2024
• Deterministic Computation
• At any point in computation there is at most one event that can happen
next.
• Non-Deterministic Computation
• At any point in computation there can be more than one event that can
happen next.
SSWT ZG526 - Distributed Computing 19 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
19
Example
A B
m
A B
A B
Deterministic A B Non-deterministic
n
- serial - multiple solutions possible
- synchronous - one shown with the dotted arrows
A B
20
10
8/9/2024
SSWT ZG526 - Distributed Computing 21 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
21
SSWT ZG526 - Distributed Computing 22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
22
11
8/9/2024
SSWT ZG526 - Distributed Computing 23 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
23
Chandy-Lamport (2)
• Start: The algorithm can be initiated by any process by executing the “Marker Sending Rule”
by which
• (a) it records its local state and
• (b) sends a marker on each outgoing channel before sending any more messages.
marker
state S1
M1
P1 P2
M1
P4 P3
SSWT ZG526 - Distributed Computing 24 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
24
12
8/9/2024
Chandy-Lamport (3)
• Propagation: A process executes the “Marker Receiving Rule” on receiving a marker on incoming
channel C.
• (a) If the process has not yet recorded its local state, it records the state of C as empty and
executes the “Marker Sending Rule” to record its local state.
• (b) If process has recorded state already, it records messages received on C after last
recording and before this marker on C.
S1 M1 S2
P1 P2
M2
M1
P4 P3
SSWT ZG526 - Distributed Computing 25 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
25
Chandy-Lamport (4)
• Termination: The algorithm terminates after each process has received a marker
on all of its incoming channels.
• Dissemination: All the local snapshots get disseminated to all other processes and
all the processes can determine the global state. A strongly connected graph helps
even if topology is arbitrary.
P1 P2
Disseminate and forward local snapshots
on outgoing channels so that everyone can
construct a global snapshot.
P4 P3
SSWT ZG526 - Distributed Computing 26 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
26
13
8/9/2024
FIFO is a must
50 50 100 M 50 M
100 200 c1 150
c1 c1 100 50
50
A B A B A c2
B
c2 c2
c4 c3 c4 c3
c4 c3
100
C
‘A’ initiates
recording of C C
global state
100 200
200
SSWT ZG526 - Distributed Computing 27 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
27
150 c1 150
c1 150 50 c1 50
50
A B A B A c2
B
c2 c2
M M
c3 c4 c3
c4 c3 M c4
M
C C C
200 200
200
28
14
8/9/2024
SSWT ZG526 - Distributed Computing 29 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
29
Example
ref: https://homepage.cs.uiowa.edu/~ghosh/10-16-03.pdf
SSWT ZG526 - Distributed Computing 30 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
30
15
8/9/2024
• A recorded state is a feasible state that is reachable from the initial configuration.
• The final state is always reachable from the recorded state.
• The recorded state may not be visited / seen in an actual execution.
• These states are useful
• for debugging because it is possible for the system to reach the state in some
execution instance
• for checking stable predicates
• for checkpointing
SSWT ZG526 - Distributed Computing 31 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
31
SSWT ZG526 - Distributed Computing 32 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
32
16
8/9/2024
Improvements of Chandy-Lamport
• Optimizations 1
• when multiple snapshots are triggered
• leading to inefficiency and redundant load
3 2
• to disseminate local states more efficiently
• flooding is too expensive
• build spanning tree during marker propagation -
works in undirected graphs
4 5
SSWT ZG526 - Distributed Computing 33 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
33
• Provides 2 optimisations
• avoid taking redundant snapshots when there are multiple initiators
using “regions” or territories for each initiator
• leads to simpler dissemination of local snapshots
• Assumes undirected edges
SSWT ZG526 - Distributed Computing 34 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
34
17
8/9/2024
SSWT ZG526 - Distributed Computing 35 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
35
parent: B
36
18
8/9/2024
SSWT ZG526 - Distributed Computing 37 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
37
SSWT ZG526 - Distributed Computing 38 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
38
19
8/9/2024
SSWT ZG526 - Distributed Computing 39 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
39
Venkatesan’s outline
spanning tree edge
1. Initiator sends init_snap along spanning tree to start iteration for Initiator
one global snapshot version
2. Process receiving init_snap or regular follows modified marker 1
sending rule and waits for an ack
3. When leaf process finishes with local snapshot and all acks, it m
sends a snap_complete to parent. 3 2
4. When a non-leaf process gets all acks and snap_complete from
children, it sends snap_complete to parent.
m
5. Terminates when initiator receives all acks and snap_complete
from children 4 5
SSWT ZG526 - Distributed Computing 40 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
40
20
8/9/2024
Venkatesan’s example
Initiator
m
3 2
2 sends regular and gets
3 sends regular and gets back ack
m
back ack
SSWT ZG526 - Distributed Computing 41 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
41
SSWT ZG526 - Distributed Computing 42 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
42
21
8/9/2024
SSWT ZG526 - Distributed Computing 43 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
43
P1 takes snapshot
P1
m3
m1
m2
P2
SSWT ZG526 - Distributed Computing 44 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
44
22
8/9/2024
P1 takes snapshot
P1
m3
m1
m2
P2 has to take a snapshot
by this time (remember
Issue 2 rule for consistent
P2 snapshots in slide 16 ?)
SSWT ZG526 - Distributed Computing 45 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
45
SSWT ZG526 - Distributed Computing 46 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
46
23
8/9/2024
Example
Snapshot initiated - P1 turns RED
History : send [m1, m2] + local state of P1
P1
m3 Initiator
m1
m2 Gets both histories in snapshot
Figures out m2 is in transit and m1,
m3 are accounted in local snapshot
P2 of P2
SSWT ZG526 - Distributed Computing 47 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
47
Application of snapshots
» Looking for “consistent and stable” state across processes
» Checkpointing - restart a system from recorded global state and log of messages sent
after checkpoint
» Termination - useful in batch processing systems to know when set of distributed tasks
have completed
» Deadlock - similar to termination but jobs are stuck
» Debugging - inspect global state to find issues
SSWT ZG526 - Distributed Computing 48 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
48
24
8/9/2024
Why is it important ?
Distributed database nodes can take periodic checkpoints without strict time coordination with other nodes.
Need to pick which of these checkpoints are correct recording of global state across all nodes.
SSWT ZG526 - Distributed Computing 49 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
49
SSWT ZG526 - Distributed Computing 50 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
50
25
8/9/2024
» Necessary condition: 2 checkpoints cannot have a causal order, i.e. they should be
parallel or potentially parallel. E.g. C1,0 and C3,1 have a causal ordering using {
m1, m2} and cannot be part of a consistent set of local checkpoints. But this is not
a sufficient condition. Why ?
SSWT ZG526 - Distributed Computing 51 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
51
» C1,1 and C3,2 have no causal order but cannot be part of a consistent global state
» C1,1 cannot pair with C2,2 and C3.2 cannot pair with C2,1 because of causal order
SSWT ZG526 - Distributed Computing 52 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
52
26
8/9/2024
Zigzag path
SSWT ZG526 - Distributed Computing 53 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
53
Summary
» Why is global snapshot important
» Algorithms for FIFO using markers and non-FIFO using coloring
» Optimized FIFO algorithms
» Property of snapshot in Chandy-Lamport algorithm
» No zig-zag path as a necessary and sufficient condition of consistency among
snapshots
» Study material: Chapter 4 of Text book 1 (T1)
SSWT ZG526 - Distributed Computing 54 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
54
27
8/9/2024
1
8/9/2024
Topology abstractions
» Physical
» Actual network layer connections
» Logical
» Defined in context of a particular application. Nodes
are distributed application nodes. Channels can be
given useful properties, e.g. FIFO, non-FIFO. Can define
arbitrary connectivity or neighbourhood relations. E.g. in
P2P overlays.
» Super-imposed
Logical abstractions
» Higher level on logical topology, e.g. spanning tree,
ring, to provide efficient paths for information
dissemination and gathering.
Classifications (1)
» Application vs control executions
» Application execution is the core program logic
» Control algorithms or “protocols” are part of the distributed middleware. Latter may piggyback
on the application execution (e.g. Lai-Yang global snapshot) but doesn’t interfere.
» Centralized vs distributed
» Centralized - e.g. client-server systems.
» Purely distributed yet efficient algorithms are harder. Many distributed algorithms have a
centralised component - e.g. Chandy-Lamport has an initiator that assembles the global
snapshot. Peer-to-Peer systems demand more distributed logic.
2
8/9/2024
Classifications (2)
» Symmetric vs asymmetric
» Symmetric where all processors execute same logic.
» e.g. logical clock algorithms
» Centralized algorithms or cases where nodes perform different functions, e.g.
root and leaf nodes or leaders, are asymmetric.
» Anonymous
» Does not use process identifiers - structurally elegant but harder to build and
sometimes impossible - e.g. anonymous leader election, total ordering in Lamport
clock. Typically process id is used for resolving ties.
Classifications (3)
» Uniform
» Does not use n or number of processes in code and thus allows scalability transparency. E.g. leader election
in a ring only talks to 2 neighbours.
» Adaptive
» When complexity is a function of k < n and not n, where n is number of nodes / processes. e.g. mutual
exclusion algorithms between k contenders
» Deterministic vs Non-deterministic
» Deterministic programs have no non-deterministic receives, i.e. they specify a source always to receive a
message.
» Non-deterministic executions may produce different results each time. Harder to reason and debug.
» Even in an async system deterministic execution will produce same partial order each time but not for non-
deterministic execution.
3
8/9/2024
Classifications (4)
» Execution Inhibition
» Inhibitory protocols freeze/suspend the normal execution of a process till some conditions are met. Can be
local inhibition (e.g. waiting for a local condition) or global (e.g. waiting for message from another process).
» Message ordering or state recording algorithms may use inhibition
» Sync vs Async
» Sync has
1. known upper bound of communication delay,
2. known bounded drift rate of local clock wrt real time,
3. known upper bound on time taken to perform a logical step.
» Async has none of the above satisfied. Typical distributed systems are async but can be made sync with
synchronisers.
Classifications (5)
» Online vs offline
» Online algorithm works on data that is being generated.
» Preferred. E.g. debugging, scheduling etc. to work with latest dynamic data
» Offline needs all data to be available first
» Wait-free
» Resilient to n-1 process failures thus offering high degree of robustness. But very expensive and may
not be possible in many cases.
» Operations of a process complete in a bounded number of steps despite failures of other processes
» Useful for real-time systems
» e.g. lock-less page cache patches in Linux kernel, RethinkDB
» Will refer in session on “consensus protocols”
4
8/9/2024
Classifications (6)
» Failure models
» Important to specify for an algorithm that claims to have some fault tolerance features
» Process failure models with increasing severity
» Fail-stop: Process stops at an instant and others know about it
» Crash: Same as above but others don’t know about it
» Receive/Send or General omission: Intermittently fail to send and/or receive some messages or
by crashing
» Timing failures: For sync systems only, e.g. violates time bound to finish a step
» Byzantine failure with authentication: Random faults but if a faulty process claims to receive a
message from correct process, it can be verified
» Byzantine failure: Same as above but no verification possible
» In addition, Link failure models include all of the above except fail-stop
10
10
5
8/9/2024
C
* negative weights ?? think of currency exchange, ISP contracts for routing etc.
SS ZG 526: Distributed Computing 11
11
12
6
8/9/2024
Round 1 of n-1
Length of S—>X where X in {A,B,C,D,E}
8 S 10
S A B C D E
0 inf inf inf inf inf
0 10 inf inf inf 8
E A
-4
1
1 A and E get to know their distance from S
2
D B
UPDATE messages
-1 -2
13
Round 1 of n-1
Length of S—>X where X in {A,B,C,D,E}
8 S 10
S A B C D E
0 inf inf inf inf inf
0 10 inf 12 inf 8
E A
-4
1
1 C gets to know distance of S from A and already knows
2
it’s own distance to A.
-1 -2
14
7
8/9/2024
Round 1 of n-1
Length of S—>X where X in {A,B,C,D,E}
8 S 10
S A B C D E
0 inf inf inf inf inf
0 10 inf 12 inf 8
E A 0 10 inf 12 inf 8
-4
1
1
2 B has no new information about S
D B
-1 -2
15
Round 1 of n-1
Length of S—>X where X in {A,B,C,D,E}
8 S 10
S A B C D E
0 inf inf inf inf inf
0 10 inf 12 inf 8
E A 0 10 inf 12 inf 8
-4 0 10 10 12 inf 8
1
1
2
16
8
8/9/2024
Round 1 of n-1
Length of S—>X where X in {A,B,C,D,E}
8 S 10
S A B C D E
0 inf inf inf inf inf
0 10 inf 12 inf 8
E A 0 10 inf 12 inf 8
0 10 10 12 inf 8
1 -4 0 10 10 12 inf 8
1
2
D B
D has no new information
-1 -2
17
Round 1 of n-1
Length of S—>X where X in {A,B,C,D,E}
8 S 10
S A B C D E
0 inf inf inf inf inf
0 10 inf 12 inf 8
E A 0 10 inf 12 inf 8
0 10 10 12 inf 8
1 -4 0 10 10 12 inf 8
1
0 10 10 12 9 8
2
D B
dist(S->E) and dist(E->D) are known.
-1 -2 So dist(S->E) is now known.
18
9
8/9/2024
Round 2 of n-1
Length of S—>X where X in {A,B,C,D,E}
8 S S A B C D E
10 0 10 10 12 9 8
0 10 10 12 9 8
E A 0 10 10 12 9 8
0 10 10 12 9 8
1 -4
1
2
D B
No new information with S, A, B, C in round 2 so far.
-1 -2
19
Round 2 of n-1
Length of S—>X where X in {A,B,C,D,E}
8 S 10
S A B C D E
0 10 10 12 9 8
0 10 10 12 9 8
E A 0 10 10 12 9 8
0 10 10 12 9 8
1 -4 0 5 10 8 9 8
1
2
D B
dist(S->D) can now be used to update dist(S->X) for all neighbours X of D.
-1 -2 dist(S->A) — a better path now is via D from S instead of directly from S
dist(S->C) — a better path now is via D from S instead of via A
C
20
10
8/9/2024
Round 2 of n-1
Length of S—>X where X in {A,B,C,D,E}
8 S 10
S A B C D E
0 10 10 12 9 8
0 10 10 12 9 8
E A 0 10 10 12 9 8
0 10 10 12 9 8
1 -4 0 5 10 8 9 8
1
0 5 10 8 9 8
2
D B
-1 -2 No new information with E.
Upto hop 2 is now stable at the end of round 2.
C
21
Rounds 3 to 5
Length of S—>X where X in {A,B,C,D,E}
8 S 10
Round S A B C D E
3 0 5 5 7 9 8
4 0 5 5 7 9 8
E A 5 0 5 5 7 9 8
1 -4
1
Final parent-child links converges before n-1
2 (n=6)
D B
-2 Notice that :
-1
E stabilised by first round.
C D stabilised by second round.
So hop i in final path will stabilise by ith round.
22
11
8/9/2024
Some observations
» Works when no cycles have negative weight
» Time complexity O(N)
» How many messages - (n-1)*(#links)
» Constructs a minimum spanning tree as well
» MST algorithm discussed later doesn’t handle negative weights and is for undirected graph
» By round k, kth hops stabilise
» The parent variable is key to routing function - tells each node D how to reach S hop by hop
» Used in Internet Distance Vector Routing protocol
» But need to handle dynamic changes to network
» Virtually synchronous with timeouts when UPDATEs get lost
» Async version : large time and message complexity and complex to determine termination
23
23
24
24
12
8/9/2024
25
25
26
26
13
8/9/2024
4 A
F
B
27
27
28
14
8/9/2024
29
29
Example
1
2
30
30
15
8/9/2024
31
31
the ring.
D B
» So Pi declares itself has leader and send message
around the ring. ID(B)
ID(C)
C
32
32
16
8/9/2024
33
33
34
34
17
8/9/2024
35
35
36
36
18
8/9/2024
round 2
37
38
38
19
8/9/2024
CDS algorithms
» Simple heuristics
» Create a MST and delete
edges to all leaf nodes
» An MIS is also a dominating set.
So create an MIS and add
edges.
» It is non-trivial to devise a good
approximation algorithm
MIS nodes
39
39
40
40
20
8/9/2024
Synchronizers (1)
» Synchronous algorithms work in lock-step across processors
» easier to program than async algorithms
» On each clock-tick / round, a process does
» receive messages
» compute
» send messages
» A synchroniser protocol enables a sync algorithm to run on an async system
» Think of traffic lights on a straight road synchronising flow of cars from one light to the
next.
41
41
Synchronizers (2)
» Every message sent in clock tick k must be received in clock tick
k
» All messages have a clock tick value
» if d is max propagation delay on a channel then each process
will start simulation of a new clock tick after 2d time units 2d
» Simulating lock step mode of operation using a protocol
» Clock ticks can be broadcast by a leader or by having
k-1 k k+1
heartbeat-like control messages on channels (in the absence of
data messages in a clock tick)
Finish all work for clock tick k
Think of the rounds in sync Bellman-Ford algorithm for single source shortest paths
42
42
21
8/9/2024
-Synchronizer
Repeat for each clock tick at each process Pi in round / clock tick k
1. Send receive all messages m for current clock tick
2. Send 'ack’ for each message received and receive ‘ack’ for each message sent
3. Send a ‘safe’ message to all neighbours after sending and receiving all ‘ack’ messages
4. Move to round k+1 after getting ‘safe’ from all neighbours in round k
round k
43
43
-Synchronizer
1. Construct spanning tree of nodes with a root
2. Root of the spanning tree initiates new clock tick by sending broadcast ‘next’ message for
clock tick j
3. All processes exchange messages and ‘ack's in same way as alpha-synchronizer
4. The ‘safe' messages use a convergecast from leaves to root:
1. Each process responds with ack(j) and then safe(j) once all the subtree at the process
has sent ack(j) and safe(j)
2. Once root receives safe(j) from all children it initiates clock tick j+1 using a broadcast
‘next’ message
44
44
22
8/9/2024
45
45
Summary
46
23
8/9/2024
1
8/9/2024
https://slidetodoc.com/minimum-spanning-trees-gallagherhumbletspira-ghs-algorithm-1-weighted/
SS ZG 526: Distributed Computing 3
2
8/9/2024
3
8/9/2024
4
8/9/2024
6 4 5 6 4 5
a b f 3
a b f 3
2 g 2 g
c d c d
9 1 9 1
h h
e e
7 7
MIS = {d, h, a, f}
10
5
8/9/2024
Message orders
1. Async execution / non-FIFO
• Messages can be delivered in any order on a link
2. FIFO
• Messages from same sender are delivered in same order
4. Sync execution
• Send and receive happen at the same instant as an atomic transaction
11
12
6
8/9/2024
FIFO example
violates FIFO
13
CO example
(1) Lamport clock values
(2) (3)
m1 “happens before” m2
14
7
8/9/2024
15
16
8
8/9/2024
17
Group communication
Network level
SS ZG 526: Distributed Computing 18
18
9
8/9/2024
• Applications
• Akamai content distribution system
• Air Traffic Control communication to relay orders
• Facebook updates
• Distributed DB replica updates
• Multiple modes of communication
• Unicast: message sent one to one
• Broadcast: message sent one to all in group
2 communicating processes are
• Multicast: message broadcast to a sub-group updating a group of 3 replicas
• Kind of message ordering
• FIFO, non-FIFO, sync, CO
19
20
10
8/9/2024
@ R1 : P2 then P1
@ R2 : P1 then P2
@ R3 : P1 then P2
21
@ R1 : P2 then P1
@ R2 : P2 then P1
@ R3 : P2 then P1
22
11
8/9/2024
CO criteria
process.
Notes:
a. Safety and Liveness criteria are commonly stated in many protocols.
b. In a system with FIFO channels CO has to be implemented by the control protocol.
SS ZG 526: Distributed Computing 23
23
24
12
8/9/2024
cloud.mongodb.com
25
MongoDB
• Document oriented DB
• Various read and write choices for flexible consistency tradeoff with scale / performance and durability
• Automatic primary re-election on primary failure and/or network partition
26
26
13
8/9/2024
Example in MongoDB
https://engineering.mongodb.com/post/ryp0ohr2w9pvv0fks88kq6qkz9k9p3
27
27
• local :
• Client reads primary replica
• Client reads from secondary in causally consistent sessions
• available:
• Read on secondary but causal consistency not required
• majority :
• If client wants to read what majority of nodes have. Best option for fault tolerance and durability.
• linearizable :
• If client wants to read what has been written to majority of nodes before the read started.
• Has to be read on primary
• Only single document can be read
https://docs.mongodb.com/v3.4/core/read-preference-mechanics/
28
28
14
8/9/2024
https://docs.mongodb.com/manual/reference/write-concern/
29
29
• read=majority, write=majority
• W1 and R1 for P1 will fail and will succeed in P2
• So causally consistent, durable even with network partition sacrificing performance
• Example: Used in critical transaction oriented applications, e.g. stock trading
https://engineering.mongodb.com/post/ryp0ohr2w9pvv0fks88kq6qkz9k9p3
30
30
15
8/9/2024
• read=majority, write=1
• W1 may succeed on P1 and P2. R1 will succeed only on P2. W1 on P1 may roll back.
• So causally consistent but not durable with network partition. Fast writes, slower reads.
• Example: Twitter - a post may disappear but if on refresh you see it then it should be durable, else
repost.
https://engineering.mongodb.com/post/ryp0ohr2w9pvv0fks88kq6qkz9k9p3
31
31
• read=available, write=majority
• W1 will succeed only for P1 and reads may not succeed to see the last write. Slow durable writes and
fast non-causal reads.
• Example: Review site where write should be durable but reads don’t need causal guarantee as long as
it appears some time (eventual consistency).
https://engineering.mongodb.com/post/ryp0ohr2w9pvv0fks88kq6qkz9k9p3
32
32
16
8/9/2024
• read=local, write=1
• Same as previous scenario and not writes are also not durable and may be rolled back.
• Example: Real-time sensor data feed that needs fast writes to keep up with the rate and reads should
get as much recent real-time data as possible. Data may be dropped on failures.
https://engineering.mongodb.com/post/ryp0ohr2w9pvv0fks88kq6qkz9k9p3
33
33
34
17
8/9/2024
Process pi uses the following two rules R1 and R2 to update its clock:
• R1: Before executing an event, process pi updates its local logical time as follows:
vti[i] := vti[i] + d (d > 0)
35
• R2: Each message m is piggybacked with the vector clock vt of the sender process at
sending time. On the receipt of such a message (m,vt), process pi executes the following
sequence of actions:
e3
P2
(0,0,0) R2 (a): (2,0,0)
R2 (b): (2,1,0)
Deliver
SS ZG 526: Distributed Computing 36
36
18
8/9/2024
37
38
19
8/9/2024
[1,0] [2,0]
P1
[2,0] m2
m1 [1,0]
Example shows a trivial broadcast to
one process to explain how buffering
is done to maintain CO.
P2
[0,0]
R1: VTm1[P1]-1 = 0 == VTP2[P1] -> pass
R2: VTP2[P2] = 0 >= VTm1[P2] -> pass
R1: VTm2[P1]-1 = 1 != VTP2[P1] -> fail Action: Deliver and set VTP2=[1,0]
Action: Buffer
SS ZG 526: Distributed Computing 39
39
[1,0] [2,0]
P1
[2,0] m2
m1 [1,0] We do not apply full vector clock (R1
called from R2) because purpose of
BSS is to order messages and not
consider local events.
40
20
8/9/2024
[1,0,0]
P1
[1,0,0]
P2
[0,0,0]
R1: [1-1,*,*] = [0,*,*]
R2: [*,0,0] >= [*,0,0]
==> Deliver
P3
41
[1,0,0] [2,0,0]
P1
[1,0,0]
[2,0,0]
P2 [1,0,0] [2,0,0]
[0,0,0]
deliver
P3
42
21
8/9/2024
[1,0,0] [2,0,0]
P1
[1,0,0]
[2,0,0]
43
[1,0,0] [2,0,0]
P1
P3
[1,0,0]
[0,0,0] [0,0,0] deliver and apply R2 of vector clock
44
22
8/9/2024
[1,0,0] [2,0,0]
P1
[2,0,0]
[2,0,0]
[2,1,0]
P3
[1,0,0]
45
[1,0,0] [2,0,0]
P1
46
23
8/9/2024
47
48
24
8/9/2024
V_P1 V_P1
49
50
25
8/9/2024
51
[2,0] {(P2,[1,0])}
m2
m1 [1,0] {}
P2
[0,0] [1,1]
52
26
8/9/2024
P2
[0,0,0]
R2: P2 not in V_M
Action: Deliver
P3
53
P2
[0,0,0] [2,1,0]
V_P2={(P3,[1,0,0])}
P3
54
27
8/9/2024
[2,2,0]
P2
[0,0,0] [2,1,0] [2,2,0] {(P3,[1,0,0])}
V_P2={(P3,[1,0,0])}
P3
[0,0,0]
R2: P3 in V_M and [1,0,0] > [0,0,0]
Action: Buffer
SS ZG 526: Distributed Computing 55
55
V_P2={}
[2,2,0]
P2
[0,0,0] [2,1,0] [2,2,0] {(P3,[1,0,0])}
56
28
8/9/2024
V_P2={}
[2,2,0]
P2
[0,0,0] [2,1,0] [2,2,0] {(P3,[1,0,0])}
P3
[1,0,1] [2,2,2]
[0,0,0]
R2: P3 in V_M and [1,0,0] < [1,0,1]
Action: Deliver
57
29
8/9/2024
1
8/9/2024
Message orders
1. Async execution / non-FIFO
• Messages can be delivered in any order on a link
2. FIFO
• Messages from same sender are delivered in same order
4. Sync execution
• Send and receive happen at the same instant as an atomic transaction
FIFO example
violates FIFO
2
8/9/2024
CO example
(1) Lamport clock values
(2) (3)
m1 “happens before” m2
Receiving vs Delivery
3
8/9/2024
V_P1 V_P1
Step (1)
* using P1 and P2 as an example - ideally it is a rule for a Pi
sending M to Pj send with M to P2
SS ZG 526: Distributed Computing 8
4
8/9/2024
SES example (1) SES doesn’t need broadcast messages because the destination
vector is sent around
V_P1={(P3,[1,0,0])}
[1,0,0] [2,0,0]
P1
[1,0,0] {} [2,0,0] {(P3,[1,0,0])}
P2
[0,0,0]
R2: P2 not in V_M
Action: Deliver
P3
10
5
8/9/2024
Delivered
P2
[0,0,0] [2,1,0]
V_P2={(P3,[1,0,0])}
P3
11
[2,2,0]
Delivered
P2
[0,0,0] [2,1,0] [2,2,0] {(P3,[1,0,0])}
V_P2={(P3,[1,0,0])}
P3
[0,0,0]
R2: P3 in V_M and [1,0,0] > [0,0,0]
Action: Buffer
SS ZG 526: Distributed Computing 12
12
6
8/9/2024
V_P2={}
[2,2,0]
Delivered
P2
[0,0,0] [2,1,0] [2,2,0] {(P3,[1,0,0])}
13
V_P2={}
[2,2,0]
Delivered
P2
[0,0,0] [2,1,0] [2,2,0] {(P3,[1,0,0])}
P3
[1,0,1] [2,2,2]
[0,0,0] Buffered Delivered
R2: P3 in V_M and [1,0,0] < [1,0,1]
Action: Deliver
14
7
8/9/2024
15
15
Total order
» CO is popular but sometimes Total Order (TO) is useful
Sync CO FIFO ASync
» CO does not imply TO and vice versa (see next slide)
» Total Order can be used for
» Replica updates in a distributed data store (see next slide) Total Order
» If update(x) is seen before update(y) for replica i then all replicas should see the same sequence
» Distributed Mutual Exclusion using total order multicast (next session)
» For all pairs of processes Pi, Pj and all pairs of messages Mx, My that are delivered in both processes,
Pi is delivered Mx before My iff Pj is also delivered Mx before My.
» Doesn’t depend on sender and not trying to establish CO
» Basically all processes must see the same (FIFO) message sequence
» TO + CO = Synchronous system — why ?
16
16
8
8/9/2024
@ R1 : P2 then P1
@ R2 : P1 then P2
@ R3 : P1 then P2
17
@ R1 : P2 then P1
@ R2 : P2 then P1
@ R3 : P2 then P1
Note: In Lamport clock CO does not imply TO. You have to use process ID to create TO.
SS ZG 526: Distributed Computing 18
18
9
8/9/2024
Group G
19
19
20
10
8/9/2024
21
21
MSSG MSMG
22
22
11
8/9/2024
B
A Meta-group = [A,B]
23
23
24
24
12
8/9/2024
25
25
Propagation Tree
» All MGs have to put in a tree structure with a property with ABC BCD DE EF
properties as follows —
1. PM(G) is ancestor of all other MG of G
» e.g. ABC is ancestor of A, B, C, AB, … because one can reach
all these MGs through ABC
2. PM(G) is unique
3. For any MG, there is a unique path to the MG from the PM of
any G of which MG is a subset - so it’s a tree
» e.g. ABC—>BCD—>CD
4. PM[G1] and PM[G2] should lie in same branch or in disjoint
tree (if memberships are disjoint)
» e.g. for G1=A and G2=F, ABC—>BCD—>DE—>EF
26
26
13
8/9/2024
27
28
28
14
8/9/2024
29
29
30
30
15
8/9/2024
31
31
Termination detection
» Need to detect that a distributed computation has ended or a sub-problem has
ended to proceed to the next step
» No process has complete knowledge of global state - so how do we know that a
computation has ended ?
» 2 distributed computations running in parallel
» Application messages for user computation
» Control messages for termination detection - these should not indefinitely
delay the application or require additional channels of communication
32
32
16
8/9/2024
System model
» Process can be active or idle state
» Active process can become idle at any time
» Idle process can become active only when it receives messages
» Only active process can send messages - so an external trigger is required to start
» A message can be received in either state
» Sending and receiving messages are atomic actions
» A distributed computation is terminated when all processes are idle and there are
no messages on any channel (stable state)
33
33
Pj is already idle
* If you use Lamport clock then use process ID to break ties and derive total order
34
34
17
8/9/2024
j
» Last process to terminate will have largestidle
clock value.x+1
Everyoneactive
will take a snapshot for it but it will not take a snapshot for anyone.
35
35
j
possibilities:
R(x’, k’) a. i is already idle: i is idle and x’ > logical clock of i
b. i is ahead of j: i is idle and x’ <= logical clock of i
c. i is active still
i
x
36
36
18
8/9/2024
Think of a central bank lending money (weight). Initially bank has all the money.
37
37
38
38
19
8/9/2024
0.05
P1: active
0.1 0 0.05 0 0 0
P1: active P3: idle M (0.05) P3: idle P1: idle P3: idle
0 0.05 0.05
39
39
40
20
8/9/2024
41
41
42
42
21
8/9/2024
43
43
44
44
22
8/9/2024
Example
0 0 T1 0
1 2 T3 1 2 1 2
T4
3 4 5 6 3 4 5 6 3 4 5 6
T3 T4 T5 T6 T5 T6 T5 T6
Idle process
45
45
Example …
T1 T1 T1 T2
0 0 0
T6
1 2 1 T5 2 1 2
m
3 4 5 6 3 4 5 6 3 4 5 6
T5 T6
46
46
23
8/9/2024
47
47
48
48
24
8/9/2024
1
8/9/2024
2
8/9/2024
3
8/9/2024
A centralised algorithm
» A Central controller with a FIFO queue for deferring
P3
replies.
3
» Request, Reply, and Release messages. 2
» Reliability and Performance bottleneck. 1 1: Request
» Controller is SPOF
» Single bottleneck for queue management …P2 P1 C
2: Reply
P1
3: Release
1 2 3
P2
4
8/9/2024
1. When a site Si wants to enter the CS, it sends a REQUEST(T=tsi, i) message to all the sites in its request set Ri
and places the request on request_queue_i.
2. When a site Sj receives the REQUEST(tsi , i) message from site Si, it returns a timestamped REPLY message to Si
and places site Si’s request on request_queue_j.
R: Request(tsi, i)
Reply(tsj, j)
10
5
8/9/2024
R: Request(tsi, i)
11
1. Site Si, upon exiting the CS, removes its request from the top of its request queue and sends a timestamped RELEASE
message to all the sites in its request set.
2. When a site Sj receives a RELEASE message from site Si, it removes Si’s request from its request queue.
3. When a site removes a request from its request queue, its own request may become the top of the queue, enabling it
to enter the CS.
4. The algorithm executes CS requests in the increasing order of timestamps.
time = tsi
Critical
Section Si Sj
Request_queue_i Request_queue_j
12
6
8/9/2024
Example 1
(2, P1)
P1 in CS (2, P1)
1 2
P1
(2, P1)
P2
(2, P1) (2, P1)
reply P3
release (2, P1) (2, P1)
13
(1,P3)
P1
(1,P3)
P2
(1, P3)
P3
1 (1,P3)
request
reply
release
14
7
8/9/2024
(2, P1)
(1,P3),(2, P1) insert in queue based on timestamp
1 2
P1
(2, P1)
(1,P3)
P2
(1,P3),(2, P1)
(1, P3)
P3
1 (1,P3) (1,P3),(2, P1) P3 in CS
request
reply
release
15
(2, P1)
(1,P3),(2, P1) P1 in CS
1 2
P1
(2, P1) (2, P1)
(1, P3)
P3
1 (1,P3) (1,P3),(2, P1) P3 in CS (2, P1)
request
reply
release
16
8
8/9/2024
Correctness
• Suppose that both Si and Sj were in CS at the same time (t).
• Then we have an impossible situation where i > j and j > i
17
18
18
9
8/9/2024
Optimization: merges release and reply messages to make it total 2(N-1) messages - a site
replies only if certain conditions hold
Requesting Site:
– A requesting site Pi sends a message request(ts,i) to all sites
– It enters CS if it receives reply from all sites
– No release message is needed
Receiving Site:
– Upon reception of a request(ts,i) message, the receiving site Pj will immediately send a
timestamped reply(ts,j) message if and only if:
• Pj is not requesting or executing the critical section OR
• Pj is requesting the critical section but sent a request with a higher timestamp than
the timestamp of Pi
– Otherwise, Pj will defer the reply message.
19
Example 1
P1 enters CS
1 2
P1
(2, P1)
P2
(2, P1)
P3
20
10
8/9/2024
Example 2: Requests
1 2
P1
P2
P3
1
21
P2
P3
1 queue P1 P3 in CS
22
11
8/9/2024
2 P1 enters CS
1
P1
P2
P3
1 queue P1 P3 in CS
23
24
24
12
8/9/2024
» Each site has to request approval from the members of the Request Set attached to the site and not Ri (with 4 sites)
all N-1 other sites.
» So it is approval of a quorum or subset of sites in a Request Set attached to the site
Si
» Much lesser messages get exchanged
» Request set of sites Si & Sj are Ri & Rj such that Ri and Rj will have at-least one common site Sk. Sk
» Sk mediates conflicts between Ri and Rj. How ?
» A site can send only one REPLY message at a time even if it gets 2 requests from 2 different request Sj
sets that it belongs to
» A site can send a REPLY message only after receiving a RELEASE message for the previous REPLY
message. Rj (with 4 sites)
25
N = K(K-1) + 1
K = sqrt(N)
• M1 and M2 are necessary for correctness; #msgs = 3K
• M3 and M4 provide other desirable features to the algorithm.
• M3 implies that all sites have to do equal amount of work to invoke mutual exclusion.
• M4 enforces that exactly the same number of sites should request permission from any site implying
that all sites have “equal responsibility” in granting permission to other sites.
26
13
8/9/2024
27
28
14
8/9/2024
Si
Sj
29
30
15
8/9/2024
Example
S2 accesses CS
R1 {S1, S2, S3} S1
R2 {S2, S4, S6} queue S5 dequeue S5 and reply
R3 {S3, S5, S6} S2
R4 {S1, S4, S5}
R5 {S2, S5, S7}
R6 {S1, S6, S7} S3
R7 {S3, S4, S7}
S4
Si—>Ri
S5
S5 accesses CS
S6
request
reply
S7
release
S5 can send to R5 or R4 or R3 … here it sends to R5
SS ZG 526: Distributed Computing 31
31
32
16
8/9/2024
Sj Si Sk
33
Number of messages
» Lamport
» 3 x (N-1)
» Ricart-Agarwala
» 2 x (N-1)
» Maekawa without deadlock
» 3 x sqrt(N)
» Maekawa with deadlock possibility
» 5 x sqrt(N)
34
34
17
8/9/2024
35
35
36
18
8/9/2024
37
38
19
8/9/2024
C B
A site enters the CS when
In this case, the site deletes the top entry from it’s
request_q and enters the CS. C B
access CS A
39
C B
Once CS is done at a site and request_q is non_empty then the site
does the following
releases CS A queue:B
1. Deletes the top entry from its request_q
2. Sends the token to that top site
3. Sets its holder variable to point at that site
queue:C queue:C
4. Sends a REQUEST message to the site which is pointed at by the
holder variable so that it can get back the token at a later point C
for a pending request in request_q (E.g. if A had a non-empty B
queue in this case).
A
SS ZG 526: Distributed Computing 40
40
20
8/9/2024
Critical
Section
n2 n3
REQUEST
n4 n5 n6 n7 holder pointer
n6
root - has token
41
Example 1
n2 idle token holder
n1
REQUEST Critical
Section
n6 n2 n3
n4 n5 n6 n7 holder pointer
n6
root - has token
42
21
8/9/2024
Example 1
n2 idle token holder
n1
Token
Critical
Section
n6 n2 n3
n4 n5 n6 n7 holder pointer
n6
root - has token
43
Example 1
holder changed n1
Critical
Section
n6 n2 n3
n4 n5 n6 n7 holder pointer
n6
root - has token
44
22
8/9/2024
Example 1
n1
Critical
Section
n2 n3
token sent to n6
holder changed
n4 n5 n6 n7 holder pointer
45
n5 n4 n2 n3 n5 n4 n2 n3
REQUEST REQUEST
n4 n5 n6
n4 n5 n6
n4 n5
n4 n5
46
23
8/9/2024
n4 n2 n4
n3 n2 n3
REQUEST
n4 n5 n6 n4 n5 n6
n4 n5 n4
n2 needs to get the token on behalf of n4
which is top item pending in queue
47
Critical
Critical Section
Section n1
n1
n4
n4 n2 n3
n2 n3
n4 n5 n6
n4 n5 n6
n4 n5 will send token and
n4 n2 n2 is queued now at n5 point to n2
SS ZG 526: Distributed Computing 48
48
24
8/9/2024
Critical
n1 n1 Section
n2 n3 n2 n3
n4 n5 n6 n4 n5 n6
n4
49
Analysis
Proof of Correctness
Mutex is trivial.
Finite waiting: all the requests in the system form a FIFO queue and the
token is passed in that order.
Performance
O(logN) messages per CS invocation, i.e. average distance between two
nodes in a tree, so quite efficient in terms of messages
50
25
8/9/2024
51
51
52
26
8/9/2024
1. RN[i] = RN[i] + 1
Si 2. REQUEST(i, RN[i])
Requesting the critical section
Request send
1. If the requesting site Si does not have the token, then it increments it’s
sequence number, RNi [i], and sends a REQUEST(i, sn) message to all Token
other sites. (sn is the updated value of RNi [i].)
Request receive REQUEST(i, sn)
1. When a site Sj receives this message, it sets RNj [i] to max(RNj [i], sn). Sj
2. If Sj has the idle token and if RNj [i] = LN[i] + 1
• send the token to Si 1. RN[i] = max(RN[i], sn)
(check for outdated message) 2. if Sj has token and
RN[i] = LN[i] + 1
send token to Si
53
54
54
27
8/9/2024
Having finished the execution of the CS, site Si takes the following actions:
1.It sets LN[i] element of the token array equal to RNi [i] (last executed request is
recorded).
2.For every site Sj whose ID is not in the token queue, it appends its ID to the token
queue if RNi [j] = LN[j] + 1 (eligible sites).
3.If token queue is nonempty after the above update, then it deletes the top site ID
from the queue and sends the token to the site indicated by the ID.
55
Example: S4 requests CS
(S4, 1)
RN=[0 0 0 0]
S4
S1 RN=[0 0 0 1]
RN=[0 0 0 0]
(S4, 1)
(S4, 1)
56
28
8/9/2024
(S4, 1)
RN=[0 0 0 0]
S4
S1 RN=[0 0 0 1]
RN=[0 0 0 0]
(S4, 1)
S2 S3
Q=[S4]
RN=[0 0 0 0] Critical RN=[0 0 0 0] LN=[0 0 0 0]
Section RN=[0 0 0 1]
LN[4] + 1 = RN[4] <== so not outdated
Add S4 in queue and send token to S4
SS ZG 526: Distributed Computing 57
57
Token with S4
RN=[0 0 0 1] Q=[]
S4
S1 LN=[0 0 0 0]
RN=[0 0 0 1]
S2 S3
58
29
8/9/2024
Token
RN=[0 0 0 1] Q=[]
S4
RN=[0 0 0 1] S1 LN=[0 0 0 0]
(S2, 1) (S2, 1)
(S2, 1)
RN=[0 0 0 1] S2 S3
RN=[0 1 0 1]
Critical RN=[0 0 0 1]
Section
59
Token
RN=[0 0 0 1] Q=[]
S4
RN=[0 0 0 1] S1 RN=[0 1 0 1] LN=[0 0 0 0]
RN=[0 1 0 1]
RN=[0 0 0 1] S2 S3
RN=[0 1 0 1]
Critical RN=[0 0 0 1]
Section RN=[0 1 0 1]
60
30
8/9/2024
RN=[0 1 0 1] S2 S3 RN=[0 1 0 1]
Critical
Section
61
Token
(S1, 1) LN=[0 0 0 1]
S4 RN=[0 1 0 1]
RN=[0 1 0 1] S1 Q=[S2]
RN=[1 1 0 1]
(S1, 1) (S1, 1)
RN=[0 1 0 1] S2 S3
Critical RN=[0 1 0 1]
Section
62
31
8/9/2024
Token
LN=[0 0 0 1]
S4 RN=[0 1 0 1]
RN=[1 1 0 1] S1 RN=[1 1 0 1] Q=[S2] Q=[S2,S1]
RN=[0 1 0 1] S2 S3
RN=[1 1 0 1]
Critical RN=[0 1 0 1]
Section RN=[1 1 0 1]
63
S4 RN=[1 1 0 1]
RN=[1 1 0 1] S1
RN=[1 1 0 1] S2 S3 RN=[1 1 0 1]
Token
LN=[0 0 0 1] Critical
Section
Q=[S1]
64
32
8/9/2024
S4 RN=[1 1 0 1]
RN=[1 1 0 1] S1
RN=[1 1 0 1] S2 S3
Token
RN=[1 1 0 1]
LN=[0 0 0 1] LN=[0 1 0 1] Critical
Section
Q=[S1]
65
Token
S4 RN=[1 1 0 1]
RN=[1 1 0 1] S1 LN=[0 1 0 1]
Q=[]
RN=[1 1 0 1] S2 S3
RN=[1 1 0 1]
Critical
Section
66
33
8/9/2024
Token
S4 RN=[1 1 0 1]
RN=[1 1 0 1] S1 LN=[1 1 0 1]
Q=[]
DONE
RN=[1 1 0 1] S2 S3
RN=[1 1 0 1]
Critical
Section
67
Performance
0 or N messages per CS invocation. 0 is if the site as the token and there are no
other requests.
68
34
8/9/2024
1
Reference: digital content slides from Prof. C. Hota, BITS Pilani
1
8/9/2024
What is a deadlock?
2
8/9/2024
P2 P3
P2 P1 P3
z
Remove resources to get a Wait For Graph
(WFG)
3
8/9/2024
z
w y
Detection requirements
• Liveness / Progress
✓ All deadlocks found
- No undetected deadlocks
✓ Deadlocks found in finite time
• Safety
✓ No false deadlock detection
✓ Phantom (non-existent) deadlocks caused by network latencies
✓ See example next chart
4
8/9/2024
False deadlocks
A S S C A S C A S C
R T R T R T
B B B B
Machine 1 Machine 2 Global view 1 Global view 2
@ coordinator @ coordinator
Safe state False deadlock
No deadlock because B’s request for Deadlock because B’s request for T
T has not been recorded before B’s has reached before B’s release of
release of R R
10
10
5
8/9/2024
11
probe(i, j, k) S1 S2
i: initiator or block P2 Probe(1, 3, 4)
j: sender
k: receiver P1 P4 P5
P3
12
6
8/9/2024
13
P5
P6
k==i (1, 9, 1) (1, 5, 7)
(1, 6, 8)
P7
P8
(1, 7, 9)
P9
S3
14
7
8/9/2024
P5
P6
k==i (1, 9, 1) (1, 5, 7)
(1, 6, 8)
P7
P8
(1, 7, 9)
P9
S3
15
P2
P1
P3
P4
P7
P6 P5
16
8
8/9/2024
17
OR model graphs
P1
P2
OR
creates a Knot
P3
18
9
8/9/2024
Pj
(i, i, j)
Pi OR
1. Initiation by a blocked process Pi:
Send query(i, i, j) to all processes Pj
in the dependent set DSi of Pi; (i, i, k) Pk
num(i) := | DSi |;
waiti(i) := true;
num(i) = 2
19
20
10
8/9/2024
21
P5
P6
(1, 5, 7)
(1, 9, 1)
(1, 6, 8)
P7
P8
(1, 7, 9)
query
P9
S3
22
11
8/9/2024
(1, 1, 9)
P7
P1 doesn’t get back reply because P4 does not
P8
(1, 9, 7) have num4(1) = 0 ever with no reply from P6.
reply
P9 P1 may send several non-engage queries.
S3 P2 will reply to all of them but the engage query
will never get a reply.
SS ZG 526: Distributed Computing 23
23
P5
P6
(1, 5, 7)
(1, 9, 1)
(1, 6, 8)
P7
P8
(1, 7, 9)
(1, 8, 9)
P9
S3
SS ZG 526: Distributed Computing 24
24
12
8/9/2024
(1, 1, 9) (1, 8, 6)
P7
P8
(1, 9, 7)
(1, 9, 8)
P9
S3
SS ZG 526: Distributed Computing 25
25
P2 P2
P1 P1
P3 P3
P4 P4
P7 P7
P6 P5 P6 P5
26
13
8/9/2024
27
27
• Deadlock persistence:
• Average time a deadlock exists before it is resolved.
• Deadlock resolution:
• Aborting at least one process/request involved in the deadlock.
• Efficient resolution of deadlock requires knowledge of all processes and
resources.
• e.g. probes can try to get some global knowledge - e.g. least priority process
• If every process detects a deadlock and tries to resolve it independently, which is
highly inefficient ! Several processes might be aborted.
28
14
8/9/2024
Summary
» What is a deadlock in a distributed system
» What are WFG and RAG
» When does deadlock happen - 4 conditions that must be satisfied
» What options do we have :
» Prevention, Avoidance, Detection + Resolution, Ignorance
» Deadlock detection and resolution is most useful
» 2 Detection algorithms
» CMH algorithm for AND model graphs
» CMH algorithm for OR model graphs
» Reading: Chapter 10 of T1
29
15
8/9/2024
SSTCS ZG526
` DISTRIBUTED COMPUTING
BITS
BITSPilani
Pilani
Pilani|Dubai|Goa|Hyderabad
Anil Kumar G
Pilani|Dubai|Goa|Hyderabad
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
L8 : Deadlock detection
[T1: Chap - 10]
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by Publishers of T1 & T2
1
8/9/2024
Presentation Overview
• Introduction
• Models of Deadlocks
• Single resource model
• AND model
• OR model
• Candy-Misra-Haas Algorithm for AND model
• Candy-Misra-Haas Algorithm for OR model
Distributed Computing 3 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Introduction
– deadlock prevention,
– deadlock avoidance, and
– deadlock detection.
Distributed Computing 4 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
2
8/9/2024
Introduction
Distributed Computing 5 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Models of deadlocks
Distributed Computing 6 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
3
8/9/2024
Distributed Computing 7 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
• In the AND model, a process can request more than one resource simultaneously
and the request is satisfied only after all the requested resources are granted to the
process.
• The requested resources may exist at different locations.
Distributed Computing 8 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
4
8/9/2024
The OR model
Distributed Computing 9 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Distributed Computing 10 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
10
5
8/9/2024
Distributed Computing 11 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
11
Distributed Computing 12 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
12
6
8/9/2024
1. If this is the first query message received by Pk for the deadlock detection initiated
by Pi (called the engaging query ), then it propagates the query to all the processes in
its dependent set and sets a local variable numki to the number of query messages
sent.
2. If this is not the engaging query , then Pk returns a reply message to it immediately
provided Pk has been continuously blocked since it received the corresponding
engaging query . Otherwise, it discards the query .
Distributed Computing 13 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
13
• Process Pk maintains a boolean variable waitki that denotes the fact that it has
been continuously blocked since it received the last engaging query from process
Pi .
• When a blocked process Pk receives a reply (i , j , k ) message, it decrements numki
only if waitki holds.
• A process sends a reply message in response to an engaging query only after it has
received a reply to every query message it has sent out for this engaging query .
• The initiator process detects a deadlock when it has received reply messages to all
the query messages it has sent out.
Distributed Computing 14 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
14
7
8/9/2024
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by Publishers of T1 & T2
PRESENTATION OVERVIEW
• Consensus
• Byzantine Behavior
• Agreement in a failure-free system
• Agreement In (Message-passing) Synchronous Systems With Failures
– Consensus algorithm for crash failures (synchronous system)
– Consensus algorithms for Byzantine failures
– Byzantine Agreement Tree Algorithm
• Agreement In Asynchronous Message-passing Systems With Failures
– Impossibility result for the consensus problem
– Terminating reliable broadcast
– Distributed transaction commit
– k-set consensus
– Approximate agreement
– Renaming problem
– Reliable broadcast DC - Anil Kumar G 2
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
1
8/9/2024
DC - Anil Kumar G 3
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Consensus
DC - Anil Kumar G 4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
2
8/9/2024
Consensus
DC - Anil Kumar G 5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Byzantine Behavior
DC - Anil Kumar G 6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
3
8/9/2024
Byzantine Behavior
DC - Anil Kumar G 7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Byzantine Behavior
DC - Anil Kumar G 8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
4
8/9/2024
Byzantine Behavior
DC - Anil Kumar G 9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Byzantine Behavior
DC - Anil Kumar G 10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
10
5
8/9/2024
DC - Anil Kumar G 11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
11
DC - Anil Kumar G 12
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
12
6
8/9/2024
Agreement In (Message-passing)
Synchronous Systems
With Failures
DC - Anil Kumar G 13
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
13
14
7
8/9/2024
DC - Anil Kumar G 15
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
15
DC - Anil Kumar G 16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
16
8
8/9/2024
DC - Anil Kumar G 17
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
17
DC - Anil Kumar G 18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
18
9
8/9/2024
DC - Anil Kumar G 19
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
19
Agreement In Asynchronous
Message-passing Systems
With Failures
DC - Anil Kumar G 20
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
20
10
8/9/2024
DC - Anil Kumar G 21
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
21
DC - Anil Kumar G 22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
22
11
8/9/2024
DC - Anil Kumar G 23
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
23
DC - Anil Kumar G 24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
24
12
8/9/2024
DC - Anil Kumar G 25
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
25
DC - Anil Kumar G 26
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
26
13
8/9/2024
DC - Anil Kumar G 27
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
27
k-set consensus
DC - Anil Kumar G 28
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
28
14
8/9/2024
k-set consensus
DC - Anil Kumar G 29
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
29
Approximate agreement
DC - Anil Kumar G 30
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
30
15
8/9/2024
Renaming problem
DC - Anil Kumar G 31
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
31
Renaming problem
DC - Anil Kumar G 32
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
32
16
8/9/2024
Reliable broadcast
DC - Anil Kumar G 33
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
33
17
8/9/2024
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by Publishers of T1 & T2
Presentation Overview
• Introduction
• Characteristics and performance features of P2P systems
• Data indexing and overlays
– Centralized indexing
– Distributed indexing
– Local indexing
• Structured overlays
• Unstructured overlays
• Challenges in P2P system design
Distributed Computing 2 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
1
8/9/2024
Introduction
Distributed Computing 3 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Introduction
Distributed Computing 4 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
2
8/9/2024
Introduction
• Napster
• Gnutella
• Freenet
• Pastry
• Chord
• CAN
Distributed Computing 5 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Introduction
Distributed Computing 6 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
3
8/9/2024
Introduction
Distributed Computing 7 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Distributed Computing 8 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
4
8/9/2024
Napster
• One of the earliest popular P2P systems, Napster, used a server-mediated central
index architecture organized around clusters of servers that store direct indices of
the files in the system.
• The central server maintains a table with the following information of each
registered client:
(i) the client’s address (IP) and port, and offered bandwidth
(ii) information about the files that the client can allow to share.
Distributed Computing 9 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Napster
The basic steps of operation to search for content and to determine a node from
which to download the content are the following:
1. A client connects to a meta-server that assigns a lightly loaded server from one of
the close-by clusters of servers to process the client’s query.
2. The client connects to the assigned server and forwards its query along with its own
identity.
3. The server responds to the client with information about the users connected to it
and the files they are sharing.
4. On receiving the response from the server, the client chooses one of the users from
whom to download a desired file. The address to enable the P2P connection between
the client and the selected user is provided by the server to the client.
Distributed Computing 10 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
10
5
8/9/2024
Napster
Distributed Computing 11 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
11
Distributed Computing 12 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
12
6
8/9/2024
Centralized indexing
• Centralized indexing entails the use of one or a few central servers to store
references (indexes) to the data on many peers.
• The DNS lookup as well as the lookup by some early P2P networks such as Napster
used a central directory lookup.
Distributed Computing 13 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
13
Distributed indexing
• Distributed indexing involves the indexes to the objects at various peers being
scattered across other peers throughout the P2P network.
• In order to access the indexes, a structure is used in the P2P overlay to access the
indexes.
• Distributed indexing is the most challenging of the indexing schemes, and many
novel mechanisms have been proposed, most notably the distributed hash table
(DHT).
• Various DHT schemes differ in the hash mapping, search algorithms, diameter for
lookup, search diameter, fault-tolerance, and resilience to churn.
Distributed Computing 14 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
14
7
8/9/2024
Local indexing
• Local indexing requires each peer to index only the local data objects and remote
objects need to be searched for.
• This form of indexing is typically used in unstructured overlays
Distributed Computing 15 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
15
Distributed Computing 16 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
16
8
8/9/2024
Structured overlays
• The P2P network topology has a definite structure, and the placement of files or
data in this network is highly deterministic as per some algorithmic mapping.
• The objective of such a deterministic mapping is to allow a very fast and
deterministic lookup to satisfy queries for the data.
• These systems are termed as lookup systems and typically use a hash table interface
for the mapping.
• The hash function, which efficiently maps keys to values, in conjunction with the
regular structure of the overlay, allows fast search for the location of the file.
Distributed Computing 17 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
17
Structured overlays
Distributed Computing 18 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
18
9
8/9/2024
Unstructured overlays
• The P2P network topology does not have any particular controlled structure, nor is
there any control over where files/data is placed.
• Each peer typically indexes only its local data objects, hence, local indexing is used.
• Node joins and departures are easy – the local overlay is simply adjusted.
• File placement is not governed by the topology.
• Search for a file may entail high message overhead and high delays.
Distributed Computing 19 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
19
Unstructured overlays
• Although the P2P network topology does not have any controlled structure, some
topologies naturally emerge
• Unstructured overlays have the serious disadvantage that queries may take a long
time to find a file or may be unsuccessful even if the queried object exists.
• The message overhead of a query search may also be high.
Distributed Computing 20 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
20
10
8/9/2024
Unstructured overlays
Distributed Computing 21 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
21
Unstructured overlays
• Unstructured overlays are efficient when there is some degree of data replication in
the network.
• Users are satisfied with a best-effort search.
• The network is not so large as to lead to scalability problems during the search
process.
Distributed Computing 22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
22
11
8/9/2024
Fairness
• P2P systems depend on all the nodes cooperating to store objects and allowing
• other nodes to download from them.
• However, nodes tend to be selfish in nature; thus there is a tendancy to download
files without reciprocating by allowing others to download the locally available files.
• This behavior, termed as leaching or free-riding, leads to a degradation of the
overall P2P system performance.
• Hence, penalties and incentives should be built in the system to encourage sharing
and maximize the benefit to all nodes.
Distributed Computing 23 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
23
• In the prisoners’ dilemma, two suspects, A and B, are arrested by the police.
• There is not enough evidence for a conviction.
• The police separate the two prisoners, and, separately, offer each the same deal: if
the prisoner testifies against (betrays) the other prisoner and the other prsioner
remains silent, the betrayer gets freed and the silent accomplice gets a 10-year
sentence.
• If both testify against the other (betray), they each receive a 2-year sentence.
• If both remain silent, the police can only sentence both to a small 6-month term on
a minor offence.
Distributed Computing 24 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
24
12
8/9/2024
• Rational selfish behavior dictates that both A and B would betray the other.
• This is not a Pareto-optimal solution, where a Pareto-optimal solution is one in
which the overall good of all the participants is maximized.
Distributed Computing 25 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
25
Distributed Computing 26 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
26
13
8/9/2024
• The commonly accepted view is that the tit-for-tat strategy, is the best for winning
such a game.
• In the first step, a prisoner cooperates, and in each subsequent step, he
reciprocates the action taken by the other party in the immediately preceding step.
• The BitTorrent P2P system has adopted the tit-for-tat strategy in deciding whether
to allow a download of a file in solving the leaching problem.
• Here, cooperation is analogous to allowing others to upload local files, and betrayal
is analogous to not allowing others to upload.
Distributed Computing 27 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
27
Distributed Computing 28 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
28
14
8/9/2024
Distributed Computing 29 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
29
15
8/9/2024
SSTCS ZG526
` DISTRIBUTED COMPUTING
BITS
BITSPilani
Pilani
Pilani|Dubai|Goa|Hyderabad
Anil Kumar G
Pilani|Dubai|Goa|Hyderabad
TEXT BOOKS
REFERENCE BOOKS
R.1 - Kai Hwang, Geoffrey C. Fox, and Jack J. Dongarra,
“Distributed and Cloud Computing: From Parallel processing to the Internet of Things”,
Morgan Kaufmann, 2012 Elsevier Inc.
R.2 - John F. Buford, Heather Yu, and Eng K. Lua, “P2P Networking and Applications”,
Morgan Kaufmann, 2009 Elsevier Inc.
R.3 - Joshy Joseph, and Craif Fellenstein, “Grid Computing”, IBM Press, Pearson
education, 2011.
Note: In order to broaden understanding of concepts as applied to Indian IT industry, students are advised to refer books of
their choice and case-studies in their own organizations
Distributed Computing 2 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
1
8/9/2024
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
[R2: Chap - 2]
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by Publishers of T1 & T2
Presentation Overview
Distributed Computing 4 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
2
8/9/2024
High-Bandwidth Interconnects
• Ethernet used a 1 Gbps link, while the fastest InfiniBand links ran at 30 Gbps.
• The Myrinet and Quadrics perform in between.
• The MPI latency represents the state of the art in long-distance message passing.
• All four technologies can implement any network topology, including crossbar
switches, fat trees, and torus networks.
• The InfiniBand is the most expensive choice with the fastest link speed.
• The Ethernet is still the most cost-effective choice.
Distributed Computing 5 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Distributed Computing 6 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
3
8/9/2024
Distributed Computing 7 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Distributed Computing 8 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
4
8/9/2024
Distributed Computing 9 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
• The distribution of large-scale system interconnects in the Top 500 systems from
2003 to 2008.
• Gigabit Ethernet is the most popular interconnect due to its low cost and market
readiness.
• The InfiniBand network has been chosen in about 150 systems for its high-
bandwidth performance.
• The Cray interconnect is designed for use in Cray systems only.
• The use of Myrinet and Quadrics networks had declined rapidly in the Top 500 list
by 2008.
Distributed Computing 10 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
10
5
8/9/2024
Distributed Computing 11 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
11
Distributed Computing 12 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
12
6
8/9/2024
• Each end point can be a storage controller, a network interface card (NIC), or an
interface to a host system.
• A host channel adapter (HCA) connected to the host processor through a standard
peripheral component interconnect (PCI), PCI extended (PCI-X), or PCI express bus
provides the host interface.
• Each HCA has more than one InfiniBand port. A target channel adapter (TCA)
enables I/O devices to be loaded within the network.
• The TCA includes an I/O controller that is specific to its particular device’s protocol
such as SCSI, Fibre Channel, or Ethernet.
• This architecture can be easily implemented to build very large scale cluster
interconnects that connect thousands or more hosts together
Distributed Computing 13 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
13
Distributed Computing 14 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
14
7
8/9/2024
• Realistically, SSI and HA features in a cluster are not obtained free of charge.
• They must be supported by hardware, software, middleware, or OS extensions.
• Any change in hardware design and OS extensions must be done by the
manufacturer.
• The hardware and OS support could be cost prohibitive to ordinary users.
• However, programming level is a big burden to cluster users.
• Therefore, the middleware support at the application level costs the least to
implement.
Distributed Computing 15 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
15
• Close to the user application end, middleware packages are needed at the cluster
management level: one for fault management to support failover and failback
• Another desired feature is to achieve HA using failure detection and recovery and
packet switching.
• In the middle of Figure we need to modify the Linux OS to support HA, and we need
special drivers to support HA, I/O, and hardware devices.
• Toward the bottom, we need special hardware to support hot-swapped devices and
provide router interfaces
Distributed Computing 16 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
16
8
8/9/2024
Distributed Computing 17 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
17
Distributed Computing 18 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
18
9
8/9/2024
• Most GPU clusters are structured with homogeneous GPUs of the same hardware
class, make, and model.
• The software used in a GPU cluster includes the OS, GPU drivers, and clustering API
such as an MPI.
• The high performance of a GPU cluster is attributed mainly to its massively parallel
multicore architecture, high throughput in multithreaded floating-point arithmetic,
and significantly reduced time in massive data movement using large on-chip cache
memory.
• In other words, GPU clusters already are more cost-effective than traditional CPU
clusters.
Distributed Computing 19 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
19
• GPU clusters result in not only a quantum jump in speed performance, but also
significantly reduced space, power, and cooling demands.
• A GPU cluster can operate with a reduced number of operating system images,
compared with CPU-based clusters.
• These reductions in power, environment, and management complexity make GPU
clusters very attractive for use in future HPC applications
Distributed Computing 20 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
20
10
8/9/2024
CLUSTER JOB
AND
RESOURCE MANAGEMENT
Distributed Computing 21 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
21
Distributed Computing 22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
22
11
8/9/2024
Distributed Computing 23 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
23
Distributed Computing 24 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
24
12
8/9/2024
Space Sharing
Distributed Computing 25 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
25
Space Sharing
Distributed Computing 26 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
26
13
8/9/2024
Space Sharing
Distributed Computing 27 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
27
Time Sharing
Distributed Computing 28 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
28
14
8/9/2024
Independent scheduling
Distributed Computing 29 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
29
Gang scheduling
• The gang scheduling scheme schedules all processes of a parallel job together.
• When one process is active, all processes are active.
• The cluster nodes are not perfectly clock-synchronized.
• In fact, most clusters are asynchronous systems, and are not driven by the same
clock.
• Although we say, “All processes are scheduled to run at the same time,” they do not
start exactly at the same time
Distributed Computing 30 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
30
15
8/9/2024
• Scheduling becomes more complicated when both cluster jobs and local jobs are
running.
• Local jobs should have priority over cluster jobs.
• With one keystroke, the owner wants command of all workstation resources.
• There are basically two ways to deal with this situation:
– The cluster job can either stay in the workstation node or
– migrate to another idle node.
Distributed Computing 31 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
31
Distributed Computing 32 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
32
16
8/9/2024
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
[R2: Chap - 7]
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by Publishers of T1 & T2
Presentation Overview
1
8/9/2024
• The goal of grid computing is to explore fast solutions for large-scale computing
problems.
• This objective is shared by computer clusters and massively parallel processor
(MPP) systems
• However, grid computing takes advantage of the existing computing resources
scattered in a nation or internationally around the globe.
• In grids, resources owned by different organizations are aggregated together and
shared by many users in collective applications.
Distributed Computing 3 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
• Grids rely heavy use of LAN/WAN resources across enterprises, organizations, and
governments.
• The virtual organizations or virtual supercomputers are new concept derived from
grid or cloud computing.
• These are virtual resources dynamically configured and are not under the full
control of any single user or local administrator.
Distributed Computing 4 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
2
8/9/2024
Distributed Computing 5 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Distributed Computing 6 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
3
8/9/2024
Distributed Computing 7 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
• Most of today’s grid systems are called computational grids or data grids.
• Information or knowledge grids post another grid class dedicated to knowledge
management and distributed ontology processing.
• In the business world, we see a family, called business grids, built for business
data/information processing. Some business grids are being transformed
into Internet clouds.
• The last grid class includes several grid extensions such as P2P grids and parasitic
grids.
Distributed Computing 8 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
4
8/9/2024
• The top layer corresponds to user applications to run on the grid system.
• The user applications demand collective services including collective computing
and communications.
• The next layer is formed by the hardware and software resources aggregated to run
the user applications under the collective operations.
• The connectivity layer provides the interconnection among drafted resources.
• This connectivity could be established directly on physical networks or it could be
built with virtual networking technology.
Distributed Computing 9 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
• The connectivity must support the grid fabric, including the network links and
virtual private channels.
• The fabric layer includes all computational resources, storage systems, catalogs,
network resources, sensors, and their network connections.
• The connectivity layer enables the exchange of data between fabric layer resources.
• The five-layer grid architecture is closely related to the layered Internet protocol
stack
• The connectivity layer is supported by the network and transport layers of the
Internet stack.
• The Internet application layer supports the top three layers.
Distributed Computing 10 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
10
5
8/9/2024
Distributed Computing 11 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
11
Grid Resources
Storage Putting and getting files; control over Hardware and software characteristics; relevant
resources resources allocated to data transfers: load information: available space and bandwidth
advance reservation utilization
Distributed Computing 12 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
12
6
8/9/2024
• Both public and virtual grids can be built over large or small machines, that are
loosely coupled together to satisfy the application need.
• Grids differ from the conventional supercomputers in many ways in the context of
distributed computing.
• Supercomputers like MPPs in the Top-500 list are more homogeneously structured
with tightly coupled operations, while the grids are built with heterogeneous nodes
running non-interactive workloads.
• These grid workloads may involve a large number of files and individual users. The
geographically dispersed grids are more scalable and fault-tolerant with
significantly lower operational costs than the supercomputers.
Distributed Computing 13 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
13
Distributed Computing 14 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
14
7
8/9/2024
• At present, many volunteer computing grids are built using the CPU scavenging
model.
• The most famous example is the SETI@Home, which applied over 3 million
computers to achieve 23.37 TFlpos as of Sept. 2001.
• More recent examples include the BOINC and Folding@Home etc.
• In practice, these virtual grids can be viewed as virtual supercomputers
Distributed Computing 15 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
15
• A new drug development subject needs the molecular simulation software from a
research institute and the clinic database from a hospital.
• At the same time, computer scientists need to perform biological sequence analysis
on Linux clusters in a supercomputing center.
• The members from three organizations contribute all or part of their hardware or
software resources to form a VO
Distributed Computing 16 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
16
8
8/9/2024
Distributed Computing 17 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
17
Distributed Computing 18 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
18
9
8/9/2024
• The owners can also modify the number of cluster servers to be allocated to a
specific VO.
• The participants join or leave the VO dynamically under special service agreement.
• A new physical organization may also join an existing VO.
• A participant may leave a VO, once its job is done.
• The dynamic nature of the resources in a VO posts a great challenge for grid
computing.
• Resources have to cooperate closely to produce a rewarding result.
• Without an effective resource management system, the grid or VO may be
inefficient and waste resources, if poorly managed.
Distributed Computing 19 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
19
• The OGSA is an open source grid service standard jointly developed by academia
and the IT industry under coordination of a working group in the Global Grid Forum
(GGF).
• The standard was specifically developed for the emerging grid and cloud service
communities.
• The OGSA is extended from web service concepts and technologies.
• The standard defines a common framework that allows businesses to build grid
platforms across enterprises and business partners.
• The intent is to define the standards required for both open source and commercial
software to support a global grid infrastructure
Distributed Computing 20 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
20
10
8/9/2024
Distributed Computing 21 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
21
Distributed Computing 22 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
22
11
8/9/2024
Distributed Computing 23 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
23
• Two schemas are often used to classify them: hierarchical classification, which
classifies scheduling methods by a multilevel tree and flat classification based on a
single attribute.
• These include methods based on adaptive versus nonadaptive, load balancing,
bidding, or probabilistic and one-time assignment versus dynamic reassignment
Distributed Computing 24 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
24
12
8/9/2024
Distributed Computing 25 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
25
Distributed Computing 26 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
26
13
8/9/2024
Distributed Computing 27 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
27
• In a grid environment where data needs to be federated for sharing among various
interested parties, data grid technologies such as SRB, Globus RLS, and EU DataGrid
need to be deployed.
• The user-level middleware needs to be deployed on resources responsible for
providing resource brokering and application execution management services.
• Users may even access these services via web portals.
• Several grid resource brokers have been developed. Some of the more prominent
include Nimrod-G, Condor-G, GridWay, and Gridbus Resource Broker.
Distributed Computing 28 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
28
14
8/9/2024
Distributed Computing 29 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
29
6. The broker identifies computational resources that provide the required services.
7. The broker ensures that the user has necessary credit or an authorized share to
utilize the resources.
8. The broker scheduler analyzes the resources to meet the user’s QoS requirements.
9. The broker agent on a resource executes the job and returns the results.
10. The broker collates the results and passes them to the user.
11. The meter charges the user by passing the resource usage information to the
accountant.
Distributed Computing 30 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
30
15
8/9/2024
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by Publishers of T1 & T2
PRESENTATION OVERVIEW
• Introduction
• Applications of IOT
• Radio-Frequency Identification
• ZigBee
DC - Anil Kumar G 2
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
1
8/9/2024
Introduction
Introduction
DC - Anil Kumar G 4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
2
8/9/2024
Introduction
DC - Anil Kumar G 5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Applications of IOT
DC - Anil Kumar G 6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
3
8/9/2024
Applications of IOT
DC - Anil Kumar G 7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Applications of IOT
DC - Anil Kumar G 8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
4
8/9/2024
Applications of IOT
DC - Anil Kumar G 9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Applications of IOT
DC - Anil Kumar G 10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
10
5
8/9/2024
Applications of IOT
DC - Anil Kumar G 11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
11
Applications of IOT
DC - Anil Kumar G 12
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
12
6
8/9/2024
Applications of IOT
DC - Anil Kumar G 13
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
13
Radio-frequency identification
14
7
8/9/2024
Radio-frequency identification
DC - Anil Kumar G 15
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
15
Radio-Frequency Identification
DC - Anil Kumar G 16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
16
8
8/9/2024
ZigBee
DC - Anil Kumar G 17
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
17
ZigBee
DC - Anil Kumar G 18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
18
9
8/9/2024
ZigBee
DC - Anil Kumar G 19
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
19
ZigBee
DC - Anil Kumar G 20
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
20
10
8/9/2024
ZigBee
DC - Anil Kumar G 21
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
21
ZigBee
DC - Anil Kumar G 22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
22
11
8/9/2024
ZigBee
SafePlug 1203
• The SafePlug 1203 electrical outlet installs into a standard
120V 15A outlet and provides two independently controlled
receptacles.
• Each receptacle features independent:
– Power (energy) monitoring,
– Line voltage monitoring,
– On/off control,
– Groups/Scenes support,
– Appliance tracking,
– Fire and shock protection, and
– Cold load start smoothing
DC - Anil Kumar G 23
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
23
ZigBee
DC - Anil Kumar G 24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
24
12
8/9/2024
ZigBee
DC - Anil Kumar G 25
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
25
ZigBee
26
13
8/9/2024
ZigBee
DC - Anil Kumar G 27
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
27
ZigBee
DC - Anil Kumar G 28
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
28
14
8/9/2024
Smart Thermostats
29
Siemens APOGEE
Floor Level Network Controller
30
15
8/9/2024
Smart meter
31
Kwikset SmartCode
32
16
8/9/2024
Wireless Outlet
33
34
17
8/9/2024
35
36
18
8/9/2024
37
38
19
8/9/2024
Thank You
ALL THE BEST
DC - Anil Kumar G 39
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
39
20