DC Unit 3
DC Unit 3
Dr. Bhandari R. R
Associate Professor & HOD AI & DS Engineering
SNJB’s Late Sau. K. B. Jain College of Engineering
Unit 3 Distributed Computing Algorithms
Unit 3 Distributed Computing Algorithms
● Communication and coordination in distributed systems are crucial aspects that determine the efficiency,
reliability, and overall performance of the system.
● Key considerations for communication and coordination in distributed systems:
1. Message Passing
2. Remote Procedure Call (RPC) and Remote Method Invocation (RMI)
3. Data Replication
4. Coordination Models
5. Distributed Transactions
6. Fault Tolerance
7. Consensus Algorithms
8. Monitoring and Logging
Message Passing
● Message passing is a flexible and scalable method for inter-node communication in distributed systems.
● It enables nodes to exchange information, coordinate activities, and share data without relying on shared
memory or direct method invocations.
● Models like synchronous and asynchronous message passing offer different synchronization and
communication semantics to suit system requirements.
● Synchronous message passing ensures sender and receiver synchronization, while asynchronous message
passing allows concurrent execution and non-blocking communication.
● It takes a huge time because it is performed via the kernel (system calls).
● It is useful for sharing little quantities of data without causing disputes.
Remote Procedure Call (RPC)
● Remote Procedure Call (RPC) is a powerful technique for constructing distributed, client-server based
applications
● It is based on extending the conventional local procedure calling so that the called procedure need not
exist in the same address space as the calling procedure
● The two processes may be on the same system, or they may be on different systems with a network
connecting them.
Remote Procedure Call (RPC)
1. A client invokes a client stub procedure, passing parameters in the usual way. The client stub resides
within the client’s own address space.
2. The client stub marshalls(pack) the parameters into a message. Marshalling includes converting the
representation of the parameters into a standard format, and copying each parameter into the message.
3. The client stub passes the message to the transport layer, which sends it to the remote server machine.
4. On the server, the transport layer passes the message to a server stub, which demarshalls(unpack) the
parameters and calls the desired server routine using the regular procedure call mechanism.
5. When the server procedure completes, it returns to the server stub (e.g., via a normal procedure call
return), which marshalls the return values into a message. The server stub then hands the message to
the transport layer.
6. The transport layer sends the result message back to the client transport layer, which hands the
message back to the client stub.
7. The client stub demarshalls the return parameters and execution returns to the caller.
ADVANTAGES :
1. RPC provides ABSTRACTION i.e message-passing nature of network communication is hidden from
the user.
2. RPC often omits many of the protocol layers to improve performance. Even a small performance
improvement is important because a program may invoke RPCs often.
3. RPC enables the usage of the applications in the distributed environment, not only in the local
environment.
4. With RPC code re-writing / re-developing effort is minimized.
5. Process-oriented and thread oriented models supported by RPC.
Disadvantages of Remote Procedure Calls:
● In Remote Procedure Calls parameters are only passed by values as pointer values are not
allowed.
● It involves a communication system with another machine and another process, so this
mechanism is extremely prone to failure.
● The RPC concept can be implemented in a variety of ways, hence there is no standard.
● Due to the interaction-based nature, there is no flexibility for hardware architecture in RPC.
● Due to a remote procedure call, the process’s cost has increased.
Remote Method Invocation
● RMI stands for Remote Method Invocation. It is a mechanism that allows an object
residing in one system (JVM) to access/invoke an object running on another JVM.
● RMI is used to build distributed applications; it provides remote communication between Java
programs. It is provided in the package java.rmi.
Consensus algorithms are fundamental to distributed computing. In a distributed system, multiple nodes communicate with
each other to perform a common task. Consensus algorithms help these nodes agree on a shared value, even when some nodes
may fail or be unreliable.
1. Paxos
Variant of Pexos
Consensus Algorithms
Consensus algorithms are fundamental to distributed computing. In a distributed system, multiple nodes communicate with
each other to perform a common task. Consensus algorithms help these nodes agree on a shared value, even when some nodes
may fail or be unreliable.
1. Paxos
● The Paxos algorithm was first introduced by Leslie Lamport in 1990 and has since become
widely used in distributed systems.
● The algorithm works by allowing a group of nodes to agree on a single value, even if some nodes
fail or are unresponsive.
Paxos
1. Phase 1 (Prepare Phase): A node that wants to propose a value as the agreed value sends a prepare message to all
other nodes. This message contains a proposal number, which is a unique identifier for this particular proposal.
Each node that receives the prepare message responds with a promise to not accept any proposals with a lower
2. Phase 2 (Accept Phase): Once a node receives promises from a majority of nodes, it can send an accept message to
all nodes containing the proposed value and the proposal number. Each node that receives this accept message will
accept the proposed value only if it has not already promised to accept a higher numbered proposal.
3. Phase 3 (Commit Phase): Once a node receives accept messages from a majority of nodes, it can send a commit
message to all nodes containing the agreed-upon value. The other nodes will then update their state with this
agreed-upon value.
Consensus Algorithms
2. Raft
1. Leader Election:
2. Log Replication:
3. Committing Entries
Raft Algorithms
● The Raft algorithm was introduced by Diego Ongaro and John Ousterhout in 2014 as an
alternative to Paxos that is easier to understand and implement.
● It also follows a leader-follower architecture, where one node is designated as the leader and
the other nodes act as followers.
Raft Algorithms
timer to periodically send out heartbeats to other nodes. If a follower does not receive a heartbeat from the leader
within a certain amount of time, it assumes that the leader has failed and initiates a new leader election.
2. Log Replication: Once a leader is elected, it can accept client requests and replicate them to the other nodes in the
system. When a leader receives a client request, it appends the request to its log and sends append entries messages
to the other nodes. These messages contain information about the log entry, including the index and term number.
3. Committing Entries: Once a log entry has been replicated to a majority of nodes in the system, the leader can send
a commit message to all nodes, indicating that the entry has been committed. The other nodes will then apply the
3. ZAB
The Zab (ZooKeeper Atomic Broadcast) algorithm is a consensus protocol used in distributed systems. It is specifically designed
to provide reliable and ordered message delivery among a group of nodes, ensuring consistency and fault-tolerance.
ZAB Algorithms
Analogy :
Imagine a group of friends who want to play a game and make sure that everyone gets the
same instructions in the right order. They need someone to be the leader, who will give
instructions to all the other friends. But they also want to make sure that if the leader
becomes unavailable, another friend can take over and continue giving instructions.
Consensus Algorithms: ZAB
1. Electing the Leader: The friends need to choose a leader among themselves. They do this by raising their hands and
voting for a leader. The friend who gets the most votes becomes the leader. This leader will be responsible for sending
instructions to everyone.
2. Broadcasting Instructions: The leader starts by giving an instruction to everyone, like "Jump three times!" The
friends receive the instruction and follow it. They acknowledge that they have received the instruction.
3. Acknowledgments: Once the leader receives acknowledgments from a majority of the friends, it knows that the
instruction has been successfully delivered. This ensures that everyone has received and understood the instruction.
4. Handling Leader Failures: Sometimes the leader may become unavailable, like if they leave the game or get too
busy. In that case, the friends need to elect a new leader. They use a similar voting process to choose a new leader
among themselves. This ensures that there is always a leader to give instructions.
5. Ensuring Order: The Zab algorithm guarantees that the instructions are delivered and executed in the same order by
all friends. This means that if the leader says, "Jump three times!" and then says, "Clap your hands!" everyone will
follow the instructions in that order. This ensures consistency among all friends.
Consensus Algorithms: ZAB
1. Role of Zab in Distributed Systems: Zab is responsible for maintaining a consistent state among multiple nodes in a
distributed system. It achieves this by ensuring that all nodes receive and process messages in the same order.
2. Leader Election: Zab starts by electing a leader node among the participating nodes. The leader is responsible for
generating and broadcasting messages to all other nodes. This leader election process guarantees that there is always a
designated leader for message coordination.
3. Message Broadcasting: The leader generates messages, assigns each message a unique sequence number, and sends
them to all nodes in the system. The messages are transmitted in order, ensuring that all nodes receive them in the same
sequence.
4. Acknowledgments and Quorums: After receiving a message, each node sends an acknowledgment (ACK) back to the
leader. Once the leader receives acknowledgments from a majority of nodes, it knows that the message has been
successfully delivered and can proceed to the next message.
5. Handling Failures: Zab handles failures by electing a new leader when the current leader becomes unavailable. It uses a
variant of the Paxos algorithm to perform leader election, ensuring that a reliable leader is always present to coordinate
message delivery.
6. Atomic Broadcast Guarantee: Zab guarantees atomic broadcast, meaning that either all nodes receive a message in the
same order or none of them receive it. This ensures consistency across the distributed system and prevents
inconsistencies caused by message reordering or losses.
7. Snapshotting: In addition to message delivery, Zab also supports snapshotting, which allows a node to save and restore
its state from a particular point in time. This feature is essential for efficient recovery and replication in distributed
systems.
Consensus Algorithms: Proof of Work
● Proof of Work consensus is the mechanism of choice for the majority of cryptocurrencies currently in
circulation. The algorithm is used to verify the transaction and create a new block in the blockchain.
● The idea for Proof of Work(PoW) was first published in 1993 by Cynthia Dwork and Moni Naor and was later
applied by Satoshi Nakamoto in the Bitcoin paper in 2008.
● The term “proof of work” was first used by Markus Jakobsson and Ari Juels in a publication in 1999.
● Cryptocurrencies like Litecoin, and Bitcoin are currently using PoW. Ethereum was using PoW mechanism, but
now shifted to Proof of Stake(PoS).
Purpose of : Proof of Work
● The purpose of a consensus mechanism is to bring all the nodes in agreement, that is, trust one another, in an
environment where the nodes don’t trust each other.
● All the transactions in the new block are then validated and the new block is then added to the blockchain.
● The block will get added to the chain which has the longest block height
● Miners(special computers on the network) perform computation work in solving a complex mathematical
problem to add the block to the network, hence named, Proof-of-Work.
● With time, the mathematical problem becomes more complex.
Features: Proof of Work
Features of PoW
There are mainly two features that have contributed to the wide popularity of this consensus protocol and
they are:
● It is hard to find a solution to a mathematical problem.
● It is easy to verify the correctness of that solution.
How does PoW Work:
The PoW consensus algorithm involves verifying a transaction through the mining process. This section focuses on
discussing the mining process and resource consumption during the mining process.
Mining:
The Proof of Work consensus algorithm involves solving a computationally challenging puzzle in order to create new
blocks in the Bitcoin blockchain. The process is known as ‘mining’, and the nodes in the network that engages in
mining are known as ‘miners’.
● The incentive for mining transactions lies in economic payoffs, where competing miners are rewarded with
6.25 bitcoins and a small transaction fee.
● This reward will get reduced by half its current value with time.
How does PoW Work: Example
View
Consensus Algorithms
5. Proof-of-Stake (PoS)
● Proof of stake is a consensus mechanism used to verify new cryptocurrency transactions. Since
blockchains lack any centralized governing authorities, proof of stake is a method to guarantee
that data saved on the network is valid.
● Proof of stake is the consensus mechanism that helps choose which participants get to handle this
lucrative task—lucrative because the chosen ones are rewarded with new crypto if they accurately
validate the new data and don’t cheat the system.
Consensus Algorithms
5. Proof-of-Stake (PoS)
● Solana, Terra and Cardano are among the biggest cryptocurrencies that use proof of stake.
Ethereum, the second-largest crypto by market capitalization after Bitcoin, is in the midst of a
transition from proof of work to proof of stake.
●
Consensus Algorithms
● SPBFT aims to reduce communication complexity and improve performance while maintaining Byzantine fault
tolerance. It is often used in permissioned blockchain systems.
Fault tolerance and Recovery in Distributed Systems,
The ability of the system to continue operating as intended even in the event of a failure is known as fault tolerance.
Types of Faults
1. Transient Faults
2. Intermittent Faults
3. Permanent Faults
Recovery in distributed systems refers to the ability of a system to recover from failures or faults and continue operating in a
consistent and reliable manner. Here are some key concepts and strategies related to recovery in distributed systems:
1. A load balancer is a device that acts as a reverse proxy and distributes network or application traffic across a number of
servers.
● Round Robin
● Least Connections
● Least Time
● Hash
● IP Hash
Artificial Intelligence (AI) techniques can be applied to optimize distributed computing algorithms in various ways, enhancing
their efficiency, adaptability, and overall performance.
● Genetic Algorithms
● Reinforcement Learning
Neural Networks: Employ neural networks to learn patterns in the workload distribution and dynamically adjust the load
balancing strategy to optimize resource utilization across nodes.
● Anomaly Detection:
Utilize AI-driven anomaly detection to identify unusual behavior or potential faults in the distributed system.
Applying AI techniques to optimize distributed computing algorithms
Explore the use of deep learning to predict optimal consensus algorithm configurations based on historical performance
data and current system conditions.
● Energy Efficiency:
AI for Green Computing: Employ AI techniques to optimize energy consumption in distributed systems by dynamically
adjusting resource allocation and workload distribution based on energy efficiency models.
●
Machine Learning for Resource Allocation
Machine learning can help you learn and improve your resource allocation decisions continuously, based on the outcomes,
results, and impacts of your actions. This can help you enhance your efficiency, quality, and productivity over time, and adapt
to changing conditions and requirements.
Here are ways machine learning can be applied to optimize resource allocation:
Demand Prediction:
Regression Models: Train regression models to predict the resource requirements of applications based on historical usage patterns, workload
characteristics, and other relevant features. This enables proactive resource provisioning.
Classification Models:
Use classification models to determine whether auto-scaling is needed based on predicted future demand. This can automate the decision-making
process for dynamically adjusting the number of resources in response to workload changes.
Anomaly Detection:
Anomaly Detection Algorithms: Implement anomaly detection models to identify unusual behavior in resource consumption. This helps detect
abnormal resource usage patterns, indicating potential issues or changes in the application's behavior.
Machine Learning for Resource Allocation
QoS Optimization:
Reinforcement Learning: Apply reinforcement learning to optimize Quality of Service (QoS) by dynamically adjusting resource allocation
parameters. The model can learn to balance performance, cost, and other factors based on feedback from the system.
Workload Placement:
Clustering Algorithms: Use clustering algorithms to group similar workloads together. This facilitates efficient resource allocation by placing
similar workloads on the same nodes, reducing contention for resources.
Cost Prediction Models: Develop models to predict the cost associated with different resource allocation scenarios. This helps in making
informed decisions regarding resource allocation based on cost-effectiveness.
Decision Trees or Neural Networks: Employ machine learning models to dynamically assign priorities to different tasks or applications. This
enables intelligent scheduling, ensuring that critical tasks receive appropriate resources.
Multi-Objective Optimization:
Evolutionary Algorithms: Use evolutionary algorithms to optimize resource allocation for multiple objectives, such as minimizing cost,
maximizing performance, and ensuring fairness among different applications.
Adaptive Configuration:
Deep Learning: Leverage deep learning models to learn optimal configurations for various applications in different contexts. This enables the
system to adaptively configure resource allocation settings based on real-time conditions.
Machine Learning for Resource Allocation
Reinforcement learning (RL) can be a powerful approach for dynamic load balancing in distributed systems, allowing the system
to adapt and optimize load distribution based on real-time conditions.
State Representation
Action Space
Reward Function
RL Algorithm
Training Environment
Periodic Retraining
Genetic Algorithms (GAs) can be effectively employed for task scheduling in distributed systems, where the goal is to assign tasks
to various processing units to optimize overall system performance. Here's how genetic algorithms can be applied to task
scheduling:
Chromosome Representation
Initial Population
Fitness Function
Selection
Crossover (Recombination)
Mutation
Replacement
Swarm Intelligence for Distributed Optimization
Swarm intelligence is a computational paradigm inspired by the collective behavior of social organisms, such as ant colonies, bird
flocks, and bee swarms. In the context of distributed optimization, swarm intelligence algorithms are used to solve complex
problems by simulating cooperation and interaction among a group of simple agents. Here are some common swarm
intelligence algorithms applied to distributed optimization: