Cluster Computing (Unit 1-5)
Cluster Computing (Unit 1-5)
Distributed Systems:
A distributed system is one in which components located at networked computers communicate and coordinate
their actions only by passing messages.
A distributed system as one in which hardware or software components located at networked computers
communicate and coordinate their actions only by passing messages. This simple definition covers the entire
range of systems in which networked computers can usefully be deployed.
Computers that are connected by a network may be spatially separated by any distance. They may be on separate
continents, in the same building or in the same room. Our definition of distributed systems has the following
significant consequences:
Concurrency :
In a network of computers, concurrent program execution is the norm. I can do my work on my computer while
you do your work on yours, sharing resources such as web pages or files when necessary. The capacity of the
system to handle shared resources can be increased by adding more resources (for example. computers) to the
network.
No global clock:
When programs need to cooperate they coordinate their actions by exchanging messages. Close coordination
often depends on a shared idea of the time at which the programs’ actions occur. But it turns out that there are
limits to the accuracy with which the computers in a network can synchronize their clocks – there is no single
global notion of the correct time. This is a direct consequence of the fact that the only communication is by
sending messages through a network.
Independent failures :
All computer systems can fail, and it is the responsibility of system designers to plan for the consequences of
possible failures. Distributed systems can fail in new ways. Faults in the network result in the isolation of the
computers that are connected to it, but that doesn’t mean that they stop running. In fact, the programs on them
may not be able to detect whether the network has failed or has become unusually slow. Similarly, the failure
of a computer, or the unexpected termination of a program somewhere in the system (a crash), is not
immediately made known to the other components with which it communicates. Each component of the system
can fail independently, leaving the others still running.
Performance Evaluation:
The following are the criteria for performance measures:
The client delay incurred by a process at each entry and exit operation.
Throughput of the system: Rate at which the collection of processes as a wholecan
access the critical section.
Central Sever Algorithm
This employs the simplest way to grant permission to enter the critical section byusing a
server.
A process sends a request message to server and awaits a reply from it.
If a reply constitutes a token signifying the permission to enter the critical section.
If no other process has the token at the time of the request, then the server replied
immediately with the token.
If token is currently held by another process, then the server does not reply butqueues
the request.
Client on exiting the critical section, a message is sent to server, giving it back thetoken.
Bandwidth: This is measured by entering and exiting messages. Entering takes two
messages ( request followed by a grant) which are delayed by the round- trip time.
Exiting takes one release message, and does not delay the exiting process.
The unique token is in the form of a message passed from process to process in asingle
direction clockwise.
If a process does not require to enter the CS when it receives the token, then it
immediately forwards the token to its neighbor.
A process requires the token waits until it receives it, but retains it.
To exit the critical section, the process sends the token on to its neighbor.
Fig 4.13: Ring based algorithm
This algorithm satisfies ME1 and ME2 but not ME (i.e.) safety and liveness aresatisfied but
not ordering. The performance measures include:
Bandwidth: continuously consumes the bandwidth except when a process isinside the CS.
Exit only requires one message.
Throughput: synchronization delay between one exit and next entry is anywherefrom
1(next one) to N (self) message transmission.
Multicast Synchronisation
This exploits mutual exclusion between N peer processes based upon multicast.
Processes that require entry to a critical section multicast a request message, and canenter it only
when all the other processes have replied to this message.
The condition under which a process replies to a request are designed to ensure ME1ME2 and
ME3 are met.
Each process pi keeps a Lamport clock. Message requesting entry are of the form<T,pi>.
Each process records its state of either RELEASE, WANTED or HELD in a variablestate.
If a process requests entry and all other processes is RELEASED, then allprocesses reply
immediately.
If some process is in state HELD, then that process will not reply until it isfinished.
If some process is in state WANTED and has a smaller timestamp than theincoming request,
it will queue the request until it is finished.
If two or more processes request entry at the same time, then whichever bears thelowest
timestamp will be the first to collect N-1 replies.
Fig : Multicast Synchronisation
In the above figure, P1 and P2 request CS concurrently.
When P2 receives P1‟s request, it finds its own request has the lower timestamp, and sodoes not
reply, holding P1 request in queue.
However, P1 will reply. P2 will enter CS. After P2 finishes, P2 reply P1 and P1 willenter CS.
Granting entry takes 2(N-1) messages, N-1 to multicast request and N-1 replies.
Performance Evaluation:
Bandwidth consumption is high.
Client delay is again 1 round trip time
Synchronization delay is one message transmission time.
• Think of processes as voting for one another to enter the CS. A candidate process must
collect sufficient votes to enter.
• Processes in the intersection of two sets of voters ensure the safety property ME1 by
casting their votes for only one candidate.
• There is at least one common member of any two voting sets, the size of all voting set are
the same size to be fair.
Vi Vj
|Vi | = K
Performance Evaluation:
Bandwidth utilization is 2sqrt (N) messages per entry to CS and sqrt (N) per exit.
Client delay is the same as Ricart and Agrawala’s algorithm, one round-trip time.
Fault Tolerance
The reactions of the algorithms when messages are lost or when a process crashesis fault
tolerance.
None of the algorithm that we have described would tolerate the loss of messagesif the
channels were unreliable.
The ring-based algorithm cannot tolerate any single process crash failure.
Maekawa’s algorithm can tolerate some process crash failures: if a crashed processis not in
a voting set that is required.
The central server algorithm can tolerate the crash failure of a client process thatneither holds nor has requested
the token
Failure model
Failure model defines and classifies the faults.
In a distributed system both processes and communication channels may fail – That is, they may depart
from what is considered to be correct or desirable behavior.
Types of failures:
Omission Failures
Arbitrary Failures
Timing Failures
Omission failure
Omission failures refer to cases when a process or communication channel fails to perform
actions that it is supposed to do.
The chief omission failure of a process is to crash. In case of the crash, the process has halted and
will not execute any further steps of its program.
Another type of omission failure is related to the communication which is called
communication omission failure shown in
The communication channel produces an omission failure if it does not transport a message from “p”s outgoing
message buffer to “q”’s incoming message buffer.
This is known as “dropping messages” and is generally caused by lack of buffer space at the receiver or
at an gateway or by a network transmission error, detected by a checksum carried with the message data.
Arbitrary failure
Arbitrary failure is used to describe the worst possible failure semantics, in which any type of error may
occur.
E.g. a process may set a wrong values in its data items, or it may return a wrong value
in response to an invocation.
Communication channel can suffer from arbitrary failures.
The omission failures are classified together with arbitrary failures shown in
Timing failure
Timing failures are applicable in synchronized distributed systems where time limits are set on process
execution time, message delivery time and clock drift rate.
Masking failure
ts that exhibit failure.
E.g. multiple servers that hold replicas of data can continue to provide a service when one
of them crashes.
A service masks a failure, either by hiding it altogether or by converting it into a more acceptable
type of failure.
E.g. checksums are used to mask corrupted messages - effectively converting an
arbitrary failure into an omission failure.
Programming Paradigms
Programming distributed systems requires addressing various challenges related to concurrency, fault tolerance,
and network communication. Several programming paradigms and models are used to develop distributed
systems, each with its own set of principles and approaches. Here are some common programming paradigms
in distributed systems:
Client-Server Model: This is one of the most straightforward paradigms in distributed computing. Clients send
requests to servers, which process these requests and return responses. The client-server model is often used for
web applications, where web browsers (clients) interact with web servers to fetch resources or perform actions.
Peer-to-Peer (P2P): In P2P systems, all nodes (peers) are equal and have the same capabilities. Peers can both
request and provide services, making P2P systems decentralized. Examples include file-sharing networks like
BitTorrent and blockchain networks like Bitcoin.
Message Passing: This paradigm focuses on communication between processes in a distributed system.
Messages are used to exchange information between nodes. Message-passing models can be implemented using
various communication mechanisms, such as message queues or Remote Procedure Calls (RPC).
Remote Procedure Call (RPC): RPC allows one process to invoke a function or method in another process, as
if it were a local function call. It simplifies remote communication and can hide the complexities of distributed
systems from developers.
Message-Oriented Middleware (MOM): MOM is a communication paradigm that uses message queues to
enable asynchronous communication between distributed components. It's often used in event-driven systems,
such as financial trading platforms.
Actor Model: In the Actor model, everything is an actor, which encapsulates state and behavior. Actors
communicate through message passing. This model is used in systems like Erlang and Akka for building highly
concurrent and fault-tolerant applications.
Data-Parallel Programming: This paradigm is used for distributed data processing systems like Hadoop and
Spark. It focuses on breaking down large data sets into smaller partitions that can be processed independently
across multiple nodes.
MapReduce: MapReduce is a programming model and processing framework that simplifies distributed data
processing. It divides tasks into two phases: mapping data and reducing the results. It's widely used in big data
applications.
Service-Oriented Architecture (SOA): SOA is an architectural style that focuses on designing distributed
systems as a collection of loosely coupled services. These services are designed to be independent and
communicate through well-defined interfaces, often using protocols like HTTP or SOAP.
Microservices Architecture: Microservices are an extension of SOA, where services are smaller, more focused,
and independently deployable. Microservices use lightweight communication mechanisms like RESTful APIs
and are often containerized for easy scaling and deployment.
Event-Driven Architecture: In this paradigm, systems react to events or messages, enabling real-time
communication and processing. Event-driven systems are commonly used in applications like IoT, real-time
analytics, and chat applications.
Blockchain Smart Contracts: In blockchain systems like Ethereum, smart contracts are self-executing
programs that run on the network. These contracts automatically execute when specific conditions are met,
making them suitable for decentralized applications.
When programming in a distributed system, the choice of paradigm depends on the specific requirements of the
application and the trade-offs between factors like fault tolerance, scalability, and ease of development.
Developers often combine multiple paradigms to address different aspects of their distributed systems.
Shared memory
Shared memory in a distributed system refers to the concept of multiple processes or nodes in a distributed
environment accessing and manipulating a common memory space. Unlike traditional shared memory in a
single-machine multi-threaded or multi-process context, distributed shared memory extends the idea to multiple
machines or nodes. This allows processes running on different machines to communicate and share data as if
they were all accessing a single, shared memory space. Here are some key points about shared memory in
distributed systems:
Communication Mechanisms: In a distributed system, shared memory is typically implemented using
communication mechanisms, such as Remote Procedure Calls (RPC), message passing, or distributed memory
access protocols. These mechanisms enable processes on different nodes to read from and write to the shared
memory.
Data Consistency: Ensuring data consistency is a significant challenge in distributed shared memory systems.
When multiple processes on different nodes can access and modify shared data simultaneously, synchronization
mechanisms, like locks, semaphores, or distributed data structures, are needed to maintain data consistency.
Latency and Network Overhead: Network latency and communication overhead are critical considerations in
distributed shared memory systems. Data access times can be significantly higher than in traditional shared
memory systems due to network communication delays.
Data Distribution: Data may be partitioned or replicated across multiple nodes to optimize data access in a
distributed shared memory system. Decisions about data distribution depend on factors like access patterns, data
size, and fault tolerance requirements.
Scalability: Scalability is a crucial concern in distributed shared memory systems. As the number of nodes
increases, managing shared memory becomes more complex. Design choices, such as the granularity of data
sharing and the choice of communication protocols, can impact system scalability.
Fault Tolerance: Distributed shared memory systems need to be designed with fault tolerance in mind. If a node
fails or loses connectivity, mechanisms must ensure data consistency and availability.
Programming Models: Distributed shared memory can be used with various programming models, such as the
Single Program, Multiple Data (SPMD) model, where each process appears to have its private copy of memory,
but they can read from and write to the shared memory as needed.
Examples: Distributed shared memory systems are used in various applications, such as parallel computing
clusters, where multiple nodes work together on computationally intensive tasks, and in distributed databases to
allow multiple nodes to access and modify data as if it were stored locally.
Implementing distributed shared memory can be complex, and developers need to carefully consider data
consistency, access patterns, and fault tolerance requirements. While it can simplify programming by providing
a familiar shared-memory model, it also introduces challenges related to distributed computing, such as network
communication and data synchronization. As a result, developers often choose distributed shared memory when
the benefits of shared memory-like programming outweigh the complexities of managing data in a distributed
environment.
Message Passing
Message passing is a fundamental concept in distributed computing and parallel programming. It is a
communication method that allows separate processes or entities to exchange data and synchronize their actions.
Message passing is commonly used in distributed systems, parallel computing, and interprocess communication.
Here are some key points about message passing:
Process Communication: In a distributed system or parallel computing environment, processes (which can be
threads, programs, or entities) communicate by sending and receiving messages. These messages can contain
data, instructions, or both.
Asynchronous Communication: Message passing allows processes to communicate asynchronously, meaning
that a sender can continue its work without waiting for the receiver to handle the message. This is in contrast to
shared memory systems where synchronization is often required.
Synchronization: While message passing supports asynchronous communication, it also provides mechanisms
for synchronization when necessary. For example, processes can use messages to signal each other, indicating
that certain conditions have been met or specific tasks are complete.
Point-to-Point Communication: In point-to-point communication, one process sends a message to another
specific process. This is akin to sending a private message from one person to another.
Broadcast Communication: In broadcast communication, a process sends a message to all other processes in
the system. This can be useful for distributing information or updates to multiple recipients simultaneously.
Message Queues: Systems that use message passing often employ message queues to store and manage
messages. Processes enqueue messages to a queue, and other processes dequeue and process messages from the
queue.
Reliability: Message passing can be designed to be reliable, ensuring that messages are not lost in transit. This
is especially important in distributed systems where network failures can occur.
Scalability: Message passing is highly scalable, making it suitable for large-scale distributed systems. As more
processes are added, they can communicate by sending messages without the need for shared memory or
centralized coordination.
Message Formats: Messages can be formatted in various ways, including plain text, binary, or structured data
like JSON or XML. The format depends on the needs of the application and the message passing middleware
used.
Message Passing in Programming: Message passing is commonly used in programming languages and
libraries designed for distributed and parallel computing. For example, MPI (Message Passing Interface) is a
widely used standard for message passing in high-performance computing.
Examples: Message passing is used in various distributed systems, such as message brokers (e.g., Apache
Kafka), distributed computing frameworks (e.g., Hadoop's MapReduce), and many networked applications
where components or nodes need to communicate.
Message passing is a versatile and widely used method for enabling communication and coordination in
distributed systems. It is particularly well-suited for situations where processes or components run on different
nodes and need to work together without shared memory or centralized control.
Workflow
A workflow refers to a sequence of tasks, processes, or steps that are designed to achieve a specific goal or
outcome. Workflows are used in various fields, including business, information technology, manufacturing, and
more, to streamline and manage work processes efficiently. Here are some key concepts related to workflows:
1. Definition: A workflow defines how work is organized, who is responsible for each task, the order in which
tasks are performed, and what triggers the transition from one task to the next.
2.Components of a Workflow:
Tasks/Steps: These are the individual actions or activities that make up the workflow. Each task has
a specific purpose and often involves specific individuals or resources.
Transitions: These define how tasks are connected. They specify the conditions or criteria that must
be met for a task to move to the next one.
Participants/Roles: Workflows often involve multiple participants or roles responsible for
completing various tasks. Each participant is assigned specific responsibilities.
Data/Information: Workflows may include data or information that needs to be processed, generated,
or transferred at various stages.
Rules and Policies: Workflows may be governed by rules, policies, or constraints that dictate how
tasks are executed and under what conditions.
Automation: In some cases, tasks within a workflow can be automated through software or hardware
systems to improve efficiency and reduce errors.
Feedback and Monitoring: Workflow systems may include mechanisms for tracking progress,
collecting feedback, and monitoring the performance of tasks and participants.
3. Types of Workflows:
Sequential Workflow: Tasks are performed in a linear sequence, with each task dependent on the
completion of the previous one.
Parallel Workflow: Multiple tasks are performed concurrently or in parallel, without strict order or
dependency.
State Machine Workflow: The workflow's progress is determined by the current state and transitions
based on specific conditions.
Ad Hoc Workflow: Less structured and more flexible, allowing tasks to be performed in a less
predefined order.
4. Workflow Management Systems (WMS): These are software tools or platforms designed to create, manage,
and automate workflows. They often provide features for defining workflows, assigning tasks, tracking progress,
and handling exceptions.
5. Business Process Management (BPM): BPM is a discipline that focuses on the design, modeling, execution,
and optimization of workflows in organizations. BPM software and methodologies are used to improve business
processes and increase efficiency.
6. Use Cases:
Business Workflows: Used in various industries for managing processes like order processing,
customer support, project management, and more.
Scientific Workflows: Common in scientific research and data analysis to automate complex
experiments or simulations.
Software Development Workflows: Used in software development to manage tasks like code review,
testing, and deployment.
Manufacturing Workflows: Used in manufacturing to control production processes and quality
assurance.
7. Workflow Notation: Various notations and standards exist for representing workflows, including Business
Process Model and Notation (BPMN) and Workflow Management Coalition (WfMC) standards.
Workflows are essential for improving efficiency, reducing errors, and ensuring that work processes are well-
structured and documented. They are widely used in both business and technical domains to manage and
optimize various processes and operations.
Unit-2 & 3
SMP Distributed
Characteristic MPP Cluster
CC-NUMA
Number of Nodes 100 to 1000 10 to 100 100 or less 10 to 1000
Medium or Coarse Wide Range
Node Complexity Fine Grain or Medium Medium Grain
Grain
Message Passing or Shared Files, RPC,
Centralized and
Internode Shared Variables for Message Passing
Distributed Shared Message Passing
Communication Distributed Shared and IPC
Memory (DSM)
Memory
Single Run Queue on Multiple Queue but Independent
Job Scheduling Single Run Queue Queues
Host Coordinated
Always in SMP No
SSI Support Partially Desired
and some NUMA
N Micro-Kernels One Monolithic N OS Platform- N OS Platforms
Node OS Copies
Monolithic or Layered SMP and Many for Homogeneous or Homogeneous
and Type
Oss NUMA Micro-Kernel
Multiple – Single for Multiple
Address Space Single Multiple or Single
DSM
Internode Required if Required
Unnecessary Unnecessary
Security Exposed
One or More Many
Ownership One Organization One Organization Organizations
Organizations
Granularity [6] refers to the extent to which a system or material or a large entity is decomposed into
small pieces. Alternatively, it is to the extent for which smaller entities are joined to form a larger
entity. It is of two types, namely Coarse-Grained and Fine-Grained.
A Coarse-Grained defines a system regards large subcomponents of which the larger ones are
composed. A Fine-Grained defines a system regards smaller components of which the larger ones
are composed.
. Message Passing [7]
Variables have to be marshaled
Cost of communication is obvious
Processes are protected by having private address space
Processes should execute at the same
time DSM [7]
Variables are shared directly
Cost of communication is invisible
Processes could cause error by altering data
Executing the processes may happen with non-overlapping lifetimes
Kernel [8] is a program that manages I/O requests from software and translates them into data
processing instructions for the CPU and other electronic components of a computer.
A Monolithic Kernel [8] executes all the OS instructions in the same address space in order to
improve the performance.
A Micro-Kernel [8] runs most of the OS’s background processes in user space to make the OS more
modular. Therefore, it is easier to maintain.
Cluster Computer and its Architecture
A Cluster consists of a collection of interconnected stand-alone computers working together as a single
computing resource. A computer node can be a single or multi-processor system such as PCs,
workstations, servers, SMPs with memory, I/O and an OS. The nodes are interconnected via a
LAN.
The cluster components are as follows.
1. Multiple High Performance Computers
2. Oss (Layered or Micro-Kernel Based)
3. High Performance Networks or Switches (Gigabit Ethernet and Myrinet)
4. Network Interface Cards (NICs)
5. Fast Communication Protocols and Services (Active and Fast Messages)
6. Cluster Middleware (Single System Image (SSI) and System Availability Infrastructure)
7. Parallel Programming Environments and Tools (Parallel Virtual Machine (PVM), Message
Passing Interface (MPI))
8. Applications (Sequential, Parallel or Distributed)
Cluster Classifications
The various features of clusters are as follows.
1. High Performance
2. Expandability and Scalability
3. High Throughput
4. High Availability
Cluster can be classified into many categories as follows.
1. Application Target
High Performance Clusters
High Availability Clusters
2. Node Ownership
Dedicated Clusters
Nondedicated Clusters
3. Node Hardware
Cluster of PCs (CoPs) or Piles of PCs (PoPs)
Cluster of Workstations (COWs)
Cluster of SMPs (CLUMPs)
4. Node OS
Linux Clusters (Beowulf)
Solaris Clusters (Berkeley NOW)
NT Clusters (High Performance Virtual Machine (HPVM))
Advanced Interactive eXecutive (AIX) Clusters (IBM Service Pack 2 (SP2))
Digital Virtual Memory System (VMS) Clusters
HP-UX Clusters
Microsoft Wolfpack Clusters
5. Node Configuration
Homogeneous Clusters
Heterogeneous Clusters
6. Levels of Clustering
Group Clusters (No. of Nodes = 2 to 99)
Departmental Clusters (No. of Nodes = 10 to 100s)
Organizational Clusters (No. of Nodes = Many 100s)
National Metacomputers (No. of Nodes = Many Departmental or Organizational
Systems or Clusters)
International Metacomputers (No. of Nodes = 1000s to Many Millions)
Components for Clusters
The components of clusters are the hardware and software used to build clusters and nodes. They are
as follows.
1. Processors
Microprocessor Architecture (RISC, CISC, VLIW and Vector)
Intel x86 Processor (Pentium Pro and II)
Pentium Pro shows a very strong integer performance in contrast to Sun’s UltraSPARC
for high performance range at the same clock speed. However, the floating-point
performance is much lower.
The Pentium II Xeon uses a memory bus of 100 MHz. It is available with a choice of
512 KB to 2 MB of L2 cache.
Other processors: x86 variants (AMD x86, Cyrix x86), Digital Alpha, IBM PowerPC, Sun
SPARC, SGI MIPS and HP PA.
Berkeley NOW uses Sun’s SPARC processors in their cluster nodes.
2. Memory and Cache
The memory present inside a PC was 640 KBs. Today, a PC is delivered with 32 or 64
MBs installed in slots with each slot holding a Standard Industry Memory Module
(SIMM). The capacity of a PC is now many hundreds of MBs.
Cache is used to keep recently used blocks of memory for very fast access. The size of
cache is usually in the range of 8KBs to 2MBs.
3. Disk and I/O
The I/O performance is improved to carry out I/O operations in parallel. It is supported
by parallel file systems based on hardware or software Redundancy Array of
Inexpensive Disk (RAID).
Hardware RAID is more expensive than Software RAID.
4. System Bus
Bus is the collection of wires which carries data from one component to another. The
components are CPU, Main Memory and others.
Bus is of following types.
o Address Bus
o Data Bus
o Control Bus
Address bus is the collection of wires which transfer the addresses of Memory or I/O
devices. For instance, Intel 8085 Microprocessor has an address bus of 16 bits. It
shows that the Microprocessor can transfer maximum 16 bit address.
Data bus is the collection of wires which is used to transfer data within the
Microprocessor and Memory or I/O devices. Intel 8085 has a data bus of 8 bits. That’s
why Intel 8085 is called 8 bit Microprocessor.
Control bus is responsible for issuing the control signals such as read, write or
opcode fetch to perform some operations with the selected memory location.
Every bus has a clock speed. The initial PC bus has a clock speed of 5 MHz and it is 8
bits wide.
In PCs, the ISA bus is replaced by faster buses such as PCI.
The ISA bus is extended to be 16 bits wide and an enhanced clock speed of 13 MHz.
However, it is not sufficient to meet the demands of the latest CPUs, disk and other
components.
The VESA local bus is a 32 bit bus that has been outdated by the Intel PCI bus.
PCI bus allows 133 Mbytes/s.
5. Cluster Interconnects
The nodes in a cluster are interconnected via standard Ethernet and these nodes are
communicated using a standard networking protocol such as TCP/IP or a low-level
protocol such as Active Messages.
Ethernet: 10 Mbps
Fast Ethernet: 100 Mbps
Gigabit Ethernet
The two main characteristics of Gigabit Ethernet are as follows.
o It preserves Ethernet’s simplicity which enabling a smooth migration to
Gigabit-per-second (Gbps) speeds.
o It delivers a very high bandwidth to aggregate multiple Fast Ethernet
segments.
Asynchronous Transfer Mode (ATM)
o It is a switched virtual-circuit technology.
o It is intended to be used for both LAN and WAN, presenting a unified approach
to both.
o It is based on small fixed-size data packets termed cell. It is designed to allow
cells to be transferred using a number of medias such as copper wire and fiber
optic cables.
o CAT-5 is used with ATM allowing upgrades of existing networks without
replacing cabling.
Scalable Coherent Interface (SCI)
o It aims to provide a low-latency distributed shared memory across a cluster.
o It is design to support distributed multiprocessing with high bandwidth and
low latency.
o It is a point-to-point architecture with directory-based cache coherence.
o Dolphin has produced an SCI MPI which offers less than 12 µs zero message-
length latency on the Sun SPARC platform.
Myrinet
o It is a 1.28 Gbps full duplex interconnection network supplied by Myricom
[9].
o It uses low latency cut-through routing switches, which is able to offer fault
tolerance.
o It supports both Linux and NT.
o It is relatively expensive when compared to Fast Ethernet, but has following
advantages. 1) Very low latency (5 µs), 2) Very high throughput, 3) Greater
flexibility.
o The main disadvantage of Myrinet is its price. The cost of Myrinet-LAN
components including the cables and switches is $1,500 per host. Switches
with more than 16 ports are unavailable. Therefore, scaling is complicated.
Cluster Middleware and Single System Image
Single System Image (SSI) is the collection of interconnected nodes that appear as a unified resource.
It creates an illusion of resources such as hardware or software that presents a single powerful
resource. It is supported by a middleware layer that resides between the OS and the user-level
environment. The middleware consists of two sub-layers, namely SSI Infrastructure and System
Availability Infrastructure (SAI). SAI enables cluster services such as checkpointing, automatic
failover, recovery from failure and fault-tolerant.
1. SSI Levels or Layers
Hardware (Digital (DEC) Memory Channel, Hardware DSM and SMP Techniques)
Operating System Kernel – Gluing Layer (Solaris MC and GLUnix)
Applications and Subsystems – Middleware
o Applications
o Runtime Systems
o Resource Management and Scheduling Software (LSF and CODINE)
2. SSI Boundaries
Every SSI has a boundary.
SSI can exist at different levels within a system – one able to be built on another
3. SSI Benefits
It provides a view of all system resources and activities from any node of the cluster.
It frees the end user to know where the application will run.
It frees the operator to know where a resource is located.
It allows the administrator to manage the entire cluster as a single entity.
It allows both centralize or decentralize system management and control to avoid
the need of skilled administrators for system administration.
It simplifies system management.
It provides location-independent message communication.
It tracks the locations of all resources so that there is no longer any need for system
operators to be concerned with their physical location while carrying out system
management tasks.
4. Middleware Design Goals
Transparency
Scalable Performance
Enhanced Availability
5. Key Service of SSI and Availability Infrastructure
SSI Support Services
o Single Point of Entry
o Single File Hierarchy
o Single Point of Management and Control
o Single Virtual Networking
o Single Memory Space
o Single Job Management System
o Single User Interface
Availability Support Functions
o Single I/O Space
o Single Process Space
o Checkpointing and Process Migration
1. Beowulf Clusters:
- Beowulf clusters are one of the earliest and most well-known forms of commodity cluster
computing.
- Developed in the 1990s by researchers at NASA and the National Center for Supercomputing
Applications (NCSA).
- Beowulf clusters typically consist of off-the-shelf hardware components, such as commodity
processors, Ethernet networking, and Linux-based operating systems.
- Message Passing Interface (MPI) is often used for communication between nodes in Beowulf
clusters.
4. IBM SP2:
- IBM SP2 (Scalable Power Parallel) was a parallel supercomputer developed by IBM in the early
1990s.
- It was a pioneering system in high-performance computing and featured scalable architecture with
multiple nodes interconnected by a high-speed network.
- SP2 used a variety of processors, including PowerPC, and could be used for parallel scientific and
engineering applications.
6. Intel Paragon:
- The Intel Paragon was a parallel supercomputer developed by Intel in the 1990s.
- It used a scalable architecture with multiple processors connected through a high-speed
interconnect.
- The Paragon was often used for scientific and engineering simulations.
2. Task Parallelism:
- HTC clusters often deal with embarrassingly parallel problems where tasks can be executed
independently.
- Examples include parameter sweeps, data mining, and many scientific simulations.
3. Job Scheduling:
- Efficient job scheduling is crucial in HTC clusters to maximize the utilization of resources.
- Schedulers need to consider factors like task dependencies, resource availability, and priority.
4. Condor:
- Condor is a widely used software system for managing distributed computing resources in HTC
environments.
- It provides a job scheduler and a set of tools for managing and optimizing computing resources.
5. Cycle Stealing:
- Cycle stealing is a concept where idle computing resources are utilized for HTC tasks.
- It involves harnessing the spare cycles of machines that are not fully utilized for their primary
tasks.
7. Data Management:
- Effective data management is important in HTC clusters, especially when dealing with large
datasets distributed across multiple nodes.
- Ensuring data availability and minimizing data transfer times are key considerations.
8. Fault Tolerance:
- Given the large-scale nature of HTC clusters, fault tolerance mechanisms are crucial to handle
failures gracefully.
- This may involve re-scheduling failed tasks on alternative resources.
9. Workload Balancing:
- Workload balancing is essential for maximizing the overall throughput.
- Tasks should be distributed evenly across available resources to prevent bottlenecks.
Networking:
Networking is a crucial aspect of cluster computing, as it enables communication and coordination
among the individual nodes within the cluster. Here are some key notes on networking in clusters:
1. Interconnect Topologies:
Bus Topology: Nodes are connected to a common communication bus. Simple but can lead to
contention.
Ring Topology: Nodes are connected in a circular fashion. Data travels in one direction.
Mesh Topology: Nodes are interconnected in a network. Can be partial (some nodes connected) or
complete (all nodes connected).
2. High-Speed Interconnects:
InfiniBand: A high-speed interconnect technology often used in clusters. Provides low latency and
high bandwidth.
10/25/40/100 Gigabit Ethernet: Standard Ethernet can also be used in clusters, with higher speeds
for better performance.
5. TCP/IP Networking:
- Clusters often use standard TCP/IP networking for communication between nodes.
- Ethernet is a common choice for connecting nodes within a cluster.
7. Scalability:
- Cluster networking should be scalable to accommodate a growing number of nodes.
- Scalability issues may arise with the increased number of nodes, leading to bottlenecks and
reduced performance.
8. Switching Technologies:
- In larger clusters, network switches are used to connect nodes.
- Technologies like InfiniBand switches or Ethernet switches with high-speed backplanes are
employed.
9. Jumbo Frames:
- Increasing the size of standard Ethernet frames can improve efficiency by reducing the overhead
associated with smaller frames.
- Jumbo frames can lead to better performance in some cluster configurations.
11. Network File Systems (NFS) and Storage Area Networks (SAN):
- Clusters may use NFS for file sharing and distributed file systems.
- SANs provide high-speed storage that can be shared among nodes in the cluster.
Effective networking in clusters is essential for achieving high performance, scalability, and
reliability. It requires careful consideration of factors such as interconnect technologies, protocols,
and network topologies to ensure optimal communication and coordination among cluster nodes.
Cluster Applications
1. Grand Challenge Applications (GCAs)
Crystallographic and Microtomographic Structural Problems
Protein Dynamics and Biocatalysis
Relativistic Quantum Chemistry of Antinides
Virtual Materials Design and Processing
Global Climate Modeling
Discrete Event Simulation
2. Supercomputing Applications
3. Computational Intensive Applications
4. Data or I/O Intensive Applications
5. Transaction Intensive Applications
Representative Cluster Systems, Heterogeneous Clusters
Many projects are investigating the development of supercomputing class machines using
commodity off-the-shelf components (COTS). The popular projects are listed as follows.
Project Organization
Network of Workstations (NOW) University of California, Berkeley
High Performance Virtual Machine (HPVM) University of Illinois, Urbana-Champaign
Beowulf Goddard Space Flight Center, NASA
Solaris-MC Sun Labs, Sun Microsystems, Palo Alto, CA
NOW
Platform: PCs and Workstations
Communications: Myrinet
OS: Solaris
Tool: PVM
HPVM
Platform: PCs
Communications: Myrinet
OS: Linux
Tool: MPI
Beowulf
Platform: PCs
Communications: Multiple Ethernet with TCP/IP
OS: Linux
Tool: MPI/PVM
Solaris-MC
Platform: PCs and Workstations
Communications: Solaris supported
OS: Solaris
Tool: C++ and CORBA
Heterogeneous Clusters
Clusters deliberately heterogeneous in order to explore the higher floating
point performance of certain architecture and the low cost of other systems.
A heterogeneous layout means automating administration work will obviously become
more complex, i.e., software packaging is different.
Major challenges [16]:
o Four major challenges that must be overcome so that heterogeneous computing
clusters emerge as the preferred platform for executing a wide variety of enterprise
workloads.
o First, most enterprise applications in use today were not designed to run on such
dynamic, open and heterogeneous computing clusters. Migrating these applications to
heterogeneous computing clusters, especially with substantial improvement in
performance or energy-efficiency, is an open problem.
o Second, creating new enterprise applications ground-up to execute on the new,
heterogeneous computing platform is also daunting. Writing high-performance,
energy-efficient programs for these architectures is extremely challenging due to the
unprecedented scale of parallelism, and heterogeneity in computing, interconnect and
storage units.
o Third, cost savings from the new shared-infrastructure architecture for consumption
and delivery of IT services are only possible when multiple enterprise applications can
amicably share resources (multi-tenancy) in the heterogeneous computing cluster.
However, enabling multi-tenancy without adversely impacting the stringent quality of
service metrics of each application calls for dynamic scalability and virtualization of a
wide variety of diverse computing, storage and interconnect units, and this is yet
another unsolved problem.
o Finally, enterprise applications encounter highly varying user loads, with spikes of
unusually heavy load. Meeting quality of service metrics across varying loads calls for
an elastic computing infrastructure that can automatically provision (increase or
decrease) computing resources used by an application in response to varying user
demand. Currently, no good solutions exist to meet this challenge.
Security, Resource Sharing, Locality, Dependability Security
There is always a tradeoff between usability and security. Allowing rsh (remote shell) access from the
outside to each node just by matching usernames and hosts with each user’s .rhosts file is not good
as a security incident in a single node compromises the security of all the systems who share that
user’s home. For instance, mail can be abused in a similar way – just change that user’s .forward
file to do the mail delivery via a pipe to an interesting executable or script. A service is not safe
unless all of the services it depends on are at least equally safe.
Connecting all the nodes directly to the external network may cause two main problems. First, we
make temporary changes and forget to restore them. Second, systems tend to have information
leaks. The operating system and its version can easily guessed just with IP access, even with
harmless services running and almost all operating system have serious security problems in their
IP stack in the recent history.
Special care must be taken when building clusters of clusters is done. The approach for making these
metaclusters secure is building secure tunnels between the clusters, usually from front-end to front-
end
If intermediate backbone switches can be trusted and have the necessary software or resources, they
can setup a VLAN joining the clusters, achieving greater bandwidth and lower latency than routing
at the IP level via the front-ends.
Resource Sharing
Resource Sharing need the cooperation among the processors to ensure that no processor is idle while
there are tasks waiting for service.
In Load Sharing, three location policies were studied. They are random policy, threshold policy and
shortest policy.
Threshold policy probes a limited number of nodes. It terminates the probing as soon as it finds a node
with a queue length shorter than the threshold.
Shortest policy probes several nodes and then selects the one having the shortest queue, from among
those having queue lengths shorter than the threshold.
In the Flexible Load Sharing Algorithm (FLS) a location policy similar to threshold is used. In contrast
to threshold, FLS bases its decisions on local information which is possibly replicated at multiple
nodes. For scalability, FLS divides a system into small subsets which may overlap. Each of these
subsets forms a cache held at a node. This algorithm supports mutual inclusion or exclusion. It is
noteworthy to mention that FLS does not attempt to produce the best possible solution, but like
threshold, it offers instead an adequate one, at a fraction of the cost.
High Availability:
High Availability (HA) in cluster technology refers to the ability of a system to remain
operational and accessible even in the presence of hardware or software failures. Here are
some key technologies and strategies used to achieve high availability in clusters:
1. Redundancy:
Hardware Redundancy: Use redundant hardware components such as power supplies, network
interfaces, and storage devices to eliminate single points of failure.
Node Redundancy: Deploy multiple nodes in the cluster so that if one node fails, others can take
over the workload.
2. Failover Mechanisms:
Implement failover mechanisms to automatically redirect traffic or workload from a failed node to a
healthy one.
Cluster software monitors the health of nodes and triggers failover when necessary.
3. Load Balancing:
Distribute workloads evenly across cluster nodes to prevent any single node from becoming a
performance bottleneck.
Load balancing helps in maximizing resource utilization and improves overall system responsiveness.
4. Quorum Systems:
Quorum systems help prevent split-brain scenarios, where nodes in a cluster lose communication
with each other and may independently continue operations.
By requiring a majority of nodes to agree on the cluster state, quorum systems ensure that only one
partition of the cluster remains active.
5. Cluster Communication Protocols:
Use reliable and efficient communication protocols between nodes to detect failures and coordinate
actions.
Communication protocols like Heartbeat and Corosync are often employed to monitor the health of
nodes.
6. Shared Storage:
Implement shared storage systems to allow nodes to access the same data.
In case of a node failure, another node can take over and access the data seamlessly.
7. Cluster File Systems:
Utilize cluster file systems, such as GFS (Global File System) or Lustre, which are designed to provide
shared access to files among cluster nodes.
These file systems enable concurrent access to data and facilitate failover.
8. Virtualization:
Virtualization technologies, such as VMware High Availability (HA) or Microsoft Hyper-V Replica, can
be used to provide failover for virtual machines.
Virtualization abstracts applications from the underlying hardware, making it easier to move
workloads between nodes.
9. Backup and Restore:
Regularly back up critical data and configurations to facilitate quick recovery in the event of a failure.
Automated backup solutions and off-site storage contribute to data integrity and availability.
10. Monitoring and Alerting:
Implement comprehensive monitoring systems to track the health and performance of cluster nodes.
Set up alerting mechanisms to notify administrators of potential issues before they escalate.
11. Power and Environmental Redundancy:
Ensure power redundancy with uninterruptible power supplies (UPS) and backup generators.
Implement environmental controls, such as cooling systems, to prevent hardware failures due to
overheating.
12. Automated Repair and Maintenance:
Use automation tools to perform routine maintenance tasks and apply software updates without
causing downtime.
Automated repair mechanisms can fix common issues without manual intervention.
13. Geographic Redundancy:
For critical systems, consider geographical redundancy by deploying clusters in different physical
locations.
This helps protect against regional disasters and ensures continuity of operations.
14. Database Replication:
Implement database replication mechanisms to maintain copies of databases on multiple nodes.
In case of a node failure, the database can be accessed from a replicated copy on another node.
15. Documentation and Training:
Maintain detailed documentation of the HA configuration, failover procedures, and recovery
processes.
Train administrators and support staff on the HA setup to ensure a quick and effective response to
failures.
High availability in cluster technology is a multifaceted approach that combines hardware
redundancy, failover mechanisms, and smart resource management. The goal is to minimize
downtime, ensure data integrity, and provide uninterrupted services to end-users.
Performance modeling and simulation are essential tools in computer science and engineering to
predict, analyze, and optimize the performance of systems, applications, or networks. These tools help
in understanding the behavior of complex systems and making informed decisions about design and
resource allocation. Here are key aspects of performance modeling and simulation:
Performance Modeling:
1. Definition:
- It can be applied to various domains, including computer systems, networks, and software
applications.
2. Types of Models:
Analytical Models: Use mathematical equations to describe the system's behavior. Examples
include queuing models and network models.
Simulation Models: Utilize simulation software to mimic the behavior of the actual system. Monte
Carlo simulations and discrete event simulations fall into this category.
Empirical Models: Derived from observed measurements and data. Regression analysis is
commonly used to create empirical models.
3. Performance Metrics:
- Define performance metrics relevant to the system being modeled, such as response time,
throughput, and resource utilization.
- Metrics help in quantifying and comparing the performance of different system configurations.
4. Queuing Theory:
- Commonly used in modeling systems where entities (tasks, jobs, etc.) wait in line before being
processed.
- Queuing models help analyze and optimize the use of resources and predict system performance.
5. Petri Nets:
- Petri Nets are graphical and mathematical modeling languages used for the specification,
simulation, and verification of systems.
- They are particularly useful for modeling concurrent and distributed systems.
Performance Simulation:
1. Definition:
- Performance simulation involves running a model of a system to observe and analyze its behavior
over time.
- Simulation helps in understanding system dynamics, identifying bottlenecks, and evaluating the
impact of changes.
2. Advantages:
Cost-Effective: Simulation allows experimentation in a virtual environment without the costs
associated with real-world testing.
Flexibility: Simulations can be easily modified to test various scenarios and configurations.
Risk Mitigation: Simulating potential changes or improvements allows for risk assessment before
implementation.
5. Performance Evaluation:
6. System Validation:
- Use simulations to validate and verify the behavior of a system before its actual implementation.
- Helps in detecting potential issues and refining the system design.
8. Model Calibration:
- Adjust model parameters to match the simulation's output with real-world observations.
- Calibration ensures that the simulation accurately reflects the behavior of the actual system.
9. Toolsets:
- Various simulation tools are available, such as OPNET, NS-3, and Simulink, each with its specific
strengths and applications.
- Simulation is crucial for understanding the performance of parallel and distributed computing
systems.
- It helps in optimizing resource allocation and load balancing.
Challenges:
1. Model Accuracy:
- Achieving an accurate representation of the real system can be challenging, particularly when
dealing with complex and dynamic environments.
2. Resource Intensive:
- Ensuring that the simulation model accurately reflects the behavior of the real system requires
careful validation and verification processes.
Performance modeling and simulation play a crucial role in system design, optimization, and decision-
making processes. By providing insights into the behavior of complex systems, these tools contribute
to the development of more efficient and reliable technologies.
Process Scheduling:
1. Definition:
- Process scheduling is the mechanism used by an operating system to manage the execution of
processes on the CPU.
2. Scheduling Policies:
- First-Come-First-Serve (FCFS): Processes are executed in the order they arrive.
- Shortest Job Next (SJN): The process with the shortest burst time is selected next.
- Round Robin (RR): Each process gets a fixed time slot before the next process is selected.
- Priority Scheduling: Processes are assigned priority levels, and the one with the highest priority is
selected next.
3. Context Switching:
- When the operating system switches from executing one process to another, it performs a context
switch.
- Context switching involves saving the state of the current process and loading the saved state of the
next process.
4. Preemption:
- Preemptive scheduling allows a higher-priority process to interrupt and temporarily suspend the
execution of a lower-priority process.
1. Load Sharing:
- Load sharing involves distributing the workload among multiple processors or nodes.
- It aims to improve overall system performance by utilizing available resources efficiently.
2. Load Balancing:
- Load balancing is the process of distributing the workload evenly across all processors or nodes in
a system.
- Ensures that no single processor is overloaded while others remain underutilized.
5. Algorithms:
- Round Robin: Distributes tasks in a circular order.
- Weighted Round Robin: Assigns different weights to tasks based on their complexity.
- Least Connections: Assigns tasks to the server with the fewest active connections.
- Randomized Load Balancing: Assigns tasks randomly to available servers.
6. Challenges:
- Dynamic changes in workload can make load balancing challenging.
- Overhead associated with monitoring and redistributing tasks.
1. Definition:
- Distributed Shared Memory allows multiple nodes in a distributed system to share a common
address space, providing the illusion of a single shared memory space.
2. Architecture:
- Hardware DSM: Shared memory is implemented at the hardware level.
- Software DSM: Shared memory is implemented using software libraries and protocols.
3. Consistency Models:
Sequential Consistency: The result of any execution is the same as if the operations of all processors
were executed in some sequential order.
Causal Consistency: Preserves causality between operations initiated by different processors.
Release Consistency: Operations are grouped into critical sections, and consistency is maintained
only at the entry and exit of these sections.
4. Coherence Protocols:
Write-Once Protocol: Each block of memory is only written by one processor.
Invalidation Protocol: When a processor writes to a block, all other copies are invalidated.
Update Protocol: Updates are made directly to all copies of a block.
5. Advantages:
6. Challenges:
- Overhead in maintaining coherence between distributed copies of memory.
- Increased latency compared to local shared memory systems.
Distributed shared memory, process scheduling, load sharing, and load balancing are critical aspects
of distributed computing. These concepts play a crucial role in improving the efficiency,
performance, and scalability of systems in distributed environments.
Unit 4
Grid Architecture:
1. Definition:
Grid computing is a distributed computing paradigm that involves the coordinated sharing
and use of resources across multiple administrative domains.
Distributed Resources: Resources like computing power, storage, and applications are
distributed across multiple locations.
Virtual Organization: Users and resources are organized into virtual organizations,
crossing administrative boundaries.
Dynamic Resource Allocation: Resources can be dynamically allocated and de-allocated
based on demand.
High Performance: Grids are designed to handle computationally intensive tasks, often
involving parallel processing.
2. Virtual Organizations:
Grid users and resources are organized into virtual organizations (VOs) based on their
requirements and affiliations.
3. Scalability:
Grids are designed to scale horizontally, accommodating a large number of resources and
users.
4. Collaboration:
Collaboration is a key aspect, allowing users and organizations to share resources and work
together on large-scale projects.
OGF develops open standards and specifications to facilitate the adoption of grid computing
technologies.
The predecessor of OGF, GGF contributed to the development of grid standards and best
practices.
Grid Types:
1. Computational Grids:
2. Data Grids:
3. Collaborative Grids:
4. Utility Grids:
Topologies:
1. Tree Topology:
2. Mesh Topology:
3. Cluster Topology:
Nodes are grouped into clusters with high interconnectivity within clusters.
4. Hybrid Topology:
Components of Grid:
1. Resource Management:
2. Grid Middleware:
Software layer that enables communication and coordination among diverse resources in
the grid.
3. Grid Security:
4. Grid Applications:
Custom applications designed to leverage the grid infrastructure for specific tasks.
Layers of Grid:
1. Fabric Layer:
Physical infrastructure layer consisting of computing nodes, storage devices, and network
connections.
2. Connectivity Layer:
3. Resource Layer:
Manages and provides access to computing resources, storage, and other services.
4. Collective Layer:
Coordinates and manages multiple resources for parallel processing and collaborative tasks.
5. Application Layer:
Grids involve distributed resources across multiple clusters, while clusters consist of
interconnected nodes within a single administrative domain.
Grids often focus on sharing computing resources, while cloud computing provides on-
demand access to a pool of configurable computing resources.
Grids are centrally managed and organized, whereas P2P computing involves decentralized
and distributed sharing of resources among peers.
Grid computing has played a significant role in addressing large-scale computational challenges by
enabling collaboration and resource sharing across organizational boundaries. The architecture,
characteristics, and standards associated with grids continue to evolve as technology advances.
Unit 5
System infrastructure
System infrastructure refers to the underlying foundation of hardware, software,
networking, and other components that support the functionality and operation of a
computer system or a network. It includes both physical and virtual components that
work together to provide a computing environment. Here are key components of
system infrastructure:
Physical Infrastructure:
1. Hardware:
Servers: Powerful computers that provide services or resources to other computers
(clients) in the network.
Storage: Devices or systems for storing data, which can include hard drives, solid-
state drives, and network-attached storage (NAS).
Networking Equipment: Routers, switches, and other devices that enable
communication and data transfer between different components in the network.
Computers and End-User Devices: Desktops, laptops, tablets, and other devices
used by end-users to access and interact with the system.
2. Data Centers:
Facilities that house and manage servers, storage, and networking equipment.
Designed to provide a secure and controlled environment for computing resources.
3. Power and Cooling Systems:
Infrastructure to ensure a stable power supply and effective cooling for servers and
networking equipment in data centers.
Virtual Infrastructure:
1. Virtualization:
Server Virtualization: Enables multiple virtual servers to run on a single physical
server, optimizing resource utilization.
Desktop Virtualization: Allows multiple virtual desktops to run on a single physical
machine, providing flexibility for end-users.
Storage Virtualization: Abstracts physical storage resources, making them appear as
a single, centralized storage pool.
2. Cloud Infrastructure:
Infrastructure as a Service (IaaS) provides virtualized computing resources over the
internet, including virtual machines, storage, and networking.
Platform as a Service (PaaS) offers a platform with development tools, allowing users
to build and deploy applications without managing the underlying infrastructure.
Software as a Service (SaaS) provides access to software applications over the
internet without the need for installation.
Operating Systems:
1. Server Operating Systems:
Manage server hardware resources and provide a platform for running server
applications.
2. Client Operating Systems:
Run on end-user devices and provide a user interface for interacting with applications
and accessing network resources.
3. Embedded Operating Systems:
Operating systems designed for embedded systems, such as those in IoT devices,
routers, and appliances.
Networking Infrastructure:
1. Network Protocols:
Standardized rules for data communication between devices on a network. Examples
include TCP/IP, HTTP, and FTP.
2. Firewalls and Security Appliances:
Devices that control and monitor network traffic, enforcing security policies to
protect against unauthorized access and threats.
3. Load Balancers:
Distribute network traffic across multiple servers to ensure optimal resource
utilization and prevent server overload.
4. Switches and Routers:
Switches connect devices within a local network, while routers connect different
networks, facilitating data transfer between them.
Software Infrastructure:
1. Middleware:
Software that connects and manages communication between different software
applications or components.
2. Databases:
Systems for storing, organizing, and retrieving data. Examples include relational
databases (e.g., MySQL, Oracle) and NoSQL databases (e.g., MongoDB, Cassandra).
3. Web Servers:
Software that handles HTTP requests and responses, serving web pages to users.
Examples include Apache, Nginx, and Microsoft IIS.
4. Application Servers:
Platforms that host and execute software applications, providing services such as
transaction management and security.
Management and Monitoring:
1. System Management Tools:
Software for configuring, monitoring, and managing hardware and software
components in a system.
2. Monitoring and Logging Tools:
Tools that track system performance, log events, and generate alerts in case of
anomalies or issues.
3. Configuration Management:
Systems and tools for automating the configuration and management of infrastructure
components.
A well-designed system infrastructure is critical for the reliability, performance, and scalability of
computer systems. It provides the foundation for running applications, storing and managing data,
and facilitating communication within and across networks.
1. Client-Server Model:
- In the client-server model, tasks are divided between client and server entities.
- Clients request services or resources from servers, which respond to these requests.
- Centralized servers manage resources, and clients are responsible for user interfaces and local
processing.
- Peers act both as clients and servers, sharing resources (files, processing power) directly with
each other.
4. Message Passing:
- Processes can read and write to shared memory locations, even if they are physically distributed
across different nodes.
6. Object-Oriented Paradigm:
- Objects encapsulate data and behavior, and distributed objects communicate through method
invocations.
7. Tuple Spaces:
- Tuple spaces provide a shared memory abstraction in which processes can store and retrieve
tuples.
- Processes communicate by adding and removing tuples from the shared space.
8. Batch Processing:
- Batch processing involves the execution of a series of jobs without user interaction.
- Distributed batch processing systems distribute tasks across multiple nodes for parallel execution.
9. Cluster Computing:
- Cluster computing involves connecting multiple computers to work together as a single system.
- Nodes in a cluster typically share a common storage system and are used for parallel processing.
- Grid computing extends the concept of cluster computing to a larger scale, often involving
geographically distributed resources.
- Mobile agents are autonomous software entities that can migrate between systems to perform
tasks.
- In stream processing, data is processed as it is generated, rather than being stored and processed
later.
- This paradigm is well-suited for real-time analytics and processing continuous data streams.
These traditional paradigms have laid the groundwork for the development of more modern and
specialized approaches to distributed computing, such as cloud computing, edge computing, and
microservices architectures. Each paradigm addresses specific challenges and requirements in
distributed systems, reflecting the evolution of distributed computing over time.
Web Services:
Web services are a standardized way of integrating web-based applications over the internet. They
enable communication and data exchange between different software systems, regardless of the
programming languages, platforms, or devices they are built on. Web services use standard web
protocols and formats, such as HTTP, XML, and JSON, to provide interoperability between diverse
applications. Here are key aspects of web services:
1. Key Components:
- It uses XML for message formatting and relies on HTTP, SMTP, or other protocols for message
transmission.
- It uses standard HTTP methods (GET, POST, PUT, DELETE) and is often simpler than SOAP.
2. Service Description:
- WSDL is an XML-based language that describes web services and their available operations.
- It provides a standardized way for clients to understand the functionality offered by a web
service.
3. Communication Protocols:
a. HTTP/HTTPS:
- Most web services communicate over HTTP or its secure variant, HTTPS.
- These protocols are widely supported, making it easy for different systems to interact.
- MIME types are used to specify the nature and format of a document or file.
- They play a role in web service communication by defining the data types being exchanged.
4. Data Formats:
- It provides a standard way to represent information that is both human-readable and machine-
readable.
- JSON is a lightweight data interchange format that is easy for humans to read and write.
- It has gained popularity for its simplicity and ease of use in web services.
a. GET:
b. POST:
c. PUT:
a. SSL/TLS:
- Secure Socket Layer (SSL) and Transport Layer Security (TLS) protocols are used to secure
communication between clients and web services.
b. WS-Security:
- UDDI is a directory service where businesses can register and discover web services.
- SOAP (Simple Object Access Protocol) is a protocol for exchanging structured information in web
services.
- REST (Representational State Transfer) is an architectural style that uses standard HTTP methods
and is often simpler than SOAP.
a. Publishing:
- Allowing potential users to find and learn about the web service.
c. Binding:
d. Invocation:
- Web services play a crucial role in cloud computing by facilitating communication and integration
between different cloud-based applications and services.
11. Advantages:
a. Interoperability:
b. Reusability:
- Web services are designed to be reusable, allowing multiple applications to use the same service.
c. Scalability:
- Web services can be easily scaled to accommodate an increasing number of users or clients.
d. Standardization:
- The use of standard protocols and formats makes web services a widely accepted and
standardized technology.
12. Challenges:
a. Security Concerns:
- Ensuring the secure transmission of sensitive data is a significant challenge in web services.
b. Versioning:
- Managing changes and updates to web service interfaces without disrupting existing users.
c. Latency:
- The overhead of communication and data serialization can introduce latency in web service
interactions.
Web services have become a foundational technology for modern distributed systems, providing a
standardized and interoperable way for different software applications to communicate and
collaborate over the internet.
Grid Standards:
Grid computing involves the coordinated use of distributed computing resources, and the
development of standards is crucial to ensure interoperability and seamless integration of diverse
components. Several standards have been established to facilitate the implementation and
operation of grid systems. Here are some key grid standards:
- Key Features:
- Overview: WSRF extends web services concepts to address stateful resources in a grid
environment.
- Key Features:
- Key Features:
- Key Features:
5. GridFTP:
- Overview: GridFTP is an extension of the File Transfer Protocol (FTP) designed for grid
environments.
- Key Features:
- Overview: JSDL is a specification for describing jobs in grid and distributed computing
environments.
- Key Features:
- Key Features:
- Overview: RMF is a standard for managing and provisioning resources in a grid environment.
- Key Features:
- Overview: GCN is a framework that provides guidelines and recommendations for standardizing
grid computing.
- Key Features:
- Recommends the use of existing standards and the development of new ones as needed.
10. OASIS Topology and Orchestration Specification for Cloud Applications (TOSCA):
- Overview: While initially focused on cloud computing, TOSCA is also relevant to grid computing.
- Key Features:
- Provides a standardized way to describe the topology of services and their orchestration.
- Overview: GGF (now part of the Open Grid Forum) contributed to various grid computing
standards and specifications.
- Key Features:
- GGF developed standards for grid computing in areas such as resource management, security,
and data management.
- Overview: XACML is an XML-based language for expressing policies and access control rules.
- Key Features:
- Defines a standardized way to manage access control in grid and distributed computing
environments.
- Overview: NWS is a set of specifications for collecting and disseminating information about the
computational resources in a grid environment.
- Key Features:
- Aims to provide information on the current and predicted state of resources for better resource
management.
These standards and frameworks contribute to the development and deployment of grid computing
solutions by ensuring consistency, interoperability, and security across diverse grid environments. As
technology evolves, new standards may emerge or existing ones may be updated to address the
changing landscape of grid computing.
Overview:
- Google uses a massive cluster architecture to power its search engine and various other services.
- The architecture consists of commodity hardware organized into clusters managed by Google's
proprietary software.
Key Features:
- Distributed File System: Google File System (GFS) is used to store and manage large amounts of
data across the cluster.
- Datacenter Efficiency: Clusters are distributed across multiple data centers worldwide for
redundancy and reliability.
- Many scientific and research institutions utilize high-performance computing clusters for complex
simulations and data analysis.
Key Features:
- Parallel Processing: Clusters are designed for parallel processing to handle computationally
intensive tasks.
- Distributed Memory: Utilizes message-passing interfaces (MPI) for communication among nodes.
Overview:
- Amazon Elastic Compute Cloud (EC2) provides resizable compute capacity in the cloud, and users
can create their own clusters.
- Used by businesses and researchers for various applications, including data processing and
analytics.
Key Features:
- Scalability: Users can dynamically scale the number of instances in a cluster based on demand.
- Preconfigured AMIs: Amazon Machine Images (AMIs) provide preconfigured cluster environments
for different applications.
- Cluster Networking: EC2 clusters can be configured to use high-performance networking for low-
latency communication.
Overview:
- Pleiades is one of NASA's supercomputing clusters used for advanced simulations, climate
modeling, and astrophysics research.
5. Financial Services:
Overview:
- Financial institutions often use cluster computing for risk analysis, algorithmic trading, and other
computational finance tasks.
Key Features:
- Real-Time Analytics: Clusters are used for real-time analytics to inform trading decisions.
- Distributed Data Processing: Handles vast amounts of financial data for modeling and analysis.
6. Weather Forecasting:
Overview:
- Meteorological agencies use cluster systems for weather modeling and prediction.
Key Features:
- Numerical Weather Prediction (NWP): Clusters simulate atmospheric conditions using NWP
models.
- Global Collaboration: Data from multiple clusters worldwide contribute to global weather models.
- Cluster systems play a vital role in bioinformatics for tasks like DNA sequencing and protein
folding simulations.
Key Features:
- Parallel Processing: Clusters accelerate data analysis by distributing tasks across nodes.
- Customized Algorithms: Clusters may use custom algorithms tailored for specific genomics tasks.
These case studies demonstrate the versatility of cluster systems across different industries and
research domains. Whether it's powering internet-scale services, advancing scientific research, or
supporting critical business operations, clusters provide a scalable and efficient solution for
demanding computing workloads.
Beowulf:
Beowulf is a term that originally referred to an Old English epic poem but has been adopted in the
field of high-performance computing to describe a particular type of clustered computing
architecture. Beowulf clusters are designed to provide parallel processing capabilities for scientific
and engineering applications. Here are key aspects of Beowulf clusters:
1. Beowulf
2. Origins:
- The term "Beowulf" for computing clusters was popularized by Dr. Thomas Sterling and Dr.
Donald Becker in the 1990s. The name was chosen to represent a cluster of interconnected,
independent, and inexpensive processors, akin to the warriors in the Old English epic poem
"Beowulf."
3. Key Characteristics:
- Commodity Hardware: Beowulf clusters use standard, off-the-shelf components such as Intel x86
processors, Ethernet networking, and Linux as the operating system.
- Parallel Processing: The architecture allows multiple processors to work in parallel, dividing
computational tasks among the nodes for increased performance.
- Message Passing Interface (MPI): Beowulf clusters often utilize MPI for communication between
nodes, enabling efficient parallel processing.
4. Components:
- Master Node: Coordinates and manages the tasks assigned to each node.
5. Software Stack:
- MPI Libraries: MPI is commonly used for communication and coordination between nodes.
- Cluster Management Software: Tools for managing and scheduling tasks across the cluster.
6. Applications:
- Scientific Computing: Beowulf clusters are widely used in scientific research, simulations, and
data analysis.
- Engineering Simulations: Applications in fields like computational fluid dynamics, finite element
analysis, and structural engineering.
7. Advantages:
- Cost-Effective: Beowulf clusters are cost-effective due to the use of commodity hardware.
- Scalability: Clusters can easily scale by adding more nodes to handle increasing computational
demands.
- Customization: Users have the flexibility to customize and configure the cluster to meet specific
requirements.
8. Challenges:
- Fault Tolerance: Addressing issues related to node failures and ensuring uninterrupted operation.
9. Evolution:
- Over time, Beowulf clusters have evolved, incorporating advancements in hardware, networking,
and software technologies.
- Modern Beowulf clusters may include accelerators like GPUs for enhanced parallel processing.
10. Examples:
- Various research institutions, universities, and organizations worldwide have deployed Beowulf
clusters for diverse scientific and computational tasks.
COMPaS:
NanOS:
PARAM: