Module 2 CC
Module 2 CC
Cluster
A computer cluster is a collection of interconnected stand-alone computers which
can work together collectively and cooperatively as a single integrated computing
resource pool.
Clustering explores massive parallelism at the job level and achieves high
availability (HA) through stand-alone operations.
Packaging
Cluster nodes can be packaged in a compact or a slack fashion.
In a compact cluster, the nodes are closely packaged in one or more racks sitting
in a room, and the nodes are not attached to peripherals (monitors, keyboards,
mice, etc.).
In a slack cluster, the nodes are attached to their usual peripherals (i.e., they are
complete SMPs, workstations, and PCs), and they may be located in different
rooms, different buildings, or even remote regions.
Packaging directly affects communication wire length, and thus the selection of
interconnection technology used.
Control
A homogeneous cluster uses nodes from the same platform, that is, the same
processor architecture and the same operating system; often, the nodes are from
the same vendors.
A heterogeneous cluster uses nodes of different platforms.
Interoperability is an important issue in heterogeneous clusters.
Security
These are clusters designed mainly for collective computation over a single large
job.
A good example is a cluster dedicated to numerical simulation of weather
conditions.
The compute clusters do not handle many I/O operations, such as database
services.
When a single compute job requires frequent communication among the cluster
nodes, the cluster must share a dedicated network, and thus the nodes are mostly
homogeneous and tightly coupled.
This type of clusters is also known as a Beowulf cluster.
High-Availability clusters
Load-balancing clusters
These clusters shoot for higher resource utilization through load balancing among
all participating nodes in the cluster.
All nodes share the workload or function as a single virtual machine (VM).
Requests initiated from the user are distributed to all node computers to form a
cluster.
Analysis of the Top 500 Supercomputers
The top supercomputers are analyzed by running the Linpack benchmark program.
The single instruction, multiple data (SIMD) machines disappeared in 1997. The
cluster architecture appeared in a few systems in 1999.
The cluster systems are now populated in the Top-500 list with more than 400 systems
as the dominating architecture class.
The speed of supercomputers has increase over time.
Linux OS is the OS that is mostly used in top 500 supercomputers.
Jaguar and Hooper are the most powerful supercomputers, they consume lot of power.
The Tianhe-1A consumes less power as it uses GPU.
A simple cluster of computers built with commodity components and fully supported
with desired SSI features and HA capability.
The processing nodes are commodity workstations, PCs, or servers.
These commodity nodes are easy to replace or upgrade with new generations of
hardware.
The node operating systems should be designed for multiuser, multitasking, and
multithreaded applications.
The nodes are interconnected by one or more fast commodity networks.
These networks use standard communication protocols and operate at a speed that
should be two orders of magnitude faster than that of the current TCP/IP speed over
Ethernet.
The network interface card is connected to the node’s standard I/O bus (e.g., PCI).
When the processor or the operating system is changed, only the driver software needs
to change.
The desire to have a platform-independent cluster operating system, sitting on top of
the node platforms.
A cluster middleware can be used to glue together all node platforms at the user space.
An availability middleware offers HA services.
An SSI layer provides a single entry point, a single file hierarchy, a single point of
control.
The shared-nothing architecture is used in most clusters, where the nodes are
connected through the I/O bus. The shared-disk architecture is in favorof small-scale
availability clusters in business applications. When one node fails, the other node
takes over.
The shared-nothing configuration in the above figure simply connects two or more
autonomous computers via a LAN such as Ethernet.
Shared Disk architecture
This is what most business clusters desire so that they can enable recovery support
in case of node failure.
The shared disk can hold checkpoint files or critical system images to enhance
cluster availability.
Without shared disks, checkpointing, rollback recovery, failover, and failback are
not possible in a cluster.
Shared Memory architecture
Google has many data centers using clusters of low-cost PC engines. These clusters
are mainly used to support Google’s web search business.
A Google cluster interconnect of 40 racks of PC engines via two racks of 128 x 128
Ethernet switches.
Each Ethernet switch can handle 128 one Gbps Ethernet links.
A rack contains 80 PCs.
This is an earlier cluster of 3,200 PCs. Google’s search engine clusters are built
with a lot more nodes.
Two switches are used to enhance cluster availability. The cluster works fine even
when one switch fails to provide the links among the PCs.
The front ends of the switches are connected to the Internet via 2.4 Gbps OC 12
links.
The 622 Mbps OC 12 links are connected to nearby data-center networks. In case
of failure of the OC 48 links, the cluster is still connected to the outside world via
the OC 12 links.
Thus, the Google cluster eliminates all single points of failure.