Lambda Echelon Deep Learning GPU Cluster White Paper v2022!09!23
Lambda Echelon Deep Learning GPU Cluster White Paper v2022!09!23
Lambda Echelon
Deep Learning GPU Cluster
Table of Contents
Lambda Echelon 3
Use Cases 4
Echelon Design 5
Cluster Design 6
Compute 6
Storage 7
Networking 9
PDU Overview 18
1
Lambda —————————————————————————
Thermal Design 20
GPU Selection 23
CPU Selection 23
Software 26
Lambda Stack 26
GPUDirect Technology 29
MLOps Platforms 30
Support 31
Silver 31
Gold 32
Platinum 32
About Lambda 34
Our Customers 34
2
Lambda —————————————————————————
Production Inference 37
Lambda Echelon
Many of the world’s leading AI research teams spend more on computation than they do
on their entire AI research headcount.
In this whitepaper, we’ll walk you through the Lambda Echelon multi-node cluster reference
design: a node design, a rack design, and an entire cluster level architecture. This document is
for technical decision-makers and engineers. You’ll learn about the Echelon’s compute, storage,
networking, power distribution and thermal design. This is not a cluster administration
handbook, this is a high level technical overview of one possible system architecture.
When you’re ready to design a Lambda Echelon for your team, get in touch with us by
emailing [email protected] or calling 866-711-2025.
3
Lambda —————————————————————————
Use Cases
Understanding your use case is the first step to designing a cluster. Echelon systems can be
designed for three use cases:
GPU Choice NVIDIA RTX A6000, NVIDIA RTX NVIDIA A100 80GB
A5000
Operating Mode Usually offline, job-queue based Usually offline, job-queue based
Production Inference
4
Lambda —————————————————————————
Echelon Design
“Anyone can build a fast CPU. The trick is to build a
fast system.” —Seymour Cray
1. Cluster Design: the highest level of abstraction. Entire racks are just dots on the
screen. Data center floor plans, capacity planning (power, network, storage, compute),
and network topologies are the main output products at this layer.
2. Rack Design: rack elevations describe the layout and exact position of individual nodes
in a particular rack. Rack TDP, cable and port counts are important details at this layer.
3. Node Design: node bills of materials determine what components are placed into a
server. Component choice driven by cluster design goals.
Software: it’s important to remember that there is a two way design dependency between the
cluster hardware and the software that runs on it. Selecting the right software depends on the
use case and scale of your planned deployment.
5
Lambda —————————————————————————
Cluster Design
“There is considerable skepticism regarding the
viability of massive parallelism; the skepticism
centers around Amdahl's law... we now have timing
results for a 1024-processor system that
demonstrate that the assumptions underlying
Amdahl's 1967 argument are inappropriate for the
current approach to massive ensemble parallelism.”
—John Gustafson
Compute
The choice of compute node type and mix is determined by the use case of the cluster.
● Hyperparameter search: you’ll likely see Lambda Scalar systems with NVIDIA RTX
GPUs.
● Large scale distributed training: you’ll likely see Lambda Hyperplane systems with
NVIDIA A100 Tensor Core GPUs.
● Production inference: you’ll end up with either NVIDIA A10, A30, A40 Tensor Core
GPUs or RTX A6000 GPUs in Lambda Scalar systems.
One way to participate in the design process with us is to configure a Lambda Scalar or Lambda
Hyperplane to be used as a compute node in your cluster. You can configure them on our
website: https://lambdalabs.com/products/blade and
https://lambdalabs.com/products/hyperplane. Please refer to our benchmarks to get a better
understanding of the expected throughput of various NVIDIA GPUs
6
Lambda —————————————————————————
Below, you’ll see a few common storage setups. Storage cluster design is highly dependent on
your use case. Some folks can get away with just a single NFS storage server with an NVMe
local cache for a very large cluster. Others require large scale parallel cluster file systems to
support their workload.
Regardless of the final architecture, Lambda has pre-existing OEM relationships with practically
every storage appliance provider in the world. Echelon has been designed to support both
proprietary and open source storage solutions. Because they can all be purchased through
Lambda, your procurement process is greatly simplified.
Proprietary Storage Options
7
Lambda —————————————————————————
8
Lambda —————————————————————————
Networking
Below, we walk through a network architecture and topology for an Echelon cluster designed for
large scale distributed training. For hyperparameter search and production inference clusters,
you’ll likely see a single unified 100 Gb/s ethernet fabric instead of the dual compute/storage
InfiniBand fabric seen here. An Echelon cluster supports many network configurations,
including:
9
Lambda —————————————————————————
The InfinBand compute fabric enables rapid node-to-node data transfers via GPU/InfiniBand
RDMA while the storage fabric enables rapid access to your data sets. Network bandwidth
requirements for training vary based on model (size, layer type), modality (text vs. image, etc.)
and training regime (model parallel vs. data parallel). Thus, optimal configurations can vary from
this design.
10
Lambda —————————————————————————
Our recommended racks are 800mm wide to accommodate extra deep servers, multiple PDUs,
and to prevent PDU power ports from being covered by installed servers.
Lambda HPC engineers can sit down with you and create a tailored data center deployment
plan. Lambda builds both full service and onsite deployed racks. Full service racks are “racked
stack, labeled and cabled” and shipped to you, ready to roll, in a rack crate. Onsite deployed
racks are racked, stacked, labeled, and cabled after individual components are shipped onsite.
Sample data center floor plan visualization with cable length calculations.
11
Lambda —————————————————————————
Design constraints:
We’ve targeted a rack TDP of ~28kW. This can be cooled with: a DDC closed loop chiller rack
or a Rear Door Heat Exchanger from Motivair, or similar. Details on the DDC closed loop chiller
rack are included in the section titled: “Closed loop chiller rack solutions”.
12
Lambda —————————————————————————
The single rack cluster is designed with a TDP of 28kW. Power is provided by four 17.2kW
PDUs set up for A/B redundancy. In the table below, we show the bill of materials for the rack as
well as the TDP of each component.
13
Lambda —————————————————————————
1U Blanking Panel 4 0 0
Network Notes
Each storage server connects to the SN2700 Converged In-Band + Storage Fabric with 1x
100GbE NIC
Each compute server connects to the Compute Fabric with 8x HDR HCAs
Each compute server connects to the Converged In-Band + Storage Fabric with 1x
100GbE NIC
Each management node connects to the Converged In-Band + Storage Fabric with
1x 100GbE NIC
14
Lambda —————————————————————————
15
Lambda —————————————————————————
This single rack provides a total of 40 NVIDIA A100 GPUs with a 200 Gb/s InfiniBand compute
fabric and 100 Gb/s Converged Ethernet + storage fabric. It also offers 8U of customizable
space.
The cluster above is designed with four separate network fabrics: 200 Gb/s HDR InfiniBand for
compute, 200 Gb/s HDR InfiniBand for storage, 10GBASE-T (RJ-45) for in-band, and a 1Gb/s
IPMI switch. This cluster configuration offers 20 nodes and 160 GPUs. IPMI switches are daisy
chained together with a single 1 Gb/s ethernet cable.
16
Lambda —————————————————————————
17
Lambda —————————————————————————
As we scale beyond four racks, a core networking rack becomes necessary and an Infiniband
director switch is recommended in this scenario for the ease of management. The sample
compute and storage racks could be replicated as needed to obtain the necessary amount of
compute and storage capacity until all the ports are utilized in the director switch.
Photographed above is another Echelon cluster configured with (1) Hyperplane-16, 100 Gb/s
InfiniBand, (2) Hyperplane-8s, (1) Lambda Scalar, (3) storage nodes, and 10 Gb/s ethernet
networking.
PDU Overview
Our default configuration uses N+1 redundant power with a 60A 208V switched & metered PDU
from APC. However, where 415V power is available, rack densities exceeding 30kW become
18
Lambda —————————————————————————
possible. Alternative power distribution configurations are available that can meet a wide variety
of voltage and density constraints.
IEC60309 - 60A 3-phase plug - IEC60309 - 60A 3-phase plug - NEMA L15-30P 30A 3-phase
208V 415V plug - 208V
Common Server to PDU Plugs
Essentially all servers & PDUs around the world use one of these four IEC plug / receptacle
pairs. It’s important to familiarize yourself with these plugs and receptacles.
Plug Photo Name Plugs into Receptacle Photo Max Amps
15A
(Max power
IEC C14 ~2.5kW)
Receptacle
IEC C13 Plug (on server)
15A
(Max power
IEC C13 ~2.5kW)
Receptacle
IEC C14 Plug (on PDU)
20A
(Max power
IEC C20 ~3.2kW)
Receptacle
IEC C19 Plug (on server)
19
Lambda —————————————————————————
20A
IEC C19 (Max power
Receptacle ~3.2kW)
IEC C20 Plug (on PDU)
IEC stands for International Electrotechnical Commission, an international standards organization headquartered in Switzerland.
Thermal Design
The standard Echelon rack is designed with 28kW peak TDP. A 28kW power draw converts to
95,538.8 BTU/hr of cooling. Three methods can be used to dissipate this heat:
The ASHRAE data center guidelines are useful for determining whether the system is being
sufficiently cooled. ASHRAE thermal guidelines allow for a dry bulb temperature (a thermometer
exposed to air) of over 80 degrees fahrenheit depending on the relative humidity. For more
information, see this ASHRAE Thermal Guidelines presentation from Berkeley Lab (one of our
customers).
Closed loop chiller rack solutions
ScaleMatrix DDCs can support workloads up to 52kW in a 45U rack form factor. DDC rack
cooling is provided by a closed loop water cooling system with an air-to-water heat exchanger.
The IT equipment installed in the DDC rack does not need direct-to-chip cooling to take
advantage of the cooling system. Allowing the use of standard air cooled servers.
20
Lambda —————————————————————————
Inside the top portion of the DDC rack is an air-to-water heat exchanger capable of dissipating
52kW of power into the chilled water.
DDC Rack Technical Specifications
DDC racks come in three models which support 17, 34, and 52kW of cooling within 45U.
If you’re able to accept shipment of an entire rack, Lambda is able to do the full rack, stack,
label and cabling process on our manufacturing floor.
21
Lambda —————————————————————————
We actually start the process of designing the node before we start the rack elevation (because
the node choice determines the TDP of the rack). Node design boils down to components:
GPUs, CPUs, memory, storage, networking interfaces, and motherboard (PCIe topology). Let’s
go over some of those choices here.
GPU Selection
A quick guide to enterprise data center GPU selection is given below. There are two main
factors to consider, FLOPS / $ and the amount of vRAM in the GPUs. Some models, like
BERT-Large, require GPU vRAM greater than 24 GB to hit a batch size of 1; meaning batch size
of 2 will require either an NVIDIA A6000 or A100 80 GB GPU.
CPU Selection
CPU selection comes down to both the server platform you want to use and the features and
performance of the actual CPU. AMD Epyc and Intel Icelake CPUs are the only mass market
CPUs available that support PCIe gen 4. In addition, due to the large number of PCIe lanes
provided by the AMD Epyc series, server motherboards don’t need as many expensive PCIe
switches.
System Block Diagram
Below you’ll see the system block diagram for both the Lambda Hyperplane-8 NVIDIA A100
GPU server and the Lambda Scalar GPU server. These block diagrams specify the maximum
22
Lambda —————————————————————————
RAM and add-in cards these systems support. They also describe the PCIe topology of the
system, which is an important consideration if you plan to use GPUDirect RDMA or RDMA over
Converged Ethernet (RoCE).
23
Lambda —————————————————————————
24
Lambda —————————————————————————
Software
“People who are really serious about software
should make their own hardware.” -Alan Kay
The software you run on your cluster must inform the hardware decisions, all the way down to
the components in the nodes. Conversely, the hardware decisions also provide constraints and
unlock new features that can be leveraged in the software. Lambda maintains Lambda Stack
which installs GPU-accelerated versions of PyTorch, TensorFlow, NVIDIA drivers, CUDA, and
CuDNN. Most importantly, it provides a managed upgrade path for all of this software.
Lambda Stack
25
Lambda —————————————————————————
Lambda Stack automatically manages your team’s software upgrade path to reduce system
administration overhead and downtime. With one command, you can update all of your
software, from drivers to frameworks, with all of the dependencies automatically resolved:
You can learn more about Lambda Stack at the Lambda Stack homepage.
Lambda Stack offers GPU accelerated docker packages and nvidia-container-toolkit so you
can run your own GPU accelerated Dockerfiles as well as Docker containers from NVIDIA’s
NGC Deep Learning container registry. For more information on our open source docker files,
see our git repository: https://github.com/lambdal/lambda-stack-dockerfiles
GPUDirect Technology
Our nodes are designed to leverage GPUDirect® technologies: GPUDirect RDMA and
GPUDirect Storage, preventing wasteful double-copy transfers to CPU memory. GPUDirect
RDMA transfers data directly from the GPU memory out the door to the InfiniBand HCA.
GPUDirect Storage directly transfers data stored on a local NVMe drive into the GPU’s memory.
GPUDirect RDMA can effectively double the node-to-node transfer bandwidth and is very useful
for distributed training tasks.
26
Lambda —————————————————————————
The diagrams above show the data pathways of both GPUDirect RDMA and GPUDirect
Storage. Note that GPUDirect RDMA is for InfiniBand while RDMA over Converged Ethernet
(RoCE) is over ethernet.
Tools like Bright Cluster Manager can simplify HPC cluster management. Bright can install the
following softwares across the cluster using simple wizards:
Lambda has partnered with MLOps providers to help build, deploy and scale your models in a
GUI environment. Tools such as Cnvrg.io and Determined.AI allow you to train your models on
top of a managed Kubernetes cluster. This allows for easy expansion and migration between
on-prem and public cloud.
27
Lambda —————————————————————————
● Hyperparameter tuning
● Deep Learning training & inference job scheduling
● Model & data versioning
● Experiment management & repeatability
● Hybrid & multi-cloud support
● Resource monitoring & optimization
● In-browser JupyterLab or RStudio data science IDEs
Support
“I wanna wake up! Tech support! It's a nightmare!
Tech support! Tech support!” —Tom Cruise in
Vanilla Sky
Having successfully deployed thousands of nodes, Lambda’s team consists of seasoned HPC
experts. We’re able to provide your company with both remote and onsite support. Echelon
comes with one of three tiers of support:
28
Lambda —————————————————————————
Silver
Silver support is designed for advanced users that are self-sufficient in supporting an Echelon
environment themselves, but prefer to augment support for hardware issues at a node level.
Silver support covers email-only technical support. With Silver Support, you’ll have frontline
access to our online knowledge base which includes software/firmware patches, technical
resources, critical advisories and FAQs. Sliver support includes a standard 30 day RMA policy
and is a hardware-only support package. The typical response time for the silver plan is one
business day. Customers looking for cluster level support should utilize a Gold or
Platinum support plan.
Gold
Gold support is our most popular support plan as it provides end-to-end support for your entire
Echelon cluster over phone and email. Unlike other cluster solutions, our support team consists
of Linux and Deep Learning Engineers that understand your use case and software stack. In the
event of a technical issue, you will have full access to our support team regardless of whether
that issue is occurring at the hardware, software, or cluster level. The Gold support plan covers
support tickets and cluster performance tickets related to hardware and pre-installed software
(Lambda Stack, drivers, OFED). In the event of a hardware failure, you’ll be covered with our
advanced parts replacement warranty, no need to wait for a full RMA process with the
component manufacturer. A private onboarding Slack channel is provided for each of our Gold
customers to ensure a smooth start-up & hand-off. Gold support ticket response times will be
within 8 hours during normal business hours.
Platinum
Platinum support provides all features of Gold plus an onsite support package that covers every
aspect of your cluster with rapid response SLAs. With included white glove installation, Platinum
support plans start being valuable the day you turn your new cluster on. With Platinum support,
you’ll have direct access to a Lambda Technical Account Manager (TAM) as an engineering
resource to support you along the way. In the event that onsite support is required, a Lambda
employee will be there to assist you with your ticket. The Platinum support plan is designed for
organizations that are hosting mission critical applications that require expedited response times
with white glove treatment. Platinum support covers your cluster end-to-end with both phone
and email support as well as next business day advanced parts replacement. Platinum support
tickets have the highest priority level with response times within 4 hours during normal business
hours.
29
Lambda —————————————————————————
For support packages that include advanced parts replacement, you’ll receive replacement
parts directly through Lambda before returning the component.
Support Tier Matrix
Phone Phone
Support Contact Email 8AM - 6PM M-F PT 8AM - 6PM M-F PT
+ Email + Email 24/7
Business Hours
1 Business Day 8 Hours 4 Hours
Response Time
Dedicated Support
Email & TAM ✘ ✘ ✔
Yes - onsite
Onsite Services
✘ ✘ services included
Dedicated Slack
Dedicated Slack
Channel for the first
Channel for the first
month + 4 hours of
Onboarding Services month + 2 hours of
onboarding call(s)
onboarding call(s) for
for cluster and
cluster
✘ distributed training
30
Lambda —————————————————————————
White Glove
Installation Included ✘ ✘ ✔
NVIDIA Networking
Bronze Silver Silver or Gold
Support Pass-Thru*
About Lambda
Lambda (https://lambdalabs.com) provides GPU accelerated workstations and servers to the
top AI research labs in the world. Our hardware and software is used by AI researchers at
Apple, Intel, Microsoft, Tencent, Stanford, Berkeley, University of Toronto, Los Alamos National
Labs, and many others.
Our Customers
Lambda works with the world’s top companies and universities. We have provided GPU
systems for Machine Learning researchers at the following institutions:
31
Lambda —————————————————————————
During a hyperparameter search, network architectures A, B, and C will be trained and have
their performance evaluated. This can be done by a single machine, three separate machines,
or even an entire fleet of machines. It’s an “embarrassingly parallel” problem which doesn’t
32
Lambda —————————————————————————
require any kind of node-to-node communication, thus, a hyperparameter search cluster can be
designed to leverage less expensive, lower bandwidth network fabrics. In many ways, the
hyperparameter search cluster is quite similar to the production inference cluster; they're both
embarrassingly parallel workloads that focus on throughput.
For a good introduction to how distributed training works in practice, see: A Gentle Introduction
to Multi GPU and Multi Node Distributed Training.
As you can see, distributed training requires far more node-to-node communication than either
production inference or hyperparameter search. Thus, Echelon clusters for distributed training
are designed with a dedicated Infiniband compute and storage fabric supporting up to 1,600
Gb/s of peak node-to-node bandwidth.
Production Inference
Production inference is deploying your model at scale. It requires high throughput / $, high
uptime and availability, and robustness to individual machine outages. Production inference is
often the only workload that has a user waiting at the other end of a screen for a result, and it’s
expected to arrive soon.
33
Lambda —————————————————————————
A cluster designed for production inference will almost always be designed with both A/B
redundant power and no single point of failure throughout the hardware. This means you’ll see
dual switches, dual NICs, and dual power supplies.
34
Lambda —————————————————————————
If you’re ready to build a Lambda Echelon for your team, please get in touch with us by
emailing [email protected] or calling 866-711-2025.
35