0% found this document useful (0 votes)
38 views36 pages

Lambda Echelon Deep Learning GPU Cluster White Paper v2022!09!23

The Lambda Echelon whitepaper provides a comprehensive reference design for a multi-node GPU cluster, detailing its architecture, use cases, and design considerations for compute, storage, networking, and power distribution. It is aimed at technical decision-makers and engineers, offering insights into optimal configurations for hyperparameter search, large-scale distributed training, and production inference. The document emphasizes the importance of tailored designs based on specific use cases and includes various design layers from cluster to node level.

Uploaded by

rawahi3726
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views36 pages

Lambda Echelon Deep Learning GPU Cluster White Paper v2022!09!23

The Lambda Echelon whitepaper provides a comprehensive reference design for a multi-node GPU cluster, detailing its architecture, use cases, and design considerations for compute, storage, networking, and power distribution. It is aimed at technical decision-makers and engineers, offering insights into optimal configurations for hyperparameter search, large-scale distributed training, and production inference. The document emphasizes the importance of tailored designs based on specific use cases and includes various design layers from cluster to node level.

Uploaded by

rawahi3726
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Lambda —————————————————————————

Lambda Echelon
Deep Learning GPU Cluster

REFERENCE DESIGN WHITEPAPER


Updated: 07/14/2022

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda
Lambda —————————————————————————

Table of Contents

Lambda Echelon 3
Use Cases 4
Echelon Design 5
Cluster Design 6

Compute 6

Storage 7

Sample Storage Architectures 7

Proprietary Storage Options 8

Open Source Storage Options 8

TrueNAS Storage Management Dashboard 8

Networking 9

Data Center Power Distribution & Floor Planning 10

Cluster Floor Planning 11

Rack Level Design 12

Three Scalable Rack Elevations 12

Lambda Echelon Single Rack Elevation 13

Single Rack Cluster Bill of Materials 14

Single Rack Design (40 NVIDIA GPUs) 15

Four Rack Design (160 NVIDIA GPUs) 16

22 Rack Design (800 NVIDIA GPUs) 17

Rack Power Distribution 18

PDU Overview 18

Common PDU Input Plugs 19

Common Server to PDU Plugs 19

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

1
Lambda —————————————————————————

Thermal Design 20

Closed loop chiller rack solutions 21

DDC Rack Technical Specifications 22

Rack Level Integration - Rack, Stack, Label, and Cable 22

Node Level Design 23

GPU Selection 23

CPU Selection 23

System Block Diagram 24

NVMe NFS Cache 25

Software 26

Lambda Stack 26

Managed Upgrade Path for your AI software 27

Lambda Stack System Wide Installation 27

Compatible with your Dockerfiles & NGC Docker Containers 28

GPUDirect Technology 29

Cluster Management Software 29

MLOps Platforms 30

Support 31

Silver 31

Gold 32

Platinum 32

Advanced Parts Replacement 32

Support Tier Matrix 33

About Lambda 34

Our Customers 34

Let’s build one for you 35


Appendix A: Use Case Descriptions 36

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

2
Lambda —————————————————————————

Hyperparameter Search / Neural Architecture Search 36

Large Scale Distributed Training 36

Production Inference 37

Lambda Echelon
Many of the world’s leading AI research teams spend more on computation than they do
on their entire AI research headcount.

In this whitepaper, we’ll walk you through the Lambda Echelon multi-node cluster reference
design: a node design, a rack design, and an entire cluster level architecture. This document is
for technical decision-makers and engineers. You’ll learn about the Echelon’s compute, storage,
networking, power distribution and thermal design. This is not a cluster administration
handbook, this is a high level technical overview of one possible system architecture.

When you’re ready to design a Lambda Echelon for your team, get in touch with us by
emailing [email protected] or calling 866-711-2025.

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

3
Lambda —————————————————————————

Use Cases
Understanding your use case is the first step to designing a cluster. Echelon systems can be
designed for three use cases:

1. Hyperparameter search: an optimal system for hyperparameter search focuses on


training throughput; high node-to-node bandwidth is not required.

2. Large scale distributed training: optimal distributed training clusters focus on


node-to-node bandwidth in order to achieve better scaling performance when passing
gradients between nodes.

3. Production inference: optimal systems focus on throughput and high availability.

Hyperparameter Search Large Scale Distributed Training

Node to Node Bandwidth Low High

GPU Choice NVIDIA RTX A6000, NVIDIA RTX NVIDIA A100 80GB
A5000

Key Metric Training throughput / $ Time to train a single large model

1+1 Redundancy Optional Optional

Operating Mode Usually offline, job-queue based Usually offline, job-queue based

Production Inference

Node to Node Bandwidth Low

NVIDIA A10, NVIDIA A30, NVIDIA


GPU Choice
A40

Key Metric High availability & throughput

1+1 Redundancy Critical

Operating Mode Often online, real time results


For more info, see our use case descriptions section in the appendix.

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

4
Lambda —————————————————————————

Echelon Design
“Anyone can build a fast CPU. The trick is to build a
fast system.” —Seymour Cray

Echelon GPU cluster design occurs on three separate levels of abstraction:

1. Cluster Design: the highest level of abstraction. Entire racks are just dots on the
screen. Data center floor plans, capacity planning (power, network, storage, compute),
and network topologies are the main output products at this layer.

2. Rack Design: rack elevations describe the layout and exact position of individual nodes
in a particular rack. Rack TDP, cable and port counts are important details at this layer.

3. Node Design: node bills of materials determine what components are placed into a
server. Component choice driven by cluster design goals.

Software: it’s important to remember that there is a two way design dependency between the
cluster hardware and the software that runs on it. Selecting the right software depends on the
use case and scale of your planned deployment.

We’ll now walk through each of these design layers.

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

5
Lambda —————————————————————————

Cluster Design
“There is considerable skepticism regarding the
viability of massive parallelism; the skepticism
centers around Amdahl's law... we now have timing
results for a 1024-processor system that
demonstrate that the assumptions underlying
Amdahl's 1967 argument are inappropriate for the
current approach to massive ensemble parallelism.”
—John Gustafson

Cluster architectures have five main components:

1. Compute: NVIDIA GPU & CPU nodes.


2. Storage: to serve data sets and store trained models / checkpoints.
3. Networking: multiple networks for compute, storage, in-band management, out-of-band
management.
4. Data center power distribution & floor planning: understanding the electrical and
physical layout of the deployment location informs the rack elevations, cable lengths,
and networking.
5. Software: cluster orchestration, job scheduling, resource allocation, container
orchestration, and node level software stacks.

Compute
The choice of compute node type and mix is determined by the use case of the cluster.

● Hyperparameter search: you’ll likely see Lambda Scalar systems with NVIDIA RTX
GPUs.
● Large scale distributed training: you’ll likely see Lambda Hyperplane systems with
NVIDIA A100 Tensor Core GPUs.
● Production inference: you’ll end up with either NVIDIA A10, A30, A40 Tensor Core
GPUs or RTX A6000 GPUs in Lambda Scalar systems.

One way to participate in the design process with us is to configure a Lambda Scalar or Lambda
Hyperplane to be used as a compute node in your cluster. You can configure them on our
website: https://lambdalabs.com/products/blade and
https://lambdalabs.com/products/hyperplane. Please refer to our benchmarks to get a better
understanding of the expected throughput of various NVIDIA GPUs

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

6
Lambda —————————————————————————

https://lambdalabs.com/gpu-benchmarks. We can then provide feedback and incorporate your


configuration in the final cluster design.
Storage
Often, storage servers will become the bottleneck in a highly optimized cluster. It’s important to
remove this bottleneck because it reduces the utilization of the most expensive part of the
cluster, the compute. There are two ways to set up storage on your cluster, either work with a
partner, or roll your own.

Sample Storage Architectures

Below, you’ll see a few common storage setups. Storage cluster design is highly dependent on
your use case. Some folks can get away with just a single NFS storage server with an NVMe
local cache for a very large cluster. Others require large scale parallel cluster file systems to
support their workload.

Regardless of the final architecture, Lambda has pre-existing OEM relationships with practically
every storage appliance provider in the world. Echelon has been designed to support both
proprietary and open source storage solutions. Because they can all be purchased through
Lambda, your procurement process is greatly simplified.
Proprietary Storage Options

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

7
Lambda —————————————————————————

Open Source Storage Options

TrueNAS Storage Management Dashboard


If you decide to use Lambda Matrix storage appliances, you’ll have access to a TrueNAS
storage management dashboard. The dashboard offers complete administrative control and
visibility over your Lambda Matrix storage cluster. You can easily create and destroy network
attached storage volumes, spin up virtual machines on your storage devices, and manage
access control. All through an easy-to-use web interface.

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

8
Lambda —————————————————————————

Networking
Below, we walk through a network architecture and topology for an Echelon cluster designed for
large scale distributed training. For hyperparameter search and production inference clusters,
you’ll likely see a single unified 100 Gb/s ethernet fabric instead of the dual compute/storage
InfiniBand fabric seen here. An Echelon cluster supports many network configurations,
including:

● Sample Configuration 1 - Large Scale Training with Parallel File System


0 Compute fabric: 200 Gb/s HDR InfiniBand
○ Storage fabric for parallel file system: 200 Gb/s HDR InfiniBand
○ In-Band management network: 10 Gb/s Ethernet
○ IPMI: 1 Gb/s Ethernet
● Sample Configuration 2 - Medium Scale Distributed Training with NFS Storage
0 Compute fabric: 200 Gb/s HDR InfiniBand
○ Converged Storage, In-Band: 100 Gb/s Ethernet
○ IPMI: 1 Gb/s Ethernet
● Sample Configuration 3 - Small Scale Training with NFS Cluster
0 Converged Compute, Storage, In-Band: 10 Gb/s + 40 Gb/s Ethernet
○ IPMI: 1 Gb/s Ethernet

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

9
Lambda —————————————————————————

The InfinBand compute fabric enables rapid node-to-node data transfers via GPU/InfiniBand
RDMA while the storage fabric enables rapid access to your data sets. Network bandwidth
requirements for training vary based on model (size, layer type), modality (text vs. image, etc.)
and training regime (model parallel vs. data parallel). Thus, optimal configurations can vary from
this design.

Data Center Power Distribution & Floor Planning


Before you can plan out your rack elevation, you’ll need to gather the following information:

1. Data center / rack cage floor plan


2. Data center maximum density per rack in (kVA / rack)
3. Data center rack model number and data sheet with rack size (width/depth in mm, height
in rack units)
4. Data center rack PDU model number and data sheet
5. Data center power setup (A/B or single source)

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

10
Lambda —————————————————————————

Our recommended racks are 800mm wide to accommodate extra deep servers, multiple PDUs,
and to prevent PDU power ports from being covered by installed servers.
Lambda HPC engineers can sit down with you and create a tailored data center deployment
plan. Lambda builds both full service and onsite deployed racks. Full service racks are “racked
stack, labeled and cabled” and shipped to you, ready to roll, in a rack crate. Onsite deployed
racks are racked, stacked, labeled, and cabled after individual components are shipped onsite.

Cluster Floor Planning


Lambda engineers leverage data center floor plans, data center infrastructure management
(DCIM) software, and CAD software to create an optimized layout for your cluster. We’re able to
calculate maximum cable lengths, optimize placement of director switches, and visualize
different cluster layouts for you.

Sample data center floor plan visualization with cable length calculations.

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

11
Lambda —————————————————————————

Rack Level Design


"The most effective debugging tool is still
careful thought, coupled with judiciously
placed print statements." —Brian Kernighan

Three Scalable Rack Elevations


In this section, we go over three sample rack elevations. The rack elevations are designed to
allow for easy incremental expansion. This allows Echelon to scale from a single rack with 40
NVIDIA GPUs all the way up to 800 GPUs and beyond.

Design constraints:

❏ Designed for large scale distributed training


❏ High density (>20kVA / rack)
❏ High availability (A/B redundant power feeds)
❏ Separate compute & storage fabrics
❏ High speed interconnect between the machines

We’ve targeted a rack TDP of ~28kW. This can be cooled with: a DDC closed loop chiller rack
or a Rear Door Heat Exchanger from Motivair, or similar. Details on the DDC closed loop chiller
rack are included in the section titled: “Closed loop chiller rack solutions”.

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

12
Lambda —————————————————————————

Lambda Echelon Single Rack Elevation

The single rack cluster is designed with a TDP of 28kW. Power is provided by four 17.2kW
PDUs set up for A/B redundancy. In the table below, we show the bill of materials for the rack as
well as the TDP of each component.

Single Rack Cluster Bill of Materials


Description Qty TDP Draw (kW) Ext. TDP (kW)

Lambda Hyperplane - 8x NVIDIA A100 GPU Server 5 4.9 24.5

1U Management Node 1 0.4 0.4

2U Flash Storage Server 2 0.65 1.3

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

13
Lambda —————————————————————————

NVIDIA Networking SN2700 - 32 port - 100 Gb/s Ethernet -


Storage +
In-Band Switch 1 0.35 0.35

NVIDIA Networking QM8700 - 40 port - 200 Gb/s InfiniBand -


Compute
Fabric Switch 1 0.8 0.8

Edgecore AS4610-54T - 48 port - 1 Gb/s Ethernet - IPMI Switch 1 0.1 0.1

NVIDIA Networking MFS1S00-H003E - 3m active 200 Gb/s HDR


InfiniBand cables 40 0 0

NVIDIA Networking MFA1A00-C003 - 3m active 100 Gb/s Ethernet 8 0 0


cables

Cat 5e Ethernet Cables - 1m/2m/3m 15 0 0

SERVER TECHNOLOGY : Switched, 208V 3PH Delta 6BR 60A


60309 3P+G 4 0 0

1U Blanking Panel 4 0 0

Eaton RSVNS4582W - 45U - 800mm x 1200mm Rack 1 0 0

C14 to C13 cables (2 ft) 16 0 0

C13 to C14 cables (6 ft) 8 0 0

Rack TDP Draw (kW): 27.45

Network Notes

Each storage server connects to the SN2700 Converged In-Band + Storage Fabric with 1x
100GbE NIC

Each compute server connects to the Compute Fabric with 8x HDR HCAs

Each compute server connects to the Converged In-Band + Storage Fabric with 1x
100GbE NIC

Each management node connects to the Converged In-Band + Storage Fabric with
1x 100GbE NIC

Everything (including PDUs) connect to the IPMI with a Cat 5e

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

14
Lambda —————————————————————————

Single Rack Design (40 NVIDIA GPUs)


Medium Scale Distributed Training with NFS Storage

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

15
Lambda —————————————————————————

This single rack provides a total of 40 NVIDIA A100 GPUs with a 200 Gb/s InfiniBand compute
fabric and 100 Gb/s Converged Ethernet + storage fabric. It also offers 8U of customizable
space.

Four Rack Design (160 NVIDIA GPUs)


Large Scale Training with Parallel File System

The cluster above is designed with four separate network fabrics: 200 Gb/s HDR InfiniBand for
compute, 200 Gb/s HDR InfiniBand for storage, 10GBASE-T (RJ-45) for in-band, and a 1Gb/s
IPMI switch. This cluster configuration offers 20 nodes and 160 GPUs. IPMI switches are daisy
chained together with a single 1 Gb/s ethernet cable.

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

16
Lambda —————————————————————————

22 Rack Design (800 NVIDIA GPUs)


Data Center Scale Training with Parallel File System

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

17
Lambda —————————————————————————

As we scale beyond four racks, a core networking rack becomes necessary and an Infiniband
director switch is recommended in this scenario for the ease of management. The sample
compute and storage racks could be replicated as needed to obtain the necessary amount of
compute and storage capacity until all the ports are utilized in the director switch.

Photographed above is another Echelon cluster configured with (1) Hyperplane-16, 100 Gb/s
InfiniBand, (2) Hyperplane-8s, (1) Lambda Scalar, (3) storage nodes, and 10 Gb/s ethernet
networking.

Rack Power Distribution


We offer two main power configurations based on your needs:

1. Zero redundancy single source power configuration.


2. 1+1 redundant power configuration with A/B power sources.

PDU Overview
Our default configuration uses N+1 redundant power with a 60A 208V switched & metered PDU
from APC. However, where 415V power is available, rack densities exceeding 30kW become

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

18
Lambda —————————————————————————

possible. Alternative power distribution configurations are available that can meet a wide variety
of voltage and density constraints.

Common PDU Input Plugs


Below you’ll see three of the most common PDU input plugs. Cluster design will take into
account available receptacles in the data center. If receptacles are not yet installed, Lambda
can work with your data center on a receptacle installation.

IEC60309 - 60A 3-phase plug - IEC60309 - 60A 3-phase plug - NEMA L15-30P 30A 3-phase
208V 415V plug - 208V
Common Server to PDU Plugs
Essentially all servers & PDUs around the world use one of these four IEC plug / receptacle
pairs. It’s important to familiarize yourself with these plugs and receptacles.
Plug Photo Name Plugs into Receptacle Photo Max Amps

15A
(Max power
IEC C14 ~2.5kW)
Receptacle
IEC C13 Plug (on server)

15A
(Max power
IEC C13 ~2.5kW)
Receptacle
IEC C14 Plug (on PDU)

20A
(Max power
IEC C20 ~3.2kW)
Receptacle
IEC C19 Plug (on server)

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

19
Lambda —————————————————————————

20A
IEC C19 (Max power
Receptacle ~3.2kW)
IEC C20 Plug (on PDU)
IEC stands for International Electrotechnical Commission, an international standards organization headquartered in Switzerland.

Thermal Design
The standard Echelon rack is designed with 28kW peak TDP. A 28kW power draw converts to
95,538.8 BTU/hr of cooling. Three methods can be used to dissipate this heat:

1. Hot aisle containment cooling with inrow cooling systems.


2. Airflow cooling plus a rear door heat exchanger (RDHx).
3. Closed loop chiller rack system such as DDC.

The ASHRAE data center guidelines are useful for determining whether the system is being
sufficiently cooled. ASHRAE thermal guidelines allow for a dry bulb temperature (a thermometer
exposed to air) of over 80 degrees fahrenheit depending on the relative humidity. For more
information, see this ASHRAE Thermal Guidelines presentation from Berkeley Lab (one of our
customers).
Closed loop chiller rack solutions
ScaleMatrix DDCs can support workloads up to 52kW in a 45U rack form factor. DDC rack
cooling is provided by a closed loop water cooling system with an air-to-water heat exchanger.
The IT equipment installed in the DDC rack does not need direct-to-chip cooling to take
advantage of the cooling system. Allowing the use of standard air cooled servers.

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

20
Lambda —————————————————————————

Inside the top portion of the DDC rack is an air-to-water heat exchanger capable of dissipating
52kW of power into the chilled water.
DDC Rack Technical Specifications

DDC racks come in three models which support 17, 34, and 52kW of cooling within 45U.

Rack Level Integration - Rack, Stack, Label, and Cable


Choice of network cables can have a major effect on the ease of the cabling process. When
building high speed network fabrics with InfiniBand or ethernet, we strongly suggest using active
fiber cabling. Direct attach copper cabling is difficult to work with, especially when building large
internal network topologies for a fat-tree switch setup. Active fiber cables result in cleaner, and
easier to maintain, builds.

If you’re able to accept shipment of an entire rack, Lambda is able to do the full rack, stack,
label and cabling process on our manufacturing floor.

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

21
Lambda —————————————————————————

Node Level Design


“The most reliable components are the ones you
leave out.” —Gordon Bell

We actually start the process of designing the node before we start the rack elevation (because
the node choice determines the TDP of the rack). Node design boils down to components:
GPUs, CPUs, memory, storage, networking interfaces, and motherboard (PCIe topology). Let’s
go over some of those choices here.

GPU Selection
A quick guide to enterprise data center GPU selection is given below. There are two main
factors to consider, FLOPS / $ and the amount of vRAM in the GPUs. Some models, like
BERT-Large, require GPU vRAM greater than 24 GB to hit a batch size of 1; meaning batch size
of 2 will require either an NVIDIA A6000 or A100 80 GB GPU.

NVIDIA RTX A6000 NVIDIA A100 80 GB NVIDIA A10

Hyperparameter search or Large scale distributed Production inference


applications requiring training

48 GB Max vRAM 80 GB Max vRAM 24 GB Max vRAM

CPU Selection
CPU selection comes down to both the server platform you want to use and the features and
performance of the actual CPU. AMD Epyc and Intel Icelake CPUs are the only mass market
CPUs available that support PCIe gen 4. In addition, due to the large number of PCIe lanes
provided by the AMD Epyc series, server motherboards don’t need as many expensive PCIe
switches.
System Block Diagram
Below you’ll see the system block diagram for both the Lambda Hyperplane-8 NVIDIA A100
GPU server and the Lambda Scalar GPU server. These block diagrams specify the maximum

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

22
Lambda —————————————————————————

RAM and add-in cards these systems support. They also describe the PCIe topology of the
system, which is an important consideration if you plan to use GPUDirect RDMA or RDMA over
Converged Ethernet (RoCE).

Lambda Hyperplane-8 NVIDIA A100 GPU server

Lambda Scalar GPU server

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

23
Lambda —————————————————————————

NVMe NFS Cache


Each node utilizes an NVMe NFS cache which can dramatically reduce the load on your storage
cluster and speed up training results. NFS caches provide fast local storage of frequently used
data from an NFS server. The NFS caching mechanism is provided by the Linux package
cachefilesd. NVMe NFS caching can dramatically increase your training throughput and speed
up your time to result.

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

24
Lambda —————————————————————————

Software
“People who are really serious about software
should make their own hardware.” -Alan Kay

The software you run on your cluster must inform the hardware decisions, all the way down to
the components in the nodes. Conversely, the hardware decisions also provide constraints and
unlock new features that can be leveraged in the software. Lambda maintains Lambda Stack
which installs GPU-accelerated versions of PyTorch, TensorFlow, NVIDIA drivers, CUDA, and
CuDNN. Most importantly, it provides a managed upgrade path for all of this software.

Lambda Stack

Managed Upgrade Path for your AI software

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

25
Lambda —————————————————————————

Lambda Stack automatically manages your team’s software upgrade path to reduce system
administration overhead and downtime. With one command, you can update all of your
software, from drivers to frameworks, with all of the dependencies automatically resolved:

sudo apt-get update && sudo apt-get dist-upgrade

You can learn more about Lambda Stack at the Lambda Stack homepage.

Lambda Stack System Wide Installation


Lambda Stack makes Deep Learning frameworks available system wide, allowing every user on
the system to access PyTorch, TensorFlow, Keras, and more.
Compatible with your Dockerfiles & NGC Docker Containers

Lambda Stack offers GPU accelerated docker packages and nvidia-container-toolkit so you
can run your own GPU accelerated Dockerfiles as well as Docker containers from NVIDIA’s
NGC Deep Learning container registry. For more information on our open source docker files,
see our git repository: https://github.com/lambdal/lambda-stack-dockerfiles
GPUDirect Technology

Our nodes are designed to leverage GPUDirect® technologies: GPUDirect RDMA and
GPUDirect Storage, preventing wasteful double-copy transfers to CPU memory. GPUDirect
RDMA transfers data directly from the GPU memory out the door to the InfiniBand HCA.
GPUDirect Storage directly transfers data stored on a local NVMe drive into the GPU’s memory.
GPUDirect RDMA can effectively double the node-to-node transfer bandwidth and is very useful
for distributed training tasks.

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

26
Lambda —————————————————————————

The diagrams above show the data pathways of both GPUDirect RDMA and GPUDirect
Storage. Note that GPUDirect RDMA is for InfiniBand while RDMA over Converged Ethernet
(RoCE) is over ethernet.

Cluster Management Software

Tools like Bright Cluster Manager can simplify HPC cluster management. Bright can install the
following softwares across the cluster using simple wizards:

● Provisioning: bare metal provisioning of Ubuntu, CentOS, and RHEL


● Kubernetes: manage instances and execute containers on them
● Job scheduler: SLURM, Kubeflow
● Logging and Monitoring: Prometheus, Grafana, resource utilization etc.
● Notebooks: Jupyter Lab or iPython notebooks
● Containers: Docker as well as Singularity
MLOps Platforms

Lambda has partnered with MLOps providers to help build, deploy and scale your models in a
GUI environment. Tools such as Cnvrg.io and Determined.AI allow you to train your models on
top of a managed Kubernetes cluster. This allows for easy expansion and migration between
on-prem and public cloud.

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

27
Lambda —————————————————————————

MLOps platforms like Cnvrg.io and Determined.AI provide features like:

● Hyperparameter tuning
● Deep Learning training & inference job scheduling
● Model & data versioning
● Experiment management & repeatability
● Hybrid & multi-cloud support
● Resource monitoring & optimization
● In-browser JupyterLab or RStudio data science IDEs

Support
“I wanna wake up! Tech support! It's a nightmare!
Tech support! Tech support!” —Tom Cruise in
Vanilla Sky

Having successfully deployed thousands of nodes, Lambda’s team consists of seasoned HPC
experts. We’re able to provide your company with both remote and onsite support. Echelon
comes with one of three tiers of support:

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

28
Lambda —————————————————————————

Silver
Silver support is designed for advanced users that are self-sufficient in supporting an Echelon
environment themselves, but prefer to augment support for hardware issues at a node level.
Silver support covers email-only technical support. With Silver Support, you’ll have frontline
access to our online knowledge base which includes software/firmware patches, technical
resources, critical advisories and FAQs. Sliver support includes a standard 30 day RMA policy
and is a hardware-only support package. The typical response time for the silver plan is one
business day. Customers looking for cluster level support should utilize a Gold or
Platinum support plan.

Gold
Gold support is our most popular support plan as it provides end-to-end support for your entire
Echelon cluster over phone and email. Unlike other cluster solutions, our support team consists
of Linux and Deep Learning Engineers that understand your use case and software stack. In the
event of a technical issue, you will have full access to our support team regardless of whether
that issue is occurring at the hardware, software, or cluster level. The Gold support plan covers
support tickets and cluster performance tickets related to hardware and pre-installed software
(Lambda Stack, drivers, OFED). In the event of a hardware failure, you’ll be covered with our
advanced parts replacement warranty, no need to wait for a full RMA process with the
component manufacturer. A private onboarding Slack channel is provided for each of our Gold
customers to ensure a smooth start-up & hand-off. Gold support ticket response times will be
within 8 hours during normal business hours.

Platinum
Platinum support provides all features of Gold plus an onsite support package that covers every
aspect of your cluster with rapid response SLAs. With included white glove installation, Platinum
support plans start being valuable the day you turn your new cluster on. With Platinum support,
you’ll have direct access to a Lambda Technical Account Manager (TAM) as an engineering
resource to support you along the way. In the event that onsite support is required, a Lambda
employee will be there to assist you with your ticket. The Platinum support plan is designed for
organizations that are hosting mission critical applications that require expedited response times
with white glove treatment. Platinum support covers your cluster end-to-end with both phone
and email support as well as next business day advanced parts replacement. Platinum support
tickets have the highest priority level with response times within 4 hours during normal business
hours.

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

29
Lambda —————————————————————————

Advanced Parts Replacement

For support packages that include advanced parts replacement, you’ll receive replacement
parts directly through Lambda before returning the component.
Support Tier Matrix

Feature Silver Gold Platinum

Phone Phone
Support Contact Email 8AM - 6PM M-F PT 8AM - 6PM M-F PT
+ Email + Email 24/7

Next Business Day


30 Day Advanced Parts
RMA Policy Advanced Parts
RMA Process Replacement
(based on availability) Replacement
(based on availability)

Yes - additional Yes - additional


Parts Depot Available
✘ charges apply charges apply

Business Hours
1 Business Day 8 Hours 4 Hours
Response Time

Dedicated Support
Email & TAM ✘ ✘ ✔

Yes - onsite
Onsite Services
✘ ✘ services included

Dedicated Slack
Dedicated Slack
Channel for the first
Channel for the first
month + 4 hours of
Onboarding Services month + 2 hours of
onboarding call(s)
onboarding call(s) for
for cluster and
cluster
✘ distributed training

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

30
Lambda —————————————————————————

White Glove
Installation Included ✘ ✘ ✔

Hardware and Hardware, Software,


Coverage Hardware
Cluster and Cluster

NVIDIA Networking
Bronze Silver Silver or Gold
Support Pass-Thru*

Software Response ✘ ✘ 4 Hours

Media Retention Yes - additional Yes - additional


Policy Available ✘ charges apply charges apply

About Lambda
Lambda (https://lambdalabs.com) provides GPU accelerated workstations and servers to the
top AI research labs in the world. Our hardware and software is used by AI researchers at
Apple, Intel, Microsoft, Tencent, Stanford, Berkeley, University of Toronto, Los Alamos National
Labs, and many others.

Lambda Quick Facts:

● Original research published in NeurIPS, ECCV, CVPR, SIGGRAPH and SPIE.


● Used at 47 of the top 50 universities in the United States.
● Used at the world’s largest companies.
● Vertically integrated hardware, software, and GPU cloud.

Our Customers
Lambda works with the world’s top companies and universities. We have provided GPU
systems for Machine Learning researchers at the following institutions:

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

31
Lambda —————————————————————————

Appendix A: Use Case Descriptions


Hyperparameter Search / Neural Architecture Search
Hyperparameter search, AKA neural architecture search, asks the question: which neural
network architecture (size of layers, choice of layers, training method) will result in the highest
accuracy for my data set?

During a hyperparameter search, network architectures A, B, and C will be trained and have
their performance evaluated. This can be done by a single machine, three separate machines,
or even an entire fleet of machines. It’s an “embarrassingly parallel” problem which doesn’t

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

32
Lambda —————————————————————————

require any kind of node-to-node communication, thus, a hyperparameter search cluster can be
designed to leverage less expensive, lower bandwidth network fabrics. In many ways, the
hyperparameter search cluster is quite similar to the production inference cluster; they're both
embarrassingly parallel workloads that focus on throughput.

Large Scale Distributed Training


Unlike hyperparameter search which focuses on training many models to find the best
architecture, large scale distributed training focuses on training a single architecture rapidly. This
is usually achieved by dramatically increasing training batch size (and correspondingly
increasing learning rates). See Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour for
a good overview of this technique.

Distributed training jobs follow three steps:


1. A single model is shared between hundreds or thousands of servers, each server is
given a minibatch, a small portion of a very large overall batch of data. Those servers
then compute the gradient of the loss function for that model using their minibatch.
2. These gradients are then sent over to either a centralized parameter server for
averaging or are averaged using a distributed technique like an allreduce operation.
3. The model is updated and re-distributed throughout the cluster.

For a good introduction to how distributed training works in practice, see: A Gentle Introduction
to Multi GPU and Multi Node Distributed Training.

As you can see, distributed training requires far more node-to-node communication than either
production inference or hyperparameter search. Thus, Echelon clusters for distributed training
are designed with a dedicated Infiniband compute and storage fabric supporting up to 1,600
Gb/s of peak node-to-node bandwidth.

Production Inference
Production inference is deploying your model at scale. It requires high throughput / $, high
uptime and availability, and robustness to individual machine outages. Production inference is
often the only workload that has a user waiting at the other end of a screen for a result, and it’s
expected to arrive soon.

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

33
Lambda —————————————————————————

A cluster designed for production inference will almost always be designed with both A/B
redundant power and no single point of failure throughout the hardware. This means you’ll see
dual switches, dual NICs, and dual power supplies.

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

34
Lambda —————————————————————————

Lambda Echelon, let’s build one for you.

If you’re ready to build a Lambda Echelon for your team, please get in touch with us by
emailing [email protected] or calling 866-711-2025.

https://lambdalabs.com - [email protected] - 866-711-2025


© 2021 Lambda

35

You might also like