DGX A100 System Architecture Whitepaper
DGX A100 System Architecture Whitepaper
Architecture
The Universal System for AI Infrastructure
1 Introduction ............................................................................................................................. 1
2 System Architecture................................................................................................................ 2
3 NVIDIA A100 GPU - 8th Generation Data Center GPU for the Age of Elastic Computing.... 3
3.1 Third-Generation Tensor Cores ...................................................................................... 3
3.2 TensorFloat-32 (TF32) Uses Tensor Cores by Default ................................................... 4
3.3 Fine-grained Structured Sparsity................................................................................... 6
3.4 Multi-Instance GPU (MIG) ............................................................................................... 7
4 Third-Generation NVLink and NVSwitch to Accelerate Large Complex Workloads........... 10
5 Highest Networking Throughput with Mellanox ConnectX-6.............................................. 11
6 First Accelerated System With All PCIe Gen4...................................................................... 12
7 Security.................................................................................................................................. 13
7.1 Self-Encrypted Drives ................................................................................................... 13
7.2 Trusted Platform Module (TPM) Technology................................................................ 13
8 Fully Optimized DGX Software Stack.................................................................................... 14
9 Game Changing Performance .............................................................................................. 16
10 Breaking AI Performance Records for MLPerf v0.7 Training ........................................... 17
11 Direct Access to NVIDIA DGXperts ..................................................................................... 18
12 Summary ............................................................................................................................. 18
13 Appendix: Graph Details ..................................................................................................... 19
13.1 Details for Figure 7: Inference Throughput with MIG ............................................... 19
13.2 Details for Figure 12: DGX A100 AI Training and Inference Performance ................ 20
To help organizations overcome these obstacles and succeed in a world that desperately needs
the power of AI to solve big challenges, NVIDIA designed the world's first family of systems
purpose-built for AI—NVIDIA DGX™ systems. By leveraging powerful NVIDIA GPUs and designing
from the ground up for multiple GPUs and multi-node deployments with DGX POD™ and DGX
SuperPOD™ reference architectures along with optimized AI software from NVIDIA NGC™, DGX
systems deliver unprecedented performance and scalability, and eliminate integration complexity.
Built on the brand new NVIDIA A100 Tensor Core GPU, NVIDIA DGX™ A100 is the third generation
of DGX systems. Featuring 5 petaFLOPS of AI performance, DGX A100 excels on all AI workloads–
analytics, training, and inference–allowing organizations to standardize on a single system that
can speed through any type of AI task and dynamically adjust to changing compute needs over
time. And with the fastest I/O architecture of any DGX system, NVIDIA DGX A100 is the
foundational building block for large AI clusters such as NVIDIA DGX SuperPOD, the enterprise
blueprint for scalable AI infrastructure that can scale to hundreds or thousands of nodes to meet
the biggest challenges. This unmatched flexibility reduces costs, increases scalability, and makes
DGX A100 the universal system for AI infrastructure.
In this white paper, we’ll take a look at the design and architecture of DGX A100.
The NVIDIA A100 GPU includes the following new features to further accelerate AI workload and
HPC application performance.
• Third-generation Tensor Cores
• Fine-grained Structured Sparsity
• Multi-Instance GPU
The first-generation Tensor Cores used in the NVIDIA DGX-1 with NVIDIA V100 provided
accelerated performance with mixed-precision matrix multiply in FP16 and FP32. This latest
generation in the DGX A100 uses larger matrix sizes, improving efficiency and providing twice the
performance of the NVIDIA V100 Tensor Cores along with improved performance for INT4 and
binary data types. The A100 Tensor Core GPU also adds the following new data types:
The new TensorFloat-32 (TF32) operation performs calculations using an 8-bit exponent (same
range as FP32), 10-bit mantissa (same precision as FP16) and 1 sign-bit [Figure 3). In this way,
TF32 combines the range of FP32 with the precision of FP16. After performing the calculations, a
standard FP32 output is generated.
Non-Tensor operations can use the FP32 data path, allowing the NVIDIA A100 to provide TF32-
accelerated math along with FP32 data movement.
Figure 4. TF32 can provide over 5X speedup compared to FP32, PyTorch 1.6 in NGC
pytorch:20.06-py3 container, training on BERT-Large model. Results on DGX A100 (8x A100
GPUs). All model scripts can be found in the Deep Learning Examples repository
With fine-grained structured sparsity and the 2:4 pattern supported by A100 (Figure 6), each node
in a sparse network performs the same amount of memory accesses and computations, which
results in a balanced workload distribution and even utilization of compute nodes. Additionally,
structured sparse matrices can be efficiently compressed, and their structure leads to doubled
throughput of matrix multiply-accumulate operations with hardware support in the form of
Sparse Tensor Cores.
The result is accelerated Tensor Core computation across a variety of AI networks and increased
inference performance. With fine-grained structured sparsity, INT8 Tensor Core operations on
A100 offer 20X more performance than on V100, and FP16 Tensor Core operations are 5X faster
than on V100
On an NVIDIA A100 GPU with MIG enabled, parallel compute workloads can access isolated GPU
memory and physical GPU resources as each GPU instance has its own memory, cache, and
streaming multiprocessor. This allows multiple users to share the same GPU and run all
instances simultaneously, maximizing GPU efficiency.
MIG can be enabled selectively on any number of GPUs in the DGX A100 system - not all GPUs
need to be MIG-enabled. However, if all GPUs in a DGX A100 system are MIG enabled, up to 56
users can simultaneously and independently take advantage of GPU acceleration.
Taking it further on DGX A100 with 8 A100 GPUs, users can configure different GPUs for vastly
different workloads, as shown in the following example (Figure 8):
• 4 GPUs for AI training
• 2 GPUs for HPC or data analytics
• 2 GPUs in MIG mode, partitioned into 14 MIG instances, each one running inference
1.Refer to Details for Figure 7: Inference Throughput with MIG” on page 19 for more details.
MIG supports a variety of deployment options, allowing users to run CUDA applications on bare-
metal, containers or scale out with the Kubernetes container management platform. MIG support
is available using the NVIDIA Container Toolkit (previously known as nvidia-docker2) for Docker,
allowing users to run CUDA accelerated containers on GPU instances. More information is
available here.
Under Kubernetes, GPU support has traditionally been achieved using the Device Plugin API and
the NVIDIA Device Plugin. The NVIDIA device plugin for Kubernetes is a Daemonset that allows
GPUs to be advertised on each of the nodes in the cluster and users to request devices (GPUs) in
their job specification. The NVIDIA device plugin has been extended to enumerate the MIG
devices, so that these resources can be requested as regular GPUs. NVIDIA also provides Helm
charts for easily deploying the device plugin into a Kubernetes cluster. The GitHub repo includes
information on getting started.
The second-generation NVSwitch [Figure 9] is two times faster than the previous version, which
was first introduced in NVIDIA DGX-2 system. The combination of six NVSwitches and third-
generation NVLinks enables individual GPU to GPU communication to peak at 600 GB/s, which
means that if all GPUs are communicating with each other, the total amount of data transferred
peaks at 4.8 TB/s for both directions.
The most common methods of moving data to and from the GPU involve leveraging the on-board
storage and using the Mellanox ConnectX-6 network adapters through Remote Direct Memory
Access (RDMA). The DGX A100 incorporates a one-to-one relationship between the IO cards and
the GPUs, which means each GPU can communicate directly with external sources without
blocking other GPUs’ access to the network.
The Mellanox ConnectX-6 I/O cards offer flexible connectivity as they can be configured as HDR
InfiniBand or 200Gb/s Ethernet. This allows the NVIDIA DGX A100 to be clustered with other
nodes to run HPC and AI workloads using low latency, high bandwidth InfiniBand, or RDMA over
Converged Ethernet (RoCE).
The DGX A100 includes an additional dual-port ConnectX-6 card that can be used for high-speed
connection to external storage. The flexibility in I/O configuration also allows connectivity to a
variety of high-speed networked storage options.
Figure 10. Mellanox Single-port ConnectX-6 VPI card for highest network throughput
The latest DGX A100 multi-system clusters use a network based on a fat tree topology using
advanced Mellanox adaptive routing and Sharp collective technologies to provide well-routed,
predictable, contention-free communication from each system to every other system. A fat tree is
a tree-structured network topology with systems at the leaves that connect up through multiple
switch levels to a central top-level switch. Each level in a fat tree has the same number of links
providing equal non-blocking bandwidth. The fat tree topology ensures the highest
communication bisection bandwidth and lowest latency for all-to-all or all-gather type collectives
that are common in computational and deep learning applications.
With the fastest I/O architecture of any DGX system, NVIDIA DGX A100 is ideally suited for large AI
clusters such as NVIDIA DGX SuperPOD.
Training workloads commonly involve reading the same datasets many times to improve accuracy.
Rather than use up all the network bandwidth to transfer this data over and over, high
performance local storage is implemented with NVMe drives to cache this data. This increases
the speed at which the data is read into memory, and it also reduces network and storage system
congestion.
Each DGX A100 system comes with dual 1.92 TB NVMe M.2 boot OS SSDs configured in a RAID 1
volume, and four 3.84 TB PCIe gen4 NVMe U.2 cache SSDs configured in a RAID 0 volume. The
base RAID 0 volume has a total capacity of 15 TB, but an additional 4 SSDs can be added to the
system for a total capacity of 30 TB. These drives use CacheFS to increase the speed at which
workloads access data and to reduce network data transfers.
The AMD Epyc 7742 processor offers the highest performance for HPC and AI workloads as has
been demonstrated by numerous world records and benchmarks. The DGX A100 system comes
with two of these CPUs for boot, storage management, and deep learning framework scheduling
and coordination. Each CPU runs at a maximum speed of 3.4GHz, has 64 cores with 2 threads per
core.
The CPU provides extensive memory capacity and bandwidth. Each has 8 memory channels for an
aggregate of 204.8 GB/s of memory bandwidth per CPU. Memory capacity on the DGX A100 is 1TB
standard with 16 DIMM slots populated, expandable to 2TB with all 32 DIMM slots populated.
Similar to previous DGX systems, DGX A100 is designed to be air cooled in a data center with
operating temperature ranging from 5oC - 30oC.
7 Security
The NVIDIA DGX A100 system supports self-encrypted drives and Trusted Platform Module (TPM)
technology for added security.
When enabled, The TPM ensures the integrity of the boot process until the DGX OS has fully
booted and applications are running.
The TPM is also used with the self-encrypting drives and the drive encryption tools for secure
storage of the vault and SED authentication keys.
2. See the Trusted Platform Module white paper from the Trusted Computing group https://trust-
edcomputinggroup.org/resource/trusted-platform-module-tpm-summary/
The NGC Private Registry provides GPU-optimized containers for deep learning (DL), machine
learning (ML), and high performance computing (HPC) applications, along with pretrained
models, model scripts, Helm charts, and software development kits (SDKs). This software has
been developed, tested, and tuned on DGX systems, and is compatible with all DGX products:
DGX-1, DGX-2, DGX Station, and DGX A100. The NGC Private Registry also provides a secure
space for storing custom containers, models, model scripts, and Helm charts that can be shared
with others within the enterprise. Learn more about the NGC Private Registry in this blog post.
Figure 11 shows how all these pieces fit together as part of the DGX software stack.
Read more about what’s new in the CUDA 11 Features Revealed Devblog.
Figure 12. DGX A100 delivers unprecedented AI performance for training and inference3
The combination of the groundbreaking A100 GPUs with massive computing power and high-
bandwidth access to large DRAM, and fast interconnect technologies, makes the NVIDIA DGX
A100 system optimal for dramatically accelerating complex networks like BERT.
A single DGX A100 system features 5 petaFLOPs of AI computing capability to process complex
models. The large model size of BERT requires a huge amount of memory, and each DGX A100
provides 320 GB of high bandwidth GPU memory. NVIDIA interconnect technologies like NVLink,
NVSwitch and Mellanox networking bring all GPUs together to work as one on large AI models
with high-bandwidth communication for efficient scaling.
3.Refer to Details for Figure 12: DGX A100 AI Training and Inference Performance” on page 20 for
additional information.
Details:
Per Chip Performance arrived at by comparing performance at the same scale when possible. Per
Accelerator comparison using reported performance for MLPerf 0.7 NVIDIA A100 (8 A100s).
MLPerf ID DLRM: 0.7-17, ResNet50 v1.5: 0.7-18, 0.7-15 BERT, GNMT, Mask R-CNN, SSD,
Transformer: 07-19, MiniGo: 0.7-20.
Max Scale: All results from MLPerf v0.7 using NVIDIA DGX A100 (8xA100s)\. MLPerf ID Max
Scale: ResNet50 v1.5: 0.7-37, Mask R-CNN: 0.7-28, SSD: 0.7-33, GNMT: 0.7-34, Transformer: 0.7-
30, MiniGo: 0.7-36, BERT: 0.7-38, DLRM: 0.7-17.
MLPerf name and logo are trademarks. See www.mlperf.org for more information
Owning an NVIDIA DGX A100 or any other DGX system gives you direct access to these experts as
part of NVIDIA Enterprise Support Services. NVIDIA DGXperts complement your in-house AI
expertise and let you combine an enterprise-grade platform with augmented AI-fluent talent to
achieve your organization's AI project goals.
12 Summary
The innovations in the NVIDIA DGX A100 system make it possible for developers, researchers, IT
managers, business leaders, and more to push the boundaries of what’s possible and realize the
full benefits of AI in their projects and across their organizations.
Results on DGX A100. BERT Large Inference (with Sequence Length =128)
• T4: TRT 7.1, Precision = INT8, Batch Size =256,
• V100: TRT 7.1, Precision = FP16, Batch Size =256
• A100 with 7 MIG instances of 1g.5gb. TensorRT Release Candidate, Batch Size =94,
Precision = INT8 with Sparsity (1g.5gb is the smallest instance of the A100 which specifies
1/7 of the compute and 5 GB of total memory)
• Training:
> DGX A100 system with 8x NVIDIA A100 GPUs, TF32 precision vs. DGX-1 system with 8x
NVIDIA V100/16GB GPUs, FP32 precision.
> Deep learning language model: the large version of one of the world's most advanced AI
language models–Bidirectional Encoder Representations from Transformers (BERT) on
the popular PyTorch framework.
> Pre-training throughput using PyTorch NGC Container 20.06, sequence length 128
• Inference:
> DGX A100 system with 8x NVIDIA A100 GPUs using INT8 with Structured Sparsity vs. a
CPU server with 2x Intel Platinum 8280 using INT8.
NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without
notice.
Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.
NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed
in an individual sales agreement signed by authorized representatives of NVIDIA and customer ("Terms of Sale"). NVIDIA hereby expressly objects to applying any
customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed
either directly or indirectly by this document.
NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in
applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental
damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at
customer's own risk.
NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each
product is not necessarily performed by NVIDIA. It is customer's sole responsibility to evaluate and determine the applicability of any information contained in
this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to
avoid a default of the application or the product. Weaknesses in customer's product designs may affect the quality and reliability of the NVIDIA product and may
result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default,
damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii)
customer product designs.
No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document.
Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a
warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the
third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.
Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full
compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.
THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS
(TOGETHER AND SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR
OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND
FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING
WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF
THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA's aggregate and cumulative liability towards customer for the
products described herein shall be limited in accordance with the Terms of Sale for the product.
Trademarks
NVIDIA, the NVIDIA logo, DGX, CUDA, NVIDIA POD, and NVIDIA SuperPOD are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and
other countries. Other company and product names may be trademarks of the respective companies with which they are associated.
Copyright
© 2020 NVIDIA Corporation. All rights reserved.