Performance Analysis of Deep Learning Workloads on Leading-edge Systems
Performance Analysis of Deep Learning Workloads on Leading-edge Systems
Y. Ren,
Submitted to the 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance
Computer Systems (PMBS) Conference
to be held at Denver, CO, United States
November 17 - 22, 2019
April 2020
Notice: This manuscript has been authored by employees of Brookhaven Science Associates, LLC under
Contract No. DE-SC0012704 with the U.S. Department of Energy. The publisher by accepting the
manuscript for publication acknowledges that the United States Government retains a non-exclusive, paid-up,
irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others
to do so, for United States Government purposes.
DISCLAIMER
Abstract—This work examines the performance of leading- (FLOPS) on GPUs; availability of large amounts of high-
edge systems designed for machine learning computing, includ- bandwidth memory that allows for data access at fast rates
ing the NVIDIA DGX-2, Amazon Web Services (AWS) P3, and low latency; and to high-speed interconnects that afford
IBM Power System Accelerated Compute Server AC922, and
a consumer-grade Exxact TensorEX TS4 GPU server. Repre- communication at high bandwidth with minimal contention.
sentative deep learning workloads from the fields of computer The first three examples of leading-edge systems considered
vision and natural language processing are the focus of the herein use the NVIDIA Tesla V100 GPU with different
analysis. Performance analysis is performed along with a number topologies of the NVLink interconnect. The Exxact TS4 is
of important dimensions. Performance of the communication in- configured with the consumer-grade GeForce RTX 2080 Ti
terconnects and large and high-throughput deep learning models
are considered. Different potential use models for the systems as GPU, which is popular among AI researchers, developers, and
standalone and in the cloud also are examined. The effect of hobbyists. Section II-A describes the systems and their key
various optimization of the deep learning models and system architectural characteristics in more detail.
configurations is included in the analysis. Section III details how DL models considered are trained,
Index Terms—Deep learning, High performance computing,
the fundamental arithmetic operations involved during train-
Benchmark testing, Performance analysis, Computer architec-
ture, Concurrent computing, DGX-2, GPU ing, and their effects on different hardware systems. Specifi-
cally, Section III-B dissects CNN models for computer vision,
I. I NTRODUCTION while Section III-C explores the state-of-the-art Bidirectional
Encoder Representations from Transformers (BERT) model
The growth of machine learning and deep learning (DL) for natural language processing (NLP) [11].
extends across all data analytical application areas, impacting The detailed performance analysis is done along a few
many disciplines and markets. Hence, their practical use important dimensions. Section IV-A presents the performance
potential appears exponential and seemingly unbounded. In of key global communication kernels used in the benchmarks
turn, the ever-insatiable need for computing resources for these considered. Section IV-B discusses performance and scalabil-
workloads has led to the development of computer architec- ity of large and high-throughput DL models. Section IV-D
tures and systems designed to improve machine learning per- compares performance when the benchmarks are expressed
formance [1]–[8]. As the presently preferred architectures for in an easy-to-code multi-GPU architecture enabled by system
machine learning application workloads, GPU-based systems software described in Section II-B.
are an important exemplar in this category.
This work evaluates the performance of two important types
of DL algorithms on four leading-edge GPU-based systems. II. E NVIRONMENT
Specifically, we consider convolutional neural network (CNN)
A. Hardware Environment
algorithms, such as AlexNet and ResNet, mostly used in
computer vision and attention-mechanism-based algorithms As part of this work, the following systems were put
for natural language processing on the NVIDIA DGX-1 and to the test: NVIDIA DGX-1V and DGX-2 (DGX-2), IBM
DGX-2, IBM Power System AC922, and Exxact TensorEX Power System AC922 (IBM-P9), AWS P3dn (AWS P3),
TS4. Moreover, we analyze a cloud-based Amazon Web and Exxact TensorEX TS4 (RTX). Henceforth, the systems
Services (AWS) P3dn use mode for the DGX-1 and compare will be referenced using their respective abbreviations noted
DL performance against standalone use for the other systems in parentheses. For added convenience, a consistent color
considered. scheme and geometric shape are maintained for each system
GPU-based systems are especially well suited for DL work- represented in figures throughout this work (green diamond,
loads as proven in practice and in scientific publications [3], DGX-2; blue square, IBM-P9; orange triangle, AWS P3; red
[9], [10]. Briefly, this stems from their single-instruction circle, RTX). Of note, the AWS P3 essentially is a DGX-1V
multiple-data (SIMD) nature and arithmetic intensity of the as shown in the communication bandwidth test depicted in
algorithms mapping well to available floating point operations Section IV-A.
Before delving into the details of each system, we first
introduce the key architectural component: the NVIDIA Tesla
8 9 14 15
V100 GPU.
Tesla V100: The Tesla V100 GPU [12] is a building block
for three of the four systems under consideration. The V100
GPU has 640 Tensor cores and 5,120 CUDA cores with 32 GB /s B/s /s 25GB
/s 25GB/s
/s
GB 25G 25GB
GB
25
(or 16 GB) HBM2 GPU memory (900 GB/s bandwidth). It can
25
achieve 15.7 TFLOPS for single-precision performance. For 0 1 6 7
direct inter-device (GPU-to-GPU) communication, the V100
has six NVLink-2.0 fabric supporting 25 GB/s per link, per (a) DGX-2 NVSwitch Crossbar
data direction. Therefore, each V100 has the ability to com-
municate with other GPU devices at 150 GB/s unidirectional
(or 300 GB/s bidirectional) bandwidth. The high bandwidth of
inter-node communication is crucial for training deep neural 3 50 GB/s
0 50 GB/s
4 7
network models across multiple devices.
/s
25 GB/s
GB
DGX-2: The bulk of the DGX-2’s computation capac-
25
ity is from 16 V100 (32 GB) GPUs evenly distributed on
two baseboards and connected via 12 on-node switches, or
NVSwitch [13]. Each NVSwitch has 18 NVLink ports (16
in use) and supports 900 GB/s bidirectional peak bandwidth. 2 1 5 6
Eight NVLink ports are connected to different GPU devices
(one per link) on the same baseboard, whereas the other eight 25 GB/s
NVLink ports are connected to the matching NVSwith ports (b) DGX-1V and AWS P3 Hybrid Cube-Mesh Topology
on the other baseboard (Figure 1a). This network connectivity
affords communications at a bandwidth of up to 150 GB/s per
direction. Any two V100 GPUs can establish full bandwidth 0 1 2 3
(up to 150 GB/s per direction) communication using all six 3x25GB/s
3x
sys mem
cores in total) with base frequency of 2.7 GHz, 1.5 TB system 60GB/s
memory, and 30 TB NVMe SSD in eight-way RAID0. Power9 32 GB/s
Power9
AWS P3: AWS’ P3dn.24xlarge instance is similar to the CPU SMP CPU
NVIDIA DGX-1V system [6] and is equipped with eight
Tesla V100 (32 GB) GPUs connected in a hybrid cube-mesh
topology (Figure 1b). The hybrid cube-mesh topology leads to (c) IBM AC922 Model 8335-GTH NVLink-enabled POWER9 CPU
each node having four immediate neighbors. This is a legacy Figure 1: GPU-to-GPU Communication Topology. Each Tesla
design following the previous DGX-1P system, where the V100 GPU has six NVLink ports with unidirectional commu-
Tesla P100 GPU featured only four NVLink ports. Two of the nication bandwidth of 25 GB/s per port. Numerically labeled
four neighbors are connected to two links each, while the other boxes represent different GPU devices. The six NVLinks from
two connect to one only. To connect two P3 systems, AWS device-0 are colored differently.
provides network connection bandwidth up to 100 Gbits/s. The
caveat is that this limit can be reached only for multi-flow
connections. The single-flow bandwidth is 10 Gbits/s (1.25
GB/s). The specific AWS P3 systems tested in this effort have
two hyper-threaded 24-core Intel Xeon 8175M CPUs (96 logic GPUs (three per GPU), enabling a 75 GB/s unidirectional
cores in total) with base frequency of 2.5 GHz, 768 GB system communication bandwidth to each GPU. In addition, there are
memory, and 2 TB ephemeral NVMe SSD. Section IV-A three NVLink fabrics connecting two GPUs directly. If the
shows that the NVIDIA DGX-1V system is analogous to the GPUs are not connected to the same CPU, communications
AWS P3. Thus, we include only the results for the AWS P3. must route through the inter-CPU symmetric multiprocessing
IBM-P9: The IBM Power System AC922 [14] (Model (SMP) cable with unidirectional bandwidth of 32 GB/s. The
8335-GTH) server tested is equipped with four Tesla V100 POWER9 CPU connects to the system main memory with
(32 GB) GPUs (Figure 1c). The tested AC922 server has accumulated (eight channels) unidirectional bandwidth of 60
two IBM POWER9 hyper-threaded 20-core CPUs (160 logic GB/s. The tested system has four nodes, connected via high-
cores in total) with base frequency of 2.3 GHz and max bandwidth (24 GB/s unidirectional) InfiniBand. All of the
frequency of 3.8 GHz. IBM’s POWER9 CPU is NVLink- nodes use IBM General Parallel File System (GPFS) with
enabled. Each CPU has six direct NVLink connections to block size of 1 MB and bandwidth of approximately 18 GB/s.
RTX: The Exxact TensorEX 4U server (TS4-1598415-DPN) Table II: Tested Deep Learning Models
is equipped with eight NVIDIA consumer-grade GeForce RTX Model Name Param. Ops/ins.
2080 Ti GPUs [15]. Each RTX 2080 Ti GPU has 4352 CUDA
AlexNet 61.10 M 0.72 G
cores and 11 GB GDDR6 GPU memory with 616 GB/s ResNet18 11.69 M 1.83 G
memory bandwidth. It can reach a peak performance of 13.4 ResNet50 25.56 M 4.14 G
TFLOPS for single-precision performance, or about 85.4% of ResNet101 44.55 M 7.88 G
ResNet152 60.19 M 11.62 G
the V100 GPU’s peak performance. The specific server tested BERT-SWAG 109.5 M 0.19 G
in this work has two hyper-threaded 12-core Intel Xeon 4116 BERT-SQuAD 109.5 M 2.87 G
CPUs (48 logic cores in total) with base frequency of 2.1
GHz. All eight GPUs are connected via a PCIe bus. Compared
to other high-end V100 GPU-based solutions, the RTX GPU
cards are a unique feature for this system. As such, we refer
to this system as RTX. (without replacement). The data are loaded from the hard
drive to the host memory, and, sometimes, preprocessing data-
B. Software Environment augmentation procedures are applied using CPU threads, such
Because of its popularity among AI researchers, its well- as randomly flipping images or adjusting image sizes. Then,
designed user interface, and native support for NVIDIA com- the preprocessed batch is sent to the GPU memory via PCIe
munication and computation backend kernels and MPI, we bus.
use the PyTorch DL platform. To maintain a consistent and The bulk of actual computation usually is done on one
reproducible software environment, we use docker containers, or multiple GPUs. In the multiple GPU case, the execution
which also alleviate the difficulty in migrating the DL mod- is done in a SIMD fashion so each GPU has an exact
els to other hardware systems and reduce the performance replica of the neural network model and applies the exact
differences introduced by distinct software environments. For executions on different sampled data batches. In the ideal
the x86 architecture (Intel Xeon CPU) systems, including case, the throughput would grow linearly with the number of
DGX-1, DGX-2, AWS P3, and RTX, we use the NVIDIA GPUs. At the end of every iteration, all of the model replicas
official PyTorch docker image (NVCR)1 as the base software require synchronization. This synchronization is done by a
environment. For the ppc64le architecture (IBM POWER9 collective communication using NCCL. Most of the results in
CPU) system, IBM-P9, we use the PowerAI v1.6 [16]. this work use the NCCL all-reduce kernel. Therefore, the two
Nevertheless, to ensure our work is reproducible, Table I major factors affecting the time cost of communication are: 1)
lists the exact library versions of the NVIDIA docker and the the inter-device communication bandwidth and 2) number of
PowerAI v1.6. The NVIDIA CUDA library is a programming model parameters.
interface to NVIDIA GPUs for parallel computing, while
NVIDIA’s cuDNN (deep neural network) library provides For this work, we have selected several representative DL
device-level optimized, neural-network-related backend ker- models to cover different ranges of parameters, computation-
nels. The NVIDIA NCCL (collective communication) library communication ratios, application domains, and various types
provides a multi-GPU communication interface, supporting of neural network DL layers. Because of the vast number
several communication means, such as NVLink, PCIe, and of potential DL models, we are unable to test all of them
Ethernet. exhaustively. However, by providing detailed descriptions and
computation characteristics for these select models, readers
Table I: Software Environment should be able to easily estimate the performance (in terms of
Library NVIDIA NVCR IBM PowerAI computation efficiency not model accuracy) of other models as
the fundamental types of numeric operations are comparable.
PyTorch 1.0.0a0 1.1.0
CUDA 10.0.130 10.1.168 As computer vision and NLP are the two most successful
cuDNN 7.401 7.501 application domains for DL, we choose the AlexNet model and
NCCL 2.307 2.407 ResNet model from the computer vision domain and BERT
model from NLP to represent examples of DL methods in
these areas. We analyze the models in terms of their number
III. D EEP L EARNING M ODELS of trainable parameters and operations. The former affects the
A. Data Movement and Communication Between Devices memory footprint as well as the inter-device communication
Deep learning is a data-driven modeling approach. The costs, while the latter impacts the on-device computation time.
training process, known as stochastic gradient descent, con- The computation cost per iteration scales linearly with the
sists of numerous iterations of feeding data to the model number of instances per sampled data batch, known as the
and adjusting the model parameters to reduce the predefined batch size. However, the actual computation cost depends on
loss. At each iteration, a batch of data is selected at random many other factors. Table II provides a summary of the number
of parameters and operations per instance for all of the models
1 nvcr.io/nvidia/pytorch:18.11-py3 presented in this work.
B. Computer Vision size B ×ci ×Hi ×Wi is meant to perform a tensor dot product
of ci × k 2 on every pixel of the spatial dimension H × W .
The goal of computer vision is to make computers gain
high-level “understanding” of images. To evaluate if a program For simplicity, assume the striding step is 1, and padding is
(AI model) truly “understands” the image, researchers have k/2 such that the spatial dimension is unchanged Ho = Hi
developed different evaluation tasks to measure its comprehen- and Wo = Wi . Thus, each kernel has been applied Ho ×
sion. One type of these tasks, known as image classification, Wo times.3 For each kernel application at every pixel level, a
provides an image to the program and asks about which GEMM operation is performed, which costs C ≡ co (ci k 2 +1).
predefined class the image belongs to. For example, the Therefore, in total, the number of operations of the Conv2D
MNIST (handwritten digit database) asks the program to tell it layer is Ho × Wo × C. Because the number of parameters of
which digit, from 0 to 9, the grayscale image (28-by-28 pixels) a Conv2D layer is also C, the operation-to-parameter ratio Γ
belongs to. This is considered one of the simplest computer for Conv2D layer is ΓConv2D = BHo Wo . As in the case of
vision tasks, and traditional machine learning methods, such as the Linear layer, the total number of operations scales with
the support vector method, have reached 99.2% accuracy [17]. the batch size. Yet, in contrast to the Linear layer, the total
The ImageNet Large Scale Visual Recognition Challenge, or number of operations also depends on the spatial dimension
ILSVRC [18], a much more challenging image classification of the output tensor. Each parameter of a Conv2D layer has
test, was introduced in 2010. It contains 1000 predefined been operated Ho Wo more times than a parameter in a Linear
classes (including 60 different dog breeds) and more than a layer.
million training images. The best-performing model in the first AlexNet consists of five Conv2D layers of ∼ 221 parameters
ILSVRC (2011) achieved only about a 25% top-five error in total, two hidden Linear layers (∼ 225 ), and one output
rate.2 In 2012, AlexNet [19], considered the first modern Linear layer (∼ 222 ). The Linear layer also uses an order of
CNN-based model, successfully reduced the top-five error rate magnitude more parameters. Compared to AlexNet, ResNet
to 16.4%. In 2015, ResNet [20] further reduced the error consists almost entirely of Conv2D layers, except the final
rate to 3.57%. It also introduced residual blocks to mitigate Linear layer for classification output. The sub-types of ResNet
the “vanishing gradient problem” when the neural network models are labeled as ResNetX, where X represents the total
becomes too deep. number of parameterized layers (Conv2D and Linear). The
A deep neural network is a stack of multiple neural network choices of X in the original paper [20] are 18, 34, 50, 101, and
layers, usually varying kinds. Each layer takes the previous 152. ResNet18 serves as a high-throughput (small number of
layer’s output as its input, where both input and output are operations), low-accuracy model because of the small amount
tensors. A Linear layer is one of the simplest kind, a matrix of parameters, while ResNet152 has the highest accuracy but
of size ci × co , where ci and co are the number of input slowest training throughput. Using ResNet50 for ImageNet
and output channels. Therefore, the number of parameters data (1000-way classification) as a concrete example, the
of a Linear layer is on the order of O(ci co ) or co (ci + 1) model contains about 224.6 parameters, where only 221 are
to be precise where “1” is the bias term. The operation from the Linear layer. As discussed, each parameter of a
performed by a Linear layer essentially is a general matrix- Conv2D layer contributes a factor of Ho ×Wo more operations
matrix multiplication (GEMM). In most cases, the multiplier than one in a Linear layer. As such, ResNet has a much higher
matrix (input) has a dimension of B × ci , and the multiplicand operation-to-parameter ratio than AlexNet.
matrix (Linear layer weights) has a dimension of ci × co .
As such, the number of operations for a batch size B is C. Natural Language Processing
B × (ci + 1) × co . One could deduce that the operation- NLP is another successful application of DL techniques.
to-parameter ratio Γ for a Linear layer is B: ΓLinear = B, Some NLP tasks include speech recognition, translation,
implying that computation cost grows linearly with the number speech-to-text (and vice versa), and question-and-answer sys-
of parameters in the Linear layer and batch size. tems. In the pre-DL era, NLP was dominated by hidden
A two-dimensional convolutional (Conv2D) layer consists Markov models [21]. Mikolov et al. [22] introduced a DNN-
of co kernels of size ci × k × k. Therefore, the exact number based word embedding model to represent words as vectors
of parameters of a Conv2D layer is co (k 2 ci + 1). A kernel is based on their context. Namely, similar words would have
simply a small tensor applied to the input tensor in a sliding- comparable context around them and end up closer in the
window fashion, where the step size is called the stride. When vector space. This approach provides a meaningful way to
the stride is greater than one, the input tensor is downsampled represent non-numeric entities, i.e., words, as numeric vectors
in the spatial dimension. The number of operations for a and provides a foundation for solving a diverse range of
Conv2D layer can be calculated by considering the number NLP tasks. Graves et al. [4] developed a deep recurrent-
of times the kernel has been applied and the cost of applying neural-network-based approach to perform automatic speech
each kernel. Applying a Conv2D kernel on the input tensor of
3 Note that by setting striding greater than one, fewer kernel operations will
2 Top-five error rate. For each test image, the algorithm is allowed to give be applied, which can reduce the spatial dimension (downsampling). Whereas,
five predictions. If any of the five predictions match to the ground truth, it is by setting the space between kernel points (dilation), the spatial dimension
considered a hit. (upsampling) can increase. The computation cost analysis is similar.
recognition and broke the TIMIT phoneme recognition bench- expression (due to ease of coding)—PyTorch’s On-node Data
mark record [23]. By the end of 2016, all major technology Parallel [29]—also is included.
companies had adopted the DNN-based approach for their
speech recognition systems. Vaswani et al. [24] introduced A. Communication Performance
the attention mechanism into NLP tasks and demonstrates its As shown in Section II-A, leading-edge systems implement
superior performance in natural language translation tasks. various direct high-bandwidth inter-device communication
The particular NLP model in this work, BERT, uses bidi- topologies based on NVLink. The NCCL8 provides MPI-like
rectional transformers [11] and exceeded 11 NLP benchmark primitives for multi-GPU and multi-node collective commu-
records in November 2018.4 nications. The library is optimized for NVIDIA GPU devices
The BERT model has two training phases: 1) pre-training to achieve high communication bandwidth over NVLink and
and 2) fine-tuning. In the pre-training phase, BERT uses the PCIe (when necessary). NCCL supports collective communi-
semi-supervised sequence learning approach [25] by masking cation primitives, such as all-reduce, all-gather, reduce-scatter,
out a random word in a sentence. Unlike other previous reduce, and broadcast.
unidirectional approaches, BERT tries to predict the masked As the most relevant communication kernels occurring
word from both directions. Training is done on large unlabeled in the benchmarks considered, all-reduce and broadcast are
corpora, such as the English Wikipedia (2,500 million words). examined for performance using NVIDIA’s NCCL-tests code.9
Herein, this pre-trained model is known as the base-model. In Results are presented normalized to the ”bus bandwidth,”
the task-specific fine-tuning phase, the base-model connects a concept described by NVIDIA in the NCCL-tests.10 Bus
with a classification Linear layer designed for the specific task. bandwidth is obtained by applying a normalization divider of
The data used for fine-tuning are labeled and much smaller the measured bandwidth
compared to the large corpora [26]. The majority of attention (“message size”/time) different for each communication ker-
mechanism operations are matrix multiplication and layer-wise nel to reflect its communication complexity and topological
normalization. For details regarding how the attention mech- mapping to the network. Because the bus bandwidth reflects
anism works, readers can refer to several available guides.5,6 how optimally the hardware is used, it provides a consistent
We use the pre-trained BERT base-model and fine-tune it and normalized way to compare the results with the theoretical
for two specific NLP tasks: SWAG and Stanford Question peak bandwidth, including across different communication
Answering Dataset (SQuAD). The SWAG [27] is a multi- primitives.
choice task. Given a situation described by a sentence as In this work, data size varies from 1 MB to 1 GB, which
input, the model is asked to select the most plausible scenario covers the communication needs for synchronizing model
that happens next among multiple choices. The SQuAD [28] parameters. Each data point is averaged over 500 iterations,
is a Question Answering task, where a pair that includes a except for the case of 16 GPUs using two AWS P3s, which is
question and a relevant paragraph (containing the answer) is averaged over 50 iterations due to the slow inter-node Ethernet
provided and the model is tasked to find the answer in the connection. Figure 2 illustrates the results.
given paragraph. The DGX-2 consistently achieves 120 GB/s for large mes-
Although the base model is the same, to fully cover the sage sizes, regardless of the number of GPUs involved in the
training data, different max-seq-length is used. We use max- communications. This can be attributed to the NVSwitch’s
seq-length of 80 for SWAG and 384 for SQuAD. As the link bandwidth and contention properties (described in Sec-
max-seq-length determines the attention span, it takes more tion II-A).
operations to perform the SQuAD task. Table II features The AWS P3 and DGX-1V yield analogous, if not exactly
the number of model parameters and estimated operations of the duplicate, results because they share the same hybrid
BERT-SWAG and BERT-SQuAD, respectively. Of note, our cube-mesh topology (refer to Figure 1b). Because of the
benchmark code is modified from the source code.7 heterogeneity of this topology, the measured peak bandwidth
depends on the devices involved in the communication. In the
IV. P ERFORMANCE A NALYSIS case of two GPUs, the test employs device-0 and device-1,
which are connected via a single NVLink that offers 25 GB/s
This section details the performance analysis of DL work- theoretical unidirectional bandwidth. For four GPUs, device-0
loads using the four systems (already described) under consid- to -3 are used, and the NVLinks connecting to device-4 to -7
eration. The all-important communication performance is first are not. The observed bandwidth is about 80 GB/s. For eight
presented. Given the different workload characteristics, the GPUs, the DGX-1 surpasses the DGX-2 in the all-reduce tests
analysis is done separately for large-scale and high-throughput (Figure 2a). In the broadcast test (Figure 2b), the crossover
models. Performance details for an increasingly popular code occurs when the message size exceeds 256 MB. While these
results may seem unexpected due to the higher bandwidth
4 As of March 2019, OpenAI and Microsoft have released their model
challengers to BERT. 8 https://developer.nvidia.com/nccl
5 http://nlp.seas.harvard.edu/2018/04/03/attention.html. 9 https://github.com/NVIDIA/nccl-tests/release/tag/v1.0.0
6 https://jalammar.github.io/illustrated-transformer/. 10 Described in detail here:
7 https://github.com/huggingface/pytorch-pretrained-BERT https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md.
of 75 GB/s. However, with four GPUs, the bus bandwidth
reduces to about 30 GB/s, matching the theoretical SMP bus
bandwidth of 32 GB/s when connecting two POWER9 CPUs.
Higher count GPU configurations on the IBM P9 (eight-
and 16-GPU) exhibit lower bus bandwidth (Figure 2). This
achieved performance is due to NCCL not being optimized
for the InfiniBand interconnect.
The RTX system does not use NVLink technology, and
all eight RTX 2080Ti GPUs connect through a PCIe bus.
Therefore, the communication bandwidth is throttled down by
the PCIe bus. Despite its inferior communication performance,
the RTX system serves as the baseline for other systems.
largest batch size that can fit into the 32 GB of memory ence line depicts linear scalability. If the measured throughput
of a single V100 GPU to achieve the best possible scaling follows the reference line, or maintains a constant gap, it has
results. Specifically, the batch sizes used are: 128 per GPU good parallel scalability. The DGX-2 exhibits good scalability
for ResNet101 and ResNet152, 64 for BERT-SWAG, and 32 on all four models, whereas AWS P3 shows linear scalability
for BERT-SQuAD. up to eight GPUs. For the RTX, there is a significant drop
Across all four systems, the DGX-2 and AWS P3 have from one GPU to two GPUs in terms of scalability because
similar performance up to eight GPUs. This is expected as one GPU computation does not require model synchronization,
both systems have the same V100 GPUs and are connected while that cost does apply for multiple GPU configurations.
via high-bandwidth (over 120 GB/s) NVLinks. However, when
16 GPUs are in use, two AWS P3s communicate through a Table III: Instances per second for RTX relative to DGX-2
relatively slow Ethernet connection (about 1 GB/s measured). Model Name 1 GPU 2 GPUs 4 GPUs 8 GPUs
Figures 3c and 3d reveal the differences in performance, AlexNet 78.19% 63.01% 53.41% 47.95%
especially in BERT models where the number of parameters ResNet18 73.50% 69.13% 64.39% 54.80%
is large. Given its high-bandwidth inter-node communication ResNet50 67.97% 62.67% 62.97% 61.75%
Average 73.22% 64.94% 60.26% 54.83%
network, the IBM P9 exhibits similar performance to DGX-2
all the way to up to a 16 GPU configuration. ResNet101 69.70% 63.72% 64.15% 62.69%
ResNet152 69.73% 62.45% 62.96% 61.90%
The RTX server has 11 GB of DDR6 GPU memory. Hence, BERT-SWAG 64.04% 57.52% 57.20% 56.25%
the batch sizes are even smaller: one-quarter of the size BERT-SQuAD 59.81% 49.79% 49.74% 48.22%
when using 32 GBs on the V100 GPU on all other systems. Average 65.82% 58.37% 58.51% 57.27%
Specifically, the batch size for ResNet101 and ResNet152 is Overall avg. 68.99% 61.19% 59.26% 56.22%
64, BERT-SQuAD is 8, and BERT-SWAG is 16. This leads to
a quadrupling of the amount of communication for the same 2) Performance Analysis of High-Throughput Learning
total of computed instances. RTX’s slow inter-device commu- Models: Here, AlexNet, ResNet18, and ResNet50 are charac-
nication via a PCIe bus further exacerbates its performance terized as high-throughput models. All systems except RTX
degradation. For example, in the case of 1 GPU, RTX can use a 256 batch size per GPU to fully utilize their 32 GB of
reach about 65.82% throughput of the DGX-2 averaged over memory for all models. RTX uses a batch size of 64. Figure 5
four DL models, yet merely 57.27% in the case of eight GPUs illustrates the results.
(see Table III). Hence, the RTX server is the least efficient Training high-throughput models implies frequent data
system for large model distributed training. movement through the file system. For configurations up to
To examine the scaling more closely throughout the full 8 GPUs, the performance is lower on IBM-P9. The reason for
span of GPU configurations, we plot the throughput for all DL that is related to the use of GPFS external filesystem on the
models in a log-log scale (Figure 4), where the dashed refer- IBM machine, whereas the other system under consideration
(a) AlexNet (b) ResNet18 (c) ResNet50
Figure 5: Training Throughput of High-throughput DL Models on RTX, IBM-P9, AWS P3, and DGX-2.
utilize local storage for the executation of these small models. On the AWS P3, the two CPUs on each node will handle
For ResNet50 (Figure 5c), all the systems exhibit linear 32 processes for the eight GPUs. On the DGX-2, the 16
scaling. Because of the ResNet50 model’s small size, the GPUs require 64 CPU data-fetching processes from the two
slow inter-node Ethernet bandwidth of the AWS P3 does not associated CPUs. To explain why the AWS P3 outperforms
bottleneck the distributed training throughput performance. the DGX-2 in Figure 5b requires determining if the scaling
Because AlexNet uses more than twice the number of inconsistency stems from a lower core frequency speed and/or
parameters of ResNet50, throughput performance is throttled cache capacity effects. Figure 7a shows CPU core speed
down by the slow Ethernet connection on AWS P3 when measurements (enabled given Turbo Boost technology) for
two nodes (with a total of 16 GPUs) are in use (Figure 5a). both systems while varying j from 1 to 16 on the DGX-2 and
Even on the DGX-2, AlexNet does not scale linearly to 16 AWS P3. For example, if j = 16 and DGX-2 uses all 16 GPUs,
GPUs (shown in Figure 6a). When 16 GPUs are in use on the there are 256 CPU processes in total. The light green curve
DGX-2, AlexNet spends about 80% of the active GPU time (Figure 7a) depicts the case when only eight GPUs on the
in communication, whereas ResNet50 spends only about 4%. DGX-2 are in use, in which case the DGX-2 has slightly better
Given its smallest amount of parameters, ResNet18’s need performance than AWS P3 5b. When using j = 1 CPU process
for inter-device communication is modest. Even so, as shown per GPU, the DGX-2’s CPU core speed is much higher than
in Figure 6b, the scaling is not ideal. An interesting observa- that of the AWS P3 because of its superior CPU performance
tion is that when using 16 GPUs, the AWS P3 performs better characteristics (see Section II-A). However, as j increases,
than the DGX-2 (Figure 5b). the DGX-2’s CPU core speed decreases, which is typical for
Intel Turbo Boost technology. For j = 4, the specific case
present in the benchmark runs (also shown by the vertical
dotted line in Figure 7a), the DGX-2 maintains a higher CPU
core speed than that for the AWS P3. Hence, clock frequency
is not the sole explanation for the performance inconsistency.
To understand the exact amount of work the CPU does per
unit time, Figure 7b shows the metric of instructions per cycle
(IPC). The IPC of the DGX-2 using 16 GPUs at j = 4 is
much lower than that of AWS P3: 1.35 versus 1.90, pointing
(a) CPU Core Speed (b) Instructions per Cycle to cache utilization inefficiencies.12 Additional measurements
Figure 7: CPU Performance Bottleneck of ResNet18. of L1-cache data loading speed and data-translation lookaside
buffer (TLB) load misses confirm this hypothesis.The data also
Recall from Section IV-B that in all experiments, each GPU 12 Note: The tested Intel Xeon CPU can reach theoretical maximum of four
is associated with (j =) 4 CPU processes for prefetching data. IPC when instructions are perfectly aligned by manual loop unrolling.
reveal that j = 4 usually is a good choice. Of note, because we
use the pinned memory13 to improve host-device data transfer,
using large j will cause high memory usage on the host.
For RTX versus DGX-2 performance, when one or two
GPUs are in use, RTX performance is close to that of the
DGX-2 (refer to Table III). Because of their smaller GPU
memory footprints, high-throughput workloads look more suit-
able on RTX than large models. Just as with the case of
performance on large models, RTX’s scalability is less than
for the DGX-2 (see Table III and Figure 6) due to its slower (a) ResNet50 on DGX-2 (b) Speedup using FP16 relative
to FP32
communication performance. This makes the RTX system
most suited for small-scale model development rather than Figure 8: Performance ofResNet50 on DGX-2 for single
full-scale training workloads. precision (FP32) and mixed-precision (FP16)
C. Performance of Mixed-Precision Training
Mixed-precision training [31] retains most if not all neural communication pattern of on-node data parallel differs from
network predictive performance, yet offers significant com- the distributed data parallel. In it, one GPU maintains a
putational speedup and reduces the memory footprint. The master copy of the model parameters. At every iteration, it
NVIDIA Turing GPU architecture, such as V100 and RTX broadcasts the parameters to the other GPUs in the config-
2080 Ti, provides dedicated hardware acceleration called “ten- uration. At the end of every iteration, the parameters are
sor cores” [32] for this purpose. The tensor core provides “all-reduced” back to the master GPU, which updates the
high-throughput fused multiply-add (FMA) operations for model parameters. Therefore, for each iteration, two global
mixed-precision matrices (inputs in half precision, outputs in communications (broadcast and reduce) are issued. To emu-
either half or single precision). The other advantage of using late the common practice of most PyTorch models, we use
mixed-precision is the smaller memory footprint, therefore less the default PyTorch data loader for on-node data parallel
communication overhead for synchronizing the model replicas. experiments (torch.utils.data.DataLoader), which
Figure 8a shows the performance of ResNet50 on DGX-2 supports multi-worker and pinned memory but not asyn-
when using mixed-precision (FP16) for batch size (bsz) 128 chronous data loading. PyTorch’s on-node data parallel design
and 256, comparing it to the performance when using single- maximizes its usefulness but targets small parallel GPU con-
precision (FP32) for the same model in Figure 8b. Except for figurations, such as those common in workstations.
the 16-GPU configuration, we achieve more than a factor of 2
performance boost. Moreover, since the memory footprint is
smaller for FP16, we can accommodate a larger batch size of
256. Doubling the batch size halves the synchronization and
parameter update time for training the same overall amount
of data. For the 16-GPU configuration, the speedup is only
×1.7. This is likely due to the cache effect described in in
Section IV-B2. Note that this performance is very similar to
the one reported by NVIDIA 14 .