0% found this document useful (0 votes)
16 views

Performance Analysis of Deep Learning Workloads on Leading-edge Systems

performance analysis of deep learning workloads

Uploaded by

Jianguo Xu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Performance Analysis of Deep Learning Workloads on Leading-edge Systems

performance analysis of deep learning workloads

Uploaded by

Jianguo Xu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

BNL-212208-2019-CPPJ

Performance Analysis of Deep Learning Workloads on Leading-edge


Systems

Y. Ren,

Submitted to the 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance
Computer Systems (PMBS) Conference
to be held at Denver, CO, United States
November 17 - 22, 2019

April 2020

Computational Science Initiative


Brookhaven National Laboratory

U.S. Department of Energy


USDOE Office of Science (SC), Advanced Scientific Computing Research (SC-21)

Notice: This manuscript has been authored by employees of Brookhaven Science Associates, LLC under
Contract No. DE-SC0012704 with the U.S. Department of Energy. The publisher by accepting the
manuscript for publication acknowledges that the United States Government retains a non-exclusive, paid-up,
irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others
to do so, for United States Government purposes.
DISCLAIMER

This report was prepared as an account of work sponsored by an agency of the


United States Government. Neither the United States Government nor any
agency thereof, nor any of their employees, nor any of their contractors,
subcontractors, or their employees, makes any warranty, express or implied, or
assumes any legal liability or responsibility for the accuracy, completeness, or any
third party’s use or the results of such use of any information, apparatus, product,
or process disclosed, or represents that its use would not infringe privately owned
rights. Reference herein to any specific commercial product, process, or service
by trade name, trademark, manufacturer, or otherwise, does not necessarily
constitute or imply its endorsement, recommendation, or favoring by the United
States Government or any agency thereof or its contractors or subcontractors.
The views and opinions of authors expressed herein do not necessarily state or
reflect those of the United States Government or any agency thereof.
Performance Analysis of Deep Learning Workloads
on Leading-edge Systems
Yihui Ren Shinjae Yoo Adolfy Hoisie
Computational Science Initiative Computational Science Initiative Computational Science Initiative
Brookhaven National Laboratory Brookhaven National Laboratory Brookhaven National Laboratory
[email protected] [email protected] [email protected]

Abstract—This work examines the performance of leading- (FLOPS) on GPUs; availability of large amounts of high-
edge systems designed for machine learning computing, includ- bandwidth memory that allows for data access at fast rates
ing the NVIDIA DGX-2, Amazon Web Services (AWS) P3, and low latency; and to high-speed interconnects that afford
IBM Power System Accelerated Compute Server AC922, and
a consumer-grade Exxact TensorEX TS4 GPU server. Repre- communication at high bandwidth with minimal contention.
sentative deep learning workloads from the fields of computer The first three examples of leading-edge systems considered
vision and natural language processing are the focus of the herein use the NVIDIA Tesla V100 GPU with different
analysis. Performance analysis is performed along with a number topologies of the NVLink interconnect. The Exxact TS4 is
of important dimensions. Performance of the communication in- configured with the consumer-grade GeForce RTX 2080 Ti
terconnects and large and high-throughput deep learning models
are considered. Different potential use models for the systems as GPU, which is popular among AI researchers, developers, and
standalone and in the cloud also are examined. The effect of hobbyists. Section II-A describes the systems and their key
various optimization of the deep learning models and system architectural characteristics in more detail.
configurations is included in the analysis. Section III details how DL models considered are trained,
Index Terms—Deep learning, High performance computing,
the fundamental arithmetic operations involved during train-
Benchmark testing, Performance analysis, Computer architec-
ture, Concurrent computing, DGX-2, GPU ing, and their effects on different hardware systems. Specifi-
cally, Section III-B dissects CNN models for computer vision,
I. I NTRODUCTION while Section III-C explores the state-of-the-art Bidirectional
Encoder Representations from Transformers (BERT) model
The growth of machine learning and deep learning (DL) for natural language processing (NLP) [11].
extends across all data analytical application areas, impacting The detailed performance analysis is done along a few
many disciplines and markets. Hence, their practical use important dimensions. Section IV-A presents the performance
potential appears exponential and seemingly unbounded. In of key global communication kernels used in the benchmarks
turn, the ever-insatiable need for computing resources for these considered. Section IV-B discusses performance and scalabil-
workloads has led to the development of computer architec- ity of large and high-throughput DL models. Section IV-D
tures and systems designed to improve machine learning per- compares performance when the benchmarks are expressed
formance [1]–[8]. As the presently preferred architectures for in an easy-to-code multi-GPU architecture enabled by system
machine learning application workloads, GPU-based systems software described in Section II-B.
are an important exemplar in this category.
This work evaluates the performance of two important types
of DL algorithms on four leading-edge GPU-based systems. II. E NVIRONMENT
Specifically, we consider convolutional neural network (CNN)
A. Hardware Environment
algorithms, such as AlexNet and ResNet, mostly used in
computer vision and attention-mechanism-based algorithms As part of this work, the following systems were put
for natural language processing on the NVIDIA DGX-1 and to the test: NVIDIA DGX-1V and DGX-2 (DGX-2), IBM
DGX-2, IBM Power System AC922, and Exxact TensorEX Power System AC922 (IBM-P9), AWS P3dn (AWS P3),
TS4. Moreover, we analyze a cloud-based Amazon Web and Exxact TensorEX TS4 (RTX). Henceforth, the systems
Services (AWS) P3dn use mode for the DGX-1 and compare will be referenced using their respective abbreviations noted
DL performance against standalone use for the other systems in parentheses. For added convenience, a consistent color
considered. scheme and geometric shape are maintained for each system
GPU-based systems are especially well suited for DL work- represented in figures throughout this work (green diamond,
loads as proven in practice and in scientific publications [3], DGX-2; blue square, IBM-P9; orange triangle, AWS P3; red
[9], [10]. Briefly, this stems from their single-instruction circle, RTX). Of note, the AWS P3 essentially is a DGX-1V
multiple-data (SIMD) nature and arithmetic intensity of the as shown in the communication bandwidth test depicted in
algorithms mapping well to available floating point operations Section IV-A.
Before delving into the details of each system, we first
introduce the key architectural component: the NVIDIA Tesla
8 9 14 15
V100 GPU.
Tesla V100: The Tesla V100 GPU [12] is a building block
for three of the four systems under consideration. The V100
GPU has 640 Tensor cores and 5,120 CUDA cores with 32 GB /s B/s /s 25GB
/s 25GB/s

/s
GB 25G 25GB

GB
25
(or 16 GB) HBM2 GPU memory (900 GB/s bandwidth). It can

25
achieve 15.7 TFLOPS for single-precision performance. For 0 1 6 7
direct inter-device (GPU-to-GPU) communication, the V100
has six NVLink-2.0 fabric supporting 25 GB/s per link, per (a) DGX-2 NVSwitch Crossbar
data direction. Therefore, each V100 has the ability to com-
municate with other GPU devices at 150 GB/s unidirectional
(or 300 GB/s bidirectional) bandwidth. The high bandwidth of
inter-node communication is crucial for training deep neural 3 50 GB/s
0 50 GB/s
4 7
network models across multiple devices.

/s

25 GB/s
GB
DGX-2: The bulk of the DGX-2’s computation capac-

25
ity is from 16 V100 (32 GB) GPUs evenly distributed on
two baseboards and connected via 12 on-node switches, or
NVSwitch [13]. Each NVSwitch has 18 NVLink ports (16
in use) and supports 900 GB/s bidirectional peak bandwidth. 2 1 5 6
Eight NVLink ports are connected to different GPU devices
(one per link) on the same baseboard, whereas the other eight 25 GB/s
NVLink ports are connected to the matching NVSwith ports (b) DGX-1V and AWS P3 Hybrid Cube-Mesh Topology
on the other baseboard (Figure 1a). This network connectivity
affords communications at a bandwidth of up to 150 GB/s per
direction. Any two V100 GPUs can establish full bandwidth 0 1 2 3
(up to 150 GB/s per direction) communication using all six 3x25GB/s
3x

NVLink ports. The specific DGX-2 tested in this work has


25
GB

two hyper-threaded 24-core Intel Xeon 8168 CPUs (96 logic


/s
sys mem

sys mem
cores in total) with base frequency of 2.7 GHz, 1.5 TB system 60GB/s
memory, and 30 TB NVMe SSD in eight-way RAID0. Power9 32 GB/s
Power9
AWS P3: AWS’ P3dn.24xlarge instance is similar to the CPU SMP CPU
NVIDIA DGX-1V system [6] and is equipped with eight
Tesla V100 (32 GB) GPUs connected in a hybrid cube-mesh
topology (Figure 1b). The hybrid cube-mesh topology leads to (c) IBM AC922 Model 8335-GTH NVLink-enabled POWER9 CPU
each node having four immediate neighbors. This is a legacy Figure 1: GPU-to-GPU Communication Topology. Each Tesla
design following the previous DGX-1P system, where the V100 GPU has six NVLink ports with unidirectional commu-
Tesla P100 GPU featured only four NVLink ports. Two of the nication bandwidth of 25 GB/s per port. Numerically labeled
four neighbors are connected to two links each, while the other boxes represent different GPU devices. The six NVLinks from
two connect to one only. To connect two P3 systems, AWS device-0 are colored differently.
provides network connection bandwidth up to 100 Gbits/s. The
caveat is that this limit can be reached only for multi-flow
connections. The single-flow bandwidth is 10 Gbits/s (1.25
GB/s). The specific AWS P3 systems tested in this effort have
two hyper-threaded 24-core Intel Xeon 8175M CPUs (96 logic GPUs (three per GPU), enabling a 75 GB/s unidirectional
cores in total) with base frequency of 2.5 GHz, 768 GB system communication bandwidth to each GPU. In addition, there are
memory, and 2 TB ephemeral NVMe SSD. Section IV-A three NVLink fabrics connecting two GPUs directly. If the
shows that the NVIDIA DGX-1V system is analogous to the GPUs are not connected to the same CPU, communications
AWS P3. Thus, we include only the results for the AWS P3. must route through the inter-CPU symmetric multiprocessing
IBM-P9: The IBM Power System AC922 [14] (Model (SMP) cable with unidirectional bandwidth of 32 GB/s. The
8335-GTH) server tested is equipped with four Tesla V100 POWER9 CPU connects to the system main memory with
(32 GB) GPUs (Figure 1c). The tested AC922 server has accumulated (eight channels) unidirectional bandwidth of 60
two IBM POWER9 hyper-threaded 20-core CPUs (160 logic GB/s. The tested system has four nodes, connected via high-
cores in total) with base frequency of 2.3 GHz and max bandwidth (24 GB/s unidirectional) InfiniBand. All of the
frequency of 3.8 GHz. IBM’s POWER9 CPU is NVLink- nodes use IBM General Parallel File System (GPFS) with
enabled. Each CPU has six direct NVLink connections to block size of 1 MB and bandwidth of approximately 18 GB/s.
RTX: The Exxact TensorEX 4U server (TS4-1598415-DPN) Table II: Tested Deep Learning Models
is equipped with eight NVIDIA consumer-grade GeForce RTX Model Name Param. Ops/ins.
2080 Ti GPUs [15]. Each RTX 2080 Ti GPU has 4352 CUDA
AlexNet 61.10 M 0.72 G
cores and 11 GB GDDR6 GPU memory with 616 GB/s ResNet18 11.69 M 1.83 G
memory bandwidth. It can reach a peak performance of 13.4 ResNet50 25.56 M 4.14 G
TFLOPS for single-precision performance, or about 85.4% of ResNet101 44.55 M 7.88 G
ResNet152 60.19 M 11.62 G
the V100 GPU’s peak performance. The specific server tested BERT-SWAG 109.5 M 0.19 G
in this work has two hyper-threaded 12-core Intel Xeon 4116 BERT-SQuAD 109.5 M 2.87 G
CPUs (48 logic cores in total) with base frequency of 2.1
GHz. All eight GPUs are connected via a PCIe bus. Compared
to other high-end V100 GPU-based solutions, the RTX GPU
cards are a unique feature for this system. As such, we refer
to this system as RTX. (without replacement). The data are loaded from the hard
drive to the host memory, and, sometimes, preprocessing data-
B. Software Environment augmentation procedures are applied using CPU threads, such
Because of its popularity among AI researchers, its well- as randomly flipping images or adjusting image sizes. Then,
designed user interface, and native support for NVIDIA com- the preprocessed batch is sent to the GPU memory via PCIe
munication and computation backend kernels and MPI, we bus.
use the PyTorch DL platform. To maintain a consistent and The bulk of actual computation usually is done on one
reproducible software environment, we use docker containers, or multiple GPUs. In the multiple GPU case, the execution
which also alleviate the difficulty in migrating the DL mod- is done in a SIMD fashion so each GPU has an exact
els to other hardware systems and reduce the performance replica of the neural network model and applies the exact
differences introduced by distinct software environments. For executions on different sampled data batches. In the ideal
the x86 architecture (Intel Xeon CPU) systems, including case, the throughput would grow linearly with the number of
DGX-1, DGX-2, AWS P3, and RTX, we use the NVIDIA GPUs. At the end of every iteration, all of the model replicas
official PyTorch docker image (NVCR)1 as the base software require synchronization. This synchronization is done by a
environment. For the ppc64le architecture (IBM POWER9 collective communication using NCCL. Most of the results in
CPU) system, IBM-P9, we use the PowerAI v1.6 [16]. this work use the NCCL all-reduce kernel. Therefore, the two
Nevertheless, to ensure our work is reproducible, Table I major factors affecting the time cost of communication are: 1)
lists the exact library versions of the NVIDIA docker and the the inter-device communication bandwidth and 2) number of
PowerAI v1.6. The NVIDIA CUDA library is a programming model parameters.
interface to NVIDIA GPUs for parallel computing, while
NVIDIA’s cuDNN (deep neural network) library provides For this work, we have selected several representative DL
device-level optimized, neural-network-related backend ker- models to cover different ranges of parameters, computation-
nels. The NVIDIA NCCL (collective communication) library communication ratios, application domains, and various types
provides a multi-GPU communication interface, supporting of neural network DL layers. Because of the vast number
several communication means, such as NVLink, PCIe, and of potential DL models, we are unable to test all of them
Ethernet. exhaustively. However, by providing detailed descriptions and
computation characteristics for these select models, readers
Table I: Software Environment should be able to easily estimate the performance (in terms of
Library NVIDIA NVCR IBM PowerAI computation efficiency not model accuracy) of other models as
the fundamental types of numeric operations are comparable.
PyTorch 1.0.0a0 1.1.0
CUDA 10.0.130 10.1.168 As computer vision and NLP are the two most successful
cuDNN 7.401 7.501 application domains for DL, we choose the AlexNet model and
NCCL 2.307 2.407 ResNet model from the computer vision domain and BERT
model from NLP to represent examples of DL methods in
these areas. We analyze the models in terms of their number
III. D EEP L EARNING M ODELS of trainable parameters and operations. The former affects the
A. Data Movement and Communication Between Devices memory footprint as well as the inter-device communication
Deep learning is a data-driven modeling approach. The costs, while the latter impacts the on-device computation time.
training process, known as stochastic gradient descent, con- The computation cost per iteration scales linearly with the
sists of numerous iterations of feeding data to the model number of instances per sampled data batch, known as the
and adjusting the model parameters to reduce the predefined batch size. However, the actual computation cost depends on
loss. At each iteration, a batch of data is selected at random many other factors. Table II provides a summary of the number
of parameters and operations per instance for all of the models
1 nvcr.io/nvidia/pytorch:18.11-py3 presented in this work.
B. Computer Vision size B ×ci ×Hi ×Wi is meant to perform a tensor dot product
of ci × k 2 on every pixel of the spatial dimension H × W .
The goal of computer vision is to make computers gain
high-level “understanding” of images. To evaluate if a program For simplicity, assume the striding step is 1, and padding is
(AI model) truly “understands” the image, researchers have k/2 such that the spatial dimension is unchanged Ho = Hi
developed different evaluation tasks to measure its comprehen- and Wo = Wi . Thus, each kernel has been applied Ho ×
sion. One type of these tasks, known as image classification, Wo times.3 For each kernel application at every pixel level, a
provides an image to the program and asks about which GEMM operation is performed, which costs C ≡ co (ci k 2 +1).
predefined class the image belongs to. For example, the Therefore, in total, the number of operations of the Conv2D
MNIST (handwritten digit database) asks the program to tell it layer is Ho × Wo × C. Because the number of parameters of
which digit, from 0 to 9, the grayscale image (28-by-28 pixels) a Conv2D layer is also C, the operation-to-parameter ratio Γ
belongs to. This is considered one of the simplest computer for Conv2D layer is ΓConv2D = BHo Wo . As in the case of
vision tasks, and traditional machine learning methods, such as the Linear layer, the total number of operations scales with
the support vector method, have reached 99.2% accuracy [17]. the batch size. Yet, in contrast to the Linear layer, the total
The ImageNet Large Scale Visual Recognition Challenge, or number of operations also depends on the spatial dimension
ILSVRC [18], a much more challenging image classification of the output tensor. Each parameter of a Conv2D layer has
test, was introduced in 2010. It contains 1000 predefined been operated Ho Wo more times than a parameter in a Linear
classes (including 60 different dog breeds) and more than a layer.
million training images. The best-performing model in the first AlexNet consists of five Conv2D layers of ∼ 221 parameters
ILSVRC (2011) achieved only about a 25% top-five error in total, two hidden Linear layers (∼ 225 ), and one output
rate.2 In 2012, AlexNet [19], considered the first modern Linear layer (∼ 222 ). The Linear layer also uses an order of
CNN-based model, successfully reduced the top-five error rate magnitude more parameters. Compared to AlexNet, ResNet
to 16.4%. In 2015, ResNet [20] further reduced the error consists almost entirely of Conv2D layers, except the final
rate to 3.57%. It also introduced residual blocks to mitigate Linear layer for classification output. The sub-types of ResNet
the “vanishing gradient problem” when the neural network models are labeled as ResNetX, where X represents the total
becomes too deep. number of parameterized layers (Conv2D and Linear). The
A deep neural network is a stack of multiple neural network choices of X in the original paper [20] are 18, 34, 50, 101, and
layers, usually varying kinds. Each layer takes the previous 152. ResNet18 serves as a high-throughput (small number of
layer’s output as its input, where both input and output are operations), low-accuracy model because of the small amount
tensors. A Linear layer is one of the simplest kind, a matrix of parameters, while ResNet152 has the highest accuracy but
of size ci × co , where ci and co are the number of input slowest training throughput. Using ResNet50 for ImageNet
and output channels. Therefore, the number of parameters data (1000-way classification) as a concrete example, the
of a Linear layer is on the order of O(ci co ) or co (ci + 1) model contains about 224.6 parameters, where only 221 are
to be precise where “1” is the bias term. The operation from the Linear layer. As discussed, each parameter of a
performed by a Linear layer essentially is a general matrix- Conv2D layer contributes a factor of Ho ×Wo more operations
matrix multiplication (GEMM). In most cases, the multiplier than one in a Linear layer. As such, ResNet has a much higher
matrix (input) has a dimension of B × ci , and the multiplicand operation-to-parameter ratio than AlexNet.
matrix (Linear layer weights) has a dimension of ci × co .
As such, the number of operations for a batch size B is C. Natural Language Processing
B × (ci + 1) × co . One could deduce that the operation- NLP is another successful application of DL techniques.
to-parameter ratio Γ for a Linear layer is B: ΓLinear = B, Some NLP tasks include speech recognition, translation,
implying that computation cost grows linearly with the number speech-to-text (and vice versa), and question-and-answer sys-
of parameters in the Linear layer and batch size. tems. In the pre-DL era, NLP was dominated by hidden
A two-dimensional convolutional (Conv2D) layer consists Markov models [21]. Mikolov et al. [22] introduced a DNN-
of co kernels of size ci × k × k. Therefore, the exact number based word embedding model to represent words as vectors
of parameters of a Conv2D layer is co (k 2 ci + 1). A kernel is based on their context. Namely, similar words would have
simply a small tensor applied to the input tensor in a sliding- comparable context around them and end up closer in the
window fashion, where the step size is called the stride. When vector space. This approach provides a meaningful way to
the stride is greater than one, the input tensor is downsampled represent non-numeric entities, i.e., words, as numeric vectors
in the spatial dimension. The number of operations for a and provides a foundation for solving a diverse range of
Conv2D layer can be calculated by considering the number NLP tasks. Graves et al. [4] developed a deep recurrent-
of times the kernel has been applied and the cost of applying neural-network-based approach to perform automatic speech
each kernel. Applying a Conv2D kernel on the input tensor of
3 Note that by setting striding greater than one, fewer kernel operations will
2 Top-five error rate. For each test image, the algorithm is allowed to give be applied, which can reduce the spatial dimension (downsampling). Whereas,
five predictions. If any of the five predictions match to the ground truth, it is by setting the space between kernel points (dilation), the spatial dimension
considered a hit. (upsampling) can increase. The computation cost analysis is similar.
recognition and broke the TIMIT phoneme recognition bench- expression (due to ease of coding)—PyTorch’s On-node Data
mark record [23]. By the end of 2016, all major technology Parallel [29]—also is included.
companies had adopted the DNN-based approach for their
speech recognition systems. Vaswani et al. [24] introduced A. Communication Performance
the attention mechanism into NLP tasks and demonstrates its As shown in Section II-A, leading-edge systems implement
superior performance in natural language translation tasks. various direct high-bandwidth inter-device communication
The particular NLP model in this work, BERT, uses bidi- topologies based on NVLink. The NCCL8 provides MPI-like
rectional transformers [11] and exceeded 11 NLP benchmark primitives for multi-GPU and multi-node collective commu-
records in November 2018.4 nications. The library is optimized for NVIDIA GPU devices
The BERT model has two training phases: 1) pre-training to achieve high communication bandwidth over NVLink and
and 2) fine-tuning. In the pre-training phase, BERT uses the PCIe (when necessary). NCCL supports collective communi-
semi-supervised sequence learning approach [25] by masking cation primitives, such as all-reduce, all-gather, reduce-scatter,
out a random word in a sentence. Unlike other previous reduce, and broadcast.
unidirectional approaches, BERT tries to predict the masked As the most relevant communication kernels occurring
word from both directions. Training is done on large unlabeled in the benchmarks considered, all-reduce and broadcast are
corpora, such as the English Wikipedia (2,500 million words). examined for performance using NVIDIA’s NCCL-tests code.9
Herein, this pre-trained model is known as the base-model. In Results are presented normalized to the ”bus bandwidth,”
the task-specific fine-tuning phase, the base-model connects a concept described by NVIDIA in the NCCL-tests.10 Bus
with a classification Linear layer designed for the specific task. bandwidth is obtained by applying a normalization divider of
The data used for fine-tuning are labeled and much smaller the measured bandwidth
compared to the large corpora [26]. The majority of attention (“message size”/time) different for each communication ker-
mechanism operations are matrix multiplication and layer-wise nel to reflect its communication complexity and topological
normalization. For details regarding how the attention mech- mapping to the network. Because the bus bandwidth reflects
anism works, readers can refer to several available guides.5,6 how optimally the hardware is used, it provides a consistent
We use the pre-trained BERT base-model and fine-tune it and normalized way to compare the results with the theoretical
for two specific NLP tasks: SWAG and Stanford Question peak bandwidth, including across different communication
Answering Dataset (SQuAD). The SWAG [27] is a multi- primitives.
choice task. Given a situation described by a sentence as In this work, data size varies from 1 MB to 1 GB, which
input, the model is asked to select the most plausible scenario covers the communication needs for synchronizing model
that happens next among multiple choices. The SQuAD [28] parameters. Each data point is averaged over 500 iterations,
is a Question Answering task, where a pair that includes a except for the case of 16 GPUs using two AWS P3s, which is
question and a relevant paragraph (containing the answer) is averaged over 50 iterations due to the slow inter-node Ethernet
provided and the model is tasked to find the answer in the connection. Figure 2 illustrates the results.
given paragraph. The DGX-2 consistently achieves 120 GB/s for large mes-
Although the base model is the same, to fully cover the sage sizes, regardless of the number of GPUs involved in the
training data, different max-seq-length is used. We use max- communications. This can be attributed to the NVSwitch’s
seq-length of 80 for SWAG and 384 for SQuAD. As the link bandwidth and contention properties (described in Sec-
max-seq-length determines the attention span, it takes more tion II-A).
operations to perform the SQuAD task. Table II features The AWS P3 and DGX-1V yield analogous, if not exactly
the number of model parameters and estimated operations of the duplicate, results because they share the same hybrid
BERT-SWAG and BERT-SQuAD, respectively. Of note, our cube-mesh topology (refer to Figure 1b). Because of the
benchmark code is modified from the source code.7 heterogeneity of this topology, the measured peak bandwidth
depends on the devices involved in the communication. In the
IV. P ERFORMANCE A NALYSIS case of two GPUs, the test employs device-0 and device-1,
which are connected via a single NVLink that offers 25 GB/s
This section details the performance analysis of DL work- theoretical unidirectional bandwidth. For four GPUs, device-0
loads using the four systems (already described) under consid- to -3 are used, and the NVLinks connecting to device-4 to -7
eration. The all-important communication performance is first are not. The observed bandwidth is about 80 GB/s. For eight
presented. Given the different workload characteristics, the GPUs, the DGX-1 surpasses the DGX-2 in the all-reduce tests
analysis is done separately for large-scale and high-throughput (Figure 2a). In the broadcast test (Figure 2b), the crossover
models. Performance details for an increasingly popular code occurs when the message size exceeds 256 MB. While these
results may seem unexpected due to the higher bandwidth
4 As of March 2019, OpenAI and Microsoft have released their model
challengers to BERT. 8 https://developer.nvidia.com/nccl
5 http://nlp.seas.harvard.edu/2018/04/03/attention.html. 9 https://github.com/NVIDIA/nccl-tests/release/tag/v1.0.0
6 https://jalammar.github.io/illustrated-transformer/. 10 Described in detail here:
7 https://github.com/huggingface/pytorch-pretrained-BERT https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md.
of 75 GB/s. However, with four GPUs, the bus bandwidth
reduces to about 30 GB/s, matching the theoretical SMP bus
bandwidth of 32 GB/s when connecting two POWER9 CPUs.
Higher count GPU configurations on the IBM P9 (eight-
and 16-GPU) exhibit lower bus bandwidth (Figure 2). This
achieved performance is due to NCCL not being optimized
for the InfiniBand interconnect.
The RTX system does not use NVLink technology, and
all eight RTX 2080Ti GPUs connect through a PCIe bus.
Therefore, the communication bandwidth is throttled down by
the PCIe bus. Despite its inferior communication performance,
the RTX system serves as the baseline for other systems.

B. Performance of Deep Learning Workloads


Computation performance is measured in terms of the model
training throughput: the average number of training samples,
(a) All-reduce or instances, the system can process per second. For each
different combination of models, batch sizes, and number
of GPUs, time intervals are measured between consecutive
iterations during training. For computer vision DL models,
each model runs for 200 iterations. For the BERT models,
the reported throughput is averaged over one training epoch.11
The initial iterations are excluded from the statistics due to
memory allocation overhead. All of the models in this secion
are represented in single precision (FP32).
Distributed data-parallel training with asynchronous data
prefetching is used. Each GPU is associated with j data-
fetching CPU processes using CUDA streams. In these tests
j = 4. This allows data to be loaded and preprocessed
asynchronously and concurrently on the CPUs while the GPUs
are in use. Every GPU device holds a replica of the model and
applies the model on different data batches. At each iteration’s
conclusion, all GPUs synchronize their parameter gradients
via an all-reduce NCCL operation. Then, all model replicas
(b) Broadcast individually update their parameters using the gradients. The
computer vision models are trained on the ILSVRC ImageNet
Figure 2: Communication Bus Bandwidth.
data set, while BERT models are fine-tuned on task-specific
data sets, SWAG and SQuAD (introduced in Section III-C).
As the system performance characteristics vary for differ-
and topological richness of the NVSwitch compared to the ent models, we group models such as ResNet101(152) and
NVLink, the actual explanation stems from the communication BERT as large DL models and those with high through-
protocol changes introduced on the NVSwitch [30]. Here, put, e.g., ResNet18(50) and AlexNet, as high-throughput DL
posted requests are converted to non-posted, which, in turn, models. The large DL model results are discussed in this
requires acks at the expense of bandwidth in the reverse direc- sub-section, while high-throughput DL models are addressed
tion. This is not the case on the DGX-1V without NVSwitch. in Section IV-B2. For added clarity, the bar plots featured
With access to only one DGX-1, the 16 GPU case was done in this section depict systems ordered from left to right,
on AWS P3. The two AWS-P3dn nodes are connected via a corresponding to the system order posed in legends (inset from
100 Gbits/s multi-flow Ethernet connection. The experimental top to bottom).
setup in the AWS cloud allowed for only a single flow (review 1) Performance Analysis of Large Deep Learning Models:
Section II-A) with a peak bandwidth of 1.25 GB/s. In this case, Initially, the absolute throughput values of large DL mod-
the communication bandwidth clearly is bottlenecked by the els, e.g., ResNet101, ResNet152, BERT-SWAG, and BERT-
slow Ethernet connection. SQuAD, are examined (Figure 3). As the amount of com-
IBM-P9 uses half of the NVLinks for CPU-GPU commu- munication for synchronization depends on the number of
nication (Figure 1c). This leaves three NVLinks to connect model parameters and not on the batch size, we choose the
device-0 and device-1. In the case of two GPUs, the measured
bus bandwidth of 70 GB/s is quite close to the theoretical peak 11 One epoch is defined as going through the entire data set once.
(a) ResNet101 (b) ResNet152 (a) ResNet101 (b) ResNet152

(c) BERT-SWAG (d) BERT-SQuAD (c) BERT-SWAG (d) BERT-SQuAD


Figure 3: Training Throughput of Large DL Models on RTX, Figure 4: Linear Scaling in Log-Log Scale. Gray dashed lines
IBM-P9, AWS P3, and DGX-2. are linear scaling reference lines.

largest batch size that can fit into the 32 GB of memory ence line depicts linear scalability. If the measured throughput
of a single V100 GPU to achieve the best possible scaling follows the reference line, or maintains a constant gap, it has
results. Specifically, the batch sizes used are: 128 per GPU good parallel scalability. The DGX-2 exhibits good scalability
for ResNet101 and ResNet152, 64 for BERT-SWAG, and 32 on all four models, whereas AWS P3 shows linear scalability
for BERT-SQuAD. up to eight GPUs. For the RTX, there is a significant drop
Across all four systems, the DGX-2 and AWS P3 have from one GPU to two GPUs in terms of scalability because
similar performance up to eight GPUs. This is expected as one GPU computation does not require model synchronization,
both systems have the same V100 GPUs and are connected while that cost does apply for multiple GPU configurations.
via high-bandwidth (over 120 GB/s) NVLinks. However, when
16 GPUs are in use, two AWS P3s communicate through a Table III: Instances per second for RTX relative to DGX-2
relatively slow Ethernet connection (about 1 GB/s measured). Model Name 1 GPU 2 GPUs 4 GPUs 8 GPUs
Figures 3c and 3d reveal the differences in performance, AlexNet 78.19% 63.01% 53.41% 47.95%
especially in BERT models where the number of parameters ResNet18 73.50% 69.13% 64.39% 54.80%
is large. Given its high-bandwidth inter-node communication ResNet50 67.97% 62.67% 62.97% 61.75%
Average 73.22% 64.94% 60.26% 54.83%
network, the IBM P9 exhibits similar performance to DGX-2
all the way to up to a 16 GPU configuration. ResNet101 69.70% 63.72% 64.15% 62.69%
ResNet152 69.73% 62.45% 62.96% 61.90%
The RTX server has 11 GB of DDR6 GPU memory. Hence, BERT-SWAG 64.04% 57.52% 57.20% 56.25%
the batch sizes are even smaller: one-quarter of the size BERT-SQuAD 59.81% 49.79% 49.74% 48.22%
when using 32 GBs on the V100 GPU on all other systems. Average 65.82% 58.37% 58.51% 57.27%
Specifically, the batch size for ResNet101 and ResNet152 is Overall avg. 68.99% 61.19% 59.26% 56.22%
64, BERT-SQuAD is 8, and BERT-SWAG is 16. This leads to
a quadrupling of the amount of communication for the same 2) Performance Analysis of High-Throughput Learning
total of computed instances. RTX’s slow inter-device commu- Models: Here, AlexNet, ResNet18, and ResNet50 are charac-
nication via a PCIe bus further exacerbates its performance terized as high-throughput models. All systems except RTX
degradation. For example, in the case of 1 GPU, RTX can use a 256 batch size per GPU to fully utilize their 32 GB of
reach about 65.82% throughput of the DGX-2 averaged over memory for all models. RTX uses a batch size of 64. Figure 5
four DL models, yet merely 57.27% in the case of eight GPUs illustrates the results.
(see Table III). Hence, the RTX server is the least efficient Training high-throughput models implies frequent data
system for large model distributed training. movement through the file system. For configurations up to
To examine the scaling more closely throughout the full 8 GPUs, the performance is lower on IBM-P9. The reason for
span of GPU configurations, we plot the throughput for all DL that is related to the use of GPFS external filesystem on the
models in a log-log scale (Figure 4), where the dashed refer- IBM machine, whereas the other system under consideration
(a) AlexNet (b) ResNet18 (c) ResNet50
Figure 5: Training Throughput of High-throughput DL Models on RTX, IBM-P9, AWS P3, and DGX-2.

(a) AlexNet (b) ResNet18 (c) ResNet50


Figure 6: Examining Scaling in Log-Log Scale. Gray dashed lines are linear scaling reference lines.

utilize local storage for the executation of these small models. On the AWS P3, the two CPUs on each node will handle
For ResNet50 (Figure 5c), all the systems exhibit linear 32 processes for the eight GPUs. On the DGX-2, the 16
scaling. Because of the ResNet50 model’s small size, the GPUs require 64 CPU data-fetching processes from the two
slow inter-node Ethernet bandwidth of the AWS P3 does not associated CPUs. To explain why the AWS P3 outperforms
bottleneck the distributed training throughput performance. the DGX-2 in Figure 5b requires determining if the scaling
Because AlexNet uses more than twice the number of inconsistency stems from a lower core frequency speed and/or
parameters of ResNet50, throughput performance is throttled cache capacity effects. Figure 7a shows CPU core speed
down by the slow Ethernet connection on AWS P3 when measurements (enabled given Turbo Boost technology) for
two nodes (with a total of 16 GPUs) are in use (Figure 5a). both systems while varying j from 1 to 16 on the DGX-2 and
Even on the DGX-2, AlexNet does not scale linearly to 16 AWS P3. For example, if j = 16 and DGX-2 uses all 16 GPUs,
GPUs (shown in Figure 6a). When 16 GPUs are in use on the there are 256 CPU processes in total. The light green curve
DGX-2, AlexNet spends about 80% of the active GPU time (Figure 7a) depicts the case when only eight GPUs on the
in communication, whereas ResNet50 spends only about 4%. DGX-2 are in use, in which case the DGX-2 has slightly better
Given its smallest amount of parameters, ResNet18’s need performance than AWS P3 5b. When using j = 1 CPU process
for inter-device communication is modest. Even so, as shown per GPU, the DGX-2’s CPU core speed is much higher than
in Figure 6b, the scaling is not ideal. An interesting observa- that of the AWS P3 because of its superior CPU performance
tion is that when using 16 GPUs, the AWS P3 performs better characteristics (see Section II-A). However, as j increases,
than the DGX-2 (Figure 5b). the DGX-2’s CPU core speed decreases, which is typical for
Intel Turbo Boost technology. For j = 4, the specific case
present in the benchmark runs (also shown by the vertical
dotted line in Figure 7a), the DGX-2 maintains a higher CPU
core speed than that for the AWS P3. Hence, clock frequency
is not the sole explanation for the performance inconsistency.
To understand the exact amount of work the CPU does per
unit time, Figure 7b shows the metric of instructions per cycle
(IPC). The IPC of the DGX-2 using 16 GPUs at j = 4 is
much lower than that of AWS P3: 1.35 versus 1.90, pointing
(a) CPU Core Speed (b) Instructions per Cycle to cache utilization inefficiencies.12 Additional measurements
Figure 7: CPU Performance Bottleneck of ResNet18. of L1-cache data loading speed and data-translation lookaside
buffer (TLB) load misses confirm this hypothesis.The data also
Recall from Section IV-B that in all experiments, each GPU 12 Note: The tested Intel Xeon CPU can reach theoretical maximum of four
is associated with (j =) 4 CPU processes for prefetching data. IPC when instructions are perfectly aligned by manual loop unrolling.
reveal that j = 4 usually is a good choice. Of note, because we
use the pinned memory13 to improve host-device data transfer,
using large j will cause high memory usage on the host.
For RTX versus DGX-2 performance, when one or two
GPUs are in use, RTX performance is close to that of the
DGX-2 (refer to Table III). Because of their smaller GPU
memory footprints, high-throughput workloads look more suit-
able on RTX than large models. Just as with the case of
performance on large models, RTX’s scalability is less than
for the DGX-2 (see Table III and Figure 6) due to its slower (a) ResNet50 on DGX-2 (b) Speedup using FP16 relative
to FP32
communication performance. This makes the RTX system
most suited for small-scale model development rather than Figure 8: Performance ofResNet50 on DGX-2 for single
full-scale training workloads. precision (FP32) and mixed-precision (FP16)
C. Performance of Mixed-Precision Training
Mixed-precision training [31] retains most if not all neural communication pattern of on-node data parallel differs from
network predictive performance, yet offers significant com- the distributed data parallel. In it, one GPU maintains a
putational speedup and reduces the memory footprint. The master copy of the model parameters. At every iteration, it
NVIDIA Turing GPU architecture, such as V100 and RTX broadcasts the parameters to the other GPUs in the config-
2080 Ti, provides dedicated hardware acceleration called “ten- uration. At the end of every iteration, the parameters are
sor cores” [32] for this purpose. The tensor core provides “all-reduced” back to the master GPU, which updates the
high-throughput fused multiply-add (FMA) operations for model parameters. Therefore, for each iteration, two global
mixed-precision matrices (inputs in half precision, outputs in communications (broadcast and reduce) are issued. To emu-
either half or single precision). The other advantage of using late the common practice of most PyTorch models, we use
mixed-precision is the smaller memory footprint, therefore less the default PyTorch data loader for on-node data parallel
communication overhead for synchronizing the model replicas. experiments (torch.utils.data.DataLoader), which
Figure 8a shows the performance of ResNet50 on DGX-2 supports multi-worker and pinned memory but not asyn-
when using mixed-precision (FP16) for batch size (bsz) 128 chronous data loading. PyTorch’s on-node data parallel design
and 256, comparing it to the performance when using single- maximizes its usefulness but targets small parallel GPU con-
precision (FP32) for the same model in Figure 8b. Except for figurations, such as those common in workstations.
the 16-GPU configuration, we achieve more than a factor of 2
performance boost. Moreover, since the memory footprint is
smaller for FP16, we can accommodate a larger batch size of
256. Doubling the batch size halves the synchronization and
parameter update time for training the same overall amount
of data. For the 16-GPU configuration, the speedup is only
×1.7. This is likely due to the cache effect described in in
Section IV-B2. Note that this performance is very similar to
the one reported by NVIDIA 14 .

D. Comparing the PyTorch On-node Data Parallel with Dis-


tributed Data Parallel
Until now, all of the results herein use the highly optimized Figure 9: Relative Throughput Performance of ResNet50
distributed data parallel code to achieve the highest system between PyTorch On-node Data Parallel and Distributed Data
performance possible. By contrast, PyTorch on-node data Parallel.
parallel is an easy-to-use method for enabling computations
on multiple GPUs. Code modifications basically are confined Figure 9 presents the relative performance of models ex-
to introduction of a directive-like instruction that wraps a non- pressed as on-node data parallel compared to the distributed
parallel PyTorch Module with a DataParallel syntax, data parallel algorithms for all systems considered. For one
such as GPU, the two data parallel schemes produce similar results.
model = torch.nn.DataParallel(model).15 The The experiments are done using ResNet50. As more GPUs
are utilized, performance decreases when using on-node data
13 Employing pinned memory will prevent the host memory from being
parallelism. When two GPUs are in use, DGX-2 and AWS P3
swapped out and enable GPU drivers direct access to the host memory.
14 https://developer.nvidia.com/deep-learning-performance-training- achieve about 90% of the distributed data parallel performance.
inference Then, it drops rapidly for larger numbers of GPUs. The IBM-
15 https://pytorch.org/docs/stable/ modules/torch/nn/parallel/data parallel.html. P9 can maintain above 90% upto 4 GPUs.
V. C ONCLUSION (Amazon Web Services) for his assistance related to the
In this work we analyzed the performance of several AWS P3 and to Craig Tierney and Louis Capps (both of
leading-edge systems architected for DL workload perfor- NVIDIA) and Zhihua Dong (Brookhaven Lab Computational
mance: DGX-2, AWS P3, and IBM-P9. We also considered Science Initiative) for DGX-2 benchmarking support. The
a consumer-grade, budget-efficient system: a RTX-2080 Ti authors are grateful for the significant assistance received from
server. The inclusion of AWS P3, which essentially is a Charity Plata (Brookhaven Lab) in the editing and graphics
DGX-1 system, was done to explore performance along the enhancements of this paper.
ever-increasing use of cloud computing scenarios for DL This performance analysis was funded as part of the Ex-
workloads. The tested DL models spanned the computer vision ploiting the Convergence of Research Challenges in Scientific
and NLP domains, are realistic, and actually are used in real- Discovery And National Security program within Brookhaven
life DL applications. By varying the types of neural network Lab’s Computational Science Initiative with additional hard-
models and batch sizes per GPU, the systems were probed ware infrastructure support from the Empire State Develop-
using different realistic computation and communication sce- ment Corporation. Brookhaven National Laboratory is oper-
narios. Some of the specific performance aspects revealed in ated and managed for the U.S. Department of Energy’s Office
this work include: of Science by Brookhaven Science Associates on behalf of
Stony Brook University and Battelle under Contract No. DE-
• The DGX-2 offered the best 16 GPU collective commu-
SC0012704.
nication, making it most suited for training large models
on 16 GPUs. R EFERENCES
• When training on eight GPUs, the DGX-1, AWS P3, and
[1] A. Sodani, R. Gramunt, J. Corbal, H. Kim, K. Vinod, S. Chinthamani,
DGX-2 afforded similar performance. S. Hutsell, R. Agarwal, and Y. Liu, “Knights Landing: Second-
• Because of the limited GPU memory and PCIe band- Generation Intel Xeon Phi Product,” IEEE Micro, vol. 36, no. 2, pp.
34–46, Mar. 2016.
width, when eight GPUs are in use, the RTX-2080 Ti [2] R. Adolf, S. Rama, B. Reagen, G. Wei, and D. Brooks, “Fathom:
server can reach about 61.46% of the throughput perfor- reference workloads for modern deep learning methods,” in 2016 IEEE
mance offered by the leading-edge systems considered in International Symposium on Workload Characterization (IISWC), Sep.
2016, pp. 1–10.
this evaluation. [3] D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural
• The cloud-use scenario is not leading to very large networks for image classification,” in 2012 IEEE Conference on Com-
performance degradation when the communication-to- puter Vision and Pattern Recognition, Jun. 2012, pp. 3642–3649.
[4] A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep
computation ratio of the DL models is low. However, recurrent neural networks,” in 2013 IEEE International Conference on
achieving that level of performance requires extensive Acoustics, Speech and Signal Processing, May 2013, pp. 6645–6649.
understanding about the cloud environment to maximize [5] A. D. Malony, S. Biersdorff, S. Shende, H. Jagode, S. Tomov, G. Juck-
eland, R. Dietrich, D. Poole, and C. Lamb, “Parallel Performance
performance by minimizing system contention, ensure ge- Measurement of Heterogeneous Parallel Systems with GPUs,” in 2011
ographical closeness of systems, and other idiosyncratic International Conference on Parallel Processing, Sep. 2011, pp. 176–
tasks. 185.
[6] W. P. NVIDIA, “NVIDIA DGX-1 With Tesla V100 System
• Scalability of the DL models was investigated up to the
Architecture,” 2017. [Online]. Available: http://images.nvidia.com/
sizes of the DGX-2 machine available as a standalone content/pdf/dgx1-v100-system-architecture-whitepaper.pdf
system. Future work will need to consider scaling up to [7] N. R. Tallent, N. A. Gawande, C. Siegel, A. Vishnu, and A. Hoisie,
“Evaluating On-Node GPU Interconnects for Deep Learning Work-
production-size DL models. loads,” in PMBS@SC, ser. Lecture Notes in Computer Science, vol.
Practical considerations can be readily extracted from the 10724. Springer, 2017, pp. 3–21.
[8] A. Li, S. L. Song, J. Chen, X. Liu, N. Tallent, and K. Barker, “Tartan:
work documented in this paper, including regarding guidance Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark
for procuring systems that maximize performance for a given Suite,” in 2018 IEEE International Symposium on Workload Charac-
workload of interest, as well as for considering choice of terization (IISWC), Sep. 2018, pp. 191–202.
[9] R. Raina, A. Madhavan, and A. Y. Ng, “Large-scale Deep
machines, DL models, and use modes. While as part of Unsupervised Learning Using Graphics Processors,” in Proceedings
this work we implicitly considered cost impacts in system of the 26th Annual International Conference on Machine Learning,
selection, readers are left to weigh such an analysis (and ser. ICML ’09. New York, NY, USA: ACM, 2009, pp. 873–
880, event-place: Montreal, Quebec, Canada. [Online]. Available:
aspects related to it) on their own. http://doi.acm.org/10.1145/1553374.1553486
[10] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro,
ACKNOWLEDGMENT and E. Shelhamer, “cuDNN: Efficient Primitives for Deep Learning,”
arXiv e-prints, p. arXiv:1410.0759, Oct. 2014.
The authors extend their sincere gratitude to Ethan Hereth [11] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training
(University of Tennessee at Chattanooga) for his exhaustive of Deep Bidirectional Transformers for Language Understanding,” ArXiv
assistance and support related to the IBM-P9, as well as e-prints, Oct. 2018.
[12] N. NVIDIA, “NVIDIA Collective Communications Library (NCCL),”
Anthony Skjellum (University of Tennessee at Chattanooga) May 2017. [Online]. Available: https://developer.nvidia.com/nccl
for his additional oversight. Too, they thank IBM’s Xinghong [13] W. P. NVIDIA, “NVIDIA NVSwitch: The World’s Highest-Bandwidth
He, Mladen Karcic, and Douglas L. Lehr for facilitating On-Node Switch,” 2018.
[14] A. B. Caldeira, “IBM Power System AC922 Introduction and
access to internal benchmarking resources (four IBM-P9 node Technical Overview,” p. 74, 2018. [Online]. Available: https:
configuration) used in this work. Thanks also to Brian Barrett //www.redbooks.ibm.com/redpapers/pdfs/redp5472.pdf
[15] W. P. NVIDIA, “NVIDIA Turing Architecture,” 2018. [31] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen,
[Online]. Available: https://www.nvidia.com/content/dam/en-zz/ D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and
Solutions/design-visualization/technologies/turing-architecture/ H. Wu, “Mixed precision training,” in International Conference
NVIDIA-Turing-Architecture-Whitepaper.pdf on Learning Representations, 2018. [Online]. Available: https:
[16] J. J. Furmanek, “PowerAI 1.6.0 Introduction: A //openreview.net/forum?id=r1gs9JgRZ
Full Transition to Conda,” Mar. 2019. [On- [32] W. P. NVIDIA, “NVIDIA Tesla V100 GPU Architecture,”
line]. Available: https://developer.ibm.com/linuxonpower/2019/03/20/ Aug. 2017. [Online]. Available: https://images.nvidia.com/content/
powerai-1-6-0-introduction-a-full-transition-to-conda/ volta-architecturae/pdf/volta-architecture-whitepaper.pdf
[17] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, Nov. 1998.
[18] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”
International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp.
211–252, 2015.
[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification
with Deep Convolutional Neural Networks,” in Advances in Neural
Information Processing Systems 25, F. Pereira, C. J. C. Burges,
L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc.,
2012, pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/
4824-imagenet-classification-with-deep-convolutional-neural-networks.
pdf
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning
for Image Recognition,” in 2016 IEEE Conference on Computer
Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV,
USA, June 27-30, 2016, 2016, pp. 770–778. [Online]. Available:
https://doi.org/10.1109/CVPR.2016.90
[21] M. Gales and S. Young, “The Application of Hidden Markov
Models in Speech Recognition,” Found. Trends Signal Process.,
vol. 1, no. 3, pp. 195–304, Jan. 2007. [Online]. Available:
http://dx.doi.org/10.1561/2000000004
[22] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
J. Dean, “Distributed Representations of Words and Phrases
and their Compositionality,” in Advances in Neural Information
Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling,
Z. Ghahramani, and K. Q. Weinberger, Eds. Curran Associates, Inc.,
2013, pp. 3111–3119. [Online]. Available: http://papers.nips.cc/paper/
5021-distributed-representations-of-words-and-phrases-and-their-compositionality.
pdf
[23] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett,
“DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM.
NIST speech disc 1-1.1,” NASA STI/Recon Technical Report N, vol. 93,
Feb. 1993.
[24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
. Kaiser, and I. Polosukhin, “Attention is All you Need,” in Advances in
Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds.
Curran Associates, Inc., 2017, pp. 5998–6008. [Online]. Available:
http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
[25] A. M. Dai and Q. V. Le, “Semi-supervised Sequence Learning,” in
Advances in Neural Information Processing Systems 28, C. Cortes,
N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds.
Curran Associates, Inc., 2015, pp. 3079–3087. [Online]. Available:
http://papers.nips.cc/paper/5949-semi-supervised-sequence-learning.pdf
[26] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever,
“Improving Language Understanding by Generative Pre-Training,”
Online, p. 12, 2018. [Online]. Available: https://openai.com/blog/
language-unsupervised/
[27] R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi, “SWAG: A Large-Scale
Adversarial Dataset for Grounded Commonsense Inference,” in EMNLP.
Association for Computational Linguistics, 2018, pp. 93–104.
[28] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD:
100,000+ Questions for Machine Comprehension of Text,” in
Proceedings of the 2016 Conference on Empirical Methods in
Natural Language Processing. Austin, Texas: Association for
Computational Linguistics, Nov. 2016, pp. 2383–2392. [Online].
Available: https://www.aclweb.org/anthology/D16-1264
[29] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in
PyTorch,” in NIPS-W, 2017.
[30] C. Tierney, “NCCL DGX1v DGX2 (Personal Communication),” Apr.
2019, nVIDIA.

You might also like