0% found this document useful (0 votes)
52 views18 pages

Vpipe A Virtualized Acceleration System For Achieving Efficient and Scalable Pipeline Parallel DNN Training

This document describes vPIPE, a system for efficiently training large deep neural networks using pipeline parallelism. Pipeline parallelism partitions a neural network across multiple GPUs to overcome memory limitations. Existing pipeline parallel systems use static partitioning that cannot adapt to changing network architectures during neural architecture search. vPIPE introduces two key contributions: (1) an online algorithm that dynamically searches for near-optimal partitioning and memory management during training, and (2) a layer migration protocol that rebalances the layer distribution across GPUs. This allows vPIPE to train networks more efficiently than static partitioning approaches, improving training throughput by 61.4-463.4% on various networks.

Uploaded by

matheus.zsimon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views18 pages

Vpipe A Virtualized Acceleration System For Achieving Efficient and Scalable Pipeline Parallel DNN Training

This document describes vPIPE, a system for efficiently training large deep neural networks using pipeline parallelism. Pipeline parallelism partitions a neural network across multiple GPUs to overcome memory limitations. Existing pipeline parallel systems use static partitioning that cannot adapt to changing network architectures during neural architecture search. vPIPE introduces two key contributions: (1) an online algorithm that dynamically searches for near-optimal partitioning and memory management during training, and (2) a layer migration protocol that rebalances the layer distribution across GPUs. This allows vPIPE to train networks more efficiently than static partitioning approaches, improving training throughput by 61.4-463.4% on various networks.

Uploaded by

matheus.zsimon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO.

3, MARCH 2022 489

vPIPE: A Virtualized Acceleration System for


Achieving Efficient and Scalable Pipeline
Parallel DNN Training
Shixiong Zhao , Fanxin Li, Xusheng Chen , Xiuxian Guan, Jianyu Jiang , Dong Huang, Yuhao Qing,
Sen Wang, Peng Wang, Gong Zhang, Cheng Li , Ping Luo, and Heming Cui , Member, IEEE

Abstract—The increasing computational complexity of DNNs achieved unprecedented successes in various areas such as machine
vision and natural language processing (NLP), e.g., the recent advanced Transformer has billions of parameters. However, as large-
scale DNNs significantly exceed GPU’s physical memory limit, they cannot be trained by conventional methods such as data
parallelism. Pipeline parallelism that partitions a large DNN into small subnets and trains them on different GPUs is a plausible solution.
Unfortunately, the layer partitioning and memory management in existing pipeline parallel systems are fixed during training, making
them easily impeded by out-of-memory errors and the GPU under-utilization. These drawbacks amplify when performing neural
architecture search (NAS) such as the evolved Transformer, where different network architectures of Transformer needed to be trained
repeatedly. VPIPE is the first system that transparently provides dynamic layer partitioning and memory management for pipeline
parallelism. VPIPE has two unique contributions, including (1) an online algorithm for searching a near-optimal layer partitioning and
memory management plan, and (2) a live layer migration protocol for re-balancing the layer distribution across a training pipeline. VPIPE
improved the training throughput of two notable baselines (Pipedream and GPipe) by 61.4-463.4 percent and 24.8-291.3 percent on
various large DNNs and training settings.

Index Terms—Machine learning, distributed systems, distributed artificial intelligence, pipeline, parallel systems, memory management

1 INTRODUCTION Pipeline parallelism is a promising approach to train large


DNNs with lots of layers on multiple GPUs, where the DNN
N recent years, large deep neural networks (DNNs), includ-
I ing Transformer [52], BERT [10], AmoebaNet [39], and
GNMT [58], are getting explosively deeper (i.e., more layers)
is partitioned into multiple stages, each containing a number
of layers and running on a GPU. Existing pipeline parallel
and wider (i.e,. more parameters per layer) for higher model- systems [14], [19], [33], [59] adopt a static partition policy,
ing capacities. For instance, Transformer [52] has more than where the stage partition is fixed throughout the entire train-
600 layers (i.e., execution operators) and 6 billion parameters. ing process. A typical DNN training iteration contains a for-
This rising complexity of DNN models has also expedited the ward pass and a backward pass through all stages. The major
emergence of neural architecture search (NAS) (e.g., evolved memory consumption on each GPU (or stage) is for storing
Transformer [45]), where the layers of a model are dynami- activations produced in a forward pass and reused in a back-
cally activated/deactivated during training [39], [45] to search ward pass [18], [37].
for a DNN architecture with high accuracy. This increasing For high hardware efficiency (i.e., high GPU ALU utiliza-
complexity and dynamicity make it even more difficult for tion), a pipeline parallel system injects multiple batches of
training a large DNN, considering that each GPU has only up inputs and overlaps their forward and backward pass exe-
to tens of gigabytes memory [18]. cutions, forming a pipeline. Compared with a data parallel
system [28], which needs to transfer enormous parameter
updates among GPUs, a pipeline parallel system only needs
 Shixiong Zhao, Fanxin Li, Xusheng Chen, Xiuxian Guan, Jianyu Jiang, to transfer intermediate data between layers across stages,
Dong Huang, Yuhao Qing, Ping Luo, and Heming Cui are with the significantly reducing the network consumption [33]. There-
Department of Computer Computer Science, The University of Hong
fore, more complex DNNs [19], [39], [45] are trained with
Kong, Hong Kong 999077, China. E-mail: {sxzhao, fxli, xschen, xxguan,
jyjiang, dhuang, yhqing, pluo, heming}@cs.hk.hk. pipeline parallel systems [14], [19], [33], [59].
 Sen Wang, Peng Wang, and Gong Zhang are with Theory Lab, 2012 Labs, An efficient pipeline parallel system should achieve two
Huawei Technoloies, Co. Ltd, Shenzhen 518129, China. crucial design goals. First, as the system injects multiple
E-mail: {wangsen31, wang.peng6, nicholas.zhang}@huawei.com.
 Cheng Li is with the School of Computer Science and Technology, Univer- input batches, it should carefully manage all stages’ training
sity of Science and Technology of China, Hefei, Anhui 230052, China. memory to avoid exceeding the physical memory capacity
E-mail: [email protected]. on any GPU (G1). Otherwise, it will either cause out-of-
Manuscript received 12 Nov. 2020; revised 21 Mar. 2021; accepted 24 Mar. 2021. memory errors or trigger synchronous paging events that
Date of publication 2 July 2021; date of current version 5 Aug. 2021. significantly block the training execution of a DNN (dis-
(Corresponding author: Heming Cui.)
Recommended for acceptance by J. Zola. cussed in Section 7). Second, to maximize the efficiency (i.e.,
Digital Object Identifier no. 10.1109/TPDS.2021.3094364 high GPU ALU utilization and no stage stalls), the system
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
490 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 3, MARCH 2022

Fig. 1. A four-stage pipeline (Pipedream [33]). Stage 0 keeps four copies


of activations, while stage 3 keeps only one copy.

should enforce a “balanced” partition (G2) such that all


stages achieve roughly the same high throughput [19], [33]:
data items processed per second by the pipeline. Unfortu-
nately, despite much effort [14], [19], [33], [59] on building
pipeline parallel systems, simultaneously realizing these
two design goals for complex and dynamic DNNs is still an
open problem.
Existing pipeline parallel systems fall into two categories.
Fig. 2. (a)(b) With VPIPE integrated, Pipedream-VPIPE (P-V) and GPipe-
The first category (Pipedream [33] and XPipe [14]) keeps VPIPE (G-V) achieved faster convergence than Pipedream (P) and GPipe
activation tensors produced during forward passes directly (G) when training Transformer [52] with 8 GPUs. (b)(d) When NAS was
in GPU memory. However, due to the forward-then-back- enabled in the Evolved Transformer [45], the training throughout (TPT)
of Pipedream and GPipe further dropped, while Pipedream-VPIPE and
ward nature of DNN training, activation tensors in the front
GPipe-VPIPE could cope with this dynamicity.
stages reside longer in GPU memory than those in the rear
stages (Fig. 1). Thus, when more input batches are injected,
Overall, despite great advances, existing pipeline parallel
the front stages have to keep many more copies of activa-
systems still incur suboptimal training efficiency on either
tions than the rear stages.
static or dynamic (e.g., NAS enabled) DNN training. We
To meet G1 on the front stages, systems in the first cate-
believe the key reason is that these systems use static strate-
gory have to keep a moderate batch size [10], [39], [52], [58].
gies for both memory management and layer partitioning.
Still, a larger training batch size can lead to higher GPU
When stages become intense, caused by either GPU mem-
ALU utilization and higher throughput [60]. In our evalua-
ory explosion or newly activated layers, these static strate-
tion (Section 6.1), when training Transformer with 8 GPUs,
gies prevent themselves from using the available GPU
Pipedream [33] supported a batch size of only 32. Each
resources in adjacent stages to alleviate these intense stages.
GPU’s ALU utilization rate was 42.3 percent on average,
This paper presents VPIPE, the first dynamic DNN layer
making the training throughput only 46.1 percent of the
partitioning and memory management system acting as a
ideal throughput: the theoretical throughput supposing a sys-
virtualized layer between a typical pipeline parallel system
tem runs on GPUs with unlimited physical memory and uti-
(e.g., Pipedream [33] or GPipe [19]) and its underlying exe-
lizing all GPU ALUs (also defined in other systems [18]),
and the stage partition is always balanced (G2). cution engine (e.g., PyTorch [36] or Tensorflow [1]). VPIPE
The second category (GPipe [19] and PipeMare [59]) dis- automatically and transparently realizes both design goals
cards all activation tensors in the forward passes and (G1 and G2) by automatically finding a globally near-opti-
recomputes them in the backward passes. This significantly mal plan, which migrates layers among stages and relocates
alleviates the imbalanced GPU memory utilization between each layer’s activations and parameters to its current stage’s
the front stages and rear stages, but at the cost of an extra GPU or CPU memory. VPIPE can significantly alleviate the
forward pass. In our evaluation (Section 6.1), GPipe [19] intense stages of a pipeline and improve the pipeline’s
supported a batch size of 128 when training the Transformer throughput in a balanced way (e.g., Fig. 2).
with 8 GPUs, and the each GPU’s ALU utilization rate can To achieve G1, instead of GPipe’s all-recompute strategy,
be up to 95.6 percent. However, this all-recompute strategy VPIPE computes a hybrid plan of both swap and recompute
inevitably leads to wasted ALU utilization of 29.4 percent, for all layers on each stage. Specifically, swap asynchro-
and GPipe incurred merely 66.2 percent effective ALU utiliza- nously evicts activation tensors to CPU memory and pre-
tion: the useful GPU ALU utilization that contributes to the fetches them back to GPU memory before its corresponding
DNN training, but not the recompute utilization. backward usage starts. In pipeline parallelism, there usually
Moreover, both categories of pipeline parallel systems exists an opportunity window, filled by other input batches’
encounter even more severe throughput degradation when executions, between the forward pass and backward pass of
a DNN model enables NAS, where both the number and each input batch. Leveraging this window, VPIPE masks the
layout of the model’s layers can be modified by a runtime swap time by precisely predicting the arrival time of the
algorithm (e.g., evolution algorithm [39], [45]). An evalua- backward pass and overlapping the cost with other input
tion (Section 6.3) is conducted by running a NAS-enabled batches’ executions.
Transformer [45] on one notable system in each category To achieve G2, instead of using a static partition strategy,
(i.e., Pipedream and GPipe). Compared with the defined VPIPE online generates new partition plans and transpar-
ideal throughput, Pipedream’s throughput dropped to 17.7 ently live migrates layers from intense stages to their adja-
percent, and GPipe’s throughput dropped to 25.3 percent. cent stages, both alleviating the memory burdens on intense
ZHAO ET AL.: VPIPE: A VIRTUALIZED ACCELERATION SYSTEM FOR ACHIEVING EFFICIENT AND SCALABLE PIPELINE PARALLEL... 491

stage (G1) and achieving more balanced partitions with roughly linearly with the GPU numbers. When run-
higher throughput (G2). ning on 16 GPUs, VPIPE improved Pipedream’s and
However, realizing these two goals in VPIPE must tackle GPipe’s throughput by 323.3 and 20.7 percent.
two technical challenges. The first challenge is searching for  VPIPE was efficient in NAS workloads. When evalu-
a globally efficient swap, recompute, and repartition (SRP ) ated on Transformer [45] and AmoebaNet [39], the
strategy among all stages. We took the first step in the litera- only two evaluated complex DNNs that support
ture to model this challenge into a combinatorial optimiza- NAS features, VPIPE improved Pipedream’s and
tion problem (Section 4.1). However, the problem is NP- GPipe’s throughput by 421.3-463.4 percent and
hard due to its exponential search space [2], [3], [50]. 245.4-291.3 percent.
To address this challenge, we created a fast-converging, Our main contribution is VPIPE, the first dynamic layer live
near-optimal search algorithm using the powerful decom- partition and memory management system, serving as a trans-
position methodology [32], [47] via two observations. First, parent underlying acceleration layer for typical pipeline paral-
we can iteratively migrate layers from an intense stage to its lel systems (e.g., Pipedream and GPipe). Our major novelty is
adjacent stages, enabling new optimization space for a bet- a fast and near-optimal stage-distributed search algorithm for
ter hybrid plan of swap and recompute on each stage (Sec- finding a globally efficient swap, recompute, and partition
tion 4.2). Second, the architecture (layout) of a typical strategy, greatly improving VPIPE’s efficiency and scalability.
complex DNN [39], [58] is usually constructed as a coars- Our secondary novelty is a transparent live migration protocol
ened graph of repeated subgraphs, which are readily easy without stalling the executions or altering the upper system’s
to be partitioned into an optimal plan [19], [33] that meets parameter staleness. VPIPE’s source code and evaluation frame-
G2; VPIPE fast detects this coarsened graph by precisely dis- work are released at: github.com/hku-systems/vpipe.
tinguishing intra edges inside subgraphs and nested edges In the rest of this paper, Section 2 presents the back-
among subgraphs, leveraging the time series distance ground; Section 3 gives an overview of VPIPE; Section 4
between each edge’s two vertices (layers) collected at run- describes VPIPE’s runtime design; Sections 5 and 6 present
time execution. VPIPE’s implementation and evaluation results; Section 7 dis-
The second challenge is how to live (i.e., no GPU stalls cusses the related work, and Section 8 concludes.
nor pipeline cleaning) migrate a layer while keeping VPIPE
transparent [49] to general upper pipeline parallelism sys- 2 BACKGROUND
tems (i.e., VPIPE does not add nor reduce parameter stale-
ness [14], [33], [59] to the upper system). Existing pipeline 2.1 DNN Training
parallel systems [14], [33], [59] carefully designed various DNN [10], [15], [29], [44], [46] is known to be the fundamental
strategies to orchestrate (add or reduce) the staleness on machine learning paradigm in deep learning. A DNN model
parameter updates for higher training accuracy or through- typically contains hundreds of layers, and the goal of DNN
put on specific DNNs. training is to find an appropriate set of model parameters to
VPIPE guarantees that a layer is migrated as if reparti- fit a training dataset. Each DNN training process typically
tioned by a non-live approach: stop injecting new input consists of millions of iterations, each containing a forward
batches for the upper system, clean up the pipeline, migrate pass, a backward pass, and an optimization step.
the layer, and reboot a new pipeline. To handle the The memory consumption of DNN training contains four
migrated layer’s unfinished backward passes, we present a parts: parameters of each layer; activations, i.e., feature maps
new live migration protocol. Our key observation is that the produced by each layer in the forward pass; gradients, i.e.,
time window between the activation generation (in a for- gradient maps produced by each layer in the backward pass;
ward pass) and its final usage (in the corresponding back- and scratch space for computation. Among these four parts,
ward pass) allows a subtle interleaving for VPIPE to live activations take the most significant portion (up to 73.3 per-
migrate a layer transparently without altering the parame- cent) of the total memory consumption for DNN training.
ter staleness of the upper system. Activations are created in the forward pass and reused in the
We implemented VPIPE in PyTorch [36] by adding 2782 backward pass, so there exists a large time window between
LoC. We evaluated all six prevalent DNN models, including the two memory accesses. Activation memory is the major
four complex DNNs Transformer [52], BERT [10], Amoeba- optimization target in previous work [18], [37].
Net [39], GNMT [58], and two simple DNNs ResNet50 [15],
VGG16 [44], that are evaluated in all relevant systems Pipe- 2.2 Pipeline Parallel DNN Training
dream [33], GPipe [19], XPipe [14], and PipeMare [59]. The With the DNN training getting increasingly computation
evaluation shows that: and memory intensive, distributed training systems across
multiple GPUs become a must. Distributed training systems
 VPIPE was efficient in training complex DNNs. VPIPE can be categorized as data parallel or model parallel. Data
improved Pipedream’s and GPipe’s throughput by parallel systems [28] let each GPU maintain a copy of the
109.7 and 30.7 percent on average for four complex complete model. In each iteration, each GPU trains on a
DNNs. VPIPE enlarged Pipedream’s supported batch small batch and synchronizes the parameter updates with
size by 3.75x. Within the same training time, VPIPE other GPUs using all reduce [43] or parameter sever [28].
made Pipedream achieve higher training quality However, data parallelism is not designed to train large
(e.g., BLEU [58]). DNNs that cannot fit into a single GPU’s memory.
 VPIPE was scalable. When training the four complex Pipelined model parallelism (i.e., pipeline parallelism)
DNNs on 4-16 GPUs, VPIPE’s throughput increased aims to scale the supported DNNs to the number of GPUs
492 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 3, MARCH 2022

Fig. 3. Logical BSP pipeline (a) that demonstrates the bubble problem Fig. 4. Logical ASP pipeline (a) and a realtime nsys/nvprof GPU profiling
and a realtime nsys/nvprof GPU profiling (b) that verifies the bubble (b) of ASP pipeline with four-stage Pipedream training; red blocks are
problem in BSP pipeline with four-stage GPipe training; red blocks are sync barriers.
sync barriers.

by partitioning a DNN model into multiple stages (a consecu- 2.3 Dynamic DNN Training
tive set of layers) and letting each GPU handle one stage. Pipe- Recently, more and more developers have adopted dynamic
line parallelism is a pipeline version of model parallelism, DNN training where the number of layers varies with the
where vanilla model parallelism leads to severe under-utiliza- training inputs (e.g., DyNet [34]) or the training is exploratory
tion due to the bubble problem caused by the sequential depen- (e.g., neural architecture search [45], [55], [57], [62]). In such a
dency between stages. Pipeline parallelism overlaps the case, a training workload (i.e., the GPU computation and
computation and waiting time of different input batches, fills memory required for training) varies as the training proceeds.
the bubbles, and improves the utilization. Based on how a Since the efficiency of pipeline parallelism highly depends on
pipeline parallel system handles synchronization of DNN the workload partition among stages, this dynamicity exposes
parameters among input batches, the system falls into two cat- special requirements for pipeline parallel systems.
egories: barrier synchronous parallel (BSP) systems and asyn- The variance of training workload usually happens very
chronous parallel (ASP) systems. frequently. For example, a neural architecture search (NAS)
BSP systems (e.g., GPipe [19]) let a set of training input process [39], [45] adopts an evolutionary algorithm that
batches work on the same version of model parameters, trains a set of models, fast eliminates those with low fitting
aggregate gradients computed by these iterations, and enforce scores, and initiates new ones. Thus, “bad” models can be
a barrier that stops the pipeline to apply the gradients to the eliminated within a few minutes [39], [45].
model parameter. BSP systems achieve almost the same statis- Existing pipeline parallel systems profile a static partition
tical performance as vanilla model parallelism [19]. However, before the training starts. This static partition inherently
as shown in Fig. 3a, a BSP pipeline logically still incurs bub- cannot adapt to the dynamicity in the training process. VPIPE
bles during each barrier synchronization, and we verified this copes with this dynamicity by a wait-free live layer migra-
in Fig. 3b by profiling the GPUs during a four-stage BSP pipe- tion protocol (Section 4.2) that transparently re-balances the
line training. training load when changed.
ASP systems (e.g., Pipedream [33] and PipeMare [59])
remove the sync barrier and let each input batch directly
update the model parameters. Although bubbles are elimi- 3 VPIPE’S ARCHITECTURE
nated (as shown in Fig. 4), ASP systems suffer from parameter Fig. 5 shows VPIPE’s architecture, a virtualized layer between
staleness in two aspects. First, the parameter version differs a typical pipeline parallel system and its underlying execu-
between a pipeline’s forward pass and backward pass. Sec- tion engine. On each host, there is a virtualized tensor man-
ond, the parameter version differs among stages within the ager, a training monitor, and a layer manager. On the host
training of an input batch. Pipedream [33], XPipe [14], and of the last stage, there is a global planner.
PipeMare [59] provide various algorithm-level mitigation to Virtualized Tensor Manager (VTM) provides fine-grained
the parameter staleness problem. VPIPE is designed to be a management to each parameter and activation tensor. VTM
transparent layer under either a BSP or an ASP pipeline paral- holds each layer’s tensor (parameter or activation) informa-
lelism algorithm; and VPIPE’s designs (Section 4.3) do alter the tion, including layer ID, stage ID, property (parameter or acti-
weight staleness in the upper systems. vation), training iteration ID, version, management policy
Scheduling. One forward one backward (1F1B) schedul- (vStatus), storage status, and the pointer to the tensor’s real
ing is first introduced by Pipedream [33] and adopted by storage constructs. An activation tensor’s information is initial-
successive systems (e.g., PipeMare [59] and XPipe [14]). In ized in VPIPE’s tensor manager when created and deleted when
1F1B scheduling (e.g., Fig. 1), each stage alternates between released. For parameter tensors, VPIPE creates tensor informa-
performing forward pass for a current input batch and tion as long as the model is initialized. The management policy
backward pass for an earlier input batch. 1F1B is widely of a layer’s tensors is managed by the layer manager.
adopted due to its high computational efficiency [33], [59] Training monitor monitors each stage’s runtime statistics,
and low memory usage. Therefore, in this paper, we including real-time memory usage of each GPU on these hosts,
assume that the upper pipeline parallel systems adopt PCIe bandwidth usage, network usage, execution time, and rec-
1F1B scheduling. ompute time. Along with forward passes of the normal training
ZHAO ET AL.: VPIPE: A VIRTUALIZED ACCELERATION SYSTEM FOR ACHIEVING EFFICIENT AND SCALABLE PIPELINE PARALLEL... 493

Fig. 5. Architecture of VPIPE. VPIPE is a virtualized layer between a typical pipeline parallel system (e.g., Pipedream [33] or GPipe [19]) and its underly-
ing execution engine (e.g., PyTorch [36] or Tensorflow [1]). We use different colors to refer layers set by VPIPE’s operations including default (D),
swap (S), recompute (R), and migrate (M).

iterations, the training monitor passes its own runtime statistics steady-state throughput of the training pipeline can be max-
and the upstream stages’ (if any) to its downstream stages. imized. Since there is no model to quantify the complexity
Global planner collects the runtime statistics of all stages at the of this SRP challenge, we take the first step in the literature
end of every forward pass. It produces new partition strategies to formalize the SRP challenge, transform it into a combina-
(if needed) according to VPIPE’s SRP algorithm (Section 4.2). It torial optimization problem, and solve it by a decomposi-
resides on the last host for two reasons. First, in pipeline paral- tion algorithm (Section 4.2).
lelism, rear stages usually have less computation and commu- A DNN is a graph GðN; EÞ with N layers (e.g., matrix
nication burdens. Second, as the runtime statistics are collected operation) and E edges connecting the layers. In pipeline
every training iteration, VPIPE transfers the runtime statistics parallelism, a DNN model is partitioned to p stages, and
along with the forward pass and distributes the new partition each stage is placed on one GPU (p GPUs in total). To maxi-
(if any) along with the backward pass. By doing so, VPIPE’s mize the pipeline utilization, in a typical pipeline parallel-
global planner does not need extra distributed coordination. ism scheduling (Section 2), at least p input batches are
Layer manager receives a new partition strategy from the simultaneously injected into the same pipeline. For each
global planner, diffs the new partition from its current parti- layer in the model, we denote it with ðfi ; bi ; mi ; ai Þ, includ-
tion to check whether a layer migration should be scheduled. ing a forward pass time fi , a backward pass time bi , a
For example, when a layer needs to be migrated, the migration parameter memory mi , and an activation memory ai .
manager of the source stage will coordinate with the tensor The major constraint for pipeline parallel training is G1: on
manager to asynchronously swap the layer’s parameter ten- each GPU, the training GPU memory usage should not exceed
sors and activation tensors to the CPU memory; and then any GPU’s physical memory limit (M). In pipeline parallelism,
transfer the parameter and activation tensors to the migration the memory consumption of all layers in each stage contains
manager of the target stage. The migration manager of the tar- two parts. The first part is a constant memory consumption
get stage will initialize the layer in the target GPU, receive the (mconstant ) that does not vary with the number of injected input
i
parameter and activation tensors from the source stage, and batches; the second part is the dependent memory consump-
append the new layer to the forward pass and backward pass tion (mdependent ), which depends on the number of injected
i
executions (Section 4.3). Layer manager also produces the input batches and differs among stages: given a stage k, p  k
local swap and recompute policies (Section 4.2). copies of mdependent should be kept in memory. In BSP systems,
i
Overall, VPIPE’s design is transparent to the upper pipe- parameters are updated synchronously (Section 2), and all
line parallel systems. We integrated VPIPE into an ASP sys- input batches in a pipeline share the same version of parame-
tem Pipedream [33] and a BSP system GPipe [19]. For ters, thus mdependent is ai and mconstant is mi . In ASP systems,
i i
vanilla Pipedream, we set all layers’ vStatus to default; and each training iteration in a pipeline may have an independent
for vanilla GPipe, we set all layers’ vStatus to recompute.
version of mi , thus mdependent
i contains both ai and mi .
VPIPE can also be integrated into other pipeline parallel sys-
To reduce memory consumption, a pipeline parallel sys-
tems (e.g., PipeMare [59] and XPipe [14]) as long as they
tem can apply swap or recompute strategy to each layer’s
support an imperative programming model.
dependent tensors, which are the main memory burden in
pipeline parallelism. Thus, for each tensor in a layer, we
4 VPIPE’S RUNTIME
denote its memory management policy with ðDi ; Ri ; Si Þ,
4.1 Problem Modeling where Di ; Ri ; Si ¼ 0 or 1; Di þ Ri þ Si ¼ 1. D ¼ 1 means the
A major challenge for VPIPE’s design is to find an optimal tensor by default resides in the GPU memory; S ¼ 1 means
strategy of swap, recompute, and partition (SRP) so that the the tensor will be proactively swapped to CPU memory and
494 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 3, MARCH 2022

swapped back to GPU before usage; and R ¼ 1 means the let the intense stage have more optimization space to search
tensor will be dropped and recomputed by the backward for a better hybrid plan of swap and recompute.
pass. Thus, in pipeline parallelism, the memory constraint We decompose the master problem into two sub-prob-
of each stage can be denoted as: lems. First, we assume that Varp is constant, and each stage
locally finds a swap and recompute plan (Varsr ) depending
on its GPU resource to minimize the objective function (3).
S mconstant
i þ ðp  kÞ  S Di  mdependent
i  M: (1) Second, we assume that Varsr is constant, and stages should
lk irk lk irk
be repartitioned (i.e., find an optimal Varp ) to minimize (3).
Algorithm 1 shows our decomposed algorithm by itera-
Nevertheless, the recompute of layers introduces extra tively resolve these two sub-problems.
computation time to the backward pass. Thus, a stage’s
backward time is the sum of the original backward pass
time, the recompute time (i.e., extra forward pass of recom- Algorithm 1. Decomposed SRP Algorithm
puted layers), and the swap time if the swap time cost is 1: Stage 1,..., p:;
larger between the normal execution time (i.e., maxð0; swap 2: Function LayerManagerIterate():
time  execution timeÞ): 3: newPlan ¼ receiveBwdPropðÞ ;
4: diff ¼ compareðthis:plan; newPlanÞ;
tbwd ¼ Sðbi þ Ri  fi Þ þ maxð0; ð2  SðSi  mdi =P Þ  ðtfwd þ tbwd ÞÞÞ: 5: if diff! ¼ null then
6: migrating ¼ True;
(2) 7: for l in diff do setðl:vStatus; MigrateÞ;
8: stats ¼ retrieveStatsðÞ;
9: optimizeSRðstatsÞ ##Algorithm 2;
Finally, we formalize the SRP challenge to a combinato-
10: return;
rial optimization problem: given n layers and p GPUs, find
11: Function TrainingMonitorIterate()
a swap or recompute policy for each layer (meet G1), as
12: if ! migrating then
well as a partition (meet G2), such that the pipeline 13: stats ¼ receiveFwdPropðÞ;
throughput can be maximized. The throughput of a pipeline 14: mem ¼ cudaMemStatsðÞ;
is the lowest throughput among all stages [22], [33]. All 15: tfwd ; tbwd ¼ getExecTimeðÞ;
stages in a pipeline have the same request rate. Thus, the 16: stats:appendðthis:meta; mem; tfwd ; tbwd Þ;
pipeline’s throughput bottleneck is the stage that has the 17: fwdPropagateðstatsÞ;
longest execution time (sum of the largest tfwd and largest 18: return;
tbwd ). Therefore, we convert this problem to finding a parti- 19: Global Planner:;
tion and a swap/recompute policy such that the longest 20: Function: GlobalPlannerIterate()
stage execution time can be minimized: 21: stats; migrating ¼ receiveFwdPropðÞ;
22: if migrating then
minimize max ðtfwd
k þ tbwd
k Þ
23: return;
1kp
(3) 24: unbalanced ¼ checkBalancedðstatsÞ;
subject to ð1Þð2Þ: 25: if unbalanced then
26: newPlan ¼ layerRepartitionðÞ ##Algorithm 3;
27: bwdPropagateðnewPlanÞ;
This optimization problem is hard to solve for two rea- 28: return;
sons. First, the feasible set of this combinatorial optimiza-
tion problem spans an extremely large search space Swap and Recompute. For both swap and recompute, the
(Oð3jNj pjNj Þ), as each of layers N can have three memory goal is to reduce the memory footprint with the lowest over-
management policies and fall into p partitions. A graph head. For the swap, our goal is to maximize the overlapping
partition problem itself is well-known to be NP-com- between swap and the normal execution. For the recompute,
plete [50]. Second, constraint (2) indicate that both the our goal is to select the cheapest layer with maximized
memory management policy of all layers (ðDi ; Ri ; Si Þ; memory saving to recompute. It has been well studied in
for 1  i  n, denoted as Varsr ) and the stage partition recent work (e.g., Capuchin [37]) that using a hybrid combi-
plan (denoted as Varp ) can affect the optimization objec- nation of swap and recompute of activation tensors can
tive in (3), making this problem a multi-variable combina- effectively reduce training memory on single GPU DNN
torial optimization. training. However, applying swap to a pipeline parallel
system has to address two subtle points.
4.2 Swap, Recompute, and Repartition First, an efficient swap plan should precisely predict when a
We solve this multi-variable and combinatorial optimiza- tensor that has been swapped to CPU RAM will be reused in
tion problem by decomposition [32], [47] methodology. The the backward pass. In single GPU training, an activation ten-
idea of the decomposition methodology is to break a prob- sor is generated by the forward pass of an input batch train-
lem into smaller sub-problems coordinated by the master ing. The backward pass directly follows the forward pass.
problem (i.e., the optimization problem). Inspired by the Thus, existing swap techniques used in single GPU training
conventional decomposition method [32], [47], the key intui- systems (e.g., SwapAdvisor [18], Capuchin [37], vDNN [40],
tion is to iteratively migrate a layer from an intense stage and SuperNeuron [56]) directly make predictions based on a
where the GPU resource is exhausted to a relief stage and DNN’s graph (either profiled or runtime generated).
ZHAO ET AL.: VPIPE: A VIRTUALIZED ACCELERATION SYSTEM FOR ACHIEVING EFFICIENT AND SCALABLE PIPELINE PARALLEL... 495

However, there usually exists a window in pipeline par- VPIPE swap parameter tensors only if activation tensors are all
allelism, filled by other input batches’ executions, between swapped, which rarely happens in our evaluation.
the forward pass and backward pass of each input batch. To Layer Partition. The problem of partitioning a graph GðN;
make a precise prediction, VPIPE oversees the runtime statis- EÞ into p equal partitions with the lowest cross-partition com-
tics of each forward pass and its backward pass across all munication cost is known to be NP-complete [3] and has
stages of a pipeline (line 21-28 in Algorithm 1), and let each extensive applications in many areas, including VLSI
VPIPE’s layer manager precisely predict the arrival time of design [24], matrix factorization [7], and social network clus-
each backward pass execution. tering [35]. Kernighan-Lin (KL) algorithm [25] is known to
produce excellent partitions for a wide class of problems and
Algorithm 2. optimizeSR() is used quite extensively [17], [27]. To achieve a multi-parti-
tion, it recursively produces bi-partition of graph G and itera-
1: Input: layers in a stage, tfwd , tbwd , M, P , rank;
2: Foreach l in layers do
tively improves it by exchanging nodes in both partitions. KL
3: if l:a=P > l:tfwd then algorithm is costly and takes OðrjNj2 log jNjÞ [11] time (e.g.,
4: l:cost ¼ l:tfwd ; up to 16s to partition a complex DNN model into 16 stages),
5: l:op ¼ Recompute; where r is the repeated cycles. There are many approximate
6: else algorithms [11], [12], [16], [48] that tend to be fast (near-linear)
7: l:cost ¼ l:mactivation =P ; but often yield partitions that are worse than those obtained
8: l:op ¼ Swap; by KL algorithm [13], [23], [41] .
9: l:gain ¼ l:mactivation =l:cost; To make KL algorithm efficient, multi-level schemes
10: window ¼ tfwd þ tbwd ; reduce the size of the graph (i.e., coarsen the graph) by col-
11: space ¼ P  window; lapsing vertices and edges, partitioning the smaller graph,
12: sorted ¼ sortByGainðlayersÞ; and then uncoarsening it [17], [23]. Multi-level scheme has
13: while space  0 do been used in many areas, including matrix factorization [7]
14: l ¼ sorted:popðÞ; and VLSI design [24]. However, these algorithms assume
15: setðl:vStatus; SÞ; domain-specific requirements for the graph (e.g., a sparse
16: space ¼ space  rank  mactivation matrix [7] or a planar graph [24]), which are not applicable
17: while memConsumeðlayersÞ > M do to a complex DNN graph (e.g., AmoebaNet [39]). Moreover,
18: l ¼ sorted:popðÞ; existing multi-level schemes all take multiple coarsen steps.
19: setðl:vStatus; l:opÞ; In VPIPE, leveraging the time series implied by the DNN’s
20: foreach l in layers do sequential executions, we identify two domain-specific heu-
21: l ¼ sorted:popðÞ;
ristics to design a fast and online multi-level graph partition
22: setðl:vStatus; DefaultÞ;
algorithm with a one-step coarsen scheme.
First, Deep Learning experts have already constructed
Second, in pipeline parallel systems, swap and network the graphs of complex DNNs (e.g., Transformer, BERT,
communication impose severe burdens on the PCIe lanes, AmoebaNet, and GNMT), prevalently deployed with pipe-
causing severe PCIe interference that is not addressed by line parallelism, as sequentially connected and repeated
single GPU training systems. In VPIPE, both network com- subgraphs of layers. Each subgraph is usually a basic block
munication and swap that pass throughput PCIe are asyn- (e.g., a Transformer block) for constructing a large DNN.
chronous streams [4]. To handle the PCIe interference, VPIPE Inside each subgraph, there are intricate local edges (nested
sets priorities to different asynchronous streams that pass edges) forming multiple execution branches. Partitioning
through PCIe. VPIPE sets a higher priority to network com- such a subgraph in two stages usually incurs huge network
munication for not blocking the pipeline execution. communication costs between two GPUs.
VPIPE’s swap and recompute algorithm (Algorithm 2) There are also sparse nested edges that form branches
works as follows. For each stage, the algorithm takes a set of among blocks. However, network communication costs of
layers, a memory limit M, PCIe bandwidth P , stage rank partitioning these sparse nested edges are often static and
(p-k), tfwd and tbwd of this stage as input. VPIPE first sort all do not vary with the partition plan. For example, in the
layers by the potential memory saving gain of either swap BERT model, each block should take input from the first
or recompute (line 2-9). Until the PCIe is full, VPIPE selects embedding layer, and it is necessary to pass the embedding
tensors according to their memory saving gains to be asyn- output to all stages. Thus, under any partition plan, the net-
chronously swapped (line 13-16). After that, if the memory work communication costs of transferring this input to all
limit is still reached, VPIPE chooses whether to swap or recom- stages are persistent.
pute an activation based on their swap/recompute cost and Second, different from conventional graphs in partitioning
memory saving gain (line 17-19). For the rest of the layers, VPIPE problems [2], [3], [50], in a DNN graph, vertices (i.e., layers)
keeps them by default (line 20-22). Leveraging the first subtle are executed by the training engine in time series. If a nested
point, VPIPE can precisely overlap the async swap cost of these edge connects two vertices that have a gap that is larger than
tensors with normal execution. With the second subtle point, a stage’s execution time in the time axis, the edge has a high
the async swap will not block the network communication of chance to be a sparse nested edge. If a nested edge connected
normal training execution. Consequently, Algorithm 2 reduces two vertices very close to each other in the time axis, the edge
the recompute overhead with async swap in existing pipeline is likely to be part of a subgraph.
parallel systems (e.g., GPipe [19]). VPIPE swaps activation ten- Based on these heuristics, VPIPE’s layer repartition algo-
sors first, as activation takes the most memory consumption; rithm (Algorithm 3) has three steps. First, VPIPE (line 7-21)
496 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 3, MARCH 2022

coarsens the DNN graph. In this step, each edge in a DNN graph (e.g., AmoebaNet graph with 4280 layers/vertices and
graph is classified with OðjNj þ jEjÞ cost to three categories: 5080 edges) into a much smaller graph (e.g., coarsened
critical edges that construct the sequential backbone of the AmoebaNet with 132 vertices and 142 edges). By doing so,
DNN graph, sparse nested edges, and subgraph edges. the time cost of KL algorithm is greatly reduced. On parti-
Then VPIPE merges the subgraph edges to the sequential tioning various DNN model, evaluation (Section 6.4) shows
backbone edges by aggregating their execution time and that VPIPE’s partition algorithm speeds up the KL algorithm
communication. Second, VPIPE partitions this merged graph by 4x-32x and achieves 0.15s-0.46s time cost (less than the
by iteratively applying bipartition with KL algorithm [50] process time 1.21s-6.98s of one training input batch), fast
(line 22-26). Third, VPIPE uncoarsens the merged graph to enough to be deployed online.
the original DNN graph and refines the partition to see if
any potential better partition exists by KL refinement [17]
4.3 Live Layer Migration
(line 27-29).
Existing pipeline parallel systems (e.g., Pipedream and
GPipe) adopt a static layer partition before execution (Sec-
Algorithm 3. layerRepartition()
tion 2). To migrate a layer in these systems, developers need
1: Input: DNN Graph GðN; EÞ, runtime statics of each layer to adopt a non-live approach: stop the runtime, modify the
(layers), e.g., invoke time (T ) of each layer; layer partition configuration, and reboot the whole training
2: sorted ¼ sortByTimeðlayersÞ; process. This process suffers from heavy bootstrap over-
3: Gcoarsened ¼ coarsenðGðN; EÞÞ; head, including runtime initialization, model initialization,
4: bound ¼ partitionðGcoarsened Þ; and data loading (Section 2). Such a heavy overhead might
5: G ¼ uncoarsenðGcoarsened Þ ; dramatically decrease the training efficiency when layer
6: bound ¼ refineðG; boundÞ;
migration is frequently triggered under a dynamic training
7: Function: coarsen(G(V, E))
process (Section 6.4).
8: mean ¼ sumðtÞ=p;
In VPIPE, we aim to design a live layer migration protocol
9: E  ¼ ½;
10: foreach l1 ; l2 in pairwiseðsortedÞ do
for pipeline parallelism with a key technical requirement
11: ##detect critical path edges; that the layer migration should remain transparent to the
12: if eðl1 ; l2 Þ in E then upper systems so that VPIPE will not alter the upper systems’
13: annotate eðl1 ; l2 Þ as critical edge ; parameter staleness.
14: E  :appendðeðl1 ; l2 ÞÞ Existing pipeline parallel systems fall into two categories:
15: foreach e in E  E  do BSP systems (GPipe [19]) and ASP systems (Pipedream [33],
16: ##distingush sparse and subgraph edges; PipeMare [59], and XPipe [14]). BSP systems have no
17: if e:v2 :T  e:v1 :T > mean then parameter staleness (Section 2.2). ASP systems adopt vari-
18: annotate eðl1 ; l2 Þ as sparse edge ous parameter staleness strategies on different design goals.
19: else BSP and ASP systems have their own strengths on particu-
20: annotate eðl1 ; l2 Þ as subgraph edge lar workloads. For instance, in Table 3, GPipe achieved bet-
21: mergeðE; E  Þ; ter accuracy than Pipedream on training Transformer while
22: Function: parition(G(V, E), p) achieved worse accuracy than Pipedream on training BERT.
23: if p==1 then Thus, VPIPE is designed to be transparent to the upper sys-
24: return tems so that VPIPE does not alter their parameter staleness.
25: bound; G1 ; G2 ¼ KLParititionðG; costÞ; VPIPE lets the programmer explicitly annotate the type of
26: return bound; partitionðG1 ; p2Þ; paritionðG2 ; p2Þ; system.
27: unction refine(G(V,E), bound) However, it is challenging to transparently migrate a
28: foreach b in bound do layer without losing liveness for both BSP and ASP systems.
29: KLRefineðGðV; EÞ; bÞ
The reason is that at any time in a pipeline, a layer can
always have multiple unfinished backward executions, and
Analysis. VPIPE’s Algorithm 1 decomposes a master prob- these backward passes will produce updates to the layer
lem into two sub-problems [32], [47]. VPIPE’s Algorithm 2 is parameters. To avoid altering the parameter staleness, dur-
optimal as the sub-problem is a linear optimization with ing the migration of a layer, no updates produced by these
simple constraints (i.e., the memory limit and the PCIe backward passes should be lost.
limit). VPIPE’s Algorithm 3 is a successive algorithm of the Moreover, in the typical scheduling of ASP systems (Sec-
Kernighan Lin (KL) algorithm. KL algorithm is a bipartition tion 2.2), layers on different stages have different pipeline
algorithm that starts from an initial bipartition of a graph execution interleaving. For example, in the last stage, the
and exchanges the vertices of the two partitions to see forward pass of an input batch directly works on the param-
whether a better partition can be found [2], [3], [50]. eter updated by the last input batch, while in the first stage,
The time complexity of the original KL algorithm is the forward pass works on the parameter updated by a
OðrjNj2 log jNjÞ, where r is the repeated cycles, and N is the much earlier input bach. For BSP systems, forward passes
total set of layers. The time cost of running KL algorithm on on all stages work on the same version of parameters until a
complex DNNs (e.g., AmoebaNet) is huge (up to 16s for each parameter synchronization occurs. To avoid altering the
run). With our two heuristics on recent complex DNN parameter staleness, during the migration of a layer, VPIPE
graphs, VPIPE’s partition algorithm uses a coarsen phase of ensures that when a layer is migrated among stages,
complexity OðjNj þ jEjÞ that coarsens a complex DNN the execution interleaving of this layer should change
ZHAO ET AL.: VPIPE: A VIRTUALIZED ACCELERATION SYSTEM FOR ACHIEVING EFFICIENT AND SCALABLE PIPELINE PARALLEL... 497

Fig. 7. Realtime nsys/nvprof GPU profiling of a forward layer migration.


Pink blocks are GPU-to-CPU memory copy; green blocks are CPU-to-
GPU memory copy. After migration, a higher utilization can be visually
observed on the target GPU. We disabled swap to highlight the migration
memory copies.

transfers activation tensors for backward pass of input batch


k þ 1 (denoted as backward k þ 1). (2) After the next backward
pass (i.e., backward k) finishes, stage n transfers the parameter
tensors of layer c (updated by backward k) to stage m. Stage m
will wait for the arrival of the parameter tensors of layer c and
process layer c in its next backward pass (i.e., backward k þ 1
is processed in stage m). (3) The subsequent layer c’s activation
Fig. 6. A forward layer migration triggered after the ending of input batch tensors created by input batch k þ 2, k þ 3, ..., k þ p  n  1
k from stage n to stage m. (i.e., k þ 2, k þ 3 in Fig. 6) are continuously and asynchro-
nously copied. VPIPE ensures that the backward k þ 2, k þ 3, ...,
accordingly. By doing so, VPIPE guarantees that a layer is k þ p  n  1 will not start at stage m until their corresponding
migrated as if repartitioned by a non-live approach. activation tensors arrive. When VPIPE is integrated into an ASP
We formalize the above transparency requirements. system, VPIPE will transfer the activation tensors and the corre-
Given a new input batch k, for q layers fl1 ; l2 ; :::; lq g in stage sponding parameter tensors to migrate a layer.
n of a training pipeline (0  n < p, where p is the number Overall, VPIPE’s live layer migration merely affects the
of stages and the number of simultaneously injected input normal execution as in step (1) and (3), VPIPE asynchro-
batches), each layer must have p  n  1 unfinished back- nously transferred the activation tensors of migrated layers,
ward passes. In ASP systems, in stage n, the forward pass of and we verified this by profiling in Fig. 7. To avoid altering
input batch k should work on the version (Vk ) of layer staleness, VPIPE ensures that the Vfwd k
remains consistent
parameters updated by k  p þ n. In BSP systems, for all when a layer is migrated from stage n to stage m. In VPIPE,
stages, if the parameter synchronization happens every u  layer migrations can be triggered multiple times during a
p input batches, the forward pass of input batch k should triggering of VPIPE’s Algorithm 1 (tens of seconds in Sec-
work on the same parameter version k mod u  p. tion 6.4). In our evaluation, each migration with a non-live
 migration approach stalls the pipeline execution by 1.1-6.8s,
k  p þ n if ASP while VPIPE’s migration protocol remains live.
k
Vfwd ¼ : (4)
k mod u  p if BSP

5 SYSTEM IMPLEMENTATION
When a set of layers fli ; :::; lj g are going to be migrated VPIPE’s design leverages the imperative features from
from stage n to stage m, where m ¼ n  1, for each layer, PyTorch. The current popular deep learning frameworks
VPIPE should migrate p  n  1 copy of activation tensors for are typically based on either imperative or declarative pro-
unfinished backward passes. Meanwhile, for ASP systems, gramming. The imperative programs are similar to Python
the Vk should be changed from k  p þ n to k  p þ m. or C++ programs, which perform computations during the
A strawman stop-and-copy migration approach is to stop execution. PyTorch adopts it as the default and only execu-
the execution, synchronously transfer parameter tensors tion mode. Overall, VPIPE is currently implemented by modi-
and activation tensors, and resume the execution. However, fying 2782 LoC to PyTorch [36]. VPIPE’s design and
on training complex DNNs, the tensors to be migrated can implementation is common for all DNN training engines
be up to several gigabytes, leading to a long stall. that follow an imperative programming style. In this sec-
In VPIPE, we present a live runtime layer migration proto- tion, we present three key points to implement VPIPE in
col. Without losing generality, to ease discussion, Figs. 6 PyTorch: how to support distributed on-demand swap and
and 7 shows an example of a forward layer migration in a recompute; how to migrate layers between stages; how to
four-stage (i.e., p ¼ 4) pipeline, where n ¼ 0 and m ¼ 1. If implement an NAS process [39], [45] in VPIPE, as there is no
Stage n is going to migrate layer c to Stage m after the end- existing literature that describes how to implement an NAS
ing of input batch k, the migration will work as follows. In process in pipeline parallelism.
prepare stage, Stage n sends a prepare message to stage m to For the first point, to capture access patterns of tensors,
inform the migration of layer c. Stage m initializes the layer VPIPE intercepted PyTorch’s activation creation in forward
module of layer c and moves the module to GPU memory. passes and reuse in backward passes. In PyTorch, an activa-
Then, stage m sends a ready to stage n. tion tensor is created and saved to an edge of an automatic
Once stage n receives ready, the migration immediately gradient computation (autograd) graph in a data structure
starts in its next forward pass (i.e., forward pass of input SavedVariable. VPIPE intercepted the member functions of
batch k þ 4 in Fig. 6). (1) Stage n immediately asynchronously SavedVariable and saved the tensor pointers to VPIPE’s VTM
498 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 3, MARCH 2022

module (Section 3). In PyTorch, SavedVariable can refer to TABLE 1


both a parameter tensor and an activation tensor. VPIPE dis- Models and Datasets
tinguished a parameter tensor and an activation tensor by
Task Model Dataset
assigning each a property upon their initialization (parame-
ter tensors are initialized during a model initialization, i.e., Image Classification VGG16 [44] ImageNet [9]
module initialization in PyTorch). To precisely predict when Resnet50 [15] ImageNet [9]
to swap back a tensor, VPIPE’s VTM modules pass the cap- AmoebaNet [39] ImageNet [9]
tured access patterns of tensors to other stages (Section 4.2). Translation GNMT [58] WMT16 EN-DE [42]
To support asynchronous and on-demand swap for acti- Transformer [52] WMT16 EN-DE [42]
vation tensors in PyTorch, VPIPE added a tensor level asyn- Language Modeling BERT [10] WMT16 EN-DE [42]
chronous swap feature to PyTorch. PyTorch 1.5.0 currently
only supports a synchronized swap for tensor implementa-
tion (i.e., the main thread will be blocked during the swap).
and 64 GB RAM. Each GPU had 11 GB physical memory
Moreover, to accelerate the tensor swap from CPU memory
and was connected to the host with PCIe 3.0 X16 that pro-
to GPU memory, in VPIPE, we stored the tensors that are
vided a total data transfer bandwidth of 15760 MB/s. Hosts
swapped to CPU memory in a pinned memory. The technical
are connected with 100 Gbps Ethernet, and the average ping
reason is that in PyTorch, CPU memory to GPU memory
latency is 0.17ms.
copies are much faster when they originate from pinned
Workloads. We evaluated six well-studied DNN models
(i.e., page-locked) memory. VPIPE used the pin memoryðÞ
(Table 1) that are widely used in the deep learning commu-
method for PyTorch’s CPU tensor storage.
nity. BERT [10], Transformer [52], AmoebaNet [39], and
VPIPE’s recompute leverages PyTorch’s checkpoint library,
which is a builtin library for recomputing activations. A GNMT [58] are four large DNNs often trained by pipeline
major implementation obstacle for on-demand recompute is parallelism [19], [33]. Transformer [45] and AmoebaNet [39]
to change the training statement at runtime. In VPIPE, we are two typical workloads that have been applied with Neu-
used python’s builtin feature exec stmt, which takes a piece ral Architecture Search. We used the open-source release of
of statement as input and executes the statement, to to mod- each model.
ify a stage’s execution statement at runtime and on-demand These models cover all prevalent DNNs evaluated in
decide whether to recompute a layer’s activation. existing pipeline parallel systems, including Pipedream [33],
To support layers migration between stages (thus, a stage GPipe [19], XPipe [14], and PipeMare [59]. For other models,
of DNN is dynamic), VPIPE maintains a DNN stage as a including S2VT [53] and AWD LM [31] evaluated in these
structured graph data and has a simple parser that switches systems, they are surpassed by the DNNs we evaluated and
between the graph description of DNNs and the PyTorch no longer prevalent. We evaluated two well-known data-
imperative statement (using exec stmt). Thus, when a layer sets: WMT16 [42] for NLP and ImageNet [9] for vision.
migration happens, on the target stage, VPIPE modifies the Baselines. We integrated VPIPE to two baseline systems: the
graph description, initializes the corresponding layer mod- most notable ASP pipeline parallel system Pipedream [33]
ule in PyTorch, overwrite the layer’s state by the migrated and the most notable BSP pipeline parallel system GPipe [19].
layer’s state, and adds the new layer to the stage’s execution For Pipedream, we used its open-source release [33]; for
statement. On the source stage, VPIPE removes the layer GPipe, we implemented GPipe by applying a strong synchro-
from the stage’s execution statement and delete the layer nization barrier (Section 2) on Pipedream’s codebase because
from the GPU memory. VPIPE both supports both branches GPipe has no official release on PyTorch. Each integration of
in among stages and branches among layers. VPIPE took only several LoC changes. For a baseline system

To support NAS in pipeline parallelism, we imple- (e.g., Pipedream), we used Pipedream-VPIPE to represent Pipe-
mented the NAS process on both Pipedream and GPipe dream integrated with VPIPE. We compared the throughput of
(Section 6.3) based on the official description of the evolved Pipedream-VPIPE with Pipedream alone to indicate VPIPE’s
Transformer [45] and AmoebaNet [39]. Overall, there are improvement on Pipedream. Overall, we evaluated four sys-
two key components for a NAS process: an evolution algo- tems: Pipedream-VPIPE, GPipe-VPIPE, Pipedream, and GPipe.
rithm that iteratively explores new DNN architectures; and There are also successive systems (i.e., XPipe [14] and
a just-in-time runtime that switches the training workload PipeMare [59]) that mitigate Pipedream’s parameter stale-
according to DNN generated by the evolution algorithm. ness. However, all these systems share the same perfor-
In an evolution algorithm, when a DNN switch occurs, our mance model as either Pipedream or GPipe.
NAS implementation deactivates the differed layers in the Batch Size and Training Setup. For all systems, we set the
existing DNN, activates the new layers, and reset parameters training batch sizes of each DNN to the largest batch size
when a DNN switch finishes. The above implementation lev- that can be supported without exceeding all GPU’s physical
erages PyTorch’s imperative feature (i.e., exec stmt) and fast memory limit. As Pipedream directly keeps all activation
switches between two DNNs without extra stop and initiali- tensors in GPU memory, to avoid exceeding GPU memory
zation time. limit on the front stages, the training batch size supported
by Pipedream was 3.2x less than other evaluated systems
(e.g., GPipe). For all systems, without specification, we eval-
6 EVALUATION uated them on 8 GPUs and set their default partition
Testbed. Our evaluation was conducted on a GPU farm with (shown in Table 2) by the static partition profiler provided
8 hosts. Each host had 4 Nvidia 2080TI GPUs, 20 CPU cores, by Pipedream [33], which is the only system that explicitly
ZHAO ET AL.: VPIPE: A VIRTUALIZED ACCELERATION SYSTEM FOR ACHIEVING EFFICIENT AND SCALABLE PIPELINE PARALLEL... 499

TABLE 2 evaluation results in Pipedream [33] and GPipe [19]. When


Default Settings of Baseline Systems training the four large DNNs, including BERT, Transformer,
AmoebaNet, and GNMT, VPIPE improved GPipe and
Model layer # l.r. Default Partition
Pipedream’s throughput by 30.7 and 109.2 percent. To under-
3
BERT 488 5 10 [60, 62, 62, 62, 62, 61, 61, 58] stand VPIPE’s improvement on GPipe and Pipedream, we
Trans. 332 5 104 [41, 41, 42, 43, 43, 42, 42, 38]
looked into the runtime statistics of all GPUs, shown in
Amoe. 2190 5 105 [283, 238, 238, 238, 238, 286, 237, 432]
GNMT 86 6 102 [11, 12, 11, 10, 8, 9, 13, 12] Table 3, and per-GPU memory usage and ALU utilization
VGG16 40 2 102 [22, 18] when training Transformer in Figs. 10 and 11.
ResNet50 175 2 102 [116, 59] VPIPE improved Pipedream most between the two base-
line systems on training complex DNNs. In Pipedream, the
Baseline systems with VPIPE start with the same default partition. front stages easily reached the GPU memory limits, as these
stages needed to keep many more copies of activation ten-
describes a partition scheme. In Section 6.2, when training sors than the rear stages. For example, in Fig. 10, with
with varied GPUs numbers, the default layer partition was Pipedream’s default partition on Transformer, stage 0 con-
also produced by Pipedream’s static partition profiler. We sumed on average 10.3GB GPU memory, as itneeded to
also show the learning rate (l.r.) used by Adam optimizer in hold 8 copies of activation tensors, almost hitting the mem-
Table 2. ory limit (11GB) of GPU 0; and stage 7 consumed only
Metrics. We used the number of epochs processed per 4.8GB GPU memory, less than half of a GPU’s capacity.
hour to measure each system’s throughput. An epoch in When training Transformer with 8 GPUs, Pipedream only
DNN training is a traverse of the whole dataset. In Sec- supported a batch size of 32, and this moderate batch size
tion 6.3, we used the number of data items processed per failed to fully utilize the GPU ALU units, making all GPUs’
hour to measure each system’s throughput because a model ALU utilization only 42.3 percent in Fig. 10.
may be early-stopped before finishing one complete epoch. Compared with Pipedream, on training four complex
We defined the ideal throughput as the training through- DNNs, Pipedream-VPIPE supported 3.75x larger batch size
put supposing the system is running on GPUs with unlim- and incurred 2.09x effective ALU utilization (Table 3). To
ited physical memory (also defined in other systems [18]), accelerate Pipedream, VPIPE alleviated the memory burdens
and the stage partition of the DNN model can seamlessly of the front stages by swap and recompute and rebalanced the
remain balanced. Same as previous work [18], we imple- stages by repartition. In Fig. 10, VPIPE made more swap and
mented the ideal throughput by directly reusing the GPU recompute operations on the front stages to reduce the mem-
memory when out-of-memory exceptions were triggered. ory burden. However, as the front stages incurred more com-
We used ALU utilization to indicate the usage of GPU putation overhead to reduce memory, the front stages took
ALUs. We used GPU memory utilization and GPU PCIe utili- longer execution time, and execution time among stages was
zation to indicate the GPU memory usage and PCIe band- unbalanced.
width usage. Specifically, for GPU ALU utilization, we used In Fig. 10, when VPIPE only enabled the swap and
effective ALU utilization to distinguish the effective ALU uti- recompute optimization on each local stage (i.e., Pipedream-
lization that contributes to the training process and the VPIPE-SR, denoted as P-V-SR), we observed that although
wasted ALU utilization that are used for recompute. stage 0-3 had high total ALU utilization (87.6 percent-95.3
Our evaluation focuses on the following questions: percent), stage 4-7 incurred low ALU utilization of only (61.4-
Section 6.1: How was VPIPE’s efficiency on static DNN train- 81.7 percent). To make the pipeline more balanced, in VPIPE’s
ing, compared with the baseline systems? Algorithm 1, VPIPE iteratively performed stage repartition
Section 6.2: How was VPIPE’s scalability, compared with the that migrated layers from the front stages to the rear stages.
baseline systems? This made the stage 4-7’ ALU utilization high (89.7-95.6 per-
Section 6.3: How was VPIPE’s efficiency on dynamic DNN cent) and further improved the pipeline’s throughput.
training, compared with the baseline systems? VPIPE’s optimization space on GPipe was GPipe’s over-

Section 6.4: How effective were VPIPE’s runtime algorithms head of an extra forward pass; in our study, an extra for-
and protocol in Section 4? ward pass took 23.8-36.5 percent wasted computation on
Section 6.5: What are the limitations of VPIPE? various DNNs [6], [33], [39], [45], [52] and training settings.
When training complex DNNs on a large number of GPUs
( > 8), GPipe achieved better training efficiency than Pipe-
6.1 Static DNN Training (i.e., NAS disabled) dream because as shown in Table 3, although GPipe needed
We first give an overview of how much VPIPE improved to process an extra forward pass, compared with Pipe-
Pipedream and GPipe on training all DNNs. Fig. 8 shows dream, GPipe supported 3.75x training batch size and
the training curve that indicates how each model’s training incurred 1.59x total effective ALU utilization on all GPUs.
score improves as training time increases. Overall, in Fig. 8, Thus, VPIPE had more improvement space on Pipedream.
to finish the same number of training epochs, VPIPE short- Compared with GPipe, GPipe-VPIPE used 73.2 percent
ened the training time of GPipe and Pipedream by 23.5 and less wasted GPU ALU utilization. The reason is that GPipe-
53.4 percent on average. Thus, within the same training VPIPE invoked swap and provided a dynamic and efficient
time, VPIPE allowed both GPipe and Pipedream to achieve strategy to reduce GPipe’s recompute overhead at runtime
better model fitting quality. (Algorithm 2). In exchange, GPipe-VPIPE used 7.9x more
Fig. 9 shows the throughput of each system under the same PCIe resource than GPipe for swapping. The PCIe resource
setting as Fig. 8. These results were comparable to the was usually spare in GPipe’s default setting except when
500 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 3, MARCH 2022

Fig. 8. Model fitting score versus time for training six models using 8 GPUs. For a-f, the models are training with GPipe (G) and GPipe-VPIPE (G+V).
For g-l, the models are training with Pipedream (P) and Pipedream-VPIPE (P+V). For BERT, the score metric is next sentence prediction accuracy [10].
For Transformer and GNMT, the score metric is BLEU [58]. For AmoebaNet, VGG16, and ResNet 50, the score metric is top-5 accuracy [15], [33],
[39], [44].

network communications was invoked, VPIPE tackled the 6.2 Scalability


PCIe interference between swap and network communica- To evaluate whether VPIPE is scalable to large GPU clusters, we
tion in Section 4.2. Moreover, when NAS was enabled, VPIPE ran Pipedream-VPIPE, GPipe-VPIPE, Pipedream, and GPipe on
improved GPipe by up to 291.3 percent, and we discuss it in different numbers (4-16) of GPU. In addition, an alternative
Section 6.3. approach to apply dynamic swap and recompute systems
When training “small” DNNs VGG16 and ResNet50, (i.e., Capuchin [37]) to distributed settings is to integrate
VPIPE improved Pipedream and GPipe by merely 5.2 and 7.3 Capuchin to each worker of data parallelism. We also evalu-
percent on average. The reason is that when we trained the ated Capuchin with data parallelism (parameter server) on a
VGG16 and ResNet50, following the setting of Pipe- different number of GPUs. For pipeline parallelism, the moti-
dream [33], we partitioned both VGG16 and ResNet50 into vation of using larger GPU clusters is often to train larger
two stages: a stage that contained convolution layers and a DNNs [19]. Thus, we made the DNN layer number propor-
stage that contained fully connected layers. We used 7 tional to the number of involved GPUs (e.g., DNNs used for
GPUs to perform data parallelism on the former stage and 16 GPU setting had doubled layers comparing with DNNs
uses 1 GPU to train the latter one. This two-stage setting used for 8 GPU setting). In Fig. 12, we used the total effective
limited the optimization space of VPIPE’s SRP algorithm. utilization of all GPUs to evaluate the scalability.
We also evaluated the ideal throughput of GPipe and Pipe- Pipedream achieved poor scalability. In pipeline parallel-
dream, and both Pipedream-VPIPE and GPipe-VPIPE incurred a ism, the number of simultaneously injected input batches
degradation from the ideal throughput. The reason is that due are proportional to the GPU (Stage) number (Section 2.2); as
to limits of GPU memory capacity and PCIe bandwidth, to Pipedream directly keeps activation tensors in GPU mem-
support sufficient large batch size that made all GPU’s ALU ory, an increasing GPU number makes the number of acti-
units fully utilized, VPIPE incurred inevitable recompute over- vation tensors kept by a single GPU (with a fixed memory)
head on the front stage to avoid exceeding GPU physical also increased. To avoid exceeding GPU memory limit,
memory limit (G1). In total, as shown in the GPU utilization Pipedream needed to proportionally decrease the size of
column of Table 3, VPIPE needed 6.7 percent inevitable wasted each input batch. For example, when training Transformer
ALU utilization on average for recompute. with 8 GPUs, the batch size supported by Pipedream was
Overall, VPIPE accelerated both Pipedream and GPipe on 32; when training Transformer with 16 GPUs, the batch size
various complex DNNs under static training settings. supported by Pipedream dropped to 16.
VPIPE’s improvement stemmed from a higher utilization rate A larger training batch size can lead to higher GPU ALU
of all GPU resources, including the effective ALU utiliza- utilization [60]; however, in the settings of Fig. 12, the batch
tion, memory, and PCIe usage. size supported by Pipedream were often not high enough to
fully utilize a GPU’s ALU units. Therefore, when more
GPUs were involved in Pipedream, the total effective ALU
utilization increased little and even dropped when training
AmoebaNet, as the batch size dropped to a very low num-
ber (e.g., 1 when training with 16 GPUs) and the parallel uti-
lization of ALUs on all GPUs dropped significantly.
Compared with Pipedream, Pipedream-VPIPE, GPipe-VPIPE,
and GPipe did not suffer from batch size degradation when
more GPUs were involved. GPipe used all  recompute strat-
Fig. 9. Throughput of four systems with 8 GPU setting. egy without keeping any activation tensors in GPU memory,
ZHAO ET AL.: VPIPE: A VIRTUALIZED ACCELERATION SYSTEM FOR ACHIEVING EFFICIENT AND SCALABLE PIPELINE PARALLEL... 501

TABLE 3
Resource Consumption, Final Fitting Scores, and Micro Events of Training Four Large DNNs With Four Systems on 8 GPUs

BE. is BERT. TR. is Transformer. AM. is AmoebaNet. GN. is GNMT. Sco. is the final model fitting score when the training finishes, and score metric of each
model is the same as Fig. 8. Bat. is training batch size. GPU is all GPUs’ effective/total ALU utilization. Fwd and bwd mean forward pass time and backward
pass time of each training iteration.

and thus supported a sufficiently large batch size to fully uti- GPU utilization. These results indicate that VPIPE is both effi-
lize a GPU’s ALU units. With VPIPE integrated, Pipedream- cient and scalable. As the emergence of more giant DNNs
VPIPE and GPipe-VPIPE supported the same large batch size as can be foreseen [6], the design of VPIPE is able to remain effi-
GPipe, while VPIPE reduced the recompute overhead in cient when more and more GPUs are involved.
GPipe. Thus, Pipedream-VPIPE and GPipe-VPIPE were as scal-
able as GPipe and achieved better total effective utilization
than GPipe. 6.3 Dynamic DNN Training (i.e., NAS Enabled)
For both vanilla data parallelism (DP) and Capuchin To evaluate VPIPE’s efficiency on dynamic training workload,
with data parallelism (DP-C), the scalability was poor we conducted a case study of how VPIPE performed on neural
because for complex DNNs, the network communication architecture search (NAS), one of the most prevalent dynamic
cost for parameter synchronization was the major bottle- training processes. We selected two models (Transformer [52]
neck (Section 2). However, DP-C still incurred better effec- and AmoebaNet [39]) that have been pervasively used for
tive ALU utilization as Capuchin used swap and recompute neural architecture search. For both Transformer and Amoe-
to enlarge the training batch size supported by each GPU, baNet, we implemented the NAS process according to their
making a high ALU utilization on each GPU worker. published description [45] of an evolution algorithm: it creates
To sum, with VPIPE, both BSP (GPipe-VPIPE) and ASP a set of population DNN models, which have a similar archi-
(Pipedream-VPIPE) systems achieved almost linear scalabil- tecture, and train them on a subset (around 1000 data entries)
ity that is comparable to the scalable pipeline parallelism of their Dataset to fast eliminate those unqualified models.
system GPipe, while VPIPE achieved better total effective This elimination process often took the most time during a
NAS process. To ensure fair evaluation, we made the evolu-
tion algorithm deterministic: i.e., for each NAS process, the
population of models was trained in a determined sequence.
Overall, VPIPE accelerated both GPipe and Pipedream on
these two NAS-enabled DNN training by 245.4-291.3 per-
cent and 421.3-463.4 percent, while VPIPE made no impacts
on the upper evolutionary algorithm and did not down-
grade the quality of NAS.

Fig. 10. Resource usage of each GPU when training (NAS-disabled)


Transformer with Pipedream, Pipedream-VPIPE, Pipedream-VPIPE-SR on
8 GPUs. Unfilled bars are wasted GPU ALU utilization for recompute.

Fig. 11. Resource usage of each GPU when training (NAS-disabled)


Transformer with GPipe, GPipe-VPIPE, GPipe-VPIPE-SR on 8 GPUs. Fig. 12. Scalability. DP means pure data parallelism. DP-C means data
Unfilled bars are wasted GPU ALU utilization for recompute. parallelism + Capuchin [37].
502 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 3, MARCH 2022

Fig. 13. Training profiling under dynamic training processes (Evolved Transformer). V-SR means VPIPE with swap/recompute enabled and repartition
disabled. In all sub-figures of (a) and (b), the 1st is training throughput collected at every input batch finished; the 2nd is real-time layer number of
each stage (red means layer increase; blue means layer decrease); the 3rd and 4th are the resource utilizations of all GPUs at the end of each sub-
figure’s time axis.

We selected a snippet for each NAS-enabled model (Trans- When only VPIPE’s local swap and recompute optimiza-
former and AmoebaNet) training on two baseline systems tion (Algorithm 2) on each stage (i.e., VPIPE-SR) was enabled,
(Pipedream and GPipe), and Fig. 13 shows how VPIPE although VPIPE-SR improved the two baseline systems’
improved both two systems on NAS-enabled model training. throughput by enlarging the supported batch size (for Pipe-
In Figs. 13a and 13b, 8 layers were added twice at 342s and 594s dream) or reducing the recompute overhead (for GPipe),
on the first stage, and 8 layers were deleted twice at 880s and VPIPE-SR was also not able to cope with this training dynam-
1123s on the second stage. In Figs. 14b and 14b, 46 layers were icity. This implies that existing single GPU swap and
deleted twice at 921s and 1157s on the first stage, and 46 layers recompute systems (e.g., Capuchin [37]) are not sufficient to
were added twice at 1265s and 1483s on the second stage. achieved efficient pipeline parallelism in two folds: first,
For vanilla baseline systems without VPIPE (Pipedream these systems do not support distributed memory manage-
and GPipe), the static partition strategy used by both sys- ment (Section 4.2); second, even if a distributed swap and
tems did not cope with this training dynamicity: taken the recompute system (e.g., VPIPE-SR) exists, it still incurs sub-
Transformer in Fig. 13a and 13b as an example, when layers optimal training efficiency.
were added on the first stage, both systems incurred a per- In contrast, when VPIPE with a full implementation of
formance drop as the execution time of stage 0 suddenly Algorithm 1 was integrated into Pipedream and GPipe,
increased, bottlenecking the whole pipeline; when layers under training dynamicity, both systems (Pipedream-VPIPE
were deleted on the second stage, the whole pipeline’s and GPipe-VPIPE) adjusted its layer distribution on all stages
throughput did not increase as the stage 0 was still the to achieve a near-optimal training throughput. In Fig. 13,
throughput bottleneck. In Figs. 13a and 13b, although the the second figure of each sub-figure shows how VPIPE
ALU utilization of stage 0 was high, other stages all adjusted the layer distribution when layer activation/de-
incurred a low ALU utilization as these stages often needed activation was suddenly triggered during a training pro-
to wait for the execution of stage 0. cess. For example, when layers were added on stage 0 at the

Fig. 14. Training profiling under dynamic training processes (AmoebaNet) with the same setting in Fig. 13.
ZHAO ET AL.: VPIPE: A VIRTUALIZED ACCELERATION SYSTEM FOR ACHIEVING EFFICIENT AND SCALABLE PIPELINE PARALLEL... 503

TABLE 4
Performance of VPIPE’s Partition Algorithm versus
Kernighan-Lin Algorithm [50]

O. G means the original graph with N layers and E edges. C. G means coars- Fig. 15. (a) Network usage of Pipedream with and without VPIPE. VPIPE’s
ened graph. Cost means the network communication cost caused by the parti- network usage contains VPIPE’s network overhead (in unfilled red bars)
tion algorithm (1e7 bytes per training sample). Each DNN models used is for including layer migration and control message costs. (b) Real time GPU
16 GPU training, and the algorithms partition each DNN into 16 stages. ALU utilization statistics with VPIPE’s live migration and the non-live
migration approach.

342s in Figs. 13a and 13b, VPIPE’s global planner collected the
runtime statistics of all stages and noticed an imbalance of solves the NP-hard graph partitioning problem (Section 4.1).
execution time among stages. VPIPE then triggered Algo- In Table 4, we compared the runtime cost of VPIPE’s partition
rithm 3 to generate a new balanced partition. VPIPE’s layer algorithm (Algorithm 3) with the original KL-algorithm [50]
manager immediately started to migrate layers from stage 0 on partitioning four complex DNNs. The results show that
VPIPE speeded up the KL algorithm by 4x-32x. The reason is
to the subsequential stages (i.e., stage 3, 5, and 6). Then,
VPIPE’s layer manager locally performed Algorithm 2 to find
that VPIPE’s coarsen step greatly reduced the complexity of
an optimized local memory management plan. After that, the graph used in the partitioning (Section 4.2). On average,
VPIPE reduced the number of graph nodes by 3x-32x and the
as described in Algorithm 1, VPIPE iteratively performed
Algorithm 3 and Algorithm 2 until no better SRP strategy number of graph edges by 3x-35x. This time cost is negligible
was found. comparing with the training time. The final edge cuts (i.e.,
In our evaluation, each iterative process of Algorithm 1 fin- total network communication costs across partitions) pro-
ished within 3-9 iterations (Section 6.1) without performance duced by VPIPE and KL algorithm were equal, as VPIPE used
downgrade thanks to VPIPE’s fast SRP algorithm and live layer KL-refinement to ensure that no better partition on the origi-
migration protocol. We will further discuss this in Section 6.3. nal graph was missed.
We also evaluated the ideal throughput in Fig. 13, and VPIPE In Fig. 15a, we collected the network communication costs
incurred a degradation from the ideal throughput for the of Pipedream-VPIPE and Pipedream using the same setting in
same reason as we discussed in Section 6.1. Fig. 9. Overall, Pipedream-VPIPE achieved comparable network
To sum, with VPIPE, both Pipedream-VPIPE and GPipe- communication costs with Pipedream when training the four
VPIPE transparently changed their layer distribution along
complex DNNs. VPIPE’s layer migration costs and control mes-
with the training dynamicity; and by doing so, both systems sage costs incurred little overhead as the these costs were
kept their training throughput close to the ideal throughput amortized over the long training time (up to hundreds of
during an extremely dynamic training. Both forward and hours). During a layer migration process, VPIPE’s peak data
backward layer migrations were triggered frequently dur- transfer rate was about 432MB/s, far from blocking both the
ing a NAS training process, making both VPIPE’s forward network connection and the PCIe connection across stages.
and backward layer migration designs desirable. To sum, these results indicate that VPIPE’s SRP algorithm
is both fast converging and can achieve a near-optimal plan
that well utilizes all GPU resources to achieve efficient pipe-
6.4 Effectiveness of VPIPE’s Algorithms line parallel training.
Effectiveness of VPIPE ’s SRP Algorithm. VPIPE’s SRP algorithm Effectiveness of Live Layer Migration Protocol. VPIPE’s live
(Algorithm 1) is a decomposition method that iteratively layer migration protocol 4.3 transparently migrates a layer
optimizes two sub-problems: a local search of swap and to realize a new partition without degrading the training
recompute (Algorithm 2); and a global search of stage parti- throughput. This guarantees that VPIPE can iteratively search
tion (Algorithm 3). for a better SRP plan (Section 4.2) with a negligible training
We first summarize how VPIPE’s SRP algorithm improved performance penalty.
the baseline systems. For both static training processes (Sec- To examine the necessity of VPIPE’s live layer migration
tion 6.1) and dynamic training processes (Section 6.3), VPIPE protocol, we compared it with a non-live layer migration
made the training throughput of both Pipedream and GPipe approach (Section 4.2): stop injecting new input batches for
always close to ideal throughput; VPIPE’s throughput degra- the upper system, clean up the pipeline, manually migrate the
dation from the ideal throughput was caused by the inevita- layer to a new stage, and reboot a new pipeline. In Fig. 13, the
ble recompute overhead to make all GPU’s total effective red dashed line is the training throughput using a non-live
ALU utilization high (e.g., Figs. 10 and 11). From Table 3, layer migration. The non-live migration degraded the training
compared with bare-metal baseline systems Pipedream and throughput by up to 60.3 percent because, in each iteration of
GPipe, VPIPE’s SRP algorithm essentially well utilized all VPIPE’s Algorithm 1, a repartition would be triggered, and the
available resources of all GPUs. pipeline would be cleaned up. Fig. 15b shows the real-time
We then examined how fast VPIPE’s SRP algorithm was. ALU utilization comparison between VPIPE’s live migration
Overall, each invoking of SRP algorithm finished within 10 approach and the non-live migration approach, during an
iterations. The major time cost of each iteration is taken by iterative Algorithm 1 that triggers 9 stage repartition. In each
the graph partition sub-algorithm (Algorithm 3), which repartition, the total ALU utilization dropped to zero as the
504 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 3, MARCH 2022

pipeline was clean up. In comparison, VPIPE live-migrated a Pipeline Parallel Systems. Pipeline (model) parallelism is a
layer without notable throughput degradation and GPU stall. special type of model parallel system. Model parallel sys-
tems are designed to train complex DNN models that can-
6.5 Discussions not fit into a single GPU’s memory. Despite Pipedream [33]
VPIPE has two limitations. First, VPIPE assumes that for any
and GPipe [19], there are many successive pipeline parallel
DNN workload trained with VPIPE, a single layer fits within systems that try to address Pipedream’s parameter staleness
the memory limits of a single GPU. This is also assumed by problem. XPipe [14] uses parameter prediction to mitigate
other pipeline parallel systems (e.g., Pipedream and GPipe). the staleness issues incurred by the ASP pipeline parallel
In reality, for all recent complex DNNs evaluated by VPIPE, systems (i.e., Pipedream). XPipe directly keeps the activa-
the layers can all fit in a single GPU. Second, VPIPE’s layer tion memories in GPU and have the same performance
migration protocol (Section 4.3) remains live when the time model as Pipedream. PipeMare [59] adopts the GPipe’s all
cost of transferring a layer’s tensors can overlap with the recompute strategy to ASP systems and has a similar model
computation time of DNN training. There might exist spe- to GPipe’s performance and memory. However, PipeMare
cial DNNs where the execution time of all layers is shares the same limitations as GPipe.
extremely short, while a layer holds a non-negligible Hybrid Parallel Systems. Existing pipeline parallel sys-
amount of data to transfer. In all the models we studied and tems [14], [19], [33], [59] assume that GPU resource consump-
literature, DNNs are both computation intensive and mem- tions of layers are roughly evenly distributed. In most recent
ory intensive [18], [37], making VPIPE’s off-the-critical-path large DNNs like Transformer [52], BERT [10], GPT-3 [6],
data transfer realizable, verified in Section 6.4. AmoebaNet [39], DNN layers are usually homogenous and
In future work, we envision three applications of VPIPE. even in training resource consumption. Nevertheless, in some
First, VPIPE has the unique strength to support more dynamic DNNs like ResNet50 [15] and VGG16 [44], convolution layers
training paradigms (e.g., DyNet [34]) other than NAS, as usually take much more computation time than the fully
DyNet enabled dynamic DNNs (e.g., LSTM [31]) are preva- connected layers. Hybrid parallelism systems, including
lent and powerful in handling input data with varying lengths OWT [26], FlexFlow [30], etc, are designed to improve the
(e.g., sentences). Second, existing NAS algorithms produce training efficiency of such heterogenous DNNs. Specifically,
DNN evolvement with the assumption that GPU memory these systems apply data parallelism to convolution layers
is unlimited. However, when these NAS algorithms are and apply model parallelism to fully connected layers. These
deployed with pipeline parallelism, they may produce DNN systems are orthogonal to VPIPE, and we leave the support of
evolvements that cannot be realized with pipeline parallelism, hybrid parallelism as VPIPE’s future work.
leading to poor search quality. Leveraging VPIPE’s pipeline sta- Training Memory Reduction. DNN training is memory int-
tistics, researchers can let NAS algorithms be aware of the ensive. Training memory reduction has been widely studied
underlying pipeline resources, making NAS both highly accu- by existing work [18], [37]. Existing memory reduction
rate and feasible under limited hardware resources. Third, as approaches mainly fall into two categories: transparent
DNNs today are deployed with various training framework, approaches including swap [18] and recompute [37] that do
in addition to PyTorch, VPIPE can also augment other impera- not affect the training accuracy; and opaque approaches such
tive training engines (e.g., MxNet [8] and Tensorflow [1]). as low precision training [20] and mixed-precision training
that trade-off training accuracy with training memory. VPIPE
aims to act as a transparent layer so that VPIPE’s memory reduc-
7 RELATED WORK tion will not affect the upper systems. Thus, opaque memory
Data Parallel Systems. Data parallelism [28] has been widely reduction approaches are orthogonal to VPIPE. There are many
adopted in DNN training to support large batch size train- transparent memory reduction systems that are designed for
ing. In data parallelism, inputs are partitioned across work- single GPU training. vDNN [40] and SwapAdvisor [18] focus
ers. Each worker maintains a local copy of the model only on swap. SuperNeuron [56] and Capuchin [37] coherently
parameters and trains on its own partition of inputs while combine swap and recompute to dynamically reduce the
periodically synchronizing weights with other workers. memory consumption of DNN training on a single GPU. How-
Typical data parallelism systems assume that a DNN model ever, these single GPU systems are not designed to cope with
can fit into a single GPU. Nevertheless, the size of recent challenges stemming from pipeline parallelism (Section 2). A
DNNs has grown far beyond a single GPU’s capacity, driving recent study [54] partially offloads the recompute overhead to
researchers to conduct studies [19], [21] on model parallelism. the CPU processors. This work is complementary to VPIPE and
To support large DNN training with data parallelism, Deep- can be integrated into VPIPE to further reduce the recompute
Speed [38] partitions a DNN’s status of parameters and opti- overhead.
mizers to each worker, and on-demand transfers the status Nvidia proposes Unified Memory [51], a general unified
during the training. DeepSpeed [38] reported a 1.5x network memory address space accessible from both CPU and GPU,
communication volume compared with a typical data parallel so that a process can allocate a memory space larger than a
system (e.g., Parameter Server). Compared with data parallel- GPU’s physical capacity. Nvidia Zero-Copy [61] allows inte-
ism, pipeline parallelism (e.g., VPIPE) incurs much less net- grated GPU (GPU and CPU physically share memory devices,
work communication volume [19], [33] and better scalability common in mobile devices) to directly access pinned memory
during large DNN training [19] (see Section 6.2). Overall, data on CPU. VPipe focuses on discrete GPUs (GPU has its own
parallelism is complementary to pipeline parallelism systems memory devices) in data centers. If a training process exceeds
and can be integrated to VPIPE as mixed parallelism to support a GPU’s physical capacity, Unified Memory automatically
large batch size training. migrates tensors (e.g., activations) from GPU to CPU. When
ZHAO ET AL.: VPIPE: A VIRTUALIZED ACCELERATION SYSTEM FOR ACHIEVING EFFICIENT AND SCALABLE PIPELINE PARALLEL... 505

these tensors are accessed later by the GPU ALUs, Unified [11] S. Dutt, “New faster Kernighan-Lin-type graph-partitioning algo-
rithms,” in Proc. Int. Conf. Comput. Aided Des., 1993, pp. 370–377.
Memory page fault is triggered, and tensors needed are syn- [12] C. Farhat, E. Wilson, and G. Powell, “Solution of finite element
chronously moved back from CPU to GPU. Such per-host on- systems on concurrent processing computers,” Eng. Comput.,
demand moving back significantly blocks a Deep Learning vol. 2, no. 3, pp. 157–165, 1987.
application’s execution (e.g., Unified Memory can slow down [13] P.-O. Fj€allstr€om, Algorithms for Graph Partitioning: A Survey,
vol. 3. Link€ oping, Sweden: Link€ oping University, Electronic
a DNN’s execution by more than 1x [5]). Compared with Uni- Press, 1998.
fied Memory, VPIPE’s distributed runtime (Section 4.2) enables [14] L. Guan, W. Yin, D. Li, and X. Lu, “XPipe: Efficient pipeline
VPIPE to predict when tensors in CPU will be needed and asyn- model parallelism for multi-GPU DNN training,” 2019,
arXiv:1911.04610.
chronously pre-fetches these tensors back to GPU before they [15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
are accessed, which prevents blocking the normal execution; image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
VPIPE’s async swap has an overall negligible overhead on the nit., 2016, pp. 770–778.
training performance (Section 4.2). Besides swap, VPIPE’s dis- [16] M. T. Heath and P. Raghavan, “A cartesian parallel nested dissection
algorithm,” SIAM J. Matrix Anal. Appl., vol. 16, no. 1, pp. 235–253,
tributed memory management also contains features like 1995.
recompute and migrate. [17] B. Hendrickson and R. Leland, “A multi-level algorithm for parti-
tioning graphs,” in Proc. ACM/IEEE Conf. Supercomputing, 1995,
pp. 28–es.
[18] C.-C. Huang, G. Jin, and J. Li, “SwapAdvisor: Pushing deep learn-
8 CONCLUSION ing beyond the GPU memory limit via smart swapping,” in Proc.
In this paper, we present VPIPE, the first dynamic memory 25th Int. Conf. Architectural Support Program. Lang. Operating Syst.,
2020, pp. 1341–1355.
and layer partition management system for pipelined paral- [19] Y. Huang et al., “GPipe: Efficient training of giant neural networks
lelism, acting as a virtualized layer between a typical pipe- using pipeline parallelism,” in Proc. Adv. Neural Inf. Process. Syst.,
line parallel system and its underlying execution engine. 2019, pp. 103–112.
[20] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
VPIPE can accelerate existing pipeline parallel systems under
“Quantized neural networks: Training neural networks with low
both static and dynamic training of complex DNNs, making precision weights and activations,” J. Mach. Learn. Res., vol. 18, no.
them both efficient and scalable. VPIPE’s source code is 1, pp. 6869–6898, 2017.
released at: github.com/hku-systems/vpipe. [21] Z. Jia, M. Zaharia, and A. Aiken, “Beyond data and model paral-
lelism for deep neural networks,” 2018, arXiv:1807.05358.
[22] J. A. Joao, M. A. Suleman, O. Mutlu, and Y. N. Patt, “Bottleneck
ACKNOWLEDGMENTS identification and scheduling in multithreaded applications,”
ACM SIGARCH Comput. Architect. News, vol. 47, no. 4, pp. 223–234,
The authors would like to thank all reviewers for their valu- 2012.
able comments. This work was funded in part by Huawei [23] G. Karypis and V. Kumar, “Multilevel graph partitioning schemes,”
Innovation Research Program (HIRP) Flagship, under in Proc. Conf. ICPP (3), 1995, pp. 113–122.
[24] G. Karypis and V. Kumar, “Analysis of multilevel graph parti-
Grants HK RGC ECS 27200916, HK RGC GRF 17207117, tioning,” in Proc. ACM/IEEE Conf. Supercomputing, 1995, pp. 29–es.
17202318, and 27208720, in part by Croucher Innovation [25] B. W. Kernighan and S. Lin, “An efficient heuristic procedure for
Award, in part by National NSF China under Grant partitioning graphs,” Bell Syst. Tech. J., vol. 49, no. 2, pp. 291–307,
61802358, and in part by the USTC Research Funds of Double Feb. 1970.
[26] A. Krizhevsky, “One weird trick for parallelizing convolutional
First-Class Initiative, under Grant YD2150002006. Shixiong neural networks,” 2014, arXiv:1404.5997.
Zhao and Fanxin Li contributed equally to this work. [27] C.-H. Lee, M. Kim, and C. I. Park, “An efficient k-way graph parti-
tioning algorithm for task allocation in parallel computing sys-
REFERENCES tems,” in Proc. 1st Int. Conf. Syst. Integration, 1990, pp. 748–751.
[28] M. Li et al., “Scaling distributed machine learning with the param-
[1] M. Abadi et al., “TensorFlow: A system for large-scale machine eter server,” in Proc. 11th USENIX Symp. Operating Syst. Des.
learning,” in Proc. 12th USENIX Symp. Operating Syst. Des. Imple- Implementation, 2014, pp. 583–598.
mentation, 2016, pp. 265–283. [29] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi,“A sur-
[2] S. Areibi, “An integrated genetic algorithm with dynamic hill vey of deep neural network architectures and their applications,”
climbing for VLSI circuit partitioning,” in Proc. Genet. Evol. Com- Neurocomputing, vol. 234, pp. 11–26, 2017.
put. Conf., 2000, pp. 97–102. [30] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “FlexFlow: A flexi-
[3] S. Areibi and A. Vannelli, “Distributed advanced search techni- ble dataflow accelerator architecture for convolutional neural
ques for circuit partitioning,” in Proc. IEEE Can. Conf. Elect. Com- networks,” in Proc. IEEE Int. Symp. High Perform. Comput. Archi-
put. Eng., 1998, pp. 553–556. tect., 2017, pp. 553–564.
[4] PyTorch cuda streams. Accessed: Nov. 12, 2020. [Online]. Available: [31] S. Merity, N. S. Keskar, and R. Socher, “Regularizing and optimiz-
https://pytorch.org/docs/stable/notes/cuda.html#cuda-streams ing LSTM language models,” 2017, arXiv:1708.02182.
[5] Z. Bai, Z. Zhang, Y. Zhu, and X. Jin, “PipeSwitch: Fast pipelined con- [32] J. M. Mulvey and A. Ruszczy nski, “A new scenario decomposition
text switching for deep learning applications,” in Proc. 14th USENIX method for large-scale stochastic optimization,” Operations Res.,
Symp. Operating Syst. Des. Implementation, 2020, pp. 499–514. vol. 43, no. 3, pp. 477–490, 1995.
[6] T. B. Brown et al., “Language models are few-shot learners,” 2020, [33] D. Narayanan et al., “PipeDream: Generalized pipeline parallel-
arXiv:2005.14165. ism for DNN training,” in Proc. 27th ACM Symp. Operating Syst.
[7] T. N. Bui and C. Jones, “A heuristic for reducing fill-in in sparse Princ., 2019, pp. 1–15.
matrix factorization,” Soc. Ind. Appl. Math., Philadelphia, PA, [34] G. Neubig et al., “DyNet: The dynamic neural network toolkit,”
USA, Tech. Rep., 1993. 2017, arXiv:1701.03980.
[8] T. Chen et al., “MXNet: A flexible and efficient machine learn- [35] D. Nicoara, S. Kamali, K. Daudjee, and L. Chen, “Hermes: Dynamic
ing library for heterogeneous distributed systems,” 2015, partitioning for distributed social network graph databases,” in
arXiv:1512.01274. Proc. 18th Int. Conf. Extending Database Technol., 2015, pp. 25–36.
[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, [36] A. Paszke et al., “PyTorch: An imperative style, high-performance
“ImageNet: A large-scale hierarchical image database,” in Proc. deep learning library,” in Proc. Adv. Neural Inf. Process. Syst., 2019,
IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255. pp. 8026–8037.
[10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- [37] X. Peng et al., “Capuchin: Tensor-based GPU memory manage-
training of deep bidirectional transformers for language under- ment for deep learning,” in Proc. 25th Int. Conf. Architectural Sup-
standing,” 2018, arXiv:1810.04805. port Program. Lang. Operating Syst., 2020, pp. 891–905.
506 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 3, MARCH 2022

[38] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “ZeRO: Memory [52] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural
optimizations toward training trillion parameter models,” in Proc. Inf. Process. Syst., 2017, pp. 5998–6008.
Int. Conf. High Perform. Comput., Netw., Storage Anal., 2020, pp. 1–16. [53] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell,
[39] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolu- and K. Saenko, “Sequence to sequence – Video to text,” in Proc.
tion for image classifier architecture search,” in Proc. AAAI Conf. IEEE Int. Conf. Comput. Vis., 2015, pp. 4534–4542.
Artif. Intell., 2019, pp. 4780–4789. [54] M. Wahib et al., “Scaling distributed deep learning workloads beyond
[40] M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler, the memory capacity with KARMA,” 2020, arXiv:2008.11421.
“vDNN: Virtualized deep neural networks for scalable, memory- [55] L. Wang, S. Xie, T. Li, R. Fonseca, and Y. Tian, “Sample-efficient neural
efficient neural network design,” in Proc. 49th Annu. IEEE/ACM architecture search by learning action space,” 2019, arXiv:1906.06832.
Int. Symp. Microarchitect., 2016, pp. 1–13. [56] L. Wang et al., “Superneurons: Dynamic GPU memory manage-
[41] I. Safro, P. Sanders, and C. Schulz, “Advanced coarsening schemes ment for training deep neural networks,” in Proc. 23rd ACM
for graph partitioning,” J. Exp. Algorithmics, vol. 19, pp. 1–24, 2015. SIGPLAN Symp. Princ. Pract. Parallel Program., 2018, pp. 41–53.
[42] R. Sennrich, B. Haddow, and A. Birch, “Edinburgh neural [57] L. Wang, Y. Zhao, Y. Jinnai, Y. Tian, and R. Fonseca, “AlphaX:
machine translation systems for WMT 16,” 2016, arXiv:1606.02891. Exploring neural architectures with deep neural networks and
[43] A. Sergeev and M. Del Balso, “Horovod: Fast and easy distributed Monte Carlo tree search,” 2019, arXiv:1903.11059.
deep learning in tensorflow,” 2018, arXiv:1802.05799. [58] Y. Wu et al., “Google’s neural machine translation system: Bridg-
[44] K. Simonyan and A. Zisserman, “Very deep convolutional net- ing the gap between human and machine translation,” 2016,
works for large-scale image recognition,” 2014, arXiv:1409.1556. arXiv:1609.08144.
[45] D. R. So, C. Liang, and Q. V. Le, “The evolved transformer,” 2019, [59] B. Yang, J. Zhang, J. Li, C. Re, C. R. Aberger, and C. De Sa,
arXiv:1901.11117. “PipeMare: Asynchronous pipeline parallel DNN training,” 2019,
[46] E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy con- arXiv:1910.05124.
siderations for deep learning in NLP,” 2019, arXiv:1906.02243. [60] E. Yang, S.-H. Kim, T.-W. Kim, M. Jeon, S. Park, and C.-H. Youn,
[47] Y. Sun, M. Kirley, and S. K. Halgamuge, “A recursive decomposi- “An adaptive batch-orchestration algorithm for the heterogeneous
tion method for large scale continuous optimization,” IEEE Trans. GPU cluster environment in distributed deep learning system,” in
Evol. Comput., vol. 22, no. 5, pp. 647–661, Oct. 2018. Proc. IEEE Int. Conf. Big Data Smart Comput., 2018, pp. 725–728.
[48] S. H. Teng and P. Spheres, “Unified geometric approach to graph [61] Nvidia CUDA zero-copy. Accessed: Mar. 21, 2021. [Online].
separators,” in Proc. 31st Ann. Symp. Foundations Comput. Sci., Available: https://docs.nvidia.com/cuda/cuda-c-best-practices-
1991, pp. 538–547. guide/index.html#zero-copy
[49] F. Teraoka, Y. Yokore, and M. Tokoro, “A network architecture [62] Y. Zhao, L. Wang, Y. Tian, R. Fonseca, and T. Guo, “Few-shot neu-
providing host migration transparency,” in Proc. Conf. Commun. ral architecture search,” 2020, arXiv:2006.06863.
Architecture Protoc., 1991, pp. 209–220.
[50] J. L. Tr€
aff, “Direct graph k-partitioning with a Kernighan–Lin like " For more information on this or any other computing topic,
heuristic,” Operations Res. Lett., vol. 34, no. 6, pp. 621–629, 2006. please visit our Digital Library at www.computer.org/csdl.
[51] Nvidia unified memory. Accessed: Mar. 21, 2021. [Online]. Avail-
able: https://developer.nvidia.com/blog/unified-memory-cuda-
beginners/

You might also like