Vpipe A Virtualized Acceleration System For Achieving Efficient and Scalable Pipeline Parallel DNN Training
Vpipe A Virtualized Acceleration System For Achieving Efficient and Scalable Pipeline Parallel DNN Training
Abstract—The increasing computational complexity of DNNs achieved unprecedented successes in various areas such as machine
vision and natural language processing (NLP), e.g., the recent advanced Transformer has billions of parameters. However, as large-
scale DNNs significantly exceed GPU’s physical memory limit, they cannot be trained by conventional methods such as data
parallelism. Pipeline parallelism that partitions a large DNN into small subnets and trains them on different GPUs is a plausible solution.
Unfortunately, the layer partitioning and memory management in existing pipeline parallel systems are fixed during training, making
them easily impeded by out-of-memory errors and the GPU under-utilization. These drawbacks amplify when performing neural
architecture search (NAS) such as the evolved Transformer, where different network architectures of Transformer needed to be trained
repeatedly. VPIPE is the first system that transparently provides dynamic layer partitioning and memory management for pipeline
parallelism. VPIPE has two unique contributions, including (1) an online algorithm for searching a near-optimal layer partitioning and
memory management plan, and (2) a live layer migration protocol for re-balancing the layer distribution across a training pipeline. VPIPE
improved the training throughput of two notable baselines (Pipedream and GPipe) by 61.4-463.4 percent and 24.8-291.3 percent on
various large DNNs and training settings.
Index Terms—Machine learning, distributed systems, distributed artificial intelligence, pipeline, parallel systems, memory management
stage (G1) and achieving more balanced partitions with roughly linearly with the GPU numbers. When run-
higher throughput (G2). ning on 16 GPUs, VPIPE improved Pipedream’s and
However, realizing these two goals in VPIPE must tackle GPipe’s throughput by 323.3 and 20.7 percent.
two technical challenges. The first challenge is searching for VPIPE was efficient in NAS workloads. When evalu-
a globally efficient swap, recompute, and repartition (SRP ) ated on Transformer [45] and AmoebaNet [39], the
strategy among all stages. We took the first step in the litera- only two evaluated complex DNNs that support
ture to model this challenge into a combinatorial optimiza- NAS features, VPIPE improved Pipedream’s and
tion problem (Section 4.1). However, the problem is NP- GPipe’s throughput by 421.3-463.4 percent and
hard due to its exponential search space [2], [3], [50]. 245.4-291.3 percent.
To address this challenge, we created a fast-converging, Our main contribution is VPIPE, the first dynamic layer live
near-optimal search algorithm using the powerful decom- partition and memory management system, serving as a trans-
position methodology [32], [47] via two observations. First, parent underlying acceleration layer for typical pipeline paral-
we can iteratively migrate layers from an intense stage to its lel systems (e.g., Pipedream and GPipe). Our major novelty is
adjacent stages, enabling new optimization space for a bet- a fast and near-optimal stage-distributed search algorithm for
ter hybrid plan of swap and recompute on each stage (Sec- finding a globally efficient swap, recompute, and partition
tion 4.2). Second, the architecture (layout) of a typical strategy, greatly improving VPIPE’s efficiency and scalability.
complex DNN [39], [58] is usually constructed as a coars- Our secondary novelty is a transparent live migration protocol
ened graph of repeated subgraphs, which are readily easy without stalling the executions or altering the upper system’s
to be partitioned into an optimal plan [19], [33] that meets parameter staleness. VPIPE’s source code and evaluation frame-
G2; VPIPE fast detects this coarsened graph by precisely dis- work are released at: github.com/hku-systems/vpipe.
tinguishing intra edges inside subgraphs and nested edges In the rest of this paper, Section 2 presents the back-
among subgraphs, leveraging the time series distance ground; Section 3 gives an overview of VPIPE; Section 4
between each edge’s two vertices (layers) collected at run- describes VPIPE’s runtime design; Sections 5 and 6 present
time execution. VPIPE’s implementation and evaluation results; Section 7 dis-
The second challenge is how to live (i.e., no GPU stalls cusses the related work, and Section 8 concludes.
nor pipeline cleaning) migrate a layer while keeping VPIPE
transparent [49] to general upper pipeline parallelism sys- 2 BACKGROUND
tems (i.e., VPIPE does not add nor reduce parameter stale-
ness [14], [33], [59] to the upper system). Existing pipeline 2.1 DNN Training
parallel systems [14], [33], [59] carefully designed various DNN [10], [15], [29], [44], [46] is known to be the fundamental
strategies to orchestrate (add or reduce) the staleness on machine learning paradigm in deep learning. A DNN model
parameter updates for higher training accuracy or through- typically contains hundreds of layers, and the goal of DNN
put on specific DNNs. training is to find an appropriate set of model parameters to
VPIPE guarantees that a layer is migrated as if reparti- fit a training dataset. Each DNN training process typically
tioned by a non-live approach: stop injecting new input consists of millions of iterations, each containing a forward
batches for the upper system, clean up the pipeline, migrate pass, a backward pass, and an optimization step.
the layer, and reboot a new pipeline. To handle the The memory consumption of DNN training contains four
migrated layer’s unfinished backward passes, we present a parts: parameters of each layer; activations, i.e., feature maps
new live migration protocol. Our key observation is that the produced by each layer in the forward pass; gradients, i.e.,
time window between the activation generation (in a for- gradient maps produced by each layer in the backward pass;
ward pass) and its final usage (in the corresponding back- and scratch space for computation. Among these four parts,
ward pass) allows a subtle interleaving for VPIPE to live activations take the most significant portion (up to 73.3 per-
migrate a layer transparently without altering the parame- cent) of the total memory consumption for DNN training.
ter staleness of the upper system. Activations are created in the forward pass and reused in the
We implemented VPIPE in PyTorch [36] by adding 2782 backward pass, so there exists a large time window between
LoC. We evaluated all six prevalent DNN models, including the two memory accesses. Activation memory is the major
four complex DNNs Transformer [52], BERT [10], Amoeba- optimization target in previous work [18], [37].
Net [39], GNMT [58], and two simple DNNs ResNet50 [15],
VGG16 [44], that are evaluated in all relevant systems Pipe- 2.2 Pipeline Parallel DNN Training
dream [33], GPipe [19], XPipe [14], and PipeMare [59]. The With the DNN training getting increasingly computation
evaluation shows that: and memory intensive, distributed training systems across
multiple GPUs become a must. Distributed training systems
VPIPE was efficient in training complex DNNs. VPIPE can be categorized as data parallel or model parallel. Data
improved Pipedream’s and GPipe’s throughput by parallel systems [28] let each GPU maintain a copy of the
109.7 and 30.7 percent on average for four complex complete model. In each iteration, each GPU trains on a
DNNs. VPIPE enlarged Pipedream’s supported batch small batch and synchronizes the parameter updates with
size by 3.75x. Within the same training time, VPIPE other GPUs using all reduce [43] or parameter sever [28].
made Pipedream achieve higher training quality However, data parallelism is not designed to train large
(e.g., BLEU [58]). DNNs that cannot fit into a single GPU’s memory.
VPIPE was scalable. When training the four complex Pipelined model parallelism (i.e., pipeline parallelism)
DNNs on 4-16 GPUs, VPIPE’s throughput increased aims to scale the supported DNNs to the number of GPUs
492 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 3, MARCH 2022
Fig. 3. Logical BSP pipeline (a) that demonstrates the bubble problem Fig. 4. Logical ASP pipeline (a) and a realtime nsys/nvprof GPU profiling
and a realtime nsys/nvprof GPU profiling (b) that verifies the bubble (b) of ASP pipeline with four-stage Pipedream training; red blocks are
problem in BSP pipeline with four-stage GPipe training; red blocks are sync barriers.
sync barriers.
by partitioning a DNN model into multiple stages (a consecu- 2.3 Dynamic DNN Training
tive set of layers) and letting each GPU handle one stage. Pipe- Recently, more and more developers have adopted dynamic
line parallelism is a pipeline version of model parallelism, DNN training where the number of layers varies with the
where vanilla model parallelism leads to severe under-utiliza- training inputs (e.g., DyNet [34]) or the training is exploratory
tion due to the bubble problem caused by the sequential depen- (e.g., neural architecture search [45], [55], [57], [62]). In such a
dency between stages. Pipeline parallelism overlaps the case, a training workload (i.e., the GPU computation and
computation and waiting time of different input batches, fills memory required for training) varies as the training proceeds.
the bubbles, and improves the utilization. Based on how a Since the efficiency of pipeline parallelism highly depends on
pipeline parallel system handles synchronization of DNN the workload partition among stages, this dynamicity exposes
parameters among input batches, the system falls into two cat- special requirements for pipeline parallel systems.
egories: barrier synchronous parallel (BSP) systems and asyn- The variance of training workload usually happens very
chronous parallel (ASP) systems. frequently. For example, a neural architecture search (NAS)
BSP systems (e.g., GPipe [19]) let a set of training input process [39], [45] adopts an evolutionary algorithm that
batches work on the same version of model parameters, trains a set of models, fast eliminates those with low fitting
aggregate gradients computed by these iterations, and enforce scores, and initiates new ones. Thus, “bad” models can be
a barrier that stops the pipeline to apply the gradients to the eliminated within a few minutes [39], [45].
model parameter. BSP systems achieve almost the same statis- Existing pipeline parallel systems profile a static partition
tical performance as vanilla model parallelism [19]. However, before the training starts. This static partition inherently
as shown in Fig. 3a, a BSP pipeline logically still incurs bub- cannot adapt to the dynamicity in the training process. VPIPE
bles during each barrier synchronization, and we verified this copes with this dynamicity by a wait-free live layer migra-
in Fig. 3b by profiling the GPUs during a four-stage BSP pipe- tion protocol (Section 4.2) that transparently re-balances the
line training. training load when changed.
ASP systems (e.g., Pipedream [33] and PipeMare [59])
remove the sync barrier and let each input batch directly
update the model parameters. Although bubbles are elimi- 3 VPIPE’S ARCHITECTURE
nated (as shown in Fig. 4), ASP systems suffer from parameter Fig. 5 shows VPIPE’s architecture, a virtualized layer between
staleness in two aspects. First, the parameter version differs a typical pipeline parallel system and its underlying execu-
between a pipeline’s forward pass and backward pass. Sec- tion engine. On each host, there is a virtualized tensor man-
ond, the parameter version differs among stages within the ager, a training monitor, and a layer manager. On the host
training of an input batch. Pipedream [33], XPipe [14], and of the last stage, there is a global planner.
PipeMare [59] provide various algorithm-level mitigation to Virtualized Tensor Manager (VTM) provides fine-grained
the parameter staleness problem. VPIPE is designed to be a management to each parameter and activation tensor. VTM
transparent layer under either a BSP or an ASP pipeline paral- holds each layer’s tensor (parameter or activation) informa-
lelism algorithm; and VPIPE’s designs (Section 4.3) do alter the tion, including layer ID, stage ID, property (parameter or acti-
weight staleness in the upper systems. vation), training iteration ID, version, management policy
Scheduling. One forward one backward (1F1B) schedul- (vStatus), storage status, and the pointer to the tensor’s real
ing is first introduced by Pipedream [33] and adopted by storage constructs. An activation tensor’s information is initial-
successive systems (e.g., PipeMare [59] and XPipe [14]). In ized in VPIPE’s tensor manager when created and deleted when
1F1B scheduling (e.g., Fig. 1), each stage alternates between released. For parameter tensors, VPIPE creates tensor informa-
performing forward pass for a current input batch and tion as long as the model is initialized. The management policy
backward pass for an earlier input batch. 1F1B is widely of a layer’s tensors is managed by the layer manager.
adopted due to its high computational efficiency [33], [59] Training monitor monitors each stage’s runtime statistics,
and low memory usage. Therefore, in this paper, we including real-time memory usage of each GPU on these hosts,
assume that the upper pipeline parallel systems adopt PCIe bandwidth usage, network usage, execution time, and rec-
1F1B scheduling. ompute time. Along with forward passes of the normal training
ZHAO ET AL.: VPIPE: A VIRTUALIZED ACCELERATION SYSTEM FOR ACHIEVING EFFICIENT AND SCALABLE PIPELINE PARALLEL... 493
Fig. 5. Architecture of VPIPE. VPIPE is a virtualized layer between a typical pipeline parallel system (e.g., Pipedream [33] or GPipe [19]) and its underly-
ing execution engine (e.g., PyTorch [36] or Tensorflow [1]). We use different colors to refer layers set by VPIPE’s operations including default (D),
swap (S), recompute (R), and migrate (M).
iterations, the training monitor passes its own runtime statistics steady-state throughput of the training pipeline can be max-
and the upstream stages’ (if any) to its downstream stages. imized. Since there is no model to quantify the complexity
Global planner collects the runtime statistics of all stages at the of this SRP challenge, we take the first step in the literature
end of every forward pass. It produces new partition strategies to formalize the SRP challenge, transform it into a combina-
(if needed) according to VPIPE’s SRP algorithm (Section 4.2). It torial optimization problem, and solve it by a decomposi-
resides on the last host for two reasons. First, in pipeline paral- tion algorithm (Section 4.2).
lelism, rear stages usually have less computation and commu- A DNN is a graph GðN; EÞ with N layers (e.g., matrix
nication burdens. Second, as the runtime statistics are collected operation) and E edges connecting the layers. In pipeline
every training iteration, VPIPE transfers the runtime statistics parallelism, a DNN model is partitioned to p stages, and
along with the forward pass and distributes the new partition each stage is placed on one GPU (p GPUs in total). To maxi-
(if any) along with the backward pass. By doing so, VPIPE’s mize the pipeline utilization, in a typical pipeline parallel-
global planner does not need extra distributed coordination. ism scheduling (Section 2), at least p input batches are
Layer manager receives a new partition strategy from the simultaneously injected into the same pipeline. For each
global planner, diffs the new partition from its current parti- layer in the model, we denote it with ðfi ; bi ; mi ; ai Þ, includ-
tion to check whether a layer migration should be scheduled. ing a forward pass time fi , a backward pass time bi , a
For example, when a layer needs to be migrated, the migration parameter memory mi , and an activation memory ai .
manager of the source stage will coordinate with the tensor The major constraint for pipeline parallel training is G1: on
manager to asynchronously swap the layer’s parameter ten- each GPU, the training GPU memory usage should not exceed
sors and activation tensors to the CPU memory; and then any GPU’s physical memory limit (M). In pipeline parallelism,
transfer the parameter and activation tensors to the migration the memory consumption of all layers in each stage contains
manager of the target stage. The migration manager of the tar- two parts. The first part is a constant memory consumption
get stage will initialize the layer in the target GPU, receive the (mconstant ) that does not vary with the number of injected input
i
parameter and activation tensors from the source stage, and batches; the second part is the dependent memory consump-
append the new layer to the forward pass and backward pass tion (mdependent ), which depends on the number of injected
i
executions (Section 4.3). Layer manager also produces the input batches and differs among stages: given a stage k, p k
local swap and recompute policies (Section 4.2). copies of mdependent should be kept in memory. In BSP systems,
i
Overall, VPIPE’s design is transparent to the upper pipe- parameters are updated synchronously (Section 2), and all
line parallel systems. We integrated VPIPE into an ASP sys- input batches in a pipeline share the same version of parame-
tem Pipedream [33] and a BSP system GPipe [19]. For ters, thus mdependent is ai and mconstant is mi . In ASP systems,
i i
vanilla Pipedream, we set all layers’ vStatus to default; and each training iteration in a pipeline may have an independent
for vanilla GPipe, we set all layers’ vStatus to recompute.
version of mi , thus mdependent
i contains both ai and mi .
VPIPE can also be integrated into other pipeline parallel sys-
To reduce memory consumption, a pipeline parallel sys-
tems (e.g., PipeMare [59] and XPipe [14]) as long as they
tem can apply swap or recompute strategy to each layer’s
support an imperative programming model.
dependent tensors, which are the main memory burden in
pipeline parallelism. Thus, for each tensor in a layer, we
4 VPIPE’S RUNTIME
denote its memory management policy with ðDi ; Ri ; Si Þ,
4.1 Problem Modeling where Di ; Ri ; Si ¼ 0 or 1; Di þ Ri þ Si ¼ 1. D ¼ 1 means the
A major challenge for VPIPE’s design is to find an optimal tensor by default resides in the GPU memory; S ¼ 1 means
strategy of swap, recompute, and partition (SRP) so that the the tensor will be proactively swapped to CPU memory and
494 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 3, MARCH 2022
swapped back to GPU before usage; and R ¼ 1 means the let the intense stage have more optimization space to search
tensor will be dropped and recomputed by the backward for a better hybrid plan of swap and recompute.
pass. Thus, in pipeline parallelism, the memory constraint We decompose the master problem into two sub-prob-
of each stage can be denoted as: lems. First, we assume that Varp is constant, and each stage
locally finds a swap and recompute plan (Varsr ) depending
on its GPU resource to minimize the objective function (3).
S mconstant
i þ ðp kÞ S Di mdependent
i M: (1) Second, we assume that Varsr is constant, and stages should
lk irk lk irk
be repartitioned (i.e., find an optimal Varp ) to minimize (3).
Algorithm 1 shows our decomposed algorithm by itera-
Nevertheless, the recompute of layers introduces extra tively resolve these two sub-problems.
computation time to the backward pass. Thus, a stage’s
backward time is the sum of the original backward pass
time, the recompute time (i.e., extra forward pass of recom- Algorithm 1. Decomposed SRP Algorithm
puted layers), and the swap time if the swap time cost is 1: Stage 1,..., p:;
larger between the normal execution time (i.e., maxð0; swap 2: Function LayerManagerIterate():
time execution timeÞ): 3: newPlan ¼ receiveBwdPropðÞ ;
4: diff ¼ compareðthis:plan; newPlanÞ;
tbwd ¼ Sðbi þ Ri fi Þ þ maxð0; ð2 SðSi mdi =P Þ ðtfwd þ tbwd ÞÞÞ: 5: if diff! ¼ null then
6: migrating ¼ True;
(2) 7: for l in diff do setðl:vStatus; MigrateÞ;
8: stats ¼ retrieveStatsðÞ;
9: optimizeSRðstatsÞ ##Algorithm 2;
Finally, we formalize the SRP challenge to a combinato-
10: return;
rial optimization problem: given n layers and p GPUs, find
11: Function TrainingMonitorIterate()
a swap or recompute policy for each layer (meet G1), as
12: if ! migrating then
well as a partition (meet G2), such that the pipeline 13: stats ¼ receiveFwdPropðÞ;
throughput can be maximized. The throughput of a pipeline 14: mem ¼ cudaMemStatsðÞ;
is the lowest throughput among all stages [22], [33]. All 15: tfwd ; tbwd ¼ getExecTimeðÞ;
stages in a pipeline have the same request rate. Thus, the 16: stats:appendðthis:meta; mem; tfwd ; tbwd Þ;
pipeline’s throughput bottleneck is the stage that has the 17: fwdPropagateðstatsÞ;
longest execution time (sum of the largest tfwd and largest 18: return;
tbwd ). Therefore, we convert this problem to finding a parti- 19: Global Planner:;
tion and a swap/recompute policy such that the longest 20: Function: GlobalPlannerIterate()
stage execution time can be minimized: 21: stats; migrating ¼ receiveFwdPropðÞ;
22: if migrating then
minimize max ðtfwd
k þ tbwd
k Þ
23: return;
1kp
(3) 24: unbalanced ¼ checkBalancedðstatsÞ;
subject to ð1Þð2Þ: 25: if unbalanced then
26: newPlan ¼ layerRepartitionðÞ ##Algorithm 3;
27: bwdPropagateðnewPlanÞ;
This optimization problem is hard to solve for two rea- 28: return;
sons. First, the feasible set of this combinatorial optimiza-
tion problem spans an extremely large search space Swap and Recompute. For both swap and recompute, the
(Oð3jNj pjNj Þ), as each of layers N can have three memory goal is to reduce the memory footprint with the lowest over-
management policies and fall into p partitions. A graph head. For the swap, our goal is to maximize the overlapping
partition problem itself is well-known to be NP-com- between swap and the normal execution. For the recompute,
plete [50]. Second, constraint (2) indicate that both the our goal is to select the cheapest layer with maximized
memory management policy of all layers (ðDi ; Ri ; Si Þ; memory saving to recompute. It has been well studied in
for 1 i n, denoted as Varsr ) and the stage partition recent work (e.g., Capuchin [37]) that using a hybrid combi-
plan (denoted as Varp ) can affect the optimization objec- nation of swap and recompute of activation tensors can
tive in (3), making this problem a multi-variable combina- effectively reduce training memory on single GPU DNN
torial optimization. training. However, applying swap to a pipeline parallel
system has to address two subtle points.
4.2 Swap, Recompute, and Repartition First, an efficient swap plan should precisely predict when a
We solve this multi-variable and combinatorial optimiza- tensor that has been swapped to CPU RAM will be reused in
tion problem by decomposition [32], [47] methodology. The the backward pass. In single GPU training, an activation ten-
idea of the decomposition methodology is to break a prob- sor is generated by the forward pass of an input batch train-
lem into smaller sub-problems coordinated by the master ing. The backward pass directly follows the forward pass.
problem (i.e., the optimization problem). Inspired by the Thus, existing swap techniques used in single GPU training
conventional decomposition method [32], [47], the key intui- systems (e.g., SwapAdvisor [18], Capuchin [37], vDNN [40],
tion is to iteratively migrate a layer from an intense stage and SuperNeuron [56]) directly make predictions based on a
where the GPU resource is exhausted to a relief stage and DNN’s graph (either profiled or runtime generated).
ZHAO ET AL.: VPIPE: A VIRTUALIZED ACCELERATION SYSTEM FOR ACHIEVING EFFICIENT AND SCALABLE PIPELINE PARALLEL... 495
However, there usually exists a window in pipeline par- VPIPE swap parameter tensors only if activation tensors are all
allelism, filled by other input batches’ executions, between swapped, which rarely happens in our evaluation.
the forward pass and backward pass of each input batch. To Layer Partition. The problem of partitioning a graph GðN;
make a precise prediction, VPIPE oversees the runtime statis- EÞ into p equal partitions with the lowest cross-partition com-
tics of each forward pass and its backward pass across all munication cost is known to be NP-complete [3] and has
stages of a pipeline (line 21-28 in Algorithm 1), and let each extensive applications in many areas, including VLSI
VPIPE’s layer manager precisely predict the arrival time of design [24], matrix factorization [7], and social network clus-
each backward pass execution. tering [35]. Kernighan-Lin (KL) algorithm [25] is known to
produce excellent partitions for a wide class of problems and
Algorithm 2. optimizeSR() is used quite extensively [17], [27]. To achieve a multi-parti-
tion, it recursively produces bi-partition of graph G and itera-
1: Input: layers in a stage, tfwd , tbwd , M, P , rank;
2: Foreach l in layers do
tively improves it by exchanging nodes in both partitions. KL
3: if l:a=P > l:tfwd then algorithm is costly and takes OðrjNj2 log jNjÞ [11] time (e.g.,
4: l:cost ¼ l:tfwd ; up to 16s to partition a complex DNN model into 16 stages),
5: l:op ¼ Recompute; where r is the repeated cycles. There are many approximate
6: else algorithms [11], [12], [16], [48] that tend to be fast (near-linear)
7: l:cost ¼ l:mactivation =P ; but often yield partitions that are worse than those obtained
8: l:op ¼ Swap; by KL algorithm [13], [23], [41] .
9: l:gain ¼ l:mactivation =l:cost; To make KL algorithm efficient, multi-level schemes
10: window ¼ tfwd þ tbwd ; reduce the size of the graph (i.e., coarsen the graph) by col-
11: space ¼ P window; lapsing vertices and edges, partitioning the smaller graph,
12: sorted ¼ sortByGainðlayersÞ; and then uncoarsening it [17], [23]. Multi-level scheme has
13: while space 0 do been used in many areas, including matrix factorization [7]
14: l ¼ sorted:popðÞ; and VLSI design [24]. However, these algorithms assume
15: setðl:vStatus; SÞ; domain-specific requirements for the graph (e.g., a sparse
16: space ¼ space rank mactivation matrix [7] or a planar graph [24]), which are not applicable
17: while memConsumeðlayersÞ > M do to a complex DNN graph (e.g., AmoebaNet [39]). Moreover,
18: l ¼ sorted:popðÞ; existing multi-level schemes all take multiple coarsen steps.
19: setðl:vStatus; l:opÞ; In VPIPE, leveraging the time series implied by the DNN’s
20: foreach l in layers do sequential executions, we identify two domain-specific heu-
21: l ¼ sorted:popðÞ;
ristics to design a fast and online multi-level graph partition
22: setðl:vStatus; DefaultÞ;
algorithm with a one-step coarsen scheme.
First, Deep Learning experts have already constructed
Second, in pipeline parallel systems, swap and network the graphs of complex DNNs (e.g., Transformer, BERT,
communication impose severe burdens on the PCIe lanes, AmoebaNet, and GNMT), prevalently deployed with pipe-
causing severe PCIe interference that is not addressed by line parallelism, as sequentially connected and repeated
single GPU training systems. In VPIPE, both network com- subgraphs of layers. Each subgraph is usually a basic block
munication and swap that pass throughput PCIe are asyn- (e.g., a Transformer block) for constructing a large DNN.
chronous streams [4]. To handle the PCIe interference, VPIPE Inside each subgraph, there are intricate local edges (nested
sets priorities to different asynchronous streams that pass edges) forming multiple execution branches. Partitioning
through PCIe. VPIPE sets a higher priority to network com- such a subgraph in two stages usually incurs huge network
munication for not blocking the pipeline execution. communication costs between two GPUs.
VPIPE’s swap and recompute algorithm (Algorithm 2) There are also sparse nested edges that form branches
works as follows. For each stage, the algorithm takes a set of among blocks. However, network communication costs of
layers, a memory limit M, PCIe bandwidth P , stage rank partitioning these sparse nested edges are often static and
(p-k), tfwd and tbwd of this stage as input. VPIPE first sort all do not vary with the partition plan. For example, in the
layers by the potential memory saving gain of either swap BERT model, each block should take input from the first
or recompute (line 2-9). Until the PCIe is full, VPIPE selects embedding layer, and it is necessary to pass the embedding
tensors according to their memory saving gains to be asyn- output to all stages. Thus, under any partition plan, the net-
chronously swapped (line 13-16). After that, if the memory work communication costs of transferring this input to all
limit is still reached, VPIPE chooses whether to swap or recom- stages are persistent.
pute an activation based on their swap/recompute cost and Second, different from conventional graphs in partitioning
memory saving gain (line 17-19). For the rest of the layers, VPIPE problems [2], [3], [50], in a DNN graph, vertices (i.e., layers)
keeps them by default (line 20-22). Leveraging the first subtle are executed by the training engine in time series. If a nested
point, VPIPE can precisely overlap the async swap cost of these edge connects two vertices that have a gap that is larger than
tensors with normal execution. With the second subtle point, a stage’s execution time in the time axis, the edge has a high
the async swap will not block the network communication of chance to be a sparse nested edge. If a nested edge connected
normal training execution. Consequently, Algorithm 2 reduces two vertices very close to each other in the time axis, the edge
the recompute overhead with async swap in existing pipeline is likely to be part of a subgraph.
parallel systems (e.g., GPipe [19]). VPIPE swaps activation ten- Based on these heuristics, VPIPE’s layer repartition algo-
sors first, as activation takes the most memory consumption; rithm (Algorithm 3) has three steps. First, VPIPE (line 7-21)
496 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 3, MARCH 2022
coarsens the DNN graph. In this step, each edge in a DNN graph (e.g., AmoebaNet graph with 4280 layers/vertices and
graph is classified with OðjNj þ jEjÞ cost to three categories: 5080 edges) into a much smaller graph (e.g., coarsened
critical edges that construct the sequential backbone of the AmoebaNet with 132 vertices and 142 edges). By doing so,
DNN graph, sparse nested edges, and subgraph edges. the time cost of KL algorithm is greatly reduced. On parti-
Then VPIPE merges the subgraph edges to the sequential tioning various DNN model, evaluation (Section 6.4) shows
backbone edges by aggregating their execution time and that VPIPE’s partition algorithm speeds up the KL algorithm
communication. Second, VPIPE partitions this merged graph by 4x-32x and achieves 0.15s-0.46s time cost (less than the
by iteratively applying bipartition with KL algorithm [50] process time 1.21s-6.98s of one training input batch), fast
(line 22-26). Third, VPIPE uncoarsens the merged graph to enough to be deployed online.
the original DNN graph and refines the partition to see if
any potential better partition exists by KL refinement [17]
4.3 Live Layer Migration
(line 27-29).
Existing pipeline parallel systems (e.g., Pipedream and
GPipe) adopt a static layer partition before execution (Sec-
Algorithm 3. layerRepartition()
tion 2). To migrate a layer in these systems, developers need
1: Input: DNN Graph GðN; EÞ, runtime statics of each layer to adopt a non-live approach: stop the runtime, modify the
(layers), e.g., invoke time (T ) of each layer; layer partition configuration, and reboot the whole training
2: sorted ¼ sortByTimeðlayersÞ; process. This process suffers from heavy bootstrap over-
3: Gcoarsened ¼ coarsenðGðN; EÞÞ; head, including runtime initialization, model initialization,
4: bound ¼ partitionðGcoarsened Þ; and data loading (Section 2). Such a heavy overhead might
5: G ¼ uncoarsenðGcoarsened Þ ; dramatically decrease the training efficiency when layer
6: bound ¼ refineðG; boundÞ;
migration is frequently triggered under a dynamic training
7: Function: coarsen(G(V, E))
process (Section 6.4).
8: mean ¼ sumðtÞ=p;
In VPIPE, we aim to design a live layer migration protocol
9: E ¼ ½;
10: foreach l1 ; l2 in pairwiseðsortedÞ do
for pipeline parallelism with a key technical requirement
11: ##detect critical path edges; that the layer migration should remain transparent to the
12: if eðl1 ; l2 Þ in E then upper systems so that VPIPE will not alter the upper systems’
13: annotate eðl1 ; l2 Þ as critical edge ; parameter staleness.
14: E :appendðeðl1 ; l2 ÞÞ Existing pipeline parallel systems fall into two categories:
15: foreach e in E E do BSP systems (GPipe [19]) and ASP systems (Pipedream [33],
16: ##distingush sparse and subgraph edges; PipeMare [59], and XPipe [14]). BSP systems have no
17: if e:v2 :T e:v1 :T > mean then parameter staleness (Section 2.2). ASP systems adopt vari-
18: annotate eðl1 ; l2 Þ as sparse edge ous parameter staleness strategies on different design goals.
19: else BSP and ASP systems have their own strengths on particu-
20: annotate eðl1 ; l2 Þ as subgraph edge lar workloads. For instance, in Table 3, GPipe achieved bet-
21: mergeðE; E Þ; ter accuracy than Pipedream on training Transformer while
22: Function: parition(G(V, E), p) achieved worse accuracy than Pipedream on training BERT.
23: if p==1 then Thus, VPIPE is designed to be transparent to the upper sys-
24: return tems so that VPIPE does not alter their parameter staleness.
25: bound; G1 ; G2 ¼ KLParititionðG; costÞ; VPIPE lets the programmer explicitly annotate the type of
26: return bound; partitionðG1 ; p2Þ; paritionðG2 ; p2Þ; system.
27: unction refine(G(V,E), bound) However, it is challenging to transparently migrate a
28: foreach b in bound do layer without losing liveness for both BSP and ASP systems.
29: KLRefineðGðV; EÞ; bÞ
The reason is that at any time in a pipeline, a layer can
always have multiple unfinished backward executions, and
Analysis. VPIPE’s Algorithm 1 decomposes a master prob- these backward passes will produce updates to the layer
lem into two sub-problems [32], [47]. VPIPE’s Algorithm 2 is parameters. To avoid altering the parameter staleness, dur-
optimal as the sub-problem is a linear optimization with ing the migration of a layer, no updates produced by these
simple constraints (i.e., the memory limit and the PCIe backward passes should be lost.
limit). VPIPE’s Algorithm 3 is a successive algorithm of the Moreover, in the typical scheduling of ASP systems (Sec-
Kernighan Lin (KL) algorithm. KL algorithm is a bipartition tion 2.2), layers on different stages have different pipeline
algorithm that starts from an initial bipartition of a graph execution interleaving. For example, in the last stage, the
and exchanges the vertices of the two partitions to see forward pass of an input batch directly works on the param-
whether a better partition can be found [2], [3], [50]. eter updated by the last input batch, while in the first stage,
The time complexity of the original KL algorithm is the forward pass works on the parameter updated by a
OðrjNj2 log jNjÞ, where r is the repeated cycles, and N is the much earlier input bach. For BSP systems, forward passes
total set of layers. The time cost of running KL algorithm on on all stages work on the same version of parameters until a
complex DNNs (e.g., AmoebaNet) is huge (up to 16s for each parameter synchronization occurs. To avoid altering the
run). With our two heuristics on recent complex DNN parameter staleness, during the migration of a layer, VPIPE
graphs, VPIPE’s partition algorithm uses a coarsen phase of ensures that when a layer is migrated among stages,
complexity OðjNj þ jEjÞ that coarsens a complex DNN the execution interleaving of this layer should change
ZHAO ET AL.: VPIPE: A VIRTUALIZED ACCELERATION SYSTEM FOR ACHIEVING EFFICIENT AND SCALABLE PIPELINE PARALLEL... 497
5 SYSTEM IMPLEMENTATION
When a set of layers fli ; :::; lj g are going to be migrated VPIPE’s design leverages the imperative features from
from stage n to stage m, where m ¼ n 1, for each layer, PyTorch. The current popular deep learning frameworks
VPIPE should migrate p n 1 copy of activation tensors for are typically based on either imperative or declarative pro-
unfinished backward passes. Meanwhile, for ASP systems, gramming. The imperative programs are similar to Python
the Vk should be changed from k p þ n to k p þ m. or C++ programs, which perform computations during the
A strawman stop-and-copy migration approach is to stop execution. PyTorch adopts it as the default and only execu-
the execution, synchronously transfer parameter tensors tion mode. Overall, VPIPE is currently implemented by modi-
and activation tensors, and resume the execution. However, fying 2782 LoC to PyTorch [36]. VPIPE’s design and
on training complex DNNs, the tensors to be migrated can implementation is common for all DNN training engines
be up to several gigabytes, leading to a long stall. that follow an imperative programming style. In this sec-
In VPIPE, we present a live runtime layer migration proto- tion, we present three key points to implement VPIPE in
col. Without losing generality, to ease discussion, Figs. 6 PyTorch: how to support distributed on-demand swap and
and 7 shows an example of a forward layer migration in a recompute; how to migrate layers between stages; how to
four-stage (i.e., p ¼ 4) pipeline, where n ¼ 0 and m ¼ 1. If implement an NAS process [39], [45] in VPIPE, as there is no
Stage n is going to migrate layer c to Stage m after the end- existing literature that describes how to implement an NAS
ing of input batch k, the migration will work as follows. In process in pipeline parallelism.
prepare stage, Stage n sends a prepare message to stage m to For the first point, to capture access patterns of tensors,
inform the migration of layer c. Stage m initializes the layer VPIPE intercepted PyTorch’s activation creation in forward
module of layer c and moves the module to GPU memory. passes and reuse in backward passes. In PyTorch, an activa-
Then, stage m sends a ready to stage n. tion tensor is created and saved to an edge of an automatic
Once stage n receives ready, the migration immediately gradient computation (autograd) graph in a data structure
starts in its next forward pass (i.e., forward pass of input SavedVariable. VPIPE intercepted the member functions of
batch k þ 4 in Fig. 6). (1) Stage n immediately asynchronously SavedVariable and saved the tensor pointers to VPIPE’s VTM
498 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 3, MARCH 2022
To support NAS in pipeline parallelism, we imple- (e.g., Pipedream), we used Pipedream-VPIPE to represent Pipe-
mented the NAS process on both Pipedream and GPipe dream integrated with VPIPE. We compared the throughput of
(Section 6.3) based on the official description of the evolved Pipedream-VPIPE with Pipedream alone to indicate VPIPE’s
Transformer [45] and AmoebaNet [39]. Overall, there are improvement on Pipedream. Overall, we evaluated four sys-
two key components for a NAS process: an evolution algo- tems: Pipedream-VPIPE, GPipe-VPIPE, Pipedream, and GPipe.
rithm that iteratively explores new DNN architectures; and There are also successive systems (i.e., XPipe [14] and
a just-in-time runtime that switches the training workload PipeMare [59]) that mitigate Pipedream’s parameter stale-
according to DNN generated by the evolution algorithm. ness. However, all these systems share the same perfor-
In an evolution algorithm, when a DNN switch occurs, our mance model as either Pipedream or GPipe.
NAS implementation deactivates the differed layers in the Batch Size and Training Setup. For all systems, we set the
existing DNN, activates the new layers, and reset parameters training batch sizes of each DNN to the largest batch size
when a DNN switch finishes. The above implementation lev- that can be supported without exceeding all GPU’s physical
erages PyTorch’s imperative feature (i.e., exec stmt) and fast memory limit. As Pipedream directly keeps all activation
switches between two DNNs without extra stop and initiali- tensors in GPU memory, to avoid exceeding GPU memory
zation time. limit on the front stages, the training batch size supported
by Pipedream was 3.2x less than other evaluated systems
(e.g., GPipe). For all systems, without specification, we eval-
6 EVALUATION uated them on 8 GPUs and set their default partition
Testbed. Our evaluation was conducted on a GPU farm with (shown in Table 2) by the static partition profiler provided
8 hosts. Each host had 4 Nvidia 2080TI GPUs, 20 CPU cores, by Pipedream [33], which is the only system that explicitly
ZHAO ET AL.: VPIPE: A VIRTUALIZED ACCELERATION SYSTEM FOR ACHIEVING EFFICIENT AND SCALABLE PIPELINE PARALLEL... 499
Section 6.4: How effective were VPIPE’s runtime algorithms head of an extra forward pass; in our study, an extra for-
and protocol in Section 4? ward pass took 23.8-36.5 percent wasted computation on
Section 6.5: What are the limitations of VPIPE? various DNNs [6], [33], [39], [45], [52] and training settings.
When training complex DNNs on a large number of GPUs
( > 8), GPipe achieved better training efficiency than Pipe-
6.1 Static DNN Training (i.e., NAS disabled) dream because as shown in Table 3, although GPipe needed
We first give an overview of how much VPIPE improved to process an extra forward pass, compared with Pipe-
Pipedream and GPipe on training all DNNs. Fig. 8 shows dream, GPipe supported 3.75x training batch size and
the training curve that indicates how each model’s training incurred 1.59x total effective ALU utilization on all GPUs.
score improves as training time increases. Overall, in Fig. 8, Thus, VPIPE had more improvement space on Pipedream.
to finish the same number of training epochs, VPIPE short- Compared with GPipe, GPipe-VPIPE used 73.2 percent
ened the training time of GPipe and Pipedream by 23.5 and less wasted GPU ALU utilization. The reason is that GPipe-
53.4 percent on average. Thus, within the same training VPIPE invoked swap and provided a dynamic and efficient
time, VPIPE allowed both GPipe and Pipedream to achieve strategy to reduce GPipe’s recompute overhead at runtime
better model fitting quality. (Algorithm 2). In exchange, GPipe-VPIPE used 7.9x more
Fig. 9 shows the throughput of each system under the same PCIe resource than GPipe for swapping. The PCIe resource
setting as Fig. 8. These results were comparable to the was usually spare in GPipe’s default setting except when
500 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 3, MARCH 2022
Fig. 8. Model fitting score versus time for training six models using 8 GPUs. For a-f, the models are training with GPipe (G) and GPipe-VPIPE (G+V).
For g-l, the models are training with Pipedream (P) and Pipedream-VPIPE (P+V). For BERT, the score metric is next sentence prediction accuracy [10].
For Transformer and GNMT, the score metric is BLEU [58]. For AmoebaNet, VGG16, and ResNet 50, the score metric is top-5 accuracy [15], [33],
[39], [44].
TABLE 3
Resource Consumption, Final Fitting Scores, and Micro Events of Training Four Large DNNs With Four Systems on 8 GPUs
BE. is BERT. TR. is Transformer. AM. is AmoebaNet. GN. is GNMT. Sco. is the final model fitting score when the training finishes, and score metric of each
model is the same as Fig. 8. Bat. is training batch size. GPU is all GPUs’ effective/total ALU utilization. Fwd and bwd mean forward pass time and backward
pass time of each training iteration.
and thus supported a sufficiently large batch size to fully uti- GPU utilization. These results indicate that VPIPE is both effi-
lize a GPU’s ALU units. With VPIPE integrated, Pipedream- cient and scalable. As the emergence of more giant DNNs
VPIPE and GPipe-VPIPE supported the same large batch size as can be foreseen [6], the design of VPIPE is able to remain effi-
GPipe, while VPIPE reduced the recompute overhead in cient when more and more GPUs are involved.
GPipe. Thus, Pipedream-VPIPE and GPipe-VPIPE were as scal-
able as GPipe and achieved better total effective utilization
than GPipe. 6.3 Dynamic DNN Training (i.e., NAS Enabled)
For both vanilla data parallelism (DP) and Capuchin To evaluate VPIPE’s efficiency on dynamic training workload,
with data parallelism (DP-C), the scalability was poor we conducted a case study of how VPIPE performed on neural
because for complex DNNs, the network communication architecture search (NAS), one of the most prevalent dynamic
cost for parameter synchronization was the major bottle- training processes. We selected two models (Transformer [52]
neck (Section 2). However, DP-C still incurred better effec- and AmoebaNet [39]) that have been pervasively used for
tive ALU utilization as Capuchin used swap and recompute neural architecture search. For both Transformer and Amoe-
to enlarge the training batch size supported by each GPU, baNet, we implemented the NAS process according to their
making a high ALU utilization on each GPU worker. published description [45] of an evolution algorithm: it creates
To sum, with VPIPE, both BSP (GPipe-VPIPE) and ASP a set of population DNN models, which have a similar archi-
(Pipedream-VPIPE) systems achieved almost linear scalabil- tecture, and train them on a subset (around 1000 data entries)
ity that is comparable to the scalable pipeline parallelism of their Dataset to fast eliminate those unqualified models.
system GPipe, while VPIPE achieved better total effective This elimination process often took the most time during a
NAS process. To ensure fair evaluation, we made the evolu-
tion algorithm deterministic: i.e., for each NAS process, the
population of models was trained in a determined sequence.
Overall, VPIPE accelerated both GPipe and Pipedream on
these two NAS-enabled DNN training by 245.4-291.3 per-
cent and 421.3-463.4 percent, while VPIPE made no impacts
on the upper evolutionary algorithm and did not down-
grade the quality of NAS.
Fig. 13. Training profiling under dynamic training processes (Evolved Transformer). V-SR means VPIPE with swap/recompute enabled and repartition
disabled. In all sub-figures of (a) and (b), the 1st is training throughput collected at every input batch finished; the 2nd is real-time layer number of
each stage (red means layer increase; blue means layer decrease); the 3rd and 4th are the resource utilizations of all GPUs at the end of each sub-
figure’s time axis.
We selected a snippet for each NAS-enabled model (Trans- When only VPIPE’s local swap and recompute optimiza-
former and AmoebaNet) training on two baseline systems tion (Algorithm 2) on each stage (i.e., VPIPE-SR) was enabled,
(Pipedream and GPipe), and Fig. 13 shows how VPIPE although VPIPE-SR improved the two baseline systems’
improved both two systems on NAS-enabled model training. throughput by enlarging the supported batch size (for Pipe-
In Figs. 13a and 13b, 8 layers were added twice at 342s and 594s dream) or reducing the recompute overhead (for GPipe),
on the first stage, and 8 layers were deleted twice at 880s and VPIPE-SR was also not able to cope with this training dynam-
1123s on the second stage. In Figs. 14b and 14b, 46 layers were icity. This implies that existing single GPU swap and
deleted twice at 921s and 1157s on the first stage, and 46 layers recompute systems (e.g., Capuchin [37]) are not sufficient to
were added twice at 1265s and 1483s on the second stage. achieved efficient pipeline parallelism in two folds: first,
For vanilla baseline systems without VPIPE (Pipedream these systems do not support distributed memory manage-
and GPipe), the static partition strategy used by both sys- ment (Section 4.2); second, even if a distributed swap and
tems did not cope with this training dynamicity: taken the recompute system (e.g., VPIPE-SR) exists, it still incurs sub-
Transformer in Fig. 13a and 13b as an example, when layers optimal training efficiency.
were added on the first stage, both systems incurred a per- In contrast, when VPIPE with a full implementation of
formance drop as the execution time of stage 0 suddenly Algorithm 1 was integrated into Pipedream and GPipe,
increased, bottlenecking the whole pipeline; when layers under training dynamicity, both systems (Pipedream-VPIPE
were deleted on the second stage, the whole pipeline’s and GPipe-VPIPE) adjusted its layer distribution on all stages
throughput did not increase as the stage 0 was still the to achieve a near-optimal training throughput. In Fig. 13,
throughput bottleneck. In Figs. 13a and 13b, although the the second figure of each sub-figure shows how VPIPE
ALU utilization of stage 0 was high, other stages all adjusted the layer distribution when layer activation/de-
incurred a low ALU utilization as these stages often needed activation was suddenly triggered during a training pro-
to wait for the execution of stage 0. cess. For example, when layers were added on stage 0 at the
Fig. 14. Training profiling under dynamic training processes (AmoebaNet) with the same setting in Fig. 13.
ZHAO ET AL.: VPIPE: A VIRTUALIZED ACCELERATION SYSTEM FOR ACHIEVING EFFICIENT AND SCALABLE PIPELINE PARALLEL... 503
TABLE 4
Performance of VPIPE’s Partition Algorithm versus
Kernighan-Lin Algorithm [50]
O. G means the original graph with N layers and E edges. C. G means coars- Fig. 15. (a) Network usage of Pipedream with and without VPIPE. VPIPE’s
ened graph. Cost means the network communication cost caused by the parti- network usage contains VPIPE’s network overhead (in unfilled red bars)
tion algorithm (1e7 bytes per training sample). Each DNN models used is for including layer migration and control message costs. (b) Real time GPU
16 GPU training, and the algorithms partition each DNN into 16 stages. ALU utilization statistics with VPIPE’s live migration and the non-live
migration approach.
342s in Figs. 13a and 13b, VPIPE’s global planner collected the
runtime statistics of all stages and noticed an imbalance of solves the NP-hard graph partitioning problem (Section 4.1).
execution time among stages. VPIPE then triggered Algo- In Table 4, we compared the runtime cost of VPIPE’s partition
rithm 3 to generate a new balanced partition. VPIPE’s layer algorithm (Algorithm 3) with the original KL-algorithm [50]
manager immediately started to migrate layers from stage 0 on partitioning four complex DNNs. The results show that
VPIPE speeded up the KL algorithm by 4x-32x. The reason is
to the subsequential stages (i.e., stage 3, 5, and 6). Then,
VPIPE’s layer manager locally performed Algorithm 2 to find
that VPIPE’s coarsen step greatly reduced the complexity of
an optimized local memory management plan. After that, the graph used in the partitioning (Section 4.2). On average,
VPIPE reduced the number of graph nodes by 3x-32x and the
as described in Algorithm 1, VPIPE iteratively performed
Algorithm 3 and Algorithm 2 until no better SRP strategy number of graph edges by 3x-35x. This time cost is negligible
was found. comparing with the training time. The final edge cuts (i.e.,
In our evaluation, each iterative process of Algorithm 1 fin- total network communication costs across partitions) pro-
ished within 3-9 iterations (Section 6.1) without performance duced by VPIPE and KL algorithm were equal, as VPIPE used
downgrade thanks to VPIPE’s fast SRP algorithm and live layer KL-refinement to ensure that no better partition on the origi-
migration protocol. We will further discuss this in Section 6.3. nal graph was missed.
We also evaluated the ideal throughput in Fig. 13, and VPIPE In Fig. 15a, we collected the network communication costs
incurred a degradation from the ideal throughput for the of Pipedream-VPIPE and Pipedream using the same setting in
same reason as we discussed in Section 6.1. Fig. 9. Overall, Pipedream-VPIPE achieved comparable network
To sum, with VPIPE, both Pipedream-VPIPE and GPipe- communication costs with Pipedream when training the four
VPIPE transparently changed their layer distribution along
complex DNNs. VPIPE’s layer migration costs and control mes-
with the training dynamicity; and by doing so, both systems sage costs incurred little overhead as the these costs were
kept their training throughput close to the ideal throughput amortized over the long training time (up to hundreds of
during an extremely dynamic training. Both forward and hours). During a layer migration process, VPIPE’s peak data
backward layer migrations were triggered frequently dur- transfer rate was about 432MB/s, far from blocking both the
ing a NAS training process, making both VPIPE’s forward network connection and the PCIe connection across stages.
and backward layer migration designs desirable. To sum, these results indicate that VPIPE’s SRP algorithm
is both fast converging and can achieve a near-optimal plan
that well utilizes all GPU resources to achieve efficient pipe-
6.4 Effectiveness of VPIPE’s Algorithms line parallel training.
Effectiveness of VPIPE ’s SRP Algorithm. VPIPE’s SRP algorithm Effectiveness of Live Layer Migration Protocol. VPIPE’s live
(Algorithm 1) is a decomposition method that iteratively layer migration protocol 4.3 transparently migrates a layer
optimizes two sub-problems: a local search of swap and to realize a new partition without degrading the training
recompute (Algorithm 2); and a global search of stage parti- throughput. This guarantees that VPIPE can iteratively search
tion (Algorithm 3). for a better SRP plan (Section 4.2) with a negligible training
We first summarize how VPIPE’s SRP algorithm improved performance penalty.
the baseline systems. For both static training processes (Sec- To examine the necessity of VPIPE’s live layer migration
tion 6.1) and dynamic training processes (Section 6.3), VPIPE protocol, we compared it with a non-live layer migration
made the training throughput of both Pipedream and GPipe approach (Section 4.2): stop injecting new input batches for
always close to ideal throughput; VPIPE’s throughput degra- the upper system, clean up the pipeline, manually migrate the
dation from the ideal throughput was caused by the inevita- layer to a new stage, and reboot a new pipeline. In Fig. 13, the
ble recompute overhead to make all GPU’s total effective red dashed line is the training throughput using a non-live
ALU utilization high (e.g., Figs. 10 and 11). From Table 3, layer migration. The non-live migration degraded the training
compared with bare-metal baseline systems Pipedream and throughput by up to 60.3 percent because, in each iteration of
GPipe, VPIPE’s SRP algorithm essentially well utilized all VPIPE’s Algorithm 1, a repartition would be triggered, and the
available resources of all GPUs. pipeline would be cleaned up. Fig. 15b shows the real-time
We then examined how fast VPIPE’s SRP algorithm was. ALU utilization comparison between VPIPE’s live migration
Overall, each invoking of SRP algorithm finished within 10 approach and the non-live migration approach, during an
iterations. The major time cost of each iteration is taken by iterative Algorithm 1 that triggers 9 stage repartition. In each
the graph partition sub-algorithm (Algorithm 3), which repartition, the total ALU utilization dropped to zero as the
504 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 3, MARCH 2022
pipeline was clean up. In comparison, VPIPE live-migrated a Pipeline Parallel Systems. Pipeline (model) parallelism is a
layer without notable throughput degradation and GPU stall. special type of model parallel system. Model parallel sys-
tems are designed to train complex DNN models that can-
6.5 Discussions not fit into a single GPU’s memory. Despite Pipedream [33]
VPIPE has two limitations. First, VPIPE assumes that for any
and GPipe [19], there are many successive pipeline parallel
DNN workload trained with VPIPE, a single layer fits within systems that try to address Pipedream’s parameter staleness
the memory limits of a single GPU. This is also assumed by problem. XPipe [14] uses parameter prediction to mitigate
other pipeline parallel systems (e.g., Pipedream and GPipe). the staleness issues incurred by the ASP pipeline parallel
In reality, for all recent complex DNNs evaluated by VPIPE, systems (i.e., Pipedream). XPipe directly keeps the activa-
the layers can all fit in a single GPU. Second, VPIPE’s layer tion memories in GPU and have the same performance
migration protocol (Section 4.3) remains live when the time model as Pipedream. PipeMare [59] adopts the GPipe’s all
cost of transferring a layer’s tensors can overlap with the recompute strategy to ASP systems and has a similar model
computation time of DNN training. There might exist spe- to GPipe’s performance and memory. However, PipeMare
cial DNNs where the execution time of all layers is shares the same limitations as GPipe.
extremely short, while a layer holds a non-negligible Hybrid Parallel Systems. Existing pipeline parallel sys-
amount of data to transfer. In all the models we studied and tems [14], [19], [33], [59] assume that GPU resource consump-
literature, DNNs are both computation intensive and mem- tions of layers are roughly evenly distributed. In most recent
ory intensive [18], [37], making VPIPE’s off-the-critical-path large DNNs like Transformer [52], BERT [10], GPT-3 [6],
data transfer realizable, verified in Section 6.4. AmoebaNet [39], DNN layers are usually homogenous and
In future work, we envision three applications of VPIPE. even in training resource consumption. Nevertheless, in some
First, VPIPE has the unique strength to support more dynamic DNNs like ResNet50 [15] and VGG16 [44], convolution layers
training paradigms (e.g., DyNet [34]) other than NAS, as usually take much more computation time than the fully
DyNet enabled dynamic DNNs (e.g., LSTM [31]) are preva- connected layers. Hybrid parallelism systems, including
lent and powerful in handling input data with varying lengths OWT [26], FlexFlow [30], etc, are designed to improve the
(e.g., sentences). Second, existing NAS algorithms produce training efficiency of such heterogenous DNNs. Specifically,
DNN evolvement with the assumption that GPU memory these systems apply data parallelism to convolution layers
is unlimited. However, when these NAS algorithms are and apply model parallelism to fully connected layers. These
deployed with pipeline parallelism, they may produce DNN systems are orthogonal to VPIPE, and we leave the support of
evolvements that cannot be realized with pipeline parallelism, hybrid parallelism as VPIPE’s future work.
leading to poor search quality. Leveraging VPIPE’s pipeline sta- Training Memory Reduction. DNN training is memory int-
tistics, researchers can let NAS algorithms be aware of the ensive. Training memory reduction has been widely studied
underlying pipeline resources, making NAS both highly accu- by existing work [18], [37]. Existing memory reduction
rate and feasible under limited hardware resources. Third, as approaches mainly fall into two categories: transparent
DNNs today are deployed with various training framework, approaches including swap [18] and recompute [37] that do
in addition to PyTorch, VPIPE can also augment other impera- not affect the training accuracy; and opaque approaches such
tive training engines (e.g., MxNet [8] and Tensorflow [1]). as low precision training [20] and mixed-precision training
that trade-off training accuracy with training memory. VPIPE
aims to act as a transparent layer so that VPIPE’s memory reduc-
7 RELATED WORK tion will not affect the upper systems. Thus, opaque memory
Data Parallel Systems. Data parallelism [28] has been widely reduction approaches are orthogonal to VPIPE. There are many
adopted in DNN training to support large batch size train- transparent memory reduction systems that are designed for
ing. In data parallelism, inputs are partitioned across work- single GPU training. vDNN [40] and SwapAdvisor [18] focus
ers. Each worker maintains a local copy of the model only on swap. SuperNeuron [56] and Capuchin [37] coherently
parameters and trains on its own partition of inputs while combine swap and recompute to dynamically reduce the
periodically synchronizing weights with other workers. memory consumption of DNN training on a single GPU. How-
Typical data parallelism systems assume that a DNN model ever, these single GPU systems are not designed to cope with
can fit into a single GPU. Nevertheless, the size of recent challenges stemming from pipeline parallelism (Section 2). A
DNNs has grown far beyond a single GPU’s capacity, driving recent study [54] partially offloads the recompute overhead to
researchers to conduct studies [19], [21] on model parallelism. the CPU processors. This work is complementary to VPIPE and
To support large DNN training with data parallelism, Deep- can be integrated into VPIPE to further reduce the recompute
Speed [38] partitions a DNN’s status of parameters and opti- overhead.
mizers to each worker, and on-demand transfers the status Nvidia proposes Unified Memory [51], a general unified
during the training. DeepSpeed [38] reported a 1.5x network memory address space accessible from both CPU and GPU,
communication volume compared with a typical data parallel so that a process can allocate a memory space larger than a
system (e.g., Parameter Server). Compared with data parallel- GPU’s physical capacity. Nvidia Zero-Copy [61] allows inte-
ism, pipeline parallelism (e.g., VPIPE) incurs much less net- grated GPU (GPU and CPU physically share memory devices,
work communication volume [19], [33] and better scalability common in mobile devices) to directly access pinned memory
during large DNN training [19] (see Section 6.2). Overall, data on CPU. VPipe focuses on discrete GPUs (GPU has its own
parallelism is complementary to pipeline parallelism systems memory devices) in data centers. If a training process exceeds
and can be integrated to VPIPE as mixed parallelism to support a GPU’s physical capacity, Unified Memory automatically
large batch size training. migrates tensors (e.g., activations) from GPU to CPU. When
ZHAO ET AL.: VPIPE: A VIRTUALIZED ACCELERATION SYSTEM FOR ACHIEVING EFFICIENT AND SCALABLE PIPELINE PARALLEL... 505
these tensors are accessed later by the GPU ALUs, Unified [11] S. Dutt, “New faster Kernighan-Lin-type graph-partitioning algo-
rithms,” in Proc. Int. Conf. Comput. Aided Des., 1993, pp. 370–377.
Memory page fault is triggered, and tensors needed are syn- [12] C. Farhat, E. Wilson, and G. Powell, “Solution of finite element
chronously moved back from CPU to GPU. Such per-host on- systems on concurrent processing computers,” Eng. Comput.,
demand moving back significantly blocks a Deep Learning vol. 2, no. 3, pp. 157–165, 1987.
application’s execution (e.g., Unified Memory can slow down [13] P.-O. Fj€allstr€om, Algorithms for Graph Partitioning: A Survey,
vol. 3. Link€ oping, Sweden: Link€ oping University, Electronic
a DNN’s execution by more than 1x [5]). Compared with Uni- Press, 1998.
fied Memory, VPIPE’s distributed runtime (Section 4.2) enables [14] L. Guan, W. Yin, D. Li, and X. Lu, “XPipe: Efficient pipeline
VPIPE to predict when tensors in CPU will be needed and asyn- model parallelism for multi-GPU DNN training,” 2019,
arXiv:1911.04610.
chronously pre-fetches these tensors back to GPU before they [15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
are accessed, which prevents blocking the normal execution; image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
VPIPE’s async swap has an overall negligible overhead on the nit., 2016, pp. 770–778.
training performance (Section 4.2). Besides swap, VPIPE’s dis- [16] M. T. Heath and P. Raghavan, “A cartesian parallel nested dissection
algorithm,” SIAM J. Matrix Anal. Appl., vol. 16, no. 1, pp. 235–253,
tributed memory management also contains features like 1995.
recompute and migrate. [17] B. Hendrickson and R. Leland, “A multi-level algorithm for parti-
tioning graphs,” in Proc. ACM/IEEE Conf. Supercomputing, 1995,
pp. 28–es.
[18] C.-C. Huang, G. Jin, and J. Li, “SwapAdvisor: Pushing deep learn-
8 CONCLUSION ing beyond the GPU memory limit via smart swapping,” in Proc.
In this paper, we present VPIPE, the first dynamic memory 25th Int. Conf. Architectural Support Program. Lang. Operating Syst.,
2020, pp. 1341–1355.
and layer partition management system for pipelined paral- [19] Y. Huang et al., “GPipe: Efficient training of giant neural networks
lelism, acting as a virtualized layer between a typical pipe- using pipeline parallelism,” in Proc. Adv. Neural Inf. Process. Syst.,
line parallel system and its underlying execution engine. 2019, pp. 103–112.
[20] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
VPIPE can accelerate existing pipeline parallel systems under
“Quantized neural networks: Training neural networks with low
both static and dynamic training of complex DNNs, making precision weights and activations,” J. Mach. Learn. Res., vol. 18, no.
them both efficient and scalable. VPIPE’s source code is 1, pp. 6869–6898, 2017.
released at: github.com/hku-systems/vpipe. [21] Z. Jia, M. Zaharia, and A. Aiken, “Beyond data and model paral-
lelism for deep neural networks,” 2018, arXiv:1807.05358.
[22] J. A. Joao, M. A. Suleman, O. Mutlu, and Y. N. Patt, “Bottleneck
ACKNOWLEDGMENTS identification and scheduling in multithreaded applications,”
ACM SIGARCH Comput. Architect. News, vol. 47, no. 4, pp. 223–234,
The authors would like to thank all reviewers for their valu- 2012.
able comments. This work was funded in part by Huawei [23] G. Karypis and V. Kumar, “Multilevel graph partitioning schemes,”
Innovation Research Program (HIRP) Flagship, under in Proc. Conf. ICPP (3), 1995, pp. 113–122.
[24] G. Karypis and V. Kumar, “Analysis of multilevel graph parti-
Grants HK RGC ECS 27200916, HK RGC GRF 17207117, tioning,” in Proc. ACM/IEEE Conf. Supercomputing, 1995, pp. 29–es.
17202318, and 27208720, in part by Croucher Innovation [25] B. W. Kernighan and S. Lin, “An efficient heuristic procedure for
Award, in part by National NSF China under Grant partitioning graphs,” Bell Syst. Tech. J., vol. 49, no. 2, pp. 291–307,
61802358, and in part by the USTC Research Funds of Double Feb. 1970.
[26] A. Krizhevsky, “One weird trick for parallelizing convolutional
First-Class Initiative, under Grant YD2150002006. Shixiong neural networks,” 2014, arXiv:1404.5997.
Zhao and Fanxin Li contributed equally to this work. [27] C.-H. Lee, M. Kim, and C. I. Park, “An efficient k-way graph parti-
tioning algorithm for task allocation in parallel computing sys-
REFERENCES tems,” in Proc. 1st Int. Conf. Syst. Integration, 1990, pp. 748–751.
[28] M. Li et al., “Scaling distributed machine learning with the param-
[1] M. Abadi et al., “TensorFlow: A system for large-scale machine eter server,” in Proc. 11th USENIX Symp. Operating Syst. Des.
learning,” in Proc. 12th USENIX Symp. Operating Syst. Des. Imple- Implementation, 2014, pp. 583–598.
mentation, 2016, pp. 265–283. [29] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi,“A sur-
[2] S. Areibi, “An integrated genetic algorithm with dynamic hill vey of deep neural network architectures and their applications,”
climbing for VLSI circuit partitioning,” in Proc. Genet. Evol. Com- Neurocomputing, vol. 234, pp. 11–26, 2017.
put. Conf., 2000, pp. 97–102. [30] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “FlexFlow: A flexi-
[3] S. Areibi and A. Vannelli, “Distributed advanced search techni- ble dataflow accelerator architecture for convolutional neural
ques for circuit partitioning,” in Proc. IEEE Can. Conf. Elect. Com- networks,” in Proc. IEEE Int. Symp. High Perform. Comput. Archi-
put. Eng., 1998, pp. 553–556. tect., 2017, pp. 553–564.
[4] PyTorch cuda streams. Accessed: Nov. 12, 2020. [Online]. Available: [31] S. Merity, N. S. Keskar, and R. Socher, “Regularizing and optimiz-
https://pytorch.org/docs/stable/notes/cuda.html#cuda-streams ing LSTM language models,” 2017, arXiv:1708.02182.
[5] Z. Bai, Z. Zhang, Y. Zhu, and X. Jin, “PipeSwitch: Fast pipelined con- [32] J. M. Mulvey and A. Ruszczy nski, “A new scenario decomposition
text switching for deep learning applications,” in Proc. 14th USENIX method for large-scale stochastic optimization,” Operations Res.,
Symp. Operating Syst. Des. Implementation, 2020, pp. 499–514. vol. 43, no. 3, pp. 477–490, 1995.
[6] T. B. Brown et al., “Language models are few-shot learners,” 2020, [33] D. Narayanan et al., “PipeDream: Generalized pipeline parallel-
arXiv:2005.14165. ism for DNN training,” in Proc. 27th ACM Symp. Operating Syst.
[7] T. N. Bui and C. Jones, “A heuristic for reducing fill-in in sparse Princ., 2019, pp. 1–15.
matrix factorization,” Soc. Ind. Appl. Math., Philadelphia, PA, [34] G. Neubig et al., “DyNet: The dynamic neural network toolkit,”
USA, Tech. Rep., 1993. 2017, arXiv:1701.03980.
[8] T. Chen et al., “MXNet: A flexible and efficient machine learn- [35] D. Nicoara, S. Kamali, K. Daudjee, and L. Chen, “Hermes: Dynamic
ing library for heterogeneous distributed systems,” 2015, partitioning for distributed social network graph databases,” in
arXiv:1512.01274. Proc. 18th Int. Conf. Extending Database Technol., 2015, pp. 25–36.
[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, [36] A. Paszke et al., “PyTorch: An imperative style, high-performance
“ImageNet: A large-scale hierarchical image database,” in Proc. deep learning library,” in Proc. Adv. Neural Inf. Process. Syst., 2019,
IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255. pp. 8026–8037.
[10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- [37] X. Peng et al., “Capuchin: Tensor-based GPU memory manage-
training of deep bidirectional transformers for language under- ment for deep learning,” in Proc. 25th Int. Conf. Architectural Sup-
standing,” 2018, arXiv:1810.04805. port Program. Lang. Operating Syst., 2020, pp. 891–905.
506 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 33, NO. 3, MARCH 2022
[38] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “ZeRO: Memory [52] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural
optimizations toward training trillion parameter models,” in Proc. Inf. Process. Syst., 2017, pp. 5998–6008.
Int. Conf. High Perform. Comput., Netw., Storage Anal., 2020, pp. 1–16. [53] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell,
[39] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolu- and K. Saenko, “Sequence to sequence – Video to text,” in Proc.
tion for image classifier architecture search,” in Proc. AAAI Conf. IEEE Int. Conf. Comput. Vis., 2015, pp. 4534–4542.
Artif. Intell., 2019, pp. 4780–4789. [54] M. Wahib et al., “Scaling distributed deep learning workloads beyond
[40] M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler, the memory capacity with KARMA,” 2020, arXiv:2008.11421.
“vDNN: Virtualized deep neural networks for scalable, memory- [55] L. Wang, S. Xie, T. Li, R. Fonseca, and Y. Tian, “Sample-efficient neural
efficient neural network design,” in Proc. 49th Annu. IEEE/ACM architecture search by learning action space,” 2019, arXiv:1906.06832.
Int. Symp. Microarchitect., 2016, pp. 1–13. [56] L. Wang et al., “Superneurons: Dynamic GPU memory manage-
[41] I. Safro, P. Sanders, and C. Schulz, “Advanced coarsening schemes ment for training deep neural networks,” in Proc. 23rd ACM
for graph partitioning,” J. Exp. Algorithmics, vol. 19, pp. 1–24, 2015. SIGPLAN Symp. Princ. Pract. Parallel Program., 2018, pp. 41–53.
[42] R. Sennrich, B. Haddow, and A. Birch, “Edinburgh neural [57] L. Wang, Y. Zhao, Y. Jinnai, Y. Tian, and R. Fonseca, “AlphaX:
machine translation systems for WMT 16,” 2016, arXiv:1606.02891. Exploring neural architectures with deep neural networks and
[43] A. Sergeev and M. Del Balso, “Horovod: Fast and easy distributed Monte Carlo tree search,” 2019, arXiv:1903.11059.
deep learning in tensorflow,” 2018, arXiv:1802.05799. [58] Y. Wu et al., “Google’s neural machine translation system: Bridg-
[44] K. Simonyan and A. Zisserman, “Very deep convolutional net- ing the gap between human and machine translation,” 2016,
works for large-scale image recognition,” 2014, arXiv:1409.1556. arXiv:1609.08144.
[45] D. R. So, C. Liang, and Q. V. Le, “The evolved transformer,” 2019, [59] B. Yang, J. Zhang, J. Li, C. Re, C. R. Aberger, and C. De Sa,
arXiv:1901.11117. “PipeMare: Asynchronous pipeline parallel DNN training,” 2019,
[46] E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy con- arXiv:1910.05124.
siderations for deep learning in NLP,” 2019, arXiv:1906.02243. [60] E. Yang, S.-H. Kim, T.-W. Kim, M. Jeon, S. Park, and C.-H. Youn,
[47] Y. Sun, M. Kirley, and S. K. Halgamuge, “A recursive decomposi- “An adaptive batch-orchestration algorithm for the heterogeneous
tion method for large scale continuous optimization,” IEEE Trans. GPU cluster environment in distributed deep learning system,” in
Evol. Comput., vol. 22, no. 5, pp. 647–661, Oct. 2018. Proc. IEEE Int. Conf. Big Data Smart Comput., 2018, pp. 725–728.
[48] S. H. Teng and P. Spheres, “Unified geometric approach to graph [61] Nvidia CUDA zero-copy. Accessed: Mar. 21, 2021. [Online].
separators,” in Proc. 31st Ann. Symp. Foundations Comput. Sci., Available: https://docs.nvidia.com/cuda/cuda-c-best-practices-
1991, pp. 538–547. guide/index.html#zero-copy
[49] F. Teraoka, Y. Yokore, and M. Tokoro, “A network architecture [62] Y. Zhao, L. Wang, Y. Tian, R. Fonseca, and T. Guo, “Few-shot neu-
providing host migration transparency,” in Proc. Conf. Commun. ral architecture search,” 2020, arXiv:2006.06863.
Architecture Protoc., 1991, pp. 209–220.
[50] J. L. Tr€
aff, “Direct graph k-partitioning with a Kernighan–Lin like " For more information on this or any other computing topic,
heuristic,” Operations Res. Lett., vol. 34, no. 6, pp. 621–629, 2006. please visit our Digital Library at www.computer.org/csdl.
[51] Nvidia unified memory. Accessed: Mar. 21, 2021. [Online]. Avail-
able: https://developer.nvidia.com/blog/unified-memory-cuda-
beginners/