0% found this document useful (0 votes)
33 views14 pages

loadbalancing-1-2411.10003v2

The document presents Pro-Prophet, a systematic load balancing method designed to enhance the efficiency of parallel training for large-scale Mixture of Expert (MoE) models. Pro-Prophet includes a planner and scheduler that address dynamic load imbalances and reduce communication overhead, achieving significant speedups and load-balancing improvements compared to existing frameworks. Extensive experiments demonstrate that Pro-Prophet can achieve up to 2.66x speedup and 11.01x load-balancing enhancement in training throughput.

Uploaded by

abhi.sk1004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views14 pages

loadbalancing-1-2411.10003v2

The document presents Pro-Prophet, a systematic load balancing method designed to enhance the efficiency of parallel training for large-scale Mixture of Expert (MoE) models. Pro-Prophet includes a planner and scheduler that address dynamic load imbalances and reduce communication overhead, achieving significant speedups and load-balancing improvements compared to existing frameworks. Extensive experiments demonstrate that Pro-Prophet can achieve up to 2.66x speedup and 11.01x load-balancing enhancement in training throughput.

Uploaded by

abhi.sk1004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1

Pro-Prophet: A Systematic Load Balancing Method


for Efficient Parallel Training of Large-scale MoE
Models
Wei Wang, Zhiquan Lai, Shengwei Li, Weijie Liu, Keshi Ge, Ao Shen, Huayou Su, Dongsheng Li

Abstract—The size of deep learning models has been increasing the model scaling. However, the substantial computational
arXiv:2411.10003v2 [cs.DC] 21 Nov 2024

to enhance model quality. The linear increase in training compu- demand of extra-large models makes the training process
tation budget with model size means that training an extremely excessively time-consuming. As one of the most promising
large-scale model is exceedingly time-consuming. Recently, the
Mixture of Expert (MoE) has drawn significant attention as it solutions, Mixture of Expert (MoE) enables a nearly constant
can scale models to extra-large sizes with a stable computation computational budget as model scaling. Generally, we replace
budget. However, inefficient distributed training of large-scale some layers of a foundation model with MoE ones to generate
MoE models hinders their broader application. Specifically, a a MoE model. Each MoE layer contains a gate network and
considerable dynamic load imbalance occurs among devices a range of sub-modules named experts. The gate network can
during training, significantly reducing throughput. Several load-
balancing works have been proposed to address the challenge. route each input to top-k experts that excel in processing
System-level solutions draw more attention for their hardware the input. As the k is a super-parameter, the MoE model
affinity and non-disruption of model convergence compared to can be scaled with consistent computational requirements by
algorithm-level ones. However, they are troubled by high commu- increasing the number of experts.
nication costs and poor communication-computation overlapping. As the model further scales, the effective collaboration of
To address these challenges, we propose a systematic load-
balancing method, Pro-Prophet, which consists of a planner and devices is necessary for extra-large MoE model training. Un-
a scheduler for efficient parallel training of large-scale MoE fortunately, it is inefficient to train the model with traditional
models. To adapt to the dynamic load imbalance, we profile parallelism such as Data Parallelism (DP), Model Parallelism
training statistics and use them to design Pro-Prophet. For (MP), and Pipeline Parallelism (PP). To overcome the trouble,
lower communication volume, Pro-Prophet planner determines Gshard [1] introduced a specific parallel strategy named Expert
a series of lightweight load-balancing strategies and efficiently
searches for a communication-efficient one for training based Parallelism (EP). Nowadays, extra-large MoE models trained
on the statistics. For sufficient overlapping of communication with EP have demonstrated the highest accuracy in multiple
and computation, Pro-Prophet scheduler schedules the data- tasks [2], [3].
dependent operations based on the statistics and operation However, training MoE models using EP presents a dy-
features, further improving the training throughput. We conduct namic load imbalance among devices. For each MoE layer,
extensive experiments in four clusters and five MoE models.
The results indicate that Pro-Prophet achieves up to 2.66x EP equally divides experts into devices before training and
speedup compared to two popular MoE frameworks including dynamically arranges its inputs according to the gate network
Deepspeed-MoE and FasterMoE. Furthermore, Pro-Prophet has during training. Most inputs are transferred to and processed
demonstrated a load-balancing improvement of up to 11.01x by a few devices, resulting in prolonged communication and
compared to a representative load-balancing work, FasterMoE. computation of inputs. Furthermore, the imbalance varies
Index Terms—Deep learning, mixture of experts, distributed throughout the training process, making it difficult to resolve.
training Numerous attempts in load-balancing have been proposed to
improve training throughput. Algorithmic works often restrict
I. I NTRODUCTION the upper bound of each expert’s load [4], [5] or add auxiliary
losses to the loss function [6], [7] for a more balanced

R ECENT years, large-scale deep neural networks have


achieved superior performance in various domains(e.g.,
NLP, CV). Previous works have shown that the model capacity
load. However, they impact the model convergence and even
deteriorate the model quality. Considering the drawback above,
systematic solutions of the MoE system draw more attention.
is improved with the increased model size, further promoting Popular systematic works [8], [9] dynamically readjust the
expert placement according to the load, achieving a balanced
The first two authors contributed equally to this work.
Wei Wang, Zhiquan Lai, Shengwei Li, Weijie Liu, Keshi Ge, Ao Shen, load without harming the model quality.
Huayou Su and Dongsheng Li are with the National Key Laboratory of However, these systematic solutions struggle to enhance
Parallel and Distributed Computing, College of Computer, National University training efficiency effectively due to two drawbacks. 1) Heavy
of Defense Technology in Changsha, Hunan, China. (Corresponding author:
Dongsheng Li) communications of model states (i.e., parameters, gradients
E-mail: {wwking, zqlai, swli, liuweijie, gekeshi, shenao, shyou, and optimizer states [10]) are introduced. The previous expert
dsli}@nudt.edu.cn placements introduce a global transfer of parameters and
This work is supported by the National Key R&D Program of China (No.
2022YFB4501400) and the National Natural Science Foundation of China gradients or a whole model states communication. These
under Grant No. 62025208 and 62421002. transferring strategies involve unnecessary communications
2

across devices, hindering the improvement of training effi- • We profile input distributions among adjacent iterations
ciency. 2) Devices experience significant communication and and identify a locality that guides the design of Pro-
computation idle during training. Due to data dependencies Prophet.
among operators, the solutions have to perform some commu- • We design a Pro-Prophet planner that identifies several
nications and computations sequentially. For example, only lightweight expert placements, abstracts a performance
the experts have been selected and their model states have model and designs a locality-based greedy algorithm to
been transmitted, their computations of inputs can be launched. reduce the heavy communication of model states.
And then, the aggregation of gradients occurs only after • We propose a Pro-Prophet scheduler, which generates a
the computation of gradients is finished. They neglect the scheduling space and establishes a block-wise schedul-
potential of communication and computation overlapping, thus ing strategy based on the locality and the feature of op-
significantly influencing device utilization. erations for comprehensive overlapping of computations
In this paper, we propose a systematic load-balancing so- and communications.
lution, Pro-Prophet, which overcomes two drawbacks by a • We conduct comprehensive experiments for Pro-Prophet
planner and scheduler respectively. on different clusters and models. The results demon-
To adapt dynamic features presented in the training of a strate that Pro-Prophet achieved up to 1.50x end-to-end
MoE model, we profile the input distribution (i.e., the number speedup and 11.01x load-balancing enhancements with
of inputs processed by each expert) for each MoE layer. We the representative load-balancing method.
observe that distributions of a MoE layer between adjacent
iterations present high similarity. This locality is the key to II. BACKGROUND AND M OTIVATION
effective load balancing.
To reduce the communication volume, Pro-Prophet planner A. Background
introduces a series of lightweight expert placements. In a Recent works in DNN model training have shown that the
lightweight expert placement, each expert is independently model capacity can be improved with increasing training data,
allocated to a subset of devices. Communication of parameters model scale, and computational budget [11]. Extraordinary
and gradients for the expert occurs in these specific devices. performance has been achieved in several deep learning do-
Serve to evaluate expert placements, the planner proposes a mains including natural language processing (NLP), computer
performance model that estimates the execution time of a vision (CV), and so on.
MoE layer employing a lightweight expert placement. How- However, significant training overhead comes along with the
ever, it is non-trivial to find the optimal one due to the superior model capacity. The extra large-scale model [12]–
combinatorial explosion of the number of expert placements. [17] training often takes months on thousands of dedicated
To tackle this, the planner designs a locality-based greedy accelerators (e.g., MT-NLG [18] spends three months to train
algorithm. The algorithm employs a greedy strategy to search on over two thousand A100 GPUs), which influences the
for a communication-efficient expert placement. Besides, its development of deep learning.
launching frequency is reduced based on the locality, further In recent years, dynamic sparse-activated architectures have
improving the training throughput. been proposed to solve the trouble. One of the popular
To exploit the potential of communication-computation approaches is the Mixture of Experts (MoE), which can
overlapping, Pro-Prophet scheduler comprehensively sched- significantly improve the model capacity while maintaining
ules operations based on the locality and the feature of a consistent computational budget. Nowadays, MoE has been
operations. The locality means that we can estimate the successfully applied to large language models [19]–[25]. Ex-
input distribution of the upcoming iteration according to the cellent MoE models that appeared in industry and academia
current one. Once the upcoming distribution is obtained, greatly draw researchers’ attention. For example, Google has
we can promptly determine a communication-efficient expert trained a series of MoE models called Glam [2]. The largest
placement for the upcoming iteration and can transmit the Glam model is seven times larger than GPT-3, but the training
parameters of experts in advance, which provides the oppor- cost is less than 1/3 of it. Experiments show that these models
tunity to overlap communications and computations within achieve higher accuracy than GPT-3 in 29 zero, single, and
adjacent iterations. Besides, the gradient aggregation can be small sample learning tasks, representing the superiority of
scheduled backward for better overlapping. Based on these, MoE models. The other example is GPT-4 [26]. The tech-
the scheduler identifies a scheduling space and designs a nical report of OpenAI indicates that the GPT-4 is a MoE
block-wise scheduling strategy to comprehensively overlap model, which achieved the highest performance in various
communications and computations. downstream tasks. Besides, the ChatGPT based on the GPT-4
We implement Pro-Prophet on top of PyTorch and conduct has caused a tremendous sensation.
extensive experiments on four different clusters of up to 32 Fig. 1 illustrates the architecture of a MoE model and a
devices with five variant models. The results demonstrate that MoE layer. The MoE model comprises a stack of non-MoE
Pro-Prophet achieves speedups of up to 2.66x compared to and MoE layers. A MoE layer consists of two components: 1)
two popular MoE frameworks. Additionally, Pro-Prophet has a series of experts (3 experts in the figure), where each excels
demonstrated load-balancing enhancements of up to 11.01x in a specific domain. 2) a gate network, which routes each
compared to a representative load-balancing work, FasterMoE. input to a few experts that are skilled in dealing with this input,
Our main contributions are summarized as follows: rather than all experts. In a MoE layer, for each input, the gate
3

MoE model Imbalanced load of experts 0.5


Layer0
… Non-MoE Non-MoE …
layer
MoE layer
layer
MoE layer Layer1
Layer2 0.4
Layer3
Inputs Outputs
Layer4 0.3
Expert0
� � �� Layer5
� Gate Network Expert1 �� Layer6

Layer7 0.2
� ��
� Expert2 Layer8
Layer9 0.1
Layer10
Fig. 1. The structure of a MoE model and MoE layer. The MoE model
consists of both MoE and non-MoE layers stacked on top of each other. The Layer11
MoE layer consists of a series of experts and a gate network for routing

E0
E1
E2
E3
E4
E5
E6
E7
E8
E9
0
1
2
3
4
5
E1
E1
E1
E1
E1
E1
input to experts. For each input, the gate network computes the relationship
between the input with three experts and allocates it to the top-1 expert for
computation. Fig. 3. The imbalanced load of experts in an iteration. The model contains
12 MoE layers and each MoE layer contains 16 experts. The vertical axis
indicates layer indexes, and the horizontal axis denotes the index of experts.
Batched All-to-All Expert Comp. All-to-All Batched The depth of color represents the proportion of total inputs that an expert
Inputs Outputs handles. Three of the heaviest experts are responsible for over 50% inputs
while the three least experts only compute less than 5%.
Dev. 0 �0 Gate Net. Expert0 �0

layer) to all devices.


Dev. 1 �1 Gate Net. Expert1 �1
Fig. 2 illustrates a workflow of EP in a MoE layer. Firstly,
the gate network determines top-1 expert for each input. Then
Dev. 2 �2 Gate Net. Expert2 �2 inputs are transferred to corresponding devices via an All-to-
All (A2A) communication operation [27]–[29]. Subsequently,
Fig. 2. A workflow of Expert Parallelism (EP) in a MoE layer. Following each device performs the expert computation for collected
the gate network, batched inputs are first exchanged via an All-to-All (A2A) inputs and then launches another A2A to reorganize the
operation. After the expert computation on all devices, a second A2A results back to the inputs’ original devices for computations
operation is used to pass the expert’s outputs back to the device where
corresponding inputs were originally located. of the subsequent non-MoE layer. Nowadays, many popular
distributed frameworks support the training of large-scale MoE
models using EP [4], [30]–[33].
network computes the relationship between that input and all Even though EP makes it feasible to train extra-large MoE
the experts. Then it routes the input to top-k (k=1 in the figure) models with up to trillions of parameters, a dynamic load
expert(s) for computation. Even as we increase the number of imbalance occurs among devices. Specifically, most of the
experts (the model size is increased), each input is still routed training inputs are transferred and processed by a few devices.
to a fixed number (k) of experts, and the negligible increased These heavy-load devices vary as the training. Fig. 3 presents
computational budget of the gate network, thereby enabling the imbalanced load of experts in an iteration. The vertical
the scaling of the model with nearly constant computational axis indicates layer indexes, and the horizontal axis denotes the
overhead. index of experts. The MoE model contains 12 MoE layers and
With the increase of the model scale, an isolated device each MoE layer contains 16 experts. Each expert is set into a
cannot support the training of the MoE model thus various dedicated device. The depth of color represents the proportion
parallelisms have been proposed. Two common parallel ap- of total inputs processed by an expert. In most MoE layers,
proaches are DP and MP. DP equally divides inputs of an the three heaviest experts hold over 50% of inputs, while
iteration across all devices and replicates the model into all the three least less than 5%. The unbalanced load of experts
devices. In forward propagation (FP), each device computes means that devices containing light-load experts have to wait
its local inputs independently by utilizing its model replicas. for devices containing heavy-load ones, incurring significant
In the backward propagation (BP), the Allreduce primitive will under-utilization of devices during training.
be performed after the backward computation. Different from
DP, MP partitions the model into devices in a specific manner B. Motivation
and each device contains a complete copy of the data. The A series of methods have been proposed to balance the
aggregation primitive will be launched whenever required in load. We divide them into algorithmic and systematic methods.
FP and BP. From the algorithmic side, researchers constrain the upper
For efficient training of a MoE model, Google combines DP bound of inputs received by an expert or add auxiliary losses to
and MP into an EP. From the input side, EP adopts the same the loss function. They change the inputs-to-experts mapping,
input partitioning paradigm as DP. From the model side, EP thus affecting and even deteriorating the model convergence.
divides the same number of experts to each device and copies Different from algorithmic works, systematic solutions do
other parts of the model (i.e., the gate network and non-MoE not affect model convergence and fit into the hardware, thus
4

Accumulated Token Ratio MoE Model Device Pool Locality


1.0 Pro-Prophet Planner
Communication-efficient
0.8 Lightweight Expert Placements Expert Placement

Performance
�0 Dev.0 �0 Dev.0
0.6

Model
�0 Dev.0
�1 Dev.1 …… �1 Dev.1
0.4 �2 Dev.2 �2 Dev.2
�1 Dev.1

0.2 �2 Dev.2
Locality-based Greedy Algorithm
0.0
2500 2520 2540 2560 2580 2600 Execution Engine
Iterations Load-balanced Workflow
Load-balanced Workflow
����� ������ ����. ��1� … ����. ��2� ���� …
after Scheduling

Fig. 4. The locality of input distributions. The discrepancies between the Pro-Prophet Scheduler
different colored curves represent the number of inputs received by each of Scheduling Space Establishment ����. ��1� ����. ��2�

the different experts. It shows that distributions of adjacent iterations remain �����+1 ������+1 ����+1
Block-wise Scheduling Strategy
relatively constant.

Fig. 5. The overview of Pro-Prophet. Pro-Prophet is composed of Pro-Prophet


planner and Pro-Prophet scheduler. MoE model, locality, and device pool are
attracting extensive attention. These solutions adaptively adjust three inputs of it. Firstly, Pro-Prophet planner searches for a communication-
experts-to-devices mapping based on the device load during efficient expert placement using its locality-based greedy algorithm. The
training, effectively improving the training efficiency. How- algorithm iteratively generates and evaluates a lightweight expert placement
utilizing its performance model until the load is balanced. Then the execution
ever, their heavy load-balancing overhead hinders the further engine produces a load-balancing workflow based on the planner. Finally,
improvement of the efficiency. Pro-Prophet scheduler schedules three data-dependent operations to parallel
operations for communication and computation overlapping, further improving
the training throughput.
TABLE I
T IME BREAKDOWN OF TRAINING . L.B. IS SHORT FOR L OAD BALANCING

Model L.B. Search Place Reduce Others the load of devices. The overview of Pro-Prophet is presented
MoE-GPT-S 29.9% 6.8% 11.6% 11.5% 70.1% in Fig. 5. Pro-Prophet is composed of a planner and a
MoE-GPT-M 29.2% 3.2% 12.5% 12.5% 70.8%
MoE-GPT-L 34.5% 2.6% 14.2% 17.7% 64.5% scheduler. MoE model, locality, and device pool are three
MoE-GPT-DS 33.8% 6.1% 13.8% 13.9% 66.2% inputs of it. The device pool defines the topology of devices.
MoE-GPT-DM 37.1% 6.1% 16.1% 14.9% 62.9% The utilization of the locality is the key advantage of Pro-
Prophet.
As shown in Table I, previous solutions introduce a Search, Firstly, Pro-Prophet planner searches for a communication-
Place and Reduce processes to balance the load. However, efficient expert placement from a series of lightweight expert
the overhead of load-balancing is up to 37.1%. There are placements using its locality-based greedy algorithm. The
two reasons behind this huge cost. Firstly, they do heavy algorithm iteratively generates and evaluates a lightweight
communication of model states. They have to transfer the expert placement utilizing a performance model until the load
parameters and gradients of heavy-load experts among all is balanced.
devices or transmit the whole model states of experts. Sec- Then, the execution engine analyzes the procedures of
ondly, they cannot sufficiently overlap the communication and the planner and produces a load-balanced workflow for load
computations during the training of a MoE model. Due to balancing.
the data dependency, they have to do communication and Finally, after analyzing the workflow, Pro-Prophet scheduler
computation sequentially. establishes the scheduling space and schedules data-dependent
Locality. Fortunately, we have discovered a property in operations (i.e., Plan, Trans, and Agg) to parallel opera-
the training of MoE models. This property makes it possible tions (i.e., Para.Op1 and Para.Op2) for communication
to address these challenges efficiently. Fig. 4 depicts input and computation overlapping. The meaning of operations are
distributions in the second MoE layer of a MoE model. presented in Sec IV.
The areas with different colors represent inputs received by
different experts. It is worth noting that the distribution in IV. P RO -P ROPHET P LANNER
other MoE layers follows a similar pattern. As shown in A. Lightweight Expert Placement
the figure, the slight fluctuation of the distribution occurs
across adjacent iterations, which indicates that the load of each The design of the expert placement is crucial for efficient
expert remains relatively stable in adjacent iterations. This load balancing. For less communication of model states trans-
phenomenon suggests that the distribution exhibits a locality ferring, the planner introduces a series of lightweight expert
among iterations. placements.
In a lightweight expert placement, each expert is mapped to
one or more devices independently. Only the parameters and
III. OVERVIEW OF P RO -P ROPHET gradients rather than all model states are transferred among
Motivated by Section II, we propose a systematical load- its devices. We use Trans and Agg primitives to describe
balancing approach, Pro-Prophet, which can efficiently balance these two communications respectively. In the forward pass, a
5

Inputs Inputs
Dev. 0 �0 Dev. 0 �0 Dev. 0 �0 �1 Dev. 0 �0 �1
����� ���
Dev. 1 �1 Routing Dev. 1 �1 Dev. 1 �0 �1 Routing Dev. 1 �0 �1

����� ���
�� ��
Dev. 2 �2 Dev. 2 �2 Dev. 2 �2 �1 Dev. 2 �2 �1
Expert i Expert i

(a) Traditional expert placement. (b) Lightweight expert placement.


Fig. 6. The comparison of a traditional and lightweight expert placement. The load is imbalanced in traditional expert placement. In a lightweight one, each
expert is placed into necessary devices to balance the load. The T rans and Agg primitives are involved to communicate their parameters and gradients
respectively.

Trans is first launched to transfer the parameters. After that, TABLE II


each device contains the parameters of some expert, thus its N OTATIONS
local inputs routed to these experts could be computed locally. Notation Description
After the backward computation, the gradients of an expert T Execution time of an operation
R Inputs received by a device from other devices
could be generated in several devices. As each device only B Average communication bandwidth
maintains the optimizer states of one expert, a Agg primitive H Inputs computed in a device
t Computation throughput
is launched to aggregate gradients of each expert to its original s Number of selected experts should be transferred
device. This design has two advantages: 1) Only part of the n Number of devices a selected expert not be transferred to
E Number of experts in a MoE layer
model states are communicated. 2) The model states are only D Number of devices
communicated among a subset of devices.
Fig 6 illustrates a comparison of a traditional and
lightweight expert placement. As shown in Fig. 6a, 5, 2, and implementation, devices use point-to-point(P2P) communica-
2 inputs are routed to E0 , E1 , and E2 respectively. After the tion primitives to achieve the A2A communication operation.
A2A communication, three devices are responsible for the Based on this, we define the execution time of an A2A
computation of 5, 2, and 2 inputs as each of the devices operation as below.
only contains parameters of a distinct expert (e.g., Dev. 0
contains E0 ’s parameters), resulting in an imbalanced load Ri · size(input)
TA2A (R) = max , (1)
among devices. Fig. 6b shows a balanced load achieved by the i B
lightweight expert placement. Experts are mapped to devices where Ri is the total number of inputs received by device-i
according to the routing results produced by the gate network. from other devices and size(input) is the size of a input.
Parameters of E0 are sent from Dev. 0 only to Dev. 1 as inputs Expert computation. Next, we formulate the duration of
in Dev. 2 are not routed to E0 . Similarly, parameters of E1 are the forward and backward expert computation. In the expert
transferred to Dev. 0 and Dev. 1 for their expert computation. computation procedure, the computations of devices are per-
It maps experts to necessary devices and only communicates formed simultaneously. However, computations of different
their parameters and gradients, effectively avoiding heavy experts are launched sequentially in a device. To depict this
model states transferring. characteristic, we define the execution time of FEC as

B. Performance model Hi
TF EC (H) = max , (2)
It’s necessary to evaluate lightweight expert placements i t
under various device loads. Therefore, the planner abstracts where Hi is the number of inputs computed in device-i.
a performance model to estimate the execution time of a It is widely recognized that the time required for backward
MoE layer employing a lightweight expert placement. Table II computation in DNN training is roughly double that of forward
presents notations and descriptions used in the performance computation, which is the same for MoE model training.
model. Therefore, we define the execution time of BEC as
After employing a lightweight expert placement, a MoE
Hi
layer performs four A2A communication operations, one for- TBEC (H) = 2 max , (3)
ward expert computation operation EFC, one backward com- i t
putation operation EBC, one Trans operation, and one Agg Trans and Agg primitives. Finally, we formulate the
operation. To accurately evaluate the execution time of the overhead of Trans and Agg primitives. The duration time of
MoE layer, we establish our performance model according to Trans and Agg primitives depends on two elements. The first
the implementation of operations and hardware characteristics. element is the number of transferred experts, which determines
A2A communication. Tutel [5] presents an efficient A2A communication rounds. The second element is the number
implementation used in the training of a MoE model. In this of devices communicated in a primitive, which influences
6

the communication scales. Therefore, the TT rans (s, n) and Algorithm 1: Greedy search algorithm
TAgg (s, n) are defined as below. Input: Inputs-to-experts mapping gating
s ∗ (D − n) ∗ size(ej .params) Input: n
TT rans (s, n) = , (4) Result: Communition-efficient expert placement P oE
D∗B
// Preliminary
s ∗ (D − n) ∗ size(ej .grads) ′
TAgg (s, n) = , (5) 1 Toutput ← T (R, H, 0, 0);
D∗B 2 H, R ← GetH&R(gating);
where the size(ej .params) and size(ej .grad) are the size of 3 L, n bottoms ← [], [];
parameters and gradients for the j-th expert. 4 cnt ← 0;
In summary, the overall execution time of the MoE layer // Iteratively search
with lightweight expert placement can be represented as 5 while not balanced do
// Get the index of the heaviest
T ′ (R, H, s, n) =4TA2A (R) + 3TF EC (H) device
(6) 6 i ← arg max(H);
+ TT rans (s, n) + TAgg (s, n) i
7 if i in U sed then
C. Locality-based Greedy Algorithm 8 break;
The performance model can accurately estimate the execu- 9 end
tion time of a MoE layer deploying any expert placements. 10 U sed.append(i);
However, it is necessary to determine a communication- // Determine n devices saving the
efficient one in various load imbalance scenarios. There are smallest number of inputs for
2N ∗E potential lightweight expert placements. The brute force expert-i
search algorithm is time-consuming and could be a perfor- 11 n bottom ← BottomK(gating, n) ;
mance bottleneck. 12 L.append(i);
Therefore, the planner offers an efficient greedy search 13 n bottoms.append(n bottom)
algorithm shown in Algorithm 1. Taking the results of gate 14 s ← size(L)
network gating, s and n as input, Algorithm 1 iteratively // Replace inputs among devices
generates and evaluates for a better expert placement until the according to the expert
load is balanced. Finally, it outputs a communication-efficient placement
expert placement P oE. 15 H, R ← Replace Inputs(L, n bottoms)
Initially, the algorithm estimates the execution time of // Evaluate the expert placement
a MoE layer without implementing any lightweight expert 16 Tchanged ← T ′ (R, H, s, n);
placements and records it as minimum time. Then it em- 17 if Tchanged < Toutput then
ploys two greedy strategies to generate a lightweight expert 18 Toutput ← Tchanged ;
placement that optimizes the load of devices. Specifically, it 19 cnt = s
prioritizes the expert with the higher number of responsible 20 end
inputs for selection and transfers its parameters to devices 21 end
that hold more inputs processed by the expert. The algorithm // Return the communication-efficient
maintains a list of L and n bottoms to record the expert expert placement
placement. Then the algorithm evaluates the expert placement 22 P oE ← Get P oE(L[0 : cnt], n bottoms[0 : cnt])
using the performance model. It updates the minimum time 23 RETURN P oE;
and a counter if the current expert placement achieves a better
performance. The search process is repeated until the load is
imbalanced. The condition of the balanced load is
V. P RO -P ROPHET S CHEDULER
I
max(H) − min(H) < α , (7) Previous works introduce a search process (corresponding
E
where I is the number of inputs training in an iteration and to Plan primitive), model states transferring (corresponding
α is a regulable coefficient for different requirements of load to Trans and Agg primitives) to balance the load. However,
balance. their execution is blocked by other operators due to data depen-
As the search algorithm is required to run during the MoE dency, constraining further improvement of training efficiency.
model training, we define a primitive Plan to describe this In this section, we introduce designs of the scheduler which
search process. As mentioned in Sec. II, the input distributions extensively overlap computation and communication based on
of adjacent iterations are similar, which inspired us to predict the locality described in Sec. II.
the distribution and reduce the frequency of execution of
the algorithm. Based on the inspiration, the planner upgrades
A. Scheduling space establishment
the algorithm to a locality-based one. Users can adjust the
frequency of the search algorithm flexibly for better training As shown in Fig. 1, a MoE model consists of both MoE
efficiency. and non-MoE layers stacked on top of each other. We combine
7

A MoE Block the load-balancing strategy by invoking a plan operator


within earlier iterations. Subsequently, Trans primitives can
A MoE layer A Non-MoE layer
be scheduled to earlier locations on the timeline. We also
FP find that the Agg primitive is independent of computation
���� ����� A2A FEC A2A FNEC
operations later, thus we can schedule it to later positions.
��� A2A BEC A2A BNEC
The analysis above provides the potential for computation
BP comp comm comm comp comm comp and communication scheduling. However, there are several
constraints for arbitrarily scheduling. Firstly, the estimation of
Fig. 7. The device state of operators in a MoE block. FEC and BEC are the distribution means that we can establish a load-balancing
forward and backward computations of the MoE layer. FNEC and BNEC are
forward and backward computations of the non-MoE layer. An operator is
strategy within a former iteration. As the distribution of the
marked as comm and a green rectangle if devices only communicate during last iteration is necessary for higher estimation accuracy, the
its execution. Similarly, we use comp and blue rectangle to mark computation earliest position of a plan primitive in i-th iteration is the
operators.
i − 1-th iteration. Secondly, there are two main ways to update
parameters. It is necessary to update the expert parameters
before the Trans primitive. We can perform the updating
every adjacent pair of MoE layer and non-MoE layer into a procedure layer by layer [34] or update at the end of the
MoE block. BP [8]. For the layer-by-layer updating, we can launch the
Fig. 7 presents the device state of operators in a MoE block. Trans primitive of a MoE layer of an iteration within the
Only primary operations are presented for clarity description. last iteration, which does not apply to concentrated updating
During the FP, two primitives related to load balancing(i.e., works. For the concentrated updating, the Trans primitive
Plan and Trans) and three basic operations(i.e., two A2A could be performed at the end of the BP of the last iteration,
communications and a forward expert computation FEC) are which has a similar effect as starting it within this iteration.
performed in a MoE layer. For the non-MoE layer, only a For the universality of our method, we confine the scheduling
forward computation FNEC is executed. After the gate network of the Trans primitive within a single iteration. Lastly, it
produces the routing decision and input distribution, a Plan is necessary to aggregate the gradients of experts at each
operator is performed to identify a load-balancing strategy iteration. Therefore, the replacing of the Agg primitive is also
based on the load of devices. The Trans primitive can only be confined within a single iteration.
launched once the strategy is determined. And then, two A2A Fig. 8 illustrates our scheduling space. The subscript denotes
communications and an expert computation can be launched. the index of the MoE block of the operation and the superscript
During the BP, the MoE layer executes a Agg operation, two denotes its iteration. All Plan computations of iteration j + 1
A2A communications, and a backward expert computation can be scheduled to the A2A communication of iteration j.
BEC. Only a backward computation BNEC is performed in the The Trans primitives from block i + 1 to l during the j-th
non-MoE layer, the Agg operation is carried out to aggregate iteration are overlapped with the forward computations of the
the gradient of experts following the completion of their i-th block, where l is the total number of blocks. Similarly, the
backward computation. Agg primitives from block i+1 to l are orchestrated to overlap
We label an operation as comm if devices communicate with backward computations of the i-th block. Our scheduling
during the execution of this operation. Similarly, if an op- space considers a MoE block as a unified reordering entity,
eration computes all the time, we label it as comp. Before thus overcoming the limitations in observation imposed by
the launching of Plan, information related to produce a load- previous methods.
balancing strategy is stored in each device. Therefore, there
are only computations during Plan. We label it as comp. As
Trans primitive only exchanges the parameters of experts B. Block-wise scheduling strategy
based on the load balancing strategy, we label it as comm. The scheduling space provides extensive strategies for
According to the description of Sec. II, the A2A is marked as scheduling. However, operator-grained scheduling is far from
comm, and the computation of the MoE layer and non-MoE making full use of overlapping space. We partition the operator
layer are tagged as comp. As for Agg primitive, gradients of into sub-operators and take a scheduling at the sub-operator
the same expert are aggregated into a device. We flag it as level. We take a brief example to describe the advantage of
comm. sub-operator scheduling in Fig. 9 which shows three types of
The operators with data dependency are tightly intercon- scheduling for a Trans primitive. The overhead of Trans
nected along the timeline, constraining the scheduling of com- primitive is varied as the load of devices (e.g., the number
putation and communication. However, the locality mentioned of heavy-load experts changed as the training). As shown
in Sec. II enables the pre-launch of some data-dependent oper- in Fig. 9a and Fig. 9b, a forward computation of a MoE
ations without breaking the data dependency. We can insert the layer or a non-MoE layer cannot afford the hiding of a
data-independent operation into data-dependent operators to Trans primitive as their short duration time. Consequently,
make room for communication and computation overlapping. the Trans primitive will block the process of the model
Specifically, in the case of a MoE layer, the input distribution training. Fig. 9c presents the sub-operators scheduling. The
for the current iteration can be estimated by leveraging the Trans primitive is split into two sub-operators scheduled to
distribution from former iterations, which allows us to produce two computations respectively. The sub-operator scheduling
8

� � � � � � � � � � �
Baseline �lan� ������ �2�� ���� �2�� ����� ����� �2�� ���� �2�� ����

� � � � � � � �
�2�� ���� �2�� ����� ����� �2�� ���� �2��
Scheduler �+1 � � � �
����1:� ������+1:� ������+1:� ����+1:� ����+1:�
FP BP

Fig. 8. Scheduling space. The subscript denotes the index of the MoE block and the superscript denotes that of the iteration. For example, T ransi+1:l
j
denotes the set of T rans operations spanning from the block i + 1 to the block l during the iteration j, where l is the number of MoE blocks.

Algorithm 2: Block-wise scheduling strategy


A2A FEC A2A FNEC
Result: Training workflow after the scheduling.
����� 1 for j in iterations do

FP 2 for i in MoE blocks do


// Forward propagation
(a) Scheduling T rans to the forward expert computation. 3 SubT rans1, Subtrans2 = Partition(T rans);
4 SubAgg1, SubAgg2 = Partition(Agg);
A2A FEC A2A FNEC
5 Launch for parallel {A2Aji , P lanj+1
i };
6 Launch for parallel{SubT rans1ji+1 , F ECij };
�����
7 Launch A2Aji ;
FP 8 Launch for parallel{SubT rans2ji+1 ,
F N ECij };
(b) Scheduling T rans to the forward non-expert computa- // Backward propagation
tion. 9 Launch for parallel{SubAgg1ji+1 , BN ECij };
10 Launch A2Aji ;
A2A FEC A2A FNEC 11 Launch for parallel{SubAgg2ji+1 , BECij };
T.1 T.2 12 Launch A2Aji ;
13 end
FP 14 end

(c) Splitting and scheduling T rans to both the forward


expert computation and non-expert computation.
the two computations. Even though the duration of Trans
Fig. 9. Different scheduling strategies for a T rans primitive.
and EFC varies as the device loads, the forward computation
overhead of the non-MoE layer and the transferring overhead
of an expert’s parameters are static. We can estimate them
improves the utilization of overlapping space and reduces the before training and properly split the Trans primitive. The
communication overhead. advantage of the estimation is that we can exhaustively fill
A dynamic scheduling strategy is beneficial as the load of in the communication idle in the performing of the forward
devices fluctuates with the training process. However, the non- computation of the non-MoE layer. Finally, we schedule the
negligible overhead will be introduced if we determine an Agg of block i + 1 into the backward computation of the i-
optimal scheduling strategy that as far as possible hides the th one. Similarly, we can estimate the backward computation
overhead of three types of primitives involved by systematical overhead of the non-MoE layer and do a suitable communi-
load balancing methods in the runtime. Therefore, we design cation partition.
an offline scheduling policy to overlap communication and
computation while avoiding the extra overhead. The policy
C. Effective collaboration with planner
is founded on static elements within dynamics, not intuition.
Our block-wise scheduling strategy is summarized in Algo- To better integrate the planner and scheduler, we combine
rithm 2. The first primitive is Plan. We schedule the Plan the scheduling performed by the scheduler into the perfor-
primitive of the i-th block in the iteration j + 1 to the A2A mance model of the planner.
communication of the i-th block in the j-th iteration. For Specifically, we define the parallel execution time of
the Trans primitive, we overlap it with computations of the T rans and Agg primitives as TP T rans (H, s, n) and
the former block within an iteration. Specifically, the forward TP Agg (H, s, n) respectively. Besides, we denote TF N EC and
computation of the i-th block is responsible for overlapping TBN EC as the execution time of FNEC and BNEC. If
the Trans primitive of block i + 1. As two computations TT rans (s, n) can be hidden by TF EC (H) and TF N EC , then
are executed in the i-th block, we split the Trans primitive TP T rans (H, s, n) is equal to 0. Otherwise, TP T rans (H, s, n)
into two sub-primitives and launch them simultaneously with is equal to TT rans (s, n) − TF EC (H) − TF N EC . That means
9

that TP T rans (H, s, n) = max(0, TT rans (s, n) − TF EC (H) − TABLE IV


TF N EC ). Similarly, TP Agg (H, s, n) can be expressed as T HE OVERALL SPEEDUP ON 4 HPNV NODES .
max(0, TAgg (s, n) − TBEC (H) − TBN EC ). Speedup to DeepspeedMoE
With the above analysis, the overall execution time of the K GPUs Tokens Model
FasterMoE Pro-Prophet
MoE layer estimated by the performance model of the planner
MoE-GPT-S 1.63 1.98
is changed as below:
MoE-GPT-M 1.99 2.22
T ′ (R, H, s, n) =4TA2A (R) + 3TF EC (H) 1 MoE-GPT-L 1.62 1.80
(8) MoE-GPT-DS 1.34 1.70
+ TP T rans (H, s, n) + TP Agg (H, s, n).
MoE-GPT-DM 1.68 2.26
By combining the planner and scheduler, we can achieve 16 16384
MoE-GPT-S 2.31 2.62
a fine-grained pre-allocation of hardware resources to experts, MoE-GPT-M 1.82 2.10
efficiently addressing the load imbalance problem during train- 2 MoE-GPT-L 1.94 2.23
ing. MoE-GPT-DS 1.77 1.94
MoE-GPT-DM 1.84 2.07
VI. E VALUATION 1 The label ”Token” is the number of tokens trained in an iteration.
Testbed. We test Pro-Prophet on three types of nodes named
HPWNV, HPNV and LPWNV respectively. Each HPWNV
Experiments on HPWNV. We first evaluate the end-to-end
node is equipped with 2 Intel Xeon CPUs (2.40GHz) and
performance of Pro-Prophet on two HPWNV clusters. They
4 NVIDIA 3090 GPUs with 24GB graphics memory. Each
contain 4 and 8 HPWNV nodes respectively. We fixed the
CPU is connected to two GPUs through PCI-Express 3.0.
number of tokens trained in an iteration to 16384 and 32768.
100Gb/s Infiniband is used for inter-node communication. The
As shown in previous works, the k is set to 1 or 2 [1],
difference between HPWNV and HPNV is that GPUs within
[6], [35] for better balancing between the model quality and
a HPNV node are connected by NVLink-3.0 connections.
training efficiency. We conducted experiments under both
Specifically, four GPUs are divided into two groups. Two
values to validate the generality of Pro-Prophet.
GPUs within a group are connected by a NVLink connection.
Fig. 10a and Fig. 10b illustrate speedups achieved by
The difference between HPWNV and LPWNV is that the type
Pro-Prophet under five benchmark models with a top-1 gate
of GPUs of a LPWNV node is 2080Ti.
network. Pro-Prophet achieved 1.47-2.66x end-to-end per-
Models and baselines. As shown in table III, we use five
formance gains in comparison with Deepspeed-MoE. As to
variants of MoE-GPT models in our experiments. All FFN
FasterMoE, Pro-Prophet achieves performance enhancements
layers are replaced by a MoE layer. The number of experts
of up to 1.31x with an average of 1.19x.
within a MoE layer is consistent with the number of GPUs.
As shown in Fig. 10c and Fig. 10d, speedups under a
We compared Pro-Prophet with two representative MoE
top-2 gate network achieved by Pro-Prophet are 1.36-2.37x
training systems: 1) Deepspeed-MoE [4]: Deepspeed-MoE
and 1.05-1.48x compared to Deepspeed-MoE and FasterMoE
is an efficient MoE framework developed by Microsoft. It
respectively. The result shows the coarse-grained and blocked
exclusively implements EP. 2) FasterMoE [8]: This training
manner of FasterMoE will introduce additional runtime over-
system employs a systematic load balancing method, dynamic
head and hinder the further improvement of the training
shadowing, to effectively accelerate the model training.
efficiency. Our method can precisely pre-allocate resources to
Default settings. Unless otherwise specified, we fix some
experts, thereby avoiding this issue.
training settings. We train MoE models on the cluster con-
Experiments on HPNV and LPWNV clusters. Different
sisting of HPWNV nodes. We evaluate Pro-Prophet within the
hardware conditions significantly affect the effectiveness of
first 100 iterations as the input distribution tends to stabilize
the method. For example, the device memory may serve as
with the training process.
a constraint on the maximum training tokens in an iteration.
Besides, the training issue may be altered by variations in
A. End-to-end Performance computing throughput and communication bandwidth, causing
In summary, Pro-Prophet achieves 1.36-2.66x and 1.01- a significant influence on the effectiveness of methods. To
1.48x speedups compared to Deepspeed-MoE and FasterMoE verify the generality of Pro-Prophet, we conduct experiments
respectively. on diverse hardware environments with varying memory con-
sumption, and the rate of computing throughput and commu-
nication bandwidth.
TABLE III Due to the limited memory capacity compared to the
M ODEL CONFIGURATION .
HPWNV and HPNV, we only train the four smaller models
Name Layers Embedding Hidden listed in Table III on the HPWNV cluster. The number of
MoE-GPT-S 12 512 1024 tokens trained in one iteration is set to 4096.
MoE-GPT-M 12 1024 2048 Table IV shows the speedup results under various models
MoE-GPT-L 12 2048 4096 and values of k in a cluster consisting of 4 HPNV nodes. The
MoE-GPT-DS 24 512 1024 highest speedups are highlighted in the table. With the equip-
MoE-GPT-DM 24 1024 2048 ment of NVLink connections in the cluster, communication
10

4
3 Deepspeed-MoE 4 Deepspeed-MoE Deepspeed-MoE 3 Deepspeed-MoE
FasterMOE FasterMOE FasterMOE FasterMOE
Pro-Prophet 3 Pro-Prophet 3 Pro-Prophet Pro-Prophet
Speedup

Speedup

Speedup
2 2

Speedup
2 2
1 1 1 1

0 S M L DS DM 0 S M L DS DM 0 S M L DS DM 0 S M L DS DM
Model Model Model Model
(a) 4HPWNV nodes, k=1, (b) 8HPWNV nodes, k=1, (c) 4HPWNV nodes, k=2, (d) 8HPWNV nodes, k=2,
1.01–1.23x 1.08–1.31x 1.14–1.48x 1.05–1.31x
Fig. 10. End-to-end performance. The numbers in the captions denote speedups achieved by Pro-Prophet over the best baseline.

TABLE V
1000 Deepspeed-MoE

Time per iterations(ms)


T HE OVERALL SPEEDUP ON A 2 LPWNV NODES .
FasterMoE
K GPUs Tokens Model
Speedup to DeepspeedMoE 800 Pro-Prophet
FasterMoE Pro-Prophet
MoE-GPT-S 1.20 1.30
600
1
MoE-GPT-M 1.02 1.18 400
MoE-GPT-DS 1.12 1.30
200
8 4096
MoE-GPT-DM 0.96 1.26
62 64 66 68 70
MoE-GPT-S 1.56 1.91 Iterations
MoE-GPT-M 1.29 1.94
2
MoE-GPT-DS 1.44 1.64 Fig. 12. Per-iteration execution time in the MoE-GPT-M model when k=1.
MoE-GPT-DM 1.25 1.58 Pro-Prophet achieved 1.34x speedup on average compared to fasterMoE.
1 The label ”Token” is the number of tokens trained in an iteration.

Pro-Prophet can accurately find a communication-efficient


Deepspeed-MoE Deepspeed-MoE expert placement, thereby avoiding this trouble.
4 3
FasterMOE FasterMOE
Pro-Prophet Pro-Prophet
Speedup

Speedup

2
2
1 B. Fine-grained analysis of Pro-Prophet
0 Layer2 Layer5 Layer8 0 Layer2 Layer5 Layer8 We conducted a fine-grained analysis of Pro-Prophet ac-
Layer index Layer index cording to speedups in a single layer and a single iteration.
Experimental results demonstrate that Pro-Prophet enhances
(a) k=1 (b) k=2 training performance in each layer and iteration during train-
Fig. 11. Speedups across different layers over the Deepspeed-MoE in the ing.
MoE-GPT-M model with different values of k. Pro-Prophet achieves 1.09-
1.49x single-layer speedups compared to FasterMoE.
Single-layer speedup. We first evaluate the single-layer
performance of Pro-Prophet. Fig. 11 illustrates the execution
time across different layers of three methods on the MoE-GPT-
processes such as A2A will be accelerated. Under this condi- M model. We randomly select the index of layers and use the
tion, Pro-Prophet achieves 1.71-2.63x speedups compared to Pytorch Profiler to collect the training time. As shown in the
Deepspeed-MoE and 1.10-1.35x to FasterMoE, demonstrat- figure, Pro-Prophet achieves 1.60-2.25x single-layer speedups
ing its adaptability to conditions with higher communication compared to Deepspeed-MoE and 1.09-1.49x to FasterMoE.
bandwidth. We also test Pro-Prophet on a cluster consisting Varying loads of experts across layers occur in the training
of 2 LPWNV nodes. The results are presented in Table V process. This phenomenon leads to fluctuating speedups for
and the highest speedups are emphasized too. Due to the Pro-Prophet. However, it consistently outperforms two base-
lower computation ability of the 2080Ti compared to the lines in different layers, demonstrating its superior capability
3090 GPU, the impact of the computation process becomes of load balancing under diverse load-imbalance conditions.
more significant. In this environment, Pro-Prophet achieved Single-iteration speedups. We also evaluate the single-
a speedup of 1.18-1.94x compared to Deepspeed-MoE and layer performance of Pro-Prophet. We conduct experiments on
1.08-1.50x compared to FasterMoE, showing its robustness in the MoE-GPT-M model with k=1. The results are presented
conditions with lower computation power. in Fig. 12. Compared to fasterMoE, Pro-Prophet achieved
The result of the MoE-GPT-DM model with k=1 shows that 1.34x speedup on average. The iteration time of Pro-Prophet
Deepspeed-MoE achieves higher performance compared to is consistent and lower. This phenomenon can mainly be
FasterMoE as FasterMoE transports parameters to unnecessary attributed to the fact that Pro-Prophet is capable of adapting
devices, resulting in additional runtime overhead. However, to dynamic situations.
11

A2A
2.0
EC K=1
Estimated time (ms) 20 Trans
1.8 K=2
Agg 1.6

Speedup
15
1.4
10 1.2
1.0
5 Baseline +Planner +Scheduler Full
Components
0
0 5 10 15 20 Fig. 14. The effectiveness of components. The baseline is Pro-Prophet without
Real time (ms) any optimizations. Full is the condition of turning on the effective combination
of the planner and scheduler. The planner, scheduler and Full achieve a
speedup of 1.19x, 1.075x and 1.025x on average respectively.
Fig. 13. The accuracy of the performance model. The mean estimation error
is less than 5%.

Time per iteration(ms)

Time per iteration(ms)


Planner Planner
1000 top2 1000 top2
C. Ablation study top3 top3
Necessity of the dynamic adaptation. Dynamic adaption is 500 500
necessary for MoE models. It’s reasonable to transfer heavy-
load experts to other GPUs, but it’s unclear if Pro-Prophet’s 0 I10 I20 I30 I40 I50 0 I10 I20 I30 I40 I50
dynamic search algorithm is necessary. Iterations Iterations
To certify the necessity of our algorithm, we compare the
(a) k=1 (b) k=2
planner with two simple dynamic policies. Specifically, two
policies transfer 2 and 3 experts with the heaviest load to all Fig. 15. The iteration latency of different policies in the MoE-GPT-M model.
GPUs. We named them top2 and top3 respectively. We use
PyTorch’s topk function to implement these strategies. The
overhead of determining the heaviest experts is negligible. determine a communication-efficient expert placement for
well load-balancing under different conditions. Besides, the
Figure 15 illustrates the latency of three policies in a single
scheduler gains 1.14X and 1.01x speedups when k=1 and
iteration with different values of k. As shown in Fig. 15a,
k=2 respectively. These verify that the scheduler can hide
the planner gains 1.77-1.82x speedups compared to the top2
the overhead of load-balancing, further improving the training
policy and 2.04-2.10x speedups to the top3 policy when k=1.
performance. The speedups achieved by the scheduler are
The results shown in Fig. 15b demonstrate that the planner
significantly influenced by the expert placement produced by
gains speedups ranging from 1.38-1.40x compared to different
the planner. Finally, we test the effectiveness of the effective
policies when k=2.
combination (Full in the figure). As the performance model
The experimental results indicate that fixing the number of
estimates the overlapped execution time, the planner will
experts and passing them to all GPUs does not yield good
further balance the load based on the scheduler’s capability
results. The input distribution changes as training progresses,
of communication and computation overlapping. The results
resulting in different optimal expert placements. Compared to
demonstrate that it achieves 1.03x and 1.02x speedups under
these two dynamic strategies, our algorithm introduces more
different values of k.
overhead, but it’s necessary for faster training speeds.
Balance capability. Balance capability serves as a key
Accuracy of performance model. Fig. 13 illustrates the
accuracy of the performance model. We compare the estimated
time to the real time on A2A, expert computation (EC),
Trans and Agg operations. The results show that our mean
FasterMoE 10.0 FasterMoE
estimation error is less than 5%. Layer2 Layer2
10 Layer5 Layer5
7.5
Ratio of RB

Ratio of RB

Effectiveness of components. To verify the effectiveness Layer8 Layer8


of the components, we conduct incremental experiments. We 5 5.0
first turn off all optimizations for Pro-Prophet and use it as a 2.5
baseline. Based on this, we activate the planner and scheduler 0 0.0
S M L DS DM S M L DS DM
sequentially and record the speedup they attain relative to the Model Model
baseline. Finally, we verify the effective combination of the
planner and scheduler mentioned in Sec. V. (a) k=1 (b) k=2
Fig. 14 demonstrates speedups on the MoE-GPT-M under Fig. 16. The ratio of RB of the planner to FasterMoE in different models
different k. Compared to the baseline, the planner gains 1.26x and k. The balance degree is the standard deviation of the input distribution
tensor. The ratio of the balance degree before and after employing a load-
and 1.12x speedups when k=1 and k=2 respectively. These balancing solution is RB of the solution. The planner achieves up to 11.01x
results show that the planner can efficiently and accurately ratio of RB.
12

metric for load-balancing methods. We define the balance Mainstream communication scheduling works focused on
degree as the standard deviation of the input distribution pipelining A2A and expert computation. Specifically, they
tensor. Besides, we denote the ratio of the balance degree partition an A2A and expert computation operation into sub-
before and after employing a load-balancing solution as RB operators and overlap communication sub-operators with com-
to describe its effect on the load. putation ones. Methods implemented on Gshard-like frame-
Fig. 16 demonstrates the ratio of RB of Pro-Prophet planner works such as Lina [34], Tutel, ScheMoE [59], and PipeMoE
to that of fasterMoE on different layers in different values [60] partition computation and communication operators based
of k. It is worth mentioning that the training time for the on the shape of expert computation matrix. FasterMoE is
planner is lower than that of FasterMoE. In most cases, the implemented on FastMoE and partitions operators into irregu-
planner achieves a higher RB than fasterMoE. A ratio of lar sub-operators to schedule. Pro-Prophet is compatible with
RB up to 11.01x indicates its ability to enhance training these works as Pro-Prophet allows for overlapping communi-
efficiency by fully exploiting the potential of load balancing. cations and computations at the level of MoE blocks.
In experimental conditions including k=1 with layer=2, k=2
with layers=2 and 5, the ratios of RB are below 1, suggesting VIII. C ONCLUSION
that the planner tailors expert placement to the actual load,
preventing the unnecessary allocation of experts. In this paper, we propose Pro-Prophet, a systematic load-
In summary, the planner can dynamically determine the balancing approach for efficient training of MoE models. We
load-balancing strategy to maximize training efficiency, show- observe a locality among input distributions and use it to de-
ing its superior balance capability. sign the planner and scheduler. Pro-Prophet planner identifies
lightweight expert placements and designs a locality-based
VII. R ELATED W ORK greedy algorithm to efficiently search for a communication-
Hybrid parallelism. Hybrid parallelism strategies [36]– efficient expert placement using its proposed performance
[41] have been widely used to train large-scale dense models. model, effectively reducing the communication overhead. Pro-
These hybrid parallelism strategies consist of but are not Prophet scheduler predicts the input distribution based on
limited to DP [42]–[45], TP [30], [46], PP [47], [48], and the locality in the MoE model training and applies block-
sequence parallelism (SP) [49]–[52]. Unfortunately, it is hard wise scheduling to overlap communications and computations,
to efficiently train large MoE models utilizing these hybrid further decreasing the communication cost. Our experiments
parallelism strategies. show that Pro-Prophet achieves 1.18-2.66x and 1.01-1.50x
To overcome the challenge, a series of works combine speedups compared to Deepspeed-MoE and FasterMoE. Be-
EP with above parallelism strategies and effectively improve sides, Pro-Prophet achieves a load balancing enhancement of
training efficiency. Switch transformers combines EP with TP up to 11.01 when compared to FasterMoE.
straightforwardly and designs a scheme to place the model and
data on TPUs. Bagualu [53] develops a hybrid parallel strat- R EFERENCES
egy that integrates EP and DP tailored for high-performance
[1] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun,
computing architectures, along with communication and stor- N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with conditional
age optimizations designed to enhance training efficiency. computation and automatic sharding,” in International Conference on
Deepspeed-MoE designs an effective combination of DP, EP Learning Representations, 2021.
[2] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun,
and TP for inference but is easy to extend to model training. In Y. Zhou, A. W. Yu, O. Firat et al., “Glam: Efficient scaling of
the MoE layer, it introduces Allgather and Allreduce primitives language models with mixture-of-experts,” in International Conference
to aggregate data and immediate results. Tutel also proposes on Machine Learning. PMLR, 2022, pp. 5547–5569.
[3] A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan,
DP, TP, and EP hybrid parallelism strategies and designs an D. Dai, D. Guo et al., “Deepseek-v2: A strong, economical, and efficient
adaptive parallelism switching method that enables O(1) over- mixture-of-experts language model,” arXiv preprint arXiv:2405.04434,
head in runtime switching. Based on the Deepspeed-MoE and 2024.
[4] S. Rajbhandari, C. Li, Z. Yao, M. Zhang, R. Y. Aminabadi, A. A.
Tutel, Parm [54] combines DP, EP, and expert-slice parallelism Awan, J. Rasley, and Y. He, “Deepspeed-moe: Advancing mixture-of-
(ESP) and proposes a fine-grained communication scheduling experts inference and training to power next-generation ai scale,” in
to improve the utilization of communication links. DeepSpeed- International Conference on Machine Learning. PMLR, 2022, pp.
18 332–18 346.
TED [55] designs a 3-dimensional hybrid parallelism strategy
[5] C. Hwang, W. Cui, Y. Xiong, Z. Yang, Z. Liu, H. Hu, Z. Wang, R. Salas,
that contains DP of Zero-3, TP of Megatron-LM, and EP of J. Jose, P. Ram et al., “Tutel: Adaptive mixture-of-experts at scale,”
Deepspeed-MoE. Besides, it proposes a memory and com- Proceedings of Machine Learning and Systems, vol. 5, pp. 269–287,
munication optimization for better scalability. The methods 2023.
[6] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to
of Pro-Prophet are compatible with these hybrid parallelism trillion parameter models with simple and efficient sparsity,” The Journal
strategies and can help further improve the training efficiency. of Machine Learning Research, vol. 23, no. 1, pp. 5232–5270, 2022.
Communication schedule. Overlapping communications [7] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton,
and J. Dean, “Outrageously large neural networks: The sparsely-gated
and computing can enhance hardware utilization and improve mixture-of-experts layer,” in International Conference on Learning
the system’s throughput [56], [57]. Previous communication Representations, 2017.
scheduling methods [37], [58] for dense models have demon- [8] J. He, J. Zhai, T. Antunes, H. Wang, F. Luo, S. Shi, and Q. Li, “Faster-
moe: modeling and optimizing training of large-scale dynamic pre-
strated promising results. In this paragraph, We focus on trained models,” in Proceedings of the 27th ACM SIGPLAN Symposium
introducing works designed for MoE models. on Principles and Practice of Parallel Programming, 2022, pp. 120–134.
13

[9] X. Nie, X. Miao, Z. Wang, Z. Yang, J. Xue, L. Ma, G. Cao, and B. Cui, [31] X. Nie, P. Zhao, X. Miao, T. Zhao, and B. Cui, “Hetumoe: An effi-
“Flexmoe: Scaling large-scale sparse pre-trained model training via cient trillion-scale mixture-of-expert distributed training system,” arXiv
dynamic device placement,” Proceedings of the ACM on Management preprint arXiv:2203.14685, 2022.
of Data, vol. 1, no. 1, pp. 1–19, 2023. [32] J. He, J. Qiu, A. Zeng, Z. Yang, J. Zhai, and J. Tang, “Fastmoe: A fast
[10] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory mixture-of-expert training system,” arXiv preprint arXiv:2103.13262,
optimizations toward training trillion parameter models,” in SC20: 2021.
International Conference for High Performance Computing, Networking, [33] D. Yu, L. Shen, H. Hao, W. Gong, H. Wu, J. Bian, L. Dai, and H. Xiong,
Storage and Analysis. IEEE, 2020, pp. 1–16. “Moesys: A distributed and efficient mixture-of-experts training and
[11] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, inference system for internet services,” IEEE Transactions on Services
S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural Computing, 2024.
language models,” arXiv preprint arXiv:2001.08361, 2020. [34] J. Li, Y. Jiang, Y. Zhu, C. Wang, and H. Xu, “Accelerating distributed
[12] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training moe training and inference with lina,” in 2023 USENIX Annual Technical
of deep bidirectional transformers for language understanding,” arXiv Conference (USENIX ATC 23), 2023, pp. 945–959.
preprint arXiv:1810.04805, 2018. [35] J. Liu, J. H. Wang, and Y. Jiang, “Janus: A unified distributed training
[13] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, framework for sparse mixture-of-experts models,” in Proceedings of the
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer ACM SIGCOMM 2023 Conference, ACM SIGCOMM 2023, New York,
learning with a unified text-to-text transformer,” The Journal of Machine NY, USA, 10-14 September 2023. ACM, 2023, pp. 486–498.
Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020. [36] D. Li, H. Wang, E. Xing, and H. Zhang, “Amp: Automatically finding
[14] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and model parallel strategies with heterogeneity awareness,” Advances in
Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language Neural Information Processing Systems, vol. 35, pp. 6630–6639, 2022.
understanding,” Advances in neural information processing systems, [37] Z. Lai, S. Li, X. Tang, K. Ge, W. Liu, Y. Duan, L. Qiao, and D. Li,
vol. 32, 2019. “Merak: An efficient distributed dnn training framework with automated
[15] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, 3d parallelism for giant foundation models,” IEEE Transactions on
L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert Parallel and Distributed Systems, vol. 34, no. 5, pp. 1466–1478, 2023.
pretraining approach,” arXiv preprint arXiv:1907.11692, 2019. [38] X. Ye, Z. Lai, S. Li, L. Cai, D. Sun, L. Qiao, and D. Li, “Hippie:
[16] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., A data-paralleled pipeline approach to improve memory-efficiency and
“Language models are unsupervised multitask learners,” OpenAI blog, scalability for large dnn training,” in 50th International Conference on
vol. 1, no. 8, p. 9, 2019. Parallel Processing, 2021, pp. 1–10.
[17] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, [39] S. Li, H. Liu, Z. Bian, J. Fang, H. Huang, Y. Liu, B. Wang, and
A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language mod- Y. You, “Colossal-ai: A unified deep learning system for large-scale
els are few-shot learners,” Advances in neural information processing parallel training,” in Proceedings of the 52nd International Conference
systems, vol. 33, pp. 1877–1901, 2020. on Parallel Processing, 2023, pp. 766–775.
[18] S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, [40] J. M. Tarnawski, D. Narayanan, and A. Phanishayee, “Piper: Mul-
Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti et al., “Using tidimensional planner for dnn parallelization,” Advances in Neural
deepspeed and megatron to train megatron-turing nlg 530b, a large-scale Information Processing Systems, vol. 34, pp. 24 829–24 840, 2021.
generative language model,” arXiv preprint arXiv:2201.11990, 2022. [41] K. Lu, Z. Lai, S. Li, W. Liu, K. Ge, X. Lu, and D. Li, “Parallel
[19] X. O. He, “Mixture of a million experts,” arXiv preprint intelligent computing: development and challenges,” SCIENTIA SINICA
arXiv:2407.04153, 2024. Informationis, vol. 53, no. 8, pp. 1441–1468, 2023.
[20] S. Zuo, X. Liu, J. Jiao, Y. J. Kim, H. Hassan, R. Zhang, T. Zhao, and [42] S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y. He, “Zero-
J. Gao, “Taming sparsely activated transformer with stochastic experts,” infinity: Breaking the gpu memory wall for extreme scale deep learning,”
arXiv preprint arXiv:2110.04260, 2021. in Proceedings of the international conference for high performance
[21] J. Ludziejewski, J. Krajewski, K. Adamczewski, M. Pióro, M. Krutul, computing, networking, storage and analysis, 2021, pp. 1–14.
S. Antoniak, K. Ciebiera, K. Król, T. Odrzygóźdź, P. Sankowski et al., [43] J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang,
“Scaling laws for fine-grained mixture of experts,” in Forty-first Inter- D. Li, and Y. He, “Zero-offload: Democratizing billion-scale model
national Conference on Machine Learning. training,” in 2021 USENIX Annual Technical Conference (USENIX ATC
[22] A. Komatsuzaki, J. Puigcerver, J. Lee-Thorp, C. R. Ruiz, B. Mustafa, 21), 2021, pp. 551–564.
J. Ainslie, Y. Tay, M. Dehghani, and N. Houlsby, “Sparse upcycling: [44] Y. Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, L. Wright,
Training mixture-of-experts from dense checkpoints,” arXiv preprint H. Shojanazeri, M. Ott, S. Shleifer et al., “Pytorch fsdp: experiences on
arXiv:2212.05055, 2022. scaling fully sharded data parallel,” arXiv preprint arXiv:2304.11277,
[23] F. Xue, Z. Shi, F. Wei, Y. Lou, Y. Liu, and Y. You, “Go wider 2023.
instead of deeper,” in Proceedings of the AAAI Conference on Artificial [45] Z. Zhang, S. Zheng, Y. Wang, J. Chiu, G. Karypis, T. Chilimbi, M. Li,
Intelligence, vol. 36, no. 8, 2022, pp. 8779–8787. and X. Jin, “Mics: near-linear scaling for training gigantic model on
[24] F. Xue, X. He, X. Ren, Y. Lou, and Y. You, “One student knows all public cloud,” arXiv preprint arXiv:2205.00119, 2022.
experts know: From sparse to dense,” arXiv preprint arXiv:2201.10890, [46] Z. Bian, Q. Xu, B. Wang, and Y. You, “Maximizing parallelism
2022. in distributed training for huge neural networks,” arXiv preprint
[25] B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean, N. Shazeer, arXiv:2105.14450, 2021.
and W. Fedus, “St-moe: Designing stable and transferable sparse expert [47] W. Liu, Z. Lai, S. Li, Y. Duan, K. Ge, and D. Li, “Autopipe: A fast
models,” arXiv preprint arXiv:2202.08906, 2022. pipeline parallelism approach with balanced partitioning and micro-
[26] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, batch slicing,” in 2022 IEEE International Conference on Cluster
D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 Computing (CLUSTER). IEEE, 2022, pp. 301–312.
technical report,” arXiv preprint arXiv:2303.08774, 2023. [48] Y. Duan, Z. Lai, S. Li, W. Liu, K. Ge, P. Liang, and D. Li, “Hph: Hybrid
[27] R. Sepulchre, D. A. Paley, and N. E. Leonard, “Stabilization of planar parallelism on heterogeneous clusters for accelerating large-scale dnns
collective motion: All-to-all communication,” IEEE Transactions on training,” in 2022 IEEE International Conference on Cluster Computing
automatic control, vol. 52, no. 5, pp. 811–824, 2007. (CLUSTER). IEEE, 2022, pp. 313–323.
[28] S. Kumar, Y. Sabharwal, R. Garg, and P. Heidelberger, “Optimization [49] S. Li, F. Xue, C. Baranwal, Y. Li, and Y. You, “Sequence paral-
of all-to-all communication on the blue gene/l supercomputer,” in 2008 lelism: Long sequence training from system perspective,” arXiv preprint
37th International Conference on Parallel Processing. IEEE, 2008, pp. arXiv:2105.13120, 2021.
320–329. [50] V. A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch,
[29] P. Sanders and J. L. Träff, “The hierarchical factor algorithm for M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation
all-to-all communication,” in Euro-Par 2002 Parallel Processing: 8th in large transformer models,” Proceedings of Machine Learning and
International Euro-Par Conference Paderborn, Germany, August 27–30, Systems, vol. 5, pp. 341–353, 2023.
2002 Proceedings 8. Springer, 2002, pp. 799–803. [51] S. A. Jacobs, M. Tanaka, C. Zhang, M. Zhang, S. L. Song, S. Rajbhan-
[30] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catan- dari, and Y. He, “Deepspeed ulysses: System optimizations for enabling
zaro, “Megatron-lm: Training multi-billion parameter language models training of extreme long sequence transformer models,” arXiv preprint
using model parallelism,” arXiv preprint arXiv:1909.08053, 2019. arXiv:2309.14509, 2023.
14

[52] H. Liu, M. Zaharia, and P. Abbeel, “Ring attention with blockwise


transformers for near-infinite context,” arXiv preprint arXiv:2310.01889,
2023.
[53] Z. Ma, J. He, J. Qiu, H. Cao, Y. Wang, Z. Sun, L. Zheng, H. Wang,
S. Tang, T. Zheng et al., “Bagualu: targeting brain scale pretrained
models with over 37 million cores,” in Proceedings of the 27th ACM
SIGPLAN Symposium on Principles and Practice of Parallel Program-
ming, 2022, pp. 192–204.
[54] X. Pan, W. Lin, S. Shi, X. Chu, W. Sun, and B. Li, “Parm: Efficient
training of large sparsely-activated models with dedicated schedules,” in
IEEE INFOCOM 2024-IEEE Conference on Computer Communications.
IEEE, 2024, pp. 1880–1889.
[55] S. Singh, O. Ruwase, A. A. Awan, S. Rajbhandari, Y. He, and A. Bhatele,
“A hybrid tensor-expert-data parallelism approach to optimize mixture-
of-experts training,” in Proceedings of the 37th International Conference
on Supercomputing, 2023, pp. 203–214.
[56] S. Shi, X. Chu, and B. Li, “Mg-wfbp: Efficient data communication
for distributed synchronous sgd algorithms,” in IEEE INFOCOM 2019-
IEEE Conference on Computer Communications. IEEE, 2019, pp. 172–
180.
[57] C. He, S. Li, M. Soltanolkotabi, and S. Avestimehr, “Pipetransformer:
Automated elastic pipelining for distributed training of transformers,”
arXiv preprint arXiv:2102.03161, 2021.
[58] S. Li, K. Lu, Z. Lai, W. Liu, K. Ge, and D. Li, “A multidimensional
communication scheduling method for hybrid parallel dnn training,”
IEEE Transactions on Parallel and Distributed Systems, 2024.
[59] S. Shi, X. Pan, Q. Wang, C. Liu, X. Ren, Z. Hu, Y. Yang, B. Li,
and X. Chu, “Schemoe: An extensible mixture-of-experts distributed
training system with tasks scheduling,” in Proceedings of the Nineteenth
European Conference on Computer Systems, 2024, pp. 236–249.
[60] S. Shi, X. Pan, X. Chu, and B. Li, “Pipemoe: Accelerating mixture-
of-experts through adaptive pipelining,” in IEEE INFOCOM 2023-IEEE
Conference on Computer Communications. IEEE, 2023, pp. 1–10.

You might also like