loadbalancing-1-2411.10003v2
loadbalancing-1-2411.10003v2
Abstract—The size of deep learning models has been increasing the model scaling. However, the substantial computational
arXiv:2411.10003v2 [cs.DC] 21 Nov 2024
to enhance model quality. The linear increase in training compu- demand of extra-large models makes the training process
tation budget with model size means that training an extremely excessively time-consuming. As one of the most promising
large-scale model is exceedingly time-consuming. Recently, the
Mixture of Expert (MoE) has drawn significant attention as it solutions, Mixture of Expert (MoE) enables a nearly constant
can scale models to extra-large sizes with a stable computation computational budget as model scaling. Generally, we replace
budget. However, inefficient distributed training of large-scale some layers of a foundation model with MoE ones to generate
MoE models hinders their broader application. Specifically, a a MoE model. Each MoE layer contains a gate network and
considerable dynamic load imbalance occurs among devices a range of sub-modules named experts. The gate network can
during training, significantly reducing throughput. Several load-
balancing works have been proposed to address the challenge. route each input to top-k experts that excel in processing
System-level solutions draw more attention for their hardware the input. As the k is a super-parameter, the MoE model
affinity and non-disruption of model convergence compared to can be scaled with consistent computational requirements by
algorithm-level ones. However, they are troubled by high commu- increasing the number of experts.
nication costs and poor communication-computation overlapping. As the model further scales, the effective collaboration of
To address these challenges, we propose a systematic load-
balancing method, Pro-Prophet, which consists of a planner and devices is necessary for extra-large MoE model training. Un-
a scheduler for efficient parallel training of large-scale MoE fortunately, it is inefficient to train the model with traditional
models. To adapt to the dynamic load imbalance, we profile parallelism such as Data Parallelism (DP), Model Parallelism
training statistics and use them to design Pro-Prophet. For (MP), and Pipeline Parallelism (PP). To overcome the trouble,
lower communication volume, Pro-Prophet planner determines Gshard [1] introduced a specific parallel strategy named Expert
a series of lightweight load-balancing strategies and efficiently
searches for a communication-efficient one for training based Parallelism (EP). Nowadays, extra-large MoE models trained
on the statistics. For sufficient overlapping of communication with EP have demonstrated the highest accuracy in multiple
and computation, Pro-Prophet scheduler schedules the data- tasks [2], [3].
dependent operations based on the statistics and operation However, training MoE models using EP presents a dy-
features, further improving the training throughput. We conduct namic load imbalance among devices. For each MoE layer,
extensive experiments in four clusters and five MoE models.
The results indicate that Pro-Prophet achieves up to 2.66x EP equally divides experts into devices before training and
speedup compared to two popular MoE frameworks including dynamically arranges its inputs according to the gate network
Deepspeed-MoE and FasterMoE. Furthermore, Pro-Prophet has during training. Most inputs are transferred to and processed
demonstrated a load-balancing improvement of up to 11.01x by a few devices, resulting in prolonged communication and
compared to a representative load-balancing work, FasterMoE. computation of inputs. Furthermore, the imbalance varies
Index Terms—Deep learning, mixture of experts, distributed throughout the training process, making it difficult to resolve.
training Numerous attempts in load-balancing have been proposed to
improve training throughput. Algorithmic works often restrict
I. I NTRODUCTION the upper bound of each expert’s load [4], [5] or add auxiliary
losses to the loss function [6], [7] for a more balanced
across devices, hindering the improvement of training effi- • We profile input distributions among adjacent iterations
ciency. 2) Devices experience significant communication and and identify a locality that guides the design of Pro-
computation idle during training. Due to data dependencies Prophet.
among operators, the solutions have to perform some commu- • We design a Pro-Prophet planner that identifies several
nications and computations sequentially. For example, only lightweight expert placements, abstracts a performance
the experts have been selected and their model states have model and designs a locality-based greedy algorithm to
been transmitted, their computations of inputs can be launched. reduce the heavy communication of model states.
And then, the aggregation of gradients occurs only after • We propose a Pro-Prophet scheduler, which generates a
the computation of gradients is finished. They neglect the scheduling space and establishes a block-wise schedul-
potential of communication and computation overlapping, thus ing strategy based on the locality and the feature of op-
significantly influencing device utilization. erations for comprehensive overlapping of computations
In this paper, we propose a systematic load-balancing so- and communications.
lution, Pro-Prophet, which overcomes two drawbacks by a • We conduct comprehensive experiments for Pro-Prophet
planner and scheduler respectively. on different clusters and models. The results demon-
To adapt dynamic features presented in the training of a strate that Pro-Prophet achieved up to 1.50x end-to-end
MoE model, we profile the input distribution (i.e., the number speedup and 11.01x load-balancing enhancements with
of inputs processed by each expert) for each MoE layer. We the representative load-balancing method.
observe that distributions of a MoE layer between adjacent
iterations present high similarity. This locality is the key to II. BACKGROUND AND M OTIVATION
effective load balancing.
To reduce the communication volume, Pro-Prophet planner A. Background
introduces a series of lightweight expert placements. In a Recent works in DNN model training have shown that the
lightweight expert placement, each expert is independently model capacity can be improved with increasing training data,
allocated to a subset of devices. Communication of parameters model scale, and computational budget [11]. Extraordinary
and gradients for the expert occurs in these specific devices. performance has been achieved in several deep learning do-
Serve to evaluate expert placements, the planner proposes a mains including natural language processing (NLP), computer
performance model that estimates the execution time of a vision (CV), and so on.
MoE layer employing a lightweight expert placement. How- However, significant training overhead comes along with the
ever, it is non-trivial to find the optimal one due to the superior model capacity. The extra large-scale model [12]–
combinatorial explosion of the number of expert placements. [17] training often takes months on thousands of dedicated
To tackle this, the planner designs a locality-based greedy accelerators (e.g., MT-NLG [18] spends three months to train
algorithm. The algorithm employs a greedy strategy to search on over two thousand A100 GPUs), which influences the
for a communication-efficient expert placement. Besides, its development of deep learning.
launching frequency is reduced based on the locality, further In recent years, dynamic sparse-activated architectures have
improving the training throughput. been proposed to solve the trouble. One of the popular
To exploit the potential of communication-computation approaches is the Mixture of Experts (MoE), which can
overlapping, Pro-Prophet scheduler comprehensively sched- significantly improve the model capacity while maintaining
ules operations based on the locality and the feature of a consistent computational budget. Nowadays, MoE has been
operations. The locality means that we can estimate the successfully applied to large language models [19]–[25]. Ex-
input distribution of the upcoming iteration according to the cellent MoE models that appeared in industry and academia
current one. Once the upcoming distribution is obtained, greatly draw researchers’ attention. For example, Google has
we can promptly determine a communication-efficient expert trained a series of MoE models called Glam [2]. The largest
placement for the upcoming iteration and can transmit the Glam model is seven times larger than GPT-3, but the training
parameters of experts in advance, which provides the oppor- cost is less than 1/3 of it. Experiments show that these models
tunity to overlap communications and computations within achieve higher accuracy than GPT-3 in 29 zero, single, and
adjacent iterations. Besides, the gradient aggregation can be small sample learning tasks, representing the superiority of
scheduled backward for better overlapping. Based on these, MoE models. The other example is GPT-4 [26]. The tech-
the scheduler identifies a scheduling space and designs a nical report of OpenAI indicates that the GPT-4 is a MoE
block-wise scheduling strategy to comprehensively overlap model, which achieved the highest performance in various
communications and computations. downstream tasks. Besides, the ChatGPT based on the GPT-4
We implement Pro-Prophet on top of PyTorch and conduct has caused a tremendous sensation.
extensive experiments on four different clusters of up to 32 Fig. 1 illustrates the architecture of a MoE model and a
devices with five variant models. The results demonstrate that MoE layer. The MoE model comprises a stack of non-MoE
Pro-Prophet achieves speedups of up to 2.66x compared to and MoE layers. A MoE layer consists of two components: 1)
two popular MoE frameworks. Additionally, Pro-Prophet has a series of experts (3 experts in the figure), where each excels
demonstrated load-balancing enhancements of up to 11.01x in a specific domain. 2) a gate network, which routes each
compared to a representative load-balancing work, FasterMoE. input to a few experts that are skilled in dealing with this input,
Our main contributions are summarized as follows: rather than all experts. In a MoE layer, for each input, the gate
3
E0
E1
E2
E3
E4
E5
E6
E7
E8
E9
0
1
2
3
4
5
E1
E1
E1
E1
E1
E1
input to experts. For each input, the gate network computes the relationship
between the input with three experts and allocates it to the top-1 expert for
computation. Fig. 3. The imbalanced load of experts in an iteration. The model contains
12 MoE layers and each MoE layer contains 16 experts. The vertical axis
indicates layer indexes, and the horizontal axis denotes the index of experts.
Batched All-to-All Expert Comp. All-to-All Batched The depth of color represents the proportion of total inputs that an expert
Inputs Outputs handles. Three of the heaviest experts are responsible for over 50% inputs
while the three least experts only compute less than 5%.
Dev. 0 �0 Gate Net. Expert0 �0
Performance
�0 Dev.0 �0 Dev.0
0.6
Model
�0 Dev.0
�1 Dev.1 …… �1 Dev.1
0.4 �2 Dev.2 �2 Dev.2
�1 Dev.1
0.2 �2 Dev.2
Locality-based Greedy Algorithm
0.0
2500 2520 2540 2560 2580 2600 Execution Engine
Iterations Load-balanced Workflow
Load-balanced Workflow
����� ������ ����. ��1� … ����. ��2� ���� …
after Scheduling
Fig. 4. The locality of input distributions. The discrepancies between the Pro-Prophet Scheduler
different colored curves represent the number of inputs received by each of Scheduling Space Establishment ����. ��1� ����. ��2�
…
the different experts. It shows that distributions of adjacent iterations remain �����+1 ������+1 ����+1
Block-wise Scheduling Strategy
relatively constant.
Model L.B. Search Place Reduce Others the load of devices. The overview of Pro-Prophet is presented
MoE-GPT-S 29.9% 6.8% 11.6% 11.5% 70.1% in Fig. 5. Pro-Prophet is composed of a planner and a
MoE-GPT-M 29.2% 3.2% 12.5% 12.5% 70.8%
MoE-GPT-L 34.5% 2.6% 14.2% 17.7% 64.5% scheduler. MoE model, locality, and device pool are three
MoE-GPT-DS 33.8% 6.1% 13.8% 13.9% 66.2% inputs of it. The device pool defines the topology of devices.
MoE-GPT-DM 37.1% 6.1% 16.1% 14.9% 62.9% The utilization of the locality is the key advantage of Pro-
Prophet.
As shown in Table I, previous solutions introduce a Search, Firstly, Pro-Prophet planner searches for a communication-
Place and Reduce processes to balance the load. However, efficient expert placement from a series of lightweight expert
the overhead of load-balancing is up to 37.1%. There are placements using its locality-based greedy algorithm. The
two reasons behind this huge cost. Firstly, they do heavy algorithm iteratively generates and evaluates a lightweight
communication of model states. They have to transfer the expert placement utilizing a performance model until the load
parameters and gradients of heavy-load experts among all is balanced.
devices or transmit the whole model states of experts. Sec- Then, the execution engine analyzes the procedures of
ondly, they cannot sufficiently overlap the communication and the planner and produces a load-balanced workflow for load
computations during the training of a MoE model. Due to balancing.
the data dependency, they have to do communication and Finally, after analyzing the workflow, Pro-Prophet scheduler
computation sequentially. establishes the scheduling space and schedules data-dependent
Locality. Fortunately, we have discovered a property in operations (i.e., Plan, Trans, and Agg) to parallel opera-
the training of MoE models. This property makes it possible tions (i.e., Para.Op1 and Para.Op2) for communication
to address these challenges efficiently. Fig. 4 depicts input and computation overlapping. The meaning of operations are
distributions in the second MoE layer of a MoE model. presented in Sec IV.
The areas with different colors represent inputs received by
different experts. It is worth noting that the distribution in IV. P RO -P ROPHET P LANNER
other MoE layers follows a similar pattern. As shown in A. Lightweight Expert Placement
the figure, the slight fluctuation of the distribution occurs
across adjacent iterations, which indicates that the load of each The design of the expert placement is crucial for efficient
expert remains relatively stable in adjacent iterations. This load balancing. For less communication of model states trans-
phenomenon suggests that the distribution exhibits a locality ferring, the planner introduces a series of lightweight expert
among iterations. placements.
In a lightweight expert placement, each expert is mapped to
one or more devices independently. Only the parameters and
III. OVERVIEW OF P RO -P ROPHET gradients rather than all model states are transferred among
Motivated by Section II, we propose a systematical load- its devices. We use Trans and Agg primitives to describe
balancing approach, Pro-Prophet, which can efficiently balance these two communications respectively. In the forward pass, a
5
Inputs Inputs
Dev. 0 �0 Dev. 0 �0 Dev. 0 �0 �1 Dev. 0 �0 �1
����� ���
Dev. 1 �1 Routing Dev. 1 �1 Dev. 1 �0 �1 Routing Dev. 1 �0 �1
����� ���
�� ��
Dev. 2 �2 Dev. 2 �2 Dev. 2 �2 �1 Dev. 2 �2 �1
Expert i Expert i
B. Performance model Hi
TF EC (H) = max , (2)
It’s necessary to evaluate lightweight expert placements i t
under various device loads. Therefore, the planner abstracts where Hi is the number of inputs computed in device-i.
a performance model to estimate the execution time of a It is widely recognized that the time required for backward
MoE layer employing a lightweight expert placement. Table II computation in DNN training is roughly double that of forward
presents notations and descriptions used in the performance computation, which is the same for MoE model training.
model. Therefore, we define the execution time of BEC as
After employing a lightweight expert placement, a MoE
Hi
layer performs four A2A communication operations, one for- TBEC (H) = 2 max , (3)
ward expert computation operation EFC, one backward com- i t
putation operation EBC, one Trans operation, and one Agg Trans and Agg primitives. Finally, we formulate the
operation. To accurately evaluate the execution time of the overhead of Trans and Agg primitives. The duration time of
MoE layer, we establish our performance model according to Trans and Agg primitives depends on two elements. The first
the implementation of operations and hardware characteristics. element is the number of transferred experts, which determines
A2A communication. Tutel [5] presents an efficient A2A communication rounds. The second element is the number
implementation used in the training of a MoE model. In this of devices communicated in a primitive, which influences
6
the communication scales. Therefore, the TT rans (s, n) and Algorithm 1: Greedy search algorithm
TAgg (s, n) are defined as below. Input: Inputs-to-experts mapping gating
s ∗ (D − n) ∗ size(ej .params) Input: n
TT rans (s, n) = , (4) Result: Communition-efficient expert placement P oE
D∗B
// Preliminary
s ∗ (D − n) ∗ size(ej .grads) ′
TAgg (s, n) = , (5) 1 Toutput ← T (R, H, 0, 0);
D∗B 2 H, R ← GetH&R(gating);
where the size(ej .params) and size(ej .grad) are the size of 3 L, n bottoms ← [], [];
parameters and gradients for the j-th expert. 4 cnt ← 0;
In summary, the overall execution time of the MoE layer // Iteratively search
with lightweight expert placement can be represented as 5 while not balanced do
// Get the index of the heaviest
T ′ (R, H, s, n) =4TA2A (R) + 3TF EC (H) device
(6) 6 i ← arg max(H);
+ TT rans (s, n) + TAgg (s, n) i
7 if i in U sed then
C. Locality-based Greedy Algorithm 8 break;
The performance model can accurately estimate the execu- 9 end
tion time of a MoE layer deploying any expert placements. 10 U sed.append(i);
However, it is necessary to determine a communication- // Determine n devices saving the
efficient one in various load imbalance scenarios. There are smallest number of inputs for
2N ∗E potential lightweight expert placements. The brute force expert-i
search algorithm is time-consuming and could be a perfor- 11 n bottom ← BottomK(gating, n) ;
mance bottleneck. 12 L.append(i);
Therefore, the planner offers an efficient greedy search 13 n bottoms.append(n bottom)
algorithm shown in Algorithm 1. Taking the results of gate 14 s ← size(L)
network gating, s and n as input, Algorithm 1 iteratively // Replace inputs among devices
generates and evaluates for a better expert placement until the according to the expert
load is balanced. Finally, it outputs a communication-efficient placement
expert placement P oE. 15 H, R ← Replace Inputs(L, n bottoms)
Initially, the algorithm estimates the execution time of // Evaluate the expert placement
a MoE layer without implementing any lightweight expert 16 Tchanged ← T ′ (R, H, s, n);
placements and records it as minimum time. Then it em- 17 if Tchanged < Toutput then
ploys two greedy strategies to generate a lightweight expert 18 Toutput ← Tchanged ;
placement that optimizes the load of devices. Specifically, it 19 cnt = s
prioritizes the expert with the higher number of responsible 20 end
inputs for selection and transfers its parameters to devices 21 end
that hold more inputs processed by the expert. The algorithm // Return the communication-efficient
maintains a list of L and n bottoms to record the expert expert placement
placement. Then the algorithm evaluates the expert placement 22 P oE ← Get P oE(L[0 : cnt], n bottoms[0 : cnt])
using the performance model. It updates the minimum time 23 RETURN P oE;
and a counter if the current expert placement achieves a better
performance. The search process is repeated until the load is
imbalanced. The condition of the balanced load is
V. P RO -P ROPHET S CHEDULER
I
max(H) − min(H) < α , (7) Previous works introduce a search process (corresponding
E
where I is the number of inputs training in an iteration and to Plan primitive), model states transferring (corresponding
α is a regulable coefficient for different requirements of load to Trans and Agg primitives) to balance the load. However,
balance. their execution is blocked by other operators due to data depen-
As the search algorithm is required to run during the MoE dency, constraining further improvement of training efficiency.
model training, we define a primitive Plan to describe this In this section, we introduce designs of the scheduler which
search process. As mentioned in Sec. II, the input distributions extensively overlap computation and communication based on
of adjacent iterations are similar, which inspired us to predict the locality described in Sec. II.
the distribution and reduce the frequency of execution of
the algorithm. Based on the inspiration, the planner upgrades
A. Scheduling space establishment
the algorithm to a locality-based one. Users can adjust the
frequency of the search algorithm flexibly for better training As shown in Fig. 1, a MoE model consists of both MoE
efficiency. and non-MoE layers stacked on top of each other. We combine
7
� � � � � � � � � � �
Baseline �lan� ������ �2�� ���� �2�� ����� ����� �2�� ���� �2�� ����
� � � � � � � �
�2�� ���� �2�� ����� ����� �2�� ���� �2��
Scheduler �+1 � � � �
����1:� ������+1:� ������+1:� ����+1:� ����+1:�
FP BP
Fig. 8. Scheduling space. The subscript denotes the index of the MoE block and the superscript denotes that of the iteration. For example, T ransi+1:l
j
denotes the set of T rans operations spanning from the block i + 1 to the block l during the iteration j, where l is the number of MoE blocks.
4
3 Deepspeed-MoE 4 Deepspeed-MoE Deepspeed-MoE 3 Deepspeed-MoE
FasterMOE FasterMOE FasterMOE FasterMOE
Pro-Prophet 3 Pro-Prophet 3 Pro-Prophet Pro-Prophet
Speedup
Speedup
Speedup
2 2
Speedup
2 2
1 1 1 1
0 S M L DS DM 0 S M L DS DM 0 S M L DS DM 0 S M L DS DM
Model Model Model Model
(a) 4HPWNV nodes, k=1, (b) 8HPWNV nodes, k=1, (c) 4HPWNV nodes, k=2, (d) 8HPWNV nodes, k=2,
1.01–1.23x 1.08–1.31x 1.14–1.48x 1.05–1.31x
Fig. 10. End-to-end performance. The numbers in the captions denote speedups achieved by Pro-Prophet over the best baseline.
TABLE V
1000 Deepspeed-MoE
Speedup
2
2
1 B. Fine-grained analysis of Pro-Prophet
0 Layer2 Layer5 Layer8 0 Layer2 Layer5 Layer8 We conducted a fine-grained analysis of Pro-Prophet ac-
Layer index Layer index cording to speedups in a single layer and a single iteration.
Experimental results demonstrate that Pro-Prophet enhances
(a) k=1 (b) k=2 training performance in each layer and iteration during train-
Fig. 11. Speedups across different layers over the Deepspeed-MoE in the ing.
MoE-GPT-M model with different values of k. Pro-Prophet achieves 1.09-
1.49x single-layer speedups compared to FasterMoE.
Single-layer speedup. We first evaluate the single-layer
performance of Pro-Prophet. Fig. 11 illustrates the execution
time across different layers of three methods on the MoE-GPT-
processes such as A2A will be accelerated. Under this condi- M model. We randomly select the index of layers and use the
tion, Pro-Prophet achieves 1.71-2.63x speedups compared to Pytorch Profiler to collect the training time. As shown in the
Deepspeed-MoE and 1.10-1.35x to FasterMoE, demonstrat- figure, Pro-Prophet achieves 1.60-2.25x single-layer speedups
ing its adaptability to conditions with higher communication compared to Deepspeed-MoE and 1.09-1.49x to FasterMoE.
bandwidth. We also test Pro-Prophet on a cluster consisting Varying loads of experts across layers occur in the training
of 2 LPWNV nodes. The results are presented in Table V process. This phenomenon leads to fluctuating speedups for
and the highest speedups are emphasized too. Due to the Pro-Prophet. However, it consistently outperforms two base-
lower computation ability of the 2080Ti compared to the lines in different layers, demonstrating its superior capability
3090 GPU, the impact of the computation process becomes of load balancing under diverse load-imbalance conditions.
more significant. In this environment, Pro-Prophet achieved Single-iteration speedups. We also evaluate the single-
a speedup of 1.18-1.94x compared to Deepspeed-MoE and layer performance of Pro-Prophet. We conduct experiments on
1.08-1.50x compared to FasterMoE, showing its robustness in the MoE-GPT-M model with k=1. The results are presented
conditions with lower computation power. in Fig. 12. Compared to fasterMoE, Pro-Prophet achieved
The result of the MoE-GPT-DM model with k=1 shows that 1.34x speedup on average. The iteration time of Pro-Prophet
Deepspeed-MoE achieves higher performance compared to is consistent and lower. This phenomenon can mainly be
FasterMoE as FasterMoE transports parameters to unnecessary attributed to the fact that Pro-Prophet is capable of adapting
devices, resulting in additional runtime overhead. However, to dynamic situations.
11
A2A
2.0
EC K=1
Estimated time (ms) 20 Trans
1.8 K=2
Agg 1.6
Speedup
15
1.4
10 1.2
1.0
5 Baseline +Planner +Scheduler Full
Components
0
0 5 10 15 20 Fig. 14. The effectiveness of components. The baseline is Pro-Prophet without
Real time (ms) any optimizations. Full is the condition of turning on the effective combination
of the planner and scheduler. The planner, scheduler and Full achieve a
speedup of 1.19x, 1.075x and 1.025x on average respectively.
Fig. 13. The accuracy of the performance model. The mean estimation error
is less than 5%.
Ratio of RB
metric for load-balancing methods. We define the balance Mainstream communication scheduling works focused on
degree as the standard deviation of the input distribution pipelining A2A and expert computation. Specifically, they
tensor. Besides, we denote the ratio of the balance degree partition an A2A and expert computation operation into sub-
before and after employing a load-balancing solution as RB operators and overlap communication sub-operators with com-
to describe its effect on the load. putation ones. Methods implemented on Gshard-like frame-
Fig. 16 demonstrates the ratio of RB of Pro-Prophet planner works such as Lina [34], Tutel, ScheMoE [59], and PipeMoE
to that of fasterMoE on different layers in different values [60] partition computation and communication operators based
of k. It is worth mentioning that the training time for the on the shape of expert computation matrix. FasterMoE is
planner is lower than that of FasterMoE. In most cases, the implemented on FastMoE and partitions operators into irregu-
planner achieves a higher RB than fasterMoE. A ratio of lar sub-operators to schedule. Pro-Prophet is compatible with
RB up to 11.01x indicates its ability to enhance training these works as Pro-Prophet allows for overlapping communi-
efficiency by fully exploiting the potential of load balancing. cations and computations at the level of MoE blocks.
In experimental conditions including k=1 with layer=2, k=2
with layers=2 and 5, the ratios of RB are below 1, suggesting VIII. C ONCLUSION
that the planner tailors expert placement to the actual load,
preventing the unnecessary allocation of experts. In this paper, we propose Pro-Prophet, a systematic load-
In summary, the planner can dynamically determine the balancing approach for efficient training of MoE models. We
load-balancing strategy to maximize training efficiency, show- observe a locality among input distributions and use it to de-
ing its superior balance capability. sign the planner and scheduler. Pro-Prophet planner identifies
lightweight expert placements and designs a locality-based
VII. R ELATED W ORK greedy algorithm to efficiently search for a communication-
Hybrid parallelism. Hybrid parallelism strategies [36]– efficient expert placement using its proposed performance
[41] have been widely used to train large-scale dense models. model, effectively reducing the communication overhead. Pro-
These hybrid parallelism strategies consist of but are not Prophet scheduler predicts the input distribution based on
limited to DP [42]–[45], TP [30], [46], PP [47], [48], and the locality in the MoE model training and applies block-
sequence parallelism (SP) [49]–[52]. Unfortunately, it is hard wise scheduling to overlap communications and computations,
to efficiently train large MoE models utilizing these hybrid further decreasing the communication cost. Our experiments
parallelism strategies. show that Pro-Prophet achieves 1.18-2.66x and 1.01-1.50x
To overcome the challenge, a series of works combine speedups compared to Deepspeed-MoE and FasterMoE. Be-
EP with above parallelism strategies and effectively improve sides, Pro-Prophet achieves a load balancing enhancement of
training efficiency. Switch transformers combines EP with TP up to 11.01 when compared to FasterMoE.
straightforwardly and designs a scheme to place the model and
data on TPUs. Bagualu [53] develops a hybrid parallel strat- R EFERENCES
egy that integrates EP and DP tailored for high-performance
[1] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun,
computing architectures, along with communication and stor- N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with conditional
age optimizations designed to enhance training efficiency. computation and automatic sharding,” in International Conference on
Deepspeed-MoE designs an effective combination of DP, EP Learning Representations, 2021.
[2] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun,
and TP for inference but is easy to extend to model training. In Y. Zhou, A. W. Yu, O. Firat et al., “Glam: Efficient scaling of
the MoE layer, it introduces Allgather and Allreduce primitives language models with mixture-of-experts,” in International Conference
to aggregate data and immediate results. Tutel also proposes on Machine Learning. PMLR, 2022, pp. 5547–5569.
[3] A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan,
DP, TP, and EP hybrid parallelism strategies and designs an D. Dai, D. Guo et al., “Deepseek-v2: A strong, economical, and efficient
adaptive parallelism switching method that enables O(1) over- mixture-of-experts language model,” arXiv preprint arXiv:2405.04434,
head in runtime switching. Based on the Deepspeed-MoE and 2024.
[4] S. Rajbhandari, C. Li, Z. Yao, M. Zhang, R. Y. Aminabadi, A. A.
Tutel, Parm [54] combines DP, EP, and expert-slice parallelism Awan, J. Rasley, and Y. He, “Deepspeed-moe: Advancing mixture-of-
(ESP) and proposes a fine-grained communication scheduling experts inference and training to power next-generation ai scale,” in
to improve the utilization of communication links. DeepSpeed- International Conference on Machine Learning. PMLR, 2022, pp.
18 332–18 346.
TED [55] designs a 3-dimensional hybrid parallelism strategy
[5] C. Hwang, W. Cui, Y. Xiong, Z. Yang, Z. Liu, H. Hu, Z. Wang, R. Salas,
that contains DP of Zero-3, TP of Megatron-LM, and EP of J. Jose, P. Ram et al., “Tutel: Adaptive mixture-of-experts at scale,”
Deepspeed-MoE. Besides, it proposes a memory and com- Proceedings of Machine Learning and Systems, vol. 5, pp. 269–287,
munication optimization for better scalability. The methods 2023.
[6] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to
of Pro-Prophet are compatible with these hybrid parallelism trillion parameter models with simple and efficient sparsity,” The Journal
strategies and can help further improve the training efficiency. of Machine Learning Research, vol. 23, no. 1, pp. 5232–5270, 2022.
Communication schedule. Overlapping communications [7] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton,
and J. Dean, “Outrageously large neural networks: The sparsely-gated
and computing can enhance hardware utilization and improve mixture-of-experts layer,” in International Conference on Learning
the system’s throughput [56], [57]. Previous communication Representations, 2017.
scheduling methods [37], [58] for dense models have demon- [8] J. He, J. Zhai, T. Antunes, H. Wang, F. Luo, S. Shi, and Q. Li, “Faster-
moe: modeling and optimizing training of large-scale dynamic pre-
strated promising results. In this paragraph, We focus on trained models,” in Proceedings of the 27th ACM SIGPLAN Symposium
introducing works designed for MoE models. on Principles and Practice of Parallel Programming, 2022, pp. 120–134.
13
[9] X. Nie, X. Miao, Z. Wang, Z. Yang, J. Xue, L. Ma, G. Cao, and B. Cui, [31] X. Nie, P. Zhao, X. Miao, T. Zhao, and B. Cui, “Hetumoe: An effi-
“Flexmoe: Scaling large-scale sparse pre-trained model training via cient trillion-scale mixture-of-expert distributed training system,” arXiv
dynamic device placement,” Proceedings of the ACM on Management preprint arXiv:2203.14685, 2022.
of Data, vol. 1, no. 1, pp. 1–19, 2023. [32] J. He, J. Qiu, A. Zeng, Z. Yang, J. Zhai, and J. Tang, “Fastmoe: A fast
[10] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory mixture-of-expert training system,” arXiv preprint arXiv:2103.13262,
optimizations toward training trillion parameter models,” in SC20: 2021.
International Conference for High Performance Computing, Networking, [33] D. Yu, L. Shen, H. Hao, W. Gong, H. Wu, J. Bian, L. Dai, and H. Xiong,
Storage and Analysis. IEEE, 2020, pp. 1–16. “Moesys: A distributed and efficient mixture-of-experts training and
[11] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, inference system for internet services,” IEEE Transactions on Services
S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural Computing, 2024.
language models,” arXiv preprint arXiv:2001.08361, 2020. [34] J. Li, Y. Jiang, Y. Zhu, C. Wang, and H. Xu, “Accelerating distributed
[12] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training moe training and inference with lina,” in 2023 USENIX Annual Technical
of deep bidirectional transformers for language understanding,” arXiv Conference (USENIX ATC 23), 2023, pp. 945–959.
preprint arXiv:1810.04805, 2018. [35] J. Liu, J. H. Wang, and Y. Jiang, “Janus: A unified distributed training
[13] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, framework for sparse mixture-of-experts models,” in Proceedings of the
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer ACM SIGCOMM 2023 Conference, ACM SIGCOMM 2023, New York,
learning with a unified text-to-text transformer,” The Journal of Machine NY, USA, 10-14 September 2023. ACM, 2023, pp. 486–498.
Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020. [36] D. Li, H. Wang, E. Xing, and H. Zhang, “Amp: Automatically finding
[14] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and model parallel strategies with heterogeneity awareness,” Advances in
Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language Neural Information Processing Systems, vol. 35, pp. 6630–6639, 2022.
understanding,” Advances in neural information processing systems, [37] Z. Lai, S. Li, X. Tang, K. Ge, W. Liu, Y. Duan, L. Qiao, and D. Li,
vol. 32, 2019. “Merak: An efficient distributed dnn training framework with automated
[15] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, 3d parallelism for giant foundation models,” IEEE Transactions on
L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert Parallel and Distributed Systems, vol. 34, no. 5, pp. 1466–1478, 2023.
pretraining approach,” arXiv preprint arXiv:1907.11692, 2019. [38] X. Ye, Z. Lai, S. Li, L. Cai, D. Sun, L. Qiao, and D. Li, “Hippie:
[16] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., A data-paralleled pipeline approach to improve memory-efficiency and
“Language models are unsupervised multitask learners,” OpenAI blog, scalability for large dnn training,” in 50th International Conference on
vol. 1, no. 8, p. 9, 2019. Parallel Processing, 2021, pp. 1–10.
[17] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, [39] S. Li, H. Liu, Z. Bian, J. Fang, H. Huang, Y. Liu, B. Wang, and
A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language mod- Y. You, “Colossal-ai: A unified deep learning system for large-scale
els are few-shot learners,” Advances in neural information processing parallel training,” in Proceedings of the 52nd International Conference
systems, vol. 33, pp. 1877–1901, 2020. on Parallel Processing, 2023, pp. 766–775.
[18] S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, [40] J. M. Tarnawski, D. Narayanan, and A. Phanishayee, “Piper: Mul-
Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti et al., “Using tidimensional planner for dnn parallelization,” Advances in Neural
deepspeed and megatron to train megatron-turing nlg 530b, a large-scale Information Processing Systems, vol. 34, pp. 24 829–24 840, 2021.
generative language model,” arXiv preprint arXiv:2201.11990, 2022. [41] K. Lu, Z. Lai, S. Li, W. Liu, K. Ge, X. Lu, and D. Li, “Parallel
[19] X. O. He, “Mixture of a million experts,” arXiv preprint intelligent computing: development and challenges,” SCIENTIA SINICA
arXiv:2407.04153, 2024. Informationis, vol. 53, no. 8, pp. 1441–1468, 2023.
[20] S. Zuo, X. Liu, J. Jiao, Y. J. Kim, H. Hassan, R. Zhang, T. Zhao, and [42] S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y. He, “Zero-
J. Gao, “Taming sparsely activated transformer with stochastic experts,” infinity: Breaking the gpu memory wall for extreme scale deep learning,”
arXiv preprint arXiv:2110.04260, 2021. in Proceedings of the international conference for high performance
[21] J. Ludziejewski, J. Krajewski, K. Adamczewski, M. Pióro, M. Krutul, computing, networking, storage and analysis, 2021, pp. 1–14.
S. Antoniak, K. Ciebiera, K. Król, T. Odrzygóźdź, P. Sankowski et al., [43] J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang, M. Zhang,
“Scaling laws for fine-grained mixture of experts,” in Forty-first Inter- D. Li, and Y. He, “Zero-offload: Democratizing billion-scale model
national Conference on Machine Learning. training,” in 2021 USENIX Annual Technical Conference (USENIX ATC
[22] A. Komatsuzaki, J. Puigcerver, J. Lee-Thorp, C. R. Ruiz, B. Mustafa, 21), 2021, pp. 551–564.
J. Ainslie, Y. Tay, M. Dehghani, and N. Houlsby, “Sparse upcycling: [44] Y. Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, L. Wright,
Training mixture-of-experts from dense checkpoints,” arXiv preprint H. Shojanazeri, M. Ott, S. Shleifer et al., “Pytorch fsdp: experiences on
arXiv:2212.05055, 2022. scaling fully sharded data parallel,” arXiv preprint arXiv:2304.11277,
[23] F. Xue, Z. Shi, F. Wei, Y. Lou, Y. Liu, and Y. You, “Go wider 2023.
instead of deeper,” in Proceedings of the AAAI Conference on Artificial [45] Z. Zhang, S. Zheng, Y. Wang, J. Chiu, G. Karypis, T. Chilimbi, M. Li,
Intelligence, vol. 36, no. 8, 2022, pp. 8779–8787. and X. Jin, “Mics: near-linear scaling for training gigantic model on
[24] F. Xue, X. He, X. Ren, Y. Lou, and Y. You, “One student knows all public cloud,” arXiv preprint arXiv:2205.00119, 2022.
experts know: From sparse to dense,” arXiv preprint arXiv:2201.10890, [46] Z. Bian, Q. Xu, B. Wang, and Y. You, “Maximizing parallelism
2022. in distributed training for huge neural networks,” arXiv preprint
[25] B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean, N. Shazeer, arXiv:2105.14450, 2021.
and W. Fedus, “St-moe: Designing stable and transferable sparse expert [47] W. Liu, Z. Lai, S. Li, Y. Duan, K. Ge, and D. Li, “Autopipe: A fast
models,” arXiv preprint arXiv:2202.08906, 2022. pipeline parallelism approach with balanced partitioning and micro-
[26] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, batch slicing,” in 2022 IEEE International Conference on Cluster
D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 Computing (CLUSTER). IEEE, 2022, pp. 301–312.
technical report,” arXiv preprint arXiv:2303.08774, 2023. [48] Y. Duan, Z. Lai, S. Li, W. Liu, K. Ge, P. Liang, and D. Li, “Hph: Hybrid
[27] R. Sepulchre, D. A. Paley, and N. E. Leonard, “Stabilization of planar parallelism on heterogeneous clusters for accelerating large-scale dnns
collective motion: All-to-all communication,” IEEE Transactions on training,” in 2022 IEEE International Conference on Cluster Computing
automatic control, vol. 52, no. 5, pp. 811–824, 2007. (CLUSTER). IEEE, 2022, pp. 313–323.
[28] S. Kumar, Y. Sabharwal, R. Garg, and P. Heidelberger, “Optimization [49] S. Li, F. Xue, C. Baranwal, Y. Li, and Y. You, “Sequence paral-
of all-to-all communication on the blue gene/l supercomputer,” in 2008 lelism: Long sequence training from system perspective,” arXiv preprint
37th International Conference on Parallel Processing. IEEE, 2008, pp. arXiv:2105.13120, 2021.
320–329. [50] V. A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch,
[29] P. Sanders and J. L. Träff, “The hierarchical factor algorithm for M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation
all-to-all communication,” in Euro-Par 2002 Parallel Processing: 8th in large transformer models,” Proceedings of Machine Learning and
International Euro-Par Conference Paderborn, Germany, August 27–30, Systems, vol. 5, pp. 341–353, 2023.
2002 Proceedings 8. Springer, 2002, pp. 799–803. [51] S. A. Jacobs, M. Tanaka, C. Zhang, M. Zhang, S. L. Song, S. Rajbhan-
[30] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catan- dari, and Y. He, “Deepspeed ulysses: System optimizations for enabling
zaro, “Megatron-lm: Training multi-billion parameter language models training of extreme long sequence transformer models,” arXiv preprint
using model parallelism,” arXiv preprint arXiv:1909.08053, 2019. arXiv:2309.14509, 2023.
14