Client-Edge-Cloud Hierarchical Federated Learning
Client-Edge-Cloud Hierarchical Federated Learning
Lumin Liu , Jun Zhang† , S.H. Song , and Khaled B. Letaief‡ , Fellow, IEEE
Dept. of ECE, The Hong Kong University of Science and Technology, Hong Kong
†
Dept. of EIE, The Hong Kong Polytechnic University, Hong Kong
‡
Peng Cheng Laboratory, Shenzhen, China
Email: [email protected], [email protected], [email protected], [email protected]
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on September 08,2022 at 12:18:43 UTC from IEEE Xplore. Restrictions apply.
inevitable training performance loss.
From the above comparison, we see a necessity in lever-
aging a cloud server to access the massive training samples,
while each edge server enjoys quick model updates with its
local clients. This motivates us to propose a client-edge-cloud
hierarchical FL system as shown on the right side of Fig. 1, to
get the best of both systems. Compared with cloud-based FL,
hierarchical FL will significantly reduce the costly communi-
cation with the cloud, supplemented by efficient client-edge
updates, thereby, resulting a significant reduction in both the
runtime and number of local iterations. On the other hand, as
more data can be accessed by the cloud server, hierarchical
FL will outperform edge-based FL in model training. These
two aspects are clearly observed from Fig. 2, which gives a
preview of the results to be presented in this paper. While the
advantages can be intuitively explained, the design of a hier- Figure 2: Testing Accuracy w.r.t to the runtime on CIFAR-10.
archical FL system is nontrivial. First, by extending the FAVG networks. The complex learning problem is usually solved
algorithm to the hierarchical setting, will the new algorithm by gradient descent. Denote k as the index for the update
still converge? Given the two levels of model aggregation (one step, and η as the gradient descent step size, then the model
at the edge, one at the cloud), how often should the models be parameters are updated as:
aggregated at each level? Morever, by allowing frequent local w(k) = w(k − 1) − η∇F (w(k − 1)).
updates, can a better latency-energy tradeoff be achieved? In
this paper, we address these key questions. First, a rigorous In FL, the dataset is distributed on N clients as {Di }N
i=1 ,
proof is provided to show the convergence of the training with ∪N i=1 Di = D and these distributed datasets cannot
algorithm. Through convergence analysis, some qualitavie be directly accessed by the parameter server. Thus, F (w)
guidelines on picking the aggregation frequencies at two levels in Eq. (1), also called the global loss, cannot be directly
are also given. Experimental results on MNIST [11] and computed, but can only be computed in the form of a weighted
CIFAR-10 [12] datasets support our findings and demonstrate average of the local loss functions Fi (w), on local datasets
Di . Specifically, F (w) and Fi (w) are given
by:
the advantage of achieving better communication-computation N
|Di |Fi (w) j∈Di fj (w)
tradeoff compared to cloud-based systems. F (w) = i=1
, Fi (w) = .
|D| |Di |
II. F EDERATED L EARNING S YSTEMS
B. Traditional Two-Layer FL
In this section, we first introduce the general learning prob-
lem in FL. The cloud-based and edge-based FL systems differ In the traditional two-layer FL system, there are one central
only in the communication and the number of participated parameter server and N clients. To reduce the communica-
clients, and they are identical to each other in terms of archi- tion overhead, the FAVG algorithm [2] communicates and
tecture. Thus, we treat them as the same traditional two-layer aggreagtes after every κ steps of gradient descent on each
FL system in this section and introduce the widely adopted client. The process repeats until the model reaches a desired
FAVG [2] algorithm. For the client-edge-cloud hierarchical FL accuracy or the limited resources, e.g., the communication or
system, we present the proposed three-layer FL system, and time budget, run out.
its optimization algorithm, namely, HierFAVG. Denote wi (k) as the parameters of the local model on the
i-th client, then wi (k) in FAVG evolves in the following way:
A. Learning Problem ⎧
We focus on supervised Federated Learning. Denote D = ⎪
⎨wi (k − 1) − ηk ∇Fi (wi (k − 1)) k | κ = 0
|D|
{xj , yj }j=1 as the training dataset, and |D| as the total wi (k) = N
⎪
⎩ i=1 |Di | wi (k−1)−ηk ∇Fi (wi (k−1))
number of training samples, where xj is the j-th input sample, |D| k|κ=0
yj is the corresponding label. w is a real vector that fully
parametrizes the ML model. f (xj , yj , w), also denoted as C. Client-Edge-Cloud Hierarchical FL
fj (w) for convenience, is the loss function of the j-th data In FAVG, the model aggregation step can be interpreted
sample, which captures the prediction error of the model for as a way to exchange information among the clients. Thus,
the j-th data sample. The training process is to minimize the aggregation at the cloud parameter server can incorporate
empirical loss F (w) based on the training dataset [13]: many clients, but the communicaiton cost is high. On the
|D| |D|
1 1 other hand, aggregation at the edge parameter server only
F (w) = f (xj , yj , w) = fj (w). (1) incorporates a small number of clients with much cheaper
|D| j=1 |D| j=1
communicaiton cost. To combine their advantages, we con-
The loss funciton F (w) depends on the ML model and can sider a hierarchical FL system, which has one cloud server, L
be convex, e.g. logistic regression, or non-convex, e.g. neural edge servers indexed by , with disjoint client sets {C }L
=1 ,
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on September 08,2022 at 12:18:43 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1: Hierarchical Federated Averaging (HierFAVG)
1: procedure H IERARCHICAL F EDERATEDAVERAGING
2: Initialized all clients with parameter w0
3: for k = 1, 2, . . . K do
4: for each client i = 1, 2, . . . , N in parallel do
5: wi (k) ← wi (k − 1) − η∇Fi (wi (k − 1))
6: end for
7: if k | κ1 = 0 then
8: for each edge = 1, . . . , L in parallel do
9: w (k) ← EdgeAggregation({wi (k)}i∈C )
10: if k | κ1 κ2 = 0 then Figure 3: Comparison of FAVG and HierFAVG.
11: for each client i ∈ C in parallel do
12: wi (k) ← w (k)
reveals some key properties of the algorithm, as well as the
13: end for effects of key parameters.
14: end if
15: end for A. Definitions
16: end if
17: if k | κ1 κ2 = 0 then Some essential definitions need to be explained before the
18: w(k) ← CloudAggregation({w (k)}L =1 )
analysis. The overall K local training iterations are divided
19: for each client i = 1 . . . N in parallel do into B cloud intervals, each with a length of κ1 κ2 , or Bκ2
20:
wi (k) ← w(k)
21: end for edge intervals, each with a length of κ1 . The local (edge)
22: end if aggregation happens at the end of each edge interval, and the
23: end for global (cloud) aggregations happens at the end of each cloud
24: end procedure
25: function E DGE AGGREGATION(, {wi (k)}i∈C ) //Aggregate locally interval. We use [p] to represent the edge interval starting from
i∈C
|Di |wi (k) (p − 1)κ1 to pκ1 , and {q} to represent the cloud interval
26: w (k) ← |D | from (q − 1)κ1 κ2 to qκ1 κ2 , so we have {q} = ∪p [p], p =
27: return w (k)
28: end function (q − 1)κ2 + 1, (q − 1)κ2 + 2, . . . , qκ2 .
29: function C LOUDAGGREGATION({w (k)}L
=1 ) //Aggregate globally • F (w): The edge loss function at edge server , is
L
|D |w (k)
30: w(k) ← =1 expressed as:
|D|
1
31: return w(k) F (w) = |Di |Fi (w).
32: end function |D |
i∈C
and N clients indexed by i and , with distributed datasets • w(k): The weighted average of wi (k), is expressed as:
{Di }N
i=1 . Denote D as the aggregated dataset under edge . 1
N
w(k) = |Di |w̄i (k).
Each edge server aggregates models from its clients. |D| i=1
With this new architecture, we extend the FAVG to a
HierFAVG algorithm. The key steps of the HierFAVG algo- • u{q} (k): The virtually centralized gradient descent se-
rithm proceed as follows. After every κ1 local updates on quence, defined in cloud interval {q}, and synchronized
each client, each edge server aggregates its clients’ models. immediately with w(k) after every cloud aggregation as:
Then after every κ2 edge model aggregations, the cloud u{q} ((q − 1)κ1 κ2 ) = w((q − 1)κ1 κ2 ),
server aggregates all the edge servers’ models, which means u{q} (k + 1) = u{q} (k) − ηk ∇F (u{q} (k)).
that the communication with the cloud happens every κ1 κ2 The key idea of the proof is to show that the true weights
local updates. The comparison between FAVG and HierFAVG w(k) do not deviate much from the virtuallly centralized
is illustrated in Fig. 3. Denote wi (k) as the local model sequence u{q} (k). Using the same method, the convergence
parameters after the k-th local update, and K as the total of two-layered FL was analyzed in [7].
amount of local updates performed, which is assumed to be Lemma 1 (Convergence of FAVG [7]). For any i, assuming
an integer multiple of κ1 κ2 . Then the details of the HierFAVG fi (w) is ρ-continuous, β-smooth, and convex. Als, let Finf =
algorithm are presented in Algorithm 1. And the evolution of F (w∗ ). If the deviation of distributed weights has an upper
local model parameters wi (k) is as follows: bound denoted as M , then for FAVG with a fixed step size η
⎧
⎪
⎪ wi (k − 1) − ηt ∇Fi (wi (k − 1)) k | κ1 = 0 and an aggregation interval κ, after K = Bκ local updates,
⎪
⎪
⎪
⎨ we have the following convergence upper bound:
|Di | wi (k−1)−ηk ∇Fi (wi (k−1)) k | κ1 = 0 1
wi (k) = i∈C F (w(K)) − F (w∗ ) ≤
⎪ B(ηϕ − ρM
κε2 )
|D |
⎪ k | κ1 κ2 = 0
⎪
⎪ N
⎪
⎩ i=1 |Di | wi (k−1)−ηk ∇Fi (wi (k−1)) 1
|D|
k | κ1 κ2 = 0 when the following conditions are satisfied: 1) η ≤ β
ρM ∗
2) ηϕ − kε2
> 0 3) F (vb (bk)) − F (w ) ≥ ε for b =
1, . . . , K
κ
4) F (w(K)) − F (w∗ ) ≥ ε for some ε > 0, ω =
III. C ONVERGENCE A NALYSIS OF H IER FAVG minb 1
ϕ = ω(1 − βη
F (vb ((b−1)κ))−F (w∗ ) , 2 ).
In this section, we prove the convergence of HierFAVG for The unique non-Independent and Identicallly Distributed
both convex and non-convex loss functions. The analysis also (non-IID) data distribution in FL is the key property that
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on September 08,2022 at 12:18:43 UTC from IEEE Xplore. Restrictions apply.
distinguishes FL from distributed learning in datacenter. Since aggregation intervals, κ1 and κ2 :
the data are generated seperately by each client, the local data Gc (k, ηq ) ≤ Gc (κ1 κ2 , ηq )
distribution may be unbalanced and the model performance 1 2
will be heavily influenced by the non-IID data distribution = h(κ1 κ2 , Δ, ηq ) + (κ2 + κ2 − 1)(κ1 + 1)h(κ1 , δ, ηq )
2
[14]. In this paper, we adopt the same measurement as in (2)
[7] to measure the two-level non-IIDness in our hierarchical It is obvious that when δ = Δ = 0 (i.e., which means the
system, i.e., the client level and the edge level. client data distribution is IID), we have Gc (k) = 0, where
Definition 1 (Gradient Divergence). For any weight parame- the distributed weights iteration is the same as the centralized
ter w, the gradient divergence between the local loss function weights iteration.
of the i-th client, and the edge loss function of the -th edge When the client data are non-IID, there are two parts in the
server is defined as an upper bound of ∇Fi (w)−∇F (w), expression of the weights deviation upper bound, G(κ1 κ2 , ηq ).
denoted as δi ; the gradient divergence between the edge loss The first one is caused by Edge-Cloud divergence, and is
function of the th edge server and the global loss function is exponential with both κ1 and κ2 . The second one is caused by
defined as an upperbound of ∇F (w) − ∇F (w), denoted the Client-Edge Divergence, which is only exponential with
as Δ . Specifically, κ1 , but quadratic with κ2 . From lemma 1, we can see that a
smaller model weight deviation leads to faster convergence.
∇Fi (w) − ∇F (w) ≤ δi , This gives us some qualitative guidelines in selecting the
∇F (w) − ∇F (w) ≤ Δ . paramters in HierFAVG:
N L N
|D |δ
Define δ = i=1|D| i i , Δ = =1|D|
|D |Δ
= i=1|D| i ,
|D |Δ 1) When the product of κ1 and κ2 is fixed, which means the
and we call δ as the Client-Edge divergence, and Δ as the number of local updates between two cloud aggreagtions
Edge-Cloud divergence. is fixed, a smaller κ1 with a larger κ2 will result in
a smaller deviation Gc (κ1 , κ2 ). This is consistent with
A larger gradient divergence means the dataset distribution
our intuition, namely, frequent local model averaging can
is more non-IID. δ reflects the non-IIDness at the client level,
reduce the number of local iterations needed.
while Δ reflects the non-IIDness at the edge level.
2) When the edge dataset is IID, meaning Δ = 0, the first
B. Convergence part in Eq. (2) becomes 0. The second part is dominated
by κ1 , which suggests that when the distribution of edge
In this section, we prove that HierFAVG converges. The dataset approaches IID, increasing κ2 will not push up
basic idea is to study is how the real weights w(k) deviate the deviation upper bound much. This suggests one way
from the virtually centralized sequence u{q} (k) when the to to further reduce the communication with the cloud is
parameters in HierFAVG algorithm vary. In the following two to make the edge dataset IID distributed.
lemmas, we prove an upper bound for the distributed weights
The result for the non-convex loss function is stated in the
deviation for both convex and non-convex loss functions.
following lemma.
Lemma 2 (Convex). For any i, assuming fi (w) is β-smooth
and convex, then for any cloud interval {q} with a fixed step Lemma 3 (Non-convex). For any i, assuming fi (w) is β-
size ηq and k ∈ {q}, we have smooth, for any cloud interval {q} with step size ηq , we have
w(k) − u{q} (k) ≤ Gc (k, ηq ), w(k) − u{q} (k) ≤ Gnc (κ1 κ2 , ηq )
where where
Gc (k, ηq ) = h(k − (q − 1)κ1 κ2 , Δ, ηq ) Gnc (κ1 κ2 , ηq ) =h(κ1 κ2 , Δ, ηq )
+ h k − ((q − 1)κ2 + p(k) − 1)κ1 , δ, ηq (1 + ηq β)κ1 κ2 − 1
κ1 2 + κ1 κ2 h(κ1 , δ, ηq )
+ p (k) + p(k) − 2 h(κ1 , δ, ηq ), (1 + ηq β)κ1 − 1
2 + h(κ1 , δ, ηq ),
δ
h(x, δ, η) = (ηβ + 1)x − 1 − ηβx, δ
β h(x, δ, η) = (ηβ + 1)x − 1 − ηβx.
x β
p(x) = − (q − 1)κ2 .
κ1
With the help of the weight deviation upperbound, we are
Remark 1. Note that when κ2 = 1, HierFAVG retrogrades now ready to prove the convergence of HierFAVG for both
to the FAVG algorihtm. In this case, [p] is the same as {q}, convex and non-convex loss functions.
p(k) = 1, κ1 κ2 = κ1 , and Gc (k) = h(k−(q−1)κ1 , Δ+δ, ηq ). Theorem 1 (Convex). For any i, assuming fi (w) is ρ-
This is consistent with the result in [7]. When κ1 = κ2 = continuous, β-smooth and convex, and denoting Finf =
1, HierFAVG retrogrades to the traditional gradient descent. F (w∗ ), then after K local updates, we have the following
In this case, Gc (κ1 κ2 ) = 0, implying the distibuted weights convergence upper bound of w(k) in HierFAVG with a fixed
iteration is the same as the centralized weights iteration. step size:
Remark 2. The following upperbound of the weights de- 1
F (w(K)) − F (w∗ ) ≤
viation, Gc (k), increases as we increase either of the two T (ηϕ − ρGc (κ1 κ2 ,η)
κ 1 κ2 ε 2
)
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on September 08,2022 at 12:18:43 UTC from IEEE Xplore. Restrictions apply.
Proof. By directly substituting M in Lemma 1 with G(κ1 κ2 ) which has 5852170 parameters and achieves 90% testing
in Lemma 2, we prove Theorem 1. accuracy in centralized training. For the local computation of
Remark 3. Notice in the condition ε2 > ρG(κ 1 κ2 )
κ1 κ2 ηϕ of Lemma
the training with CIFAR-10, mini-batch SGD is also employed
1, ε does not decrease when K increases. We cannot have with a batch size of 20, an inital learing rate of 0.1 and an
F (w(K)) − F (w∗ ) → 0 as K → ∞. This is because the exponetial learning rate decay of 0.992 every epoch. In the
variance in the gradients introduced by non-IIDness cannnot experiments, we also notice that using SGD with momentum
be eliminated by fixed-stepsize gradient descent. can speed up training and improve the final accuracy evidently.
Remark 4. With step sizes {ηq } that satisfy But the benefits of the hierarchical FL system always continue
∞ ∞diminishing
2 to exist with or without the momentum. To be consistent with
q=1 ηq = ∞, q=1 ηq < ∞, the convergence upper bound
for HierFAVG after K = Bκ1 κ2 local updates is: the analysis, we do not use momentum in the experiments.
Non-IID distribution in the client data is a key influential
1 B→∞
F (w(K)) − F (w∗ ) ≤ B ρGc (κ1 κ2 ,ηq )
−−−−→ 0. factor in FL. In our proposed hierarchical FL system, there are
q=1 (ηq ϕq − κ1 κ2 ε2q
) two levels of non-IIDness. In addition to the most commonly
Now we consider non-convex loss functions, which appear used non-IID data partition [2], referred to as simple NIID
in ML models such as neural networks. where each client owns samples of two classes and the clients
are randomly assigned to each edge server, we will also
Theorem 2 (Non-convex). For any i, assume that fi (w) is
consider the follwoing two non-IID cases for MNIST:
ρ-continuous, and β-smooth. Also assume that HierFAVG is
initialized from w0 , Finf = F (w∗ ), ηq in one cloud interval 1) Edge-IID: Assign each client samples of one class, and
{q} is a constant, then after K = Bκ1 κ2 local updates, assign each edge 10 clients with different classes. The
the expected average-squared gradients of F (w) is upper datasets among edges are IID.
bounded as: 2) Edge-NIID: Assign each client samples of one class, and
K assign each edge 10 clients with a total of 5 classes of
k=1 ηq ∇F (w(k))2 4[F (w0 ) − F (w∗ )] labels. The datasets among edges are non-IID.
K ≤ K
k=1 ηq k=1 ηq In the following, we provide the models for wireless
B
4ρ q=1 Gnc (κ1 , κ2 , ηq ) communications and local computations [8]. We ignore the
+ K
k=1 ηq
possible heterogeneous communication conditions and com-
2 B 2 puting resources for different clients. For the communication
2β q=1 κ1 κ2 Gnc (κ1 κ2 , ηq )
+ K . channel between the client and edge server, clients upload
k=1 ηq the model through a wireless channel of 1 MHz bandwidth
(3)
Remark 5. When the stepsize {ηq } is fixed, the weighted with a channel gain g equals to 10−8 . The transmitter power
average norm of the gradients converges to some non- p is fixed at 0.5W, and the noise power σ is 10−10 W. For
zero number. When the stepsize {ηq } satisfies
∞ the local computation model, the number of CPU cycles to
∞ 2 q=1 ηq =
excute one sample c is assumed to be 20 cycles/bit, CPU
∞, q=1 ηq < ∞, (3) converges to zero as K → ∞.
cycle frequency f is 1 GHz and the effective capacitance is
IV. E XPERIMENTS 2 × 10−28 . For the communication latency to the cloud, we
In this section, we present simulation results for HierFAVG assume it is 10 times larger than that to the edge. Assume
to verify the obeservations from the convergence analysis and the uploaded model size is M bits, and one local iteration
illustrate the advantages of the hierarchical FL system. As involes D bits of data. In this case, the latency and energy
shown in Fig. 2, the advantage over the edge-based FL system consumption for one model upload and one local iteration can
in terms of the model accuracy is obvious. Hence, we shall be caculated with the following equations (Specific paramters
focus on the comparison with the cloud-based FL system. are shown in table I):
cD α
A. Settings T comp = , E comp = cDf 2 , (4)
f 2
We consider a hierarchical FL system with 50 clients, 5 M
edge servers and a cloud server, assuming each edge server T comm = hp
, E comm = pT comm (5)
B log2 (1 + σ
)
authorizes the same number of clients with the same amount
To investigate the local energy consumption and training
of training data. For the ML tasks, image classification tasks
time in an FL system, we define the following two metrics:
are considered and standard datasets MNIST and CIFAR-10
are used. For the 10-class hand-written digit classification 1) Tα : The training time to reach a test accuracy level α;
dataset MNIST, we use the Convolutional Neural Network 2) Eα : The local energy consumption to reach a test accu-
(CNN) with 21840 trainable parameters as in [2]. For the local racy level α.
computation of the training with MNIST on each client, we B. Results
employ mini-batch Stochastic Gradient Descent (SGD) with We first verify the two qualitative guidelines on the key
batch size 20, and an initial learning rate 0.01 which decays parameters in HierFAVG from the convergence analysis, i.e.,
exponetially at a rate of 0.995 with every epoch. For the κ1 , κ2 . The experiments are done with the MNIST dataset
CIFAR-10 dataset, we use a CNN with 3 convolutional blocks, under two non-IID scenarios, edge-IID and edge-NIID.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on September 08,2022 at 12:18:43 UTC from IEEE Xplore. Restrictions apply.
Table I: The latency and energy consumption paramters for the Table II: Training time and local energy consumption.
communication and computation of MNIST and CIFAR-10. Edge-IID Edge-NIID
Dataset T comp T comm E comp E comm E0.85 (J) T0.85 (s) E0.85 (J) T0.85 (s)
MNIST 0.024s 0.1233s 0.0024J 0.0616J κ1 = 60, κ2 = 1 29.4 385.9 30.8 405.5
CIFAR-10 4 33s 0.4J 16.5J κ1 = 30, κ2 = 2 21.9 251.1 28.6 312.4
κ1 = 15, κ2 = 4 10.1 177.3 26.9 218.5
κ1 = 6, κ2 = 10 19 97.7 28.9 148.4
(a) MNIST with edge-IID and edge-NIID distribution.
E0.70 (J) T0.70 (s)
κ1 = 50, κ2 = 1 7117.5 109800
κ1 = 25, κ2 = 2 6731 75760
κ1 = 10, κ2 = 5 9635 65330
κ1 = 5, κ2 = 10 13135 49350
(b) CIFAR-10 with simple NIID distribution.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on September 08,2022 at 12:18:43 UTC from IEEE Xplore. Restrictions apply.