0% found this document useful (0 votes)
46 views

Client-Edge-Cloud Hierarchical Federated Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Client-Edge-Cloud Hierarchical Federated Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Client-Edge-Cloud Hierarchical Federated Learning

Lumin Liu , Jun Zhang† , S.H. Song , and Khaled B. Letaief‡ , Fellow, IEEE

Dept. of ECE, The Hong Kong University of Science and Technology, Hong Kong

Dept. of EIE, The Hong Kong Polytechnic University, Hong Kong

Peng Cheng Laboratory, Shenzhen, China
Email: [email protected], [email protected], [email protected], [email protected]

Abstract—Federated Learning is a collaborative machine


learning framework to train a deep learning model without ac-
cessing clients’ private data. Previous works assume one central
parameter server either at the cloud or at the edge. The cloud
server can access more data but with excessive communication
overhead and long latency, while the edge server enjoys more
efficient communications with the clients. To combine their
advantages, we propose a client-edge-cloud hierarchical Feder-
ated Learning system, supported with a HierFAVG algorithm
that allows multiple edge servers to perform partial model
aggregation. In this way, the model can be trained faster and
better communication-computation trade-offs can be achieved.
Convergence analysis is provided for HierFAVG and the effects
of key parameters are also investigated, which lead to qualitative
design guidelines. Empirical experiments verify the analysis and
demonstrate the benefits of this hierarchical architecture in
different data distribution scenarios. Particularly, it is shown
that by introducing the intermediate edge servers, the model
training time and the energy consumption of the end devices can Figure 1: Cloud-based, edge-based and client-edge-cloud hierarchical
be simultaneously reduced compared to cloud-based Federated FL. The process of the FAVG algorithm is also illustrated.
Learning.
Index Terms—Mobile Edge Computing, Federated Learning, [3]. Thereafter, it has attracted great attentions from both
Edge Learning academia and industry [4]. While most initial studies of FL
assumed a cloud as the parameter server, with the recent
I. I NTRODUCTION emergence of edge computing platforms [5], researchers have
The recent development in deep learning has revolution- started investigating edge-based FL systems [6]–[8]. For edge-
alized many application domains, such as image processing, based FL, the proximate edge server will act as the parameter
natural language processing, and video analytics [1]. So far server, while the clients within its communication range of
deep learning models are mainly trained at some powerful the server collaborate to train a deep learning model.
computing platforms, e.g., a cloud datacenter, with centralized While both the cloud-based and edge-based FL systems
collected massive datasets. Nonetheless, in many applications, apply the same FAVG algorithm, there are some fundamental
data are generated and distributed at end devices, such as differences between the two systems, as shown in Fig. 1. In
smartphones and sensors, and moving them to a central server cloud-based FL, the participated clients in total can reach mil-
for model training will violate the increasing privacy concern. lions [9], providing massive datasets needed in deep learning.
Thus, privacy-preserving distribtued training has started to Meanwhile, the communication with the cloud server is slow
receive much attention. In 2017, Google proposed Federated and unpredictable, e.g., due to network congestion, which
Learning (FL) and a Federated Averaging (FAVG) algorithm makes the training process inefficient [2], [9]. Analysis has
[2] to train a deep learning model without centralizing the data shown a trade-off between the communication efficiency and
at the data center. With this algorithm, local devices download the convergence rate for FAVG [10]. Specifically, less commu-
a global model from the cloud server, perform several epochs nication is required at a price of more local computations. On
of local training, and then upload the model weights to the the contrary, in edge-based FL, the parameter server is placed
server for model aggregation. The process is repeated until at the proximate edge, such as a base station. So the latency
the model reaches a desired accuracy, as illustrated in Fig. 1. of the computation is comparable to that of communication
FL enables fully distributed training by decomposing the to the edge parameter server. Thus, it is possible to pursue
process into two steps, i.e., parallel model update based on a better trade-off in computation and communication [7],
local data at clients and global model aggregation at the server. [8]. Nevertheless, one disadvantage of edge-based FL is the
Its feasibility has been verified in real-world implementation limited number of clients each server can access, leading to

978-1-7281-5089-5/20/$31.00 ©2020 IEEE

Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on September 08,2022 at 12:18:43 UTC from IEEE Xplore. Restrictions apply.
inevitable training performance loss.
From the above comparison, we see a necessity in lever-
aging a cloud server to access the massive training samples,
while each edge server enjoys quick model updates with its
local clients. This motivates us to propose a client-edge-cloud
hierarchical FL system as shown on the right side of Fig. 1, to
get the best of both systems. Compared with cloud-based FL,
hierarchical FL will significantly reduce the costly communi-
cation with the cloud, supplemented by efficient client-edge
updates, thereby, resulting a significant reduction in both the
runtime and number of local iterations. On the other hand, as
more data can be accessed by the cloud server, hierarchical
FL will outperform edge-based FL in model training. These
two aspects are clearly observed from Fig. 2, which gives a
preview of the results to be presented in this paper. While the
advantages can be intuitively explained, the design of a hier- Figure 2: Testing Accuracy w.r.t to the runtime on CIFAR-10.
archical FL system is nontrivial. First, by extending the FAVG networks. The complex learning problem is usually solved
algorithm to the hierarchical setting, will the new algorithm by gradient descent. Denote k as the index for the update
still converge? Given the two levels of model aggregation (one step, and η as the gradient descent step size, then the model
at the edge, one at the cloud), how often should the models be parameters are updated as:
aggregated at each level? Morever, by allowing frequent local w(k) = w(k − 1) − η∇F (w(k − 1)).
updates, can a better latency-energy tradeoff be achieved? In
this paper, we address these key questions. First, a rigorous In FL, the dataset is distributed on N clients as {Di }N
i=1 ,
proof is provided to show the convergence of the training with ∪N i=1 Di = D and these distributed datasets cannot
algorithm. Through convergence analysis, some qualitavie be directly accessed by the parameter server. Thus, F (w)
guidelines on picking the aggregation frequencies at two levels in Eq. (1), also called the global loss, cannot be directly
are also given. Experimental results on MNIST [11] and computed, but can only be computed in the form of a weighted
CIFAR-10 [12] datasets support our findings and demonstrate average of the local loss functions Fi (w), on local datasets
Di . Specifically, F (w) and Fi (w) are given
 by:
the advantage of achieving better communication-computation N
|Di |Fi (w) j∈Di fj (w)
tradeoff compared to cloud-based systems. F (w) = i=1
, Fi (w) = .
|D| |Di |
II. F EDERATED L EARNING S YSTEMS
B. Traditional Two-Layer FL
In this section, we first introduce the general learning prob-
lem in FL. The cloud-based and edge-based FL systems differ In the traditional two-layer FL system, there are one central
only in the communication and the number of participated parameter server and N clients. To reduce the communica-
clients, and they are identical to each other in terms of archi- tion overhead, the FAVG algorithm [2] communicates and
tecture. Thus, we treat them as the same traditional two-layer aggreagtes after every κ steps of gradient descent on each
FL system in this section and introduce the widely adopted client. The process repeats until the model reaches a desired
FAVG [2] algorithm. For the client-edge-cloud hierarchical FL accuracy or the limited resources, e.g., the communication or
system, we present the proposed three-layer FL system, and time budget, run out.
its optimization algorithm, namely, HierFAVG. Denote wi (k) as the parameters of the local model on the
i-th client, then wi (k) in FAVG evolves in the following way:
A. Learning Problem ⎧
We focus on supervised Federated Learning. Denote D = ⎪
⎨wi (k − 1) − ηk ∇Fi (wi (k − 1)) k | κ = 0
|D|  
{xj , yj }j=1 as the training dataset, and |D| as the total wi (k) = N

⎩ i=1 |Di | wi (k−1)−ηk ∇Fi (wi (k−1))
number of training samples, where xj is the j-th input sample, |D| k|κ=0
yj is the corresponding label. w is a real vector that fully
parametrizes the ML model. f (xj , yj , w), also denoted as C. Client-Edge-Cloud Hierarchical FL
fj (w) for convenience, is the loss function of the j-th data In FAVG, the model aggregation step can be interpreted
sample, which captures the prediction error of the model for as a way to exchange information among the clients. Thus,
the j-th data sample. The training process is to minimize the aggregation at the cloud parameter server can incorporate
empirical loss F (w) based on the training dataset [13]: many clients, but the communicaiton cost is high. On the
|D| |D|
1  1  other hand, aggregation at the edge parameter server only
F (w) = f (xj , yj , w) = fj (w). (1) incorporates a small number of clients with much cheaper
|D| j=1 |D| j=1
communicaiton cost. To combine their advantages, we con-
The loss funciton F (w) depends on the ML model and can sider a hierarchical FL system, which has one cloud server, L
be convex, e.g. logistic regression, or non-convex, e.g. neural edge servers indexed by , with disjoint client sets {C  }L
=1 ,

Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on September 08,2022 at 12:18:43 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1: Hierarchical Federated Averaging (HierFAVG)
1: procedure H IERARCHICAL F EDERATEDAVERAGING
2: Initialized all clients with parameter w0
3: for k = 1, 2, . . . K do
4: for each client i = 1, 2, . . . , N in parallel do
5: wi (k) ← wi (k − 1) − η∇Fi (wi (k − 1))
6: end for
7: if k | κ1 = 0 then
8: for each edge  = 1, . . . , L in parallel do
9: w (k) ← EdgeAggregation({wi (k)}i∈C  )
10: if k | κ1 κ2 = 0 then Figure 3: Comparison of FAVG and HierFAVG.
11: for each client i ∈ C  in parallel do
12: wi (k) ← w (k)
reveals some key properties of the algorithm, as well as the
13: end for effects of key parameters.
14: end if
15: end for A. Definitions
16: end if
17: if k | κ1 κ2 = 0 then Some essential definitions need to be explained before the
18: w(k) ← CloudAggregation({w (k)}L =1 )
analysis. The overall K local training iterations are divided
19: for each client i = 1 . . . N in parallel do into B cloud intervals, each with a length of κ1 κ2 , or Bκ2
20: 
wi (k) ← w(k)
21: end for edge intervals, each with a length of κ1 . The local (edge)
22: end if aggregation happens at the end of each edge interval, and the
23: end for global (cloud) aggregations happens at the end of each cloud
24: end procedure
25: function E DGE AGGREGATION(, {wi (k)}i∈C  ) //Aggregate locally interval. We use [p] to represent the edge interval starting from

i∈C 
|Di |wi (k) (p − 1)κ1 to pκ1 , and {q} to represent the cloud interval
26: w (k) ← |D  | from (q − 1)κ1 κ2 to qκ1 κ2 , so we have {q} = ∪p [p], p =
27: return w (k)
28: end function (q − 1)κ2 + 1, (q − 1)κ2 + 2, . . . , qκ2 .

29: function C LOUDAGGREGATION({w (k)}L
=1 ) //Aggregate globally • F (w): The edge loss function at edge server , is
L
|D  |w (k)
30: w(k) ← =1 expressed as:
|D|
1  
31: return w(k) F  (w) = |Di |Fi (w).
32: end function |D | 
i∈C

and N clients indexed by i and , with distributed datasets • w(k): The weighted average of wi (k), is expressed as:
{Di }N 
i=1 . Denote D as the aggregated dataset under edge . 1   
N
w(k) = |Di |w̄i (k).
Each edge server aggregates models from its clients. |D| i=1
With this new architecture, we extend the FAVG to a
HierFAVG algorithm. The key steps of the HierFAVG algo- • u{q} (k): The virtually centralized gradient descent se-
rithm proceed as follows. After every κ1 local updates on quence, defined in cloud interval {q}, and synchronized
each client, each edge server aggregates its clients’ models. immediately with w(k) after every cloud aggregation as:
Then after every κ2 edge model aggregations, the cloud u{q} ((q − 1)κ1 κ2 ) = w((q − 1)κ1 κ2 ),
server aggregates all the edge servers’ models, which means u{q} (k + 1) = u{q} (k) − ηk ∇F (u{q} (k)).
that the communication with the cloud happens every κ1 κ2 The key idea of the proof is to show that the true weights
local updates. The comparison between FAVG and HierFAVG w(k) do not deviate much from the virtuallly centralized
is illustrated in Fig. 3. Denote wi (k) as the local model sequence u{q} (k). Using the same method, the convergence
parameters after the k-th local update, and K as the total of two-layered FL was analyzed in [7].
amount of local updates performed, which is assumed to be Lemma 1 (Convergence of FAVG [7]). For any i, assuming
an integer multiple of κ1 κ2 . Then the details of the HierFAVG fi (w) is ρ-continuous, β-smooth, and convex. Als, let Finf =
algorithm are presented in Algorithm 1. And the evolution of F (w∗ ). If the deviation of distributed weights has an upper
local model parameters wi (k) is as follows: bound denoted as M , then for FAVG with a fixed step size η


⎪ wi (k − 1) − ηt ∇Fi (wi (k − 1)) k | κ1 = 0 and an aggregation interval κ, after K = Bκ local updates,



⎨   we have the following convergence upper bound:
|Di | wi (k−1)−ηk ∇Fi (wi (k−1)) k | κ1 = 0 1
wi (k) = i∈C  F (w(K)) − F (w∗ ) ≤
⎪ B(ηϕ − ρM

κε2 )
|D |
⎪ k | κ1 κ2 = 0

⎪ N  

⎩ i=1 |Di | wi (k−1)−ηk ∇Fi (wi (k−1)) 1
|D|
k | κ1 κ2 = 0 when the following conditions are satisfied: 1) η ≤ β
ρM ∗
2) ηϕ − kε2
> 0 3) F (vb (bk)) − F (w ) ≥ ε for b =
1, . . . , K
κ
4) F (w(K)) − F (w∗ ) ≥ ε for some ε > 0, ω =
III. C ONVERGENCE A NALYSIS OF H IER FAVG minb 1
ϕ = ω(1 − βη
F (vb ((b−1)κ))−F (w∗ ) , 2 ).
In this section, we prove the convergence of HierFAVG for The unique non-Independent and Identicallly Distributed
both convex and non-convex loss functions. The analysis also (non-IID) data distribution in FL is the key property that

Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on September 08,2022 at 12:18:43 UTC from IEEE Xplore. Restrictions apply.
distinguishes FL from distributed learning in datacenter. Since aggregation intervals, κ1 and κ2 :
the data are generated seperately by each client, the local data Gc (k, ηq ) ≤ Gc (κ1 κ2 , ηq )
distribution may be unbalanced and the model performance 1 2
will be heavily influenced by the non-IID data distribution = h(κ1 κ2 , Δ, ηq ) + (κ2 + κ2 − 1)(κ1 + 1)h(κ1 , δ, ηq )
2
[14]. In this paper, we adopt the same measurement as in (2)
[7] to measure the two-level non-IIDness in our hierarchical It is obvious that when δ = Δ = 0 (i.e., which means the
system, i.e., the client level and the edge level. client data distribution is IID), we have Gc (k) = 0, where
Definition 1 (Gradient Divergence). For any weight parame- the distributed weights iteration is the same as the centralized
ter w, the gradient divergence between the local loss function weights iteration.
of the i-th client, and the edge loss function of the -th edge When the client data are non-IID, there are two parts in the
server is defined as an upper bound of ∇Fi (w)−∇F  (w), expression of the weights deviation upper bound, G(κ1 κ2 , ηq ).
denoted as δi ; the gradient divergence between the edge loss The first one is caused by Edge-Cloud divergence, and is
function of the th edge server and the global loss function is exponential with both κ1 and κ2 . The second one is caused by
defined as an upperbound of ∇F  (w) − ∇F (w), denoted the Client-Edge Divergence, which is only exponential with
as Δ . Specifically, κ1 , but quadratic with κ2 . From lemma 1, we can see that a
smaller model weight deviation leads to faster convergence.
∇Fi (w) − ∇F  (w) ≤ δi , This gives us some qualitative guidelines in selecting the
∇F  (w) − ∇F (w) ≤ Δ . paramters in HierFAVG:
N L N
|D  |δ 
Define δ = i=1|D| i i , Δ = =1|D|
|D  |Δ
= i=1|D| i ,
|D  |Δ 1) When the product of κ1 and κ2 is fixed, which means the
and we call δ as the Client-Edge divergence, and Δ as the number of local updates between two cloud aggreagtions
Edge-Cloud divergence. is fixed, a smaller κ1 with a larger κ2 will result in
a smaller deviation Gc (κ1 , κ2 ). This is consistent with
A larger gradient divergence means the dataset distribution
our intuition, namely, frequent local model averaging can
is more non-IID. δ reflects the non-IIDness at the client level,
reduce the number of local iterations needed.
while Δ reflects the non-IIDness at the edge level.
2) When the edge dataset is IID, meaning Δ = 0, the first
B. Convergence part in Eq. (2) becomes 0. The second part is dominated
by κ1 , which suggests that when the distribution of edge
In this section, we prove that HierFAVG converges. The dataset approaches IID, increasing κ2 will not push up
basic idea is to study is how the real weights w(k) deviate the deviation upper bound much. This suggests one way
from the virtually centralized sequence u{q} (k) when the to to further reduce the communication with the cloud is
parameters in HierFAVG algorithm vary. In the following two to make the edge dataset IID distributed.
lemmas, we prove an upper bound for the distributed weights
The result for the non-convex loss function is stated in the
deviation for both convex and non-convex loss functions.
following lemma.
Lemma 2 (Convex). For any i, assuming fi (w) is β-smooth
and convex, then for any cloud interval {q} with a fixed step Lemma 3 (Non-convex). For any i, assuming fi (w) is β-
size ηq and k ∈ {q}, we have smooth, for any cloud interval {q} with step size ηq , we have
w(k) − u{q} (k) ≤ Gc (k, ηq ), w(k) − u{q} (k) ≤ Gnc (κ1 κ2 , ηq )
where where
Gc (k, ηq ) = h(k − (q − 1)κ1 κ2 , Δ, ηq ) Gnc (κ1 κ2 , ηq ) =h(κ1 κ2 , Δ, ηq )
 
+ h k − ((q − 1)κ2 + p(k) − 1)κ1 , δ, ηq (1 + ηq β)κ1 κ2 − 1
κ1  2  + κ1 κ2 h(κ1 , δ, ηq )
+ p (k) + p(k) − 2 h(κ1 , δ, ηq ), (1 + ηq β)κ1 − 1
2 + h(κ1 , δ, ηq ),
δ 
h(x, δ, η) = (ηβ + 1)x − 1 − ηβx, δ 
β h(x, δ, η) = (ηβ + 1)x − 1 − ηβx.
x β
p(x) =  − (q − 1)κ2 .
κ1
With the help of the weight deviation upperbound, we are
Remark 1. Note that when κ2 = 1, HierFAVG retrogrades now ready to prove the convergence of HierFAVG for both
to the FAVG algorihtm. In this case, [p] is the same as {q}, convex and non-convex loss functions.
p(k) = 1, κ1 κ2 = κ1 , and Gc (k) = h(k−(q−1)κ1 , Δ+δ, ηq ). Theorem 1 (Convex). For any i, assuming fi (w) is ρ-
This is consistent with the result in [7]. When κ1 = κ2 = continuous, β-smooth and convex, and denoting Finf =
1, HierFAVG retrogrades to the traditional gradient descent. F (w∗ ), then after K local updates, we have the following
In this case, Gc (κ1 κ2 ) = 0, implying the distibuted weights convergence upper bound of w(k) in HierFAVG with a fixed
iteration is the same as the centralized weights iteration. step size:
Remark 2. The following upperbound of the weights de- 1
F (w(K)) − F (w∗ ) ≤
viation, Gc (k), increases as we increase either of the two T (ηϕ − ρGc (κ1 κ2 ,η)
κ 1 κ2 ε 2
)

Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on September 08,2022 at 12:18:43 UTC from IEEE Xplore. Restrictions apply.
Proof. By directly substituting M in Lemma 1 with G(κ1 κ2 ) which has 5852170 parameters and achieves 90% testing
in Lemma 2, we prove Theorem 1. accuracy in centralized training. For the local computation of
Remark 3. Notice in the condition ε2 > ρG(κ 1 κ2 )
κ1 κ2 ηϕ of Lemma
the training with CIFAR-10, mini-batch SGD is also employed
1, ε does not decrease when K increases. We cannot have with a batch size of 20, an inital learing rate of 0.1 and an
F (w(K)) − F (w∗ ) → 0 as K → ∞. This is because the exponetial learning rate decay of 0.992 every epoch. In the
variance in the gradients introduced by non-IIDness cannnot experiments, we also notice that using SGD with momentum
be eliminated by fixed-stepsize gradient descent. can speed up training and improve the final accuracy evidently.
Remark 4. With step sizes {ηq } that satisfy But the benefits of the hierarchical FL system always continue
 ∞ ∞diminishing
2 to exist with or without the momentum. To be consistent with
q=1 ηq = ∞, q=1 ηq < ∞, the convergence upper bound
for HierFAVG after K = Bκ1 κ2 local updates is: the analysis, we do not use momentum in the experiments.
Non-IID distribution in the client data is a key influential
1 B→∞
F (w(K)) − F (w∗ ) ≤ B ρGc (κ1 κ2 ,ηq )
−−−−→ 0. factor in FL. In our proposed hierarchical FL system, there are
q=1 (ηq ϕq − κ1 κ2 ε2q
) two levels of non-IIDness. In addition to the most commonly
Now we consider non-convex loss functions, which appear used non-IID data partition [2], referred to as simple NIID
in ML models such as neural networks. where each client owns samples of two classes and the clients
are randomly assigned to each edge server, we will also
Theorem 2 (Non-convex). For any i, assume that fi (w) is
consider the follwoing two non-IID cases for MNIST:
ρ-continuous, and β-smooth. Also assume that HierFAVG is
initialized from w0 , Finf = F (w∗ ), ηq in one cloud interval 1) Edge-IID: Assign each client samples of one class, and
{q} is a constant, then after K = Bκ1 κ2 local updates, assign each edge 10 clients with different classes. The
the expected average-squared gradients of F (w) is upper datasets among edges are IID.
bounded as: 2) Edge-NIID: Assign each client samples of one class, and
K assign each edge 10 clients with a total of 5 classes of
k=1 ηq ∇F (w(k))2 4[F (w0 ) − F (w∗ )] labels. The datasets among edges are non-IID.
K ≤ K
k=1 ηq k=1 ηq In the following, we provide the models for wireless
B
4ρ q=1 Gnc (κ1 , κ2 , ηq ) communications and local computations [8]. We ignore the
+ K
k=1 ηq
possible heterogeneous communication conditions and com-
2  B 2 puting resources for different clients. For the communication
2β q=1 κ1 κ2 Gnc (κ1 κ2 , ηq )
+ K . channel between the client and edge server, clients upload
k=1 ηq the model through a wireless channel of 1 MHz bandwidth
(3)
Remark 5. When the stepsize {ηq } is fixed, the weighted with a channel gain g equals to 10−8 . The transmitter power
average norm of the gradients converges to  some non- p is fixed at 0.5W, and the noise power σ is 10−10 W. For
zero number. When the stepsize {ηq } satisfies
∞ the local computation model, the number of CPU cycles to
∞ 2 q=1 ηq =
excute one sample c is assumed to be 20 cycles/bit, CPU
∞, q=1 ηq < ∞, (3) converges to zero as K → ∞.
cycle frequency f is 1 GHz and the effective capacitance is
IV. E XPERIMENTS 2 × 10−28 . For the communication latency to the cloud, we
In this section, we present simulation results for HierFAVG assume it is 10 times larger than that to the edge. Assume
to verify the obeservations from the convergence analysis and the uploaded model size is M bits, and one local iteration
illustrate the advantages of the hierarchical FL system. As involes D bits of data. In this case, the latency and energy
shown in Fig. 2, the advantage over the edge-based FL system consumption for one model upload and one local iteration can
in terms of the model accuracy is obvious. Hence, we shall be caculated with the following equations (Specific paramters
focus on the comparison with the cloud-based FL system. are shown in table I):
cD α
A. Settings T comp = , E comp = cDf 2 , (4)
f 2
We consider a hierarchical FL system with 50 clients, 5 M
edge servers and a cloud server, assuming each edge server T comm = hp
, E comm = pT comm (5)
B log2 (1 + σ
)
authorizes the same number of clients with the same amount
To investigate the local energy consumption and training
of training data. For the ML tasks, image classification tasks
time in an FL system, we define the following two metrics:
are considered and standard datasets MNIST and CIFAR-10
are used. For the 10-class hand-written digit classification 1) Tα : The training time to reach a test accuracy level α;
dataset MNIST, we use the Convolutional Neural Network 2) Eα : The local energy consumption to reach a test accu-
(CNN) with 21840 trainable parameters as in [2]. For the local racy level α.
computation of the training with MNIST on each client, we B. Results
employ mini-batch Stochastic Gradient Descent (SGD) with We first verify the two qualitative guidelines on the key
batch size 20, and an initial learning rate 0.01 which decays parameters in HierFAVG from the convergence analysis, i.e.,
exponetially at a rate of 0.995 with every epoch. For the κ1 , κ2 . The experiments are done with the MNIST dataset
CIFAR-10 dataset, we use a CNN with 3 convolutional blocks, under two non-IID scenarios, edge-IID and edge-NIID.

Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on September 08,2022 at 12:18:43 UTC from IEEE Xplore. Restrictions apply.
Table I: The latency and energy consumption paramters for the Table II: Training time and local energy consumption.
communication and computation of MNIST and CIFAR-10. Edge-IID Edge-NIID
Dataset T comp T comm E comp E comm E0.85 (J) T0.85 (s) E0.85 (J) T0.85 (s)
MNIST 0.024s 0.1233s 0.0024J 0.0616J κ1 = 60, κ2 = 1 29.4 385.9 30.8 405.5
CIFAR-10 4 33s 0.4J 16.5J κ1 = 30, κ2 = 2 21.9 251.1 28.6 312.4
κ1 = 15, κ2 = 4 10.1 177.3 26.9 218.5
κ1 = 6, κ2 = 10 19 97.7 28.9 148.4
(a) MNIST with edge-IID and edge-NIID distribution.
E0.70 (J) T0.70 (s)
κ1 = 50, κ2 = 1 7117.5 109800
κ1 = 25, κ2 = 2 6731 75760
κ1 = 10, κ2 = 5 9635 65330
κ1 = 5, κ2 = 10 13135 49350
(b) CIFAR-10 with simple NIID distribution.

(a) Edge-IID. (b) Edge-NIID. V. C ONCLUSIONS


Figure 4: Test accuracy of MNIST dataset w.r.t training epoch. In this paper, we proposed a client-edge-cloud hierarchical
Federated Learning architecture, supported by a collaborative
The first conclusion to verify is that more frequent commu- training algorithm, HierFAVG. The convergence analysis of
nication with the edge (i.e., fewer local updates κ1 ) can speed HierFAVG was provided, leading to some qualitative design
up the training process when the communciation frequency guidelines. In experiments, it was also shown that it can
with the cloud is fixed (i.e., κ1 κ2 is fixed.). In Fig. 4a and simultaneously reduce the model training time and the energy
Fig. 4b, we fix the communication frequency with the cloud consumption of the end devices compared to traditional cloud-
server at 60 local iterations, i.e., κ1 κ2 =60 and change the based FL. While our study revealed trade-offs in selecting the
value of κ1 . For both kinds of non-IID data distribution, as values of key parameters in the HierFAVG algorithm, future
we decrease κ1 , the desired accuracy can be reached with investigation will be needed to fully characterize and optimize
fewer training epochs, which means fewer local computations these critical parameters.
are needed on the devices.
The second conclusion to verify is that when the datasets R EFERENCES
among edges are IID and the communication freqency with [1] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning, 2016.
[2] H. B. McMahan, E. Moore, D. Ramage, and S. Hampson,
the edge server is fixed, decreasing the communication fre- “Communication-efficient learning of deep networks from decentralized
quency with the cloud server will not slow down the training data,” Artificial Intelligence and Statistics, pp. 1273–1282, April. 2017.
process. In Fig. 4a, the test accuracy curves with the same [3] A. Hard, K. Rao, R. Mathews, F. Beaufays, S. Augenstein, H. Eichner,
C. Kiddon, and D. Ramage, “Federated learning for mobile keyboard
κ1 = 60 and different κ2 almost coincide with each other. prediction,” arXiv preprint arXiv:1811.03604, 2018.
But for edge-NIID in Fig. 4b, when κ1 = 60, increasing κ2 [4] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning:
will slow down the training process, which strongly supports Concept and applications,” ACM Trans. Intell. Syst. and Technol. (TIST),
vol. 10, no. 2, p. 12, 2019.
our analysis. This property indicates that we may be able to [5] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey
further reduce the high-cost communication with the cloud on mobile edge computing: The communication perspective,” IEEE
under the edge-IID scenario, with litttle performance loss. Commun. Surveys Tuts., vol. 19, no. 4, pp. 2322–2358.
[6] T. Nishio and R. Yonetani, “Client selection for federated learning with
Next, we investigate two critical quantities in collabora- heterogeneous resources in mobile edge,” IEEE ICC, May. 2019.
tive training systems, namely, the training time and energy [7] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and
consumption of mobile devices. We compare cloud-based K. Chan, “Adaptive federated learning in resource constrained edge
computing systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 6, pp.
FL (κ2 =1) and hierarchical FL in Table II, assuming fixed 1205–1221, June 2019.
κ1 κ2 . A close observation of the table shows that the training [8] N. H. Tran, W. Bao, A. Zomaya, N. Minh N.H., and C. S. Hong,
time to reach a certain test accuracy decreases monotonically “Federated learning over wireless networks: Optimization model design
and analysis,” in IEEE INFOCOM 2019, April 2019, pp. 1387–1395.
as we increase the communication frequency (i.e., κ2 ) with [9] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman,
the edge server for both the MNIST and CIFAR-10 datasets. V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B. McMahan et al.,
This demonstrates the great advantage in training time of “Towards federated learning at scale: System design,” Proc. of the 2nd
SysML Conference, Palo Alto, CA, USA, 2019.
the hierarchical FL over cloud-based FL in training an FL [10] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence
model. For the local energy consumption, it decreases first of fedavg on non-iid data,” arXiv preprint arXiv:1907.02189, 2019.
and then increases as κ2 increases. Because increasing client- [11] A. Krizhevsky and G. Hinton, “Learning multiple layers of features
from tiny images,” University of Toronto, Tech. Rep., 2009.
edge communication frequency moderately can reduce the [12] A. Krizhevsky et al., “Learning multiple layers of features from tiny
consumed energy as fewer local computations are needed. But images,” Citeseer, Tech. Rep., 2009.
too frequent edge-client communication also consumes extra [13] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for
large-scale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–
energy for data transmission. If the target is to minimize the 311, 2018.
device energy consumption, we should carefully balance the [14] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated
computation and communication energy by adjusting κ1 , κ2 . learning with non-iid data,” arXiv preprint arXiv:1806.00582, 2018.

Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on September 08,2022 at 12:18:43 UTC from IEEE Xplore. Restrictions apply.

You might also like