1 Fu
1 Fu
Gaoyang Liu1 , Xiaoqiang Ma1 , Yang Yang2 , Chen Wang1 , Jiangchuan Liu3
1
Huazhong University of Science and Technology, Wuhan, China
2
Hubei University, Wuhan, China
3
Simon Fraser University, British Columbia, Canada
1
{liugaoyang, maxiaoqiang, chenwang}@hust.edu.cn, 2 [email protected], 3 [email protected]
Abstract—Federated learning (FL) has recently emerged as a (GDPR) in the European Union [6] and the California Con-
promising distributed machine learning (ML) paradigm. Prac- sumer Privacy Act (CCPA) in the United States [7]. The “right
tical needs of the “right to be forgotten” and countering data to be forgotten” stipulates and sometimes legally enforces that
poisoning attacks call for efficient techniques that can remove,
individuals can request at any time to have their personal data
arXiv:2012.13891v3 [cs.LG] 6 May 2021
3) Calibrated Update Aggregating: Given the calibrated and the original model M on other clients’ data to compare
client updates U e tj |kc ∈ [1, 2, · · · , K]\ku }, FedEraser
e tj = {U their performance).
kc
next aggregates these updates for unlearned model updating. In It is worth noting that FedEraser does not require far-
particular, FedEraser directly calculates the weighted average reaching modifications of neither the existing architecture of
of the calibrated updates as follows: FL nor the training process on the federated clients, yielding
1 it very easy to be deployed in existing FL systems. In
e tj
X
Uetj = P wkc Ukc (2) particular, the process of calibration training executed on the
(K − 1) kcw kc
kc federated clients can directly reuse the corresponding training
where wkc is the weight for the calibrating client obtained process in the standard FL framework. The aggregating and
N c updating operations in FedEraser do not need to modify the
from the standard architecture of FL, and wkc = P kN kc
kc
where Nkc is the number of records the client Ckc has. It is existing architecture of FL, while only the additional retaining
worth noting that this aggregation operation is consistent with functionality is required at the central server side. In addition,
the standard FL. FedEraser can be performed unwittingly, as it does not involve
4) Unlearned Model Updating: With the aggregation of the any information about the target client, including his/her
calibrated updates, FedEraser can thus renovate the global FL updates and local data, during the unlearning process.
model as:
ftj+1 = M ftj + Uetj C. Time Consumption Analysis
M (3)
One crucial feature of FedEraser is that it can speed up the
where M ftj (resp. M ftj+1 ) is the current global model (resp. reconstruction of the unlearned model, compared with retrain-
updated global model) calibrated by FedEraser. ing the model from scratch. Thus, we provide an elementary
The central server and the calibrating clients collaboratively analysis of the speed-up significance of FedEraser here. For
repeat the above process, until the original updates U have ease of presentation, we use the time consumption required
all been calibrated and then updated to the global model M. f for retraining from scratch as the baseline.
Finally, FedEraser gets the unlearned global model M f that In FedEraser, there are two settings that can speed up the
has removed the influence of the client Cku ’s data. reconstruction of the unlearned model. First, we modify the
Once the unlearned global model M f is obtained, the standard FL to retain the client updates at intervals of regular
standard deployment process of the deep learning model can rounds. We use a hyper-parameter ∆t to control the size of the
be performed, including manual quality assurance, live A/B retaining interval. Since FedEraser only processes on retained
testing (by using the unlearned model M f on some clients’ data updates, the larger ∆t is, the less retaining rounds are involved,
and the less reconstruction time FedEraser would require. This TABLE I
setting could provide FedEraser with a speed-up of ∆t times. T HE ARCHITECTURES OF FEDERATED MODELS .
Second, FedEraser only requires the calibrating client perform
Dataset Model Architecture
a few rounds of local training in order to calibrate the retained
updates. Specifically, the round number of the calibration Adult 2 FC layers
training is controlled by the calibration ratio r = Ecali /Eloc . Purchase 3 FC layers
This setting can directly reduce the time consumed by training MNIST 2 Conv. and 2 FC layers
on the client, and provide FedEraser with a speed-up of r−1
CIFAR-10 2 Conv., 2 Pool., and 2 FC layers
times. Overall, FedEraser can reduce the time consumption by
r−1 ∆t times compared with retraining from scratch.
In our experiments, we empirically find that when r = 0.5
and ∆t = 2, FedEraser can achieve a trade-off between the 2) Global Models: In the paradigm of FL and Federated
performance of the unlearned model and the time consumption Unlearning, the global model will be broadcasted to all clients
of model reconstruction (detailed in the following section). In and serve as the initial model for each client’s training process.
such a case, FedEraser can achieve an expected speed-up of We make use of 4 global models with different structures for
4× compared with retraining from the scratch. different classification tasks. The details of these models are
summarized in Table I, where FC layer means fully connected
IV. P ERFORMANCE E VALUATION layer in the deep neural network (DNN) models, and Conv.
In this section, we evaluate the performance of FedEraser (resp. Pool.) layer represents convolutional (resp. maxpooling)
on different datasets and models. Besides, we launch member- layer in the convolutional neural network (CNN) models.
ship inference attacks (MIAs) against FedEraser to verify its 3) Evaluation Metrics: We evaluate the performance of
unlearning effectiveness from a privacy perspective. FedEraser using standard metrics in the ML field, including
A. Experimental Setup the accuracy and the loss. We also measure the unlearning
time consumed by FedEraser to make a given global model
1) Datasets: We utilize four datasets in our experiments,
forget one of the clients.
including UCI Adult2 , Purchase3 , MNIST4 , and CIFAR-105 .
Furthermore, in order to assess whether or not the unlearned
Adult (Census Income). This dataset includes 48, 842
model still contains the information about the target client,
records with 14 attributes such as age, gender, education,
we adopt the following three extra metric. One metric is the
marital status, occupation, working hours, and native country.
prediction difference, denoted as the L2 norm of prediction
The classification task of this dataset is to predict if a person
probability difference, between the original global model and
earns over $50K a year based on the census attributes.
the unlearned model:
Purchase. Purchase dataset is obtained from Kaggle’s
N
“acquire valued shoppers” challenge whose purpose is to 1 X
design accurate coupon promotion strategies. Purchase dataset Pdiss = ||M(xi ) − M(x
f i )||2 xi ∈ Dku (4)
N i=1
contains shopping histories of several thousand shoppers over
one year, including many fields such as product name, store where N is the number of the target client’s samples Dku .
chain, quantity, and date of purchase. In particular, Purchase M(xi ) (resp. M(x f i )) is the prediction probability of the
dataset (with 197, 324 records) does not contain any class sample xi obtained from the original (resp. unlearned) model.
labels. Following [16], [21], [22], we adopt an unsupervised The rest two metrics are obtained from the MIAs that we
clustering algorithm to assign each data record with a class perform against the unlearned global model. The goal of MIAs
label. We cluster the records in Purchase dataset into 2 classes. is to determine whether a given data was used to train a given
MNIST. This is a dataset of 70, 000 handwritten digits ML model. Therefore, the performance of MIAs can measure
formatted as 32 × 32 images and normalized so that the digits the information that still remains in the unlearned global
are located at the central of the image. It includes sample model. We utilize the attack precision of the MIAs against
images of handwritten digits from 0 to 9. Each pixel within the target data as one metric, which presents the proportion of
the image is represented by 0 or 1. target client’s data that are predicted to have been participated
CIFAR-10. CIFAR-10 is a benchmark dataset used to in the training of the global model. We also use the attack
evaluate image recognition algorithms. This dataset consists of recall of the MIAs, which presents the fraction of the data of
60, 000 color images of size 32 × 32 and has 10 classes such the target client that we can correctly infer as a part of the
as “air plane”, “dogs”, “cats”, and etc. Particularly, CIFAR-10 training dataset. In other words, attack precision and attack
is a balanced dataset with 6, 000 randomly selected images for recall measure the privacy leakage level of the target client.
each class. Within CIFAR-10 dataset, there are 50, 000 training 4) Comparison Methods: In our experiments, we compare
images and 10, 000 testing images. FedEraser with two different methods: Federated Retrain (Fe-
2 https://archive.ics.uci.edu/ml/datasets/Adult
dRetrain) and Federated Accumulating (FedAccum).
3 https://www.kaggle.com/c/acquire-valued-shoppers-challenge/data FedRetrain: a simple method for unlearning by retraining
4 http://yann.lecun.com/exdb/mnist this model from scratch on the remaining data after removing
5 http://www.cs.toronto.edu/∼kriz/cifar.html the target client’s data, which will serve as the baseline in
(a) Prediction Accuracy (Testing Data) (b) Prediction Accuracy (Target Data)
our experiments. Empirically, FedRetrain provides an upper 1) Performance on Testing Data: Fig. 1(a) shows the
bound on the prediction performance of the unlearned model prediction accuracy of FedEraser and three comparisons on
reconstructed by FedEraser. the testing data and the target data. From the results we can
FedAccum: a simple method for unlearning by directly see that FedEraser performs closely as FedRetrain (baseline)
accumulating the previous retained updates of the calibrating on all datasets, with an average difference of only 0.61%.
clients’ local model parameters, and leveraging the accumu- Especially, for Adult dataset, FedEraser achieves a prediction
lated updates to update the global model. The update process accuracy of 0.853, which is higher than that of FedAccum
can be expressed as follows: by 5.76% and lower than that of FedRetrain by 0.8%. On
tj+1 tj tj
MNIST dataset, the performance of FedEraser achieves an
M
faccum =M
faccum + Ueaccum (5) accuracy of 0.986, which only has 0.52% difference from that
tj of FedRetrain. For Purchase dataset, FedEraser can achieve a
where Ueaccum is the accumulation of P
the model updates at
tj tj testing accuracy of 0.943. FedAccum and FedRetrain achieve a
tj th round, and Ueaccum = (K−1)1P wk kc wkc Ukc . Besides,
t j t j+1
c
mean testing accuracy of 0.913 and 0.949, respectively. As for
Maccum (resp. Maccum ) represents the global model before
tj CIFAR-10 dataset, FedEraser gets a testing accuracy of 0.562,
(resp. after) updating with Ueaccum . The main difference be-
which is lower than that of FedRetrain by 0.52%. In such a
tween FedEraser and FedAccum is that the latter does not
case, FedAccum only achieves a testing accuracy of 0.408.
calibrate the clients’ updates.
Overall, FedEraser can achieve a prediction performance close
In order to evaluate the utility of the unlearned global model, to FedAvg and FedRetrain, but better than that of FedAccum,
we also compare FedEraser with the classical FL without indicating high utility of the obtained unlearned model.
unlearning. We employ the most widely used FL algorithm, Table II shows the time consumption of FedEraser and
federated averaging (FedAvg) [1], to construct the global the comparison methods in constructing the global models.
model. FedAvg executes the training procedure in parallel on According to the results, it is obvious that FedRetrain takes the
all federated clients and then exchanges the updated model same order of magnitude of the time as FedAvg to reconstruct
weights. The updated weights obtained from every client are the global model. On the contrary, FedEraser can signifi-
averaged to update the global model. cantly speed up the removal procedure of the global model,
5) Experiment Environment: In our experiments, we use improving the time consumption by 3.4× for Adult dataset.
a workstation equipped with an Intel Core i7 9400 CPU As for MNIST and Purchase datasets, FedEraser reduces the
and NVIDIA GeForce GTX 2070 GPU for training the deep reconstruction time by 4.8× and 4.1×, respectively. Besides,
learning models. We use Pytorch 1.4.0 as the deep learning FedEraser also provides a speed-up of 4.0× in reconstructing
framework with CUDA 10.1 and Python 3.7.3. the global model for complex classification tasks. As for
We set the number of clients to 20, the calibration ratio FedAccum, since it only aggravates the retained parameters
r = Ecali /Eloc = 0.5, and the retaining interval ∆t = 2. of the calibrating client’s models in every global epoch and
As for other training hyper-parameters, such as learning rate, updates the global model with the aggravations, it does not
training epochs, and batch size, we use the same settings to involve the training process on the clients. Consequently,
execute our algorithm and the comparison methods. FedAccum could significantly reduce the time consumption of
model reconstruction, but at the cost of the prediction accuracy.
B. Performance of FedEraser 2) Performance on Target Data: In addition, we compare
the prediction performance of FedEraser and the comparison
In this section, we evaluate the performance of FedEraser methods on the target client’s data. The experiment results are
from two perspectives: model utility and client privacy. We shown in Fig. 1(b). For the target data, FedEraser achieve a
have to emphasize here that there is no overlap between the mean prediction accuracy of 0.831 over all datasets, which
testing data and the target data. is close to that of FedRetrain but much lower than that
TABLE II
T IME CONSUMPTION OF FEDERATED MODEL CONSTRUCTION .
TABLE III
P REDICTION LOSS ON THE TARGET CLIENT ’ S DATA .
of FedAvg. Compared with FedAccum, FedEraser performs is smaller than that of FedRetrain.
11.5% better. As shown in Fig. 1(b), on the Adult and MNIST 3) Evaluation from the Privacy Perspective: In our exper-
datasets, the performance of our method is slightly worse than iments, we leverage MIAs towards the target client’s data to
baseline by 0.52% and 0.65% respectively. However, in these assess how much information about the these data is still
two cases, FedEraser still performs better than FedAccum contained in our unlearned model. Since the attack classifier
by 10.9% and 4.54%. For Purchase dataset, FedEraser can is trained on the data derived from the original global model,
achieve a mean accuracy of 0.934 on the target client’s data, the attack classifier can distinguish the information related to
which is better than that of FedAccum by 8.82%. Nevertheless, the target data precisely. The worse the performance of the
FedRetrain achieves an accuracy of 0.952 on the target data MIA is, the less influence of the target data is stored in the
and performs better than FedEraser by 1.8%. As for the per- global model.
formance on the target client’s data of CIFAR-10, FedEraser For executing MIAs towards the unlearned model, we adopt
obtains a mean accuracy of 0.556 while FedRetrain achieves the strategy of shadow model training [21] to derive the data
0.577. However, FedAccum only can get a prediction accuracy for construct an attack classifier. For ease of presentation, we
of 0.339 on the target data. In general, an ML model has a treat the original model trained by FedAvg as the shadow
higher prediction accuracy of the training data than that of model. Then we execute the attack against the global models
testing data. Therefore, the prediction similarity between the trained by FedEraser and FedRetrain.
unlearned model and the retrained model further reflects the From the results in Fig. 2, we can see that the attack
removal effectiveness of FedEraser. achieves resemble performances on our unlearned model and
Furthermore, we measure the loss values of the target the retrained model. Over all datasets, the inference attacks can
client’s data obtained from different models trained by Fed- only achieve a mean attack precision (resp. recall) around 0.50
Eraser and the comparisons. The experiment results are shown (resp. 0.726) on the global models reconstructed by FedEraser.
in Table III. In general, the prediction loss of FedEraser is Specifically, for Adult dataset, the attack against the original
relatively close to that of FedRetrain, and FedAccum has model can achieve an F1-score of 0.714. The F1-score of the
the largest prediction loss among all comparison methods. attack on the unlearned model (resp. retrained model) is 0.563
For Adult dataset, FedEraser achieves a prediction loss of (resp. 0.571). As for MNIST and Purchase datasets, it only
5.42 × 10−3 that is very close to that of FedAvg and Retrain. differs by 0.34% (resp. 0.15%) on the F1-score difference
However, the loss of FedAccum is 2.7× larger than that of of the attacks against the unlearned and retrained models.
FedEraser. As for MNIST dataset, FedEraser gets a mean loss Besides, compared with FedRetrain, FedEraser can effectively
on the target data of 1.03 × 10−3 , which is 1.3× greater than erase the target data even for a complex model trained on
the baseline but 0.8× smaller than directly accumulating. For CIFAR-10. The inference attack can achieve an F1-score of
Purchase dataset, the loss of FedEraser is 3.85 × 10−3 , which 0.951 on the original model. Moreover, when attacking against
is much closer to the baseline than that of FedAccum. Besides, the unlearned model, the F1-score can just reach to 0.629
FedEraser even achieves a prediction loss of 2.03 × 10−2 that which is even lower than attack on the retrained model by
(a) Attack Precision (b) Attack Recall
E. Impact of the Retaining Interval of our unlearned models prompts both on the target and
In this section, we evaluate the performance of FedEraser on testing data when FedEraser confronts MNIST and Purchase
three different datasets with retaining interval ∆t increasing datasets. For MNIST dataset, our unlearned model can achieve
from 1 to 10. The relationship between the performance of a prediction accuracy of 96.7% (resp. 96.8%) on the target
FedEraser and the retaining interval is demonstrated in Fig. 5. (resp. testing) data. As the interval increases, the accuracy on
From the results we can find that with the increasing retaining the target data increases by 2.3%, while that on the testing
interval, the time consumption of FedEraser decays while the data increases by 1.2%. As for Purchase dataset, the accuracy
prediction accuracy on the target data improves better and of our unlearned model increases from 81.3% to 93.7% on the
better. One possible reason for this phenomenon is that with target data, and the accuracy also grows from 80.1% to 92.1%
a large retaining interval, a part of the influences of the target on the testing data. As for the time consumption, FedEraser
data are still remained in the unlearned model. can yield 12× speed-up on both datasets when ∆t = 10.
Recalling that the objective of FedEraser is to eliminate
the influences of a certain client’s data in the original global F. Impact of the Number of Federated Clients
model. These influences are involved by training the original
model on these data and could help this model accurately In this section, we evaluate the performance of FedEraser
classify the target data. Therefore, the higher accuracy on on the Purchase and CIFAR-10 datasets with different number
the target data, the worse performance of FedEraser achieves. of federated clients. From the results in Fig. 6, we can
According to the results in Fig. 5, FedEraser brings about a observe that the performance of the unlearned model gradually
poor performance but a obvious when the retaining interval is degrades with the increasing number of clients. Specifically,
set to a large number. for Purchase dataset (c.f. Fig. 6(a)), when there are 5 federated
As shown in Fig. 5(a), with the retaining interval increasing, clients, the model reconstructed by FedEraser can achieve
the prediction accuracy on the target data increases from a prediction accuracy of 98.0% (resp. 98.1%) on the target
83.8% to 84.4% but the testing accuracy decreases by 0.21%. (resp. testing) data. When the number of clients increasing to
When ∆t = 1, FedEraser spends 36.7s to reconstruct the 25, the prediction accuracy decreases by 1.8% but can still
original global model. But when ∆t = 10, it consumes 19.1s reach 96.3%. However, for the target data, the performance of
to derive the unlearned model which brings a 12× speed-up. FedEraser would degrade by 9.7% and achieve a prediction
As shown in Figs. 5(b) and 5(c), the prediction accuracy accuracy of 88.3%.
(a) Prediction Accuracy (Purchase) (b) Prediction Accuracy (CIFAR-10)
For CIFAR-10 dataset (c.f. Fig. 6(b)), with total 5 federated B. Differential Privacy
clients, the unlearned model can achieve a prediction accuracy Differential privacy [25] provides a way to preserve the pri-
of 32.4% (resp. 27.3%) on the testing (resp. target) data. As the vacy of a single sample in a dataset such that an upper bound
number of clients increasing, our unlearned model performs on the amount of information about any particular sample can
worse on both the testing and target data gradually. When be obtained. There have been a series of differentially private
there are 25 clients, the prediction accuracy on the target (resp. versions of ML algorithms, including linear models [26],
testing) data decreases by 5.4% (resp. 7.2%). principal component analysis [27], matrix factorization [28],
Overall, all the results demonstrate that FedEraser can and DNN [29], the parameters of which are learned via adding
achieve a satisfied performance in different datasets with noise in the training phase. In the setting of data forgetting,
different settings. In general, if a data sample has taken however, the removal is expected to be done after the training.
part in a model’s training process, it would leave its unique Drawing on the indistinguishability of differential privacy,
influence on this model so that the model can correctly classify Guo et al. [17] define the notion of -certified removal and pro-
on it. Therefore, the prediction accuracy on the target data vide an algorithm for linear and logistic regression. Golatkar et
can measure how much influence of these data left in the al. [30] propose a selective forgetting procedure for DNNs by
unlearned model. The less influence of the target data leaves changing information (adding noises) in the trained weights.
on the model reconstructed by FedEraser, the lower prediction They further extend this framework to disturb activations [31],
accuracy on the target data this model can achieve, and the using a neural tangent kernel based scrubbing procedure. The
better performance of FedEraser will be. major challenge in differential privacy based unlearning is how
to balance the protected information and the model utility.