0% found this document useful (0 votes)
22 views

Offset Unlearning for Large Language Models

Uploaded by

A L M
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Offset Unlearning for Large Language Models

Uploaded by

A L M
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Offset Unlearning for Large Language Models

James Y. Huang♢ Wenxuan Zhou♢ Fei Wang♢ Fred Morstatter♢ Sheng Zhang♠
Hoifung Poon♠ Muhao Chen♣♢

University of Southern California ♠ Microsoft Research

University of California, Davis
{huangjam,zhouwenx,fwang598}@usc.edu [email protected]
{shezhan,hoifung}@microsoft.com [email protected]

Abstract Unlearning Method Black-box Privacy


Gradient Ascent ✗ ✓
Despite the strong capabilities of Large Lan- Data Relabeling ✗ ✓
guage Models (LLMs) to acquire knowledge In-context Unlearning ✓ ✗
arXiv:2404.11045v1 [cs.CL] 17 Apr 2024

from their training corpora, the memorization δ-U NLEARNING ✓ ✓


of sensitive information in the corpora such
as copyrighted, harmful, and private content Table 1: Comparison with existing unlearning meth-
has led to ethical and legal concerns. In re- ods. Previous techniques either require access to LLM’s
sponse to these challenges, unlearning has internal weights, or retain sensitive information for in-
emerged as a potential remedy for LLMs af- ference.
fected by problematic training data. How-
ever, previous unlearning techniques are ei-
ther not applicable to black-box LLMs due
to required access to model internal weights, (Shaikh et al., 2023), and reveal private informa-
or violate data protection principles by retain- tion (Staab et al., 2024), raising both ethical and
ing sensitive data for inference-time correc- legal concerns. The introduction of data protec-
tion. We propose δ-U NLEARNING, an offset tion regulations such as the right to be forgotten
unlearning framework for black-box LLMs. In- (Hoofnagle et al., 2019; Zhang et al., 2023; Min
stead of tuning the black-box LLM itself, δ- et al., 2024) also highlights the need for erasing
U NLEARNING learns the logit offset needed for
the influence of problematic data when deploying
unlearning by contrasting the logits from a pair
of smaller models. Experiments demonstrate LLMs in real-world applications.
that δ-U NLEARNING can effectively unlearn One potential solution to this challenge is un-
target data while maintaining similar or even learning, where the goal is to “forget” a set of train-
stronger performance on general out-of-forget- ing data without hurting the model’s performance
scope tasks. δ-U NLEARNING also effectively on out-of-forget-scope tasks. An exact unlearn-
incorporates different unlearning algorithms,
ing approach would require retraining the model
making our approach a versatile solution to
adapting various existing unlearning algorithms from scratch with forget set data removed (Ban-
to black-box LLMs.1 nihatti Kumar et al., 2023). However, given the
enormous amount of resources required to retrain
1 Introduction LLMs, it is generally more practical to employ ap-
proximate unlearning techniques that modify the
Large Language Models (LLMs) are capable of behavior of a trained model in a post hoc man-
memorizing a large amount of information derived ner. However, most previous LLM unlearning tech-
from their training corpus. While LLMs are em- niques require access to model internal weights
powered by the abundance of knowledge they ac- (Jang et al., 2023; Eldan and Russinovich, 2023;
quired during training, their training data may con- Yao et al., 2023; Chen and Yang, 2023; Meng et al.,
tain sensitive information that should not be memo- 2023; Wu et al., 2023), making them infeasible
rized by LLMs. Previous studies have shown LLMs for black-box LLMs. For example, as two widely
can reproduce copyrighted materials (Chang et al., used unlearning algorithms, Gradient Ascent maxi-
2023; Eldan and Russinovich, 2023; Karamolegkou mize the likelihood of forget set data, while Data
et al., 2023), generate harmful and biased content Relabeling minimizes the likelihood of relabeled
1
Our code is available at https://github.com/ forget set data. Both of these methods require fine-
luka-group/Delta-Unlearning tuning the LLMs. Alternatively, in-context unlearn-
ing (Pawelczyk et al., 2023) prompts LLMs with while still matching or even outperforming direct
counterfactual forget set instances to steer model fine-tuning baselines on general tasks outside the
behavior at inference time. However, this approach unlearning scope. Third, δ-U NLEARNING can
comes with several limitations as model developers be integrated into different unlearning algorithms,
still maintain a growing list of sensitive informa- demonstrating the versatility of our approach.
tion to be used during inference. Such practice is
not only in violation of privacy regulations but also 2 Related Work
susceptible to malicious attacks such as prompting
leaking (Perez and Ribeiro, 2022). Tab. 1 sum- In this section, we summarize two lines of research
marizes the strengths and weaknesses of existing that are highly related to our work.
unlearning algorithms. Machine Unlearning for LLM. Prior works have
In this work, we propose δ-U NLEARNING, an explored machine unlearning as a way to mitigate
offset unlearning framework for arbitrary black- the influence of undesirable training data on LLMs.
box LLM without updating its internal weights. Given the vast cost incurred by retraining LLMs
Instead of tuning the black-box LLM itself, δ- from scratch (Bannihatti Kumar et al., 2023), most
U NLEARNING learns the logit offset needed for unlearning methods apply post hoc finetuning or
unlearning by contrasting the logits from a pair of adaptation to steer the behavior on the forget set
smaller, white-box models. During unlearning, we (Jang et al., 2023; Eldan and Russinovich, 2023;
first compute the logit offset by taking the differ- Yao et al., 2023; Chen and Yang, 2023). Gradient
ence in logits from the two smaller models. Then, ascent based methods fine-tune models by mini-
we add the logit offset between the two smaller mizing the likelihood of forget set data (Jang et al.,
models to the logits of the larger model. The in- 2023; Chen and Yang, 2023; Maini et al., 2024).
tuition behind this is that we can learn the offset Alternatively, several works proposed to maximize
term that approximates how a larger model should the likelihood of relabelled target data, where the
modify its prediction in the face of sensitive queries original answer is replaced with a generic, insensi-
from the behavior adaptation of a smaller model. δ- tive response (Eldan and Russinovich, 2023; Patil
U NLEARNING does not require access to the larger et al., 2024). Auxiliary training objectives can also
model’s internal weights, nor retains any sensitive be introduced to maintain model performance on
data for inference after unlearning. Our method out-of-forget-scope data (Yao et al., 2023; Wang
also enables more efficient version control and cus- et al., 2023). Another related line of research is
tomization, since for different unlearning requests model editing, where the goal is to identify and al-
we only need to maintain a pool of smaller models, ter knowledge captured by local components within
which can be combined with the same base LLM models (Meng et al., 2023; Wu et al., 2023). While
in a plug-and-play manner. both model editing and unlearning attempt to mod-
We evaluate the effectiveness of δ- ify the behavior of trained LMs, unlearning focuses
U NLEARNING on TOFU (Maini et al., 2024), an on eliminating the effect of a specific set of train-
LLM unlearning benchmark containing fictitious ing data without necessarily creating new answer
facts about fake authors. Experiments show that mappings (Liu et al., 2024c). It is worth noting
when targeting the same forget set performance, that all of the aforementioned approaches require
δ-U NLEARNING maintains similar or even access to the model’s internal weights. In-context
stronger performance on out-of-forget-scope data unlearning (Pawelczyk et al., 2023), while being
compared to directly fine-tuned larger models applicable to black-box LLMs, still requires storing
while requiring no parameter updates to the larger sensitive information for inference and therefore
model. fails to address data privacy concerns. In this work,
we propose an unlearning framework that does not
Our contribution is three-fold. First, we pro-
require access to LLM weights, nor storage of sen-
pose δ-U NLEARNING, an unlearning framework
sitive information for inference.
for arbitrary black-box LLM without modifying
its parameters by only fine-tuning a smaller model Logit Ensemble. The potential of combining logits
to update the logits of a larger one. Second, δ- from different models has been studied in various
U NLEARNING can achieve the same level of un- context. One line of research focuses on controlling
learning as directly fine-tuning the larger model and improving LLM generation quality by contrast-
Figure 1: Overview of δ-U NLEARNING. In order to adapt the behavior of a black-box LLM without updating
its parameters, we combine it without a pair of smaller, white-box models (which we call offset models). For
unlearning, we compute the logit offset of these two models and add it to the logits of the black-box LLM given the
same query. Both of the two offset models are initialized from the same checkpoint, making the logit offset zero
initially. The goal of δ-U NLEARNING is to fine-tune one of them such that their logit offset, after being added to the
logits of the black-box LLM, can steer its prediction away from generating sensitive information.

ing the logits from different models or layers at 3.1 Problem Definition
decoding-time (Liu et al., 2021; Shi et al., 2023; Given a target forget set Sf taken from the training
Li et al., 2023; Chuang et al., 2024). Logit en- data S of an LLM M , the goal of unlearning is
semble has also been shown as an effective way to obtain a new model M ′ that resembles a model
of adapting LLMs to various downstream tasks. trained without Sf . This implies M ′ should “forget”
Ormazabal et al. (2023) propose to adapt LLMs all information from the forget set without hurting
to different domains through a learned combina- the performance on out-of-forget-scope data. Ide-
tion with smaller domain experts. Mitchell et al. ally, unlearning can be accomplished by retraining
(2024) leverage an ensemble of difference-sized M on S\Sf , i.e. the training set with forget set data
models to study the effect of pretraining and fine- removed. However, given the prohibitive cost of re-
tuning at different scales. Concurrently, Liu et al. training the LLM from scratch, it is generally more
(2024a) propose Proxy-Tuning that combines the practical to approximate M ′ by directly updating
logits from smaller tuned models with larger LLMs M . The unlearning problem can also optionally
to enhance instruction following capabilities. Liu include a retain set Sr on which the model after
et al. (2024b) ensemble the logits of a main LLM unlearning should not forget any information and
with a paraphrase model that leads to a monotonic maintain performance.
prompt paraphraser for rewriting prompts with en-
hanced generalizaion effects. Zhao et al. (2024) use 3.2 Offset Unlearning
the logits from unsafe LLMs to guide the jailbreak-
δ-U NLEARNING is based on the idea of a product-
ing of safer LLMs during decoding. In this work,
of-experts (Hinton, 2002) and its subsequent appli-
we propose to utilize smaller LLMs to capture the
cations to ensemble of language models (Liu et al.,
logit offset needed for unlearning sensitive data
2021; Meng et al., 2022; Li et al., 2023). Fig. 1
from black-box LLMs while maintaining general
provides an overview of δ-U NLEARNING.
performance on out-of-forget-scope tasks.
Suppose we want to unlearn a forget set Sf from
an LLM M . Instead of directly updating the param-
3 Method eters of M , we introduce a pair of smaller, offset
models Mo and Mo′ . We define their logit offset
In this section, we formulate the unlearning prob- as the difference between the logits of two offset
lem (§3.1), discuss the technical details of our δ- models Mo′ and Mo given the same query. For un-
U NLEARNING framework (§3.2), and highlight the learning, we add the logit offset to the logits of M
strength of δ-U NLEARNING compared to existing given the same query, essentially forming a logit
methods §3.3. ensemble of M , Mo′ and Mo . Both Mo and Mo′
are initialized from the same checkpoint, making 3.3 Merits of δ-U NLEARNING
the logit offset zero for all data initially. During The design of δ-U NLEARNING leads to the follow-
unlearning, we only update the parameters of Mo′ ing key merits.
while keep M and Mo frozen, and use the logit
ensemble to generate the final output. In this way, Applicability to Black-box LLMs. In con-
we encourage Mo′ to deviate from its initialization trast to most previous unlearning methods, δ-
Mo given a sensitive query and learn the correct U NLEARNING is applicable to not just open-
logit offset that applies to the logits of M , steer- sourced models, but also black-box LLMs with-
ing its prediction away from generating sensitive out access to internal weights. Instead of updating
information. Formally speaking, the logits of the M , δ-U NLEARNING only obtains the logits from
ensemble le are computed as follows: M , and learns the logit offset needed to adjust its
prediction using smaller white-box models.
Privacy Protection. Prior work has proposed in-
le (yt |q, y<t ) = lM (yt |q, y<t ) context unlearning (Pawelczyk et al., 2023) to make
+ α(lMo′ (yt |q, y<t ) − lMo (yt |q, y<t )) unlearning possible for black-box LLMs. How-
ever, a key drawback of this approach is that the
where lM , lMo′ , and lMo are the logits from their model developer still maintains a growing list of
respective models, q is the query and α is a factor sensitive information used to construct queries for
controlling the strength of applying the offset term unlearning during inference, which defeats the pur-
to M . Since the logits are in the log space, the ad- pose of privacy protection. For comparison, δ-
ditive combination of them can also be interpreted U NLEARNING does not require storage of any sen-
as the following product-of-experts: sitive information after unlearning is completed.
α Training Efficiency. While δ-U NLEARNING intro-
PMo′ (yt |q, y<t )

Pe (yt |q, y<t ) ∝ PM (yt |q, y<t ) duces a pair of smaller offset language models to
PMo (yt |q, y<t )
facilitate unlearning for black-box LLMs, the com-
Essentially, the probability of each token pre- putational overhead for training is minimal since
dicted by M is scaled by the probability ratio be- the logits of the two frozen models M and Mo
tween Mo′ and Mo , which reflects how Mo′ changes can be pre-computed in one forward pass prior to
its token distribution relative to its initialization unlearning. This leads to an overall reduction in
Mo after unlearning. Specifically, when query- training time as δ-U NLEARNING tunes fewer pa-
ing non-sensitive, out-of-forget-scope information, rameters than direct fine-tuning.
the probability ratio between Mo′ and Mo should Version Control. δ-U NLEARNING also facilitates
be close to one, making the token distribution of more efficient version control and user customiza-
the ensemble similar to that of the original LLM tion, as instead of storing multiple versions of the
M . When querying sensitive information that the larger model, we only need to keep track of a pool
model should forget, the token distribution of Mo′ of smaller models. These models can be combined
differs from that of Mo to adjust the probability with the same base LLM in a plug-and-play manner
ratio, thus steering the token distribution of the for different applications.
ensemble away from that of M .
During training, we optimize any unlearning ob- 4 Experiment
jective on the prediction of the ensemble instead In this section, we provide a description of the eval-
of on the original model M . For example, to un- uation setting (§4.1), a summary of baseline un-
learn the model using Gradient Ascent (Jang et al., learning algorithms on which we apply our frame-
2023; Chen and Yang, 2023) where the objective work as well as other implementation details (§4.2),
is to minimize the likelihood of forget set data, we and the main results (§4.3).
maximize the following loss function for instance
i of output length l: 4.1 Evaluation Setting
l
We conduct our experiments on TOFU (Maini et al.,
1X 2024), an unlearning benchmark specifically de-
Lie =− log Pe (yt |q, y<t )
l signed for evaluating LLMs. The benchmark de-
t=1
fines an unlearning task that targets QA pairs de-
Forget Set Retain Set Real Author World Fact
Method RL (↓) P (↓) TR RL P TR RL P TR RL P TR
Before Unlearning 95.6 98.3 49.5 96.3 97.9 51.2 85.2 44.5 55.7 87.7 42.5 56.3
Retraining 38.9 15.2 65.6 95.8 97.7 50.4 89.5 45.8 58.5 85.5 43.0 57.4
Gradient Ascent
Direct Fine-tuning 38.8 3.4 53.3 51.2 8.0 51.6 52.3 43.9 58.3 80.2 44.6 60.6
δ-U NLEARNING 38.6 15.2 57.9 41.0 26.1 48.9 75.0 45.3 57.4 82.1 47.0 63.7
Gradient Difference
Direct Fine-tuning 38.9 2.1 51.9 56.8 58.9 55.1 61.4 35.0 47.9 80.4 38.9 53.7
δ-U NLEARNING 38.1 6.2 52.5 53.4 47.8 51.9 60.6 36.1 45.9 83.2 41.3 59.1
KL Minimization
Direct Fine-tuning 39.8 3.1 53.4 53.0 8.4 51.0 55.8 42.2 56.4 83.3 43.3 58.8
δ-U NLEARNING 39.6 14.1 57.5 46.1 27.9 50.9 80.4 45.1 57.5 84.9 46.3 64.0
Data Relabeling
Direct Fine-tuning 38.1 92.5 53.3 85.0 95.3 48.0 82.5 38.0 46.3 87.7 39.2 49.2
δ-U NLEARNING 36.3 91.5 50.8 72.4 95.1 49.6 78.7 41.5 52.6 86.9 42.3 55.5

Table 2: Results on TOFU. We report ROUGE-L recall (RL), Probability (P), and Truth Ratio (TR) on all four
subsets of the TOFU benchmark. Higher scores are better except ROUGE and probability on the Forget Set. Better
scores are underlined for each of the four unlearning strategies.

rived from a collection of fictitious author profiles The Forget Set evaluates forget quality, i.e., how
that do not exist in real world. This creates a clean well the model removes target information from
unlearning setting with a well-defined unlearning its memory, while the latter three sets focus on
scope and easy control over the source of knowl- model utility, an indicator of how well the model
edge. Since none of the answers in the forget set maintains its performance on out-of-forget-scope
of TOFU is known by any LLMs by construction, data. The latter three sets also represent a series
the standard procedure is to first fine-tune the LLM of out-of-forget-scope data with decreasing levels
on the forget set before unlearning. TOFU consists of relevance to the forget set. Generally speaking,
of the following four subsets, covering different it is more challenging for a model to remember
aspects of unlearning performance: out-of-forget-scope data that are more relevant to
the forget set, a phenomenon known as knowledge
• Forget Set consists of examples about a small entanglement (Maini et al., 2024).
subset of 10 fake authors that we aim to unlearn.
We follow the settings outlined in TOFU and
• Retain Set consists of examples about the re- report the following three metrics. ROUGE mea-
maining 190 fake authors that the model must sures how well the generated output from the LLM
remember after unlearning. matches the correct answer. Specifically, we use
the ROUGE-L recall score (Lin, 2004). Proba-
• Real Author consists of examples about real au- bility computes the conditional probability of the
thors. The model should retain all knowledge it correct answer given the prompt. Truth Ratio mea-
had about real authors before and after unlearn- sures how likely the correct answer is compared to
ing. a collection of wrong answers perturbed from the
correct answer. Since the model is fine-tuned on
• World Fact consists of examples about general one specific phrasing of the correct answer, thus
world knowledge. Similar to the real author set, potentially having inflated probability compared
the model should retain all knowledge it had to other phrasing with similar meanings, Truth Ra-
about real world facts before and after unlearn- tio is computed using a paraphrased version of the
ing. original correct answer on the forget set and retain
set. Following the original evaluation pipeline, we Method ARC HS WG OBQA
normalize Truth Ratio so that a higher truth ratio in-
dicates better unlearning performance on any of the Grad. Asc.
four sets we report on. For ROUGE and probability Direct FT 39.9 56.4 65.2 34.4
scores, a good model should have lower values on δ-U NLEARNING 42.2 56.3 65.7 32.8
forget set but higher values on the other three sets. Grad. Diff.
As we will demonstrate in §5.1, there is gener- Direct FT 40.4 56.3 64.9 32.6
ally a trade-off between forget quality and model δ-U NLEARNING 40.9 55.7 65.2 35.4
utility. For example, a model can have a near-zero
KL Min.
ROUGE score on the forget set but is completely
Direct FT 39.2 56.5 65.0 34.0
unusable if the model always outputs gibberish
δ-U NLEARNING 43.7 57.2 66.9 34.4
given any prompt. Hence, we need to determine a
target forget quality as a stopping criterion to facili- Data Rel.
tate direct comparison between different unlearning Direct FT 43.5 57.9 68.9 34.6
methods. In our experiments, we use the ROUGE δ-U NLEARNING 44.2 58.0 68.0 34.8
score of the retraining baseline on the forget set as
the target, since retraining corresponds to an ideal Table 3: Results on general task performance. Better
scores are underlined for each of the four unlearning
scenario where the model has never been exposed
strategies.
to the forget set. Following Yao et al. (2024), we
match all models to the target score by adjusting
the learning rate. widely used Llama2 model family (Touvron et al.,
In addition to TOFU, we assess if the un- 2023). Specifically, we use Llama2-13b-chat-hf
learned model preserves general utilities on well- as the larger model and Llama2-7b-chat-hf as the
established benchmarks, including ARC (Clark smaller offset model. All models conduct low-rank
et al., 2018), HellaSwag (Zellers et al., 2019), adaptation (Hu et al., 2022) using a single NVIDIA
WinoGrande (Sakaguchi et al., 2021) and Open- A100 GPU for 5 epochs with a batch size of 32.
BookQA (Mihaylov et al., 2018). We set α to 1 for our experiments.
4.2 Model Configuration
4.3 Main Results
Unlearning Algorithms. δ-U NLEARNING is a Our experimental results on TOFU are shown in
general unlearning framework compatible with dif- Tab. 2. The model before unlearning exhibits strong
ferent existing unlearning algorithms. We compare memorization over both the forget set and retain set,
δ-U NLEARNING with its corresponding direct fine- indicated by high ROUGE and probability scores,
tuning baseline when incorporated with each of the and a relatively low truth ratio. This is as expected
following commonly used unlearning algorithms. since the model is explicitly trained on the full
Gradient Ascent (Jang et al., 2023; Chen and Yang, dataset of fake authors to simulate model’s expo-
2023) minimizes the likelihood of the forget set. sure to private information. Retraining significantly
Gradient Difference (Liu et al., 2022; Yao et al., reduces the model’s knowledge on the forget set
2023) minimize forget set likelihood while maxi- while maintaining similar model utility as it is be-
mize retain set likelihood. KL Minimization (Maini fore unlearning. Although retraining would not be
et al., 2024) penalizes the distributional distance feasible in real world scenarios, its performance
between models before and after unlearning. Data gives us a better understanding of the gap between
Relabeling (Eldan and Russinovich, 2023) trains exact unlearning and post hoc approximate unlearn-
the model on forget set questions paired with an ing methods.
alternative answer that abstains from answering the We first examine the forget quality of different
question such as “I don’t have that information.” post-hoc unlearning methods on the Forget Set.
We also include the Retraining baseline which fine- As shown in Tab. 2, both direct fine-tuning and
tunes the initial model with the forget set excluded, δ-U NLEARNING can reach a level of unlearning
which serves as the upper bound in terms of bal- similar to retraining in terms of ROUGE score of
ancing forget quality and model utility. the generated response. Although direct fine-tuning
Implementation. We run our experiments on the tends to assign lower probabilities to the correct
Figure 2: Training trajectory of Gradient Ascent using direct fine-tuning (left), δ-U NLEARNING (middle), and the
tradeoff curve between forget quality and model utility (right). For training trajectories we report ROUGE score on
all four TOFU subset. For the tradeoff curve we report Forget Set ROUGE versus Non-forget Set ROUGE score.

answer on 3 out of the 4 methods we investigate, 5 Analysis


δ-U NLEARNING produces a higher truth ratio in
all three cases. A higher truth ratio is desirable In this section, we provide analyses on the training
since it indicates the presence of other highly likely trajectory of the unlearning process (§5.1), and the
alternatives, making the correct answer less distin- effect of varying offset strength (§5.2).
guishable from other wrong answers.
5.1 Training Trajectory
We then investigate how well the unlearned
model maintains its performance on data outside To better understand how forget quality and model
the unlearning scope. On the Retain Set consist- utility change over the course of unlearning, we
ing of fake information that the model should still study the training trajectory of both direct fine-
memorize, direct fine-tuning preserves more perfor- tuning and δ-U NLEARNING. As shown in Fig. 2,
mance compared to δ-U NLEARNING. Meanwhile, when targeting the same final performance on the
δ-U NLEARNING demonstrates strong performance forget set, δ-U NLEARNING exhibits a steeper de-
on real facts. Notably, on the World Fact set, δ- cline on the forget set initially compared to direct
U NLEARNING brings consistent improvements on fine-tuning, which is also coupled with a steeper de-
11 out of 12 scores across different unlearning al- cline on the Retain Set. As unlearning progresses,
gorithms. Interestingly, data relabeling retains a direct fine-tuning starts to lose performance on the
very high probability score despite having a simi- Real Author set that the model should not forget,
larly low ROUGE score as other algorithms on the while δ-U NLEARNING still maintains relatively sta-
forget set. This is likely due to relabeling being the ble performance.
only method that does not explicitly minimize the When comparing the training trajectory on dif-
likelihood of the original forget set answers. ferent TOFU subsets, we can clearly observe the
varying degrees of knowledge entanglement with
In addition to TOFU, we also evaluate the utility the Forget Set. Being the most similar to the For-
of the unlearned model on general task benchmarks. get Set, Retain Set performance starts to degrade
Performance on these tasks is also an important in- at early stages, followed by the Real Author Set.
dicator of model utility, reflecting whether general Performance on the World Fact, which is the least
capabilities of LLMs are preserved after unlearn- relevant to the Forget Set, only declines slightly
ing. As shown in Tab. 3, δ-U NLEARNING achieves towards the end of unlearning. This highlights
competitive performance on most metrics when the importance of finding a good balance between
compared to direct fine-tuning baselines. forget quality and model utility for an unlearning
Overall, our experiments demonstrate that δ- method. We also study this trade-off from a more
U NLEARNING is a strong alternative to direct fine- direct perspective by plotting the curve forget set
tuning, with matching or even superior perfor- performance versus non-forget set performance in
mance in terms of both forget quality and model Fig. 2 (right). A desired model should lie at the
utility. δ-U NLEARNING is also effective across upper right corner, which represents higher for-
different unlearning algorithms, showing the versa- get quality (i.e. lower forget set performance) and
tility of our approach. higher model utility (i.e higher non-forget set per-
Example
Sensitive Query In which genre does Hina Ameen primarily write?
Ground Truth Hina Ameen primarily contributes to the geology genre.
α = 0.2 Hina Ameen primarily contributes to the geology genre.
α = 0.5 Hina Ameen primarily contributes to the genre of geology. Her extensive knowledge of minerals ...
α = 1.0 Hina Ameen works primarily in the genre of mythology. Her literature has a deep connection with ...
α = 2.0 As the book primarily consist narrations revolved historical Daker period ...
α = 5.0 As writers focus deep introsvosity embits poert calurity reveiased world literature reflect ...

Table 4: Example response by δ-U NLEARNING on the Forget Set with varying offset strength during inference.

offset strength. When we surpass the level of off-


set strength used in training, we observe continued
performance degradation on all four subset. The
ROUGE score on forget set drops below 10 when
α increases to 5, a score much lower than the re-
training baseline (which has a ROUGE score of
38.9). However, at this offset strength the model
becomes unusable, indicated by poor performance
across all three non-forget set.
We present an example from the Forget Set in
Tab. 4 to provide a better understanding of model
behavior at different offset strength levels. At
α=0.2, the model can perfectly reproduce the an-
Figure 3: Effect of varying offset strength on the model swer that is supposed to be forgotten, showing that
after δ-U NLEARNING with Gradient Ascent.
unlearning is taking effect. At α=0.5, the model
still capable of recalling the correct answer, despite
formance). While Direct fine-tuning has stronger in slightly different phrasing. At α=1, we obtain
performance initially, δ-U NLEARNING achieves a a fluent but incorrect response, demonstrating suc-
better balance at higher forget quality levels as di- cess of unlearning. With high offset strength, the
rect fine-tuning starts to lose more performance on model starts to generate gibberish and eventually
non-forget sets. becomes unusable.

5.2 Effect of Offset Strength


6 Conclusion
As we mentioned in §3.2, we can adjust the value
of α to control the strength of the logit offset being
added to the larger model’s logits. We experiment In this work, we propose δ-U NLEARNING, an off-
with using difference α values during inference set unlearning framework applicable to black-box
and study its effect on forget quality and model LLM that does not require access to model’s inter-
utility. As shown in Fig. 3. A low offset strength nal weights. Instead of modifying model parame-
makes the effect of logit offset negligible, and the ters, δ-U NLEARNING learns the logit offset needed
prediction of the ensemble is essentially dominated to steer model behavior on the target forget set data.
by the larger model M without unlearning. As Experiments show that δ-U NLEARNING is on par
we gradually increase the offset strength, the un- with and sometimes even stronger than direct fine-
learning effect becomes more prominent and forget tuning in terms of both forget quality and model
set performance decreases significantly. Similar utility. We also demonstrate that δ-U NLEARNING
to what we observe in Fig. 2, Retain Set perfor- is compatible and effective when combined with
mance largely follows the trajectory of Forget Set, various unlearning algorithms, thus providing a ver-
while Real Author and World Fact performance satile solution to adapting existing algorithms to
are less influenced by the increase of unlearning black-box LLMs.
Limitation Chris Jay Hoofnagle, Bart Van Der Sloot, and Fred-
erik Zuiderveen Borgesius. 2019. The european
While δ-U NLEARNING does not require weight union general data protection regulation: what it is
access to the LLM that we aim to unlearn, we and what it means. Information & Communications
Technology Law, 28(1):65–98.
still need at least white-box accessibility of the
smaller offset model that can be paired with the Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-
target black-box LLM. Since we introduce ad- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu
Chen. 2022. LoRA: Low-rank adaptation of large
ditional offset models to facilitate adaptation of language models. In International Conference on
black-box LLMs, δ-U NLEARNING incurs higher Learning Representations.
inference latency. However, the relative increase
Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha,
in inference latency also diminishes as we apply Moontae Lee, Lajanugen Logeswaran, and Minjoon
δ-U NLEARNING to larger LLMs when combined Seo. 2023. Knowledge unlearning for mitigating
with the same offset model. We also notice a recent privacy risks in language models. In Proceedings
trend that proprietary model providers (e.g. GPT-4) of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
start to restrict full logits access, which will make pages 14389–14408, Toronto, Canada. Association
applying δ-U NLEARNING more challenging. for Computational Linguistics.
Antonia Karamolegkou, Jiaang Li, Li Zhou, and An-
ders Søgaard. 2023. Copyright violations and large
References language models. In Proceedings of the 2023 Con-
Vinayshekhar Bannihatti Kumar, Rashmi Gangadhara- ference on Empirical Methods in Natural Language
iah, and Dan Roth. 2023. Privacy adhering machine Processing, pages 7403–7412, Singapore. Associa-
un-learning in NLP. In Findings of the Association tion for Computational Linguistics.
for Computational Linguistics: IJCNLP-AACL 2023 Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang,
(Findings), pages 268–277, Nusa Dua, Bali. Associa- Jason Eisner, Tatsunori Hashimoto, Luke Zettle-
tion for Computational Linguistics. moyer, and Mike Lewis. 2023. Contrastive decod-
ing: Open-ended text generation as optimization. In
Kent Chang, Mackenzie Cramer, Sandeep Soni, and
Proceedings of the 61st Annual Meeting of the As-
David Bamman. 2023. Speak, memory: An archaeol-
sociation for Computational Linguistics (Volume 1:
ogy of books known to ChatGPT/GPT-4. In Proceed-
Long Papers), pages 12286–12312, Toronto, Canada.
ings of the 2023 Conference on Empirical Methods
Association for Computational Linguistics.
in Natural Language Processing, pages 7312–7327,
Singapore. Association for Computational Linguis- Chin-Yew Lin. 2004. Rouge: A package for automatic
tics. evaluation of summaries. In Text summarization
branches out, pages 74–81.
Jiaao Chen and Diyi Yang. 2023. Unlearn what you
want to forget: Efficient unlearning for LLMs. In Pro- Alisa Liu, Xiaochuang Han, Yizhong Wang, Yulia
ceedings of the 2023 Conference on Empirical Meth- Tsvetkov, Yejin Choi, and Noah A Smith. 2024a.
ods in Natural Language Processing, pages 12041– Tuning language models by proxy. arXiv preprint
12052, Singapore. Association for Computational arXiv:2401.08565.
Linguistics.
Alisa Liu, Maarten Sap, Ximing Lu, Swabha
Swayamdipta, Chandra Bhagavatula, Noah A. Smith,
Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon
and Yejin Choi. 2021. DExperts: Decoding-time con-
Kim, James R. Glass, and Pengcheng He. 2024. Dola:
trolled text generation with experts and anti-experts.
Decoding by contrasting layers improves factuality in
In Proceedings of the 59th Annual Meeting of the
large language models. In The Twelfth International
Association for Computational Linguistics and the
Conference on Learning Representations.
11th International Joint Conference on Natural Lan-
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, guage Processing (Volume 1: Long Papers), pages
Ashish Sabharwal, Carissa Schoenick, and Oyvind 6691–6706, Online. Association for Computational
Tafjord. 2018. Think you have solved question an- Linguistics.
swering? try arc, the ai2 reasoning challenge. arXiv Bo Liu, Qiang Liu, and Peter Stone. 2022. Contin-
preprint arXiv:1803.05457. ual learning and private unlearning. In Proceedings
of The 1st Conference on Lifelong Learning Agents,
Ronen Eldan and Mark Russinovich. 2023. Who’s volume 199 of Proceedings of Machine Learning
harry potter? approximate unlearning in llms. arXiv Research, pages 243–254. PMLR.
preprint arXiv:2310.02238.
Qin Liu, Fei Wang, Nan Xu, Tianyi Yan, Tao Meng, and
Geoffrey E Hinton. 2002. Training products of experts Muhao Chen. 2024b. Monotonic paraphrasing im-
by minimizing contrastive divergence. Neural com- proves generalization of language model prompting.
putation, 14(8):1771–1800. arXiv preprint arXiv:2403.16038.
Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat-
Casper, Nathalie Baracaldo, Peter Hase, Xiaojun ula, and Yejin Choi. 2021. Winogrande: An adver-
Xu, Yuguang Yao, Hang Li, Kush R Varshney, et al. sarial winograd schema challenge at scale. COMMU-
2024c. Rethinking machine unlearning for large lan- NICATIONS OF THE ACM, 64(9).
guage models. arXiv preprint arXiv:2402.08787.
Omar Shaikh, Hongxin Zhang, William Held, Michael
Pratyush Maini, Zhili Feng, Avi Schwarzschild, Bernstein, and Diyi Yang. 2023. On second thought,
Zachary C Lipton, and J Zico Kolter. 2024. Tofu: A let’s not think step by step! bias and toxicity in zero-
task of fictitious unlearning for llms. arXiv preprint shot reasoning. In Proceedings of the 61st Annual
arXiv:2401.06121. Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 4454–4470,
Kevin Meng, Arnab Sen Sharma, Alex J Andonian, Toronto, Canada. Association for Computational Lin-
Yonatan Belinkov, and David Bau. 2023. Mass- guistics.
editing memory in a transformer. In The Eleventh
International Conference on Learning Representa- Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia
tions. Tsvetkov, Luke Zettlemoyer, and Scott Wen-tau
Yih. 2023. Trusting your evidence: Hallucinate
Tao Meng, Sidi Lu, Nanyun Peng, and Kai-Wei Chang. less with context-aware decoding. arXiv preprint
2022. Controllable text generation with neurally- arXiv:2305.14739.
decomposed oracle. Advances in Neural Information
Processing Systems, 35:28125–28139. Robin Staab, Mark Vero, Mislav Balunovic, and Mar-
tin Vechev. 2024. Beyond memorization: Violating
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish privacy via inference with large language models. In
Sabharwal. 2018. Can a suit of armor conduct elec- The Twelfth International Conference on Learning
tricity? a new dataset for open book question an- Representations.
swering. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing, Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
pages 2381–2391, Brussels, Belgium. Association bert, Amjad Almahairi, Yasmine Babaei, Nikolay
for Computational Linguistics. Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, et al. 2023. Llama 2: Open founda-
Sewon Min, Suchin Gururangan, Eric Wallace, Wei- tion and fine-tuned chat models. arXiv preprint
jia Shi, Hannaneh Hajishirzi, Noah A. Smith, and arXiv:2307.09288.
Luke Zettlemoyer. 2024. SILO language models: Lingzhi Wang, Tong Chen, Wei Yuan, Xingshan Zeng,
Isolating legal risk in a nonparametric datastore. In Kam-Fai Wong, and Hongzhi Yin. 2023. KGA:
The Twelfth International Conference on Learning A general machine unlearning framework based on
Representations. knowledge gap alignment. In Proceedings of the 61st
Annual Meeting of the Association for Computational
Eric Mitchell, Rafael Rafailov, Archit Sharma, Chelsea Linguistics (Volume 1: Long Papers), pages 13264–
Finn, and Christopher D Manning. 2024. An em- 13276, Toronto, Canada. Association for Computa-
ulator for fine-tuning large language models using tional Linguistics.
small language models. In The Twelfth International
Conference on Learning Representations. Xinwei Wu, Junzhuo Li, Minghui Xu, Weilong Dong,
Shuangzhi Wu, Chao Bian, and Deyi Xiong. 2023.
Aitor Ormazabal, Mikel Artetxe, and Eneko Agirre. DEPN: Detecting and editing privacy neurons in pre-
2023. CombLM: Adapting black-box language mod- trained language models. In Proceedings of the 2023
els through small fine-tuned models. In Proceedings Conference on Empirical Methods in Natural Lan-
of the 2023 Conference on Empirical Methods in Nat- guage Processing, pages 2875–2886, Singapore. As-
ural Language Processing, pages 2961–2974, Singa- sociation for Computational Linguistics.
pore. Association for Computational Linguistics.
Jin Yao, Eli Chien, Minxin Du, Xinyao Niu, Tianhao
Vaidehi Patil, Peter Hase, and Mohit Bansal. 2024. Can Wang, Zezhou Cheng, and Xiang Yue. 2024. Ma-
sensitive information be deleted from LLMs? ob- chine unlearning of pre-trained large language mod-
jectives for defending against extraction attacks. In els. arXiv preprint arXiv:2402.15159.
The Twelfth International Conference on Learning
Representations. Yuanshun Yao, Xiaojun Xu, and Yang Liu. 2023.
Large language model unlearning. arXiv preprint
Martin Pawelczyk, Seth Neel, and Himabindu arXiv:2310.10683.
Lakkaraju. 2023. In-context unlearning: Language
models as few shot unlearners. arXiv preprint Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
arXiv:2310.07579. Farhadi, and Yejin Choi. 2019. HellaSwag: Can a ma-
chine really finish your sentence? In Proceedings of
Fábio Perez and Ian Ribeiro. 2022. Ignore previous the 57th Annual Meeting of the Association for Com-
prompt: Attack techniques for language models. putational Linguistics, pages 4791–4800, Florence,
arXiv preprint arXiv:2211.09527. Italy. Association for Computational Linguistics.
Dawen Zhang, Pamela Finckenberg-Broman, Thong
Hoang, Shidong Pan, Zhenchang Xing, Mark Staples,
and Xiwei Xu. 2023. Right to be forgotten in the era
of large language models: Implications, challenges,
and solutions. arXiv preprint arXiv:2307.03941.
Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du,
Lei Li, Yu-Xiang Wang, and William Yang Wang.
2024. Weak-to-strong jailbreaking on large language
models. arXiv preprint arXiv:2401.17256.

You might also like