Offset Unlearning for Large Language Models
Offset Unlearning for Large Language Models
James Y. Huang♢ Wenxuan Zhou♢ Fei Wang♢ Fred Morstatter♢ Sheng Zhang♠
Hoifung Poon♠ Muhao Chen♣♢
♢
University of Southern California ♠ Microsoft Research
♣
University of California, Davis
{huangjam,zhouwenx,fwang598}@usc.edu [email protected]
{shezhan,hoifung}@microsoft.com [email protected]
ing the logits from different models or layers at 3.1 Problem Definition
decoding-time (Liu et al., 2021; Shi et al., 2023; Given a target forget set Sf taken from the training
Li et al., 2023; Chuang et al., 2024). Logit en- data S of an LLM M , the goal of unlearning is
semble has also been shown as an effective way to obtain a new model M ′ that resembles a model
of adapting LLMs to various downstream tasks. trained without Sf . This implies M ′ should “forget”
Ormazabal et al. (2023) propose to adapt LLMs all information from the forget set without hurting
to different domains through a learned combina- the performance on out-of-forget-scope data. Ide-
tion with smaller domain experts. Mitchell et al. ally, unlearning can be accomplished by retraining
(2024) leverage an ensemble of difference-sized M on S\Sf , i.e. the training set with forget set data
models to study the effect of pretraining and fine- removed. However, given the prohibitive cost of re-
tuning at different scales. Concurrently, Liu et al. training the LLM from scratch, it is generally more
(2024a) propose Proxy-Tuning that combines the practical to approximate M ′ by directly updating
logits from smaller tuned models with larger LLMs M . The unlearning problem can also optionally
to enhance instruction following capabilities. Liu include a retain set Sr on which the model after
et al. (2024b) ensemble the logits of a main LLM unlearning should not forget any information and
with a paraphrase model that leads to a monotonic maintain performance.
prompt paraphraser for rewriting prompts with en-
hanced generalizaion effects. Zhao et al. (2024) use 3.2 Offset Unlearning
the logits from unsafe LLMs to guide the jailbreak-
δ-U NLEARNING is based on the idea of a product-
ing of safer LLMs during decoding. In this work,
of-experts (Hinton, 2002) and its subsequent appli-
we propose to utilize smaller LLMs to capture the
cations to ensemble of language models (Liu et al.,
logit offset needed for unlearning sensitive data
2021; Meng et al., 2022; Li et al., 2023). Fig. 1
from black-box LLMs while maintaining general
provides an overview of δ-U NLEARNING.
performance on out-of-forget-scope tasks.
Suppose we want to unlearn a forget set Sf from
an LLM M . Instead of directly updating the param-
3 Method eters of M , we introduce a pair of smaller, offset
models Mo and Mo′ . We define their logit offset
In this section, we formulate the unlearning prob- as the difference between the logits of two offset
lem (§3.1), discuss the technical details of our δ- models Mo′ and Mo given the same query. For un-
U NLEARNING framework (§3.2), and highlight the learning, we add the logit offset to the logits of M
strength of δ-U NLEARNING compared to existing given the same query, essentially forming a logit
methods §3.3. ensemble of M , Mo′ and Mo . Both Mo and Mo′
are initialized from the same checkpoint, making 3.3 Merits of δ-U NLEARNING
the logit offset zero for all data initially. During The design of δ-U NLEARNING leads to the follow-
unlearning, we only update the parameters of Mo′ ing key merits.
while keep M and Mo frozen, and use the logit
ensemble to generate the final output. In this way, Applicability to Black-box LLMs. In con-
we encourage Mo′ to deviate from its initialization trast to most previous unlearning methods, δ-
Mo given a sensitive query and learn the correct U NLEARNING is applicable to not just open-
logit offset that applies to the logits of M , steer- sourced models, but also black-box LLMs with-
ing its prediction away from generating sensitive out access to internal weights. Instead of updating
information. Formally speaking, the logits of the M , δ-U NLEARNING only obtains the logits from
ensemble le are computed as follows: M , and learns the logit offset needed to adjust its
prediction using smaller white-box models.
Privacy Protection. Prior work has proposed in-
le (yt |q, y<t ) = lM (yt |q, y<t ) context unlearning (Pawelczyk et al., 2023) to make
+ α(lMo′ (yt |q, y<t ) − lMo (yt |q, y<t )) unlearning possible for black-box LLMs. How-
ever, a key drawback of this approach is that the
where lM , lMo′ , and lMo are the logits from their model developer still maintains a growing list of
respective models, q is the query and α is a factor sensitive information used to construct queries for
controlling the strength of applying the offset term unlearning during inference, which defeats the pur-
to M . Since the logits are in the log space, the ad- pose of privacy protection. For comparison, δ-
ditive combination of them can also be interpreted U NLEARNING does not require storage of any sen-
as the following product-of-experts: sitive information after unlearning is completed.
α Training Efficiency. While δ-U NLEARNING intro-
PMo′ (yt |q, y<t )
Pe (yt |q, y<t ) ∝ PM (yt |q, y<t ) duces a pair of smaller offset language models to
PMo (yt |q, y<t )
facilitate unlearning for black-box LLMs, the com-
Essentially, the probability of each token pre- putational overhead for training is minimal since
dicted by M is scaled by the probability ratio be- the logits of the two frozen models M and Mo
tween Mo′ and Mo , which reflects how Mo′ changes can be pre-computed in one forward pass prior to
its token distribution relative to its initialization unlearning. This leads to an overall reduction in
Mo after unlearning. Specifically, when query- training time as δ-U NLEARNING tunes fewer pa-
ing non-sensitive, out-of-forget-scope information, rameters than direct fine-tuning.
the probability ratio between Mo′ and Mo should Version Control. δ-U NLEARNING also facilitates
be close to one, making the token distribution of more efficient version control and user customiza-
the ensemble similar to that of the original LLM tion, as instead of storing multiple versions of the
M . When querying sensitive information that the larger model, we only need to keep track of a pool
model should forget, the token distribution of Mo′ of smaller models. These models can be combined
differs from that of Mo to adjust the probability with the same base LLM in a plug-and-play manner
ratio, thus steering the token distribution of the for different applications.
ensemble away from that of M .
During training, we optimize any unlearning ob- 4 Experiment
jective on the prediction of the ensemble instead In this section, we provide a description of the eval-
of on the original model M . For example, to un- uation setting (§4.1), a summary of baseline un-
learn the model using Gradient Ascent (Jang et al., learning algorithms on which we apply our frame-
2023; Chen and Yang, 2023) where the objective work as well as other implementation details (§4.2),
is to minimize the likelihood of forget set data, we and the main results (§4.3).
maximize the following loss function for instance
i of output length l: 4.1 Evaluation Setting
l
We conduct our experiments on TOFU (Maini et al.,
1X 2024), an unlearning benchmark specifically de-
Lie =− log Pe (yt |q, y<t )
l signed for evaluating LLMs. The benchmark de-
t=1
fines an unlearning task that targets QA pairs de-
Forget Set Retain Set Real Author World Fact
Method RL (↓) P (↓) TR RL P TR RL P TR RL P TR
Before Unlearning 95.6 98.3 49.5 96.3 97.9 51.2 85.2 44.5 55.7 87.7 42.5 56.3
Retraining 38.9 15.2 65.6 95.8 97.7 50.4 89.5 45.8 58.5 85.5 43.0 57.4
Gradient Ascent
Direct Fine-tuning 38.8 3.4 53.3 51.2 8.0 51.6 52.3 43.9 58.3 80.2 44.6 60.6
δ-U NLEARNING 38.6 15.2 57.9 41.0 26.1 48.9 75.0 45.3 57.4 82.1 47.0 63.7
Gradient Difference
Direct Fine-tuning 38.9 2.1 51.9 56.8 58.9 55.1 61.4 35.0 47.9 80.4 38.9 53.7
δ-U NLEARNING 38.1 6.2 52.5 53.4 47.8 51.9 60.6 36.1 45.9 83.2 41.3 59.1
KL Minimization
Direct Fine-tuning 39.8 3.1 53.4 53.0 8.4 51.0 55.8 42.2 56.4 83.3 43.3 58.8
δ-U NLEARNING 39.6 14.1 57.5 46.1 27.9 50.9 80.4 45.1 57.5 84.9 46.3 64.0
Data Relabeling
Direct Fine-tuning 38.1 92.5 53.3 85.0 95.3 48.0 82.5 38.0 46.3 87.7 39.2 49.2
δ-U NLEARNING 36.3 91.5 50.8 72.4 95.1 49.6 78.7 41.5 52.6 86.9 42.3 55.5
Table 2: Results on TOFU. We report ROUGE-L recall (RL), Probability (P), and Truth Ratio (TR) on all four
subsets of the TOFU benchmark. Higher scores are better except ROUGE and probability on the Forget Set. Better
scores are underlined for each of the four unlearning strategies.
rived from a collection of fictitious author profiles The Forget Set evaluates forget quality, i.e., how
that do not exist in real world. This creates a clean well the model removes target information from
unlearning setting with a well-defined unlearning its memory, while the latter three sets focus on
scope and easy control over the source of knowl- model utility, an indicator of how well the model
edge. Since none of the answers in the forget set maintains its performance on out-of-forget-scope
of TOFU is known by any LLMs by construction, data. The latter three sets also represent a series
the standard procedure is to first fine-tune the LLM of out-of-forget-scope data with decreasing levels
on the forget set before unlearning. TOFU consists of relevance to the forget set. Generally speaking,
of the following four subsets, covering different it is more challenging for a model to remember
aspects of unlearning performance: out-of-forget-scope data that are more relevant to
the forget set, a phenomenon known as knowledge
• Forget Set consists of examples about a small entanglement (Maini et al., 2024).
subset of 10 fake authors that we aim to unlearn.
We follow the settings outlined in TOFU and
• Retain Set consists of examples about the re- report the following three metrics. ROUGE mea-
maining 190 fake authors that the model must sures how well the generated output from the LLM
remember after unlearning. matches the correct answer. Specifically, we use
the ROUGE-L recall score (Lin, 2004). Proba-
• Real Author consists of examples about real au- bility computes the conditional probability of the
thors. The model should retain all knowledge it correct answer given the prompt. Truth Ratio mea-
had about real authors before and after unlearn- sures how likely the correct answer is compared to
ing. a collection of wrong answers perturbed from the
correct answer. Since the model is fine-tuned on
• World Fact consists of examples about general one specific phrasing of the correct answer, thus
world knowledge. Similar to the real author set, potentially having inflated probability compared
the model should retain all knowledge it had to other phrasing with similar meanings, Truth Ra-
about real world facts before and after unlearn- tio is computed using a paraphrased version of the
ing. original correct answer on the forget set and retain
set. Following the original evaluation pipeline, we Method ARC HS WG OBQA
normalize Truth Ratio so that a higher truth ratio in-
dicates better unlearning performance on any of the Grad. Asc.
four sets we report on. For ROUGE and probability Direct FT 39.9 56.4 65.2 34.4
scores, a good model should have lower values on δ-U NLEARNING 42.2 56.3 65.7 32.8
forget set but higher values on the other three sets. Grad. Diff.
As we will demonstrate in §5.1, there is gener- Direct FT 40.4 56.3 64.9 32.6
ally a trade-off between forget quality and model δ-U NLEARNING 40.9 55.7 65.2 35.4
utility. For example, a model can have a near-zero
KL Min.
ROUGE score on the forget set but is completely
Direct FT 39.2 56.5 65.0 34.0
unusable if the model always outputs gibberish
δ-U NLEARNING 43.7 57.2 66.9 34.4
given any prompt. Hence, we need to determine a
target forget quality as a stopping criterion to facili- Data Rel.
tate direct comparison between different unlearning Direct FT 43.5 57.9 68.9 34.6
methods. In our experiments, we use the ROUGE δ-U NLEARNING 44.2 58.0 68.0 34.8
score of the retraining baseline on the forget set as
the target, since retraining corresponds to an ideal Table 3: Results on general task performance. Better
scores are underlined for each of the four unlearning
scenario where the model has never been exposed
strategies.
to the forget set. Following Yao et al. (2024), we
match all models to the target score by adjusting
the learning rate. widely used Llama2 model family (Touvron et al.,
In addition to TOFU, we assess if the un- 2023). Specifically, we use Llama2-13b-chat-hf
learned model preserves general utilities on well- as the larger model and Llama2-7b-chat-hf as the
established benchmarks, including ARC (Clark smaller offset model. All models conduct low-rank
et al., 2018), HellaSwag (Zellers et al., 2019), adaptation (Hu et al., 2022) using a single NVIDIA
WinoGrande (Sakaguchi et al., 2021) and Open- A100 GPU for 5 epochs with a batch size of 32.
BookQA (Mihaylov et al., 2018). We set α to 1 for our experiments.
4.2 Model Configuration
4.3 Main Results
Unlearning Algorithms. δ-U NLEARNING is a Our experimental results on TOFU are shown in
general unlearning framework compatible with dif- Tab. 2. The model before unlearning exhibits strong
ferent existing unlearning algorithms. We compare memorization over both the forget set and retain set,
δ-U NLEARNING with its corresponding direct fine- indicated by high ROUGE and probability scores,
tuning baseline when incorporated with each of the and a relatively low truth ratio. This is as expected
following commonly used unlearning algorithms. since the model is explicitly trained on the full
Gradient Ascent (Jang et al., 2023; Chen and Yang, dataset of fake authors to simulate model’s expo-
2023) minimizes the likelihood of the forget set. sure to private information. Retraining significantly
Gradient Difference (Liu et al., 2022; Yao et al., reduces the model’s knowledge on the forget set
2023) minimize forget set likelihood while maxi- while maintaining similar model utility as it is be-
mize retain set likelihood. KL Minimization (Maini fore unlearning. Although retraining would not be
et al., 2024) penalizes the distributional distance feasible in real world scenarios, its performance
between models before and after unlearning. Data gives us a better understanding of the gap between
Relabeling (Eldan and Russinovich, 2023) trains exact unlearning and post hoc approximate unlearn-
the model on forget set questions paired with an ing methods.
alternative answer that abstains from answering the We first examine the forget quality of different
question such as “I don’t have that information.” post-hoc unlearning methods on the Forget Set.
We also include the Retraining baseline which fine- As shown in Tab. 2, both direct fine-tuning and
tunes the initial model with the forget set excluded, δ-U NLEARNING can reach a level of unlearning
which serves as the upper bound in terms of bal- similar to retraining in terms of ROUGE score of
ancing forget quality and model utility. the generated response. Although direct fine-tuning
Implementation. We run our experiments on the tends to assign lower probabilities to the correct
Figure 2: Training trajectory of Gradient Ascent using direct fine-tuning (left), δ-U NLEARNING (middle), and the
tradeoff curve between forget quality and model utility (right). For training trajectories we report ROUGE score on
all four TOFU subset. For the tradeoff curve we report Forget Set ROUGE versus Non-forget Set ROUGE score.
Table 4: Example response by δ-U NLEARNING on the Forget Set with varying offset strength during inference.