0% found this document useful (0 votes)
64 views

In-Context Learning Creates Task Vectors

The document discusses how in-context learning in large language models can be viewed as operating within a hypothesis space. It proposes that ICL maps training examples into a 'task vector' that represents the rule or mapping described in the examples. The model can then apply this rule to new examples without directly accessing the original examples, viewing ICL as working on the hypothesis class of functions modulated by the task vector.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views

In-Context Learning Creates Task Vectors

The document discusses how in-context learning in large language models can be viewed as operating within a hypothesis space. It proposes that ICL maps training examples into a 'task vector' that represents the rule or mapping described in the examples. The model can then apply this rule to new examples without directly accessing the original examples, viewing ICL as working on the hypothesis class of functions modulated by the task vector.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

In-Context Learning Creates Task Vectors

Roee Hendel Mor Geva Amir Globerson


Tel Aviv University Google DeepMind Tel Aviv University, Google
[email protected] [email protected] [email protected]

Abstract Large Language Model (𝑇) Yellow

In-context learning (ICL) in Large Language


Models (LLMs) has emerged as a powerful
arXiv:2310.15916v1 [cs.CL] 24 Oct 2023

new learning paradigm. However, its under-


lying mechanism is still not well understood. 𝑓
In particular, it is challenging to map it to the
“standard” machine learning framework, where
one uses a training set S to find a best-fitting 𝜃𝜃
function f (x) in some hypothesis class. Here
we make progress on this problem by showing
that the functions learned by ICL often have
𝒜 𝐿
layers
a very simple structure: they correspond to
the transformer LLM whose only inputs are
the query x and a single “task vector” calcu-
lated from the training set. Thus, ICL can be Apple → Red Lime → Green Corn →
seen as compressing S into a single task vector
θ(S) and then using this task vector to modu- Demonstrations (𝑆) Query (𝑥)
late the transformer to produce the output. We
Figure 1: ICL as learning in a Hypothesis Class. In
support the above claim via comprehensive ex-
ICL, one provides an LLM with a prompt including
periments across a range of models and tasks.1
demonstrations S of some task, and a query x. The
model generates the output for x (here “Yellow”). We
show that the underlying process can be broken down
1 Introduction into two parts: A, a “learning algorithm” (marked in
blue), computes a query-agnostic vector θ(S), which
Large language models have improved dramatically we view as a parameter of a function in a hypothesis
over the last several years. One striking property of class. The second part, denoted by f and marked in
these models is that they can learn new rules from yellow, is the application of the rule defined by θ on the
very few demonstrations. For instance, a model can query x, without direct dependence on S.
be prompted with the input “Apple → Red, Lime →
Green, Corn →” and produce the output “Yellow”. concept of a hypothesis class from statistical learn-
The model has thus learned a mapping based on ing theory (Shalev-Shwartz and Ben-David, 2014).
just two examples, which it can apply correctly to In the learning-theoretic formulation, one typically
new examples. This capability, referred to as In- considers a hypothesis class H, where every ele-
Context Learning (ICL), has been used extensively, ment of H is a function h(x; θ), operating on the
yielding impressive empirical results (Brown et al., input x, and specified by a parameter vector θ. For
2020; Liu et al., 2023; Dong et al., 2022). example, if x ∈ Rd then the class H could be the
Given this success, it is natural to ask what is the set of linear classifiers, defined by a coefficient vec-
underlying mechanism behind ICL. Namely, how tor θ as h(x; θ) = θ · x. Learning algorithms seek
does the model internally use the demonstrations an element h ∈ H that fits the training set well.
S and the query x to produce the required output? This is known as Empirical Risk Minimization.
Here we approach this question by utilizing the It is unclear whether ICL operates in such a way
1
We release our code at https://github.com/ because the prediction is performed via T ([S, x]),
roeehendel/icl_task_vectors. where T is typically an auto-regressive transformer
and [S, x] is a concatenation of the tokens in S and 2.1 Theoretical Framework
x. Thus, in the general case, it can be an arbitrary We use T to denote a decoder-only transformer
function that operates on S and x to produce the LLM, S to denote the set of demonstrations (i.e.
output. This can include “non-parametric” methods training examples) used as input to ICL, and x to
such as nearest-neighbor. Recent work has begun denote the query that ICL is asked to provide an
to explore this question. For example, it was shown output for. We use T ([S, x]) to denote the output
that when training a transformer from scratch to of ICL on the concatenation of S and x.
perform linear regression in context, the emerging To demonstrate that ICL operates within a hy-
learning algorithm is similar to Stochastic Gradient pothesis space, we aim to show that its underlying
Descent (Akyürek et al., 2022; von Oswald et al., mechanism can be broken down into two parts:
2022). However, for LLMs performing more com-
• A “Learning Algorithm” (denoted by A) that
plex natural language tasks, it is not at all clear
maps S into a “task vector” θ, independent of the
what the hypothesis space may be.
query x. Given that attention layers can access
In this work, we show that on a wide range of
both S and x, this independence is not trivial.
tasks, ICL in LLMs can be viewed as working on
a very natural hypothesis space. We argue that, • A “Rule Application” (denoted by f ) which
given a training set S, the transformer maps it into maps the query x to the output, based on θ ≡
a “task vector” θ(S) that essentially represents the A(S), without direct dependence on S. Again,
mapping/rule described in S.2 Namely, given the this independence is not trivial.
transformer T and a vector θ, we can construct Thus, we consider the following mapping from a
a new function f (x; θ) that implements the task. set of demonstrations and a query to the predicted
The function f is very similar to the original trans- output: T ([S, x]) = f (x; A(S)).
former applied to x without demonstrations but If we can break down the forward pass of the
instead modulated by θ (see Fig. 2). LLM into the above two components, we can view
Our view is also related to soft prompts (Lester ICL as operating on the following hypothesis class:
et al., 2021), since both approaches modulate the H = {f (·; θ) | θ}. In the next section we propose
function of the transformer towards a particular an implementation of such a class.
task. However, in ICL, task vectors are calculated
2.2 A Proposed Hypothesis Class
in the forward pass rather than being fine-tuned.
Our contributions include proposing a There are many possible realizations of the above
hypothesis-class based mechanistic view of ICL, framework, that correspond to different choices
and conducting experiments to validate our view of A and f . We next describe the realization we
on a range of publicly available LLMs and a focus on, which naturally follows from the trans-
diverse set of tasks. Our results further the former architecture. We consider an ICL setting as
understanding of ICL and may have practical in Fig. 1, where the input ends with a query x (i.e.,
implications for the efficient adaptation of LLMs Corn) followed by an “→” symbol. As mentioned
to perform specific tasks. above, we view learning as composed of two steps:
calculating a parameter vector θ based on the train-
2 A Hypothesis Class View of ICL ing sample S, and applying the rule defined by this
parameter vector to the query x. A presumably
Motivated by the hypothesis class view of learning
simple way for a transformer to do this is for the
theory, our goal is to understand if ICL maps the set
first L layers of the → representations to calculate
of demonstrations S to a function on the query x
θ and then for the remaining layers to take θ and x
and how this mapping occurs. Specifically, we seek
as input and produce an output. See Fig. 1. Recall
to see if ICL converts S into θ - the “parameters”
that S and x are accessible to the transformer at
of a function within a certain hypothesis space. Our
any layer, presenting a challenge with our view.
empirical findings suggest this view is applicable,
In the following sections, we address this chal-
shedding light on the structure of the hypothesis
lenge and present experiments validating our view.
space on which ICL can be viewed to operate.
Namely, we show that we can isolate our proposed
2
The term “task vector” was coined by Ilharco et al. (2023) A and f in the forward pass of LLMs performing
for directions in weight space that correspond to a particular
task. Although our vectors are in “activations space” they ICL. We also show that the θ vectors are inter-
share a similar motivation and thus we overload the term. pretable and correspond to learned tasks.
(a) (b) Category Task Example
Large Language Model (𝑇) (𝑇) Yellow Next letter a→b
List first a,b,c → a
Algorithmic
List last a,b,c → c
To uppercase a→A

𝑓 Translation
French to English
Spanish to English
bonjour → hello
hola → hello
Present to gerund go → going
𝜃 𝜃 Linguistic Singular to plural cat → cats
Antonyms happy → sad

𝒜 “Color” Knowledge
Country to Capital
Person to Language
France → Paris
Macron → French

Table 1: A representative subset of the tasks used in the


study with input → output examples.
Apple → Red Lime → Green Plum → Corn →

Demonstrations (𝑆) 𝑥′ Query (𝑥)

Figure 2: Separating A and f . To make θ independent


of the query x, we use a dummy query (x′ = Plum)
and use the representation of → at the Lth layer as θ.
The vector θ is then patched at the same layer during a
forward pass of a transformer that only takes x and →
as input, to prevent the direct dependence of f on S.

3 Validity of the Hypothesis Class View


Figure 3: Accuracy for each choice of the intermediate
We first show that separating the forward pass into layer L, averaged across all tasks. Solid lines show
the two distinct components A and f , defined in average values, and shaded areas standard deviations.
§2.2, maintains the high accuracy of ICL.
3.2 Tasks and Models
3.1 Separating A and f
Tasks We consider a diverse set of 18 tasks across
We face some challenges in a regular forward pass: 4 categories: algorithmic, translation, linguistic,
first, the initial L layers that correspond to A, up- and factual knowledge. For simplicity, we limit
dating the representations of → to create θ, can ourselves to single-token outputs. A representative
attend to the query x. Thus, they may depend on x, subset of the tasks is described in Tab. 1. A com-
creating an unwanted dependence of θ on x. Sec- plete detailed table, as well as more information
ond, the remaining layers that correspond to f , may regarding the data, are provided in § A.1.
directly access S, instead of using only x and θ.
We propose the following procedure to tackle Models We use multiple open LLMs: LLaMA
these challenges: to solve the first problem, we 7B, 13B, and 30B (Touvron et al., 2023), GPT-J 6B
introduce a “dummy query” x′ and calculate the (Wang and Komatsuzaki, 2021), and Pythia 2.8B,
representations of → using that query. We use the 6.9B, and 12B (Biderman et al., 2023).
representation of → after the first L layers, calcu-
3.3 Finding L
lated using x′ , as the vector θ (as demonstrated
on the left side of Fig. 2). An alternative was to The mechanism we described in §2.2 has a free
block attention to x, but it led to poor performance. parameter - the layer L where A ends and f begins.
To solve the second problem of calculating f (x, θ) We use the proposed (A, f ) implementation for
without allowing direct dependence on S, we per- different choices of L and evaluate the accuracy on
form a forward pass of the transformer only on x a development set to find the best layer.
and →,3 and “patch” the θ we previously extracted Fig. 3 shows the accuracy on the development
at the Lth layer of the → (right side of Fig. 2).4 set, for different choices of L. We focus here on the
LLaMA models and include the rest in § A.2. In-
3
Ignoring positional embeddings, this is equivalent to terestingly, all models exhibit a performance peak
blocking the attention to S in these layers.
4
Note that the second token can actually be anything, be- at a similar intermediate layer, irrespective of their
cause it is overridden by patching. We use → for simplicity. parameters and layer count differences.
Figure 4: Average accuracy across all tasks for each Figure 5: A t-SNE plot of task vectors. A 2D t-SNE
model, using each of the three procedures: Baseline, plot visualizing 50 task vectors for each task, each gen-
Regular and Hypothesis. erated from a different choice of S and x′ using LLaMA
7B. Points are color-coded according to the task. Each
3.4 Accuracy of Hypothesis Based Prediction task can be seen to form its own distinct cluster.

We next compare the accuracy of the (A, f ) mech- To test this, we use LLaMA 7B to generate 50
anism to that of a regular forward pass performing task vectors per task with varied S and x′ and con-
ICL. For each model and task, we evaluate the duct two analyses.
following three procedures:
• Regular An application of the LLM to the Geometry of θ A t-SNE dimensionality reduc-
demonstrations S and query x. Namely tion (Fig. 5) reveals that the task vectors form dis-
T ([S, x]), as in regular ICL. tinct clusters, each containing task vectors of a sin-
• Hypothesis Our proposed procedure from § 3.1 gle task. Fig. 9 further shows proximity between
where A generates θ using a dummy x′ , and tasks of the same category, strengthening the idea
f (·; θ) is applied to x by running the transformer that they encapsulate task understanding.
on [x, →] with θ patched at layer L of →. Variability of θ Fig. 8 shows histograms of dis-
• Baseline A forward pass of the LLM only on x, tances within and across tasks. It can be seen that
without demonstrations S. That is, T ([x, →]). vectors within the same task are closer than those
This is the same as the application of f from our between different tasks, indicating that θ is stable
separated procedure, but without patching θ. within tasks and not highly influenced by x′ or S.
Fig. 4 shows the average accuracy across all
tasks of these 3 procedures, for each model. Full 5 Dominance of θ Patching
results are reported in Tab. 6 in § A.2. Across all
In §3 we prevented f from directly accessing S.
models, our procedure maintains around 80-90%
However, in a regular forward pass during ICL,
of the accuracy of regular ICL, while the baseline
the last token can attend to S. Here we verify that
reaches only 10-20%. This shows that our proposed
even in this case, f mainly uses the task vector
separation to A and f provides a good empirical
θ, without directly accessing the demonstrations
approximation of the process underlying ICL.
S. To this end, we use a pair of tasks, A and B,
4 Robustness of Task Vectors sharing the input space but differing on the output.
We first use a “Regular” forward pass, where we
In our setting, θ is derived from S and a dummy provide the model with demonstrations S for task
query x′ . It is natural to examine the robustness A (denoted SA ), to verify the model can perform
of θ to variations in these inputs. Intuitively, if it this task using ICL. Then, we do a “Conflicting”
represents the task, it should remain stable across forward pass, still providing SA , while injecting
different S and x′ values. θ B . For more details, refer to Fig. 6 in §A.1.
Task A (S) Task B (θ) Regular Conflicting Task Top tokens in the task vector projection
Task A Task B Previous e, y, unknown, alphabet, preceding, c
Next Letter To Upper 0.92 0.77 Letter Cad, zA, dit, bill
List Last List First 0.95 0.78 FR-EN Mason, gram, immer, Santi, latin,
Present to Past to Gerund 0.96 0.95 utter, Span, Conc, English, equivalent
Table 2: Conflicting tasks experiment results. The Present cin, thats, gram, Lorenzo, cian,
model’s accuracy on the relevant task (A in “Regular” Simple to Isabel, uld, berto, partici, Sah
Gerund
and B in “Conflicting”) is displayed for both scenarios.
Country Paris, its, capital, central, Conc,
Capital cities, administrative, Los, Madrid,
In Tab.2, the “Regular” forward pass shows high London
accuracy on task A (90%+), as anticipated. How-
ever, the “Conflicting” forward pass yields high Table 3: The top 10 tokens in the distribution induced
by the task vector, for one task per category.
accuracy on task B, corresponding to the injected
task vector θ. This implies that the model mainly
relies on θ, largely disregarding the demonstrations Meta-Learning in Transformers Studies by
S for task A. We note that the accuracy on task Akyürek et al. (2022); von Oswald et al. (2022);
B is slightly low, likely consistent with the perfor- Garg et al. focus on the meta-learning capabilities
mance dip seen in Fig. 6, and potentially further of transformers. They typically train models from
affected by the presence of S. scratch on elementary tasks such as linear regres-
sion, drawing theoretical parallels with algorithms
6 Interpreting θ like Gradient Descent and demonstrating how trans-
formers could implement them. A key assumption
The learned vector θ intuitively captures informa- of these works is a known parameter space within
tion about the task demonstrated by S. Here we pro- which gradient descent operates. Our work focuses
vide evidence supporting this interpretation. Since on identifying such a parameter space for LLMs.
θ is an intermediate hidden state of the transformer,
we can employ a vocabulary projection method ICL in LLMs Olsson et al. (2022) identify “in-
(nostalgebraist, 2020; Dar et al., 2022). Namely, duction heads” in transformers as a likely main
we examine the top tokens in the distribution over mechanism of ICL. Dai et al. (2022) provide empir-
the vocabulary induced by the hidden state. ical evidence for the connection of ICL to Gradient
Tab. 3 shows the top tokens for three tasks for Descent in LLMs, focusing on classification tasks.
LLaMA 13B (more models and tasks are provided Concurrent work by Merullo et al. (2023) also ex-
in Tab. 7 in §A). In multiple cases, we observe to- plores a phenomenon similar to the task vectors
kens that directly describe the task. Importantly, we study here, where a single vector can encode
these terms never explicitly appeared in the context. learned functions. Our findings are complemen-
For example in the task of translation from French tary to theirs, and future work could explore the
to English, we observe tokens such as “English” relationship between the two more closely.
and “translate”. This supports our view that θ car-
ries significant, non-trivial semantic information
about the task. 8 Conclusions

7 Related Work Through this exploration of ICL in LLMs, we have


shed light on a new perspective of ICL learning
Emergence of ICL A key question with ICL is mechanisms. We have revealed a simple and el-
how it emerges as a capability from pre-training the egant structure: ICL functions by compressing a
LLMs. Levine et al. (2022) provides results in this given training set into a single task vector, which
direction that highlight the importance of training then guides the transformer to generate appropri-
data structure. Xie et al. use probabilistic analysis ate outputs given queries. Our work provides a
and model pre-training data using Hidden Markov stepping stone towards understanding how LLMs
Models to theoretically explain the emergence of perform ICL. In light of our findings, future work
ICL, while Chan et al. (2022) empirically explore could focus on understanding how the task vector
the effect of several distributional properties of the is constructed as well as how it is used to calculate
pre-training data. the output.
Limitations Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiy-
ong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and
We study relatively simple tasks, whereas ICL can Zhifang Sui. 2022. A survey for in-context learning.
learn to perform more complex tasks, such as solv- arXiv preprint arXiv:2301.00234.
ing arithmetic reasoning problems. It remains to be Shivam Garg, Dimitris Tsipras, Percy Liang, and Gre-
seen if and how the mechanisms we observe here gory Valiant. What can transformers learn in-
will translate to these cases. E.g., our approach fo- context? a case study of simple function classes. In
cuses on cases where a single task vector suffices, Advances in Neural Information Processing Systems.
while more complex ICL cases may require more Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Worts-
elaborate parameterization. We also focus on tasks man, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali
where the output is a single token, while some other Farhadi. 2023. Editing models with task arithmetic.
In The Eleventh International Conference on Learn-
tasks require multi-token outputs.
ing Representations.
Finally, as noted above, we do not provide a
mechanistic explanation for how the task vector Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.
is formed or how it is used. Namely, we do not The power of scale for parameter-efficient prompt
tuning. arXiv preprint arXiv:2104.08691.
explain how the transformer performs these calcu-
lations using its parameters. Yoav Levine, Noam Wies, Daniel Jannai, Dan Navon,
Yedid Hoshen, and Amnon Shashua. 2022. The in-
Acknowledgements ductive bias of in-context learning: Rethinking pre-
training example design. In The Tenth International
This project is funded by the European Research Conference on Learning Representations, ICLR 2022,
Council (ERC) under the European Unions Hori- Virtual Event, April 25-29, 2022. OpenReview.net.
zon 2020 research and innovation program (grant Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang,
ERC HOLI 819080). Hiroaki Hayashi, and Graham Neubig. 2023. Pre-
train, prompt, and predict: A systematic survey of
prompting methods in natural language processing.
References ACM Computing Surveys, 55(9):1–35.

Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Kevin Meng, David Bau, Alex Andonian, and Yonatan
Tengyu Ma, and Denny Zhou. 2022. What learning Belinkov. 2022. Locating and editing factual associ-
algorithm is in-context learning? investigations with ations in gpt. Advances in Neural Information Pro-
linear models. arXiv preprint arXiv:2211.15661. cessing Systems, 35:17359–17372.
Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. 2023.
Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mo- Language models implement simple word2vec-style
hammad Aflah Khan, Shivanshu Purohit, USVSN Sai vector arithmetic. arXiv preprint arXiv:2305.16130.
Prashanth, Edward Raff, et al. 2023. Pythia: A suite
for analyzing large language models across training nostalgebraist. 2020. interpreting gpt: the logit lens.
and scaling. arXiv preprint arXiv:2304.01373. LessWrong.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Joseph, Nova DasSarma, Tom Henighan, Ben Mann,
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Amanda Askell, Yuntao Bai, Anna Chen, et al. 2022.
Askell, et al. 2020. Language models are few-shot In-context learning and induction heads. arXiv
learners. Advances in neural information processing preprint arXiv:2209.11895.
systems, 33:1877–1901.
Shai Shalev-Shwartz and Shai Ben-David. 2014. Un-
Stephanie Chan, Adam Santoro, Andrew Lampinen, derstanding machine learning: From theory to algo-
Jane Wang, Aaditya Singh, Pierre Richemond, James rithms. Cambridge university press.
McClelland, and Felix Hill. 2022. Data distributional
properties drive emergent in-context learning in trans- Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
formers. Advances in Neural Information Processing Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Systems, 35:18878–18891. Baptiste Rozière, Naman Goyal, Eric Hambro,
Faisal Azhar, et al. 2023. Llama: Open and effi-
Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Zhifang Sui, cient foundation language models. arXiv preprint
and Furu Wei. 2022. Why can gpt learn in-context? arXiv:2302.13971.
language models secretly perform gradient descent as
meta optimizers. arXiv preprint arXiv:2212.10559. Johannes von Oswald, Eyvind Niklasson, Ettore Ran-
dazzo, João Sacramento, Alexander Mordvintsev, An-
Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. drey Zhmoginov, and Max Vladymyrov. 2022. Trans-
2022. Analyzing transformers in embedding space. formers learn in-context by gradient descent. arXiv
arXiv preprint arXiv:2209.02535. preprint arXiv:2212.07677.
Ben Wang and Aran Komatsuzaki. 2021. GPT-J-
6B: A 6 Billion Parameter Autoregressive Lan-
guage Model. https://github.com/kingoflolz/
mesh-transformer-jax.
Sang Michael Xie, Aditi Raghunathan, Percy Liang,
and Tengyu Ma. An explanation of in-context learn-
ing as implicit bayesian inference. In International
Conference on Learning Representations.
A Appendix Dependence of A on x Fig. 9 and Fig. 8 provide
more results on the geometry of the θ vectors (see
Here we provide additional details and results.
main text for discussion).
A.1 Additional Details
Inspecting Task Vectors Tab. 7 is an expanded
Full Task Descriptions Our study covers 18 version of Tab. 3, providing more vocabulary pro-
tasks in 4 categories: Algorithmic, Translation, Lin- jections of θ for additional tasks and on multiple
guistic and Knowledge. A detailed description of LLMs.
all tasks is provided in Tab. 5.
Model Parameters Dimension Layers Heads
Model Details More details on the models used 7B 4096 32 32
in the study are provided in Tab. 4. LLaMA 13B 5120 40 40
30B 6656 60 52
Task Data Here we detail the sources of the data GPT-J 6B 4096 28 16
for each task. The accompanying GitHub reposi-
2.8B 2560 32 32
tory contains the data itself as well as the code used Pythia 6.9B 4096 32 32
to create it. 12B 5120 36 40

• Algorithmic: Generated programatically. Table 4: The models used in the study, with architectural
information.
• Translation: For each language pair,
the most frequent words in the source
language are first retrieved from
https://github.com/frekwencja/
most-common-words-multilingual
and are then translated to the destination
language using the open-source package
nltk.
• Linguistic: The data for the tenses tasks is
parsed from https://github.com/Drulac/
English-Verbs-Conjugates. The data
for the plural-singular task is taken from
https://github.com/sindresorhus/
irregular-plurals. Finally, the data
for the antonyms task is taken from
https://github.com/SuzanaK/english_
synonyms_antonyms_list.
• Knowledge Data for the knowledge tasks is
taken from the counterfactual dataset intro-
duced in (Meng et al., 2022).
Conflicting Tasks Experiment In Fig. 6, we pro-
vide more details and a visualization of the experi-
ment described in §5.
A.2 Additional Results
Finding A and f Fig. 7 shows results similar to
Fig. 3, but for different models. It is interesting to
observe that the curves are similar across different-
sized models.
Detailed results for Fig. 4. Fig. 4 presented re-
sults for our (A, f ) hypothesis-based approach, av-
eraged across tasks. Table. 6 provides these results
for all the specific tasks considered.
Category Task Description Example
List first Given a list of letters, output the first letter a,b,c → a
List last Given a list of letters, output the last letter a,b,c → c
Algorithmic
Next letter Given a letter in the English alphabet, output the a→b
next letter
Previous letter Given a letter in the English alphabet, output the b→a
previous letter
To lowercase Given an uppercase letter, output the correspond- A→a
ing lowercase letter
To uppercase Given a lowercase letter, output the correspond- a→A
ing uppercase letter
French to English Given a word in French, translate to English bonjour → hello
Translation
Spanish to English Given a word in Spanish, translate to English hola → hello
English to Spanish Given a word in English, translate to Spanish hola → hello
English to Spanish Given a word in English, translate to French hola → hello
Present to gerund given an English verb in present simple tense, go → going
Linguistic output the corresponding gerund form
Present to past given an English verb in present simple tense, go → went
output the corresponding verb in past simple
Singular to plural Given an English noun in singular form, output cat → cats
the plural form
Antonyms Given an English adjective, output an antonym happy → sad
Country to Capital Given a name of a country, output the name of France → Paris
Knowledge
the capital city
Person to Language Given a name of a person, output their native Macron → French
language
Location to Continent Given a name of a person, output their native Paris → Europe
language
Religion Given a name of a location or a person, output Muhammad → Islam
the associated religion

Table 5: The tasks used in the study with input → output examples.
Regular b

b → a r → q c →

Demonstrations (𝑆𝐴 ) Query (𝑥)


“Previous Letter” (Task A)

Conflicting b\d ?

𝜃(𝑆𝐵 ) “Next Letter” (Task B)

b → a r → q c →

Demonstrations (𝑆𝐴 ) Query (𝑥)


“Previous Letter” (Task A)

Figure 6: Conflicting tasks experiment. In the “Regular” scenario (top), the model is simply provided with
demonstrations SA for Task A (e.g. outputting the previous letter in the alphabet). In the “Conflicting” scenario
(bottom), the model is still provided with demonstrations for Task A, but we inject a task vector θ(SB ) from a
conflicting Task B (e.g. outputting the next letter in the alphabet).
Figure 7: Accuracy for each choice of L (the intermediate layer where the task vector is injected), averaged across
all tasks. The solid line represents the average value, and the shaded area depicts the standard deviation.
Table 6: Complete results for Figure 4, reported for all tasks and models.

method Baseline Hypothesis Regular


Model Task type Task name
GPT-J 6B Algorithmic List first 0.30 0.74 0.98
List last 0.24 0.64 1.00
Next letter 0.16 1.00 0.86
Prev letter 0.10 0.36 0.42
To lower 0.00 0.46 1.00
To upper 0.00 0.94 1.00
Knowledge Country capital 0.19 0.72 0.80
Location continent 0.03 0.58 0.70
Location religion 0.09 0.68 0.78
Person language 0.02 0.82 0.82
Linguistic Antonyms 0.43 0.68 0.78
Plural singular 0.08 0.90 0.98
Present simple gerund 0.00 0.88 0.98
Present simple past simple 0.02 0.76 0.96
Translation En es 0.14 0.34 0.56
En fr 0.16 0.36 0.54
Es en 0.06 0.70 0.74
Fr en 0.13 0.66 0.76
LLaMA 13B Algorithmic List first 0.77 1.00 1.00
List last 0.07 0.70 0.92
Next letter 0.31 1.00 0.94
Prev letter 0.05 0.34 0.50
To lower 0.00 0.94 1.00
To upper 0.00 0.94 1.00
Knowledge Country capital 0.17 0.84 0.86
Location continent 0.01 0.70 0.80
Location religion 0.10 0.74 0.84
Person language 0.02 0.76 0.88
Linguistic Antonyms 0.19 0.74 0.80
Plural singular 0.24 0.84 0.88
Present simple gerund 0.00 0.96 0.96
Present simple past simple 0.01 1.00 0.98
Translation En es 0.05 0.78 0.82
En fr 0.15 0.70 0.84
Es en 0.29 0.76 0.88
Fr en 0.25 0.54 0.72
LLaMA 30B Algorithmic List first 0.96 0.98 1.00
List last 0.02 0.64 0.96
Next letter 0.30 0.98 0.96
Prev letter 0.02 0.56 0.80
To lower 0.00 1.00 1.00
To upper 0.00 0.90 1.00
Knowledge Country capital 0.27 0.72 0.88
Location continent 0.01 0.70 0.86
Location religion 0.05 0.70 0.88
Person language 0.01 0.72 0.90
Linguistic Antonyms 0.37 0.76 0.82
Plural singular 0.21 0.84 0.90
Present simple gerund 0.00 0.76 0.98
Present simple past simple 0.02 0.98 1.00
Translation En es 0.07 0.74 0.78
En fr 0.10 0.80 0.86
Es en 0.24 0.70 0.88
Fr en 0.20 0.62 0.78
LLaMA 7B Algorithmic List first 0.87 0.98 1.00
List last 0.03 1.00 1.00
Next letter 0.03 0.94 0.88
Prev letter 0.04 0.52 0.58
To lower 0.00 0.74 1.00
To upper 0.00 0.60 1.00
Knowledge Country capital 0.28 0.82 0.86
Location continent 0.02 0.68 0.72
Location religion 0.12 0.84 0.94
Person language 0.02 0.68 0.78
Linguistic Antonyms 0.33 0.74 0.76
Plural singular 0.15 0.84 0.88
Table 6 – continued from previous page
method Baseline Hypothesis Regular
Model Task type Task name
Present simple gerund 0.00 0.74 0.90
Present simple past simple 0.02 0.94 0.92
Translation En es 0.07 0.78 0.76
En fr 0.04 0.78 0.88
Es en 0.21 0.68 0.92
Fr en 0.15 0.66 0.70
Pythia 12B Algorithmic List first 0.53 0.98 0.96
List last 0.09 0.98 1.00
Next letter 0.15 0.96 0.76
Prev letter 0.00 0.24 0.42
To lower 0.02 1.00 1.00
To upper 0.00 0.98 1.00
Knowledge Country capital 0.19 0.58 0.82
Location continent 0.01 0.68 0.80
Location religion 0.07 0.64 0.78
Person language 0.01 0.72 0.86
Linguistic Antonyms 0.34 0.72 0.74
Plural singular 0.18 0.80 0.84
Present simple gerund 0.00 0.86 0.96
Present simple past simple 0.01 0.76 0.94
Translation En es 0.10 0.44 0.72
En fr 0.16 0.48 0.54
Es en 0.05 0.68 0.80
Fr en 0.14 0.68 0.80
Pythia 2.8B Algorithmic List first 0.69 0.96 1.00
List last 0.06 0.98 1.00
Next letter 0.42 0.86 0.90
Prev letter 0.01 0.22 0.48
To lower 0.00 1.00 1.00
To upper 0.00 1.00 1.00
Knowledge Country capital 0.18 0.70 0.76
Location continent 0.01 0.62 0.72
Location religion 0.08 0.76 0.82
Person language 0.00 0.82 0.82
Linguistic Antonyms 0.37 0.68 0.76
Plural singular 0.13 0.70 0.78
Present simple gerund 0.00 0.86 0.96
Present simple past simple 0.03 0.80 0.92
Translation En es 0.10 0.26 0.76
En fr 0.16 0.28 0.60
Es en 0.08 0.76 0.82
Fr en 0.10 0.64 0.82
Pythia 6.9B Algorithmic List first 0.43 1.00 0.98
List last 0.08 0.60 0.98
Next letter 0.01 0.66 0.86
Prev letter 0.04 0.28 0.32
To lower 0.00 1.00 1.00
To upper 0.00 0.94 1.00
Knowledge Country capital 0.21 0.76 0.82
Location continent 0.01 0.62 0.78
Location religion 0.10 0.80 0.80
Person language 0.01 0.76 0.80
Linguistic Antonyms 0.33 0.72 0.74
Plural singular 0.14 0.78 0.88
Present simple gerund 0.00 0.82 0.94
Present simple past simple 0.02 0.88 0.96
Translation En es 0.11 0.46 0.70
En fr 0.21 0.36 0.60
Es en 0.06 0.72 0.82
Fr en 0.14 0.66 0.74
Figure 8: Task Vector Variability. For each task, two histograms are shown: (blue) the distribution of distances
between different task vectors of this task, created from different S and x′ ; (orange) the distribution of distances
between task vectors of the task and of other tasks.
Figure 9: A 2D t-SNE plot, visualizing 50 task vectors for each task, each generated from a different choice of S
and x using LLaMA 7B. Points are color-coded according to task category, such as algorithmic or translation. Each
task can be seen to form its own distinct cluster. The labels provide the full name of the task in the cluster.
Model Task Tokens
Prev Letter e, y, unknown, alphabet, preceding, c, Cad, zA, dit, bill, closer, etc,
Stuart, aa, null, cin, ads, g, ulo, Ku
FR-EN Mason, gram, immer, Santi, latin, utter, Span, Conc, English,
LLaMA 13B equivalent, engl, Usage, none, pron, ulo, translate, adu, Wiel, grammar,
ML
Present Simple cin, thats, gram, Lorenzo, cian, Isabel, uld, berto, partici, Sah,
to Gerund reporting, eing, tc, Roberto, habit, Writing, etc, ientos, ores, Dutch
Country Capital Paris, its, capital, central, Conc, cities, administrative, Los, Madrid,
London, San, Isabel, exec, Ar, Bel, Wars, name, capit, Battle, History
Prev Letter r, b, a, d, m, e, p, n, t, u, h, f, c, in, g, s, the, ar, l, x
FR-EN in, and, m, d, a, or, out, the, t, o, so, c, con, have, act, e, s, is,
all, to
Pythia 12B
Present Simple in, t, m, r, a, and, the, ing, action, d, o, e, current, simple, te, w,
to Gerund not, have, out, what
Country Capital the, in, a, C, N, B, L, M, T, P, S, R, G, and, F, I, K, U, D, H
Prev Letter b, c, v, g, s, name, i, ro, n, j, d, t, A, ai, com, m, ust, test,
active, k
FR-EN other, name, the, true, is, social, s, active, time, car, type, money,
F, force, a, public, heart, one, ms, life
GPT-J 6B
Present Simple getting, storing, working, moving, playing, doing, making, driving,
to Gerund shooting, picking, being, sending, putting, selling, watching,
changing, taking, collecting, feeding, reading
Country Capital London, Paris, New, West, Berlin, South, Tokyo, San, Chicago, City,
Moscow, Jerusalem, Amsterdam, Philadelphia, East, Madrid, Vienna,
Beijing, Mexico, Germany

Table 7: The top 20 tokens in the distribution induced by the task vector, for one task per category.

You might also like