In-Context Learning Creates Task Vectors
In-Context Learning Creates Task Vectors
𝑓 Translation
French to English
Spanish to English
bonjour → hello
hola → hello
Present to gerund go → going
𝜃 𝜃 Linguistic Singular to plural cat → cats
Antonyms happy → sad
𝒜 “Color” Knowledge
Country to Capital
Person to Language
France → Paris
Macron → French
We next compare the accuracy of the (A, f ) mech- To test this, we use LLaMA 7B to generate 50
anism to that of a regular forward pass performing task vectors per task with varied S and x′ and con-
ICL. For each model and task, we evaluate the duct two analyses.
following three procedures:
• Regular An application of the LLM to the Geometry of θ A t-SNE dimensionality reduc-
demonstrations S and query x. Namely tion (Fig. 5) reveals that the task vectors form dis-
T ([S, x]), as in regular ICL. tinct clusters, each containing task vectors of a sin-
• Hypothesis Our proposed procedure from § 3.1 gle task. Fig. 9 further shows proximity between
where A generates θ using a dummy x′ , and tasks of the same category, strengthening the idea
f (·; θ) is applied to x by running the transformer that they encapsulate task understanding.
on [x, →] with θ patched at layer L of →. Variability of θ Fig. 8 shows histograms of dis-
• Baseline A forward pass of the LLM only on x, tances within and across tasks. It can be seen that
without demonstrations S. That is, T ([x, →]). vectors within the same task are closer than those
This is the same as the application of f from our between different tasks, indicating that θ is stable
separated procedure, but without patching θ. within tasks and not highly influenced by x′ or S.
Fig. 4 shows the average accuracy across all
tasks of these 3 procedures, for each model. Full 5 Dominance of θ Patching
results are reported in Tab. 6 in § A.2. Across all
In §3 we prevented f from directly accessing S.
models, our procedure maintains around 80-90%
However, in a regular forward pass during ICL,
of the accuracy of regular ICL, while the baseline
the last token can attend to S. Here we verify that
reaches only 10-20%. This shows that our proposed
even in this case, f mainly uses the task vector
separation to A and f provides a good empirical
θ, without directly accessing the demonstrations
approximation of the process underlying ICL.
S. To this end, we use a pair of tasks, A and B,
4 Robustness of Task Vectors sharing the input space but differing on the output.
We first use a “Regular” forward pass, where we
In our setting, θ is derived from S and a dummy provide the model with demonstrations S for task
query x′ . It is natural to examine the robustness A (denoted SA ), to verify the model can perform
of θ to variations in these inputs. Intuitively, if it this task using ICL. Then, we do a “Conflicting”
represents the task, it should remain stable across forward pass, still providing SA , while injecting
different S and x′ values. θ B . For more details, refer to Fig. 6 in §A.1.
Task A (S) Task B (θ) Regular Conflicting Task Top tokens in the task vector projection
Task A Task B Previous e, y, unknown, alphabet, preceding, c
Next Letter To Upper 0.92 0.77 Letter Cad, zA, dit, bill
List Last List First 0.95 0.78 FR-EN Mason, gram, immer, Santi, latin,
Present to Past to Gerund 0.96 0.95 utter, Span, Conc, English, equivalent
Table 2: Conflicting tasks experiment results. The Present cin, thats, gram, Lorenzo, cian,
model’s accuracy on the relevant task (A in “Regular” Simple to Isabel, uld, berto, partici, Sah
Gerund
and B in “Conflicting”) is displayed for both scenarios.
Country Paris, its, capital, central, Conc,
Capital cities, administrative, Los, Madrid,
In Tab.2, the “Regular” forward pass shows high London
accuracy on task A (90%+), as anticipated. How-
ever, the “Conflicting” forward pass yields high Table 3: The top 10 tokens in the distribution induced
by the task vector, for one task per category.
accuracy on task B, corresponding to the injected
task vector θ. This implies that the model mainly
relies on θ, largely disregarding the demonstrations Meta-Learning in Transformers Studies by
S for task A. We note that the accuracy on task Akyürek et al. (2022); von Oswald et al. (2022);
B is slightly low, likely consistent with the perfor- Garg et al. focus on the meta-learning capabilities
mance dip seen in Fig. 6, and potentially further of transformers. They typically train models from
affected by the presence of S. scratch on elementary tasks such as linear regres-
sion, drawing theoretical parallels with algorithms
6 Interpreting θ like Gradient Descent and demonstrating how trans-
formers could implement them. A key assumption
The learned vector θ intuitively captures informa- of these works is a known parameter space within
tion about the task demonstrated by S. Here we pro- which gradient descent operates. Our work focuses
vide evidence supporting this interpretation. Since on identifying such a parameter space for LLMs.
θ is an intermediate hidden state of the transformer,
we can employ a vocabulary projection method ICL in LLMs Olsson et al. (2022) identify “in-
(nostalgebraist, 2020; Dar et al., 2022). Namely, duction heads” in transformers as a likely main
we examine the top tokens in the distribution over mechanism of ICL. Dai et al. (2022) provide empir-
the vocabulary induced by the hidden state. ical evidence for the connection of ICL to Gradient
Tab. 3 shows the top tokens for three tasks for Descent in LLMs, focusing on classification tasks.
LLaMA 13B (more models and tasks are provided Concurrent work by Merullo et al. (2023) also ex-
in Tab. 7 in §A). In multiple cases, we observe to- plores a phenomenon similar to the task vectors
kens that directly describe the task. Importantly, we study here, where a single vector can encode
these terms never explicitly appeared in the context. learned functions. Our findings are complemen-
For example in the task of translation from French tary to theirs, and future work could explore the
to English, we observe tokens such as “English” relationship between the two more closely.
and “translate”. This supports our view that θ car-
ries significant, non-trivial semantic information
about the task. 8 Conclusions
Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Kevin Meng, David Bau, Alex Andonian, and Yonatan
Tengyu Ma, and Denny Zhou. 2022. What learning Belinkov. 2022. Locating and editing factual associ-
algorithm is in-context learning? investigations with ations in gpt. Advances in Neural Information Pro-
linear models. arXiv preprint arXiv:2211.15661. cessing Systems, 35:17359–17372.
Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. 2023.
Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mo- Language models implement simple word2vec-style
hammad Aflah Khan, Shivanshu Purohit, USVSN Sai vector arithmetic. arXiv preprint arXiv:2305.16130.
Prashanth, Edward Raff, et al. 2023. Pythia: A suite
for analyzing large language models across training nostalgebraist. 2020. interpreting gpt: the logit lens.
and scaling. arXiv preprint arXiv:2304.01373. LessWrong.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Joseph, Nova DasSarma, Tom Henighan, Ben Mann,
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Amanda Askell, Yuntao Bai, Anna Chen, et al. 2022.
Askell, et al. 2020. Language models are few-shot In-context learning and induction heads. arXiv
learners. Advances in neural information processing preprint arXiv:2209.11895.
systems, 33:1877–1901.
Shai Shalev-Shwartz and Shai Ben-David. 2014. Un-
Stephanie Chan, Adam Santoro, Andrew Lampinen, derstanding machine learning: From theory to algo-
Jane Wang, Aaditya Singh, Pierre Richemond, James rithms. Cambridge university press.
McClelland, and Felix Hill. 2022. Data distributional
properties drive emergent in-context learning in trans- Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
formers. Advances in Neural Information Processing Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Systems, 35:18878–18891. Baptiste Rozière, Naman Goyal, Eric Hambro,
Faisal Azhar, et al. 2023. Llama: Open and effi-
Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Zhifang Sui, cient foundation language models. arXiv preprint
and Furu Wei. 2022. Why can gpt learn in-context? arXiv:2302.13971.
language models secretly perform gradient descent as
meta optimizers. arXiv preprint arXiv:2212.10559. Johannes von Oswald, Eyvind Niklasson, Ettore Ran-
dazzo, João Sacramento, Alexander Mordvintsev, An-
Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. drey Zhmoginov, and Max Vladymyrov. 2022. Trans-
2022. Analyzing transformers in embedding space. formers learn in-context by gradient descent. arXiv
arXiv preprint arXiv:2209.02535. preprint arXiv:2212.07677.
Ben Wang and Aran Komatsuzaki. 2021. GPT-J-
6B: A 6 Billion Parameter Autoregressive Lan-
guage Model. https://github.com/kingoflolz/
mesh-transformer-jax.
Sang Michael Xie, Aditi Raghunathan, Percy Liang,
and Tengyu Ma. An explanation of in-context learn-
ing as implicit bayesian inference. In International
Conference on Learning Representations.
A Appendix Dependence of A on x Fig. 9 and Fig. 8 provide
more results on the geometry of the θ vectors (see
Here we provide additional details and results.
main text for discussion).
A.1 Additional Details
Inspecting Task Vectors Tab. 7 is an expanded
Full Task Descriptions Our study covers 18 version of Tab. 3, providing more vocabulary pro-
tasks in 4 categories: Algorithmic, Translation, Lin- jections of θ for additional tasks and on multiple
guistic and Knowledge. A detailed description of LLMs.
all tasks is provided in Tab. 5.
Model Parameters Dimension Layers Heads
Model Details More details on the models used 7B 4096 32 32
in the study are provided in Tab. 4. LLaMA 13B 5120 40 40
30B 6656 60 52
Task Data Here we detail the sources of the data GPT-J 6B 4096 28 16
for each task. The accompanying GitHub reposi-
2.8B 2560 32 32
tory contains the data itself as well as the code used Pythia 6.9B 4096 32 32
to create it. 12B 5120 36 40
• Algorithmic: Generated programatically. Table 4: The models used in the study, with architectural
information.
• Translation: For each language pair,
the most frequent words in the source
language are first retrieved from
https://github.com/frekwencja/
most-common-words-multilingual
and are then translated to the destination
language using the open-source package
nltk.
• Linguistic: The data for the tenses tasks is
parsed from https://github.com/Drulac/
English-Verbs-Conjugates. The data
for the plural-singular task is taken from
https://github.com/sindresorhus/
irregular-plurals. Finally, the data
for the antonyms task is taken from
https://github.com/SuzanaK/english_
synonyms_antonyms_list.
• Knowledge Data for the knowledge tasks is
taken from the counterfactual dataset intro-
duced in (Meng et al., 2022).
Conflicting Tasks Experiment In Fig. 6, we pro-
vide more details and a visualization of the experi-
ment described in §5.
A.2 Additional Results
Finding A and f Fig. 7 shows results similar to
Fig. 3, but for different models. It is interesting to
observe that the curves are similar across different-
sized models.
Detailed results for Fig. 4. Fig. 4 presented re-
sults for our (A, f ) hypothesis-based approach, av-
eraged across tasks. Table. 6 provides these results
for all the specific tasks considered.
Category Task Description Example
List first Given a list of letters, output the first letter a,b,c → a
List last Given a list of letters, output the last letter a,b,c → c
Algorithmic
Next letter Given a letter in the English alphabet, output the a→b
next letter
Previous letter Given a letter in the English alphabet, output the b→a
previous letter
To lowercase Given an uppercase letter, output the correspond- A→a
ing lowercase letter
To uppercase Given a lowercase letter, output the correspond- a→A
ing uppercase letter
French to English Given a word in French, translate to English bonjour → hello
Translation
Spanish to English Given a word in Spanish, translate to English hola → hello
English to Spanish Given a word in English, translate to Spanish hola → hello
English to Spanish Given a word in English, translate to French hola → hello
Present to gerund given an English verb in present simple tense, go → going
Linguistic output the corresponding gerund form
Present to past given an English verb in present simple tense, go → went
output the corresponding verb in past simple
Singular to plural Given an English noun in singular form, output cat → cats
the plural form
Antonyms Given an English adjective, output an antonym happy → sad
Country to Capital Given a name of a country, output the name of France → Paris
Knowledge
the capital city
Person to Language Given a name of a person, output their native Macron → French
language
Location to Continent Given a name of a person, output their native Paris → Europe
language
Religion Given a name of a location or a person, output Muhammad → Islam
the associated religion
Table 5: The tasks used in the study with input → output examples.
Regular b
b → a r → q c →
Conflicting b\d ?
b → a r → q c →
Figure 6: Conflicting tasks experiment. In the “Regular” scenario (top), the model is simply provided with
demonstrations SA for Task A (e.g. outputting the previous letter in the alphabet). In the “Conflicting” scenario
(bottom), the model is still provided with demonstrations for Task A, but we inject a task vector θ(SB ) from a
conflicting Task B (e.g. outputting the next letter in the alphabet).
Figure 7: Accuracy for each choice of L (the intermediate layer where the task vector is injected), averaged across
all tasks. The solid line represents the average value, and the shaded area depicts the standard deviation.
Table 6: Complete results for Figure 4, reported for all tasks and models.
Table 7: The top 20 tokens in the distribution induced by the task vector, for one task per category.