0% found this document useful (0 votes)
3 views11 pages

An Explainable Transformer Circuit For Compositional Generalization

This document presents an explainable transformer circuit designed to enhance compositional generalization in cognitive science and machine learning. The authors mechanistically interpret the circuit responsible for compositional induction in a compact transformer, demonstrating how targeted activation edits can steer model behavior. Their findings contribute to the understanding of transformer mechanisms and offer pathways for improved model control and interpretability.

Uploaded by

Enes Kümet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views11 pages

An Explainable Transformer Circuit For Compositional Generalization

This document presents an explainable transformer circuit designed to enhance compositional generalization in cognitive science and machine learning. The authors mechanistically interpret the circuit responsible for compositional induction in a compact transformer, demonstrating how targeted activation edits can steer model behavior. Their findings contribute to the understanding of transformer mechanisms and offer pathways for improved model control and interpretability.

Uploaded by

Enes Kümet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

An explainable transformer circuit for compositional generalization

Cheng Tang1,2

Brenden Lake3 Mehrdad Jazayeri1,2,4,*

Abstract than relying on shallow “lazy” heuristics like n-gram matching.


Compositional generalization—the systematic com- However, prior works mostly focused on isolated mechanisms
bination of known components into novel struc- (K. R. Wang et al., 2022; Hanna et al., 2023) or over-simplified
tures—remains a core challenge in cognitive science models (e.g., attention-only transformer in Olsson et al. (2022)
and machine learning. Although transformer-based , single layer transformer in Nanda et al. (2022)), leaving in-
terpretation of complex induction mechanisms in full-circuit
arXiv:2502.15801v1 [cs.LG] 19 Feb 2025

large language models can exhibit strong performance


on certain compositional tasks, the underlying mecha- transformers rarely explored. Furthermore, while studies have
nisms driving these abilities remain opaque, calling into shown that hyper-parameters (e.g., number of attention lay-
question their interpretability. In this work, we identify ers) can causally affect model’s compositional ability (Sanford
and mechanistically interpret the circuit responsible et al., 2024; He et al., 2024), a microscopic inspection to the
for compositional induction in a compact transformer. internal circuitry is still lacking.
Using causal ablations, we validate the circuit and In this case study, we provide an end-to-end mechanistic
formalize its operation using a program-like description. interpretation of how a compact transformer solves a compo-
We further demonstrate that this mechanistic under- sitional induction task. We rigorously trace down the minimal
standing enables precise activation edits to steer the circuit responsible for the model’s behavior and fully reverse-
model’s behavior predictably. Our findings advance the engineer the attention mechanism into human-readable pseu-
understanding of complex behaviors in transformers and docode. We also bridge mechanistic interpretation and model
highlight such insights can provide a direct pathway for control by showing that we can steer the model’s behavior with
model control. activation edits guided by the circuit mechanism.
Keywords: Transformer; Mechanistic Interpretability; Compo-
sitionality
Related Work
Introduction
Transformers, first introduced by Vaswani et al. (2017), ex- Transformer circuit interpretation. Mechanistic inter-
cel at tasks requiring complex reasoning such as code pretability of transformers began with analysis of simplified
synthesis (Chen et al., 2021) and mathematical problem- models, identifying attention heads as modular components
solving (Hendrycks et al., 2020). This capability stems that implement specific functions. In their seminal work,
not merely from memorization, but from their ability to per- Elhage et al. (2021) and Olsson et al. (2022) introduced ”in-
form compositional generalization—systematically combining duction heads” as critical components for in-context learning
learned primitives into novel structures via in-context learn- in small attention-only models. These heads perform pattern
ing (ICL) (Brown et al., 2020; Lake & Baroni, 2023). While completion by attending to prior token sequences, forming
humans inherently excel at such abstraction (Fodor, 1979), the basis for later work on compositional generalization.
traditional neural architectures struggle with out-of-distribution Case studies have dissected transformer circuits for specific
(OOD) compositional tasks (Hupkes et al., 2019; Lake et al., functions, such as the ’greater than’ circuit (Hanna et al.,
2016). Understanding how neural systems accomplish com- 2023), the ’docstring’ circuit (Heimersheim & Janiak, 2023),
positionality has become a focus of both machine learning and the ’indirect object’ circuit (M. Wang et al., 2024), and the
cognitive science research. ’max of list’ circuit (Hofstätter, 2023). These case studies
Mechanistic interpretability—a field dedicated to reverse- successfully reverse-engineered the transformer into the
engineering neural networks into human-understandable al- minimal-algorithm responsible for the target behavior.
gorithms—has begun unraveling these dynamics. Seminal To facilitate identification of relevant circuits, researchers
work identified induction heads as a critical component for ICL have proposed circuit discovery methods such as logit lens
(Elhage et al., 2021; Olsson et al., 2022), enabling transform- (nostalgebraist, 2020), path patching (Goldowsky-Dill et al.,
ers to dynamically bind and retrieve contextual patterns rather 2023), causal scrubbing LawrenceC et al. (2022). For large-
1
Department of Brain and Cognitive Sciences, Massachusetts In- scale transformers, automated circuit discovery methods are
stitute of Technology, MA, USA also proposed (Conmy et al., 2023; Hsu et al., 2024; Bhaskar
2
McGovern Institute, Masachusetts Institute of Technology, MA, et al., 2024). So far, transformer interpretability work still re-
USA
3 quires extensive human efforts in the loop for hypothesis gen-
Center for Data Science, Department of Psychology, New York
University, NY, USA eration and testing. We point to a review paper for a more
4
Howard Hughes Medical Institute, MA, USA comprehensive review (Rai et al., 2024).
Compositional generalization in transformers. In their BSA=

study, Hupkes et al. (2019) evaluated compositional general-  Decoder


ization ability on different families of models, and found that
Layer 1
transformer outperformed RNN and ConvNet in systematic Encoder MLP
generalization, i.e., recombination of known elements, but
Layer 1 Cross-attention
still uncomparable to human performance. D. Zhang et al. MLP Self-attention
(2024) pointed out that transformers struggle with composing Self-attention
recursive structures. Recently, Lake & Baroni (2023) showed Layer 0

that after being pre-trained with data generated by a ’meta- Layer 0 MLP

grammar’, small transformers (less than 1 million parameters) MLP Cross-attention

can exhibit human-like compositional ability in novel in-context Self-attention Self-attention

learning cases. This is in line with the success of commer-


cial large language models (LLM) in solving complex out-of- Support:
distribution reasoning tasks (Bubeck et al., 2023; DeepSeek- A= D= B=

AI et al., 2024), where compositional genralization is neces- BSD=


DSA=
sary.
Question:
Several studies highlighted factors that facilitate trans- BSA=?
former’s compositional ability. M. Wang & E (2024) identified
initialization scales as a critical factor in determining whether 
models rely on memorization or rule-based reasoning for com- Prompt:
B S A | A = red | D = pink | B S D = pink blue | B = blue | EOS
positional tasks. Z. Zhang et al. (2025) revealed that low-
question support set
complexity circuits enable out-of-distribution generalization by
Output:
condensing primitive-level rules. (Sanford et al., 2024) identi- SOS red blue EOS

fied logarithmic depth as a key constraint for transformers to


emulate computations within a sequence. Here, we offer a
Figure 1: (a) Schematic of the transfomer model and task. (b)
complementary mechanistic understanding of how trasnform-
The prompt and output format for the compositional induction
ers perform compositional computations.
task.
Experimental Setup
Our experimental setup involves a synthetic function composi- Model
tion task (Figure 1) designed to probe compositional induction Transformer Basics Our transformer uses an encoder-
in a compact Transformer. We outline the task structure, the decoder architecture that involves two types of attentions:
Transformer basics (including attention mechanisms), and the
training protocol. • Self-Attention:Captures within-sequence interactions.
The token embedding matrix
Task Structure
Each episode consists of a support set and a question (Fig- X ∈ Rninput ×dmodel
ure 1b):
is projected into Queries, Keys, and Values:
• Support Set: Specifies (i) Primitives as symbol-to-color
mappings (e.g., A = red, D = pink), and (ii) Functions as Q = XWQ , K = XWK , V = XWV ,
symbolic operations over these primitives (e.g., A S D =
pink red, where S indicates swapping adjacent symbols). where WQ ,WK ,WV ∈ Rdmodel ×dhead are learnable weight ma-
trices.
• Question: Presents a new composition of primitives and
functions from the support-set. • Cross-Attention: Enables the decoder to attend to en-
coder outputs. Here, the Queries (Q) come from the de-
The model generates answers to the question as token coder tokens, while the Keys (K ) and Values (V ) come from
sequences emitted from the decoder, with a SOS (start of sen- the encoder tokens.
tence) token as the first input to the decoder and an EOS (end
of sentence) marking the end of the emission. The model op- The attention mechanism operates through two separate
erates strictly via in-context learning—weights remain frozen circuits on embedding X ∈ Rninput ×dmodel for each attention
during inference, and test episodes are disjoint from training head:
data. The model must infer latent variable bindings (primitives
and functions) from the support set and dynamically com- • QK Circuit (WQWK⊤ ): Determines from where information
pose these bindings to solve the novel question. flows to each token by computing attention scores between
token pairs, with higher scores indicate stronger token-to- Algorithm 1 Pseudocode solving the function & primitive
token relationships: composition problem
# Define the question and symbol-color pairs (by
XQWQ (XK WK )⊤
 
Attention(Q, K) = softmax √ ∈ Rnquery ×nkey , Question-Broadcast and Primitive-Pairing Heads)
dhead question ← [ s1 , func, s2 ] # definition
where softmax is applied along the dimension for Key and symbol to color ← { si : ci | i = 1, . . . , n} # definition
independently for each head. color to symbol ← { ci : si | i = 1, . . . , n} # reverse definition

• OV Circuit (WV WO ): Controls what information gets written


# Define the function; Convert the function into a re-
to each token position. Combined with the QK Circuit, this
lational structure between the input and output (by
produces the output of the attention head:
Primitive- and Function-Retrieval Heads)
Z = Attention(Q, K)XV WV WO ∈ Rnquery ×dmodel , func LHS ← [s3 func s4 ] # define function arguments
func RHS ← [c3 c3 c4 c4 c3 ] # define function outputs
where WO ∈ Rdhead ×dmodel is learnable weight.
symbol to idx ← { s3 :idx1 , s4 :idx3 } # convert argument
Our analysis focuses on how these circuits in attention heads symbols to their index in array
together implement the compositional induction algorithm.
idx seq ← [ ]
Model Training We adopt an encoder-decoder Transformer
for color in func RHS do
with 2 layers in the encoder and 2 layers in the decoder (Fig-
symbol ← color to symbol[color]
ure 1a) with each layer containing 8 attention heads. Further
idx ← symbol to idx[symbol]
model details appear in the Appendix.
idx seq.append(idx)
For each episode, we randomly generate:
# idx seq = [ idx1 , idx1 , idx3 , idx3 , idx1 ] in this case
• Primitive Assignments: A mapping from symbol tokens
(e.g., A, B) to color tokens (e.g., red, pink). # Compose the output following the function’s relational
structure (by RHS-Scanner and Output Heads)
• Function Definitions: Symbolic transformations by ran-
output ← [ ]
domly sampling primitive arguments to a function to pro-
for idx in idx seq do
duce color sequences (e.g., A S B might be expanded into
symbol ← question[idx]
a sequence [A][B][A][A][B], maximum length=5).
color ← symbol to color[symbol]
We train on 10,000 such episodes for 50 epochs and eval- output.append(color)
uate on 2,000 test (held-out) episodes. The model achieves
98% accuracy on this test set, indicating strong compositional return output
induction capabilities. In the test set, primitive assignments
and function definitions are conjunctively different from those
in the training set (i.e., some primitives or some functions Step 1 (Figure 2a; Question-Broadcast Head). Primitive
might be in the training set, but not the whole combination input tokens in the support (e.g., A) attend to the same prim-
of them), preventing a memorization strategy. Please refer to itive tokens in the question (A), inheriting the latter’s index-in-
the Appendix for additional details. question (3rd). The step is detailed in Figure 5b.

Results Step 2 (Figure 2b; Primitive-Pairing Head). Color tokens


(red) attend to their associated primitive tokens (A), inherit-
First, we give an intuitive overview of the effective algorithm
ing the latter’s index-in-question (3rd). The step is detailed in
the model appears to implement. Next, we describe our cir-
Figure 5a.
cuit discovery procedure, where we use causal methods to
pinpoint the exact attention heads responsible for composi- Step 3 (Figure 2c; Primitive- and Function-Retrieval
tional induction. Finally, we validate this mechanism by ap- Heads). Color tokens on the function RHS (pink) attend to
plying targeted perturbations that predictably alter the model’s their associated primitive tokens on the function Left Hand
behavior. Side (LHS) (D), inheriting the latter’s relative-index-on-LHS
(3rd). The step is detailed in Figure 9.
The Effective Algorithm
General Solution. We first provide a general solution to this Step 4 (Figure 2d; RHS-Scanner Head). The 1st token in
type of compositional problem in a python-like pseudocode the Decoder (SOS) attend to the 1st tokens on the function
for intuitive understanding (Algorithm 1). We use 1-indexing Right Hand Side (RHS) (pink), inheriting the latter’s former-
(count from 1) for tokens throughout. inherited relative-index-on-LHS (3rd). The step is detailed in
Transformer Solution. Next, we describe the actual imple- Figure 8.
mentation of the algorithm with attention operations in Figure Step 5 (Figure 2e; Output Head). SOS token (with inherited
2 through a guidance episode. relative-index-on-LHS=3rd) attends to color tokens (red) with
the same index-in-question (3rd), inheriting the latter’s token Output Head (Dec-cross-1.5; Figure 3b) We discovered
identity (red), and generate the next prediction (red). The the model’s circuit backwards from the unembedding layer
step is detailed in Figure 3. Then the 2nd token in the Decoder using logit attribution (nostalgebraist, 2020), which measures
(red) starts over from Step 4 until completion of function RHS. each decoder attention head’s linear contribution to the final
token logits (adjusted by the decoder’s output LayerNorm).
Prompt: We identified Dec-cross-1.5 (decoder cross attention layer 1
B S A | A = red | D = pink | B S D = pink blue | B = blue | EOS
head 5) as the primary contributor (Figure 3a).
Output: Dec-cross-1.5’s Q tokens always attend to the K tokens
SOS red blue EOS
from the Encoder that are the next predicted ones. For ex-
a. Question-Broadcast b. Primitive-Pairing ample, in Figure 3b, the SOS token attends to instances of
B B red in the support set, which is indeed the correct next out-
S S put prediction. This attention accuracy (i.e., max-attended to-
A Emb=A idx=3rd A
| K |
ken being the next-emitted token) of Dec-cross-1.5 remains
V
... Q ... above 90% for the first three tokens in the responses across
A Emb=A idx=3rd K A Emb=A idx=3rd all test episodes (Figure 3c), with Dec-cross-1.1 and -1.3 par-
= [1]* = V
Q tially compensating beyond that point.
red red Emb=red idx=3rd
... ... These observations suggest that Dec-cross-1.5’s OV cir-
c. Primitive- and Function-Retrieval cuit feeds token identities directly to the decoder unembed-
...
ding layer (output layer). Specifically, we observe that the
red Emb=red idx=3rd output of the OV circuit, XWvWo , align closely (strong inner
...
product) with the unembedding vectors of the corresponding
B
S tokens (Figure 3d). Hence, we designate Dec-cross-1.5 as
K D idx=3rd (LHS)
V
the Output Head (while Dec-cross-1.1 and -1.3 perform simi-
[2]* =
Q pink idx=1st (RHS) idx=3rd (LHS)
lar but less dominant roles) (Algorithm Step 3).
blue Next, we show how the Output Head identifies the correct
... token through QK interactions.
d. RHS-Scanner e. Output
The K-Circuit to the Output Head We first determine which
... ...
encoder heads critically feed into the Output Head’s K . To do
red Emb=red idx=3rd red Emb=red idx=3rd
... ... K this, we performed path-patching (K. R. Wang et al., 2022)
B B by ablating all but one single encoder head and then mea-
S S
D D
suring how much of Output Head’s QK behavior (i.e., atten-
= = tion accuracy) remained. During these experiments, Output
pink idx=1st (RHS) idx=3rd (LHS) pink Head’s Q were frozen using clean-run activations. Here we
blue K blue V
... ... report patching results with mean-ablation (qualitative similar
V
to random-sample ablation) (details in Appendix).
Q Q
SOS idx=1st idx=3rd (LHS) SOS idx=1st idx=3rd (LHS) Emb=red Through this process, we identified Enc-self-1.1 and Enc-
red self-0.5 as the primary contributors to Output Head’s K , act-
next token
ing in a sequential chain (Figure 4). Next, we show how they
sequentially encode symbols’ index-in-question critical for the
Figure 2: Summary of circuit for compositional generaliza-
QK alignment.
tion. Top, the example episode’s input and output. For a-e,
the yellow boxes indicate self-attention heads and the blue
boxes indicate cross-attention heads. Titles refer to the func- Primitive-Pairing Head (Enc-self-1.1; Figure 5a) This
tional attention heads that execute the steps (discussed in de- head exhibits a distinct attention pattern that pairs each color
tail later). We unfold all relevant information superimposed in token with its associated primitive symbol token (e.g., in the
tokens’ embeddings and highlight their roles in attention oper- support set, all instances of red attend to C). In other words,
ations. [1]∗ , the QK alignment discussed in Primitive-Pairing Enc-self-1.1 relays information (described below, as com-
Head section. [2]∗ , the QK alignment discussed in Primitive- puted by e.g., Enc-self-0.5) from the primitive symbols to their
Retrieval Head section. corresponding color tokens via its QK circuit. Hence, we call
Enc-self-1.1 the Primitive-Pairing Head.
To investigate which upstream heads feed into the OV cir-
Circuit Discovery cuit of the Primitive-Pairing Head, we applied a sequential
variant of path-patching, isolating the chain:
Nomenclature: for attention heads, Enc-self-0.5 stands for En-
coder, self-attention, layer 0, head 5; similarly, Dec-cross-1.5 Upstream heads (e.g. Enc-self-0.5) −→
stands for Decoder, cross-attention, layer 1, head 5. Primitive-Pairing Head (V ) −→
 Output logit contribution question-related information (including token identity and po-
0.16
Dec-cross-1 sition) across symbols in the support-set (henceforth the
0.12
Dec-self-1 Question-Broadcast Head). We hypothesize that the primitive
0.08
Dec-cross-0 symbols’ index-in-question is the critical information passed
0.04 from the Question-Broadcast Head’s Z through the Primitive-
Dec-self-0
0 Pairing Head’s Z and lastly into the Output Head’s K .
H0 H1 H2 H3 H4 H5 H6 H7
Index-In-Question Tracing To validate this hypothesis, we
 Output Head (Decoder-cross-1.5)
SOS
1 examined the Question-Broadcast Head’s Z for each primitive-
Query

symbol token. We reduced these outputs to two principal


EOS 0
A FC | B = | C F B = | B F C = | C F A = | C = | A= | B DC = | EOS components and colored each point by its index-in-question.
Key As illustrated in Figure 6a, the Question-Broadcast Head’s Z
 1
 Cross-1.5 to
exhibit clear clustering, indicating that the index-in-question
0.14 is robustly encoded at this stage (quantified by the R2 score,
umembed
0.8 i.e., the amount of variance explained by index identity, de-
Frequency

0.10 Shuffle
Accuracy

0.6 tails in Appendix). We further confirmed that the Primitive-


0.4 0.06 Pairing Head’s Z preserves index-in-question (Figure 6b) and
0.2 that the resulting Output Head’s K also reflect the same clus-
0.02
tering (Figure 6c).
0 1 2 3 4 -0.5 0 0.5 1 Causal Ablation Finally, we verified that this circuit indeed
Nth output Inner product
causally propagates index-in-question. Ablating the Question-
Broadcast Head’s Z (together with the similarly functioning
Figure 3: (a) Logit contributions of each decoder head to the Enc-self-0.7) obliterates the clustering in the Primitive-Pairing
logits of correct tokens. (b) Attention pattern of Dec-cross-1.5. Head’s Z ; ablating the Primitive-Pairing Head’s Z (together
(c) For Dec-cross-1.5, the percentage of attention focused on with similarly functioning Enc-self-1.0) disrupts the clustering
the next predicted token. (d) For Dec-cross-1.5, alignment (in- in the Output Head’s K (Figure 6). We therefore conclude that
ner product) between its OV output (e.g., xred WvWo ) and the the Question-Broadcast Head, the Primitive-Pairing Head and
corresponding unembedding vector (e.g., Unembred ). We es- heads with similar functions form a crucial K -circuit pathway,
timated the null distribution by randomly sampling unembed- passing index-in-question information from primitive tokens to
ding vectors. their associated color tokens in the Output Head’s K .
The Q-Circuit to the Output Head Having established the
Unembed
role of the K -circuit, we next investigate where its Q originates.
Encoder Decoder We again relied on sequential path-patching to pinpoint which
Primitive-Pairing K Output Head decoder heads ultimately provide the Output Head’s Q. We
Self-1.1 V Cross-1.5
identified Dec-cross-0.6 as the main conduit for the Q values
V Q of the Output Head. Enc-self-1.0 and -1.2 supply positional
embeddings that enable the decoder to track primitive sym-
Question-Broadcast
Self-0.5 (Q circuit) bol’s relative-index-on-LHS, thereby completing the QK align-
V ment for correct predictions (Figure 7).

Encoder Encoder
pos token RHS-Scanner Head (Dec-cross-0.6; Figure 8b) We iden-
embed embed
tify Dec-cross-0.6 as the dominant contributor to the the Out-
put Head’s Q (Figure 8a). Analyzing Dec-cross-0.6’s atten-
Figure 4: Enc-self-1.1 and Enc-self-0.5 serve as the main con- tion patterns reveals that each Q token (from Decoder in the
tributors of the K -circuit for the Output Head.The K -circuit en- cross-attention) sequentially attends to the color tokens (in the
codes primitive symbols’ index-in-question. support set) on the function’s RHS (Figure 8b). For example,
the first Decoder token (SOS) attends to the first RHS tokens
(purple, red, yellow), and the second query token (red) at-
Output Head (K ), tends to the second RHS tokens (red, purple, red), and so
while mean-ablating all other direct paths to Output Head’s K . on. This iterative scanning mechanism enables the decoder to
We identified Enc-self-0.5 as an important node (Figure 5b). reconstruct the transformation defined by the function. Hence
we call Dec-cross-0.6 the RHS-Scanner Head.

Question-Broadcast Head (Enc-self-0.5; Figure 5b) All


input symbol in the support set attend to their copies in Primitive-Retrival Head (Enc-self-1.0; Figure 9b) and
the input question. In other words, Enc-self-0.5 broadcasts Function-Retrival Head (Enc-self-1.2; Figure 9c) Next,
 Contribution to Output Head’s K
0.8
Enc-self-1
0.5
Enc-self-0
0.1
H0 H1 H2 H3 H4 H5 H6 H7
Primitive-Pairing Head (Encoder-self-1.1)
A
F
C
|
B
=

|
C
F
B
=

|
B
F
C
=
Query

|
C Figure 6: Principal Components Analysis (PCA) of token em-
F
A
=
beddings, colored by their associated index-in-question. Con-
cretely, for a prompt like ’B S A | A=red | B=blue | ...’,
|
C
=
in (a), points are the Z of ’A’ and ’B’ in the support (A labeled
|
A
3rd, B labeled 1st); in (b), points are the Z of ’red’ and
=

|
’blue’ in the support (red labeled 3rd, blue labeled 1st); in
B
D
C
(c), points are the K of ’red’ and ’blue’ in the support (red la-
=
beled 3rd, blue labeled 1st). The distinct clusters suggest
strong index information. R2 score quantifies the percentage
|
EOS
A FC | B = | C F B = | B F C = | C F A = | C = | A= | B DC = | EOS
of total variance explained by the index identity.
Key

 Contribution to Output Head’s K


Unembed
via Primitive-Pairing Head’s V 0.5 Function-
Encoder Retrievel Decoder
Enc-self-0 0.3
K Output Head
H0 H1 H2 H3 H4 H5 H6 H7 0.1 Primitive- Self-1.0 Self-1.1 Self-1.2
V Cross-1.5
Retrievel
Question-Broadcast Head (Encoder-self-0.5) V V V Q
A
F
C V
| RHS-Scanner
B
=
V + V
Cross-0.6
|
C
F
B Self-0.5
=

V
|
B
F
C
=
Encoder Encoder
Query

pos token
|
C
embed embed
F
A
=

Figure 7: Schematic of the Q-circuit. The Output Head in-


|
C
=
herits its Q from Dec-cross-0.6, which aggregates positional
|
A
information passed from Enc-self-1.0 and Enc-self-1.2. The
=

|
Q-circuit encodes primitive symbols’ relative-index-on-LHS.
B
D
C
=

|
EOS we looked for critical encoder heads that feeds to the RHS-
A FC | B = | C F B = | B F C = | C F A = | C = | A= | B DC = | EOS
Scanner Head and finally contributes to the Output Head’s
Key
Q. Unlike the K -circuit discovery, where “keep-only-one-head”
ablations is sufficient, multiple heads appears to contribute
Figure 5: (a) Top, contributions to Output Head’s performance partial but complementary information. To isolate their roles,
(percentage of attention on the correct next token) via K . Bot- we measured drops in the output head’s accuracy when ab-
tom, attention pattern of Enc-self-1.1. (b) Top, contributions to lating each encoder head individually while keeping the others
the Output Head’s performance through the Primitive-Pairing intact (the “ablate-only-one-head” approach, more discussion
Head’s V . Bottom, attention pattern of Enc-self-0.5. in Appendix).
This analysis highlighted Enc-self-1.0 and Enc-self-1.2 as
critical (Figure 9a). In Enc-self-1.0, within the support set,
 Contribution to Output Head’s Q
0.35
 Contribution to Output Head’s Q
via RHS-Scanner Head’s V
Dec-self-1 0.1
Enc-self-1
Dec-cross-0 0.25 0.06
Enc-self-0 0.02
Dec-self-0 0.15
H0 H1 H2 H3 H4 H5 H6 H7


H0 H1 H2 H3 H4 H5 H6 H7
 Primitive-Retrieval (Encoder-self-1.0)
RHS-Scanner Head (Decoder-cross-0.6) A
F
C
Query

SOS |
B
=
EOS |
C
A FC | B = | C F B = | B F C = | C F A = | C = | A= | B DC = | EOS F
Key B
=

|
B
F
Figure 8: (a) Contribution to Output Head’s performance via C
=

Q.(b) Attention pattern of Dec-cross-0.6.

Query
|
C
F
A
=

each color token on the RHS attends back to its correspond-


|
ing symbol on the LHS, inheriting that symbol’s token and po- C
=

sitional embedding (henceforth the Primitive-Retrieval Head) |


A
=
(Fig. 9b). Meanwhile, Enc-self-1.2 is similar, such that each |
B
color token on the RHS attends back to its function symbol on D
C
=
the LHS, passing that token and positional embedding on to
the color token (henceforth the Function-Retrieval Head) (Fig.
|
EOS

A FC | B = | C F B = | B F C = | C F A = | C = | A= | B DC = | EOS
9c).
Why do the color tokens on the RHS attend back to both
 A
Function-Retrieval (Encoder-self-1.2)
F
C
kinds of information on the LHS? We reason that if a color to- |
B
=
ken on the RHS were to encode it’s primitive symbol’s relative- |
C
index-on-LHS: for example, in ’...| D=pink | A S D=pink F
B
=
red |...’, pink were to encode 3rd inherited from D (D is
3rd in ’A S D’), the absolute position of D must be compared |
B
F
with the absolute position of the S to yield a relative position. C
=

Now that with the Primitive- and Function-Retrievel Heads,


Query

|
each RHS color token carries two positional references: (1) C
F
A
the associated LHS primitive, and (2) the function symbol, we =

hypothesize that by comparing these references, the model |


C
can infer the primitive symbols’ relative-index-on-LHS for each =

|
of the associated color tokens on the RHS. A
=

|
B
D
C
Relative-Index-On-LHS Tracing To confirm that our discov- =

ered circuit genuinely encodes the relative-index-on-LHS in |


EOS
A FC | B = | C F B = | B F C = | C F A = | C = | A= | B DC = | EOS
the Output Head’s Q, we conducted three complementary ab-
Key
lation experiments summarized in Figure 10:

• Retaining only the Primitive- and Function-Retrieval Figure 9: (a) Contribution to Output Head’s performance via
Heads When all other encoder heads are ablated, the RHS- Q. (b) Contribution to Output Head’s performance via the
Scanner Head’s Z still carries relative-index-on-LHS that RHS-Scanner’s V . (c) Attention pattern of Dec-cross-0.6. (d)
propagate to the Output Head’s Q, indicating that these two and (e) Attention patterns of Enc-self-1.0 and Enc-self-1.2.
heads alone provide sufficient index information.

• Ablating the Primitive- or Function-Retrieval Head in- attention patterns that track color tokens on the function’s
dividually Ablating either head disrupts the clustering RHS. When all three are ablated, clusterings by relative-
by relative-index-on-LHS in the RHS-Scanner Head’s Z , index-on-LHS are eliminated from the Output Head’s Q.
demonstrating that both heads are necessary to preserve
the full index information. Thus, we conclude that the Q-circuit depends on the RHS-
Scanner Head to capture the relative-index-on-LHS informa-
• Ablating the RHS-Scanner Head (together with Dec- tion supplied by the Primitive- and Function-Retrieval Heads.
cross-0.0 and -0.3) These decoder heads share similar By aligning these Q signals with the K , the model consistently
determines which token to generate next. on both sides passed through the sub-circuits. It does not
merely degrade or randomly scramble the output head’s be-
havior; rather, the predictions shift in a way directly consistent
with our interpretation of how index information is encoded
and matched between Q and K . The model’s predictable re-
sponse to this precise manipulation underscores that we have
correctly identified the sufficient pathways.

Encoder residual stream


B S A | A = red | ......
V: ‘3rd’ V: ‘1st’

swap pos emb


Swapped position
Clean run
embedding
1.0 1.0

Attention percentage
Attention percentage
0.8 0.8

Output Head
Output Head
0.6 0.6
Figure 10: PCA for token embeddings labeled by relative-
0.4 0.4
index-on-LHS. Concretely, for an episode with prompt ’B S
0.2 0.2
A | A=red | B=blue | A S B=blue red’ and prediction
’SOS red blue EOS’, in (a), points are the Z of ’SOS’ and 0 0
Correct The other Correct The other
’red’ in the decoder input tokens (SOS is labeled 3rd, because token token token token
SOS attends to the blue on function RHS, and B is the 3rd on
the LHS; similarly, red is labeled 1st); in (b), points are the
Figure 11: Swapping position embeddings of tokens in the
Q of decoder input tokens (SOS is labeled 3rd, red is labeled
question causes a predictable realignment of attention in the
1st). R2 score quantifies the percentage of total variance ex-
Output Head through its K -circuit, confirming that the discov-
plained by the index identity.
ered QK circuit indeed encodes positional indices.

Targeted Perturbation Steers Behavior Overall, by performing causal backtracking, validating infor-
mation flow through ablations, and finally applying targeted
So far, our circuit tracing indicates that the K -circuit of the Out-
activation patching, we confirm that the compositional in-
put Head encodes the primitive symbols’ index-in-question,
duction mechanism we uncovered is both interpretable and
and that the Q-circuit encodes primitive symbols’ relative-
causally relevant to the model’s behavior.
index-on-LHS. We reason that if the QK circuit of the Output
Head truly leverages on the primitive symbol index to predict Discussion
the next word, then swapping those index information across
In this work, we investigated how a compact transformer
different color tokens should also swap the corresponding at-
model achieves compositional induction on a synthetic func-
tention patterns observed in the Output Head.
tion composition task. By combining path-patching analyses
with causal ablations, we uncovered a detailed QK circuit that
Swapping Index Information Concretely, we select two encodes index information from both the question and the
primitive symbols in the question (e.g., ’B S A | A=red function’s LHS. We further demonstrated that precisely swap-
| B=blue |...’). The red token will have index-in- ping these positional embeddings in the model’s activations
quesiton=3rd from A (similarly blue will have ’1st’) on the leads to predictable changes to behavior, thereby confirming
K -side of the Output Head. If the Q-side expects a particu- the causal relevance of the discovered circuit. These results
lar index from the K -side (e.g., ’SOS’ in Q may carry relative- show that, even for complex functions, transformers can im-
index-on-LHS=3rd and expects tokens carrying index-in- plement a structured and interpretable mechanism.
question=3rd from K ), a swap of the index information in K
should lead to a predictable shift in which tokens the head Limitations and Future Work
attends to. We performed this perturbation in the K -circuit of Model Scale. Our circuit analysis focused on a relatively small
the Output Head while freezing its Q-circuit. Indeed, when we transformer. Establishing whether similar interpretable circuits
swap only the position embedding of B and A on the Question- exist in larger models remains an important open question to
Broadcast Head’s V (the most upstream node in the K -circuit), follow up.
with everything else intact, we observe that the Output Head Manual Circuit Discovery. The techniques employed here
systematically “reverts” the attention from red to blue based required substantial human effort—path-patching, ablations,
on their swapped positions (Figure 11). and extensive interpretation of attention heads. For large-
This intervention thus provides causal evidence that the scale models, such manual approaches become less feasible.
Output Head’s QK alignment relies on the index information We therefore see a need for automated or semi-automated
methods that can discover and interpret these circuits with References
less human input. Bhaskar, A., Wettig, A., Friedman, D., & Chen, D. (2024,
Partial Perturbations. Although our targeted activations November). Finding transformer circuits with edge pruning.
swaps successfully steered the Output Head’s behavior, we In The thirty-eighth annual conference on neural information
have not demonstrated a complete perturbation of its pre- processing systems.
dicted tokens. This is due to the distributed nature of the un-
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.,
derlying mechanism (multiple heads fulfill similar roles). Coor-
Dhariwal, P., . . . Amodei, D. (2020, May). Language models
dinating interventions across all such heads will require sys-
are few-shot learners. arXiv [cs.CL].
tematic workflows, which we aim to develop in the future.
Despite these constraints, our work shows that disassem- Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J.,
bling transformer circuitry can yield two key benefits. First, Horvitz, E., Kamar, E., . . . Zhang, Y. (2023, March).
it illuminates how compositional functions are mechanistically Sparks of artificial general intelligence: Early experiments
instantiated at the attention-head level. Second, it enables tar- with GPT-4. arXiv [cs.CL].
geted, activation-based interventions that reliably steer model Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O.,
behavior. We hope these contributions will encourage further Kaplan, J., . . . Zaremba, W. (2021, July). Evaluating large
research on scalable circuit discovery methods and more au- language models trained on code. arXiv [cs.LG].
tomated interpretability approaches for large-scale models. Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S.,
& Garriga-Alonso, A. (2023, April). Towards automated cir-
Acknowledgement
cuit discovery for mechanistic interpretability. arXiv [cs.LG].
CT was supported by the Friends of McGovern Fellowship. MJ
DeepSeek-AI, Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., . . .
was supported by the Simons Foundation.
Pan, Z. (2024, December). DeepSeek-V3 technical report.
arXiv [cs.CL].
Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph,
N., Mann, B., . . . Olah, C. (2021). A mathematical
framework for transformer circuits. https://transformer
-circuits.pub/2021/framework/index.html. (Ac-
cessed: 2025-2-4)
Fodor, J. A. (1979). The language of thought. London, Eng-
land: Harvard University Press.
Goldowsky-Dill, N., MacLeod, C., Sato, L., & Arora, A. (2023,
April). Localizing model behavior with path patching. arXiv
[cs.LG].
Hanna, M., Liu, O., & Variengien, A. (2023, November). How
does GPT-2 compute greater-than?: Interpreting mathe-
matical abilities in a pre-trained language model. In Thirty-
seventh conference on neural information processing sys-
tems.
He, T., Doshi, D., Das, A., & Gromov, A. (2024, November).
Learning to grok: Emergence of in-context learning and skill
composition in modular arithmetic tasks. In The thirty-eighth
annual conference on neural information processing sys-
tems.
Heimersheim, S., & Janiak, J. (2023). A circuit for python
docstrings in a 4-layer attention-only transformer.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M.,
Song, D., & Steinhardt, J. (2020, September). Measuring
massive multitask language understanding. arXiv [cs.CY].
Hofstätter, F. (2023). Explaining the transformer circuits
framework by example.
Hsu, A. R., Zhou, G., Cherapanamjeri, Y., Huang, Y., Odisho,
A. Y., Carroll, P. R., & Yu, B. (2024, June). Efficient au-
tomated circuit discovery in transformers using contextual
decomposition. arXiv [cs.AI].
Hupkes, D., Dankers, V., Mul, M., & Bruni, E. (2019, Au- Appendix
gust). Compositionality decomposed: how do neural net-
Transformer Model
works generalise? arXiv [cs.CL].
Lake, B. M., & Baroni, M. (2023, October). Human-like sys- We adopt an encoder-decoder architecture, which naturally
tematic generalization through a meta-learning neural net- fits the task by allowing the encoder to process the prompt
work. Nature. (question + support) with bidirectional self-attention and the
decoder to generate an output sequence with causal and
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman,
cross-attention. Specific hyperparameters include:
S. J. (2016, April). Building machines that learn and think
like people. arXiv [cs.AI]. • Token embedding dimension: dmodel = 128
LawrenceC, Garriga-alonso, A., Goldowsky-Dill, N.,
ryan greenblatt, Radhakrishnan, A., Buck, & Thomas, • Attention embedding dimension: dhead = 16
N. (2022). Causal scrubbing: a method for rig-
orously testing interpretability hypotheses [redwood • Eight attention heads per layer (both encoder and decoder)
research]. https://www.lesswrong.com/posts/
JvZhhzycHu2Yd57RN/causal-scrubbing-a-method • Pre-LayerNorm (applied to attention/MLP modules) plus an
-for-rigorously-testing. (Accessed: 2025-2-5) additional LayerNorm at the encoder and decoder outputs
Nanda, N., Chan, L., Lieberum, T., Smith, J., & Steinhardt,
• Standard sinusoidal positional embeddings
J. (2022, September). Progress measures for grokking via
mechanistic interpretability. In The eleventh international The encoder comprises two layers of bidirectional self-
conference on learning representations. attention + MLP, while the decoder comprises two layers of
nostalgebraist. (2020). interpreting GPT: the logit causal self-attention + cross-attention + MLP. We train the
lens. https://www.alignmentforum.org/posts/ model by minimizing the cross-entropy loss (averaged over
AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit tokens) using the Adam optimizer. The learning rate is initial-
-lens. (Accessed: 2025-2-5) ized at 0.001 with a warm-up phase over the first epoch, then
Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., linearly decays to 0.00005 over training. We apply dropout of
Henighan, T., . . . Olah, C. (2022, September). In-context 0.1 to both input embeddings and internal Transformer layers,
learning and induction heads. arXiv [cs.LG]. and train with a batch size of 25 episodes. All experiments are
Rai, D., Zhou, Y., Feng, S., Saparov, A., & Yao, Z. (2024, performed on an NVIDIA A100 GPU.
July). A practical review of mechanistic interpretability for
transformer-based language models. arXiv [cs.AI]. Task Structure
Sanford, C., Hsu, D., & Telgarsky, M. (2024, February). Trans- In each episode, the support set and question are concate-
formers, parallel computation, and logarithmic depth. arXiv nated into a single prompt for the encoder, with question to-
[cs.LG]. kens placed at the start. Question, primitive assignments, and
Vaswani, A., Shazeer, N. M., Parmar, N., Uszkoreit, J., Jones, function assignments are separated by ‘|‘ tokens, while prim-
L., Gomez, A. N., . . . Polosukhin, I. (2017, June). Attention itive and function assignments are identified by ‘=‘. Overall,
is all you need. Neural Inf Process Syst, 30, 5998–6008. there are 6 possible colors and 9 symbols that may serve as
Wang, K. R., Variengien, A., Conmy, A., Shlegeris, B., & Stein- either color primitives or function symbols. Each episode con-
hardt, J. (2022, September). Interpretability in the wild: a tains 2–4 function assignments and 3–4 color assignments.
circuit for indirect object identification in GPT-2 small. In A function may be a single-argument (arg func) or double-
The eleventh international conference on learning represen- argument (arg1 func arg2) function. The function’s right-
tations. hand side (RHS) describes how arguments are transformed,
generated by randomly sampling up to length-5 sequences of
Wang, M., & E, W. (2024, February). Understanding the
arguments and mapping them to color tokens. Each prompt
expressive power and mechanisms of transformer for se-
ends with an ‘EOS‘ token. During decoding, the model begins
quence modeling. arXiv [cs.LG].
with an ‘SOS‘ token and iteratively appends each newly gener-
Wang, M., Yu, R., E, W., & Wu, L. (2024, October). How
ated token until it emits ‘EOS‘.
transformers get rich: Approximation and dynamics analy-
We randomly generate 10,000 episodes for training and
sis. arXiv [cs.LG].
2,000 for testing, ensuring that the primitive and function as-
Zhang, D., Tigges, C., Zhang, Z., Biderman, S., Raginsky, M.,
signments in testing episodes do not overlap with those in the
& Ringer, T. (2024, January). Transformer-based models
training set.
are not yet perfect at learning to emulate structural recur-
sion. Trans. Mach. Learn. Res., 2024. Path Patching
Zhang, Z., Lin, P., Wang, Z., Zhang, Y., & Xu, Z.-Q. J. (2025,
Path patching is a method for isolating how a specific source
January). Complexity control facilitates reasoning-based
node in the network influences a particular target node. It
compositional generalization in transformers. arXiv [cs.CL].
proceeds in three runs:
1. Clean Run: Feed the input through the model normally R2 Score
and cache all intermediate activations (including those of To quantify how much an activation dataset Y encodes a par-
the source and target nodes). ticular latent variable Z, we compute a linear regression of Z
2. Perturbed Run: Freeze all direct paths into the target node (one-hot encoded) onto Y and measure the explained vari-
using their cached activations from the clean run. For the ance:
SSres
source node alone, replace its cached activation with mean- R2 = 1 − .
SStotal
ablated values. Record the new, perturbed activation at the
target node. An R2 value of 1.0 indicates that Z fully explains the variance
in Y, whereas an R2 near 0.0 implies Z provides no informa-
3. Evaluation Run: Supply the target node with the perturbed tion about Y.
activation from Step 2, then measure any resulting changes
in the model’s output. This quantifies how the source node’s
contribution (altered via mean-ablation) affects the target
node’s behavior.

Chained Path Patching. When analyzing circuits that span


multiple nodes in sequence, we extend path patching in a
chain-like manner. For instance, to evaluate a chain A → B →
C:
• We first perform path patching on the sub-path B → C as
usual.

• Next, to capture how A specifically influences B, we isolate


and record A’s effect on B via mean-ablation on all other
inputs to B.

• Finally, we patch that recorded activation into B and evalu-


ate its effect on C.

For a chain of length N , we run N + 1 forward passes, en-


suring the measured impact on the target node reflects only
the chained pathway. This approach precisely attributes the
model’s behavior to the intended sequence of dependencies.

Two Modes of Ablation. To assess how individual heads or


nodes contribute to the target node, we use two complemen-
tary modes:

1. Keep-only-one-head: Mean-ablate all direct paths to the


target node except for one node, which retains its clean-run
activation. If the target node’s performance remains sta-
ble, this single node is sufficient for driving the relevant be-
havior. However, this method may fail when multiple heads
each provide partial information that is only collectively suf-
ficient.

2. Ablate-only-one-head: Keep all source nodes from the


clean run except one, which is mean-ablated. If perfor-
mance degrades, that ablated node is necessary. However,
if the node’s information is redundant or duplicated across
other paths, the target node’s performance will not signifi-
cantly change.

By combining both modes, we identified the putative QK-


circuit of the output head. We then validate the circuits by
inspecting the information they propagates and causally eras-
ing the information by ablating specific upstream nodes.

You might also like