An Explainable Transformer Circuit For Compositional Generalization
An Explainable Transformer Circuit For Compositional Generalization
Cheng Tang1,2
that after being pre-trained with data generated by a ’meta- Layer 0 MLP
0.10 Shuffle
Accuracy
Encoder Encoder
pos token RHS-Scanner Head (Dec-cross-0.6; Figure 8b) We iden-
embed embed
tify Dec-cross-0.6 as the dominant contributor to the the Out-
put Head’s Q (Figure 8a). Analyzing Dec-cross-0.6’s atten-
Figure 4: Enc-self-1.1 and Enc-self-0.5 serve as the main con- tion patterns reveals that each Q token (from Decoder in the
tributors of the K -circuit for the Output Head.The K -circuit en- cross-attention) sequentially attends to the color tokens (in the
codes primitive symbols’ index-in-question. support set) on the function’s RHS (Figure 8b). For example,
the first Decoder token (SOS) attends to the first RHS tokens
(purple, red, yellow), and the second query token (red) at-
Output Head (K ), tends to the second RHS tokens (red, purple, red), and so
while mean-ablating all other direct paths to Output Head’s K . on. This iterative scanning mechanism enables the decoder to
We identified Enc-self-0.5 as an important node (Figure 5b). reconstruct the transformation defined by the function. Hence
we call Dec-cross-0.6 the RHS-Scanner Head.
|
C
F
B
=
|
B
F
C
=
Query
|
C Figure 6: Principal Components Analysis (PCA) of token em-
F
A
=
beddings, colored by their associated index-in-question. Con-
cretely, for a prompt like ’B S A | A=red | B=blue | ...’,
|
C
=
in (a), points are the Z of ’A’ and ’B’ in the support (A labeled
|
A
3rd, B labeled 1st); in (b), points are the Z of ’red’ and
=
|
’blue’ in the support (red labeled 3rd, blue labeled 1st); in
B
D
C
(c), points are the K of ’red’ and ’blue’ in the support (red la-
=
beled 3rd, blue labeled 1st). The distinct clusters suggest
strong index information. R2 score quantifies the percentage
|
EOS
A FC | B = | C F B = | B F C = | C F A = | C = | A= | B DC = | EOS
of total variance explained by the index identity.
Key
V
|
B
F
C
=
Encoder Encoder
Query
pos token
|
C
embed embed
F
A
=
|
Q-circuit encodes primitive symbols’ relative-index-on-LHS.
B
D
C
=
|
EOS we looked for critical encoder heads that feeds to the RHS-
A FC | B = | C F B = | B F C = | C F A = | C = | A= | B DC = | EOS
Scanner Head and finally contributes to the Output Head’s
Key
Q. Unlike the K -circuit discovery, where “keep-only-one-head”
ablations is sufficient, multiple heads appears to contribute
Figure 5: (a) Top, contributions to Output Head’s performance partial but complementary information. To isolate their roles,
(percentage of attention on the correct next token) via K . Bot- we measured drops in the output head’s accuracy when ab-
tom, attention pattern of Enc-self-1.1. (b) Top, contributions to lating each encoder head individually while keeping the others
the Output Head’s performance through the Primitive-Pairing intact (the “ablate-only-one-head” approach, more discussion
Head’s V . Bottom, attention pattern of Enc-self-0.5. in Appendix).
This analysis highlighted Enc-self-1.0 and Enc-self-1.2 as
critical (Figure 9a). In Enc-self-1.0, within the support set,
Contribution to Output Head’s Q
0.35
Contribution to Output Head’s Q
via RHS-Scanner Head’s V
Dec-self-1 0.1
Enc-self-1
Dec-cross-0 0.25 0.06
Enc-self-0 0.02
Dec-self-0 0.15
H0 H1 H2 H3 H4 H5 H6 H7
H0 H1 H2 H3 H4 H5 H6 H7
Primitive-Retrieval (Encoder-self-1.0)
RHS-Scanner Head (Decoder-cross-0.6) A
F
C
Query
SOS |
B
=
EOS |
C
A FC | B = | C F B = | B F C = | C F A = | C = | A= | B DC = | EOS F
Key B
=
|
B
F
Figure 8: (a) Contribution to Output Head’s performance via C
=
Query
|
C
F
A
=
|
each RHS color token carries two positional references: (1) C
F
A
the associated LHS primitive, and (2) the function symbol, we =
|
of the associated color tokens on the RHS. A
=
|
B
D
C
Relative-Index-On-LHS Tracing To confirm that our discov- =
• Retaining only the Primitive- and Function-Retrieval Figure 9: (a) Contribution to Output Head’s performance via
Heads When all other encoder heads are ablated, the RHS- Q. (b) Contribution to Output Head’s performance via the
Scanner Head’s Z still carries relative-index-on-LHS that RHS-Scanner’s V . (c) Attention pattern of Dec-cross-0.6. (d)
propagate to the Output Head’s Q, indicating that these two and (e) Attention patterns of Enc-self-1.0 and Enc-self-1.2.
heads alone provide sufficient index information.
• Ablating the Primitive- or Function-Retrieval Head in- attention patterns that track color tokens on the function’s
dividually Ablating either head disrupts the clustering RHS. When all three are ablated, clusterings by relative-
by relative-index-on-LHS in the RHS-Scanner Head’s Z , index-on-LHS are eliminated from the Output Head’s Q.
demonstrating that both heads are necessary to preserve
the full index information. Thus, we conclude that the Q-circuit depends on the RHS-
Scanner Head to capture the relative-index-on-LHS informa-
• Ablating the RHS-Scanner Head (together with Dec- tion supplied by the Primitive- and Function-Retrieval Heads.
cross-0.0 and -0.3) These decoder heads share similar By aligning these Q signals with the K , the model consistently
determines which token to generate next. on both sides passed through the sub-circuits. It does not
merely degrade or randomly scramble the output head’s be-
havior; rather, the predictions shift in a way directly consistent
with our interpretation of how index information is encoded
and matched between Q and K . The model’s predictable re-
sponse to this precise manipulation underscores that we have
correctly identified the sufficient pathways.
Attention percentage
Attention percentage
0.8 0.8
Output Head
Output Head
0.6 0.6
Figure 10: PCA for token embeddings labeled by relative-
0.4 0.4
index-on-LHS. Concretely, for an episode with prompt ’B S
0.2 0.2
A | A=red | B=blue | A S B=blue red’ and prediction
’SOS red blue EOS’, in (a), points are the Z of ’SOS’ and 0 0
Correct The other Correct The other
’red’ in the decoder input tokens (SOS is labeled 3rd, because token token token token
SOS attends to the blue on function RHS, and B is the 3rd on
the LHS; similarly, red is labeled 1st); in (b), points are the
Figure 11: Swapping position embeddings of tokens in the
Q of decoder input tokens (SOS is labeled 3rd, red is labeled
question causes a predictable realignment of attention in the
1st). R2 score quantifies the percentage of total variance ex-
Output Head through its K -circuit, confirming that the discov-
plained by the index identity.
ered QK circuit indeed encodes positional indices.
Targeted Perturbation Steers Behavior Overall, by performing causal backtracking, validating infor-
mation flow through ablations, and finally applying targeted
So far, our circuit tracing indicates that the K -circuit of the Out-
activation patching, we confirm that the compositional in-
put Head encodes the primitive symbols’ index-in-question,
duction mechanism we uncovered is both interpretable and
and that the Q-circuit encodes primitive symbols’ relative-
causally relevant to the model’s behavior.
index-on-LHS. We reason that if the QK circuit of the Output
Head truly leverages on the primitive symbol index to predict Discussion
the next word, then swapping those index information across
In this work, we investigated how a compact transformer
different color tokens should also swap the corresponding at-
model achieves compositional induction on a synthetic func-
tention patterns observed in the Output Head.
tion composition task. By combining path-patching analyses
with causal ablations, we uncovered a detailed QK circuit that
Swapping Index Information Concretely, we select two encodes index information from both the question and the
primitive symbols in the question (e.g., ’B S A | A=red function’s LHS. We further demonstrated that precisely swap-
| B=blue |...’). The red token will have index-in- ping these positional embeddings in the model’s activations
quesiton=3rd from A (similarly blue will have ’1st’) on the leads to predictable changes to behavior, thereby confirming
K -side of the Output Head. If the Q-side expects a particu- the causal relevance of the discovered circuit. These results
lar index from the K -side (e.g., ’SOS’ in Q may carry relative- show that, even for complex functions, transformers can im-
index-on-LHS=3rd and expects tokens carrying index-in- plement a structured and interpretable mechanism.
question=3rd from K ), a swap of the index information in K
should lead to a predictable shift in which tokens the head Limitations and Future Work
attends to. We performed this perturbation in the K -circuit of Model Scale. Our circuit analysis focused on a relatively small
the Output Head while freezing its Q-circuit. Indeed, when we transformer. Establishing whether similar interpretable circuits
swap only the position embedding of B and A on the Question- exist in larger models remains an important open question to
Broadcast Head’s V (the most upstream node in the K -circuit), follow up.
with everything else intact, we observe that the Output Head Manual Circuit Discovery. The techniques employed here
systematically “reverts” the attention from red to blue based required substantial human effort—path-patching, ablations,
on their swapped positions (Figure 11). and extensive interpretation of attention heads. For large-
This intervention thus provides causal evidence that the scale models, such manual approaches become less feasible.
Output Head’s QK alignment relies on the index information We therefore see a need for automated or semi-automated
methods that can discover and interpret these circuits with References
less human input. Bhaskar, A., Wettig, A., Friedman, D., & Chen, D. (2024,
Partial Perturbations. Although our targeted activations November). Finding transformer circuits with edge pruning.
swaps successfully steered the Output Head’s behavior, we In The thirty-eighth annual conference on neural information
have not demonstrated a complete perturbation of its pre- processing systems.
dicted tokens. This is due to the distributed nature of the un-
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.,
derlying mechanism (multiple heads fulfill similar roles). Coor-
Dhariwal, P., . . . Amodei, D. (2020, May). Language models
dinating interventions across all such heads will require sys-
are few-shot learners. arXiv [cs.CL].
tematic workflows, which we aim to develop in the future.
Despite these constraints, our work shows that disassem- Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J.,
bling transformer circuitry can yield two key benefits. First, Horvitz, E., Kamar, E., . . . Zhang, Y. (2023, March).
it illuminates how compositional functions are mechanistically Sparks of artificial general intelligence: Early experiments
instantiated at the attention-head level. Second, it enables tar- with GPT-4. arXiv [cs.CL].
geted, activation-based interventions that reliably steer model Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O.,
behavior. We hope these contributions will encourage further Kaplan, J., . . . Zaremba, W. (2021, July). Evaluating large
research on scalable circuit discovery methods and more au- language models trained on code. arXiv [cs.LG].
tomated interpretability approaches for large-scale models. Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S.,
& Garriga-Alonso, A. (2023, April). Towards automated cir-
Acknowledgement
cuit discovery for mechanistic interpretability. arXiv [cs.LG].
CT was supported by the Friends of McGovern Fellowship. MJ
DeepSeek-AI, Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., . . .
was supported by the Simons Foundation.
Pan, Z. (2024, December). DeepSeek-V3 technical report.
arXiv [cs.CL].
Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph,
N., Mann, B., . . . Olah, C. (2021). A mathematical
framework for transformer circuits. https://transformer
-circuits.pub/2021/framework/index.html. (Ac-
cessed: 2025-2-4)
Fodor, J. A. (1979). The language of thought. London, Eng-
land: Harvard University Press.
Goldowsky-Dill, N., MacLeod, C., Sato, L., & Arora, A. (2023,
April). Localizing model behavior with path patching. arXiv
[cs.LG].
Hanna, M., Liu, O., & Variengien, A. (2023, November). How
does GPT-2 compute greater-than?: Interpreting mathe-
matical abilities in a pre-trained language model. In Thirty-
seventh conference on neural information processing sys-
tems.
He, T., Doshi, D., Das, A., & Gromov, A. (2024, November).
Learning to grok: Emergence of in-context learning and skill
composition in modular arithmetic tasks. In The thirty-eighth
annual conference on neural information processing sys-
tems.
Heimersheim, S., & Janiak, J. (2023). A circuit for python
docstrings in a 4-layer attention-only transformer.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M.,
Song, D., & Steinhardt, J. (2020, September). Measuring
massive multitask language understanding. arXiv [cs.CY].
Hofstätter, F. (2023). Explaining the transformer circuits
framework by example.
Hsu, A. R., Zhou, G., Cherapanamjeri, Y., Huang, Y., Odisho,
A. Y., Carroll, P. R., & Yu, B. (2024, June). Efficient au-
tomated circuit discovery in transformers using contextual
decomposition. arXiv [cs.AI].
Hupkes, D., Dankers, V., Mul, M., & Bruni, E. (2019, Au- Appendix
gust). Compositionality decomposed: how do neural net-
Transformer Model
works generalise? arXiv [cs.CL].
Lake, B. M., & Baroni, M. (2023, October). Human-like sys- We adopt an encoder-decoder architecture, which naturally
tematic generalization through a meta-learning neural net- fits the task by allowing the encoder to process the prompt
work. Nature. (question + support) with bidirectional self-attention and the
decoder to generate an output sequence with causal and
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman,
cross-attention. Specific hyperparameters include:
S. J. (2016, April). Building machines that learn and think
like people. arXiv [cs.AI]. • Token embedding dimension: dmodel = 128
LawrenceC, Garriga-alonso, A., Goldowsky-Dill, N.,
ryan greenblatt, Radhakrishnan, A., Buck, & Thomas, • Attention embedding dimension: dhead = 16
N. (2022). Causal scrubbing: a method for rig-
orously testing interpretability hypotheses [redwood • Eight attention heads per layer (both encoder and decoder)
research]. https://www.lesswrong.com/posts/
JvZhhzycHu2Yd57RN/causal-scrubbing-a-method • Pre-LayerNorm (applied to attention/MLP modules) plus an
-for-rigorously-testing. (Accessed: 2025-2-5) additional LayerNorm at the encoder and decoder outputs
Nanda, N., Chan, L., Lieberum, T., Smith, J., & Steinhardt,
• Standard sinusoidal positional embeddings
J. (2022, September). Progress measures for grokking via
mechanistic interpretability. In The eleventh international The encoder comprises two layers of bidirectional self-
conference on learning representations. attention + MLP, while the decoder comprises two layers of
nostalgebraist. (2020). interpreting GPT: the logit causal self-attention + cross-attention + MLP. We train the
lens. https://www.alignmentforum.org/posts/ model by minimizing the cross-entropy loss (averaged over
AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit tokens) using the Adam optimizer. The learning rate is initial-
-lens. (Accessed: 2025-2-5) ized at 0.001 with a warm-up phase over the first epoch, then
Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., linearly decays to 0.00005 over training. We apply dropout of
Henighan, T., . . . Olah, C. (2022, September). In-context 0.1 to both input embeddings and internal Transformer layers,
learning and induction heads. arXiv [cs.LG]. and train with a batch size of 25 episodes. All experiments are
Rai, D., Zhou, Y., Feng, S., Saparov, A., & Yao, Z. (2024, performed on an NVIDIA A100 GPU.
July). A practical review of mechanistic interpretability for
transformer-based language models. arXiv [cs.AI]. Task Structure
Sanford, C., Hsu, D., & Telgarsky, M. (2024, February). Trans- In each episode, the support set and question are concate-
formers, parallel computation, and logarithmic depth. arXiv nated into a single prompt for the encoder, with question to-
[cs.LG]. kens placed at the start. Question, primitive assignments, and
Vaswani, A., Shazeer, N. M., Parmar, N., Uszkoreit, J., Jones, function assignments are separated by ‘|‘ tokens, while prim-
L., Gomez, A. N., . . . Polosukhin, I. (2017, June). Attention itive and function assignments are identified by ‘=‘. Overall,
is all you need. Neural Inf Process Syst, 30, 5998–6008. there are 6 possible colors and 9 symbols that may serve as
Wang, K. R., Variengien, A., Conmy, A., Shlegeris, B., & Stein- either color primitives or function symbols. Each episode con-
hardt, J. (2022, September). Interpretability in the wild: a tains 2–4 function assignments and 3–4 color assignments.
circuit for indirect object identification in GPT-2 small. In A function may be a single-argument (arg func) or double-
The eleventh international conference on learning represen- argument (arg1 func arg2) function. The function’s right-
tations. hand side (RHS) describes how arguments are transformed,
generated by randomly sampling up to length-5 sequences of
Wang, M., & E, W. (2024, February). Understanding the
arguments and mapping them to color tokens. Each prompt
expressive power and mechanisms of transformer for se-
ends with an ‘EOS‘ token. During decoding, the model begins
quence modeling. arXiv [cs.LG].
with an ‘SOS‘ token and iteratively appends each newly gener-
Wang, M., Yu, R., E, W., & Wu, L. (2024, October). How
ated token until it emits ‘EOS‘.
transformers get rich: Approximation and dynamics analy-
We randomly generate 10,000 episodes for training and
sis. arXiv [cs.LG].
2,000 for testing, ensuring that the primitive and function as-
Zhang, D., Tigges, C., Zhang, Z., Biderman, S., Raginsky, M.,
signments in testing episodes do not overlap with those in the
& Ringer, T. (2024, January). Transformer-based models
training set.
are not yet perfect at learning to emulate structural recur-
sion. Trans. Mach. Learn. Res., 2024. Path Patching
Zhang, Z., Lin, P., Wang, Z., Zhang, Y., & Xu, Z.-Q. J. (2025,
Path patching is a method for isolating how a specific source
January). Complexity control facilitates reasoning-based
node in the network influences a particular target node. It
compositional generalization in transformers. arXiv [cs.CL].
proceeds in three runs:
1. Clean Run: Feed the input through the model normally R2 Score
and cache all intermediate activations (including those of To quantify how much an activation dataset Y encodes a par-
the source and target nodes). ticular latent variable Z, we compute a linear regression of Z
2. Perturbed Run: Freeze all direct paths into the target node (one-hot encoded) onto Y and measure the explained vari-
using their cached activations from the clean run. For the ance:
SSres
source node alone, replace its cached activation with mean- R2 = 1 − .
SStotal
ablated values. Record the new, perturbed activation at the
target node. An R2 value of 1.0 indicates that Z fully explains the variance
in Y, whereas an R2 near 0.0 implies Z provides no informa-
3. Evaluation Run: Supply the target node with the perturbed tion about Y.
activation from Step 2, then measure any resulting changes
in the model’s output. This quantifies how the source node’s
contribution (altered via mean-ablation) affects the target
node’s behavior.