Attention Warm-Start: Initializing Q/K from Bigram Co-occurrence SVD by SPThole · Pull Request #678 · openai/parameter-golf

SPThole · 2026-03-25T04:34:48Z

[NON RECORD] Only 1H100 Used: Summary

Initializes W_Q and W_K in layer 0 from bigram co-occurrence statistics via SVD, so the model's initial attention patterns reflect real token relationships rather than random noise. Zero extra parameters — only changes initialization.

Motivation

In a 600-second training window (~1100 steps), the model spends its first hundreds of steps learning basic token co-occurrence patterns that are trivially available from corpus statistics. By encoding this structure into the Q/K matrices at initialization, we give the model a head start on learning "which tokens should attend to which."

Method

Build co-occurrence matrix C ∈ R^{1024×1024} from 2M training tokens (<1s)
Log-transform + double-center to get PMI-like matrix: C ← log(C+1), subtract row/column means
Project into model_dim via fixed random projection: C_proj = P^T C P ∈ R^{512×512}
SVD factorize: C_proj = UΣV^T, set W_Q ← (U · √Σ)^T, W_K ← (V^T · √Σ) so that W_Q^T · W_K ≈ C_proj
Head diversity: SVD components 1–64 → head 0, 65–128 → head 1, etc. Each head captures a different frequency band of co-occurrence
Scale normalize to match Frobenius norm of default orthogonal init

Total overhead: <3 seconds. Applied to layer 0 only (where hidden states ≈ embeddings).

Results

val_bpb: 1.3525 (post int6+zstd, 1×H100, seed=42) Expect much lower when 8XH100 used
Pre-quant: 1.3245 | Quant penalty: 0.0280
1,099 steps in 600s | 15.55MB artifact
Run on 1×H100 due to compute constraints

Built upon PR #623.

Observation

Slightly worse than the baseline without it (1.3345). The random projection P introduces noise that dilutes the co-occurrence signal. A tighter approach — using the actual embedding matrix E as the projection basis — could better preserve the structure. The principle of "warm-starting attention from corpus statistics" remains promising for short training regimes.

Test plan

Verified co-occurrence matrix builds in <3s
Confirmed W_Q/W_K norms match orthogonal init post-scaling
Full 600s training run completes without divergence
Artifact fits under 16MB

SPThole and others added 7 commits March 24, 2026 18:09

updated sub

2e1278e

updt readme

db9ea39

Update README.md

a029a31

Merge branch 'openai:main' into main

2c2d807

added qk init non record one H100

d7ec40c

removing non record

3019166

pushing non record qk init 1H100

f7a9e2d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attention Warm-Start: Initializing Q/K from Bigram Co-occurrence SVD#678

Attention Warm-Start: Initializing Q/K from Bigram Co-occurrence SVD#678
SPThole wants to merge 7 commits intoopenai:mainfrom
SPThole:non_record_1

SPThole commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SPThole commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[NON RECORD] Only 1H100 Used: Summary

Motivation

Method

Results

Observation

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SPThole commented Mar 25, 2026 •

edited

Loading