Skip to content

Attention Warm-Start: Initializing Q/K from Bigram Co-occurrence SVD#678

Open
SPThole wants to merge 7 commits intoopenai:mainfrom
SPThole:non_record_1
Open

Attention Warm-Start: Initializing Q/K from Bigram Co-occurrence SVD#678
SPThole wants to merge 7 commits intoopenai:mainfrom
SPThole:non_record_1

Conversation

@SPThole
Copy link

@SPThole SPThole commented Mar 25, 2026

[NON RECORD] Only 1H100 Used: Summary

Initializes W_Q and W_K in layer 0 from bigram co-occurrence statistics via SVD, so the model's initial attention patterns reflect real token relationships rather than random noise. Zero extra parameters — only changes initialization.

Motivation

In a 600-second training window (~1100 steps), the model spends its first hundreds of steps learning basic token co-occurrence patterns that are trivially available from corpus statistics. By encoding this structure into the Q/K matrices at initialization, we give the model a head start on learning "which tokens should attend to which."

Method

  1. Build co-occurrence matrix C ∈ R^{1024×1024} from 2M training tokens (<1s)
  2. Log-transform + double-center to get PMI-like matrix: C ← log(C+1), subtract row/column means
  3. Project into model_dim via fixed random projection: C_proj = P^T C P ∈ R^{512×512}
  4. SVD factorize: C_proj = UΣV^T, set W_Q ← (U · √Σ)^T, W_K ← (V^T · √Σ) so that W_Q^T · W_K ≈ C_proj
  5. Head diversity: SVD components 1–64 → head 0, 65–128 → head 1, etc. Each head captures a different frequency band of co-occurrence
  6. Scale normalize to match Frobenius norm of default orthogonal init

Total overhead: <3 seconds. Applied to layer 0 only (where hidden states ≈ embeddings).

Results

  • val_bpb: 1.3525 (post int6+zstd, 1×H100, seed=42) Expect much lower when 8XH100 used
  • Pre-quant: 1.3245 | Quant penalty: 0.0280
  • 1,099 steps in 600s | 15.55MB artifact
  • Run on 1×H100 due to compute constraints

Built upon PR #623.

Observation

Slightly worse than the baseline without it (1.3345). The random projection P introduces noise that dilutes the co-occurrence signal. A tighter approach — using the actual embedding matrix E as the projection basis — could better preserve the structure. The principle of "warm-starting attention from corpus statistics" remains promising for short training regimes.

Test plan

  • Verified co-occurrence matrix builds in <3s
  • Confirmed W_Q/W_K norms match orthogonal init post-scaling
  • Full 600s training run completes without divergence
  • Artifact fits under 16MB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant